Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

Sebastian Raschka Sun, 06 Oct 2019 08:14:58 -0700

You are right, changing the figure size would fix the issue (updated the 
notebook). In practice, I think the issue becomes choosing a good aspect ratio 
such that the


a) general proportions of the plot look ok
b) proportions of the boxes wrt the arrows look ok

It's all possible for a user to do, but for my use cases (e.g., making a quick 
graphic for a presentation / meeting) it was just quicker with graphviz. On the 
other hand, I would prefer/recommend the plot_tree func just because it is 
based on matplotlib ...

In any case, I haven't had a chance to look at the plot_tree func but I guess 
this could potentially be relatively easy to address. I guess it would just 
require finding and setting a good default value for the

a) XOR case where a user provides either feature names or class label names. 
b) AND case where a user provides both feature names and class label names.



> On Oct 6, 2019, at 9:55 AM, Andreas Mueller <[email protected]> wrote:
> 
> Thanks!
> I'll double check that issue. Generally you have to set the figure size to 
> get good results.
> We should probably add some code to set the figure size automatically (if we 
> create a figure?).
> 
> 
> On 10/6/19 10:40 AM, Sebastian Raschka wrote:
>> Sure, I just ran an example I made with graphviz via plot_tree, and it looks 
>> like there's an issue with overlapping boxes if you use class (and/or 
>> feature) names. I made a reproducible example here so that you can take a 
>> look:
>> https://github.com/rasbt/bugreport/blob/master/scikit-learn/plot_tree/tree-demo-1.ipynb
>> 
>> Happy to add this to the sklearn issue list if there's no issue filed for 
>> that yet.
>> 
>> Best,
>> Sebastian
>> 
>>> On Oct 6, 2019, at 9:10 AM, Andreas Mueller <[email protected]> wrote:
>>> 
>>> 
>>> 
>>> On 10/4/19 11:28 PM, Sebastian Raschka wrote:
>>>> The docs show a way such that you don't need to write it as png file using 
>>>> tree.plot_tree:
>>>> https://scikit-learn.org/stable/modules/tree.html#classification
>>>> 
>>>> I don't remember why, but I think I had problems with that in the past (I 
>>>> think it didn't look so nice visually, but don't remember), which is why I 
>>>> still stick to graphviz.
>>> Can you give me examples that don't look as nice? I would love to improve 
>>> it.
>>> 
>>>>  For my use cases, it's not much hassle -- it used to be a bit of a hassle 
>>>> to get GraphViz working, but now you can do
>>>> 
>>>> conda install pydotplus
>>>> conda install graphviz
>>>> 
>>>> Coincidentally, I just made an example for a lecture I was teaching on 
>>>> Tue: 
>>>> https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/06_trees/code/06-trees_demo.ipynb
>>>> 
>>>> Best,
>>>> Sebastian
>>>> 
>>>> 
>>>>> On Oct 4, 2019, at 10:09 PM, C W <[email protected]> wrote:
>>>>> 
>>>>> On a separate note, what do you use for plotting?
>>>>> 
>>>>> I found graphviz, but you have to first save it as a png on your 
>>>>> computer. That's a lot work for just one plot. Is there something like a 
>>>>> matplotlib?
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> On Fri, Oct 4, 2019 at 9:42 PM Sebastian Raschka 
>>>>> <[email protected]> wrote:
>>>>> Yeah, think of it more as a computational workaround for achieving the 
>>>>> same thing more efficiently (although it looks inelegant/weird)-- 
>>>>> something like that wouldn't be mentioned in textbooks.
>>>>> 
>>>>> Best,
>>>>> Sebastian
>>>>> 
>>>>>> On Oct 4, 2019, at 6:33 PM, C W <[email protected]> wrote:
>>>>>> 
>>>>>> Thanks Sebastian, I think I get it.
>>>>>> 
>>>>>> It's just have never seen it this way. Quite different from what I'm 
>>>>>> used in Elements of Statistical Learning.
>>>>>> 
>>>>>> On Fri, Oct 4, 2019 at 7:13 PM Sebastian Raschka 
>>>>>> <[email protected]> wrote:
>>>>>> Not sure if there's a website for that. In any case, to explain this 
>>>>>> differently, as discussed earlier sklearn assumes continuous features 
>>>>>> for decision trees. So, it will use a binary threshold for splitting 
>>>>>> along a feature attribute. In other words, it cannot do sth like
>>>>>> 
>>>>>> if x == 1 then right child node
>>>>>> else left child node
>>>>>> 
>>>>>> Instead, what it does is
>>>>>> 
>>>>>> if x >= 0.5 then right child node
>>>>>> else left child node
>>>>>> 
>>>>>> These are basically equivalent as you can see when you just plug in 
>>>>>> values 0 and 1 for x.
>>>>>> 
>>>>>> Best,
>>>>>> Sebastian
>>>>>> 
>>>>>>> On Oct 4, 2019, at 5:34 PM, C W <[email protected]> wrote:
>>>>>>> 
>>>>>>> I don't understand your answer.
>>>>>>> 
>>>>>>> Why after one-hot-encoding it still outputs greater than 0.5 or less 
>>>>>>> than? Does sklearn website have a working example on categorical input?
>>>>>>> 
>>>>>>> Thanks!
>>>>>>> 
>>>>>>> On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka 
>>>>>>> <[email protected]> wrote:
>>>>>>> Like Nicolas said, the 0.5 is just a workaround but will do the right 
>>>>>>> thing on the one-hot encoded variables, here. You will find that the 
>>>>>>> threshold is always at 0.5 for these variables. I.e., what it will do 
>>>>>>> is to use the following conversion:
>>>>>>> 
>>>>>>> treat as car_Audi=1 if car_Audi >= 0.5
>>>>>>> treat as car_Audi=0 if car_Audi < 0.5
>>>>>>> 
>>>>>>> or, it may be
>>>>>>> 
>>>>>>> treat as car_Audi=1 if car_Audi > 0.5
>>>>>>> treat as car_Audi=0 if car_Audi <= 0.5
>>>>>>> 
>>>>>>> (Forgot which one sklearn is using, but either way. it will be fine.)
>>>>>>> 
>>>>>>> Best,
>>>>>>> Sebastian
>>>>>>> 
>>>>>>> 
>>>>>>>> On Oct 4, 2019, at 1:44 PM, Nicolas Hug <[email protected]> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> But, decision tree is still mistaking one-hot-encoding as numerical 
>>>>>>>>> input and split at 0.5. This is not right. Perhaps, I'm doing 
>>>>>>>>> something wrong?
>>>>>>>> You're not doing anything wrong, and neither is the tree. Trees don't 
>>>>>>>> support categorical variables in sklearn, so everything is treated as 
>>>>>>>> numerical.
>>>>>>>> 
>>>>>>>> This is why we do one-hot-encoding: so that a set of numerical (one 
>>>>>>>> hot encoded) features can be treated as if they were just one 
>>>>>>>> categorical feature.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Nicolas
>>>>>>>> 
>>>>>>>> On 10/4/19 2:01 PM, C W wrote:
>>>>>>>>> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo 
>>>>>>>>> on my part.
>>>>>>>>> 
>>>>>>>>> Looks like I did one-hot-encoding correctly. My new variable names 
>>>>>>>>> are: car_Audi, car_BMW, etc.
>>>>>>>>> 
>>>>>>>>> But, decision tree is still mistaking one-hot-encoding as numerical 
>>>>>>>>> input and split at 0.5. This is not right. Perhaps, I'm doing 
>>>>>>>>> something wrong?
>>>>>>>>> 
>>>>>>>>> Is there a good toy example on the sklearn website? I am only see 
>>>>>>>>> this: 
>>>>>>>>> https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html.
>>>>>>>>> 
>>>>>>>>> Thanks!
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka 
>>>>>>>>> <[email protected]> wrote:
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, 
>>>>>>>>>> Toyota=1, Audi=2) as numerical values, not category.The tree splits 
>>>>>>>>>> at 0.5 and 1.5
>>>>>>>>> that's not a onehot encoding then.
>>>>>>>>> 
>>>>>>>>> For an Audi datapoint, it should be
>>>>>>>>> 
>>>>>>>>> BMW=0
>>>>>>>>> Toyota=0
>>>>>>>>> Audi=1
>>>>>>>>> 
>>>>>>>>> for BMW
>>>>>>>>> 
>>>>>>>>> BMW=1
>>>>>>>>> Toyota=0
>>>>>>>>> Audi=0
>>>>>>>>> 
>>>>>>>>> and for Toyota
>>>>>>>>> 
>>>>>>>>> BMW=0
>>>>>>>>> Toyota=1
>>>>>>>>> Audi=0
>>>>>>>>> 
>>>>>>>>> The split threshold should then be at 0.5 for any of these features.
>>>>>>>>> 
>>>>>>>>> Based on your email, I think you were assuming that the DT does the 
>>>>>>>>> one-hot encoding internally, which it doesn't. In practice, it is 
>>>>>>>>> hard to guess what is a nominal and what is a ordinal variable, so 
>>>>>>>>> you have to do the onehot encoding before you give the data to the 
>>>>>>>>> decision tree.
>>>>>>>>> 
>>>>>>>>> Best,
>>>>>>>>> Sebastian
>>>>>>>>> 
>>>>>>>>>> On Oct 4, 2019, at 11:48 AM, C W <[email protected]> wrote:
>>>>>>>>>> 
>>>>>>>>>> I'm getting some funny results. I am doing a regression decision 
>>>>>>>>>> tree, the response variables are assigned to levels.
>>>>>>>>>> 
>>>>>>>>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, 
>>>>>>>>>> Toyota=1, Audi=2) as numerical values, not category.
>>>>>>>>>> 
>>>>>>>>>> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? 
>>>>>>>>>> How does the sklearn know internally 0 vs. 1 is categorical, not 
>>>>>>>>>> numerical?
>>>>>>>>>> 
>>>>>>>>>> In R for instance, you do as.factor(), which explicitly states the 
>>>>>>>>>> data type.
>>>>>>>>>> 
>>>>>>>>>> Thank you!
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller <[email protected]> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 9/15/19 8:16 AM, Guillaume Lemaître wrote:
>>>>>>>>>>> On Sat, 14 Sep 2019 at 20:59, C W <[email protected]> wrote:
>>>>>>>>>>> Thanks, Guillaume.
>>>>>>>>>>> Column transformer looks pretty neat. I've also heard though, this 
>>>>>>>>>>> pipeline can be tedious to set up? Specifying what you want for 
>>>>>>>>>>> every feature is a pain.
>>>>>>>>>>> 
>>>>>>>>>>> It would be interesting for us which part of the pipeline is 
>>>>>>>>>>> tedious to set up to know if we can improve something there.
>>>>>>>>>>> Do you mean, that you would like to automatically detect of which 
>>>>>>>>>>> type of feature (categorical/numerical) and apply a
>>>>>>>>>>> default encoder/scaling such as discuss there: 
>>>>>>>>>>> https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127
>>>>>>>>>>> 
>>>>>>>>>>> IMO, one a user perspective, it would be cleaner in some cases at 
>>>>>>>>>>> the cost of applying blindly a black box
>>>>>>>>>>> which might be dangerous.
>>>>>>>>>> Also see 
>>>>>>>>>> https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor
>>>>>>>>>> Which basically does that.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>  Jaiver,
>>>>>>>>>>> Actually, you guessed right. My real data has only one numerical 
>>>>>>>>>>> variable, looks more like this:
>>>>>>>>>>> 
>>>>>>>>>>> Gender Date            Income  Car   Attendance
>>>>>>>>>>> Male     2019/3/01   10000   BMW          Yes
>>>>>>>>>>> Female 2019/5/02    9000   Toyota          No
>>>>>>>>>>> Male     2019/7/15   12000    Audi           Yes
>>>>>>>>>>> 
>>>>>>>>>>> I am predicting income using all other categorical variables. Maybe 
>>>>>>>>>>> it is catboost!
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> 
>>>>>>>>>>> M
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Sat, Sep 14, 2019 at 9:25 AM Javier López <[email protected]> wrote:
>>>>>>>>>>> If you have datasets with many categorical features, and perhaps 
>>>>>>>>>>> many categories, the tools in sklearn are quite limited,
>>>>>>>>>>> but there are alternative implementations of boosted trees that are 
>>>>>>>>>>> designed with categorical features in mind. Take a look
>>>>>>>>>>> at catboost [1], which has an sklearn-compatible API.
>>>>>>>>>>> 
>>>>>>>>>>> J
>>>>>>>>>>> 
>>>>>>>>>>> [1] https://catboost.ai/
>>>>>>>>>>> 
>>>>>>>>>>> On Sat, Sep 14, 2019 at 3:40 AM C W <[email protected]> wrote:
>>>>>>>>>>> Hello all,
>>>>>>>>>>> I'm very confused. Can the decision tree module handle both 
>>>>>>>>>>> continuous and categorical features in the dataset? In this case, 
>>>>>>>>>>> it's just CART (Classification and Regression Trees).
>>>>>>>>>>> 
>>>>>>>>>>> For example,
>>>>>>>>>>> Gender Age Income  Car   Attendance
>>>>>>>>>>> Male     30   10000   BMW          Yes
>>>>>>>>>>> Female 35     9000  Toyota          No
>>>>>>>>>>> Male     50   12000    Audi           Yes
>>>>>>>>>>> 
>>>>>>>>>>> According to the documentation 
>>>>>>>>>>> https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart,
>>>>>>>>>>>  it can not!
>>>>>>>>>>> 
>>>>>>>>>>> It says: "scikit-learn implementation does not support categorical 
>>>>>>>>>>> variables for now".
>>>>>>>>>>> 
>>>>>>>>>>> Is this true? If not, can someone point me to an example? If yes, 
>>>>>>>>>>> what do people do?
>>>>>>>>>>> 
>>>>>>>>>>> Thank you very much!
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> scikit-learn mailing list
>>>>>>>>>>> [email protected]
>>>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> scikit-learn mailing list
>>>>>>>>>>> [email protected]
>>>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> scikit-learn mailing list
>>>>>>>>>>> [email protected]
>>>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> -- 
>>>>>>>>>>> Guillaume Lemaitre
>>>>>>>>>>> INRIA Saclay - Parietal team
>>>>>>>>>>> Center for Data Science Paris-Saclay
>>>>>>>>>>> https://glemaitre.github.io/
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> scikit-learn mailing list
>>>>>>>>>>> 
>>>>>>>>>>> [email protected]
>>>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>>>>> _______________________________________________
>>>>>>>>>> scikit-learn mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>>>>> _______________________________________________
>>>>>>>>>> scikit-learn mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>>>> _______________________________________________
>>>>>>>>> scikit-learn mailing list
>>>>>>>>> [email protected]
>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> scikit-learn mailing list
>>>>>>>>> 
>>>>>>>>> [email protected]
>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>>> _______________________________________________
>>>>>>>> scikit-learn mailing list
>>>>>>>> [email protected]
>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>> _______________________________________________
>>>>>>> scikit-learn mailing list
>>>>>>> [email protected]
>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>> _______________________________________________
>>>>>>> scikit-learn mailing list
>>>>>>> [email protected]
>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>> _______________________________________________
>>>>>> scikit-learn mailing list
>>>>>> [email protected]
>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>> _______________________________________________
>>>>>> scikit-learn mailing list
>>>>>> [email protected]
>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> [email protected]
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> [email protected]
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> [email protected]
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> _______________________________________________
>>> scikit-learn mailing list
>>> [email protected]
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> _______________________________________________
>> scikit-learn mailing list
>> [email protected]
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

Reply via email to