Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

Sebastian Raschka Fri, 04 Oct 2019 16:13:28 -0700

Not sure if there's a website for that. In any case, to explain this 
differently, as discussed earlier sklearn assumes continuous features for 
decision trees. So, it will use a binary threshold for splitting along a 
feature attribute. In other words, it cannot do sth like


if x == 1 then right child node
else left child node

Instead, what it does is

if x >= 0.5 then right child node
else left child node

These are basically equivalent as you can see when you just plug in values 0 
and 1 for x.

Best,
Sebastian

> On Oct 4, 2019, at 5:34 PM, C W <[email protected]> wrote:
> 
> I don't understand your answer.
> 
> Why after one-hot-encoding it still outputs greater than 0.5 or less than? 
> Does sklearn website have a working example on categorical input?
> 
> Thanks!
> 
> On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka <[email protected]> 
> wrote:
> Like Nicolas said, the 0.5 is just a workaround but will do the right thing 
> on the one-hot encoded variables, here. You will find that the threshold is 
> always at 0.5 for these variables. I.e., what it will do is to use the 
> following conversion:
> 
> treat as car_Audi=1 if car_Audi >= 0.5
> treat as car_Audi=0 if car_Audi < 0.5
> 
> or, it may be
> 
> treat as car_Audi=1 if car_Audi > 0.5
> treat as car_Audi=0 if car_Audi <= 0.5
> 
> (Forgot which one sklearn is using, but either way. it will be fine.)
> 
> Best,
> Sebastian
> 
> 
>> On Oct 4, 2019, at 1:44 PM, Nicolas Hug <[email protected]> wrote:
>> 
>> 
>>> But, decision tree is still mistaking one-hot-encoding as numerical input 
>>> and split at 0.5. This is not right. Perhaps, I'm doing something wrong?
>> 
>> You're not doing anything wrong, and neither is the tree. Trees don't 
>> support categorical variables in sklearn, so everything is treated as 
>> numerical.
>> 
>> This is why we do one-hot-encoding: so that a set of numerical (one hot 
>> encoded) features can be treated as if they were just one categorical 
>> feature.
>> 
>> 
>> 
>> Nicolas
>> 
>> On 10/4/19 2:01 PM, C W wrote:
>>> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo on my 
>>> part.
>>> 
>>> Looks like I did one-hot-encoding correctly. My new variable names are: 
>>> car_Audi, car_BMW, etc.
>>> 
>>> But, decision tree is still mistaking one-hot-encoding as numerical input 
>>> and split at 0.5. This is not right. Perhaps, I'm doing something wrong?
>>> 
>>> Is there a good toy example on the sklearn website? I am only see this: 
>>> https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html.
>>> 
>>> Thanks!
>>> 
>>> 
>>> 
>>> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka 
>>> <[email protected]> wrote:
>>> Hi,
>>> 
>>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, 
>>>> Audi=2) as numerical values, not category.The tree splits at 0.5 and 1.5
>>> 
>>> that's not a onehot encoding then.
>>> 
>>> For an Audi datapoint, it should be
>>> 
>>> BMW=0
>>> Toyota=0
>>> Audi=1
>>> 
>>> for BMW
>>> 
>>> BMW=1
>>> Toyota=0
>>> Audi=0
>>> 
>>> and for Toyota
>>> 
>>> BMW=0
>>> Toyota=1
>>> Audi=0
>>> 
>>> The split threshold should then be at 0.5 for any of these features.
>>> 
>>> Based on your email, I think you were assuming that the DT does the one-hot 
>>> encoding internally, which it doesn't. In practice, it is hard to guess 
>>> what is a nominal and what is a ordinal variable, so you have to do the 
>>> onehot encoding before you give the data to the decision tree.
>>> 
>>> Best,
>>> Sebastian
>>> 
>>>> On Oct 4, 2019, at 11:48 AM, C W <[email protected]> wrote:
>>>> 
>>>> I'm getting some funny results. I am doing a regression decision tree, the 
>>>> response variables are assigned to levels.
>>>> 
>>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, 
>>>> Audi=2) as numerical values, not category.
>>>> 
>>>> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? How 
>>>> does the sklearn know internally 0 vs. 1 is categorical, not numerical? 
>>>> 
>>>> In R for instance, you do as.factor(), which explicitly states the data 
>>>> type.
>>>> 
>>>> Thank you!
>>>> 
>>>> 
>>>> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller <[email protected]> wrote:
>>>> 
>>>> 
>>>> On 9/15/19 8:16 AM, Guillaume Lemaître wrote:
>>>>> 
>>>>> 
>>>>> On Sat, 14 Sep 2019 at 20:59, C W <[email protected]> wrote:
>>>>> Thanks, Guillaume. 
>>>>> Column transformer looks pretty neat. I've also heard though, this 
>>>>> pipeline can be tedious to set up? Specifying what you want for every 
>>>>> feature is a pain.
>>>>> 
>>>>> It would be interesting for us which part of the pipeline is tedious to 
>>>>> set up to know if we can improve something there.
>>>>> Do you mean, that you would like to automatically detect of which type of 
>>>>> feature (categorical/numerical) and apply a
>>>>> default encoder/scaling such as discuss there: 
>>>>> https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127
>>>>> 
>>>>> IMO, one a user perspective, it would be cleaner in some cases at the 
>>>>> cost of applying blindly a black box
>>>>> which might be dangerous.
>>>> Also see 
>>>> https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor
>>>> Which basically does that.
>>>> 
>>>> 
>>>>>  
>>>>> 
>>>>> Jaiver,
>>>>> Actually, you guessed right. My real data has only one numerical 
>>>>> variable, looks more like this:
>>>>> 
>>>>> Gender Date            Income  Car   Attendance
>>>>> Male     2019/3/01   10000   BMW          Yes
>>>>> Female 2019/5/02    9000   Toyota          No
>>>>> Male     2019/7/15   12000    Audi           Yes
>>>>> 
>>>>> I am predicting income using all other categorical variables. Maybe it is 
>>>>> catboost!
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> M
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Sat, Sep 14, 2019 at 9:25 AM Javier López <[email protected]> wrote:
>>>>> If you have datasets with many categorical features, and perhaps many 
>>>>> categories, the tools in sklearn are quite limited, 
>>>>> but there are alternative implementations of boosted trees that are 
>>>>> designed with categorical features in mind. Take a look
>>>>> at catboost [1], which has an sklearn-compatible API.
>>>>> 
>>>>> J
>>>>> 
>>>>> [1] https://catboost.ai/
>>>>> 
>>>>> On Sat, Sep 14, 2019 at 3:40 AM C W <[email protected]> wrote:
>>>>> Hello all,
>>>>> I'm very confused. Can the decision tree module handle both continuous 
>>>>> and categorical features in the dataset? In this case, it's just CART 
>>>>> (Classification and Regression Trees).
>>>>> 
>>>>> For example,
>>>>> Gender Age Income  Car   Attendance
>>>>> Male     30   10000   BMW          Yes
>>>>> Female 35     9000  Toyota          No
>>>>> Male     50   12000    Audi           Yes
>>>>> 
>>>>> According to the documentation 
>>>>> https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart,
>>>>>  it can not! 
>>>>> 
>>>>> It says: "scikit-learn implementation does not support categorical 
>>>>> variables for now". 
>>>>> 
>>>>> Is this true? If not, can someone point me to an example? If yes, what do 
>>>>> people do?
>>>>> 
>>>>> Thank you very much!
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> [email protected]
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> [email protected]
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> [email protected]
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Guillaume Lemaitre
>>>>> INRIA Saclay - Parietal team
>>>>> Center for Data Science Paris-Saclay
>>>>> https://glemaitre.github.io/
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> 
>>>>> [email protected]
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>> 
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> [email protected]
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> [email protected]
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> 
>>> _______________________________________________
>>> scikit-learn mailing list
>>> [email protected]
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> 
>>> 
>>> _______________________________________________
>>> scikit-learn mailing list
>>> 
>>> [email protected]
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> _______________________________________________
>> scikit-learn mailing list
>> [email protected]
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

Reply via email to