Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

Sebastian Raschka Fri, 04 Oct 2019 10:28:14 -0700

Hi,

> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, 
> Audi=2) as numerical values, not category.The tree splits at 0.5 and 1.5


that's not a onehot encoding then.

For an Audi datapoint, it should be

BMW=0
Toyota=0
Audi=1

for BMW

BMW=1
Toyota=0
Audi=0

and for Toyota

BMW=0
Toyota=1
Audi=0

The split threshold should then be at 0.5 for any of these features.

Based on your email, I think you were assuming that the DT does the one-hot 
encoding internally, which it doesn't. In practice, it is hard to guess what is 
a nominal and what is a ordinal variable, so you have to do the onehot encoding 
before you give the data to the decision tree.

Best,
Sebastian

> On Oct 4, 2019, at 11:48 AM, C W <[email protected]> wrote:
> 
> I'm getting some funny results. I am doing a regression decision tree, the 
> response variables are assigned to levels.
> 
> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, 
> Audi=2) as numerical values, not category.
> 
> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? How does 
> the sklearn know internally 0 vs. 1 is categorical, not numerical? 
> 
> In R for instance, you do as.factor(), which explicitly states the data type.
> 
> Thank you!
> 
> 
> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> 
> On 9/15/19 8:16 AM, Guillaume Lemaître wrote:
>> 
>> 
>> On Sat, 14 Sep 2019 at 20:59, C W <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Thanks, Guillaume. 
>> Column transformer looks pretty neat. I've also heard though, this pipeline 
>> can be tedious to set up? Specifying what you want for every feature is a 
>> pain.
>> 
>> It would be interesting for us which part of the pipeline is tedious to set 
>> up to know if we can improve something there.
>> Do you mean, that you would like to automatically detect of which type of 
>> feature (categorical/numerical) and apply a
>> default encoder/scaling such as discuss there: 
>> https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127
>>  
>> <https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127>
>> 
>> IMO, one a user perspective, it would be cleaner in some cases at the cost 
>> of applying blindly a black box
>> which might be dangerous.
> Also see 
> https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor
>  
> <https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor>
> Which basically does that.
> 
> 
>>  
>> 
>> Jaiver,
>> Actually, you guessed right. My real data has only one numerical variable, 
>> looks more like this:
>> 
>> Gender Date            Income  Car   Attendance
>> Male     2019/3/01   10000   BMW          Yes
>> Female 2019/5/02    9000   Toyota          No
>> Male     2019/7/15   12000    Audi           Yes
>> 
>> I am predicting income using all other categorical variables. Maybe it is 
>> catboost!
>> 
>> Thanks,
>> 
>> M
>> 
>> 
>> 
>> 
>> 
>> 
>> On Sat, Sep 14, 2019 at 9:25 AM Javier López <[email protected]> 
>> <mailto:[email protected]> wrote:
>> If you have datasets with many categorical features, and perhaps many 
>> categories, the tools in sklearn are quite limited, 
>> but there are alternative implementations of boosted trees that are designed 
>> with categorical features in mind. Take a look
>> at catboost [1], which has an sklearn-compatible API.
>> 
>> J
>> 
>> [1] https://catboost.ai/ <https://catboost.ai/>
>> On Sat, Sep 14, 2019 at 3:40 AM C W <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Hello all,
>> I'm very confused. Can the decision tree module handle both continuous and 
>> categorical features in the dataset? In this case, it's just CART 
>> (Classification and Regression Trees).
>> 
>> For example,
>> Gender Age Income  Car   Attendance
>> Male     30   10000   BMW          Yes
>> Female 35     9000  Toyota          No
>> Male     50   12000    Audi           Yes
>> 
>> According to the documentation 
>> https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart
>>  
>> <https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart>,
>>  it can not! 
>> 
>> It says: "scikit-learn implementation does not support categorical variables 
>> for now". 
>> 
>> Is this true? If not, can someone point me to an example? If yes, what do 
>> people do?
>> 
>> Thank you very much!
>> 
>> 
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> [email protected] <mailto:[email protected]>
>> https://mail.python.org/mailman/listinfo/scikit-learn 
>> <https://mail.python.org/mailman/listinfo/scikit-learn>
>> _______________________________________________
>> scikit-learn mailing list
>> [email protected] <mailto:[email protected]>
>> https://mail.python.org/mailman/listinfo/scikit-learn 
>> <https://mail.python.org/mailman/listinfo/scikit-learn>
>> _______________________________________________
>> scikit-learn mailing list
>> [email protected] <mailto:[email protected]>
>> https://mail.python.org/mailman/listinfo/scikit-learn 
>> <https://mail.python.org/mailman/listinfo/scikit-learn>
>> 
>> 
>> -- 
>> Guillaume Lemaitre
>> INRIA Saclay - Parietal team
>> Center for Data Science Paris-Saclay
>> https://glemaitre.github.io/ <https://glemaitre.github.io/>
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> [email protected] <mailto:[email protected]>
>> https://mail.python.org/mailman/listinfo/scikit-learn 
>> <https://mail.python.org/mailman/listinfo/scikit-learn>
> 
> _______________________________________________
> scikit-learn mailing list
> [email protected] <mailto:[email protected]>
> https://mail.python.org/mailman/listinfo/scikit-learn 
> <https://mail.python.org/mailman/listinfo/scikit-learn>
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

Reply via email to