Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

Sebastian Raschka Fri, 04 Oct 2019 18:43:24 -0700

Yeah, think of it more as a computational workaround for achieving the same 
thing more efficiently (although it looks inelegant/weird)-- something like 
that wouldn't be mentioned in textbooks.


Best,
Sebastian

> On Oct 4, 2019, at 6:33 PM, C W <[email protected]> wrote:
> 
> Thanks Sebastian, I think I get it.
> 
> It's just have never seen it this way. Quite different from what I'm used in 
> Elements of Statistical Learning.
> 
> On Fri, Oct 4, 2019 at 7:13 PM Sebastian Raschka <[email protected]> 
> wrote:
> Not sure if there's a website for that. In any case, to explain this 
> differently, as discussed earlier sklearn assumes continuous features for 
> decision trees. So, it will use a binary threshold for splitting along a 
> feature attribute. In other words, it cannot do sth like
> 
> if x == 1 then right child node
> else left child node
> 
> Instead, what it does is
> 
> if x >= 0.5 then right child node
> else left child node
> 
> These are basically equivalent as you can see when you just plug in values 0 
> and 1 for x.
> 
> Best,
> Sebastian
> 
> > On Oct 4, 2019, at 5:34 PM, C W <[email protected]> wrote:
> > 
> > I don't understand your answer.
> > 
> > Why after one-hot-encoding it still outputs greater than 0.5 or less than? 
> > Does sklearn website have a working example on categorical input?
> > 
> > Thanks!
> > 
> > On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka 
> > <[email protected]> wrote:
> > Like Nicolas said, the 0.5 is just a workaround but will do the right thing 
> > on the one-hot encoded variables, here. You will find that the threshold is 
> > always at 0.5 for these variables. I.e., what it will do is to use the 
> > following conversion:
> > 
> > treat as car_Audi=1 if car_Audi >= 0.5
> > treat as car_Audi=0 if car_Audi < 0.5
> > 
> > or, it may be
> > 
> > treat as car_Audi=1 if car_Audi > 0.5
> > treat as car_Audi=0 if car_Audi <= 0.5
> > 
> > (Forgot which one sklearn is using, but either way. it will be fine.)
> > 
> > Best,
> > Sebastian
> > 
> > 
> >> On Oct 4, 2019, at 1:44 PM, Nicolas Hug <[email protected]> wrote:
> >> 
> >> 
> >>> But, decision tree is still mistaking one-hot-encoding as numerical input 
> >>> and split at 0.5. This is not right. Perhaps, I'm doing something wrong?
> >> 
> >> You're not doing anything wrong, and neither is the tree. Trees don't 
> >> support categorical variables in sklearn, so everything is treated as 
> >> numerical.
> >> 
> >> This is why we do one-hot-encoding: so that a set of numerical (one hot 
> >> encoded) features can be treated as if they were just one categorical 
> >> feature.
> >> 
> >> 
> >> 
> >> Nicolas
> >> 
> >> On 10/4/19 2:01 PM, C W wrote:
> >>> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo on my 
> >>> part.
> >>> 
> >>> Looks like I did one-hot-encoding correctly. My new variable names are: 
> >>> car_Audi, car_BMW, etc.
> >>> 
> >>> But, decision tree is still mistaking one-hot-encoding as numerical input 
> >>> and split at 0.5. This is not right. Perhaps, I'm doing something wrong?
> >>> 
> >>> Is there a good toy example on the sklearn website? I am only see this: 
> >>> https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html.
> >>> 
> >>> Thanks!
> >>> 
> >>> 
> >>> 
> >>> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka 
> >>> <[email protected]> wrote:
> >>> Hi,
> >>> 
> >>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, 
> >>>> Audi=2) as numerical values, not category.The tree splits at 0.5 and 1.5
> >>> 
> >>> that's not a onehot encoding then.
> >>> 
> >>> For an Audi datapoint, it should be
> >>> 
> >>> BMW=0
> >>> Toyota=0
> >>> Audi=1
> >>> 
> >>> for BMW
> >>> 
> >>> BMW=1
> >>> Toyota=0
> >>> Audi=0
> >>> 
> >>> and for Toyota
> >>> 
> >>> BMW=0
> >>> Toyota=1
> >>> Audi=0
> >>> 
> >>> The split threshold should then be at 0.5 for any of these features.
> >>> 
> >>> Based on your email, I think you were assuming that the DT does the 
> >>> one-hot encoding internally, which it doesn't. In practice, it is hard to 
> >>> guess what is a nominal and what is a ordinal variable, so you have to do 
> >>> the onehot encoding before you give the data to the decision tree.
> >>> 
> >>> Best,
> >>> Sebastian
> >>> 
> >>>> On Oct 4, 2019, at 11:48 AM, C W <[email protected]> wrote:
> >>>> 
> >>>> I'm getting some funny results. I am doing a regression decision tree, 
> >>>> the response variables are assigned to levels.
> >>>> 
> >>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, 
> >>>> Audi=2) as numerical values, not category.
> >>>> 
> >>>> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? How 
> >>>> does the sklearn know internally 0 vs. 1 is categorical, not numerical? 
> >>>> 
> >>>> In R for instance, you do as.factor(), which explicitly states the data 
> >>>> type.
> >>>> 
> >>>> Thank you!
> >>>> 
> >>>> 
> >>>> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller <[email protected]> 
> >>>> wrote:
> >>>> 
> >>>> 
> >>>> On 9/15/19 8:16 AM, Guillaume Lemaître wrote:
> >>>>> 
> >>>>> 
> >>>>> On Sat, 14 Sep 2019 at 20:59, C W <[email protected]> wrote:
> >>>>> Thanks, Guillaume. 
> >>>>> Column transformer looks pretty neat. I've also heard though, this 
> >>>>> pipeline can be tedious to set up? Specifying what you want for every 
> >>>>> feature is a pain.
> >>>>> 
> >>>>> It would be interesting for us which part of the pipeline is tedious to 
> >>>>> set up to know if we can improve something there.
> >>>>> Do you mean, that you would like to automatically detect of which type 
> >>>>> of feature (categorical/numerical) and apply a
> >>>>> default encoder/scaling such as discuss there: 
> >>>>> https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127
> >>>>> 
> >>>>> IMO, one a user perspective, it would be cleaner in some cases at the 
> >>>>> cost of applying blindly a black box
> >>>>> which might be dangerous.
> >>>> Also see 
> >>>> https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor
> >>>> Which basically does that.
> >>>> 
> >>>> 
> >>>>>  
> >>>>> 
> >>>>> Jaiver,
> >>>>> Actually, you guessed right. My real data has only one numerical 
> >>>>> variable, looks more like this:
> >>>>> 
> >>>>> Gender Date            Income  Car   Attendance
> >>>>> Male     2019/3/01   10000   BMW          Yes
> >>>>> Female 2019/5/02    9000   Toyota          No
> >>>>> Male     2019/7/15   12000    Audi           Yes
> >>>>> 
> >>>>> I am predicting income using all other categorical variables. Maybe it 
> >>>>> is catboost!
> >>>>> 
> >>>>> Thanks,
> >>>>> 
> >>>>> M
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> On Sat, Sep 14, 2019 at 9:25 AM Javier López <[email protected]> wrote:
> >>>>> If you have datasets with many categorical features, and perhaps many 
> >>>>> categories, the tools in sklearn are quite limited, 
> >>>>> but there are alternative implementations of boosted trees that are 
> >>>>> designed with categorical features in mind. Take a look
> >>>>> at catboost [1], which has an sklearn-compatible API.
> >>>>> 
> >>>>> J
> >>>>> 
> >>>>> [1] https://catboost.ai/
> >>>>> 
> >>>>> On Sat, Sep 14, 2019 at 3:40 AM C W <[email protected]> wrote:
> >>>>> Hello all,
> >>>>> I'm very confused. Can the decision tree module handle both continuous 
> >>>>> and categorical features in the dataset? In this case, it's just CART 
> >>>>> (Classification and Regression Trees).
> >>>>> 
> >>>>> For example,
> >>>>> Gender Age Income  Car   Attendance
> >>>>> Male     30   10000   BMW          Yes
> >>>>> Female 35     9000  Toyota          No
> >>>>> Male     50   12000    Audi           Yes
> >>>>> 
> >>>>> According to the documentation 
> >>>>> https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart,
> >>>>>  it can not! 
> >>>>> 
> >>>>> It says: "scikit-learn implementation does not support categorical 
> >>>>> variables for now". 
> >>>>> 
> >>>>> Is this true? If not, can someone point me to an example? If yes, what 
> >>>>> do people do?
> >>>>> 
> >>>>> Thank you very much!
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> _______________________________________________
> >>>>> scikit-learn mailing list
> >>>>> [email protected]
> >>>>> https://mail.python.org/mailman/listinfo/scikit-learn
> >>>>> _______________________________________________
> >>>>> scikit-learn mailing list
> >>>>> [email protected]
> >>>>> https://mail.python.org/mailman/listinfo/scikit-learn
> >>>>> _______________________________________________
> >>>>> scikit-learn mailing list
> >>>>> [email protected]
> >>>>> https://mail.python.org/mailman/listinfo/scikit-learn
> >>>>> 
> >>>>> 
> >>>>> -- 
> >>>>> Guillaume Lemaitre
> >>>>> INRIA Saclay - Parietal team
> >>>>> Center for Data Science Paris-Saclay
> >>>>> https://glemaitre.github.io/
> >>>>> 
> >>>>> 
> >>>>> _______________________________________________
> >>>>> scikit-learn mailing list
> >>>>> 
> >>>>> [email protected]
> >>>>> https://mail.python.org/mailman/listinfo/scikit-learn
> >>>> 
> >>>> _______________________________________________
> >>>> scikit-learn mailing list
> >>>> [email protected]
> >>>> https://mail.python.org/mailman/listinfo/scikit-learn
> >>>> _______________________________________________
> >>>> scikit-learn mailing list
> >>>> [email protected]
> >>>> https://mail.python.org/mailman/listinfo/scikit-learn
> >>> 
> >>> _______________________________________________
> >>> scikit-learn mailing list
> >>> [email protected]
> >>> https://mail.python.org/mailman/listinfo/scikit-learn
> >>> 
> >>> 
> >>> _______________________________________________
> >>> scikit-learn mailing list
> >>> 
> >>> [email protected]
> >>> https://mail.python.org/mailman/listinfo/scikit-learn
> >> _______________________________________________
> >> scikit-learn mailing list
> >> [email protected]
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> > 
> > _______________________________________________
> > scikit-learn mailing list
> > [email protected]
> > https://mail.python.org/mailman/listinfo/scikit-learn
> > _______________________________________________
> > scikit-learn mailing list
> > [email protected]
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

Reply via email to