Not sure if there's a website for that. In any case, to explain this differently, as discussed earlier sklearn assumes continuous features for decision trees. So, it will use a binary threshold for splitting along a feature attribute. In other words, it cannot do sth like
if x == 1 then right child node else left child node Instead, what it does is if x >= 0.5 then right child node else left child node These are basically equivalent as you can see when you just plug in values 0 and 1 for x. Best, Sebastian > On Oct 4, 2019, at 5:34 PM, C W <tmrs...@gmail.com> wrote: > > I don't understand your answer. > > Why after one-hot-encoding it still outputs greater than 0.5 or less than? > Does sklearn website have a working example on categorical input? > > Thanks! > > On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka <m...@sebastianraschka.com> > wrote: > Like Nicolas said, the 0.5 is just a workaround but will do the right thing > on the one-hot encoded variables, here. You will find that the threshold is > always at 0.5 for these variables. I.e., what it will do is to use the > following conversion: > > treat as car_Audi=1 if car_Audi >= 0.5 > treat as car_Audi=0 if car_Audi < 0.5 > > or, it may be > > treat as car_Audi=1 if car_Audi > 0.5 > treat as car_Audi=0 if car_Audi <= 0.5 > > (Forgot which one sklearn is using, but either way. it will be fine.) > > Best, > Sebastian > > >> On Oct 4, 2019, at 1:44 PM, Nicolas Hug <nio...@gmail.com> wrote: >> >> >>> But, decision tree is still mistaking one-hot-encoding as numerical input >>> and split at 0.5. This is not right. Perhaps, I'm doing something wrong? >> >> You're not doing anything wrong, and neither is the tree. Trees don't >> support categorical variables in sklearn, so everything is treated as >> numerical. >> >> This is why we do one-hot-encoding: so that a set of numerical (one hot >> encoded) features can be treated as if they were just one categorical >> feature. >> >> >> >> Nicolas >> >> On 10/4/19 2:01 PM, C W wrote: >>> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo on my >>> part. >>> >>> Looks like I did one-hot-encoding correctly. My new variable names are: >>> car_Audi, car_BMW, etc. >>> >>> But, decision tree is still mistaking one-hot-encoding as numerical input >>> and split at 0.5. This is not right. Perhaps, I'm doing something wrong? >>> >>> Is there a good toy example on the sklearn website? I am only see this: >>> https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html. >>> >>> Thanks! >>> >>> >>> >>> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka >>> <m...@sebastianraschka.com> wrote: >>> Hi, >>> >>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, >>>> Audi=2) as numerical values, not category.The tree splits at 0.5 and 1.5 >>> >>> that's not a onehot encoding then. >>> >>> For an Audi datapoint, it should be >>> >>> BMW=0 >>> Toyota=0 >>> Audi=1 >>> >>> for BMW >>> >>> BMW=1 >>> Toyota=0 >>> Audi=0 >>> >>> and for Toyota >>> >>> BMW=0 >>> Toyota=1 >>> Audi=0 >>> >>> The split threshold should then be at 0.5 for any of these features. >>> >>> Based on your email, I think you were assuming that the DT does the one-hot >>> encoding internally, which it doesn't. In practice, it is hard to guess >>> what is a nominal and what is a ordinal variable, so you have to do the >>> onehot encoding before you give the data to the decision tree. >>> >>> Best, >>> Sebastian >>> >>>> On Oct 4, 2019, at 11:48 AM, C W <tmrs...@gmail.com> wrote: >>>> >>>> I'm getting some funny results. I am doing a regression decision tree, the >>>> response variables are assigned to levels. >>>> >>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, >>>> Audi=2) as numerical values, not category. >>>> >>>> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? How >>>> does the sklearn know internally 0 vs. 1 is categorical, not numerical? >>>> >>>> In R for instance, you do as.factor(), which explicitly states the data >>>> type. >>>> >>>> Thank you! >>>> >>>> >>>> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller <t3k...@gmail.com> wrote: >>>> >>>> >>>> On 9/15/19 8:16 AM, Guillaume Lemaître wrote: >>>>> >>>>> >>>>> On Sat, 14 Sep 2019 at 20:59, C W <tmrs...@gmail.com> wrote: >>>>> Thanks, Guillaume. >>>>> Column transformer looks pretty neat. I've also heard though, this >>>>> pipeline can be tedious to set up? Specifying what you want for every >>>>> feature is a pain. >>>>> >>>>> It would be interesting for us which part of the pipeline is tedious to >>>>> set up to know if we can improve something there. >>>>> Do you mean, that you would like to automatically detect of which type of >>>>> feature (categorical/numerical) and apply a >>>>> default encoder/scaling such as discuss there: >>>>> https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127 >>>>> >>>>> IMO, one a user perspective, it would be cleaner in some cases at the >>>>> cost of applying blindly a black box >>>>> which might be dangerous. >>>> Also see >>>> https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor >>>> Which basically does that. >>>> >>>> >>>>> >>>>> >>>>> Jaiver, >>>>> Actually, you guessed right. My real data has only one numerical >>>>> variable, looks more like this: >>>>> >>>>> Gender Date Income Car Attendance >>>>> Male 2019/3/01 10000 BMW Yes >>>>> Female 2019/5/02 9000 Toyota No >>>>> Male 2019/7/15 12000 Audi Yes >>>>> >>>>> I am predicting income using all other categorical variables. Maybe it is >>>>> catboost! >>>>> >>>>> Thanks, >>>>> >>>>> M >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Sat, Sep 14, 2019 at 9:25 AM Javier López <jlo...@ende.cc> wrote: >>>>> If you have datasets with many categorical features, and perhaps many >>>>> categories, the tools in sklearn are quite limited, >>>>> but there are alternative implementations of boosted trees that are >>>>> designed with categorical features in mind. Take a look >>>>> at catboost [1], which has an sklearn-compatible API. >>>>> >>>>> J >>>>> >>>>> [1] https://catboost.ai/ >>>>> >>>>> On Sat, Sep 14, 2019 at 3:40 AM C W <tmrs...@gmail.com> wrote: >>>>> Hello all, >>>>> I'm very confused. Can the decision tree module handle both continuous >>>>> and categorical features in the dataset? In this case, it's just CART >>>>> (Classification and Regression Trees). >>>>> >>>>> For example, >>>>> Gender Age Income Car Attendance >>>>> Male 30 10000 BMW Yes >>>>> Female 35 9000 Toyota No >>>>> Male 50 12000 Audi Yes >>>>> >>>>> According to the documentation >>>>> https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart, >>>>> it can not! >>>>> >>>>> It says: "scikit-learn implementation does not support categorical >>>>> variables for now". >>>>> >>>>> Is this true? If not, can someone point me to an example? If yes, what do >>>>> people do? >>>>> >>>>> Thank you very much! >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn@python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn@python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn@python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>>> -- >>>>> Guillaume Lemaitre >>>>> INRIA Saclay - Parietal team >>>>> Center for Data Science Paris-Saclay >>>>> https://glemaitre.github.io/ >>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> >>>>> scikit-learn@python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn@python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn@python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn