Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo on my part.
Looks like I did one-hot-encoding correctly. My new variable names are: car_Audi, car_BMW, etc. But, decision tree is still mistaking one-hot-encoding as numerical input and split at 0.5. This is not right. Perhaps, I'm doing something wrong? Is there a good toy example on the sklearn website? I am only see this: https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html . Thanks! On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka <m...@sebastianraschka.com> wrote: > Hi, > > The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, > Audi=2) as numerical values, not category.The tree splits at 0.5 and 1.5 > > > that's not a onehot encoding then. > > For an Audi datapoint, it should be > > BMW=0 > Toyota=0 > Audi=1 > > for BMW > > BMW=1 > Toyota=0 > Audi=0 > > and for Toyota > > BMW=0 > Toyota=1 > Audi=0 > > The split threshold should then be at 0.5 for any of these features. > > Based on your email, I think you were assuming that the DT does the > one-hot encoding internally, which it doesn't. In practice, it is hard to > guess what is a nominal and what is a ordinal variable, so you have to do > the onehot encoding before you give the data to the decision tree. > > Best, > Sebastian > > On Oct 4, 2019, at 11:48 AM, C W <tmrs...@gmail.com> wrote: > > I'm getting some funny results. I am doing a regression decision tree, the > response variables are assigned to levels. > > The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, > Audi=2) as numerical values, not category. > > The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? How > does the sklearn know internally 0 vs. 1 is categorical, not numerical? > > In R for instance, you do as.factor(), which explicitly states the data > type. > > Thank you! > > > On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller <t3k...@gmail.com> wrote: > >> >> >> On 9/15/19 8:16 AM, Guillaume Lemaître wrote: >> >> >> >> On Sat, 14 Sep 2019 at 20:59, C W <tmrs...@gmail.com> wrote: >> >>> Thanks, Guillaume. >>> Column transformer looks pretty neat. I've also heard though, this >>> pipeline can be tedious to set up? Specifying what you want for every >>> feature is a pain. >>> >> >> It would be interesting for us which part of the pipeline is tedious to >> set up to know if we can improve something there. >> Do you mean, that you would like to automatically detect of which type of >> feature (categorical/numerical) and apply a >> default encoder/scaling such as discuss there: >> https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127 >> >> IMO, one a user perspective, it would be cleaner in some cases at the >> cost of applying blindly a black box >> which might be dangerous. >> >> Also see >> https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor >> Which basically does that. >> >> >> >> >>> >>> Jaiver, >>> Actually, you guessed right. My real data has only one numerical >>> variable, looks more like this: >>> >>> Gender Date Income Car Attendance >>> Male 2019/3/01 10000 BMW Yes >>> Female 2019/5/02 9000 Toyota No >>> Male 2019/7/15 12000 Audi Yes >>> >>> I am predicting income using all other categorical variables. Maybe it >>> is catboost! >>> >>> Thanks, >>> >>> M >>> >>> >>> >>> >>> >>> >>> On Sat, Sep 14, 2019 at 9:25 AM Javier López <jlo...@ende.cc> >>> <jlo...@ende.cc> wrote: >>> >>>> If you have datasets with many categorical features, and perhaps many >>>> categories, the tools in sklearn are quite limited, >>>> but there are alternative implementations of boosted trees that are >>>> designed with categorical features in mind. Take a look >>>> at catboost [1], which has an sklearn-compatible API. >>>> >>>> J >>>> >>>> [1] https://catboost.ai/ >>>> >>>> On Sat, Sep 14, 2019 at 3:40 AM C W <tmrs...@gmail.com> wrote: >>>> >>>>> Hello all, >>>>> I'm very confused. Can the decision tree module handle both continuous >>>>> and categorical features in the dataset? In this case, it's just CART >>>>> (Classification and Regression Trees). >>>>> >>>>> For example, >>>>> Gender Age Income Car Attendance >>>>> Male 30 10000 BMW Yes >>>>> Female 35 9000 Toyota No >>>>> Male 50 12000 Audi Yes >>>>> >>>>> According to the documentation >>>>> https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart, >>>>> it can not! >>>>> >>>>> It says: "scikit-learn implementation does not support categorical >>>>> variables for now". >>>>> >>>>> Is this true? If not, can someone point me to an example? If yes, what >>>>> do people do? >>>>> >>>>> Thank you very much! >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn@python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn@python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> >> -- >> Guillaume Lemaitre >> INRIA Saclay - Parietal team >> Center for Data Science Paris-Saclay >> https://glemaitre.github.io/ >> >> _______________________________________________ >> scikit-learn mailing >> listscikit-learn@python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn