Hi, > The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, > Audi=2) as numerical values, not category.The tree splits at 0.5 and 1.5
that's not a onehot encoding then. For an Audi datapoint, it should be BMW=0 Toyota=0 Audi=1 for BMW BMW=1 Toyota=0 Audi=0 and for Toyota BMW=0 Toyota=1 Audi=0 The split threshold should then be at 0.5 for any of these features. Based on your email, I think you were assuming that the DT does the one-hot encoding internally, which it doesn't. In practice, it is hard to guess what is a nominal and what is a ordinal variable, so you have to do the onehot encoding before you give the data to the decision tree. Best, Sebastian > On Oct 4, 2019, at 11:48 AM, C W <tmrs...@gmail.com> wrote: > > I'm getting some funny results. I am doing a regression decision tree, the > response variables are assigned to levels. > > The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, > Audi=2) as numerical values, not category. > > The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? How does > the sklearn know internally 0 vs. 1 is categorical, not numerical? > > In R for instance, you do as.factor(), which explicitly states the data type. > > Thank you! > > > On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller <t3k...@gmail.com > <mailto:t3k...@gmail.com>> wrote: > > > On 9/15/19 8:16 AM, Guillaume Lemaître wrote: >> >> >> On Sat, 14 Sep 2019 at 20:59, C W <tmrs...@gmail.com >> <mailto:tmrs...@gmail.com>> wrote: >> Thanks, Guillaume. >> Column transformer looks pretty neat. I've also heard though, this pipeline >> can be tedious to set up? Specifying what you want for every feature is a >> pain. >> >> It would be interesting for us which part of the pipeline is tedious to set >> up to know if we can improve something there. >> Do you mean, that you would like to automatically detect of which type of >> feature (categorical/numerical) and apply a >> default encoder/scaling such as discuss there: >> https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127 >> >> <https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127> >> >> IMO, one a user perspective, it would be cleaner in some cases at the cost >> of applying blindly a black box >> which might be dangerous. > Also see > https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor > > <https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor> > Which basically does that. > > >> >> >> Jaiver, >> Actually, you guessed right. My real data has only one numerical variable, >> looks more like this: >> >> Gender Date Income Car Attendance >> Male 2019/3/01 10000 BMW Yes >> Female 2019/5/02 9000 Toyota No >> Male 2019/7/15 12000 Audi Yes >> >> I am predicting income using all other categorical variables. Maybe it is >> catboost! >> >> Thanks, >> >> M >> >> >> >> >> >> >> On Sat, Sep 14, 2019 at 9:25 AM Javier López <jlo...@ende.cc> >> <mailto:jlo...@ende.cc> wrote: >> If you have datasets with many categorical features, and perhaps many >> categories, the tools in sklearn are quite limited, >> but there are alternative implementations of boosted trees that are designed >> with categorical features in mind. Take a look >> at catboost [1], which has an sklearn-compatible API. >> >> J >> >> [1] https://catboost.ai/ <https://catboost.ai/> >> On Sat, Sep 14, 2019 at 3:40 AM C W <tmrs...@gmail.com >> <mailto:tmrs...@gmail.com>> wrote: >> Hello all, >> I'm very confused. Can the decision tree module handle both continuous and >> categorical features in the dataset? In this case, it's just CART >> (Classification and Regression Trees). >> >> For example, >> Gender Age Income Car Attendance >> Male 30 10000 BMW Yes >> Female 35 9000 Toyota No >> Male 50 12000 Audi Yes >> >> According to the documentation >> https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart >> >> <https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart>, >> it can not! >> >> It says: "scikit-learn implementation does not support categorical variables >> for now". >> >> Is this true? If not, can someone point me to an example? If yes, what do >> people do? >> >> Thank you very much! >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org <mailto:scikit-learn@python.org> >> https://mail.python.org/mailman/listinfo/scikit-learn >> <https://mail.python.org/mailman/listinfo/scikit-learn> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org <mailto:scikit-learn@python.org> >> https://mail.python.org/mailman/listinfo/scikit-learn >> <https://mail.python.org/mailman/listinfo/scikit-learn> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org <mailto:scikit-learn@python.org> >> https://mail.python.org/mailman/listinfo/scikit-learn >> <https://mail.python.org/mailman/listinfo/scikit-learn> >> >> >> -- >> Guillaume Lemaitre >> INRIA Saclay - Parietal team >> Center for Data Science Paris-Saclay >> https://glemaitre.github.io/ <https://glemaitre.github.io/> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org <mailto:scikit-learn@python.org> >> https://mail.python.org/mailman/listinfo/scikit-learn >> <https://mail.python.org/mailman/listinfo/scikit-learn> > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org <mailto:scikit-learn@python.org> > https://mail.python.org/mailman/listinfo/scikit-learn > <https://mail.python.org/mailman/listinfo/scikit-learn> > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn