You are right, changing the figure size would fix the issue (updated the notebook). In practice, I think the issue becomes choosing a good aspect ratio such that the
a) general proportions of the plot look ok b) proportions of the boxes wrt the arrows look ok It's all possible for a user to do, but for my use cases (e.g., making a quick graphic for a presentation / meeting) it was just quicker with graphviz. On the other hand, I would prefer/recommend the plot_tree func just because it is based on matplotlib ... In any case, I haven't had a chance to look at the plot_tree func but I guess this could potentially be relatively easy to address. I guess it would just require finding and setting a good default value for the a) XOR case where a user provides either feature names or class label names. b) AND case where a user provides both feature names and class label names. > On Oct 6, 2019, at 9:55 AM, Andreas Mueller <t3k...@gmail.com> wrote: > > Thanks! > I'll double check that issue. Generally you have to set the figure size to > get good results. > We should probably add some code to set the figure size automatically (if we > create a figure?). > > > On 10/6/19 10:40 AM, Sebastian Raschka wrote: >> Sure, I just ran an example I made with graphviz via plot_tree, and it looks >> like there's an issue with overlapping boxes if you use class (and/or >> feature) names. I made a reproducible example here so that you can take a >> look: >> https://github.com/rasbt/bugreport/blob/master/scikit-learn/plot_tree/tree-demo-1.ipynb >> >> Happy to add this to the sklearn issue list if there's no issue filed for >> that yet. >> >> Best, >> Sebastian >> >>> On Oct 6, 2019, at 9:10 AM, Andreas Mueller <t3k...@gmail.com> wrote: >>> >>> >>> >>> On 10/4/19 11:28 PM, Sebastian Raschka wrote: >>>> The docs show a way such that you don't need to write it as png file using >>>> tree.plot_tree: >>>> https://scikit-learn.org/stable/modules/tree.html#classification >>>> >>>> I don't remember why, but I think I had problems with that in the past (I >>>> think it didn't look so nice visually, but don't remember), which is why I >>>> still stick to graphviz. >>> Can you give me examples that don't look as nice? I would love to improve >>> it. >>> >>>> For my use cases, it's not much hassle -- it used to be a bit of a hassle >>>> to get GraphViz working, but now you can do >>>> >>>> conda install pydotplus >>>> conda install graphviz >>>> >>>> Coincidentally, I just made an example for a lecture I was teaching on >>>> Tue: >>>> https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/06_trees/code/06-trees_demo.ipynb >>>> >>>> Best, >>>> Sebastian >>>> >>>> >>>>> On Oct 4, 2019, at 10:09 PM, C W <tmrs...@gmail.com> wrote: >>>>> >>>>> On a separate note, what do you use for plotting? >>>>> >>>>> I found graphviz, but you have to first save it as a png on your >>>>> computer. That's a lot work for just one plot. Is there something like a >>>>> matplotlib? >>>>> >>>>> Thanks! >>>>> >>>>> On Fri, Oct 4, 2019 at 9:42 PM Sebastian Raschka >>>>> <m...@sebastianraschka.com> wrote: >>>>> Yeah, think of it more as a computational workaround for achieving the >>>>> same thing more efficiently (although it looks inelegant/weird)-- >>>>> something like that wouldn't be mentioned in textbooks. >>>>> >>>>> Best, >>>>> Sebastian >>>>> >>>>>> On Oct 4, 2019, at 6:33 PM, C W <tmrs...@gmail.com> wrote: >>>>>> >>>>>> Thanks Sebastian, I think I get it. >>>>>> >>>>>> It's just have never seen it this way. Quite different from what I'm >>>>>> used in Elements of Statistical Learning. >>>>>> >>>>>> On Fri, Oct 4, 2019 at 7:13 PM Sebastian Raschka >>>>>> <m...@sebastianraschka.com> wrote: >>>>>> Not sure if there's a website for that. In any case, to explain this >>>>>> differently, as discussed earlier sklearn assumes continuous features >>>>>> for decision trees. So, it will use a binary threshold for splitting >>>>>> along a feature attribute. In other words, it cannot do sth like >>>>>> >>>>>> if x == 1 then right child node >>>>>> else left child node >>>>>> >>>>>> Instead, what it does is >>>>>> >>>>>> if x >= 0.5 then right child node >>>>>> else left child node >>>>>> >>>>>> These are basically equivalent as you can see when you just plug in >>>>>> values 0 and 1 for x. >>>>>> >>>>>> Best, >>>>>> Sebastian >>>>>> >>>>>>> On Oct 4, 2019, at 5:34 PM, C W <tmrs...@gmail.com> wrote: >>>>>>> >>>>>>> I don't understand your answer. >>>>>>> >>>>>>> Why after one-hot-encoding it still outputs greater than 0.5 or less >>>>>>> than? Does sklearn website have a working example on categorical input? >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka >>>>>>> <m...@sebastianraschka.com> wrote: >>>>>>> Like Nicolas said, the 0.5 is just a workaround but will do the right >>>>>>> thing on the one-hot encoded variables, here. You will find that the >>>>>>> threshold is always at 0.5 for these variables. I.e., what it will do >>>>>>> is to use the following conversion: >>>>>>> >>>>>>> treat as car_Audi=1 if car_Audi >= 0.5 >>>>>>> treat as car_Audi=0 if car_Audi < 0.5 >>>>>>> >>>>>>> or, it may be >>>>>>> >>>>>>> treat as car_Audi=1 if car_Audi > 0.5 >>>>>>> treat as car_Audi=0 if car_Audi <= 0.5 >>>>>>> >>>>>>> (Forgot which one sklearn is using, but either way. it will be fine.) >>>>>>> >>>>>>> Best, >>>>>>> Sebastian >>>>>>> >>>>>>> >>>>>>>> On Oct 4, 2019, at 1:44 PM, Nicolas Hug <nio...@gmail.com> wrote: >>>>>>>> >>>>>>>> >>>>>>>>> But, decision tree is still mistaking one-hot-encoding as numerical >>>>>>>>> input and split at 0.5. This is not right. Perhaps, I'm doing >>>>>>>>> something wrong? >>>>>>>> You're not doing anything wrong, and neither is the tree. Trees don't >>>>>>>> support categorical variables in sklearn, so everything is treated as >>>>>>>> numerical. >>>>>>>> >>>>>>>> This is why we do one-hot-encoding: so that a set of numerical (one >>>>>>>> hot encoded) features can be treated as if they were just one >>>>>>>> categorical feature. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Nicolas >>>>>>>> >>>>>>>> On 10/4/19 2:01 PM, C W wrote: >>>>>>>>> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo >>>>>>>>> on my part. >>>>>>>>> >>>>>>>>> Looks like I did one-hot-encoding correctly. My new variable names >>>>>>>>> are: car_Audi, car_BMW, etc. >>>>>>>>> >>>>>>>>> But, decision tree is still mistaking one-hot-encoding as numerical >>>>>>>>> input and split at 0.5. This is not right. Perhaps, I'm doing >>>>>>>>> something wrong? >>>>>>>>> >>>>>>>>> Is there a good toy example on the sklearn website? I am only see >>>>>>>>> this: >>>>>>>>> https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html. >>>>>>>>> >>>>>>>>> Thanks! >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka >>>>>>>>> <m...@sebastianraschka.com> wrote: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, >>>>>>>>>> Toyota=1, Audi=2) as numerical values, not category.The tree splits >>>>>>>>>> at 0.5 and 1.5 >>>>>>>>> that's not a onehot encoding then. >>>>>>>>> >>>>>>>>> For an Audi datapoint, it should be >>>>>>>>> >>>>>>>>> BMW=0 >>>>>>>>> Toyota=0 >>>>>>>>> Audi=1 >>>>>>>>> >>>>>>>>> for BMW >>>>>>>>> >>>>>>>>> BMW=1 >>>>>>>>> Toyota=0 >>>>>>>>> Audi=0 >>>>>>>>> >>>>>>>>> and for Toyota >>>>>>>>> >>>>>>>>> BMW=0 >>>>>>>>> Toyota=1 >>>>>>>>> Audi=0 >>>>>>>>> >>>>>>>>> The split threshold should then be at 0.5 for any of these features. >>>>>>>>> >>>>>>>>> Based on your email, I think you were assuming that the DT does the >>>>>>>>> one-hot encoding internally, which it doesn't. In practice, it is >>>>>>>>> hard to guess what is a nominal and what is a ordinal variable, so >>>>>>>>> you have to do the onehot encoding before you give the data to the >>>>>>>>> decision tree. >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Sebastian >>>>>>>>> >>>>>>>>>> On Oct 4, 2019, at 11:48 AM, C W <tmrs...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>> I'm getting some funny results. I am doing a regression decision >>>>>>>>>> tree, the response variables are assigned to levels. >>>>>>>>>> >>>>>>>>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, >>>>>>>>>> Toyota=1, Audi=2) as numerical values, not category. >>>>>>>>>> >>>>>>>>>> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? >>>>>>>>>> How does the sklearn know internally 0 vs. 1 is categorical, not >>>>>>>>>> numerical? >>>>>>>>>> >>>>>>>>>> In R for instance, you do as.factor(), which explicitly states the >>>>>>>>>> data type. >>>>>>>>>> >>>>>>>>>> Thank you! >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller <t3k...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 9/15/19 8:16 AM, Guillaume Lemaître wrote: >>>>>>>>>>> On Sat, 14 Sep 2019 at 20:59, C W <tmrs...@gmail.com> wrote: >>>>>>>>>>> Thanks, Guillaume. >>>>>>>>>>> Column transformer looks pretty neat. I've also heard though, this >>>>>>>>>>> pipeline can be tedious to set up? Specifying what you want for >>>>>>>>>>> every feature is a pain. >>>>>>>>>>> >>>>>>>>>>> It would be interesting for us which part of the pipeline is >>>>>>>>>>> tedious to set up to know if we can improve something there. >>>>>>>>>>> Do you mean, that you would like to automatically detect of which >>>>>>>>>>> type of feature (categorical/numerical) and apply a >>>>>>>>>>> default encoder/scaling such as discuss there: >>>>>>>>>>> https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127 >>>>>>>>>>> >>>>>>>>>>> IMO, one a user perspective, it would be cleaner in some cases at >>>>>>>>>>> the cost of applying blindly a black box >>>>>>>>>>> which might be dangerous. >>>>>>>>>> Also see >>>>>>>>>> https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor >>>>>>>>>> Which basically does that. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Jaiver, >>>>>>>>>>> Actually, you guessed right. My real data has only one numerical >>>>>>>>>>> variable, looks more like this: >>>>>>>>>>> >>>>>>>>>>> Gender Date Income Car Attendance >>>>>>>>>>> Male 2019/3/01 10000 BMW Yes >>>>>>>>>>> Female 2019/5/02 9000 Toyota No >>>>>>>>>>> Male 2019/7/15 12000 Audi Yes >>>>>>>>>>> >>>>>>>>>>> I am predicting income using all other categorical variables. Maybe >>>>>>>>>>> it is catboost! >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> >>>>>>>>>>> M >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sat, Sep 14, 2019 at 9:25 AM Javier López <jlo...@ende.cc> wrote: >>>>>>>>>>> If you have datasets with many categorical features, and perhaps >>>>>>>>>>> many categories, the tools in sklearn are quite limited, >>>>>>>>>>> but there are alternative implementations of boosted trees that are >>>>>>>>>>> designed with categorical features in mind. Take a look >>>>>>>>>>> at catboost [1], which has an sklearn-compatible API. >>>>>>>>>>> >>>>>>>>>>> J >>>>>>>>>>> >>>>>>>>>>> [1] https://catboost.ai/ >>>>>>>>>>> >>>>>>>>>>> On Sat, Sep 14, 2019 at 3:40 AM C W <tmrs...@gmail.com> wrote: >>>>>>>>>>> Hello all, >>>>>>>>>>> I'm very confused. Can the decision tree module handle both >>>>>>>>>>> continuous and categorical features in the dataset? In this case, >>>>>>>>>>> it's just CART (Classification and Regression Trees). >>>>>>>>>>> >>>>>>>>>>> For example, >>>>>>>>>>> Gender Age Income Car Attendance >>>>>>>>>>> Male 30 10000 BMW Yes >>>>>>>>>>> Female 35 9000 Toyota No >>>>>>>>>>> Male 50 12000 Audi Yes >>>>>>>>>>> >>>>>>>>>>> According to the documentation >>>>>>>>>>> https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart, >>>>>>>>>>> it can not! >>>>>>>>>>> >>>>>>>>>>> It says: "scikit-learn implementation does not support categorical >>>>>>>>>>> variables for now". >>>>>>>>>>> >>>>>>>>>>> Is this true? If not, can someone point me to an example? If yes, >>>>>>>>>>> what do people do? >>>>>>>>>>> >>>>>>>>>>> Thank you very much! >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> scikit-learn mailing list >>>>>>>>>>> scikit-learn@python.org >>>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> scikit-learn mailing list >>>>>>>>>>> scikit-learn@python.org >>>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> scikit-learn mailing list >>>>>>>>>>> scikit-learn@python.org >>>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Guillaume Lemaitre >>>>>>>>>>> INRIA Saclay - Parietal team >>>>>>>>>>> Center for Data Science Paris-Saclay >>>>>>>>>>> https://glemaitre.github.io/ >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> scikit-learn mailing list >>>>>>>>>>> >>>>>>>>>>> scikit-learn@python.org >>>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>>>> _______________________________________________ >>>>>>>>>> scikit-learn mailing list >>>>>>>>>> scikit-learn@python.org >>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>>>> _______________________________________________ >>>>>>>>>> scikit-learn mailing list >>>>>>>>>> scikit-learn@python.org >>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>>> _______________________________________________ >>>>>>>>> scikit-learn mailing list >>>>>>>>> scikit-learn@python.org >>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> scikit-learn mailing list >>>>>>>>> >>>>>>>>> scikit-learn@python.org >>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>> _______________________________________________ >>>>>>>> scikit-learn mailing list >>>>>>>> scikit-learn@python.org >>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>> _______________________________________________ >>>>>>> scikit-learn mailing list >>>>>>> scikit-learn@python.org >>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>> _______________________________________________ >>>>>>> scikit-learn mailing list >>>>>>> scikit-learn@python.org >>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn@python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn@python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn@python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn@python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn@python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn