Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

Andreas Mueller Sun, 06 Oct 2019 07:58:12 -0700

Thanks!

I'll double check that issue. Generally you have to set the figure sizeto get good results.We should probably add some code to set the figure size automatically(if we create a figure?).



On 10/6/19 10:40 AM, Sebastian Raschka wrote:

Sure, I just ran an example I made with graphviz via plot_tree, and it looks 
like there's an issue with overlapping boxes if you use class (and/or feature) 
names. I made a reproducible example here so that you can take a look:
https://github.com/rasbt/bugreport/blob/master/scikit-learn/plot_tree/tree-demo-1.ipynb

Happy to add this to the sklearn issue list if there's no issue filed for that 
yet.

Best,
Sebastian

On Oct 6, 2019, at 9:10 AM, Andreas Mueller <[email protected]> wrote:



On 10/4/19 11:28 PM, Sebastian Raschka wrote:

The docs show a way such that you don't need to write it as png file using 
tree.plot_tree:
https://scikit-learn.org/stable/modules/tree.html#classification

I don't remember why, but I think I had problems with that in the past (I think 
it didn't look so nice visually, but don't remember), which is why I still 
stick to graphviz.

Can you give me examples that don't look as nice? I would love to improve it.

  For my use cases, it's not much hassle -- it used to be a bit of a hassle to 
get GraphViz working, but now you can do

conda install pydotplus
conda install graphviz

Coincidentally, I just made an example for a lecture I was teaching on Tue: 
https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/06_trees/code/06-trees_demo.ipynb

Best,
Sebastian

On Oct 4, 2019, at 10:09 PM, C W <[email protected]> wrote:

On a separate note, what do you use for plotting?

I found graphviz, but you have to first save it as a png on your computer. 
That's a lot work for just one plot. Is there something like a matplotlib?

Thanks!

On Fri, Oct 4, 2019 at 9:42 PM Sebastian Raschka <[email protected]> 
wrote:
Yeah, think of it more as a computational workaround for achieving the same 
thing more efficiently (although it looks inelegant/weird)-- something like 
that wouldn't be mentioned in textbooks.

Best,
Sebastian

On Oct 4, 2019, at 6:33 PM, C W <[email protected]> wrote:

Thanks Sebastian, I think I get it.

It's just have never seen it this way. Quite different from what I'm used in 
Elements of Statistical Learning.

On Fri, Oct 4, 2019 at 7:13 PM Sebastian Raschka <[email protected]> 
wrote:
Not sure if there's a website for that. In any case, to explain this 
differently, as discussed earlier sklearn assumes continuous features for 
decision trees. So, it will use a binary threshold for splitting along a 
feature attribute. In other words, it cannot do sth like

if x == 1 then right child node
else left child node

Instead, what it does is

if x >= 0.5 then right child node
else left child node

These are basically equivalent as you can see when you just plug in values 0 
and 1 for x.

Best,
Sebastian

On Oct 4, 2019, at 5:34 PM, C W <[email protected]> wrote:

I don't understand your answer.

Why after one-hot-encoding it still outputs greater than 0.5 or less than? Does 
sklearn website have a working example on categorical input?

Thanks!

On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka <[email protected]> 
wrote:
Like Nicolas said, the 0.5 is just a workaround but will do the right thing on 
the one-hot encoded variables, here. You will find that the threshold is always 
at 0.5 for these variables. I.e., what it will do is to use the following 
conversion:

treat as car_Audi=1 if car_Audi >= 0.5
treat as car_Audi=0 if car_Audi < 0.5

or, it may be

treat as car_Audi=1 if car_Audi > 0.5
treat as car_Audi=0 if car_Audi <= 0.5

(Forgot which one sklearn is using, but either way. it will be fine.)

Best,
Sebastian

On Oct 4, 2019, at 1:44 PM, Nicolas Hug <[email protected]> wrote:

But, decision tree is still mistaking one-hot-encoding as numerical input and 
split at 0.5. This is not right. Perhaps, I'm doing something wrong?

You're not doing anything wrong, and neither is the tree. Trees don't support 
categorical variables in sklearn, so everything is treated as numerical.

This is why we do one-hot-encoding: so that a set of numerical (one hot 
encoded) features can be treated as if they were just one categorical feature.



Nicolas

On 10/4/19 2:01 PM, C W wrote:

Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo on my part.

Looks like I did one-hot-encoding correctly. My new variable names are: 
car_Audi, car_BMW, etc.

But, decision tree is still mistaking one-hot-encoding as numerical input and 
split at 0.5. This is not right. Perhaps, I'm doing something wrong?

Is there a good toy example on the sklearn website? I am only see this: 
https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html.

Thanks!



On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka <[email protected]> 
wrote:
Hi,

The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, 
Audi=2) as numerical values, not category.The tree splits at 0.5 and 1.5

that's not a onehot encoding then.

For an Audi datapoint, it should be

BMW=0
Toyota=0
Audi=1

for BMW

BMW=1
Toyota=0
Audi=0

and for Toyota

BMW=0
Toyota=1
Audi=0

The split threshold should then be at 0.5 for any of these features.

Based on your email, I think you were assuming that the DT does the one-hot 
encoding internally, which it doesn't. In practice, it is hard to guess what is 
a nominal and what is a ordinal variable, so you have to do the onehot encoding 
before you give the data to the decision tree.

Best,
Sebastian

On Oct 4, 2019, at 11:48 AM, C W <[email protected]> wrote:

I'm getting some funny results. I am doing a regression decision tree, the 
response variables are assigned to levels.

The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, 
Audi=2) as numerical values, not category.

The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? How does the 
sklearn know internally 0 vs. 1 is categorical, not numerical?

In R for instance, you do as.factor(), which explicitly states the data type.

Thank you!


On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller <[email protected]> wrote:


On 9/15/19 8:16 AM, Guillaume Lemaître wrote:

On Sat, 14 Sep 2019 at 20:59, C W <[email protected]> wrote:
Thanks, Guillaume.
Column transformer looks pretty neat. I've also heard though, this pipeline can 
be tedious to set up? Specifying what you want for every feature is a pain.

It would be interesting for us which part of the pipeline is tedious to set up 
to know if we can improve something there.
Do you mean, that you would like to automatically detect of which type of 
feature (categorical/numerical) and apply a
default encoder/scaling such as discuss there: 
https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127

IMO, one a user perspective, it would be cleaner in some cases at the cost of 
applying blindly a black box
which might be dangerous.

Also see 
https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor
Which basically does that.

Jaiver,

Actually, you guessed right. My real data has only one numerical variable, 
looks more like this:

Gender Date            Income  Car   Attendance
Male     2019/3/01   10000   BMW          Yes
Female 2019/5/02    9000   Toyota          No
Male     2019/7/15   12000    Audi           Yes

I am predicting income using all other categorical variables. Maybe it is 
catboost!

Thanks,

M






On Sat, Sep 14, 2019 at 9:25 AM Javier López <[email protected]> wrote:
If you have datasets with many categorical features, and perhaps many 
categories, the tools in sklearn are quite limited,
but there are alternative implementations of boosted trees that are designed 
with categorical features in mind. Take a look
at catboost [1], which has an sklearn-compatible API.

J

[1] https://catboost.ai/

On Sat, Sep 14, 2019 at 3:40 AM C W <[email protected]> wrote:
Hello all,
I'm very confused. Can the decision tree module handle both continuous and 
categorical features in the dataset? In this case, it's just CART 
(Classification and Regression Trees).

For example,
Gender Age Income  Car   Attendance
Male     30   10000   BMW          Yes
Female 35     9000  Toyota          No
Male     50   12000    Audi           Yes

According to the documentation 
https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart,
 it can not!

It says: "scikit-learn implementation does not support categorical variables for 
now".

Is this true? If not, can someone point me to an example? If yes, what do 
people do?

Thank you very much!



_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn


--
Guillaume Lemaitre
INRIA Saclay - Parietal team
Center for Data Science Paris-Saclay
https://glemaitre.github.io/


_______________________________________________
scikit-learn mailing list

[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list

[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

Reply via email to