Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo on my part.
Looks like I did one-hot-encoding correctly. My new variable names are:
car_Audi, car_BMW, etc.
But, decision tree is still mistaking one-hot-encoding as numerical input and
split at 0.5. This is not right. Perhaps, I'm doing something wrong?
Is there a good toy example on the sklearn website? I am only see this:
https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html.
Thanks!
On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka <m...@sebastianraschka.com>
wrote:
Hi,
The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1,
Audi=2) as numerical values, not category.The tree splits at 0.5 and 1.5
that's not a onehot encoding then.
For an Audi datapoint, it should be
BMW=0
Toyota=0
Audi=1
for BMW
BMW=1
Toyota=0
Audi=0
and for Toyota
BMW=0
Toyota=1
Audi=0
The split threshold should then be at 0.5 for any of these features.
Based on your email, I think you were assuming that the DT does the one-hot
encoding internally, which it doesn't. In practice, it is hard to guess what is
a nominal and what is a ordinal variable, so you have to do the onehot encoding
before you give the data to the decision tree.
Best,
Sebastian
On Oct 4, 2019, at 11:48 AM, C W <tmrs...@gmail.com> wrote:
I'm getting some funny results. I am doing a regression decision tree, the
response variables are assigned to levels.
The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1,
Audi=2) as numerical values, not category.
The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? How does the
sklearn know internally 0 vs. 1 is categorical, not numerical?
In R for instance, you do as.factor(), which explicitly states the data type.
Thank you!
On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller <t3k...@gmail.com> wrote:
On 9/15/19 8:16 AM, Guillaume Lemaître wrote:
On Sat, 14 Sep 2019 at 20:59, C W <tmrs...@gmail.com> wrote:
Thanks, Guillaume.
Column transformer looks pretty neat. I've also heard though, this pipeline can
be tedious to set up? Specifying what you want for every feature is a pain.
It would be interesting for us which part of the pipeline is tedious to set up
to know if we can improve something there.
Do you mean, that you would like to automatically detect of which type of
feature (categorical/numerical) and apply a
default encoder/scaling such as discuss there:
https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127
IMO, one a user perspective, it would be cleaner in some cases at the cost of
applying blindly a black box
which might be dangerous.
Also see
https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor
Which basically does that.
Jaiver,
Actually, you guessed right. My real data has only one numerical variable,
looks more like this:
Gender Date Income Car Attendance
Male 2019/3/01 10000 BMW Yes
Female 2019/5/02 9000 Toyota No
Male 2019/7/15 12000 Audi Yes
I am predicting income using all other categorical variables. Maybe it is
catboost!
Thanks,
M
On Sat, Sep 14, 2019 at 9:25 AM Javier López <jlo...@ende.cc> wrote:
If you have datasets with many categorical features, and perhaps many
categories, the tools in sklearn are quite limited,
but there are alternative implementations of boosted trees that are designed
with categorical features in mind. Take a look
at catboost [1], which has an sklearn-compatible API.
J
[1] https://catboost.ai/
On Sat, Sep 14, 2019 at 3:40 AM C W <tmrs...@gmail.com> wrote:
Hello all,
I'm very confused. Can the decision tree module handle both continuous and
categorical features in the dataset? In this case, it's just CART
(Classification and Regression Trees).
For example,
Gender Age Income Car Attendance
Male 30 10000 BMW Yes
Female 35 9000 Toyota No
Male 50 12000 Audi Yes
According to the documentation
https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart,
it can not!
It says: "scikit-learn implementation does not support categorical variables for
now".
Is this true? If not, can someone point me to an example? If yes, what do
people do?
Thank you very much!
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
--
Guillaume Lemaitre
INRIA Saclay - Parietal team
Center for Data Science Paris-Saclay
https://glemaitre.github.io/
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn