But, decision tree is still mistaking one-hot-encoding as numerical input and split at 0.5. This is not right. Perhaps, I'm doing something wrong?

You're not doing anything wrong, and neither is the tree. Trees don't support categorical variables in sklearn, so everything is treated as numerical.

This is why we do one-hot-encoding: so that a set of numerical (one hot encoded) features can be treated as if they were just one categorical feature.


Nicolas

On 10/4/19 2:01 PM, C W wrote:
Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo on my part.

Looks like I did one-hot-encoding correctly. My new variable names are: car_Audi, car_BMW, etc.

But, decision tree is still mistaking one-hot-encoding as numerical input and split at 0.5. This is not right. Perhaps, I'm doing something wrong?

Is there a good toy example on the sklearn website? I am only see this: https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html.

Thanks!



On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka <m...@sebastianraschka.com <mailto:m...@sebastianraschka.com>> wrote:

    Hi,

    The funny part is: the tree is taking one-hot-encoding (BMW=0,
    Toyota=1, Audi=2) as numerical values, not category.The tree
    splits at 0.5 and 1.5

    that's not a onehot encoding then.

    For an Audi datapoint, it should be

    BMW=0
    Toyota=0
    Audi=1

    for BMW

    BMW=1
    Toyota=0
    Audi=0

    and for Toyota

    BMW=0
    Toyota=1
    Audi=0

    The split threshold should then be at 0.5 for any of these features.

    Based on your email, I think you were assuming that the DT does
    the one-hot encoding internally, which it doesn't. In practice, it
    is hard to guess what is a nominal and what is a ordinal variable,
    so you have to do the onehot encoding before you give the data to
    the decision tree.

    Best,
    Sebastian

    On Oct 4, 2019, at 11:48 AM, C W <tmrs...@gmail.com
    <mailto:tmrs...@gmail.com>> wrote:

    I'm getting some funny results. I am doing a regression decision
    tree, the response variables are assigned to levels.

    The funny part is: the tree is taking one-hot-encoding (BMW=0,
    Toyota=1, Audi=2) as numerical values, not category.

    The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding
    wrong? How does the sklearn know internally 0 vs. 1 is
    categorical, not numerical?

    In R for instance, you do as.factor(), which explicitly states
    the data type.

    Thank you!


    On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller
    <t3k...@gmail.com <mailto:t3k...@gmail.com>> wrote:



        On 9/15/19 8:16 AM, Guillaume Lemaître wrote:


        On Sat, 14 Sep 2019 at 20:59, C W <tmrs...@gmail.com
        <mailto:tmrs...@gmail.com>> wrote:

            Thanks, Guillaume.
            Column transformer looks pretty neat. I've also heard
            though, this pipeline can be tedious to set up?
            Specifying what you want for every feature is a pain.


        It would be interesting for us which part of the pipeline is
        tedious to set up to know if we can improve something there.
        Do you mean, that you would like to automatically detect of
        which type of feature (categorical/numerical) and apply a
        default encoder/scaling such as discuss there:
        
https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127

        IMO, one a user perspective, it would be cleaner in some
        cases at the cost of applying blindly a black box
        which might be dangerous.
        Also see
        
https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor
        Which basically does that.



            Jaiver,
            Actually, you guessed right. My real data has only one
            numerical variable, looks more like this:

            Gender Date Income  Car   Attendance
            Male     2019/3/01   10000 BMW          Yes
            Female 2019/5/02    9000  Toyota          No
            Male     2019/7/15   12000 Audi           Yes

            I am predicting income using all other categorical
            variables. Maybe it is catboost!

            Thanks,

            M






            On Sat, Sep 14, 2019 at 9:25 AM Javier López
            <jlo...@ende.cc> <mailto:jlo...@ende.cc> wrote:

                If you have datasets with many categorical features,
                and perhaps many categories, the tools in sklearn
                are quite limited,
                but there are alternative implementations of boosted
                trees that are designed with categorical features in
                mind. Take a look
                at catboost [1], which has an sklearn-compatible API.

                J

                [1] https://catboost.ai/

                On Sat, Sep 14, 2019 at 3:40 AM C W
                <tmrs...@gmail.com <mailto:tmrs...@gmail.com>> wrote:

                    Hello all,
                    I'm very confused. Can the decision tree module
                    handle both continuous and categorical features
                    in the dataset? In this case, it's just CART
                    (Classification and Regression Trees).

                    For example,
                    Gender Age Income Car   Attendance
                    Male     30   10000 BMW          Yes
                    Female 35     9000 Toyota          No
                    Male     50   12000 Audi           Yes

                    According to the documentation
                    
https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart,
                    it can not!

                    It says: "scikit-learn implementation does not
                    support categorical variables for now".

                    Is this true? If not, can someone point me to an
                    example? If yes, what do people do?

                    Thank you very much!



                    _______________________________________________
                    scikit-learn mailing list
                    scikit-learn@python.org
                    <mailto:scikit-learn@python.org>
                    https://mail.python.org/mailman/listinfo/scikit-learn

                _______________________________________________
                scikit-learn mailing list
                scikit-learn@python.org <mailto:scikit-learn@python.org>
                https://mail.python.org/mailman/listinfo/scikit-learn

            _______________________________________________
            scikit-learn mailing list
            scikit-learn@python.org <mailto:scikit-learn@python.org>
            https://mail.python.org/mailman/listinfo/scikit-learn



-- Guillaume Lemaitre
        INRIA Saclay - Parietal team
        Center for Data Science Paris-Saclay
        https://glemaitre.github.io/

        _______________________________________________
        scikit-learn mailing list
        scikit-learn@python.org  <mailto:scikit-learn@python.org>
        https://mail.python.org/mailman/listinfo/scikit-learn

        _______________________________________________
        scikit-learn mailing list
        scikit-learn@python.org <mailto:scikit-learn@python.org>
        https://mail.python.org/mailman/listinfo/scikit-learn

    _______________________________________________
    scikit-learn mailing list
    scikit-learn@python.org <mailto:scikit-learn@python.org>
    https://mail.python.org/mailman/listinfo/scikit-learn

    _______________________________________________
    scikit-learn mailing list
    scikit-learn@python.org <mailto:scikit-learn@python.org>
    https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to