Hi, Georg,
I bring this up every time here on the mailing list :), and you probably aware
of this issue, but it makes a difference whether your categorical data is
nominal or ordinal. For instance if you have an ordinal variable like with
values like {small, medium, large} you probably want to encode it as {1, 2, 3}
or {1, 20, 100} or whatever is appropriate based on your domain knowledge
regarding the variable. If you have sth like {blue, red, green} it may make
more sense to do a one-hot encoding so that the classifier doesn't assume a
relationship between the variables like blue > red > green or sth like that.
Now, the DictVectorizer and OneHotEncoder are both doing one hot encoding. The
LabelEncoder does convert a variable to integer values, but if you have sth
like {small, medium, large}, it wouldn't know the order (if that's an ordinal
variable) and it would just assign arbitrary integers in increasing order.
Thus, if you are dealing ordinal variables, there's no way around doing this
manually; for example you could create mapping dictionaries for that (most
conveniently done in pandas).
Best,
Sebastian
> On Aug 5, 2017, at 5:10 AM, Georg Heiler <[email protected]> wrote:
>
> Hi,
>
> the LabelEncooder is only meant for a single column i.e. target variable. Is
> the DictVectorizeer or a manual chaining of multiple LabelEncoders (one per
> categorical column) the desired way to get values which can be fed into a
> subsequent classifier?
>
> Is there some way I have overlooked which works better and possibly also can
> handle unseen values by applying most frequent imputation?
>
> regards,
> Georg
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn