Hi,

I think there are many reasons that have led to the current situation.
One is that scikit-learn is based on numpy arrays, which do not offer
categorical data types (yet: ideas are being discussed
https://numpy.org/neps/nep-0041-improved-dtype-support.html Pandas already
has a categorical data type
https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html)

For algorithms like random forests, having categorical variables would be
absolutely great.

Another reason might be different communities handling categorical data in
different ways traditionally. One-hot-encoding is more common on the ML
side than on the stats side for instance.

To your point:
> One-hot-encoding results in large number of features. This really blows
up quickly. And I have to fight curse of dimensionality with PCA reduction.
That's not cool!

Depending on the algorithm being used, a categorical variable may or may
not need to be expanded into one-hot dimension encoding under the hood, so
the potential gain of having such a data encoding method is highly
dependent on the algorithms used.

Hope this helps!
Michael

On Thu, Apr 30, 2020 at 3:57 PM C W <tmrs...@gmail.com> wrote:

> Hello everyone,
>
> I am frustrated with the one-hot-encoding requirement for categorical
> feature. Why?
>
> I've used R and Stata software, none needs such transformation. They have
> a data type called "factors", which is different from "numeric".
>
> My problem with OHE:
> One-hot-encoding results in large number of features. This really blows up
> quickly. And I have to fight curse of dimensionality with PCA reduction.
> That's not cool!
>
> Can sklearn have a "factor" data type in the future? It would make life so
> much easier.
>
> Thanks a lot!
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to