Hi, I think there are many reasons that have led to the current situation. One is that scikit-learn is based on numpy arrays, which do not offer categorical data types (yet: ideas are being discussed https://numpy.org/neps/nep-0041-improved-dtype-support.html Pandas already has a categorical data type https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html)
For algorithms like random forests, having categorical variables would be absolutely great. Another reason might be different communities handling categorical data in different ways traditionally. One-hot-encoding is more common on the ML side than on the stats side for instance. To your point: > One-hot-encoding results in large number of features. This really blows up quickly. And I have to fight curse of dimensionality with PCA reduction. That's not cool! Depending on the algorithm being used, a categorical variable may or may not need to be expanded into one-hot dimension encoding under the hood, so the potential gain of having such a data encoding method is highly dependent on the algorithms used. Hope this helps! Michael On Thu, Apr 30, 2020 at 3:57 PM C W <tmrs...@gmail.com> wrote: > Hello everyone, > > I am frustrated with the one-hot-encoding requirement for categorical > feature. Why? > > I've used R and Stata software, none needs such transformation. They have > a data type called "factors", which is different from "numeric". > > My problem with OHE: > One-hot-encoding results in large number of features. This really blows up > quickly. And I have to fight curse of dimensionality with PCA reduction. > That's not cool! > > Can sklearn have a "factor" data type in the future? It would make life so > much easier. > > Thanks a lot! > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn