OrdinalEncoder is the equivalent of pd.factorize and will work in the
scikit-learn ecosystem.

However, be aware that you should not just swap OneHotEncoder to
OrdinalEncoder just at your wish.
It depends of your machine learning pipeline.

As mentioned by Gael, tree-based algorithm will be fine with
OrdinalEncoder. If you have a linear model,
then you need to use the OneHotEncoder if the categories do not have any
order.

I will just refer to one notebook that we taught in EuroScipy last year:
https://github.com/lesteve/euroscipy-2019-scikit-learn-tutorial/blob/master/rendered_notebooks/02_basic_preprocessing.ipynb

On Fri, 1 May 2020 at 05:11, C W <tmrs...@gmail.com> wrote:

> Hermes,
>
> That's an interesting function. Does it work with sklearn after
> factorize?  Is there any example? Thanks!
>
> On Thu, Apr 30, 2020 at 6:51 PM Hermes Morales <paisanoher...@hotmail.com>
> wrote:
>
>> Perhaps pd.factorize could hello?
>>
>> Obtener Outlook para Android <https://aka.ms/ghei36>
>>
>> ------------------------------
>> *From:* scikit-learn <scikit-learn-bounces+paisanohermes=
>> hotmail....@python.org> on behalf of Gael Varoquaux <
>> gael.varoqu...@normalesup.org>
>> *Sent:* Thursday, April 30, 2020 5:12:06 PM
>> *To:* Scikit-learn mailing list <scikit-learn@python.org>
>> *Subject:* Re: [scikit-learn] Why does sklearn require one-hot-encoding
>> for categorical features? Can we have a "factor" data type?
>>
>> On Thu, Apr 30, 2020 at 03:55:00PM -0400, C W wrote:
>> > I've used R and Stata software, none needs such transformation. They
>> have a
>> > data type called "factors", which is different from "numeric".
>>
>> > My problem with OHE:
>> > One-hot-encoding results in large number of features. This really blows
>> up
>> > quickly. And I have to fight curse of dimensionality with PCA
>> reduction. That's
>> > not cool!
>>
>> Most statistical models still not one-hot encoding behind the hood. So, R
>> and stata do it too.
>>
>> Typically, tree-based models can be adapted to work directly on
>> categorical data. Ours don't. It's work in progress.
>>
>> G
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn@python.org
>>
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fscikit-learn&amp;data=02%7C01%7C%7Ce7aa6f99b7914a1f84b208d7ed430801%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637238744453345410&amp;sdata=e3BfHB4v5VFteeZ0Zh3FJ9Wcz9KmkUwur5i8Reue3mc%3D&amp;reserved=0
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


-- 
Guillaume Lemaitre
Scikit-learn @ Inria Foundation
https://glemaitre.github.io/
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to