Thank you for the link, Guilaumme. In my particular case, I am working on random forest classification.
The notebook seems great. I will have to go through it in detail. I'm still fairly new at using sklearn. Thank you for everyone's quick response, always feeling loved on here! :) On Fri, May 1, 2020 at 4:00 AM Guillaume LemaƮtre <g.lemaitr...@gmail.com> wrote: > OrdinalEncoder is the equivalent of pd.factorize and will work in the > scikit-learn ecosystem. > > However, be aware that you should not just swap OneHotEncoder to > OrdinalEncoder just at your wish. > It depends of your machine learning pipeline. > > As mentioned by Gael, tree-based algorithm will be fine with > OrdinalEncoder. If you have a linear model, > then you need to use the OneHotEncoder if the categories do not have any > order. > > I will just refer to one notebook that we taught in EuroScipy last year: > > https://github.com/lesteve/euroscipy-2019-scikit-learn-tutorial/blob/master/rendered_notebooks/02_basic_preprocessing.ipynb > > On Fri, 1 May 2020 at 05:11, C W <tmrs...@gmail.com> wrote: > >> Hermes, >> >> That's an interesting function. Does it work with sklearn after >> factorize? Is there any example? Thanks! >> >> On Thu, Apr 30, 2020 at 6:51 PM Hermes Morales <paisanoher...@hotmail.com> >> wrote: >> >>> Perhaps pd.factorize could hello? >>> >>> Obtener Outlook para Android <https://aka.ms/ghei36> >>> >>> ------------------------------ >>> *From:* scikit-learn <scikit-learn-bounces+paisanohermes= >>> hotmail....@python.org> on behalf of Gael Varoquaux < >>> gael.varoqu...@normalesup.org> >>> *Sent:* Thursday, April 30, 2020 5:12:06 PM >>> *To:* Scikit-learn mailing list <scikit-learn@python.org> >>> *Subject:* Re: [scikit-learn] Why does sklearn require one-hot-encoding >>> for categorical features? Can we have a "factor" data type? >>> >>> On Thu, Apr 30, 2020 at 03:55:00PM -0400, C W wrote: >>> > I've used R and Stata software, none needs such transformation. They >>> have a >>> > data type called "factors", which is different from "numeric". >>> >>> > My problem with OHE: >>> > One-hot-encoding results in large number of features. This really >>> blows up >>> > quickly. And I have to fight curse of dimensionality with PCA >>> reduction. That's >>> > not cool! >>> >>> Most statistical models still not one-hot encoding behind the hood. So, R >>> and stata do it too. >>> >>> Typically, tree-based models can be adapted to work directly on >>> categorical data. Ours don't. It's work in progress. >>> >>> G >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> >>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fscikit-learn&data=02%7C01%7C%7Ce7aa6f99b7914a1f84b208d7ed430801%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637238744453345410&sdata=e3BfHB4v5VFteeZ0Zh3FJ9Wcz9KmkUwur5i8Reue3mc%3D&reserved=0 >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > -- > Guillaume Lemaitre > Scikit-learn @ Inria Foundation > https://glemaitre.github.io/ > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn