Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

Gael Varoquaux Tue, 20 Nov 2018 13:18:39 -0800

On Tue, Nov 20, 2018 at 04:06:30PM -0500, Andreas Mueller wrote:
> I would love to see the TargetEncoder ported to scikit-learn.
> The CountFeaturizer is pretty stalled:
> https://github.com/scikit-learn/scikit-learn/pull/9614


So would I. But there are several ways of doing it:

- the naive way is not the right one: just computing the average of y
  for each category leads to overfitting quite fast

- it can be done cross-validated, splitting the train data, in a
  "cross-fit" strategy (see https://github.com/dirty-cat/dirty_cat/issues/53)

- it can be done using empirical-Bayes shrinkage, which is what we
  currently do in dirty_cat.

We are planning to do heavy benchmarking of those strategies, to figure
out tradeoff. But we won't get to it before February, I am afraid.

> Have you benchmarked the other encoders in the category_encoding lib?
> I would be really curious to know when/how they help.

We did (part of the results are in the publication), and we didn't
have great success.

Gaël

> On 11/20/18 3:58 PM, Gael Varoquaux wrote:
> > Hi scikit-learn friends,

> > As you might have seen on twitter, my lab -with a few friends- has
> > embarked on research to ease machine on "dirty data". We are
> > experimenting on new encoding methods for non-curated string categories.
> > For this, we are developing a small software project called "dirty_cat":
> > https://dirty-cat.github.io/stable/

> > dirty_cat is a test bed for new ideas of "dirty categories". It is a
> > research project, though we still try to do decent software engineering
> > :). Rather than contributing to existing codebases (as the great
> > categorical-encoding project in scikit-learn-contrib), we spanned it out
> > in a separate software project to have the freedom to try out ideas that
> > we might give up after gaining insight.

> > We hope that it is a useful tool: if you have non-curated string
> > categories, please give it a try. Understanding what works and what does
> > not is important to know what to consolidate. Hopefully one day we can
> > develop a tool that is of wide-enough interest that it can go in
> > scikit-learn-contrib, or maybe even scikit-learn.

> > Also, if you have suggestions of publicly available databases that we try
> > it upon, we would love to hear from you.

> > Cheers,

> > Gaël

> > PS: if you want to work on dirty-data problems in Paris as a post-doc or
> > an engineer, send me a line
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn

> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
    Gael Varoquaux
    Senior Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

Reply via email to