Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-21 Thread Andreas Mueller
On 11/21/18 10:34 AM, Gael Varoquaux wrote: Joris has just accepted to help with benchmarking. We can have preliminary results earlier. The question really is: out of the different variants that exist, which one should we choose. I think that it is a legitimate question that arises on many of

Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-21 Thread Gael Varoquaux
On Wed, Nov 21, 2018 at 09:47:13AM -0500, Andreas Mueller wrote: > The PR is over a year old already, and you hadn't voiced any opposition > there. My bad, sorry. Given the name, I had not guessed the link between the PR and encoding of categorical features. I find myself very much in agreement wi

Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-21 Thread Andreas Mueller
On 11/21/18 12:38 AM, Gael Varoquaux wrote: On Tue, Nov 20, 2018 at 09:58:49PM -0500, Andreas Mueller wrote: On 11/20/18 4:43 PM, Gael Varoquaux wrote: We are planning to do heavy benchmarking of those strategies, to figure out tradeoff. But we won't get to it before February, I am afraid.

Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-20 Thread Gael Varoquaux
On Tue, Nov 20, 2018 at 09:58:49PM -0500, Andreas Mueller wrote: > On 11/20/18 4:43 PM, Gael Varoquaux wrote: > > We are planning to do heavy benchmarking of those strategies, to figure > > out tradeoff. But we won't get to it before February, I am afraid. > Does that mean you'd be opposed to addi

Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-20 Thread Andreas Mueller
On 11/20/18 4:43 PM, Gael Varoquaux wrote: We are planning to do heavy benchmarking of those strategies, to figure out tradeoff. But we won't get to it before February, I am afraid. Does that mean you'd be opposed to adding the leave-one-out TargetEncoder before you do this? I would really li

Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-20 Thread Gael Varoquaux
On Tue, Nov 20, 2018 at 04:35:43PM -0500, Andreas Mueller wrote: > > - it can be done cross-validated, splitting the train data, in a > >"cross-fit" strategy > > (seehttps://github.com/dirty-cat/dirty_cat/issues/53) > This is called leave-one-out in the category_encoding library, I think, > an

Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-20 Thread Andreas Mueller
On 11/20/18 4:16 PM, Gael Varoquaux wrote: - the naive way is not the right one: just computing the average of y for each category leads to overfitting quite fast - it can be done cross-validated, splitting the train data, in a "cross-fit" strategy (seehttps://github.com/dirty-cat/dirty

Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-20 Thread Gael Varoquaux
On Tue, Nov 20, 2018 at 04:06:30PM -0500, Andreas Mueller wrote: > I would love to see the TargetEncoder ported to scikit-learn. > The CountFeaturizer is pretty stalled: > https://github.com/scikit-learn/scikit-learn/pull/9614 So would I. But there are several ways of doing it: - the naive way is

Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-20 Thread Andreas Mueller
I would love to see the TargetEncoder ported to scikit-learn. The CountFeaturizer is pretty stalled: https://github.com/scikit-learn/scikit-learn/pull/9614 :-/ Have you benchmarked the other encoders in the category_encoding lib? I would be really curious to know when/how they help. On 11/20/1

[scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-20 Thread Gael Varoquaux
Hi scikit-learn friends, As you might have seen on twitter, my lab -with a few friends- has embarked on research to ease machine on "dirty data". We are experimenting on new encoding methods for non-curated string categories. For this, we are developing a small software project called "dirty_cat":