Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-20 Thread Gael Varoquaux
On Tue, Nov 20, 2018 at 09:58:49PM -0500, Andreas Mueller wrote: > On 11/20/18 4:43 PM, Gael Varoquaux wrote: > > We are planning to do heavy benchmarking of those strategies, to figure > > out tradeoff. But we won't get to it before February, I am afraid. > Does that mean you'd be opposed to addi

Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-20 Thread Andreas Mueller
On 11/20/18 4:43 PM, Gael Varoquaux wrote: We are planning to do heavy benchmarking of those strategies, to figure out tradeoff. But we won't get to it before February, I am afraid. Does that mean you'd be opposed to adding the leave-one-out TargetEncoder before you do this? I would really li

Re: [scikit-learn] make all new parameters keyword-only?

2018-11-20 Thread Olivier Grisel
+1 on the ideal in general (and to enforce this on new classes / params). +1 to be conservative and not break existing code. Le mar. 20 nov. 2018 à 21:09, Joris Van den Bossche < jorisvandenboss...@gmail.com> a écrit : > Op zo 18 nov. 2018 om 11:14 schreef Joel Nothman : > >> I think we're all ag

Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-20 Thread Gael Varoquaux
On Tue, Nov 20, 2018 at 04:35:43PM -0500, Andreas Mueller wrote: > > - it can be done cross-validated, splitting the train data, in a > >"cross-fit" strategy > > (seehttps://github.com/dirty-cat/dirty_cat/issues/53) > This is called leave-one-out in the category_encoding library, I think, > an

Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-20 Thread Andreas Mueller
On 11/20/18 4:16 PM, Gael Varoquaux wrote: - the naive way is not the right one: just computing the average of y for each category leads to overfitting quite fast - it can be done cross-validated, splitting the train data, in a "cross-fit" strategy (seehttps://github.com/dirty-cat/dirty

Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-20 Thread Gael Varoquaux
On Tue, Nov 20, 2018 at 04:06:30PM -0500, Andreas Mueller wrote: > I would love to see the TargetEncoder ported to scikit-learn. > The CountFeaturizer is pretty stalled: > https://github.com/scikit-learn/scikit-learn/pull/9614 So would I. But there are several ways of doing it: - the naive way is

Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-20 Thread Andreas Mueller
I would love to see the TargetEncoder ported to scikit-learn. The CountFeaturizer is pretty stalled: https://github.com/scikit-learn/scikit-learn/pull/9614 :-/ Have you benchmarked the other encoders in the category_encoding lib? I would be really curious to know when/how they help. On 11/20/1

[scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-20 Thread Gael Varoquaux
Hi scikit-learn friends, As you might have seen on twitter, my lab -with a few friends- has embarked on research to ease machine on "dirty data". We are experimenting on new encoding methods for non-curated string categories. For this, we are developing a small software project called "dirty_cat":

Re: [scikit-learn] make all new parameters keyword-only?

2018-11-20 Thread Joris Van den Bossche
Op zo 18 nov. 2018 om 11:14 schreef Joel Nothman : > I think we're all agreed that this change would be a good thing. > > What we're not agreed on is how much risk we take by breaking legacy code > that relied on argument order. > I think that, in principle, it could be possible to do this with a

Re: [scikit-learn] Next Sprint

2018-11-20 Thread Gael Varoquaux
On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote: > We can also do Paris in April / May or June if that's ok with Joel and better > for Andreas. Absolutely. My thoughts here are that I want to minimize transportation, partly because flying has a large carbon footprint. Also, for per

Re: [scikit-learn] Next Sprint

2018-11-20 Thread Olivier Grisel
We can also do Paris in April / May or June if that's ok with Joel and better for Andreas. I am teaching on Fridays from end of January to March. But I can miss half a day of sprint to teach my class. -- Olivier ___ scikit-learn mailing list scikit-lea