On 11/21/18 10:34 AM, Gael Varoquaux wrote:
Joris has just accepted to help with benchmarking. We can have
preliminary results earlier. The question really is: out of the different
variants that exist, which one should we choose. I think that it is a
legitimate question that arises on many of
On Wed, Nov 21, 2018 at 09:47:13AM -0500, Andreas Mueller wrote:
> The PR is over a year old already, and you hadn't voiced any opposition
> there.
My bad, sorry. Given the name, I had not guessed the link between the PR
and encoding of categorical features. I find myself very much in
agreement wi
On 11/21/18 12:38 AM, Gael Varoquaux wrote:
On Tue, Nov 20, 2018 at 09:58:49PM -0500, Andreas Mueller wrote:
On 11/20/18 4:43 PM, Gael Varoquaux wrote:
We are planning to do heavy benchmarking of those strategies, to figure
out tradeoff. But we won't get to it before February, I am afraid.
On Tue, Nov 20, 2018 at 09:58:49PM -0500, Andreas Mueller wrote:
> On 11/20/18 4:43 PM, Gael Varoquaux wrote:
> > We are planning to do heavy benchmarking of those strategies, to figure
> > out tradeoff. But we won't get to it before February, I am afraid.
> Does that mean you'd be opposed to addi
On 11/20/18 4:43 PM, Gael Varoquaux wrote:
We are planning to do heavy benchmarking of those strategies, to figure
out tradeoff. But we won't get to it before February, I am afraid.
Does that mean you'd be opposed to adding the leave-one-out TargetEncoder
before you do this? I would really li
On Tue, Nov 20, 2018 at 04:35:43PM -0500, Andreas Mueller wrote:
> > - it can be done cross-validated, splitting the train data, in a
> >"cross-fit" strategy
> > (seehttps://github.com/dirty-cat/dirty_cat/issues/53)
> This is called leave-one-out in the category_encoding library, I think,
> an
On 11/20/18 4:16 PM, Gael Varoquaux wrote:
- the naive way is not the right one: just computing the average of y
for each category leads to overfitting quite fast
- it can be done cross-validated, splitting the train data, in a
"cross-fit" strategy (seehttps://github.com/dirty-cat/dirty
On Tue, Nov 20, 2018 at 04:06:30PM -0500, Andreas Mueller wrote:
> I would love to see the TargetEncoder ported to scikit-learn.
> The CountFeaturizer is pretty stalled:
> https://github.com/scikit-learn/scikit-learn/pull/9614
So would I. But there are several ways of doing it:
- the naive way is
I would love to see the TargetEncoder ported to scikit-learn.
The CountFeaturizer is pretty stalled:
https://github.com/scikit-learn/scikit-learn/pull/9614
:-/
Have you benchmarked the other encoders in the category_encoding lib?
I would be really curious to know when/how they help.
On 11/20/1
Hi scikit-learn friends,
As you might have seen on twitter, my lab -with a few friends- has
embarked on research to ease machine on "dirty data". We are
experimenting on new encoding methods for non-curated string categories.
For this, we are developing a small software project called "dirty_cat":
10 matches
Mail list logo