Re: [scikit-learn] benchmarking TargetEncoder Was: ANN Dirty_cat: learning on dirty categories

Joris Van den Bossche Sat, 15 Dec 2018 04:38:44 -0800

Op vr 14 dec. 2018 om 16:46 schreef Andreas Mueller <[email protected]>:


> As far as I understand, the open PR is not a leave-one-out TargetEncoder?
>
> I would want it to be :-/
>
> I also did not yet add the CountFeaturizer from that scikit-learn PR,
> because it is actually quite different (e.g it doesn't work for regression
> tasks, as it counts conditional on y). But for classification it could be
> easily added to the benchmarks.
>
> I'm confused now. That's what TargetEncoder and leave-one-out
> TargetEncoder do as well, right?.
>

As far as I understand, that is not exactly what those do. The
TargetEncoder (as implemented in dirty_cat, category_encoders and
hccEncoders) will, for each category, calculate the expected value of the
target depending on the category. For binary classification this indeed
comes to counting the 0's and 1's, and there the information contained in
the result might be similar as the sklearn PR, but the format is different:
those packages calculate the probability (value between 0 and 1 as number
of 1's divided by number of samples in that category) and return that as a
single column, instead of returning two columns with the counts for the 0's
and 1's.
And for regression this is not related to counting anymore, but just the
average of the target per category (in practice, the TargetEncoder is
computing the same for regression or binary classification: the average of
the target per category. But for regression, the CountFeaturizer doesn't
work since there are no discrete values in the target to count).

Furthermore, all of those implementations in the 3 mentioned packages have
some kind of regularization (empirical bayes shrinkage, or KFold or
leave-one-out cross-validation), while this is also not present in the
CountFeaturizer PR (but this aspect is of course something we want to
actually test in the benchmarks).

Another thing I noticed in the CountFeaturizer implementation, is that the
behaviour differs when y is passed or not. First, I find it a bit strange
to do this as it is a quite different behaviour (counting the categories
(to just encode the categorical variable with a notion about its frequency
in the training set), or counting the target depending on the category is
quite different?). But also, when using a transformer in a Pipeline, you
don't control the passing of y, I think? So in that way, you always have
the behaviour of counting the target.
I would find it more logical to have those two things in two separate
transformers (if we think the "frequency encoder" is useful enough).
(I need to give this feedback on the PR, but that will be for after the
holidays)

Joris


> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn
>

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] benchmarking TargetEncoder Was: ANN Dirty_cat: learning on dirty categories

Reply via email to