Op vr 14 dec. 2018 om 16:46 schreef Andreas Mueller <[email protected]>:
> As far as I understand, the open PR is not a leave-one-out TargetEncoder? > > I would want it to be :-/ > > I also did not yet add the CountFeaturizer from that scikit-learn PR, > because it is actually quite different (e.g it doesn't work for regression > tasks, as it counts conditional on y). But for classification it could be > easily added to the benchmarks. > > I'm confused now. That's what TargetEncoder and leave-one-out > TargetEncoder do as well, right?. > As far as I understand, that is not exactly what those do. The TargetEncoder (as implemented in dirty_cat, category_encoders and hccEncoders) will, for each category, calculate the expected value of the target depending on the category. For binary classification this indeed comes to counting the 0's and 1's, and there the information contained in the result might be similar as the sklearn PR, but the format is different: those packages calculate the probability (value between 0 and 1 as number of 1's divided by number of samples in that category) and return that as a single column, instead of returning two columns with the counts for the 0's and 1's. And for regression this is not related to counting anymore, but just the average of the target per category (in practice, the TargetEncoder is computing the same for regression or binary classification: the average of the target per category. But for regression, the CountFeaturizer doesn't work since there are no discrete values in the target to count). Furthermore, all of those implementations in the 3 mentioned packages have some kind of regularization (empirical bayes shrinkage, or KFold or leave-one-out cross-validation), while this is also not present in the CountFeaturizer PR (but this aspect is of course something we want to actually test in the benchmarks). Another thing I noticed in the CountFeaturizer implementation, is that the behaviour differs when y is passed or not. First, I find it a bit strange to do this as it is a quite different behaviour (counting the categories (to just encode the categorical variable with a notion about its frequency in the training set), or counting the target depending on the category is quite different?). But also, when using a transformer in a Pipeline, you don't control the passing of y, I think? So in that way, you always have the behaviour of counting the target. I would find it more logical to have those two things in two separate transformers (if we think the "frequency encoder" is useful enough). (I need to give this feedback on the PR, but that will be for after the holidays) Joris > _______________________________________________ > scikit-learn mailing list > [email protected] > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list [email protected] https://mail.python.org/mailman/listinfo/scikit-learn
