Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

Andreas Mueller Tue, 20 Nov 2018 13:38:16 -0800



On 11/20/18 4:16 PM, Gael Varoquaux wrote:

- the naive way is not the right one: just computing the average of y
   for each category leads to overfitting quite fast

- it can be done cross-validated, splitting the train data, in a
   "cross-fit" strategy (seehttps://github.com/dirty-cat/dirty_cat/issues/53)

This is called leave-one-out in the category_encoding library, I think,
and that's what my first implementation would be.


- it can be done using empirical-Bayes shrinkage, which is what we
   currently do in dirty_cat.

Reference / explanation?


We are planning to do heavy benchmarking of those strategies, to figure
out tradeoff. But we won't get to it before February, I am afraid.

aww ;)
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

Reply via email to