Hi Sean.
We were talking about adding topical how-tos to the website, and that indeed looks like
a good candidate, as the question pops up a lot.
If you want to write one, I'm sure you'll get valuable feedback :)

Cheers,
Andy

On 08/29/2013 10:20 AM, Sean Violante wrote:
I was wondering if some documentation could be prepared/is available for the workflow for handling unbalanced data.

I am looking at Web site Click through data, but I am sure similar issues occur for other cases.

This seemed to be one approach/steps required. I would value corrections, comments, but it seems like it would be useful to develop a tutorial to clarify a specific use-case, since it does seem to involve changes to a number of different steps.

a) undersample the most frequent class [assuming you have plenty of data]

b) We are still interested in the "true" probabilities so use logistic regression and reweight classes to adjust for undersampling. [rankSVM? Any models existing in sklearn? ]

c) use AUC [or other metric for unbalanced classes]

d) on validation/test sets either do not undersample or undersample but then reweight for the metric calulation.

This "workflow" maintains the correct probabilities... another approach eg in order to use standard SVM is a) fit model on EITHER rebalanced data [if "too much frequent class"] OR reweight classes

b) use AUC/ as cross-validation metric [in v14]

c) use distance_function output... [ can predict_proba output in SVM be used including a reweighting scheme?]

Lastly a tutorial on the SVM predict proba or other ways of generating probabilities from distance functions would be kind of useful in itself!!!

Sean


------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk


_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to