Hi Sean.
We were talking about adding topical how-tos to the website, and that
indeed looks like
a good candidate, as the question pops up a lot.
If you want to write one, I'm sure you'll get valuable feedback :)
Cheers,
Andy
On 08/29/2013 10:20 AM, Sean Violante wrote:
I was wondering if some documentation could be prepared/is available
for the workflow for handling unbalanced data.
I am looking at Web site Click through data, but I am sure similar
issues occur for other cases.
This seemed to be one approach/steps required. I would value
corrections, comments, but it seems like it would be useful to develop
a tutorial to clarify a specific use-case, since it does seem to
involve changes to a number of different steps.
a) undersample the most frequent class [assuming you have plenty of data]
b) We are still interested in the "true" probabilities so use logistic
regression and reweight classes to adjust for undersampling.
[rankSVM? Any models existing in sklearn? ]
c) use AUC [or other metric for unbalanced classes]
d) on validation/test sets either do not undersample or undersample
but then reweight for the metric calulation.
This "workflow" maintains the correct probabilities... another
approach eg in order to use standard SVM is
a) fit model on EITHER rebalanced data [if "too much frequent class"]
OR reweight classes
b) use AUC/ as cross-validation metric [in v14]
c) use distance_function output... [ can predict_proba output in SVM
be used including a reweighting scheme?]
Lastly a tutorial on the SVM predict proba or other ways of
generating probabilities from distance functions would be kind of
useful in itself!!!
Sean
------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general