I don't know how the smote algorithm works, but you may have some luck using a classifier which can deal with class weights. SVC in sklearn has this. I think you can also have this on NaiveBayes.
Also, one general comment about under/over sampling is to be sure that you only perform such techniques on the training set. I have recently read papers which did over sampling on both training and test set. Maybe this is you case. -- Flavio On Sat, Sep 14, 2013 at 7:31 AM, ChungHung Liu <[email protected]> wrote: > I encounter imbalanced dataset problem with minority class around 0.3k and > majority class around 15k. I read some documents saying down sampling or over > sampling can apply to such problem. After testing, it shows that with down > sampling, dataset needs to be reduced to around 700 then the confusion matrix > would look ok. Although the result looks ok, the size is too small. > > confusion matrix: > preds A B > actual > A 8 79 > B 73 15 > > However, with over sampling (replicating minority class), no mater how many > minority class are over sampled e.g. 11% 30%, 50% ( where precentage = # > minority / total rows dataset). The confusion matrix result doesn't look good > (many data are misclassified) > > confusion matrix: > preds A B > actual > A 50 2707 > B 2549 44 > > Then, I find that > http://comments.gmane.org/gmane.comp.python.scikit-learn/5278 can be used to > perform sampling by SMOTE. But the result is similar to over sampling (many > misclassified class). The way of sampling is done by > # simplified steps > down_sampled_majority_samples = shuffle(majority_samples) * 70/100 > # testing percentage includes 100, 2*100, 5*100 but the result is similar > synthetic_minority = SMOTE(minority_samples, 12*100, 5) > train_data = synthetic_minority + minority_samples + > down_sampled_majority_samples > > Generally what procedure or should be paid attention to when working on > imbalanced dataset? > > Thanks for advices > > ------------------------------------------------------------------------------ > LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99! > 1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint > 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes > Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/22/13. > http://pubads.g.doubleclick.net/gampad/clk?id=64545871&iu=/4140/ostg.clktrk > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99! 1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/22/13. http://pubads.g.doubleclick.net/gampad/clk?id=64545871&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
