I don't know how the smote algorithm works, but you may have some luck
using a classifier which can deal with class weights. SVC in sklearn
has this. I think you can also have this on NaiveBayes.

Also, one general comment about under/over sampling is to be sure that
you only perform such techniques on the training set. I have recently
read papers which did over sampling on both training and test set.
Maybe this is you case.
--
Flavio


On Sat, Sep 14, 2013 at 7:31 AM, ChungHung Liu <[email protected]> wrote:
> I encounter imbalanced dataset problem with minority class around 0.3k and 
> majority class around 15k. I read some documents saying down sampling or over 
> sampling can apply to such problem. After testing, it shows that with down 
> sampling, dataset needs to be reduced to around 700 then the confusion matrix 
> would look ok. Although the result looks ok, the size is too small.
>
> confusion matrix:
>  preds  A  B
> actual
> A         8   79
> B        73   15
>
> However, with over sampling (replicating minority class), no mater how many 
> minority class are over sampled e.g. 11% 30%, 50% ( where precentage = # 
> minority / total rows dataset). The confusion matrix result doesn't look good 
> (many data are misclassified)
>
> confusion matrix:
>  preds  A   B
> actual
> A        50  2707
> B      2549    44
>
> Then, I find that 
> http://comments.gmane.org/gmane.comp.python.scikit-learn/5278 can be used to 
> perform sampling by SMOTE. But the result is similar to over sampling (many 
> misclassified class). The way of sampling is done  by
>     # simplified steps
>     down_sampled_majority_samples = shuffle(majority_samples) * 70/100
>     # testing percentage includes 100, 2*100, 5*100 but the result is similar
>     synthetic_minority = SMOTE(minority_samples, 12*100, 5)
>     train_data = synthetic_minority + minority_samples + 
> down_sampled_majority_samples
>
> Generally what procedure or should be paid attention to when working on 
> imbalanced dataset?
>
> Thanks for advices
>
> ------------------------------------------------------------------------------
> LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
> 1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
> 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
> Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/22/13.
> http://pubads.g.doubleclick.net/gampad/clk?id=64545871&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/22/13. 
http://pubads.g.doubleclick.net/gampad/clk?id=64545871&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to