[jira] [Updated] (MADLIB-1168) Balance datasets

Frank McQuillan (JIRA) Thu, 30 Nov 2017 17:38:37 -0800

     [ 
https://issues.apache.org/jira/browse/MADLIB-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Frank McQuillan updated MADLIB-1168:
------------------------------------
    Description: 
>From [1] here is the motivation behind balancing datasets:

“Most classification algorithms will only perform optimally when the number of 
samples of each class is roughly the same. Highly skewed datasets, where the 
minority is heavily outnumbered by one or more classes, have proven to be a 
challenge while at the same time becoming more and more common.

One way of addressing this issue is by re-sampling the dataset as to offset 
this imbalance with the hope of arriving at a more robust and fair decision 
boundary than you would otherwise.

Re-sampling techniques can be divided in these categories:

* Under-sampling the majority class(es).
* Over-sampling the minority class.
* Combining over- and under-sampling.
* Create ensemble balanced sets.”

There is an extensive literature on balancing datasets.  The plan for MADlib in 
the initial phase is to offer basic functionality that can be extended in later 
phases based on feedback from users.  

Please see attached document for proposed scope of this story.

References

[1] imbalance-learn Python project
http://contrib.scikit-learn.org/imbalanced-learn/stable/index.html
https://github.com/scikit-learn-contrib/imbalanced-learn



  was:
Given a table with varying number of records for each class label, this 
function will create an output table with each class label having approximately 
the same number of records. 

Approach TBD


> Balance datasets
> ----------------
>
>                 Key: MADLIB-1168
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1168
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Sampling
>            Reporter: Frank McQuillan
>             Fix For: v2.0
>
>
> From [1] here is the motivation behind balancing datasets:
> “Most classification algorithms will only perform optimally when the number 
> of samples of each class is roughly the same. Highly skewed datasets, where 
> the minority is heavily outnumbered by one or more classes, have proven to be 
> a challenge while at the same time becoming more and more common.
> One way of addressing this issue is by re-sampling the dataset as to offset 
> this imbalance with the hope of arriving at a more robust and fair decision 
> boundary than you would otherwise.
> Re-sampling techniques can be divided in these categories:
> * Under-sampling the majority class(es).
> * Over-sampling the minority class.
> * Combining over- and under-sampling.
> * Create ensemble balanced sets.”
> There is an extensive literature on balancing datasets.  The plan for MADlib 
> in the initial phase is to offer basic functionality that can be extended in 
> later phases based on feedback from users.  
> Please see attached document for proposed scope of this story.
> References
> [1] imbalance-learn Python project
> http://contrib.scikit-learn.org/imbalanced-learn/stable/index.html
> https://github.com/scikit-learn-contrib/imbalanced-learn



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (MADLIB-1168) Balance datasets

Reply via email to