[ https://issues.apache.org/jira/browse/MADLIB-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297329#comment-16297329 ]
Orhan Kislal edited comment on MADLIB-1168 at 12/19/17 7:55 PM: ---------------------------------------------------------------- PR implements the aforementioned idea for undersampling without replacement. It seems having both row numbers and ordering will slow down the process quite a bit. An alternate approach would be handling each class differently. We can create a view for a given class (i.e. view_cl1), and use a query like: {code} select * from view_cl1 order by random() limit min_count; {code} and then return a union of these subqueries. I am not sure if this will actually improve performance since we will have multiple queries instead of a single one but it might be worth exploring. [~riyer] any thoughts? was (Author: okislal): The PR above implements idea above for undersampling without replacement. It seems having both row numbers and ordering will slow down the process quite a bit. An alternate approach would be handling each class differently. We can create a view for a given class (i.e. view_cl1), and use a query like: {code} select * from view_cl1 order by random() limit min_count; {code} and then return a union of these subqueries. I am not sure if this will actually improve performance since we will have multiple queries instead of a single one but it might be worth exploring. [~riyer] any thoughts? > Balance datasets > ---------------- > > Key: MADLIB-1168 > URL: https://issues.apache.org/jira/browse/MADLIB-1168 > Project: Apache MADlib > Issue Type: New Feature > Components: Module: Sampling > Reporter: Frank McQuillan > Fix For: v2.0 > > Attachments: MADlib Balance Datasets Requirements.pdf, > MADlib_Balance_Datasets_Requirements_v2.pdf > > > From [1] here is the motivation behind balancing datasets: > “Most classification algorithms will only perform optimally when the number > of samples of each class is roughly the same. Highly skewed datasets, where > the minority is heavily outnumbered by one or more classes, have proven to be > a challenge while at the same time becoming more and more common. > One way of addressing this issue is by re-sampling the dataset as to offset > this imbalance with the hope of arriving at a more robust and fair decision > boundary than you would otherwise. > Re-sampling techniques can be divided in these categories: > * Under-sampling the majority class(es). > * Over-sampling the minority class. > * Combining over- and under-sampling. > * Create ensemble balanced sets.” > There is an extensive literature on balancing datasets. The plan for MADlib > in the initial phase is to offer basic functionality that can be extended in > later phases based on feedback from users. > Please see attached document for proposed scope of this story. > References > [1] imbalance-learn Python project > http://contrib.scikit-learn.org/imbalanced-learn/stable/index.html > https://github.com/scikit-learn-contrib/imbalanced-learn -- This message was sent by Atlassian JIRA (v6.4.14#64029)