[
https://issues.apache.org/jira/browse/MADLIB-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297329#comment-16297329
]
Orhan Kislal edited comment on MADLIB-1168 at 12/19/17 7:55 PM:
----------------------------------------------------------------
PR implements the aforementioned idea for undersampling without replacement. It
seems having both row numbers and ordering will slow down the process quite a
bit. An alternate approach would be handling each class differently. We can
create a view for a given class (i.e. view_cl1), and use a query like:
{code}
select * from view_cl1 order by random() limit min_count;
{code}
and then return a union of these subqueries. I am not sure if this will
actually improve performance since we will have multiple queries instead of a
single one but it might be worth exploring. [~riyer] any thoughts?
was (Author: okislal):
The PR above implements idea above for undersampling without replacement. It
seems having both row numbers and ordering will slow down the process quite a
bit. An alternate approach would be handling each class differently. We can
create a view for a given class (i.e. view_cl1), and use a query like:
{code}
select * from view_cl1 order by random() limit min_count;
{code}
and then return a union of these subqueries. I am not sure if this will
actually improve performance since we will have multiple queries instead of a
single one but it might be worth exploring. [~riyer] any thoughts?
> Balance datasets
> ----------------
>
> Key: MADLIB-1168
> URL: https://issues.apache.org/jira/browse/MADLIB-1168
> Project: Apache MADlib
> Issue Type: New Feature
> Components: Module: Sampling
> Reporter: Frank McQuillan
> Fix For: v2.0
>
> Attachments: MADlib Balance Datasets Requirements.pdf,
> MADlib_Balance_Datasets_Requirements_v2.pdf
>
>
> From [1] here is the motivation behind balancing datasets:
> “Most classification algorithms will only perform optimally when the number
> of samples of each class is roughly the same. Highly skewed datasets, where
> the minority is heavily outnumbered by one or more classes, have proven to be
> a challenge while at the same time becoming more and more common.
> One way of addressing this issue is by re-sampling the dataset as to offset
> this imbalance with the hope of arriving at a more robust and fair decision
> boundary than you would otherwise.
> Re-sampling techniques can be divided in these categories:
> * Under-sampling the majority class(es).
> * Over-sampling the minority class.
> * Combining over- and under-sampling.
> * Create ensemble balanced sets.”
> There is an extensive literature on balancing datasets. The plan for MADlib
> in the initial phase is to offer basic functionality that can be extended in
> later phases based on feedback from users.
> Please see attached document for proposed scope of this story.
> References
> [1] imbalance-learn Python project
> http://contrib.scikit-learn.org/imbalanced-learn/stable/index.html
> https://github.com/scikit-learn-contrib/imbalanced-learn
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)