GitHub user Swatisoni opened a pull request: https://github.com/apache/madlib/pull/223
Balance datasets : re-sampling technique JIRA:MADLIB-1168 Additional Authors: Orhan Kislal okis...@pivotal.io Jingyi Mei j...@pivotal.io Balanced datasets Phase 1 and Phase 2 implementation which performs balanced sampling in following specified re-sampling techniques 1. Under-sampling the majority class(es), with- and without replacement 2. Over-sampling the minority class 3. Combining over- and under-sampling - Uniform sampling of all classes (default case) 4. Create ensemble balanced sets - Re-sampling given comma-delimited string of specific class and respective sample sizes 5. IC tests Balanced sampling with grouping functionality will be implemented in phase 3 You can merge this pull request into a Git repository by running: $ git pull https://github.com/Swatisoni/madlib balanced_sets_final Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/223.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #223 ---- commit 3b2d1f18b9cf5ef8f78669678d82dc29cd11812b Author: Swatisoni <soniswati.2010@...> Date: 2018-01-10T20:07:36Z Balance datasets : re-sampling technique JIRA:MADLIB-1168 Additional Authors: Orhan Kislal okis...@pivotal.io Jingyi Mei j...@pivotal.io Balanced datasets Phase 1 and Phase 2 implementation which performs balanced sampling in following specified re-sampling techniques 1. Under-sampling the majority class(es), with- and without replacement 2. Over-sampling the minority class 3. Combining over- and under-sampling - Uniform sampling of all classes (default case) 4. Create ensemble balanced sets - Re-sampling given comma-delimited string of specific class and respective sample sizes 5. IC tests Balanced sampling with grouping functionality will be implemented in phase 3 ---- ---