[jira] [Commented] (MADLIB-986) Stratified sampling
[ https://issues.apache.org/jira/browse/MADLIB-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067397#comment-16067397 ] Frank McQuillan commented on MADLIB-986: Looks good. I put a Jupyter workbook that demonstrates stratified sampling on https://github.com/apache/incubator-madlib-site/blob/asf-site/community-artifacts/stratified-sampling-v1.ipynb > Stratified sampling > --- > > Key: MADLIB-986 > URL: https://issues.apache.org/jira/browse/MADLIB-986 > Project: Apache MADlib > Issue Type: New Feature > Components: Module: Sampling >Reporter: Frank McQuillan >Assignee: Orhan Kislal > Labels: starter > Fix For: v1.12 > > > Story > As a data scientist, I want to sample a data table in proportion to the > number of rows in each group, so that I can do model building on the sampled > data sets. > The MVP for this story is: > * sample proportion is global, i.e., single fractional value between 0 and 1 > * allow option to sample without replacement (default) and sample with > replacement > * allow option to output a subset of columns to the output table > Proposed Interface > {code} > stratified_sample ( >source_table, >output_table, >proportion, >grouping_col -- optional >with_replacement, -- optional >target_cols -- optional > ) > source_table > TEXT. The name of the table containing the input data. > output_table > TEXT. Name of output table that contains the sampled data. > The output table contains all the columns present in the source table > unless otherwise specified in the 'target_cols' parameter below. > proportion > FLOAT8 in the range (0,1). The size of the sample in each stratum will > be taken in proportion to the size of the stratum. > grouping_col (optional) > TEXT, default: NULL. A single column or a list of comma-separated columns > that defines how to stratify. When this parameter is NULL, > no grouping is used so the sampling is non-stratified. > with_replacement (optional) > BOOLEAN, default FALSE. Determines whether to sample with replacement > or without replacement (default). > target_cols (optional) > TEXT, default NULL. A comma-separated list of columns to appear in the > 'output_table'. > If NULL, all columns from the 'source_table' will appear in the > 'output_table'. > {code} > Other notes > PDL tools is one example implementation of stratified sampling to review [2]. > > Please review existing MADlib sample functions [3] to see if these can be > used as a basis, or built on, for this stratified sample story. > References > [2] PDL tools sampling modules incl stratified sampling > http://pivotalsoftware.github.io/PDLTools/group__grp__sampling.html > [3] Existing MADlib sample function > http://madlib.incubator.apache.org/docs/latest/group__grp__sample.html > [4] Pandas/Selecting Random Samples > http://pandas.pydata.org/pandas-docs/stable/indexing.html#selecting-random-samples > [5] General > https://en.wikipedia.org/wiki/Stratified_sampling -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MADLIB-986) Stratified sampling
[ https://issues.apache.org/jira/browse/MADLIB-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065765#comment-16065765 ] ASF GitHub Bot commented on MADLIB-986: --- Github user asfgit closed the pull request at: https://github.com/apache/incubator-madlib/pull/143 > Stratified sampling > --- > > Key: MADLIB-986 > URL: https://issues.apache.org/jira/browse/MADLIB-986 > Project: Apache MADlib > Issue Type: New Feature > Components: Module: Sampling >Reporter: Frank McQuillan >Assignee: Orhan Kislal > Labels: starter > Fix For: v1.12 > > > Story > As a data scientist, I want to sample a data table in proportion to the > number of rows in each group, so that I can do model building on the sampled > data sets. > The MVP for this story is: > * sample proportion is global, i.e., single fractional value between 0 and 1 > * allow option to sample without replacement (default) and sample with > replacement > * allow option to output a subset of columns to the output table > Proposed Interface > {code} > stratified_sample ( >source_table, >output_table, >proportion, >grouping_col -- optional >with_replacement, -- optional >target_cols -- optional > ) > source_table > TEXT. The name of the table containing the input data. > output_table > TEXT. Name of output table that contains the sampled data. > The output table contains all the columns present in the source table > unless otherwise specified in the 'target_cols' parameter below. > proportion > FLOAT8 in the range (0,1). The size of the sample in each stratum will > be taken in proportion to the size of the stratum. > grouping_col (optional) > TEXT, default: NULL. A single column or a list of comma-separated columns > that defines how to stratify. When this parameter is NULL, > no grouping is used so the sampling is non-stratified. > with_replacement (optional) > BOOLEAN, default FALSE. Determines whether to sample with replacement > or without replacement (default). > target_cols (optional) > TEXT, default NULL. A comma-separated list of columns to appear in the > 'output_table'. > If NULL, all columns from the 'source_table' will appear in the > 'output_table'. > {code} > Other notes > PDL tools is one example implementation of stratified sampling to review [2]. > > Please review existing MADlib sample functions [3] to see if these can be > used as a basis, or built on, for this stratified sample story. > References > [2] PDL tools sampling modules incl stratified sampling > http://pivotalsoftware.github.io/PDLTools/group__grp__sampling.html > [3] Existing MADlib sample function > http://madlib.incubator.apache.org/docs/latest/group__grp__sample.html > [4] Pandas/Selecting Random Samples > http://pandas.pydata.org/pandas-docs/stable/indexing.html#selecting-random-samples > [5] General > https://en.wikipedia.org/wiki/Stratified_sampling -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MADLIB-986) Stratified sampling
[ https://issues.apache.org/jira/browse/MADLIB-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058421#comment-16058421 ] ASF GitHub Bot commented on MADLIB-986: --- GitHub user orhankislal opened a pull request: https://github.com/apache/incubator-madlib/pull/143 Sample: Add stratified sampling JIRA: MADLIB-986 Add stratified sampling with the following options. - With or without grouping - With or without replacement - A specific set of target columns or all of them You can merge this pull request into a Git repository by running: $ git pull https://github.com/orhankislal/incubator-madlib feature/strs_take2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-madlib/pull/143.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #143 commit 6ef23fc00cf06ac027f69229d7cf0cf444a7f456 Author: Orhan Kislal Date: 2017-06-21T23:07:08Z Sample: Add stratified sampling JIRA: MADLIB-986 Add stratified sampling with the following options. - With or without grouping - With or without replacement - A specific set of target columns or all of them > Stratified sampling > --- > > Key: MADLIB-986 > URL: https://issues.apache.org/jira/browse/MADLIB-986 > Project: Apache MADlib > Issue Type: New Feature > Components: Module: Sampling >Reporter: Frank McQuillan >Assignee: Orhan Kislal > Labels: starter > Fix For: v1.12 > > > Story > As a data scientist, I want to sample a data table in proportion to the > number of rows in each group, so that I can do model building on the sampled > data sets. > The MVP for this story is: > * sample proportion is global, i.e., single fractional value between 0 and 1 > * allow option to sample without replacement (default) and sample with > replacement > * allow option to output a subset of columns to the output table > Proposed Interface > {code} > stratified_sample ( >source_table, >output_table, >proportion, >grouping_col -- optional >with_replacement, -- optional >target_cols -- optional > ) > source_table > TEXT. The name of the table containing the input data. > output_table > TEXT. Name of output table that contains the sampled data. > The output table contains all the columns present in the source table > unless otherwise specified in the 'target_cols' parameter below. > proportion > FLOAT8 in the range (0,1). The size of the sample in each stratum will > be taken in proportion to the size of the stratum. > grouping_col (optional) > TEXT, default: NULL. A single column or a list of comma-separated columns > that defines how to stratify. When this parameter is NULL, > no grouping is used so the sampling is non-stratified. > with_replacement (optional) > BOOLEAN, default FALSE. Determines whether to sample with replacement > or without replacement (default). > target_cols (optional) > TEXT, default NULL. A comma-separated list of columns to appear in the > 'output_table'. > If NULL, all columns from the 'source_table' will appear in the > 'output_table'. > {code} > Other notes > PDL tools is one example implementation of stratified sampling to review [2]. > > Please review existing MADlib sample functions [3] to see if these can be > used as a basis, or built on, for this stratified sample story. > References > [2] PDL tools sampling modules incl stratified sampling > http://pivotalsoftware.github.io/PDLTools/group__grp__sampling.html > [3] Existing MADlib sample function > http://madlib.incubator.apache.org/docs/latest/group__grp__sample.html > [4] Pandas/Selecting Random Samples > http://pandas.pydata.org/pandas-docs/stable/indexing.html#selecting-random-samples > [5] General > https://en.wikipedia.org/wiki/Stratified_sampling -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MADLIB-986) Stratified sampling
[ https://issues.apache.org/jira/browse/MADLIB-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049830#comment-16049830 ] Orhan Kislal commented on MADLIB-986: - I think the pdl tools implementation could be ported for the without replacement case. It uses the following 3 sql statements (assume v1 is the source_table and id is the target_column): Give a random label to every record: {code} CREATE TABLE __samp_aux_tab AS ( SELECT id,grp,random() AS __samp_out_label FROM v1 ); {code} Find the cut-off point for the desired percentage: {code} CREATE TABLE __samp_thresh_tab AS ( SELECT grp, percentile_disc(0.2) WITHIN GROUP (ORDER BY __samp_out_label) AS __samp_out_label FROM __samp_aux_tab GROUP BY grp ) ; {code} Select the records that fall into the sampled section: {code} CREATE TABLE out_tab AS ( SELECT id,__samp_thresh_tab.grp FROM __samp_thresh_tab,__samp_aux_tab WHERE __samp_thresh_tab.grp = __samp_aux_tab.grp AND __samp_thresh_tab.__samp_out_label >= __samp_aux_tab.__samp_out_label ); {code} I don't think we need a new table for the step 2 output since it is just a single value for each group. > Stratified sampling > --- > > Key: MADLIB-986 > URL: https://issues.apache.org/jira/browse/MADLIB-986 > Project: Apache MADlib > Issue Type: New Feature > Components: Module: Sampling >Reporter: Frank McQuillan > Labels: starter > Fix For: v1.12 > > > Story > As a data scientist, I want to sample a data table in proportion to the > number of rows in each group, so that I can do model building on the sampled > data sets. > The MVP for this story is: > * sample proportion is global, i.e., single fractional value between 0 and 1 > * allow option to sample without replacement (default) and sample with > replacement > * allow option to output a subset of columns to the output table > Proposed Interface > {code} > stratified_sample ( >source_table, >output_table, >proportion, >grouping_col -- optional >with_replacement, -- optional >target_cols -- optional > ) > source_table > TEXT. The name of the table containing the input data. > output_table > TEXT. Name of output table that contains the sampled data. > The output table contains all the columns present in the source table > unless otherwise specified in the 'target_cols' parameter below. > proportion > FLOAT8 in the range (0,1). The size of the sample in each stratum will > be taken in proportion to the size of the stratum. > grouping_col (optional) > TEXT, default: NULL. A single column or a list of comma-separated columns > that defines how to stratify. When this parameter is NULL, > no grouping is used so the sampling is non-stratified. > with_replacement (optional) > BOOLEAN, default FALSE. Determines whether to sample with replacement > or without replacement (default). > target_cols (optional) > TEXT, default NULL. A comma-separated list of columns to appear in the > 'output_table'. > If NULL, all columns from the 'source_table' will appear in the > 'output_table'. > {code} > Other notes > PDL tools is one example implementation of stratified sampling to review [2]. > > Please review existing MADlib sample functions [3] to see if these can be > used as a basis, or built on, for this stratified sample story. > References > [2] PDL tools sampling modules incl stratified sampling > http://pivotalsoftware.github.io/PDLTools/group__grp__sampling.html > [3] Existing MADlib sample function > http://madlib.incubator.apache.org/docs/latest/group__grp__sample.html > [4] Pandas/Selecting Random Samples > http://pandas.pydata.org/pandas-docs/stable/indexing.html#selecting-random-samples > [5] General > https://en.wikipedia.org/wiki/Stratified_sampling -- This message was sent by Atlassian JIRA (v6.4.14#64029)