[jira] [Commented] (MADLIB-986) Stratified sampling

2017-06-28 Thread Frank McQuillan (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067397#comment-16067397
 ] 

Frank McQuillan commented on MADLIB-986:


Looks good.  I put a Jupyter workbook that demonstrates stratified sampling on 
https://github.com/apache/incubator-madlib-site/blob/asf-site/community-artifacts/stratified-sampling-v1.ipynb


> Stratified sampling
> ---
>
> Key: MADLIB-986
> URL: https://issues.apache.org/jira/browse/MADLIB-986
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Sampling
>Reporter: Frank McQuillan
>Assignee: Orhan Kislal
>  Labels: starter
> Fix For: v1.12
>
>
> Story
> As a data scientist, I want to sample a data table in proportion to the 
> number of rows in each group, so that I can do model building on the sampled 
> data sets.
> The MVP for this story is:
> * sample proportion is global, i.e., single fractional value between 0 and 1
> * allow option to sample without replacement (default) and sample with 
> replacement
> * allow option to output a subset of columns to the output table
> Proposed Interface
> {code}
> stratified_sample ( 
>source_table,
>output_table,
>proportion,
>grouping_col -- optional
>with_replacement, -- optional
>target_cols -- optional
> )
> source_table
> TEXT. The name of the table containing the input data.
> output_table
> TEXT. Name of output table that contains the sampled data. 
> The output table contains all the columns present in the source table 
> unless otherwise specified in the 'target_cols' parameter below.
> proportion
> FLOAT8 in the range (0,1).  The size of the sample in each stratum will 
> be taken in proportion to the size of the stratum. 
> grouping_col (optional)
> TEXT, default: NULL. A single column or a list of comma-separated columns
>  that defines how to stratify.  When this parameter is NULL, 
> no grouping is used so the sampling is non-stratified.
> with_replacement (optional) 
> BOOLEAN, default FALSE.  Determines whether to sample with replacement 
> or without replacement (default).
> target_cols (optional)
> TEXT, default NULL. A comma-separated list of columns to appear in the 
> 'output_table'. 
> If NULL, all columns from the 'source_table'  will appear in the 
> 'output_table'.
> {code}
> Other notes
> PDL tools is one example implementation of stratified sampling to review [2]. 
>  
> Please review existing MADlib sample functions [3] to see if these can be 
> used as a basis, or built on, for this stratified sample story. 
> References
> [2] PDL tools sampling modules incl stratified sampling
> http://pivotalsoftware.github.io/PDLTools/group__grp__sampling.html
> [3] Existing MADlib sample function
> http://madlib.incubator.apache.org/docs/latest/group__grp__sample.html
> [4] Pandas/Selecting Random Samples
> http://pandas.pydata.org/pandas-docs/stable/indexing.html#selecting-random-samples
> [5] General
> https://en.wikipedia.org/wiki/Stratified_sampling



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-986) Stratified sampling

2017-06-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065765#comment-16065765
 ] 

ASF GitHub Bot commented on MADLIB-986:
---

Github user asfgit closed the pull request at:

https://github.com/apache/incubator-madlib/pull/143


> Stratified sampling
> ---
>
> Key: MADLIB-986
> URL: https://issues.apache.org/jira/browse/MADLIB-986
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Sampling
>Reporter: Frank McQuillan
>Assignee: Orhan Kislal
>  Labels: starter
> Fix For: v1.12
>
>
> Story
> As a data scientist, I want to sample a data table in proportion to the 
> number of rows in each group, so that I can do model building on the sampled 
> data sets.
> The MVP for this story is:
> * sample proportion is global, i.e., single fractional value between 0 and 1
> * allow option to sample without replacement (default) and sample with 
> replacement
> * allow option to output a subset of columns to the output table
> Proposed Interface
> {code}
> stratified_sample ( 
>source_table,
>output_table,
>proportion,
>grouping_col -- optional
>with_replacement, -- optional
>target_cols -- optional
> )
> source_table
> TEXT. The name of the table containing the input data.
> output_table
> TEXT. Name of output table that contains the sampled data. 
> The output table contains all the columns present in the source table 
> unless otherwise specified in the 'target_cols' parameter below.
> proportion
> FLOAT8 in the range (0,1).  The size of the sample in each stratum will 
> be taken in proportion to the size of the stratum. 
> grouping_col (optional)
> TEXT, default: NULL. A single column or a list of comma-separated columns
>  that defines how to stratify.  When this parameter is NULL, 
> no grouping is used so the sampling is non-stratified.
> with_replacement (optional) 
> BOOLEAN, default FALSE.  Determines whether to sample with replacement 
> or without replacement (default).
> target_cols (optional)
> TEXT, default NULL. A comma-separated list of columns to appear in the 
> 'output_table'. 
> If NULL, all columns from the 'source_table'  will appear in the 
> 'output_table'.
> {code}
> Other notes
> PDL tools is one example implementation of stratified sampling to review [2]. 
>  
> Please review existing MADlib sample functions [3] to see if these can be 
> used as a basis, or built on, for this stratified sample story. 
> References
> [2] PDL tools sampling modules incl stratified sampling
> http://pivotalsoftware.github.io/PDLTools/group__grp__sampling.html
> [3] Existing MADlib sample function
> http://madlib.incubator.apache.org/docs/latest/group__grp__sample.html
> [4] Pandas/Selecting Random Samples
> http://pandas.pydata.org/pandas-docs/stable/indexing.html#selecting-random-samples
> [5] General
> https://en.wikipedia.org/wiki/Stratified_sampling



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-986) Stratified sampling

2017-06-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058421#comment-16058421
 ] 

ASF GitHub Bot commented on MADLIB-986:
---

GitHub user orhankislal opened a pull request:

https://github.com/apache/incubator-madlib/pull/143

Sample: Add stratified sampling

JIRA: MADLIB-986

Add stratified sampling with the following options.
- With or without grouping
- With or without replacement
- A specific set of target columns or all of them

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/orhankislal/incubator-madlib 
feature/strs_take2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/143.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #143


commit 6ef23fc00cf06ac027f69229d7cf0cf444a7f456
Author: Orhan Kislal 
Date:   2017-06-21T23:07:08Z

Sample: Add stratified sampling

JIRA: MADLIB-986

Add stratified sampling with the following options.
- With or without grouping
- With or without replacement
- A specific set of target columns or all of them




> Stratified sampling
> ---
>
> Key: MADLIB-986
> URL: https://issues.apache.org/jira/browse/MADLIB-986
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Sampling
>Reporter: Frank McQuillan
>Assignee: Orhan Kislal
>  Labels: starter
> Fix For: v1.12
>
>
> Story
> As a data scientist, I want to sample a data table in proportion to the 
> number of rows in each group, so that I can do model building on the sampled 
> data sets.
> The MVP for this story is:
> * sample proportion is global, i.e., single fractional value between 0 and 1
> * allow option to sample without replacement (default) and sample with 
> replacement
> * allow option to output a subset of columns to the output table
> Proposed Interface
> {code}
> stratified_sample ( 
>source_table,
>output_table,
>proportion,
>grouping_col -- optional
>with_replacement, -- optional
>target_cols -- optional
> )
> source_table
> TEXT. The name of the table containing the input data.
> output_table
> TEXT. Name of output table that contains the sampled data. 
> The output table contains all the columns present in the source table 
> unless otherwise specified in the 'target_cols' parameter below.
> proportion
> FLOAT8 in the range (0,1).  The size of the sample in each stratum will 
> be taken in proportion to the size of the stratum. 
> grouping_col (optional)
> TEXT, default: NULL. A single column or a list of comma-separated columns
>  that defines how to stratify.  When this parameter is NULL, 
> no grouping is used so the sampling is non-stratified.
> with_replacement (optional) 
> BOOLEAN, default FALSE.  Determines whether to sample with replacement 
> or without replacement (default).
> target_cols (optional)
> TEXT, default NULL. A comma-separated list of columns to appear in the 
> 'output_table'. 
> If NULL, all columns from the 'source_table'  will appear in the 
> 'output_table'.
> {code}
> Other notes
> PDL tools is one example implementation of stratified sampling to review [2]. 
>  
> Please review existing MADlib sample functions [3] to see if these can be 
> used as a basis, or built on, for this stratified sample story. 
> References
> [2] PDL tools sampling modules incl stratified sampling
> http://pivotalsoftware.github.io/PDLTools/group__grp__sampling.html
> [3] Existing MADlib sample function
> http://madlib.incubator.apache.org/docs/latest/group__grp__sample.html
> [4] Pandas/Selecting Random Samples
> http://pandas.pydata.org/pandas-docs/stable/indexing.html#selecting-random-samples
> [5] General
> https://en.wikipedia.org/wiki/Stratified_sampling



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-986) Stratified sampling

2017-06-14 Thread Orhan Kislal (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049830#comment-16049830
 ] 

Orhan Kislal commented on MADLIB-986:
-

I think the pdl tools implementation could be ported for the without 
replacement case. It uses the following 3 sql statements (assume v1 is the 
source_table and id is the target_column):

Give a random label to every record:
{code}
CREATE TABLE __samp_aux_tab AS (
  SELECT id,grp,random() AS __samp_out_label FROM v1
);
{code}

Find the cut-off point for the desired percentage:
{code}
CREATE TABLE __samp_thresh_tab AS (
SELECT grp,
   percentile_disc(0.2) WITHIN GROUP (ORDER BY
 __samp_out_label) AS __samp_out_label
FROM __samp_aux_tab GROUP BY grp
) ;
{code}

Select the records that fall into the sampled section:

{code}
CREATE TABLE out_tab AS (
SELECT id,__samp_thresh_tab.grp
  FROM __samp_thresh_tab,__samp_aux_tab
  WHERE __samp_thresh_tab.grp = __samp_aux_tab.grp
AND __samp_thresh_tab.__samp_out_label >=
__samp_aux_tab.__samp_out_label
);
{code}

I don't think we need a new table for the step 2 output since it is just a 
single value for each group.

> Stratified sampling
> ---
>
> Key: MADLIB-986
> URL: https://issues.apache.org/jira/browse/MADLIB-986
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Sampling
>Reporter: Frank McQuillan
>  Labels: starter
> Fix For: v1.12
>
>
> Story
> As a data scientist, I want to sample a data table in proportion to the 
> number of rows in each group, so that I can do model building on the sampled 
> data sets.
> The MVP for this story is:
> * sample proportion is global, i.e., single fractional value between 0 and 1
> * allow option to sample without replacement (default) and sample with 
> replacement
> * allow option to output a subset of columns to the output table
> Proposed Interface
> {code}
> stratified_sample ( 
>source_table,
>output_table,
>proportion,
>grouping_col -- optional
>with_replacement, -- optional
>target_cols -- optional
> )
> source_table
> TEXT. The name of the table containing the input data.
> output_table
> TEXT. Name of output table that contains the sampled data. 
> The output table contains all the columns present in the source table 
> unless otherwise specified in the 'target_cols' parameter below.
> proportion
> FLOAT8 in the range (0,1).  The size of the sample in each stratum will 
> be taken in proportion to the size of the stratum. 
> grouping_col (optional)
> TEXT, default: NULL. A single column or a list of comma-separated columns
>  that defines how to stratify.  When this parameter is NULL, 
> no grouping is used so the sampling is non-stratified.
> with_replacement (optional) 
> BOOLEAN, default FALSE.  Determines whether to sample with replacement 
> or without replacement (default).
> target_cols (optional)
> TEXT, default NULL. A comma-separated list of columns to appear in the 
> 'output_table'. 
> If NULL, all columns from the 'source_table'  will appear in the 
> 'output_table'.
> {code}
> Other notes
> PDL tools is one example implementation of stratified sampling to review [2]. 
>  
> Please review existing MADlib sample functions [3] to see if these can be 
> used as a basis, or built on, for this stratified sample story. 
> References
> [2] PDL tools sampling modules incl stratified sampling
> http://pivotalsoftware.github.io/PDLTools/group__grp__sampling.html
> [3] Existing MADlib sample function
> http://madlib.incubator.apache.org/docs/latest/group__grp__sample.html
> [4] Pandas/Selecting Random Samples
> http://pandas.pydata.org/pandas-docs/stable/indexing.html#selecting-random-samples
> [5] General
> https://en.wikipedia.org/wiki/Stratified_sampling



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)