[ https://issues.apache.org/jira/browse/PIG-3648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13863497#comment-13863497 ]
Aniket Mokashi edited comment on PIG-3648 at 1/6/14 10:13 PM: -------------------------------------------------------------- +1. There is a typo in comments - RandomeSampleLoader, otherwise, patch looks good. bq. volatility in the variance of the results increases when sorting a large number of rows Can you give an example when this happens? Sampling algo looks good to me. Also, we want to keep number of samples less, so that we can replace this mechanism in future if needed. was (Author: aniket486): +1. There is a typo in comments - RandomeSampleLoader, otherwise, patch looks good. > volatility in the variance of the results increases when sorting a large > number of rows Can you give an example when this happens? Sampling algo looks good to me. Also, we want to keep number of samples less, so that we can replace this mechanism in future if needed. > Make the sample size for RandomSampleLoader configurable > -------------------------------------------------------- > > Key: PIG-3648 > URL: https://issues.apache.org/jira/browse/PIG-3648 > Project: Pig > Issue Type: Improvement > Components: impl > Reporter: Cheolsoo Park > Assignee: Cheolsoo Park > Priority: Minor > Fix For: 0.13.0 > > Attachments: PIG-3648-1.patch > > > Pig uses RandomSampleLoader for range partitioning in order-by. But since the > sample size is hardcoded as 100, volatility in the variance of the results > increases when sorting a large number of rows (e.g. 10M+ per task). > It would be nice if the sample size could be configurable via Pig properties. -- This message was sent by Atlassian JIRA (v6.1.5#6160)