[ https://issues.apache.org/jira/browse/PIG-3648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13863730#comment-13863730 ]
Aniket Mokashi commented on PIG-3648: ------------------------------------- We are using reservoir sampling here, with assumption that the number of samples fit in memory. My only question/concern is how much benefit does increasing sample size provide here. In your example- 100 samples on 13M rows had 10x skew. Does 200 sample make it 5x skew? If it does, doing this definitely makes sense. > Make the sample size for RandomSampleLoader configurable > -------------------------------------------------------- > > Key: PIG-3648 > URL: https://issues.apache.org/jira/browse/PIG-3648 > Project: Pig > Issue Type: Improvement > Components: impl > Reporter: Cheolsoo Park > Assignee: Cheolsoo Park > Priority: Minor > Fix For: 0.13.0 > > Attachments: PIG-3648-1.patch > > > Pig uses RandomSampleLoader for range partitioning in order-by. But since the > sample size is hardcoded as 100, volatility in the variance of the results > increases when sorting a large number of rows (e.g. 10M+ per task). > It would be nice if the sample size could be configurable via Pig properties. -- This message was sent by Atlassian JIRA (v6.1.5#6160)