[
https://issues.apache.org/jira/browse/PIG-4148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125768#comment-14125768
]
Cheolsoo Park commented on PIG-4148:
------------------------------------
Another interesting observation is that all the samples that are received by
the sample aggregate vertex have row number {{376}}. This is funny because row
number in POReservoirSample varies from 1 to 52M.
> Tez order-by is often skewed because FindQuantiles UDF is called with small
> number
> ----------------------------------------------------------------------------------
>
> Key: PIG-4148
> URL: https://issues.apache.org/jira/browse/PIG-4148
> Project: Pig
> Issue Type: Sub-task
> Components: tez
> Reporter: Cheolsoo Park
> Fix For: 0.14.0
>
> Attachments: generate_sample.py, metric_retention.explain,
> popackage.log, samples_logs.tar.gz
>
>
> In Tez, FindQuantiles UDF is called with a smaller number of samples than MR
> resulting in skew in range partitions.
> For example, I have a job that runs sampling with a parallelism of 300. Since
> each task samples 100 records, the total sample should be 30K. But
> FindQuantiles UDF is called with only 300 samples.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)