[ 
https://issues.apache.org/jira/browse/PIG-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395588#comment-14395588
 ] 

Hao Zhu commented on PIG-4485:
------------------------------

Thanks Daniel for the clarification.
For 2:
If we set pig.random.sampler.sample.size=0, what is the algorithm to decide the 
key range for each reducer? 

For 3:
We tested not just parquet files, but also other formats of files(CSV), the 
finding is , the sampler job reads the same amount of bytes as the real 
"sorting" job. The tests are done in both CDH5.3 and MapR 4.0.x.
If the sampler job reads all bytes of all files, then why not merge the 
algorithm to the next sorting job?





> Can Pig disable RandomSampleLoader when doing "Order by"
> --------------------------------------------------------
>
>                 Key: PIG-4485
>                 URL: https://issues.apache.org/jira/browse/PIG-4485
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.13.0
>            Reporter: Hao Zhu
>            Priority: Critical
>
> When reading parquet files with "order by":
> {code}
> a = load '/xxx/xxx/parquet/xxx.parquet' using ParquetLoader();
> b = order a by col1 ;
> c = limit b 100 ;
> dump c
> {code}
> Pig spawns a Sampler job always in the begining:
> {code}
> Job Stats (time in seconds):
> JobId Maps    Reduces MaxMapTime      MinMapTIme      AvgMapTime      
> MedianMapTime   MaxReduceTime   MinReduceTime   AvgReduceTime   
> MedianReducetime        Alias   Feature Outputs
> job_1426804645147_1270        1       1       8       8       8       8       
> 4       4       4       4       b       SAMPLER
> job_1426804645147_1271        1       1       10      10      10      10      
> 4       4       4       4       b       ORDER_BY,COMBINER
> job_1426804645147_1272        1       1       2       2       2       2       
> 4       4       4       4       b               hdfs:/tmp/temp-xxx/tmp-xxx,
> {code}
> The issue is when reading lots of files, the first sampler job can take a 
> long time to finish.
> The ask is:
> 1. Is the sampler job a must to implement "order by"?
> 2. If no, is there any way to disable RandomSampleLoader manually?
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to