[ 
https://issues.apache.org/jira/browse/PIG-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390961#comment-14390961
 ] 

Hao Zhu commented on PIG-4485:
------------------------------

Note:
Even if we manually set the number of reducers for the "ORDER_BY" MR job by:
{code}
grunt> SET default_parallel 10;
{code}
However it still can not disable/avoid the Sampler job. 
Since this Sampler job means nothing to the following job, I believe we should 
either disable it or have a option to manually disable it.

{code}
Job Stats (time in seconds):
JobId   Maps    Reduces MaxMapTime      MinMapTIme      AvgMapTime      
MedianMapTime   MaxReduceTime   MinReduceTime   AvgReduceTime   
MedianReducetime        Alias   Feature Outputs
job_1426804645147_1310  1       1       3       3       3       3       2       
2       2       2       b       SAMPLER
job_1426804645147_1311  1       10      3       3       3       3       4       
2       3       3       b       ORDER_BY,COMBINER
job_1426804645147_1312  1       1       4       4       4       4       2       
2       2       2       b               hdfs:/tmp/temp-1039851978/tmp-937290529,
{code}

> Can Pig disable RandomSampleLoader when doing "Order by"
> --------------------------------------------------------
>
>                 Key: PIG-4485
>                 URL: https://issues.apache.org/jira/browse/PIG-4485
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.13.0
>            Reporter: Hao Zhu
>            Priority: Critical
>
> When reading parquet files with "order by":
> {code}
> a = load '/xxx/xxx/parquet/xxx.parquet' using ParquetLoader();
> b = order a by col1 ;
> c = limit b 100 ;
> dump c
> {code}
> Pig spawns a Sampler job always in the begining:
> {code}
> Job Stats (time in seconds):
> JobId Maps    Reduces MaxMapTime      MinMapTIme      AvgMapTime      
> MedianMapTime   MaxReduceTime   MinReduceTime   AvgReduceTime   
> MedianReducetime        Alias   Feature Outputs
> job_1426804645147_1270        1       1       8       8       8       8       
> 4       4       4       4       b       SAMPLER
> job_1426804645147_1271        1       1       10      10      10      10      
> 4       4       4       4       b       ORDER_BY,COMBINER
> job_1426804645147_1272        1       1       2       2       2       2       
> 4       4       4       4       b               hdfs:/tmp/temp-xxx/tmp-xxx,
> {code}
> The issue is when reading lots of files, the first sampler job can take a 
> long time to finish.
> The ask is:
> 1. Is the sampler job a must to implement "order by"?
> 2. If no, is there any way to disable RandomSampleLoader manually?
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to