[jira] [Commented] (PIG-4485) Can Pig disable RandomSampleLoader when doing "Order by"

Hao Zhu (JIRA) Fri, 03 Apr 2015 14:59:16 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395143#comment-14395143
 ]


Hao Zhu commented on PIG-4485:
------------------------------

Hi Daniel,

Thanks for responding.
We have done some tests on 3000+ parquet file(each file size is 500M), the 
sampler job normally takes more than 1 hour.
By default, the Sampler job samples 100 records per parquet file, so in the end 
of the pig job, the Sampler sampled 3000*100 records.

However here are 3 questions:
1. If the hadoop admin has good experience on how many reducers should be used, 
why do not let hadoop admin to decide the number of reducers for the "real" MR 
job?
2. If we "set pig.random.sampler.sample.size 0", in this case, Sampler will 
sample 0 row.  Why don't we just disable Sampler in this case?

3. Per our tests in house, Sampler job read all bytes of all files. So the 
"HDFS reads" stat for Sampler job is the same as "Real" MR job.
This could be another issue: why Sampler job needs to read all the bytes of all 
files? My assumption is, it should read 100 records(by default) from each file, 
and then stop reading this file,right?

Thanks.



> Can Pig disable RandomSampleLoader when doing "Order by"
> --------------------------------------------------------
>
>                 Key: PIG-4485
>                 URL: https://issues.apache.org/jira/browse/PIG-4485
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.13.0
>            Reporter: Hao Zhu
>            Priority: Critical
>
> When reading parquet files with "order by":
> {code}
> a = load '/xxx/xxx/parquet/xxx.parquet' using ParquetLoader();
> b = order a by col1 ;
> c = limit b 100 ;
> dump c
> {code}
> Pig spawns a Sampler job always in the begining:
> {code}
> Job Stats (time in seconds):
> JobId Maps    Reduces MaxMapTime      MinMapTIme      AvgMapTime      
> MedianMapTime   MaxReduceTime   MinReduceTime   AvgReduceTime   
> MedianReducetime        Alias   Feature Outputs
> job_1426804645147_1270        1       1       8       8       8       8       
> 4       4       4       4       b       SAMPLER
> job_1426804645147_1271        1       1       10      10      10      10      
> 4       4       4       4       b       ORDER_BY,COMBINER
> job_1426804645147_1272        1       1       2       2       2       2       
> 4       4       4       4       b               hdfs:/tmp/temp-xxx/tmp-xxx,
> {code}
> The issue is when reading lots of files, the first sampler job can take a 
> long time to finish.
> The ask is:
> 1. Is the sampler job a must to implement "order by"?
> 2. If no, is there any way to disable RandomSampleLoader manually?
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4485) Can Pig disable RandomSampleLoader when doing "Order by"

Reply via email to