[jira] [Commented] (PIG-4485) Can Pig disable RandomSampleLoader when doing "Order by"

Hao Zhu (JIRA) Sat, 04 Apr 2015 13:59:19 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395940#comment-14395940
 ]


Hao Zhu commented on PIG-4485:
------------------------------

So " The sampling job does a full scan to make the sample random. " means the 
sampler will read all bytes of all files in theory?
In a 4 node cluster with 3000 parquet files(500M each)=1.5T data, if we run 
below:

a = load '/xxx/xxx/parquet/' using ParquetLoader();
b = order a by col1 ;
c = limit b 100 ;
dump c

Above sampler will read the whole 1.5T data per statistics from MR job 
statistics, and it normally takes more than 1 hour to finish.
The actual sorting job takes about 6 hours.
Of course, the job time also depends on how many Mappers are running 
concurrently.

But the main concern here is, if Sampler only needs to sample 100 records per 
file(by default), it should be improved to not read the whole data sets. 
It makes no sense that the 1T data are read twice. For parquet file especially, 
Sampler should skip most of the rows, just get what it needs.






> Can Pig disable RandomSampleLoader when doing "Order by"
> --------------------------------------------------------
>
>                 Key: PIG-4485
>                 URL: https://issues.apache.org/jira/browse/PIG-4485
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.13.0
>            Reporter: Hao Zhu
>            Priority: Critical
>
> When reading parquet files with "order by":
> {code}
> a = load '/xxx/xxx/parquet/xxx.parquet' using ParquetLoader();
> b = order a by col1 ;
> c = limit b 100 ;
> dump c
> {code}
> Pig spawns a Sampler job always in the begining:
> {code}
> Job Stats (time in seconds):
> JobId Maps    Reduces MaxMapTime      MinMapTIme      AvgMapTime      
> MedianMapTime   MaxReduceTime   MinReduceTime   AvgReduceTime   
> MedianReducetime        Alias   Feature Outputs
> job_1426804645147_1270        1       1       8       8       8       8       
> 4       4       4       4       b       SAMPLER
> job_1426804645147_1271        1       1       10      10      10      10      
> 4       4       4       4       b       ORDER_BY,COMBINER
> job_1426804645147_1272        1       1       2       2       2       2       
> 4       4       4       4       b               hdfs:/tmp/temp-xxx/tmp-xxx,
> {code}
> The issue is when reading lots of files, the first sampler job can take a 
> long time to finish.
> The ask is:
> 1. Is the sampler job a must to implement "order by"?
> 2. If no, is there any way to disable RandomSampleLoader manually?
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4485) Can Pig disable RandomSampleLoader when doing "Order by"

Reply via email to