[
https://issues.apache.org/jira/browse/PIG-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395940#comment-14395940
]
Hao Zhu commented on PIG-4485:
------------------------------
So " The sampling job does a full scan to make the sample random. " means the
sampler will read all bytes of all files in theory?
In a 4 node cluster with 3000 parquet files(500M each)=1.5T data, if we run
below:
a = load '/xxx/xxx/parquet/' using ParquetLoader();
b = order a by col1 ;
c = limit b 100 ;
dump c
Above sampler will read the whole 1.5T data per statistics from MR job
statistics, and it normally takes more than 1 hour to finish.
The actual sorting job takes about 6 hours.
Of course, the job time also depends on how many Mappers are running
concurrently.
But the main concern here is, if Sampler only needs to sample 100 records per
file(by default), it should be improved to not read the whole data sets.
It makes no sense that the 1T data are read twice. For parquet file especially,
Sampler should skip most of the rows, just get what it needs.
> Can Pig disable RandomSampleLoader when doing "Order by"
> --------------------------------------------------------
>
> Key: PIG-4485
> URL: https://issues.apache.org/jira/browse/PIG-4485
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.13.0
> Reporter: Hao Zhu
> Priority: Critical
>
> When reading parquet files with "order by":
> {code}
> a = load '/xxx/xxx/parquet/xxx.parquet' using ParquetLoader();
> b = order a by col1 ;
> c = limit b 100 ;
> dump c
> {code}
> Pig spawns a Sampler job always in the begining:
> {code}
> Job Stats (time in seconds):
> JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime
> MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime
> MedianReducetime Alias Feature Outputs
> job_1426804645147_1270 1 1 8 8 8 8
> 4 4 4 4 b SAMPLER
> job_1426804645147_1271 1 1 10 10 10 10
> 4 4 4 4 b ORDER_BY,COMBINER
> job_1426804645147_1272 1 1 2 2 2 2
> 4 4 4 4 b hdfs:/tmp/temp-xxx/tmp-xxx,
> {code}
> The issue is when reading lots of files, the first sampler job can take a
> long time to finish.
> The ask is:
> 1. Is the sampler job a must to implement "order by"?
> 2. If no, is there any way to disable RandomSampleLoader manually?
> Thanks.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)