[
https://issues.apache.org/jira/browse/PIG-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385583#comment-14385583
]
Hao Zhu commented on PIG-4485:
------------------------------
BTW: I have confirmed this behavior on Pig 0.12 on CDH 5.3 and also Pig 0.13 on
MapR 4.0.1.
> Can Pig disable RandomSampleLoader when doing "Order by"
> --------------------------------------------------------
>
> Key: PIG-4485
> URL: https://issues.apache.org/jira/browse/PIG-4485
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.13.0
> Reporter: Hao Zhu
> Priority: Critical
>
> When reading parquet files with "order by":
> {code}
> a = load '/xxx/xxx/parquet/xxx.parquet' using ParquetLoader();
> b = order a by col1 ;
> c = limit b 100 ;
> dump c
> {code}
> Pig spawns a Sampler job always in the begining:
> {code}
> Job Stats (time in seconds):
> JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime
> MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime
> MedianReducetime Alias Feature Outputs
> job_1426804645147_1270 1 1 8 8 8 8
> 4 4 4 4 b SAMPLER
> job_1426804645147_1271 1 1 10 10 10 10
> 4 4 4 4 b ORDER_BY,COMBINER
> job_1426804645147_1272 1 1 2 2 2 2
> 4 4 4 4 b hdfs:/tmp/temp-xxx/tmp-xxx,
> {code}
> The issue is when reading lots of files, the first sampler job can take a
> long time to finish.
> The ask is:
> 1. Is the sampler job a must to implement "order by"?
> 2. If no, is there any way to disable RandomSampleLoader manually?
> Thanks.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)