[ 
https://issues.apache.org/jira/browse/PIG-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090601#comment-13090601
 ] 

Daniel Dai commented on PIG-2237:
---------------------------------

This is because SampleOptimizer will change the parallel size for "order by" 
according to input size, at this time, LimitAdjuster already determined whether 
or not to add one additional limit job. We need to do LimitAdjuster after 
SampleOptimizer.

> LIMIT generates wrong number of records if pig determines no of reducers as 
> more than 1
> ---------------------------------------------------------------------------------------
>
>                 Key: PIG-2237
>                 URL: https://issues.apache.org/jira/browse/PIG-2237
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.0, 0.9.0
>            Reporter: Anitha Raju
>            Assignee: Daniel Dai
>             Fix For: 0.9.1, 0.10
>
>
> Hi,
> For a script
> ========
> A = load 'test.txt' using PigStorage() as (a:int,b:int);
> B = order A by a ;
> C = limit B 2;
> store C into 'op1' using PigStorage();
> ========
> Limit and ORDER BY are done in the same MR job if no explicit PARALLELism is 
> mentioned.
> In this case, the no of reducers are determined by pig and sometimes it is 
> calculated > 1.
> Since limit happens at the reduce side, each reduce tasks does a limit 
> separately generating n*2 records where n is the no of reduce tasks 
> calculated by pig.
> If an explicit specification of no of reduce tasks using PARALLEL keyword is 
> done on ORDER BY,
> ==========
> B = order A by a PARALLEL 4;
> ==========
> another MR is created with 1 reduce task where the limit is done. 
> In short, the issue occurs when the no of reducers calculated by pig is 
> greater than 1 and a limit is involved in the MR.
> The issue can be replicated by specifying
> ==========
> -Dpig.exec.reducers.bytes.per.reducer
> ==========
> The issue is seen in 0.8 and 0.9 version. It works good in 0.7
> Regards,
> Anitha

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to