LIMIT generates wrong number of records if pig determines no of reducers as
more than 1
---------------------------------------------------------------------------------------
Key: PIG-2237
URL: https://issues.apache.org/jira/browse/PIG-2237
Project: Pig
Issue Type: Bug
Affects Versions: 0.9.0, 0.8.0
Reporter: Anitha Raju
Hi,
For a script
========
A = load 'test.txt' using PigStorage() as (a:int,b:int);
B = order A by a ;
C = limit B 2;
store C into 'op1' using PigStorage();
========
Limit and ORDER BY are done in the same MR job if no explicit PARALLELism is
mentioned.
In this case, the no of reducers are determined by pig and sometimes it is
calculated > 1.
Since limit happens at the reduce side, each reduce tasks does a limit
separately generating n*2 records where n is the no of reduce tasks calculated
by pig.
If an explicit specification of no of reduce tasks using PARALLEL keyword is
done on ORDER BY,
==========
B = order A by a PARALLEL 4;
==========
another MR is created with 1 reduce task where the limit is done.
In short, the issue occurs when the no of reducers calculated by pig is greater
than 1 and a limit is involved in the MR.
The issue can be replicated by specifying
==========
-Dpig.exec.reducers.bytes.per.reducer
==========
The issue is seen in 0.8 and 0.9 version. It works good in 0.7
Regards,
Anitha
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira