LIMIT generates wrong number of records if pig determines no of reducers as more than 1 ---------------------------------------------------------------------------------------
Key: PIG-2237 URL: https://issues.apache.org/jira/browse/PIG-2237 Project: Pig Issue Type: Bug Affects Versions: 0.9.0, 0.8.0 Reporter: Anitha Raju Hi, For a script ======== A = load 'test.txt' using PigStorage() as (a:int,b:int); B = order A by a ; C = limit B 2; store C into 'op1' using PigStorage(); ======== Limit and ORDER BY are done in the same MR job if no explicit PARALLELism is mentioned. In this case, the no of reducers are determined by pig and sometimes it is calculated > 1. Since limit happens at the reduce side, each reduce tasks does a limit separately generating n*2 records where n is the no of reduce tasks calculated by pig. If an explicit specification of no of reduce tasks using PARALLEL keyword is done on ORDER BY, ========== B = order A by a PARALLEL 4; ========== another MR is created with 1 reduce task where the limit is done. In short, the issue occurs when the no of reducers calculated by pig is greater than 1 and a limit is involved in the MR. The issue can be replicated by specifying ========== -Dpig.exec.reducers.bytes.per.reducer ========== The issue is seen in 0.8 and 0.9 version. It works good in 0.7 Regards, Anitha -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira