[ https://issues.apache.org/jira/browse/PIG-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Daniel Dai updated PIG-2652: ---------------------------- Attachment: PIG-2652_2.patch I agree SampleOptimizer did more than it suppose to be, better to separate into two rule. As Bill observes, the rule does not proceed because some precondition fail. For now we can adjust the precondition check to solve some problem. I attach a patch for it. It solves Dmitriy's test case, however, Bill's test case is more involved. It is also related to plan merge of MultQuery. If I rewrite the query to get rid of the alias reuse, I can make it work: {code} L = LOAD '1.txt' AS (owner:chararray,pet:chararray,age:int,phone:chararray); LN = LOAD '1.txt' AS (owner:chararray,pet:chararray,age:int,phone:chararray); R = LOAD '2.txt' AS (owner:chararray,pet:chararray,age:int,phone:chararray); L2 = FILTER L BY ((int)age > 0); UNIONED = UNION LN, L2; JOINED = JOIN UNIONED BY owner, R BY owner USING 'skewed'; dump JOINED; {code} > Skew join and order by don't trigger reducer estimation > ------------------------------------------------------- > > Key: PIG-2652 > URL: https://issues.apache.org/jira/browse/PIG-2652 > Project: Pig > Issue Type: Bug > Reporter: Bill Graham > Assignee: Bill Graham > Fix For: 0.10.0, 0.9.3, 0.11 > > Attachments: PIG-2652_1.patch, PIG-2652_2.patch > > > If neither PARALLEL, default parallel or {{mapred.reduce.tasks}} are set, the > number of reducers is not estimated based on input size for skew joins or > order by. Instead, these jobs get only 1 reducer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira