[
https://issues.apache.org/jira/browse/PIG-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Daniel Dai updated PIG-2652:
----------------------------
Attachment: PIG-2652_2.patch
I agree SampleOptimizer did more than it suppose to be, better to separate into
two rule.
As Bill observes, the rule does not proceed because some precondition fail. For
now we can adjust the precondition check to solve some problem. I attach a
patch for it. It solves Dmitriy's test case, however, Bill's test case is more
involved. It is also related to plan merge of MultQuery. If I rewrite the query
to get rid of the alias reuse, I can make it work:
{code}
L = LOAD '1.txt' AS (owner:chararray,pet:chararray,age:int,phone:chararray);
LN = LOAD '1.txt' AS (owner:chararray,pet:chararray,age:int,phone:chararray);
R = LOAD '2.txt' AS (owner:chararray,pet:chararray,age:int,phone:chararray);
L2 = FILTER L BY ((int)age > 0);
UNIONED = UNION LN, L2;
JOINED = JOIN UNIONED BY owner, R BY owner USING 'skewed';
dump JOINED;
{code}
> Skew join and order by don't trigger reducer estimation
> -------------------------------------------------------
>
> Key: PIG-2652
> URL: https://issues.apache.org/jira/browse/PIG-2652
> Project: Pig
> Issue Type: Bug
> Reporter: Bill Graham
> Assignee: Bill Graham
> Fix For: 0.10.0, 0.9.3, 0.11
>
> Attachments: PIG-2652_1.patch, PIG-2652_2.patch
>
>
> If neither PARALLEL, default parallel or {{mapred.reduce.tasks}} are set, the
> number of reducers is not estimated based on input size for skew joins or
> order by. Instead, these jobs get only 1 reducer.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira