[ 
https://issues.apache.org/jira/browse/PIG-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2652:
----------------------------

    Attachment: PIG-2652_2.patch

I agree SampleOptimizer did more than it suppose to be, better to separate into 
two rule. 

As Bill observes, the rule does not proceed because some precondition fail. For 
now we can adjust the precondition check to solve some problem. I attach a 
patch for it. It solves Dmitriy's test case, however, Bill's test case is more 
involved. It is also related to plan merge of MultQuery. If I rewrite the query 
to get rid of the alias reuse, I can make it work:

{code}
L = LOAD '1.txt' AS (owner:chararray,pet:chararray,age:int,phone:chararray);
LN = LOAD '1.txt' AS (owner:chararray,pet:chararray,age:int,phone:chararray);
R = LOAD '2.txt' AS (owner:chararray,pet:chararray,age:int,phone:chararray);

L2 = FILTER L BY ((int)age > 0);
UNIONED = UNION LN, L2;
JOINED = JOIN UNIONED BY owner, R BY owner USING 'skewed';

dump JOINED;
{code}
                
> Skew join and order by don't trigger reducer estimation
> -------------------------------------------------------
>
>                 Key: PIG-2652
>                 URL: https://issues.apache.org/jira/browse/PIG-2652
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Bill Graham
>            Assignee: Bill Graham
>             Fix For: 0.10.0, 0.9.3, 0.11
>
>         Attachments: PIG-2652_1.patch, PIG-2652_2.patch
>
>
> If neither PARALLEL, default parallel or {{mapred.reduce.tasks}} are set, the 
> number of reducers is not estimated based on input size for skew joins or 
> order by. Instead, these jobs get only 1 reducer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to