[
https://issues.apache.org/jira/browse/PIG-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258089#comment-13258089
]
Daniel Dai commented on PIG-2652:
---------------------------------
Seems it still does not solve the LimitAdjuster issue. Imagine the following
script:
{code}
A = load '1.txt';
B = order A by a0;
C = limit B 100;
dump C;
{code}
It will generate 2 jobs, sampler job and order by job. When we launch the first
job, we check the size of input file, and realize we need N>1 reducer, so we
adjust both jobs to set #reducer to N. But since there is a limit operator
beneath, it then needs to add a third job with #reducer=1 to impose the limit
100. LimitAdjuster is assumed to add the third job, but it runs before
JobControlCompiler, so it cannot see #reducers=N.
There is a testcase TestEvalPipeline2.testLimitAutoReducer, and it fails
because of the above reason.
> Skew join and order by don't trigger reducer estimation
> -------------------------------------------------------
>
> Key: PIG-2652
> URL: https://issues.apache.org/jira/browse/PIG-2652
> Project: Pig
> Issue Type: Bug
> Reporter: Bill Graham
> Assignee: Bill Graham
> Fix For: 0.10.0, 0.9.3, 0.11
>
> Attachments: PIG-2652_1.patch, PIG-2652_2.patch, PIG-2652_3.patch,
> PIG-2652_3_10.patch, PIG-2652_4.patch, PIG-2652_5.patch
>
>
> If neither PARALLEL, default parallel or {{mapred.reduce.tasks}} are set, the
> number of reducers is not estimated based on input size for skew joins or
> order by. Instead, these jobs get only 1 reducer.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira