[
https://issues.apache.org/jira/browse/PIG-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13255363#comment-13255363
]
Dmitriy V. Ryaboy commented on PIG-2652:
----------------------------------------
Spent some time debugging my refactoring and decided maybe there's a bug in
your patch, Daniel. As written, we look at the inputs to the sampling job and
estimating reducers for the successor based on those inputs. However, the
successor actually has two inputs -- the sampled dataset, and the second joined
relation. That means the earlier estimate is incorrect.
I tried running the estimator on the post-sample job, but there doesn't seem to
be a way to connect the plan to its predecessor -- the plan passed in is
already trimmed at the top. I'll try the following instead: identify a sampling
job's children, and set them aside somewhere; then check against the saved list
of known post-sample jobs and re-run the estimator for them if parallelism is
set to 1.
> Skew join and order by don't trigger reducer estimation
> -------------------------------------------------------
>
> Key: PIG-2652
> URL: https://issues.apache.org/jira/browse/PIG-2652
> Project: Pig
> Issue Type: Bug
> Reporter: Bill Graham
> Assignee: Bill Graham
> Fix For: 0.10.0, 0.9.3, 0.11
>
> Attachments: PIG-2652_1.patch, PIG-2652_2.patch, PIG-2652_3.patch,
> PIG-2652_3_10.patch
>
>
> If neither PARALLEL, default parallel or {{mapred.reduce.tasks}} are set, the
> number of reducers is not estimated based on input size for skew joins or
> order by. Instead, these jobs get only 1 reducer.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira