[ https://issues.apache.org/jira/browse/PIG-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13255363#comment-13255363 ]
Dmitriy V. Ryaboy commented on PIG-2652: ---------------------------------------- Spent some time debugging my refactoring and decided maybe there's a bug in your patch, Daniel. As written, we look at the inputs to the sampling job and estimating reducers for the successor based on those inputs. However, the successor actually has two inputs -- the sampled dataset, and the second joined relation. That means the earlier estimate is incorrect. I tried running the estimator on the post-sample job, but there doesn't seem to be a way to connect the plan to its predecessor -- the plan passed in is already trimmed at the top. I'll try the following instead: identify a sampling job's children, and set them aside somewhere; then check against the saved list of known post-sample jobs and re-run the estimator for them if parallelism is set to 1. > Skew join and order by don't trigger reducer estimation > ------------------------------------------------------- > > Key: PIG-2652 > URL: https://issues.apache.org/jira/browse/PIG-2652 > Project: Pig > Issue Type: Bug > Reporter: Bill Graham > Assignee: Bill Graham > Fix For: 0.10.0, 0.9.3, 0.11 > > Attachments: PIG-2652_1.patch, PIG-2652_2.patch, PIG-2652_3.patch, > PIG-2652_3_10.patch > > > If neither PARALLEL, default parallel or {{mapred.reduce.tasks}} are set, the > number of reducers is not estimated based on input size for skew joins or > order by. Instead, these jobs get only 1 reducer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira