[ https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069162#comment-13069162 ]
jirapos...@reviews.apache.org commented on MAPREDUCE-2324: ---------------------------------------------------------- ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1164/#review1157 ----------------------------------------------------------- branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java <https://reviews.apache.org/r/1164/#comment2386> I think a default of 0.8 or so would probably make more sense -- just in case one of the TTs is in some bad state where it isn't heartbeating, we don't want to wait forever. branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java <https://reviews.apache.org/r/1164/#comment2387> since this is a set of trackers, not attempts, a better name might be: failedReduceSchedulingTrackers, or something? branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java <https://reviews.apache.org/r/1164/#comment2388> this key should probably be defined as a constant in MRJobConfig, right? branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java <https://reviews.apache.org/r/1164/#comment2389> refactor this into a new method? branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java <https://reviews.apache.org/r/1164/#comment2391> I think the operator precedence is off here. (int)reduceInputAttemptFactor is higher precedence, so it will end up rounding anything < 1.0 down to 0. branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java <https://reviews.apache.org/r/1164/#comment2390> style: add space between if and ( branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java <https://reviews.apache.org/r/1164/#comment2392> I think we mostly avoid the 1st person in error messages. Change to "Tried to schedule..." rather than "We tried"... branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java <https://reviews.apache.org/r/1164/#comment2393> StringUtils.humanReadableInt might be useful here. branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java <https://reviews.apache.org/r/1164/#comment2394> jobId is the job, not the task, right? branches/branch-0.20-security/src/test/org/apache/hadoop/mapred/TestTaskLimits.java <https://reviews.apache.org/r/1164/#comment2395> typos: "failes", "then that" branches/branch-0.20-security/src/test/org/apache/hadoop/mapred/TestTaskLimits.java <https://reviews.apache.org/r/1164/#comment2396> isn't the default input limit unlimited? why do we need this? branches/branch-0.20-security/src/test/org/apache/hadoop/mapred/TestTaskLimits.java <https://reviews.apache.org/r/1164/#comment2397> we should check that the failure info of the job has the correct type of error message (ie that it didn't fail due to some other error) - Todd On 2011-07-21 18:49:31, Robert Evans wrote: bq. bq. ----------------------------------------------------------- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/1164/ bq. ----------------------------------------------------------- bq. bq. (Updated 2011-07-21 18:49:31) bq. bq. bq. Review request for hadoop-mapreduce, Todd Lipcon, Tom Graves, and Jeffrey Naisbitt. bq. bq. bq. Summary bq. ------- bq. bq. Job should fail if a reduce task can't be scheduled anywhere. V2 of the patch. bq. bq. bq. This addresses bug MAPREDUCE-2324. bq. https://issues.apache.org/jira/browse/MAPREDUCE-2324 bq. bq. bq. Diffs bq. ----- bq. bq. branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java 1148035 bq. branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/TaskTracker.java 1148035 bq. branches/branch-0.20-security/src/test/org/apache/hadoop/mapred/MiniMRCluster.java 1148035 bq. branches/branch-0.20-security/src/test/org/apache/hadoop/mapred/TestTaskLimits.java 1148035 bq. bq. Diff: https://reviews.apache.org/r/1164/diff bq. bq. bq. Testing bq. ------- bq. bq. Unit tests and ran manual tests on a single node cluster. bq. bq. bq. Thanks, bq. bq. Robert bq. bq. > Job should fail if a reduce task can't be scheduled anywhere > ------------------------------------------------------------ > > Key: MAPREDUCE-2324 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324 > Project: Hadoop Map/Reduce > Issue Type: Bug > Affects Versions: 0.20.2, 0.20.205.0 > Reporter: Todd Lipcon > Assignee: Robert Joseph Evans > Fix For: 0.20.205.0 > > Attachments: MR-2324-security-v1.txt, MR-2324-security-v2.txt > > > If there's a reduce task that needs more disk space than is available on any > mapred.local.dir in the cluster, that task will stay pending forever. For > example, we produced this in a QA cluster by accidentally running terasort > with one reducer - since no mapred.local.dir had 1T free, the job remained in > pending state for several days. The reason for the "stuck" task wasn't clear > from a user perspective until we looked at the JT logs. > Probably better to just fail the job if a reduce task goes through all TTs > and finds that there isn't enough space. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira