[jira] [Commented] (TEZ-3072) Node blacklisting always reruns completed non-leaf tasks
[ https://issues.apache.org/jira/browse/TEZ-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147747#comment-15147747 ] Bikas Saha commented on TEZ-3072: - When we handle node decommissioning this may be partly relevant. Eg. in that case we could send inputfailed to consumers. However that's a discussion for the future. I am +1 for the changes in the taskattemptimpl. Blacklisting a node should not arbitrarily rerun all completed attempts on that node because downstream consumers may have already finished processing. We should probably rename the config to signify this aspect - e.g. bad_node_rerun_attempts and give it a default of false. However, I would like to be cautious about the changes in taskimpl. If a task has been marked as failed retroactively then it implies that consumers have reported enough errors against it. And also, after this that attempt will be retried. So informing the node about this seems the right thing to do. It is likely that a number of such errors may indicate issues with that node, some of which may be temporary. With TEZ-3075, which would temporarily decommission the nodes, we should be able to handle the temporary cases. But getting the information about failures (including fetch failures) is important to make the decisions at the node level. Hence, IMO we should not make the change proposed in TaskImpl. If such a change is needed, then it could be made in AMNode/AMNodeTracker logic that handles AMNodeEventTaskAttemptEnded. There we could filter attempt failures by type and ignore fetch failures (based on a separate config). Or we could postpone that change in preference to TEZ-3075. Separately, AMNodeEventTaskAttemptEnded seems to be sent from TaskScheduler and TaskImpl whereas it could be sent from a single source in TaskAttemptImpl. The current approach is open to getting out of sync. > Node blacklisting always reruns completed non-leaf tasks > > > Key: TEZ-3072 > URL: https://issues.apache.org/jira/browse/TEZ-3072 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: TEZ-3072.001.patch > > > Recently a user ran a job with many vertices, and there was a bug in the > user's code that caused a problem in one of the trailing vertices in the > task. On some nodes enough tasks failed that the AM thought it needed to > blacklist those nodes. That blacklisting then caused many completed vertices > to re-run because it thought it needed to re-execute the non-leaf tasks that > had completed on those nodes. This wasted a lot of cluster resources and job > time for no benefit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3072) Node blacklisting always reruns completed non-leaf tasks
[ https://issues.apache.org/jira/browse/TEZ-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15117350#comment-15117350 ] Jason Lowe commented on TEZ-3072: - In this particular case even blacklisting the node was the wrong thing to do because the node was irrelevant to the task failures. I noticed that the code treats a node removed by YARN and a node blacklisted due to attempt failures equally. I could see that being problematic in practice, because a node that is failing tasks could serve up data from shuffle just fine. Re-running completed tasks would only help iff the shuffle would be problematic. I suspect in most cases the completed re-runs to avoid the theoretical possibility we could have shuffle problems (without even trying to verify the problem exists) makes the job slower than just assuming the shuffle might work and let normal fetch failure handling take care of the problem. Yes, there's going to be pathological cases where predictive re-execution would drastically speed up the job, but we're seeing plenty of cases where this preemptive strike against potential shuffle problems is causing much more harm. Saw another case of this yesterday where a job re-ran dozens of tasks from upstream completed vertices for no benefit. > Node blacklisting always reruns completed non-leaf tasks > > > Key: TEZ-3072 > URL: https://issues.apache.org/jira/browse/TEZ-3072 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Jason Lowe > > Recently a user ran a job with many vertices, and there was a bug in the > user's code that caused a problem in one of the trailing vertices in the > task. On some nodes enough tasks failed that the AM thought it needed to > blacklist those nodes. That blacklisting then caused many completed vertices > to re-run because it thought it needed to re-execute the non-leaf tasks that > had completed on those nodes. This wasted a lot of cluster resources and job > time for no benefit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3072) Node blacklisting always reruns completed non-leaf tasks
[ https://issues.apache.org/jira/browse/TEZ-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15117668#comment-15117668 ] Bikas Saha commented on TEZ-3072: - Agree. Which is why I am suggesting that we stop doing this in the short term (regular read-error based path is going to provide protection in case the machine is really down. The current logic in there is mostly derived from MR and may be getting triggered more often due to more notifications being sent from other parts of the Tez code for which the node handling logic is not prepared for. Opened TEZ-3075 for a longer term revamp of that logic. But for now, I think, not re-running all completed work may be a good enough fix for the common cases we are seeing in this jira. Is that correct? Or should the larger changes in TEZ-3075 be done now? > Node blacklisting always reruns completed non-leaf tasks > > > Key: TEZ-3072 > URL: https://issues.apache.org/jira/browse/TEZ-3072 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Jason Lowe > > Recently a user ran a job with many vertices, and there was a bug in the > user's code that caused a problem in one of the trailing vertices in the > task. On some nodes enough tasks failed that the AM thought it needed to > blacklist those nodes. That blacklisting then caused many completed vertices > to re-run because it thought it needed to re-execute the non-leaf tasks that > had completed on those nodes. This wasted a lot of cluster resources and job > time for no benefit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3072) Node blacklisting always reruns completed non-leaf tasks
[ https://issues.apache.org/jira/browse/TEZ-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15117885#comment-15117885 ] TezQA commented on TEZ-3072: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12784463/TEZ-3072.001.patch against master revision 2bf27de. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in : org.apache.tez.test.TestFaultTolerance Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/1432//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1432//console This message is automatically generated. > Node blacklisting always reruns completed non-leaf tasks > > > Key: TEZ-3072 > URL: https://issues.apache.org/jira/browse/TEZ-3072 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: TEZ-3072.001.patch > > > Recently a user ran a job with many vertices, and there was a bug in the > user's code that caused a problem in one of the trailing vertices in the > task. On some nodes enough tasks failed that the AM thought it needed to > blacklist those nodes. That blacklisting then caused many completed vertices > to re-run because it thought it needed to re-execute the non-leaf tasks that > had completed on those nodes. This wasted a lot of cluster resources and job > time for no benefit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3072) Node blacklisting always reruns completed non-leaf tasks
[ https://issues.apache.org/jira/browse/TEZ-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115814#comment-15115814 ] Jason Lowe commented on TEZ-3072: - We also have issues with temporary fetch failure issues with a node causing all completed tasks from that node to re-run. In many ways the blacklisting logic is causing more problems than it is solving, at least with respect to fetch-failure related processing. It would be nice if we could configure blacklisting to ignore node effects involving shuffle (e.g.; fetch failures are not reported to the blacklisting logic, and blacklisted nodes don't cause compelted tasks to re-run). > Node blacklisting always reruns completed non-leaf tasks > > > Key: TEZ-3072 > URL: https://issues.apache.org/jira/browse/TEZ-3072 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Jason Lowe > > Recently a user ran a job with many vertices, and there was a bug in the > user's code that caused a problem in one of the trailing vertices in the > task. On some nodes enough tasks failed that the AM thought it needed to > blacklist those nodes. That blacklisting then caused many completed vertices > to re-run because it thought it needed to re-execute the non-leaf tasks that > had completed on those nodes. This wasted a lot of cluster resources and job > time for no benefit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3072) Node blacklisting always reruns completed non-leaf tasks
[ https://issues.apache.org/jira/browse/TEZ-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116156#comment-15116156 ] Bikas Saha commented on TEZ-3072: - A short term fix could be disabling the rerun of completed tasks (but continuing to blacklist the node to avoid scheduling more work there). A longer term effort towards putting machines in probation for a few cycles before giving up on them might help prevent cliffs like this, specially due to temporary glitches. > Node blacklisting always reruns completed non-leaf tasks > > > Key: TEZ-3072 > URL: https://issues.apache.org/jira/browse/TEZ-3072 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Jason Lowe > > Recently a user ran a job with many vertices, and there was a bug in the > user's code that caused a problem in one of the trailing vertices in the > task. On some nodes enough tasks failed that the AM thought it needed to > blacklist those nodes. That blacklisting then caused many completed vertices > to re-run because it thought it needed to re-execute the non-leaf tasks that > had completed on those nodes. This wasted a lot of cluster resources and job > time for no benefit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)