[ https://issues.apache.org/jira/browse/SPARK-17911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572344#comment-15572344 ]
Imran Rashid commented on SPARK-17911: -------------------------------------- bq. In other words, handling a ResubmitFailedStages event should be quick, and causes failedStages to be cleared, allowing the next ResubmitFailedStages event to be posted from the handling of another FetchFailed. If there are the expected lot of fetch failures for a single stage, and there is no RESUBMIT_TIMEOUT, then it is quite likely that there will be a burst of resubmit events (and corresponding log messages) and submitStage calls made in rapid succession. Lemme rephrase your comment to make sure I understand it. The messageScheduler and delay do *not* effective correctness, or even what actually gets resubmitted. When we resubmit {{mapStage}}, we always resubmit all tasks corresponding to shuffle map output on the failed executor. And when we resubmit the {{failedStage}}, there is probably a long enough delay from {{mapStage}} that waiting 200ms is relatively inconsequential. However, it *does* effect the logging. If the scheduler event queue is relatively empty, then as the fetch failures trickle in, for each one we'd post a Resubmit event which gets handled relatively quickly. So each fetch failure would trigger another resubmit event and more logging. Which is also undesirable, both because of the noise in the logs, and b/c it would creates an unnecessary flood of events on the scheduler event queue. Is that a fair summary? I agree with everything said there, but then I'd request we take one of two actions: 1) change the resubmit logic -- in addition to checking failed stages, you can also check {{waitingStages}} and {{runningStages}}. that is what happens eventually anyway inside {{resubmitFailedStages}}. This would actually be even better for decreasing noise in the logs etc. I know this may seem like a small thing to make such a big deal about, but I honestly think this is confusing enough that its worth cleaning up -- eliminating an unneeded event queue I think is a significant win. 2) If we don't do that, lets at least add in a better comment explaining the purpose. (Maybe just a pointer to this jira at this point) > Scheduler does not need messageScheduler for ResubmitFailedStages > ----------------------------------------------------------------- > > Key: SPARK-17911 > URL: https://issues.apache.org/jira/browse/SPARK-17911 > Project: Spark > Issue Type: Improvement > Components: Scheduler > Affects Versions: 2.0.0 > Reporter: Imran Rashid > > Its not totally clear what the purpose of the {{messageScheduler}} is in > {{DAGScheduler}}. It can perhaps be eliminated completely; or perhaps we > should just clearly document its purpose. > This comes from a long discussion w/ [~markhamstra] on an unrelated PR here: > https://github.com/apache/spark/pull/15335/files/c80ad22a242255cac91cce2c7c537f9b21100f70#diff-6a9ff7fb74fd490a50462d45db2d5e11 > But its tricky so breaking it out here for archiving the discussion. > Note: this issue requires a decision on what to do before a code change, so > lets just discuss it on jira first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org