[ https://issues.apache.org/jira/browse/SPARK-13369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15930063#comment-15930063 ]
Imran Rashid commented on SPARK-13369: -------------------------------------- Thanks for fixing this [~sitalke...@gmail.com]. I just noticed that earlier in this ticket, there was a discussion about the need to set this config for streaming. I don't believe that is true, the way this works it actually should be fine for occasional fetch failures in a long-lived streaming job. The maximum number of fetch failures is per-stage, and the count is reset when the stage is run successfully. Can you explain why you'd need to modify this config for a streaming job? (The large cluster case at Facebook makes sense to me, as we discussed on the pr, and I updated the jira description accordingly.) > Number of consecutive fetch failures for a stage before the job is aborted > should be configurable > -------------------------------------------------------------------------------------------------- > > Key: SPARK-13369 > URL: https://issues.apache.org/jira/browse/SPARK-13369 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 1.6.0 > Reporter: Sital Kedia > Assignee: Sital Kedia > Priority: Minor > Fix For: 2.2.0 > > > The previously hardcoded max 4 retries per stage is not suitable for all > cluster configurations. Since spark retries a stage at the sign of the first > fetch failure, you can easily end up with many stage retries to discover all > the failures. In particular, two scenarios this value should change are (1) > if there are more than 4 executors per node; in that case, it may take 4 > retries to discover the problem with each executor on the node and (2) during > cluster maintenance on large clusters, where multiple machines are serviced > at once, but you also cannot afford total cluster downtime. By making this > value configurable, cluster managers can tune this value to something more > appropriate to their cluster configuration. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org