[ https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503647#comment-16503647 ]
Jiang Xingbo commented on SPARK-24375: -------------------------------------- The major problem is that tasks in the same stage of a MPI workload may rely on the internal results of other parallel running folk tasks to compute the final results, thus when a task fail, other tasks in the same stage may generate incorrect result or even hang, and it seems to be straight-forward to just retry the whole stage on task failure. > Design sketch: support barrier scheduling in Apache Spark > --------------------------------------------------------- > > Key: SPARK-24375 > URL: https://issues.apache.org/jira/browse/SPARK-24375 > Project: Spark > Issue Type: Story > Components: Spark Core > Affects Versions: 3.0.0 > Reporter: Xiangrui Meng > Assignee: Jiang Xingbo > Priority: Major > > This task is to outline a design sketch for the barrier scheduling SPIP > discussion. It doesn't need to be a complete design before the vote. But it > should at least cover both Scala/Java and PySpark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org