[ https://issues.apache.org/jira/browse/SPARK-42577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693753#comment-17693753 ]
Tengfei Huang commented on SPARK-42577: --------------------------------------- I am working on this. Thanks. [~Ngone51] > A large stage could run indefinitely due to executor lost > --------------------------------------------------------- > > Key: SPARK-42577 > URL: https://issues.apache.org/jira/browse/SPARK-42577 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 3.0.3, 3.1.3, 3.2.3, 3.3.2 > Reporter: wuyi > Priority: Major > > When a stage is extremely large and Spark runs on spot instances or > problematic clusters with frequent worker/executor loss, the stage could run > indefinitely due to task rerun caused by the executor loss. This happens, > when the external shuffle service is on, and the large stages runs hours to > complete, when spark tries to submit a child stage, it will find the parent > stage - the large one, has missed some partitions, so the large stage has to > rerun. When it completes again, it finds new missing partitions due to the > same reason. > We should add a attempt limitation for this kind of scenario. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org