wuyi created SPARK-42577:
----------------------------

             Summary: A large stage could run indefinitely due to executor lost
                 Key: SPARK-42577
                 URL: https://issues.apache.org/jira/browse/SPARK-42577
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 3.3.2, 3.2.3, 3.1.3, 3.0.3
            Reporter: wuyi


When a stage is extremely large and Spark runs on spot instances or problematic 
clusters with frequent worker/executor loss,  the stage could run indefinitely 
due to task rerun caused by the executor loss. This happens, when the external 
shuffle service is on, and the large stages runs hours to complete, when spark 
tries to submit a child stage, it will find the parent stage - the large one, 
has missed some partitions, so the large stage has to rerun. When it completes 
again, it finds new missing partitions due to the same reason.

We should add a attempt limitation for this kind of scenario.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to