[ 
https://issues.apache.org/jira/browse/SPARK-50288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu-Ting LIN updated SPARK-50288:
--------------------------------
    Summary: Executors of stage failed still alive even through stage has been 
retried  (was: Executors of stage failed still alive even that stage has been 
retried)

> Executors of stage failed still alive even through stage has been retried
> -------------------------------------------------------------------------
>
>                 Key: SPARK-50288
>                 URL: https://issues.apache.org/jira/browse/SPARK-50288
>             Project: Spark
>          Issue Type: Question
>          Components: Shuffle, Spark Core
>    Affects Versions: 3.3.0, 3.5.2
>            Reporter: Yu-Ting LIN
>            Priority: Major
>         Attachments: stage19_stage19retry1.png, trigger_duplicate_process.png
>
>
> We are executing spark dataframe API with foreachPartition and we observed 
> something that we are not able to explain.
> In the figure that we provided, you can see that stage 19 has been retried 
> due to some ShuffleOutputNotFound error and stage 19 has been retried 1.
> However, we found that there are still some executors allocated for stage 19 
> in state RUNNING e.g. partition 371 and 200. In addition, as not yet 
> finished, partition 371 and 200 have been resubmitted again in stage 19 
> (retry 1). 
> Is there any configuration can help us control this phenomenon ?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to