[ https://issues.apache.org/jira/browse/SPARK-50288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yu-Ting LIN updated SPARK-50288: -------------------------------- Summary: Executors of stage failed still alive even through stage has been retried (was: Executors of stage failed still alive even that stage has been retried) > Executors of stage failed still alive even through stage has been retried > ------------------------------------------------------------------------- > > Key: SPARK-50288 > URL: https://issues.apache.org/jira/browse/SPARK-50288 > Project: Spark > Issue Type: Question > Components: Shuffle, Spark Core > Affects Versions: 3.3.0, 3.5.2 > Reporter: Yu-Ting LIN > Priority: Major > Attachments: stage19_stage19retry1.png, trigger_duplicate_process.png > > > We are executing spark dataframe API with foreachPartition and we observed > something that we are not able to explain. > In the figure that we provided, you can see that stage 19 has been retried > due to some ShuffleOutputNotFound error and stage 19 has been retried 1. > However, we found that there are still some executors allocated for stage 19 > in state RUNNING e.g. partition 371 and 200. In addition, as not yet > finished, partition 371 and 200 have been resubmitted again in stage 19 > (retry 1). > Is there any configuration can help us control this phenomenon ? -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org