Aaruna Godthi created SPARK-34109:
-------------------------------------

             Summary: Killing executors excluded on failure, results in 
additional executors being marked as excluded due to fetch failures
                 Key: SPARK-34109
                 URL: https://issues.apache.org/jira/browse/SPARK-34109
             Project: Spark
          Issue Type: Bug
          Components: Kubernetes, Shuffle, Spark Core
    Affects Versions: 3.0.1, 3.0.0
            Reporter: Aaruna Godthi


Configuration:

 

```

spark.excludeOnFailure.enabled: true # aka deprecated spark.blacklist.enabled


spark.excludeOnFailure.application.fetchFailure.enabled: true # aka deprecated 
spark.blacklist.application.fetchFailure.enabled

spark.excludeOnFailure.killExcludedExecutors: true # aka deprecated 
spark.blacklist.killBlacklistedExecutors

```

 

In this case, we have noticed when a few executors are excluded due to task 
failures (maybe due to host issues), then those executors are killed after 
being excluded.

However, when other executors try to fetch shuffle blocks from these killed 
executors, then  these other executors also end up getting excluded due to 
`spark.excludeOnFailure.application.fetchFailure.enabled`. 

Instead, the fetch failures in case of fetch from these excluded executors 
should not be considered when excluding executors based on 
`spark.excludeOnFailure.application.fetchFailure.enabled`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to