Aaruna Godthi created SPARK-34109: ------------------------------------- Summary: Killing executors excluded on failure, results in additional executors being marked as excluded due to fetch failures Key: SPARK-34109 URL: https://issues.apache.org/jira/browse/SPARK-34109 Project: Spark Issue Type: Bug Components: Kubernetes, Shuffle, Spark Core Affects Versions: 3.0.1, 3.0.0 Reporter: Aaruna Godthi
Configuration: ``` spark.excludeOnFailure.enabled: true # aka deprecated spark.blacklist.enabled spark.excludeOnFailure.application.fetchFailure.enabled: true # aka deprecated spark.blacklist.application.fetchFailure.enabled spark.excludeOnFailure.killExcludedExecutors: true # aka deprecated spark.blacklist.killBlacklistedExecutors ``` In this case, we have noticed when a few executors are excluded due to task failures (maybe due to host issues), then those executors are killed after being excluded. However, when other executors try to fetch shuffle blocks from these killed executors, then these other executors also end up getting excluded due to `spark.excludeOnFailure.application.fetchFailure.enabled`. Instead, the fetch failures in case of fetch from these excluded executors should not be considered when excluding executors based on `spark.excludeOnFailure.application.fetchFailure.enabled` -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org