[ https://issues.apache.org/jira/browse/SPARK-53151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Anthony Broguy updated SPARK-53151: ----------------------------------- Description: h3. Summary The following Spark property "spark.executor.maxNumFailures" has been introduced in Spark 3.5.0. This property should define the maximum number of executor failures before failing the application, default value is "numExecutors * 2, with minimum of 3". But it seems it's not taken into account, so when executors are killed due to OOM, they are recreated, again killed due to OOM, and so on... and it exceeds the limit defined by this property. This results in an infinite loop. It impacts all Spark versions 3.5.x on K8S. However in Spark 4.0.0 on K8S, this property works as expected. h3. Expected Behavior When the number of executors failures is reached, the Spark job should stop and raise this error message : {code:bash} ERROR ExecutorPodsLifecycleManager: Max number of executor failures (4) reached {code} h3. Actual Behavior Below some Spark 3.5.6 driver logs where you can see that executors failed 950 times (much more than the limit which was 4). The job ran for 1 day cause of this infinite loop before I stopped it manually. {code:bash} 2025-08-06 16:54:37 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: No executor found for x.x.x.x:46272 2025-08-06 16:54:38 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (x.x.x.x:46284) with ID 950, ResourceProfileId 0 2025-08-06 16:54:38 INFO BlockManagerMasterEndpoint: Registering block manager x.x.x.x:38291 with 117.0 MiB RAM, BlockManagerId(950, x.x.x.x, 38291, None) 2025-08-06 16:54:38 INFO TaskSetManager: Starting task 1.474 in stage 7.0 (TID 961) (x.x.x.x, executor 950, partition 1, ANY, 9818 bytes) 2025-08-06 16:54:38 INFO BlockManagerInfo: Added broadcast_10_piece0 in memory on x.x.x.x:38291 (size: 18.6 KiB, free: 116.9 MiB) 2025-08-06 16:54:39 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on x.x.x.x:38291 (size: 38.3 KiB, free: 116.9 MiB) 2025-08-06 16:55:02 INFO BlockManagerInfo: Added rdd_21_1 in memory on x.x.x.x:38291 (size: 55.6 MiB, free: 61.3 MiB) 2025-08-06 16:55:10 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Disabling executor 950. 2025-08-06 16:55:10 INFO DAGScheduler: Executor lost: 950 (epoch 950) 2025-08-06 16:55:10 INFO BlockManagerMasterEndpoint: Trying to remove executor 950 from BlockManagerMaster. 2025-08-06 16:55:10 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_21_1 ! 2025-08-06 16:55:10 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(950, x.x.x.x, 38291, None) 2025-08-06 16:55:10 INFO BlockManagerMaster: Removed 950 successfully in removeExecutor 2025-08-06 16:55:10 INFO DAGScheduler: Shuffle files lost for executor: 950 (epoch 950) 2025-08-06 16:55:25 WARN HeartbeatReceiver: Removing executor 949 with no recent heartbeats: 135723 ms exceeds timeout 120000 ms 2025-08-06 16:55:25 INFO KubernetesClusterSchedulerBackend: Requesting to kill executor(s) 949 2025-08-06 16:55:25 INFO KubernetesClusterSchedulerBackend: Actual list of executor(s) to be killed is 949 2025-08-06 16:55:25 INFO TaskSchedulerImpl: Executor 949 on x.x.x.x killed by driver. 2025-08-06 16:55:25 INFO TaskSetManager: task 0.474 in stage 7.0 (TID 960) failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task. 2025-08-06 16:55:25 INFO BlockManagerMaster: Removal of executor 949 requested 2025-08-06 16:55:25 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 949 2025-08-06 16:55:25 INFO BlockManagerMasterEndpoint: Trying to remove executor 949 from BlockManagerMaster. 2025-08-06 16:55:30 INFO KubernetesClusterSchedulerBackend: Forcefully deleting 1 pods (out of 1) that are still running after graceful shutdown period. 2025-08-06 16:55:31 INFO BlockManagerMaster: Removal of executor 949 requested 2025-08-06 16:55:31 INFO BlockManagerMasterEndpoint: Trying to remove executor 949 from BlockManagerMaster. 2025-08-06 16:55:31 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 949 2025-08-06 16:55:32 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes for ResourceProfile Id: 0, target: 2, known: 1, sharedSlotFromPendingPods: 2147483647. {code} h3. Steps to Reproduce To reproduce this behavior, simply run the following Spark code using the Docker image [3.5.6-scala2.12-java17-python3-r-ubuntu|https://hub.docker.com/layers/apache/spark/3.5.6-scala2.12-java17-python3-r-ubuntu/images/sha256-96d4508680e09543ee1c6b2705c67176d4e0e1023b21c198a43a8c846eb264b7] : {code:python} from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = spark.read.csv("<input>") df.coalesce(1).write.parquet("<output>") {code} The Spark job must run on K8S (client mode or cluster mode) The input data must be large enough to produce an OOM exception on executors. For instance, I used a 600MB CSV file as input with the following low Spark resources: {code:bash} --conf "spark.driver.cores=1" \ --conf "spark.driver.memory=2g" \ --conf "spark.executor.instances=2" \ --conf "spark.executor.cores=1" \ --conf "spark.executor.memory=512m" \ --conf "spark.executor.memoryOverhead=64m" \ {code} was: h3. Summary The following Spark property "spark.executor.maxNumFailures" has been introduced in Spark 3.5.0. This property should define the maximum number of executor failures before failing the application, default value is "numExecutors * 2, with minimum of 3". But it seems it's not taken into account, so when executors are killed due to OOM, they are recreated, again killed due to OOM, and so on... and it exceeds the limit defined by this property. This results in an infinite loop. It impacts all Spark versions 3.5.x on K8S. However in Spark 4.0.0 on K8S, this property works as expected. h3. Actual Behavior Below some driver logs where you can see that executors have been recreated 950 times. The Spark 3.5.6 job ran for 1 day cause of this infinite loop before I stopped it manually. {code:bash} 2025-08-06 16:54:37 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: No executor found for x.x.x.x:46272 2025-08-06 16:54:38 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (x.x.x.x:46284) with ID 950, ResourceProfileId 0 2025-08-06 16:54:38 INFO BlockManagerMasterEndpoint: Registering block manager x.x.x.x:38291 with 117.0 MiB RAM, BlockManagerId(950, x.x.x.x, 38291, None) 2025-08-06 16:54:38 INFO TaskSetManager: Starting task 1.474 in stage 7.0 (TID 961) (x.x.x.x, executor 950, partition 1, ANY, 9818 bytes) 2025-08-06 16:54:38 INFO BlockManagerInfo: Added broadcast_10_piece0 in memory on x.x.x.x:38291 (size: 18.6 KiB, free: 116.9 MiB) 2025-08-06 16:54:39 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on x.x.x.x:38291 (size: 38.3 KiB, free: 116.9 MiB) 2025-08-06 16:55:02 INFO BlockManagerInfo: Added rdd_21_1 in memory on x.x.x.x:38291 (size: 55.6 MiB, free: 61.3 MiB) 2025-08-06 16:55:10 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Disabling executor 950. 2025-08-06 16:55:10 INFO DAGScheduler: Executor lost: 950 (epoch 950) 2025-08-06 16:55:10 INFO BlockManagerMasterEndpoint: Trying to remove executor 950 from BlockManagerMaster. 2025-08-06 16:55:10 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_21_1 ! 2025-08-06 16:55:10 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(950, x.x.x.x, 38291, None) 2025-08-06 16:55:10 INFO BlockManagerMaster: Removed 950 successfully in removeExecutor 2025-08-06 16:55:10 INFO DAGScheduler: Shuffle files lost for executor: 950 (epoch 950) 2025-08-06 16:55:25 WARN HeartbeatReceiver: Removing executor 949 with no recent heartbeats: 135723 ms exceeds timeout 120000 ms 2025-08-06 16:55:25 INFO KubernetesClusterSchedulerBackend: Requesting to kill executor(s) 949 2025-08-06 16:55:25 INFO KubernetesClusterSchedulerBackend: Actual list of executor(s) to be killed is 949 2025-08-06 16:55:25 INFO TaskSchedulerImpl: Executor 949 on x.x.x.x killed by driver. 2025-08-06 16:55:25 INFO TaskSetManager: task 0.474 in stage 7.0 (TID 960) failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task. 2025-08-06 16:55:25 INFO BlockManagerMaster: Removal of executor 949 requested 2025-08-06 16:55:25 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 949 2025-08-06 16:55:25 INFO BlockManagerMasterEndpoint: Trying to remove executor 949 from BlockManagerMaster. 2025-08-06 16:55:30 INFO KubernetesClusterSchedulerBackend: Forcefully deleting 1 pods (out of 1) that are still running after graceful shutdown period. 2025-08-06 16:55:31 INFO BlockManagerMaster: Removal of executor 949 requested 2025-08-06 16:55:31 INFO BlockManagerMasterEndpoint: Trying to remove executor 949 from BlockManagerMaster. 2025-08-06 16:55:31 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 949 2025-08-06 16:55:32 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes for ResourceProfile Id: 0, target: 2, known: 1, sharedSlotFromPendingPods: 2147483647. {code} h3. Steps to Reproduce To reproduce this behavior, simply run the following Spark code using the Docker image [3.5.6-scala2.12-java17-python3-r-ubuntu|https://hub.docker.com/layers/apache/spark/3.5.6-scala2.12-java17-python3-r-ubuntu/images/sha256-96d4508680e09543ee1c6b2705c67176d4e0e1023b21c198a43a8c846eb264b7] : {code:python} from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = spark.read.csv("<input>") df.coalesce(1).write.parquet("<output>") {code} The Spark job must run on K8S (client mode or cluster mode) The input data must be large enough to produce an OOM exception on executors. For instance, I used a 600MB CSV file as input with the following low Spark resources: {code:bash} --conf "spark.driver.cores=1" \ --conf "spark.driver.memory=2g" \ --conf "spark.executor.instances=2" \ --conf "spark.executor.cores=1" \ --conf "spark.executor.memory=512m" \ --conf "spark.executor.memoryOverhead=64m" \ {code} > "spark.executor.maxNumFailures" not working / OOM exception on executors > results in an infinite loop > ---------------------------------------------------------------------------------------------------- > > Key: SPARK-53151 > URL: https://issues.apache.org/jira/browse/SPARK-53151 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.5.6 > Environment: Spark 3.5.6 on Kubernetes - Docker image > "spark:3.5.6-scala2.12-java17-python3-r-ubuntu" > Reporter: Anthony Broguy > Priority: Major > > h3. Summary > The following Spark property "spark.executor.maxNumFailures" has been > introduced in Spark 3.5.0. This property should define the maximum number of > executor failures before failing the application, default value is > "numExecutors * 2, with minimum of 3". > But it seems it's not taken into account, so when executors are killed due to > OOM, they are recreated, again killed due to OOM, and so on... and it exceeds > the limit defined by this property. > This results in an infinite loop. > It impacts all Spark versions 3.5.x on K8S. > However in Spark 4.0.0 on K8S, this property works as expected. > h3. Expected Behavior > When the number of executors failures is reached, the Spark job should stop > and raise this error message : > {code:bash} > ERROR ExecutorPodsLifecycleManager: Max number of executor failures (4) > reached > {code} > h3. Actual Behavior > Below some Spark 3.5.6 driver logs where you can see that executors failed > 950 times (much more than the limit which was 4). > The job ran for 1 day cause of this infinite loop before I stopped it > manually. > {code:bash} > 2025-08-06 16:54:37 INFO > KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: No executor found > for x.x.x.x:46272 > 2025-08-06 16:54:38 INFO > KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered > executor NettyRpcEndpointRef(spark-client://Executor) (x.x.x.x:46284) with ID > 950, ResourceProfileId 0 > 2025-08-06 16:54:38 INFO BlockManagerMasterEndpoint: Registering block > manager x.x.x.x:38291 with 117.0 MiB RAM, BlockManagerId(950, x.x.x.x, 38291, > None) > 2025-08-06 16:54:38 INFO TaskSetManager: Starting task 1.474 in stage 7.0 > (TID 961) (x.x.x.x, executor 950, partition 1, ANY, 9818 bytes) > 2025-08-06 16:54:38 INFO BlockManagerInfo: Added broadcast_10_piece0 in > memory on x.x.x.x:38291 (size: 18.6 KiB, free: 116.9 MiB) > 2025-08-06 16:54:39 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory > on x.x.x.x:38291 (size: 38.3 KiB, free: 116.9 MiB) > 2025-08-06 16:55:02 INFO BlockManagerInfo: Added rdd_21_1 in memory on > x.x.x.x:38291 (size: 55.6 MiB, free: 61.3 MiB) > 2025-08-06 16:55:10 INFO > KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Disabling > executor 950. > 2025-08-06 16:55:10 INFO DAGScheduler: Executor lost: 950 (epoch 950) > 2025-08-06 16:55:10 INFO BlockManagerMasterEndpoint: Trying to remove > executor 950 from BlockManagerMaster. > 2025-08-06 16:55:10 WARN BlockManagerMasterEndpoint: No more replicas > available for rdd_21_1 ! > 2025-08-06 16:55:10 INFO BlockManagerMasterEndpoint: Removing block manager > BlockManagerId(950, x.x.x.x, 38291, None) > 2025-08-06 16:55:10 INFO BlockManagerMaster: Removed 950 successfully in > removeExecutor > 2025-08-06 16:55:10 INFO DAGScheduler: Shuffle files lost for executor: 950 > (epoch 950) > 2025-08-06 16:55:25 WARN HeartbeatReceiver: Removing executor 949 with no > recent heartbeats: 135723 ms exceeds timeout 120000 ms > 2025-08-06 16:55:25 INFO KubernetesClusterSchedulerBackend: Requesting to > kill executor(s) 949 > 2025-08-06 16:55:25 INFO KubernetesClusterSchedulerBackend: Actual list of > executor(s) to be killed is 949 > 2025-08-06 16:55:25 INFO TaskSchedulerImpl: Executor 949 on x.x.x.x killed by > driver. > 2025-08-06 16:55:25 INFO TaskSetManager: task 0.474 in stage 7.0 (TID 960) > failed because while it was being computed, its executor exited for a reason > unrelated to the task. Not counting this failure towards the maximum number > of failures for the task. > 2025-08-06 16:55:25 INFO BlockManagerMaster: Removal of executor 949 requested > 2025-08-06 16:55:25 INFO > KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove > non-existent executor 949 > 2025-08-06 16:55:25 INFO BlockManagerMasterEndpoint: Trying to remove > executor 949 from BlockManagerMaster. > 2025-08-06 16:55:30 INFO KubernetesClusterSchedulerBackend: Forcefully > deleting 1 pods (out of 1) that are still running after graceful shutdown > period. > 2025-08-06 16:55:31 INFO BlockManagerMaster: Removal of executor 949 requested > 2025-08-06 16:55:31 INFO BlockManagerMasterEndpoint: Trying to remove > executor 949 from BlockManagerMaster. > 2025-08-06 16:55:31 INFO > KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove > non-existent executor 949 > 2025-08-06 16:55:32 INFO ExecutorPodsAllocator: Going to request 1 executors > from Kubernetes for ResourceProfile Id: 0, target: 2, known: 1, > sharedSlotFromPendingPods: 2147483647. > {code} > h3. Steps to Reproduce > To reproduce this behavior, simply run the following Spark code using the > Docker image > [3.5.6-scala2.12-java17-python3-r-ubuntu|https://hub.docker.com/layers/apache/spark/3.5.6-scala2.12-java17-python3-r-ubuntu/images/sha256-96d4508680e09543ee1c6b2705c67176d4e0e1023b21c198a43a8c846eb264b7] > : > {code:python} > from pyspark.sql import SparkSession > spark = SparkSession.builder.getOrCreate() > df = spark.read.csv("<input>") > df.coalesce(1).write.parquet("<output>") > {code} > The Spark job must run on K8S (client mode or cluster mode) > The input data must be large enough to produce an OOM exception on executors. > For instance, I used a 600MB CSV file as input with the following low Spark > resources: > {code:bash} > --conf "spark.driver.cores=1" \ > --conf "spark.driver.memory=2g" \ > --conf "spark.executor.instances=2" \ > --conf "spark.executor.cores=1" \ > --conf "spark.executor.memory=512m" \ > --conf "spark.executor.memoryOverhead=64m" \ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org