[ 
https://issues.apache.org/jira/browse/SPARK-53151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Broguy updated SPARK-53151:
-----------------------------------
    Description: 
h3. Summary

The following Spark property "spark.executor.maxNumFailures" has been 
introduced in Spark 3.5.0. This property should define the maximum number of 
executor failures before failing the application, default value is 
"numExecutors * 2, with minimum of 3".

But it seems it's not taken into account, so when executors are killed due to 
OOM, they are recreated, again killed due to OOM, and so on... and it exceeds 
the limit defined by this property.

This results in an infinite loop.
It impacts all Spark versions 3.5.x on K8S. 
However in Spark 4.0.0 on K8S, this property works as expected.

h3. Expected Behavior

When the number of executors failures is reached, the Spark job should stop and 
raise this error message :
{code:bash}
ERROR ExecutorPodsLifecycleManager: Max number of executor failures (4) reached
{code}

h3. Actual Behavior

Below some Spark 3.5.6 driver logs where you can see that executors failed 950 
times (much more than the limit which was 4). 
The job ran for 1 day cause of this infinite loop before I stopped it manually.
{code:bash}
2025-08-06 16:54:37 INFO 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: No executor found 
for x.x.x.x:46272
2025-08-06 16:54:38 INFO 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered executor 
NettyRpcEndpointRef(spark-client://Executor) (x.x.x.x:46284) with ID 950,  
ResourceProfileId 0
2025-08-06 16:54:38 INFO BlockManagerMasterEndpoint: Registering block manager 
x.x.x.x:38291 with 117.0 MiB RAM, BlockManagerId(950, x.x.x.x, 38291, None)
2025-08-06 16:54:38 INFO TaskSetManager: Starting task 1.474 in stage 7.0 (TID 
961) (x.x.x.x, executor 950, partition 1, ANY, 9818 bytes)
2025-08-06 16:54:38 INFO BlockManagerInfo: Added broadcast_10_piece0 in memory 
on x.x.x.x:38291 (size: 18.6 KiB, free: 116.9 MiB)
2025-08-06 16:54:39 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory 
on x.x.x.x:38291 (size: 38.3 KiB, free: 116.9 MiB)
2025-08-06 16:55:02 INFO BlockManagerInfo: Added rdd_21_1 in memory on 
x.x.x.x:38291 (size: 55.6 MiB, free: 61.3 MiB)
2025-08-06 16:55:10 INFO 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Disabling executor 
950.
2025-08-06 16:55:10 INFO DAGScheduler: Executor lost: 950 (epoch 950)
2025-08-06 16:55:10 INFO BlockManagerMasterEndpoint: Trying to remove executor 
950 from BlockManagerMaster.
2025-08-06 16:55:10 WARN BlockManagerMasterEndpoint: No more replicas available 
for rdd_21_1 !
2025-08-06 16:55:10 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(950, x.x.x.x, 38291, None)
2025-08-06 16:55:10 INFO BlockManagerMaster: Removed 950 successfully in 
removeExecutor
2025-08-06 16:55:10 INFO DAGScheduler: Shuffle files lost for executor: 950 
(epoch 950)
2025-08-06 16:55:25 WARN HeartbeatReceiver: Removing executor 949 with no 
recent heartbeats: 135723 ms exceeds timeout 120000 ms
2025-08-06 16:55:25 INFO KubernetesClusterSchedulerBackend: Requesting to kill 
executor(s) 949
2025-08-06 16:55:25 INFO KubernetesClusterSchedulerBackend: Actual list of 
executor(s) to be killed is 949
2025-08-06 16:55:25 INFO TaskSchedulerImpl: Executor 949 on x.x.x.x killed by 
driver.
2025-08-06 16:55:25 INFO TaskSetManager: task 0.474 in stage 7.0 (TID 960) 
failed because while it was being computed, its executor exited for a reason 
unrelated to the task. Not counting this failure towards the maximum number of 
failures for the task.
2025-08-06 16:55:25 INFO BlockManagerMaster: Removal of executor 949 requested
2025-08-06 16:55:25 INFO 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove 
non-existent executor 949
2025-08-06 16:55:25 INFO BlockManagerMasterEndpoint: Trying to remove executor 
949 from BlockManagerMaster.
2025-08-06 16:55:30 INFO KubernetesClusterSchedulerBackend: Forcefully deleting 
1 pods (out of 1) that are still running after graceful shutdown period.
2025-08-06 16:55:31 INFO BlockManagerMaster: Removal of executor 949 requested
2025-08-06 16:55:31 INFO BlockManagerMasterEndpoint: Trying to remove executor 
949 from BlockManagerMaster.
2025-08-06 16:55:31 INFO 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove 
non-existent executor 949
2025-08-06 16:55:32 INFO ExecutorPodsAllocator: Going to request 1 executors 
from Kubernetes for ResourceProfile Id: 0, target: 2, known: 1, 
sharedSlotFromPendingPods: 2147483647.
{code}

h3. Steps to Reproduce

To reproduce this behavior, simply run the following Spark code using the 
Docker image 
[3.5.6-scala2.12-java17-python3-r-ubuntu|https://hub.docker.com/layers/apache/spark/3.5.6-scala2.12-java17-python3-r-ubuntu/images/sha256-96d4508680e09543ee1c6b2705c67176d4e0e1023b21c198a43a8c846eb264b7]
 :
{code:python}
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df = spark.read.csv("<input>")
df.coalesce(1).write.parquet("<output>")
{code}
The Spark job must run on K8S (client mode or cluster mode)
The input data must be large enough to produce an OOM exception on executors.
For instance, I used a 600MB CSV file as input with the following low Spark 
resources:
{code:bash}
--conf "spark.driver.cores=1" \
--conf "spark.driver.memory=2g" \
--conf "spark.executor.instances=2" \
--conf "spark.executor.cores=1" \
--conf "spark.executor.memory=512m" \
--conf "spark.executor.memoryOverhead=64m" \
{code}

  was:
h3. Summary

The following Spark property "spark.executor.maxNumFailures" has been 
introduced in Spark 3.5.0. This property should define the maximum number of 
executor failures before failing the application, default value is 
"numExecutors * 2, with minimum of 3".

But it seems it's not taken into account, so when executors are killed due to 
OOM, they are recreated, again killed due to OOM, and so on... and it exceeds 
the limit defined by this property.

This results in an infinite loop.
It impacts all Spark versions 3.5.x on K8S. 
However in Spark 4.0.0 on K8S, this property works as expected.
h3. Actual Behavior

Below some driver logs where you can see that executors have been recreated 950 
times. The Spark 3.5.6 job ran for 1 day cause of this infinite loop before I 
stopped it manually.
{code:bash}
2025-08-06 16:54:37 INFO 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: No executor found 
for x.x.x.x:46272
2025-08-06 16:54:38 INFO 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered executor 
NettyRpcEndpointRef(spark-client://Executor) (x.x.x.x:46284) with ID 950,  
ResourceProfileId 0
2025-08-06 16:54:38 INFO BlockManagerMasterEndpoint: Registering block manager 
x.x.x.x:38291 with 117.0 MiB RAM, BlockManagerId(950, x.x.x.x, 38291, None)
2025-08-06 16:54:38 INFO TaskSetManager: Starting task 1.474 in stage 7.0 (TID 
961) (x.x.x.x, executor 950, partition 1, ANY, 9818 bytes)
2025-08-06 16:54:38 INFO BlockManagerInfo: Added broadcast_10_piece0 in memory 
on x.x.x.x:38291 (size: 18.6 KiB, free: 116.9 MiB)
2025-08-06 16:54:39 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory 
on x.x.x.x:38291 (size: 38.3 KiB, free: 116.9 MiB)
2025-08-06 16:55:02 INFO BlockManagerInfo: Added rdd_21_1 in memory on 
x.x.x.x:38291 (size: 55.6 MiB, free: 61.3 MiB)
2025-08-06 16:55:10 INFO 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Disabling executor 
950.
2025-08-06 16:55:10 INFO DAGScheduler: Executor lost: 950 (epoch 950)
2025-08-06 16:55:10 INFO BlockManagerMasterEndpoint: Trying to remove executor 
950 from BlockManagerMaster.
2025-08-06 16:55:10 WARN BlockManagerMasterEndpoint: No more replicas available 
for rdd_21_1 !
2025-08-06 16:55:10 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(950, x.x.x.x, 38291, None)
2025-08-06 16:55:10 INFO BlockManagerMaster: Removed 950 successfully in 
removeExecutor
2025-08-06 16:55:10 INFO DAGScheduler: Shuffle files lost for executor: 950 
(epoch 950)
2025-08-06 16:55:25 WARN HeartbeatReceiver: Removing executor 949 with no 
recent heartbeats: 135723 ms exceeds timeout 120000 ms
2025-08-06 16:55:25 INFO KubernetesClusterSchedulerBackend: Requesting to kill 
executor(s) 949
2025-08-06 16:55:25 INFO KubernetesClusterSchedulerBackend: Actual list of 
executor(s) to be killed is 949
2025-08-06 16:55:25 INFO TaskSchedulerImpl: Executor 949 on x.x.x.x killed by 
driver.
2025-08-06 16:55:25 INFO TaskSetManager: task 0.474 in stage 7.0 (TID 960) 
failed because while it was being computed, its executor exited for a reason 
unrelated to the task. Not counting this failure towards the maximum number of 
failures for the task.
2025-08-06 16:55:25 INFO BlockManagerMaster: Removal of executor 949 requested
2025-08-06 16:55:25 INFO 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove 
non-existent executor 949
2025-08-06 16:55:25 INFO BlockManagerMasterEndpoint: Trying to remove executor 
949 from BlockManagerMaster.
2025-08-06 16:55:30 INFO KubernetesClusterSchedulerBackend: Forcefully deleting 
1 pods (out of 1) that are still running after graceful shutdown period.
2025-08-06 16:55:31 INFO BlockManagerMaster: Removal of executor 949 requested
2025-08-06 16:55:31 INFO BlockManagerMasterEndpoint: Trying to remove executor 
949 from BlockManagerMaster.
2025-08-06 16:55:31 INFO 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove 
non-existent executor 949
2025-08-06 16:55:32 INFO ExecutorPodsAllocator: Going to request 1 executors 
from Kubernetes for ResourceProfile Id: 0, target: 2, known: 1, 
sharedSlotFromPendingPods: 2147483647.
{code}
h3. Steps to Reproduce

To reproduce this behavior, simply run the following Spark code using the 
Docker image 
[3.5.6-scala2.12-java17-python3-r-ubuntu|https://hub.docker.com/layers/apache/spark/3.5.6-scala2.12-java17-python3-r-ubuntu/images/sha256-96d4508680e09543ee1c6b2705c67176d4e0e1023b21c198a43a8c846eb264b7]
 :
{code:python}
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df = spark.read.csv("<input>")
df.coalesce(1).write.parquet("<output>")
{code}
The Spark job must run on K8S (client mode or cluster mode)
The input data must be large enough to produce an OOM exception on executors.
For instance, I used a 600MB CSV file as input with the following low Spark 
resources:
{code:bash}
--conf "spark.driver.cores=1" \
--conf "spark.driver.memory=2g" \
--conf "spark.executor.instances=2" \
--conf "spark.executor.cores=1" \
--conf "spark.executor.memory=512m" \
--conf "spark.executor.memoryOverhead=64m" \
{code}


> "spark.executor.maxNumFailures" not working / OOM exception on executors 
> results in an infinite loop
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-53151
>                 URL: https://issues.apache.org/jira/browse/SPARK-53151
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.5.6
>         Environment: Spark 3.5.6 on Kubernetes - Docker image 
> "spark:3.5.6-scala2.12-java17-python3-r-ubuntu"
>            Reporter: Anthony Broguy
>            Priority: Major
>
> h3. Summary
> The following Spark property "spark.executor.maxNumFailures" has been 
> introduced in Spark 3.5.0. This property should define the maximum number of 
> executor failures before failing the application, default value is 
> "numExecutors * 2, with minimum of 3".
> But it seems it's not taken into account, so when executors are killed due to 
> OOM, they are recreated, again killed due to OOM, and so on... and it exceeds 
> the limit defined by this property.
> This results in an infinite loop.
> It impacts all Spark versions 3.5.x on K8S. 
> However in Spark 4.0.0 on K8S, this property works as expected.
> h3. Expected Behavior
> When the number of executors failures is reached, the Spark job should stop 
> and raise this error message :
> {code:bash}
> ERROR ExecutorPodsLifecycleManager: Max number of executor failures (4) 
> reached
> {code}
> h3. Actual Behavior
> Below some Spark 3.5.6 driver logs where you can see that executors failed 
> 950 times (much more than the limit which was 4). 
> The job ran for 1 day cause of this infinite loop before I stopped it 
> manually.
> {code:bash}
> 2025-08-06 16:54:37 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: No executor found 
> for x.x.x.x:46272
> 2025-08-06 16:54:38 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered 
> executor NettyRpcEndpointRef(spark-client://Executor) (x.x.x.x:46284) with ID 
> 950,  ResourceProfileId 0
> 2025-08-06 16:54:38 INFO BlockManagerMasterEndpoint: Registering block 
> manager x.x.x.x:38291 with 117.0 MiB RAM, BlockManagerId(950, x.x.x.x, 38291, 
> None)
> 2025-08-06 16:54:38 INFO TaskSetManager: Starting task 1.474 in stage 7.0 
> (TID 961) (x.x.x.x, executor 950, partition 1, ANY, 9818 bytes)
> 2025-08-06 16:54:38 INFO BlockManagerInfo: Added broadcast_10_piece0 in 
> memory on x.x.x.x:38291 (size: 18.6 KiB, free: 116.9 MiB)
> 2025-08-06 16:54:39 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory 
> on x.x.x.x:38291 (size: 38.3 KiB, free: 116.9 MiB)
> 2025-08-06 16:55:02 INFO BlockManagerInfo: Added rdd_21_1 in memory on 
> x.x.x.x:38291 (size: 55.6 MiB, free: 61.3 MiB)
> 2025-08-06 16:55:10 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Disabling 
> executor 950.
> 2025-08-06 16:55:10 INFO DAGScheduler: Executor lost: 950 (epoch 950)
> 2025-08-06 16:55:10 INFO BlockManagerMasterEndpoint: Trying to remove 
> executor 950 from BlockManagerMaster.
> 2025-08-06 16:55:10 WARN BlockManagerMasterEndpoint: No more replicas 
> available for rdd_21_1 !
> 2025-08-06 16:55:10 INFO BlockManagerMasterEndpoint: Removing block manager 
> BlockManagerId(950, x.x.x.x, 38291, None)
> 2025-08-06 16:55:10 INFO BlockManagerMaster: Removed 950 successfully in 
> removeExecutor
> 2025-08-06 16:55:10 INFO DAGScheduler: Shuffle files lost for executor: 950 
> (epoch 950)
> 2025-08-06 16:55:25 WARN HeartbeatReceiver: Removing executor 949 with no 
> recent heartbeats: 135723 ms exceeds timeout 120000 ms
> 2025-08-06 16:55:25 INFO KubernetesClusterSchedulerBackend: Requesting to 
> kill executor(s) 949
> 2025-08-06 16:55:25 INFO KubernetesClusterSchedulerBackend: Actual list of 
> executor(s) to be killed is 949
> 2025-08-06 16:55:25 INFO TaskSchedulerImpl: Executor 949 on x.x.x.x killed by 
> driver.
> 2025-08-06 16:55:25 INFO TaskSetManager: task 0.474 in stage 7.0 (TID 960) 
> failed because while it was being computed, its executor exited for a reason 
> unrelated to the task. Not counting this failure towards the maximum number 
> of failures for the task.
> 2025-08-06 16:55:25 INFO BlockManagerMaster: Removal of executor 949 requested
> 2025-08-06 16:55:25 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove 
> non-existent executor 949
> 2025-08-06 16:55:25 INFO BlockManagerMasterEndpoint: Trying to remove 
> executor 949 from BlockManagerMaster.
> 2025-08-06 16:55:30 INFO KubernetesClusterSchedulerBackend: Forcefully 
> deleting 1 pods (out of 1) that are still running after graceful shutdown 
> period.
> 2025-08-06 16:55:31 INFO BlockManagerMaster: Removal of executor 949 requested
> 2025-08-06 16:55:31 INFO BlockManagerMasterEndpoint: Trying to remove 
> executor 949 from BlockManagerMaster.
> 2025-08-06 16:55:31 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove 
> non-existent executor 949
> 2025-08-06 16:55:32 INFO ExecutorPodsAllocator: Going to request 1 executors 
> from Kubernetes for ResourceProfile Id: 0, target: 2, known: 1, 
> sharedSlotFromPendingPods: 2147483647.
> {code}
> h3. Steps to Reproduce
> To reproduce this behavior, simply run the following Spark code using the 
> Docker image 
> [3.5.6-scala2.12-java17-python3-r-ubuntu|https://hub.docker.com/layers/apache/spark/3.5.6-scala2.12-java17-python3-r-ubuntu/images/sha256-96d4508680e09543ee1c6b2705c67176d4e0e1023b21c198a43a8c846eb264b7]
>  :
> {code:python}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.getOrCreate()
> df = spark.read.csv("<input>")
> df.coalesce(1).write.parquet("<output>")
> {code}
> The Spark job must run on K8S (client mode or cluster mode)
> The input data must be large enough to produce an OOM exception on executors.
> For instance, I used a 600MB CSV file as input with the following low Spark 
> resources:
> {code:bash}
> --conf "spark.driver.cores=1" \
> --conf "spark.driver.memory=2g" \
> --conf "spark.executor.instances=2" \
> --conf "spark.executor.cores=1" \
> --conf "spark.executor.memory=512m" \
> --conf "spark.executor.memoryOverhead=64m" \
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to