Anthony Broguy created SPARK-53151:
--------------------------------------

             Summary: "spark.executor.maxNumFailures" not working / OOM 
exception on executors results in an infinite loop
                 Key: SPARK-53151
                 URL: https://issues.apache.org/jira/browse/SPARK-53151
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.5.6
         Environment: Spark 3.5.6 on Kubernetes - Docker image 
"spark:3.5.6-scala2.12-java17-python3-r-ubuntu"
            Reporter: Anthony Broguy


h3. Summary

The following Spark property "spark.executor.maxNumFailures" has been 
introduced in Spark 3.5.0. This property should define the maximum number of 
executor failures before failing the application, default value is 
"numExecutors * 2, with minimum of 3".

But it seems it's not taken into account, so when executors are killed due to 
OOM, they are recreated, again killed due to OOM, and so on... and it exceeds 
the limit defined by this property.

This results in an infinite loop.
It impacts all Spark versions 3.5.x on K8S. 
However in Spark 4.0.0 on K8S, this property works as expected.
h3. Actual Behavior

Below some driver logs where you can see that executors have been recreated 950 
times. The Spark 3.5.6 job ran for 1 day cause of this infinite loop before I 
stopped it manually.
{code:bash}
2025-08-06 16:54:37 INFO 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: No executor found 
for x.x.x.x:46272
2025-08-06 16:54:38 INFO 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered executor 
NettyRpcEndpointRef(spark-client://Executor) (x.x.x.x:46284) with ID 950,  
ResourceProfileId 0
2025-08-06 16:54:38 INFO BlockManagerMasterEndpoint: Registering block manager 
x.x.x.x:38291 with 117.0 MiB RAM, BlockManagerId(950, x.x.x.x, 38291, None)
2025-08-06 16:54:38 INFO TaskSetManager: Starting task 1.474 in stage 7.0 (TID 
961) (x.x.x.x, executor 950, partition 1, ANY, 9818 bytes)
2025-08-06 16:54:38 INFO BlockManagerInfo: Added broadcast_10_piece0 in memory 
on x.x.x.x:38291 (size: 18.6 KiB, free: 116.9 MiB)
2025-08-06 16:54:39 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory 
on x.x.x.x:38291 (size: 38.3 KiB, free: 116.9 MiB)
2025-08-06 16:55:02 INFO BlockManagerInfo: Added rdd_21_1 in memory on 
x.x.x.x:38291 (size: 55.6 MiB, free: 61.3 MiB)
2025-08-06 16:55:10 INFO 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Disabling executor 
950.
2025-08-06 16:55:10 INFO DAGScheduler: Executor lost: 950 (epoch 950)
2025-08-06 16:55:10 INFO BlockManagerMasterEndpoint: Trying to remove executor 
950 from BlockManagerMaster.
2025-08-06 16:55:10 WARN BlockManagerMasterEndpoint: No more replicas available 
for rdd_21_1 !
2025-08-06 16:55:10 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(950, x.x.x.x, 38291, None)
2025-08-06 16:55:10 INFO BlockManagerMaster: Removed 950 successfully in 
removeExecutor
2025-08-06 16:55:10 INFO DAGScheduler: Shuffle files lost for executor: 950 
(epoch 950)
2025-08-06 16:55:25 WARN HeartbeatReceiver: Removing executor 949 with no 
recent heartbeats: 135723 ms exceeds timeout 120000 ms
2025-08-06 16:55:25 INFO KubernetesClusterSchedulerBackend: Requesting to kill 
executor(s) 949
2025-08-06 16:55:25 INFO KubernetesClusterSchedulerBackend: Actual list of 
executor(s) to be killed is 949
2025-08-06 16:55:25 INFO TaskSchedulerImpl: Executor 949 on x.x.x.x killed by 
driver.
2025-08-06 16:55:25 INFO TaskSetManager: task 0.474 in stage 7.0 (TID 960) 
failed because while it was being computed, its executor exited for a reason 
unrelated to the task. Not counting this failure towards the maximum number of 
failures for the task.
2025-08-06 16:55:25 INFO BlockManagerMaster: Removal of executor 949 requested
2025-08-06 16:55:25 INFO 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove 
non-existent executor 949
2025-08-06 16:55:25 INFO BlockManagerMasterEndpoint: Trying to remove executor 
949 from BlockManagerMaster.
2025-08-06 16:55:30 INFO KubernetesClusterSchedulerBackend: Forcefully deleting 
1 pods (out of 1) that are still running after graceful shutdown period.
2025-08-06 16:55:31 INFO BlockManagerMaster: Removal of executor 949 requested
2025-08-06 16:55:31 INFO BlockManagerMasterEndpoint: Trying to remove executor 
949 from BlockManagerMaster.
2025-08-06 16:55:31 INFO 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove 
non-existent executor 949
2025-08-06 16:55:32 INFO ExecutorPodsAllocator: Going to request 1 executors 
from Kubernetes for ResourceProfile Id: 0, target: 2, known: 1, 
sharedSlotFromPendingPods: 2147483647.
2025-08-06 16:55:32 INFO KubernetesClientUtils: Spark configuration files 
loaded from Some(/opt/spark/conf) : log4j2.properties,metrics.properties
{code}
h3. Steps to Reproduce

To reproduce this behavior, simply run the following Spark code using the 
Docker image 
[3.5.6-scala2.12-java17-python3-r-ubuntu|https://hub.docker.com/layers/apache/spark/3.5.6-scala2.12-java17-python3-r-ubuntu/images/sha256-96d4508680e09543ee1c6b2705c67176d4e0e1023b21c198a43a8c846eb264b7]
 :
{code:python}
spark = SparkSession.builder.getOrCreate()
df = spark.read.csv("<input>")
df.coalesce(1).write.parquet("<output>")
{code}
The Spark job must run on K8S (client mode or cluster mode)
The input data must be large enough to produce an OOM exception on executors.
For instance, I used a 600MB CSV file as input with the following low Spark 
resources:
{code:bash}
--conf "spark.driver.cores=1" \
--conf "spark.driver.memory=2g" \
--conf "spark.executor.instances=2" \
--conf "spark.executor.cores=1" \
--conf "spark.executor.memory=512m" \
--conf "spark.executor.memoryOverhead=64m" \
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to