Hitting SPARK-45858 on Kubernetes - Unavoidable bug or misconfiguration?

Aaron Grubb Mon, 19 Aug 2024 06:02:10 -0700

Hi all,

I'm running Spark on Kubernetes on AWS using only spot instances for executors 
with dynamic allocation enabled. This particular job is being
triggered by Airflow and it hit this bug [1] 6 times in a row. However, I had 
recently switched to using PersistentVolumeClaims in Spark with
spark.shuffle.sort.io.plugin.class=org.apache.spark.shuffle.KubernetesLocalDiskShuffleDataIO
 but kept
spark.dynamicAllocation.shuffleTracking.enabled=true. Upon review, I see under 
the notes for spark.dynamicAllocation.enabled [2] that these
configurations are "or" not "and". However, when setting 
spark.dynamicAllocation.shuffleTracking.enabled=false, my job crashes with the
message


org.apache.spark.SparkException: Dynamic allocation of executors requires one 
of the following conditions: 1) enabling external shuffle
service through spark.shuffle.service.enabled. 2) enabling shuffle tracking 
through spark.dynamicAllocation.shuffleTracking.enabled. 3)
enabling shuffle blocks decommission through spark.decommission.enabled and 
spark.storage.decommission.shuffleBlocks.enabled. 4)
(Experimental) configuring spark.shuffle.sort.io.plugin.class to use a custom 
ShuffleDataIO who's ShuffleDriverComponents supports reliable
storage.

Am I hitting this bug unavoidably? Or is there a configuration I'm missing to 
enable
spark.shuffle.sort.io.plugin.class=org.apache.spark.shuffle.KubernetesLocalDiskShuffleDataIO
 to replace
spark.dynamicAllocation.shuffleTracking.enabled=true?

Using Spark 3.5.1 - here's my full spark-defaults.conf just in case

spark.checkpoint.compress                                                       
               true
spark.driver.cores                                                              
                                   1
spark.driver.maxResultSize                                                      
               2g
spark.driver.memory                                                             
               5140m
spark.dynamicAllocation.enabled                                                 
               true
spark.dynamicAllocation.executorAllocationRatio                                 
               0.33
spark.dynamicAllocation.maxExecutors                                            
               20
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout                        
               30
spark.eventLog.enabled                                                          
               true
spark.executor.cores                                                            
                               3
spark.executor.logs.rolling.enableCompression                                   
               true
spark.executor.logs.rolling.maxRetainedFiles                                    
               48
spark.executor.logs.rolling.strategy                                            
               time
spark.executor.logs.rolling.time.interval                                       
               hourly
spark.hadoop.fs.s3a.impl                                                        
               org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.connection.ssl.enabled                                      
               false
spark.hadoop.fs.s3a.fast.upload                                                 
               true
spark.kryo.registrationRequired                                                 
               false
spark.kryo.unsafe                                                               
               false
spark.kryoserializer.buffer                                                     
               1m
spark.kryoserializer.buffer.max                                                 
               1g
spark.kubernetes.driver.limit.cores                                             
               750m
spark.kubernetes.driver.ownPersistentVolumeClaim                                
               true
spark.kubernetes.driver.request.cores                                           
               750m
spark.kubernetes.driver.reusePersistentVolumeClaim                              
               true
spark.kubernetes.driver.waitToReusePersistentVolumeClaim                        
               true
spark.kubernetes.executor.limit.cores                                           
               3700m
spark.kubernetes.executor.request.cores                                         
               3700m
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName
    OnDemand
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path
           /data/spark-x/executor-x
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly
       false
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit
    20Gi
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass
 ebs-sc
spark.kubernetes.namespace                                                      
               spark
spark.serializer                                                                
               org.apache.spark.serializer.KryoSerializer
spark.shuffle.sort.io.plugin.class                                              
            
org.apache.spark.shuffle.KubernetesLocalDiskShuffleDataIO
spark.sql.orc.compression.codec                                                 
               zlib
spark.sql.pyspark.jvmStacktrace.enabled                                         
               true
spark.sql.sources.partitionOverwriteMode                                        
               dynamic
spark.sql.streaming.kafka.useDeprecatedOffsetFetching                           
               false
spark.submit.deployMode                                                         
               cluster

Thanks,
Aaron

[1] https://issues.apache.org/jira/browse/SPARK-45858
[2] https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Hitting SPARK-45858 on Kubernetes - Unavoidable bug or misconfiguration?

Reply via email to