minor correction:
>> (hence our *ReadWriteOnce* Storage should be sufficient right?...

On Wed, Mar 16, 2022 at 11:33 AM Andreas Weise <andreas.we...@gmail.com>
wrote:

> Hi,
>
> when using dynamic allocation on k8s with dynamic pvc reuse, I face that
> only few executors are running. 2 of 4 are stucked in 'ContainerCreating'
> with Events like:
> spark-medium-1x-38b7c47f92340e9e-exec-3 : Multi-Attach error for volume
> "pvc-c184e264-4a6d-406f-8d95-c59ff9e074d8" Volume is already used by pod(s)
> spark-medium-1x-38b7c47f92340e9e-exec-2
>
> According to the documentation, only PVCs of deleted executors should be
> reused (hence our ReadOnlyMany Storage should be sufficient right?). But
> the executor of the reused pvc is still running. Is this expected ?
>
> Config:
>
> spark.dynamicAllocation.enabled=true
> spark.dynamicAllocation.maxExecutors=4
> spark.dynamicAllocation.minExecutors=1
> spark.dynamicAllocation.executorIdleTimeout=60s
> spark.dynamicAllocation.shuffleTracking.enabled=true
> spark.kubernetes.driver.ownPersistentVolumeClaim=true
> spark.kubernetes.driver.reusePersistentVolumeClaim=true
>
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=OnDemand
>
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.storageClass=sc-openshift-default
>
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.sizeLimit=1Gi
>
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/tmp/data/
>
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=false
>
> Log excerpt (full log attached):
>
> INFO [2022-03-16 11:09:21,678] ({kubernetes-executor-snapshots-subscribers-0} 
> Logging.scala[logInfo]:57) - Going to request 1 executors from Kubernetes for 
> ResourceProfile Id: 0, target: 1, known: 0, sharedSlotFromPendingPods: 
> 2147483647.
> INFO [2022-03-16 11:09:21,684] ({kubernetes-executor-snapshots-subscribers-0} 
> Logging.scala[logInfo]:57) - Found 0 reusable PVCs from 0 PVCs
> INFO [2022-03-16 11:09:21,686] ({kubernetes-executor-snapshots-subscribers-0} 
> Logging.scala[logInfo]:57) - Spark configuration files loaded from 
> Some(/opt/conda/lib/python3.9/site-packages/pyspark/conf) : 
> log4j.properties,hive-site.xml
> INFO [2022-03-16 11:09:21,687] ({kubernetes-executor-snapshots-subscribers-0} 
> Logging.scala[logInfo]:57) - Adding decommission script to lifecycle
> INFO [2022-03-16 11:09:21,689] 
> ({FIFOScheduler-interpreter_510428346-Worker-1} Logging.scala[logInfo]:57) - 
> Using initial executors = 1, max of spark.dynamicAllocation.initialExecutors, 
> spark.dynamicAllocation.minExecutors and spark.executor.instances
> WARN [2022-03-16 11:09:21,690] 
> ({FIFOScheduler-interpreter_510428346-Worker-1} Logging.scala[logWarning]:69) 
> - Dynamic allocation without a shuffle service is an experimental feature.
> INFO [2022-03-16 11:09:21,775] ({kubernetes-executor-snapshots-subscribers-0} 
> Logging.scala[logInfo]:57) - Trying to create PersistentVolumeClaim 
> spark-medium-1x-38b7c47f92340e9e-exec-2-pvc-0 with StorageClass 
> sc-openshift-default
> INFO [2022-03-16 11:09:37,220] ({dispatcher-CoarseGrainedScheduler} 
> Logging.scala[logInfo]:57) - Registered executor 
> NettyRpcEndpointRef(spark-client://Executor) (10.128.6.67:57144) with ID 2,  
> ResourceProfileId 0
> INFO [2022-03-16 11:09:37,225] ({spark-listener-group-executorManagement} 
> Logging.scala[logInfo]:57) - New executor 2 has registered (new total is 1)
>
> ...
>
> INFO [2022-03-16 11:09:51,708] ({kubernetes-executor-snapshots-subscribers-1} 
> Logging.scala[logInfo]:57) - Going to request 1 executors from Kubernetes for 
> ResourceProfile Id: 0, target: 2, known: 1, sharedSlotFromPendingPods: 
> 2147483647.
> INFO [2022-03-16 11:09:51,709] ({spark-dynamic-executor-allocation} 
> Logging.scala[logInfo]:57) - Requesting 1 new executor because tasks are 
> backlogged (new desired total will be 2 for resource profile id: 0)
> INFO [2022-03-16 11:09:51,717] ({kubernetes-executor-snapshots-subscribers-1} 
> Logging.scala[logInfo]:57) - Found 1 reusable PVCs from 1 PVCs
> INFO [2022-03-16 11:09:51,719] ({kubernetes-executor-snapshots-subscribers-1} 
> Logging.scala[logInfo]:57) - Spark configuration files loaded from 
> Some(/opt/conda/lib/python3.9/site-packages/pyspark/conf) : 
> log4j.properties,hive-site.xml
> INFO [2022-03-16 11:09:51,721] ({kubernetes-executor-snapshots-subscribers-1} 
> Logging.scala[logInfo]:57) - Adding decommission script to lifecycle
> INFO [2022-03-16 11:09:51,726] ({kubernetes-executor-snapshots-subscribers-1} 
> Logging.scala[logInfo]:57) - Reuse PersistentVolumeClaim 
> spark-medium-1x-38b7c47f92340e9e-exec-2-pvc-0
> INFO [2022-03-16 11:09:52,713] ({spark-dynamic-executor-allocation} 
> Logging.scala[logInfo]:57) - Requesting 2 new executors because tasks are 
> backlogged (new desired total will be 4 for resource profile id: 0)
> INFO [2022-03-16 11:09:52,813] ({kubernetes-executor-snapshots-subscribers-0} 
> Logging.scala[logInfo]:57) - Going to request 2 executors from Kubernetes for 
> ResourceProfile Id: 0, target: 4, known: 2, sharedSlotFromPendingPods: 
> 2147483646.
> INFO [2022-03-16 11:09:52,820] ({kubernetes-executor-snapshots-subscribers-0} 
> Logging.scala[logInfo]:57) - Found 0 reusable PVCs from 1 PVCs
> INFO [2022-03-16 11:09:52,821] ({kubernetes-executor-snapshots-subscribers-0} 
> Logging.scala[logInfo]:57) - Spark configuration files loaded from 
> Some(/opt/conda/lib/python3.9/site-packages/pyspark/conf) : 
> log4j.properties,hive-site.xml
> INFO [2022-03-16 11:09:52,822] ({kubernetes-executor-snapshots-subscribers-0} 
> Logging.scala[logInfo]:57) - Adding decommission script to lifecycle
> INFO [2022-03-16 11:09:52,855] ({kubernetes-executor-snapshots-subscribers-0} 
> Logging.scala[logInfo]:57) - Trying to create PersistentVolumeClaim 
> spark-medium-1x-38b7c47f92340e9e-exec-4-pvc-0 with StorageClass 
> sc-openshift-default
> INFO [2022-03-16 11:09:52,866] ({kubernetes-executor-snapshots-subscribers-0} 
> Logging.scala[logInfo]:57) - Spark configuration files loaded from 
> Some(/opt/conda/lib/python3.9/site-packages/pyspark/conf) : 
> log4j.properties,hive-site.xml
> INFO [2022-03-16 11:09:52,867] ({kubernetes-executor-snapshots-subscribers-0} 
> Logging.scala[logInfo]:57) - Adding decommission script to lifecycle
> INFO [2022-03-16 11:09:52,899] ({kubernetes-executor-snapshots-subscribers-0} 
> Logging.scala[logInfo]:57) - Trying to create PersistentVolumeClaim 
> spark-medium-1x-38b7c47f92340e9e-exec-5-pvc-0 with StorageClass 
> sc-openshift-default
>
>
> Best regards
>
> Andreas
>
> ...
>
>

Reply via email to