minor correction: >> (hence our *ReadWriteOnce* Storage should be sufficient right?...
On Wed, Mar 16, 2022 at 11:33 AM Andreas Weise <andreas.we...@gmail.com> wrote: > Hi, > > when using dynamic allocation on k8s with dynamic pvc reuse, I face that > only few executors are running. 2 of 4 are stucked in 'ContainerCreating' > with Events like: > spark-medium-1x-38b7c47f92340e9e-exec-3 : Multi-Attach error for volume > "pvc-c184e264-4a6d-406f-8d95-c59ff9e074d8" Volume is already used by pod(s) > spark-medium-1x-38b7c47f92340e9e-exec-2 > > According to the documentation, only PVCs of deleted executors should be > reused (hence our ReadOnlyMany Storage should be sufficient right?). But > the executor of the reused pvc is still running. Is this expected ? > > Config: > > spark.dynamicAllocation.enabled=true > spark.dynamicAllocation.maxExecutors=4 > spark.dynamicAllocation.minExecutors=1 > spark.dynamicAllocation.executorIdleTimeout=60s > spark.dynamicAllocation.shuffleTracking.enabled=true > spark.kubernetes.driver.ownPersistentVolumeClaim=true > spark.kubernetes.driver.reusePersistentVolumeClaim=true > > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=OnDemand > > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.storageClass=sc-openshift-default > > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.sizeLimit=1Gi > > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/tmp/data/ > > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=false > > Log excerpt (full log attached): > > INFO [2022-03-16 11:09:21,678] ({kubernetes-executor-snapshots-subscribers-0} > Logging.scala[logInfo]:57) - Going to request 1 executors from Kubernetes for > ResourceProfile Id: 0, target: 1, known: 0, sharedSlotFromPendingPods: > 2147483647. > INFO [2022-03-16 11:09:21,684] ({kubernetes-executor-snapshots-subscribers-0} > Logging.scala[logInfo]:57) - Found 0 reusable PVCs from 0 PVCs > INFO [2022-03-16 11:09:21,686] ({kubernetes-executor-snapshots-subscribers-0} > Logging.scala[logInfo]:57) - Spark configuration files loaded from > Some(/opt/conda/lib/python3.9/site-packages/pyspark/conf) : > log4j.properties,hive-site.xml > INFO [2022-03-16 11:09:21,687] ({kubernetes-executor-snapshots-subscribers-0} > Logging.scala[logInfo]:57) - Adding decommission script to lifecycle > INFO [2022-03-16 11:09:21,689] > ({FIFOScheduler-interpreter_510428346-Worker-1} Logging.scala[logInfo]:57) - > Using initial executors = 1, max of spark.dynamicAllocation.initialExecutors, > spark.dynamicAllocation.minExecutors and spark.executor.instances > WARN [2022-03-16 11:09:21,690] > ({FIFOScheduler-interpreter_510428346-Worker-1} Logging.scala[logWarning]:69) > - Dynamic allocation without a shuffle service is an experimental feature. > INFO [2022-03-16 11:09:21,775] ({kubernetes-executor-snapshots-subscribers-0} > Logging.scala[logInfo]:57) - Trying to create PersistentVolumeClaim > spark-medium-1x-38b7c47f92340e9e-exec-2-pvc-0 with StorageClass > sc-openshift-default > INFO [2022-03-16 11:09:37,220] ({dispatcher-CoarseGrainedScheduler} > Logging.scala[logInfo]:57) - Registered executor > NettyRpcEndpointRef(spark-client://Executor) (10.128.6.67:57144) with ID 2, > ResourceProfileId 0 > INFO [2022-03-16 11:09:37,225] ({spark-listener-group-executorManagement} > Logging.scala[logInfo]:57) - New executor 2 has registered (new total is 1) > > ... > > INFO [2022-03-16 11:09:51,708] ({kubernetes-executor-snapshots-subscribers-1} > Logging.scala[logInfo]:57) - Going to request 1 executors from Kubernetes for > ResourceProfile Id: 0, target: 2, known: 1, sharedSlotFromPendingPods: > 2147483647. > INFO [2022-03-16 11:09:51,709] ({spark-dynamic-executor-allocation} > Logging.scala[logInfo]:57) - Requesting 1 new executor because tasks are > backlogged (new desired total will be 2 for resource profile id: 0) > INFO [2022-03-16 11:09:51,717] ({kubernetes-executor-snapshots-subscribers-1} > Logging.scala[logInfo]:57) - Found 1 reusable PVCs from 1 PVCs > INFO [2022-03-16 11:09:51,719] ({kubernetes-executor-snapshots-subscribers-1} > Logging.scala[logInfo]:57) - Spark configuration files loaded from > Some(/opt/conda/lib/python3.9/site-packages/pyspark/conf) : > log4j.properties,hive-site.xml > INFO [2022-03-16 11:09:51,721] ({kubernetes-executor-snapshots-subscribers-1} > Logging.scala[logInfo]:57) - Adding decommission script to lifecycle > INFO [2022-03-16 11:09:51,726] ({kubernetes-executor-snapshots-subscribers-1} > Logging.scala[logInfo]:57) - Reuse PersistentVolumeClaim > spark-medium-1x-38b7c47f92340e9e-exec-2-pvc-0 > INFO [2022-03-16 11:09:52,713] ({spark-dynamic-executor-allocation} > Logging.scala[logInfo]:57) - Requesting 2 new executors because tasks are > backlogged (new desired total will be 4 for resource profile id: 0) > INFO [2022-03-16 11:09:52,813] ({kubernetes-executor-snapshots-subscribers-0} > Logging.scala[logInfo]:57) - Going to request 2 executors from Kubernetes for > ResourceProfile Id: 0, target: 4, known: 2, sharedSlotFromPendingPods: > 2147483646. > INFO [2022-03-16 11:09:52,820] ({kubernetes-executor-snapshots-subscribers-0} > Logging.scala[logInfo]:57) - Found 0 reusable PVCs from 1 PVCs > INFO [2022-03-16 11:09:52,821] ({kubernetes-executor-snapshots-subscribers-0} > Logging.scala[logInfo]:57) - Spark configuration files loaded from > Some(/opt/conda/lib/python3.9/site-packages/pyspark/conf) : > log4j.properties,hive-site.xml > INFO [2022-03-16 11:09:52,822] ({kubernetes-executor-snapshots-subscribers-0} > Logging.scala[logInfo]:57) - Adding decommission script to lifecycle > INFO [2022-03-16 11:09:52,855] ({kubernetes-executor-snapshots-subscribers-0} > Logging.scala[logInfo]:57) - Trying to create PersistentVolumeClaim > spark-medium-1x-38b7c47f92340e9e-exec-4-pvc-0 with StorageClass > sc-openshift-default > INFO [2022-03-16 11:09:52,866] ({kubernetes-executor-snapshots-subscribers-0} > Logging.scala[logInfo]:57) - Spark configuration files loaded from > Some(/opt/conda/lib/python3.9/site-packages/pyspark/conf) : > log4j.properties,hive-site.xml > INFO [2022-03-16 11:09:52,867] ({kubernetes-executor-snapshots-subscribers-0} > Logging.scala[logInfo]:57) - Adding decommission script to lifecycle > INFO [2022-03-16 11:09:52,899] ({kubernetes-executor-snapshots-subscribers-0} > Logging.scala[logInfo]:57) - Trying to create PersistentVolumeClaim > spark-medium-1x-38b7c47f92340e9e-exec-5-pvc-0 with StorageClass > sc-openshift-default > > > Best regards > > Andreas > > ... > >