Resurfacing The question to get more attention Hello, > > im running Spark 2.3 job on kubernetes cluster >> >> kubectl version >> >> Client Version: version.Info{Major:"1", Minor:"9", >> GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", >> GitTreeState:"clean", BuildDate:"2018-02-09T21:51:06Z", >> GoVersion:"go1.9.4", Compiler:"gc", Platform:"darwin/amd64"} >> >> Server Version: version.Info{Major:"1", Minor:"8", >> GitVersion:"v1.8.3", GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd", >> GitTreeState:"clean", BuildDate:"2017-11-08T18:27:48Z", >> GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"} >> >> >> >> when i ran spark submit on k8s master the driverpod is stuck in Waiting: >> PodInitializing state. >> I had to manually kill the driver pod and submit new job in this case >> ,then it works.How this can be handled in production ? >> > This happens with executor pods as well >
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-25128 > > >> >> This is happening if i submit the jobs almost parallel ie submit 5 jobs >> one after the other simultaneously. >> >> I'm running spark jobs on 20 nodes each having below configuration >> >> I tried kubectl describe node on the node where trhe driver pod is >> running this is what i got ,i do see there is overcommit on resources but i >> expected kubernetes scheduler not to schedule if resources in node are >> overcommitted or node is in Not Ready state ,in this case node is in Ready >> State but i observe same behaviour if node is in "Not Ready" state >> >> >> >> Name: ********** >> >> Roles: worker >> >> Labels: beta.kubernetes.io/arch=amd64 >> >> beta.kubernetes.io/os=linux >> >> kubernetes.io/hostname=**** >> >> node-role.kubernetes.io/worker=true >> >> Annotations: node.alpha.kubernetes.io/ttl=0 >> >> >> volumes.kubernetes.io/controller-managed-attach-detach=true >> >> Taints: <none> >> >> CreationTimestamp: Tue, 31 Jul 2018 09:59:24 -0400 >> >> Conditions: >> >> Type Status LastHeartbeatTime >> LastTransitionTime Reason Message >> >> ---- ------ ----------------- >> ------------------ ------ ------- >> >> OutOfDisk False Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31 >> Jul 2018 09:59:24 -0400 KubeletHasSufficientDisk kubelet has >> sufficient disk space available >> >> MemoryPressure False Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31 >> Jul 2018 09:59:24 -0400 KubeletHasSufficientMemory kubelet has >> sufficient memory available >> >> DiskPressure False Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31 >> Jul 2018 09:59:24 -0400 KubeletHasNoDiskPressure kubelet has no disk >> pressure >> >> Ready True Tue, 14 Aug 2018 09:31:20 -0400 Sat, 11 >> Aug 2018 00:41:27 -0400 KubeletReady kubelet is posting >> ready status. AppArmor enabled >> >> Addresses: >> >> InternalIP: ***** >> >> Hostname: ****** >> >> Capacity: >> >> cpu: 16 >> >> memory: 125827288Ki >> >> pods: 110 >> >> Allocatable: >> >> cpu: 16 >> >> memory: 125724888Ki >> >> pods: 110 >> >> System Info: >> >> Machine ID: ************* >> >> System UUID: ************** >> >> Boot ID: 1493028d-0a80-4f2f-b0f1-48d9b8910e9f >> >> Kernel Version: 4.4.0-1062-aws >> >> OS Image: Ubuntu 16.04.4 LTS >> >> Operating System: linux >> >> Architecture: amd64 >> >> Container Runtime Version: docker://Unknown >> >> Kubelet Version: v1.8.3 >> >> Kube-Proxy Version: v1.8.3 >> >> PodCIDR: ****** >> >> ExternalID: ************** >> >> Non-terminated Pods: (11 in total) >> >> Namespace Name >> CPU Requests CPU Limits Memory Requests Memory >> Limits >> >> --------- ---- >> ------------ ---------- --------------- >> ------------- >> >> kube-system calico-node-gj5mb >> 250m (1%) 0 (0%) 0 (0%) 0 (0%) >> >> kube-system >> kube-proxy-**************************************** 100m (0%) >> 0 (0%) 0 (0%) 0 (0%) >> >> kube-system >> prometheus-prometheus-node-exporter-9cntq 100m (0%) >> 200m (1%) 30Mi (0%) 50Mi (0%) >> >> logging >> elasticsearch-elasticsearch-data-69df997486-gqcwg 400m (2%) >> 1 (6%) 8Gi (6%) 16Gi (13%) >> >> logging fluentd-fluentd-elasticsearch-tj7nd >> 200m (1%) 0 (0%) 612Mi (0%) 0 (0%) >> >> rook rook-agent-6jtzm >> 0 (0%) 0 (0%) 0 (0%) 0 (0%) >> >> rook >> rook-ceph-osd-10-6-42-250.accel.aws-cardda.cb4good.com-gwb8j 0 (0%) >> 0 (0%) 0 (0%) 0 (0%) >> >> spark >> accelerate-test-5-a3bfb8a597e83d459193a183e17f13b5-exec-1 2 (12%) >> 0 (0%) 10Gi (8%) 12Gi (10%) >> >> spark >> accelerate-testing-1-8ed0482f3bfb3c0a83da30bb7d433dff-exec-5 2 (12%) >> 0 (0%) 10Gi (8%) 12Gi (10%) >> >> spark >> accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver 1 (6%) >> 0 (0%) 2Gi (1%) 2432Mi (1%) >> >> spark >> accelerate-testing-2-e8bd0607cc693bc8ae25cc6dc300b2c7-driver 1 (6%) >> 0 (0%) 2Gi (1%) 2432Mi (1%) >> >> Allocated resources: >> >> (Total limits may be over 100 percent, i.e., overcommitted.) >> >> CPU Requests CPU Limits Memory Requests Memory Limits >> >> ------------ ---------- --------------- ------------- >> >> 7050m (44%) 1200m (7%) 33410Mi (27%) 45874Mi (37%) >> >> >> Events: <none> >> >> >> Kubectl describe pod gives below message >> >> Name: >> accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver >> Namespace: spark >> Node: **** >> Start Time: Mon, 13 Aug 2018 16:18:34 -0400 >> Labels: >> launch-id=k8s-submit-service-cddf45ff-0d88-4681-af85-d8ed0359ce73 >> >> spark-app-selector=spark-63f536fd87f8457796802767922ef7d9 >> spark-role=driver >> Annotations: spark-app-name=accelerate-testing-2 >> Status: Pending >> IP: >> Init Containers: >> spark-init: >> Container ID: >> Image: ****:v2.3.0 >> Image ID: >> Port: <none> >> Args: >> init >> /etc/spark-init/spark-init.properties >> State: Waiting >> Reason: PodInitializing >> Ready: False >> Restart Count: 0 >> Environment: <none> >> Mounts: >> /etc/spark-init from spark-init-properties (rw) >> /var/run/secrets/kubernetes.io/serviceaccount from >> spark-token-mj86g (ro) >> /var/spark-data/spark-files from download-files-volume (rw) >> /var/spark-data/spark-jars from download-jars-volume (rw) >> Containers: >> spark-kubernetes-driver: >> Container ID: >> Image: ******:v2.3.0 >> Image ID: >> Port: <none> >> Args: >> driver >> State: Waiting >> Reason: PodInitializing >> Ready: False >> Restart Count: 0 >> Limits: >> memory: 2432Mi >> Requests: >> cpu: 1 >> memory: 2Gi >> Environment: >> SPARK_DRIVER_MEMORY: 2g >> SPARK_DRIVER_CLASS: com.myclass >> SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP) >> SPARK_MOUNTED_CLASSPATH: >> >> /var/spark-data/spark-jars/quantum-workflow-2.2.24.0-SNAPSHOT-assembly.jar:/var/spark-data/spark-jars/my.jar >> SPARK_MOUNTED_FILES_DIR: /var/spark-data/spark-files >> SPARK_JAVA_OPT_0: >> -Dspark.kubernetes.container.image=*** >> SPARK_JAVA_OPT_1: >> -Dspark.jars=s3a://my/my.jar,s3a://my/my1.jar >> SPARK_JAVA_OPT_2: -Dspark.submit.deployMode=cluster >> SPARK_JAVA_OPT_3: >> -Dspark.driver.blockManager.port=7079 >> SPARK_JAVA_OPT_4: -Dspark.executor.memory=10g >> SPARK_JAVA_OPT_5: -Dspark.app.id >> =spark-63f536fd87f8457796802767922ef7d9 >> SPARK_JAVA_OPT_6: >> -Dspark.kubernetes.authenticate.driver.serviceAccountName=spark >> SPARK_JAVA_OPT_7: -Dspark.master=k8s:// >> https://kubernetes.default >> SPARK_JAVA_OPT_8: >> -Dspark.driver.host=spark-1534191513364-driver-svc.spark.svc >> SPARK_JAVA_OPT_9: -Dspark.executor.cores=2 >> SPARK_JAVA_OPT_10: >> >> -Dspark.kubernetes.executor.podNamePrefix=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba >> SPARK_JAVA_OPT_11: -Dspark.driver.port=7078 >> SPARK_JAVA_OPT_12: -Dspark.kubernetes.namespace=spark >> SPARK_JAVA_OPT_13: -Dspark.executor.memoryOverhead=2g >> SPARK_JAVA_OPT_14: >> >> -Dspark.kubernetes.initContainer.configMapName=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-init-config >> SPARK_JAVA_OPT_15: >> -Dspark.kubernetes.initContainer.configMapKey=spark-init.properties >> SPARK_JAVA_OPT_16: -Dspark.executor.instances=10 >> SPARK_JAVA_OPT_17: -Dspark.memory.fraction=0.6 >> SPARK_JAVA_OPT_18: -Dspark.driver.memory=2g >> SPARK_JAVA_OPT_19: -Dspark.kubernetes.driver.pod.name >> =accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver >> SPARK_JAVA_OPT_20: -Dspark.app.name >> =accelerate-testing-2 >> SPARK_JAVA_OPT_21: >> -Dspark.kubernetes.driver.label.launch-id=******** >> Mounts: >> /var/run/secrets/kubernetes.io/serviceaccount from >> spark-token-mj86g (ro) >> /var/spark-data/spark-files from download-files-volume (rw) >> /var/spark-data/spark-jars from download-jars-volume (rw) >> Conditions: >> Type Status >> Initialized False >> Ready False >> PodScheduled True >> Volumes: >> spark-init-properties: >> Type: ConfigMap (a volume populated by a ConfigMap) >> Name: >> accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-init-config >> Optional: false >> download-jars-volume: >> Type: EmptyDir (a temporary directory that shares a pod's >> lifetime) >> Medium: >> download-files-volume: >> Type: EmptyDir (a temporary directory that shares a pod's >> lifetime) >> Medium: >> spark-token-mj86g: >> Type: Secret (a volume populated by a Secret) >> SecretName: spark-token-mj86g >> Optional: false >> QoS Class: Burstable >> Node-Selectors: <none> >> Tolerations: <none> >> Events: >> Type Reason Age From >> Message >> ---- ------ ---- ---- >> ------- >> Normal SandboxChanged 44m (x518 over 18h) kubelet, >> **************************** Pod sandbox changed, it will be killed and >> re-created. >> Warning FailedSync 19s (x540 over 18h) kubelet, >> **************************** Error syncing pod >> >>