im running Spark 2.3 job on kubernetes cluster kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", BuildDate:"2018-02-09T21:51:06Z", GoVersion:"go1.9.4", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.3", GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd", GitTreeState:"clean", BuildDate:"2017-11-08T18:27:48Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"} when i ran spark submit on k8s master the driver pod is stuck in Waiting: PodInitializing state. I had to manually kill the driver pod and submit new job in this case ,then it works. This is happening if i submit the jobs almost parallel ie submit 5 jobs one after the other simultaneously. I'm running spark jobs on 20 nodes each having below configuration I tried kubectl describe node on the node where trhe driver pod is running this is what i got ,i do see there is overcommit on resources but i expected kubernetes scheduler not to schedule if resources in node are overcommitted or node is in Not Ready state ,in this case node is in Ready State but i observe same behaviour if node is in "Not Ready" state Name: ********** Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux kubernetes.io/hostname=**** node-role.kubernetes.io/worker=true Annotations: node.alpha.kubernetes.io/ttl=0 volumes.kubernetes.io/controller-managed-attach-detach=true Taints: <none> CreationTimestamp: Tue, 31 Jul 2018 09:59:24 -0400 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- OutOfDisk False Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31 Jul 2018 09:59:24 -0400 KubeletHasSufficientDisk kubelet has sufficient disk space available MemoryPressure False Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31 Jul 2018 09:59:24 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31 Jul 2018 09:59:24 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure Ready True Tue, 14 Aug 2018 09:31:20 -0400 Sat, 11 Aug 2018 00:41:27 -0400 KubeletReady kubelet is posting ready status. AppArmor enabled Addresses: InternalIP: ***** Hostname: ****** Capacity: cpu: 16 memory: 125827288Ki pods: 110 Allocatable: cpu: 16 memory: 125724888Ki pods: 110 System Info: Machine ID: ************* System UUID: ************** Boot ID: 1493028d-0a80-4f2f-b0f1-48d9b8910e9f Kernel Version: 4.4.0-1062-aws OS Image: Ubuntu 16.04.4 LTS Operating System: linux Architecture: amd64 Container Runtime Version: docker://Unknown Kubelet Version: v1.8.3 Kube-Proxy Version: v1.8.3 PodCIDR: ****** ExternalID: ************** Non-terminated Pods: (11 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits --------- ---- ------------ ---------- --------------- ------------- kube-system calico-node-gj5mb 250m (1%) 0 (0%) 0 (0%) 0 (0%) kube-system kube-proxy-**************************************** 100m (0%) 0 (0%) 0 (0%) 0 (0%) kube-system prometheus-prometheus-node-exporter-9cntq 100m (0%) 200m (1%) 30Mi (0%) 50Mi (0%) logging elasticsearch-elasticsearch-data-69df997486-gqcwg 400m (2%) 1 (6%) 8Gi (6%) 16Gi (13%) logging fluentd-fluentd-elasticsearch-tj7nd 200m (1%) 0 (0%) 612Mi (0%) 0 (0%) rook rook-agent-6jtzm 0 (0%) 0 (0%) 0 (0%) 0 (0%) rook rook-ceph-osd-10-6-42-250.accel.aws-cardda.cb4good.com-gwb8j 0 (0%) 0 (0%) 0 (0%) 0 (0%) spark accelerate-test-5-a3bfb8a597e83d459193a183e17f13b5-exec-1 2 (12%) 0 (0%) 10Gi (8%) 12Gi (10%) spark accelerate-testing-1-8ed0482f3bfb3c0a83da30bb7d433dff-exec-5 2 (12%) 0 (0%) 10Gi (8%) 12Gi (10%) spark accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver 1 (6%) 0 (0%) 2Gi (1%) 2432Mi (1%) spark accelerate-testing-2-e8bd0607cc693bc8ae25cc6dc300b2c7-driver 1 (6%) 0 (0%) 2Gi (1%) 2432Mi (1%) Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) CPU Requests CPU Limits Memory Requests Memory Limits ------------ ---------- --------------- ------------- 7050m (44%) 1200m (7%) 33410Mi (27%) 45874Mi (37%) Events: <none> Kubectl describe pod gives below message Name: accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver Namespace: spark Node: **** Start Time: Mon, 13 Aug 2018 16:18:34 -0400 Labels: launch-id=k8s-submit-service-cddf45ff-0d88-4681-af85-d8ed0359ce73 spark-app-selector=spark-63f536fd87f8457796802767922ef7d9 spark-role=driver Annotations: spark-app-name=accelerate-testing-2 Status: Pending IP: Init Containers: spark-init: Container ID: Image: ****:v2.3.0 Image ID: Port: <none> Args: init /etc/spark-init/spark-init.properties State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Environment: <none> Mounts: /etc/spark-init from spark-init-properties (rw) /var/run/secrets/kubernetes.io/serviceaccount from spark-token-mj86g (ro) /var/spark-data/spark-files from download-files-volume (rw) /var/spark-data/spark-jars from download-jars-volume (rw) Containers: spark-kubernetes-driver: Container ID: Image: ******:v2.3.0 Image ID: Port: <none> Args: driver State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Limits: memory: 2432Mi Requests: cpu: 1 memory: 2Gi Environment: SPARK_DRIVER_MEMORY: 2g SPARK_DRIVER_CLASS: com.myclass SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP) SPARK_MOUNTED_CLASSPATH: /var/spark-data/spark-jars/quantum-workflow-2.2.24.0-SNAPSHOT-assembly.jar:/var/spark-data/spark-jars/my.jar SPARK_MOUNTED_FILES_DIR: /var/spark-data/spark-files SPARK_JAVA_OPT_0: -Dspark.kubernetes.container.image=*** SPARK_JAVA_OPT_1: -Dspark.jars=s3a://my/my.jar,s3a://my/my1.jar SPARK_JAVA_OPT_2: -Dspark.submit.deployMode=cluster SPARK_JAVA_OPT_3: -Dspark.driver.blockManager.port=7079 SPARK_JAVA_OPT_4: -Dspark.executor.memory=10g SPARK_JAVA_OPT_5: -Dspark.app.id =spark-63f536fd87f8457796802767922ef7d9 SPARK_JAVA_OPT_6: -Dspark.kubernetes.authenticate.driver.serviceAccountName=spark SPARK_JAVA_OPT_7: -Dspark.master=k8s:// https://kubernetes.default SPARK_JAVA_OPT_8: -Dspark.driver.host=spark-1534191513364-driver-svc.spark.svc SPARK_JAVA_OPT_9: -Dspark.executor.cores=2 SPARK_JAVA_OPT_10: -Dspark.kubernetes.executor.podNamePrefix=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba SPARK_JAVA_OPT_11: -Dspark.driver.port=7078 SPARK_JAVA_OPT_12: -Dspark.kubernetes.namespace=spark SPARK_JAVA_OPT_13: -Dspark.executor.memoryOverhead=2g SPARK_JAVA_OPT_14: -Dspark.kubernetes.initContainer.configMapName=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-init-config SPARK_JAVA_OPT_15: -Dspark.kubernetes.initContainer.configMapKey=spark-init.properties SPARK_JAVA_OPT_16: -Dspark.executor.instances=10 SPARK_JAVA_OPT_17: -Dspark.memory.fraction=0.6 SPARK_JAVA_OPT_18: -Dspark.driver.memory=2g SPARK_JAVA_OPT_19: -Dspark.kubernetes.driver.pod.name =accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver SPARK_JAVA_OPT_20: -Dspark.app.name=accelerate-testing-2 SPARK_JAVA_OPT_21: -Dspark.kubernetes.driver.label.launch-id=******** Mounts: /var/run/secrets/kubernetes.io/serviceaccount from spark-token-mj86g (ro) /var/spark-data/spark-files from download-files-volume (rw) /var/spark-data/spark-jars from download-jars-volume (rw) Conditions: Type Status Initialized False Ready False PodScheduled True Volumes: spark-init-properties: Type: ConfigMap (a volume populated by a ConfigMap) Name: accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-init-config Optional: false download-jars-volume: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: download-files-volume: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: spark-token-mj86g: Type: Secret (a volume populated by a Secret) SecretName: spark-token-mj86g Optional: false QoS Class: Burstable Node-Selectors: <none> Tolerations: <none> Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SandboxChanged 44m (x518 over 18h) kubelet, **************************** Pod sandbox changed, it will be killed and re-created. Warning FailedSync 19s (x540 over 18h) kubelet, **************************** Error syncing pod