Re: spark driver pod stuck in Waiting: PodInitializing state in Kubernetes

2018-08-17 Thread purna pradeep
Resurfacing The question to get more attention

Hello,
>
> im running Spark 2.3 job on kubernetes cluster
>>
>> kubectl version
>>
>> Client Version: version.Info{Major:"1", Minor:"9",
>> GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b",
>> GitTreeState:"clean", BuildDate:"2018-02-09T21:51:06Z",
>> GoVersion:"go1.9.4", Compiler:"gc", Platform:"darwin/amd64"}
>>
>> Server Version: version.Info{Major:"1", Minor:"8",
>> GitVersion:"v1.8.3", GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd",
>> GitTreeState:"clean", BuildDate:"2017-11-08T18:27:48Z",
>> GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
>>
>>
>>
>> when i ran spark submit on k8s master the driverpod is stuck in Waiting:
>> PodInitializing state.
>> I had to manually kill the driver pod and submit new job in this case
>> ,then it works.How this can be handled in production ?
>>
> This happens with executor pods as well
>

https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-25128
>
>
>>
>> This is happening if i submit the jobs almost parallel ie submit 5 jobs
>> one after the other simultaneously.
>>
>> I'm running spark jobs on 20 nodes each having below configuration
>>
>> I tried kubectl describe node on the node where trhe driver pod is
>> running this is what i got ,i do see there is overcommit on resources but i
>> expected kubernetes scheduler not to schedule if resources in node are
>> overcommitted or node is in Not Ready state ,in this case node is in Ready
>> State but i observe same behaviour if node is in "Not Ready" state
>>
>>
>>
>> Name:   **
>>
>> Roles:  worker
>>
>> Labels: beta.kubernetes.io/arch=amd64
>>
>> beta.kubernetes.io/os=linux
>>
>> kubernetes.io/hostname=
>>
>> node-role.kubernetes.io/worker=true
>>
>> Annotations:node.alpha.kubernetes.io/ttl=0
>>
>>
>> volumes.kubernetes.io/controller-managed-attach-detach=true
>>
>> Taints: 
>>
>> CreationTimestamp:  Tue, 31 Jul 2018 09:59:24 -0400
>>
>> Conditions:
>>
>>   Type Status  LastHeartbeatTime
>> LastTransitionTimeReason   Message
>>
>>    --  -
>> ----   ---
>>
>>   OutOfDiskFalse   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31
>> Jul 2018 09:59:24 -0400   KubeletHasSufficientDisk kubelet has
>> sufficient disk space available
>>
>>   MemoryPressure   False   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31
>> Jul 2018 09:59:24 -0400   KubeletHasSufficientMemory   kubelet has
>> sufficient memory available
>>
>>   DiskPressure False   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31
>> Jul 2018 09:59:24 -0400   KubeletHasNoDiskPressure kubelet has no disk
>> pressure
>>
>>   ReadyTrueTue, 14 Aug 2018 09:31:20 -0400   Sat, 11
>> Aug 2018 00:41:27 -0400   KubeletReady kubelet is posting
>> ready status. AppArmor enabled
>>
>> Addresses:
>>
>>   InternalIP:  *
>>
>>   Hostname:**
>>
>> Capacity:
>>
>>  cpu: 16
>>
>>  memory:  125827288Ki
>>
>>  pods:110
>>
>> Allocatable:
>>
>>  cpu: 16
>>
>>  memory:  125724888Ki
>>
>>  pods:110
>>
>> System Info:
>>
>>  Machine ID: *
>>
>>  System UUID:**
>>
>>  Boot ID:1493028d-0a80-4f2f-b0f1-48d9b8910e9f
>>
>>  Kernel Version: 4.4.0-1062-aws
>>
>>  OS Image:   Ubuntu 16.04.4 LTS
>>
>>  Operating System:   linux
>>
>>  Architecture:   amd64
>>
>>  Container Runtime Version:  docker://Unknown
>>
>>  Kubelet Version:v1.8.3
>>
>>  Kube-Proxy Version: v1.8.3
>>
>> PodCIDR: **
>>
>> ExternalID:  **
>>
>> Non-terminated Pods: (11 in total)
>>
>>   Namespace  Name
>>CPU Requests  CPU Limits  Memory Requests  Memory
>> Limits
>>
>>   -  
>>  --  ---
>>  -
>>
>>   kube-systemcalico-node-gj5mb
>> 250m (1%) 0 (0%)  0 (0%)   0 (0%)
>>
>>   kube-system
>>  kube-proxy- 100m (0%)
>> 0 (0%)  0 (0%)   0 (0%)
>>
>>   kube-system
>>  prometheus-prometheus-node-exporter-9cntq   100m (0%)
>> 200m (1%)   30Mi (0%)50Mi (0%)
>>
>>   logging
>>  elasticsearch-elasticsearch-data-69df997486-gqcwg   400m (2%)
>> 1 (6%)  8Gi (6%) 16Gi (13%)
>>
>>   logging

Re: spark driver pod stuck in Waiting: PodInitializing state in Kubernetes

2018-08-16 Thread purna pradeep
Hello,

im running Spark 2.3 job on kubernetes cluster
>
> kubectl version
>
> Client Version: version.Info{Major:"1", Minor:"9",
> GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b",
> GitTreeState:"clean", BuildDate:"2018-02-09T21:51:06Z",
> GoVersion:"go1.9.4", Compiler:"gc", Platform:"darwin/amd64"}
>
> Server Version: version.Info{Major:"1", Minor:"8",
> GitVersion:"v1.8.3", GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd",
> GitTreeState:"clean", BuildDate:"2017-11-08T18:27:48Z",
> GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
>
>
>
> when i ran spark submit on k8s master the driver pod is stuck in Waiting:
> PodInitializing state.
> I had to manually kill the driver pod and submit new job in this case
> ,then it works.How this can be handled in production ?
>

https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-25128


>
> This is happening if i submit the jobs almost parallel ie submit 5 jobs
> one after the other simultaneously.
>
> I'm running spark jobs on 20 nodes each having below configuration
>
> I tried kubectl describe node on the node where trhe driver pod is running
> this is what i got ,i do see there is overcommit on resources but i
> expected kubernetes scheduler not to schedule if resources in node are
> overcommitted or node is in Not Ready state ,in this case node is in Ready
> State but i observe same behaviour if node is in "Not Ready" state
>
>
>
> Name:   **
>
> Roles:  worker
>
> Labels: beta.kubernetes.io/arch=amd64
>
> beta.kubernetes.io/os=linux
>
> kubernetes.io/hostname=
>
> node-role.kubernetes.io/worker=true
>
> Annotations:node.alpha.kubernetes.io/ttl=0
>
>
> volumes.kubernetes.io/controller-managed-attach-detach=true
>
> Taints: 
>
> CreationTimestamp:  Tue, 31 Jul 2018 09:59:24 -0400
>
> Conditions:
>
>   Type Status  LastHeartbeatTime
> LastTransitionTimeReason   Message
>
>    --  -
> ----   ---
>
>   OutOfDiskFalse   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31
> Jul 2018 09:59:24 -0400   KubeletHasSufficientDisk kubelet has
> sufficient disk space available
>
>   MemoryPressure   False   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31
> Jul 2018 09:59:24 -0400   KubeletHasSufficientMemory   kubelet has
> sufficient memory available
>
>   DiskPressure False   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31
> Jul 2018 09:59:24 -0400   KubeletHasNoDiskPressure kubelet has no disk
> pressure
>
>   ReadyTrueTue, 14 Aug 2018 09:31:20 -0400   Sat, 11
> Aug 2018 00:41:27 -0400   KubeletReady kubelet is posting
> ready status. AppArmor enabled
>
> Addresses:
>
>   InternalIP:  *
>
>   Hostname:**
>
> Capacity:
>
>  cpu: 16
>
>  memory:  125827288Ki
>
>  pods:110
>
> Allocatable:
>
>  cpu: 16
>
>  memory:  125724888Ki
>
>  pods:110
>
> System Info:
>
>  Machine ID: *
>
>  System UUID:**
>
>  Boot ID:1493028d-0a80-4f2f-b0f1-48d9b8910e9f
>
>  Kernel Version: 4.4.0-1062-aws
>
>  OS Image:   Ubuntu 16.04.4 LTS
>
>  Operating System:   linux
>
>  Architecture:   amd64
>
>  Container Runtime Version:  docker://Unknown
>
>  Kubelet Version:v1.8.3
>
>  Kube-Proxy Version: v1.8.3
>
> PodCIDR: **
>
> ExternalID:  **
>
> Non-terminated Pods: (11 in total)
>
>   Namespace  Name
>CPU Requests  CPU Limits  Memory Requests  Memory
> Limits
>
>   -  
>  --  ---
>  -
>
>   kube-systemcalico-node-gj5mb
>   250m (1%) 0 (0%)  0 (0%)   0 (0%)
>
>   kube-system
>  kube-proxy- 100m (0%)
> 0 (0%)  0 (0%)   0 (0%)
>
>   kube-systemprometheus-prometheus-node-exporter-9cntq
>   100m (0%) 200m (1%)   30Mi (0%)50Mi (0%)
>
>   logging
>  elasticsearch-elasticsearch-data-69df997486-gqcwg   400m (2%)
> 1 (6%)  8Gi (6%) 16Gi (13%)
>
>   loggingfluentd-fluentd-elasticsearch-tj7nd
>   200m (1%) 0 (0%)  612Mi (0%)   0 (0%)
>
>   rook   rook-agent-6jtzm
>0 (0%)0 (0%)  0 (0%)   0 (0%)

spark driver pod stuck in Waiting: PodInitializing state in Kubernetes

2018-08-15 Thread purna pradeep
im running Spark 2.3 job on kubernetes cluster

kubectl version

Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3",
GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean",
BuildDate:"2018-02-09T21:51:06Z", GoVersion:"go1.9.4", Compiler:"gc",
Platform:"darwin/amd64"}

Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.3",
GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd", GitTreeState:"clean",
BuildDate:"2017-11-08T18:27:48Z", GoVersion:"go1.8.3", Compiler:"gc",
Platform:"linux/amd64"}



when i ran spark submit on k8s master the driver pod is stuck in Waiting:
PodInitializing state.
I had to manually kill the driver pod and submit new job in this case ,then
it works.


This is happening if i submit the jobs almost parallel ie submit 5 jobs one
after the other simultaneously.

I'm running spark jobs on 20 nodes each having below configuration

I tried kubectl describe node on the node where trhe driver pod is running
this is what i got ,i do see there is overcommit on resources but i
expected kubernetes scheduler not to schedule if resources in node are
overcommitted or node is in Not Ready state ,in this case node is in Ready
State but i observe same behaviour if node is in "Not Ready" state



Name:   **

Roles:  worker

Labels: beta.kubernetes.io/arch=amd64

beta.kubernetes.io/os=linux

kubernetes.io/hostname=

node-role.kubernetes.io/worker=true

Annotations:node.alpha.kubernetes.io/ttl=0


volumes.kubernetes.io/controller-managed-attach-detach=true

Taints: 

CreationTimestamp:  Tue, 31 Jul 2018 09:59:24 -0400

Conditions:

  Type Status  LastHeartbeatTime
LastTransitionTimeReason   Message

   --  -
----   ---

  OutOfDiskFalse   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31
Jul 2018 09:59:24 -0400   KubeletHasSufficientDisk kubelet has
sufficient disk space available

  MemoryPressure   False   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31
Jul 2018 09:59:24 -0400   KubeletHasSufficientMemory   kubelet has
sufficient memory available

  DiskPressure False   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31
Jul 2018 09:59:24 -0400   KubeletHasNoDiskPressure kubelet has no disk
pressure

  ReadyTrueTue, 14 Aug 2018 09:31:20 -0400   Sat, 11
Aug 2018 00:41:27 -0400   KubeletReady kubelet is posting
ready status. AppArmor enabled

Addresses:

  InternalIP:  *

  Hostname:**

Capacity:

 cpu: 16

 memory:  125827288Ki

 pods:110

Allocatable:

 cpu: 16

 memory:  125724888Ki

 pods:110

System Info:

 Machine ID: *

 System UUID:**

 Boot ID:1493028d-0a80-4f2f-b0f1-48d9b8910e9f

 Kernel Version: 4.4.0-1062-aws

 OS Image:   Ubuntu 16.04.4 LTS

 Operating System:   linux

 Architecture:   amd64

 Container Runtime Version:  docker://Unknown

 Kubelet Version:v1.8.3

 Kube-Proxy Version: v1.8.3

PodCIDR: **

ExternalID:  **

Non-terminated Pods: (11 in total)

  Namespace  Name
 CPU Requests  CPU Limits  Memory Requests  Memory
Limits

  -  
   --  ---
 -

  kube-systemcalico-node-gj5mb
  250m (1%) 0 (0%)  0 (0%)   0 (0%)

  kube-system
 kube-proxy- 100m (0%)
0 (0%)  0 (0%)   0 (0%)

  kube-systemprometheus-prometheus-node-exporter-9cntq
  100m (0%) 200m (1%)   30Mi (0%)50Mi (0%)

  logging
 elasticsearch-elasticsearch-data-69df997486-gqcwg   400m (2%)
1 (6%)  8Gi (6%) 16Gi (13%)

  loggingfluentd-fluentd-elasticsearch-tj7nd
  200m (1%) 0 (0%)  612Mi (0%)   0 (0%)

  rook   rook-agent-6jtzm
 0 (0%)0 (0%)  0 (0%)   0 (0%)

  rook
rook-ceph-osd-10-6-42-250.accel.aws-cardda.cb4good.com-gwb8j0 (0%)
   0 (0%)  0 (0%)   0 (0%)

  spark
 accelerate-test-5-a3bfb8a597e83d459193a183e17f13b5-exec-1   2 (12%)
0 (0%)  10Gi (8%)12Gi (10%)

  spark
 accelerate-testing-1-8ed0482f3bfb3c0a83da30bb7d433dff-exec-52 (12%)
0 (0%)  10Gi (8%)12Gi