[jira] [Updated] (SPARK-28949) Kubernetes CGroup leaking leads to Spark Pods hang in Pending status

2019-09-02 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-28949:
-
Attachment: describe-driver-pod.txt

> Kubernetes CGroup leaking leads to Spark Pods hang in Pending status
> 
>
> Key: SPARK-28949
> URL: https://issues.apache.org/jira/browse/SPARK-28949
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.3, 2.4.4
>Reporter: Kent Yao
>Priority: Major
> Attachments: describe-driver-pod.txt, describe-executor-pod.txt
>
>
> After running Spark on k8s for a few days, some kubelet fails to create pod 
> caused by warning message like
> {code:java}
> \"mkdir 
> /sys/fs/cgroup/memory/kubepods/burstable/podb4a04361-ca89-11e9-a224-6c92bf35392e/1d5aed3ea20b246ec4f121f778f48c493e3e8678f2afe58a96c15180176e:
>  no space left on device\"
> {code}
> The k8s cluster and the kubelet node are free.
> These pods zombie over days before we manually notify and terminate them. 
> Maybe it
> is a little bit 
> This probably related to 
> [https://github.com/kubernetes/kubernetes/issues/70324]
> Do we need a timeout, retry or failover mechanism for Spark to handle these 
> kinds of k8s kernel issues?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28949) Kubernetes CGroup leaking leads to Spark Pods hang in Pending status

2019-09-02 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-28949:
-
Attachment: describe-executor-pod.txt

> Kubernetes CGroup leaking leads to Spark Pods hang in Pending status
> 
>
> Key: SPARK-28949
> URL: https://issues.apache.org/jira/browse/SPARK-28949
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.3, 2.4.4
>Reporter: Kent Yao
>Priority: Major
> Attachments: describe-driver-pod.txt, describe-executor-pod.txt
>
>
> After running Spark on k8s for a few days, some kubelet fails to create pod 
> caused by warning message like
> {code:java}
> \"mkdir 
> /sys/fs/cgroup/memory/kubepods/burstable/podb4a04361-ca89-11e9-a224-6c92bf35392e/1d5aed3ea20b246ec4f121f778f48c493e3e8678f2afe58a96c15180176e:
>  no space left on device\"
> {code}
> The k8s cluster and the kubelet node are free.
> These pods zombie over days before we manually notify and terminate them. 
> Maybe it
> is a little bit 
> This probably related to 
> [https://github.com/kubernetes/kubernetes/issues/70324]
> Do we need a timeout, retry or failover mechanism for Spark to handle these 
> kinds of k8s kernel issues?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28949) Kubernetes CGroup leaking leads to Spark Pods hang in Pending status

2019-09-02 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-28949:
-
Description: 
After running Spark on k8s for a few days, some kubelet fails to create pod 
caused by warning message like
{code:java}
\"mkdir 
/sys/fs/cgroup/memory/kubepods/burstable/podb4a04361-ca89-11e9-a224-6c92bf35392e/1d5aed3ea20b246ec4f121f778f48c493e3e8678f2afe58a96c15180176e:
 no space left on device\"
{code}
The k8s cluster and the kubelet node are free.

These pods zombie over days before we manually notify and terminate them. Maybe 
it

is a little bit easy to identify zombied driver pods, but it is quite 
inconvenient to identify executor pods when spark applications scale-out.

This probably related to [https://github.com/kubernetes/kubernetes/issues/70324]

Do we need a timeout, retry or failover mechanism for Spark to handle these 
kinds of k8s kernel issues?

 

 

  was:
After running Spark on k8s for a few days, some kubelet fails to create pod 
caused by warning message like
{code:java}
\"mkdir 
/sys/fs/cgroup/memory/kubepods/burstable/podb4a04361-ca89-11e9-a224-6c92bf35392e/1d5aed3ea20b246ec4f121f778f48c493e3e8678f2afe58a96c15180176e:
 no space left on device\"
{code}
The k8s cluster and the kubelet node are free.

These pods zombie over days before we manually notify and terminate them. Maybe 
it

is a little bit 

This probably related to [https://github.com/kubernetes/kubernetes/issues/70324]

Do we need a timeout, retry or failover mechanism for Spark to handle these 
kinds of k8s kernel issues?

 

 


> Kubernetes CGroup leaking leads to Spark Pods hang in Pending status
> 
>
> Key: SPARK-28949
> URL: https://issues.apache.org/jira/browse/SPARK-28949
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.3, 2.4.4
>Reporter: Kent Yao
>Priority: Major
> Attachments: describe-driver-pod.txt, describe-executor-pod.txt
>
>
> After running Spark on k8s for a few days, some kubelet fails to create pod 
> caused by warning message like
> {code:java}
> \"mkdir 
> /sys/fs/cgroup/memory/kubepods/burstable/podb4a04361-ca89-11e9-a224-6c92bf35392e/1d5aed3ea20b246ec4f121f778f48c493e3e8678f2afe58a96c15180176e:
>  no space left on device\"
> {code}
> The k8s cluster and the kubelet node are free.
> These pods zombie over days before we manually notify and terminate them. 
> Maybe it
> is a little bit easy to identify zombied driver pods, but it is quite 
> inconvenient to identify executor pods when spark applications scale-out.
> This probably related to 
> [https://github.com/kubernetes/kubernetes/issues/70324]
> Do we need a timeout, retry or failover mechanism for Spark to handle these 
> kinds of k8s kernel issues?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org