[ 
https://issues.apache.org/jira/browse/SPARK-28949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-28949:
-----------------------------
    Attachment: describe-executor-pod.txt

> Kubernetes CGroup leaking leads to Spark Pods hang in Pending status
> --------------------------------------------------------------------
>
>                 Key: SPARK-28949
>                 URL: https://issues.apache.org/jira/browse/SPARK-28949
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 2.3.3, 2.4.4
>            Reporter: Kent Yao
>            Priority: Major
>         Attachments: describe-driver-pod.txt, describe-executor-pod.txt
>
>
> After running Spark on k8s for a few days, some kubelet fails to create pod 
> caused by warning message like
> {code:java}
> \"mkdir 
> /sys/fs/cgroup/memory/kubepods/burstable/podb4a04361-ca89-11e9-a224-6c92bf35392e/1d5aed3effffa20b246ec4f121f778f48c493e3e8678f2afe58a96c15180176e:
>  no space left on device\"
> {code}
> The k8s cluster and the kubelet node are free.
> These pods zombie over days before we manually notify and terminate them. 
> Maybe it
> is a little bit 
> This probably related to 
> [https://github.com/kubernetes/kubernetes/issues/70324]
> Do we need a timeout, retry or failover mechanism for Spark to handle these 
> kinds of k8s kernel issues?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to