Kent Yao created SPARK-28949:
--------------------------------

             Summary: Kubernetes CGroup leaking leads to Spark Pods hang in 
Pending status
                 Key: SPARK-28949
                 URL: https://issues.apache.org/jira/browse/SPARK-28949
             Project: Spark
          Issue Type: Bug
          Components: Kubernetes
    Affects Versions: 2.4.4, 2.3.3
            Reporter: Kent Yao


After running Spark on k8s for a few days, some kubelet fails to create pod 
caused by warning message like
{code:java}
\"mkdir 
/sys/fs/cgroup/memory/kubepods/burstable/podb4a04361-ca89-11e9-a224-6c92bf35392e/1d5aed3effffa20b246ec4f121f778f48c493e3e8678f2afe58a96c15180176e:
 no space left on device\"
{code}
The k8s cluster and the kubelet node are free.

These pods zombie over days before we manually notify and terminate them. Maybe 
it

is a little bit 

This probably related to [https://github.com/kubernetes/kubernetes/issues/70324]

Do we need a timeout, retry or failover mechanism for Spark to handle these 
kinds of k8s kernel issues?

 

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to