Kent Yao created SPARK-28949: -------------------------------- Summary: Kubernetes CGroup leaking leads to Spark Pods hang in Pending status Key: SPARK-28949 URL: https://issues.apache.org/jira/browse/SPARK-28949 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 2.4.4, 2.3.3 Reporter: Kent Yao
After running Spark on k8s for a few days, some kubelet fails to create pod caused by warning message like {code:java} \"mkdir /sys/fs/cgroup/memory/kubepods/burstable/podb4a04361-ca89-11e9-a224-6c92bf35392e/1d5aed3effffa20b246ec4f121f778f48c493e3e8678f2afe58a96c15180176e: no space left on device\" {code} The k8s cluster and the kubelet node are free. These pods zombie over days before we manually notify and terminate them. Maybe it is a little bit This probably related to [https://github.com/kubernetes/kubernetes/issues/70324] Do we need a timeout, retry or failover mechanism for Spark to handle these kinds of k8s kernel issues? -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org