[ https://issues.apache.org/jira/browse/SPARK-28949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kent Yao updated SPARK-28949: ----------------------------- Attachment: describe-executor-pod.txt > Kubernetes CGroup leaking leads to Spark Pods hang in Pending status > -------------------------------------------------------------------- > > Key: SPARK-28949 > URL: https://issues.apache.org/jira/browse/SPARK-28949 > Project: Spark > Issue Type: Bug > Components: Kubernetes > Affects Versions: 2.3.3, 2.4.4 > Reporter: Kent Yao > Priority: Major > Attachments: describe-driver-pod.txt, describe-executor-pod.txt > > > After running Spark on k8s for a few days, some kubelet fails to create pod > caused by warning message like > {code:java} > \"mkdir > /sys/fs/cgroup/memory/kubepods/burstable/podb4a04361-ca89-11e9-a224-6c92bf35392e/1d5aed3effffa20b246ec4f121f778f48c493e3e8678f2afe58a96c15180176e: > no space left on device\" > {code} > The k8s cluster and the kubelet node are free. > These pods zombie over days before we manually notify and terminate them. > Maybe it > is a little bit > This probably related to > [https://github.com/kubernetes/kubernetes/issues/70324] > Do we need a timeout, retry or failover mechanism for Spark to handle these > kinds of k8s kernel issues? > > -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org