[jira] [Updated] (SPARK-28949) Kubernetes CGroup leaking leads to Spark Pods hang in Pending status
[ https://issues.apache.org/jira/browse/SPARK-28949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-28949: - Attachment: describe-driver-pod.txt > Kubernetes CGroup leaking leads to Spark Pods hang in Pending status > > > Key: SPARK-28949 > URL: https://issues.apache.org/jira/browse/SPARK-28949 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.3, 2.4.4 >Reporter: Kent Yao >Priority: Major > Attachments: describe-driver-pod.txt, describe-executor-pod.txt > > > After running Spark on k8s for a few days, some kubelet fails to create pod > caused by warning message like > {code:java} > \"mkdir > /sys/fs/cgroup/memory/kubepods/burstable/podb4a04361-ca89-11e9-a224-6c92bf35392e/1d5aed3ea20b246ec4f121f778f48c493e3e8678f2afe58a96c15180176e: > no space left on device\" > {code} > The k8s cluster and the kubelet node are free. > These pods zombie over days before we manually notify and terminate them. > Maybe it > is a little bit > This probably related to > [https://github.com/kubernetes/kubernetes/issues/70324] > Do we need a timeout, retry or failover mechanism for Spark to handle these > kinds of k8s kernel issues? > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28949) Kubernetes CGroup leaking leads to Spark Pods hang in Pending status
[ https://issues.apache.org/jira/browse/SPARK-28949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-28949: - Attachment: describe-executor-pod.txt > Kubernetes CGroup leaking leads to Spark Pods hang in Pending status > > > Key: SPARK-28949 > URL: https://issues.apache.org/jira/browse/SPARK-28949 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.3, 2.4.4 >Reporter: Kent Yao >Priority: Major > Attachments: describe-driver-pod.txt, describe-executor-pod.txt > > > After running Spark on k8s for a few days, some kubelet fails to create pod > caused by warning message like > {code:java} > \"mkdir > /sys/fs/cgroup/memory/kubepods/burstable/podb4a04361-ca89-11e9-a224-6c92bf35392e/1d5aed3ea20b246ec4f121f778f48c493e3e8678f2afe58a96c15180176e: > no space left on device\" > {code} > The k8s cluster and the kubelet node are free. > These pods zombie over days before we manually notify and terminate them. > Maybe it > is a little bit > This probably related to > [https://github.com/kubernetes/kubernetes/issues/70324] > Do we need a timeout, retry or failover mechanism for Spark to handle these > kinds of k8s kernel issues? > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28949) Kubernetes CGroup leaking leads to Spark Pods hang in Pending status
[ https://issues.apache.org/jira/browse/SPARK-28949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-28949: - Description: After running Spark on k8s for a few days, some kubelet fails to create pod caused by warning message like {code:java} \"mkdir /sys/fs/cgroup/memory/kubepods/burstable/podb4a04361-ca89-11e9-a224-6c92bf35392e/1d5aed3ea20b246ec4f121f778f48c493e3e8678f2afe58a96c15180176e: no space left on device\" {code} The k8s cluster and the kubelet node are free. These pods zombie over days before we manually notify and terminate them. Maybe it is a little bit easy to identify zombied driver pods, but it is quite inconvenient to identify executor pods when spark applications scale-out. This probably related to [https://github.com/kubernetes/kubernetes/issues/70324] Do we need a timeout, retry or failover mechanism for Spark to handle these kinds of k8s kernel issues? was: After running Spark on k8s for a few days, some kubelet fails to create pod caused by warning message like {code:java} \"mkdir /sys/fs/cgroup/memory/kubepods/burstable/podb4a04361-ca89-11e9-a224-6c92bf35392e/1d5aed3ea20b246ec4f121f778f48c493e3e8678f2afe58a96c15180176e: no space left on device\" {code} The k8s cluster and the kubelet node are free. These pods zombie over days before we manually notify and terminate them. Maybe it is a little bit This probably related to [https://github.com/kubernetes/kubernetes/issues/70324] Do we need a timeout, retry or failover mechanism for Spark to handle these kinds of k8s kernel issues? > Kubernetes CGroup leaking leads to Spark Pods hang in Pending status > > > Key: SPARK-28949 > URL: https://issues.apache.org/jira/browse/SPARK-28949 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.3, 2.4.4 >Reporter: Kent Yao >Priority: Major > Attachments: describe-driver-pod.txt, describe-executor-pod.txt > > > After running Spark on k8s for a few days, some kubelet fails to create pod > caused by warning message like > {code:java} > \"mkdir > /sys/fs/cgroup/memory/kubepods/burstable/podb4a04361-ca89-11e9-a224-6c92bf35392e/1d5aed3ea20b246ec4f121f778f48c493e3e8678f2afe58a96c15180176e: > no space left on device\" > {code} > The k8s cluster and the kubelet node are free. > These pods zombie over days before we manually notify and terminate them. > Maybe it > is a little bit easy to identify zombied driver pods, but it is quite > inconvenient to identify executor pods when spark applications scale-out. > This probably related to > [https://github.com/kubernetes/kubernetes/issues/70324] > Do we need a timeout, retry or failover mechanism for Spark to handle these > kinds of k8s kernel issues? > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org