Denis Krivenko created SPARK-37856: -------------------------------------- Summary: Executor pods keep existing if driver container was restarted Key: SPARK-37856 URL: https://issues.apache.org/jira/browse/SPARK-37856 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.2.0, 3.1.2 Environment: * Kubernetes 1.20 * Spark 3.1.2 * Hadoop 3.2.0 * Java 11 * Scala 2.12
and * Kubernetes 1.20 * Spark 3.2.0 * Hadoop 3.3.1 * Java 11 * Scala 2.12 Reporter: Denis Krivenko I run Spark Thrift Server on Kubernetes cluster, so the driver pod runs continuously and it creates and manages executor pods. From time to time OOM issue occurs on a driver pod or executor pods. When it happens on * executor - the executor pod is getting deleted and the driver creates a new executor pod instead. It works as expected. * driver - Kubernetes restarts the driver container and the driver creates new executor pods. All previous executors stop, but still exist with *Error* state for Spark 3.1.2 or with *Completed* state for Spark 3.2.0 The behavior can be reproduced by restarting a pod container with the command {code:java} kubectl exec POD_NAME -c CONTAINER_NAME -- /sbin/killall5{code} Property _spark.kubernetes.executor.deleteOnTermination_ is set to *true* by default. If I delete driver pod all executor pods (in any state) are also deleted completely. +Pod list+ {code:java} NAME READY STATUS RESTARTS AGE spark-thrift-server-85cf5d689b-vvrwd 1/1 Running 1 3d15h spark-thrift-server-198cc57e3f9a7400-exec-10 1/1 Running 0 86m spark-thrift-server-198cc57e3f9a7400-exec-6 1/1 Running 0 12h spark-thrift-server-198cc57e3f9a7400-exec-8 1/1 Running 0 9h spark-thrift-server-198cc57e3f9a7400-exec-9 1/1 Running 0 3h12m spark-thrift-server-1a9aee7e31f36eea-exec-17 0/1 Completed 0 38h spark-thrift-server-1a9aee7e31f36eea-exec-18 0/1 Completed 0 38h spark-thrift-server-1a9aee7e31f36eea-exec-19 0/1 Completed 0 36h spark-thrift-server-1a9aee7e31f36eea-exec-21 0/1 Completed 0 24h {code} +Driver pod+ {code:java} apiVersion: v1 kind: Pod metadata: name: spark-thrift-server-85cf5d689b-vvrwd uid: b69a7c68-a767-4e3b-939c-061347b1c25e spec: ... status: containerStatuses: - containerID: containerd://7206acf424aa30b6f8533c0e32c99ebfdc5ee80648e76289f6bd2f87460ddcd3 image: xxx/spark:3.2.0 lastState: terminated: containerID: containerd://fe3cacb8e6470ac37dcd50d525ae3d54c8b6bfef3558325bc22e7b40daab1703 exitCode: 143 finishedAt: "2022-01-09T16:09:50Z" reason: OOMKilled startedAt: "2022-01-07T00:32:21Z" name: spark-thrift-server ready: true restartCount: 1 started: true state: running: startedAt: "2022-01-09T16:09:51Z" {code} Executor pod {code:java} apiVersion: v1 kind: Pod metadata: name: spark-thrift-server-1a9aee7e31f36eea-exec-17 ownerReferences: - apiVersion: v1 controller: true kind: Pod name: spark-thrift-server-85cf5d689b-vvrwd uid: b69a7c68-a767-4e3b-939c-061347b1c25e spec: ... status: containerStatuses: - containerID: containerd://75c68190147ba980f4b9014eef3989ddc2ee30de321fd1119957b6684a995c19 image: xxx/spark:3.2.0 lastState: {} name: spark-kubernetes-executor ready: false restartCount: 0 started: false state: terminated: containerID: containerd://75c68190147ba980f4b9014eef3989ddc2ee30de321fd1119957b6684a995c19 exitCode: 0 finishedAt: "2022-01-09T16:08:57Z" reason: Completed startedAt: "2022-01-09T01:39:15Z" {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org