[ https://issues.apache.org/jira/browse/SPARK-37856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Denis Krivenko updated SPARK-37856: ----------------------------------- Environment: Kubernetes 1.20 | Spark 3.1.2 | Hadoop 3.2.0 | Java 11 | Scala 2.12 Kubernetes 1.20 | Spark 3.2.0 | Hadoop 3.3.1 | Java 11 | Scala 2.12 was: * Kubernetes 1.20 * Spark 3.1.2 * Hadoop 3.2.0 * Java 11 * Scala 2.12 and * Kubernetes 1.20 * Spark 3.2.0 * Hadoop 3.3.1 * Java 11 * Scala 2.12 > Executor pods keep existing if driver container was restarted > ------------------------------------------------------------- > > Key: SPARK-37856 > URL: https://issues.apache.org/jira/browse/SPARK-37856 > Project: Spark > Issue Type: Bug > Components: Kubernetes > Affects Versions: 3.1.2, 3.2.0 > Environment: Kubernetes 1.20 | Spark 3.1.2 | Hadoop 3.2.0 | Java 11 | > Scala 2.12 > Kubernetes 1.20 | Spark 3.2.0 | Hadoop 3.3.1 | Java 11 | Scala 2.12 > Reporter: Denis Krivenko > Priority: Minor > > I run Spark Thrift Server on Kubernetes cluster, so the driver pod runs > continuously and it creates and manages executor pods. From time to time OOM > issue occurs on a driver pod or executor pods. > When it happens on > * executor - the executor pod is getting deleted and the driver creates a > new executor pod instead. It works as expected. > * driver - Kubernetes restarts the driver container and the driver > creates new executor pods. All previous executors stop, but still exist with > *Error* state for Spark 3.1.2 or with *Completed* state for Spark 3.2.0 > The behavior can be reproduced by restarting a pod container with the command > {code:java} > kubectl exec POD_NAME -c CONTAINER_NAME -- /sbin/killall5{code} > Property _spark.kubernetes.executor.deleteOnTermination_ is set to *true* by > default. > If I delete driver pod all executor pods (in any state) are also deleted > completely. > +Pod list+ > {code:java} > NAME READY STATUS RESTARTS > AGE > spark-thrift-server-85cf5d689b-vvrwd 1/1 Running 1 > 3d15h > spark-thrift-server-198cc57e3f9a7400-exec-10 1/1 Running 0 > 86m > spark-thrift-server-198cc57e3f9a7400-exec-6 1/1 Running 0 > 12h > spark-thrift-server-198cc57e3f9a7400-exec-8 1/1 Running 0 > 9h > spark-thrift-server-198cc57e3f9a7400-exec-9 1/1 Running 0 > 3h12m > spark-thrift-server-1a9aee7e31f36eea-exec-17 0/1 Completed 0 > 38h > spark-thrift-server-1a9aee7e31f36eea-exec-18 0/1 Completed 0 > 38h > spark-thrift-server-1a9aee7e31f36eea-exec-19 0/1 Completed 0 > 36h > spark-thrift-server-1a9aee7e31f36eea-exec-21 0/1 Completed 0 > 24h > {code} > +Driver pod+ > {code:java} > apiVersion: v1 > kind: Pod > metadata: > name: spark-thrift-server-85cf5d689b-vvrwd > uid: b69a7c68-a767-4e3b-939c-061347b1c25e > spec: > ... > status: > containerStatuses: > - containerID: > containerd://7206acf424aa30b6f8533c0e32c99ebfdc5ee80648e76289f6bd2f87460ddcd3 > image: xxx/spark:3.2.0 > lastState: > terminated: > containerID: > containerd://fe3cacb8e6470ac37dcd50d525ae3d54c8b6bfef3558325bc22e7b40daab1703 > exitCode: 143 > finishedAt: "2022-01-09T16:09:50Z" > reason: OOMKilled > startedAt: "2022-01-07T00:32:21Z" > name: spark-thrift-server > ready: true > restartCount: 1 > started: true > state: > running: > startedAt: "2022-01-09T16:09:51Z" {code} > Executor pod > {code:java} > apiVersion: v1 > kind: Pod > metadata: > name: spark-thrift-server-1a9aee7e31f36eea-exec-17 > ownerReferences: > - apiVersion: v1 > controller: true > kind: Pod > name: spark-thrift-server-85cf5d689b-vvrwd > uid: b69a7c68-a767-4e3b-939c-061347b1c25e > spec: > ... > status: > containerStatuses: > - containerID: > containerd://75c68190147ba980f4b9014eef3989ddc2ee30de321fd1119957b6684a995c19 > image: xxx/spark:3.2.0 > lastState: {} > name: spark-kubernetes-executor > ready: false > restartCount: 0 > started: false > state: > terminated: > containerID: > containerd://75c68190147ba980f4b9014eef3989ddc2ee30de321fd1119957b6684a995c19 > exitCode: 0 > finishedAt: "2022-01-09T16:08:57Z" > reason: Completed > startedAt: "2022-01-09T01:39:15Z" {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org