[ 
https://issues.apache.org/jira/browse/SPARK-37856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Krivenko updated SPARK-37856:
-----------------------------------
    Environment: 
Kubernetes 1.20 | Spark 3.1.2 | Hadoop 3.2.0 | Java 11 | Scala 2.12

Kubernetes 1.20 | Spark 3.2.0 | Hadoop 3.3.1 | Java 11 | Scala 2.12

  was:
* Kubernetes 1.20
 * Spark 3.1.2
 * Hadoop 3.2.0
 * Java 11
 * Scala 2.12

and
 * Kubernetes 1.20
 * Spark 3.2.0
 * Hadoop 3.3.1
 * Java 11
 * Scala 2.12


> Executor pods keep existing if driver container was restarted
> -------------------------------------------------------------
>
>                 Key: SPARK-37856
>                 URL: https://issues.apache.org/jira/browse/SPARK-37856
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 3.1.2, 3.2.0
>         Environment: Kubernetes 1.20 | Spark 3.1.2 | Hadoop 3.2.0 | Java 11 | 
> Scala 2.12
> Kubernetes 1.20 | Spark 3.2.0 | Hadoop 3.3.1 | Java 11 | Scala 2.12
>            Reporter: Denis Krivenko
>            Priority: Minor
>
> I run Spark Thrift Server on Kubernetes cluster, so the driver pod runs 
> continuously and it creates and manages executor pods. From time to time OOM 
> issue occurs on a driver pod or executor pods.
> When it happens on
>  * executor - the executor pod is getting deleted and the driver creates a 
> new executor pod instead. It works as expected.
>  * driver     - Kubernetes restarts the driver container and the driver 
> creates new executor pods. All previous executors stop, but still exist with 
> *Error* state for Spark 3.1.2 or with *Completed* state for Spark 3.2.0
> The behavior can be reproduced by restarting a pod container with the command
> {code:java}
> kubectl exec POD_NAME -c CONTAINER_NAME -- /sbin/killall5{code}
> Property _spark.kubernetes.executor.deleteOnTermination_ is set to *true* by 
> default.
> If I delete driver pod all executor pods (in any state) are also deleted 
> completely.
> +Pod list+
> {code:java}
> NAME                                           READY   STATUS      RESTARTS   
> AGE
> spark-thrift-server-85cf5d689b-vvrwd           1/1     Running     1          
> 3d15h
> spark-thrift-server-198cc57e3f9a7400-exec-10   1/1     Running     0          
> 86m
> spark-thrift-server-198cc57e3f9a7400-exec-6    1/1     Running     0          
> 12h
> spark-thrift-server-198cc57e3f9a7400-exec-8    1/1     Running     0          
> 9h
> spark-thrift-server-198cc57e3f9a7400-exec-9    1/1     Running     0          
> 3h12m
> spark-thrift-server-1a9aee7e31f36eea-exec-17   0/1     Completed   0          
> 38h
> spark-thrift-server-1a9aee7e31f36eea-exec-18   0/1     Completed   0          
> 38h
> spark-thrift-server-1a9aee7e31f36eea-exec-19   0/1     Completed   0          
> 36h
> spark-thrift-server-1a9aee7e31f36eea-exec-21   0/1     Completed   0          
> 24h
>  {code}
> +Driver pod+
> {code:java}
> apiVersion: v1
> kind: Pod
> metadata:
>   name: spark-thrift-server-85cf5d689b-vvrwd
>   uid: b69a7c68-a767-4e3b-939c-061347b1c25e
> spec:
>   ...
> status:
>   containerStatuses:
>   - containerID: 
> containerd://7206acf424aa30b6f8533c0e32c99ebfdc5ee80648e76289f6bd2f87460ddcd3
>     image: xxx/spark:3.2.0
>     lastState:
>       terminated:
>         containerID: 
> containerd://fe3cacb8e6470ac37dcd50d525ae3d54c8b6bfef3558325bc22e7b40daab1703
>         exitCode: 143
>         finishedAt: "2022-01-09T16:09:50Z"
>         reason: OOMKilled
>         startedAt: "2022-01-07T00:32:21Z"
>     name: spark-thrift-server
>     ready: true
>     restartCount: 1
>     started: true
>     state:
>       running:
>         startedAt: "2022-01-09T16:09:51Z" {code}
> Executor pod
> {code:java}
> apiVersion: v1
> kind: Pod
> metadata:
>   name: spark-thrift-server-1a9aee7e31f36eea-exec-17
>   ownerReferences:
>   - apiVersion: v1
>     controller: true
>     kind: Pod
>     name: spark-thrift-server-85cf5d689b-vvrwd
>     uid: b69a7c68-a767-4e3b-939c-061347b1c25e
> spec:
>   ...
> status:
>   containerStatuses:
>   - containerID: 
> containerd://75c68190147ba980f4b9014eef3989ddc2ee30de321fd1119957b6684a995c19
>     image: xxx/spark:3.2.0
>     lastState: {}
>     name: spark-kubernetes-executor
>     ready: false
>     restartCount: 0
>     started: false
>     state:
>       terminated:
>         containerID: 
> containerd://75c68190147ba980f4b9014eef3989ddc2ee30de321fd1119957b6684a995c19
>         exitCode: 0
>         finishedAt: "2022-01-09T16:08:57Z"
>         reason: Completed
>         startedAt: "2022-01-09T01:39:15Z" {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to