[ https://issues.apache.org/jira/browse/SPARK-37910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478478#comment-17478478 ]
Petri commented on SPARK-37910: ------------------------------- In deployment.yaml we have: * name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name * name: SPARK_DRIVER_BIND_ADDRESS valueFrom: fieldRef: fieldPath: status.podIP * name: K8S_NS valueFrom: fieldRef: fieldPath: metadata.namespace We are setting following confs for spark-submit: DRIVER_HOSTNAME=$(echo $SPARK_DRIVER_BIND_ADDRESS | sed 's/\./-/g') --conf spark.kubernetes.driver.pod.name=$POD_NAME \ --conf spark.driver.host=$DRIVER_HOSTNAME.$K8S_NS.pod.cluster.local \ So we are using Pod DNS name, is that ok? Or should we use headless service? Your documentation is not clear about it. What we are missing in our confs is the spark.driver.port. Is that a mandatory conf needed? Can you give exact steps how to check to pod network status? We have a quite similar setup in our other microservice, which is working OK with (Spark 3.2.0 ja Java 11), but for some reason this microservice in question has the problem. > Spark executor self-exiting due to driver disassociated in Kubernetes with > client deploy-mode > --------------------------------------------------------------------------------------------- > > Key: SPARK-37910 > URL: https://issues.apache.org/jira/browse/SPARK-37910 > Project: Spark > Issue Type: Bug > Components: Kubernetes > Affects Versions: 3.2.0 > Reporter: Petri > Priority: Major > > I have Spark driver running in a Kubernetes pod with client deploy-mode and > it tries to start an executor. > Executor will fail with error: > \{"type":"log", "level":"ERROR", "name":"STREAMING_OTHERS", > "time":"2022-01-14T12:29:38.318Z", "timezone":"UTC", > "class":"dispatcher-Executor", > "method":"spark.executor.CoarseGrainedExecutorBackend.logError(73)", > "log":"Executor self-exiting due to : Driver > 192-168-39-71.mni-system.pod.cluster.local:40752 disassociated! Shutting > down.\n"} > Then driver will attempt to start another executor which fails with same > error and this goes on and on. > In the driver pod, I see only following errors: > 22/01/14 12:26:32 ERROR TaskSchedulerImpl: Lost executor 1 on > 192.168.43.250: > 22/01/14 12:27:16 ERROR TaskSchedulerImpl: Lost executor 2 on > 192.168.43.233: > 22/01/14 12:27:59 ERROR TaskSchedulerImpl: Lost executor 3 on > 192.168.43.221: > 22/01/14 12:28:43 ERROR TaskSchedulerImpl: Lost executor 4 on > 192.168.43.217: > 22/01/14 12:29:27 ERROR TaskSchedulerImpl: Lost executor 5 on > 192.168.43.197: > 22/01/14 12:30:10 ERROR TaskSchedulerImpl: Lost executor 6 on > 192.168.43.237: > 22/01/14 12:30:53 ERROR TaskSchedulerImpl: Lost executor 7 on > 192.168.43.196: > 22/01/14 12:31:42 ERROR TaskSchedulerImpl: Lost executor 8 on > 192.168.43.228: > 22/01/14 12:32:31 ERROR TaskSchedulerImpl: Lost executor 9 on > 192.168.43.254: > 22/01/14 12:33:14 ERROR TaskSchedulerImpl: Lost executor 10 on > 192.168.43.204: > 22/01/14 12:33:57 ERROR TaskSchedulerImpl: Lost executor 11 on > 192.168.43.231: > What is wrong? And how can I get executors running correctly? -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org