[ 
https://issues.apache.org/jira/browse/SPARK-37910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478478#comment-17478478
 ] 

Petri commented on SPARK-37910:
-------------------------------

In deployment.yaml we have:
 * name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
 *  name: SPARK_DRIVER_BIND_ADDRESS
valueFrom:
fieldRef:
fieldPath: status.podIP
 * name: K8S_NS
valueFrom:
fieldRef:
fieldPath: metadata.namespace

 

We are setting following confs for spark-submit:

DRIVER_HOSTNAME=$(echo $SPARK_DRIVER_BIND_ADDRESS | sed 's/\./-/g')

--conf spark.kubernetes.driver.pod.name=$POD_NAME \
--conf spark.driver.host=$DRIVER_HOSTNAME.$K8S_NS.pod.cluster.local \

 

So we are using Pod DNS name, is that ok? Or should we use headless service? 
Your documentation is not clear about it. What we are missing in our confs is 
the spark.driver.port. Is that a mandatory conf needed?

Can you give exact steps how to check to pod network status?

 

We have a quite similar setup in our other microservice, which is working OK 
with (Spark 3.2.0 ja Java 11), but for some reason this microservice in 
question has the problem. 

> Spark executor self-exiting due to driver disassociated in Kubernetes with 
> client deploy-mode
> ---------------------------------------------------------------------------------------------
>
>                 Key: SPARK-37910
>                 URL: https://issues.apache.org/jira/browse/SPARK-37910
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 3.2.0
>            Reporter: Petri
>            Priority: Major
>
> I have Spark driver running in a Kubernetes pod with client deploy-mode and 
> it tries to start an executor.
> Executor will fail with error:
>     \{"type":"log", "level":"ERROR", "name":"STREAMING_OTHERS", 
> "time":"2022-01-14T12:29:38.318Z", "timezone":"UTC", 
> "class":"dispatcher-Executor", 
> "method":"spark.executor.CoarseGrainedExecutorBackend.logError(73)", 
> "log":"Executor self-exiting due to : Driver 
> 192-168-39-71.mni-system.pod.cluster.local:40752 disassociated! Shutting 
> down.\n"}
> Then driver will attempt to start another executor which fails with same 
> error and this goes on and on.
> In the driver pod, I see only following errors:
>     22/01/14 12:26:32 ERROR TaskSchedulerImpl: Lost executor 1 on 
> 192.168.43.250:
>     22/01/14 12:27:16 ERROR TaskSchedulerImpl: Lost executor 2 on 
> 192.168.43.233:
>     22/01/14 12:27:59 ERROR TaskSchedulerImpl: Lost executor 3 on 
> 192.168.43.221:
>     22/01/14 12:28:43 ERROR TaskSchedulerImpl: Lost executor 4 on 
> 192.168.43.217:
>     22/01/14 12:29:27 ERROR TaskSchedulerImpl: Lost executor 5 on 
> 192.168.43.197:
>     22/01/14 12:30:10 ERROR TaskSchedulerImpl: Lost executor 6 on 
> 192.168.43.237:
>     22/01/14 12:30:53 ERROR TaskSchedulerImpl: Lost executor 7 on 
> 192.168.43.196:
>     22/01/14 12:31:42 ERROR TaskSchedulerImpl: Lost executor 8 on 
> 192.168.43.228:
>     22/01/14 12:32:31 ERROR TaskSchedulerImpl: Lost executor 9 on 
> 192.168.43.254:
>     22/01/14 12:33:14 ERROR TaskSchedulerImpl: Lost executor 10 on 
> 192.168.43.204:
>     22/01/14 12:33:57 ERROR TaskSchedulerImpl: Lost executor 11 on 
> 192.168.43.231:
> What is wrong? And how can I get executors running correctly?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to