Petri created SPARK-37999:
-----------------------------

             Summary: Spark executor self-exiting due to driver disassociated 
in Kubernetes
                 Key: SPARK-37999
                 URL: https://issues.apache.org/jira/browse/SPARK-37999
             Project: Spark
          Issue Type: Bug
          Components: Kubernetes
    Affects Versions: 3.2.0
            Reporter: Petri


I have Spark driver running in a Kubernetes pod with client deploy-mode.I have 
created a headless K8S service with name 'lola' at port 7077 which targets the 
driver pod.
Driver pod will launch successfully and tries to start an executor, but 
eventually the executor will fail with error:
{code:java}
Executor self-exiting due to : Driver lola.mni-system:7077 disassociated! 
Shutting down.{code}
Then driver stays up and running and will attempt to start another executor 
which fails with same error and this goes on and on, driver spawning new 
failing executors.

In the driver pod, I see only following errors (when using 'grep ERROR'):
{code:java}
22/01/24 13:41:12 ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.82.105:
22/01/24 13:41:56 ERROR TaskSchedulerImpl: Lost executor 2 on 192.168.82.106:
22/01/24 13:42:12 ERROR TaskSchedulerImpl: Lost executor 7 on 192.168.47.80: 
The executor with ID 7 (registered at 1643031697505 ms) was not found in the 
cluster at the polling time (1643031731509 ms) which is after the accepted 
detect delta time (30000 ms) configured by 
`spark.kubernetes.executor.missingPodDetectDelta`. The executor may have been 
deleted but the driver missed the deletion event. Marking this executor as 
failed.
22/01/24 13:42:38 ERROR TaskSchedulerImpl: Lost executor 3 on 192.168.82.103:
22/01/24 13:45:30 ERROR TaskSchedulerImpl: Lost executor 4 on 
192.168.50.220:{code}
 

Full log from the executor:
{code:java}
+ source /opt/spark/bin/common.sh
+ cp /etc/group /tmp/group
+ cp /etc/passwd /tmp/passwd
++ id -u
+ myuid=1501
++ id -g
+ mygid=0
+ myuname=cspk
+ fsgid=
+ fsgrpname=cspk
+ set +e
++ getent passwd 1501
+ uidentry=
++ cat /etc/machine-id
cat: /etc/machine-id: No such file or directory
+ export SYSTEMID=
+ SYSTEMID=
+ set -e
+ '[' -z '' ']'
+ '[' -w /tmp/group ']'
+ echo cspk:x::
+ cp /etc/passwd /tmp/passwd.template
+ '[' -z '' ']'
+ '[' -w /tmp/passwd.template ']'
+ echo 'cspk:x:1501:0:anonymous uid:/opt/spark:/bin/false'
+ envsubst
+ export LD_PRELOAD=/usr/lib64/libnss_wrapper.so
+ LD_PRELOAD=/usr/lib64/libnss_wrapper.so
+ export NSS_WRAPPER_PASSWD=/tmp/passwd
+ NSS_WRAPPER_PASSWD=/tmp/passwd
+ export NSS_WRAPPER_GROUP=/tmp/group
+ NSS_WRAPPER_GROUP=/tmp/group
+ SPARK_K8S_CMD=executor
+ case "$SPARK_K8S_CMD" in
+ shift 1
+ SPARK_CLASSPATH='/var/local/streaming_engine/*:/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ env
+ sort -t_ -k4 -n
+ grep SPARK_AUTH_OPT_
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_AUTH_OPTS
+ env
+ grep SPARK_NET_CRYPTO_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_NET_CRYPTO_OPTS
+ '[' -n '' ']'
+ '[' -z ']'
+ set +x
TLS Not enabled for WebServer
+ CMD=(${JAVA_HOME}/bin/java $EXTRAJAVAOPTS "${SPARK_EXECUTOR_JAVA_OPTS[@]}" 
"${SPARK_AUTH_OPTS[@]}" "${SPARK_NET_CRYPTO_OPTS[@]}" 
-Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp "$SPARK_CLASSPATH" 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
$SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores 
$SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname 
$SPARK_EXECUTOR_POD_IP)
+ exec /usr/bin/tini -s -- /etc/alternatives/jre_openjdk//bin/java 
-Dcom.nokia.rtna.jmx1= -Dcom.nokia.rtna.jmx2=10100 
-Dlog4j.configurationFile=http://192.168.80.89:8888/log4j2.xml 
-Dlog4j.configuration=http://192.168.80.89:8888/log4j2.xml 
-Dcom.nokia.rtna.app=LolaStreamingApp -Dspark.driver.port=7077 -Xms4096m 
-Xmx4096m -cp '/var/local/streaming_engine/*:/opt/spark/jars/*' 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://coarsegrainedschedu...@lola.mni-system:7077 --executor-id 10 --cores 3 
--app-id spark-application-1643031611044 --hostname 192.168.82.121
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/var/local/streaming_engine/log4j-slf4j-impl-2.13.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/opt/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
(file:/var/local/streaming_engine/spark-unsafe_2.12-3.1.2.jar) to constructor 
java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of 
org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations
WARNING: All illegal access operations will be denied in a future release
{"type":"log", "level":"ERROR", "name":"STREAMING_OTHERS", 
"time":"2022-01-24T13:49:16.606Z", "timezone":"UTC", 
"class":"dispatcher-Executor", 
"method":"spark.executor.CoarseGrainedExecutorBackend.logError(73)", 
"log":"Executor self-exiting due to : Driver lola.mni-system:7077 
disassociated! Shutting down.\n"}
 {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to