In-Ho Yi created SPARK-37587:
--------------------------------

             Summary: Forever-running streams get completed with finished 
status when k8s worker is lost
                 Key: SPARK-37587
                 URL: https://issues.apache.org/jira/browse/SPARK-37587
             Project: Spark
          Issue Type: Bug
          Components: Kubernetes, Structured Streaming
    Affects Versions: 3.1.1
            Reporter: In-Ho Yi


We have forever-running streaming jobs, defined as:

{{spark.readStream}}
{{.format("avro")}}
{{.option("maxFilesPerTrigger", 10)}}
{{.schema(schema)}}
{{.load(loadPath)}}
{{.as[T]}}
{{.writeStream}}
{{.option("checkpointLocation", checkpointPath)}}

We have several of these running, and at the end of main, I call

{{spark.streams.awaitAnyTermination()}}

To fail the job on any failing jobs.

Now, we are running on a k8s runner with arguments (showing only relevant ones):

{{- /opt/spark/bin/spark-submit}}
{{# From `kubectl cluster-info`}}
{{- --master}}
{{- k8s://https://SOMEADDRESS.gr7.us-east-1.eks.amazonaws.com:443}}
{{- --deploy-mode}}
{{- client}}
{{- --name}}
{{- ourstreamingjob}}
{{# Driver connection via headless service}}
{{- --conf}}
{{- spark.driver.host=spark-driver-service}}
{{- --conf}}
{{- spark.driver.port=31337}}
{{# Move ivy temp dirs under /tmp. See 
https://stackoverflow.com/a/55921242/760482}}
{{- --conf}}
{{- spark.jars.ivy=/tmp/.ivy}}
{{# JVM settings for ivy and log4j.}}
{{- --conf}}
{{- spark.driver.extraJavaOptions=-Divy.cache.dir=/tmp -Divy.home=/tmp 
-Dlog4j.configuration=file:///opt/spark/log4j.properties}}
{{- --conf}}
{{- 
spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///opt/spark/log4j.properties}}
{{# Spark on k8s settings.}}
{{- --conf}}
{{- spark.executor.instances=10}}
{{- --conf}}
{{- spark.executor.memory=8g}}
{{- --conf}}
{{- spark.executor.memoryOverhead=2g}}
{{- --conf}}
{{{}- 
spark.kubernetes.container.image=xxx.dkr.ecr.us-east-1.amazonaws.com/{}}}{{{}ourstreamingjob{}}}{{{}:latesthash{}}}
{{- --conf}}
{{- spark.kubernetes.container.image.pullPolicy=Always}}



The container image is built with provided spark:3.1.1-hadoop3.2 image and we 
add our jar to it.

We have found issues with our monitoring that a couple of these streaming were 
not running anymore, and found that these streams were deemed "complete" with 
FINISHED status. I couldn't find any error logs from the streams themselves. We 
did find that the time when these jobs were completed happen to coincide with 
the time when one of the k8s workers were lost and a new worker was added to 
the cluster.

>From the console, it says:
{quote}Executor 7
Removed at 2021/12/07 22:26:12
Reason: The executor with ID 7 (registered at 1638858647668 ms) was not found 
in the cluster at the polling time (1638915971948 ms) which is after the 
accepted detect delta time (30000 ms) configured by 
`spark.kubernetes.executor.missingPodDetectDelta`. The executor may have been 
deleted but the driver missed the deletion event. Marking this executor as 
failed.
{quote}
I think the correct behavior would be to continue on the streaming from the 
last known checkpoint. If there is really an error, it should throw and the 
error should propagate so that eventually the pipeline will terminate.

I've checked out changes since 3.1.1 but couldn't find any fix to this specific 
issue. Also, is there any workaround you could recommend?

Thanks!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to