In-Ho Yi created SPARK-37587: -------------------------------- Summary: Forever-running streams get completed with finished status when k8s worker is lost Key: SPARK-37587 URL: https://issues.apache.org/jira/browse/SPARK-37587 Project: Spark Issue Type: Bug Components: Kubernetes, Structured Streaming Affects Versions: 3.1.1 Reporter: In-Ho Yi
We have forever-running streaming jobs, defined as: {{spark.readStream}} {{.format("avro")}} {{.option("maxFilesPerTrigger", 10)}} {{.schema(schema)}} {{.load(loadPath)}} {{.as[T]}} {{.writeStream}} {{.option("checkpointLocation", checkpointPath)}} We have several of these running, and at the end of main, I call {{spark.streams.awaitAnyTermination()}} To fail the job on any failing jobs. Now, we are running on a k8s runner with arguments (showing only relevant ones): {{- /opt/spark/bin/spark-submit}} {{# From `kubectl cluster-info`}} {{- --master}} {{- k8s://https://SOMEADDRESS.gr7.us-east-1.eks.amazonaws.com:443}} {{- --deploy-mode}} {{- client}} {{- --name}} {{- ourstreamingjob}} {{# Driver connection via headless service}} {{- --conf}} {{- spark.driver.host=spark-driver-service}} {{- --conf}} {{- spark.driver.port=31337}} {{# Move ivy temp dirs under /tmp. See https://stackoverflow.com/a/55921242/760482}} {{- --conf}} {{- spark.jars.ivy=/tmp/.ivy}} {{# JVM settings for ivy and log4j.}} {{- --conf}} {{- spark.driver.extraJavaOptions=-Divy.cache.dir=/tmp -Divy.home=/tmp -Dlog4j.configuration=file:///opt/spark/log4j.properties}} {{- --conf}} {{- spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///opt/spark/log4j.properties}} {{# Spark on k8s settings.}} {{- --conf}} {{- spark.executor.instances=10}} {{- --conf}} {{- spark.executor.memory=8g}} {{- --conf}} {{- spark.executor.memoryOverhead=2g}} {{- --conf}} {{{}- spark.kubernetes.container.image=xxx.dkr.ecr.us-east-1.amazonaws.com/{}}}{{{}ourstreamingjob{}}}{{{}:latesthash{}}} {{- --conf}} {{- spark.kubernetes.container.image.pullPolicy=Always}} The container image is built with provided spark:3.1.1-hadoop3.2 image and we add our jar to it. We have found issues with our monitoring that a couple of these streaming were not running anymore, and found that these streams were deemed "complete" with FINISHED status. I couldn't find any error logs from the streams themselves. We did find that the time when these jobs were completed happen to coincide with the time when one of the k8s workers were lost and a new worker was added to the cluster. >From the console, it says: {quote}Executor 7 Removed at 2021/12/07 22:26:12 Reason: The executor with ID 7 (registered at 1638858647668 ms) was not found in the cluster at the polling time (1638915971948 ms) which is after the accepted detect delta time (30000 ms) configured by `spark.kubernetes.executor.missingPodDetectDelta`. The executor may have been deleted but the driver missed the deletion event. Marking this executor as failed. {quote} I think the correct behavior would be to continue on the streaming from the last known checkpoint. If there is really an error, it should throw and the error should propagate so that eventually the pipeline will terminate. I've checked out changes since 3.1.1 but couldn't find any fix to this specific issue. Also, is there any workaround you could recommend? Thanks! -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org