I am running a Structured Streaming job (Spark 2.2.0) using EMR 5.9. The
job sources data from a Kafka topic, performs a variety of filters and
transformations, and sinks data back into a different Kafka topic.

Once per day, we stop the query in order to merge the namenode edit logs
with the fsimage, because Structured Streaming creates and destroys a
significant number of HDFS files, and EMR doesn't support a secondary or HA
namenode for fsimage compaction (AWS support directed us to do this, as
Namenode edit logs were filling the disk).

Occasionally, the Structured Streaming query will not restart because the
most recent file in the "commits" or "offsets" checkpoint subdirectory is
empty. This seems like an undesirable behavior, as it requires manual
intervention to remove the empty files in order to force the job to fall
back onto the last good values. Has anyone run into this behavior? The only
similar issue I can find is SPARK-21760
<https://issues.apache.org/jira/browse/SPARK-21760>, which appears to have
no fix or workaround.

Any assistance would be greatly appreciated!

Regards,
Will

Reply via email to