Hi

I have a Spark structured streaming job that reads from Kafka and writes
parquet files to Hive/HDFS. The files are not very large, but the Kafka
source is noisy so each spark job takes a long time to complete. There is a
significant window during which the parquet files are incomplete and other
tools, including PrestoDB, encounter errors while trying to read them.

I wrote this list and stackoverflow about the problem last summer:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Structured-Streaming-truncated-Parquet-after-driver-crash-or-kill-tt29043.html
https://stackoverflow.com/questions/47337030/not-a-parquet-file-too-small-from-presto-during-spark-structured-streaming-r/47339805#47339805

After hesitating for a while, I wrote a custom commit protocol to solve the
problem. It combines HadoopMapReduceCommitProtocol's behavior of writing to
a temp file first, with ManifestFileCommitProtocol. From what I can tell
ManifestFileCommitProtocol is required for the normal Structured Streaming
behavior of being able to resume streams from a known point.

I think this commit protocol could be generally useful. Writing to a temp
file and moving it to the final location is low cost on HDFS and is the
standard behavior for non-streaming jobs, as implemented in
HadoopMapReduceCommitProtocol. At the same time ManifestFileCommitProtocol
is needed for structured streaming jobs. We have been running this for a
few months in production without problems.

Here is the code (at the moment not up to Spark standards, admittedly):
https://github.com/davcamer/spark/commit/361f1c69851f0f94cfd974ce720c694407f9340b

Did I miss a better approach? Does anyone else think this would be useful?

Thanks for reading,
Dave

Reply via email to