Hi
I have a Spark structured streaming job that reads from Kafka and writes parquet files to Hive/HDFS. The files are not very large, but the Kafka source is noisy so each spark job takes a long time to complete. There is a significant window during which the parquet files are incomplete and other tools, including PrestoDB, encounter errors while trying to read them. I wrote this list and stackoverflow about the problem last summer: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Structured-Streaming-truncated-Parquet-after-driver-crash-or-kill-tt29043.html https://stackoverflow.com/questions/47337030/not-a-parquet-file-too-small-from-presto-during-spark-structured-streaming-r/47339805#47339805 After hesitating for a while, I wrote a custom commit protocol to solve the problem. It combines HadoopMapReduceCommitProtocol's behavior of writing to a temp file first, with ManifestFileCommitProtocol. From what I can tell ManifestFileCommitProtocol is required for the normal Structured Streaming behavior of being able to resume streams from a known point. I think this commit protocol could be generally useful. Writing to a temp file and moving it to the final location is low cost on HDFS and is the standard behavior for non-streaming jobs, as implemented in HadoopMapReduceCommitProtocol. At the same time ManifestFileCommitProtocol is needed for structured streaming jobs. We have been running this for a few months in production without problems. Here is the code (at the moment not up to Spark standards, admittedly): https://github.com/davcamer/spark/commit/361f1c69851f0f94cfd974ce720c694407f9340b Did I miss a better approach? Does anyone else think this would be useful? Thanks for reading, Dave