Mihaly Toth created SPARK-25331: ----------------------------------- Summary: Structured Streaming File Sink duplicates records in case of driver failure Key: SPARK-25331 URL: https://issues.apache.org/jira/browse/SPARK-25331 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.3.1 Reporter: Mihaly Toth
Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has been started by {{FileFormatWrite.write}} and then the resulting task sets are completed but in the meantime the driver dies. In such a case repeating {{FileStreamSink.addBtach}} will result in duplicate writing of the data In the event the driver fails after the executors start processing the job the processed batch will be written twice. Steps needed: 1. call {{FileStreamSink.addBtach}} 2. make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}} 3. call {{FileStreamSink.addBtach}} with the same data 4. make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} successfully 5. Verify file output - according to {{Sink.addBatch}} documentation the rdd should be written only once I have created a wip PR with a unit test: https://github.com/apache/spark/pull/22331 -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org