[ https://issues.apache.org/jira/browse/SPARK-25331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mihaly Toth updated SPARK-25331: -------------------------------- Description: Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has been started by {{FileFormatWrite.write}} and then the resulting task sets are completed but in the meantime the driver dies. In such a case repeating {{FileStreamSink.addBtach}} will result in duplicate writing of the data In the event the driver fails after the executors start processing the job the processed batch will be written twice. Steps needed: # call {{FileStreamSink.addBtach}} # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}} # call {{FileStreamSink.addBtach}} with the same data # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} successfully # Verify file output - according to {{Sink.addBatch}} documentation the rdd should be written only once I have created a wip PR with a unit test: https://github.com/apache/spark/pull/22331 was: Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has been started by {{FileFormatWrite.write}} and then the resulting task sets are completed but in the meantime the driver dies. In such a case repeating {{FileStreamSink.addBtach}} will result in duplicate writing of the data In the event the driver fails after the executors start processing the job the processed batch will be written twice. Steps needed: 1. call {{FileStreamSink.addBtach}} 2. make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}} 3. call {{FileStreamSink.addBtach}} with the same data 4. make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} successfully 5. Verify file output - according to {{Sink.addBatch}} documentation the rdd should be written only once I have created a wip PR with a unit test: https://github.com/apache/spark/pull/22331 > Structured Streaming File Sink duplicates records in case of driver failure > --------------------------------------------------------------------------- > > Key: SPARK-25331 > URL: https://issues.apache.org/jira/browse/SPARK-25331 > Project: Spark > Issue Type: Bug > Components: Structured Streaming > Affects Versions: 2.3.1 > Reporter: Mihaly Toth > Priority: Major > > Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has > been started by {{FileFormatWrite.write}} and then the resulting task sets > are completed but in the meantime the driver dies. In such a case repeating > {{FileStreamSink.addBtach}} will result in duplicate writing of the data > In the event the driver fails after the executors start processing the job > the processed batch will be written twice. > Steps needed: > # call {{FileStreamSink.addBtach}} > # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}} > # call {{FileStreamSink.addBtach}} with the same data > # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} > successfully > # Verify file output - according to {{Sink.addBatch}} documentation the rdd > should be written only once > I have created a wip PR with a unit test: > https://github.com/apache/spark/pull/22331 -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org