[jira] [Updated] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure

Mihaly Toth (JIRA) Tue, 04 Sep 2018 07:32:20 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-25331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mihaly Toth updated SPARK-25331:
--------------------------------
    Description: 
Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has 
been started by {{FileFormatWrite.write}} and then the resulting task sets are 
completed but in the meantime the driver dies. In such a case repeating 
{{FileStreamSink.addBtach}} will result in duplicate writing of the data

In the event the driver fails after the executors start processing the job the 
processed batch will be written twice.

Steps needed:
# call {{FileStreamSink.addBtach}}
# make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}}
# call {{FileStreamSink.addBtach}} with the same data
# make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} successfully
# Verify file output - according to {{Sink.addBatch}} documentation the rdd 
should be written only once

I have created a wip PR with a unit test:
https://github.com/apache/spark/pull/22331


  was:
Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has 
been started by {{FileFormatWrite.write}} and then the resulting task sets are 
completed but in the meantime the driver dies. In such a case repeating 
{{FileStreamSink.addBtach}} will result in duplicate writing of the data

In the event the driver fails after the executors start processing the job the 
processed batch will be written twice.

Steps needed:
1. call {{FileStreamSink.addBtach}}
2. make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}}
3. call {{FileStreamSink.addBtach}} with the same data
4. make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} successfully
5. Verify file output - according to {{Sink.addBatch}} documentation the rdd 
should be written only once

I have created a wip PR with a unit test:
https://github.com/apache/spark/pull/22331



> Structured Streaming File Sink duplicates records in case of driver failure
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-25331
>                 URL: https://issues.apache.org/jira/browse/SPARK-25331
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 2.3.1
>            Reporter: Mihaly Toth
>            Priority: Major
>
> Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has 
> been started by {{FileFormatWrite.write}} and then the resulting task sets 
> are completed but in the meantime the driver dies. In such a case repeating 
> {{FileStreamSink.addBtach}} will result in duplicate writing of the data
> In the event the driver fails after the executors start processing the job 
> the processed batch will be written twice.
> Steps needed:
> # call {{FileStreamSink.addBtach}}
> # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}}
> # call {{FileStreamSink.addBtach}} with the same data
> # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} 
> successfully
> # Verify file output - according to {{Sink.addBatch}} documentation the rdd 
> should be written only once
> I have created a wip PR with a unit test:
> https://github.com/apache/spark/pull/22331



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure

Reply via email to