Structured Streaming with S3 file source duplicates data because of eventual consistency

Yash Sharma Thu, 11 Jan 2018 15:36:07 -0800

Hi Team,
I have been using Structured Streaming with the S3 data source but I am
seeing it duplicate the data intermittently. New run seem to fix it, but
the duplication happens ~10% of time. The ratio increases with more number
of files in the source. Investigating more, I see this is clearly an issue
with S3's eventual consistency, and spark re-processes the task twice,
because its not able to verify if the task successfully wrote the output of
completed task.


I have added all the details of investigation in the ticket below with code
and error logs.Is there a way we can address this issue and is there
anything I can help out with.

https://issues.apache.org/jira/browse/SPARK-23050

Cheers

Structured Streaming with S3 file source duplicates data because of eventual consistency

Reply via email to