Hi Team, I have been using Structured Streaming with the S3 data source but I am seeing it duplicate the data intermittently. New run seem to fix it, but the duplication happens ~10% of time. The ratio increases with more number of files in the source. Investigating more, I see this is clearly an issue with S3's eventual consistency, and spark re-processes the task twice, because its not able to verify if the task successfully wrote the output of completed task.
I have added all the details of investigation in the ticket below with code and error logs.Is there a way we can address this issue and is there anything I can help out with. https://issues.apache.org/jira/browse/SPARK-23050 Cheers