[ https://issues.apache.org/jira/browse/BAHIR-213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16895999#comment-16895999 ]
Steve Loughran commented on BAHIR-213: -------------------------------------- BTW, because of the delay between S3 change and event being processed, there's a risk of changes in the store happening before the stream handler sees it 1. POST path 2. event #1 queued 3. DELETE path 4. event #2 queued 5. event #1 received 5. FNFE when querying file Also: double update 1. POST path 2. event #1 queued 3. POST path 4. event #2 queued 5. event #1 received 6.. contents of path are at state (3) 7. event #2 received even though state hasn't changed there's also two other issues * the risk of events arriving out of order. * the risk of a previous state of the file (contents or tombstone) being seen in processing event #1 What does that mean? I think it means that you need to handle * file potentially missing when you receive the event...but you still need to handle the possibility that a tombstone was cached before the post #1 operation, so may want to spin a bit awaiting its arrival. * file details when processing event different from that in the event data the best thing to do here is demand that every file uploaded MUST have a unique name, while making sure that the new stream source is resilient to changes (i.e downgrades if the source file isn't there...), without offering any guarantees of correctness > Faster S3 file Source for Structured Streaming with SQS > ------------------------------------------------------- > > Key: BAHIR-213 > URL: https://issues.apache.org/jira/browse/BAHIR-213 > Project: Bahir > Issue Type: New Feature > Components: Spark Structured Streaming Connectors > Affects Versions: Spark-2.4.0 > Reporter: Abhishek Dixit > Priority: Major > > Using FileStreamSource to read files from a S3 bucket has problems both in > terms of costs and latency: > * *Latency:* Listing all the files in S3 buckets every microbatch can be > both slow and resource intensive. > * *Costs:* Making List API requests to S3 every microbatch can be costly. > The solution is to use Amazon Simple Queue Service (SQS) which lets you find > new files written to S3 bucket without the need to list all the files every > microbatch. > S3 buckets can be configured to send notification to an Amazon SQS Queue on > Object Create / Object Delete events. For details see AWS documentation here > [Configuring S3 Event > Notifications|https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html] > > Spark can leverage this to find new files written to S3 bucket by reading > notifications from SQS queue instead of listing files every microbatch. > I hope to contribute changes proposed in [this pull > request|https://github.com/apache/spark/pull/24934] to Apache Bahir as > suggested by @[gaborgsomogyi|https://github.com/gaborgsomogyi] > [here|https://github.com/apache/spark/pull/24934#issuecomment-511389130] -- This message was sent by Atlassian JIRA (v7.6.14#76016)