[
https://issues.apache.org/jira/browse/BAHIR-213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16895999#comment-16895999
]
Steve Loughran commented on BAHIR-213:
--------------------------------------
BTW, because of the delay between S3 change and event being processed, there's
a risk of changes in the store happening before the stream handler sees it
1. POST path
2. event #1 queued
3. DELETE path
4. event #2 queued
5. event #1 received
5. FNFE when querying file
Also: double update
1. POST path
2. event #1 queued
3. POST path
4. event #2 queued
5. event #1 received
6.. contents of path are at state (3)
7. event #2 received even though state hasn't changed
there's also two other issues
* the risk of events arriving out of order.
* the risk of a previous state of the file (contents or tombstone) being seen
in processing event #1
What does that mean? I think it means that you need to handle
* file potentially missing when you receive the event...but you still need to
handle the possibility that a tombstone was cached before the post #1
operation, so may want to spin a bit awaiting its arrival.
* file details when processing event different from that in the event data
the best thing to do here is demand that every file uploaded MUST have a unique
name, while making sure that the new stream source is resilient to changes (i.e
downgrades if the source file isn't there...), without offering any guarantees
of correctness
> Faster S3 file Source for Structured Streaming with SQS
> -------------------------------------------------------
>
> Key: BAHIR-213
> URL: https://issues.apache.org/jira/browse/BAHIR-213
> Project: Bahir
> Issue Type: New Feature
> Components: Spark Structured Streaming Connectors
> Affects Versions: Spark-2.4.0
> Reporter: Abhishek Dixit
> Priority: Major
>
> Using FileStreamSource to read files from a S3 bucket has problems both in
> terms of costs and latency:
> * *Latency:* Listing all the files in S3 buckets every microbatch can be
> both slow and resource intensive.
> * *Costs:* Making List API requests to S3 every microbatch can be costly.
> The solution is to use Amazon Simple Queue Service (SQS) which lets you find
> new files written to S3 bucket without the need to list all the files every
> microbatch.
> S3 buckets can be configured to send notification to an Amazon SQS Queue on
> Object Create / Object Delete events. For details see AWS documentation here
> [Configuring S3 Event
> Notifications|https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html]
>
> Spark can leverage this to find new files written to S3 bucket by reading
> notifications from SQS queue instead of listing files every microbatch.
> I hope to contribute changes proposed in [this pull
> request|https://github.com/apache/spark/pull/24934] to Apache Bahir as
> suggested by @[gaborgsomogyi|https://github.com/gaborgsomogyi]
> [here|https://github.com/apache/spark/pull/24934#issuecomment-511389130]
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)