I'm writing a Spark Streaming application where the input data is put into an S3 bucket in small batches (using Database Migration Service - DMS). The Spark application is the only consumer. I'm considering two possible architectures:
Have Spark Streaming watch an S3 prefix and pick up new objects as they come in Stream data from S3 to a Kinesis stream (through a Lambda function triggered as new S3 objects are created by DMS) and use the stream as input for the Spark application. While the second solution will work, the first solution is simpler. But are there any pitfalls? Looking at this guide, I'm concerned about two specific points: > *The more files under a directory, the longer it will take to scan for changes — even if no files have been modified.* We will be keeping the S3 data indefinitely. So the number of objects under the prefix being monitored is going to increase very quickly. > *“Full” Filesystems such as HDFS tend to set the modification time on their files as soon as the output stream is created. When a file is opened, even before data has been completely written, it may be included in the DStream - after which updates to the file within the same window will be ignored. That is: changes may be missed, and data omitted from the stream.* I'm not sure if this applies to S3, since to my understanding objects are created atomically and cannot be updated afterwards as is the case with ordinary files (unless deleted and recreated, which I don't believe DMS does) Thanks for any help!