On 25 Jun 2018, at 23:59, Farshid Zavareh <fhzava...@gmail.com<mailto:fhzava...@gmail.com>> wrote:
I'm writing a Spark Streaming application where the input data is put into an S3 bucket in small batches (using Database Migration Service - DMS). The Spark application is the only consumer. I'm considering two possible architectures: Have Spark Streaming watch an S3 prefix and pick up new objects as they come in Stream data from S3 to a Kinesis stream (through a Lambda function triggered as new S3 objects are created by DMS) and use the stream as input for the Spark application. While the second solution will work, the first solution is simpler. But are there any pitfalls? Looking at this guide, I'm concerned about two specific points: > The more files under a directory, the longer it will take to scan for changes > — even if no files have been modified. We will be keeping the S3 data indefinitely. So the number of objects under the prefix being monitored is going to increase very quickly. Theres a slightly-more-optimised streaming source for cloud streams here https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/org/apache/spark/streaming/hortonworks/CloudInputDStream.scala Even so, the cost of scanning S3 is one LIST request per 5000 objects; I'll leave it to you to work out how many there will be in your application —and how much it will cost. And of course, the more LIST calls tehre are, the longer things take, the bigger your window needs to be. > “Full” Filesystems such as HDFS tend to set the modification time on their > files as soon as the output stream is created. When a file is opened, even > before data has been completely written, it may be included in the DStream - > after which updates to the file within the same window will be ignored. That > is: changes may be missed, and data omitted from the stream. I'm not sure if this applies to S3, since to my understanding objects are created atomically and cannot be updated afterwards as is the case with ordinary files (unless deleted and recreated, which I don't believe DMS does) Objects written to S3 are't visible until the upload completes, in an atomic operation. You can write in place and not worry. The timestamp on S3 artifacts comes from the PUT tim. On multipart uploads of many MB/many GB uploads, thats when the first post to initiate the MPU is kicked off. So if the upload starts in time window t1 and completed in window t2, the object won't be visible until t2, but the timestamp will be of t1. Bear that in mind. The lambda callback probably does have better scalability and resilience; not tried it myself. Thanks for any help!