On 25 Jun 2018, at 23:59, Farshid Zavareh 
<fhzava...@gmail.com<mailto:fhzava...@gmail.com>> wrote:

I'm writing a Spark Streaming application where the input data is put into an 
S3 bucket in small batches (using Database Migration Service - DMS). The Spark 
application is the only consumer. I'm considering two possible architectures:

Have Spark Streaming watch an S3 prefix and pick up new objects as they come in
Stream data from S3 to a Kinesis stream (through a Lambda function triggered as 
new S3 objects are created by DMS) and use the stream as input for the Spark 
application.
While the second solution will work, the first solution is simpler. But are 
there any pitfalls? Looking at this guide, I'm concerned about two specific 
points:

> The more files under a directory, the longer it will take to scan for changes 
> — even if no files have been modified.

We will be keeping the S3 data indefinitely. So the number of objects under the 
prefix being monitored is going to increase very quickly.


Theres a slightly-more-optimised streaming source for cloud streams here

https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/org/apache/spark/streaming/hortonworks/CloudInputDStream.scala


Even so, the cost of scanning S3 is one LIST request per 5000 objects; I'll 
leave it to you to work out how many there will be in your application —and how 
much it will cost. And of course, the more LIST calls tehre are, the longer 
things take, the bigger your window needs to be.


> “Full” Filesystems such as HDFS tend to set the modification time on their 
> files as soon as the output stream is created. When a file is opened, even 
> before data has been completely written, it may be included in the DStream - 
> after which updates to the file within the same window will be ignored. That 
> is: changes may be missed, and data omitted from the stream.

I'm not sure if this applies to S3, since to my understanding objects are 
created atomically and cannot be updated afterwards as is the case with 
ordinary files (unless deleted and recreated, which I don't believe DMS does)


Objects written to S3 are't visible until the upload completes, in an atomic 
operation. You can write in place and not worry.

The timestamp on S3 artifacts comes from the PUT tim. On multipart uploads of 
many MB/many GB uploads, thats when the first post to initiate the MPU is 
kicked off. So if the upload starts in time window t1 and completed in window 
t2, the object won't be visible until t2, but the timestamp will be of t1. Bear 
that  in mind.

The lambda callback probably does have better scalability and resilience;  not 
tried it myself.





Thanks for any help!

Reply via email to