Thanks. A workaround I can think of is to rename/move the objects which have been processed to a different prefix (which is not monitored), But with StreamingContext. textFileStream method there doesn't seem to be a way to know where each record is coming from. Is there another way to do this?
On Wed, Jun 27, 2018 at 12:26 AM Steve Loughran <ste...@hortonworks.com> wrote: > > On 25 Jun 2018, at 23:59, Farshid Zavareh <fhzava...@gmail.com> wrote: > > I'm writing a Spark Streaming application where the input data is put into > an S3 bucket in small batches (using Database Migration Service - DMS). The > Spark application is the only consumer. I'm considering two possible > architectures: > > Have Spark Streaming watch an S3 prefix and pick up new objects as they > come in > Stream data from S3 to a Kinesis stream (through a Lambda function > triggered as new S3 objects are created by DMS) and use the stream as input > for the Spark application. > While the second solution will work, the first solution is simpler. But > are there any pitfalls? Looking at this guide, I'm concerned about two > specific points: > > > *The more files under a directory, the longer it will take to scan for > changes — even if no files have been modified.* > > We will be keeping the S3 data indefinitely. So the number of objects > under the prefix being monitored is going to increase very quickly. > > > > Theres a slightly-more-optimised streaming source for cloud streams here > > > https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/org/apache/spark/streaming/hortonworks/CloudInputDStream.scala > > > Even so, the cost of scanning S3 is one LIST request per 5000 objects; > I'll leave it to you to work out how many there will be in your application > —and how much it will cost. And of course, the more LIST calls tehre are, > the longer things take, the bigger your window needs to be. > > > > *“Full” Filesystems such as HDFS tend to set the modification time on > their files as soon as the output stream is created. When a file is opened, > even before data has been completely written, it may be included in the > DStream - after which updates to the file within the same window will be > ignored. That is: changes may be missed, and data omitted from the stream.* > > I'm not sure if this applies to S3, since to my understanding objects are > created atomically and cannot be updated afterwards as is the case with > ordinary files (unless deleted and recreated, which I don't believe DMS > does) > > > Objects written to S3 are't visible until the upload completes, in an > atomic operation. You can write in place and not worry. > > The timestamp on S3 artifacts comes from the PUT tim. On multipart uploads > of many MB/many GB uploads, thats when the first post to initiate the MPU > is kicked off. So if the upload starts in time window t1 and completed in > window t2, the object won't be visible until t2, but the timestamp will be > of t1. Bear that in mind. > > The lambda callback probably does have better scalability and resilience; > not tried it myself. > > > > > > Thanks for any help! > > >