Thanks.

A workaround I can think of is to rename/move the objects which have been
processed to a different prefix (which is not monitored), But with
StreamingContext. textFileStream method there doesn't seem to be a way to
know where each record is coming from. Is there another way to do this?

On Wed, Jun 27, 2018 at 12:26 AM Steve Loughran <ste...@hortonworks.com>
wrote:

>
> On 25 Jun 2018, at 23:59, Farshid Zavareh <fhzava...@gmail.com> wrote:
>
> I'm writing a Spark Streaming application where the input data is put into
> an S3 bucket in small batches (using Database Migration Service - DMS). The
> Spark application is the only consumer. I'm considering two possible
> architectures:
>
> Have Spark Streaming watch an S3 prefix and pick up new objects as they
> come in
> Stream data from S3 to a Kinesis stream (through a Lambda function
> triggered as new S3 objects are created by DMS) and use the stream as input
> for the Spark application.
> While the second solution will work, the first solution is simpler. But
> are there any pitfalls? Looking at this guide, I'm concerned about two
> specific points:
>
> > *The more files under a directory, the longer it will take to scan for
> changes — even if no files have been modified.*
>
> We will be keeping the S3 data indefinitely. So the number of objects
> under the prefix being monitored is going to increase very quickly.
>
>
>
> Theres a slightly-more-optimised streaming source for cloud streams here
>
>
> https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/org/apache/spark/streaming/hortonworks/CloudInputDStream.scala
>
>
> Even so, the cost of scanning S3 is one LIST request per 5000 objects;
> I'll leave it to you to work out how many there will be in your application
> —and how much it will cost. And of course, the more LIST calls tehre are,
> the longer things take, the bigger your window needs to be.
>
>
> > *“Full” Filesystems such as HDFS tend to set the modification time on
> their files as soon as the output stream is created. When a file is opened,
> even before data has been completely written, it may be included in the
> DStream - after which updates to the file within the same window will be
> ignored. That is: changes may be missed, and data omitted from the stream.*
>
> I'm not sure if this applies to S3, since to my understanding objects are
> created atomically and cannot be updated afterwards as is the case with
> ordinary files (unless deleted and recreated, which I don't believe DMS
> does)
>
>
> Objects written to S3 are't visible until the upload completes, in an
> atomic operation. You can write in place and not worry.
>
> The timestamp on S3 artifacts comes from the PUT tim. On multipart uploads
> of many MB/many GB uploads, thats when the first post to initiate the MPU
> is kicked off. So if the upload starts in time window t1 and completed in
> window t2, the object won't be visible until t2, but the timestamp will be
> of t1. Bear that  in mind.
>
> The lambda callback probably does have better scalability and resilience;
>  not tried it myself.
>
>
>
>
>
> Thanks for any help!
>
>
>

Reply via email to