Re: [Spark Streaming] Spark Streaming with S3 vs Kinesis

2018-06-28 Thread Farshid Zavareh
Thanks.

A workaround I can think of is to rename/move the objects which have been
processed to a different prefix (which is not monitored), But with
StreamingContext. textFileStream method there doesn't seem to be a way to
know where each record is coming from. Is there another way to do this?

On Wed, Jun 27, 2018 at 12:26 AM Steve Loughran 
wrote:

>
> On 25 Jun 2018, at 23:59, Farshid Zavareh  wrote:
>
> I'm writing a Spark Streaming application where the input data is put into
> an S3 bucket in small batches (using Database Migration Service - DMS). The
> Spark application is the only consumer. I'm considering two possible
> architectures:
>
> Have Spark Streaming watch an S3 prefix and pick up new objects as they
> come in
> Stream data from S3 to a Kinesis stream (through a Lambda function
> triggered as new S3 objects are created by DMS) and use the stream as input
> for the Spark application.
> While the second solution will work, the first solution is simpler. But
> are there any pitfalls? Looking at this guide, I'm concerned about two
> specific points:
>
> > *The more files under a directory, the longer it will take to scan for
> changes — even if no files have been modified.*
>
> We will be keeping the S3 data indefinitely. So the number of objects
> under the prefix being monitored is going to increase very quickly.
>
>
>
> Theres a slightly-more-optimised streaming source for cloud streams here
>
>
> https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/org/apache/spark/streaming/hortonworks/CloudInputDStream.scala
>
>
> Even so, the cost of scanning S3 is one LIST request per 5000 objects;
> I'll leave it to you to work out how many there will be in your application
> —and how much it will cost. And of course, the more LIST calls tehre are,
> the longer things take, the bigger your window needs to be.
>
>
> > *“Full” Filesystems such as HDFS tend to set the modification time on
> their files as soon as the output stream is created. When a file is opened,
> even before data has been completely written, it may be included in the
> DStream - after which updates to the file within the same window will be
> ignored. That is: changes may be missed, and data omitted from the stream.*
>
> I'm not sure if this applies to S3, since to my understanding objects are
> created atomically and cannot be updated afterwards as is the case with
> ordinary files (unless deleted and recreated, which I don't believe DMS
> does)
>
>
> Objects written to S3 are't visible until the upload completes, in an
> atomic operation. You can write in place and not worry.
>
> The timestamp on S3 artifacts comes from the PUT tim. On multipart uploads
> of many MB/many GB uploads, thats when the first post to initiate the MPU
> is kicked off. So if the upload starts in time window t1 and completed in
> window t2, the object won't be visible until t2, but the timestamp will be
> of t1. Bear that  in mind.
>
> The lambda callback probably does have better scalability and resilience;
>  not tried it myself.
>
>
>
>
>
> Thanks for any help!
>
>
>


Re: [Spark Streaming] Spark Streaming with S3 vs Kinesis

2018-06-26 Thread Steve Loughran


On 25 Jun 2018, at 23:59, Farshid Zavareh 
mailto:fhzava...@gmail.com>> wrote:

I'm writing a Spark Streaming application where the input data is put into an 
S3 bucket in small batches (using Database Migration Service - DMS). The Spark 
application is the only consumer. I'm considering two possible architectures:

Have Spark Streaming watch an S3 prefix and pick up new objects as they come in
Stream data from S3 to a Kinesis stream (through a Lambda function triggered as 
new S3 objects are created by DMS) and use the stream as input for the Spark 
application.
While the second solution will work, the first solution is simpler. But are 
there any pitfalls? Looking at this guide, I'm concerned about two specific 
points:

> The more files under a directory, the longer it will take to scan for changes 
> — even if no files have been modified.

We will be keeping the S3 data indefinitely. So the number of objects under the 
prefix being monitored is going to increase very quickly.


Theres a slightly-more-optimised streaming source for cloud streams here

https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/org/apache/spark/streaming/hortonworks/CloudInputDStream.scala


Even so, the cost of scanning S3 is one LIST request per 5000 objects; I'll 
leave it to you to work out how many there will be in your application —and how 
much it will cost. And of course, the more LIST calls tehre are, the longer 
things take, the bigger your window needs to be.


> “Full” Filesystems such as HDFS tend to set the modification time on their 
> files as soon as the output stream is created. When a file is opened, even 
> before data has been completely written, it may be included in the DStream - 
> after which updates to the file within the same window will be ignored. That 
> is: changes may be missed, and data omitted from the stream.

I'm not sure if this applies to S3, since to my understanding objects are 
created atomically and cannot be updated afterwards as is the case with 
ordinary files (unless deleted and recreated, which I don't believe DMS does)


Objects written to S3 are't visible until the upload completes, in an atomic 
operation. You can write in place and not worry.

The timestamp on S3 artifacts comes from the PUT tim. On multipart uploads of 
many MB/many GB uploads, thats when the first post to initiate the MPU is 
kicked off. So if the upload starts in time window t1 and completed in window 
t2, the object won't be visible until t2, but the timestamp will be of t1. Bear 
that  in mind.

The lambda callback probably does have better scalability and resilience;  not 
tried it myself.





Thanks for any help!



[Spark Streaming] Spark Streaming with S3 vs Kinesis

2018-06-25 Thread Farshid Zavareh
I'm writing a Spark Streaming application where the input data is put into
an S3 bucket in small batches (using Database Migration Service - DMS). The
Spark application is the only consumer. I'm considering two possible
architectures:

Have Spark Streaming watch an S3 prefix and pick up new objects as they
come in
Stream data from S3 to a Kinesis stream (through a Lambda function
triggered as new S3 objects are created by DMS) and use the stream as input
for the Spark application.
While the second solution will work, the first solution is simpler. But are
there any pitfalls? Looking at this guide, I'm concerned about two specific
points:

> *The more files under a directory, the longer it will take to scan for
changes — even if no files have been modified.*

We will be keeping the S3 data indefinitely. So the number of objects under
the prefix being monitored is going to increase very quickly.

> *“Full” Filesystems such as HDFS tend to set the modification time on
their files as soon as the output stream is created. When a file is opened,
even before data has been completely written, it may be included in the
DStream - after which updates to the file within the same window will be
ignored. That is: changes may be missed, and data omitted from the stream.*

I'm not sure if this applies to S3, since to my understanding objects are
created atomically and cannot be updated afterwards as is the case with
ordinary files (unless deleted and recreated, which I don't believe DMS
does)

Thanks for any help!