Re: Monitoring S3 Bucket with Spark Streaming

Benjamin Kim Sat, 09 Apr 2016 07:55:14 -0700

Nezih,

This looks like a good alternative to having the Spark Streaming job check for 
new files on its own. Do you know if there is a way to have the Spark Streaming 
job get notified with the new file information and act upon it? This can reduce 
the overhead and cost of polling S3. Plus, I can use this to notify and kick 
off Lambda to process new data files and make them ready for Spark Streaming to 
consume. This will also use notifications to trigger. I just need to have all 
incoming folders configured for notifications for Lambda and all outgoing 
folders for Spark Streaming. This sounds like a better setup than we have now.


Thanks,
Ben

> On Apr 9, 2016, at 12:25 AM, Nezih Yigitbasi <nyigitb...@netflix.com> wrote:
> 
> While it is doable in Spark, S3 also supports notifications: 
> http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html 
> <http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html>
> 
> 
> On Fri, Apr 8, 2016 at 9:15 PM Natu Lauchande <nlaucha...@gmail.com 
> <mailto:nlaucha...@gmail.com>> wrote:
> Hi Benjamin,
> 
> I have done it . The critical configuration items are the ones below :
> 
>       ssc.sparkContext.hadoopConfiguration.set("fs.s3n.impl", 
> "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
>       ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", 
> AccessKeyId)
>       ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", 
> AWSSecretAccessKey)
> 
>       val inputS3Stream =  ssc.textFileStream("s3://example_bucket/folder")
> 
> This code will probe for new S3 files created in your every batch interval.
> 
> Thanks,
> Natu
> 
> On Fri, Apr 8, 2016 at 9:14 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Has anyone monitored an S3 bucket or directory using Spark Streaming and 
> pulled any new files to process? If so, can you provide basic Scala coding 
> help on this?
> 
> Thanks,
> Ben
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 
>

Re: Monitoring S3 Bucket with Spark Streaming

Reply via email to