Re: Monitoring S3 Bucket with Spark Streaming

Benjamin Kim Sat, 09 Apr 2016 14:50:20 -0700

This was easy!

I just created a notification on a source S3 bucket to kick off a Lambda 
function that would decompress the dropped file and save it to another S3 
bucket. In return, this S3 bucket has a notification to send a SNS message to 
me via email. I can just as easily setup SQS to be the endpoint of this 
notification. This would then convey to a listening Spark Streaming job the 
file information to download. I like this!


Cheers,
Ben 

> On Apr 9, 2016, at 9:54 AM, Benjamin Kim <bbuil...@gmail.com> wrote:
> 
> This is awesome! I have someplace to start from.
> 
> Thanks,
> Ben
> 
> 
>> On Apr 9, 2016, at 9:45 AM, programminggee...@gmail.com 
>> <mailto:programminggee...@gmail.com> wrote:
>> 
>> Someone please correct me if I am wrong as I am still rather green to spark, 
>> however it appears that through the S3 notification mechanism described 
>> below, you can publish events to SQS and use SQS as a streaming source into 
>> spark. The project at https://github.com/imapi/spark-sqs-receiver 
>> <https://github.com/imapi/spark-sqs-receiver> appears to provide libraries 
>> for doing this.
>> 
>> Hope this helps.
>> 
>> Sent from my iPhone
>> 
>> On Apr 9, 2016, at 9:55 AM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> 
>>> Nezih,
>>> 
>>> This looks like a good alternative to having the Spark Streaming job check 
>>> for new files on its own. Do you know if there is a way to have the Spark 
>>> Streaming job get notified with the new file information and act upon it? 
>>> This can reduce the overhead and cost of polling S3. Plus, I can use this 
>>> to notify and kick off Lambda to process new data files and make them ready 
>>> for Spark Streaming to consume. This will also use notifications to 
>>> trigger. I just need to have all incoming folders configured for 
>>> notifications for Lambda and all outgoing folders for Spark Streaming. This 
>>> sounds like a better setup than we have now.
>>> 
>>> Thanks,
>>> Ben
>>> 
>>>> On Apr 9, 2016, at 12:25 AM, Nezih Yigitbasi <nyigitb...@netflix.com 
>>>> <mailto:nyigitb...@netflix.com>> wrote:
>>>> 
>>>> While it is doable in Spark, S3 also supports notifications: 
>>>> http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html 
>>>> <http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html>
>>>> 
>>>> 
>>>> On Fri, Apr 8, 2016 at 9:15 PM Natu Lauchande <nlaucha...@gmail.com 
>>>> <mailto:nlaucha...@gmail.com>> wrote:
>>>> Hi Benjamin,
>>>> 
>>>> I have done it . The critical configuration items are the ones below :
>>>> 
>>>>       ssc.sparkContext.hadoopConfiguration.set("fs.s3n.impl", 
>>>> "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
>>>>       ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", 
>>>> AccessKeyId)
>>>>       
>>>> ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", 
>>>> AWSSecretAccessKey)
>>>> 
>>>>       val inputS3Stream =  ssc.textFileStream("s3://example_bucket/folder 
>>>> <s3://example_bucket/folder>")
>>>> 
>>>> This code will probe for new S3 files created in your every batch interval.
>>>> 
>>>> Thanks,
>>>> Natu
>>>> 
>>>> On Fri, Apr 8, 2016 at 9:14 PM, Benjamin Kim <bbuil...@gmail.com 
>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>> Has anyone monitored an S3 bucket or directory using Spark Streaming and 
>>>> pulled any new files to process? If so, can you provide basic Scala coding 
>>>> help on this?
>>>> 
>>>> Thanks,
>>>> Ben
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>>>> <mailto:user-unsubscr...@spark.apache.org>
>>>> For additional commands, e-mail: user-h...@spark.apache.org 
>>>> <mailto:user-h...@spark.apache.org>
>>>> 
>>>> 
>>> 
>

Re: Monitoring S3 Bucket with Spark Streaming

Reply via email to