This was easy! I just created a notification on a source S3 bucket to kick off a Lambda function that would decompress the dropped file and save it to another S3 bucket. In return, this S3 bucket has a notification to send a SNS message to me via email. I can just as easily setup SQS to be the endpoint of this notification. This would then convey to a listening Spark Streaming job the file information to download. I like this!
Cheers, Ben > On Apr 9, 2016, at 9:54 AM, Benjamin Kim <bbuil...@gmail.com> wrote: > > This is awesome! I have someplace to start from. > > Thanks, > Ben > > >> On Apr 9, 2016, at 9:45 AM, programminggee...@gmail.com >> <mailto:programminggee...@gmail.com> wrote: >> >> Someone please correct me if I am wrong as I am still rather green to spark, >> however it appears that through the S3 notification mechanism described >> below, you can publish events to SQS and use SQS as a streaming source into >> spark. The project at https://github.com/imapi/spark-sqs-receiver >> <https://github.com/imapi/spark-sqs-receiver> appears to provide libraries >> for doing this. >> >> Hope this helps. >> >> Sent from my iPhone >> >> On Apr 9, 2016, at 9:55 AM, Benjamin Kim <bbuil...@gmail.com >> <mailto:bbuil...@gmail.com>> wrote: >> >>> Nezih, >>> >>> This looks like a good alternative to having the Spark Streaming job check >>> for new files on its own. Do you know if there is a way to have the Spark >>> Streaming job get notified with the new file information and act upon it? >>> This can reduce the overhead and cost of polling S3. Plus, I can use this >>> to notify and kick off Lambda to process new data files and make them ready >>> for Spark Streaming to consume. This will also use notifications to >>> trigger. I just need to have all incoming folders configured for >>> notifications for Lambda and all outgoing folders for Spark Streaming. This >>> sounds like a better setup than we have now. >>> >>> Thanks, >>> Ben >>> >>>> On Apr 9, 2016, at 12:25 AM, Nezih Yigitbasi <nyigitb...@netflix.com >>>> <mailto:nyigitb...@netflix.com>> wrote: >>>> >>>> While it is doable in Spark, S3 also supports notifications: >>>> http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html >>>> <http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html> >>>> >>>> >>>> On Fri, Apr 8, 2016 at 9:15 PM Natu Lauchande <nlaucha...@gmail.com >>>> <mailto:nlaucha...@gmail.com>> wrote: >>>> Hi Benjamin, >>>> >>>> I have done it . The critical configuration items are the ones below : >>>> >>>> ssc.sparkContext.hadoopConfiguration.set("fs.s3n.impl", >>>> "org.apache.hadoop.fs.s3native.NativeS3FileSystem") >>>> ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", >>>> AccessKeyId) >>>> >>>> ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", >>>> AWSSecretAccessKey) >>>> >>>> val inputS3Stream = ssc.textFileStream("s3://example_bucket/folder >>>> <s3://example_bucket/folder>") >>>> >>>> This code will probe for new S3 files created in your every batch interval. >>>> >>>> Thanks, >>>> Natu >>>> >>>> On Fri, Apr 8, 2016 at 9:14 PM, Benjamin Kim <bbuil...@gmail.com >>>> <mailto:bbuil...@gmail.com>> wrote: >>>> Has anyone monitored an S3 bucket or directory using Spark Streaming and >>>> pulled any new files to process? If so, can you provide basic Scala coding >>>> help on this? >>>> >>>> Thanks, >>>> Ben >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> <mailto:user-unsubscr...@spark.apache.org> >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> <mailto:user-h...@spark.apache.org> >>>> >>>> >>> >