Nezih, This looks like a good alternative to having the Spark Streaming job check for new files on its own. Do you know if there is a way to have the Spark Streaming job get notified with the new file information and act upon it? This can reduce the overhead and cost of polling S3. Plus, I can use this to notify and kick off Lambda to process new data files and make them ready for Spark Streaming to consume. This will also use notifications to trigger. I just need to have all incoming folders configured for notifications for Lambda and all outgoing folders for Spark Streaming. This sounds like a better setup than we have now.
Thanks, Ben > On Apr 9, 2016, at 12:25 AM, Nezih Yigitbasi <nyigitb...@netflix.com> wrote: > > While it is doable in Spark, S3 also supports notifications: > http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html > <http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html> > > > On Fri, Apr 8, 2016 at 9:15 PM Natu Lauchande <nlaucha...@gmail.com > <mailto:nlaucha...@gmail.com>> wrote: > Hi Benjamin, > > I have done it . The critical configuration items are the ones below : > > ssc.sparkContext.hadoopConfiguration.set("fs.s3n.impl", > "org.apache.hadoop.fs.s3native.NativeS3FileSystem") > ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", > AccessKeyId) > ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", > AWSSecretAccessKey) > > val inputS3Stream = ssc.textFileStream("s3://example_bucket/folder") > > This code will probe for new S3 files created in your every batch interval. > > Thanks, > Natu > > On Fri, Apr 8, 2016 at 9:14 PM, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > Has anyone monitored an S3 bucket or directory using Spark Streaming and > pulled any new files to process? If so, can you provide basic Scala coding > help on this? > > Thanks, > Ben > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > <mailto:user-unsubscr...@spark.apache.org> > For additional commands, e-mail: user-h...@spark.apache.org > <mailto:user-h...@spark.apache.org> > >