Re: Monitoring S3 Bucket with Spark Streaming

Benjamin Kim Sat, 09 Apr 2016 07:46:10 -0700

Natu,

Do you know if textFileStream can see if new files are created underneath a 
whole bucket? For example, if the bucket name is incoming and new files 
underneath it are 2016/04/09/00/00/01/data.csv and 
2016/04/09/00/00/02/data/csv, will these files be picked up? Also, will Spark 
Streaming not pick up these files again on the following run knowing that it 
already picked them up or do we have to store state somewhere, like the last 
run date and time to compare against?


Thanks,
Ben

> On Apr 8, 2016, at 9:15 PM, Natu Lauchande <nlaucha...@gmail.com> wrote:
> 
> Hi Benjamin,
> 
> I have done it . The critical configuration items are the ones below :
> 
>       ssc.sparkContext.hadoopConfiguration.set("fs.s3n.impl", 
> "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
>       ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", 
> AccessKeyId)
>       ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", 
> AWSSecretAccessKey)
> 
>       val inputS3Stream =  ssc.textFileStream("s3://example_bucket/folder")
> 
> This code will probe for new S3 files created in your every batch interval.
> 
> Thanks,
> Natu
> 
> On Fri, Apr 8, 2016 at 9:14 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Has anyone monitored an S3 bucket or directory using Spark Streaming and 
> pulled any new files to process? If so, can you provide basic Scala coding 
> help on this?
> 
> Thanks,
> Ben
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 
>

Re: Monitoring S3 Bucket with Spark Streaming

Reply via email to