Re: Monitoring S3 Bucket with Spark Streaming

2016-04-12 Thread Benjamin Kim
All, I have more of a general Scala JSON question. I have setup a notification on the S3 source bucket that triggers a Lambda function to unzip the new file placed there. Then, it saves the unzipped CSV file into another destination bucket where a notification is sent to a SQS topic. The

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
Ah, I spoke too soon. I thought the SQS part was going to be a spark package. It looks like it has be compiled into a jar for use. Am I right? Can someone help with this? I tried to compile it using SBT, but I’m stuck with a SonatypeKeys not found error. If there’s an easier alternative,

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
This was easy! I just created a notification on a source S3 bucket to kick off a Lambda function that would decompress the dropped file and save it to another S3 bucket. In return, this S3 bucket has a notification to send a SNS message to me via email. I can just as easily setup SQS to be the

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Gourav Sengupta
why not use AWS Lambda? Regards, Gourav On Fri, Apr 8, 2016 at 8:14 PM, Benjamin Kim wrote: > Has anyone monitored an S3 bucket or directory using Spark Streaming and > pulled any new files to process? If so, can you provide basic Scala coding > help on this? > > Thanks, >

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Nezih Yigitbasi
Natu, Benjamin, With this mechanism you can configure notifications for *buckets* (if you only care about some key prefixes you can take a look at object key name filtering, see the docs) for various event types, and then these events can be published to SNS, SQS or Lambdas. I think using SQS as

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Natu Lauchande
Do you know if textFileStream can see if new files are created underneath a whole bucket? Only at the level of the folder that you specify . They don't do subfolders. So your approach would be detecting everything under path s3://bucket/path/2016040902_data.csv Also, will Spark Streaming not

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
This is awesome! I have someplace to start from. Thanks, Ben > On Apr 9, 2016, at 9:45 AM, programminggee...@gmail.com wrote: > > Someone please correct me if I am wrong as I am still rather green to spark, > however it appears that through the S3 notification mechanism described > below,

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread programminggeek72
Someone please correct me if I am wrong as I am still rather green to spark, however it appears that through the S3 notification mechanism described below, you can publish events to SQS and use SQS as a streaming source into spark. The project at https://github.com/imapi/spark-sqs-receiver

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
Nezih, This looks like a good alternative to having the Spark Streaming job check for new files on its own. Do you know if there is a way to have the Spark Streaming job get notified with the new file information and act upon it? This can reduce the overhead and cost of polling S3. Plus, I can

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
Natu, Do you know if textFileStream can see if new files are created underneath a whole bucket? For example, if the bucket name is incoming and new files underneath it are 2016/04/09/00/00/01/data.csv and 2016/04/09/00/00/02/data/csv, will these files be picked up? Also, will Spark Streaming

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Natu Lauchande
Can you elaborate a bit more in your approach using s3 notifications ? Just curious. dealing with a similar issue right now that might benefit from this. On 09 Apr 2016 9:25 AM, "Nezih Yigitbasi" wrote: > While it is doable in Spark, S3 also supports notifications: >

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Nezih Yigitbasi
While it is doable in Spark, S3 also supports notifications: http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html On Fri, Apr 8, 2016 at 9:15 PM Natu Lauchande wrote: > Hi Benjamin, > > I have done it . The critical configuration items are the ones below

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-08 Thread Natu Lauchande
Hi Benjamin, I have done it . The critical configuration items are the ones below : ssc.sparkContext.hadoopConfiguration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem") ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", AccessKeyId)

Monitoring S3 Bucket with Spark Streaming

2016-04-08 Thread Benjamin Kim
Has anyone monitored an S3 bucket or directory using Spark Streaming and pulled any new files to process? If so, can you provide basic Scala coding help on this? Thanks, Ben - To unsubscribe, e-mail: