We've a custom version/build of sparktreaming doing the nested s3 lookups faster (uses native S3 APIs). You can find the source code over here : https://github.com/sigmoidanalytics/spark-modified, In particular the changes from here <https://github.com/sigmoidanalytics/spark-modified/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L206>. And the binary jars here : https://github.com/sigmoidanalytics/spark-modified/tree/master/lib
Here's the instructions to use it: This is how you create your stream: val lines = ssc.*s3FileStream*[LongWritable, Text, TextInputFormat]("bucketname/") You need ACCESS_KEY and SECRET_KEY in the environment for this to work. Also, by default it is recursive. Also you need these jars <https://github.com/sigmoidanalytics/spark-modified/tree/master/lib> in the SPARK_CLASSPATH: aws-java-sdk-1.8.3.jar httpclient-4.2.5.jar aws-java-sdk-1.9.24.jar httpcore-4.3.2.jar aws-java-sdk-core-1.9.24.jar joda-time-2.6.jar aws-java-sdk-s3-1.9.24.jar spark-streaming_2.10-1.2.0.jar Let me know if you need any more clarification/information on this, feel free to suggest changes. Thanks Best Regards On Sat, Apr 4, 2015 at 3:30 AM, Tathagata Das <t...@databricks.com> wrote: > Yes, definitely can be added. Just haven't gotten around to doing it :) > There are proposals for this that you can try - > https://github.com/apache/spark/pull/2765/files . Have you review it at > some point. > > On Fri, Apr 3, 2015 at 1:08 PM, Adam Ritter <adamge...@gmail.com> wrote: > >> That doesn't seem like a good solution unfortunately as I would be >> needing this to work in a production environment. Do you know why the >> limitation exists for FileInputDStream in the first place? Unless I'm >> missing something important about how some of the internals work I don't >> see why this feature could be added in at some point. >> >> On Fri, Apr 3, 2015 at 12:47 PM, Tathagata Das <t...@databricks.com> >> wrote: >> >>> I sort-a-hacky workaround is to use a queueStream where you can manually >>> create RDDs (using sparkContext.hadoopFile) and insert into the queue. Note >>> that this is for testing only as queueStream does not work with driver >>> fautl recovery. >>> >>> TD >>> >>> On Fri, Apr 3, 2015 at 12:23 PM, adamgerst <adamge...@gmail.com> wrote: >>> >>>> So after pulling my hair out for a bit trying to convert one of my >>>> standard >>>> spark jobs to streaming I found that FileInputDStream does not support >>>> nested folders (see the brief mention here >>>> >>>> http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources >>>> the fileStream method returns a FileInputDStream). So before, for my >>>> standard job, I was reading from say >>>> >>>> s3n://mybucket/2015/03/02/*log >>>> >>>> And could also modify it to simply get an entire months worth of logs. >>>> Since the logs are split up based upon their date, when the batch ran >>>> for >>>> the day, I simply passed in a parameter of the date to make sure I was >>>> reading the correct data >>>> >>>> But since I want to turn this job into a streaming job I need to simply >>>> do >>>> something like >>>> >>>> s3n://mybucket/*log >>>> >>>> This would totally work fine if it were a standard spark application, >>>> but >>>> fails for streaming. Is there anyway I can get around this limitation? >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-FileStream-Nested-File-Support-tp22370.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >>> >> >