We've a custom version/build of sparktreaming doing the nested s3 lookups
faster (uses native S3 APIs). You can find the source code over here :
https://github.com/sigmoidanalytics/spark-modified, In particular the
changes from here
<https://github.com/sigmoidanalytics/spark-modified/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L206>.
And the binary jars here :
https://github.com/sigmoidanalytics/spark-modified/tree/master/lib

Here's the instructions to use it:

This is how you create your stream:

val lines = ssc.*s3FileStream*[LongWritable, Text,
TextInputFormat]("bucketname/")


You need ACCESS_KEY and SECRET_KEY in the environment for this to work.
Also, by default it is recursive.

Also you need these jars
<https://github.com/sigmoidanalytics/spark-modified/tree/master/lib> in the
SPARK_CLASSPATH:


aws-java-sdk-1.8.3.jar        httpclient-4.2.5.jar
aws-java-sdk-1.9.24.jar       httpcore-4.3.2.jar
aws-java-sdk-core-1.9.24.jar  joda-time-2.6.jar
aws-java-sdk-s3-1.9.24.jar    spark-streaming_2.10-1.2.0.jar



Let me know if you need any more clarification/information on this, feel
free to suggest changes.




Thanks
Best Regards

On Sat, Apr 4, 2015 at 3:30 AM, Tathagata Das <t...@databricks.com> wrote:

> Yes, definitely can be added. Just haven't gotten around to doing it :)
> There are proposals for this that you can try -
> https://github.com/apache/spark/pull/2765/files . Have you review it at
> some point.
>
> On Fri, Apr 3, 2015 at 1:08 PM, Adam Ritter <adamge...@gmail.com> wrote:
>
>> That doesn't seem like a good solution unfortunately as I would be
>> needing this to work in a production environment.  Do you know why the
>> limitation exists for FileInputDStream in the first place?  Unless I'm
>> missing something important about how some of the internals work I don't
>> see why this feature could be added in at some point.
>>
>> On Fri, Apr 3, 2015 at 12:47 PM, Tathagata Das <t...@databricks.com>
>> wrote:
>>
>>> I sort-a-hacky workaround is to use a queueStream where you can manually
>>> create RDDs (using sparkContext.hadoopFile) and insert into the queue. Note
>>> that this is for testing only as queueStream does not work with driver
>>> fautl recovery.
>>>
>>> TD
>>>
>>> On Fri, Apr 3, 2015 at 12:23 PM, adamgerst <adamge...@gmail.com> wrote:
>>>
>>>> So after pulling my hair out for a bit trying to convert one of my
>>>> standard
>>>> spark jobs to streaming I found that FileInputDStream does not support
>>>> nested folders (see the brief mention here
>>>>
>>>> http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources
>>>> the fileStream method returns a FileInputDStream).  So before, for my
>>>> standard job, I was reading from say
>>>>
>>>> s3n://mybucket/2015/03/02/*log
>>>>
>>>> And could also modify it to simply get an entire months worth of logs.
>>>> Since the logs are split up based upon their date, when the batch ran
>>>> for
>>>> the day, I simply passed in a parameter of the date to make sure I was
>>>> reading the correct data
>>>>
>>>> But since I want to turn this job into a streaming job I need to simply
>>>> do
>>>> something like
>>>>
>>>> s3n://mybucket/*log
>>>>
>>>> This would totally work fine if it were a standard spark application,
>>>> but
>>>> fails for streaming.  Is there anyway I can get around this limitation?
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-FileStream-Nested-File-Support-tp22370.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>

Reply via email to