In the past I have worked around this problem by avoiding sc.textFile().
Instead I read the data directly inside of a Spark job.  Basically, you
start with an RDD where each entry is a file in S3 and then flatMap that
with something that reads the files and returns the lines.

Here's an example: https://gist.github.com/marmbrus/fff0b058f134fa7752fe

Using this class you can do something like:

sc.parallelize("s3n://mybucket/file1" :: "s3n://mybucket/file1" ... ::
Nil).flatMap(new ReadLinesSafe(_))

You can also build up the list of files by running a Spark job:
https://gist.github.com/marmbrus/15e72f7bc22337cf6653

Michael

On Wed, Nov 26, 2014 at 9:23 AM, Aaron Davidson <ilike...@gmail.com> wrote:

> Spark has a known problem where it will do a pass of metadata on a large
> number of small files serially, in order to find the partition information
> prior to starting the job. This will probably not be repaired by switching
> the FS impl.
>
> However, you can change the FS being used like so (prior to the first
> usage):
> sc.hadoopConfiguration.set("fs.s3n.impl",
> "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
>
> On Wed, Nov 26, 2014 at 1:47 AM, Tomer Benyamini <tomer....@gmail.com>
> wrote:
>
>> Thanks Lalit; Setting the access + secret keys in the configuration works
>> even when calling sc.textFile. Is there a way to select which hadoop s3
>> native filesystem implementation would be used at runtime using the hadoop
>> configuration?
>>
>> Thanks,
>> Tomer
>>
>> On Wed, Nov 26, 2014 at 11:08 AM, lalit1303 <la...@sigmoidanalytics.com>
>> wrote:
>>
>>>
>>> you can try creating hadoop Configuration and set s3 configuration i.e.
>>> access keys etc.
>>> Now, for reading files from s3 use newAPIHadoopFile and pass the config
>>> object here along with key, value classes.
>>>
>>>
>>>
>>>
>>>
>>> -----
>>> Lalit Yadav
>>> la...@sigmoidanalytics.com
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/S3NativeFileSystem-inefficient-implementation-when-calling-sc-textFile-tp19841p19845.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Reply via email to