Here is how I have dealt with many small text files (on s3 though this
should generalize) in the past:
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201411.mbox/%3ccaaswr-58p66-es2haxh4i+bu__0rvxd2okewkly0mee8rue...@mail.gmail.com%3E




> FromMichael Armbrust <mich...@databricks.com>SubjectRe:
> S3NativeFileSystem inefficient implementation when calling sc.textFileDateThu,
> 27 Nov 2014 03:20:14 GMT
>
> In the past I have worked around this problem by avoiding sc.textFile().
> Instead I read the data directly inside of a Spark job.  Basically, you
> start with an RDD where each entry is a file in S3 and then flatMap that
> with something that reads the files and returns the lines.
>
> Here's an example: https://gist.github.com/marmbrus/fff0b058f134fa7752fe
>
> Using this class you can do something like:
>
> sc.parallelize("s3n://mybucket/file1" :: "s3n://mybucket/file1" ... ::
> Nil).flatMap(new ReadLinesSafe(_))
>
> You can also build up the list of files by running a Spark 
> job:https://gist.github.com/marmbrus/15e72f7bc22337cf6653
>
> Michael
>
>
> On Sat, Mar 14, 2015 at 10:38 AM, Pat Ferrel <p...@occamsmachete.com>
> wrote:
>
>> It’s a long story but there are many dirs with smallish part-xxxx files
>> in them so we create a list of the individual files as input
>> to sparkContext.textFile(fileList). I suppose we could move them and rename
>> them to be contiguous part-xxxx files in one dir. Would that be better than
>> passing in a long list of individual filenames? We could also make the part
>> files much larger by collecting the smaller ones. But would any of this
>> make a difference in IO speed?
>>
>> I ask because using the long file list seems to read, what amounts to a
>> not very large data set rather slowly. If it were all in large part files
>> in one dir I’d expect it to go much faster but this is just intuition.
>>
>>
>> On Mar 14, 2015, at 9:58 AM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>> why can you not put them in a directory and read them as one input? you
>> will get a task per file, but spark is very fast at executing many tasks
>> (its not a jvm per task).
>>
>> On Sat, Mar 14, 2015 at 12:51 PM, Pat Ferrel <p...@occamsmachete.com>
>> wrote:
>>
>>> Any advice on dealing with a large number of separate input files?
>>>
>>>
>>> On Mar 13, 2015, at 4:06 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
>>>
>>> We have many text files that we need to read in parallel. We can create
>>> a comma delimited list of files to pass in to
>>> sparkContext.textFile(fileList). The list can get very large (maybe 10000)
>>> and is all on hdfs.
>>>
>>> The question is: what is the most performant way to read them? Should
>>> they be broken up and read in groups appending the resulting RDDs or should
>>> we just pass in the entire list at once? In effect I’m asking if Spark does
>>> some optimization of whether we should do it explicitly. If the later, what
>>> rule might we use depending on our cluster setup?
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>>
>

Reply via email to