I agree that it would be better if Spark did a better job automatically
here, though doing so is probably a non-trivial amount of work.  My code is
certainly worse if you have only a few very large text files for example
and thus I'd generally encourage people to try the built in options first.

However, one of the nice things about Spark I think is the flexibility that
it gives you. So, when you are trying to read 100,000s of tiny files this
works pretty well.  I'll also comment that this does not create a task per
file and that is another reason its faster for the many small files case.
Of course that comes at the expense of locality (which doesn't matter for
my use case on S3 anyway)...

On Tue, Mar 17, 2015 at 8:16 AM, Imran Rashid <iras...@cloudera.com> wrote:

> Interesting, on another thread, I was just arguing that the user should
> *not* open the files themselves and read them, b/c then they lose all the
> other goodies we have in HadoopRDD, eg. the metric tracking.
>
> I think this encourages Pat's argument that we might actually need better
> support for this in spark context itself?
>
> On Sat, Mar 14, 2015 at 1:11 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>>
>> Here is how I have dealt with many small text files (on s3 though this
>> should generalize) in the past:
>>
>> http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201411.mbox/%3ccaaswr-58p66-es2haxh4i+bu__0rvxd2okewkly0mee8rue...@mail.gmail.com%3E
>>
>>
>>
>>
>>> FromMichael Armbrust <mich...@databricks.com>SubjectRe:
>>> S3NativeFileSystem inefficient implementation when calling sc.textFile
>>> DateThu, 27 Nov 2014 03:20:14 GMT
>>>
>>> In the past I have worked around this problem by avoiding sc.textFile().
>>> Instead I read the data directly inside of a Spark job.  Basically, you
>>> start with an RDD where each entry is a file in S3 and then flatMap that
>>> with something that reads the files and returns the lines.
>>>
>>> Here's an example: https://gist.github.com/marmbrus/fff0b058f134fa7752fe
>>>
>>> Using this class you can do something like:
>>>
>>> sc.parallelize("s3n://mybucket/file1" :: "s3n://mybucket/file1" ... ::
>>> Nil).flatMap(new ReadLinesSafe(_))
>>>
>>> You can also build up the list of files by running a Spark 
>>> job:https://gist.github.com/marmbrus/15e72f7bc22337cf6653
>>>
>>> Michael
>>>
>>>
>>> On Sat, Mar 14, 2015 at 10:38 AM, Pat Ferrel <p...@occamsmachete.com>
>>> wrote:
>>>
>>>> It’s a long story but there are many dirs with smallish part-xxxx files
>>>> in them so we create a list of the individual files as input
>>>> to sparkContext.textFile(fileList). I suppose we could move them and rename
>>>> them to be contiguous part-xxxx files in one dir. Would that be better than
>>>> passing in a long list of individual filenames? We could also make the part
>>>> files much larger by collecting the smaller ones. But would any of this
>>>> make a difference in IO speed?
>>>>
>>>> I ask because using the long file list seems to read, what amounts to a
>>>> not very large data set rather slowly. If it were all in large part files
>>>> in one dir I’d expect it to go much faster but this is just intuition.
>>>>
>>>>
>>>> On Mar 14, 2015, at 9:58 AM, Koert Kuipers <ko...@tresata.com> wrote:
>>>>
>>>> why can you not put them in a directory and read them as one input? you
>>>> will get a task per file, but spark is very fast at executing many tasks
>>>> (its not a jvm per task).
>>>>
>>>> On Sat, Mar 14, 2015 at 12:51 PM, Pat Ferrel <p...@occamsmachete.com>
>>>> wrote:
>>>>
>>>>> Any advice on dealing with a large number of separate input files?
>>>>>
>>>>>
>>>>> On Mar 13, 2015, at 4:06 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
>>>>>
>>>>> We have many text files that we need to read in parallel. We can
>>>>> create a comma delimited list of files to pass in to
>>>>> sparkContext.textFile(fileList). The list can get very large (maybe 10000)
>>>>> and is all on hdfs.
>>>>>
>>>>> The question is: what is the most performant way to read them? Should
>>>>> they be broken up and read in groups appending the resulting RDDs or 
>>>>> should
>>>>> we just pass in the entire list at once? In effect I’m asking if Spark 
>>>>> does
>>>>> some optimization of whether we should do it explicitly. If the later, 
>>>>> what
>>>>> rule might we use depending on our cluster setup?
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to