Here is how I have dealt with many small text files (on s3 though this should generalize) in the past: http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201411.mbox/%3ccaaswr-58p66-es2haxh4i+bu__0rvxd2okewkly0mee8rue...@mail.gmail.com%3E
> FromMichael Armbrust <mich...@databricks.com>SubjectRe: > S3NativeFileSystem inefficient implementation when calling sc.textFileDateThu, > 27 Nov 2014 03:20:14 GMT > > In the past I have worked around this problem by avoiding sc.textFile(). > Instead I read the data directly inside of a Spark job. Basically, you > start with an RDD where each entry is a file in S3 and then flatMap that > with something that reads the files and returns the lines. > > Here's an example: https://gist.github.com/marmbrus/fff0b058f134fa7752fe > > Using this class you can do something like: > > sc.parallelize("s3n://mybucket/file1" :: "s3n://mybucket/file1" ... :: > Nil).flatMap(new ReadLinesSafe(_)) > > You can also build up the list of files by running a Spark > job:https://gist.github.com/marmbrus/15e72f7bc22337cf6653 > > Michael > > > On Sat, Mar 14, 2015 at 10:38 AM, Pat Ferrel <p...@occamsmachete.com> > wrote: > >> It’s a long story but there are many dirs with smallish part-xxxx files >> in them so we create a list of the individual files as input >> to sparkContext.textFile(fileList). I suppose we could move them and rename >> them to be contiguous part-xxxx files in one dir. Would that be better than >> passing in a long list of individual filenames? We could also make the part >> files much larger by collecting the smaller ones. But would any of this >> make a difference in IO speed? >> >> I ask because using the long file list seems to read, what amounts to a >> not very large data set rather slowly. If it were all in large part files >> in one dir I’d expect it to go much faster but this is just intuition. >> >> >> On Mar 14, 2015, at 9:58 AM, Koert Kuipers <ko...@tresata.com> wrote: >> >> why can you not put them in a directory and read them as one input? you >> will get a task per file, but spark is very fast at executing many tasks >> (its not a jvm per task). >> >> On Sat, Mar 14, 2015 at 12:51 PM, Pat Ferrel <p...@occamsmachete.com> >> wrote: >> >>> Any advice on dealing with a large number of separate input files? >>> >>> >>> On Mar 13, 2015, at 4:06 PM, Pat Ferrel <p...@occamsmachete.com> wrote: >>> >>> We have many text files that we need to read in parallel. We can create >>> a comma delimited list of files to pass in to >>> sparkContext.textFile(fileList). The list can get very large (maybe 10000) >>> and is all on hdfs. >>> >>> The question is: what is the most performant way to read them? Should >>> they be broken up and read in groups appending the resulting RDDs or should >>> we just pass in the entire list at once? In effect I’m asking if Spark does >>> some optimization of whether we should do it explicitly. If the later, what >>> rule might we use depending on our cluster setup? >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >> >