Interesting, on another thread, I was just arguing that the user should *not* open the files themselves and read them, b/c then they lose all the other goodies we have in HadoopRDD, eg. the metric tracking.
I think this encourages Pat's argument that we might actually need better support for this in spark context itself? On Sat, Mar 14, 2015 at 1:11 PM, Michael Armbrust <mich...@databricks.com> wrote: > > Here is how I have dealt with many small text files (on s3 though this > should generalize) in the past: > > http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201411.mbox/%3ccaaswr-58p66-es2haxh4i+bu__0rvxd2okewkly0mee8rue...@mail.gmail.com%3E > > > > >> FromMichael Armbrust <mich...@databricks.com>SubjectRe: >> S3NativeFileSystem inefficient implementation when calling sc.textFile >> DateThu, 27 Nov 2014 03:20:14 GMT >> >> In the past I have worked around this problem by avoiding sc.textFile(). >> Instead I read the data directly inside of a Spark job. Basically, you >> start with an RDD where each entry is a file in S3 and then flatMap that >> with something that reads the files and returns the lines. >> >> Here's an example: https://gist.github.com/marmbrus/fff0b058f134fa7752fe >> >> Using this class you can do something like: >> >> sc.parallelize("s3n://mybucket/file1" :: "s3n://mybucket/file1" ... :: >> Nil).flatMap(new ReadLinesSafe(_)) >> >> You can also build up the list of files by running a Spark >> job:https://gist.github.com/marmbrus/15e72f7bc22337cf6653 >> >> Michael >> >> >> On Sat, Mar 14, 2015 at 10:38 AM, Pat Ferrel <p...@occamsmachete.com> >> wrote: >> >>> It’s a long story but there are many dirs with smallish part-xxxx files >>> in them so we create a list of the individual files as input >>> to sparkContext.textFile(fileList). I suppose we could move them and rename >>> them to be contiguous part-xxxx files in one dir. Would that be better than >>> passing in a long list of individual filenames? We could also make the part >>> files much larger by collecting the smaller ones. But would any of this >>> make a difference in IO speed? >>> >>> I ask because using the long file list seems to read, what amounts to a >>> not very large data set rather slowly. If it were all in large part files >>> in one dir I’d expect it to go much faster but this is just intuition. >>> >>> >>> On Mar 14, 2015, at 9:58 AM, Koert Kuipers <ko...@tresata.com> wrote: >>> >>> why can you not put them in a directory and read them as one input? you >>> will get a task per file, but spark is very fast at executing many tasks >>> (its not a jvm per task). >>> >>> On Sat, Mar 14, 2015 at 12:51 PM, Pat Ferrel <p...@occamsmachete.com> >>> wrote: >>> >>>> Any advice on dealing with a large number of separate input files? >>>> >>>> >>>> On Mar 13, 2015, at 4:06 PM, Pat Ferrel <p...@occamsmachete.com> wrote: >>>> >>>> We have many text files that we need to read in parallel. We can create >>>> a comma delimited list of files to pass in to >>>> sparkContext.textFile(fileList). The list can get very large (maybe 10000) >>>> and is all on hdfs. >>>> >>>> The question is: what is the most performant way to read them? Should >>>> they be broken up and read in groups appending the resulting RDDs or should >>>> we just pass in the entire list at once? In effect I’m asking if Spark does >>>> some optimization of whether we should do it explicitly. If the later, what >>>> rule might we use depending on our cluster setup? >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >>> >>> >> >