Interesting, on another thread, I was just arguing that the user should
*not* open the files themselves and read them, b/c then they lose all the
other goodies we have in HadoopRDD, eg. the metric tracking.
I think this encourages Pat's argument that we might actually need better
support for this
I agree that it would be better if Spark did a better job automatically
here, though doing so is probably a non-trivial amount of work. My code is
certainly worse if you have only a few very large text files for example
and thus I'd generally encourage people to try the built in options first.
There are no-doubt many things that feed into the right way to read a lot of
files into Spark. But why force users to learn all of those factors instead of
putting an optimizer layer into the read inside Spark?
BTW I realize your method is not one task per file, it’s chunked and done in
Hi,
Internally Spark uses HDFS api to handle file data. Have a look at HAR,
Sequence file input format. More information on this cloudera blog
http://blog.cloudera.com/blog/2009/02/the-small-files-problem/.
Regards,
Madhukara Phatak
http://datamantra.io/
On Sun, Mar 15, 2015 at 9:59 PM, Pat
Ah most interesting—thanks.
So it seems sc.textFile(longFileList) has to read all metadata before starting
the read for partitioning purposes so what you do is not use it?
You create a task per file that reads one file (in parallel) per task without
scanning for _all_ metadata. Can’t argue
It’s a long story but there are many dirs with smallish part- files in them
so we create a list of the individual files as input to
sparkContext.textFile(fileList). I suppose we could move them and rename them
to be contiguous part- files in one dir. Would that be better than passing
Any advice on dealing with a large number of separate input files?
On Mar 13, 2015, at 4:06 PM, Pat Ferrel p...@occamsmachete.com wrote:
We have many text files that we need to read in parallel. We can create a comma
delimited list of files to pass in to sparkContext.textFile(fileList). The
Here is how I have dealt with many small text files (on s3 though this
should generalize) in the past:
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201411.mbox/%3ccaaswr-58p66-es2haxh4i+bu__0rvxd2okewkly0mee8rue...@mail.gmail.com%3E
FromMichael Armbrust
We have many text files that we need to read in parallel. We can create a comma
delimited list of files to pass in to sparkContext.textFile(fileList). The list
can get very large (maybe 1) and is all on hdfs.
The question is: what is the most performant way to read them? Should they be