Re: Need Advice about reading lots of text files

2015-03-17 Thread Imran Rashid
Interesting, on another thread, I was just arguing that the user should *not* open the files themselves and read them, b/c then they lose all the other goodies we have in HadoopRDD, eg. the metric tracking. I think this encourages Pat's argument that we might actually need better support for this

Re: Need Advice about reading lots of text files

2015-03-17 Thread Michael Armbrust
I agree that it would be better if Spark did a better job automatically here, though doing so is probably a non-trivial amount of work. My code is certainly worse if you have only a few very large text files for example and thus I'd generally encourage people to try the built in options first.

Re: Need Advice about reading lots of text files

2015-03-17 Thread Pat Ferrel
There are no-doubt many things that feed into the right way to read a lot of files into Spark. But why force users to learn all of those factors instead of putting an optimizer layer into the read inside Spark? BTW I realize your method is not one task per file, it’s chunked and done in

Re: Need Advice about reading lots of text files

2015-03-16 Thread madhu phatak
Hi, Internally Spark uses HDFS api to handle file data. Have a look at HAR, Sequence file input format. More information on this cloudera blog http://blog.cloudera.com/blog/2009/02/the-small-files-problem/. Regards, Madhukara Phatak http://datamantra.io/ On Sun, Mar 15, 2015 at 9:59 PM, Pat

Re: Need Advice about reading lots of text files

2015-03-15 Thread Pat Ferrel
Ah most interesting—thanks. So it seems sc.textFile(longFileList) has to read all metadata before starting the read for partitioning purposes so what you do is not use it? You create a task per file that reads one file (in parallel) per task without scanning for _all_ metadata. Can’t argue

Re: Need Advice about reading lots of text files

2015-03-14 Thread Pat Ferrel
It’s a long story but there are many dirs with smallish part- files in them so we create a list of the individual files as input to sparkContext.textFile(fileList). I suppose we could move them and rename them to be contiguous part- files in one dir. Would that be better than passing

Re: Need Advice about reading lots of text files

2015-03-14 Thread Pat Ferrel
Any advice on dealing with a large number of separate input files? On Mar 13, 2015, at 4:06 PM, Pat Ferrel p...@occamsmachete.com wrote: We have many text files that we need to read in parallel. We can create a comma delimited list of files to pass in to sparkContext.textFile(fileList). The

Re: Need Advice about reading lots of text files

2015-03-14 Thread Michael Armbrust
Here is how I have dealt with many small text files (on s3 though this should generalize) in the past: http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201411.mbox/%3ccaaswr-58p66-es2haxh4i+bu__0rvxd2okewkly0mee8rue...@mail.gmail.com%3E FromMichael Armbrust

Need Advice about reading lots of text files

2015-03-13 Thread Pat Ferrel
We have many text files that we need to read in parallel. We can create a comma delimited list of files to pass in to sparkContext.textFile(fileList). The list can get very large (maybe 1) and is all on hdfs. The question is: what is the most performant way to read them? Should they be