Re: Strategies for reading large numbers of files

2014-11-19 Thread soojin
in context: http://apache-spark-user-list.1001560.n3.nabble.com/Strategies-for-reading-large-numbers-of-files-tp15644p19314.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user

Re: Strategies for reading large numbers of files

2014-10-21 Thread Landon Kuhn
)) } else { list += file } }) } else { list += srcDir } list .toList } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Strategies-for-reading-large-numbers-of-files-tp15644p15835.html Sent from

Re: Strategies for reading large numbers of files

2014-10-07 Thread deenar.toraskar
} }) } else { list += srcDir } list .toList } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Strategies-for-reading-large-numbers-of-files-tp15644p15835.html Sent from the Apache Spark User List mailing list archive

Re: Strategies for reading large numbers of files

2014-10-06 Thread Nicholas Chammas
Unfortunately not. Again, I wonder if adding support targeted at this small files problem would make sense for Spark core, as it is a common problem in our space. Right now, I don't know of any other options. Nick On Mon, Oct 6, 2014 at 2:24 PM, Landon Kuhn lan...@janrain.com wrote:

Strategies for reading large numbers of files

2014-10-02 Thread Landon Kuhn
Hello, I'm trying to use Spark to process a large number of files in S3. I'm running into an issue that I believe is related to the high number of files, and the resources required to build the listing within the driver program. If anyone in the Spark community can provide insight or guidance, it

Re: Strategies for reading large numbers of files

2014-10-02 Thread Nicholas Chammas
I believe this is known as the Hadoop Small Files Problem, and it affects Spark as well. The best approach I've seen to merging small files like this is by using s3distcp, as suggested here http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/, as a pre-processing