in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Strategies-for-reading-large-numbers-of-files-tp15644p19314.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user
))
}
else {
list += file
}
})
}
else {
list += srcDir
}
list .toList
}
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Strategies-for-reading-large-numbers-of-files-tp15644p15835.html
Sent from
}
})
}
else {
list += srcDir
}
list .toList
}
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Strategies-for-reading-large-numbers-of-files-tp15644p15835.html
Sent from the Apache Spark User List mailing list archive
Unfortunately not. Again, I wonder if adding support targeted at this
small files problem would make sense for Spark core, as it is a common
problem in our space.
Right now, I don't know of any other options.
Nick
On Mon, Oct 6, 2014 at 2:24 PM, Landon Kuhn lan...@janrain.com wrote:
Hello, I'm trying to use Spark to process a large number of files in S3.
I'm running into an issue that I believe is related to the high number of
files, and the resources required to build the listing within the driver
program. If anyone in the Spark community can provide insight or guidance,
it
I believe this is known as the Hadoop Small Files Problem, and it affects
Spark as well. The best approach I've seen to merging small files like this
is by using s3distcp, as suggested here
http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/,
as a pre-processing