Hi there,
I was wondering if anybody could help me find an efficient way to make a
MapReduce program like this:
1) For each map function, it need access some huge files, which is around
6GB
2) These files are READ-ONLY. Actually they are like some huge look-up
table, which will not change during 2~3 years.
I tried two ways to make the program work, but neither of them is efficient:
1) The first approach I tried is to let each map function load those files
independently, like this:
map (...) { load(files); DoMapTask(...)}
2) The second approach I tried is to load the files before RDD.map(...) and
broadcast the files. However, because the files are too large, the
broadcasting overhead is 30min ~ 1 hour.
Could anybody help me find an efficient way to solve it?
Thanks very much.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/What-if-there-are-large-read-only-variables-shared-by-all-map-functions-tp10435.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.