Have a look at broadcast variables .
On Tuesday, July 22, 2014, Parthus <peng.wei....@gmail.com> wrote: > Hi there, > > I was wondering if anybody could help me find an efficient way to make a > MapReduce program like this: > > 1) For each map function, it need access some huge files, which is around > 6GB > > 2) These files are READ-ONLY. Actually they are like some huge look-up > table, which will not change during 2~3 years. > > I tried two ways to make the program work, but neither of them is > efficient: > > 1) The first approach I tried is to let each map function load those files > independently, like this: > > map (...) { load(files); DoMapTask(...)} > > 2) The second approach I tried is to load the files before RDD.map(...) and > broadcast the files. However, because the files are too large, the > broadcasting overhead is 30min ~ 1 hour. > > Could anybody help me find an efficient way to solve it? > > Thanks very much. > > > > > > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/What-if-there-are-large-read-only-variables-shared-by-all-map-functions-tp10435.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > -- Sent from Gmail Mobile