Spark provides an abstraction called broadcast variables. It has multiple underlying implementations, and can be much more convenient that Hadoop distributed cache.
http://spark.incubator.apache.org/docs/0.7.3/scala-programming-guide.html#broadcast-variables -- Reynold Xin, AMPLab, UC Berkeley http://rxin.org On Wed, Sep 11, 2013 at 7:11 PM, Konstantin Abakumov <[email protected]>wrote: > Hello everyone! > > I am solving the task in which every cluster node, executing spark job, > should have access to a big external file. The file is MaxMind GeoIP > database and its size is around 15 megabytes. MaxMind's provided library > permanently uses it for reading with random access. Of course, it just can > be stored in hdfs, but accessing it for random reading will be quite > inefficient. > > Hadoop mapreduce has DistributedCache module dedicated for this purpose. > We can specify files in hdfs that will be required during job execution and > they are copied to worker nodes before the job starts. So the job will > efficiently access their copies on local machine. > > I didn't found simple and effective way of doing the same thing in spark. > Is there any preferable way to do so? > > -- > Best regards, > Konstantin Abakumov > >
