Re: File broadcasting

Reynold Xin Wed, 11 Sep 2013 06:55:41 -0700

Spark provides an abstraction called broadcast variables. It has multiple
underlying implementations, and can be much more convenient that Hadoop
distributed cache.


http://spark.incubator.apache.org/docs/0.7.3/scala-programming-guide.html#broadcast-variables


--
Reynold Xin, AMPLab, UC Berkeley
http://rxin.org



On Wed, Sep 11, 2013 at 7:11 PM, Konstantin Abakumov
<[email protected]>wrote:

>  Hello everyone!
>
> I am solving the task in which every cluster node, executing spark job,
> should have access to a big external file. The file is MaxMind GeoIP
> database and its size is around 15 megabytes. MaxMind's provided library
> permanently uses it for reading with random access. Of course, it just can
> be stored in hdfs, but accessing it for random reading will be quite
> inefficient.
>
> Hadoop mapreduce has DistributedCache module dedicated for this purpose.
> We can specify files in hdfs that will be required during job execution and
> they are copied to worker nodes before the job starts. So the job will
> efficiently access their copies on local machine.
>
> I didn't found simple and effective way of doing the same thing in spark.
> Is there any preferable way to do so?
>
> --
> Best regards,
> Konstantin Abakumov
>
>

Re: File broadcasting

Reply via email to