Theodore,

Broadcast variables do that when using the DataSet API -
http://data-artisans.com/how-to-factorize-a-700-gb-matrix-with-apache-flink/

See the following lines in the article-
To support the above presented algorithm efficiently we had to improve
Flinkā€™s broadcasting mechanism since it easily becomes the bottleneck of
the implementation. The enhanced Flink version can share broadcast
variables among multiple tasks running on the same machine. *Sharing avoids
having to keep for each task an individual copy of the broadcasted variable
on the heap. This increases the memory efficiency significantly, especially
if the broadcasted variables can grow up to several GBs of size.*

If you are using in the DataStream API then side-inputs (not yet
implemented) would achieve the same as broadcast variables.  (
https://docs.google.com/document/d/1hIgxi2Zchww_5fWUHLoYiXwSBXjv-M5eOv-MKQYN3m4/edit#)
. I use keyed Connected Streams in situation where I need them for one of
my use-cases (propagating rule changes to the data) where I could have used
side-inputs.

Sameer




On Thu, Aug 4, 2016 at 8:56 PM, Theodore Vasiloudis <
theodoros.vasilou...@gmail.com> wrote:

> Hello all,
>
> for a prototype we are looking into we would like to read a big matrix
> from HDFS, and for every element that comes in a stream of vectors do on
> multiplication with the matrix. The matrix should fit in the memory of one
> machine.
>
> We can read in the matrix using a RichMapFunction, but that would mean
> that a copy of the matrix is made for each Task Slot AFAIK, if the
> RichMapFunction is instantiated once per Task Slot.
>
> So I'm wondering how should we try address this problem, is it possible to
> have just one copy of the object in memory per TM?
>
> As a follow-up if we have more than one TM per node, is it possible to
> share memory between them? My guess is that we have to look at some
> external store for that.
>
> Cheers,
> Theo
>

Reply via email to