You mean "Connected Streams"? I use that for the same requirement. I way it 
works it looks like it creates multiple copies per co-map operation. I use the 
keyed version to match side inputs with the data. 

Sent from my iPhone

> On Aug 5, 2016, at 12:36 PM, Theodore Vasiloudis 
> <theodoros.vasilou...@gmail.com> wrote:
> 
> Yes this is a streaming use case, so broadcast is not an option.
> 
> If I get it correctly with connected streams I would emulate side input by 
> "streaming" the matrix with a key that all incoming vector records match on?
> 
> Wouldn't that create multiple copies of the matrix in memory?
> 
>> On Thu, Aug 4, 2016 at 6:56 PM, Sameer W <sam...@axiomine.com> wrote:
>> Theodore,
>> 
>> Broadcast variables do that when using the DataSet API - 
>> http://data-artisans.com/how-to-factorize-a-700-gb-matrix-with-apache-flink/
>> 
>> See the following lines in the article-
>> To support the above presented algorithm efficiently we had to improve 
>> Flinkā€™s broadcasting mechanism since it easily becomes the bottleneck of the 
>> implementation. The enhanced Flink version can share broadcast variables 
>> among multiple tasks running on the same machine. Sharing avoids having to 
>> keep for each task an individual copy of the broadcasted variable on the 
>> heap. This increases the memory efficiency significantly, especially if the 
>> broadcasted variables can grow up to several GBs of size.
>> 
>> If you are using in the DataStream API then side-inputs (not yet 
>> implemented) would achieve the same as broadcast variables.  
>> (https://docs.google.com/document/d/1hIgxi2Zchww_5fWUHLoYiXwSBXjv-M5eOv-MKQYN3m4/edit#)
>>  . I use keyed Connected Streams in situation where I need them for one of 
>> my use-cases (propagating rule changes to the data) where I could have used 
>> side-inputs.
>> 
>> Sameer
>> 
>> 
>> 
>> 
>>> On Thu, Aug 4, 2016 at 8:56 PM, Theodore Vasiloudis 
>>> <theodoros.vasilou...@gmail.com> wrote:
>>> Hello all,
>>> 
>>> for a prototype we are looking into we would like to read a big matrix from 
>>> HDFS, and for every element that comes in a stream of vectors do on 
>>> multiplication with the matrix. The matrix should fit in the memory of one 
>>> machine.
>>> 
>>> We can read in the matrix using a RichMapFunction, but that would mean
>>> that a copy of the matrix is made for each Task Slot AFAIK, if the 
>>> RichMapFunction is instantiated once per Task Slot.
>>> 
>>> So I'm wondering how should we try address this problem, is it possible to 
>>> have just one copy of the object in memory per TM?
>>> 
>>> As a follow-up if we have more than one TM per node, is it possible to 
>>> share memory between them? My guess is that we have to look at some 
>>> external store for that.
>>> 
>>> Cheers,
>>> Theo
> 

Reply via email to