using updating shared data

2019-01-01 Thread Avi Levi
Hi,
I have a list (couple of thousands text lines) that I need to use in my map
function. I read this article about broadcasting variables

or
using distributed cache

however I need to update this list from time to time, and if I understood
correctly it is not possible on broadcast or cache without restarting the
job. Is there idiomatic way to achieve this? A db seems to be an overkill
for that and I do want to be cheap on io/network calls as much as possible.

Cheers
Avi


Re: using updating shared data

2019-01-01 Thread miki haiat
Im trying to understand  your  use case.
What is the source  of the data ? FS ,KAFKA else ?


On Tue, Jan 1, 2019 at 6:29 PM Avi Levi  wrote:

> Hi,
> I have a list (couple of thousands text lines) that I need to use in my
> map function. I read this article about broadcasting variables
> 
>  or
> using distributed cache
> 
> however I need to update this list from time to time, and if I understood
> correctly it is not possible on broadcast or cache without restarting the
> job. Is there idiomatic way to achieve this? A db seems to be an overkill
> for that and I do want to be cheap on io/network calls as much as possible.
>
> Cheers
> Avi
>
>


Re: using updating shared data

2019-01-02 Thread Till Rohrmann
Hi Avi,

you could use Flink's broadcast state pattern [1]. You would need to use
the DataStream API but it allows you to have two streams (input and control
stream) where the control stream is broadcasted to all sub tasks. So by
ingesting messages into the control stream you can send model updates to
all sub tasks.

[1]
https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html

Cheers,
Till

On Tue, Jan 1, 2019 at 6:49 PM miki haiat  wrote:

> Im trying to understand  your  use case.
> What is the source  of the data ? FS ,KAFKA else ?
>
>
> On Tue, Jan 1, 2019 at 6:29 PM Avi Levi  wrote:
>
>> Hi,
>> I have a list (couple of thousands text lines) that I need to use in my
>> map function. I read this article about broadcasting variables
>> 
>>  or
>> using distributed cache
>> 
>> however I need to update this list from time to time, and if I understood
>> correctly it is not possible on broadcast or cache without restarting the
>> job. Is there idiomatic way to achieve this? A db seems to be an overkill
>> for that and I do want to be cheap on io/network calls as much as possible.
>>
>> Cheers
>> Avi
>>
>>


Re: using updating shared data

2019-01-02 Thread Avi Levi
Thanks Till I will defiantly going to check it. just to make sure that I
got you correctly. you are suggesting the the list that I want to broadcast
will be broadcasted via control stream and it will be than be kept in the
relevant operator state correct ? and updates (CRUD) on that list will be
preformed via the control stream. correct ?
BR
Avi

On Wed, Jan 2, 2019 at 4:28 PM Till Rohrmann  wrote:

> Hi Avi,
>
> you could use Flink's broadcast state pattern [1]. You would need to use
> the DataStream API but it allows you to have two streams (input and control
> stream) where the control stream is broadcasted to all sub tasks. So by
> ingesting messages into the control stream you can send model updates to
> all sub tasks.
>
> [1]
> 
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html
> 
>
> Cheers,
> Till
>
> On Tue, Jan 1, 2019 at 6:49 PM miki haiat  wrote:
>
>> Im trying to understand  your  use case.
>> What is the source  of the data ? FS ,KAFKA else ?
>>
>>
>> On Tue, Jan 1, 2019 at 6:29 PM Avi Levi  wrote:
>>
>>> Hi,
>>> I have a list (couple of thousands text lines) that I need to use in my
>>> map function. I read this article about broadcasting variables
>>> 
>>>  or
>>> using distributed cache
>>> 
>>> however I need to update this list from time to time, and if I understood
>>> correctly it is not possible on broadcast or cache without restarting the
>>> job. Is there idiomatic way to achieve this? A db seems to be an overkill
>>> for that and I do want to be cheap on io/network calls as much as possible.
>>>
>>> Cheers
>>> Avi
>>>
>>>


Re: using updating shared data

2019-01-02 Thread Till Rohrmann
Yes exactly Avi.

Cheers,
Till

On Wed, Jan 2, 2019 at 5:42 PM Avi Levi  wrote:

> Thanks Till I will defiantly going to check it. just to make sure that I
> got you correctly. you are suggesting the the list that I want to broadcast
> will be broadcasted via control stream and it will be than be kept in the
> relevant operator state correct ? and updates (CRUD) on that list will be
> preformed via the control stream. correct ?
> BR
> Avi
>
> On Wed, Jan 2, 2019 at 4:28 PM Till Rohrmann  wrote:
>
>> Hi Avi,
>>
>> you could use Flink's broadcast state pattern [1]. You would need to use
>> the DataStream API but it allows you to have two streams (input and control
>> stream) where the control stream is broadcasted to all sub tasks. So by
>> ingesting messages into the control stream you can send model updates to
>> all sub tasks.
>>
>> [1]
>> 
>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html
>> 
>>
>> Cheers,
>> Till
>>
>> On Tue, Jan 1, 2019 at 6:49 PM miki haiat  wrote:
>>
>>> Im trying to understand  your  use case.
>>> What is the source  of the data ? FS ,KAFKA else ?
>>>
>>>
>>> On Tue, Jan 1, 2019 at 6:29 PM Avi Levi  wrote:
>>>
 Hi,
 I have a list (couple of thousands text lines) that I need to use in my
 map function. I read this article about broadcasting variables
 
  or
 using distributed cache
 
 however I need to update this list from time to time, and if I understood
 correctly it is not possible on broadcast or cache without restarting the
 job. Is there idiomatic way to achieve this? A db seems to be an overkill
 for that and I do want to be cheap on io/network calls as much as possible.

 Cheers
 Avi




Re: using updating shared data

2019-01-02 Thread Elias Levy
One thing you must be careful of, is that if you are using event time
processing, assuming that the control stream will only receive messages
sporadically, is that event time will stop moving forward in the operator
joining the streams while the control stream is idle.  You can get around
this by using a periodic watermark extractor one the control stream that
bounds the event time delay to processing time or by defining your own low
level operator that ignores watermarks from the control stream.

On Wed, Jan 2, 2019 at 8:42 AM Avi Levi  wrote:

> Thanks Till I will defiantly going to check it. just to make sure that I
> got you correctly. you are suggesting the the list that I want to broadcast
> will be broadcasted via control stream and it will be than be kept in the
> relevant operator state correct ? and updates (CRUD) on that list will be
> preformed via the control stream. correct ?
> BR
> Avi
>
> On Wed, Jan 2, 2019 at 4:28 PM Till Rohrmann  wrote:
>
>> Hi Avi,
>>
>> you could use Flink's broadcast state pattern [1]. You would need to use
>> the DataStream API but it allows you to have two streams (input and control
>> stream) where the control stream is broadcasted to all sub tasks. So by
>> ingesting messages into the control stream you can send model updates to
>> all sub tasks.
>>
>> [1]
>> 
>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html
>> 
>>
>> Cheers,
>> Till
>>
>> On Tue, Jan 1, 2019 at 6:49 PM miki haiat  wrote:
>>
>>> Im trying to understand  your  use case.
>>> What is the source  of the data ? FS ,KAFKA else ?
>>>
>>>
>>> On Tue, Jan 1, 2019 at 6:29 PM Avi Levi  wrote:
>>>
 Hi,
 I have a list (couple of thousands text lines) that I need to use in my
 map function. I read this article about broadcasting variables
 
  or
 using distributed cache
 
 however I need to update this list from time to time, and if I understood
 correctly it is not possible on broadcast or cache without restarting the
 job. Is there idiomatic way to achieve this? A db seems to be an overkill
 for that and I do want to be cheap on io/network calls as much as possible.

 Cheers
 Avi




Re: using updating shared data

2019-01-03 Thread Avi Levi
Thanks for the tip Elias!

On Wed, Jan 2, 2019 at 9:44 PM Elias Levy 
wrote:

> One thing you must be careful of, is that if you are using event time
> processing, assuming that the control stream will only receive messages
> sporadically, is that event time will stop moving forward in the operator
> joining the streams while the control stream is idle.  You can get around
> this by using a periodic watermark extractor one the control stream that
> bounds the event time delay to processing time or by defining your own low
> level operator that ignores watermarks from the control stream.
>
> On Wed, Jan 2, 2019 at 8:42 AM Avi Levi  wrote:
>
>> Thanks Till I will defiantly going to check it. just to make sure that I
>> got you correctly. you are suggesting the the list that I want to broadcast
>> will be broadcasted via control stream and it will be than be kept in the
>> relevant operator state correct ? and updates (CRUD) on that list will be
>> preformed via the control stream. correct ?
>> BR
>> Avi
>>
>> On Wed, Jan 2, 2019 at 4:28 PM Till Rohrmann 
>> wrote:
>>
>>> Hi Avi,
>>>
>>> you could use Flink's broadcast state pattern [1]. You would need to use
>>> the DataStream API but it allows you to have two streams (input and control
>>> stream) where the control stream is broadcasted to all sub tasks. So by
>>> ingesting messages into the control stream you can send model updates to
>>> all sub tasks.
>>>
>>> [1]
>>> 
>>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html
>>> 
>>>
>>> Cheers,
>>> Till
>>>
>>> On Tue, Jan 1, 2019 at 6:49 PM miki haiat  wrote:
>>>
 Im trying to understand  your  use case.
 What is the source  of the data ? FS ,KAFKA else ?


 On Tue, Jan 1, 2019 at 6:29 PM Avi Levi 
 wrote:

> Hi,
> I have a list (couple of thousands text lines) that I need to use in
> my map function. I read this article about broadcasting variables
> 
>  or
> using distributed cache
> 
> however I need to update this list from time to time, and if I understood
> correctly it is not possible on broadcast or cache without restarting the
> job. Is there idiomatic way to achieve this? A db seems to be an overkill
> for that and I do want to be cheap on io/network calls as much as 
> possible.
>
> Cheers
> Avi
>
>


Re: using updating shared data

2019-01-04 Thread David Anderson
Another solution to the watermarking issue is to write an
AssignerWithPeriodicWatermarks for the control stream that always returns
Watermark.MAX_WATERMARK as the current watermark. This produces watermarks
for the control stream that will effectively be ignored.

On Thu, Jan 3, 2019 at 9:18 PM Avi Levi  wrote:

> Thanks for the tip Elias!
>
> On Wed, Jan 2, 2019 at 9:44 PM Elias Levy 
> wrote:
>
>> One thing you must be careful of, is that if you are using event time
>> processing, assuming that the control stream will only receive messages
>> sporadically, is that event time will stop moving forward in the operator
>> joining the streams while the control stream is idle.  You can get around
>> this by using a periodic watermark extractor one the control stream that
>> bounds the event time delay to processing time or by defining your own low
>> level operator that ignores watermarks from the control stream.
>>
>> On Wed, Jan 2, 2019 at 8:42 AM Avi Levi  wrote:
>>
>>> Thanks Till I will defiantly going to check it. just to make sure that I
>>> got you correctly. you are suggesting the the list that I want to broadcast
>>> will be broadcasted via control stream and it will be than be kept in the
>>> relevant operator state correct ? and updates (CRUD) on that list will be
>>> preformed via the control stream. correct ?
>>> BR
>>> Avi
>>>
>>> On Wed, Jan 2, 2019 at 4:28 PM Till Rohrmann 
>>> wrote:
>>>
 Hi Avi,

 you could use Flink's broadcast state pattern [1]. You would need to
 use the DataStream API but it allows you to have two streams (input and
 control stream) where the control stream is broadcasted to all sub tasks.
 So by ingesting messages into the control stream you can send model updates
 to all sub tasks.

 [1]
 
 https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html
 

 Cheers,
 Till

 On Tue, Jan 1, 2019 at 6:49 PM miki haiat  wrote:

> Im trying to understand  your  use case.
> What is the source  of the data ? FS ,KAFKA else ?
>
>
> On Tue, Jan 1, 2019 at 6:29 PM Avi Levi 
> wrote:
>
>> Hi,
>> I have a list (couple of thousands text lines) that I need to use in
>> my map function. I read this article about broadcasting variables
>> 
>>  or
>> using distributed cache
>> 
>> however I need to update this list from time to time, and if I understood
>> correctly it is not possible on broadcast or cache without restarting the
>> job. Is there idiomatic way to achieve this? A db seems to be an overkill
>> for that and I do want to be cheap on io/network calls as much as 
>> possible.
>>
>> Cheers
>> Avi
>>
>>


Re: using updating shared data

2019-01-06 Thread Avi Levi
Sounds like a good idea. because in the control stream the time doesn't
really matters. Thanks !!!

On Fri, Jan 4, 2019 at 11:13 AM David Anderson 
wrote:

> Another solution to the watermarking issue is to write an
> AssignerWithPeriodicWatermarks for the control stream that always returns
> Watermark.MAX_WATERMARK as the current watermark. This produces watermarks
> for the control stream that will effectively be ignored.
>
> On Thu, Jan 3, 2019 at 9:18 PM Avi Levi  wrote:
>
>> Thanks for the tip Elias!
>>
>> On Wed, Jan 2, 2019 at 9:44 PM Elias Levy 
>> wrote:
>>
>>> One thing you must be careful of, is that if you are using event time
>>> processing, assuming that the control stream will only receive messages
>>> sporadically, is that event time will stop moving forward in the operator
>>> joining the streams while the control stream is idle.  You can get around
>>> this by using a periodic watermark extractor one the control stream that
>>> bounds the event time delay to processing time or by defining your own low
>>> level operator that ignores watermarks from the control stream.
>>>
>>> On Wed, Jan 2, 2019 at 8:42 AM Avi Levi  wrote:
>>>
 Thanks Till I will defiantly going to check it. just to make sure that
 I got you correctly. you are suggesting the the list that I want to
 broadcast will be broadcasted via control stream and it will be than be
 kept in the relevant operator state correct ? and updates (CRUD) on that
 list will be preformed via the control stream. correct ?
 BR
 Avi

 On Wed, Jan 2, 2019 at 4:28 PM Till Rohrmann 
 wrote:

> Hi Avi,
>
> you could use Flink's broadcast state pattern [1]. You would need to
> use the DataStream API but it allows you to have two streams (input and
> control stream) where the control stream is broadcasted to all sub tasks.
> So by ingesting messages into the control stream you can send model 
> updates
> to all sub tasks.
>
> [1]
> 
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html
> 
>
> Cheers,
> Till
>
> On Tue, Jan 1, 2019 at 6:49 PM miki haiat  wrote:
>
>> Im trying to understand  your  use case.
>> What is the source  of the data ? FS ,KAFKA else ?
>>
>>
>> On Tue, Jan 1, 2019 at 6:29 PM Avi Levi 
>> wrote:
>>
>>> Hi,
>>> I have a list (couple of thousands text lines) that I need to use in
>>> my map function. I read this article about broadcasting variables
>>> 
>>>  or
>>> using distributed cache
>>> 
>>> however I need to update this list from time to time, and if I 
>>> understood
>>> correctly it is not possible on broadcast or cache without restarting 
>>> the
>>> job. Is there idiomatic way to achieve this? A db seems to be an 
>>> overkill
>>> for that and I do want to be cheap on io/network calls as much as 
>>> possible.
>>>
>>> Cheers
>>> Avi
>>>
>>>


Re: using updating shared data

2019-01-06 Thread Elias Levy
That is not fully correct.  While in practice it may not matter, ignoring
the timestamp of control messages may result in non-deterministic behavior,
as during a restart the control message may be processed in a different
order in relation to the other stream.  So the output of multiple runs may
differ depending in the relative order the messages in the stream are
processed in.

On Sun, Jan 6, 2019 at 12:36 AM Avi Levi  wrote:

> Sounds like a good idea. because in the control stream the time doesn't
> really matters. Thanks !!!
>
> On Fri, Jan 4, 2019 at 11:13 AM David Anderson 
> wrote:
>
>> Another solution to the watermarking issue is to write an
>> AssignerWithPeriodicWatermarks for the control stream that always returns
>> Watermark.MAX_WATERMARK as the current watermark. This produces watermarks
>> for the control stream that will effectively be ignored.
>>
>> On Thu, Jan 3, 2019 at 9:18 PM Avi Levi  wrote:
>>
>>> Thanks for the tip Elias!
>>>
>>> On Wed, Jan 2, 2019 at 9:44 PM Elias Levy 
>>> wrote:
>>>
 One thing you must be careful of, is that if you are using event time
 processing, assuming that the control stream will only receive messages
 sporadically, is that event time will stop moving forward in the operator
 joining the streams while the control stream is idle.  You can get around
 this by using a periodic watermark extractor one the control stream that
 bounds the event time delay to processing time or by defining your own low
 level operator that ignores watermarks from the control stream.

 On Wed, Jan 2, 2019 at 8:42 AM Avi Levi 
 wrote:

> Thanks Till I will defiantly going to check it. just to make sure that
> I got you correctly. you are suggesting the the list that I want to
> broadcast will be broadcasted via control stream and it will be than be
> kept in the relevant operator state correct ? and updates (CRUD) on that
> list will be preformed via the control stream. correct ?
> BR
> Avi
>
> On Wed, Jan 2, 2019 at 4:28 PM Till Rohrmann 
> wrote:
>
>> Hi Avi,
>>
>> you could use Flink's broadcast state pattern [1]. You would need to
>> use the DataStream API but it allows you to have two streams (input and
>> control stream) where the control stream is broadcasted to all sub tasks.
>> So by ingesting messages into the control stream you can send model 
>> updates
>> to all sub tasks.
>>
>> [1]
>> 
>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html
>> 
>>
>> Cheers,
>> Till
>>
>> On Tue, Jan 1, 2019 at 6:49 PM miki haiat  wrote:
>>
>>> Im trying to understand  your  use case.
>>> What is the source  of the data ? FS ,KAFKA else ?
>>>
>>>
>>> On Tue, Jan 1, 2019 at 6:29 PM Avi Levi 
>>> wrote:
>>>
 Hi,
 I have a list (couple of thousands text lines) that I need to use
 in my map function. I read this article about broadcasting
 variables
 
  or
 using distributed cache
 
 however I need to update this list from time to time, and if I 
 understood
 correctly it is not possible on broadcast or cache without restarting 
 the
 job. Is there idiomatic way to achieve this? A db seems to be an 
 overkill
 for that and I do want to be cheap on io/network calls as much as 
 possible.

 Cheers
 Avi