Re: Broadcast state before events stream consumption

2019-02-21 Thread Dawid Wysakowicz
Hi Averell,

BroadcastState is a special case of OperatorState. Operator state is
always kept in-memory at runtime (must fit into memory), no matter what
state backend you use. Nevertheless it is snapshotted and thus fault
tolerant.

Best,

Dawid

On 21/02/2019 11:50, Averell wrote:
> Hi Konstantin,
>
> The statement below is mentioned at the end of the page 
> broadcast_state.html#important-considerations
> 
>   
> /"No RocksDB state backend: Broadcast state is kept in-memory at runtime and
> memory provisioning should be done accordingly. This holds for all operator
> states."/
>
> I am using RocksDB state backend, and is confused by that statement and
> yours.
>
> Could you please help clarify?
>
> Thanks and regards,
> Averell
>  
>
>
>
> --
> Sent from: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/



signature.asc
Description: OpenPGP digital signature


Re: Broadcast state before events stream consumption

2019-02-21 Thread Averell
Hi Konstantin,

The statement below is mentioned at the end of the page 
broadcast_state.html#important-considerations

  
/"No RocksDB state backend: Broadcast state is kept in-memory at runtime and
memory provisioning should be done accordingly. This holds for all operator
states."/

I am using RocksDB state backend, and is confused by that statement and
yours.

Could you please help clarify?

Thanks and regards,
Averell
 



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: Broadcast state before events stream consumption

2019-02-14 Thread Konstantin Knauf
Hi Chirag,

Broadcast state is checkpointed, hence the savepoint would contain it.

Best,

Konstantin

On Wed, Feb 13, 2019 at 4:04 PM Chirag Dewan 
wrote:

> Hi Konstantin,
>
> For the second solution, would savepoint persist the Broadcast state in
> State backend? Because I am aware that Broadcast state is not checkpointed.
>
> Is that correct?
>
> Thanks,
>
> Chirag
>
> Sent from Yahoo Mail on Android
> 
>
> On Mon, 11 Feb 2019 at 2:39 PM, Konstantin Knauf
>  wrote:
> Hi Chirag, Hi Vadim,
>
> from the top of my head, I see two options here:
>
> * Buffer the "fast" stream inside the KeyedBroadcastProcessFunction until
> relevant (whatever this means in your use case) broadcast events have
> arrived. Advantage: operationally easy, events are emitted as early as
> possible. Disadvantage: state size might become very large, depending on
> the nature of the broadcast stream it might be hard to know, when the
> "relevant broadcast events have arrived".
>
> * Start your job and only consume the broadcast stream (by configuration).
> Once the stream is "fully processed", i.e. has caught up, take a savepoint.
> Finally, start the job from this savepoint with the correct "fast" stream.
> There is a small race condition between taking the savepoint and restarting
> the job, which might matter (or not) depending on your use case.
>
> This topic is related to event-time alignment in sources, which has been
> actively discussed in the community in the past and we might be able to
> solve this in a similar way in the future.
>
> Cheers,
>
> Konstantin
>
> On Fri, Feb 8, 2019 at 5:48 PM Chirag Dewan 
> wrote:
>
> Hi Vadim,
>
> I would be interested in this too.
>
> Presently, I have to read my lookup source in the *open *method and keep
> it in a cache. By doing that I cannot make use of the broadcast state until
> ofcourse the first emit comes on the *Broadcast *stream.
>
> The problem with waiting the event stream is the lack of knowledge that I
> have read all the data from the lookup source. There is no possibility of
> having a special marker in the data as well for my use case.
>
> So pre loading the data seems to be the only option right now.
>
> Thanks,
>
> Chirag
>
>
>
> On Friday, 8 February, 2019, 7:45:37 pm IST, Vadim Vararu <
> vadim.var...@adswizz.com> wrote:
>
>
> Hi all,
>
> I need to use the broadcast state mechanism (
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html)
> for the next scenario.
>
> I have a reference data stream (slow) and an events stream (fast running)
> and I want to do a kind of lookup in the reference stream for each
> event. The broadcast state mechanism seems to fit perfect the scenario.
>
> From documentation:
> *As an example where broadcast state can emerge as a natural fit, one can
> imagine a low-throughput stream containing a set of rules which we want to
> evaluate against all elements coming from another stream.*
>
> However, I am not sure what is the correct way to delay the consumption of
> the fast running stream until the slow one is fully read (in case of a
> file) or until a marker is emitted (in case of some other source). Is there
> any way to accomplish that? It doesn't seem to be a rare use case.
>
> Thanks, Vadim.
>
>
>
> --
>
> Konstantin Knauf | Solutions Architect
>
> +49 160 91394525
>
> 
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward  - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Data Artisans GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
>
>

-- 

Konstantin Knauf | Solutions Architect

+49 160 91394525



Follow us @VervericaData

--

Join Flink Forward  - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Data Artisans GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen


Re: Broadcast state before events stream consumption

2019-02-13 Thread Chirag Dewan
Hi Konstantin,
For the second solution, would savepoint persist the Broadcast state in State 
backend? Because I am aware that Broadcast state is not checkpointed. 
Is that correct?
Thanks,
Chirag

Sent from Yahoo Mail on Android 
 
  On Mon, 11 Feb 2019 at 2:39 PM, Konstantin Knauf 
wrote:   Hi Chirag, Hi Vadim, 
from the top of my head, I see two options here: 
* Buffer the "fast" stream inside the KeyedBroadcastProcessFunction until 
relevant (whatever this means in your use case) broadcast events have arrived. 
Advantage: operationally easy, events are emitted as early as possible. 
Disadvantage: state size might become very large, depending on the nature of 
the broadcast stream it might be hard to know, when the "relevant broadcast 
events have arrived".
* Start your job and only consume the broadcast stream (by configuration). Once 
the stream is "fully processed", i.e. has caught up, take a savepoint. Finally, 
start the job from this savepoint with the correct "fast" stream. There is a 
small race condition between taking the savepoint and restarting the job, which 
might matter (or not) depending on your use case. 
This topic is related to event-time alignment in sources, which has been 
actively discussed in the community in the past and we might be able to solve 
this in a similar way in the future. 
Cheers, 
Konstantin
On Fri, Feb 8, 2019 at 5:48 PM Chirag Dewan  wrote:

 Hi Vadim,
I would be interested in this too. 
Presently, I have to read my lookup source in the open method and keep it in a 
cache. By doing that I cannot make use of the broadcast state until ofcourse 
the first emit comes on the Broadcast stream.
The problem with waiting the event stream is the lack of knowledge that I have 
read all the data from the lookup source. There is no possibility of having a 
special marker in the data as well for my use case.
So pre loading the data seems to be the only option right now.
Thanks,
Chirag


On Friday, 8 February, 2019, 7:45:37 pm IST, Vadim Vararu 
 wrote:  
 
  Hi all,
I need to use the broadcast state mechanism 
(https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html)
 for the next scenario.
I have a reference data stream (slow) and an events stream (fast running) and I 
want to do a kind of lookup in the reference stream for eachevent. The 
broadcast state mechanism seems to fit perfect the scenario. 
>From documentation:As an example where broadcast state can emerge as a natural 
>fit, one can imagine a low-throughput stream containing a set of rules which 
>we want to evaluate against all elements coming from another stream.

However, I am not sure what is the correct way to delay the consumption of the 
fast running stream until the slow one is fully read (in case of a file) or 
until a marker is emitted (in case of some other source). Is there any way to 
accomplish that? It doesn't seem to be a rare use case.
Thanks, Vadim.  


-- 

Konstantin Knauf | Solutions Architect

+49 160 91394525





Follow us @VervericaData

--

Join Flink Forward - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

--

Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Data Artisans GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen      


Re: Broadcast state before events stream consumption

2019-02-11 Thread Konstantin Knauf
Hi Chirag, Hi Vadim,

from the top of my head, I see two options here:

* Buffer the "fast" stream inside the KeyedBroadcastProcessFunction until
relevant (whatever this means in your use case) broadcast events have
arrived. Advantage: operationally easy, events are emitted as early as
possible. Disadvantage: state size might become very large, depending on
the nature of the broadcast stream it might be hard to know, when the
"relevant broadcast events have arrived".

* Start your job and only consume the broadcast stream (by configuration).
Once the stream is "fully processed", i.e. has caught up, take a savepoint.
Finally, start the job from this savepoint with the correct "fast" stream.
There is a small race condition between taking the savepoint and restarting
the job, which might matter (or not) depending on your use case.

This topic is related to event-time alignment in sources, which has been
actively discussed in the community in the past and we might be able to
solve this in a similar way in the future.

Cheers,

Konstantin

On Fri, Feb 8, 2019 at 5:48 PM Chirag Dewan  wrote:

> Hi Vadim,
>
> I would be interested in this too.
>
> Presently, I have to read my lookup source in the *open *method and keep
> it in a cache. By doing that I cannot make use of the broadcast state until
> ofcourse the first emit comes on the *Broadcast *stream.
>
> The problem with waiting the event stream is the lack of knowledge that I
> have read all the data from the lookup source. There is no possibility of
> having a special marker in the data as well for my use case.
>
> So pre loading the data seems to be the only option right now.
>
> Thanks,
>
> Chirag
>
>
>
> On Friday, 8 February, 2019, 7:45:37 pm IST, Vadim Vararu <
> vadim.var...@adswizz.com> wrote:
>
>
> Hi all,
>
> I need to use the broadcast state mechanism (
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html)
> for the next scenario.
>
> I have a reference data stream (slow) and an events stream (fast running)
> and I want to do a kind of lookup in the reference stream for each
> event. The broadcast state mechanism seems to fit perfect the scenario.
>
> From documentation:
> *As an example where broadcast state can emerge as a natural fit, one can
> imagine a low-throughput stream containing a set of rules which we want to
> evaluate against all elements coming from another stream.*
>
> However, I am not sure what is the correct way to delay the consumption of
> the fast running stream until the slow one is fully read (in case of a
> file) or until a marker is emitted (in case of some other source). Is there
> any way to accomplish that? It doesn't seem to be a rare use case.
>
> Thanks, Vadim.
>


-- 

Konstantin Knauf | Solutions Architect

+49 160 91394525



Follow us @VervericaData

--

Join Flink Forward  - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Data Artisans GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen


Re: Broadcast state before events stream consumption

2019-02-08 Thread Chirag Dewan
 Hi Vadim,
I would be interested in this too. 
Presently, I have to read my lookup source in the open method and keep it in a 
cache. By doing that I cannot make use of the broadcast state until ofcourse 
the first emit comes on the Broadcast stream.
The problem with waiting the event stream is the lack of knowledge that I have 
read all the data from the lookup source. There is no possibility of having a 
special marker in the data as well for my use case.
So pre loading the data seems to be the only option right now.
Thanks,
Chirag


On Friday, 8 February, 2019, 7:45:37 pm IST, Vadim Vararu 
 wrote:  
 
  Hi all,
I need to use the broadcast state mechanism 
(https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html)
 for the next scenario.
I have a reference data stream (slow) and an events stream (fast running) and I 
want to do a kind of lookup in the reference stream for eachevent. The 
broadcast state mechanism seems to fit perfect the scenario. 
>From documentation:As an example where broadcast state can emerge as a natural 
>fit, one can imagine a low-throughput stream containing a set of rules which 
>we want to evaluate against all elements coming from another stream.

However, I am not sure what is the correct way to delay the consumption of the 
fast running stream until the slow one is fully read (in case of a file) or 
until a marker is emitted (in case of some other source). Is there any way to 
accomplish that? It doesn't seem to be a rare use case.
Thanks, Vadim.  

Broadcast state before events stream consumption

2019-02-08 Thread Vadim Vararu
Hi all,

I need to use the broadcast state mechanism 
(https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html)
 for the next scenario.

I have a reference data stream (slow) and an events stream (fast running) and I 
want to do a kind of lookup in the reference stream for each
event. The broadcast state mechanism seems to fit perfect the scenario.

>From documentation:
As an example where broadcast state can emerge as a natural fit, one can 
imagine a low-throughput stream containing a set of rules which we want to 
evaluate against all elements coming from another stream.

However, I am not sure what is the correct way to delay the consumption of the 
fast running stream until the slow one is fully read (in case of a file) or 
until a marker is emitted (in case of some other source). Is there any way to 
accomplish that? It doesn't seem to be a rare use case.

Thanks, Vadim.