Re: global state and single stream

2020-09-24 Thread Matthias Pohl
Hi Adam,
sorry for the late reply. Introducing a global state is something that
should be avoided as it introduces bottlenecks and/or concurrency/order
issues. Broadcasting the state between different subtasks will also bring a
loss in performance since each state change has to be shared with every
other subtask. Ideally, you might be able to reconsider the design of your
pipeline.

Is there a specific reason that prevents you from doing the merging on a
single instance?

Best,
Matthias

On Wed, Sep 16, 2020 at 11:21 PM Adam Atrea  wrote:

>
> Hi,
>
> I am pretty new to Flink and I'm trying to implement - which seems to me -
> a pretty basic pattern:
>
> Say I have a single stream of *Price *objects encapsulating a price value
> and a symbol (for example A to Z)  they are emitted at a very random
> interval all day - could be 1 /day or once a week.
>
> *..(price= .22, symbol = 'B')...(price= .12 , symbol = 'C').. (price= .12
> , symbol = 'A'). .(price= .22, symbol = 'Z')...*
>
> I want to define an operator that should emit a single object only once
> when ALL symbols have been received (not before) - the object will include
> all the received prices.
> If a price is received again for the same symbol it will reemit the same
> object with the updated price and all previous prices will stay the same.
>
> If the stream is keyed by symbol I understand that using MapState state
> will not help because the state is local to the partition - each task will
> need to know that all symbols have been received.
>
> Basically I'm looking for a global state for a single keyed stream - a
> state accessible by all parallel tasks - I have used the broadcast pattern
> but I understand this is when connecting 2 streams - is there a way to do
> it without forcing parallelism to 1.
>
> Thanks in advance for your assistance
>
> Adam
>
>
>


global state and single stream

2020-09-16 Thread Adam Atrea
Hi,

I am pretty new to Flink and I'm trying to implement - which seems to me -
a pretty basic pattern:

Say I have a single stream of *Price *objects encapsulating a price value
and a symbol (for example A to Z)  they are emitted at a very random
interval all day - could be 1 /day or once a week.

*..(price= .22, symbol = 'B')...(price= .12 , symbol = 'C').. (price= .12 ,
symbol = 'A'). .(price= .22, symbol = 'Z')...*

I want to define an operator that should emit a single object only once
when ALL symbols have been received (not before) - the object will include
all the received prices.
If a price is received again for the same symbol it will reemit the same
object with the updated price and all previous prices will stay the same.

If the stream is keyed by symbol I understand that using MapState state
will not help because the state is local to the partition - each task will
need to know that all symbols have been received.

Basically I'm looking for a global state for a single keyed stream - a
state accessible by all parallel tasks - I have used the broadcast pattern
but I understand this is when connecting 2 streams - is there a way to do
it without forcing parallelism to 1.

Thanks in advance for your assistance

Adam