Hi 
I have few question regarding Flink's state.

Lets say we have:

Case 1.
stream.keybBy(...).process(myProcessFunction).parallelism(3).

MyProcessFucntion uses a managed state (mapState, ListState etc). I'm using
state checkpoints.

Flink will redistribute events across 3 instances of myProcessFunction
according to keyby function.
When job is restarted with the same parallelism level, state is recovered
from last checkpoint and traffic is redistributed across process Function
with the same manner. 

What will happen though, if I will increase the parallelism level to 4. 
The traffic will be distributed across 4 instances now, so key that was
originally going to operator 3, now can go to operator 4. What will happen
with managed state that originally was builder for this key. Will it be
accessible from new operator instance now? From [1] where I can read " Flink
is able to automatically redistribute state when the parallelism is changed,
and also do better memory management." I will assume YES.

Case 2:
Lets assume I have two operators, where each of them is using a managed
state. From documentation I can read that managed state can be used only on
a keyed stream. This means that I will have to key my stream twice (or more)
if I want to use managed stream in all of my operators? What if the actual
keyBy function will be the same for all pipeline. Each keyBy function hit
performance right?


Case 3:
Is there a possibility to use managed state on non keyed stream? For example
I have a process function that has a "map of key value mappings" This map
can be delivered/build using a broadcast state pattern and can be quite big.
Sounds like a good place to use MapState, but the stream is not keyed. 
How can I approach this?

Lets assume for all cases that I'm using a RocksDB state backend 

Thanks,


[1]
https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/state.html







--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Reply via email to