Hi Peter,

State initialization with with historic data is a use case that's coming up
more and more.
Unfortunately, there's no good solution for this yet but just a couple of
workaround that require careful design and work for all cases.
There was a talk about exactly this problem and some ideas for addressing
it at Flink Forward a month ago [1]. The slides and video of the talk are
available online [2].

Your idea of initializing keyed state during startup (by the open() method)
doesn't work.
Keyed state is automatically moved into the context of the key of a
currently processed record.
Since there are no records during initialization, one would need to
manually set the key for the state to initialize.
The challenge here is that the keys are partitioned / sharded across the
parallel instances. So, one would need to know on which instance which key
must be initialized. This is not trivial.

Best,
Fabian

[1]
https://sf-2018.flink-forward.org/kb_sessions/bootstrapping-state-in-apache-flink/
[2]
https://data-artisans.com/flink-forward/resources/bootstrapping-state-in-apache-flink

2018-05-04 19:47 GMT+02:00 Tao Xia <t...@udacity.com>:

> Also would like to know how to do this if it is possible.
>
> On Fri, May 4, 2018 at 9:31 AM, Peter Zende <peter.ze...@gmail.com> wrote:
>
>> Hi,
>>
>> We use RocksDB with FsStateBackend (HDFS) to store state used by the
>> mapWithState operator. Is it possible to initialize / populate this state
>> during the streaming application startup?
>>
>> Our intention is to reprocess the historical data from HDFS in a batch
>> job and save the latest state of the records onto HDFS. Thus when we
>> restart the streaming job we can just build up or load the most recent view
>> of this store.
>>
>> Many thanks,
>> Peter
>>
>
>

Reply via email to