Hi Joern,

Very thanks for sharing the detailed scenarios! It inspires a lot. 

If I understand right, could it might be summaried as follows?
1. There is a batch job to first intialize the state, the state is used 
in the stream mode, and the stream pipeline is different from the the batch job.
2. Currently it is implemented by extracting the state and output it to the 
sink, 
then load it on startup, but there might be some inconvenience due to possible
additional development and performance (the state is large)  issue. 

We would try to have some more thoughts on this scenario~

Best,
Yun



 ------------------Original Mail ------------------
Sender:Joern Kottmann <kottm...@gmail.com>
Send Date:Tue Dec 7 16:58:03 2021
Recipients:Yun Gao <yungao...@aliyun.com>
CC:vtygoss <vtyg...@126.com>, Alexander Preuß <alexanderpre...@ververica.com>, 
user@flink.apache.org <user@flink.apache.org>
Subject:Re: Re: Re: how to run streaming process after batch process is 
completed?

Hello, 

One of the applications Spire [1] is using Flink for is to process AIS [2] data 
collected by our satellites and from other sources. AIS is transmitting a 
ships' static and dynamic information, such as names, callsigns or positions. 
One of the challenges processing AIS data is that there are no unique keys, 
since the mmsi or imo can be spoofed or is sometimes shared between vessels.

To deal with multiple vessels per mmsi we use a Keyed Process Function that 
keeps state per detected vessel, data about the vessel is stored in the state 
of the function and is hard to transfer out of the batch processing. Batch 
processing really helps to collect data about a vessel and is therefore 
necessary for us before we can switch to stream mode.
Since the state and the outputs are not the same the reconstruction of the 
state for stream mode can't be achieved by feeding the outputs into the 
pipeline via some source. Therefore we need code in our batch job just to deal 
with extracting the state.

A vessel is usually outputted for each update that is received for it, but 
outputting it together with it's entire state is not desirable for performance 
reasons in batch mode. Also some vessels should never be outputted but need to 
be restored.

The pipeline has a couple of stateful functions and the more we add the harder 
it gets to restore the state.

Best,
Jörn

Reply via email to