Hi all,

I was wondering what approaches people usually take with reprocessing data
with Flink - specifically the case where you want to upgrade a Flink job,
and make it reprocess historical data before continuing to process a live
stream.

I'm wondering if we can do something similar to the 'simple rewind' or
'parallel rewind' which Samza uses to solve this problem, discussed here:
https://samza.apache.org/learn/documentation/0.10/jobs/reprocessing.html

Having used Flink over the past couple of months, the main issue I've had
involves Flink's internal state - from my experience it seems it is easy to
break the state when upgrading a job, or when changing the parallelism of
operators, plus there's no easy way to view/access an internal key-value
state from outside Flink.

For an example of what I mean, consider a Flink job which consumes a stream
of 'updates' to items, and maintains a key-value store of items within
Flink's internal state (e.g. in RocksDB). The job also writes the updated
items to a Kafka topic:

http://oi64.tinypic.com/34q5opf.jpg

My worry with this is that the state in RocksDB could be lost or become
incompatible with an updated version of the job. If this happens, we need
to be able to rebuild Flink's internal key-value store in RocksDB. So I'd
like to be able to do something like this (which I believe is the Samza
solution):

http://oi67.tinypic.com/219ri95.jpg

Has anyone done something like this already with Flink? If so are there any
examples of how to do this replay & switchover (rebuild state by consuming
from a historical log, then switch over to processing the live stream)?

Thanks for any insights,
Josh

Reply via email to