Hi,

We’re currently thinking about releasing StateFun 2.2.1, to address a
critical bug that causes restores from checkpoints / savepoints to fail
under certain circumstances [1].

To provide a bit more context, the full fix for this issue is two-fold:

   1. *Fix restoring from checkpoints / savepoints taken with the same
   StateFun version:* this has already been fixed in StateFun, with changes
   backported to `flink-statefun/release-2.2`.
   2. *Allow restoring from older savepoints taken with StateFun <= 2.2.0:*
   this requires a few fixes to Flink around restoring heap-based timers [2]
   and iterating through key groups in restored raw keyed state streams [3].
   These fixes will be included in Flink 1.11.3 [4], meaning that to fix this,
   StateFun will need to wait until Flink 1.11.3 is out and upgrade its Flink
   dependency.

The main discussion point here is whether or not it makes sense for
StateFun 2.2.1 to wait for Flink 1.11.3, so that both parts of the problems
1) and 2) can be solved together in a single hotfix release.

The other option is to release StateFun 2.2.1 already with fixes for
problem 1) only, and have another follow-up hotfix release 2.2.2 after
Flink 1.11.3 is available.

I propose to keep a close eye on the progress of Flink 1.11.3 (you can
track progress on the 1.11.3 discussion thread [4]), and *make a decision
here mid-week on Wednesday, Nov. 4th*.
If by then we decide to not let StateFun 2.2.1 wait for Flink 1.11.3
because it could take a while, we can start with a StateFun 2.2.1 RC right
away; otherwise, if Flink 1.11.3 seems to be just around the corner, we can
wait for a few more days.

What do you think?

Cheers,
Gordon

[1] https://issues.apache.org/jira/browse/FLINK-19692
[2] https://github.com/apache/flink/pull/13761
[3] https://github.com/apache/flink/pull/13772
[4]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-Apache-Flink-1-11-3-td45989.html

Reply via email to