Hi, We’re currently thinking about releasing StateFun 2.2.1, to address a critical bug that causes restores from checkpoints / savepoints to fail under certain circumstances [1].
To provide a bit more context, the full fix for this issue is two-fold: 1. *Fix restoring from checkpoints / savepoints taken with the same StateFun version:* this has already been fixed in StateFun, with changes backported to `flink-statefun/release-2.2`. 2. *Allow restoring from older savepoints taken with StateFun <= 2.2.0:* this requires a few fixes to Flink around restoring heap-based timers [2] and iterating through key groups in restored raw keyed state streams [3]. These fixes will be included in Flink 1.11.3 [4], meaning that to fix this, StateFun will need to wait until Flink 1.11.3 is out and upgrade its Flink dependency. The main discussion point here is whether or not it makes sense for StateFun 2.2.1 to wait for Flink 1.11.3, so that both parts of the problems 1) and 2) can be solved together in a single hotfix release. The other option is to release StateFun 2.2.1 already with fixes for problem 1) only, and have another follow-up hotfix release 2.2.2 after Flink 1.11.3 is available. I propose to keep a close eye on the progress of Flink 1.11.3 (you can track progress on the 1.11.3 discussion thread [4]), and *make a decision here mid-week on Wednesday, Nov. 4th*. If by then we decide to not let StateFun 2.2.1 wait for Flink 1.11.3 because it could take a while, we can start with a StateFun 2.2.1 RC right away; otherwise, if Flink 1.11.3 seems to be just around the corner, we can wait for a few more days. What do you think? Cheers, Gordon [1] https://issues.apache.org/jira/browse/FLINK-19692 [2] https://github.com/apache/flink/pull/13761 [3] https://github.com/apache/flink/pull/13772 [4] http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-Apache-Flink-1-11-3-td45989.html
