Hi Robert,

So far we've only seen a single user report the issue, but the severity of
FLINK-19692 is actually pretty huge.
TL;DR: If a checkpoint / savepoint that contains feedback events (which is
considered normal under typical StateFun operations) is attempted to be
restored from, the restore would always fail.

That's why we came up with the discussion to potentially release a
"partial" solution with StateFun 2.2.1 already so that at least there is a
StateFun release available that works properly with failure recoveries,
and then after that release another follow-up StateFun hotfix release
2.2.2, which would include Flink 1.11.3, to address the remaining part of
the problem.

BR,
Gordon

On Tue, Nov 3, 2020 at 9:33 PM Robert Metzger <rmetz...@apache.org> wrote:

> Thanks a lot for starting this thread.
> How many users are affected by the problem? Is it somebody else besides
> the initial issue reporter?
> If it is just one person, I would suggest to rather help pushing the
> 1.11.3 release over the line or work on more StateFun features ;)
>
> On Tue, Nov 3, 2020 at 11:58 AM Igal Shilman <i...@ververica.com> wrote:
>
>> Hi Gordon,
>> Thanks for driving this discussion!
>>
>> I would go with the second suggestion - having two consecutive StateFun
>> releases 2.2.1 and 2.2.2, since the Flink-1.11.3 release
>> might take a while, and this hot-fix release is important enough to get
>> out
>> as early as possible.
>>
>> Cheers,
>> Igal.
>>
>>
>>
>>
>> On Mon, Nov 2, 2020 at 11:43 AM Tzu-Li (Gordon) Tai <tzuli...@apache.org>
>> wrote:
>>
>> > Hi,
>> >
>> > We’re currently thinking about releasing StateFun 2.2.1, to address a
>> > critical bug that causes restores from checkpoints / savepoints to fail
>> > under certain circumstances [1].
>> >
>> > To provide a bit more context, the full fix for this issue is two-fold:
>> >
>> >    1. *Fix restoring from checkpoints / savepoints taken with the same
>> >    StateFun version:* this has already been fixed in StateFun, with
>> >    changes backported to `flink-statefun/release-2.2`.
>> >    2. *Allow restoring from older savepoints taken with StateFun <=
>> >    2.2.0:* this requires a few fixes to Flink around restoring
>> heap-based
>> >    timers [2] and iterating through key groups in restored raw keyed
>> state
>> >    streams [3]. These fixes will be included in Flink 1.11.3 [4],
>> meaning that
>> >    to fix this, StateFun will need to wait until Flink 1.11.3 is out and
>> >    upgrade its Flink dependency.
>> >
>> > The main discussion point here is whether or not it makes sense for
>> > StateFun 2.2.1 to wait for Flink 1.11.3, so that both parts of the
>> problems
>> > 1) and 2) can be solved together in a single hotfix release.
>> >
>> > The other option is to release StateFun 2.2.1 already with fixes for
>> > problem 1) only, and have another follow-up hotfix release 2.2.2 after
>> > Flink 1.11.3 is available.
>> >
>> > I propose to keep a close eye on the progress of Flink 1.11.3 (you can
>> > track progress on the 1.11.3 discussion thread [4]), and *make a
>> decision
>> > here mid-week on Wednesday, Nov. 4th*.
>> > If by then we decide to not let StateFun 2.2.1 wait for Flink 1.11.3
>> > because it could take a while, we can start with a StateFun 2.2.1 RC
>> right
>> > away; otherwise, if Flink 1.11.3 seems to be just around the corner, we
>> can
>> > wait for a few more days.
>> >
>> > What do you think?
>> >
>> > Cheers,
>> > Gordon
>> >
>> > [1] https://issues.apache.org/jira/browse/FLINK-19692
>> > [2] https://github.com/apache/flink/pull/13761
>> > [3] https://github.com/apache/flink/pull/13772
>> > [4]
>> >
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-Apache-Flink-1-11-3-td45989.html
>> >
>>
>

Reply via email to