Hi Gordon,
thanks a lot for this clarification.

In this case I would vote for releasing StateFun 2.2.1 asap and not wait
for 1.11.3.

Thanks a lot for your efforts!


On Tue, Nov 3, 2020 at 3:38 PM Tzu-Li (Gordon) Tai <tzuli...@apache.org>
wrote:

> Hi Robert,
>
> So far we've only seen a single user report the issue, but the severity of
> FLINK-19692 is actually pretty huge.
> TL;DR: If a checkpoint / savepoint that contains feedback events (which is
> considered normal under typical StateFun operations) is attempted to be
> restored from, the restore would always fail.
>
> That's why we came up with the discussion to potentially release a
> "partial" solution with StateFun 2.2.1 already so that at least there is a
> StateFun release available that works properly with failure recoveries,
> and then after that release another follow-up StateFun hotfix release
> 2.2.2, which would include Flink 1.11.3, to address the remaining part of
> the problem.
>
> BR,
> Gordon
>
> On Tue, Nov 3, 2020 at 9:33 PM Robert Metzger <rmetz...@apache.org> wrote:
>
>> Thanks a lot for starting this thread.
>> How many users are affected by the problem? Is it somebody else besides
>> the initial issue reporter?
>> If it is just one person, I would suggest to rather help pushing the
>> 1.11.3 release over the line or work on more StateFun features ;)
>>
>> On Tue, Nov 3, 2020 at 11:58 AM Igal Shilman <i...@ververica.com> wrote:
>>
>>> Hi Gordon,
>>> Thanks for driving this discussion!
>>>
>>> I would go with the second suggestion - having two consecutive StateFun
>>> releases 2.2.1 and 2.2.2, since the Flink-1.11.3 release
>>> might take a while, and this hot-fix release is important enough to get
>>> out
>>> as early as possible.
>>>
>>> Cheers,
>>> Igal.
>>>
>>>
>>>
>>>
>>> On Mon, Nov 2, 2020 at 11:43 AM Tzu-Li (Gordon) Tai <tzuli...@apache.org
>>> >
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > We’re currently thinking about releasing StateFun 2.2.1, to address a
>>> > critical bug that causes restores from checkpoints / savepoints to fail
>>> > under certain circumstances [1].
>>> >
>>> > To provide a bit more context, the full fix for this issue is two-fold:
>>> >
>>> >    1. *Fix restoring from checkpoints / savepoints taken with the same
>>> >    StateFun version:* this has already been fixed in StateFun, with
>>> >    changes backported to `flink-statefun/release-2.2`.
>>> >    2. *Allow restoring from older savepoints taken with StateFun <=
>>> >    2.2.0:* this requires a few fixes to Flink around restoring
>>> heap-based
>>> >    timers [2] and iterating through key groups in restored raw keyed
>>> state
>>> >    streams [3]. These fixes will be included in Flink 1.11.3 [4],
>>> meaning that
>>> >    to fix this, StateFun will need to wait until Flink 1.11.3 is out
>>> and
>>> >    upgrade its Flink dependency.
>>> >
>>> > The main discussion point here is whether or not it makes sense for
>>> > StateFun 2.2.1 to wait for Flink 1.11.3, so that both parts of the
>>> problems
>>> > 1) and 2) can be solved together in a single hotfix release.
>>> >
>>> > The other option is to release StateFun 2.2.1 already with fixes for
>>> > problem 1) only, and have another follow-up hotfix release 2.2.2 after
>>> > Flink 1.11.3 is available.
>>> >
>>> > I propose to keep a close eye on the progress of Flink 1.11.3 (you can
>>> > track progress on the 1.11.3 discussion thread [4]), and *make a
>>> decision
>>> > here mid-week on Wednesday, Nov. 4th*.
>>> > If by then we decide to not let StateFun 2.2.1 wait for Flink 1.11.3
>>> > because it could take a while, we can start with a StateFun 2.2.1 RC
>>> right
>>> > away; otherwise, if Flink 1.11.3 seems to be just around the corner,
>>> we can
>>> > wait for a few more days.
>>> >
>>> > What do you think?
>>> >
>>> > Cheers,
>>> > Gordon
>>> >
>>> > [1] https://issues.apache.org/jira/browse/FLINK-19692
>>> > [2] https://github.com/apache/flink/pull/13761
>>> > [3] https://github.com/apache/flink/pull/13772
>>> > [4]
>>> >
>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-Apache-Flink-1-11-3-td45989.html
>>> >
>>>
>>

Reply via email to