Re: [DISCUSS] Releasing StateFun hotfix version 2.2.1

2020-11-04 Thread Tzu-Li (Gordon) Tai
Thanks everyone for the feedback.

I've just updated the status of Flink 1.11.3 earlier, in its corresponding
discussion thread [1].

>From the looks of it, it seems like it makes sense to proceed with StateFun
2.2.1 without waiting for Flink 1.11.3.
Since this is also the consensus we've reached here, I have proceeded to
create RC1 for StateFun 2.2.1 [2].

[1]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-Apache-Flink-1-11-3-td45989.html
[2]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-StateFun-hotfix-version-2-2-1-td46239.html

On Tue, Nov 3, 2020 at 10:42 PM Robert Metzger  wrote:

> Hi Gordon,
> thanks a lot for this clarification.
>
> In this case I would vote for releasing StateFun 2.2.1 asap and not wait
> for 1.11.3.
>
> Thanks a lot for your efforts!
>
>
> On Tue, Nov 3, 2020 at 3:38 PM Tzu-Li (Gordon) Tai 
> wrote:
>
>> Hi Robert,
>>
>> So far we've only seen a single user report the issue, but the severity
>> of FLINK-19692 is actually pretty huge.
>> TL;DR: If a checkpoint / savepoint that contains feedback events (which
>> is considered normal under typical StateFun operations) is attempted to be
>> restored from, the restore would always fail.
>>
>> That's why we came up with the discussion to potentially release a
>> "partial" solution with StateFun 2.2.1 already so that at least there is a
>> StateFun release available that works properly with failure recoveries,
>> and then after that release another follow-up StateFun hotfix release
>> 2.2.2, which would include Flink 1.11.3, to address the remaining part of
>> the problem.
>>
>> BR,
>> Gordon
>>
>> On Tue, Nov 3, 2020 at 9:33 PM Robert Metzger 
>> wrote:
>>
>>> Thanks a lot for starting this thread.
>>> How many users are affected by the problem? Is it somebody else besides
>>> the initial issue reporter?
>>> If it is just one person, I would suggest to rather help pushing the
>>> 1.11.3 release over the line or work on more StateFun features ;)
>>>
>>> On Tue, Nov 3, 2020 at 11:58 AM Igal Shilman  wrote:
>>>
 Hi Gordon,
 Thanks for driving this discussion!

 I would go with the second suggestion - having two consecutive StateFun
 releases 2.2.1 and 2.2.2, since the Flink-1.11.3 release
 might take a while, and this hot-fix release is important enough to get
 out
 as early as possible.

 Cheers,
 Igal.




 On Mon, Nov 2, 2020 at 11:43 AM Tzu-Li (Gordon) Tai <
 tzuli...@apache.org>
 wrote:

 > Hi,
 >
 > We’re currently thinking about releasing StateFun 2.2.1, to address a
 > critical bug that causes restores from checkpoints / savepoints to
 fail
 > under certain circumstances [1].
 >
 > To provide a bit more context, the full fix for this issue is
 two-fold:
 >
 >1. *Fix restoring from checkpoints / savepoints taken with the same
 >StateFun version:* this has already been fixed in StateFun, with
 >changes backported to `flink-statefun/release-2.2`.
 >2. *Allow restoring from older savepoints taken with StateFun <=
 >2.2.0:* this requires a few fixes to Flink around restoring
 heap-based
 >timers [2] and iterating through key groups in restored raw keyed
 state
 >streams [3]. These fixes will be included in Flink 1.11.3 [4],
 meaning that
 >to fix this, StateFun will need to wait until Flink 1.11.3 is out
 and
 >upgrade its Flink dependency.
 >
 > The main discussion point here is whether or not it makes sense for
 > StateFun 2.2.1 to wait for Flink 1.11.3, so that both parts of the
 problems
 > 1) and 2) can be solved together in a single hotfix release.
 >
 > The other option is to release StateFun 2.2.1 already with fixes for
 > problem 1) only, and have another follow-up hotfix release 2.2.2 after
 > Flink 1.11.3 is available.
 >
 > I propose to keep a close eye on the progress of Flink 1.11.3 (you can
 > track progress on the 1.11.3 discussion thread [4]), and *make a
 decision
 > here mid-week on Wednesday, Nov. 4th*.
 > If by then we decide to not let StateFun 2.2.1 wait for Flink 1.11.3
 > because it could take a while, we can start with a StateFun 2.2.1 RC
 right
 > away; otherwise, if Flink 1.11.3 seems to be just around the corner,
 we can
 > wait for a few more days.
 >
 > What do you think?
 >
 > Cheers,
 > Gordon
 >
 > [1] https://issues.apache.org/jira/browse/FLINK-19692
 > [2] https://github.com/apache/flink/pull/13761
 > [3] https://github.com/apache/flink/pull/13772
 > [4]
 >
 http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-Apache-Flink-1-11-3-td45989.html
 >

>>>


Re: [DISCUSS] Releasing StateFun hotfix version 2.2.1

2020-11-03 Thread Robert Metzger
Hi Gordon,
thanks a lot for this clarification.

In this case I would vote for releasing StateFun 2.2.1 asap and not wait
for 1.11.3.

Thanks a lot for your efforts!


On Tue, Nov 3, 2020 at 3:38 PM Tzu-Li (Gordon) Tai 
wrote:

> Hi Robert,
>
> So far we've only seen a single user report the issue, but the severity of
> FLINK-19692 is actually pretty huge.
> TL;DR: If a checkpoint / savepoint that contains feedback events (which is
> considered normal under typical StateFun operations) is attempted to be
> restored from, the restore would always fail.
>
> That's why we came up with the discussion to potentially release a
> "partial" solution with StateFun 2.2.1 already so that at least there is a
> StateFun release available that works properly with failure recoveries,
> and then after that release another follow-up StateFun hotfix release
> 2.2.2, which would include Flink 1.11.3, to address the remaining part of
> the problem.
>
> BR,
> Gordon
>
> On Tue, Nov 3, 2020 at 9:33 PM Robert Metzger  wrote:
>
>> Thanks a lot for starting this thread.
>> How many users are affected by the problem? Is it somebody else besides
>> the initial issue reporter?
>> If it is just one person, I would suggest to rather help pushing the
>> 1.11.3 release over the line or work on more StateFun features ;)
>>
>> On Tue, Nov 3, 2020 at 11:58 AM Igal Shilman  wrote:
>>
>>> Hi Gordon,
>>> Thanks for driving this discussion!
>>>
>>> I would go with the second suggestion - having two consecutive StateFun
>>> releases 2.2.1 and 2.2.2, since the Flink-1.11.3 release
>>> might take a while, and this hot-fix release is important enough to get
>>> out
>>> as early as possible.
>>>
>>> Cheers,
>>> Igal.
>>>
>>>
>>>
>>>
>>> On Mon, Nov 2, 2020 at 11:43 AM Tzu-Li (Gordon) Tai >> >
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > We’re currently thinking about releasing StateFun 2.2.1, to address a
>>> > critical bug that causes restores from checkpoints / savepoints to fail
>>> > under certain circumstances [1].
>>> >
>>> > To provide a bit more context, the full fix for this issue is two-fold:
>>> >
>>> >1. *Fix restoring from checkpoints / savepoints taken with the same
>>> >StateFun version:* this has already been fixed in StateFun, with
>>> >changes backported to `flink-statefun/release-2.2`.
>>> >2. *Allow restoring from older savepoints taken with StateFun <=
>>> >2.2.0:* this requires a few fixes to Flink around restoring
>>> heap-based
>>> >timers [2] and iterating through key groups in restored raw keyed
>>> state
>>> >streams [3]. These fixes will be included in Flink 1.11.3 [4],
>>> meaning that
>>> >to fix this, StateFun will need to wait until Flink 1.11.3 is out
>>> and
>>> >upgrade its Flink dependency.
>>> >
>>> > The main discussion point here is whether or not it makes sense for
>>> > StateFun 2.2.1 to wait for Flink 1.11.3, so that both parts of the
>>> problems
>>> > 1) and 2) can be solved together in a single hotfix release.
>>> >
>>> > The other option is to release StateFun 2.2.1 already with fixes for
>>> > problem 1) only, and have another follow-up hotfix release 2.2.2 after
>>> > Flink 1.11.3 is available.
>>> >
>>> > I propose to keep a close eye on the progress of Flink 1.11.3 (you can
>>> > track progress on the 1.11.3 discussion thread [4]), and *make a
>>> decision
>>> > here mid-week on Wednesday, Nov. 4th*.
>>> > If by then we decide to not let StateFun 2.2.1 wait for Flink 1.11.3
>>> > because it could take a while, we can start with a StateFun 2.2.1 RC
>>> right
>>> > away; otherwise, if Flink 1.11.3 seems to be just around the corner,
>>> we can
>>> > wait for a few more days.
>>> >
>>> > What do you think?
>>> >
>>> > Cheers,
>>> > Gordon
>>> >
>>> > [1] https://issues.apache.org/jira/browse/FLINK-19692
>>> > [2] https://github.com/apache/flink/pull/13761
>>> > [3] https://github.com/apache/flink/pull/13772
>>> > [4]
>>> >
>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-Apache-Flink-1-11-3-td45989.html
>>> >
>>>
>>


Re: [DISCUSS] Releasing StateFun hotfix version 2.2.1

2020-11-03 Thread Tzu-Li (Gordon) Tai
Hi Robert,

So far we've only seen a single user report the issue, but the severity of
FLINK-19692 is actually pretty huge.
TL;DR: If a checkpoint / savepoint that contains feedback events (which is
considered normal under typical StateFun operations) is attempted to be
restored from, the restore would always fail.

That's why we came up with the discussion to potentially release a
"partial" solution with StateFun 2.2.1 already so that at least there is a
StateFun release available that works properly with failure recoveries,
and then after that release another follow-up StateFun hotfix release
2.2.2, which would include Flink 1.11.3, to address the remaining part of
the problem.

BR,
Gordon

On Tue, Nov 3, 2020 at 9:33 PM Robert Metzger  wrote:

> Thanks a lot for starting this thread.
> How many users are affected by the problem? Is it somebody else besides
> the initial issue reporter?
> If it is just one person, I would suggest to rather help pushing the
> 1.11.3 release over the line or work on more StateFun features ;)
>
> On Tue, Nov 3, 2020 at 11:58 AM Igal Shilman  wrote:
>
>> Hi Gordon,
>> Thanks for driving this discussion!
>>
>> I would go with the second suggestion - having two consecutive StateFun
>> releases 2.2.1 and 2.2.2, since the Flink-1.11.3 release
>> might take a while, and this hot-fix release is important enough to get
>> out
>> as early as possible.
>>
>> Cheers,
>> Igal.
>>
>>
>>
>>
>> On Mon, Nov 2, 2020 at 11:43 AM Tzu-Li (Gordon) Tai 
>> wrote:
>>
>> > Hi,
>> >
>> > We’re currently thinking about releasing StateFun 2.2.1, to address a
>> > critical bug that causes restores from checkpoints / savepoints to fail
>> > under certain circumstances [1].
>> >
>> > To provide a bit more context, the full fix for this issue is two-fold:
>> >
>> >1. *Fix restoring from checkpoints / savepoints taken with the same
>> >StateFun version:* this has already been fixed in StateFun, with
>> >changes backported to `flink-statefun/release-2.2`.
>> >2. *Allow restoring from older savepoints taken with StateFun <=
>> >2.2.0:* this requires a few fixes to Flink around restoring
>> heap-based
>> >timers [2] and iterating through key groups in restored raw keyed
>> state
>> >streams [3]. These fixes will be included in Flink 1.11.3 [4],
>> meaning that
>> >to fix this, StateFun will need to wait until Flink 1.11.3 is out and
>> >upgrade its Flink dependency.
>> >
>> > The main discussion point here is whether or not it makes sense for
>> > StateFun 2.2.1 to wait for Flink 1.11.3, so that both parts of the
>> problems
>> > 1) and 2) can be solved together in a single hotfix release.
>> >
>> > The other option is to release StateFun 2.2.1 already with fixes for
>> > problem 1) only, and have another follow-up hotfix release 2.2.2 after
>> > Flink 1.11.3 is available.
>> >
>> > I propose to keep a close eye on the progress of Flink 1.11.3 (you can
>> > track progress on the 1.11.3 discussion thread [4]), and *make a
>> decision
>> > here mid-week on Wednesday, Nov. 4th*.
>> > If by then we decide to not let StateFun 2.2.1 wait for Flink 1.11.3
>> > because it could take a while, we can start with a StateFun 2.2.1 RC
>> right
>> > away; otherwise, if Flink 1.11.3 seems to be just around the corner, we
>> can
>> > wait for a few more days.
>> >
>> > What do you think?
>> >
>> > Cheers,
>> > Gordon
>> >
>> > [1] https://issues.apache.org/jira/browse/FLINK-19692
>> > [2] https://github.com/apache/flink/pull/13761
>> > [3] https://github.com/apache/flink/pull/13772
>> > [4]
>> >
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-Apache-Flink-1-11-3-td45989.html
>> >
>>
>


Re: [DISCUSS] Releasing StateFun hotfix version 2.2.1

2020-11-03 Thread Robert Metzger
Thanks a lot for starting this thread.
How many users are affected by the problem? Is it somebody else besides the
initial issue reporter?
If it is just one person, I would suggest to rather help pushing the 1.11.3
release over the line or work on more StateFun features ;)

On Tue, Nov 3, 2020 at 11:58 AM Igal Shilman  wrote:

> Hi Gordon,
> Thanks for driving this discussion!
>
> I would go with the second suggestion - having two consecutive StateFun
> releases 2.2.1 and 2.2.2, since the Flink-1.11.3 release
> might take a while, and this hot-fix release is important enough to get out
> as early as possible.
>
> Cheers,
> Igal.
>
>
>
>
> On Mon, Nov 2, 2020 at 11:43 AM Tzu-Li (Gordon) Tai 
> wrote:
>
> > Hi,
> >
> > We’re currently thinking about releasing StateFun 2.2.1, to address a
> > critical bug that causes restores from checkpoints / savepoints to fail
> > under certain circumstances [1].
> >
> > To provide a bit more context, the full fix for this issue is two-fold:
> >
> >1. *Fix restoring from checkpoints / savepoints taken with the same
> >StateFun version:* this has already been fixed in StateFun, with
> >changes backported to `flink-statefun/release-2.2`.
> >2. *Allow restoring from older savepoints taken with StateFun <=
> >2.2.0:* this requires a few fixes to Flink around restoring heap-based
> >timers [2] and iterating through key groups in restored raw keyed
> state
> >streams [3]. These fixes will be included in Flink 1.11.3 [4],
> meaning that
> >to fix this, StateFun will need to wait until Flink 1.11.3 is out and
> >upgrade its Flink dependency.
> >
> > The main discussion point here is whether or not it makes sense for
> > StateFun 2.2.1 to wait for Flink 1.11.3, so that both parts of the
> problems
> > 1) and 2) can be solved together in a single hotfix release.
> >
> > The other option is to release StateFun 2.2.1 already with fixes for
> > problem 1) only, and have another follow-up hotfix release 2.2.2 after
> > Flink 1.11.3 is available.
> >
> > I propose to keep a close eye on the progress of Flink 1.11.3 (you can
> > track progress on the 1.11.3 discussion thread [4]), and *make a decision
> > here mid-week on Wednesday, Nov. 4th*.
> > If by then we decide to not let StateFun 2.2.1 wait for Flink 1.11.3
> > because it could take a while, we can start with a StateFun 2.2.1 RC
> right
> > away; otherwise, if Flink 1.11.3 seems to be just around the corner, we
> can
> > wait for a few more days.
> >
> > What do you think?
> >
> > Cheers,
> > Gordon
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-19692
> > [2] https://github.com/apache/flink/pull/13761
> > [3] https://github.com/apache/flink/pull/13772
> > [4]
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-Apache-Flink-1-11-3-td45989.html
> >
>


Re: [DISCUSS] Releasing StateFun hotfix version 2.2.1

2020-11-03 Thread Igal Shilman
Hi Gordon,
Thanks for driving this discussion!

I would go with the second suggestion - having two consecutive StateFun
releases 2.2.1 and 2.2.2, since the Flink-1.11.3 release
might take a while, and this hot-fix release is important enough to get out
as early as possible.

Cheers,
Igal.




On Mon, Nov 2, 2020 at 11:43 AM Tzu-Li (Gordon) Tai 
wrote:

> Hi,
>
> We’re currently thinking about releasing StateFun 2.2.1, to address a
> critical bug that causes restores from checkpoints / savepoints to fail
> under certain circumstances [1].
>
> To provide a bit more context, the full fix for this issue is two-fold:
>
>1. *Fix restoring from checkpoints / savepoints taken with the same
>StateFun version:* this has already been fixed in StateFun, with
>changes backported to `flink-statefun/release-2.2`.
>2. *Allow restoring from older savepoints taken with StateFun <=
>2.2.0:* this requires a few fixes to Flink around restoring heap-based
>timers [2] and iterating through key groups in restored raw keyed state
>streams [3]. These fixes will be included in Flink 1.11.3 [4], meaning that
>to fix this, StateFun will need to wait until Flink 1.11.3 is out and
>upgrade its Flink dependency.
>
> The main discussion point here is whether or not it makes sense for
> StateFun 2.2.1 to wait for Flink 1.11.3, so that both parts of the problems
> 1) and 2) can be solved together in a single hotfix release.
>
> The other option is to release StateFun 2.2.1 already with fixes for
> problem 1) only, and have another follow-up hotfix release 2.2.2 after
> Flink 1.11.3 is available.
>
> I propose to keep a close eye on the progress of Flink 1.11.3 (you can
> track progress on the 1.11.3 discussion thread [4]), and *make a decision
> here mid-week on Wednesday, Nov. 4th*.
> If by then we decide to not let StateFun 2.2.1 wait for Flink 1.11.3
> because it could take a while, we can start with a StateFun 2.2.1 RC right
> away; otherwise, if Flink 1.11.3 seems to be just around the corner, we can
> wait for a few more days.
>
> What do you think?
>
> Cheers,
> Gordon
>
> [1] https://issues.apache.org/jira/browse/FLINK-19692
> [2] https://github.com/apache/flink/pull/13761
> [3] https://github.com/apache/flink/pull/13772
> [4]
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-Apache-Flink-1-11-3-td45989.html
>