Re: [VOTE] Release 1.10.0, release candidate #1

Gary Yao Wed, 05 Feb 2020 11:13:19 -0800

It is indeed unfortunate that these issues are discovered only now. I think
Thomas has a valid point, and we might be risking the trust of our users
here.


What are our options?

    1. Document this behavior and how to work around it copiously in the
release notes [1]
    2. Try to restore the previous behavior
    3. Change default value of jobmanager.scheduler to "legacy" and rollout
the feature in 1.11
    4. Change default value of jobmanager.scheduler to "legacy" and rollout
the feature earliest in 1.10.1

[1]
https://github.com/apache/flink/pull/10997/files#diff-b84c5611825842e8f74301ca70d94d23R86

On Wed, Feb 5, 2020 at 7:24 PM Stephan Ewen <[email protected]> wrote:

> Should we make these a blocker? I am not sure - we could also clearly
> state in the release notes how to restore the old behavior, if your setup
> assumes that behavior.
>
> Release candidates for this release have been out since mid December, it
> is a bit unfortunate that these things have been raised so late.
> Having these rather open ended tickets (how to re-define the existing
> metrics in the new scheduler/failover handling) now as release blockers
> would mean that 1.10 is indefinitely delayed.
>
> Are we sure we want to do that?
>
> On Wed, Feb 5, 2020 at 6:53 PM Thomas Weise <[email protected]> wrote:
>
>> Hi Gary,
>>
>> Thanks for the clarification!
>>
>> When we upgrade to a new Flink release, we don't start with a default
>> flink-conf.yaml but upgrade our existing tooling and configuration.
>> Therefore we notice this issue as part of the upgrade to 1.10, and not
>> when
>> we upgraded to 1.9.
>>
>> I would expect many other users to be in the same camp, and therefore
>> consider making these regressions a blocker for 1.10?
>>
>> Thanks,
>> Thomas
>>
>>
>> On Wed, Feb 5, 2020 at 4:53 AM Gary Yao <[email protected]> wrote:
>>
>> > > also notice that the exception causing a restart is no longer
>> displayed
>> > > in the UI, which is probably related?
>> >
>> > Yes, this is also related to the new scheduler. I created FLINK-15917
>> [1]
>> > to
>> > track this. Moreover, I created a ticket about the uptime metric not
>> > resetting
>> > [2]. Both issues already exist in 1.9 if
>> > "jobmanager.execution.failover-strategy" is set to "region", which is
>> the
>> > case
>> > in the default flink-conf.yaml.
>> >
>> > In 1.9, unsetting "jobmanager.execution.failover-strategy" was enough to
>> > fall
>> > back to the previous behavior.
>> >
>> > In 1.10, you can still fall back to the previous behavior by setting
>> > "jobmanager.scheduler: legacy" and unsetting
>> > "jobmanager.execution.failover-strategy" in your flink-conf.yaml
>> >
>> > I would not consider these issues blockers since there is a workaround
>> for
>> > them, but of course we would like to see the new scheduler getting some
>> > production exposure. More detailed release notes about the caveats of
>> the
>> > new
>> > scheduler will be added to the user documentation.
>> >
>> >
>> > > The watermark issue was
>> > https://issues.apache.org/jira/browse/FLINK-14470
>> >
>> > This should be fixed now [3].
>> >
>> >
>> > [1] https://issues.apache.org/jira/browse/FLINK-15917
>> > [2] https://issues.apache.org/jira/browse/FLINK-15918
>> > [3] https://issues.apache.org/jira/browse/FLINK-8949
>> >
>> > On Wed, Feb 5, 2020 at 7:04 AM Thomas Weise <[email protected]> wrote:
>> >
>> >> Hi Gary,
>> >>
>> >> Thanks for the reply.
>> >>
>> >> -->
>> >>
>> >> On Tue, Feb 4, 2020 at 5:20 AM Gary Yao <[email protected]> wrote:
>> >>
>> >> > Hi Thomas,
>> >> >
>> >> > > 2) Was there a change in how job recovery reflects in the uptime
>> >> metric?
>> >> > > Didn't uptime previously reset to 0 on recovery (now it just keeps
>> >> > > increasing)?
>> >> >
>> >> > The uptime is the difference between the current time and the time
>> when
>> >> the
>> >> > job transitioned to RUNNING state. By default we no longer transition
>> >> the
>> >> > job
>> >> > out of the RUNNING state when restarting. This has something to do
>> with
>> >> the
>> >> > new scheduler which enables pipelined region failover by default [1].
>> >> > Actually
>> >> > we enabled pipelined region failover already in the binary
>> distribution
>> >> of
>> >> > Flink 1.9 by setting:
>> >> >
>> >> >     jobmanager.execution.failover-strategy: region
>> >> >
>> >> > in the default flink-conf.yaml. Unless you have removed this config
>> >> option
>> >> > or
>> >> > you are using a custom yaml, you should be seeing this behavior in
>> Flink
>> >> > 1.9.
>> >> > If you do not want region failover, set
>> >> >
>> >> >     jobmanager.execution.failover-strategy: full
>> >> >
>> >> >
>> >> We are using the default (the jobmanager.execution.failover-strategy
>> >> setting is not present in our flink config).
>> >>
>> >> The change in behavior I see is between the 1.9 based deployment and
>> the
>> >> 1.10 RC.
>> >>
>> >> Our 1.9 branch is here:
>> >> https://github.com/lyft/flink/tree/release-1.9-lyft
>> >>
>> >> I also notice that the exception causing a restart is no longer
>> displayed
>> >> in the UI, which is probably related?
>> >>
>> >>
>> >> >
>> >> > > 1) Is the low watermark display in the UI still broken?
>> >> >
>> >> > I was not aware that this is broken. Is there an issue tracking this
>> >> bug?
>> >> >
>> >>
>> >> The watermark issue was
>> https://issues.apache.org/jira/browse/FLINK-14470
>> >>
>> >> (I don't have a good way to verify it is fixed at the moment.)
>> >>
>> >> Another problem with this 1.10 RC is that the checkpointAlignmentTime
>> >> metric is missing. (I have not been able to investigate this further
>> yet.)
>> >>
>> >>
>> >> >
>> >> > Best,
>> >> > Gary
>> >> >
>> >> > [1] https://issues.apache.org/jira/browse/FLINK-14651
>> >> >
>> >> > On Tue, Feb 4, 2020 at 2:56 AM Thomas Weise <[email protected]> wrote:
>> >> >
>> >> >> I opened a PR for FLINK-15868
>> >> >> <https://issues.apache.org/jira/browse/FLINK-15868>:
>> >> >> https://github.com/apache/flink/pull/11006
>> >> >>
>> >> >> With that change, I was able to run an application that consumes
>> from
>> >> >> Kinesis.
>> >> >>
>> >> >> I should have data tomorrow regarding the performance.
>> >> >>
>> >> >> Two questions/observations:
>> >> >>
>> >> >> 1) Is the low watermark display in the UI still broken?
>> >> >> 2) Was there a change in how job recovery reflects in the uptime
>> >> metric?
>> >> >> Didn't uptime previously reset to 0 on recovery (now it just keeps
>> >> >> increasing)?
>> >> >>
>> >> >> Thanks,
>> >> >> Thomas
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Mon, Feb 3, 2020 at 10:55 AM Thomas Weise <[email protected]>
>> wrote:
>> >> >>
>> >> >> > I found another issue with the Kinesis connector:
>> >> >> >
>> >> >> > https://issues.apache.org/jira/browse/FLINK-15868
>> >> >> >
>> >> >> >
>> >> >> > On Mon, Feb 3, 2020 at 3:35 AM Gary Yao <[email protected]> wrote:
>> >> >> >
>> >> >> >> Hi everyone,
>> >> >> >>
>> >> >> >> I am hereby canceling the vote due to:
>> >> >> >>
>> >> >> >>     FLINK-15837
>> >> >> >>     FLINK-15840
>> >> >> >>
>> >> >> >> Another RC will be created later today.
>> >> >> >>
>> >> >> >> Best,
>> >> >> >> Gary
>> >> >> >>
>> >> >> >> On Mon, Jan 27, 2020 at 10:06 PM Gary Yao <[email protected]>
>> wrote:
>> >> >> >>
>> >> >> >> > Hi everyone,
>> >> >> >> > Please review and vote on the release candidate #1 for the
>> version
>> >> >> >> 1.10.0,
>> >> >> >> > as follows:
>> >> >> >> > [ ] +1, Approve the release
>> >> >> >> > [ ] -1, Do not approve the release (please provide specific
>> >> comments)
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > The complete staging area is available for your review, which
>> >> >> includes:
>> >> >> >> > * JIRA release notes [1],
>> >> >> >> > * the official Apache source release and binary convenience
>> >> releases
>> >> >> to
>> >> >> >> be
>> >> >> >> > deployed to dist.apache.org [2], which are signed with the key
>> >> with
>> >> >> >> > fingerprint BB137807CEFBE7DD2616556710B12A1F89C115E8 [3],
>> >> >> >> > * all artifacts to be deployed to the Maven Central Repository
>> >> [4],
>> >> >> >> > * source code tag "release-1.10.0-rc1" [5],
>> >> >> >> >
>> >> >> >> > The announcement blog post is in the works. I will update this
>> >> voting
>> >> >> >> > thread with a link to the pull request soon.
>> >> >> >> >
>> >> >> >> > The vote will be open for at least 72 hours. It is adopted by
>> >> >> majority
>> >> >> >> > approval, with at least 3 PMC affirmative votes.
>> >> >> >> >
>> >> >> >> > Thanks,
>> >> >> >> > Yu & Gary
>> >> >> >> >
>> >> >> >> > [1]
>> >> >> >> >
>> >> >> >>
>> >> >>
>> >>
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12345845
>> >> >> >> > [2]
>> >> https://dist.apache.org/repos/dist/dev/flink/flink-1.10.0-rc1/
>> >> >> >> > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
>> >> >> >> > [4]
>> >> >> >>
>> >> https://repository.apache.org/content/repositories/orgapacheflink-1325
>> >> >> >> > [5]
>> >> https://github.com/apache/flink/releases/tag/release-1.10.0-rc1
>> >> >> >> >
>> >> >> >>
>> >> >> >
>> >> >>
>> >> >
>> >>
>> >
>>
>

Re: [VOTE] Release 1.10.0, release candidate #1

Reply via email to