Couldn't we treat a missing option as legacy, but set the new scheduler as
the default value in all newly shipped flink-conf.yaml?

In this way, old users get the old behavior (either implicitly or
explicitly) unless they explicitly upgrade.
New users benefit from the new scheduler.

On Wed, Feb 5, 2020 at 8:13 PM Gary Yao <g...@apache.org> wrote:

> It is indeed unfortunate that these issues are discovered only now. I think
> Thomas has a valid point, and we might be risking the trust of our users
> here.
>
> What are our options?
>
>     1. Document this behavior and how to work around it copiously in the
> release notes [1]
>     2. Try to restore the previous behavior
>     3. Change default value of jobmanager.scheduler to "legacy" and rollout
> the feature in 1.11
>     4. Change default value of jobmanager.scheduler to "legacy" and rollout
> the feature earliest in 1.10.1
>
> [1]
>
> https://github.com/apache/flink/pull/10997/files#diff-b84c5611825842e8f74301ca70d94d23R86
>
> On Wed, Feb 5, 2020 at 7:24 PM Stephan Ewen <se...@apache.org> wrote:
>
> > Should we make these a blocker? I am not sure - we could also clearly
> > state in the release notes how to restore the old behavior, if your setup
> > assumes that behavior.
> >
> > Release candidates for this release have been out since mid December, it
> > is a bit unfortunate that these things have been raised so late.
> > Having these rather open ended tickets (how to re-define the existing
> > metrics in the new scheduler/failover handling) now as release blockers
> > would mean that 1.10 is indefinitely delayed.
> >
> > Are we sure we want to do that?
> >
> > On Wed, Feb 5, 2020 at 6:53 PM Thomas Weise <t...@apache.org> wrote:
> >
> >> Hi Gary,
> >>
> >> Thanks for the clarification!
> >>
> >> When we upgrade to a new Flink release, we don't start with a default
> >> flink-conf.yaml but upgrade our existing tooling and configuration.
> >> Therefore we notice this issue as part of the upgrade to 1.10, and not
> >> when
> >> we upgraded to 1.9.
> >>
> >> I would expect many other users to be in the same camp, and therefore
> >> consider making these regressions a blocker for 1.10?
> >>
> >> Thanks,
> >> Thomas
> >>
> >>
> >> On Wed, Feb 5, 2020 at 4:53 AM Gary Yao <g...@apache.org> wrote:
> >>
> >> > > also notice that the exception causing a restart is no longer
> >> displayed
> >> > > in the UI, which is probably related?
> >> >
> >> > Yes, this is also related to the new scheduler. I created FLINK-15917
> >> [1]
> >> > to
> >> > track this. Moreover, I created a ticket about the uptime metric not
> >> > resetting
> >> > [2]. Both issues already exist in 1.9 if
> >> > "jobmanager.execution.failover-strategy" is set to "region", which is
> >> the
> >> > case
> >> > in the default flink-conf.yaml.
> >> >
> >> > In 1.9, unsetting "jobmanager.execution.failover-strategy" was enough
> to
> >> > fall
> >> > back to the previous behavior.
> >> >
> >> > In 1.10, you can still fall back to the previous behavior by setting
> >> > "jobmanager.scheduler: legacy" and unsetting
> >> > "jobmanager.execution.failover-strategy" in your flink-conf.yaml
> >> >
> >> > I would not consider these issues blockers since there is a workaround
> >> for
> >> > them, but of course we would like to see the new scheduler getting
> some
> >> > production exposure. More detailed release notes about the caveats of
> >> the
> >> > new
> >> > scheduler will be added to the user documentation.
> >> >
> >> >
> >> > > The watermark issue was
> >> > https://issues.apache.org/jira/browse/FLINK-14470
> >> >
> >> > This should be fixed now [3].
> >> >
> >> >
> >> > [1] https://issues.apache.org/jira/browse/FLINK-15917
> >> > [2] https://issues.apache.org/jira/browse/FLINK-15918
> >> > [3] https://issues.apache.org/jira/browse/FLINK-8949
> >> >
> >> > On Wed, Feb 5, 2020 at 7:04 AM Thomas Weise <t...@apache.org> wrote:
> >> >
> >> >> Hi Gary,
> >> >>
> >> >> Thanks for the reply.
> >> >>
> >> >> -->
> >> >>
> >> >> On Tue, Feb 4, 2020 at 5:20 AM Gary Yao <g...@apache.org> wrote:
> >> >>
> >> >> > Hi Thomas,
> >> >> >
> >> >> > > 2) Was there a change in how job recovery reflects in the uptime
> >> >> metric?
> >> >> > > Didn't uptime previously reset to 0 on recovery (now it just
> keeps
> >> >> > > increasing)?
> >> >> >
> >> >> > The uptime is the difference between the current time and the time
> >> when
> >> >> the
> >> >> > job transitioned to RUNNING state. By default we no longer
> transition
> >> >> the
> >> >> > job
> >> >> > out of the RUNNING state when restarting. This has something to do
> >> with
> >> >> the
> >> >> > new scheduler which enables pipelined region failover by default
> [1].
> >> >> > Actually
> >> >> > we enabled pipelined region failover already in the binary
> >> distribution
> >> >> of
> >> >> > Flink 1.9 by setting:
> >> >> >
> >> >> >     jobmanager.execution.failover-strategy: region
> >> >> >
> >> >> > in the default flink-conf.yaml. Unless you have removed this config
> >> >> option
> >> >> > or
> >> >> > you are using a custom yaml, you should be seeing this behavior in
> >> Flink
> >> >> > 1.9.
> >> >> > If you do not want region failover, set
> >> >> >
> >> >> >     jobmanager.execution.failover-strategy: full
> >> >> >
> >> >> >
> >> >> We are using the default (the jobmanager.execution.failover-strategy
> >> >> setting is not present in our flink config).
> >> >>
> >> >> The change in behavior I see is between the 1.9 based deployment and
> >> the
> >> >> 1.10 RC.
> >> >>
> >> >> Our 1.9 branch is here:
> >> >> https://github.com/lyft/flink/tree/release-1.9-lyft
> >> >>
> >> >> I also notice that the exception causing a restart is no longer
> >> displayed
> >> >> in the UI, which is probably related?
> >> >>
> >> >>
> >> >> >
> >> >> > > 1) Is the low watermark display in the UI still broken?
> >> >> >
> >> >> > I was not aware that this is broken. Is there an issue tracking
> this
> >> >> bug?
> >> >> >
> >> >>
> >> >> The watermark issue was
> >> https://issues.apache.org/jira/browse/FLINK-14470
> >> >>
> >> >> (I don't have a good way to verify it is fixed at the moment.)
> >> >>
> >> >> Another problem with this 1.10 RC is that the checkpointAlignmentTime
> >> >> metric is missing. (I have not been able to investigate this further
> >> yet.)
> >> >>
> >> >>
> >> >> >
> >> >> > Best,
> >> >> > Gary
> >> >> >
> >> >> > [1] https://issues.apache.org/jira/browse/FLINK-14651
> >> >> >
> >> >> > On Tue, Feb 4, 2020 at 2:56 AM Thomas Weise <t...@apache.org>
> wrote:
> >> >> >
> >> >> >> I opened a PR for FLINK-15868
> >> >> >> <https://issues.apache.org/jira/browse/FLINK-15868>:
> >> >> >> https://github.com/apache/flink/pull/11006
> >> >> >>
> >> >> >> With that change, I was able to run an application that consumes
> >> from
> >> >> >> Kinesis.
> >> >> >>
> >> >> >> I should have data tomorrow regarding the performance.
> >> >> >>
> >> >> >> Two questions/observations:
> >> >> >>
> >> >> >> 1) Is the low watermark display in the UI still broken?
> >> >> >> 2) Was there a change in how job recovery reflects in the uptime
> >> >> metric?
> >> >> >> Didn't uptime previously reset to 0 on recovery (now it just keeps
> >> >> >> increasing)?
> >> >> >>
> >> >> >> Thanks,
> >> >> >> Thomas
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On Mon, Feb 3, 2020 at 10:55 AM Thomas Weise <t...@apache.org>
> >> wrote:
> >> >> >>
> >> >> >> > I found another issue with the Kinesis connector:
> >> >> >> >
> >> >> >> > https://issues.apache.org/jira/browse/FLINK-15868
> >> >> >> >
> >> >> >> >
> >> >> >> > On Mon, Feb 3, 2020 at 3:35 AM Gary Yao <g...@apache.org>
> wrote:
> >> >> >> >
> >> >> >> >> Hi everyone,
> >> >> >> >>
> >> >> >> >> I am hereby canceling the vote due to:
> >> >> >> >>
> >> >> >> >>     FLINK-15837
> >> >> >> >>     FLINK-15840
> >> >> >> >>
> >> >> >> >> Another RC will be created later today.
> >> >> >> >>
> >> >> >> >> Best,
> >> >> >> >> Gary
> >> >> >> >>
> >> >> >> >> On Mon, Jan 27, 2020 at 10:06 PM Gary Yao <g...@apache.org>
> >> wrote:
> >> >> >> >>
> >> >> >> >> > Hi everyone,
> >> >> >> >> > Please review and vote on the release candidate #1 for the
> >> version
> >> >> >> >> 1.10.0,
> >> >> >> >> > as follows:
> >> >> >> >> > [ ] +1, Approve the release
> >> >> >> >> > [ ] -1, Do not approve the release (please provide specific
> >> >> comments)
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > The complete staging area is available for your review, which
> >> >> >> includes:
> >> >> >> >> > * JIRA release notes [1],
> >> >> >> >> > * the official Apache source release and binary convenience
> >> >> releases
> >> >> >> to
> >> >> >> >> be
> >> >> >> >> > deployed to dist.apache.org [2], which are signed with the
> key
> >> >> with
> >> >> >> >> > fingerprint BB137807CEFBE7DD2616556710B12A1F89C115E8 [3],
> >> >> >> >> > * all artifacts to be deployed to the Maven Central
> Repository
> >> >> [4],
> >> >> >> >> > * source code tag "release-1.10.0-rc1" [5],
> >> >> >> >> >
> >> >> >> >> > The announcement blog post is in the works. I will update
> this
> >> >> voting
> >> >> >> >> > thread with a link to the pull request soon.
> >> >> >> >> >
> >> >> >> >> > The vote will be open for at least 72 hours. It is adopted by
> >> >> >> majority
> >> >> >> >> > approval, with at least 3 PMC affirmative votes.
> >> >> >> >> >
> >> >> >> >> > Thanks,
> >> >> >> >> > Yu & Gary
> >> >> >> >> >
> >> >> >> >> > [1]
> >> >> >> >> >
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12345845
> >> >> >> >> > [2]
> >> >> https://dist.apache.org/repos/dist/dev/flink/flink-1.10.0-rc1/
> >> >> >> >> > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> >> >> >> >> > [4]
> >> >> >> >>
> >> >>
> https://repository.apache.org/content/repositories/orgapacheflink-1325
> >> >> >> >> > [5]
> >> >> https://github.com/apache/flink/releases/tag/release-1.10.0-rc1
> >> >> >> >> >
> >> >> >> >>
> >> >> >> >
> >> >> >>
> >> >> >
> >> >>
> >> >
> >>
> >
>

Reply via email to