Couldn't we treat a missing option as legacy, but set the new scheduler as the default value in all newly shipped flink-conf.yaml?
In this way, old users get the old behavior (either implicitly or explicitly) unless they explicitly upgrade. New users benefit from the new scheduler. On Wed, Feb 5, 2020 at 8:13 PM Gary Yao <g...@apache.org> wrote: > It is indeed unfortunate that these issues are discovered only now. I think > Thomas has a valid point, and we might be risking the trust of our users > here. > > What are our options? > > 1. Document this behavior and how to work around it copiously in the > release notes [1] > 2. Try to restore the previous behavior > 3. Change default value of jobmanager.scheduler to "legacy" and rollout > the feature in 1.11 > 4. Change default value of jobmanager.scheduler to "legacy" and rollout > the feature earliest in 1.10.1 > > [1] > > https://github.com/apache/flink/pull/10997/files#diff-b84c5611825842e8f74301ca70d94d23R86 > > On Wed, Feb 5, 2020 at 7:24 PM Stephan Ewen <se...@apache.org> wrote: > > > Should we make these a blocker? I am not sure - we could also clearly > > state in the release notes how to restore the old behavior, if your setup > > assumes that behavior. > > > > Release candidates for this release have been out since mid December, it > > is a bit unfortunate that these things have been raised so late. > > Having these rather open ended tickets (how to re-define the existing > > metrics in the new scheduler/failover handling) now as release blockers > > would mean that 1.10 is indefinitely delayed. > > > > Are we sure we want to do that? > > > > On Wed, Feb 5, 2020 at 6:53 PM Thomas Weise <t...@apache.org> wrote: > > > >> Hi Gary, > >> > >> Thanks for the clarification! > >> > >> When we upgrade to a new Flink release, we don't start with a default > >> flink-conf.yaml but upgrade our existing tooling and configuration. > >> Therefore we notice this issue as part of the upgrade to 1.10, and not > >> when > >> we upgraded to 1.9. > >> > >> I would expect many other users to be in the same camp, and therefore > >> consider making these regressions a blocker for 1.10? > >> > >> Thanks, > >> Thomas > >> > >> > >> On Wed, Feb 5, 2020 at 4:53 AM Gary Yao <g...@apache.org> wrote: > >> > >> > > also notice that the exception causing a restart is no longer > >> displayed > >> > > in the UI, which is probably related? > >> > > >> > Yes, this is also related to the new scheduler. I created FLINK-15917 > >> [1] > >> > to > >> > track this. Moreover, I created a ticket about the uptime metric not > >> > resetting > >> > [2]. Both issues already exist in 1.9 if > >> > "jobmanager.execution.failover-strategy" is set to "region", which is > >> the > >> > case > >> > in the default flink-conf.yaml. > >> > > >> > In 1.9, unsetting "jobmanager.execution.failover-strategy" was enough > to > >> > fall > >> > back to the previous behavior. > >> > > >> > In 1.10, you can still fall back to the previous behavior by setting > >> > "jobmanager.scheduler: legacy" and unsetting > >> > "jobmanager.execution.failover-strategy" in your flink-conf.yaml > >> > > >> > I would not consider these issues blockers since there is a workaround > >> for > >> > them, but of course we would like to see the new scheduler getting > some > >> > production exposure. More detailed release notes about the caveats of > >> the > >> > new > >> > scheduler will be added to the user documentation. > >> > > >> > > >> > > The watermark issue was > >> > https://issues.apache.org/jira/browse/FLINK-14470 > >> > > >> > This should be fixed now [3]. > >> > > >> > > >> > [1] https://issues.apache.org/jira/browse/FLINK-15917 > >> > [2] https://issues.apache.org/jira/browse/FLINK-15918 > >> > [3] https://issues.apache.org/jira/browse/FLINK-8949 > >> > > >> > On Wed, Feb 5, 2020 at 7:04 AM Thomas Weise <t...@apache.org> wrote: > >> > > >> >> Hi Gary, > >> >> > >> >> Thanks for the reply. > >> >> > >> >> --> > >> >> > >> >> On Tue, Feb 4, 2020 at 5:20 AM Gary Yao <g...@apache.org> wrote: > >> >> > >> >> > Hi Thomas, > >> >> > > >> >> > > 2) Was there a change in how job recovery reflects in the uptime > >> >> metric? > >> >> > > Didn't uptime previously reset to 0 on recovery (now it just > keeps > >> >> > > increasing)? > >> >> > > >> >> > The uptime is the difference between the current time and the time > >> when > >> >> the > >> >> > job transitioned to RUNNING state. By default we no longer > transition > >> >> the > >> >> > job > >> >> > out of the RUNNING state when restarting. This has something to do > >> with > >> >> the > >> >> > new scheduler which enables pipelined region failover by default > [1]. > >> >> > Actually > >> >> > we enabled pipelined region failover already in the binary > >> distribution > >> >> of > >> >> > Flink 1.9 by setting: > >> >> > > >> >> > jobmanager.execution.failover-strategy: region > >> >> > > >> >> > in the default flink-conf.yaml. Unless you have removed this config > >> >> option > >> >> > or > >> >> > you are using a custom yaml, you should be seeing this behavior in > >> Flink > >> >> > 1.9. > >> >> > If you do not want region failover, set > >> >> > > >> >> > jobmanager.execution.failover-strategy: full > >> >> > > >> >> > > >> >> We are using the default (the jobmanager.execution.failover-strategy > >> >> setting is not present in our flink config). > >> >> > >> >> The change in behavior I see is between the 1.9 based deployment and > >> the > >> >> 1.10 RC. > >> >> > >> >> Our 1.9 branch is here: > >> >> https://github.com/lyft/flink/tree/release-1.9-lyft > >> >> > >> >> I also notice that the exception causing a restart is no longer > >> displayed > >> >> in the UI, which is probably related? > >> >> > >> >> > >> >> > > >> >> > > 1) Is the low watermark display in the UI still broken? > >> >> > > >> >> > I was not aware that this is broken. Is there an issue tracking > this > >> >> bug? > >> >> > > >> >> > >> >> The watermark issue was > >> https://issues.apache.org/jira/browse/FLINK-14470 > >> >> > >> >> (I don't have a good way to verify it is fixed at the moment.) > >> >> > >> >> Another problem with this 1.10 RC is that the checkpointAlignmentTime > >> >> metric is missing. (I have not been able to investigate this further > >> yet.) > >> >> > >> >> > >> >> > > >> >> > Best, > >> >> > Gary > >> >> > > >> >> > [1] https://issues.apache.org/jira/browse/FLINK-14651 > >> >> > > >> >> > On Tue, Feb 4, 2020 at 2:56 AM Thomas Weise <t...@apache.org> > wrote: > >> >> > > >> >> >> I opened a PR for FLINK-15868 > >> >> >> <https://issues.apache.org/jira/browse/FLINK-15868>: > >> >> >> https://github.com/apache/flink/pull/11006 > >> >> >> > >> >> >> With that change, I was able to run an application that consumes > >> from > >> >> >> Kinesis. > >> >> >> > >> >> >> I should have data tomorrow regarding the performance. > >> >> >> > >> >> >> Two questions/observations: > >> >> >> > >> >> >> 1) Is the low watermark display in the UI still broken? > >> >> >> 2) Was there a change in how job recovery reflects in the uptime > >> >> metric? > >> >> >> Didn't uptime previously reset to 0 on recovery (now it just keeps > >> >> >> increasing)? > >> >> >> > >> >> >> Thanks, > >> >> >> Thomas > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> On Mon, Feb 3, 2020 at 10:55 AM Thomas Weise <t...@apache.org> > >> wrote: > >> >> >> > >> >> >> > I found another issue with the Kinesis connector: > >> >> >> > > >> >> >> > https://issues.apache.org/jira/browse/FLINK-15868 > >> >> >> > > >> >> >> > > >> >> >> > On Mon, Feb 3, 2020 at 3:35 AM Gary Yao <g...@apache.org> > wrote: > >> >> >> > > >> >> >> >> Hi everyone, > >> >> >> >> > >> >> >> >> I am hereby canceling the vote due to: > >> >> >> >> > >> >> >> >> FLINK-15837 > >> >> >> >> FLINK-15840 > >> >> >> >> > >> >> >> >> Another RC will be created later today. > >> >> >> >> > >> >> >> >> Best, > >> >> >> >> Gary > >> >> >> >> > >> >> >> >> On Mon, Jan 27, 2020 at 10:06 PM Gary Yao <g...@apache.org> > >> wrote: > >> >> >> >> > >> >> >> >> > Hi everyone, > >> >> >> >> > Please review and vote on the release candidate #1 for the > >> version > >> >> >> >> 1.10.0, > >> >> >> >> > as follows: > >> >> >> >> > [ ] +1, Approve the release > >> >> >> >> > [ ] -1, Do not approve the release (please provide specific > >> >> comments) > >> >> >> >> > > >> >> >> >> > > >> >> >> >> > The complete staging area is available for your review, which > >> >> >> includes: > >> >> >> >> > * JIRA release notes [1], > >> >> >> >> > * the official Apache source release and binary convenience > >> >> releases > >> >> >> to > >> >> >> >> be > >> >> >> >> > deployed to dist.apache.org [2], which are signed with the > key > >> >> with > >> >> >> >> > fingerprint BB137807CEFBE7DD2616556710B12A1F89C115E8 [3], > >> >> >> >> > * all artifacts to be deployed to the Maven Central > Repository > >> >> [4], > >> >> >> >> > * source code tag "release-1.10.0-rc1" [5], > >> >> >> >> > > >> >> >> >> > The announcement blog post is in the works. I will update > this > >> >> voting > >> >> >> >> > thread with a link to the pull request soon. > >> >> >> >> > > >> >> >> >> > The vote will be open for at least 72 hours. It is adopted by > >> >> >> majority > >> >> >> >> > approval, with at least 3 PMC affirmative votes. > >> >> >> >> > > >> >> >> >> > Thanks, > >> >> >> >> > Yu & Gary > >> >> >> >> > > >> >> >> >> > [1] > >> >> >> >> > > >> >> >> >> > >> >> >> > >> >> > >> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12345845 > >> >> >> >> > [2] > >> >> https://dist.apache.org/repos/dist/dev/flink/flink-1.10.0-rc1/ > >> >> >> >> > [3] https://dist.apache.org/repos/dist/release/flink/KEYS > >> >> >> >> > [4] > >> >> >> >> > >> >> > https://repository.apache.org/content/repositories/orgapacheflink-1325 > >> >> >> >> > [5] > >> >> https://github.com/apache/flink/releases/tag/release-1.10.0-rc1 > >> >> >> >> > > >> >> >> >> > >> >> >> > > >> >> >> > >> >> > > >> >> > >> > > >> > > >