Re: [VOTE] Release 1.11.0, release candidate #4

Chesnay Schepler Thu, 02 Jul 2020 07:02:08 -0700

+1

Re examples:

The examples failing is new in 1.1*1*, and was introduced inhttps://issues.apache.org/jira/browse/FLINK-16655.

In prior versions, calls to print()/count()/etc. were were simplytreated as an execute(), whereas with 1.11 we outright fail thesubmission because these do not work in detached submissions (which jarsubmissions always are).This is generally /fine/, and may safe users some headaches, but weshould add this to the release notes and in a follow-up ensure a propererror message is shown in the UI (I'll take care of that). At the momentyou just get an "Internal Server Error.", and have to check theJobManager logs for details.


On 02/07/2020 15:47, Stephan Ewen wrote:

+1 (binding) from my side

   - legal files (license, notice) looks correct
   - no binaries in the release
   - ran examples from command line
   - ran some examples from web ui
   - log files look sane
   - RocksDB, incremental checkpoints, savepoints, moving savepoints
all works as expected.

There are some friction points, which have also been mentioned. However, I
am not sure they need to block the release.
   - Some batch examples in the web UI have not been working in 1.10. We
should fix that asap, because it impacts the "getting started" experience,
but I personally don't vote against the release based on that
   - Same for the CDC bug. It is unfortunate, but I would not hold the
release at such a late stage for one special issue in a new connector.
Let's work on a timely 1.11.1.


I would withdraw my vote, if we find a fundamental issue in the network
system causing the increased checkpoint delays, causing the job regression
Thomas mentioned.
Such a core bug would be a deal-breaker for a large fraction of users.




On Thu, Jul 2, 2020 at 11:35 AM Zhijiang <wangzhijiang...@aliyun.com.invalid>
wrote:

I also agree with Till and Robert's proposals.

In general I think we should not block the release based on current
estimation. Otherwise we continuously postpone the release, it might
probably occur new bugs for blockers, then we might probably
get stuck in such cycle to not give a final release for users in time. But
that does not mean RC4 would be the final one, and we can reevaluate the
effects in progress with the accumulated issues.

Regarding the performance regression, if possible we can reproduce to
analysis the reason based on Thomas's feedback, then we can evaluate its
effect.

Regarding the FLINK-18461, after syncing with Jark offline, the bug would
effect one of three scenarios for using CDC feature, and this effected
scenario is actually the most commonly used way by users.
My suggestion is to merge it into release-1.11 ATM since the PR already
open for review, then let's further finalize the conclusion later. If this
issue is the only one after RC4 going through, then another option is to
cover it in next release-1.11.1 as Robert suggested, as we can prepare for
the next minor release soon. If there are other blockers issues during
voting and necessary to be resolved soon, then it is no doubt to cover all
of them in next RC5.

Best,
Zhijiang

------------------------------------------------------------------
From:Till Rohrmann <trohrm...@apache.org>
Send Time:2020年7月2日(星期四) 16:46
To:dev <dev@flink.apache.org>
Cc:Zhijiang <wangzhijiang...@aliyun.com>
Subject:Re: [VOTE] Release 1.11.0, release candidate #4

I agree with Robert.

@Chesnay: The problem has probably already existed in Flink 1.10 and
before because we cannot run jobs with eager execution calls from the web
ui. I agree with Robert that we can/should improve our documentation in
this regard, though.

@Thomas:
1. I will update the release notes to add a short section describing that
one needs to configure the JobManager memory.
2. Concerning the performance regression we should look into it. I believe
Zhijiang is very eager to learn more about your exact setup to further
debug it. Again I agree with Robert to not block the release on it at the
moment.

@Jark: How much of a problem is FLINK-18461? Will it make the CDC feature
completely unusable or will only make a subset of the use cases to not
work? If it is the latter, then I believe that we can document the
limitations and try to fix it asap. Depending on the remaining testing the
fix might make it into the 1.11.0 or the 1.11.1 release.

Cheers,
Till
On Thu, Jul 2, 2020 at 10:33 AM Robert Metzger <rmetz...@apache.org>
wrote:
Thanks a lot for the thorough testing Thomas! This is really helpful!

  @Chesnay: I would not block the release on this. The web submission does
  not seem to be the documented / preferred way of job submission. It is
  unlikely to harm the beginner's experience (and they would anyways not
read
  the release notes). I mention the beginner experience, because they are
the
  primary audience of the examples.

  Regarding FLINK-18461 / Jark's issue: I would not block the release on
  that, but still try to get it fixed asap. It is likely that this RC
doesn't
  go through (given the rate at which we are finding issues), and even if it
  goes through, we can document it as a known issue in the release
  announcement and immediately release 1.11.1.
  Blocking the release on this causes quite a bit of work for the release
  managers for rolling a new RC. Until we have understood the performance
  regression Thomas is reporting, I would keep this RC open, and keep
testing.

  On Thu, Jul 2, 2020 at 8:34 AM Jark Wu <imj...@gmail.com> wrote:

  > Hi,
  >
  > I'm very sorry but we just found a blocker issue FLINK-18461 [1] in the
new
  > feature of changelog source (CDC).
  > This bug will result in queries on changelog source can’t be inserted
into
  > upsert sink (e.g. ES, JDBC, HBase),
  > which is a common case in production. CDC is one of the important
features
  > of Table/SQL in this release,
  > so from my side, I hope we can have this fix in 1.11.0, otherwise, this
is
  > a broken feature...
  >
  > Again, I am terribly sorry for delaying the release...
  >
  > Best,
  > Jark
  >
  > [1]: https://issues.apache.org/jira/browse/FLINK-18461
  >
  > On Thu, 2 Jul 2020 at 12:02, Zhijiang <wangzhijiang...@aliyun.com
.invalid>
  > wrote:
  >
  > > Hi Thomas,
  > >
  > > Thanks for the efficient feedback.
  > >
  > > Regarding the suggestion of adding the release notes document, I agree
  > > with your point. Maybe we should adjust the vote template accordingly
in
  > > the respective wiki to guide the following release processes.
  > >
  > > Regarding the performance regression, could you provide some more
details
  > > for our better measurement or reproducing on our sides?
  > > E.g. I guess the topology only includes two vertexes source and sink?
  > > What is the parallelism for every vertex?
  > > The upstream shuffles data to the downstream via rebalance
partitioner or
  > > other?
  > > The checkpoint mode is exactly-once with rocksDB state backend?
  > > The backpressure happened in this case?
  > > How much percentage regression in this case?
  > >
  > > Best,
  > > Zhijiang
  > >
  > >
  > >
  > > ------------------------------------------------------------------
  > > From:Thomas Weise <t...@apache.org>
  > > Send Time:2020年7月2日(星期四) 09:54
  > > To:dev <dev@flink.apache.org>
  > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
  > >
  > > Hi Till,
  > >
  > > Yes, we don't have the setting in flink-conf.yaml.
  > >
  > > Generally, we carry forward the existing configuration and any change
to
  > > default configuration values would impact the upgrade.
  > >
  > > Yes, since it is an incompatible change I would state it in the
release
  > > notes.
  > >
  > > Thanks,
  > > Thomas
  > >
  > > BTW I found a performance regression while trying to upgrade another
  > > pipeline with this RC. It is a simple Kinesis to Kinesis job. Wasn't
able
  > > to pin it down yet, symptoms include increased checkpoint alignment
time.
  > >
  > > On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann <trohrm...@apache.org>
  > > wrote:
  > >
  > > > Hi Thomas,
  > > >
  > > > just to confirm: When starting the image in local mode, then you
don't
  > > have
  > > > any of the JobManager memory configuration settings configured in
the
  > > > effective flink-conf.yaml, right? Does this mean that you have
  > explicitly
  > > > removed `jobmanager.heap.size: 1024m` from the default
configuration?
  > If
  > > > this is the case, then I believe it was more of an unintentional
  > artifact
  > > > that it worked before and it has been corrected now so that one
needs
  > to
  > > > specify the memory of the JM process explicitly. Do you think it
would
  > > help
  > > > to explicitly state this in the release notes?
  > > >
  > > > Cheers,
  > > > Till
  > > >
  > > > On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise <t...@apache.org> wrote:
  > > >
  > > > > Thanks for preparing another RC!
  > > > >
  > > > > As mentioned in the previous RC thread, it would be super helpful
if
  > > the
  > > > > release notes that are part of the documentation can be included
[1].
  > > > It's
  > > > > a significant time-saver to have read those first.
  > > > >
  > > > > I found one more non-backward compatible change that would be
worth
  > > > > addressing/mentioning:
  > > > >
  > > > > It is now necessary to configure the jobmanager heap size in
  > > > > flink-conf.yaml (with either jobmanager.heap.size
  > > > > or jobmanager.memory.heap.size). Why would I not want to do that
  > > anyways?
  > > > > Well, we set it dynamically for a cluster deployment via the
  > > > > flinkk8soperator, but the container image can also be used for
  > testing
  > > > with
  > > > > local mode (./bin/jobmanager.sh start-foreground local). That will
  > fail
  > > > if
  > > > > the heap wasn't configured and that's how I noticed it.
  > > > >
  > > > > Thanks,
  > > > > Thomas
  > > > >
  > > > > [1]
  > > > >
  > > > >
  > > >
  > >
  >
https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html
  > > > >
  > > > > On Tue, Jun 30, 2020 at 3:18 AM Zhijiang <
wangzhijiang...@aliyun.com
  > > > > .invalid>
  > > > > wrote:
  > > > >
  > > > > > Hi everyone,
  > > > > >
  > > > > > Please review and vote on the release candidate #4 for the
version
  > > > > 1.11.0,
  > > > > > as follows:
  > > > > > [ ] +1, Approve the release
  > > > > > [ ] -1, Do not approve the release (please provide specific
  > comments)
  > > > > >
  > > > > > The complete staging area is available for your review, which
  > > includes:
  > > > > > * JIRA release notes [1],
  > > > > > * the official Apache source release and binary convenience
  > releases
  > > to
  > > > > be
  > > > > > deployed to dist.apache.org [2], which are signed with the key
  > with
  > > > > > fingerprint 2DA85B93244FDFA19A6244500653C0A2CEA00D0E [3],
  > > > > > * all artifacts to be deployed to the Maven Central Repository
[4],
  > > > > > * source code tag "release-1.11.0-rc4" [5],
  > > > > > * website pull request listing the new release and adding
  > > announcement
  > > > > > blog post [6].
  > > > > >
  > > > > > The vote will be open for at least 72 hours. It is adopted by
  > > majority
  > > > > > approval, with at least 3 PMC affirmative votes.
  > > > > >
  > > > > > Thanks,
  > > > > > Release Manager
  > > > > >
  > > > > > [1]
  > > > > >
  > > > >
  > > >
  > >
  >
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346364
  > > > > > [2]
https://dist.apache.org/repos/dist/dev/flink/flink-1.11.0-rc4/
  > > > > > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
  > > > > > [4]
  > > > > >
  > > >
  > https://repository.apache.org/content/repositories/orgapacheflink-1377/
  > > > > > [5]
  > https://github.com/apache/flink/releases/tag/release-1.11.0-rc4
  > > > > > [6] https://github.com/apache/flink-web/pull/352
  > > > > >
  > > > > >
  > > > >
  > > >
  > >
  > >
  >

Re: [VOTE] Release 1.11.0, release candidate #4

Reply via email to