Re: [VOTE] Release 1.11.0, release candidate #4

2020-06-30 Thread Chesnay Schepler

- source does not contain binaries
- started a local cluster, logs are fine, examples run
- web submission works _in general_

However, a number of batch examples fail when submitted through the 
WebUI with the following error:


Caused by: org.apache.flink.api.common.InvalidProgramException:
Job was submitted in detached mode. Results of job execution, such as 
accumulators, runtime, etc. are not available.
Please make sure your program doesn't call an eager execution function 
[collect, print, printToErr, count].


I could not find mention of this in the release notes (nor in 1.10; not 
quite sure when this change was introduced...).


IIRC this change was intentional, and it isn't necessarily a deal 
breaker, but we should ensure that our examples are compatible with all 
submission methods.


I'm undecided yet as to whether to block the release on it.

On 30/06/2020 12:17, Zhijiang wrote:

Hi everyone,

Please review and vote on the release candidate #4 for the version 1.11.0, as 
follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release and binary convenience releases to be 
deployed to dist.apache.org [2], which are signed with the key with fingerprint 
2DA85B93244FDFA19A6244500653C0A2CEA00D0E [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "release-1.11.0-rc4" [5],
* website pull request listing the new release and adding announcement blog 
post [6].

The vote will be open for at least 72 hours. It is adopted by majority 
approval, with at least 3 PMC affirmative votes.

Thanks,
Release Manager

[1] 
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346364
[2] https://dist.apache.org/repos/dist/dev/flink/flink-1.11.0-rc4/
[3] https://dist.apache.org/repos/dist/release/flink/KEYS
[4] https://repository.apache.org/content/repositories/orgapacheflink-1377/
[5] https://github.com/apache/flink/releases/tag/release-1.11.0-rc4
[6] https://github.com/apache/flink-web/pull/352





Re: [VOTE] Release 1.11.0, release candidate #4

2020-06-30 Thread Thomas Weise
Thanks for preparing another RC!

As mentioned in the previous RC thread, it would be super helpful if the
release notes that are part of the documentation can be included [1]. It's
a significant time-saver to have read those first.

I found one more non-backward compatible change that would be worth
addressing/mentioning:

It is now necessary to configure the jobmanager heap size in
flink-conf.yaml (with either jobmanager.heap.size
or jobmanager.memory.heap.size). Why would I not want to do that anyways?
Well, we set it dynamically for a cluster deployment via the
flinkk8soperator, but the container image can also be used for testing with
local mode (./bin/jobmanager.sh start-foreground local). That will fail if
the heap wasn't configured and that's how I noticed it.

Thanks,
Thomas

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html

On Tue, Jun 30, 2020 at 3:18 AM Zhijiang 
wrote:

> Hi everyone,
>
> Please review and vote on the release candidate #4 for the version 1.11.0,
> as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release and binary convenience releases to be
> deployed to dist.apache.org [2], which are signed with the key with
> fingerprint 2DA85B93244FDFA19A6244500653C0A2CEA00D0E [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "release-1.11.0-rc4" [5],
> * website pull request listing the new release and adding announcement
> blog post [6].
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> Release Manager
>
> [1]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346364
> [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.11.0-rc4/
> [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> [4]
> https://repository.apache.org/content/repositories/orgapacheflink-1377/
> [5] https://github.com/apache/flink/releases/tag/release-1.11.0-rc4
> [6] https://github.com/apache/flink-web/pull/352
>
>


Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-01 Thread Till Rohrmann
Hi Thomas,

just to confirm: When starting the image in local mode, then you don't have
any of the JobManager memory configuration settings configured in the
effective flink-conf.yaml, right? Does this mean that you have explicitly
removed `jobmanager.heap.size: 1024m` from the default configuration? If
this is the case, then I believe it was more of an unintentional artifact
that it worked before and it has been corrected now so that one needs to
specify the memory of the JM process explicitly. Do you think it would help
to explicitly state this in the release notes?

Cheers,
Till

On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise  wrote:

> Thanks for preparing another RC!
>
> As mentioned in the previous RC thread, it would be super helpful if the
> release notes that are part of the documentation can be included [1]. It's
> a significant time-saver to have read those first.
>
> I found one more non-backward compatible change that would be worth
> addressing/mentioning:
>
> It is now necessary to configure the jobmanager heap size in
> flink-conf.yaml (with either jobmanager.heap.size
> or jobmanager.memory.heap.size). Why would I not want to do that anyways?
> Well, we set it dynamically for a cluster deployment via the
> flinkk8soperator, but the container image can also be used for testing with
> local mode (./bin/jobmanager.sh start-foreground local). That will fail if
> the heap wasn't configured and that's how I noticed it.
>
> Thanks,
> Thomas
>
> [1]
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html
>
> On Tue, Jun 30, 2020 at 3:18 AM Zhijiang  .invalid>
> wrote:
>
> > Hi everyone,
> >
> > Please review and vote on the release candidate #4 for the version
> 1.11.0,
> > as follows:
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> > The complete staging area is available for your review, which includes:
> > * JIRA release notes [1],
> > * the official Apache source release and binary convenience releases to
> be
> > deployed to dist.apache.org [2], which are signed with the key with
> > fingerprint 2DA85B93244FDFA19A6244500653C0A2CEA00D0E [3],
> > * all artifacts to be deployed to the Maven Central Repository [4],
> > * source code tag "release-1.11.0-rc4" [5],
> > * website pull request listing the new release and adding announcement
> > blog post [6].
> >
> > The vote will be open for at least 72 hours. It is adopted by majority
> > approval, with at least 3 PMC affirmative votes.
> >
> > Thanks,
> > Release Manager
> >
> > [1]
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346364
> > [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.11.0-rc4/
> > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> > [4]
> > https://repository.apache.org/content/repositories/orgapacheflink-1377/
> > [5] https://github.com/apache/flink/releases/tag/release-1.11.0-rc4
> > [6] https://github.com/apache/flink-web/pull/352
> >
> >
>


Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-01 Thread Thomas Weise
Hi Till,

Yes, we don't have the setting in flink-conf.yaml.

Generally, we carry forward the existing configuration and any change to
default configuration values would impact the upgrade.

Yes, since it is an incompatible change I would state it in the release
notes.

Thanks,
Thomas

BTW I found a performance regression while trying to upgrade another
pipeline with this RC. It is a simple Kinesis to Kinesis job. Wasn't able
to pin it down yet, symptoms include increased checkpoint alignment time.

On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann  wrote:

> Hi Thomas,
>
> just to confirm: When starting the image in local mode, then you don't have
> any of the JobManager memory configuration settings configured in the
> effective flink-conf.yaml, right? Does this mean that you have explicitly
> removed `jobmanager.heap.size: 1024m` from the default configuration? If
> this is the case, then I believe it was more of an unintentional artifact
> that it worked before and it has been corrected now so that one needs to
> specify the memory of the JM process explicitly. Do you think it would help
> to explicitly state this in the release notes?
>
> Cheers,
> Till
>
> On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise  wrote:
>
> > Thanks for preparing another RC!
> >
> > As mentioned in the previous RC thread, it would be super helpful if the
> > release notes that are part of the documentation can be included [1].
> It's
> > a significant time-saver to have read those first.
> >
> > I found one more non-backward compatible change that would be worth
> > addressing/mentioning:
> >
> > It is now necessary to configure the jobmanager heap size in
> > flink-conf.yaml (with either jobmanager.heap.size
> > or jobmanager.memory.heap.size). Why would I not want to do that anyways?
> > Well, we set it dynamically for a cluster deployment via the
> > flinkk8soperator, but the container image can also be used for testing
> with
> > local mode (./bin/jobmanager.sh start-foreground local). That will fail
> if
> > the heap wasn't configured and that's how I noticed it.
> >
> > Thanks,
> > Thomas
> >
> > [1]
> >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html
> >
> > On Tue, Jun 30, 2020 at 3:18 AM Zhijiang  > .invalid>
> > wrote:
> >
> > > Hi everyone,
> > >
> > > Please review and vote on the release candidate #4 for the version
> > 1.11.0,
> > > as follows:
> > > [ ] +1, Approve the release
> > > [ ] -1, Do not approve the release (please provide specific comments)
> > >
> > > The complete staging area is available for your review, which includes:
> > > * JIRA release notes [1],
> > > * the official Apache source release and binary convenience releases to
> > be
> > > deployed to dist.apache.org [2], which are signed with the key with
> > > fingerprint 2DA85B93244FDFA19A6244500653C0A2CEA00D0E [3],
> > > * all artifacts to be deployed to the Maven Central Repository [4],
> > > * source code tag "release-1.11.0-rc4" [5],
> > > * website pull request listing the new release and adding announcement
> > > blog post [6].
> > >
> > > The vote will be open for at least 72 hours. It is adopted by majority
> > > approval, with at least 3 PMC affirmative votes.
> > >
> > > Thanks,
> > > Release Manager
> > >
> > > [1]
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346364
> > > [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.11.0-rc4/
> > > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> > > [4]
> > >
> https://repository.apache.org/content/repositories/orgapacheflink-1377/
> > > [5] https://github.com/apache/flink/releases/tag/release-1.11.0-rc4
> > > [6] https://github.com/apache/flink-web/pull/352
> > >
> > >
> >
>


Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-01 Thread Zhijiang
Hi Thomas,

Thanks for the efficient feedback. 

Regarding the suggestion of adding the release notes document, I agree with 
your point. Maybe we should adjust the vote template accordingly in the 
respective wiki to guide the following release processes.

Regarding the performance regression, could you provide some more details for 
our better measurement or reproducing on our sides? 
E.g. I guess the topology only includes two vertexes source and sink? 
What is the parallelism for every vertex?
The upstream shuffles data to the downstream via rebalance partitioner or other?
The checkpoint mode is exactly-once with rocksDB state backend?
The backpressure happened in this case?
How much percentage regression in this case?

Best,
Zhijiang



--
From:Thomas Weise 
Send Time:2020年7月2日(星期四) 09:54
To:dev 
Subject:Re: [VOTE] Release 1.11.0, release candidate #4

Hi Till,

Yes, we don't have the setting in flink-conf.yaml.

Generally, we carry forward the existing configuration and any change to
default configuration values would impact the upgrade.

Yes, since it is an incompatible change I would state it in the release
notes.

Thanks,
Thomas

BTW I found a performance regression while trying to upgrade another
pipeline with this RC. It is a simple Kinesis to Kinesis job. Wasn't able
to pin it down yet, symptoms include increased checkpoint alignment time.

On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann  wrote:

> Hi Thomas,
>
> just to confirm: When starting the image in local mode, then you don't have
> any of the JobManager memory configuration settings configured in the
> effective flink-conf.yaml, right? Does this mean that you have explicitly
> removed `jobmanager.heap.size: 1024m` from the default configuration? If
> this is the case, then I believe it was more of an unintentional artifact
> that it worked before and it has been corrected now so that one needs to
> specify the memory of the JM process explicitly. Do you think it would help
> to explicitly state this in the release notes?
>
> Cheers,
> Till
>
> On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise  wrote:
>
> > Thanks for preparing another RC!
> >
> > As mentioned in the previous RC thread, it would be super helpful if the
> > release notes that are part of the documentation can be included [1].
> It's
> > a significant time-saver to have read those first.
> >
> > I found one more non-backward compatible change that would be worth
> > addressing/mentioning:
> >
> > It is now necessary to configure the jobmanager heap size in
> > flink-conf.yaml (with either jobmanager.heap.size
> > or jobmanager.memory.heap.size). Why would I not want to do that anyways?
> > Well, we set it dynamically for a cluster deployment via the
> > flinkk8soperator, but the container image can also be used for testing
> with
> > local mode (./bin/jobmanager.sh start-foreground local). That will fail
> if
> > the heap wasn't configured and that's how I noticed it.
> >
> > Thanks,
> > Thomas
> >
> > [1]
> >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html
> >
> > On Tue, Jun 30, 2020 at 3:18 AM Zhijiang  > .invalid>
> > wrote:
> >
> > > Hi everyone,
> > >
> > > Please review and vote on the release candidate #4 for the version
> > 1.11.0,
> > > as follows:
> > > [ ] +1, Approve the release
> > > [ ] -1, Do not approve the release (please provide specific comments)
> > >
> > > The complete staging area is available for your review, which includes:
> > > * JIRA release notes [1],
> > > * the official Apache source release and binary convenience releases to
> > be
> > > deployed to dist.apache.org [2], which are signed with the key with
> > > fingerprint 2DA85B93244FDFA19A6244500653C0A2CEA00D0E [3],
> > > * all artifacts to be deployed to the Maven Central Repository [4],
> > > * source code tag "release-1.11.0-rc4" [5],
> > > * website pull request listing the new release and adding announcement
> > > blog post [6].
> > >
> > > The vote will be open for at least 72 hours. It is adopted by majority
> > > approval, with at least 3 PMC affirmative votes.
> > >
> > > Thanks,
> > > Release Manager
> > >
> > > [1]
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346364
> > > [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.11.0-rc4/
> > > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> > > [4]
> > >
> https://repository.apache.org/content/repositories/orgapacheflink-1377/
> > > [5] https://github.com/apache/flink/releases/tag/release-1.11.0-rc4
> > > [6] https://github.com/apache/flink-web/pull/352
> > >
> > >
> >
>



Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-01 Thread Jark Wu
Hi,

I'm very sorry but we just found a blocker issue FLINK-18461 [1] in the new
feature of changelog source (CDC).
This bug will result in queries on changelog source can’t be inserted into
upsert sink (e.g. ES, JDBC, HBase),
which is a common case in production. CDC is one of the important features
of Table/SQL in this release,
so from my side, I hope we can have this fix in 1.11.0, otherwise, this is
a broken feature...

Again, I am terribly sorry for delaying the release...

Best,
Jark

[1]: https://issues.apache.org/jira/browse/FLINK-18461

On Thu, 2 Jul 2020 at 12:02, Zhijiang 
wrote:

> Hi Thomas,
>
> Thanks for the efficient feedback.
>
> Regarding the suggestion of adding the release notes document, I agree
> with your point. Maybe we should adjust the vote template accordingly in
> the respective wiki to guide the following release processes.
>
> Regarding the performance regression, could you provide some more details
> for our better measurement or reproducing on our sides?
> E.g. I guess the topology only includes two vertexes source and sink?
> What is the parallelism for every vertex?
> The upstream shuffles data to the downstream via rebalance partitioner or
> other?
> The checkpoint mode is exactly-once with rocksDB state backend?
> The backpressure happened in this case?
> How much percentage regression in this case?
>
> Best,
> Zhijiang
>
>
>
> --
> From:Thomas Weise 
> Send Time:2020年7月2日(星期四) 09:54
> To:dev 
> Subject:Re: [VOTE] Release 1.11.0, release candidate #4
>
> Hi Till,
>
> Yes, we don't have the setting in flink-conf.yaml.
>
> Generally, we carry forward the existing configuration and any change to
> default configuration values would impact the upgrade.
>
> Yes, since it is an incompatible change I would state it in the release
> notes.
>
> Thanks,
> Thomas
>
> BTW I found a performance regression while trying to upgrade another
> pipeline with this RC. It is a simple Kinesis to Kinesis job. Wasn't able
> to pin it down yet, symptoms include increased checkpoint alignment time.
>
> On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann 
> wrote:
>
> > Hi Thomas,
> >
> > just to confirm: When starting the image in local mode, then you don't
> have
> > any of the JobManager memory configuration settings configured in the
> > effective flink-conf.yaml, right? Does this mean that you have explicitly
> > removed `jobmanager.heap.size: 1024m` from the default configuration? If
> > this is the case, then I believe it was more of an unintentional artifact
> > that it worked before and it has been corrected now so that one needs to
> > specify the memory of the JM process explicitly. Do you think it would
> help
> > to explicitly state this in the release notes?
> >
> > Cheers,
> > Till
> >
> > On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise  wrote:
> >
> > > Thanks for preparing another RC!
> > >
> > > As mentioned in the previous RC thread, it would be super helpful if
> the
> > > release notes that are part of the documentation can be included [1].
> > It's
> > > a significant time-saver to have read those first.
> > >
> > > I found one more non-backward compatible change that would be worth
> > > addressing/mentioning:
> > >
> > > It is now necessary to configure the jobmanager heap size in
> > > flink-conf.yaml (with either jobmanager.heap.size
> > > or jobmanager.memory.heap.size). Why would I not want to do that
> anyways?
> > > Well, we set it dynamically for a cluster deployment via the
> > > flinkk8soperator, but the container image can also be used for testing
> > with
> > > local mode (./bin/jobmanager.sh start-foreground local). That will fail
> > if
> > > the heap wasn't configured and that's how I noticed it.
> > >
> > > Thanks,
> > > Thomas
> > >
> > > [1]
> > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html
> > >
> > > On Tue, Jun 30, 2020 at 3:18 AM Zhijiang  > > .invalid>
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > Please review and vote on the release candidate #4 for the version
> > > 1.11.0,
> > > > as follows:
> > > > [ ] +1, Approve the release
> > > > [ ] -1, Do not approve the release (please provide specific comments)
> > > >
> > > > The complete staging area is available for your review, which
> includes:
> > > > * JIRA release notes [1],
> > > > * the official Apache source release and binary convenience releases
> to
> > > be
> > > > deployed to dist.apache.org [2], which are signed with the key with
> > > > fingerprint 2DA85B93244FDFA19A6244500653C0A2CEA00D0E [3],
> > > > * all artifacts to be deployed to the Maven Central Repository [4],
> > > > * source code tag "release-1.11.0-rc4" [5],
> > > > * website pull request listing the new release and adding
> announcement
> > > > blog post [6].
> > > >
> > > > The vote will be open for at least 72 hours. It is adopted by
> majority
> > > > approval, with at least 3 PMC affirmative votes.
> 

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-02 Thread Robert Metzger
Thanks a lot for the thorough testing Thomas! This is really helpful!

@Chesnay: I would not block the release on this. The web submission does
not seem to be the documented / preferred way of job submission. It is
unlikely to harm the beginner's experience (and they would anyways not read
the release notes). I mention the beginner experience, because they are the
primary audience of the examples.

Regarding FLINK-18461 / Jark's issue: I would not block the release on
that, but still try to get it fixed asap. It is likely that this RC doesn't
go through (given the rate at which we are finding issues), and even if it
goes through, we can document it as a known issue in the release
announcement and immediately release 1.11.1.
Blocking the release on this causes quite a bit of work for the release
managers for rolling a new RC. Until we have understood the performance
regression Thomas is reporting, I would keep this RC open, and keep testing.


On Thu, Jul 2, 2020 at 8:34 AM Jark Wu  wrote:

> Hi,
>
> I'm very sorry but we just found a blocker issue FLINK-18461 [1] in the new
> feature of changelog source (CDC).
> This bug will result in queries on changelog source can’t be inserted into
> upsert sink (e.g. ES, JDBC, HBase),
> which is a common case in production. CDC is one of the important features
> of Table/SQL in this release,
> so from my side, I hope we can have this fix in 1.11.0, otherwise, this is
> a broken feature...
>
> Again, I am terribly sorry for delaying the release...
>
> Best,
> Jark
>
> [1]: https://issues.apache.org/jira/browse/FLINK-18461
>
> On Thu, 2 Jul 2020 at 12:02, Zhijiang 
> wrote:
>
> > Hi Thomas,
> >
> > Thanks for the efficient feedback.
> >
> > Regarding the suggestion of adding the release notes document, I agree
> > with your point. Maybe we should adjust the vote template accordingly in
> > the respective wiki to guide the following release processes.
> >
> > Regarding the performance regression, could you provide some more details
> > for our better measurement or reproducing on our sides?
> > E.g. I guess the topology only includes two vertexes source and sink?
> > What is the parallelism for every vertex?
> > The upstream shuffles data to the downstream via rebalance partitioner or
> > other?
> > The checkpoint mode is exactly-once with rocksDB state backend?
> > The backpressure happened in this case?
> > How much percentage regression in this case?
> >
> > Best,
> > Zhijiang
> >
> >
> >
> > --
> > From:Thomas Weise 
> > Send Time:2020年7月2日(星期四) 09:54
> > To:dev 
> > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> >
> > Hi Till,
> >
> > Yes, we don't have the setting in flink-conf.yaml.
> >
> > Generally, we carry forward the existing configuration and any change to
> > default configuration values would impact the upgrade.
> >
> > Yes, since it is an incompatible change I would state it in the release
> > notes.
> >
> > Thanks,
> > Thomas
> >
> > BTW I found a performance regression while trying to upgrade another
> > pipeline with this RC. It is a simple Kinesis to Kinesis job. Wasn't able
> > to pin it down yet, symptoms include increased checkpoint alignment time.
> >
> > On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann 
> > wrote:
> >
> > > Hi Thomas,
> > >
> > > just to confirm: When starting the image in local mode, then you don't
> > have
> > > any of the JobManager memory configuration settings configured in the
> > > effective flink-conf.yaml, right? Does this mean that you have
> explicitly
> > > removed `jobmanager.heap.size: 1024m` from the default configuration?
> If
> > > this is the case, then I believe it was more of an unintentional
> artifact
> > > that it worked before and it has been corrected now so that one needs
> to
> > > specify the memory of the JM process explicitly. Do you think it would
> > help
> > > to explicitly state this in the release notes?
> > >
> > > Cheers,
> > > Till
> > >
> > > On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise  wrote:
> > >
> > > > Thanks for preparing another RC!
> > > >
> > > > As mentioned in the previous RC thread, it would be super helpful if
> > the
> > > > release notes that are part of the documentation can be included [1].
> > > It's
> > > > a significant time-saver to have read those first.
> > > >
> > > > I found one more non-backward compatible change that would be worth
> > > > addressing/mentioning:
> > > >
> > > > It is now necessary to configure the jobmanager heap size in
> > > > flink-conf.yaml (with either jobmanager.heap.size
> > > > or jobmanager.memory.heap.size). Why would I not want to do that
> > anyways?
> > > > Well, we set it dynamically for a cluster deployment via the
> > > > flinkk8soperator, but the container image can also be used for
> testing
> > > with
> > > > local mode (./bin/jobmanager.sh start-foreground local). That will
> fail
> > > if
> > > > the heap wasn't configured and that's how I noticed it.
> > >

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-02 Thread Till Rohrmann
I agree with Robert.

@Chesnay: The problem has probably already existed in Flink 1.10 and before
because we cannot run jobs with eager execution calls from the web ui. I
agree with Robert that we can/should improve our documentation in this
regard, though.

@Thomas:
1. I will update the release notes to add a short section describing that
one needs to configure the JobManager memory.
2. Concerning the performance regression we should look into it. I believe
Zhijiang is very eager to learn more about your exact setup to further
debug it. Again I agree with Robert to not block the release on it at the
moment.

@Jark: How much of a problem is FLINK-18461? Will it make the CDC feature
completely unusable or will only make a subset of the use cases to not
work? If it is the latter, then I believe that we can document the
limitations and try to fix it asap. Depending on the remaining testing the
fix might make it into the 1.11.0 or the 1.11.1 release.

Cheers,
Till

On Thu, Jul 2, 2020 at 10:33 AM Robert Metzger  wrote:

> Thanks a lot for the thorough testing Thomas! This is really helpful!
>
> @Chesnay: I would not block the release on this. The web submission does
> not seem to be the documented / preferred way of job submission. It is
> unlikely to harm the beginner's experience (and they would anyways not read
> the release notes). I mention the beginner experience, because they are the
> primary audience of the examples.
>
> Regarding FLINK-18461 / Jark's issue: I would not block the release on
> that, but still try to get it fixed asap. It is likely that this RC doesn't
> go through (given the rate at which we are finding issues), and even if it
> goes through, we can document it as a known issue in the release
> announcement and immediately release 1.11.1.
> Blocking the release on this causes quite a bit of work for the release
> managers for rolling a new RC. Until we have understood the performance
> regression Thomas is reporting, I would keep this RC open, and keep
> testing.
>
>
> On Thu, Jul 2, 2020 at 8:34 AM Jark Wu  wrote:
>
> > Hi,
> >
> > I'm very sorry but we just found a blocker issue FLINK-18461 [1] in the
> new
> > feature of changelog source (CDC).
> > This bug will result in queries on changelog source can’t be inserted
> into
> > upsert sink (e.g. ES, JDBC, HBase),
> > which is a common case in production. CDC is one of the important
> features
> > of Table/SQL in this release,
> > so from my side, I hope we can have this fix in 1.11.0, otherwise, this
> is
> > a broken feature...
> >
> > Again, I am terribly sorry for delaying the release...
> >
> > Best,
> > Jark
> >
> > [1]: https://issues.apache.org/jira/browse/FLINK-18461
> >
> > On Thu, 2 Jul 2020 at 12:02, Zhijiang  .invalid>
> > wrote:
> >
> > > Hi Thomas,
> > >
> > > Thanks for the efficient feedback.
> > >
> > > Regarding the suggestion of adding the release notes document, I agree
> > > with your point. Maybe we should adjust the vote template accordingly
> in
> > > the respective wiki to guide the following release processes.
> > >
> > > Regarding the performance regression, could you provide some more
> details
> > > for our better measurement or reproducing on our sides?
> > > E.g. I guess the topology only includes two vertexes source and sink?
> > > What is the parallelism for every vertex?
> > > The upstream shuffles data to the downstream via rebalance partitioner
> or
> > > other?
> > > The checkpoint mode is exactly-once with rocksDB state backend?
> > > The backpressure happened in this case?
> > > How much percentage regression in this case?
> > >
> > > Best,
> > > Zhijiang
> > >
> > >
> > >
> > > --
> > > From:Thomas Weise 
> > > Send Time:2020年7月2日(星期四) 09:54
> > > To:dev 
> > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > >
> > > Hi Till,
> > >
> > > Yes, we don't have the setting in flink-conf.yaml.
> > >
> > > Generally, we carry forward the existing configuration and any change
> to
> > > default configuration values would impact the upgrade.
> > >
> > > Yes, since it is an incompatible change I would state it in the release
> > > notes.
> > >
> > > Thanks,
> > > Thomas
> > >
> > > BTW I found a performance regression while trying to upgrade another
> > > pipeline with this RC. It is a simple Kinesis to Kinesis job. Wasn't
> able
> > > to pin it down yet, symptoms include increased checkpoint alignment
> time.
> > >
> > > On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann 
> > > wrote:
> > >
> > > > Hi Thomas,
> > > >
> > > > just to confirm: When starting the image in local mode, then you
> don't
> > > have
> > > > any of the JobManager memory configuration settings configured in the
> > > > effective flink-conf.yaml, right? Does this mean that you have
> > explicitly
> > > > removed `jobmanager.heap.size: 1024m` from the default configuration?
> > If
> > > > this is the case, then I believe it was more of an unintentional
> > artifa

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-02 Thread Zhijiang
I also agree with Till and Robert's proposals. 

In general I think we should not block the release based on current estimation. 
Otherwise we continuously postpone the release, it might probably occur new 
bugs for blockers, then we might probably
get stuck in such cycle to not give a final release for users in time. But that 
does not mean RC4 would be the final one, and we can reevaluate the effects in 
progress with the accumulated issues.

Regarding the performance regression, if possible we can reproduce to analysis 
the reason based on Thomas's feedback, then we can evaluate its effect.

Regarding the FLINK-18461, after syncing with Jark offline, the bug would 
effect one of three scenarios for using CDC feature, and this effected scenario 
is actually the most commonly used way by users.
My suggestion is to merge it into release-1.11 ATM since the PR already open 
for review, then let's further finalize the conclusion later. If this issue is 
the only one after RC4 going through, then another option is to cover it in 
next release-1.11.1 as Robert suggested, as we can prepare for the next minor 
release soon. If there are other blockers issues during voting and necessary to 
be resolved soon, then it is no doubt to cover all of them in next RC5.

Best,
Zhijiang


--
From:Till Rohrmann 
Send Time:2020年7月2日(星期四) 16:46
To:dev 
Cc:Zhijiang 
Subject:Re: [VOTE] Release 1.11.0, release candidate #4

I agree with Robert.

@Chesnay: The problem has probably already existed in Flink 1.10 and before 
because we cannot run jobs with eager execution calls from the web ui. I agree 
with Robert that we can/should improve our documentation in this regard, though.

@Thomas: 
1. I will update the release notes to add a short section describing that one 
needs to configure the JobManager memory. 
2. Concerning the performance regression we should look into it. I believe 
Zhijiang is very eager to learn more about your exact setup to further debug 
it. Again I agree with Robert to not block the release on it at the moment.

@Jark: How much of a problem is FLINK-18461? Will it make the CDC feature 
completely unusable or will only make a subset of the use cases to not work? If 
it is the latter, then I believe that we can document the limitations and try 
to fix it asap. Depending on the remaining testing the fix might make it into 
the 1.11.0 or the 1.11.1 release.

Cheers,
Till
On Thu, Jul 2, 2020 at 10:33 AM Robert Metzger  wrote:
Thanks a lot for the thorough testing Thomas! This is really helpful!

 @Chesnay: I would not block the release on this. The web submission does
 not seem to be the documented / preferred way of job submission. It is
 unlikely to harm the beginner's experience (and they would anyways not read
 the release notes). I mention the beginner experience, because they are the
 primary audience of the examples.

 Regarding FLINK-18461 / Jark's issue: I would not block the release on
 that, but still try to get it fixed asap. It is likely that this RC doesn't
 go through (given the rate at which we are finding issues), and even if it
 goes through, we can document it as a known issue in the release
 announcement and immediately release 1.11.1.
 Blocking the release on this causes quite a bit of work for the release
 managers for rolling a new RC. Until we have understood the performance
 regression Thomas is reporting, I would keep this RC open, and keep testing.


 On Thu, Jul 2, 2020 at 8:34 AM Jark Wu  wrote:

 > Hi,
 >
 > I'm very sorry but we just found a blocker issue FLINK-18461 [1] in the new
 > feature of changelog source (CDC).
 > This bug will result in queries on changelog source can’t be inserted into
 > upsert sink (e.g. ES, JDBC, HBase),
 > which is a common case in production. CDC is one of the important features
 > of Table/SQL in this release,
 > so from my side, I hope we can have this fix in 1.11.0, otherwise, this is
 > a broken feature...
 >
 > Again, I am terribly sorry for delaying the release...
 >
 > Best,
 > Jark
 >
 > [1]: https://issues.apache.org/jira/browse/FLINK-18461
 >
 > On Thu, 2 Jul 2020 at 12:02, Zhijiang 
 > wrote:
 >
 > > Hi Thomas,
 > >
 > > Thanks for the efficient feedback.
 > >
 > > Regarding the suggestion of adding the release notes document, I agree
 > > with your point. Maybe we should adjust the vote template accordingly in
 > > the respective wiki to guide the following release processes.
 > >
 > > Regarding the performance regression, could you provide some more details
 > > for our better measurement or reproducing on our sides?
 > > E.g. I guess the topology only includes two vertexes source and sink?
 > > What is the parallelism for every vertex?
 > > The upstream shuffles data to the downstream via rebalance partitioner or
 > > other?
 > > The checkpoint mode is exactly-once with rocksDB state backend?
 > > The backpressure happened in this case?
 > > How much percentage regression 

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-02 Thread Stephan Ewen
+1 (binding) from my side

  - legal files (license, notice) looks correct
  - no binaries in the release
  - ran examples from command line
  - ran some examples from web ui
  - log files look sane
  - RocksDB, incremental checkpoints, savepoints, moving savepoints
all works as expected.

There are some friction points, which have also been mentioned. However, I
am not sure they need to block the release.
  - Some batch examples in the web UI have not been working in 1.10. We
should fix that asap, because it impacts the "getting started" experience,
but I personally don't vote against the release based on that
  - Same for the CDC bug. It is unfortunate, but I would not hold the
release at such a late stage for one special issue in a new connector.
Let's work on a timely 1.11.1.


I would withdraw my vote, if we find a fundamental issue in the network
system causing the increased checkpoint delays, causing the job regression
Thomas mentioned.
Such a core bug would be a deal-breaker for a large fraction of users.




On Thu, Jul 2, 2020 at 11:35 AM Zhijiang 
wrote:

> I also agree with Till and Robert's proposals.
>
> In general I think we should not block the release based on current
> estimation. Otherwise we continuously postpone the release, it might
> probably occur new bugs for blockers, then we might probably
> get stuck in such cycle to not give a final release for users in time. But
> that does not mean RC4 would be the final one, and we can reevaluate the
> effects in progress with the accumulated issues.
>
> Regarding the performance regression, if possible we can reproduce to
> analysis the reason based on Thomas's feedback, then we can evaluate its
> effect.
>
> Regarding the FLINK-18461, after syncing with Jark offline, the bug would
> effect one of three scenarios for using CDC feature, and this effected
> scenario is actually the most commonly used way by users.
> My suggestion is to merge it into release-1.11 ATM since the PR already
> open for review, then let's further finalize the conclusion later. If this
> issue is the only one after RC4 going through, then another option is to
> cover it in next release-1.11.1 as Robert suggested, as we can prepare for
> the next minor release soon. If there are other blockers issues during
> voting and necessary to be resolved soon, then it is no doubt to cover all
> of them in next RC5.
>
> Best,
> Zhijiang
>
>
> --
> From:Till Rohrmann 
> Send Time:2020年7月2日(星期四) 16:46
> To:dev 
> Cc:Zhijiang 
> Subject:Re: [VOTE] Release 1.11.0, release candidate #4
>
> I agree with Robert.
>
> @Chesnay: The problem has probably already existed in Flink 1.10 and
> before because we cannot run jobs with eager execution calls from the web
> ui. I agree with Robert that we can/should improve our documentation in
> this regard, though.
>
> @Thomas:
> 1. I will update the release notes to add a short section describing that
> one needs to configure the JobManager memory.
> 2. Concerning the performance regression we should look into it. I believe
> Zhijiang is very eager to learn more about your exact setup to further
> debug it. Again I agree with Robert to not block the release on it at the
> moment.
>
> @Jark: How much of a problem is FLINK-18461? Will it make the CDC feature
> completely unusable or will only make a subset of the use cases to not
> work? If it is the latter, then I believe that we can document the
> limitations and try to fix it asap. Depending on the remaining testing the
> fix might make it into the 1.11.0 or the 1.11.1 release.
>
> Cheers,
> Till
> On Thu, Jul 2, 2020 at 10:33 AM Robert Metzger 
> wrote:
> Thanks a lot for the thorough testing Thomas! This is really helpful!
>
>  @Chesnay: I would not block the release on this. The web submission does
>  not seem to be the documented / preferred way of job submission. It is
>  unlikely to harm the beginner's experience (and they would anyways not
> read
>  the release notes). I mention the beginner experience, because they are
> the
>  primary audience of the examples.
>
>  Regarding FLINK-18461 / Jark's issue: I would not block the release on
>  that, but still try to get it fixed asap. It is likely that this RC
> doesn't
>  go through (given the rate at which we are finding issues), and even if it
>  goes through, we can document it as a known issue in the release
>  announcement and immediately release 1.11.1.
>  Blocking the release on this causes quite a bit of work for the release
>  managers for rolling a new RC. Until we have understood the performance
>  regression Thomas is reporting, I would keep this RC open, and keep
> testing.
>
>
>  On Thu, Jul 2, 2020 at 8:34 AM Jark Wu  wrote:
>
>  > Hi,
>  >
>  > I'm very sorry but we just found a blocker issue FLINK-18461 [1] in the
> new
>  > feature of changelog source (CDC).
>  > This bug will result in queries on changelog source can’t be inserted
> into
>  > upsert sink (e.g

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-02 Thread Chesnay Schepler

+1

Re examples:

The examples failing is new in 1.1*1*, and was introduced in 
https://issues.apache.org/jira/browse/FLINK-16655.


In prior versions, calls to print()/count()/etc. were were simply 
treated as an execute(),  whereas with 1.11 we outright fail the 
submission because these do not work in detached submissions (which jar 
submissions always are).
This is generally /fine/, and may safe users some headaches, but we 
should add this to the release notes and in a follow-up ensure a proper 
error message is shown in the UI (I'll take care of that). At the moment 
you just get an "Internal Server Error.", and have to check the 
JobManager logs for details.


On 02/07/2020 15:47, Stephan Ewen wrote:

+1 (binding) from my side

   - legal files (license, notice) looks correct
   - no binaries in the release
   - ran examples from command line
   - ran some examples from web ui
   - log files look sane
   - RocksDB, incremental checkpoints, savepoints, moving savepoints
all works as expected.

There are some friction points, which have also been mentioned. However, I
am not sure they need to block the release.
   - Some batch examples in the web UI have not been working in 1.10. We
should fix that asap, because it impacts the "getting started" experience,
but I personally don't vote against the release based on that
   - Same for the CDC bug. It is unfortunate, but I would not hold the
release at such a late stage for one special issue in a new connector.
Let's work on a timely 1.11.1.


I would withdraw my vote, if we find a fundamental issue in the network
system causing the increased checkpoint delays, causing the job regression
Thomas mentioned.
Such a core bug would be a deal-breaker for a large fraction of users.




On Thu, Jul 2, 2020 at 11:35 AM Zhijiang 
wrote:


I also agree with Till and Robert's proposals.

In general I think we should not block the release based on current
estimation. Otherwise we continuously postpone the release, it might
probably occur new bugs for blockers, then we might probably
get stuck in such cycle to not give a final release for users in time. But
that does not mean RC4 would be the final one, and we can reevaluate the
effects in progress with the accumulated issues.

Regarding the performance regression, if possible we can reproduce to
analysis the reason based on Thomas's feedback, then we can evaluate its
effect.

Regarding the FLINK-18461, after syncing with Jark offline, the bug would
effect one of three scenarios for using CDC feature, and this effected
scenario is actually the most commonly used way by users.
My suggestion is to merge it into release-1.11 ATM since the PR already
open for review, then let's further finalize the conclusion later. If this
issue is the only one after RC4 going through, then another option is to
cover it in next release-1.11.1 as Robert suggested, as we can prepare for
the next minor release soon. If there are other blockers issues during
voting and necessary to be resolved soon, then it is no doubt to cover all
of them in next RC5.

Best,
Zhijiang


--
From:Till Rohrmann 
Send Time:2020年7月2日(星期四) 16:46
To:dev 
Cc:Zhijiang 
Subject:Re: [VOTE] Release 1.11.0, release candidate #4

I agree with Robert.

@Chesnay: The problem has probably already existed in Flink 1.10 and
before because we cannot run jobs with eager execution calls from the web
ui. I agree with Robert that we can/should improve our documentation in
this regard, though.

@Thomas:
1. I will update the release notes to add a short section describing that
one needs to configure the JobManager memory.
2. Concerning the performance regression we should look into it. I believe
Zhijiang is very eager to learn more about your exact setup to further
debug it. Again I agree with Robert to not block the release on it at the
moment.

@Jark: How much of a problem is FLINK-18461? Will it make the CDC feature
completely unusable or will only make a subset of the use cases to not
work? If it is the latter, then I believe that we can document the
limitations and try to fix it asap. Depending on the remaining testing the
fix might make it into the 1.11.0 or the 1.11.1 release.

Cheers,
Till
On Thu, Jul 2, 2020 at 10:33 AM Robert Metzger 
wrote:
Thanks a lot for the thorough testing Thomas! This is really helpful!

  @Chesnay: I would not block the release on this. The web submission does
  not seem to be the documented / preferred way of job submission. It is
  unlikely to harm the beginner's experience (and they would anyways not
read
  the release notes). I mention the beginner experience, because they are
the
  primary audience of the examples.

  Regarding FLINK-18461 / Jark's issue: I would not block the release on
  that, but still try to get it fixed asap. It is likely that this RC
doesn't
  go through (given the rate at which we are finding issues), and even if it
  goes through, we can document it as a kno

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-02 Thread Robert Metzger
Issues found:
-
https://repository.apache.org/content/repositories/orgapacheflink-1377/org/apache/flink/flink-runtime_2.12/1.11.0/flink-runtime_2.12-1.11.0.jar
./META-INF/NOTICE lists "org.uncommons.maths:uncommons-maths:1.2.2a" as a
bundled dependency. However, it seems they are not bundled. I'm waiting
with my vote until we've discussed this issue. I'm leaning towards
continuing the release vote (
https://issues.apache.org/jira/browse/FLINK-18471).

Checks:
- source archive compiles
- checked artifacts in staging repo
  - flink-azure-fs-hadoop-1.11.0.jar seems to have a correct NOTICE file
  - versions in pom seem correct
  - checked some other jars
- ... I will continue later ...

On Thu, Jul 2, 2020 at 3:47 PM Stephan Ewen  wrote:

> +1 (binding) from my side
>
>   - legal files (license, notice) looks correct
>   - no binaries in the release
>   - ran examples from command line
>   - ran some examples from web ui
>   - log files look sane
>   - RocksDB, incremental checkpoints, savepoints, moving savepoints
> all works as expected.
>
> There are some friction points, which have also been mentioned. However, I
> am not sure they need to block the release.
>   - Some batch examples in the web UI have not been working in 1.10. We
> should fix that asap, because it impacts the "getting started" experience,
> but I personally don't vote against the release based on that
>   - Same for the CDC bug. It is unfortunate, but I would not hold the
> release at such a late stage for one special issue in a new connector.
> Let's work on a timely 1.11.1.
>
>
> I would withdraw my vote, if we find a fundamental issue in the network
> system causing the increased checkpoint delays, causing the job regression
> Thomas mentioned.
> Such a core bug would be a deal-breaker for a large fraction of users.
>
>
>
>
> On Thu, Jul 2, 2020 at 11:35 AM Zhijiang  .invalid>
> wrote:
>
> > I also agree with Till and Robert's proposals.
> >
> > In general I think we should not block the release based on current
> > estimation. Otherwise we continuously postpone the release, it might
> > probably occur new bugs for blockers, then we might probably
> > get stuck in such cycle to not give a final release for users in time.
> But
> > that does not mean RC4 would be the final one, and we can reevaluate the
> > effects in progress with the accumulated issues.
> >
> > Regarding the performance regression, if possible we can reproduce to
> > analysis the reason based on Thomas's feedback, then we can evaluate its
> > effect.
> >
> > Regarding the FLINK-18461, after syncing with Jark offline, the bug would
> > effect one of three scenarios for using CDC feature, and this effected
> > scenario is actually the most commonly used way by users.
> > My suggestion is to merge it into release-1.11 ATM since the PR already
> > open for review, then let's further finalize the conclusion later. If
> this
> > issue is the only one after RC4 going through, then another option is to
> > cover it in next release-1.11.1 as Robert suggested, as we can prepare
> for
> > the next minor release soon. If there are other blockers issues during
> > voting and necessary to be resolved soon, then it is no doubt to cover
> all
> > of them in next RC5.
> >
> > Best,
> > Zhijiang
> >
> >
> > --
> > From:Till Rohrmann 
> > Send Time:2020年7月2日(星期四) 16:46
> > To:dev 
> > Cc:Zhijiang 
> > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> >
> > I agree with Robert.
> >
> > @Chesnay: The problem has probably already existed in Flink 1.10 and
> > before because we cannot run jobs with eager execution calls from the web
> > ui. I agree with Robert that we can/should improve our documentation in
> > this regard, though.
> >
> > @Thomas:
> > 1. I will update the release notes to add a short section describing that
> > one needs to configure the JobManager memory.
> > 2. Concerning the performance regression we should look into it. I
> believe
> > Zhijiang is very eager to learn more about your exact setup to further
> > debug it. Again I agree with Robert to not block the release on it at the
> > moment.
> >
> > @Jark: How much of a problem is FLINK-18461? Will it make the CDC feature
> > completely unusable or will only make a subset of the use cases to not
> > work? If it is the latter, then I believe that we can document the
> > limitations and try to fix it asap. Depending on the remaining testing
> the
> > fix might make it into the 1.11.0 or the 1.11.1 release.
> >
> > Cheers,
> > Till
> > On Thu, Jul 2, 2020 at 10:33 AM Robert Metzger 
> > wrote:
> > Thanks a lot for the thorough testing Thomas! This is really helpful!
> >
> >  @Chesnay: I would not block the release on this. The web submission does
> >  not seem to be the documented / preferred way of job submission. It is
> >  unlikely to harm the beginner's experience (and they would anyways not
> > read
> >  the release notes). I mention the begin

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-02 Thread Chesnay Schepler
Listing more than we need to (especially if it is apache licensed) isn't 
a big problem, since nothing changes from a users perspective in regards 
to licensing.


On 02/07/2020 17:08, Robert Metzger wrote:

Issues found:
-
https://repository.apache.org/content/repositories/orgapacheflink-1377/org/apache/flink/flink-runtime_2.12/1.11.0/flink-runtime_2.12-1.11.0.jar
./META-INF/NOTICE lists "org.uncommons.maths:uncommons-maths:1.2.2a" as a
bundled dependency. However, it seems they are not bundled. I'm waiting
with my vote until we've discussed this issue. I'm leaning towards
continuing the release vote (
https://issues.apache.org/jira/browse/FLINK-18471).

Checks:
- source archive compiles
- checked artifacts in staging repo
   - flink-azure-fs-hadoop-1.11.0.jar seems to have a correct NOTICE file
   - versions in pom seem correct
   - checked some other jars
- ... I will continue later ...

On Thu, Jul 2, 2020 at 3:47 PM Stephan Ewen  wrote:


+1 (binding) from my side

   - legal files (license, notice) looks correct
   - no binaries in the release
   - ran examples from command line
   - ran some examples from web ui
   - log files look sane
   - RocksDB, incremental checkpoints, savepoints, moving savepoints
all works as expected.

There are some friction points, which have also been mentioned. However, I
am not sure they need to block the release.
   - Some batch examples in the web UI have not been working in 1.10. We
should fix that asap, because it impacts the "getting started" experience,
but I personally don't vote against the release based on that
   - Same for the CDC bug. It is unfortunate, but I would not hold the
release at such a late stage for one special issue in a new connector.
Let's work on a timely 1.11.1.


I would withdraw my vote, if we find a fundamental issue in the network
system causing the increased checkpoint delays, causing the job regression
Thomas mentioned.
Such a core bug would be a deal-breaker for a large fraction of users.




On Thu, Jul 2, 2020 at 11:35 AM Zhijiang 
wrote:


I also agree with Till and Robert's proposals.

In general I think we should not block the release based on current
estimation. Otherwise we continuously postpone the release, it might
probably occur new bugs for blockers, then we might probably
get stuck in such cycle to not give a final release for users in time.

But

that does not mean RC4 would be the final one, and we can reevaluate the
effects in progress with the accumulated issues.

Regarding the performance regression, if possible we can reproduce to
analysis the reason based on Thomas's feedback, then we can evaluate its
effect.

Regarding the FLINK-18461, after syncing with Jark offline, the bug would
effect one of three scenarios for using CDC feature, and this effected
scenario is actually the most commonly used way by users.
My suggestion is to merge it into release-1.11 ATM since the PR already
open for review, then let's further finalize the conclusion later. If

this

issue is the only one after RC4 going through, then another option is to
cover it in next release-1.11.1 as Robert suggested, as we can prepare

for

the next minor release soon. If there are other blockers issues during
voting and necessary to be resolved soon, then it is no doubt to cover

all

of them in next RC5.

Best,
Zhijiang


--
From:Till Rohrmann 
Send Time:2020年7月2日(星期四) 16:46
To:dev 
Cc:Zhijiang 
Subject:Re: [VOTE] Release 1.11.0, release candidate #4

I agree with Robert.

@Chesnay: The problem has probably already existed in Flink 1.10 and
before because we cannot run jobs with eager execution calls from the web
ui. I agree with Robert that we can/should improve our documentation in
this regard, though.

@Thomas:
1. I will update the release notes to add a short section describing that
one needs to configure the JobManager memory.
2. Concerning the performance regression we should look into it. I

believe

Zhijiang is very eager to learn more about your exact setup to further
debug it. Again I agree with Robert to not block the release on it at the
moment.

@Jark: How much of a problem is FLINK-18461? Will it make the CDC feature
completely unusable or will only make a subset of the use cases to not
work? If it is the latter, then I believe that we can document the
limitations and try to fix it asap. Depending on the remaining testing

the

fix might make it into the 1.11.0 or the 1.11.1 release.

Cheers,
Till
On Thu, Jul 2, 2020 at 10:33 AM Robert Metzger 
wrote:
Thanks a lot for the thorough testing Thomas! This is really helpful!

  @Chesnay: I would not block the release on this. The web submission does
  not seem to be the documented / preferred way of job submission. It is
  unlikely to harm the beginner's experience (and they would anyways not
read
  the release notes). I mention the beginner experience, because they are
the
  primary audience of the examples.

  Regarding FLINK-18

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-02 Thread Till Rohrmann
- verified checksums and signature
- built Flink from source release with Scala 2.12
- Executed some example jobs successfully
- verified license and notice files

I found the following issues with some NOTICE files:

* flink-connector-hive: org.apache.parquet:parquet-format:1.10.0 ->
org.apache.parquet:parquet-format:2.4.0
* flink-connector-kinesis:
  com.amazonaws:aws-java-sdk-dynamodb:jar:1.11.754 ->
com.amazonaws:aws-java-sdk-dynamodb:jar:1.11.603
  com.amazonaws:aws-java-sdk-s3:jar:1.11.754 ->
com.amazonaws:aws-java-sdk-s3:jar:1.11.603
  com.amazonaws:aws-java-sdk-kms:jar:1.11.754 ->
com.amazonaws:aws-java-sdk-kms:jar:1.11.603
* flink-sql-parquet: org.apache.commons:commons-compress:1.20 not used

So these three modules report wrong versions for their dependencies in the
NOTICE files. I would argue that this is not a big problem since the
license did not change and we are not required to list ASL 2.0
dependencies. Hence, I would suggest to continue with the release voting. I
will open a PR to fix these problems soon.

Given that this is not a problem and that we don't find a problem in the
network stack, +1 for this release candidate.

Cheers,
Till

On Thu, Jul 2, 2020 at 5:29 PM Chesnay Schepler  wrote:

> Listing more than we need to (especially if it is apache licensed) isn't
> a big problem, since nothing changes from a users perspective in regards
> to licensing.
>
> On 02/07/2020 17:08, Robert Metzger wrote:
> > Issues found:
> > -
> >
> https://repository.apache.org/content/repositories/orgapacheflink-1377/org/apache/flink/flink-runtime_2.12/1.11.0/flink-runtime_2.12-1.11.0.jar
> > ./META-INF/NOTICE lists "org.uncommons.maths:uncommons-maths:1.2.2a" as a
> > bundled dependency. However, it seems they are not bundled. I'm waiting
> > with my vote until we've discussed this issue. I'm leaning towards
> > continuing the release vote (
> > https://issues.apache.org/jira/browse/FLINK-18471).
> >
> > Checks:
> > - source archive compiles
> > - checked artifacts in staging repo
> >- flink-azure-fs-hadoop-1.11.0.jar seems to have a correct NOTICE file
> >- versions in pom seem correct
> >- checked some other jars
> > - ... I will continue later ...
> >
> > On Thu, Jul 2, 2020 at 3:47 PM Stephan Ewen  wrote:
> >
> >> +1 (binding) from my side
> >>
> >>- legal files (license, notice) looks correct
> >>- no binaries in the release
> >>- ran examples from command line
> >>- ran some examples from web ui
> >>- log files look sane
> >>- RocksDB, incremental checkpoints, savepoints, moving savepoints
> >> all works as expected.
> >>
> >> There are some friction points, which have also been mentioned.
> However, I
> >> am not sure they need to block the release.
> >>- Some batch examples in the web UI have not been working in 1.10. We
> >> should fix that asap, because it impacts the "getting started"
> experience,
> >> but I personally don't vote against the release based on that
> >>- Same for the CDC bug. It is unfortunate, but I would not hold the
> >> release at such a late stage for one special issue in a new connector.
> >> Let's work on a timely 1.11.1.
> >>
> >>
> >> I would withdraw my vote, if we find a fundamental issue in the network
> >> system causing the increased checkpoint delays, causing the job
> regression
> >> Thomas mentioned.
> >> Such a core bug would be a deal-breaker for a large fraction of users.
> >>
> >>
> >>
> >>
> >> On Thu, Jul 2, 2020 at 11:35 AM Zhijiang  >> .invalid>
> >> wrote:
> >>
> >>> I also agree with Till and Robert's proposals.
> >>>
> >>> In general I think we should not block the release based on current
> >>> estimation. Otherwise we continuously postpone the release, it might
> >>> probably occur new bugs for blockers, then we might probably
> >>> get stuck in such cycle to not give a final release for users in time.
> >> But
> >>> that does not mean RC4 would be the final one, and we can reevaluate
> the
> >>> effects in progress with the accumulated issues.
> >>>
> >>> Regarding the performance regression, if possible we can reproduce to
> >>> analysis the reason based on Thomas's feedback, then we can evaluate
> its
> >>> effect.
> >>>
> >>> Regarding the FLINK-18461, after syncing with Jark offline, the bug
> would
> >>> effect one of three scenarios for using CDC feature, and this effected
> >>> scenario is actually the most commonly used way by users.
> >>> My suggestion is to merge it into release-1.11 ATM since the PR already
> >>> open for review, then let's further finalize the conclusion later. If
> >> this
> >>> issue is the only one after RC4 going through, then another option is
> to
> >>> cover it in next release-1.11.1 as Robert suggested, as we can prepare
> >> for
> >>> the next minor release soon. If there are other blockers issues during
> >>> voting and necessary to be resolved soon, then it is no doubt to cover
> >> all
> >>> of them in next RC5.
> >>>
> >>> Best,
> >>> Zhijiang
> >>>
> >>>
> >>> -

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-02 Thread Till Rohrmann
I've opened a PR for fixing the NOTICE file problems [1].

[1] https://github.com/apache/flink/pull/12811

Cheers,
Till

On Thu, Jul 2, 2020 at 6:23 PM Till Rohrmann  wrote:

> - verified checksums and signature
> - built Flink from source release with Scala 2.12
> - Executed some example jobs successfully
> - verified license and notice files
>
> I found the following issues with some NOTICE files:
>
> * flink-connector-hive: org.apache.parquet:parquet-format:1.10.0 ->
> org.apache.parquet:parquet-format:2.4.0
> * flink-connector-kinesis:
>   com.amazonaws:aws-java-sdk-dynamodb:jar:1.11.754 ->
> com.amazonaws:aws-java-sdk-dynamodb:jar:1.11.603
>   com.amazonaws:aws-java-sdk-s3:jar:1.11.754 ->
> com.amazonaws:aws-java-sdk-s3:jar:1.11.603
>   com.amazonaws:aws-java-sdk-kms:jar:1.11.754 ->
> com.amazonaws:aws-java-sdk-kms:jar:1.11.603
> * flink-sql-parquet: org.apache.commons:commons-compress:1.20 not used
>
> So these three modules report wrong versions for their dependencies in the
> NOTICE files. I would argue that this is not a big problem since the
> license did not change and we are not required to list ASL 2.0
> dependencies. Hence, I would suggest to continue with the release voting. I
> will open a PR to fix these problems soon.
>
> Given that this is not a problem and that we don't find a problem in the
> network stack, +1 for this release candidate.
>
> Cheers,
> Till
>
> On Thu, Jul 2, 2020 at 5:29 PM Chesnay Schepler 
> wrote:
>
>> Listing more than we need to (especially if it is apache licensed) isn't
>> a big problem, since nothing changes from a users perspective in regards
>> to licensing.
>>
>> On 02/07/2020 17:08, Robert Metzger wrote:
>> > Issues found:
>> > -
>> >
>> https://repository.apache.org/content/repositories/orgapacheflink-1377/org/apache/flink/flink-runtime_2.12/1.11.0/flink-runtime_2.12-1.11.0.jar
>> > ./META-INF/NOTICE lists "org.uncommons.maths:uncommons-maths:1.2.2a" as
>> a
>> > bundled dependency. However, it seems they are not bundled. I'm waiting
>> > with my vote until we've discussed this issue. I'm leaning towards
>> > continuing the release vote (
>> > https://issues.apache.org/jira/browse/FLINK-18471).
>> >
>> > Checks:
>> > - source archive compiles
>> > - checked artifacts in staging repo
>> >- flink-azure-fs-hadoop-1.11.0.jar seems to have a correct NOTICE
>> file
>> >- versions in pom seem correct
>> >- checked some other jars
>> > - ... I will continue later ...
>> >
>> > On Thu, Jul 2, 2020 at 3:47 PM Stephan Ewen  wrote:
>> >
>> >> +1 (binding) from my side
>> >>
>> >>- legal files (license, notice) looks correct
>> >>- no binaries in the release
>> >>- ran examples from command line
>> >>- ran some examples from web ui
>> >>- log files look sane
>> >>- RocksDB, incremental checkpoints, savepoints, moving savepoints
>> >> all works as expected.
>> >>
>> >> There are some friction points, which have also been mentioned.
>> However, I
>> >> am not sure they need to block the release.
>> >>- Some batch examples in the web UI have not been working in 1.10.
>> We
>> >> should fix that asap, because it impacts the "getting started"
>> experience,
>> >> but I personally don't vote against the release based on that
>> >>- Same for the CDC bug. It is unfortunate, but I would not hold the
>> >> release at such a late stage for one special issue in a new connector.
>> >> Let's work on a timely 1.11.1.
>> >>
>> >>
>> >> I would withdraw my vote, if we find a fundamental issue in the network
>> >> system causing the increased checkpoint delays, causing the job
>> regression
>> >> Thomas mentioned.
>> >> Such a core bug would be a deal-breaker for a large fraction of users.
>> >>
>> >>
>> >>
>> >>
>> >> On Thu, Jul 2, 2020 at 11:35 AM Zhijiang > >> .invalid>
>> >> wrote:
>> >>
>> >>> I also agree with Till and Robert's proposals.
>> >>>
>> >>> In general I think we should not block the release based on current
>> >>> estimation. Otherwise we continuously postpone the release, it might
>> >>> probably occur new bugs for blockers, then we might probably
>> >>> get stuck in such cycle to not give a final release for users in time.
>> >> But
>> >>> that does not mean RC4 would be the final one, and we can reevaluate
>> the
>> >>> effects in progress with the accumulated issues.
>> >>>
>> >>> Regarding the performance regression, if possible we can reproduce to
>> >>> analysis the reason based on Thomas's feedback, then we can evaluate
>> its
>> >>> effect.
>> >>>
>> >>> Regarding the FLINK-18461, after syncing with Jark offline, the bug
>> would
>> >>> effect one of three scenarios for using CDC feature, and this effected
>> >>> scenario is actually the most commonly used way by users.
>> >>> My suggestion is to merge it into release-1.11 ATM since the PR
>> already
>> >>> open for review, then let's further finalize the conclusion later. If
>> >> this
>> >>> issue is the only one after RC4 going through, then another option is
>> 

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-02 Thread Thomas Weise
Hi Zhijiang,

The performance degradation manifests in backpressure which leads to
growing backlog in the source. I switched a few times between 1.10 and 1.11
and the behavior is consistent.

The DAG is:

KinesisConsumer -> (Flat Map, Flat Map, Flat Map)    forward
-> KinesisProducer

Parallelism: 160
No shuffle/rebalance.

Checkpointing config:

Checkpointing Mode Exactly Once
Interval 10s
Timeout 10m 0s
Minimum Pause Between Checkpoints 10s
Maximum Concurrent Checkpoints 1
Persist Checkpoints Externally Enabled (delete on cancellation)

State backend: rocksdb  (filesystem leads to same symptoms)
Checkpoint size is tiny (500KB)

An interesting difference to another job that I had upgraded successfully
is the low checkpointing interval.

Thanks,
Thomas


On Wed, Jul 1, 2020 at 9:02 PM Zhijiang 
wrote:

> Hi Thomas,
>
> Thanks for the efficient feedback.
>
> Regarding the suggestion of adding the release notes document, I agree
> with your point. Maybe we should adjust the vote template accordingly in
> the respective wiki to guide the following release processes.
>
> Regarding the performance regression, could you provide some more details
> for our better measurement or reproducing on our sides?
> E.g. I guess the topology only includes two vertexes source and sink?
> What is the parallelism for every vertex?
> The upstream shuffles data to the downstream via rebalance partitioner or
> other?
> The checkpoint mode is exactly-once with rocksDB state backend?
> The backpressure happened in this case?
> How much percentage regression in this case?
>
> Best,
> Zhijiang
>
>
>
> --
> From:Thomas Weise 
> Send Time:2020年7月2日(星期四) 09:54
> To:dev 
> Subject:Re: [VOTE] Release 1.11.0, release candidate #4
>
> Hi Till,
>
> Yes, we don't have the setting in flink-conf.yaml.
>
> Generally, we carry forward the existing configuration and any change to
> default configuration values would impact the upgrade.
>
> Yes, since it is an incompatible change I would state it in the release
> notes.
>
> Thanks,
> Thomas
>
> BTW I found a performance regression while trying to upgrade another
> pipeline with this RC. It is a simple Kinesis to Kinesis job. Wasn't able
> to pin it down yet, symptoms include increased checkpoint alignment time.
>
> On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann 
> wrote:
>
> > Hi Thomas,
> >
> > just to confirm: When starting the image in local mode, then you don't
> have
> > any of the JobManager memory configuration settings configured in the
> > effective flink-conf.yaml, right? Does this mean that you have explicitly
> > removed `jobmanager.heap.size: 1024m` from the default configuration? If
> > this is the case, then I believe it was more of an unintentional artifact
> > that it worked before and it has been corrected now so that one needs to
> > specify the memory of the JM process explicitly. Do you think it would
> help
> > to explicitly state this in the release notes?
> >
> > Cheers,
> > Till
> >
> > On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise  wrote:
> >
> > > Thanks for preparing another RC!
> > >
> > > As mentioned in the previous RC thread, it would be super helpful if
> the
> > > release notes that are part of the documentation can be included [1].
> > It's
> > > a significant time-saver to have read those first.
> > >
> > > I found one more non-backward compatible change that would be worth
> > > addressing/mentioning:
> > >
> > > It is now necessary to configure the jobmanager heap size in
> > > flink-conf.yaml (with either jobmanager.heap.size
> > > or jobmanager.memory.heap.size). Why would I not want to do that
> anyways?
> > > Well, we set it dynamically for a cluster deployment via the
> > > flinkk8soperator, but the container image can also be used for testing
> > with
> > > local mode (./bin/jobmanager.sh start-foreground local). That will fail
> > if
> > > the heap wasn't configured and that's how I noticed it.
> > >
> > > Thanks,
> > > Thomas
> > >
> > > [1]
> > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html
> > >
> > > On Tue, Jun 30, 2020 at 3:18 AM Zhijiang  > > .invalid>
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > Please review and vote on the release candidate #4 for the version
> > > 1.11.0,
> > > > as follows:
> > > > [ ] +1, Approve the release
> > > > [ ] -1, Do not approve the release (please provide specific comments)
> > > >
> > > > The complete staging area is available for your review, which
> includes:
> > > > * JIRA release notes [1],
> > > > * the official Apache source release and binary convenience releases
> to
> > > be
> > > > deployed to dist.apache.org [2], which are signed with the key with
> > > > fingerprint 2DA85B93244FDFA19A6244500653C0A2CEA00D0E [3],
> > > > * all artifacts to be deployed to the Maven Central Repository [4],
> > > > * source code tag "release-1.11.0-rc4" [5],
> > > > * website pul

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-02 Thread Robert Metzger
+1 (binding)

Checks:
- source archive compiles
- checked artifacts in staging repo
  - flink-azure-fs-hadoop-1.11.0.jar seems to have a correct NOTICE file
  - versions in pom seem correct
  - checked some other jars
- deployed Flink on YARN on Azure HDInsight (which uses Hadoop 3.1.1)
  - Reported some tiny log sanity issue:
https://issues.apache.org/jira/browse/FLINK-18474
  - Wordcount against HDFS works


On Thu, Jul 2, 2020 at 7:07 PM Thomas Weise  wrote:

> Hi Zhijiang,
>
> The performance degradation manifests in backpressure which leads to
> growing backlog in the source. I switched a few times between 1.10 and 1.11
> and the behavior is consistent.
>
> The DAG is:
>
> KinesisConsumer -> (Flat Map, Flat Map, Flat Map)    forward
> -> KinesisProducer
>
> Parallelism: 160
> No shuffle/rebalance.
>
> Checkpointing config:
>
> Checkpointing Mode Exactly Once
> Interval 10s
> Timeout 10m 0s
> Minimum Pause Between Checkpoints 10s
> Maximum Concurrent Checkpoints 1
> Persist Checkpoints Externally Enabled (delete on cancellation)
>
> State backend: rocksdb  (filesystem leads to same symptoms)
> Checkpoint size is tiny (500KB)
>
> An interesting difference to another job that I had upgraded successfully
> is the low checkpointing interval.
>
> Thanks,
> Thomas
>
>
> On Wed, Jul 1, 2020 at 9:02 PM Zhijiang  .invalid>
> wrote:
>
> > Hi Thomas,
> >
> > Thanks for the efficient feedback.
> >
> > Regarding the suggestion of adding the release notes document, I agree
> > with your point. Maybe we should adjust the vote template accordingly in
> > the respective wiki to guide the following release processes.
> >
> > Regarding the performance regression, could you provide some more details
> > for our better measurement or reproducing on our sides?
> > E.g. I guess the topology only includes two vertexes source and sink?
> > What is the parallelism for every vertex?
> > The upstream shuffles data to the downstream via rebalance partitioner or
> > other?
> > The checkpoint mode is exactly-once with rocksDB state backend?
> > The backpressure happened in this case?
> > How much percentage regression in this case?
> >
> > Best,
> > Zhijiang
> >
> >
> >
> > --
> > From:Thomas Weise 
> > Send Time:2020年7月2日(星期四) 09:54
> > To:dev 
> > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> >
> > Hi Till,
> >
> > Yes, we don't have the setting in flink-conf.yaml.
> >
> > Generally, we carry forward the existing configuration and any change to
> > default configuration values would impact the upgrade.
> >
> > Yes, since it is an incompatible change I would state it in the release
> > notes.
> >
> > Thanks,
> > Thomas
> >
> > BTW I found a performance regression while trying to upgrade another
> > pipeline with this RC. It is a simple Kinesis to Kinesis job. Wasn't able
> > to pin it down yet, symptoms include increased checkpoint alignment time.
> >
> > On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann 
> > wrote:
> >
> > > Hi Thomas,
> > >
> > > just to confirm: When starting the image in local mode, then you don't
> > have
> > > any of the JobManager memory configuration settings configured in the
> > > effective flink-conf.yaml, right? Does this mean that you have
> explicitly
> > > removed `jobmanager.heap.size: 1024m` from the default configuration?
> If
> > > this is the case, then I believe it was more of an unintentional
> artifact
> > > that it worked before and it has been corrected now so that one needs
> to
> > > specify the memory of the JM process explicitly. Do you think it would
> > help
> > > to explicitly state this in the release notes?
> > >
> > > Cheers,
> > > Till
> > >
> > > On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise  wrote:
> > >
> > > > Thanks for preparing another RC!
> > > >
> > > > As mentioned in the previous RC thread, it would be super helpful if
> > the
> > > > release notes that are part of the documentation can be included [1].
> > > It's
> > > > a significant time-saver to have read those first.
> > > >
> > > > I found one more non-backward compatible change that would be worth
> > > > addressing/mentioning:
> > > >
> > > > It is now necessary to configure the jobmanager heap size in
> > > > flink-conf.yaml (with either jobmanager.heap.size
> > > > or jobmanager.memory.heap.size). Why would I not want to do that
> > anyways?
> > > > Well, we set it dynamically for a cluster deployment via the
> > > > flinkk8soperator, but the container image can also be used for
> testing
> > > with
> > > > local mode (./bin/jobmanager.sh start-foreground local). That will
> fail
> > > if
> > > > the heap wasn't configured and that's how I noticed it.
> > > >
> > > > Thanks,
> > > > Thomas
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html
> > > >
> > > > On Tue, Jun 30, 2020 at 3:18 AM Zhijiang  > > > .invalid>
> > > > wrote:
> > > >
> >

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-02 Thread Kostas Kloudas
Hi all,

As far as the issue that Chesnay mentioned that leads to a "Caused by:
org.apache.flink.api.common.InvalidProgramException:"  for DataSet
examples with print() collect() or count() as sink, this was a
semi-intensional side-effect of the application mode. Before, in these
cases, the output was simply ignored. Now we have the same behavior as
in the "detached" mode. I already opened a PR for the release notes
(sorry for not doing it earlier although this was a known change in
behavior, as mentioned it in the PR here
https://github.com/apache/flink/pull/11460 ) and I will merge it
today.

Cheers,
Kostas

On Thu, Jul 2, 2020 at 8:07 PM Robert Metzger  wrote:
>
> +1 (binding)
>
> Checks:
> - source archive compiles
> - checked artifacts in staging repo
>   - flink-azure-fs-hadoop-1.11.0.jar seems to have a correct NOTICE file
>   - versions in pom seem correct
>   - checked some other jars
> - deployed Flink on YARN on Azure HDInsight (which uses Hadoop 3.1.1)
>   - Reported some tiny log sanity issue:
> https://issues.apache.org/jira/browse/FLINK-18474
>   - Wordcount against HDFS works
>
>
> On Thu, Jul 2, 2020 at 7:07 PM Thomas Weise  wrote:
>
> > Hi Zhijiang,
> >
> > The performance degradation manifests in backpressure which leads to
> > growing backlog in the source. I switched a few times between 1.10 and 1.11
> > and the behavior is consistent.
> >
> > The DAG is:
> >
> > KinesisConsumer -> (Flat Map, Flat Map, Flat Map)    forward
> > -> KinesisProducer
> >
> > Parallelism: 160
> > No shuffle/rebalance.
> >
> > Checkpointing config:
> >
> > Checkpointing Mode Exactly Once
> > Interval 10s
> > Timeout 10m 0s
> > Minimum Pause Between Checkpoints 10s
> > Maximum Concurrent Checkpoints 1
> > Persist Checkpoints Externally Enabled (delete on cancellation)
> >
> > State backend: rocksdb  (filesystem leads to same symptoms)
> > Checkpoint size is tiny (500KB)
> >
> > An interesting difference to another job that I had upgraded successfully
> > is the low checkpointing interval.
> >
> > Thanks,
> > Thomas
> >
> >
> > On Wed, Jul 1, 2020 at 9:02 PM Zhijiang  > .invalid>
> > wrote:
> >
> > > Hi Thomas,
> > >
> > > Thanks for the efficient feedback.
> > >
> > > Regarding the suggestion of adding the release notes document, I agree
> > > with your point. Maybe we should adjust the vote template accordingly in
> > > the respective wiki to guide the following release processes.
> > >
> > > Regarding the performance regression, could you provide some more details
> > > for our better measurement or reproducing on our sides?
> > > E.g. I guess the topology only includes two vertexes source and sink?
> > > What is the parallelism for every vertex?
> > > The upstream shuffles data to the downstream via rebalance partitioner or
> > > other?
> > > The checkpoint mode is exactly-once with rocksDB state backend?
> > > The backpressure happened in this case?
> > > How much percentage regression in this case?
> > >
> > > Best,
> > > Zhijiang
> > >
> > >
> > >
> > > --
> > > From:Thomas Weise 
> > > Send Time:2020年7月2日(星期四) 09:54
> > > To:dev 
> > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > >
> > > Hi Till,
> > >
> > > Yes, we don't have the setting in flink-conf.yaml.
> > >
> > > Generally, we carry forward the existing configuration and any change to
> > > default configuration values would impact the upgrade.
> > >
> > > Yes, since it is an incompatible change I would state it in the release
> > > notes.
> > >
> > > Thanks,
> > > Thomas
> > >
> > > BTW I found a performance regression while trying to upgrade another
> > > pipeline with this RC. It is a simple Kinesis to Kinesis job. Wasn't able
> > > to pin it down yet, symptoms include increased checkpoint alignment time.
> > >
> > > On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann 
> > > wrote:
> > >
> > > > Hi Thomas,
> > > >
> > > > just to confirm: When starting the image in local mode, then you don't
> > > have
> > > > any of the JobManager memory configuration settings configured in the
> > > > effective flink-conf.yaml, right? Does this mean that you have
> > explicitly
> > > > removed `jobmanager.heap.size: 1024m` from the default configuration?
> > If
> > > > this is the case, then I believe it was more of an unintentional
> > artifact
> > > > that it worked before and it has been corrected now so that one needs
> > to
> > > > specify the memory of the JM process explicitly. Do you think it would
> > > help
> > > > to explicitly state this in the release notes?
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise  wrote:
> > > >
> > > > > Thanks for preparing another RC!
> > > > >
> > > > > As mentioned in the previous RC thread, it would be super helpful if
> > > the
> > > > > release notes that are part of the documentation can be included [1].
> > > > It's
> > > > > a significant time-saver to have read those first

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-02 Thread Aljoscha Krettek

+1

 - verified hash of source release
 - verified signature of source release
 - source release compiles (with Scala 2.11)
 - examples run without spurious log output (errors, exceptions)

I can confirm that log scrolling doesn't work on Firefox, though it 
never has.


I would also feel better if we can find the source of the performance 
regression that Thomas mentioned. It might be that we have to solve that 
in a .1 patch release.


Best,
Aljoscha

On 02.07.20 20:37, Kostas Kloudas wrote:

Hi all,

As far as the issue that Chesnay mentioned that leads to a "Caused by:
org.apache.flink.api.common.InvalidProgramException:"  for DataSet
examples with print() collect() or count() as sink, this was a
semi-intensional side-effect of the application mode. Before, in these
cases, the output was simply ignored. Now we have the same behavior as
in the "detached" mode. I already opened a PR for the release notes
(sorry for not doing it earlier although this was a known change in
behavior, as mentioned it in the PR here
https://github.com/apache/flink/pull/11460 ) and I will merge it
today.

Cheers,
Kostas

On Thu, Jul 2, 2020 at 8:07 PM Robert Metzger  wrote:


+1 (binding)

Checks:
- source archive compiles
- checked artifacts in staging repo
   - flink-azure-fs-hadoop-1.11.0.jar seems to have a correct NOTICE file
   - versions in pom seem correct
   - checked some other jars
- deployed Flink on YARN on Azure HDInsight (which uses Hadoop 3.1.1)
   - Reported some tiny log sanity issue:
https://issues.apache.org/jira/browse/FLINK-18474
   - Wordcount against HDFS works


On Thu, Jul 2, 2020 at 7:07 PM Thomas Weise  wrote:


Hi Zhijiang,

The performance degradation manifests in backpressure which leads to
growing backlog in the source. I switched a few times between 1.10 and 1.11
and the behavior is consistent.

The DAG is:

KinesisConsumer -> (Flat Map, Flat Map, Flat Map)    forward
-> KinesisProducer

Parallelism: 160
No shuffle/rebalance.

Checkpointing config:

Checkpointing Mode Exactly Once
Interval 10s
Timeout 10m 0s
Minimum Pause Between Checkpoints 10s
Maximum Concurrent Checkpoints 1
Persist Checkpoints Externally Enabled (delete on cancellation)

State backend: rocksdb  (filesystem leads to same symptoms)
Checkpoint size is tiny (500KB)

An interesting difference to another job that I had upgraded successfully
is the low checkpointing interval.

Thanks,
Thomas


On Wed, Jul 1, 2020 at 9:02 PM Zhijiang 
wrote:


Hi Thomas,

Thanks for the efficient feedback.

Regarding the suggestion of adding the release notes document, I agree
with your point. Maybe we should adjust the vote template accordingly in
the respective wiki to guide the following release processes.

Regarding the performance regression, could you provide some more details
for our better measurement or reproducing on our sides?
E.g. I guess the topology only includes two vertexes source and sink?
What is the parallelism for every vertex?
The upstream shuffles data to the downstream via rebalance partitioner or
other?
The checkpoint mode is exactly-once with rocksDB state backend?
The backpressure happened in this case?
How much percentage regression in this case?

Best,
Zhijiang



--
From:Thomas Weise 
Send Time:2020年7月2日(星期四) 09:54
To:dev 
Subject:Re: [VOTE] Release 1.11.0, release candidate #4

Hi Till,

Yes, we don't have the setting in flink-conf.yaml.

Generally, we carry forward the existing configuration and any change to
default configuration values would impact the upgrade.

Yes, since it is an incompatible change I would state it in the release
notes.

Thanks,
Thomas

BTW I found a performance regression while trying to upgrade another
pipeline with this RC. It is a simple Kinesis to Kinesis job. Wasn't able
to pin it down yet, symptoms include increased checkpoint alignment time.

On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann 
wrote:


Hi Thomas,

just to confirm: When starting the image in local mode, then you don't

have

any of the JobManager memory configuration settings configured in the
effective flink-conf.yaml, right? Does this mean that you have

explicitly

removed `jobmanager.heap.size: 1024m` from the default configuration?

If

this is the case, then I believe it was more of an unintentional

artifact

that it worked before and it has been corrected now so that one needs

to

specify the memory of the JM process explicitly. Do you think it would

help

to explicitly state this in the release notes?

Cheers,
Till

On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise  wrote:


Thanks for preparing another RC!

As mentioned in the previous RC thread, it would be super helpful if

the

release notes that are part of the documentation can be included [1].

It's

a significant time-saver to have read those first.

I found one more non-backward compatible change that would be worth
addressing/mentioning:

It is now necessary to configure the jobmanag

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-02 Thread Dawid Wysakowicz
+1 (binding)

Checks:

- built from sources

-verified signatures & no binaries in the source archive

- run all tests locally (mvn clean install)

      here I had a couple of problems:

        * https://issues.apache.org/jira/browse/FLINK-18476

        * https://issues.apache.org/jira/browse/FLINK-18470

        * UnsignedTypeConversionITCase failed because it requires
libncursed5 installed

  None of the issues should be blockers imo as all three fail because
the tests assume certain configuration of the environment.

- started local cluster & run a couple of table examples

    the ChangelogSocketExample did not work for me. I think it would
make sense to only bundle examples that work out of the box in the dist.
Nevertheless as it is a new example in the release and it is only an
example I would not the release because of it.
(https://issues.apache.org/jira/browse/FLINK-18477)

- started sql-client and run a few very simple queries

- verified a couple of license files:

    Here I have more of a question. If we bundle an artifact with a
classifier. Shall we include the classifier as part of the entry in
LICENSE file?

    We bundle org.apache.orc:orc-core:jar:nohive:1.4.3 in
flink-sql-connector-hive-1.2.2, but in the LICENSE file we list it
without the nohive classifier.

Side note. We do bundle some python files as part of the distribution. I
have not seen anyone trying that out in the thread so far. Shall we ask
somebody more familiar with the python module to check that?

Best,

Dawid

On 02/07/2020 20:37, Kostas Kloudas wrote:
> Hi all,
>
> As far as the issue that Chesnay mentioned that leads to a "Caused by:
> org.apache.flink.api.common.InvalidProgramException:"  for DataSet
> examples with print() collect() or count() as sink, this was a
> semi-intensional side-effect of the application mode. Before, in these
> cases, the output was simply ignored. Now we have the same behavior as
> in the "detached" mode. I already opened a PR for the release notes
> (sorry for not doing it earlier although this was a known change in
> behavior, as mentioned it in the PR here
> https://github.com/apache/flink/pull/11460 ) and I will merge it
> today.
>
> Cheers,
> Kostas
>
> On Thu, Jul 2, 2020 at 8:07 PM Robert Metzger  wrote:
>> +1 (binding)
>>
>> Checks:
>> - source archive compiles
>> - checked artifacts in staging repo
>>   - flink-azure-fs-hadoop-1.11.0.jar seems to have a correct NOTICE file
>>   - versions in pom seem correct
>>   - checked some other jars
>> - deployed Flink on YARN on Azure HDInsight (which uses Hadoop 3.1.1)
>>   - Reported some tiny log sanity issue:
>> https://issues.apache.org/jira/browse/FLINK-18474
>>   - Wordcount against HDFS works
>>
>>
>> On Thu, Jul 2, 2020 at 7:07 PM Thomas Weise  wrote:
>>
>>> Hi Zhijiang,
>>>
>>> The performance degradation manifests in backpressure which leads to
>>> growing backlog in the source. I switched a few times between 1.10 and 1.11
>>> and the behavior is consistent.
>>>
>>> The DAG is:
>>>
>>> KinesisConsumer -> (Flat Map, Flat Map, Flat Map)    forward
>>> -> KinesisProducer
>>>
>>> Parallelism: 160
>>> No shuffle/rebalance.
>>>
>>> Checkpointing config:
>>>
>>> Checkpointing Mode Exactly Once
>>> Interval 10s
>>> Timeout 10m 0s
>>> Minimum Pause Between Checkpoints 10s
>>> Maximum Concurrent Checkpoints 1
>>> Persist Checkpoints Externally Enabled (delete on cancellation)
>>>
>>> State backend: rocksdb  (filesystem leads to same symptoms)
>>> Checkpoint size is tiny (500KB)
>>>
>>> An interesting difference to another job that I had upgraded successfully
>>> is the low checkpointing interval.
>>>
>>> Thanks,
>>> Thomas
>>>
>>>
>>> On Wed, Jul 1, 2020 at 9:02 PM Zhijiang >> .invalid>
>>> wrote:
>>>
 Hi Thomas,

 Thanks for the efficient feedback.

 Regarding the suggestion of adding the release notes document, I agree
 with your point. Maybe we should adjust the vote template accordingly in
 the respective wiki to guide the following release processes.

 Regarding the performance regression, could you provide some more details
 for our better measurement or reproducing on our sides?
 E.g. I guess the topology only includes two vertexes source and sink?
 What is the parallelism for every vertex?
 The upstream shuffles data to the downstream via rebalance partitioner or
 other?
 The checkpoint mode is exactly-once with rocksDB state backend?
 The backpressure happened in this case?
 How much percentage regression in this case?

 Best,
 Zhijiang



 --
 From:Thomas Weise 
 Send Time:2020年7月2日(星期四) 09:54
 To:dev 
 Subject:Re: [VOTE] Release 1.11.0, release candidate #4

 Hi Till,

 Yes, we don't have the setting in flink-conf.yaml.

 Generally, we carry forward the existing configuration and any change to
 default configurat

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-02 Thread Zhijiang
Hi Thomas,

Thanks for your reply with rich information!

We are trying to reproduce your case in our cluster to further verify it, and  
@Yingjie Cao is working on it now.
 As we have not kinesis consumer and producer internally, so we will construct 
the common source and sink instead in the case of backpressure. 

Firstly, we can dismiss the rockdb factor in this release, since you also 
mentioned that "filesystem leads to same symptoms".

Secondly, if my understanding is right, you emphasis that the regression only 
exists for the jobs with low checkpoint interval (10s). 
Based on that, I have two suspicions with the network related changes in this 
release:
- [1]: Limited the maximum backlog value (default 10) in subpartition 
queue. 
- [2]: Delay send the following buffers after checkpoint barrier on 
upstream side until barrier alignment on downstream side.

These changes are motivated for reducing the in-flight buffers to speedup 
checkpoint especially in the case of backpressure. 
In theory they should have very minor performance effect and actually we also 
tested in cluster to verify within expectation before merging them,
 but maybe there are other corner cases we have not thought of before.

Before the testing result on our side comes out for your respective job case, I 
have some other questions to confirm for further analysis:
-  How much percentage regression you found after switching to 1.11?
-  Are there any network bottleneck in your cluster? E.g. the network 
bandwidth is full caused by other jobs? If so, it might have more effects by 
above [2]
-  Did you adjust the default network buffer setting? E.g. 
"taskmanager.network.memory.floating-buffers-per-gate" or 
"taskmanager.network.memory.buffers-per-channel"
-  I guess the topology has three vertexes "KinesisConsumer -> Chained 
FlatMap -> KinesisProducer", and the partition mode for "KinesisConsumer -> 
FlatMap" and "FlatMap->KinesisProducer" are both "forward"? If so, the edge 
connection is one-to-one, not all-to-all, then the above [1][2] should no 
effects in theory with default network buffer setting.
- By slot sharing, I guess these three vertex parallelism task would 
probably be deployed into the same slot, then the data shuffle is by memory 
queue, not network stack. If so, the above [2] should no effect.
- I also saw some Jira changes for kinesis in this release, could you 
confirm that these changes would not effect the performance?

Best,
Zhijiang


--
From:Thomas Weise 
Send Time:2020年7月3日(星期五) 01:07
To:dev ; Zhijiang 
Subject:Re: [VOTE] Release 1.11.0, release candidate #4

Hi Zhijiang,

The performance degradation manifests in backpressure which leads to
growing backlog in the source. I switched a few times between 1.10 and 1.11
and the behavior is consistent.

The DAG is:

KinesisConsumer -> (Flat Map, Flat Map, Flat Map)    forward
-> KinesisProducer

Parallelism: 160
No shuffle/rebalance.

Checkpointing config:

Checkpointing Mode Exactly Once
Interval 10s
Timeout 10m 0s
Minimum Pause Between Checkpoints 10s
Maximum Concurrent Checkpoints 1
Persist Checkpoints Externally Enabled (delete on cancellation)

State backend: rocksdb  (filesystem leads to same symptoms)
Checkpoint size is tiny (500KB)

An interesting difference to another job that I had upgraded successfully
is the low checkpointing interval.

Thanks,
Thomas


On Wed, Jul 1, 2020 at 9:02 PM Zhijiang 
wrote:

> Hi Thomas,
>
> Thanks for the efficient feedback.
>
> Regarding the suggestion of adding the release notes document, I agree
> with your point. Maybe we should adjust the vote template accordingly in
> the respective wiki to guide the following release processes.
>
> Regarding the performance regression, could you provide some more details
> for our better measurement or reproducing on our sides?
> E.g. I guess the topology only includes two vertexes source and sink?
> What is the parallelism for every vertex?
> The upstream shuffles data to the downstream via rebalance partitioner or
> other?
> The checkpoint mode is exactly-once with rocksDB state backend?
> The backpressure happened in this case?
> How much percentage regression in this case?
>
> Best,
> Zhijiang
>
>
>
> --
> From:Thomas Weise 
> Send Time:2020年7月2日(星期四) 09:54
> To:dev 
> Subject:Re: [VOTE] Release 1.11.0, release candidate #4
>
> Hi Till,
>
> Yes, we don't have the setting in flink-conf.yaml.
>
> Generally, we carry forward the existing configuration and any change to
> default configuration values would impact the upgrade.
>
> Yes, since it is an incompatible change I would state it in the release
> notes.
>
> Thanks,
> Thomas
>
> BTW I found a performance regression while trying to upgrade another
> pipeline with this RC. It is a simple Kinesis to Kinesis job. Wasn't able
> to pin it down yet, symptoms in

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-03 Thread Till Rohrmann
@Dawid I think it would be correct to also include the classifier for the
org.apache.orc:orc-core:jar:nohive:1.4.3 dependency because it is different
from the non-classified artifact. I would not block the release on it,
though, because it is a ASL 2.0 dependency which we are not required to
list. Can you open a PR for fixing this problem?

Concerning the Python module I believe that Jincheng could help us with the
verification process.

Cheers,
Till

On Fri, Jul 3, 2020 at 8:46 AM Zhijiang 
wrote:

> Hi Thomas,
>
> Thanks for your reply with rich information!
>
> We are trying to reproduce your case in our cluster to further verify it,
> and  @Yingjie Cao is working on it now.
>  As we have not kinesis consumer and producer internally, so we will
> construct the common source and sink instead in the case of backpressure.
>
> Firstly, we can dismiss the rockdb factor in this release, since you also
> mentioned that "filesystem leads to same symptoms".
>
> Secondly, if my understanding is right, you emphasis that the regression
> only exists for the jobs with low checkpoint interval (10s).
> Based on that, I have two suspicions with the network related changes in
> this release:
> - [1]: Limited the maximum backlog value (default 10) in subpartition
> queue.
> - [2]: Delay send the following buffers after checkpoint barrier on
> upstream side until barrier alignment on downstream side.
>
> These changes are motivated for reducing the in-flight buffers to speedup
> checkpoint especially in the case of backpressure.
> In theory they should have very minor performance effect and actually we
> also tested in cluster to verify within expectation before merging them,
>  but maybe there are other corner cases we have not thought of before.
>
> Before the testing result on our side comes out for your respective job
> case, I have some other questions to confirm for further analysis:
> -  How much percentage regression you found after switching to 1.11?
> -  Are there any network bottleneck in your cluster? E.g. the network
> bandwidth is full caused by other jobs? If so, it might have more effects
> by above [2]
> -  Did you adjust the default network buffer setting? E.g.
> "taskmanager.network.memory.floating-buffers-per-gate" or
> "taskmanager.network.memory.buffers-per-channel"
> -  I guess the topology has three vertexes "KinesisConsumer -> Chained
> FlatMap -> KinesisProducer", and the partition mode for "KinesisConsumer ->
> FlatMap" and "FlatMap->KinesisProducer" are both "forward"? If so, the edge
> connection is one-to-one, not all-to-all, then the above [1][2] should no
> effects in theory with default network buffer setting.
> - By slot sharing, I guess these three vertex parallelism task would
> probably be deployed into the same slot, then the data shuffle is by memory
> queue, not network stack. If so, the above [2] should no effect.
> - I also saw some Jira changes for kinesis in this release, could you
> confirm that these changes would not effect the performance?
>
> Best,
> Zhijiang
>
>
> --
> From:Thomas Weise 
> Send Time:2020年7月3日(星期五) 01:07
> To:dev ; Zhijiang 
> Subject:Re: [VOTE] Release 1.11.0, release candidate #4
>
> Hi Zhijiang,
>
> The performance degradation manifests in backpressure which leads to
> growing backlog in the source. I switched a few times between 1.10 and 1.11
> and the behavior is consistent.
>
> The DAG is:
>
> KinesisConsumer -> (Flat Map, Flat Map, Flat Map)    forward
> -> KinesisProducer
>
> Parallelism: 160
> No shuffle/rebalance.
>
> Checkpointing config:
>
> Checkpointing Mode Exactly Once
> Interval 10s
> Timeout 10m 0s
> Minimum Pause Between Checkpoints 10s
> Maximum Concurrent Checkpoints 1
> Persist Checkpoints Externally Enabled (delete on cancellation)
>
> State backend: rocksdb  (filesystem leads to same symptoms)
> Checkpoint size is tiny (500KB)
>
> An interesting difference to another job that I had upgraded successfully
> is the low checkpointing interval.
>
> Thanks,
> Thomas
>
>
> On Wed, Jul 1, 2020 at 9:02 PM Zhijiang  .invalid>
> wrote:
>
> > Hi Thomas,
> >
> > Thanks for the efficient feedback.
> >
> > Regarding the suggestion of adding the release notes document, I agree
> > with your point. Maybe we should adjust the vote template accordingly in
> > the respective wiki to guide the following release processes.
> >
> > Regarding the performance regression, could you provide some more details
> > for our better measurement or reproducing on our sides?
> > E.g. I guess the topology only includes two vertexes source and sink?
> > What is the parallelism for every vertex?
> > The upstream shuffles data to the downstream via rebalance partitioner or
> > other?
> > The checkpoint mode is exactly-once with rocksDB state backend?
> > The backpressure happened in this case?
> > How much percentage regression in this case?
> >
> > Best,
> > Zh

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-03 Thread Dawid Wysakowicz
Thanks Till for the clarification. I opened
https://github.com/apache/flink/pull/12816

On 03/07/2020 10:15, Till Rohrmann wrote:
> @Dawid I think it would be correct to also include the classifier for the
> org.apache.orc:orc-core:jar:nohive:1.4.3 dependency because it is different
> from the non-classified artifact. I would not block the release on it,
> though, because it is a ASL 2.0 dependency which we are not required to
> list. Can you open a PR for fixing this problem?
>
> Concerning the Python module I believe that Jincheng could help us with the
> verification process.
>
> Cheers,
> Till
>
> On Fri, Jul 3, 2020 at 8:46 AM Zhijiang 
> wrote:
>
>> Hi Thomas,
>>
>> Thanks for your reply with rich information!
>>
>> We are trying to reproduce your case in our cluster to further verify it,
>> and  @Yingjie Cao is working on it now.
>>  As we have not kinesis consumer and producer internally, so we will
>> construct the common source and sink instead in the case of backpressure.
>>
>> Firstly, we can dismiss the rockdb factor in this release, since you also
>> mentioned that "filesystem leads to same symptoms".
>>
>> Secondly, if my understanding is right, you emphasis that the regression
>> only exists for the jobs with low checkpoint interval (10s).
>> Based on that, I have two suspicions with the network related changes in
>> this release:
>> - [1]: Limited the maximum backlog value (default 10) in subpartition
>> queue.
>> - [2]: Delay send the following buffers after checkpoint barrier on
>> upstream side until barrier alignment on downstream side.
>>
>> These changes are motivated for reducing the in-flight buffers to speedup
>> checkpoint especially in the case of backpressure.
>> In theory they should have very minor performance effect and actually we
>> also tested in cluster to verify within expectation before merging them,
>>  but maybe there are other corner cases we have not thought of before.
>>
>> Before the testing result on our side comes out for your respective job
>> case, I have some other questions to confirm for further analysis:
>> -  How much percentage regression you found after switching to 1.11?
>> -  Are there any network bottleneck in your cluster? E.g. the network
>> bandwidth is full caused by other jobs? If so, it might have more effects
>> by above [2]
>> -  Did you adjust the default network buffer setting? E.g.
>> "taskmanager.network.memory.floating-buffers-per-gate" or
>> "taskmanager.network.memory.buffers-per-channel"
>> -  I guess the topology has three vertexes "KinesisConsumer -> Chained
>> FlatMap -> KinesisProducer", and the partition mode for "KinesisConsumer ->
>> FlatMap" and "FlatMap->KinesisProducer" are both "forward"? If so, the edge
>> connection is one-to-one, not all-to-all, then the above [1][2] should no
>> effects in theory with default network buffer setting.
>> - By slot sharing, I guess these three vertex parallelism task would
>> probably be deployed into the same slot, then the data shuffle is by memory
>> queue, not network stack. If so, the above [2] should no effect.
>> - I also saw some Jira changes for kinesis in this release, could you
>> confirm that these changes would not effect the performance?
>>
>> Best,
>> Zhijiang
>>
>>
>> --
>> From:Thomas Weise 
>> Send Time:2020年7月3日(星期五) 01:07
>> To:dev ; Zhijiang 
>> Subject:Re: [VOTE] Release 1.11.0, release candidate #4
>>
>> Hi Zhijiang,
>>
>> The performance degradation manifests in backpressure which leads to
>> growing backlog in the source. I switched a few times between 1.10 and 1.11
>> and the behavior is consistent.
>>
>> The DAG is:
>>
>> KinesisConsumer -> (Flat Map, Flat Map, Flat Map)    forward
>> -> KinesisProducer
>>
>> Parallelism: 160
>> No shuffle/rebalance.
>>
>> Checkpointing config:
>>
>> Checkpointing Mode Exactly Once
>> Interval 10s
>> Timeout 10m 0s
>> Minimum Pause Between Checkpoints 10s
>> Maximum Concurrent Checkpoints 1
>> Persist Checkpoints Externally Enabled (delete on cancellation)
>>
>> State backend: rocksdb  (filesystem leads to same symptoms)
>> Checkpoint size is tiny (500KB)
>>
>> An interesting difference to another job that I had upgraded successfully
>> is the low checkpointing interval.
>>
>> Thanks,
>> Thomas
>>
>>
>> On Wed, Jul 1, 2020 at 9:02 PM Zhijiang > .invalid>
>> wrote:
>>
>>> Hi Thomas,
>>>
>>> Thanks for the efficient feedback.
>>>
>>> Regarding the suggestion of adding the release notes document, I agree
>>> with your point. Maybe we should adjust the vote template accordingly in
>>> the respective wiki to guide the following release processes.
>>>
>>> Regarding the performance regression, could you provide some more details
>>> for our better measurement or reproducing on our sides?
>>> E.g. I guess the topology only includes two vertexes source and sink?
>>> What is the parallelism for every vertex?
>>> The upstream shu

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-03 Thread Yingjie Cao
Hi Thomas,

I tried to reproduce the regression by constructing a Job with the same
topology, parallelism and checkpoint interval (Kinesis source and sink are
replaced for we do not have the test environment). But unfortunately, no
regression is observed both for back pressure and no back pressure cases.

Maybe we need more information to further investigate the case. Beside what
Zhijiang have asked, I have one more question: how many records/bytes each
vertex processes per second and is there any imbalance, for example, some
tasks are slower or process more records/bytes than others?

Best,
Yingjie

Thomas Weise 于2020年7月3日 周五上午1:07写道:

> Hi Zhijiang,
>
> The performance degradation manifests in backpressure which leads to
> growing backlog in the source. I switched a few times between 1.10 and 1.11
> and the behavior is consistent.
>
> The DAG is:
>
> KinesisConsumer -> (Flat Map, Flat Map, Flat Map)    forward
> -> KinesisProducer
>
> Parallelism: 160
> No shuffle/rebalance.
>
> Checkpointing config:
>
> Checkpointing Mode Exactly Once
> Interval 10s
> Timeout 10m 0s
> Minimum Pause Between Checkpoints 10s
> Maximum Concurrent Checkpoints 1
> Persist Checkpoints Externally Enabled (delete on cancellation)
>
> State backend: rocksdb  (filesystem leads to same symptoms)
> Checkpoint size is tiny (500KB)
>
> An interesting difference to another job that I had upgraded successfully
> is the low checkpointing interval.
>
> Thanks,
> Thomas
>
>
> On Wed, Jul 1, 2020 at 9:02 PM Zhijiang  .invalid>
> wrote:
>
> > Hi Thomas,
> >
> > Thanks for the efficient feedback.
> >
> > Regarding the suggestion of adding the release notes document, I agree
> > with your point. Maybe we should adjust the vote template accordingly in
> > the respective wiki to guide the following release processes.
> >
> > Regarding the performance regression, could you provide some more details
> > for our better measurement or reproducing on our sides?
> > E.g. I guess the topology only includes two vertexes source and sink?
> > What is the parallelism for every vertex?
> > The upstream shuffles data to the downstream via rebalance partitioner or
> > other?
> > The checkpoint mode is exactly-once with rocksDB state backend?
> > The backpressure happened in this case?
> > How much percentage regression in this case?
> >
> > Best,
> > Zhijiang
> >
> >
> >
> > --
> > From:Thomas Weise 
> > Send Time:2020年7月2日(星期四) 09:54
> > To:dev 
> > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> >
> > Hi Till,
> >
> > Yes, we don't have the setting in flink-conf.yaml.
> >
> > Generally, we carry forward the existing configuration and any change to
> > default configuration values would impact the upgrade.
> >
> > Yes, since it is an incompatible change I would state it in the release
> > notes.
> >
> > Thanks,
> > Thomas
> >
> > BTW I found a performance regression while trying to upgrade another
> > pipeline with this RC. It is a simple Kinesis to Kinesis job. Wasn't able
> > to pin it down yet, symptoms include increased checkpoint alignment time.
> >
> > On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann 
> > wrote:
> >
> > > Hi Thomas,
> > >
> > > just to confirm: When starting the image in local mode, then you don't
> > have
> > > any of the JobManager memory configuration settings configured in the
> > > effective flink-conf.yaml, right? Does this mean that you have
> explicitly
> > > removed `jobmanager.heap.size: 1024m` from the default configuration?
> If
> > > this is the case, then I believe it was more of an unintentional
> artifact
> > > that it worked before and it has been corrected now so that one needs
> to
> > > specify the memory of the JM process explicitly. Do you think it would
> > help
> > > to explicitly state this in the release notes?
> > >
> > > Cheers,
> > > Till
> > >
> > > On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise  wrote:
> > >
> > > > Thanks for preparing another RC!
> > > >
> > > > As mentioned in the previous RC thread, it would be super helpful if
> > the
> > > > release notes that are part of the documentation can be included [1].
> > > It's
> > > > a significant time-saver to have read those first.
> > > >
> > > > I found one more non-backward compatible change that would be worth
> > > > addressing/mentioning:
> > > >
> > > > It is now necessary to configure the jobmanager heap size in
> > > > flink-conf.yaml (with either jobmanager.heap.size
> > > > or jobmanager.memory.heap.size). Why would I not want to do that
> > anyways?
> > > > Well, we set it dynamically for a cluster deployment via the
> > > > flinkk8soperator, but the container image can also be used for
> testing
> > > with
> > > > local mode (./bin/jobmanager.sh start-foreground local). That will
> fail
> > > if
> > > > the heap wasn't configured and that's how I noticed it.
> > > >
> > > > Thanks,
> > > > Thomas
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://ci.ap

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-03 Thread Leonard Xu
+1 (non-binding)

- checked/verified signatures and hashes
- built from source sing scala 2.11 succeeded
- go through all issues which "fixVersion" property is 1.11.0, there is no 
blocker.
- checked that there are no missing artifacts
- test SQL connector Elasticsearch7/JDBC/HBase/Kafka (new connector) with some 
queries in SQL Client, they works well and the result is expected
- started a cluster, WebUI was accessible, submitted a wordcount job and ran 
succeeded, no suspicious log output
- the web PR looks good

Best,
Leonard Xu

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-03 Thread jincheng sun
+1(binding)

Checks:
   - check wheel package consistency
   - test the built from the wheel package
   - checked the signature and checksum
   - pip installed the Python package
`apache_flink-1.11.0-cp37-cp37m-macosx_10_9_x86_64.whl` successfully and
run a simple word count example successfully
   - verify the performance for PyFlink UDFs
   - test Python UDTF support

Best,
Jincheng


Leonard Xu  于2020年7月3日周五 下午6:02写道:

> +1 (non-binding)
>
> - checked/verified signatures and hashes
> - built from source sing scala 2.11 succeeded
> - go through all issues which "fixVersion" property is 1.11.0, there is no
> blocker.
> - checked that there are no missing artifacts
> - test SQL connector Elasticsearch7/JDBC/HBase/Kafka (new connector) with
> some queries in SQL Client, they works well and the result is expected
> - started a cluster, WebUI was accessible, submitted a wordcount job and
> ran succeeded, no suspicious log output
> - the web PR looks good
>
> Best,
> Leonard Xu


Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-03 Thread Dian Fu
+1 (non-binding)

- built from source with scala 2.11 successfully
- checked the signature and checksum of the binary packages
- installed PyFlink on MacOS, Windows and Linux successfully
- tested the functionality of Pandas UDF and the conversion between PyFlink 
Table and Pandas DataFrame
- verified that the Python dependency management functionality works well

Regards,
Dian

> 在 2020年7月3日,下午8:24,jincheng sun  写道:
> 
> +1(binding)
> 
> Checks:
>   - check wheel package consistency
>   - test the built from the wheel package
>   - checked the signature and checksum
>   - pip installed the Python package
> `apache_flink-1.11.0-cp37-cp37m-macosx_10_9_x86_64.whl` successfully and
> run a simple word count example successfully
>   - verify the performance for PyFlink UDFs
>   - test Python UDTF support
> 
> Best,
> Jincheng
> 
> 
> Leonard Xu  于2020年7月3日周五 下午6:02写道:
> 
>> +1 (non-binding)
>> 
>> - checked/verified signatures and hashes
>> - built from source sing scala 2.11 succeeded
>> - go through all issues which "fixVersion" property is 1.11.0, there is no
>> blocker.
>> - checked that there are no missing artifacts
>> - test SQL connector Elasticsearch7/JDBC/HBase/Kafka (new connector) with
>> some queries in SQL Client, they works well and the result is expected
>> - started a cluster, WebUI was accessible, submitted a wordcount job and
>> ran succeeded, no suspicious log output
>> - the web PR looks good
>> 
>> Best,
>> Leonard Xu



Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-03 Thread Xingbo Huang
+1 (non-binding)

- check wheel package consistency with the built from the source code
- test the built from the wheel package in mac os with python 3.6
- verify the performance for PyFlink UDFs including Python General UDF and
Pandas UDF
- test Python UDTF

Best,
Xingbo

Dian Fu  于2020年7月3日周五 下午8:45写道:

> +1 (non-binding)
>
> - built from source with scala 2.11 successfully
> - checked the signature and checksum of the binary packages
> - installed PyFlink on MacOS, Windows and Linux successfully
> - tested the functionality of Pandas UDF and the conversion between
> PyFlink Table and Pandas DataFrame
> - verified that the Python dependency management functionality works well
>
> Regards,
> Dian
>
> > 在 2020年7月3日,下午8:24,jincheng sun  写道:
> >
> > +1(binding)
> >
> > Checks:
> >   - check wheel package consistency
> >   - test the built from the wheel package
> >   - checked the signature and checksum
> >   - pip installed the Python package
> > `apache_flink-1.11.0-cp37-cp37m-macosx_10_9_x86_64.whl` successfully and
> > run a simple word count example successfully
> >   - verify the performance for PyFlink UDFs
> >   - test Python UDTF support
> >
> > Best,
> > Jincheng
> >
> >
> > Leonard Xu  于2020年7月3日周五 下午6:02写道:
> >
> >> +1 (non-binding)
> >>
> >> - checked/verified signatures and hashes
> >> - built from source sing scala 2.11 succeeded
> >> - go through all issues which "fixVersion" property is 1.11.0, there is
> no
> >> blocker.
> >> - checked that there are no missing artifacts
> >> - test SQL connector Elasticsearch7/JDBC/HBase/Kafka (new connector)
> with
> >> some queries in SQL Client, they works well and the result is expected
> >> - started a cluster, WebUI was accessible, submitted a wordcount job and
> >> ran succeeded, no suspicious log output
> >> - the web PR looks good
> >>
> >> Best,
> >> Leonard Xu
>
>


Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-03 Thread Yingjie Cao
Hi Thomas,

Thanks a lot for offering these information.

We have decided to try to reproduce the regression on AWS. It will be
really appreciated if you can share some demo code with us, and if it is
not convenient, could you give us some more information about the record
type and size, the processing logic of each operator? It will help us to
write a more similar Job and reproduce the regression.

Besides, I have a question to ask, if increasing the checkpoint interval
helps to reduce the regression? I am asking because I wonder if the
regression is really related to the checkpoint interval.

Best,
Yingjie

Thomas Weise 于2020年7月3日 周五上午1:07写道:

> Hi Zhijiang,
>
> The performance degradation manifests in backpressure which leads to
> growing backlog in the source. I switched a few times between 1.10 and 1.11
> and the behavior is consistent.
>
> The DAG is:
>
> KinesisConsumer -> (Flat Map, Flat Map, Flat Map)    forward
> -> KinesisProducer
>
> Parallelism: 160
> No shuffle/rebalance.
>
> Checkpointing config:
>
> Checkpointing Mode Exactly Once
> Interval 10s
> Timeout 10m 0s
> Minimum Pause Between Checkpoints 10s
> Maximum Concurrent Checkpoints 1
> Persist Checkpoints Externally Enabled (delete on cancellation)
>
> State backend: rocksdb  (filesystem leads to same symptoms)
> Checkpoint size is tiny (500KB)
>
> An interesting difference to another job that I had upgraded successfully
> is the low checkpointing interval.
>
> Thanks,
> Thomas
>
>
> On Wed, Jul 1, 2020 at 9:02 PM Zhijiang  .invalid>
> wrote:
>
> > Hi Thomas,
> >
> > Thanks for the efficient feedback.
> >
> > Regarding the suggestion of adding the release notes document, I agree
> > with your point. Maybe we should adjust the vote template accordingly in
> > the respective wiki to guide the following release processes.
> >
> > Regarding the performance regression, could you provide some more details
> > for our better measurement or reproducing on our sides?
> > E.g. I guess the topology only includes two vertexes source and sink?
> > What is the parallelism for every vertex?
> > The upstream shuffles data to the downstream via rebalance partitioner or
> > other?
> > The checkpoint mode is exactly-once with rocksDB state backend?
> > The backpressure happened in this case?
> > How much percentage regression in this case?
> >
> > Best,
> > Zhijiang
> >
> >
> >
> > --
> > From:Thomas Weise 
> > Send Time:2020年7月2日(星期四) 09:54
> > To:dev 
> > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> >
> > Hi Till,
> >
> > Yes, we don't have the setting in flink-conf.yaml.
> >
> > Generally, we carry forward the existing configuration and any change to
> > default configuration values would impact the upgrade.
> >
> > Yes, since it is an incompatible change I would state it in the release
> > notes.
> >
> > Thanks,
> > Thomas
> >
> > BTW I found a performance regression while trying to upgrade another
> > pipeline with this RC. It is a simple Kinesis to Kinesis job. Wasn't able
> > to pin it down yet, symptoms include increased checkpoint alignment time.
> >
> > On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann 
> > wrote:
> >
> > > Hi Thomas,
> > >
> > > just to confirm: When starting the image in local mode, then you don't
> > have
> > > any of the JobManager memory configuration settings configured in the
> > > effective flink-conf.yaml, right? Does this mean that you have
> explicitly
> > > removed `jobmanager.heap.size: 1024m` from the default configuration?
> If
> > > this is the case, then I believe it was more of an unintentional
> artifact
> > > that it worked before and it has been corrected now so that one needs
> to
> > > specify the memory of the JM process explicitly. Do you think it would
> > help
> > > to explicitly state this in the release notes?
> > >
> > > Cheers,
> > > Till
> > >
> > > On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise  wrote:
> > >
> > > > Thanks for preparing another RC!
> > > >
> > > > As mentioned in the previous RC thread, it would be super helpful if
> > the
> > > > release notes that are part of the documentation can be included [1].
> > > It's
> > > > a significant time-saver to have read those first.
> > > >
> > > > I found one more non-backward compatible change that would be worth
> > > > addressing/mentioning:
> > > >
> > > > It is now necessary to configure the jobmanager heap size in
> > > > flink-conf.yaml (with either jobmanager.heap.size
> > > > or jobmanager.memory.heap.size). Why would I not want to do that
> > anyways?
> > > > Well, we set it dynamically for a cluster deployment via the
> > > > flinkk8soperator, but the container image can also be used for
> testing
> > > with
> > > > local mode (./bin/jobmanager.sh start-foreground local). That will
> fail
> > > if
> > > > the heap wasn't configured and that's how I noticed it.
> > > >
> > > > Thanks,
> > > > Thomas
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> http

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-03 Thread Thomas Weise
Hi Zhijiang,

It will probably be best if we connect next week and discuss the issue
directly since this could be quite difficult to reproduce.

Before the testing result on our side comes out for your respective job
case, I have some other questions to confirm for further analysis:
-  How much percentage regression you found after switching to 1.11?

~40% throughput decline

-  Are there any network bottleneck in your cluster? E.g. the network
bandwidth is full caused by other jobs? If so, it might have more effects
by above [2]

The test runs on a k8s cluster that is also used for other production jobs.
There is no reason be believe network is the bottleneck.

-  Did you adjust the default network buffer setting? E.g.
"taskmanager.network.memory.floating-buffers-per-gate" or
"taskmanager.network.memory.buffers-per-channel"

The job is using the defaults, i.e we don't configure the settings. If you
want me to try specific settings in the hope that it will help to isolate
the issue please let me know.

-  I guess the topology has three vertexes "KinesisConsumer -> Chained
FlatMap -> KinesisProducer", and the partition mode for "KinesisConsumer ->
FlatMap" and "FlatMap->KinesisProducer" are both "forward"? If so, the edge
connection is one-to-one, not all-to-all, then the above [1][2] should no
effects in theory with default network buffer setting.

There are only 2 vertices and the edge is "forward".

- By slot sharing, I guess these three vertex parallelism task would
probably be deployed into the same slot, then the data shuffle is by memory
queue, not network stack. If so, the above [2] should no effect.

Yes, vertices share slots.

- I also saw some Jira changes for kinesis in this release, could you
confirm that these changes would not effect the performance?

I will need to take a look. 1.10 already had a regression introduced by the
Kinesis producer update.


Thanks,
Thomas


On Thu, Jul 2, 2020 at 11:46 PM Zhijiang 
wrote:

> Hi Thomas,
>
> Thanks for your reply with rich information!
>
> We are trying to reproduce your case in our cluster to further verify it,
> and  @Yingjie Cao is working on it now.
>  As we have not kinesis consumer and producer internally, so we will
> construct the common source and sink instead in the case of backpressure.
>
> Firstly, we can dismiss the rockdb factor in this release, since you also
> mentioned that "filesystem leads to same symptoms".
>
> Secondly, if my understanding is right, you emphasis that the regression
> only exists for the jobs with low checkpoint interval (10s).
> Based on that, I have two suspicions with the network related changes in
> this release:
> - [1]: Limited the maximum backlog value (default 10) in subpartition
> queue.
> - [2]: Delay send the following buffers after checkpoint barrier on
> upstream side until barrier alignment on downstream side.
>
> These changes are motivated for reducing the in-flight buffers to speedup
> checkpoint especially in the case of backpressure.
> In theory they should have very minor performance effect and actually we
> also tested in cluster to verify within expectation before merging them,
>  but maybe there are other corner cases we have not thought of before.
>
> Before the testing result on our side comes out for your respective job
> case, I have some other questions to confirm for further analysis:
> -  How much percentage regression you found after switching to 1.11?
> -  Are there any network bottleneck in your cluster? E.g. the network
> bandwidth is full caused by other jobs? If so, it might have more effects
> by above [2]
> -  Did you adjust the default network buffer setting? E.g.
> "taskmanager.network.memory.floating-buffers-per-gate" or
> "taskmanager.network.memory.buffers-per-channel"
> -  I guess the topology has three vertexes "KinesisConsumer -> Chained
> FlatMap -> KinesisProducer", and the partition mode for "KinesisConsumer ->
> FlatMap" and "FlatMap->KinesisProducer" are both "forward"? If so, the edge
> connection is one-to-one, not all-to-all, then the above [1][2] should no
> effects in theory with default network buffer setting.
> - By slot sharing, I guess these three vertex parallelism task would
> probably be deployed into the same slot, then the data shuffle is by memory
> queue, not network stack. If so, the above [2] should no effect.
> - I also saw some Jira changes for kinesis in this release, could you
> confirm that these changes would not effect the performance?
>
> Best,
> Zhijiang
>
>
> --
> From:Thomas Weise 
> Send Time:2020年7月3日(星期五) 01:07
> To:dev ; Zhijiang 
> Subject:Re: [VOTE] Release 1.11.0, release candidate #4
>
> Hi Zhijiang,
>
> The performance degradation manifests in backpressure which leads to
> growing backlog in the source. I switched a few times between 1.10 and 1.11
> and the behavior is consistent.
>
> The DAG is:
>
> Kinesis

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-04 Thread Zhijiang
Hi Thomas,

Thanks for the further update information. 

I guess we can dismiss the network stack changes, since in your case the 
downstream and upstream would probably be deployed in the same slot bypassing 
the network data shuffle. 
Also I guess release-1.11 will not bring general performance regression in 
runtime engine, as we also did the performance testing for all general cases by 
[1] in real cluster before and the testing results should fit the expectation. 
But we indeed did not test the specific source and sink connectors yet as I 
known.

Regarding your performance regression with 40%, I wonder it is probably related 
to specific source/sink changes (e.g. kinesis) or environment issues with 
corner case. 
If possible, it would be helpful to further locate whether the regression is 
caused by kinesis, by replacing the kinesis source & sink and keeping the 
others same.

As you said, it would be efficient to contact with you directly next week to 
further discuss this issue. And we are willing/eager to provide any help to 
resolve this issue soon.

Besides that, I guess this issue should not be the blocker for the release, 
since it is probably a corner case based on the current analysis. 
If we really conclude anything need to be resolved after the final release, 
then we can also make the next minor release-1.11.1 come soon.

[1] https://issues.apache.org/jira/browse/FLINK-18433

Best,
Zhijiang


--
From:Thomas Weise 
Send Time:2020年7月4日(星期六) 12:26
To:dev ; Zhijiang 
Cc:Yingjie Cao 
Subject:Re: [VOTE] Release 1.11.0, release candidate #4

Hi Zhijiang,

It will probably be best if we connect next week and discuss the issue
directly since this could be quite difficult to reproduce.

Before the testing result on our side comes out for your respective job
case, I have some other questions to confirm for further analysis:
-  How much percentage regression you found after switching to 1.11?

~40% throughput decline

-  Are there any network bottleneck in your cluster? E.g. the network
bandwidth is full caused by other jobs? If so, it might have more effects
by above [2]

The test runs on a k8s cluster that is also used for other production jobs.
There is no reason be believe network is the bottleneck.

-  Did you adjust the default network buffer setting? E.g.
"taskmanager.network.memory.floating-buffers-per-gate" or
"taskmanager.network.memory.buffers-per-channel"

The job is using the defaults, i.e we don't configure the settings. If you
want me to try specific settings in the hope that it will help to isolate
the issue please let me know.

-  I guess the topology has three vertexes "KinesisConsumer -> Chained
FlatMap -> KinesisProducer", and the partition mode for "KinesisConsumer ->
FlatMap" and "FlatMap->KinesisProducer" are both "forward"? If so, the edge
connection is one-to-one, not all-to-all, then the above [1][2] should no
effects in theory with default network buffer setting.

There are only 2 vertices and the edge is "forward".

- By slot sharing, I guess these three vertex parallelism task would
probably be deployed into the same slot, then the data shuffle is by memory
queue, not network stack. If so, the above [2] should no effect.

Yes, vertices share slots.

- I also saw some Jira changes for kinesis in this release, could you
confirm that these changes would not effect the performance?

I will need to take a look. 1.10 already had a regression introduced by the
Kinesis producer update.


Thanks,
Thomas


On Thu, Jul 2, 2020 at 11:46 PM Zhijiang 
wrote:

> Hi Thomas,
>
> Thanks for your reply with rich information!
>
> We are trying to reproduce your case in our cluster to further verify it,
> and  @Yingjie Cao is working on it now.
>  As we have not kinesis consumer and producer internally, so we will
> construct the common source and sink instead in the case of backpressure.
>
> Firstly, we can dismiss the rockdb factor in this release, since you also
> mentioned that "filesystem leads to same symptoms".
>
> Secondly, if my understanding is right, you emphasis that the regression
> only exists for the jobs with low checkpoint interval (10s).
> Based on that, I have two suspicions with the network related changes in
> this release:
> - [1]: Limited the maximum backlog value (default 10) in subpartition
> queue.
> - [2]: Delay send the following buffers after checkpoint barrier on
> upstream side until barrier alignment on downstream side.
>
> These changes are motivated for reducing the in-flight buffers to speedup
> checkpoint especially in the case of backpressure.
> In theory they should have very minor performance effect and actually we
> also tested in cluster to verify within expectation before merging them,
>  but maybe there are other corner cases we have not thought of before.
>
> Before the testing result on our side comes out for your respective job
> case, I have some other

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-04 Thread Steven Wu
+1 (non-binding)

- rolled out to thousands of router jobs in our test env
- tested with a large-state job. Did simple resilience and
checkpoint/savepoint tests. General performance metrics look on par.
- tested with a high-parallelism stateless transformation job. General
performance metrics look on par.

On Sat, Jul 4, 2020 at 7:39 AM Zhijiang 
wrote:

> Hi Thomas,
>
> Thanks for the further update information.
>
> I guess we can dismiss the network stack changes, since in your case the
> downstream and upstream would probably be deployed in the same slot
> bypassing the network data shuffle.
> Also I guess release-1.11 will not bring general performance regression in
> runtime engine, as we also did the performance testing for all general
> cases by [1] in real cluster before and the testing results should fit the
> expectation. But we indeed did not test the specific source and sink
> connectors yet as I known.
>
> Regarding your performance regression with 40%, I wonder it is probably
> related to specific source/sink changes (e.g. kinesis) or environment
> issues with corner case.
> If possible, it would be helpful to further locate whether the regression
> is caused by kinesis, by replacing the kinesis source & sink and keeping
> the others same.
>
> As you said, it would be efficient to contact with you directly next week
> to further discuss this issue. And we are willing/eager to provide any help
> to resolve this issue soon.
>
> Besides that, I guess this issue should not be the blocker for the
> release, since it is probably a corner case based on the current analysis.
> If we really conclude anything need to be resolved after the final
> release, then we can also make the next minor release-1.11.1 come soon.
>
> [1] https://issues.apache.org/jira/browse/FLINK-18433
>
> Best,
> Zhijiang
>
>
> --
> From:Thomas Weise 
> Send Time:2020年7月4日(星期六) 12:26
> To:dev ; Zhijiang 
> Cc:Yingjie Cao 
> Subject:Re: [VOTE] Release 1.11.0, release candidate #4
>
> Hi Zhijiang,
>
> It will probably be best if we connect next week and discuss the issue
> directly since this could be quite difficult to reproduce.
>
> Before the testing result on our side comes out for your respective job
> case, I have some other questions to confirm for further analysis:
> -  How much percentage regression you found after switching to 1.11?
>
> ~40% throughput decline
>
> -  Are there any network bottleneck in your cluster? E.g. the network
> bandwidth is full caused by other jobs? If so, it might have more effects
> by above [2]
>
> The test runs on a k8s cluster that is also used for other production jobs.
> There is no reason be believe network is the bottleneck.
>
> -  Did you adjust the default network buffer setting? E.g.
> "taskmanager.network.memory.floating-buffers-per-gate" or
> "taskmanager.network.memory.buffers-per-channel"
>
> The job is using the defaults, i.e we don't configure the settings. If you
> want me to try specific settings in the hope that it will help to isolate
> the issue please let me know.
>
> -  I guess the topology has three vertexes "KinesisConsumer -> Chained
> FlatMap -> KinesisProducer", and the partition mode for "KinesisConsumer ->
> FlatMap" and "FlatMap->KinesisProducer" are both "forward"? If so, the edge
> connection is one-to-one, not all-to-all, then the above [1][2] should no
> effects in theory with default network buffer setting.
>
> There are only 2 vertices and the edge is "forward".
>
> - By slot sharing, I guess these three vertex parallelism task would
> probably be deployed into the same slot, then the data shuffle is by memory
> queue, not network stack. If so, the above [2] should no effect.
>
> Yes, vertices share slots.
>
> - I also saw some Jira changes for kinesis in this release, could you
> confirm that these changes would not effect the performance?
>
> I will need to take a look. 1.10 already had a regression introduced by the
> Kinesis producer update.
>
>
> Thanks,
> Thomas
>
>
> On Thu, Jul 2, 2020 at 11:46 PM Zhijiang  .invalid>
> wrote:
>
> > Hi Thomas,
> >
> > Thanks for your reply with rich information!
> >
> > We are trying to reproduce your case in our cluster to further verify it,
> > and  @Yingjie Cao is working on it now.
> >  As we have not kinesis consumer and producer internally, so we will
> > construct the common source and sink instead in the case of backpressure.
> >
> > Firstly, we can dismiss the rockdb factor in this release, since you also
> > mentioned that "filesystem leads to same symptoms".
> >
> > Secondly, if my understanding is right, you emphasis that the regression
> > only exists for the jobs with low checkpoint interval (10s).
> > Based on that, I have two suspicions with the network related changes in
> > this release:
> > - [1]: Limited the maximum backlog value (default 10) in subpartition
> > queue.
> > - [2]: Delay send the following 

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-04 Thread Congxian Qiu
+1 (non-binding)  % some NOTICE files missing

I found some projects do not contain NOTICE file (I'm not sure these
projects do not need NOTICE file or the NOTICE file is missing. I report
the findings here.).
The pom.xml change is based on the compare between
release-1.10.0..release-1.11-RC4[1], the dependency for below projects has
uploaded to gist[2]
- flink-connector-elasticsearch6
- flink-connector-elasticsearch7
- flink-connector-hbase
- flink-hcatlog
- flink-orc
- flink-parquet
- flink-sequence-file


checked
- sha512 & gpg check, ok
- build from source, ok
- all pom point to 1.11.0
- run some demos on a real cluster, ok
- manual test savepoint relocatable, ok
- some license check, apart from the NOTICE file, looks good to me.

[1] https://
github.com/apache/flink/compare/release-1.10.0..release-1.11.0-rc4
[2] https://gist.github.com/klion26/026a79897334fdeefec381cf7cdd5d93

Best,
Congxian


Steven Wu  于2020年7月5日周日 上午1:41写道:

> +1 (non-binding)
>
> - rolled out to thousands of router jobs in our test env
> - tested with a large-state job. Did simple resilience and
> checkpoint/savepoint tests. General performance metrics look on par.
> - tested with a high-parallelism stateless transformation job. General
> performance metrics look on par.
>
> On Sat, Jul 4, 2020 at 7:39 AM Zhijiang  .invalid>
> wrote:
>
> > Hi Thomas,
> >
> > Thanks for the further update information.
> >
> > I guess we can dismiss the network stack changes, since in your case the
> > downstream and upstream would probably be deployed in the same slot
> > bypassing the network data shuffle.
> > Also I guess release-1.11 will not bring general performance regression
> in
> > runtime engine, as we also did the performance testing for all general
> > cases by [1] in real cluster before and the testing results should fit
> the
> > expectation. But we indeed did not test the specific source and sink
> > connectors yet as I known.
> >
> > Regarding your performance regression with 40%, I wonder it is probably
> > related to specific source/sink changes (e.g. kinesis) or environment
> > issues with corner case.
> > If possible, it would be helpful to further locate whether the regression
> > is caused by kinesis, by replacing the kinesis source & sink and keeping
> > the others same.
> >
> > As you said, it would be efficient to contact with you directly next week
> > to further discuss this issue. And we are willing/eager to provide any
> help
> > to resolve this issue soon.
> >
> > Besides that, I guess this issue should not be the blocker for the
> > release, since it is probably a corner case based on the current
> analysis.
> > If we really conclude anything need to be resolved after the final
> > release, then we can also make the next minor release-1.11.1 come soon.
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-18433
> >
> > Best,
> > Zhijiang
> >
> >
> > --
> > From:Thomas Weise 
> > Send Time:2020年7月4日(星期六) 12:26
> > To:dev ; Zhijiang 
> > Cc:Yingjie Cao 
> > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> >
> > Hi Zhijiang,
> >
> > It will probably be best if we connect next week and discuss the issue
> > directly since this could be quite difficult to reproduce.
> >
> > Before the testing result on our side comes out for your respective job
> > case, I have some other questions to confirm for further analysis:
> > -  How much percentage regression you found after switching to 1.11?
> >
> > ~40% throughput decline
> >
> > -  Are there any network bottleneck in your cluster? E.g. the network
> > bandwidth is full caused by other jobs? If so, it might have more effects
> > by above [2]
> >
> > The test runs on a k8s cluster that is also used for other production
> jobs.
> > There is no reason be believe network is the bottleneck.
> >
> > -  Did you adjust the default network buffer setting? E.g.
> > "taskmanager.network.memory.floating-buffers-per-gate" or
> > "taskmanager.network.memory.buffers-per-channel"
> >
> > The job is using the defaults, i.e we don't configure the settings. If
> you
> > want me to try specific settings in the hope that it will help to isolate
> > the issue please let me know.
> >
> > -  I guess the topology has three vertexes "KinesisConsumer ->
> Chained
> > FlatMap -> KinesisProducer", and the partition mode for "KinesisConsumer
> ->
> > FlatMap" and "FlatMap->KinesisProducer" are both "forward"? If so, the
> edge
> > connection is one-to-one, not all-to-all, then the above [1][2] should no
> > effects in theory with default network buffer setting.
> >
> > There are only 2 vertices and the edge is "forward".
> >
> > - By slot sharing, I guess these three vertex parallelism task would
> > probably be deployed into the same slot, then the data shuffle is by
> memory
> > queue, not network stack. If so, the above [2] should no effect.
> >
> > Yes, vertices share slots.
> >
> > - I also saw some Jira 

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-04 Thread Thomas Weise
Hi Zhijiang,

Could you please point me to more details regarding: "[2]: Delay send the
following buffers after checkpoint barrier on upstream side until barrier
alignment on downstream side."

In this case, the downstream task has a high average checkpoint duration
(~30s, sync part). If there was a change to hold buffers depending on
downstream performance, could this possibly apply to this case (even when
there is no shuffle that would require alignment)?

Thanks,
Thomas


On Sat, Jul 4, 2020 at 7:39 AM Zhijiang 
wrote:

> Hi Thomas,
>
> Thanks for the further update information.
>
> I guess we can dismiss the network stack changes, since in your case the
> downstream and upstream would probably be deployed in the same slot
> bypassing the network data shuffle.
> Also I guess release-1.11 will not bring general performance regression in
> runtime engine, as we also did the performance testing for all general
> cases by [1] in real cluster before and the testing results should fit the
> expectation. But we indeed did not test the specific source and sink
> connectors yet as I known.
>
> Regarding your performance regression with 40%, I wonder it is probably
> related to specific source/sink changes (e.g. kinesis) or environment
> issues with corner case.
> If possible, it would be helpful to further locate whether the regression
> is caused by kinesis, by replacing the kinesis source & sink and keeping
> the others same.
>
> As you said, it would be efficient to contact with you directly next week
> to further discuss this issue. And we are willing/eager to provide any help
> to resolve this issue soon.
>
> Besides that, I guess this issue should not be the blocker for the
> release, since it is probably a corner case based on the current analysis.
> If we really conclude anything need to be resolved after the final
> release, then we can also make the next minor release-1.11.1 come soon.
>
> [1] https://issues.apache.org/jira/browse/FLINK-18433
>
> Best,
> Zhijiang
>
>
> --
> From:Thomas Weise 
> Send Time:2020年7月4日(星期六) 12:26
> To:dev ; Zhijiang 
> Cc:Yingjie Cao 
> Subject:Re: [VOTE] Release 1.11.0, release candidate #4
>
> Hi Zhijiang,
>
> It will probably be best if we connect next week and discuss the issue
> directly since this could be quite difficult to reproduce.
>
> Before the testing result on our side comes out for your respective job
> case, I have some other questions to confirm for further analysis:
> -  How much percentage regression you found after switching to 1.11?
>
> ~40% throughput decline
>
> -  Are there any network bottleneck in your cluster? E.g. the network
> bandwidth is full caused by other jobs? If so, it might have more effects
> by above [2]
>
> The test runs on a k8s cluster that is also used for other production jobs.
> There is no reason be believe network is the bottleneck.
>
> -  Did you adjust the default network buffer setting? E.g.
> "taskmanager.network.memory.floating-buffers-per-gate" or
> "taskmanager.network.memory.buffers-per-channel"
>
> The job is using the defaults, i.e we don't configure the settings. If you
> want me to try specific settings in the hope that it will help to isolate
> the issue please let me know.
>
> -  I guess the topology has three vertexes "KinesisConsumer -> Chained
> FlatMap -> KinesisProducer", and the partition mode for "KinesisConsumer ->
> FlatMap" and "FlatMap->KinesisProducer" are both "forward"? If so, the edge
> connection is one-to-one, not all-to-all, then the above [1][2] should no
> effects in theory with default network buffer setting.
>
> There are only 2 vertices and the edge is "forward".
>
> - By slot sharing, I guess these three vertex parallelism task would
> probably be deployed into the same slot, then the data shuffle is by memory
> queue, not network stack. If so, the above [2] should no effect.
>
> Yes, vertices share slots.
>
> - I also saw some Jira changes for kinesis in this release, could you
> confirm that these changes would not effect the performance?
>
> I will need to take a look. 1.10 already had a regression introduced by the
> Kinesis producer update.
>
>
> Thanks,
> Thomas
>
>
> On Thu, Jul 2, 2020 at 11:46 PM Zhijiang  .invalid>
> wrote:
>
> > Hi Thomas,
> >
> > Thanks for your reply with rich information!
> >
> > We are trying to reproduce your case in our cluster to further verify it,
> > and  @Yingjie Cao is working on it now.
> >  As we have not kinesis consumer and producer internally, so we will
> > construct the common source and sink instead in the case of backpressure.
> >
> > Firstly, we can dismiss the rockdb factor in this release, since you also
> > mentioned that "filesystem leads to same symptoms".
> >
> > Secondly, if my understanding is right, you emphasis that the regression
> > only exists for the jobs with low checkpoint interval (10s).
> > Based on that, I have two suspicions with the ne

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-04 Thread Jark Wu
+1

- started cluster and ran some examples, verified web ui and log output,
nothing unexpected, except ChangelogSocketExample which has been reported
[2] by Dawid.
- started cluster to run e2e SQL queries with millions of records with
Kafka, MySQL, Elasticsearch as sources/lookup/sinks. Works well and the
results are as expected.
- use SQL CLI to SELECT kafka source with debezium data, the result is as
expected.
- reviewed the release PR, left a comment [1] to add a note for
the FLINK-18461 issue.

Regarding the CDC issue FLINK-18461, the fix has been merged into
release-1.11 branch.
I think we have a conclusion to not block the RC by this issue. We can
quickly launch the next release-1.11.1 to cover it as Robert suggested.

Best,
Jark

[1]: https://github.com/apache/flink-web/pull/352/files#r449830524
[2]: https://issues.apache.org/jira/browse/FLINK-18477

On Sun, 5 Jul 2020 at 12:22, Thomas Weise  wrote:

> Hi Zhijiang,
>
> Could you please point me to more details regarding: "[2]: Delay send the
> following buffers after checkpoint barrier on upstream side until barrier
> alignment on downstream side."
>
> In this case, the downstream task has a high average checkpoint duration
> (~30s, sync part). If there was a change to hold buffers depending on
> downstream performance, could this possibly apply to this case (even when
> there is no shuffle that would require alignment)?
>
> Thanks,
> Thomas
>
>
> On Sat, Jul 4, 2020 at 7:39 AM Zhijiang  .invalid>
> wrote:
>
> > Hi Thomas,
> >
> > Thanks for the further update information.
> >
> > I guess we can dismiss the network stack changes, since in your case the
> > downstream and upstream would probably be deployed in the same slot
> > bypassing the network data shuffle.
> > Also I guess release-1.11 will not bring general performance regression
> in
> > runtime engine, as we also did the performance testing for all general
> > cases by [1] in real cluster before and the testing results should fit
> the
> > expectation. But we indeed did not test the specific source and sink
> > connectors yet as I known.
> >
> > Regarding your performance regression with 40%, I wonder it is probably
> > related to specific source/sink changes (e.g. kinesis) or environment
> > issues with corner case.
> > If possible, it would be helpful to further locate whether the regression
> > is caused by kinesis, by replacing the kinesis source & sink and keeping
> > the others same.
> >
> > As you said, it would be efficient to contact with you directly next week
> > to further discuss this issue. And we are willing/eager to provide any
> help
> > to resolve this issue soon.
> >
> > Besides that, I guess this issue should not be the blocker for the
> > release, since it is probably a corner case based on the current
> analysis.
> > If we really conclude anything need to be resolved after the final
> > release, then we can also make the next minor release-1.11.1 come soon.
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-18433
> >
> > Best,
> > Zhijiang
> >
> >
> > --
> > From:Thomas Weise 
> > Send Time:2020年7月4日(星期六) 12:26
> > To:dev ; Zhijiang 
> > Cc:Yingjie Cao 
> > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> >
> > Hi Zhijiang,
> >
> > It will probably be best if we connect next week and discuss the issue
> > directly since this could be quite difficult to reproduce.
> >
> > Before the testing result on our side comes out for your respective job
> > case, I have some other questions to confirm for further analysis:
> > -  How much percentage regression you found after switching to 1.11?
> >
> > ~40% throughput decline
> >
> > -  Are there any network bottleneck in your cluster? E.g. the network
> > bandwidth is full caused by other jobs? If so, it might have more effects
> > by above [2]
> >
> > The test runs on a k8s cluster that is also used for other production
> jobs.
> > There is no reason be believe network is the bottleneck.
> >
> > -  Did you adjust the default network buffer setting? E.g.
> > "taskmanager.network.memory.floating-buffers-per-gate" or
> > "taskmanager.network.memory.buffers-per-channel"
> >
> > The job is using the defaults, i.e we don't configure the settings. If
> you
> > want me to try specific settings in the hope that it will help to isolate
> > the issue please let me know.
> >
> > -  I guess the topology has three vertexes "KinesisConsumer ->
> Chained
> > FlatMap -> KinesisProducer", and the partition mode for "KinesisConsumer
> ->
> > FlatMap" and "FlatMap->KinesisProducer" are both "forward"? If so, the
> edge
> > connection is one-to-one, not all-to-all, then the above [1][2] should no
> > effects in theory with default network buffer setting.
> >
> > There are only 2 vertices and the edge is "forward".
> >
> > - By slot sharing, I guess these three vertex parallelism task would
> > probably be deployed into the same slot, then the data shuf

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-04 Thread Zhijiang
Hi Thomas,

Regarding [2], it has more detail infos in the Jira description 
(https://issues.apache.org/jira/browse/FLINK-16404). 

I can also give some basic explanations here to dismiss the concern.
1. In the past, the following buffers after the barrier will be cached on 
downstream side before alignment.
2. In 1.11, the upstream would not send the buffers after the barrier. When the 
downstream finishes the alignment, it will notify the downstream of continuing 
sending following buffers, since it can process them after alignment.
3. The only difference is that the temporary blocked buffers are cached either 
on downstream side or on upstream side before alignment.
4. The side effect would be the additional notification cost for every barrier 
alignment. If the downstream and upstream are deployed in separate TaskManager, 
the cost is network transport delay (the effect can be ignored based on our 
testing with 1s checkpoint interval). For sharing slot in your case, the cost 
is only one method call in processor, can be ignored also.

You mentioned "In this case, the downstream task has a high average checkpoint 
duration(~30s, sync part)." This duration is not reflecting the changes above, 
and it is only indicating the duration for calling `Operation.snapshotState`. 
If this duration is beyond your expectation, you can check or debug whether the 
source/sink operations might take more time to finish `snapshotState` in 
practice. E.g. you can
make the implementation of this method as empty to further verify the effect.

Best,
Zhijiang


--
From:Thomas Weise 
Send Time:2020年7月5日(星期日) 12:22
To:dev ; Zhijiang 
Cc:Yingjie Cao 
Subject:Re: [VOTE] Release 1.11.0, release candidate #4

Hi Zhijiang,

Could you please point me to more details regarding: "[2]: Delay send the
following buffers after checkpoint barrier on upstream side until barrier
alignment on downstream side."

In this case, the downstream task has a high average checkpoint duration
(~30s, sync part). If there was a change to hold buffers depending on
downstream performance, could this possibly apply to this case (even when
there is no shuffle that would require alignment)?

Thanks,
Thomas


On Sat, Jul 4, 2020 at 7:39 AM Zhijiang 
wrote:

> Hi Thomas,
>
> Thanks for the further update information.
>
> I guess we can dismiss the network stack changes, since in your case the
> downstream and upstream would probably be deployed in the same slot
> bypassing the network data shuffle.
> Also I guess release-1.11 will not bring general performance regression in
> runtime engine, as we also did the performance testing for all general
> cases by [1] in real cluster before and the testing results should fit the
> expectation. But we indeed did not test the specific source and sink
> connectors yet as I known.
>
> Regarding your performance regression with 40%, I wonder it is probably
> related to specific source/sink changes (e.g. kinesis) or environment
> issues with corner case.
> If possible, it would be helpful to further locate whether the regression
> is caused by kinesis, by replacing the kinesis source & sink and keeping
> the others same.
>
> As you said, it would be efficient to contact with you directly next week
> to further discuss this issue. And we are willing/eager to provide any help
> to resolve this issue soon.
>
> Besides that, I guess this issue should not be the blocker for the
> release, since it is probably a corner case based on the current analysis.
> If we really conclude anything need to be resolved after the final
> release, then we can also make the next minor release-1.11.1 come soon.
>
> [1] https://issues.apache.org/jira/browse/FLINK-18433
>
> Best,
> Zhijiang
>
>
> --
> From:Thomas Weise 
> Send Time:2020年7月4日(星期六) 12:26
> To:dev ; Zhijiang 
> Cc:Yingjie Cao 
> Subject:Re: [VOTE] Release 1.11.0, release candidate #4
>
> Hi Zhijiang,
>
> It will probably be best if we connect next week and discuss the issue
> directly since this could be quite difficult to reproduce.
>
> Before the testing result on our side comes out for your respective job
> case, I have some other questions to confirm for further analysis:
> -  How much percentage regression you found after switching to 1.11?
>
> ~40% throughput decline
>
> -  Are there any network bottleneck in your cluster? E.g. the network
> bandwidth is full caused by other jobs? If so, it might have more effects
> by above [2]
>
> The test runs on a k8s cluster that is also used for other production jobs.
> There is no reason be believe network is the bottleneck.
>
> -  Did you adjust the default network buffer setting? E.g.
> "taskmanager.network.memory.floating-buffers-per-gate" or
> "taskmanager.network.memory.buffers-per-channel"
>
> The job is using the defaults, i.e we don't configure the settings. If you
> want me to try specific sett

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-05 Thread Benchao Li
+1 (non-binding)

Checks:
- verified signature and shasum of release files [OK]
- build from source [OK]
- started standalone cluster, sql-client [mostly OK except one issue]
  - played with sql-client
  - played with new features: LIKE / Table Options
  - checked Web UI functionality
  - canceled job from UI

While I'm playing with the new table factories, I found one issue[1] which
surprises me.
I don't think this should be a blocker, hence I'll still vote my +1.

[1] https://issues.apache.org/jira/browse/FLINK-18487

Zhijiang  于2020年7月5日周日 下午1:10写道:

> Hi Thomas,
>
> Regarding [2], it has more detail infos in the Jira description (
> https://issues.apache.org/jira/browse/FLINK-16404).
>
> I can also give some basic explanations here to dismiss the concern.
> 1. In the past, the following buffers after the barrier will be cached on
> downstream side before alignment.
> 2. In 1.11, the upstream would not send the buffers after the barrier.
> When the downstream finishes the alignment, it will notify the downstream
> of continuing sending following buffers, since it can process them after
> alignment.
> 3. The only difference is that the temporary blocked buffers are cached
> either on downstream side or on upstream side before alignment.
> 4. The side effect would be the additional notification cost for every
> barrier alignment. If the downstream and upstream are deployed in separate
> TaskManager, the cost is network transport delay (the effect can be ignored
> based on our testing with 1s checkpoint interval). For sharing slot in your
> case, the cost is only one method call in processor, can be ignored also.
>
> You mentioned "In this case, the downstream task has a high average
> checkpoint duration(~30s, sync part)." This duration is not reflecting the
> changes above, and it is only indicating the duration for calling
> `Operation.snapshotState`.
> If this duration is beyond your expectation, you can check or debug
> whether the source/sink operations might take more time to finish
> `snapshotState` in practice. E.g. you can
> make the implementation of this method as empty to further verify the
> effect.
>
> Best,
> Zhijiang
>
>
> --
> From:Thomas Weise 
> Send Time:2020年7月5日(星期日) 12:22
> To:dev ; Zhijiang 
> Cc:Yingjie Cao 
> Subject:Re: [VOTE] Release 1.11.0, release candidate #4
>
> Hi Zhijiang,
>
> Could you please point me to more details regarding: "[2]: Delay send the
> following buffers after checkpoint barrier on upstream side until barrier
> alignment on downstream side."
>
> In this case, the downstream task has a high average checkpoint duration
> (~30s, sync part). If there was a change to hold buffers depending on
> downstream performance, could this possibly apply to this case (even when
> there is no shuffle that would require alignment)?
>
> Thanks,
> Thomas
>
>
> On Sat, Jul 4, 2020 at 7:39 AM Zhijiang  .invalid>
> wrote:
>
> > Hi Thomas,
> >
> > Thanks for the further update information.
> >
> > I guess we can dismiss the network stack changes, since in your case the
> > downstream and upstream would probably be deployed in the same slot
> > bypassing the network data shuffle.
> > Also I guess release-1.11 will not bring general performance regression
> in
> > runtime engine, as we also did the performance testing for all general
> > cases by [1] in real cluster before and the testing results should fit
> the
> > expectation. But we indeed did not test the specific source and sink
> > connectors yet as I known.
> >
> > Regarding your performance regression with 40%, I wonder it is probably
> > related to specific source/sink changes (e.g. kinesis) or environment
> > issues with corner case.
> > If possible, it would be helpful to further locate whether the regression
> > is caused by kinesis, by replacing the kinesis source & sink and keeping
> > the others same.
> >
> > As you said, it would be efficient to contact with you directly next week
> > to further discuss this issue. And we are willing/eager to provide any
> help
> > to resolve this issue soon.
> >
> > Besides that, I guess this issue should not be the blocker for the
> > release, since it is probably a corner case based on the current
> analysis.
> > If we really conclude anything need to be resolved after the final
> > release, then we can also make the next minor release-1.11.1 come soon.
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-18433
> >
> > Best,
> > Zhijiang
> >
> >
> > --
> > From:Thomas Weise 
> > Send Time:2020年7月4日(星期六) 12:26
> > To:dev ; Zhijiang 
> > Cc:Yingjie Cao 
> > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> >
> > Hi Zhijiang,
> >
> > It will probably be best if we connect next week and discuss the issue
> > directly since this could be quite difficult to reproduce.
> >
> > Before the testing result on our side comes out for your respective job
> 

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-05 Thread Xintong Song
+1 (non-binding)

- verified signature and checksum
- build from source
- checked log sanity
- checked webui
- played with memory configurations
- played with binding addresses/ports

Thank you~

Xintong Song



On Sun, Jul 5, 2020 at 9:41 PM Benchao Li  wrote:

> +1 (non-binding)
>
> Checks:
> - verified signature and shasum of release files [OK]
> - build from source [OK]
> - started standalone cluster, sql-client [mostly OK except one issue]
>   - played with sql-client
>   - played with new features: LIKE / Table Options
>   - checked Web UI functionality
>   - canceled job from UI
>
> While I'm playing with the new table factories, I found one issue[1] which
> surprises me.
> I don't think this should be a blocker, hence I'll still vote my +1.
>
> [1] https://issues.apache.org/jira/browse/FLINK-18487
>
> Zhijiang  于2020年7月5日周日 下午1:10写道:
>
> > Hi Thomas,
> >
> > Regarding [2], it has more detail infos in the Jira description (
> > https://issues.apache.org/jira/browse/FLINK-16404).
> >
> > I can also give some basic explanations here to dismiss the concern.
> > 1. In the past, the following buffers after the barrier will be cached on
> > downstream side before alignment.
> > 2. In 1.11, the upstream would not send the buffers after the barrier.
> > When the downstream finishes the alignment, it will notify the downstream
> > of continuing sending following buffers, since it can process them after
> > alignment.
> > 3. The only difference is that the temporary blocked buffers are cached
> > either on downstream side or on upstream side before alignment.
> > 4. The side effect would be the additional notification cost for every
> > barrier alignment. If the downstream and upstream are deployed in
> separate
> > TaskManager, the cost is network transport delay (the effect can be
> ignored
> > based on our testing with 1s checkpoint interval). For sharing slot in
> your
> > case, the cost is only one method call in processor, can be ignored also.
> >
> > You mentioned "In this case, the downstream task has a high average
> > checkpoint duration(~30s, sync part)." This duration is not reflecting
> the
> > changes above, and it is only indicating the duration for calling
> > `Operation.snapshotState`.
> > If this duration is beyond your expectation, you can check or debug
> > whether the source/sink operations might take more time to finish
> > `snapshotState` in practice. E.g. you can
> > make the implementation of this method as empty to further verify the
> > effect.
> >
> > Best,
> > Zhijiang
> >
> >
> > --
> > From:Thomas Weise 
> > Send Time:2020年7月5日(星期日) 12:22
> > To:dev ; Zhijiang 
> > Cc:Yingjie Cao 
> > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> >
> > Hi Zhijiang,
> >
> > Could you please point me to more details regarding: "[2]: Delay send the
> > following buffers after checkpoint barrier on upstream side until barrier
> > alignment on downstream side."
> >
> > In this case, the downstream task has a high average checkpoint duration
> > (~30s, sync part). If there was a change to hold buffers depending on
> > downstream performance, could this possibly apply to this case (even when
> > there is no shuffle that would require alignment)?
> >
> > Thanks,
> > Thomas
> >
> >
> > On Sat, Jul 4, 2020 at 7:39 AM Zhijiang  > .invalid>
> > wrote:
> >
> > > Hi Thomas,
> > >
> > > Thanks for the further update information.
> > >
> > > I guess we can dismiss the network stack changes, since in your case
> the
> > > downstream and upstream would probably be deployed in the same slot
> > > bypassing the network data shuffle.
> > > Also I guess release-1.11 will not bring general performance regression
> > in
> > > runtime engine, as we also did the performance testing for all general
> > > cases by [1] in real cluster before and the testing results should fit
> > the
> > > expectation. But we indeed did not test the specific source and sink
> > > connectors yet as I known.
> > >
> > > Regarding your performance regression with 40%, I wonder it is probably
> > > related to specific source/sink changes (e.g. kinesis) or environment
> > > issues with corner case.
> > > If possible, it would be helpful to further locate whether the
> regression
> > > is caused by kinesis, by replacing the kinesis source & sink and
> keeping
> > > the others same.
> > >
> > > As you said, it would be efficient to contact with you directly next
> week
> > > to further discuss this issue. And we are willing/eager to provide any
> > help
> > > to resolve this issue soon.
> > >
> > > Besides that, I guess this issue should not be the blocker for the
> > > release, since it is probably a corner case based on the current
> > analysis.
> > > If we really conclude anything need to be resolved after the final
> > > release, then we can also make the next minor release-1.11.1 come soon.
> > >
> > > [1] https://issues.apache.org/jira/browse/FLINK-18433
> > >
> > > 

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-05 Thread Jingsong Li
+1 (non-binding)

- verified signature and checksum
- build from source
- checked webui and log sanity
- played with filesystem and new connectors
- played with Hive connector

Best,
Jingsonga

On Mon, Jul 6, 2020 at 9:50 AM Xintong Song  wrote:

> +1 (non-binding)
>
> - verified signature and checksum
> - build from source
> - checked log sanity
> - checked webui
> - played with memory configurations
> - played with binding addresses/ports
>
> Thank you~
>
> Xintong Song
>
>
>
> On Sun, Jul 5, 2020 at 9:41 PM Benchao Li  wrote:
>
> > +1 (non-binding)
> >
> > Checks:
> > - verified signature and shasum of release files [OK]
> > - build from source [OK]
> > - started standalone cluster, sql-client [mostly OK except one issue]
> >   - played with sql-client
> >   - played with new features: LIKE / Table Options
> >   - checked Web UI functionality
> >   - canceled job from UI
> >
> > While I'm playing with the new table factories, I found one issue[1]
> which
> > surprises me.
> > I don't think this should be a blocker, hence I'll still vote my +1.
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-18487
> >
> > Zhijiang  于2020年7月5日周日 下午1:10写道:
> >
> > > Hi Thomas,
> > >
> > > Regarding [2], it has more detail infos in the Jira description (
> > > https://issues.apache.org/jira/browse/FLINK-16404).
> > >
> > > I can also give some basic explanations here to dismiss the concern.
> > > 1. In the past, the following buffers after the barrier will be cached
> on
> > > downstream side before alignment.
> > > 2. In 1.11, the upstream would not send the buffers after the barrier.
> > > When the downstream finishes the alignment, it will notify the
> downstream
> > > of continuing sending following buffers, since it can process them
> after
> > > alignment.
> > > 3. The only difference is that the temporary blocked buffers are cached
> > > either on downstream side or on upstream side before alignment.
> > > 4. The side effect would be the additional notification cost for every
> > > barrier alignment. If the downstream and upstream are deployed in
> > separate
> > > TaskManager, the cost is network transport delay (the effect can be
> > ignored
> > > based on our testing with 1s checkpoint interval). For sharing slot in
> > your
> > > case, the cost is only one method call in processor, can be ignored
> also.
> > >
> > > You mentioned "In this case, the downstream task has a high average
> > > checkpoint duration(~30s, sync part)." This duration is not reflecting
> > the
> > > changes above, and it is only indicating the duration for calling
> > > `Operation.snapshotState`.
> > > If this duration is beyond your expectation, you can check or debug
> > > whether the source/sink operations might take more time to finish
> > > `snapshotState` in practice. E.g. you can
> > > make the implementation of this method as empty to further verify the
> > > effect.
> > >
> > > Best,
> > > Zhijiang
> > >
> > >
> > > --
> > > From:Thomas Weise 
> > > Send Time:2020年7月5日(星期日) 12:22
> > > To:dev ; Zhijiang 
> > > Cc:Yingjie Cao 
> > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > >
> > > Hi Zhijiang,
> > >
> > > Could you please point me to more details regarding: "[2]: Delay send
> the
> > > following buffers after checkpoint barrier on upstream side until
> barrier
> > > alignment on downstream side."
> > >
> > > In this case, the downstream task has a high average checkpoint
> duration
> > > (~30s, sync part). If there was a change to hold buffers depending on
> > > downstream performance, could this possibly apply to this case (even
> when
> > > there is no shuffle that would require alignment)?
> > >
> > > Thanks,
> > > Thomas
> > >
> > >
> > > On Sat, Jul 4, 2020 at 7:39 AM Zhijiang  > > .invalid>
> > > wrote:
> > >
> > > > Hi Thomas,
> > > >
> > > > Thanks for the further update information.
> > > >
> > > > I guess we can dismiss the network stack changes, since in your case
> > the
> > > > downstream and upstream would probably be deployed in the same slot
> > > > bypassing the network data shuffle.
> > > > Also I guess release-1.11 will not bring general performance
> regression
> > > in
> > > > runtime engine, as we also did the performance testing for all
> general
> > > > cases by [1] in real cluster before and the testing results should
> fit
> > > the
> > > > expectation. But we indeed did not test the specific source and sink
> > > > connectors yet as I known.
> > > >
> > > > Regarding your performance regression with 40%, I wonder it is
> probably
> > > > related to specific source/sink changes (e.g. kinesis) or environment
> > > > issues with corner case.
> > > > If possible, it would be helpful to further locate whether the
> > regression
> > > > is caused by kinesis, by replacing the kinesis source & sink and
> > keeping
> > > > the others same.
> > > >
> > > > As you said, it would be efficient to contact with you directly next
> 

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-05 Thread Zhijiang
Hi all,

The vote already lasted for more than 72 hours. Thanks everyone for helping 
test and verify the release. 
I will finalize the vote result soon in a separate email.

Best,
Zhijiang


--
From:Jingsong Li 
Send Time:2020年7月6日(星期一) 12:11
To:dev 
Subject:Re: [VOTE] Release 1.11.0, release candidate #4

+1 (non-binding)

- verified signature and checksum
- build from source
- checked webui and log sanity
- played with filesystem and new connectors
- played with Hive connector

Best,
Jingsonga

On Mon, Jul 6, 2020 at 9:50 AM Xintong Song  wrote:

> +1 (non-binding)
>
> - verified signature and checksum
> - build from source
> - checked log sanity
> - checked webui
> - played with memory configurations
> - played with binding addresses/ports
>
> Thank you~
>
> Xintong Song
>
>
>
> On Sun, Jul 5, 2020 at 9:41 PM Benchao Li  wrote:
>
> > +1 (non-binding)
> >
> > Checks:
> > - verified signature and shasum of release files [OK]
> > - build from source [OK]
> > - started standalone cluster, sql-client [mostly OK except one issue]
> >   - played with sql-client
> >   - played with new features: LIKE / Table Options
> >   - checked Web UI functionality
> >   - canceled job from UI
> >
> > While I'm playing with the new table factories, I found one issue[1]
> which
> > surprises me.
> > I don't think this should be a blocker, hence I'll still vote my +1.
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-18487
> >
> > Zhijiang  于2020年7月5日周日 下午1:10写道:
> >
> > > Hi Thomas,
> > >
> > > Regarding [2], it has more detail infos in the Jira description (
> > > https://issues.apache.org/jira/browse/FLINK-16404).
> > >
> > > I can also give some basic explanations here to dismiss the concern.
> > > 1. In the past, the following buffers after the barrier will be cached
> on
> > > downstream side before alignment.
> > > 2. In 1.11, the upstream would not send the buffers after the barrier.
> > > When the downstream finishes the alignment, it will notify the
> downstream
> > > of continuing sending following buffers, since it can process them
> after
> > > alignment.
> > > 3. The only difference is that the temporary blocked buffers are cached
> > > either on downstream side or on upstream side before alignment.
> > > 4. The side effect would be the additional notification cost for every
> > > barrier alignment. If the downstream and upstream are deployed in
> > separate
> > > TaskManager, the cost is network transport delay (the effect can be
> > ignored
> > > based on our testing with 1s checkpoint interval). For sharing slot in
> > your
> > > case, the cost is only one method call in processor, can be ignored
> also.
> > >
> > > You mentioned "In this case, the downstream task has a high average
> > > checkpoint duration(~30s, sync part)." This duration is not reflecting
> > the
> > > changes above, and it is only indicating the duration for calling
> > > `Operation.snapshotState`.
> > > If this duration is beyond your expectation, you can check or debug
> > > whether the source/sink operations might take more time to finish
> > > `snapshotState` in practice. E.g. you can
> > > make the implementation of this method as empty to further verify the
> > > effect.
> > >
> > > Best,
> > > Zhijiang
> > >
> > >
> > > --
> > > From:Thomas Weise 
> > > Send Time:2020年7月5日(星期日) 12:22
> > > To:dev ; Zhijiang 
> > > Cc:Yingjie Cao 
> > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > >
> > > Hi Zhijiang,
> > >
> > > Could you please point me to more details regarding: "[2]: Delay send
> the
> > > following buffers after checkpoint barrier on upstream side until
> barrier
> > > alignment on downstream side."
> > >
> > > In this case, the downstream task has a high average checkpoint
> duration
> > > (~30s, sync part). If there was a change to hold buffers depending on
> > > downstream performance, could this possibly apply to this case (even
> when
> > > there is no shuffle that would require alignment)?
> > >
> > > Thanks,
> > > Thomas
> > >
> > >
> > > On Sat, Jul 4, 2020 at 7:39 AM Zhijiang  > > .invalid>
> > > wrote:
> > >
> > > > Hi Thomas,
> > > >
> > > > Thanks for the further update information.
> > > >
> > > > I guess we can dismiss the network stack changes, since in your case
> > the
> > > > downstream and upstream would probably be deployed in the same slot
> > > > bypassing the network data shuffle.
> > > > Also I guess release-1.11 will not bring general performance
> regression
> > > in
> > > > runtime engine, as we also did the performance testing for all
> general
> > > > cases by [1] in real cluster before and the testing results should
> fit
> > > the
> > > > expectation. But we indeed did not test the specific source and sink
> > > > connectors yet as I known.
> > > >
> > > > Regarding your performance regression with 40%, I wonder it is
> probably
> > > > related to sp

Re: [VOTE] Release 1.11.0, release candidate #4

2020-07-05 Thread Yang Wang
+1 (non-binding)

- Verified building from source
- Running Flink on local, submit jobs via cli and webui
- Running Flink on Yarn
   - Test per-job, session, application modes
   - Test provided flink lib
   - Test remote user jar
- Running Flink on K8s
   - Standalone yaml submission, including session and applicationmode
   - Native submission, including session and application mode


Best,
Yang

Zhijiang  于2020年7月6日周一 下午2:43写道:

> Hi all,
>
> The vote already lasted for more than 72 hours. Thanks everyone for
> helping test and verify the release.
> I will finalize the vote result soon in a separate email.
>
> Best,
> Zhijiang
>
>
> --
> From:Jingsong Li 
> Send Time:2020年7月6日(星期一) 12:11
> To:dev 
> Subject:Re: [VOTE] Release 1.11.0, release candidate #4
>
> +1 (non-binding)
>
> - verified signature and checksum
> - build from source
> - checked webui and log sanity
> - played with filesystem and new connectors
> - played with Hive connector
>
> Best,
> Jingsonga
>
> On Mon, Jul 6, 2020 at 9:50 AM Xintong Song  wrote:
>
> > +1 (non-binding)
> >
> > - verified signature and checksum
> > - build from source
> > - checked log sanity
> > - checked webui
> > - played with memory configurations
> > - played with binding addresses/ports
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Sun, Jul 5, 2020 at 9:41 PM Benchao Li  wrote:
> >
> > > +1 (non-binding)
> > >
> > > Checks:
> > > - verified signature and shasum of release files [OK]
> > > - build from source [OK]
> > > - started standalone cluster, sql-client [mostly OK except one issue]
> > >   - played with sql-client
> > >   - played with new features: LIKE / Table Options
> > >   - checked Web UI functionality
> > >   - canceled job from UI
> > >
> > > While I'm playing with the new table factories, I found one issue[1]
> > which
> > > surprises me.
> > > I don't think this should be a blocker, hence I'll still vote my +1.
> > >
> > > [1] https://issues.apache.org/jira/browse/FLINK-18487
> > >
> > > Zhijiang  于2020年7月5日周日 下午1:10写道:
> > >
> > > > Hi Thomas,
> > > >
> > > > Regarding [2], it has more detail infos in the Jira description (
> > > > https://issues.apache.org/jira/browse/FLINK-16404).
> > > >
> > > > I can also give some basic explanations here to dismiss the concern.
> > > > 1. In the past, the following buffers after the barrier will be
> cached
> > on
> > > > downstream side before alignment.
> > > > 2. In 1.11, the upstream would not send the buffers after the
> barrier.
> > > > When the downstream finishes the alignment, it will notify the
> > downstream
> > > > of continuing sending following buffers, since it can process them
> > after
> > > > alignment.
> > > > 3. The only difference is that the temporary blocked buffers are
> cached
> > > > either on downstream side or on upstream side before alignment.
> > > > 4. The side effect would be the additional notification cost for
> every
> > > > barrier alignment. If the downstream and upstream are deployed in
> > > separate
> > > > TaskManager, the cost is network transport delay (the effect can be
> > > ignored
> > > > based on our testing with 1s checkpoint interval). For sharing slot
> in
> > > your
> > > > case, the cost is only one method call in processor, can be ignored
> > also.
> > > >
> > > > You mentioned "In this case, the downstream task has a high average
> > > > checkpoint duration(~30s, sync part)." This duration is not
> reflecting
> > > the
> > > > changes above, and it is only indicating the duration for calling
> > > > `Operation.snapshotState`.
> > > > If this duration is beyond your expectation, you can check or debug
> > > > whether the source/sink operations might take more time to finish
> > > > `snapshotState` in practice. E.g. you can
> > > > make the implementation of this method as empty to further verify the
> > > > effect.
> > > >
> > > > Best,
> > > > Zhijiang
> > > >
> > > >
> > > > --
> > > > From:Thomas Weise 
> > > > Send Time:2020年7月5日(星期日) 12:22
> > > > To:dev ; Zhijiang 
> > > > Cc:Yingjie Cao 
> > > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > > >
> > > > Hi Zhijiang,
> > > >
> > > > Could you please point me to more details regarding: "[2]: Delay send
> > the
> > > > following buffers after checkpoint barrier on upstream side until
> > barrier
> > > > alignment on downstream side."
> > > >
> > > > In this case, the downstream task has a high average checkpoint
> > duration
> > > > (~30s, sync part). If there was a change to hold buffers depending on
> > > > downstream performance, could this possibly apply to this case (even
> > when
> > > > there is no shuffle that would require alignment)?
> > > >
> > > > Thanks,
> > > > Thomas
> > > >
> > > >
> > > > On Sat, Jul 4, 2020 at 7:39 AM Zhijiang  > > > .invalid>
> > > > wrote:
> > > >
> > > > > Hi Thomas,
> > > > >
> > > > > Thanks for the further