Re: [VOTE] Release 1.11.0, release candidate #4

Zhijiang Thu, 02 Jul 2020 23:47:07 -0700

Hi Thomas,

Thanks for your reply with rich information!


We are trying to reproduce your case in our cluster to further verify it, and  
@Yingjie Cao is working on it now.
 As we have not kinesis consumer and producer internally, so we will construct 
the common source and sink instead in the case of backpressure. 

Firstly, we can dismiss the rockdb factor in this release, since you also 
mentioned that "filesystem leads to same symptoms".

Secondly, if my understanding is right, you emphasis that the regression only 
exists for the jobs with low checkpoint interval (10s). 
Based on that, I have two suspicions with the network related changes in this 
release:
    - [1]: Limited the maximum backlog value (default 10) in subpartition 
queue. 
    - [2]: Delay send the following buffers after checkpoint barrier on 
upstream side until barrier alignment on downstream side.

These changes are motivated for reducing the in-flight buffers to speedup 
checkpoint especially in the case of backpressure. 
In theory they should have very minor performance effect and actually we also 
tested in cluster to verify within expectation before merging them,
 but maybe there are other corner cases we have not thought of before.

Before the testing result on our side comes out for your respective job case, I 
have some other questions to confirm for further analysis:
    -  How much percentage regression you found after switching to 1.11?
    -  Are there any network bottleneck in your cluster? E.g. the network 
bandwidth is full caused by other jobs? If so, it might have more effects by 
above [2]
    -  Did you adjust the default network buffer setting? E.g. 
"taskmanager.network.memory.floating-buffers-per-gate" or 
"taskmanager.network.memory.buffers-per-channel"
    -  I guess the topology has three vertexes "KinesisConsumer -> Chained 
FlatMap -> KinesisProducer", and the partition mode for "KinesisConsumer -> 
FlatMap" and "FlatMap->KinesisProducer" are both "forward"? If so, the edge 
connection is one-to-one, not all-to-all, then the above [1][2] should no 
effects in theory with default network buffer setting.
    - By slot sharing, I guess these three vertex parallelism task would 
probably be deployed into the same slot, then the data shuffle is by memory 
queue, not network stack. If so, the above [2] should no effect.
    - I also saw some Jira changes for kinesis in this release, could you 
confirm that these changes would not effect the performance?

Best,
Zhijiang


------------------------------------------------------------------
From:Thomas Weise <t...@apache.org>
Send Time:2020年7月3日(星期五) 01:07
To:dev <dev@flink.apache.org>; Zhijiang <wangzhijiang...@aliyun.com>
Subject:Re: [VOTE] Release 1.11.0, release candidate #4

Hi Zhijiang,

The performance degradation manifests in backpressure which leads to
growing backlog in the source. I switched a few times between 1.10 and 1.11
and the behavior is consistent.

The DAG is:

KinesisConsumer -> (Flat Map, Flat Map, Flat Map)   -------- forward
---------> KinesisProducer

Parallelism: 160
No shuffle/rebalance.

Checkpointing config:

Checkpointing Mode Exactly Once
Interval 10s
Timeout 10m 0s
Minimum Pause Between Checkpoints 10s
Maximum Concurrent Checkpoints 1
Persist Checkpoints Externally Enabled (delete on cancellation)

State backend: rocksdb  (filesystem leads to same symptoms)
Checkpoint size is tiny (500KB)

An interesting difference to another job that I had upgraded successfully
is the low checkpointing interval.

Thanks,
Thomas


On Wed, Jul 1, 2020 at 9:02 PM Zhijiang <wangzhijiang...@aliyun.com.invalid>
wrote:

> Hi Thomas,
>
> Thanks for the efficient feedback.
>
> Regarding the suggestion of adding the release notes document, I agree
> with your point. Maybe we should adjust the vote template accordingly in
> the respective wiki to guide the following release processes.
>
> Regarding the performance regression, could you provide some more details
> for our better measurement or reproducing on our sides?
> E.g. I guess the topology only includes two vertexes source and sink?
> What is the parallelism for every vertex?
> The upstream shuffles data to the downstream via rebalance partitioner or
> other?
> The checkpoint mode is exactly-once with rocksDB state backend?
> The backpressure happened in this case?
> How much percentage regression in this case?
>
> Best,
> Zhijiang
>
>
>
> ------------------------------------------------------------------
> From:Thomas Weise <t...@apache.org>
> Send Time:2020年7月2日(星期四) 09:54
> To:dev <dev@flink.apache.org>
> Subject:Re: [VOTE] Release 1.11.0, release candidate #4
>
> Hi Till,
>
> Yes, we don't have the setting in flink-conf.yaml.
>
> Generally, we carry forward the existing configuration and any change to
> default configuration values would impact the upgrade.
>
> Yes, since it is an incompatible change I would state it in the release
> notes.
>
> Thanks,
> Thomas
>
> BTW I found a performance regression while trying to upgrade another
> pipeline with this RC. It is a simple Kinesis to Kinesis job. Wasn't able
> to pin it down yet, symptoms include increased checkpoint alignment time.
>
> On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann <trohrm...@apache.org>
> wrote:
>
> > Hi Thomas,
> >
> > just to confirm: When starting the image in local mode, then you don't
> have
> > any of the JobManager memory configuration settings configured in the
> > effective flink-conf.yaml, right? Does this mean that you have explicitly
> > removed `jobmanager.heap.size: 1024m` from the default configuration? If
> > this is the case, then I believe it was more of an unintentional artifact
> > that it worked before and it has been corrected now so that one needs to
> > specify the memory of the JM process explicitly. Do you think it would
> help
> > to explicitly state this in the release notes?
> >
> > Cheers,
> > Till
> >
> > On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise <t...@apache.org> wrote:
> >
> > > Thanks for preparing another RC!
> > >
> > > As mentioned in the previous RC thread, it would be super helpful if
> the
> > > release notes that are part of the documentation can be included [1].
> > It's
> > > a significant time-saver to have read those first.
> > >
> > > I found one more non-backward compatible change that would be worth
> > > addressing/mentioning:
> > >
> > > It is now necessary to configure the jobmanager heap size in
> > > flink-conf.yaml (with either jobmanager.heap.size
> > > or jobmanager.memory.heap.size). Why would I not want to do that
> anyways?
> > > Well, we set it dynamically for a cluster deployment via the
> > > flinkk8soperator, but the container image can also be used for testing
> > with
> > > local mode (./bin/jobmanager.sh start-foreground local). That will fail
> > if
> > > the heap wasn't configured and that's how I noticed it.
> > >
> > > Thanks,
> > > Thomas
> > >
> > > [1]
> > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html
> > >
> > > On Tue, Jun 30, 2020 at 3:18 AM Zhijiang <wangzhijiang...@aliyun.com
> > > .invalid>
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > Please review and vote on the release candidate #4 for the version
> > > 1.11.0,
> > > > as follows:
> > > > [ ] +1, Approve the release
> > > > [ ] -1, Do not approve the release (please provide specific comments)
> > > >
> > > > The complete staging area is available for your review, which
> includes:
> > > > * JIRA release notes [1],
> > > > * the official Apache source release and binary convenience releases
> to
> > > be
> > > > deployed to dist.apache.org [2], which are signed with the key with
> > > > fingerprint 2DA85B93244FDFA19A6244500653C0A2CEA00D0E [3],
> > > > * all artifacts to be deployed to the Maven Central Repository [4],
> > > > * source code tag "release-1.11.0-rc4" [5],
> > > > * website pull request listing the new release and adding
> announcement
> > > > blog post [6].
> > > >
> > > > The vote will be open for at least 72 hours. It is adopted by
> majority
> > > > approval, with at least 3 PMC affirmative votes.
> > > >
> > > > Thanks,
> > > > Release Manager
> > > >
> > > > [1]
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346364
> > > > [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.11.0-rc4/
> > > > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> > > > [4]
> > > >
> > https://repository.apache.org/content/repositories/orgapacheflink-1377/
> > > > [5] https://github.com/apache/flink/releases/tag/release-1.11.0-rc4
> > > > [6] https://github.com/apache/flink-web/pull/352
> > > >
> > > >
> > >
> >
>
>

Re: [VOTE] Release 1.11.0, release candidate #4

Reply via email to