Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0, release candidate #4)

Piotr Nowojski Fri, 14 Aug 2020 06:58:47 -0700

Thanks Thomas for reporting the problem, analysing which commit has caused
and now for the verification that it was fixed :) Much appreciated.


Piotrek

czw., 13 sie 2020 o 18:18 Thomas Weise <t...@apache.org> napisał(a):

> Hi Roman,
>
> Thanks for working on this! I deployed the change and it appears to be
> working as expected.
>
> Will monitor over a period of time to compare the checkpoint counts and get
> back to you if there are still issues.
>
> Thomas
>
>
> On Thu, Aug 13, 2020 at 3:41 AM Roman Khachatryan <ro...@data-artisans.com
> >
> wrote:
>
> > Hi Thomas,
> >
> > The fix is now merged to master and to release-1.11.
> > So if you'd like you can check if it solves your problem (it would be
> > helpful for us too).
> >
> > On Sat, Aug 8, 2020 at 9:26 AM Roman Khachatryan <
> ro...@data-artisans.com>
> > wrote:
> >
> >> Hi Thomas,
> >>
> >> Thanks a lot for the detailed information.
> >>
> >> I think the problem is in CheckpointCoordinator. It stores the last
> >> checkpoint completion time after checking queued requests.
> >> I've created a ticket to fix this:
> >> https://issues.apache.org/jira/browse/FLINK-18856
> >>
> >>
> >> On Sat, Aug 8, 2020 at 5:25 AM Thomas Weise <t...@apache.org> wrote:
> >>
> >>> Just another update:
> >>>
> >>> The duration of snapshotState is capped by the Kinesis
> >>> producer's "RecordTtl" setting (default 30s). The sleep time in
> flushSync
> >>> does not contribute to the observed behavior.
> >>>
> >>> I guess the open question is why, with the same settings, is 1.11 since
> >>> commit 355184d69a8519d29937725c8d85e8465d7e3a90 processing more
> checkpoints?
> >>>
> >>>
> >>> On Fri, Aug 7, 2020 at 9:15 AM Thomas Weise <t...@apache.org> wrote:
> >>>
> >>>> Hi Roman,
> >>>>
> >>>> Here are the checkpoint summaries for both commits:
> >>>>
> >>>>
> >>>>
> https://docs.google.com/presentation/d/159IVXQGXabjnYJk3oVm3UP2UW_5G-TGs_u9yzYb030I/edit#slide=id.g86d15b2fc7_0_0
> >>>>
> >>>> The config:
> >>>>
> >>>>     CheckpointConfig checkpointConfig = env.getCheckpointConfig();
> >>>>
> >>>> checkpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
> >>>>     checkpointConfig.setCheckpointInterval(*10_000*);
> >>>>     checkpointConfig.setMinPauseBetweenCheckpoints(*10_000*);
> >>>>
> >>>>
> checkpointConfig.enableExternalizedCheckpoints(DELETE_ON_CANCELLATION);
> >>>>     checkpointConfig.setCheckpointTimeout(600_000);
> >>>>     checkpointConfig.setMaxConcurrentCheckpoints(1);
> >>>>     checkpointConfig.setFailOnCheckpointingErrors(true);
> >>>>
> >>>> The values marked bold when changed to *60_000* make the symptom
> >>>> disappear. I meanwhile also verified that with the 1.11.0 release
> commit.
> >>>>
> >>>> I will take a look at the sleep time issue.
> >>>>
> >>>> Thanks,
> >>>> Thomas
> >>>>
> >>>>
> >>>> On Fri, Aug 7, 2020 at 1:44 AM Roman Khachatryan <
> >>>> ro...@data-artisans.com> wrote:
> >>>>
> >>>>> Hi Thomas,
> >>>>>
> >>>>> Thanks for your reply!
> >>>>>
> >>>>> I think you are right, we can remove this sleep and improve
> >>>>> KinesisProducer.
> >>>>> Probably, it's snapshotState can also be sped up by forcing records
> >>>>> flush more often.
> >>>>> Do you see that 30s checkpointing duration is caused
> >>>>> by KinesisProducer (or maybe other operators)?
> >>>>>
> >>>>> I'd also like to understand the reason behind this increase in
> >>>>> checkpoint frequency.
> >>>>> Can you please share these values:
> >>>>>  - execution.checkpointing.min-pause
> >>>>>  - execution.checkpointing.max-concurrent-checkpoints
> >>>>>  - execution.checkpointing.timeout
> >>>>>
> >>>>> And what is the "new" observed checkpoint frequency (or how many
> >>>>> checkpoints are created) compared to older versions?
> >>>>>
> >>>>>
> >>>>> On Fri, Aug 7, 2020 at 4:49 AM Thomas Weise <t...@apache.org> wrote:
> >>>>>
> >>>>>> Hi Roman,
> >>>>>>
> >>>>>> Indeed there are more frequent checkpoints with this change! The
> >>>>>> application was configured to checkpoint every 10s. With 1.10 ("good
> >>>>>> commit"), that leads to fewer completed checkpoints compared to 1.11
> >>>>>> ("bad
> >>>>>> commit"). Just to be clear, the only difference between the two runs
> >>>>>> was
> >>>>>> the commit 355184d69a8519d29937725c8d85e8465d7e3a90
> >>>>>>
> >>>>>> Since the sync part of checkpoints with the Kinesis producer always
> >>>>>> takes
> >>>>>> ~30 seconds, the 10s configured checkpoint frequency really had no
> >>>>>> effect
> >>>>>> before 1.11. I confirmed that both commits perform comparably by
> >>>>>> setting
> >>>>>> the checkpoint frequency and min pause to 60s.
> >>>>>>
> >>>>>> I still have to verify with the final 1.11.0 release commit.
> >>>>>>
> >>>>>> It's probably good to take a look at the Kinesis producer. Is it
> >>>>>> really
> >>>>>> necessary to have 500ms sleep time? What's responsible for the ~30s
> >>>>>> duration in snapshotState?
> >>>>>>
> >>>>>> As things stand it doesn't make sense to use checkpoint intervals <
> >>>>>> 30s
> >>>>>> when using the Kinesis producer.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Thomas
> >>>>>>
> >>>>>> On Sat, Aug 1, 2020 at 2:53 PM Roman Khachatryan <
> >>>>>> ro...@data-artisans.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>> > Hi Thomas,
> >>>>>> >
> >>>>>> > Thanks a lot for the analysis.
> >>>>>> >
> >>>>>> > The first thing that I'd check is whether checkpoints became more
> >>>>>> frequent
> >>>>>> > with this commit (as each of them adds at least 500ms if there is
> >>>>>> at least
> >>>>>> > one not sent record, according to
> >>>>>> FlinkKinesisProducer.snapshotState).
> >>>>>> >
> >>>>>> > Can you share checkpointing statistics (1.10 vs 1.11 or last
> "good"
> >>>>>> vs
> >>>>>> > first "bad" commits)?
> >>>>>> >
> >>>>>> > On Fri, Jul 31, 2020 at 5:29 AM Thomas Weise <
> >>>>>> thomas.we...@gmail.com>
> >>>>>> > wrote:
> >>>>>> >
> >>>>>> > > I run git bisect and the first commit that shows the regression
> >>>>>> is:
> >>>>>> > >
> >>>>>> > >
> >>>>>> > >
> >>>>>> >
> >>>>>>
> https://github.com/apache/flink/commit/355184d69a8519d29937725c8d85e8465d7e3a90
> >>>>>> > >
> >>>>>> > >
> >>>>>> > > On Thu, Jul 23, 2020 at 6:46 PM Kurt Young <ykt...@gmail.com>
> >>>>>> wrote:
> >>>>>> > >
> >>>>>> > > > From my experience, java profilers are sometimes not accurate
> >>>>>> enough to
> >>>>>> > > > find out the performance regression
> >>>>>> > > > root cause. In this case, I would suggest you try out intel
> >>>>>> vtune
> >>>>>> > > amplifier
> >>>>>> > > > to watch more detailed metrics.
> >>>>>> > > >
> >>>>>> > > > Best,
> >>>>>> > > > Kurt
> >>>>>> > > >
> >>>>>> > > >
> >>>>>> > > > On Fri, Jul 24, 2020 at 8:51 AM Thomas Weise <t...@apache.org>
> >>>>>> wrote:
> >>>>>> > > >
> >>>>>> > > > > The cause of the issue is all but clear.
> >>>>>> > > > >
> >>>>>> > > > > Previously I had mentioned that there is no suspect change
> to
> >>>>>> the
> >>>>>> > > Kinesis
> >>>>>> > > > > connector and that I had reverted the AWS SDK change to no
> >>>>>> effect.
> >>>>>> > > > >
> >>>>>> > > > > https://issues.apache.org/jira/browse/FLINK-17496 actually
> >>>>>> fixed
> >>>>>> > > another
> >>>>>> > > > > regression in the previous release and is present before and
> >>>>>> after.
> >>>>>> > > > >
> >>>>>> > > > > I repeated the run with 1.11.0 core and downgraded the
> entire
> >>>>>> Kinesis
> >>>>>> > > > > connector to 1.10.1: Nothing changes, i.e. the regression is
> >>>>>> still
> >>>>>> > > > present.
> >>>>>> > > > > Therefore we will need to look elsewhere for the root cause.
> >>>>>> > > > >
> >>>>>> > > > > Regarding the time spent in snapshotState, repeat runs
> reveal
> >>>>>> a wide
> >>>>>> > > > range
> >>>>>> > > > > for both versions, 1.10 and 1.11. So again this is nothing
> >>>>>> pointing
> >>>>>> > to
> >>>>>> > > a
> >>>>>> > > > > root cause.
> >>>>>> > > > >
> >>>>>> > > > > At this point, I have no ideas remaining other than doing a
> >>>>>> bisect to
> >>>>>> > > > find
> >>>>>> > > > > the culprit. Any other suggestions?
> >>>>>> > > > >
> >>>>>> > > > > Thomas
> >>>>>> > > > >
> >>>>>> > > > >
> >>>>>> > > > > On Thu, Jul 16, 2020 at 9:19 PM Zhijiang <
> >>>>>> wangzhijiang...@aliyun.com
> >>>>>> > > > > .invalid>
> >>>>>> > > > > wrote:
> >>>>>> > > > >
> >>>>>> > > > > > Hi Thomas,
> >>>>>> > > > > >
> >>>>>> > > > > > Thanks for your further profiling information and glad to
> >>>>>> see we
> >>>>>> > > > already
> >>>>>> > > > > > finalized the location to cause the regression.
> >>>>>> > > > > > Actually I was also suspicious of the point of
> >>>>>> #snapshotState in
> >>>>>> > > > previous
> >>>>>> > > > > > discussions since it indeed cost much time to block normal
> >>>>>> operator
> >>>>>> > > > > > processing.
> >>>>>> > > > > >
> >>>>>> > > > > > Based on your below feedback, the sleep time during
> >>>>>> #snapshotState
> >>>>>> > > > might
> >>>>>> > > > > > be the main concern, and I also digged into the
> >>>>>> implementation of
> >>>>>> > > > > > FlinkKinesisProducer#snapshotState.
> >>>>>> > > > > > while (producer.getOutstandingRecordsCount() > 0) {
> >>>>>> > > > > >    producer.flush();
> >>>>>> > > > > >    try {
> >>>>>> > > > > >       Thread.sleep(500);
> >>>>>> > > > > >    } catch (InterruptedException e) {
> >>>>>> > > > > >       LOG.warn("Flushing was interrupted.");
> >>>>>> > > > > >       break;
> >>>>>> > > > > >    }
> >>>>>> > > > > > }
> >>>>>> > > > > > It seems that the sleep time is mainly affected by the
> >>>>>> internal
> >>>>>> > > > > operations
> >>>>>> > > > > > inside KinesisProducer implementation provided by
> >>>>>> amazonaws, which
> >>>>>> > I
> >>>>>> > > am
> >>>>>> > > > > not
> >>>>>> > > > > > quite familiar with.
> >>>>>> > > > > > But I noticed there were two upgrades related to it in
> >>>>>> > > release-1.11.0.
> >>>>>> > > > > One
> >>>>>> > > > > > is for upgrading amazon-kinesis-producer to 0.14.0 [1] and
> >>>>>> another
> >>>>>> > is
> >>>>>> > > > for
> >>>>>> > > > > > upgrading aws-sdk-version to 1.11.754 [2].
> >>>>>> > > > > > You mentioned that you already reverted the SDK upgrade to
> >>>>>> verify
> >>>>>> > no
> >>>>>> > > > > > changes. Did you also revert the [1] to verify?
> >>>>>> > > > > > [1] https://issues.apache.org/jira/browse/FLINK-17496
> >>>>>> > > > > > [2] https://issues.apache.org/jira/browse/FLINK-14881
> >>>>>> > > > > >
> >>>>>> > > > > > Best,
> >>>>>> > > > > > Zhijiang
> >>>>>> > > > > >
> >>>>>> ------------------------------------------------------------------
> >>>>>> > > > > > From:Thomas Weise <t...@apache.org>
> >>>>>> > > > > > Send Time:2020年7月17日(星期五) 05:29
> >>>>>> > > > > > To:dev <dev@flink.apache.org>
> >>>>>> > > > > > Cc:Zhijiang <wangzhijiang...@aliyun.com>; Stephan Ewen <
> >>>>>> > > > se...@apache.org
> >>>>>> > > > > >;
> >>>>>> > > > > > Arvid Heise <ar...@ververica.com>; Aljoscha Krettek <
> >>>>>> > > > aljos...@apache.org
> >>>>>> > > > > >
> >>>>>> > > > > > Subject:Re: Kinesis Performance Issue (was [VOTE] Release
> >>>>>> 1.11.0,
> >>>>>> > > > release
> >>>>>> > > > > > candidate #4)
> >>>>>> > > > > >
> >>>>>> > > > > > Sorry for the delay.
> >>>>>> > > > > >
> >>>>>> > > > > > I confirmed that the regression is due to the sink
> >>>>>> (unsurprising,
> >>>>>> > > since
> >>>>>> > > > > > another job with the same consumer, but not the producer,
> >>>>>> runs as
> >>>>>> > > > > > expected).
> >>>>>> > > > > >
> >>>>>> > > > > > As promised I did CPU profiling on the problematic
> >>>>>> application,
> >>>>>> > which
> >>>>>> > > > > gives
> >>>>>> > > > > > more insight into the regression [1]
> >>>>>> > > > > >
> >>>>>> > > > > > The screenshots show that the average time for
> snapshotState
> >>>>>> > > increases
> >>>>>> > > > > from
> >>>>>> > > > > > ~9s to ~28s. The data also shows the increase in sleep
> time
> >>>>>> during
> >>>>>> > > > > > snapshotState.
> >>>>>> > > > > >
> >>>>>> > > > > > Does anyone, based on changes made in 1.11, have a theory
> >>>>>> why?
> >>>>>> > > > > >
> >>>>>> > > > > > I had previously looked at the changes to the Kinesis
> >>>>>> connector and
> >>>>>> > > > also
> >>>>>> > > > > > reverted the SDK upgrade, which did not change the
> >>>>>> situation.
> >>>>>> > > > > >
> >>>>>> > > > > > It will likely be necessary to drill into the sink /
> >>>>>> checkpointing
> >>>>>> > > > > details
> >>>>>> > > > > > to understand the cause of the problem.
> >>>>>> > > > > >
> >>>>>> > > > > > Let me know if anyone has specific questions that I can
> >>>>>> answer from
> >>>>>> > > the
> >>>>>> > > > > > profiling results.
> >>>>>> > > > > >
> >>>>>> > > > > > Thomas
> >>>>>> > > > > >
> >>>>>> > > > > > [1]
> >>>>>> > > > > >
> >>>>>> > > > > >
> >>>>>> > > > >
> >>>>>> > > >
> >>>>>> > >
> >>>>>> >
> >>>>>>
> https://docs.google.com/presentation/d/159IVXQGXabjnYJk3oVm3UP2UW_5G-TGs_u9yzYb030I/edit?usp=sharing
> >>>>>> > > > > >
> >>>>>> > > > > > On Mon, Jul 13, 2020 at 11:14 AM Thomas Weise <
> >>>>>> t...@apache.org>
> >>>>>> > > wrote:
> >>>>>> > > > > >
> >>>>>> > > > > > > + dev@ for visibility
> >>>>>> > > > > > >
> >>>>>> > > > > > > I will investigate further today.
> >>>>>> > > > > > >
> >>>>>> > > > > > >
> >>>>>> > > > > > > On Wed, Jul 8, 2020 at 4:42 AM Aljoscha Krettek <
> >>>>>> > > aljos...@apache.org
> >>>>>> > > > >
> >>>>>> > > > > > > wrote:
> >>>>>> > > > > > >
> >>>>>> > > > > > >> On 06.07.20 20:39, Stephan Ewen wrote:
> >>>>>> > > > > > >> >    - Did sink checkpoint notifications change in a
> >>>>>> relevant
> >>>>>> > way,
> >>>>>> > > > for
> >>>>>> > > > > > >> example
> >>>>>> > > > > > >> > due to some Kafka issues we addressed in 1.11
> >>>>>> (@Aljoscha
> >>>>>> > maybe?)
> >>>>>> > > > > > >>
> >>>>>> > > > > > >> I think that's unrelated: the Kafka fixes were isolated
> >>>>>> in Kafka
> >>>>>> > > and
> >>>>>> > > > > the
> >>>>>> > > > > > >> one bug I discovered on the way was about the Task
> >>>>>> reaper.
> >>>>>> > > > > > >>
> >>>>>> > > > > > >>
> >>>>>> > > > > > >> On 07.07.20 17:51, Zhijiang wrote:
> >>>>>> > > > > > >> > Sorry for my misunderstood of the previous
> information,
> >>>>>> > Thomas.
> >>>>>> > > I
> >>>>>> > > > > was
> >>>>>> > > > > > >> assuming that the sync checkpoint duration increased
> >>>>>> after
> >>>>>> > upgrade
> >>>>>> > > > as
> >>>>>> > > > > it
> >>>>>> > > > > > >> was mentioned before.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > If I remembered correctly, the memory state backend
> >>>>>> also has
> >>>>>> > the
> >>>>>> > > > > same
> >>>>>> > > > > > >> issue? If so, we can dismiss the rocksDB state changes.
> >>>>>> As the
> >>>>>> > > slot
> >>>>>> > > > > > sharing
> >>>>>> > > > > > >> enabled, the downstream and upstream should
> >>>>>> > > > > > >> > probably deployed into the same slot, then no network
> >>>>>> shuffle
> >>>>>> > > > > effect.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > I think we need to find out whether it has other
> >>>>>> symptoms
> >>>>>> > > changed
> >>>>>> > > > > > >> besides the performance regression to further figure
> out
> >>>>>> the
> >>>>>> > > scope.
> >>>>>> > > > > > >> > E.g. any metrics changes, the number of TaskManager
> >>>>>> and the
> >>>>>> > > number
> >>>>>> > > > > of
> >>>>>> > > > > > >> slots per TaskManager from deployment changes.
> >>>>>> > > > > > >> > 40% regression is really big, I guess the changes
> >>>>>> should also
> >>>>>> > be
> >>>>>> > > > > > >> reflected in other places.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > I am not sure whether we can reproduce the regression
> >>>>>> in our
> >>>>>> > AWS
> >>>>>> > > > > > >> environment by writing any Kinesis jobs, since there
> are
> >>>>>> also
> >>>>>> > > normal
> >>>>>> > > > > > >> Kinesis jobs as Thomas mentioned after upgrade.
> >>>>>> > > > > > >> > So it probably looks like to touch some corner case.
> I
> >>>>>> am very
> >>>>>> > > > > willing
> >>>>>> > > > > > >> to provide any help for debugging if possible.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > Best,
> >>>>>> > > > > > >> > Zhijiang
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >
> >>>>>> > >
> ------------------------------------------------------------------
> >>>>>> > > > > > >> > From:Thomas Weise <t...@apache.org>
> >>>>>> > > > > > >> > Send Time:2020年7月7日(星期二) 23:01
> >>>>>> > > > > > >> > To:Stephan Ewen <se...@apache.org>
> >>>>>> > > > > > >> > Cc:Aljoscha Krettek <aljos...@apache.org>; Arvid
> >>>>>> Heise <
> >>>>>> > > > > > >> ar...@ververica.com>; Zhijiang <
> >>>>>> wangzhijiang...@aliyun.com>
> >>>>>> > > > > > >> > Subject:Re: Kinesis Performance Issue (was [VOTE]
> >>>>>> Release
> >>>>>> > > 1.11.0,
> >>>>>> > > > > > >> release candidate #4)
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > We are deploying our apps with FlinkK8sOperator. We
> >>>>>> have one
> >>>>>> > job
> >>>>>> > > > > that
> >>>>>> > > > > > >> works as expected after the upgrade and the one
> >>>>>> discussed here
> >>>>>> > > that
> >>>>>> > > > > has
> >>>>>> > > > > > the
> >>>>>> > > > > > >> performance regression.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > "The performance regression is obvious caused by long
> >>>>>> duration
> >>>>>> > > of
> >>>>>> > > > > sync
> >>>>>> > > > > > >> checkpoint process in Kinesis sink operator, which
> would
> >>>>>> block
> >>>>>> > the
> >>>>>> > > > > > normal
> >>>>>> > > > > > >> data processing until back pressure the source."
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > That's a constant. Before (1.10) and upgrade have the
> >>>>>> same
> >>>>>> > sync
> >>>>>> > > > > > >> checkpointing time. The question is what change came in
> >>>>>> with the
> >>>>>> > > > > > upgrade.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > On Tue, Jul 7, 2020 at 7:33 AM Stephan Ewen <
> >>>>>> se...@apache.org
> >>>>>> > >
> >>>>>> > > > > wrote:
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > @Thomas Just one thing real quick: Are you using the
> >>>>>> > standalone
> >>>>>> > > > > setup
> >>>>>> > > > > > >> scripts (like start-cluster.sh, and the former "slaves"
> >>>>>> file) ?
> >>>>>> > > > > > >> > Be aware that this is now called "workers" because of
> >>>>>> avoiding
> >>>>>> > > > > > >> sensitive names.
> >>>>>> > > > > > >> > In one internal benchmark we saw quite a lot of
> >>>>>> slowdown
> >>>>>> > > > initially,
> >>>>>> > > > > > >> before seeing that the cluster was not a distributed
> >>>>>> cluster any
> >>>>>> > > > more
> >>>>>> > > > > > ;-)
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > On Tue, Jul 7, 2020 at 9:08 AM Zhijiang <
> >>>>>> > > > wangzhijiang...@aliyun.com
> >>>>>> > > > > >
> >>>>>> > > > > > >> wrote:
> >>>>>> > > > > > >> > Thanks for this kickoff and help analysis, Stephan!
> >>>>>> > > > > > >> > Thanks for the further feedback and investigation,
> >>>>>> Thomas!
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > The performance regression is obvious caused by long
> >>>>>> duration
> >>>>>> > of
> >>>>>> > > > > sync
> >>>>>> > > > > > >> checkpoint process in Kinesis sink operator, which
> would
> >>>>>> block
> >>>>>> > the
> >>>>>> > > > > > normal
> >>>>>> > > > > > >> data processing until back pressure the source.
> >>>>>> > > > > > >> > Maybe we could dig into the process of sync execution
> >>>>>> in
> >>>>>> > > > checkpoint.
> >>>>>> > > > > > >> E.g. break down the steps inside respective
> >>>>>> > operator#snapshotState
> >>>>>> > > > to
> >>>>>> > > > > > >> statistic which operation cost most of the time, then
> >>>>>> > > > > > >> > we might probably find the root cause to bring such
> >>>>>> cost.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > Look forward to the further progress. :)
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > Best,
> >>>>>> > > > > > >> > Zhijiang
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >
> >>>>>> > >
> ------------------------------------------------------------------
> >>>>>> > > > > > >> > From:Stephan Ewen <se...@apache.org>
> >>>>>> > > > > > >> > Send Time:2020年7月7日(星期二) 14:52
> >>>>>> > > > > > >> > To:Thomas Weise <t...@apache.org>
> >>>>>> > > > > > >> > Cc:Stephan Ewen <se...@apache.org>; Zhijiang <
> >>>>>> > > > > > >> wangzhijiang...@aliyun.com>; Aljoscha Krettek <
> >>>>>> > > aljos...@apache.org
> >>>>>> > > > >;
> >>>>>> > > > > > >> Arvid Heise <ar...@ververica.com>
> >>>>>> > > > > > >> > Subject:Re: Kinesis Performance Issue (was [VOTE]
> >>>>>> Release
> >>>>>> > > 1.11.0,
> >>>>>> > > > > > >> release candidate #4)
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > Thank you for the digging so deeply.
> >>>>>> > > > > > >> > Mysterious think this regression.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > On Mon, Jul 6, 2020, 22:56 Thomas Weise <
> >>>>>> t...@apache.org>
> >>>>>> > wrote:
> >>>>>> > > > > > >> > @Stephan: yes, I refer to sync time in the web UI (it
> >>>>>> is
> >>>>>> > > unchanged
> >>>>>> > > > > > >> between 1.10 and 1.11 for the specific pipeline).
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > I verified that increasing the checkpointing interval
> >>>>>> does not
> >>>>>> > > > make
> >>>>>> > > > > a
> >>>>>> > > > > > >> difference.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > I looked at the Kinesis connector changes since
> 1.10.1
> >>>>>> and
> >>>>>> > don't
> >>>>>> > > > see
> >>>>>> > > > > > >> anything that could cause this.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > Another pipeline that is using the Kinesis consumer
> >>>>>> (but not
> >>>>>> > the
> >>>>>> > > > > > >> producer) performs as expected.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > I tried reverting the AWS SDK version change,
> symptoms
> >>>>>> remain
> >>>>>> > > > > > unchanged:
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > diff --git
> >>>>>> a/flink-connectors/flink-connector-kinesis/pom.xml
> >>>>>> > > > > > >> b/flink-connectors/flink-connector-kinesis/pom.xml
> >>>>>> > > > > > >> > index a6abce23ba..741743a05e 100644
> >>>>>> > > > > > >> > ---
> a/flink-connectors/flink-connector-kinesis/pom.xml
> >>>>>> > > > > > >> > +++
> b/flink-connectors/flink-connector-kinesis/pom.xml
> >>>>>> > > > > > >> > @@ -33,7 +33,7 @@ under the License.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >>
> >>>>>> > > > >
> >>>>>> > >
> >>>>>>
> <artifactId>flink-connector-kinesis_${scala.binary.version}</artifactId>
> >>>>>> > > > > > >> >          <name>flink-connector-kinesis</name>
> >>>>>> > > > > > >> >          <properties>
> >>>>>> > > > > > >> > -
> >>>>>>  <aws.sdk.version>1.11.754</aws.sdk.version>
> >>>>>> > > > > > >> > +
> >>>>>>  <aws.sdk.version>1.11.603</aws.sdk.version>
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >>
> <aws.kinesis-kcl.version>1.11.2</aws.kinesis-kcl.version>
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >>
> <aws.kinesis-kpl.version>0.14.0</aws.kinesis-kpl.version>
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >>
> >>>>>> > > > > >
> >>>>>> > > > >
> >>>>>> > > >
> >>>>>> > >
> >>>>>> >
> >>>>>>
> <aws.dynamodbstreams-kinesis-adapter.version>1.5.0</aws.dynamodbstreams-kinesis-adapter.version>
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > I'm planning to take a look with a profiler next.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > Thomas
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > On Mon, Jul 6, 2020 at 11:40 AM Stephan Ewen <
> >>>>>> > se...@apache.org>
> >>>>>> > > > > > wrote:
> >>>>>> > > > > > >> > Hi all!
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > Forking this thread out of the release vote thread.
> >>>>>> > > > > > >> >  From what Thomas describes, it really sounds like a
> >>>>>> > > sink-specific
> >>>>>> > > > > > >> issue.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > @Thomas: When you say sink has a long synchronous
> >>>>>> checkpoint
> >>>>>> > > time,
> >>>>>> > > > > you
> >>>>>> > > > > > >> mean the time that is shown as "sync time" on the
> >>>>>> metrics and
> >>>>>> > web
> >>>>>> > > > UI?
> >>>>>> > > > > > That
> >>>>>> > > > > > >> is not including any network buffer related operations.
> >>>>>> It is
> >>>>>> > > purely
> >>>>>> > > > > the
> >>>>>> > > > > > >> operator's time.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > Can we dig into the changes we did in sinks:
> >>>>>> > > > > > >> >    - Kinesis version upgrade, AWS library updates
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >    - Could it be that some call (checkpoint complete)
> >>>>>> that was
> >>>>>> > > > > > >> previously (1.10) in a separate thread is not in the
> >>>>>> mailbox and
> >>>>>> > > > this
> >>>>>> > > > > > >> simply reduces the number of threads that do the work?
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >    - Did sink checkpoint notifications change in a
> >>>>>> relevant
> >>>>>> > way,
> >>>>>> > > > for
> >>>>>> > > > > > >> example due to some Kafka issues we addressed in 1.11
> >>>>>> (@Aljoscha
> >>>>>> > > > > maybe?)
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > Best,
> >>>>>> > > > > > >> > Stephan
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > On Sun, Jul 5, 2020 at 7:10 AM Zhijiang <
> >>>>>> > > > wangzhijiang...@aliyun.com
> >>>>>> > > > > > .invalid>
> >>>>>> > > > > > >> wrote:
> >>>>>> > > > > > >> > Hi Thomas,
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >   Regarding [2], it has more detail infos in the Jira
> >>>>>> > > description
> >>>>>> > > > (
> >>>>>> > > > > > >> https://issues.apache.org/jira/browse/FLINK-16404).
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >   I can also give some basic explanations here to
> >>>>>> dismiss the
> >>>>>> > > > > concern.
> >>>>>> > > > > > >> >   1. In the past, the following buffers after the
> >>>>>> barrier will
> >>>>>> > > be
> >>>>>> > > > > > >> cached on downstream side before alignment.
> >>>>>> > > > > > >> >   2. In 1.11, the upstream would not send the buffers
> >>>>>> after
> >>>>>> > the
> >>>>>> > > > > > >> barrier. When the downstream finishes the alignment, it
> >>>>>> will
> >>>>>> > > notify
> >>>>>> > > > > the
> >>>>>> > > > > > >> downstream of continuing sending following buffers,
> >>>>>> since it can
> >>>>>> > > > > process
> >>>>>> > > > > > >> them after alignment.
> >>>>>> > > > > > >> >   3. The only difference is that the temporary
> blocked
> >>>>>> buffers
> >>>>>> > > are
> >>>>>> > > > > > >> cached either on downstream side or on upstream side
> >>>>>> before
> >>>>>> > > > alignment.
> >>>>>> > > > > > >> >   4. The side effect would be the additional
> >>>>>> notification cost
> >>>>>> > > for
> >>>>>> > > > > > >> every barrier alignment. If the downstream and upstream
> >>>>>> are
> >>>>>> > > deployed
> >>>>>> > > > > in
> >>>>>> > > > > > >> separate TaskManager, the cost is network transport
> >>>>>> delay (the
> >>>>>> > > > effect
> >>>>>> > > > > > can
> >>>>>> > > > > > >> be ignored based on our testing with 1s checkpoint
> >>>>>> interval).
> >>>>>> > For
> >>>>>> > > > > > sharing
> >>>>>> > > > > > >> slot in your case, the cost is only one method call in
> >>>>>> > processor,
> >>>>>> > > > can
> >>>>>> > > > > be
> >>>>>> > > > > > >> ignored also.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >   You mentioned "In this case, the downstream task
> has
> >>>>>> a high
> >>>>>> > > > > average
> >>>>>> > > > > > >> checkpoint duration(~30s, sync part)." This duration is
> >>>>>> not
> >>>>>> > > > reflecting
> >>>>>> > > > > > the
> >>>>>> > > > > > >> changes above, and it is only indicating the duration
> for
> >>>>>> > calling
> >>>>>> > > > > > >> `Operation.snapshotState`.
> >>>>>> > > > > > >> >   If this duration is beyond your expectation, you
> can
> >>>>>> check
> >>>>>> > or
> >>>>>> > > > > debug
> >>>>>> > > > > > >> whether the source/sink operations might take more time
> >>>>>> to
> >>>>>> > finish
> >>>>>> > > > > > >> `snapshotState` in practice. E.g. you can
> >>>>>> > > > > > >> >   make the implementation of this method as empty to
> >>>>>> further
> >>>>>> > > > verify
> >>>>>> > > > > > the
> >>>>>> > > > > > >> effect.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >   Best,
> >>>>>> > > > > > >> >   Zhijiang
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >
> >>>>>> > > >
> >>>>>> ------------------------------------------------------------------
> >>>>>> > > > > > >> >   From:Thomas Weise <t...@apache.org>
> >>>>>> > > > > > >> >   Send Time:2020年7月5日(星期日) 12:22
> >>>>>> > > > > > >> >   To:dev <dev@flink.apache.org>; Zhijiang <
> >>>>>> > > > > wangzhijiang...@aliyun.com
> >>>>>> > > > > > >
> >>>>>> > > > > > >> >   Cc:Yingjie Cao <kevin.ying...@gmail.com>
> >>>>>> > > > > > >> >   Subject:Re: [VOTE] Release 1.11.0, release
> candidate
> >>>>>> #4
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >   Hi Zhijiang,
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >   Could you please point me to more details
> regarding:
> >>>>>> "[2]:
> >>>>>> > > Delay
> >>>>>> > > > > > send
> >>>>>> > > > > > >> the
> >>>>>> > > > > > >> >   following buffers after checkpoint barrier on
> >>>>>> upstream side
> >>>>>> > > > until
> >>>>>> > > > > > >> barrier
> >>>>>> > > > > > >> >   alignment on downstream side."
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >   In this case, the downstream task has a high
> average
> >>>>>> > > checkpoint
> >>>>>> > > > > > >> duration
> >>>>>> > > > > > >> >   (~30s, sync part). If there was a change to hold
> >>>>>> buffers
> >>>>>> > > > depending
> >>>>>> > > > > > on
> >>>>>> > > > > > >> >   downstream performance, could this possibly apply
> to
> >>>>>> this
> >>>>>> > case
> >>>>>> > > > > (even
> >>>>>> > > > > > >> when
> >>>>>> > > > > > >> >   there is no shuffle that would require alignment)?
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >   Thanks,
> >>>>>> > > > > > >> >   Thomas
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >   On Sat, Jul 4, 2020 at 7:39 AM Zhijiang <
> >>>>>> > > > > wangzhijiang...@aliyun.com
> >>>>>> > > > > > >> .invalid>
> >>>>>> > > > > > >> >   wrote:
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >   > Hi Thomas,
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > Thanks for the further update information.
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > I guess we can dismiss the network stack changes,
> >>>>>> since in
> >>>>>> > > > your
> >>>>>> > > > > > >> case the
> >>>>>> > > > > > >> >   > downstream and upstream would probably be
> deployed
> >>>>>> in the
> >>>>>> > > same
> >>>>>> > > > > > slot
> >>>>>> > > > > > >> >   > bypassing the network data shuffle.
> >>>>>> > > > > > >> >   > Also I guess release-1.11 will not bring general
> >>>>>> > performance
> >>>>>> > > > > > >> regression in
> >>>>>> > > > > > >> >   > runtime engine, as we also did the performance
> >>>>>> testing for
> >>>>>> > > all
> >>>>>> > > > > > >> general
> >>>>>> > > > > > >> >   > cases by [1] in real cluster before and the
> testing
> >>>>>> > results
> >>>>>> > > > > should
> >>>>>> > > > > > >> fit the
> >>>>>> > > > > > >> >   > expectation. But we indeed did not test the
> >>>>>> specific
> >>>>>> > source
> >>>>>> > > > and
> >>>>>> > > > > > sink
> >>>>>> > > > > > >> >   > connectors yet as I known.
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > Regarding your performance regression with 40%, I
> >>>>>> wonder
> >>>>>> > it
> >>>>>> > > is
> >>>>>> > > > > > >> probably
> >>>>>> > > > > > >> >   > related to specific source/sink changes (e.g.
> >>>>>> kinesis) or
> >>>>>> > > > > > >> environment
> >>>>>> > > > > > >> >   > issues with corner case.
> >>>>>> > > > > > >> >   > If possible, it would be helpful to further
> locate
> >>>>>> whether
> >>>>>> > > the
> >>>>>> > > > > > >> regression
> >>>>>> > > > > > >> >   > is caused by kinesis, by replacing the kinesis
> >>>>>> source &
> >>>>>> > sink
> >>>>>> > > > and
> >>>>>> > > > > > >> keeping
> >>>>>> > > > > > >> >   > the others same.
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > As you said, it would be efficient to contact
> with
> >>>>>> you
> >>>>>> > > > directly
> >>>>>> > > > > > >> next week
> >>>>>> > > > > > >> >   > to further discuss this issue. And we are
> >>>>>> willing/eager to
> >>>>>> > > > > provide
> >>>>>> > > > > > >> any help
> >>>>>> > > > > > >> >   > to resolve this issue soon.
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > Besides that, I guess this issue should not be
> the
> >>>>>> blocker
> >>>>>> > > for
> >>>>>> > > > > the
> >>>>>> > > > > > >> >   > release, since it is probably a corner case based
> >>>>>> on the
> >>>>>> > > > current
> >>>>>> > > > > > >> analysis.
> >>>>>> > > > > > >> >   > If we really conclude anything need to be
> resolved
> >>>>>> after
> >>>>>> > the
> >>>>>> > > > > final
> >>>>>> > > > > > >> >   > release, then we can also make the next minor
> >>>>>> > release-1.11.1
> >>>>>> > > > > come
> >>>>>> > > > > > >> soon.
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > [1]
> >>>>>> https://issues.apache.org/jira/browse/FLINK-18433
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > Best,
> >>>>>> > > > > > >> >   > Zhijiang
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   >
> >>>>>> > > > >
> >>>>>> ------------------------------------------------------------------
> >>>>>> > > > > > >> >   > From:Thomas Weise <t...@apache.org>
> >>>>>> > > > > > >> >   > Send Time:2020年7月4日(星期六) 12:26
> >>>>>> > > > > > >> >   > To:dev <dev@flink.apache.org>; Zhijiang <
> >>>>>> > > > > > wangzhijiang...@aliyun.com
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >   > Cc:Yingjie Cao <kevin.ying...@gmail.com>
> >>>>>> > > > > > >> >   > Subject:Re: [VOTE] Release 1.11.0, release
> >>>>>> candidate #4
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > Hi Zhijiang,
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > It will probably be best if we connect next week
> >>>>>> and
> >>>>>> > discuss
> >>>>>> > > > the
> >>>>>> > > > > > >> issue
> >>>>>> > > > > > >> >   > directly since this could be quite difficult to
> >>>>>> reproduce.
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > Before the testing result on our side comes out
> >>>>>> for your
> >>>>>> > > > > > respective
> >>>>>> > > > > > >> job
> >>>>>> > > > > > >> >   > case, I have some other questions to confirm for
> >>>>>> further
> >>>>>> > > > > analysis:
> >>>>>> > > > > > >> >   >     -  How much percentage regression you found
> >>>>>> after
> >>>>>> > > > switching
> >>>>>> > > > > to
> >>>>>> > > > > > >> 1.11?
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > ~40% throughput decline
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   >     -  Are there any network bottleneck in your
> >>>>>> cluster?
> >>>>>> > > E.g.
> >>>>>> > > > > the
> >>>>>> > > > > > >> network
> >>>>>> > > > > > >> >   > bandwidth is full caused by other jobs? If so, it
> >>>>>> might
> >>>>>> > have
> >>>>>> > > > > more
> >>>>>> > > > > > >> effects
> >>>>>> > > > > > >> >   > by above [2]
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > The test runs on a k8s cluster that is also used
> >>>>>> for other
> >>>>>> > > > > > >> production jobs.
> >>>>>> > > > > > >> >   > There is no reason be believe network is the
> >>>>>> bottleneck.
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   >     -  Did you adjust the default network buffer
> >>>>>> setting?
> >>>>>> > > E.g.
> >>>>>> > > > > > >> >   >
> >>>>>> "taskmanager.network.memory.floating-buffers-per-gate" or
> >>>>>> > > > > > >> >   > "taskmanager.network.memory.buffers-per-channel"
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > The job is using the defaults, i.e we don't
> >>>>>> configure the
> >>>>>> > > > > > settings.
> >>>>>> > > > > > >> If you
> >>>>>> > > > > > >> >   > want me to try specific settings in the hope that
> >>>>>> it will
> >>>>>> > > help
> >>>>>> > > > > to
> >>>>>> > > > > > >> isolate
> >>>>>> > > > > > >> >   > the issue please let me know.
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   >     -  I guess the topology has three vertexes
> >>>>>> > > > "KinesisConsumer
> >>>>>> > > > > ->
> >>>>>> > > > > > >> Chained
> >>>>>> > > > > > >> >   > FlatMap -> KinesisProducer", and the partition
> >>>>>> mode for
> >>>>>> > > > > > >> "KinesisConsumer ->
> >>>>>> > > > > > >> >   > FlatMap" and "FlatMap->KinesisProducer" are both
> >>>>>> > "forward"?
> >>>>>> > > If
> >>>>>> > > > > so,
> >>>>>> > > > > > >> the edge
> >>>>>> > > > > > >> >   > connection is one-to-one, not all-to-all, then
> the
> >>>>>> above
> >>>>>> > > > [1][2]
> >>>>>> > > > > > >> should no
> >>>>>> > > > > > >> >   > effects in theory with default network buffer
> >>>>>> setting.
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > There are only 2 vertices and the edge is
> >>>>>> "forward".
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   >     - By slot sharing, I guess these three vertex
> >>>>>> > > parallelism
> >>>>>> > > > > task
> >>>>>> > > > > > >> would
> >>>>>> > > > > > >> >   > probably be deployed into the same slot, then the
> >>>>>> data
> >>>>>> > > shuffle
> >>>>>> > > > > is
> >>>>>> > > > > > >> by memory
> >>>>>> > > > > > >> >   > queue, not network stack. If so, the above [2]
> >>>>>> should no
> >>>>>> > > > effect.
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > Yes, vertices share slots.
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   >     - I also saw some Jira changes for kinesis in
> >>>>>> this
> >>>>>> > > > release,
> >>>>>> > > > > > >> could you
> >>>>>> > > > > > >> >   > confirm that these changes would not effect the
> >>>>>> > performance?
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > I will need to take a look. 1.10 already had a
> >>>>>> regression
> >>>>>> > > > > > >> introduced by the
> >>>>>> > > > > > >> >   > Kinesis producer update.
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > Thanks,
> >>>>>> > > > > > >> >   > Thomas
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > On Thu, Jul 2, 2020 at 11:46 PM Zhijiang <
> >>>>>> > > > > > >> wangzhijiang...@aliyun.com
> >>>>>> > > > > > >> >   > .invalid>
> >>>>>> > > > > > >> >   > wrote:
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > > Hi Thomas,
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > Thanks for your reply with rich information!
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > We are trying to reproduce your case in our
> >>>>>> cluster to
> >>>>>> > > > further
> >>>>>> > > > > > >> verify it,
> >>>>>> > > > > > >> >   > > and  @Yingjie Cao is working on it now.
> >>>>>> > > > > > >> >   > >  As we have not kinesis consumer and producer
> >>>>>> > internally,
> >>>>>> > > so
> >>>>>> > > > > we
> >>>>>> > > > > > >> will
> >>>>>> > > > > > >> >   > > construct the common source and sink instead in
> >>>>>> the case
> >>>>>> > > of
> >>>>>> > > > > > >> backpressure.
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > Firstly, we can dismiss the rockdb factor in
> this
> >>>>>> > release,
> >>>>>> > > > > since
> >>>>>> > > > > > >> you also
> >>>>>> > > > > > >> >   > > mentioned that "filesystem leads to same
> >>>>>> symptoms".
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > Secondly, if my understanding is right, you
> >>>>>> emphasis
> >>>>>> > that
> >>>>>> > > > the
> >>>>>> > > > > > >> regression
> >>>>>> > > > > > >> >   > > only exists for the jobs with low checkpoint
> >>>>>> interval
> >>>>>> > > (10s).
> >>>>>> > > > > > >> >   > > Based on that, I have two suspicions with the
> >>>>>> network
> >>>>>> > > > related
> >>>>>> > > > > > >> changes in
> >>>>>> > > > > > >> >   > > this release:
> >>>>>> > > > > > >> >   > >     - [1]: Limited the maximum backlog value
> >>>>>> (default
> >>>>>> > 10)
> >>>>>> > > in
> >>>>>> > > > > > >> subpartition
> >>>>>> > > > > > >> >   > > queue.
> >>>>>> > > > > > >> >   > >     - [2]: Delay send the following buffers
> after
> >>>>>> > > checkpoint
> >>>>>> > > > > > >> barrier on
> >>>>>> > > > > > >> >   > > upstream side until barrier alignment on
> >>>>>> downstream
> >>>>>> > side.
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > These changes are motivated for reducing the
> >>>>>> in-flight
> >>>>>> > > > buffers
> >>>>>> > > > > > to
> >>>>>> > > > > > >> speedup
> >>>>>> > > > > > >> >   > > checkpoint especially in the case of
> >>>>>> backpressure.
> >>>>>> > > > > > >> >   > > In theory they should have very minor
> >>>>>> performance effect
> >>>>>> > > and
> >>>>>> > > > > > >> actually we
> >>>>>> > > > > > >> >   > > also tested in cluster to verify within
> >>>>>> expectation
> >>>>>> > before
> >>>>>> > > > > > >> merging them,
> >>>>>> > > > > > >> >   > >  but maybe there are other corner cases we have
> >>>>>> not
> >>>>>> > > thought
> >>>>>> > > > of
> >>>>>> > > > > > >> before.
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > Before the testing result on our side comes out
> >>>>>> for your
> >>>>>> > > > > > >> respective job
> >>>>>> > > > > > >> >   > > case, I have some other questions to confirm
> for
> >>>>>> further
> >>>>>> > > > > > analysis:
> >>>>>> > > > > > >> >   > >     -  How much percentage regression you found
> >>>>>> after
> >>>>>> > > > > switching
> >>>>>> > > > > > >> to 1.11?
> >>>>>> > > > > > >> >   > >     -  Are there any network bottleneck in your
> >>>>>> cluster?
> >>>>>> > > > E.g.
> >>>>>> > > > > > the
> >>>>>> > > > > > >> network
> >>>>>> > > > > > >> >   > > bandwidth is full caused by other jobs? If so,
> >>>>>> it might
> >>>>>> > > have
> >>>>>> > > > > > more
> >>>>>> > > > > > >> effects
> >>>>>> > > > > > >> >   > > by above [2]
> >>>>>> > > > > > >> >   > >     -  Did you adjust the default network
> buffer
> >>>>>> > setting?
> >>>>>> > > > E.g.
> >>>>>> > > > > > >> >   > >
> >>>>>> "taskmanager.network.memory.floating-buffers-per-gate"
> >>>>>> > or
> >>>>>> > > > > > >> >   > >
> "taskmanager.network.memory.buffers-per-channel"
> >>>>>> > > > > > >> >   > >     -  I guess the topology has three vertexes
> >>>>>> > > > > "KinesisConsumer
> >>>>>> > > > > > ->
> >>>>>> > > > > > >> >   > Chained
> >>>>>> > > > > > >> >   > > FlatMap -> KinesisProducer", and the partition
> >>>>>> mode for
> >>>>>> > > > > > >> "KinesisConsumer
> >>>>>> > > > > > >> >   > ->
> >>>>>> > > > > > >> >   > > FlatMap" and "FlatMap->KinesisProducer" are
> both
> >>>>>> > > "forward"?
> >>>>>> > > > If
> >>>>>> > > > > > >> so, the
> >>>>>> > > > > > >> >   > edge
> >>>>>> > > > > > >> >   > > connection is one-to-one, not all-to-all, then
> >>>>>> the above
> >>>>>> > > > > [1][2]
> >>>>>> > > > > > >> should no
> >>>>>> > > > > > >> >   > > effects in theory with default network buffer
> >>>>>> setting.
> >>>>>> > > > > > >> >   > >     - By slot sharing, I guess these three
> vertex
> >>>>>> > > > parallelism
> >>>>>> > > > > > >> task would
> >>>>>> > > > > > >> >   > > probably be deployed into the same slot, then
> >>>>>> the data
> >>>>>> > > > shuffle
> >>>>>> > > > > > is
> >>>>>> > > > > > >> by
> >>>>>> > > > > > >> >   > memory
> >>>>>> > > > > > >> >   > > queue, not network stack. If so, the above [2]
> >>>>>> should no
> >>>>>> > > > > effect.
> >>>>>> > > > > > >> >   > >     - I also saw some Jira changes for kinesis
> >>>>>> in this
> >>>>>> > > > > release,
> >>>>>> > > > > > >> could you
> >>>>>> > > > > > >> >   > > confirm that these changes would not effect the
> >>>>>> > > performance?
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > Best,
> >>>>>> > > > > > >> >   > > Zhijiang
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > >
> >>>>>> ------------------------------------------------------------------
> >>>>>> > > > > > >> >   > > From:Thomas Weise <t...@apache.org>
> >>>>>> > > > > > >> >   > > Send Time:2020年7月3日(星期五) 01:07
> >>>>>> > > > > > >> >   > > To:dev <dev@flink.apache.org>; Zhijiang <
> >>>>>> > > > > > >> wangzhijiang...@aliyun.com>
> >>>>>> > > > > > >> >   > > Subject:Re: [VOTE] Release 1.11.0, release
> >>>>>> candidate #4
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > Hi Zhijiang,
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > The performance degradation manifests in
> >>>>>> backpressure
> >>>>>> > > which
> >>>>>> > > > > > leads
> >>>>>> > > > > > >> to
> >>>>>> > > > > > >> >   > > growing backlog in the source. I switched a few
> >>>>>> times
> >>>>>> > > > between
> >>>>>> > > > > > >> 1.10 and
> >>>>>> > > > > > >> >   > 1.11
> >>>>>> > > > > > >> >   > > and the behavior is consistent.
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > The DAG is:
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > KinesisConsumer -> (Flat Map, Flat Map, Flat
> Map)
> >>>>>> > >  --------
> >>>>>> > > > > > >> forward
> >>>>>> > > > > > >> >   > > ---------> KinesisProducer
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > Parallelism: 160
> >>>>>> > > > > > >> >   > > No shuffle/rebalance.
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > Checkpointing config:
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > Checkpointing Mode Exactly Once
> >>>>>> > > > > > >> >   > > Interval 10s
> >>>>>> > > > > > >> >   > > Timeout 10m 0s
> >>>>>> > > > > > >> >   > > Minimum Pause Between Checkpoints 10s
> >>>>>> > > > > > >> >   > > Maximum Concurrent Checkpoints 1
> >>>>>> > > > > > >> >   > > Persist Checkpoints Externally Enabled (delete
> on
> >>>>>> > > > > cancellation)
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > State backend: rocksdb  (filesystem leads to
> same
> >>>>>> > > symptoms)
> >>>>>> > > > > > >> >   > > Checkpoint size is tiny (500KB)
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > An interesting difference to another job that I
> >>>>>> had
> >>>>>> > > upgraded
> >>>>>> > > > > > >> successfully
> >>>>>> > > > > > >> >   > > is the low checkpointing interval.
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > Thanks,
> >>>>>> > > > > > >> >   > > Thomas
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > On Wed, Jul 1, 2020 at 9:02 PM Zhijiang <
> >>>>>> > > > > > >> wangzhijiang...@aliyun.com
> >>>>>> > > > > > >> >   > > .invalid>
> >>>>>> > > > > > >> >   > > wrote:
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > > Hi Thomas,
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > > Thanks for the efficient feedback.
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > > Regarding the suggestion of adding the
> release
> >>>>>> notes
> >>>>>> > > > > document,
> >>>>>> > > > > > >> I agree
> >>>>>> > > > > > >> >   > > > with your point. Maybe we should adjust the
> >>>>>> vote
> >>>>>> > > template
> >>>>>> > > > > > >> accordingly
> >>>>>> > > > > > >> >   > in
> >>>>>> > > > > > >> >   > > > the respective wiki to guide the following
> >>>>>> release
> >>>>>> > > > > processes.
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > > Regarding the performance regression, could
> you
> >>>>>> > provide
> >>>>>> > > > some
> >>>>>> > > > > > >> more
> >>>>>> > > > > > >> >   > details
> >>>>>> > > > > > >> >   > > > for our better measurement or reproducing on
> >>>>>> our
> >>>>>> > sides?
> >>>>>> > > > > > >> >   > > > E.g. I guess the topology only includes two
> >>>>>> vertexes
> >>>>>> > > > source
> >>>>>> > > > > > and
> >>>>>> > > > > > >> sink?
> >>>>>> > > > > > >> >   > > > What is the parallelism for every vertex?
> >>>>>> > > > > > >> >   > > > The upstream shuffles data to the downstream
> >>>>>> via
> >>>>>> > > rebalance
> >>>>>> > > > > > >> partitioner
> >>>>>> > > > > > >> >   > or
> >>>>>> > > > > > >> >   > > > other?
> >>>>>> > > > > > >> >   > > > The checkpoint mode is exactly-once with
> >>>>>> rocksDB state
> >>>>>> > > > > > backend?
> >>>>>> > > > > > >> >   > > > The backpressure happened in this case?
> >>>>>> > > > > > >> >   > > > How much percentage regression in this case?
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > > Best,
> >>>>>> > > > > > >> >   > > > Zhijiang
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >>
> >>>>>> > ------------------------------------------------------------------
> >>>>>> > > > > > >> >   > > > From:Thomas Weise <t...@apache.org>
> >>>>>> > > > > > >> >   > > > Send Time:2020年7月2日(星期四) 09:54
> >>>>>> > > > > > >> >   > > > To:dev <dev@flink.apache.org>
> >>>>>> > > > > > >> >   > > > Subject:Re: [VOTE] Release 1.11.0, release
> >>>>>> candidate
> >>>>>> > #4
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > > Hi Till,
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > > Yes, we don't have the setting in
> >>>>>> flink-conf.yaml.
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > > Generally, we carry forward the existing
> >>>>>> configuration
> >>>>>> > > and
> >>>>>> > > > > any
> >>>>>> > > > > > >> change
> >>>>>> > > > > > >> >   > to
> >>>>>> > > > > > >> >   > > > default configuration values would impact the
> >>>>>> upgrade.
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > > Yes, since it is an incompatible change I
> >>>>>> would state
> >>>>>> > it
> >>>>>> > > > in
> >>>>>> > > > > > the
> >>>>>> > > > > > >> release
> >>>>>> > > > > > >> >   > > > notes.
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > > Thanks,
> >>>>>> > > > > > >> >   > > > Thomas
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > > BTW I found a performance regression while
> >>>>>> trying to
> >>>>>> > > > upgrade
> >>>>>> > > > > > >> another
> >>>>>> > > > > > >> >   > > > pipeline with this RC. It is a simple Kinesis
> >>>>>> to
> >>>>>> > Kinesis
> >>>>>> > > > > job.
> >>>>>> > > > > > >> Wasn't
> >>>>>> > > > > > >> >   > able
> >>>>>> > > > > > >> >   > > > to pin it down yet, symptoms include
> increased
> >>>>>> > > checkpoint
> >>>>>> > > > > > >> alignment
> >>>>>> > > > > > >> >   > time.
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > > On Wed, Jul 1, 2020 at 12:04 AM Till
> Rohrmann <
> >>>>>> > > > > > >> trohrm...@apache.org>
> >>>>>> > > > > > >> >   > > > wrote:
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > > > Hi Thomas,
> >>>>>> > > > > > >> >   > > > >
> >>>>>> > > > > > >> >   > > > > just to confirm: When starting the image in
> >>>>>> local
> >>>>>> > > mode,
> >>>>>> > > > > then
> >>>>>> > > > > > >> you
> >>>>>> > > > > > >> >   > don't
> >>>>>> > > > > > >> >   > > > have
> >>>>>> > > > > > >> >   > > > > any of the JobManager memory configuration
> >>>>>> settings
> >>>>>> > > > > > >> configured in the
> >>>>>> > > > > > >> >   > > > > effective flink-conf.yaml, right? Does this
> >>>>>> mean
> >>>>>> > that
> >>>>>> > > > you
> >>>>>> > > > > > have
> >>>>>> > > > > > >> >   > > explicitly
> >>>>>> > > > > > >> >   > > > > removed `jobmanager.heap.size: 1024m` from
> >>>>>> the
> >>>>>> > default
> >>>>>> > > > > > >> configuration?
> >>>>>> > > > > > >> >   > > If
> >>>>>> > > > > > >> >   > > > > this is the case, then I believe it was
> more
> >>>>>> of an
> >>>>>> > > > > > >> unintentional
> >>>>>> > > > > > >> >   > > artifact
> >>>>>> > > > > > >> >   > > > > that it worked before and it has been
> >>>>>> corrected now
> >>>>>> > so
> >>>>>> > > > > that
> >>>>>> > > > > > >> one needs
> >>>>>> > > > > > >> >   > > to
> >>>>>> > > > > > >> >   > > > > specify the memory of the JM process
> >>>>>> explicitly. Do
> >>>>>> > > you
> >>>>>> > > > > > think
> >>>>>> > > > > > >> it
> >>>>>> > > > > > >> >   > would
> >>>>>> > > > > > >> >   > > > help
> >>>>>> > > > > > >> >   > > > > to explicitly state this in the release
> >>>>>> notes?
> >>>>>> > > > > > >> >   > > > >
> >>>>>> > > > > > >> >   > > > > Cheers,
> >>>>>> > > > > > >> >   > > > > Till
> >>>>>> > > > > > >> >   > > > >
> >>>>>> > > > > > >> >   > > > > On Wed, Jul 1, 2020 at 7:01 AM Thomas
> Weise <
> >>>>>> > > > > t...@apache.org
> >>>>>> > > > > > >
> >>>>>> > > > > > >> wrote:
> >>>>>> > > > > > >> >   > > > >
> >>>>>> > > > > > >> >   > > > > > Thanks for preparing another RC!
> >>>>>> > > > > > >> >   > > > > >
> >>>>>> > > > > > >> >   > > > > > As mentioned in the previous RC thread,
> it
> >>>>>> would
> >>>>>> > be
> >>>>>> > > > > super
> >>>>>> > > > > > >> helpful
> >>>>>> > > > > > >> >   > if
> >>>>>> > > > > > >> >   > > > the
> >>>>>> > > > > > >> >   > > > > > release notes that are part of the
> >>>>>> documentation
> >>>>>> > can
> >>>>>> > > > be
> >>>>>> > > > > > >> included
> >>>>>> > > > > > >> >   > [1].
> >>>>>> > > > > > >> >   > > > > It's
> >>>>>> > > > > > >> >   > > > > > a significant time-saver to have read
> >>>>>> those first.
> >>>>>> > > > > > >> >   > > > > >
> >>>>>> > > > > > >> >   > > > > > I found one more non-backward compatible
> >>>>>> change
> >>>>>> > that
> >>>>>> > > > > would
> >>>>>> > > > > > >> be worth
> >>>>>> > > > > > >> >   > > > > > addressing/mentioning:
> >>>>>> > > > > > >> >   > > > > >
> >>>>>> > > > > > >> >   > > > > > It is now necessary to configure the
> >>>>>> jobmanager
> >>>>>> > heap
> >>>>>> > > > > size
> >>>>>> > > > > > in
> >>>>>> > > > > > >> >   > > > > > flink-conf.yaml (with either
> >>>>>> jobmanager.heap.size
> >>>>>> > > > > > >> >   > > > > > or jobmanager.memory.heap.size). Why
> would
> >>>>>> I not
> >>>>>> > > want
> >>>>>> > > > to
> >>>>>> > > > > > do
> >>>>>> > > > > > >> that
> >>>>>> > > > > > >> >   > > > anyways?
> >>>>>> > > > > > >> >   > > > > > Well, we set it dynamically for a cluster
> >>>>>> > deployment
> >>>>>> > > > via
> >>>>>> > > > > > the
> >>>>>> > > > > > >> >   > > > > > flinkk8soperator, but the container image
> >>>>>> can also
> >>>>>> > > be
> >>>>>> > > > > used
> >>>>>> > > > > > >> for
> >>>>>> > > > > > >> >   > > testing
> >>>>>> > > > > > >> >   > > > > with
> >>>>>> > > > > > >> >   > > > > > local mode (./bin/jobmanager.sh
> >>>>>> start-foreground
> >>>>>> > > > local).
> >>>>>> > > > > > >> That will
> >>>>>> > > > > > >> >   > > fail
> >>>>>> > > > > > >> >   > > > > if
> >>>>>> > > > > > >> >   > > > > > the heap wasn't configured and that's
> how I
> >>>>>> > noticed
> >>>>>> > > > it.
> >>>>>> > > > > > >> >   > > > > >
> >>>>>> > > > > > >> >   > > > > > Thanks,
> >>>>>> > > > > > >> >   > > > > > Thomas
> >>>>>> > > > > > >> >   > > > > >
> >>>>>> > > > > > >> >   > > > > > [1]
> >>>>>> > > > > > >> >   > > > > >
> >>>>>> > > > > > >> >   > > > > >
> >>>>>> > > > > > >> >   > > > >
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >>
> >>>>>> > > > > >
> >>>>>> > > > >
> >>>>>> > > >
> >>>>>> > >
> >>>>>> >
> >>>>>>
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html
> >>>>>> > > > > > >> >   > > > > >
> >>>>>> > > > > > >> >   > > > > > On Tue, Jun 30, 2020 at 3:18 AM Zhijiang
> <
> >>>>>> > > > > > >> >   > wangzhijiang...@aliyun.com
> >>>>>> > > > > > >> >   > > > > > .invalid>
> >>>>>> > > > > > >> >   > > > > > wrote:
> >>>>>> > > > > > >> >   > > > > >
> >>>>>> > > > > > >> >   > > > > > > Hi everyone,
> >>>>>> > > > > > >> >   > > > > > >
> >>>>>> > > > > > >> >   > > > > > > Please review and vote on the release
> >>>>>> candidate
> >>>>>> > #4
> >>>>>> > > > for
> >>>>>> > > > > > the
> >>>>>> > > > > > >> >   > version
> >>>>>> > > > > > >> >   > > > > > 1.11.0,
> >>>>>> > > > > > >> >   > > > > > > as follows:
> >>>>>> > > > > > >> >   > > > > > > [ ] +1, Approve the release
> >>>>>> > > > > > >> >   > > > > > > [ ] -1, Do not approve the release
> >>>>>> (please
> >>>>>> > provide
> >>>>>> > > > > > >> specific
> >>>>>> > > > > > >> >   > > comments)
> >>>>>> > > > > > >> >   > > > > > >
> >>>>>> > > > > > >> >   > > > > > > The complete staging area is available
> >>>>>> for your
> >>>>>> > > > > review,
> >>>>>> > > > > > >> which
> >>>>>> > > > > > >> >   > > > includes:
> >>>>>> > > > > > >> >   > > > > > > * JIRA release notes [1],
> >>>>>> > > > > > >> >   > > > > > > * the official Apache source release
> and
> >>>>>> binary
> >>>>>> > > > > > >> convenience
> >>>>>> > > > > > >> >   > > releases
> >>>>>> > > > > > >> >   > > > to
> >>>>>> > > > > > >> >   > > > > > be
> >>>>>> > > > > > >> >   > > > > > > deployed to dist.apache.org [2], which
> >>>>>> are
> >>>>>> > signed
> >>>>>> > > > > with
> >>>>>> > > > > > >> the key
> >>>>>> > > > > > >> >   > > with
> >>>>>> > > > > > >> >   > > > > > > fingerprint
> >>>>>> > > 2DA85B93244FDFA19A6244500653C0A2CEA00D0E
> >>>>>> > > > > > [3],
> >>>>>> > > > > > >> >   > > > > > > * all artifacts to be deployed to the
> >>>>>> Maven
> >>>>>> > > Central
> >>>>>> > > > > > >> Repository
> >>>>>> > > > > > >> >   > [4],
> >>>>>> > > > > > >> >   > > > > > > * source code tag "release-1.11.0-rc4"
> >>>>>> [5],
> >>>>>> > > > > > >> >   > > > > > > * website pull request listing the new
> >>>>>> release
> >>>>>> > and
> >>>>>> > > > > > adding
> >>>>>> > > > > > >> >   > > > announcement
> >>>>>> > > > > > >> >   > > > > > > blog post [6].
> >>>>>> > > > > > >> >   > > > > > >
> >>>>>> > > > > > >> >   > > > > > > The vote will be open for at least 72
> >>>>>> hours. It
> >>>>>> > is
> >>>>>> > > > > > >> adopted by
> >>>>>> > > > > > >> >   > > > majority
> >>>>>> > > > > > >> >   > > > > > > approval, with at least 3 PMC
> >>>>>> affirmative votes.
> >>>>>> > > > > > >> >   > > > > > >
> >>>>>> > > > > > >> >   > > > > > > Thanks,
> >>>>>> > > > > > >> >   > > > > > > Release Manager
> >>>>>> > > > > > >> >   > > > > > >
> >>>>>> > > > > > >> >   > > > > > > [1]
> >>>>>> > > > > > >> >   > > > > > >
> >>>>>> > > > > > >> >   > > > > >
> >>>>>> > > > > > >> >   > > > >
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >>
> >>>>>> > > > > >
> >>>>>> > > > >
> >>>>>> > > >
> >>>>>> > >
> >>>>>> >
> >>>>>>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346364
> >>>>>> > > > > > >> >   > > > > > > [2]
> >>>>>> > > > > > >> >   >
> >>>>>> > > >
> https://dist.apache.org/repos/dist/dev/flink/flink-1.11.0-rc4/
> >>>>>> > > > > > >> >   > > > > > > [3]
> >>>>>> > > > > > https://dist.apache.org/repos/dist/release/flink/KEYS
> >>>>>> > > > > > >> >   > > > > > > [4]
> >>>>>> > > > > > >> >   > > > > > >
> >>>>>> > > > > > >> >   > > > >
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >>
> >>>>>> > > > >
> >>>>>> > >
> >>>>>>
> https://repository.apache.org/content/repositories/orgapacheflink-1377/
> >>>>>> > > > > > >> >   > > > > > > [5]
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > >
> >>>>>> https://github.com/apache/flink/releases/tag/release-1.11.0-rc4
> >>>>>> > > > > > >> >   > > > > > > [6]
> >>>>>> > https://github.com/apache/flink-web/pull/352
> >>>>>> > > > > > >> >   > > > > > >
> >>>>>> > > > > > >> >   > > > > > >
> >>>>>> > > > > > >> >   > > > > >
> >>>>>> > > > > > >> >   > > > >
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >>
> >>>>>> > > > > > >>
> >>>>>> > > > > >
> >>>>>> > > > > >
> >>>>>> > > > >
> >>>>>> > > >
> >>>>>> > >
> >>>>>> >
> >>>>>> >
> >>>>>> > --
> >>>>>> > Regards,
> >>>>>> > Roman
> >>>>>> >
> >>>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Regards,
> >>>>> Roman
> >>>>>
> >>>>
> >>
> >> --
> >> Regards,
> >> Roman
> >>
> >
> >
> > --
> > Regards,
> > Roman
> >
>

Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0, release candidate #4)

Reply via email to