Re: [ANNOUNCE] Performance Daily Monitoring Moved from Ververica to Apache Flink Slack Channel

Martijn Visser Tue, 17 Jan 2023 05:53:19 -0800

Hi Yanfei,

Thanks for the proposal! Like Yuan mentioned, let's start a new discussion
thread to get a clean discussion of your proposal, but it already sounds
good to me.


Best regards,

Martijn

Op di 17 jan. 2023 om 10:41 schreef Yuan Mei <[email protected]>:

> Hey Yanfei,
>
> Thanks so much for the efforts driving the whole process. It's great to see
> that the performance benchmarks are indeed useful to help find regressions.
>
> 1. Regarding the procedure of how to use and understand the notification
> reported from the slack channel #flink-dev-benchmarks, the instructions
> read reasonably to me, and we can iterate over it gradually. Once you've
> done the wiki change, please ping me and I can help review it.
>
> 2. It also sounds to me reasonable to incorporate the performance-watching
> procedure into the release managers' daily/weekly monitors. But since it
> involves a change to the standard routine of releasing, we need to discuss
> and vote on the change.
>
> My suggestion is to start a new discussion thread for the instructions and
> proposed change so that more people are aware of the proposal and join the
> discussion (this is an announcement thread :-)).
>
>
> Best
> Yuan
>
> On Mon, Jan 16, 2023 at 4:52 PM Qingsheng Ren <[email protected]> wrote:
>
> > Thanks for making this detailed guide, Yanfei! This is quite helpful for
> > release managers to monitor and manage performance regressions.
> >
> > I think it will be great to also document the threshold of alerts sent to
> > the Slack channel, and some related formula used in the test, either in
> the
> > wiki page or in the README of flink-benchmarks repo. This could help
> other
> > maintainers to interpret the result.
> >
> > Also we can add this to release managers' daily monitors, similar to CI
> > instabilities. We can start operating with the process proposed by
> Yanfei,
> > and complete it gradually once we find something to add.
> >
> > Best regards,
> > Qingsheng
> >
> > On Mon, Jan 16, 2023 at 12:08 PM Yanfei Lei <[email protected]> wrote:
> >
> > > Hi devs,
> > >
> > > Flink benchmarks are periodically executed on
> > > http://codespeed.dak8s.net:8080 to monitor Flink performance. In late
> > > Oct'22, a new slack channel #flink-dev-benchmarks was created for
> > > notifications of performance regressions. It helped us find 2 build
> > > failures[1,2] and 5 performance regressions[3,4,5,6,7] in the past 3
> > > months, which is very meaningful for ensuring the quality of the code.
> > > I am checking the slack notifications once a week now, and if more
> > > people come to monitor together, we can check once a day in the future
> > > to find out regressions in a timely manner.
> > >
> > > According to some contributors and my own experience, I have
> > > summarized a document on how to handle performance regressions. The
> > > following is just a draft, which can be continuously iterated and
> > > improved later.
> > >
> > > When a benchmark regression is detected, the following steps will help
> > > to deal with regressions:
> > >
> > > 1. Create a Jira ticket(one per group of related benchmarks). Set
> > > effects and fix versions to the current Flink version,
> > > component=Benchmarks, type=Bug.
> > >
> > > 2. Post the ticket in the slack channel(replying in a thread).
> > >
> > > 3. Verify that the regression is real and investigate the cause. Take
> > > FLINK-30623[5] as an example:
> > >
> > >     3.1 Inspect the timeline following the
> > > link(
> > >
> >
> http://codespeed.dak8s.net:8000/timeline/#/?exe=1&ben=checkpointSingleInput.UNALIGNED&extr=on&quarts=on&equid=off&env=2&revs=200
> > > )
> > > from the notification. Suspicious commit ranges can be obtained from
> > > the figure, for this example, the suspicious range is
> > > 13ef498172b...fb272D2cdebf.
> > >
> > >     3.2 Narrow down the commit range via git log. You can directly
> > > locate a specific commit based on experience or compare the benchmark
> > > results of each commit in this range, a commit would be found if this
> > > regression is real. See instructions for using benchmark-request, you
> > > can also try to benchmark locally. http://codespeed.dak8s.net:8080
> > > benchmarking infrastructure is hosted using resources provided by
> > > Ververica(Alibaba) and maintained by PMCs and Ververica, please
> > > contact one of Apache Flink PMCs to get access. For example, two
> > > benchmark requests had been submitted to verify whether FLINK-30533
> > > caused the regression.
> > >
> > > > Before FLINK-30533:
> > > http://codespeed.dak8s.net:8080/job/flink-benchmark-request/177
> > > >
> > > > - checkpointSingleInput.UNALIGNED: 333.635178(+-8.169488)
> > > >
> > > > - checkpointSingleInput.UNALIGNED_1: 213.837107(+-7.282883)
> > > >
> > > > # After FLINK-30533:
> > > http://codespeed.dak8s.net:8080/job/flink-benchmark-request/178
> > > >
> > > > - checkpointSingleInput.UNALIGNED: 61.536982（+-3.581509）
> > > >
> > > > - checkpointSingleInput.UNALIGNED_1: 38.207438（+-2.937051）
> > >
> > >     3.3 Changes in flink-benchmarks[8] may also cause a regression,
> > > don't forget to check if flink-benchmarks have changed recently.
> > >
> > >     3.4 If a regression cannot be reproduced stably which is caused by
> > > the error in results or the issues of physical machines (like
> > > FLINK-18614[9]), this means the regression is not real.
> > >
> > > 4. Post benchmark results under the Jira ticket, and ping the authors
> > > of the commit(or relevant developers) to investigate the regression if
> > > the regression is real. Otherwise, set the resolution of Jira ticket
> > > as "Not a bug", post the conclusion and close the ticket.
> > >
> > > 5. If a regression is not fixed within a week of confirming that one
> > > commit is the root cause of the regression, contact the release
> > > manager to revert it (after confirming that reverting the changes
> > > resolves the issue using benchmark-request[10]).
> > >
> > > If the above process is considered acceptable, I can draft a version
> > > and put it in the community wiki[10]. @Matthias had proposed to
> > > incorporate performance regression monitoring into the release
> > > management, and make the regression testing be monitored regularly by
> > > release managers or volunteers. I‘m glad to be one of the volunteers.
> > >
> > > Hope to hear your advice and opinions!
> > >
> > > [1] https://issues.apache.org/jira/browse/FLINK-29883
> > > [2] https://issues.apache.org/jira/browse/FLINK-30015
> > > [3] https://issues.apache.org/jira/browse/FLINK-29886
> > > [4] https://issues.apache.org/jira/browse/FLINK-30181
> > > [5] https://issues.apache.org/jira/browse/FLINK-30623
> > > [6] https://issues.apache.org/jira/browse/FLINK-30624
> > > [7] https://issues.apache.org/jira/browse/FLINK-30625
> > > [8] https://github.com/apache/flink-benchmarks
> > > [9] https://issues.apache.org/jira/browse/FLINK-18614
> > > [10]
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=115511847
> > >
> > > Best regards,
> > > Yanfei
> > > Ververica(Alibaba)
> > >
> > > Yanfei Lei <[email protected]> 于2023年1月12日周四 17:46写道：
> > > >
> > > > Hi all,
> > > >
> > > > Thanks for the reminder.
> > > >
> > > > @Matthias
> > > >
> > > > any updates on the performance tests? ...or more specifically, any
> > > updates
> > > > on the script for alerting on performance regressions?
> > > >
> > > >
> > > > I create a PR for FLINK-27571[1] but it's still under review, would
> you
> > > like to help take a look?
> > > >
> > > > FLINK-27571 is just for the new benchmarks, for the old existing
> > > benchmarks, their information is stored
> > > >
> > > > in codespeed's database which can't be updated by URL request, so I
> > also
> > > logged into the Jenkins master
> > > >
> > > > and modified the codespeed's database, currently "less is better" can
> > be
> > > displayed normally on the timeline[2].
> > > >
> > > >
> > > > Does it make sense to formalize/document the process?
> > > >
> > > > Certainly, I'm preparing a draft to share my experience of finding
> > > commits that caused regressions.
> > > >
> > > > Originally, I wanted to wait for FLINK-27571 to be merged before
> > > starting a discussion, and I will put
> > > >
> > > > a draft of the document later.
> > > >
> > > >
> > > > This slack channel can only provide notice of regression and some
> > > experience on how to locate regression,
> > > >
> > > > but we also need some people to take action after the regression
> > > happens. It is mainly a few people who volunteer to do these things,
> > > >
> > > > like FLINK-30015[3] and FLINK-30623[4], many thanks for Martijn's
> > > contribution.
> > > >
> > > > As for whether to add the responsibilities to the release manager, I
> > > think it needs to see other people's opinions.
> > > >
> > > > @Martijn
> > > >
> > > > Thanks for creating these tickets. For FLINK-30623 and
> FLINK-30624[5],
> > > @Hangxiang and I have located the corresponding commit
> > > >
> > > > and pinged the corresponding submitter. Regression may not be
> avoided,
> > I
> > > totally do agree that this work needs to be formalized as soon as
> > possible
> > > to fix regressions.
> > > >
> > > >
> > > > [1] https://issues.apache.org/jira/browse/FLINK-27571
> > > >
> > > > [2]
> > >
> >
> http://codespeed.dak8s.net:8000/timeline/#/?ben=createScheduler.BATCH&extr=on&quarts=on&equid=off&env=2&revs=200&exe=1,3,5,6,8,9
> > > >
> > > > [3] https://issues.apache.org/jira/browse/FLINK-30015
> > > >
> > > > [4] https://issues.apache.org/jira/browse/FLINK-30623
> > > >
> > > > [5] https://issues.apache.org/jira/browse/FLINK-30624
> > > >
> > > >
> > > > Best regards,
> > > >
> > > > Yanfei
> > > >
> > > >
> > > > Martijn Visser <[email protected]> 于2023年1月11日周三 01:11写道：
> > > >>
> > > >> Hi all,
> > > >>
> > > >> Related to Matthias' email, I've checked the notifications in the
> > Slack
> > > >> channel and noticed three major benchmark regressions. In the end,
> > I've
> > > >> decided to create Jira tickets for it [1] [2] [3] but I do agree
> that
> > > this
> > > >> work needs to be formalized as soon as possible to avoid
> regressions.
> > It
> > > >> would also be great to include a process on how these regressions
> will
> > > be
> > > >> fixed, because I have no idea who to ping/notify that these
> > regressions
> > > >> have occurred.
> > > >>
> > > >> Best regards,
> > > >>
> > > >> Martijn
> > > >>
> > > >> [1] https://issues.apache.org/jira/browse/FLINK-30623
> > > >> [2] https://issues.apache.org/jira/browse/FLINK-30624
> > > >> [3] https://issues.apache.org/jira/browse/FLINK-30625
> > > >>
> > > >> On Tue, Jan 10, 2023 at 1:56 PM Matthias Pohl
> > > >> <[email protected]> wrote:
> > > >>
> > > >> > Hi Yanfei,
> > > >> > any updates on the performance tests? ...or more specifically, any
> > > updates
> > > >> > on the script for alerting on performance regressions?
> > > >> >
> > > >> > Does it make sense to formalize/document the process? Currently,
> the
> > > >> > release management doesn't do anything in terms of performance
> > > >> > test monitoring. Therefore, performance regressions are not
> > > necessarily
> > > >> > identified actively (in contrast to CI instabilities). Or is this
> > > covered
> > > >> > by the PMC? It would be interesting to know whether there's
> someone
> > to
> > > >> > reach out to who's monitoring the regression tests regularly.
> Would
> > > it make
> > > >> > sense for this person to join the release calls?
> > > >> >
> > > >> > Or shall we work on formalizing/documenting the process and
> > > integrating
> > > >> > this responsibility into what the release manager(s) are in charge
> > > of? My
> > > >> > concern with that approach is that contributors might be less
> > willing
> > > to
> > > >> > volunteer in the release management if we collect everything in
> one
> > > role.
> > > >> > Alternatively, we could split the release manager role up into
> > > sub-roles
> > > >> > that contributors can volunteer for in a release (e.g. CI
> > monitoring,
> > > >> > performance test monitoring, Jira maintenance, ... just coming up
> > with
> > > >> > random tasks here).
> > > >> >
> > > >> > Alternatively, we could leave everything as is and just respond if
> > > there's
> > > >> > some complaint. I'm curious about your (and other's) opinions.
> > > >> >
> > > >> > Matthias
> > > >> >
> > > >> > On Tue, Nov 29, 2022 at 2:13 PM Yanfei Lei <[email protected]>
> > > wrote:
> > > >> >
> > > >> > > Hi Martijn,
> > > >> > >
> > > >> > > Thanks for bringing this up.
> > > >> > >
> > > >> > > In the past two months, this channel has helped us find many
> > > benchmark
> > > >> > fail
> > > >> > > issues, like FLINK-29883
> > > >> > > <https://issues.apache.org/jira/browse/FLINK-29883>[1],
> > > >> > > FLINK-29886 <https://issues.apache.org/jira/browse/FLINK-29886
> > >[2],
> > > >> > > FLINK-30015 <https://issues.apache.org/jira/browse/FLINK-30015
> > >[3]
> > > and
> > > >> > > FLINK-30181 <https://issues.apache.org/jira/browse/FLINK-30181
> > >[4].
> > > I
> > > >> > also
> > > >> > > have tried investigating several of the frequently reported
> > > regressions
> > > >> > and
> > > >> > > replied under the notification in slack channel(copy them here):
> > > >> > >
> > > >> > >    1. serializerHeavyString
> > > >> > >    <
> > > >> > >
> > > >> >
> > >
> >
> http://codespeed.dak8s.net:8000/timeline/#/?exe=6&ben=serializerHeavyString&extr=on&quarts=on&equid=off&env=2&revs=200
> > > >> > > >:
> > > >> > >    It is unstable for a long time, see [5]
> > > >> > >    https://issues.apache.org/jira/browse/FLINK-27165 for
> possible
> > > >> > reasons.
> > > >> > >    2. Regressions are detected by a simple script which may have
> > > false
> > > >> > >    positives and false negatives, especially for benchmarks with
> > > small
> > > >> > >    absolute values, small value changes cause large percentage
> > > changes.
> > > >> > see
> > > >> > >    [6] for details.
> > > >> > >
> > > >> > >      Maybe slidingWindow
> > > >> > > <
> > > >> > >
> > > >> >
> > >
> >
> http://codespeed.dak8s.net:8000/timeline/#/?exe=6&ben=slidingWindow&extr=on&quarts=on&equid=off&env=2&revs=200
> > > >> > > >(value~=600),
> > > >> > > stateBackends.ROCKS
> > > >> > > <
> > > >> > >
> > > >> >
> > >
> >
> http://codespeed.dak8s.net:8000/timeline/#/?exe=6&ben=stateBackends.ROCKS&extr=on&quarts=on&equid=off&env=2&revs=200
> > > >> > > >
> > > >> > > (value~=260) and serializerHeavyString
> > > >> > > <
> > > >> > >
> > > >> >
> > >
> >
> http://codespeed.dak8s.net:8000/timeline/#/?exe=6&ben=serializerHeavyString&extr=on&quarts=on&equid=off&env=2&revs=200
> > > >> > > >(value~=170)
> > > >> > > are
> > > >> > > not true regressions.
> > > >> > >
> > > >> > >    1. For deployAllTasks.STREAMING
> > > >> > >    <
> > > >> > >
> > > >> >
> > >
> >
> http://codespeed.dak8s.net:8000/timeline/#/?exe=8&ben=deployAllTasks.STREAMING&extr=on&quarts=on&equid=off&env=2&revs=200
> > > >> > > >,
> > > >> > >    this benchmark result is how much time it takes to deploy
> job,
> > > the
> > > >> > less
> > > >> > >    value the better performance, see [7] for details.
> FLINK-27571
> > > >> > >    <https://issues.apache.org/jira/browse/FLINK-27571>[8] would
> > > fix this
> > > >> > >    problem.
> > > >> > >
> > > >> > >
> > > >> > > As mentioned before, regressions are detected by a simple script
> > > that is
> > > >> > > less stable, FLINK-29825 <
> > > >> > > https://issues.apache.org/jira/browse/FLINK-29825>[9]
> > > >> > > is created to improve the benchmark's stability. I planned to
> > > invite more
> > > >> > > volunteers to monitor it after the checking of regression became
> > > more
> > > >> > > stable, but I've been stuck with something else lately, sorry
> for
> > > the
> > > >> > late
> > > >> > > response.  Any suggestions on handling benchmark
> regressions/fails
> > > are
> > > >> > > welcome.
> > > >> > >
> > > >> > > [1] https://issues.apache.org/jira/browse/FLINK-29883
> > > >> > >
> > > >> > > [2] https://issues.apache.org/jira/browse/FLINK-29886
> > > >> > >
> > > >> > > [3] https://issues.apache.org/jira/browse/FLINK-30015
> > > >> > >
> > > >> > > [4] https://issues.apache.org/jira/browse/FLINK-30181
> > > >> > >
> > > >> > > [5] https://issues.apache.org/jira/browse/FLINK-27165
> > > >> > >
> > > >> > > [6]
> > > >> > >
> > > >> > >
> > > >> >
> > >
> >
> https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136
> > > >> > >
> > > >> > > [7]
> > > >> > >
> > > >> > >
> > > >> >
> > >
> >
> https://github.com/apache/flink-benchmarks/blob/master/src/main/java/org/apache/flink/scheduler/benchmark/deploying/DeployingTasksInStreamingJobBenchmarkExecutor.java#L58
> > > >> > >
> > > >> > > [8] https://issues.apache.org/jira/browse/FLINK-27571
> > > >> > >
> > > >> > > [9] https://issues.apache.org/jira/browse/FLINK-29825
> > > >> > >
> > > >> > >
> > > >> > > Best,
> > > >> > >
> > > >> > > Yanfei
> > > >> > >
> > > >> > > Martijn Visser <[email protected]> 于2022年11月29日周二
> 15:54写道：
> > > >> > >
> > > >> > > > Hi,
> > > >> > > >
> > > >> > > > Is there any update to be expected on the benchmark? I see
> > > results of
> > > >> > the
> > > >> > > > benchmark being posted to Slack, but it appears that it's not
> > > being
> > > >> > > > monitored and no follow-up actions are being taken. I think
> it's
> > > >> > > currently
> > > >> > > > lacking a process on how to interpret the results and what
> > action
> > > >> > should
> > > >> > > > be taken and by whom.
> > > >> > > >
> > > >> > > > Best regards,
> > > >> > > >
> > > >> > > > Martijn
> > > >> > > >
> > > >> > > > On Thu, Nov 3, 2022 at 12:22 PM Jing Ge <[email protected]>
> > > wrote:
> > > >> > > >
> > > >> > > > > Thanks yanfei for driving this!
> > > >> > > > >
> > > >> > > > > Looking forward to further discussion w.r.t. the workflow.
> > > >> > > > >
> > > >> > > > > Best regards,
> > > >> > > > > Jing
> > > >> > > > >
> > > >> > > > > On Mon, Oct 31, 2022 at 6:04 PM Mason Chen <
> > > [email protected]>
> > > >> > > > wrote:
> > > >> > > > >
> > > >> > > > > > +1, thanks for driving this!
> > > >> > > > > >
> > > >> > > > > > On a side note, can we also ensure that a performance
> > summary
> > > >> > report
> > > >> > > > for
> > > >> > > > > > Flink major version upgrades is in release notes, once
> this
> > > >> > > > > infrastructure
> > > >> > > > > > becomes mature? From the user perspective, it would be
> nice
> > > to know
> > > >> > > > what
> > > >> > > > > > the expected (or unexpected) regressions in a major
> version
> > > upgrade
> > > >> > > > are.
> > > >> > > > > > I've seen the community do something like this before
> (e.g.
> > > the
> > > >> > major
> > > >> > > > > > rocksdb version bump in 1.14?) and it was quite valuable
> to
> > > know
> > > >> > that
> > > >> > > > > > upfront!
> > > >> > > > > >
> > > >> > > > > > Best,
> > > >> > > > > > Mason
> > > >> > > > > >
> > > >> > > > > > On Fri, Oct 28, 2022 at 1:46 AM weijie guo <
> > > >> > > [email protected]>
> > > >> > > > > > wrote:
> > > >> > > > > >
> > > >> > > > > > > Thanks Yanfei for driving this.
> > > >> > > > > > >
> > > >> > > > > > > It allows us to easily find the problem of performance
> > > >> > regression.
> > > >> > > > > > > Especially recently, I have made some improvements to
> the
> > > >> > > scheduling
> > > >> > > > > > > related parts, your work is very important to ensure
> that
> > > these
> > > >> > > > changes
> > > >> > > > > > do
> > > >> > > > > > > not cause some unexpected problems.
> > > >> > > > > > >
> > > >> > > > > > > Best regards,
> > > >> > > > > > >
> > > >> > > > > > > Weijie
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > Congxian Qiu <[email protected]> 于2022年10月28日周五
> > > 16:03写道：
> > > >> > > > > > >
> > > >> > > > > > > > Thanks for driving this and making the performance
> > > monitoring
> > > >> > > > public,
> > > >> > > > > > > this
> > > >> > > > > > > > can make us know and resolve the performance problem
> > > quickly.
> > > >> > > > > > > >
> > > >> > > > > > > > Looking forward to the workflow and detailed
> > descriptions
> > > fo
> > > >> > > > > > > > flink-dev-benchmarks.
> > > >> > > > > > > >
> > > >> > > > > > > > Best,
> > > >> > > > > > > > Congxian
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > > Yun Tang <[email protected]> 于2022年10月27日周四 12:41写道：
> > > >> > > > > > > >
> > > >> > > > > > > > > Thanks, Yanfei for driving this to monitor the
> > > performance in
> > > >> > > the
> > > >> > > > > > > Apache
> > > >> > > > > > > > > Flink Slack Channel.
> > > >> > > > > > > > >
> > > >> > > > > > > > > Look forward to the workflow and detailed
> descriptions
> > > of
> > > >> > > > > > > > > flink-dev-benchmarks.
> > > >> > > > > > > > >
> > > >> > > > > > > > > Best
> > > >> > > > > > > > > Yun Tang
> > > >> > > > > > > > > ________________________________
> > > >> > > > > > > > > From: Hangxiang Yu <[email protected]>
> > > >> > > > > > > > > Sent: Thursday, October 27, 2022 10:59
> > > >> > > > > > > > > To: [email protected] <[email protected]>
> > > >> > > > > > > > > Subject: Re: [ANNOUNCE] Performance Daily Monitoring
> > > Moved
> > > >> > from
> > > >> > > > > > > Ververica
> > > >> > > > > > > > > to Apache Flink Slack Channel
> > > >> > > > > > > > >
> > > >> > > > > > > > > Hi, Yanfei.
> > > >> > > > > > > > > Thanks for driving this.
> > > >> > > > > > > > > It could help us to detect and resolve the
> regression
> > > problem
> > > >> > > > > quickly
> > > >> > > > > > > and
> > > >> > > > > > > > > officially.
> > > >> > > > > > > > > I'd like to join as a maintainer.
> > > >> > > > > > > > > Looking forward to the workflow.
> > > >> > > > > > > > >
> > > >> > > > > > > > > On Wed, Oct 26, 2022 at 5:18 PM Yuan Mei <
> > > >> > > [email protected]
> > > >> > > > >
> > > >> > > > > > > wrote:
> > > >> > > > > > > > >
> > > >> > > > > > > > > > Thanks, Yanfei, to drive this and make the
> > performance
> > > >> > > > monitoring
> > > >> > > > > > > > > publicly
> > > >> > > > > > > > > > available.
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > Looking forward to seeing the workflow, and more
> > > details as
> > > >> > > > > Martijn
> > > >> > > > > > > > > > mentioned.
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > Best
> > > >> > > > > > > > > > Yuan
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > On Wed, Oct 26, 2022 at 2:59 PM Martijn Visser <
> > > >> > > > > > > > [email protected]
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > wrote:
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > > Hi Yanfei Lei,
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > > > Thanks for setting this up! It would be
> > interesting
> > > to
> > > >> > also
> > > >> > > > > know
> > > >> > > > > > > > which
> > > >> > > > > > > > > > > aspects of Flink are monitored for
> "performance".
> > > I'm
> > > >> > > > assuming
> > > >> > > > > > > there
> > > >> > > > > > > > > are
> > > >> > > > > > > > > > > specific pieces of functionality that are
> > > performance
> > > >> > > tested,
> > > >> > > > > but
> > > >> > > > > > > it
> > > >> > > > > > > > > > would
> > > >> > > > > > > > > > > be great if this would be written down somewhere
> > > (next
> > > >> > to a
> > > >> > > > > > > procedure
> > > >> > > > > > > > > how
> > > >> > > > > > > > > > > to detect a regression and what should be next
> > > steps).
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > > > Best regards,
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > > > Martijn
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > > > On Wed, Oct 26, 2022 at 8:21 AM Zakelly Lan <
> > > >> > > > > > [email protected]
> > > >> > > > > > > >
> > > >> > > > > > > > > > wrote:
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > > > > Hi yanfei,
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > Thanks for driving this! It's a great help.
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > I would like to join as a maintainer.
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > Best,
> > > >> > > > > > > > > > > > Zakelly
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > On Wed, Oct 26, 2022 at 11:32 AM yanfei lei <
> > > >> > > > > > [email protected]
> > > >> > > > > > > >
> > > >> > > > > > > > > > wrote:
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > Hi everyone,
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > As discussed earlier, we plan to create a
> > > benchmark
> > > >> > > > channel
> > > >> > > > > > in
> > > >> > > > > > > > > Apache
> > > >> > > > > > > > > > > > Flink
> > > >> > > > > > > > > > > > > slack[1], but the plan was shelved for a
> > > while[2].
> > > >> > So I
> > > >> > > > > went
> > > >> > > > > > on
> > > >> > > > > > > > > with
> > > >> > > > > > > > > > > this
> > > >> > > > > > > > > > > > > work, and created the #flink-dev-benchmarks
> > > channel
> > > >> > for
> > > >> > > > > > > > performance
> > > >> > > > > > > > > > > > > regression notifications.
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > We have a regression report script[3] that
> > runs
> > > >> > daily,
> > > >> > > > and
> > > >> > > > > a
> > > >> > > > > > > > > > > notification
> > > >> > > > > > > > > > > > > would be sent to the slack channel when the
> > > last few
> > > >> > > > > > benchmark
> > > >> > > > > > > > > > results
> > > >> > > > > > > > > > > > are
> > > >> > > > > > > > > > > > > significantly worse than the baseline.
> > > >> > > > > > > > > > > > > Note, regressions are detected by a simple
> > > script
> > > >> > which
> > > >> > > > may
> > > >> > > > > > > have
> > > >> > > > > > > > > > false
> > > >> > > > > > > > > > > > > positives and false negatives. And all
> > > benchmarks are
> > > >> > > > > > executed
> > > >> > > > > > > on
> > > >> > > > > > > > > one
> > > >> > > > > > > > > > > > > physical machine[4] which is provided by
> > > >> > > > > > Ververica(Alibaba)[5],
> > > >> > > > > > > > it
> > > >> > > > > > > > > > > might
> > > >> > > > > > > > > > > > > happen that hardware issues affect
> > performance,
> > > like
> > > >> > > > > > > > "[FLINK-18614
> > > >> > > > > > > > > > > > > <
> > > https://issues.apache.org/jira/browse/FLINK-18614>]
> > > >> > > > > > > Performance
> > > >> > > > > > > > > > > > regression
> > > >> > > > > > > > > > > > > 2020.07.13"[6].
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > After the migration, we need a procedure to
> > > watch
> > > >> > over
> > > >> > > > the
> > > >> > > > > > > entire
> > > >> > > > > > > > > > > > > performance of Flink code together. For
> > > example, if a
> > > >> > > > > > > regression
> > > >> > > > > > > > > > > > > occurs, investigating the cause and
> resolving
> > > the
> > > >> > > problem
> > > >> > > > > are
> > > >> > > > > > > > > needed.
> > > >> > > > > > > > > > > In
> > > >> > > > > > > > > > > > > the past, this procedure is maintained
> > > internally
> > > >> > > within
> > > >> > > > > > > > Ververica,
> > > >> > > > > > > > > > but
> > > >> > > > > > > > > > > > we
> > > >> > > > > > > > > > > > > think making the procedure public would
> > benefit
> > > all.
> > > >> > I
> > > >> > > > > > > volunteer
> > > >> > > > > > > > to
> > > >> > > > > > > > > > > serve
> > > >> > > > > > > > > > > > > as one of the initial maintainers, and would
> > be
> > > glad
> > > >> > if
> > > >> > > > > more
> > > >> > > > > > > > > > > contributors
> > > >> > > > > > > > > > > > > can join me. I'd also prepare some
> guidelines
> > > to help
> > > >> > > > > others
> > > >> > > > > > > get
> > > >> > > > > > > > > > > familiar
> > > >> > > > > > > > > > > > > with the workflow. I will start a new thread
> > to
> > > >> > discuss
> > > >> > > > the
> > > >> > > > > > > > > workflow
> > > >> > > > > > > > > > > > soon.
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > [1]
> > > >> > > > > > > > >
> > > >> > > https://www.mail-archive.com/[email protected]/msg58666.html
> > > >> > > > > > > > > > > > > [2]
> > > >> > https://issues.apache.org/jira/browse/FLINK-28468
> > > >> > > > > > > > > > > > > [3]
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > >
> >
> https://github.com/apache/flink-benchmarks/blob/master/regression_report.py
> > > >> > > > > > > > > > > > > [4] http://codespeed.dak8s.net:8080
> > > >> > > > > > > > > > > > > [5]
> > > >> > > > > > > > >
> > > >> > >
> https://lists.apache.org/thread/jzljp4233799vwwqnr0vc9wgqs0xj1ro
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > [6]
> > > >> > https://issues.apache.org/jira/browse/FLINK-18614
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > > --
> > > >> > > > > > > > > Best,
> > > >> > > > > > > > > Hangxiang.
> > > >> > > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >
> > > >
> > >
> > >
> > > Yanfei Lei <[email protected]> 于2023年1月12日周四 17:46写道：
> > >
> > >
> > > Yanfei Lei <[email protected]> 于2023年1月12日周四 17:46写道：
> > > >
> > > > Hi all,
> > > >
> > > > Thanks for the reminder.
> > > >
> > > > @Matthias
> > > >
> > > > any updates on the performance tests? ...or more specifically, any
> > > updates
> > > > on the script for alerting on performance regressions?
> > > >
> > > >
> > > > I create a PR for FLINK-27571[1] but it's still under review, would
> you
> > > like to help take a look?
> > > >
> > > > FLINK-27571 is just for the new benchmarks, for the old existing
> > > benchmarks, their information is stored
> > > >
> > > > in codespeed's database which can't be updated by URL request, so I
> > also
> > > logged into the Jenkins master
> > > >
> > > > and modified the codespeed's database, currently "less is better" can
> > be
> > > displayed normally on the timeline[2].
> > > >
> > > >
> > > > Does it make sense to formalize/document the process?
> > > >
> > > > Certainly, I'm preparing a draft to share my experience of finding
> > > commits that caused regressions.
> > > >
> > > > Originally, I wanted to wait for FLINK-27571 to be merged before
> > > starting a discussion, and I will put
> > > >
> > > > a draft of the document later.
> > > >
> > > >
> > > > This slack channel can only provide notice of regression and some
> > > experience on how to locate regression,
> > > >
> > > > but we also need some people to take action after the regression
> > > happens. It is mainly a few people who volunteer to do these things,
> > > >
> > > > like FLINK-30015[3] and FLINK-30623[4], many thanks for Martijn's
> > > contribution.
> > > >
> > > > As for whether to add the responsibilities to the release manager, I
> > > think it needs to see other people's opinions.
> > > >
> > > > @Martijn
> > > >
> > > > Thanks for creating these tickets. For FLINK-30623 and
> FLINK-30624[5],
> > > @Hangxiang and I have located the corresponding commit
> > > >
> > > > and pinged the corresponding submitter. Regression may not be
> avoided,
> > I
> > > totally do agree that this work needs to be formalized as soon as
> > possible
> > > to fix regressions.
> > > >
> > > >
> > > > [1] https://issues.apache.org/jira/browse/FLINK-27571
> > > >
> > > > [2]
> > >
> >
> http://codespeed.dak8s.net:8000/timeline/#/?ben=createScheduler.BATCH&extr=on&quarts=on&equid=off&env=2&revs=200&exe=1,3,5,6,8,9
> > > >
> > > > [3] https://issues.apache.org/jira/browse/FLINK-30015
> > > >
> > > > [4] https://issues.apache.org/jira/browse/FLINK-30623
> > > >
> > > > [5] https://issues.apache.org/jira/browse/FLINK-30624
> > > >
> > > >
> > > > Best regards,
> > > >
> > > > Yanfei
> > > >
> > > >
> > > > Martijn Visser <[email protected]> 于2023年1月11日周三 01:11写道：
> > > >>
> > > >> Hi all,
> > > >>
> > > >> Related to Matthias' email, I've checked the notifications in the
> > Slack
> > > >> channel and noticed three major benchmark regressions. In the end,
> > I've
> > > >> decided to create Jira tickets for it [1] [2] [3] but I do agree
> that
> > > this
> > > >> work needs to be formalized as soon as possible to avoid
> regressions.
> > It
> > > >> would also be great to include a process on how these regressions
> will
> > > be
> > > >> fixed, because I have no idea who to ping/notify that these
> > regressions
> > > >> have occurred.
> > > >>
> > > >> Best regards,
> > > >>
> > > >> Martijn
> > > >>
> > > >> [1] https://issues.apache.org/jira/browse/FLINK-30623
> > > >> [2] https://issues.apache.org/jira/browse/FLINK-30624
> > > >> [3] https://issues.apache.org/jira/browse/FLINK-30625
> > > >>
> > > >> On Tue, Jan 10, 2023 at 1:56 PM Matthias Pohl
> > > >> <[email protected]> wrote:
> > > >>
> > > >> > Hi Yanfei,
> > > >> > any updates on the performance tests? ...or more specifically, any
> > > updates
> > > >> > on the script for alerting on performance regressions?
> > > >> >
> > > >> > Does it make sense to formalize/document the process? Currently,
> the
> > > >> > release management doesn't do anything in terms of performance
> > > >> > test monitoring. Therefore, performance regressions are not
> > > necessarily
> > > >> > identified actively (in contrast to CI instabilities). Or is this
> > > covered
> > > >> > by the PMC? It would be interesting to know whether there's
> someone
> > to
> > > >> > reach out to who's monitoring the regression tests regularly.
> Would
> > > it make
> > > >> > sense for this person to join the release calls?
> > > >> >
> > > >> > Or shall we work on formalizing/documenting the process and
> > > integrating
> > > >> > this responsibility into what the release manager(s) are in charge
> > > of? My
> > > >> > concern with that approach is that contributors might be less
> > willing
> > > to
> > > >> > volunteer in the release management if we collect everything in
> one
> > > role.
> > > >> > Alternatively, we could split the release manager role up into
> > > sub-roles
> > > >> > that contributors can volunteer for in a release (e.g. CI
> > monitoring,
> > > >> > performance test monitoring, Jira maintenance, ... just coming up
> > with
> > > >> > random tasks here).
> > > >> >
> > > >> > Alternatively, we could leave everything as is and just respond if
> > > there's
> > > >> > some complaint. I'm curious about your (and other's) opinions.
> > > >> >
> > > >> > Matthias
> > > >> >
> > > >> > On Tue, Nov 29, 2022 at 2:13 PM Yanfei Lei <[email protected]>
> > > wrote:
> > > >> >
> > > >> > > Hi Martijn,
> > > >> > >
> > > >> > > Thanks for bringing this up.
> > > >> > >
> > > >> > > In the past two months, this channel has helped us find many
> > > benchmark
> > > >> > fail
> > > >> > > issues, like FLINK-29883
> > > >> > > <https://issues.apache.org/jira/browse/FLINK-29883>[1],
> > > >> > > FLINK-29886 <https://issues.apache.org/jira/browse/FLINK-29886
> > >[2],
> > > >> > > FLINK-30015 <https://issues.apache.org/jira/browse/FLINK-30015
> > >[3]
> > > and
> > > >> > > FLINK-30181 <https://issues.apache.org/jira/browse/FLINK-30181
> > >[4].
> > > I
> > > >> > also
> > > >> > > have tried investigating several of the frequently reported
> > > regressions
> > > >> > and
> > > >> > > replied under the notification in slack channel(copy them here):
> > > >> > >
> > > >> > >    1. serializerHeavyString
> > > >> > >    <
> > > >> > >
> > > >> >
> > >
> >
> http://codespeed.dak8s.net:8000/timeline/#/?exe=6&ben=serializerHeavyString&extr=on&quarts=on&equid=off&env=2&revs=200
> > > >> > > >:
> > > >> > >    It is unstable for a long time, see [5]
> > > >> > >    https://issues.apache.org/jira/browse/FLINK-27165 for
> possible
> > > >> > reasons.
> > > >> > >    2. Regressions are detected by a simple script which may have
> > > false
> > > >> > >    positives and false negatives, especially for benchmarks with
> > > small
> > > >> > >    absolute values, small value changes cause large percentage
> > > changes.
> > > >> > see
> > > >> > >    [6] for details.
> > > >> > >
> > > >> > >      Maybe slidingWindow
> > > >> > > <
> > > >> > >
> > > >> >
> > >
> >
> http://codespeed.dak8s.net:8000/timeline/#/?exe=6&ben=slidingWindow&extr=on&quarts=on&equid=off&env=2&revs=200
> > > >> > > >(value~=600),
> > > >> > > stateBackends.ROCKS
> > > >> > > <
> > > >> > >
> > > >> >
> > >
> >
> http://codespeed.dak8s.net:8000/timeline/#/?exe=6&ben=stateBackends.ROCKS&extr=on&quarts=on&equid=off&env=2&revs=200
> > > >> > > >
> > > >> > > (value~=260) and serializerHeavyString
> > > >> > > <
> > > >> > >
> > > >> >
> > >
> >
> http://codespeed.dak8s.net:8000/timeline/#/?exe=6&ben=serializerHeavyString&extr=on&quarts=on&equid=off&env=2&revs=200
> > > >> > > >(value~=170)
> > > >> > > are
> > > >> > > not true regressions.
> > > >> > >
> > > >> > >    1. For deployAllTasks.STREAMING
> > > >> > >    <
> > > >> > >
> > > >> >
> > >
> >
> http://codespeed.dak8s.net:8000/timeline/#/?exe=8&ben=deployAllTasks.STREAMING&extr=on&quarts=on&equid=off&env=2&revs=200
> > > >> > > >,
> > > >> > >    this benchmark result is how much time it takes to deploy
> job,
> > > the
> > > >> > less
> > > >> > >    value the better performance, see [7] for details.
> FLINK-27571
> > > >> > >    <https://issues.apache.org/jira/browse/FLINK-27571>[8] would
> > > fix this
> > > >> > >    problem.
> > > >> > >
> > > >> > >
> > > >> > > As mentioned before, regressions are detected by a simple script
> > > that is
> > > >> > > less stable, FLINK-29825 <
> > > >> > > https://issues.apache.org/jira/browse/FLINK-29825>[9]
> > > >> > > is created to improve the benchmark's stability. I planned to
> > > invite more
> > > >> > > volunteers to monitor it after the checking of regression became
> > > more
> > > >> > > stable, but I've been stuck with something else lately, sorry
> for
> > > the
> > > >> > late
> > > >> > > response.  Any suggestions on handling benchmark
> regressions/fails
> > > are
> > > >> > > welcome.
> > > >> > >
> > > >> > > [1] https://issues.apache.org/jira/browse/FLINK-29883
> > > >> > >
> > > >> > > [2] https://issues.apache.org/jira/browse/FLINK-29886
> > > >> > >
> > > >> > > [3] https://issues.apache.org/jira/browse/FLINK-30015
> > > >> > >
> > > >> > > [4] https://issues.apache.org/jira/browse/FLINK-30181
> > > >> > >
> > > >> > > [5] https://issues.apache.org/jira/browse/FLINK-27165
> > > >> > >
> > > >> > > [6]
> > > >> > >
> > > >> > >
> > > >> >
> > >
> >
> https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136
> > > >> > >
> > > >> > > [7]
> > > >> > >
> > > >> > >
> > > >> >
> > >
> >
> https://github.com/apache/flink-benchmarks/blob/master/src/main/java/org/apache/flink/scheduler/benchmark/deploying/DeployingTasksInStreamingJobBenchmarkExecutor.java#L58
> > > >> > >
> > > >> > > [8] https://issues.apache.org/jira/browse/FLINK-27571
> > > >> > >
> > > >> > > [9] https://issues.apache.org/jira/browse/FLINK-29825
> > > >> > >
> > > >> > >
> > > >> > > Best,
> > > >> > >
> > > >> > > Yanfei
> > > >> > >
> > > >> > > Martijn Visser <[email protected]> 于2022年11月29日周二
> 15:54写道：
> > > >> > >
> > > >> > > > Hi,
> > > >> > > >
> > > >> > > > Is there any update to be expected on the benchmark? I see
> > > results of
> > > >> > the
> > > >> > > > benchmark being posted to Slack, but it appears that it's not
> > > being
> > > >> > > > monitored and no follow-up actions are being taken. I think
> it's
> > > >> > > currently
> > > >> > > > lacking a process on how to interpret the results and what
> > action
> > > >> > should
> > > >> > > > be taken and by whom.
> > > >> > > >
> > > >> > > > Best regards,
> > > >> > > >
> > > >> > > > Martijn
> > > >> > > >
> > > >> > > > On Thu, Nov 3, 2022 at 12:22 PM Jing Ge <[email protected]>
> > > wrote:
> > > >> > > >
> > > >> > > > > Thanks yanfei for driving this!
> > > >> > > > >
> > > >> > > > > Looking forward to further discussion w.r.t. the workflow.
> > > >> > > > >
> > > >> > > > > Best regards,
> > > >> > > > > Jing
> > > >> > > > >
> > > >> > > > > On Mon, Oct 31, 2022 at 6:04 PM Mason Chen <
> > > [email protected]>
> > > >> > > > wrote:
> > > >> > > > >
> > > >> > > > > > +1, thanks for driving this!
> > > >> > > > > >
> > > >> > > > > > On a side note, can we also ensure that a performance
> > summary
> > > >> > report
> > > >> > > > for
> > > >> > > > > > Flink major version upgrades is in release notes, once
> this
> > > >> > > > > infrastructure
> > > >> > > > > > becomes mature? From the user perspective, it would be
> nice
> > > to know
> > > >> > > > what
> > > >> > > > > > the expected (or unexpected) regressions in a major
> version
> > > upgrade
> > > >> > > > are.
> > > >> > > > > > I've seen the community do something like this before
> (e.g.
> > > the
> > > >> > major
> > > >> > > > > > rocksdb version bump in 1.14?) and it was quite valuable
> to
> > > know
> > > >> > that
> > > >> > > > > > upfront!
> > > >> > > > > >
> > > >> > > > > > Best,
> > > >> > > > > > Mason
> > > >> > > > > >
> > > >> > > > > > On Fri, Oct 28, 2022 at 1:46 AM weijie guo <
> > > >> > > [email protected]>
> > > >> > > > > > wrote:
> > > >> > > > > >
> > > >> > > > > > > Thanks Yanfei for driving this.
> > > >> > > > > > >
> > > >> > > > > > > It allows us to easily find the problem of performance
> > > >> > regression.
> > > >> > > > > > > Especially recently, I have made some improvements to
> the
> > > >> > > scheduling
> > > >> > > > > > > related parts, your work is very important to ensure
> that
> > > these
> > > >> > > > changes
> > > >> > > > > > do
> > > >> > > > > > > not cause some unexpected problems.
> > > >> > > > > > >
> > > >> > > > > > > Best regards,
> > > >> > > > > > >
> > > >> > > > > > > Weijie
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > Congxian Qiu <[email protected]> 于2022年10月28日周五
> > > 16:03写道：
> > > >> > > > > > >
> > > >> > > > > > > > Thanks for driving this and making the performance
> > > monitoring
> > > >> > > > public,
> > > >> > > > > > > this
> > > >> > > > > > > > can make us know and resolve the performance problem
> > > quickly.
> > > >> > > > > > > >
> > > >> > > > > > > > Looking forward to the workflow and detailed
> > descriptions
> > > fo
> > > >> > > > > > > > flink-dev-benchmarks.
> > > >> > > > > > > >
> > > >> > > > > > > > Best,
> > > >> > > > > > > > Congxian
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > > Yun Tang <[email protected]> 于2022年10月27日周四 12:41写道：
> > > >> > > > > > > >
> > > >> > > > > > > > > Thanks, Yanfei for driving this to monitor the
> > > performance in
> > > >> > > the
> > > >> > > > > > > Apache
> > > >> > > > > > > > > Flink Slack Channel.
> > > >> > > > > > > > >
> > > >> > > > > > > > > Look forward to the workflow and detailed
> descriptions
> > > of
> > > >> > > > > > > > > flink-dev-benchmarks.
> > > >> > > > > > > > >
> > > >> > > > > > > > > Best
> > > >> > > > > > > > > Yun Tang
> > > >> > > > > > > > > ________________________________
> > > >> > > > > > > > > From: Hangxiang Yu <[email protected]>
> > > >> > > > > > > > > Sent: Thursday, October 27, 2022 10:59
> > > >> > > > > > > > > To: [email protected] <[email protected]>
> > > >> > > > > > > > > Subject: Re: [ANNOUNCE] Performance Daily Monitoring
> > > Moved
> > > >> > from
> > > >> > > > > > > Ververica
> > > >> > > > > > > > > to Apache Flink Slack Channel
> > > >> > > > > > > > >
> > > >> > > > > > > > > Hi, Yanfei.
> > > >> > > > > > > > > Thanks for driving this.
> > > >> > > > > > > > > It could help us to detect and resolve the
> regression
> > > problem
> > > >> > > > > quickly
> > > >> > > > > > > and
> > > >> > > > > > > > > officially.
> > > >> > > > > > > > > I'd like to join as a maintainer.
> > > >> > > > > > > > > Looking forward to the workflow.
> > > >> > > > > > > > >
> > > >> > > > > > > > > On Wed, Oct 26, 2022 at 5:18 PM Yuan Mei <
> > > >> > > [email protected]
> > > >> > > > >
> > > >> > > > > > > wrote:
> > > >> > > > > > > > >
> > > >> > > > > > > > > > Thanks, Yanfei, to drive this and make the
> > performance
> > > >> > > > monitoring
> > > >> > > > > > > > > publicly
> > > >> > > > > > > > > > available.
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > Looking forward to seeing the workflow, and more
> > > details as
> > > >> > > > > Martijn
> > > >> > > > > > > > > > mentioned.
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > Best
> > > >> > > > > > > > > > Yuan
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > On Wed, Oct 26, 2022 at 2:59 PM Martijn Visser <
> > > >> > > > > > > > [email protected]
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > wrote:
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > > Hi Yanfei Lei,
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > > > Thanks for setting this up! It would be
> > interesting
> > > to
> > > >> > also
> > > >> > > > > know
> > > >> > > > > > > > which
> > > >> > > > > > > > > > > aspects of Flink are monitored for
> "performance".
> > > I'm
> > > >> > > > assuming
> > > >> > > > > > > there
> > > >> > > > > > > > > are
> > > >> > > > > > > > > > > specific pieces of functionality that are
> > > performance
> > > >> > > tested,
> > > >> > > > > but
> > > >> > > > > > > it
> > > >> > > > > > > > > > would
> > > >> > > > > > > > > > > be great if this would be written down somewhere
> > > (next
> > > >> > to a
> > > >> > > > > > > procedure
> > > >> > > > > > > > > how
> > > >> > > > > > > > > > > to detect a regression and what should be next
> > > steps).
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > > > Best regards,
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > > > Martijn
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > > > On Wed, Oct 26, 2022 at 8:21 AM Zakelly Lan <
> > > >> > > > > > [email protected]
> > > >> > > > > > > >
> > > >> > > > > > > > > > wrote:
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > > > > Hi yanfei,
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > Thanks for driving this! It's a great help.
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > I would like to join as a maintainer.
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > Best,
> > > >> > > > > > > > > > > > Zakelly
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > On Wed, Oct 26, 2022 at 11:32 AM yanfei lei <
> > > >> > > > > > [email protected]
> > > >> > > > > > > >
> > > >> > > > > > > > > > wrote:
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > Hi everyone,
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > As discussed earlier, we plan to create a
> > > benchmark
> > > >> > > > channel
> > > >> > > > > > in
> > > >> > > > > > > > > Apache
> > > >> > > > > > > > > > > > Flink
> > > >> > > > > > > > > > > > > slack[1], but the plan was shelved for a
> > > while[2].
> > > >> > So I
> > > >> > > > > went
> > > >> > > > > > on
> > > >> > > > > > > > > with
> > > >> > > > > > > > > > > this
> > > >> > > > > > > > > > > > > work, and created the #flink-dev-benchmarks
> > > channel
> > > >> > for
> > > >> > > > > > > > performance
> > > >> > > > > > > > > > > > > regression notifications.
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > We have a regression report script[3] that
> > runs
> > > >> > daily,
> > > >> > > > and
> > > >> > > > > a
> > > >> > > > > > > > > > > notification
> > > >> > > > > > > > > > > > > would be sent to the slack channel when the
> > > last few
> > > >> > > > > > benchmark
> > > >> > > > > > > > > > results
> > > >> > > > > > > > > > > > are
> > > >> > > > > > > > > > > > > significantly worse than the baseline.
> > > >> > > > > > > > > > > > > Note, regressions are detected by a simple
> > > script
> > > >> > which
> > > >> > > > may
> > > >> > > > > > > have
> > > >> > > > > > > > > > false
> > > >> > > > > > > > > > > > > positives and false negatives. And all
> > > benchmarks are
> > > >> > > > > > executed
> > > >> > > > > > > on
> > > >> > > > > > > > > one
> > > >> > > > > > > > > > > > > physical machine[4] which is provided by
> > > >> > > > > > Ververica(Alibaba)[5],
> > > >> > > > > > > > it
> > > >> > > > > > > > > > > might
> > > >> > > > > > > > > > > > > happen that hardware issues affect
> > performance,
> > > like
> > > >> > > > > > > > "[FLINK-18614
> > > >> > > > > > > > > > > > > <
> > > https://issues.apache.org/jira/browse/FLINK-18614>]
> > > >> > > > > > > Performance
> > > >> > > > > > > > > > > > regression
> > > >> > > > > > > > > > > > > 2020.07.13"[6].
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > After the migration, we need a procedure to
> > > watch
> > > >> > over
> > > >> > > > the
> > > >> > > > > > > entire
> > > >> > > > > > > > > > > > > performance of Flink code together. For
> > > example, if a
> > > >> > > > > > > regression
> > > >> > > > > > > > > > > > > occurs, investigating the cause and
> resolving
> > > the
> > > >> > > problem
> > > >> > > > > are
> > > >> > > > > > > > > needed.
> > > >> > > > > > > > > > > In
> > > >> > > > > > > > > > > > > the past, this procedure is maintained
> > > internally
> > > >> > > within
> > > >> > > > > > > > Ververica,
> > > >> > > > > > > > > > but
> > > >> > > > > > > > > > > > we
> > > >> > > > > > > > > > > > > think making the procedure public would
> > benefit
> > > all.
> > > >> > I
> > > >> > > > > > > volunteer
> > > >> > > > > > > > to
> > > >> > > > > > > > > > > serve
> > > >> > > > > > > > > > > > > as one of the initial maintainers, and would
> > be
> > > glad
> > > >> > if
> > > >> > > > > more
> > > >> > > > > > > > > > > contributors
> > > >> > > > > > > > > > > > > can join me. I'd also prepare some
> guidelines
> > > to help
> > > >> > > > > others
> > > >> > > > > > > get
> > > >> > > > > > > > > > > familiar
> > > >> > > > > > > > > > > > > with the workflow. I will start a new thread
> > to
> > > >> > discuss
> > > >> > > > the
> > > >> > > > > > > > > workflow
> > > >> > > > > > > > > > > > soon.
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > [1]
> > > >> > > > > > > > >
> > > >> > > https://www.mail-archive.com/[email protected]/msg58666.html
> > > >> > > > > > > > > > > > > [2]
> > > >> > https://issues.apache.org/jira/browse/FLINK-28468
> > > >> > > > > > > > > > > > > [3]
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > >
> >
> https://github.com/apache/flink-benchmarks/blob/master/regression_report.py
> > > >> > > > > > > > > > > > > [4] http://codespeed.dak8s.net:8080
> > > >> > > > > > > > > > > > > [5]
> > > >> > > > > > > > >
> > > >> > >
> https://lists.apache.org/thread/jzljp4233799vwwqnr0vc9wgqs0xj1ro
> > > >> > > > > > > > > > > > >
> > > >> > > > > > > > > > > > > [6]
> > > >> > https://issues.apache.org/jira/browse/FLINK-18614
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > > --
> > > >> > > > > > > > > Best,
> > > >> > > > > > > > > Hangxiang.
> > > >> > > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >
> > > >
> > >
> > >
> > > --
> > > Best,
> > > Yanfei
> > >
> > >
> >
>

Re: [ANNOUNCE] Performance Daily Monitoring Moved from Ververica to Apache Flink Slack Channel

Reply via email to