Hi Yanfei, Thanks for the proposal! Like Yuan mentioned, let's start a new discussion thread to get a clean discussion of your proposal, but it already sounds good to me.
Best regards, Martijn Op di 17 jan. 2023 om 10:41 schreef Yuan Mei <yuanmei.w...@gmail.com>: > Hey Yanfei, > > Thanks so much for the efforts driving the whole process. It's great to see > that the performance benchmarks are indeed useful to help find regressions. > > 1. Regarding the procedure of how to use and understand the notification > reported from the slack channel #flink-dev-benchmarks, the instructions > read reasonably to me, and we can iterate over it gradually. Once you've > done the wiki change, please ping me and I can help review it. > > 2. It also sounds to me reasonable to incorporate the performance-watching > procedure into the release managers' daily/weekly monitors. But since it > involves a change to the standard routine of releasing, we need to discuss > and vote on the change. > > My suggestion is to start a new discussion thread for the instructions and > proposed change so that more people are aware of the proposal and join the > discussion (this is an announcement thread :-)). > > > Best > Yuan > > On Mon, Jan 16, 2023 at 4:52 PM Qingsheng Ren <renqs...@gmail.com> wrote: > > > Thanks for making this detailed guide, Yanfei! This is quite helpful for > > release managers to monitor and manage performance regressions. > > > > I think it will be great to also document the threshold of alerts sent to > > the Slack channel, and some related formula used in the test, either in > the > > wiki page or in the README of flink-benchmarks repo. This could help > other > > maintainers to interpret the result. > > > > Also we can add this to release managers' daily monitors, similar to CI > > instabilities. We can start operating with the process proposed by > Yanfei, > > and complete it gradually once we find something to add. > > > > Best regards, > > Qingsheng > > > > On Mon, Jan 16, 2023 at 12:08 PM Yanfei Lei <fredia...@gmail.com> wrote: > > > > > Hi devs, > > > > > > Flink benchmarks are periodically executed on > > > http://codespeed.dak8s.net:8080 to monitor Flink performance. In late > > > Oct'22, a new slack channel #flink-dev-benchmarks was created for > > > notifications of performance regressions. It helped us find 2 build > > > failures[1,2] and 5 performance regressions[3,4,5,6,7] in the past 3 > > > months, which is very meaningful for ensuring the quality of the code. > > > I am checking the slack notifications once a week now, and if more > > > people come to monitor together, we can check once a day in the future > > > to find out regressions in a timely manner. > > > > > > According to some contributors and my own experience, I have > > > summarized a document on how to handle performance regressions. The > > > following is just a draft, which can be continuously iterated and > > > improved later. > > > > > > When a benchmark regression is detected, the following steps will help > > > to deal with regressions: > > > > > > 1. Create a Jira ticket(one per group of related benchmarks). Set > > > effects and fix versions to the current Flink version, > > > component=Benchmarks, type=Bug. > > > > > > 2. Post the ticket in the slack channel(replying in a thread). > > > > > > 3. Verify that the regression is real and investigate the cause. Take > > > FLINK-30623[5] as an example: > > > > > > 3.1 Inspect the timeline following the > > > link( > > > > > > http://codespeed.dak8s.net:8000/timeline/#/?exe=1&ben=checkpointSingleInput.UNALIGNED&extr=on&quarts=on&equid=off&env=2&revs=200 > > > ) > > > from the notification. Suspicious commit ranges can be obtained from > > > the figure, for this example, the suspicious range is > > > 13ef498172b...fb272D2cdebf. > > > > > > 3.2 Narrow down the commit range via git log. You can directly > > > locate a specific commit based on experience or compare the benchmark > > > results of each commit in this range, a commit would be found if this > > > regression is real. See instructions for using benchmark-request, you > > > can also try to benchmark locally. http://codespeed.dak8s.net:8080 > > > benchmarking infrastructure is hosted using resources provided by > > > Ververica(Alibaba) and maintained by PMCs and Ververica, please > > > contact one of Apache Flink PMCs to get access. For example, two > > > benchmark requests had been submitted to verify whether FLINK-30533 > > > caused the regression. > > > > > > > Before FLINK-30533: > > > http://codespeed.dak8s.net:8080/job/flink-benchmark-request/177 > > > > > > > > - checkpointSingleInput.UNALIGNED: 333.635178(+-8.169488) > > > > > > > > - checkpointSingleInput.UNALIGNED_1: 213.837107(+-7.282883) > > > > > > > > # After FLINK-30533: > > > http://codespeed.dak8s.net:8080/job/flink-benchmark-request/178 > > > > > > > > - checkpointSingleInput.UNALIGNED: 61.536982(+-3.581509) > > > > > > > > - checkpointSingleInput.UNALIGNED_1: 38.207438(+-2.937051) > > > > > > 3.3 Changes in flink-benchmarks[8] may also cause a regression, > > > don't forget to check if flink-benchmarks have changed recently. > > > > > > 3.4 If a regression cannot be reproduced stably which is caused by > > > the error in results or the issues of physical machines (like > > > FLINK-18614[9]), this means the regression is not real. > > > > > > 4. Post benchmark results under the Jira ticket, and ping the authors > > > of the commit(or relevant developers) to investigate the regression if > > > the regression is real. Otherwise, set the resolution of Jira ticket > > > as "Not a bug", post the conclusion and close the ticket. > > > > > > 5. If a regression is not fixed within a week of confirming that one > > > commit is the root cause of the regression, contact the release > > > manager to revert it (after confirming that reverting the changes > > > resolves the issue using benchmark-request[10]). > > > > > > If the above process is considered acceptable, I can draft a version > > > and put it in the community wiki[10]. @Matthias had proposed to > > > incorporate performance regression monitoring into the release > > > management, and make the regression testing be monitored regularly by > > > release managers or volunteers. I‘m glad to be one of the volunteers. > > > > > > Hope to hear your advice and opinions! > > > > > > [1] https://issues.apache.org/jira/browse/FLINK-29883 > > > [2] https://issues.apache.org/jira/browse/FLINK-30015 > > > [3] https://issues.apache.org/jira/browse/FLINK-29886 > > > [4] https://issues.apache.org/jira/browse/FLINK-30181 > > > [5] https://issues.apache.org/jira/browse/FLINK-30623 > > > [6] https://issues.apache.org/jira/browse/FLINK-30624 > > > [7] https://issues.apache.org/jira/browse/FLINK-30625 > > > [8] https://github.com/apache/flink-benchmarks > > > [9] https://issues.apache.org/jira/browse/FLINK-18614 > > > [10] > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=115511847 > > > > > > Best regards, > > > Yanfei > > > Ververica(Alibaba) > > > > > > Yanfei Lei <fredia...@gmail.com> 于2023年1月12日周四 17:46写道: > > > > > > > > Hi all, > > > > > > > > Thanks for the reminder. > > > > > > > > @Matthias > > > > > > > > any updates on the performance tests? ...or more specifically, any > > > updates > > > > on the script for alerting on performance regressions? > > > > > > > > > > > > I create a PR for FLINK-27571[1] but it's still under review, would > you > > > like to help take a look? > > > > > > > > FLINK-27571 is just for the new benchmarks, for the old existing > > > benchmarks, their information is stored > > > > > > > > in codespeed's database which can't be updated by URL request, so I > > also > > > logged into the Jenkins master > > > > > > > > and modified the codespeed's database, currently "less is better" can > > be > > > displayed normally on the timeline[2]. > > > > > > > > > > > > Does it make sense to formalize/document the process? > > > > > > > > Certainly, I'm preparing a draft to share my experience of finding > > > commits that caused regressions. > > > > > > > > Originally, I wanted to wait for FLINK-27571 to be merged before > > > starting a discussion, and I will put > > > > > > > > a draft of the document later. > > > > > > > > > > > > This slack channel can only provide notice of regression and some > > > experience on how to locate regression, > > > > > > > > but we also need some people to take action after the regression > > > happens. It is mainly a few people who volunteer to do these things, > > > > > > > > like FLINK-30015[3] and FLINK-30623[4], many thanks for Martijn's > > > contribution. > > > > > > > > As for whether to add the responsibilities to the release manager, I > > > think it needs to see other people's opinions. > > > > > > > > @Martijn > > > > > > > > Thanks for creating these tickets. For FLINK-30623 and > FLINK-30624[5], > > > @Hangxiang and I have located the corresponding commit > > > > > > > > and pinged the corresponding submitter. Regression may not be > avoided, > > I > > > totally do agree that this work needs to be formalized as soon as > > possible > > > to fix regressions. > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/FLINK-27571 > > > > > > > > [2] > > > > > > http://codespeed.dak8s.net:8000/timeline/#/?ben=createScheduler.BATCH&extr=on&quarts=on&equid=off&env=2&revs=200&exe=1,3,5,6,8,9 > > > > > > > > [3] https://issues.apache.org/jira/browse/FLINK-30015 > > > > > > > > [4] https://issues.apache.org/jira/browse/FLINK-30623 > > > > > > > > [5] https://issues.apache.org/jira/browse/FLINK-30624 > > > > > > > > > > > > Best regards, > > > > > > > > Yanfei > > > > > > > > > > > > Martijn Visser <martijnvis...@apache.org> 于2023年1月11日周三 01:11写道: > > > >> > > > >> Hi all, > > > >> > > > >> Related to Matthias' email, I've checked the notifications in the > > Slack > > > >> channel and noticed three major benchmark regressions. In the end, > > I've > > > >> decided to create Jira tickets for it [1] [2] [3] but I do agree > that > > > this > > > >> work needs to be formalized as soon as possible to avoid > regressions. > > It > > > >> would also be great to include a process on how these regressions > will > > > be > > > >> fixed, because I have no idea who to ping/notify that these > > regressions > > > >> have occurred. > > > >> > > > >> Best regards, > > > >> > > > >> Martijn > > > >> > > > >> [1] https://issues.apache.org/jira/browse/FLINK-30623 > > > >> [2] https://issues.apache.org/jira/browse/FLINK-30624 > > > >> [3] https://issues.apache.org/jira/browse/FLINK-30625 > > > >> > > > >> On Tue, Jan 10, 2023 at 1:56 PM Matthias Pohl > > > >> <matthias.p...@aiven.io.invalid> wrote: > > > >> > > > >> > Hi Yanfei, > > > >> > any updates on the performance tests? ...or more specifically, any > > > updates > > > >> > on the script for alerting on performance regressions? > > > >> > > > > >> > Does it make sense to formalize/document the process? Currently, > the > > > >> > release management doesn't do anything in terms of performance > > > >> > test monitoring. Therefore, performance regressions are not > > > necessarily > > > >> > identified actively (in contrast to CI instabilities). Or is this > > > covered > > > >> > by the PMC? It would be interesting to know whether there's > someone > > to > > > >> > reach out to who's monitoring the regression tests regularly. > Would > > > it make > > > >> > sense for this person to join the release calls? > > > >> > > > > >> > Or shall we work on formalizing/documenting the process and > > > integrating > > > >> > this responsibility into what the release manager(s) are in charge > > > of? My > > > >> > concern with that approach is that contributors might be less > > willing > > > to > > > >> > volunteer in the release management if we collect everything in > one > > > role. > > > >> > Alternatively, we could split the release manager role up into > > > sub-roles > > > >> > that contributors can volunteer for in a release (e.g. CI > > monitoring, > > > >> > performance test monitoring, Jira maintenance, ... just coming up > > with > > > >> > random tasks here). > > > >> > > > > >> > Alternatively, we could leave everything as is and just respond if > > > there's > > > >> > some complaint. I'm curious about your (and other's) opinions. > > > >> > > > > >> > Matthias > > > >> > > > > >> > On Tue, Nov 29, 2022 at 2:13 PM Yanfei Lei <fredia...@gmail.com> > > > wrote: > > > >> > > > > >> > > Hi Martijn, > > > >> > > > > > >> > > Thanks for bringing this up. > > > >> > > > > > >> > > In the past two months, this channel has helped us find many > > > benchmark > > > >> > fail > > > >> > > issues, like FLINK-29883 > > > >> > > <https://issues.apache.org/jira/browse/FLINK-29883>[1], > > > >> > > FLINK-29886 <https://issues.apache.org/jira/browse/FLINK-29886 > > >[2], > > > >> > > FLINK-30015 <https://issues.apache.org/jira/browse/FLINK-30015 > > >[3] > > > and > > > >> > > FLINK-30181 <https://issues.apache.org/jira/browse/FLINK-30181 > > >[4]. > > > I > > > >> > also > > > >> > > have tried investigating several of the frequently reported > > > regressions > > > >> > and > > > >> > > replied under the notification in slack channel(copy them here): > > > >> > > > > > >> > > 1. serializerHeavyString > > > >> > > < > > > >> > > > > > >> > > > > > > > http://codespeed.dak8s.net:8000/timeline/#/?exe=6&ben=serializerHeavyString&extr=on&quarts=on&equid=off&env=2&revs=200 > > > >> > > >: > > > >> > > It is unstable for a long time, see [5] > > > >> > > https://issues.apache.org/jira/browse/FLINK-27165 for > possible > > > >> > reasons. > > > >> > > 2. Regressions are detected by a simple script which may have > > > false > > > >> > > positives and false negatives, especially for benchmarks with > > > small > > > >> > > absolute values, small value changes cause large percentage > > > changes. > > > >> > see > > > >> > > [6] for details. > > > >> > > > > > >> > > Maybe slidingWindow > > > >> > > < > > > >> > > > > > >> > > > > > > > http://codespeed.dak8s.net:8000/timeline/#/?exe=6&ben=slidingWindow&extr=on&quarts=on&equid=off&env=2&revs=200 > > > >> > > >(value~=600), > > > >> > > stateBackends.ROCKS > > > >> > > < > > > >> > > > > > >> > > > > > > > http://codespeed.dak8s.net:8000/timeline/#/?exe=6&ben=stateBackends.ROCKS&extr=on&quarts=on&equid=off&env=2&revs=200 > > > >> > > > > > > >> > > (value~=260) and serializerHeavyString > > > >> > > < > > > >> > > > > > >> > > > > > > > http://codespeed.dak8s.net:8000/timeline/#/?exe=6&ben=serializerHeavyString&extr=on&quarts=on&equid=off&env=2&revs=200 > > > >> > > >(value~=170) > > > >> > > are > > > >> > > not true regressions. > > > >> > > > > > >> > > 1. For deployAllTasks.STREAMING > > > >> > > < > > > >> > > > > > >> > > > > > > > http://codespeed.dak8s.net:8000/timeline/#/?exe=8&ben=deployAllTasks.STREAMING&extr=on&quarts=on&equid=off&env=2&revs=200 > > > >> > > >, > > > >> > > this benchmark result is how much time it takes to deploy > job, > > > the > > > >> > less > > > >> > > value the better performance, see [7] for details. > FLINK-27571 > > > >> > > <https://issues.apache.org/jira/browse/FLINK-27571>[8] would > > > fix this > > > >> > > problem. > > > >> > > > > > >> > > > > > >> > > As mentioned before, regressions are detected by a simple script > > > that is > > > >> > > less stable, FLINK-29825 < > > > >> > > https://issues.apache.org/jira/browse/FLINK-29825>[9] > > > >> > > is created to improve the benchmark's stability. I planned to > > > invite more > > > >> > > volunteers to monitor it after the checking of regression became > > > more > > > >> > > stable, but I've been stuck with something else lately, sorry > for > > > the > > > >> > late > > > >> > > response. Any suggestions on handling benchmark > regressions/fails > > > are > > > >> > > welcome. > > > >> > > > > > >> > > [1] https://issues.apache.org/jira/browse/FLINK-29883 > > > >> > > > > > >> > > [2] https://issues.apache.org/jira/browse/FLINK-29886 > > > >> > > > > > >> > > [3] https://issues.apache.org/jira/browse/FLINK-30015 > > > >> > > > > > >> > > [4] https://issues.apache.org/jira/browse/FLINK-30181 > > > >> > > > > > >> > > [5] https://issues.apache.org/jira/browse/FLINK-27165 > > > >> > > > > > >> > > [6] > > > >> > > > > > >> > > > > > >> > > > > > > > https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136 > > > >> > > > > > >> > > [7] > > > >> > > > > > >> > > > > > >> > > > > > > > https://github.com/apache/flink-benchmarks/blob/master/src/main/java/org/apache/flink/scheduler/benchmark/deploying/DeployingTasksInStreamingJobBenchmarkExecutor.java#L58 > > > >> > > > > > >> > > [8] https://issues.apache.org/jira/browse/FLINK-27571 > > > >> > > > > > >> > > [9] https://issues.apache.org/jira/browse/FLINK-29825 > > > >> > > > > > >> > > > > > >> > > Best, > > > >> > > > > > >> > > Yanfei > > > >> > > > > > >> > > Martijn Visser <martijnvis...@apache.org> 于2022年11月29日周二 > 15:54写道: > > > >> > > > > > >> > > > Hi, > > > >> > > > > > > >> > > > Is there any update to be expected on the benchmark? I see > > > results of > > > >> > the > > > >> > > > benchmark being posted to Slack, but it appears that it's not > > > being > > > >> > > > monitored and no follow-up actions are being taken. I think > it's > > > >> > > currently > > > >> > > > lacking a process on how to interpret the results and what > > action > > > >> > should > > > >> > > > be taken and by whom. > > > >> > > > > > > >> > > > Best regards, > > > >> > > > > > > >> > > > Martijn > > > >> > > > > > > >> > > > On Thu, Nov 3, 2022 at 12:22 PM Jing Ge <j...@ververica.com> > > > wrote: > > > >> > > > > > > >> > > > > Thanks yanfei for driving this! > > > >> > > > > > > > >> > > > > Looking forward to further discussion w.r.t. the workflow. > > > >> > > > > > > > >> > > > > Best regards, > > > >> > > > > Jing > > > >> > > > > > > > >> > > > > On Mon, Oct 31, 2022 at 6:04 PM Mason Chen < > > > mas.chen6...@gmail.com> > > > >> > > > wrote: > > > >> > > > > > > > >> > > > > > +1, thanks for driving this! > > > >> > > > > > > > > >> > > > > > On a side note, can we also ensure that a performance > > summary > > > >> > report > > > >> > > > for > > > >> > > > > > Flink major version upgrades is in release notes, once > this > > > >> > > > > infrastructure > > > >> > > > > > becomes mature? From the user perspective, it would be > nice > > > to know > > > >> > > > what > > > >> > > > > > the expected (or unexpected) regressions in a major > version > > > upgrade > > > >> > > > are. > > > >> > > > > > I've seen the community do something like this before > (e.g. > > > the > > > >> > major > > > >> > > > > > rocksdb version bump in 1.14?) and it was quite valuable > to > > > know > > > >> > that > > > >> > > > > > upfront! > > > >> > > > > > > > > >> > > > > > Best, > > > >> > > > > > Mason > > > >> > > > > > > > > >> > > > > > On Fri, Oct 28, 2022 at 1:46 AM weijie guo < > > > >> > > guoweijieres...@gmail.com> > > > >> > > > > > wrote: > > > >> > > > > > > > > >> > > > > > > Thanks Yanfei for driving this. > > > >> > > > > > > > > > >> > > > > > > It allows us to easily find the problem of performance > > > >> > regression. > > > >> > > > > > > Especially recently, I have made some improvements to > the > > > >> > > scheduling > > > >> > > > > > > related parts, your work is very important to ensure > that > > > these > > > >> > > > changes > > > >> > > > > > do > > > >> > > > > > > not cause some unexpected problems. > > > >> > > > > > > > > > >> > > > > > > Best regards, > > > >> > > > > > > > > > >> > > > > > > Weijie > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > Congxian Qiu <qcx978132...@gmail.com> 于2022年10月28日周五 > > > 16:03写道: > > > >> > > > > > > > > > >> > > > > > > > Thanks for driving this and making the performance > > > monitoring > > > >> > > > public, > > > >> > > > > > > this > > > >> > > > > > > > can make us know and resolve the performance problem > > > quickly. > > > >> > > > > > > > > > > >> > > > > > > > Looking forward to the workflow and detailed > > descriptions > > > fo > > > >> > > > > > > > flink-dev-benchmarks. > > > >> > > > > > > > > > > >> > > > > > > > Best, > > > >> > > > > > > > Congxian > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > Yun Tang <myas...@live.com> 于2022年10月27日周四 12:41写道: > > > >> > > > > > > > > > > >> > > > > > > > > Thanks, Yanfei for driving this to monitor the > > > performance in > > > >> > > the > > > >> > > > > > > Apache > > > >> > > > > > > > > Flink Slack Channel. > > > >> > > > > > > > > > > > >> > > > > > > > > Look forward to the workflow and detailed > descriptions > > > of > > > >> > > > > > > > > flink-dev-benchmarks. > > > >> > > > > > > > > > > > >> > > > > > > > > Best > > > >> > > > > > > > > Yun Tang > > > >> > > > > > > > > ________________________________ > > > >> > > > > > > > > From: Hangxiang Yu <master...@gmail.com> > > > >> > > > > > > > > Sent: Thursday, October 27, 2022 10:59 > > > >> > > > > > > > > To: dev@flink.apache.org <dev@flink.apache.org> > > > >> > > > > > > > > Subject: Re: [ANNOUNCE] Performance Daily Monitoring > > > Moved > > > >> > from > > > >> > > > > > > Ververica > > > >> > > > > > > > > to Apache Flink Slack Channel > > > >> > > > > > > > > > > > >> > > > > > > > > Hi, Yanfei. > > > >> > > > > > > > > Thanks for driving this. > > > >> > > > > > > > > It could help us to detect and resolve the > regression > > > problem > > > >> > > > > quickly > > > >> > > > > > > and > > > >> > > > > > > > > officially. > > > >> > > > > > > > > I'd like to join as a maintainer. > > > >> > > > > > > > > Looking forward to the workflow. > > > >> > > > > > > > > > > > >> > > > > > > > > On Wed, Oct 26, 2022 at 5:18 PM Yuan Mei < > > > >> > > yuanmei.w...@gmail.com > > > >> > > > > > > > >> > > > > > > wrote: > > > >> > > > > > > > > > > > >> > > > > > > > > > Thanks, Yanfei, to drive this and make the > > performance > > > >> > > > monitoring > > > >> > > > > > > > > publicly > > > >> > > > > > > > > > available. > > > >> > > > > > > > > > > > > >> > > > > > > > > > Looking forward to seeing the workflow, and more > > > details as > > > >> > > > > Martijn > > > >> > > > > > > > > > mentioned. > > > >> > > > > > > > > > > > > >> > > > > > > > > > Best > > > >> > > > > > > > > > Yuan > > > >> > > > > > > > > > > > > >> > > > > > > > > > On Wed, Oct 26, 2022 at 2:59 PM Martijn Visser < > > > >> > > > > > > > martijnvis...@apache.org > > > >> > > > > > > > > > > > > >> > > > > > > > > > wrote: > > > >> > > > > > > > > > > > > >> > > > > > > > > > > Hi Yanfei Lei, > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > Thanks for setting this up! It would be > > interesting > > > to > > > >> > also > > > >> > > > > know > > > >> > > > > > > > which > > > >> > > > > > > > > > > aspects of Flink are monitored for > "performance". > > > I'm > > > >> > > > assuming > > > >> > > > > > > there > > > >> > > > > > > > > are > > > >> > > > > > > > > > > specific pieces of functionality that are > > > performance > > > >> > > tested, > > > >> > > > > but > > > >> > > > > > > it > > > >> > > > > > > > > > would > > > >> > > > > > > > > > > be great if this would be written down somewhere > > > (next > > > >> > to a > > > >> > > > > > > procedure > > > >> > > > > > > > > how > > > >> > > > > > > > > > > to detect a regression and what should be next > > > steps). > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > Best regards, > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > Martijn > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > On Wed, Oct 26, 2022 at 8:21 AM Zakelly Lan < > > > >> > > > > > zakelly....@gmail.com > > > >> > > > > > > > > > > >> > > > > > > > > > wrote: > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > Hi yanfei, > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > Thanks for driving this! It's a great help. > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > I would like to join as a maintainer. > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > Best, > > > >> > > > > > > > > > > > Zakelly > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > On Wed, Oct 26, 2022 at 11:32 AM yanfei lei < > > > >> > > > > > fredia...@gmail.com > > > >> > > > > > > > > > > >> > > > > > > > > > wrote: > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > Hi everyone, > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > As discussed earlier, we plan to create a > > > benchmark > > > >> > > > channel > > > >> > > > > > in > > > >> > > > > > > > > Apache > > > >> > > > > > > > > > > > Flink > > > >> > > > > > > > > > > > > slack[1], but the plan was shelved for a > > > while[2]. > > > >> > So I > > > >> > > > > went > > > >> > > > > > on > > > >> > > > > > > > > with > > > >> > > > > > > > > > > this > > > >> > > > > > > > > > > > > work, and created the #flink-dev-benchmarks > > > channel > > > >> > for > > > >> > > > > > > > performance > > > >> > > > > > > > > > > > > regression notifications. > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > We have a regression report script[3] that > > runs > > > >> > daily, > > > >> > > > and > > > >> > > > > a > > > >> > > > > > > > > > > notification > > > >> > > > > > > > > > > > > would be sent to the slack channel when the > > > last few > > > >> > > > > > benchmark > > > >> > > > > > > > > > results > > > >> > > > > > > > > > > > are > > > >> > > > > > > > > > > > > significantly worse than the baseline. > > > >> > > > > > > > > > > > > Note, regressions are detected by a simple > > > script > > > >> > which > > > >> > > > may > > > >> > > > > > > have > > > >> > > > > > > > > > false > > > >> > > > > > > > > > > > > positives and false negatives. And all > > > benchmarks are > > > >> > > > > > executed > > > >> > > > > > > on > > > >> > > > > > > > > one > > > >> > > > > > > > > > > > > physical machine[4] which is provided by > > > >> > > > > > Ververica(Alibaba)[5], > > > >> > > > > > > > it > > > >> > > > > > > > > > > might > > > >> > > > > > > > > > > > > happen that hardware issues affect > > performance, > > > like > > > >> > > > > > > > "[FLINK-18614 > > > >> > > > > > > > > > > > > < > > > https://issues.apache.org/jira/browse/FLINK-18614>] > > > >> > > > > > > Performance > > > >> > > > > > > > > > > > regression > > > >> > > > > > > > > > > > > 2020.07.13"[6]. > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > After the migration, we need a procedure to > > > watch > > > >> > over > > > >> > > > the > > > >> > > > > > > entire > > > >> > > > > > > > > > > > > performance of Flink code together. For > > > example, if a > > > >> > > > > > > regression > > > >> > > > > > > > > > > > > occurs, investigating the cause and > resolving > > > the > > > >> > > problem > > > >> > > > > are > > > >> > > > > > > > > needed. > > > >> > > > > > > > > > > In > > > >> > > > > > > > > > > > > the past, this procedure is maintained > > > internally > > > >> > > within > > > >> > > > > > > > Ververica, > > > >> > > > > > > > > > but > > > >> > > > > > > > > > > > we > > > >> > > > > > > > > > > > > think making the procedure public would > > benefit > > > all. > > > >> > I > > > >> > > > > > > volunteer > > > >> > > > > > > > to > > > >> > > > > > > > > > > serve > > > >> > > > > > > > > > > > > as one of the initial maintainers, and would > > be > > > glad > > > >> > if > > > >> > > > > more > > > >> > > > > > > > > > > contributors > > > >> > > > > > > > > > > > > can join me. I'd also prepare some > guidelines > > > to help > > > >> > > > > others > > > >> > > > > > > get > > > >> > > > > > > > > > > familiar > > > >> > > > > > > > > > > > > with the workflow. I will start a new thread > > to > > > >> > discuss > > > >> > > > the > > > >> > > > > > > > > workflow > > > >> > > > > > > > > > > > soon. > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > [1] > > > >> > > > > > > > > > > > >> > > https://www.mail-archive.com/dev@flink.apache.org/msg58666.html > > > >> > > > > > > > > > > > > [2] > > > >> > https://issues.apache.org/jira/browse/FLINK-28468 > > > >> > > > > > > > > > > > > [3] > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > https://github.com/apache/flink-benchmarks/blob/master/regression_report.py > > > >> > > > > > > > > > > > > [4] http://codespeed.dak8s.net:8080 > > > >> > > > > > > > > > > > > [5] > > > >> > > > > > > > > > > > >> > > > https://lists.apache.org/thread/jzljp4233799vwwqnr0vc9wgqs0xj1ro > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > [6] > > > >> > https://issues.apache.org/jira/browse/FLINK-18614 > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > -- > > > >> > > > > > > > > Best, > > > >> > > > > > > > > Hangxiang. > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > > > > > > > > > Yanfei Lei <fredia...@gmail.com> 于2023年1月12日周四 17:46写道: > > > > > > > > > Yanfei Lei <fredia...@gmail.com> 于2023年1月12日周四 17:46写道: > > > > > > > > Hi all, > > > > > > > > Thanks for the reminder. > > > > > > > > @Matthias > > > > > > > > any updates on the performance tests? ...or more specifically, any > > > updates > > > > on the script for alerting on performance regressions? > > > > > > > > > > > > I create a PR for FLINK-27571[1] but it's still under review, would > you > > > like to help take a look? > > > > > > > > FLINK-27571 is just for the new benchmarks, for the old existing > > > benchmarks, their information is stored > > > > > > > > in codespeed's database which can't be updated by URL request, so I > > also > > > logged into the Jenkins master > > > > > > > > and modified the codespeed's database, currently "less is better" can > > be > > > displayed normally on the timeline[2]. > > > > > > > > > > > > Does it make sense to formalize/document the process? > > > > > > > > Certainly, I'm preparing a draft to share my experience of finding > > > commits that caused regressions. > > > > > > > > Originally, I wanted to wait for FLINK-27571 to be merged before > > > starting a discussion, and I will put > > > > > > > > a draft of the document later. > > > > > > > > > > > > This slack channel can only provide notice of regression and some > > > experience on how to locate regression, > > > > > > > > but we also need some people to take action after the regression > > > happens. It is mainly a few people who volunteer to do these things, > > > > > > > > like FLINK-30015[3] and FLINK-30623[4], many thanks for Martijn's > > > contribution. > > > > > > > > As for whether to add the responsibilities to the release manager, I > > > think it needs to see other people's opinions. > > > > > > > > @Martijn > > > > > > > > Thanks for creating these tickets. For FLINK-30623 and > FLINK-30624[5], > > > @Hangxiang and I have located the corresponding commit > > > > > > > > and pinged the corresponding submitter. Regression may not be > avoided, > > I > > > totally do agree that this work needs to be formalized as soon as > > possible > > > to fix regressions. > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/FLINK-27571 > > > > > > > > [2] > > > > > > http://codespeed.dak8s.net:8000/timeline/#/?ben=createScheduler.BATCH&extr=on&quarts=on&equid=off&env=2&revs=200&exe=1,3,5,6,8,9 > > > > > > > > [3] https://issues.apache.org/jira/browse/FLINK-30015 > > > > > > > > [4] https://issues.apache.org/jira/browse/FLINK-30623 > > > > > > > > [5] https://issues.apache.org/jira/browse/FLINK-30624 > > > > > > > > > > > > Best regards, > > > > > > > > Yanfei > > > > > > > > > > > > Martijn Visser <martijnvis...@apache.org> 于2023年1月11日周三 01:11写道: > > > >> > > > >> Hi all, > > > >> > > > >> Related to Matthias' email, I've checked the notifications in the > > Slack > > > >> channel and noticed three major benchmark regressions. In the end, > > I've > > > >> decided to create Jira tickets for it [1] [2] [3] but I do agree > that > > > this > > > >> work needs to be formalized as soon as possible to avoid > regressions. > > It > > > >> would also be great to include a process on how these regressions > will > > > be > > > >> fixed, because I have no idea who to ping/notify that these > > regressions > > > >> have occurred. > > > >> > > > >> Best regards, > > > >> > > > >> Martijn > > > >> > > > >> [1] https://issues.apache.org/jira/browse/FLINK-30623 > > > >> [2] https://issues.apache.org/jira/browse/FLINK-30624 > > > >> [3] https://issues.apache.org/jira/browse/FLINK-30625 > > > >> > > > >> On Tue, Jan 10, 2023 at 1:56 PM Matthias Pohl > > > >> <matthias.p...@aiven.io.invalid> wrote: > > > >> > > > >> > Hi Yanfei, > > > >> > any updates on the performance tests? ...or more specifically, any > > > updates > > > >> > on the script for alerting on performance regressions? > > > >> > > > > >> > Does it make sense to formalize/document the process? Currently, > the > > > >> > release management doesn't do anything in terms of performance > > > >> > test monitoring. Therefore, performance regressions are not > > > necessarily > > > >> > identified actively (in contrast to CI instabilities). Or is this > > > covered > > > >> > by the PMC? It would be interesting to know whether there's > someone > > to > > > >> > reach out to who's monitoring the regression tests regularly. > Would > > > it make > > > >> > sense for this person to join the release calls? > > > >> > > > > >> > Or shall we work on formalizing/documenting the process and > > > integrating > > > >> > this responsibility into what the release manager(s) are in charge > > > of? My > > > >> > concern with that approach is that contributors might be less > > willing > > > to > > > >> > volunteer in the release management if we collect everything in > one > > > role. > > > >> > Alternatively, we could split the release manager role up into > > > sub-roles > > > >> > that contributors can volunteer for in a release (e.g. CI > > monitoring, > > > >> > performance test monitoring, Jira maintenance, ... just coming up > > with > > > >> > random tasks here). > > > >> > > > > >> > Alternatively, we could leave everything as is and just respond if > > > there's > > > >> > some complaint. I'm curious about your (and other's) opinions. > > > >> > > > > >> > Matthias > > > >> > > > > >> > On Tue, Nov 29, 2022 at 2:13 PM Yanfei Lei <fredia...@gmail.com> > > > wrote: > > > >> > > > > >> > > Hi Martijn, > > > >> > > > > > >> > > Thanks for bringing this up. > > > >> > > > > > >> > > In the past two months, this channel has helped us find many > > > benchmark > > > >> > fail > > > >> > > issues, like FLINK-29883 > > > >> > > <https://issues.apache.org/jira/browse/FLINK-29883>[1], > > > >> > > FLINK-29886 <https://issues.apache.org/jira/browse/FLINK-29886 > > >[2], > > > >> > > FLINK-30015 <https://issues.apache.org/jira/browse/FLINK-30015 > > >[3] > > > and > > > >> > > FLINK-30181 <https://issues.apache.org/jira/browse/FLINK-30181 > > >[4]. > > > I > > > >> > also > > > >> > > have tried investigating several of the frequently reported > > > regressions > > > >> > and > > > >> > > replied under the notification in slack channel(copy them here): > > > >> > > > > > >> > > 1. serializerHeavyString > > > >> > > < > > > >> > > > > > >> > > > > > > > http://codespeed.dak8s.net:8000/timeline/#/?exe=6&ben=serializerHeavyString&extr=on&quarts=on&equid=off&env=2&revs=200 > > > >> > > >: > > > >> > > It is unstable for a long time, see [5] > > > >> > > https://issues.apache.org/jira/browse/FLINK-27165 for > possible > > > >> > reasons. > > > >> > > 2. Regressions are detected by a simple script which may have > > > false > > > >> > > positives and false negatives, especially for benchmarks with > > > small > > > >> > > absolute values, small value changes cause large percentage > > > changes. > > > >> > see > > > >> > > [6] for details. > > > >> > > > > > >> > > Maybe slidingWindow > > > >> > > < > > > >> > > > > > >> > > > > > > > http://codespeed.dak8s.net:8000/timeline/#/?exe=6&ben=slidingWindow&extr=on&quarts=on&equid=off&env=2&revs=200 > > > >> > > >(value~=600), > > > >> > > stateBackends.ROCKS > > > >> > > < > > > >> > > > > > >> > > > > > > > http://codespeed.dak8s.net:8000/timeline/#/?exe=6&ben=stateBackends.ROCKS&extr=on&quarts=on&equid=off&env=2&revs=200 > > > >> > > > > > > >> > > (value~=260) and serializerHeavyString > > > >> > > < > > > >> > > > > > >> > > > > > > > http://codespeed.dak8s.net:8000/timeline/#/?exe=6&ben=serializerHeavyString&extr=on&quarts=on&equid=off&env=2&revs=200 > > > >> > > >(value~=170) > > > >> > > are > > > >> > > not true regressions. > > > >> > > > > > >> > > 1. For deployAllTasks.STREAMING > > > >> > > < > > > >> > > > > > >> > > > > > > > http://codespeed.dak8s.net:8000/timeline/#/?exe=8&ben=deployAllTasks.STREAMING&extr=on&quarts=on&equid=off&env=2&revs=200 > > > >> > > >, > > > >> > > this benchmark result is how much time it takes to deploy > job, > > > the > > > >> > less > > > >> > > value the better performance, see [7] for details. > FLINK-27571 > > > >> > > <https://issues.apache.org/jira/browse/FLINK-27571>[8] would > > > fix this > > > >> > > problem. > > > >> > > > > > >> > > > > > >> > > As mentioned before, regressions are detected by a simple script > > > that is > > > >> > > less stable, FLINK-29825 < > > > >> > > https://issues.apache.org/jira/browse/FLINK-29825>[9] > > > >> > > is created to improve the benchmark's stability. I planned to > > > invite more > > > >> > > volunteers to monitor it after the checking of regression became > > > more > > > >> > > stable, but I've been stuck with something else lately, sorry > for > > > the > > > >> > late > > > >> > > response. Any suggestions on handling benchmark > regressions/fails > > > are > > > >> > > welcome. > > > >> > > > > > >> > > [1] https://issues.apache.org/jira/browse/FLINK-29883 > > > >> > > > > > >> > > [2] https://issues.apache.org/jira/browse/FLINK-29886 > > > >> > > > > > >> > > [3] https://issues.apache.org/jira/browse/FLINK-30015 > > > >> > > > > > >> > > [4] https://issues.apache.org/jira/browse/FLINK-30181 > > > >> > > > > > >> > > [5] https://issues.apache.org/jira/browse/FLINK-27165 > > > >> > > > > > >> > > [6] > > > >> > > > > > >> > > > > > >> > > > > > > > https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136 > > > >> > > > > > >> > > [7] > > > >> > > > > > >> > > > > > >> > > > > > > > https://github.com/apache/flink-benchmarks/blob/master/src/main/java/org/apache/flink/scheduler/benchmark/deploying/DeployingTasksInStreamingJobBenchmarkExecutor.java#L58 > > > >> > > > > > >> > > [8] https://issues.apache.org/jira/browse/FLINK-27571 > > > >> > > > > > >> > > [9] https://issues.apache.org/jira/browse/FLINK-29825 > > > >> > > > > > >> > > > > > >> > > Best, > > > >> > > > > > >> > > Yanfei > > > >> > > > > > >> > > Martijn Visser <martijnvis...@apache.org> 于2022年11月29日周二 > 15:54写道: > > > >> > > > > > >> > > > Hi, > > > >> > > > > > > >> > > > Is there any update to be expected on the benchmark? I see > > > results of > > > >> > the > > > >> > > > benchmark being posted to Slack, but it appears that it's not > > > being > > > >> > > > monitored and no follow-up actions are being taken. I think > it's > > > >> > > currently > > > >> > > > lacking a process on how to interpret the results and what > > action > > > >> > should > > > >> > > > be taken and by whom. > > > >> > > > > > > >> > > > Best regards, > > > >> > > > > > > >> > > > Martijn > > > >> > > > > > > >> > > > On Thu, Nov 3, 2022 at 12:22 PM Jing Ge <j...@ververica.com> > > > wrote: > > > >> > > > > > > >> > > > > Thanks yanfei for driving this! > > > >> > > > > > > > >> > > > > Looking forward to further discussion w.r.t. the workflow. > > > >> > > > > > > > >> > > > > Best regards, > > > >> > > > > Jing > > > >> > > > > > > > >> > > > > On Mon, Oct 31, 2022 at 6:04 PM Mason Chen < > > > mas.chen6...@gmail.com> > > > >> > > > wrote: > > > >> > > > > > > > >> > > > > > +1, thanks for driving this! > > > >> > > > > > > > > >> > > > > > On a side note, can we also ensure that a performance > > summary > > > >> > report > > > >> > > > for > > > >> > > > > > Flink major version upgrades is in release notes, once > this > > > >> > > > > infrastructure > > > >> > > > > > becomes mature? From the user perspective, it would be > nice > > > to know > > > >> > > > what > > > >> > > > > > the expected (or unexpected) regressions in a major > version > > > upgrade > > > >> > > > are. > > > >> > > > > > I've seen the community do something like this before > (e.g. > > > the > > > >> > major > > > >> > > > > > rocksdb version bump in 1.14?) and it was quite valuable > to > > > know > > > >> > that > > > >> > > > > > upfront! > > > >> > > > > > > > > >> > > > > > Best, > > > >> > > > > > Mason > > > >> > > > > > > > > >> > > > > > On Fri, Oct 28, 2022 at 1:46 AM weijie guo < > > > >> > > guoweijieres...@gmail.com> > > > >> > > > > > wrote: > > > >> > > > > > > > > >> > > > > > > Thanks Yanfei for driving this. > > > >> > > > > > > > > > >> > > > > > > It allows us to easily find the problem of performance > > > >> > regression. > > > >> > > > > > > Especially recently, I have made some improvements to > the > > > >> > > scheduling > > > >> > > > > > > related parts, your work is very important to ensure > that > > > these > > > >> > > > changes > > > >> > > > > > do > > > >> > > > > > > not cause some unexpected problems. > > > >> > > > > > > > > > >> > > > > > > Best regards, > > > >> > > > > > > > > > >> > > > > > > Weijie > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > Congxian Qiu <qcx978132...@gmail.com> 于2022年10月28日周五 > > > 16:03写道: > > > >> > > > > > > > > > >> > > > > > > > Thanks for driving this and making the performance > > > monitoring > > > >> > > > public, > > > >> > > > > > > this > > > >> > > > > > > > can make us know and resolve the performance problem > > > quickly. > > > >> > > > > > > > > > > >> > > > > > > > Looking forward to the workflow and detailed > > descriptions > > > fo > > > >> > > > > > > > flink-dev-benchmarks. > > > >> > > > > > > > > > > >> > > > > > > > Best, > > > >> > > > > > > > Congxian > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > Yun Tang <myas...@live.com> 于2022年10月27日周四 12:41写道: > > > >> > > > > > > > > > > >> > > > > > > > > Thanks, Yanfei for driving this to monitor the > > > performance in > > > >> > > the > > > >> > > > > > > Apache > > > >> > > > > > > > > Flink Slack Channel. > > > >> > > > > > > > > > > > >> > > > > > > > > Look forward to the workflow and detailed > descriptions > > > of > > > >> > > > > > > > > flink-dev-benchmarks. > > > >> > > > > > > > > > > > >> > > > > > > > > Best > > > >> > > > > > > > > Yun Tang > > > >> > > > > > > > > ________________________________ > > > >> > > > > > > > > From: Hangxiang Yu <master...@gmail.com> > > > >> > > > > > > > > Sent: Thursday, October 27, 2022 10:59 > > > >> > > > > > > > > To: dev@flink.apache.org <dev@flink.apache.org> > > > >> > > > > > > > > Subject: Re: [ANNOUNCE] Performance Daily Monitoring > > > Moved > > > >> > from > > > >> > > > > > > Ververica > > > >> > > > > > > > > to Apache Flink Slack Channel > > > >> > > > > > > > > > > > >> > > > > > > > > Hi, Yanfei. > > > >> > > > > > > > > Thanks for driving this. > > > >> > > > > > > > > It could help us to detect and resolve the > regression > > > problem > > > >> > > > > quickly > > > >> > > > > > > and > > > >> > > > > > > > > officially. > > > >> > > > > > > > > I'd like to join as a maintainer. > > > >> > > > > > > > > Looking forward to the workflow. > > > >> > > > > > > > > > > > >> > > > > > > > > On Wed, Oct 26, 2022 at 5:18 PM Yuan Mei < > > > >> > > yuanmei.w...@gmail.com > > > >> > > > > > > > >> > > > > > > wrote: > > > >> > > > > > > > > > > > >> > > > > > > > > > Thanks, Yanfei, to drive this and make the > > performance > > > >> > > > monitoring > > > >> > > > > > > > > publicly > > > >> > > > > > > > > > available. > > > >> > > > > > > > > > > > > >> > > > > > > > > > Looking forward to seeing the workflow, and more > > > details as > > > >> > > > > Martijn > > > >> > > > > > > > > > mentioned. > > > >> > > > > > > > > > > > > >> > > > > > > > > > Best > > > >> > > > > > > > > > Yuan > > > >> > > > > > > > > > > > > >> > > > > > > > > > On Wed, Oct 26, 2022 at 2:59 PM Martijn Visser < > > > >> > > > > > > > martijnvis...@apache.org > > > >> > > > > > > > > > > > > >> > > > > > > > > > wrote: > > > >> > > > > > > > > > > > > >> > > > > > > > > > > Hi Yanfei Lei, > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > Thanks for setting this up! It would be > > interesting > > > to > > > >> > also > > > >> > > > > know > > > >> > > > > > > > which > > > >> > > > > > > > > > > aspects of Flink are monitored for > "performance". > > > I'm > > > >> > > > assuming > > > >> > > > > > > there > > > >> > > > > > > > > are > > > >> > > > > > > > > > > specific pieces of functionality that are > > > performance > > > >> > > tested, > > > >> > > > > but > > > >> > > > > > > it > > > >> > > > > > > > > > would > > > >> > > > > > > > > > > be great if this would be written down somewhere > > > (next > > > >> > to a > > > >> > > > > > > procedure > > > >> > > > > > > > > how > > > >> > > > > > > > > > > to detect a regression and what should be next > > > steps). > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > Best regards, > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > Martijn > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > On Wed, Oct 26, 2022 at 8:21 AM Zakelly Lan < > > > >> > > > > > zakelly....@gmail.com > > > >> > > > > > > > > > > >> > > > > > > > > > wrote: > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > Hi yanfei, > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > Thanks for driving this! It's a great help. > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > I would like to join as a maintainer. > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > Best, > > > >> > > > > > > > > > > > Zakelly > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > On Wed, Oct 26, 2022 at 11:32 AM yanfei lei < > > > >> > > > > > fredia...@gmail.com > > > >> > > > > > > > > > > >> > > > > > > > > > wrote: > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > Hi everyone, > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > As discussed earlier, we plan to create a > > > benchmark > > > >> > > > channel > > > >> > > > > > in > > > >> > > > > > > > > Apache > > > >> > > > > > > > > > > > Flink > > > >> > > > > > > > > > > > > slack[1], but the plan was shelved for a > > > while[2]. > > > >> > So I > > > >> > > > > went > > > >> > > > > > on > > > >> > > > > > > > > with > > > >> > > > > > > > > > > this > > > >> > > > > > > > > > > > > work, and created the #flink-dev-benchmarks > > > channel > > > >> > for > > > >> > > > > > > > performance > > > >> > > > > > > > > > > > > regression notifications. > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > We have a regression report script[3] that > > runs > > > >> > daily, > > > >> > > > and > > > >> > > > > a > > > >> > > > > > > > > > > notification > > > >> > > > > > > > > > > > > would be sent to the slack channel when the > > > last few > > > >> > > > > > benchmark > > > >> > > > > > > > > > results > > > >> > > > > > > > > > > > are > > > >> > > > > > > > > > > > > significantly worse than the baseline. > > > >> > > > > > > > > > > > > Note, regressions are detected by a simple > > > script > > > >> > which > > > >> > > > may > > > >> > > > > > > have > > > >> > > > > > > > > > false > > > >> > > > > > > > > > > > > positives and false negatives. And all > > > benchmarks are > > > >> > > > > > executed > > > >> > > > > > > on > > > >> > > > > > > > > one > > > >> > > > > > > > > > > > > physical machine[4] which is provided by > > > >> > > > > > Ververica(Alibaba)[5], > > > >> > > > > > > > it > > > >> > > > > > > > > > > might > > > >> > > > > > > > > > > > > happen that hardware issues affect > > performance, > > > like > > > >> > > > > > > > "[FLINK-18614 > > > >> > > > > > > > > > > > > < > > > https://issues.apache.org/jira/browse/FLINK-18614>] > > > >> > > > > > > Performance > > > >> > > > > > > > > > > > regression > > > >> > > > > > > > > > > > > 2020.07.13"[6]. > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > After the migration, we need a procedure to > > > watch > > > >> > over > > > >> > > > the > > > >> > > > > > > entire > > > >> > > > > > > > > > > > > performance of Flink code together. For > > > example, if a > > > >> > > > > > > regression > > > >> > > > > > > > > > > > > occurs, investigating the cause and > resolving > > > the > > > >> > > problem > > > >> > > > > are > > > >> > > > > > > > > needed. > > > >> > > > > > > > > > > In > > > >> > > > > > > > > > > > > the past, this procedure is maintained > > > internally > > > >> > > within > > > >> > > > > > > > Ververica, > > > >> > > > > > > > > > but > > > >> > > > > > > > > > > > we > > > >> > > > > > > > > > > > > think making the procedure public would > > benefit > > > all. > > > >> > I > > > >> > > > > > > volunteer > > > >> > > > > > > > to > > > >> > > > > > > > > > > serve > > > >> > > > > > > > > > > > > as one of the initial maintainers, and would > > be > > > glad > > > >> > if > > > >> > > > > more > > > >> > > > > > > > > > > contributors > > > >> > > > > > > > > > > > > can join me. I'd also prepare some > guidelines > > > to help > > > >> > > > > others > > > >> > > > > > > get > > > >> > > > > > > > > > > familiar > > > >> > > > > > > > > > > > > with the workflow. I will start a new thread > > to > > > >> > discuss > > > >> > > > the > > > >> > > > > > > > > workflow > > > >> > > > > > > > > > > > soon. > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > [1] > > > >> > > > > > > > > > > > >> > > https://www.mail-archive.com/dev@flink.apache.org/msg58666.html > > > >> > > > > > > > > > > > > [2] > > > >> > https://issues.apache.org/jira/browse/FLINK-28468 > > > >> > > > > > > > > > > > > [3] > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > https://github.com/apache/flink-benchmarks/blob/master/regression_report.py > > > >> > > > > > > > > > > > > [4] http://codespeed.dak8s.net:8080 > > > >> > > > > > > > > > > > > [5] > > > >> > > > > > > > > > > > >> > > > https://lists.apache.org/thread/jzljp4233799vwwqnr0vc9wgqs0xj1ro > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > [6] > > > >> > https://issues.apache.org/jira/browse/FLINK-18614 > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > -- > > > >> > > > > > > > > Best, > > > >> > > > > > > > > Hangxiang. > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > > > > > > > > > -- > > > Best, > > > Yanfei > > > > > > > > >