Re: [DISCUSS] Incorporate performance regression monitoring into routine process

Yanfei Lei Tue, 31 Jan 2023 00:56:41 -0800

Thanks for all the feedback and suggestions.

@Piotr:
>> I was setting the priority to a blocker and I would propose to add this to 
>> the instructions and general convention.


Thanks for sharing your experience, I will update this to the document.
And your suggestion of leveraging Two-sample Kolmogorov-Smirnov test
to detect regressions in FLINK-29825[1] is a great idea, I'm curious
about two things:

(1) it's not easy to construct the first distribution: "all latest
samples, from the latest one (1st one), until N'th latest sample". If
some regressions have occurred before, it will "distort" the first
distribution, possibly leading to false positive; meanwhile,
optimization can also cause "distortion" which possibly leads to false
negative, It's relatively acceptable. Maybe we can filter out outliers
to avoid this problem.

(2) Also, Kolmogorov-Smirnov test is an absolute value, and it is
impossible to identify whether it is an improvement or a regression.

My plan is/was to use the error to adjust each threshold, each
benchmark run has a "Score Error (99.9%)" value in the result[2], let
threshold = 0.75*(max error of N'th latest sample), and the other
logic is the same as the existing detection script.


@Dong Lin:
The current detection script[3] is similar to the
average-throughput-deviation method you described. We use the median
of the last 100 days as a benchmark, and then compare the median of
the last 20 days, and this script may delay the regression alerts. It
is a good way to detect by deviation, currently each benchmark run has
a "Score Error (99.9%)" value in the result, maybe we can also
calculate "Score Error (99.9%)" into the deviation.

In addition, due to the existence of regression/improvement, the
median of the baseline is also unstable, I think RobustScaler[4] of
flink ML is very suitable here, maybe RobustScaler can be used to
calculate the median of baseline.
The problem faced by this automatic detection tool is very similar to
novelty and outlier detection problems, maybe some methods in
Sklearn[5] can be considered.

[1] 
https://issues.apache.org/jira/browse/FLINK-29825?focusedCommentId=17679077&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17679077
[2] 
http://codespeed.dak8s.net:8080/job/flink-benchmark-request/lastSuccessfulBuild/artifact/jmh-result.csv/*view*/
[3] https://github.com/apache/flink-benchmarks/blob/master/regression_report.py
[4] 
https://nightlies.apache.org/flink/flink-ml-docs-master/docs/operators/feature/robustscaler/
[5] https://scikit-learn.org/stable/modules/outlier_detection.html


Dong Lin <lindon...@gmail.com> 于2023年1月31日周二 10:23写道：
>
> Hi Piotr,
>
> Yes, the challenge of developing such an automatic tool is indeed to handle
> noise and achieve a balance between false positive and false negative. It
> is great to know that we already have scripts that can access historical
> benchmark data and generate alerts.
>
> There is some heuristic that we can use to considerably reduce the false
> positive alert. For example, we can measure the
> average-throughput-deviation of such a benchmark over 5 runs and only
> suppress alert for benchmarks whose deviation is too high (i.e. the
> benchmark is too noisy and requires tuning or removal). And we can compare
> the deviation-from-last-commit with the average-throughput-deviation of
> this benchmark and report error if these two values differ to much (i.e.
> the performance regression is observably higher than its typical noise).
>
> I implemented this benchmark framework
> <https://github.com/tensorflow/benchmarks/tree/master/perfzero> for
> TensorFlow three years ago and re-designed a regression detection algorithm
> internally at Google using the heuristics described above. I recall that
> the script worked reasonably well and most of the regression alert is
> actionable. Maybe I can take a shot at implementing a similar algorithm for
> Flink benchmark sometime later and see how it works.
>
> Anyway, the development of the algorithm is probably not the focus of this
> topic. +1 for incorporating performance regression monitoring into routine
> process.
>
> Best,
> Dong
>
>
> On Mon, Jan 30, 2023 at 10:56 PM Piotr Nowojski <pnowoj...@apache.org>
> wrote:
>
> > Hi Dong,
> >
> > The main issue with an automatic tool at the moment is that some benchmarks
> > are quite noisy and performance regressions are often within the noise of a
> > given benchmark. Our currently existing tooling can not handle those cases.
> > Until we address this issue, I think it will have to remain a manual
> > process. There is a ticket mentioned by Yuan [1] where I have written a
> > comment and a proposal on how to improve the automatic performance
> > regression detection.
> >
> > Best,
> > Piotrek
> >
> > [1]
> >
> > https://issues.apache.org/jira/browse/FLINK-29825?focusedCommentId=17679077&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17679077
> >
> >
> > pon., 30 sty 2023 o 15:31 Dong Lin <lindon...@gmail.com> napisał(a):
> >
> > > Hi Yanfei,
> > >
> > > Thanks for driving the benchmark monitoring effort! The Google doc and
> > the
> > > community wiki looks pretty good.
> > >
> > > According to Yuan's comment, it seems that we currently manually watch
> > the
> > > benchmark results to detect regression. Have we considered automating
> > this
> > > process by e.g. exporting the nightly benchmark results to a database and
> > > using scripts to detect regression based on pre-defined rules?
> > >
> > > This approach is probably more scalable and accurate in the long term.
> > And
> > > I had a good experience working with such a regression detection tool in
> > my
> > > past job.
> > >
> > > Thanks,
> > > Dong
> > >
> > >
> > >
> > > On Thu, Jan 19, 2023 at 4:02 PM Yanfei Lei <fredia...@gmail.com> wrote:
> > >
> > > > Hi devs,
> > > >
> > > > I'd like to start a discussion about incorporating performance
> > > > regression monitoring into the routine process. Flink benchmarks are
> > > > periodically executed on http://codespeed.dak8s.net:8080 to monitor
> > > > Flink performance. In late Oct'22, a new slack channel
> > > > #flink-dev-benchmarks was created for notifications of performance
> > > > regressions. It helped us find 2 build failures[1,2] and 5 performance
> > > > regressions[3,4,5,6,7] in the past 3 months, which is very meaningful
> > > > to ensuring the quality of the code.
> > > >
> > > > There are some release managers( cc @Matthias, @Martijn, @Qingsheng)
> > > > proposing to incorporate performance regression monitoring into the
> > > > release management, I think it makes sense for performance stabilities
> > > > (like CI stabilities), since almost every release has some tickets
> > > > about performance optimizations, the performance monitoring can
> > > > effectively avoid performance regression and track the performance
> > > > improvement of each release. So I start this discussion to pick
> > > > everyone’s brain for some suggestions.
> > > >
> > > > In the past, I checked the slack notifications once a week, and I have
> > > > summarized a draft[8](
> > > >
> > >
> > https://docs.google.com/document/d/1jTTJHoCTf8_LAjviyAY3Fi7p-tYtl_zw7rJKV4V6T_c/edit?usp=sharing
> > > > )
> > > > on how to deal with performance regressions according to some
> > > > contributors and my own experience. If the above proposal is
> > > > considered acceptable, I’d like to put it in the community wiki[9].
> > > >
> > > > Looking forward to your feedback!
> > > >
> > > > [1] https://issues.apache.org/jira/browse/FLINK-29883
> > > > [2] https://issues.apache.org/jira/browse/FLINK-30015
> > > > [3] https://issues.apache.org/jira/browse/FLINK-29886
> > > > [4] https://issues.apache.org/jira/browse/FLINK-30181
> > > > [5] https://issues.apache.org/jira/browse/FLINK-30623
> > > > [6] https://issues.apache.org/jira/browse/FLINK-30624
> > > > [7] https://issues.apache.org/jira/browse/FLINK-30625
> > > > [8]
> > > >
> > >
> > https://docs.google.com/document/d/1jTTJHoCTf8_LAjviyAY3Fi7p-tYtl_zw7rJKV4V6T_c/edit?usp=sharing
> > > > [9]
> > > >
> > >
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=115511847
> > > >
> > > > Best,
> > > > Yanfei
> > > >
> > >
> >



-- 
Best,
Yanfei

Re: [DISCUSS] Incorporate performance regression monitoring into routine process

Reply via email to