[ https://issues.apache.org/jira/browse/FLINK-29825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17679077#comment-17679077 ]
Piotr Nowojski edited comment on FLINK-29825 at 1/20/23 9:17 AM: ----------------------------------------------------------------- Thanks for picking this up [~Yanfei Lei]. Are you planning to improve benchmarks stability itself (1), or only the regression detection script (2)? For the (2), for sometime I was playing with idea how to more reliably detect regression in noisy benchmarks like this: http://codespeed.dak8s.net:8000/timeline/#/?exe=1&ben=fireProcessingTimers&extr=on&quarts=on&equid=off&env=2&revs=200 My assumption is/was that this must be a pretty well known problem, and it's only a matter of finding the right algorithm to do it. For example maybe [Two sample Kolmogorov-Smirnov test|https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov%E2%80%93Smirnov_test]. It looks like if we split the samples into two distributions, for example: 1. all latest samples, from the latest one (1st one), until N'th latest sample 2. samples from N'th latest to M'th latest (M>N) We would have two distributions and we could use the Komnongorov-Smirnov test to compare those two distributions, how similar are they. And report if the difference is greater then some threshold. Probably there are also other ways. Could you share your thoughts/plans on this issue [~Yanfei Lei]? was (Author: pnowojski): Thanks for picking this up [~Yanfei Lei]. Are you planning to improve benchmarks stability itself (1), or only the regression detection script (2)? For the (2), for sometime I was playing with idea how to more reliably detect regression in noisy benchmarks like this: http://codespeed.dak8s.net:8000/timeline/#/?exe=1&ben=fireProcessingTimers&extr=on&quarts=on&equid=off&env=2&revs=200 My assumption is/was that this must be a pretty well known problem, and it's only a matter of finding the right algorithm to do it. For example maybe [Two sample Kolmogorov-Smirnov test|https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov%E2%80%93Smirnov_test]. It looks like if we split the samples into two distributions, for example: 1. all latest samples, from the latest one (1st one), until N'th latest sample 2. samples from N'th latest to M'th latest (M>N) We would have two distributions and we could use the Komnongorov-Smirnov test to compare those two distributions, how similar are they. And report if the difference is greater then some threshold. Probably there are also other ways. > Improve benchmark stability > --------------------------- > > Key: FLINK-29825 > URL: https://issues.apache.org/jira/browse/FLINK-29825 > Project: Flink > Issue Type: Improvement > Components: Benchmarks > Affects Versions: 1.17.0 > Reporter: Yanfei Lei > Assignee: Yanfei Lei > Priority: Minor > > Currently, regressions are detected by a simple script which may have false > positives and false negatives, especially for benchmarks with small absolute > values, small value changes would cause large percentage changes. see > [here|https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136] > for details. > And all benchmarks are executed on one physical machine, it might happen that > hardware issues affect performance, like "[FLINK-18614] Performance > regression 2020.07.13". > > This ticket aims to improve the precision and recall of the regression-check > script. > -- This message was sent by Atlassian Jira (v8.20.10#820010)