[ https://issues.apache.org/jira/browse/FLINK-29825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17682413#comment-17682413 ]
Yanfei Lei commented on FLINK-29825: ------------------------------------ Sorry for the late reply, I'm planning to improve the stability regression detection script. Leveraging [Two sample Kolmogorov-Smirnov test|https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov%E2%80%93Smirnov_test] to detect regression is a good idea: (1) I‘m concerned it's not easy to construct the first distribution: "all latest samples, from the latest one (1st one), until N'th latest sample". If some regressions have occurred before, it will "distort" the first distribution, possibly leading to false positive; meanwhile, optimization can also cause "distortion" which possibly leads to false negative, It's relatively acceptable. Maybe we can filter out outliers to avoid this problem. (2) Also, [Kolmogorov-Smirnov test|https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov%E2%80%93Smirnov_test] is an absolute value, and it is impossible to identify whether it is an improvement or a regression. My plan is/was to use the error to adjust each threshold, each benchmark run has an error value in the result, let thr=0.75*(max error of N'th latest sample), and the other logic is the same as the existing detection script. > Improve benchmark stability > --------------------------- > > Key: FLINK-29825 > URL: https://issues.apache.org/jira/browse/FLINK-29825 > Project: Flink > Issue Type: Improvement > Components: Benchmarks > Affects Versions: 1.17.0 > Reporter: Yanfei Lei > Assignee: Yanfei Lei > Priority: Minor > > Currently, regressions are detected by a simple script which may have false > positives and false negatives, especially for benchmarks with small absolute > values, small value changes would cause large percentage changes. see > [here|https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136] > for details. > And all benchmarks are executed on one physical machine, it might happen that > hardware issues affect performance, like "[FLINK-18614] Performance > regression 2020.07.13". > > This ticket aims to improve the precision and recall of the regression-check > script. > -- This message was sent by Atlassian Jira (v8.20.10#820010)