[jira] [Commented] (FLINK-29825) Improve benchmark stability

Yanfei Lei (Jira) Mon, 30 Jan 2023 23:21:18 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-29825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17682413#comment-17682413
 ]


Yanfei Lei commented on FLINK-29825:
------------------------------------

Sorry for the late reply, I'm planning to improve the stability regression 
detection script.

Leveraging [Two sample Kolmogorov-Smirnov 
test|https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov%E2%80%93Smirnov_test]
 to detect regression is a good idea:

(1) I‘m concerned it's not easy to construct the first distribution: "all 
latest samples, from the latest one (1st one), until N'th latest sample". If 
some regressions have occurred before, it will "distort" the first 
distribution, possibly leading to false positive; meanwhile, optimization can 
also cause "distortion" which possibly leads to false negative, It's relatively 
acceptable. Maybe we can filter out outliers to avoid this problem. 

(2) Also, [Kolmogorov-Smirnov 
test|https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov%E2%80%93Smirnov_test]
 is an absolute value, and it is impossible to identify whether it is an 
improvement or a regression.

 

My plan is/was to use the error to adjust each threshold, each benchmark run 
has an error value in the result, 

let thr=0.75*(max error of N'th latest sample), and the other logic is the same 
as the existing detection script.

> Improve benchmark stability
> ---------------------------
>
>                 Key: FLINK-29825
>                 URL: https://issues.apache.org/jira/browse/FLINK-29825
>             Project: Flink
>          Issue Type: Improvement
>          Components: Benchmarks
>    Affects Versions: 1.17.0
>            Reporter: Yanfei Lei
>            Assignee: Yanfei Lei
>            Priority: Minor
>
> Currently, regressions are detected by a simple script which may have false 
> positives and false negatives, especially for benchmarks with small absolute 
> values, small value changes would cause large percentage changes. see 
> [here|https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136]
>  for details.
> And all benchmarks are executed on one physical machine, it might happen that 
> hardware issues affect performance, like "[FLINK-18614] Performance 
> regression 2020.07.13".
>  
> This ticket aims to improve the precision and recall of the regression-check 
> script.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-29825) Improve benchmark stability

Reply via email to