[jira] [Commented] (FLINK-29825) Improve benchmark stability

2023-03-09 Thread Dong Lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698784#comment-17698784
 ] 

Dong Lin commented on FLINK-29825:
--

Merged to apache/flink-benchmarks master branch 
7d2013a9f401366bc9073857175f434882867bfe

> Improve benchmark stability
> ---
>
> Key: FLINK-29825
> URL: https://issues.apache.org/jira/browse/FLINK-29825
> Project: Flink
>  Issue Type: Improvement
>  Components: Benchmarks
>Affects Versions: 1.17.0
>Reporter: Yanfei Lei
>Assignee: Yanfei Lei
>Priority: Minor
>  Labels: pull-request-available
>
> Currently, regressions are detected by a simple script which may have false 
> positives and false negatives, especially for benchmarks with small absolute 
> values, small value changes would cause large percentage changes. see 
> [here|https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136]
>  for details.
> And all benchmarks are executed on one physical machine, it might happen that 
> hardware issues affect performance, like "[FLINK-18614] Performance 
> regression 2020.07.13".
>  
> This ticket aims to improve the precision and recall of the regression-check 
> script.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29825) Improve benchmark stability

2023-02-12 Thread Yanfei Lei (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687668#comment-17687668
 ] 

Yanfei Lei commented on FLINK-29825:


Thanks for taking the time to review the evaluation results. Writing a blog is 
a good idea,  and I also intend to implement Dong's algorithm completely (only 
the max-based algorithm under “moreisbetter" is implemented during evaluation) 
to replace the median-based algorithm.

> Improve benchmark stability
> ---
>
> Key: FLINK-29825
> URL: https://issues.apache.org/jira/browse/FLINK-29825
> Project: Flink
>  Issue Type: Improvement
>  Components: Benchmarks
>Affects Versions: 1.17.0
>Reporter: Yanfei Lei
>Assignee: Yanfei Lei
>Priority: Minor
>
> Currently, regressions are detected by a simple script which may have false 
> positives and false negatives, especially for benchmarks with small absolute 
> values, small value changes would cause large percentage changes. see 
> [here|https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136]
>  for details.
> And all benchmarks are executed on one physical machine, it might happen that 
> hardware issues affect performance, like "[FLINK-18614] Performance 
> regression 2020.07.13".
>  
> This ticket aims to improve the precision and recall of the regression-check 
> script.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29825) Improve benchmark stability

2023-02-10 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687143#comment-17687143
 ] 

Piotr Nowojski commented on FLINK-29825:


Yes, that's a good idea :)

> Improve benchmark stability
> ---
>
> Key: FLINK-29825
> URL: https://issues.apache.org/jira/browse/FLINK-29825
> Project: Flink
>  Issue Type: Improvement
>  Components: Benchmarks
>Affects Versions: 1.17.0
>Reporter: Yanfei Lei
>Assignee: Yanfei Lei
>Priority: Minor
>
> Currently, regressions are detected by a simple script which may have false 
> positives and false negatives, especially for benchmarks with small absolute 
> values, small value changes would cause large percentage changes. see 
> [here|https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136]
>  for details.
> And all benchmarks are executed on one physical machine, it might happen that 
> hardware issues affect performance, like "[FLINK-18614] Performance 
> regression 2020.07.13".
>  
> This ticket aims to improve the precision and recall of the regression-check 
> script.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29825) Improve benchmark stability

2023-02-10 Thread Dong Lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687116#comment-17687116
 ] 

Dong Lin commented on FLINK-29825:
--

Thanks [~Yanfei Lei] for the detailed evaluation results! Maybe we can write a 
blog together based on your evaluation results.

> Improve benchmark stability
> ---
>
> Key: FLINK-29825
> URL: https://issues.apache.org/jira/browse/FLINK-29825
> Project: Flink
>  Issue Type: Improvement
>  Components: Benchmarks
>Affects Versions: 1.17.0
>Reporter: Yanfei Lei
>Assignee: Yanfei Lei
>Priority: Minor
>
> Currently, regressions are detected by a simple script which may have false 
> positives and false negatives, especially for benchmarks with small absolute 
> values, small value changes would cause large percentage changes. see 
> [here|https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136]
>  for details.
> And all benchmarks are executed on one physical machine, it might happen that 
> hardware issues affect performance, like "[FLINK-18614] Performance 
> regression 2020.07.13".
>  
> This ticket aims to improve the precision and recall of the regression-check 
> script.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29825) Improve benchmark stability

2023-02-10 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687098#comment-17687098
 ] 

Piotr Nowojski commented on FLINK-29825:


Thanks a lot for the very detailed comparison [~Yanfei Lei]. Let's go with the 
[~lindong]'s proposal!

> Improve benchmark stability
> ---
>
> Key: FLINK-29825
> URL: https://issues.apache.org/jira/browse/FLINK-29825
> Project: Flink
>  Issue Type: Improvement
>  Components: Benchmarks
>Affects Versions: 1.17.0
>Reporter: Yanfei Lei
>Assignee: Yanfei Lei
>Priority: Minor
>
> Currently, regressions are detected by a simple script which may have false 
> positives and false negatives, especially for benchmarks with small absolute 
> values, small value changes would cause large percentage changes. see 
> [here|https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136]
>  for details.
> And all benchmarks are executed on one physical machine, it might happen that 
> hardware issues affect performance, like "[FLINK-18614] Performance 
> regression 2020.07.13".
>  
> This ticket aims to improve the precision and recall of the regression-check 
> script.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29825) Improve benchmark stability

2023-02-10 Thread Yanfei Lei (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687082#comment-17687082
 ] 

Yanfei Lei commented on FLINK-29825:


[~pnowojski]  I tried to use hunter to detect regression, and 
[here|https://docs.google.com/document/d/1coI4eJsauBtrlS1Z77bhGf-hNtDEXbzuwacG5ZPCMc8/edit?usp=sharing]
 are some evaluation results of the three algorithms. I'm not sure I fully 
understand the usage of hunter, it looks like hunter can only detect 
regressions in the history sequence, I modified it a little bit to detect 
regressions in the latest commit, correct me if something is wrong in the 
document:D.

> Improve benchmark stability
> ---
>
> Key: FLINK-29825
> URL: https://issues.apache.org/jira/browse/FLINK-29825
> Project: Flink
>  Issue Type: Improvement
>  Components: Benchmarks
>Affects Versions: 1.17.0
>Reporter: Yanfei Lei
>Assignee: Yanfei Lei
>Priority: Minor
>
> Currently, regressions are detected by a simple script which may have false 
> positives and false negatives, especially for benchmarks with small absolute 
> values, small value changes would cause large percentage changes. see 
> [here|https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136]
>  for details.
> And all benchmarks are executed on one physical machine, it might happen that 
> hardware issues affect performance, like "[FLINK-18614] Performance 
> regression 2020.07.13".
>  
> This ticket aims to improve the precision and recall of the regression-check 
> script.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29825) Improve benchmark stability

2023-02-07 Thread Dong Lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685641#comment-17685641
 ] 

Dong Lin commented on FLINK-29825:
--

Thanks [~Yanfei Lei] for implementing and evaluating the algorithm!

[~pnowojski] Cool, I think we have agreed to make incremental improvements and 
used the algorithm proposed in the above doc to detect regression for Flink 
benchmarks.

We probably still have different understandings regarding the pros/cons of 
these alternative choices. It will be great if you or someone else can help 
implement an alternative choice and show that it can do better than the one we 
are going to use. I probably won't have time to try the Hunter algorithm myself 
in the near future.




> Improve benchmark stability
> ---
>
> Key: FLINK-29825
> URL: https://issues.apache.org/jira/browse/FLINK-29825
> Project: Flink
>  Issue Type: Improvement
>  Components: Benchmarks
>Affects Versions: 1.17.0
>Reporter: Yanfei Lei
>Assignee: Yanfei Lei
>Priority: Minor
>
> Currently, regressions are detected by a simple script which may have false 
> positives and false negatives, especially for benchmarks with small absolute 
> values, small value changes would cause large percentage changes. see 
> [here|https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136]
>  for details.
> And all benchmarks are executed on one physical machine, it might happen that 
> hardware issues affect performance, like "[FLINK-18614] Performance 
> regression 2020.07.13".
>  
> This ticket aims to improve the precision and recall of the regression-check 
> script.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29825) Improve benchmark stability

2023-02-07 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685367#comment-17685367
 ] 

Piotr Nowojski commented on FLINK-29825:


Thanks for the investigation [~Yanfei Lei]. As I said, I'm pretty sure we 
should be able to find a better, more sophisticated solution, but at the same 
time I can not dive deeper into this myself. I would encourage one of you to 
take a look at the Hunter tool that I mentioned above, and maybe include it in 
the comparison. But at the same time if you are strongly inclined towards 
[~lindong]'s idea, I wouldn't block it, as it's indeed most likely an 
improvement over what we have right now.

> Improve benchmark stability
> ---
>
> Key: FLINK-29825
> URL: https://issues.apache.org/jira/browse/FLINK-29825
> Project: Flink
>  Issue Type: Improvement
>  Components: Benchmarks
>Affects Versions: 1.17.0
>Reporter: Yanfei Lei
>Assignee: Yanfei Lei
>Priority: Minor
>
> Currently, regressions are detected by a simple script which may have false 
> positives and false negatives, especially for benchmarks with small absolute 
> values, small value changes would cause large percentage changes. see 
> [here|https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136]
>  for details.
> And all benchmarks are executed on one physical machine, it might happen that 
> hardware issues affect performance, like "[FLINK-18614] Performance 
> regression 2020.07.13".
>  
> This ticket aims to improve the precision and recall of the regression-check 
> script.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29825) Improve benchmark stability

2023-02-07 Thread Yanfei Lei (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685205#comment-17685205
 ] 

Yanfei Lei commented on FLINK-29825:


[~lindong] 
Thanks for the algorithm you proposed, I wrote a 
[script|https://github.com/fredia/flink-benchmarks/blob/FLINK-29825/check_regression.py]
 to test it briefly, new algorithm shows better sensitivity than existing 
median-based method.
I did two kinds of tests:
1. For benchmarks where regression has occurred:
    a. Under the appropriate parameters, the new algorithm has higher precision 
and recall on most benchmarks.
    b. The new algorithm can find the regression faster, and the current 
algorithm needs to wait until the median window slides into the corresponding 
interval, which means that the regression may have occurred for several days.
2. For noisy benchmarks:
    a. New algorithm produces fewer false positives for most benchmarks. like 
fireProcessingTimers of Flink (Java11) and (fireProcessingTimers of Flink.
     b. For the benchmark with regression in the noise(like serializerTuple of 
Flink (Java11)), the new algorithm can also detect it, but the existing 
median-based method cannot detect it.

In my opinion, the new algorithm is very concise and efficient, it can also 
avoid the effects of distorted baselines caused by regression.

> Improve benchmark stability
> ---
>
> Key: FLINK-29825
> URL: https://issues.apache.org/jira/browse/FLINK-29825
> Project: Flink
>  Issue Type: Improvement
>  Components: Benchmarks
>Affects Versions: 1.17.0
>Reporter: Yanfei Lei
>Assignee: Yanfei Lei
>Priority: Minor
>
> Currently, regressions are detected by a simple script which may have false 
> positives and false negatives, especially for benchmarks with small absolute 
> values, small value changes would cause large percentage changes. see 
> [here|https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136]
>  for details.
> And all benchmarks are executed on one physical machine, it might happen that 
> hardware issues affect performance, like "[FLINK-18614] Performance 
> regression 2020.07.13".
>  
> This ticket aims to improve the precision and recall of the regression-check 
> script.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29825) Improve benchmark stability

2023-02-06 Thread Dong Lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684999#comment-17684999
 ] 

Dong Lin commented on FLINK-29825:
--

[~pnowojski] I think one drawback with your proposal is that it is comparing 
two distributions and depends on having large enough number of samples in both 
distributions. It means that if a benchmark happens, you need to run engouh 
commit-points so that the recent distribution starts to be considerably 
different from the previous distribution according to Kolmogorov-Smirnov test. 
This would considerably delay the time-to-regression-detection. It seems that 
my proposal would not suffer from this issue since it allows users to specify 
how many commit-points we need to repeat the regression before sending alert. 
And this number can be 1-3 commit points.

Regarding the drawback of not detecting "there is visible performance 
regression within benchmark noise", my proposal is to either exclude noisy 
benchmark completely, or we can require the regression to be 2X the noise (the 
ratio is also tunable). These sound like a reasonable practical solution, right?

BTW, I don't think we will be able to have perfect regression detection without 
any drawback(e.g. 0 false positive and 0 false negative). The question is 
whether the proposed solution can be useful enough (i.e. low false positive and 
low false negative) and whether it is the best solution across all available 
choices. So it can be OK if some regression is not detected, like the one 
mentioned above


BTW, regarding the noisy benchmark mentioned above, I am curious how 
Kolmogorov-Smirnov test can address issue. Maybe I can update my proposal to 
re-use the idea. Can you help explain it?

I will take a look at the tooling mentioned above later to see if we can learn 
from them or re-use them.

> Improve benchmark stability
> ---
>
> Key: FLINK-29825
> URL: https://issues.apache.org/jira/browse/FLINK-29825
> Project: Flink
>  Issue Type: Improvement
>  Components: Benchmarks
>Affects Versions: 1.17.0
>Reporter: Yanfei Lei
>Assignee: Yanfei Lei
>Priority: Minor
>
> Currently, regressions are detected by a simple script which may have false 
> positives and false negatives, especially for benchmarks with small absolute 
> values, small value changes would cause large percentage changes. see 
> [here|https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136]
>  for details.
> And all benchmarks are executed on one physical machine, it might happen that 
> hardware issues affect performance, like "[FLINK-18614] Performance 
> regression 2020.07.13".
>  
> This ticket aims to improve the precision and recall of the regression-check 
> script.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29825) Improve benchmark stability

2023-02-06 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684677#comment-17684677
 ] 

Piotr Nowojski commented on FLINK-29825:


I have responded on the dev mailing list, but let's maybe move the discussion 
here.

[~lindong] , the Kolmogorov-Smirnov test was a just a result of a quick google 
search for relevant mathematical concepts. I have a feeling that it could be 
adapted to something that would work for us. For example instead of checking 
the supremum between two empirical distribution functions (EDF), we could add 
up differences between those distribution functions. If the new EDF has on 
lower values, the sum of differences would be negative, that would point toward 
a regression. But maybe there are better approaches.

I think the drawback of your proposal is that it wouldn't detect if there is 
visible performance regression within benchmark noise. While this should be 
do-able with large enough number of samples. For example if the results are 
oscillating randomly around 1000 (+/- 150), and there is performance regression 
that changes the result to 900 (+/- 135). And we have quite a lot of noisy 
benchmarks, like 
[this|http://codespeed.dak8s.net:8000/timeline/?ben=fireProcessingTimers=2] 
or [this|http://codespeed.dak8s.net:8000/timeline/?ben=serializerTuple=2]. 

I was also informed about some tooling created exactly for detecting 
performance regressions from benchmark results: 

> fork of  Hunter - a perf change detection tool, originally from DataStax:
> Blog post - 
> [https://medium.com/building-the-open-data-stack/detecting-performance-regressions-with-datastax-hunter-c22dc444aea4]
> Paper - [https://arxiv.org/pdf/2301.03034.pdf]
> Our fork - [https://github.com/ge/hunter]

The algorithm that's used underneath "E-divisive Means" sounds promising. 

> Improve benchmark stability
> ---
>
> Key: FLINK-29825
> URL: https://issues.apache.org/jira/browse/FLINK-29825
> Project: Flink
>  Issue Type: Improvement
>  Components: Benchmarks
>Affects Versions: 1.17.0
>Reporter: Yanfei Lei
>Assignee: Yanfei Lei
>Priority: Minor
>
> Currently, regressions are detected by a simple script which may have false 
> positives and false negatives, especially for benchmarks with small absolute 
> values, small value changes would cause large percentage changes. see 
> [here|https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136]
>  for details.
> And all benchmarks are executed on one physical machine, it might happen that 
> hardware issues affect performance, like "[FLINK-18614] Performance 
> regression 2020.07.13".
>  
> This ticket aims to improve the precision and recall of the regression-check 
> script.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29825) Improve benchmark stability

2023-02-04 Thread Dong Lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684183#comment-17684183
 ] 

Dong Lin commented on FLINK-29825:
--

According to the wiki, two sample Kolmogorov-Smirnov test is used to determine 
whether two distributions (i.e. collection of values) are close enough. On the 
other hand, regression detection is more about determining whether a single 
value (e.g. latest performance) is observably worse than the the best 
performance in the past. These are two quite different problems.

Is there any success story of using Kolmogorov-Smirnov to detect regression in 
practice?

I drafted this doc ( 
[https://docs.google.com/document/d/1Bvzvq79Ll5yxd1UtC0YzczgFbZPAgPcN3cI0MjVkIag)]
 to explain the algorithm that I would like to try to detect Flink regression. 
It is not exactly the same as the one I used before for TensorFlow (because I 
lost that doc) but the ideas are pretty much the same. Using the heuristics 
described in this doc, I am confident it should have much lower false positive 
rate than the relatively simple formula used in the existing script 
([https://github.com/apache/flink-benchmarks/blob/master/regression_report.py).|https://github.com/apache/flink-benchmarks/blob/master/regression_report.py.]

The parameters (e.g. threshold for regression detection) of this algorithm need 
to be tuned based on the benchmark data.

Hopefully I can get time to implement and evaluate this algorithm in the coming 
2 week. The main issue that I don't know how to address yet is how to update 
the script to get the maximum and deviation of throughput across multiple runs 
for a given commit point.

> Improve benchmark stability
> ---
>
> Key: FLINK-29825
> URL: https://issues.apache.org/jira/browse/FLINK-29825
> Project: Flink
>  Issue Type: Improvement
>  Components: Benchmarks
>Affects Versions: 1.17.0
>Reporter: Yanfei Lei
>Assignee: Yanfei Lei
>Priority: Minor
>
> Currently, regressions are detected by a simple script which may have false 
> positives and false negatives, especially for benchmarks with small absolute 
> values, small value changes would cause large percentage changes. see 
> [here|https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136]
>  for details.
> And all benchmarks are executed on one physical machine, it might happen that 
> hardware issues affect performance, like "[FLINK-18614] Performance 
> regression 2020.07.13".
>  
> This ticket aims to improve the precision and recall of the regression-check 
> script.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29825) Improve benchmark stability

2023-01-30 Thread Yanfei Lei (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17682413#comment-17682413
 ] 

Yanfei Lei commented on FLINK-29825:


Sorry for the late reply, I'm planning to improve the stability regression 
detection script.

Leveraging [Two sample Kolmogorov-Smirnov 
test|https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov%E2%80%93Smirnov_test]
 to detect regression is a good idea:

(1) I‘m concerned it's not easy to construct the first distribution: "all 
latest samples, from the latest one (1st one), until N'th latest sample". If 
some regressions have occurred before, it will "distort" the first 
distribution, possibly leading to false positive; meanwhile, optimization can 
also cause "distortion" which possibly leads to false negative, It's relatively 
acceptable. Maybe we can filter out outliers to avoid this problem. 

(2) Also, [Kolmogorov-Smirnov 
test|https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov%E2%80%93Smirnov_test]
 is an absolute value, and it is impossible to identify whether it is an 
improvement or a regression.

 

My plan is/was to use the error to adjust each threshold, each benchmark run 
has an error value in the result, 

let thr=0.75*(max error of N'th latest sample), and the other logic is the same 
as the existing detection script.

> Improve benchmark stability
> ---
>
> Key: FLINK-29825
> URL: https://issues.apache.org/jira/browse/FLINK-29825
> Project: Flink
>  Issue Type: Improvement
>  Components: Benchmarks
>Affects Versions: 1.17.0
>Reporter: Yanfei Lei
>Assignee: Yanfei Lei
>Priority: Minor
>
> Currently, regressions are detected by a simple script which may have false 
> positives and false negatives, especially for benchmarks with small absolute 
> values, small value changes would cause large percentage changes. see 
> [here|https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136]
>  for details.
> And all benchmarks are executed on one physical machine, it might happen that 
> hardware issues affect performance, like "[FLINK-18614] Performance 
> regression 2020.07.13".
>  
> This ticket aims to improve the precision and recall of the regression-check 
> script.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29825) Improve benchmark stability

2023-01-20 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17679077#comment-17679077
 ] 

Piotr Nowojski commented on FLINK-29825:


Thanks for picking this up [~Yanfei Lei]. Are you planning to improve 
benchmarks stability itself (1), or only the regression detection script (2)?

For the (2), for sometime I was playing with idea how to more reliably detect 
regression in noisy benchmarks like this:
http://codespeed.dak8s.net:8000/timeline/#/?exe=1=fireProcessingTimers=on=on=off=2=200
My assumption is/was that this must be a pretty well known problem, and it's 
only a matter of finding the right algorithm to do it. For example maybe [Two 
sample Kolmogorov-Smirnov 
test|https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov%E2%80%93Smirnov_test].
 It looks like if we split the samples into two distributions, for example:
1. all latest samples, from the latest one (1st one), until N'th latest sample 
2. samples from N'th latest to M'th latest (M>N)

We would have two distributions and we could use the Komnongorov-Smirnov test 
to compare those two distributions, how similar are they. And report if the 
difference is greater then some threshold.

Probably there are also other ways. 

> Improve benchmark stability
> ---
>
> Key: FLINK-29825
> URL: https://issues.apache.org/jira/browse/FLINK-29825
> Project: Flink
>  Issue Type: Improvement
>  Components: Benchmarks
>Affects Versions: 1.17.0
>Reporter: Yanfei Lei
>Assignee: Yanfei Lei
>Priority: Minor
>
> Currently, regressions are detected by a simple script which may have false 
> positives and false negatives, especially for benchmarks with small absolute 
> values, small value changes would cause large percentage changes. see 
> [here|https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136]
>  for details.
> And all benchmarks are executed on one physical machine, it might happen that 
> hardware issues affect performance, like "[FLINK-18614] Performance 
> regression 2020.07.13".
>  
> This ticket aims to improve the precision and recall of the regression-check 
> script.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)