[ 
https://issues.apache.org/jira/browse/BEAM-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17549099#comment-17549099
 ] 

Danny McCormick commented on BEAM-11431:
----------------------------------------

This issue has been migrated to https://github.com/apache/beam/issues/20707

> Automated Release Performance Benchmark Regression/Improvement comparison
> -------------------------------------------------------------------------
>
>                 Key: BEAM-11431
>                 URL: https://issues.apache.org/jira/browse/BEAM-11431
>             Project: Beam
>          Issue Type: Improvement
>          Components: testing
>            Reporter: Robert Burke
>            Priority: P3
>
> While running the release, we have a step that has us check for Performance 
> Regressions for our releases.  
> [https://beam.apache.org/contribute/release-guide/#3-investigate-performance-regressions]
>  
> However, what we're able to check is the measured graphs over time. We don't 
> have a clear indication of what these metrics were for the last release, only 
> able to see vague trends in the line graph.  To be clear, the line graph is 
> excellent at seeing large sudden changes, or small changes over a large 
> amount of time, it doesn't help the release manager very well.
> For one: infra might have changed in the mean time, such as compilers, test 
> machine hardware, and load variables, along with the benchmarking code itself 
> which makes comparing any two points in those graphs very difficult. Worst, 
> they are only ever single runs, which puts them at the mercy of variance. 
> Changes that are invariably good in all cases are difficult.
> This Jira proposes that we should make it possible to reproducibly 
> performance test and compare two releases. In addition, we should be able to 
> publish the results of our benchmarks along with the rest of the release 
> artifacts, along with the comparison to the previous release.
> Obvious caveat: If there are new tests that can't run on the previous 
> release, (or old tests that can't run on the new release) they're free to be 
> excluded. This can be automatic by tagging the tests somehow, or publish 
> explicit manual exclusions or inclusions. This implies that the tests are 
> user side, and rely on a given set of released SDK or Runner artifacts for 
> execution.
> Ideally the release manager can run a good chunk of these tests on their 
> local machine, or a host in the cloud. Any such cloud resources should be 
> identical for before and after comparisons. Eg. If one is comparing Flink 
> performance, then the same machine types should be used to compare Beam 
> version X and X+1.
> As inspiration, a Go tool called Benchstat does what I'm talking about for 
> the Go Benchmark format. See the descriptions in the documentation here: 
> [https://pkg.go.dev/golang.org/x/perf/cmd/benchstat?readme=expanded#section-readme]
>  
> It takes the results from 1 or more runs of the given benchmark (measuring 
> time per operation, or memory thoughput, or allocs per operation etc), on the 
> old system, and the same from the new system, and produces averages and 
> deltas. These are in a suitable format 
> eg.
> {{$ benchstat old.txt new.txt}}
>  {{name old time/op new time/op delta}}
>  {{GobEncode 13.6ms ± 1% 11.8ms ± 1% -13.31% (p=0.016 n=4+5)}}
>  {{JSONEncode 32.1ms ± 1% 31.8ms ± 1% ~ (p=0.286 n=4+5)}}
> This would be a valuable way to produce and present results for users, and to 
> more easily validate what performance characteristics have changed between 
> versions.
> Given the size and breadth and distributed nature of Beam and associated 
> infrastructure, this is something we likely only wish to do along with the 
> release. It will likely be time consuming, and for larger scale load tests on 
> cloud resources, expensive. In order to make meaningful  comparisons, as much 
> as possible needs to be invariant between the releases under comparison.
> In particular: if running on a distributed set of resources (eg cloud 
> cluster) the machine type and numbers should remain invariant (Spark and 
> Flink clusters should be the same size, dataflow being different is trickier 
> but should be unrestricted, as that's the point.) Local tests on a single 
> machine are comparable by themselves as well.
> Included in the publishing, the specifics of the machine(s) being run on 
> should be included (CPU, clock, RAM amount, # of machines if distributed, 
> official cloud designation if using cloud provider VMs (AKA machine types, 
> like e2-standard-4 or n2-highcpu-32, or c6g.4xlarge or D8d v4).
> Overall goal is to be able to run the comparisons on a local machine, and be 
> able to send jobs to clusters in clouds. Actual provisioning of cloud 
> resources is a non-goal of this proposal.
> Given a (set) of tests, we should be able to generate a text file with the 
> results, for collation similar to what go's benchstat does. Bonus points if 
> we can have benchstat handle the task for us without modification.
> Similar to our release validation scripts, a release manager (or any user) 
> should be able to access and compare results.
> eg. ./release_comparison.sh \{$OLD_VERSION} \{$NEW_VERSION}
> It must be able to support Release Candidate versions.
> Adding this kind of infrastructure will improve trust in Beam, beam releases, 
> and allow others to more consistently compare performance results.
> This Jira stands as a proposal and if accepted, a place for discussion, and 
> hanging subtasks and specifics.
> A side task that could be useful would be to be able to generate these text 
> file versions of the benchmarks from querying the metrics database. Then the 
> comparisons can be a few datapoints from around a given time point, to 
> another, which at least make the release managers job a little easier, though 
> that doesn't compare two releases.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to