[
https://issues.apache.org/jira/browse/BEAM-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17549099#comment-17549099
]
Danny McCormick commented on BEAM-11431:
----------------------------------------
This issue has been migrated to https://github.com/apache/beam/issues/20707
> Automated Release Performance Benchmark Regression/Improvement comparison
> -------------------------------------------------------------------------
>
> Key: BEAM-11431
> URL: https://issues.apache.org/jira/browse/BEAM-11431
> Project: Beam
> Issue Type: Improvement
> Components: testing
> Reporter: Robert Burke
> Priority: P3
>
> While running the release, we have a step that has us check for Performance
> Regressions for our releases.
> [https://beam.apache.org/contribute/release-guide/#3-investigate-performance-regressions]
>
> However, what we're able to check is the measured graphs over time. We don't
> have a clear indication of what these metrics were for the last release, only
> able to see vague trends in the line graph. To be clear, the line graph is
> excellent at seeing large sudden changes, or small changes over a large
> amount of time, it doesn't help the release manager very well.
> For one: infra might have changed in the mean time, such as compilers, test
> machine hardware, and load variables, along with the benchmarking code itself
> which makes comparing any two points in those graphs very difficult. Worst,
> they are only ever single runs, which puts them at the mercy of variance.
> Changes that are invariably good in all cases are difficult.
> This Jira proposes that we should make it possible to reproducibly
> performance test and compare two releases. In addition, we should be able to
> publish the results of our benchmarks along with the rest of the release
> artifacts, along with the comparison to the previous release.
> Obvious caveat: If there are new tests that can't run on the previous
> release, (or old tests that can't run on the new release) they're free to be
> excluded. This can be automatic by tagging the tests somehow, or publish
> explicit manual exclusions or inclusions. This implies that the tests are
> user side, and rely on a given set of released SDK or Runner artifacts for
> execution.
> Ideally the release manager can run a good chunk of these tests on their
> local machine, or a host in the cloud. Any such cloud resources should be
> identical for before and after comparisons. Eg. If one is comparing Flink
> performance, then the same machine types should be used to compare Beam
> version X and X+1.
> As inspiration, a Go tool called Benchstat does what I'm talking about for
> the Go Benchmark format. See the descriptions in the documentation here:
> [https://pkg.go.dev/golang.org/x/perf/cmd/benchstat?readme=expanded#section-readme]
>
> It takes the results from 1 or more runs of the given benchmark (measuring
> time per operation, or memory thoughput, or allocs per operation etc), on the
> old system, and the same from the new system, and produces averages and
> deltas. These are in a suitable format
> eg.
> {{$ benchstat old.txt new.txt}}
> {{name old time/op new time/op delta}}
> {{GobEncode 13.6ms ± 1% 11.8ms ± 1% -13.31% (p=0.016 n=4+5)}}
> {{JSONEncode 32.1ms ± 1% 31.8ms ± 1% ~ (p=0.286 n=4+5)}}
> This would be a valuable way to produce and present results for users, and to
> more easily validate what performance characteristics have changed between
> versions.
> Given the size and breadth and distributed nature of Beam and associated
> infrastructure, this is something we likely only wish to do along with the
> release. It will likely be time consuming, and for larger scale load tests on
> cloud resources, expensive. In order to make meaningful comparisons, as much
> as possible needs to be invariant between the releases under comparison.
> In particular: if running on a distributed set of resources (eg cloud
> cluster) the machine type and numbers should remain invariant (Spark and
> Flink clusters should be the same size, dataflow being different is trickier
> but should be unrestricted, as that's the point.) Local tests on a single
> machine are comparable by themselves as well.
> Included in the publishing, the specifics of the machine(s) being run on
> should be included (CPU, clock, RAM amount, # of machines if distributed,
> official cloud designation if using cloud provider VMs (AKA machine types,
> like e2-standard-4 or n2-highcpu-32, or c6g.4xlarge or D8d v4).
> Overall goal is to be able to run the comparisons on a local machine, and be
> able to send jobs to clusters in clouds. Actual provisioning of cloud
> resources is a non-goal of this proposal.
> Given a (set) of tests, we should be able to generate a text file with the
> results, for collation similar to what go's benchstat does. Bonus points if
> we can have benchstat handle the task for us without modification.
> Similar to our release validation scripts, a release manager (or any user)
> should be able to access and compare results.
> eg. ./release_comparison.sh \{$OLD_VERSION} \{$NEW_VERSION}
> It must be able to support Release Candidate versions.
> Adding this kind of infrastructure will improve trust in Beam, beam releases,
> and allow others to more consistently compare performance results.
> This Jira stands as a proposal and if accepted, a place for discussion, and
> hanging subtasks and specifics.
> A side task that could be useful would be to be able to generate these text
> file versions of the benchmarks from querying the metrics database. Then the
> comparisons can be a few datapoints from around a given time point, to
> another, which at least make the release managers job a little easier, though
> that doesn't compare two releases.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)