Re: [DISCUSS] Partial CI builds - Reducing flakiness with fewer tests
> > > How would we handle commits that break the integration tests? Would > > we revert commits on trunk, or fix-forward? > > This is currently up to committer discretion, and I don't think that > would change if we were to re-tool the PR builds. In the presence of > flaky failures, we can't reliably blame failures on particular commits > without running much more expensive statistical tests. > I was more thinking about integration regressions rather than flaky failures. However, if we're running a module's tests as well as the "affected modules" tests, then I guess we should be running the correct integration tests for each PR build. I guess this gets back to Chris's point about how this approach favors the downstream modules. Maybe that's unavoidable. Come to think of it, a similar problem exists with system tests. We don't run these for each PR (or each trunk commit, or even nightly AFAIK) since they are prohibitively costly/lengthy to run. However, they do sometimes find integration regressions that the junit suite missed. In these cases we only have the choice to fix-forward. On Tue, Jun 13, 2023 at 12:19 PM Greg Harris wrote: > David, > > Thanks for finding that gradle plugin. The `changedModules` mode is > exactly what I had in mind for fairness to modules earlier in the > dependency graph. > > > if we moved to a policy where PRs only need some of the tests to pass > > to merge, when would we run the full CI? On each trunk commit (i.e., PR > > merge)? > > In a world where the PR runs includes only the changed modules and > their dependencies, the full suite should be run for each commit on > trunk and on release branches. I don't think that optimizing the trunk > build runtime is of great benefit, and the current behavior seems > reasonable to continue. > > > How would we handle commits that break the integration tests? Would > > we revert commits on trunk, or fix-forward? > > This is currently up to committer discretion, and I don't think that > would change if we were to re-tool the PR builds. In the presence of > flaky failures, we can't reliably blame failures on particular commits > without running much more expensive statistical tests. > > One place that I often see flakiness present is in new tests, where > someone has chosen timeouts which work for them locally and in the PR > build. After some 10s or 100s of runs, the flakiness becomes evident > and someone looks into a fix-forward. > I don't necessarily think I would advocate for a hard revert of an > entire feature if one of the added tests is flaky, but that's my > discretion. We can adopt a project policy of reverting whatever we > can, but I don't think that's a more welcoming or productive project > than we have now. > > Greg > > On Tue, Jun 13, 2023 at 7:24 AM David Arthur wrote: > > > > Hey folks, interesting discussion. > > > > I came across a Gradle plugin that calculates a DAG of modules based on > the > > diff and can run only the affected module's tests or the affected + > > downstream tests. > > > > https://github.com/dropbox/AffectedModuleDetector > > > > I tested it out locally, and it seems to work as advertised. > > > > Greg, if we moved to a policy where PRs only need some of the tests to > pass > > to merge, when would we run the full CI? On each trunk commit (i.e., PR > > merge)? How would we handle commits that break the integration tests? > Would > > we revert commits on trunk, or fix-forward? > > > > -David > > > > > > On Thu, Jun 8, 2023 at 2:02 PM Greg Harris > > > wrote: > > > > > Gaurav, > > > > > > The target-determinator is certainly the "off-the-shelf" solution I > > > expected would be out there. If the project migrates to Bazel I think > > > that would make the partial builds much easier to implement. > > > I think we should look into the other benefits of migrating to bazel > > > to see if it is worth it even if the partial builds feature is decided > > > against, or after it is reverted. > > > > > > Chris, > > > > > > > Do you think we should aim to disable > > > > merges without a full suite of passing CI runs (allowing for > > > administrative > > > > override in an emergency)? If so, what would the path be from our > current > > > > state to there? What can we do to ensure that we don't get stuck > relying > > > on > > > > a once-temporary aid that becomes effectively permanent? > > > > > > Yes I think it would be nice to require a green build to merge, > > > without excessive merge-queue infrastructure that is more common in > > > high-volume monorepos. > > > > > > 1. We would decide on a sunset date on the partial build indicator, at > > > which point it will be disabled on trunk and all release branches. I > > > suspect that a sunset date could be set for 1-3 releases in the > > > future. > > > 2. We would enable partial builds. The merge-button would remain > > > as-is, with the partial build and full-suite indicators both being > > > visible. > > > 3. Communication would be sent out to expl
Re: [DISCUSS] Partial CI builds - Reducing flakiness with fewer tests
David, Thanks for finding that gradle plugin. The `changedModules` mode is exactly what I had in mind for fairness to modules earlier in the dependency graph. > if we moved to a policy where PRs only need some of the tests to pass > to merge, when would we run the full CI? On each trunk commit (i.e., PR > merge)? In a world where the PR runs includes only the changed modules and their dependencies, the full suite should be run for each commit on trunk and on release branches. I don't think that optimizing the trunk build runtime is of great benefit, and the current behavior seems reasonable to continue. > How would we handle commits that break the integration tests? Would > we revert commits on trunk, or fix-forward? This is currently up to committer discretion, and I don't think that would change if we were to re-tool the PR builds. In the presence of flaky failures, we can't reliably blame failures on particular commits without running much more expensive statistical tests. One place that I often see flakiness present is in new tests, where someone has chosen timeouts which work for them locally and in the PR build. After some 10s or 100s of runs, the flakiness becomes evident and someone looks into a fix-forward. I don't necessarily think I would advocate for a hard revert of an entire feature if one of the added tests is flaky, but that's my discretion. We can adopt a project policy of reverting whatever we can, but I don't think that's a more welcoming or productive project than we have now. Greg On Tue, Jun 13, 2023 at 7:24 AM David Arthur wrote: > > Hey folks, interesting discussion. > > I came across a Gradle plugin that calculates a DAG of modules based on the > diff and can run only the affected module's tests or the affected + > downstream tests. > > https://github.com/dropbox/AffectedModuleDetector > > I tested it out locally, and it seems to work as advertised. > > Greg, if we moved to a policy where PRs only need some of the tests to pass > to merge, when would we run the full CI? On each trunk commit (i.e., PR > merge)? How would we handle commits that break the integration tests? Would > we revert commits on trunk, or fix-forward? > > -David > > > On Thu, Jun 8, 2023 at 2:02 PM Greg Harris > wrote: > > > Gaurav, > > > > The target-determinator is certainly the "off-the-shelf" solution I > > expected would be out there. If the project migrates to Bazel I think > > that would make the partial builds much easier to implement. > > I think we should look into the other benefits of migrating to bazel > > to see if it is worth it even if the partial builds feature is decided > > against, or after it is reverted. > > > > Chris, > > > > > Do you think we should aim to disable > > > merges without a full suite of passing CI runs (allowing for > > administrative > > > override in an emergency)? If so, what would the path be from our current > > > state to there? What can we do to ensure that we don't get stuck relying > > on > > > a once-temporary aid that becomes effectively permanent? > > > > Yes I think it would be nice to require a green build to merge, > > without excessive merge-queue infrastructure that is more common in > > high-volume monorepos. > > > > 1. We would decide on a sunset date on the partial build indicator, at > > which point it will be disabled on trunk and all release branches. I > > suspect that a sunset date could be set for 1-3 releases in the > > future. > > 2. We would enable partial builds. The merge-button would remain > > as-is, with the partial build and full-suite indicators both being > > visible. > > 3. Communication would be sent out to explain that the partial build > > indicators are a temporary flakiness reduction tool. > > 4. Committers would continue to rely on the full-suite indicators, and > > watch the partial build to get a sense of the level of flakiness in > > each module. > > 5. On new contributions which have failing partial builds, Committers > > can ask that contributors investigate and follow-up on flaky failures > > that appeared in their PR, explaining that they can help make that > > indicator green for future contributions. > > 6. After the sunset date passes, the partial build indicators will be > > disabled. > > 7. If the main trunk build is sufficiently reliable by some criteria > > (e.g. 20 passes in a row) we can discuss requiring a green build for > > the GitHub merge button. > > > > > We probably > > > want to build awareness of this dependency graph into any partial CI > > logic > > > we add, but if we do opt for that, then this change would > > > disproportionately benefit downstream modules (Streams, Connect, MM2), > > and > > > have little to no benefit for upstream ones (clients and at least some > > core > > > modules). > > > > I considered that, and I think it is more beneficial to provide > > equitable partial builds than ones with perfect coverage. If you want > > to know about cross-module failures, I think that the fu
Re: [DISCUSS] Partial CI builds - Reducing flakiness with fewer tests
Hey folks, interesting discussion. I came across a Gradle plugin that calculates a DAG of modules based on the diff and can run only the affected module's tests or the affected + downstream tests. https://github.com/dropbox/AffectedModuleDetector I tested it out locally, and it seems to work as advertised. Greg, if we moved to a policy where PRs only need some of the tests to pass to merge, when would we run the full CI? On each trunk commit (i.e., PR merge)? How would we handle commits that break the integration tests? Would we revert commits on trunk, or fix-forward? -David On Thu, Jun 8, 2023 at 2:02 PM Greg Harris wrote: > Gaurav, > > The target-determinator is certainly the "off-the-shelf" solution I > expected would be out there. If the project migrates to Bazel I think > that would make the partial builds much easier to implement. > I think we should look into the other benefits of migrating to bazel > to see if it is worth it even if the partial builds feature is decided > against, or after it is reverted. > > Chris, > > > Do you think we should aim to disable > > merges without a full suite of passing CI runs (allowing for > administrative > > override in an emergency)? If so, what would the path be from our current > > state to there? What can we do to ensure that we don't get stuck relying > on > > a once-temporary aid that becomes effectively permanent? > > Yes I think it would be nice to require a green build to merge, > without excessive merge-queue infrastructure that is more common in > high-volume monorepos. > > 1. We would decide on a sunset date on the partial build indicator, at > which point it will be disabled on trunk and all release branches. I > suspect that a sunset date could be set for 1-3 releases in the > future. > 2. We would enable partial builds. The merge-button would remain > as-is, with the partial build and full-suite indicators both being > visible. > 3. Communication would be sent out to explain that the partial build > indicators are a temporary flakiness reduction tool. > 4. Committers would continue to rely on the full-suite indicators, and > watch the partial build to get a sense of the level of flakiness in > each module. > 5. On new contributions which have failing partial builds, Committers > can ask that contributors investigate and follow-up on flaky failures > that appeared in their PR, explaining that they can help make that > indicator green for future contributions. > 6. After the sunset date passes, the partial build indicators will be > disabled. > 7. If the main trunk build is sufficiently reliable by some criteria > (e.g. 20 passes in a row) we can discuss requiring a green build for > the GitHub merge button. > > > We probably > > want to build awareness of this dependency graph into any partial CI > logic > > we add, but if we do opt for that, then this change would > > disproportionately benefit downstream modules (Streams, Connect, MM2), > and > > have little to no benefit for upstream ones (clients and at least some > core > > modules). > > I considered that, and I think it is more beneficial to provide > equitable partial builds than ones with perfect coverage. If you want > to know about cross-module failures, I think that the full suite is > still the appropriate tool to detect those. > From another perspective, if this change is only useful to `core` > after all other dependent modules have addressed their flakiness, then > it delays `core` contributors from the reward for addressing their > flakiness. > And if anything, making the partial build cover fewer failure modes > and reminding people of that fact will keep the full-suite builds > relevant. > > > but people should already be making > > sure that tests are running locally before they push changes > > I agree, and I don't think that partial builds are in any danger of > replacing local testing for fast synchronous iteration. I also agree > that it is appropriate to give personal feedback to repeat > contributors to improve the quality of their PRs. > CI seems to be for enforcing the lowest-common-denominator on > contributions, and benefiting (potentially first-time) contributors > which do not have the familiarity, mindfulness, or resources to > pre-test each of their contributions. > It is in those situations that I think partial builds can be > beneficial: if there is something mechanical that all contributors > should be doing locally, why not have it done automatically on their > behalf? > > > Finally, since there are a number of existing flaky tests on trunk, what > > would the strategy be for handling those? Do we try to get to a green > state > > on a per-module basis (possibly with awareness of downstream modules) as > > quickly as possible, and then selectively enable partial builds once we > > feel confident that flakiness has been addressed? > > For modules which receive regular PRs, the partial builds would > surface those flakes, incentivising fixing them. Once t
Re: [DISCUSS] Partial CI builds - Reducing flakiness with fewer tests
Gaurav, The target-determinator is certainly the "off-the-shelf" solution I expected would be out there. If the project migrates to Bazel I think that would make the partial builds much easier to implement. I think we should look into the other benefits of migrating to bazel to see if it is worth it even if the partial builds feature is decided against, or after it is reverted. Chris, > Do you think we should aim to disable > merges without a full suite of passing CI runs (allowing for administrative > override in an emergency)? If so, what would the path be from our current > state to there? What can we do to ensure that we don't get stuck relying on > a once-temporary aid that becomes effectively permanent? Yes I think it would be nice to require a green build to merge, without excessive merge-queue infrastructure that is more common in high-volume monorepos. 1. We would decide on a sunset date on the partial build indicator, at which point it will be disabled on trunk and all release branches. I suspect that a sunset date could be set for 1-3 releases in the future. 2. We would enable partial builds. The merge-button would remain as-is, with the partial build and full-suite indicators both being visible. 3. Communication would be sent out to explain that the partial build indicators are a temporary flakiness reduction tool. 4. Committers would continue to rely on the full-suite indicators, and watch the partial build to get a sense of the level of flakiness in each module. 5. On new contributions which have failing partial builds, Committers can ask that contributors investigate and follow-up on flaky failures that appeared in their PR, explaining that they can help make that indicator green for future contributions. 6. After the sunset date passes, the partial build indicators will be disabled. 7. If the main trunk build is sufficiently reliable by some criteria (e.g. 20 passes in a row) we can discuss requiring a green build for the GitHub merge button. > We probably > want to build awareness of this dependency graph into any partial CI logic > we add, but if we do opt for that, then this change would > disproportionately benefit downstream modules (Streams, Connect, MM2), and > have little to no benefit for upstream ones (clients and at least some core > modules). I considered that, and I think it is more beneficial to provide equitable partial builds than ones with perfect coverage. If you want to know about cross-module failures, I think that the full suite is still the appropriate tool to detect those. >From another perspective, if this change is only useful to `core` after all other dependent modules have addressed their flakiness, then it delays `core` contributors from the reward for addressing their flakiness. And if anything, making the partial build cover fewer failure modes and reminding people of that fact will keep the full-suite builds relevant. > but people should already be making > sure that tests are running locally before they push changes I agree, and I don't think that partial builds are in any danger of replacing local testing for fast synchronous iteration. I also agree that it is appropriate to give personal feedback to repeat contributors to improve the quality of their PRs. CI seems to be for enforcing the lowest-common-denominator on contributions, and benefiting (potentially first-time) contributors which do not have the familiarity, mindfulness, or resources to pre-test each of their contributions. It is in those situations that I think partial builds can be beneficial: if there is something mechanical that all contributors should be doing locally, why not have it done automatically on their behalf? > Finally, since there are a number of existing flaky tests on trunk, what > would the strategy be for handling those? Do we try to get to a green state > on a per-module basis (possibly with awareness of downstream modules) as > quickly as possible, and then selectively enable partial builds once we > feel confident that flakiness has been addressed? For modules which receive regular PRs, the partial builds would surface those flakes, incentivising fixing them. Once the per-module flakiness is under control, committers for those modules can start to require green partial builds, pushing back on PRs which have obvious failures. For modules which do not receive regular PRs, those flakes will appear in all of the full-suite builds, and wouldn't be addressed by the partial builds. Those would require either another incentive structure, or would need volunteers from other modules to help fix them. I'm not sure which modules receive the least PR traffic, but I can see that the current most stale modules are log4j-appender 1 year ago, streams:examples 6 months ago, connect:file 5 months ago, storage:api 5 months ago, and streams-scala 5 months ago. Thanks for the feedback all, Greg On Wed, Jun 7, 2023 at 7:55 AM Chris Egerton wrote: > > Hi Greg, > > I can see the point
Re: [DISCUSS] Partial CI builds - Reducing flakiness with fewer tests
Hi Greg, I can see the point about enabling partial runs as a temporary measure to fight flakiness, and it does carry some merit. In that case, though, we should have an idea of what the desired end state is once we've stopped relying on any temporary measures. Do you think we should aim to disable merges without a full suite of passing CI runs (allowing for administrative override in an emergency)? If so, what would the path be from our current state to there? What can we do to ensure that we don't get stuck relying on a once-temporary aid that becomes effectively permanent? With partial builds, we also need to be careful to make sure to correctly handle cross-module dependencies. A tweak to broker or client logic may only affect files in one module and pass all tests for that module, but have far-reaching consequences for Streams, Connect, and MM2. We probably want to build awareness of this dependency graph into any partial CI logic we add, but if we do opt for that, then this change would disproportionately benefit downstream modules (Streams, Connect, MM2), and have little to no benefit for upstream ones (clients and at least some core modules). With regards to faster iteration times--I agree that it would be nice if our CI builds didn't take 2-3 hours, but people should already be making sure that tests are running locally before they push changes (or, if they really want, they can run tests locally after pushing changes). And if rapid iteration is necessary, it's always (or at least for the foreseeable future) going to be faster to run whatever specific tests or build tasks you need to run locally, instead of pushing to GitHub and waiting for Jenkins to check for you. Finally, since there are a number of existing flaky tests on trunk, what would the strategy be for handling those? Do we try to get to a green state on a per-module basis (possibly with awareness of downstream modules) as quickly as possible, and then selectively enable partial builds once we feel confident that flakiness has been addressed? Cheers, Chris On Wed, Jun 7, 2023 at 5:09 AM Gaurav Narula wrote: > Hey Greg, > > Thanks for sharing this idea! > > The idea of building and testing a relevant subset of code certainly seems > interesting. > > Perhaps this is a good fit for Bazel [1] where > target-determinator [2] can be used to to find a subset of targets that > have changed between two commits. > > Even without [2], Bazel builds can benefit immensely from distributing > builds > to a set of remote nodes [3] with support for caching previously built > targets [4]. > > We've seen a few other ASF projects adopt Bazel as well: > > * https://github.com/apache/rocketmq > * https://github.com/apache/brpc > * https://github.com/apache/trafficserver > * https://github.com/apache/ws-axiom > > I wonder how the Kafka community feels about experimenting with Bazel and > exploring if it helps us offer faster build times without compromising on > the > correctness of the targets that need to be built and tested? > > Thanks, > Gaurav > > [1]: https://bazel.build > [2]: https://github.com/bazel-contrib/target-determinator > [3]: https://bazel.build/remote/rbe > [4]: https://bazel.build/remote/caching > > On 2023/06/05 17:47:07 Greg Harris wrote: > > Hey all, > > > > I've been working on test flakiness recently, and I've been trying to > > come up with ways to tackle the issue top-down as well as bottom-up, > > and I'm interested to hear your thoughts on an idea. > > > > In addition to the current full-suite runs, can we in parallel trigger > > a smaller test run which has only a relevant subset of tests? For > > example, if someone is working on one sub-module, the CI would only > > run tests in that module. > > > > I think this would be more likely to pass than the full suite due to > > the fewer tests failing probabilistically, and would improve the > > signal-to-noise ratio of the summary pass/fail marker on GitHub. This > > should also be shorter to execute than the full suite, allowing for > > faster cycle-time than the current full suite encourages. > > > > This would also strengthen the incentive for contributors specializing > > in a module to de-flake tests, as they are rewarded with a tangible > > improvement within their area of the project. Currently, even the > > modules with the most reliable tests receive consistent CI failures > > from other less reliable modules. > > > > I believe this is possible, even if there isn't an off-the-shelf > > solution for it. We can learn of the changed files via a git diff, map > > that to modules containing those files, and then execute the tests > > just for those modules with gradle. GitHub also permits showing > > multiple "checks" so that we can emit both the full-suite and partial > > test results. > > > > Thanks, > > Greg > >
RE: [DISCUSS] Partial CI builds - Reducing flakiness with fewer tests
Hey Greg, Thanks for sharing this idea! The idea of building and testing a relevant subset of code certainly seems interesting. Perhaps this is a good fit for Bazel [1] where target-determinator [2] can be used to to find a subset of targets that have changed between two commits. Even without [2], Bazel builds can benefit immensely from distributing builds to a set of remote nodes [3] with support for caching previously built targets [4]. We've seen a few other ASF projects adopt Bazel as well: * https://github.com/apache/rocketmq * https://github.com/apache/brpc * https://github.com/apache/trafficserver * https://github.com/apache/ws-axiom I wonder how the Kafka community feels about experimenting with Bazel and exploring if it helps us offer faster build times without compromising on the correctness of the targets that need to be built and tested? Thanks, Gaurav [1]: https://bazel.build [2]: https://github.com/bazel-contrib/target-determinator [3]: https://bazel.build/remote/rbe [4]: https://bazel.build/remote/caching On 2023/06/05 17:47:07 Greg Harris wrote: > Hey all, > > I've been working on test flakiness recently, and I've been trying to > come up with ways to tackle the issue top-down as well as bottom-up, > and I'm interested to hear your thoughts on an idea. > > In addition to the current full-suite runs, can we in parallel trigger > a smaller test run which has only a relevant subset of tests? For > example, if someone is working on one sub-module, the CI would only > run tests in that module. > > I think this would be more likely to pass than the full suite due to > the fewer tests failing probabilistically, and would improve the > signal-to-noise ratio of the summary pass/fail marker on GitHub. This > should also be shorter to execute than the full suite, allowing for > faster cycle-time than the current full suite encourages. > > This would also strengthen the incentive for contributors specializing > in a module to de-flake tests, as they are rewarded with a tangible > improvement within their area of the project. Currently, even the > modules with the most reliable tests receive consistent CI failures > from other less reliable modules. > > I believe this is possible, even if there isn't an off-the-shelf > solution for it. We can learn of the changed files via a git diff, map > that to modules containing those files, and then execute the tests > just for those modules with gradle. GitHub also permits showing > multiple "checks" so that we can emit both the full-suite and partial > test results. > > Thanks, > Greg >
Re: [DISCUSS] Partial CI builds - Reducing flakiness with fewer tests
David, Thanks for your thoughts! > Indeed, they will be more likely to pass but the > downside is that folks may start to only rely on that signal and commit > without looking at the full test suite. This seems dangerous to me. I completely agree with you that it is not desirable for committers to become overly reliant on the smaller subset of tests. Instead, the partial builds can be a new merge requirement, instead of a replacement for the existing full-suite spot-check. For PRs which fail the partial test run, committers won't need to examine the full suite to know that the contributor needs to make further changes, and can spend their attention on something else. We can make this clear with a dev-list announcement to explain the meaning and interpretation of the new build result. We can also call for flakiness reduction contributions at the same time. > I would rather focus on trying to address this first. If > we can stabilize them, I wonder if we should also enforce a green build to > merge. The reason I'm interested in this change is because this project already appears to adopt this policy of bottom-up flakiness reduction, and I don't think it has been effective enough to be able to enforce requiring a green build. I think that it is better to improve the incentives for flakiness reduction, wait for flakiness to improve, and then later enforce a green build to merge. In that context, the partial builds are a temporary change to help us get to the desired end-goal. Thanks, Greg On Tue, Jun 6, 2023 at 2:51 AM David Jacot wrote: > > Hey Greg, > > Thanks for bringing this up. > > I am not sure to understand the benefit of the parallele trigger of a > subset of the tests. Indeed, they will be more likely to pass but the > downside is that folks may start to only rely on that signal and commit > without looking at the full test suite. This seems dangerous to me. > > However, I agree that we have an issue with our builds. We have way too > many flaky tests. I would rather focus on trying to address this first. If > we can stabilize them, I wonder if we should also enforce a green build to > merge. > > Best, > David > > > > On Mon, Jun 5, 2023 at 7:47 PM Greg Harris > wrote: > > > Hey all, > > > > I've been working on test flakiness recently, and I've been trying to > > come up with ways to tackle the issue top-down as well as bottom-up, > > and I'm interested to hear your thoughts on an idea. > > > > In addition to the current full-suite runs, can we in parallel trigger > > a smaller test run which has only a relevant subset of tests? For > > example, if someone is working on one sub-module, the CI would only > > run tests in that module. > > > > I think this would be more likely to pass than the full suite due to > > the fewer tests failing probabilistically, and would improve the > > signal-to-noise ratio of the summary pass/fail marker on GitHub. This > > should also be shorter to execute than the full suite, allowing for > > faster cycle-time than the current full suite encourages. > > > > This would also strengthen the incentive for contributors specializing > > in a module to de-flake tests, as they are rewarded with a tangible > > improvement within their area of the project. Currently, even the > > modules with the most reliable tests receive consistent CI failures > > from other less reliable modules. > > > > I believe this is possible, even if there isn't an off-the-shelf > > solution for it. We can learn of the changed files via a git diff, map > > that to modules containing those files, and then execute the tests > > just for those modules with gradle. GitHub also permits showing > > multiple "checks" so that we can emit both the full-suite and partial > > test results. > > > > Thanks, > > Greg > >
Re: [DISCUSS] Partial CI builds - Reducing flakiness with fewer tests
Hey Greg, Thanks for bringing this up. I am not sure to understand the benefit of the parallele trigger of a subset of the tests. Indeed, they will be more likely to pass but the downside is that folks may start to only rely on that signal and commit without looking at the full test suite. This seems dangerous to me. However, I agree that we have an issue with our builds. We have way too many flaky tests. I would rather focus on trying to address this first. If we can stabilize them, I wonder if we should also enforce a green build to merge. Best, David On Mon, Jun 5, 2023 at 7:47 PM Greg Harris wrote: > Hey all, > > I've been working on test flakiness recently, and I've been trying to > come up with ways to tackle the issue top-down as well as bottom-up, > and I'm interested to hear your thoughts on an idea. > > In addition to the current full-suite runs, can we in parallel trigger > a smaller test run which has only a relevant subset of tests? For > example, if someone is working on one sub-module, the CI would only > run tests in that module. > > I think this would be more likely to pass than the full suite due to > the fewer tests failing probabilistically, and would improve the > signal-to-noise ratio of the summary pass/fail marker on GitHub. This > should also be shorter to execute than the full suite, allowing for > faster cycle-time than the current full suite encourages. > > This would also strengthen the incentive for contributors specializing > in a module to de-flake tests, as they are rewarded with a tangible > improvement within their area of the project. Currently, even the > modules with the most reliable tests receive consistent CI failures > from other less reliable modules. > > I believe this is possible, even if there isn't an off-the-shelf > solution for it. We can learn of the changed files via a git diff, map > that to modules containing those files, and then execute the tests > just for those modules with gradle. GitHub also permits showing > multiple "checks" so that we can emit both the full-suite and partial > test results. > > Thanks, > Greg >
[DISCUSS] Partial CI builds - Reducing flakiness with fewer tests
Hey all, I've been working on test flakiness recently, and I've been trying to come up with ways to tackle the issue top-down as well as bottom-up, and I'm interested to hear your thoughts on an idea. In addition to the current full-suite runs, can we in parallel trigger a smaller test run which has only a relevant subset of tests? For example, if someone is working on one sub-module, the CI would only run tests in that module. I think this would be more likely to pass than the full suite due to the fewer tests failing probabilistically, and would improve the signal-to-noise ratio of the summary pass/fail marker on GitHub. This should also be shorter to execute than the full suite, allowing for faster cycle-time than the current full suite encourages. This would also strengthen the incentive for contributors specializing in a module to de-flake tests, as they are rewarded with a tangible improvement within their area of the project. Currently, even the modules with the most reliable tests receive consistent CI failures from other less reliable modules. I believe this is possible, even if there isn't an off-the-shelf solution for it. We can learn of the changed files via a git diff, map that to modules containing those files, and then execute the tests just for those modules with gradle. GitHub also permits showing multiple "checks" so that we can emit both the full-suite and partial test results. Thanks, Greg