> > > How would we handle commits that break the integration tests? Would > > we revert commits on trunk, or fix-forward? > > This is currently up to committer discretion, and I don't think that > would change if we were to re-tool the PR builds. In the presence of > flaky failures, we can't reliably blame failures on particular commits > without running much more expensive statistical tests. >
I was more thinking about integration regressions rather than flaky failures. However, if we're running a module's tests as well as the "affected modules" tests, then I guess we should be running the correct integration tests for each PR build. I guess this gets back to Chris's point about how this approach favors the downstream modules. Maybe that's unavoidable. Come to think of it, a similar problem exists with system tests. We don't run these for each PR (or each trunk commit, or even nightly AFAIK) since they are prohibitively costly/lengthy to run. However, they do sometimes find integration regressions that the junit suite missed. In these cases we only have the choice to fix-forward. On Tue, Jun 13, 2023 at 12:19 PM Greg Harris <greg.har...@aiven.io.invalid> wrote: > David, > > Thanks for finding that gradle plugin. The `changedModules` mode is > exactly what I had in mind for fairness to modules earlier in the > dependency graph. > > > if we moved to a policy where PRs only need some of the tests to pass > > to merge, when would we run the full CI? On each trunk commit (i.e., PR > > merge)? > > In a world where the PR runs includes only the changed modules and > their dependencies, the full suite should be run for each commit on > trunk and on release branches. I don't think that optimizing the trunk > build runtime is of great benefit, and the current behavior seems > reasonable to continue. > > > How would we handle commits that break the integration tests? Would > > we revert commits on trunk, or fix-forward? > > This is currently up to committer discretion, and I don't think that > would change if we were to re-tool the PR builds. In the presence of > flaky failures, we can't reliably blame failures on particular commits > without running much more expensive statistical tests. > > One place that I often see flakiness present is in new tests, where > someone has chosen timeouts which work for them locally and in the PR > build. After some 10s or 100s of runs, the flakiness becomes evident > and someone looks into a fix-forward. > I don't necessarily think I would advocate for a hard revert of an > entire feature if one of the added tests is flaky, but that's my > discretion. We can adopt a project policy of reverting whatever we > can, but I don't think that's a more welcoming or productive project > than we have now. > > Greg > > On Tue, Jun 13, 2023 at 7:24 AM David Arthur <mum...@gmail.com> wrote: > > > > Hey folks, interesting discussion. > > > > I came across a Gradle plugin that calculates a DAG of modules based on > the > > diff and can run only the affected module's tests or the affected + > > downstream tests. > > > > https://github.com/dropbox/AffectedModuleDetector > > > > I tested it out locally, and it seems to work as advertised. > > > > Greg, if we moved to a policy where PRs only need some of the tests to > pass > > to merge, when would we run the full CI? On each trunk commit (i.e., PR > > merge)? How would we handle commits that break the integration tests? > Would > > we revert commits on trunk, or fix-forward? > > > > -David > > > > > > On Thu, Jun 8, 2023 at 2:02 PM Greg Harris <greg.har...@aiven.io.invalid > > > > wrote: > > > > > Gaurav, > > > > > > The target-determinator is certainly the "off-the-shelf" solution I > > > expected would be out there. If the project migrates to Bazel I think > > > that would make the partial builds much easier to implement. > > > I think we should look into the other benefits of migrating to bazel > > > to see if it is worth it even if the partial builds feature is decided > > > against, or after it is reverted. > > > > > > Chris, > > > > > > > Do you think we should aim to disable > > > > merges without a full suite of passing CI runs (allowing for > > > administrative > > > > override in an emergency)? If so, what would the path be from our > current > > > > state to there? What can we do to ensure that we don't get stuck > relying > > > on > > > > a once-temporary aid that becomes effectively permanent? > > > > > > Yes I think it would be nice to require a green build to merge, > > > without excessive merge-queue infrastructure that is more common in > > > high-volume monorepos. > > > > > > 1. We would decide on a sunset date on the partial build indicator, at > > > which point it will be disabled on trunk and all release branches. I > > > suspect that a sunset date could be set for 1-3 releases in the > > > future. > > > 2. We would enable partial builds. The merge-button would remain > > > as-is, with the partial build and full-suite indicators both being > > > visible. > > > 3. Communication would be sent out to explain that the partial build > > > indicators are a temporary flakiness reduction tool. > > > 4. Committers would continue to rely on the full-suite indicators, and > > > watch the partial build to get a sense of the level of flakiness in > > > each module. > > > 5. On new contributions which have failing partial builds, Committers > > > can ask that contributors investigate and follow-up on flaky failures > > > that appeared in their PR, explaining that they can help make that > > > indicator green for future contributions. > > > 6. After the sunset date passes, the partial build indicators will be > > > disabled. > > > 7. If the main trunk build is sufficiently reliable by some criteria > > > (e.g. 20 passes in a row) we can discuss requiring a green build for > > > the GitHub merge button. > > > > > > > We probably > > > > want to build awareness of this dependency graph into any partial CI > > > logic > > > > we add, but if we do opt for that, then this change would > > > > disproportionately benefit downstream modules (Streams, Connect, > MM2), > > > and > > > > have little to no benefit for upstream ones (clients and at least > some > > > core > > > > modules). > > > > > > I considered that, and I think it is more beneficial to provide > > > equitable partial builds than ones with perfect coverage. If you want > > > to know about cross-module failures, I think that the full suite is > > > still the appropriate tool to detect those. > > > From another perspective, if this change is only useful to `core` > > > after all other dependent modules have addressed their flakiness, then > > > it delays `core` contributors from the reward for addressing their > > > flakiness. > > > And if anything, making the partial build cover fewer failure modes > > > and reminding people of that fact will keep the full-suite builds > > > relevant. > > > > > > > but people should already be making > > > > sure that tests are running locally before they push changes > > > > > > I agree, and I don't think that partial builds are in any danger of > > > replacing local testing for fast synchronous iteration. I also agree > > > that it is appropriate to give personal feedback to repeat > > > contributors to improve the quality of their PRs. > > > CI seems to be for enforcing the lowest-common-denominator on > > > contributions, and benefiting (potentially first-time) contributors > > > which do not have the familiarity, mindfulness, or resources to > > > pre-test each of their contributions. > > > It is in those situations that I think partial builds can be > > > beneficial: if there is something mechanical that all contributors > > > should be doing locally, why not have it done automatically on their > > > behalf? > > > > > > > Finally, since there are a number of existing flaky tests on trunk, > what > > > > would the strategy be for handling those? Do we try to get to a green > > > state > > > > on a per-module basis (possibly with awareness of downstream > modules) as > > > > quickly as possible, and then selectively enable partial builds once > we > > > > feel confident that flakiness has been addressed? > > > > > > For modules which receive regular PRs, the partial builds would > > > surface those flakes, incentivising fixing them. Once the per-module > > > flakiness is under control, committers for those modules can start to > > > require green partial builds, pushing back on PRs which have obvious > > > failures. > > > For modules which do not receive regular PRs, those flakes will appear > > > in all of the full-suite builds, and wouldn't be addressed by the > > > partial builds. Those would require either another incentive > > > structure, or would need volunteers from other modules to help fix > > > them. > > > I'm not sure which modules receive the least PR traffic, but I can see > > > that the current most stale modules are log4j-appender 1 year ago, > > > streams:examples 6 months ago, connect:file 5 months ago, storage:api > > > 5 months ago, and streams-scala 5 months ago. > > > > > > Thanks for the feedback all, > > > Greg > > > > > > On Wed, Jun 7, 2023 at 7:55 AM Chris Egerton <chr...@aiven.io.invalid> > > > wrote: > > > > > > > > Hi Greg, > > > > > > > > I can see the point about enabling partial runs as a temporary > measure to > > > > fight flakiness, and it does carry some merit. In that case, though, > we > > > > should have an idea of what the desired end state is once we've > stopped > > > > relying on any temporary measures. Do you think we should aim to > disable > > > > merges without a full suite of passing CI runs (allowing for > > > administrative > > > > override in an emergency)? If so, what would the path be from our > current > > > > state to there? What can we do to ensure that we don't get stuck > relying > > > on > > > > a once-temporary aid that becomes effectively permanent? > > > > > > > > With partial builds, we also need to be careful to make sure to > correctly > > > > handle cross-module dependencies. A tweak to broker or client logic > may > > > > only affect files in one module and pass all tests for that module, > but > > > > have far-reaching consequences for Streams, Connect, and MM2. We > probably > > > > want to build awareness of this dependency graph into any partial CI > > > logic > > > > we add, but if we do opt for that, then this change would > > > > disproportionately benefit downstream modules (Streams, Connect, > MM2), > > > and > > > > have little to no benefit for upstream ones (clients and at least > some > > > core > > > > modules). > > > > > > > > With regards to faster iteration times--I agree that it would be > nice if > > > > our CI builds didn't take 2-3 hours, but people should already be > making > > > > sure that tests are running locally before they push changes (or, if > they > > > > really want, they can run tests locally after pushing changes). And > if > > > > rapid iteration is necessary, it's always (or at least for the > > > foreseeable > > > > future) going to be faster to run whatever specific tests or build > tasks > > > > you need to run locally, instead of pushing to GitHub and waiting for > > > > Jenkins to check for you. > > > > > > > > Finally, since there are a number of existing flaky tests on trunk, > what > > > > would the strategy be for handling those? Do we try to get to a green > > > state > > > > on a per-module basis (possibly with awareness of downstream > modules) as > > > > quickly as possible, and then selectively enable partial builds once > we > > > > feel confident that flakiness has been addressed? > > > > > > > > Cheers, > > > > > > > > Chris > > > > > > > > On Wed, Jun 7, 2023 at 5:09 AM Gaurav Narula <ka...@gnarula.com> > wrote: > > > > > > > > > Hey Greg, > > > > > > > > > > Thanks for sharing this idea! > > > > > > > > > > The idea of building and testing a relevant subset of code > certainly > > > seems > > > > > interesting. > > > > > > > > > > Perhaps this is a good fit for Bazel [1] where > > > > > target-determinator [2] can be used to to find a subset of targets > that > > > > > have changed between two commits. > > > > > > > > > > Even without [2], Bazel builds can benefit immensely from > distributing > > > > > builds > > > > > to a set of remote nodes [3] with support for caching previously > built > > > > > targets [4]. > > > > > > > > > > We've seen a few other ASF projects adopt Bazel as well: > > > > > > > > > > * https://github.com/apache/rocketmq > > > > > * https://github.com/apache/brpc > > > > > * https://github.com/apache/trafficserver > > > > > * https://github.com/apache/ws-axiom > > > > > > > > > > I wonder how the Kafka community feels about experimenting with > Bazel > > > and > > > > > exploring if it helps us offer faster build times without > compromising > > > on > > > > > the > > > > > correctness of the targets that need to be built and tested? > > > > > > > > > > Thanks, > > > > > Gaurav > > > > > > > > > > [1]: https://bazel.build > > > > > [2]: https://github.com/bazel-contrib/target-determinator > > > > > [3]: https://bazel.build/remote/rbe > > > > > [4]: https://bazel.build/remote/caching > > > > > > > > > > On 2023/06/05 17:47:07 Greg Harris wrote: > > > > > > Hey all, > > > > > > > > > > > > I've been working on test flakiness recently, and I've been > trying to > > > > > > come up with ways to tackle the issue top-down as well as > bottom-up, > > > > > > and I'm interested to hear your thoughts on an idea. > > > > > > > > > > > > In addition to the current full-suite runs, can we in parallel > > > trigger > > > > > > a smaller test run which has only a relevant subset of tests? For > > > > > > example, if someone is working on one sub-module, the CI would > only > > > > > > run tests in that module. > > > > > > > > > > > > I think this would be more likely to pass than the full suite > due to > > > > > > the fewer tests failing probabilistically, and would improve the > > > > > > signal-to-noise ratio of the summary pass/fail marker on GitHub. > This > > > > > > should also be shorter to execute than the full suite, allowing > for > > > > > > faster cycle-time than the current full suite encourages. > > > > > > > > > > > > This would also strengthen the incentive for contributors > > > specializing > > > > > > in a module to de-flake tests, as they are rewarded with a > tangible > > > > > > improvement within their area of the project. Currently, even the > > > > > > modules with the most reliable tests receive consistent CI > failures > > > > > > from other less reliable modules. > > > > > > > > > > > > I believe this is possible, even if there isn't an off-the-shelf > > > > > > solution for it. We can learn of the changed files via a git > diff, > > > map > > > > > > that to modules containing those files, and then execute the > tests > > > > > > just for those modules with gradle. GitHub also permits showing > > > > > > multiple "checks" so that we can emit both the full-suite and > partial > > > > > > test results. > > > > > > > > > > > > Thanks, > > > > > > Greg > > > > > > > > > > > > > > > -- > > David Arthur > -- -David