Re: [DISCUSS] Partial CI builds - Reducing flakiness with fewer tests

2023-06-14 Thread David Arthur
>
> > How would we handle commits that break the integration tests? Would
> > we revert commits on trunk, or fix-forward?
>
> This is currently up to committer discretion, and I don't think that
> would change if we were to re-tool the PR builds. In the presence of
> flaky failures, we can't reliably blame failures on particular commits
> without running much more expensive statistical tests.
>

I was more thinking about integration regressions rather than flaky
failures. However, if we're running a module's tests as well as the
"affected modules" tests, then I guess we should
be running the correct integration tests for each PR build. I guess this
gets back to Chris's point about how this approach favors the downstream
modules. Maybe that's unavoidable.

Come to think of it, a similar problem exists with system tests. We don't
run these for each PR (or each trunk commit, or even nightly AFAIK) since
they are prohibitively costly/lengthy to run.
However, they do sometimes find integration regressions that the junit
suite missed. In these cases we only have the choice to fix-forward.


On Tue, Jun 13, 2023 at 12:19 PM Greg Harris 
wrote:

> David,
>
> Thanks for finding that gradle plugin. The `changedModules` mode is
> exactly what I had in mind for fairness to modules earlier in the
> dependency graph.
>
> > if we moved to a policy where PRs only need some of the tests to pass
> > to merge, when would we run the full CI? On each trunk commit (i.e., PR
> > merge)?
>
> In a world where the PR runs includes only the changed modules and
> their dependencies, the full suite should be run for each commit on
> trunk and on release branches. I don't think that optimizing the trunk
> build runtime is of great benefit, and the current behavior seems
> reasonable to continue.
>
> > How would we handle commits that break the integration tests? Would
> > we revert commits on trunk, or fix-forward?
>
> This is currently up to committer discretion, and I don't think that
> would change if we were to re-tool the PR builds. In the presence of
> flaky failures, we can't reliably blame failures on particular commits
> without running much more expensive statistical tests.
>
> One place that I often see flakiness present is in new tests, where
> someone has chosen timeouts which work for them locally and in the PR
> build. After some 10s or 100s of runs, the flakiness becomes evident
> and someone looks into a fix-forward.
> I don't necessarily think I would advocate for a hard revert of an
> entire feature if one of the added tests is flaky, but that's my
> discretion. We can adopt a project policy of reverting whatever we
> can, but I don't think that's a more welcoming or productive project
> than we have now.
>
> Greg
>
> On Tue, Jun 13, 2023 at 7:24 AM David Arthur  wrote:
> >
> > Hey folks, interesting discussion.
> >
> > I came across a Gradle plugin that calculates a DAG of modules based on
> the
> > diff and can run only the affected module's tests or the affected +
> > downstream tests.
> >
> > https://github.com/dropbox/AffectedModuleDetector
> >
> > I tested it out locally, and it seems to work as advertised.
> >
> > Greg, if we moved to a policy where PRs only need some of the tests to
> pass
> > to merge, when would we run the full CI? On each trunk commit (i.e., PR
> > merge)? How would we handle commits that break the integration tests?
> Would
> > we revert commits on trunk, or fix-forward?
> >
> > -David
> >
> >
> > On Thu, Jun 8, 2023 at 2:02 PM Greg Harris  >
> > wrote:
> >
> > > Gaurav,
> > >
> > > The target-determinator is certainly the "off-the-shelf" solution I
> > > expected would be out there. If the project migrates to Bazel I think
> > > that would make the partial builds much easier to implement.
> > > I think we should look into the other benefits of migrating to bazel
> > > to see if it is worth it even if the partial builds feature is decided
> > > against, or after it is reverted.
> > >
> > > Chris,
> > >
> > > > Do you think we should aim to disable
> > > > merges without a full suite of passing CI runs (allowing for
> > > administrative
> > > > override in an emergency)? If so, what would the path be from our
> current
> > > > state to there? What can we do to ensure that we don't get stuck
> relying
> > > on
> > > > a once-temporary aid that becomes effectively permanent?
> > >
> > > Yes I think it would be nice to require a green build to merge,
> > > without excessive merge-queue infrastructure that is more common in
> > > high-volume monorepos.
> > >
> > > 1. We would decide on a sunset date on the partial build indicator, at
> > > which point it will be disabled on trunk and all release branches. I
> > > suspect that a sunset date could be set for 1-3 releases in the
> > > future.
> > > 2. We would enable partial builds. The merge-button would remain
> > > as-is, with the partial build and full-suite indicators both being
> > > visible.
> > > 3. Communication would be sent out to expl

Re: [DISCUSS] Partial CI builds - Reducing flakiness with fewer tests

2023-06-13 Thread Greg Harris
David,

Thanks for finding that gradle plugin. The `changedModules` mode is
exactly what I had in mind for fairness to modules earlier in the
dependency graph.

> if we moved to a policy where PRs only need some of the tests to pass
> to merge, when would we run the full CI? On each trunk commit (i.e., PR
> merge)?

In a world where the PR runs includes only the changed modules and
their dependencies, the full suite should be run for each commit on
trunk and on release branches. I don't think that optimizing the trunk
build runtime is of great benefit, and the current behavior seems
reasonable to continue.

> How would we handle commits that break the integration tests? Would
> we revert commits on trunk, or fix-forward?

This is currently up to committer discretion, and I don't think that
would change if we were to re-tool the PR builds. In the presence of
flaky failures, we can't reliably blame failures on particular commits
without running much more expensive statistical tests.

One place that I often see flakiness present is in new tests, where
someone has chosen timeouts which work for them locally and in the PR
build. After some 10s or 100s of runs, the flakiness becomes evident
and someone looks into a fix-forward.
I don't necessarily think I would advocate for a hard revert of an
entire feature if one of the added tests is flaky, but that's my
discretion. We can adopt a project policy of reverting whatever we
can, but I don't think that's a more welcoming or productive project
than we have now.

Greg

On Tue, Jun 13, 2023 at 7:24 AM David Arthur  wrote:
>
> Hey folks, interesting discussion.
>
> I came across a Gradle plugin that calculates a DAG of modules based on the
> diff and can run only the affected module's tests or the affected +
> downstream tests.
>
> https://github.com/dropbox/AffectedModuleDetector
>
> I tested it out locally, and it seems to work as advertised.
>
> Greg, if we moved to a policy where PRs only need some of the tests to pass
> to merge, when would we run the full CI? On each trunk commit (i.e., PR
> merge)? How would we handle commits that break the integration tests? Would
> we revert commits on trunk, or fix-forward?
>
> -David
>
>
> On Thu, Jun 8, 2023 at 2:02 PM Greg Harris 
> wrote:
>
> > Gaurav,
> >
> > The target-determinator is certainly the "off-the-shelf" solution I
> > expected would be out there. If the project migrates to Bazel I think
> > that would make the partial builds much easier to implement.
> > I think we should look into the other benefits of migrating to bazel
> > to see if it is worth it even if the partial builds feature is decided
> > against, or after it is reverted.
> >
> > Chris,
> >
> > > Do you think we should aim to disable
> > > merges without a full suite of passing CI runs (allowing for
> > administrative
> > > override in an emergency)? If so, what would the path be from our current
> > > state to there? What can we do to ensure that we don't get stuck relying
> > on
> > > a once-temporary aid that becomes effectively permanent?
> >
> > Yes I think it would be nice to require a green build to merge,
> > without excessive merge-queue infrastructure that is more common in
> > high-volume monorepos.
> >
> > 1. We would decide on a sunset date on the partial build indicator, at
> > which point it will be disabled on trunk and all release branches. I
> > suspect that a sunset date could be set for 1-3 releases in the
> > future.
> > 2. We would enable partial builds. The merge-button would remain
> > as-is, with the partial build and full-suite indicators both being
> > visible.
> > 3. Communication would be sent out to explain that the partial build
> > indicators are a temporary flakiness reduction tool.
> > 4. Committers would continue to rely on the full-suite indicators, and
> > watch the partial build to get a sense of the level of flakiness in
> > each module.
> > 5. On new contributions which have failing partial builds, Committers
> > can ask that contributors investigate and follow-up on flaky failures
> > that appeared in their PR, explaining that they can help make that
> > indicator green for future contributions.
> > 6. After the sunset date passes, the partial build indicators will be
> > disabled.
> > 7. If the main trunk build is sufficiently reliable by some criteria
> > (e.g. 20 passes in a row) we can discuss requiring a green build for
> > the GitHub merge button.
> >
> > > We probably
> > > want to build awareness of this dependency graph into any partial CI
> > logic
> > > we add, but if we do opt for that, then this change would
> > > disproportionately benefit downstream modules (Streams, Connect, MM2),
> > and
> > > have little to no benefit for upstream ones (clients and at least some
> > core
> > > modules).
> >
> > I considered that, and I think it is more beneficial to provide
> > equitable partial builds than ones with perfect coverage. If you want
> > to know about cross-module failures, I think that the fu

Re: [DISCUSS] Partial CI builds - Reducing flakiness with fewer tests

2023-06-13 Thread David Arthur
Hey folks, interesting discussion.

I came across a Gradle plugin that calculates a DAG of modules based on the
diff and can run only the affected module's tests or the affected +
downstream tests.

https://github.com/dropbox/AffectedModuleDetector

I tested it out locally, and it seems to work as advertised.

Greg, if we moved to a policy where PRs only need some of the tests to pass
to merge, when would we run the full CI? On each trunk commit (i.e., PR
merge)? How would we handle commits that break the integration tests? Would
we revert commits on trunk, or fix-forward?

-David


On Thu, Jun 8, 2023 at 2:02 PM Greg Harris 
wrote:

> Gaurav,
>
> The target-determinator is certainly the "off-the-shelf" solution I
> expected would be out there. If the project migrates to Bazel I think
> that would make the partial builds much easier to implement.
> I think we should look into the other benefits of migrating to bazel
> to see if it is worth it even if the partial builds feature is decided
> against, or after it is reverted.
>
> Chris,
>
> > Do you think we should aim to disable
> > merges without a full suite of passing CI runs (allowing for
> administrative
> > override in an emergency)? If so, what would the path be from our current
> > state to there? What can we do to ensure that we don't get stuck relying
> on
> > a once-temporary aid that becomes effectively permanent?
>
> Yes I think it would be nice to require a green build to merge,
> without excessive merge-queue infrastructure that is more common in
> high-volume monorepos.
>
> 1. We would decide on a sunset date on the partial build indicator, at
> which point it will be disabled on trunk and all release branches. I
> suspect that a sunset date could be set for 1-3 releases in the
> future.
> 2. We would enable partial builds. The merge-button would remain
> as-is, with the partial build and full-suite indicators both being
> visible.
> 3. Communication would be sent out to explain that the partial build
> indicators are a temporary flakiness reduction tool.
> 4. Committers would continue to rely on the full-suite indicators, and
> watch the partial build to get a sense of the level of flakiness in
> each module.
> 5. On new contributions which have failing partial builds, Committers
> can ask that contributors investigate and follow-up on flaky failures
> that appeared in their PR, explaining that they can help make that
> indicator green for future contributions.
> 6. After the sunset date passes, the partial build indicators will be
> disabled.
> 7. If the main trunk build is sufficiently reliable by some criteria
> (e.g. 20 passes in a row) we can discuss requiring a green build for
> the GitHub merge button.
>
> > We probably
> > want to build awareness of this dependency graph into any partial CI
> logic
> > we add, but if we do opt for that, then this change would
> > disproportionately benefit downstream modules (Streams, Connect, MM2),
> and
> > have little to no benefit for upstream ones (clients and at least some
> core
> > modules).
>
> I considered that, and I think it is more beneficial to provide
> equitable partial builds than ones with perfect coverage. If you want
> to know about cross-module failures, I think that the full suite is
> still the appropriate tool to detect those.
> From another perspective, if this change is only useful to `core`
> after all other dependent modules have addressed their flakiness, then
> it delays `core` contributors from the reward for addressing their
> flakiness.
> And if anything, making the partial build cover fewer failure modes
> and reminding people of that fact will keep the full-suite builds
> relevant.
>
> > but people should already be making
> > sure that tests are running locally before they push changes
>
> I agree, and I don't think that partial builds are in any danger of
> replacing local testing for fast synchronous iteration. I also agree
> that it is appropriate to give personal feedback to repeat
> contributors to improve the quality of their PRs.
> CI seems to be for enforcing the lowest-common-denominator on
> contributions, and benefiting (potentially first-time) contributors
> which do not have the familiarity, mindfulness, or resources to
> pre-test each of their contributions.
> It is in those situations that I think partial builds can be
> beneficial: if there is something mechanical that all contributors
> should be doing locally, why not have it done automatically on their
> behalf?
>
> > Finally, since there are a number of existing flaky tests on trunk, what
> > would the strategy be for handling those? Do we try to get to a green
> state
> > on a per-module basis (possibly with awareness of downstream modules) as
> > quickly as possible, and then selectively enable partial builds once we
> > feel confident that flakiness has been addressed?
>
> For modules which receive regular PRs, the partial builds would
> surface those flakes, incentivising fixing them. Once t

Re: [DISCUSS] Partial CI builds - Reducing flakiness with fewer tests

2023-06-08 Thread Greg Harris
Gaurav,

The target-determinator is certainly the "off-the-shelf" solution I
expected would be out there. If the project migrates to Bazel I think
that would make the partial builds much easier to implement.
I think we should look into the other benefits of migrating to bazel
to see if it is worth it even if the partial builds feature is decided
against, or after it is reverted.

Chris,

> Do you think we should aim to disable
> merges without a full suite of passing CI runs (allowing for administrative
> override in an emergency)? If so, what would the path be from our current
> state to there? What can we do to ensure that we don't get stuck relying on
> a once-temporary aid that becomes effectively permanent?

Yes I think it would be nice to require a green build to merge,
without excessive merge-queue infrastructure that is more common in
high-volume monorepos.

1. We would decide on a sunset date on the partial build indicator, at
which point it will be disabled on trunk and all release branches. I
suspect that a sunset date could be set for 1-3 releases in the
future.
2. We would enable partial builds. The merge-button would remain
as-is, with the partial build and full-suite indicators both being
visible.
3. Communication would be sent out to explain that the partial build
indicators are a temporary flakiness reduction tool.
4. Committers would continue to rely on the full-suite indicators, and
watch the partial build to get a sense of the level of flakiness in
each module.
5. On new contributions which have failing partial builds, Committers
can ask that contributors investigate and follow-up on flaky failures
that appeared in their PR, explaining that they can help make that
indicator green for future contributions.
6. After the sunset date passes, the partial build indicators will be disabled.
7. If the main trunk build is sufficiently reliable by some criteria
(e.g. 20 passes in a row) we can discuss requiring a green build for
the GitHub merge button.

> We probably
> want to build awareness of this dependency graph into any partial CI logic
> we add, but if we do opt for that, then this change would
> disproportionately benefit downstream modules (Streams, Connect, MM2), and
> have little to no benefit for upstream ones (clients and at least some core
> modules).

I considered that, and I think it is more beneficial to provide
equitable partial builds than ones with perfect coverage. If you want
to know about cross-module failures, I think that the full suite is
still the appropriate tool to detect those.
>From another perspective, if this change is only useful to `core`
after all other dependent modules have addressed their flakiness, then
it delays `core` contributors from the reward for addressing their
flakiness.
And if anything, making the partial build cover fewer failure modes
and reminding people of that fact will keep the full-suite builds
relevant.

> but people should already be making
> sure that tests are running locally before they push changes

I agree, and I don't think that partial builds are in any danger of
replacing local testing for fast synchronous iteration. I also agree
that it is appropriate to give personal feedback to repeat
contributors to improve the quality of their PRs.
CI seems to be for enforcing the lowest-common-denominator on
contributions, and benefiting (potentially first-time) contributors
which do not have the familiarity, mindfulness, or resources to
pre-test each of their contributions.
It is in those situations that I think partial builds can be
beneficial: if there is something mechanical that all contributors
should be doing locally, why not have it done automatically on their
behalf?

> Finally, since there are a number of existing flaky tests on trunk, what
> would the strategy be for handling those? Do we try to get to a green state
> on a per-module basis (possibly with awareness of downstream modules) as
> quickly as possible, and then selectively enable partial builds once we
> feel confident that flakiness has been addressed?

For modules which receive regular PRs, the partial builds would
surface those flakes, incentivising fixing them. Once the per-module
flakiness is under control, committers for those modules can start to
require green partial builds, pushing back on PRs which have obvious
failures.
For modules which do not receive regular PRs, those flakes will appear
in all of the full-suite builds, and wouldn't be addressed by the
partial builds. Those would require either another incentive
structure, or would need volunteers from other modules to help fix
them.
I'm not sure which modules receive the least PR traffic, but I can see
that the current most stale modules are log4j-appender 1 year ago,
streams:examples 6 months ago, connect:file 5 months ago, storage:api
5 months ago, and streams-scala 5 months ago.

Thanks for the feedback all,
Greg

On Wed, Jun 7, 2023 at 7:55 AM Chris Egerton  wrote:
>
> Hi Greg,
>
> I can see the point 

Re: [DISCUSS] Partial CI builds - Reducing flakiness with fewer tests

2023-06-07 Thread Chris Egerton
Hi Greg,

I can see the point about enabling partial runs as a temporary measure to
fight flakiness, and it does carry some merit. In that case, though, we
should have an idea of what the desired end state is once we've stopped
relying on any temporary measures. Do you think we should aim to disable
merges without a full suite of passing CI runs (allowing for administrative
override in an emergency)? If so, what would the path be from our current
state to there? What can we do to ensure that we don't get stuck relying on
a once-temporary aid that becomes effectively permanent?

With partial builds, we also need to be careful to make sure to correctly
handle cross-module dependencies. A tweak to broker or client logic may
only affect files in one module and pass all tests for that module, but
have far-reaching consequences for Streams, Connect, and MM2. We probably
want to build awareness of this dependency graph into any partial CI logic
we add, but if we do opt for that, then this change would
disproportionately benefit downstream modules (Streams, Connect, MM2), and
have little to no benefit for upstream ones (clients and at least some core
modules).

With regards to faster iteration times--I agree that it would be nice if
our CI builds didn't take 2-3 hours, but people should already be making
sure that tests are running locally before they push changes (or, if they
really want, they can run tests locally after pushing changes). And if
rapid iteration is necessary, it's always (or at least for the foreseeable
future) going to be faster to run whatever specific tests or build tasks
you need to run locally, instead of pushing to GitHub and waiting for
Jenkins to check for you.

Finally, since there are a number of existing flaky tests on trunk, what
would the strategy be for handling those? Do we try to get to a green state
on a per-module basis (possibly with awareness of downstream modules) as
quickly as possible, and then selectively enable partial builds once we
feel confident that flakiness has been addressed?

Cheers,

Chris

On Wed, Jun 7, 2023 at 5:09 AM Gaurav Narula  wrote:

> Hey Greg,
>
> Thanks for sharing this idea!
>
> The idea of building and testing a relevant subset of code certainly seems
> interesting.
>
> Perhaps this is a good fit for Bazel [1] where
> target-determinator [2] can be used to to find a subset of targets that
> have changed between two commits.
>
> Even without [2], Bazel builds can benefit immensely from distributing
> builds
> to a set of remote nodes [3] with support for caching previously built
> targets [4].
>
> We've seen a few other ASF projects adopt Bazel as well:
>
> * https://github.com/apache/rocketmq
> * https://github.com/apache/brpc
> * https://github.com/apache/trafficserver
> * https://github.com/apache/ws-axiom
>
> I wonder how the Kafka community feels about experimenting with Bazel and
> exploring if it helps us offer faster build times without compromising on
> the
> correctness of the targets that need to be built and tested?
>
> Thanks,
> Gaurav
>
> [1]: https://bazel.build
> [2]: https://github.com/bazel-contrib/target-determinator
> [3]: https://bazel.build/remote/rbe
> [4]: https://bazel.build/remote/caching
>
> On 2023/06/05 17:47:07 Greg Harris wrote:
> > Hey all,
> >
> > I've been working on test flakiness recently, and I've been trying to
> > come up with ways to tackle the issue top-down as well as bottom-up,
> > and I'm interested to hear your thoughts on an idea.
> >
> > In addition to the current full-suite runs, can we in parallel trigger
> > a smaller test run which has only a relevant subset of tests? For
> > example, if someone is working on one sub-module, the CI would only
> > run tests in that module.
> >
> > I think this would be more likely to pass than the full suite due to
> > the fewer tests failing probabilistically, and would improve the
> > signal-to-noise ratio of the summary pass/fail marker on GitHub. This
> > should also be shorter to execute than the full suite, allowing for
> > faster cycle-time than the current full suite encourages.
> >
> > This would also strengthen the incentive for contributors specializing
> > in a module to de-flake tests, as they are rewarded with a tangible
> > improvement within their area of the project. Currently, even the
> > modules with the most reliable tests receive consistent CI failures
> > from other less reliable modules.
> >
> > I believe this is possible, even if there isn't an off-the-shelf
> > solution for it. We can learn of the changed files via a git diff, map
> > that to modules containing those files, and then execute the tests
> > just for those modules with gradle. GitHub also permits showing
> > multiple "checks" so that we can emit both the full-suite and partial
> > test results.
> >
> > Thanks,
> > Greg
> >


RE: [DISCUSS] Partial CI builds - Reducing flakiness with fewer tests

2023-06-07 Thread Gaurav Narula
Hey Greg,

Thanks for sharing this idea!

The idea of building and testing a relevant subset of code certainly seems 
interesting.

Perhaps this is a good fit for Bazel [1] where
target-determinator [2] can be used to to find a subset of targets that have 
changed between two commits.

Even without [2], Bazel builds can benefit immensely from distributing builds
to a set of remote nodes [3] with support for caching previously built
targets [4].

We've seen a few other ASF projects adopt Bazel as well:

* https://github.com/apache/rocketmq
* https://github.com/apache/brpc
* https://github.com/apache/trafficserver
* https://github.com/apache/ws-axiom

I wonder how the Kafka community feels about experimenting with Bazel and
exploring if it helps us offer faster build times without compromising on the
correctness of the targets that need to be built and tested?

Thanks,
Gaurav

[1]: https://bazel.build
[2]: https://github.com/bazel-contrib/target-determinator
[3]: https://bazel.build/remote/rbe
[4]: https://bazel.build/remote/caching

On 2023/06/05 17:47:07 Greg Harris wrote:
> Hey all,
> 
> I've been working on test flakiness recently, and I've been trying to
> come up with ways to tackle the issue top-down as well as bottom-up,
> and I'm interested to hear your thoughts on an idea.
> 
> In addition to the current full-suite runs, can we in parallel trigger
> a smaller test run which has only a relevant subset of tests? For
> example, if someone is working on one sub-module, the CI would only
> run tests in that module.
> 
> I think this would be more likely to pass than the full suite due to
> the fewer tests failing probabilistically, and would improve the
> signal-to-noise ratio of the summary pass/fail marker on GitHub. This
> should also be shorter to execute than the full suite, allowing for
> faster cycle-time than the current full suite encourages.
> 
> This would also strengthen the incentive for contributors specializing
> in a module to de-flake tests, as they are rewarded with a tangible
> improvement within their area of the project. Currently, even the
> modules with the most reliable tests receive consistent CI failures
> from other less reliable modules.
> 
> I believe this is possible, even if there isn't an off-the-shelf
> solution for it. We can learn of the changed files via a git diff, map
> that to modules containing those files, and then execute the tests
> just for those modules with gradle. GitHub also permits showing
> multiple "checks" so that we can emit both the full-suite and partial
> test results.
> 
> Thanks,
> Greg
>

Re: [DISCUSS] Partial CI builds - Reducing flakiness with fewer tests

2023-06-06 Thread Greg Harris
David,

Thanks for your thoughts!

> Indeed, they will be more likely to pass but the
> downside is that folks may start to only rely on that signal and commit
> without looking at the full test suite. This seems dangerous to me.

I completely agree with you that it is not desirable for committers to
become overly reliant on the smaller subset of tests. Instead, the
partial builds can be a new merge requirement, instead of a
replacement for the existing full-suite spot-check.
For PRs which fail the partial test run, committers won't need to
examine the full suite to know that the contributor needs to make
further changes, and can spend their attention on something else.
We can make this clear with a dev-list announcement to explain the
meaning and interpretation of the new build result. We can also call
for flakiness reduction contributions at the same time.

> I would rather focus on trying to address this first. If
> we can stabilize them, I wonder if we should also enforce a green build to
> merge.

The reason I'm interested in this change is because this project
already appears to adopt this policy of bottom-up flakiness reduction,
and I don't think it has been effective enough to be able to enforce
requiring a green build.
I think that it is better to improve the incentives for flakiness
reduction, wait for flakiness to improve, and then later enforce a
green build to merge. In that context, the partial builds are a
temporary change to help us get to the desired end-goal.

Thanks,
Greg

On Tue, Jun 6, 2023 at 2:51 AM David Jacot  wrote:
>
> Hey Greg,
>
> Thanks for bringing this up.
>
> I am not sure to understand the benefit of the parallele trigger of a
> subset of the tests. Indeed, they will be more likely to pass but the
> downside is that folks may start to only rely on that signal and commit
> without looking at the full test suite. This seems dangerous to me.
>
> However, I agree that we have an issue with our builds. We have way too
> many flaky tests. I would rather focus on trying to address this first. If
> we can stabilize them, I wonder if we should also enforce a green build to
> merge.
>
> Best,
> David
>
>
>
> On Mon, Jun 5, 2023 at 7:47 PM Greg Harris 
> wrote:
>
> > Hey all,
> >
> > I've been working on test flakiness recently, and I've been trying to
> > come up with ways to tackle the issue top-down as well as bottom-up,
> > and I'm interested to hear your thoughts on an idea.
> >
> > In addition to the current full-suite runs, can we in parallel trigger
> > a smaller test run which has only a relevant subset of tests? For
> > example, if someone is working on one sub-module, the CI would only
> > run tests in that module.
> >
> > I think this would be more likely to pass than the full suite due to
> > the fewer tests failing probabilistically, and would improve the
> > signal-to-noise ratio of the summary pass/fail marker on GitHub. This
> > should also be shorter to execute than the full suite, allowing for
> > faster cycle-time than the current full suite encourages.
> >
> > This would also strengthen the incentive for contributors specializing
> > in a module to de-flake tests, as they are rewarded with a tangible
> > improvement within their area of the project. Currently, even the
> > modules with the most reliable tests receive consistent CI failures
> > from other less reliable modules.
> >
> > I believe this is possible, even if there isn't an off-the-shelf
> > solution for it. We can learn of the changed files via a git diff, map
> > that to modules containing those files, and then execute the tests
> > just for those modules with gradle. GitHub also permits showing
> > multiple "checks" so that we can emit both the full-suite and partial
> > test results.
> >
> > Thanks,
> > Greg
> >


Re: [DISCUSS] Partial CI builds - Reducing flakiness with fewer tests

2023-06-06 Thread David Jacot
Hey Greg,

Thanks for bringing this up.

I am not sure to understand the benefit of the parallele trigger of a
subset of the tests. Indeed, they will be more likely to pass but the
downside is that folks may start to only rely on that signal and commit
without looking at the full test suite. This seems dangerous to me.

However, I agree that we have an issue with our builds. We have way too
many flaky tests. I would rather focus on trying to address this first. If
we can stabilize them, I wonder if we should also enforce a green build to
merge.

Best,
David



On Mon, Jun 5, 2023 at 7:47 PM Greg Harris 
wrote:

> Hey all,
>
> I've been working on test flakiness recently, and I've been trying to
> come up with ways to tackle the issue top-down as well as bottom-up,
> and I'm interested to hear your thoughts on an idea.
>
> In addition to the current full-suite runs, can we in parallel trigger
> a smaller test run which has only a relevant subset of tests? For
> example, if someone is working on one sub-module, the CI would only
> run tests in that module.
>
> I think this would be more likely to pass than the full suite due to
> the fewer tests failing probabilistically, and would improve the
> signal-to-noise ratio of the summary pass/fail marker on GitHub. This
> should also be shorter to execute than the full suite, allowing for
> faster cycle-time than the current full suite encourages.
>
> This would also strengthen the incentive for contributors specializing
> in a module to de-flake tests, as they are rewarded with a tangible
> improvement within their area of the project. Currently, even the
> modules with the most reliable tests receive consistent CI failures
> from other less reliable modules.
>
> I believe this is possible, even if there isn't an off-the-shelf
> solution for it. We can learn of the changed files via a git diff, map
> that to modules containing those files, and then execute the tests
> just for those modules with gradle. GitHub also permits showing
> multiple "checks" so that we can emit both the full-suite and partial
> test results.
>
> Thanks,
> Greg
>


[DISCUSS] Partial CI builds - Reducing flakiness with fewer tests

2023-06-05 Thread Greg Harris
Hey all,

I've been working on test flakiness recently, and I've been trying to
come up with ways to tackle the issue top-down as well as bottom-up,
and I'm interested to hear your thoughts on an idea.

In addition to the current full-suite runs, can we in parallel trigger
a smaller test run which has only a relevant subset of tests? For
example, if someone is working on one sub-module, the CI would only
run tests in that module.

I think this would be more likely to pass than the full suite due to
the fewer tests failing probabilistically, and would improve the
signal-to-noise ratio of the summary pass/fail marker on GitHub. This
should also be shorter to execute than the full suite, allowing for
faster cycle-time than the current full suite encourages.

This would also strengthen the incentive for contributors specializing
in a module to de-flake tests, as they are rewarded with a tangible
improvement within their area of the project. Currently, even the
modules with the most reliable tests receive consistent CI failures
from other less reliable modules.

I believe this is possible, even if there isn't an off-the-shelf
solution for it. We can learn of the changed files via a git diff, map
that to modules containing those files, and then execute the tests
just for those modules with gradle. GitHub also permits showing
multiple "checks" so that we can emit both the full-suite and partial
test results.

Thanks,
Greg