Re: Finer-grained test runs?

Kenneth Knowles Thu, 09 Jul 2020 13:01:24 -0700

On Thu, Jul 9, 2020 at 11:47 AM Robert Bradshaw <rober...@google.com> wrote:

> On Thu, Jul 9, 2020 at 8:40 AM Luke Cwik <lc...@google.com> wrote:
> >
> >> If Brian's: it does not result in redundant build (if plugin works)
> since it would be one Gradle build process. But it does do a full build if
> you touch something at the root of the ancestry tree like core SDK or
> model. I would like to avoid automatically testing descendants if we can,
> since things like Nexmark and most IOs are not sensitive to the vast
> majority of model or core SDK changes. Runners are borderline.
> >
> > I believe that the cost of fixing an issue that is found later once the
> test starts failing because the test wasn't run as part of the PR has a
> much higher order of magnitude of cost to triage and fix. Mostly due to
> loss of context from the PR author/reviewer and if the culprit PR can't be
> found then whoever is trying to fix it.
>
> Huge +1 to this.
>

Totally agree. This abstract statement is clearly true. I suggest
considering things more concretely.

Ideally we could count on the build system (and good caching) to only
> test what actually needs to be tested, and with work being done on
> runners and IOs this would be a small subset of our entire suite. When
> working lower in the stack (and I am prone to do) I think it's
> acceptable to have longer wait times--and would *much* rather pay that
> price than discover things later. Perhaps some things could be
> surgically removed (it would be interesting to mine data on how often
> test failures in the "leaves" catch real issues), but I would do that
> with care. That being said, flakiness is really an issues (and it
> seems these days I have to re-run tests, often multiple times, to get
> a PR to green; splitting up jobs could help that as well).
>

Agree with your sentiment that a longer wait for core changes is generally
fine; my phrasing above overemphasized this case. Anecdotally, without
mining data, leaf modules do catch bugs in core changes sometimes when (by
definition) they are not adequately tested. This is a good measure for how
much we have to improve our engineering practices.

But anyhow this is one very special case. Coming back to the overall issue,
what we actually do today is run all leaf/middle/root builds whenever
anything in any leaf/middle/root layer is changed. And we track greenness
and flakiness at this same level of granularity.

Recall my (non-binding) starting point guessing at what tests should or
should not run in some scenarios: (this tangent is just about the third
one, where I explicitly said maybe we run all the same tests and then we
want to focus on separating signals as Luke pointed out)

> - changing an IO or runner would not trigger the 20 minutes of core SDK
tests
> - changing a runner would not trigger the long IO local integration tests
> - changing the core SDK could potentially not run as many tests in
presubmit, but maybe it would and they would be separately reported results
with clear flakiness signal

And let's consider even more concrete examples:

 - when changing a Fn API proto, how important is it to run RabbitMqIOTest?
 - when changing JdbcIO, how important is it to run the Java SDK
needsRunnerTests? RabbitMqIOTest?
 - when changing the FlinkRunner, how important is it to make sure that
Nexmark queries still match their models when run on direct runner?

I chose these examples to all have zero value, of course. And I've
deliberately included an example of a core change and a leaf test. Not all
(core change, leaf test) pairs are equally important. The vast majority of
all tests we run are literally unable to be affected by the changes
triggering the test. So that's why enabling Gradle cache or using a plugin
like Brian found could help part of the issue, but not the whole issue,
again as Luke reminded.

We make these tradeoffs all the time, of course, via putting some tests in
*IT and postCommit runs and some in *Test, implicitly preCommit. But I am
imagining a future where we can decouple the test suite definitions (very
stable, not depending on the project context) from the decision of where
and when to run them (less stable, changing as the project changes).

My assumption is that the project will only grow and all these problems
(flakiness, runtime, false coupling) will continue to get worse. I raised
this now so we could consider what is a steady state approach that could
scale, before it becomes an emergency. I take it as a given that it is
harder to change culture than it is to change infra/code, so I am not
considering any possibility of more attention to flaky tests or more
attention to testing the core properly or more attention to making tests
snappy or more careful consideration of *IT and *Test. (unless we build
infra that forces more attention to these things)

Incidentally, SQL is not actually fully factored out. If you edit SQL it
runs a limited subset defined by :sqlPreCommit. If you edit core, then
:javaPreCommit still includes SQL tests.

Kenn

Re: Finer-grained test runs?

Reply via email to