Re: Finer-grained test runs?

Kenneth Knowles Thu, 09 Jul 2020 13:18:23 -0700

On Thu, Jul 9, 2020 at 1:10 PM Luke Cwik <[email protected]> wrote:

> The budget would represent some criteria that we need from tests (e.g.
> percent passed, max num skipped tests, test execution time, ...). If we
> fail the criteria then there must be actionable work (such as fix tests)
> followed with something that prevents the status quo from continuing (such
> as preventing releases/features being merged) until the criteria is
> satisfied again.
>


+1 . This is aligned with "CI as monitoring/alerting of the health of the
machine that is your evolving codebase", which I very much subscribe to.
Alert when something is wrong (another missing piece: have a quick way to
ack and suppress false alarms in those cases you really want a sensitive
alert).

Do you know good implementation choices in Gradle/JUnit/Jenkins? (asking
before searching for it myself)

Kenn


> On Thu, Jul 9, 2020 at 1:00 PM Kenneth Knowles <[email protected]> wrote:
>
>>
>>
>> On Thu, Jul 9, 2020 at 11:47 AM Robert Bradshaw <[email protected]>
>> wrote:
>>
>>> On Thu, Jul 9, 2020 at 8:40 AM Luke Cwik <[email protected]> wrote:
>>> >
>>> >> If Brian's: it does not result in redundant build (if plugin works)
>>> since it would be one Gradle build process. But it does do a full build if
>>> you touch something at the root of the ancestry tree like core SDK or
>>> model. I would like to avoid automatically testing descendants if we can,
>>> since things like Nexmark and most IOs are not sensitive to the vast
>>> majority of model or core SDK changes. Runners are borderline.
>>> >
>>> > I believe that the cost of fixing an issue that is found later once
>>> the test starts failing because the test wasn't run as part of the PR has a
>>> much higher order of magnitude of cost to triage and fix. Mostly due to
>>> loss of context from the PR author/reviewer and if the culprit PR can't be
>>> found then whoever is trying to fix it.
>>>
>>> Huge +1 to this.
>>>
>>
>> Totally agree. This abstract statement is clearly true. I suggest
>> considering things more concretely.
>>
>> Ideally we could count on the build system (and good caching) to only
>>> test what actually needs to be tested, and with work being done on
>>> runners and IOs this would be a small subset of our entire suite. When
>>> working lower in the stack (and I am prone to do) I think it's
>>> acceptable to have longer wait times--and would *much* rather pay that
>>> price than discover things later. Perhaps some things could be
>>> surgically removed (it would be interesting to mine data on how often
>>> test failures in the "leaves" catch real issues), but I would do that
>>> with care. That being said, flakiness is really an issues (and it
>>> seems these days I have to re-run tests, often multiple times, to get
>>> a PR to green; splitting up jobs could help that as well).
>>>
>>
>> Agree with your sentiment that a longer wait for core changes is
>> generally fine; my phrasing above overemphasized this case. Anecdotally,
>> without mining data, leaf modules do catch bugs in core changes sometimes
>> when (by definition) they are not adequately tested. This is a good measure
>> for how much we have to improve our engineering practices.
>>
>> But anyhow this is one very special case. Coming back to the overall
>> issue, what we actually do today is run all leaf/middle/root builds
>> whenever anything in any leaf/middle/root layer is changed. And we track
>> greenness and flakiness at this same level of granularity.
>>
>> Recall my (non-binding) starting point guessing at what tests should or
>> should not run in some scenarios: (this tangent is just about the third
>> one, where I explicitly said maybe we run all the same tests and then we
>> want to focus on separating signals as Luke pointed out)
>>
>> > - changing an IO or runner would not trigger the 20 minutes of core SDK
>> tests
>> > - changing a runner would not trigger the long IO local integration
>> tests
>> > - changing the core SDK could potentially not run as many tests in
>> presubmit, but maybe it would and they would be separately reported results
>> with clear flakiness signal
>>
>> And let's consider even more concrete examples:
>>
>>  - when changing a Fn API proto, how important is it to run
>> RabbitMqIOTest?
>>  - when changing JdbcIO, how important is it to run the Java SDK
>> needsRunnerTests? RabbitMqIOTest?
>>  - when changing the FlinkRunner, how important is it to make sure that
>> Nexmark queries still match their models when run on direct runner?
>>
>> I chose these examples to all have zero value, of course. And I've
>> deliberately included an example of a core change and a leaf test. Not all
>> (core change, leaf test) pairs are equally important. The vast majority of
>> all tests we run are literally unable to be affected by the changes
>> triggering the test. So that's why enabling Gradle cache or using a plugin
>> like Brian found could help part of the issue, but not the whole issue,
>> again as Luke reminded.
>>
>> We make these tradeoffs all the time, of course, via putting some tests
>> in *IT and postCommit runs and some in *Test, implicitly preCommit. But I
>> am imagining a future where we can decouple the test suite definitions
>> (very stable, not depending on the project context) from the decision of
>> where and when to run them (less stable, changing as the project changes).
>>
>> My assumption is that the project will only grow and all these problems
>> (flakiness, runtime, false coupling) will continue to get worse. I raised
>> this now so we could consider what is a steady state approach that could
>> scale, before it becomes an emergency. I take it as a given that it is
>> harder to change culture than it is to change infra/code, so I am not
>> considering any possibility of more attention to flaky tests or more
>> attention to testing the core properly or more attention to making tests
>> snappy or more careful consideration of *IT and *Test. (unless we build
>> infra that forces more attention to these things)
>>
>> Incidentally, SQL is not actually fully factored out. If you edit SQL it
>> runs a limited subset defined by :sqlPreCommit. If you edit core, then
>> :javaPreCommit still includes SQL tests.
>>
>> Kenn
>>
>

Re: Finer-grained test runs?

Reply via email to