Re: Finer-grained test runs?

Luke Cwik Thu, 09 Jul 2020 13:29:18 -0700

No, not without doing the research myself to see what is the current
tooling available.


On Thu, Jul 9, 2020 at 1:17 PM Kenneth Knowles <[email protected]> wrote:

>
>
> On Thu, Jul 9, 2020 at 1:10 PM Luke Cwik <[email protected]> wrote:
>
>> The budget would represent some criteria that we need from tests (e.g.
>> percent passed, max num skipped tests, test execution time, ...). If we
>> fail the criteria then there must be actionable work (such as fix tests)
>> followed with something that prevents the status quo from continuing (such
>> as preventing releases/features being merged) until the criteria is
>> satisfied again.
>>
>
> +1 . This is aligned with "CI as monitoring/alerting of the health of the
> machine that is your evolving codebase", which I very much subscribe to.
> Alert when something is wrong (another missing piece: have a quick way to
> ack and suppress false alarms in those cases you really want a sensitive
> alert).
>
> Do you know good implementation choices in Gradle/JUnit/Jenkins? (asking
> before searching for it myself)
>
> Kenn
>
>
>> On Thu, Jul 9, 2020 at 1:00 PM Kenneth Knowles <[email protected]> wrote:
>>
>>>
>>>
>>> On Thu, Jul 9, 2020 at 11:47 AM Robert Bradshaw <[email protected]>
>>> wrote:
>>>
>>>> On Thu, Jul 9, 2020 at 8:40 AM Luke Cwik <[email protected]> wrote:
>>>> >
>>>> >> If Brian's: it does not result in redundant build (if plugin works)
>>>> since it would be one Gradle build process. But it does do a full build if
>>>> you touch something at the root of the ancestry tree like core SDK or
>>>> model. I would like to avoid automatically testing descendants if we can,
>>>> since things like Nexmark and most IOs are not sensitive to the vast
>>>> majority of model or core SDK changes. Runners are borderline.
>>>> >
>>>> > I believe that the cost of fixing an issue that is found later once
>>>> the test starts failing because the test wasn't run as part of the PR has a
>>>> much higher order of magnitude of cost to triage and fix. Mostly due to
>>>> loss of context from the PR author/reviewer and if the culprit PR can't be
>>>> found then whoever is trying to fix it.
>>>>
>>>> Huge +1 to this.
>>>>
>>>
>>> Totally agree. This abstract statement is clearly true. I suggest
>>> considering things more concretely.
>>>
>>> Ideally we could count on the build system (and good caching) to only
>>>> test what actually needs to be tested, and with work being done on
>>>> runners and IOs this would be a small subset of our entire suite. When
>>>> working lower in the stack (and I am prone to do) I think it's
>>>> acceptable to have longer wait times--and would *much* rather pay that
>>>> price than discover things later. Perhaps some things could be
>>>> surgically removed (it would be interesting to mine data on how often
>>>> test failures in the "leaves" catch real issues), but I would do that
>>>> with care. That being said, flakiness is really an issues (and it
>>>> seems these days I have to re-run tests, often multiple times, to get
>>>> a PR to green; splitting up jobs could help that as well).
>>>>
>>>
>>> Agree with your sentiment that a longer wait for core changes is
>>> generally fine; my phrasing above overemphasized this case. Anecdotally,
>>> without mining data, leaf modules do catch bugs in core changes sometimes
>>> when (by definition) they are not adequately tested. This is a good measure
>>> for how much we have to improve our engineering practices.
>>>
>>> But anyhow this is one very special case. Coming back to the overall
>>> issue, what we actually do today is run all leaf/middle/root builds
>>> whenever anything in any leaf/middle/root layer is changed. And we track
>>> greenness and flakiness at this same level of granularity.
>>>
>>> Recall my (non-binding) starting point guessing at what tests should or
>>> should not run in some scenarios: (this tangent is just about the third
>>> one, where I explicitly said maybe we run all the same tests and then we
>>> want to focus on separating signals as Luke pointed out)
>>>
>>> > - changing an IO or runner would not trigger the 20 minutes of core
>>> SDK tests
>>> > - changing a runner would not trigger the long IO local integration
>>> tests
>>> > - changing the core SDK could potentially not run as many tests in
>>> presubmit, but maybe it would and they would be separately reported results
>>> with clear flakiness signal
>>>
>>> And let's consider even more concrete examples:
>>>
>>>  - when changing a Fn API proto, how important is it to run
>>> RabbitMqIOTest?
>>>  - when changing JdbcIO, how important is it to run the Java SDK
>>> needsRunnerTests? RabbitMqIOTest?
>>>  - when changing the FlinkRunner, how important is it to make sure that
>>> Nexmark queries still match their models when run on direct runner?
>>>
>>> I chose these examples to all have zero value, of course. And I've
>>> deliberately included an example of a core change and a leaf test. Not all
>>> (core change, leaf test) pairs are equally important. The vast majority of
>>> all tests we run are literally unable to be affected by the changes
>>> triggering the test. So that's why enabling Gradle cache or using a plugin
>>> like Brian found could help part of the issue, but not the whole issue,
>>> again as Luke reminded.
>>>
>>> We make these tradeoffs all the time, of course, via putting some tests
>>> in *IT and postCommit runs and some in *Test, implicitly preCommit. But I
>>> am imagining a future where we can decouple the test suite definitions
>>> (very stable, not depending on the project context) from the decision of
>>> where and when to run them (less stable, changing as the project changes).
>>>
>>> My assumption is that the project will only grow and all these problems
>>> (flakiness, runtime, false coupling) will continue to get worse. I raised
>>> this now so we could consider what is a steady state approach that could
>>> scale, before it becomes an emergency. I take it as a given that it is
>>> harder to change culture than it is to change infra/code, so I am not
>>> considering any possibility of more attention to flaky tests or more
>>> attention to testing the core properly or more attention to making tests
>>> snappy or more careful consideration of *IT and *Test. (unless we build
>>> infra that forces more attention to these things)
>>>
>>> Incidentally, SQL is not actually fully factored out. If you edit SQL it
>>> runs a limited subset defined by :sqlPreCommit. If you edit core, then
>>> :javaPreCommit still includes SQL tests.
>>>
>>> Kenn
>>>
>>

Re: Finer-grained test runs?

Reply via email to