On Thu, Jul 9, 2020 at 1:10 PM Luke Cwik <[email protected]> wrote: > The budget would represent some criteria that we need from tests (e.g. > percent passed, max num skipped tests, test execution time, ...). If we > fail the criteria then there must be actionable work (such as fix tests) > followed with something that prevents the status quo from continuing (such > as preventing releases/features being merged) until the criteria is > satisfied again. >
+1 . This is aligned with "CI as monitoring/alerting of the health of the machine that is your evolving codebase", which I very much subscribe to. Alert when something is wrong (another missing piece: have a quick way to ack and suppress false alarms in those cases you really want a sensitive alert). Do you know good implementation choices in Gradle/JUnit/Jenkins? (asking before searching for it myself) Kenn > On Thu, Jul 9, 2020 at 1:00 PM Kenneth Knowles <[email protected]> wrote: > >> >> >> On Thu, Jul 9, 2020 at 11:47 AM Robert Bradshaw <[email protected]> >> wrote: >> >>> On Thu, Jul 9, 2020 at 8:40 AM Luke Cwik <[email protected]> wrote: >>> > >>> >> If Brian's: it does not result in redundant build (if plugin works) >>> since it would be one Gradle build process. But it does do a full build if >>> you touch something at the root of the ancestry tree like core SDK or >>> model. I would like to avoid automatically testing descendants if we can, >>> since things like Nexmark and most IOs are not sensitive to the vast >>> majority of model or core SDK changes. Runners are borderline. >>> > >>> > I believe that the cost of fixing an issue that is found later once >>> the test starts failing because the test wasn't run as part of the PR has a >>> much higher order of magnitude of cost to triage and fix. Mostly due to >>> loss of context from the PR author/reviewer and if the culprit PR can't be >>> found then whoever is trying to fix it. >>> >>> Huge +1 to this. >>> >> >> Totally agree. This abstract statement is clearly true. I suggest >> considering things more concretely. >> >> Ideally we could count on the build system (and good caching) to only >>> test what actually needs to be tested, and with work being done on >>> runners and IOs this would be a small subset of our entire suite. When >>> working lower in the stack (and I am prone to do) I think it's >>> acceptable to have longer wait times--and would *much* rather pay that >>> price than discover things later. Perhaps some things could be >>> surgically removed (it would be interesting to mine data on how often >>> test failures in the "leaves" catch real issues), but I would do that >>> with care. That being said, flakiness is really an issues (and it >>> seems these days I have to re-run tests, often multiple times, to get >>> a PR to green; splitting up jobs could help that as well). >>> >> >> Agree with your sentiment that a longer wait for core changes is >> generally fine; my phrasing above overemphasized this case. Anecdotally, >> without mining data, leaf modules do catch bugs in core changes sometimes >> when (by definition) they are not adequately tested. This is a good measure >> for how much we have to improve our engineering practices. >> >> But anyhow this is one very special case. Coming back to the overall >> issue, what we actually do today is run all leaf/middle/root builds >> whenever anything in any leaf/middle/root layer is changed. And we track >> greenness and flakiness at this same level of granularity. >> >> Recall my (non-binding) starting point guessing at what tests should or >> should not run in some scenarios: (this tangent is just about the third >> one, where I explicitly said maybe we run all the same tests and then we >> want to focus on separating signals as Luke pointed out) >> >> > - changing an IO or runner would not trigger the 20 minutes of core SDK >> tests >> > - changing a runner would not trigger the long IO local integration >> tests >> > - changing the core SDK could potentially not run as many tests in >> presubmit, but maybe it would and they would be separately reported results >> with clear flakiness signal >> >> And let's consider even more concrete examples: >> >> - when changing a Fn API proto, how important is it to run >> RabbitMqIOTest? >> - when changing JdbcIO, how important is it to run the Java SDK >> needsRunnerTests? RabbitMqIOTest? >> - when changing the FlinkRunner, how important is it to make sure that >> Nexmark queries still match their models when run on direct runner? >> >> I chose these examples to all have zero value, of course. And I've >> deliberately included an example of a core change and a leaf test. Not all >> (core change, leaf test) pairs are equally important. The vast majority of >> all tests we run are literally unable to be affected by the changes >> triggering the test. So that's why enabling Gradle cache or using a plugin >> like Brian found could help part of the issue, but not the whole issue, >> again as Luke reminded. >> >> We make these tradeoffs all the time, of course, via putting some tests >> in *IT and postCommit runs and some in *Test, implicitly preCommit. But I >> am imagining a future where we can decouple the test suite definitions >> (very stable, not depending on the project context) from the decision of >> where and when to run them (less stable, changing as the project changes). >> >> My assumption is that the project will only grow and all these problems >> (flakiness, runtime, false coupling) will continue to get worse. I raised >> this now so we could consider what is a steady state approach that could >> scale, before it becomes an emergency. I take it as a given that it is >> harder to change culture than it is to change infra/code, so I am not >> considering any possibility of more attention to flaky tests or more >> attention to testing the core properly or more attention to making tests >> snappy or more careful consideration of *IT and *Test. (unless we build >> infra that forces more attention to these things) >> >> Incidentally, SQL is not actually fully factored out. If you edit SQL it >> runs a limited subset defined by :sqlPreCommit. If you edit core, then >> :javaPreCommit still includes SQL tests. >> >> Kenn >> >
