No, not without doing the research myself to see what is the current tooling available.
On Thu, Jul 9, 2020 at 1:17 PM Kenneth Knowles <k...@apache.org> wrote: > > > On Thu, Jul 9, 2020 at 1:10 PM Luke Cwik <lc...@google.com> wrote: > >> The budget would represent some criteria that we need from tests (e.g. >> percent passed, max num skipped tests, test execution time, ...). If we >> fail the criteria then there must be actionable work (such as fix tests) >> followed with something that prevents the status quo from continuing (such >> as preventing releases/features being merged) until the criteria is >> satisfied again. >> > > +1 . This is aligned with "CI as monitoring/alerting of the health of the > machine that is your evolving codebase", which I very much subscribe to. > Alert when something is wrong (another missing piece: have a quick way to > ack and suppress false alarms in those cases you really want a sensitive > alert). > > Do you know good implementation choices in Gradle/JUnit/Jenkins? (asking > before searching for it myself) > > Kenn > > >> On Thu, Jul 9, 2020 at 1:00 PM Kenneth Knowles <k...@apache.org> wrote: >> >>> >>> >>> On Thu, Jul 9, 2020 at 11:47 AM Robert Bradshaw <rober...@google.com> >>> wrote: >>> >>>> On Thu, Jul 9, 2020 at 8:40 AM Luke Cwik <lc...@google.com> wrote: >>>> > >>>> >> If Brian's: it does not result in redundant build (if plugin works) >>>> since it would be one Gradle build process. But it does do a full build if >>>> you touch something at the root of the ancestry tree like core SDK or >>>> model. I would like to avoid automatically testing descendants if we can, >>>> since things like Nexmark and most IOs are not sensitive to the vast >>>> majority of model or core SDK changes. Runners are borderline. >>>> > >>>> > I believe that the cost of fixing an issue that is found later once >>>> the test starts failing because the test wasn't run as part of the PR has a >>>> much higher order of magnitude of cost to triage and fix. Mostly due to >>>> loss of context from the PR author/reviewer and if the culprit PR can't be >>>> found then whoever is trying to fix it. >>>> >>>> Huge +1 to this. >>>> >>> >>> Totally agree. This abstract statement is clearly true. I suggest >>> considering things more concretely. >>> >>> Ideally we could count on the build system (and good caching) to only >>>> test what actually needs to be tested, and with work being done on >>>> runners and IOs this would be a small subset of our entire suite. When >>>> working lower in the stack (and I am prone to do) I think it's >>>> acceptable to have longer wait times--and would *much* rather pay that >>>> price than discover things later. Perhaps some things could be >>>> surgically removed (it would be interesting to mine data on how often >>>> test failures in the "leaves" catch real issues), but I would do that >>>> with care. That being said, flakiness is really an issues (and it >>>> seems these days I have to re-run tests, often multiple times, to get >>>> a PR to green; splitting up jobs could help that as well). >>>> >>> >>> Agree with your sentiment that a longer wait for core changes is >>> generally fine; my phrasing above overemphasized this case. Anecdotally, >>> without mining data, leaf modules do catch bugs in core changes sometimes >>> when (by definition) they are not adequately tested. This is a good measure >>> for how much we have to improve our engineering practices. >>> >>> But anyhow this is one very special case. Coming back to the overall >>> issue, what we actually do today is run all leaf/middle/root builds >>> whenever anything in any leaf/middle/root layer is changed. And we track >>> greenness and flakiness at this same level of granularity. >>> >>> Recall my (non-binding) starting point guessing at what tests should or >>> should not run in some scenarios: (this tangent is just about the third >>> one, where I explicitly said maybe we run all the same tests and then we >>> want to focus on separating signals as Luke pointed out) >>> >>> > - changing an IO or runner would not trigger the 20 minutes of core >>> SDK tests >>> > - changing a runner would not trigger the long IO local integration >>> tests >>> > - changing the core SDK could potentially not run as many tests in >>> presubmit, but maybe it would and they would be separately reported results >>> with clear flakiness signal >>> >>> And let's consider even more concrete examples: >>> >>> - when changing a Fn API proto, how important is it to run >>> RabbitMqIOTest? >>> - when changing JdbcIO, how important is it to run the Java SDK >>> needsRunnerTests? RabbitMqIOTest? >>> - when changing the FlinkRunner, how important is it to make sure that >>> Nexmark queries still match their models when run on direct runner? >>> >>> I chose these examples to all have zero value, of course. And I've >>> deliberately included an example of a core change and a leaf test. Not all >>> (core change, leaf test) pairs are equally important. The vast majority of >>> all tests we run are literally unable to be affected by the changes >>> triggering the test. So that's why enabling Gradle cache or using a plugin >>> like Brian found could help part of the issue, but not the whole issue, >>> again as Luke reminded. >>> >>> We make these tradeoffs all the time, of course, via putting some tests >>> in *IT and postCommit runs and some in *Test, implicitly preCommit. But I >>> am imagining a future where we can decouple the test suite definitions >>> (very stable, not depending on the project context) from the decision of >>> where and when to run them (less stable, changing as the project changes). >>> >>> My assumption is that the project will only grow and all these problems >>> (flakiness, runtime, false coupling) will continue to get worse. I raised >>> this now so we could consider what is a steady state approach that could >>> scale, before it becomes an emergency. I take it as a given that it is >>> harder to change culture than it is to change infra/code, so I am not >>> considering any possibility of more attention to flaky tests or more >>> attention to testing the core properly or more attention to making tests >>> snappy or more careful consideration of *IT and *Test. (unless we build >>> infra that forces more attention to these things) >>> >>> Incidentally, SQL is not actually fully factored out. If you edit SQL it >>> runs a limited subset defined by :sqlPreCommit. If you edit core, then >>> :javaPreCommit still includes SQL tests. >>> >>> Kenn >>> >>