Re: Finer-grained test runs?

Kenneth Knowles Mon, 13 Jul 2020 11:17:11 -0700

Some context links for the benefit of the thread & archive:

Beam issue mentioning a Jenkins plugin that caches on the Jenkins master:
https://issues.apache.org/jira/browse/BEAM-4400
Beam's request to infra: https://issues.apache.org/jira/browse/INFRA-16630
Denied and reasoning on prior request:
https://issues.apache.org/jira/browse/INFRA-16060


Because the Jenkins master / S3 are not good choices for where to cache.
Hosting the actual Gradle build cache server as in the thread linked above
would be different. Prototyping on the Beam ticket indicated a lack of
success.

(Ongoing) thread on builds@ asking about existing service:
https://lists.apache.org/thread.html/ae40734e34dcf1d3bd8c65dfea3094709d9d8eb97bfb9ab92149e97c%40%3Cbuilds.apache.org%3E

Kenn

On Mon, Jul 13, 2020 at 10:26 AM Kenneth Knowles <k...@apache.org> wrote:

> Having thought this over a bit, I think there are a few goals and they are
> interfering with each other.
>
> 1. Clear signal for module / test suite health. This is a post-commit
> concern. Post-commit jobs already all run as cronjobs with no
> dependency-driven stuff.
> 2. Making precommit test signal stay non-flaky as modules, tests, and
> flakiness increase.
> 3. Making precommit stay fast as modules, tests, and flakiness increase.
>
> Noting the interdependence of pre-commit and post-commit:
>
>  - you can phrase trigger post-commit jobs
>  - pre-commit jobs are run as post-commits also
>
> Summarizing a bit:
>
> 1. Clear per-module/suite greenness and flakiness signal
>  - it would be nice if we could do this at the Gradle job level, but right
> now it is Jenkins job
>  - on the other hand, most Gradle jobs do not represent a module so that
> could be too fine-grained and Jenkins jobs are better
>  - if we have a ton of Jenkins jobs, we need some new automation or
> amortized management
>  - don't want to overwhelm the Jenkins executors, especially not causing
> precommit queueing
>
> 2. Making precommit stay non-flaky, robustly
>  - we can fix flakes, but can't count on that long term, but we could
> build something that forces us to solve it at P0
>  - we can add retry budget to tests where deflaking cannot be prioritized
>  - a lot of anxiety that testing less in pre-commit will cause painful
> post-commit debugging
>  - a lot of overlap with making it faster, since the flakes are often
> caused by irrelevant tests
>
> 3. Making precommit stay fast, robustly
>  - we could improve per-worker incremental build
>  - we could use a distributed build cache
>  - we have tasks that don't do their input/output correctly that will have
> problems
>
> I care most about #1 and then also #2. The only reason I care about #3 is
> because of #2: Once a pre-commit is more than a couple minutes, I always go
> and do something else and come back in an hour or two. So if it flakes just
> a few times, it costs a day. Fix #2 and I don't think #3 is urgent yet.
>
> A distributed build cache seems to be fairly low effort to set up and
> makes #2 and #3 better and may unlock approaches to #1. If we can fix our
> Gradle configs. We can ask ASF infra if they have something already or can
> set it up.
>
> That will still leave open how to get better and more visible greenness
> and flakiness signal at a more meaningful granularity.
>
> Kenn
>
> On Fri, Jul 10, 2020 at 6:38 AM Kenneth Knowles <k...@apache.org> wrote:
>
>> On Thu, Jul 9, 2020 at 1:44 PM Robert Bradshaw <rober...@google.com>
>> wrote:
>>
>>> I wonder how hard it would be to track greenness and flakiness at the
>>> level of gradle project (or even lower), viewed hierarchically.
>>>
>>
>> Looks like this is part of the Gradle Enterprise Tests Dashboard
>> offering: https://gradle.com/blog/flaky-tests/
>>
>> Kenn
>>
>> > Recall my (non-binding) starting point guessing at what tests should or
>>> should not run in some scenarios: (this tangent is just about the third
>>> one, where I explicitly said maybe we run all the same tests and then we
>>> want to focus on separating signals as Luke pointed out)
>>> >
>>> > > - changing an IO or runner would not trigger the 20 minutes of core
>>> SDK tests
>>> > > - changing a runner would not trigger the long IO local integration
>>> tests
>>> > > - changing the core SDK could potentially not run as many tests in
>>> presubmit, but maybe it would and they would be separately reported results
>>> with clear flakiness signal
>>> >
>>> > And let's consider even more concrete examples:
>>> >
>>> >  - when changing a Fn API proto, how important is it to run
>>> RabbitMqIOTest?
>>> >  - when changing JdbcIO, how important is it to run the Java SDK
>>> needsRunnerTests? RabbitMqIOTest?
>>> >  - when changing the FlinkRunner, how important is it to make sure
>>> that Nexmark queries still match their models when run on direct runner?
>>> >
>>> > I chose these examples to all have zero value, of course. And I've
>>> deliberately included an example of a core change and a leaf test. Not all
>>> (core change, leaf test) pairs are equally important. The vast majority of
>>> all tests we run are literally unable to be affected by the changes
>>> triggering the test. So that's why enabling Gradle cache or using a plugin
>>> like Brian found could help part of the issue, but not the whole issue,
>>> again as Luke reminded.
>>>
>>> For (2) and (3), I would hope that the build dependency graph could
>>> exclude them. You're right about (1) (and I've hit that countless
>>> times), but would rather err on the side of accidentally running too
>>> many tests than not enough. If we make manual edits to what can be
>>> inferred by the build graph, let's make it a blacklist rather than an
>>> allow list to avoid accidental lost coverage.
>>>
>>> > We make these tradeoffs all the time, of course, via putting some
>>> tests in *IT and postCommit runs and some in *Test, implicitly preCommit.
>>> But I am imagining a future where we can decouple the test suite
>>> definitions (very stable, not depending on the project context) from the
>>> decision of where and when to run them (less stable, changing as the
>>> project changes).
>>> >
>>> > My assumption is that the project will only grow and all these
>>> problems (flakiness, runtime, false coupling) will continue to get worse. I
>>> raised this now so we could consider what is a steady state approach that
>>> could scale, before it becomes an emergency. I take it as a given that it
>>> is harder to change culture than it is to change infra/code, so I am not
>>> considering any possibility of more attention to flaky tests or more
>>> attention to testing the core properly or more attention to making tests
>>> snappy or more careful consideration of *IT and *Test. (unless we build
>>> infra that forces more attention to these things)
>>> >
>>> > Incidentally, SQL is not actually fully factored out. If you edit SQL
>>> it runs a limited subset defined by :sqlPreCommit. If you edit core, then
>>> :javaPreCommit still includes SQL tests.
>>>
>>> I think running SQL tests when you edit core is not actually that bad.
>>> Possibly better than not running any of them. (Maybe, as cost becomes
>>> more of a concern, adding the notion of "smoke tests" that are a cheap
>>> subset run when upstream projects change would be a good compromise.)
>>>
>>

Re: Finer-grained test runs?

Reply via email to