Re: Finer-grained test runs?

Kenneth Knowles Thu, 09 Jul 2020 11:25:23 -0700

On Thu, Jul 9, 2020 at 8:40 AM Luke Cwik <lc...@google.com> wrote:

> On Wed, Jul 8, 2020 at 9:22 PM Kenneth Knowles <k...@apache.org> wrote:
>
>> I like your use of "ancestor" and "descendant". I will adopt it.
>>
>> On Wed, Jul 8, 2020 at 4:53 PM Robert Bradshaw <rober...@google.com>
>> wrote:
>>
>>> On Wed, Jul 8, 2020 at 4:44 PM Luke Cwik <lc...@google.com> wrote:
>>> >
>>> > I'm not sure that breaking it up will be significantly faster since
>>> each module needs to build its ancestors and run tests of itself and all of
>>> its descendants which isn't a trivial amount of work. We have only so many
>>> executors and with the increased number of jobs, won't we just be waiting
>>> for queued jobs to start?
>>
>>
>>
>> I think that depends on how many fewer tests we could run (or rerun)
>>> for the average PR. (It would also be nice if we could share build
>>> artifacts across executors (is there something like ccache for
>>> javac?), but maybe that's too far-fetched?)
>>>
>>
>> Robert: The gradle cache should remain valid across runs, I think... my
>> latest understanding was that it was a robust up-to-date check (aka not
>> `make`). We may have messed this up, as I am not seeing as much caching as
>> I would expect nor as much as I see locally. We had to do some tweaking in
>> the maven days to put the .m2 directory outside of the realm wiped for each
>> new build. Maybe we are clobbering the Gradle cache too. That might
>> actually make most builds so fast we do not care about my proposal.
>>
>
> The gradle cache relies on our inputs/outputs to be specified correctly.
> It's great that this has been fixed since I was under the impression that
> it was disabled and/or that we used --rerun-tasks everywhere.
>


Sorry, when I said *should* I mean that if it is not currently being used,
we should do what it takes to use it. Based on the scans, I don't think
test results are being cached. But I could have read things wrong...


Luke: I am not sure if you are replying to my email or to Brian's.
>>
> If Brian's: it does not result in redundant build (if plugin works) since
>> it would be one Gradle build process. But it does do a full build if you
>> touch something at the root of the ancestry tree like core SDK or model. I
>> would like to avoid automatically testing descendants if we can, since
>> things like Nexmark and most IOs are not sensitive to the vast majority of
>> model or core SDK changes. Runners are borderline.
>>
>
> I believe that the cost of fixing an issue that is found later once the
> test starts failing because the test wasn't run as part of the PR has a
> much higher order of magnitude of cost to triage and fix. Mostly due to
> loss of context from the PR author/reviewer and if the culprit PR can't be
> found then whoever is trying to fix it.
>
> If we are willing to not separate out into individual jobs then we are
> really trying to make the job faster.
>

It would also reduce flakiness, which was a key motivator for this thread.
It is a good point about separate signals, which I somehow forgot in
between emails. So an approach based on separate jobs is not strictly
worse, since it has this benefit.


How much digging have folks done into the build scans since they show a lot
> of details that are useful around what is slow for a specific job. Take the
> Java Precommit for example:
> * The timeline of what tasks ran when:
> https://scans.gradle.com/s/u2rkcnww2fs24/timeline (looks like nexmark
> testing is 30 mins long and is the straggler)
>

I did a bit of this digging the other day. Separating Nexmark out from Java
(as we did with SQL) would be a mitigation that addresses job speed. I
planned on doing this today. Separating out each of the 10 minute IO and
runner runs would also improve speed and reduce flakiness but then this is
turning into a longer task. Doing this with include/exclude patterns in job
files is simple [1] but will get harder to keep consistent. I would guess
they are already inconsistent.

Here's a sketch of one way that this can scale: have the metadata that
defines trigger patterns and test targets live next to the modules. Then it
scales just as well as authoring modules does. You need some code to
assemble the appropriate job triggers from the declared ancestry. This
could have the benefit that the signal is for a module and not for a job.
Changing the triggers or refactoring how different things run would not
reset the meaning of the signal, as it does now.


* It looks like our build cache (
> https://scans.gradle.com/s/u2rkcnww2fs24/performance/buildCache) is
> saving about 5% of total cpu time, should we consider setting up a remote
> build cache?
>
> If mine: you could assume my proposal is like Brian's but with full
>> isolated Jenkins builds. This would be strictly worse, since it would add
>> redundant builds of ancestors. I am assuming that you always run a separate
>> Jenkins job for every descendant. Still, many modules have fewer
>> descendants. And they do not trigger all the way up to the root and down to
>> all descendants of the root.
>>
>>
> I was replying to yours since differentiated jobs is what gives
> visibility. I agree that Brian's approach would make the build faster if it
> could figure out everything that needs to run easily and be easy to
> maintain.
>
> From a community perspective, extensions and IOs are the most likely use
>> case for newcomers. For the person who comes to add or improve FooIO, it is
>> not a good experience to hit a flake in RabbitMqIO or JdbcIO or
>> DataflowRunner or FlinkRunner flakes.
>>
>
> If flakes had a very low failure budget then as a community this would be
> a non-issue.
>

What does this mean? I like the sound of "as a community this would be a
non-issue".

Kenn

[1] You have to have matching patterns in ancestor(s) and child job
triggers, as with SQL here:
https://github.com/apache/beam/blob/daca13c48befac0795eeb5e9b90449b79a7c6ffe/.test-infra/jenkins/job_PreCommit_Java.groovy#L34
and then you have to make corresponding changes to the top-level Gradle
targets that define what the job does.



>
>
>> I think the plugin Brian mentioned is only a start. It would be even
>> better for each module to have an opt-in list of descendants to test on
>> precommit. This works well with a rollback-first strategy on post-commit.
>> We can then replay the PR while triggering the postcommits that failed.
>>
>> > I agree that we would have better visibility though in github and also
>>> in Jenkins.
>>>
>>> I do have to say having to scroll through a huge number of github
>>> checks is not always an improvement.
>>>
>>
>> +1 but OTOH the gradle scan is sometimes too fine grained or associates
>> logs oddly (I skip the Jenkins status page almost always)
>>
>>
>>> > Fixing flaky tests would help improve our test signal as well. Not
>>> many willing people here though but could be less work then building and
>>> maintaining so many different jobs.
>>>
>>> +1
>>>
>>
>> I agree with fixing flakes, but I want to treat the occurrence and
>> resolution of flakiness as standard operations. Just as bug counts increase
>> continuously as a project grows, so will overall flakiness. Separating
>> flakiness signals will help to prioritize which flakes to address.
>>
>>
> Kenn
>>
>>
>>> > On Wed, Jul 8, 2020 at 4:13 PM Kenneth Knowles <k...@apache.org>
>>> wrote:
>>> >>
>>> >> That's a good start. It is new enough and with few enough commits
>>> that I'd want to do some thorough experimentation. Our build is complex
>>> enough with a lot of ad hoc coding that we might end up maintaining
>>> whatever we choose...
>>> >>
>>> >> In my ideal scenario the list of "what else to test" would be
>>> manually editable, or even strictly opt-in. Automatically testing
>>> everything that might be affected quickly runs into scaling problems too.
>>> It could make sense in post-commit but less so in pre-commit.
>>> >>
>>> >> Kenn
>>> >>
>>> >> On Wed, Jul 8, 2020 at 3:50 PM Brian Hulette <bhule...@google.com>
>>> wrote:
>>> >>>
>>> >>> > We could have one "test the things" Jenkins job if the underlying
>>> tool (Gradle) could resolve what needs to be run.
>>> >>>
>>> >>> I think this would be much better. Otherwise it seems our Jenkins
>>> definitions are just duplicating information that's already stored in the
>>> build.gradle files which seems error-prone, especially for tests validating
>>> combinations of artifacts. I did some quick searching and came across [1].
>>> It doesn't look like the project has had a lot of recent activity, but it
>>> claims to do what we need:
>>> >>>
>>> >>> > The plugin will generate new tasks on the root project for each
>>> task provided on the configuration with the following pattern
>>> ${taskName}ChangedModules.
>>> >>> > These generated tasks will run the changedModules task to get the
>>> list of changed modules and for each one will call the given task.
>>> >>>
>>> >>> Of course this would only really help us with java tests as gradle
>>> doesn't know much about the structure of dependencies within the python
>>> (and go?) SDK.
>>> >>>
>>> >>> Brian
>>> >>>
>>> >>> [1] https://github.com/ismaeldivita/change-tracker-plugin
>>> >>>
>>> >>> On Wed, Jul 8, 2020 at 3:29 PM Kenneth Knowles <k...@apache.org>
>>> wrote:
>>> >>>>
>>> >>>> Hi all,
>>> >>>>
>>> >>>> I wanted to start a discussion about getting finer grained test
>>> execution more focused on particular artifacts/modules. In particular, I
>>> want to gather the downsides and impossibilities. So I will make a proposal
>>> that people can disagree with easily.
>>> >>>>
>>> >>>> Context: job_PreCommit_Java is a monolithic job that...
>>> >>>>
>>> >>>>  - takes 40-50 minutes
>>> >>>>  - runs tests of maybe a bit under 100 modules
>>> >>>>  - executes over 10k tests
>>> >>>>  - runs on any change to model/, sdks/java/, runners/,
>>> examples/java/, examples/kotlin/, release/ (only exception is SQL)
>>> >>>>  - is pretty flaky (because it conflates so many independent test
>>> flakes, mostly runners and IOs)
>>> >>>>
>>> >>>> See a scan at
>>> https://scans.gradle.com/s/dnuo4o245d2fw/timeline?sort=longest
>>> >>>>
>>> >>>> Proposal: Eliminate monolithic job and break into finer-grained
>>> jobs that operate on two principles:
>>> >>>>
>>> >>>> 1. Test run should be focused on validating one artifact or a
>>> specific integration of other artifacts.
>>> >>>> 2. Test run should trigger only on things that could affect the
>>> validity of that artifact.
>>> >>>>
>>> >>>> For example, a starting point is to separate:
>>> >>>>
>>> >>>>  - core SDK
>>> >>>>  - runner helper libs
>>> >>>>  - each runner
>>> >>>>  - each extension
>>> >>>>  - each IO
>>> >>>>
>>> >>>> Benefits:
>>> >>>>
>>> >>>>  - changing an IO or runner would not trigger the 20 minutes of
>>> core SDK tests
>>> >>>>  - changing a runner would not trigger the long IO local
>>> integration tests
>>> >>>>  - changing the core SDK could potentially not run as many tests in
>>> presubmit, but maybe it would and they would be separately reported results
>>> with clear flakiness signal
>>> >>>>
>>> >>>> There are 72 build.gradle files under sdks/java/ and 30 under
>>> runners/. They don't all require a separate job. But still there are enough
>>> that it is worth automation. Does anyone know of what options we might
>>> have? It does not even have to be in Jenkins. We could have one "test the
>>> things" Jenkins job if the underlying tool (Gradle) could resolve what
>>> needs to be run. Caching is not sufficient in my experience.
>>> >>>>
>>> >>>> (there are other quick fix alternatives to shrinking this time, but
>>> I want to focus on bigger picture)
>>> >>>>
>>> >>>> Kenn
>>>
>>

Re: Finer-grained test runs?

Reply via email to