Re: Finer-grained test runs?

2020-07-13 Thread Kenneth Knowles
Some context links for the benefit of the thread & archive:

Beam issue mentioning a Jenkins plugin that caches on the Jenkins master:
https://issues.apache.org/jira/browse/BEAM-4400
Beam's request to infra: https://issues.apache.org/jira/browse/INFRA-16630
Denied and reasoning on prior request:
https://issues.apache.org/jira/browse/INFRA-16060

Because the Jenkins master / S3 are not good choices for where to cache.
Hosting the actual Gradle build cache server as in the thread linked above
would be different. Prototyping on the Beam ticket indicated a lack of
success.

(Ongoing) thread on builds@ asking about existing service:
https://lists.apache.org/thread.html/ae40734e34dcf1d3bd8c65dfea3094709d9d8eb97bfb9ab92149e97c%40%3Cbuilds.apache.org%3E

Kenn

On Mon, Jul 13, 2020 at 10:26 AM Kenneth Knowles  wrote:

> Having thought this over a bit, I think there are a few goals and they are
> interfering with each other.
>
> 1. Clear signal for module / test suite health. This is a post-commit
> concern. Post-commit jobs already all run as cronjobs with no
> dependency-driven stuff.
> 2. Making precommit test signal stay non-flaky as modules, tests, and
> flakiness increase.
> 3. Making precommit stay fast as modules, tests, and flakiness increase.
>
> Noting the interdependence of pre-commit and post-commit:
>
>  - you can phrase trigger post-commit jobs
>  - pre-commit jobs are run as post-commits also
>
> Summarizing a bit:
>
> 1. Clear per-module/suite greenness and flakiness signal
>  - it would be nice if we could do this at the Gradle job level, but right
> now it is Jenkins job
>  - on the other hand, most Gradle jobs do not represent a module so that
> could be too fine-grained and Jenkins jobs are better
>  - if we have a ton of Jenkins jobs, we need some new automation or
> amortized management
>  - don't want to overwhelm the Jenkins executors, especially not causing
> precommit queueing
>
> 2. Making precommit stay non-flaky, robustly
>  - we can fix flakes, but can't count on that long term, but we could
> build something that forces us to solve it at P0
>  - we can add retry budget to tests where deflaking cannot be prioritized
>  - a lot of anxiety that testing less in pre-commit will cause painful
> post-commit debugging
>  - a lot of overlap with making it faster, since the flakes are often
> caused by irrelevant tests
>
> 3. Making precommit stay fast, robustly
>  - we could improve per-worker incremental build
>  - we could use a distributed build cache
>  - we have tasks that don't do their input/output correctly that will have
> problems
>
> I care most about #1 and then also #2. The only reason I care about #3 is
> because of #2: Once a pre-commit is more than a couple minutes, I always go
> and do something else and come back in an hour or two. So if it flakes just
> a few times, it costs a day. Fix #2 and I don't think #3 is urgent yet.
>
> A distributed build cache seems to be fairly low effort to set up and
> makes #2 and #3 better and may unlock approaches to #1. If we can fix our
> Gradle configs. We can ask ASF infra if they have something already or can
> set it up.
>
> That will still leave open how to get better and more visible greenness
> and flakiness signal at a more meaningful granularity.
>
> Kenn
>
> On Fri, Jul 10, 2020 at 6:38 AM Kenneth Knowles  wrote:
>
>> On Thu, Jul 9, 2020 at 1:44 PM Robert Bradshaw 
>> wrote:
>>
>>> I wonder how hard it would be to track greenness and flakiness at the
>>> level of gradle project (or even lower), viewed hierarchically.
>>>
>>
>> Looks like this is part of the Gradle Enterprise Tests Dashboard
>> offering: https://gradle.com/blog/flaky-tests/
>>
>> Kenn
>>
>> > Recall my (non-binding) starting point guessing at what tests should or
>>> should not run in some scenarios: (this tangent is just about the third
>>> one, where I explicitly said maybe we run all the same tests and then we
>>> want to focus on separating signals as Luke pointed out)
>>> >
>>> > > - changing an IO or runner would not trigger the 20 minutes of core
>>> SDK tests
>>> > > - changing a runner would not trigger the long IO local integration
>>> tests
>>> > > - changing the core SDK could potentially not run as many tests in
>>> presubmit, but maybe it would and they would be separately reported results
>>> with clear flakiness signal
>>> >
>>> > And let's consider even more concrete examples:
>>> >
>>> >  - when changing a Fn API proto, how important is it to run
>>> RabbitMqIOTest?
>>> >  - when changing JdbcIO, how important is it to run the Java SDK
>>> needsRunnerTests? RabbitMqIOTest?
>>> >  - when changing the FlinkRunner, how important is it to make sure
>>> that Nexmark queries still match their models when run on direct runner?
>>> >
>>> > I chose these examples to all have zero value, of course. And I've
>>> deliberately included an example of a core change and a leaf test. Not all
>>> (core change, leaf test) pairs are equally important. The vast 

Re: Finer-grained test runs?

2020-07-13 Thread Kenneth Knowles
Having thought this over a bit, I think there are a few goals and they are
interfering with each other.

1. Clear signal for module / test suite health. This is a post-commit
concern. Post-commit jobs already all run as cronjobs with no
dependency-driven stuff.
2. Making precommit test signal stay non-flaky as modules, tests, and
flakiness increase.
3. Making precommit stay fast as modules, tests, and flakiness increase.

Noting the interdependence of pre-commit and post-commit:

 - you can phrase trigger post-commit jobs
 - pre-commit jobs are run as post-commits also

Summarizing a bit:

1. Clear per-module/suite greenness and flakiness signal
 - it would be nice if we could do this at the Gradle job level, but right
now it is Jenkins job
 - on the other hand, most Gradle jobs do not represent a module so that
could be too fine-grained and Jenkins jobs are better
 - if we have a ton of Jenkins jobs, we need some new automation or
amortized management
 - don't want to overwhelm the Jenkins executors, especially not causing
precommit queueing

2. Making precommit stay non-flaky, robustly
 - we can fix flakes, but can't count on that long term, but we could build
something that forces us to solve it at P0
 - we can add retry budget to tests where deflaking cannot be prioritized
 - a lot of anxiety that testing less in pre-commit will cause painful
post-commit debugging
 - a lot of overlap with making it faster, since the flakes are often
caused by irrelevant tests

3. Making precommit stay fast, robustly
 - we could improve per-worker incremental build
 - we could use a distributed build cache
 - we have tasks that don't do their input/output correctly that will have
problems

I care most about #1 and then also #2. The only reason I care about #3 is
because of #2: Once a pre-commit is more than a couple minutes, I always go
and do something else and come back in an hour or two. So if it flakes just
a few times, it costs a day. Fix #2 and I don't think #3 is urgent yet.

A distributed build cache seems to be fairly low effort to set up and makes
#2 and #3 better and may unlock approaches to #1. If we can fix our Gradle
configs. We can ask ASF infra if they have something already or can set it
up.

That will still leave open how to get better and more visible greenness and
flakiness signal at a more meaningful granularity.

Kenn

On Fri, Jul 10, 2020 at 6:38 AM Kenneth Knowles  wrote:

> On Thu, Jul 9, 2020 at 1:44 PM Robert Bradshaw 
> wrote:
>
>> I wonder how hard it would be to track greenness and flakiness at the
>> level of gradle project (or even lower), viewed hierarchically.
>>
>
> Looks like this is part of the Gradle Enterprise Tests Dashboard offering:
> https://gradle.com/blog/flaky-tests/
>
> Kenn
>
> > Recall my (non-binding) starting point guessing at what tests should or
>> should not run in some scenarios: (this tangent is just about the third
>> one, where I explicitly said maybe we run all the same tests and then we
>> want to focus on separating signals as Luke pointed out)
>> >
>> > > - changing an IO or runner would not trigger the 20 minutes of core
>> SDK tests
>> > > - changing a runner would not trigger the long IO local integration
>> tests
>> > > - changing the core SDK could potentially not run as many tests in
>> presubmit, but maybe it would and they would be separately reported results
>> with clear flakiness signal
>> >
>> > And let's consider even more concrete examples:
>> >
>> >  - when changing a Fn API proto, how important is it to run
>> RabbitMqIOTest?
>> >  - when changing JdbcIO, how important is it to run the Java SDK
>> needsRunnerTests? RabbitMqIOTest?
>> >  - when changing the FlinkRunner, how important is it to make sure that
>> Nexmark queries still match their models when run on direct runner?
>> >
>> > I chose these examples to all have zero value, of course. And I've
>> deliberately included an example of a core change and a leaf test. Not all
>> (core change, leaf test) pairs are equally important. The vast majority of
>> all tests we run are literally unable to be affected by the changes
>> triggering the test. So that's why enabling Gradle cache or using a plugin
>> like Brian found could help part of the issue, but not the whole issue,
>> again as Luke reminded.
>>
>> For (2) and (3), I would hope that the build dependency graph could
>> exclude them. You're right about (1) (and I've hit that countless
>> times), but would rather err on the side of accidentally running too
>> many tests than not enough. If we make manual edits to what can be
>> inferred by the build graph, let's make it a blacklist rather than an
>> allow list to avoid accidental lost coverage.
>>
>> > We make these tradeoffs all the time, of course, via putting some tests
>> in *IT and postCommit runs and some in *Test, implicitly preCommit. But I
>> am imagining a future where we can decouple the test suite definitions
>> (very stable, not depending on the project context) from the 

Re: Finer-grained test runs?

2020-07-10 Thread Kenneth Knowles
On Thu, Jul 9, 2020 at 1:44 PM Robert Bradshaw  wrote:

> I wonder how hard it would be to track greenness and flakiness at the
> level of gradle project (or even lower), viewed hierarchically.
>

Looks like this is part of the Gradle Enterprise Tests Dashboard offering:
https://gradle.com/blog/flaky-tests/

Kenn

> Recall my (non-binding) starting point guessing at what tests should or
> should not run in some scenarios: (this tangent is just about the third
> one, where I explicitly said maybe we run all the same tests and then we
> want to focus on separating signals as Luke pointed out)
> >
> > > - changing an IO or runner would not trigger the 20 minutes of core
> SDK tests
> > > - changing a runner would not trigger the long IO local integration
> tests
> > > - changing the core SDK could potentially not run as many tests in
> presubmit, but maybe it would and they would be separately reported results
> with clear flakiness signal
> >
> > And let's consider even more concrete examples:
> >
> >  - when changing a Fn API proto, how important is it to run
> RabbitMqIOTest?
> >  - when changing JdbcIO, how important is it to run the Java SDK
> needsRunnerTests? RabbitMqIOTest?
> >  - when changing the FlinkRunner, how important is it to make sure that
> Nexmark queries still match their models when run on direct runner?
> >
> > I chose these examples to all have zero value, of course. And I've
> deliberately included an example of a core change and a leaf test. Not all
> (core change, leaf test) pairs are equally important. The vast majority of
> all tests we run are literally unable to be affected by the changes
> triggering the test. So that's why enabling Gradle cache or using a plugin
> like Brian found could help part of the issue, but not the whole issue,
> again as Luke reminded.
>
> For (2) and (3), I would hope that the build dependency graph could
> exclude them. You're right about (1) (and I've hit that countless
> times), but would rather err on the side of accidentally running too
> many tests than not enough. If we make manual edits to what can be
> inferred by the build graph, let's make it a blacklist rather than an
> allow list to avoid accidental lost coverage.
>
> > We make these tradeoffs all the time, of course, via putting some tests
> in *IT and postCommit runs and some in *Test, implicitly preCommit. But I
> am imagining a future where we can decouple the test suite definitions
> (very stable, not depending on the project context) from the decision of
> where and when to run them (less stable, changing as the project changes).
> >
> > My assumption is that the project will only grow and all these problems
> (flakiness, runtime, false coupling) will continue to get worse. I raised
> this now so we could consider what is a steady state approach that could
> scale, before it becomes an emergency. I take it as a given that it is
> harder to change culture than it is to change infra/code, so I am not
> considering any possibility of more attention to flaky tests or more
> attention to testing the core properly or more attention to making tests
> snappy or more careful consideration of *IT and *Test. (unless we build
> infra that forces more attention to these things)
> >
> > Incidentally, SQL is not actually fully factored out. If you edit SQL it
> runs a limited subset defined by :sqlPreCommit. If you edit core, then
> :javaPreCommit still includes SQL tests.
>
> I think running SQL tests when you edit core is not actually that bad.
> Possibly better than not running any of them. (Maybe, as cost becomes
> more of a concern, adding the notion of "smoke tests" that are a cheap
> subset run when upstream projects change would be a good compromise.)
>


Re: Finer-grained test runs?

2020-07-09 Thread Robert Bradshaw
It does sound like we're generally on the same page. Minor comments below.

On Thu, Jul 9, 2020 at 1:00 PM Kenneth Knowles  wrote:
>
> On Thu, Jul 9, 2020 at 11:47 AM Robert Bradshaw  wrote:
>>
>> On Thu, Jul 9, 2020 at 8:40 AM Luke Cwik  wrote:
>> >
>> >> If Brian's: it does not result in redundant build (if plugin works) since 
>> >> it would be one Gradle build process. But it does do a full build if you 
>> >> touch something at the root of the ancestry tree like core SDK or model. 
>> >> I would like to avoid automatically testing descendants if we can, since 
>> >> things like Nexmark and most IOs are not sensitive to the vast majority 
>> >> of model or core SDK changes. Runners are borderline.
>> >
>> > I believe that the cost of fixing an issue that is found later once the 
>> > test starts failing because the test wasn't run as part of the PR has a 
>> > much higher order of magnitude of cost to triage and fix. Mostly due to 
>> > loss of context from the PR author/reviewer and if the culprit PR can't be 
>> > found then whoever is trying to fix it.
>>
>> Huge +1 to this.
>
>
> Totally agree. This abstract statement is clearly true. I suggest considering 
> things more concretely.
>
>> Ideally we could count on the build system (and good caching) to only
>> test what actually needs to be tested, and with work being done on
>> runners and IOs this would be a small subset of our entire suite. When
>> working lower in the stack (and I am prone to do) I think it's
>> acceptable to have longer wait times--and would *much* rather pay that
>> price than discover things later. Perhaps some things could be
>> surgically removed (it would be interesting to mine data on how often
>> test failures in the "leaves" catch real issues), but I would do that
>> with care. That being said, flakiness is really an issues (and it
>> seems these days I have to re-run tests, often multiple times, to get
>> a PR to green; splitting up jobs could help that as well).
>
> Agree with your sentiment that a longer wait for core changes is generally 
> fine; my phrasing above overemphasized this case. Anecdotally, without mining 
> data, leaf modules do catch bugs in core changes sometimes when (by 
> definition) they are not adequately tested. This is a good measure for how 
> much we have to improve our engineering practices.
>
> But anyhow this is one very special case. Coming back to the overall issue, 
> what we actually do today is run all leaf/middle/root builds whenever 
> anything in any leaf/middle/root layer is changed. And we track greenness and 
> flakiness at this same level of granularity.

I wonder how hard it would be to track greenness and flakiness at the
level of gradle project (or even lower), viewed hierarchically.

> Recall my (non-binding) starting point guessing at what tests should or 
> should not run in some scenarios: (this tangent is just about the third one, 
> where I explicitly said maybe we run all the same tests and then we want to 
> focus on separating signals as Luke pointed out)
>
> > - changing an IO or runner would not trigger the 20 minutes of core SDK 
> > tests
> > - changing a runner would not trigger the long IO local integration tests
> > - changing the core SDK could potentially not run as many tests in 
> > presubmit, but maybe it would and they would be separately reported results 
> > with clear flakiness signal
>
> And let's consider even more concrete examples:
>
>  - when changing a Fn API proto, how important is it to run RabbitMqIOTest?
>  - when changing JdbcIO, how important is it to run the Java SDK 
> needsRunnerTests? RabbitMqIOTest?
>  - when changing the FlinkRunner, how important is it to make sure that 
> Nexmark queries still match their models when run on direct runner?
>
> I chose these examples to all have zero value, of course. And I've 
> deliberately included an example of a core change and a leaf test. Not all 
> (core change, leaf test) pairs are equally important. The vast majority of 
> all tests we run are literally unable to be affected by the changes 
> triggering the test. So that's why enabling Gradle cache or using a plugin 
> like Brian found could help part of the issue, but not the whole issue, again 
> as Luke reminded.

For (2) and (3), I would hope that the build dependency graph could
exclude them. You're right about (1) (and I've hit that countless
times), but would rather err on the side of accidentally running too
many tests than not enough. If we make manual edits to what can be
inferred by the build graph, let's make it a blacklist rather than an
allow list to avoid accidental lost coverage.

> We make these tradeoffs all the time, of course, via putting some tests in 
> *IT and postCommit runs and some in *Test, implicitly preCommit. But I am 
> imagining a future where we can decouple the test suite definitions (very 
> stable, not depending on the project context) from the decision of where and 
> when to run them (less stable, 

Re: Finer-grained test runs?

2020-07-09 Thread Luke Cwik
No, not without doing the research myself to see what is the current
tooling available.

On Thu, Jul 9, 2020 at 1:17 PM Kenneth Knowles  wrote:

>
>
> On Thu, Jul 9, 2020 at 1:10 PM Luke Cwik  wrote:
>
>> The budget would represent some criteria that we need from tests (e.g.
>> percent passed, max num skipped tests, test execution time, ...). If we
>> fail the criteria then there must be actionable work (such as fix tests)
>> followed with something that prevents the status quo from continuing (such
>> as preventing releases/features being merged) until the criteria is
>> satisfied again.
>>
>
> +1 . This is aligned with "CI as monitoring/alerting of the health of the
> machine that is your evolving codebase", which I very much subscribe to.
> Alert when something is wrong (another missing piece: have a quick way to
> ack and suppress false alarms in those cases you really want a sensitive
> alert).
>
> Do you know good implementation choices in Gradle/JUnit/Jenkins? (asking
> before searching for it myself)
>
> Kenn
>
>
>> On Thu, Jul 9, 2020 at 1:00 PM Kenneth Knowles  wrote:
>>
>>>
>>>
>>> On Thu, Jul 9, 2020 at 11:47 AM Robert Bradshaw 
>>> wrote:
>>>
 On Thu, Jul 9, 2020 at 8:40 AM Luke Cwik  wrote:
 >
 >> If Brian's: it does not result in redundant build (if plugin works)
 since it would be one Gradle build process. But it does do a full build if
 you touch something at the root of the ancestry tree like core SDK or
 model. I would like to avoid automatically testing descendants if we can,
 since things like Nexmark and most IOs are not sensitive to the vast
 majority of model or core SDK changes. Runners are borderline.
 >
 > I believe that the cost of fixing an issue that is found later once
 the test starts failing because the test wasn't run as part of the PR has a
 much higher order of magnitude of cost to triage and fix. Mostly due to
 loss of context from the PR author/reviewer and if the culprit PR can't be
 found then whoever is trying to fix it.

 Huge +1 to this.

>>>
>>> Totally agree. This abstract statement is clearly true. I suggest
>>> considering things more concretely.
>>>
>>> Ideally we could count on the build system (and good caching) to only
 test what actually needs to be tested, and with work being done on
 runners and IOs this would be a small subset of our entire suite. When
 working lower in the stack (and I am prone to do) I think it's
 acceptable to have longer wait times--and would *much* rather pay that
 price than discover things later. Perhaps some things could be
 surgically removed (it would be interesting to mine data on how often
 test failures in the "leaves" catch real issues), but I would do that
 with care. That being said, flakiness is really an issues (and it
 seems these days I have to re-run tests, often multiple times, to get
 a PR to green; splitting up jobs could help that as well).

>>>
>>> Agree with your sentiment that a longer wait for core changes is
>>> generally fine; my phrasing above overemphasized this case. Anecdotally,
>>> without mining data, leaf modules do catch bugs in core changes sometimes
>>> when (by definition) they are not adequately tested. This is a good measure
>>> for how much we have to improve our engineering practices.
>>>
>>> But anyhow this is one very special case. Coming back to the overall
>>> issue, what we actually do today is run all leaf/middle/root builds
>>> whenever anything in any leaf/middle/root layer is changed. And we track
>>> greenness and flakiness at this same level of granularity.
>>>
>>> Recall my (non-binding) starting point guessing at what tests should or
>>> should not run in some scenarios: (this tangent is just about the third
>>> one, where I explicitly said maybe we run all the same tests and then we
>>> want to focus on separating signals as Luke pointed out)
>>>
>>> > - changing an IO or runner would not trigger the 20 minutes of core
>>> SDK tests
>>> > - changing a runner would not trigger the long IO local integration
>>> tests
>>> > - changing the core SDK could potentially not run as many tests in
>>> presubmit, but maybe it would and they would be separately reported results
>>> with clear flakiness signal
>>>
>>> And let's consider even more concrete examples:
>>>
>>>  - when changing a Fn API proto, how important is it to run
>>> RabbitMqIOTest?
>>>  - when changing JdbcIO, how important is it to run the Java SDK
>>> needsRunnerTests? RabbitMqIOTest?
>>>  - when changing the FlinkRunner, how important is it to make sure that
>>> Nexmark queries still match their models when run on direct runner?
>>>
>>> I chose these examples to all have zero value, of course. And I've
>>> deliberately included an example of a core change and a leaf test. Not all
>>> (core change, leaf test) pairs are equally important. The vast majority of
>>> all tests we run are literally unable to be 

Re: Finer-grained test runs?

2020-07-09 Thread Kenneth Knowles
On Thu, Jul 9, 2020 at 1:10 PM Luke Cwik  wrote:

> The budget would represent some criteria that we need from tests (e.g.
> percent passed, max num skipped tests, test execution time, ...). If we
> fail the criteria then there must be actionable work (such as fix tests)
> followed with something that prevents the status quo from continuing (such
> as preventing releases/features being merged) until the criteria is
> satisfied again.
>

+1 . This is aligned with "CI as monitoring/alerting of the health of the
machine that is your evolving codebase", which I very much subscribe to.
Alert when something is wrong (another missing piece: have a quick way to
ack and suppress false alarms in those cases you really want a sensitive
alert).

Do you know good implementation choices in Gradle/JUnit/Jenkins? (asking
before searching for it myself)

Kenn


> On Thu, Jul 9, 2020 at 1:00 PM Kenneth Knowles  wrote:
>
>>
>>
>> On Thu, Jul 9, 2020 at 11:47 AM Robert Bradshaw 
>> wrote:
>>
>>> On Thu, Jul 9, 2020 at 8:40 AM Luke Cwik  wrote:
>>> >
>>> >> If Brian's: it does not result in redundant build (if plugin works)
>>> since it would be one Gradle build process. But it does do a full build if
>>> you touch something at the root of the ancestry tree like core SDK or
>>> model. I would like to avoid automatically testing descendants if we can,
>>> since things like Nexmark and most IOs are not sensitive to the vast
>>> majority of model or core SDK changes. Runners are borderline.
>>> >
>>> > I believe that the cost of fixing an issue that is found later once
>>> the test starts failing because the test wasn't run as part of the PR has a
>>> much higher order of magnitude of cost to triage and fix. Mostly due to
>>> loss of context from the PR author/reviewer and if the culprit PR can't be
>>> found then whoever is trying to fix it.
>>>
>>> Huge +1 to this.
>>>
>>
>> Totally agree. This abstract statement is clearly true. I suggest
>> considering things more concretely.
>>
>> Ideally we could count on the build system (and good caching) to only
>>> test what actually needs to be tested, and with work being done on
>>> runners and IOs this would be a small subset of our entire suite. When
>>> working lower in the stack (and I am prone to do) I think it's
>>> acceptable to have longer wait times--and would *much* rather pay that
>>> price than discover things later. Perhaps some things could be
>>> surgically removed (it would be interesting to mine data on how often
>>> test failures in the "leaves" catch real issues), but I would do that
>>> with care. That being said, flakiness is really an issues (and it
>>> seems these days I have to re-run tests, often multiple times, to get
>>> a PR to green; splitting up jobs could help that as well).
>>>
>>
>> Agree with your sentiment that a longer wait for core changes is
>> generally fine; my phrasing above overemphasized this case. Anecdotally,
>> without mining data, leaf modules do catch bugs in core changes sometimes
>> when (by definition) they are not adequately tested. This is a good measure
>> for how much we have to improve our engineering practices.
>>
>> But anyhow this is one very special case. Coming back to the overall
>> issue, what we actually do today is run all leaf/middle/root builds
>> whenever anything in any leaf/middle/root layer is changed. And we track
>> greenness and flakiness at this same level of granularity.
>>
>> Recall my (non-binding) starting point guessing at what tests should or
>> should not run in some scenarios: (this tangent is just about the third
>> one, where I explicitly said maybe we run all the same tests and then we
>> want to focus on separating signals as Luke pointed out)
>>
>> > - changing an IO or runner would not trigger the 20 minutes of core SDK
>> tests
>> > - changing a runner would not trigger the long IO local integration
>> tests
>> > - changing the core SDK could potentially not run as many tests in
>> presubmit, but maybe it would and they would be separately reported results
>> with clear flakiness signal
>>
>> And let's consider even more concrete examples:
>>
>>  - when changing a Fn API proto, how important is it to run
>> RabbitMqIOTest?
>>  - when changing JdbcIO, how important is it to run the Java SDK
>> needsRunnerTests? RabbitMqIOTest?
>>  - when changing the FlinkRunner, how important is it to make sure that
>> Nexmark queries still match their models when run on direct runner?
>>
>> I chose these examples to all have zero value, of course. And I've
>> deliberately included an example of a core change and a leaf test. Not all
>> (core change, leaf test) pairs are equally important. The vast majority of
>> all tests we run are literally unable to be affected by the changes
>> triggering the test. So that's why enabling Gradle cache or using a plugin
>> like Brian found could help part of the issue, but not the whole issue,
>> again as Luke reminded.
>>
>> We make these tradeoffs all the time, of course, 

Re: Finer-grained test runs?

2020-07-09 Thread Luke Cwik
The budget would represent some criteria that we need from tests (e.g.
percent passed, max num skipped tests, test execution time, ...). If we
fail the criteria then there must be actionable work (such as fix tests)
followed with something that prevents the status quo from continuing (such
as preventing releases/features being merged) until the criteria is
satisfied again.

On Thu, Jul 9, 2020 at 1:00 PM Kenneth Knowles  wrote:

>
>
> On Thu, Jul 9, 2020 at 11:47 AM Robert Bradshaw 
> wrote:
>
>> On Thu, Jul 9, 2020 at 8:40 AM Luke Cwik  wrote:
>> >
>> >> If Brian's: it does not result in redundant build (if plugin works)
>> since it would be one Gradle build process. But it does do a full build if
>> you touch something at the root of the ancestry tree like core SDK or
>> model. I would like to avoid automatically testing descendants if we can,
>> since things like Nexmark and most IOs are not sensitive to the vast
>> majority of model or core SDK changes. Runners are borderline.
>> >
>> > I believe that the cost of fixing an issue that is found later once the
>> test starts failing because the test wasn't run as part of the PR has a
>> much higher order of magnitude of cost to triage and fix. Mostly due to
>> loss of context from the PR author/reviewer and if the culprit PR can't be
>> found then whoever is trying to fix it.
>>
>> Huge +1 to this.
>>
>
> Totally agree. This abstract statement is clearly true. I suggest
> considering things more concretely.
>
> Ideally we could count on the build system (and good caching) to only
>> test what actually needs to be tested, and with work being done on
>> runners and IOs this would be a small subset of our entire suite. When
>> working lower in the stack (and I am prone to do) I think it's
>> acceptable to have longer wait times--and would *much* rather pay that
>> price than discover things later. Perhaps some things could be
>> surgically removed (it would be interesting to mine data on how often
>> test failures in the "leaves" catch real issues), but I would do that
>> with care. That being said, flakiness is really an issues (and it
>> seems these days I have to re-run tests, often multiple times, to get
>> a PR to green; splitting up jobs could help that as well).
>>
>
> Agree with your sentiment that a longer wait for core changes is generally
> fine; my phrasing above overemphasized this case. Anecdotally, without
> mining data, leaf modules do catch bugs in core changes sometimes when (by
> definition) they are not adequately tested. This is a good measure for how
> much we have to improve our engineering practices.
>
> But anyhow this is one very special case. Coming back to the overall
> issue, what we actually do today is run all leaf/middle/root builds
> whenever anything in any leaf/middle/root layer is changed. And we track
> greenness and flakiness at this same level of granularity.
>
> Recall my (non-binding) starting point guessing at what tests should or
> should not run in some scenarios: (this tangent is just about the third
> one, where I explicitly said maybe we run all the same tests and then we
> want to focus on separating signals as Luke pointed out)
>
> > - changing an IO or runner would not trigger the 20 minutes of core SDK
> tests
> > - changing a runner would not trigger the long IO local integration tests
> > - changing the core SDK could potentially not run as many tests in
> presubmit, but maybe it would and they would be separately reported results
> with clear flakiness signal
>
> And let's consider even more concrete examples:
>
>  - when changing a Fn API proto, how important is it to run RabbitMqIOTest?
>  - when changing JdbcIO, how important is it to run the Java SDK
> needsRunnerTests? RabbitMqIOTest?
>  - when changing the FlinkRunner, how important is it to make sure that
> Nexmark queries still match their models when run on direct runner?
>
> I chose these examples to all have zero value, of course. And I've
> deliberately included an example of a core change and a leaf test. Not all
> (core change, leaf test) pairs are equally important. The vast majority of
> all tests we run are literally unable to be affected by the changes
> triggering the test. So that's why enabling Gradle cache or using a plugin
> like Brian found could help part of the issue, but not the whole issue,
> again as Luke reminded.
>
> We make these tradeoffs all the time, of course, via putting some tests in
> *IT and postCommit runs and some in *Test, implicitly preCommit. But I am
> imagining a future where we can decouple the test suite definitions (very
> stable, not depending on the project context) from the decision of where
> and when to run them (less stable, changing as the project changes).
>
> My assumption is that the project will only grow and all these problems
> (flakiness, runtime, false coupling) will continue to get worse. I raised
> this now so we could consider what is a steady state approach that could
> scale, before it 

Re: Finer-grained test runs?

2020-07-09 Thread Kenneth Knowles
On Thu, Jul 9, 2020 at 11:47 AM Robert Bradshaw  wrote:

> On Thu, Jul 9, 2020 at 8:40 AM Luke Cwik  wrote:
> >
> >> If Brian's: it does not result in redundant build (if plugin works)
> since it would be one Gradle build process. But it does do a full build if
> you touch something at the root of the ancestry tree like core SDK or
> model. I would like to avoid automatically testing descendants if we can,
> since things like Nexmark and most IOs are not sensitive to the vast
> majority of model or core SDK changes. Runners are borderline.
> >
> > I believe that the cost of fixing an issue that is found later once the
> test starts failing because the test wasn't run as part of the PR has a
> much higher order of magnitude of cost to triage and fix. Mostly due to
> loss of context from the PR author/reviewer and if the culprit PR can't be
> found then whoever is trying to fix it.
>
> Huge +1 to this.
>

Totally agree. This abstract statement is clearly true. I suggest
considering things more concretely.

Ideally we could count on the build system (and good caching) to only
> test what actually needs to be tested, and with work being done on
> runners and IOs this would be a small subset of our entire suite. When
> working lower in the stack (and I am prone to do) I think it's
> acceptable to have longer wait times--and would *much* rather pay that
> price than discover things later. Perhaps some things could be
> surgically removed (it would be interesting to mine data on how often
> test failures in the "leaves" catch real issues), but I would do that
> with care. That being said, flakiness is really an issues (and it
> seems these days I have to re-run tests, often multiple times, to get
> a PR to green; splitting up jobs could help that as well).
>

Agree with your sentiment that a longer wait for core changes is generally
fine; my phrasing above overemphasized this case. Anecdotally, without
mining data, leaf modules do catch bugs in core changes sometimes when (by
definition) they are not adequately tested. This is a good measure for how
much we have to improve our engineering practices.

But anyhow this is one very special case. Coming back to the overall issue,
what we actually do today is run all leaf/middle/root builds whenever
anything in any leaf/middle/root layer is changed. And we track greenness
and flakiness at this same level of granularity.

Recall my (non-binding) starting point guessing at what tests should or
should not run in some scenarios: (this tangent is just about the third
one, where I explicitly said maybe we run all the same tests and then we
want to focus on separating signals as Luke pointed out)

> - changing an IO or runner would not trigger the 20 minutes of core SDK
tests
> - changing a runner would not trigger the long IO local integration tests
> - changing the core SDK could potentially not run as many tests in
presubmit, but maybe it would and they would be separately reported results
with clear flakiness signal

And let's consider even more concrete examples:

 - when changing a Fn API proto, how important is it to run RabbitMqIOTest?
 - when changing JdbcIO, how important is it to run the Java SDK
needsRunnerTests? RabbitMqIOTest?
 - when changing the FlinkRunner, how important is it to make sure that
Nexmark queries still match their models when run on direct runner?

I chose these examples to all have zero value, of course. And I've
deliberately included an example of a core change and a leaf test. Not all
(core change, leaf test) pairs are equally important. The vast majority of
all tests we run are literally unable to be affected by the changes
triggering the test. So that's why enabling Gradle cache or using a plugin
like Brian found could help part of the issue, but not the whole issue,
again as Luke reminded.

We make these tradeoffs all the time, of course, via putting some tests in
*IT and postCommit runs and some in *Test, implicitly preCommit. But I am
imagining a future where we can decouple the test suite definitions (very
stable, not depending on the project context) from the decision of where
and when to run them (less stable, changing as the project changes).

My assumption is that the project will only grow and all these problems
(flakiness, runtime, false coupling) will continue to get worse. I raised
this now so we could consider what is a steady state approach that could
scale, before it becomes an emergency. I take it as a given that it is
harder to change culture than it is to change infra/code, so I am not
considering any possibility of more attention to flaky tests or more
attention to testing the core properly or more attention to making tests
snappy or more careful consideration of *IT and *Test. (unless we build
infra that forces more attention to these things)

Incidentally, SQL is not actually fully factored out. If you edit SQL it
runs a limited subset defined by :sqlPreCommit. If you edit core, then
:javaPreCommit still includes SQL 

Re: Finer-grained test runs?

2020-07-09 Thread Robert Bradshaw
On Thu, Jul 9, 2020 at 8:40 AM Luke Cwik  wrote:
>
>> If Brian's: it does not result in redundant build (if plugin works) since it 
>> would be one Gradle build process. But it does do a full build if you touch 
>> something at the root of the ancestry tree like core SDK or model. I would 
>> like to avoid automatically testing descendants if we can, since things like 
>> Nexmark and most IOs are not sensitive to the vast majority of model or core 
>> SDK changes. Runners are borderline.
>
> I believe that the cost of fixing an issue that is found later once the test 
> starts failing because the test wasn't run as part of the PR has a much 
> higher order of magnitude of cost to triage and fix. Mostly due to loss of 
> context from the PR author/reviewer and if the culprit PR can't be found then 
> whoever is trying to fix it.

Huge +1 to this.

Ideally we could count on the build system (and good caching) to only
test what actually needs to be tested, and with work being done on
runners and IOs this would be a small subset of our entire suite. When
working lower in the stack (and I am prone to do) I think it's
acceptable to have longer wait times--and would *much* rather pay that
price than discover things later. Perhaps some things could be
surgically removed (it would be interesting to mine data on how often
test failures in the "leaves" catch real issues), but I would do that
with care. That being said, flakiness is really an issues (and it
seems these days I have to re-run tests, often multiple times, to get
a PR to green; splitting up jobs could help that as well).


Re: Finer-grained test runs?

2020-07-09 Thread Kenneth Knowles
On Thu, Jul 9, 2020 at 8:40 AM Luke Cwik  wrote:

> On Wed, Jul 8, 2020 at 9:22 PM Kenneth Knowles  wrote:
>
>> I like your use of "ancestor" and "descendant". I will adopt it.
>>
>> On Wed, Jul 8, 2020 at 4:53 PM Robert Bradshaw 
>> wrote:
>>
>>> On Wed, Jul 8, 2020 at 4:44 PM Luke Cwik  wrote:
>>> >
>>> > I'm not sure that breaking it up will be significantly faster since
>>> each module needs to build its ancestors and run tests of itself and all of
>>> its descendants which isn't a trivial amount of work. We have only so many
>>> executors and with the increased number of jobs, won't we just be waiting
>>> for queued jobs to start?
>>
>>
>>
>> I think that depends on how many fewer tests we could run (or rerun)
>>> for the average PR. (It would also be nice if we could share build
>>> artifacts across executors (is there something like ccache for
>>> javac?), but maybe that's too far-fetched?)
>>>
>>
>> Robert: The gradle cache should remain valid across runs, I think... my
>> latest understanding was that it was a robust up-to-date check (aka not
>> `make`). We may have messed this up, as I am not seeing as much caching as
>> I would expect nor as much as I see locally. We had to do some tweaking in
>> the maven days to put the .m2 directory outside of the realm wiped for each
>> new build. Maybe we are clobbering the Gradle cache too. That might
>> actually make most builds so fast we do not care about my proposal.
>>
>
> The gradle cache relies on our inputs/outputs to be specified correctly.
> It's great that this has been fixed since I was under the impression that
> it was disabled and/or that we used --rerun-tasks everywhere.
>

Sorry, when I said *should* I mean that if it is not currently being used,
we should do what it takes to use it. Based on the scans, I don't think
test results are being cached. But I could have read things wrong...


Luke: I am not sure if you are replying to my email or to Brian's.
>>
> If Brian's: it does not result in redundant build (if plugin works) since
>> it would be one Gradle build process. But it does do a full build if you
>> touch something at the root of the ancestry tree like core SDK or model. I
>> would like to avoid automatically testing descendants if we can, since
>> things like Nexmark and most IOs are not sensitive to the vast majority of
>> model or core SDK changes. Runners are borderline.
>>
>
> I believe that the cost of fixing an issue that is found later once the
> test starts failing because the test wasn't run as part of the PR has a
> much higher order of magnitude of cost to triage and fix. Mostly due to
> loss of context from the PR author/reviewer and if the culprit PR can't be
> found then whoever is trying to fix it.
>
> If we are willing to not separate out into individual jobs then we are
> really trying to make the job faster.
>

It would also reduce flakiness, which was a key motivator for this thread.
It is a good point about separate signals, which I somehow forgot in
between emails. So an approach based on separate jobs is not strictly
worse, since it has this benefit.


How much digging have folks done into the build scans since they show a lot
> of details that are useful around what is slow for a specific job. Take the
> Java Precommit for example:
> * The timeline of what tasks ran when:
> https://scans.gradle.com/s/u2rkcnww2fs24/timeline (looks like nexmark
> testing is 30 mins long and is the straggler)
>

I did a bit of this digging the other day. Separating Nexmark out from Java
(as we did with SQL) would be a mitigation that addresses job speed. I
planned on doing this today. Separating out each of the 10 minute IO and
runner runs would also improve speed and reduce flakiness but then this is
turning into a longer task. Doing this with include/exclude patterns in job
files is simple [1] but will get harder to keep consistent. I would guess
they are already inconsistent.

Here's a sketch of one way that this can scale: have the metadata that
defines trigger patterns and test targets live next to the modules. Then it
scales just as well as authoring modules does. You need some code to
assemble the appropriate job triggers from the declared ancestry. This
could have the benefit that the signal is for a module and not for a job.
Changing the triggers or refactoring how different things run would not
reset the meaning of the signal, as it does now.


* It looks like our build cache (
> https://scans.gradle.com/s/u2rkcnww2fs24/performance/buildCache) is
> saving about 5% of total cpu time, should we consider setting up a remote
> build cache?
>
> If mine: you could assume my proposal is like Brian's but with full
>> isolated Jenkins builds. This would be strictly worse, since it would add
>> redundant builds of ancestors. I am assuming that you always run a separate
>> Jenkins job for every descendant. Still, many modules have fewer
>> descendants. And they do not trigger all the way up to the root and down to
>> all 

Re: Finer-grained test runs?

2020-07-09 Thread Luke Cwik
On Wed, Jul 8, 2020 at 9:22 PM Kenneth Knowles  wrote:

> I like your use of "ancestor" and "descendant". I will adopt it.
>
> On Wed, Jul 8, 2020 at 4:53 PM Robert Bradshaw 
> wrote:
>
>> On Wed, Jul 8, 2020 at 4:44 PM Luke Cwik  wrote:
>> >
>> > I'm not sure that breaking it up will be significantly faster since
>> each module needs to build its ancestors and run tests of itself and all of
>> its descendants which isn't a trivial amount of work. We have only so many
>> executors and with the increased number of jobs, won't we just be waiting
>> for queued jobs to start?
>
>
>
> I think that depends on how many fewer tests we could run (or rerun)
>> for the average PR. (It would also be nice if we could share build
>> artifacts across executors (is there something like ccache for
>> javac?), but maybe that's too far-fetched?)
>>
>
> Robert: The gradle cache should remain valid across runs, I think... my
> latest understanding was that it was a robust up-to-date check (aka not
> `make`). We may have messed this up, as I am not seeing as much caching as
> I would expect nor as much as I see locally. We had to do some tweaking in
> the maven days to put the .m2 directory outside of the realm wiped for each
> new build. Maybe we are clobbering the Gradle cache too. That might
> actually make most builds so fast we do not care about my proposal.
>

The gradle cache relies on our inputs/outputs to be specified correctly.
It's great that this has been fixed since I was under the impression that
it was disabled and/or that we used --rerun-tasks everywhere.

Luke: I am not sure if you are replying to my email or to Brian's.
>
If Brian's: it does not result in redundant build (if plugin works) since
> it would be one Gradle build process. But it does do a full build if you
> touch something at the root of the ancestry tree like core SDK or model. I
> would like to avoid automatically testing descendants if we can, since
> things like Nexmark and most IOs are not sensitive to the vast majority of
> model or core SDK changes. Runners are borderline.
>

I believe that the cost of fixing an issue that is found later once the
test starts failing because the test wasn't run as part of the PR has a
much higher order of magnitude of cost to triage and fix. Mostly due to
loss of context from the PR author/reviewer and if the culprit PR can't be
found then whoever is trying to fix it.

If we are willing to not separate out into individual jobs then we are
really trying to make the job faster. How much digging have folks done into
the build scans since they show a lot of details that are useful around
what is slow for a specific job. Take the Java Precommit for example:
* The timeline of what tasks ran when:
https://scans.gradle.com/s/u2rkcnww2fs24/timeline (looks like nexmark
testing is 30 mins long and is the straggler)
* It looks like our build cache (
https://scans.gradle.com/s/u2rkcnww2fs24/performance/buildCache) is saving
about 5% of total cpu time, should we consider setting up a remote build
cache?


> If mine: you could assume my proposal is like Brian's but with full
> isolated Jenkins builds. This would be strictly worse, since it would add
> redundant builds of ancestors. I am assuming that you always run a separate
> Jenkins job for every descendant. Still, many modules have fewer
> descendants. And they do not trigger all the way up to the root and down to
> all descendants of the root.
>
>
I was replying to yours since differentiated jobs is what gives visibility.
I agree that Brian's approach would make the build faster if it could
figure out everything that needs to run easily and be easy to maintain.


> From a community perspective, extensions and IOs are the most likely use
> case for newcomers. For the person who comes to add or improve FooIO, it is
> not a good experience to hit a flake in RabbitMqIO or JdbcIO or
> DataflowRunner or FlinkRunner flakes.
>

If flakes had a very low failure budget then as a community this would be a
non-issue.


> I think the plugin Brian mentioned is only a start. It would be even
> better for each module to have an opt-in list of descendants to test on
> precommit. This works well with a rollback-first strategy on post-commit.
> We can then replay the PR while triggering the postcommits that failed.
>
> > I agree that we would have better visibility though in github and also
>> in Jenkins.
>>
>> I do have to say having to scroll through a huge number of github
>> checks is not always an improvement.
>>
>
> +1 but OTOH the gradle scan is sometimes too fine grained or associates
> logs oddly (I skip the Jenkins status page almost always)
>
>
>> > Fixing flaky tests would help improve our test signal as well. Not many
>> willing people here though but could be less work then building and
>> maintaining so many different jobs.
>>
>> +1
>>
>
> I agree with fixing flakes, but I want to treat the occurrence and
> resolution of flakiness as standard operations. Just as bug 

Re: Finer-grained test runs?

2020-07-08 Thread Kenneth Knowles
I like your use of "ancestor" and "descendant". I will adopt it.

On Wed, Jul 8, 2020 at 4:53 PM Robert Bradshaw  wrote:

> On Wed, Jul 8, 2020 at 4:44 PM Luke Cwik  wrote:
> >
> > I'm not sure that breaking it up will be significantly faster since each
> module needs to build its ancestors and run tests of itself and all of its
> descendants which isn't a trivial amount of work. We have only so many
> executors and with the increased number of jobs, won't we just be waiting
> for queued jobs to start?



I think that depends on how many fewer tests we could run (or rerun)
> for the average PR. (It would also be nice if we could share build
> artifacts across executors (is there something like ccache for
> javac?), but maybe that's too far-fetched?)
>

Robert: The gradle cache should remain valid across runs, I think... my
latest understanding was that it was a robust up-to-date check (aka not
`make`). We may have messed this up, as I am not seeing as much caching as
I would expect nor as much as I see locally. We had to do some tweaking in
the maven days to put the .m2 directory outside of the realm wiped for each
new build. Maybe we are clobbering the Gradle cache too. That might
actually make most builds so fast we do not care about my proposal.

Luke: I am not sure if you are replying to my email or to Brian's.

If Brian's: it does not result in redundant build (if plugin works) since
it would be one Gradle build process. But it does do a full build if you
touch something at the root of the ancestry tree like core SDK or model. I
would like to avoid automatically testing descendants if we can, since
things like Nexmark and most IOs are not sensitive to the vast majority of
model or core SDK changes. Runners are borderline.

If mine: you could assume my proposal is like Brian's but with full
isolated Jenkins builds. This would be strictly worse, since it would add
redundant builds of ancestors. I am assuming that you always run a separate
Jenkins job for every descendant. Still, many modules have fewer
descendants. And they do not trigger all the way up to the root and down to
all descendants of the root.

>From a community perspective, extensions and IOs are the most likely use
case for newcomers. For the person who comes to add or improve FooIO, it is
not a good experience to hit a flake in RabbitMqIO or JdbcIO or
DataflowRunner or FlinkRunner flakes.

I think the plugin Brian mentioned is only a start. It would be even better
for each module to have an opt-in list of descendants to test on precommit.
This works well with a rollback-first strategy on post-commit. We can then
replay the PR while triggering the postcommits that failed.

> I agree that we would have better visibility though in github and also in
> Jenkins.
>
> I do have to say having to scroll through a huge number of github
> checks is not always an improvement.
>

+1 but OTOH the gradle scan is sometimes too fine grained or associates
logs oddly (I skip the Jenkins status page almost always)


> > Fixing flaky tests would help improve our test signal as well. Not many
> willing people here though but could be less work then building and
> maintaining so many different jobs.
>
> +1
>

I agree with fixing flakes, but I want to treat the occurrence and
resolution of flakiness as standard operations. Just as bug counts increase
continuously as a project grows, so will overall flakiness. Separating
flakiness signals will help to prioritize which flakes to address.

Kenn


> > On Wed, Jul 8, 2020 at 4:13 PM Kenneth Knowles  wrote:
> >>
> >> That's a good start. It is new enough and with few enough commits that
> I'd want to do some thorough experimentation. Our build is complex enough
> with a lot of ad hoc coding that we might end up maintaining whatever we
> choose...
> >>
> >> In my ideal scenario the list of "what else to test" would be manually
> editable, or even strictly opt-in. Automatically testing everything that
> might be affected quickly runs into scaling problems too. It could make
> sense in post-commit but less so in pre-commit.
> >>
> >> Kenn
> >>
> >> On Wed, Jul 8, 2020 at 3:50 PM Brian Hulette 
> wrote:
> >>>
> >>> > We could have one "test the things" Jenkins job if the underlying
> tool (Gradle) could resolve what needs to be run.
> >>>
> >>> I think this would be much better. Otherwise it seems our Jenkins
> definitions are just duplicating information that's already stored in the
> build.gradle files which seems error-prone, especially for tests validating
> combinations of artifacts. I did some quick searching and came across [1].
> It doesn't look like the project has had a lot of recent activity, but it
> claims to do what we need:
> >>>
> >>> > The plugin will generate new tasks on the root project for each task
> provided on the configuration with the following pattern
> ${taskName}ChangedModules.
> >>> > These generated tasks will run the changedModules task to get the
> list of changed modules and for each one 

Re: Finer-grained test runs?

2020-07-08 Thread Robert Bradshaw
On Wed, Jul 8, 2020 at 4:44 PM Luke Cwik  wrote:
>
> I'm not sure that breaking it up will be significantly faster since each 
> module needs to build its ancestors and run tests of itself and all of its 
> descendants which isn't a trivial amount of work. We have only so many 
> executors and with the increased number of jobs, won't we just be waiting for 
> queued jobs to start?

I think that depends on how many fewer tests we could run (or rerun)
for the average PR. (It would also be nice if we could share build
artifacts across executors (is there something like ccache for
javac?), but maybe that's too far-fetched?)

> I agree that we would have better visibility though in github and also in 
> Jenkins.

I do have to say having to scroll through a huge number of github
checks is not always an improvement.

> Fixing flaky tests would help improve our test signal as well. Not many 
> willing people here though but could be less work then building and 
> maintaining so many different jobs.

+1

> On Wed, Jul 8, 2020 at 4:13 PM Kenneth Knowles  wrote:
>>
>> That's a good start. It is new enough and with few enough commits that I'd 
>> want to do some thorough experimentation. Our build is complex enough with a 
>> lot of ad hoc coding that we might end up maintaining whatever we choose...
>>
>> In my ideal scenario the list of "what else to test" would be manually 
>> editable, or even strictly opt-in. Automatically testing everything that 
>> might be affected quickly runs into scaling problems too. It could make 
>> sense in post-commit but less so in pre-commit.
>>
>> Kenn
>>
>> On Wed, Jul 8, 2020 at 3:50 PM Brian Hulette  wrote:
>>>
>>> > We could have one "test the things" Jenkins job if the underlying tool 
>>> > (Gradle) could resolve what needs to be run.
>>>
>>> I think this would be much better. Otherwise it seems our Jenkins 
>>> definitions are just duplicating information that's already stored in the 
>>> build.gradle files which seems error-prone, especially for tests validating 
>>> combinations of artifacts. I did some quick searching and came across [1]. 
>>> It doesn't look like the project has had a lot of recent activity, but it 
>>> claims to do what we need:
>>>
>>> > The plugin will generate new tasks on the root project for each task 
>>> > provided on the configuration with the following pattern 
>>> > ${taskName}ChangedModules.
>>> > These generated tasks will run the changedModules task to get the list of 
>>> > changed modules and for each one will call the given task.
>>>
>>> Of course this would only really help us with java tests as gradle doesn't 
>>> know much about the structure of dependencies within the python (and go?) 
>>> SDK.
>>>
>>> Brian
>>>
>>> [1] https://github.com/ismaeldivita/change-tracker-plugin
>>>
>>> On Wed, Jul 8, 2020 at 3:29 PM Kenneth Knowles  wrote:

 Hi all,

 I wanted to start a discussion about getting finer grained test execution 
 more focused on particular artifacts/modules. In particular, I want to 
 gather the downsides and impossibilities. So I will make a proposal that 
 people can disagree with easily.

 Context: job_PreCommit_Java is a monolithic job that...

  - takes 40-50 minutes
  - runs tests of maybe a bit under 100 modules
  - executes over 10k tests
  - runs on any change to model/, sdks/java/, runners/, examples/java/, 
 examples/kotlin/, release/ (only exception is SQL)
  - is pretty flaky (because it conflates so many independent test flakes, 
 mostly runners and IOs)

 See a scan at 
 https://scans.gradle.com/s/dnuo4o245d2fw/timeline?sort=longest

 Proposal: Eliminate monolithic job and break into finer-grained jobs that 
 operate on two principles:

 1. Test run should be focused on validating one artifact or a specific 
 integration of other artifacts.
 2. Test run should trigger only on things that could affect the validity 
 of that artifact.

 For example, a starting point is to separate:

  - core SDK
  - runner helper libs
  - each runner
  - each extension
  - each IO

 Benefits:

  - changing an IO or runner would not trigger the 20 minutes of core SDK 
 tests
  - changing a runner would not trigger the long IO local integration tests
  - changing the core SDK could potentially not run as many tests in 
 presubmit, but maybe it would and they would be separately reported 
 results with clear flakiness signal

 There are 72 build.gradle files under sdks/java/ and 30 under runners/. 
 They don't all require a separate job. But still there are enough that it 
 is worth automation. Does anyone know of what options we might have? It 
 does not even have to be in Jenkins. We could have one "test the things" 
 Jenkins job if the underlying tool (Gradle) could resolve what needs to be 
 run. Caching is not sufficient in 

Re: Finer-grained test runs?

2020-07-08 Thread Luke Cwik
I'm not sure that breaking it up will be significantly faster since each
module needs to build its ancestors and run tests of itself and all of its
descendants which isn't a trivial amount of work. We have only so many
executors and with the increased number of jobs, won't we just be waiting
for queued jobs to start? I agree that we would have better visibility
though in github and also in Jenkins.

Fixing flaky tests would help improve our test signal as well. Not many
willing people here though but could be less work then building and
maintaining so many different jobs.


On Wed, Jul 8, 2020 at 4:13 PM Kenneth Knowles  wrote:

> That's a good start. It is new enough and with few enough commits that I'd
> want to do some thorough experimentation. Our build is complex enough with
> a lot of ad hoc coding that we might end up maintaining whatever we
> choose...
>
> In my ideal scenario the list of "what else to test" would be manually
> editable, or even strictly opt-in. Automatically testing everything that
> might be affected quickly runs into scaling problems too. It could make
> sense in post-commit but less so in pre-commit.
>
> Kenn
>
> On Wed, Jul 8, 2020 at 3:50 PM Brian Hulette  wrote:
>
>> > We could have one "test the things" Jenkins job if the underlying tool
>> (Gradle) could resolve what needs to be run.
>>
>> I think this would be much better. Otherwise it seems our Jenkins
>> definitions are just duplicating information that's already stored in the
>> build.gradle files which seems error-prone, especially for tests validating
>> combinations of artifacts. I did some quick searching and came across [1].
>> It doesn't look like the project has had a lot of recent activity, but it
>> claims to do what we need:
>>
>> > The plugin will generate new tasks on the root project for each task
>> provided on the configuration with the following pattern
>> ${taskName}ChangedModules.
>> > These generated tasks will run the changedModules task to get the list
>> of changed modules and for each one will call the given task.
>>
>> Of course this would only really help us with java tests as gradle
>> doesn't know much about the structure of dependencies within the python
>> (and go?) SDK.
>>
>> Brian
>>
>> [1] https://github.com/ismaeldivita/change-tracker-plugin
>>
>> On Wed, Jul 8, 2020 at 3:29 PM Kenneth Knowles  wrote:
>>
>>> Hi all,
>>>
>>> I wanted to start a discussion about getting finer grained test
>>> execution more focused on particular artifacts/modules. In particular, I
>>> want to gather the downsides and impossibilities. So I will make a proposal
>>> that people can disagree with easily.
>>>
>>> Context: job_PreCommit_Java is a monolithic job that...
>>>
>>>  - takes 40-50 minutes
>>>  - runs tests of maybe a bit under 100 modules
>>>  - executes over 10k tests
>>>  - runs on any change to model/, sdks/java/, runners/, examples/java/,
>>> examples/kotlin/, release/ (only exception is SQL)
>>>  - is pretty flaky (because it conflates so many independent test
>>> flakes, mostly runners and IOs)
>>>
>>> See a scan at
>>> https://scans.gradle.com/s/dnuo4o245d2fw/timeline?sort=longest
>>>
>>> Proposal: Eliminate monolithic job and break into finer-grained jobs
>>> that operate on two principles:
>>>
>>> 1. Test run should be focused on validating one artifact or a specific
>>> integration of other artifacts.
>>> 2. Test run should trigger only on things that could affect the validity
>>> of that artifact.
>>>
>>> For example, a starting point is to separate:
>>>
>>>  - core SDK
>>>  - runner helper libs
>>>  - each runner
>>>  - each extension
>>>  - each IO
>>>
>>> Benefits:
>>>
>>>  - changing an IO or runner would not trigger the 20 minutes of core SDK
>>> tests
>>>  - changing a runner would not trigger the long IO local integration
>>> tests
>>>  - changing the core SDK could potentially not run as many tests in
>>> presubmit, but maybe it would and they would be separately reported results
>>> with clear flakiness signal
>>>
>>> There are 72 build.gradle files under sdks/java/ and 30 under runners/.
>>> They don't all require a separate job. But still there are enough that it
>>> is worth automation. Does anyone know of what options we might have? It
>>> does not even have to be in Jenkins. We could have one "test the things"
>>> Jenkins job if the underlying tool (Gradle) could resolve what needs to be
>>> run. Caching is not sufficient in my experience.
>>>
>>> (there are other quick fix alternatives to shrinking this time, but I
>>> want to focus on bigger picture)
>>>
>>> Kenn
>>>
>>


Re: Finer-grained test runs?

2020-07-08 Thread Kenneth Knowles
That's a good start. It is new enough and with few enough commits that I'd
want to do some thorough experimentation. Our build is complex enough with
a lot of ad hoc coding that we might end up maintaining whatever we
choose...

In my ideal scenario the list of "what else to test" would be manually
editable, or even strictly opt-in. Automatically testing everything that
might be affected quickly runs into scaling problems too. It could make
sense in post-commit but less so in pre-commit.

Kenn

On Wed, Jul 8, 2020 at 3:50 PM Brian Hulette  wrote:

> > We could have one "test the things" Jenkins job if the underlying tool
> (Gradle) could resolve what needs to be run.
>
> I think this would be much better. Otherwise it seems our Jenkins
> definitions are just duplicating information that's already stored in the
> build.gradle files which seems error-prone, especially for tests validating
> combinations of artifacts. I did some quick searching and came across [1].
> It doesn't look like the project has had a lot of recent activity, but it
> claims to do what we need:
>
> > The plugin will generate new tasks on the root project for each task
> provided on the configuration with the following pattern
> ${taskName}ChangedModules.
> > These generated tasks will run the changedModules task to get the list
> of changed modules and for each one will call the given task.
>
> Of course this would only really help us with java tests as gradle doesn't
> know much about the structure of dependencies within the python (and go?)
> SDK.
>
> Brian
>
> [1] https://github.com/ismaeldivita/change-tracker-plugin
>
> On Wed, Jul 8, 2020 at 3:29 PM Kenneth Knowles  wrote:
>
>> Hi all,
>>
>> I wanted to start a discussion about getting finer grained test execution
>> more focused on particular artifacts/modules. In particular, I want to
>> gather the downsides and impossibilities. So I will make a proposal that
>> people can disagree with easily.
>>
>> Context: job_PreCommit_Java is a monolithic job that...
>>
>>  - takes 40-50 minutes
>>  - runs tests of maybe a bit under 100 modules
>>  - executes over 10k tests
>>  - runs on any change to model/, sdks/java/, runners/, examples/java/,
>> examples/kotlin/, release/ (only exception is SQL)
>>  - is pretty flaky (because it conflates so many independent test flakes,
>> mostly runners and IOs)
>>
>> See a scan at
>> https://scans.gradle.com/s/dnuo4o245d2fw/timeline?sort=longest
>>
>> Proposal: Eliminate monolithic job and break into finer-grained jobs that
>> operate on two principles:
>>
>> 1. Test run should be focused on validating one artifact or a specific
>> integration of other artifacts.
>> 2. Test run should trigger only on things that could affect the validity
>> of that artifact.
>>
>> For example, a starting point is to separate:
>>
>>  - core SDK
>>  - runner helper libs
>>  - each runner
>>  - each extension
>>  - each IO
>>
>> Benefits:
>>
>>  - changing an IO or runner would not trigger the 20 minutes of core SDK
>> tests
>>  - changing a runner would not trigger the long IO local integration tests
>>  - changing the core SDK could potentially not run as many tests in
>> presubmit, but maybe it would and they would be separately reported results
>> with clear flakiness signal
>>
>> There are 72 build.gradle files under sdks/java/ and 30 under runners/.
>> They don't all require a separate job. But still there are enough that it
>> is worth automation. Does anyone know of what options we might have? It
>> does not even have to be in Jenkins. We could have one "test the things"
>> Jenkins job if the underlying tool (Gradle) could resolve what needs to be
>> run. Caching is not sufficient in my experience.
>>
>> (there are other quick fix alternatives to shrinking this time, but I
>> want to focus on bigger picture)
>>
>> Kenn
>>
>


Re: Finer-grained test runs?

2020-07-08 Thread Brian Hulette
> We could have one "test the things" Jenkins job if the underlying tool
(Gradle) could resolve what needs to be run.

I think this would be much better. Otherwise it seems our Jenkins
definitions are just duplicating information that's already stored in the
build.gradle files which seems error-prone, especially for tests validating
combinations of artifacts. I did some quick searching and came across [1].
It doesn't look like the project has had a lot of recent activity, but it
claims to do what we need:

> The plugin will generate new tasks on the root project for each task
provided on the configuration with the following pattern
${taskName}ChangedModules.
> These generated tasks will run the changedModules task to get the list of
changed modules and for each one will call the given task.

Of course this would only really help us with java tests as gradle doesn't
know much about the structure of dependencies within the python (and go?)
SDK.

Brian

[1] https://github.com/ismaeldivita/change-tracker-plugin

On Wed, Jul 8, 2020 at 3:29 PM Kenneth Knowles  wrote:

> Hi all,
>
> I wanted to start a discussion about getting finer grained test execution
> more focused on particular artifacts/modules. In particular, I want to
> gather the downsides and impossibilities. So I will make a proposal that
> people can disagree with easily.
>
> Context: job_PreCommit_Java is a monolithic job that...
>
>  - takes 40-50 minutes
>  - runs tests of maybe a bit under 100 modules
>  - executes over 10k tests
>  - runs on any change to model/, sdks/java/, runners/, examples/java/,
> examples/kotlin/, release/ (only exception is SQL)
>  - is pretty flaky (because it conflates so many independent test flakes,
> mostly runners and IOs)
>
> See a scan at
> https://scans.gradle.com/s/dnuo4o245d2fw/timeline?sort=longest
>
> Proposal: Eliminate monolithic job and break into finer-grained jobs that
> operate on two principles:
>
> 1. Test run should be focused on validating one artifact or a specific
> integration of other artifacts.
> 2. Test run should trigger only on things that could affect the validity
> of that artifact.
>
> For example, a starting point is to separate:
>
>  - core SDK
>  - runner helper libs
>  - each runner
>  - each extension
>  - each IO
>
> Benefits:
>
>  - changing an IO or runner would not trigger the 20 minutes of core SDK
> tests
>  - changing a runner would not trigger the long IO local integration tests
>  - changing the core SDK could potentially not run as many tests in
> presubmit, but maybe it would and they would be separately reported results
> with clear flakiness signal
>
> There are 72 build.gradle files under sdks/java/ and 30 under runners/.
> They don't all require a separate job. But still there are enough that it
> is worth automation. Does anyone know of what options we might have? It
> does not even have to be in Jenkins. We could have one "test the things"
> Jenkins job if the underlying tool (Gradle) could resolve what needs to be
> run. Caching is not sufficient in my experience.
>
> (there are other quick fix alternatives to shrinking this time, but I want
> to focus on bigger picture)
>
> Kenn
>


Finer-grained test runs?

2020-07-08 Thread Kenneth Knowles
Hi all,

I wanted to start a discussion about getting finer grained test execution
more focused on particular artifacts/modules. In particular, I want to
gather the downsides and impossibilities. So I will make a proposal that
people can disagree with easily.

Context: job_PreCommit_Java is a monolithic job that...

 - takes 40-50 minutes
 - runs tests of maybe a bit under 100 modules
 - executes over 10k tests
 - runs on any change to model/, sdks/java/, runners/, examples/java/,
examples/kotlin/, release/ (only exception is SQL)
 - is pretty flaky (because it conflates so many independent test flakes,
mostly runners and IOs)

See a scan at https://scans.gradle.com/s/dnuo4o245d2fw/timeline?sort=longest

Proposal: Eliminate monolithic job and break into finer-grained jobs that
operate on two principles:

1. Test run should be focused on validating one artifact or a specific
integration of other artifacts.
2. Test run should trigger only on things that could affect the validity of
that artifact.

For example, a starting point is to separate:

 - core SDK
 - runner helper libs
 - each runner
 - each extension
 - each IO

Benefits:

 - changing an IO or runner would not trigger the 20 minutes of core SDK
tests
 - changing a runner would not trigger the long IO local integration tests
 - changing the core SDK could potentially not run as many tests in
presubmit, but maybe it would and they would be separately reported results
with clear flakiness signal

There are 72 build.gradle files under sdks/java/ and 30 under runners/.
They don't all require a separate job. But still there are enough that it
is worth automation. Does anyone know of what options we might have? It
does not even have to be in Jenkins. We could have one "test the things"
Jenkins job if the underlying tool (Gradle) could resolve what needs to be
run. Caching is not sufficient in my experience.

(there are other quick fix alternatives to shrinking this time, but I want
to focus on bigger picture)

Kenn