Re: Finer-grained test runs?
Some context links for the benefit of the thread & archive: Beam issue mentioning a Jenkins plugin that caches on the Jenkins master: https://issues.apache.org/jira/browse/BEAM-4400 Beam's request to infra: https://issues.apache.org/jira/browse/INFRA-16630 Denied and reasoning on prior request: https://issues.apache.org/jira/browse/INFRA-16060 Because the Jenkins master / S3 are not good choices for where to cache. Hosting the actual Gradle build cache server as in the thread linked above would be different. Prototyping on the Beam ticket indicated a lack of success. (Ongoing) thread on builds@ asking about existing service: https://lists.apache.org/thread.html/ae40734e34dcf1d3bd8c65dfea3094709d9d8eb97bfb9ab92149e97c%40%3Cbuilds.apache.org%3E Kenn On Mon, Jul 13, 2020 at 10:26 AM Kenneth Knowles wrote: > Having thought this over a bit, I think there are a few goals and they are > interfering with each other. > > 1. Clear signal for module / test suite health. This is a post-commit > concern. Post-commit jobs already all run as cronjobs with no > dependency-driven stuff. > 2. Making precommit test signal stay non-flaky as modules, tests, and > flakiness increase. > 3. Making precommit stay fast as modules, tests, and flakiness increase. > > Noting the interdependence of pre-commit and post-commit: > > - you can phrase trigger post-commit jobs > - pre-commit jobs are run as post-commits also > > Summarizing a bit: > > 1. Clear per-module/suite greenness and flakiness signal > - it would be nice if we could do this at the Gradle job level, but right > now it is Jenkins job > - on the other hand, most Gradle jobs do not represent a module so that > could be too fine-grained and Jenkins jobs are better > - if we have a ton of Jenkins jobs, we need some new automation or > amortized management > - don't want to overwhelm the Jenkins executors, especially not causing > precommit queueing > > 2. Making precommit stay non-flaky, robustly > - we can fix flakes, but can't count on that long term, but we could > build something that forces us to solve it at P0 > - we can add retry budget to tests where deflaking cannot be prioritized > - a lot of anxiety that testing less in pre-commit will cause painful > post-commit debugging > - a lot of overlap with making it faster, since the flakes are often > caused by irrelevant tests > > 3. Making precommit stay fast, robustly > - we could improve per-worker incremental build > - we could use a distributed build cache > - we have tasks that don't do their input/output correctly that will have > problems > > I care most about #1 and then also #2. The only reason I care about #3 is > because of #2: Once a pre-commit is more than a couple minutes, I always go > and do something else and come back in an hour or two. So if it flakes just > a few times, it costs a day. Fix #2 and I don't think #3 is urgent yet. > > A distributed build cache seems to be fairly low effort to set up and > makes #2 and #3 better and may unlock approaches to #1. If we can fix our > Gradle configs. We can ask ASF infra if they have something already or can > set it up. > > That will still leave open how to get better and more visible greenness > and flakiness signal at a more meaningful granularity. > > Kenn > > On Fri, Jul 10, 2020 at 6:38 AM Kenneth Knowles wrote: > >> On Thu, Jul 9, 2020 at 1:44 PM Robert Bradshaw >> wrote: >> >>> I wonder how hard it would be to track greenness and flakiness at the >>> level of gradle project (or even lower), viewed hierarchically. >>> >> >> Looks like this is part of the Gradle Enterprise Tests Dashboard >> offering: https://gradle.com/blog/flaky-tests/ >> >> Kenn >> >> > Recall my (non-binding) starting point guessing at what tests should or >>> should not run in some scenarios: (this tangent is just about the third >>> one, where I explicitly said maybe we run all the same tests and then we >>> want to focus on separating signals as Luke pointed out) >>> > >>> > > - changing an IO or runner would not trigger the 20 minutes of core >>> SDK tests >>> > > - changing a runner would not trigger the long IO local integration >>> tests >>> > > - changing the core SDK could potentially not run as many tests in >>> presubmit, but maybe it would and they would be separately reported results >>> with clear flakiness signal >>> > >>> > And let's consider even more concrete examples: >>> > >>> > - when changing a Fn API proto, how important is it to run >>> RabbitMqIOTest? >>> > - when changing JdbcIO, how important is it to run the Java SDK >>> needsRunnerTests? RabbitMqIOTest? >>> > - when changing the FlinkRunner, how important is it to make sure >>> that Nexmark queries still match their models when run on direct runner? >>> > >>> > I chose these examples to all have zero value, of course. And I've >>> deliberately included an example of a core change and a leaf test. Not all >>> (core change, leaf test) pairs are equally important. The vast
Re: Finer-grained test runs?
Having thought this over a bit, I think there are a few goals and they are interfering with each other. 1. Clear signal for module / test suite health. This is a post-commit concern. Post-commit jobs already all run as cronjobs with no dependency-driven stuff. 2. Making precommit test signal stay non-flaky as modules, tests, and flakiness increase. 3. Making precommit stay fast as modules, tests, and flakiness increase. Noting the interdependence of pre-commit and post-commit: - you can phrase trigger post-commit jobs - pre-commit jobs are run as post-commits also Summarizing a bit: 1. Clear per-module/suite greenness and flakiness signal - it would be nice if we could do this at the Gradle job level, but right now it is Jenkins job - on the other hand, most Gradle jobs do not represent a module so that could be too fine-grained and Jenkins jobs are better - if we have a ton of Jenkins jobs, we need some new automation or amortized management - don't want to overwhelm the Jenkins executors, especially not causing precommit queueing 2. Making precommit stay non-flaky, robustly - we can fix flakes, but can't count on that long term, but we could build something that forces us to solve it at P0 - we can add retry budget to tests where deflaking cannot be prioritized - a lot of anxiety that testing less in pre-commit will cause painful post-commit debugging - a lot of overlap with making it faster, since the flakes are often caused by irrelevant tests 3. Making precommit stay fast, robustly - we could improve per-worker incremental build - we could use a distributed build cache - we have tasks that don't do their input/output correctly that will have problems I care most about #1 and then also #2. The only reason I care about #3 is because of #2: Once a pre-commit is more than a couple minutes, I always go and do something else and come back in an hour or two. So if it flakes just a few times, it costs a day. Fix #2 and I don't think #3 is urgent yet. A distributed build cache seems to be fairly low effort to set up and makes #2 and #3 better and may unlock approaches to #1. If we can fix our Gradle configs. We can ask ASF infra if they have something already or can set it up. That will still leave open how to get better and more visible greenness and flakiness signal at a more meaningful granularity. Kenn On Fri, Jul 10, 2020 at 6:38 AM Kenneth Knowles wrote: > On Thu, Jul 9, 2020 at 1:44 PM Robert Bradshaw > wrote: > >> I wonder how hard it would be to track greenness and flakiness at the >> level of gradle project (or even lower), viewed hierarchically. >> > > Looks like this is part of the Gradle Enterprise Tests Dashboard offering: > https://gradle.com/blog/flaky-tests/ > > Kenn > > > Recall my (non-binding) starting point guessing at what tests should or >> should not run in some scenarios: (this tangent is just about the third >> one, where I explicitly said maybe we run all the same tests and then we >> want to focus on separating signals as Luke pointed out) >> > >> > > - changing an IO or runner would not trigger the 20 minutes of core >> SDK tests >> > > - changing a runner would not trigger the long IO local integration >> tests >> > > - changing the core SDK could potentially not run as many tests in >> presubmit, but maybe it would and they would be separately reported results >> with clear flakiness signal >> > >> > And let's consider even more concrete examples: >> > >> > - when changing a Fn API proto, how important is it to run >> RabbitMqIOTest? >> > - when changing JdbcIO, how important is it to run the Java SDK >> needsRunnerTests? RabbitMqIOTest? >> > - when changing the FlinkRunner, how important is it to make sure that >> Nexmark queries still match their models when run on direct runner? >> > >> > I chose these examples to all have zero value, of course. And I've >> deliberately included an example of a core change and a leaf test. Not all >> (core change, leaf test) pairs are equally important. The vast majority of >> all tests we run are literally unable to be affected by the changes >> triggering the test. So that's why enabling Gradle cache or using a plugin >> like Brian found could help part of the issue, but not the whole issue, >> again as Luke reminded. >> >> For (2) and (3), I would hope that the build dependency graph could >> exclude them. You're right about (1) (and I've hit that countless >> times), but would rather err on the side of accidentally running too >> many tests than not enough. If we make manual edits to what can be >> inferred by the build graph, let's make it a blacklist rather than an >> allow list to avoid accidental lost coverage. >> >> > We make these tradeoffs all the time, of course, via putting some tests >> in *IT and postCommit runs and some in *Test, implicitly preCommit. But I >> am imagining a future where we can decouple the test suite definitions >> (very stable, not depending on the project context) from the
Re: Finer-grained test runs?
On Thu, Jul 9, 2020 at 1:44 PM Robert Bradshaw wrote: > I wonder how hard it would be to track greenness and flakiness at the > level of gradle project (or even lower), viewed hierarchically. > Looks like this is part of the Gradle Enterprise Tests Dashboard offering: https://gradle.com/blog/flaky-tests/ Kenn > Recall my (non-binding) starting point guessing at what tests should or > should not run in some scenarios: (this tangent is just about the third > one, where I explicitly said maybe we run all the same tests and then we > want to focus on separating signals as Luke pointed out) > > > > > - changing an IO or runner would not trigger the 20 minutes of core > SDK tests > > > - changing a runner would not trigger the long IO local integration > tests > > > - changing the core SDK could potentially not run as many tests in > presubmit, but maybe it would and they would be separately reported results > with clear flakiness signal > > > > And let's consider even more concrete examples: > > > > - when changing a Fn API proto, how important is it to run > RabbitMqIOTest? > > - when changing JdbcIO, how important is it to run the Java SDK > needsRunnerTests? RabbitMqIOTest? > > - when changing the FlinkRunner, how important is it to make sure that > Nexmark queries still match their models when run on direct runner? > > > > I chose these examples to all have zero value, of course. And I've > deliberately included an example of a core change and a leaf test. Not all > (core change, leaf test) pairs are equally important. The vast majority of > all tests we run are literally unable to be affected by the changes > triggering the test. So that's why enabling Gradle cache or using a plugin > like Brian found could help part of the issue, but not the whole issue, > again as Luke reminded. > > For (2) and (3), I would hope that the build dependency graph could > exclude them. You're right about (1) (and I've hit that countless > times), but would rather err on the side of accidentally running too > many tests than not enough. If we make manual edits to what can be > inferred by the build graph, let's make it a blacklist rather than an > allow list to avoid accidental lost coverage. > > > We make these tradeoffs all the time, of course, via putting some tests > in *IT and postCommit runs and some in *Test, implicitly preCommit. But I > am imagining a future where we can decouple the test suite definitions > (very stable, not depending on the project context) from the decision of > where and when to run them (less stable, changing as the project changes). > > > > My assumption is that the project will only grow and all these problems > (flakiness, runtime, false coupling) will continue to get worse. I raised > this now so we could consider what is a steady state approach that could > scale, before it becomes an emergency. I take it as a given that it is > harder to change culture than it is to change infra/code, so I am not > considering any possibility of more attention to flaky tests or more > attention to testing the core properly or more attention to making tests > snappy or more careful consideration of *IT and *Test. (unless we build > infra that forces more attention to these things) > > > > Incidentally, SQL is not actually fully factored out. If you edit SQL it > runs a limited subset defined by :sqlPreCommit. If you edit core, then > :javaPreCommit still includes SQL tests. > > I think running SQL tests when you edit core is not actually that bad. > Possibly better than not running any of them. (Maybe, as cost becomes > more of a concern, adding the notion of "smoke tests" that are a cheap > subset run when upstream projects change would be a good compromise.) >
Re: Finer-grained test runs?
It does sound like we're generally on the same page. Minor comments below. On Thu, Jul 9, 2020 at 1:00 PM Kenneth Knowles wrote: > > On Thu, Jul 9, 2020 at 11:47 AM Robert Bradshaw wrote: >> >> On Thu, Jul 9, 2020 at 8:40 AM Luke Cwik wrote: >> > >> >> If Brian's: it does not result in redundant build (if plugin works) since >> >> it would be one Gradle build process. But it does do a full build if you >> >> touch something at the root of the ancestry tree like core SDK or model. >> >> I would like to avoid automatically testing descendants if we can, since >> >> things like Nexmark and most IOs are not sensitive to the vast majority >> >> of model or core SDK changes. Runners are borderline. >> > >> > I believe that the cost of fixing an issue that is found later once the >> > test starts failing because the test wasn't run as part of the PR has a >> > much higher order of magnitude of cost to triage and fix. Mostly due to >> > loss of context from the PR author/reviewer and if the culprit PR can't be >> > found then whoever is trying to fix it. >> >> Huge +1 to this. > > > Totally agree. This abstract statement is clearly true. I suggest considering > things more concretely. > >> Ideally we could count on the build system (and good caching) to only >> test what actually needs to be tested, and with work being done on >> runners and IOs this would be a small subset of our entire suite. When >> working lower in the stack (and I am prone to do) I think it's >> acceptable to have longer wait times--and would *much* rather pay that >> price than discover things later. Perhaps some things could be >> surgically removed (it would be interesting to mine data on how often >> test failures in the "leaves" catch real issues), but I would do that >> with care. That being said, flakiness is really an issues (and it >> seems these days I have to re-run tests, often multiple times, to get >> a PR to green; splitting up jobs could help that as well). > > Agree with your sentiment that a longer wait for core changes is generally > fine; my phrasing above overemphasized this case. Anecdotally, without mining > data, leaf modules do catch bugs in core changes sometimes when (by > definition) they are not adequately tested. This is a good measure for how > much we have to improve our engineering practices. > > But anyhow this is one very special case. Coming back to the overall issue, > what we actually do today is run all leaf/middle/root builds whenever > anything in any leaf/middle/root layer is changed. And we track greenness and > flakiness at this same level of granularity. I wonder how hard it would be to track greenness and flakiness at the level of gradle project (or even lower), viewed hierarchically. > Recall my (non-binding) starting point guessing at what tests should or > should not run in some scenarios: (this tangent is just about the third one, > where I explicitly said maybe we run all the same tests and then we want to > focus on separating signals as Luke pointed out) > > > - changing an IO or runner would not trigger the 20 minutes of core SDK > > tests > > - changing a runner would not trigger the long IO local integration tests > > - changing the core SDK could potentially not run as many tests in > > presubmit, but maybe it would and they would be separately reported results > > with clear flakiness signal > > And let's consider even more concrete examples: > > - when changing a Fn API proto, how important is it to run RabbitMqIOTest? > - when changing JdbcIO, how important is it to run the Java SDK > needsRunnerTests? RabbitMqIOTest? > - when changing the FlinkRunner, how important is it to make sure that > Nexmark queries still match their models when run on direct runner? > > I chose these examples to all have zero value, of course. And I've > deliberately included an example of a core change and a leaf test. Not all > (core change, leaf test) pairs are equally important. The vast majority of > all tests we run are literally unable to be affected by the changes > triggering the test. So that's why enabling Gradle cache or using a plugin > like Brian found could help part of the issue, but not the whole issue, again > as Luke reminded. For (2) and (3), I would hope that the build dependency graph could exclude them. You're right about (1) (and I've hit that countless times), but would rather err on the side of accidentally running too many tests than not enough. If we make manual edits to what can be inferred by the build graph, let's make it a blacklist rather than an allow list to avoid accidental lost coverage. > We make these tradeoffs all the time, of course, via putting some tests in > *IT and postCommit runs and some in *Test, implicitly preCommit. But I am > imagining a future where we can decouple the test suite definitions (very > stable, not depending on the project context) from the decision of where and > when to run them (less stable,
Re: Finer-grained test runs?
No, not without doing the research myself to see what is the current tooling available. On Thu, Jul 9, 2020 at 1:17 PM Kenneth Knowles wrote: > > > On Thu, Jul 9, 2020 at 1:10 PM Luke Cwik wrote: > >> The budget would represent some criteria that we need from tests (e.g. >> percent passed, max num skipped tests, test execution time, ...). If we >> fail the criteria then there must be actionable work (such as fix tests) >> followed with something that prevents the status quo from continuing (such >> as preventing releases/features being merged) until the criteria is >> satisfied again. >> > > +1 . This is aligned with "CI as monitoring/alerting of the health of the > machine that is your evolving codebase", which I very much subscribe to. > Alert when something is wrong (another missing piece: have a quick way to > ack and suppress false alarms in those cases you really want a sensitive > alert). > > Do you know good implementation choices in Gradle/JUnit/Jenkins? (asking > before searching for it myself) > > Kenn > > >> On Thu, Jul 9, 2020 at 1:00 PM Kenneth Knowles wrote: >> >>> >>> >>> On Thu, Jul 9, 2020 at 11:47 AM Robert Bradshaw >>> wrote: >>> On Thu, Jul 9, 2020 at 8:40 AM Luke Cwik wrote: > >> If Brian's: it does not result in redundant build (if plugin works) since it would be one Gradle build process. But it does do a full build if you touch something at the root of the ancestry tree like core SDK or model. I would like to avoid automatically testing descendants if we can, since things like Nexmark and most IOs are not sensitive to the vast majority of model or core SDK changes. Runners are borderline. > > I believe that the cost of fixing an issue that is found later once the test starts failing because the test wasn't run as part of the PR has a much higher order of magnitude of cost to triage and fix. Mostly due to loss of context from the PR author/reviewer and if the culprit PR can't be found then whoever is trying to fix it. Huge +1 to this. >>> >>> Totally agree. This abstract statement is clearly true. I suggest >>> considering things more concretely. >>> >>> Ideally we could count on the build system (and good caching) to only test what actually needs to be tested, and with work being done on runners and IOs this would be a small subset of our entire suite. When working lower in the stack (and I am prone to do) I think it's acceptable to have longer wait times--and would *much* rather pay that price than discover things later. Perhaps some things could be surgically removed (it would be interesting to mine data on how often test failures in the "leaves" catch real issues), but I would do that with care. That being said, flakiness is really an issues (and it seems these days I have to re-run tests, often multiple times, to get a PR to green; splitting up jobs could help that as well). >>> >>> Agree with your sentiment that a longer wait for core changes is >>> generally fine; my phrasing above overemphasized this case. Anecdotally, >>> without mining data, leaf modules do catch bugs in core changes sometimes >>> when (by definition) they are not adequately tested. This is a good measure >>> for how much we have to improve our engineering practices. >>> >>> But anyhow this is one very special case. Coming back to the overall >>> issue, what we actually do today is run all leaf/middle/root builds >>> whenever anything in any leaf/middle/root layer is changed. And we track >>> greenness and flakiness at this same level of granularity. >>> >>> Recall my (non-binding) starting point guessing at what tests should or >>> should not run in some scenarios: (this tangent is just about the third >>> one, where I explicitly said maybe we run all the same tests and then we >>> want to focus on separating signals as Luke pointed out) >>> >>> > - changing an IO or runner would not trigger the 20 minutes of core >>> SDK tests >>> > - changing a runner would not trigger the long IO local integration >>> tests >>> > - changing the core SDK could potentially not run as many tests in >>> presubmit, but maybe it would and they would be separately reported results >>> with clear flakiness signal >>> >>> And let's consider even more concrete examples: >>> >>> - when changing a Fn API proto, how important is it to run >>> RabbitMqIOTest? >>> - when changing JdbcIO, how important is it to run the Java SDK >>> needsRunnerTests? RabbitMqIOTest? >>> - when changing the FlinkRunner, how important is it to make sure that >>> Nexmark queries still match their models when run on direct runner? >>> >>> I chose these examples to all have zero value, of course. And I've >>> deliberately included an example of a core change and a leaf test. Not all >>> (core change, leaf test) pairs are equally important. The vast majority of >>> all tests we run are literally unable to be
Re: Finer-grained test runs?
On Thu, Jul 9, 2020 at 1:10 PM Luke Cwik wrote: > The budget would represent some criteria that we need from tests (e.g. > percent passed, max num skipped tests, test execution time, ...). If we > fail the criteria then there must be actionable work (such as fix tests) > followed with something that prevents the status quo from continuing (such > as preventing releases/features being merged) until the criteria is > satisfied again. > +1 . This is aligned with "CI as monitoring/alerting of the health of the machine that is your evolving codebase", which I very much subscribe to. Alert when something is wrong (another missing piece: have a quick way to ack and suppress false alarms in those cases you really want a sensitive alert). Do you know good implementation choices in Gradle/JUnit/Jenkins? (asking before searching for it myself) Kenn > On Thu, Jul 9, 2020 at 1:00 PM Kenneth Knowles wrote: > >> >> >> On Thu, Jul 9, 2020 at 11:47 AM Robert Bradshaw >> wrote: >> >>> On Thu, Jul 9, 2020 at 8:40 AM Luke Cwik wrote: >>> > >>> >> If Brian's: it does not result in redundant build (if plugin works) >>> since it would be one Gradle build process. But it does do a full build if >>> you touch something at the root of the ancestry tree like core SDK or >>> model. I would like to avoid automatically testing descendants if we can, >>> since things like Nexmark and most IOs are not sensitive to the vast >>> majority of model or core SDK changes. Runners are borderline. >>> > >>> > I believe that the cost of fixing an issue that is found later once >>> the test starts failing because the test wasn't run as part of the PR has a >>> much higher order of magnitude of cost to triage and fix. Mostly due to >>> loss of context from the PR author/reviewer and if the culprit PR can't be >>> found then whoever is trying to fix it. >>> >>> Huge +1 to this. >>> >> >> Totally agree. This abstract statement is clearly true. I suggest >> considering things more concretely. >> >> Ideally we could count on the build system (and good caching) to only >>> test what actually needs to be tested, and with work being done on >>> runners and IOs this would be a small subset of our entire suite. When >>> working lower in the stack (and I am prone to do) I think it's >>> acceptable to have longer wait times--and would *much* rather pay that >>> price than discover things later. Perhaps some things could be >>> surgically removed (it would be interesting to mine data on how often >>> test failures in the "leaves" catch real issues), but I would do that >>> with care. That being said, flakiness is really an issues (and it >>> seems these days I have to re-run tests, often multiple times, to get >>> a PR to green; splitting up jobs could help that as well). >>> >> >> Agree with your sentiment that a longer wait for core changes is >> generally fine; my phrasing above overemphasized this case. Anecdotally, >> without mining data, leaf modules do catch bugs in core changes sometimes >> when (by definition) they are not adequately tested. This is a good measure >> for how much we have to improve our engineering practices. >> >> But anyhow this is one very special case. Coming back to the overall >> issue, what we actually do today is run all leaf/middle/root builds >> whenever anything in any leaf/middle/root layer is changed. And we track >> greenness and flakiness at this same level of granularity. >> >> Recall my (non-binding) starting point guessing at what tests should or >> should not run in some scenarios: (this tangent is just about the third >> one, where I explicitly said maybe we run all the same tests and then we >> want to focus on separating signals as Luke pointed out) >> >> > - changing an IO or runner would not trigger the 20 minutes of core SDK >> tests >> > - changing a runner would not trigger the long IO local integration >> tests >> > - changing the core SDK could potentially not run as many tests in >> presubmit, but maybe it would and they would be separately reported results >> with clear flakiness signal >> >> And let's consider even more concrete examples: >> >> - when changing a Fn API proto, how important is it to run >> RabbitMqIOTest? >> - when changing JdbcIO, how important is it to run the Java SDK >> needsRunnerTests? RabbitMqIOTest? >> - when changing the FlinkRunner, how important is it to make sure that >> Nexmark queries still match their models when run on direct runner? >> >> I chose these examples to all have zero value, of course. And I've >> deliberately included an example of a core change and a leaf test. Not all >> (core change, leaf test) pairs are equally important. The vast majority of >> all tests we run are literally unable to be affected by the changes >> triggering the test. So that's why enabling Gradle cache or using a plugin >> like Brian found could help part of the issue, but not the whole issue, >> again as Luke reminded. >> >> We make these tradeoffs all the time, of course,
Re: Finer-grained test runs?
The budget would represent some criteria that we need from tests (e.g. percent passed, max num skipped tests, test execution time, ...). If we fail the criteria then there must be actionable work (such as fix tests) followed with something that prevents the status quo from continuing (such as preventing releases/features being merged) until the criteria is satisfied again. On Thu, Jul 9, 2020 at 1:00 PM Kenneth Knowles wrote: > > > On Thu, Jul 9, 2020 at 11:47 AM Robert Bradshaw > wrote: > >> On Thu, Jul 9, 2020 at 8:40 AM Luke Cwik wrote: >> > >> >> If Brian's: it does not result in redundant build (if plugin works) >> since it would be one Gradle build process. But it does do a full build if >> you touch something at the root of the ancestry tree like core SDK or >> model. I would like to avoid automatically testing descendants if we can, >> since things like Nexmark and most IOs are not sensitive to the vast >> majority of model or core SDK changes. Runners are borderline. >> > >> > I believe that the cost of fixing an issue that is found later once the >> test starts failing because the test wasn't run as part of the PR has a >> much higher order of magnitude of cost to triage and fix. Mostly due to >> loss of context from the PR author/reviewer and if the culprit PR can't be >> found then whoever is trying to fix it. >> >> Huge +1 to this. >> > > Totally agree. This abstract statement is clearly true. I suggest > considering things more concretely. > > Ideally we could count on the build system (and good caching) to only >> test what actually needs to be tested, and with work being done on >> runners and IOs this would be a small subset of our entire suite. When >> working lower in the stack (and I am prone to do) I think it's >> acceptable to have longer wait times--and would *much* rather pay that >> price than discover things later. Perhaps some things could be >> surgically removed (it would be interesting to mine data on how often >> test failures in the "leaves" catch real issues), but I would do that >> with care. That being said, flakiness is really an issues (and it >> seems these days I have to re-run tests, often multiple times, to get >> a PR to green; splitting up jobs could help that as well). >> > > Agree with your sentiment that a longer wait for core changes is generally > fine; my phrasing above overemphasized this case. Anecdotally, without > mining data, leaf modules do catch bugs in core changes sometimes when (by > definition) they are not adequately tested. This is a good measure for how > much we have to improve our engineering practices. > > But anyhow this is one very special case. Coming back to the overall > issue, what we actually do today is run all leaf/middle/root builds > whenever anything in any leaf/middle/root layer is changed. And we track > greenness and flakiness at this same level of granularity. > > Recall my (non-binding) starting point guessing at what tests should or > should not run in some scenarios: (this tangent is just about the third > one, where I explicitly said maybe we run all the same tests and then we > want to focus on separating signals as Luke pointed out) > > > - changing an IO or runner would not trigger the 20 minutes of core SDK > tests > > - changing a runner would not trigger the long IO local integration tests > > - changing the core SDK could potentially not run as many tests in > presubmit, but maybe it would and they would be separately reported results > with clear flakiness signal > > And let's consider even more concrete examples: > > - when changing a Fn API proto, how important is it to run RabbitMqIOTest? > - when changing JdbcIO, how important is it to run the Java SDK > needsRunnerTests? RabbitMqIOTest? > - when changing the FlinkRunner, how important is it to make sure that > Nexmark queries still match their models when run on direct runner? > > I chose these examples to all have zero value, of course. And I've > deliberately included an example of a core change and a leaf test. Not all > (core change, leaf test) pairs are equally important. The vast majority of > all tests we run are literally unable to be affected by the changes > triggering the test. So that's why enabling Gradle cache or using a plugin > like Brian found could help part of the issue, but not the whole issue, > again as Luke reminded. > > We make these tradeoffs all the time, of course, via putting some tests in > *IT and postCommit runs and some in *Test, implicitly preCommit. But I am > imagining a future where we can decouple the test suite definitions (very > stable, not depending on the project context) from the decision of where > and when to run them (less stable, changing as the project changes). > > My assumption is that the project will only grow and all these problems > (flakiness, runtime, false coupling) will continue to get worse. I raised > this now so we could consider what is a steady state approach that could > scale, before it
Re: Finer-grained test runs?
On Thu, Jul 9, 2020 at 11:47 AM Robert Bradshaw wrote: > On Thu, Jul 9, 2020 at 8:40 AM Luke Cwik wrote: > > > >> If Brian's: it does not result in redundant build (if plugin works) > since it would be one Gradle build process. But it does do a full build if > you touch something at the root of the ancestry tree like core SDK or > model. I would like to avoid automatically testing descendants if we can, > since things like Nexmark and most IOs are not sensitive to the vast > majority of model or core SDK changes. Runners are borderline. > > > > I believe that the cost of fixing an issue that is found later once the > test starts failing because the test wasn't run as part of the PR has a > much higher order of magnitude of cost to triage and fix. Mostly due to > loss of context from the PR author/reviewer and if the culprit PR can't be > found then whoever is trying to fix it. > > Huge +1 to this. > Totally agree. This abstract statement is clearly true. I suggest considering things more concretely. Ideally we could count on the build system (and good caching) to only > test what actually needs to be tested, and with work being done on > runners and IOs this would be a small subset of our entire suite. When > working lower in the stack (and I am prone to do) I think it's > acceptable to have longer wait times--and would *much* rather pay that > price than discover things later. Perhaps some things could be > surgically removed (it would be interesting to mine data on how often > test failures in the "leaves" catch real issues), but I would do that > with care. That being said, flakiness is really an issues (and it > seems these days I have to re-run tests, often multiple times, to get > a PR to green; splitting up jobs could help that as well). > Agree with your sentiment that a longer wait for core changes is generally fine; my phrasing above overemphasized this case. Anecdotally, without mining data, leaf modules do catch bugs in core changes sometimes when (by definition) they are not adequately tested. This is a good measure for how much we have to improve our engineering practices. But anyhow this is one very special case. Coming back to the overall issue, what we actually do today is run all leaf/middle/root builds whenever anything in any leaf/middle/root layer is changed. And we track greenness and flakiness at this same level of granularity. Recall my (non-binding) starting point guessing at what tests should or should not run in some scenarios: (this tangent is just about the third one, where I explicitly said maybe we run all the same tests and then we want to focus on separating signals as Luke pointed out) > - changing an IO or runner would not trigger the 20 minutes of core SDK tests > - changing a runner would not trigger the long IO local integration tests > - changing the core SDK could potentially not run as many tests in presubmit, but maybe it would and they would be separately reported results with clear flakiness signal And let's consider even more concrete examples: - when changing a Fn API proto, how important is it to run RabbitMqIOTest? - when changing JdbcIO, how important is it to run the Java SDK needsRunnerTests? RabbitMqIOTest? - when changing the FlinkRunner, how important is it to make sure that Nexmark queries still match their models when run on direct runner? I chose these examples to all have zero value, of course. And I've deliberately included an example of a core change and a leaf test. Not all (core change, leaf test) pairs are equally important. The vast majority of all tests we run are literally unable to be affected by the changes triggering the test. So that's why enabling Gradle cache or using a plugin like Brian found could help part of the issue, but not the whole issue, again as Luke reminded. We make these tradeoffs all the time, of course, via putting some tests in *IT and postCommit runs and some in *Test, implicitly preCommit. But I am imagining a future where we can decouple the test suite definitions (very stable, not depending on the project context) from the decision of where and when to run them (less stable, changing as the project changes). My assumption is that the project will only grow and all these problems (flakiness, runtime, false coupling) will continue to get worse. I raised this now so we could consider what is a steady state approach that could scale, before it becomes an emergency. I take it as a given that it is harder to change culture than it is to change infra/code, so I am not considering any possibility of more attention to flaky tests or more attention to testing the core properly or more attention to making tests snappy or more careful consideration of *IT and *Test. (unless we build infra that forces more attention to these things) Incidentally, SQL is not actually fully factored out. If you edit SQL it runs a limited subset defined by :sqlPreCommit. If you edit core, then :javaPreCommit still includes SQL
Re: Finer-grained test runs?
On Thu, Jul 9, 2020 at 8:40 AM Luke Cwik wrote: > >> If Brian's: it does not result in redundant build (if plugin works) since it >> would be one Gradle build process. But it does do a full build if you touch >> something at the root of the ancestry tree like core SDK or model. I would >> like to avoid automatically testing descendants if we can, since things like >> Nexmark and most IOs are not sensitive to the vast majority of model or core >> SDK changes. Runners are borderline. > > I believe that the cost of fixing an issue that is found later once the test > starts failing because the test wasn't run as part of the PR has a much > higher order of magnitude of cost to triage and fix. Mostly due to loss of > context from the PR author/reviewer and if the culprit PR can't be found then > whoever is trying to fix it. Huge +1 to this. Ideally we could count on the build system (and good caching) to only test what actually needs to be tested, and with work being done on runners and IOs this would be a small subset of our entire suite. When working lower in the stack (and I am prone to do) I think it's acceptable to have longer wait times--and would *much* rather pay that price than discover things later. Perhaps some things could be surgically removed (it would be interesting to mine data on how often test failures in the "leaves" catch real issues), but I would do that with care. That being said, flakiness is really an issues (and it seems these days I have to re-run tests, often multiple times, to get a PR to green; splitting up jobs could help that as well).
Re: Finer-grained test runs?
On Thu, Jul 9, 2020 at 8:40 AM Luke Cwik wrote: > On Wed, Jul 8, 2020 at 9:22 PM Kenneth Knowles wrote: > >> I like your use of "ancestor" and "descendant". I will adopt it. >> >> On Wed, Jul 8, 2020 at 4:53 PM Robert Bradshaw >> wrote: >> >>> On Wed, Jul 8, 2020 at 4:44 PM Luke Cwik wrote: >>> > >>> > I'm not sure that breaking it up will be significantly faster since >>> each module needs to build its ancestors and run tests of itself and all of >>> its descendants which isn't a trivial amount of work. We have only so many >>> executors and with the increased number of jobs, won't we just be waiting >>> for queued jobs to start? >> >> >> >> I think that depends on how many fewer tests we could run (or rerun) >>> for the average PR. (It would also be nice if we could share build >>> artifacts across executors (is there something like ccache for >>> javac?), but maybe that's too far-fetched?) >>> >> >> Robert: The gradle cache should remain valid across runs, I think... my >> latest understanding was that it was a robust up-to-date check (aka not >> `make`). We may have messed this up, as I am not seeing as much caching as >> I would expect nor as much as I see locally. We had to do some tweaking in >> the maven days to put the .m2 directory outside of the realm wiped for each >> new build. Maybe we are clobbering the Gradle cache too. That might >> actually make most builds so fast we do not care about my proposal. >> > > The gradle cache relies on our inputs/outputs to be specified correctly. > It's great that this has been fixed since I was under the impression that > it was disabled and/or that we used --rerun-tasks everywhere. > Sorry, when I said *should* I mean that if it is not currently being used, we should do what it takes to use it. Based on the scans, I don't think test results are being cached. But I could have read things wrong... Luke: I am not sure if you are replying to my email or to Brian's. >> > If Brian's: it does not result in redundant build (if plugin works) since >> it would be one Gradle build process. But it does do a full build if you >> touch something at the root of the ancestry tree like core SDK or model. I >> would like to avoid automatically testing descendants if we can, since >> things like Nexmark and most IOs are not sensitive to the vast majority of >> model or core SDK changes. Runners are borderline. >> > > I believe that the cost of fixing an issue that is found later once the > test starts failing because the test wasn't run as part of the PR has a > much higher order of magnitude of cost to triage and fix. Mostly due to > loss of context from the PR author/reviewer and if the culprit PR can't be > found then whoever is trying to fix it. > > If we are willing to not separate out into individual jobs then we are > really trying to make the job faster. > It would also reduce flakiness, which was a key motivator for this thread. It is a good point about separate signals, which I somehow forgot in between emails. So an approach based on separate jobs is not strictly worse, since it has this benefit. How much digging have folks done into the build scans since they show a lot > of details that are useful around what is slow for a specific job. Take the > Java Precommit for example: > * The timeline of what tasks ran when: > https://scans.gradle.com/s/u2rkcnww2fs24/timeline (looks like nexmark > testing is 30 mins long and is the straggler) > I did a bit of this digging the other day. Separating Nexmark out from Java (as we did with SQL) would be a mitigation that addresses job speed. I planned on doing this today. Separating out each of the 10 minute IO and runner runs would also improve speed and reduce flakiness but then this is turning into a longer task. Doing this with include/exclude patterns in job files is simple [1] but will get harder to keep consistent. I would guess they are already inconsistent. Here's a sketch of one way that this can scale: have the metadata that defines trigger patterns and test targets live next to the modules. Then it scales just as well as authoring modules does. You need some code to assemble the appropriate job triggers from the declared ancestry. This could have the benefit that the signal is for a module and not for a job. Changing the triggers or refactoring how different things run would not reset the meaning of the signal, as it does now. * It looks like our build cache ( > https://scans.gradle.com/s/u2rkcnww2fs24/performance/buildCache) is > saving about 5% of total cpu time, should we consider setting up a remote > build cache? > > If mine: you could assume my proposal is like Brian's but with full >> isolated Jenkins builds. This would be strictly worse, since it would add >> redundant builds of ancestors. I am assuming that you always run a separate >> Jenkins job for every descendant. Still, many modules have fewer >> descendants. And they do not trigger all the way up to the root and down to >> all
Re: Finer-grained test runs?
On Wed, Jul 8, 2020 at 9:22 PM Kenneth Knowles wrote: > I like your use of "ancestor" and "descendant". I will adopt it. > > On Wed, Jul 8, 2020 at 4:53 PM Robert Bradshaw > wrote: > >> On Wed, Jul 8, 2020 at 4:44 PM Luke Cwik wrote: >> > >> > I'm not sure that breaking it up will be significantly faster since >> each module needs to build its ancestors and run tests of itself and all of >> its descendants which isn't a trivial amount of work. We have only so many >> executors and with the increased number of jobs, won't we just be waiting >> for queued jobs to start? > > > > I think that depends on how many fewer tests we could run (or rerun) >> for the average PR. (It would also be nice if we could share build >> artifacts across executors (is there something like ccache for >> javac?), but maybe that's too far-fetched?) >> > > Robert: The gradle cache should remain valid across runs, I think... my > latest understanding was that it was a robust up-to-date check (aka not > `make`). We may have messed this up, as I am not seeing as much caching as > I would expect nor as much as I see locally. We had to do some tweaking in > the maven days to put the .m2 directory outside of the realm wiped for each > new build. Maybe we are clobbering the Gradle cache too. That might > actually make most builds so fast we do not care about my proposal. > The gradle cache relies on our inputs/outputs to be specified correctly. It's great that this has been fixed since I was under the impression that it was disabled and/or that we used --rerun-tasks everywhere. Luke: I am not sure if you are replying to my email or to Brian's. > If Brian's: it does not result in redundant build (if plugin works) since > it would be one Gradle build process. But it does do a full build if you > touch something at the root of the ancestry tree like core SDK or model. I > would like to avoid automatically testing descendants if we can, since > things like Nexmark and most IOs are not sensitive to the vast majority of > model or core SDK changes. Runners are borderline. > I believe that the cost of fixing an issue that is found later once the test starts failing because the test wasn't run as part of the PR has a much higher order of magnitude of cost to triage and fix. Mostly due to loss of context from the PR author/reviewer and if the culprit PR can't be found then whoever is trying to fix it. If we are willing to not separate out into individual jobs then we are really trying to make the job faster. How much digging have folks done into the build scans since they show a lot of details that are useful around what is slow for a specific job. Take the Java Precommit for example: * The timeline of what tasks ran when: https://scans.gradle.com/s/u2rkcnww2fs24/timeline (looks like nexmark testing is 30 mins long and is the straggler) * It looks like our build cache ( https://scans.gradle.com/s/u2rkcnww2fs24/performance/buildCache) is saving about 5% of total cpu time, should we consider setting up a remote build cache? > If mine: you could assume my proposal is like Brian's but with full > isolated Jenkins builds. This would be strictly worse, since it would add > redundant builds of ancestors. I am assuming that you always run a separate > Jenkins job for every descendant. Still, many modules have fewer > descendants. And they do not trigger all the way up to the root and down to > all descendants of the root. > > I was replying to yours since differentiated jobs is what gives visibility. I agree that Brian's approach would make the build faster if it could figure out everything that needs to run easily and be easy to maintain. > From a community perspective, extensions and IOs are the most likely use > case for newcomers. For the person who comes to add or improve FooIO, it is > not a good experience to hit a flake in RabbitMqIO or JdbcIO or > DataflowRunner or FlinkRunner flakes. > If flakes had a very low failure budget then as a community this would be a non-issue. > I think the plugin Brian mentioned is only a start. It would be even > better for each module to have an opt-in list of descendants to test on > precommit. This works well with a rollback-first strategy on post-commit. > We can then replay the PR while triggering the postcommits that failed. > > > I agree that we would have better visibility though in github and also >> in Jenkins. >> >> I do have to say having to scroll through a huge number of github >> checks is not always an improvement. >> > > +1 but OTOH the gradle scan is sometimes too fine grained or associates > logs oddly (I skip the Jenkins status page almost always) > > >> > Fixing flaky tests would help improve our test signal as well. Not many >> willing people here though but could be less work then building and >> maintaining so many different jobs. >> >> +1 >> > > I agree with fixing flakes, but I want to treat the occurrence and > resolution of flakiness as standard operations. Just as bug
Re: Finer-grained test runs?
I like your use of "ancestor" and "descendant". I will adopt it. On Wed, Jul 8, 2020 at 4:53 PM Robert Bradshaw wrote: > On Wed, Jul 8, 2020 at 4:44 PM Luke Cwik wrote: > > > > I'm not sure that breaking it up will be significantly faster since each > module needs to build its ancestors and run tests of itself and all of its > descendants which isn't a trivial amount of work. We have only so many > executors and with the increased number of jobs, won't we just be waiting > for queued jobs to start? I think that depends on how many fewer tests we could run (or rerun) > for the average PR. (It would also be nice if we could share build > artifacts across executors (is there something like ccache for > javac?), but maybe that's too far-fetched?) > Robert: The gradle cache should remain valid across runs, I think... my latest understanding was that it was a robust up-to-date check (aka not `make`). We may have messed this up, as I am not seeing as much caching as I would expect nor as much as I see locally. We had to do some tweaking in the maven days to put the .m2 directory outside of the realm wiped for each new build. Maybe we are clobbering the Gradle cache too. That might actually make most builds so fast we do not care about my proposal. Luke: I am not sure if you are replying to my email or to Brian's. If Brian's: it does not result in redundant build (if plugin works) since it would be one Gradle build process. But it does do a full build if you touch something at the root of the ancestry tree like core SDK or model. I would like to avoid automatically testing descendants if we can, since things like Nexmark and most IOs are not sensitive to the vast majority of model or core SDK changes. Runners are borderline. If mine: you could assume my proposal is like Brian's but with full isolated Jenkins builds. This would be strictly worse, since it would add redundant builds of ancestors. I am assuming that you always run a separate Jenkins job for every descendant. Still, many modules have fewer descendants. And they do not trigger all the way up to the root and down to all descendants of the root. >From a community perspective, extensions and IOs are the most likely use case for newcomers. For the person who comes to add or improve FooIO, it is not a good experience to hit a flake in RabbitMqIO or JdbcIO or DataflowRunner or FlinkRunner flakes. I think the plugin Brian mentioned is only a start. It would be even better for each module to have an opt-in list of descendants to test on precommit. This works well with a rollback-first strategy on post-commit. We can then replay the PR while triggering the postcommits that failed. > I agree that we would have better visibility though in github and also in > Jenkins. > > I do have to say having to scroll through a huge number of github > checks is not always an improvement. > +1 but OTOH the gradle scan is sometimes too fine grained or associates logs oddly (I skip the Jenkins status page almost always) > > Fixing flaky tests would help improve our test signal as well. Not many > willing people here though but could be less work then building and > maintaining so many different jobs. > > +1 > I agree with fixing flakes, but I want to treat the occurrence and resolution of flakiness as standard operations. Just as bug counts increase continuously as a project grows, so will overall flakiness. Separating flakiness signals will help to prioritize which flakes to address. Kenn > > On Wed, Jul 8, 2020 at 4:13 PM Kenneth Knowles wrote: > >> > >> That's a good start. It is new enough and with few enough commits that > I'd want to do some thorough experimentation. Our build is complex enough > with a lot of ad hoc coding that we might end up maintaining whatever we > choose... > >> > >> In my ideal scenario the list of "what else to test" would be manually > editable, or even strictly opt-in. Automatically testing everything that > might be affected quickly runs into scaling problems too. It could make > sense in post-commit but less so in pre-commit. > >> > >> Kenn > >> > >> On Wed, Jul 8, 2020 at 3:50 PM Brian Hulette > wrote: > >>> > >>> > We could have one "test the things" Jenkins job if the underlying > tool (Gradle) could resolve what needs to be run. > >>> > >>> I think this would be much better. Otherwise it seems our Jenkins > definitions are just duplicating information that's already stored in the > build.gradle files which seems error-prone, especially for tests validating > combinations of artifacts. I did some quick searching and came across [1]. > It doesn't look like the project has had a lot of recent activity, but it > claims to do what we need: > >>> > >>> > The plugin will generate new tasks on the root project for each task > provided on the configuration with the following pattern > ${taskName}ChangedModules. > >>> > These generated tasks will run the changedModules task to get the > list of changed modules and for each one
Re: Finer-grained test runs?
On Wed, Jul 8, 2020 at 4:44 PM Luke Cwik wrote: > > I'm not sure that breaking it up will be significantly faster since each > module needs to build its ancestors and run tests of itself and all of its > descendants which isn't a trivial amount of work. We have only so many > executors and with the increased number of jobs, won't we just be waiting for > queued jobs to start? I think that depends on how many fewer tests we could run (or rerun) for the average PR. (It would also be nice if we could share build artifacts across executors (is there something like ccache for javac?), but maybe that's too far-fetched?) > I agree that we would have better visibility though in github and also in > Jenkins. I do have to say having to scroll through a huge number of github checks is not always an improvement. > Fixing flaky tests would help improve our test signal as well. Not many > willing people here though but could be less work then building and > maintaining so many different jobs. +1 > On Wed, Jul 8, 2020 at 4:13 PM Kenneth Knowles wrote: >> >> That's a good start. It is new enough and with few enough commits that I'd >> want to do some thorough experimentation. Our build is complex enough with a >> lot of ad hoc coding that we might end up maintaining whatever we choose... >> >> In my ideal scenario the list of "what else to test" would be manually >> editable, or even strictly opt-in. Automatically testing everything that >> might be affected quickly runs into scaling problems too. It could make >> sense in post-commit but less so in pre-commit. >> >> Kenn >> >> On Wed, Jul 8, 2020 at 3:50 PM Brian Hulette wrote: >>> >>> > We could have one "test the things" Jenkins job if the underlying tool >>> > (Gradle) could resolve what needs to be run. >>> >>> I think this would be much better. Otherwise it seems our Jenkins >>> definitions are just duplicating information that's already stored in the >>> build.gradle files which seems error-prone, especially for tests validating >>> combinations of artifacts. I did some quick searching and came across [1]. >>> It doesn't look like the project has had a lot of recent activity, but it >>> claims to do what we need: >>> >>> > The plugin will generate new tasks on the root project for each task >>> > provided on the configuration with the following pattern >>> > ${taskName}ChangedModules. >>> > These generated tasks will run the changedModules task to get the list of >>> > changed modules and for each one will call the given task. >>> >>> Of course this would only really help us with java tests as gradle doesn't >>> know much about the structure of dependencies within the python (and go?) >>> SDK. >>> >>> Brian >>> >>> [1] https://github.com/ismaeldivita/change-tracker-plugin >>> >>> On Wed, Jul 8, 2020 at 3:29 PM Kenneth Knowles wrote: Hi all, I wanted to start a discussion about getting finer grained test execution more focused on particular artifacts/modules. In particular, I want to gather the downsides and impossibilities. So I will make a proposal that people can disagree with easily. Context: job_PreCommit_Java is a monolithic job that... - takes 40-50 minutes - runs tests of maybe a bit under 100 modules - executes over 10k tests - runs on any change to model/, sdks/java/, runners/, examples/java/, examples/kotlin/, release/ (only exception is SQL) - is pretty flaky (because it conflates so many independent test flakes, mostly runners and IOs) See a scan at https://scans.gradle.com/s/dnuo4o245d2fw/timeline?sort=longest Proposal: Eliminate monolithic job and break into finer-grained jobs that operate on two principles: 1. Test run should be focused on validating one artifact or a specific integration of other artifacts. 2. Test run should trigger only on things that could affect the validity of that artifact. For example, a starting point is to separate: - core SDK - runner helper libs - each runner - each extension - each IO Benefits: - changing an IO or runner would not trigger the 20 minutes of core SDK tests - changing a runner would not trigger the long IO local integration tests - changing the core SDK could potentially not run as many tests in presubmit, but maybe it would and they would be separately reported results with clear flakiness signal There are 72 build.gradle files under sdks/java/ and 30 under runners/. They don't all require a separate job. But still there are enough that it is worth automation. Does anyone know of what options we might have? It does not even have to be in Jenkins. We could have one "test the things" Jenkins job if the underlying tool (Gradle) could resolve what needs to be run. Caching is not sufficient in
Re: Finer-grained test runs?
I'm not sure that breaking it up will be significantly faster since each module needs to build its ancestors and run tests of itself and all of its descendants which isn't a trivial amount of work. We have only so many executors and with the increased number of jobs, won't we just be waiting for queued jobs to start? I agree that we would have better visibility though in github and also in Jenkins. Fixing flaky tests would help improve our test signal as well. Not many willing people here though but could be less work then building and maintaining so many different jobs. On Wed, Jul 8, 2020 at 4:13 PM Kenneth Knowles wrote: > That's a good start. It is new enough and with few enough commits that I'd > want to do some thorough experimentation. Our build is complex enough with > a lot of ad hoc coding that we might end up maintaining whatever we > choose... > > In my ideal scenario the list of "what else to test" would be manually > editable, or even strictly opt-in. Automatically testing everything that > might be affected quickly runs into scaling problems too. It could make > sense in post-commit but less so in pre-commit. > > Kenn > > On Wed, Jul 8, 2020 at 3:50 PM Brian Hulette wrote: > >> > We could have one "test the things" Jenkins job if the underlying tool >> (Gradle) could resolve what needs to be run. >> >> I think this would be much better. Otherwise it seems our Jenkins >> definitions are just duplicating information that's already stored in the >> build.gradle files which seems error-prone, especially for tests validating >> combinations of artifacts. I did some quick searching and came across [1]. >> It doesn't look like the project has had a lot of recent activity, but it >> claims to do what we need: >> >> > The plugin will generate new tasks on the root project for each task >> provided on the configuration with the following pattern >> ${taskName}ChangedModules. >> > These generated tasks will run the changedModules task to get the list >> of changed modules and for each one will call the given task. >> >> Of course this would only really help us with java tests as gradle >> doesn't know much about the structure of dependencies within the python >> (and go?) SDK. >> >> Brian >> >> [1] https://github.com/ismaeldivita/change-tracker-plugin >> >> On Wed, Jul 8, 2020 at 3:29 PM Kenneth Knowles wrote: >> >>> Hi all, >>> >>> I wanted to start a discussion about getting finer grained test >>> execution more focused on particular artifacts/modules. In particular, I >>> want to gather the downsides and impossibilities. So I will make a proposal >>> that people can disagree with easily. >>> >>> Context: job_PreCommit_Java is a monolithic job that... >>> >>> - takes 40-50 minutes >>> - runs tests of maybe a bit under 100 modules >>> - executes over 10k tests >>> - runs on any change to model/, sdks/java/, runners/, examples/java/, >>> examples/kotlin/, release/ (only exception is SQL) >>> - is pretty flaky (because it conflates so many independent test >>> flakes, mostly runners and IOs) >>> >>> See a scan at >>> https://scans.gradle.com/s/dnuo4o245d2fw/timeline?sort=longest >>> >>> Proposal: Eliminate monolithic job and break into finer-grained jobs >>> that operate on two principles: >>> >>> 1. Test run should be focused on validating one artifact or a specific >>> integration of other artifacts. >>> 2. Test run should trigger only on things that could affect the validity >>> of that artifact. >>> >>> For example, a starting point is to separate: >>> >>> - core SDK >>> - runner helper libs >>> - each runner >>> - each extension >>> - each IO >>> >>> Benefits: >>> >>> - changing an IO or runner would not trigger the 20 minutes of core SDK >>> tests >>> - changing a runner would not trigger the long IO local integration >>> tests >>> - changing the core SDK could potentially not run as many tests in >>> presubmit, but maybe it would and they would be separately reported results >>> with clear flakiness signal >>> >>> There are 72 build.gradle files under sdks/java/ and 30 under runners/. >>> They don't all require a separate job. But still there are enough that it >>> is worth automation. Does anyone know of what options we might have? It >>> does not even have to be in Jenkins. We could have one "test the things" >>> Jenkins job if the underlying tool (Gradle) could resolve what needs to be >>> run. Caching is not sufficient in my experience. >>> >>> (there are other quick fix alternatives to shrinking this time, but I >>> want to focus on bigger picture) >>> >>> Kenn >>> >>
Re: Finer-grained test runs?
That's a good start. It is new enough and with few enough commits that I'd want to do some thorough experimentation. Our build is complex enough with a lot of ad hoc coding that we might end up maintaining whatever we choose... In my ideal scenario the list of "what else to test" would be manually editable, or even strictly opt-in. Automatically testing everything that might be affected quickly runs into scaling problems too. It could make sense in post-commit but less so in pre-commit. Kenn On Wed, Jul 8, 2020 at 3:50 PM Brian Hulette wrote: > > We could have one "test the things" Jenkins job if the underlying tool > (Gradle) could resolve what needs to be run. > > I think this would be much better. Otherwise it seems our Jenkins > definitions are just duplicating information that's already stored in the > build.gradle files which seems error-prone, especially for tests validating > combinations of artifacts. I did some quick searching and came across [1]. > It doesn't look like the project has had a lot of recent activity, but it > claims to do what we need: > > > The plugin will generate new tasks on the root project for each task > provided on the configuration with the following pattern > ${taskName}ChangedModules. > > These generated tasks will run the changedModules task to get the list > of changed modules and for each one will call the given task. > > Of course this would only really help us with java tests as gradle doesn't > know much about the structure of dependencies within the python (and go?) > SDK. > > Brian > > [1] https://github.com/ismaeldivita/change-tracker-plugin > > On Wed, Jul 8, 2020 at 3:29 PM Kenneth Knowles wrote: > >> Hi all, >> >> I wanted to start a discussion about getting finer grained test execution >> more focused on particular artifacts/modules. In particular, I want to >> gather the downsides and impossibilities. So I will make a proposal that >> people can disagree with easily. >> >> Context: job_PreCommit_Java is a monolithic job that... >> >> - takes 40-50 minutes >> - runs tests of maybe a bit under 100 modules >> - executes over 10k tests >> - runs on any change to model/, sdks/java/, runners/, examples/java/, >> examples/kotlin/, release/ (only exception is SQL) >> - is pretty flaky (because it conflates so many independent test flakes, >> mostly runners and IOs) >> >> See a scan at >> https://scans.gradle.com/s/dnuo4o245d2fw/timeline?sort=longest >> >> Proposal: Eliminate monolithic job and break into finer-grained jobs that >> operate on two principles: >> >> 1. Test run should be focused on validating one artifact or a specific >> integration of other artifacts. >> 2. Test run should trigger only on things that could affect the validity >> of that artifact. >> >> For example, a starting point is to separate: >> >> - core SDK >> - runner helper libs >> - each runner >> - each extension >> - each IO >> >> Benefits: >> >> - changing an IO or runner would not trigger the 20 minutes of core SDK >> tests >> - changing a runner would not trigger the long IO local integration tests >> - changing the core SDK could potentially not run as many tests in >> presubmit, but maybe it would and they would be separately reported results >> with clear flakiness signal >> >> There are 72 build.gradle files under sdks/java/ and 30 under runners/. >> They don't all require a separate job. But still there are enough that it >> is worth automation. Does anyone know of what options we might have? It >> does not even have to be in Jenkins. We could have one "test the things" >> Jenkins job if the underlying tool (Gradle) could resolve what needs to be >> run. Caching is not sufficient in my experience. >> >> (there are other quick fix alternatives to shrinking this time, but I >> want to focus on bigger picture) >> >> Kenn >> >
Re: Finer-grained test runs?
> We could have one "test the things" Jenkins job if the underlying tool (Gradle) could resolve what needs to be run. I think this would be much better. Otherwise it seems our Jenkins definitions are just duplicating information that's already stored in the build.gradle files which seems error-prone, especially for tests validating combinations of artifacts. I did some quick searching and came across [1]. It doesn't look like the project has had a lot of recent activity, but it claims to do what we need: > The plugin will generate new tasks on the root project for each task provided on the configuration with the following pattern ${taskName}ChangedModules. > These generated tasks will run the changedModules task to get the list of changed modules and for each one will call the given task. Of course this would only really help us with java tests as gradle doesn't know much about the structure of dependencies within the python (and go?) SDK. Brian [1] https://github.com/ismaeldivita/change-tracker-plugin On Wed, Jul 8, 2020 at 3:29 PM Kenneth Knowles wrote: > Hi all, > > I wanted to start a discussion about getting finer grained test execution > more focused on particular artifacts/modules. In particular, I want to > gather the downsides and impossibilities. So I will make a proposal that > people can disagree with easily. > > Context: job_PreCommit_Java is a monolithic job that... > > - takes 40-50 minutes > - runs tests of maybe a bit under 100 modules > - executes over 10k tests > - runs on any change to model/, sdks/java/, runners/, examples/java/, > examples/kotlin/, release/ (only exception is SQL) > - is pretty flaky (because it conflates so many independent test flakes, > mostly runners and IOs) > > See a scan at > https://scans.gradle.com/s/dnuo4o245d2fw/timeline?sort=longest > > Proposal: Eliminate monolithic job and break into finer-grained jobs that > operate on two principles: > > 1. Test run should be focused on validating one artifact or a specific > integration of other artifacts. > 2. Test run should trigger only on things that could affect the validity > of that artifact. > > For example, a starting point is to separate: > > - core SDK > - runner helper libs > - each runner > - each extension > - each IO > > Benefits: > > - changing an IO or runner would not trigger the 20 minutes of core SDK > tests > - changing a runner would not trigger the long IO local integration tests > - changing the core SDK could potentially not run as many tests in > presubmit, but maybe it would and they would be separately reported results > with clear flakiness signal > > There are 72 build.gradle files under sdks/java/ and 30 under runners/. > They don't all require a separate job. But still there are enough that it > is worth automation. Does anyone know of what options we might have? It > does not even have to be in Jenkins. We could have one "test the things" > Jenkins job if the underlying tool (Gradle) could resolve what needs to be > run. Caching is not sufficient in my experience. > > (there are other quick fix alternatives to shrinking this time, but I want > to focus on bigger picture) > > Kenn >
Finer-grained test runs?
Hi all, I wanted to start a discussion about getting finer grained test execution more focused on particular artifacts/modules. In particular, I want to gather the downsides and impossibilities. So I will make a proposal that people can disagree with easily. Context: job_PreCommit_Java is a monolithic job that... - takes 40-50 minutes - runs tests of maybe a bit under 100 modules - executes over 10k tests - runs on any change to model/, sdks/java/, runners/, examples/java/, examples/kotlin/, release/ (only exception is SQL) - is pretty flaky (because it conflates so many independent test flakes, mostly runners and IOs) See a scan at https://scans.gradle.com/s/dnuo4o245d2fw/timeline?sort=longest Proposal: Eliminate monolithic job and break into finer-grained jobs that operate on two principles: 1. Test run should be focused on validating one artifact or a specific integration of other artifacts. 2. Test run should trigger only on things that could affect the validity of that artifact. For example, a starting point is to separate: - core SDK - runner helper libs - each runner - each extension - each IO Benefits: - changing an IO or runner would not trigger the 20 minutes of core SDK tests - changing a runner would not trigger the long IO local integration tests - changing the core SDK could potentially not run as many tests in presubmit, but maybe it would and they would be separately reported results with clear flakiness signal There are 72 build.gradle files under sdks/java/ and 30 under runners/. They don't all require a separate job. But still there are enough that it is worth automation. Does anyone know of what options we might have? It does not even have to be in Jenkins. We could have one "test the things" Jenkins job if the underlying tool (Gradle) could resolve what needs to be run. Caching is not sufficient in my experience. (there are other quick fix alternatives to shrinking this time, but I want to focus on bigger picture) Kenn