No, this isn't the kind of thing that should require a vote (unless someone really wants a vote).
On Thu, Jun 7, 2018 at 9:29 AM Udi Meiri <eh...@google.com> wrote: > Would I need a vote on installing this plugin, or can I just open a ticket > to infra? > > On Wed, Jun 6, 2018, 16:18 Robert Bradshaw <rober...@google.com> wrote: > >> Even if it's not perfect, seems like it'd surely be a net win (and >> probably a large one). Also, the build cache should look back at more than >> just the single previous build, so if any previous jobs (up to the cache >> size limit) built/tested artifacts unchanged by the current PR, the results >> would live in the cache. >> >> I would look at (a) and (b) only if this isn't already good enough. >> >> On Wed, Jun 6, 2018 at 3:50 PM Udi Meiri <eh...@google.com> wrote: >> >>> To follow up on the Jenkins Job Cacher Plugin: >>> >>> Using a Jenkins plugin to save and reuse the Gradle cache for successive >>> precommit jobs. >>> The problem with this approach is that the precommit runs that a Jenkins >>> server runs are unrelated. >>> Say you have 2 PRs, A and B, and the precommit job for B reuses the >>> cache left by the job for A. >>> The diff between the two will cause tests affected both by A and B to be >>> rerun (at least). >>> If A modifies Python code, then the job for B must rerun ALL Python >>> tests (since Gradle doesn't do dependency tracking for Python). >>> >>> Proposal: >>> a. The cache plugin is still useful for successive Java precommit jobs, >>> but not for Python. (Go, I have no idea) >>> We could use it exclusively for Java precommits. >>> b. To avoid running precommit jobs for code not touched by a PR, look at >>> the paths of files changed. >>> For example, a PR touching only files under sdks/python/... need only >>> run Python precommit tests. >>> >>> On Tue, Jun 5, 2018 at 7:24 PM Udi Meiri <eh...@google.com> wrote: >>> >>>> I've been having a separate discussion on the proposal doc, which is >>>> ready for another round of reviews. >>>> Change summary: >>>> - Changed fast requirement to be < 30 minutes and simplify the check as >>>> an aggregate for each precommit job type. >>>> - Updated slowness notification methods to include automated methods: >>>> as a precommit check result type on GitHub, as a bug. >>>> - Merged in the metrics design doc. >>>> - Added detailed design section. >>>> - Added list of deliverables. >>>> >>>> What I would like is consensus regarding: >>>> - How fast we want precommit runs to be. I propose 30m. >>>> - Deadline for fixing a slow test before it is temporarily removed from >>>> precommit. I propose 24 hours. >>>> >>>> >>>> Replying to the thread: >>>> >>>> 1. I like the idea of using the Jenkins Job Cacher Plugin to skip >>>> unaffected tests (BEAM-4400). >>>> >>>> 2. Java Precommit tests include integration tests (example >>>> <https://builds.apache.org/view/A-D/view/Beam/job/beam_PreCommit_Java_GradleBuild/lastCompletedBuild/testReport/org.apache.beam.examples/> >>>> ). >>>> We could split these out to get much faster results, i.e., a separate >>>> precommit just for basic integration tests (which will still need to run in >>>> <30m). >>>> Perhaps lint checks for Python could be split out as well. >>>> >>>> I'll add these suggestions to the doc tomorrow. >>>> >>>> On Thu, May 24, 2018 at 9:25 AM Scott Wegner <sweg...@google.com> >>>> wrote: >>>> >>>>> So, it sounds like there's agreement that we should improve precommit >>>>> times by only running necessary tests, and configuring Jenkins Job >>>>> Caching + Gradle build cache is a path to get there. I've filed BEAM-4400 >>>>> [1] to follow-up on this. >>>>> >>>>> Getting back to Udi's original proposal [2]: I see value in defining a >>>>> metric and target for overall pre-commit timing. The proposal for an >>>>> initial "2 hour" target is helpful as a guardrail: we're already hitting >>>>> it, but if we drift to a point where we're not, that should trigger some >>>>> action to be taken to get back to a healthy state. >>>>> >>>>> I wouldn't mind separately setting a more aspiration goal of getting >>>>> the pre-commits even faster (i.e. 15-30 mins), but I suspect that would >>>>> require a concerted effort to evaluate and improve existing tests across >>>>> the codebase. One idea would be to set up ensure the metric reporting can >>>>> show the trend, and which tests are responsible for the most walltime, so >>>>> that we know where to invest any efforts to improve tests. >>>>> >>>>> >>>>> [1] https://issues.apache.org/jira/browse/BEAM-4400 >>>>> [2] >>>>> https://docs.google.com/document/d/1udtvggmS2LTMmdwjEtZCcUQy6aQAiYTI3OrTP8CLfJM/edit?usp=sharing >>>>> >>>>> >>>>> On Wed, May 23, 2018 at 11:46 AM Kenneth Knowles <k...@google.com> >>>>> wrote: >>>>> >>>>>> With regard to the Job Cacher Plugin: I think it is an infra ticket >>>>>> to install? And I guess we need it longer term when we move to >>>>>> containerized builds anyhow? One thing I've experienced with the >>>>>> Travis-CI >>>>>> cache is that the time spent uploading & downloading the remote cache - >>>>>> in >>>>>> that case of all the pip installed dependencies - negated the benefits. >>>>>> Probably for Beam it will have a greater benefit if we can skip most of >>>>>> the >>>>>> build. >>>>>> >>>>>> Regarding integration tests in precommit: I think it is OK to run >>>>>> maybe one Dataflow job in precommit, but it should be in parallel with >>>>>> the >>>>>> unit tests and just a smoke test that takes 5 minutes, not a suite that >>>>>> takes 35 minutes. So IMO that is low-hanging fruit. If this would make >>>>>> postcommit unstable, then it also means precommit is unstable. Both are >>>>>> troublesome. >>>>>> >>>>>> More short term, some possible hacks: >>>>>> >>>>>> - Point gradle to cache outside the git workspace. We already did >>>>>> this for .m2 and it helped a lot. >>>>>> - Intersect touched files with projects. Our nonstandard project >>>>>> names might be a pain here. Not sure if fixing that is on the roadmap. >>>>>> >>>>>> Kenn >>>>>> >>>>>> On Wed, May 23, 2018 at 9:31 AM Ismaël Mejía <ieme...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> I second Robert idea of ‘inteligently’ running only the affected >>>>>>> tests, >>>>>>> probably >>>>>>> there is no need to run Java for a go fix (and eventually if any >>>>>>> issue it >>>>>>> can be >>>>>>> catched in postcommit), same for a dev who just fixed something in >>>>>>> KafkaIO >>>>>>> and has >>>>>>> to wait for other IO tests to pass. I suppose that languages, IOs and >>>>>>> extensions >>>>>>> are ‘easy’ to isolate so maybe we can start with those. >>>>>>> >>>>>>> Earlier signals are also definitely great to have too, but not sure >>>>>>> how we >>>>>>> can >>>>>>> have those with the current infra. >>>>>>> >>>>>>> From a quicklook the biggest time is consumed by the examples module >>>>>>> probably >>>>>>> because they run in Dataflow with real IOs no?, that module alone >>>>>>> takes ~35 >>>>>>> minutes, so maybe moving it to postcommit will gain us some quick >>>>>>> improvement. >>>>>>> On the other hand we should probably not dismiss the consequences of >>>>>>> moving >>>>>>> more >>>>>>> stuff to postcommit given that our current postcommit is not the most >>>>>>> stable, or >>>>>>> the quickest, only the Dataflow suite takes 1h30! >>>>>>> >>>>>>> >>>>>>> On Tue, May 22, 2018 at 12:01 AM Mikhail Gryzykhin < >>>>>>> mig...@google.com> >>>>>>> wrote: >>>>>>> >>>>>>> > What we can do here is estimate how much effort we want to put in >>>>>>> and set >>>>>>> remote target. >>>>>>> > Such as: >>>>>>> > Third quarter 2018 -- 1hr SLO >>>>>>> > Forth quarter 2018 -- 30min SLO, >>>>>>> > etc. >>>>>>> >>>>>>> > Combined with policy for newly added tests, this can give us some >>>>>>> goal to >>>>>>> aim for. >>>>>>> >>>>>>> > --Mikhail >>>>>>> >>>>>>> > Have feedback? >>>>>>> >>>>>>> >>>>>>> > On Mon, May 21, 2018 at 2:06 PM Scott Wegner <sweg...@google.com> >>>>>>> wrote: >>>>>>> >>>>>>> >> Thanks for the proposal, I left comments in the doc. Overall I >>>>>>> think >>>>>>> it's a great idea. >>>>>>> >>>>>>> >> I've seen other projects with much faster pre-commits, and it >>>>>>> requires >>>>>>> strict guidelines on unit test design and keeping tests isolated >>>>>>> in-memory >>>>>>> as much as possible. That's not currently the case in Java; we have >>>>>>> pre-commits which submit pipelines to Dataflow service. >>>>>>> >>>>>>> >> I don't know if it's feasible to get Java down to 15-20 mins in >>>>>>> the >>>>>>> short term, but a good starting point would be to document the >>>>>>> requirements >>>>>>> for a test to run as pre-commit, and start enforcing it for new >>>>>>> tests. >>>>>>> >>>>>>> >>>>>>> >> On Fri, May 18, 2018 at 3:25 PM Henning Rohde <hero...@google.com> >>>>>>> wrote: >>>>>>> >>>>>>> >>> Good proposal. I think it should be considered in tandem with >>>>>>> the "No >>>>>>> commit on red post-commit" proposal and could be far more ambitious >>>>>>> than 2 >>>>>>> hours. For example, something in the <15-20 mins range, say, would >>>>>>> be much >>>>>>> less of an inconvenience to the development effort. Go takes ~3 >>>>>>> mins, which >>>>>>> means that it is practical to wait until a PR is green before asking >>>>>>> anyone >>>>>>> to look at it. If I need to wait for a Java or Python pre-commit, I >>>>>>> task >>>>>>> switch and come back later. If the post-commits are enforced to be >>>>>>> green, >>>>>>> we could possibly gain a much more productive flow at the cost of the >>>>>>> occasional post-commit break, compared to now. Maybe IOs can be less >>>>>>> extensively tested pre-commit, for example, or only if actually >>>>>>> changed? >>>>>>> >>>>>>> >>> I also like Robert's suggestion of spitting up pre-commits into >>>>>>> something more fine-grained to get a clear partial signal quicker. >>>>>>> If we >>>>>>> have an adequate number of Jenkins slots, it might also speed things >>>>>>> up >>>>>>> overall. >>>>>>> >>>>>>> >>> Thanks, >>>>>>> >>> Henning >>>>>>> >>>>>>> >>> On Fri, May 18, 2018 at 12:30 PM Scott Wegner < >>>>>>> sweg...@google.com> >>>>>>> wrote: >>>>>>> >>>>>>> >>>> re: intelligently skipping tests for code that doesn't change >>>>>>> (i.e. >>>>>>> Java tests on Python PR): this should be possible. We already have >>>>>>> build-caching enabled in Gradle, but I believe it is local to the git >>>>>>> workspace and doesn't persist between Jenkins runs. >>>>>>> >>>>>>> >>>> With a quick search, I see there is a Jenkins Build Cacher >>>>>>> Plugin [1] >>>>>>> that hooks into Gradle build cache and does exactly what we need. >>>>>>> Does >>>>>>> anybody know whether we could get this enabled on our Jenkins? >>>>>>> >>>>>>> >>>> [1] https://wiki.jenkins.io/display/JENKINS/Job+Cacher+Plugin >>>>>>> >>>>>>> >>>> On Fri, May 18, 2018 at 12:08 PM Robert Bradshaw < >>>>>>> rober...@google.com> >>>>>>> wrote: >>>>>>> >>>>>>> >>>>> [somehow my email got garbled...] >>>>>>> >>>>>>> >>>>> Now that we're using gradle, perhaps we could be more >>>>>>> intelligent >>>>>>> about only running the affected tests? E.g. when you touch Python >>>>>>> (or Go) >>>>>>> you shouldn't need to run the Java precommit at all, which would >>>>>>> reduce the >>>>>>> latency for those PRs and also the time spent in queue. Presumably >>>>>>> this >>>>>>> could even be applied per-module for the Java tests. (Maybe a large, >>>>>>> shared >>>>>>> build cache could help here as well...) >>>>>>> >>>>>>> >>>>> I also wouldn't be opposed to a quicker immediate signal, plus >>>>>>> more >>>>>>> extensive tests before actually merging. It's also nice to not have >>>>>>> to wait >>>>>>> an hour to see that you have a lint error; quick stuff like that >>>>>>> could be >>>>>>> signaled quickly before a contributor looses context. >>>>>>> >>>>>>> >>>>> - Robert >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>> On Fri, May 18, 2018 at 5:55 AM Kenneth Knowles < >>>>>>> k...@google.com> >>>>>>> wrote: >>>>>>> >>>>>>> >>>>>> I like the idea. I think it is a good time for the project to >>>>>>> start >>>>>>> tracking this and keeping it usable. >>>>>>> >>>>>>> >>>>>> Certainly 2 hours is more than enough, is that not so? The >>>>>>> Java >>>>>>> precommit seems to take <=40 minutes while Python takes ~20 and Go >>>>>>> is so >>>>>>> fast it doesn't matter. Do we have enough stragglers that we don't >>>>>>> make it >>>>>>> in the 95th percentile? Is the time spent in the Jenkins queue? >>>>>>> >>>>>>> >>>>>> For our current coverage, I'd be willing to go for: >>>>>>> >>>>>>> >>>>>> - 1 hr hard cap (someone better at stats could choose %ile) >>>>>>> >>>>>> - roll back or remove test from precommit if fix looks >>>>>>> like more >>>>>>> than 1 week (roll back if it is perf degradation, remove test from >>>>>>> precommit if it is additional coverage that just doesn't fit in the >>>>>>> time) >>>>>>> >>>>>>> >>>>>> There's a longer-term issue that doing a full build each time >>>>>>> is >>>>>>> expected to linearly scale up with the size of our repo (it is the >>>>>>> monorepo >>>>>>> problem but for a minirepo) so there is no cap that is feasible >>>>>>> until we >>>>>>> have effective cross-build caching. And my long-term goal would be >>>>>>> <30 >>>>>>> minutes. At the latency of opening a pull request and then checking >>>>>>> your >>>>>>> email that's not burdensome, but an hour is. >>>>>>> >>>>>>> >>>>>> Kenn >>>>>>> >>>>>>> >>>>>> On Thu, May 17, 2018 at 6:54 PM Udi Meiri <eh...@google.com> >>>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> HI, >>>>>>> >>>>>>> I have a proposal to improve contributor experience by >>>>>>> keeping >>>>>>> precommit times low. >>>>>>> >>>>>>> >>>>>>> I'm looking to get community consensus and approval about: >>>>>>> >>>>>>> 1. How long should precommits take. 2 hours @95th percentile >>>>>>> over >>>>>>> the past 4 weeks is the current proposal. >>>>>>> >>>>>>> 2. The process for dealing with slowness. Do we: fix, roll >>>>>>> back, >>>>>>> remove a test from precommit? >>>>>>> >>>>>>> Rolling back if a fix is estimated to take longer than 2 >>>>>>> weeks is >>>>>>> the current proposal. >>>>>>> >>>>>>> >>>>>>> >>>>>>> https://docs.google.com/document/d/1udtvggmS2LTMmdwjEtZCcUQy6aQAiYTI3OrTP8CLfJM/edit?usp=sharing >>>>>>> >>>>>>