Re: Proposal: keeping precommit times fast

Robert Bradshaw Thu, 07 Jun 2018 11:34:55 -0700

No, this isn't the kind of thing that should require a vote (unless someone
really wants a vote).


On Thu, Jun 7, 2018 at 9:29 AM Udi Meiri <eh...@google.com> wrote:

> Would I need a vote on installing this plugin, or can I just open a ticket
> to infra?
>
> On Wed, Jun 6, 2018, 16:18 Robert Bradshaw <rober...@google.com> wrote:
>
>> Even if it's not perfect, seems like it'd surely be a net win (and
>> probably a large one). Also, the build cache should look back at more than
>> just the single previous build, so if any previous jobs (up to the cache
>> size limit) built/tested artifacts unchanged by the current PR, the results
>> would live in the cache.
>>
>> I would look at (a) and (b) only if this isn't already good enough.
>>
>> On Wed, Jun 6, 2018 at 3:50 PM Udi Meiri <eh...@google.com> wrote:
>>
>>> To follow up on the Jenkins Job Cacher Plugin:
>>>
>>> Using a Jenkins plugin to save and reuse the Gradle cache for successive
>>> precommit jobs.
>>> The problem with this approach is that the precommit runs that a Jenkins
>>> server runs are unrelated.
>>> Say you have 2 PRs, A and B, and the precommit job for B reuses the
>>> cache left by the job for A.
>>> The diff between the two will cause tests affected both by A and B to be
>>> rerun (at least).
>>> If A modifies Python code, then the job for B must rerun ALL Python
>>> tests (since Gradle doesn't do dependency tracking for Python).
>>>
>>> Proposal:
>>> a. The cache plugin is still useful for successive Java precommit jobs,
>>> but not for Python. (Go, I have no idea)
>>> We could use it exclusively for Java precommits.
>>> b. To avoid running precommit jobs for code not touched by a PR, look at
>>> the paths of files changed.
>>> For example, a PR touching only files under sdks/python/... need only
>>> run Python precommit tests.
>>>
>>> On Tue, Jun 5, 2018 at 7:24 PM Udi Meiri <eh...@google.com> wrote:
>>>
>>>> I've been having a separate discussion on the proposal doc, which is
>>>> ready for another round of reviews.
>>>> Change summary:
>>>> - Changed fast requirement to be < 30 minutes and simplify the check as
>>>> an aggregate for each precommit job type.
>>>> - Updated slowness notification methods to include automated methods:
>>>> as a precommit check result type on GitHub, as a bug.
>>>> - Merged in the metrics design doc.
>>>> - Added detailed design section.
>>>> - Added list of deliverables.
>>>>
>>>> What I would like is consensus regarding:
>>>> - How fast we want precommit runs to be. I propose 30m.
>>>> - Deadline for fixing a slow test before it is temporarily removed from
>>>> precommit. I propose 24 hours.
>>>>
>>>>
>>>> Replying to the thread:
>>>>
>>>> 1. I like the idea of using the Jenkins Job Cacher Plugin to skip
>>>> unaffected tests (BEAM-4400).
>>>>
>>>> 2. Java Precommit tests include integration tests (example
>>>> <https://builds.apache.org/view/A-D/view/Beam/job/beam_PreCommit_Java_GradleBuild/lastCompletedBuild/testReport/org.apache.beam.examples/>
>>>> ).
>>>> We could split these out to get much faster results, i.e., a separate
>>>> precommit just for basic integration tests (which will still need to run in
>>>> <30m).
>>>> Perhaps lint checks for Python could be split out as well.
>>>>
>>>> I'll add these suggestions to the doc tomorrow.
>>>>
>>>> On Thu, May 24, 2018 at 9:25 AM Scott Wegner <sweg...@google.com>
>>>> wrote:
>>>>
>>>>> So, it sounds like there's agreement that we should improve precommit
>>>>> times by only running necessary tests, and configuring Jenkins Job
>>>>> Caching + Gradle build cache is a path to get there. I've filed BEAM-4400
>>>>> [1] to follow-up on this.
>>>>>
>>>>> Getting back to Udi's original proposal [2]: I see value in defining a
>>>>> metric and target for overall pre-commit timing. The proposal for an
>>>>> initial "2 hour" target is helpful as a guardrail: we're already hitting
>>>>> it, but if we drift to a point where we're not, that should trigger some
>>>>> action to be taken to get back to a healthy state.
>>>>>
>>>>> I wouldn't mind separately setting a more aspiration goal of getting
>>>>> the pre-commits even faster (i.e. 15-30 mins), but I suspect that would
>>>>> require a concerted effort to evaluate and improve existing tests across
>>>>> the codebase. One idea would be to set up ensure the metric reporting can
>>>>> show the trend, and which tests are responsible for the most walltime, so
>>>>> that we know where to invest any efforts to improve tests.
>>>>>
>>>>>
>>>>> [1] https://issues.apache.org/jira/browse/BEAM-4400
>>>>> [2]
>>>>> https://docs.google.com/document/d/1udtvggmS2LTMmdwjEtZCcUQy6aQAiYTI3OrTP8CLfJM/edit?usp=sharing
>>>>>
>>>>>
>>>>> On Wed, May 23, 2018 at 11:46 AM Kenneth Knowles <k...@google.com>
>>>>> wrote:
>>>>>
>>>>>> With regard to the Job Cacher Plugin: I think it is an infra ticket
>>>>>> to install? And I guess we need it longer term when we move to
>>>>>> containerized builds anyhow? One thing I've experienced with the 
>>>>>> Travis-CI
>>>>>> cache is that the time spent uploading & downloading the remote cache - 
>>>>>> in
>>>>>> that case of all the pip installed dependencies - negated the benefits.
>>>>>> Probably for Beam it will have a greater benefit if we can skip most of 
>>>>>> the
>>>>>> build.
>>>>>>
>>>>>> Regarding integration tests in precommit: I think it is OK to run
>>>>>> maybe one Dataflow job in precommit, but it should be in parallel with 
>>>>>> the
>>>>>> unit tests and just a smoke test that takes 5 minutes, not a suite that
>>>>>> takes 35 minutes. So IMO that is low-hanging fruit. If this would make
>>>>>> postcommit unstable, then it also means precommit is unstable. Both are
>>>>>> troublesome.
>>>>>>
>>>>>> More short term, some possible hacks:
>>>>>>
>>>>>>  - Point gradle to cache outside the git workspace. We already did
>>>>>> this for .m2 and it helped a lot.
>>>>>>  - Intersect touched files with projects. Our nonstandard project
>>>>>> names might be a pain here. Not sure if fixing that is on the roadmap.
>>>>>>
>>>>>> Kenn
>>>>>>
>>>>>> On Wed, May 23, 2018 at 9:31 AM Ismaël Mejía <ieme...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I second Robert idea of ‘inteligently’ running only the affected
>>>>>>> tests,
>>>>>>> probably
>>>>>>> there is no need to run Java for a go fix (and eventually if any
>>>>>>> issue it
>>>>>>> can be
>>>>>>> catched in postcommit), same for a dev who just fixed something in
>>>>>>> KafkaIO
>>>>>>> and has
>>>>>>> to wait for other IO tests to pass. I suppose that languages, IOs and
>>>>>>> extensions
>>>>>>> are ‘easy’ to isolate so maybe we can start with those.
>>>>>>>
>>>>>>> Earlier signals are also definitely great to have too, but not sure
>>>>>>> how we
>>>>>>> can
>>>>>>> have those with the current infra.
>>>>>>>
>>>>>>>  From a quicklook the biggest time is consumed by the examples module
>>>>>>> probably
>>>>>>> because they run in Dataflow with real IOs no?, that module alone
>>>>>>> takes ~35
>>>>>>> minutes, so maybe moving it to postcommit will gain us some quick
>>>>>>> improvement.
>>>>>>> On the other hand we should probably not dismiss the consequences of
>>>>>>> moving
>>>>>>> more
>>>>>>> stuff to postcommit given that our current postcommit is not the most
>>>>>>> stable, or
>>>>>>> the quickest, only the Dataflow suite takes 1h30!
>>>>>>>
>>>>>>>
>>>>>>> On Tue, May 22, 2018 at 12:01 AM Mikhail Gryzykhin <
>>>>>>> mig...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> > What we can do here is estimate how much effort we want to put in
>>>>>>> and set
>>>>>>> remote target.
>>>>>>> > Such as:
>>>>>>> > Third quarter 2018 -- 1hr SLO
>>>>>>> > Forth quarter 2018 -- 30min SLO,
>>>>>>> > etc.
>>>>>>>
>>>>>>> > Combined with policy for newly added tests, this can give us some
>>>>>>> goal to
>>>>>>> aim for.
>>>>>>>
>>>>>>> > --Mikhail
>>>>>>>
>>>>>>> > Have feedback?
>>>>>>>
>>>>>>>
>>>>>>> > On Mon, May 21, 2018 at 2:06 PM Scott Wegner <sweg...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> >> Thanks for the proposal, I left comments in the doc. Overall I
>>>>>>> think
>>>>>>> it's a great idea.
>>>>>>>
>>>>>>> >> I've seen other projects with much faster pre-commits, and it
>>>>>>> requires
>>>>>>> strict guidelines on unit test design and keeping tests isolated
>>>>>>> in-memory
>>>>>>> as much as possible. That's not currently the case in Java; we have
>>>>>>> pre-commits which submit pipelines to Dataflow service.
>>>>>>>
>>>>>>> >> I don't know if it's feasible to get Java down to 15-20 mins in
>>>>>>> the
>>>>>>> short term, but a good starting point would be to document the
>>>>>>> requirements
>>>>>>> for a test to run as pre-commit, and start enforcing it for new
>>>>>>> tests.
>>>>>>>
>>>>>>>
>>>>>>> >> On Fri, May 18, 2018 at 3:25 PM Henning Rohde <hero...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> >>> Good proposal. I think it should be considered in tandem with
>>>>>>> the "No
>>>>>>> commit on red post-commit" proposal and could be far more ambitious
>>>>>>> than 2
>>>>>>> hours. For example, something in the <15-20 mins range, say, would
>>>>>>> be much
>>>>>>> less of an inconvenience to the development effort. Go takes ~3
>>>>>>> mins, which
>>>>>>> means that it is practical to wait until a PR is green before asking
>>>>>>> anyone
>>>>>>> to look at it. If I need to wait for a Java or Python pre-commit, I
>>>>>>> task
>>>>>>> switch and come back later. If the post-commits are enforced to be
>>>>>>> green,
>>>>>>> we could possibly gain a much more productive flow at the cost of the
>>>>>>> occasional post-commit break, compared to now. Maybe IOs can be less
>>>>>>> extensively tested pre-commit, for example, or only if actually
>>>>>>> changed?
>>>>>>>
>>>>>>> >>> I also like Robert's suggestion of spitting up pre-commits into
>>>>>>> something more fine-grained to get a clear partial signal quicker.
>>>>>>> If we
>>>>>>> have an adequate number of Jenkins slots, it might also speed things
>>>>>>> up
>>>>>>> overall.
>>>>>>>
>>>>>>> >>> Thanks,
>>>>>>> >>>    Henning
>>>>>>>
>>>>>>> >>> On Fri, May 18, 2018 at 12:30 PM Scott Wegner <
>>>>>>> sweg...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> >>>> re: intelligently skipping tests for code that doesn't change
>>>>>>> (i.e.
>>>>>>> Java tests on Python PR): this should be possible. We already have
>>>>>>> build-caching enabled in Gradle, but I believe it is local to the git
>>>>>>> workspace and doesn't persist between Jenkins runs.
>>>>>>>
>>>>>>> >>>> With a quick search, I see there is a Jenkins Build Cacher
>>>>>>> Plugin [1]
>>>>>>> that hooks into Gradle build cache and does exactly what we need.
>>>>>>> Does
>>>>>>> anybody know whether we could get this enabled on our Jenkins?
>>>>>>>
>>>>>>> >>>> [1] https://wiki.jenkins.io/display/JENKINS/Job+Cacher+Plugin
>>>>>>>
>>>>>>> >>>> On Fri, May 18, 2018 at 12:08 PM Robert Bradshaw <
>>>>>>> rober...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> >>>>> [somehow  my email got garbled...]
>>>>>>>
>>>>>>> >>>>> Now that we're using gradle, perhaps we could be more
>>>>>>> intelligent
>>>>>>> about only running the affected tests? E.g. when you touch Python
>>>>>>> (or Go)
>>>>>>> you shouldn't need to run the Java precommit at all, which would
>>>>>>> reduce the
>>>>>>> latency for those PRs and also the time spent in queue. Presumably
>>>>>>> this
>>>>>>> could even be applied per-module for the Java tests. (Maybe a large,
>>>>>>> shared
>>>>>>> build cache could help here as well...)
>>>>>>>
>>>>>>> >>>>> I also wouldn't be opposed to a quicker immediate signal, plus
>>>>>>> more
>>>>>>> extensive tests before actually merging. It's also nice to not have
>>>>>>> to wait
>>>>>>> an hour to see that you have a lint error; quick stuff like that
>>>>>>> could be
>>>>>>> signaled quickly before a contributor looses context.
>>>>>>>
>>>>>>> >>>>> - Robert
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> >>>>> On Fri, May 18, 2018 at 5:55 AM Kenneth Knowles <
>>>>>>> k...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> >>>>>> I like the idea. I think it is a good time for the project to
>>>>>>> start
>>>>>>> tracking this and keeping it usable.
>>>>>>>
>>>>>>> >>>>>> Certainly 2 hours is more than enough, is that not so? The
>>>>>>> Java
>>>>>>> precommit seems to take <=40 minutes while Python takes ~20 and Go
>>>>>>> is so
>>>>>>> fast it doesn't matter. Do we have enough stragglers that we don't
>>>>>>> make it
>>>>>>> in the 95th percentile? Is the time spent in the Jenkins queue?
>>>>>>>
>>>>>>> >>>>>> For our current coverage, I'd be willing to go for:
>>>>>>>
>>>>>>> >>>>>>    - 1 hr hard cap (someone better at stats could choose %ile)
>>>>>>> >>>>>>    - roll back or remove test from precommit if fix looks
>>>>>>> like more
>>>>>>> than 1 week (roll back if it is perf degradation, remove test from
>>>>>>> precommit if it is additional coverage that just doesn't fit in the
>>>>>>> time)
>>>>>>>
>>>>>>> >>>>>> There's a longer-term issue that doing a full build each time
>>>>>>> is
>>>>>>> expected to linearly scale up with the size of our repo (it is the
>>>>>>> monorepo
>>>>>>> problem but for a minirepo) so there is no cap that is feasible
>>>>>>> until we
>>>>>>> have effective cross-build caching. And my long-term goal would be
>>>>>>> <30
>>>>>>> minutes. At the latency of opening a pull request and then checking
>>>>>>> your
>>>>>>> email that's not burdensome, but an hour is.
>>>>>>>
>>>>>>> >>>>>> Kenn
>>>>>>>
>>>>>>> >>>>>> On Thu, May 17, 2018 at 6:54 PM Udi Meiri <eh...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> >>>>>>> HI,
>>>>>>> >>>>>>> I have a proposal to improve contributor experience by
>>>>>>> keeping
>>>>>>> precommit times low.
>>>>>>>
>>>>>>> >>>>>>> I'm looking to get community consensus and approval about:
>>>>>>> >>>>>>> 1. How long should precommits take. 2 hours @95th percentile
>>>>>>> over
>>>>>>> the past 4 weeks is the current proposal.
>>>>>>> >>>>>>> 2. The process for dealing with slowness. Do we: fix, roll
>>>>>>> back,
>>>>>>> remove a test from precommit?
>>>>>>> >>>>>>> Rolling back if a fix is estimated to take longer than 2
>>>>>>> weeks is
>>>>>>> the current proposal.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> https://docs.google.com/document/d/1udtvggmS2LTMmdwjEtZCcUQy6aQAiYTI3OrTP8CLfJM/edit?usp=sharing
>>>>>>>
>>>>>>

Re: Proposal: keeping precommit times fast

Reply via email to