Mick, this is fantastic!

I'll wait another day to see if anyone else chimes in. (Would also love to
hear from CassCI folks, anyone else really who has wrestled with this even
for internal forks).

On Tue, Feb 4, 2020 at 10:37 AM Mick Semb Wever <m...@apache.org> wrote:

> Nate, I leave it to you to forward what-you-chose to the board@'s thread.
>
>
> > Are there still troubles and what are they?
>
>
> TL;DR
>   the ASF could provide the Cassandra community with an isolated jenkins
> installation: so that we can manage and control the Jenkins master,  as
> well as ensure all donated hardware for Jenkins agents are dedicated and
> isolated to us.
>
>
> The long writeup…
>
> For Cassandra's use of ASF's Jenkins I see the following problems.
>
> ** Lack of trust (aka reliability)
>
> The Jenkins agents re-use their workspaces, as opposed to using new
> containers per test run, leading to broken agents, disks, git clones, etc.
> One broken test run, or a broken agent, too easily affects subsequent test
> executions.
>
> The complexity (and flakiness) around our tests is a real problem.  CI on
> a project like Cassandra is a beast and the community is very limited in
> what it can do, it really needs the help of larger companies. Effort is
> required in fixing the broken, the flakey, and the ignored tests.
> Parallelising the tests will help by better isolating failures, but tests
> (and their execution scripts) also need to be better at cleaning up after
> themselves, or a more container approach needs to be taken.
>
> Another issue is that other projects sometimes using the agents, and Infra
> sometimes edits our build configurations (out of necessity).
>
>
> ** Lack of resources (throughput and response)
>
> Having only 9 agents: none of which can run the large dtests; is a
> problem. All 9 are from Instaclustr, much kudos! Three companies recently
> have said they will donate resources, this is work in progress.
>
> We have four release branches where we would like to provide per-commit
> post-commit testing. Each complete test execution currently take 24hr+.
> Parallelising tests atm won't help much as the agents are generally
> saturated (with the pipelines doing the top-level parallelisation). Once we
> get more hardware in place: for the sake of improving throughput; it will
> make sense to look into parallelising the tests more.
>
> The throughput of tests will also improve with effort put into
> removing/rewriting long running and inefficient tests. Also, and i think
> this is LHF, throughput could be improved by using (or taking inspiration
> from) Apache Yetus so to only run tests on what it relevant in the
> patch/commit. Ref:
> http://yetus.apache.org/documentation/0.11.1/precommit-basic/
>
>
> ** Difficulty in use
>
> Jenkins is clumsy to use compared to the CI systems we use more often
> today: Travis, CircleCI, GH Actions.
>
> One of the complaints has been that only committers can kick off CI for
> patches (ie pre-commit CI runs).  But I don't believe this to be a crucial
> issue for a number of reasons.
>
> 1. Thorough CI testing of a patch only needs to happen during the review
> process, to which a committer needs to be involved in anyway.
> 2.  We don't have enough jenkins agents to handle the amount of throughput
> that automated branch/patch/pull-request testing would require.
> 3. Our tests could allow unknown contributors to take ownership of the
> agent servers (eg via the execution of bash scripts).
> 4. We have CircleCI working that provides basic testing for
> work-in-progress patches.
>
>
> Focusing on post-commit CI and having canonical results for our release
> branches, i think then it boils down to the stability and throughput of
> tests, and the persistence and permanence of results.
>
> The persistence and permanence of results is a bug bear for me. It has
> been partially addressed with posting the build results to the builds@
> ML. But this only provides a (pretty raw) summary of the results. I'm keen
> to take the next step of the posting of CI results back to committed jira
> tickets (but am waiting on seeing Jenkins run stable for a while).  If we
> had our own Jenkins master we could then look into retaining more/all build
> results. Being able to see the longer term trends of test results and well
> as execution times I hope would add the incentive to get more folk involved.
>
> Looping back to the ASF and what they could do: it would help us a lot in
> improving the stability and usability issues by providing us an isolated
> jenkins. Having our own master would simplify the setup, use and debugging,
> of Jenkins. It would still require some sunk cost but hopefully we'd end up
> with something better tailored to our needs. And with isolated agents help
> restore confidence.
>
> regards,
> Mick
>
> PS i really want to hear from those that were involved in the past with
> cassci, your skills and experience on this topic surpass anything i got.
>
>
>
> On Sun, 2 Feb 2020, at 22:51, Nate McCall wrote:
> > Hi folks,
> > The board is looking for feedback on CI infrastructure. I'm happy to take
> > some (constructive) comments back. (Shuler, Mick and David Capwell
> > specifically as folks who've most recently wrestled with this a fair
> bit).
> >
> > Thanks,
> > -Nate
> >
> > ---------- Forwarded message ---------
> > From: Dave Fisher <w...@apache.org>
> > Date: Mon, Feb 3, 2020 at 8:58 AM
> > Subject: [CI] What are the troubles projects face with CI and Infra
> > To: Apache Board <bo...@apache.org>
> >
> >
> > Hi -
> >
> > It has come to the attention of the board through looking at past board
> > reports that some projects are having problems with CI infrastructure.
> >
> > Are there still troubles and what are they?
> >
> > Regards,
> > Dave
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>

Reply via email to