Only have a moment to respond, but Mick hit the higlights with containerization, parallelization, these help solve cleanup, speed, and cascading failures. Dynamic disposable slaves would be icing on that cake, which may require a dedicated master.

One more note on jobs, or more correctly unnecessary jobs - pipelines have a `changeset` build condition we should tinker with. There is zero reason to run a job with no actual code diff. For instance, I committed to 2.1 this morning and merged `-s ours` nothing to the newer branches - there's really no reason to run and take up valuable resources with no actual diff changes.


Mick, this is fantastic!

I'll wait another day to see if anyone else chimes in. (Would also love to
hear from CassCI folks, anyone else really who has wrestled with this even
for internal forks).

Nate, I leave it to you to forward what-you-chose to the board@'s thread.

Are there still troubles and what are they?

   the ASF could provide the Cassandra community with an isolated jenkins
installation: so that we can manage and control the Jenkins master,  as
well as ensure all donated hardware for Jenkins agents are dedicated and
isolated to us.

The long writeup…

For Cassandra's use of ASF's Jenkins I see the following problems.

** Lack of trust (aka reliability)

The Jenkins agents re-use their workspaces, as opposed to using new
containers per test run, leading to broken agents, disks, git clones, etc.
One broken test run, or a broken agent, too easily affects subsequent test

The complexity (and flakiness) around our tests is a real problem.  CI on
a project like Cassandra is a beast and the community is very limited in
what it can do, it really needs the help of larger companies. Effort is
required in fixing the broken, the flakey, and the ignored tests.
Parallelising the tests will help by better isolating failures, but tests
(and their execution scripts) also need to be better at cleaning up after
themselves, or a more container approach needs to be taken.

Another issue is that other projects sometimes using the agents, and Infra
sometimes edits our build configurations (out of necessity).

** Lack of resources (throughput and response)

Having only 9 agents: none of which can run the large dtests; is a
problem. All 9 are from Instaclustr, much kudos! Three companies recently
have said they will donate resources, this is work in progress.

We have four release branches where we would like to provide per-commit
post-commit testing. Each complete test execution currently take 24hr+.
Parallelising tests atm won't help much as the agents are generally
saturated (with the pipelines doing the top-level parallelisation). Once we
get more hardware in place: for the sake of improving throughput; it will
make sense to look into parallelising the tests more.

The throughput of tests will also improve with effort put into
removing/rewriting long running and inefficient tests. Also, and i think
this is LHF, throughput could be improved by using (or taking inspiration
from) Apache Yetus so to only run tests on what it relevant in the
patch/commit. Ref:

** Difficulty in use

Jenkins is clumsy to use compared to the CI systems we use more often
today: Travis, CircleCI, GH Actions.

One of the complaints has been that only committers can kick off CI for
patches (ie pre-commit CI runs).  But I don't believe this to be a crucial
issue for a number of reasons.

1. Thorough CI testing of a patch only needs to happen during the review
process, to which a committer needs to be involved in anyway.
2.  We don't have enough jenkins agents to handle the amount of throughput
that automated branch/patch/pull-request testing would require.
3. Our tests could allow unknown contributors to take ownership of the
agent servers (eg via the execution of bash scripts).
4. We have CircleCI working that provides basic testing for
work-in-progress patches.

Focusing on post-commit CI and having canonical results for our release
branches, i think then it boils down to the stability and throughput of
tests, and the persistence and permanence of results.

The persistence and permanence of results is a bug bear for me. It has
been partially addressed with posting the build results to the builds@
ML. But this only provides a (pretty raw) summary of the results. I'm keen
to take the next step of the posting of CI results back to committed jira
tickets (but am waiting on seeing Jenkins run stable for a while).  If we
had our own Jenkins master we could then look into retaining more/all build
results. Being able to see the longer term trends of test results and well
as execution times I hope would add the incentive to get more folk involved.

Looping back to the ASF and what they could do: it would help us a lot in
improving the stability and usability issues by providing us an isolated
jenkins. Having our own master would simplify the setup, use and debugging,
of Jenkins. It would still require some sunk cost but hopefully we'd end up
with something better tailored to our needs. And with isolated agents help
restore confidence.


PS i really want to hear from those that were involved in the past with
cassci, your skills and experience on this topic surpass anything i got.

Hi folks,
The board is looking for feedback on CI infrastructure. I'm happy to take
some (constructive) comments back. (Shuler, Mick and David Capwell
specifically as folks who've most recently wrestled with this a fair


Hi -

It has come to the attention of the board through looking at past board
reports that some projects are having problems with CI infrastructure.

Are there still troubles and what are they?


