Happy 2023 everyone!

With only four months in front of us before the first 5.0 release I'm
hoping we can re-energize our focus on CI and Stable Trunk.

This post covers the following
 * Recap of CI improvements
 * State of Affair
 * The Butler (Build Lead)
 * Proposal for a Repeatable Containerised CI

and it calls for the following actions
 ** we need you to sign up for a week's rotation as Build Lead !
 ** please reply in-thread any CI issues I've forgotten,
 ** does CASSANDRA-18137 warrant a CEP?


 *** Recap of CI improvements

It's been over two years since my last CI Status post, with Adam and
Josh covering much of it in their general Status emails (which are
deeply appreciated).  I'm hoping we can continue with both, given
their importance to a successful 5.0 release and the debt cost we face
otherwise going from the initial alpha release to the eventual GA.


We have made good efforts on moving towards a Stable Trunk.
Special mentions to
 - improving parity between CircleCI and ci-cassandra.a.o (CASSANDRA-17930)
 - introducing Butler and the Build Lead role
 - pre-commit workflow, and automated multiplexing, in CircleCI
(CASSANDRA-16625)
 - single digit flaky failures per build on 4.0, 4.1 and trunk
ci-cassandra.a.o !!
 - CircleCI is as stable on Large as XLarge containers (CASSANDRA-18127)


*** State of Affair

None of our CI systems are consistently green yet.  Flakies occur in
both CircleCI and ci-cassandra.a.o  . We had to lower the 4.1 release
CI criteria to accept three consequential green runs on CircleCI, as
it would have been unlikely to achieve the same on ci-cassandra.a.o.
While the flakey rate is lower than 4.0, the higher number of tests we
run is making it harder to get those green runs.

Despite the overhead we continue to face with flakies and getting
major releases out, 4.1 saw fewer releases to GA than 4.0, I think all
will agree things are improving.  But the challenge in front of us up
to the 5.0 release is huge with nine CEPs slated to land.  Pre-commit
and post-commit CI needs investing in if we want our stable trunk
efforts to continue to improve.


*** The Butler (Build Lead)

The introduction of Butler and the Build Lead was a wonderful
improvement to our CI efforts.  It has brought a lot of hygiene in
listing out flakies as they happened.  Noted that this has in-turn
increased the burden in getting our major releases out, but that's to
be seen as a one-off cost.  This initiative lost traction and
volunteers mid last year.

We really need you to take part in the Build Lead weekly rotation.

I've signed myself up for this week, please jump in and sign yourself
up for the weeks ahead.  If you are a coach/manager for a team, please
permit and encourage your engineers to be involved in this activity,
it shouldn't be more than an hour over the week.  Further instructions
found at https://cwiki.apache.org/confluence/display/CASSANDRA/Build+Lead

If it's your first time being a Build Lead the community is here to
help you, just reach out.  It's also a great way into our community
for newcomers!

When it comes to Butler it's UX of history is a bit clumsy.  TIL that
you can indeed list the full history of failures per test, see 'Full
History' under a test page*.  Please use this information to help
create jira tickets on flakies, specifically the versions it applies
to and the rough rate of failure so far observed.

*) e.g. 
https://butler.cassandra.apache.org/#/ci/upstream/workflow/Cassandra-trunk/failure/snapshot_test/TestArchiveCommitlog/test_archive_commitlog_point_in_time_ln


*** Proposal for a Repeatable Containerised CI

Building on what Josh writes in his "Cassandra project status, Year in
Review Holiday Edition" post, and many discussions offline with many
folk, I've written up the ticket epic for creating a reproducible
containerised ci-cassandra.a.o

Please read https://issues.apache.org/jira/browse/CASSANDRA-18137

The tl;dr of it is to create a script that, using the jenkins k8s
operator, can set up a ci-cassandra.a.o clone in your k8s context.

The ticket is lengthy, despite being in bullet form.  I don't believe
it warrants a CEP, speak up if you disagree.  The idea is to provide
us a turnkey solution: the jenkins k8s operator based script (create
ci-cassandra.a.o clone, run pipeline, save results, tear down clone);
to bring our existing build and test scripts (including their docker
images) from cassandra-builds to be in-tree to give us a declarative
jenkins pipeline that (in a simple intuitive manner) maps stages to
CI-agnostic build and test scripts (that can be run locally without a
CI system if you so desire), where all branch specific testing context
(jdks, pythons, dists) is defined outside of the CI code.  Its success
depends upon providing a CI system that is stable and fast for
pre-commit testing.

Reply via email to