Happy 2023 everyone! With only four months in front of us before the first 5.0 release I'm hoping we can re-energize our focus on CI and Stable Trunk.
This post covers the following * Recap of CI improvements * State of Affair * The Butler (Build Lead) * Proposal for a Repeatable Containerised CI and it calls for the following actions ** we need you to sign up for a week's rotation as Build Lead ! ** please reply in-thread any CI issues I've forgotten, ** does CASSANDRA-18137 warrant a CEP? *** Recap of CI improvements It's been over two years since my last CI Status post, with Adam and Josh covering much of it in their general Status emails (which are deeply appreciated). I'm hoping we can continue with both, given their importance to a successful 5.0 release and the debt cost we face otherwise going from the initial alpha release to the eventual GA. We have made good efforts on moving towards a Stable Trunk. Special mentions to - improving parity between CircleCI and ci-cassandra.a.o (CASSANDRA-17930) - introducing Butler and the Build Lead role - pre-commit workflow, and automated multiplexing, in CircleCI (CASSANDRA-16625) - single digit flaky failures per build on 4.0, 4.1 and trunk ci-cassandra.a.o !! - CircleCI is as stable on Large as XLarge containers (CASSANDRA-18127) *** State of Affair None of our CI systems are consistently green yet. Flakies occur in both CircleCI and ci-cassandra.a.o . We had to lower the 4.1 release CI criteria to accept three consequential green runs on CircleCI, as it would have been unlikely to achieve the same on ci-cassandra.a.o. While the flakey rate is lower than 4.0, the higher number of tests we run is making it harder to get those green runs. Despite the overhead we continue to face with flakies and getting major releases out, 4.1 saw fewer releases to GA than 4.0, I think all will agree things are improving. But the challenge in front of us up to the 5.0 release is huge with nine CEPs slated to land. Pre-commit and post-commit CI needs investing in if we want our stable trunk efforts to continue to improve. *** The Butler (Build Lead) The introduction of Butler and the Build Lead was a wonderful improvement to our CI efforts. It has brought a lot of hygiene in listing out flakies as they happened. Noted that this has in-turn increased the burden in getting our major releases out, but that's to be seen as a one-off cost. This initiative lost traction and volunteers mid last year. We really need you to take part in the Build Lead weekly rotation. I've signed myself up for this week, please jump in and sign yourself up for the weeks ahead. If you are a coach/manager for a team, please permit and encourage your engineers to be involved in this activity, it shouldn't be more than an hour over the week. Further instructions found at https://cwiki.apache.org/confluence/display/CASSANDRA/Build+Lead If it's your first time being a Build Lead the community is here to help you, just reach out. It's also a great way into our community for newcomers! When it comes to Butler it's UX of history is a bit clumsy. TIL that you can indeed list the full history of failures per test, see 'Full History' under a test page*. Please use this information to help create jira tickets on flakies, specifically the versions it applies to and the rough rate of failure so far observed. *) e.g. https://butler.cassandra.apache.org/#/ci/upstream/workflow/Cassandra-trunk/failure/snapshot_test/TestArchiveCommitlog/test_archive_commitlog_point_in_time_ln *** Proposal for a Repeatable Containerised CI Building on what Josh writes in his "Cassandra project status, Year in Review Holiday Edition" post, and many discussions offline with many folk, I've written up the ticket epic for creating a reproducible containerised ci-cassandra.a.o Please read https://issues.apache.org/jira/browse/CASSANDRA-18137 The tl;dr of it is to create a script that, using the jenkins k8s operator, can set up a ci-cassandra.a.o clone in your k8s context. The ticket is lengthy, despite being in bullet form. I don't believe it warrants a CEP, speak up if you disagree. The idea is to provide us a turnkey solution: the jenkins k8s operator based script (create ci-cassandra.a.o clone, run pipeline, save results, tear down clone); to bring our existing build and test scripts (including their docker images) from cassandra-builds to be in-tree to give us a declarative jenkins pipeline that (in a simple intuitive manner) maps stages to CI-agnostic build and test scripts (that can be run locally without a CI system if you so desire), where all branch specific testing context (jdks, pythons, dists) is defined outside of the CI code. Its success depends upon providing a CI system that is stable and fast for pre-commit testing.