Michael Semb Wever created CASSANDRA-18137:
----------------------------------------------

             Summary: Repeatable ci-cassandra.a.o 
                 Key: CASSANDRA-18137
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18137
             Project: Cassandra
          Issue Type: Epic
          Components: CI
            Reporter: Michael Semb Wever
            Assignee: Michael Semb Wever


Goals
- Reproducible reference ASF CI environment so contributors can clone it.
- An accepted “test result output” format that will certify a commit regardless 
of CI env.
- Turnaround times as fast as circleci (cloned environment scales to capacity).
- Intuitive CI implementation accessible to new contributors. 



Existing Problems
- Many unknown flakies due to infrequent failure rates and limited test history,
- time-consuming to identify flakies as infra-related, test-related, 
code-related,
- ci-cassandra.a.o is hard to debug (donated heterogenous servers around the 
world, ASF controlled with limited physical access, two executors per agent 
[noisy neighbour]),
- slow turnaround times compared to circleci, also very variable times as the 
fixed resource pool running both post- and pre-commit CI becomes easily 
saturated,
- difficult to pre-commit test jenkins and cassandra-build changes,
- CI development efforts is split between ci-cassandra and circleci, despite 
ci-cassandra being our canonical and non-commercial CI,
- lacking parity of what is tested between ci-cassandra and circleci
- circleci is restricted to those with access to premium commercial circleci, 
creating classes of engineers in the community and an exclusive OSS culture,
- cassandra-builds as a separate repo without release branches matching in-tree 
adds complexity to changing matrix values (jdks, pythons, dist)
- mixture of jenkins dsl groovy, declarative and scripting pipeline.
- different pre-commit and post-commit jenkins pipelines are used.


Additional Goals
- Identify all remaining test flakies.
- Thin CI implementation that builds on a common set of CI-agnostic build and 
test scripts. Contributors are free to add/maintain other CI solutions while 
remaining aligned on what and how we test.
- Extendable by downstream codebases (designed for re-use and extension). 


Proposal
- Provide a jenkins k8s operator based script that with one command-line spawns 
a ci-cassandra.a.o clone on the k8s cluster in context, runs the Jenkinsfile 
pipeline, saves the test result, and tears down the ci-cassandra.a.o clone.  
[turnkey solution]
- Parameters make spawn and tear-down optional, so ci-cassandra.a.o clones can 
be re-usable.
- Bring build and test scripts (including their docker images) from 
cassandra-builds in-tree
- Provide a declarative jenkins pipeline that maps stages to CI-agnostic build 
and test scripts.
- CI-agnostic build and test scripts can be run with docker, and without any 
CI, on any machine.
- Branch specific testing context is defined outside of the CI code.


Unknowns
- with the known pipeline steps, the matrix involved in each (jdk, python, 
dist, arch), and the parallelisation possible, what is the theoretical fastest 
run on an unlimited capacity k8s cluster,
- what parameterisation to the script is required for typical developer testing 
pre-commit,
- what is the cost of a single pipeline run, what is the expected cost of a 
year's post-commit CI,
- what is the size of the test result and how can it be saved and shared,
- what is the ideal stable agent resource specifications, can we use 
heterogenous environments if different test types have different minimum 
requirements,
- is multiplexing testing in jenkins a requirement to this epic
- how does the community provide CI to contributors that cannot afford the k8s 
cluster costs


Non Goals
- deciding what to do with ci-cassandra.a.o if a ci-cassandra.a.o clone is 
donated and provides a more stable post-commit environment than the original 
ci-cassandra.a.o
- any work/improvements on circleci
- discussing or changing our branching and merging strategies
- introduction of a develop or staging branch for final pass pre-commit CI 
optimisation
- jira integration (automatic bot comments of test results)
- introduction/support of additional tests: code coverage, jmh benchmarking, or 
larger performance testing



Timing and Priorisation
- Both 4.0 and 4.1 major releases have been delayed (and significant amounts of 
engineering time spent) on addressing an unknown number of flakies.
- Test failures are still being caught months after commit and merge.
- 5.0 promises a significant increase in large contributions, increasing the 
risk and cost of failing to do stable trunk development.
- Downstream users are requesting closer alignment to upstream, and convergence 
to upstream's CI approach (including extending it for additional QA).


History and background context can be read in the previous 'Cassandra CI 
Status' dev@ threads: 
https://lists.apache.org/list?d...@cassandra.apache.org:lte=5y:%22Cassandra%20CI%20Status%22
 





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to