> Not everyone will have access to such resources, if all you have is 1 such 
> pod you'll be waiting a long time (in theory one month, and you actually need 
> a few bigger pods for some of the more extensive tests, e.g. large upgrade 
> tests)….   
One thing worth calling out: I believe we have *a lot* of low hanging fruit in 
the domain of "find long running tests and speed them up". Early 2022 I was 
poking around at our unit tests on CASSANDRA-17371 and found that *2.62% of our 
tests made up 20.4% of our runtime* 
(https://docs.google.com/spreadsheets/d/1-tkH-hWBlEVInzMjLmJz4wABV6_mGs-2-NNM2XoVTcA/edit#gid=1501761592).
 This kind of finding is pretty consistent; I remember Carl Yeksigian at NGCC 
back in like 2015 axing an hour plus of aggregate runtime by just devoting an 
afternoon to looking at a few badly behaving tests.

I'd like to see us move from "1 pod 1 month" down to something a lot more 
manageable. :)

Shout-out to Berenger's work on CASSANDRA-16951 for dtest cluster reuse (not 
yet merged), and I have CASSANDRA-15196 to remove the CDC vs. non segment 
allocator distinction and axe the test-cdc target entirely.

Ok. Enough of that. Don't want to derail us, just wanted to call out that the 
state of things today isn't the way it has to be.

On Fri, Jun 30, 2023, at 4:41 PM, Mick Semb Wever wrote:
>>> - There are hw constraints, is there any approximation on how long it will 
>>> take to run all tests? Or is there a stated goal that we will strive to 
>>> reach as a project?
>> Have to defer to Mick on this; I don't think the changes outlined here will 
>> materially change the runtime on our currently donated nodes in CI. 
> 
> 
> A recent comparison between CircleCI and the jenkins code underneath 
> ci-cassandra.a.o was done (not yet shared) to whether a 'repeatable CI' can 
> be both lower cost and same turn around time.  The exercise undercovered that 
> there's a lot of waste in our jenkins builds, and once the jenkinsfile 
> becomes standalone it can stash and unstash the build results.  >From this a 
> conservative estimate was even if we only brought the build time to be double 
> that of circleci it will still be significantly lower cost while still using 
> on-demand ec2 instances. (The goal is to use spot instances.)
> 
> The real problem here is that our CI pipeline uses ~1000 containers. 
> ci-cassandra.a.o only has 100 executors (and a few of these at any time are 
> often down for disk self-cleaning).   The idea with 'repeatable CI', and to a 
> broader extent Josh's opening email, is that no one will need to use 
> ci-cassandra.a.o for pre-commit work anymore.  For post-commit we don't care 
> if it takes 7 hours (we care about stability of results, which 'repeatable 
> CI' also helps us with).
> 
> While pre-commit testing will be more accessible to everyone, it will still 
> depend on the resources you have access to.  For the fastest turn-around 
> times you will need a k8s cluster that can spawn 1000 pods (4cpu, 8GB ram) 
> which will run for up to 1-30 minutes, or the equivalent.  Not everyone will 
> have access to such resources, if all you have is 1 such pod you'll be 
> waiting a long time (in theory one month, and you actually need a few bigger 
> pods for some of the more extensive tests, e.g. large upgrade tests)….   

Reply via email to