Ultimately I think we have to invest in two directions: first, choose
a consistent, representative subset of stable tests that we feel give
us a reasonable level of confidence in return for a reasonable amount
of runtime. Second, we need to invest in figuring out why certain
tests fail. I strongly dislike the term "flaky" because it suggests
that it's some inconsequential issue causing problems. The truth is
that a test that fails is either a bug in the service code or a bug in
the test. I've come to realize that the CI and build framework is way
too complex for me to be able to help with much, but I would love to
start chipping away at failing test bugs. I'm getting settled into my
new job and I should be able to commit some regular time each week to
triage and fixing starting in August, and if there are any other folks
who are interested let me know.
Cheers,
Derek
On Mon, Jul 3, 2023, 12:30 PM Josh McKenzie <jmcken...@apache.org> wrote:
Instead of running all the tests through available CI agents
every time we can have presets of tests:
Back when I joined the project in 2014, unit tests took ~ 5
minutes to run on a local machine. We had pre-commit and
post-commit tests as a distinction as well, but also had flakes in
the pre and post batch. I'd love to see us get back to a unit test
regime like that.
The challenge we've always had is flaky tests showing up in either
the pre-commit or post-commit groups and difficulty in attribution
on a flaky failure to where it was introduced (not to lay blame
but to educate and learn and prevent recurrence). While
historically further reduced smoke testing suites would just mean
more flakes showing up downstream, the rule of multiplexing new or
changed tests might go a long way to mitigating that.
Should we mention in this concept how we will build the
sub-projects (e.g. Accord) alongside Cassandra?
I think it's an interesting question, but I also think there's no
real dependency of process between primary mainline branches and
feature branches. My intuition is that having the same bar (green
CI, multiplex, don't introduce flakes, smart smoke suite tiering)
would be a good idea on feature branches so there's not a death
march right before merge, squashing flakes when you have to
multiplex hundreds of tests before merge to mainline (since
presumably a feature branch would impact a lot of tests).
Now that I write that all out it does sound Painful. =/
On Mon, Jul 3, 2023, at 10:38 AM, Maxim Muzafarov wrote:
For me, the biggest benefit of keeping the build scripts and CI
configurations as well in the same project is that these files are
versioned in the same way as the main sources do. This ensures
that we
can build past releases without having any annoying errors in the
scripts, so I would say that this is a pretty necessary change.
I'd like to mention the approach that could work for the projects
with
a huge amount of tests. Instead of running all the tests through
available CI agents every time we can have presets of tests:
- base tests (to make sure that your design basically works, the set
will not run longer than 30 min);
- pre-commit tests (the number of tests to make sure that we can
safely commit new changes and fit the run into the 1-2 hour build
timeframe);
- nightly builds (scheduled task to build everything we have once a
day and notify the ML if that build fails);
My question here is:
Should we mention in this concept how we will build the sub-projects
(e.g. Accord) alongside Cassandra?
On Fri, 30 Jun 2023 at 23:19, Josh McKenzie
<jmcken...@apache.org> wrote:
>
> Not everyone will have access to such resources, if all you
have is 1 such pod you'll be waiting a long time (in theory one
month, and you actually need a few bigger pods for some of the
more extensive tests, e.g. large upgrade tests)….
>
> One thing worth calling out: I believe we have a lot of low
hanging fruit in the domain of "find long running tests and speed
them up". Early 2022 I was poking around at our unit tests on
CASSANDRA-17371 and found that 2.62% of our tests made up 20.4%
of our runtime
(https://docs.google.com/spreadsheets/d/1-tkH-hWBlEVInzMjLmJz4wABV6_mGs-2-NNM2XoVTcA/edit#gid=1501761592).
This kind of finding is pretty consistent; I remember Carl
Yeksigian at NGCC back in like 2015 axing an hour plus of
aggregate runtime by just devoting an afternoon to looking at a
few badly behaving tests.
>
> I'd like to see us move from "1 pod 1 month" down to something
a lot more manageable. :)
>
> Shout-out to Berenger's work on CASSANDRA-16951 for dtest
cluster reuse (not yet merged), and I have CASSANDRA-15196 to
remove the CDC vs. non segment allocator distinction and axe the
test-cdc target entirely.
>
> Ok. Enough of that. Don't want to derail us, just wanted to
call out that the state of things today isn't the way it has to be.
>
> On Fri, Jun 30, 2023, at 4:41 PM, Mick Semb Wever wrote:
>
> - There are hw constraints, is there any approximation on how
long it will take to run all tests? Or is there a stated goal
that we will strive to reach as a project?
>
> Have to defer to Mick on this; I don't think the changes
outlined here will materially change the runtime on our currently
donated nodes in CI.
>
>
>
> A recent comparison between CircleCI and the jenkins code
underneath ci-cassandra.a.o was done (not yet shared) to whether
a 'repeatable CI' can be both lower cost and same turn around
time. The exercise undercovered that there's a lot of waste in
our jenkins builds, and once the jenkinsfile becomes standalone
it can stash and unstash the build results. From this a
conservative estimate was even if we only brought the build time
to be double that of circleci it will still be significantly
lower cost while still using on-demand ec2 instances. (The goal
is to use spot instances.)
>
> The real problem here is that our CI pipeline uses ~1000
containers. ci-cassandra.a.o only has 100 executors (and a few of
these at any time are often down for disk self-cleaning). The
idea with 'repeatable CI', and to a broader extent Josh's opening
email, is that no one will need to use ci-cassandra.a.o for
pre-commit work anymore. For post-commit we don't care if it
takes 7 hours (we care about stability of results, which
'repeatable CI' also helps us with).
>
> While pre-commit testing will be more accessible to everyone,
it will still depend on the resources you have access to. For
the fastest turn-around times you will need a k8s cluster that
can spawn 1000 pods (4cpu, 8GB ram) which will run for up to 1-30
minutes, or the equivalent. Not everyone will have access to
such resources, if all you have is 1 such pod you'll be waiting a
long time (in theory one month, and you actually need a few
bigger pods for some of the more extensive tests, e.g. large
upgrade tests)….
>
>