Re: [DISCUSS] Formalizing requirements for pre-commit patches on new CI

Berenguer Blasi Tue, 04 Jul 2023 21:28:50 -0700

Currently a dtest is being ran in j8 w/wo vnodes , j8/j11 w/wo vnodesand j11 w/wo vnodes. That is 6 times total. I wonder about that ROI.

On dtest cluster reusage yes, I stopped that as at the time we had lotsof CI changes, an upcoming release and priorities. But when the CIstarts flexing it's muscles that'd be easy to pick up again as dtestscode shouldn't have changed much.


On 4/7/23 17:11, Derek Chen-Becker wrote:

Ultimately I think we have to invest in two directions: first, choosea consistent, representative subset of stable tests that we feel giveus a reasonable level of confidence in return for a reasonable amountof runtime. Second, we need to invest in figuring out why certaintests fail. I strongly dislike the term "flaky" because it suggeststhat it's some inconsequential issue causing problems. The truth isthat a test that fails is either a bug in the service code or a bug inthe test. I've come to realize that the CI and build framework is waytoo complex for me to be able to help with much, but I would love tostart chipping away at failing test bugs. I'm getting settled into mynew job and I should be able to commit some regular time each week totriage and fixing starting in August, and if there are any other folkswho are interested let me know.


Cheers,

Derek

On Mon, Jul 3, 2023, 12:30 PM Josh McKenzie <[email protected]> wrote:

    Instead of running all the tests through available CI agents
    every time we can have presets of tests:

    Back when I joined the project in 2014, unit tests took ~ 5
    minutes to run on a local machine. We had pre-commit and
    post-commit tests as a distinction as well, but also had flakes in
    the pre and post batch. I'd love to see us get back to a unit test
    regime like that.

    The challenge we've always had is flaky tests showing up in either
    the pre-commit or post-commit groups and difficulty in attribution
    on a flaky failure to where it was introduced (not to lay blame
    but to educate and learn and prevent recurrence). While
    historically further reduced smoke testing suites would just mean
    more flakes showing up downstream, the rule of multiplexing new or
    changed tests might go a long way to mitigating that.

    Should we mention in this concept how we will build the
    sub-projects (e.g. Accord) alongside Cassandra?

    I think it's an interesting question, but I also think there's no
    real dependency of process between primary mainline branches and
    feature branches. My intuition is that having the same bar (green
    CI, multiplex, don't introduce flakes, smart smoke suite tiering)
    would be a good idea on feature branches so there's not a death
    march right before merge, squashing flakes when you have to
    multiplex hundreds of tests before merge to mainline (since
    presumably a feature branch would impact a lot of tests).

    Now that I write that all out it does sound Painful. =/

    On Mon, Jul 3, 2023, at 10:38 AM, Maxim Muzafarov wrote:

    For me, the biggest benefit of keeping the build scripts and CI
    configurations as well in the same project is that these files are
    versioned in the same way as the main sources do. This ensures
    that we
    can build past releases without having any annoying errors in the
    scripts, so I would say that this is a pretty necessary change.

    I'd like to mention the approach that could work for the projects
    with
    a huge amount of tests. Instead of running all the tests through
    available CI agents every time we can have presets of tests:
    - base tests (to make sure that your design basically works, the set
    will not run longer than 30 min);
    - pre-commit tests (the number of tests to make sure that we can
    safely commit new changes and fit the run into the 1-2 hour build
    timeframe);
    - nightly builds (scheduled task to build everything we have once a
    day and notify the ML if that build fails);


    My question here is:
    Should we mention in this concept how we will build the sub-projects
    (e.g. Accord) alongside Cassandra?

    On Fri, 30 Jun 2023 at 23:19, Josh McKenzie
    <[email protected]> wrote:
    >
    > Not everyone will have access to such resources, if all you
    have is 1 such pod you'll be waiting a long time (in theory one
    month, and you actually need a few bigger pods for some of the
    more extensive tests, e.g. large upgrade tests)….
    >
    > One thing worth calling out: I believe we have a lot of low
    hanging fruit in the domain of "find long running tests and speed
    them up". Early 2022 I was poking around at our unit tests on
    CASSANDRA-17371 and found that 2.62% of our tests made up 20.4%
    of our runtime
    
(https://docs.google.com/spreadsheets/d/1-tkH-hWBlEVInzMjLmJz4wABV6_mGs-2-NNM2XoVTcA/edit#gid=1501761592).
    This kind of finding is pretty consistent; I remember Carl
    Yeksigian at NGCC back in like 2015 axing an hour plus of
    aggregate runtime by just devoting an afternoon to looking at a
    few badly behaving tests.
    >
    > I'd like to see us move from "1 pod 1 month" down to something
    a lot more manageable. :)
    >
    > Shout-out to Berenger's work on CASSANDRA-16951 for dtest
    cluster reuse (not yet merged), and I have CASSANDRA-15196 to
    remove the CDC vs. non segment allocator distinction and axe the
    test-cdc target entirely.
    >
    > Ok. Enough of that. Don't want to derail us, just wanted to
    call out that the state of things today isn't the way it has to be.
    >
    > On Fri, Jun 30, 2023, at 4:41 PM, Mick Semb Wever wrote:
    >
    > - There are hw constraints, is there any approximation on how
    long it will take to run all tests? Or is there a stated goal
    that we will strive to reach as a project?
    >
    > Have to defer to Mick on this; I don't think the changes
    outlined here will materially change the runtime on our currently
    donated nodes in CI.
    >
    >
    >
    > A recent comparison between CircleCI and the jenkins code
    underneath ci-cassandra.a.o was done (not yet shared) to whether
    a 'repeatable CI' can be both lower cost and same turn around
    time.  The exercise undercovered that there's a lot of waste in
    our jenkins builds, and once the jenkinsfile becomes standalone
    it can stash and unstash the build results.  From this a
    conservative estimate was even if we only brought the build time
    to be double that of circleci it will still be significantly
    lower cost while still using on-demand ec2 instances. (The goal
    is to use spot instances.)
    >
    > The real problem here is that our CI pipeline uses ~1000
    containers. ci-cassandra.a.o only has 100 executors (and a few of
    these at any time are often down for disk self-cleaning).   The
    idea with 'repeatable CI', and to a broader extent Josh's opening
    email, is that no one will need to use ci-cassandra.a.o for
    pre-commit work anymore.  For post-commit we don't care if it
    takes 7 hours (we care about stability of results, which
    'repeatable CI' also helps us with).
    >
    > While pre-commit testing will be more accessible to everyone,
    it will still depend on the resources you have access to.  For
    the fastest turn-around times you will need a k8s cluster that
    can spawn 1000 pods (4cpu, 8GB ram) which will run for up to 1-30
    minutes, or the equivalent.  Not everyone will have access to
    such resources, if all you have is 1 such pod you'll be waiting a
    long time (in theory one month, and you actually need a few
    bigger pods for some of the more extensive tests, e.g. large
    upgrade tests)….
    >
    >

Re: [DISCUSS] Formalizing requirements for pre-commit patches on new CI

Reply via email to