Re: [DISCUSS] Formalizing requirements for pre-commit patches on new CI

2023-06-30 Thread Josh McKenzie
> Not everyone will have access to such resources, if all you have is 1 such 
> pod you'll be waiting a long time (in theory one month, and you actually need 
> a few bigger pods for some of the more extensive tests, e.g. large upgrade 
> tests)….   
One thing worth calling out: I believe we have *a lot* of low hanging fruit in 
the domain of "find long running tests and speed them up". Early 2022 I was 
poking around at our unit tests on CASSANDRA-17371 and found that *2.62% of our 
tests made up 20.4% of our runtime* 
(https://docs.google.com/spreadsheets/d/1-tkH-hWBlEVInzMjLmJz4wABV6_mGs-2-NNM2XoVTcA/edit#gid=1501761592).
 This kind of finding is pretty consistent; I remember Carl Yeksigian at NGCC 
back in like 2015 axing an hour plus of aggregate runtime by just devoting an 
afternoon to looking at a few badly behaving tests.

I'd like to see us move from "1 pod 1 month" down to something a lot more 
manageable. :)

Shout-out to Berenger's work on CASSANDRA-16951 for dtest cluster reuse (not 
yet merged), and I have CASSANDRA-15196 to remove the CDC vs. non segment 
allocator distinction and axe the test-cdc target entirely.

Ok. Enough of that. Don't want to derail us, just wanted to call out that the 
state of things today isn't the way it has to be.

On Fri, Jun 30, 2023, at 4:41 PM, Mick Semb Wever wrote:
>>> - There are hw constraints, is there any approximation on how long it will 
>>> take to run all tests? Or is there a stated goal that we will strive to 
>>> reach as a project?
>> Have to defer to Mick on this; I don't think the changes outlined here will 
>> materially change the runtime on our currently donated nodes in CI. 
> 
> 
> A recent comparison between CircleCI and the jenkins code underneath 
> ci-cassandra.a.o was done (not yet shared) to whether a 'repeatable CI' can 
> be both lower cost and same turn around time.  The exercise undercovered that 
> there's a lot of waste in our jenkins builds, and once the jenkinsfile 
> becomes standalone it can stash and unstash the build results.  >From this a 
> conservative estimate was even if we only brought the build time to be double 
> that of circleci it will still be significantly lower cost while still using 
> on-demand ec2 instances. (The goal is to use spot instances.)
> 
> The real problem here is that our CI pipeline uses ~1000 containers. 
> ci-cassandra.a.o only has 100 executors (and a few of these at any time are 
> often down for disk self-cleaning).   The idea with 'repeatable CI', and to a 
> broader extent Josh's opening email, is that no one will need to use 
> ci-cassandra.a.o for pre-commit work anymore.  For post-commit we don't care 
> if it takes 7 hours (we care about stability of results, which 'repeatable 
> CI' also helps us with).
> 
> While pre-commit testing will be more accessible to everyone, it will still 
> depend on the resources you have access to.  For the fastest turn-around 
> times you will need a k8s cluster that can spawn 1000 pods (4cpu, 8GB ram) 
> which will run for up to 1-30 minutes, or the equivalent.  Not everyone will 
> have access to such resources, if all you have is 1 such pod you'll be 
> waiting a long time (in theory one month, and you actually need a few bigger 
> pods for some of the more extensive tests, e.g. large upgrade tests)….   


Re: [DISCUSS] Formalizing requirements for pre-commit patches on new CI

2023-06-30 Thread Mick Semb Wever
>
> - There are hw constraints, is there any approximation on how long it will
> take to run all tests? Or is there a stated goal that we will strive to
> reach as a project?
>
> Have to defer to Mick on this; I don't think the changes outlined here
> will materially change the runtime on our currently donated nodes in CI.
>


A recent comparison between CircleCI and the jenkins code underneath
ci-cassandra.a.o was done (not yet shared) to whether a 'repeatable CI' can
be both lower cost and same turn around time.  The exercise undercovered
that there's a lot of waste in our jenkins builds, and once the jenkinsfile
becomes standalone it can stash and unstash the build results.  From this a
conservative estimate was even if we only brought the build time to be
double that of circleci it will still be significantly lower cost while
still using on-demand ec2 instances. (The goal is to use spot instances.)

The real problem here is that our CI pipeline uses ~1000 containers.
ci-cassandra.a.o only has 100 executors (and a few of these at any time are
often down for disk self-cleaning).   The idea with 'repeatable CI', and to
a broader extent Josh's opening email, is that no one will need to use
ci-cassandra.a.o for pre-commit work anymore.  For post-commit we don't
care if it takes 7 hours (we care about stability of results, which
'repeatable CI' also helps us with).

While pre-commit testing will be more accessible to everyone, it will still
depend on the resources you have access to.  For the fastest
turn-around times you will need a k8s cluster that can spawn 1000 pods
(4cpu, 8GB ram) which will run for up to 1-30 minutes, or the equivalent.
Not everyone will have access to such resources, if all you have is 1 such
pod you'll be waiting a long time (in theory one month, and you actually
need a few bigger pods for some of the more extensive tests, e.g. large
upgrade tests)….


Re: CASSANDRA-18554 - mTLS based client and internode authenticators

2023-06-30 Thread Jeremiah Jordan
 I don’t think users necessarily need to be able to update their own
identities.  I just don’t want to have to use the super user role.  The
super user role has all power over all things in the data base.  I don’t
want to have to give that much power to the person who manages identities,
I just want to give them the power to manage identities.

Jeremiah Jordan
e. jerem...@datastax.com
w. www.datastax.com



On Jun 30, 2023 at 1:35:41 PM, Dinesh Joshi  wrote:

> Yuki, Jeremiah both are fair points. The mental model we're using for
> mTLS authentication is slightly different.
>
> In your model you're treating the TLS identity itself to be similar to
> the password. The password is the 'shared secret' that currently needs
> to be rotated by the user that owns the account therefore necessitating
> the permission to update their password. But that is not the case with
> TLS certificates and mTLS identities.
>
> The model we're going for is different. The identity is provisioned for
> an account by a super user. This is more locked down and the user can
> still rotate their own certificates but not change the identity
> associated with their account without a super user.
>
> Once provisioned, a user does not need rotate the identity itself. They
> only need to obtain fresh certificates as their certificates near
> expiry. This requires no updates on the database unlike passwords.
>
> We could extend this functionality in the future to allow users to
> change their own identity. Nothing here prevents that.
>
> thanks,
>
> Dinesh
>
>
>
> On 6/29/23 08:16, Jeremiah Jordan wrote:
>
> I like the idea of extending CREATE ROLE rather than adding a brand new
>
> ADD IDENTITY syntax.  Not sure how that can line up with one to many
>
> relationships for an identity, but maybe that can just be done through
>
> role hierarchy?
>
>
> In either case, I don’t think IDENTITY related operations should be tied
>
> to the super user flag. They should be tied to either existing role
>
> permissions, or a brand new permissions about IDENTITY.  We should not
>
> require that end users give the account allowed to make IDENTITY changes
>
> super user permission to do what ever they want across the whole database.
>
>
> On Jun 28, 2023 at 11:48:02 PM, Yuki Morishita 
> > wrote:
>
> > Thinking more about "CREATE ROLE" permission, if we can extend CREATE
>
> > ROLE/ALTER ROLE statements, it may look streamlined:
>
> >
>
> > I don't have the good example, but something like:
>
> > ```
>
> > CREATE ROLE dev WITH LOGIN = true AND IDENTITIES = {'spiffe://xxx'};
>
> > ALTER ROLE dev ADD IDENTITY 'xxx';
>
> > LIST ROLES;
>
> > ```
>
> >
>
> > This requires a role to identities table as well as the current
>
> > identity to role table though.
>
> >
>
> > On Thu, Jun 29, 2023 at 12:34 PM Yuki Morishita 
> > > wrote:
>
> >
>
> > Hi Jyothsna,
>
> >
>
> > I think for the *initial* commit, the description looks fine to me.
>
> > I'd like to see/contribute to the future improvement though:
>
> >
>
> > * ADD IDENTITY requires SUPERUSER, this means that the brand new
>
> > cluster needs to start with
>
> > PasswordAuthenticator/CassandraAuthorizer first, and then change
>
> > to mTLS one.
>
> > * For this, I'd really like to see Cassandra use password
>
> > authn and authz by default.
>
> > * Cassandra allows the user with "CREATE ROLE" permission to
>
> > create roles without superuser privilege. Maybe it is natural to
>
> > allow them to add identities also?
>
> >
>
> >
>
> > On Thu, Jun 29, 2023 at 7:35 AM Jyothsna Konisa
>
> > mailto:jyothsna1...@gmail.com>> wrote:
>
> >
>
> > Hi Yuki,
>
> >
>
> > I have added cassandra docs for CQL syntax that we are adding
>
> > and how to get started with using mTLS authenticators along
>
> > with the migration plan. Please review it and let me know if
>
> > it looks good.
>
> >
>
> > Thanks,
>
> > Jyothsna Konisa.
>
> >
>
> > On Wed, Jun 21, 2023 at 10:46 AM Jyothsna Konisa
>
> > mailto:jyothsna1...@gmail.com>> wrote:
>
> >
>
> > Hi Yuki!
>
> >
>
> > Thanks for the questions.
>
> >
>
> > Here are the steps for the initial setup.
>
> >
>
> > 1. Since only super users can add/remove identities from
>
> > the `identity_to_roles` table, operators should use that
>
> > role to add authorized identities to the table. Note that
>
> > the authenticator is not an mTLS authenticator yet.
>
> > EX: ADD IDENTITY
>
> > 'spiffe://testdomain.com/testIdentifier/testValue
>
> > <
> https://urldefense.com/v3/__http://testdomain.com/testIdentifier/testValue__;!!PbtH5S7Ebw!bc-bxD5J_z84ErqBnLngRGkogZQQF2d5tQcORTek4SaE5S_LVkzIYlLIFY73R48icK6fAwtUBLwxgTEHUA$>'
> TO ROLE 'read_only_user'
>
> >
>
> > 2. Change authenticator 

Re: [DISCUSS] Formalizing requirements for pre-commit patches on new CI

2023-06-30 Thread Josh McKenzie
All great questions I don't have answers to Ekaterina. :) Thoughts though:

> - Currently we run at most two parallel CI runs in Jenkins-dev, I guess you 
> will try to improve that limitation?
If we get to using cloud-based resources for CI instead of our donated hardware 
w/a budget, we could theoretically have a world where we could run more jobs at 
a time on ASF infra. Managing a monthly recurring spend on CI for a bunch of 
committers around the world with different sponsors is outside the scope of 
what we're targeting, but the work we're doing now will enable us to pursue 
that as a potential option in the future.

> - There are hw constraints, is there any approximation on how long it will 
> take to run all tests? Or is there a stated goal that we will strive to reach 
> as a project?
Have to defer to Mick on this; I don't think the changes outlined here will 
materially change the runtime on our currently donated nodes in CI. It'd be 
faster if we spun up cloud resources; we've gone back and forth on that topic 
too, using spot instances, more resilience in the face of that, etc. But 
keeping that path separate so we can bite off manageable chunks at a time.

> - Bringing scripts in-tree will make it easier to add a multiplexer which we 
> miss at the moment, that’s great. (Running jobs in a loop, helps a lot with 
> flaky tests) . Also makes it easier to add any new test suites
Definitely; this should have been in the doc (and is in a few others on the 
topic that are on related bits). I'll add a bullet about multiplexing changed 
or newly added tests.

On Fri, Jun 30, 2023, at 2:38 PM, Ekaterina Dimitrova wrote:
> Thank you, Josh and Mick
> 
> Immediate questions on my mind:
> - Currently we run at most two parallel CI runs in Jenkins-dev, I guess you 
> will try to improve that limitation?
> - There are hw constraints, is there any approximation on how long it will 
> take to run all tests? Or is there a stated goal that we will strive to reach 
> as a project?
> - Bringing scripts in-tree will make it easier to add a multiplexer which we 
> miss at the moment, that’s great. (Running jobs in a loop, helps a lot with 
> flaky tests) . Also makes it easier to add any new test suites
> 
> On Fri, 30 Jun 2023 at 13:35, Derek Chen-Becker  wrote:
>> Thanks Josh, this looks great! I think the constraints you've outlined are 
>> reasonable for an initial attempt. We can always evolve if we run into 
>> issues.
>> 
>> Cheers,
>> 
>> Derek
>> 
>> On Fri, Jun 30, 2023 at 11:19 AM Josh McKenzie  wrote:
>>> __
>>> Context: we're looking to get away from having split CircleCI and ASF CI as 
>>> well
>>> as getting ASF CI to a stable state. There's a variety of reasons why it's 
>>> flaky
>>> (orchestration, heterogenous hardware, hardware failures, flaky tests,
>>> non-deterministic runs, noisy neighbors, etc), many of which Mick has been
>>> making great headway on starting to address.
>>> 
>>> If you're curious see:
>>> - Mick's 2023/01/09 email thread on CI:
>>> https://lists.apache.org/thread/fqdvqkjmz6w8c864vw98ymvb1995lcy4
>>> - Mick's 2023/04/26 email thread on CI:
>>> https://lists.apache.org/thread/xb80v6r857dz5rlm5ckcn69xcl4shvbq
>>> - CASSANDRA-18137: epic for "Repeatable ci-cassandra.a.o":
>>> https://issues.apache.org/jira/browse/CASSANDRA-18137
>>> - CASSANDRA-18133: In-tree build scripts:
>>> https://issues.apache.org/jira/browse/CASSANDRA-18133
>>> 
>>> What's fallen out from this: the new reference CI will have the following 
>>> logical layers:
>>> 1. ant
>>> 2. build/test scripts that setup the env. See run-tests.sh and
>>> run-python-dtests.sh here:
>>> 
>>> https://github.com/thelastpickle/cassandra/tree/0aecbd873ff4de5474fe15efac4cdde10b603c7b/.build
>>> 3. dockerized build/test scripts that have containerized the flow of 1 and 
>>> 2. See:
>>> 
>>> https://github.com/thelastpickle/cassandra/tree/0aecbd873ff4de5474fe15efac4cdde10b603c7b/.build/docker
>>> 4. CI integrations. See generation of unified test report in build.xml:
>>> 
>>> https://github.com/thelastpickle/cassandra/blame/mck/18133/trunk/build.xml#L1794-L1817)
>>> 5. Optional full CI lifecycle w/Jenkins running in a container (full stack
>>> setup, run, teardown, pending)
>>> 
>>> **I want to let everyone know the high level structure of how this is 
>>> shaping up,**
>>> **as this is a change that will directly impact the work of *all of us* on 
>>> the**
>>> **project.**
>>> 
>>> In terms of our goals, the chief goals I'd like to call out in this context 
>>> are:
>>> * ASF CI needs to be and remain consistent
>>> * contributors need a turnkey way to validate their work before merging that
>>> they can accelerate by throwing resources at it.
>>> 
>>> We as a project need to determine what is *required* to run in a CI 
>>> environment
>>> to consider that run certified for merge. Where Mick and I landed 
>>> through a lot
>>> of back and forth is that the following would be re

Re: [DISCUSS] Formalizing requirements for pre-commit patches on new CI

2023-06-30 Thread Ekaterina Dimitrova
Thank you, Josh and Mick

Immediate questions on my mind:
- Currently we run at most two parallel CI runs in Jenkins-dev, I guess you
will try to improve that limitation?
- There are hw constraints, is there any approximation on how long it will
take to run all tests? Or is there a stated goal that we will strive to
reach as a project?
- Bringing scripts in-tree will make it easier to add a multiplexer which
we miss at the moment, that’s great. (Running jobs in a loop, helps a lot
with flaky tests) . Also makes it easier to add any new test suites

On Fri, 30 Jun 2023 at 13:35, Derek Chen-Becker 
wrote:

> Thanks Josh, this looks great! I think the constraints you've outlined are
> reasonable for an initial attempt. We can always evolve if we run into
> issues.
>
> Cheers,
>
> Derek
>
> On Fri, Jun 30, 2023 at 11:19 AM Josh McKenzie 
> wrote:
>
>> Context: we're looking to get away from having split CircleCI and ASF CI
>> as well
>> as getting ASF CI to a stable state. There's a variety of reasons why
>> it's flaky
>> (orchestration, heterogenous hardware, hardware failures, flaky tests,
>> non-deterministic runs, noisy neighbors, etc), many of which Mick has been
>> making great headway on starting to address.
>>
>> If you're curious see:
>> - Mick's 2023/01/09 email thread on CI:
>> https://lists.apache.org/thread/fqdvqkjmz6w8c864vw98ymvb1995lcy4
>> - Mick's 2023/04/26 email thread on CI:
>> https://lists.apache.org/thread/xb80v6r857dz5rlm5ckcn69xcl4shvbq
>> - CASSANDRA-18137: epic for "Repeatable ci-cassandra.a.o":
>> https://issues.apache.org/jira/browse/CASSANDRA-18137
>> - CASSANDRA-18133: In-tree build scripts:
>> https://issues.apache.org/jira/browse/CASSANDRA-18133
>>
>> What's fallen out from this: the new reference CI will have the following
>> logical layers:
>> 1. ant
>> 2. build/test scripts that setup the env. See run-tests.sh and
>> run-python-dtests.sh here:
>>
>> https://github.com/thelastpickle/cassandra/tree/0aecbd873ff4de5474fe15efac4cdde10b603c7b/.build
>> 3. dockerized build/test scripts that have containerized the flow of 1
>> and 2. See:
>>
>> https://github.com/thelastpickle/cassandra/tree/0aecbd873ff4de5474fe15efac4cdde10b603c7b/.build/docker
>> 4. CI integrations. See generation of unified test report in build.xml:
>>
>> https://github.com/thelastpickle/cassandra/blame/mck/18133/trunk/build.xml#L1794-L1817
>> )
>> 5. Optional full CI lifecycle w/Jenkins running in a container (full stack
>> setup, run, teardown, pending)
>>
>>
>> *I want to let everyone know the high level structure of how this is
>> shaping up,*
>>
>> *as this is a change that will directly impact the work of *all of us* on
>> the*
>> *project.*
>>
>> In terms of our goals, the chief goals I'd like to call out in this
>> context are:
>> * ASF CI needs to be and remain consistent
>> * contributors need a turnkey way to validate their work before merging
>> that
>> they can accelerate by throwing resources at it.
>>
>> We as a project need to determine what is *required* to run in a CI
>> environment
>> to consider that run certified for merge. Where Mick and I landed
>> through a lot
>> of back and forth is that the following would be required:
>> 1. used ant / pytest to build and run tests
>> 2. used the reference scripts being changed in CASSANDRA-18133 (in-tree
>> .build/)
>> to setup and execute your test environment
>> 3. constrained your runtime environment to the same hardware and time
>> constraints we use in ASF CI, within reason (CPU count independent of
>> speed,
>> memory size and disk size independent of hardware specs, etc)
>> 4. reported test results in a unified fashion that has all the
>> information we
>> normally get from a test run
>> 5. (maybe) Parallelized the tests across the same split lines as upstream
>> ASF
>> (i.e. no weird env specific neighbor / scheduling flakes)
>>
>> Last but not least is the "What do we do with CircleCI?" angle. The
>> current
>> thought is we allow people to continue using it with the stated goal of
>> migrating the circle config over to using the unified build scripts as
>> well and
>> get it in compliance with the above requirements.
>>
>> For reference, here's a gdoc where we've hashed this out:
>>
>> https://docs.google.com/document/d/1TaYMvE5ryOYX03cxzY6XzuUS651fktVER02JHmZR5FU/edit?usp=sharing
>>
>> So my questions for the community here:
>> 1. What's missing from the above conceptualization of the problem?
>> 2. Are the constraints too strong? Too weak? Just right?
>>
>> Thanks everyone, and happy Friday. ;)
>>
>> ~Josh
>>
>
>
> --
> +---+
> | Derek Chen-Becker |
> | GPG Key available at https://keybase.io/dchenbecker and   |
> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> +---

Re: CASSANDRA-18554 - mTLS based client and internode authenticators

2023-06-30 Thread Dinesh Joshi
Yuki, Jeremiah both are fair points. The mental model we're using for
mTLS authentication is slightly different.

In your model you're treating the TLS identity itself to be similar to
the password. The password is the 'shared secret' that currently needs
to be rotated by the user that owns the account therefore necessitating
the permission to update their password. But that is not the case with
TLS certificates and mTLS identities.

The model we're going for is different. The identity is provisioned for
an account by a super user. This is more locked down and the user can
still rotate their own certificates but not change the identity
associated with their account without a super user.

Once provisioned, a user does not need rotate the identity itself. They
only need to obtain fresh certificates as their certificates near
expiry. This requires no updates on the database unlike passwords.

We could extend this functionality in the future to allow users to
change their own identity. Nothing here prevents that.

thanks,

Dinesh



On 6/29/23 08:16, Jeremiah Jordan wrote:
> I like the idea of extending CREATE ROLE rather than adding a brand new
> ADD IDENTITY syntax.  Not sure how that can line up with one to many
> relationships for an identity, but maybe that can just be done through
> role hierarchy?
> 
> In either case, I don’t think IDENTITY related operations should be tied
> to the super user flag. They should be tied to either existing role
> permissions, or a brand new permissions about IDENTITY.  We should not
> require that end users give the account allowed to make IDENTITY changes
> super user permission to do what ever they want across the whole database.
> 
> On Jun 28, 2023 at 11:48:02 PM, Yuki Morishita  > wrote:
>> Thinking more about "CREATE ROLE" permission, if we can extend CREATE
>> ROLE/ALTER ROLE statements, it may look streamlined:
>>
>> I don't have the good example, but something like:
>> ```
>> CREATE ROLE dev WITH LOGIN = true AND IDENTITIES = {'spiffe://xxx'};
>> ALTER ROLE dev ADD IDENTITY 'xxx'; 
>> LIST ROLES;
>> ```
>>
>> This requires a role to identities table as well as the current
>> identity to role table though.
>>
>> On Thu, Jun 29, 2023 at 12:34 PM Yuki Morishita > > wrote:
>>
>> Hi Jyothsna,
>>
>> I think for the *initial* commit, the description looks fine to me.
>> I'd like to see/contribute to the future improvement though:
>>
>> * ADD IDENTITY requires SUPERUSER, this means that the brand new
>> cluster needs to start with
>> PasswordAuthenticator/CassandraAuthorizer first, and then change
>> to mTLS one.
>>     * For this, I'd really like to see Cassandra use password
>> authn and authz by default.
>> * Cassandra allows the user with "CREATE ROLE" permission to
>> create roles without superuser privilege. Maybe it is natural to
>> allow them to add identities also?
>>
>>
>> On Thu, Jun 29, 2023 at 7:35 AM Jyothsna Konisa
>> mailto:jyothsna1...@gmail.com>> wrote:
>>
>> Hi Yuki,
>>
>> I have added cassandra docs for CQL syntax that we are adding
>> and how to get started with using mTLS authenticators along
>> with the migration plan. Please review it and let me know if
>> it looks good.
>>
>> Thanks,
>> Jyothsna Konisa.
>>
>> On Wed, Jun 21, 2023 at 10:46 AM Jyothsna Konisa
>> mailto:jyothsna1...@gmail.com>> wrote:
>>
>> Hi Yuki!
>>
>> Thanks for the questions.
>>
>> Here are the steps for the initial setup.
>>
>> 1. Since only super users can add/remove identities from
>> the `identity_to_roles` table, operators should use that
>> role to add authorized identities to the table. Note that
>> the authenticator is not an mTLS authenticator yet.
>> EX: ADD IDENTITY
>> 'spiffe://testdomain.com/testIdentifier/testValue
>> 
>> '
>>  TO ROLE 'read_only_user'
>>
>> 2. Change authenticator configuration in cassandra.yaml to
>> use mTLS authenticator
>> EX: authenticator:
>>   class_name :org.apache.cassandra.auth.MutualTlsAuthenticator
>>   parameters :
>>     validator_class_name:
>> org.apache.cassandra.auth.SpiffeCertificateValidator
>> 3. Restart the cluster so that newly configured mTLS
>> authenticator is used
>>
>> What will be the op's first step to set up the roles and
>> identities?
>> -> Yes, the op should set up roles & identities first.
>>
>> Is default cassandra / cassandra superuser login still
>> required to set up other ro

Re: [DISCUSS] Formalizing requirements for pre-commit patches on new CI

2023-06-30 Thread Derek Chen-Becker
Thanks Josh, this looks great! I think the constraints you've outlined are
reasonable for an initial attempt. We can always evolve if we run into
issues.

Cheers,

Derek

On Fri, Jun 30, 2023 at 11:19 AM Josh McKenzie  wrote:

> Context: we're looking to get away from having split CircleCI and ASF CI
> as well
> as getting ASF CI to a stable state. There's a variety of reasons why it's
> flaky
> (orchestration, heterogenous hardware, hardware failures, flaky tests,
> non-deterministic runs, noisy neighbors, etc), many of which Mick has been
> making great headway on starting to address.
>
> If you're curious see:
> - Mick's 2023/01/09 email thread on CI:
> https://lists.apache.org/thread/fqdvqkjmz6w8c864vw98ymvb1995lcy4
> - Mick's 2023/04/26 email thread on CI:
> https://lists.apache.org/thread/xb80v6r857dz5rlm5ckcn69xcl4shvbq
> - CASSANDRA-18137: epic for "Repeatable ci-cassandra.a.o":
> https://issues.apache.org/jira/browse/CASSANDRA-18137
> - CASSANDRA-18133: In-tree build scripts:
> https://issues.apache.org/jira/browse/CASSANDRA-18133
>
> What's fallen out from this: the new reference CI will have the following
> logical layers:
> 1. ant
> 2. build/test scripts that setup the env. See run-tests.sh and
> run-python-dtests.sh here:
>
> https://github.com/thelastpickle/cassandra/tree/0aecbd873ff4de5474fe15efac4cdde10b603c7b/.build
> 3. dockerized build/test scripts that have containerized the flow of 1 and
> 2. See:
>
> https://github.com/thelastpickle/cassandra/tree/0aecbd873ff4de5474fe15efac4cdde10b603c7b/.build/docker
> 4. CI integrations. See generation of unified test report in build.xml:
>
> https://github.com/thelastpickle/cassandra/blame/mck/18133/trunk/build.xml#L1794-L1817
> )
> 5. Optional full CI lifecycle w/Jenkins running in a container (full stack
> setup, run, teardown, pending)
>
>
> *I want to let everyone know the high level structure of how this is
> shaping up,*
>
> *as this is a change that will directly impact the work of *all of us* on
> the*
> *project.*
>
> In terms of our goals, the chief goals I'd like to call out in this
> context are:
> * ASF CI needs to be and remain consistent
> * contributors need a turnkey way to validate their work before merging
> that
> they can accelerate by throwing resources at it.
>
> We as a project need to determine what is *required* to run in a CI
> environment
> to consider that run certified for merge. Where Mick and I landed
> through a lot
> of back and forth is that the following would be required:
> 1. used ant / pytest to build and run tests
> 2. used the reference scripts being changed in CASSANDRA-18133 (in-tree
> .build/)
> to setup and execute your test environment
> 3. constrained your runtime environment to the same hardware and time
> constraints we use in ASF CI, within reason (CPU count independent of
> speed,
> memory size and disk size independent of hardware specs, etc)
> 4. reported test results in a unified fashion that has all the information
> we
> normally get from a test run
> 5. (maybe) Parallelized the tests across the same split lines as upstream
> ASF
> (i.e. no weird env specific neighbor / scheduling flakes)
>
> Last but not least is the "What do we do with CircleCI?" angle. The current
> thought is we allow people to continue using it with the stated goal of
> migrating the circle config over to using the unified build scripts as
> well and
> get it in compliance with the above requirements.
>
> For reference, here's a gdoc where we've hashed this out:
>
> https://docs.google.com/document/d/1TaYMvE5ryOYX03cxzY6XzuUS651fktVER02JHmZR5FU/edit?usp=sharing
>
> So my questions for the community here:
> 1. What's missing from the above conceptualization of the problem?
> 2. Are the constraints too strong? Too weak? Just right?
>
> Thanks everyone, and happy Friday. ;)
>
> ~Josh
>


-- 
+---+
| Derek Chen-Becker |
| GPG Key available at https://keybase.io/dchenbecker and   |
| https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
| Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
+---+


[DISCUSS] Formalizing requirements for pre-commit patches on new CI

2023-06-30 Thread Josh McKenzie
Context: we're looking to get away from having split CircleCI and ASF CI as well
as getting ASF CI to a stable state. There's a variety of reasons why it's flaky
(orchestration, heterogenous hardware, hardware failures, flaky tests,
non-deterministic runs, noisy neighbors, etc), many of which Mick has been
making great headway on starting to address.

If you're curious see:
- Mick's 2023/01/09 email thread on CI:
https://lists.apache.org/thread/fqdvqkjmz6w8c864vw98ymvb1995lcy4
- Mick's 2023/04/26 email thread on CI:
https://lists.apache.org/thread/xb80v6r857dz5rlm5ckcn69xcl4shvbq
- CASSANDRA-18137: epic for "Repeatable ci-cassandra.a.o":
https://issues.apache.org/jira/browse/CASSANDRA-18137
- CASSANDRA-18133: In-tree build scripts:
https://issues.apache.org/jira/browse/CASSANDRA-18133

What's fallen out from this: the new reference CI will have the following 
logical layers:
1. ant
2. build/test scripts that setup the env. See run-tests.sh and
run-python-dtests.sh here:

https://github.com/thelastpickle/cassandra/tree/0aecbd873ff4de5474fe15efac4cdde10b603c7b/.build
3. dockerized build/test scripts that have containerized the flow of 1 and 2. 
See:

https://github.com/thelastpickle/cassandra/tree/0aecbd873ff4de5474fe15efac4cdde10b603c7b/.build/docker
4. CI integrations. See generation of unified test report in build.xml:

https://github.com/thelastpickle/cassandra/blame/mck/18133/trunk/build.xml#L1794-L1817)
5. Optional full CI lifecycle w/Jenkins running in a container (full stack
setup, run, teardown, pending)

**I want to let everyone know the high level structure of how this is shaping 
up,
**
**as this is a change that will directly impact the work of *all of us* on the
**
**project.**

In terms of our goals, the chief goals I'd like to call out in this context are:
* ASF CI needs to be and remain consistent
* contributors need a turnkey way to validate their work before merging that
they can accelerate by throwing resources at it.

We as a project need to determine what is *required* to run in a CI environment
to consider that run certified for merge. Where Mick and I landed through a 
lot
of back and forth is that the following would be required:
1. used ant / pytest to build and run tests
2. used the reference scripts being changed in CASSANDRA-18133 (in-tree .build/)
to setup and execute your test environment
3. constrained your runtime environment to the same hardware and time
constraints we use in ASF CI, within reason (CPU count independent of speed,
memory size and disk size independent of hardware specs, etc)
4. reported test results in a unified fashion that has all the information we
normally get from a test run
5. (maybe) Parallelized the tests across the same split lines as upstream ASF
(i.e. no weird env specific neighbor / scheduling flakes)

Last but not least is the "What do we do with CircleCI?" angle. The current
thought is we allow people to continue using it with the stated goal of
migrating the circle config over to using the unified build scripts as well and
get it in compliance with the above requirements.

For reference, here's a gdoc where we've hashed this out:

https://docs.google.com/document/d/1TaYMvE5ryOYX03cxzY6XzuUS651fktVER02JHmZR5FU/edit?usp=sharing

So my questions for the community here:
1. What's missing from the above conceptualization of the problem?
2. Are the constraints too strong? Too weak? Just right?

Thanks everyone, and happy Friday. ;)

~Josh