All great questions I don't have answers to Ekaterina. :) Thoughts though:

> - Currently we run at most two parallel CI runs in Jenkins-dev, I guess you 
> will try to improve that limitation?
If we get to using cloud-based resources for CI instead of our donated hardware 
w/a budget, we could theoretically have a world where we could run more jobs at 
a time on ASF infra. Managing a monthly recurring spend on CI for a bunch of 
committers around the world with different sponsors is outside the scope of 
what we're targeting, but the work we're doing now will enable us to pursue 
that as a potential option in the future.

> - There are hw constraints, is there any approximation on how long it will 
> take to run all tests? Or is there a stated goal that we will strive to reach 
> as a project?
Have to defer to Mick on this; I don't think the changes outlined here will 
materially change the runtime on our currently donated nodes in CI. It'd be 
faster if we spun up cloud resources; we've gone back and forth on that topic 
too, using spot instances, more resilience in the face of that, etc. But 
keeping that path separate so we can bite off manageable chunks at a time.

> - Bringing scripts in-tree will make it easier to add a multiplexer which we 
> miss at the moment, that’s great. (Running jobs in a loop, helps a lot with 
> flaky tests) . Also makes it easier to add any new test suites
Definitely; this should have been in the doc (and is in a few others on the 
topic that are on related bits). I'll add a bullet about multiplexing changed 
or newly added tests.

On Fri, Jun 30, 2023, at 2:38 PM, Ekaterina Dimitrova wrote:
> Thank you, Josh and Mick
> 
> Immediate questions on my mind:
> - Currently we run at most two parallel CI runs in Jenkins-dev, I guess you 
> will try to improve that limitation?
> - There are hw constraints, is there any approximation on how long it will 
> take to run all tests? Or is there a stated goal that we will strive to reach 
> as a project?
> - Bringing scripts in-tree will make it easier to add a multiplexer which we 
> miss at the moment, that’s great. (Running jobs in a loop, helps a lot with 
> flaky tests) . Also makes it easier to add any new test suites
> 
> On Fri, 30 Jun 2023 at 13:35, Derek Chen-Becker <de...@chen-becker.org> wrote:
>> Thanks Josh, this looks great! I think the constraints you've outlined are 
>> reasonable for an initial attempt. We can always evolve if we run into 
>> issues.
>> 
>> Cheers,
>> 
>> Derek
>> 
>> On Fri, Jun 30, 2023 at 11:19 AM Josh McKenzie <jmcken...@apache.org> wrote:
>>> __
>>> Context: we're looking to get away from having split CircleCI and ASF CI as 
>>> well
>>> as getting ASF CI to a stable state. There's a variety of reasons why it's 
>>> flaky
>>> (orchestration, heterogenous hardware, hardware failures, flaky tests,
>>> non-deterministic runs, noisy neighbors, etc), many of which Mick has been
>>> making great headway on starting to address.
>>> 
>>> If you're curious see:
>>> - Mick's 2023/01/09 email thread on CI:
>>>     https://lists.apache.org/thread/fqdvqkjmz6w8c864vw98ymvb1995lcy4
>>> - Mick's 2023/04/26 email thread on CI:
>>>     https://lists.apache.org/thread/xb80v6r857dz5rlm5ckcn69xcl4shvbq
>>> - CASSANDRA-18137: epic for "Repeatable ci-cassandra.a.o":
>>>     https://issues.apache.org/jira/browse/CASSANDRA-18137
>>> - CASSANDRA-18133: In-tree build scripts:
>>>     https://issues.apache.org/jira/browse/CASSANDRA-18133
>>> 
>>> What's fallen out from this: the new reference CI will have the following 
>>> logical layers:
>>> 1. ant
>>> 2. build/test scripts that setup the env. See run-tests.sh and
>>>     run-python-dtests.sh here:
>>>     
>>> https://github.com/thelastpickle/cassandra/tree/0aecbd873ff4de5474fe15efac4cdde10b603c7b/.build
>>> 3. dockerized build/test scripts that have containerized the flow of 1 and 
>>> 2. See:
>>>     
>>> https://github.com/thelastpickle/cassandra/tree/0aecbd873ff4de5474fe15efac4cdde10b603c7b/.build/docker
>>> 4. CI integrations. See generation of unified test report in build.xml:
>>>     
>>> https://github.com/thelastpickle/cassandra/blame/mck/18133/trunk/build.xml#L1794-L1817)
>>> 5. Optional full CI lifecycle w/Jenkins running in a container (full stack
>>>     setup, run, teardown, pending)
>>> 
>>> **I want to let everyone know the high level structure of how this is 
>>> shaping up,**
>>> **as this is a change that will directly impact the work of *all of us* on 
>>> the**
>>> **project.**
>>> 
>>> In terms of our goals, the chief goals I'd like to call out in this context 
>>> are:
>>> * ASF CI needs to be and remain consistent
>>> * contributors need a turnkey way to validate their work before merging that
>>>     they can accelerate by throwing resources at it.
>>> 
>>> We as a project need to determine what is *required* to run in a CI 
>>> environment
>>>     to consider that run certified for merge. Where Mick and I landed 
>>> through a lot
>>>     of back and forth is that the following would be required:
>>> 1. used ant / pytest to build and run tests
>>> 2. used the reference scripts being changed in CASSANDRA-18133 (in-tree 
>>> .build/)
>>>     to setup and execute your test environment
>>> 3. constrained your runtime environment to the same hardware and time
>>>     constraints we use in ASF CI, within reason (CPU count independent of 
>>> speed,
>>>     memory size and disk size independent of hardware specs, etc)
>>> 4. reported test results in a unified fashion that has all the information 
>>> we
>>>     normally get from a test run
>>> 5. (maybe) Parallelized the tests across the same split lines as upstream 
>>> ASF
>>>     (i.e. no weird env specific neighbor / scheduling flakes)
>>> 
>>> Last but not least is the "What do we do with CircleCI?" angle. The current
>>> thought is we allow people to continue using it with the stated goal of
>>> migrating the circle config over to using the unified build scripts as well 
>>> and
>>> get it in compliance with the above requirements.
>>> 
>>> For reference, here's a gdoc where we've hashed this out:
>>>     
>>> https://docs.google.com/document/d/1TaYMvE5ryOYX03cxzY6XzuUS651fktVER02JHmZR5FU/edit?usp=sharing
>>> 
>>> So my questions for the community here:
>>> 1. What's missing from the above conceptualization of the problem?
>>> 2. Are the constraints too strong? Too weak? Just right?
>>> 
>>> Thanks everyone, and happy Friday. ;)
>>> 
>>> ~Josh
>> 
>> 
>> --
>> +---------------------------------------------------------------+
>> | Derek Chen-Becker                                             |
>> | GPG Key available at https://keybase.io/dchenbecker and       |
>> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
>> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
>> +---------------------------------------------------------------+
>> 

Reply via email to