Hello everyone,

In the last 24 hours, the NuttX project has exceeded its limit of 25 daily
runners for GitHub CI as enforced by the Apache infrastructure team. The runners
hit around 70.

This has happened before and we were warned by the Apache infrastructure team,
leading to several improvements to our GitHub CI. Many of these improvements
came from Lup, but we need some more attention from other contributors to
resolve this issue.

You can see the discussion around this here: 
https://github.com/apache/nuttx/issues/17914

I'm opening this mailing list thread so that we can hopefully discuss some
potential solutions here.

I think we first need to clarify what exactly our testing goals are with the CI,
outside of minimum requirements of checking linting/style compatibility. What do
we want to catch in our CI runs on new PRs? Once this is narrowed down, we can
hopefully start altering our CI system to run the bare minimum checks to achieve
the testing we desire.

Some suggestions from the issue were:

- Allow maintainers to manually select which workflows to run
- Prevent CI from running until PRs receive some approvals
- Check the Apache infra API to stop CI runs when the daily limit has been
  exceeded
- Use the GitHub labels on PRs to choose which parts of the CI to run
- Have one large CI configuration (say, on the simulator) which can be run for
  PRs that affect general code and not board-specific logic (i.e. modifications
  to the scheduler).
- Run a small CI run for new PRs and then run a nightly full-build to check for
  any failures that were not caught

From what I can tell, much of our CI usage is spent on compile-testing every
configuration for every board that is modified under a certain architecture
(i.e. all ARM boards for ARM changes). I think we can start reducing CI usage by
picking a representative board + configuration combo for each chip. This means a
change to the RP2040 chip logic will build only one configuration in CI, and a
change to the ARM Cortex 0 logic will build one configuration per Cortex 0 chip.
I think this would drastically reduce our CI usage, although it will take a good
amount of work to implement.

Please, share your thoughts about what CI _should_ be testing and if you have
any suggestions on where NuttX can cut down CI usage. We need more people than
just Lup working on this now, since exceeding our limits 

a) frustrates Apache infra and may lead to us losing workflow privileges
b) forces us to stop workflow runs and merges for incoming PRs until we are
   below the limit again

This isn't sustainable so we need to come up with some solutions and implement
them soon!

-- 
Matteo Golin

Attachment: signature.asc
Description: PGP signature

Reply via email to