To slightly expand on the above: We do have tests with fixed (build.yml) and with variable random seed (ci-meson.yml).

I was not aware that only meson used a random seed. That's good to know and explains why meson seems to fail more often in CI. As a reviewer/developer it's helpful to know which workflows are the most stable, and which could fail for unrelated reasons. This fact should be written down somewhere, probably in the CI documentation if you end up having time to write it.

A related question would be: shall we temporarily disable tests that are known to randomly fail?
Advantage: less noise due to random failures
Disadvantage: less coverage

I don't think tests need to be disabled, but rather the CI should not report a PR as failing if the same failure occurs on develop. So still run the failing tests, but don't report the workflow as failing if the same test fails on develop. I think we already have something like this for the fixed seed tests, but not for the random seeds.

On the other hand, it would be nice if the CI highlighted a test which passes in a PR but fails on develop.

>  I wonder if the stranger/unreproducible failures might be caused by some faulty caching on the CI server, but I don't know enough about how the CI server is configured and what is cached between builds to say if that might be the case.

From my experience, these issues are almost never specific to CI (i.e. the same error could be reproduced in principle by running the same commands locally on a developer's machine). The only exceptions are issues related to "docker pull/push" that you sometimes see. Those come from the design decision to run the CI in a new docker container. Fixing those issues by redesiging the corresponding workflows would be desirable (see below).

This failure in code unrelated to the PR is neither due to a random test or a docker issue, and I cannot reproduce it: https://github.com/sagemath/sage/actions/runs/17134650987/job/48607483558#step:15:8858

I agree that failures like this are very rare though. We do use a lot of CI every month, so I would not be surprised to learn that the expected number of monthly random hardware glitches (or solar flares, or whatever your favourite explanation is for strange computer phenomena) for our CI setup is non-negligible.

>  It would be nice to have a GitHub label for these kinds of issues so they can be found more easily. I'm not sure who has permissions to add new labels.

Good idea! Needs to be done by one of the github org admins.
Would whoever has the permissions for this consider adding two new labels to GitHub? One called "CI" (or something similar) to be used for issues/PRs relating to the CI (we have "CI fix" but that's for CI fixes that should be merged before other PRs). And one called "random seed failure" (or something similar) for issues that report or PRs that fix tests that consistently fail for specific random seeds.

--
You received this message because you are subscribed to the Google Groups 
"sage-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/sage-devel/9f9089bd-5e20-4b1c-ac9d-7d0a0f760d64%40ucalgary.ca.

Reply via email to