To slightly expand on the above: We do have tests with fixed
(build.yml) and with variable random seed (ci-meson.yml).
I was not aware that only meson used a random seed. That's good to know
and explains why meson seems to fail more often in CI. As a
reviewer/developer it's helpful to know which workflows are the most
stable, and which could fail for unrelated reasons. This fact should be
written down somewhere, probably in the CI documentation if you end up
having time to write it.
A related question would be: shall we temporarily disable tests that
are known to randomly fail?
Advantage: less noise due to random failures
Disadvantage: less coverage
I don't think tests need to be disabled, but rather the CI should not
report a PR as failing if the same failure occurs on develop. So still
run the failing tests, but don't report the workflow as failing if the
same test fails on develop. I think we already have something like this
for the fixed seed tests, but not for the random seeds.
On the other hand, it would be nice if the CI highlighted a test which
passes in a PR but fails on develop.
> I wonder if the stranger/unreproducible failures might be caused by
some faulty caching on the CI server, but I don't know enough about
how the CI server is configured and what is cached between builds to
say if that might be the case.
From my experience, these issues are almost never specific to CI (i.e.
the same error could be reproduced in principle by running the same
commands locally on a developer's machine). The only exceptions are
issues related to "docker pull/push" that you sometimes see. Those
come from the design decision to run the CI in a new docker container.
Fixing those issues by redesiging the corresponding workflows would be
desirable (see below).
This failure in code unrelated to the PR is neither due to a random test
or a docker issue, and I cannot reproduce it:
https://github.com/sagemath/sage/actions/runs/17134650987/job/48607483558#step:15:8858
I agree that failures like this are very rare though. We do use a lot of
CI every month, so I would not be surprised to learn that the expected
number of monthly random hardware glitches (or solar flares, or whatever
your favourite explanation is for strange computer phenomena) for our CI
setup is non-negligible.
> It would be nice to have a GitHub label for these kinds of issues
so they can be found more easily. I'm not sure who has permissions to
add new labels.
Good idea! Needs to be done by one of the github org admins.
Would whoever has the permissions for this consider adding two new
labels to GitHub? One called "CI" (or something similar) to be used for
issues/PRs relating to the CI (we have "CI fix" but that's for CI fixes
that should be merged before other PRs). And one called "random seed
failure" (or something similar) for issues that report or PRs that fix
tests that consistently fail for specific random seeds.
--
You received this message because you are subscribed to the Google Groups
"sage-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion visit
https://groups.google.com/d/msgid/sage-devel/9f9089bd-5e20-4b1c-ac9d-7d0a0f760d64%40ucalgary.ca.