On 2025-08-25 6:56 p.m., Kwankyu Lee wrote:
Perhaps this helps you understand the current pitiful state of the CI
infrastructure. Since the CI infrastructure itself is broken and
unstable, I think that documenting it now is premature.
Even if formal documentation isn't possible, I still think it would be
good to have something informal explaining what people are working on
for the CI, how it works, what needs to be improved, etc. Maybe this
could be a use case for https://github.com/sagemath/sage/discussions ?
It would be nice to have something a bit more centralized than various
GitHub issues and PRs, even if it's just a list of the relevant issues
and PRs (GitHub's projects feature might also be useful for this:
https://docs.github.com/en/issues/planning-and-tracking-with-projects).
My worry is that it seems there are very few people who are
knowledgeable about our CI. To put it crudely, the bus factor
<https://en.wikipedia.org/wiki/Bus_factor> for our CI infrastructure
seems to be low.
On 2025-08-26 7:55 a.m., '[email protected]' via sage-devel wrote:
The biggest issue with the reliability of the CI is a deep design
decision in the way the tests are setup. Many doctests have an
inherent random element, and this is mostly on purpose to increase the
surface of code paths that are tested and thereby discover new bugs.
The disadvantage is that unfortunately some test runs will produce
failures that are not connected to the changes of the PR. I don't see
really anything that can be done on the level of the CI infrastructure
to improve the situation, but would be happy to get new ideas.
We also occasionally have PRs fail for issues other than random inputs.
Sometimes there is a networking issue (not sure if there's anything to
be done about that). I've sometimes seen tests fail that did not involve
any randomness and the failure was in code unrelated to the PR and could
not be reproduced (for example,
https://github.com/sagemath/sage/actions/runs/17134650987/job/48607483558#step:15:8858).
I wonder if the stranger/unreproducible failures might be caused by some
faulty caching on the CI server, but I don't know enough about how the
CI server is configured and what is cached between builds to say if that
might be the case.
On 2025-08-26 7:55 a.m., '[email protected]' via sage-devel wrote:
work on such issues (searching for 'random' or 'flaky' or 'CI' in the
github issues should bring up most of them, eg
https://github.com/sagemath/sage/issues?q=is%3Aissue%20state%3Aopen%20%22random%22).
It would be nice to have a GitHub label for these kinds of issues so
they can be found more easily. I'm not sure who has permissions to add
new labels.
On 2025-08-26 7:55 a.m., '[email protected]' via sage-devel wrote:
I also have a half-working notebook that extracts the failing tests
from the CI runs at https://github.com/sagemath/sage/pull/39100
Getting this from half-working to working and reporting the results
somewhere easy to find sounds like it would be a worthwhile endeavour,
and would be a good start to collecting information about the CI
somewhere where it can be found easily. I'll open a GitHub issue for
this (I'm not saying I am going to try to work on it myself anytime
soon, so if anyone else reading this wants to work on it feel free).
On 2025-08-26 8:54 a.m., Dylan Thurston wrote:
I know nothing about the internal implementation here, but just this
description suggests a change in practice: when a test on a PR fails,
rerun that test on the existing branch (without the pull request) with
the same random seed to see if it also fails there. If it does, then
automatically file a separate issue.
That sounds like a good idea that is probably technically possible but I
have no idea how it would be implemented. I do think we have something
to ignore failures for doctests that failed on the last commit to the
develop branch, which is similar enough that I think this should be
possible.
One issue with this suggestions is it will mean CI takes longer which
would aggravate the problem of potentially waiting a long time for a CI
server to be available for other tests. For setups with limited CI
minutes I think it's common to run a limited test suite on PRs and the
full test suite on develop. We sort of do this, more distros are tested
on develop than on PRs for instance. Another possibility is to run the
most important/stable/relevant tests first and only run the full test
suite after those pass. Obviously this has the drawback of waiting
longer for things to finish if we don't start jobs until the first jobs
succeed, but it's something to consider if we reach a point where we are
running more CI tests than our CI servers can keep up with. One way to
do this would be to run tests on our "most stable" systems first (Linux)
and only test the less stable systems (Mac, Windows) if those pass. Or
only test the PDF docbuild after the HTML docbuild passes. We'd probably
want to have some label to override this for PRs that are trying to fix
something like a Windows-specific issue. If our CI usage right now isn't
a big problem then I don't think we need to do this, just pointing out
that there are options if it becomes an issue in the future.
--
You received this message because you are subscribed to the Google Groups
"sage-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion visit
https://groups.google.com/d/msgid/sage-devel/28a42df7-a1b3-4c68-9148-cc0f165497f9%40ucalgary.ca.