Re: [sage-devel] Re: Documentation and state of Sage CI

Vincent Macri Tue, 26 Aug 2025 15:58:13 -0700

On 2025-08-25 6:56 p.m., Kwankyu Lee wrote:

Perhaps this helps you understand the current pitiful state of the CIinfrastructure. Since the CI infrastructure itself is broken andunstable, I think that documenting it now is premature.

Even if formal documentation isn't possible, I still think it would begood to have something informal explaining what people are working onfor the CI, how it works, what needs to be improved, etc. Maybe thiscould be a use case for https://github.com/sagemath/sage/discussions ?It would be nice to have something a bit more centralized than variousGitHub issues and PRs, even if it's just a list of the relevant issuesand PRs (GitHub's projects feature might also be useful for this:https://docs.github.com/en/issues/planning-and-tracking-with-projects).My worry is that it seems there are very few people who areknowledgeable about our CI. To put it crudely, the bus factor<https://en.wikipedia.org/wiki/Bus_factor> for our CI infrastructureseems to be low.


On 2025-08-26 7:55 a.m., '[email protected]' via sage-devel wrote:

The biggest issue with the reliability of the CI is a deep designdecision in the way the tests are setup. Many doctests have aninherent random element, and this is mostly on purpose to increase thesurface of code paths that are tested and thereby discover new bugs.The disadvantage is that unfortunately some test runs will producefailures that are not connected to the changes of the PR. I don't seereally anything that can be done on the level of the CI infrastructureto improve the situation, but would be happy to get new ideas.

We also occasionally have PRs fail for issues other than random inputs.Sometimes there is a networking issue (not sure if there's anything tobe done about that). I've sometimes seen tests fail that did not involveany randomness and the failure was in code unrelated to the PR and couldnot be reproduced (for example,https://github.com/sagemath/sage/actions/runs/17134650987/job/48607483558#step:15:8858).I wonder if the stranger/unreproducible failures might be caused by somefaulty caching on the CI server, but I don't know enough about how theCI server is configured and what is cached between builds to say if thatmight be the case.


On 2025-08-26 7:55 a.m., '[email protected]' via sage-devel wrote:

work on such issues (searching for 'random' or 'flaky' or 'CI' in thegithub issues should bring up most of them, eghttps://github.com/sagemath/sage/issues?q=is%3Aissue%20state%3Aopen%20%22random%22).

It would be nice to have a GitHub label for these kinds of issues sothey can be found more easily. I'm not sure who has permissions to addnew labels.


On 2025-08-26 7:55 a.m., '[email protected]' via sage-devel wrote:

I also have a half-working notebook that extracts the failing testsfrom the CI runs at https://github.com/sagemath/sage/pull/39100

Getting this from half-working to working and reporting the resultssomewhere easy to find sounds like it would be a worthwhile endeavour,and would be a good start to collecting information about the CIsomewhere where it can be found easily. I'll open a GitHub issue forthis (I'm not saying I am going to try to work on it myself anytimesoon, so if anyone else reading this wants to work on it feel free).


On 2025-08-26 8:54 a.m., Dylan Thurston wrote:

I know nothing about the internal implementation here, but just this
description suggests a change in practice: when a test on a PR fails,
rerun that test on the existing branch (without the pull request) with
the same random seed to see if it also fails there. If it does, then
automatically file a separate issue.

That sounds like a good idea that is probably technically possible but Ihave no idea how it would be implemented. I do think we have somethingto ignore failures for doctests that failed on the last commit to thedevelop branch, which is similar enough that I think this should bepossible.

One issue with this suggestions is it will mean CI takes longer whichwould aggravate the problem of potentially waiting a long time for a CIserver to be available for other tests. For setups with limited CIminutes I think it's common to run a limited test suite on PRs and thefull test suite on develop. We sort of do this, more distros are testedon develop than on PRs for instance. Another possibility is to run themost important/stable/relevant tests first and only run the full testsuite after those pass. Obviously this has the drawback of waitinglonger for things to finish if we don't start jobs until the first jobssucceed, but it's something to consider if we reach a point where we arerunning more CI tests than our CI servers can keep up with. One way todo this would be to run tests on our "most stable" systems first (Linux)and only test the less stable systems (Mac, Windows) if those pass. Oronly test the PDF docbuild after the HTML docbuild passes. We'd probablywant to have some label to override this for PRs that are trying to fixsomething like a Windows-specific issue. If our CI usage right now isn'ta big problem then I don't think we need to do this, just pointing outthat there are options if it becomes an issue in the future.


--
You received this message because you are subscribed to the Google Groups 
"sage-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/sage-devel/28a42df7-a1b3-4c68-9148-cc0f165497f9%40ucalgary.ca.

Re: [sage-devel] Re: Documentation and state of Sage CI

Reply via email to