Hey,
> Yeah, this is a ridiculous situation. We should do a hackathon to get > better monitoring of useful metrics (machine load, > time-of-push-to-time-to-build-completion, etc.), to clearly identify the > bottlenecks (crashes? inefficient protocol? scheduling issues? Cuirass > or offload or guix-daemon issue?), and to address as many of them as we > can. > > Any volunteers? :-) I'd really like to improve the situation! A hackathon seems like a nice idea. As a matter of fact, I already spent some times improving the stability of Cuirass web interface[1]. Now I can see multiple topics that could be approached in parallel: * Add metrics to Cuirass as you suggested. There's an open ticket about that here[2]. * Investigate offloading issues[3]. * Fix database contention[4]. * Fix guix-daemon deadlocking[5]. * Monitor closely what's happening on Berlin and decide if it is opportune to add a build scheduler mechanism somewhere. See what Hydra is doing[6] and what Chris is proposing[7]. As most of the issues are only observed on Berlin machines, which access is restricted, we will also have to find a way to reproduce them locally. Anyway, if some people are motivated, we could try to plan a day or week-end to work on those topics :). Thanks, Mathieu [1]: https://issues.guix.gnu.org/42548. [2]: https://issues.guix.gnu.org/32548. [3]: https://issues.guix.gnu.org/34033. [4]: https://issues.guix.gnu.org/42001. [5]: https://issues.guix.gnu.org/31785. [6]: https://github.com/NixOS/hydra/blob/master/src/hydra-queue-runner/dispatcher.cc [7]: https://lists.gnu.org/archive/html/guix-devel/2020-04/msg00323.html