Hello,

when trying to run a guix build agent in a docker container on openshift
with a colleague and assigning 8 of the 128 cores of the physical machine,
the agent would be completely choked since it would start all builds with
commands such as "make -j 128". The 128 are determined by a call to the
guile function current-processor-count, which calls nproc from coreutils
(see "man nproc"). This works on bare metal and virtual machines, but not
in containers or more generally when cgroups are used to limit the number
of cores. Additionally, but less crucially, this probably leads to the
max-1min-load-average parameter of guix-build-coordinator-agent-configuration
to be completely useless: In the example, the machine could have a load of
120 on the other cores, but the part attached to the build agent would
be idle.

This can be worked around by passing by hand extra arguments, such as
"--cores=8" to the guix daemon service, and adapting max-parallel-builds
of the build agent service. Still, it would be nice to have a more
automated approach (for instance, when changing the number of assigned
cores in openshift, one does not want to recreate a docker container with
new manual parameters).

Here is how far we got concerning a potential solution.

When cgroups are available, the file
   /sys/fs/cgroup/cpu.pressure
contains some measure of load congestion:
   some avg10=8.28 avg60=5.50 avg300=2.11 total=365519361
   full avg10=0.00 avg60=0.00 avg300=0.00 total=0
Its contents are described here:
   https://www.kernel.org/doc/html/latest/accounting/psi.html#psi
The "full" line is meaningless. I am not exactly sure what is measured
by the "some" line - it is not the load, but a percentage of time during
which "some tasks are stalled on a given resource". It looks like the
max-1min-load-average of the build agent service could be replaced by
a threshold for the avg60 value of this file.

To obtain the current value, the libcgroup library, which is already
available in guix, can be used; we may need to write guile bindings.

I suppose that the number of available cores can be determined in a
similar manner.

What do you think?

Andreas


Reply via email to