Hello!

Andreas Enge <andr...@enge.fr> skribis:

> Am Thu, Jun 06, 2024 at 07:48:27PM +0200 schrieb Andreas Enge:
>> Could the graph on
>>    https://ci.guix.gnu.org/metrics
>> be augmented by the number of packages to be built for the different
>> architectures?

That would be nice, I agree (I haven’t looked much at that part of the
code).

> In that direction, the metrics now show that very few packages were built
> in the last 24 hours, except maybe for ARM (where we anyway build few
> packages). But the number of waiting builds stalls at around 280000.
>
> Are these all for ARM now? Should we cancel builds a bit more aggressively
> to make sure that recent packages are favoured?

In the meantime, here’s me doing stats-as-a-service:

--8<---------------cut here---------------start------------->8---
ludo@berlin ~$ sudo -u cuirass psql cuirass
cuirass=> select count(*) from builds where status = -2 ;
 count  
--------
 284314
(1 row)

Time: 635.478 ms
cuirass=> select count(*) from builds where status = -2 and system = 
'x86_64-linux';
 count 
-------
     0
(1 row)

Time: 761.333 ms
cuirass=> select count(*) from builds where status = -2 and system = 
'aarch64-linux';
 count  
--------
 160847
(1 row)

Time: 661.968 ms
cuirass=> select count(*) from builds where status = -2 and system = 
'powerpc64le-linux';
 count  
--------
 119124
(1 row)

Time: 589.800 ms
cuirass=> select count(*) from builds where status = -2 and system = 
'armhf-linux';
 count 
-------
  4343
(1 row)

Time: 549.242 ms
cuirass=> select count(*) from builds where status = -2 and system = 
'i686-linux';
 count 
-------
     0
(1 row)

Time: 1088.130 ms (00:01.088)
--8<---------------cut here---------------end--------------->8---

So lots of AArch64 and POWER9 builds.

Executive summary:

  1. Of all the AArch64 build machines we have, only ‘overdrive1’ is
     currently actually contributing build power;

  2. AArch64 build machines ‘pankow’, ‘grunewald’, and ‘kreuzberg’
     (HoneyCombs) need on-site intervention so we can reconfigure them
     and reboot them.

  3. Some other AArch64 build machines (‘lieserl’ and ‘monokuma’) have
     been off for months and we’re discussing on guix-sysadmin ways to
     turn them back on;

  4. POWER9, I’m not sure.

  5. ‘cuirass remote-server’ may be too slow at handling incoming
     messages from workers, leading to redundant builds and the
     impression on https://ci.guix.gnu.org/workers that workers are
     idle, even when they’re in fact busy building stuff.


Investigation details:

I noticed that ‘cuirass remote-server’ on berlin would all too often
consider workers as “unresponsive” (meaning that it hasn’t received a
‘ping’ message from them in the past 2 minutes):

--8<---------------cut here---------------start------------->8---
ludo@berlin ~$ sudo grep unresponsive /var/log/cuirass-remote-server.log |tail 
-10
2024-06-17 12:44:02 restarted 1 builds that were on unresponsive workers
2024-06-17 12:50:03 restarted 1 builds that were on unresponsive workers
2024-06-17 12:55:03 restarted 1 builds that were on unresponsive workers
2024-06-17 13:01:03 restarted 3 builds that were on unresponsive workers
2024-06-17 13:08:03 restarted 1 builds that were on unresponsive workers
2024-06-17 13:20:03 restarted 1 builds that were on unresponsive workers
2024-06-17 13:22:03 restarted 4 builds that were on unresponsive workers
2024-06-17 13:24:03 restarted 2 builds that were on unresponsive workers
2024-06-17 13:29:03 restarted 1 builds that were on unresponsive workers
2024-06-17 13:33:03 restarted 3 builds that were on unresponsive workers
--8<---------------cut here---------------end--------------->8---

As shown in this log, the effect is that some builds get restarted, even
though they are still being built by a worker that was wrongfully
considered unresponsive.

This needs further investigation.  The SQL query for
‘db-get-pending-build’ fixed by Cuirass commit
17338588d4862b04e9e405c1244a2ea703b50d98 is no longer at fault: it’s now
reasonably fast (there’s a warning in ‘cuirass-remote-server.log’ if it
ever takes more than 10s).  It could be that the backlog of incoming
messages in ‘remote-server’ still keeps increasing though, since workers
send pings every minute no matter what.

A further problem is that we’re unable to retrieve binaries from a
couple of build machines:

--8<---------------cut here---------------start------------->8---
ludo@berlin ~$ sudo grep error: /var/log/cuirass-remote-server.log |tail -10
2024-06-17 13:05:21 error: failed to add 
/gnu/store/f96ya7x7yjns39n8np16rmnhzarqcchd-guix-78d385a6b to store: path 
`/gnu/store/f96ya7x7yjns39n8np16rmnhzarqcchd-guix-78d385a6b' does not exist and 
cannot be created
2024-06-17 13:05:21 error: The remote-worker signing key might be unauthorized.
2024-06-17 13:05:21 error: failed to add 
/gnu/store/f96ya7x7yjns39n8np16rmnhzarqcchd-guix-78d385a6b to store: path 
`/gnu/store/f96ya7x7yjns39n8np16rmnhzarqcchd-guix-78d385a6b' does not exist and 
cannot be created
2024-06-17 13:05:21 error: The remote-worker signing key might be unauthorized.
2024-06-17 13:05:21 error: failed to add 
/gnu/store/f96ya7x7yjns39n8np16rmnhzarqcchd-guix-78d385a6b to store: path 
`/gnu/store/f96ya7x7yjns39n8np16rmnhzarqcchd-guix-78d385a6b' does not exist and 
cannot be created
2024-06-17 13:05:21 error: The remote-worker signing key might be unauthorized.
2024-06-17 13:17:29 error: failed to add 
/gnu/store/ljhvgbblb4y7554rg542vam5hp8rg9mg-ocaml-bos-0.2.1 to store: path 
`/gnu/store/ljhvgbblb4y7554rg542vam5hp8rg9mg-ocaml-bos-0.2.1' does not exist 
and cannot be created
2024-06-17 13:17:29 error: The remote-worker signing key might be unauthorized.
2024-06-17 13:24:03 error: failed to add 
/gnu/store/vb57h47b5xpin1h0rrvh9qd2bxapy8f7-ocaml-uucp-15.0.0 to store: path 
`/gnu/store/vb57h47b5xpin1h0rrvh9qd2bxapy8f7-ocaml-uucp-15.0.0' does not exist 
and cannot be created
2024-06-17 13:24:03 error: The remote-worker signing key might be unauthorized.
--8<---------------cut here---------------end--------------->8---

By picking store items from these error messages, we can determine that
at least ‘pankow’ (10.0.0.8, AArch64) and ‘grunewald’ (10.0.0.10,
AArch64) are at fault:

--8<---------------cut here---------------start------------->8---
ludo@berlin ~$ guix gc --derivers 
/gnu/store/vb57h47b5xpin1h0rrvh9qd2bxapy8f7-ocaml-uucp-15.0.0
/gnu/store/8yc7j6q169f8312wx6jxs7g0z4xy5l5l-ocaml-uucp-15.0.0.drv
ludo@berlin ~$ sudo grep 8yc7j6q169f8312wx6jxs7g0z4xy5l5l 
/var/log/cuirass-remote-server.log |tail -10
2024-06-17 13:21:50 10.0.0.8 (uUTl7MVR): build started: 
'/gnu/store/8yc7j6q169f8312wx6jxs7g0z4xy5l5l-ocaml-uucp-15.0.0.drv'.
2024-06-17 13:24:03 fetching 1 outputs of 
'/gnu/store/8yc7j6q169f8312wx6jxs7g0z4xy5l5l-ocaml-uucp-15.0.0.drv' from 
http://10.0.0.8:5558
2024-06-17 13:24:03 build succeeded: 
'/gnu/store/8yc7j6q169f8312wx6jxs7g0z4xy5l5l-ocaml-uucp-15.0.0.drv'
ludo@berlin ~$ guix gc --derivers 
/gnu/store/f96ya7x7yjns39n8np16rmnhzarqcchd-guix-78d385a6b
/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv
ludo@berlin ~$ sudo grep ygrgwp9jyksjpnd76b83ifdskbcdjbhh 
/var/log/cuirass-remote-server.log  |tail -10
2024-06-17 13:05:21 fetching 1 outputs of 
'/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv' from 
http://10.0.0.8:5558
2024-06-17 13:05:21 build succeeded: 
'/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv'
2024-06-17 13:05:21 build succeeded: 
'/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv'
2024-06-17 13:05:21 build succeeded: 
'/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv'
2024-06-17 13:05:21 build succeeded: 
'/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv'
2024-06-17 13:34:39 build failed: 
'/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv'
2024-06-17 13:41:08 fetching 1 outputs of 
'/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv' from 
http://10.0.0.10:5558
2024-06-17 13:41:08 fetching 1 outputs of 
'/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv' from 
http://10.0.0.10:5558
2024-06-17 13:41:09 build succeeded: 
'/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv'
2024-06-17 13:41:09 build succeeded: 
'/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv'
--8<---------------cut here---------------end--------------->8---

The signing key of ‘grunewald’ is definitely registered:

--8<---------------cut here---------------start------------->8---
$ ssh grunewald cat /etc/guix/signing-key.pub
(public-key 
 (ecc 
  (curve Ed25519)
  (q #370A0165E60213CA122E026402EE3DEA61FE4E4EE27D16DA44044AA49714D481#)
  )
 )
$ grep -rl 370A0165E60213CA122E026402EE3DEA61FE4E4EE27D16DA44044AA49714D481 
~/src/guix-maintenance/hydra/
$ ssh berlin grep 
370A0165E60213CA122E026402EE3DEA61FE4E4EE27D16DA44044AA49714D481 /etc/guix/acl
    (q #370A0165E60213CA122E026402EE3DEA61FE4E4EE27D16DA44044AA49714D481#)
--8<---------------cut here---------------end--------------->8---

That of ‘pankow’ I can’t say because I cannot log in.  Most likely, it
rebooted and might have regenerated a new signing key different from the
one that’s registered.  So in effect, ‘pankow’ is effectively not
contributing any build.

The third machine of the HoneyComb family is ‘kreuzberg’: it’s been off
for a few days, after I rebooted it and it didn’t come back.

Thanks,
Ludo’.

PS: I’m traveling this week so I won’t be very responsive.

Reply via email to