On Wednesday, February 23rd, 2022 at 9:49 AM, Christopher Baines <m...@cbaines.net> wrote:
> Ricardo Wurmus rek...@elephly.net writes: > > > Ricardo Wurmus rek...@elephly.net writes: > > > > > Hi Guix, > > > > > > I had to manually run the build of llvm 11 on aarch64, because it would > > > > > > keep timing out: > > > > > > time guix build > > > /gnu/store/0hc7inxqcczb8mq2wcwrcw0vd3i2agkv-llvm-11.0.0.drv > > > --timeout=999999 --max-silent-time=999999 > > > > > > After more than two days it finally built. This seems a little > > > > > > excessive. Towards the end of the build I saw a 1% point progress > > > > > > increase for every hour that passed. > > > > > > Is there something wrong with the build nodes, are we building llvm 11 > > > > > > wrong, or is this just the way it is on aarch64 systems? > > > > I now see that gfortran 10 also takes a very long time to build. It’s > > > > on kreuzberg (10.0.0.9) and I see that out of the 16 cores only one is > > > > really busy. Other cores sometimes come in with a tiny bit of work, but > > > > you might miss it if you blink. > > > > Guix ran “make -j 16” at the top level, but the other make processes > > > > that have been spawned as children do not have “-j 16”. There are > > > > probably 16 or so invocations of cc1plus, but only CPU0 seems to be busy > > > > at 100% while the others are at 0. > > > > What’s up with that? > > Regarding the llvm derivation you mentioned [1], it looks like for > > bordeaux.guix.gnu.org, the build completed in around a couple of hours, > > this was on the 4 core Overdrive machine though. > > 1: > https://data.guix.gnu.org/gnu/store/0hc7inxqcczb8mq2wcwrcw0vd3i2agkv-llvm-11.0.0.drv > > On the subject of the HoneyComb machines, I haven't noticed anything > > like you describe with the one (hatysa) running behind > > bordeaux.guix.gnu.org. Most cores are fully occupied most of the time, > > which the 15m load average sitting around 16. > > Some things to check though, what does the load average look like when > > you think the system should be using all it's cores? If it's high but > > there's not much CPU utilisation, that suggests there's a bottleneck > > somewhere else. > > Also, what does the memory and swap usage look like? Hatysa has 32GB of > > memory and swap, and ideally it would actually have 64GB, since that > > would avoid swapping more often. One thing I remember about building LLVM a number of years ago when I was working on it through my job (though only for x86-64, not aarch64) is that the build is very memory intensive. In particular, linking the various binaries would each be quite slow and consume a lot of memory, causing significant, intense swapping with less than 64GB of memory in a parallel build (and sometimes eventually trigger the OOM killer). As I recall, using ld.bfd for the build was by far the slowest, ld.gold was noticeably better, and ld.lld was showing promise for doing better than ld.gold. Just my $0.02 of past experiences, in case they help to understand the slow aarch64 build with LLVM 11. Cheers, Kaelyn > > One problem I have observed with hatysa is storage > > instability/performance issues. Looking in /var/log/messages, I see > > things like the following. Maybe check /var/log/messages for anything > > similar? > > nvme nvme0: I/O 0 QID 6 timeout, aborting > > nvme nvme0: I/O 1 QID 6 timeout, aborting > > nvme nvme0: I/O 2 QID 6 timeout, aborting > > nvme nvme0: I/O 3 QID 6 timeout, aborting > > nvme nvme0: Abort status: 0x0 > > nvme nvme0: Abort status: 0x0 > > nvme nvme0: Abort status: 0x0 > > nvme nvme0: Abort status: 0x0 > > Lastly, I'm not quite sure what thermal problems look like on ARM, but > > maybe check the CPU temps. I see between 60 and 70 degrees as reported > > by the sensors command, this is with a different CPU cooler though. > > Chris