Re: advice debugging lockups with swap-thrashing symptoms?
You are probably haunted by a bad issue with DMA memory and running out of it. Your top is missing -SH since then you would probably see the pagedameon go bananas. The problem is you have not enough memory below 4G but the pagedaemon is not able to properly free memory there since it has no proper tracking for that condition. It only knows memory is short and tries to drop as much as possible over and over again. As a result your system becomes unresponsive. The boot loader can print the memory map. Which should show you how much memory is below 4G (I think the command is machine mem). This is a known issue and there is some work going on to fix the problem. -- :wq Claudio Thanks, Claudio. Let me know if I can help by testing anything. In case it's useful, here is the output of "machine memory" at the boot prompt. Transcribed by hand so there are probably errors. boot> machine memory Region 0: type 1 at 0x0 for 609KB Region 1: type 2 at 0xf for 64KB Region 2: type 2 at 0xfec0 for 20480KB Region 3: type 2 at 0xe000 for 262144KB Region 4: type 2 at 0x98400 for 31KB Region 5: type 2 at 0xcfdf for 64KB Region 6: type 1 at 0x10 for 3404292KB Region 7: type 3 at 0xcfde3000 for 52KB Region 8: type 4 at 0xcfde for 12KB REgion 9: type 1 at 0x1 for 13369344KB Low ram: 609KB High ram: 3404292KB Total free memory: 16774245KB -- James
Re: advice debugging lockups with swap-thrashing symptoms?
On Thu, May 23, 2024 at 03:37:24PM +, James Cook wrote: > On Thu, May 23, 2024 at 08:00:37AM GMT, Nick Holland wrote: > > On 5/23/24 03:18, Stuart Henderson wrote: > > > On 2024-05-22, James Cook wrote: > > > > One of my OpenBSD boxes sometimes gets in a weird locked-up or > > > > almost-locked-up state. I'm wondering what I can do to debug it > > > > further next time it happens. > > > ... > > > > I would also expect the cache number to be much higher. E.g. on > > > > this occasion, I was running "git annex fsck", which reads plenty > > > > of data from disk. > > > > > > Heavy filesystem access can result in this sort of thing, I used to > > > have unpacked ports source on one of my machines for grepping over, > > > the machine was pretty much unusable for anything else while that was > > > running. > > > > > > Might be worth trying some noatime mount flags if you don't already have > > > them, at least then you can avoid turning some reads into writes. > > > > > > > Definitely a possibility. Long time ago, I think I asked about the > > possibility of a "disknice" to throttle disk access on individual > > tasks. TedU@ came through for me with something that definitely solved > > my problem, and I use it from time to time since -- basically, it just > > suspends a particular program occasionally, which lets other programs > > have a chance to get disk access. I saved it (and made a tiny update > > that is needed now) and put it here: > > > > https://holland-consulting.net/scripts/disknice.html > > > > > > Also... > > I've seen disks "fail" where they get super-slow. The failure modes > > seems to be difficulty reading data...but after enough retries, it > > succeeds, resetting the retry counter back to zero, and then the next > > read encounters the same problem. You may be able to hear lots of > > activity on the drive with little obvious progress. I'm not convinced > > this is your problem, but ... something to consider. > > > > Nick. > > Thanks for the pointers. disknice sounds useful. However I am skeptical that > this can be explained away as a normal consequence of intense filesystem > access, for a few reasons. > > 1. In the past, even the mouse pointer has frozen. (I'm 95% sure of this > from memory. Will note it more carefully next time this happens.) Surely > that shouldn't depend on disk access? See also tmux/xterm updating very > slowly; does that depend on the filesystem? > > 2. The low 165M cache number makes me suspicious. With 14G free and plenty > of data being read, shouldn't that grow? E.g. right now it's at 11G (and I'm > running git annex fsck like I was before; I have a lot of data to fsck). I > believe I've seen similar small cache numbers in the past. > > 3. The git annex fsck was running on a different hard disk. (Normally it > sits in a cubpoard; I've hooked it up temporarily.) Swap, /, /home etc are > all on a different SSD. I am running the same thing now (different disk) and > perceive no impact on performance. That's not to say there wasn't intense > access to the SSD, though; Firefox is a suspect here. > > Nonetheless, if I can't make any other progress, I'll look into noatime > and/or disknice. (I really wish I could reliably reproduce this, but > unfortunately it just happens every few days or weeks with no apparent > pattern other than the system being under some load when it happens.) > > (I'll note one other thing, just in case: I also experience random crashes > and restarts with this machine that seem to be hardware-related. Very > different from what I'm describing here; has even happened during BIOS POST, > and with no disks inside the machine. I just mention it because it opens the > possibility of unreliable hardware involved, in case that changes things.) > You are probably haunted by a bad issue with DMA memory and running out of it. Your top is missing -SH since then you would probably see the pagedameon go bananas. The problem is you have not enough memory below 4G but the pagedaemon is not able to properly free memory there since it has no proper tracking for that condition. It only knows memory is short and tries to drop as much as possible over and over again. As a result your system becomes unresponsive. The boot loader can print the memory map. Which should show you how much memory is below 4G (I think the command is machine mem). This is a known issue and there is some work going on to fix the problem. -- :wq Claudio
Re: advice debugging lockups with swap-thrashing symptoms?
On Thu, May 23, 2024 at 08:00:37AM GMT, Nick Holland wrote: On 5/23/24 03:18, Stuart Henderson wrote: On 2024-05-22, James Cook wrote: One of my OpenBSD boxes sometimes gets in a weird locked-up or almost-locked-up state. I'm wondering what I can do to debug it further next time it happens. ... I would also expect the cache number to be much higher. E.g. on this occasion, I was running "git annex fsck", which reads plenty of data from disk. Heavy filesystem access can result in this sort of thing, I used to have unpacked ports source on one of my machines for grepping over, the machine was pretty much unusable for anything else while that was running. Might be worth trying some noatime mount flags if you don't already have them, at least then you can avoid turning some reads into writes. Definitely a possibility. Long time ago, I think I asked about the possibility of a "disknice" to throttle disk access on individual tasks. TedU@ came through for me with something that definitely solved my problem, and I use it from time to time since -- basically, it just suspends a particular program occasionally, which lets other programs have a chance to get disk access. I saved it (and made a tiny update that is needed now) and put it here: https://holland-consulting.net/scripts/disknice.html Also... I've seen disks "fail" where they get super-slow. The failure modes seems to be difficulty reading data...but after enough retries, it succeeds, resetting the retry counter back to zero, and then the next read encounters the same problem. You may be able to hear lots of activity on the drive with little obvious progress. I'm not convinced this is your problem, but ... something to consider. Nick. Thanks for the pointers. disknice sounds useful. However I am skeptical that this can be explained away as a normal consequence of intense filesystem access, for a few reasons. 1. In the past, even the mouse pointer has frozen. (I'm 95% sure of this from memory. Will note it more carefully next time this happens.) Surely that shouldn't depend on disk access? See also tmux/xterm updating very slowly; does that depend on the filesystem? 2. The low 165M cache number makes me suspicious. With 14G free and plenty of data being read, shouldn't that grow? E.g. right now it's at 11G (and I'm running git annex fsck like I was before; I have a lot of data to fsck). I believe I've seen similar small cache numbers in the past. 3. The git annex fsck was running on a different hard disk. (Normally it sits in a cubpoard; I've hooked it up temporarily.) Swap, /, /home etc are all on a different SSD. I am running the same thing now (different disk) and perceive no impact on performance. That's not to say there wasn't intense access to the SSD, though; Firefox is a suspect here. Nonetheless, if I can't make any other progress, I'll look into noatime and/or disknice. (I really wish I could reliably reproduce this, but unfortunately it just happens every few days or weeks with no apparent pattern other than the system being under some load when it happens.) (I'll note one other thing, just in case: I also experience random crashes and restarts with this machine that seem to be hardware-related. Very different from what I'm describing here; has even happened during BIOS POST, and with no disks inside the machine. I just mention it because it opens the possibility of unreliable hardware involved, in case that changes things.) -- James
Re: advice debugging lockups with swap-thrashing symptoms?
On 5/23/24 03:18, Stuart Henderson wrote: On 2024-05-22, James Cook wrote: One of my OpenBSD boxes sometimes gets in a weird locked-up or almost-locked-up state. I'm wondering what I can do to debug it further next time it happens. ... I would also expect the cache number to be much higher. E.g. on this occasion, I was running "git annex fsck", which reads plenty of data from disk. Heavy filesystem access can result in this sort of thing, I used to have unpacked ports source on one of my machines for grepping over, the machine was pretty much unusable for anything else while that was running. Might be worth trying some noatime mount flags if you don't already have them, at least then you can avoid turning some reads into writes. Definitely a possibility. Long time ago, I think I asked about the possibility of a "disknice" to throttle disk access on individual tasks. TedU@ came through for me with something that definitely solved my problem, and I use it from time to time since -- basically, it just suspends a particular program occasionally, which lets other programs have a chance to get disk access. I saved it (and made a tiny update that is needed now) and put it here: https://holland-consulting.net/scripts/disknice.html Also... I've seen disks "fail" where they get super-slow. The failure modes seems to be difficulty reading data...but after enough retries, it succeeds, resetting the retry counter back to zero, and then the next read encounters the same problem. You may be able to hear lots of activity on the drive with little obvious progress. I'm not convinced this is your problem, but ... something to consider. Nick.
Re: advice debugging lockups with swap-thrashing symptoms?
On 2024-05-22, James Cook wrote: > One of my OpenBSD boxes sometimes gets in a weird locked-up or > almost-locked-up state. I'm wondering what I can do to debug it > further next time it happens. ... > I would also expect the cache number to be much higher. E.g. on > this occasion, I was running "git annex fsck", which reads plenty > of data from disk. Heavy filesystem access can result in this sort of thing, I used to have unpacked ports source on one of my machines for grepping over, the machine was pretty much unusable for anything else while that was running. Might be worth trying some noatime mount flags if you don't already have them, at least then you can avoid turning some reads into writes.