Re: advice debugging lockups with swap-thrashing symptoms?

2024-05-23 Thread James Cook

You are probably haunted by a bad issue with DMA memory and running out of
it. Your top is missing -SH since then you would probably see the
pagedameon go bananas. The problem is you have not enough memory below 4G
but the pagedaemon is not able to properly free memory there since it has
no proper tracking for that condition. It only knows memory is short and
tries to drop as much as possible over and over again. As a result your
system becomes unresponsive.

The boot loader can print the memory map. Which should show you
how much memory is below 4G (I think the command is machine mem).

This is a known issue and there is some work going on to fix the problem.
--
:wq Claudio


Thanks, Claudio. Let me know if I can help by testing anything.

In case it's useful, here is the output of "machine memory" at the 
boot prompt. Transcribed by hand so there are probably errors.


boot> machine memory
Region 0: type 1 at 0x0 for 609KB
Region 1: type 2 at 0xf for 64KB
Region 2: type 2 at 0xfec0 for 20480KB
Region 3: type 2 at 0xe000 for 262144KB
Region 4: type 2 at 0x98400 for 31KB
Region 5: type 2 at 0xcfdf for 64KB
Region 6: type 1 at 0x10 for 3404292KB
Region 7: type 3 at 0xcfde3000 for 52KB
Region 8: type 4 at 0xcfde for 12KB
REgion 9: type 1 at 0x1 for 13369344KB
Low ram: 609KB  High ram: 3404292KB
Total free memory: 16774245KB

--
James



Re: advice debugging lockups with swap-thrashing symptoms?

2024-05-23 Thread Claudio Jeker
On Thu, May 23, 2024 at 03:37:24PM +, James Cook wrote:
> On Thu, May 23, 2024 at 08:00:37AM GMT, Nick Holland wrote:
> > On 5/23/24 03:18, Stuart Henderson wrote:
> > > On 2024-05-22, James Cook  wrote:
> > > > One of my OpenBSD boxes sometimes gets in a weird locked-up or
> > > > almost-locked-up state. I'm wondering what I can do to debug it
> > > > further next time it happens.
> > > ...
> > > > I would also expect the cache number to be much higher. E.g. on
> > > > this occasion, I was running "git annex fsck", which reads plenty
> > > > of data from disk.
> > > 
> > > Heavy filesystem access can result in this sort of thing, I used to
> > > have unpacked ports source on one of my machines for grepping over,
> > > the machine was pretty much unusable for anything else while that was
> > > running.
> > > 
> > > Might be worth trying some noatime mount flags if you don't already have
> > > them, at least then you can avoid turning some reads into writes.
> > > 
> > 
> > Definitely a possibility.  Long time ago, I think I asked about the
> > possibility of a "disknice" to throttle disk access on individual
> > tasks.  TedU@ came through for me with something that definitely solved
> > my problem, and I use it from time to time since -- basically, it just
> > suspends a particular program occasionally, which lets other programs
> > have a chance to get disk access.  I saved it (and made a tiny update
> > that is needed now) and put it here:
> > 
> > https://holland-consulting.net/scripts/disknice.html
> > 
> > 
> > Also...
> > I've seen disks "fail" where they get super-slow.  The failure modes
> > seems to be difficulty reading data...but after enough retries, it
> > succeeds, resetting the retry counter back to zero, and then the next
> > read encounters the same problem.  You may be able to hear lots of
> > activity on the drive with little obvious progress.   I'm not convinced
> > this is your problem, but ... something to consider.
> > 
> > Nick.
> 
> Thanks for the pointers. disknice sounds useful. However I am skeptical that
> this can be explained away as a normal consequence of intense filesystem
> access, for a few reasons.
> 
> 1. In the past, even the mouse pointer has frozen. (I'm 95% sure of this
> from memory. Will note it more carefully next time this happens.) Surely
> that shouldn't depend on disk access? See also tmux/xterm updating very
> slowly; does that depend on the filesystem?
> 
> 2. The low 165M cache number makes me suspicious. With 14G free and plenty
> of data being read, shouldn't that grow? E.g. right now it's at 11G (and I'm
> running git annex fsck like I was before; I have a lot of data to fsck). I
> believe I've seen similar small cache numbers in the past.
> 
> 3. The git annex fsck was running on a different hard disk. (Normally it
> sits in a cubpoard; I've hooked it up temporarily.) Swap, /, /home etc are
> all on a different SSD. I am running the same thing now (different disk) and
> perceive no impact on performance. That's not to say there wasn't intense
> access to the SSD, though; Firefox is a suspect here.
> 
> Nonetheless, if I can't make any other progress, I'll look into noatime
> and/or disknice. (I really wish I could reliably reproduce this, but
> unfortunately it just happens every few days or weeks with no apparent
> pattern other than the system being under some load when it happens.)
> 
> (I'll note one other thing, just in case: I also experience random crashes
> and restarts with this machine that seem to be hardware-related. Very
> different from what I'm describing here; has even happened during BIOS POST,
> and with no disks inside the machine. I just mention it because it opens the
> possibility of unreliable hardware involved, in case that changes things.)
> 

You are probably haunted by a bad issue with DMA memory and running out of
it. Your top is missing -SH since then you would probably see the
pagedameon go bananas. The problem is you have not enough memory below 4G
but the pagedaemon is not able to properly free memory there since it has
no proper tracking for that condition. It only knows memory is short and
tries to drop as much as possible over and over again. As a result your
system becomes unresponsive.

The boot loader can print the memory map. Which should show you
how much memory is below 4G (I think the command is machine mem).

This is a known issue and there is some work going on to fix the problem.
-- 
:wq Claudio



Re: advice debugging lockups with swap-thrashing symptoms?

2024-05-23 Thread James Cook

On Thu, May 23, 2024 at 08:00:37AM GMT, Nick Holland wrote:

On 5/23/24 03:18, Stuart Henderson wrote:

On 2024-05-22, James Cook  wrote:

One of my OpenBSD boxes sometimes gets in a weird locked-up or
almost-locked-up state. I'm wondering what I can do to debug it
further next time it happens.

...

I would also expect the cache number to be much higher. E.g. on
this occasion, I was running "git annex fsck", which reads plenty
of data from disk.


Heavy filesystem access can result in this sort of thing, I used to
have unpacked ports source on one of my machines for grepping over,
the machine was pretty much unusable for anything else while that was
running.

Might be worth trying some noatime mount flags if you don't already have
them, at least then you can avoid turning some reads into writes.



Definitely a possibility.  Long time ago, I think I asked about the
possibility of a "disknice" to throttle disk access on individual
tasks.  TedU@ came through for me with something that definitely solved
my problem, and I use it from time to time since -- basically, it just
suspends a particular program occasionally, which lets other programs
have a chance to get disk access.  I saved it (and made a tiny update
that is needed now) and put it here:

https://holland-consulting.net/scripts/disknice.html


Also...
I've seen disks "fail" where they get super-slow.  The failure modes
seems to be difficulty reading data...but after enough retries, it
succeeds, resetting the retry counter back to zero, and then the next
read encounters the same problem.  You may be able to hear lots of
activity on the drive with little obvious progress.   I'm not convinced
this is your problem, but ... something to consider.

Nick.


Thanks for the pointers. disknice sounds useful. However I am 
skeptical that this can be explained away as a normal consequence 
of intense filesystem access, for a few reasons.


1. In the past, even the mouse pointer has frozen. (I'm 95% sure 
of this from memory. Will note it more carefully next time this 
happens.) Surely that shouldn't depend on disk access? See also 
tmux/xterm updating very slowly; does that depend on the filesystem?


2. The low 165M cache number makes me suspicious. With 14G free 
and plenty of data being read, shouldn't that grow? E.g. right now 
it's at 11G (and I'm running git annex fsck like I was before; I 
have a lot of data to fsck). I believe I've seen similar small cache 
numbers in the past.


3. The git annex fsck was running on a different hard disk. (Normally 
it sits in a cubpoard; I've hooked it up temporarily.) Swap, /, /home 
etc are all on a different SSD. I am running the same thing now 
(different disk) and perceive no impact on performance. That's not 
to say there wasn't intense access to the SSD, though; Firefox is 
a suspect here.


Nonetheless, if I can't make any other progress, I'll look into 
noatime and/or disknice. (I really wish I could reliably reproduce 
this, but unfortunately it just happens every few days or weeks 
with no apparent pattern other than the system being under some 
load when it happens.)


(I'll note one other thing, just in case: I also experience random 
crashes and restarts with this machine that seem to be hardware-related. 
Very different from what I'm describing here; has even happened 
during BIOS POST, and with no disks inside the machine. I just 
mention it because it opens the possibility of unreliable hardware 
involved, in case that changes things.)


--
James



Re: advice debugging lockups with swap-thrashing symptoms?

2024-05-23 Thread Nick Holland

On 5/23/24 03:18, Stuart Henderson wrote:

On 2024-05-22, James Cook  wrote:

One of my OpenBSD boxes sometimes gets in a weird locked-up or
almost-locked-up state. I'm wondering what I can do to debug it
further next time it happens.

...

I would also expect the cache number to be much higher. E.g. on
this occasion, I was running "git annex fsck", which reads plenty
of data from disk.


Heavy filesystem access can result in this sort of thing, I used to
have unpacked ports source on one of my machines for grepping over,
the machine was pretty much unusable for anything else while that was
running.

Might be worth trying some noatime mount flags if you don't already have
them, at least then you can avoid turning some reads into writes.



Definitely a possibility.  Long time ago, I think I asked about the
possibility of a "disknice" to throttle disk access on individual
tasks.  TedU@ came through for me with something that definitely solved
my problem, and I use it from time to time since -- basically, it just
suspends a particular program occasionally, which lets other programs
have a chance to get disk access.  I saved it (and made a tiny update
that is needed now) and put it here:

https://holland-consulting.net/scripts/disknice.html


Also...
I've seen disks "fail" where they get super-slow.  The failure modes
seems to be difficulty reading data...but after enough retries, it
succeeds, resetting the retry counter back to zero, and then the next
read encounters the same problem.  You may be able to hear lots of
activity on the drive with little obvious progress.   I'm not convinced
this is your problem, but ... something to consider.

Nick.



Re: advice debugging lockups with swap-thrashing symptoms?

2024-05-23 Thread Stuart Henderson
On 2024-05-22, James Cook  wrote:
> One of my OpenBSD boxes sometimes gets in a weird locked-up or
> almost-locked-up state. I'm wondering what I can do to debug it
> further next time it happens.
...
> I would also expect the cache number to be much higher. E.g. on
> this occasion, I was running "git annex fsck", which reads plenty
> of data from disk.

Heavy filesystem access can result in this sort of thing, I used to
have unpacked ports source on one of my machines for grepping over,
the machine was pretty much unusable for anything else while that was
running.

Might be worth trying some noatime mount flags if you don't already have
them, at least then you can avoid turning some reads into writes.