Re: Chasing OOM Issues - good sysctl metrics to use?

Mark Millard Sat, 23 Apr 2022 12:34:07 -0700

On 2022-Apr-23, at 10:26, Pete Wright <p...@nomadlogic.org> wrote:

> On 4/22/22 18:46, Mark Millard wrote:
>> On 2022-Apr-22, at 16:42, Pete Wright <p...@nomadlogic.org> wrote:
>> 
>>> On 4/21/22 21:18, Mark Millard wrote:
>>>> Messages in the console out would be appropriate
>>>> to report. Messages might also be available via
>>>> the following at appropriate times:
>>> that is what is frustrating.  i will get notification that the processes 
>>> are killed:
>>> Apr 22 09:55:15 topanga kernel: pid 76242 (chrome), jid 0, uid 1001, was 
>>> killed: failed to reclaim memory
>>> Apr 22 09:55:19 topanga kernel: pid 76288 (chrome), jid 0, uid 1001, was 
>>> killed: failed to reclaim memory
>>> Apr 22 09:55:20 topanga kernel: pid 76259 (firefox), jid 0, uid 1001, was 
>>> killed: failed to reclaim memory
>>> Apr 22 09:55:22 topanga kernel: pid 76252 (firefox), jid 0, uid 1001, was 
>>> killed: failed to reclaim memory
>>> Apr 22 09:55:23 topanga kernel: pid 76267 (firefox), jid 0, uid 1001, was 
>>> killed: failed to reclaim memory
>>> Apr 22 09:55:24 topanga kernel: pid 76234 (chrome), jid 0, uid 1001, was 
>>> killed: failed to reclaim memory
>>> Apr 22 09:55:26 topanga kernel: pid 76275 (firefox), jid 0, uid 1001, was 
>>> killed: failed to reclaim memory
>> Those messages are not reporting being out of swap
>> as such. They are reporting sustained low free RAM
>> despite a number of less drastic attempts to gain
>> back free RAM (to above some threshold).
>> 
>> FreeBSD does not swap out the kernel stacks for
>> processes that stay in a runnable state: it just
>> continues to page. Thus just one large process
>> that has a huge working set of active pages can
>> lead to OOM kills in a context were no other set
>> of processes would be enough to gain the free
>> RAM required. Such contexts are not really a
>> swap issue.
> 
> Thank you for this clarification/explanation - that totally makes sense!
> 
>> 
>> Based on there being only 1 "killed:" reason,
>> I have a suggestion that should allow delaying
>> such kills for a long time. That in turn may
>> help with investigating without actually
>> suffering the kills during the activity: more
>> time with low free RAM to observe.
> 
> Great idea thank-you!  and thanks for the example settings and descriptions 
> as well.
>> But those are large but finite activities. If
>> you want to leave something running for days,
>> weeks, months, or whatever that produces the
>> sustained low free RAM conditions, the problem
>> will eventually happen. Ultimately one may have
>> to exit and restart such processes once and a
>> while, exiting enough of them to give a little
>> time with sufficient free RAM.
> perfect - since this is a workstation my run-time for these processes is 
> probably a week as i update my system and pkgs over the weekend, then dog 
> food current during the work week.
> 
>>> yes i have a 2GB of swap that resides on a nvme device.
>> I assume a partition style. Otherwise there are other
>> issues involved --that likely should be avoided by
>> switching to partition style.
> 
> so i kinda lied - initially i had just a 2G swap, but i added a second 20G 
> swap a while ago to have enough space to capture some cores while testing 
> drm-kmod work.  based on this comment i am going to only use the 20G file 
> backed swap and see how that goes.
> 
> this is my fstab entry currently for the file backed swap:
> md99 none swap sw,file=/root/swap1,late 0 0


I think you may have taken my suggestion backwards . . .

Unfortunately, vnode (file) based swap space should be *avoided*
and partitions are what should be used in order to avoid deadlocks:

On 2017-Feb-13, at 7:20 PM, Konstantin Belousov <kostikbel at gmail.com> wrote
on the freebsd-arm list:

QUOTE
swapfile write requires the write request to come through the filesystem
write path, which might require the filesystem to allocate more memory
and read some data. E.g. it is known that any ZFS write request
allocates memory, and that write request on large UFS file might require
allocating and reading an indirect block buffer to find the block number
of the written block, if the indirect block was not yet read.

As result, swapfile swapping is more prone to the trivial and unavoidable
deadlocks where the pagedaemon thread, which produces free memory, needs
more free memory to make a progress.  Swap write on the raw partition over
simple partitioning scheme directly over HBA are usually safe, while e.g.
zfs over geli over umass is the worst construction.
END QUOTE

The developers handbook has a section debugging deadlocks that he
referenced in a response to another report (on freebsd-hackers).

https://docs.freebsd.org/en/books/developers-handbook/kerneldebug/#kerneldebug-deadlocks

>>>> ZFS (so with ARC)? UFS? Both?
>>> i am using ZFS and am setting my vfs.zfs.arc.max to 10G.  i have also 
>>> experienced this crash with that set to the default unlimited value as well.
>> I use ZFS on systems with at least 8 GiBytes of RAM,
>> but I've never tuned ZFS. So I'm not much help for
>> that side of things.
> 
> since we started this thread I've gone ahead and removed the zfs.arc.max 
> setting since its cruft at this point.  i initially added it to test a 
> configuration i deployed to a sever hosting a bunch of VMs.
> 
>> I'm hoping that vm.pageout_oom_seq=120 (or more) makes it
>> so you do not have to have identified everything up front
>> and can explore easier.
>> 
>> 
>> Note that vm.pageout_oom_seq is both a loader tunable
>> and a writeable runtime tunable:
>> 
>> # sysctl -T vm.pageout_oom_seq
>> vm.pageout_oom_seq: 120
>> amd64_ZFS amd64  1400053 1400053 # sysctl -W vm.pageout_oom_seq
>> vm.pageout_oom_seq: 120
>> 
>> So you can use it to extend the time when the
>> machine is already running.
> 
> fantastic.  thanks again for taking your time and sharing your knowledge and 
> experience with me Mark!
> 
> these types of journeys are why i run current on my daily driver, it really 
> helps me better understand the OS so that i can be a better admin on the 
> "real" servers i run for work.  its also just fun to learn stuff too heh.
> 


===
Mark Millard
marklmi at yahoo.com

Re: Chasing OOM Issues - good sysctl metrics to use?

Reply via email to