Re: Properly handle OOM death?

Israel Brewster Mon, 13 Mar 2023 10:41:43 -0700

> On Mar 13, 2023, at 9:36 AM, Joe Conway <m...@joeconway.com> wrote:
> 
> On 3/13/23 13:21, Israel Brewster wrote:
>> I’m running a postgresql 13 database on an Ubuntu 20.04 VM that is a bit 
>> more memory constrained than I would like, such that every week or so the 
>> various processes running on the machine will align badly and the OOM killer 
>> will kick in, killing off postgresql, as per the following journalctl output:
>> Mar 12 04:04:23 novarupta systemd[1]: postgresql@13-main.service: A process 
>> of this unit has been killed by the OOM killer.
>> Mar 12 04:04:32 novarupta systemd[1]: postgresql@13-main.service: Failed 
>> with result 'oom-kill'.
>> Mar 12 04:04:32 novarupta systemd[1]: postgresql@13-main.service: Consumed 
>> 5d 17h 48min 24.509s CPU time.
>> And the service is no longer running.
>> When this happens, I go in and restart the postgresql service, and 
>> everything is happy again for the next week or two.
>> Obviously this is not a good situation. Which leads to two questions:
>> 1) is there some tweaking I can do in the postgresql config itself to 
>> prevent the situation from occurring in the first place?
>> 2) My first thought was to simply have systemd restart postgresql whenever 
>> it is killed like this, which is easy enough. Then I looked at the default 
>> unit file, and found these lines:
>> # prevent OOM killer from choosing the postmaster (individual backends will
>> # reset the score to 0)
>> OOMScoreAdjust=-900
>> # restarting automatically will prevent "pg_ctlcluster ... stop" from 
>> working,
>> # so we disable it here. Also, the postmaster will restart by itself on most
>> # problems anyway, so it is questionable if one wants to enable external
>> # automatic restarts.
>> #Restart=on-failure
>> Which seems to imply that the OOM killer should only be killing off 
>> individual backends, not the entire cluster to begin with - which should be 
>> fine. And also that adding the restart=on-failure option is probably not the 
>> greatest idea. Which makes me wonder what is really going on?
> 
> First, are you running with a cgroup memory.limit set (e.g. in a container)?


Not sure, actually. I *think* I had it set it up as a full VM though, not a 
container. I’ll have to double-check that.

> Assuming no, see:
> 
> https://www.postgresql.org/docs/current/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT
> 
> That will tell you:
> 1/ Turn off memory overcommit: "Although this setting will not prevent the 
> OOM killer from being invoked altogether, it will lower the chances 
> significantly and will therefore lead to more robust system behavior."
> 
> 2/ set /proc/self/oom_score_adj to -1000 rather than -900 
> (OOMScoreAdjust=-1000): the value -1000 is important as it is a "magic" value 
> which prevents the process from being selected by the OOM killer (see: 
> https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/oom.h#L6) 
> whereas -900 just makes it less likely.

..and that answers the question I just sent about the above linked page 😄 
Thanks!

> 
> All that said, even if the individual backend gets killed, the postmaster 
> will still go into crash recovery. So while technically postgres does not 
> restart, the effect is much the same. So see #1 above as your best protection.

Interesting. Makes sense though. Thanks!


---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

> 
> HTH,
> 
> Joe
> 
> -- 
> Joe Conway
> PostgreSQL Contributors Team
> RDS Open Source Databases
> Amazon Web Services: https://aws.amazon.com <https://aws.amazon.com/>

Re: Properly handle OOM death?

Reply via email to