On 9/15/2020 1:18 AM, Michal Hocko wrote:
On Mon 14-09-20 09:57:02, Vijay Balakrishna wrote:
On 9/14/2020 7:33 AM, Michal Hocko wrote:
On Thu 10-09-20 13:47:39, Vijay Balakrishna wrote:
When memory is hotplug added or removed the min_free_kbytes must be
recalculated based on what is expected by khugepaged. Currently
after hotplug, min_free_kbytes will be set to a lower default and higher
default set when THP enabled is lost. This leaves the system with small
min_free_kbytes which isn't suitable for systems especially with network
intensive loads. Typical failure symptoms include HW WATCHDOG reset,
soft lockup hang notices, NETDEVICE WATCHDOG timeouts, and OOM process
kills.
Care to explain some more please? The whole point of increasing
min_free_kbytes for THP is to get a larger free memory with a hope that
huge pages will be more likely to appear. While this might help for
other users that need a high order pages it is definitely not the
primary reason behind it. Could you provide an example with some more
data?
Thanks Michal. I haven't looked into THP as part of my investigation, so I
cannot comment.
In our use case we are hotplug removing ~2GB of 8GB total (on our SoC)
during normal reboot/shutdown. This memory is hotplug hot-added as movable
type via systemd late service during start-of-day.
In our stress test first we ran into HW WATCHDOG recovery, on enabling
kernel watchdog we started seeing soft lockup hung task notices, failure
symptons varied, where stack trace of hung tasks sometimes trying to
allocate GFP_ATOMIC memory, looping in do_notify_resume, NETDEVICE WATCHDOG
timeouts, OOM process kills etc., During investigation we reran stress test
without hotplug use case. Surprisingly this run didn't encounter the said
problems. This led to comparing what is different between the two runs,
while looking at various globals, studying hotplug code I uncovered the
issue of failing to restore min_free_kbytes. In particular on our 8GB SoC
min_free_kbytes went down to 8703 from 22528 after hotplug add.
Did you try to increase min_free_kbytes manually after hot remove? Btw.
No, in our use case memory hot remove done during shutdown.
I would consider oom killer invocation due to min_free_kbytes really
weird behavior. If anything the higher value would cause more memory
reclaim and potentially oom rather than smaller one.
Yes, we wondered about it too. One panic stack trace (after many OOM kills)
[330321.174240] Out of memory and no killable processes...
[330321.179658] Kernel panic - not syncing: System is deadlocked on memory
[330321.186489] CPU: 4 PID: 1 Comm: systemd Kdump: loaded Tainted: G
O 5.4.51-xxx #1
[330321.196900] Hardware name: Overlake (DT)
[330321.201038] Call trace:
[330321.203660] dump_backtrace+0x0/0x1d0
[330321.207533] show_stack+0x20/0x2c
[330321.211048] dump_stack+0xe8/0x150
[330321.214656] panic+0x18c/0x3b4
[330321.217901] out_of_memory+0x4c0/0x6e4
[330321.221863] __alloc_pages_nodemask+0xbdc/0x1c90
[330321.226722] alloc_pages_current+0x21c/0x2b0
[330321.231220] alloc_slab_page+0x1e0/0x7d8
[330321.235361] new_slab+0x2e8/0x2f8
[330321.238874] ___slab_alloc+0x45c/0x59c
[330321.242835] kmem_cache_alloc+0x2d4/0x360
[330321.247065] getname_flags+0x6c/0x2a8
[330321.250938] user_path_at_empty+0x3c/0x68
[330321.255168] do_readlinkat+0x7c/0x17c
[330321.259039] __arm64_sys_readlinkat+0x5c/0x70
[330321.263627] el0_svc_handler+0x1b8/0x32c
[330321.267767] el0_svc+0x10/0x14
[330321.271026] SMP: stopping secondary CPUs
[330321.275382] Starting crashdump kernel...
[330321.279526] Bye!
Then while searching I came across documented warning below. In above
instance panic after OOM kills happened after 3+ days of stress run (a
mixure of ttcp, cpuloadgen and fio).
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/performance_tuning_guide/sect-red_hat_enterprise_linux-performance_tuning_guide-configuration_tools-configuring_system_memory_capacity
Warning
Extreme values can damage your system. Setting min_free_kbytes to an
extremely low value prevents the system from reclaiming memory, which
can result in system hangs and OOM-killing processes. However, setting
min_free_kbytes too high (for example, to 5–10% of total system memory)
causes the system to enter an out-of-memory state immediately, resulting
in the system spending too much time reclaiming memory.