As the memory layout or the number of CPUs change, a loaded kdump
capture kernel must also be updated. By having an accurate
representation of the memory and CPU layout, the resulting kdump
capture in response to a kernel panic will be complete and accurate.
Currently, the memory and CPU layout is recorded only when the kdump
capture kernel is loaded via the kexec userspace utility. If the kexec
utility is passing a non-signed kernel, then kexec also passes in the
memory and CPU layout (which is obtained via sysfs). If the kernel is
signed, then the kernel constructs the memory and CPU layout during
the syscall (also from sysfs resources). In either case, the memory
and CPU layout is recorded in a vmcoreinfo structure at kdump capture
kernel load time; and the capture kernel, capture kernel initrd,
capture kernel cmdline and the vmcoreinfo struct are all located in
crash kernel memory (reserved via the crashkernel= parameter to the
current kernel), ready to go upon a panic.
When memory and/or CPU changes do occur, these changes cause kernel
and subsequently udev events. The handling of the udev events allows
for the onlining/offlining of the associated memory or CPUs. The udev
event handling is also utilized to update the kdump capture kernel, by
reloading it.
In the Red Hat distributions, for example, the udev rule
98-kexec.rules handles the memory and CPU udev events so that kdump
can be updated. In RHEL7, it looks like:
SUBSYSTEM=="cpu", ACTION=="add", GOTO="kdump_reload"
SUBSYSTEM=="cpu", ACTION=="remove", GOTO="kdump_reload"
SUBSYSTEM=="memory", ACTION=="online", GOTO="kdump_reload"
SUBSYSTEM=="memory", ACTION=="offline", GOTO="kdump_reload"
GOTO="kdump_reload_end"
LABEL="kdump_reload"
# If kdump is not loaded, calling "kdumpctl reload" will end up
# doing nothing, but it and systemd-run will always generate
# extra logs for each call, so trigger the "kdumpctl reload"
# only if kdump service is active to avoid unnecessary logs
RUN+="/bin/sh -c '/usr/bin/systemctl is-active kdump.service || exit 0; /usr/bin/systemd-run
--quiet /usr/bin/kdumpctl reload'"
LABEL="kdump_reload_end"
On any change to memory or CPUs (ie. ACTION=="add" or
ACTION=="remove"), the "kdump_reload" label invokes kdumpctl to reload
the kdump kernel. This happens for each CPU and/or memory block added
or removed to/from the system.
When memory is added, for example, the memory subsystem breaks the
memory into smaller memblocks, typically 128MiB. Then for each of
memblock, a udev event is triggered, which in turns causes the current
kdump kernel to be unloaded, then reloaded again (this time with an
updated memory layout including the just onlined memblock). Thus, the
act of adding the 1GiB actually causes the kdump capture kernel to be
unloaded then reloaded 8 times.
The actual problem is that there is a window of time between when the
kdump capture kernel is unloaded and before it is reloaded again,
where kdump fails if a panic occurs. It fails simply because there
isn't a valid kdump kernel resident.
Now, as long as the number of events is small, for example going from
8 to 16 CPUs, or adding a single 1GiB, this unloading-then-reloading
behavior has been, for the most part, tolerable. But in cloud
environments, in particular, with guest memory being added and removed
by the tens or hundreds of gigabytes, this behavior easily exposes
the race window(s) which can cause kdump to fail.
As a more concrete example, on a guest with 32GiB memory, and sizing
up the guest to 512GiB memory, this results in 3840 128MiB memblocks,
and associated udev events! It took the guest nearly 400 seconds to
process all the events. Clearly this is a waste of time and energy,
but more importantly it creates 3839 un-necessary race windows in
which kdump can fail.
The excessive kdump capture kernel unload-then-reload activity did not
go un-noticed and, in RHEL8, the 98-kexec.rules invokes a new script
named kdump-udev-throttle that looks like this:
#!/bin/bash
# This util helps to reduce the workload of kdump service restarting
# on udev event. When hotplugging memory / CPU, multiple udev
# events may be triggered concurrently, and obviously, we don't want
# to restart kdump service for each event.
# This script will be called by udev, and make sure kdump service is
# restart after all events we are watching are settled.
# On each call, this script will update try to aquire the $throttle_lock
# The first instance acquired the file lock will keep waiting for events
# to settle and then reload kdump. Other instances will just exit
# In this way, we can make sure kdump service is restarted immediately
# and for exactly once after udev events are settled.
throttle_lock="/var/lock/kdump-udev-throttle"
exec 9>$throttle_lock
if [ $? -ne 0 ]; then
echo "Failed to create the lock file! Fallback to non-throttled kdump
service restart"
/bin/kdumpctl reload
exit 1
fi