Re: [RFC]: kexec: change to handle memory/cpu changes

2020-12-21 Thread Baoquan He
On 12/14/20 at 10:50am, Eric DeVolder wrote:
...
> The cell contents show the number of seconds it took for the system to
> process all of the 3840 memblocks. The value in parenthesis is the
> number of kdump unload-then-reload operations per second.
> 
>   1 480GB DIMM   480 1GB DIMMs
> ---+-++
>  RHEL7 | 181s (21.2 ops) | 389s (9.8 ops) |
> ---+-++
>  RHEL8 |  86s (44.7 ops) | 419s (9.2 ops) |
> ---+-++
> 
> The scenario of adding 480 1GiB virtual DIMMs takes more time given
> the larger number of round trips of QEMU -> kernel -> udev -> kernel ->
> QEMU, and are both roughly 400s.
> 
> The RHEL7 system process all 3840 memblocks individually and perform
> 3840 kdump unload-then-reload operations.
> 
> However, RHEL8 data in the best case scenario (1 480GiB DIMM) suggests
> that approximately 86/4= 21 kdump unload-then-reload operations
> happened, and in the worst case scenario (480 1GiB DIMMs), the data
> suggests that approximately 419/4 = 105 kdump unload-then-reload
> operations happened. For RHEL8, the final number of kdump
> unload-then-reload operations are 0.5% (21 of 3840) and 2.7% (105 of
> 3840), respectively, compared to that of the RHEL7 system.
> 
> The throttle approach is quite effective in reducing the number of
> kdump unload-then-reload operations. However, the kdump capture kernel
> is still reloaded multiple times, and each kdump capture kernel reload
> is a race window in which kdump can fail.
> 
> A quick peek at Ubuntu 20.04 LTS reveals it has 50-kdump-tools.rules
> that looks like:
> 
>   SUBSYSTEM=="memory", ACTION=="online", PROGRAM="/usr/sbin/kdump-config 
> try-reload"
>   SUBSYSTEM=="memory", ACTION=="offline", PROGRAM="/usr/sbin/kdump-config 
> try-reload"
>   SUBSYSTEM=="cpu", ACTION=="add", PROGRAM="/usr/sbin/kdump-config try-reload"
>   SUBSYSTEM=="cpu", ACTION=="remove", PROGRAM="/usr/sbin/kdump-config 
> try-reload"
>   SUBSYSTEM=="cpu", ACTION=="offline", PROGRAM="/usr/sbin/kdump-config 
> try-reload"
> 
> which produces the equivalent behavior to RHEL7 whereby every event
> results in a kdump capture kernel reload.
> 
> Fedora 33 and CentOS 8-stream behave the same as RHEL8.
> 
> Perhaps a better solution is to rewrite the vmcoreinfo structure that
> contains the memory and CPU layout information, as those changes to
> memory and CPUs occur. Rewriting vmcoreinfo is an in-kernel activity
> and would certainly avoid the relatively large unload-then-reload
> times of the kdump capture kernel. The pointer to the vmcoreinfo
> structure is provided to the capture kernel via the elfcorehdr=
> parameter to the capture kernel cmdline. Rewriting the vmcoreinfo
> structure as well as rewriting the capture kernel cmdline parameter is
> needed to utilize this approach.

Great investigation and conclusion, and very nice idea as below. When I
read the first half of this mail, I thought maybe we could add a new
option to kexec-tools utility for updating eflcorehdr only when hotplug
udev events detected. Then come to this part, I would say yes, doing it
inside kernel looks better. A special handling for hotplug looks
necessary as you have said, I will check what we can do and give back
some details, thanks for doing these.

Thanks
Baoquan

> 
> Based upon some amount of examining code, I think the challenges
> involved in updating the CPU and memory layout in-kernel are:
> 
>  - adding call-outs on the add_memory()/try_remove_memory() and
>cpu_up()/cpu_down() paths for notifying the kdump subsystem of
>memory and/or CPU changes.
> 
>  - updating the struct kimage with the memory or CPU changes
> 
>  - Rewriting the vmcoreinfo structure from the data contained
>in struct kimage, eg crash_prepare_elf64_headers()
> 
>  - Installing the updated vmcoreinfo struct via
>kimage_crash_copy_vmcoreinfo() and rewriting the kdump kernel
>cmdline in order to update parameter elfcorehdr= with the
>new address
> 
> As I am not overly familiar with all the code paths involved, yet, I'm
> sure the devil is in the details. However, due the kexec_file_load
> syscall, it appears most of the infrastructure is already in place,
> and we essentially need to tap into it again for memory and cpu
> changes.
> 
> It appears that this change could be applicable to both kexec_load and
> kexec_file_load, it has the potential to (eventually) simplify the
> userland kexec utility for kexec_load, and would eliminate the need
> for 98-kexec.rules and the associated churn.
> 
> Comments please!
> eric
> 
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
> 



[RFC]: kexec: change to handle memory/cpu changes

2020-12-14 Thread Eric DeVolder



As the memory layout or the number of CPUs change, a loaded kdump
capture kernel must also be updated. By having an accurate
representation of the memory and CPU layout, the resulting kdump
capture in response to a kernel panic will be complete and accurate.

Currently, the memory and CPU layout is recorded only when the kdump
capture kernel is loaded via the kexec userspace utility. If the kexec
utility is passing a non-signed kernel, then kexec also passes in the
memory and CPU layout (which is obtained via sysfs). If the kernel is
signed, then the kernel constructs the memory and CPU layout during
the syscall (also from sysfs resources). In either case, the memory
and CPU layout is recorded in a vmcoreinfo structure at kdump capture
kernel load time; and the capture kernel, capture kernel initrd,
capture kernel cmdline and the vmcoreinfo struct are all located in
crash kernel memory (reserved via the crashkernel= parameter to the
current kernel), ready to go upon a panic.

When memory and/or CPU changes do occur, these changes cause kernel
and subsequently udev events. The handling of the udev events allows
for the onlining/offlining of the associated memory or CPUs. The udev
event handling is also utilized to update the kdump capture kernel, by
reloading it.

In the Red Hat distributions, for example, the udev rule
98-kexec.rules handles the memory and CPU udev events so that kdump
can be updated. In RHEL7, it looks like:

  SUBSYSTEM=="cpu", ACTION=="add", GOTO="kdump_reload"
  SUBSYSTEM=="cpu", ACTION=="remove", GOTO="kdump_reload"
  SUBSYSTEM=="memory", ACTION=="online", GOTO="kdump_reload"
  SUBSYSTEM=="memory", ACTION=="offline", GOTO="kdump_reload"

  GOTO="kdump_reload_end"

  LABEL="kdump_reload"

  # If kdump is not loaded, calling "kdumpctl reload" will end up
  # doing nothing, but it and systemd-run will always generate
  # extra logs for each call, so trigger the "kdumpctl reload"
  # only if kdump service is active to avoid unnecessary logs
  RUN+="/bin/sh -c '/usr/bin/systemctl is-active kdump.service || exit 0; /usr/bin/systemd-run 
--quiet /usr/bin/kdumpctl reload'"


  LABEL="kdump_reload_end"

On any change to memory or CPUs (ie. ACTION=="add" or
ACTION=="remove"), the "kdump_reload" label invokes kdumpctl to reload
the kdump kernel. This happens for each CPU and/or memory block added
or removed to/from the system.

When memory is added, for example, the memory subsystem breaks the
memory into smaller memblocks, typically 128MiB. Then for each of
memblock, a udev event is triggered, which in turns causes the current
kdump kernel to be unloaded, then reloaded again (this time with an
updated memory layout including the just onlined memblock). Thus, the
act of adding the 1GiB actually causes the kdump capture kernel to be
unloaded then reloaded 8 times.

The actual problem is that there is a window of time between when the
kdump capture kernel is unloaded and before it is reloaded again,
where kdump fails if a panic occurs. It fails simply because there
isn't a valid kdump kernel resident.

Now, as long as the number of events is small, for example going from
8 to 16 CPUs, or adding a single 1GiB, this unloading-then-reloading
behavior has been, for the most part, tolerable. But in cloud
environments, in particular, with guest memory being added and removed
by the tens or hundreds of gigabytes, this behavior easily exposes
the race window(s) which can cause kdump to fail.

As a more concrete example, on a guest with 32GiB memory, and sizing
up the guest to 512GiB memory, this results in 3840 128MiB memblocks,
and associated udev events!  It took the guest nearly 400 seconds to
process all the events. Clearly this is a waste of time and energy,
but more importantly it creates 3839 un-necessary race windows in
which kdump can fail.

The excessive kdump capture kernel unload-then-reload activity did not
go un-noticed and, in RHEL8, the 98-kexec.rules invokes a new script
named kdump-udev-throttle that looks like this:

  #!/bin/bash
  # This util helps to reduce the workload of kdump service restarting
  # on udev event. When hotplugging memory / CPU, multiple udev
  # events may be triggered concurrently, and obviously, we don't want
  # to restart kdump service for each event.

  # This script will be called by udev, and make sure kdump service is
  # restart after all events we are watching are settled.

  # On each call, this script will update try to aquire the $throttle_lock
  # The first instance acquired the file lock will keep waiting for events
  # to settle and then reload kdump. Other instances will just exit
  # In this way, we can make sure kdump service is restarted immediately
  # and for exactly once after udev events are settled.

  throttle_lock="/var/lock/kdump-udev-throttle"

  exec 9>$throttle_lock
  if [ $? -ne 0 ]; then
echo "Failed to create the lock file! Fallback to non-throttled kdump 
service restart"
/bin/kdumpctl reload
exit 1
  fi