Re: [PATCHv2 0/3] kexec: Add support for UKI format kernel
Hi Pingfan, Hi Simon, I know I'm a little bit too late, but I finally had time to take a look. All in all the series looks fine to me. But I have two questions. 1) Is there a reason why you don't enable the UKI for x86 as well? The way I see it patch 1 and 2 cover x86 as well. So you only should need a patch similar to patch 3. Or am I missing something? 2) Would it make sense to "reuse" the mechanism you introduced for the initrd for the command line as well. I.e. if it's a UKI with a .cmdline section and the user didn't provide any of the --command-line, --reuse-cmdline, etc. then use the command line from the UKI. Thanks Philipp On Wed, 16 Oct 2024 19:34:12 +0800 Pingfan Liu wrote: > As a UEFI PE format kernel image becomes more popular, there is a need > for kexec to reboot those kinds of images. > > After the introduction of the UKI (another PE), at present, there are > three competitive methods to support that goal, but all of them have > pros and cons.[1] It seems that none of them can be accepted in the near > future. Therefore, we are resorting to the user space kexec-tools to > parse the UKI format for the time being. > > By parsing the UKI, systemd-stub is stepped around and PCM will not affect > the boot up of the second system. > > [1]: > https://github.com/rhkdump/kexec_uefi/blob/main/overview.md#the-competitive-solutions > > Cc: Simon Horman > Cc: Eric Biederman > Cc: Baoquan He > Cc: Dave Young > Cc: Ard Biesheuvel > Cc: Jan Hendrik Farr > Cc: Philipp Rudo > Cc: Lennart Poettering > Cc: kexec@lists.infradead.org > --- > v1 -> v2: > Add include/pe.h in makefile to fix 'make dist' > Rename default_initrd_fd to implicit_initrd_fd > > > Pingfan Liu (3): > kexec: Introduce implicit_initrd_fd to pass internal initrd > information > kexec: Introduce UKI image parser > arm64: Support UKI image format > > include/Makefile | 1 + > include/pe.h | 104 +++ > kexec/Makefile | 1 + > kexec/arch/arm64/kexec-arm64.c | 1 + > kexec/arch/arm64/kexec-image-arm64.c | 3 +- > kexec/arch/x86_64/kexec-bzImage64.c | 3 +- > kexec/kexec-uki.c| 122 +++ > kexec/kexec.c| 2 + > kexec/kexec.h| 5 ++ > 9 files changed, 240 insertions(+), 2 deletions(-) > create mode 100644 include/pe.h > create mode 100644 kexec/kexec-uki.c > ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] Submit physical address to is_phys_addr()
Hi Alexander, On Fri, 13 Sep 2024 16:35:00 +0200 Alexander Gordeev wrote: > ELF program table virtual address is submitted to is_phys_addr() > function, most likely as a leftover of replaced is_vmalloc_addr() > function, as result of commit 2e452d75fa78 ("[PATCH v3] Enable > --mem-usage for s390x."). Submit the physical address instead. is the patch supposed to fix the bug with /proc/kcore I reported? I gave it a try on a KVM VM and it's still failing... Thanks Philipp > Signed-off-by: Alexander Gordeev > --- > elf_info.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/elf_info.c b/elf_info.c > index bc24083..7f3d053 100644 > --- a/elf_info.c > +++ b/elf_info.c > @@ -765,7 +765,7 @@ int get_kcore_dump_loads(void) > for (i = 0; i < num_pt_loads; ++i) { > struct pt_load_segment *p = &pt_loads[i]; > if (p->phys_start == NOT_PADDR > - || !is_phys_addr(p->virt_start)) > + || !is_phys_addr(p->phys_start)) > continue; > loads++; > } > @@ -786,7 +786,7 @@ int get_kcore_dump_loads(void) > for (i = 0, j = 0; i < num_pt_loads; ++i) { > struct pt_load_segment *p = &pt_loads[i]; > if (p->phys_start == NOT_PADDR > - || !is_phys_addr(p->virt_start)) > + || !is_phys_addr(p->phys_start)) > continue; > if (j >= loads) { > free(pls); ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [RFCv2 0/9] UEFI emulator for kexec
Hi Jarkko, On Sat, 07 Sep 2024 14:41:38 +0300 "Jarkko Sakkinen" wrote: > On Sat Sep 7, 2024 at 2:31 PM EEST, Jarkko Sakkinen wrote: > > On Sat Sep 7, 2024 at 2:27 PM EEST, Jarkko Sakkinen wrote: > > > On Fri Sep 6, 2024 at 1:54 PM EEST, Philipp Rudo wrote: > > > > Let me throw an other wild idea in the ring. Instead of implementing > > > > a EFI runtime we could also include a eBPF version of the stub into the > > > > images. kexec could then extract the eBPF program and let it run just > > > > like any other eBPF program with all the pros (and cons) that come with > > > > it. That won't be as generic as the EFI runtime, e.g. you couldn't > > > > simply kexec any OS installer. On the other hand it would make it > > > > easier to port UKIs et al. to non-EFI systems. What do you think? > > > > > > BPF would have some guarantees that are favorable such as programs > > > always end, even faulty ones. It always has implicit "ExitBootServices". > > > > > > Just a remark. > > > > Some days ago I was thinking could some of the kernel functionality be > > eBPF at least like in formal theory because most of it is amortized, > > i.e. does a fixed chunk of work. Not going into that rabbit hole but > > I really like this idea and could be good experimentation ground for > > such innovation. > > E.g. let's imagine there would imaginary eBPF-TPM driver framework. > > How I would go doing that would be to take the existing TPM driver > functionality and provide extra functions and resources available for > subsystem specific BPF environment, and have the orhestration code as > eBPF. I pretty much concluded that there is a chance that such could > work out. > > Not something in my immediate table but it is still really interesting > idea, as instead of using language to separate "safe" and unsafe" > regions you would use "VM" environments to create the walls. In the > end of the day that would also great venture for Rust in kernel, i.e. > compile that BPF from Rust. > > Sorry going of the hook the comment triggered me ;-) I'm glad you like the idea :-) Sounds like an interesting idea you are having there! Thanks Philipp ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [RFCv2 0/9] UEFI emulator for kexec
Hi Lennart, Hi Jan, On Mon, 9 Sep 2024 12:42:45 +0200 Jan Hendrik Farr wrote: > On 09 11:48:30, Lennart Poettering wrote: > > On Fr, 06.09.24 12:54, Philipp Rudo (pr...@redhat.com) wrote: > > > > > I mostly agree on what you have wrote. But I see a big problem in > > > running the EFI emulator in user space when it comes to secure boot. > > > The chain of trust ends in the kernel. So it's the kernel that needs to > > > verify that the image to be loaded can be trusted. But when the EFI > > > runtime is in user space the kernel simply cannot do that. Which means, > > > if we want to go this way, we would need to extend the chain of trust > > > to user space. Which will be a whole bucket of worms, not just a > > > can. > > > > May it would be nice to have a way to "zap" userspace away, i.e. allow > > the kernel to get rid of all processes in some way, reliable. And then > > simply start a new userspace, from a trusted definition. Or in other > > words: if you don't want to trust the usual userspace, then let's > > maybe just terminate it, and create it anew, with a clean, pristine > > definition the old userspace cannot get access to. > > Well, this is an interesting idea! > > However, I'm sceptical if this could be done in a secure way. How do we > ensure that nothing the old userspace did with the various interfaces to > the kernel has no impact on the new userspace? Maybe others can chime in > on this? Does kernel_lockdown give more guarantees related to this? > > Even if this is possible in a secure way, there is a problem with doing > this for kernels that are to be kexec'd on kernel panic. In this > approach we can't pre-run them until EBS(), so we would rely on the old > kernel to still be intact when we want to kexec reboot. I don't believe there's a way to do that on running kernels. As Jan pointed out, this cannot be done during reboot, as for kdump that would mean to run after a panic. So it would need to run when the new image is loaded. But at that time your user space is running. Plus you also always have a user space component that triggers kexec. So you cannot simply "zap" user space but have to somehow stash it away, run your trusted user space and, then restore the old user space again. That sounds pretty error prone to me. Plus it will tank your performance every time you do a kexec, which for kdump is every boot... > You could do a system where you kexec into an intermediate kernel. That > kernel get's kexec'd with a signed initrd that can use the normal > kexec_load syscall to load do any kind of preparation in userspace. > Problem: For that intermediate enviroment we already need a format > that combines kernel image, initrd, cmdline all signed in one package > aka UKI. Was it the chicken or the egg? > > But this shows that if we implemented UKIs the easy way (kernel simply > checks signature, extracts the pieces, and kexecs them like normal), > this approach could always be used to support kexec for other future > formats. They could use the kernels UKI support to boot into an > intermediate kernel with UEFI implemented in userspace in the initrd. > > So basically support UKIs the easy way and use them to be able to > securely zap away userspace and start with a fresh kernel and signed > userspace as a way to support other UEFI formats that are not UKI. Well, in theory that should work. But I see several problems: 1) How does the first kernel tell the intermediate kernel which file(s) with wich command line to load? In fact, how does the first kernel get the information itself? You would need a new system call that takes two kernel images, one for the intermediate and one for the kernel to load,for that. Of course you could also build the intermediate UKI during kernel build and include it into the image. Similar to what is done with the purgatory. But that would totally bloat the kernel image. 2) I expect that to be extremely painful to debug, if the intermediate kernel runs into a panic. For sure kdump won't work in that case... 3) Distros would need maintain and test the additional UKI. 4) This approach basically needs to boot twice. But there are people out there who fight to reduce boot times extremely hard. For them every millisecond counts. Telling them that they will need to wait twice as long will be very hard to sell. > > > > > Let me throw an other wild idea in the ring. Instead of implementing > > > a EFI runtime we could also include a eBPF version of the stub into the > > > images. kexec could then extract the eBPF program and let it run just > > > like any other eBPF program with all the pros (and cons) that come
Re: [RFA] makedumpfile: fix access to os_info for /proc/kcore
Hi Alex, On Fri, 6 Sep 2024 16:35:50 +0200 Alexander Gordeev wrote: > On Wed, Sep 04, 2024 at 06:12:59PM +0200, Philipp Rudo wrote: > > Hi Philipp, > > > Hi Alex, > > > > our QE found a problem when trying to run makedumpfile with /proc/kcore > > on s390. For example > > > > # makedumpfile --mem-usage /proc/kcore > > s390x_init_vm: Can't get s390x os_info ptr. > > > > The exact options passed to makedumpfile don't matter. The error is > > always the same. Trying the same on a dump file created from > > /proc/vmcore works fine. As the function in question was introduced > > with you commit 6f8325d ("[PATCH v2 2/2] s390x: uncouple virtual and > > physical address spaces") I'm reaching out to you. > > > > Looking at /proc/kcore with crash I noticed that > > abs_lowcore->os_info (aka. address 0xe18) is zero. Hence the check > > > > if (!readmem(PADDR, S390X_LC_OS_INFO, &addr, > > sizeof(addr)) || !addr) { > > ERRMSG("Can't get s390x os_info ptr.\n"); > > return FALSE; > > } > > > > at the beginning of s390x_init_vm fails. My theory is that when trying > > to access the absolute lowcore via /proc/kcore the read gets prefixed > > and thus ends up in the per-cpu lowcore. As the os_info field isn't set > > in the per-cpu lowcore the read returns 0, triggering the error. > > Yes, I think your analysis is correct. \o/ I haven't lost all my s390 skills, yet. > > I played around with crash trying to access the absolute lowcore via > > __abs_lowcore and lowcore_ptr but failed. I always ended up in the > > per-cpu lowcore. I also tried to get the address of os_info from the > > dwarf information but that only returnes a virtual address which cannot > > be used in the function that sets up vm... > > > > Any idea how this problem could be fixed? > > I will take a deeper look at it. Thanks! > > > Thanks > > Philipp > > Thanks for reporting! > > > P.S. While looking at the function I found one nit. Right after the > > check mentioned above there's an other check for > > > > if (addr == 0) > > return TRUE; > > > > which can never be true as the !addr from above already handles this > > case. > > It will be TRUE when readmem() succeeded and read out zero. > In fact, || !addr condition is redundant. Do you want to send a patch? Could you take over the patch? I'm not really sure, when addr == 0 is expected. You are much more qualified to describe that. Thanks Philipp expected. ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [RFCv2 0/9] UEFI emulator for kexec
Hi Ard, Hi Jan, On Wed, 28 Aug 2024 19:08:14 +0200 Ard Biesheuvel wrote: [...] > Thanks for putting this RFC together. This is useful work, and gives > us food for thought and discussion. > > There are a few problems that become apparent when going through these > changes. > > 1. Implementing UEFI entirely is intractable, and unnecessary. > Implementing the subset of UEFI that is actually needed to boot Linux > *is* tractable, though, but we need to work together to write this > down somewhere. > - the EFI stub needs the boot services for the EFI memory map and > the allocation routines > - GRUB needs block I/O > - systemd-stub/UKI needs file I/O to look for sidecars > - etc etc > > I implemented a Rust 'efiloader' crate a while ago that encapsulates > most of this (it can boot Linux/arm64 on QEMU and boot x86 via GRUB in > user space **). Adding file I/O to this should be straight-forward - > as Lennart points out, we only need the protocol, it doesn't need to > be backed by an actual file system, it just needs to be able to expose > other files in the right way. > > 2. Running the UEFI emulator on bare metal is not going to scale. > Cloning UART driver code and MMU code etc is a can of worms that you > want to leave closed. And as Lennart points out, there is other > hardware (TPM) that needs to be accessible as well. Providing a > separate set of drivers for all hardware that the EFI emulator may > need to access is not a tractable problem either. > > The fix for this, as I see it, is to run the EFI emulator in user > space, to the point where the payload calls ExitBootServices(). This > will allow all I/O and memory protocol to be implemented trivially, > using C library routines. I have a crude prototype** of this running > to the point where ExitBootServices() is called (and then it crashes). > The tricky yet interesting bit here is how we migrate a chunk of user > space memory to the bare metal context that will be created by the > kexec syscall later (in which the call to ExitBootServices() would > return and proceed with the boot). But the principle is rather > straight-forward, and would permit us, e.g., to kexec an OS installer > too. I mostly agree on what you have wrote. But I see a big problem in running the EFI emulator in user space when it comes to secure boot. The chain of trust ends in the kernel. So it's the kernel that needs to verify that the image to be loaded can be trusted. But when the EFI runtime is in user space the kernel simply cannot do that. Which means, if we want to go this way, we would need to extend the chain of trust to user space. Which will be a whole bucket of worms, not just a can. That's why I tend more to Jan's suggestion to include the EFI runtime in the kernel. Alas, that comes with it's own problem, as that requires to run code in the kernel that was never intended to run in kernel context. So even when we can trust the code not to be malicious, we cannot trust it to not accidentally change the system state in a way the kernel doesn't expect... Let me throw an other wild idea in the ring. Instead of implementing a EFI runtime we could also include a eBPF version of the stub into the images. kexec could then extract the eBPF program and let it run just like any other eBPF program with all the pros (and cons) that come with it. That won't be as generic as the EFI runtime, e.g. you couldn't simply kexec any OS installer. On the other hand it would make it easier to port UKIs et al. to non-EFI systems. What do you think? Thanks Philipp > 3. We need to figure out how to support TPM and PCRs in the context of > kexec. This is a fundamental issue with verified boot, given that the > kexec PCR state is necessarily different from the boot state, and so > we cannot reuse the TPM directly if we want to pretend that we are > doing an ordinary boot in kexec. The alternative is to leave the TPM > in a state where the kexec kernel can access its sealed secrets, and > mock up the TCG2 EFI protocols using a shim that sits between the TPM > hardware (as the real TCG2 protocols will be long gone) and the EFI > payload. But as I said, this is a fundamental issue, as the ability to > pretend that a kexec boot is a pristine boot would mean that verified > boot is broken. > > > As future work, I'd like to propose to collaborate on some alignment > regarding a UEFI baseline for Linux, i.e., the parts that we actually > need to boot Linux. > > For this series in particular, I don't see a way forward where we > adopt this approach, and carry all this code inside the kernel. > > Thanks. > Ard. > ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[RFA] makedumpfile: fix access to os_info for /proc/kcore
Hi Alex, our QE found a problem when trying to run makedumpfile with /proc/kcore on s390. For example # makedumpfile --mem-usage /proc/kcore s390x_init_vm: Can't get s390x os_info ptr. The exact options passed to makedumpfile don't matter. The error is always the same. Trying the same on a dump file created from /proc/vmcore works fine. As the function in question was introduced with you commit 6f8325d ("[PATCH v2 2/2] s390x: uncouple virtual and physical address spaces") I'm reaching out to you. Looking at /proc/kcore with crash I noticed that abs_lowcore->os_info (aka. address 0xe18) is zero. Hence the check if (!readmem(PADDR, S390X_LC_OS_INFO, &addr, sizeof(addr)) || !addr) { ERRMSG("Can't get s390x os_info ptr.\n"); return FALSE; } at the beginning of s390x_init_vm fails. My theory is that when trying to access the absolute lowcore via /proc/kcore the read gets prefixed and thus ends up in the per-cpu lowcore. As the os_info field isn't set in the per-cpu lowcore the read returns 0, triggering the error. I played around with crash trying to access the absolute lowcore via __abs_lowcore and lowcore_ptr but failed. I always ended up in the per-cpu lowcore. I also tried to get the address of os_info from the dwarf information but that only returnes a virtual address which cannot be used in the function that sets up vm... Any idea how this problem could be fixed? Thanks Philipp P.S. While looking at the function I found one nit. Right after the check mentioned above there's an other check for if (addr == 0) return TRUE; which can never be true as the !addr from above already handles this case. ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v3 00/17] kexec: Allow preservation of ftrace buffers
Hi Alex, On Fri, 2 Feb 2024 13:58:52 +0100 Alexander Graf wrote: > Hi Philipp, > > On 29.01.24 17:34, Philipp Rudo wrote: > > Hi Alex, > > > > adding linux-integrity as there are some synergies with IMA_KEXEC (in case > > we > > get KHO to work). > > > > Fist of all I believe that having a generic framework to pass information > > from > > one kernel to the other across kexec would be a good thing. But I'm afraid > > that > > > Thanks, I'm happy to hear that you agree with the basic motivation :). > There are fundamentally 2 problems with passing data: > > * Passing structured data in a cross-architecture way > * Passing memory > > KHO tackles both. It proposes a common FDT based format that allows us > to pass per-subsystem properties. That way, a subsystem does not need to > know whether it's running on ARM, x86, RISC-V or s390x. It just gains > awareness for KHO and can pass data. > > On top of that, it proposes a standardized "mem" property (and some > magic around that) which allows subsystems to pass memory. > > > > you are ignoring some fundamental problems which makes it extremely hard, if > > not impossible, to reliably transfer the kernel's state from one kernel to > > the > > other. > > > > One thing I don't understand is how reusing the scratch area is working. > > Sure > > you pass it's location via the dt/boot_params but I don't see any code that > > makes it a CMA region. So IIUC the scratch area won't be available for the > > 2nd > > kernel. Which is probably for the better as IIUC the 2nd kernel gets loaded > > and > > runs inside that area and I don't believe the CMA design ever considered > > that > > the kernel image could be included in a CMA area. > > > That one took me a lot to figure out sensibly (with recursion all the > way down) while building KHO :). I hope I detailed it sensibly in the > documentation - please let me know how to improve it in case it's > unclear: https://lore.kernel.org/lkml/20240117144704.602-8-g...@amazon.com/ > > Let me explain inline using different words as well what happens: > > The first (and only the first) kernel that boots allocates a CMA region > as "scratch region". It loads the new kernel into that region. It passes > that region as "scratch region" to the next kernel. The next kernel now > takes it and marks every page block that the scratch region spans as CMA: > > https://lore.kernel.org/lkml/20240117144704.602-3-g...@amazon.com/ > > The CMA hint doesn't mean we create an actual CMA region. It mostly > means that the kernel won't use this memory for any kernel allocations. > Kernel allocations up to this point are allocations we don't need to > pass on with KHO again. Kernel allocations past that point may be > allocations that we want to pass, so we just never place them into the > "scratch region" again. > > And because we now already have a scratch region from the previous > kernel, we keep reusing that forever with any new KHO kexec. Thanks for the explanation. I've missed the memblock_mark_scratch in kho_populate. The code makes much more sense now :-) Having that said, for complex series like this one I like to do the review on a branch in my local git as that to avoid problems like that (or at least make them less likely). But your patches didn't apply. Can you tell me what your base is or make your git branch available. That would be very helpful to me. Thanks! > > Staying at reusing the scratch area. One thing that is broken for sure is > > that > > you reuse the scratch area without ever checking the kho_scratch parameter > > of > > the 2nd kernel's command line. Remember, with kexec you are dealing with two > > different kernels with two different command lines. Meaning you can only > > reuse > > the scratch area if the requested size in the 2nd kernel is identical to the > > one of the 1st kernel. In all other cases you need to adjust the scratch > > area's > > size or reserve a new one. > > > Hm. So you're saying a user may want to change the size of the scratch > area with a KHO kexec. That's insanely risky because you (as rightfully > pointed out below) may have significant fragmentation at that point. And > we will only know when we're in the new kernel so it's too late to > abort. IMHO it's better to just declare the scratch region as immutable > during KHO to avoid that pitfall. Yes, a user can set any command line with kexec. My expectation
Re: [PATCH v3 00/17] kexec: Allow preservation of ftrace buffers
Hi Alex, adding linux-integrity as there are some synergies with IMA_KEXEC (in case we get KHO to work). Fist of all I believe that having a generic framework to pass information from one kernel to the other across kexec would be a good thing. But I'm afraid that you are ignoring some fundamental problems which makes it extremely hard, if not impossible, to reliably transfer the kernel's state from one kernel to the other. One thing I don't understand is how reusing the scratch area is working. Sure you pass it's location via the dt/boot_params but I don't see any code that makes it a CMA region. So IIUC the scratch area won't be available for the 2nd kernel. Which is probably for the better as IIUC the 2nd kernel gets loaded and runs inside that area and I don't believe the CMA design ever considered that the kernel image could be included in a CMA area. Staying at reusing the scratch area. One thing that is broken for sure is that you reuse the scratch area without ever checking the kho_scratch parameter of the 2nd kernel's command line. Remember, with kexec you are dealing with two different kernels with two different command lines. Meaning you can only reuse the scratch area if the requested size in the 2nd kernel is identical to the one of the 1st kernel. In all other cases you need to adjust the scratch area's size or reserve a new one. This directly leads to the next problem. In kho_reserve_previous_mem you are reusing the different memory regions wherever the 1st kernel allocated them. But that also means you are handing over the 1st kernel's memory fragmentation to the 2nd kernel and you do that extremely early during boot. Which means that users who need to allocate large continuous physical memory, like the scratch area or the crashkernel memory, will have increasing chance to not find a suitable area. Which IMHO is unacceptable. Finally, and that's the big elephant in the room, is your lax handling of the unstable kernel internal ABI. Remember, you are dealing with two different kernels, that also means two different source levels and two different configs. So only because both the 1st and 2nd kernel have a e.g. struct buffer_page doesn't means that they have the same struct buffer_page. But that's what your code implicitly assumes. For KHO ever to make it upstream you need to make sure that both kernels are "speaking the same language". Personally I see two possible solutions: 1) You introduce a stable intermediate format for every subsystem similar to what IMA_KEXEC does. This should work for simple types like struct buffer_page but for complex ones like struct vfio_device that's basically impossible. 2) You also hand over the ABI version for every given type (basically just a hash over all fields including all the dependencies). So the 2nd kernel can verify that the data handed over is in a format it can handle and if not bail out with a descriptive error message rather than reading garbage. Plus side is that once such a system is in place you can reuse it to automatically resolve all dependencies so you no longer need to manually store the buffer_page and its buffer_data_page separately. Down side is that traversing the debuginfo (including the ones from modules) is not a simple task and I expect that such a system will be way more complex than the rest of KHO. In addition there are some cases that the versioning won't be able to capture. For example if a type contains a "void *"-field. Then although the definition of the type is identical in both kernels the field can be cast to different types when used. An other problem will be function pointers which you first need to resolve in the 1st kernel and then map to the identical function in the 2nd kernel. This will become particularly "fun" when the function is part of a module that isn't loaded at the time when you try to recreate the kernel's state. So to summarize, while it would be nice to have a generic framework like KHO to pass data from one kernel to the other via kexec there are good reasons why it doesn't exist, yet. Thanks Philipp On Wed, 17 Jan 2024 14:46:47 + Alexander Graf wrote: > Kexec today considers itself purely a boot loader: When we enter the new > kernel, any state the previous kernel left behind is irrelevant and the > new kernel reinitializes the system. > > However, there are use cases where this mode of operation is not what we > actually want. In virtualization hosts for example, we want to use kexec > to update the host kernel while virtual machine memory stays untouched. > When we add device assignment to the mix, we also need to ensure that > IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we > need to do the same for the PCI subsystem. If we want to kexec while an > SEV-SNP enabled virtual machine is running, we need to preserve the VM > context pages and physical memory. See James' and my Linux Plumbers > Conference 2023 presentation for details: > > https://lpc.eve
Re: [RFC 0/3] kdump: Check mem_map of CMA area in kdump
Hi Pingfan, On Mon, 18 Dec 2023 13:23:22 +0800 Pingfan Liu wrote: > From: Pingfan Liu > > > First of all, this series is only for proof of concept. It only passes > compilation. > > For years, CMA is proposed to be used as crashkernel reserved memory. > But DIO prevent us to follow it since DMA may be in-flight and ruin the > kdump kernel. > > This series exports the crash kernel's CMA area information through > device-tree, and kdump kernel skips any page, which refcnt!=mapcount and > has a potential DMA activity. > > The exported information include: > u64 kdump_cma_pfn; > u64 kdump_cma_pg_cnt; > u64 kdump_cma_pg_paddr; > > And they should be filled with Jiri's series "[PATCH 0/4] kdump: > crashkernel reservation from CMA" > > After the conjunction of two series, the CMA used for kdump has only the > following risk, where the following conditions: > -1.a wrong code forges _refcnt and mapcount to the same value > -2.the page is also used by DIO > > > Is it acceptable, or any rescue e.g. CRC on page? > > Please share your thoughts. I don't think your approach will work as intended. The problem is that we are dealing with two separate kernels and there is no guarantee that both kernels are identical. So you cannot rely on the definition of struct page in the crash kernel to be identical to the one in the panicked kernel. Meaning check_poison_page from the crash kernel cannot simply operate on the struct pages from the panicked kernel. To get this approach to work I see three possible "fixes" 1) enforce in kexec that only the currently running kernel can be loaded as crash kernel. 2) pass all required "debuginfo" to the crash kernel so it can parse the required data reliably from the dump. This also requires to have all the mm helper functions to be reimplemented to work in check_poison_page. 3) the required information is passed via a new data structure which is designed in a way that it can easily be passed in between different kernels. But this would require the mm subsystem to maintain the page states in the CMA in two separate data structures. Personally I don't think that any of the three "fixes" is desirable. Thanks Philipp > Thanks, > > Pingfan > > > Cc: Jiri Bohac > Cc: Michal Hocko > Cc: Philipp Rudo > Cc: Baoquan He > Cc: Dave Young > To: kexec@lists.infradead.org > --- > Pingfan Liu (3): > crash_dump: Parse the CMA's mem_map in kdump > of: kexec: Set up properties for reusing CMA in kdump > of: fdt: Parse properties of reusing CMA in kdump > > drivers/of/fdt.c | 43 +++ > drivers/of/kexec.c| 14 > include/linux/kexec.h | 5 +++ > init/main.c | 4 +++ > kernel/crash_dump.c | 80 +++ > 5 files changed, 146 insertions(+) > ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 0/4] kdump: crashkernel reservation from CMA
On Wed, 6 Dec 2023 16:19:51 +0100 Michal Hocko wrote: > On Wed 06-12-23 14:49:51, Michal Hocko wrote: > > On Wed 06-12-23 12:08:05, Philipp Rudo wrote: > [...] > > > If I understand Documentation/core-api/pin_user_pages.rst correctly you > > > missed case 1 Direct IO. In that case "short term" DMA is allowed for > > > pages without FOLL_LONGTERM. Meaning that there is a way you can > > > corrupt the CMA and with that the crash kernel after the production > > > kernel has panicked. > > > > Could you expand on this? How exactly direct IO request survives across > > into the kdump kernel? I do understand the RMDA case because the IO is > > async and out of control of the receiving end. > > OK, I guess I get what you mean. You are worried that there is > DIO request > program DMA controller to read into CMA memory > > boot into crash kernel backed by CMA > DMA transfer is done. > > DIO doesn't migrate the pinned memory because it is considered a very > quick operation which doesn't block the movability for too long. That is > why I have considered that a non-problem. RDMA on the other might pin > memory for transfer for much longer but that case is handled by > migrating the memory away. Right that is the scenario we need to prevent. > Now I agree that there is a chance of the corruption from DIO. The > question I am not entirely clear about right now is how big of a real > problem that is. DMA transfers should be a very swift operation. Would > it help to wait for a grace period before jumping into the kdump kernel? Please see my other mail. > > Also if direct IO is a problem how come this is not a problem for kexec > > in general. The new kernel usually shares all the memory with the 1st > > kernel. > > This is also more clear now. Pure kexec is shutting down all the devices > which should terminate the in-flight DMA transfers. Right, it _should_ terminate all transfers. But here we are back at the shitty device drivers that don't have a working shutdown method. That's why we have already seen the problem you describe above with kexec. And please believe me that debugging such a scenario is an absolute pain. Especially when it's a proprietary, out-of-tree driver that caused the mess. Thanks Philipp ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 0/4] kdump: crashkernel reservation from CMA
On Thu, 7 Dec 2023 09:55:20 +0100 Michal Hocko wrote: > On Thu 07-12-23 12:23:13, Baoquan He wrote: > [...] > > We can't guarantee how swift the DMA transfer could be in the cma, case, > > it will be a venture. > > We can't guarantee this of course but AFAIK the DMA shouldn't take > minutes, right? While not perfect, waiting for some time before jumping > into the crash kernel should be acceptable from user POV and it should > work around most of those potential lingering programmed DMA transfers. I don't think that simply waiting is acceptable. For one it doesn't guarantee that there is no corruption (please also see below) but only reduces its probability. Furthermore, how long would you wait? Thing is that users don't only want to reduce the memory usage but also the downtime of kdump. In the end I'm afraid that "simply waiting" will make things unnecessarily more complex without really solving any issue. > So I guess what we would like to hear from you as kdump maintainers is > this. Is it absolutely imperative that these issue must be proven > impossible or is a best effort approach something worth investing time > into? Because if the requirement is an absolute guarantee then I simply > do not see any feasible way to achieve the goal of reusable memory. > > Let me reiterate that the existing reservation mechanism is showing its > limits for production systems and I strongly believe this is something > that needs addressing because crash dumps are very often the only tool > to investigate complex issues. Because having a crash dump is so important I want a prove that no legal operation can corrupt the crashkernel memory. The easiest way to achieve this is by simply keeping the two memory regions fully separated like it is today. In theory it should also be possible to prevent any kind of page pinning in the shared crashkernel memory. But I don't know which side effect this has for mm. Such an idea needs to be discussed on the mm mailing list first. Finally, let me question whether the whole approach actually solves anything. For me the difficulty in determining the correct crashkernel memory is only a symptom. The real problem is that most developers don't expect their code to run outside their typical environment. Especially not in an memory constraint environment like kdump. But that problem won't be solved by throwing more memory at it as this additional memory will eventually run out as well. In the end we are back at the point where we are today but with more memory. Finally finally, one tip. Next time a customer complaints about how much memory the crashkernel "wastes" ask them how much one day of down time for one machine costs them and how much memory they could buy for that money. After that calculation I'm pretty sure that an additional 100M of crashkernel memory becomes much more tempting. Thanks Philipp ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 0/4] kdump: crashkernel reservation from CMA
On Fri, 1 Dec 2023 17:59:02 +0100 Michal Hocko wrote: > On Fri 01-12-23 16:51:13, Philipp Rudo wrote: > > On Fri, 1 Dec 2023 12:55:52 +0100 > > Michal Hocko wrote: > > > > > On Fri 01-12-23 12:33:53, Philipp Rudo wrote: > > > [...] > > > > And yes, those are all what-if concerns but unfortunately that is all > > > > we have right now. > > > > > > Should theoretical concerns without an actual evidence (e.g. multiple > > > drivers known to be broken) become a roadblock for this otherwise useful > > > feature? > > > > Those concerns aren't just theoretical. They are experiences we have > > from a related feature that suffers exactly the same problem regularly > > which wouldn't exist if everybody would simply work "properly". > > What is the related feature? kexec > > And yes, even purely theoretical concerns can become a roadblock for a > > feature when the cost of those theoretical concerns exceed the benefit > > of the feature. The thing is that bugs will be reported against kexec. > > So _we_ need to figure out which of the shitty drivers caused the > > problem. That puts additional burden on _us_. What we are trying to > > evaluate at the moment is if the benefit outweighs the extra burden > > with the information we have at the moment. > > I do understand your concerns! But I am pretty sure you do realize that > it is really hard to argue theoreticals. Let me restate what I consider > facts. Hopefully we can agree on these points > - the CMA region can be used by user space memory which is a > great advantage because the memory is not wasted and our > experience has shown that users do care about this a lot. We > _know_ that pressure on making those reservations smaller > results in a less reliable crashdump and more resources spent > on tuning and testing (especially after major upgrades). A > larger reservation which is not completely wasted for the > normal runtime is addressing that concern. > - There is no other known mechanism to achieve the reusability > of the crash kernel memory to stop the wastage without much > more intrusive code/api impact (e.g. a separate zone or > dedicated interface to prevent any hazardous usage like RDMA). > - implementation wise the patch has a very small footprint. It > is using an existing infrastructure (CMA) and it adds a > minimal hooking into crashkernel configuration. > - The only identified risk so far is RDMA acting on this memory > without using proper pinning interface. If it helps to have a > statement from RDMA maintainers/developers then we can pull > them in for a further discussion of course. > - The feature requires an explicit opt-in so this doesn't bring > any new risk to existing crash kernel users until they decide > to use it. AFAIU there is no way to tell that the crash kernel > memory used to be CMA based in the primary kernel. If you > believe that having that information available for > debugability would help then I believe this shouldn't be hard > to add. I think it would even make sense to mark this feature > experimental to make it clear to users that this needs some > time before it can be marked production ready. > > I hope I haven't really missed anything important. The final If I understand Documentation/core-api/pin_user_pages.rst correctly you missed case 1 Direct IO. In that case "short term" DMA is allowed for pages without FOLL_LONGTERM. Meaning that there is a way you can corrupt the CMA and with that the crash kernel after the production kernel has panicked. With that I don't see a chance this series can be included unless someone can explain me that that the documentation is wrong or I understood it wrong. Having that said NAcked-by: Philipp Rudo > cost/benefit judgment is up to you, maintainers, of course but I would > like to remind that we are dealing with a _real_ problem that many > production systems are struggling with and that we don't really have any > other solution available. ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 0/4] kdump: crashkernel reservation from CMA
On Fri, 1 Dec 2023 12:55:52 +0100 Michal Hocko wrote: > On Fri 01-12-23 12:33:53, Philipp Rudo wrote: > [...] > > And yes, those are all what-if concerns but unfortunately that is all > > we have right now. > > Should theoretical concerns without an actual evidence (e.g. multiple > drivers known to be broken) become a roadblock for this otherwise useful > feature? Those concerns aren't just theoretical. They are experiences we have from a related feature that suffers exactly the same problem regularly which wouldn't exist if everybody would simply work "properly". And yes, even purely theoretical concerns can become a roadblock for a feature when the cost of those theoretical concerns exceed the benefit of the feature. The thing is that bugs will be reported against kexec. So _we_ need to figure out which of the shitty drivers caused the problem. That puts additional burden on _us_. What we are trying to evaluate at the moment is if the benefit outweighs the extra burden with the information we have at the moment. > > Only alternative would be to run extended tests in > > the field. Which means this user facing change needs to be included. > > Which also means that we are stuck with it as once a user facing change > > is in it's extremely hard to get rid of it again... > > I am not really sure I follow you here. Are you suggesting once > crashkernel=cma is added it would become a user api and therefore > impossible to get rid of? Yes, sort of. I wouldn't rank a command line parameter as user api. So we still can get rid of it. But there will be long discussions I'd like to avoid if possible. Thanks Philipp ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 0/4] kdump: crashkernel reservation from CMA
Hi Jiri, I'd really love to see something like this to work. Although I also share the concerns about shitty device drivers corrupting the CMA. Please see my other mail for that. Anyway, one more comment below. On Fri, 24 Nov 2023 20:54:36 +0100 Jiri Bohac wrote: [...] > Now, specifying > crashkernel=100M craskhernel=1G,cma > on the command line will make a standard crashkernel reservation > of 100M, where kexec will load the kernel and initrd. > > An additional 1G will be reserved from CMA, still usable by the > production system. The crash kernel will have 1.1G memory > available. The 100M can be reliably predicted based on the size > of the kernel and initrd. I doubt that the fixed part can be predicted "reliably". For sure it will be more reliable than today but IMHO we will still be stuck with some guessing. Otherwise it would mean that you already know during boot which initrd the user space will be loading later on. Which IMHO is impossible as the initrd can always be rebuild with a larger size. Furthermore, I'd be careful when you are dealing with compressed kernel images. As I'm not sure how the different decompressor phases would handle scenarios where the (fixed) crashkernel memory is large enough to hold the compressed kernel (+initrd) but not the decompressed one. One more thing, I'm not sure I like that you need to reserve two separate memory regions. Personally I would prefer it if you could reserve one large region for the crashkernel but allow parts of it to be reused via CMA. Otherwise I'm afraid there will be people who only have one ,cma entry on the command line and cannot figure out why they cannot load the crash kernel. Thanks Philipp > When no crashkernel=size,cma is specified, everything works as > before. > ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 0/4] kdump: crashkernel reservation from CMA
Hi Michal, On Thu, 30 Nov 2023 14:41:12 +0100 Michal Hocko wrote: > On Thu 30-11-23 20:31:44, Baoquan He wrote: > [...] > > > > which doesn't use the proper pinning API (which would migrate away from > > > > the CMA) then what is the worst case? We will get crash kernel corrupted > > > > potentially and fail to take a proper kernel crash, right? Is this > > > > worrisome? Yes. Is it a real roadblock? I do not think so. The problem > > > > We may fail to take a proper kernel crash, why isn't it a roadblock? > > It would be if the threat was practical. So far I only see very > theoretical what-if concerns. And I do not mean to downplay those at > all. As already explained proper CMA users shouldn't ever leak out any > writes across kernel reboot. You are right, "proper" CMA users don't do that. But "proper" drivers also provide a working shutdown() method. Experience shows that there are enough shitty drivers out there without working shutdown(). So I think it is naive to assume you are only dealing with "proper" CMA users. For me the question is, what is less painful? Hunting down shitty (potentially out of tree) drivers that cause a memory corruption? Or ... > > We > > have stable way with a little more memory, why would we take risk to > > take another way, just for saving memory? Usually only high end server > > needs the big memory for crashkernel and the big end server usually have > > huge system ram. The big memory will be a very small percentage relative > > to huge system RAM. > > Jiri will likely talk more specific about that but our experience tells > that proper crashkernel memory scaling has turned out a real > maintainability problem because existing setups tend to break with major > kernel version upgrades or non trivial changes. ... frequently test if the crashkernel memory is still appropriate? The big advantage of the latter I see is that an OOM situation has very easy to detect and debug. A memory corruption isn't. Especially when it was triggered by an other kernel. And yes, those are all what-if concerns but unfortunately that is all we have right now. Only alternative would be to run extended tests in the field. Which means this user facing change needs to be included. Which also means that we are stuck with it as once a user facing change is in it's extremely hard to get rid of it again... Thanks Philipp > > > > seems theoretical to me and it is not CMA usage at fault here IMHO. It > > > > is the said theoretical driver that needs fixing anyway. > > > > Now, what we want to make clear is if it's a theoretical possibility, or > > very likely happen. We have met several on-flight DMA stomping into > > kexec kernel's initrd in the past two years because device driver didn't > > provide shutdown() methor properly. For kdump, once it happen, the pain > > is we don't know how to debug. For kexec reboot, customer allows to > > login their system to reproduce and figure out the stomping. For kdump, > > the system corruption rarely happend, and the stomping could rarely > > happen too. > > yes, this is understood. > > > The code change looks simple and the benefit is very attractive. I > > surely like it if finally people confirm there's no risk. As I said, we > > can't afford to take the risk if it possibly happen. But I don't object > > if other people would rather take risk, we can let it land in kernel. > > I think it is fair to be cautious and I wouldn't impose the new method > as a default. Only time can tell how safe this really is. It is hard to > protect agains theoretical issues though. Bugs should be fixed. > I believe this option would allow to configure kdump much easier and > less fragile. > > > My personal opinion, thanks for sharing your thought. > > Thanks for sharing. > ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 0/2] Sign the Image which is zboot's payload
Hi Dave, On Fri, 22 Sep 2023 13:41:22 +0800 Dave Young wrote: > Hi Jan, > > On Fri, 22 Sept 2023 at 13:19, Jan Hendrik Farr wrote: > > > > Hi Pingfan! > > > > On 21 21:37:01, Pingfan Liu wrote: > > > From: Pingfan Liu > > > > > > > > For security boot, the vmlinuz.efi will be signed so UEFI boot loader > > > can check against it. But at present, there is no signature for kexec > > > file load, this series makes a signature on the zboot's payload -- Image > > > before it is compressed. As a result, the kexec-tools parses and > > > decompresses the Image.gz to get the Image, which has signature and can > > > be checked against during kexec file load > > > > I missed some of the earlier discussion about this zboot kexec support. > > So just let me know if I'm missing something here. You were exploring > > these two options in getting this supported: > > > > 1. Making kexec_file_load do all the work. > > > > This option makes the signature verification easy. kexec_file_load > > checks the signature on the pe file and then extracts it and does the > > kexec. > > > > This is similar to how I'm approaching UKI support in [1]. > > > > 2. Extract in userspace and pass decompressed kernel to kexec_file_load > > > > This options requires the decompressed kernel to have a valid signature on > > it. That's why this patch adds the ability to add that signature to the > > kernel contained inside the zboot image. > > > > This option would not make sense for UKI support as it would not > > validate the signature with respect to the initrd and cmdline that it > > contains. > > Another possibility for the cmdline could be using the bootconfig > facility which was > introduced for boot time tracking: > Documentation/admin-guide/bootconfig.rst > > So the initrd+cmdline can be signed as well. Has this been discussed > before for UKI? Not that I know of. But I'm not sure if the bootconfig the way it works today does the trick. For one the bootconfig is simply glued to the end of the initrd. But that makes it part of the UKI as well. So there is no added gain. Plus, adding the cmdline to the UKI was done on purpose to prevent any unauthorized editing. That basically means that any change to the cmdline needs to be signed as well. But I don't see any signature verification while processing the bootconfig. Finally the bootconfig is setup too late in the boot process, in particular after setup_arch which reserves the crashkernel memory and needs to parse the kernel command line for that. An even more extreme example is the decompressor phase on s390. There the command line is parsed as well. And that is code that runs before start_kernel. All in all I don't believe that using the bootconfig adds much benefit for the UKI. Thanks Philipp
Re: [PATCH v2 0/2] x86/kexec: UKI Support
Hi Jan, On Thu, 21 Sep 2023 00:02:25 +0200 Jan Hendrik Farr wrote: [...] > > Maybe we should do a BoF at LPC to discuss this further? > > I definetly won't be at LPC, is it possible to join virtually? Yes, LPC will be hybrid again this year. Virtual access costs $50 although you can apply for an 50% discount when you are a non-professional. https://lpc.events/event/17/page/212-attend Thanks Philipp ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2 0/2] x86/kexec: UKI Support
Hi Jan, On Thu, 14 Sep 2023 23:04:32 +0200 "Jan Hendrik Farr" wrote: > On Thu, Sep 14, 2023, at 8:51 PM, Philipp Rudo wrote: > > [...] > > > > In this context I hope it is also clear to you that when more and more > > people rely on the spec you need a more formal process when including > > changes. Especially when the change might break the implementation of > > others. So no more making the .cmdline optional and allowing it to be > > overwritten all on the same day. > > > > Having that said, what does "local override" exactly mean? Does that > > mean a distro can allow a user to freely choose the cmdline without > > checking any signatures? > > The behavior of systemd-stub is to allow the bootloader (or whatever > called sd-stub) supplied cmdline when there is no .cmdline section in > the UKI. That's how I understand "local override" here. For WIP v3 of > this patch the behavior is to use the cmdline supplied by userspace to > the kexec_file_load syscall if no .cmdline section is in the UKI. > > empty .cmdline section -> empty cmdline always passed to kernel > .cmdline section -> use bootloader/user supplied cmdline (which would > be empty by default) > > This setup does not make sense for a locked down / secure system though. > > Maybe the word "override" is not ideal. There is nothing actually being > overridden as there is no cmdline in the UKI in the first place. > > sd-stub also allows the bootloader supplied cmdline if not using secure > boot. So maybe the kernel could allow user supplied cmdline if not in > lockdown mode for kexec maybe? If not in lockdown mode somebody can just > kexec an unsigned kernel + unsigned cmdline using the kexec_load syscall > anyways. For this case the word "override" makes sense. > > The logic for all of this in sd-stub is in [1]. > > > I.e. does that mean we can get rid of this > > https://github.com/systemd/systemd/issues/24539 > > This is a different usecase IMO. Yeah, I expected that. The whole question was meant to be rhetorical. The point I wanted to make was that when a spec uses terms like "local override" it needs to explain what it means. Thanks Philipp > >> Hence, seeing the spec as set in stone and as inherently low quality > >> is the wrong way to see it I am sure. Instead, the goal here is to > >> adjust the spec to make it work really nicely for *both* systemd and > >> the kernel. > > > > Sorry, I never wanted to intend that the spec inherently low quality. > > Just that it doesn't meat my expectations, yet. But that is fine. The > > spec isn't even a year old and there's only a single implementation, > > yet. So it's more documentation rather than a spec. > > Let's make it happen. > > > [1] > https://github.com/systemd/systemd/blob/5898cef22a35ceefa068d5f46929eced2baab0ed/src/boot/efi/stub.c#L140 > ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2 0/2] x86/kexec: UKI Support
Hi Jan, On Wed, 13 Sep 2023 16:42:33 +0200 "Jan Hendrik Farr" wrote: > On Wed, Sep 13, 2023, at 4:00 PM, Philipp Rudo wrote: [...] > In [5] Luca writes: > > [...] we fully intend for the UKI format to be an open and stable > > specification, that anybody can support and rely on. > But that is unfortunately not where the format is at this point. > > What is annoying though is where this leaves a user that actually > wants this feature. They can carry a patch or they might have to wait > a long time. > > Can you indicate what it would take for the kernel community to consider > this spec as stable enough? I don't think there is a good answer to that question. In fact I believe if you ask 10 people from the community you will get 20+ different answers. My guess is that either (1) the spec is moved to some official standard committee where people spend decades to polish it before it makes it into the kernel or (2) there's a big flamewar on LKML until Linus had enough and passes his judgment on it. So definitely (2) ;-) Thanks Philipp > > > > In the end the only benefit this series brings is to extend the > > signature checking on the whole UKI except of just the kernel image. > > Everything else can also be done in user space. Compared to the > > problems described above this is a very small gain for me. > > Correct. That is the benefit of pulling the UKI apart in the > kernel. However having to sign the kernel inside the UKI defeats > the whole point. > > > [1] https://uapi-group.org/specifications/specs/unified_kernel_image/ > [2] https://github.com/uapi-group/specifications/pull/72 > [3] https://github.com/uapi-group/specifications/pull/73 > [4] https://github.com/uapi-group/specifications/issues/74 > [5] https://github.com/systemd/systemd/issues/28538 > ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2 0/2] x86/kexec: UKI Support
Hi Lennart, On Thu, 14 Sep 2023 11:32:20 +0200 Lennart Poettering wrote: > On Mi, 13.09.23 16:00, Philipp Rudo (pr...@redhat.com) wrote: > > > For example there are two definitions for the UKI which contradict each > > other. > > The dedicated one [1] you have cited earlier and the one in the BLS for > > type #2 > > entries [2]. In [1] the .linux and .initrd sections are mandatory and the > > .osrel and .cmdline sections are optional while in [2] it is the other way > > round. Which definition should the kernel follow? > > > > Furthermore, I absolutely don't understand how the spec should be read. All > > the spec does is defining some file formats. There is no word about which > > component in the boot chain is supposed to handle them and what exactly this > > component is supposed to do with it. But that is crucial if we want to add > > UKI > > support for kexec as the kexec systemcall will replace the stub. So we need > > to > > know what tasks the stub is supposed to perform. Currently this is only some > > implementation detail of the systemd-stub [3] that can change any moment > > and I > > strongly oppose to base any uapi on it. > > > > In the end the only benefit this series brings is to extend the signature > > checking on the whole UKI except of just the kernel image. Everything else > > can > > also be done in user space. Compared to the problems described above this > > is a > > very small gain for me. > > > > Until the spec got fixed I don't see a chance to add UKI support for kexec. > > > > So that spec is initially just a generalization of what > systemd-stub/systemd-boot/ukify does. The descrepancies between the > cited specs mostly come from the that generalization. If you want to > enumerate kernels and order them the ".osrel" stuff for example is > necessary, hence the boot loader spec really wants it. If you don't > care about the boot loader spec though and just want to register the > kernel UKI PE directly in BootXXX efi vars for example, then there's > no need to include .osrel. That all said we should certainly make the > two specs align better, and clarify the situation. Suggestions/patches > more than welcome. Thanks for the explanation. I know that writing a spec isn't easy and no matter how hard you try it will never be "perfect". That's why I think it's important to have a proper Introduction at the beginning of the spec that gives a context to the reader on the problem the spec is trying to solve, the different use cases/environments considered while writing the spec, the different components involved and how they interact with each other, which limitations there are, etc.. The BLS sort of has it (if you also consider the "Additional discussion") but for the UKI the introduction basically boils down to "it's a file that contains stuff", which isn't very helpful. Having such a context not only makes it easier to understand where such contradictions come from but also help to prevent problems where people from different backgrounds interpret the spec differently because they look at it from their context. An other thing I don't understand is, why the Extension Images (I assume describe the "Companion Files" in the systmd-stub man pages) are a separate spec. With the initrd and cmdline being part of the UKI and thus fixed you take away a lot of flexibility users have today. These extensions bring back part of the flexibility which IMHO is needed by general purpose distros as for them a simple one-size-fits-all solution doesn't work. That's why for me both belong together and thus should be described in the same spec. The extensions are also a good example why you need to properly define the different components and their responsibility. In a secureboot environment these extensions need to be signed and verified during boot. But wich component is responsible to check the signature? Is it the stub? The kernel? or even the initrd? If you don't define that in the spec you will eventually end up in situations where nobody checks the signature because everybody is sure it's "someone else's job". > Ultimately, I think a spec written as description with a single > implementation in mind (i.e. systemd) is a generally a bad spec. Hence > if kexec in the Linux kernel wants to add support for it, that'd be > great but I'd see that as an opportunity to adjust the spec to the > needs of the Linux kernel in this area, so that it reflects well more > than just one backend implementation. Fully agree. I must admit my first mail sounds pretty negative. But I don't oppose the UKI. All I wanted to say tha
Re: [PATCH v2 0/2] x86/kexec: UKI Support
Hi Jan, All in all the code looks fine to me. Nevertheless I don't think UKI support should be added at the moment. This is because IMHO you basically reinterpret the kexec_file systemcall and thus add a new uapi to the kernel. Once introduced it is extremely hard to remove or change an uapi again. The problem I see is that the spec you based your work on is in such a poor shape that I don't feel comfortable to add any new uapi based on it. For example there are two definitions for the UKI which contradict each other. The dedicated one [1] you have cited earlier and the one in the BLS for type #2 entries [2]. In [1] the .linux and .initrd sections are mandatory and the .osrel and .cmdline sections are optional while in [2] it is the other way round. Which definition should the kernel follow? Furthermore, I absolutely don't understand how the spec should be read. All the spec does is defining some file formats. There is no word about which component in the boot chain is supposed to handle them and what exactly this component is supposed to do with it. But that is crucial if we want to add UKI support for kexec as the kexec systemcall will replace the stub. So we need to know what tasks the stub is supposed to perform. Currently this is only some implementation detail of the systemd-stub [3] that can change any moment and I strongly oppose to base any uapi on it. In the end the only benefit this series brings is to extend the signature checking on the whole UKI except of just the kernel image. Everything else can also be done in user space. Compared to the problems described above this is a very small gain for me. Until the spec got fixed I don't see a chance to add UKI support for kexec. Thanks Philipp [1] https://uapi-group.org/specifications/specs/unified_kernel_image/ [2] https://uapi-group.org/specifications/specs/boot_loader_specification/#type-2-efi-unified-kernel-images [3] https://www.freedesktop.org/software/systemd/man/systemd-stub.html On Mon, 11 Sep 2023 07:25:33 +0200 Jan Hendrik Farr wrote: > Hello, > > this patch (v2) implements UKI support for kexec_file_load. It will require > support in the kexec-tools userspace utility. For testing purposes the > following can be used: https://github.com/Cydox/kexec-test/ > > Creating UKIs for testing can be done with ukify (included in systemd), > sbctl, and mkinitcpio, etc. > > There has been discussion on this topic in an issue on GitHub that is linked > below for reference. > > Changes for v2: > - .cmdline section is now optional > - moving pefile_parse_binary is now in a separate commit for clarity > - parse_pefile.c is now in /lib instead of arch/x86/kernel (not sure if > this is the best location, but it definetly shouldn't have been in an > architecture specific location) > - parse_pefile.h is now in include/kernel instead of architecture > specific location > - if initrd or cmdline is manually supplied EPERM is returned instead of > being silently ignored > - formatting tweaks > > > Some links: > - Related discussion: https://github.com/systemd/systemd/issues/28538 > - Documentation of UKIs: > https://uapi-group.org/specifications/specs/unified_kernel_image/ > > Jan Hendrik Farr (2): > move pefile_parse_binary to its own file > x86/kexec: UKI support > > arch/x86/include/asm/kexec-uki.h | 7 ++ > arch/x86/kernel/Makefile | 1 + > arch/x86/kernel/kexec-uki.c| 126 + > arch/x86/kernel/machine_kexec_64.c | 2 + > crypto/asymmetric_keys/mscode_parser.c | 2 +- > crypto/asymmetric_keys/verify_pefile.c | 110 +++-- > crypto/asymmetric_keys/verify_pefile.h | 16 > include/linux/parse_pefile.h | 32 +++ > lib/Makefile | 3 + > lib/parse_pefile.c | 109 + > 10 files changed, 292 insertions(+), 116 deletions(-) > create mode 100644 arch/x86/include/asm/kexec-uki.h > create mode 100644 arch/x86/kernel/kexec-uki.c > create mode 100644 include/linux/parse_pefile.h > create mode 100644 lib/parse_pefile.c > ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [RFC PATCH 0/4] kdump: add generic functions to simplify crashkernel crashkernel in architecture
Hi Baoquan, Hi Dave, On Sat, 8 Jul 2023 10:15:53 +0800 Dave Young wrote: > On 06/19/23 at 01:59pm, Baoquan He wrote: > > In the current arm64, crashkernel=,high support has been finished after > > several rounds of posting and careful reviewing. The code in arm64 which > > parses crashkernel kernel parameters firstly, then reserve memory can be > > a good example for other ARCH to refer to. > > > > Whereas in x86_64, the code mixing crashkernel parameter parsing and > > memory reserving is twisted, and looks messy. Refactoring the code to > > make it more readable maintainable is necessary. > > > > Here, try to abstract the crashkernel parameter parsing code into a > > generic function parse_crashkernel_generic(), and the crashkernel memory > > reserving code into a generic function reserve_crashkernel_generic(). > > Then, in ARCH which crashkernel=,high support is needed, a simple > > arch_reserve_crashkernel() can be added to call above two generic > > functions. This can remove the duplicated implmentation code in each > > ARCH, like arm64, x86_64. > > Hi Baoquan, the parse_crashkernel_common and parse_crashkernel_generic > are confusion to me. Thanks for the effort though. I agree, having both parse_crashkernel_common and parse_crashkernel_generic is pretty confusing. > I'm not sure if it will be easy or not, but ideally I think the parse > function can be arch independent, something like a general funtion > parse_crashkernel() which can return the whole necessary infomation of > crashkenrel for arch code to use, for example return like > below pseudo stucture(just a concept, may need to think more): > > structure crashkernel_range { > size, > range, > struct list_head list; > } > > structure crashkernel{ > structure crashkernel_range *range_list; > union { > offset, > low_high > } > } > > So the arch code can just get the data of crashkernel and then check > about the details, if it does not support low and high reservation then > it can just ignore the option. > > Thoughts? Sounds reasonable. The only thing I would make different is for the parser to take the systems ram into account and simply return the size that needs to be allocated in case multiple ranges are given. I've played around a little on how this might look like (probably wasting way too much time on it) and came up with the patch below. With that patch parse_crashkernel_{common,high,low} and friends could be removed once all architectures are ported to the new parse_crashkernel function. Please note that I never tested the patch. So it probably doesn't even compile. Nevertheless I hope it helps to get an idea on what I think should work :-) Thanks Philipp diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index fb904cc57ab1..fd5d01872c53 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -66,22 +66,12 @@ phys_addr_t __ro_after_init arm64_dma_phys_limit; static void __init arch_reserve_crashkernel(void) { - unsigned long long low_size = 0; - unsigned long long crash_base, crash_size; char *cmdline = boot_command_line; - bool high = false; - int ret; if (!IS_ENABLED(CONFIG_KEXEC_CORE)) return; - ret = parse_crashkernel_generic(cmdline, &crash_size, &crash_base, - &low_size, &high); - if (ret) - return; - - reserve_crashkernel_generic(cmdline, crash_size, crash_base, - low_size, high); + reserve_crashkernel_generic(cmdline); } /* diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index b26221022283..4a78bf8ad494 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -486,28 +486,17 @@ unsigned long crash_low_size_default(void) static void __init arch_reserve_crashkernel(void) { - unsigned long long crash_base, crash_size, low_size = 0; char *cmdline = boot_command_line; - bool high = false; - int ret; if (!IS_ENABLED(CONFIG_KEXEC_CORE)) return; - ret = parse_crashkernel_generic(cmdline, &crash_size, &crash_base, - &low_size, &high); - if (ret) - return; - if (xen_pv_domain()) { pr_info("Ignoring crashkernel for a Xen PV domain\n"); return; } - reserve_crashkernel_generic(cmdline, crash_size, crash_base, - low_size, high); - - return; + reserve_crashkernel_generic(cmdline); } static struct resource standard_io_resources[] = { diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h index 1b12868cad1b..a1ebd6c47467 100644 --- a/include/linux/crash_core.h +++ b/include/linux/crash_core.h @@ -84,35 +84,25 @@ int parse_crashkernel_high(char *cmdline, unsigned long long system_ram, int parse_crashkernel_low(char *cmdline, unsi
Re: [PATCH v5 1/2] kexec: Support purgatories with .text.hot sections
Hi Ricardo, On Thu, 30 Mar 2023 11:44:47 +0200 Ricardo Ribalda wrote: > Clang16 links the purgatory text in two sections: > > [ 1] .text PROGBITS 0040 >11a1 AX 0 0 16 > [ 2] .rela.textRELA 3498 >0648 0018 I 24 1 8 > ... > [17] .text.hot.PROGBITS 3220 >020b AX 0 0 1 > [18] .rela.text.hot. RELA 4428 >0078 0018 I 2417 8 > > And both of them have their range [sh_addr ... sh_addr+sh_size] on the > area pointed by `e_entry`. > > This causes that image->start is calculated twice, once for .text and > another time for .text.hot. The second calculation leaves image->start > in a random location. > > Because of this, the system crashes immediately after: > > kexec_core: Starting new kernel > > Cc: sta...@vger.kernel.org > Fixes: 930457057abe ("kernel/kexec_file.c: split up __kexec_load_puragory") > Reviewed-by: Ross Zwisler > Signed-off-by: Ricardo Ribalda > --- > kernel/kexec_file.c | 14 +- > 1 file changed, 13 insertions(+), 1 deletion(-) > > diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c > index f1a0e4e3fb5c..c7a0e51a6d87 100644 > --- a/kernel/kexec_file.c > +++ b/kernel/kexec_file.c > @@ -901,10 +901,22 @@ static int kexec_purgatory_setup_sechdrs(struct > purgatory_info *pi, > } > > offset = ALIGN(offset, align); > + > + /* > + * Check if the segment contains the entry point, if so, > + * calculate the value of image->start based on it. > + * If the compiler has produced more than one .text section > + * (Eg: .text.hot), they are generally after the main .text > + * section, and they shall not be used to calculate > + * image->start. So do not re-calculate image->start if it > + * is not set to the initial value, and warn the user so they > + * have a chance to fix their purgatory's linker script. > + */ > if (sechdrs[i].sh_flags & SHF_EXECINSTR && > pi->ehdr->e_entry >= sechdrs[i].sh_addr && > pi->ehdr->e_entry < (sechdrs[i].sh_addr > - + sechdrs[i].sh_size)) { > + + sechdrs[i].sh_size) && > + !WARN_ON(kbuf->image->start != pi->ehdr->e_entry)) { Looks good to me. I'm not sure if it is better to use WARN_ON_ONCE to avoid spamming the log when there are more than two .text sections or when you reload the kernel. But that's only a rare corner case. So no strong opinion from my side. Either way Reviewed-by: Philipp Rudo > kbuf->image->start -= sechdrs[i].sh_addr; > kbuf->image->start += kbuf->mem + offset; > } > ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v3] kexec: Support purgatories with .text.hot sections
Hi Ricardo, sorry for the late reply... On Mon, 27 Mar 2023 13:52:08 +0200 Ricardo Ribalda wrote: [...] > > I tried removing the -r from arch/x86/purgatory/Makefile and that resulted > into: > > [ 115.631578] BUG: unable to handle page fault for address: 93224d5c8e20 > [ 115.631583] #PF: supervisor write access in kernel mode > [ 115.631585] #PF: error_code(0x0002) - not-present page > [ 115.631586] PGD 10067 P4D 10067 PUD 1001ed067 PMD 132b58067 PTE 0 > [ 115.631589] Oops: 0002 [#1] PREEMPT SMP NOPTI > [ 115.631592] CPU: 0 PID: 5291 Comm: kexec-lite Tainted: G U > 5.15.103-17399-g852a928df601-dirty #19 > cd159e0d6a91f03e06035a0a8eb7fc984a8f3e82 > [ 115.631594] Hardware name: Google Crota/Crota, BIOS > Google_Crota.14505.288.0 11/08/2022 > [ 115.631595] RIP: 0010:memcpy_erms+0x6/0x10 > [ 115.631599] Code: 5d 00 eb bd eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 > c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 cc cc cc cc 66 90 48 89 f8 > 48 89 d1 a4 c3 cc cc cc cc 0f 1f 00 48 89 f8 48 83 fa 20 72 7e 40 > 38 fe > [ 115.631601] RSP: 0018:93224f65fe50 EFLAGS: 00010246 > [ 115.631602] RAX: 93224d5c8e20 RBX: ffea RCX: > 0100 > [ 115.631603] RDX: 0100 RSI: 9322407bd000 RDI: > 93224d5c8e20 > [ 115.631604] RBP: 93224f65fe88 R08: R09: > 92133cd3ef08 > [ 115.631605] R10: 9322407be000 R11: a1b4f2e0 R12: > > [ 115.631606] R13: 92133cee4c00 R14: 0100 R15: > a2b6f14f > [ 115.631607] FS: 78e8b9dbf7c0() GS:92143780() > knlGS: > [ 115.631609] CS: 0010 DS: ES: CR0: 80050033 > [ 115.631610] CR2: 93224d5c8e20 CR3: 00015be26001 CR4: > 00770ef0 > [ 115.631611] PKRU: 5554 > [ 115.631612] Call Trace: > [ 115.631614] > [ 115.631615] kexec_purgatory_get_set_symbol+0x82/0xd3 > [ 115.631619] __se_sys_kexec_file_load+0x523/0x644 > [ 115.631621] do_syscall_64+0x58/0xa5 > [ 115.631623] entry_SYSCALL_64_after_hwframe+0x61/0xcb Yeah, simply dropping -r doesn't work. You at least need to add -fPIE to the CFLAGS. But probably you need more. When you go down this route you really need to pay attention to some nasty details... > And I did not continue in that direction. That's totally fine. Thanks Philipp > I also tried finding a flag for llvm that would avoid splitting .text, > but was not lucky either. > > I will look into making a linker script for x86, we could combine it > with something like: > > if (sechdrs[i].sh_flags & SHF_EXECINSTR && > pi->ehdr->e_entry >= sechdrs[i].sh_addr && > pi->ehdr->e_entry < (sechdrs[i].sh_addr > -+ sechdrs[i].sh_size) && > - kbuf->image->start == pi->ehdr->e_entry) { > - kbuf->image->start -= sechdrs[i].sh_addr; > - kbuf->image->start += kbuf->mem + offset; > ++ sechdrs[i].sh_size)) { > + if (!WARN_ON(kbuf->image->start != > pi->ehdr->e_entry)) { > + kbuf->image->start -= sechdrs[i].sh_addr; > + kbuf->image->start += kbuf->mem + offset; > + } > } > > So developers have some hints of what to look at. > > Thanks! > > > > > > Thanks > > Philipp > > > > > kbuf->image->start -= sechdrs[i].sh_addr; > > > kbuf->image->start += kbuf->mem + offset; > > > } > > > > > > --- > > > base-commit: 17214b70a159c6547df9ae204a6275d983146f6b > > > change-id: 20230321-kexec_clang16-4510c23d129c > > > > > > Best regards, > > > > ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v3] kexec: Support purgatories with .text.hot sections
Hi Ricardo, On Wed, 22 Mar 2023 20:09:21 +0100 Ricardo Ribalda wrote: > Clang16 links the purgatory text in two sections: > > [ 1] .text PROGBITS 0040 >11a1 AX 0 0 16 > [ 2] .rela.textRELA 3498 >0648 0018 I 24 1 8 > ... > [17] .text.hot.PROGBITS 3220 >020b AX 0 0 1 > [18] .rela.text.hot. RELA 4428 >0078 0018 I 2417 8 > > And both of them have their range [sh_addr ... sh_addr+sh_size] on the > area pointed by `e_entry`. > > This causes that image->start is calculated twice, once for .text and > another time for .text.hot. The second calculation leaves image->start > in a random location. > > Because of this, the system crashes inmediatly after: > > kexec_core: Starting new kernel Great analysis! > Signed-off-by: Ricardo Ribalda > --- > kexec: Fix kexec_file_load for llvm16 > > When upreving llvm I realised that kexec stopped working on my test > platform. This patch fixes it. > > To: Eric Biederman > Cc: Baoquan He > Cc: Philipp Rudo > Cc: kexec@lists.infradead.org > Cc: linux-ker...@vger.kernel.org > --- > Changes in v3: > - Fix initial value. Thanks Ross! > - Link to v2: > https://lore.kernel.org/r/20230321-kexec_clang16-v2-0-d10e5d517...@chromium.org > > Changes in v2: > - Fix if condition. Thanks Steven!. > - Update Philipp email. Thanks Baoquan. > - Link to v1: > https://lore.kernel.org/r/20230321-kexec_clang16-v1-0-a768fc2c7...@chromium.org > --- > kernel/kexec_file.c | 13 - > 1 file changed, 12 insertions(+), 1 deletion(-) > > diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c > index f1a0e4e3fb5c..25a37d8f113a 100644 > --- a/kernel/kexec_file.c > +++ b/kernel/kexec_file.c > @@ -901,10 +901,21 @@ static int kexec_purgatory_setup_sechdrs(struct > purgatory_info *pi, > } > > offset = ALIGN(offset, align); > + > + /* > + * Check if the segment contains the entry point, if so, > + * calculate the value of image->start based on it. > + * If the compiler has produced more than one .text sections > + * (Eg: .text.hot), they are generally after the main .text > + * section, and they shall not be used to calculate > + * image->start. So do not re-calculate image->start if it > + * is not set to the initial value. > + */ > if (sechdrs[i].sh_flags & SHF_EXECINSTR && > pi->ehdr->e_entry >= sechdrs[i].sh_addr && > pi->ehdr->e_entry < (sechdrs[i].sh_addr > - + sechdrs[i].sh_size)) { > + + sechdrs[i].sh_size) && > + kbuf->image->start == pi->ehdr->e_entry) { I'm not entirely sure if this is the solution to go with. As you state in the comment above this solution assumes that the .text section comes before any other .text.* section. But this assumption isn't much stronger than the assumption that there is only a single .text section, which is used nowadays. The best solution I can come up with right now is to introduce a linker script for the purgatory that simply merges the .text sections into one. Similar to what I did for s390 in arch/s390/purgatory/purgatory.lds.S (although for a different reason). But that would require every architecture to get one. An alternative would be to find a way to get rid of the -r option on the LD_FLAGS, which IIRC is the reason why both section overlap in the first place. Thanks Philipp > kbuf->image->start -= sechdrs[i].sh_addr; > kbuf->image->start += kbuf->mem + offset; > } > > --- > base-commit: 17214b70a159c6547df9ae204a6275d983146f6b > change-id: 20230321-kexec_clang16-4510c23d129c > > Best regards, ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 1/2] kexec: Default __NR_kexec_file_load is set to undefined
Hi Baoquan, On Tue, 28 Feb 2023 21:32:26 +0800 Baoquan He wrote: > Hi Philipp, > > On 02/27/23 at 04:19pm, Philipp Rudo wrote: > .. > > > diff --git a/kexec/kexec-syscall.h b/kexec/kexec-syscall.h > > > index be6ccd5..ea77936 100644 > > > --- a/kexec/kexec-syscall.h > > > +++ b/kexec/kexec-syscall.h > > > @@ -59,9 +59,7 @@ > > > #endif > > > #endif /*ifndef __NR_kexec_load*/ > > > > > > -#ifdef __arm__ > > > #undef __NR_kexec_file_load > > > -#endif > > > > > > #ifndef __NR_kexec_file_load > > > > I don't think this will work as intended. > > > > On the top of the file sys/syscall.h gets included. In there > > architectures that support kexec_file_load define __NR_kexec_file_load. > > This also means that if an architecture doesn't support kexec_file_load > > __NR_kexec_file_load shouldn't be defined in the first place. Thus I > > suggest that you find out why sys/syscall.h defines > > __NR_kexec_file_load for LoongArch even when the system call is not > > supported and fix it there. > > Checking whether LoongArch has defined __NR_kexec_file_load sounds a > good suggestion. Wondering why we still need add __NR_kexec_file_load > definition in kexec-tools. E.g below s390 kexec_file support you added. > sometime won't be found or __NR_kexec_file_load could be > not defined yet? > > commit d4a948c268272cf37c71be820fb02bf40e56292b > Author: Philipp Rudo > Date: Wed May 16 14:27:18 2018 +0200 > > kexec/s390: Add support for kexec_file_load > To be honest I don't remember why I have added it back then. Most likely I simply copied what others have done before. The benefit in having the extra definition I see is that you don't need to rebuild glibc when you implement the syscall and don't use the generic unistd.h. But that is something only few people should ever encounter. Thanks Philipp ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 1/2] kexec: Default __NR_kexec_file_load is set to undefined
Hi Youling, On Tue, 28 Feb 2023 10:02:16 +0800 Youling Tang wrote: > Hi, Philipp > > On 02/27/2023 11:19 PM, Philipp Rudo wrote: > > Hi Youling, > > > > On Fri, 24 Feb 2023 17:51:07 +0800 > > Youling Tang wrote: > > > >> The initial reason is that after the merger of 29fe5067ed07 > >> ("kexec: make -a the default"), kexec cannot be used on LoongArch, > >> MIPS .etc architectures. We need to add "-c" for normal use. The > >> current kexec_file_load system call is not implemented in > >> architectures such as LoongArch, so it needs to pass kexec_load. > >> So we need to set __NR_kexec_file_load to undefined in unsupported > >> architectures. This will return EFALLBACK via > >> is_kexec_file_load_implemented, > >> and then via kexec_load. > >> > >> Signed-off-by: Youling Tang > >> --- > >> kexec/kexec-syscall.h | 2 -- > >> 1 file changed, 2 deletions(-) > >> > >> diff --git a/kexec/kexec-syscall.h b/kexec/kexec-syscall.h > >> index be6ccd5..ea77936 100644 > >> --- a/kexec/kexec-syscall.h > >> +++ b/kexec/kexec-syscall.h > >> @@ -59,9 +59,7 @@ > >> #endif > >> #endif /*ifndef __NR_kexec_load*/ > >> > >> -#ifdef __arm__ > >> #undef __NR_kexec_file_load > >> -#endif > >> > >> #ifndef __NR_kexec_file_load > > > > I don't think this will work as intended. > > Works fine in LoongArch after applying this patch. I believe you. The problem is that it changes the behavior on other architectures. > > > > On the top of the file sys/syscall.h gets included. In there > > architectures that support kexec_file_load define __NR_kexec_file_load. > > This also means that if an architecture doesn't support kexec_file_load > > __NR_kexec_file_load shouldn't be defined in the first place. Thus I > > suggest that you find out why sys/syscall.h defines > > __NR_kexec_file_load for LoongArch even when the system call is not > > supported and fix it there. > Yes, in the kernel, LoongArch uses the generic syscall number (in > include/uapi/asm-generic/unistd.h), so __NR_kexec_file_load is > defined as 294. Ok, I see. I have expected that it is wrapped in some ifdef CONFIG_KEXEC_FILE. But apparently that is not the case. Thanks for the clarification. > I think the simpler way is to modify it in kexec-tools as follows, > #ifdef __loongarch__ > #undef __NR_kexec_file_load > #endif Yeah, looks like this is the best solution. Only thing you could do is to merge it with the arm special handling. I.e. so it reads #if defined(__arm__) || defined(__longarch__) #undef __NR_kexec_file_load #endif But that is more a matter of taste. So you and Simon need to agree what you like more. Thanks Philipp ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 1/2] kexec: Default __NR_kexec_file_load is set to undefined
Hi Youling, On Fri, 24 Feb 2023 17:51:07 +0800 Youling Tang wrote: > The initial reason is that after the merger of 29fe5067ed07 > ("kexec: make -a the default"), kexec cannot be used on LoongArch, > MIPS .etc architectures. We need to add "-c" for normal use. The > current kexec_file_load system call is not implemented in > architectures such as LoongArch, so it needs to pass kexec_load. > So we need to set __NR_kexec_file_load to undefined in unsupported > architectures. This will return EFALLBACK via is_kexec_file_load_implemented, > and then via kexec_load. > > Signed-off-by: Youling Tang > --- > kexec/kexec-syscall.h | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/kexec/kexec-syscall.h b/kexec/kexec-syscall.h > index be6ccd5..ea77936 100644 > --- a/kexec/kexec-syscall.h > +++ b/kexec/kexec-syscall.h > @@ -59,9 +59,7 @@ > #endif > #endif /*ifndef __NR_kexec_load*/ > > -#ifdef __arm__ > #undef __NR_kexec_file_load > -#endif > > #ifndef __NR_kexec_file_load I don't think this will work as intended. On the top of the file sys/syscall.h gets included. In there architectures that support kexec_file_load define __NR_kexec_file_load. This also means that if an architecture doesn't support kexec_file_load __NR_kexec_file_load shouldn't be defined in the first place. Thus I suggest that you find out why sys/syscall.h defines __NR_kexec_file_load for LoongArch even when the system call is not supported and fix it there. Thanks Philipp ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: Unified Kernel Image Support
Hi, On Mon, 6 Feb 2023 17:19:33 +0800 Baoquan He wrote: > On 02/03/23 at 10:46pm, Yishen Miao wrote: > > Hello all, > > > > I am experimenting kexec on my box. It uses systemd-boot as the bootloader > > and boots from a unified kernel image (objcopy'ed cmdline, kernel, > > initrdramfs, and microcode updates). As of kexec-tools 2.0.25 and systemd > > 252.5, when I rum systemctl kexec, it returns the following: > > > > > > # sudo systemctl kexec > > Running /usr/bin/kexec --load "/efi/EFI/Linux/ArchLinux-linux.efi" --append > > "root=UUID=----"(null) > > Cannot determine the file type of /efi/EFI/Linux/ArchLinux-linux.efi > > > > It seems that systemctl successfully identified the UKI from systemd-boot, > > however, kexec could not recognize it. > > > > Are there any plans to add UKI support to kexec? Your response is greatly > > appreciated! > > My colleageus mentioned UKI recently. We have plan to support it, while > haven't started to work on that. I've looked into UKI recently. In order to provide some base support one should only need to teach kexec_file_load the new file format [1]. However that still leaves two fundamental issues which limit the usefulness of that support. 1) The way I understand it the initrd contained in the UKI is only a stub that is supposed to read further "modules" from disk which together form the initrd needed for the given hardware/system configuration. The problem is that the kexec_file_load syscall only accepts one fd for the kernel and one fd for the initrd. So to support multiple modules we would either need to introduce a new syscall or define a uABI that allows to pass multiple initrds via this one fd. Either way it's a new user interface and should be designed with care. 2) As the kernel command line is part of the UKI and is protected by the signature it cannot be changed by users. So to support kdump with a UKI a distro would need to find one crashkernel= parameter that works for all users which is impossible. Thus before kdump can be supported with UKI there needs to be a mechanism in place that allows users to edit the command line. Others have the same problem. There is an open issue on github [2] to add this support. So all in all I believe there will be kexec support for UKI but I don't see it to come anytime soon. Thanks Philipp [1] ...and kexec-tools if you like to support kexec_load. But as the main use case for UKI is together with Secure Boot I don't think that's necessary. [2] https://github.com/systemd/systemd/issues/24539 > I have a testing machine at hand right now, just finished teseting > upstream patches. If you have the detailed steps about how to make UKI, > privately or publicly, I can try it now, and see what we can do. ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] kexec: Enable runtime allocation of crash_image
Ricardo, On Mon, 28 Nov 2022 18:07:06 +0100 Ricardo Ribalda wrote: > Hi Philipp > > > Thanks for your review. > > > On Mon, 28 Nov 2022 at 18:00, Philipp Rudo wrote: > > > > Hi Ricardo, > > > > On Thu, 24 Nov 2022 23:23:36 +0100 > > Ricardo Ribalda wrote: > > > > > Usually crash_image is defined statically via the crashkernel parameter > > > or DT. > > > > > > But if the crash kernel is not used, or is smaller than then > > > area pre-allocated that memory is wasted. > > > > > > Also, if the crash kernel was not defined at bootime, there is no way to > > > use the crash kernel. > > > > > > Enable runtime allocation of the crash_image if the crash_image is not > > > defined statically. Following the same memory allocation/validation path > > > that for the reboot kexec kernel. > > > > > > Signed-off-by: Ricardo Ribalda > > > > I don't think this patch will work as intended. For one you omit > > setting the image->type to KEXEC_TYPE_CRASH. But when you grep for that > > type you will find that there is a lot of special handling done for it. > > I don't believe that this can simply be skipped without causing > > problems. > > > > Furthermore I think you have missed one important detail. The memory > > reserved for the crash kernel is not just a buffer for the image but > > the memory it runs in! For that it has to be a continuous piece of > > physical memory with usually some additional arch specific limitations. > > When allocated dynamically all those limitations need to be considered. > > But a standard kexec doesn't care about those limitations as it doesn't > > care about the os running before itself. It can simply overwrite the > > memory when booting. But if the crash kernel does the same it will > > corrupt the dump it is supposed to generate. > > Right now, I do not intend to use it to fetch a kdump, I am using it > as the image that will run when the system crashes. the crash_image is currently all about creating a dump. If you want to change that you need to discuss the new behavior in the commit message! Please update the commit message. Thanks Philipp > > It seems to work fine on the two devices that I am using for tests. > > > > > Thanks > > Philipp > > > > > --- > > > kexec: Enable runtime allocation of crash_image > > > > > > To: Eric Biederman > > > Cc: kexec@lists.infradead.org > > > Cc: linux-ker...@vger.kernel.org > > > Cc: Sergey Senozhatsky > > > Cc: linux-ker...@vger.kernel.org > > > Cc: Ross Zwisler > > > Cc: Philipp Rudo > > > Cc: Baoquan He > > > --- > > > include/linux/kexec.h | 1 + > > > kernel/kexec.c| 9 + > > > kernel/kexec_core.c | 5 + > > > kernel/kexec_file.c | 7 --- > > > 4 files changed, 15 insertions(+), 7 deletions(-) > > > > > > diff --git a/include/linux/kexec.h b/include/linux/kexec.h > > > index 41a686996aaa..98ca9a32bc8e 100644 > > > --- a/include/linux/kexec.h > > > +++ b/include/linux/kexec.h > > > @@ -427,6 +427,7 @@ extern int kexec_load_disabled; > > > extern bool kexec_in_progress; > > > > > > int crash_shrink_memory(unsigned long new_size); > > > +bool __crash_memory_valid(void); > > > ssize_t crash_get_memory_size(void); > > > > > > #ifndef arch_kexec_protect_crashkres > > > diff --git a/kernel/kexec.c b/kernel/kexec.c > > > index cb8e6e6f983c..b5c17db25e88 100644 > > > --- a/kernel/kexec.c > > > +++ b/kernel/kexec.c > > > @@ -28,7 +28,7 @@ static int kimage_alloc_init(struct kimage **rimage, > > > unsigned long entry, > > > struct kimage *image; > > > bool kexec_on_panic = flags & KEXEC_ON_CRASH; > > > > > > - if (kexec_on_panic) { > > > + if (kexec_on_panic && __crash_memory_valid()) { > > > /* Verify we have a valid entry point */ > > > if ((entry < phys_to_boot_phys(crashk_res.start)) || > > > (entry > phys_to_boot_phys(crashk_res.end))) > > > @@ -44,7 +44,7 @@ static int kimage_alloc_init(struct kimage **rimage, > > > unsigned long entry, > > > image->nr_segments = nr_segments; > > > memcpy(image->segment, segments, nr_segments * sizeof(*segments)); > > > > > > - if (kexec_on_panic) { > > >
Re: [PATCH v1 2/2] kexec: Introduce kexec_reboot_disabled
Hi Steven, On Mon, 28 Nov 2022 11:42:00 -0500 Steven Rostedt wrote: > On Thu, 24 Nov 2022 16:01:15 +0100 > Philipp Rudo wrote: > > > No, I think the implementation is fine. I'm currently only struggling > > to understand what problem kexec_reboot_disabled solves that cannot be > > solved by kexec_load_disabled. > > Hi Philipp, > > Thanks for working with us on this. > > Let me try to explain our use case. We want kexec/kdump enabled, but we > really do not want kexec used for any other purpose. We must have the kexec > kernel loaded at boot up and not afterward. > > Your recommendation of: > > kexec -p dump_kernel > echo 1 > /proc/sys/kernel/kexec_load_disabled > > can work, and we will probably add it. But we are taking the paranoid > approach, and what I learned in security 101 ;-) and that is, only open up > the minimal attack surface as possible. Well that's sort of my problem. When you go all in on paranoia I would expect that you also restrict the crash kernel. Otherwise you keep most of the attack surface. But disabling 'reboot' of the crash kernel is quite intrusive and probably not what you want. That's why I think it is better do the restriction on the 'load' rather than the 'reboot' path. One solution for that is the script above. But that pushes all the responsibilities concerning syncing and error handling to the sysadmin. Depending on your level of paranoia that might be too risky. Personally I think it's fine as the kernel alone cannot fix all potential security problems. In my opinion this has to be done in the layer that is responsible for the task done. So when a security problem arises due to ill syncing when starting different services it's the job of the init system to fix the issue. An alternative approach and sort of compromise I see is to convert kexec_load_disabled from a simple on/off switch to a counter on how often a kexec load can be made (in practice a tristate on/off/one-shot should be sufficient). Ideally the reboot and panic path will have separate counters. With that you could for example use kexec_load_limit.reboot=0 and kexec_load_limit.panic=1 to disable the load of images for reboot while still allow to load a crash kernel once. With this you have the flexibility you need while also preventing a race where an attacker overwrites your crash kernel before you can toggle the switch. What do you think? > Yes, it's highly unlikely that the above would crash. But as with most > security vulnerabilities, it's not going to be an attacker that creates a > new gadget here, but probably another script in the future that causes this > to be delayed or something, and a new window of opportunity will arise for > an attacker. Maybe, that new window only works for non panic kernels. Yes, > this is a contrived scenario, but the work vs risk is very low in adding > this feature. True, but that problem is not limited to userspace. For example see Ricardos other patch [1] where he treats the load of a crash kernel just like a standard load. In my opinion he creates such a gadget in that patch. [1] https://lore.kernel.org/all/20221124-kexec-noalloc-v1-0-d78361e99...@chromium.org/ Thanks Philipp > Perhaps the attack surface that a reboot kexec could be, is that the > attacker gets the ability at boot up to load the kexec for reboot and not > panic. > Then the attack must wait for the victim to reboot their machine before > they have access to the new kernel. Again, I admit this is contrived, but > just because I can't think of a real situation that this could be a problem > doesn't mean that one doesn't exist. > > In other words, if we never want to allow a kexec reboot, why allow it at > all from the beginning? The above allows it, until we don't. That alone > makes us nervous. Whereas this patch is rather trivial and doesn't add > complexity. > > Thanks for your time, we appreciate it. > > -- Steve > > ___ > kexec mailing list > kexec@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/kexec > ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] kexec: Enable runtime allocation of crash_image
Hi Ricardo, On Thu, 24 Nov 2022 23:23:36 +0100 Ricardo Ribalda wrote: > Usually crash_image is defined statically via the crashkernel parameter > or DT. > > But if the crash kernel is not used, or is smaller than then > area pre-allocated that memory is wasted. > > Also, if the crash kernel was not defined at bootime, there is no way to > use the crash kernel. > > Enable runtime allocation of the crash_image if the crash_image is not > defined statically. Following the same memory allocation/validation path > that for the reboot kexec kernel. > > Signed-off-by: Ricardo Ribalda I don't think this patch will work as intended. For one you omit setting the image->type to KEXEC_TYPE_CRASH. But when you grep for that type you will find that there is a lot of special handling done for it. I don't believe that this can simply be skipped without causing problems. Furthermore I think you have missed one important detail. The memory reserved for the crash kernel is not just a buffer for the image but the memory it runs in! For that it has to be a continuous piece of physical memory with usually some additional arch specific limitations. When allocated dynamically all those limitations need to be considered. But a standard kexec doesn't care about those limitations as it doesn't care about the os running before itself. It can simply overwrite the memory when booting. But if the crash kernel does the same it will corrupt the dump it is supposed to generate. Thanks Philipp > --- > kexec: Enable runtime allocation of crash_image > > To: Eric Biederman > Cc: kexec@lists.infradead.org > Cc: linux-ker...@vger.kernel.org > Cc: Sergey Senozhatsky > Cc: linux-ker...@vger.kernel.org > Cc: Ross Zwisler > Cc: Philipp Rudo > Cc: Baoquan He > --- > include/linux/kexec.h | 1 + > kernel/kexec.c| 9 + > kernel/kexec_core.c | 5 + > kernel/kexec_file.c | 7 --- > 4 files changed, 15 insertions(+), 7 deletions(-) > > diff --git a/include/linux/kexec.h b/include/linux/kexec.h > index 41a686996aaa..98ca9a32bc8e 100644 > --- a/include/linux/kexec.h > +++ b/include/linux/kexec.h > @@ -427,6 +427,7 @@ extern int kexec_load_disabled; > extern bool kexec_in_progress; > > int crash_shrink_memory(unsigned long new_size); > +bool __crash_memory_valid(void); > ssize_t crash_get_memory_size(void); > > #ifndef arch_kexec_protect_crashkres > diff --git a/kernel/kexec.c b/kernel/kexec.c > index cb8e6e6f983c..b5c17db25e88 100644 > --- a/kernel/kexec.c > +++ b/kernel/kexec.c > @@ -28,7 +28,7 @@ static int kimage_alloc_init(struct kimage **rimage, > unsigned long entry, > struct kimage *image; > bool kexec_on_panic = flags & KEXEC_ON_CRASH; > > - if (kexec_on_panic) { > + if (kexec_on_panic && __crash_memory_valid()) { > /* Verify we have a valid entry point */ > if ((entry < phys_to_boot_phys(crashk_res.start)) || > (entry > phys_to_boot_phys(crashk_res.end))) > @@ -44,7 +44,7 @@ static int kimage_alloc_init(struct kimage **rimage, > unsigned long entry, > image->nr_segments = nr_segments; > memcpy(image->segment, segments, nr_segments * sizeof(*segments)); > > - if (kexec_on_panic) { > + if (kexec_on_panic && __crash_memory_valid()) { > /* Enable special crash kernel control page alloc policy. */ > image->control_page = crashk_res.start; > image->type = KEXEC_TYPE_CRASH; > @@ -101,7 +101,7 @@ static int do_kexec_load(unsigned long entry, unsigned > long nr_segments, > > if (flags & KEXEC_ON_CRASH) { > dest_image = &kexec_crash_image; > - if (kexec_crash_image) > + if (kexec_crash_image && __crash_memory_valid()) > arch_kexec_unprotect_crashkres(); > } else { > dest_image = &kexec_image; > @@ -157,7 +157,8 @@ static int do_kexec_load(unsigned long entry, unsigned > long nr_segments, > image = xchg(dest_image, image); > > out: > - if ((flags & KEXEC_ON_CRASH) && kexec_crash_image) > + if ((flags & KEXEC_ON_CRASH) && kexec_crash_image && > + __crash_memory_valid()) > arch_kexec_protect_crashkres(); > > kimage_free(image); > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c > index ca2743f9c634..77083c9760fb 100644 > --- a/kernel/kexec_core.c > +++ b/kernel/kexec_core.c > @@ -1004,6 +1004,11 @@ void crash_kexec(struct pt_regs *regs) > } > } > > +bool __crash_memory_valid(void) > +{ > + return crashk_r
Re: [PATCH v1 2/2] kexec: Introduce kexec_reboot_disabled
Hi Ricardo, On Thu, 24 Nov 2022 23:32:34 +0100 Ricardo Ribalda wrote: > Hi Philipp > > > On Thu, 24 Nov 2022 at 16:01, Philipp Rudo wrote: > > > > On Thu, 24 Nov 2022 13:52:58 +0100 > > Ricardo Ribalda wrote: > > > > > On Thu, 24 Nov 2022 at 12:40, Philipp Rudo wrote: > > > > > > > > Hi Ricardo, > > > > > > > > On Wed, 23 Nov 2022 09:58:08 +0100 > > > > Ricardo Ribalda wrote: > > > > > > > > > Hi Philipp > > > > > > > > > > Thanks for your review. > > > > > > > > > > My scenario is a trusted system, where even if you are root, your > > > > > access to the system is very limited. > > > > > > > > > > Let's assume LOADPIN and verity are enabled. > > > > > > > > My point is that on such systems I expect that a sysadmin also wants to > > > > control the crash kernel including its initramfs (which also has to be > > > > part > > > > of the signed kernel?). But if that's the case a sysadmin can simply arm > > > > kdump early during boot and then toggle kexec_load_disabled. With that > > > > LINUX_REBOOT_CMD_KEXEC also gets disabled as no kexec kernel can be > > > > loaded > > > > while kdump works. Thus there is no need to add the new interface. Or am > > > > I missing anything? > > > > > > Let's say that you have a script that does something like this > > > > > > > > > kexec -p dump_kernel > > > echo 1 > /proc/sys/kernel/kexec_load_disabled > > > > > > If an attacker can DDos the system and make that script crash... then > > > kexec is still accessible > > > > > > On the other hand, if you load the kernel with the commandline > > > > > > sysctl.kernel.kexec_load_disabled=1 > > > > reboot? > > yes :) thanks! > > > Otherwise you shouldn't be able to load the crash kernel at all. > > > > > Then even if the script crashes, the only way to abuse kexec is by > > > panicing the running kernel > > > > True. But when an attacker can DDos the system the final workload is > > already running. So wouldn't it be enough to make sure that the script > > above has finished before starting you workload. E.g. by setting an > > appropriate Before=/After= in the systemd.unit? > > What if the kexec binary crashes and the unit will never succeed? Then there are options like OnFailure= or FailureAction= the sysadmin can use do what ever he seems appropriate. > Or worse, your distro does not use systemd !!! In that case there are other ways to achieve the same. The two options are only examples. Why can't this be used as a replacement? > > Furthermore, I don't think that restricting kexec reboot alone is > > sufficient when the attacker can still control the crash kernel. At > > least my assumption is that triggering a panic instead of just > > rebooting is just a mild inconvenience for somebody who is able to pull > > off an attack like that. > > The attacker does not control the crash kernel completely. loadpin is > still in place. > Yes, they can downgrade the whole system to a vulnerable kernel image. > But the choices are limited :) > > With physical access to the device panicing a kernel is easily doable > (but not trivial). But remotely, it is more challenging. Well the same holds for kexec. So the only difference is triggering the panic where I'm still not convinced it's a huge obstacle for someone who is able to pull off all the steps before for such an attack. To be honest I don't think we make a progress here at the moment. I would like to hear from others what they think about this. Thanks Philipp > > > > > > Would it make you more comfortable if I model this as a kernel config > > > instead of a runtime option? > > > > No, I think the implementation is fine. I'm currently only struggling > > to understand what problem kexec_reboot_disabled solves that cannot be > > solved by kexec_load_disabled. > > > > > Thanks! > > > > Happy to help. > > > > Thanks > > Philipp > > > > > > > > > > > > > > > > Thanks > > > > Philipp > > > > > > > > > > > > > > On Mon, 21 Nov 2022 at 15:10, Philipp Rudo wrote: > > > > > > > > > > &
Re: [PATCH v1 2/2] kexec: Introduce kexec_reboot_disabled
On Thu, 24 Nov 2022 13:52:58 +0100 Ricardo Ribalda wrote: > On Thu, 24 Nov 2022 at 12:40, Philipp Rudo wrote: > > > > Hi Ricardo, > > > > On Wed, 23 Nov 2022 09:58:08 +0100 > > Ricardo Ribalda wrote: > > > > > Hi Philipp > > > > > > Thanks for your review. > > > > > > My scenario is a trusted system, where even if you are root, your > > > access to the system is very limited. > > > > > > Let's assume LOADPIN and verity are enabled. > > > > My point is that on such systems I expect that a sysadmin also wants to > > control the crash kernel including its initramfs (which also has to be part > > of the signed kernel?). But if that's the case a sysadmin can simply arm > > kdump early during boot and then toggle kexec_load_disabled. With that > > LINUX_REBOOT_CMD_KEXEC also gets disabled as no kexec kernel can be loaded > > while kdump works. Thus there is no need to add the new interface. Or am > > I missing anything? > > Let's say that you have a script that does something like this > > > kexec -p dump_kernel > echo 1 > /proc/sys/kernel/kexec_load_disabled > > If an attacker can DDos the system and make that script crash... then > kexec is still accessible > > On the other hand, if you load the kernel with the commandline > > sysctl.kernel.kexec_load_disabled=1 reboot? Otherwise you shouldn't be able to load the crash kernel at all. > Then even if the script crashes, the only way to abuse kexec is by > panicing the running kernel True. But when an attacker can DDos the system the final workload is already running. So wouldn't it be enough to make sure that the script above has finished before starting you workload. E.g. by setting an appropriate Before=/After= in the systemd.unit? Furthermore, I don't think that restricting kexec reboot alone is sufficient when the attacker can still control the crash kernel. At least my assumption is that triggering a panic instead of just rebooting is just a mild inconvenience for somebody who is able to pull off an attack like that. > Would it make you more comfortable if I model this as a kernel config > instead of a runtime option? No, I think the implementation is fine. I'm currently only struggling to understand what problem kexec_reboot_disabled solves that cannot be solved by kexec_load_disabled. > Thanks! Happy to help. Thanks Philipp > > > > > > Thanks > > Philipp > > > > > > > > On Mon, 21 Nov 2022 at 15:10, Philipp Rudo wrote: > > > > > > > > Hi Ricardo, > > > > > > > > On Thu, 17 Nov 2022 16:15:07 +0100 > > > > Ricardo Ribalda wrote: > > > > > > > > > Hi Philipp > > > > > > > > > > Thanks for your review! > > > > > > > > happy to help. > > > > > > > > > > > > > > On Thu, 17 Nov 2022 at 16:07, Philipp Rudo wrote: > > > > > > > > > > > > Hi Ricardo, > > > > > > > > > > > > all in all I think this patch makes sense. However, there is one > > > > > > point > > > > > > I don't like... > > > > > > > > > > > > On Mon, 14 Nov 2022 14:18:39 +0100 > > > > > > Ricardo Ribalda wrote: > > > > > > > > > > > > > Create a new toogle that disables LINUX_REBOOT_CMD_KEXEC, > > > > > > > reducing the > > > > > > > attack surface to a system. > > > > > > > > > > > > > > Without this toogle, an attacker can only reboot into a different > > > > > > > kernel > > > > > > > if they can create a panic(). > > > > > > > > > > > > > > Signed-off-by: Ricardo Ribalda > > > > > > > > > > > > > > diff --git a/Documentation/admin-guide/sysctl/kernel.rst > > > > > > > b/Documentation/admin-guide/sysctl/kernel.rst > > > > > > > index 97394bd9d065..25d019682d33 100644 > > > > > > > --- a/Documentation/admin-guide/sysctl/kernel.rst > > > > > > > +++ b/Documentation/admin-guide/sysctl/kernel.rst > > > > > > > @@ -462,6 +462,17 @@ altered. > > > > > > > Generally used together with the `modules_disabled`_ sysctl. > > > > > > > > > > > > > > > >
Re: [PATCH v1 2/2] kexec: Introduce kexec_reboot_disabled
Hi Ricardo, On Wed, 23 Nov 2022 09:58:08 +0100 Ricardo Ribalda wrote: > Hi Philipp > > Thanks for your review. > > My scenario is a trusted system, where even if you are root, your > access to the system is very limited. > > Let's assume LOADPIN and verity are enabled. My point is that on such systems I expect that a sysadmin also wants to control the crash kernel including its initramfs (which also has to be part of the signed kernel?). But if that's the case a sysadmin can simply arm kdump early during boot and then toggle kexec_load_disabled. With that LINUX_REBOOT_CMD_KEXEC also gets disabled as no kexec kernel can be loaded while kdump works. Thus there is no need to add the new interface. Or am I missing anything? Thanks Philipp > > On Mon, 21 Nov 2022 at 15:10, Philipp Rudo wrote: > > > > Hi Ricardo, > > > > On Thu, 17 Nov 2022 16:15:07 +0100 > > Ricardo Ribalda wrote: > > > > > Hi Philipp > > > > > > Thanks for your review! > > > > happy to help. > > > > > > > > On Thu, 17 Nov 2022 at 16:07, Philipp Rudo wrote: > > > > > > > > Hi Ricardo, > > > > > > > > all in all I think this patch makes sense. However, there is one point > > > > I don't like... > > > > > > > > On Mon, 14 Nov 2022 14:18:39 +0100 > > > > Ricardo Ribalda wrote: > > > > > > > > > Create a new toogle that disables LINUX_REBOOT_CMD_KEXEC, reducing the > > > > > attack surface to a system. > > > > > > > > > > Without this toogle, an attacker can only reboot into a different > > > > > kernel > > > > > if they can create a panic(). > > > > > > > > > > Signed-off-by: Ricardo Ribalda > > > > > > > > > > diff --git a/Documentation/admin-guide/sysctl/kernel.rst > > > > > b/Documentation/admin-guide/sysctl/kernel.rst > > > > > index 97394bd9d065..25d019682d33 100644 > > > > > --- a/Documentation/admin-guide/sysctl/kernel.rst > > > > > +++ b/Documentation/admin-guide/sysctl/kernel.rst > > > > > @@ -462,6 +462,17 @@ altered. > > > > > Generally used together with the `modules_disabled`_ sysctl. > > > > > > > > > > > > > > > +kexec_reboot_disabled > > > > > += > > > > > + > > > > > +A toggle indicating if ``LINUX_REBOOT_CMD_KEXEC`` has been disabled. > > > > > +This value defaults to 0 (false: ``LINUX_REBOOT_CMD_KEXEC`` enabled), > > > > > +but can be set to 1 (true: ``LINUX_REBOOT_CMD_KEXEC`` disabled). > > > > > +Once true, kexec can no longer be used for reboot and the toggle > > > > > +cannot be set back to false. > > > > > +This toggle does not affect the use of kexec during a crash. > > > > > + > > > > > + > > > > > kptr_restrict > > > > > = > > > > > > > > > > diff --git a/include/linux/kexec.h b/include/linux/kexec.h > > > > > index 41a686996aaa..15c3fad8918b 100644 > > > > > --- a/include/linux/kexec.h > > > > > +++ b/include/linux/kexec.h > > > > > @@ -407,6 +407,7 @@ extern int kimage_crash_copy_vmcoreinfo(struct > > > > > kimage *image); > > > > > extern struct kimage *kexec_image; > > > > > extern struct kimage *kexec_crash_image; > > > > > extern int kexec_load_disabled; > > > > > +extern int kexec_reboot_disabled; > > > > > > > > > > #ifndef kexec_flush_icache_page > > > > > #define kexec_flush_icache_page(page) > > > > > diff --git a/kernel/kexec.c b/kernel/kexec.c > > > > > index cb8e6e6f983c..43063f803d81 100644 > > > > > --- a/kernel/kexec.c > > > > > +++ b/kernel/kexec.c > > > > > @@ -196,6 +196,10 @@ static inline int kexec_load_check(unsigned long > > > > > nr_segments, > > > > > if (!capable(CAP_SYS_BOOT) || kexec_load_disabled) > > > > > return -EPERM; > > > > > > > > > > + /* Check if the system admin has disabled kexec reboot. */ > > > > > + if (!(flags & KEXEC_ON_CRASH) && kexec_reboot_disabled) > > > > > + return -EPERM; > > > > > > > > ... Allowing to load a crashkernel doesn't m
Re: [PATCH v1 2/2] kexec: Introduce kexec_reboot_disabled
Hi Ricardo, On Thu, 17 Nov 2022 16:15:07 +0100 Ricardo Ribalda wrote: > Hi Philipp > > Thanks for your review! happy to help. > > On Thu, 17 Nov 2022 at 16:07, Philipp Rudo wrote: > > > > Hi Ricardo, > > > > all in all I think this patch makes sense. However, there is one point > > I don't like... > > > > On Mon, 14 Nov 2022 14:18:39 +0100 > > Ricardo Ribalda wrote: > > > > > Create a new toogle that disables LINUX_REBOOT_CMD_KEXEC, reducing the > > > attack surface to a system. > > > > > > Without this toogle, an attacker can only reboot into a different kernel > > > if they can create a panic(). > > > > > > Signed-off-by: Ricardo Ribalda > > > > > > diff --git a/Documentation/admin-guide/sysctl/kernel.rst > > > b/Documentation/admin-guide/sysctl/kernel.rst > > > index 97394bd9d065..25d019682d33 100644 > > > --- a/Documentation/admin-guide/sysctl/kernel.rst > > > +++ b/Documentation/admin-guide/sysctl/kernel.rst > > > @@ -462,6 +462,17 @@ altered. > > > Generally used together with the `modules_disabled`_ sysctl. > > > > > > > > > +kexec_reboot_disabled > > > += > > > + > > > +A toggle indicating if ``LINUX_REBOOT_CMD_KEXEC`` has been disabled. > > > +This value defaults to 0 (false: ``LINUX_REBOOT_CMD_KEXEC`` enabled), > > > +but can be set to 1 (true: ``LINUX_REBOOT_CMD_KEXEC`` disabled). > > > +Once true, kexec can no longer be used for reboot and the toggle > > > +cannot be set back to false. > > > +This toggle does not affect the use of kexec during a crash. > > > + > > > + > > > kptr_restrict > > > = > > > > > > diff --git a/include/linux/kexec.h b/include/linux/kexec.h > > > index 41a686996aaa..15c3fad8918b 100644 > > > --- a/include/linux/kexec.h > > > +++ b/include/linux/kexec.h > > > @@ -407,6 +407,7 @@ extern int kimage_crash_copy_vmcoreinfo(struct kimage > > > *image); > > > extern struct kimage *kexec_image; > > > extern struct kimage *kexec_crash_image; > > > extern int kexec_load_disabled; > > > +extern int kexec_reboot_disabled; > > > > > > #ifndef kexec_flush_icache_page > > > #define kexec_flush_icache_page(page) > > > diff --git a/kernel/kexec.c b/kernel/kexec.c > > > index cb8e6e6f983c..43063f803d81 100644 > > > --- a/kernel/kexec.c > > > +++ b/kernel/kexec.c > > > @@ -196,6 +196,10 @@ static inline int kexec_load_check(unsigned long > > > nr_segments, > > > if (!capable(CAP_SYS_BOOT) || kexec_load_disabled) > > > return -EPERM; > > > > > > + /* Check if the system admin has disabled kexec reboot. */ > > > + if (!(flags & KEXEC_ON_CRASH) && kexec_reboot_disabled) > > > + return -EPERM; > > > > ... Allowing to load a crashkernel doesn't make sense in my opinion. If > > an attacker is capable of creating a malicious kernel, planting it on > > the victims system and then find a way to boot it via kexec this > > attacker also knows how to load the malicious kernel as crashkernel and > > trigger a panic. So you haven't really gained anything. That's why I > > would simply drop this hunk (and the corresponding one from > > kexec_file_load) and let users who worry about this use a combination of > > kexec_load_disabled and kexec_reboot_disabled. > > If for whatever reason your sysadmin configured kexec_reboot_disabed > it can be nice that when a user try to load it they get a warning. > It is easier to debug than waiting two steps later when they run kexec -e I'm having second thoughts about this patch. My main problem is that I don't see a real use case where kexec_reboot_disabled is advantageous over kexec_load_disabled. The point is that disabling LINUX_REBOOT_CMD_KEXEC is almost identical to toggling kexec_load_disabled without a loaded kernel (when you don't have a kernel loaded you cannot reboot into it). With this the main use case of kexec_reboot_disabled is already covered by kexec_load_disabled. However, there are two differences 1) with kexec_reboot_disable you can still (re-)load a crash kernel e.g. to update the initramfs after a config change. But as discussed in my first mail this comes on the cost that an attacker could still load a malicious crash kernel and then 'panic into it'. 2) kexec_load_disabled also prevents unloading of a loaded kernel. So once loaded kexec_load_disabled
Re: [PATCH RFC] kexec: Freeze processes before kexec
Hi Steve, On Wed, 16 Nov 2022 15:16:10 -0500 Steven Rostedt wrote: > On Wed, 16 Nov 2022 19:56:24 + > "Joel Fernandes (Google)" wrote: > > > --- a/kernel/kexec_core.c > > +++ b/kernel/kexec_core.c > > @@ -1175,6 +1175,12 @@ int kernel_kexec(void) > > } else > > #endif > > { > > + error = freeze_processes(); > > + if (error) { > > + error = -EBUSY; > > + goto Unlock; > > + } > > If this is the path of a kernel panic, do we really want to check the > return error of freeze_processes()? We are panicing, there's not much more > we can do. kernel_kexec isn't called during panic. We don't need to worry about it here. Having that said I think this is a problem in the device driver _not_ in kexec. In my opinion it's the job of the driver to prevent such races during shutdown. Thanks Philipp ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v1 2/2] kexec: Introduce kexec_reboot_disabled
Hi Ricardo, all in all I think this patch makes sense. However, there is one point I don't like... On Mon, 14 Nov 2022 14:18:39 +0100 Ricardo Ribalda wrote: > Create a new toogle that disables LINUX_REBOOT_CMD_KEXEC, reducing the > attack surface to a system. > > Without this toogle, an attacker can only reboot into a different kernel > if they can create a panic(). > > Signed-off-by: Ricardo Ribalda > > diff --git a/Documentation/admin-guide/sysctl/kernel.rst > b/Documentation/admin-guide/sysctl/kernel.rst > index 97394bd9d065..25d019682d33 100644 > --- a/Documentation/admin-guide/sysctl/kernel.rst > +++ b/Documentation/admin-guide/sysctl/kernel.rst > @@ -462,6 +462,17 @@ altered. > Generally used together with the `modules_disabled`_ sysctl. > > > +kexec_reboot_disabled > += > + > +A toggle indicating if ``LINUX_REBOOT_CMD_KEXEC`` has been disabled. > +This value defaults to 0 (false: ``LINUX_REBOOT_CMD_KEXEC`` enabled), > +but can be set to 1 (true: ``LINUX_REBOOT_CMD_KEXEC`` disabled). > +Once true, kexec can no longer be used for reboot and the toggle > +cannot be set back to false. > +This toggle does not affect the use of kexec during a crash. > + > + > kptr_restrict > = > > diff --git a/include/linux/kexec.h b/include/linux/kexec.h > index 41a686996aaa..15c3fad8918b 100644 > --- a/include/linux/kexec.h > +++ b/include/linux/kexec.h > @@ -407,6 +407,7 @@ extern int kimage_crash_copy_vmcoreinfo(struct kimage > *image); > extern struct kimage *kexec_image; > extern struct kimage *kexec_crash_image; > extern int kexec_load_disabled; > +extern int kexec_reboot_disabled; > > #ifndef kexec_flush_icache_page > #define kexec_flush_icache_page(page) > diff --git a/kernel/kexec.c b/kernel/kexec.c > index cb8e6e6f983c..43063f803d81 100644 > --- a/kernel/kexec.c > +++ b/kernel/kexec.c > @@ -196,6 +196,10 @@ static inline int kexec_load_check(unsigned long > nr_segments, > if (!capable(CAP_SYS_BOOT) || kexec_load_disabled) > return -EPERM; > > + /* Check if the system admin has disabled kexec reboot. */ > + if (!(flags & KEXEC_ON_CRASH) && kexec_reboot_disabled) > + return -EPERM; ... Allowing to load a crashkernel doesn't make sense in my opinion. If an attacker is capable of creating a malicious kernel, planting it on the victims system and then find a way to boot it via kexec this attacker also knows how to load the malicious kernel as crashkernel and trigger a panic. So you haven't really gained anything. That's why I would simply drop this hunk (and the corresponding one from kexec_file_load) and let users who worry about this use a combination of kexec_load_disabled and kexec_reboot_disabled. Thanks Philipp > + > /* Permit LSMs and IMA to fail the kexec */ > result = security_kernel_load_data(LOADING_KEXEC_IMAGE, false); > if (result < 0) > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c > index ca2743f9c634..fe82e2525705 100644 > --- a/kernel/kexec_core.c > +++ b/kernel/kexec_core.c > @@ -929,6 +929,7 @@ int kimage_load_segment(struct kimage *image, > struct kimage *kexec_image; > struct kimage *kexec_crash_image; > int kexec_load_disabled; > +int kexec_reboot_disabled; > #ifdef CONFIG_SYSCTL > static struct ctl_table kexec_core_sysctls[] = { > { > @@ -941,6 +942,16 @@ static struct ctl_table kexec_core_sysctls[] = { > .extra1 = SYSCTL_ONE, > .extra2 = SYSCTL_ONE, > }, > + { > + .procname = "kexec_reboot_disabled", > + .data = &kexec_reboot_disabled, > + .maxlen = sizeof(int), > + .mode = 0644, > + /* only handle a transition from default "0" to "1" */ > + .proc_handler = proc_dointvec_minmax, > + .extra1 = SYSCTL_ONE, > + .extra2 = SYSCTL_ONE, > + }, > { } > }; > > @@ -1138,7 +1149,7 @@ int kernel_kexec(void) > > if (!kexec_trylock()) > return -EBUSY; > - if (!kexec_image) { > + if (!kexec_image || kexec_reboot_disabled) { > error = -EINVAL; > goto Unlock; > } > diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c > index 45637511e0de..583fba6de5cb 100644 > --- a/kernel/kexec_file.c > +++ b/kernel/kexec_file.c > @@ -333,6 +333,11 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, > initrd_fd, > if (!capable(CAP_SYS_BOOT) || kexec_load_disabled) > return -EPERM; > > + /* Check if the system admin has disabled kexec reboot. */ > + if (!(flags & (KEXEC_FILE_ON_CRASH | KEXEC_FILE_UNLOAD)) > + && kexec_reboot_disabled) > + return -EPERM; > + > /* Make sure we have a legal set of flags */ > if (flags != (flags & KEXEC_FILE_FLAGS)) > return -EINVAL; > ___ kexec mailing l
Re: [PATCH] kernel/crash_core.c : Remove redundant checks for ck_cmdline is NULL
Hi lizhe, On Mon, 2 May 2022 18:11:20 +0800 (CST) lizhe wrote: > HI Philipp Rudo. > > > When ck_cmdline is NULL. The last three lines of this function are > equivalent to : > if ( ! NULL) > return NULL; > return NULL; > This is obviously a redundant check. > > >I will use the above description to describe the patch, the explanation looks good to me. Thanks! Philipp > > > thanks. > > lizhe > > > > > > > > > > > > > > > > > > > At 2022-04-26 16:39:52, "Philipp Rudo" wrote: > >Hi lizhe, > > > >On Mon, 25 Apr 2022 08:38:57 -0700 > >lizhe wrote: > > > >> When ck_cmdline is NULL, the only caller of get_last_crashkernel() > >> has already done non-NULL check(see __parse_crashkernel()), > >> so it doesn't make any sense to make a check here > > > >sorry, but I still don't like the description. What I don't understand > >in particular is why you are mentioning the caller (__parse_crashkernel) > >here. ck_cmdline is a local variable to get_last_crashkernel. So the > >caller cannot perform any check on the variable but only the return > >value of the function. So the patch description should describe why we > >can remove the additional return NULL without changing the behavior of > >the function. > > > >Thanks > >Philipp > > > >> Signed-off-by: lizhe > >> --- > >> kernel/crash_core.c | 3 --- > >> 1 file changed, 3 deletions(-) > >> > >> diff --git a/kernel/crash_core.c b/kernel/crash_core.c > >> index 256cf6db573c..c232f01a2c54 100644 > >> --- a/kernel/crash_core.c > >> +++ b/kernel/crash_core.c > >> @@ -222,9 +222,6 @@ static __init char *get_last_crashkernel(char *cmdline, > >>p = strstr(p+1, name); > >>} > >> > >> - if (!ck_cmdline) > >> - return NULL; > >> - > >>return ck_cmdline; > >> } > >> ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] kernel/crash_core.c : Remove redundant checks for ck_cmdline is NULL
Hi lizhe, On Mon, 25 Apr 2022 08:38:57 -0700 lizhe wrote: > When ck_cmdline is NULL, the only caller of get_last_crashkernel() > has already done non-NULL check(see __parse_crashkernel()), > so it doesn't make any sense to make a check here sorry, but I still don't like the description. What I don't understand in particular is why you are mentioning the caller (__parse_crashkernel) here. ck_cmdline is a local variable to get_last_crashkernel. So the caller cannot perform any check on the variable but only the return value of the function. So the patch description should describe why we can remove the additional return NULL without changing the behavior of the function. Thanks Philipp > Signed-off-by: lizhe > --- > kernel/crash_core.c | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/kernel/crash_core.c b/kernel/crash_core.c > index 256cf6db573c..c232f01a2c54 100644 > --- a/kernel/crash_core.c > +++ b/kernel/crash_core.c > @@ -222,9 +222,6 @@ static __init char *get_last_crashkernel(char *cmdline, > p = strstr(p+1, name); > } > > - if (!ck_cmdline) > - return NULL; > - > return ck_cmdline; > } > ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] kernel/crash_core.c: No judgment required
Hi, On Tue, 26 Apr 2022 10:17:18 +0200 Philipp Rudo wrote: > Hi lizhe, > > On Mon, 25 Apr 2022 14:22:31 +0800 (CST) > lizhe wrote: > > > HI : > > > > > > I found the problem at the first time and gave the solution, > > > > > > > > > > Pphilipp Rudo just saw the solution to the problem and gave an explanation. > > the author of this patch should only be me > > right, I only commented on the patch you sent. > > Could you please update the commit message and send a v2. should have checked the rest of my mails first... ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] kernel/crash_core.c: No judgment required
Hi lizhe, On Mon, 25 Apr 2022 14:22:31 +0800 (CST) lizhe wrote: > HI : > > > I found the problem at the first time and gave the solution, > > > > > Pphilipp Rudo just saw the solution to the problem and gave an explanation. > the author of this patch should only be me right, I only commented on the patch you sent. Could you please update the commit message and send a v2. Thanks Philipp > > > > > > > > > lizhe > > > > > > > > > At 2022-04-25 09:36:17, "Baoquan He" wrote: > >On 12/14/21 at 05:32pm, Philipp Rudo wrote: > >> Hi lizhe, > >> > >> On Thu, 9 Dec 2021 19:20:03 -0800 > >> lizhe wrote: > >> > >> > No judgment required ck_cmdline is NULL > >> > its caller has alreadly judged, see __parse_crashkernel > >> > function > >> > > >> > Signed-off-by: lizhe > >> > --- > >> > kernel/crash_core.c | 3 --- > >> > 1 file changed, 3 deletions(-) > >> > > >> > diff --git a/kernel/crash_core.c b/kernel/crash_core.c > >> > index eb53f5ec62c9..9981cf9b9fe4 100644 > >> > --- a/kernel/crash_core.c > >> > +++ b/kernel/crash_core.c > >> > @@ -221,9 +221,6 @@ static __init char *get_last_crashkernel(char > >> > *cmdline, > >> > p = strstr(p+1, name); > >> > } > >> > > >> > -if (!ck_cmdline) > >> > -return NULL; > >> > - > >> > return ck_cmdline; > >> > } > >> > > >> > >> I agree that the if-block is not needed and can be removed. However, I > >> cannot follow your reasoning in the commit message. Could you please > >> explain it in more detail. > >> > >> The reason why I think that the 'if' can be removed is that the > >> expression can only be true when ck_cmdline = NULL. But with that the > >> last three lines are equivalent to > >> > >>if (!ck_cmdline) > >>return ck_cmdline; > >> > >>return ck_cmdline; > >> > >> Which simply doesn't make any sense. > > > >Right, the judgement actually introduces redundant codes. As Zhe > >replied, maybe you can rewrite the log and repost with your > >Signed-off-by, Philipp. As for Author, you two can discuss in private > >mail. ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH makedumpfile] Avoid false-positive mem_section validation with vmlinux
Hi Kazu, On Wed, 20 Apr 2022 23:58:29 + HAGIO KAZUHITO(萩尾 一仁) wrote: > Currently get_mem_section() validates if SYMBOL(mem_section) is the address > of the mem_section array first. But there was a report that the first > validation wrongly returned TRUE with -x vmlinux and SPARSEMEM_EXTREME > (4.15+) on s390x. This leads to crash failing statup with the following > seek error: > > crash: seek error: kernel virtual address: 67fffc2800 type: "memory > section root table" > > Skip the first validation when satisfying the conditions. > > Reported-by: Dave Wysochanski > Signed-off-by: Kazuhito Hagio for me this patch looks fine and exactly addresses the problem, that mem_section has different types in the vmlinux and vmcoreinfo. So, for what I want Reviewed-and-Tested-by: Philipp Rudo > --- > makedumpfile.c | 31 +++ > 1 file changed, 31 insertions(+) > > diff --git a/makedumpfile.c b/makedumpfile.c > index a2f45c84cee3..65d1c7c2f02c 100644 > --- a/makedumpfile.c > +++ b/makedumpfile.c > @@ -3698,6 +3698,22 @@ validate_mem_section(unsigned long *mem_sec, > return ret; > } > > +/* > + * SYMBOL(mem_section) varies with the combination of memory model and > + * its source: > + * > + * SPARSEMEM > + * vmcoreinfo: address of mem_section root array > + * -x vmlinux: address of mem_section root array > + * > + * SPARSEMEM_EXTREME v1 > + * vmcoreinfo: address of mem_section root array > + * -x vmlinux: address of mem_section root array > + * > + * SPARSEMEM_EXTREME v2 (with 83e3c48729d9 and a0b1280368d1) 4.15+ > + * vmcoreinfo: address of mem_section root array > + * -x vmlinux: address of pointer to mem_section root array > + */ > static int > get_mem_section(unsigned int mem_section_size, unsigned long *mem_maps, > unsigned int num_section) > @@ -3710,12 +3726,27 @@ get_mem_section(unsigned int mem_section_size, > unsigned long *mem_maps, > strerror(errno)); > return FALSE; > } > + > + /* > + * There was a report that the first validation wrongly returned TRUE > + * with -x vmlinux and SPARSEMEM_EXTREME v2 on s390x, so skip it. > + * Howerver, leave the fallback validation as it is for the -i option. > + */ > + if (is_sparsemem_extreme() && info->name_vmlinux) { > + unsigned long flag = 0; > + if (get_symbol_type_name("mem_section", > DWARF_INFO_GET_SYMBOL_TYPE, > + NULL, &flag) > + && !(flag & TYPE_ARRAY)) > + goto skip_1st_validation; > + } > + > ret = validate_mem_section(mem_sec, SYMBOL(mem_section), > mem_section_size, mem_maps, num_section); > > if (!ret && is_sparsemem_extreme()) { > unsigned long mem_section_ptr; > > +skip_1st_validation: > if (!readmem(VADDR, SYMBOL(mem_section), &mem_section_ptr, >sizeof(mem_section_ptr))) > goto out; ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 2/2] makedumpfile: break loop after last dumpable page
Hi Kazu, On Thu, 7 Apr 2022 06:43:00 + HAGIO KAZUHITO(萩尾 一仁) wrote: > -Original Message- > > Once the last dumpable page was processed there is no need to finish the > > loop to the last page. Thus exit early to improve performance. > > > > Signed-off-by: Philipp Rudo > > --- > > makedumpfile.c | 6 ++ > > 1 file changed, 6 insertions(+) > > > > diff --git a/makedumpfile.c b/makedumpfile.c > > index 2ef3458..c944d0e 100644 > > --- a/makedumpfile.c > > +++ b/makedumpfile.c > > @@ -8884,6 +8884,12 @@ write_kdump_pages_cyclic(struct cache_data > > *cd_header, struct cache_data *cd_pag > > > > for (pfn = start_pfn; pfn < end_pfn; pfn++) { > > > > + /* > > +* All dumpable pages have been processed. No need to continue. > > +*/ > > + if (num_dumped == info->num_dumpable) > > + break; > > This patch is likely to increase the possibility of failure to capture > /proc/kcore, although this is an unofficial functionality... > > # makedumpfile -ld31 /proc/kcore kcore.snap > # crash vmlinux kcore.snap > ... > crash: page incomplete: kernel virtual address: 916fbeffed00 type: > "pglist node_id" > > In cyclic mode, makedumpfile first calculates only info->num_dumpable [1] and > frees the used bitmap, and later creates 2nd bitmap again [2] at this time. > > create_dumpfile > create_dump_bitmap > info->num_dumpable = get_num_dumpable_cyclic() <<-- [1] > writeout_dumpfile > write_kdump_pages_and_bitmap_cyclic > foreach cycle > create_2nd_bitmap <<-- [2] > write_kdump_pages_cyclic > > So with live system, num_dumped can exceed info->num_dumpable. > If it stops at info->num_dumpable, some necessary data can be missed. thanks for the explanation! I haven't considered that case and assumed info->num_dumpable is constant. > Capturing /proc/kcore is still fragile and not considered enough, but > sometimes useful... when I want to capture a snapshot of memory. > > (the bitmap is allocated as block, so probably it's working as some buffer?) > > So I will merge the 1/2 patch, but personally would not like to merge > this patch. How necessary is this? I don't think this patch is absolutely necessary. It's only a small performance improvement but shouldn't have a huge impact. If you like you can drop the patch. Thanks Philipp ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH 2/2] makedumpfile: break loop after last dumpable page
Once the last dumpable page was processed there is no need to finish the loop to the last page. Thus exit early to improve performance. Signed-off-by: Philipp Rudo --- makedumpfile.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/makedumpfile.c b/makedumpfile.c index 2ef3458..c944d0e 100644 --- a/makedumpfile.c +++ b/makedumpfile.c @@ -8884,6 +8884,12 @@ write_kdump_pages_cyclic(struct cache_data *cd_header, struct cache_data *cd_pag for (pfn = start_pfn; pfn < end_pfn; pfn++) { + /* +* All dumpable pages have been processed. No need to continue. +*/ + if (num_dumped == info->num_dumpable) + break; + /* * Check the excluded page. */ -- 2.35.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH 1/2] makedumpfile: omit unnecessary calls to print_progress
Check first if a page is dumpable before printing the process. Otherwise there is the chance that num_dumped % per == 0 at the beginning of the block of undampable pages. In that case num_dumped isn't updated before the next dumpable page and thus print_process is called for every page in that block. This is especially annoying when the block is after the last dumpable page and thus num_dumped == info->num_dumpable. In that case print_process will bypass its check to only print the process once every second and thus flood the console with unnecessary prints. This can lead to a severe decrease in performance especially when the console is in line mode. Signed-off-by: Philipp Rudo --- makedumpfile.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/makedumpfile.c b/makedumpfile.c index 14556db..2ef3458 100644 --- a/makedumpfile.c +++ b/makedumpfile.c @@ -8884,16 +8884,16 @@ write_kdump_pages_cyclic(struct cache_data *cd_header, struct cache_data *cd_pag for (pfn = start_pfn; pfn < end_pfn; pfn++) { - if ((num_dumped % per) == 0) - print_progress(PROGRESS_COPY, num_dumped, info->num_dumpable, &ts_start); - /* * Check the excluded page. */ if (!is_dumpable(info->bitmap2, pfn, cycle)) continue; + if ((num_dumped % per) == 0) + print_progress(PROGRESS_COPY, num_dumped, info->num_dumpable, &ts_start); num_dumped++; + if (!read_pfn(pfn, buf)) goto out; -- 2.35.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH 0/2] makedumpfile: improve write_kdump_pages_cyclic
Hi, I noticed that on s390 makedumpfile sometimes floods the console with Copying data : [100.0 %] - eta: 0s causing the dump process to slow down dramatically. The patches attached are two somewhat separate approaches to fix this issue. In my opinion both patches make sense to be included even when one of them would be enough to fix the problem described above. Thanks Philipp Philipp Rudo (2): makedumpfile: omit unnecessary calls to print_progress makedumpfile: break loop after last dumpable page makedumpfile.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) -- 2.35.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: running into OOM killer with kexec loading large ramdisk
Hi Tobias, On Wed, 23 Mar 2022 12:14:59 +0100 Tobias Powalowski wrote: > Hi, > again me, > I try to load a 900MB ramdisk on a ramfs rootfs with kexec and a 10MB kernel. > With 2800MB RAM assigned to qemu. > Memory free by /proc/memstat: 2.2GB > It keeps on OOM killed while executing: > kexec -l kernel --initrd=initrd.img > I can safely unpack the initrd in the ramfs without getting OOM killed. > What is kexec doing wrong here? I don't think that kexec is doing anything wrong here. The kexec_load syscall is designed in a way that user space prepares everything in a huge buffer and passes a pointer to it to the system call. The systemcall then needs to copy everything from the user buffer to a kernel buffer. So there are three copies of the initrd in memory (including the one in ramfs). Together they take up 3 * 900MB = 2.7GB. So basically all the memory of your guest. You can try using the kexec_file_load systemcall. That at least eliminates the copy in the user buffer so there are only two copies left. 400MB of free memory is still quite scarce, though. Thanks Philipp ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH] util_lib/elf_info: harden parsing of printk buffer
The old printk mechanism (> v3.5.0 and < v5.10.0) had a fixed size buffer (log_buf) that contains all messages. The location for the next message is stored in log_next_idx. In case the log_buf runs full log_next_idx wraps around and starts overwriting old messages at the beginning of the buffer. The wraparound is denoted by a message with msg->len == 0. Following the behavior described above blindly is dangerous as e.g. a memory corruption could overwrite (parts of) the log_buf. If the corruption adds a message with msg->len == 0 this leads to an endless loop when dumping the dmesg. Fix this by verifying that not wrapped around before when it encounters a message with msg->len == 0. While at it also verify that the index is within the log_buf and thus guard against corruptions with msg->len != 0. The same bug has been reported and fixed in makedumpfile [1]. [1] http://lists.infradead.org/pipermail/kexec/2022-March/024272.html Signed-off-by: Philipp Rudo --- util_lib/elf_info.c | 28 +--- 1 file changed, 25 insertions(+), 3 deletions(-) diff --git a/util_lib/elf_info.c b/util_lib/elf_info.c index d252eff..ce71c60 100644 --- a/util_lib/elf_info.c +++ b/util_lib/elf_info.c @@ -763,8 +763,9 @@ static void dump_dmesg_structured(int fd, void (*handler)(char*, unsigned int)) { #define OUT_BUF_SIZE 4096 uint64_t log_buf, log_buf_offset, ts_nsec; - uint32_t log_first_idx, log_next_idx, current_idx, len = 0, i; + uint32_t log_buf_len, log_first_idx, log_next_idx, current_idx, len = 0, i; char *buf, out_buf[OUT_BUF_SIZE]; + bool has_wrapped_around = false; ssize_t ret; char *msg; uint16_t text_len; @@ -811,6 +812,7 @@ static void dump_dmesg_structured(int fd, void (*handler)(char*, unsigned int)) } log_buf = read_file_pointer(fd, vaddr_to_offset(log_buf_vaddr)); + log_buf_len = read_file_s32(fd, vaddr_to_offset(log_buf_len_vaddr)); log_first_idx = read_file_u32(fd, vaddr_to_offset(log_first_idx_vaddr)); log_next_idx = read_file_u32(fd, vaddr_to_offset(log_next_idx_vaddr)); @@ -882,11 +884,31 @@ static void dump_dmesg_structured(int fd, void (*handler)(char*, unsigned int)) * and read the message at the start of the buffer. */ loglen = struct_val_u16(buf, log_offset_len); - if (!loglen) + if (!loglen) { + if (has_wrapped_around) { + if (len && handler) + handler(out_buf, len); + fprintf(stderr, "Cycle when parsing dmesg detected.\n"); + fprintf(stderr, "The prink log_buf is most likely corrupted.\n"); + fprintf(stderr, "log_buf = 0x%lx, idx = 0x%x\n", + log_buf, current_idx); + exit(68); + } current_idx = 0; - else + has_wrapped_around = true; + } else { /* Move to next record */ current_idx += loglen; + if(current_idx > log_buf_len - log_sz) { + if (len && handler) + handler(out_buf, len); + fprintf(stderr, "Index outside log_buf detected.\n"); + fprintf(stderr, "The prink log_buf is most likely corrupted.\n"); + fprintf(stderr, "log_buf = 0x%lx, idx = 0x%x\n", + log_buf, current_idx); + exit(69); + } + } } free(buf); if (len && handler) -- 2.35.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v2 0/4] makedumpfile: harden parsing of old prink buffer
Hi, dumping the dmesg can cause an endless loop for the old prink mechanism (> v3.5.0 and < v5.10.0) when the log_buf got corrupted. This series fixes those cases by adding a cycle detection. The cycle detection is implemented in a generic way so that it can be reused in other parts of makedumpfile. Thanks Philipp v2: * Rename 'idx' to 'ptr' * Also print the non-loop part when a cycle was detected. Such a situation can happen when log_buf wrapped around in the kernel (log_first_idx != 0) and the corruption occurred on an idx < log_first_idx. * Add patch 4 fixing a bug independent from the memory corruption but found while investigating it. Philipp Rudo (4): makedumpfile: add generic cycle detection makedumpfile: use pointer arithmetics for dump_dmesg makedumpfile: use cycle detection when parsing the prink log_buf makedumpfile: print error when reading with unsupported compression Makefile | 2 +- detect_cycle.c | 99 + detect_cycle.h | 40 +++ makedumpfile.c | 131 - 4 files changed, 247 insertions(+), 25 deletions(-) create mode 100644 detect_cycle.c create mode 100644 detect_cycle.h -- 2.35.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v2 3/4] makedumpfile: use cycle detection when parsing the prink log_buf
The old printk mechanism (> v3.5.0 and < v5.10.0) had a fixed size buffer (log_buf) that contains all messages. The location for the next message is stored in log_next_idx. In case the log_buf runs full log_next_idx wraps around and starts overwriting old messages at the beginning of the buffer. The wraparound is denoted by a message with msg->len == 0. Following the behavior described above blindly in makedumpfile is dangerous as e.g. a memory corruption could overwrite (parts of) the log_buf. If the corruption adds a message with msg->len == 0 this leads to an endless loop when dumping the dmesg with makedumpfile appending the messages up to the corruption over and over again to the output file until file system is full. Fix this by using cycle detection and aboard once one is detected. While at it also verify that the index is within the log_buf and thus guard against corruptions with msg->len != 0. Reported-by: Audra Mitchell Suggested-by: Dave Wysochanski Signed-off-by: Philipp Rudo --- makedumpfile.c | 48 +++- 1 file changed, 47 insertions(+), 1 deletion(-) diff --git a/makedumpfile.c b/makedumpfile.c index 9a05c96..b7ac999 100644 --- a/makedumpfile.c +++ b/makedumpfile.c @@ -15,6 +15,7 @@ */ #include "makedumpfile.h" #include "print_info.h" +#include "detect_cycle.h" #include "dwarf_info.h" #include "elf_info.h" #include "erase_info.h" @@ -5528,10 +5529,11 @@ dump_dmesg() unsigned long index, log_buf, log_end; unsigned int log_first_idx, log_next_idx; unsigned long long first_idx_sym; + struct detect_cycle *dc = NULL; unsigned long log_end_2_6_24; unsigned log_end_2_6_25; char *log_buffer = NULL, *log_ptr = NULL; - char *ptr; + char *ptr, *next_ptr; /* * log_end has been changed to "unsigned" since linux-2.6.25. @@ -5679,12 +5681,55 @@ dump_dmesg() goto out; } ptr = log_buffer + log_first_idx; + dc = dc_init(ptr, log_buffer, log_next); while (ptr != log_buffer + log_next_idx) { log_ptr = log_from_ptr(ptr, log_buffer); if (!dump_log_entry(log_ptr, info->fd_dumpfile, info->name_dumpfile)) goto out; ptr = log_next(ptr, log_buffer); + if (dc_next(dc, (void **) &next_ptr)) { + unsigned long len; + int in_cycle; + char *first; + + /* Clear everything we have already written... */ + ftruncate(info->fd_dumpfile, 0); + lseek(info->fd_dumpfile, 0, SEEK_SET); + + /* ...and only write up to the corruption. */ + dc_find_start(dc, (void **) &first, &len); + ptr = log_buffer + log_first_idx; + in_cycle = FALSE; + while (len) { + log_ptr = log_from_ptr(ptr, log_buffer); + if (!dump_log_entry(log_ptr, + info->fd_dumpfile, + info->name_dumpfile)) + goto out; + ptr = log_next(ptr, log_buffer); + + if (log_ptr == first) + in_cycle = TRUE; + + if (in_cycle) + len--; + } + ERRMSG("Cycle when parsing dmesg detected.\n"); + ERRMSG("The printk log_buf is most likely corrupted.\n"); + ERRMSG("log_buf = 0x%lx, idx = 0x%lx\n", log_buf, ptr - log_buffer); + close_files_for_creating_dumpfile(); + goto out; + } + if (next_ptr < log_buffer || + next_ptr > log_buffer + log_buf_len - SIZE(printk_log)) { + ERRMSG("Index outside log_buf detected.\n"); + ERRMSG("The printk log_buf is most likely corrupted.\n"); + ERRMSG("log_buf = 0x%lx, idx = 0x%lx\n", log_buf, ptr - log_buffer); + close_files_for_creating_dumpfile();
[PATCH v2 4/4] makedumpfile: print error when reading with unsupported compression
Currently makedumpfile only checks if the required compression algorithm was enabled during build when compressing a dump but not when reading from one. This can lead to situations where, one version of makedumpfile creates the dump using a compression algorithm an other version of makedumpfile doesn't support. When the second version now tries to, e.g. extract the dmesg from the dump it will fail with an error similar to # makedumpfile --dump-dmesg vmcore dmesg.txt __vtop4_x86_64: Can't get a valid pgd. readmem: Can't convert a virtual address(92e18284) to physical address. readmem: type_addr: 0, addr:92e18284, size:390 check_release: Can't get the address of system_utsname. makedumpfile Failed. That's because readpage_kdump_compressed{_parallel} does not return with an error if the page it is trying to read is compressed with an unsupported compression algorithm. Thus readmem copies random data from the (uninitialized) cachebuf to its caller and thus causing the error above. Fix this by checking if the required compression algorithm is supported in readpage_kdump_compressed{_parallel} and print a proper error message if it isn't. Reported-by: Dave Wysochanski Signed-off-by: Philipp Rudo --- makedumpfile.c | 56 ++ 1 file changed, 48 insertions(+), 8 deletions(-) diff --git a/makedumpfile.c b/makedumpfile.c index b7ac999..56f3b6c 100644 --- a/makedumpfile.c +++ b/makedumpfile.c @@ -865,9 +865,14 @@ readpage_kdump_compressed(unsigned long long paddr, void *bufptr) ERRMSG("Uncompress failed: %d\n", ret); goto out_error; } + } else if ((pd.flags & DUMP_DH_COMPRESSED_LZO)) { #ifdef USELZO - } else if (info->flag_lzo_support - && (pd.flags & DUMP_DH_COMPRESSED_LZO)) { + if (!info->flag_lzo_support) { + ERRMSG("lzo compression unsupported\n"); + out = FALSE; + goto out_error; + } + retlen = info->page_size; ret = lzo1x_decompress_safe((unsigned char *)buf, pd.size, (unsigned char *)bufptr, &retlen, @@ -876,9 +881,14 @@ readpage_kdump_compressed(unsigned long long paddr, void *bufptr) ERRMSG("Uncompress failed: %d\n", ret); goto out_error; } +#else + ERRMSG("lzo compression unsupported\n"); + ERRMSG("Try `make USELZO=on` when building.\n"); + out = FALSE; + goto out_error; #endif -#ifdef USESNAPPY } else if ((pd.flags & DUMP_DH_COMPRESSED_SNAPPY)) { +#ifdef USESNAPPY ret = snappy_uncompressed_length(buf, pd.size, (size_t *)&retlen); if (ret != SNAPPY_OK) { @@ -891,14 +901,24 @@ readpage_kdump_compressed(unsigned long long paddr, void *bufptr) ERRMSG("Uncompress failed: %d\n", ret); goto out_error; } +#else + ERRMSG("snappy compression unsupported\n"); + ERRMSG("Try `make USESNAPPY=on` when building.\n"); + out = FALSE; + goto out_error; #endif -#ifdef USEZSTD } else if ((pd.flags & DUMP_DH_COMPRESSED_ZSTD)) { +#ifdef USEZSTD ret = ZSTD_decompress(bufptr, info->page_size, buf, pd.size); if (ZSTD_isError(ret) || (ret != info->page_size)) { ERRMSG("Uncompress failed: %d\n", ret); goto out_error; } +#else + ERRMSG("zstd compression unsupported\n"); + ERRMSG("Try `make USEZSTD=on` when building.\n"); + out = FALSE; + goto out_error; #endif } @@ -964,9 +984,14 @@ readpage_kdump_compressed_parallel(int fd_memory, unsigned long long paddr, ERRMSG("Uncompress failed: %d\n", ret); goto out_error; } + } else if ((pd.flags & DUMP_DH_COMPRESSED_LZO)) { #ifdef USELZO - } else if (info->flag_lzo_support - && (pd.flags & DUMP_DH_COMPRESSED_LZO)) { + if (!info->flag_lzo_support) { + ERRMSG("lzo compression unsupported\n"); + out = FALSE; + goto out_error; + } + retlen = info->page_size; ret = lzo1x_decompress_safe((unsigned char *)buf, pd.size, (unsigned char *)bufptr, &retlen, @@ -975,9 +1000,14
[PATCH v2 1/4] makedumpfile: add generic cycle detection
In order to work makedumpfile needs to interpret data read from the dump. This can cause problems as the data from the dump cannot be trusted (otherwise the kernel wouldn't have panicked in the first place). This also means that every loop which stop condition depend on data read from the dump has a chance to loop forever. Thus add a generic cycle detection mechanism that allows to detect and handle such situations appropriately. For cycle detection use Brent's algorithm [1] as it has constant memory usage. With this it can also be used in the kdump kernel without the danger that it runs oom when iterating large data structures. Furthermore it only depends on some pointer arithmetic. Thus the performance impact (as long as no cycle was detected) should be comparatively small. [1] https://en.wikipedia.org/wiki/Cycle_detection#Brent's_algorithm Suggested-by: Dave Wysochanski Signed-off-by: Philipp Rudo --- Makefile | 2 +- detect_cycle.c | 99 ++ detect_cycle.h | 40 3 files changed, 140 insertions(+), 1 deletion(-) create mode 100644 detect_cycle.c create mode 100644 detect_cycle.h diff --git a/Makefile b/Makefile index f118b31..3441364 100644 --- a/Makefile +++ b/Makefile @@ -45,7 +45,7 @@ CFLAGS_ARCH += -m32 endif SRC_BASE = makedumpfile.c makedumpfile.h diskdump_mod.h sadump_mod.h sadump_info.h -SRC_PART = print_info.c dwarf_info.c elf_info.c erase_info.c sadump_info.c cache.c tools.c printk.c +SRC_PART = print_info.c dwarf_info.c elf_info.c erase_info.c sadump_info.c cache.c tools.c printk.c detect_cycle.c OBJ_PART=$(patsubst %.c,%.o,$(SRC_PART)) SRC_ARCH = arch/arm.c arch/arm64.c arch/x86.c arch/x86_64.c arch/ia64.c arch/ppc64.c arch/s390x.c arch/ppc.c arch/sparc64.c arch/mips64.c OBJ_ARCH=$(patsubst %.c,%.o,$(SRC_ARCH)) diff --git a/detect_cycle.c b/detect_cycle.c new file mode 100644 index 000..6b551a7 --- /dev/null +++ b/detect_cycle.c @@ -0,0 +1,99 @@ +/* + * detect_cycle.c -- Generic cycle detection using Brent's algorithm + * + * Created by: Philipp Rudo + * + * Copyright (c) 2022 Red Hat, Inc. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include + +#include "detect_cycle.h" + +struct detect_cycle { + /* First entry of the list */ + void *head; + + /* Variables required by Brent's algorithm */ + void *fast_p; + void *slow_p; + unsigned long length; + unsigned long power; + + /* Function to get the next entry in the list */ + dc_next_t next; + + /* Private data passed to next */ + void *data; +}; + +struct detect_cycle *dc_init(void *head, void *data, dc_next_t next) +{ + struct detect_cycle *new; + + new = malloc(sizeof(*new)); + if (!new) + return NULL; + + new->next = next; + new->data = data; + + new->head = head; + new->slow_p = head; + new->fast_p = head; + new->length = 0; + new->power = 2; + + return new; +} + +int dc_next(struct detect_cycle *dc, void **next) +{ + + if (dc->length == dc->power) { + dc->length = 0; + dc->power *= 2; + dc->slow_p = dc->fast_p; + } + + dc->fast_p = dc->next(dc->fast_p, dc->data); + dc->length++; + + if (dc->slow_p == dc->fast_p) + return 1; + + *next = dc->fast_p; + return 0; +} + +void dc_find_start(struct detect_cycle *dc, void **first, unsigned long *len) +{ + void *slow_p, *fast_p; + unsigned long tmp; + + slow_p = fast_p = dc->head; + tmp = dc->length; + + while (tmp) { + fast_p = dc->next(fast_p, dc->data); + tmp--; + } + + while (slow_p != fast_p) { + slow_p = dc->next(slow_p, dc->data); + fast_p = dc->next(fast_p, dc->data); + } + + *first = slow_p; + *len = dc->length; +} diff --git a/detect_cycle.h b/detect_cycle.h new file mode 100644 index 000..2ca75c7 --- /dev/null +++ b/detect_cycle.h @@ -0,0 +1,40 @@ +/* + * detect_cycle.h -- Generic cycle detection using Brent's algorithm + * + * Created by: Philipp Rudo + * + * Copyright (c) 2022 Red Hat, Inc. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the ter
[PATCH v2 2/4] makedumpfile: use pointer arithmetics for dump_dmesg
When parsing the printk buffer for the old printk mechanism (> v3.5.0+ and < 5.10.0) a log entry is currently specified by the offset into the buffer where the entry starts. Change this to use a pointers instead. This is done in preparation for using the new cycle detection mechanism. Signed-off-by: Philipp Rudo --- makedumpfile.c | 29 + 1 file changed, 13 insertions(+), 16 deletions(-) diff --git a/makedumpfile.c b/makedumpfile.c index 7ed9756..9a05c96 100644 --- a/makedumpfile.c +++ b/makedumpfile.c @@ -5482,13 +5482,10 @@ dump_log_entry(char *logptr, int fp, const char *file_name) * get log record by index; idx must point to valid message. */ static char * -log_from_idx(unsigned int idx, char *logbuf) +log_from_ptr(char *logptr, char *logbuf) { - char *logptr; unsigned int msglen; - logptr = logbuf + idx; - /* * A length == 0 record is the end of buffer marker. * Wrap around and return the message at the start of @@ -5502,14 +5499,13 @@ log_from_idx(unsigned int idx, char *logbuf) return logptr; } -static long -log_next(unsigned int idx, char *logbuf) +static void * +log_next(void *_logptr, void *_logbuf) { - char *logptr; + char *logptr = _logptr; + char *logbuf = _logbuf; unsigned int msglen; - logptr = logbuf + idx; - /* * A length == 0 record is the end of buffer marker. Wrap around and * read the message at the start of the buffer as *this* one, and @@ -5519,10 +5515,10 @@ log_next(unsigned int idx, char *logbuf) msglen = USHORT(logptr + OFFSET(printk_log.len)); if (!msglen) { msglen = USHORT(logbuf + OFFSET(printk_log.len)); - return msglen; + return logbuf + msglen; } - return idx + msglen; + return logptr + msglen; } int @@ -5530,11 +5526,12 @@ dump_dmesg() { int log_buf_len, length_log, length_oldlog, ret = FALSE; unsigned long index, log_buf, log_end; - unsigned int idx, log_first_idx, log_next_idx; + unsigned int log_first_idx, log_next_idx; unsigned long long first_idx_sym; unsigned long log_end_2_6_24; unsigned log_end_2_6_25; char *log_buffer = NULL, *log_ptr = NULL; + char *ptr; /* * log_end has been changed to "unsigned" since linux-2.6.25. @@ -5681,13 +5678,13 @@ dump_dmesg() ERRMSG("Can't open output file.\n"); goto out; } - idx = log_first_idx; - while (idx != log_next_idx) { - log_ptr = log_from_idx(idx, log_buffer); + ptr = log_buffer + log_first_idx; + while (ptr != log_buffer + log_next_idx) { + log_ptr = log_from_ptr(ptr, log_buffer); if (!dump_log_entry(log_ptr, info->fd_dumpfile, info->name_dumpfile)) goto out; - idx = log_next(idx, log_buffer); + ptr = log_next(ptr, log_buffer); } if (!close_files_for_creating_dumpfile()) goto out; -- 2.35.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 3/3] makedumpfile: use cycle detection when parsing the prink log_buf
Hi Kazu, On Fri, 11 Mar 2022 07:59:47 + HAGIO KAZUHITO(萩尾 一仁) wrote: > -Original Message- > > Hi Dave, > > > > On Wed, 9 Mar 2022 03:48:12 -0500 > > David Wysochanski wrote: > > > > > On Mon, Mar 7, 2022 at 12:23 PM Philipp Rudo wrote: > > > > > > > > The old printk mechanism (> v3.5.0 and < v5.10.0) had a fixed size > > > > buffer (log_buf) that contains all messages. The location for the next > > > > message is stored in log_next_idx. In case the log_buf runs full > > > > log_next_idx wraps around and starts overwriting old messages at the > > > > beginning of the buffer. The wraparound is denoted by a message with > > > > msg->len == 0. > > > > > > > > Following the behavior described above blindly in makedumpfile is > > > > dangerous as e.g. a memory corruption could overwrite (parts of) the > > > > log_buf. If the corruption adds a message with msg->len == 0 this leads > > > > to an endless loop when dumping the dmesg with makedumpfile appending > > > > the messages up to the corruption over and over again to the output file > > > > until file system is full. Fix this by using cycle detection and aboard > > > > once one is detected. > > > > > > > > While at it also verify that the index is within the log_buf and thus > > > > guard against corruptions with msg->len != 0. > > > > > > > > Fixes: 36c2458 ("[PATCH] --dump-dmesg fix for post 3.5 kernels.") > > > > Reported-by: Audra Mitchell > > > > Suggested-by: Dave Wysochanski > > > > Signed-off-by: Philipp Rudo > > > > --- > > > > makedumpfile.c | 42 -- > > > > 1 file changed, 40 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/makedumpfile.c b/makedumpfile.c > > > > index edf128b..2738d16 100644 > > > > --- a/makedumpfile.c > > > > +++ b/makedumpfile.c > > > > @@ -15,6 +15,7 @@ > > > > */ > > > > #include "makedumpfile.h" > > > > #include "print_info.h" > > > > +#include "detect_cycle.h" > > > > #include "dwarf_info.h" > > > > #include "elf_info.h" > > > > #include "erase_info.h" > > > > @@ -5528,10 +5529,11 @@ dump_dmesg() > > > > unsigned long index, log_buf, log_end; > > > > unsigned int log_first_idx, log_next_idx; > > > > unsigned long long first_idx_sym; > > > > + struct detect_cycle *dc = NULL; > > > > unsigned long log_end_2_6_24; > > > > unsigned log_end_2_6_25; > > > > char *log_buffer = NULL, *log_ptr = NULL; > > > > - char *idx; > > > > + char *idx, *next_idx; > > > > > > > > > > Would be clearer to call the above "next_ptr" rather than "next_idx" > > > (as far as I know 'index' refers to 32-bit quantities). > > > Same comment about the "idx" variable, maybe "ptr"? > > > > Hmm... I stuck with idx as the kernel uses the same name. In my > > opinion using the same name makes it easier to see that both variables > > contain the same "quantity" even when the implementation is slightly > > different (in the kernel idx is the offset in the log_buf just like it > > was in makedumpfile before patch 2). But my opinion isn't very strong > > on the naming. So when the consent is to rename the variable I'm open > > to do it. > > > > @Kazu: Do you have a preference here? > > Personally I think it will be more readable to use "*ptr" for pointers > in this case, as Dave says. ok, then I'll rename it in v2. Thanks Philipp ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 3/3] makedumpfile: use cycle detection when parsing the prink log_buf
Hi Dave, On Wed, 9 Mar 2022 03:48:12 -0500 David Wysochanski wrote: > On Mon, Mar 7, 2022 at 12:23 PM Philipp Rudo wrote: > > > > The old printk mechanism (> v3.5.0 and < v5.10.0) had a fixed size > > buffer (log_buf) that contains all messages. The location for the next > > message is stored in log_next_idx. In case the log_buf runs full > > log_next_idx wraps around and starts overwriting old messages at the > > beginning of the buffer. The wraparound is denoted by a message with > > msg->len == 0. > > > > Following the behavior described above blindly in makedumpfile is > > dangerous as e.g. a memory corruption could overwrite (parts of) the > > log_buf. If the corruption adds a message with msg->len == 0 this leads > > to an endless loop when dumping the dmesg with makedumpfile appending > > the messages up to the corruption over and over again to the output file > > until file system is full. Fix this by using cycle detection and aboard > > once one is detected. > > > > While at it also verify that the index is within the log_buf and thus > > guard against corruptions with msg->len != 0. > > > > Fixes: 36c2458 ("[PATCH] --dump-dmesg fix for post 3.5 kernels.") > > Reported-by: Audra Mitchell > > Suggested-by: Dave Wysochanski > > Signed-off-by: Philipp Rudo > > --- > > makedumpfile.c | 42 -- > > 1 file changed, 40 insertions(+), 2 deletions(-) > > > > diff --git a/makedumpfile.c b/makedumpfile.c > > index edf128b..2738d16 100644 > > --- a/makedumpfile.c > > +++ b/makedumpfile.c > > @@ -15,6 +15,7 @@ > > */ > > #include "makedumpfile.h" > > #include "print_info.h" > > +#include "detect_cycle.h" > > #include "dwarf_info.h" > > #include "elf_info.h" > > #include "erase_info.h" > > @@ -5528,10 +5529,11 @@ dump_dmesg() > > unsigned long index, log_buf, log_end; > > unsigned int log_first_idx, log_next_idx; > > unsigned long long first_idx_sym; > > + struct detect_cycle *dc = NULL; > > unsigned long log_end_2_6_24; > > unsigned log_end_2_6_25; > > char *log_buffer = NULL, *log_ptr = NULL; > > - char *idx; > > + char *idx, *next_idx; > > > > Would be clearer to call the above "next_ptr" rather than "next_idx" > (as far as I know 'index' refers to 32-bit quantities). > Same comment about the "idx" variable, maybe "ptr"? Hmm... I stuck with idx as the kernel uses the same name. In my opinion using the same name makes it easier to see that both variables contain the same "quantity" even when the implementation is slightly different (in the kernel idx is the offset in the log_buf just like it was in makedumpfile before patch 2). But my opinion isn't very strong on the naming. So when the consent is to rename the variable I'm open to do it. @Kazu: Do you have a preference here? Same for your comments in patch 2. > > /* > > * log_end has been changed to "unsigned" since linux-2.6.25. > > @@ -5679,12 +5681,47 @@ dump_dmesg() > > goto out; > > } > > idx = log_buffer + log_first_idx; > > + dc = dc_init(idx, log_buffer, log_next); > > while (idx != log_buffer + log_next_idx) { > > log_ptr = log_from_idx(idx, log_buffer); > > if (!dump_log_entry(log_ptr, info->fd_dumpfile, > > info->name_dumpfile)) > > goto out; > > - idx = log_next(idx, log_buffer); > > + if (dc_next(dc, (void **) &next_idx)) { > > + unsigned long len; > > + char *first; > > + > > + /* Clear everything we have already > > written... */ > > + ftruncate(info->fd_dumpfile, 0); > > + lseek(info->fd_dumpfile, 0, SEEK_SET); > > + > > I'm not sure I understand why you're doing this. That's because in every pass of the loop the entry is written to the output file. So most likely the output file already contains more than needed. Thus we somehow need to trim the file to the end of the cycle. Thus I decided to go brute force and simply clear all content from the file and wr
[PATCH 0/3] makedumpfile: harden parsing of old prink buffer
Hi, dumping the dmesg can cause an endless loop for the old prink mechanism (> v3.5.0 and < v5.10.0) when the log_buf got corrupted. This series fixes those cases by adding a cycle detection. The cycle detection is implemented in a generic way so that it can be reused in other parts of makedumpfile. Thanks Philipp Philipp Rudo (3): makedumpfile: add generic cycle detection makedumpfile: use pointer arithmetics for dump_dmesg makedumpfile: use cycle detection when parsing the prink log_buf Makefile | 2 +- detect_cycle.c | 99 ++ detect_cycle.h | 40 makedumpfile.c | 65 + 4 files changed, 190 insertions(+), 16 deletions(-) create mode 100644 detect_cycle.c create mode 100644 detect_cycle.h -- 2.35.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH 3/3] makedumpfile: use cycle detection when parsing the prink log_buf
The old printk mechanism (> v3.5.0 and < v5.10.0) had a fixed size buffer (log_buf) that contains all messages. The location for the next message is stored in log_next_idx. In case the log_buf runs full log_next_idx wraps around and starts overwriting old messages at the beginning of the buffer. The wraparound is denoted by a message with msg->len == 0. Following the behavior described above blindly in makedumpfile is dangerous as e.g. a memory corruption could overwrite (parts of) the log_buf. If the corruption adds a message with msg->len == 0 this leads to an endless loop when dumping the dmesg with makedumpfile appending the messages up to the corruption over and over again to the output file until file system is full. Fix this by using cycle detection and aboard once one is detected. While at it also verify that the index is within the log_buf and thus guard against corruptions with msg->len != 0. Fixes: 36c2458 ("[PATCH] --dump-dmesg fix for post 3.5 kernels.") Reported-by: Audra Mitchell Suggested-by: Dave Wysochanski Signed-off-by: Philipp Rudo --- makedumpfile.c | 42 -- 1 file changed, 40 insertions(+), 2 deletions(-) diff --git a/makedumpfile.c b/makedumpfile.c index edf128b..2738d16 100644 --- a/makedumpfile.c +++ b/makedumpfile.c @@ -15,6 +15,7 @@ */ #include "makedumpfile.h" #include "print_info.h" +#include "detect_cycle.h" #include "dwarf_info.h" #include "elf_info.h" #include "erase_info.h" @@ -5528,10 +5529,11 @@ dump_dmesg() unsigned long index, log_buf, log_end; unsigned int log_first_idx, log_next_idx; unsigned long long first_idx_sym; + struct detect_cycle *dc = NULL; unsigned long log_end_2_6_24; unsigned log_end_2_6_25; char *log_buffer = NULL, *log_ptr = NULL; - char *idx; + char *idx, *next_idx; /* * log_end has been changed to "unsigned" since linux-2.6.25. @@ -5679,12 +5681,47 @@ dump_dmesg() goto out; } idx = log_buffer + log_first_idx; + dc = dc_init(idx, log_buffer, log_next); while (idx != log_buffer + log_next_idx) { log_ptr = log_from_idx(idx, log_buffer); if (!dump_log_entry(log_ptr, info->fd_dumpfile, info->name_dumpfile)) goto out; - idx = log_next(idx, log_buffer); + if (dc_next(dc, (void **) &next_idx)) { + unsigned long len; + char *first; + + /* Clear everything we have already written... */ + ftruncate(info->fd_dumpfile, 0); + lseek(info->fd_dumpfile, 0, SEEK_SET); + + /* ...and only write up to the corruption. */ + dc_find_start(dc, (void **) &first, &len); + idx = log_buffer + log_first_idx; + while (len) { + log_ptr = log_from_idx(idx, log_buffer); + if (!dump_log_entry(log_ptr, + info->fd_dumpfile, + info->name_dumpfile)) + goto out; + idx = log_next(idx, log_buffer); + len--; + } + ERRMSG("Cycle when parsing dmesg detected.\n"); + ERRMSG("The printk log_buf is most likely corrupted.\n"); + ERRMSG("log_buf = 0x%lx, idx = 0x%lx\n", log_buf, idx - log_buffer); + close_files_for_creating_dumpfile(); + goto out; + } + if (next_idx < log_buffer || + next_idx > log_buffer + log_buf_len - SIZE(printk_log)) { + ERRMSG("Index outside log_buf detected.\n"); + ERRMSG("The printk log_buf is most likely corrupted.\n"); + ERRMSG("log_buf = 0x%lx, idx = 0x%lx\n", log_buf, idx - log_buffer); + close_files_for_creating_dumpfile(); + goto out; + } + idx = next_idx; } if (!close_files_for_creating_dumpfil
[PATCH 2/3] makedumpfile: use pointer arithmetics for dump_dmesg
When parsing the printk buffer for the old printk mechanism (> v3.5.0+ and < 5.10.0) a log entry is currently specified by the offset into the buffer where the entry starts. Change this to use a pointers instead. This is done in preparation for using the new cycle detection mechanism. Signed-off-by: Philipp Rudo --- makedumpfile.c | 25 +++-- 1 file changed, 11 insertions(+), 14 deletions(-) diff --git a/makedumpfile.c b/makedumpfile.c index 7ed9756..edf128b 100644 --- a/makedumpfile.c +++ b/makedumpfile.c @@ -5482,13 +5482,10 @@ dump_log_entry(char *logptr, int fp, const char *file_name) * get log record by index; idx must point to valid message. */ static char * -log_from_idx(unsigned int idx, char *logbuf) +log_from_idx(char *logptr, char *logbuf) { - char *logptr; unsigned int msglen; - logptr = logbuf + idx; - /* * A length == 0 record is the end of buffer marker. * Wrap around and return the message at the start of @@ -5502,14 +5499,13 @@ log_from_idx(unsigned int idx, char *logbuf) return logptr; } -static long -log_next(unsigned int idx, char *logbuf) +static void * +log_next(void *_logptr, void *_logbuf) { - char *logptr; + char *logptr = _logptr; + char *logbuf = _logbuf; unsigned int msglen; - logptr = logbuf + idx; - /* * A length == 0 record is the end of buffer marker. Wrap around and * read the message at the start of the buffer as *this* one, and @@ -5519,10 +5515,10 @@ log_next(unsigned int idx, char *logbuf) msglen = USHORT(logptr + OFFSET(printk_log.len)); if (!msglen) { msglen = USHORT(logbuf + OFFSET(printk_log.len)); - return msglen; + return logbuf + msglen; } - return idx + msglen; + return logptr + msglen; } int @@ -5530,11 +5526,12 @@ dump_dmesg() { int log_buf_len, length_log, length_oldlog, ret = FALSE; unsigned long index, log_buf, log_end; - unsigned int idx, log_first_idx, log_next_idx; + unsigned int log_first_idx, log_next_idx; unsigned long long first_idx_sym; unsigned long log_end_2_6_24; unsigned log_end_2_6_25; char *log_buffer = NULL, *log_ptr = NULL; + char *idx; /* * log_end has been changed to "unsigned" since linux-2.6.25. @@ -5681,8 +5678,8 @@ dump_dmesg() ERRMSG("Can't open output file.\n"); goto out; } - idx = log_first_idx; - while (idx != log_next_idx) { + idx = log_buffer + log_first_idx; + while (idx != log_buffer + log_next_idx) { log_ptr = log_from_idx(idx, log_buffer); if (!dump_log_entry(log_ptr, info->fd_dumpfile, info->name_dumpfile)) -- 2.35.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH 1/3] makedumpfile: add generic cycle detection
In order to work makedumpfile needs to interpret data read from the dump. This can cause problems as the data from the dump cannot be trusted (otherwise the kernel wouldn't have panicked in the first place). This also means that every loop which stop condition depend on data read from the dump has a chance to loop forever. Thus add a generic cycle detection mechanism that allows to detect and handle such situations appropriately. For cycle detection use Brent's algorithm [1] as it has constant memory usage. With this it can also be used in the kdump kernel without the danger that it runs oom when iterating large data structures. Furthermore it only depends on some pointer arithmetic. Thus the performance impact (as long as no cycle was detected) should be comparatively small. [1] https://en.wikipedia.org/wiki/Cycle_detection#Brent's_algorithm Suggested-by: Dave Wysochanski Signed-off-by: Philipp Rudo --- Makefile | 2 +- detect_cycle.c | 99 ++ detect_cycle.h | 40 3 files changed, 140 insertions(+), 1 deletion(-) create mode 100644 detect_cycle.c create mode 100644 detect_cycle.h diff --git a/Makefile b/Makefile index 9f9fd22..e9a474f 100644 --- a/Makefile +++ b/Makefile @@ -45,7 +45,7 @@ CFLAGS_ARCH += -m32 endif SRC_BASE = makedumpfile.c makedumpfile.h diskdump_mod.h sadump_mod.h sadump_info.h -SRC_PART = print_info.c dwarf_info.c elf_info.c erase_info.c sadump_info.c cache.c tools.c printk.c +SRC_PART = print_info.c dwarf_info.c elf_info.c erase_info.c sadump_info.c cache.c tools.c printk.c detect_cycle.c OBJ_PART=$(patsubst %.c,%.o,$(SRC_PART)) SRC_ARCH = arch/arm.c arch/arm64.c arch/x86.c arch/x86_64.c arch/ia64.c arch/ppc64.c arch/s390x.c arch/ppc.c arch/sparc64.c arch/mips64.c OBJ_ARCH=$(patsubst %.c,%.o,$(SRC_ARCH)) diff --git a/detect_cycle.c b/detect_cycle.c new file mode 100644 index 000..6b551a7 --- /dev/null +++ b/detect_cycle.c @@ -0,0 +1,99 @@ +/* + * detect_cycle.c -- Generic cycle detection using Brent's algorithm + * + * Created by: Philipp Rudo + * + * Copyright (c) 2022 Red Hat, Inc. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include + +#include "detect_cycle.h" + +struct detect_cycle { + /* First entry of the list */ + void *head; + + /* Variables required by Brent's algorithm */ + void *fast_p; + void *slow_p; + unsigned long length; + unsigned long power; + + /* Function to get the next entry in the list */ + dc_next_t next; + + /* Private data passed to next */ + void *data; +}; + +struct detect_cycle *dc_init(void *head, void *data, dc_next_t next) +{ + struct detect_cycle *new; + + new = malloc(sizeof(*new)); + if (!new) + return NULL; + + new->next = next; + new->data = data; + + new->head = head; + new->slow_p = head; + new->fast_p = head; + new->length = 0; + new->power = 2; + + return new; +} + +int dc_next(struct detect_cycle *dc, void **next) +{ + + if (dc->length == dc->power) { + dc->length = 0; + dc->power *= 2; + dc->slow_p = dc->fast_p; + } + + dc->fast_p = dc->next(dc->fast_p, dc->data); + dc->length++; + + if (dc->slow_p == dc->fast_p) + return 1; + + *next = dc->fast_p; + return 0; +} + +void dc_find_start(struct detect_cycle *dc, void **first, unsigned long *len) +{ + void *slow_p, *fast_p; + unsigned long tmp; + + slow_p = fast_p = dc->head; + tmp = dc->length; + + while (tmp) { + fast_p = dc->next(fast_p, dc->data); + tmp--; + } + + while (slow_p != fast_p) { + slow_p = dc->next(slow_p, dc->data); + fast_p = dc->next(fast_p, dc->data); + } + + *first = slow_p; + *len = dc->length; +} diff --git a/detect_cycle.h b/detect_cycle.h new file mode 100644 index 000..2ca75c7 --- /dev/null +++ b/detect_cycle.h @@ -0,0 +1,40 @@ +/* + * detect_cycle.h -- Generic cycle detection using Brent's algorithm + * + * Created by: Philipp Rudo + * + * Copyright (c) 2022 Red Hat, Inc. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the ter
Re: [RFC PATCH] kdump: Add support for crashkernel=auto
Hi Petr, On Fri, 4 Feb 2022 06:34:19 +0100 Petr Tesařík wrote: > Hi Philipp, > > Dne 31. 01. 22 v 11:33 Philipp Rudo napsal(a): > > Hi, > > > > On Fri, 28 Jan 2022 11:31:49 +0100 > > Petr Tesařík wrote: > > > >> Hi Tiezhu Yang, > >> > >> On Jan 28, 2022 at 02:20 Tiezhu Yang wrote: > >>> [...] > >>> Hi Petr, > >>> > >>> Thank you for your reply. > >>> > >>> This is a RFC patch, the initial aim of this patch is to discuss what is > >>> the proper way to support crashkernel=auto. > >> > >> Well, the point I'm trying to make is that crashkernel=auto cannot be > >> implemented. Your code would have to know what happens in the future, > >> and AFAIK time travel has not been discovered yet. ;-) > >> > >> A better approach is to make a very large allocation initially, e.g. > >> half of available RAM. The remaining RAM should still be big enough to > >> start booting the system. Later, when a kdump user-space service knows > >> what it wants to load, it can shrink the reservation by writing a lower > >> value into /sys/kernel/kexec_crash_size. > > > > Even this approach doesn't work in every situation. For example it > > requires that the system has at least twice the RAM it requires to > > safely boot. That's not always given for e.g minimalistic VMs or > > embedded systems. > > If you reserve more RAM for the panic kernel than for running your > actual workload, then you definitely have very special needs, and you > should not expect that everything works out of the box. That was basically the point I was trying to make. There is always a scenario with special needs so that is is basically impossible to find that one solution that works for everybody. > > Furthermore the memory requirement can also change during runtime due > > to, e.g. workload spikes, device hot plug, moving the dump target from > > an un-encrypted to an encrypted disk, etc.. So even when your user-space > > program can exactly calculate the memory requirement at the moment > > it loads kdump it might be too little at the moment the system panics. > > In order for it to work the user-space would constantly need to monitor > > how much memory is needed and adjust the requirement. But that would > > also require to increase the reservation during runtime which would be > > extremely expensive (if possible at all). > > > > All in all I support Petr that time travel is the only proper solution > > for implementing crashkernel=auto. But once we have time travel I > > would prefer to use the gained knowledge to fix the bug that triggered > > the panic rather than calculating the memory requirement for kdump. > > Yeah, long live patching! :-) > > >> The alternative approach does not need any changes to the kernel, except > >> maybe adding something like "crashkernel=max". > > > > A slightly different approach is for the user-space tool to simply set > > the crashkernel= parameter on the kernel commandline for the next boot. > > This also works for memory restrained systems. Needs a reboot though... > > The downside is that if you remove some memory while your system is off, > then a reservation calculate for the previous RAM size may no longer be > possible on the next boot, and the kernel will boot up without any > reservation. That's where "crashkernel=max" would come in handy. Let me > send a patch and see the discussion. True, in that situation our approach will fail. I'm looking forward to see your patch. Thanks Philipp > >>> A moment ago, I find the following patch, it is more flexible, but it is > >>> not merged into the upstream kernel now. > >>> > >>> kernel/crash_core: Add crashkernel=auto for vmcore creation > >>> > >>> https://lore.kernel.org/lkml/20210223174153.72802-1-saeed.mirzamohamm...@oracle.com/ > >>> > > > > The patch was ultimately rejected by Linus > > > > https://lore.kernel.org/linux-mm/20210507010432.in24pudkt%25a...@linux-foundation.org/ > > > > Thanks > > Philipp > > > >>> > >>>> > >>>>> [...] > >>>>> diff --git a/kernel/crash_core.c b/kernel/crash_core.c > >>>>> index 256cf6d..32c51e2 100644 > >>>>> --- a/kernel/crash_core.c > >>>>> +++ b/kernel/crash_core.c > >>>>> @@ -252,6 +252,26 @@ static int __init __parse_c
Re: Is it possible to use a stream for initrd=
Hi Tobias, On Wed, 9 Feb 2022 10:32:24 +0100 Tobias Powalowski wrote: > Sorry you misunderstood, I don't want to create the initrd.img file. > I want to pass the zstd directly to kexec initrd= option. sorry, that is not possible. The kexec-tools expect a file name with --initrd. Simply passing a binary blob to be used as initrd isn't implemented. Thanks Philipp > greetings > tpowa > > Am Mi., 9. Feb. 2022 um 10:28 Uhr schrieb Baoquan He : > > > > On 02/09/22 at 10:16am, Tobias Powalowski wrote: > > > Hi, > > > I have a tmp directory with all files placed in: > > > find . -mindepth 1 -printf '%P\0' | sort -z | LANG=C bsdtar --uid 0 > > > --gid 0 --null -cnf - -T - |\ > > > LANG=C bsdtar --null -cf - --format=newc @- | zstd -T0 > /initrd.img > > > > > > and this initrd.img I want to write into kexec without creating the > > > initrd file. > > > > > > kexec -l /vmlinuz-linux --initrd=/initrd.img --reuse-cmdline > > > systemctl kexec > > > > See manpage of kexec, the EXAMPLE part: > > > > kexec -l /boot/vmlinux --initrd=/boot/initrd --reuse-cmdline > > kexec -e > > > > ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [RFC PATCH] kdump: Add support for crashkernel=auto
Hi, On Fri, 28 Jan 2022 11:31:49 +0100 Petr Tesařík wrote: > Hi Tiezhu Yang, > > On Jan 28, 2022 at 02:20 Tiezhu Yang wrote: > >[...] > > Hi Petr, > > > > Thank you for your reply. > > > > This is a RFC patch, the initial aim of this patch is to discuss what is > > the proper way to support crashkernel=auto. > > Well, the point I'm trying to make is that crashkernel=auto cannot be > implemented. Your code would have to know what happens in the future, > and AFAIK time travel has not been discovered yet. ;-) > > A better approach is to make a very large allocation initially, e.g. > half of available RAM. The remaining RAM should still be big enough to > start booting the system. Later, when a kdump user-space service knows > what it wants to load, it can shrink the reservation by writing a lower > value into /sys/kernel/kexec_crash_size. Even this approach doesn't work in every situation. For example it requires that the system has at least twice the RAM it requires to safely boot. That's not always given for e.g minimalistic VMs or embedded systems. Furthermore the memory requirement can also change during runtime due to, e.g. workload spikes, device hot plug, moving the dump target from an un-encrypted to an encrypted disk, etc.. So even when your user-space program can exactly calculate the memory requirement at the moment it loads kdump it might be too little at the moment the system panics. In order for it to work the user-space would constantly need to monitor how much memory is needed and adjust the requirement. But that would also require to increase the reservation during runtime which would be extremely expensive (if possible at all). All in all I support Petr that time travel is the only proper solution for implementing crashkernel=auto. But once we have time travel I would prefer to use the gained knowledge to fix the bug that triggered the panic rather than calculating the memory requirement for kdump. > The alternative approach does not need any changes to the kernel, except > maybe adding something like "crashkernel=max". A slightly different approach is for the user-space tool to simply set the crashkernel= parameter on the kernel commandline for the next boot. This also works for memory restrained systems. Needs a reboot though... > > A moment ago, I find the following patch, it is more flexible, but it is > > not merged into the upstream kernel now. > > > > kernel/crash_core: Add crashkernel=auto for vmcore creation > > > > https://lore.kernel.org/lkml/20210223174153.72802-1-saeed.mirzamohamm...@oracle.com/ > > The patch was ultimately rejected by Linus https://lore.kernel.org/linux-mm/20210507010432.in24pudkt%25a...@linux-foundation.org/ Thanks Philipp > > > >> > >>> [...] > >>> diff --git a/kernel/crash_core.c b/kernel/crash_core.c > >>> index 256cf6d..32c51e2 100644 > >>> --- a/kernel/crash_core.c > >>> +++ b/kernel/crash_core.c > >>> @@ -252,6 +252,26 @@ static int __init __parse_crashkernel(char > >>> *cmdline, > >>> if (suffix) > >>> return parse_crashkernel_suffix(ck_cmdline, crash_size, > >>> suffix); > >>> + > >>> + if (strncmp(ck_cmdline, "auto", 4) == 0) { > >>> +#if defined(CONFIG_X86_64) || defined(CONFIG_S390) > >>> + ck_cmdline = "1G-4G:160M,4G-64G:192M,64G-1T:256M,1T-:512M"; > >>> +#elif defined(CONFIG_ARM64) > >>> + ck_cmdline = "2G-:448M"; > >>> +#elif defined(CONFIG_PPC64) > >>> + char *fadump_cmdline; > >>> + > >>> + fadump_cmdline = get_last_crashkernel(cmdline, "fadump=", > >>> NULL); > >>> + fadump_cmdline = fadump_cmdline ? > >>> + fadump_cmdline + strlen("fadump=") : NULL; > >>> + if (!fadump_cmdline || (strncmp(fadump_cmdline, "off", 3) == > >>> 0)) > >>> + ck_cmdline = > >>> "2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G"; > >>> + else > >>> + ck_cmdline = > >>> "4G-16G:768M,16G-64G:1G,64G-128G:2G,128G-1T:4G,1T-2T:6G,2T-4T:12G,4T-8T:20G,8T-16T:36G,16T-32T:64G,32T-64T:128G,64T-:180G"; > >>> > >>> > >>> > >>> +#endif > >>> + pr_info("Using crashkernel=auto, the size chosen is a best > >>> effort estimation.\n"); > >>> + } > >>> + > >> > >> How did you even arrive at the above numbers? > > > > Memory requirements for kdump: > > > > https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/managing_monitoring_and_updating_the_kernel/supported-kdump-configurations-and-targets_managing-monitoring-and-updating-the-kernel#memory-requirements-for-kdump_supported-kdump-configurations-and-targets > > > > > > > > I've done some research on > >> this topic recently (ie. during the last 7 years or so). My x86_64 > >> system with 8G RAM running openSUSE Leap 15.3 seems needs 188M for > >> saving to the local disk, and 203M to save over the network (using > >> SFTP). My PPC64 LPAR with 16G RAM running latest Beta of SLES 15 SP4 > >> needs 587M, i.e. with the above nu
Re: [PATCHv4 1/4] arm64: make phys_offset signed
Hi Pingfan, On Tue, 18 Jan 2022 15:48:09 +0800 Pingfan Liu wrote: > After kernel commit 7bc1a0f9e176 ("arm64: mm: use single quantity to > represent the PA to VA translation"), phys_offset can be negative if > running 52-bits kernel on 48-bits hardware. > > So changing phys_offset from unsigned to signed. > > Signed-off-by: Pingfan Liu > Cc: Kairui Song > Cc: Simon Horman > Cc: Philipp Rudo > To: kexec@lists.infradead.org > --- > kexec/arch/arm64/kexec-arm64.c | 12 ++-- > kexec/arch/arm64/kexec-arm64.h | 2 +- > util_lib/elf_info.c| 2 +- > util_lib/include/elf_info.h| 2 +- > 4 files changed, 9 insertions(+), 9 deletions(-) > [...] > diff --git a/kexec/arch/arm64/kexec-arm64.h b/kexec/arch/arm64/kexec-arm64.h > index ed447ac..1844b67 100644 > --- a/kexec/arch/arm64/kexec-arm64.h > +++ b/kexec/arch/arm64/kexec-arm64.h > @@ -58,7 +58,7 @@ extern off_t initrd_size; > */ > > struct arm64_mem { > - uint64_t phys_offset; > + long phys_offset; I think this one should be int64_t as well. Other than that Reviewed-by: Philipp Rudo > uint64_t text_offset; > uint64_t image_size; > uint64_t vp_offset; ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCHv4 2/4] arm64/crashdump: unify routine to get page_offset
Hi Pingfan, On Tue, 18 Jan 2022 15:48:10 +0800 Pingfan Liu wrote: > There are two funcs to get page_offset: > get_kernel_page_offset() > get_page_offset() > > Since get_kernel_page_offset() does not observe the kernel formula, and > remove it. Unify them in order to introduce 52-bits VA kernel more > easily in the coming patch. > > Signed-off-by: Pingfan Liu > Cc: Kairui Song > Cc: Simon Horman > Cc: Philipp Rudo > To: kexec@lists.infradead.org looks good Reviewed-by: Philipp Rudo > --- > kexec/arch/arm64/crashdump-arm64.c | 23 +-- > kexec/arch/arm64/kexec-arm64.c | 8 > kexec/arch/arm64/kexec-arm64.h | 1 + > 3 files changed, 6 insertions(+), 26 deletions(-) > > diff --git a/kexec/arch/arm64/crashdump-arm64.c > b/kexec/arch/arm64/crashdump-arm64.c > index a02019a..0a8d44c 100644 > --- a/kexec/arch/arm64/crashdump-arm64.c > +++ b/kexec/arch/arm64/crashdump-arm64.c > @@ -46,27 +46,6 @@ static struct crash_elf_info elf_info = { > .machine= EM_AARCH64, > }; > > -/* > - * Note: The returned value is correct only if !CONFIG_RANDOMIZE_BASE. > - */ > -static uint64_t get_kernel_page_offset(void) > -{ > - int i; > - > - if (elf_info.kern_vaddr_start == UINT64_MAX) > - return UINT64_MAX; > - > - /* Current max virtual memory range is 48-bits. */ > - for (i = 48; i > 0; i--) > - if (!(elf_info.kern_vaddr_start & (1UL << i))) > - break; > - > - if (i <= 0) > - return UINT64_MAX; > - else > - return UINT64_MAX << i; > -} > - > /* > * iomem_range_callback() - callback called for each iomem region > * @data: not used > @@ -198,7 +177,7 @@ int load_crashdump_segments(struct kexec_info *info) > if (err) > return EFAILED; > > - elf_info.page_offset = get_kernel_page_offset(); > + get_page_offset(&elf_info.page_offset); > dbgprintf("%s: page_offset: %016llx\n", __func__, > elf_info.page_offset); > > diff --git a/kexec/arch/arm64/kexec-arm64.c b/kexec/arch/arm64/kexec-arm64.c > index c6c67e8..33cc258 100644 > --- a/kexec/arch/arm64/kexec-arm64.c > +++ b/kexec/arch/arm64/kexec-arm64.c > @@ -909,7 +909,7 @@ static int get_va_bits(void) > * get_page_offset - Helper for getting PAGE_OFFSET > */ > > -static int get_page_offset(void) > +int get_page_offset(unsigned long *page_offset) > { > int ret; > > @@ -917,8 +917,8 @@ static int get_page_offset(void) > if (ret < 0) > return ret; > > - page_offset = (0xUL) << (va_bits - 1); > - dbgprintf("page_offset : %lx\n", page_offset); > + *page_offset = UINT64_MAX << (va_bits - 1); > + dbgprintf("page_offset : %lx\n", *page_offset); > > return 0; > } > @@ -954,7 +954,7 @@ int get_phys_base_from_pt_load(long *phys_offset) > unsigned long long phys_start; > unsigned long long virt_start; > > - ret = get_page_offset(); > + ret = get_page_offset(&page_offset); > if (ret < 0) > return ret; > > diff --git a/kexec/arch/arm64/kexec-arm64.h b/kexec/arch/arm64/kexec-arm64.h > index 1844b67..ed99d9d 100644 > --- a/kexec/arch/arm64/kexec-arm64.h > +++ b/kexec/arch/arm64/kexec-arm64.h > @@ -69,6 +69,7 @@ extern struct arm64_mem arm64_mem; > > uint64_t get_phys_offset(void); > uint64_t get_vp_offset(void); > +int get_page_offset(unsigned long *offset); > > static inline void reset_vp_offset(void) > { ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCHv4 3/4] arm64: read VA_BITS from kcore for 52-bits VA kernel
Hi Pingfan, On Tue, 18 Jan 2022 15:48:11 +0800 Pingfan Liu wrote: > phys_to_virt() calculates virtual address. As a important factor, > page_offset is excepted to be accurate. > > Since arm64 kernel exposes va_bits through vmcore, using it. > > Signed-off-by: Pingfan Liu > Cc: Kairui Song > Cc: Simon Horman > Cc: Philipp Rudo > To: kexec@lists.infradead.org looks good Reviewed-by: Philipp Rudo > --- > kexec/arch/arm64/kexec-arm64.c | 34 ++ > util_lib/elf_info.c| 5 + > util_lib/include/elf_info.h| 1 + > 3 files changed, 36 insertions(+), 4 deletions(-) > > diff --git a/kexec/arch/arm64/kexec-arm64.c b/kexec/arch/arm64/kexec-arm64.c > index 33cc258..793799b 100644 > --- a/kexec/arch/arm64/kexec-arm64.c > +++ b/kexec/arch/arm64/kexec-arm64.c > @@ -54,7 +54,7 @@ > static bool try_read_phys_offset_from_kcore = false; > > /* Machine specific details. */ > -static int va_bits; > +static int va_bits = -1; > static unsigned long page_offset; > > /* Global varables the core kexec routines expect. */ > @@ -876,7 +876,18 @@ static inline void set_phys_offset(int64_t v, char > *set_method) > > static int get_va_bits(void) > { > - unsigned long long stext_sym_addr = get_kernel_sym("_stext"); > + unsigned long long stext_sym_addr; > + > + /* > + * if already got from kcore > + */ > + if (va_bits != -1) > + goto out; > + > + > + /* For kernel older than v4.19 */ > + fprintf(stderr, "Warning, can't get the VA_BITS from kcore\n"); > + stext_sym_addr = get_kernel_sym("_stext"); > > if (stext_sym_addr == 0) { > fprintf(stderr, "Can't get the symbol of _stext.\n"); > @@ -900,6 +911,7 @@ static int get_va_bits(void) > return -1; > } > > +out: > dbgprintf("va_bits : %d\n", va_bits); > > return 0; > @@ -917,14 +929,27 @@ int get_page_offset(unsigned long *page_offset) > if (ret < 0) > return ret; > > - *page_offset = UINT64_MAX << (va_bits - 1); > + if (va_bits < 52) > + *page_offset = UINT64_MAX << (va_bits - 1); > + else > + *page_offset = UINT64_MAX << va_bits; > + > dbgprintf("page_offset : %lx\n", *page_offset); > > return 0; > } > > +static void arm64_scan_vmcoreinfo(char *pos) > +{ > + const char *str; > + > + str = "NUMBER(VA_BITS)="; > + if (memcmp(str, pos, strlen(str)) == 0) > + va_bits = strtoul(pos + strlen(str), NULL, 10); > +} > + > /** > - * get_phys_offset_from_vmcoreinfo_pt_note - Helper for getting PHYS_OFFSET > + * get_phys_offset_from_vmcoreinfo_pt_note - Helper for getting PHYS_OFFSET > (and va_bits) > * from VMCOREINFO note inside 'kcore'. > */ > > @@ -937,6 +962,7 @@ static int get_phys_offset_from_vmcoreinfo_pt_note(long > *phys_offset) > return EFAILED; > } > > + arch_scan_vmcoreinfo = arm64_scan_vmcoreinfo; > ret = read_phys_offset_elf_kcore(fd, phys_offset); > > close(fd); > diff --git a/util_lib/elf_info.c b/util_lib/elf_info.c > index 5574c7f..d252eff 100644 > --- a/util_lib/elf_info.c > +++ b/util_lib/elf_info.c > @@ -310,6 +310,8 @@ int get_pt_load(int idx, > > #define NOT_FOUND_LONG_VALUE (-1) > > +void (*arch_scan_vmcoreinfo)(char *pos); > + > void scan_vmcoreinfo(char *start, size_t size) > { > char *last = start + size - 1; > @@ -551,6 +553,9 @@ void scan_vmcoreinfo(char *start, size_t size) > } > } > > + if (arch_scan_vmcoreinfo != NULL) > + (*arch_scan_vmcoreinfo)(pos); > + > if (last_line) > break; > } > diff --git a/util_lib/include/elf_info.h b/util_lib/include/elf_info.h > index f550d86..fdf4c3d 100644 > --- a/util_lib/include/elf_info.h > +++ b/util_lib/include/elf_info.h > @@ -31,5 +31,6 @@ int get_pt_load(int idx, > int read_phys_offset_elf_kcore(int fd, long *phys_off); > int read_elf(int fd); > void dump_dmesg(int fd, void (*handler)(char*, unsigned int)); > +extern void (*arch_scan_vmcoreinfo)(char *pos); > > #endif /* ELF_INFO_H */ ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCHv4 4/4] arm64: fix PAGE_OFFSET calc for flipped mm
Hi Pingfan, On Tue, 18 Jan 2022 15:48:12 +0800 Pingfan Liu wrote: > From: Kairui Song > > Since kernel commit 14c127c957c1 ('arm64: mm: Flip kernel VA space'), > the memory layout on arm64 have changed, and kexec-tools can no longer > get the the right PAGE_OFFSET based on _text symbol. > > Prior to that, the kimage (_text) lays above PAGE_END with this layout: > 0 -> VA_START : Usespace > VA_START-> VA_START + 256M : BPF JIT, Modules > VA_START + 256M -> PAGE_OFFSET - (~GB misc) : Vmalloc (KERNEL _text HERE) > PAGE_OFFSET -> ... : * Linear map * > > And here we have: > VA_START= -1UL << VA_BITS > PAGE_OFFSET = -1UL << (VA_BITS - 1) > _text < -1UL << (VA_BITS - 1) > > Kernel image lays somewhere between VA_START and PAGE_OFFSET, so we just > calc VA_BITS by getting the highest unset bit of _text symbol address, > and shift one less bit of VA_BITS to get page offset. This works as long > as KASLR don't put kernel in a too high location (which is commented inline). > > And after that commit, kernel layout have changed: > 0 -> PAGE_OFFSET : Userspace > PAGE_OFFSET -> PAGE_END : * Linear map * > PAGE_END-> PAGE_END + 128M : bpf jit region > PAGE_END + 128M -> PAGE_END + 256MB : modules > PAGE_END + 256M -> ... : vmalloc (KERNEL _text HERE) > > Here we have: > PAGE_OFFSET = -1UL << VA_BITS > PAGE_END= -1UL << (VA_BITS - 1) > _text > -1UL << (VA_BITS - 1) > > Kernel image now lays above PAGE_END, so we have to shift one more bit to > get the VA_BITS, and shift the exact VA_BITS for PAGE_OFFSET. > > We can simply check if "_text > -1UL << (VA_BITS - 1)" is true to judge > which layout is being used and shift the page offset occordingly. > > Signed-off-by: Kairui Song > (rebased and stripped by Pingfan ) > Signed-off-by: Pingfan Liu > Cc: Simon Horman > Cc: Philipp Rudo > To: kexec@lists.infradead.org > --- > kexec/arch/arm64/kexec-arm64.c | 14 +- > 1 file changed, 13 insertions(+), 1 deletion(-) > > diff --git a/kexec/arch/arm64/kexec-arm64.c b/kexec/arch/arm64/kexec-arm64.c > index 793799b..ce7a5bb 100644 > --- a/kexec/arch/arm64/kexec-arm64.c > +++ b/kexec/arch/arm64/kexec-arm64.c > @@ -923,13 +923,25 @@ out: > > int get_page_offset(unsigned long *page_offset) > { > + unsigned long long text_sym_addr, kernel_va_mid; > int ret; > > + text_sym_addr = get_kernel_sym("_text"); > + if (text_sym_addr == 0) { > + fprintf(stderr, "Can't get the symbol of _text to calculate > page_offset.\n"); > + return -1; > + } > + > ret = get_va_bits(); > if (ret < 0) > return ret; > > - if (va_bits < 52) > + /* Since kernel 5.4, kernel image is put above > + * UINT64_MAX << (va_bits - 1) > + */ > + kernel_va_mid = UINT64_MAX << (va_bits - 1); > + /* older kernel */ > + if (text_sym_addr < kernel_va_mid) > *page_offset = UINT64_MAX << (va_bits - 1); > else > *page_offset = UINT64_MAX << va_bits; I would drop the kernel_va_mid and simply use *page_offset = UINT64_MAX << (va_bits - 1) if (*page_offset > text_sym_addr > *page_offset) *page_offset = UINT64_MAX << va_bits but that's more a matter of taste. Reviewed-by: Philipp Rudo ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2 3/5] add slurp_proc_file()
Hi Sven, sorry, i need to nag again. On Wed, 15 Dec 2021 11:18:34 +0100 Sven Schnelle wrote: > slurp_file() cannot be used to read proc files, as they are returning > a size of zero in stat(). Add a function slurp_proc_file() which is > similar to slurp_file(), but doesn't require the size of the file to > be known. > > Signed-off-by: Sven Schnelle > --- > kexec/kexec.c | 32 > 1 file changed, 32 insertions(+) > > diff --git a/kexec/kexec.c b/kexec/kexec.c > index f63b36b771eb..a1acba2adf2a 100644 > --- a/kexec/kexec.c > +++ b/kexec/kexec.c > @@ -1106,6 +1106,38 @@ static void remove_parameter(char *line, const char > *param_name) > } > } > > +static char *slurp_proc_file(const char *filename, size_t *len) > +{ > + ssize_t ret, startpos = 0; > + unsigned int size = 64; > + char *buf = NULL, *tmp; > + int fd; > + > + fd = open(filename, O_RDONLY); > + if (fd == -1) > + return NULL; > + > + do { > + size *= 2; > + tmp = realloc(buf, size); > + if (!tmp) { > + free(buf); > + return NULL; > + } > + buf = tmp; > + > + ret = read(fd, buf + startpos, size - startpos); > + if (ret < 0) { > + free(buf); > + return NULL; > + } > + startpos += ret; > + *len = startpos; > + } while (ret == size); I don't think this will work as intended. 1) read returns the bytes read. So ret has a maximum value of size - startpos and thus can only be equal to size on the first pass when startpos = 0. 2) it's not an error when read reads less bytes than requested but can happen when, e.g. it get's interrupted. The simplest solution I see is to simply use 'while (ret)' Even when that means that there is an extra pass in the loop with an extra call to realloc. The cleaner solution probably would be to put the read into a second loop. I.e. the following should work (no tested) do { [...] do { ret = read(fd, buf + startpos, size - startpos); if (ret < 0) { free(buf); return NULL; } startpos += ret; while (ret && startpos != size); *len = startpos; } while (ret); Thanks Philipp > + > + return buf; > +} > + > /* > * Returns the contents of the current command line to be used with > * --reuse-cmdline option. The function gets called from architecture > specific ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 3/3] arm64: read VA_BITS from kcore for 52-bits VA kernel
Hi Pingfang, On Fri, 10 Dec 2021 11:07:35 +0800 Pingfan Liu wrote: > phys_to_virt() calculates virtual address. As a important factor, > page_offset is excepted to be accurate. > > Since arm64 kernel exposes va_bits through vmcore, using it. > > Signed-off-by: Pingfan Liu > --- > kexec/arch/arm64/kexec-arm64.c | 31 +++ > kexec/arch/arm64/kexec-arm64.h | 1 + > util_lib/elf_info.c| 5 + > 3 files changed, 33 insertions(+), 4 deletions(-) > > diff --git a/kexec/arch/arm64/kexec-arm64.c b/kexec/arch/arm64/kexec-arm64.c > index bd650e6..ccc92db 100644 > --- a/kexec/arch/arm64/kexec-arm64.c > +++ b/kexec/arch/arm64/kexec-arm64.c > @@ -54,7 +54,7 @@ > static bool try_read_phys_offset_from_kcore = false; > > /* Machine specific details. */ > -static int va_bits; > +static int va_bits = -1; > static unsigned long page_offset; > > /* Global varables the core kexec routines expect. */ > @@ -876,7 +876,15 @@ static inline void set_phys_offset(long v, char > *set_method) > > static int get_va_bits(void) > { > - unsigned long long stext_sym_addr = get_kernel_sym("_stext"); > + unsigned long long stext_sym_addr; > + > + /* > + * if already got from kcore > + */ > + if (va_bits != -1) > + goto out; > + > + stext_sym_addr = get_kernel_sym("_stext"); > > if (stext_sym_addr == 0) { > fprintf(stderr, "Can't get the symbol of _stext.\n"); > @@ -900,6 +908,7 @@ static int get_va_bits(void) > return -1; > } > > +out: > dbgprintf("va_bits : %d\n", va_bits); > > return 0; > @@ -917,14 +926,27 @@ int get_page_offset(unsigned long *page_offset) > if (ret < 0) > return ret; > > - page_offset = (0xUL) << (va_bits - 1); > + if (va_bits < 52) > + *page_offset = (0xUL) << (va_bits - 1); > + else > + *page_offset = (0xUL) << va_bits; wouldn't it make sense to use ULONG_MAX here? At least for me it would be much better readable. > dbgprintf("page_offset : %lx\n", page_offset); > > return 0; > } > > +static void arm64_scan_vmcoreinfo(char *pos) > +{ > + const char *str; > + > + str = "NUMBER(VA_BITS)="; > + if (memcmp(str, pos, strlen(str)) == 0) > + va_bits = strtoul(pos + strlen(str), NULL, 10); > +} > + > /** > - * get_phys_offset_from_vmcoreinfo_pt_note - Helper for getting PHYS_OFFSET > + * get_phys_offset_from_vmcoreinfo_pt_note - Helper for getting PHYS_OFFSET > (and va_bits) > * from VMCOREINFO note inside 'kcore'. > */ > > @@ -937,6 +959,7 @@ static int get_phys_offset_from_vmcoreinfo_pt_note(long > *phys_offset) > return EFAILED; > } > > + arch_scan_vmcoreinfo = arm64_scan_vmcoreinfo; > ret = read_phys_offset_elf_kcore(fd, phys_offset); > > close(fd); > diff --git a/kexec/arch/arm64/kexec-arm64.h b/kexec/arch/arm64/kexec-arm64.h > index ed99d9d..d291705 100644 > --- a/kexec/arch/arm64/kexec-arm64.h > +++ b/kexec/arch/arm64/kexec-arm64.h > @@ -66,6 +66,7 @@ struct arm64_mem { > > #define arm64_mem_ngv UINT64_MAX > extern struct arm64_mem arm64_mem; > +extern void (*arch_scan_vmcoreinfo)(char *pos); This definition isn't arm64 specific. I think this should go to util_lib/include/elf_info.h. Thanks Philipp > > uint64_t get_phys_offset(void); > uint64_t get_vp_offset(void); > diff --git a/util_lib/elf_info.c b/util_lib/elf_info.c > index 5574c7f..d252eff 100644 > --- a/util_lib/elf_info.c > +++ b/util_lib/elf_info.c > @@ -310,6 +310,8 @@ int get_pt_load(int idx, > > #define NOT_FOUND_LONG_VALUE (-1) > > +void (*arch_scan_vmcoreinfo)(char *pos); > + > void scan_vmcoreinfo(char *start, size_t size) > { > char *last = start + size - 1; > @@ -551,6 +553,9 @@ void scan_vmcoreinfo(char *start, size_t size) > } > } > > + if (arch_scan_vmcoreinfo != NULL) > + (*arch_scan_vmcoreinfo)(pos); > + > if (last_line) > break; > } ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 2/3] arm64/crashdump: unify routine to get page_offset
Hi Pinfang, On Fri, 10 Dec 2021 11:07:34 +0800 Pingfan Liu wrote: > There are two funcs to get page_offset: > get_kernel_page_offset() > get_page_offset() > > Since get_kernel_page_offset() does not observe the kernel formula, and > remove it. Unify them in order to introduce 52-bits VA kernel more > easily in the coming patch. > > Signed-off-by: Pingfan Liu > --- > kexec/arch/arm64/crashdump-arm64.c | 23 +-- > kexec/arch/arm64/kexec-arm64.c | 4 ++-- > kexec/arch/arm64/kexec-arm64.h | 1 + > 3 files changed, 4 insertions(+), 24 deletions(-) > > diff --git a/kexec/arch/arm64/crashdump-arm64.c > b/kexec/arch/arm64/crashdump-arm64.c > index a02019a..0a8d44c 100644 > --- a/kexec/arch/arm64/crashdump-arm64.c > +++ b/kexec/arch/arm64/crashdump-arm64.c [...] > diff --git a/kexec/arch/arm64/kexec-arm64.c b/kexec/arch/arm64/kexec-arm64.c > index 7373fa3..bd650e6 100644 > --- a/kexec/arch/arm64/kexec-arm64.c > +++ b/kexec/arch/arm64/kexec-arm64.c > @@ -909,7 +909,7 @@ static int get_va_bits(void) > * get_page_offset - Helper for getting PAGE_OFFSET > */ > > -static int get_page_offset(void) > +int get_page_offset(unsigned long *page_offset) > { > int ret; > The new page_offset is never used in this patch. There's still the line left with the global page_offset but that will cause a type miss-match (unsigend long vs unsigned log *). In the next patch you do the update (page_offset -> *page_offset) but in the meantime the code is broken. Thanks Philipp > @@ -954,7 +954,7 @@ int get_phys_base_from_pt_load(long *phys_offset) > unsigned long long phys_start; > unsigned long long virt_start; > > - ret = get_page_offset(); > + ret = get_page_offset(&page_offset); > if (ret < 0) > return ret; > > diff --git a/kexec/arch/arm64/kexec-arm64.h b/kexec/arch/arm64/kexec-arm64.h > index 1844b67..ed99d9d 100644 > --- a/kexec/arch/arm64/kexec-arm64.h > +++ b/kexec/arch/arm64/kexec-arm64.h > @@ -69,6 +69,7 @@ extern struct arm64_mem arm64_mem; > > uint64_t get_phys_offset(void); > uint64_t get_vp_offset(void); > +int get_page_offset(unsigned long *offset); > > static inline void reset_vp_offset(void) > { ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] kernel/crash_core.c: No judgment required
Hi lizhe, On Thu, 9 Dec 2021 19:20:03 -0800 lizhe wrote: > No judgment required ck_cmdline is NULL > its caller has alreadly judged, see __parse_crashkernel > function > > Signed-off-by: lizhe > --- > kernel/crash_core.c | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/kernel/crash_core.c b/kernel/crash_core.c > index eb53f5ec62c9..9981cf9b9fe4 100644 > --- a/kernel/crash_core.c > +++ b/kernel/crash_core.c > @@ -221,9 +221,6 @@ static __init char *get_last_crashkernel(char *cmdline, > p = strstr(p+1, name); > } > > - if (!ck_cmdline) > - return NULL; > - > return ck_cmdline; > } > I agree that the if-block is not needed and can be removed. However, I cannot follow your reasoning in the commit message. Could you please explain it in more detail. The reason why I think that the 'if' can be removed is that the expression can only be true when ck_cmdline = NULL. But with that the last three lines are equivalent to if (!ck_cmdline) return ck_cmdline; return ck_cmdline; Which simply doesn't make any sense. Thanks Philipp ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 1/1] s390: handle R_390_PLT32DBL reloc entries in machine_apply_elf_rel()
Hi Alexander, @Alexander: Thanks for taking care of this. On Wed, 8 Dec 2021 13:53:55 +0100 Alexander Egorenkov wrote: > Starting with gcc 11.3, the C compiler will generate PLT-relative function > calls even if they are local and do not require it. Later on during linking, > the linker will replace all PLT-relative calls to local functions with > PC-relative ones. Unfortunately, the purgatory code of kexec/kdump is > not being linked as a regular executable or shared library would have been, > and therefore, all PLT-relative addresses remain in the generated purgatory > object code unresolved. This leads to the situation where the purgatory > code is being executed during kdump with all PLT-relative addresses > unresolved. And this results in endless loops within the purgatory code. Tiny nit. The last two sentences describe the situation in the kernel. Luckily the kexec-tools do proper error checking and die with "Unknown rela relocation: 0x14 0x73c0901c" when they encounter an unknown relocation type. Anyway, the code is correct Reviewed-by: Philipp Rudo > Furthermore, the clang C compiler has always behaved like described above > and this commit should fix the purgatory code built with the latter. > > Because the purgatory code is no regular executable or shared library, > contains only calls to local functions and has no PLT, all R_390_PLT32DBL > relocation entries can be resolved just like a R_390_PC32DBL one. > > * > https://refspecs.linuxfoundation.org/ELF/zSeries/lzsabi0_zSeries/x1633.html#AEN1699 > > Relocation entries of purgatory code generated with gcc 11.3 > > > $ readelf -r purgatory/purgatory.o > > Relocation section '.rela.text' at offset 0x6e8 contains 27 entries: > Offset Info Type Sym. ValueSym. Name + > Addend > 000c 00030013 R_390_PC32DBL .data + 2 > 001a 00100014 R_390_PLT32DBL sha256_starts + > 2 > 0030 00110014 R_390_PLT32DBL sha256_update + > 2 > 0046 00120014 R_390_PLT32DBL sha256_finish + > 2 > 0050 00030013 R_390_PC32DBL .data + 102 > 005a 00130014 R_390_PLT32DBL memcmp + 2 > ... > 0118 00160014 R_390_PLT32DBL setup_arch + 2 > 011e 00030013 R_390_PC32DBL .data + 2 > 012c 000f0014 R_390_PLT32DBL > verify_sha256_digest + 2 > 0142 00170014 R_390_PLT32DBL > post_verification[...] + 2 > > Relocation entries of purgatory code generated with gcc 11.2 > > > $ readelf -r purgatory/purgatory.o > > Relocation section '.rela.text' at offset 0x6e8 contains 27 entries: > Offset Info Type Sym. ValueSym. Name + > Addend > 000e 00030013 R_390_PC32DBL .data + 2 > 001c 00100013 R_390_PC32DBL sha256_starts + > 2 > 0036 00110013 R_390_PC32DBL sha256_update + > 2 > 0048 00120013 R_390_PC32DBL sha256_finish + > 2 > 0052 00030013 R_390_PC32DBL .data + 102 > 005c 00130013 R_390_PC32DBL memcmp + 2 > ... > 011a 00160013 R_390_PC32DBL setup_arch + 2 > 0120 00030013 R_390_PC32DBL .data + 122 > 0130 000f0013 R_390_PC32DBL > verify_sha256_digest + 2 > 0146 00170013 R_390_PC32DBL > post_verification[...] + 2 > > Corresponding s390 kernel discussion: > * > https://lore.kernel.org/linux-s390/20211208105801.188140-1-egore...@linux.ibm.com/T/#u > > Signed-off-by: Alexander Egorenkov > Reported-by: Tao Liu > Suggested-by: Philipp Rudo > --- > kexec/arch/s390/kexec-elf-rel-s390.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/kexec/arch/s390/kexec-elf-rel-s390.c > b/kexec/arch/s390/kexec-elf-rel-s390.c > index a5e1b7345578..91ba86a9991d 100644 > --- a/kexec/arch/s390/kexec-elf-rel-s390.c > +++ b/kexec/arch/s390/kexec-elf-rel-s390.c > @@ -56,6 +56,7 @@ void machine_apply_elf_rel(struct mem_ehdr *UNUSED(ehdr), > case R_390_PC16:/* PC relative 16 bit. */ > case R_390_PC16DBL: /* PC relative 16 bit shifted by 1. */ > case R_390_PC32DBL: /* PC relative 32 bit shift
[PATCH v2] kernel/crash_core: suppress unknown crashkernel parameter warning
When booting with crashkernel= on the kernel command line a warning similar to [0.038294] Kernel command line: ro console=ttyS0 crashkernel=256M [0.038353] Unknown kernel command line parameters "crashkernel=256M", will be passed to user space. is printed. This comes from crashkernel= being parsed independent from the kernel parameter handling mechanism. So the code in init/main.c doesn't know that crashkernel= is a valid kernel parameter and prints this incorrect warning. Suppress the warning by adding a dummy early_param handler for crashkernel=. Fixes: 86d1919a4fb0 ("init: print out unknown kernel parameters") Signed-off-by: Philipp Rudo Acked-by: Baoquan He --- v2: - improve commit message - add Acked-by from Baoquan --- kernel/crash_core.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/kernel/crash_core.c b/kernel/crash_core.c index eb53f5ec62c9..256cf6db573c 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -6,6 +6,7 @@ #include #include +#include #include #include @@ -295,6 +296,16 @@ int __init parse_crashkernel_low(char *cmdline, "crashkernel=", suffix_tbl[SUFFIX_LOW]); } +/* + * Add a dummy early_param handler to mark crashkernel= as a known command line + * parameter and suppress incorrect warnings in init/main.c. + */ +static int __init parse_crashkernel_dummy(char *arg) +{ + return 0; +} +early_param("crashkernel", parse_crashkernel_dummy); + Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type, void *data, size_t data_len) { -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2 0/6] KEXEC_SIG with appended signature
Hi Michal, On Tue, 7 Dec 2021 18:32:21 +0100 Michal Suchánek wrote: > On Tue, Dec 07, 2021 at 05:10:14PM +0100, Philipp Rudo wrote: > > Hi Michal, > > > > i finally had the time to take a closer look at the series. Except for > > the nit in patch 4 and my personal preference in patch 6 the code looks > > good to me. > > > > What I don't like are the commit messages on the first commits. In my > > opinion they are so short that they are almost useless. For example in > > patch 2 there is absolutely no explanation why you can simply copy the > > s390 over to ppc. > > They use the same signature format. I suppose I can add a note saying > that. The note is what I was asking for. For me the commit message is an important piece of documentation for other developers (or yourself in a year). That's why in my opinion it's important to describe _why_ you do something in it as you cannot get the _why_ by reading the code. > > Or in patch 3 you are silently changing the error > > code in kexec from EKEYREJECT to ENODATA. So I would appreciate it if > > Not sure what I should do about this. The different implementations use > different random error codes, and when they are unified the error code > clearly changes for one or the other. My complaint wasn't that you change the return code. There's no way to avoid choosing one over the other. It's again that you don't document the change in the commit message for others. > Does anything depend on a particular error code returned? Not that I know of. At least in the kexec-tools ENODATA and EKEYREJECT are handled the same way. Thanks Philipp > Thanks > > Michal > > > you could improve them a little. > > > > Thanks > > Philipp > > > > On Thu, 25 Nov 2021 19:02:38 +0100 > > Michal Suchanek wrote: > > > > > Hello, > > > > > > This is resend of the KEXEC_SIG patchset. > > > > > > The first patch is new because it'a a cleanup that does not require any > > > change to the module verification code. > > > > > > The second patch is the only one that is intended to change any > > > functionality. > > > > > > The rest only deduplicates code but I did not receive any review on that > > > part so I don't know if it's desirable as implemented. > > > > > > The first two patches can be applied separately without the rest. > > > > > > Thanks > > > > > > Michal > > > > > > Michal Suchanek (6): > > > s390/kexec_file: Don't opencode appended signature check. > > > powerpc/kexec_file: Add KEXEC_SIG support. > > > kexec_file: Don't opencode appended signature verification. > > > module: strip the signature marker in the verification function. > > > module: Use key_being_used_for for log messages in > > > verify_appended_signature > > > module: Move duplicate mod_check_sig users code to mod_parse_sig > > > > > > arch/powerpc/Kconfig | 11 + > > > arch/powerpc/kexec/elf_64.c | 14 ++ > > > arch/s390/kernel/machine_kexec_file.c| 42 ++ > > > crypto/asymmetric_keys/asymmetric_type.c | 1 + > > > include/linux/module_signature.h | 1 + > > > include/linux/verification.h | 4 ++ > > > kernel/module-internal.h | 2 - > > > kernel/module.c | 12 +++-- > > > kernel/module_signature.c| 56 +++- > > > kernel/module_signing.c | 33 +++--- > > > security/integrity/ima/ima_modsig.c | 22 ++ > > > 11 files changed, 113 insertions(+), 85 deletions(-) > > > > > > ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2 6/6] module: Move duplicate mod_check_sig users code to mod_parse_sig
Hi Michal, On Thu, 25 Nov 2021 19:02:44 +0100 Michal Suchanek wrote: > Multiple users of mod_check_sig check for the marker, then call > mod_check_sig, extract signature length, and remove the signature. > > Put this code in one place together with mod_check_sig. > > Signed-off-by: Michal Suchanek > --- > include/linux/module_signature.h| 1 + > kernel/module_signature.c | 56 - > kernel/module_signing.c | 26 +++--- > security/integrity/ima/ima_modsig.c | 22 ++-- > 4 files changed, 63 insertions(+), 42 deletions(-) > > diff --git a/include/linux/module_signature.h > b/include/linux/module_signature.h > index 7eb4b00381ac..1343879b72b3 100644 > --- a/include/linux/module_signature.h > +++ b/include/linux/module_signature.h > @@ -42,5 +42,6 @@ struct module_signature { > > int mod_check_sig(const struct module_signature *ms, size_t file_len, > const char *name); > +int mod_parse_sig(const void *data, size_t *len, size_t *sig_len, const char > *name); > > #endif /* _LINUX_MODULE_SIGNATURE_H */ > diff --git a/kernel/module_signature.c b/kernel/module_signature.c > index 00132d12487c..784b40575ee4 100644 > --- a/kernel/module_signature.c > +++ b/kernel/module_signature.c > @@ -8,14 +8,36 @@ > > #include > #include > +#include > #include > #include > > +/** > + * mod_check_sig_marker - check that the given data has signature marker at > the end > + * > + * @data:Data with appended signature > + * @len: Length of data. Signature marker length is subtracted on > success. > + */ > +static inline int mod_check_sig_marker(const void *data, size_t *len) I personally don't like it when a function has a "check" in it's name as it doesn't describe what the function is checking for. For me mod_has_sig_marker is much more precise. I would use that instead. Thanks Philipp > +{ > + const unsigned long markerlen = sizeof(MODULE_SIG_STRING) - 1; > + > + if (markerlen > *len) > + return -ENODATA; > + > + if (memcmp(data + *len - markerlen, MODULE_SIG_STRING, > +markerlen)) > + return -ENODATA; > + > + *len -= markerlen; > + return 0; > +} > + > /** > * mod_check_sig - check that the given signature is sane > * > * @ms: Signature to check. > - * @file_len:Size of the file to which @ms is appended. > + * @file_len:Size of the file to which @ms is appended (without the > marker). > * @name:What is being checked. Used for error messages. > */ > int mod_check_sig(const struct module_signature *ms, size_t file_len, > @@ -44,3 +66,35 @@ int mod_check_sig(const struct module_signature *ms, > size_t file_len, > > return 0; > } > + > +/** > + * mod_parse_sig - check that the given signature is sane and determine > signature length > + * > + * @data:Data with appended signature. > + * @len: Length of data. Signature and marker length is subtracted on > success. > + * @sig_len: Length of signature. Filled on success. > + * @name:What is being checked. Used for error messages. > + */ > +int mod_parse_sig(const void *data, size_t *len, size_t *sig_len, const char > *name) > +{ > + const struct module_signature *sig; > + int rc; > + > + rc = mod_check_sig_marker(data, len); > + if (rc) > + return rc; > + > + if (*len < sizeof(*sig)) > + return -ENODATA; > + > + sig = (const struct module_signature *)(data + (*len - sizeof(*sig))); > + > + rc = mod_check_sig(sig, *len, name); > + if (rc) > + return rc; > + > + *sig_len = be32_to_cpu(sig->sig_len); > + *len -= *sig_len + sizeof(*sig); > + > + return 0; > +} > diff --git a/kernel/module_signing.c b/kernel/module_signing.c > index cef72a6f6b5d..02bbca90f467 100644 > --- a/kernel/module_signing.c > +++ b/kernel/module_signing.c > @@ -25,35 +25,17 @@ int verify_appended_signature(const void *data, size_t > *len, > struct key *trusted_keys, > enum key_being_used_for purpose) > { > - const unsigned long markerlen = sizeof(MODULE_SIG_STRING) - 1; > struct module_signature ms; > - size_t sig_len, modlen = *len; > + size_t sig_len; > int ret; > > - pr_devel("==>%s %s(,%zu)\n", __func__, key_being_used_for[purpose], > modlen); > + pr_devel("==>%s %s(,%zu)\n", __func__, key_being_used_for[purpose], > *len); > > - if (markerlen > modlen) > - return -ENODATA; > - > - if (memcmp(data + modlen - markerlen, MODULE_SIG_STRING, > -markerlen)) > - return -ENODATA; > - modlen -= markerlen; > - > - if (modlen <= sizeof(ms)) > - return -EBADMSG; > - > - memcpy(&ms, data + (modlen - sizeof(ms)), sizeof(ms)); > - > - ret = mod_check_sig(&ms, modlen, key_being_used_for[purpose]); > + ret = mod_parse_sig(data,
Re: [PATCH v2 4/6] module: strip the signature marker in the verification function.
Hi Michal, On Thu, 25 Nov 2021 19:02:42 +0100 Michal Suchanek wrote: > It is stripped by each caller separately. > > Signed-off-by: Michal Suchanek > --- > arch/powerpc/kexec/elf_64.c | 9 - > arch/s390/kernel/machine_kexec_file.c | 9 - > kernel/module.c | 7 +-- > kernel/module_signing.c | 12 ++-- kernel/module_signing.c is only compiled with MODULE_SIG enabled but KEXEC_SIG only selects MODULE_SIG_FORMAT. In the unlikely case that KEXEC_SIG is enabled but MODULE_SIG isn't this causes a build breakage. So you need to update KEXEC_SIG to select MODULE_SIG instead of MODULE_SIG_FORMAT for s390 and ppc. Thanks Philipp > 4 files changed, 11 insertions(+), 26 deletions(-) > > diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c > index 266cb26d3ca0..63634c95265d 100644 > --- a/arch/powerpc/kexec/elf_64.c > +++ b/arch/powerpc/kexec/elf_64.c > @@ -156,15 +156,6 @@ static void *elf64_load(struct kimage *image, char > *kernel_buf, > int elf64_verify_sig(const char *kernel, unsigned long length) > { > size_t kernel_len = length; > - const unsigned long marker_len = sizeof(MODULE_SIG_STRING) - 1; > - > - if (marker_len > kernel_len) > - return -EKEYREJECTED; > - > - if (memcmp(kernel + kernel_len - marker_len, MODULE_SIG_STRING, > -marker_len)) > - return -EKEYREJECTED; > - kernel_len -= marker_len; > > return verify_appended_signature(kernel, &kernel_len, > VERIFY_USE_PLATFORM_KEYRING, > "kexec_file"); > diff --git a/arch/s390/kernel/machine_kexec_file.c > b/arch/s390/kernel/machine_kexec_file.c > index 432797249db3..c4632c1a1b59 100644 > --- a/arch/s390/kernel/machine_kexec_file.c > +++ b/arch/s390/kernel/machine_kexec_file.c > @@ -27,20 +27,11 @@ const struct kexec_file_ops * const kexec_file_loaders[] > = { > int s390_verify_sig(const char *kernel, unsigned long length) > { > size_t kernel_len = length; > - const unsigned long marker_len = sizeof(MODULE_SIG_STRING) - 1; > > /* Skip signature verification when not secure IPLed. */ > if (!ipl_secure_flag) > return 0; > > - if (marker_len > kernel_len) > - return -EKEYREJECTED; > - > - if (memcmp(kernel + kernel_len - marker_len, MODULE_SIG_STRING, > -marker_len)) > - return -EKEYREJECTED; > - kernel_len -= marker_len; > - > return verify_appended_signature(kernel, &kernel_len, > VERIFY_USE_PLATFORM_KEYRING, > "kexec_file"); > } > diff --git a/kernel/module.c b/kernel/module.c > index 8481933dfa92..d91ca0f93a40 100644 > --- a/kernel/module.c > +++ b/kernel/module.c > @@ -2882,7 +2882,6 @@ static inline void kmemleak_load_module(const struct > module *mod, > static int module_sig_check(struct load_info *info, int flags) > { > int err = -ENODATA; > - const unsigned long markerlen = sizeof(MODULE_SIG_STRING) - 1; > const char *reason; > const void *mod = info->hdr; > > @@ -2890,11 +2889,7 @@ static int module_sig_check(struct load_info *info, > int flags) >* Require flags == 0, as a module with version information >* removed is no longer the module that was signed >*/ > - if (flags == 0 && > - info->len > markerlen && > - memcmp(mod + info->len - markerlen, MODULE_SIG_STRING, markerlen) > == 0) { > - /* We truncate the module to discard the signature */ > - info->len -= markerlen; > + if (flags == 0) { > err = verify_appended_signature(mod, &info->len, > VERIFY_USE_SECONDARY_KEYRING, > "module"); > if (!err) { > diff --git a/kernel/module_signing.c b/kernel/module_signing.c > index f492e410564d..4c28cb55275f 100644 > --- a/kernel/module_signing.c > +++ b/kernel/module_signing.c > @@ -15,8 +15,7 @@ > #include "module-internal.h" > > /** > - * verify_appended_signature - Verify the signature on a module with the > - * signature marker stripped. > + * verify_appended_signature - Verify the signature on a module > * @data: The data to be verified > * @len: Size of @data. > * @trusted_keys: Keyring to use for verification > @@ -25,12 +24,21 @@ > int verify_appended_signature(const void *data, size_t *len, > struct key *trusted_keys, const char *what) > { > + const unsigned long markerlen = sizeof(MODULE_SIG_STRING) - 1; > struct module_signature ms; > size_t sig_len, modlen = *len; > int ret; > > pr_devel("==>%s(,%zu)\n", __func__, modlen); > > + if (markerlen > modlen) > + return -ENODATA; > + > + if (memcmp(data + modlen - markerlen, MODULE_SIG_STRING, > +markerlen)) > + return -ENODATA; > + modlen -= markerlen; > + >
Re: [PATCH v2 0/6] KEXEC_SIG with appended signature
Hi Michal, i finally had the time to take a closer look at the series. Except for the nit in patch 4 and my personal preference in patch 6 the code looks good to me. What I don't like are the commit messages on the first commits. In my opinion they are so short that they are almost useless. For example in patch 2 there is absolutely no explanation why you can simply copy the s390 over to ppc. Or in patch 3 you are silently changing the error code in kexec from EKEYREJECT to ENODATA. So I would appreciate it if you could improve them a little. Thanks Philipp On Thu, 25 Nov 2021 19:02:38 +0100 Michal Suchanek wrote: > Hello, > > This is resend of the KEXEC_SIG patchset. > > The first patch is new because it'a a cleanup that does not require any > change to the module verification code. > > The second patch is the only one that is intended to change any > functionality. > > The rest only deduplicates code but I did not receive any review on that > part so I don't know if it's desirable as implemented. > > The first two patches can be applied separately without the rest. > > Thanks > > Michal > > Michal Suchanek (6): > s390/kexec_file: Don't opencode appended signature check. > powerpc/kexec_file: Add KEXEC_SIG support. > kexec_file: Don't opencode appended signature verification. > module: strip the signature marker in the verification function. > module: Use key_being_used_for for log messages in > verify_appended_signature > module: Move duplicate mod_check_sig users code to mod_parse_sig > > arch/powerpc/Kconfig | 11 + > arch/powerpc/kexec/elf_64.c | 14 ++ > arch/s390/kernel/machine_kexec_file.c| 42 ++ > crypto/asymmetric_keys/asymmetric_type.c | 1 + > include/linux/module_signature.h | 1 + > include/linux/verification.h | 4 ++ > kernel/module-internal.h | 2 - > kernel/module.c | 12 +++-- > kernel/module_signature.c| 56 +++- > kernel/module_signing.c | 33 +++--- > security/integrity/ima/ima_modsig.c | 22 ++ > 11 files changed, 113 insertions(+), 85 deletions(-) > ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 3/3] s390: add support for --reuse-cmdline
Hi Sven, makes absolutely sense to have this option. One problem though... On Mon, 22 Nov 2021 08:14:01 +0100 Sven Schnelle wrote: > --reuse-cmdline reads the command line of the currently > running kernel from /proc/cmdline and uses that for the > kernel that should be kexec'd. > > Signed-off-by: Sven Schnelle > Reviewed-by: Alexander Egorenkov > --- > kexec/arch/s390/include/arch/options.h | 10 ++ > kexec/arch/s390/kexec-image.c | 9 + > 2 files changed, 15 insertions(+), 4 deletions(-) > > diff --git a/kexec/arch/s390/include/arch/options.h > b/kexec/arch/s390/include/arch/options.h > index 76044a301ceb..b030b61d61be 100644 [...] > diff --git a/kexec/arch/s390/kexec-image.c b/kexec/arch/s390/kexec-image.c > index dbeb689b830a..310d967ea331 100644 > --- a/kexec/arch/s390/kexec-image.c > +++ b/kexec/arch/s390/kexec-image.c > @@ -72,6 +72,10 @@ int image_s390_load_file(int argc, char **argv, struct > kexec_info *info) > case OPT_RAMDISK: > ramdisk = optarg; > break; > + case OPT_REUSE_CMDLINE: > + free(command_line); > + command_line = get_command_line(); get_command_line reads a maximum of 2048 bytes from /prc/cmdline. With the configurable size on s390 defaulting to 4096 bytes this will ultimately cause problems. So you need to make get_command_line more flexible first. Thanks Philipp > + break; > } > } > > @@ -123,6 +127,10 @@ image_s390_load(int argc, char **argv, const char > *kernel_buf, > if (command_line_add(optarg)) > return -1; > break; > + case OPT_REUSE_CMDLINE: > + free(command_line); > + command_line = get_command_line(); > + break; > case OPT_RAMDISK: > ramdisk = optarg; > break; > @@ -223,5 +231,6 @@ image_s390_usage(void) > printf("--command-line=STRING Set the kernel command line to STRING.\n" > "--append=STRING Set the kernel command line to STRING.\n" > "--initrd=FILENAME Use the file FILENAME as a ramdisk.\n" > +"--reuse-cmdline Use kernel command line from running > system.\n" > ); > } ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 1/3] s390: add variable command line size
Hi Sven, On Mon, 22 Nov 2021 08:13:59 +0100 Sven Schnelle wrote: > Newer s390 kernels support a command line size longer than 896 > bytes. Such kernels contain a new member in the parameter area, > which might be utilized by tools like kexec. Older kernels have > the location initialized to zero, so we check whether there's a > non-zero number present and use that. If there isn't, we fallback > to the legacy command line size of 896 bytes. > > Signed-off-by: Sven Schnelle > Reviewed-by: Alexander Egorenkov > --- > kexec/arch/s390/kexec-image.c | 53 --- > kexec/arch/s390/kexec-s390.h | 19 +++-- > 2 files changed, 46 insertions(+), 26 deletions(-) > > diff --git a/kexec/arch/s390/kexec-image.c b/kexec/arch/s390/kexec-image.c > index 3c24fdfe3c7c..7747d02399db 100644 > --- a/kexec/arch/s390/kexec-image.c > +++ b/kexec/arch/s390/kexec-image.c > @@ -25,7 +25,7 @@ > #include > > static uint64_t crash_base, crash_end; > -static char command_line[COMMAND_LINESIZE]; > +static char *command_line; isn't this the perfect opportunity to get rid of this global variable and... > static void add_segment_check(struct kexec_info *info, const void *buf, > size_t bufsz, unsigned long base, size_t memsz) > @@ -38,11 +38,16 @@ static void add_segment_check(struct kexec_info *info, > const void *buf, > > int command_line_add(const char *str) ... simply pass the pointer as an argument ;) Thanks Philipp > { > - if (strlen(command_line) + strlen(str) + 1 > COMMAND_LINESIZE) { > - fprintf(stderr, "Command line too long.\n"); > + char *tmp = NULL; > + > + tmp = concat_cmdline(command_line, str); > + if (!tmp) { > + fprintf(stderr, "out of memory\n"); > return -1; > } > - strcat(command_line, str); > + > + free(command_line); > + command_line = tmp; > return 0; > } > > @@ -78,6 +83,8 @@ int image_s390_load_file(int argc, char **argv, struct > kexec_info *info) > if (info->initrd_fd == -1) { > fprintf(stderr, "Could not open initrd file %s:%s\n", > ramdisk, strerror(errno)); > + free(command_line); > + command_line = NULL; > return -1; > } > } > @@ -97,7 +104,7 @@ image_s390_load(int argc, char **argv, const char > *kernel_buf, > const char *ramdisk; > off_t ramdisk_len; > unsigned int ramdisk_origin; > - int opt; > + int opt, ret = -1; > > if (info->file_mode) > return image_s390_load_file(argc, argv, info); > @@ -112,7 +119,6 @@ image_s390_load(int argc, char **argv, const char > *kernel_buf, > }; > static const char short_options[] = KEXEC_OPT_STR ""; > > - command_line[0] = 0; > ramdisk = NULL; > ramdisk_len = 0; > ramdisk_origin = 0; > @@ -132,7 +138,7 @@ image_s390_load(int argc, char **argv, const char > *kernel_buf, > if (info->kexec_flags & KEXEC_ON_CRASH) { > if (parse_iomem_single("Crash kernel\n", &crash_base, > &crash_end)) > - return -1; > + goto out; > } > > /* Add kernel segment */ > @@ -151,7 +157,7 @@ image_s390_load(int argc, char **argv, const char > *kernel_buf, > rd_buffer = slurp_file_mmap(ramdisk, &ramdisk_len); > if (rd_buffer == NULL) { > fprintf(stderr, "Could not read ramdisk.\n"); > - return -1; > + goto out; > } > ramdisk_origin = MAX(RAMDISK_ORIGIN_ADDR, kernel_size); > ramdisk_origin = _ALIGN_UP(ramdisk_origin, 0x10); > @@ -160,7 +166,7 @@ image_s390_load(int argc, char **argv, const char > *kernel_buf, > } > if (info->kexec_flags & KEXEC_ON_CRASH) { > if (load_crashdump_segments(info, crash_base, crash_end)) > - return -1; > + goto out; > } else { > info->entry = (void *) IMAGE_READ_OFFSET; > } > @@ -183,15 +189,28 @@ image_s390_load(int argc, char **argv, const char > *kernel_buf, > *tmp = crash_end - crash_base + 1; > } > } > - /* > - * We will write a probably given command line. > - * First, erase the old area, then setup the new parameters: > - */ > - if (strlen(command_line) != 0) { > - memset(krnl_buffer + COMMAND_LINE_OFFS, 0, COMMAND_LINESIZE); > - memcpy(krnl_buffer + COMMAND_LINE_OFFS, command_line, > strlen(command_line)); > + > + if (command_line) { > + unsigned long maxsize; > + char *dest = krnl_buffer + COMMAND_LINE_OFFS; > + > + maxsize = *(unsigned long *)(krnl_buffer + > MAX_COMMAND_LINESIZE_OFFS); >
Re: [PATCH] kernel/crash_core: suppress unknown crashkernel parameter warning
Hi Baoquan, On Tue, 7 Dec 2021 11:34:46 +0800 Baoquan He wrote: > On 12/06/21 at 12:17pm, Philipp Rudo wrote: > > When booting with crashkernel= on the kernel command line a warning > > similar to > > > > [0.038294] Kernel command line: ro console=ttyS0 crashkernel=256M > > [0.038353] Unknown kernel command line parameters "crashkernel=256M", > > will be passed to user space. > > > > is printed. This originates from crashkernel= being parsed independent from > > the early_param() mechanism. So the code in init/main.c doesn't know > > Not only the early_param(), __setup() also takes the same mechanism. > It's just handled in different stage. You might need to call it kernel > param handling mechanism, not sure if it's accurate. you are right, "kernel param handling" is better. I used early_param as that's where we would need to hook into if we wanted to use the common kernel param handling. But I don't think it is worth it. @akpm: do you update the commit message before sending the patch to Linus or shall I send a v2? > > that crashkernel= is a valid kernel parameter and prints this incorrect > > warning. Suppress the warning by adding a dummy early_param handler for > > crashkernel=. > > The fix looks good to me, thanks. > > Acked-by: Baoquan He Thanks > By the way, on which arch did you find this issue? Ask because I am > wondering whether there's any other similiar independent kernel cmdline > handling from __setup_param(). If have, is there a chance to take a > common method to handle them, e.g a generic function or a place to > identify them. Just wild thought, I have no idea yet. Otherwise, we may > need several this kind of dummy handler for each one. The issue was first reported on s390 but I used x86 to test the fix. The only other reported parameter I encountered was BOOT_IMAGE= which is not a kernel parameter and thus correct. But in the corresponding bugzilla Andrew (on cc) said "Gah! I thought I had squashed all of these interesting uses of the kernel command line, it is like playing whack-a-mole." So I believe there were multiple other parameters that had the same problem. Thanks Philipp > > Fixes: 86d1919a4fb0 ("init: print out unknown kernel parameters") > > Signed-off-by: Philipp Rudo > > --- > > kernel/crash_core.c | 11 +++ > > 1 file changed, 11 insertions(+) > > > > diff --git a/kernel/crash_core.c b/kernel/crash_core.c > > index eb53f5ec62c9..256cf6db573c 100644 > > --- a/kernel/crash_core.c > > +++ b/kernel/crash_core.c > > @@ -6,6 +6,7 @@ > > > > #include > > #include > > +#include > > #include > > #include > > > > @@ -295,6 +296,16 @@ int __init parse_crashkernel_low(char *cmdline, > > "crashkernel=", suffix_tbl[SUFFIX_LOW]); > > } > > > > +/* > > + * Add a dummy early_param handler to mark crashkernel= as a known command > > line > > + * parameter and suppress incorrect warnings in init/main.c. > > + */ > > +static int __init parse_crashkernel_dummy(char *arg) > > +{ > > + return 0; > > +} > > +early_param("crashkernel", parse_crashkernel_dummy); > > + > > Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type, > > void *data, size_t data_len) > > { > > -- > > 2.31.1 > > > > > > ___ > > kexec mailing list > > kexec@lists.infradead.org > > http://lists.infradead.org/mailman/listinfo/kexec > > > ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH] kernel/crash_core: suppress unknown crashkernel parameter warning
When booting with crashkernel= on the kernel command line a warning similar to [0.038294] Kernel command line: ro console=ttyS0 crashkernel=256M [0.038353] Unknown kernel command line parameters "crashkernel=256M", will be passed to user space. is printed. This originates from crashkernel= being parsed independent from the early_param() mechanism. So the code in init/main.c doesn't know that crashkernel= is a valid kernel parameter and prints this incorrect warning. Suppress the warning by adding a dummy early_param handler for crashkernel=. Fixes: 86d1919a4fb0 ("init: print out unknown kernel parameters") Signed-off-by: Philipp Rudo --- kernel/crash_core.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/kernel/crash_core.c b/kernel/crash_core.c index eb53f5ec62c9..256cf6db573c 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -6,6 +6,7 @@ #include #include +#include #include #include @@ -295,6 +296,16 @@ int __init parse_crashkernel_low(char *cmdline, "crashkernel=", suffix_tbl[SUFFIX_LOW]); } +/* + * Add a dummy early_param handler to mark crashkernel= as a known command line + * parameter and suppress incorrect warnings in init/main.c. + */ +static int __init parse_crashkernel_dummy(char *arg) +{ + return 0; +} +early_param("crashkernel", parse_crashkernel_dummy); + Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type, void *data, size_t data_len) { -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] s390/kexec: fix memory leak of ipl report buffer
Hi Baoquan, On Fri, 29 Oct 2021 17:26:35 +0800 Baoquan He wrote: > A memory leak is reported by kmemleak scanning: > > unreferenced object 0x38000195000 (size 4096): > comm "kexec", pid 8548, jiffies 4294953647 (age 32443.270s) > hex dump (first 32 bytes): > 00 00 00 c8 20 00 00 00 00 00 00 c0 02 80 00 00 ... > 40 40 40 40 40 40 40 40 00 00 00 00 00 00 00 00 > backtrace: > [<11a2f199>] __vmalloc_node_range+0xc0/0x140 > [<81fa2752>] vzalloc+0x5a/0x70 > [<63a4c92d>] ipl_report_finish+0x2c/0x180 > [<553304da>] kexec_file_add_ipl_report+0xf4/0x150 > [<862d033f>] kexec_file_add_components+0x124/0x160 > [<0d2717bb>] arch_kexec_kernel_image_load+0x62/0x90 > [<2e0373b6>] kimage_file_alloc_init+0x1aa/0x2e0 > [<60f2d14f>] __do_sys_kexec_file_load+0x17c/0x2c0 > [<8c86fe5a>] __s390x_sys_kexec_file_load+0x40/0x50 > [<1fdb9dac>] __do_syscall+0x1bc/0x1f0 > [<3ee4258d>] system_call+0x78/0xa0 > > The ipl report buffer is allocated via vmalloc, while has no chance to free > if the kexe loading is unloaded. This will cause obvious memory leak > when kexec/kdump kernel is reloaded, manually, or triggered, e.g by > memory hotplug. > > Here, add struct kimage_arch to s390 to pass out the ipl report buffer > address, and introduce arch dependent function > arch_kimage_file_post_load_cleanup() to free it when unloaded. > > Signed-off-by: Baoquan He other than a missing Fixes: 99feaa717e55 ("s390/kexec_file: Create ipl report and pass to next kernel") the patch looks good to me. Reviewed-by: Philipp Rudo > --- > arch/s390/include/asm/kexec.h | 7 +++ > arch/s390/kernel/machine_kexec_file.c | 9 + > 2 files changed, 16 insertions(+) > > diff --git a/arch/s390/include/asm/kexec.h b/arch/s390/include/asm/kexec.h > index ea398a05f643..9f16e99fb882 100644 > --- a/arch/s390/include/asm/kexec.h > +++ b/arch/s390/include/asm/kexec.h > @@ -74,6 +74,13 @@ void *kexec_file_add_components(struct kimage *image, > int arch_kexec_do_relocs(int r_type, void *loc, unsigned long val, >unsigned long addr); > > +#define ARCH_HAS_KIMAGE_ARCH > + > +struct kimage_arch { > + void *ipl_buf; > +}; > + > + > extern const struct kexec_file_ops s390_kexec_image_ops; > extern const struct kexec_file_ops s390_kexec_elf_ops; > > diff --git a/arch/s390/kernel/machine_kexec_file.c > b/arch/s390/kernel/machine_kexec_file.c > index f9e4baa64b67..584c48b631b2 100644 > --- a/arch/s390/kernel/machine_kexec_file.c > +++ b/arch/s390/kernel/machine_kexec_file.c > @@ -202,6 +202,7 @@ static int kexec_file_add_ipl_report(struct kimage *image, > buf.buffer = ipl_report_finish(data->report); > buf.bufsz = data->report->size; > buf.memsz = buf.bufsz; > + image->arch.ipl_buf = buf.buffer; > > data->memsz += buf.memsz; > > @@ -321,3 +322,11 @@ int arch_kexec_kernel_image_probe(struct kimage *image, > void *buf, > > return kexec_image_probe_default(image, buf, buf_len); > } > + > +int arch_kimage_file_post_load_cleanup(struct kimage *image) > +{ > + kvfree(image->arch.ipl_buf); > + image->arch.ipl_buf = NULL; > + > + return kexec_image_post_load_cleanup_default(image); > +} ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH 2/2] makedumpfile: Fix --dry-run for incomplete dumps
When running out of space during a dry run, e.g. by limiting the output size using the -L option, makedumpfile fails with [...] Can't write the dump file(vmcore). Size limit(104857600) reached. get_nr_pages: Can't seek end of the dump file(vmcore). __read_disk_dump_header: Can't open a file(vmcore). No such file or directory [...] This is because for --dry-run no dump file is created and a dummy file descriptor of -1 is used. Thus the file operations in get_nr_pages and check_and_modify_*_headers fail. Fix the first error by using write_bytes as file size for the output file and the second one by always returning TRUE for check_and_modify_headers. Fixes: 3422e1d ("[PATCH 1/2] Add --dry-run option to prevent writing the dumpfile") Signed-off-by: Philipp Rudo --- makedumpfile.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/makedumpfile.c b/makedumpfile.c index 30f9725..c063267 100644 --- a/makedumpfile.c +++ b/makedumpfile.c @@ -5113,6 +5113,9 @@ check_and_modify_multiple_kdump_headers() { int check_and_modify_headers() { + if (info->flag_dry_run) + return TRUE; + if (info->flag_elf_dumpfile) return check_and_modify_elf_headers(info->name_dumpfile); else @@ -7996,7 +7999,11 @@ get_nr_pages(void *buf, struct cache_data *cd_page){ int size, file_end, nr_pages; page_desc_t *pd = buf; - file_end = lseek(cd_page->fd, 0, SEEK_END); + if (info->flag_dry_run) + file_end = write_bytes; + else + file_end = lseek(cd_page->fd, 0, SEEK_END); + if (file_end < 0) { ERRMSG("Can't seek end of the dump file(%s).\n", cd_page->file_name); return -1; -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH 1/2] makedumpfile: Fix bad file descriptor error when using --dry-run
When running makedumpfile with the --dry-run option it fails with [...] write_and_check_space: Can't seek the dump file(vmcore). Bad file descriptor [...] This is because for --dry-run no dump file is created and a dummy file descriptor of -1 is used. Thus the lseek in write_and_check_space will fail. Fix this by treating a dry run as if writing to STDOUT. Fixes: f0cfa86 ("[PATCH v2 3/3] Add -L option to limit output file size") Signed-off-by: Philipp Rudo --- makedumpfile.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/makedumpfile.c b/makedumpfile.c index b1b3b42..30f9725 100644 --- a/makedumpfile.c +++ b/makedumpfile.c @@ -4712,7 +4712,7 @@ write_and_check_space(int fd, void *buf, size_t buf_size, const char* desc, int retval = 0; off_t pos; - if (fd == STDOUT_FILENO) { + if (fd == STDOUT_FILENO || info->flag_dry_run) { pos = write_bytes; } else { pos = lseek(fd, 0, SEEK_CUR); -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH 0/2] makedumpfile: fix two bugs with --dry-run
Hi, While playing with the --dry-run option I noticed two bugs. You can find the fixes below. Thanks Philipp Philipp Rudo (2): makedumpfile: Fix bad file descriptor error when using --dry-run makedumpfile: Fix --dry-run for incomplete dumps makedumpfile.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] Only allow to set crash_kexec_post_notifiers on boot time
Hi, On Fri, 25 Sep 2020 10:56:25 -0400 Konrad Rzeszutek Wilk wrote: > On Fri, Sep 25, 2020 at 11:05:58AM +0800, Dave Young wrote: > > Hi, > > > > On 09/24/20 at 01:16pm, boris.ostrov...@oracle.com wrote: > > > > > > On 9/24/20 12:43 PM, Michael Kelley wrote: > > > > From: Eric W. Biederman Sent: Thursday, > > > > September 24, 2020 9:26 AM > > > >> Michael Kelley writes: > > > >> > > > > Added Hyper-V people and people who created the param, it is below > > > > commit, I also want to remove it if possible, let's see how people > > > > think, but the least way should be to disable the auto setting in > > > > both systemd > > > > and kernel: > > > >>> Hyper-V uses a notifier to inform the host system that a Linux VM has > > > >>> panic'ed. Informing the host is particularly important in a public > > > >>> cloud > > > >>> such as Azure so that the cloud software can alert the customer, and > > > >>> can > > > >>> track cloud-wide reliability statistics. Whether a kdump is taken > > > >>> is controlled > > > >>> entirely by the customer and how he configures the VM, and we want > > > >>> the host to be informed either way. > > > >> Why? > > > >> > > > >> Why does the host care? > > > >> Especially if the VM continues executing into a kdump kernel? > > > > The host itself doesn't care. But the host is a convenient out-of-band > > > > channel for recording that a panic has occurred and to collect basic > > > > data > > > > about the panic. This out-of-band channel is then used to notify the > > > > end > > > > customer that his VM has panic'ed. Sure, the customer should be running > > > > his own monitoring software, but customers don't always do what they > > > > should. Equally important, the out-of-band channel allows the cloud > > > > infrastructure software to notice trends, such as that the rate of Linux > > > > panics has increased, and that perhaps there is a cloud problem that > > > > should be investigated. > > > > > > > > > In many cases (especially in cloud environment) your dump device is > > > remote (e.g. iscsi) and kdump sometimes (often?) gets stuck because of > > > connectivity issues (which could be cause of the panic in the first > > > place). So it is quite desirable to inform the infrastructure that the VM > > > is on its way out without waiting for kdump to complete. > > > > That can probably be done in kdump kernel if it is really needed. Say > > informing host that panic happened and a kdump kernel is runnning. > > If kdump kernel gets to that point. Sometimes (sadly) it ends up being > misconfigured and it chokes up - and hence having multiple ways to emit > the crash information before running kdump kernel is a life-saver. > > > > > But I think to set crash_kexec_post_notifiers by default is still bad. > > Because of the way it is run today I presume? If there was some > safe/unsafe policy that should work right? I would think that the > safe ones that work properly all the time are: > > - HyperV CRASH_MSRs, > - KVM PVPANIC_[PANIC,CRASHLOAD] push button knob, > - pstore EFI variables > - Dumping in memory, > > And then some that depend on firmware version (aka BIOS, and vendor) are: > - ACPI ERST, > > And then the unsafe: > - s390, PowerPC (I don't actually know what they are but that > was Dave's primary motivator). that won't work on s390. Let me emphasize that the problems on s390 are not the notifiers themselves but the fact that they are called before crash_kexec. On s390 we have multiple dump methods besides kdump. We use a panic notifier to trigger these dump methods from the panicking kernel. The problem is that these dump methods are less powerful than kdump so we only want to use them as fallback, i.e. only use them when either kdump wasn't configured or loading of the crash kernel failed for whatever reason. That's why (plus historic reasons) our notifier stops the machine when it is called and none of the methods is configured. Which means that the second crash_kexec is never reached. Long story short, the problem on s390 is caused by the two hunks in kernel/panic.c:panic from f06e5153f4ae ("kernel/panic.c: add "crash_kexec_post_notifiers" option for kdump after panic_notifers"). Besides the problems on s390 I support Dave and think that setting crash_kexec_post_notifiers by default is wrong. We should keep in mind that we are in a panic situation. This means that the kernel is in a state where it doesn't trust itself anymore. So we should keep the code that is run to the bare minimum as we cannot rely on it to work properly. Thanks Philipp > > > > > > > > > > > > > > > > >> Further like I have mentioned everytime something like this has come up > > > >> a call on the kexec on panic code path should be a direct call (That > > > >> can > > > >> be audited) not something hidden in a notifier call chain (which can > > > >> not). > > > >> > > > > > > We btw already have
Re: [PATCH] Only allow to set crash_kexec_post_notifiers on boot time
Hi Konrad, On Mon, 21 Sep 2020 16:18:12 -0400 Konrad Rzeszutek Wilk wrote: > On Fri, Sep 18, 2020 at 05:47:43PM -0700, Andrew Morton wrote: > > On Fri, 18 Sep 2020 11:25:46 +0800 Dave Young wrote: > > > > > crash_kexec_post_notifiers enables running various panic notifier > > > before kdump kernel booting. This increases risks of kdump failure. > > > It is well documented in kernel-parameters.txt. We do not suggest > > > people to enable it together with kdump unless he/she is really sure. > > > This is also not suggested to be enabled by default when users are > > > not aware in distributions. > > > > > > But unfortunately it is enabled by default in systemd, see below > > > discussions in a systemd report, we can not convince systemd to change > > > it: > > > https://github.com/systemd/systemd/issues/16661 > > > > > > Actually we have got reports about kdump kernel hangs in both s390x > > > and powerpcle cases caused by the systemd change, also some x86 cases > > > could also be caused by the same (although that is in Hyper-V code > > > instead of systemd, that need to be addressed separately). > > Perhaps it may be better to fix the issus on s390x and PowerPC as well? There's little s390 can fix. We use the panic_notifier_list to start other dumpers in case kdump isn't configured or failed. This behavior was introduced in 2006 long before crash_kexec_post_notifiers were introduced. So I suggest that crash_kexec_post_notifiers are fixed instead. > > > > > > Thus to avoid the auto enablement here just disable the param writable > > > permission in sysfs. > > > > > > > Well. I don't think this is at all a desirable way of resolving a > > disagreement with the systemd developers > > > > At the above github address I'm seeing "ryncsn added a commit to > > ryncsn/systemd that referenced this issue 9 days ago", "pstore: don't > > enable crash_kexec_post_notifiers by default". So didn't that address > > the issue? > > It does in systemd, but there is a strong interest in making this on by > default. AFAIK pstore requires UEFI to work. So what's the point to enable it on non-UEFI systems? Thanks Philipp ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH V40 08/29] kexec_file: split KEXEC_VERIFY_SIG into KEXEC_SIG and KEXEC_SIG_FORCE
Hi Matthew, found a typo ... On Mon, 19 Aug 2019 17:17:44 -0700 Matthew Garrett wrote: [...] > index 6d0635ceddd0..9b4f37a4edf1 100644 > --- a/arch/s390/kernel/kexec_elf.c > +++ b/arch/s390/kernel/kexec_elf.c > @@ -130,7 +130,7 @@ static int s390_elf_probe(const char *buf, unsigned long > len) > const struct kexec_file_ops s390_kexec_elf_ops = { > .probe = s390_elf_probe, > .load = s390_elf_load, > -#ifdef CONFIG_KEXEC_VERIFY_SIG > +#ifdef CONFIG_KEXEC__SIG ^^ ... here. > .verify_sig = s390_verify_sig, > -#endif /* CONFIG_KEXEC_VERIFY_SIG */ > +#endif /* CONFIG_KEXEC_SIG */ > }; Thanks Philipp
Re: [PATCH RFC] generic ELF support for kexec
Hi Sven, On Tue, 25 Jun 2019 20:54:34 +0200 Sven Schnelle wrote: > Hi List, > > i recently started working on kexec for PA-RISC. While doing so, i figured > that powerpc already has support for reading ELF images inside of the Kernel. > My first attempt was to steal the source code and modify it for PA-RISC, but > it turned out that i didn't had to change much. Only ARM specific stuff like > fdt blob fetching had to be removed. > > So instead of duplicating the code, i thought about moving the ELF stuff to > the core kexec code, and exposing several function to use that code from the > arch specific code. That's always the right approach. Well done. > I'm attaching the patch to this Mail. What do you think about that change? > s390 also uses ELF files, and (maybe?) could also switch to this > implementation. > But i don't know anything about S/390 and don't have one in my basement. So > i'll leave s390 to the IBM folks. I'm afraid there isn't much code here s390 can reuse. I see multiple problems in kexec_elf_load: * while loading the phdrs we also need to setup some data structures to pass to the next kernel * the s390 kernel needs to be loaded to a fixed address * there is no support to load a crash kernel Of course that could all be fixed/worked around by introducing some arch hooks. But when you take into account that the whole elf loader on s390 is ~100 lines of code, I don't think it is worth it. More comments below. [...] > diff --git a/include/linux/kexec.h b/include/linux/kexec.h > index b9b1bc5f9669..49b23b425f84 100644 > --- a/include/linux/kexec.h > +++ b/include/linux/kexec.h > @@ -216,6 +216,41 @@ extern int crash_prepare_elf64_headers(struct crash_mem > *mem, int kernel_map, > void **addr, unsigned long *sz); > #endif /* CONFIG_KEXEC_FILE */ > > +#ifdef CONFIG_KEXEC_FILE_ELF > + > +struct kexec_elf_info { > + /* > + * Where the ELF binary contents are kept. > + * Memory managed by the user of the struct. > + */ > + const char *buffer; > + > + const struct elfhdr *ehdr; > + const struct elf_phdr *proghdrs; > + struct elf_shdr *sechdrs; > +}; Do i understand this right? elf_info->buffer contains the full elf file and elf_info->ehdr is a (endianness translated) copy of the files ehdr? If so ... > +void kexec_free_elf_info(struct kexec_elf_info *elf_info); > + > +int kexec_build_elf_info(const char *buf, size_t len, struct elfhdr *ehdr, > + struct kexec_elf_info *elf_info); > + > +int kexec_elf_kernel_load(struct kimage *image, struct kexec_buf *kbuf, > + char *kernel_buf, unsigned long kernel_len, > + unsigned long *kernel_load_addr); > + > +int kexec_elf_probe(const char *buf, unsigned long len); > + > +int kexec_elf_load(struct kimage *image, struct elfhdr *ehdr, > + struct kexec_elf_info *elf_info, > + struct kexec_buf *kbuf, > + unsigned long *lowest_load_addr); > + > +int kexec_elf_load(struct kimage *image, struct elfhdr *ehdr, > + struct kexec_elf_info *elf_info, > + struct kexec_buf *kbuf, > + unsigned long *lowest_load_addr); ... you could simplify the arguments by dropping the ehdr argument. The elf_info should contain all the information needed. Furthermore the kexec_buf also contains a pointer to its kimage. So you can drop that argument as well. An other thing is that you kzalloc the memory needed for proghdrs and sechdrs but expect the user of those functions to provide the memory needed for ehdr. Wouldn't it be more consistent to also kzalloc the ehdr? [...] > diff --git a/kernel/kexec_file_elf.c b/kernel/kexec_file_elf.c > new file mode 100644 > index ..bb966c93492c > --- /dev/null > +++ b/kernel/kexec_file_elf.c > @@ -0,0 +1,574 @@ [...] > +static uint64_t elf64_to_cpu(const struct elfhdr *ehdr, uint64_t value) > +{ > + if (ehdr->e_ident[EI_DATA] == ELFDATA2LSB) > + value = le64_to_cpu(value); > + else if (ehdr->e_ident[EI_DATA] == ELFDATA2MSB) > + value = be64_to_cpu(value); > + > + return value; > +} > + > +static uint16_t elf16_to_cpu(const struct elfhdr *ehdr, uint16_t value) > +{ > + if (ehdr->e_ident[EI_DATA] == ELFDATA2LSB) > + value = le16_to_cpu(value); > + else if (ehdr->e_ident[EI_DATA] == ELFDATA2MSB) > + value = be16_to_cpu(value); > + > + return value; > +} > + > +static uint32_t elf32_to_cpu(const struct elfhdr *ehdr, uint32_t value) > +{ > + if (ehdr->e_ident[EI_DATA] == ELFDATA2LSB) > + value = le32_to_cpu(value); > + else if (ehdr->e_ident[EI_DATA] == ELFDATA2MSB) > + value = be32_to_cpu(value); > + > + return value; > +} What are the elf*_to_cpu good for? In general I'd assume that kexec loads a kernel for the same architecture it is running on. So the
Re: [PATCH v13 03/16] s390, kexec_file: drop arch_kexec_mem_walk()
Hey Akashi, I kept thinking about this patch and remembered why I didn't made the change you are suggesting now. The problem is when you only check for kbuf->mem you are excluding address 0, which might be a valid address to load the kernel to. On s390 this is actually done when the kernel is not loaded via a boot loader. For kexec_file however, we cut off the first few kB of the image and jump directly to 'startup'. So checking for !0 does not cause a problem here. Anyway, the long term safer solution would be something like #define KEXEC_BUF_MEM_UNKNOWN (-1UL) for architectures to tell common code to search a fitting mem hole. Back then I didn't do the change because I had the other workaround, which didn't require a common code change. But when you are touching the code now it is worth thinking about it. Just wanted to let you know Philipp On Wed, 1 Aug 2018 16:58:07 +0900 AKASHI Takahiro wrote: > Since s390 already knows where to locate buffers, calling > arch_kexec_mem_walk() has no sense. So we can just drop it as kbuf->mem > indicates this while all other architectures sets it to 0 initially. > > This change is a preparatory work for the next patch, where all the > variant memory walks, either on system resource or memblock, will be > put in one common place so that it will satisfy all the architectures' > need. > > Signed-off-by: AKASHI Takahiro > Reviewed-by: Philipp Rudo > Cc: Martin Schwidefsky > Cc: Heiko Carstens > Cc: Dave Young > Cc: Vivek Goyal > Cc: Baoquan He > --- > arch/s390/kernel/machine_kexec_file.c | 10 -- > kernel/kexec_file.c | 4 > 2 files changed, 4 insertions(+), 10 deletions(-) > > diff --git a/arch/s390/kernel/machine_kexec_file.c > b/arch/s390/kernel/machine_kexec_file.c > index f413f57f8d20..32023b4f9dc0 100644 > --- a/arch/s390/kernel/machine_kexec_file.c > +++ b/arch/s390/kernel/machine_kexec_file.c > @@ -134,16 +134,6 @@ int kexec_file_add_initrd(struct kimage *image, struct > s390_load_data *data, > return ret; > } > > -/* > - * The kernel is loaded to a fixed location. Turn off kexec_locate_mem_hole > - * and provide kbuf->mem by hand. > - */ > -int arch_kexec_walk_mem(struct kexec_buf *kbuf, > - int (*func)(struct resource *, void *)) > -{ > - return 1; > -} > - > int arch_kexec_apply_relocations_add(struct purgatory_info *pi, >Elf_Shdr *section, >const Elf_Shdr *relsec, > diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c > index 63c7ce1c0c3e..bf39df5e5bb9 100644 > --- a/kernel/kexec_file.c > +++ b/kernel/kexec_file.c > @@ -534,6 +534,10 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf) > { > int ret; > > + /* Arch knows where to place */ > + if (kbuf->mem) > + return 0; > + > ret = arch_kexec_walk_mem(kbuf, locate_mem_hole_callback); > > return ret == 1 ? 0 : -EADDRNOTAVAIL; ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v12 03/16] s390, kexec_file: drop arch_kexec_mem_walk()
Hi AKASHI, the patch looks good to me. Reviewed-by: Philipp Rudo Thanks Philipp On Tue, 24 Jul 2018 15:57:46 +0900 AKASHI Takahiro wrote: > Since s390 already knows where to locate buffers, calling > arch_kexec_mem_walk() has no sense. So we can just drop it as kbuf->mem > indicates this while all other architectures sets it to 0 initially. > > This change is a preparatory work for the next patch, where all the > variant memory walks, either on system resource or memblock, will be > put in one common place so that it will satisfy all the architectures' > need. > > Signed-off-by: AKASHI Takahiro > Cc: Martin Schwidefsky > Cc: Heiko Carstens > Cc: Dave Young > Cc: Vivek Goyal > Cc: Baoquan He > --- > arch/s390/kernel/machine_kexec_file.c | 10 -- > kernel/kexec_file.c | 4 > 2 files changed, 4 insertions(+), 10 deletions(-) > > diff --git a/arch/s390/kernel/machine_kexec_file.c > b/arch/s390/kernel/machine_kexec_file.c > index f413f57f8d20..32023b4f9dc0 100644 > --- a/arch/s390/kernel/machine_kexec_file.c > +++ b/arch/s390/kernel/machine_kexec_file.c > @@ -134,16 +134,6 @@ int kexec_file_add_initrd(struct kimage *image, struct > s390_load_data *data, > return ret; > } > > -/* > - * The kernel is loaded to a fixed location. Turn off kexec_locate_mem_hole > - * and provide kbuf->mem by hand. > - */ > -int arch_kexec_walk_mem(struct kexec_buf *kbuf, > - int (*func)(struct resource *, void *)) > -{ > - return 1; > -} > - > int arch_kexec_apply_relocations_add(struct purgatory_info *pi, >Elf_Shdr *section, >const Elf_Shdr *relsec, > diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c > index 63c7ce1c0c3e..bf39df5e5bb9 100644 > --- a/kernel/kexec_file.c > +++ b/kernel/kexec_file.c > @@ -534,6 +534,10 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf) > { > int ret; > > + /* Arch knows where to place */ > + if (kbuf->mem) > + return 0; > + > ret = arch_kexec_walk_mem(kbuf, locate_mem_hole_callback); > > return ret == 1 ? 0 : -EADDRNOTAVAIL; ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 0/1] x86/purgatory: Fix Makefile bug
Hi Andrew, can you send this fix to Linus like the original patch? Can you also tell me if the bug is severe enough to go to stable? For productive systems the bug shouldn't cause any problem because they are usually build from scratch. For developers however, it can cause quite a pain while debugging. @Dave: Thanks for looking into this. Thanks Philipp On Fri, 6 Jul 2018 16:30:29 +0800 Dave Young wrote: > Hi Philipp, > > On 07/05/18 at 10:43am, Philipp Rudo wrote: > > Hi Dave, > > > > On Thu, 5 Jul 2018 14:46:20 +0800 > > Dave Young wrote: > > > > > On 07/04/18 at 01:00pm, Philipp Rudo wrote: > > > > Hi everybody, > > > > > > > > when i moved the purgatories sha256 implementation to common code, i > > > > forgot > > > > to add FORCE to the new Makefile target. This patch fixes it. > > > > > > Hi Philipp, > > > > > > Do you have the exact steps of how to reproduce your problem? > > > > I only tested it on s390, but on x86 it should work the same way. > > > > - Build the kernel without the fix > > - Add some flag to the purgatories KBUILD_CFLAGS,I used > > -fno-asynchronous-unwind-tables > > - Re-build the kernel > > > > When you look at make's output you see that sha256.o is not re-build in the > > last step. Also readelf -S still shows the .eh_frame section for sha256.o. > > > > With the fix sha256.o is re-build in the last step. > > > > Hope that helps. > > Thanks for explanation. > > For the patch: > Acked-by: Dave Young > > Thanks > Dave > ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 0/1] x86/purgatory: Fix Makefile bug
Hi Dave, On Thu, 5 Jul 2018 14:46:20 +0800 Dave Young wrote: > On 07/04/18 at 01:00pm, Philipp Rudo wrote: > > Hi everybody, > > > > when i moved the purgatories sha256 implementation to common code, i forgot > > to add FORCE to the new Makefile target. This patch fixes it. > > Hi Philipp, > > Do you have the exact steps of how to reproduce your problem? I only tested it on s390, but on x86 it should work the same way. - Build the kernel without the fix - Add some flag to the purgatories KBUILD_CFLAGS,I used -fno-asynchronous-unwind-tables - Re-build the kernel When you look at make's output you see that sha256.o is not re-build in the last step. Also readelf -S still shows the .eh_frame section for sha256.o. With the fix sha256.o is re-build in the last step. Hope that helps. Thanks Philipp > > > > Note: There is an equivalent bug in the s390 purgatory. The fix for that > > will go upstream via Martin. > > > > Thanks > > Philipp > > > > Philipp Rudo (1): > > x86/purgatory: Add missing FORCE to Makefile target > > > > arch/x86/purgatory/Makefile | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > -- > > 2.16.4 > > > > Thanks > Dave > ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH 0/1] x86/purgatory: Fix Makefile bug
Hi everybody, when i moved the purgatories sha256 implementation to common code, i forgot to add FORCE to the new Makefile target. This patch fixes it. Note: There is an equivalent bug in the s390 purgatory. The fix for that will go upstream via Martin. Thanks Philipp Philipp Rudo (1): x86/purgatory: Add missing FORCE to Makefile target arch/x86/purgatory/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- 2.16.4 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH 1/1] x86/purgatory: Add missing FORCE to Makefile target
Without FORCE make does not detect changes only made to the command line options. So object files might not be re-built even when they should be. Fix this by adding FORCE where it is missing. Fixes: df6f2801f511 ("kernel/kexec_file.c: move purgatories sha256 to common code") Cc: # 4.17 Signed-off-by: Philipp Rudo --- arch/x86/purgatory/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile index 2e9ee023e6bc..81a8e33115ad 100644 --- a/arch/x86/purgatory/Makefile +++ b/arch/x86/purgatory/Makefile @@ -6,7 +6,7 @@ purgatory-y := purgatory.o stack.o setup-x86_$(BITS).o sha256.o entry64.o string targets += $(purgatory-y) PURGATORY_OBJS = $(addprefix $(obj)/,$(purgatory-y)) -$(obj)/sha256.o: $(srctree)/lib/sha256.c +$(obj)/sha256.o: $(srctree)/lib/sha256.c FORCE $(call if_changed_rule,cc_o_c) LDFLAGS_purgatory.ro := -e purgatory_start -r --no-undefined -nostdlib -z nodefaultlib -- 2.16.4 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] kexec/s390: Add support for kexec_file_load
On Tue, 22 May 2018 10:11:04 +0200 Simon Horman wrote: > On Mon, May 21, 2018 at 02:24:40PM +0800, Dave Young wrote: > > On 05/16/18 at 02:27pm, Philipp Rudo wrote: > > > Since kernel 4.17-rc2 s390 supports the kexec_file_load system call. Add > > > the new system call to kexec-tools and provide the -s > > > (--kexec-file-syscall) > > > option for s390 to support this new feature. > > > > > > Signed-off-by: Philipp Rudo > > ... > > > Acked-by: Dave Young > > Thanks, applied. > Sorry for the late answer, I had the last week off. Nevertheless thanks to both of you. Philipp ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH] kexec/s390: Add support for kexec_file_load
Since kernel 4.17-rc2 s390 supports the kexec_file_load system call. Add the new system call to kexec-tools and provide the -s (--kexec-file-syscall) option for s390 to support this new feature. Signed-off-by: Philipp Rudo --- kexec/arch/s390/kexec-image.c | 46 +++ kexec/kexec-syscall.h | 3 +++ 2 files changed, 49 insertions(+) diff --git a/kexec/arch/s390/kexec-image.c b/kexec/arch/s390/kexec-image.c index 0c8937b..8b39566 100644 --- a/kexec/arch/s390/kexec-image.c +++ b/kexec/arch/s390/kexec-image.c @@ -22,6 +22,7 @@ #include "../../kexec/crashdump.h" #include "kexec-s390.h" #include +#include static uint64_t crash_base, crash_end; static char command_line[COMMAND_LINESIZE]; @@ -45,6 +46,48 @@ int command_line_add(const char *str) return 0; } +int image_s390_load_file(int argc, char **argv, struct kexec_info *info) +{ + const char *ramdisk = NULL; + int opt; + + static const struct option options[] = + { + KEXEC_OPTIONS + {"command-line", 1, 0, OPT_APPEND}, + {"append", 1, 0, OPT_APPEND}, + {"initrd", 1, 0, OPT_RAMDISK}, + {0, 0, 0, 0}, + }; + static const char short_options[] = KEXEC_OPT_STR ""; + + while ((opt = getopt_long(argc, argv, short_options, options, 0)) != -1) { + switch(opt) { + case OPT_APPEND: + if (command_line_add(optarg)) + return -1; + break; + case OPT_RAMDISK: + ramdisk = optarg; + break; + } + } + + if (ramdisk) { + info->initrd_fd = open(ramdisk, O_RDONLY); + if (info->initrd_fd == -1) { + fprintf(stderr, "Could not open initrd file %s:%s\n", + ramdisk, strerror(errno)); + return -1; + } + } + + info->command_line = command_line; + info->command_line_len = strlen (command_line) + 1; + + return 0; +} + int image_s390_load(int argc, char **argv, const char *kernel_buf, off_t kernel_size, struct kexec_info *info) @@ -56,6 +99,9 @@ image_s390_load(int argc, char **argv, const char *kernel_buf, unsigned int ramdisk_origin; int opt; + if (info->file_mode) + return image_s390_load_file(argc, argv, info); + static const struct option options[] = { KEXEC_OPTIONS diff --git a/kexec/kexec-syscall.h b/kexec/kexec-syscall.h index 33638c2..b96e02a 100644 --- a/kexec/kexec-syscall.h +++ b/kexec/kexec-syscall.h @@ -64,6 +64,9 @@ #ifdef __powerpc64__ #define __NR_kexec_file_load 382 #endif +#ifdef __s390x__ +#define __NR_kexec_file_load 381 +#endif #ifndef __NR_kexec_file_load /* system call not available for the arch */ -- 2.16.3 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v1 0/2] kexec: Remove "weak" annotations from headers
Hi Bjorn, in recent patches AKASHI [1] and I [2] made some changes to the declarations you are touching and already removed some of the weak statements. The patches got accepted on linux-next and will (hopefully) be pulled for v4.17. So you should prepare for some merge conflicts. Nevertheless three weak statements still remain (arch_kexec_walk_mem & arch_kexec_apply_relocations*) so your patch still makes totally sense. Thanks Philipp [1] https://lkml.org/lkml/2018/3/6/201 [2] https://lkml.org/lkml/2018/3/21/278 On Thu, 12 Apr 2018 13:23:29 -0500 Bjorn Helgaas wrote: > "Weak" annotations in header files are error-prone because they make > every definition weak. Remove them from include/linux/kexec.h. > > These were introduced in two separate commits, so this is in two > patches so they can be easily backported to stable kernels (some of > them date back to v4.3 and one only goes back to v4.10). > > --- > > Bjorn Helgaas (2): > kexec: Remove "weak" from kexec_file function declarations > kexec: Remove "weak" from arch_kexec_walk_mem() declaration > > > include/linux/kexec.h | 24 > 1 file changed, 12 insertions(+), 12 deletions(-) > > ___ > kexec mailing list > kexec@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/kexec > ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2 00/11] kexec_file: Clean up purgatory load
Hi Andrew, Hi Dave, i checked out linux-next with the patches and it looks good to me. Also made a quick test and everything works fine. Thanks very much to both of you Philipp On Wed, 21 Mar 2018 16:00:42 -0700 Andrew Morton wrote: > On Wed, 21 Mar 2018 12:27:40 +0100 Philipp Rudo > wrote: > > > here is the updated patch set including Dave's comments. > > There were some syntactic clashes with the "kexec_file, x86, powerpc: > refactoring for other architecutres" series. Please check (in > linux-next) that I got it all correct and that both patch series are > functioning correctly. > > I expect to send all this into Stephen for -next tomorrow. > ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec