from:"Eric W. Biederman"

Re: [RFC] efi/tpm: add efi.tpm_log as a reserved region in 820_table_firmware

2024-09-17 Thread Eric W. Biederman

Ard Biesheuvel  writes:

> On Tue, 17 Sept 2024 at 17:24, Eric W. Biederman  
> wrote:
>>
>> Ard Biesheuvel  writes:
>>
>> > Hi Eric,
>> >
>> > Thanks for chiming in.
>>
>> It just looked like after James gave some expert input the
>> conversation got stuck, so I am just trying to move it along.
>>
>> I don't think anyone knows what this whole elephant looks like,
>> which makes solving the problem tricky.
>>
>> > On Mon, 16 Sept 2024 at 22:21, Eric W. Biederman  
>> > wrote:
>> >>
> ...
>> >>
>> >> This leaves two practical questions if I have been following everything
>> >> correctly.
>> >>
>> >> 1) How to get kexec to avoid picking that memory for the new kernel to
>> >>run in before it initializes itself. (AKA the getting stomped by
>> >>relocate kernel problem).
>> >>
>> >> 2) How to point the new kernel to preserved tpm_log.
>> >>
>> >>
>> >> This recommendation is from memory so it may be a bit off but
>> >> the general structure should work.  The idea is as follows.
>> >>
>> >> - Pass the information between kernels.
>> >>
>> >>   It is probably simplest for the kernel to have a command line option
>> >>   that tells the kernel the address and size of the tpm_log.
>> >>
>> >>   We have a couple of mechanisms here.  Assuming you are loading a
>> >>   bzImage with kexec_file_load you should be able to have the in kernel
>> >>   loader to add those arguments to the kernel command line.
>> >>
>> >
>> > This shouldn't be necessary, and I think it is actively harmful to
>> > keep inventing special ways for the kexec kernel to learn about these
>> > things that deviate from the methods used by the first kernel. This is
>> > how we ended up with 5 sources of truth for the physical memory map
>> > (EFI memory map, memblock and 3 different versions of the e820 memory
>> > map).
>> >
>> > We should try very hard to make kexec idempotent, and reuse the
>> > existing methods where possible. In this case, the EFI configuration
>> > table is already being exposed to the kexec kernel, which describes
>> > the base of the allocation. The size of the allocation can be derived
>> > from the table header.
>> >
>> >> - Ensure that when the loader is finding an address to load the new
>> >>   kernel it treats the address of the tpm_log as unavailable.
>> >>
>> >
>> > The TPM log is a table created by the EFI stub loader, which is part
>> > of the kernel. So if we need to tweak this for kexec's benefit, I'd
>> > prefer changing it in a way that can accommodate the first kernel too.
>> > However, I think the current method already has that property so I
>> > don't think we need to do anything (modulo fixing the bug)
>>
>> I am fine with not inventing a new mechanism, but I think we need
>> to reuse whatever mechanism the stub loader uses to pass it's
>> table to the kernel.  Not the EFI table that disappears at
>> ExitBootServices().
>>
>
> Not sure what you mean here - the EFI table that gets clobbered by
> kexec *is* the table that is created by the stub loader to pass the
> TPM log to the kernel. Not sure what alternative you have in mind
> here.

I was referring to whatever the EFI table that James Bottomley mentioned
that I presume the stub loader reads from when the stub loader
constructs the tpm_log that is passed to the kernel.

So I believe we are in agreement of everything except terminology.

>> > That said, I am doubtful that the kexec kernel can make meaningful use
>> > of the TPM log to begin with, given that the TPM will be out of sync
>> > at this point. But it is still better to keep it for symmetry, letting
>> > the higher level kexec/kdump logic running in user space reason about
>> > whether the TPM log has any value to it.
>>
>> Someone seems to think so or there would not be a complaint that it is
>> getting corrupted.
>>
>
> No. The problem is that the size of the table is *in* the table, and
> so if it gets corrupted, the code that attempts to memblock_reserve()
> it goes off into the weeds. But that does not imply there is a point
> to having access to this table from a kexec kernel in the first place.

If there is no point to having access to it then we should just not
pass anything to the loaded kernel, so the kernel d

Re: [RFC] efi/tpm: add efi.tpm_log as a reserved region in 820_table_firmware

2024-09-17 Thread Eric W. Biederman

Ard Biesheuvel  writes:

> Hi Eric,
>
> Thanks for chiming in.

It just looked like after James gave some expert input the
conversation got stuck, so I am just trying to move it along.

I don't think anyone knows what this whole elephant looks like,
which makes solving the problem tricky.

> On Mon, 16 Sept 2024 at 22:21, Eric W. Biederman  
> wrote:
>>
>> James Bottomley  writes:
>>
>> > On Fri, 2024-09-13 at 04:57 -0700, Breno Leitao wrote:
>> >> Hello James,
>> >>
>> >> On Thu, Sep 12, 2024 at 12:22:01PM -0400, James Bottomley wrote:
>> >> > On Thu, 2024-09-12 at 06:03 -0700, Breno Leitao wrote:
>> >> > > Hello Ard,
>> >> > >
>> >> > > On Thu, Sep 12, 2024 at 12:51:57PM +0200, Ard Biesheuvel wrote:
>> >> > > > I don't see how this could be an EFI bug, given that it does
>> >> > > > not deal with E820 tables at all.
>> >> > >
>> >> > > I want to back up a little bit and make sure I am following the
>> >> > > discussion.
>> >> > >
>> >> > > From what I understand from previous discussion, we have an EFI
>> >> > > bug as the root cause of this issue.
>> >> > >
>> >> > > This happens because the EFI does NOT mark the EFI TPM event log
>> >> > > memory region as reserved (EFI_RESERVED_TYPE). Not having an
>> >> > > entry for the event table memory in EFI memory mapped, then
>> >> > > libstub will ignore it completely (the TPM event log memory
>> >> > > range) and not populate e820 table with it.
>> >> >
>> >> > Wait, that's not correct.  The TPM log is in memory that doesn't
>> >> > survive ExitBootServices (by design in case the OS doesn't care
>> >> > about it).  So the EFI stub actually copies it over to a new
>> >> > configuration table that is in reserved memory before it calls
>> >> > ExitBootServices.  This new copy should be in kernel reserved
>> >> > memory regardless of its e820 map status.
>> >>
>> >> First of all, thanks for clarifying some points here.
>> >>
>> >> How should the TPM log table be passed to the next kernel when
>> >> kexecing() since it didn't surive ExitBootServices?
>> >
>> > I've no idea.  I'm assuming you don't elaborately reconstruct the EFI
>> > boot services, so you can't enter the EFI boot stub before
>> > ExitBootServices is called?  So I'd guess you want to preserve the EFI
>> > table that copied the TPM data in to kernel memory.
>>
>> This leaves two practical questions if I have been following everything
>> correctly.
>>
>> 1) How to get kexec to avoid picking that memory for the new kernel to
>>run in before it initializes itself. (AKA the getting stomped by
>>relocate kernel problem).
>>
>> 2) How to point the new kernel to preserved tpm_log.
>>
>>
>> This recommendation is from memory so it may be a bit off but
>> the general structure should work.  The idea is as follows.
>>
>> - Pass the information between kernels.
>>
>>   It is probably simplest for the kernel to have a command line option
>>   that tells the kernel the address and size of the tpm_log.
>>
>>   We have a couple of mechanisms here.  Assuming you are loading a
>>   bzImage with kexec_file_load you should be able to have the in kernel
>>   loader to add those arguments to the kernel command line.
>>
>
> This shouldn't be necessary, and I think it is actively harmful to
> keep inventing special ways for the kexec kernel to learn about these
> things that deviate from the methods used by the first kernel. This is
> how we ended up with 5 sources of truth for the physical memory map
> (EFI memory map, memblock and 3 different versions of the e820 memory
> map).
>
> We should try very hard to make kexec idempotent, and reuse the
> existing methods where possible. In this case, the EFI configuration
> table is already being exposed to the kexec kernel, which describes
> the base of the allocation. The size of the allocation can be derived
> from the table header.
>
>> - Ensure that when the loader is finding an address to load the new
>>   kernel it treats the address of the tpm_log as unavailable.
>>
>
> The TPM log is a table created by the EFI stub loader, which is part
> of the kernel. So if we need to tweak this for kexec&

Re: [RFC] efi/tpm: add efi.tpm_log as a reserved region in 820_table_firmware

2024-09-16 Thread Eric W. Biederman

James Bottomley  writes:

> On Fri, 2024-09-13 at 04:57 -0700, Breno Leitao wrote:
>> Hello James,
>> 
>> On Thu, Sep 12, 2024 at 12:22:01PM -0400, James Bottomley wrote:
>> > On Thu, 2024-09-12 at 06:03 -0700, Breno Leitao wrote:
>> > > Hello Ard,
>> > > 
>> > > On Thu, Sep 12, 2024 at 12:51:57PM +0200, Ard Biesheuvel wrote:
>> > > > I don't see how this could be an EFI bug, given that it does
>> > > > not deal with E820 tables at all.
>> > > 
>> > > I want to back up a little bit and make sure I am following the
>> > > discussion.
>> > > 
>> > > From what I understand from previous discussion, we have an EFI
>> > > bug as the root cause of this issue.
>> > > 
>> > > This happens because the EFI does NOT mark the EFI TPM event log
>> > > memory region as reserved (EFI_RESERVED_TYPE). Not having an
>> > > entry for the event table memory in EFI memory mapped, then
>> > > libstub will ignore it completely (the TPM event log memory
>> > > range) and not populate e820 table with it.
>> > 
>> > Wait, that's not correct.  The TPM log is in memory that doesn't
>> > survive ExitBootServices (by design in case the OS doesn't care
>> > about it).  So the EFI stub actually copies it over to a new
>> > configuration table that is in reserved memory before it calls
>> > ExitBootServices.  This new copy should be in kernel reserved
>> > memory regardless of its e820 map status.
>> 
>> First of all, thanks for clarifying some points here.
>> 
>> How should the TPM log table be passed to the next kernel when
>> kexecing() since it didn't surive ExitBootServices?
>
> I've no idea.  I'm assuming you don't elaborately reconstruct the EFI
> boot services, so you can't enter the EFI boot stub before
> ExitBootServices is called?  So I'd guess you want to preserve the EFI
> table that copied the TPM data in to kernel memory.

This leaves two practical questions if I have been following everything
correctly.

1) How to get kexec to avoid picking that memory for the new kernel to
   run in before it initializes itself. (AKA the getting stomped by
   relocate kernel problem).

2) How to point the new kernel to preserved tpm_log.


This recommendation is from memory so it may be a bit off but
the general structure should work.  The idea is as follows.

- Pass the information between kernels.

  It is probably simplest for the kernel to have a command line option
  that tells the kernel the address and size of the tpm_log.

  We have a couple of mechanisms here.  Assuming you are loading a
  bzImage with kexec_file_load you should be able to have the in kernel
  loader to add those arguments to the kernel command line.

- Ensure that when the loader is finding an address to load the new
  kernel it treats the address of the tpm_log as unavailable.

Does that sound like a doable plan?

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: EFI table being corrupted during Kexec

2024-09-10 Thread Eric W. Biederman

Breno Leitao  writes:

> Hello Eric,
>
> On Tue, Sep 10, 2024 at 09:26:00AM -0500, Eric W. Biederman wrote:
>> > I am wondering if that memory region/range should be part of e820 table 
>> > that is
>> > passed by EFI firmware to kernel, and if it is not passed (as it is not 
>> > being
>> > passed today), then the kernel doesn't need to respect it, and it is free 
>> > to
>> > overwrite (as it does today). In other words, this is a firmware bug and 
>> > not a
>> > kernel bug.
>> >
>> > Am I missing something?
>> 
>> I agree that this appears to be a firmware bug.  This memory is reserved
>> in one location and not in another location.
>
> That was is our current understanding also, but, having the same issue
> in EDK2 and on a real machine firmware was surprising.
>
> Anyway, I've CCed the EDK2 mailing list in this thread as well, let's
> see if someone has any comment.
>
>> As I recall the memblock allocator is the bootstrap memory allocator
>> used when bringing up the kernel.  So I don't see reserving something
>> in the memblock allocator as being authoritative as to how the firmware
>> has setup memory.
>> 
>> I would suggest writing a patch to update whatever is calling
>> memblock_reserve to also, or perhaps in preference to update the e820
>> map.  If the code is not x86 specific I would suggest using ACPI's
>> arch_reserve_mem_area call.
>
> Should all memblock_reserve() memory ranges be mapped to e820 table, or,
> just specific cases where we see problems?

Just specific cases.  There could be other linux specific reasons to
tell the memblock allocator not to allocation from a specific range
of memory.

Eric


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: EFI table being corrupted during Kexec

2024-09-10 Thread Eric W. Biederman

Breno Leitao  writes:

> We've seen a problem in upstream kernel kexec, where a EFI TPM log event table
> is being overwritten.  This problem happen on real machine, as well as in a
> recent EDK2 qemu VM.
>
> Digging deep, the table is being overwritten during kexec, more precisely when
> relocating kernel (relocate_kernel() function).
>
> I've also found that the table is being properly reserved using
> memblock_reserve() early in the boot, and that range gets overwritten later in
> by relocate_kernel(). In other words, kexec is overwriting a memory that was
> previously reserved (as memblock_reserve()).
>
> Usama found that kexec only honours memory reservations from 
> /sys/firmware/memmap
> which comes from e820_table_firmware table.
>
> Looking at the TPM spec, I found the following part:
>
>   If the ACPI TPM2 table contains the address and size of the Platform 
> Firmware TCG log,
>   firmware “pins” the memory associated with the Platform Firmware TCG 
> log, and reports
>   this memory as “Reserved” memory via the INT 15h/E820 interface.
>
>
> From: 
> https://trustedcomputinggroup.org/wp-content/uploads/PC-ClientPlatform_Profile_for_TPM_2p0_Systems_v49_161114_public-review.pdf
>
> I am wondering if that memory region/range should be part of e820 table that 
> is
> passed by EFI firmware to kernel, and if it is not passed (as it is not being
> passed today), then the kernel doesn't need to respect it, and it is free to
> overwrite (as it does today). In other words, this is a firmware bug and not a
> kernel bug.
>
> Am I missing something?

I agree that this appears to be a firmware bug.  This memory is reserved
in one location and not in another location.

That said that doesn't mean we can't deal with it in the kernel.
acpi_table_upgrade seems to have hit a similar issue issue and calls
arch_reserve_mem_area to reserve the area in the e820tables.

The last time I looked the e820 tables (in the kernel) are used to store
the efi memory map when available and only use the true e820 data on
older systems.

Which is a long way of say that the e820 table in the kernel last I
looked was the master table, of how the firmware views the memory.

As I recall the memblock allocator is the bootstrap memory allocator
used when bringing up the kernel.  So I don't see reserving something
in the memblock allocator as being authoritative as to how the firmware
has setup memory.

I would suggest writing a patch to update whatever is calling
memblock_reserve to also, or perhaps in preference to update the e820
map.  If the code is not x86 specific I would suggest using ACPI's
arch_reserve_mem_area call.

If you have a good path to your the folks who write for the computers
where this happens it seems entirely reasonable to report this as a bug
to them as well.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/1] kexec_file: fix elfcorehdr digest exclusion when CONFIG_CRASH_HOTPLUG=y

2024-08-16 Thread Eric W. Biederman

Petr Tesarik  writes:

> From: Petr Tesarik 
>
> Fix the condition to exclude the elfcorehdr segment from the SHA digest
> calculation.
>
> The j iterator is an index into the output sha_regions[] array, not into
> the input image->segment[] array. Once it reaches image->elfcorehdr_index,
> all subsequent segments are excluded. Besides, if the purgatory segment
> precedes the elfcorehdr segment, the elfcorehdr may be wrongly included in
> the calculation.

I would rather make CONFIG_CRASH_HOTPLUG depend on broken.

The hash is supposed to include everything we depend upon so when
a borken machine corrupts something we can detect that corruption
and not attempt to take a crash dump.

The elfcorehdr is definitely something that needs to be part of the
hash.

So please go back to the drawing board and find a way to include the
program header in the hash even with CONFIG_CRASH_HOTPLUG.


Eric


> Fixes: f7cc804a9fd4 ("kexec: exclude elfcorehdr from the segment digest")
> Cc: sta...@kernel.org
> Signed-off-by: Petr Tesarik 
> ---
>  kernel/kexec_file.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index 3d64290d24c9..3eedb8c226ad 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -752,7 +752,7 @@ static int kexec_calculate_store_digests(struct kimage 
> *image)
>  
>  #ifdef CONFIG_CRASH_HOTPLUG
>   /* Exclude elfcorehdr segment to allow future changes via 
> hotplug */
> - if (j == image->elfcorehdr_index)
> + if (i == image->elfcorehdr_index)
>   continue;
>  #endif

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v9 06/19] x86: Add early SHA-1 support for Secure Launch early measurements

2024-05-31 Thread Eric W. Biederman

Eric Biggers  writes:

> On Thu, May 30, 2024 at 06:03:18PM -0700, Ross Philipson wrote:
>> From: "Daniel P. Smith" 
>> 
>> For better or worse, Secure Launch needs SHA-1 and SHA-256. The
>> choice of hashes used lie with the platform firmware, not with
>> software, and is often outside of the users control.
>> 
>> Even if we'd prefer to use SHA-256-only, if firmware elected to start us
>> with the SHA-1 and SHA-256 backs active, we still need SHA-1 to parse
>> the TPM event log thus far, and deliberately cap the SHA-1 PCRs in order
>> to safely use SHA-256 for everything else.
>> 
>> The SHA-1 code here has its origins in the code from the main kernel:
>> 
>> commit c4d5b9ffa31f ("crypto: sha1 - implement base layer for SHA-1")
>> 
>> A modified version of this code was introduced to the lib/crypto/sha1.c
>> to bring it in line with the SHA-256 code and allow it to be pulled into the
>> setup kernel in the same manner as SHA-256 is.
>> 
>> Signed-off-by: Daniel P. Smith 
>> Signed-off-by: Ross Philipson 
>
> Thanks.  This explanation doesn't seem to have made it into the actual code or
> documentation.  Can you please get it into a more permanent location?
>
> Also, can you point to where the "deliberately cap the SHA-1 PCRs" thing 
> happens
> in the code?
>
> That paragraph is also phrased as a hypothetical, "Even if we'd prefer to use
> SHA-256-only".  That implies that you do not, in fact, prefer SHA-256 only.  
> Is
> that the case?  Sure, maybe there are situations where you *have* to use 
> SHA-1,
> but why would you not at least *prefer* SHA-256?

Yes.  Please prefer to use SHA-256.

Have you considered implementing I think it is SHA1-DC (as git has) that
is compatible with SHA1 but blocks the known class of attacks where
sha1 is actively broken at this point?

No offense to your Trenchboot project but my gut says that anything new
relying on SHA-1 is probably a bad joke at this point.

Firmware can most definitely be upgraded and if the goal is a more
secure boot the usual backwards compatibility concerns for supporting
old firmware really don't apply.

Perhaps hide all of the SHA-1 stuff behind CONFIG_TRENCHBOOT_PROTOTYPE
or something like that to make it clear that SHA-1 is not appropriate
for any kind of production deployment and is only good for prototyping
your implementation until properly implemented firmware comes along.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [RFC PATCH 0/9] kexec x86 purgatory cleanup

2024-04-24 Thread Eric W. Biederman

Ard Biesheuvel  writes:

> From: Ard Biesheuvel 
>
> The kexec purgatory is built like a kernel module, i.e., a partially
> linked ELF object where each section is allocated and placed
> individually, and all relocations need to be fixed up, even place
> relative ones.
>
> This makes sense for kernel modules, which share the address space with
> the core kernel, and contain unresolved references that need to be wired
> up to symbols in other modules or the kernel itself.
>
> The purgatory, however, is a fully linked binary without any external
> references, or any overlap with the kernel's virtual address space. So
> it makes much more sense to create a fully linked ELF executable that
> can just be loaded and run anywhere in memory.

It does have external references that are resolved when it is loaded.

Further it is at least my impression that non-PIC code is more
efficient.  PIC typically requires silly things like Global Offset
Tables that non-PIC code does not.  At first glance this looks like a
code passivization.

Now at lot of functionality has been stripped out of purgatory so maybe
in it's stripped down this make sense, but I want to challenge the
notion that this is the obvious thing to do.

> The purgatory build on x86 has already switched over to position
> independent codegen, which only leaves a handful of absolute references,
> which can either be dropped (patch #3) or converted into a RIP-relative
> one (patch #4). That leaves a purgatory executable that can run at any
> offset in memory with applying any relocations whatsoever.

I missed that conversation.  Do you happen to have a pointer?  I would
think the 32bit code is where the PIC would be most costly as the 32bit
x86 instruction set predates PIC being a common compilation target.

> Some tweaks are needed to deal with the difference between partially
> (ET_REL) and fully (ET_DYN/ET_EXEC) linked ELF objects, but with those
> in place, a substantial amount of complicated ELF allocation, placement
> and patching/relocation code can simply be dropped.

Really?  As I recall it only needed to handle a single allocation type,
and there were good reasons (at least when I wrote it) to patch symbols.

Again maybe the fact that people have removed 90% of the functionality
makes this make sense, but that is not obvious at first glance.

> The last patch in the series removes this code from the generic kexec
> implementation, but this can only be done once other architectures apply
> the same changes proposed here for x86 (powerpc, s390 and riscv all
> implement the purgatory using the shared logic)
>
> Link: 
> https://lore.kernel.org/all/CAKwvOd=3Jrzju++=Ve61=ZdeshxUM=K3-bGMNREnGOQgNw=a...@mail.gmail.com/
> Link: 
> https://lore.kernel.org/all/20240418201705.3673200-2-ardb+...@google.com/
>
> Cc: Arnd Bergmann 
> Cc: Eric Biederman 
> Cc: kexec@lists.infradead.org
> Cc: Nathan Chancellor 
> Cc: Nick Desaulniers 
> Cc: Kees Cook 
> Cc: Bill Wendling 
> Cc: Justin Stitt 
> Cc: Masahiro Yamada 
>
> Ard Biesheuvel (9):
>   x86/purgatory: Drop function entry padding from purgatory
>   x86/purgatory: Simplify stack handling
>   x86/purgatory: Drop pointless GDT switch
>   x86/purgatory: Avoid absolute reference to GDT
>   x86/purgatory: Simplify GDT and drop data segment
>   kexec: Add support for fully linked purgatory executables
>   x86/purgatory: Use fully linked PIE ELF executable
>   x86/purgatory: Simplify references to regs array
>   kexec: Drop support for partially linked purgatory executables
>
>  arch/x86/include/asm/kexec.h   |   8 -
>  arch/x86/kernel/kexec-bzimage64.c  |   8 -
>  arch/x86/kernel/machine_kexec_64.c | 127 --
>  arch/x86/purgatory/Makefile|  17 +-
>  arch/x86/purgatory/entry64.S   |  96 
>  arch/x86/purgatory/setup-x86_64.S  |  31 +--
>  arch/x86/purgatory/stack.S |  18 --
>  include/asm-generic/purgatory.lds  |  34 +++
>  kernel/kexec_file.c| 255 +++-
>  9 files changed, 125 insertions(+), 469 deletions(-)
>  delete mode 100644 arch/x86/purgatory/stack.S
>  create mode 100644 include/asm-generic/purgatory.lds

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [REGRESSION] kexec does firmware reboot in kernel v6.7.6

2024-03-13 Thread Eric W. Biederman

Steve Wahl  writes:

> [*really* added kexec maintainers this time.]
>
> Full thread starts here:
> https://lore.kernel.org/all/3a1b9909-45ac-4f97-ad68-d16ef1ce9...@pavinjoseph.com/
>
> On Wed, Mar 13, 2024 at 12:12:31AM +0530, Pavin Joseph wrote:
>> On 3/12/24 20:43, Steve Wahl wrote:
>> > But I don't want to introduce a new command line parameter if the
>> > actual problem can be understood and fixed.  The question is how much
>> > time do I have to persue a direct fix before some other action needs
>> > to be taken?
>> 
>> Perhaps the kexec maintainers [0] can be made aware of this and you could
>> coordinate with them on a potential fix?
>> 
>> Currently maintained by
>> P:  Simon Horman
>> M:  ho...@verge.net.au
>> L:  kexec@lists.infradead.org
>
> Probably a good idea to add kexec people to the list, so I've added
> them to this email.
>
> Everyone, my recent patch to the kernel that changed identity mapping:
>
> 7143c5f4cf2073193 x86/mm/ident_map: Use gbpages only where full GB page 
> should be mapped.
>
> ... has broken kexec on a few machines.  The symptom is they do a full
> BIOS reboot instead of a kexec of the new kernel.  Seems to be limited
> to AMD processors, but it's not all AMD processors, probably just some
> characteristic that they happen to share.
>
> The same machines that are broken by my patch, are also broken in
> previous kernels if you add "nogbpages" to the kernel command line
> (which makes the identity map bigger, "nogbpages" doing for all parts
> of the identity map what my patch does only for some parts of it).
>
> I'm still hoping to find a machine I can reproduce this on to try and
> debug it myself.
>
> If any of you have any assistance or advice to offer, it would be most
> welcome!

Kexec happens on identity mapped page tables.

The files of interest are machine_kexec_64.c and relocate_kernel_64.S

I suspect either the building of the identity mappged page table in
machine_kexec_prepare, or the switching to the page table in
identity_mapped in relocate_kernel_64.S is where something goes wrong.

Probably in kernel_ident_mapping_init as that code is directly used
to build the identity mapped page tables.

Hmm.

Your change is commit d794734c9bbf ("x86/mm/ident_map: Use gbpages only
where full GB page should be mapped.")

Given the simplicity of that change itself my guess is that somewhere in
the first 1Gb there are pages that needed to be mapped like the idt at 0
that are not getting mapped.

Reading through the changelog:
>   x86/mm/ident_map: Use gbpages only where full GB page should be mapped.
>
>When ident_pud_init() uses only gbpages to create identity maps, large
>ranges of addresses not actually requested can be included in the
>resulting table; a 4K request will map a full GB.  On UV systems, this
>ends up including regions that will cause hardware to halt the system
>if accessed (these are marked "reserved" by BIOS).  Even processor
>speculation into these regions is enough to trigger the system halt.
>
>Only use gbpages when map creation requests include the full GB page
>of space.  Fall back to using smaller 2M pages when only portions of a
>GB page are included in the request.
>
>No attempt is made to coalesce mapping requests. If a request requires
>a map entry at the 2M (pmd) level, subsequent mapping requests within
>the same 1G region will also be at the pmd level, even if adjacent or
>overlapping such requests could have been combined to map a full
>gbpage.  Existing usage starts with larger regions and then adds
>smaller regions, so this should not have any great consequence.
>
>[ dhansen: fix up comment formatting, simplifty changelog ]
>
>Signed-off-by: Steve Wahl 
>Signed-off-by: Dave Hansen 
>Cc: sta...@vger.kernel.org
>Link: 
> https://lore.kernel.org/all/20240126164841.170866-1-steve.wahl%40hpe.com

I know historically that fixed mtrrs were used so that the first 1GiB
could be covered with page tables and cause problems.

I suspect whatever those UV systems are more targeted solution would be
to use the fixed mtrrs to disable caching and speculation on the
problematic ranges rather than completely changing the fundamental logic
of how pages are mapped.

Right now it looks like you get to play a game of whack-a-mole with
firmware/BIOS tables that don't mention something important, and
ensuring the kernel maps everything important in the first 1GiB.

It might be worth setting up early printk on some of these systems
and seeing if the failure is in early boot up of the new kernel (that is
using kexec supplied identity mapped pages) rather than in kexec per-se.

But that is just my guess at the moment.

Eric




>> I hope the root cause can be fixed instead of patching it over with a flag
>> to suppress the problem, but I don't know how regressions are handled here.
>
> That would be my preference as well.
>
> Thanks,
>
> --> Steve Wahl

Re: [PATCH] kexec: should use uchunk for user buffer increasing

2024-02-05 Thread Eric W. Biederman

Baoquan He  writes:

> On 01/30/24 at 06:18pm, yang.zhang wrote:
>> From: "yang.zhang" 
>> 
>> Because of alignment requirement in kexec-tools, there is
>> no problem for user buffer increasing when loading segments.
>> But when coping, the step is uchunk, so we should use uchunk
>> not mchunk.
>
> In theory, ubytes is <= mbytes. So uchunk is always <= mchunk. If ubytes
> is exhausted, while there's still remaining mbytes, then uchunk is 0,
> there's still mchunk stepping forward. If I understand it correctly,
> this is a good catch. Not sure if Eric has comment on this to confirm.

As far as I can read the code the proposed change is a noop.

I agree it is more correct to not advance the pointers we read from,
but since we never read from them after that point it does not
matter.

>
> static int kimage_load_normal_segment(struct kimage *image,
>  struct kexec_segment *segment)
> {
> ..
>
> ptr += maddr & ~PAGE_MASK;
> mchunk = min_t(size_t, mbytes,
> PAGE_SIZE - (maddr & ~PAGE_MASK));
> uchunk = min(ubytes, mchunk);
> ..}

If we are going to improve the code for clarity.  We probably
want to do something like:

diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index d08fc7b5db97..1a8b8ce6bf15 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -800,22 +800,24 @@ static int kimage_load_normal_segment(struct kimage 
*image,
PAGE_SIZE - (maddr & ~PAGE_MASK));
uchunk = min(ubytes, mchunk);
 
-   /* For file based kexec, source pages are in kernel memory */
-   if (image->file_mode)
-   memcpy(ptr, kbuf, uchunk);
-   else
-   result = copy_from_user(ptr, buf, uchunk);
+   if (uchunk) {
+   /* For file based kexec, source pages are in kernel 
memory */
+   if (image->file_mode)
+   memcpy(ptr, kbuf, uchunk);
+   else
+   result = copy_from_user(ptr, buf, uchunk);
+   ubytes -= uchunk;
+   if (image->file_mode)
+   kbuf += uchunk;
+   else
+   buf += uchunk;
+   }
kunmap_local(ptr);
if (result) {
result = -EFAULT;
goto out;
}
-   ubytes -= uchunk;
maddr  += mchunk;
-   if (image->file_mode)
-   kbuf += mchunk;
-   else
-   buf += mchunk;
mbytes -= mchunk;
 
cond_resched();

And make it exceedingly clear that all of the copying and the rest
only happens before uchunk goes to zero.  Otherwise we are relying
on a lot of operations becoming noops when uchunk goes to zero.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/3] kexec_file: fix incorrect end value passed to kimage_is_destination_range()

2023-12-15 Thread Eric W. Biederman

Yuntao Wang  writes:

> The end parameter received by kimage_is_destination_range() should be the
> last valid byte address of the target memory segment plus 1. However, in
> the locate_mem_hole_bottom_up() and locate_mem_hole_top_down() functions,
> the corresponding value passed to kimage_is_destination_range() is the last
> valid byte address of the target memory segment, which is 1 less. Fix
> it.

If that is true we I think we should rather fix
kimage_is_destination_range.

Otherwise we run the risk of having areas whose end is not
representable, epecially on 32bit.


Eric


> Signed-off-by: Yuntao Wang 
> ---
>  kernel/kexec_file.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index f9a419cd22d4..26be070d3bdd 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -435,13 +435,12 @@ static int locate_mem_hole_top_down(unsigned long 
> start, unsigned long end,
>   if (temp_start < start || temp_start < kbuf->buf_min)
>   return 0;
>  
> - temp_end = temp_start + kbuf->memsz - 1;
> -
>   /*
>* Make sure this does not conflict with any of existing
>* segments
>*/
> - if (kimage_is_destination_range(image, temp_start, temp_end)) {
> + if (kimage_is_destination_range(image, temp_start,
> + temp_start + kbuf->memsz)) {
>   temp_start = temp_start - PAGE_SIZE;
>   continue;
>   }
> @@ -475,7 +474,7 @@ static int locate_mem_hole_bottom_up(unsigned long start, 
> unsigned long end,
>* Make sure this does not conflict with any of existing
>* segments
>*/
> - if (kimage_is_destination_range(image, temp_start, temp_end)) {
> + if (kimage_is_destination_range(image, temp_start, temp_end + 
> 1)) {
>   temp_start = temp_start + PAGE_SIZE;
>   continue;
>   }

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 00/15] kexec: Allow preservation of ftrace buffers

2023-12-14 Thread Eric W. Biederman

Alexander Graf  writes:

> Kexec today considers itself purely a boot loader: When we enter the new
> kernel, any state the previous kernel left behind is irrelevant and the
> new kernel reinitializes the system.
>
> However, there are use cases where this mode of operation is not what we
> actually want. In virtualization hosts for example, we want to use kexec
> to update the host kernel while virtual machine memory stays untouched.
> When we add device assignment to the mix, we also need to ensure that
> IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
> need to do the same for the PCI subsystem. If we want to kexec while an
> SEV-SNP enabled virtual machine is running, we need to preserve the VM
> context pages and physical memory. See James' and my Linux Plumbers
> Conference 2023 presentation for details:
>
>   https://lpc.events/event/17/contributions/1485/
>
> To start us on the journey to support all the use cases above, this
> patch implements basic infrastructure to allow hand over of kernel state
> across kexec (Kexec HandOver, aka KHO). As example target, we use ftrace:
> With this patch set applied, you can read ftrace records from the
> pre-kexec environment in your post-kexec one. This creates a very powerful
> debugging and performance analysis tool for kexec. It's also slightly
> easier to reason about than full blown VFIO state preservation.
>
> == Alternatives ==
>
> There are alternative approaches to (parts of) the problems above:
>
>   * Memory Pools [1] - preallocated persistent memory region + allocator
>   * PRMEM [2] - resizable persistent memory regions with fixed metadata
> pointer on the kernel command line + allocator
>   * Pkernfs [3] - preallocated file system for in-kernel data with fixed
>   address location on the kernel command line
>   * PKRAM [4] - handover of user space pages using a fixed metadata page
> specified via command line
>
> All of the approaches above fundamentally have the same problem: They
> require the administrator to explicitly carve out a physical memory
> location because they have no mechanism outside of the kernel command
> line to pass data (including memory reservations) between kexec'ing
> kernels.
>
> KHO provides that base foundation. We will determine later whether we
> still need any of the approaches above for fast bulk memory handover of for
> example IOMMU page tables. But IMHO they would all be users of KHO, with
> KHO providing the foundational primitive to pass metadata and bulk memory
> reservations as well as provide easy versioning for data.

What you are describe in many ways is the same problem as
kexec-on-panic.  The goal of leaving devices running absolutely requires
carving out memory for the new kernel to live in while it is coming up
so that DMA from a device that was not shutdown down does not stomp the
kernel coming up.

If I understand the virtualization case some of those virtual machines
are going to have virtual NICs that are going to want to DMA memory to
the host system.  Which if I understand things correctly means that
among the devices you explicitly want to keep running there is a not
a way to avoid the chance of DMA coming in while the kernel is being
changed.

There is also a huge maintenance challenge associated with all of this.

If you go with something that is essentially kexec-on-panic and then
add a little bit to help find things in the memory of the previous
kernel while the new kernel is coming up I can see it as a possibility.

As an example I think preserving ftrace data of kexec seems bizarre.
I don't see how that is an interesting use case at all.  Not in
the situation of preserving virtual machines, and not in the situation
of kexec on panic.

If you are doing an orderly shutdown and kernel switch you should be
able to manually change the memory.  If you are not doing an orderly
shutdown then I really don't get it.

I don't hate the capability you are trying to build.

I have not read or looked at most of this so I am probably
missing subtle details.

As you are currently describing things I have the sense you have
completely misframed the problem and are trying to solve the wrong parts
of the problem.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] kexec: do syscore_shutdown() in kernel_kexec

2023-12-13 Thread Eric W. Biederman

James Gowans  writes:

> syscore_shutdown() runs driver and module callbacks to get the system
> into a state where it can be correctly shut down. In commit
> 6f389a8f1dd2 ("PM / reboot: call syscore_shutdown() after 
> disable_nonboot_cpus()")
> syscore_shutdown() was removed from kernel_restart_prepare() and hence
> got (incorrectly?) removed from the kexec flow. This was innocuous until
> commit 6735150b6997 ("KVM: Use syscore_ops instead of reboot_notifier to hook 
> restart/shutdown")
> changed the way that KVM registered its shutdown callbacks, switching from
> reboot notifiers to syscore_ops.shutdown. As syscore_shutdown() is
> missing from kexec, KVM's shutdown hook is not run and virtualisation is
> left enabled on the boot CPU which results in triple faults when
> switching to the new kernel on Intel x86 VT-x with VMXE enabled.
>
> Fix this by adding syscore_shutdown() to the kexec sequence. In terms of
> where to add it, it is being added after migrating the kexec task to the
> boot CPU, but before APs are shut down. It is not totally clear if this
> is the best place: in commit 6f389a8f1dd2 ("PM / reboot: call 
> syscore_shutdown() after disable_nonboot_cpus()")
> it is stated that "syscore_ops operations should be carried with one
> CPU on-line and interrupts disabled." APs are only offlined later in
> machine_shutdown(), so this syscore_shutdown() is being run while APs
> are still online. This seems to be the correct place as it matches where
> syscore_shutdown() is run in the reboot and halt flows - they also run
> it before APs are shut down. The assumption is that the commit message
> in commit 6f389a8f1dd2 ("PM / reboot: call syscore_shutdown() after 
> disable_nonboot_cpus()")
> is no longer valid.
>
> KVM has been discussed here as it is what broke loudly by not having
> syscore_shutdown() in kexec, but this change impacts more than just KVM;
> all drivers/modules which register a syscore_ops.shutdown callback will
> now be invoked in the kexec flow. Looking at some of them like x86 MCE
> it is probably more correct to also shut these down during kexec.
> Maintainers of all drivers which use syscore_ops.shutdown are added on
> CC for visibility. They are:
>
> arch/powerpc/platforms/cell/spu_base.c  .shutdown = spu_shutdown,
> arch/x86/kernel/cpu/mce/core.c.shutdown = 
> mce_syscore_shutdown,
> arch/x86/kernel/i8259.c .shutdown = i8259A_shutdown,
> drivers/irqchip/irq-i8259.c   .shutdown = i8259A_shutdown,
> drivers/irqchip/irq-sun6i-r.c .shutdown = sun6i_r_intc_shutdown,
> drivers/leds/trigger/ledtrig-cpu.c.shutdown = 
> ledtrig_cpu_syscore_shutdown,
> drivers/power/reset/sc27xx-poweroff.c .shutdown = sc27xx_poweroff_shutdown,
> kernel/irq/generic-chip.c .shutdown = irq_gc_shutdown,
> virt/kvm/kvm_main.c   .shutdown = kvm_shutdown,
>
> This has been tested by doing a kexec on x86_64 and aarch64.

>From the 10,000 foot perspective:
Acked-by: "Eric W. Biederman" 


Eric

> Fixes: 6735150b6997 ("KVM: Use syscore_ops instead of reboot_notifier to hook 
> restart/shutdown")
>
> Signed-off-by: James Gowans 
> Cc: Eric Biederman 
> Cc: Paolo Bonzini 
> Cc: Sean Christopherson 
> Cc: Marc Zyngier 
> Cc: Arnd Bergmann 
> Cc: Tony Luck 
> Cc: Borislav Petkov 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Chen-Yu Tsai 
> Cc: Jernej Skrabec 
> Cc: Samuel Holland 
> Cc: Pavel Machek 
> Cc: Sebastian Reichel 
> Cc: Orson Zhai 
> Cc: Alexander Graf 
> Cc: Jan H. Schoenherr 
> ---
>  kernel/kexec_core.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index be5642a4ec49..b926c4db8a91 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -1254,6 +1254,7 @@ int kernel_kexec(void)
>   kexec_in_progress = true;
>   kernel_restart_prepare("kexec reboot");
>   migrate_to_reboot_cpu();
> + syscore_shutdown();
>  
>   /*
>* migrate_to_reboot_cpu() disables CPU hotplug assuming that

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2 1/2] KVM: Use syscore_ops instead of reboot_notifier to hook restart/shutdown

2023-12-11 Thread Eric W. Biederman

"Gowans, James"  writes:

> On Mon, 2023-12-11 at 09:54 +0200, James Gowans wrote:
>> > 
>> > What problem are you running into with your rebase that worked with
>> > reboot notifiers that is not working with syscore_shutdown?
>> 
>> Prior to this commit [1] which changed KVM from reboot notifiers to
>> syscore_ops, KVM's reboot notifier shutdown callback was invoked on
>> kexec via kernel_restart_prepare.
>> 
>> After this commit, KVM is not being shut down because currently the
>> kexec flow does not call syscore_shutdown.
>
> I think I missed what you're asking here; you're asking for a reproducer
> for the specific failure? 
>
> 1. Launch a QEMU VM with -enable-kvm flag
>
> 2. Do an immediate (-f flag) kexec:
> kexec -f --reuse-cmdline ./bzImage 
>
> Somewhere after doing the RET to new kernel in the relocate_kernel asm
> function the new kernel starts triple faulting; I can't exactly figure
> out where but I think it has to do with the new kernel trying to modify
> CR3 while the VMXE bit is still set in CR4 causing the triple fault.
>
> If KVM has been shut down via the shutdown callback, or alternatively if
> the QEMU process has actually been killed first (by not doing a -f exec)
> then the VMXE bit is clear and the kexec goes smoothly.
>
> So, TL;DR: kexec -f use to work with a KVM VM active, now it goes into a
> triple fault crash.

You mentioned I rebase so I thought your were backporting kernel patches.
By rebase do you mean you porting your userspace to a newer kernel?

In any event I believe the bug with respect to kexec was introduced in
commit 6f389a8f1dd2 ("PM / reboot: call syscore_shutdown() after
disable_nonboot_cpus()").  That is where syscore_shutdown was removed
from kernel_restart_prepare().

At this point it looks like someone just needs to add the missing
syscore_shutdown call into kernel_kexec() right after
migrate_to_reboot_cpu() is called.

That said I am not seeing the reboot notifiers being called on the kexec
path either so your issue with kvm might be deeper.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2 1/2] KVM: Use syscore_ops instead of reboot_notifier to hook restart/shutdown

2023-12-09 Thread Eric W. Biederman

"Gowans, James"  writes:

> Hi Sean,
>
> Blast from the past but I've just been bitten by this patch when
> rebasing across v6.4.
>
> On Fri, 2023-05-12 at 16:31 -0700, Sean Christopherson wrote:
>> Use syscore_ops.shutdown to disable hardware virtualization during a
>> reboot instead of using the dedicated reboot_notifier so that KVM disables
>> virtualization _after_ system_state has been updated.  This will allow
>> fixing a race in KVM's handling of a forced reboot where KVM can end up
>> enabling hardware virtualization between kernel_restart_prepare() and
>> machine_restart().
>
> The issue is that, AFAICT, the syscore_ops.shutdown are not called when
> doing a kexec. Reboot notifiers are called across kexec via:
>
> kernel_kexec
>   kernel_restart_prepare
> blocking_notifier_call_chain
>   kvm_reboot
>
> So after this patch, KVM is not shutdown during kexec; if hardware virt
> mode is enabled then the kexec hangs in exactly the same manner as you
> describe with the reboot.

kernel_restart_prepare calls device_shutdown.  Which should call all
of the shutdown operations.  This has been the way the code has been
structured since December 2005.

> Some specific shutdown callbacks, for example IOMMU, HPET, IRQ, etc are
> called in native_machine_shutdown, but KVM is not one of these.
>
> Thoughts on possible ways to fix this:
> a) go back to reboot notifiers
> b) get kexec to call syscore_shutdown() to invoke all of these callbacks
> c) Add a KVM-specific callback to native_machine_shutdown(); we only
> need this for Intel x86, right?
>
> My slight preference is towards adding syscore_shutdown() to kexec, but
> I'm not sure that's feasible. Adding kexec maintainers for input.

Why isn't device_suthdown calling syscore_shutdown?

What problem are you running into with your rebase that worked with
reboot notifiers that is not working with syscore_shutdown?

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] kexec: Fix reboot race during device_shutdown()

2023-10-10 Thread Eric W. Biederman

Joel Fernandes  writes:

> On Mon, Oct 9, 2023 at 11:30 AM Eric W. Biederman  
> wrote:
>>
>> Joel Fernandes  writes:
>>
>> > On Mon, Oct 2, 2023 at 2:18 PM Joel Fernandes  
>> > wrote:
>> > [..]
>> >> > > Such freezing is already being done if kernel supports KEXEC_JUMP and
>> >> > > kexec_image->preserve_context is true. However, doing it if either of 
>> >> > > these are
>> >> > > not true prevents crashes/races.
>> >> >
>> >> > The KEXEC_JUMP case is something else entirely.  It is supposed to work
>> >> > like suspend to RAM.  Maybe reboot should as well, but I am
>> >> > uncomfortable making a generic device fix kexec specific.
>> >>
>> >> I see your point of view. I think regular reboot should also be fixed
>> >> to avoid similar crash possibilities. I am happy to make a change for
>> >> that similar to this patch if we want to proceed that way.
>> >>
>> >> Thoughts?
>> >
>> > Just checking how we want to proceed, is the consensus that we should
>> > prevent kernel crashes without relying on userspace stopping all
>> > processes? Should we fix regular reboot syscall as well and not just
>> > kexec reboot?
>>
>> It just occurred to me there is something very fishy about all of this.
>>
>> What userspace do you have using kexec (not kexec on panic) that doesn't
>> preform the same userspace shutdown as a normal reboot?
>>
>> Quite frankly such a userspace is buggy, and arguably that is where you
>> should start fixing things.
>
> It is a simple unit test that tests kexec support by kexec-rebooting
> the kernel. I don't think SIGSTOP/SIGKILL'ing during kexec-reboot is
> ideal because in a real panic-on-kexec type crash, that may not happen
> and so does not emulate the real world that well. I think we want the
> kexec-reboot to do a *reboot* without crashing the kernel while doing
> so. Ricardo/Steve can chime on what they feel as well.

This is confusing.  You have a unit test that, that tests kexec on
panic using a the full kexec reboot.

The two are fundamentally similar but you aren't going to have a valid
test case if you mix them.

There is a whole kernel module that tests more interesting cases,
for the simple case you probably just want to do:

echo 'p' > /proc/sysrq-trigger

At least I think it is p that causes a kernel-panic.

That will ensure you are exercising the kexec on panic code path.  That
performs the minimal shutdown in the kernel.

>> That way you can get the orderly shutdown
>> of userspace daemons/services along with an orderly shutdown of
>> everything the kernel is responsible for.
>
> Fixing in userspace is an option but people are not happy that the
> kernel can crash like that.

In a kexec on panic scenario the kernel needs to perform that absolute
bare essential shutdown before calling kexec (basically nothing).
During kexec-on-panic nothing can be relied upon because we don't know
what is broken.  If that is what you care about (as suggested by the
unit test) you need to fix the device initialization.

In a normal kexec scenario the whole normal reboot process is expected.
I have no problems with fixing the kernel to handle that scenario,
but in the real world the entire orderly shutdown both, kernel
and userspace should be performed.

>> At the kernel level a kexec reboot and a normal reboot have been
>> deliberately kept as close as possible.  Which is why I say we should
>> fix it in reboot.
>
> You mean fix it in userspace?

No.  I mean in the kernel the orderly shutdown for a kexec reboot and an
ordinary reboot are kept as close to the same as possible.

It should be the case that the only differences between the two is that
in once case system firmware takes over after the orderly shutdown,
and in the other case a new kernel takes over after the orderly shutdown.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] kexec: Fix reboot race during device_shutdown()

2023-10-09 Thread Eric W. Biederman

Joel Fernandes  writes:

> On Mon, Oct 2, 2023 at 2:18 PM Joel Fernandes  wrote:
> [..]
>> > > Such freezing is already being done if kernel supports KEXEC_JUMP and
>> > > kexec_image->preserve_context is true. However, doing it if either of 
>> > > these are
>> > > not true prevents crashes/races.
>> >
>> > The KEXEC_JUMP case is something else entirely.  It is supposed to work
>> > like suspend to RAM.  Maybe reboot should as well, but I am
>> > uncomfortable making a generic device fix kexec specific.
>>
>> I see your point of view. I think regular reboot should also be fixed
>> to avoid similar crash possibilities. I am happy to make a change for
>> that similar to this patch if we want to proceed that way.
>>
>> Thoughts?
>
> Just checking how we want to proceed, is the consensus that we should
> prevent kernel crashes without relying on userspace stopping all
> processes? Should we fix regular reboot syscall as well and not just
> kexec reboot?

It just occurred to me there is something very fishy about all of this.

What userspace do you have using kexec (not kexec on panic) that doesn't
preform the same userspace shutdown as a normal reboot?

Quite frankly such a userspace is buggy, and arguably that is where you
should start fixing things.  That way you can get the orderly shutdown
of userspace daemons/services along with an orderly shutdown of
everything the kernel is responsible for.

At the kernel level a kexec reboot and a normal reboot have been
deliberately kept as close as possible.  Which is why I say we should
fix it in reboot.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] kexec: Fix reboot race during device_shutdown()

2023-09-29 Thread Eric W. Biederman

"Joel Fernandes (Google)"  writes:

> During kexec reboot, it is possible for a race to occur between
> device_shutdown() and userspace.  This causes accesses to GPU after pm_runtime
> suspend has already happened. Fix this by calling freeze_processes() before
> device_shutdown().

Is there any reason why this same race with between sys_kexec and the
adreno_ioctl can not happen during a normal reboot?

Is there any reason why there is not a .shutdown method to prevent the
race?

I would think the thing to do is to prevent this race in
kernel_restart_prepare or in the GPUs .shutdown method.  As I don't see
anything that would prevent this during a normal reboot.

>
> Such freezing is already being done if kernel supports KEXEC_JUMP and
> kexec_image->preserve_context is true. However, doing it if either of these 
> are
> not true prevents crashes/races.

The KEXEC_JUMP case is something else entirely.  It is supposed to work
like suspend to RAM.  Maybe reboot should as well, but I am
uncomfortable making a generic device fix kexec specific.


> This fixes the following crash during kexec reboot on an ARM64 device
> with adreno GPU:
>
> [  292.534314] Kernel panic - not syncing: Asynchronous SError Interrupt
> [  292.534323] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
> [  292.534326] Call trace:
> [  292.534328]  dump_backtrace+0x0/0x1d4
> [  292.534337]  show_stack+0x20/0x2c
> [  292.534342]  dump_stack_lvl+0x60/0x78
> [  292.534347]  dump_stack+0x18/0x38
> [  292.534352]  panic+0x148/0x3b0
> [  292.534357]  nmi_panic+0x80/0x94
> [  292.534364]  arm64_serror_panic+0x70/0x7c
> [  292.534369]  do_serror+0x0/0x7c
> [  292.534372]  do_serror+0x54/0x7c
> [  292.534377]  el1h_64_error_handler+0x34/0x4c
> [  292.534381]  el1h_64_error+0x7c/0x80
> [  292.534386]  el1_interrupt+0x20/0x58
> [  292.534389]  el1h_64_irq_handler+0x18/0x24
> [  292.534395]  el1h_64_irq+0x7c/0x80
> [  292.534399]  local_daif_inherit+0x10/0x18
> [  292.534405]  el1h_64_sync_handler+0x48/0xb4
> [  292.534410]  el1h_64_sync+0x7c/0x80
> [  292.534414]  a6xx_gmu_set_oob+0xbc/0x1fc
> [  292.534422]  a6xx_get_timestamp+0x40/0xb4
> [  292.534426]  adreno_get_param+0x12c/0x1e0
> [  292.534433]  msm_ioctl_get_param+0x64/0x70
> [  292.534440]  drm_ioctl_kernel+0xe8/0x158
> [  292.534448]  drm_ioctl+0x208/0x320
> [  292.534453]  __arm64_sys_ioctl+0x98/0xd0
> [  292.534461]  invoke_syscall+0x4c/0x118
>
> Cc: Steven Rostedt 
> Cc: Ricardo Ribalda 
> Cc: Ross Zwisler 
> Cc: Rob Clark 
> Cc: Linus Torvalds 
> Tested-by: Ricardo Ribalda 
> Signed-off-by: Joel Fernandes (Google) 
> ---
>  kernel/kexec_core.c | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index e2f2574d8b74..6599f485e42d 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -1299,6 +1299,12 @@ int kernel_kexec(void)
>   } else
>  #endif
>   {
> + error = freeze_processes();
> + if (error) {
> + error = -EBUSY;
> + goto Unlock;
> + }
> +
>   kexec_in_progress = true;
>   kernel_restart_prepare("kexec reboot");
>   migrate_to_reboot_cpu();

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 0/4] faster kexec reboot

2022-07-25 Thread Eric W. Biederman

Albert Huang  writes:

> From: "huangjie.albert" 
>
> In many time-sensitive scenarios, we need a shorter time to restart 
> the kernel. However, in the current kexec fast restart code, there 
> are many places in the memory copy operation, verification operation 
> and decompression operation, which take more time than 500ms. Through 
> the following patch series. machine_kexec-->start_kernel only takes
> 15ms

Is this a tiny embedded device you are taking the timings of?

How are you handling driver shutdown and restart?  I would expect those
to be a larger piece of the puzzle than memory.

My desktop can do something like 128GiB/s.  Which would suggest that
copying 128MiB of kernel+initrd would take perhaps 10ms.  The SHA256
implementation may not be tuned so that could be part of the performance
issue.  The SHA256 hash has a reputation for having fast
implementations.  I chose SHA256 originally simply because it has more
bits so it makes the odds of detecting an error higher.

If all you care about is booting a kernel as fast as possible it make
make sense to have a large reserved region of memory like we have for
the kexec on panic kernel.  If that really makes sense I recommend
adding a second kernel command line option and a reserving second region
of reserved memory.  That makes telling if the are any conflicts simple.

I am having a hard time seeing how anyone else would want these options.
Losing megabytes of memory simply because you might reboot using kexec
seems like the wrong side of a trade-off.

The CONFIG_KEXEC_PURGATORY_SKIP_SIG option is very misnamed.  It is not
signature verification that is happening it is a hash verification.
There are not encrypted bits at play.  Instead there is a check to
ensure that the kernel has not been corrupted by in-flight DMA that some
driver forgot to shut down.

So you are building a version of kexec that if something goes wrong it
could very easily eat your data, or otherwise do some very bad things
that are absolutely non-trivial to debug.

That the decision to skip the sha256 hash that prevents corruption is
happening at compile time, instead of at run-time, will guarantee the
option is simply not available on any general purpose kernel
configuration.  Given how dangerous it is to skip the hash verification
it is probably not a bad thing overall, but it is most definitely
something that will make maintenance more difficult.

If done well I don't see why anyone would mind a uncompressed kernel
but I don't see what the advantage of what you are doing is over using
vmlinux is the build directory.  It isn't a bzImage but it is the
uncompressed kernel.

As I proof of concept I think what you are doing goes a way to showing
that things can be improved.  My overall sense is that improving things
the way you are proposing does not help the general case and simply adds
to the maintenance burden.

Eric

>
> How to measure time:
>
> c code:
> uint64_t current_cycles(void)
> {
> uint32_t low, high;
> asm volatile("rdtsc" : "=a"(low), "=d"(high));
> return ((uint64_t)low) | ((uint64_t)high << 32);
> }
> assembly code:
>pushq %rax
>pushq %rdx
>rdtsc
>mov   %eax,%eax
>shl   $0x20,%rdx
>or%rax,%rdx
>movq  %rdx,0x840(%r14)
>popq  %rdx
>popq  %rax
> the timestamp may store in boot_params or kexec control page, so we can
> get the all timestamp after kernel boot up.
>
> huangjie.albert (4):
>   kexec: reuse crash kernel reserved memory for normal kexec
>   kexec: add CONFING_KEXEC_PURGATORY_SKIP_SIG
>   x86: Support the uncompressed kernel to speed up booting
>   x86: boot: avoid memory copy if kernel is uncompressed
>
>  arch/x86/Kconfig   | 10 +
>  arch/x86/boot/compressed/Makefile  |  5 -
>  arch/x86/boot/compressed/head_64.S |  8 +--
>  arch/x86/boot/compressed/misc.c| 35 +-
>  arch/x86/purgatory/purgatory.c |  7 ++
>  include/linux/kexec.h  |  9 
>  include/uapi/linux/kexec.h |  2 ++
>  kernel/kexec.c | 19 +++-
>  kernel/kexec_core.c| 16 --
>  kernel/kexec_file.c| 20 +++--
>  scripts/Makefile.lib   |  5 +
>  11 files changed, 114 insertions(+), 22 deletions(-)

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 3/4] x86: Support the uncompressed kernel to speed up booting

2022-07-25 Thread Eric W. Biederman

Albert Huang  writes:

> From: "huangjie.albert" 
>
> Although the compressed kernel can save the time of loading the
> kernel into the memory and save the disk space for storing the kernel,
> but in some time-sensitive scenarios, the time for decompressing the
> kernel is intolerable. Therefore, it is necessary to support uncompressed
> kernel images, so that the time of kernel decompression can be saved when
> the kernel is started.
>
> This part of the time on my machine is approximately:
> image type  image  size  times
> compressed(gzip) 8.5M159ms
> uncompressed 53M 8.5ms

Why in the world are you using arch/x86/boot/compressed/... for an
uncompressed kernel.  Especially if you don't plan to process
relocations.

Even if it somehow makes sense why have you not followed the pattern
used by the rest of the code and implemented a file that implements
a no-op __decompress routine?

Eric


> Signed-off-by: huangjie.albert 
> ---
>  arch/x86/Kconfig  |  1 +
>  arch/x86/boot/compressed/Makefile |  5 -
>  arch/x86/boot/compressed/misc.c   | 13 +
>  scripts/Makefile.lib  |  5 +
>  4 files changed, 23 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index adbd3a2bd60f..231187624c68 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -221,6 +221,7 @@ config X86
>   select HAVE_KERNEL_LZO
>   select HAVE_KERNEL_XZ
>   select HAVE_KERNEL_ZSTD
> + select HAVE_KERNEL_UNCOMPRESSED
>   select HAVE_KPROBES
>   select HAVE_KPROBES_ON_FTRACE
>   select HAVE_FUNCTION_ERROR_INJECTION
> diff --git a/arch/x86/boot/compressed/Makefile 
> b/arch/x86/boot/compressed/Makefile
> index 19e1905dcbf6..0c8417a2f792 100644
> --- a/arch/x86/boot/compressed/Makefile
> +++ b/arch/x86/boot/compressed/Makefile
> @@ -26,7 +26,7 @@ OBJECT_FILES_NON_STANDARD   := y
>  KCOV_INSTRUMENT  := n
>  
>  targets := vmlinux vmlinux.bin vmlinux.bin.gz vmlinux.bin.bz2 
> vmlinux.bin.lzma \
> - vmlinux.bin.xz vmlinux.bin.lzo vmlinux.bin.lz4 vmlinux.bin.zst
> + vmlinux.bin.xz vmlinux.bin.lzo vmlinux.bin.lz4 vmlinux.bin.zst 
> vmlinux.bin.none
>  
>  # CLANG_FLAGS must come before any cc-disable-warning or cc-option calls in
>  # case of cross compiling, as it has the '--target=' flag, which is needed to
> @@ -139,6 +139,8 @@ $(obj)/vmlinux.bin.lz4: $(vmlinux.bin.all-y) FORCE
>   $(call if_changed,lz4_with_size)
>  $(obj)/vmlinux.bin.zst: $(vmlinux.bin.all-y) FORCE
>   $(call if_changed,zstd22_with_size)
> +$(obj)/vmlinux.bin.none: $(vmlinux.bin.all-y) FORCE
> + $(call if_changed,none)
>  
>  suffix-$(CONFIG_KERNEL_GZIP) := gz
>  suffix-$(CONFIG_KERNEL_BZIP2):= bz2
> @@ -147,6 +149,7 @@ suffix-$(CONFIG_KERNEL_XZ):= xz
>  suffix-$(CONFIG_KERNEL_LZO)  := lzo
>  suffix-$(CONFIG_KERNEL_LZ4)  := lz4
>  suffix-$(CONFIG_KERNEL_ZSTD) := zst
> +suffix-$(CONFIG_KERNEL_UNCOMPRESSED) := none
>  
>  quiet_cmd_mkpiggy = MKPIGGY $@
>cmd_mkpiggy = $(obj)/mkpiggy $< > $@
> diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
> index cf690d8712f4..c23c0f525d93 100644
> --- a/arch/x86/boot/compressed/misc.c
> +++ b/arch/x86/boot/compressed/misc.c
> @@ -181,6 +181,19 @@ void __puthex(unsigned long value)
>   }
>  }
>  
> +#ifdef CONFIG_KERNEL_UNCOMPRESSED
> +#include 
> +static int __decompress(unsigned char *buf, long len,
> + long (*fill)(void*, unsigned long),
> + long (*flush)(void*, unsigned long),
> + unsigned char *outbuf, long olen,
> + long *pos, void (*error)(char *x))
> +{
> + memcpy(outbuf, buf, olen);
> + return 0;
> +}
> +#endif
> +
>  #ifdef CONFIG_X86_NEED_RELOCS
>  static void handle_relocations(void *output, unsigned long output_len,
>  unsigned long virt_addr)
> diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
> index 3fb6a99e78c4..c89d5466c617 100644
> --- a/scripts/Makefile.lib
> +++ b/scripts/Makefile.lib
> @@ -438,6 +438,11 @@ quiet_cmd_lz4 = LZ4 $@
>  quiet_cmd_lz4_with_size = LZ4 $@
>cmd_lz4_with_size = { cat $(real-prereqs) | $(LZ4) -l -c1 stdin 
> stdout; \
>$(size_append); } > $@
> +# none
> +quiet_cmd_none = NONE $@
> +  cmd_none = (cat $(filter-out FORCE,$^) && \
> +  $(call size_append, $(filter-out FORCE,$^))) > $@ || \
> +  (rm -f $@ ; false)
>  
>  # U-Boot mkimage
>  # ---

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2] panic, kexec: Make __crash_kexec() NMI safe

2022-06-25 Thread Eric W. Biederman

Valentin Schneider  writes:

> Attempting to get a crash dump out of a debug PREEMPT_RT kernel via an NMI
> panic() doesn't work. The cause of that lies in the PREEMPT_RT definition
> of mutex_trylock():
>
>   if (IS_ENABLED(CONFIG_DEBUG_RT_MUTEXES) && WARN_ON_ONCE(!in_task()))
>   return 0;
>
> This prevents an NMI panic() from executing the main body of
> __crash_kexec() which does the actual kexec into the kdump kernel.
> The warning and return are explained by:
>
>   6ce47fd961fa ("rtmutex: Warn if trylock is called from hard/softirq 
> context")
>   [...]
>   The reasons for this are:
>
>   1) There is a potential deadlock in the slowpath
>
>   2) Another cpu which blocks on the rtmutex will boost the task
>which allegedly locked the rtmutex, but that cannot work
>because the hard/softirq context borrows the task context.
>
> Furthermore, grabbing the lock isn't NMI safe, so do away with it and
> use an atomic variable to serialize reads vs writes of
> kexec_crash_image.
>
> Tested by triggering NMI panics via:
>
>   $ echo 1 > /proc/sys/kernel/panic_on_unrecovered_nmi
>   $ echo 1 > /proc/sys/kernel/unknown_nmi_panic
>   $ echo 1 > /proc/sys/kernel/panic
>
>   $ ipmitool power diag
>
> Fixes: 6ce47fd961fa ("rtmutex: Warn if trylock is called from hard/softirq 
> context")
> Signed-off-by: Valentin Schneider 

I am not particularly fond of this patch as it adds more complexity than
is necessary to solve the problem.

Calling a spade a spade PREEMPT_RT's mutex_trylock implementation is
broken as it can not support the use cases of an ordinary mutex_trylock.
I have not seen (possibly I skimmed too quickly) anywhere in the
discussion why PREEMPT_RT is not being fixed.  Looking at the code
there is enough going on in try_to_take_rt_mutex that I can imagine
that some part of that code is not nmi safe.  So I can believe
PREEMPT_RT may be unfix-ably broken.

At this point I recommend going back to being ``unconventional'' with
the kexec locking and effectively reverting commit 8c5a1cf0ad3a ("kexec:
use a mutex for locking rather than xchg()").

That would also mean that we don't have to worry about the lockdep code
doing something weird in the future and breaking kexec.

Your change starting to is atomic_cmpxchng is most halfway to a revert
of commit 8c5a1cf0ad3a ("kexec: use a mutex for locking rather than
xchg()").  So we might as well go the whole way and just document that
the kexec on panic code can not use conventional kernel locking
primitives and has to dig deep and build it's own.  At which point it
makes no sense for the rest of the kexec code to use anything different.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCHv7 11/14] x86: Disable kexec if system has unaccepted memory

2022-06-23 Thread Eric W. Biederman

Dave Hansen  writes:

> ... adding kexec folks
>
> On 6/14/22 05:02, Kirill A. Shutemov wrote:
>> On kexec, the target kernel has to know what memory has been accepted.
>> Information in EFI map is out of date and cannot be used.
>> 
>> boot_params.unaccepted_memory can be used to pass the bitmap between two
>> kernels on kexec, but the use-case is not yet implemented.
>> 
>> Disable kexec on machines with unaccepted memory for now.
> ...
>> +static int __init unaccepted_init(void)
>> +{
>> +if (!boot_params.unaccepted_memory)
>> +return 0;
>> +
>> +#ifdef CONFIG_KEXEC_CORE
>> +/*
>> + * TODO: Information on memory acceptance status has to be communicated
>> + * between kernel.
>> + */
>> +pr_warn("Disable kexec: not yet supported on systems with unaccepted 
>> memory\n");
>> +kexec_load_disabled = 1;
>> +#endif
>
> This looks to be the *only* in-kernel user tweaking kexec_load_disabled.
>  It doesn't feel great to just be disabling kexec like this.  Why not
> just fix it properly?
>
> What do the kexec folks think?

I didn't realized someone had implemented kexec_load_disabled.  I am not
particularly happy about that.  It looks like an over-broad stick that
we will have to support forever.

This change looks like it just builds on that bad decision.

If people don't want to deal with this situation right now, then I
recommend they make this new code and KEXEC conflict at the Kconfig
level.  That would give serious incentive to adding the missing
implementation.

If there is some deep and fundamental why this can not be supported
then it probably makes sense to put some code in the arch_kexec_load
hook that verifies that deep and fundamental reason is present.

With the kexec code all we have to verify it works is a little testing
and careful code review.  Something like this makes code review much
harder because the entire kernel has to be checked to see if some random
driver without locking changed a variable.  Rather than having it
apparent that this special case exists when reading through the kexec
code.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 24/30] panic: Refactor the panic path

2022-05-25 Thread Eric W. Biederman

"Guilherme G. Piccoli"  writes:

> The panic() function is somewhat convoluted - a lot of changes were
> made over the years, adding comments that might be misleading/outdated
> now, it has a code structure that is a bit complex to follow, with
> lots of conditionals, for example. The panic notifier list is something
> else - a single list, with multiple callbacks of different purposes,
> that run in a non-deterministic order and may affect hardly kdump
> reliability - see the "crash_kexec_post_notifiers" workaround-ish flag.
>
> This patch proposes a major refactor on the panic path based on Petr's
> idea [0] - basically we split the notifiers list in three, having a set
> of different call points in the panic path. Below a list of changes
> proposed in this patch, culminating in the panic notifiers level
> concept:
>
> (a) First of all, we improved comments all over the function
> and removed useless variables / includes. Also, as part of this
> clean-up we concentrate the console flushing functions in a helper.
>
> (b) As mentioned before, there is a split of the panic notifier list
> in three, based on the purpose of the callback. The code contains
> good documentation in form of comments, but a summary of the three
> lists follows:
>
> - the hypervisor list aims low-risk procedures to inform hypervisors
> or firmware about the panic event, also includes LED-related functions;
>
> - the informational list contains callbacks that provide more details,
> like kernel offset or trace dump (if enabled) and also includes the
> callbacks aimed at reducing log pollution or warns, like the RCU and
> hung task disable callbacks;
>
> - finally, the pre_reboot list is the old notifier list renamed,
> containing the more risky callbacks that didn't fit the previous
> lists. There is also a 4th list (the post_reboot one), but it's not
> related with the original list - it contains late time architecture
> callbacks aimed at stopping the machine, for example.
>
> The 3 notifiers lists execute in different moments, hypervisor being
> the first, followed by informational and finally the pre_reboot list.
>
> (c) But then, there is the ordering problem of the notifiers against
> the crash_kernel() call - kdump must be as reliable as possible.
> For that, a simple binary "switch" as "crash_kexec_post_notifiers"
> is not enough, hence we introduce here concept of panic notifier
> levels: there are 5 levels, from 0 (no notifier executes before
> kdump) until 4 (all notifiers run before kdump); the default level
> is 2, in which the hypervisor and (iff we have any kmsg dumper)
> the informational notifiers execute before kdump.
>
> The detailed documentation of the levels is present in code comments
> and in the kernel-parameters.txt file; as an analogy with the previous
> panic() implementation, the level 0 is exactly the same as the old
> behavior of notifiers, running all after kdump, and the level 4 is
> the same as "crash_kexec_post_notifiers=Y" (we kept this parameter as
> a deprecated one).
>
> (d) Finally, an important change made here: we now use only the
> function "crash_smp_send_stop()" to shut all the secondary CPUs
> in the panic path. Before, there was a case of using the regular
> "smp_send_stop()", but the better approach is to simplify the
> code and try to use the function which was created exclusively
> for the panic path. Experiments showed that it works fine, and
> code was very simplified with that.
>
> Functional change is expected from this refactor, since now we
> call some notifiers by default before kdump, but the goal here
> besides code clean-up is to have a better panic path, more
> reliable and deterministic, but also very customizable.
>
> [0] https://lore.kernel.org/lkml/YfPxvzSzDLjO5ldp@alley/

I am late to this discussion.  My apologies.

Unfortunately I am also very grouchy.

Notifiers before kexec on panic are fundamentally broken.  So please
just remove crash_kexec_post notifiers and be done with it.  Part of the
deep issue is that firmware always has a common and broken
implementation for anything that is not mission critical to
motherboards.

Notifiers in any sense on these paths are just bollocks.  Any kind of
notifier list is fundamentally fragile in the face of memory corruption
and very very difficult to review.

So I am going to refresh my ancient NACK on this.

I can certainly appreciate that there are pieces of the reboot paths
that can be improved.  I don't think making anything more feature full
or flexible is any kind of real improvement.

Eric



>
> Suggested-by: Petr Mladek 
> Signed-off-by: Guilherme G. Piccoli 
> ---
>
> Special thanks to Petr and Baoquan for the suggestion and feedback in a 
> previous
> email thread. There's some important design decisions that worth mentioning 
> and
> discussing:
>
> * The default panic notifiers level is 2, based on Petr Mladek's suggestion,
> which makes a lot of sense. Of course, this is customizable through the
> parameter, but

Re: [PATCH] kexec_file: Drop weak attribute from arch_kexec_apply_relocations[_add]

2022-05-20 Thread Eric W. Biederman

Baoquan He  writes:

> On 05/19/22 at 12:59pm, Eric W. Biederman wrote:
>> Baoquan He  writes:
>> 
>> > Hi Eric,
>> >
>> > On 05/18/22 at 04:59pm, Eric W. Biederman wrote:
>> >> "Naveen N. Rao"  writes:
>> >> 
>> >> > Since commit d1bcae833b32f1 ("ELF: Don't generate unused section
>> >> > symbols") [1], binutils (v2.36+) started dropping section symbols that
>> >> > it thought were unused.  This isn't an issue in general, but with
>> >> > kexec_file.c, gcc is placing kexec_arch_apply_relocations[_add] into a
>> >> > separate .text.unlikely section and the section symbol ".text.unlikely"
>> >> > is being dropped. Due to this, recordmcount is unable to find a non-weak
>> >> > symbol in .text.unlikely to generate a relocation record against.
>> >> >
>> >> > Address this by dropping the weak attribute from these functions:
>> >> > - arch_kexec_apply_relocations() is not overridden by any architecture
>> >> >   today, so just drop the weak attribute.
>> >> > - arch_kexec_apply_relocations_add() is only overridden by x86 and s390.
>> >> >   Retain the function prototype for those and move the weak
>> >> >   implementation into the header as a static inline for other
>> >> >   architectures.
>> >> >
>> >> > [1] 
>> >> > https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=d1bcae833b32f1
>> >> 
>> >> Any chance you can also get machine_kexec_post_load,
>> >> crash_free_reserved_phys_range, arch_kexec_protect_protect_crashkres,
>> >> arch_kexec_unprotect_crashkres, arch_kexec_kernel_image_probe,
>> >> arch_kexec_kernel_image_probe, arch_kimage_file_post_load_cleanup,
>> >> arch_kexec_kernel_verify_sig, and arch_kexec_locate_mem_hole as well.
>> >> 
>> >> That is everything in kexec that uses a __weak symbol.  If we can't
>> >> count on them working we might as well just get rid of the rest
>> >> preemptively.
>> >
>> > Is there a new rule that __weak is not suggested in kernel any more?
>> > Please help provide a pointer if yes, so that I can learn that.
>> >
>> > In my mind, __weak is very simple and clear as a mechanism to add
>> > ARCH related functionality.
>> 
>> You should be able to trace the conversation back for all of the details
>> but if you can't here is the summary.
>> 
>> There is a tool that some architectures use called recordmcount.  The
>> recordmcount looks for a symbol in a section, and ignores all weak
>> symbols.  In certain cases sections become so simple there are only weak
>> symbols.  At which point recordmcount fails.
>> 
>> Which means in practice __weak symbols are unreliable and don't work
>> to add ARCH related functionality.
>> 
>> Given that __weak symbols fail randomly I would much rather have simpler
>> code that doesn't fail.  It has never been the case that __weak symbols
>> have been very common in the kernel.  I expect they are something like
>> bool that have been gaining traction.  Still given that __weak symbols
>> don't work.  I don't want them.
>
> Thanks for the summary, Eric.
>
> From Naveen's reply, what I got is, llvm's recent change makes
> symbol of section .text.unlikely lost,

If I have read the thread correctly this change happened in both
llvm and binutils.  So both tools chains that are used to build the
kernel.

> but the secton .text.unlikely
> still exists. The __weak symbol will be put in .text.unlikely partly,
> when arch_kexec_apply_relocations_add() includes the pr_err line. While
> removing the pr_err() line will put __weak symbol
> arch_kexec_apply_relocations_add() in .text instead.

Yes.  Calling pr_err has some effect.  Either causing an mcount
entry to be ommitted, or causing the symbols in the function to be
placed in .text.unlikely.

> Now the status is that not only recordmcount got this problem, objtools
> met it too and got an appropriate fix. Means objtools's fix doesn't need
> kernel's adjustment. Recordmcount need kernel to adjust because it lacks
> continuous support and developement. Naveen also told that they are
> converting to objtools, just the old CI cases rely on recordmcount. In
> fact, if someone stands up to get an appropriate recordmcount fix too,
> the problem will be gone too.

If the descriptions are correct I suspect recoredmcount could just
decided to use the weak sy

Re: [PATCH] kexec_file: Drop weak attribute from arch_kexec_apply_relocations[_add]

2022-05-19 Thread Eric W. Biederman

Baoquan He  writes:

> Hi Eric,
>
> On 05/18/22 at 04:59pm, Eric W. Biederman wrote:
>> "Naveen N. Rao"  writes:
>> 
>> > Since commit d1bcae833b32f1 ("ELF: Don't generate unused section
>> > symbols") [1], binutils (v2.36+) started dropping section symbols that
>> > it thought were unused.  This isn't an issue in general, but with
>> > kexec_file.c, gcc is placing kexec_arch_apply_relocations[_add] into a
>> > separate .text.unlikely section and the section symbol ".text.unlikely"
>> > is being dropped. Due to this, recordmcount is unable to find a non-weak
>> > symbol in .text.unlikely to generate a relocation record against.
>> >
>> > Address this by dropping the weak attribute from these functions:
>> > - arch_kexec_apply_relocations() is not overridden by any architecture
>> >   today, so just drop the weak attribute.
>> > - arch_kexec_apply_relocations_add() is only overridden by x86 and s390.
>> >   Retain the function prototype for those and move the weak
>> >   implementation into the header as a static inline for other
>> >   architectures.
>> >
>> > [1] 
>> > https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=d1bcae833b32f1
>> 
>> Any chance you can also get machine_kexec_post_load,
>> crash_free_reserved_phys_range, arch_kexec_protect_protect_crashkres,
>> arch_kexec_unprotect_crashkres, arch_kexec_kernel_image_probe,
>> arch_kexec_kernel_image_probe, arch_kimage_file_post_load_cleanup,
>> arch_kexec_kernel_verify_sig, and arch_kexec_locate_mem_hole as well.
>> 
>> That is everything in kexec that uses a __weak symbol.  If we can't
>> count on them working we might as well just get rid of the rest
>> preemptively.
>
> Is there a new rule that __weak is not suggested in kernel any more?
> Please help provide a pointer if yes, so that I can learn that.
>
> In my mind, __weak is very simple and clear as a mechanism to add
> ARCH related functionality.

You should be able to trace the conversation back for all of the details
but if you can't here is the summary.

There is a tool that some architectures use called recordmcount.  The
recordmcount looks for a symbol in a section, and ignores all weak
symbols.  In certain cases sections become so simple there are only weak
symbols.  At which point recordmcount fails.

Which means in practice __weak symbols are unreliable and don't work
to add ARCH related functionality.

Given that __weak symbols fail randomly I would much rather have simpler
code that doesn't fail.  It has never been the case that __weak symbols
have been very common in the kernel.  I expect they are something like
bool that have been gaining traction.  Still given that __weak symbols
don't work.  I don't want them.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] kexec_file: Drop weak attribute from arch_kexec_apply_relocations[_add]

2022-05-18 Thread Eric W. Biederman

"Naveen N. Rao"  writes:

> Since commit d1bcae833b32f1 ("ELF: Don't generate unused section
> symbols") [1], binutils (v2.36+) started dropping section symbols that
> it thought were unused.  This isn't an issue in general, but with
> kexec_file.c, gcc is placing kexec_arch_apply_relocations[_add] into a
> separate .text.unlikely section and the section symbol ".text.unlikely"
> is being dropped. Due to this, recordmcount is unable to find a non-weak
> symbol in .text.unlikely to generate a relocation record against.
>
> Address this by dropping the weak attribute from these functions:
> - arch_kexec_apply_relocations() is not overridden by any architecture
>   today, so just drop the weak attribute.
> - arch_kexec_apply_relocations_add() is only overridden by x86 and s390.
>   Retain the function prototype for those and move the weak
>   implementation into the header as a static inline for other
>   architectures.
>
> [1] https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=d1bcae833b32f1

Any chance you can also get machine_kexec_post_load,
crash_free_reserved_phys_range, arch_kexec_protect_protect_crashkres,
arch_kexec_unprotect_crashkres, arch_kexec_kernel_image_probe,
arch_kexec_kernel_image_probe, arch_kimage_file_post_load_cleanup,
arch_kexec_kernel_verify_sig, and arch_kexec_locate_mem_hole as well.

That is everything in kexec that uses a __weak symbol.  If we can't
count on them working we might as well just get rid of the rest
preemptively.

Could you also address Andrews concerns by using a Kconfig symbol that
the architectures that implement the symbol can select.

I don't want to ask too much of a volunteer but if you are willing
addressing both of those would be a great help.

Eric

> Signed-off-by: Naveen N. Rao 
> ---
>  include/linux/kexec.h | 28 
>  kernel/kexec_file.c   | 19 +--
>  2 files changed, 25 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index 58d1b58a971e34..e656f981f43a73 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -193,10 +193,6 @@ void *kexec_purgatory_get_symbol_addr(struct kimage 
> *image, const char *name);
>  int arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
> unsigned long buf_len);
>  void *arch_kexec_kernel_image_load(struct kimage *image);
> -int arch_kexec_apply_relocations_add(struct purgatory_info *pi,
> -  Elf_Shdr *section,
> -  const Elf_Shdr *relsec,
> -  const Elf_Shdr *symtab);
>  int arch_kexec_apply_relocations(struct purgatory_info *pi,
>Elf_Shdr *section,
>const Elf_Shdr *relsec,
> @@ -229,6 +225,30 @@ extern int crash_exclude_mem_range(struct crash_mem *mem,
>  unsigned long long mend);
>  extern int crash_prepare_elf64_headers(struct crash_mem *mem, int kernel_map,
>  void **addr, unsigned long *sz);
> +
> +#if defined(CONFIG_X86_64) || defined(CONFIG_S390)
> +int arch_kexec_apply_relocations_add(struct purgatory_info *pi,
> +  Elf_Shdr *section,
> +  const Elf_Shdr *relsec,
> +  const Elf_Shdr *symtab);
> +#else
> +/*
> + * arch_kexec_apply_relocations_add - apply relocations of type RELA
> + * @pi:  Purgatory to be relocated.
> + * @section: Section relocations applying to.
> + * @relsec:  Section containing RELAs.
> + * @symtab:  Corresponding symtab.
> + *
> + * Return: 0 on success, negative errno on error.
> + */
> +static inline int
> +arch_kexec_apply_relocations_add(struct purgatory_info *pi, Elf_Shdr 
> *section,
> +  const Elf_Shdr *relsec, const Elf_Shdr *symtab)
> +{
> + pr_err("RELA relocation unsupported.\n");
> + return -ENOEXEC;
> +}
> +#endif /* CONFIG_X86_64 || CONFIG_S390 */
>  #endif /* CONFIG_KEXEC_FILE */
>  
>  #ifdef CONFIG_KEXEC_ELF
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index 8347fc158d2b96..6bae253b4d315e 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -108,23 +108,6 @@ int __weak arch_kexec_kernel_verify_sig(struct kimage 
> *image, void *buf,
>  }
>  #endif
>  
> -/*
> - * arch_kexec_apply_relocations_add - apply relocations of type RELA
> - * @pi:  Purgatory to be relocated.
> - * @section: Section relocations applying to.
> - * @relsec:  Section containing RELAs.
> - * @symtab:  Corresponding symtab.
> - *
> - * Return: 0 on success, negative errno on error.
> - */
> -int __weak
> -arch_kexec_apply_relocations_add(struct purgatory_info *pi, Elf_Shdr 
> *section,
> -  const Elf_Shdr *relsec, const Elf_Shdr *symtab)
> -{
> - pr_err("RELA relocation unsupported.\n");
> - return -ENOEXEC;
> -}
> -

Re: [PATCH] kexec_file: Drop pr_err in weak implementations of arch_kexec_apply_relocations[_add]

2022-05-18 Thread Eric W. Biederman

Michael Ellerman  writes:

> "Eric W. Biederman"  writes:
>> Looking at this the pr_err is absolutely needed.  If an unsupported case
>> winds up in the purgatory blob and the code can't handle it things
>> will fail silently much worse later.
>
> It won't fail later, it will fail the syscall.
>
> sys_kexec_file_load()
>   kimage_file_alloc_init()
> kimage_file_prepare_segments()
>   arch_kexec_kernel_image_load()
> kexec_image_load_default()
>   image->fops->load()
> elf64_load()# powerpc
> bzImage64_load()# x86
>   kexec_load_purgatory()
> kexec_apply_relocations()
>
> Which does:
>
>   if (relsec->sh_type == SHT_RELA)
>   ret = arch_kexec_apply_relocations_add(pi, section,
>  relsec, symtab);
>   else if (relsec->sh_type == SHT_REL)
>   ret = arch_kexec_apply_relocations(pi, section,
>  relsec, symtab);
>   if (ret)
>   return ret;
>
> And that error is bubbled all the way back up. So as long as
> arch_kexec_apply_relocations() returns an error the syscall will fail
> back to userspace and there'll be an error message at that level.
>
> It's true that having nothing printed in dmesg makes it harder to work
> out why the syscall failed. But it's a kernel bug if there are unhandled
> relocations in the kernel-supplied purgatory code, so a user really has
> no way to do anything about the error even if it is printed.

Good point.  I really hadn't noticed the error code in there when I
looked.

I still don't think changing the functionality of the code because of
a tool issue is the right solution.


>> "Naveen N. Rao"  writes:
>>
>>> Baoquan He wrote:
>>>> On 04/25/22 at 11:11pm, Naveen N. Rao wrote:
>>>>> kexec_load_purgatory() can fail for many reasons - there is no need to
>>>>> print an error when encountering unsupported relocations.
>>>>> This solves a build issue on powerpc with binutils v2.36 and newer [1].
>>>>> Since commit d1bcae833b32f1 ("ELF: Don't generate unused section
>>>>> symbols") [2], binutils started dropping section symbols that it thought
>>>> I am not familiar with binutils, while wondering if this exists in other
>>>> ARCHes except of ppc. Arm64 doesn't have the ARCH override either, do we
>>>> have problem with it?
>>>
>>> I'm not aware of this specific file causing a problem on other 
>>> architectures -
>>> perhaps the config options differ enough. There are however more reports of
>>> similar issues affecting other architectures with the llvm integrated 
>>> assembler:
>>> https://github.com/ClangBuiltLinux/linux/issues/981
>>>
>>>>
>>>>> were unused.  This isn't an issue in general, but with kexec_file.c, gcc
>>>>> is placing kexec_arch_apply_relocations[_add] into a separate
>>>>> .text.unlikely section and the section symbol ".text.unlikely" is being
>>>>> dropped. Due to this, recordmcount is unable to find a non-weak symbol
>>>> But arch_kexec_apply_relocations_add is weak symbol on ppc.
>>>
>>> Yes. Note that it is just the section symbol that gets dropped. The section 
>>> is
>>> still present and will continue to hold the symbols for the functions
>>> themselves.
>>
>> So we have a case where binutils thinks it is doing something useful
>> and our kernel specific tool gets tripped up by it.
>
> It's not just binutils, the LLVM assembler has the same behavior.
>
>> Reading the recordmcount code it looks like it is finding any symbol
>> within a section but ignoring weak symbols.  So I suspect the only
>> remaining symbol in the section is __weak and that confuses
>> recordmcount.
>>
>> Does removing the __weak annotation on those functions fix the build
>> error?  If so we can restructure the kexec code to simply not use __weak
>> symbols.
>>
>> Otherwise the fix needs to be in recordmcount or binutils, and we should
>> loop whoever maintains recordmcount in to see what they can do.
>
> It seems that recordmcount is not really maintained anymore now that x86
> uses objtool?
>
> There've been several threads about fixing recordmcount, but none of
> them seem to have lead to a solution.

That is unfortunate.

> These weak symbol vs recordmcount problems ha

Re: [PATCH] kexec_file: Drop pr_err in weak implementations of arch_kexec_apply_relocations[_add]

2022-05-17 Thread Eric W. Biederman

Looking at this the pr_err is absolutely needed.  If an unsupported case
winds up in the purgatory blob and the code can't handle it things
will fail silently much worse later.  So the proposed patch is
unfortunately the wrong direction.

"Naveen N. Rao"  writes:

> Baoquan He wrote:
>> On 04/25/22 at 11:11pm, Naveen N. Rao wrote:
>>> kexec_load_purgatory() can fail for many reasons - there is no need to
>>> print an error when encountering unsupported relocations.
>>> This solves a build issue on powerpc with binutils v2.36 and newer [1].
>>> Since commit d1bcae833b32f1 ("ELF: Don't generate unused section
>>> symbols") [2], binutils started dropping section symbols that it thought
>> I am not familiar with binutils, while wondering if this exists in other
>> ARCHes except of ppc. Arm64 doesn't have the ARCH override either, do we
>> have problem with it?
>
> I'm not aware of this specific file causing a problem on other architectures -
> perhaps the config options differ enough. There are however more reports of
> similar issues affecting other architectures with the llvm integrated 
> assembler:
> https://github.com/ClangBuiltLinux/linux/issues/981
>
>> 
>>> were unused.  This isn't an issue in general, but with kexec_file.c, gcc
>>> is placing kexec_arch_apply_relocations[_add] into a separate
>>> .text.unlikely section and the section symbol ".text.unlikely" is being
>>> dropped. Due to this, recordmcount is unable to find a non-weak symbol
>> But arch_kexec_apply_relocations_add is weak symbol on ppc.
>
> Yes. Note that it is just the section symbol that gets dropped. The section is
> still present and will continue to hold the symbols for the functions
> themselves.

So we have a case where binutils thinks it is doing something useful
and our kernel specific tool gets tripped up by it.

Reading the recordmcount code it looks like it is finding any symbol
within a section but ignoring weak symbols.  So I suspect the only
remaining symbol in the section is __weak and that confuses
recordmcount.

Does removing the __weak annotation on those functions fix the build
error?  If so we can restructure the kexec code to simply not use __weak
symbols.

Otherwise the fix needs to be in recordmcount or binutils, and we should
loop whoever maintains recordmcount in to see what they can do.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2 01/12] kexec: Allow architecture code to opt-out at runtime

2021-11-02 Thread Eric W. Biederman

Joerg Roedel  writes:

> Hi again,
>
> On Mon, Nov 01, 2021 at 04:11:42PM -0500, Eric W. Biederman wrote:
>> I seem to remember the consensus when this was reviewed that it was
>> unnecessary and there is already support for doing something like
>> this at a more fine grained level so we don't need a new kexec hook.
>
> Forgot to state to problem again which these patches solve:
>
> Currently a Linux kernel running as an SEV-ES guest has no way to
> successfully kexec into a new kernel. The normal SIPI sequence to reset
> the non-boot VCPUs does not work in SEV-ES guests and special code is
> needed in Linux to safely hand over the VCPUs from one kernel to the
> next. What happens currently is that the kexec'ed kernel will just hang.
>
> The code which implements the VCPU hand-over is also included in this
> patch-set, but it requires a certain level of Hypervisor support which
> is not available everywhere.
>
> To make it clear to the user that kexec will not work in their
> environment, it is best to disable the respected syscalls. This is what
> the hook is needed for.

Note this is environmental.  This is the equivalent of a driver for a
device without some feature.

The kernel already has machine_kexec_prepare, which is perfectly capable
of detecting this is a problem and causing kexec_load to fail.  Which
is all that is required.

We don't need a new hook and a new code path to test for one
architecture.

So when we can reliably cause the system call to fail with a specific
error code I don't think it makes sense to make clutter up generic code
because of one architecture's design mistakes.

My honest preference would be to go farther and have a
firmware/hypervisor/platform independent rendezvous for the cpus so we
don't have to worry about what bugs the code under has implemented for
this special case.  Because frankly there when there are layers of
software if a bug can slip through it always seems to and causes
problems.

But definitely there is no reason to add another generic hook when the
existing hook is quite good enough.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 2/3] RISC-V: use memcpy for kexec_file mode

2021-11-01 Thread Eric W. Biederman

Björn Töpel  writes:

> On Sat, 30 Oct 2021 at 05:51, Eric W. Biederman  wrote:
>>
>> Liao Chang  writes:
>>
>> > The pointer to buffer loading kernel binaries is in kernel space for
>> > kexec_fil mode, When copy_from_user copies data from pointer to a block
>> > of memory, it checkes that the pointer is in the user space range, on
>> > RISCV-V that is:
>> >
>> > static inline bool __access_ok(unsigned long addr, unsigned long size)
>> > {
>> >   return size <= TASK_SIZE && addr <= TASK_SIZE - size;
>> > }
>> >
>> > and TASK_SIZE is 0x40 for 64-bits, which now causes
>> > copy_from_user to reject the access of the field 'buf' of struct
>> > kexec_segment that is in range [CONFIG_PAGE_OFFSET - VMALLOC_SIZE,
>> > CONFIG_PAGE_OFFSET), is invalid user space pointer.
>> >
>> > This patch fixes this issue by skipping access_ok(), use mempcy() instead.
>>
>> I am a bit confused.
>>
>> Why is machine_kexec ever calling copy_from_user?  That seems wrong in
>> all cases.
>>
>
> It's not machine_kexec -- it's machine_kexec_prepare, which pulls out
> the FDT from the image. It looks like MIPS does it similarly.
>
> (Caveat: I might be confused as well! ;-))

True it is machine_kexec_prepare.  But still.  copy_from_user does not
belong in there.  It is not passed a userspace pointer.

This looks more like a case for kmap to read a table from the firmware.

Even if it someone made sense it definitely does not make sense to
make it a conditional copy_from_user.  That way lies madness.

The entire change is a smell that there is some abstraction that is
going wrong, and that abstraction needs to get fixed.

Eric


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2 01/12] kexec: Allow architecture code to opt-out at runtime

2021-11-01 Thread Eric W. Biederman

Borislav Petkov  writes:

> On Mon, Sep 13, 2021 at 05:55:52PM +0200, Joerg Roedel wrote:
>> From: Joerg Roedel 
>> 
>> Allow a runtime opt-out of kexec support for architecture code in case
>> the kernel is running in an environment where kexec is not properly
>> supported yet.
>> 
>> This will be used on x86 when the kernel is running as an SEV-ES
>> guest. SEV-ES guests need special handling for kexec to hand over all
>> CPUs to the new kernel. This requires special hypervisor support and
>> handling code in the guest which is not yet implemented.
>> 
>> Cc: sta...@vger.kernel.org # v5.10+
>> Signed-off-by: Joerg Roedel 
>> ---
>>  include/linux/kexec.h |  1 +
>>  kernel/kexec.c| 14 ++
>>  kernel/kexec_file.c   |  9 +
>>  3 files changed, 24 insertions(+)
>
> I guess I can take this through the tip tree along with the next one.

I seem to remember the consensus when this was reviewed that it was
unnecessary and there is already support for doing something like
this at a more fine grained level so we don't need a new kexec hook.

Eric


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 2/3] RISC-V: use memcpy for kexec_file mode

2021-10-29 Thread Eric W. Biederman

Liao Chang  writes:

> The pointer to buffer loading kernel binaries is in kernel space for
> kexec_fil mode, When copy_from_user copies data from pointer to a block
> of memory, it checkes that the pointer is in the user space range, on
> RISCV-V that is:
>
> static inline bool __access_ok(unsigned long addr, unsigned long size)
> {
>   return size <= TASK_SIZE && addr <= TASK_SIZE - size;
> }
>
> and TASK_SIZE is 0x40 for 64-bits, which now causes
> copy_from_user to reject the access of the field 'buf' of struct
> kexec_segment that is in range [CONFIG_PAGE_OFFSET - VMALLOC_SIZE,
> CONFIG_PAGE_OFFSET), is invalid user space pointer.
>
> This patch fixes this issue by skipping access_ok(), use mempcy() instead.

I am a bit confused.

Why is machine_kexec ever calling copy_from_user?  That seems wrong in
all cases.

Even worse then having a copy_from_user is having data that you don't
know if you should call copy_from_user on.

There is most definitely a bug here.  Can someone please sort it out
without making the kernel guess what kind of memory it is copying from.

Eric



> Signed-off-by: Liao Chang 
> ---
>  arch/riscv/kernel/machine_kexec.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/arch/riscv/kernel/machine_kexec.c 
> b/arch/riscv/kernel/machine_kexec.c
> index e6eca271a4d6..4a5db856919b 100644
> --- a/arch/riscv/kernel/machine_kexec.c
> +++ b/arch/riscv/kernel/machine_kexec.c
> @@ -65,7 +65,9 @@ machine_kexec_prepare(struct kimage *image)
>   if (image->segment[i].memsz <= sizeof(fdt))
>   continue;
>  
> - if (copy_from_user(&fdt, image->segment[i].buf, sizeof(fdt)))
> + if (image->file_mode)
> + memcpy(&fdt, image->segment[i].buf, sizeof(fdt));
> + else if (copy_from_user(&fdt, image->segment[i].buf, 
> sizeof(fdt)))
>   continue;
>  
>   if (fdt_check_header(&fdt))

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2 1/2] kexec, KEYS: make the code in bzImage64_verify_sig generic

2021-10-11 Thread Eric W. Biederman

Coiby Xu  writes:

> The code in bzImage64_verify_sig could make use of system keyrings including
> .buitin_trusted_keys, .secondary_trusted_keys and .platform keyring to verify
> signed kernel image as PE file. Make it generic so both x86_64 and arm64 can 
> use it.

The naming is problematic.

At a minimum please name the new function kexec_kernel_verify_pe_sig.
AKA what you named it without the "arch_" prefix.  A function named with
an "arch_" prefix  implies that it has an architecture specific
implementation.

It looks like arch_kexec_kernel_verify_sig should be killed as well
as it only has one implementation in the generic code.  And the code
should always call kexec_image_verify_sig_default.  Not that you should
do that but I am pointing it out as it seems that is the bad example you
are copying.

Eric

> Signed-off-by: Coiby Xu 
> ---
>  arch/x86/kernel/kexec-bzimage64.c | 13 +
>  include/linux/kexec.h |  3 +++
>  kernel/kexec_file.c   | 17 +
>  3 files changed, 21 insertions(+), 12 deletions(-)
>
> diff --git a/arch/x86/kernel/kexec-bzimage64.c 
> b/arch/x86/kernel/kexec-bzimage64.c
> index 170d0fd68b1f..4136dd3be5a9 100644
> --- a/arch/x86/kernel/kexec-bzimage64.c
> +++ b/arch/x86/kernel/kexec-bzimage64.c
> @@ -17,7 +17,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  
>  #include 
>  #include 
> @@ -531,17 +530,7 @@ static int bzImage64_cleanup(void *loader_data)
>  #ifdef CONFIG_KEXEC_BZIMAGE_VERIFY_SIG
>  static int bzImage64_verify_sig(const char *kernel, unsigned long kernel_len)
>  {
> - int ret;
> -
> - ret = verify_pefile_signature(kernel, kernel_len,
> -   VERIFY_USE_SECONDARY_KEYRING,
> -   VERIFYING_KEXEC_PE_SIGNATURE);
> - if (ret == -ENOKEY && IS_ENABLED(CONFIG_INTEGRITY_PLATFORM_KEYRING)) {
> - ret = verify_pefile_signature(kernel, kernel_len,
> -   VERIFY_USE_PLATFORM_KEYRING,
> -   VERIFYING_KEXEC_PE_SIGNATURE);
> - }
> - return ret;
> + return arch_kexec_kernel_verify_pe_sig(kernel, kernel_len);
>  }
>  #endif
>  
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index 0c994ae37729..d45f32336dbe 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -19,6 +19,7 @@
>  #include 
>  
>  #include 
> +#include 
>  
>  #ifdef CONFIG_KEXEC_CORE
>  #include 
> @@ -199,6 +200,8 @@ int arch_kimage_file_post_load_cleanup(struct kimage 
> *image);
>  #ifdef CONFIG_KEXEC_SIG
>  int arch_kexec_kernel_verify_sig(struct kimage *image, void *buf,
>unsigned long buf_len);
> +int arch_kexec_kernel_verify_pe_sig(const char *kernel,
> + unsigned long kernel_len);
>  #endif
>  int arch_kexec_locate_mem_hole(struct kexec_buf *kbuf);
>  
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index 33400ff051a8..0530275b7aa3 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -106,6 +106,23 @@ int __weak arch_kexec_kernel_verify_sig(struct kimage 
> *image, void *buf,
>  {
>   return kexec_image_verify_sig_default(image, buf, buf_len);
>  }
> +
> +#ifdef CONFIG_SIGNED_PE_FILE_VERIFICATION
> +int arch_kexec_kernel_verify_pe_sig(const char *kernel, unsigned long 
> kernel_len)
> +{
> + int ret;
> +
> + ret = verify_pefile_signature(kernel, kernel_len,
> +   VERIFY_USE_SECONDARY_KEYRING,
> +   VERIFYING_KEXEC_PE_SIGNATURE);
> + if (ret == -ENOKEY && IS_ENABLED(CONFIG_INTEGRITY_PLATFORM_KEYRING)) {
> + ret = verify_pefile_signature(kernel, kernel_len,
> +   VERIFY_USE_PLATFORM_KEYRING,
> +   VERIFYING_KEXEC_PE_SIGNATURE);
> + }
> + return ret;
> +}
> +#endif
>  #endif
>  
>  /*

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v3 1/4] kexec: simplify compat_sys_kexec_load

2021-05-18 Thread Eric W. Biederman



Arnd Bergmann  writes:

> On Tue, May 18, 2021 at 4:05 PM Arnd Bergmann  wrote:
>>
>> On Tue, May 18, 2021 at 3:41 PM Eric W. Biederman  
>> wrote:
>> >
>> > Arnd Bergmann  writes:
>> >
>> > > From: Arnd Bergmann KEXEC_ARCH_DEFAULT
>> > >
>> > > The compat version of sys_kexec_load() uses compat_alloc_user_space to
>> > > convert the user-provided arguments into the native format.
>> > >
>> > > Move the conversion into the regular implementation with
>> > > an in_compat_syscall() check to simplify it and avoid the
>> > > compat_alloc_user_space() call.
>> > >
>> > > compat_sys_kexec_load() now behaves the same as sys_kexec_load().
>> >
>> > Nacked-by: "Eric W. Biederman" 
>> >KEXEC_ARCH_DEFAULT
>> > The patch is wrong.
>> >
>> > The logic between the compat entry point and the ordinary entry point
>> > are by necessity different.   This unifies the logic and breaks the compat
>> > entry point.
>> >
>> > The fundamentally necessity is that the code being loaded needs to know
>> > which mode the kernel is running in so it can safely transition to the
>> > new kernel.
>> >
>> > Given that the two entry points fundamentally need different logic,
>> > and that difference was not preserved and the goal of this patchset
>> > was to unify that which fundamentally needs to be different.  I don't
>> > think this patch series makes any sense for kexec.
>>
>> Sorry, I'm not following that explanation. Can you clarify what different
>> modes of the kernel you are referring to here, and how my patch
>> changes this?

I think something like the untested diff below is enough to get rid of
compat_alloc_user cleanly.

Certainly it should be enough to give any idea what I am thinking.

diff --git a/kernel/kexec.c b/kernel/kexec.c
index c82c6c06f051..ce69a5d68023 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -19,26 +19,21 @@
 
 #include "kexec_internal.h"
 
-static int copy_user_segment_list(struct kimage *image,
+static void copy_user_segment_list(struct kimage *image,
  unsigned long nr_segments,
- struct kexec_segment __user *segments)
+ struct kexec_segment *segments)
 {
-   int ret;
size_t segment_bytes;
 
/* Read in the segments */
image->nr_segments = nr_segments;
segment_bytes = nr_segments * sizeof(*segments);
-   ret = copy_from_user(image->segment, segments, segment_bytes);
-   if (ret)
-   ret = -EFAULT;
-
-   return ret;
+   memcpy(image->segment, segments, segment_bytes);
 }
 
 static int kimage_alloc_init(struct kimage **rimage, unsigned long entry,
 unsigned long nr_segments,
-struct kexec_segment __user *segments,
+struct kexec_segment *segments,
 unsigned long flags)
 {
int ret;
@@ -59,9 +54,7 @@ static int kimage_alloc_init(struct kimage **rimage, unsigned 
long entry,
 
image->start = entry;
 
-   ret = copy_user_segment_list(image, nr_segments, segments);
-   if (ret)
-   goto out_free_image;
+   copy_user_segment_list(image, nr_segments, segments);
 
if (kexec_on_panic) {
/* Enable special crash kernel control page alloc policy. */
@@ -103,8 +96,8 @@ static int kimage_alloc_init(struct kimage **rimage, 
unsigned long entry,
return ret;
 }
 
-static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
-   struct kexec_segment __user *segments, unsigned long flags)
+static int do_kexec_load_locked(unsigned long entry, unsigned long nr_segments,
+   struct kexec_segment *segments, unsigned long flags)
 {
struct kimage **dest_image, *image;
unsigned long i;
@@ -174,6 +167,27 @@ static int do_kexec_load(unsigned long entry, unsigned 
long nr_segments,
return ret;
 }
 
+static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
+   struct kexec_segment *segments, unsigned long flags)
+{
+   int result;
+
+   /* Because we write directly to the reserved memory
+* region when loading crash kernels we need a mutex here to
+* prevent multiple crash  kernels from attempting to load
+* simultaneously, and to prevent a crash kernel from loading
+* over the top of a in use crash kernel.
+*
+* KISS: always take the mutex.
+*/
+   if (!mutex_trylock(&kexec_mutex))
+   return -EBUSY;
+
+

Re: [PATCH v3 1/4] kexec: simplify compat_sys_kexec_load

2021-05-18 Thread Eric W. Biederman

Arnd Bergmann  writes:

> On Tue, May 18, 2021 at 4:05 PM Arnd Bergmann  wrote:
>>
>> On Tue, May 18, 2021 at 3:41 PM Eric W. Biederman  
>> wrote:
>> >
>> > Arnd Bergmann  writes:
>> >
>> > > From: Arnd Bergmann KEXEC_ARCH_DEFAULT
>> > >
>> > > The compat version of sys_kexec_load() uses compat_alloc_user_space to
>> > > convert the user-provided arguments into the native format.
>> > >
>> > > Move the conversion into the regular implementation with
>> > > an in_compat_syscall() check to simplify it and avoid the
>> > > compat_alloc_user_space() call.
>> > >
>> > > compat_sys_kexec_load() now behaves the same as sys_kexec_load().
>> >
>> > Nacked-by: "Eric W. Biederman" 
>> >KEXEC_ARCH_DEFAULT
>> > The patch is wrong.
>> >
>> > The logic between the compat entry point and the ordinary entry point
>> > are by necessity different.   This unifies the logic and breaks the compat
>> > entry point.
>> >
>> > The fundamentally necessity is that the code being loaded needs to know
>> > which mode the kernel is running in so it can safely transition to the
>> > new kernel.
>> >
>> > Given that the two entry points fundamentally need different logic,
>> > and that difference was not preserved and the goal of this patchset
>> > was to unify that which fundamentally needs to be different.  I don't
>> > think this patch series makes any sense for kexec.
>>
>> Sorry, I'm not following that explanation. Can you clarify what different
>> modes of the kernel you are referring to here, and how my patch
>> changes this?
>
> I think I figured it out now myself after comparing the two functions:
>
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -269,7 +269,8 @@ SYSCALL_DEFINE4(kexec_load, unsigned long, entry,
> unsigned long, nr_segments,
>
> /* Verify we are on the appropriate architecture */
> if (((flags & KEXEC_ARCH_MASK) != KEXEC_ARCH) &&
> -   ((flags & KEXEC_ARCH_MASK) != KEXEC_ARCH_DEFAULT))
> +   (in_compat_syscall() ||
> +   ((flags & KEXEC_ARCH_MASK) != KEXEC_ARCH_DEFAULT)))
> return -EINVAL;
>
> /* Because we write directly to the reserved memory
>
> Not sure if that's the best way of doing it, but it looks like folding this
> in restores the current behavior.

Yes.  That is pretty much all there is.

I personally can't stand the sight of in_compat_syscall() doubly so when
you have to lie to the type system with casts.  The cognitive dissonance
I experience is extreme.

I will be happy to help you find another way to get rid of
compat_alloc_user, but not that way.


There is a whole mess in there that was introduced when someone added
do_kexec_load while I was napping in 2017 that makes the system calls an
absolute mess.  It all needs to be cleaned up.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v3 1/4] kexec: simplify compat_sys_kexec_load

2021-05-18 Thread Eric W. Biederman

Arnd Bergmann  writes:

> From: Arnd Bergmann 
>
> The compat version of sys_kexec_load() uses compat_alloc_user_space to
> convert the user-provided arguments into the native format.
>
> Move the conversion into the regular implementation with
> an in_compat_syscall() check to simplify it and avoid the
> compat_alloc_user_space() call.
>
> compat_sys_kexec_load() now behaves the same as sys_kexec_load().

Nacked-by: "Eric W. Biederman" 

The patch is wrong.

The logic between the compat entry point and the ordinary entry point
are by necessity different.   This unifies the logic and breaks the compat
entry point.

The fundamentally necessity is that the code being loaded needs to know
which mode the kernel is running in so it can safely transition to the
new kernel.

Given that the two entry points fundamentally need different logic,
and that difference was not preserved and the goal of this patchset
was to unify that which fundamentally needs to be different.  I don't
think this patch series makes any sense for kexec.

Eric




>
> Signed-off-by: Arnd Bergmann 
> ---
>  include/linux/kexec.h |  2 -
>  kernel/kexec.c| 95 +++
>  2 files changed, 42 insertions(+), 55 deletions(-)
>
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index 0c994ae37729..f61e310d7a85 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -88,14 +88,12 @@ struct kexec_segment {
>   size_t memsz;
>  };
>  
> -#ifdef CONFIG_COMPAT
>  struct compat_kexec_segment {
>   compat_uptr_t buf;
>   compat_size_t bufsz;
>   compat_ulong_t mem; /* User space sees this as a (void *) ... */
>   compat_size_t memsz;
>  };
> -#endif
>  
>  #ifdef CONFIG_KEXEC_FILE
>  struct purgatory_info {
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index c82c6c06f051..6618b1d9f00b 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -19,21 +19,46 @@
>  
>  #include "kexec_internal.h"
>  
> +static int copy_user_compat_segment_list(struct kimage *image,
> +  unsigned long nr_segments,
> +  void __user *segments)
> +{
> + struct compat_kexec_segment __user *cs = segments;
> + struct compat_kexec_segment segment;
> + int i;
> +
> + for (i = 0; i < nr_segments; i++) {
> + if (copy_from_user(&segment, &cs[i], sizeof(segment)))
> + return -EFAULT;
> +
> + image->segment[i] = (struct kexec_segment) {
> + .buf   = compat_ptr(segment.buf),
> + .bufsz = segment.bufsz,
> + .mem   = segment.mem,
> + .memsz = segment.memsz,
> + };
> + }
> +
> + return 0;
> +}
> +
> +
>  static int copy_user_segment_list(struct kimage *image,
> unsigned long nr_segments,
> struct kexec_segment __user *segments)
>  {
> - int ret;
>   size_t segment_bytes;
>  
>   /* Read in the segments */
>   image->nr_segments = nr_segments;
>   segment_bytes = nr_segments * sizeof(*segments);
> - ret = copy_from_user(image->segment, segments, segment_bytes);
> - if (ret)
> - ret = -EFAULT;
> + if (in_compat_syscall())
> + return copy_user_compat_segment_list(image, nr_segments, 
> segments);
>  
> - return ret;
> + if (copy_from_user(image->segment, segments, segment_bytes))
> + return -EFAULT;
> +
> + return 0;
>  }
>  
>  static int kimage_alloc_init(struct kimage **rimage, unsigned long entry,
> @@ -233,8 +258,9 @@ static inline int kexec_load_check(unsigned long 
> nr_segments,
>   return 0;
>  }
>  
> -SYSCALL_DEFINE4(kexec_load, unsigned long, entry, unsigned long, nr_segments,
> - struct kexec_segment __user *, segments, unsigned long, flags)
> +static int kernel_kexec_load(unsigned long entry, unsigned long nr_segments,
> +  struct kexec_segment __user * segments,
> +  unsigned long flags)
>  {
>   int result;
>  
> @@ -265,57 +291,20 @@ SYSCALL_DEFINE4(kexec_load, unsigned long, entry, 
> unsigned long, nr_segments,
>   return result;
>  }
>  
> +SYSCALL_DEFINE4(kexec_load, unsigned long, entry, unsigned long, nr_segments,
> + struct kexec_segment __user *, segments, unsigned long, flags)
> +{
> + return kernel_kexec_load(entry, nr_segments, segments, flags);
> +}
> +
>  #ifdef CONFIG_COMPAT
>  COMPAT_SYSCALL_DEFINE4(kexec_l

Re: [PATCH v3 1/4] kexec: simplify compat_sys_kexec_load

2021-05-17 Thread Eric W. Biederman

Arnd Bergmann  writes:

> From: Arnd Bergmann 
>
> The compat version of sys_kexec_load() uses compat_alloc_user_space to
> convert the user-provided arguments into the native format.
>
> Move the conversion into the regular implementation with
> an in_compat_syscall() check to simplify it and avoid the
> compat_alloc_user_space() call.
>
> compat_sys_kexec_load() now behaves the same as sys_kexec_load().

Is it possible to do this without in_compat_syscall(),
and casting pointers to a wrong type?

We open ourselves up to bugs whenever we lie to the type system.

Skimming through the code it looks like it should be possible
to not need the in_compat_syscall and the casts to the wrong
type by changing the order of the code a little bit.

Eric


> Signed-off-by: Arnd Bergmann 
> ---
>  include/linux/kexec.h |  2 -
>  kernel/kexec.c| 95 +++
>  2 files changed, 42 insertions(+), 55 deletions(-)
>
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index 0c994ae37729..f61e310d7a85 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -88,14 +88,12 @@ struct kexec_segment {
>   size_t memsz;
>  };
>  
> -#ifdef CONFIG_COMPAT
>  struct compat_kexec_segment {
>   compat_uptr_t buf;
>   compat_size_t bufsz;
>   compat_ulong_t mem; /* User space sees this as a (void *) ... */
>   compat_size_t memsz;
>  };
> -#endif
>  
>  #ifdef CONFIG_KEXEC_FILE
>  struct purgatory_info {
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index c82c6c06f051..6618b1d9f00b 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -19,21 +19,46 @@
>  
>  #include "kexec_internal.h"
>  
> +static int copy_user_compat_segment_list(struct kimage *image,
> +  unsigned long nr_segments,
> +  void __user *segments)
> +{
> + struct compat_kexec_segment __user *cs = segments;
> + struct compat_kexec_segment segment;
> + int i;
> +
> + for (i = 0; i < nr_segments; i++) {
> + if (copy_from_user(&segment, &cs[i], sizeof(segment)))
> + return -EFAULT;
> +
> + image->segment[i] = (struct kexec_segment) {
> + .buf   = compat_ptr(segment.buf),
> + .bufsz = segment.bufsz,
> + .mem   = segment.mem,
> + .memsz = segment.memsz,
> + };
> + }
> +
> + return 0;
> +}
> +
> +
>  static int copy_user_segment_list(struct kimage *image,
> unsigned long nr_segments,
> struct kexec_segment __user *segments)
>  {
> - int ret;
>   size_t segment_bytes;
>  
>   /* Read in the segments */
>   image->nr_segments = nr_segments;
>   segment_bytes = nr_segments * sizeof(*segments);
> - ret = copy_from_user(image->segment, segments, segment_bytes);
> - if (ret)
> - ret = -EFAULT;
> + if (in_compat_syscall())
> + return copy_user_compat_segment_list(image, nr_segments, 
> segments);
>  
> - return ret;
> + if (copy_from_user(image->segment, segments, segment_bytes))
> + return -EFAULT;
> +
> + return 0;
>  }
>  
>  static int kimage_alloc_init(struct kimage **rimage, unsigned long entry,
> @@ -233,8 +258,9 @@ static inline int kexec_load_check(unsigned long 
> nr_segments,
>   return 0;
>  }
>  
> -SYSCALL_DEFINE4(kexec_load, unsigned long, entry, unsigned long, nr_segments,
> - struct kexec_segment __user *, segments, unsigned long, flags)
> +static int kernel_kexec_load(unsigned long entry, unsigned long nr_segments,
> +  struct kexec_segment __user * segments,
> +  unsigned long flags)
>  {
>   int result;
>  
> @@ -265,57 +291,20 @@ SYSCALL_DEFINE4(kexec_load, unsigned long, entry, 
> unsigned long, nr_segments,
>   return result;
>  }
>  
> +SYSCALL_DEFINE4(kexec_load, unsigned long, entry, unsigned long, nr_segments,
> + struct kexec_segment __user *, segments, unsigned long, flags)
> +{
> + return kernel_kexec_load(entry, nr_segments, segments, flags);
> +}
> +
>  #ifdef CONFIG_COMPAT
>  COMPAT_SYSCALL_DEFINE4(kexec_load, compat_ulong_t, entry,
>  compat_ulong_t, nr_segments,
>  struct compat_kexec_segment __user *, segments,
>  compat_ulong_t, flags)
>  {
> - struct compat_kexec_segment in;
> - struct kexec_segment out, __user *ksegments;
> - unsigned long i, result;
> -
> - result = kexec_load_check(nr_segments, flags);
> - if (result)
> - return result;
> -
> - /* Don't allow clients that don't understand the native
> -  * architecture to do anything.
> -  */
> - if ((flags & KEXEC_ARCH_MASK) == KEXEC_ARCH_DEFAULT)
> - return -EINVAL;
> -
> - ksegments = compat_alloc_user_space(nr_segment

Re: [PATCH 2/2] x86/kexec/64: Forbid kexec when running as an SEV-ES guest

2021-05-06 Thread Eric W. Biederman

Joerg Roedel  writes:

> On Thu, May 06, 2021 at 12:42:03PM -0500, Eric W. Biederman wrote:
>> I don't understand this.
>> 
>> Fundamentally kexec is about doing things more or less inspite of
>> what the firmware is doing.
>> 
>> I don't have any idea what a SEV-ES is.  But the normal x86 boot doesn't
>> do anything special.  Is cross cpu IPI emulation buggy?
>
> Under SEV-ES the normal SIPI-based sequence to re-initialize a CPU does
> not work anymore. An SEV-ES guest is a special virtual machine where the
> hardware encrypts the guest memory and the guest register state. The
> hypervisor can't make any modifications to the guests registers at
> runtime. Therefore it also can't emulate a SIPI sequence and reset the
> vCPU.
>
> The guest kernel has to reset the vCPU itself and hand it over from the
> old kernel to the kexec'ed kernel. This isn't currently implemented and
> therefore kexec needs to be disabled when running as an SEV-ES guest.
>
> Implementing this also requires an extension to the guest-hypervisor
> protocol (the GHCB Spec[1]) which is only available in version 2. So a
> guest running on a hypervisor supporting only version 1 will never
> properly support kexec.

Why does it need that?

Would it not make sense to instead teach kexec how to pass a cpu from
one kernel to another.  We could use that everywhere.

Even the kexec-on-panic case should work as even in that case we have
to touch the cpus as they go down.

The hardware simply worked well enough that it hasn't mattered enough
for us to do something like that, but given that we need to do something
anyway.  It seems like it would make most sense do something that
will work everywhere, and does not introduce unnecessary dependencies
on hypervisors.

>> What is the actual problem you are trying to avoid?
>
> Currently, if someone tries kexec in an SEV-ES guest, the kexec'ed
> kernel will only be able to bring up the boot CPU, not the others. The
> others will wake up with the old kernels CPU state in the new kernels
> memory and do undefined things, most likely triple-fault because their
> page-table is not existent anymore.
>
> So since kexec currently does not work as expected under SEV-ES, it is
> better to hide it until everything is implemented so it can do what the
> user expects.

I can understand temporarily disabling the functionality.

>> And yes for a temporary hack the suggestion of putting code into
>> machine_kexec_prepare seems much more reasonable so we don't have to
>> carry special case infrastructure for the forseeable future.
>
> As I said above, for protocol version 1 it will stay disabled, so it is
> not only a temporary hack.

Why does bringing up a cpu need hypervisor support?

I understand why we can't do what we do currently, but that doesn't seem
to preclude doing something without hypervisor support.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 2/2] x86/kexec/64: Forbid kexec when running as an SEV-ES guest

2021-05-06 Thread Eric W. Biederman

Joerg Roedel  writes:

> From: Joerg Roedel 
>
> For now, kexec is not supported when running as an SEV-ES guest. Doing
> so requires additional hypervisor support and special code to hand
> over the CPUs to the new kernel in a safe way.
>
> Until this is implemented, do not support kexec in SEV-ES guests.

I don't understand this.

Fundamentally kexec is about doing things more or less inspite of
what the firmware is doing.

I don't have any idea what a SEV-ES is.  But the normal x86 boot doesn't
do anything special.  Is cross cpu IPI emulation buggy?

If this is a move in your face hypervisor like Xen is sometimes I can
see perhaps needing a little bit of different work during bootup.
Perhaps handing back a cpu on system shutdown and asking for more cpus
on system boot up.

What is the actual problem you are trying to avoid?

And yes for a temporary hack the suggestion of putting code into
machine_kexec_prepare seems much more reasonable so we don't have to
carry special case infrastructure for the forseeable future.

Eric

> Cc: sta...@vger.kernel.org # v5.10+
> Signed-off-by: Joerg Roedel 
> ---
>  arch/x86/kernel/machine_kexec_64.c | 8 
>  1 file changed, 8 insertions(+)
>
> diff --git a/arch/x86/kernel/machine_kexec_64.c 
> b/arch/x86/kernel/machine_kexec_64.c
> index c078b0d3ab0e..f902cc9cc634 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -620,3 +620,11 @@ void arch_kexec_pre_free_pages(void *vaddr, unsigned int 
> pages)
>*/
>   set_memory_encrypted((unsigned long)vaddr, pages);
>  }
> +
> +/*
> + * Kexec is not supported in SEV-ES guests yet
> + */
> +bool arch_kexec_supported(void)
> +{
> + return !sev_es_active();
> +}

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

[CFT][PATCH] kexec: Remove the error prone kernel_version function

2021-04-09 Thread Eric W. Biederman



During kexec there are two kernel versions at play.  The version of
the running kernel and the version of the kernel that will be booted.

On powerpc it appears people have been using the version of the
running kernel to attempt to detect properties of the kernel to be
booted which is just wrong.  As the linux kernel version that is being
detected is a no longer supported kernel just remove that buggy and
confused code.

On x86_64 the kernel_version is used to compute the starting virtual
address of the running kernel so a proper core dump may be generated.
Using the kernel_version stopped working a while ago when the starting
virtual address  became randomized.

The old code was kept for the case where the kernel was not built with
randomization support, but there is nothing in reading /proc/kcore
that won't work to detect the starting virtual address even there.
In fact /proc/kcore must have the starting virtual address or a
debugger can not make sense of the running kernel.

So just make computing the starting virtual address on x86_64
unconditional.  With a hard coded fallback just in case something went
wrong.

Doing something with kernel_version() has become important as recent
stable kernels have seen the minor version to > 255.  Just removing
kernel_version() looks like the best option.

Signed-off-by: "Eric W. Biederman" 
---

Can folks please test this patch and verify that it works.  I really
think simply removing the problem code is going to be a much more robust
solution than papering over the bug.

 kexec/Makefile |  1 -
 kexec/arch/i386/crashdump-x86.c| 28 +--
 kexec/arch/ppc/crashdump-powerpc.c |  3 +-
 kexec/arch/ppc64/crashdump-ppc64.c |  3 +-
 kexec/kernel_version.c | 57 --
 kexec/kexec.h  |  4 ---
 6 files changed, 11 insertions(+), 85 deletions(-)
 delete mode 100644 kexec/kernel_version.c

diff --git a/kexec/Makefile b/kexec/Makefile
index 8e3e9ea39664..e69e30950ac8 100644
--- a/kexec/Makefile
+++ b/kexec/Makefile
@@ -22,7 +22,6 @@ KEXEC_SRCS_base += kexec/firmware_memmap.c
 KEXEC_SRCS_base += kexec/crashdump.c
 KEXEC_SRCS_base += kexec/crashdump-xen.c
 KEXEC_SRCS_base += kexec/phys_arch.c
-KEXEC_SRCS_base += kexec/kernel_version.c
 KEXEC_SRCS_base += kexec/lzma.c
 KEXEC_SRCS_base += kexec/zlib.c
 KEXEC_SRCS_base += kexec/kexec-xen.c
diff --git a/kexec/arch/i386/crashdump-x86.c b/kexec/arch/i386/crashdump-x86.c
index d5b5b6850fe8..ea3e7c73c621 100644
--- a/kexec/arch/i386/crashdump-x86.c
+++ b/kexec/arch/i386/crashdump-x86.c
@@ -55,16 +55,8 @@ static int get_kernel_page_offset(struct kexec_info 
*UNUSED(info),
int kv;
 
if (elf_info->machine == EM_X86_64) {
-   kv = kernel_version();
-   if (kv < 0)
-   return -1;
-
-   if (kv < KERNEL_VERSION(2, 6, 27))
-   elf_info->page_offset = X86_64_PAGE_OFFSET_PRE_2_6_27;
-   else if (kv < KERNEL_VERSION(4, 20, 0))
-   elf_info->page_offset = X86_64_PAGE_OFFSET_PRE_4_20_0;
-   else
-   elf_info->page_offset = X86_64_PAGE_OFFSET;
+   /* get_kernel_vaddr_and_size will override this */
+   elf_info->page_offset = X86_64_PAGE_OFFSET;
}
else if (elf_info->machine == EM_386) {
elf_info->page_offset = X86_PAGE_OFFSET;
@@ -151,17 +143,15 @@ static int get_kernel_vaddr_and_size(struct kexec_info 
*UNUSED(info),
 
/* Search for the real PAGE_OFFSET when KASLR memory randomization
 * is enabled */
-   if (get_kernel_sym("page_offset_base") != 0) {
-   for(phdr = ehdr.e_phdr; phdr != end_phdr; phdr++) {
-   if (phdr->p_type == PT_LOAD) {
-   vaddr = phdr->p_vaddr & pud_mask;
-   if (lowest_vaddr == 0 || lowest_vaddr > vaddr)
-   lowest_vaddr = vaddr;
-   }
+   for(phdr = ehdr.e_phdr; phdr != end_phdr; phdr++) {
+   if (phdr->p_type == PT_LOAD) {
+   vaddr = phdr->p_vaddr & pud_mask;
+   if (lowest_vaddr == 0 || lowest_vaddr > vaddr)
+   lowest_vaddr = vaddr;
}
-   if (lowest_vaddr != 0)
-   elf_info->page_offset = lowest_vaddr;
}
+   if (lowest_vaddr != 0)
+   elf_info->page_offset = lowest_vaddr;
 
/* Traverse through the Elf headers and find the region where
 * _stext symbol is located in. That's where kernel is mapped */
diff --git a/kexec/arch/ppc/crashdump-powerpc.c 
b/kexec/arch/ppc/crashdump-powerpc.c
index 4ad026f38dd0..15e85313ff75 100644
--- a/kexec/arch/ppc/crashdump-powerpc.c
+++ b/kexec/arch/ppc/crashdump-powerpc.c

Re: kexec does not work for kernel version with patch level >= 256

2021-04-07 Thread Eric W. Biederman

Liu Tao  writes:

> Hello Eric,
>
> Please correct me if I'm wrong. After my research, I found that the
> KERNEL_VERSION
> check cannot be removed.
>
> In x86_64 case, function get_kernel_page_offset set different hard coded
> values into
> elf_info->page_offset according to KERNEL_VERSION, then in function
> get_kernel_vaddr_and_size,
> elf_info->page_offset gets refreshed by reading program segments of
> /proc/kcore.
> The refresh can fail when KASLR is off, thus the hard coded values are
> still needed as pre-set
> default values.

I see that the code is conditional upon KASLR, but I don't see any
particular reason why the code in get_kernel_vaddr_and_size is
conditional upon KASLR.

Skimming through arch/x86/kernel/vmlinux.lds.S and fs/proc/kcore.c I
don't see anything that is ASLR specific.  So everything should work
simply by removing the unnecessary gate on the presence of the
page_address_base symbol.

I suspect the code will even correctly compute PAGE_OFFSET on all
architectures, but we don't need to go that far to remove our use of the
kernel version.

> In addition, If I set a wrong value in elf_info->page_offset, readelf -l
> vmcore will give the value I set,
> reading symbols in crash-utility is not affected.

Especially if the reading the symbols is not affected by a wrong value
just auto-detecting the value really seems to make the most sense.

> From my point of view, extending the patch number from 8bit to 16bit is the
> solution. Any thoughts?

My thought is that in general the kernel version can not be depended
upon for anything as there exist enterprise kernels that get feature
backports.  So there very easily could be a kernel where the kernel
version does not accurately reflect what is going on.  So unless we can
say with certainty that there is no other way to detect the base address
of the kernel we really don't want to use the kernel version.

Right now it just looks like one all that is necessary is the removal of
an unnecessary if check.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: kexec does not work for kernel version with patch level >= 256

2021-04-01 Thread Eric W. Biederman

Liu Tao  writes:

> Hello Eric,
>
> I'd like to have a try on this issue.
> Since I'm not very familiar with the related code, maybe I need to spend a
> little more time on it.

Please give it a shot.

Getting rid of explicit kernel version checks would be wonderful and
prevent future problems as well.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: kexec does not work for kernel version with patch level >= 256

2021-03-31 Thread Eric W. Biederman

Greg KH  writes:

> On Wed, Mar 31, 2021 at 04:03:24PM +0800, Baoquan He wrote:
>> Add Sasha and Greg to the CC list.
>> 
>> On 03/31/21 at 11:48am, Baoquan He wrote:
>> > On 03/31/21 at 11:04am, Patrick Sung wrote:
>> > > On Wed, Mar 31, 2021 at 10:47 AM Baoquan He  wrote:
>> > > >
>> > > > On 03/24/21 at 12:28pm, Patrick Sung wrote:
>> > > > > Hello all,
>> > > > >
>> > > > > I am using the 4.9 long term kernel which is currently at 4.9.262.
>> > > > > When using this kernel with kexec-tools it prints out this error
>> > > > >
>> > > > >   Unsupported utsname.release: 4.9.262
>> > > > >   Cannot load 
>> > > > >
>> > > > > A quick search in the code shows that kexec/kernel_version.c doing 
>> > > > > this check:
>> > > > >
>> > > > >   if (major >= 256 || minor >= 256 || patch >= 256) {
>> > > > >
>> > > > > and also in kexec/kexec.h
>> > > > >   #define KERNEL_VERSION(major, minor, patch) \
>> > > > > (((major) << 16) | ((minor) << 8) | patch)
>> > > >
>> > > > Yeah, this seems to be a good catch. The existing longterm kenrel 
>> > > > 4.9.262
>> > > > does cause the problem. I am not very sure about the longterm kernel
>> > > > version numbering, maybe we can leave 16 bits for for patch number to
>> > > > avoid the longterm kernel issue?
>> > > >
>> > > > Is there document telling the longterm kernel version numbering, or any
>> > > > pointer?
>> > > >
>> > > Actually I found that the mainline kernel clamp the "patch" version to 
>> > > 255
>> > > 
>> > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/Makefile?id=9b82f13e7ef316cdc0a8858f1349f4defce3f9e0
>> > 
>> > Yeah, mainline kernel use below formula to construct kernel version.
>> > Seems longterm kernel takes a different way. While it's understandable
>> > that Longterm kernel using a larger patch number since it will evolve
>> > evolve for a longer time to get in bug fixes. Maybe we should enlarge
>> > patch number to 16 bits?
>> > 
>> > echo '#define KERNEL_VERSION(a,b,c) (((a) << 16) + ((b) << 8) +  \ 
>> >
>> > ((c) > 255 ? 255 : (c)))';
>> 
>> Should we also need to do the the same thing in kexec-tools utility, to
>> clamp the sub-level to 255? And the sub-level number is not so important
>> that we can do the clamping and won't cause any issue?
>
> If you can enlarge your "c" number to be 16bits, please do so.  We
> couldn't do that in the kernel source as it would break an existing
> user/kernel api :(

The code in kexec certainly can.  The function kernel_version() returns
a long.

However.  I think the code in kexec can do better.

There are only 4 uses of KERNEL_VERSION in the tree and only one of them
is testing for a case that would apply to kernel versions that are being
maintained.

AKA  I think checks for 3.15 and 2.6.27 we can just delete.

The checks for 3.15 are about command line size and are actively wrong.
As checking the running kernel to find the version of the kernel that
is being loaded is not reliable.

Which leaves only one check and I think the information needed is
X86_64_PAGE_OFFSET.  I think that can be obtained directly by
reading /proc/kcore instead of guessing it from the kernel version.

Does anyone want to try that?  I admit it is a bit more work than
changing the macro to just add more bits to the patch level
but it is would be more robust long term.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v14 01/11] x86: kdump: replace the hard-coded alignment with macro CRASH_ALIGN

2021-02-26 Thread Eric W. Biederman

chenzhou  writes:

> On 2021/2/25 15:25, Baoquan He wrote:
>> On 02/24/21 at 02:19pm, Catalin Marinas wrote:
>>> On Sat, Jan 30, 2021 at 03:10:15PM +0800, Chen Zhou wrote:
 Move CRASH_ALIGN to header asm/kexec.h for later use. Besides, the
 alignment of crash kernel regions in x86 is 16M(CRASH_ALIGN), but
 function reserve_crashkernel() also used 1M alignment. So just
 replace hard-coded alignment 1M with macro CRASH_ALIGN.
>>> [...]
 @@ -510,7 +507,7 @@ static void __init reserve_crashkernel(void)
} else {
unsigned long long start;

 -  start = memblock_phys_alloc_range(crash_size, SZ_1M, crash_base,
 +  start = memblock_phys_alloc_range(crash_size, CRASH_ALIGN, 
 crash_base,
  crash_base + crash_size);
if (start != crash_base) {
pr_info("crashkernel reservation failed - memory is in 
 use.\n");
>>> There is a small functional change here for x86. Prior to this patch,
>>> crash_base passed by the user on the command line is allowed to be 1MB
>>> aligned. With this patch, such reservation will fail.
>>>
>>> Is the current behaviour a bug in the current x86 code or it does allow
>>> 1MB-aligned reservations?
>> Hmm, you are right. Here we should keep 1MB alignment as is because
>> users specify the address and size, their intention should be respected.
>> The 1MB alignment for fixed memory region reservation was introduced in
>> below commit, but it doesn't tell what is Eric's request at that time, I
>> guess it meant respecting users' specifying.

> I think we could make the alignment unified. Why is the alignment system 
> reserved and
> user specified different? Besides, there is no document about the 1MB 
> alignment.
> How about adding the alignment size(16MB) in doc  if user specified
> start address as arm64 does.

Looking at what the code is doing.  Attempting to reserve a crash region
at the location the user specified.  Adding unnecessary alignment
constraints is totally broken. 

I am not even certain enforcing a 1MB alignment makes sense.  I suspect
it was added so that we don't accidentally reserve low memory on x86.
Frankly I am not even certain that makes sense.

Now in practice there might be an argument for 2MB alignment that goes
with huge page sizes on x86.  But until someone finds that there are
actual problems with 1MB alignment I would not touch it.

The proper response to something that isn't documented and confusing is
not to arbitrarily change it and risk breaking users.  Especially in
this case where it is clear that adding additional alignment is total
nonsense.  The proper response to something that isn't clear and
documented is to dig in and document it, or to leave it alone and let it
be the next persons problem.

In this case there is no reason for changing this bit of code.
All CRASH_ALIGN is about is a default alignment when none is specified.
It is not a functional requirement but just something so that things
come out nicely.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2] kexec: move machine_kexec_post_load() to public interface

2021-02-22 Thread Eric W. Biederman

Will Deacon  writes:

> On Fri, 19 Feb 2021 14:51:42 -0500, Pavel Tatashin wrote:
>> machine_kexec_post_load() is called after kexec load is finished. It must
>> declared in public header not in kexec_internal.h
>> 
>> Fixes the following compiler warning:
>> 
>> arch/arm64/kernel/machine_kexec.c:62:5: warning: no previous prototype for
>> function 'machine_kexec_post_load' [-Wmissing-prototypes]
>>int machine_kexec_post_load(struct kimage *kimage)
>
> Applied to arm64 (for-next/fixes), thanks!
>
> [1/1] kexec: move machine_kexec_post_load() to public interface
>   https://git.kernel.org/arm64/c/2596b6ae412b

As you have already applied this it seems a bit late,
but I will mention that change could legitimately have the following
tag.

Fixes: de68e4daea90 ("kexec: add machine_kexec_post_load()")

So far arm64 is the only implementation of that hook.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v11 0/6] arm64: MMU enabled kexec relocation

2021-02-04 Thread Eric W. Biederman

Pavel Tatashin  writes:

>> > I understand that having an extra set of page tables could potentially
>> > waste memory, especially if VAs are sparse, but in this case we use
>> > page tables exclusively for contiguous VA space (copy [src, src +
>> > size]). Therefore, the extra memory usage is tiny. The ratio for
>> > kernels with  4K page_size is (size of relocated memory) / 512.  A
>> > normal initrd + kernel is usually under 64M, an extra space which
>> > means ~128K for the page table. Even with a huge relocation, where
>> > initrd is ~512M the extra memory usage in the worst case is just ~1M.
>> > I really doubt we will have any problem from users because of such
>> > small overhead in comparison to the total kexec-load size.
>
> Hi Eric,
>
>>
>> Foolish question.
>
> Thank you for your e-mail, you gave some interesting insights.
>
>>
>> Does arm64 have something like 2M pages that it can use for the
>> linear map?
>
> Yes, with 4K pages arm64 as well has 2M pages, but arm64 also has a
> choice of 16K and 64K and second level pages are bigger there.

>> On x86_64 we always generate page tables, because they are necessary to
>> be in 64bit mode.  As I recall on x86_64 we always use 2M pages which
>> means for each 4K of page tables we map 1GiB of memory.   Which is very
>> tiny.
>>
>> If you do as well as x86_64 for arm64 I suspect that will be good enough
>> for people to not claim regression.
>>
>> Would a variation on the x86_64 implementation that allocates page
>> tables work for arm64?
> ...
>>
>> As long as the page table provided is a linear mapping of physical
>> memory (aka it looks like paging is disabled).  The the code that
>> relocates memory should be pretty much the same.
>>
>> My experience with other architectures suggests only a couple of
>> instructions need to be different to deal with a MMU being enabled.
>
> I think what you are proposing is similar to what James proposed. Yes,
> for a linear map relocation should be pretty much the same as we do
> relocation as with MMU disabled.
>
> Linear map still uses memory, because page tables must be outside of
> destination addresses of segments of the next kernel. Therefore, we
> must allocate a page table for the linear map. It might be a little
> smaller, but in reality the difference is small with 4K pages, and
> insignificant with 64K pages. The benefit of my approach is that the
> assembly copy loop is simpler, and allows hardware prefetching to
> work.
>
> The regular relocation loop works like this:
>
> for (entry = head; !(entry & IND_DONE); entry = *ptr++) {
> addr = __va(entry & PAGE_MASK);
>
> switch (entry & IND_FLAGS) {
> case IND_DESTINATION:
> dest = addr;
> break;
> case IND_INDIRECTION:
> ptr = addr;
> break;
> case IND_SOURCE:
> copy_page(dest, addr);
> dest += PAGE_SIZE;
> }
> }
>
> The entry for the next relocation page has to be always fetched, and
> therefore prefetching cannot help with the actual loop.

True.

In the common case the loop looks like:
> for (entry = head; !(entry & IND_DONE); entry = *ptr++) {
> addr = __va(entry & PAGE_MASK);
>
> switch (entry & IND_FLAGS) {
> case IND_SOURCE:
> copy_page(dest, addr);
> dest += PAGE_SIZE;
> }
> }

Which is a read of the source address followed by the copy_page.
I suspect the overhead of that loop is small enough that it swamped by
the cost of the copy_page.

If not and a better data structure can be proposed we can look at that.

> In comparison, the loop that I am proposing is like this:
>
> for (addr = head; addr < end; addr += PAGE_SIZE, dst += PAGE_SIZE)
> copy_page(dest, addr);
>
> Here is assembly code for my loop:
>
> 1: copy_page x1, x2, x3, x4, x5, x6, x7, x8, x9, x10
> sub x11, x11, #PAGE_SIZE
> cbnz x11, 1b

I think you may be hiding the cost of that loop in the page table
fetches themselves.

It is possible though unlikely that a page table with huge pages
(and thus smaller page fault costs) and the original loop is actually
cheaper.

> That said, if James and you agree that linear map is the way to go
> forward, I am OK with that as well, as it is still much better than
> having no caching at all.

The big advantage of a linear map is that the kexec'd code can continue
to use it until it sets up it's own page tables.

I probably did not document it well enough but a linear map then
equivalent of not having virtual addresses at all was always my
intention for the hand-off state of kexec between kernels.

So please try the linear map.  If it is noticably slower than your
optimized page table give numbers and we can see if there is a way to
improve the generic kexec data structures.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v11 0/6] arm64: MMU enabled kexec relocation

2021-02-03 Thread Eric W. Biederman

Pavel Tatashin  writes:

> Hi James,
>
>> The problem I see with this is rewriting the relocation code. It needs to 
>> work whether the
>> machine has enough memory to enable the MMU during kexec, or not.
>>
>> In off-list mail to Pavel I proposed an alternative implementation here:
>> https://gitlab.arm.com/linux-arm/linux-jm/-/tree/kexec+mmu/v0
>>
>> By using a copy of the linear map, and passing the phys_to_virt offset into
>> arm64_relocate_new_kernel() its possible to use the same code when we fail 
>> to allocate the
>> page tables, and run with the MMU off as it does today.
>> I'm convinced someone will crawl out of the woodwork screaming 'regression' 
>> if we
>> substantially increase the amount of memory needed to kexec at all.
>>
>> From that discussion: this didn't meet Pavel's timing needs.
>> If you depend on having all the src/dst pages lined up in a single line, it 
>> sounds like
>> you've over-tuned this to depend on the CPU's streaming mode. What causes 
>> the CPU to
>> start/stop that stuff is very implementation specific (and firmware 
>> configurable).
>> I don't think we should let this rule out systems that can kexec today, but 
>> don't have
>> enough extra memory for the page tables.
>> Having two copies of the relocation code is obviously a bad idea.
>
> I understand that having an extra set of page tables could potentially
> waste memory, especially if VAs are sparse, but in this case we use
> page tables exclusively for contiguous VA space (copy [src, src +
> size]). Therefore, the extra memory usage is tiny. The ratio for
> kernels with  4K page_size is (size of relocated memory) / 512.  A
> normal initrd + kernel is usually under 64M, an extra space which
> means ~128K for the page table. Even with a huge relocation, where
> initrd is ~512M the extra memory usage in the worst case is just ~1M.
> I really doubt we will have any problem from users because of such
> small overhead in comparison to the total kexec-load size.

Foolish question.

Does arm64 have something like 2M pages that it can use for the
linear map?

On x86_64 we always generate page tables, because they are necessary to
be in 64bit mode.  As I recall on x86_64 we always use 2M pages which
means for each 4K of page tables we map 1GiB of memory.   Which is very
tiny.

If you do as well as x86_64 for arm64 I suspect that will be good enough
for people to not claim regression.

Would a variation on the x86_64 implementation that allocates page
tables work for arm64?

>> (as before: ) Instead of trying to make the relocations run quickly, can we 
>> reduce them?
>> This would benefit other architectures too.
>
> This was exactly my first approach [1] where I tried to pre-reserve
> memory similar to how it is done for a crash kernel, but I was asked
> to go away [2] as this is an ARM64 specific problem, where current
> relocation performance is prohibitively slow. I have tested on x86,
> and it does not suffer from this problem, relocation performance is
> just as fast as with MMU enabled ARM64.
>
>>
>> Can the kexec core code allocate higher order pages, instead of doing 
>> everything page at
>> at time?
>
> Yes, however, failures during kexec-load due to failure to coalesce
> huge pages can add extra hassle to users, and therefore this should be
> only an optimization with fallback to base pages.
>
>>
>> If you have a crash kernel reservation, can we use that to eliminate the 
>> relocations
>> completely?
>> (I think this suggestion has been lost in translation each time I make it.
>> I mean like this:
>> https://gitlab.arm.com/linux-arm/linux-jm/-/tree/kexec/kexec_in_crashk/v0
>> Runes to test it:
>> | sudo ./kexec -p -u
>> | sudo cat /proc/iomem | grep Crash
>> |  b020-f01f : Crash kernel
>> | sudo ./kexec --mem-min=0xb020 --mem-max=0xf01ff -l ~/Image 
>> --reuse-cmdline
>>
>> I bet its even faster!)
>
> There is a problem with this approach. While, with kexec_load() call
> it is possible to specify physical destinations for each segment, with
> kexec_file_load() it is not possible. The secure systems that do IMA
> checks during kexec load require kexec_file_load(), and we cannot
> ahead of time specify destinations for these segments (at least
> without substantially changing common kexec code which is not going to
> happen as this arm64 specific problem).


>> I think 'as fast as possible' and 'memory constrained' are mutually exclusive
>> requirements. We need to make the page tables optional with a single 
>> implementation.

In my experience the slowdown with disabling a cpus cache (which
apparently happens on arm64 when the MMU is disabled) is freakishly
huge.

Enabling the cache shouldn't be 'as fast as possible' but simply
disengaging the parking brake.

> In my opinion having two different types of relocations will only add
> extra corner cases, confusion about different performance, and bugs.
> It is better to have two types: 1. crash kernel type without
> relocation, 2. fast reloc

Re: amdgpu problem after kexec

2021-02-03 Thread Eric W. Biederman

Alex Deucher  writes:

> On Wed, Feb 3, 2021 at 3:36 AM Dave Young  wrote:
>>
>> Hi Baoquan,
>>
>> Thanks for ccing.
>> On 01/28/21 at 01:29pm, Baoquan He wrote:
>> > On 01/11/21 at 01:17pm, Alexander E. Patrakov wrote:
>> > > Hello,
>> > >
>> > > I was trying out kexec on my new laptop, which is a HP EliteBook 735
>> > > G6. The problem is, amdgpu does not have hardware acceleration after
>> > > kexec. Also, strangely, the lines about BlueTooth are missing from
>> > > dmesg after kexec, but I have not tried to use BlueTooth on this
>> > > laptop yet. I don't know how to debug this, the relevant amdgpu lines
>> > > in dmesg are:
>> > >
>> > > amdgpu :04:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB
>> > > test failed on gfx (-110).
>> > > [drm:process_one_work] *ERROR* ib ring test failed (-110).
>> > >
>> > > The good and bad dmesg files are attached. Is it a kexec problem (and
>> > > amdgpu is only a victim), or should I take it to amdgpu lists? Do I
>> > > need to provide some extra kernel arguments for debugging?

The best debugging I can think of is can you arrange to have the amdgpu
modules removed before the final kexec -e?

That would tell us if the code to shutdown the gpu exist in the rmmod
path aka the .remove method and is simply missing in the kexec path aka
the .shutdown method.


>> > I am not familiar with graphical component. Add Dave to CC to see if
>> > he has some comments. It would be great if amdgpu expert can have a look.
>>
>> It needs amdgpu driver people to help.  Since kexec bypass
>> bios/UEFI initialization so we requires drivers to implement .shutdown
>> method and test it to make 2nd kernel to work correctly.
>
> kexec is tricky to make work properly on our GPUs.  The problem is
> that there are some engines on the GPU that cannot be re-initialized
> once they have been initialized without an intervening device reset.
> APUs are even trickier because they share a lot of hardware state with
> the CPU.  Doing lots of extra resets adds latency.  The driver has
> code to try and detect if certain engines are running at driver load
> time and do a reset before initialization to make this work, but it
> apparently is not working properly on your system.

There are two cases that I think sometimes get mixed up.

There is kexec-on-panic in which case all of the work needs to happen in
the driver initialization.

There is also a simple kexec in which case some of the work can happen
in the kernel that is being shutdown and sometimes that is easer.

Does it make sense to reset your device unconditionally on driver removal?
Would it make sense to reset your device unconditionally on driver add?

How can someone debug the smart logic of reset on driver load?

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2 1/1] kexec: dump kmessage before machine_kexec

2021-01-28 Thread Eric W. Biederman

Pavel Tatashin  writes:

> kmsg_dump(KMSG_DUMP_SHUTDOWN) is called before
> machine_restart(), machine_halt(), machine_power_off(), the only one that
> is missing is  machine_kexec().
>
> The dmesg output that it contains can be used to study the shutdown
> performance of both kernel and systemd during kexec reboot.
>
> Here is example of dmesg data collected after kexec:

As long was we keep kmsg_dump out of the crash_kexec path where
it completely breaks kexec on panic this seems a reasonable thing to do.
On the ordinary kernel_kexec path everything is expected to be working.

Is kmsg_dump expected to work after all of the device drivers
are shut down?  Otherwise this placement of kmsg_dump is too late.

Eric

> root@dplat-cp22:~# cat /sys/fs/pstore/dmesg-ramoops-0 | tail
> ...
> <6>[   70.914592] psci: CPU3 killed (polled 0 ms)
> <5>[   70.915705] CPU4: shutdown
> <6>[   70.916643] psci: CPU4 killed (polled 4 ms)
> <5>[   70.917715] CPU5: shutdown
> <6>[   70.918725] psci: CPU5 killed (polled 0 ms)
> <5>[   70.919704] CPU6: shutdown
> <6>[   70.920726] psci: CPU6 killed (polled 4 ms)
> <5>[   70.921642] CPU7: shutdown
> <6>[   70.922650] psci: CPU7 killed (polled 0 ms)
>
> Signed-off-by: Pavel Tatashin 
> Reviewed-by: Kees Cook 
> Reviewed-by: Petr Mladek 
> Reviewed-by: Bhupesh Sharma 
> ---
>  kernel/kexec_core.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index 4f8efc278aa7..e253c8b59145 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -37,6 +37,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -1180,6 +1181,7 @@ int kernel_kexec(void)
>   machine_shutdown();
>   }
>  
> + kmsg_dump(KMSG_DUMP_SHUTDOWN);
>   machine_kexec(kexec_image);
>  
>  #ifdef CONFIG_KEXEC_JUMP

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-17 Thread Eric W. Biederman

Thomas Gleixner  writes:

> On Tue, Nov 17 2020 at 12:19, David Woodhouse wrote:
>> On Tue, 2020-11-17 at 10:53 +0100, Thomas Gleixner wrote:
>>> But that does not solve the problem either simply because then the IOMMU
>>> will catch the rogue MSIs and you get an interrupt storm on the IOMMU
>>> error interrupt.
>>
>> Not if you can tell the IOMMU to stop reporting those errors.
>>
>> We can easily do it per-device for DMA errors; not quite sure what
>> granularity we have for interrupts. Perhaps the Intel IOMMU only lets
>> you set the Fault Processing Disable bit per IRTE entry, and you still
>> get faults for Compatibility Format interrupts? Not sure about AMD...
>
> It looks like the fault (DMAR) and event (AMD) interrupts can be
> disabled in the IOMMU. That might help to bridge the gap until the PCI
> bus is scanned in full glory and the devices can be shut up for real.
>
> If we make this conditional for a crash dump kernel that might do the
> trick.
>
> Lot's of _might_ there :)

Worth testing.

Folks tracking this down is this enough of a hint for you to write a
patch and test?

Also worth checking how close irqpoll is to handling a case like this.
At least historically it did a pretty good job at shutting down problem
interrupts.

I really find it weird that an edge triggered irq was firing fast enough
to stop a system from booting.  Level triggered irqs do that if they are
acknolwedged without actually being handled.  I think edge triggered
irqs only fire when another event comes in, and it seems odd to get so
many actual events causing interrupts that the system soft locks
up.  Is my memory of that situation confused?

I recommend making these facilities general debug facilities so that
they can be used for cases other than crash dump.  The crash dump kernel
would always enable them because it can assume that the hardware is
very likely in a wonky state.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-16 Thread Eric W. Biederman

Bjorn Helgaas  writes:

> I don't think passing the device information to the kdump kernel is
> really practical.  The kdump kernel would use it to do PCI config
> writes to disable MSIs before enabling IRQs, and it doesn't know how
> to access config space that early.

I don't think it is particularly practical either.  But in practice
on x86 it is either mmio writes or 0xcf8 style writes and we could
pass a magic table that would have all of that information.

> We could invent special "early config access" things, but that gets
> really complicated really fast.  Config access depends on ACPI MCFG
> tables, firmware interfaces, and in many cases, on the native host
> bridge drivers in drivers/pci/controllers/.

I do agree that the practical problem with passing information early
is that gets us into the weeds and creates code that we only care
about in the case of kexec-on-panic.  It is much better to make the
existing code more robust, so that we reduce our dependency on firmware
doing the right thing.

> I think we need to disable MSIs in the crashing kernel before the
> kexec.  It adds a little more code in the crash_kexec() path, but it
> seems like a worthwhile tradeoff.

Disabling MSIs in the b0rken kernel is not possible.

Walking the device tree or even a significant subset of it hugely
decreases the chances that we will run into something that is incorrect
in the known broken kernel.  I expect the code to do that would double
or triple the amount of code that must be executed in the known broken
kernel.  The last time something like that happened (switching from xchg
to ordinary locks) we had cases that stopped working.  Walking all of
the pci devices in the system is much more invasive.

That is not to downplay the problems of figuring out how to disable
things in early boot.

My two top candidates are poking the IOMMUs early to shut things off,
and figuring out if we can delay enabling interrupts until we have
initialized pci.

Poking at IOMMUs early should work for most systems with ``enterprise''
hardware.  Systems where people care about kdump the most.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-16 Thread Eric W. Biederman

"Guilherme G. Piccoli"  writes:

> First of all, thanks everybody for the great insights/discussion! This
> thread ended-up being a great learning for (at least) me.
>
> Given the big amount of replies and intermixed comments, I wouldn't be
> able to respond inline to all, so I'll try another approach below.
>
>
> From Bjorn:
> "I think [0] proposes using early_quirks() to disable MSIs at boot-time.
> That doesn't seem like a robust solution because (a) the problem affects
> all arches but early_quirks() is x86-specific and (b) even on x86
> early_quirks() only works for PCI segment 0 because it relies on the
> 0xCF8/0xCFC I/O ports."
>
> Ah. I wasn't aware of that limitation, I thought enhancing the
> early_quirks() search to go through all buses would fix that, thanks for
> the clarification! And again, worth to clarify that this is not a
> problem affecting all arches _in practice_ - PowerPC for example has the
> FW primitives allowing a powerful PCI controller (out-of-band) reset,
> preventing this kind of issue usually.
>
> [0]
> https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpicc...@canonical.com
>
>
> From Bjorn:
> "A crash_device_shutdown() could do something at the host bridge level
> if that's possible, or reset/disable bus mastering/disable MSI/etc on
> individual PCI devices if necessary."
>
> From Lukas:
> "Guilherme's original patches from 2018 iterate over all 256 PCI buses.
> That might impact boot time negatively.  The reason he has to do that is
> because the crashing kernel doesn't know which devices exist and which
> have interrupts enabled.  However the crashing kernel has that
> information.  It should either disable interrupts itself or pass the
> necessary information to the crashing kernel as setup_data or whatever.
>
> Guilherme's patches add a "clearmsi" command line parameter.  I guess
> the idea is that the parameter is always passed to the crash kernel but
> the patches don't do that, which seems odd."
>
> Very good points Lukas, thanks! The reason of not adding the clearmsi
> thing as a default kdump procedure is kinda related to your first point:
> it impacts a bit boot-time, also it's an additional logic in the kdump
> kernel, which we know is (sometimes) the last resort in gathering
> additional data to debug a potentially complex issue. That said, I'd not
> like to enforce this kind of "policy" to everybody, so my implementation
> relies on having it as a parameter, and the kdump userspace counter-part
> could then have a choice in adding or not such mechanism to the kdump
> kernel parameter list.
>
> About passing the data to next kernel, this is very interesting, we
> could do something like that either through setup_data (as you said) or
> using a new proposal, the PKRAM thing [a].
> This way we wouldn't need a crash_device_shutdown(), but instead when
> kernel is loading a crash kernel (through kexec -p) we can collect all
> the necessary information that'll be passed to the crash kernel
> (obviously that if we are collecting PCI topology information, we need
> callbacks in the PCI hotplug add/del path to update this information).
>
> [a]
> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yzn...@oracle.com/
>
> Below, inline reply to Eric's last message.
>
>
> On 15/11/2020 17:46, Eric W. Biederman wrote:
>>> [...]
>>> An MSI interrupt is a (DMA) write to the local APIC base address
>>> 0xfeex which has the target CPU and control bits encoded in the
>>> lower bits. The written data is the vector and more control bits.
>>>
>>> The only way to stop that is to shut it up at the PCI device level.
>> 
>> Or to write to magic chipset registers that will stop transforming DMA
>> writes to 0xfeex into x86 interrupts.  With an IOMMU I know x86 has
>> such registers (because the point of the IOMMU is to limit the problems
>> rogue devices can cause).  Without an IOMMU I don't know if x86 has any
>> such registers.  I remember that other platforms have an interrupt
>> controller that does provide some level of control.  That x86 does not
>> is what makes this an x86 specific problem.
>> [...]
>> Looking at patch 3/3 what this patchset does is an early disable of
>> of the msi registers.  Which is mostly reasonable.  Especially as has
>> been pointed out the only location the x86 vector and x86 cpu can
>> be found is in the msi configuration registers.
>> 
>> That also seems reasonable.  But Bjorn's concern about not finding all
>> devices in all d

Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-15 Thread Eric W. Biederman

Thomas Gleixner  writes:

> On Sun, Nov 15 2020 at 08:29, Eric W. Biederman wrote:
>> ebied...@xmission.com (Eric W. Biederman) writes:
>> For ordinary irqs you can have this with level triggered irqs
>> and the kernel has code that will shutdown the irq at the ioapic
>> level.  Then the kernel continues by polling the irq source.
>>
>> I am still missing details but my first question is can our general
>> solution to screaming level triggered irqs apply?
>
> No.
>
>> How can edge triggered MSI irqs be screaming?
>>
>> Is there something we can do in enabling the APICs or IOAPICs that
>> would allow this to be handled better.  My memory when we enable
>> the APICs and IOAPICs we completely clear the APIC entries and so
>> should be disabling sources.
>
> Yes, but MSI has nothing to do with APIC/IOAPIC

Yes, sorry.  It has been long enough that the details were paged out
of my memory.

>> Is the problem perhaps that we wind up using an APIC entry that was
>> previously used for the MSI interrupt as something else when we
>> reprogram them?  Even with this why doesn't the generic code
>> to stop screaming irqs apply here?
>
> Again. No. The problem cannot be solved at the APIC level. The APIC is
> the receiving end of MSI and has absolutely no control over it.
>
> An MSI interrupt is a (DMA) write to the local APIC base address
> 0xfeex which has the target CPU and control bits encoded in the
> lower bits. The written data is the vector and more control bits.
>
> The only way to stop that is to shut it up at the PCI device level.

Or to write to magic chipset registers that will stop transforming DMA
writes to 0xfeex into x86 interrupts.  With an IOMMU I know x86 has
such registers (because the point of the IOMMU is to limit the problems
rogue devices can cause).  Without an IOMMU I don't know if x86 has any
such registers.  I remember that other platforms have an interrupt
controller that does provide some level of control.  That x86 does not
is what makes this an x86 specific problem.

The generic solution is to have the PCI code set bus master disables
when it is enumerationg and initializing devices.  Last time I was
paying attention that was actually the policy of the pci layer and
drivers that did not enable bus mastering were considered buggy.

Looking at patch 3/3 what this patchset does is an early disable of
of the msi registers.  Which is mostly reasonable.  Especially as has
been pointed out the only location the x86 vector and x86 cpu can
be found is in the msi configuration registers.

That also seems reasonable.  But Bjorn's concern about not finding all
devices in all domains does seem real.

There are a handful of devices where the Bus master disable bit doesn't
disable bus mastering.  I wonder if there are devices where MSI and MSIX
disables don't fully work.  It seems completely possible to have MSI or
MSIX equivalent registers at a non-standard location as drivers must be
loaded to handle them.

So if we can safely and reliably disable DMA and MSI at the generic PCI
device level during boot up I am all for it.

How difficult would it be to tell the IOMMUs to stop passing traffic
through in an early pci quirk?  The problem setup was apparently someone
using the device directly from a VM.  So I presume there is an IOMMU
in that configuration.

> Unfortunately there is no way to tell the APIC "Mask vector X" and the
> dump kernel does neither know which device it comes from nor does it
> have enumerated PCI completely which would reset the device and shutup
> the spew. Due to the interrupt storm it does not get that far.

So the question is how do we make this robust?

Can we perhaps disable all interrupts in this case and limp along
in polling mode until the pci bus has been enumerated?

It is nice and desirable to be able to use the hardware in high
performance mode in a kexec-on-panic situation but if we can detect a
problem and figure out how to limp along sometimes that is acceptable.

The failure mode in the kexec-on-panic kernel is definitely the corect
one.  We could not figure out how to wrestle the hardware into usability
so we fail to take a crash dump or do anything else that might corrupt
the system.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-15 Thread Eric W. Biederman

ebied...@xmission.com (Eric W. Biederman) writes:

> Bjorn Helgaas  writes:
>
>> [+cc Rafael for question about ACPI method for PCI host bridge reset]
>>
>> On Sat, Nov 14, 2020 at 09:58:08PM +0100, Thomas Gleixner wrote:
>>> On Sat, Nov 14 2020 at 14:39, Bjorn Helgaas wrote:
>>> > On Sat, Nov 14, 2020 at 12:40:10AM +0100, Thomas Gleixner wrote:
>>> >> On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote:
>>> >> > On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote:
>>> >> >> pci_device_shutdown() still clears the Bus Master Enable bit if we're
>>> >> >> doing a kexec and the device is in D0-D3hot, which should also disable
>>> >> >> MSI/MSI-X.  Why doesn't this solve the problem?  Is this because the
>>> >> >> device causing the storm was in PCI_UNKNOWN state?
>>> >> >
>>> >> > That's indeed a really good question.
>>> >> 
>>> >> So we do that on kexec, but is that true when starting a kdump kernel
>>> >> from a kernel crash? I doubt it.
>>> >
>>> > Ah, right, I bet that's it, thanks.  The kdump path is basically this:
>>> >
>>> >   crash_kexec
>>> > machine_kexec
>>> >
>>> > while the usual kexec path is:
>>> >
>>> >   kernel_kexec
>>> > kernel_restart_prepare
>>> >   device_shutdown
>>> > while (!list_empty(&devices_kset->list))
>>> >   dev->bus->shutdown
>>> > pci_device_shutdown# pci_bus_type.shutdown
>>> > machine_kexec
>>> >
>>> > So maybe we need to explore doing some or all of device_shutdown() in
>>> > the crash_kexec() path as well as in the kernel_kexec() path.
>>> 
>>> The problem is that if the machine crashed anything you try to attempt
>>> before starting the crash kernel is reducing the chance that the crash
>>> kernel actually starts.
>>
>> Right.
>>
>>> Is there something at the root bridge level which allows to tell the
>>> underlying busses to shut up, reset or go into a defined state? That
>>> might avoid chasing lists which might be already unreliable.
>>
>> Maybe we need some kind of crash_device_shutdown() that does the
>> minimal thing to protect the kdump kernel from devices.
>
> The kdump kernel does not use any memory the original kernel uses.
> Which should be a minimal and fairly robust level of protection
> until the device drivers can be loaded and get ahold of things.
>
>> The programming model for conventional PCI host bridges and PCIe Root
>> Complexes is device-specific since they're outside the PCI domain.
>> There probably *are* ways to do those things, but you would need a
>> native host bridge driver or something like an ACPI method.  I'm not
>> aware of an ACPI way to do this, but I added Rafael in case he is.
>>
>> A crash_device_shutdown() could do something at the host bridge level
>> if that's possible, or reset/disable bus mastering/disable MSI/etc on
>> individual PCI devices if necessary.
>
> Unless I am confused DMA'ing to memory that is not already in use
> is completely broken wether or not you are using the kdump kernel.

Bah.  I was confused because I had not read up-thread.

MSI mixes DMA and irqs so confusion is easy.

So the problem is screaming irqs when the kernel is booting up.
This is a fundamentally tricky problem.

For ordinary irqs you can have this with level triggered irqs
and the kernel has code that will shutdown the irq at the ioapic
level.  Then the kernel continues by polling the irq source.

I am still missing details but my first question is can our general
solution to screaming level triggered irqs apply?

How can edge triggered MSI irqs be screaming?

Is there something we can do in enabling the APICs or IOAPICs that
would allow this to be handled better.  My memory when we enable
the APICs and IOAPICs we completely clear the APIC entries and so
should be disabling sources.

Is the problem perhaps that we wind up using an APIC entry that was
previously used for the MSI interrupt as something else when we
reprogram them?  Even with this why doesn't the generic code
to stop screaming irqs apply here?

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

2020-11-15 Thread Eric W. Biederman

Bjorn Helgaas  writes:

> [+cc Rafael for question about ACPI method for PCI host bridge reset]
>
> On Sat, Nov 14, 2020 at 09:58:08PM +0100, Thomas Gleixner wrote:
>> On Sat, Nov 14 2020 at 14:39, Bjorn Helgaas wrote:
>> > On Sat, Nov 14, 2020 at 12:40:10AM +0100, Thomas Gleixner wrote:
>> >> On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote:
>> >> > On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote:
>> >> >> pci_device_shutdown() still clears the Bus Master Enable bit if we're
>> >> >> doing a kexec and the device is in D0-D3hot, which should also disable
>> >> >> MSI/MSI-X.  Why doesn't this solve the problem?  Is this because the
>> >> >> device causing the storm was in PCI_UNKNOWN state?
>> >> >
>> >> > That's indeed a really good question.
>> >> 
>> >> So we do that on kexec, but is that true when starting a kdump kernel
>> >> from a kernel crash? I doubt it.
>> >
>> > Ah, right, I bet that's it, thanks.  The kdump path is basically this:
>> >
>> >   crash_kexec
>> > machine_kexec
>> >
>> > while the usual kexec path is:
>> >
>> >   kernel_kexec
>> > kernel_restart_prepare
>> >   device_shutdown
>> > while (!list_empty(&devices_kset->list))
>> >   dev->bus->shutdown
>> > pci_device_shutdown# pci_bus_type.shutdown
>> > machine_kexec
>> >
>> > So maybe we need to explore doing some or all of device_shutdown() in
>> > the crash_kexec() path as well as in the kernel_kexec() path.
>> 
>> The problem is that if the machine crashed anything you try to attempt
>> before starting the crash kernel is reducing the chance that the crash
>> kernel actually starts.
>
> Right.
>
>> Is there something at the root bridge level which allows to tell the
>> underlying busses to shut up, reset or go into a defined state? That
>> might avoid chasing lists which might be already unreliable.
>
> Maybe we need some kind of crash_device_shutdown() that does the
> minimal thing to protect the kdump kernel from devices.

The kdump kernel does not use any memory the original kernel uses.
Which should be a minimal and fairly robust level of protection
until the device drivers can be loaded and get ahold of things.

> The programming model for conventional PCI host bridges and PCIe Root
> Complexes is device-specific since they're outside the PCI domain.
> There probably *are* ways to do those things, but you would need a
> native host bridge driver or something like an ACPI method.  I'm not
> aware of an ACPI way to do this, but I added Rafael in case he is.
>
> A crash_device_shutdown() could do something at the host bridge level
> if that's possible, or reset/disable bus mastering/disable MSI/etc on
> individual PCI devices if necessary.

Unless I am confused DMA'ing to memory that is not already in use
is completely broken wether or not you are using the kdump kernel.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH RFC PKS/PMEM 51/58] kernel: Utilize new kmap_thread()

2020-10-09 Thread Eric W. Biederman

ira.we...@intel.com writes:

> From: Ira Weiny 
>
> This kmap() call is localized to a single thread.  To avoid the over
> head of global PKRS updates use the new kmap_thread() call.

Acked-by: "Eric W. Biederman" 

>
> Cc: Eric Biederman 
> Signed-off-by: Ira Weiny 
> ---
>  kernel/kexec_core.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index c19c0dad1ebe..272a9920c0d6 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -815,7 +815,7 @@ static int kimage_load_normal_segment(struct kimage 
> *image,
>   if (result < 0)
>   goto out;
>  
> - ptr = kmap(page);
> + ptr = kmap_thread(page);
>   /* Start with a clear page */
>   clear_page(ptr);
>   ptr += maddr & ~PAGE_MASK;
> @@ -828,7 +828,7 @@ static int kimage_load_normal_segment(struct kimage 
> *image,
>   memcpy(ptr, kbuf, uchunk);
>   else
>   result = copy_from_user(ptr, buf, uchunk);
> - kunmap(page);
> + kunmap_thread(page);
>   if (result) {
>   result = -EFAULT;
>   goto out;
> @@ -879,7 +879,7 @@ static int kimage_load_crash_segment(struct kimage *image,
>   goto out;
>   }
>   arch_kexec_post_alloc_pages(page_address(page), 1, 0);
> - ptr = kmap(page);
> + ptr = kmap_thread(page);
>   ptr += maddr & ~PAGE_MASK;
>   mchunk = min_t(size_t, mbytes,
>   PAGE_SIZE - (maddr & ~PAGE_MASK));
> @@ -895,7 +895,7 @@ static int kimage_load_crash_segment(struct kimage *image,
>   else
>   result = copy_from_user(ptr, buf, uchunk);
>   kexec_flush_icache_page(page);
> - kunmap(page);
> + kunmap_thread(page);
>   arch_kexec_pre_free_pages(page_address(page), 1);
>   if (result) {
>   result = -EFAULT;

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] Only allow to set crash_kexec_post_notifiers on boot time

2020-09-24 Thread Eric W. Biederman

Michael Kelley  writes:

> From: Konrad Rzeszutek Wilk  Sent: Wednesday, 
> September 23, 2020 8:48 AM
>> 
>> On Wed, Sep 23, 2020 at 10:43:29AM +0800, Dave Young wrote:
>> > + more people who may care about this param
>> 
>> Paarty time!!
>> 
>> (See below, didn't snip any comments)
>> > On 09/21/20 at 08:45pm, Eric W. Biederman wrote:
>> > > Konrad Rzeszutek Wilk  writes:
>> > >
>> > > > On Fri, Sep 18, 2020 at 05:47:43PM -0700, Andrew Morton wrote:
>> > > >> On Fri, 18 Sep 2020 11:25:46 +0800 Dave Young  
>> > > >> wrote:
>> > > >>
>> > > >> > crash_kexec_post_notifiers enables running various panic notifier
>> > > >> > before kdump kernel booting. This increases risks of kdump failure.
>> > > >> > It is well documented in kernel-parameters.txt. We do not suggest
>> > > >> > people to enable it together with kdump unless he/she is really 
>> > > >> > sure.
>> > > >> > This is also not suggested to be enabled by default when users are
>> > > >> > not aware in distributions.
>> > > >> >
>> > > >> > But unfortunately it is enabled by default in systemd, see below
>> > > >> > discussions in a systemd report, we can not convince systemd to 
>> > > >> > change
>> > > >> > it:
>> > > >> >
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsyst
>> emd%2Fsystemd%2Fissues%2F16661&data=02%7C01%7Cmikelley%40microsoft.com%
>> 7C3631bae06f7147c0f92908d85fd7f2b2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%
>> 7C637364728378052956&sdata=9CUpPUxcKLLggbJ1bjubBjbFUAhPVeZhIc4yss8wAiU%3
>> D&reserved=0
>> > > >> >
>> > > >> > Actually we have got reports about kdump kernel hangs in both s390x
>> > > >> > and powerpcle cases caused by the systemd change,  also some x86 
>> > > >> > cases
>> > > >> > could also be caused by the same (although that is in Hyper-V code
>> > > >> > instead of systemd, that need to be addressed separately).
>> > > >
>> > > > Perhaps it may be better to fix the issus on s390x and PowerPC as well?
>> > > >
>> > > >> >
>> > > >> > Thus to avoid the auto enablement here just disable the param 
>> > > >> > writable
>> > > >> > permission in sysfs.
>> > > >> >
>> > > >>
>> > > >> Well.  I don't think this is at all a desirable way of resolving a
>> > > >> disagreement with the systemd developers
>> > > >>
>> > > >> At the above github address I'm seeing "ryncsn added a commit to
>> > > >> ryncsn/systemd that referenced this issue 9 days ago", "pstore: don't
>> > > >> enable crash_kexec_post_notifiers by default".  So didn't that address
>> > > >> the issue?
>> > > >
>> > > > It does in systemd, but there is a strong interest in making this on
>> > > > by default.
>> > >
>> > > There is also a strong interest in removing this code entirely from the
>> > > kernel.
>> >
>> > Added Hyper-V people and people who created the param, it is below
>> > commit, I also want to remove it if possible, let's see how people
>> > think, but the least way should be to disable the auto setting in both 
>> > systemd
>> > and kernel:
>
> Hyper-V uses a notifier to inform the host system that a Linux VM has
> panic'ed.  Informing the host is particularly important in a public cloud
> such as Azure so that the cloud software can alert the customer, and can
> track cloud-wide reliability statistics.   Whether a kdump is taken is 
> controlled
> entirely by the customer and how he configures the VM, and we want
> the host to be informed either way.

Why?

Why does the host care?
Especially if the VM continues executing into a kdump kernel?

Further like I have mentioned everytime something like this has come up
a call on the kexec on panic code path should be a direct call (That can
be audited) not something hidden in a notifier call chain (which can not).

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] Only allow to set crash_kexec_post_notifiers on boot time

2020-09-21 Thread Eric W. Biederman

Konrad Rzeszutek Wilk  writes:

> On Fri, Sep 18, 2020 at 05:47:43PM -0700, Andrew Morton wrote:
>> On Fri, 18 Sep 2020 11:25:46 +0800 Dave Young  wrote:
>> 
>> > crash_kexec_post_notifiers enables running various panic notifier
>> > before kdump kernel booting. This increases risks of kdump failure.
>> > It is well documented in kernel-parameters.txt. We do not suggest
>> > people to enable it together with kdump unless he/she is really sure.
>> > This is also not suggested to be enabled by default when users are
>> > not aware in distributions.
>> > 
>> > But unfortunately it is enabled by default in systemd, see below
>> > discussions in a systemd report, we can not convince systemd to change
>> > it:
>> > https://github.com/systemd/systemd/issues/16661
>> > 
>> > Actually we have got reports about kdump kernel hangs in both s390x
>> > and powerpcle cases caused by the systemd change,  also some x86 cases
>> > could also be caused by the same (although that is in Hyper-V code
>> > instead of systemd, that need to be addressed separately).
>
> Perhaps it may be better to fix the issus on s390x and PowerPC as well?
>
>> > 
>> > Thus to avoid the auto enablement here just disable the param writable
>> > permission in sysfs.
>> > 
>> 
>> Well.  I don't think this is at all a desirable way of resolving a
>> disagreement with the systemd developers
>> 
>> At the above github address I'm seeing "ryncsn added a commit to
>> ryncsn/systemd that referenced this issue 9 days ago", "pstore: don't
>> enable crash_kexec_post_notifiers by default".  So didn't that address
>> the issue?
>
> It does in systemd, but there is a strong interest in making this on
> by default.

There is also a strong interest in removing this code entirely from the
kernel.

This failure is a case in point.

I think I am at my I told you so point.  This is what all of the testing
over all the years has said.  Leaving functionality to the peculiarities
of firmware when you don't have to, and can actually control what is
going on doesn't work.

Eric



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context

2020-09-14 Thread Eric W. Biederman

Adding the kexec list as well.

Joerg Vehlow  writes:

> Hi Eric,
>> What is this patch supposed to be doing?
>>
>> What bug is it fixing?
> This information is part in the first message of this mail thread.
> The patch was intendedfor the active discussion in this thread,
> not for a broad review.

> A short summary: In the rt kernel, a panic in an interrupt context does
> not start the dump-capture kernel, because there is a mutex_trylock in
> __crash_kexe. If this is called in interrupt context, it always fails.
> In the non-rt kernel calling mutex_trylock is not allowed according to
> the comment of the function, but it still works.

Thanks.  For whatever reason I did not see the rest of this thread
when I was replying to your patch.

I get the feeling the rt kernel is breaking this case deliberately.
I don't know of any reason why a trylock couldn't work.

That said I won't propose fixing up the locks that way.

>> A BUG_ON that triggers inside of BUG_ONs seems not just suspect but
>> outright impossible to make use of.
> I am not entirely sure what would happen here. But even if it gets in
> some kind ofendless loop, I guess this is ok, because it allows finding
> the problem. A piece of code in the function, that ensures the precondition
> is a lot better than relying on only a comment.
> If this was in mtex_trylock, the bug described above wouldn't have sneaked
> in 12 years ago...

BUG_ON's are more likely to hide a problem then to show it.
Sometimes they are appropriate but the should be avoided as much as
possible.

>> I get the feeling skimming this that it is time to sort out and simplify
>> the locking here, rather than make it more complex, and more likely to
>> fail.
> I would very much like that, but sadly it looks like it is not possible.
> Either it wouldrequire blocking locks, that may fail, or not locking at
> all, that may also fail.Using a different kind of lock (like spinlock)
> is also not possible, becausespinlock_trylock again uses mutex_trylock
> in the rt kernel.

I think it is possible but the locking needs to be relooked at.

>> I get the feeling that over the years somehow the assumption that the
>> rest of the kernel is broken and that we need to get out of the broken
>> kernel as fast and as simply as possible has been lost.
> Yes I also have the feeling, that the mutexes need fixing, but I wouldn't
> to post any patch for that. At the moment, given the interface of the mutex,
> this is clearly a bug in kexec, even if it works in the non-rt kernel.

Cleanups that break the code. Sigh.

The code was written correctly for this case and was fine until
8c5a1cf0ad3a ("kexec: use a mutex for locking rather than xchg()").

Mostly because I didn't trust locks given their comparatively high level
of abstraction and what do you know that turned out to be correct in
this case.

It definitely looks time to see how the locking can be improved on the
kexec on panic code path.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [RFC PATCH 0/3] Add writing support to vmcore for reusing oldmem

2020-09-09 Thread Eric W. Biederman

Kairui Song  writes:

> Currently vmcore only supports reading, this patch series is an RFC
> to add writing support to vmcore. It's x86_64 only yet, I'll add other
> architecture later if there is no problem with this idea.
>
> My purpose of adding writing support is to reuse the crashed kernel's
> old memory in kdump kernel, reduce kdump memory pressure, and
> allow kdump to run with a smaller crashkernel reservation.
>
> This is doable because in most cases, after kernel panic, user only
> interested in the crashed kernel itself, and userspace/cache/free
> memory pages are not dumped. `makedumpfile` is widely used to skip
> these pages. Kernel pages usually only take a small part of
> the whole old memory. So there will be many reusable pages.
>
> By adding writing support, userspace then can use these pages as a fast
> and temporary storage. This helps reduce memory pressure in many ways.
>
> For example, I've written a POC program based on this, it will find
> the reusable pages, and creates an NBD device which maps to these pages.
> The NBD device can then be used as swap, or to hold some temp files
> which previouly live in RAM.
>
> The link of the POC tool: https://github.com/ryncsn/kdumpd

A couple of thoughts.
1) Unless I am completely mistaken treating this as a exercise in
   memory hotplug would be much simpler.

   AKA just plug in the memory that is not needed as part of the kdump.

   I see below that you have problems doing this because
   of fragmentation.  I still think hotplug is doable using some
   kind of fragmented memory zone.

2) The purpose of the memory reservation is because hardware is
   still potentially running agains the memory of the old kernel.

   By the time we have brought up a new kernel enough of the hardware
   may have been reinitialized that we don't have to worry about
   hardware randomly dma'ing into the memory used by the old kernel.

   With IOMMUs and care we may be able to guarantee for some machine
   configurations it is impossible for DMA to come from some piece of
   hardware that is present but the kernel does not have a driver
   loaded for.

I really do not like this approach because it is fundamentlly doing the
wrong thing.  Adding write support to read-only drivers.  I do not see
anywhere that you even mentioned the hard problem and the reason we
reserve memory in the first place.  Hardware spontaneously DMA'ing onto
it.

> It's have been a long time issue that kdump suffers from OOM issue
> with limited crashkernel memory. So reusing old memory could be very
> helpful.

There is a very fine line here between reusing existing code (aka
drivers and userspace) and doing something that should work.

It might make sense to figure out what is using so much memory
that an OOM is triggered.

Ages ago I did something that was essentially dumping the kernels printk
buffer to the serial console in case of a crash and I had things down to
something comparatively miniscule like 8M or less.

My memory is that historically it has been high performance scsi raid
drivers or something like that, that are behind the need to have such
large memory reservations.

Now that I think about it, you aren't by any chance doing something
silly like running systemd in your initrd are you?  Are these OOMs by
any chance a userspace problem rather than a problem with inefficient
drivers?

In summary you either need to show that it is safe to reuse the
memory of the old kernel, or do some work to reduce the memory footprint
of the crashdump kernel, and the crashdump userspace. 

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] kexec/kexec.c: Add missing fclose()/close() call

2020-08-25 Thread Eric W. Biederman

Youling Tang  writes:

> Add missing fclose()/close() call.

The program exits immediately and that exit closes all of the files.
What is the point of adding an explicit close?

>
> Signed-off-by: Youling Tang 
> ---
>  kexec/kexec.c | 28 ++--
>  1 file changed, 22 insertions(+), 6 deletions(-)
>
> diff --git a/kexec/kexec.c b/kexec/kexec.c
> index a62b362..f970b13 100644
> --- a/kexec/kexec.c
> +++ b/kexec/kexec.c
> @@ -542,6 +542,7 @@ static char *slurp_file_generic(const char *filename, 
> off_t *r_size,
>   }
>   result = fstat(fd, &stats);
>   if (result < 0) {
> + close(fd);
>   die("Cannot stat: %s: %s\n",
>   filename, strerror(errno));
>   }
> @@ -553,20 +554,26 @@ static char *slurp_file_generic(const char *filename, 
> off_t *r_size,
>   if (S_ISCHR(stats.st_mode)) {
>  
>   size = lseek(fd, 0, SEEK_END);
> - if (size < 0)
> + if (size < 0) {
> + close(fd);
>   die("Can not seek file %s: %s\n", filename,
>   strerror(errno));
> + }
>  
>   err = lseek(fd, 0, SEEK_SET);
> - if (err < 0)
> + if (err < 0) {
> + close(fd);
>   die("Can not seek to the begin of file %s: %s\n",
>   filename, strerror(errno));
> + }
>   buf = slurp_fd(fd, filename, size, &nread);
>   } else if (S_ISBLK(stats.st_mode)) {
>   err = ioctl(fd, BLKGETSIZE64, &size);
> - if (err < 0)
> + if (err < 0) {
> + close(fd);
>   die("Can't retrieve size of block device %s: %s\n",
>   filename, strerror(errno));
> + }
>   buf = slurp_fd(fd, filename, size, &nread);
>   } else {
>   size = stats.st_size;
> @@ -578,13 +585,18 @@ static char *slurp_file_generic(const char *filename, 
> off_t *r_size,
>   buf = slurp_fd(fd, filename, size, &nread);
>   }
>   }
> - if ((use_mmap && (buf == MAP_FAILED)) || (!use_mmap && (buf == NULL)))
> + if ((use_mmap && (buf == MAP_FAILED)) || (!use_mmap && (buf == NULL))) {
> + close(fd);
>   die("Cannot read %s", filename);
> + }
>  
> - if (nread != size)
> + if (nread != size) {
> + close(fd);
>   die("Read on %s ended before stat said it should\n", filename);
> + }
>  
>   *r_size = size;
> + close(fd);
>   return buf;
>  }
>  
> @@ -1131,8 +1143,10 @@ char *get_command_line(void)
>   if (!fp)
>   die("Could not open /proc/cmdline.");
>  
> - if (fgets(line, sizeof_line, fp) == NULL)
> + if (fgets(line, sizeof_line, fp) == NULL) {
> + fclose(fp);
>   die("Can't read /proc/cmdline.");
> + }
>  
>   fclose(fp);
>  
> @@ -1257,12 +1271,14 @@ static int do_kexec_file_load(int fileind, int argc, 
> char **argv,
>   if (i == file_types) {
>   fprintf(stderr, "Cannot determine the file type " "of %s\n",
>   kernel);
> + close(kernel_fd);
>   return EFAILED;
>   }
>  
>   ret = file_type[i].load(argc, argv, kernel_buf, kernel_size, &info);
>   if (ret < 0) {
>   fprintf(stderr, "Cannot load %s\n", kernel);
> + close(kernel_fd);
>   return ret;
>   }

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH RFC 2/2] lkdtm: Add heap spraying test

2020-08-17 Thread Eric W. Biederman

Alexander Popov  writes:

> Add a simple test for CONFIG_SLAB_QUARANTINE.
>
> It performs heap spraying that aims to reallocate the recently freed heap
> object. This technique is used for exploiting use-after-free
> vulnerabilities in the kernel code.
>
> This test shows that CONFIG_SLAB_QUARANTINE breaks heap spraying
> exploitation technique.
>
> Signed-off-by: Alexander Popov 

Why put this test in the linux kernel dump test module?

I have no problem with tests, and I may be wrong but this
does not look like you are testing to see if heap corruption
triggers a crash dump.  Which is what the rest of the tests
in lkdtm are about.  Seeing if the test triggers successfully
triggers a crash dump.

Eric

> ---
>  drivers/misc/lkdtm/core.c  |  1 +
>  drivers/misc/lkdtm/heap.c  | 40 ++
>  drivers/misc/lkdtm/lkdtm.h |  1 +
>  3 files changed, 42 insertions(+)
>
> diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c
> index a5e344df9166..78b7669c35eb 100644
> --- a/drivers/misc/lkdtm/core.c
> +++ b/drivers/misc/lkdtm/core.c
> @@ -126,6 +126,7 @@ static const struct crashtype crashtypes[] = {
>   CRASHTYPE(SLAB_FREE_DOUBLE),
>   CRASHTYPE(SLAB_FREE_CROSS),
>   CRASHTYPE(SLAB_FREE_PAGE),
> + CRASHTYPE(HEAP_SPRAY),
>   CRASHTYPE(SOFTLOCKUP),
>   CRASHTYPE(HARDLOCKUP),
>   CRASHTYPE(SPINLOCKUP),
> diff --git a/drivers/misc/lkdtm/heap.c b/drivers/misc/lkdtm/heap.c
> index 1323bc16f113..a72a241e314a 100644
> --- a/drivers/misc/lkdtm/heap.c
> +++ b/drivers/misc/lkdtm/heap.c
> @@ -205,6 +205,46 @@ static void ctor_a(void *region)
>  static void ctor_b(void *region)
>  { }
>  
> +#define HEAP_SPRAY_SIZE 128
> +
> +void lkdtm_HEAP_SPRAY(void)
> +{
> + int *addr;
> + int *spray_addrs[HEAP_SPRAY_SIZE] = { 0 };
> + unsigned long i = 0;
> +
> + addr = kmem_cache_alloc(a_cache, GFP_KERNEL);
> + if (!addr) {
> + pr_info("Unable to allocate memory in lkdtm-heap-a cache\n");
> + return;
> + }
> +
> + *addr = 0x31337;
> + kmem_cache_free(a_cache, addr);
> +
> + pr_info("Performing heap spraying...\n");
> + for (i = 0; i < HEAP_SPRAY_SIZE; i++) {
> + spray_addrs[i] = kmem_cache_alloc(a_cache, GFP_KERNEL);
> + *spray_addrs[i] = 0x31337;
> + pr_info("attempt %lu: spray alloc addr %p vs freed addr %p\n",
> + i, spray_addrs[i], addr);
> + if (spray_addrs[i] == addr) {
> + pr_info("freed addr is reallocated!\n");
> + break;
> + }
> + }
> +
> + if (i < HEAP_SPRAY_SIZE)
> + pr_info("FAIL! Heap spraying succeed :(\n");
> + else
> + pr_info("OK! Heap spraying hasn't succeed :)\n");
> +
> + for (i = 0; i < HEAP_SPRAY_SIZE; i++) {
> + if (spray_addrs[i])
> + kmem_cache_free(a_cache, spray_addrs[i]);
> + }
> +}
> +
>  void __init lkdtm_heap_init(void)
>  {
>   double_free_cache = kmem_cache_create("lkdtm-heap-double_free",
> diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h
> index 8878538b2c13..dfafb4ae6f3a 100644
> --- a/drivers/misc/lkdtm/lkdtm.h
> +++ b/drivers/misc/lkdtm/lkdtm.h
> @@ -45,6 +45,7 @@ void lkdtm_READ_BUDDY_AFTER_FREE(void);
>  void lkdtm_SLAB_FREE_DOUBLE(void);
>  void lkdtm_SLAB_FREE_CROSS(void);
>  void lkdtm_SLAB_FREE_PAGE(void);
> +void lkdtm_HEAP_SPRAY(void);
>  
>  /* lkdtm_perms.c */
>  void __init lkdtm_perms_init(void);

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/2] kexec: Add quick kexec support for kernel

2020-08-14 Thread Eric W. Biederman

Sang Yan  writes:

> In normal kexec, relocating kernel may cost 5 ~ 10 seconds, to
> copy all segments from vmalloced memory to kernel boot memory,
> because of disabled mmu.

I haven't seen kexec that slow since I tested on my 16Mhz 386.

That machine has an excuse it really is slow.  Anything else
that takes seconds is almost certainly slow because someone
has misconfigured things to not cache the data copied by kexec.

I humbly suggest that you fix the arm64 code so that the data gets
cached.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 2/3] security: add symbol namespace for reading file data

2020-05-13 Thread Eric W. Biederman

Luis Chamberlain  writes:

> Certain symbols are not meant to be used by everybody, the security
> helpers for reading files directly is one such case. Use a symbol
> namespace for them.
>
> This will prevent abuse of use of these symbols in places they were
> not inteded to be used, and provides an easy way to audit where these
> types of operations happen as a whole.

Why not just remove the ability for the firmware loader to be a module?

Is there some important use case that requires the firmware loader
to be a module?

We already compile the code in by default.  So it is probably just
easier to remove the modular support all together.  Which would allow
the export of the security hooks to be removed as well.

Eric


> Signed-off-by: Luis Chamberlain 
> ---
>  drivers/base/firmware_loader/fallback.c | 1 +
>  fs/exec.c   | 2 ++
>  kernel/kexec.c  | 2 ++
>  kernel/module.c | 2 ++
>  security/security.c | 6 +++---
>  5 files changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/base/firmware_loader/fallback.c 
> b/drivers/base/firmware_loader/fallback.c
> index d9ac7296205e..b06dafda 100644
> --- a/drivers/base/firmware_loader/fallback.c
> +++ b/drivers/base/firmware_loader/fallback.c
> @@ -19,6 +19,7 @@
>   */
>  
>  MODULE_IMPORT_NS(FIRMWARE_LOADER_PRIVATE);
> +MODULE_IMPORT_NS(SECURITY_READ);
>  
>  extern struct firmware_fallback_config fw_fallback_config;
>  
> diff --git a/fs/exec.c b/fs/exec.c
> index 9791b9eef9ce..30bd800ab1d6 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -72,6 +72,8 @@
>  
>  #include 
>  
> +MODULE_IMPORT_NS(SECURITY_READ);
> +
>  int suid_dumpable = 0;
>  
>  static LIST_HEAD(formats);
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index f977786fe498..8d572b41a157 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -19,6 +19,8 @@
>  
>  #include "kexec_internal.h"
>  
> +MODULE_IMPORT_NS(SECURITY_READ);
> +
>  static int copy_user_segment_list(struct kimage *image,
> unsigned long nr_segments,
> struct kexec_segment __user *segments)
> diff --git a/kernel/module.c b/kernel/module.c
> index 80faaf2116dd..8973a463712e 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -59,6 +59,8 @@
>  #include 
>  #include "module-internal.h"
>  
> +MODULE_IMPORT_NS(SECURITY_READ);
> +
>  #define CREATE_TRACE_POINTS
>  #include 
>  
> diff --git a/security/security.c b/security/security.c
> index 8ae66e4c370f..bdbd1fc5105a 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -1654,7 +1654,7 @@ int security_kernel_read_file(struct file *file, enum 
> kernel_read_file_id id)
>   return ret;
>   return ima_read_file(file, id);
>  }
> -EXPORT_SYMBOL_GPL(security_kernel_read_file);
> +EXPORT_SYMBOL_NS_GPL(security_kernel_read_file, SECURITY_READ);
>  
>  int security_kernel_post_read_file(struct file *file, char *buf, loff_t size,
>  enum kernel_read_file_id id)
> @@ -1666,7 +1666,7 @@ int security_kernel_post_read_file(struct file *file, 
> char *buf, loff_t size,
>   return ret;
>   return ima_post_read_file(file, buf, size, id);
>  }
> -EXPORT_SYMBOL_GPL(security_kernel_post_read_file);
> +EXPORT_SYMBOL_NS_GPL(security_kernel_post_read_file, SECURITY_READ);
>  
>  int security_kernel_load_data(enum kernel_load_data_id id)
>  {
> @@ -1677,7 +1677,7 @@ int security_kernel_load_data(enum kernel_load_data_id 
> id)
>   return ret;
>   return ima_load_data(id);
>  }
> -EXPORT_SYMBOL_GPL(security_kernel_load_data);
> +EXPORT_SYMBOL_NS_GPL(security_kernel_load_data, SECURITY_READ);
>  
>  int security_task_fix_setuid(struct cred *new, const struct cred *old,
>int flags)

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] kexec: Discard loaded image on memory hotplug

2020-05-12 Thread Eric W. Biederman

David Hildenbrand  writes:
>> 
>> Maybe we have enough information to fixup the loaded kexec image
>> in the kexec_file_load case, we certainly don't in the ordinary
>> kexec_load case.
>
> Yes, that's also what I mentioned in my reply to Baoquan.
>
>> 
>> For now I want to stick to the simplest thing we can do which is either
>> blocking the memory hotplug operation (if that is possible) or
>> dropping the loaded kexec image.
>
> Yes, the latter is the best for now. It's simple.
>
> I am suggesting to add explicit callbacks to
> add_memory()/remove_memory(), and calling the invalidation from there -
> because I see various issues with the memory notifier approach (racy,
> false positives, never called if memory is not onlined).

Ok so we are in agreement.  Correct patch.  Wrong trigger condition.

Eric


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] kexec: Discard loaded image on memory hotplug

2020-05-11 Thread Eric W. Biederman

David Hildenbrand  writes:

> On 09.05.20 17:14, Eric W. Biederman wrote:
>> David Hildenbrand  writes:
>> 
>>> On 01.05.20 18:57, James Morse wrote:
>>>> On x86, the kexec payload contains a copy of the current memory map.
>>>> If memory is added or removed, this copy of the memory map becomes
>>>> stale. Getting this wrong may prevent the next kernel from booting.
>>>> The first kernel may die if it tries to re-assemble the next kernel
>>>> in memory that has been removed.
>>>>
>>>> Discard the loaded kexec image when the memory map changes, user-space
>>>> should reload it.
>>>>
>>>> Kdump is unaffected, as it is placed within the crashkernel reserved
>>>> memory area and only uses this memory. The stale memory map may affect
>>>> generation of the vmcore, but the kdump kernel should be in a position
>>>> to validate it.
>>>>
>>>> Signed-off-by: James Morse 
>>>> ---
>>>> This patch obsoletes:
>>>>  * kexec/memory_hotplug: Prevent removal and accidental use
>>>> https://lore.kernel.org/linux-arm-kernel/20200326180730.4754-1-james.mo...@arm.com/
>>>>
>>>>  kernel/kexec_core.c | 40 
>>>>  1 file changed, 40 insertions(+)
>>>>
>>>> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
>>>> index c19c0dad1ebe..e1901e5bd4b5 100644
>>>> --- a/kernel/kexec_core.c
>>>> +++ b/kernel/kexec_core.c
>>>> @@ -12,6 +12,7 @@
>>>>  #include 
>>>>  #include 
>>>>  #include 
>>>> +#include 
>>>>  #include 
>>>>  #include 
>>>>  #include 
>>>> @@ -22,10 +23,12 @@
>>>>  #include 
>>>>  #include 
>>>>  #include 
>>>> +#include 
>>>>  #include 
>>>>  #include 
>>>>  #include 
>>>>  #include 
>>>> +#include 
>>>>  #include 
>>>>  #include 
>>>>  #include 
>>>> @@ -1219,3 +1222,40 @@ void __weak arch_kexec_protect_crashkres(void)
>>>>  
>>>>  void __weak arch_kexec_unprotect_crashkres(void)
>>>>  {}
>>>> +
>>>> +/*
>>>> + * If the memory layout changes, any loaded kexec image should be evicted
>>>> + * as it may contain a copy of the (now stale) memory map. This also means
>>>> + * we don't need to check the memory is still present when re-assembling 
>>>> the
>>>> + * new kernel at machine_kexec() time.
>>>> + */
>>>
>>> Onlining/offlining is not a change of the memory map.
>> 
>> Phrasing it that way is non-sense.  What is important is memory
>> available in the system.  A memory map is just a reflection upon that,
>> a memory map is not the definition of truth.
>> 
>> So if this notifier reflects when memory is coming and going on the
>> system this is a reasonable approach.  
>> 
>> Do these notifiers might fire for special kinds of memory that should
>> only be used for very special purposes?
>> 
>> This change with the addition of some filters say to limit taking action
>> to MEM_ONLINE and MEM_OFFLINE looks reasonable to me.  Probably also
>> filtering out special kinds of memory that is not gernally useful.
>
> There are cases, where this notifier will not get called (e.g., hotplug
> a DIMM and don't online it) or will get called, although nothing changed
> (offline+re-online to a different zone triggered by user space). AFAIK,
> nothing in kexec (*besides kdump) cares about online vs. offline memory.
> This is why this feels wrong.

So what precisely does offline and online of memory mean in this context?
Is it turning the memory on and off?  (which is the obvious meaning)
Or is offline and online letting the ordinary kernel use a chunk
of memory and not use a chunk of memory and the memory remains running
the entire time?


> add_memory()/try_remove_memory() is the place where:
> - Memblocks are created/deleted (if the memblock allocator is still
>   alive)
> - Memory resources are created/deleted (e.g., reflected in /proc/iomem)
> - Firmware memmap entries are created/deleted (/sys/firmware/memmap)
>
> My idea would be to add something like
> kexec_map_add()/kexec_map_remove() where we have
> firmware_map_add_hotplug()/firmware_map_remove(). From there, we can
> unload the kexec image like done in this patch.

I don't see the connection with a firmware_map.  Maybe that is how it is
thought about in the code but in principle the firmware can not exist
or completely ignore memory hotplug.

> And these callbacks might come in handy for fixing up the kexec initial
> memmap in case of kexec_file_load(). AFAIKS on x86_64:

Maybe we have enough information to fixup the loaded kexec image
in the kexec_file_load case, we certainly don't in the ordinary
kexec_load case.

For now I want to stick to the simplest thing we can do which is either
blocking the memory hotplug operation (if that is possible) or
dropping the loaded kexec image.

If that actually becomes a problem in practice we can do something more.
But for now let's just do the minimal we can do that will prevent
incorrect operation.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] kexec: Discard loaded image on memory hotplug

2020-05-09 Thread Eric W. Biederman

David Hildenbrand  writes:

> On 01.05.20 18:57, James Morse wrote:
>> On x86, the kexec payload contains a copy of the current memory map.
>> If memory is added or removed, this copy of the memory map becomes
>> stale. Getting this wrong may prevent the next kernel from booting.
>> The first kernel may die if it tries to re-assemble the next kernel
>> in memory that has been removed.
>> 
>> Discard the loaded kexec image when the memory map changes, user-space
>> should reload it.
>> 
>> Kdump is unaffected, as it is placed within the crashkernel reserved
>> memory area and only uses this memory. The stale memory map may affect
>> generation of the vmcore, but the kdump kernel should be in a position
>> to validate it.
>> 
>> Signed-off-by: James Morse 
>> ---
>> This patch obsoletes:
>>  * kexec/memory_hotplug: Prevent removal and accidental use
>> https://lore.kernel.org/linux-arm-kernel/20200326180730.4754-1-james.mo...@arm.com/
>> 
>>  kernel/kexec_core.c | 40 
>>  1 file changed, 40 insertions(+)
>> 
>> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
>> index c19c0dad1ebe..e1901e5bd4b5 100644
>> --- a/kernel/kexec_core.c
>> +++ b/kernel/kexec_core.c
>> @@ -12,6 +12,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>> @@ -22,10 +23,12 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>> @@ -1219,3 +1222,40 @@ void __weak arch_kexec_protect_crashkres(void)
>>  
>>  void __weak arch_kexec_unprotect_crashkres(void)
>>  {}
>> +
>> +/*
>> + * If the memory layout changes, any loaded kexec image should be evicted
>> + * as it may contain a copy of the (now stale) memory map. This also means
>> + * we don't need to check the memory is still present when re-assembling the
>> + * new kernel at machine_kexec() time.
>> + */
>
> Onlining/offlining is not a change of the memory map.

Phrasing it that way is non-sense.  What is important is memory
available in the system.  A memory map is just a reflection upon that,
a memory map is not the definition of truth.

So if this notifier reflects when memory is coming and going on the
system this is a reasonable approach.  

Do these notifiers might fire for special kinds of memory that should
only be used for very special purposes?

This change with the addition of some filters say to limit taking action
to MEM_ONLINE and MEM_OFFLINE looks reasonable to me.  Probably also
filtering out special kinds of memory that is not gernally useful.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [RFC][PATCH] kexec: Teach indirect pages how to live in high memory

2020-05-05 Thread Eric W. Biederman

Hari Bathini  writes:

> On 05/05/20 3:29 am, Eric W. Biederman wrote:
>> 
>> Recently a patch was proposed to kimage_alloc_page to slightly alter
>> the logic of how pages allocated with incompatible flags were
>> detected.  The logic was being altered because the semantics of the
>> page alloctor were changing yet again.
>> 
>> Looking at that case I realized that there is no reason for it to even
>> exist.  Either the indirect page allocations and the source page
>> allocations could be separated out, or I could do as I am doing now
>> and simply teach the indirect pages to live in high memory.
>> 
>> This patch replaced pointers of type kimage_entry_t * with a new type
>> kimage_entry_pos_t.  This new type holds the physical address of the
>> indirect page and the offset within that page of the next indirect
>> entry to write.  A special constant KIMAGE_ENTRY_POS_INVALID is added
>> that kimage_image_pos_t variables that don't currently have a valid
>> may be set to.
>> 
>> Two new functions kimage_read_entry and kimage_write_entry have been
>> provided to write entries in way that works if they live in high
>> memory.
>> 
>> The now unnecessary checks to see if a destination entry is non-zero
>> and to increment it if so have been removed.  For safety new indrect
>> pages are now cleared so we have a guarantee everything that has not
>> been used yet is zero.  Along with this writing an extra trailing 0
>> entry has been removed, as it is known all trailing entries are now 0.
>> 
>> With highmem support implemented for indirect pages
>> kimage_image_alloc_page has been updated to always allocate
>> GFP_HIGHUSER pages, and handling of pages with different
>> gfp flags has been removed.
>> 
>> Signed-off-by: "Eric W. Biederman" 
>
> Eric, the patch failed with data access exception on ppc64. Using the below 
> patch on top
> got me going...

Doh!  Somehow I thought I had put that logic or something equivalent
into kimage_write_entry and it appears I did not.  I will see if I can
respin the patch.

Thank you very much for testing.

Eric


> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index 45862fd..bef52f1 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -570,7 +570,12 @@ static int kimage_add_entry(struct kimage *image, 
> kimage_entry_t entry)
>   return -ENOMEM;
>  
>   ind_addr = page_to_boot_pfn(page) << PAGE_SHIFT;
> - kimage_write_entry(image->entry_pos, ind_addr | 
> IND_INDIRECTION);
> +
> + /* If it is the first entry, handle it here */
> + if (!image->head)
> + image->head = ind_addr | IND_INDIRECTION;
> + else
> + kimage_write_entry(image->entry_pos, ind_addr | 
> IND_INDIRECTION);
>  
>   clear_highpage(page);
>  
> @@ -623,7 +628,11 @@ int __weak machine_kexec_post_load(struct kimage *image)
>  
>  void kimage_terminate(struct kimage *image)
>  {
> - kimage_write_entry(image->entry_pos, IND_DONE);
> + /* This could be the only entry in case of kdump */
> + if (!image->head)
> + image->head = IND_DONE;
> + else
> + kimage_write_entry(image->entry_pos, IND_DONE);
>  }
>  
>  #define for_each_kimage_entry(image, pos, entry) 
> \
>
>
> Thanks
> Hari

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

[RFC][PATCH] kexec: Teach indirect pages how to live in high memory

2020-05-04 Thread Eric W. Biederman



Recently a patch was proposed to kimage_alloc_page to slightly alter
the logic of how pages allocated with incompatible flags were
detected.  The logic was being altered because the semantics of the
page alloctor were changing yet again.

Looking at that case I realized that there is no reason for it to even
exist.  Either the indirect page allocations and the source page
allocations could be separated out, or I could do as I am doing now
and simply teach the indirect pages to live in high memory.

This patch replaced pointers of type kimage_entry_t * with a new type
kimage_entry_pos_t.  This new type holds the physical address of the
indirect page and the offset within that page of the next indirect
entry to write.  A special constant KIMAGE_ENTRY_POS_INVALID is added
that kimage_image_pos_t variables that don't currently have a valid
may be set to.

Two new functions kimage_read_entry and kimage_write_entry have been
provided to write entries in way that works if they live in high
memory.

The now unnecessary checks to see if a destination entry is non-zero
and to increment it if so have been removed.  For safety new indrect
pages are now cleared so we have a guarantee everything that has not
been used yet is zero.  Along with this writing an extra trailing 0
entry has been removed, as it is known all trailing entries are now 0.

With highmem support implemented for indirect pages
kimage_image_alloc_page has been updated to always allocate
GFP_HIGHUSER pages, and handling of pages with different
gfp flags has been removed.

Signed-off-by: "Eric W. Biederman" 
---

I have not done more than compile test this but I think this will remove
that tricky case in the kexec highmem support.

Any comments?  Does anyone have a 32bit highmem system where they can
test this code?  I can probably do something with a 32bit x86 kernel
but it has been a few days.

Does anyone know how we can more effectively allocate memory below
whatever the maximum limit that kexec supports? Typically below
4G on 32bit and below 2^64 on 64bits.

Eric

 include/linux/kexec.h |   5 +-
 kernel/kexec_core.c   | 119 +-
 2 files changed, 73 insertions(+), 51 deletions(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 1776eb2e43a4..6d3f6f4cb926 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -69,6 +69,8 @@
  */
 
 typedef unsigned long kimage_entry_t;
+typedef unsigned long kimage_entry_pos_t;
+#define KIMAGE_ENTRY_POS_INVALID ((kimage_entry_pos_t)-2)
 
 struct kexec_segment {
/*
@@ -243,8 +245,7 @@ int kexec_elf_probe(const char *buf, unsigned long len);
 #endif
 struct kimage {
kimage_entry_t head;
-   kimage_entry_t *entry;
-   kimage_entry_t *last_entry;
+   kimage_entry_pos_t entry_pos;
 
unsigned long start;
struct page *control_code_page;
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index c19c0dad1ebe..45862fda9e60 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -142,7 +142,6 @@ EXPORT_SYMBOL_GPL(kexec_crash_loaded);
 #define PAGE_COUNT(x) (((x) + PAGE_SIZE - 1) >> PAGE_SHIFT)
 
 static struct page *kimage_alloc_page(struct kimage *image,
-  gfp_t gfp_mask,
   unsigned long dest);
 
 int sanity_check_segment_list(struct kimage *image)
@@ -261,8 +260,7 @@ struct kimage *do_kimage_alloc_init(void)
return NULL;
 
image->head = 0;
-   image->entry = &image->head;
-   image->last_entry = &image->head;
+   image->entry_pos = KIMAGE_ENTRY_POS_INVALID;
image->control_page = ~0; /* By default this does not apply */
image->type = KEXEC_TYPE_DEFAULT;
 
@@ -531,28 +529,56 @@ int kimage_crash_copy_vmcoreinfo(struct kimage *image)
return 0;
 }
 
-static int kimage_add_entry(struct kimage *image, kimage_entry_t entry)
+static kimage_entry_t kimage_read_entry(kimage_entry_pos_t pos)
 {
-   if (*image->entry != 0)
-   image->entry++;
+   kimage_entry_t *arr, entry;
+   struct page *page;
+   unsigned long off;
+
+   page = boot_pfn_to_page(pos >> PAGE_SHIFT);
+   off = pos & ~PAGE_MASK;
+   arr = kmap_atomic(page);
+   entry = arr[off];
+   kunmap_atomic(arr);
+
+   return entry;
+}
 
-   if (image->entry == image->last_entry) {
-   kimage_entry_t *ind_page;
+static void kimage_write_entry(kimage_entry_pos_t pos, kimage_entry_t entry)
+{
+   kimage_entry_t *arr;
+   struct page *page;
+   unsigned long off;
+
+   page = boot_pfn_to_page(pos >> PAGE_SHIFT);
+   off = pos & ~PAGE_MASK;
+   arr = kmap_atomic(page);
+   arr[off] = entry;
+   kunmap_atomic(arr);
+}
+
+#define LAST_KIMAGE_ENTRY ((PAGE_SIZE/sizeof(kimage_entry_t)) - 1)
+static int kimage_add_entry(struct kimage *imag

Re: [PATCH v2 03/10] kexec: separate PageHighMem() and PageHighMemZone() use case

2020-05-04 Thread Eric W. Biederman

I have added in the kexec mailling list.

Looking at the patch we are discussing it appears that the kexec code
could be doing much better in highmem situations today but is not.

Joonsoo Kim  writes:

> 2020년 5월 1일 (금) 오후 11:06, Eric W. Biederman 님이 작성:
>>
>> js1...@gmail.com writes:
>>
>> > From: Joonsoo Kim 
>> >
>> > Until now, PageHighMem() is used for two different cases. One is to check
>> > if there is a direct mapping for this page or not. The other is to check
>> > the zone of this page, that is, weather it is the highmem type zone or not.
>> >
>> > Now, we have separate functions, PageHighMem() and PageHighMemZone() for
>> > each cases. Use appropriate one.
>> >
>> > Note that there are some rules to determine the proper macro.
>> >
>> > 1. If PageHighMem() is called for checking if the direct mapping exists
>> > or not, use PageHighMem().
>> > 2. If PageHighMem() is used to predict the previous gfp_flags for
>> > this page, use PageHighMemZone(). The zone of the page is related to
>> > the gfp_flags.
>> > 3. If purpose of calling PageHighMem() is to count highmem page and
>> > to interact with the system by using this count, use PageHighMemZone().
>> > This counter is usually used to calculate the available memory for an
>> > kernel allocation and pages on the highmem zone cannot be available
>> > for an kernel allocation.
>> > 4. Otherwise, use PageHighMemZone(). It's safe since it's implementation
>> > is just copy of the previous PageHighMem() implementation and won't
>> > be changed.
>> >
>> > I apply the rule #2 for this patch.
>>
>> Hmm.
>>
>> What happened to the notion of deprecating and reducing the usage of
>> highmem?  I know that we have some embedded architectures where it is
>> still important but this feels like it flies in the face of that.
>
> AFAIK, deprecating highmem requires some more time and, before then,
> we need to support it.

But it at least makes sense to look at what we are doing with highmem
and ask if it makes sense.

>> This part of kexec would be much more maintainable if it had a proper
>> mm layer helper that tested to see if the page matched the passed in
>> gfp flags.  That way the mm layer could keep changing and doing weird
>> gyrations and this code would not care.
>
> Good idea! I will do it.
>
>>
>> What would be really helpful is if there was a straight forward way to
>> allocate memory whose physical address fits in the native word size.
>>
>>
>> All I know for certain about this patch is that it takes a piece of code
>> that looked like it made sense, and transfroms it into something I can
>> not easily verify, and can not maintain.
>
> Although I decide to make a helper as you described above, I don't
> understand why you think that a new code isn't maintainable. It is just
> the same thing with different name. Could you elaborate more why do
> you think so?

Because the current code is already wrong.  It does not handle
the general case of what it claims to handle.  When the only distinction
that needs to be drawn is highmem or not highmem that is likely fine.
But now you are making it possible to draw more distinctions.  At which
point I have no idea which distinction needs to be drawn.

The code and the logic is about 20 years old.  When it was written I
don't recally taking numa seriously and the kernel only had 3 zones
as I recall (DMA aka the now deprecated GFP_DMA, NORMAL, and HIGH).

The code attempts to work around limitations of those old zones amd play
nice in a highmem world by allocating memory HIGH memory and not using
it if the memory was above 4G ( on 32bit ).

Looking the kernel now has GFP_DMA32 so on 32bit with highmem we should
probably be using that, when allocating memory.

Further in dealing with this memory management situation we only
have two situations we call kimage_alloc_page.

For an indirect page which must have a valid page_address(page).
We could probably relax that if we cared to.

For a general kexec page to store the next kernel in until we switch.
The general pages can be in high memory.

In a highmem world all of those pages should be below 32bit.

Given that we fundamentally have two situations my sense is that we
should just refactor the code so that we never have to deal with:

/* The old page I have found cannot be a
 * destination page, so return it if it's
 * gfp_flags honor the ones passed in.
 */
if (!(gfp_mask & __GFP_HIGHMEM) &&

Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-23 Thread Eric W. Biederman

David Hildenbrand  writes:

>> The confusing part was talking about memory being still in use,
>> that is actually scheduled for use in the future.
>
> +1
>
>> 
 Usually somewhere in the loaded image
 is a copy of the memory map at the time the kexec kernel was loaded.
 That will invalidate the memory map as well.
>>>
>>> Ah, unconditionally. Sure, x86 needs this.
>>> (arm64 re-discovers the memory map from firmware tables after kexec)
>
> Does this include hotplugged DIMMs e.g., under KVM?
> [...]

As far as I know.  If the memory map changes we need to drop the loaded
image.


Having thought about it a little more I suspect it would be the
other way and just block all hotplug actions after a kexec_load.
As all we expect to happen is running shutdown scripts.

If blocking the hotplug action uses printk to print a nice message
saying something like: "Hotplug blocked because of a loaded kexec image",
then people will be able to figure out what is going on and
call kexec -u if they haven't started the shutdown scripts yet.


Either way it is something simple and unconditional that will make
things work.

 All of this should be for a very brief window of a few seconds, as
 the loaded kexec image is quite short.
>>>
>>> It seems I'm the outlier anticipating anything could happen between
>>> those syscalls.
>> 
>> The design is:
>>  sys_kexec_load()
>>  shutdown scripts
>> sys_reboot(LINUX_REBOOT_CMD_KEXEC);
>> 
>> There are two system call simply so that the shutdown scripts can run.
>> Now maybe someone somewhere does something different but that is not
>> expected.
>> 
>> Only the kexec on panic kernel is expected to persist somewhat
>> indefinitely.  But that should be in memory that is reserved from boot
>> time, and so the memory hotplug should have enough visibility to not
>> allow that memory to be given up.
>
> Yes, and AFAIK, memory blocks which hold the reserved crashkernel area
> can usually not get offlined and, therefore, the memory cannot get removed.
>
> Interestingly, s390x even has a hotplug notifier for that
>
> arch/s390/kernel/setup.c:kdump_mem_notifier()
>
> (offlining of memory on s390x can result in memory getting depopulated
> in the hypervisor, so after it would have been offlined, it would no
> longer be accessible. I somewhat doubt that this notifier is really
> needed - all pages in the crashkernel area should look like ordinary
> allocated pages when the area is reserved early during boot via the
> memblock allocator, and therefore offlining cannot succeed. But that's a
> different story - and I suspect this is a leftover from pre-memblock times.)

It might be worth seeing if that is true, or if we need to generalize the
s390x code.

Eric


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-22 Thread Eric W. Biederman

James Morse  writes:

> Hi Eric,
>
> On 15/04/2020 21:33, Eric W. Biederman wrote:
>> James Morse  writes:
>> 
>>> An image loaded for kexec is not stored in place, instead its segments
>>> are scattered through memory, and are re-assembled when needed. In the
>>> meantime, the target memory may have been removed.
>>>
>>> Because mm is not aware that this memory is still in use, it allows it
>>> to be removed.
>>>
>>> Add a memory notifier to prevent the removal of memory regions that
>>> overlap with a loaded kexec image segment. e.g., when triggered from the
>>> Qemu console:
>>> | kexec_core: memory region in use
>>> | memory memory32: Offline failed.
>>>
>>> Signed-off-by: James Morse 
>> 
>> Given that we are talking about the destination pages for kexec
>> not where the loaded kernel is currently stored the description is
>> confusing.
>
> I think David has some better wording to cover this. I thought I had it with 
> 'scattered
> and re-assembled'.

The confusing part was talking about memory being still in use,
that is actually scheduled for use in the future.

>> Usually somewhere in the loaded image
>> is a copy of the memory map at the time the kexec kernel was loaded.
>> That will invalidate the memory map as well.
>
> Ah, unconditionally. Sure, x86 needs this.
> (arm64 re-discovers the memory map from firmware tables after kexec)
>
> If that's an acceptable change in behaviour, sure, lets do that.

Yes.


>> All of this should be for a very brief window of a few seconds, as
>> the loaded kexec image is quite short.
>
> It seems I'm the outlier anticipating anything could happen between
> those syscalls.

The design is:
sys_kexec_load()
shutdown scripts
sys_reboot(LINUX_REBOOT_CMD_KEXEC);

There are two system call simply so that the shutdown scripts can run.
Now maybe someone somewhere does something different but that is not
expected.

Only the kexec on panic kernel is expected to persist somewhat
indefinitely.  But that should be in memory that is reserved from boot
time, and so the memory hotplug should have enough visibility to not
allow that memory to be given up.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use

2020-04-22 Thread Eric W. Biederman

James Morse  writes:

> Hi Eric,
>
> On 15/04/2020 21:29, Eric W. Biederman wrote:
>> James Morse  writes:
>> 
>>> Hello!
>>>
>>> arm64 recently queued support for memory hotremove, which led to some
>>> new corner cases for kexec.
>>>
>>> If the kexec segments are loaded for a removable region, that region may
>>> be removed before kexec actually occurs. This causes the first kernel to
>>> lockup when applying the relocations. (I've triggered this on x86 too).
>>>
>>> The first patch adds a memory notifier for kexec so that it can refuse
>>> to allow in-use regions to be taken offline.
>>>
>>>
>>> This doesn't solve the problem for arm64, where the new kernel must
>>> initially rely on the data structures from the first boot to describe
>>> memory. These don't describe hotpluggable memory.
>>> If kexec places the kernel in one of these regions, it must also provide
>>> a DT that describes the region in which the kernel was mapped as memory.
>>> (and somehow ensure its always present in the future...)
>>>
>>> To prevent this from happening accidentally with unaware user-space,
>>> patches two and three allow arm64 to give these regions a different
>>> name.
>>>
>>> This is a change in behaviour for arm64 as memory hotadd and hotremove
>>> were added separately.
>>>
>>>
>>> I haven't tried kdump.
>>> Unaware kdump from user-space probably won't describe the hotplug
>>> regions if the name is different, which saves us from problems if
>>> the memory is no longer present at kdump time, but means the vmcore
>>> is incomplete.
>>>
>>>
>>> These patches are based on arm64's for-next/core branch, but can all
>>> be merged independently.
>> 
>> So I just looked through these quickly and I think there are real
>> problems here we can fix, and that are worth fixing.
>> 
>> However I am not thrilled with the fixes you propose.
>
> Sure. Unfortunately /proc/iomem is the only trick arm64 has to keep the 
> existing
> kexec-tools working.
> (We've had 'unthrilling' patches like this before to prevent user-space from 
> loading the
> kernel over the top of the in-memory firmware tables.)
>
> arm64 expects the description of memory to come from firmware, be that UEFI 
> for memory
> present at boot, or the ACPI AML methods for memory that was added
> later.
>
> On arm64 there is no standard location for memory. The kernel has to be 
> handed a pointer
> to the firmware tables that describe it. The kernel expects to boot from 
> memory that was
> present at boot.

What do you do when the firmware is wrong?  Does arm64 support the
mem=xxx@yyy kernel command line options?

If you want to handle the general case of memory hotplug having a
limitation that you have to boot from memory that was present at boot is
a bug, because the memory might not be there.

> Modifying the firmware tables at runtime doesn't solve the problem as we may 
> need to move
> the firmware-reserved memory region that describes memory. User-space may 
> still load and
> kexec either side of that update.
>
> Even if we could modify the structures at runtime, we can't update a loaded 
> kexec image.
> We have no idea which blob from userspace is the DT. It may not even be linux 
> that has
> been loaded.

What can be done and very reasonably so is on memory hotplug:
- Unloaded any loaded kexec image.
- Block loading any new image until the hotplug operation completes.

That is simple and generic, and can be done for all architectures.

This doesn't apply to kexec on panic kernel because it fundamentally
needs to figure out how to limp along (or reliably stop) when it has the
wrong memory map.

> We can't emulate parts of UEFI's handover because kexec's purgatory
> isn't an EFI program.

Plus much of EFI is unusable after ExitBootServices is called.

> I can't see a path through all this. If we have to modify existing 
> user-space, I'd rather
> leave it broken. We can detect the problem in the arch code and print a 
> warning at load time.

The weirdest thing to me in all of this is that you have been wanting to
handle memory hotplug.  But you don't want to change or deal with the
memory map changing when hotplug occurs.  The memory map changing is
fundamentally memory hotplug does.

So I think it is fundamental to figure out how to pass the updated
memory map.  Either through command line mem=xxx@yyy command line
options or through another option.

If you really want to keep the limitation that you have to have the
kernel in the initial memory map you can compare that map to the
efi tables when selecting the load address.

Expecting userspace to reload the loaded kernel after memory hotplug is
completely reasonable.

Unless I am mistaken memory hotplug is expected to be a rare event not
something that happens every day, certainly not something that happens
every minute.

Eric



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-21 Thread Eric W. Biederman

David Hildenbrand  writes:

>>> ACPI SRAT is embeded into efi, need read out the rsdp pointer. If we don't
>>> pass the efi, it won't get the SRAT table correctly, if I remember
>>> correctly. Yeah, I remeber kvm guest can get memory hotplugged with
>>> ACPI only, this won't happen on bare metal though. Need check carefully. 
>>> I have been using kvm guest with uefi firmwire recently.
>> 
>> Yeah, I can imagine that bare metal is different. kvm only uses ACPI.
>> 
>> I'm also asking because of virtio-mem. Memory added via virtio-mem is
>> not part of any efi tables or whatsoever. So I assume the kexec kernel
>> will not detect it automatically (good!), instead load the virtio-mem
>> driver and let it add memory back to the system.
>> 
>> I should probably play with kexec and virtio-mem once I have some spare
>> cycles ... to find out what's broken and needs to be addressed :)
>
> FWIW, I just gave virtio-mem and kexec/kdump a try.
>
> a) kdump seems to work. Memory added by virtio-mem is getting dumped.
> The kexec kernel only uses memory in the crash region. The virtio-mem
> driver properly bails out due to is_kdump_kernel().
>
> b) "kexec -s -l" seems to work fine. For now, the kernel does not seem
> to get placed on virtio-mem memory (pure luck due to the left-to-right
> search). Memory added by virtio-mem is not getting added to the e820
> map. Once the virtio-mem driver comes back up in the kexec kernel, the
> right memory is readded.

This sounds like a bug.

> c) "kexec -c -l" does not work properly. All memory added by virtio-mem
> is added to the e820 map, which is wrong. Memory that should not be
> touched will be touched by the kexec kernel. I assume kexec-tools just
> goes ahead and adds anything it can find in /proc/iomem (or
> /sys/firmware/memmap/) to the e820 map of the new kernel.
>
> Due to c), I assume all hotplugged memory (e.g., ACPI DIMMs) is
> similarly added to the e820 map and, therefore, won't be able to be
> onlined MOVABLE easily.

This sounds like correct behavior to me.  If you add memory to the
system it is treated as memory to the system.

If we need to make it a special kind of memory with special rules we can
have some kind of special marking for the memory.  But hotplugged is not
in itself a sufficient criteria to say don't use this as normal memory.

If take a huge server and I plug in an extra dimm it is just memory.

For a similarly huge server I might want to have memory that the system
booted with unpluggable, in case hardware error reporting notices
a dimm generating a lot of memory errors.

Now perhaps virtualization needs a special tier of memory that should
only be used for cases where the memory is easily movable.

I am not familiar with virtio-mem but my skim of the initial design
is that virtio-mem was not designed to be such a special tier of memory.
Perhaps something has changed?
https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg03870.html

> At least for virtio-mem, I would either have to
> a) Not support "kexec -c -l". A viable option if we would be planning on
> not supporting it either way in the long term. I could block this
> in-kernel somehow eventually.

No.

> b) Teach kexec-tools to leave virtio-mem added memory alone. E.g., by
> indicating it in /proc/iomem in a special way ("System RAM
> (hotplugged)"/"System RAM (virtio-mem)").

How does the kernel memory allocator treat this memory?

The logic is simple.  If the kernel memory allocator treats that memory
as ordinary memory available for all uses it should be presented as
ordinary memory available for all uses.

If the kernel memory allocator treats that memory as special memory
only available for uses that we can easily free later and give back to
the system.  AKA it is special and not oridinary memory we should mark
it as such.

Eric

p.s.  Please excuse me for jumping in I may be missing some important
context, but what I read when I saw this message in my inbox just seemed
very wrong.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 3/3] arm64: memory: Give hotplug memory a different resource name

2020-04-15 Thread Eric W. Biederman

James Morse  writes:

> If kexec chooses to place the kernel in a memory region that was
> added after boot, we fail to boot as the kernel is running from a
> location that is not described as memory by the UEFI memory map or
> the original DT.
>
> To prevent unaware user-space kexec from doing this accidentally,
> give these regions a different name.

Please fix the problem and don't hack around it.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 2/3] mm/memory_hotplug: Allow arch override of non boot memory resource names

2020-04-15 Thread Eric W. Biederman

James Morse  writes:

> Memory added to the system by hotplug has a 'System RAM' resource created
> for it. This is exposed to user-space via /proc/iomem.
>
> This poses problems for kexec on arm64. If kexec decides to place the
> kernel in one of these newly onlined regions, the new kernel will find
> itself booting from a region not described as memory in the firmware
> tables.
>
> Arm64 doesn't have a structure like the e820 memory map that can be
> re-written when memory is brought online. Instead arm64 uses the UEFI
> memory map, or the memory node from the DT, sometimes both. We never
> rewrite these.
>
> Allow an architecture to specify a different name for these hotplug
> regions.

Gah.  No.

Please find a way to pass the current memory map to the loaded kexec'd
kernel.

Starting a kernel with no way for it to know what the current memory map
is just plain scary.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-15 Thread Eric W. Biederman

James Morse  writes:

> An image loaded for kexec is not stored in place, instead its segments
> are scattered through memory, and are re-assembled when needed. In the
> meantime, the target memory may have been removed.
>
> Because mm is not aware that this memory is still in use, it allows it
> to be removed.
>
> Add a memory notifier to prevent the removal of memory regions that
> overlap with a loaded kexec image segment. e.g., when triggered from the
> Qemu console:
> | kexec_core: memory region in use
> | memory memory32: Offline failed.
>
> Signed-off-by: James Morse 

Given that we are talking about the destination pages for kexec
not where the loaded kernel is currently stored the description is
confusing.

Beyond that I think it would be better to simply unload the loaded
kernel at memory hotunplug time.  Usually somewhere in the loaded image
is a copy of the memory map at the time the kexec kernel was loaded.
That will invalidate the memory map as well.

All of this should be for a very brief window of a few seconds, as
the loaded kexec image is quite short.

So instead of failing in the notifier, if you could simply unload the
loaded image in the notifier I think that would be simpler and more
robust.  While still preventing the loaded image from falling over
when it starts executing.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use

2020-04-15 Thread Eric W. Biederman

James Morse  writes:

> Hello!
>
> arm64 recently queued support for memory hotremove, which led to some
> new corner cases for kexec.
>
> If the kexec segments are loaded for a removable region, that region may
> be removed before kexec actually occurs. This causes the first kernel to
> lockup when applying the relocations. (I've triggered this on x86 too).
>
> The first patch adds a memory notifier for kexec so that it can refuse
> to allow in-use regions to be taken offline.
>
>
> This doesn't solve the problem for arm64, where the new kernel must
> initially rely on the data structures from the first boot to describe
> memory. These don't describe hotpluggable memory.
> If kexec places the kernel in one of these regions, it must also provide
> a DT that describes the region in which the kernel was mapped as memory.
> (and somehow ensure its always present in the future...)
>
> To prevent this from happening accidentally with unaware user-space,
> patches two and three allow arm64 to give these regions a different
> name.
>
> This is a change in behaviour for arm64 as memory hotadd and hotremove
> were added separately.
>
>
> I haven't tried kdump.
> Unaware kdump from user-space probably won't describe the hotplug
> regions if the name is different, which saves us from problems if
> the memory is no longer present at kdump time, but means the vmcore
> is incomplete.
>
>
> These patches are based on arm64's for-next/core branch, but can all
> be merged independently.

So I just looked through these quickly and I think there are real
problems here we can fix, and that are worth fixing.

However I am not thrilled with the fixes you propose.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-13 Thread Eric W. Biederman

Andrew Morton  writes:

> On Mon, 13 Apr 2020 08:15:23 -0500 ebied...@xmission.com (Eric W. Biederman) 
> wrote:
>
>> > For 3), people can still use kexec_load and develop/fix for it, if no
>> > kexec_file_load supported. But 32-bit arm should be a different one,
>> > more like i386, we will leave it as is, and fix anything which could
>> > break it. But people really expects to improve or add feature to it? E.g
>> > in this patchset, the mem hotplug issue James raised, I assume James is
>> > focusing on arm64, x86_64, but not 32-bit arm. As DavidH commented in
>> > another reply, people even don't agree to continue supporting memory
>> > hotplug on 32-bit system. We ever took effort to fix a memory hotplug
>> > bug on i386 with a patch, but people would rather set it as BROKEN.
>> 
>> For memory hotplug just reload.  Userspace already gets good events.
>> 
>> We should not expect anything except a panic kernel to be loaded over a
>> memory hotplug event. The kexec on panic code should actually be loaded
>> in a location that we don't reliquish if asked for it.
>
> Is that a nack for James's patchset?

I have just read the end of the thread and I have the sense that the
patchset had already been rejected.  I will see if I can go back and
read the beginning.

I was mostly reacting to the idea that you could stop maintaining an
interface that people are actively using because there is a newer
interface.

Eric


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-13 Thread Eric W. Biederman

Baoquan He  writes:

> On 04/12/20 at 02:52pm, Eric W. Biederman wrote:
>> 
>> The only benefit of kexec_file_load is that it is simple enough from a
>> kernel perspective that signatures can be checked.
>
> We don't have this restriction any more with below commit:
>
> commit 99d5cadfde2b ("kexec_file: split KEXEC_VERIFY_SIG into KEXEC_SIG
> and KEXEC_SIG_FORCE")
>
> With KEXEC_SIG_FORCE not set, we can use kexec_load_file to cover both
> secure boot or legacy system for kexec/kdump. Being simple enough is
> enough to astract and convince us to use it instead. And kexec_file_load
> has been in use for several years on systems with secure boot, since
> added in 2014, on x86_64.

No.  Actaully kexec_file_load is the less capable interface, and less
flexible interface.  Which is why it is appropriate for signature
verification.

>> kexec_load in every other respect is the more capable and functional
>> interface.  It makes no sense to get rid of it.
>> 
>> It does make sense to reload with a loaded kernel on memory hotplug.
>> That is simple and easy.  If we are going to handle something in the
>> kernel it should simple an automated unloading of the kernel on memory
>> hotplug.
>> 
>> 
>> I think it would be irresponsible to deprecate kexec_load on any
>> platform.
>> 
>> I also suspect that kexec_file_load could be taught to copy the dtb
>> on arm32 if someone wants to deal with signatures.
>> 
>> We definitely can not even think of deprecating kexec_load until
>> architecture that supports it also supports kexec_file_load and everyone
>> is happy with that interface.  That is Linus's no regression rule.
>
> I should pick a milder word to express our tendency and tell our plan
> then 'obsolete'. Even though I added 'gradually', seems it doesn't help
> much. I didn't mean to say 'deprecate' at all when replied.
>
> The situation and trend I understand about kexec_load and kexec_file_load
> are:
>
> 1) Supporting kexec_file_load is suggested to add in ARCHes which don't
> have yet, just as x86_64, arm64 and s390 have done;
>  
> 2) kexec_file_load is suggested to use, and take precedence over
> kexec_load in the future, if both are supported in one ARCH.

The deep problem is that kexec_file_load is distinctly less expressive
than kexec_load.

> 3) Kexec_load is kept being used by ARCHes w/o kexc_file_load support,
> and by ARCHes for back compatibility w/ kexec_file_load support.
>
> For 1) and 2), I think the reason is obvious as Eric said,
> kexec_file_load is simple enough. And currently, whenever we got a bug
> report, we may need fix them twice, for kexec_load and kexec_file_load.
> If kexec_file_load is made by default, e.g on x86_64, we will change it
> in kernel space only, for kexec_file_load. This is what I meant about
> 'obsolete gradually'. I think for arm64, s390, they will do these too.
> Unless there's some critical/blocker bug in kexec_load, to corrupt the
> old kexec_load interface in old product.

Maybe.  The code that kexec_file_load sucked into the kernel is quite
stable and rarely needs changes except during a port of kexec to
another architecture.

Last I looked the real maintenance effor of kexec and kexec on panic was
in the drivers.  So I don't think we can use maintenance to do anything.

> For 3), people can still use kexec_load and develop/fix for it, if no
> kexec_file_load supported. But 32-bit arm should be a different one,
> more like i386, we will leave it as is, and fix anything which could
> break it. But people really expects to improve or add feature to it? E.g
> in this patchset, the mem hotplug issue James raised, I assume James is
> focusing on arm64, x86_64, but not 32-bit arm. As DavidH commented in
> another reply, people even don't agree to continue supporting memory
> hotplug on 32-bit system. We ever took effort to fix a memory hotplug
> bug on i386 with a patch, but people would rather set it as BROKEN.

For memory hotplug just reload.  Userspace already gets good events.

We should not expect anything except a panic kernel to be loaded over a
memory hotplug event. The kexec on panic code should actually be loaded
in a location that we don't reliquish if asked for it.

Quite frankly at this point I would love to see the signature fad die,
which would allow us to remove kexec_file_load.  I still have not seen
the signature code used anywhere except by people anticipating trouble.

Given that Microsoft has already directly signed a malicous bootloader.
(Not in the Linux ecosystem).  I don't even know if any of the reasons
for having kexec_file_load are legtimate.


If someone wants to do the work and ensure

Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-12 Thread Eric W. Biederman



The only benefit of kexec_file_load is that it is simple enough from a
kernel perspective that signatures can be checked.

kexec_load in every other respect is the more capable and functional
interface.  It makes no sense to get rid of it.

It does make sense to reload with a loaded kernel on memory hotplug.
That is simple and easy.  If we are going to handle something in the
kernel it should simple an automated unloading of the kernel on memory
hotplug.


I think it would be irresponsible to deprecate kexec_load on any
platform.

I also suspect that kexec_file_load could be taught to copy the dtb
on arm32 if someone wants to deal with signatures.

We definitely can not even think of deprecating kexec_load until
architecture that supports it also supports kexec_file_load and everyone
is happy with that interface.  That is Linus's no regression rule.


Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] crash_dump: remove saved_max_pfn

2020-03-30 Thread Eric W. Biederman

Kairui Song  writes:

> This variable is no longer used.
>
> saved_max_pfn was originally introduce in commit 92aa63a5a1bf ("[PATCH]
> kdump: Retrieve saved max pfn"), used to make sure that user does not
> try to read the physical memory beyond saved_max_pfn. But since
> commit 921d58c0e699 ("vmcore: remove saved_max_pfn check")
> it's no longer used for the check.
>
> Only user left is Calary IOMMU, which start using it from
> commit 95b68dec0d52 ("calgary iommu: use the first kernels TCE tables
> in kdump"). But again, recently in commit 90dc392fc445 ("x86: Remove
> the calgary IOMMU driver"), Calary IOMMU is removed and this variable
> no longer have any user.
>
> So just remove it.
>
> Signed-off-by: Kairui Song 

Acked-by: "Eric W. Biederman" 

Can we merge this through the tip tree?


> ---
>  arch/x86/kernel/e820.c | 8 
>  include/linux/crash_dump.h | 2 --
>  kernel/crash_dump.c| 6 --
>  3 files changed, 16 deletions(-)
>
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index c5399e80c59c..4d13c57f370a 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -910,14 +910,6 @@ static int __init parse_memmap_one(char *p)
>   return -EINVAL;
>  
>   if (!strncmp(p, "exactmap", 8)) {
> -#ifdef CONFIG_CRASH_DUMP
> - /*
> -  * If we are doing a crash dump, we still need to know
> -  * the real memory size before the original memory map is
> -  * reset.
> -  */
> - saved_max_pfn = e820__end_of_ram_pfn();
> -#endif
>   e820_table->nr_entries = 0;
>   userdef = 1;
>   return 0;
> diff --git a/include/linux/crash_dump.h b/include/linux/crash_dump.h
> index 4664fc1871de..bc156285d097 100644
> --- a/include/linux/crash_dump.h
> +++ b/include/linux/crash_dump.h
> @@ -97,8 +97,6 @@ extern void unregister_oldmem_pfn_is_ram(void);
>  static inline bool is_kdump_kernel(void) { return 0; }
>  #endif /* CONFIG_CRASH_DUMP */
>  
> -extern unsigned long saved_max_pfn;
> -
>  /* Device Dump information to be filled by drivers */
>  struct vmcoredd_data {
>   char dump_name[VMCOREDD_MAX_NAME_BYTES]; /* Unique name of the dump */
> diff --git a/kernel/crash_dump.c b/kernel/crash_dump.c
> index 9c23ae074b40..92da32275af5 100644
> --- a/kernel/crash_dump.c
> +++ b/kernel/crash_dump.c
> @@ -5,12 +5,6 @@
>  #include 
>  #include 
>  
> -/*
> - * If we have booted due to a crash, max_pfn will be a very low value. We 
> need
> - * to know the amount of memory that the previous kernel used.
> - */
> -unsigned long saved_max_pfn;
> -
>  /*
>   * stores the physical address of elf header of crash image
>   *

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v1 2/2] firmware: dmi_scan: Pass dmi_entry_point to kexec'ed kernel

2020-01-21 Thread Eric W. Biederman

Andy Shevchenko  writes:

> On Tue, Jan 21, 2020 at 12:18:03AM +0100, Ard Biesheuvel wrote:
>> On Mon, 20 Jan 2020 at 23:31, Andy Shevchenko  
>> wrote:
>> > On Mon, Jan 20, 2020 at 9:28 PM Eric W. Biederman  
>> > wrote:
>> > > Andy Shevchenko  writes:
>> > > > On Sat, Dec 17, 2016 at 06:57:21PM +0800, Dave Young wrote:
>
> ...
>
>> > > > Can we apply these patches for now until you will find better
>> > > > solution?
>> > >
>> > > Not a chance.  The patches don't apply to any kernel in the git history.
>> > >
>> > > Which may be part of your problem.  You are or at least were running
>> > > with code that has not been merged upstream.
>> >
>> > It's done against linux-next.
>> > Applied clearly. (Not the version in this more than yearly old series
>> > of course, that's why I told I can resend)
>> >
>> > > > P.S. I may resend them rebased on recent vanilla.
>> > >
>> > > Second.  I looked at your test results and they don't directly make
>> > > sense.  dmidecode bypasses the kernel completely or it did last time
>> > > I looked so I don't know why you would be using that to test if
>> > > something in the kernel is working.
>> > >
>> > > However dmidecode failing suggests that the actual problem is something
>> > > in the first kernel is stomping the dmi tables.
>> >
>> > See below.
>> >
>> > > Adding a command line option won't fix stomped tables.
>> >
>> > It provides a mechanism, which seems to be absent, to the second
>> > kernel to know where to look for SMBIOS tables.
>> >
>> > > So what I would suggest is:
>> > > a) Verify that dmidecode works before kexec.
>> >
>> > Yes, it does.
>> >
>> > > b) Test to see if dmidecode works after kexec.
>> >
>> > No, it doesn't.
>> >
>> > > c) Once (a) shows that dmidecode works and (b) shows that dmidecode
>> > >fails figure out what is stomping your dmi tables during or before
>> > >kexec and that is what should get fixed.
>> >
>> > The problem here as I can see it that EFI and kexec protocols are not
>> > friendly to each other.
>> > I'm not an expert in either. That's why I'm asking for possible
>> > solutions. And this needs to be done in kernel to allow drivers to
>> > work.
>> >
>> > Does the
>> >
>> > commit 4996c02306a25def1d352ec8e8f48895bbc7dea9
>> > Author: Takao Indoh 
>> > Date:   Thu Jul 14 18:05:21 2011 -0400
>> >
>> > ACPI: introduce "acpi_rsdp=" parameter for kdump
>> >
>> > description shed a light on this?
>> >
>> > > Now using a non-efi method of dmi detection relies on the
>> > > tables being between 0xF and 0x1. AKA the last 64K
>> > > of the first 1MiB of memory.  You might check to see if your
>> > > dmi tables are in that address range.
>> >
>> > # dmidecode --no-sysfs
>> > # dmidecode 3.2
>> > Scanning /dev/mem for entry point.
>> > # No SMBIOS nor DMI entry point found, sorry.
>> >
>> > === with patch applied ===
>> > # dmidecode
>> > ...
>> > Release Date: 03/10/2015
>> > ...
>> >
>> > >
>> > > Otherwise I suspect the good solution is to give efi it's own page
>> > > tables in the kernel and switch to it whenever efi functions are called.
>> > >
>> >
>> > > But on 32bit the Linux kernel has historically been just fine directly
>> > > accessing the hardware, and ignoring efi and all of the other BIOS's.
>> >
>> > It seems not only for 32-bit Linux kernel anymore. MS Surface 3 runs
>> > 64-bit code.
>> >
>> > > So if that doesn't work on Intel Galileo that is probably a firmware
>> > > problem.
>> >
>> > It's not only about Galileo anymore.
>> >
>> 
>> Looking at the x86 kexec EFI code, it seems that it has special
>> handling for the legacy SMBIOS table address, but not for the SMBIOS3
>> table address, which was introduced to accommodate SMBIOS tables
>> living in memory that is not 32-bit addressable.
>> 
>> Could anyone check whether these systems provide SMBIOS 3.0 tables,
>> and whether their address

Re: [PATCH v1 2/2] firmware: dmi_scan: Pass dmi_entry_point to kexec'ed kernel

2020-01-21 Thread Eric W. Biederman

Jean Delvare  writes:

> On Mon, 20 Jan 2020 23:55:43 +0200, Andy Shevchenko wrote:
>> On Mon, Jan 20, 2020 at 11:44 PM Jean Delvare  wrote:
>> >
>> > On Mon, 20 Jan 2020 10:04:04 -0600, Eric W. Biederman wrote:  
>> > > Second.  I looked at your test results and they don't directly make
>> > > sense.  dmidecode bypasses the kernel completely or it did last time
>> > > I looked so I don't know why you would be using that to test if
>> > > something in the kernel is working.  
>> >
>> > That must have been long ago. A recent version of dmidecode (>= 3.0)
>> > running on a recent kernel  
>> > (>= d7f96f97c4031fa4ffdb7801f9aae23e96170a6f, v4.2) will read the DMI  
>> > data from /sys/firmware/dmi/tables, so it is very much relying on the
>> > kernel doing the right thing. If not, it will still try to fallback to
>> > reading from /dev/mem directly on certain architectures. You can force
>> > that old method with --no-sysfs.
>> >
>> > Hope that helps,  
>> 
>> I don't understand how it possible can help for in-kernel code, like
>> DMI quirks in a drivers.
>
> OK, just ignore me then, probably I misunderstood the point made by
> Eric.

No.  I just haven't dived into this area of code in a long time.

It seems a little indirect to use dmidecode as the test to see if the
kernel has the pointer to the dmitables, but with the knowledge you
provided it seems like a perfectly valid test.

Eric


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v1 2/2] firmware: dmi_scan: Pass dmi_entry_point to kexec'ed kernel

2020-01-20 Thread Eric W. Biederman

Andy Shevchenko  writes:

> On Sat, Dec 17, 2016 at 06:57:21PM +0800, Dave Young wrote:
>> Ccing efi people.
>> 
>> On 12/16/16 at 02:33pm, Jean Delvare wrote:
>> > On Fri, 16 Dec 2016 14:18:58 +0200, Andy Shevchenko wrote:
>> > > On Fri, 2016-12-16 at 10:32 +0800, Dave Young wrote:
>> > > > On 12/15/16 at 12:28pm, Jean Delvare wrote:
>> > > > > I am no kexec expert but this confuses me. Shouldn't the second
>> > > > > kernel have access to the EFI systab as the first kernel does? It
>> > > > > includes many more pointers than just ACPI and DMI tables, and it
>> > > > > would seem inconvenient to have to pass all these addresses
>> > > > > individually explicitly.
>> > > > 
>> > > > Yes, in modern linux kernel, kexec has the support for EFI, I think it
>> > > > should work naturally at least in x86_64.
>> > > 
>> > > Thanks for this good news!
>> > > 
>> > > Unfortunately Intel Galileo is 32-bit platform.
>> > 
>> > If it was done for X86_64 then maybe it can be generalized to X86?
>> 
>> For X86_64, we have a new way for efi runtime memmory mapping, in i386
>> code it still use old ioremap way. It is impossible to use same way as
>> the X86_64 since the virtual address space is limited.
>> 
>> But maybe for 32bit, kexec kernel can run in physical mode, but I'm not
>> sure, I would suggest Andy to do a test first with efi=noruntime for
>> kexec 2nd kernel.
>
> Guys, it was quite a long no hear from you. As I told you the proposed work
> around didn't help. Today I found that Microsoft Surface 3 also affected
> by this.
>
> Can we apply these patches for now until you will find better
> solution?

Not a chance.  The patches don't apply to any kernel in the git history.

Which may be part of your problem.  You are or at least were running
with code that has not been merged upstream.

> P.S. I may resend them rebased on recent vanilla.

Second.  I looked at your test results and they don't directly make
sense.  dmidecode bypasses the kernel completely or it did last time
I looked so I don't know why you would be using that to test if
something in the kernel is working.

However dmidecode failing suggests that the actual problem is something
in the first kernel is stomping the dmi tables.

Adding a command line option won't fix stomped tables.

So what I would suggest is:
a) Verify that dmidecode works before kexec.
b) Test to see if dmidecode works after kexec.
c) Once (a) shows that dmidecode works and (b) shows that dmidecode
   fails figure out what is stomping your dmi tables during or before
   kexec and that is what should get fixed.

Now using a non-efi method of dmi detection relies on the
tables being between 0xF and 0x1. AKA the last 64K
of the first 1MiB of memory.  You might check to see if your
dmi tables are in that address range.

Otherwise I suspect the good solution is to give efi it's own page
tables in the kernel and switch to it whenever efi functions are called.

But on 32bit the Linux kernel has historically been just fine directly
accessing the hardware, and ignoring efi and all of the other BIOS's.
So if that doesn't work on Intel Galileo that is probably a firmware
problem.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: kexec_file overwrites reserved EFI ESRT memory

2019-11-26 Thread Eric W. Biederman

Michael Weiser  writes:

> Hello Eric,
> Hello Ard,
>
> on my machine, kexec_file loads the normal (not crash) kernel image
> right across the EFI ESRT reserved memory range:
>
> esrt: Reserving ESRT space from 0x74dd6f98 to 0x74dd6fd0.
> [...]
> kexec_file: kernel signature verification successful.
> kexec_file: Loading segment 0: buf=0xe99b31ad bufsz=0x5000 
> mem=0x91000 memsz=0x6000
> kexec_file: Loading segment 1: buf=0xe45cdeb8 bufsz=0x1240 
> mem=0x8f000 memsz=0x2000
> kexec_file: Loading segment 2: buf=0x096e6de9 bufsz=0x1133888 
> mem=0x7300 memsz=0x249a000
>
> This causes the following message by the kexec'd kernel:
>
> esrt: Unsupported ESRT version 2904149718861218184.
>
> (The image is rather large at 18MiB as it has a built-in initrd.)

When did x86_64 get support for ARCH_KEEP_MEMBLOCK?  I can't find it
anywhere.

My recollection is that on x86 the definitive specification of what is
reserved and what is not is the struct resource (aka /proc/iomem).
While on some other architectures they do something else apparently
the memblock implementation.

Fundamentally when deciding where to place a new kernel kexec (either
user space or the in kernel kexec_file implementation) needs to be able
to ask the question which memory ares are reserved.  What the buddy
allocator does is unimportant as kexec copies memory from all over
the place and places it in the destined memory addresses at the
time of the kexec operation.

So my question is why doesn't the ESRT reservation wind up in
/proc/iomem?

Are you dealing with an embedded port that is being clever?

Or is there some subtle breakage now that x86 has memblock support that
/proc/iomem is no longer being properly maintained?

Eric

> Poking at the involved code a bit (as a layman) I found that the EFI
> code reserves the memory range using memblock_reserve() which is by all
> appearances correctly handed over to the buddy allocator as
> in-use/reserved. kexec_file on the other hand by default looks at iomem
> regions of type System RAM using walk_system_ram_res() and does not seem
> to have that particular information available to consider. (As may have
> become clear from this explanation I'm still somewhat fuzzy (to put it
> midly) on the relationship of memblock, buddy and slab allocator and how
> (if at all) kexec_file interacts with them to a.) find available memory
> regions for the new kernel to load to and b.) tell them where it
> loaded the new kernel to so they don't use it any more.)
>
> As is to be expected, activating CONFIG_ARCH_KEEP_MEMBLOCK makes
> kexec_file use the preserved memblock structures and indeed end up using
> totally different memory regions and gets rid of the message:
>
> kexec_file: kernel signature verification successful.
> kexec_file: Loading segment 0: buf=0x2dea71f8 bufsz=0x5000 
> mem=0x47df8e000 memsz=0x6000
> kexec_file: Loading segment 1: buf=0x0686ff17 bufsz=0x1240 
> mem=0x47df8c000 memsz=0x2000
> kexec_file: Loading segment 2: buf=0xfc444e67 bufsz=0x1133888 
> mem=0x46900 memsz=0x2497000
>
> This is with 5.3.11 mainline and linux-next 5.4.0-rc8-next-20191122.
>
> I'm not actually trying to use ESRT for anything at this point but want
> to stop the boot message from messing up silent boot and suspect that
> this could potentially happen to other, more important EFI memory
> regions as well.
>
> I'm willing to chase this further but at this point I'm wondering
> whether it's the EFI code not reserving this memory area with enough
> emphasis (as iomem?) or kexec_file not checking usability of
> candidate memory regions rigorously enough (based on what other
> criteria?).
>
> Are there maybe any upcoming patches or subsystem-specific kernel trees
> I should try?
>
> Please let me know what other information may be helpful or if I should
> open a bug on bugzilla.kernel.org.
>
> Boot messages on normal boot:
> Linux version 5.3.11-gentoo (m@n) (gcc version 9.2.0 (Gentoo 9.2.0-r2 p3)) 
> #29 SMP Thu Nov 21 20:40:28 CET 2019
> Command line: 
> x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
> x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
> x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
> x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
> x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
> x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
> x86/fpu: xstate_offset[3]:  832, xstate_sizes[3]:   64
> x86/fpu: xstate_offset[4]:  896, xstate_sizes[4]:   64
> x86/fpu: Enabled xstate features 0x1f, context size is 960 bytes, using 
> 'compacted' format.
> BIOS-provided physical RAM map:
> BIOS-e820: [mem 0x-0x0009efff] usable
> BIOS-e820: [mem 0x0009f000-0x000f] reserved
> BIOS-e820: [mem 0x0010-0x763fafff] usable
> BIOS-e820: [mem 0x763fb000-0x79979fff] reserved
> BIOS-e820: [mem 0x7997a000-0x7

Re: [PATCH 1/2 v5] x86/kdump: always reserve the low 1MiB when the crashkernel option is specified

2019-10-24 Thread Eric W. Biederman

lijiang  writes:

>  * Returns the length of the argument (regardless of if it was
>  * truncated to fit in the buffer), or -1 on not found.
>  */
> static int
> __cmdline_find_option(const char *cmdline, int max_cmdline_size,
>   const char *option, char *buffer, int bufsize)
>
>
> According to the above code comment, it should be better like this:
>
> +   if (cmdline_find_option(boot_command_line, "crashkernel",
> +   NULL, 0) > 0) {
>
> After i test, i will post again.
>

This seems reasonable as we are dealing with x86 only code.

It wound be nice if someone could generalize cmdline_find_option to be
arch independent so that crash_core.c:parse_crashkernel could use it.
I don't think for this patchset, but it looks like an overdue cleanup.

We run the risk with parse_crashkernel using strstr and this using
another algorithm of having different kernel command line parsers
giving different results and disagreeing if "crashkernel=" is present
or not on the kernel command line.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region

2019-10-16 Thread Eric W. Biederman

lijiang  writes:

> 在 2019年10月15日 19:04, Eric W. Biederman 写道:
>> lijiang  writes:
>> 
>>> 在 2019年10月13日 11:54, Eric W. Biederman 写道:
>>>> Dave Young  writes:
>>>>
>>>>> Hi Eric,
>>>>>
>>>>> On 10/12/19 at 06:26am, Eric W. Biederman wrote:
>>>>>> Lianbo Jiang  writes:
>>>>>>
>>>>>>> When the crashkernel kernel command line option is specified, the
>>>>>>> low 1MiB memory will always be reserved, which makes that the memory
>>>>>>> allocated later won't fall into the low 1MiB area, thereby, it's not
>>>>>>> necessary to create a backup region and also no need to copy the first
>>>>>>> 640k content to a backup region.
>>>>>>>
>>>>>>> Currently, the code related to the backup region can be safely removed,
>>>>>>> so lets clean up.
>>>>>>>
>>>>>>> Signed-off-by: Lianbo Jiang 
>>>>>>> ---
>>>>>>
>>>>>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>>>>>>> index eb651fbde92a..cc5774fc84c0 100644
>>>>>>> --- a/arch/x86/kernel/crash.c
>>>>>>> +++ b/arch/x86/kernel/crash.c
>>>>>>> @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs 
>>>>>>> *regs)
>>>>>>>  
>>>>>>>  #ifdef CONFIG_KEXEC_FILE
>>>>>>>  
>>>>>>> -static unsigned long crash_zero_bytes;
>>>>>>> -
>>>>>>>  static int get_nr_ram_ranges_callback(struct resource *res, void *arg)
>>>>>>>  {
>>>>>>> unsigned int *nr_ranges = arg;
>>>>>>> @@ -234,9 +232,15 @@ static int 
>>>>>>> prepare_elf64_ram_headers_callback(struct resource *res, void *arg)
>>>>>>>  {
>>>>>>> struct crash_mem *cmem = arg;
>>>>>>>  
>>>>>>> -   cmem->ranges[cmem->nr_ranges].start = res->start;
>>>>>>> -   cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>>>> -   cmem->nr_ranges++;
>>>>>>> +   if (res->start >= SZ_1M) {
>>>>>>> +   cmem->ranges[cmem->nr_ranges].start = res->start;
>>>>>>> +   cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>>>> +   cmem->nr_ranges++;
>>>>>>> +   } else if (res->end > SZ_1M) {
>>>>>>> +   cmem->ranges[cmem->nr_ranges].start = SZ_1M;
>>>>>>> +   cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>>>> +   cmem->nr_ranges++;
>>>>>>> +   }
>>>>>>
>>>>>> What is going on with this chunk?  I can guess but this needs a clear
>>>>>> comment.
>>>>>
>>>>> Indeed it needs some code comment, this is based on some offline
>>>>> discussion.  cat /proc/vmcore will give a warning because ioremap is
>>>>> mapping the system ram.
>>>>>
>>>>> We pass the first 1M to kdump kernel in e820 as system ram so that 2nd
>>>>> kernel can use the low 1M memory because for example the trampoline
>>>>> code.
>>>>>
>>>>>>
>>>>>>>  
>>>>>>> return 0;
>>>>>>>  }
>>>>>>
>>>>>>> @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage 
>>>>>>> *image, struct boot_params *params)
>>>>>>> memset(&cmd, 0, sizeof(struct crash_memmap_data));
>>>>>>> cmd.params = params;
>>>>>>>  
>>>>>>> -   /* Add first 640K segment */
>>>>>>> -   ei.addr = image->arch.backup_src_start;
>>>>>>> -   ei.size = image->arch.backup_src_sz;
>>>>>>> +   /*
>>>>>>> +* Add the low memory range[0x1000, SZ_1M], skip
>>>>>>> +* the first zero page.
>>>>>>> +*/
>>>>>>> +   ei.addr = PAGE_SIZE;
>>>>>>> +   ei.size = SZ_1M - PAGE

Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region

2019-10-15 Thread Eric W. Biederman

lijiang  writes:

> 在 2019年10月12日 20:16, Dave Young 写道:
>> Hi Eric,
>> 
>> On 10/12/19 at 06:26am, Eric W. Biederman wrote:
>>> Lianbo Jiang  writes:
>>>
>>>> When the crashkernel kernel command line option is specified, the
>>>> low 1MiB memory will always be reserved, which makes that the memory
>>>> allocated later won't fall into the low 1MiB area, thereby, it's not
>>>> necessary to create a backup region and also no need to copy the first
>>>> 640k content to a backup region.
>>>>
>>>> Currently, the code related to the backup region can be safely removed,
>>>> so lets clean up.
>>>>
>>>> Signed-off-by: Lianbo Jiang 
>>>> ---
>>>
>>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>>>> index eb651fbde92a..cc5774fc84c0 100644
>>>> --- a/arch/x86/kernel/crash.c
>>>> +++ b/arch/x86/kernel/crash.c
>>>> @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs 
>>>> *regs)
>>>>  
>>>>  #ifdef CONFIG_KEXEC_FILE
>>>>  
>>>> -static unsigned long crash_zero_bytes;
>>>> -
>>>>  static int get_nr_ram_ranges_callback(struct resource *res, void *arg)
>>>>  {
>>>>unsigned int *nr_ranges = arg;
>>>> @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct 
>>>> resource *res, void *arg)
>>>>  {
>>>>struct crash_mem *cmem = arg;
>>>>  
>>>> -  cmem->ranges[cmem->nr_ranges].start = res->start;
>>>> -  cmem->ranges[cmem->nr_ranges].end = res->end;
>>>> -  cmem->nr_ranges++;
>>>> +  if (res->start >= SZ_1M) {
>>>> +  cmem->ranges[cmem->nr_ranges].start = res->start;
>>>> +  cmem->ranges[cmem->nr_ranges].end = res->end;
>>>> +  cmem->nr_ranges++;
>>>> +  } else if (res->end > SZ_1M) {
>>>> +  cmem->ranges[cmem->nr_ranges].start = SZ_1M;
>>>> +  cmem->ranges[cmem->nr_ranges].end = res->end;
>>>> +  cmem->nr_ranges++;
>>>> +  }
>>>
>>> What is going on with this chunk?  I can guess but this needs a clear
>>> comment.
>> 
>> Indeed it needs some code comment, this is based on some offline
>> discussion.  cat /proc/vmcore will give a warning because ioremap is
>> mapping the system ram.
>> 
>> We pass the first 1M to kdump kernel in e820 as system ram so that 2nd
>> kernel can use the low 1M memory because for example the trampoline
>> code.
>> 
> Thank you, Eric and Dave. I will add the code comment as below if it would be 
> OK.
>
> @@ -234,9 +232,20 @@ static int prepare_elf64_ram_headers_callback(struct 
> resource *res, void *arg)
>  {
> struct crash_mem *cmem = arg;
>  
> -   cmem->ranges[cmem->nr_ranges].start = res->start;
> -   cmem->ranges[cmem->nr_ranges].end = res->end;
> -   cmem->nr_ranges++;
> +   /*
> +* Currently, pass the low 1MiB range to kdump kernel in e820
> +* as system ram so that kdump kernel can also use the low 1MiB
> +* memory due to the real mode trampoline code.
> +* And later, the low 1MiB range will be exclued from elf header,
> +* which will avoid remapping the 1MiB system ram when dumping
> +* vmcore.
> +*/
> +   if (res->start >= SZ_1M) {
> +   cmem->ranges[cmem->nr_ranges].start = res->start;
> +   cmem->ranges[cmem->nr_ranges].end = res->end;
> +   cmem->nr_ranges++;
> +   } else if (res->end > SZ_1M) {
> +   cmem->ranges[cmem->nr_ranges].start = SZ_1M;
> +   cmem->ranges[cmem->nr_ranges].end = res->end;
> +   cmem->nr_ranges++;
> +   }
>  
> return 0;
>  }

I just read through the appropriate section of crash.c and the way
things are structured doing this work in
prepare_elf64_ram_headers_callback is wrong.

This can be done in a simpler manner in elf_header_exclude_ranges.
Something like:

/* The low 1MiB is always reserved */
ret = crash_exclude_mem_range(cmem, 0, 1024*1024);
if (ret)
return ret;

Eric

Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region

2019-10-15 Thread Eric W. Biederman

lijiang  writes:

> 在 2019年10月13日 11:54, Eric W. Biederman 写道:
>> Dave Young  writes:
>> 
>>> Hi Eric,
>>>
>>> On 10/12/19 at 06:26am, Eric W. Biederman wrote:
>>>> Lianbo Jiang  writes:
>>>>
>>>>> When the crashkernel kernel command line option is specified, the
>>>>> low 1MiB memory will always be reserved, which makes that the memory
>>>>> allocated later won't fall into the low 1MiB area, thereby, it's not
>>>>> necessary to create a backup region and also no need to copy the first
>>>>> 640k content to a backup region.
>>>>>
>>>>> Currently, the code related to the backup region can be safely removed,
>>>>> so lets clean up.
>>>>>
>>>>> Signed-off-by: Lianbo Jiang 
>>>>> ---
>>>>
>>>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>>>>> index eb651fbde92a..cc5774fc84c0 100644
>>>>> --- a/arch/x86/kernel/crash.c
>>>>> +++ b/arch/x86/kernel/crash.c
>>>>> @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs 
>>>>> *regs)
>>>>>  
>>>>>  #ifdef CONFIG_KEXEC_FILE
>>>>>  
>>>>> -static unsigned long crash_zero_bytes;
>>>>> -
>>>>>  static int get_nr_ram_ranges_callback(struct resource *res, void *arg)
>>>>>  {
>>>>>   unsigned int *nr_ranges = arg;
>>>>> @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct 
>>>>> resource *res, void *arg)
>>>>>  {
>>>>>   struct crash_mem *cmem = arg;
>>>>>  
>>>>> - cmem->ranges[cmem->nr_ranges].start = res->start;
>>>>> - cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>> - cmem->nr_ranges++;
>>>>> + if (res->start >= SZ_1M) {
>>>>> + cmem->ranges[cmem->nr_ranges].start = res->start;
>>>>> + cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>> + cmem->nr_ranges++;
>>>>> + } else if (res->end > SZ_1M) {
>>>>> + cmem->ranges[cmem->nr_ranges].start = SZ_1M;
>>>>> + cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>> + cmem->nr_ranges++;
>>>>> + }
>>>>
>>>> What is going on with this chunk?  I can guess but this needs a clear
>>>> comment.
>>>
>>> Indeed it needs some code comment, this is based on some offline
>>> discussion.  cat /proc/vmcore will give a warning because ioremap is
>>> mapping the system ram.
>>>
>>> We pass the first 1M to kdump kernel in e820 as system ram so that 2nd
>>> kernel can use the low 1M memory because for example the trampoline
>>> code.
>>>
>>>>
>>>>>  
>>>>>   return 0;
>>>>>  }
>>>>
>>>>> @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, 
>>>>> struct boot_params *params)
>>>>>   memset(&cmd, 0, sizeof(struct crash_memmap_data));
>>>>>   cmd.params = params;
>>>>>  
>>>>> - /* Add first 640K segment */
>>>>> - ei.addr = image->arch.backup_src_start;
>>>>> - ei.size = image->arch.backup_src_sz;
>>>>> + /*
>>>>> +  * Add the low memory range[0x1000, SZ_1M], skip
>>>>> +  * the first zero page.
>>>>> +  */
>>>>> + ei.addr = PAGE_SIZE;
>>>>> + ei.size = SZ_1M - PAGE_SIZE;
>>>>>   ei.type = E820_TYPE_RAM;
>>>>>   add_e820_entry(params, &ei);
>>>>
>>>> Likewise here.  Why do we need a special case?
>>>> Why the magic with PAGE_SIZE?
>>>
>>> Good catch, the zero page part is useless, I think no other special
>>> reason, just assumed zero page is not usable, but it should be ok to
>>> remove the special handling, just pass 0 - 1M is good enough.
>> 
>> But if we have stopped special casing the low 1M.  Why do we need a
>> special case here at all?
>> 
> Here, need to pass the low memory range to kdump kernel, which will guarantee
> the availability of low memory in kdump kernel, otherwise, kdump kernel won't
> use the low memory region.
>
>> If you need the special case it is almost certainly wro

Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region

2019-10-12 Thread Eric W. Biederman

Dave Young  writes:

> Hi Eric,
>
> On 10/12/19 at 06:26am, Eric W. Biederman wrote:
>> Lianbo Jiang  writes:
>> 
>> > When the crashkernel kernel command line option is specified, the
>> > low 1MiB memory will always be reserved, which makes that the memory
>> > allocated later won't fall into the low 1MiB area, thereby, it's not
>> > necessary to create a backup region and also no need to copy the first
>> > 640k content to a backup region.
>> >
>> > Currently, the code related to the backup region can be safely removed,
>> > so lets clean up.
>> >
>> > Signed-off-by: Lianbo Jiang 
>> > ---
>> 
>> > diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>> > index eb651fbde92a..cc5774fc84c0 100644
>> > --- a/arch/x86/kernel/crash.c
>> > +++ b/arch/x86/kernel/crash.c
>> > @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs 
>> > *regs)
>> >  
>> >  #ifdef CONFIG_KEXEC_FILE
>> >  
>> > -static unsigned long crash_zero_bytes;
>> > -
>> >  static int get_nr_ram_ranges_callback(struct resource *res, void *arg)
>> >  {
>> >unsigned int *nr_ranges = arg;
>> > @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct 
>> > resource *res, void *arg)
>> >  {
>> >struct crash_mem *cmem = arg;
>> >  
>> > -  cmem->ranges[cmem->nr_ranges].start = res->start;
>> > -  cmem->ranges[cmem->nr_ranges].end = res->end;
>> > -  cmem->nr_ranges++;
>> > +  if (res->start >= SZ_1M) {
>> > +  cmem->ranges[cmem->nr_ranges].start = res->start;
>> > +  cmem->ranges[cmem->nr_ranges].end = res->end;
>> > +  cmem->nr_ranges++;
>> > +  } else if (res->end > SZ_1M) {
>> > +  cmem->ranges[cmem->nr_ranges].start = SZ_1M;
>> > +  cmem->ranges[cmem->nr_ranges].end = res->end;
>> > +  cmem->nr_ranges++;
>> > +  }
>> 
>> What is going on with this chunk?  I can guess but this needs a clear
>> comment.
>
> Indeed it needs some code comment, this is based on some offline
> discussion.  cat /proc/vmcore will give a warning because ioremap is
> mapping the system ram.
>
> We pass the first 1M to kdump kernel in e820 as system ram so that 2nd
> kernel can use the low 1M memory because for example the trampoline
> code.
>
>> 
>> >  
>> >return 0;
>> >  }
>> 
>> > @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, 
>> > struct boot_params *params)
>> >memset(&cmd, 0, sizeof(struct crash_memmap_data));
>> >cmd.params = params;
>> >  
>> > -  /* Add first 640K segment */
>> > -  ei.addr = image->arch.backup_src_start;
>> > -  ei.size = image->arch.backup_src_sz;
>> > +  /*
>> > +   * Add the low memory range[0x1000, SZ_1M], skip
>> > +   * the first zero page.
>> > +   */
>> > +  ei.addr = PAGE_SIZE;
>> > +  ei.size = SZ_1M - PAGE_SIZE;
>> >ei.type = E820_TYPE_RAM;
>> >add_e820_entry(params, &ei);
>> 
>> Likewise here.  Why do we need a special case?
>> Why the magic with PAGE_SIZE?
>
> Good catch, the zero page part is useless, I think no other special
> reason, just assumed zero page is not usable, but it should be ok to
> remove the special handling, just pass 0 - 1M is good enough.

But if we have stopped special casing the low 1M.  Why do we need a
special case here at all?

If you need the special case it is almost certainly wrong to say you
have ram above 640KiB and below 1MiB.  That is the legacy ROM and video
MMIO area.

There is a reason the original code said 640KiB.

Eric

Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region

2019-10-12 Thread Eric W. Biederman

Lianbo Jiang  writes:

> When the crashkernel kernel command line option is specified, the
> low 1MiB memory will always be reserved, which makes that the memory
> allocated later won't fall into the low 1MiB area, thereby, it's not
> necessary to create a backup region and also no need to copy the first
> 640k content to a backup region.
>
> Currently, the code related to the backup region can be safely removed,
> so lets clean up.
>
> Signed-off-by: Lianbo Jiang 
> ---

> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index eb651fbde92a..cc5774fc84c0 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>  
>  #ifdef CONFIG_KEXEC_FILE
>  
> -static unsigned long crash_zero_bytes;
> -
>  static int get_nr_ram_ranges_callback(struct resource *res, void *arg)
>  {
>   unsigned int *nr_ranges = arg;
> @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct 
> resource *res, void *arg)
>  {
>   struct crash_mem *cmem = arg;
>  
> - cmem->ranges[cmem->nr_ranges].start = res->start;
> - cmem->ranges[cmem->nr_ranges].end = res->end;
> - cmem->nr_ranges++;
> + if (res->start >= SZ_1M) {
> + cmem->ranges[cmem->nr_ranges].start = res->start;
> + cmem->ranges[cmem->nr_ranges].end = res->end;
> + cmem->nr_ranges++;
> + } else if (res->end > SZ_1M) {
> + cmem->ranges[cmem->nr_ranges].start = SZ_1M;
> + cmem->ranges[cmem->nr_ranges].end = res->end;
> + cmem->nr_ranges++;
> + }

What is going on with this chunk?  I can guess but this needs a clear
comment.

>  
>   return 0;
>  }

> @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, 
> struct boot_params *params)
>   memset(&cmd, 0, sizeof(struct crash_memmap_data));
>   cmd.params = params;
>  
> - /* Add first 640K segment */
> - ei.addr = image->arch.backup_src_start;
> - ei.size = image->arch.backup_src_sz;
> + /*
> +  * Add the low memory range[0x1000, SZ_1M], skip
> +  * the first zero page.
> +  */
> + ei.addr = PAGE_SIZE;
> + ei.size = SZ_1M - PAGE_SIZE;
>   ei.type = E820_TYPE_RAM;
>   add_e820_entry(params, &ei);

Likewise here.  Why do we need a special case?
Why the magic with PAGE_SIZE?

Is this needed because of your other special case above?

When SME is active and the crashkernel command line is enabled do we
just want to leave the low 1MB unencrypted?  So we don't need any
special cases?

Eric

Re: [PATCH v2] x86/kdump: Fix 'kmem -s' reported an invalid freepointer when SME was active

2019-10-07 Thread Eric W. Biederman

lijiang  writes:

> 在 2019年10月07日 17:33, Dave Young 写道:
>> Hi Lianbo,
>> On 10/07/19 at 03:08pm, Lianbo Jiang wrote:
>>> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204793
>>>
>>> Kdump kernel will reuse the first 640k region because of some reasons,
>>> for example: the trampline and conventional PC system BIOS region may
>>> require to allocate memory in this area. Obviously, kdump kernel will
>>> also overwrite the first 640k region, therefore, kernel has to copy
>>> the contents of the first 640k area to a backup area, which is done in
>>> purgatory(), because vmcore may need the old memory. When vmcore is
>>> dumped, kdump kernel will read the old memory from the backup area of
>>> the first 640k area.
>>>
>>> Basically, the main reason should be clear, kernel does not correctly
>>> handle the first 640k region when SME is active, which causes that
>>> kernel does not properly copy these old memory to the backup area in
>>> purgatory(). Therefore, kdump kernel reads out the incorrect contents
>>> from the backup area when dumping vmcore. Finally, the phenomenon is
>>> as follow:
>>>
>>> [root linux]$ crash vmlinux 
>>> /var/crash/127.0.0.1-2019-09-19-08\:31\:27/vmcore
>>> WARNING: kernel relocated [240MB]: patching 97110 gdb minimal_symbol values
>>>
>>>   KERNEL: /var/crash/127.0.0.1-2019-09-19-08:31:27/vmlinux
>>> DUMPFILE: /var/crash/127.0.0.1-2019-09-19-08:31:27/vmcore  [PARTIAL 
>>> DUMP]
>>> CPUS: 128
>>> DATE: Thu Sep 19 08:31:18 2019
>>>   UPTIME: 00:01:21
>>> LOAD AVERAGE: 0.16, 0.07, 0.02
>>>TASKS: 1343
>>> NODENAME: amd-ethanol
>>>  RELEASE: 5.3.0-rc7+
>>>  VERSION: #4 SMP Thu Sep 19 08:14:00 EDT 2019
>>>  MACHINE: x86_64  (2195 Mhz)
>>>   MEMORY: 127.9 GB
>>>PANIC: "Kernel panic - not syncing: sysrq triggered crash"
>>>  PID: 9789
>>>  COMMAND: "bash"
>>> TASK: "89711894ae80  [THREAD_INFO: 89711894ae80]"
>>>  CPU: 83
>>>STATE: TASK_RUNNING (PANIC)
>>>
>>> crash> kmem -s|grep -i invalid
>>> kmem: dma-kmalloc-512: slab:d77680001c00 invalid 
>>> freepointer:a6086ac099f0c5a4
>>> kmem: dma-kmalloc-512: slab:d77680001c00 invalid 
>>> freepointer:a6086ac099f0c5a4
>>> crash>
>>>
>>> BTW: I also tried to fix the above problem in purgatory(), but there
>>> are too many restricts in purgatory() context, for example: i can't
>>> allocate new memory to create the identity mapping page table for SME
>>> situation.
>>>
>>> Currently, there are two places where the first 640k area is needed,
>>> the first one is in the find_trampoline_placement(), another one is
>>> in the reserve_real_mode(), and their content doesn't matter. To avoid
>>> the above error, lets occupy the remain memory of the first 640k region
>>> (expect for the trampoline and real mode) so that the allocated memory
>>> does not fall into the first 640k area when SME is active, which makes
>>> us not to worry about whether kernel can correctly copy the contents of
>>> the first 640k area to a backup region in the purgatory().
>>>
>>> Signed-off-by: Lianbo Jiang 
>>> ---
>>> Changes since v1:
>>> 1. Improve patch log
>>> 2. Change the checking condition from sme_active() to sme_active()
>>>&& strstr(boot_command_line, "crashkernel=")
>>>
>>>  arch/x86/kernel/setup.c | 3 +++
>>>  1 file changed, 3 insertions(+)
>>>
>>> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
>>> index 77ea96b794bd..bdb1a02a84fd 100644
>>> --- a/arch/x86/kernel/setup.c
>>> +++ b/arch/x86/kernel/setup.c
>>> @@ -1148,6 +1148,9 @@ void __init setup_arch(char **cmdline_p)
>>>  
>>> reserve_real_mode();
>>>  
>>> +   if (sme_active() && strstr(boot_command_line, "crashkernel="))
>>> +   memblock_reserve(0, 640*1024);
>>> +
>> 
>> Seems you missed the comment about "unconditionally do it", only check
>> crashkernel param looks better.
>> 
> If so, it means that copying the first 640k to a backup region is no longer 
> needed, and
> i should post a patch series to remove the copy_backup_region(). Any idea?
>
>> Also I noticed re

Re: [PATCH] x86/kdump: Fix 'kmem -s' reported an invalid freepointer when SME was active

2019-09-30 Thread Eric W. Biederman

Baoquan He  writes:

> On 09/27/19 at 09:32pm, Eric W. Biederman wrote:
>> Baoquan He  writes:
>> 
>> > On 09/27/19 at 03:49pm, Eric W. Biederman wrote:
>> >> Dave Young  writes:
>> >> >> In order to avoid such problem, lets occupy the first 640k region when
>> >> >> SME is active, which will ensure that the allocated memory does not 
>> >> >> fall
>> >> >> into the first 640k area. So, no need to worry about whether kernel can
>> >> >> correctly copy the contents of the first 640K area to a backup region 
>> >> >> in
>> >> >> purgatory().
>> >> 
>> >> We must occupy part of the first 640k so that we can start up secondary
>> >> cpus unless someone has added another way to do that in recent years on
>> >> SME capable cpus.
>> >> 
>> >> Further there is Fimware/BIOS interaction that happens within those
>> >> first 640K.
>> >> 
>> >> Furthermore the kdump kernel needs to be able to read all of the memory
>> >> that the previous kernel could read.  Otherwise we can't get a crash
>> >> dump.
>> >> 
>> >> So I do not think ignoring the first 640K is the correct resolution
>> >> here.
>> >> 
>> >> > The log is too simple,  I know you did some other tries to fix this, but
>> >> > the patch log does not show why you can not correctly copy the 640k in
>> >> > current kdump code, in purgatory here.
>> >> >
>> >> > Also this patch seems works in your test, but still to see if other
>> >> > people can comment and see if it is safe or not, if any other risks
>> >> > other than waste the small chunk of memory.  If it is safe then kdump
>> >> > can just drop the backup logic and use this in common code instead of
>> >> > only do it for SME.
>> >> 
>> >> Exactly.
>> >> 
>> >> I think at best this avoids the symptoms, but does not give a reliable
>> >> crash dump.
>> >
>> > Sorry, didn't notice this comment at bottom.
>> >
>> > From code, currently the first 640K area is needed in two places.
>> > One is for 5-level trampoline during boot compressing stage, in
>> > find_trampoline_placement(). 
>> >
>> > The other is in reserve_real_mode(), as you mentioned, for application
>> > CPU booting.
>> >
>> > Only allow these two put data inside first 640K, then lock it done. It
>> > should not impact crash dump and parsing. And these two's content
>> > doesn't matter.
>> 
>> Do I understand correctly that the idea is that the kernel
>> that may crash will never touch these pages?  And that the reservation
>
> Thanks a lot for your comments, Eric.
>
> And yes. Except of the reserved real mode trampoline for secondary cpu if
> the trampoline is allocated in this area, it will be reused. We search area
> for real mode under 1M. Otherwise, the kernel that may crash will never
> touch these pages.
>
>> is not in the kernel that recovers from the crash?  That definitely
>
> You mean kdump kernel? Only the e820 reserved regions which are inserted
> into io_resource will be passed to kdump kernel, memblock reserved
> region is only seen by the present kernel.
>
> But for the content in first 640K, it's not erased, kdump kernel will
> ignore it and take it as a available memory resource into memory
> allocator.
>
>
>> needs a little better description.  I know it is not a lot on modern
>> systems but reserving an extra 1M of memory to avoid having to special
>> case it later seems in need of calling out.
>> 
>> I have an old system around that I think that 640K is about 25% of
>> memory.
>
> Understood. Basically 640K is wasted in this case. But we only do like
> this in SME case, a condition checking is added. And system with SME is
> pretty new model, it may not impact the old system.

The conditional really should be based on if we are reserving memory
for a kdump kernel.  AKA if crash_kernel=XXX is specified on the kernel
command line.

At which point I think it would be very reasonable to unconditionally
reserve the low 640k, and make the whole thing a non-issue.  This would
allow the kdump code to just not do anything special for any of the
weird special case.

It isn't perfect because we need a page or so used in the first kernel
for bootstrapping the secondary cpus, but that seems like the least of
evils.  Especial

Re: [PATCH] x86/kdump: Fix 'kmem -s' reported an invalid freepointer when SME was active

2019-09-27 Thread Eric W. Biederman

Baoquan He  writes:

> On 09/27/19 at 03:49pm, Eric W. Biederman wrote:
>> Dave Young  writes:
>> >> In order to avoid such problem, lets occupy the first 640k region when
>> >> SME is active, which will ensure that the allocated memory does not fall
>> >> into the first 640k area. So, no need to worry about whether kernel can
>> >> correctly copy the contents of the first 640K area to a backup region in
>> >> purgatory().
>> 
>> We must occupy part of the first 640k so that we can start up secondary
>> cpus unless someone has added another way to do that in recent years on
>> SME capable cpus.
>> 
>> Further there is Fimware/BIOS interaction that happens within those
>> first 640K.
>> 
>> Furthermore the kdump kernel needs to be able to read all of the memory
>> that the previous kernel could read.  Otherwise we can't get a crash
>> dump.
>> 
>> So I do not think ignoring the first 640K is the correct resolution
>> here.
>> 
>> > The log is too simple,  I know you did some other tries to fix this, but
>> > the patch log does not show why you can not correctly copy the 640k in
>> > current kdump code, in purgatory here.
>> >
>> > Also this patch seems works in your test, but still to see if other
>> > people can comment and see if it is safe or not, if any other risks
>> > other than waste the small chunk of memory.  If it is safe then kdump
>> > can just drop the backup logic and use this in common code instead of
>> > only do it for SME.
>> 
>> Exactly.
>> 
>> I think at best this avoids the symptoms, but does not give a reliable
>> crash dump.
>
> Sorry, didn't notice this comment at bottom.
>
> From code, currently the first 640K area is needed in two places.
> One is for 5-level trampoline during boot compressing stage, in
> find_trampoline_placement(). 
>
> The other is in reserve_real_mode(), as you mentioned, for application
> CPU booting.
>
> Only allow these two put data inside first 640K, then lock it done. It
> should not impact crash dump and parsing. And these two's content
> doesn't matter.

Apologies.  Do I understand correctly that the idea is that the kernel
that may crash will never touch these pages?  And that the reservation
is not in the kernel that recovers from the crash?  That definitely
needs a little better description.  I know it is not a lot on modern
systems but reserving an extra 1M of memory to avoid having to special
case it later seems in need of calling out.

I have an old system around that I think that 640K is about 25% of
memory.

How we interact with BIOS tables in the first 640k needs some
explanation.  Both in the first kernel and in the crash kernel.

Eric

Re: [PATCH] x86/kdump: Fix 'kmem -s' reported an invalid freepointer when SME was active

2019-09-27 Thread Eric W. Biederman

Dave Young  writes:

> Hi Lianbo,
>
> For kexec/kdump patches, please remember to cc kexec list next time.
> Also it is definitely kdump specific issue, I added Vivek and Eric in
> cc. 
>
> On 09/20/19 at 11:53am, Lianbo Jiang wrote:
>> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204793
>> 
>> Kdump kernel will reuse the first 640k region because of some reasons,
>> for example: the trampline and conventional PC system BIOS region may
>> require to allocate memory in this area. Obviously, kdump kernel will
>> also overwrite the first 640k region, therefore, kernel has to copy
>> the contents of the first 640k area to a backup area, which is done in
>> purgatory(), because vmcore may need the old memory. When vmcore is
>> dumped, kdump kernel will read the old memory from the backup area of
>> the first 640k area.
>>
>> Basically, the main reason should be clear, kernel does not correctly
>> handle the first 640k region when SME is active, which causes that
>> kernel does not properly copy these old memory to the backup area in
>> purgatory(). Therefore, kdump kernel reads out the incorrect contents
>> from the backup area when dumping vmcore. Finally, the phenomenon is
>> as follow:
>> 
>> [root linux]$ crash vmlinux /var/crash/127.0.0.1-2019-09-19-08\:31\:27/vmcore
>> WARNING: kernel relocated [240MB]: patching 97110 gdb minimal_symbol values
>> 
>>   KERNEL: /var/crash/127.0.0.1-2019-09-19-08:31:27/vmlinux
>> DUMPFILE: /var/crash/127.0.0.1-2019-09-19-08:31:27/vmcore  [PARTIAL DUMP]
>> CPUS: 128
>> DATE: Thu Sep 19 08:31:18 2019
>>   UPTIME: 00:01:21
>> LOAD AVERAGE: 0.16, 0.07, 0.02
>>TASKS: 1343
>> NODENAME: amd-ethanol
>>  RELEASE: 5.3.0-rc7+
>>  VERSION: #4 SMP Thu Sep 19 08:14:00 EDT 2019
>>  MACHINE: x86_64  (2195 Mhz)
>>   MEMORY: 127.9 GB
>>PANIC: "Kernel panic - not syncing: sysrq triggered crash"
>>  PID: 9789
>>  COMMAND: "bash"
>> TASK: "89711894ae80  [THREAD_INFO: 89711894ae80]"
>>  CPU: 83
>>STATE: TASK_RUNNING (PANIC)
>> 
>> crash> kmem -s|grep -i invalid
>> kmem: dma-kmalloc-512: slab:d77680001c00 invalid 
>> freepointer:a6086ac099f0c5a4
>> kmem: dma-kmalloc-512: slab:d77680001c00 invalid 
>> freepointer:a6086ac099f0c5a4
>> crash>
>> 
>> In order to avoid such problem, lets occupy the first 640k region when
>> SME is active, which will ensure that the allocated memory does not fall
>> into the first 640k area. So, no need to worry about whether kernel can
>> correctly copy the contents of the first 640K area to a backup region in
>> purgatory().

We must occupy part of the first 640k so that we can start up secondary
cpus unless someone has added another way to do that in recent years on
SME capable cpus.

Further there is Fimware/BIOS interaction that happens within those
first 640K.

Furthermore the kdump kernel needs to be able to read all of the memory
that the previous kernel could read.  Otherwise we can't get a crash
dump.

So I do not think ignoring the first 640K is the correct resolution
here.

> The log is too simple,  I know you did some other tries to fix this, but
> the patch log does not show why you can not correctly copy the 640k in
> current kdump code, in purgatory here.
>
> Also this patch seems works in your test, but still to see if other
> people can comment and see if it is safe or not, if any other risks
> other than waste the small chunk of memory.  If it is safe then kdump
> can just drop the backup logic and use this in common code instead of
> only do it for SME.

Exactly.

I think at best this avoids the symptoms, but does not give a reliable
crash dump.

Eric

Re: [v1 0/5] allow to reserve memory for normal kexec kernel

2019-07-08 Thread Eric W. Biederman

Pavel Tatashin  writes:

> Currently, it is only allowed to reserve memory for crash kernel, because
> it is a requirement in order to be able to boot into crash kernel without
> touching memory of crashed kernel is to have memory reserved.
>
> The second benefit for having memory reserved for kexec kernel is
> that it does not require a relocation after segments are loaded into
> memory.
>
> If kexec functionality is used for a fast system update, with a minimal
> downtime, the relocation of kernel + initramfs might take a significant
> portion of reboot.
>
> In fact, on the machine that we are using, that has ARM64 processor
> it takes 0.35s to relocate during kexec, thus taking 52% of kernel reboot
> time:
>
> kernel shutdown   0.03s
> relocation0.35s
> kernel startup0.29s
>
> Image: 13M and initramfs is 24M. If initramfs increases, the relocation
> time increases proportionally.

Something is very very wrong there.

Last I measured memory bandwidth seriously I could touch a Gigabyte per
second easily, and that was nearly 20 years ago.  Did you manage to
disable caching or have some particularly slow code that does the
reolocations.

There is a serious cost to reserving memory in that it is simply not
available at other times.  For kexec on panic there is no other reliable
way to get memory that won't be DMA'd to.

We have options in this case and I would strongly encourage you to track
down why that copy in relocation is so very slow.  I suspect a 4KiB page
size is large enough that it can swamp pointer following costs.

My back of the napkin math says even 20 years ago your copying costs
should be only 0.037s.  The only machine I have ever tested on where
the copy costs were noticable was my old 386.

Maybe I am out to lunch here but a claim that your memory only runs
at 100MiB/s (the speed of my spinning rust hard drive) is rather
incredible.

Eric

Re: [PATCH] x86/kexec: prefer _THIS_IP_ to current_text_addr

2018-08-21 Thread Eric W. Biederman

Linus Torvalds  writes:

> On Mon, Aug 20, 2018 at 10:58 AM Nick Desaulniers
>  wrote:
>>
>> + akpm, Linus
>>
>> Bumping for review.
>
> Ugh. I am not personally a huge fan of this endless "fix up one at a time".
>
> Just do a patch that removes current_text_addr() entirely and be done
> with it, if that's what we want the end result to be.
>
> Don't bother with these small "let's remove the remaining ones one by
> one". Just get it over and done with.

One is generic code the other is assembly but the both appear to do the
same thing without unexpected complexity.

That said the patch earlier in this thread has clearly never been
compiled as it is using THIS_IP instead of _THIS_IP_ and there is not
a define of THIS_IP in the kernel.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] uts: Don't randomize "struct uts_namespace".

2018-07-06 Thread Eric W. Biederman

Tetsuo Handa  writes:

> I'm waiting for response from makedumpfile developers.
>
> But makedumpfile is a tool for saving kernel crash dump.
> If makedumpfile cannot work, we cannot use kernel crash dump.

I suspect the version string is comparable in size to the pointer to the
version string and as such would be a better candidate to include in the
handful of elf notes we provide for the mkdumpfile & it's ilk to bootstram.

I assume getting the kernel's version string will be enough to track
back to an individual vmlinux and figure out the layout of everything
else.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] kexec: yield to scheduler when loading kimage segments

2018-06-11 Thread Eric W. Biederman

Jarrett Farnitano  writes:

> Without yielding while loading kimage segments, a large initrd
> will block all other work on the CPU performing the load until
> it is completed. For example loading an initrd of 200MB on a
> low power single core system will lock up the system for a few
> seconds.
>
> To increase system responsiveness to other tasks at that time,
> call cond_resched() in both the crash kernel and normal kernel
> segment loading loops.
>
> Signed-off-by: Jarrett Farnitano 

Reviewed-by: "Eric W. Biederman" 

> ---
>  kernel/kexec_core.c | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index 5616755..8ee07d6 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -783,6 +783,8 @@ static int kimage_load_normal_segment(struct kimage 
> *image,
>   else
>   buf += mchunk;
>   mbytes -= mchunk;
> +
> + cond_resched();
>   }
>  out:
>   return result;
> @@ -847,6 +849,8 @@ static int kimage_load_crash_segment(struct kimage *image,
>   else
>   buf += mchunk;
>   mbytes -= mchunk;
> +
> + cond_resched();
>   }
>  out:
>   return result;

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

1 2 3 4 5 6 7 >

1 - 100 of 691 matches

Mail list logo