Re: kexec_file overwrites reserved EFI ESRT memory

2019-12-02 Thread Dave Young
On 11/29/19 at 04:27pm, Michael Weiser wrote:
> Hello Dave,
> 
> On Mon, Nov 25, 2019 at 01:52:01PM +0800, Dave Young wrote:
> 
> > > > Fundamentally when deciding where to place a new kernel kexec (either
> > > > user space or the in kernel kexec_file implementation) needs to be able
> > > > to ask the question which memory ares are reserved.
> [...]
> > > > So my question is why doesn't the ESRT reservation wind up in
> > > > /proc/iomem?
> > > 
> > > My guess is that the focus was that some EFI structures need to be kept
> > > around accross the life cycle of *one* running kernel and
> > > memblock_reserve() was enough for that. Marking them so they survive
> > > kexecing another kernel might just never have cropped up thus far. Ard
> > > or Matt would know.
> > Can you check your un-reserved memory, if your memory falls into EFI
> > BOOT* then in X86 you can use something like below if it is not covered:
> 
> > void __init efi_esrt_init(void)
> > {
> > ...
> > pr_info("Reserving ESRT space from %pa to %pa.\n", &esrt_data, &end);
> > if (md.type == EFI_BOOT_SERVICES_DATA)
> > efi_mem_reserve(esrt_data, esrt_data_size);
> > ...
> > }
> 
> Please bear with me if I'm a bit slow on the uptake here: On my machine,
> the esrt module reports at boot:
> 
> [0.001244] esrt: Reserving ESRT space from 0x74dd2f98 to 
> 0x74dd2fd0.
> 
> This area is of type "Boot Data" (== BOOT_SERVICES_DATA) which makes the
> code you quote reserve it using memblock_reserve() shown by
> memblock=debug:
> 
> [0.001246] memblock_reserve: [0x74dd2f98-0x74dd2fcf] 
> efi_mem_reserve+0x1d/0x2b
> 
> It also calls into arch/x86/platform/efi/quirks.c:efi_arch_mem_reserve()
> which tags it as EFI_MEMORY_RUNTIME while the surrounding ones aren't
> as shown by efi=debug:
> 
> [0.178111] efi: mem10: [Boot Data  |   |  |  |  |  |  |  |  |   
> |WB|WT|WC|UC] range=[0x74dd3000-0x75becfff] (14MB)
> [0.178113] efi: mem11: [Boot Data  |RUN|  |  |  |  |  |  |  |   
> |WB|WT|WC|UC] range=[0x74dd2000-0x74dd2fff] (0MB)
> [0.178114] efi: mem12: [Boot Data  |   |  |  |  |  |  |  |  |   
> |WB|WT|WC|UC] range=[0x6d635000-0x74dd1fff] (119MB)
> 
> This prevents arch/x86/platform/efi/quirks.c:efi_free_boot_services()
> from calling __memblock_free_late() on it. And indeed, memblock=debug does
> not report this area as being free'd while the surrounding ones are:
> 
> [0.178369] __memblock_free_late: [0x74dd3000-0x75becfff] 
> efi_free_boot_services+0x126/0x1f8
> [0.178658] __memblock_free_late: [0x6d635000-0x74dd1fff] 
> efi_free_boot_services+0x126/0x1f8
> 
> The esrt area does not show up in /proc/iomem though:
> 
> 0010-763f5fff : System RAM
>   6200-62a00d80 : Kernel code
>   62c0-62f15fff : Kernel rodata
>   6300-630ea8bf : Kernel data
>   63fed000-641f : Kernel bss
>   6500-6aff : Crash kernel
> 
> And thus kexec loads the new kernel right over that area as shown when
> enabling -DDEBUG on kexec_file.c (0x74dd3000 being inbetween 0x7300
> and 0x7300+0x24be000 = 0x754be000):
> 
> [  650.007695] kexec_file: Loading segment 0: buf=0x3a9c84d6 
> bufsz=0x5000 mem=0x98000 memsz=0x6000
> [  650.007699] kexec_file: Loading segment 1: buf=0x17b2b9e6 
> bufsz=0x1240 mem=0x96000 memsz=0x2000
> [  650.007703] kexec_file: Loading segment 2: buf=0xfdf72ba2 
> bufsz=0x1150888 mem=0x7300 memsz=0x24be000
> 
> ... because it looks for any memory hole large enough in iomem resources
> tagged as System RAM, which 0x74dd2000-0x74dd2fff would then need to be
> excluded from on my system.
> 
> Looking some more at efi_arch_mem_reserve() I see that it also registers
> the area with efi.memmap and installs it using efi_memmap_install().
> which seems to call memremap(MEMREMAP_WB) on it. From my understanding
> of the comments in the source of memremap(), MEMREMAP_WB does specifically
> *not* reserve that memory in any way.
> 
> > Unfortunately I noticed there are different requirements/ways for
> > different types of "reserved" memory.  But that is another topic..
> 
> I tried to reserve the area with something like this:
> 
> t a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> index 4de244683a7e..b86a5df027a2 100644
> --- a/arch/x86/platform/efi/quirks.c
> +++ b/arch/x86/platform/efi/quirks.c
> @@ -249,6 +249,7 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 
> size)
> efi_memory_desc_t md;
> int num_entries;
> void *new;
> +   struct resource *res;
>  
> if (efi_mem_desc_lookup(addr, &md) ||
> md.type != EFI_BOOT_SERVICES_DATA) {
> @@ -294,6 +295,21 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 
> size)
> early_memunmap(new, new_size);
>  
> efi_memmap_install(new_phys, num_entries);
> +
> +   res = memblock_alloc(sizeof(*res),

Re: kexec_file overwrites reserved EFI ESRT memory

2019-12-02 Thread Dave Young
Add more cc
On 12/02/19 at 04:58pm, Dave Young wrote:
> On 11/29/19 at 04:27pm, Michael Weiser wrote:
> > Hello Dave,
> > 
> > On Mon, Nov 25, 2019 at 01:52:01PM +0800, Dave Young wrote:
> > 
> > > > > Fundamentally when deciding where to place a new kernel kexec (either
> > > > > user space or the in kernel kexec_file implementation) needs to be 
> > > > > able
> > > > > to ask the question which memory ares are reserved.
> > [...]
> > > > > So my question is why doesn't the ESRT reservation wind up in
> > > > > /proc/iomem?
> > > > 
> > > > My guess is that the focus was that some EFI structures need to be kept
> > > > around accross the life cycle of *one* running kernel and
> > > > memblock_reserve() was enough for that. Marking them so they survive
> > > > kexecing another kernel might just never have cropped up thus far. Ard
> > > > or Matt would know.
> > > Can you check your un-reserved memory, if your memory falls into EFI
> > > BOOT* then in X86 you can use something like below if it is not covered:
> > 
> > > void __init efi_esrt_init(void)
> > > {
> > > ...
> > >   pr_info("Reserving ESRT space from %pa to %pa.\n", &esrt_data, &end);
> > >   if (md.type == EFI_BOOT_SERVICES_DATA)
> > >   efi_mem_reserve(esrt_data, esrt_data_size);
> > > ...
> > > }
> > 
> > Please bear with me if I'm a bit slow on the uptake here: On my machine,
> > the esrt module reports at boot:
> > 
> > [0.001244] esrt: Reserving ESRT space from 0x74dd2f98 to 
> > 0x74dd2fd0.
> > 
> > This area is of type "Boot Data" (== BOOT_SERVICES_DATA) which makes the
> > code you quote reserve it using memblock_reserve() shown by
> > memblock=debug:
> > 
> > [0.001246] memblock_reserve: [0x74dd2f98-0x74dd2fcf] 
> > efi_mem_reserve+0x1d/0x2b
> > 
> > It also calls into arch/x86/platform/efi/quirks.c:efi_arch_mem_reserve()
> > which tags it as EFI_MEMORY_RUNTIME while the surrounding ones aren't
> > as shown by efi=debug:
> > 
> > [0.178111] efi: mem10: [Boot Data  |   |  |  |  |  |  |  |  |   
> > |WB|WT|WC|UC] range=[0x74dd3000-0x75becfff] (14MB)
> > [0.178113] efi: mem11: [Boot Data  |RUN|  |  |  |  |  |  |  |   
> > |WB|WT|WC|UC] range=[0x74dd2000-0x74dd2fff] (0MB)
> > [0.178114] efi: mem12: [Boot Data  |   |  |  |  |  |  |  |  |   
> > |WB|WT|WC|UC] range=[0x6d635000-0x74dd1fff] (119MB)
> > 
> > This prevents arch/x86/platform/efi/quirks.c:efi_free_boot_services()
> > from calling __memblock_free_late() on it. And indeed, memblock=debug does
> > not report this area as being free'd while the surrounding ones are:
> > 
> > [0.178369] __memblock_free_late: 
> > [0x74dd3000-0x75becfff] efi_free_boot_services+0x126/0x1f8
> > [0.178658] __memblock_free_late: 
> > [0x6d635000-0x74dd1fff] efi_free_boot_services+0x126/0x1f8
> > 
> > The esrt area does not show up in /proc/iomem though:
> > 
> > 0010-763f5fff : System RAM
> >   6200-62a00d80 : Kernel code
> >   62c0-62f15fff : Kernel rodata
> >   6300-630ea8bf : Kernel data
> >   63fed000-641f : Kernel bss
> >   6500-6aff : Crash kernel
> > 
> > And thus kexec loads the new kernel right over that area as shown when
> > enabling -DDEBUG on kexec_file.c (0x74dd3000 being inbetween 0x7300
> > and 0x7300+0x24be000 = 0x754be000):
> > 
> > [  650.007695] kexec_file: Loading segment 0: buf=0x3a9c84d6 
> > bufsz=0x5000 mem=0x98000 memsz=0x6000
> > [  650.007699] kexec_file: Loading segment 1: buf=0x17b2b9e6 
> > bufsz=0x1240 mem=0x96000 memsz=0x2000
> > [  650.007703] kexec_file: Loading segment 2: buf=0xfdf72ba2 
> > bufsz=0x1150888 mem=0x7300 memsz=0x24be000
> > 
> > ... because it looks for any memory hole large enough in iomem resources
> > tagged as System RAM, which 0x74dd2000-0x74dd2fff would then need to be
> > excluded from on my system.
> > 
> > Looking some more at efi_arch_mem_reserve() I see that it also registers
> > the area with efi.memmap and installs it using efi_memmap_install().
> > which seems to call memremap(MEMREMAP_WB) on it. From my understanding
> > of the comments in the source of memremap(), MEMREMAP_WB does specifically
> > *not* reserve that memory in any way.
> > 
> > > Unfortunately I noticed there are different requirements/ways for
> > > different types of "reserved" memory.  But that is another topic..
> > 
> > I tried to reserve the area with something like this:
> > 
> > t a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> > index 4de244683a7e..b86a5df027a2 100644
> > --- a/arch/x86/platform/efi/quirks.c
> > +++ b/arch/x86/platform/efi/quirks.c
> > @@ -249,6 +249,7 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 
> > size)
> > efi_memory_desc_t md;
> > int num_entries;
> > void *new;
> > +   struct resource *res;
> >  
> > if (efi_mem_desc_lookup(addr, &md) ||
> >

Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)

2019-12-02 Thread Petr Mladek
Hi,

I have seen few prelimitary versions before this public one.
I am either happy with it or blind to see new problems[*].

It would be great if anyone else could look at it. Especially
I am intreseted:

  + Whether the algorithm can be understood by people who
see it for the "first" time.

  + Whether there are any obvious races. I wonder if our assumtions
about atomic_cmpxchg() guaranttes are correct.


[*] I have found one theoretical ABA problem. I think that is
really rather theoretical and should not block this patch.
I think that we could do something to prevent it either
now or later. See below for more details


On Thu 2019-11-28 02:58:33, John Ogness wrote:
> Add a new lockless ringbuffer implementation to be used for printk.
> First only add support for writers. Reader support will be added in
> a follow-up commit.
> 
> Signed-off-by: John Ogness 
> ---
>  kernel/printk/printk_ringbuffer.c | 676 ++
>  kernel/printk/printk_ringbuffer.h | 239 +++
>  2 files changed, 915 insertions(+)
>  create mode 100644 kernel/printk/printk_ringbuffer.c
>  create mode 100644 kernel/printk/printk_ringbuffer.h
> 
> diff --git a/kernel/printk/printk_ringbuffer.c 
> b/kernel/printk/printk_ringbuffer.c
> new file mode 100644
> index ..09c32e52fd40
> --- /dev/null
> +++ b/kernel/printk/printk_ringbuffer.c
> @@ -0,0 +1,676 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include "printk_ringbuffer.h"
> +
> +/**
> + * DOC: printk_ringbuffer overview
> + *
> + * Data Structure
> + * --
> + * The printk_ringbuffer is made up of 3 internal ringbuffers::
> + *
> + * * desc_ring:  A ring of descriptors. A descriptor contains all record
> + *   meta-data (sequence number, timestamp, loglevel, etc.) 
> as
> + *   well as internal state information about the record and
> + *   logical positions specifying where in the other
> + *   ringbuffers the text and dictionary strings are located.
> + *
> + * * text_data_ring: A ring of data blocks. A data block consists of a 32-bit
> + *   integer (ID) that maps to a desc_ring index followed by
> + *   the text string of the record.
> + *
> + * * dict_data_ring: A ring of data blocks. A data block consists of a 32-bit
> + *   integer (ID) that maps to a desc_ring index followed by
> + *   the dictionary string of the record.
> + *

I slightly miss some top level description of the approach. Especially
how the code deals with races. The following comes to my mind:


The state of each entry is always defined by desc->state_val.
It consists of 30-bit id and 2-bit state value, see DESC_INDEX()
and get_desc_state.

prb_reserve() call returns descriptor in a reserved state. It points
to reserved data blocks in the data rings.

The algorithm starts with using never used descriptors and data
blocks. Later it uses descriptors and data blocks in the reusable
state.

The descriptors and data space are reused in a round-robin fahion.
Only records in the committed state could be moved to reusable state.

prb_reserve() has to do several steps. Namely it has to update head
and tail positions in all ring buffers. Also it has to manipulare
desc->state_val. Most races are avoided by using atomic_cmpxchg()
operations. They make sure that all values are moved from one
well defined state to another well defined state.

ABA races are avoided by using logical positions and indexes.
The rinbuffer must be reused many times before these values
overflow. Anyway, logical positions to data ring could not overflow
wildly. They are manipulated only when the descriptor is already
reserved. The descriptor could get reused only when it was commited
before. It means that there is a limited number of writers
until they would need to reuse a particular descriptor. [*]

Back to the prb_reserve() algorithm. The repeated pattern is that
it does not matter what caller invalidates oldest entries. They are
fine when it has already been done by another writer in the meantime.
The real winner is the caller that is able to move the head position
and reserve the space. Others need to go back to invalidating the oldest
entry, etc.

prb_reserve() caller has exclusive write access to the reserved
descriptor and data blocks. It has to call prb_commit() when finished.
It is a signal that the data are valid for readers. But it is also
a signal that the descriptor and the space might get reused be
other writers.


Finally, readers just need to check the state of the descriptor
before and after reading the data. The data are correct only
when the descriptor is in committed state.

[*] Hmm, the ABA problem might theoretically happen in desc_reserve(),
see below.



> +/*
> + * Sanity checker for reserve size. The ringbuffer code assumes that a data
> + * block do

Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)

2019-12-02 Thread Petr Mladek
On Mon 2019-12-02 16:48:41, Petr Mladek wrote:
> > +/* Reserve a new descriptor, invalidating the oldest if necessary. */
> > +static bool desc_reserve(struct printk_ringbuffer *rb, u32 *id_out)
> > +{
> > +   struct prb_desc_ring *desc_ring = &rb->desc_ring;
> > +   struct prb_desc *desc;
> > +   u32 id_prev_wrap;
> > +   u32 head_id;
> > +   u32 id;
> > +
> > +   head_id = atomic_read(&desc_ring->head_id);
> > +
> > +   do {
> > +   desc = to_desc(desc_ring, head_id);
> > +
> > +   id = DESC_ID(head_id + 1);
> > +   id_prev_wrap = DESC_ID_PREV_WRAP(desc_ring, id);
> > +
> > +   if (id_prev_wrap == atomic_read(&desc_ring->tail_id)) {
> > +   if (!desc_push_tail(rb, id_prev_wrap))
> > +   return false;
> > +   }
> > +   } while (!atomic_try_cmpxchg(&desc_ring->head_id, &head_id, id));
> 
> Hmm, in theory, ABA problem might cause that we successfully
> move desc_ring->head_id when tail has not been pushed yet.
> 
> As a result we would never call desc_push_tail() until
> it overflows again.
> 
> I am not sure if we need to take care of it. The code is called with
> interrupts disabled. IMHO, only NMI could cause ABA problem
> in reality. But the game (debugging) is lost anyway when NMI ovewrites
> the buffer several times.

BTW: If I am counting correctly. The ABA problem would happen when
exactly 2^30 (1G) messages is written in the mean time.

> Also it should not be a complete failure. We still could bail out when
> the previous state was not reusable. We are on the safe side
> when it was reusable.
> 
> Also we could probably detect when desc_ring->tail_id is not
> updated any longer and do something about it.
> 
> > +
> > +   desc = to_desc(desc_ring, id);
> > +
> > +   /* Set the descriptor's ID and also set its state to reserved. */
> > +   atomic_set(&desc->state_var, id | 0);
> 
> We should check here that the original state id from the previous
> wrap in reusable state (or 0 for bootstrap). On error, we know that
> there was the ABA problem and would need to do something about it.

I believe that it should be enough to add this check and
bail out in case of error.

Best Regards,
Petr

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)

2019-12-02 Thread John Ogness
On 2019-12-02, Petr Mladek  wrote:
>> > +/* Reserve a new descriptor, invalidating the oldest if necessary. */
>> > +static bool desc_reserve(struct printk_ringbuffer *rb, u32 *id_out)
>> > +{
>> > +  struct prb_desc_ring *desc_ring = &rb->desc_ring;
>> > +  struct prb_desc *desc;
>> > +  u32 id_prev_wrap;
>> > +  u32 head_id;
>> > +  u32 id;
>> > +
>> > +  head_id = atomic_read(&desc_ring->head_id);
>> > +
>> > +  do {
>> > +  desc = to_desc(desc_ring, head_id);
>> > +
>> > +  id = DESC_ID(head_id + 1);
>> > +  id_prev_wrap = DESC_ID_PREV_WRAP(desc_ring, id);
>> > +
>> > +  if (id_prev_wrap == atomic_read(&desc_ring->tail_id)) {
>> > +  if (!desc_push_tail(rb, id_prev_wrap))
>> > +  return false;
>> > +  }
>> > +  } while (!atomic_try_cmpxchg(&desc_ring->head_id, &head_id, id));
>> 
>> Hmm, in theory, ABA problem might cause that we successfully
>> move desc_ring->head_id when tail has not been pushed yet.
>> 
>> As a result we would never call desc_push_tail() until
>> it overflows again.
>> 
>> I am not sure if we need to take care of it. The code is called with
>> interrupts disabled. IMHO, only NMI could cause ABA problem
>> in reality. But the game (debugging) is lost anyway when NMI ovewrites
>> the buffer several times.
>
> BTW: If I am counting correctly. The ABA problem would happen when
> exactly 2^30 (1G) messages is written in the mean time.

All the ringbuffer code assumes that the use of index numbers handles
the ABA problem (i.e. there must not be 1 billion printk's within an
NMI). If we want to support 1 billion+ printk's within an NMI, then
perhaps the index number should be increased. For 64-bit systems it
would be no problem to go to 62 bits. For 32-bit systems, I don't know
how well the 64-bit atomic operations are supported.

>> Also it should not be a complete failure. We still could bail out when
>> the previous state was not reusable. We are on the safe side
>> when it was reusable.
>> 
>> Also we could probably detect when desc_ring->tail_id is not
>> updated any longer and do something about it.
>> 
>> > +
>> > +  desc = to_desc(desc_ring, id);
>> > +
>> > +  /* Set the descriptor's ID and also set its state to reserved. */
>> > +  atomic_set(&desc->state_var, id | 0);
>> 
>> We should check here that the original state id from the previous
>> wrap in reusable state (or 0 for bootstrap). On error, we know that
>> there was the ABA problem and would need to do something about it.
>
> I believe that it should be enough to add this check and
> bail out in case of error.

I need to go through the code again in detail and see how many locations
are affected by ABA. All the code was written with the assumption that
this type of ABA will not happen.

As you've stated, we could add minimal handling so that the ringbuffer
at least does not break or get stuck.

BTW: The same assumption is made for logical positions. There are 4
times as many of these (on 32-bit systems) but logical positions advance
much faster. I will review these as well.

John Ogness

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RESEND PATCH v5 1/5] crash_core, vmcoreinfo: Append 'MAX_PHYSMEM_BITS' to vmcoreinfo

2019-12-02 Thread John Donnelly

On 11/29/19 1:59 PM, Bhupesh Sharma wrote:

Right now user-space tools like 'makedumpfile' and 'crash' need to rely
on a best-guess method of determining value of 'MAX_PHYSMEM_BITS'
supported by underlying kernel.

This value is used in user-space code to calculate the bit-space
required to store a section for SPARESMEM (similar to the existing
calculation method used in the kernel implementation):

   #define SECTIONS_SHIFT(MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)

Now, regressions have been reported in user-space utilities
like 'makedumpfile' and 'crash' on arm64, with the recently added
kernel support for 52-bit physical address space, as there is
no clear method of determining this value in user-space
(other than reading kernel CONFIG flags).

As per suggestion from makedumpfile maintainer (Kazu), it makes more
sense to append 'MAX_PHYSMEM_BITS' to vmcoreinfo in the core code itself
rather than in arch-specific code, so that the user-space code for other
archs can also benefit from this addition to the vmcoreinfo and use it
as a standard way of determining 'SECTIONS_SHIFT' value in user-land.

A reference 'makedumpfile' implementation which reads the
'MAX_PHYSMEM_BITS' value from vmcoreinfo in a arch-independent fashion
is available here:

[0]. 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_bhupesh-2Dsharma_makedumpfile_blob_remove-2Dmax-2Dphys-2Dmem-2Dbit-2Dv1_arch_ppc64.c-23L471&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=t2fPg9D87F7D8jm0_3CG9yoiIKdRg4qc_thBw4bzMhc&m=WdAKu3i_AeChZ2ZngChACvP5LtULN6mgHopxlQbu16Q&s=AYn-OBh6EqC80XVBg3oLc8ivaCwCNs-cm0PMkJ_2fjo&e=

Cc: Boris Petkov 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: James Morse 
Cc: Mark Rutland 
Cc: Will Deacon 
Cc: Steve Capper 
Cc: Catalin Marinas 
Cc: Ard Biesheuvel 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Benjamin Herrenschmidt 
Cc: Dave Anderson 
Cc: Kazuhito Hagio 
Cc: x...@kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-ker...@vger.kernel.org
Cc: kexec@lists.infradead.org
Signed-off-by: Bhupesh Sharma 


Tested-by:  John Donnelly 


---
  kernel/crash_core.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 9f1557b98468..18175687133a 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -413,6 +413,7 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_LENGTH(mem_section, NR_SECTION_ROOTS);
VMCOREINFO_STRUCT_SIZE(mem_section);
VMCOREINFO_OFFSET(mem_section, section_mem_map);
+   VMCOREINFO_NUMBER(MAX_PHYSMEM_BITS);
  #endif
VMCOREINFO_STRUCT_SIZE(page);
VMCOREINFO_STRUCT_SIZE(pglist_data);




--
Thank You,
John

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RESEND PATCH v5 2/5] arm64/crash_core: Export TCR_EL1.T1SZ in vmcoreinfo

2019-12-02 Thread John Donnelly

On 11/29/19 1:59 PM, Bhupesh Sharma wrote:

vabits_actual variable on arm64 indicates the actual VA space size,
and allows a single binary to support both 48-bit and 52-bit VA
spaces.

If the ARMv8.2-LVA optional feature is present, and we are running
with a 64KB page size; then it is possible to use 52-bits of address
space for both userspace and kernel addresses. However, any kernel
binary that supports 52-bit must also be able to fall back to 48-bit
at early boot time if the hardware feature is not present.

Since TCR_EL1.T1SZ indicates the size offset of the memory region
addressed by TTBR1_EL1 (and hence can be used for determining the
vabits_actual value) it makes more sense to export the same in
vmcoreinfo rather than vabits_actual variable, as the name of the
variable can change in future kernel versions, but the architectural
constructs like TCR_EL1.T1SZ can be used better to indicate intended
specific fields to user-space.

User-space utilities like makedumpfile and crash-utility, need to
read/write this value from/to vmcoreinfo for determining if a virtual
address lies in the linear map range.

The user-space computation for determining whether an address lies in
the linear map range is the same as we have in kernel-space:

   #define __is_lm_address(addr)(!(((u64)addr) & BIT(vabits_actual - 
1)))

I have sent out user-space patches for makedumpfile and crash-utility
to add features for obtaining vabits_actual value from TCR_EL1.T1SZ (see
[0] and [1]).

Akashi reported that he was able to use this patchset and the user-space
changes to get user-space working fine with the 52-bit kernel VA
changes (see [2]).

[0]. 
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_pipermail_kexec_2019-2DNovember_023966.html&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=t2fPg9D87F7D8jm0_3CG9yoiIKdRg4qc_thBw4bzMhc&m=hl1GOXoV3jxYFfIKiiUHZCo3tERnHZ8sNFncKCLsu0g&s=AuQKgcRrjZzeOv_rg3saDrjUlJJraGBptzAlDaUNirc&e=
[1]. 
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_pipermail_kexec_2019-2DNovember_024006.html&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=t2fPg9D87F7D8jm0_3CG9yoiIKdRg4qc_thBw4bzMhc&m=hl1GOXoV3jxYFfIKiiUHZCo3tERnHZ8sNFncKCLsu0g&s=48CAvsBJrAJIyXWl_7dQd_a4HPt2dYKZ134lP3jDLA8&e=
[2]. 
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_pipermail_kexec_2019-2DNovember_023992.html&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=t2fPg9D87F7D8jm0_3CG9yoiIKdRg4qc_thBw4bzMhc&m=hl1GOXoV3jxYFfIKiiUHZCo3tERnHZ8sNFncKCLsu0g&s=c-62ZXf5jXn9NYHIS_Qu6-xlbRjPPyG5D07RoEzVzC4&e=

Cc: James Morse 
Cc: Mark Rutland 
Cc: Will Deacon 
Cc: Steve Capper 
Cc: Catalin Marinas 
Cc: Ard Biesheuvel 
Cc: Dave Anderson 
Cc: Kazuhito Hagio 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-ker...@vger.kernel.org
Cc: kexec@lists.infradead.org
Signed-off-by: Bhupesh Sharma 


Tested-by:  John Donnelly 


---
  arch/arm64/include/asm/pgtable-hwdef.h | 1 +
  arch/arm64/kernel/crash_core.c | 9 +
  2 files changed, 10 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable-hwdef.h 
b/arch/arm64/include/asm/pgtable-hwdef.h
index d9fbd433cc17..d2e7aff5821e 100644
--- a/arch/arm64/include/asm/pgtable-hwdef.h
+++ b/arch/arm64/include/asm/pgtable-hwdef.h
@@ -215,6 +215,7 @@
  #define TCR_TxSZ(x)   (TCR_T0SZ(x) | TCR_T1SZ(x))
  #define TCR_TxSZ_WIDTH6
  #define TCR_T0SZ_MASK (((UL(1) << TCR_TxSZ_WIDTH) - 1) << 
TCR_T0SZ_OFFSET)
+#define TCR_T1SZ_MASK  (((UL(1) << TCR_TxSZ_WIDTH) - 1) << 
TCR_T1SZ_OFFSET)
  
  #define TCR_EPD0_SHIFT		7

  #define TCR_EPD0_MASK (UL(1) << TCR_EPD0_SHIFT)
diff --git a/arch/arm64/kernel/crash_core.c b/arch/arm64/kernel/crash_core.c
index ca4c3e12d8c5..f78310ba65ea 100644
--- a/arch/arm64/kernel/crash_core.c
+++ b/arch/arm64/kernel/crash_core.c
@@ -7,6 +7,13 @@
  #include 
  #include 
  
+static inline u64 get_tcr_el1_t1sz(void);

+
+static inline u64 get_tcr_el1_t1sz(void)
+{
+   return (read_sysreg(tcr_el1) & TCR_T1SZ_MASK) >> TCR_T1SZ_OFFSET;
+}
+
  void arch_crash_save_vmcoreinfo(void)
  {
VMCOREINFO_NUMBER(VA_BITS);
@@ -15,5 +22,7 @@ void arch_crash_save_vmcoreinfo(void)
kimage_voffset);
vmcoreinfo_append_str("NUMBER(PHYS_OFFSET)=0x%llx\n",
PHYS_OFFSET);
+   vmcoreinfo_append_str("NUMBER(tcr_el1_t1sz)=0x%llx\n",
+   get_tcr_el1_t1sz());
vmcoreinfo_append_str("KERNELOFFSET=%lx\n", kaslr_offset());
  }




--
Thank You,
John

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: kexec_file overwrites reserved EFI ESRT memory

2019-12-02 Thread Michael Weiser
Hi Dave,

On Mon, Dec 02, 2019 at 05:05:20PM +0800, Dave Young wrote:

> > It seems a serious problem, the EFI modified memmap does not get an
> > /proc/iomem resource update, but kexec_file relies on /proc/iomem in
> > X86.
> > 
> > There is an question from Sai about why add_efi_memmap is not enabled by
> > default:
> > https://www.spinics.net/lists/linux-mm/msg185166.html

Incidentally, a data point I did not think to mention: I do boot the
kernel as EFI application directly from the firmware as a boot entry
with compiled in initrd and command line:

$ grep EFI nobak/kernel/linux/.config
CONFIG_EFI=y
CONFIG_EFI_STUB=y
# CONFIG_EFI_MIXED is not set
CONFIG_DMI_SCAN_MACHINE_NON_EFI_FALLBACK=y
# EFI (Extensible Firmware Interface) Support
CONFIG_EFI_VARS=m
CONFIG_EFI_ESRT=y
CONFIG_EFI_VARS_PSTORE=m
# CONFIG_EFI_VARS_PSTORE_DEFAULT_DISABLE is not set
CONFIG_EFI_RUNTIME_MAP=y
# CONFIG_EFI_FAKE_MEMMAP is not set
CONFIG_EFI_RUNTIME_WRAPPERS=y
# CONFIG_EFI_BOOTLOADER_CONTROL is not set
# CONFIG_EFI_CAPSULE_LOADER is not set
# CONFIG_EFI_TEST is not set
# CONFIG_EFI_RCI2_TABLE is not set
# end of EFI (Extensible Firmware Interface) Support
CONFIG_UEFI_CPER=y
CONFIG_UEFI_CPER_X86=y
CONFIG_EFI_EARLYCON=y
CONFIG_EFI_PARTITION=y
CONFIG_FB_EFI=y
CONFIG_EFIVAR_FS=y
# CONFIG_EFI_PGT_DUMP is not set

$ grep CMDLINE nobak/kernel/linux/.config
CONFIG_CMDLINE_BOOL=y
CONFIG_CMDLINE="root=UUID=97[...]e4 rd.luks.uuid=8a[...]c3 
rd.luks.allow-discards=8a[...]c3 mem_sleep_default=deep resume=UUID=97[...]e4 
resume_offset=96256 efi=debug memblock=debug"
CONFIG_CMDLINE_OVERRIDE=y
# CONFIG_BLK_CMDLINE_PARSER is not set
# CONFIG_CMDLINE_PARTITION is not set
CONFIG_FB_CMDLINE=y

$ efibootmgr -v
BootCurrent: 000A
Timeout: 2 seconds
BootOrder: 000A,0009,0008,0005,0007,0006,0004,0002,0001,,0003
[...]
Boot0005* gentoo-5.4.0-next-20191127+-clear
HD(1,GPT,e7[...]f2,0x800,0x64000)/File(\kernel-5.4.0-next-20191127+-clear)
[...]
Boot000A* gentoo-5.4.1-gentoo
HD(1,GPT,e7[...]f2,0x800,0x64000)/File(\kernel-5.4.1-gentoo)

So there's no boot loader that could construct an e820 table for the
kernel to consume. I understand it's then up to the EFI stub to come up
with a e820 table from the EFI memory map.

> > Long time ago the add_efi_memmap is only enabled in case we explict
> > enable it on cmdline, I'm not sure if we can do it by default, maybe we
> > should.   Need opinion from X86 maintainers..
> > Can you try below diff see if it works for you? (not tested, and need
> > explicitly 'add_efi_memmap' in kernel cmdline param)

Neither adding add_efi_memmap nor adding your patch and setting that option
does make the ESRT memory region appear in /proc/iomem. kexec_file still
loads the kernel across the ESRT region.

What occurs to me is that nowhere does the ESRT memory region appear in
any externally provided memory map. Neither e820 nor EFI seem to declare
it. Is that expected or a bug of my particular system?

For example, the e820 map (derived from the EFI map by the EFI stub?)
has these regions:

BIOS-provided physical RAM map:
BIOS-e820: [mem 0x-0x0009efff] usable
BIOS-e820: [mem 0x0009f000-0x000f] reserved
BIOS-e820: [mem 0x0010-0x763f5fff] usable
BIOS-e820: [mem 0x763f6000-0x79974fff] reserved
BIOS-e820: [mem 0x79975000-0x799f1fff] ACPI data
BIOS-e820: [mem 0x799f2000-0x79aa6fff] ACPI NVS
BIOS-e820: [mem 0x79aa7000-0x7a40dfff] reserved
BIOS-e820: [mem 0x7a40e000-0x7a40efff] usable
BIOS-e820: [mem 0x7a40f000-0x7fff] reserved
BIOS-e820: [mem 0xf000-0xf7ff] reserved
BIOS-e820: [mem 0xfe00-0xfe010fff] reserved
BIOS-e820: [mem 0xfec0-0xfec00fff] reserved
BIOS-e820: [mem 0xfed0-0xfed03fff] reserved
BIOS-e820: [mem 0xfee0-0xfee00fff] reserved
BIOS-e820: [mem 0xff00-0x] reserved
BIOS-e820: [mem 0x0001-0x00047dff] usable

The ESRT region sits smack in the middle of a large system RAM region:

BIOS-e820: [mem 0x0010-0x763f5fff] usable

Consequently, the relevant part of /proc/iomem looks like this:

-0fff : Reserved
1000-0009efff : System RAM
0009f000-000f : Reserved
  000a-000b : PCI Bus :00
  000e-000e3fff : PCI Bus :00
  000e4000-000e7fff : PCI Bus :00
  000e8000-000ebfff : PCI Bus :00
  000ec000-000e : PCI Bus :00
  000f-000f : PCI Bus :00
000f-000f : System ROM
0010-763f5fff : System RAM
  6500-6aff : Crash kernel
763f6000-79974fff : Reserved
79975000-799f1fff : ACPI Tables
799f2000-79aa6fff : ACPI Non-volatile Storage
  79a17000-79a17fff : USBC000:00

What it would need to look like for kexec to leave ESRT alone, I guess, is:

-0fff : Reserved
1000-0009efff : System RAM
0009f000-000f : Reserved
  000

Re: [RFC PATCH v5 1/3] printk-rb: new printk ringbuffer implementation (writer)

2019-12-02 Thread Sergey Senozhatsky
On (19/12/02 17:37), John Ogness wrote:
> On 2019-12-02, Petr Mladek  wrote:
> >> > +/* Reserve a new descriptor, invalidating the oldest if necessary. */
> >> > +static bool desc_reserve(struct printk_ringbuffer *rb, u32 *id_out)
> >> > +{
> >> > +struct prb_desc_ring *desc_ring = &rb->desc_ring;
> >> > +struct prb_desc *desc;
> >> > +u32 id_prev_wrap;
> >> > +u32 head_id;
> >> > +u32 id;
> >> > +
> >> > +head_id = atomic_read(&desc_ring->head_id);
> >> > +
> >> > +do {
> >> > +desc = to_desc(desc_ring, head_id);
> >> > +
> >> > +id = DESC_ID(head_id + 1);
> >> > +id_prev_wrap = DESC_ID_PREV_WRAP(desc_ring, id);
> >> > +
> >> > +if (id_prev_wrap == atomic_read(&desc_ring->tail_id)) {
> >> > +if (!desc_push_tail(rb, id_prev_wrap))
> >> > +return false;
> >> > +}
> >> > +} while (!atomic_try_cmpxchg(&desc_ring->head_id, &head_id, 
> >> > id));
> >> 
> >> Hmm, in theory, ABA problem might cause that we successfully
> >> move desc_ring->head_id when tail has not been pushed yet.
> >> 
> >> As a result we would never call desc_push_tail() until
> >> it overflows again.
> >> 
> >> I am not sure if we need to take care of it. The code is called with
> >> interrupts disabled. IMHO, only NMI could cause ABA problem
> >> in reality. But the game (debugging) is lost anyway when NMI ovewrites
> >> the buffer several times.
> >
> > BTW: If I am counting correctly. The ABA problem would happen when
> > exactly 2^30 (1G) messages is written in the mean time.
> 
> All the ringbuffer code assumes that the use of index numbers handles
> the ABA problem (i.e. there must not be 1 billion printk's within an
> NMI). If we want to support 1 billion+ printk's within an NMI, then
> perhaps the index number should be increased. For 64-bit systems it
> would be no problem to go to 62 bits. For 32-bit systems, I don't know
> how well the 64-bit atomic operations are supported.

ftrace dumps from NMI (DUMP_ALL type ftrace_dump_on_oops on a $BIG
machine)? 1G seems large enough, but who knows.

-ss

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] efi: arm64: Introduce /sys/firmware/efi/memreserve to tell the persistent pages

2019-12-02 Thread Masayoshi Mizuma
On Fri, Nov 29, 2019 at 01:25:36PM +0100, Matthias Brugger wrote:
> 
> 
> On 25/11/2019 19:49, Masayoshi Mizuma wrote:
> > From: Masayoshi Mizuma 
> > 
> > kexec reboot stops in early boot sequence because efi_config_parse_tables()
> > refers garbage data. We can see the log with memblock=debug kernel option:
> > 
> >   efi:  ACPI 2.0=0x9821790014  PROP=0x8757f5c0  SMBIOS 3.0=0x982074  
> > MEMRESERVE=0x9820bfdc58
> >   memblock_reserve: [0x009820bfdc58-0x009820bfdc67] 
> > efi_config_parse_tables+0x228/0x278
> >   memblock_reserve: [0x8276-0x324d07ff] 
> > efi_config_parse_tables+0x228/0x278
> >   memblock_reserve: [0xcc4f84ecc0511670-0x5f6e5214a7fd91f9] 
> > efi_config_parse_tables+0x244/0x278
> >   memblock_reserve: [0xd2fd4144b9af693d-0xad0c1db1086f40a2] 
> > efi_config_parse_tables+0x244/0x278
> >   memblock_reserve: [0x0c719bb159b1fadc-0x5aa6e62a1417ce12] 
> > efi_config_parse_tables+0x244/0x278
> >   ...
> > 
> > That happens because 0x8276, struct linux_efi_memreserve, is destroyed.
> > 0x8276 is pointed from efi.mem_reseve, and efi.mem_reserve points the
> > head page of LPI pending table and LPI property table which are allocated by
> > gic_reserve_range().
> > 
> > The destroyer is kexec. kexec locates the initrd to the area:
> > 
> >   ]# kexec -d -l /boot/vmlinuz-5.4.0-rc7 /boot/initramfs-5.4.0-rc7.img 
> > --reuse-cmdline
> >   ...
> >   initrd: base 8229, size 388dd8ah (59301258)
> >   ...
> > 
> > From dynamic debug log. initrd is located in segment[1]:
> >   machine_kexec_prepare:70:
> > kexec kimage info:
> >   type:0
> >   start:   85b30680
> >   head:0
> >   nr_segments: 4
> > segment[0]: 8048 - 8229, 0x1e1 bytes, 
> > 481 pages
> > segment[1]: 8229 - 85b2, 0x389 bytes, 
> > 905 pages
> > segment[2]: 85b2 - 85b3, 0x1 bytes, 1 
> > pages
> > segment[3]: 85b3 - 85b4, 0x1 bytes, 1 
> > pages
> > 
> > kexec searches the memory region to locate initrd through
> > "System RAM" in /proc/iomem. The pending tables are included in
> > "System RAM" because they are allocated by alloc_pages(), so kexec
> > destroys the LPI pending tables.
> > 
> 
> Doesn't that mean that you haven't enough memory reserved so that you have to
> fallback to allocate it via __get_free_page()?

That's a not fallback allocation. The pending tables and also property
tables are allocated by alloc_pages() on its_allocate_prop_table() and
its_allocate_pending_table().

> 
> 
> > Introduce /sys/firmware/efi/memreserve to tell the pages pointed by
> > efi.mem_reserve so that kexec can avoid the area to locate initrd.
> > 
> 
> Doesn't that need a patch for kexec-tools to actually take this into account?

Yes, we need a patch for kexec-tools as well. I'm preparing the kexec
patch.

> 
> > Signed-off-by: Masayoshi Mizuma 
> > ---
> >  drivers/firmware/efi/efi.c | 45 +-
> >  1 file changed, 44 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> > index e98bbf8e5..0aa07cc09 100644
> > --- a/drivers/firmware/efi/efi.c
> > +++ b/drivers/firmware/efi/efi.c
> > @@ -141,6 +141,47 @@ static ssize_t systab_show(struct kobject *kobj,
> >  
> >  static struct kobj_attribute efi_attr_systab = __ATTR_RO_MODE(systab, 
> > 0400);
> >  
> > +static struct linux_efi_memreserve *efi_memreserve_root __ro_after_init;
> > +#ifdef CONFIG_KEXEC
> > +static ssize_t memreserve_show(struct kobject *kobj,
> > +  struct kobj_attribute *attr, char *buf)
> > +{
> > +   struct linux_efi_memreserve *rsv;
> > +   phys_addr_t start, end;
> > +   unsigned long prsv;
> > +   char *str = buf;
> > +   int count, i;
> > +
> > +   if (!kobj || !buf)
> > +   return -EINVAL;
> > +
> > +   if ((efi_memreserve_root == (void *)ULONG_MAX) ||
> > +   (!efi_memreserve_root))
> > +   return -ENODEV;
> > +
> > +   for (prsv = efi_memreserve_root->next; prsv; prsv = rsv->next) {
> > +   rsv = memremap(prsv, sizeof(*rsv), MEMREMAP_WB);
> > +   if (!rsv) {
> > +   pr_err("Could not map efi_memreserve\n");
> > +   return -ENOMEM;
> > +   }
> > +   count = atomic_read(&rsv->count);
> > +   for (i = 0; i < count; i++) {
> > +   start = rsv->entry[i].base;
> > +   end = start + rsv->entry[i].size - 1;
> > +
> > +   str += sprintf(str, "%pa-%pa\n", &start, &end);
> 
> What happens if we provide a buf which is too small?

Good point.
The strings may exceed the buffer size (PAGE_SIZE) in case
efi_memreserve_root has a lot of entries.
It might be better to use seq_printf() to show efi_memreserve_root...
I'll move the file from a sysfs entry to a proc entry so that 
efi_memreserve_root can be handled by s

[PATCH] makedumpfile: assign bitmap2 fd for sub process during refiltering

2019-12-02 Thread Pingfan Liu
In refiltering mode, each sub process inherits bitmap2->fd from parent.
Then they lseek()/read() on the same fd, which means that they interference
with each other.

This breaks the purpose of SPLITTING_FD_BITMAP(i) for each sub process.
Fix it by assigning a sub process dedicated fd to bitmap2->fd.

Signed-off-by: Pingfan Liu 
---
 makedumpfile.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/makedumpfile.c b/makedumpfile.c
index d76a435..1dc8640 100644
--- a/makedumpfile.c
+++ b/makedumpfile.c
@@ -8857,7 +8857,8 @@ write_kdump_pages_and_bitmap_cyclic(struct cache_data 
*cd_header, struct cache_d
if (info->flag_cyclic) {
if (!prepare_bitmap2_buffer())
return FALSE;
-   }
+   } else if (info->flag_refiltering)
+   info->bitmap2->fd = info->fd_bitmap;
 
/*
 * Write pages and bitmap cyclically.
-- 
2.7.5


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec