Re: [PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-19 Thread Mike Rapoport
From: Mike Rapoport (IBM) 

On Thu, 13 Jun 2024 11:55:06 -0400, Steven Rostedt wrote:
> Reserve unspecified location of physical memory from kernel command line
> 
> Background:
> 
> In ChromeOS, we have 1 MB of pstore ramoops reserved so that we can extract
> dmesg output and some other information when a crash happens in the field.
> (This is only done when the user selects "Allow Google to collect data for
>  improving the system"). But there are cases when there's a bug that
> requires more data to be retrieved to figure out what is happening. We would
> like to increase the pstore size, either temporarily, or maybe even
> permanently. The pstore on these devices are at a fixed location in RAM (as
> the RAM is not cleared on soft reboots nor crashes). The location is chosen
> by the BIOS (coreboot) and passed to the kernel via ACPI tables on x86.
> There's a driver that queries for this to initialize the pstore for
> ChromeOS:
> 
> [...]

Applied to for-next branch of memblock.git tree, thanks!

[0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up
  commit: 1e4c64b71c9bf230b25fde12cbcceacfdc8b3332
[1/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up
  commit: 1e4c64b71c9bf230b25fde12cbcceacfdc8b3332
[2/2] pstore/ramoops: Add ramoops.mem_name= command line option
  commit: d9d814eebb1ae9742e7fd7f39730653b16326bd4

tree: https://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock
branch: for-next

--
Sincerely yours,
Mike.




Re: [PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-19 Thread Mike Rapoport
On Tue, Jun 18, 2024 at 12:21:19PM +0200, Ard Biesheuvel wrote:
> On Mon, 17 Jun 2024 at 23:01, Alexander Graf  wrote:
>
> So I would personally prefer for this code not to go in at all. But if
> it does go in (and Steven has already agreed to this), it needs a
> giant disclaimer that it is best effort and may get broken
> inadvertently by changes that may seem unrelated.

I think that reserve_mem= is way less intrusive than memmap= we anyway
have.
It will reserve memory and the documentation adequately warns that the
location of that memory might be at different physical addresses.

-- 
Sincerely yours,
Mike.



Re: [PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-18 Thread Steven Rostedt
On Tue, 18 Jun 2024 13:47:49 +0200
Alexander Graf  wrote:

> IMHO the big fat disclaimer should be in the argument name. 
> "reserve_mem" to me sounds like it actually guarantees a reservation - 
> which it doesn't. Can we name it more along the lines of "debug" (to 
> indicate it's not for production data) or "phoenix" (usually gets reborn 
> out of ashes, but you can never know for sure): "debug_mem", / 
> "phoenix_mem"?

I don't see any reason it will not reserve memory. That doesn't seem to
be the issue.  What is not guaranteed is that it will be in the same
location as last time. So the name is correct. It's not
"reserve_consistent_memory" ;-)

-- Steve



Re: [PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-18 Thread Alexander Graf


On 18.06.24 12:21, Ard Biesheuvel wrote:

On Mon, 17 Jun 2024 at 23:01, Alexander Graf  wrote:

On 17.06.24 22:40, Steven Rostedt wrote:

On Mon, 17 Jun 2024 09:07:29 +0200
Alexander Graf  wrote:


Hey Steve,


I believe we're talking about 2 different things :). Let me rephrase a
bit and make a concrete example.

Imagine you have passed the "reserve_mem=12M:4096:trace" kernel command
line option. The kernel now comes up and allocates a random chunk of
memory that - by (admittedly good) chance - may be at the same physical
location as before. Let's assume it deemed 0x100 as a good offset.

Note, it's not random. It picks from the top of available memory every
time. But things can mess with it (see below).


Let's now assume you're running on a UEFI system. There, you always have
non-volatile storage available to you even in the pre-boot phase. That
means the kernel could create a UEFI variable that says "12M:4096:trace
-> 0x100". The pre-boot phase takes all these UEFI variables and
marks them as reserved. When you finally reach your command line parsing
logic for reserve_mem=, you can flip all reservations that were not on
the command line back to normal memory.

That way you have pretty much guaranteed persistent memory regions, even
with KASLR changing your memory layout across boots.

The nice thing is that the above is an extension of what you've already
built: Systems with UEFI simply get better guarantees that their regions
persist.

This could be an added feature, but it is very architecture specific,
and would likely need architecture specific updates.


It definitely would be an added feature, yes. But one that allows you to
ensure persistence a lot more safely :).



Requirement:

Need a way to reserve memory that will be at a consistent location for
every boot, if the kernel and system are the same. Does not need to work
if rebooting to a different kernel, or if the system can change the
memory layout between boots.

The reserved memory can not be an hard coded address, as the same kernel /
command line needs to run on several different machines. The picked memory
reservation just needs to be the same for a given machine, but may be

With KASLR is enabled, doesn't this approach break too often to be
reliable enough for the data you want to extract?

Picking up the idea above, with a persistent variable we could even make
KASLR avoid that reserved pstore region in its search for a viable KASLR
offset.

I think I was hit by it once in all my testing. For our use case, the
few times it fails to map is not going to affect what we need this for
at all.

Once is pretty good. Do you know why? Also once out of how many runs? Is
the randomness source not as random as it should be or are the number of
bits for KASLR maybe so few on your target architecture that the odds of
hitting anything become low? Do these same constraints hold true outside
of your testing environment?

So I just ran it a hundred times in a loop. I added a patch to print
the location of "_text". The loop was this:

for i in `seq 100`; do
  ssh root@debiantesting-x86-64 "dmesg | grep -e 'text starts' -e 'mapped boot'  
>> text; grub-reboot '1>0'; sleep 0.5; reboot"
  sleep 25
done

It searches dmesg for my added printk as well as the print of were the
ring buffer was loaded in physical memory.

It takes about 15 seconds to reboot, so I waited 25. The results are
attached. I found that it was consistent 76 times, which means 1 out of
4 it's not. Funny enough, it broke whenever it loaded the kernel below
0x1. And then it would be off by a little.

It was consistently at:

0x27d00

And when it failed, it was at 0x27ce0.

Note, when I used the e820 tables to do this, I never saw a failure. My
assumption is that when it is below 0x1, something else gets
allocated causing this to get pushed down.


Thinking about it again: What if you run the allocation super early (see
arch/x86/boot/compressed/kaslr.c:handle_mem_options())?

That code is not used by EFI boot anymore.

In general, I would recommend (and have recommended) against these
kinds of hacks in mainline, because -especially on x86- there is
always someone that turns up with some kind of convoluted use case
that gets broken if we try to change anything in the boot code.

I spent considerable time over the past year making the EFI/x86 boot
code compatible with the new MS imposed requirements on PC boot
firmware (related to secure boot and NX restrictions on memory
mappings). This involved some radical refactoring of the boot
sequence, including the KASLR logic. Adding fragile code there that
will result in regressions observable to end users when it gets broken
is really not what I'd like to see.

So I would personally prefer for this code not to go in at all. But if
it does go in (and Steven has already agreed to this), it needs a
giant disclaimer that it is best effort and may get broken
inadvertently by changes that may 

Re: [PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-18 Thread Ard Biesheuvel
On Mon, 17 Jun 2024 at 23:01, Alexander Graf  wrote:
>
> On 17.06.24 22:40, Steven Rostedt wrote:
> > On Mon, 17 Jun 2024 09:07:29 +0200
> > Alexander Graf  wrote:
> >
> >> Hey Steve,
> >>
> >>
> >> I believe we're talking about 2 different things :). Let me rephrase a
> >> bit and make a concrete example.
> >>
> >> Imagine you have passed the "reserve_mem=12M:4096:trace" kernel command
> >> line option. The kernel now comes up and allocates a random chunk of
> >> memory that - by (admittedly good) chance - may be at the same physical
> >> location as before. Let's assume it deemed 0x100 as a good offset.
> > Note, it's not random. It picks from the top of available memory every
> > time. But things can mess with it (see below).
> >
> >> Let's now assume you're running on a UEFI system. There, you always have
> >> non-volatile storage available to you even in the pre-boot phase. That
> >> means the kernel could create a UEFI variable that says "12M:4096:trace
> >> -> 0x100". The pre-boot phase takes all these UEFI variables and
> >> marks them as reserved. When you finally reach your command line parsing
> >> logic for reserve_mem=, you can flip all reservations that were not on
> >> the command line back to normal memory.
> >>
> >> That way you have pretty much guaranteed persistent memory regions, even
> >> with KASLR changing your memory layout across boots.
> >>
> >> The nice thing is that the above is an extension of what you've already
> >> built: Systems with UEFI simply get better guarantees that their regions
> >> persist.
> > This could be an added feature, but it is very architecture specific,
> > and would likely need architecture specific updates.
>
>
> It definitely would be an added feature, yes. But one that allows you to
> ensure persistence a lot more safely :).
>
>
> > Requirement:
> >
> > Need a way to reserve memory that will be at a consistent location for
> > every boot, if the kernel and system are the same. Does not need to work
> > if rebooting to a different kernel, or if the system can change the
> > memory layout between boots.
> >
> > The reserved memory can not be an hard coded address, as the same 
> > kernel /
> > command line needs to run on several different machines. The picked 
> > memory
> > reservation just needs to be the same for a given machine, but may be
>  With KASLR is enabled, doesn't this approach break too often to be
>  reliable enough for the data you want to extract?
> 
>  Picking up the idea above, with a persistent variable we could even make
>  KASLR avoid that reserved pstore region in its search for a viable KASLR
>  offset.
> >>> I think I was hit by it once in all my testing. For our use case, the
> >>> few times it fails to map is not going to affect what we need this for
> >>> at all.
> >> Once is pretty good. Do you know why? Also once out of how many runs? Is
> >> the randomness source not as random as it should be or are the number of
> >> bits for KASLR maybe so few on your target architecture that the odds of
> >> hitting anything become low? Do these same constraints hold true outside
> >> of your testing environment?
> > So I just ran it a hundred times in a loop. I added a patch to print
> > the location of "_text". The loop was this:
> >
> >for i in `seq 100`; do
> >  ssh root@debiantesting-x86-64 "dmesg | grep -e 'text starts' -e 
> > 'mapped boot'  >> text; grub-reboot '1>0'; sleep 0.5; reboot"
> >  sleep 25
> >done
> >
> > It searches dmesg for my added printk as well as the print of were the
> > ring buffer was loaded in physical memory.
> >
> > It takes about 15 seconds to reboot, so I waited 25. The results are
> > attached. I found that it was consistent 76 times, which means 1 out of
> > 4 it's not. Funny enough, it broke whenever it loaded the kernel below
> > 0x1. And then it would be off by a little.
> >
> > It was consistently at:
> >
> >0x27d00
> >
> > And when it failed, it was at 0x27ce0.
> >
> > Note, when I used the e820 tables to do this, I never saw a failure. My
> > assumption is that when it is below 0x1, something else gets
> > allocated causing this to get pushed down.
>
>
> Thinking about it again: What if you run the allocation super early (see
> arch/x86/boot/compressed/kaslr.c:handle_mem_options())?

That code is not used by EFI boot anymore.

In general, I would recommend (and have recommended) against these
kinds of hacks in mainline, because -especially on x86- there is
always someone that turns up with some kind of convoluted use case
that gets broken if we try to change anything in the boot code.

I spent considerable time over the past year making the EFI/x86 boot
code compatible with the new MS imposed requirements on PC boot
firmware (related to secure boot and NX restrictions on memory
mappings). This involved some radical refactoring of the boot
sequence, including the 

Re: [PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-17 Thread Steven Rostedt
On Mon, 17 Jun 2024 23:01:12 +0200
Alexander Graf  wrote:
> > This could be an added feature, but it is very architecture specific,
> > and would likely need architecture specific updates.  
> 
> 
> It definitely would be an added feature, yes. But one that allows you to 
> ensure persistence a lot more safely :).

Sure.

> 
> Thinking about it again: What if you run the allocation super early (see 
> arch/x86/boot/compressed/kaslr.c:handle_mem_options())? If you stick to 
> allocating only from top, you're effectively kernel version independent 
> for your allocations because none of the kernel code ran yet and 
> definitely KASLR independent because you're running deterministically 
> before KASLR even gets allocated.
> 
> > As this code relies on memblock_phys_alloc() being consistent, if
> > something gets allocated before it differently depending on where the
> > kernel is, it can also move the location. A plugin to UEFI would mean
> > that it would need to reserve the memory, and the code here will need
> > to know where it is. We could always make the function reserve_mem()
> > global and weak so that architectures can override it.  
> 
> 
> Yes, the in-kernel UEFI loader (efi-stub) could simply populate a new 
> type of memblock with the respective reservations and you later call 
> memblock_find_in_range_node() instead of memblock_phys_alloc() to pass 
> in flags that you want to allocate only from the new 
> MEMBLOCK_RESERVE_MEM type. The same model would work for BIOS boots 
> through the handle_mem_options() path above. In fact, if the BIOS way 
> works fine, we don't even need UEFI variables: The same way allocations 
> will be identical during BIOS execution, they should stay identical 
> across UEFI launches.
> 
> As cherry on top, kexec also works seamlessly with the special memblock 
> approach because kexec (at least on x86) hands memblocks as is to the 
> next kernel. So the new kernel will also automatically use the same 
> ranges for its allocations.

I'm all for expanding this. But I would just want to get this in for
now as is. It theoretically works on all architectures. If someone
wants to make in more robust and accurate on a specific architecture,
I'm all for it. Like I said, we could make the reserver_mem() function
global and weak, and then if an architecture has a better way to handle
this, it could use that.

Hmm, x86 could do this with the e820 code like I did in my first
versions. Like I said, it didn't fail at all with that.

And we can have an UEFI version as well.

-- Steve



Re: [PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-17 Thread Alexander Graf

[resend because Thunderbird decided to send the previous version as HTML :(]


On 17.06.24 22:40, Steven Rostedt wrote:

On Mon, 17 Jun 2024 09:07:29 +0200
Alexander Graf  wrote:


Hey Steve,


I believe we're talking about 2 different things :). Let me rephrase a
bit and make a concrete example.

Imagine you have passed the "reserve_mem=12M:4096:trace" kernel command
line option. The kernel now comes up and allocates a random chunk of
memory that - by (admittedly good) chance - may be at the same physical
location as before. Let's assume it deemed 0x100 as a good offset.

Note, it's not random. It picks from the top of available memory every
time. But things can mess with it (see below).


Let's now assume you're running on a UEFI system. There, you always have
non-volatile storage available to you even in the pre-boot phase. That
means the kernel could create a UEFI variable that says "12M:4096:trace
-> 0x100". The pre-boot phase takes all these UEFI variables and
marks them as reserved. When you finally reach your command line parsing
logic for reserve_mem=, you can flip all reservations that were not on
the command line back to normal memory.

That way you have pretty much guaranteed persistent memory regions, even
with KASLR changing your memory layout across boots.

The nice thing is that the above is an extension of what you've already
built: Systems with UEFI simply get better guarantees that their regions
persist.

This could be an added feature, but it is very architecture specific,
and would likely need architecture specific updates.



It definitely would be an added feature, yes. But one that allows you to 
ensure persistence a lot more safely :).




Requirement:

Need a way to reserve memory that will be at a consistent location for
every boot, if the kernel and system are the same. Does not need to work
if rebooting to a different kernel, or if the system can change the
memory layout between boots.

The reserved memory can not be an hard coded address, as the same kernel /
command line needs to run on several different machines. The picked memory
reservation just needs to be the same for a given machine, but may be

With KASLR is enabled, doesn't this approach break too often to be
reliable enough for the data you want to extract?

Picking up the idea above, with a persistent variable we could even make
KASLR avoid that reserved pstore region in its search for a viable KASLR
offset.

I think I was hit by it once in all my testing. For our use case, the
few times it fails to map is not going to affect what we need this for
at all.

Once is pretty good. Do you know why? Also once out of how many runs? Is
the randomness source not as random as it should be or are the number of
bits for KASLR maybe so few on your target architecture that the odds of
hitting anything become low? Do these same constraints hold true outside
of your testing environment?

So I just ran it a hundred times in a loop. I added a patch to print
the location of "_text". The loop was this:

   for i in `seq 100`; do
 ssh root@debiantesting-x86-64 "dmesg | grep -e 'text starts' -e 'mapped boot'  
>> text; grub-reboot '1>0'; sleep 0.5; reboot"
 sleep 25
   done

It searches dmesg for my added printk as well as the print of were the
ring buffer was loaded in physical memory.

It takes about 15 seconds to reboot, so I waited 25. The results are
attached. I found that it was consistent 76 times, which means 1 out of
4 it's not. Funny enough, it broke whenever it loaded the kernel below
0x1. And then it would be off by a little.

It was consistently at:

   0x27d00

And when it failed, it was at 0x27ce0.

Note, when I used the e820 tables to do this, I never saw a failure. My
assumption is that when it is below 0x1, something else gets
allocated causing this to get pushed down.



Thinking about it again: What if you run the allocation super early (see 
arch/x86/boot/compressed/kaslr.c:handle_mem_options())? If you stick to 
allocating only from top, you're effectively kernel version independent 
for your allocations because none of the kernel code ran yet and 
definitely KASLR independent because you're running deterministically 
before KASLR even gets allocated.



As this code relies on memblock_phys_alloc() being consistent, if
something gets allocated before it differently depending on where the
kernel is, it can also move the location. A plugin to UEFI would mean
that it would need to reserve the memory, and the code here will need
to know where it is. We could always make the function reserve_mem()
global and weak so that architectures can override it.



Yes, the in-kernel UEFI loader (efi-stub) could simply populate a new 
type of memblock with the respective reservations and you later call 
memblock_find_in_range_node() instead of memblock_phys_alloc() to pass 
in flags that you want to allocate only from the new 
MEMBLOCK_RESERVE_MEM type. The same model would work for 

Re: [PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-17 Thread Steven Rostedt
On Mon, 17 Jun 2024 09:07:29 +0200
Alexander Graf  wrote:

> Hey Steve,
> 
> 
> I believe we're talking about 2 different things :). Let me rephrase a 
> bit and make a concrete example.
> 
> Imagine you have passed the "reserve_mem=12M:4096:trace" kernel command 
> line option. The kernel now comes up and allocates a random chunk of 
> memory that - by (admittedly good) chance - may be at the same physical 
> location as before. Let's assume it deemed 0x100 as a good offset.

Note, it's not random. It picks from the top of available memory every
time. But things can mess with it (see below).

> 
> Let's now assume you're running on a UEFI system. There, you always have 
> non-volatile storage available to you even in the pre-boot phase. That 
> means the kernel could create a UEFI variable that says "12M:4096:trace 
> -> 0x100". The pre-boot phase takes all these UEFI variables and   
> marks them as reserved. When you finally reach your command line parsing 
> logic for reserve_mem=, you can flip all reservations that were not on 
> the command line back to normal memory.
> 
> That way you have pretty much guaranteed persistent memory regions, even 
> with KASLR changing your memory layout across boots.
> 
> The nice thing is that the above is an extension of what you've already 
> built: Systems with UEFI simply get better guarantees that their regions 
> persist.

This could be an added feature, but it is very architecture specific,
and would likely need architecture specific updates.

> 
> 
> >  
> >>  
> >>> Requirement:
> >>>
> >>> Need a way to reserve memory that will be at a consistent location for
> >>> every boot, if the kernel and system are the same. Does not need to work
> >>> if rebooting to a different kernel, or if the system can change the
> >>> memory layout between boots.
> >>>
> >>> The reserved memory can not be an hard coded address, as the same kernel /
> >>> command line needs to run on several different machines. The picked memory
> >>> reservation just needs to be the same for a given machine, but may be  
> >>
> >> With KASLR is enabled, doesn't this approach break too often to be
> >> reliable enough for the data you want to extract?
> >>
> >> Picking up the idea above, with a persistent variable we could even make
> >> KASLR avoid that reserved pstore region in its search for a viable KASLR
> >> offset.  
> > I think I was hit by it once in all my testing. For our use case, the
> > few times it fails to map is not going to affect what we need this for
> > at all.  
> 
> 
> Once is pretty good. Do you know why? Also once out of how many runs? Is 
> the randomness source not as random as it should be or are the number of 
> bits for KASLR maybe so few on your target architecture that the odds of 
> hitting anything become low? Do these same constraints hold true outside 
> of your testing environment?

So I just ran it a hundred times in a loop. I added a patch to print
the location of "_text". The loop was this:

  for i in `seq 100`; do
ssh root@debiantesting-x86-64 "dmesg | grep -e 'text starts' -e 'mapped 
boot'  >> text; grub-reboot '1>0'; sleep 0.5; reboot"
sleep 25
  done

It searches dmesg for my added printk as well as the print of were the
ring buffer was loaded in physical memory.

It takes about 15 seconds to reboot, so I waited 25. The results are
attached. I found that it was consistent 76 times, which means 1 out of
4 it's not. Funny enough, it broke whenever it loaded the kernel below
0x1. And then it would be off by a little.

It was consistently at:

  0x27d00

And when it failed, it was at 0x27ce0.

Note, when I used the e820 tables to do this, I never saw a failure. My
assumption is that when it is below 0x1, something else gets
allocated causing this to get pushed down.

As this code relies on memblock_phys_alloc() being consistent, if
something gets allocated before it differently depending on where the
kernel is, it can also move the location. A plugin to UEFI would mean
that it would need to reserve the memory, and the code here will need
to know where it is. We could always make the function reserve_mem()
global and weak so that architectures can override it.

-- Steve


text
Description: Binary data


Re: [PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-17 Thread Alexander Graf

Hey Steve,

On 13.06.24 19:12, Steven Rostedt wrote:

On Thu, 13 Jun 2024 18:54:12 +0200
Alexander Graf  wrote:


Do you have a "real" pstore on these systems that you could store
non-volatile variables in, such as persistent UEFI variables? If so, you
could create an actually persistent mapping for your trace pstore even
across kernel version updates as a general mechanism to create reserved
memblocks at fixed offsets.

After implementing all this, I don't think I can use pstore for my
purpose. pstore is a generic interface for persistent storage, and
requires an interface to access it. From what I understand, it's not
the place to just ask for an area of RAM.

For this, I have a single patch that allows the tracing instance to use
an area reserved by reserve_mem.

   reserve_mem=12M:4096:trace trace_instance=boot_mapped@trace

I've already tested this on qemu and a couple of chromebooks. It works
well.



I believe we're talking about 2 different things :). Let me rephrase a 
bit and make a concrete example.


Imagine you have passed the "reserve_mem=12M:4096:trace" kernel command 
line option. The kernel now comes up and allocates a random chunk of 
memory that - by (admittedly good) chance - may be at the same physical 
location as before. Let's assume it deemed 0x100 as a good offset.


Let's now assume you're running on a UEFI system. There, you always have 
non-volatile storage available to you even in the pre-boot phase. That 
means the kernel could create a UEFI variable that says "12M:4096:trace 
-> 0x100". The pre-boot phase takes all these UEFI variables and 
marks them as reserved. When you finally reach your command line parsing 
logic for reserve_mem=, you can flip all reservations that were not on 
the command line back to normal memory.


That way you have pretty much guaranteed persistent memory regions, even 
with KASLR changing your memory layout across boots.


The nice thing is that the above is an extension of what you've already 
built: Systems with UEFI simply get better guarantees that their regions 
persist.








Requirement:

Need a way to reserve memory that will be at a consistent location for
every boot, if the kernel and system are the same. Does not need to work
if rebooting to a different kernel, or if the system can change the
memory layout between boots.

The reserved memory can not be an hard coded address, as the same kernel /
command line needs to run on several different machines. The picked memory
reservation just needs to be the same for a given machine, but may be


With KASLR is enabled, doesn't this approach break too often to be
reliable enough for the data you want to extract?

Picking up the idea above, with a persistent variable we could even make
KASLR avoid that reserved pstore region in its search for a viable KASLR
offset.

I think I was hit by it once in all my testing. For our use case, the
few times it fails to map is not going to affect what we need this for
at all.



Once is pretty good. Do you know why? Also once out of how many runs? Is 
the randomness source not as random as it should be or are the number of 
bits for KASLR maybe so few on your target architecture that the odds of 
hitting anything become low? Do these same constraints hold true outside 
of your testing environment?



Alex




Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597


[PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-13 Thread Steven Rostedt
Reserve unspecified location of physical memory from kernel command line

Background:

In ChromeOS, we have 1 MB of pstore ramoops reserved so that we can extract
dmesg output and some other information when a crash happens in the field.
(This is only done when the user selects "Allow Google to collect data for
 improving the system"). But there are cases when there's a bug that
requires more data to be retrieved to figure out what is happening. We would
like to increase the pstore size, either temporarily, or maybe even
permanently. The pstore on these devices are at a fixed location in RAM (as
the RAM is not cleared on soft reboots nor crashes). The location is chosen
by the BIOS (coreboot) and passed to the kernel via ACPI tables on x86.
There's a driver that queries for this to initialize the pstore for
ChromeOS:

  See drivers/platform/chrome/chromeos_pstore.c

Problem:

The problem is that, even though there's a process to change the kernel on
these systems, and is done regularly to install updates, the firmware is
updated much less frequently. Choosing the place in RAM also takes special
care, and may be in a different address for different boards. Updating the
size via firmware is a large effort and not something that many are willing
to do for a temporary pstore size change.

Requirement:

Need a way to reserve memory that will be at a consistent location for
every boot, if the kernel and system are the same. Does not need to work
if rebooting to a different kernel, or if the system can change the
memory layout between boots.

The reserved memory can not be an hard coded address, as the same kernel /
command line needs to run on several different machines. The picked memory
reservation just needs to be the same for a given machine, but may be
different for different machines.

Solution:

The solution I have come up with is to introduce a new "reserve_mem=" kernel
command line. This parameter takes the following format:

  reserve_mem=nn:align:label

Where nn is the size of memory to reserve, the align is the alignment of
that memory, and label is the way for other sub-systems to find that memory.
This way the kernel command line could have:

  reserve_mem=12M:4096:oops   ramoops.mem_name=oops

At boot up, the kernel will search for 12 megabytes in usable memory regions
with an alignment of 4096. It will start at the highest regions and work its
way down (for those old devices that want access to lower address DMA). When
it finds a region, it will save it off in a small table and mark it with the
"oops" label. Then the pstore ramoops sub-system could ask for that memory
and location, and it will map itself there.

This prototype allows for 8 different mappings (which may be overkill, 4 is
probably plenty) with 16 byte size to store the label.

I have tested this and it works for us to solve the above problem. We can
update the kernel and command line and increase the size of pstore without
needing to update the firmware, or knowing every memory layout of each
board. I only tested this locally, it has not been tested in the field.

Changes since v5: 
https://lore.kernel.org/all/20240613003435.401549...@goodmis.org/

[ patch at bottom showing differences ]

- Stressed more that this is a best effort use case

- Updated ramoops.rst to document this new feature

- Used a new variable "tmp" to use in reserve_mem_find_by_name() instead
  of using "size" and possibly corrupting it.

Changes since v4: 
https://lore.kernel.org/all/20240611215610.548854...@goodmis.org/

- Add all checks about reserve_mem before allocation.
  This means reserved_mem_add() is now a void function.

- Check for name duplications.

- Fix compare of align to SMP_CACHE_BYTES ("<" instead of "<=")

Changes since v3: 
https://lore.kernel.org/all/20240611144911.327227...@goodmis.org/

- Changed table type of start and size from unsigned long to phys_addr_t
  (as well as the parameters to the functions that use them)

- Changed old reference to "early_reserve_mem" to "reserve_mem"

- Check before reservering memory:
  o Size is non-zero
  o name has text in it

- If align is less than SMP_CACHE_BYTES, make it SMP_CACHE_BYTES

- Remove the silly check of testing *p == '\0' after a p += strlen(p)

Changes since v2: 
https://lore.kernel.org/all/20240606150143.876469...@goodmis.org/

- Fixed typo of "reserver"

- Added EXPORT_SYMBOL_GPL() for reserve_mem_find_by_name()

- Removed "built-in" from module description that was changed from v1.


Changes since v1: 
https://lore.kernel.org/all/2024060320.801075...@goodmis.org/

- Updated the change log of the first patch as well as added an entry
  into kernel-parameters.txt about how reserve_mem is for soft reboots
  and may not be reliable. 


Steven Rostedt (Google) (2):
  mm/memblock: Add "reserve_mem" to reserved named memory at boot up
  pstore/ramoops: Add ramoops.mem_name= command line option


 Documentation/admin-guide/kernel-parameters.txt |  22 +
 

Re: [PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-13 Thread Steven Rostedt
On Thu, 13 Jun 2024 18:54:12 +0200
Alexander Graf  wrote:

> 
> Do you have a "real" pstore on these systems that you could store 
> non-volatile variables in, such as persistent UEFI variables? If so, you 
> could create an actually persistent mapping for your trace pstore even 
> across kernel version updates as a general mechanism to create reserved 
> memblocks at fixed offsets.

After implementing all this, I don't think I can use pstore for my
purpose. pstore is a generic interface for persistent storage, and
requires an interface to access it. From what I understand, it's not
the place to just ask for an area of RAM.

For this, I have a single patch that allows the tracing instance to use
an area reserved by reserve_mem.

  reserve_mem=12M:4096:trace trace_instance=boot_mapped@trace

I've already tested this on qemu and a couple of chromebooks. It works
well.

> 
> 
> > Requirement:
> >
> > Need a way to reserve memory that will be at a consistent location for
> > every boot, if the kernel and system are the same. Does not need to work
> > if rebooting to a different kernel, or if the system can change the
> > memory layout between boots.
> >
> > The reserved memory can not be an hard coded address, as the same kernel /
> > command line needs to run on several different machines. The picked memory
> > reservation just needs to be the same for a given machine, but may be  
> 
> 
> With KASLR is enabled, doesn't this approach break too often to be 
> reliable enough for the data you want to extract?
> 
> Picking up the idea above, with a persistent variable we could even make 
> KASLR avoid that reserved pstore region in its search for a viable KASLR 
> offset.

I think I was hit by it once in all my testing. For our use case, the
few times it fails to map is not going to affect what we need this for
at all.

-- Steve



Re: [PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-13 Thread Alexander Graf

Hey Steve,

On 13.06.24 17:55, Steven Rostedt wrote:

Reserve unspecified location of physical memory from kernel command line

Background:

In ChromeOS, we have 1 MB of pstore ramoops reserved so that we can extract
dmesg output and some other information when a crash happens in the field.
(This is only done when the user selects "Allow Google to collect data for
  improving the system"). But there are cases when there's a bug that
requires more data to be retrieved to figure out what is happening. We would
like to increase the pstore size, either temporarily, or maybe even
permanently. The pstore on these devices are at a fixed location in RAM (as
the RAM is not cleared on soft reboots nor crashes). The location is chosen
by the BIOS (coreboot) and passed to the kernel via ACPI tables on x86.
There's a driver that queries for this to initialize the pstore for
ChromeOS:

   See drivers/platform/chrome/chromeos_pstore.c

Problem:

The problem is that, even though there's a process to change the kernel on
these systems, and is done regularly to install updates, the firmware is
updated much less frequently. Choosing the place in RAM also takes special
care, and may be in a different address for different boards. Updating the
size via firmware is a large effort and not something that many are willing
to do for a temporary pstore size change.



(sorry for not commenting on earlier versions, I didn't see v1-v5 in my 
inbox)


Do you have a "real" pstore on these systems that you could store 
non-volatile variables in, such as persistent UEFI variables? If so, you 
could create an actually persistent mapping for your trace pstore even 
across kernel version updates as a general mechanism to create reserved 
memblocks at fixed offsets.




Requirement:

Need a way to reserve memory that will be at a consistent location for
every boot, if the kernel and system are the same. Does not need to work
if rebooting to a different kernel, or if the system can change the
memory layout between boots.

The reserved memory can not be an hard coded address, as the same kernel /
command line needs to run on several different machines. The picked memory
reservation just needs to be the same for a given machine, but may be



With KASLR is enabled, doesn't this approach break too often to be 
reliable enough for the data you want to extract?


Picking up the idea above, with a persistent variable we could even make 
KASLR avoid that reserved pstore region in its search for a viable KASLR 
offset.



Alex




Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597