Re: [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use

2020-03-30 Thread Dave Young
Hi James,
On 03/30/20 at 06:17pm, James Morse wrote:
> Hi Baoquan,
> 
> On 3/30/20 2:55 PM, Baoquan He wrote:
> > On 03/26/20 at 06:07pm, James Morse wrote:
> >> arm64 recently queued support for memory hotremove, which led to some
> >> new corner cases for kexec.
> >>
> >> If the kexec segments are loaded for a removable region, that region may
> >> be removed before kexec actually occurs. This causes the first kernel to
> >> lockup when applying the relocations. (I've triggered this on x86 too).
> >>
> >> The first patch adds a memory notifier for kexec so that it can refuse
> >> to allow in-use regions to be taken offline.
> > 
> > I talked about this with Dave Young. Currently, we tend to use
> > kexec_file_load more in the future since most of its implementation is
> > in kernel, we can get information about kernel more easilier. For the
> > kexec kernel loaded into hotpluggable area, we can fix it in
> > kexec_file_load side, we know the MOVABLE zone's start and end. As for
> > the old kexec_load, we would like to keep it for back compatibility. At
> > least in our distros, we have switched to kexec_file_load, will
> > gradually obsolete kexec_load.
> 
> > So for this one, I suggest avoiding those
> > MOVZBLE memory region when searching place for kexec kernel.
> 
> How does today's user-space know?
> 
> 
> > Not sure if arm64 will still have difficulty.
> 
> arm64 added support for kexec_load first, then kexec_file_load. (evidently a
> mistake).
> kexec_file_load support was only added in the last year or so, I'd hazard most
> people using this, are using the regular load kind. (and probably don't know 
> or
> care).

I agreed that file load is still not widely used,  but in the long run
we should not maintain both of them all the future time.  Especially
when some kernel-userspace interfaces need to be introduced, file load
will have the natural advantage.  We may keep the kexec_load for other
misc usecases, but we can use file load for the major modern
linux-to-linux loading.  I'm not saying we can do it immediately, just
thought we should reduce the duplicate effort and try to avoid hacking if
possible.

Anyway about this particular issue, I wonder if we can just reload with
a udev rule as replied in another mail.

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use

2020-03-30 Thread Dave Young
Hi James,
On 03/26/20 at 06:07pm, James Morse wrote:
> Hello!
> 
> arm64 recently queued support for memory hotremove, which led to some
> new corner cases for kexec.
> 
> If the kexec segments are loaded for a removable region, that region may
> be removed before kexec actually occurs. This causes the first kernel to
> lockup when applying the relocations. (I've triggered this on x86 too).

Does a kexec reload work for your case?   If yes then I would suggest to
do it in userspace,  for example have a udev rule to reload kexec if
needed.

Actually we have a rule to restart kdump loading, but not for kexec, it
sounds also need a service to load kexec, and an udev rule to reload for
memory hotplug.

> 
> The first patch adds a memory notifier for kexec so that it can refuse
> to allow in-use regions to be taken offline.
> 
> 
> This doesn't solve the problem for arm64, where the new kernel must
> initially rely on the data structures from the first boot to describe
> memory. These don't describe hotpluggable memory.
> If kexec places the kernel in one of these regions, it must also provide
> a DT that describes the region in which the kernel was mapped as memory.
> (and somehow ensure its always present in the future...)
> 
> To prevent this from happening accidentally with unaware user-space,
> patches two and three allow arm64 to give these regions a different
> name.
> 
> This is a change in behaviour for arm64 as memory hotadd and hotremove
> were added separately.
> 
> 
> I haven't tried kdump.
> Unaware kdump from user-space probably won't describe the hotplug
> regions if the name is different, which saves us from problems if
> the memory is no longer present at kdump time, but means the vmcore
> is incomplete.
> 
> 
> These patches are based on arm64's for-next/core branch, but can all
> be merged independently.
> 
> Thanks,
> 
> James Morse (3):
>   kexec: Prevent removal of memory in use by a loaded kexec image
>   mm/memory_hotplug: Allow arch override of non boot memory resource
> names
>   arm64: memory: Give hotplug memory a different resource name
> 
>  arch/arm64/include/asm/memory.h | 11 +++
>  kernel/kexec_core.c | 56 +
>  mm/memory_hotplug.c |  6 +++-
>  3 files changed, 72 insertions(+), 1 deletion(-)
> 
> -- 
> 2.25.1
> 
> 
> ___
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
> 

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] swiotlb: Allow swiotlb to live at pre-defined address

2020-03-30 Thread Baoquan He
On 03/30/20 at 10:42pm, Alexander Graf wrote:
> 
> 
> On 30.03.20 15:40, Konrad Rzeszutek Wilk wrote:
> > 
> > 
> > 
> > On Mon, Mar 30, 2020 at 02:06:01PM +0800, Kairui Song wrote:
> > > On Sat, Mar 28, 2020 at 7:57 PM Dave Young  wrote:
> > > > 
> > > > On 03/26/20 at 05:29pm, Alexander Graf wrote:
> > > > > The swiotlb is a very convenient fallback mechanism for bounce 
> > > > > buffering of
> > > > > DMAable data. It is usually used for the compatibility case where 
> > > > > devices
> > > > > can only DMA to a "low region".
> > > > > 
> > > > > However, in some scenarios this "low region" may be bound even more
> > > > > heavily. For example, there are embedded system where only an SRAM 
> > > > > region
> > > > > is shared between device and CPU. There are also heterogeneous 
> > > > > computing
> > > > > scenarios where only a subset of RAM is cache coherent between the
> > > > > components of the system. There are partitioning hypervisors, where
> > > > > a "control VM" that implements device emulation has limited view into 
> > > > > a
> > > > > partition's memory for DMA capabilities due to safety concerns.
> > > > > 
> > > > > This patch adds a command line driven mechanism to move all DMA 
> > > > > memory into
> > > > > a predefined shared memory region which may or may not be part of the
> > > > > physical address layout of the Operating System.
> > > > > 
> > > > > Ideally, the typical path to set this configuration would be through 
> > > > > Device
> > > > > Tree or ACPI, but neither of the two mechanisms is standardized yet. 
> > > > > Also,
> > > > > in the x86 MicroVM use case, we have neither ACPI nor Device Tree, but
> > > > > instead configure the system purely through kernel command line 
> > > > > options.
> > > > > 
> > > > > I'm sure other people will find the functionality useful going forward
> > > > > though and extend it to be triggered by DT/ACPI in the future.
> > > > 
> > > > Hmm, we have a use case for kdump, this maybe useful.  For example
> > > > swiotlb is enabled by default if AMD SME/SEV is active, and in kdump
> > > > kernel we have to increase the crashkernel reserved size for the extra
> > > > swiotlb requirement.  I wonder if we can just reuse the old kernel's
> > > > swiotlb region and pass the addr to kdump kernel.
> > > > 
> > > 
> > > Yes, definitely helpful for kdump kernel. This can help reduce the
> > > crashkernel value.
> > > 
> > > Previously I was thinking about something similar, play around the
> > > e820 entry passed to kdump and let it place swiotlb in wanted region.
> > > Simply remap it like in this patch looks much cleaner.
> > > 
> > > If this patch is acceptable, one more patch is needed to expose the
> > > swiotlb in iomem, so kexec-tools can pass the right kernel cmdline to
> > > second kernel.
> > 
> > We seem to be passsing a lot of data to kexec.. Perhaps something
> > of a unified way since we seem to have a lot of things to pass - disabling
> > IOMMU, ACPI RSDT address, and then this.
> > 
> > CC-ing Anthony who is working on something - would you by any chance
> > have a doc on this?
> 
> 
> I see in general 2 use cases here:
> 
> 
> 1) Allow for a generic mechanism to have the fully system, individual buses,
> devices or functions of a device go through a particular, self-contained
> bounce buffer.
> 
> This sounds like the holy grail to a lot of problems. It would solve typical
> embedded scenarios where you only have a shared SRAM. It solves the safety
> case (to some extent) where you need to ensure that one device interaction
> doesn't conflict with another device interaction. It also solves the problem
> I've tried to solve with the patch here.
> 
> It's unfortunately a lot harder than the patch I sent, so it will take me
> some time to come up with a working patch set.. I suppose starting with a DT
> binding only is sensible. Worst case, x86 does also support DT ...
> 
> (And yes, I'm always happy to review patches if someone else beats me to it)
> 
> 
> 2) Reuse the SWIOTLB from the previous boot on kexec/kdump
> 
> I see little direct relation to SEV here. The only reason SEV makes it more
> relevant, is that you need to have an SWIOTLB region available with SEV
> while without you could live with a disabled IOMMU.
> 
> However, I can definitely understand how you would want to have a way to
> tell the new kexec'ed kernel where the old SWIOTLB was, so it can reuse its
> memory for its own SWIOTLB. That way, you don't have to reserve another 64MB
> of RAM for kdump.
> 
> What I'm curious on is whether we need to be as elaborate. Can't we just
> pass the old SWIOTLB as free memory to the new kexec'ed kernel and
> everything else will fall into place? All that would take is a bit of
> shuffling on the e820 table pass-through to the kexec'ed kernel, no?

Swiotlb memory have to be continuous. We can't guarantee that region
won't be touched by kernel allocation before swiotlb init. Then we may
not have chance to get a continuous 

Re: [PATCH] swiotlb: Allow swiotlb to live at pre-defined address

2020-03-30 Thread Dave Young
Hi,

[snip]
> 2) Reuse the SWIOTLB from the previous boot on kexec/kdump

We should only care about kdump

> 
> I see little direct relation to SEV here. The only reason SEV makes it more
> relevant, is that you need to have an SWIOTLB region available with SEV
> while without you could live with a disabled IOMMU.


Here is some comment in arch/x86/kernel/pci-swiotlb.c, it is enforced
for some reason.
/*
 * If SME is active then swiotlb will be set to 1 so that bounce
 * buffers are allocated and used for devices that do not support
 * the addressing range required for the encryption mask.
 */
if (sme_active())
swiotlb = 1;

> 
> However, I can definitely understand how you would want to have a way to
> tell the new kexec'ed kernel where the old SWIOTLB was, so it can reuse its
> memory for its own SWIOTLB. That way, you don't have to reserve another 64MB
> of RAM for kdump.
> 
> What I'm curious on is whether we need to be as elaborate. Can't we just
> pass the old SWIOTLB as free memory to the new kexec'ed kernel and
> everything else will fall into place? All that would take is a bit of
> shuffling on the e820 table pass-through to the kexec'ed kernel, no?

Maybe either of the two is fine.  But we may need ensure these swiotlb
area to be reused explictly in some way.  Say about the crashkernel=X,high case,
major part is in above 4G region, and a small piece in low memory. Then
when kernel booting, kernel/driver initialization could use out of the
low memory, and the remain part for swiotlb could be not big enough and
finally swiotlb allocation fails. 

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] swiotlb: Allow swiotlb to live at pre-defined address

2020-03-30 Thread Dave Young
On 03/30/20 at 09:40am, Konrad Rzeszutek Wilk wrote:
> On Mon, Mar 30, 2020 at 02:06:01PM +0800, Kairui Song wrote:
> > On Sat, Mar 28, 2020 at 7:57 PM Dave Young  wrote:
> > >
> > > On 03/26/20 at 05:29pm, Alexander Graf wrote:
> > > > The swiotlb is a very convenient fallback mechanism for bounce 
> > > > buffering of
> > > > DMAable data. It is usually used for the compatibility case where 
> > > > devices
> > > > can only DMA to a "low region".
> > > >
> > > > However, in some scenarios this "low region" may be bound even more
> > > > heavily. For example, there are embedded system where only an SRAM 
> > > > region
> > > > is shared between device and CPU. There are also heterogeneous computing
> > > > scenarios where only a subset of RAM is cache coherent between the
> > > > components of the system. There are partitioning hypervisors, where
> > > > a "control VM" that implements device emulation has limited view into a
> > > > partition's memory for DMA capabilities due to safety concerns.
> > > >
> > > > This patch adds a command line driven mechanism to move all DMA memory 
> > > > into
> > > > a predefined shared memory region which may or may not be part of the
> > > > physical address layout of the Operating System.
> > > >
> > > > Ideally, the typical path to set this configuration would be through 
> > > > Device
> > > > Tree or ACPI, but neither of the two mechanisms is standardized yet. 
> > > > Also,
> > > > in the x86 MicroVM use case, we have neither ACPI nor Device Tree, but
> > > > instead configure the system purely through kernel command line options.
> > > >
> > > > I'm sure other people will find the functionality useful going forward
> > > > though and extend it to be triggered by DT/ACPI in the future.
> > >
> > > Hmm, we have a use case for kdump, this maybe useful.  For example
> > > swiotlb is enabled by default if AMD SME/SEV is active, and in kdump
> > > kernel we have to increase the crashkernel reserved size for the extra
> > > swiotlb requirement.  I wonder if we can just reuse the old kernel's
> > > swiotlb region and pass the addr to kdump kernel.
> > >
> > 
> > Yes, definitely helpful for kdump kernel. This can help reduce the
> > crashkernel value.
> > 
> > Previously I was thinking about something similar, play around the
> > e820 entry passed to kdump and let it place swiotlb in wanted region.
> > Simply remap it like in this patch looks much cleaner.
> > 
> > If this patch is acceptable, one more patch is needed to expose the
> > swiotlb in iomem, so kexec-tools can pass the right kernel cmdline to
> > second kernel.
> 
> We seem to be passsing a lot of data to kexec.. Perhaps something
> of a unified way since we seem to have a lot of things to pass - disabling
> IOMMU, ACPI RSDT address, and then this.

acpi_rsdp kernel cmdline is not useful anymore.  The initial purpose is
for kexec-rebooting in efi system.  But now we have supported efi boot
across kexec reboot thus that is useless now.

Thanks
Dave


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] swiotlb: Allow swiotlb to live at pre-defined address

2020-03-30 Thread Anthony Yznaga



On 3/30/20 1:42 PM, Alexander Graf wrote:
>
>
> On 30.03.20 15:40, Konrad Rzeszutek Wilk wrote:
>>
>>
>>
>> On Mon, Mar 30, 2020 at 02:06:01PM +0800, Kairui Song wrote:
>>> On Sat, Mar 28, 2020 at 7:57 PM Dave Young  wrote:

 On 03/26/20 at 05:29pm, Alexander Graf wrote:
> The swiotlb is a very convenient fallback mechanism for bounce buffering 
> of
> DMAable data. It is usually used for the compatibility case where devices
> can only DMA to a "low region".
>
> However, in some scenarios this "low region" may be bound even more
> heavily. For example, there are embedded system where only an SRAM region
> is shared between device and CPU. There are also heterogeneous computing
> scenarios where only a subset of RAM is cache coherent between the
> components of the system. There are partitioning hypervisors, where
> a "control VM" that implements device emulation has limited view into a
> partition's memory for DMA capabilities due to safety concerns.
>
> This patch adds a command line driven mechanism to move all DMA memory 
> into
> a predefined shared memory region which may or may not be part of the
> physical address layout of the Operating System.
>
> Ideally, the typical path to set this configuration would be through 
> Device
> Tree or ACPI, but neither of the two mechanisms is standardized yet. Also,
> in the x86 MicroVM use case, we have neither ACPI nor Device Tree, but
> instead configure the system purely through kernel command line options.
>
> I'm sure other people will find the functionality useful going forward
> though and extend it to be triggered by DT/ACPI in the future.

 Hmm, we have a use case for kdump, this maybe useful.  For example
 swiotlb is enabled by default if AMD SME/SEV is active, and in kdump
 kernel we have to increase the crashkernel reserved size for the extra
 swiotlb requirement.  I wonder if we can just reuse the old kernel's
 swiotlb region and pass the addr to kdump kernel.

>>>
>>> Yes, definitely helpful for kdump kernel. This can help reduce the
>>> crashkernel value.
>>>
>>> Previously I was thinking about something similar, play around the
>>> e820 entry passed to kdump and let it place swiotlb in wanted region.
>>> Simply remap it like in this patch looks much cleaner.
>>>
>>> If this patch is acceptable, one more patch is needed to expose the
>>> swiotlb in iomem, so kexec-tools can pass the right kernel cmdline to
>>> second kernel.
>>
>> We seem to be passsing a lot of data to kexec.. Perhaps something
>> of a unified way since we seem to have a lot of things to pass - disabling
>> IOMMU, ACPI RSDT address, and then this.
>>
>> CC-ing Anthony who is working on something - would you by any chance
>> have a doc on this?
>
>
> I see in general 2 use cases here:
>
>
> 1) Allow for a generic mechanism to have the fully system, individual buses, 
> devices or functions of a device go through a particular, self-contained 
> bounce buffer.
>
> This sounds like the holy grail to a lot of problems. It would solve typical 
> embedded scenarios where you only have a shared SRAM. It solves the safety 
> case (to some extent) where you need to ensure that one device interaction 
> doesn't conflict with another device interaction. It also solves the problem 
> I've tried to solve with the patch here.
>
> It's unfortunately a lot harder than the patch I sent, so it will take me 
> some time to come up with a working patch set.. I suppose starting with a DT 
> binding only is sensible. Worst case, x86 does also support DT ...
>
> (And yes, I'm always happy to review patches if someone else beats me to it)

Not precisely what is described here, but I am working on a somewhat generic 
mechanism for preserving pages across kexec based on this old RFC patch set: 
https://lkml.org/lkml/2013/7/1/211.  I expect to post patches soon.

Anthony

>
>
> 2) Reuse the SWIOTLB from the previous boot on kexec/kdump
>
> I see little direct relation to SEV here. The only reason SEV makes it more 
> relevant, is that you need to have an SWIOTLB region available with SEV while 
> without you could live with a disabled IOMMU.
>
> However, I can definitely understand how you would want to have a way to tell 
> the new kexec'ed kernel where the old SWIOTLB was, so it can reuse its memory 
> for its own SWIOTLB. That way, you don't have to reserve another 64MB of RAM 
> for kdump.
>
> What I'm curious on is whether we need to be as elaborate. Can't we just pass 
> the old SWIOTLB as free memory to the new kexec'ed kernel and everything else 
> will fall into place? All that would take is a bit of shuffling on the e820 
> table pass-through to the kexec'ed kernel, no?
>
>
> Thanks,
>
> Alex
>
>
>
>
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am 

Re: [PATCH] swiotlb: Allow swiotlb to live at pre-defined address

2020-03-30 Thread Alexander Graf




On 30.03.20 15:40, Konrad Rzeszutek Wilk wrote:




On Mon, Mar 30, 2020 at 02:06:01PM +0800, Kairui Song wrote:

On Sat, Mar 28, 2020 at 7:57 PM Dave Young  wrote:


On 03/26/20 at 05:29pm, Alexander Graf wrote:

The swiotlb is a very convenient fallback mechanism for bounce buffering of
DMAable data. It is usually used for the compatibility case where devices
can only DMA to a "low region".

However, in some scenarios this "low region" may be bound even more
heavily. For example, there are embedded system where only an SRAM region
is shared between device and CPU. There are also heterogeneous computing
scenarios where only a subset of RAM is cache coherent between the
components of the system. There are partitioning hypervisors, where
a "control VM" that implements device emulation has limited view into a
partition's memory for DMA capabilities due to safety concerns.

This patch adds a command line driven mechanism to move all DMA memory into
a predefined shared memory region which may or may not be part of the
physical address layout of the Operating System.

Ideally, the typical path to set this configuration would be through Device
Tree or ACPI, but neither of the two mechanisms is standardized yet. Also,
in the x86 MicroVM use case, we have neither ACPI nor Device Tree, but
instead configure the system purely through kernel command line options.

I'm sure other people will find the functionality useful going forward
though and extend it to be triggered by DT/ACPI in the future.


Hmm, we have a use case for kdump, this maybe useful.  For example
swiotlb is enabled by default if AMD SME/SEV is active, and in kdump
kernel we have to increase the crashkernel reserved size for the extra
swiotlb requirement.  I wonder if we can just reuse the old kernel's
swiotlb region and pass the addr to kdump kernel.



Yes, definitely helpful for kdump kernel. This can help reduce the
crashkernel value.

Previously I was thinking about something similar, play around the
e820 entry passed to kdump and let it place swiotlb in wanted region.
Simply remap it like in this patch looks much cleaner.

If this patch is acceptable, one more patch is needed to expose the
swiotlb in iomem, so kexec-tools can pass the right kernel cmdline to
second kernel.


We seem to be passsing a lot of data to kexec.. Perhaps something
of a unified way since we seem to have a lot of things to pass - disabling
IOMMU, ACPI RSDT address, and then this.

CC-ing Anthony who is working on something - would you by any chance
have a doc on this?



I see in general 2 use cases here:


1) Allow for a generic mechanism to have the fully system, individual 
buses, devices or functions of a device go through a particular, 
self-contained bounce buffer.


This sounds like the holy grail to a lot of problems. It would solve 
typical embedded scenarios where you only have a shared SRAM. It solves 
the safety case (to some extent) where you need to ensure that one 
device interaction doesn't conflict with another device interaction. It 
also solves the problem I've tried to solve with the patch here.


It's unfortunately a lot harder than the patch I sent, so it will take 
me some time to come up with a working patch set.. I suppose starting 
with a DT binding only is sensible. Worst case, x86 does also support DT ...


(And yes, I'm always happy to review patches if someone else beats me to it)


2) Reuse the SWIOTLB from the previous boot on kexec/kdump

I see little direct relation to SEV here. The only reason SEV makes it 
more relevant, is that you need to have an SWIOTLB region available with 
SEV while without you could live with a disabled IOMMU.


However, I can definitely understand how you would want to have a way to 
tell the new kexec'ed kernel where the old SWIOTLB was, so it can reuse 
its memory for its own SWIOTLB. That way, you don't have to reserve 
another 64MB of RAM for kdump.


What I'm curious on is whether we need to be as elaborate. Can't we just 
pass the old SWIOTLB as free memory to the new kexec'ed kernel and 
everything else will fall into place? All that would take is a bit of 
shuffling on the e820 table pass-through to the kexec'ed kernel, no?



Thanks,

Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] crash_dump: remove saved_max_pfn

2020-03-30 Thread Eric W. Biederman
Kairui Song  writes:

> This variable is no longer used.
>
> saved_max_pfn was originally introduce in commit 92aa63a5a1bf ("[PATCH]
> kdump: Retrieve saved max pfn"), used to make sure that user does not
> try to read the physical memory beyond saved_max_pfn. But since
> commit 921d58c0e699 ("vmcore: remove saved_max_pfn check")
> it's no longer used for the check.
>
> Only user left is Calary IOMMU, which start using it from
> commit 95b68dec0d52 ("calgary iommu: use the first kernels TCE tables
> in kdump"). But again, recently in commit 90dc392fc445 ("x86: Remove
> the calgary IOMMU driver"), Calary IOMMU is removed and this variable
> no longer have any user.
>
> So just remove it.
>
> Signed-off-by: Kairui Song 

Acked-by: "Eric W. Biederman" 

Can we merge this through the tip tree?


> ---
>  arch/x86/kernel/e820.c | 8 
>  include/linux/crash_dump.h | 2 --
>  kernel/crash_dump.c| 6 --
>  3 files changed, 16 deletions(-)
>
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index c5399e80c59c..4d13c57f370a 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -910,14 +910,6 @@ static int __init parse_memmap_one(char *p)
>   return -EINVAL;
>  
>   if (!strncmp(p, "exactmap", 8)) {
> -#ifdef CONFIG_CRASH_DUMP
> - /*
> -  * If we are doing a crash dump, we still need to know
> -  * the real memory size before the original memory map is
> -  * reset.
> -  */
> - saved_max_pfn = e820__end_of_ram_pfn();
> -#endif
>   e820_table->nr_entries = 0;
>   userdef = 1;
>   return 0;
> diff --git a/include/linux/crash_dump.h b/include/linux/crash_dump.h
> index 4664fc1871de..bc156285d097 100644
> --- a/include/linux/crash_dump.h
> +++ b/include/linux/crash_dump.h
> @@ -97,8 +97,6 @@ extern void unregister_oldmem_pfn_is_ram(void);
>  static inline bool is_kdump_kernel(void) { return 0; }
>  #endif /* CONFIG_CRASH_DUMP */
>  
> -extern unsigned long saved_max_pfn;
> -
>  /* Device Dump information to be filled by drivers */
>  struct vmcoredd_data {
>   char dump_name[VMCOREDD_MAX_NAME_BYTES]; /* Unique name of the dump */
> diff --git a/kernel/crash_dump.c b/kernel/crash_dump.c
> index 9c23ae074b40..92da32275af5 100644
> --- a/kernel/crash_dump.c
> +++ b/kernel/crash_dump.c
> @@ -5,12 +5,6 @@
>  #include 
>  #include 
>  
> -/*
> - * If we have booted due to a crash, max_pfn will be a very low value. We 
> need
> - * to know the amount of memory that the previous kernel used.
> - */
> -unsigned long saved_max_pfn;
> -
>  /*
>   * stores the physical address of elf header of crash image
>   *

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 3/3] arm64: memory: Give hotplug memory a different resource name

2020-03-30 Thread David Hildenbrand
On 26.03.20 19:07, James Morse wrote:
> If kexec chooses to place the kernel in a memory region that was
> added after boot, we fail to boot as the kernel is running from a
> location that is not described as memory by the UEFI memory map or
> the original DT.
> 
> To prevent unaware user-space kexec from doing this accidentally,
> give these regions a different name.
> 
> Signed-off-by: James Morse 
> ---
> This is a change in behaviour as seen by user-space, because memory hot-add
> has already been merged.
> 
>  arch/arm64/include/asm/memory.h | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> index 2be67b232499..ef1686518469 100644
> --- a/arch/arm64/include/asm/memory.h
> +++ b/arch/arm64/include/asm/memory.h
> @@ -166,6 +166,17 @@
>  #define IOREMAP_MAX_ORDER(PMD_SHIFT)
>  #endif
>  
> +/*
> + * Memory hotplug allows new regions of 'System RAM' to be added to the 
> system.
> + * These aren't described as memory by the UEFI memory map, or DT memory 
> node.
> + * If we kexec from one of these regions, the new kernel boots from a 
> location
> + * that isn't described as RAM.
> + *
> + * Give these resources a different name, so unaware kexec doesn't do this by
> + * accident.
> + */
> +#define MEMORY_HOTPLUG_RES_NAME "System RAM (hotplug)"
> +
>  #ifndef __ASSEMBLY__
>  extern u64   vabits_actual;
>  #define PAGE_END (_PAGE_END(vabits_actual))
> 

(While I am familiar with makedumpfile in the crash kernel, I am not yet
familiar with kexec, so bare with me)


Looking at kexec:arch/arm64/crashdump-arm64.c

load_crashdump_segments() -> crash_get_memory_ranges() ->
kexec_iomem_for_each_line() -> iomem_range_callback()


#define SYSTEM_RAM "System RAM\n"

...
} else if (strncmp(str, SYSTEM_RAM, strlen(SYSTEM_RAM)) == 0) {
return mem_regions_add(_memory_rgns, ...);
}


The hotplugged memory will no longer be detected as a crashdump segment,
consequently (AFAIU) not be described in the elf header, and therefore
also no longer dumped (e.g., by makedumpfile).

I assume you'll have to adapt kexec-tools to still consider this memory
for dumping, correct? Or am I missing something?


-- 
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 2/3] mm/memory_hotplug: Allow arch override of non boot memory resource names

2020-03-30 Thread James Morse
Hi David,

On 3/30/20 2:23 PM, David Hildenbrand wrote:
 diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
 index 0a54ffac8c68..69b03dd7fc74 100644
 --- a/mm/memory_hotplug.c
 +++ b/mm/memory_hotplug.c
 @@ -42,6 +42,10 @@
  #include "internal.h"
  #include "shuffle.h"
  
 +#ifndef MEMORY_HOTPLUG_RES_NAME
 +#define MEMORY_HOTPLUG_RES_NAME "System RAM"
 +#endif
>>>
>>> So I assume changing this for all architectures would result in some
>>> user space tool breaking? Are we aware of any?
>>
>> Last time we had to touch arm64's /proc/iomem strings I went through debian's
>> codesearch for stuff that reads it, kexec-tools was the only thing that 
>> parsed
>> it in anger. (From memory, the other tools were looking for PCIe windows to 
>> do
>> firmware flashing..)
>>
>> Looking again, having qualifiers on the end of 'System RAM' looks like it 
>> could
>> confuse 's390-tools's detect_mem_chunks parser.
> 
> Good to know, we should find out if this could work.
> 
>>
>> It looks like the strings that come out of 'FIRMWARE_MEMMAP' are a duplicate 
>> set.
>>
>>
>>> I do wonder if we should simply change it for all architectures if possible.
>>
>> If its possible that would be great. But I suspect that ship has sailed,
>> changing it on other architectures could break some fragile parsing code.
> 
> I assume any parser has to be prepared for new types showing up.
> Otherwise these would not be future proof. The question is if a common
> prefix is problematic.
> 
> E.g., Use "Hotplugged System RAM" instead of "System RAM (hotplug)"

Aha, I went for a (suffix) because that is what 32bit Arm did for the boot 
alias.


>> I'm wary of changing it on arm64, the only thing that makes it tolerable is 
>> that
>> memory hot-add was relatively recently merged, and we don't anticipate it 
>> being
>> widely used until you can remove memory as well.
>>
>> Changing it on arm64 is to prevent today's versions of kexec-tools from
>> accidentally placing the new kernel in memory that wasn't described at boot.
>> This leads to an unhandled exception during boot[0] because the kernel can't
>> access itself via the mapping of all memory. (hotpluggable regions are only
>> discovered by suitably configured ACPI systems much later)

> I want the very same for virtio-mem (initially x86-only, but later open
> for all archs). Can also be interesting for Hyper-V. kexec should not
> try to use hotplugged memory as kexec target, as memory blocks can be
> partially inaccessible.

Great! I assumed these placement requirements would be arm64 specific.


> Of course, I can provide an interface to override the name via
> add_memory(), but having it on all architectures handled in a similar
> way right from the start would be nicer.

I agree having it the same on all architectures would be good.

It sounds like virtio-mem is a better argument for doing this than arm64's
firmware memory description.

I'll have a read, and maybe post something to linux-arch to do this at rc1.
(I assume we'll have a few weeks to make sure arm64 at least uses the same name
if it goes on longer)


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use

2020-03-30 Thread James Morse
Hi Baoquan,

On 3/30/20 2:55 PM, Baoquan He wrote:
> On 03/26/20 at 06:07pm, James Morse wrote:
>> arm64 recently queued support for memory hotremove, which led to some
>> new corner cases for kexec.
>>
>> If the kexec segments are loaded for a removable region, that region may
>> be removed before kexec actually occurs. This causes the first kernel to
>> lockup when applying the relocations. (I've triggered this on x86 too).
>>
>> The first patch adds a memory notifier for kexec so that it can refuse
>> to allow in-use regions to be taken offline.
> 
> I talked about this with Dave Young. Currently, we tend to use
> kexec_file_load more in the future since most of its implementation is
> in kernel, we can get information about kernel more easilier. For the
> kexec kernel loaded into hotpluggable area, we can fix it in
> kexec_file_load side, we know the MOVABLE zone's start and end. As for
> the old kexec_load, we would like to keep it for back compatibility. At
> least in our distros, we have switched to kexec_file_load, will
> gradually obsolete kexec_load.

> So for this one, I suggest avoiding those
> MOVZBLE memory region when searching place for kexec kernel.

How does today's user-space know?


> Not sure if arm64 will still have difficulty.

arm64 added support for kexec_load first, then kexec_file_load. (evidently a
mistake).
kexec_file_load support was only added in the last year or so, I'd hazard most
people using this, are using the regular load kind. (and probably don't know or
care).



Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use

2020-03-30 Thread Baoquan He
Hi James,

On 03/26/20 at 06:07pm, James Morse wrote:
> Hello!
> 
> arm64 recently queued support for memory hotremove, which led to some
> new corner cases for kexec.
> 
> If the kexec segments are loaded for a removable region, that region may
> be removed before kexec actually occurs. This causes the first kernel to
> lockup when applying the relocations. (I've triggered this on x86 too).
> 
> The first patch adds a memory notifier for kexec so that it can refuse
> to allow in-use regions to be taken offline.

I talked about this with Dave Young. Currently, we tend to use
kexec_file_load more in the future since most of its implementation is
in kernel, we can get information about kernel more easilier. For the
kexec kernel loaded into hotpluggable area, we can fix it in
kexec_file_load side, we know the MOVABLE zone's start and end. As for
the old kexec_load, we would like to keep it for back compatibility. At
least in our distros, we have switched to kexec_file_load, will
gradually obsolete kexec_load. So for this one, I suggest avoiding those
MOVZBLE memory region when searching place for kexec kernel.

Not sure if arm64 will still have difficulty.

> 
> 
> This doesn't solve the problem for arm64, where the new kernel must
> initially rely on the data structures from the first boot to describe
> memory. These don't describe hotpluggable memory.
> If kexec places the kernel in one of these regions, it must also provide
> a DT that describes the region in which the kernel was mapped as memory.
> (and somehow ensure its always present in the future...)
> 
> To prevent this from happening accidentally with unaware user-space,
> patches two and three allow arm64 to give these regions a different
> name.
> 
> This is a change in behaviour for arm64 as memory hotadd and hotremove
> were added separately.
> 
> 
> I haven't tried kdump.
> Unaware kdump from user-space probably won't describe the hotplug
> regions if the name is different, which saves us from problems if
> the memory is no longer present at kdump time, but means the vmcore
> is incomplete.
> 
> 
> These patches are based on arm64's for-next/core branch, but can all
> be merged independently.
> 
> Thanks,
> 
> James Morse (3):
>   kexec: Prevent removal of memory in use by a loaded kexec image
>   mm/memory_hotplug: Allow arch override of non boot memory resource
> names
>   arm64: memory: Give hotplug memory a different resource name
> 
>  arch/arm64/include/asm/memory.h | 11 +++
>  kernel/kexec_core.c | 56 +
>  mm/memory_hotplug.c |  6 +++-
>  3 files changed, 72 insertions(+), 1 deletion(-)
> 
> -- 
> 2.25.1
> 
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] swiotlb: Allow swiotlb to live at pre-defined address

2020-03-30 Thread Konrad Rzeszutek Wilk
On Mon, Mar 30, 2020 at 02:06:01PM +0800, Kairui Song wrote:
> On Sat, Mar 28, 2020 at 7:57 PM Dave Young  wrote:
> >
> > On 03/26/20 at 05:29pm, Alexander Graf wrote:
> > > The swiotlb is a very convenient fallback mechanism for bounce buffering 
> > > of
> > > DMAable data. It is usually used for the compatibility case where devices
> > > can only DMA to a "low region".
> > >
> > > However, in some scenarios this "low region" may be bound even more
> > > heavily. For example, there are embedded system where only an SRAM region
> > > is shared between device and CPU. There are also heterogeneous computing
> > > scenarios where only a subset of RAM is cache coherent between the
> > > components of the system. There are partitioning hypervisors, where
> > > a "control VM" that implements device emulation has limited view into a
> > > partition's memory for DMA capabilities due to safety concerns.
> > >
> > > This patch adds a command line driven mechanism to move all DMA memory 
> > > into
> > > a predefined shared memory region which may or may not be part of the
> > > physical address layout of the Operating System.
> > >
> > > Ideally, the typical path to set this configuration would be through 
> > > Device
> > > Tree or ACPI, but neither of the two mechanisms is standardized yet. Also,
> > > in the x86 MicroVM use case, we have neither ACPI nor Device Tree, but
> > > instead configure the system purely through kernel command line options.
> > >
> > > I'm sure other people will find the functionality useful going forward
> > > though and extend it to be triggered by DT/ACPI in the future.
> >
> > Hmm, we have a use case for kdump, this maybe useful.  For example
> > swiotlb is enabled by default if AMD SME/SEV is active, and in kdump
> > kernel we have to increase the crashkernel reserved size for the extra
> > swiotlb requirement.  I wonder if we can just reuse the old kernel's
> > swiotlb region and pass the addr to kdump kernel.
> >
> 
> Yes, definitely helpful for kdump kernel. This can help reduce the
> crashkernel value.
> 
> Previously I was thinking about something similar, play around the
> e820 entry passed to kdump and let it place swiotlb in wanted region.
> Simply remap it like in this patch looks much cleaner.
> 
> If this patch is acceptable, one more patch is needed to expose the
> swiotlb in iomem, so kexec-tools can pass the right kernel cmdline to
> second kernel.

We seem to be passsing a lot of data to kexec.. Perhaps something
of a unified way since we seem to have a lot of things to pass - disabling
IOMMU, ACPI RSDT address, and then this.

CC-ing Anthony who is working on something - would you by any chance
have a doc on this?

Thanks!
> 
> > >
> > > Signed-off-by: Alexander Graf 
> > > ---
> > >  Documentation/admin-guide/kernel-parameters.txt |  3 +-
> > >  Documentation/x86/x86_64/boot-options.rst   |  4 ++-
> > >  kernel/dma/swiotlb.c| 46 
> > > +++--
> > >  3 files changed, 49 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> > > b/Documentation/admin-guide/kernel-parameters.txt
> > > index c07815d230bc..d085d55c3cbe 100644
> > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > @@ -4785,11 +4785,12 @@
> > >   it if 0 is given (See 
> > > Documentation/admin-guide/cgroup-v1/memory.rst)
> > >
> > >   swiotlb=[ARM,IA-64,PPC,MIPS,X86]
> > > - Format: {  | force | noforce }
> > > + Format: {  | force | noforce | addr= > > addr> }
> > >-- Number of I/O TLB slabs
> > >   force -- force using of bounce buffers even if they
> > >wouldn't be automatically used by the 
> > > kernel
> > >   noforce -- Never use bounce buffers (for debugging)
> > > + addr= -- Try to allocate SWIOTLB at 
> > > defined address
> > >
> > >   switches=   [HW,M68k]
> > >
> > > diff --git a/Documentation/x86/x86_64/boot-options.rst 
> > > b/Documentation/x86/x86_64/boot-options.rst
> > > index 2b98efb5ba7f..ca46c57b68c9 100644
> > > --- a/Documentation/x86/x86_64/boot-options.rst
> > > +++ b/Documentation/x86/x86_64/boot-options.rst
> > > @@ -297,11 +297,13 @@ iommu options only relevant to the AMD GART 
> > > hardware IOMMU:
> > >  iommu options only relevant to the software bounce buffering (SWIOTLB) 
> > > IOMMU
> > >  implementation:
> > >
> > > -swiotlb=[,force]
> > > +swiotlb=[,force][,addr=]
> > >
> > >  Prereserve that many 128K pages for the software IO bounce 
> > > buffering.
> > >force
> > >  Force all IO through the software TLB.
> > > +  addr=
> > > +Try to allocate SWIOTLB at defined address
> > >
> > >  Settings for the IBM Calgary hardware IOMMU currently found in IBM
> > > 

Re: [PATCH 2/3] mm/memory_hotplug: Allow arch override of non boot memory resource names

2020-03-30 Thread David Hildenbrand
>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>>> index 0a54ffac8c68..69b03dd7fc74 100644
>>> --- a/mm/memory_hotplug.c
>>> +++ b/mm/memory_hotplug.c
>>> @@ -42,6 +42,10 @@
>>>  #include "internal.h"
>>>  #include "shuffle.h"
>>>  
>>> +#ifndef MEMORY_HOTPLUG_RES_NAME
>>> +#define MEMORY_HOTPLUG_RES_NAME "System RAM"
>>> +#endif
>>
>> So I assume changing this for all architectures would result in some
>> user space tool breaking? Are we aware of any?
> 
> Last time we had to touch arm64's /proc/iomem strings I went through debian's
> codesearch for stuff that reads it, kexec-tools was the only thing that parsed
> it in anger. (From memory, the other tools were looking for PCIe windows to do
> firmware flashing..)
> 
> Looking again, having qualifiers on the end of 'System RAM' looks like it 
> could
> confuse 's390-tools's detect_mem_chunks parser.

Good to know, we should find out if this could work.

> 
> It looks like the strings that come out of 'FIRMWARE_MEMMAP' are a duplicate 
> set.
> 
> 
>> I do wonder if we should simply change it for all architectures if possible.
> 
> If its possible that would be great. But I suspect that ship has sailed,
> changing it on other architectures could break some fragile parsing code.

I assume any parser has to be prepared for new types showing up.
Otherwise these would not be future proof. The question is if a common
prefix is problematic.

E.g., Use "Hotplugged System RAM" instead of "System RAM (hotplug)"

> 
> I'm wary of changing it on arm64, the only thing that makes it tolerable is 
> that
> memory hot-add was relatively recently merged, and we don't anticipate it 
> being
> widely used until you can remove memory as well.
> 
> Changing it on arm64 is to prevent today's versions of kexec-tools from
> accidentally placing the new kernel in memory that wasn't described at boot.
> This leads to an unhandled exception during boot[0] because the kernel can't
> access itself via the mapping of all memory. (hotpluggable regions are only
> discovered by suitably configured ACPI systems much later)

I want the very same for virtio-mem (initially x86-only, but later open
for all archs). Can also be interesting for Hyper-V. kexec should not
try to use hotplugged memory as kexec target, as memory blocks can be
partially inaccessible.

Of course, I can provide an interface to override the name via
add_memory(), but having it on all architectures handled in a similar
way right from the start would be nicer.


-- 
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use

2020-03-30 Thread David Hildenbrand
On 27.03.20 16:42, James Morse wrote:
> Hi David,
> 
> On 3/27/20 9:27 AM, David Hildenbrand wrote:
>> On 26.03.20 19:07, James Morse wrote:
>>> arm64 recently queued support for memory hotremove, which led to some
>>> new corner cases for kexec.
>>>
>>> If the kexec segments are loaded for a removable region, that region may
>>> be removed before kexec actually occurs. This causes the first kernel to
>>> lockup when applying the relocations. (I've triggered this on x86 too).
>>>
>>> The first patch adds a memory notifier for kexec so that it can refuse
>>> to allow in-use regions to be taken offline.
> 
>> IIRC other architectures handle that by setting the affected pages
>> PageReserved. Any reason why to not stick to the same?
> 
> Hmm, I didn't spot this. How come core code doesn't do it if its needed?
> 
> Doesn't PG_Reserved prevent the page from being used for regular allocations?
> (or is that only if its done early)
> 
> I prefer the runtime check as the dmesg output gives the user some chance of
> knowing why their memory-offline failed, and doing something about it!

I was confused which memory we are trying to protect. Understood now,
that you are dealing with the target physical memory described during
described during kexec_load.

[...]

> 
>> Also, makedumpfile will check if the
>> sections are still around IIRC.
> 
> Curious. I thought the vmcore was virtually addressed, how does it know which
> linear-map portions correspond to sysfs memory nodes with KASLR?

That's a very interesting question. I remember there was KASLR support
being implemented specifically for that - but I don't know any details.

>> Not sure what you mean by "Unaware kdump from user-space".
> 
> The existing kexec-tools binaries, that (I assume) don't go probing to find 
> out
> if 'System RAM' is removable or not, loading a kdump kernel, along with the
> user-space generated blob that describes the first kernel's memory usage to 
> the
> second kernel.

Finally understood how kexec without kdump works, thanks.

-- 
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-03-30 Thread David Hildenbrand
> Adding a sentence about the way kexec load works may help, the first paragraph
> would read:
> 
> | Kexec allows user-space to specify the address that the kexec image should 
> be
> | loaded to. Because this memory may be in use, an image loaded for kexec is 
> not
> | stored in place, instead its segments are scattered through memory, and are
> | re-assembled when needed. In the meantime, the target memory may have been
> | removed.
> 
> Do you think thats clearer?

Yes, very much. Maybe add, that the target is described by user space
during kexec_load() and that user space - right now - parses /proc/iomem
to find applicable system memory.

> [...]
> 
>>> Load kexec:
>>> | root@vm:/sys/devices/system/memory# kexec -l /root/bzImage --reuse-cmdline
>>>
>>
>> I assume this will trigger
>>
>> kexec_load -> do_kexec_load -> kimage_load_segment ->
>> kimage_load_normal_segment -> kimage_alloc_page -> kimage_alloc_pages
>>
>> Which will just allocate a bunch of pages and mark them reserved.
>>
>> Now, AFAIKs, all allocations will be unmovable. So none of the kexec
>> segment allocations will actually end up on your DIMM (as it is onlined
>> online_movable).
>>
>> So, the loaded image (with its segments) from user won't be problematic
>> and not get placed on your DIMM.
>>
>>
>> Now, the problematic part is (via man kexec_load) "mem and memsz specify
>> a physical address range that is the target of the copy."
>>
>> So the place where the image will be "assembled" at when doing the
>> reboot. Understood :)
> 
> Yup.
> 
> [...]
> 
>> I wonder if we should instead make the "kexec -e" fail. It tries to
>> touch random system memory.
> 
> Heh, isn't touching random system memory what kexec does?!

Having a racy user interface that can trigger kernel crashes feels very
wrong. We should limit the impact.

> 
> Its all described to user-space as 'System RAM'. Teaching it to probe
> /sys/devices/memory/... would require a user-space change.

I think we should really rename hotplugged memory on all architectures.

Especially also relevant for virtio-mem/hyper-v balloon, where some
pieces of (hotplugged )memory blocks are partially unavailable and
should not be touched - accessing them results in unpredictable behavior
(e.g., crashes or discarded writes).

[...]

>> Will probably need some thought. But it will actually also bail out when
>> user space passes wrong physical memory addresses, instead of
>> triple-faulting silently.
> 
> With this change, the reboot(LINUX_REBOOT_CMD_KEXEC), call would fail. This
> thing doesn't usually return, so we're likely to trigger error-handling that 
> has
> never run before.
> 
> (Last time I debugged one of these, it turned out kexec had taken the network
> interfaces down, meaning the nfsroot was no longer accessible)
> 
> How can user-space know whether kexec is going to succeed, or fail like this?
> Any loaded kexec kernel could secretly be in this broken state.
> 
> Can user-space know what caused this to become unreliable? (without reading 
> the
> kernel source)
> 
> 
> Given kexec can be unloaded by user-space, I think its better to prevent us
> getting into the broken state, preferably giving the hint that kexec us using
> that memory. The user can 'kexec -u', then retry removing the memory.
> 
> I think forbidding the memory-offline is simpler for user-space to deal with.

I thought about this over the weekend, and I don't think it's the right
approach.

1. It's racy. If memory is getting offlined/unplugged just while user
space is about to trigger the kexec_load(), you end up with the very
same triple-fault.

2. It's semantically wrong. kexec does not need online memory ("managed
by the buddy"), but still you disallow offlining memory.


I would really much rather want to see user-space choosing boot memory
(e.g., renaming hotplugged memory on all architectures), and checking
during "kexec -e" if the selected memory is actually "there", before
trying to write to it.

-- 
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-03-30 Thread James Morse
Hi David,

On 3/27/20 6:52 PM, David Hildenbrand wrote:
>>> 2. You do the kexec. The kexec kernel will only operate on a reserved
>>> memory region (reserved via e.g., kernel cmdline crashkernel=128M).
>>
>> I think you are merging the kexec and kdump behaviours.
>> (Wrong terminology? The things behind 'kexec -l Image' and 'kexec -p Image')
> 
> Oh, I see - I think your example below clarifies things. Something like
> that should go in the cover letter if we end up in this patch being
> required :)

Do you mean the commit message? I think its far too long...

Adding a sentence about the way kexec load works may help, the first paragraph
would read:

| Kexec allows user-space to specify the address that the kexec image should be
| loaded to. Because this memory may be in use, an image loaded for kexec is not
| stored in place, instead its segments are scattered through memory, and are
| re-assembled when needed. In the meantime, the target memory may have been
| removed.

Do you think thats clearer?


> (I missed that the problematic part is "random" addresses passed by user
> space to the kernel, where it wants data to be loaded to on kexec -e)

[...]

>> Load kexec:
>> | root@vm:/sys/devices/system/memory# kexec -l /root/bzImage --reuse-cmdline
>>
> 
> I assume this will trigger
> 
> kexec_load -> do_kexec_load -> kimage_load_segment ->
> kimage_load_normal_segment -> kimage_alloc_page -> kimage_alloc_pages
> 
> Which will just allocate a bunch of pages and mark them reserved.
> 
> Now, AFAIKs, all allocations will be unmovable. So none of the kexec
> segment allocations will actually end up on your DIMM (as it is onlined
> online_movable).
> 
> So, the loaded image (with its segments) from user won't be problematic
> and not get placed on your DIMM.
> 
> 
> Now, the problematic part is (via man kexec_load) "mem and memsz specify
> a physical address range that is the target of the copy."
> 
> So the place where the image will be "assembled" at when doing the
> reboot. Understood :)

Yup.

[...]

> I wonder if we should instead make the "kexec -e" fail. It tries to
> touch random system memory.

Heh, isn't touching random system memory what kexec does?!

Its all described to user-space as 'System RAM'. Teaching it to probe
/sys/devices/memory/... would require a user-space change.


> Denying to offline MOVABLE memory should be avoided - and what kexec
> does here sounds dangerous to me (allowing it to write random system
> memory).

> Roughly what I am thinking is this:
> 
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index ba1d91e868ca..70c39a5307e5 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -1135,6 +1135,10 @@ int kernel_kexec(void)
> error = -EINVAL;
> goto Unlock;
> }
> +   if (!kexec_image_validate()) {
> +   error = -EINVAL;
> +   goto Unlock;
> +   }
> 
>  #ifdef CONFIG_KEXEC_JUMP
> if (kexec_image->preserve_context) {
> 
> 
> kexec_image_validate() would go over all segments and validate that the
> involved pages are actual valid memory (pfn_to_online_page()).
> 
> All we have to do is protect from memory hotplug until we switch to the
> new kernel.

(migrate_to_reboot_cpu() can sleep), I think you'd end up with something like
this patch, but only while kexec_in_progress. I don't think letting kexec fail
if the events occur in a different order is good for user-space.


> Will probably need some thought. But it will actually also bail out when
> user space passes wrong physical memory addresses, instead of
> triple-faulting silently.

With this change, the reboot(LINUX_REBOOT_CMD_KEXEC), call would fail. This
thing doesn't usually return, so we're likely to trigger error-handling that has
never run before.

(Last time I debugged one of these, it turned out kexec had taken the network
interfaces down, meaning the nfsroot was no longer accessible)

How can user-space know whether kexec is going to succeed, or fail like this?
Any loaded kexec kernel could secretly be in this broken state.

Can user-space know what caused this to become unreliable? (without reading the
kernel source)


Given kexec can be unloaded by user-space, I think its better to prevent us
getting into the broken state, preferably giving the hint that kexec us using
that memory. The user can 'kexec -u', then retry removing the memory.

I think forbidding the memory-offline is simpler for user-space to deal with.


Thanks,

James

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Reply For More Details.

2020-03-30 Thread Maryalice Williams
-- 
My dear,

I am Mrs Maryalice Williams, I want to send you donation of two
million seven hundred thousand Dollars ($2.7M) for volunteer projects
in your country due to my ill health that could not permit me. Kindly
reply for more details, and also send me the following details, as per
below, your full Name ..,  Address...,
Age...,  Occupation ...

Remain blessed,
Mrs. Maryalice Williams.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


THANK YOU

2020-03-30 Thread Harvey Terence
>From Harvey Terence (Mr.)

25 Canada Square, Canary Wharf, London E14 5LB,

Good day

I am Mr. Harvey Terence, Operating Officer of this bank. With your
honest assistant and cooperation, we can finalize this transaction
within 7/14 working days.

 I need a reliable and honest person who will be able to handle this
business opportunity with me because of the need to involve a
foreigner. I am contacting you because of such demand, and I believe
you will work with me to achieve this purpose and will never turn down
my request.

Before the United States of America and Iraqi war, our bank customer
Mr.Hatem Kamil Abdul Fatah, who was the deputy governor of Baghdad in
Iraq and also a business man made a deposit of (GBP10,750,000.00) Ten
Million, Seven Hundred And Fifty Thousand
Pounds Sterling Only in a Bank account number: ABP-LN-685
00/52207712321 over here in our bank.

But I later discovered that the Deputy Governor has been assassinated
in Baghdad by unknown gun men.

Below is the information about his death as a proof and verification
of his assassination In Baghdad:
http://news.bbc.co.uk/go/pr/fr/-/1/hi/world/middle_east/3970619.stm

During my further investigation after hearing of his assassination in
Baghdad, I also discovered that Mr.Hatem Kamil Abdul Fatah did not
declare any next of kin in his official papers including the paper
work of his funds with our bank which might be because he embezzled
this funds while in office and was afraid of revealing his political
dignity when opening the above account number in our bank until his
dead.

My aim of contacting you is to assist me to receive this money in your
bank account over there in your country and let me know how much
commission you will receive out of the total fund when transferred
into your oversea bank account?.

You will diligently transfer the balance to me through another bank
account number from another bank I will forward to you as soon as the
fund is transferred into your over sea account after deducting your
commission from the whole sum or I will come over to your country to
meet with you one on one for sharing of the fund or shall invested the
fund into any lucrative business out there in your country together..

We are going to process and perfect the transaction legally as bank to
bank procedure has been put in place.

I need your urgent reply through my private E-mail address at:
ma...@yahoo.com if you are interested to work with me.

I provide more details on how to process the approval of the fund in
your name to be release for instant bank to bank wire transfer into
any designated bank account of your choice without delay.

Please keep this transaction safe and confidential as exposing this
transaction will jeopardize my reputation in this Bank.

I would like to hear from you in no distant time as soon as you read
this mail through the above stated E-mail address so that we can
proceed accordingly.

Best Regards,

Mr. Harvey Terence

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC PATCH v2 1/3] meminfo_extra: introduce meminfo extra

2020-03-30 Thread Leon Romanovsky
On Sun, Mar 29, 2020 at 09:23:04AM +0200, Greg KH wrote:
> On Sun, Mar 29, 2020 at 10:19:07AM +0300, Leon Romanovsky wrote:
> > On Tue, Mar 24, 2020 at 09:53:16PM +0900, Jaewon Kim wrote:
> > >
> > >
> > > On 2020년 03월 24일 20:46, Greg KH wrote:
> > > > On Tue, Mar 24, 2020 at 08:37:38PM +0900, Jaewon Kim wrote:
> > > >>
> > > >> On 2020년 03월 24일 19:11, Greg KH wrote:
> > > >>> On Tue, Mar 24, 2020 at 06:11:17PM +0900, Jaewon Kim wrote:
> > >  On 2020년 03월 23일 18:53, Greg KH wrote:
> > > >> +int register_meminfo_extra(atomic_long_t *val, int shift, const 
> > > >> char *name)
> > > >> +{
> > > >> +  struct meminfo_extra *meminfo, *memtemp;
> > > >> +  int len;
> > > >> +  int error = 0;
> > > >> +
> > > >> +  meminfo = kzalloc(sizeof(*meminfo), GFP_KERNEL);
> > > >> +  if (!meminfo) {
> > > >> +  error = -ENOMEM;
> > > >> +  goto out;
> > > >> +  }
> > > >> +
> > > >> +  meminfo->val = val;
> > > >> +  meminfo->shift_for_page = shift;
> > > >> +  strncpy(meminfo->name, name, NAME_SIZE);
> > > >> +  len = strlen(meminfo->name);
> > > >> +  meminfo->name[len] = ':';
> > > >> +  strncpy(meminfo->name_pad, meminfo->name, NAME_BUF_SIZE);
> > > >> +  while (++len < NAME_BUF_SIZE - 1)
> > > >> +  meminfo->name_pad[len] = ' ';
> > > >> +
> > > >> +  spin_lock(_lock);
> > > >> +  list_for_each_entry_rcu(memtemp, _head, list) {
> > > >> +  if (memtemp->val == val) {
> > > >> +  error = -EINVAL;
> > > >> +  break;
> > > >> +  }
> > > >> +  }
> > > >> +  if (!error)
> > > >> +  list_add_tail_rcu(>list, _head);
> > > >> +  spin_unlock(_lock);
> > > > If you have a lock, why are you needing rcu?
> > >  I think _rcu should be removed out of list_for_each_entry_rcu.
> > >  But I'm confused about what you meant.
> > >  I used rcu_read_lock on __meminfo_extra,
> > >  and I think spin_lock is also needed for addition and deletion to 
> > >  handle multiple modifiers.
> > > >>> If that's the case, then that's fine, it just didn't seem like that 
> > > >>> was
> > > >>> needed.  Or I might have been reading your rcu logic incorrectly...
> > > >>>
> > > >> +  if (error)
> > > >> +  kfree(meminfo);
> > > >> +out:
> > > >> +
> > > >> +  return error;
> > > >> +}
> > > >> +EXPORT_SYMBOL(register_meminfo_extra);
> > > > EXPORT_SYMBOL_GPL()?  I have to ask :)
> > >  I can use EXPORT_SYMBOL_GPL.
> > > > thanks,
> > > >
> > > > greg k-h
> > > >
> > > >
> > >  Hello
> > >  Thank you for your comment.
> > > 
> > >  By the way there was not resolved discussion on v1 patch as I 
> > >  mentioned on cover page.
> > >  I'd like to hear your opinion on this /proc/meminfo_extra node.
> > > >>> I think it is the propagation of an old and obsolete interface that 
> > > >>> you
> > > >>> will have to support for the next 20+ years and yet not actually be
> > > >>> useful :)
> > > >>>
> > >  Do you think this is meaningful or cannot co-exist with other future
> > >  sysfs based API.
> > > >>> What sysfs-based API?
> > > >> Please refer to mail thread on v1 patch set - 
> > > >> https://protect2.fireeye.com/url?k=16e3accc-4b2f6548-16e22783-0cc47aa8f5ba-935fe828ac2f6656=https://lkml.org/lkml/fancy/2020/3/10/2102
> > > >> especially discussion with Leon Romanovsky on 
> > > >> https://protect2.fireeye.com/url?k=74208ed9-29ec475d-74210596-0cc47aa8f5ba-0bd4ef48931fec95=https://lkml.org/lkml/fancy/2020/3/16/140
> > > > I really do not understand what you are referring to here, sorry.   I do
> > > > not see any sysfs-based code in that thread.
> > > Sorry. I also did not see actual code.
> > > Hello Leon Romanovsky, could you elaborate your plan regarding sysfs 
> > > stuff?
> >
> > Sorry for being late, I wasn't in "TO:", so missed the whole discussion.
> >
> > Greg,
> >
> > We need the exposed information for the memory optimizations (debug, not
> > production) of our high speed NICs. Our devices (mlx5) allocates a lot of
> > memory, so optimization there can help us to scale in SRIOV mode easier and
> > be less constraint by the memory.
>
> Great, then use debugfs and expose what ever you want in what ever way
> you want, no restrictions there, you do not need any type of kernel-wide
> /proc file for that today.

No argue here, just gave you an example why Jaewon's idea is worth to explore.

>
> > I want to emphasize that I don't like idea of extending /proc/* interface
> > because it is going to be painful to grep on large machines with many
> > devices. And I don't like the idea that every driver will need to register
> > into this interface, because it will be abused almost immediately.
>
> I agree.
>
> > My proposal was to create new sysfs file by driver/core and put all
> > information automatically there, for example, 

Re: [PATCH] swiotlb: Allow swiotlb to live at pre-defined address

2020-03-30 Thread Kairui Song
On Sat, Mar 28, 2020 at 7:57 PM Dave Young  wrote:
>
> On 03/26/20 at 05:29pm, Alexander Graf wrote:
> > The swiotlb is a very convenient fallback mechanism for bounce buffering of
> > DMAable data. It is usually used for the compatibility case where devices
> > can only DMA to a "low region".
> >
> > However, in some scenarios this "low region" may be bound even more
> > heavily. For example, there are embedded system where only an SRAM region
> > is shared between device and CPU. There are also heterogeneous computing
> > scenarios where only a subset of RAM is cache coherent between the
> > components of the system. There are partitioning hypervisors, where
> > a "control VM" that implements device emulation has limited view into a
> > partition's memory for DMA capabilities due to safety concerns.
> >
> > This patch adds a command line driven mechanism to move all DMA memory into
> > a predefined shared memory region which may or may not be part of the
> > physical address layout of the Operating System.
> >
> > Ideally, the typical path to set this configuration would be through Device
> > Tree or ACPI, but neither of the two mechanisms is standardized yet. Also,
> > in the x86 MicroVM use case, we have neither ACPI nor Device Tree, but
> > instead configure the system purely through kernel command line options.
> >
> > I'm sure other people will find the functionality useful going forward
> > though and extend it to be triggered by DT/ACPI in the future.
>
> Hmm, we have a use case for kdump, this maybe useful.  For example
> swiotlb is enabled by default if AMD SME/SEV is active, and in kdump
> kernel we have to increase the crashkernel reserved size for the extra
> swiotlb requirement.  I wonder if we can just reuse the old kernel's
> swiotlb region and pass the addr to kdump kernel.
>

Yes, definitely helpful for kdump kernel. This can help reduce the
crashkernel value.

Previously I was thinking about something similar, play around the
e820 entry passed to kdump and let it place swiotlb in wanted region.
Simply remap it like in this patch looks much cleaner.

If this patch is acceptable, one more patch is needed to expose the
swiotlb in iomem, so kexec-tools can pass the right kernel cmdline to
second kernel.

> >
> > Signed-off-by: Alexander Graf 
> > ---
> >  Documentation/admin-guide/kernel-parameters.txt |  3 +-
> >  Documentation/x86/x86_64/boot-options.rst   |  4 ++-
> >  kernel/dma/swiotlb.c| 46 
> > +++--
> >  3 files changed, 49 insertions(+), 4 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> > b/Documentation/admin-guide/kernel-parameters.txt
> > index c07815d230bc..d085d55c3cbe 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -4785,11 +4785,12 @@
> >   it if 0 is given (See 
> > Documentation/admin-guide/cgroup-v1/memory.rst)
> >
> >   swiotlb=[ARM,IA-64,PPC,MIPS,X86]
> > - Format: {  | force | noforce }
> > + Format: {  | force | noforce | addr= }
> >-- Number of I/O TLB slabs
> >   force -- force using of bounce buffers even if they
> >wouldn't be automatically used by the kernel
> >   noforce -- Never use bounce buffers (for debugging)
> > + addr= -- Try to allocate SWIOTLB at 
> > defined address
> >
> >   switches=   [HW,M68k]
> >
> > diff --git a/Documentation/x86/x86_64/boot-options.rst 
> > b/Documentation/x86/x86_64/boot-options.rst
> > index 2b98efb5ba7f..ca46c57b68c9 100644
> > --- a/Documentation/x86/x86_64/boot-options.rst
> > +++ b/Documentation/x86/x86_64/boot-options.rst
> > @@ -297,11 +297,13 @@ iommu options only relevant to the AMD GART hardware 
> > IOMMU:
> >  iommu options only relevant to the software bounce buffering (SWIOTLB) 
> > IOMMU
> >  implementation:
> >
> > -swiotlb=[,force]
> > +swiotlb=[,force][,addr=]
> >
> >  Prereserve that many 128K pages for the software IO bounce 
> > buffering.
> >force
> >  Force all IO through the software TLB.
> > +  addr=
> > +Try to allocate SWIOTLB at defined address
> >
> >  Settings for the IBM Calgary hardware IOMMU currently found in IBM
> >  pSeries and xSeries machines
> > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> > index c19379fabd20..83da0caa2f93 100644
> > --- a/kernel/dma/swiotlb.c
> > +++ b/kernel/dma/swiotlb.c
> > @@ -46,6 +46,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  #define CREATE_TRACE_POINTS
> >  #include 
> > @@ -102,6 +103,12 @@ unsigned int max_segment;
> >  #define INVALID_PHYS_ADDR (~(phys_addr_t)0)
> >  static phys_addr_t *io_tlb_orig_addr;
> >
> > +/*
> > + * The TLB phys addr may be defined on the command line. Store it here if 
> > it is.