Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-22 Thread Baoquan He
On 04/22/20 at 12:05pm, David Hildenbrand wrote:
> On 22.04.20 11:57, Baoquan He wrote:
> > On 04/22/20 at 11:24am, David Hildenbrand wrote:
> >> On 22.04.20 11:17, Baoquan He wrote:
> >>> On 04/21/20 at 03:29pm, David Hildenbrand wrote:
> >> ACPI SRAT is embeded into efi, need read out the rsdp pointer. If we 
> >> don't
> >> pass the efi, it won't get the SRAT table correctly, if I remember
> >> correctly. Yeah, I remeber kvm guest can get memory hotplugged with
> >> ACPI only, this won't happen on bare metal though. Need check 
> >> carefully. 
> >> I have been using kvm guest with uefi firmwire recently.
> >
> > Yeah, I can imagine that bare metal is different. kvm only uses ACPI.
> >
> > I'm also asking because of virtio-mem. Memory added via virtio-mem is
> > not part of any efi tables or whatsoever. So I assume the kexec kernel
> > will not detect it automatically (good!), instead load the virtio-mem
> > driver and let it add memory back to the system.
> >
> > I should probably play with kexec and virtio-mem once I have some spare
> > cycles ... to find out what's broken and needs to be addressed :)
> 
>  FWIW, I just gave virtio-mem and kexec/kdump a try.
> 
>  a) kdump seems to work. Memory added by virtio-mem is getting dumped.
>  The kexec kernel only uses memory in the crash region. The virtio-mem
>  driver properly bails out due to is_kdump_kernel().
> >>>
> >>> Right, kdump is not impacted later added memory.
> >>>
> 
>  b) "kexec -s -l" seems to work fine. For now, the kernel does not seem
>  to get placed on virtio-mem memory (pure luck due to the left-to-right
>  search). Memory added by virtio-mem is not getting added to the e820
>  map. Once the virtio-mem driver comes back up in the kexec kernel, the
>  right memory is readded.
> >>>
> >>> kexec_file_load just behaves as you tested. It doesn't collect later
> >>> added memory to e820 because it uses e820_table_kexec directly to pass
> >>> e820 to kexec-ed kernel. However, this e820_table_kexec is only updated
> >>> during boot stage. I tried hot adding DIMM after boot, kexec-ed kernel
> >>> doesn't have it in e820 during bootup, but it's recoginized and added
> >>> when ACPI scanning. I think we should update e820_table_kexec when hot
> >>> add/remove memory, at least for DIMM. Not sure if DLPAR, virtio-mem,
> >>> balloon will need be added into e820_table_kexec too, and if this is
> >>> expected behaviour.
> >>>
> >>> But whatever we do, it won't impact the kexec file_loading, because of
> >>> the searching strategy bottom up. Just adding them into e820_table_kexec
> >>> will make it consistent with cold reboot which get recognizes and get
> >>> them into e820 during bootup.
> >>
> >> Yeah, I think whatever a cold-booted kernel will see is what kexec-ed
> >> kernel should see. Not more, not less.
> >>
> >> Regarding virtio-mem: Not in e820 on cold-boot.
> >> Regarding DIMMs: DIMMs under KVM will never show up in the e820 map
> >> IIRC. I think on real HW it can be different.
> > 
> > Yeah, DIMMs under KVM won't show up in e820 map. While this is not feature
> > of QEMU/KVM, but a defect of it. I ever asked Igor who is developer of
> > QEMU/KVM guest in this area, why we don't make kvm guest recognize
> > hotpluggable DIMM and add it into e820 map, he said he had tried to make
> > it, but this will corrupt guest on HyperV. So he had to revert the
> 
> Yeah, I remember that this had to be reverted due to something breaking.
> But OTOH, it allows us to online coldplugged DIMMs online_movable
> easily, so I'd say it's even a feature (although, does not behave like
> real HW we have).
> 
> I use this extensively when testing memory hot(un)plug via coldplugged
> DIMMs.
> 
> I do wonder if there is real HW, where this is also the case.

None for what I know. Hotplug on real HW includes two parts, the boot
mem being hotpluggable is more flexiable one. It allows people to
replace bad DIMM. And you can see code in boot stage has been adjusted a
lot on this purpose, at that time, people haven't thought about kvm
guest.

> 
> > commit on qemu. So I think we can leave it for now for both real HW and
> > kvm, or update the e820_table_kexec to include added DIMM for both real
> > HW and KVM. I hope one day KVM dev will find a way to conquer the defect
> > on HyperV and make the e820map consistent with bare metal. After all,
> > kvm guest is trying to imitate real HW for the most part.
> > 
> > Anyway, I will think about the e820_table_kexec updating. See if we can
> > do something about it.
> 
> Yeah, for DIMMs on real HW it might definitely make sense. We might be
> able to hook into updates of /sys/firmware/memmap on memory add/remove.



Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-22 Thread David Hildenbrand
On 22.04.20 11:57, Baoquan He wrote:
> On 04/22/20 at 11:24am, David Hildenbrand wrote:
>> On 22.04.20 11:17, Baoquan He wrote:
>>> On 04/21/20 at 03:29pm, David Hildenbrand wrote:
>> ACPI SRAT is embeded into efi, need read out the rsdp pointer. If we 
>> don't
>> pass the efi, it won't get the SRAT table correctly, if I remember
>> correctly. Yeah, I remeber kvm guest can get memory hotplugged with
>> ACPI only, this won't happen on bare metal though. Need check carefully. 
>> I have been using kvm guest with uefi firmwire recently.
>
> Yeah, I can imagine that bare metal is different. kvm only uses ACPI.
>
> I'm also asking because of virtio-mem. Memory added via virtio-mem is
> not part of any efi tables or whatsoever. So I assume the kexec kernel
> will not detect it automatically (good!), instead load the virtio-mem
> driver and let it add memory back to the system.
>
> I should probably play with kexec and virtio-mem once I have some spare
> cycles ... to find out what's broken and needs to be addressed :)

 FWIW, I just gave virtio-mem and kexec/kdump a try.

 a) kdump seems to work. Memory added by virtio-mem is getting dumped.
 The kexec kernel only uses memory in the crash region. The virtio-mem
 driver properly bails out due to is_kdump_kernel().
>>>
>>> Right, kdump is not impacted later added memory.
>>>

 b) "kexec -s -l" seems to work fine. For now, the kernel does not seem
 to get placed on virtio-mem memory (pure luck due to the left-to-right
 search). Memory added by virtio-mem is not getting added to the e820
 map. Once the virtio-mem driver comes back up in the kexec kernel, the
 right memory is readded.
>>>
>>> kexec_file_load just behaves as you tested. It doesn't collect later
>>> added memory to e820 because it uses e820_table_kexec directly to pass
>>> e820 to kexec-ed kernel. However, this e820_table_kexec is only updated
>>> during boot stage. I tried hot adding DIMM after boot, kexec-ed kernel
>>> doesn't have it in e820 during bootup, but it's recoginized and added
>>> when ACPI scanning. I think we should update e820_table_kexec when hot
>>> add/remove memory, at least for DIMM. Not sure if DLPAR, virtio-mem,
>>> balloon will need be added into e820_table_kexec too, and if this is
>>> expected behaviour.
>>>
>>> But whatever we do, it won't impact the kexec file_loading, because of
>>> the searching strategy bottom up. Just adding them into e820_table_kexec
>>> will make it consistent with cold reboot which get recognizes and get
>>> them into e820 during bootup.
>>
>> Yeah, I think whatever a cold-booted kernel will see is what kexec-ed
>> kernel should see. Not more, not less.
>>
>> Regarding virtio-mem: Not in e820 on cold-boot.
>> Regarding DIMMs: DIMMs under KVM will never show up in the e820 map
>> IIRC. I think on real HW it can be different.
> 
> Yeah, DIMMs under KVM won't show up in e820 map. While this is not feature
> of QEMU/KVM, but a defect of it. I ever asked Igor who is developer of
> QEMU/KVM guest in this area, why we don't make kvm guest recognize
> hotpluggable DIMM and add it into e820 map, he said he had tried to make
> it, but this will corrupt guest on HyperV. So he had to revert the

Yeah, I remember that this had to be reverted due to something breaking.
But OTOH, it allows us to online coldplugged DIMMs online_movable
easily, so I'd say it's even a feature (although, does not behave like
real HW we have).

I use this extensively when testing memory hot(un)plug via coldplugged
DIMMs.

I do wonder if there is real HW, where this is also the case.

> commit on qemu. So I think we can leave it for now for both real HW and
> kvm, or update the e820_table_kexec to include added DIMM for both real
> HW and KVM. I hope one day KVM dev will find a way to conquer the defect
> on HyperV and make the e820map consistent with bare metal. After all,
> kvm guest is trying to imitate real HW for the most part.
> 
> Anyway, I will think about the e820_table_kexec updating. See if we can
> do something about it.

Yeah, for DIMMs on real HW it might definitely make sense. We might be
able to hook into updates of /sys/firmware/memmap on memory add/remove.

-- 
Thanks,

David / dhildenb



Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-22 Thread Baoquan He
On 04/22/20 at 11:24am, David Hildenbrand wrote:
> On 22.04.20 11:17, Baoquan He wrote:
> > On 04/21/20 at 03:29pm, David Hildenbrand wrote:
>  ACPI SRAT is embeded into efi, need read out the rsdp pointer. If we 
>  don't
>  pass the efi, it won't get the SRAT table correctly, if I remember
>  correctly. Yeah, I remeber kvm guest can get memory hotplugged with
>  ACPI only, this won't happen on bare metal though. Need check carefully. 
>  I have been using kvm guest with uefi firmwire recently.
> >>>
> >>> Yeah, I can imagine that bare metal is different. kvm only uses ACPI.
> >>>
> >>> I'm also asking because of virtio-mem. Memory added via virtio-mem is
> >>> not part of any efi tables or whatsoever. So I assume the kexec kernel
> >>> will not detect it automatically (good!), instead load the virtio-mem
> >>> driver and let it add memory back to the system.
> >>>
> >>> I should probably play with kexec and virtio-mem once I have some spare
> >>> cycles ... to find out what's broken and needs to be addressed :)
> >>
> >> FWIW, I just gave virtio-mem and kexec/kdump a try.
> >>
> >> a) kdump seems to work. Memory added by virtio-mem is getting dumped.
> >> The kexec kernel only uses memory in the crash region. The virtio-mem
> >> driver properly bails out due to is_kdump_kernel().
> > 
> > Right, kdump is not impacted later added memory.
> > 
> >>
> >> b) "kexec -s -l" seems to work fine. For now, the kernel does not seem
> >> to get placed on virtio-mem memory (pure luck due to the left-to-right
> >> search). Memory added by virtio-mem is not getting added to the e820
> >> map. Once the virtio-mem driver comes back up in the kexec kernel, the
> >> right memory is readded.
> > 
> > kexec_file_load just behaves as you tested. It doesn't collect later
> > added memory to e820 because it uses e820_table_kexec directly to pass
> > e820 to kexec-ed kernel. However, this e820_table_kexec is only updated
> > during boot stage. I tried hot adding DIMM after boot, kexec-ed kernel
> > doesn't have it in e820 during bootup, but it's recoginized and added
> > when ACPI scanning. I think we should update e820_table_kexec when hot
> > add/remove memory, at least for DIMM. Not sure if DLPAR, virtio-mem,
> > balloon will need be added into e820_table_kexec too, and if this is
> > expected behaviour.
> > 
> > But whatever we do, it won't impact the kexec file_loading, because of
> > the searching strategy bottom up. Just adding them into e820_table_kexec
> > will make it consistent with cold reboot which get recognizes and get
> > them into e820 during bootup.
> 
> Yeah, I think whatever a cold-booted kernel will see is what kexec-ed
> kernel should see. Not more, not less.
> 
> Regarding virtio-mem: Not in e820 on cold-boot.
> Regarding DIMMs: DIMMs under KVM will never show up in the e820 map
> IIRC. I think on real HW it can be different.

Yeah, DIMMs under KVM won't show up in e820 map. While this is not feature
of QEMU/KVM, but a defect of it. I ever asked Igor who is developer of
QEMU/KVM guest in this area, why we don't make kvm guest recognize
hotpluggable DIMM and add it into e820 map, he said he had tried to make
it, but this will corrupt guest on HyperV. So he had to revert the
commit on qemu. So I think we can leave it for now for both real HW and
kvm, or update the e820_table_kexec to include added DIMM for both real
HW and KVM. I hope one day KVM dev will find a way to conquer the defect
on HyperV and make the e820map consistent with bare metal. After all,
kvm guest is trying to imitate real HW for the most part.

Anyway, I will think about the e820_table_kexec updating. See if we can
do something about it.



Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-22 Thread David Hildenbrand
On 22.04.20 11:17, Baoquan He wrote:
> On 04/21/20 at 03:29pm, David Hildenbrand wrote:
 ACPI SRAT is embeded into efi, need read out the rsdp pointer. If we don't
 pass the efi, it won't get the SRAT table correctly, if I remember
 correctly. Yeah, I remeber kvm guest can get memory hotplugged with
 ACPI only, this won't happen on bare metal though. Need check carefully. 
 I have been using kvm guest with uefi firmwire recently.
>>>
>>> Yeah, I can imagine that bare metal is different. kvm only uses ACPI.
>>>
>>> I'm also asking because of virtio-mem. Memory added via virtio-mem is
>>> not part of any efi tables or whatsoever. So I assume the kexec kernel
>>> will not detect it automatically (good!), instead load the virtio-mem
>>> driver and let it add memory back to the system.
>>>
>>> I should probably play with kexec and virtio-mem once I have some spare
>>> cycles ... to find out what's broken and needs to be addressed :)
>>
>> FWIW, I just gave virtio-mem and kexec/kdump a try.
>>
>> a) kdump seems to work. Memory added by virtio-mem is getting dumped.
>> The kexec kernel only uses memory in the crash region. The virtio-mem
>> driver properly bails out due to is_kdump_kernel().
> 
> Right, kdump is not impacted later added memory.
> 
>>
>> b) "kexec -s -l" seems to work fine. For now, the kernel does not seem
>> to get placed on virtio-mem memory (pure luck due to the left-to-right
>> search). Memory added by virtio-mem is not getting added to the e820
>> map. Once the virtio-mem driver comes back up in the kexec kernel, the
>> right memory is readded.
> 
> kexec_file_load just behaves as you tested. It doesn't collect later
> added memory to e820 because it uses e820_table_kexec directly to pass
> e820 to kexec-ed kernel. However, this e820_table_kexec is only updated
> during boot stage. I tried hot adding DIMM after boot, kexec-ed kernel
> doesn't have it in e820 during bootup, but it's recoginized and added
> when ACPI scanning. I think we should update e820_table_kexec when hot
> add/remove memory, at least for DIMM. Not sure if DLPAR, virtio-mem,
> balloon will need be added into e820_table_kexec too, and if this is
> expected behaviour.
> 
> But whatever we do, it won't impact the kexec file_loading, because of
> the searching strategy bottom up. Just adding them into e820_table_kexec
> will make it consistent with cold reboot which get recognizes and get
> them into e820 during bootup.

Yeah, I think whatever a cold-booted kernel will see is what kexec-ed
kernel should see. Not more, not less.

Regarding virtio-mem: Not in e820 on cold-boot.
Regarding DIMMs: DIMMs under KVM will never show up in the e820 map
IIRC. I think on real HW it can be different.

-- 
Thanks,

David / dhildenb



Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-22 Thread Baoquan He
On 04/21/20 at 03:29pm, David Hildenbrand wrote:
> >> ACPI SRAT is embeded into efi, need read out the rsdp pointer. If we don't
> >> pass the efi, it won't get the SRAT table correctly, if I remember
> >> correctly. Yeah, I remeber kvm guest can get memory hotplugged with
> >> ACPI only, this won't happen on bare metal though. Need check carefully. 
> >> I have been using kvm guest with uefi firmwire recently.
> > 
> > Yeah, I can imagine that bare metal is different. kvm only uses ACPI.
> > 
> > I'm also asking because of virtio-mem. Memory added via virtio-mem is
> > not part of any efi tables or whatsoever. So I assume the kexec kernel
> > will not detect it automatically (good!), instead load the virtio-mem
> > driver and let it add memory back to the system.
> > 
> > I should probably play with kexec and virtio-mem once I have some spare
> > cycles ... to find out what's broken and needs to be addressed :)
> 
> FWIW, I just gave virtio-mem and kexec/kdump a try.
> 
> a) kdump seems to work. Memory added by virtio-mem is getting dumped.
> The kexec kernel only uses memory in the crash region. The virtio-mem
> driver properly bails out due to is_kdump_kernel().

Right, kdump is not impacted later added memory.

> 
> b) "kexec -s -l" seems to work fine. For now, the kernel does not seem
> to get placed on virtio-mem memory (pure luck due to the left-to-right
> search). Memory added by virtio-mem is not getting added to the e820
> map. Once the virtio-mem driver comes back up in the kexec kernel, the
> right memory is readded.

kexec_file_load just behaves as you tested. It doesn't collect later
added memory to e820 because it uses e820_table_kexec directly to pass
e820 to kexec-ed kernel. However, this e820_table_kexec is only updated
during boot stage. I tried hot adding DIMM after boot, kexec-ed kernel
doesn't have it in e820 during bootup, but it's recoginized and added
when ACPI scanning. I think we should update e820_table_kexec when hot
add/remove memory, at least for DIMM. Not sure if DLPAR, virtio-mem,
balloon will need be added into e820_table_kexec too, and if this is
expected behaviour.

But whatever we do, it won't impact the kexec file_loading, because of
the searching strategy bottom up. Just adding them into e820_table_kexec
will make it consistent with cold reboot which get recognizes and get
them into e820 during bootup.
> 
> c) "kexec -c -l" does not work properly. All memory added by virtio-mem
> is added to the e820 map, which is wrong. Memory that should not be
> touched will be touched by the kexec kernel. I assume kexec-tools just
> goes ahead and adds anything it can find in /proc/iomem (or
> /sys/firmware/memmap/) to the e820 map of the new kernel.
> 
> Due to c), I assume all hotplugged memory (e.g., ACPI DIMMs) is
> similarly added to the e820 map and, therefore, won't be able to be
> onlined MOVABLE easily.

Yes, kexec_load will read memory regions from /sys/firmware/memmap/ or
/proc/iomem. Making it right seems a little harder, we can export them
to /proc/iomem or /sys/firmware/memmap/ with mark them with 'hotplug',
but the attribute that which zone they belongs to is not easy to tell.

We are proactive on widely testing kexec_file_load on x86_64, s390,
arm64 by adding test cases into CKI.

> 
> 
> At least for virtio-mem, I would either have to
> a) Not support "kexec -c -l". A viable option if we would be planning on
> not supporting it either way in the long term. I could block this
> in-kernel somehow eventually.
> 
> b) Teach kexec-tools to leave virtio-mem added memory alone. E.g., by
> indicating it in /proc/iomem in a special way ("System RAM
> (hotplugged)"/"System RAM (virtio-mem)").
> 
> Baoquan, any opinion on that?
> 
> -- 
> Thanks,
> 
> David / dhildenb



Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-21 Thread David Hildenbrand


>> b) "kexec -s -l" seems to work fine. For now, the kernel does not seem
>> to get placed on virtio-mem memory (pure luck due to the left-to-right
>> search). Memory added by virtio-mem is not getting added to the e820
>> map. Once the virtio-mem driver comes back up in the kexec kernel, the
>> right memory is readded.
> 
> This sounds like a bug.

This is how virtio-mem wants its memory to get handled.

> 
>> c) "kexec -c -l" does not work properly. All memory added by virtio-mem
>> is added to the e820 map, which is wrong. Memory that should not be
>> touched will be touched by the kexec kernel. I assume kexec-tools just
>> goes ahead and adds anything it can find in /proc/iomem (or
>> /sys/firmware/memmap/) to the e820 map of the new kernel.
>>
>> Due to c), I assume all hotplugged memory (e.g., ACPI DIMMs) is
>> similarly added to the e820 map and, therefore, won't be able to be
>> onlined MOVABLE easily.
> 
> This sounds like correct behavior to me.  If you add memory to the
> system it is treated as memory to the system.

Yeah, I would agree if we are talking about DIMMs, but this memory is
special. It's added via a paravirtualized interface and will contain
holes, especially after unplug. While memory in these holes can usually
be read, it should not be written. More on that below.

> 
> If we need to make it a special kind of memory with special rules we can
> have some kind of special marking for the memory.  But hotplugged is not
> in itself a sufficient criteria to say don't use this as normal memory.

Agreed. It is special, though.

> 
> If take a huge server and I plug in an extra dimm it is just memory.

Agreed.

[...]

> 
> Now perhaps virtualization needs a special tier of memory that should
> only be used for cases where the memory is easily movable.
> 
> I am not familiar with virtio-mem but my skim of the initial design
> is that virtio-mem was not designed to be such a special tier of memory.
> Perhaps something has changed?
> https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg03870.html

Yes, a lot changed. See
https://lkml.kernel.org/r/20200311171422.10484-1-da...@redhat.com for
the latest-greatest design overview.


> 
>> b) Teach kexec-tools to leave virtio-mem added memory alone. E.g., by
>> indicating it in /proc/iomem in a special way ("System RAM
>> (hotplugged)"/"System RAM (virtio-mem)").
> 
> How does the kernel memory allocator treat this memory?

So what virtio-mem does is add memory sections on demand and populate
within these sections the requested amount of memory. E.g., if 64MB are
requested, it will add a 128MB section/resource but only make the first
64MB accessible (via the hypervisor) and only give the first 64MB to the
buddy. This way of adding memory is similar to what XEN and hypver-v
balloon drivers do when hotplugging memory.

When requested to plug more memory, it might go ahead and make (parts
of) the remaining 64MB accessible and give them to the buddy. In case it
cannot "fill any holes", it will add a new section.

When requested to unplug memory, it will try to remove memory from the
added (here 64MB) memory from the buddy and tell the hypervisor about it.

So, it has some similarity to ballooning in virtual environment,
however, it manages its own device memory only and can therefore give
better guarantees and detect malicious guests.

Right now, I think the right approach would be to not create
/sys/firmware/memmap entries from memory virtio-mem added.

[...]

> 
> p.s.  Please excuse me for jumping in I may be missing some important
> context, but what I read when I saw this message in my inbox just seemed
> very wrong.

Yeah, still, thanks for having a look. Please let me know if you need
more information.

-- 
Thanks,

David / dhildenb



Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-21 Thread Eric W. Biederman
David Hildenbrand  writes:

>>> ACPI SRAT is embeded into efi, need read out the rsdp pointer. If we don't
>>> pass the efi, it won't get the SRAT table correctly, if I remember
>>> correctly. Yeah, I remeber kvm guest can get memory hotplugged with
>>> ACPI only, this won't happen on bare metal though. Need check carefully. 
>>> I have been using kvm guest with uefi firmwire recently.
>> 
>> Yeah, I can imagine that bare metal is different. kvm only uses ACPI.
>> 
>> I'm also asking because of virtio-mem. Memory added via virtio-mem is
>> not part of any efi tables or whatsoever. So I assume the kexec kernel
>> will not detect it automatically (good!), instead load the virtio-mem
>> driver and let it add memory back to the system.
>> 
>> I should probably play with kexec and virtio-mem once I have some spare
>> cycles ... to find out what's broken and needs to be addressed :)
>
> FWIW, I just gave virtio-mem and kexec/kdump a try.
>
> a) kdump seems to work. Memory added by virtio-mem is getting dumped.
> The kexec kernel only uses memory in the crash region. The virtio-mem
> driver properly bails out due to is_kdump_kernel().
>
> b) "kexec -s -l" seems to work fine. For now, the kernel does not seem
> to get placed on virtio-mem memory (pure luck due to the left-to-right
> search). Memory added by virtio-mem is not getting added to the e820
> map. Once the virtio-mem driver comes back up in the kexec kernel, the
> right memory is readded.

This sounds like a bug.

> c) "kexec -c -l" does not work properly. All memory added by virtio-mem
> is added to the e820 map, which is wrong. Memory that should not be
> touched will be touched by the kexec kernel. I assume kexec-tools just
> goes ahead and adds anything it can find in /proc/iomem (or
> /sys/firmware/memmap/) to the e820 map of the new kernel.
>
> Due to c), I assume all hotplugged memory (e.g., ACPI DIMMs) is
> similarly added to the e820 map and, therefore, won't be able to be
> onlined MOVABLE easily.

This sounds like correct behavior to me.  If you add memory to the
system it is treated as memory to the system.

If we need to make it a special kind of memory with special rules we can
have some kind of special marking for the memory.  But hotplugged is not
in itself a sufficient criteria to say don't use this as normal memory.

If take a huge server and I plug in an extra dimm it is just memory.

For a similarly huge server I might want to have memory that the system
booted with unpluggable, in case hardware error reporting notices
a dimm generating a lot of memory errors.

Now perhaps virtualization needs a special tier of memory that should
only be used for cases where the memory is easily movable.

I am not familiar with virtio-mem but my skim of the initial design
is that virtio-mem was not designed to be such a special tier of memory.
Perhaps something has changed?
https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg03870.html


> At least for virtio-mem, I would either have to
> a) Not support "kexec -c -l". A viable option if we would be planning on
> not supporting it either way in the long term. I could block this
> in-kernel somehow eventually.

No.

> b) Teach kexec-tools to leave virtio-mem added memory alone. E.g., by
> indicating it in /proc/iomem in a special way ("System RAM
> (hotplugged)"/"System RAM (virtio-mem)").

How does the kernel memory allocator treat this memory?

The logic is simple.  If the kernel memory allocator treats that memory
as ordinary memory available for all uses it should be presented as
ordinary memory available for all uses.

If the kernel memory allocator treats that memory as special memory
only available for uses that we can easily free later and give back to
the system.  AKA it is special and not oridinary memory we should mark
it as such.

Eric

p.s.  Please excuse me for jumping in I may be missing some important
context, but what I read when I saw this message in my inbox just seemed
very wrong.




Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-21 Thread David Hildenbrand
On 21.04.20 15:29, David Hildenbrand wrote:
>>> ACPI SRAT is embeded into efi, need read out the rsdp pointer. If we don't
>>> pass the efi, it won't get the SRAT table correctly, if I remember
>>> correctly. Yeah, I remeber kvm guest can get memory hotplugged with
>>> ACPI only, this won't happen on bare metal though. Need check carefully. 
>>> I have been using kvm guest with uefi firmwire recently.
>>
>> Yeah, I can imagine that bare metal is different. kvm only uses ACPI.
>>
>> I'm also asking because of virtio-mem. Memory added via virtio-mem is
>> not part of any efi tables or whatsoever. So I assume the kexec kernel
>> will not detect it automatically (good!), instead load the virtio-mem
>> driver and let it add memory back to the system.
>>
>> I should probably play with kexec and virtio-mem once I have some spare
>> cycles ... to find out what's broken and needs to be addressed :)
> 
> FWIW, I just gave virtio-mem and kexec/kdump a try.
> 
> a) kdump seems to work. Memory added by virtio-mem is getting dumped.
> The kexec kernel only uses memory in the crash region. The virtio-mem
> driver properly bails out due to is_kdump_kernel().
> 
> b) "kexec -s -l" seems to work fine. For now, the kernel does not seem
> to get placed on virtio-mem memory (pure luck due to the left-to-right
> search). Memory added by virtio-mem is not getting added to the e820
> map. Once the virtio-mem driver comes back up in the kexec kernel, the
> right memory is readded.
> 
> c) "kexec -c -l" does not work properly. All memory added by virtio-mem
> is added to the e820 map, which is wrong. Memory that should not be
> touched will be touched by the kexec kernel. I assume kexec-tools just
> goes ahead and adds anything it can find in /proc/iomem (or
> /sys/firmware/memmap/) to the e820 map of the new kernel.
> 
> Due to c), I assume all hotplugged memory (e.g., ACPI DIMMs) is
> similarly added to the e820 map and, therefore, won't be able to be
> onlined MOVABLE easily.
> 
> 
> At least for virtio-mem, I would either have to
> a) Not support "kexec -c -l". A viable option if we would be planning on
> not supporting it either way in the long term. I could block this
> in-kernel somehow eventually.
> 
> b) Teach kexec-tools to leave virtio-mem added memory alone. E.g., by
> indicating it in /proc/iomem in a special way ("System RAM
> (hotplugged)"/"System RAM (virtio-mem)").

I just realized, that *not* creating /sys/firmware/memmap/ entries for
virtio-mem memory seems to be the right thing to do.


-- 
Thanks,

David / dhildenb



Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-21 Thread David Hildenbrand
>> ACPI SRAT is embeded into efi, need read out the rsdp pointer. If we don't
>> pass the efi, it won't get the SRAT table correctly, if I remember
>> correctly. Yeah, I remeber kvm guest can get memory hotplugged with
>> ACPI only, this won't happen on bare metal though. Need check carefully. 
>> I have been using kvm guest with uefi firmwire recently.
> 
> Yeah, I can imagine that bare metal is different. kvm only uses ACPI.
> 
> I'm also asking because of virtio-mem. Memory added via virtio-mem is
> not part of any efi tables or whatsoever. So I assume the kexec kernel
> will not detect it automatically (good!), instead load the virtio-mem
> driver and let it add memory back to the system.
> 
> I should probably play with kexec and virtio-mem once I have some spare
> cycles ... to find out what's broken and needs to be addressed :)

FWIW, I just gave virtio-mem and kexec/kdump a try.

a) kdump seems to work. Memory added by virtio-mem is getting dumped.
The kexec kernel only uses memory in the crash region. The virtio-mem
driver properly bails out due to is_kdump_kernel().

b) "kexec -s -l" seems to work fine. For now, the kernel does not seem
to get placed on virtio-mem memory (pure luck due to the left-to-right
search). Memory added by virtio-mem is not getting added to the e820
map. Once the virtio-mem driver comes back up in the kexec kernel, the
right memory is readded.

c) "kexec -c -l" does not work properly. All memory added by virtio-mem
is added to the e820 map, which is wrong. Memory that should not be
touched will be touched by the kexec kernel. I assume kexec-tools just
goes ahead and adds anything it can find in /proc/iomem (or
/sys/firmware/memmap/) to the e820 map of the new kernel.

Due to c), I assume all hotplugged memory (e.g., ACPI DIMMs) is
similarly added to the e820 map and, therefore, won't be able to be
onlined MOVABLE easily.


At least for virtio-mem, I would either have to
a) Not support "kexec -c -l". A viable option if we would be planning on
not supporting it either way in the long term. I could block this
in-kernel somehow eventually.

b) Teach kexec-tools to leave virtio-mem added memory alone. E.g., by
indicating it in /proc/iomem in a special way ("System RAM
(hotplugged)"/"System RAM (virtio-mem)").

Baoquan, any opinion on that?

-- 
Thanks,

David / dhildenb



Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-16 Thread David Hildenbrand
>> kexec_walk_memblock() has the option for "kbuf->top_down". Only
>> kexec_walk_resources() seems to ignore it.
> 
> Yeah, that top down searching is done in a found low mem area. Means
> firstly search an available region bottom up, then put kernel top down
> in that region. The reason is our iomem res is linked with singly linked
> list. So we can only search bottom up efficiently.
> 
> kexec_load is doing the real top down searching, so kernel will be put
> at the top of system ram. I ever tried to change it to support top down
> searching for kexec_file_load too with patches, since QE and customers
> are often confused with this difference when debugging.
> 
> Andrew may remeber this, he suggested me to change the singly linked list 
> to doubly linked list for iomem res, then do the top down searching for
> kexec_file_load. I tried with some effort, the change introduced too much
> code change, I just gave up finally.

Well, at least right now this seems to be the right approach (hotplug),
lol :)

> 
> http://archive.lwn.net:8080/devicetree/20180718024944.577-1-...@redhat.com/
> 
> I can see that top down searching for kexec can avoid the highly used
> low memory region, esp under 4G, for dma, kinds of firmware reserving,
> etc. And customers/QE of kexec get used to it. I can change kexec_file_load
> to top down too with a simple way if people really complain it. But now, 
> seems bottom up is not bad too.

Ah, I understand the problem. Maybe a simple "optimization" would be to
start searching bottom-up from e.g.,2GB/4GB first. If nothing was found,
search botoom-up from 0-2GB/4GB etc.

> 
>>
>> So I think in case of memblocks (e.g., arm64), this still applies?
> 
> Yeah, aren't you trying to remove it? I haven't read your patches
> carefully, maybe I got it wrong. And arm64 even can't support the hot added

For arm64 we're still creating memblocks for hotplugged memory, but I
guess it's not too hard to stop doing that.

> memory being able to recorded into firmware, seems it's not so ready, 
> won't they change that design in the future?

It seems to be incomplete, yes. No idea if it's fixable, no arm64 expert ...


>> - powerpc to filter out all LMBs that can be removed (assuming not all
>>   memory corresponds to LMBs that can be removed, otherwise we're in
>>   trouble ... :) )
>> - virtio-mem to filter out all memory it added.
>> - hyper-v to filter out partially backed memory blocks (esp. the last
>>   memory block it added and only partially backed it by memory).
>>
>> This would make it work for kexec_file_load(), however, I do wonder how
>> we would want to approach that from userspace kexec-tools when handling
>> it from kexec_load().
>
> Let's make kexec_file_load work firstly. Since this work is only first
> step to make kexec-ed kernel not break memory hotplug. After kexec
> rebooting, the KASLR may locate kernel into hotpluggable area too.

 Can you elaborate how that would work?
>>>
>>> Well, boot memory can be hotplugged or not after boot, they are marked
>>> in uefi tables, the current kexec doesn't save and pass them into 2nd
>>> kenrel, when kexec kernel bootup, it need read them and avoid them to
>>> randomize kernel into.
>>
>> What about e.g., memory hotplugged by ACPI? I would assume, that the
>> kexec kernel will not make use of that (IOW detected that) until the
>> ACPI driver comes up and re-detects + adds that memory.
>>
>> Or how would that machinery work in case we have a DIMM hotplugged via ACPI?
> 
> ACPI SRAT is embeded into efi, need read out the rsdp pointer. If we don't
> pass the efi, it won't get the SRAT table correctly, if I remember
> correctly. Yeah, I remeber kvm guest can get memory hotplugged with
> ACPI only, this won't happen on bare metal though. Need check carefully. 
> I have been using kvm guest with uefi firmwire recently.

Yeah, I can imagine that bare metal is different. kvm only uses ACPI.

I'm also asking because of virtio-mem. Memory added via virtio-mem is
not part of any efi tables or whatsoever. So I assume the kexec kernel
will not detect it automatically (good!), instead load the virtio-mem
driver and let it add memory back to the system.

I should probably play with kexec and virtio-mem once I have some spare
cycles ... to find out what's broken and needs to be addressed :)

-- 
Thanks,

David / dhildenb



Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-16 Thread Baoquan He
On 04/16/20 at 04:09pm, David Hildenbrand wrote:
> >>> Sounds doable to me, and not complicated.
> >>>
>  images. It would apply to
> 
>  - arm64 and filter out all hotadded memory (IIRC, only boot memory can
>    be used).
> >>>
> >>> Do you mean hot added memory after boot can't be recognized and added
> >>> into system RAM on arm64?
> >>
> >> See patch #3 of this patch set, which wants to avoid placing kexec
> >> binaries on hotplugged memory. But I have no idea what the current plan
> >> regarding arm64 is (this thread exploded :) ).
> >>
> >> I would assume that we don't want to place kexec images on any
> >> hotplugged (or rather: hot(un)pluggable) memory - on any architecture.
> > 
> > Yes, noticed that and James replied to DaveY.
> > 
> > Later, when I was considering to make a draft patch to do the picking of
> > memory from normal zone, and add a notifier, as we discussed at above, I
> > suddenly realized that kexec_file_load doesn't have this issue. It
> > traverse system RAM bottom up to get an available region to put
> > kernel/initrd/boot_param, etc. I can't think of a system where its
> > low memory could be unavailable.
> 
> kexec_walk_memblock() has the option for "kbuf->top_down". Only
> kexec_walk_resources() seems to ignore it.

Yeah, that top down searching is done in a found low mem area. Means
firstly search an available region bottom up, then put kernel top down
in that region. The reason is our iomem res is linked with singly linked
list. So we can only search bottom up efficiently.

kexec_load is doing the real top down searching, so kernel will be put
at the top of system ram. I ever tried to change it to support top down
searching for kexec_file_load too with patches, since QE and customers
are often confused with this difference when debugging.

Andrew may remeber this, he suggested me to change the singly linked list 
to doubly linked list for iomem res, then do the top down searching for
kexec_file_load. I tried with some effort, the change introduced too much
code change, I just gave up finally.

http://archive.lwn.net:8080/devicetree/20180718024944.577-1-...@redhat.com/

I can see that top down searching for kexec can avoid the highly used
low memory region, esp under 4G, for dma, kinds of firmware reserving,
etc. And customers/QE of kexec get used to it. I can change kexec_file_load
to top down too with a simple way if people really complain it. But now, 
seems bottom up is not bad too.

> 
> So I think in case of memblocks (e.g., arm64), this still applies?

Yeah, aren't you trying to remove it? I haven't read your patches
carefully, maybe I got it wrong. And arm64 even can't support the hot added
memory being able to recorded into firmware, seems it's not so ready, 
won't they change that design in the future?
> 
> >>
> >>>
> >>>
>  - powerpc to filter out all LMBs that can be removed (assuming not all
>    memory corresponds to LMBs that can be removed, otherwise we're in
>    trouble ... :) )
>  - virtio-mem to filter out all memory it added.
>  - hyper-v to filter out partially backed memory blocks (esp. the last
>    memory block it added and only partially backed it by memory).
> 
>  This would make it work for kexec_file_load(), however, I do wonder how
>  we would want to approach that from userspace kexec-tools when handling
>  it from kexec_load().
> >>>
> >>> Let's make kexec_file_load work firstly. Since this work is only first
> >>> step to make kexec-ed kernel not break memory hotplug. After kexec
> >>> rebooting, the KASLR may locate kernel into hotpluggable area too.
> >>
> >> Can you elaborate how that would work?
> > 
> > Well, boot memory can be hotplugged or not after boot, they are marked
> > in uefi tables, the current kexec doesn't save and pass them into 2nd
> > kenrel, when kexec kernel bootup, it need read them and avoid them to
> > randomize kernel into.
> 
> What about e.g., memory hotplugged by ACPI? I would assume, that the
> kexec kernel will not make use of that (IOW detected that) until the
> ACPI driver comes up and re-detects + adds that memory.
> 
> Or how would that machinery work in case we have a DIMM hotplugged via ACPI?

ACPI SRAT is embeded into efi, need read out the rsdp pointer. If we don't
pass the efi, it won't get the SRAT table correctly, if I remember
correctly. Yeah, I remeber kvm guest can get memory hotplugged with
ACPI only, this won't happen on bare metal though. Need check carefully. 
I have been using kvm guest with uefi firmwire recently.



Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-16 Thread David Hildenbrand
>>> Sounds doable to me, and not complicated.
>>>
 images. It would apply to

 - arm64 and filter out all hotadded memory (IIRC, only boot memory can
   be used).
>>>
>>> Do you mean hot added memory after boot can't be recognized and added
>>> into system RAM on arm64?
>>
>> See patch #3 of this patch set, which wants to avoid placing kexec
>> binaries on hotplugged memory. But I have no idea what the current plan
>> regarding arm64 is (this thread exploded :) ).
>>
>> I would assume that we don't want to place kexec images on any
>> hotplugged (or rather: hot(un)pluggable) memory - on any architecture.
> 
> Yes, noticed that and James replied to DaveY.
> 
> Later, when I was considering to make a draft patch to do the picking of
> memory from normal zone, and add a notifier, as we discussed at above, I
> suddenly realized that kexec_file_load doesn't have this issue. It
> traverse system RAM bottom up to get an available region to put
> kernel/initrd/boot_param, etc. I can't think of a system where its
> low memory could be unavailable.

kexec_walk_memblock() has the option for "kbuf->top_down". Only
kexec_walk_resources() seems to ignore it.

So I think in case of memblocks (e.g., arm64), this still applies?

>>
>>>
>>>
 - powerpc to filter out all LMBs that can be removed (assuming not all
   memory corresponds to LMBs that can be removed, otherwise we're in
   trouble ... :) )
 - virtio-mem to filter out all memory it added.
 - hyper-v to filter out partially backed memory blocks (esp. the last
   memory block it added and only partially backed it by memory).

 This would make it work for kexec_file_load(), however, I do wonder how
 we would want to approach that from userspace kexec-tools when handling
 it from kexec_load().
>>>
>>> Let's make kexec_file_load work firstly. Since this work is only first
>>> step to make kexec-ed kernel not break memory hotplug. After kexec
>>> rebooting, the KASLR may locate kernel into hotpluggable area too.
>>
>> Can you elaborate how that would work?
> 
> Well, boot memory can be hotplugged or not after boot, they are marked
> in uefi tables, the current kexec doesn't save and pass them into 2nd
> kenrel, when kexec kernel bootup, it need read them and avoid them to
> randomize kernel into.

What about e.g., memory hotplugged by ACPI? I would assume, that the
kexec kernel will not make use of that (IOW detected that) until the
ACPI driver comes up and re-detects + adds that memory.

Or how would that machinery work in case we have a DIMM hotplugged via ACPI?

-- 
Thanks,

David / dhildenb



Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-16 Thread Baoquan He
On 04/16/20 at 03:31pm, David Hildenbrand wrote:
> > Not sure if I get the notifier idea clearly. If you mean 
> > 
> > 1) Add a common function to pick memory in unmovable zone;
> 
> Not strictly required IMHO. But, minor detail.
> 
> > 2) Let DLPAR, balloon register with notifier;
> 
> Yeah, or virtio-mem, or any other technology that adds/removes memory
> dynamically.
> 
> > 3) In the common function, ask notified part to check if the picked
> >unmovable memory is available for locating kexec kernel;
> 
> Yeah.

These may not be needed, please see below comment.

> 
> > 
> > Sounds doable to me, and not complicated.
> > 
> >> images. It would apply to
> >>
> >> - arm64 and filter out all hotadded memory (IIRC, only boot memory can
> >>   be used).
> > 
> > Do you mean hot added memory after boot can't be recognized and added
> > into system RAM on arm64?
> 
> See patch #3 of this patch set, which wants to avoid placing kexec
> binaries on hotplugged memory. But I have no idea what the current plan
> regarding arm64 is (this thread exploded :) ).
> 
> I would assume that we don't want to place kexec images on any
> hotplugged (or rather: hot(un)pluggable) memory - on any architecture.

Yes, noticed that and James replied to DaveY.

Later, when I was considering to make a draft patch to do the picking of
memory from normal zone, and add a notifier, as we discussed at above, I
suddenly realized that kexec_file_load doesn't have this issue. It
traverse system RAM bottom up to get an available region to put
kernel/initrd/boot_param, etc. I can't think of a system where its
low memory could be unavailable.
> 
> > 
> > 
> >> - powerpc to filter out all LMBs that can be removed (assuming not all
> >>   memory corresponds to LMBs that can be removed, otherwise we're in
> >>   trouble ... :) )
> >> - virtio-mem to filter out all memory it added.
> >> - hyper-v to filter out partially backed memory blocks (esp. the last
> >>   memory block it added and only partially backed it by memory).
> >>
> >> This would make it work for kexec_file_load(), however, I do wonder how
> >> we would want to approach that from userspace kexec-tools when handling
> >> it from kexec_load().
> > 
> > Let's make kexec_file_load work firstly. Since this work is only first
> > step to make kexec-ed kernel not break memory hotplug. After kexec
> > rebooting, the KASLR may locate kernel into hotpluggable area too.
> 
> Can you elaborate how that would work?

Well, boot memory can be hotplugged or not after boot, they are marked
in uefi tables, the current kexec doesn't save and pass them into 2nd
kenrel, when kexec kernel bootup, it need read them and avoid them to
randomize kernel into.



Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-16 Thread David Hildenbrand
> Not sure if I get the notifier idea clearly. If you mean 
> 
> 1) Add a common function to pick memory in unmovable zone;

Not strictly required IMHO. But, minor detail.

> 2) Let DLPAR, balloon register with notifier;

Yeah, or virtio-mem, or any other technology that adds/removes memory
dynamically.

> 3) In the common function, ask notified part to check if the picked
>unmovable memory is available for locating kexec kernel;

Yeah.

> 
> Sounds doable to me, and not complicated.
> 
>> images. It would apply to
>>
>> - arm64 and filter out all hotadded memory (IIRC, only boot memory can
>>   be used).
> 
> Do you mean hot added memory after boot can't be recognized and added
> into system RAM on arm64?

See patch #3 of this patch set, which wants to avoid placing kexec
binaries on hotplugged memory. But I have no idea what the current plan
regarding arm64 is (this thread exploded :) ).

I would assume that we don't want to place kexec images on any
hotplugged (or rather: hot(un)pluggable) memory - on any architecture.

> 
> 
>> - powerpc to filter out all LMBs that can be removed (assuming not all
>>   memory corresponds to LMBs that can be removed, otherwise we're in
>>   trouble ... :) )
>> - virtio-mem to filter out all memory it added.
>> - hyper-v to filter out partially backed memory blocks (esp. the last
>>   memory block it added and only partially backed it by memory).
>>
>> This would make it work for kexec_file_load(), however, I do wonder how
>> we would want to approach that from userspace kexec-tools when handling
>> it from kexec_load().
> 
> Let's make kexec_file_load work firstly. Since this work is only first
> step to make kexec-ed kernel not break memory hotplug. After kexec
> rebooting, the KASLR may locate kernel into hotpluggable area too.

Can you elaborate how that would work?

-- 
Thanks,

David / dhildenb



Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-14 Thread Baoquan He
On 04/14/20 at 04:49pm, David Hildenbrand wrote:
> > The root cause is kexec-ed kernel is targeted at hotpluggable memory
> > region. Just avoiding the movable area can fix it. In kexec_file_load(),
> > just checking or picking those unmovable region to put kernel/initrd in
> > function locate_mem_hole_callback() can fix it. The page or pageblock's
> > zone is movable or not, it's easy to know. This fix doesn't need to
> > bother other component.
> 
>  I don't fully agree. E.g., just because memory is onlined to ZONE_NORMAL
>  does not imply that it cannot get offlined and removed e.g., this is
>  heavily used on ppc64, with 16MB sections.
> >>>
> >>> Really? I just know there are two kinds of mem hoplug in ppc, but don't
> >>> know the details. So in this case, is there any flag or a way to know
> >>> those memory block are hotpluggable? I am curious how those kernel data
> >>> is avoided to be put in this area. Or ppc just freely uses it for kernel
> >>> data or user space data, then try to migrate when hot remove?
> >>
> >> See
> >> arch/powerpc/platforms/pseries/hotplug-memory.c:dlpar_memory_remove_by_count()
> >>
> >> Under DLAPR, it can remove memory in LMB granularity, which is usually
> >> 16MB (== single section on ppc64). DLPAR will directly online all
> >> hotplugged memory (LMBs) from the kernel using device_online(), which
> >> will go to ZONE_NORMAL.
> >>
> >> When trying to remove memory, it simply scans for offlineable 16MB
> >> memory blocks (==section == LMB), offlines and removes them. No need for
> >> the movable zone and all the involved issues.
> > 
> > Yes, this is a different one, thanks for pointing it out. It sounds like
> > balloon driver in virt platform, doesn't it?
> 
> With DLPAR there is a hypervisor involved (which manages the actual HW
> DIMMs), so yes.
> 
> > 
> > Avoiding to put kexec kernel into movable zone can't solve this DLPAR
> > case as you said.
> > 
> >>
> >> Now, the interesting question is, can we have LMBs added during boot
> >> (not via add_memory()), that will later be removed via remove_memory().
> >> IIRC, we had BUGs related to that, so I think yes. If a section contains
> >> no unmovable allocations (after boot), it can get removed.
> > 
> > I do want to ask this question. If we can add LMB into system RAM, then
> > reload kexec can solve it. 
> > 
> > Another better way is adding a common function to filter out the
> > movable zone when search position for kexec kernel, use a arch specific
> > funciton to filter out DLPAR memory blocks for ppc only. Over there,
> > we can simply use for_each_drmem_lmb() to do that.
> 
> I was thinking about something similar. Maybe something like a notifier
> that can be used to test if selected memory can be used for kexec

Not sure if I get the notifier idea clearly. If you mean 

1) Add a common function to pick memory in unmovable zone;
2) Let DLPAR, balloon register with notifier;
3) In the common function, ask notified part to check if the picked
   unmovable memory is available for locating kexec kernel;

Sounds doable to me, and not complicated.

> images. It would apply to
> 
> - arm64 and filter out all hotadded memory (IIRC, only boot memory can
>   be used).

Do you mean hot added memory after boot can't be recognized and added
into system RAM on arm64?


> - powerpc to filter out all LMBs that can be removed (assuming not all
>   memory corresponds to LMBs that can be removed, otherwise we're in
>   trouble ... :) )
> - virtio-mem to filter out all memory it added.
> - hyper-v to filter out partially backed memory blocks (esp. the last
>   memory block it added and only partially backed it by memory).
> 
> This would make it work for kexec_file_load(), however, I do wonder how
> we would want to approach that from userspace kexec-tools when handling
> it from kexec_load().

Let's make kexec_file_load work firstly. Since this work is only first
step to make kexec-ed kernel not break memory hotplug. After kexec
rebooting, the KASLR may locate kernel into hotpluggable area too.



Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-14 Thread David Hildenbrand
On 14.04.20 16:39, Baoquan He wrote:
> On 04/14/20 at 11:37am, David Hildenbrand wrote:
>> On 14.04.20 11:22, Baoquan He wrote:
>>> On 04/14/20 at 10:00am, David Hildenbrand wrote:
 On 14.04.20 08:40, Baoquan He wrote:
> On 04/13/20 at 08:15am, Eric W. Biederman wrote:
>> Baoquan He  writes:
>>
>>> On 04/12/20 at 02:52pm, Eric W. Biederman wrote:

 The only benefit of kexec_file_load is that it is simple enough from a
 kernel perspective that signatures can be checked.
>>>
>>> We don't have this restriction any more with below commit:
>>>
>>> commit 99d5cadfde2b ("kexec_file: split KEXEC_VERIFY_SIG into KEXEC_SIG
>>> and KEXEC_SIG_FORCE")
>>>
>>> With KEXEC_SIG_FORCE not set, we can use kexec_load_file to cover both
>>> secure boot or legacy system for kexec/kdump. Being simple enough is
>>> enough to astract and convince us to use it instead. And kexec_file_load
>>> has been in use for several years on systems with secure boot, since
>>> added in 2014, on x86_64.
>>
>> No.  Actaully kexec_file_load is the less capable interface, and less
>> flexible interface.  Which is why it is appropriate for signature
>> verification.
>
> Well, everyone has a stance and the corresponding view. You could have
> wider view from long time maintenance and in upstrem position, and think
> kexec_file_load is horrible. But I can only see from our work as a front
> line engineer to maintain/develop kexec/kdump in RHEL, and think
> kexec_file_load is easier to maintain.
>
> Surely except of multiple kernel image format support. No matter it is
> kexec_load and kexec_file_load, e.g in x86_64, we only support bzImage.
> This is produced from kerel building by default. We have no way to
> support it in our distros and add it into kexec_file_load.
>
> [RFC PATCH] x86/boot: make ELF kernel multiboot-able
> https://lkml.org/lkml/2017/2/15/654
>
>>
 kexec_load in every other respect is the more capable and functional
 interface.  It makes no sense to get rid of it.

 It does make sense to reload with a loaded kernel on memory hotplug.
 That is simple and easy.  If we are going to handle something in the
 kernel it should simple an automated unloading of the kernel on memory
 hotplug.


 I think it would be irresponsible to deprecate kexec_load on any
 platform.

 I also suspect that kexec_file_load could be taught to copy the dtb
 on arm32 if someone wants to deal with signatures.

 We definitely can not even think of deprecating kexec_load until
 architecture that supports it also supports kexec_file_load and 
 everyone
 is happy with that interface.  That is Linus's no regression rule.
>>>
>>> I should pick a milder word to express our tendency and tell our plan
>>> then 'obsolete'. Even though I added 'gradually', seems it doesn't help
>>> much. I didn't mean to say 'deprecate' at all when replied.
>>>
>>> The situation and trend I understand about kexec_load and 
>>> kexec_file_load
>>> are:
>>>
>>> 1) Supporting kexec_file_load is suggested to add in ARCHes which don't
>>> have yet, just as x86_64, arm64 and s390 have done;
>>>  
>>> 2) kexec_file_load is suggested to use, and take precedence over
>>> kexec_load in the future, if both are supported in one ARCH.
>>
>> The deep problem is that kexec_file_load is distinctly less expressive
>> than kexec_load.
>>
>>> 3) Kexec_load is kept being used by ARCHes w/o kexc_file_load support,
>>> and by ARCHes for back compatibility w/ kexec_file_load support.
>>>
>>> For 1) and 2), I think the reason is obvious as Eric said,
>>> kexec_file_load is simple enough. And currently, whenever we got a bug
>>> report, we may need fix them twice, for kexec_load and kexec_file_load.
>>> If kexec_file_load is made by default, e.g on x86_64, we will change it
>>> in kernel space only, for kexec_file_load. This is what I meant about
>>> 'obsolete gradually'. I think for arm64, s390, they will do these too.
>>> Unless there's some critical/blocker bug in kexec_load, to corrupt the
>>> old kexec_load interface in old product.
>>
>> Maybe.  The code that kexec_file_load sucked into the kernel is quite
>> stable and rarely needs changes except during a port of kexec to
>> another architecture.
>>
>> Last I looked the real maintenance effor of kexec and kexec on panic was
>> in the drivers.  So I don't think we can use maintenance to do anything.
>
> Not sure if I got it. But if check Lianbo's patches, a lot of effort has
> been taken to make SEV work well on kexec_file_load. And we have
> switched to use kexec_file_load in the newly 

Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-14 Thread Baoquan He
On 04/14/20 at 11:37am, David Hildenbrand wrote:
> On 14.04.20 11:22, Baoquan He wrote:
> > On 04/14/20 at 10:00am, David Hildenbrand wrote:
> >> On 14.04.20 08:40, Baoquan He wrote:
> >>> On 04/13/20 at 08:15am, Eric W. Biederman wrote:
>  Baoquan He  writes:
> 
> > On 04/12/20 at 02:52pm, Eric W. Biederman wrote:
> >>
> >> The only benefit of kexec_file_load is that it is simple enough from a
> >> kernel perspective that signatures can be checked.
> >
> > We don't have this restriction any more with below commit:
> >
> > commit 99d5cadfde2b ("kexec_file: split KEXEC_VERIFY_SIG into KEXEC_SIG
> > and KEXEC_SIG_FORCE")
> >
> > With KEXEC_SIG_FORCE not set, we can use kexec_load_file to cover both
> > secure boot or legacy system for kexec/kdump. Being simple enough is
> > enough to astract and convince us to use it instead. And kexec_file_load
> > has been in use for several years on systems with secure boot, since
> > added in 2014, on x86_64.
> 
>  No.  Actaully kexec_file_load is the less capable interface, and less
>  flexible interface.  Which is why it is appropriate for signature
>  verification.
> >>>
> >>> Well, everyone has a stance and the corresponding view. You could have
> >>> wider view from long time maintenance and in upstrem position, and think
> >>> kexec_file_load is horrible. But I can only see from our work as a front
> >>> line engineer to maintain/develop kexec/kdump in RHEL, and think
> >>> kexec_file_load is easier to maintain.
> >>>
> >>> Surely except of multiple kernel image format support. No matter it is
> >>> kexec_load and kexec_file_load, e.g in x86_64, we only support bzImage.
> >>> This is produced from kerel building by default. We have no way to
> >>> support it in our distros and add it into kexec_file_load.
> >>>
> >>> [RFC PATCH] x86/boot: make ELF kernel multiboot-able
> >>> https://lkml.org/lkml/2017/2/15/654
> >>>
> 
> >> kexec_load in every other respect is the more capable and functional
> >> interface.  It makes no sense to get rid of it.
> >>
> >> It does make sense to reload with a loaded kernel on memory hotplug.
> >> That is simple and easy.  If we are going to handle something in the
> >> kernel it should simple an automated unloading of the kernel on memory
> >> hotplug.
> >>
> >>
> >> I think it would be irresponsible to deprecate kexec_load on any
> >> platform.
> >>
> >> I also suspect that kexec_file_load could be taught to copy the dtb
> >> on arm32 if someone wants to deal with signatures.
> >>
> >> We definitely can not even think of deprecating kexec_load until
> >> architecture that supports it also supports kexec_file_load and 
> >> everyone
> >> is happy with that interface.  That is Linus's no regression rule.
> >
> > I should pick a milder word to express our tendency and tell our plan
> > then 'obsolete'. Even though I added 'gradually', seems it doesn't help
> > much. I didn't mean to say 'deprecate' at all when replied.
> >
> > The situation and trend I understand about kexec_load and 
> > kexec_file_load
> > are:
> >
> > 1) Supporting kexec_file_load is suggested to add in ARCHes which don't
> > have yet, just as x86_64, arm64 and s390 have done;
> >  
> > 2) kexec_file_load is suggested to use, and take precedence over
> > kexec_load in the future, if both are supported in one ARCH.
> 
>  The deep problem is that kexec_file_load is distinctly less expressive
>  than kexec_load.
> 
> > 3) Kexec_load is kept being used by ARCHes w/o kexc_file_load support,
> > and by ARCHes for back compatibility w/ kexec_file_load support.
> >
> > For 1) and 2), I think the reason is obvious as Eric said,
> > kexec_file_load is simple enough. And currently, whenever we got a bug
> > report, we may need fix them twice, for kexec_load and kexec_file_load.
> > If kexec_file_load is made by default, e.g on x86_64, we will change it
> > in kernel space only, for kexec_file_load. This is what I meant about
> > 'obsolete gradually'. I think for arm64, s390, they will do these too.
> > Unless there's some critical/blocker bug in kexec_load, to corrupt the
> > old kexec_load interface in old product.
> 
>  Maybe.  The code that kexec_file_load sucked into the kernel is quite
>  stable and rarely needs changes except during a port of kexec to
>  another architecture.
> 
>  Last I looked the real maintenance effor of kexec and kexec on panic was
>  in the drivers.  So I don't think we can use maintenance to do anything.
> >>>
> >>> Not sure if I got it. But if check Lianbo's patches, a lot of effort has
> >>> been taken to make SEV work well on kexec_file_load. And we have
> >>> switched to use kexec_file_load in the newly published  Fedora release
> >>> on 

Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-14 Thread David Hildenbrand
On 14.04.20 11:22, Baoquan He wrote:
> On 04/14/20 at 10:00am, David Hildenbrand wrote:
>> On 14.04.20 08:40, Baoquan He wrote:
>>> On 04/13/20 at 08:15am, Eric W. Biederman wrote:
 Baoquan He  writes:

> On 04/12/20 at 02:52pm, Eric W. Biederman wrote:
>>
>> The only benefit of kexec_file_load is that it is simple enough from a
>> kernel perspective that signatures can be checked.
>
> We don't have this restriction any more with below commit:
>
> commit 99d5cadfde2b ("kexec_file: split KEXEC_VERIFY_SIG into KEXEC_SIG
> and KEXEC_SIG_FORCE")
>
> With KEXEC_SIG_FORCE not set, we can use kexec_load_file to cover both
> secure boot or legacy system for kexec/kdump. Being simple enough is
> enough to astract and convince us to use it instead. And kexec_file_load
> has been in use for several years on systems with secure boot, since
> added in 2014, on x86_64.

 No.  Actaully kexec_file_load is the less capable interface, and less
 flexible interface.  Which is why it is appropriate for signature
 verification.
>>>
>>> Well, everyone has a stance and the corresponding view. You could have
>>> wider view from long time maintenance and in upstrem position, and think
>>> kexec_file_load is horrible. But I can only see from our work as a front
>>> line engineer to maintain/develop kexec/kdump in RHEL, and think
>>> kexec_file_load is easier to maintain.
>>>
>>> Surely except of multiple kernel image format support. No matter it is
>>> kexec_load and kexec_file_load, e.g in x86_64, we only support bzImage.
>>> This is produced from kerel building by default. We have no way to
>>> support it in our distros and add it into kexec_file_load.
>>>
>>> [RFC PATCH] x86/boot: make ELF kernel multiboot-able
>>> https://lkml.org/lkml/2017/2/15/654
>>>

>> kexec_load in every other respect is the more capable and functional
>> interface.  It makes no sense to get rid of it.
>>
>> It does make sense to reload with a loaded kernel on memory hotplug.
>> That is simple and easy.  If we are going to handle something in the
>> kernel it should simple an automated unloading of the kernel on memory
>> hotplug.
>>
>>
>> I think it would be irresponsible to deprecate kexec_load on any
>> platform.
>>
>> I also suspect that kexec_file_load could be taught to copy the dtb
>> on arm32 if someone wants to deal with signatures.
>>
>> We definitely can not even think of deprecating kexec_load until
>> architecture that supports it also supports kexec_file_load and everyone
>> is happy with that interface.  That is Linus's no regression rule.
>
> I should pick a milder word to express our tendency and tell our plan
> then 'obsolete'. Even though I added 'gradually', seems it doesn't help
> much. I didn't mean to say 'deprecate' at all when replied.
>
> The situation and trend I understand about kexec_load and kexec_file_load
> are:
>
> 1) Supporting kexec_file_load is suggested to add in ARCHes which don't
> have yet, just as x86_64, arm64 and s390 have done;
>  
> 2) kexec_file_load is suggested to use, and take precedence over
> kexec_load in the future, if both are supported in one ARCH.

 The deep problem is that kexec_file_load is distinctly less expressive
 than kexec_load.

> 3) Kexec_load is kept being used by ARCHes w/o kexc_file_load support,
> and by ARCHes for back compatibility w/ kexec_file_load support.
>
> For 1) and 2), I think the reason is obvious as Eric said,
> kexec_file_load is simple enough. And currently, whenever we got a bug
> report, we may need fix them twice, for kexec_load and kexec_file_load.
> If kexec_file_load is made by default, e.g on x86_64, we will change it
> in kernel space only, for kexec_file_load. This is what I meant about
> 'obsolete gradually'. I think for arm64, s390, they will do these too.
> Unless there's some critical/blocker bug in kexec_load, to corrupt the
> old kexec_load interface in old product.

 Maybe.  The code that kexec_file_load sucked into the kernel is quite
 stable and rarely needs changes except during a port of kexec to
 another architecture.

 Last I looked the real maintenance effor of kexec and kexec on panic was
 in the drivers.  So I don't think we can use maintenance to do anything.
>>>
>>> Not sure if I got it. But if check Lianbo's patches, a lot of effort has
>>> been taken to make SEV work well on kexec_file_load. And we have
>>> switched to use kexec_file_load in the newly published  Fedora release
>>> on x86_64 by default. Before this, Lianbo has investigated and done many
>>> experiments to make sure the switching is safe. We finally made this
>>> decision. Next we will do the switch in Enterprise distros. Once these
>>> are proved safe, we will suggest customers to 

Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-14 Thread Baoquan He
On 04/14/20 at 10:00am, David Hildenbrand wrote:
> On 14.04.20 08:40, Baoquan He wrote:
> > On 04/13/20 at 08:15am, Eric W. Biederman wrote:
> >> Baoquan He  writes:
> >>
> >>> On 04/12/20 at 02:52pm, Eric W. Biederman wrote:
> 
>  The only benefit of kexec_file_load is that it is simple enough from a
>  kernel perspective that signatures can be checked.
> >>>
> >>> We don't have this restriction any more with below commit:
> >>>
> >>> commit 99d5cadfde2b ("kexec_file: split KEXEC_VERIFY_SIG into KEXEC_SIG
> >>> and KEXEC_SIG_FORCE")
> >>>
> >>> With KEXEC_SIG_FORCE not set, we can use kexec_load_file to cover both
> >>> secure boot or legacy system for kexec/kdump. Being simple enough is
> >>> enough to astract and convince us to use it instead. And kexec_file_load
> >>> has been in use for several years on systems with secure boot, since
> >>> added in 2014, on x86_64.
> >>
> >> No.  Actaully kexec_file_load is the less capable interface, and less
> >> flexible interface.  Which is why it is appropriate for signature
> >> verification.
> > 
> > Well, everyone has a stance and the corresponding view. You could have
> > wider view from long time maintenance and in upstrem position, and think
> > kexec_file_load is horrible. But I can only see from our work as a front
> > line engineer to maintain/develop kexec/kdump in RHEL, and think
> > kexec_file_load is easier to maintain.
> > 
> > Surely except of multiple kernel image format support. No matter it is
> > kexec_load and kexec_file_load, e.g in x86_64, we only support bzImage.
> > This is produced from kerel building by default. We have no way to
> > support it in our distros and add it into kexec_file_load.
> > 
> > [RFC PATCH] x86/boot: make ELF kernel multiboot-able
> > https://lkml.org/lkml/2017/2/15/654
> > 
> >>
>  kexec_load in every other respect is the more capable and functional
>  interface.  It makes no sense to get rid of it.
> 
>  It does make sense to reload with a loaded kernel on memory hotplug.
>  That is simple and easy.  If we are going to handle something in the
>  kernel it should simple an automated unloading of the kernel on memory
>  hotplug.
> 
> 
>  I think it would be irresponsible to deprecate kexec_load on any
>  platform.
> 
>  I also suspect that kexec_file_load could be taught to copy the dtb
>  on arm32 if someone wants to deal with signatures.
> 
>  We definitely can not even think of deprecating kexec_load until
>  architecture that supports it also supports kexec_file_load and everyone
>  is happy with that interface.  That is Linus's no regression rule.
> >>>
> >>> I should pick a milder word to express our tendency and tell our plan
> >>> then 'obsolete'. Even though I added 'gradually', seems it doesn't help
> >>> much. I didn't mean to say 'deprecate' at all when replied.
> >>>
> >>> The situation and trend I understand about kexec_load and kexec_file_load
> >>> are:
> >>>
> >>> 1) Supporting kexec_file_load is suggested to add in ARCHes which don't
> >>> have yet, just as x86_64, arm64 and s390 have done;
> >>>  
> >>> 2) kexec_file_load is suggested to use, and take precedence over
> >>> kexec_load in the future, if both are supported in one ARCH.
> >>
> >> The deep problem is that kexec_file_load is distinctly less expressive
> >> than kexec_load.
> >>
> >>> 3) Kexec_load is kept being used by ARCHes w/o kexc_file_load support,
> >>> and by ARCHes for back compatibility w/ kexec_file_load support.
> >>>
> >>> For 1) and 2), I think the reason is obvious as Eric said,
> >>> kexec_file_load is simple enough. And currently, whenever we got a bug
> >>> report, we may need fix them twice, for kexec_load and kexec_file_load.
> >>> If kexec_file_load is made by default, e.g on x86_64, we will change it
> >>> in kernel space only, for kexec_file_load. This is what I meant about
> >>> 'obsolete gradually'. I think for arm64, s390, they will do these too.
> >>> Unless there's some critical/blocker bug in kexec_load, to corrupt the
> >>> old kexec_load interface in old product.
> >>
> >> Maybe.  The code that kexec_file_load sucked into the kernel is quite
> >> stable and rarely needs changes except during a port of kexec to
> >> another architecture.
> >>
> >> Last I looked the real maintenance effor of kexec and kexec on panic was
> >> in the drivers.  So I don't think we can use maintenance to do anything.
> > 
> > Not sure if I got it. But if check Lianbo's patches, a lot of effort has
> > been taken to make SEV work well on kexec_file_load. And we have
> > switched to use kexec_file_load in the newly published  Fedora release
> > on x86_64 by default. Before this, Lianbo has investigated and done many
> > experiments to make sure the switching is safe. We finally made this
> > decision. Next we will do the switch in Enterprise distros. Once these
> > are proved safe, we will suggest customers to use kexec_file_load for
> > kexec