Re: [PATCH 0/4] kdump: crashkernel reservation from CMA

2023-12-06 Thread David Hildenbrand

On 06.12.23 12:08, Philipp Rudo wrote:

On Fri, 1 Dec 2023 17:59:02 +0100
Michal Hocko  wrote:


On Fri 01-12-23 16:51:13, Philipp Rudo wrote:

On Fri, 1 Dec 2023 12:55:52 +0100
Michal Hocko  wrote:
   

On Fri 01-12-23 12:33:53, Philipp Rudo wrote:
[...]

And yes, those are all what-if concerns but unfortunately that is all
we have right now.


Should theoretical concerns without an actual evidence (e.g. multiple
drivers known to be broken) become a roadblock for this otherwise useful
feature?


Those concerns aren't just theoretical. They are experiences we have
from a related feature that suffers exactly the same problem regularly
which wouldn't exist if everybody would simply work "properly".


What is the related feature?


kexec


And yes, even purely theoretical concerns can become a roadblock for a
feature when the cost of those theoretical concerns exceed the benefit
of the feature. The thing is that bugs will be reported against kexec.
So _we_ need to figure out which of the shitty drivers caused the
problem. That puts additional burden on _us_. What we are trying to
evaluate at the moment is if the benefit outweighs the extra burden
with the information we have at the moment.


I do understand your concerns! But I am pretty sure you do realize that
it is really hard to argue theoreticals.  Let me restate what I consider
facts. Hopefully we can agree on these points
- the CMA region can be used by user space memory which is a
  great advantage because the memory is not wasted and our
  experience has shown that users do care about this a lot. We
  _know_ that pressure on making those reservations smaller
  results in a less reliable crashdump and more resources spent
  on tuning and testing (especially after major upgrades).  A
  larger reservation which is not completely wasted for the
  normal runtime is addressing that concern.
- There is no other known mechanism to achieve the reusability
  of the crash kernel memory to stop the wastage without much
  more intrusive code/api impact (e.g. a separate zone or
  dedicated interface to prevent any hazardous usage like RDMA).
- implementation wise the patch has a very small footprint. It
  is using an existing infrastructure (CMA) and it adds a
  minimal hooking into crashkernel configuration.
- The only identified risk so far is RDMA acting on this memory
  without using proper pinning interface. If it helps to have a
  statement from RDMA maintainers/developers then we can pull
  them in for a further discussion of course.
- The feature requires an explicit opt-in so this doesn't bring
  any new risk to existing crash kernel users until they decide
  to use it. AFAIU there is no way to tell that the crash kernel
  memory used to be CMA based in the primary kernel. If you
  believe that having that information available for
  debugability would help then I believe this shouldn't be hard
  to add.  I think it would even make sense to mark this feature
  experimental to make it clear to users that this needs some
  time before it can be marked production ready.

I hope I haven't really missed anything important. The final


If I understand Documentation/core-api/pin_user_pages.rst correctly you
missed case 1 Direct IO. In that case "short term" DMA is allowed for
pages without FOLL_LONGTERM. Meaning that there is a way you can
corrupt the CMA and with that the crash kernel after the production
kernel has panicked.

With that I don't see a chance this series can be included unless
someone can explain me that that the documentation is wrong or I
understood it wrong.


I think you are right. We'd have to disallow any FOLL_PIN on these CMA 
pages, or find other ways of handling that (detect that there are no 
short-term pins any).


But, I'm also wondering how MMU-notifier-based approaches might 
interfere, where CMA pages might be transparently mapped into secondary 
MMUs, possibly having DMA going on.


Are we sure that all these secondary MMUs are inactive as soon as we kexec?

--
Cheers,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC PATCH v2 0/7] Introduce persistent memory pool

2023-09-28 Thread David Hildenbrand

On 28.09.23 15:22, Dave Hansen wrote:

On 9/27/23 09:13, Stanislav Kinsburskii wrote:

Once deposited, these pages can't be accessed by Linux anymore and thus
must be preserved in "used" state across kexec, as hypervisor state is
unware of kexec.


If Linux can't access them, they're not RAM any more.  I'd much rather
remove them from the memory map and move on with life rather than
implement a bunch of new ABI that's got to be handed across kernels.


The motivation of handling kexec (faster?) in a hyper-v domain doesn't 
sound particularly compelling got me for such features. If you inflated 
memory, just don't allow to kexec. It's been broken for years IIUC.


Maybe the other use cases are more "relevant".

--
Cheers,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC PATCH v2 0/7] Introduce persistent memory pool

2023-09-28 Thread David Hildenbrand

On 28.09.23 12:25, Baoquan He wrote:

On 09/27/23 at 09:13am, Stanislav Kinsburskii wrote:

On Wed, Sep 27, 2023 at 01:44:38PM +0800, Baoquan He wrote:

Hi Stanislav,

On 09/25/23 at 02:27pm, Stanislav Kinsburskii wrote:

This patch introduces a memory allocator specifically tailored for
persistent memory within the kernel. The allocator maintains
kernel-specific states like DMA passthrough device states, IOMMU state, and
more across kexec.


Can you give more details about how this persistent memory pool will be
utilized in a actual scenario? I mean, what problem have you met so that
you have to introduce persistent memory pool to solve it?



The major reason we have at the moment, is that Linux root partition
running on top of the Microsoft hypervisor needs to deposit pages to
hypervisor in runtime, when hypervisor runs out of memory.
"Depositing" here means, that Linux passes a set of its PFNs to the
hypervisor via hypercall, and hypervisor then uses these pages for its
own needs.

Once deposited, these pages can't be accessed by Linux anymore and thus
must be preserved in "used" state across kexec, as hypervisor state is
unware of kexec. In the same time, these pages can we withdrawn when
usused. Thus, an allocator persistent across kexec looks reasonable for
this particular matter.


Thanks for these details.
  
The deposit and withdraw remind me the Balloon driver, David's virtio-mem,

DLPAR on ppc which can hot increasing or shrinking phisical memory on guest
OS. Can't microsoft hypervisor do the similar thing to reclaim or give
back the memory from or to the 'Linux root partition' running on top of
the hypervisor?


virtio-mem was designed with kexec support in mind. You only expose the 
initial memory to the second kernel, and that memory can never have such 
holes. That does not apply to memory ballooning implementations, like 
Hyper-V dynamic memory.


In the virtio-mem paper I have the following:

"In our experiments, Hyper-V VMs crashed reliably when
trying to use kexec under Linux for fast OS reboots with
an inflated balloon. Other memory ballooning mechanisms
either have to temporarily deflate the whole balloon or al-
low access to inflated memory, which is undesired in cloud
environments."

I remember XEN does something elaborate, whereby they allow access to 
all inflated memory during reboot, but limit the total number of pages 
they will hand out. IIRC, you then have to work around things like 
"Windows initializes all memory with 0s when booting, and cope with 
that". So there are ways how hypervisors handled that in the past.


--
Cheers,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH V2 1/2] efi/unaccepted: Do not let /proc/vmcore try to access unaccepted memory

2023-09-12 Thread David Hildenbrand

On 12.09.23 09:47, Adrian Hunter wrote:

On 12/09/23 10:19, David Hildenbrand wrote:

On 11.09.23 13:21, Adrian Hunter wrote:

Support for unaccepted memory was added recently, refer commit dcdfdd40fa82
("mm: Add support for unaccepted memory"), whereby a virtual machine may
need to accept memory before it can be used.

Do not let /proc/vmcore try to access unaccepted memory because it can
cause the guest to fail.


Oh, hold on. What are the actual side effects of this?

Once we're in the kdump kernel, any guest is already dead. So failing a guest 
doesn't apply, no?


Unaccepted Memory is used by virtual machines.  In this case the guest
has kexec'ed to a dump-capture kernel, so the virtual machine is still
alive and running the dump-capture kernel.


Ah, I got lost in TDX host semantics. So what you're saying, if we 
(guest) are reading unnaccepted memory we will get zapped. Makes sense.


--
Cheers,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH V2 1/2] efi/unaccepted: Do not let /proc/vmcore try to access unaccepted memory

2023-09-12 Thread David Hildenbrand

On 11.09.23 13:21, Adrian Hunter wrote:

Support for unaccepted memory was added recently, refer commit dcdfdd40fa82
("mm: Add support for unaccepted memory"), whereby a virtual machine may
need to accept memory before it can be used.

Do not let /proc/vmcore try to access unaccepted memory because it can
cause the guest to fail.


Oh, hold on. What are the actual side effects of this?

Once we're in the kdump kernel, any guest is already dead. So failing a 
guest doesn't apply, no?


--
Cheers,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH V2 1/2] efi/unaccepted: Do not let /proc/vmcore try to access unaccepted memory

2023-09-12 Thread David Hildenbrand

On 11.09.23 13:21, Adrian Hunter wrote:

Support for unaccepted memory was added recently, refer commit dcdfdd40fa82
("mm: Add support for unaccepted memory"), whereby a virtual machine may
need to accept memory before it can be used.

Do not let /proc/vmcore try to access unaccepted memory because it can
cause the guest to fail.

For /proc/vmcore, which is read-only, this means a read or mmap of
unaccepted memory will return zeros.

Signed-off-by: Adrian Hunter 
---


[...]


+static inline bool pfn_is_unaccepted_memory(unsigned long pfn)
+{
+   phys_addr_t paddr = pfn << PAGE_SHIFT;
+
+   return range_contains_unaccepted_memory(paddr, paddr + PAGE_SIZE);
+}
+
  #endif /* _LINUX_MM_H */


As stated, if the relevant table is not already properly populated with 
information about unaccepted memory by the first kernel, this probably 
logically belongs into Kirills series.


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH V2 2/2] proc/kcore: Do not try to access unaccepted memory

2023-09-12 Thread David Hildenbrand

On 11.09.23 13:21, Adrian Hunter wrote:

Support for unaccepted memory was added recently, refer commit
dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby a virtual
machine may need to accept memory before it can be used.

Do not try to access unaccepted memory because it can cause the guest to
fail.

For /proc/kcore, which is read-only and does not support mmap, this means a
read of unaccepted memory will return zeros.

Signed-off-by: Adrian Hunter 
---
  fs/proc/kcore.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)


Changes in V2:

   Change patch subject and commit message
   Do not open code pfn_is_unaccepted_memory()


diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 23fc24d16b31..6422e569b080 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -546,7 +546,8 @@ static ssize_t read_kcore_iter(struct kiocb *iocb, struct 
iov_iter *iter)
 * and explicitly excluded physical ranges.
 */
if (!page || PageOffline(page) ||
-   is_page_hwpoison(page) || !pfn_is_ram(pfn)) {
+   is_page_hwpoison(page) || !pfn_is_ram(pfn) ||
+   pfn_is_unaccepted_memory(pfn)) {
if (iov_iter_zero(tsz, iter) != tsz) {
ret = -EFAULT;
goto out;


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/3] proc/vmcore: Do not map unaccepted memory

2023-09-11 Thread David Hildenbrand

On 11.09.23 12:05, Kirill A. Shutemov wrote:

On Mon, Sep 11, 2023 at 11:50:31AM +0200, David Hildenbrand wrote:

On 11.09.23 11:27, Kirill A. Shutemov wrote:

On Mon, Sep 11, 2023 at 10:42:51AM +0200, David Hildenbrand wrote:

On 11.09.23 10:41, Kirill A. Shutemov wrote:

On Mon, Sep 11, 2023 at 10:03:36AM +0200, David Hildenbrand wrote:

On 06.09.23 09:39, Adrian Hunter wrote:

Support for unaccepted memory was added recently, refer commit
dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby
a virtual machine may need to accept memory before it can be used.

Do not map unaccepted memory because it can cause the guest to fail.

For /proc/vmcore, which is read-only, this means a read or mmap of
unaccepted memory will return zeros.


Does a second (kdump) kernel that exposes /proc/vmcore reliably get access
to the information whether memory of the first kernel is unaccepted (IOW,
not its memory, but the memory of the first kernel it is supposed to expose
via /proc/vmcore)?


There are few patches in my queue to few related issue, but generally,
yes, the information is available to the target kernel via EFI
configuration table.


I assume that table provided by the first kernel, and not read directly from
HW, correct?


The table is constructed by the EFI stub in the first kernel based on EFI
memory map.



Okay, should work then once that's done by the first kernel.

Maybe include this patch in your series?


Can do. But the other two patches are not related to kexec. Hm.


Yes, the others can go in separately. But this here really needs other 
kexec/kdump changes.


--
Cheers,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/3] proc/vmcore: Do not map unaccepted memory

2023-09-11 Thread David Hildenbrand

On 11.09.23 11:27, Kirill A. Shutemov wrote:

On Mon, Sep 11, 2023 at 10:42:51AM +0200, David Hildenbrand wrote:

On 11.09.23 10:41, Kirill A. Shutemov wrote:

On Mon, Sep 11, 2023 at 10:03:36AM +0200, David Hildenbrand wrote:

On 06.09.23 09:39, Adrian Hunter wrote:

Support for unaccepted memory was added recently, refer commit
dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby
a virtual machine may need to accept memory before it can be used.

Do not map unaccepted memory because it can cause the guest to fail.

For /proc/vmcore, which is read-only, this means a read or mmap of
unaccepted memory will return zeros.


Does a second (kdump) kernel that exposes /proc/vmcore reliably get access
to the information whether memory of the first kernel is unaccepted (IOW,
not its memory, but the memory of the first kernel it is supposed to expose
via /proc/vmcore)?


There are few patches in my queue to few related issue, but generally,
yes, the information is available to the target kernel via EFI
configuration table.


I assume that table provided by the first kernel, and not read directly from
HW, correct?


The table is constructed by the EFI stub in the first kernel based on EFI
memory map.



Okay, should work then once that's done by the first kernel.

Maybe include this patch in your series?

--
Cheers,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/3] proc/vmcore: Do not map unaccepted memory

2023-09-11 Thread David Hildenbrand

On 11.09.23 10:41, Kirill A. Shutemov wrote:

On Mon, Sep 11, 2023 at 10:03:36AM +0200, David Hildenbrand wrote:

On 06.09.23 09:39, Adrian Hunter wrote:

Support for unaccepted memory was added recently, refer commit
dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby
a virtual machine may need to accept memory before it can be used.

Do not map unaccepted memory because it can cause the guest to fail.

For /proc/vmcore, which is read-only, this means a read or mmap of
unaccepted memory will return zeros.


Does a second (kdump) kernel that exposes /proc/vmcore reliably get access
to the information whether memory of the first kernel is unaccepted (IOW,
not its memory, but the memory of the first kernel it is supposed to expose
via /proc/vmcore)?


There are few patches in my queue to few related issue, but generally,
yes, the information is available to the target kernel via EFI
configuration table.


I assume that table provided by the first kernel, and not read directly 
from HW, correct?


--
Cheers,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 3/3] /dev/mem: Do not map unaccepted memory

2023-09-11 Thread David Hildenbrand

On 07.09.23 16:46, Dave Hansen wrote:

On 9/7/23 07:25, Kirill A. Shutemov wrote:

On Thu, Sep 07, 2023 at 07:15:21AM -0700, Dave Hansen wrote:

On 9/6/23 00:39, Adrian Hunter wrote:

Support for unaccepted memory was added recently, refer commit
dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby
a virtual machine may need to accept memory before it can be used.

Do not map unaccepted memory because it can cause the guest to fail.

Doesn't /dev/mem already provide a billion ways for someone to shoot
themselves in the foot?  TDX seems to have added the 1,000,000,001st.
Is this really worth patching?

Is it better to let TD die silently? I don't think so.


First, let's take a look at all of the distro kernels that folks will
run under TDX.  Do they have STRICT_DEVMEM set?


For virtio-mem, we do

config VIRTIO_MEM
...
depends on EXCLUSIVE_SYSTEM_RAM

Which in turn:

config EXCLUSIVE_SYSTEM_RAM
...
depends on !DEVMEM || STRICT_DEVMEM


Not supported on all archs, but at least on RHEL9 on x86_64 and aarch64.

So, making unaccepted memory similarly depend on "!DEVMEM || 
STRICT_DEVMEM" does not sound too far off ...



--
Cheers,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/3] proc/vmcore: Do not map unaccepted memory

2023-09-11 Thread David Hildenbrand

On 06.09.23 09:39, Adrian Hunter wrote:

Support for unaccepted memory was added recently, refer commit
dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby
a virtual machine may need to accept memory before it can be used.

Do not map unaccepted memory because it can cause the guest to fail.

For /proc/vmcore, which is read-only, this means a read or mmap of
unaccepted memory will return zeros.


Does a second (kdump) kernel that exposes /proc/vmcore reliably get 
access to the information whether memory of the first kernel is 
unaccepted (IOW, not its memory, but the memory of the first kernel it 
is supposed to expose via /proc/vmcore)?


I recall there might be other kdump-related issues for TDX and friends 
to solve. Especially, which information the second kernel gets provided 
by the first kernel.


So can this patch even be tested reasonably (IOW, get into a kdump 
kernel in an environment where the first kernel has unaccepted memory, 
and verify that unaccepted memory is handled accordingly? ... while 
kdump doing anything reasonable in such an environment at all?)


--
Cheers,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v12 7/7] x86/crash: Add x86 crash hotplug support

2022-10-26 Thread David Hildenbrand

On 26.10.22 16:48, Baoquan He wrote:

On 10/25/22 at 12:31pm, Borislav Petkov wrote:

On Thu, Oct 13, 2022 at 10:57:28AM +0800, Baoquan He wrote:

The concern to range number mainly is on Virt guest systems.


And why would virt emulate 1K hotpluggable DIMM slots and not emulate a
real machine?


IIRC, ACPI only allows for 256 slots. PPC dlpar might provide more.



Well, currently, mem hotpug is an important feature on virt system to
dynamically increase/shrink memory on the system. If only emulating real
machine, it won't be different than bare metal system.

IIRC, the ballon driver or virtio-mem feature can add memory board, e.g
1G, block size is 128M, 8 blocks added. When shrinking this 1G memory
later, it will take best effort way to hot remove memory. Means if any
memory block is occupied, it will be kept there. Finally we could only
remove every second blocks, 4 blocks altogether. Then the left
un-removed blocks will produce 4 separate memory regions. Like this, a
virt guest could have many memory regions in kernel after memory
being added/removed.

If I am wrong, Please correct me, David.


Yes, virtio-mem (but also PPC dlpar) can result in many individual 
memory blocks with holes in between after hotunplug. Hotplug OTOH, 
usually tries to "plug" these holes and reduce the total number of 
memory blocks. It might be rare that our range will be heavily 
fragmented after unplug, but it's certainly possible.


[...]



Yes, now assume we have a HPE SGI system and it has memory hotplug
capacity. The system itself has already got memory regions more than
1024. Then when we hot add extra memory board, we want to include the
newly added memory regions into elfcorehdr so that it will be dumped out
in kdump kernel.

That's why I earlier suggested 2048 for number of memory regions.


The more the better, unless "it hurts". Assuming a single memory block 
is 128 MiB, that would be 256 GiB.


Usually, on big systems, the memory block size is 2 GiB. So 4 TiB.

--
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH RFC 1/2] coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")

2022-08-29 Thread David Hildenbrand
On 29.08.22 05:07, Linus Torvalds wrote:
> On Sun, Aug 28, 2022 at 6:56 PM Dave Young  wrote:
>>
>>> John mentioned PANIC_ON().
>>
>> I would vote for PANIC_ON(), it sounds like a good idea, because
>> BUG_ON() is not obvious and, PANIC_ON() can alert the code author that
>> this will cause a kernel panic and one will be more careful before
>> using it.
> 
> People, NO.
> 
> We're trying to get rid of BUG_ON() because it kills the machine.
> 
> Not replace it with another bogus thing that kills a machine.
> 
> So no PANIC_ON(). We used to have "panic()" many many years ago, we
> got rid of it. We're not re-introducing it.
> 
> People who want to panic on warnings can do so. WARN_ON() _becomes_
> PANIC for those people. But those people are the "we have a million
> machines, we want to just fail things on any sign of trouble, and we
> have MIS people who can look at the logs".
> 
> And it's not like we need to get rid of _all_ BUG_ON() cases. If you
> have a "this is major internal corruption, there's no way we can
> continue", then BUG_ON() is appropriate. It will try to kill that
> process and try to keep the machine running, and again, the kind of
> people who don't care about one machine (because - again - they have
> millions of them) can just turn that into a panic-and-reboot
> situation.
> 
> But the kind of people for whom the machine they are on IS THEIR ONLY
> MACHINE - whether it be a workstation, a laptop, or a cellphone -
> there is absolutely zero situation where "let's just kill the machine"
> is *EVER* approproate. Even a BUG_ON() will try to continue as well as
> it can after killing the current thread, but it's going to be iffy,
> because locking etc.
> 
> So WARN_ON_ONCE() is the thing to aim for. BUG_ON() is the thing for
> "oops, I really don't know what to do, and I physically *cannot*
> continue" (and that is *not* "I'm too lazy to do error handling").
> 
> There is no room for PANIC. None. Ever.

Let me clearer what I had in mind, avoiding the PANIC_ON terminology
John raised. I was wondering if it would make sense to

1) Be able to specify a severity for WARN (developer decision)

2) Be able to specify a severity for panic_on_warn (admin decision)

Distributions with kdump could keep a mode whereby severe warnings
(e.g., former BUG_ON) would properly kdump+reboot and harmless warnings
(e.g., clean recovery path) would WARN once + continue.

I agree that whether to panic should in most cases be a decision of the
admin, not the developer.


Now, that's a different discussion then the documentation update at
hand, and I primary wanted to raise awareness for the kdump people, and
ask them if a stronger move towards WARN_ON_ONCE will affect
them/customer expectations.

I'll work with John to document the current rules to reflect everything
you said here.

-- 
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH RFC 1/2] coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")

2022-08-26 Thread David Hildenbrand
On 26.08.22 03:43, Dave Young wrote:
> Hi David,
> 
> [Added more people in cc]
> 

Hi Dave,

thanks for your input!

[...]

>> Side note: especially with kdump() I feel like we might see much more
>> widespread use of panic_on_warn to be able to actually extract debug
>> information in a controlled manner -- for example on enterprise distros.
>> ... which would then make these systems more likely to crash, because
>> there is no way to distinguish a rather harmless warning from a severe
>> warning :/ . But let's see if some kdump() folks will share their
>> opinion as reply to the cover letter.
> 
> I can understand the intention of this patch, and I totally agree that
> BUG() should be used carefully, this is a good proposal if we can
> clearly define the standard about when to use BUG().  But I do have

Essentially, the general rule from Linus is "absolutely no new BUG_ON()
calls ever" -- but I think the consensus in that thread was that there
are corner cases when it comes to unavoidable data corruption/security
issues. And these are rare cases, not the usual case where we'd have
used BUG_ON()/VM_BUG_ON().

> some worries,  I think this standard is different for different sub
> components, it is not clear to me at least,  so this may introduce an
> unstable running kernel and cause troubles (eg. data corruption) with
> a WARN instead of a BUG. Probably it would be better to say "Do not
> WARN lightly, and do not hesitate to use BUG if it is really needed"?


Well, I don't make the rules, I document them and share them for general
awareness/comments :) Documenting this is valuable, because there seem
to be quite some different opinions floating around in the community --
and I've been learning different rules from different people over the years.

> 
> About "patch_on_warn", it will depend on the admin/end user to set it,
> it is not a good idea for distribution to set it. It seems we are
> leaving it to end users to take the risk of a kernel panic even with
> all kernel WARN even if it is sometimes not necessary.

My question would be what we could add/improve to keep systems with
kdump armed running as expected for end users, that is most probably:

1) don't crash on harmless WARN() that can just be reported and the
   machine will continue running mostly fine without real issues.
2) crash on severe issues (previously BUG) such that we can properly
   capture a system dump via kdump. The restart the machine.

Of course, once one would run into 2), one could try reproducing with
"panic_on_warn" to get a reasonable system dump. But I guess that's not
what enterprise customers expect.


One wild idea (in the cover letter) was to add something new that can be
configured by user space and that expresses that something is more
severe than just some warning that can be recovered easily. But it can
eventually be recovered to keep the system running to some degree. But
still, it's configurable if we want to trigger a panic or let the system
run.

John mentioned PANIC_ON().


What would be your expectation for kdump users under which conditions we
want to trigger kdump and when not?

Regarding panic_on_warn, how often do e.g., RHEL users observe warnings
that we're not able to catch during testing, such that "panic_on_warn"
would be a real no-go?

-- 
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH RFC 1/2] coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")

2022-08-25 Thread David Hildenbrand
On 24.08.22 23:59, John Hubbard wrote:
> On 8/24/22 09:30, David Hildenbrand wrote:
>> diff --git a/Documentation/process/coding-style.rst 
>> b/Documentation/process/coding-style.rst
>> index 03eb53fd029a..a6d81ff578fe 100644
>> --- a/Documentation/process/coding-style.rst
>> +++ b/Documentation/process/coding-style.rst
>> @@ -1186,6 +1186,33 @@ expression used.  For instance:
>>  #endif /* CONFIG_SOMETHING */
>>  
> 
> I like the idea of adding this documentation, and this is the right
> place. Naturally, if one likes something, one must immediately change
> it. :) Therefore, here is an alternative writeup that I think captures
> what you and the email threads were saying.
> 
> How's this sound?

Much better, thanks! :)

> 
> diff --git a/Documentation/process/coding-style.rst 
> b/Documentation/process/coding-style.rst
> index 03eb53fd029a..32df0d503388 100644
> --- a/Documentation/process/coding-style.rst
> +++ b/Documentation/process/coding-style.rst
> @@ -1185,6 +1185,53 @@ expression used.  For instance:
> ...
> #endif /* CONFIG_SOMETHING */
>  
> +22) Do not crash the kernel
> +---
> +
> +Use WARN() rather than BUG()
> +
> +
> +Do not add new code that uses any of the BUG() variants, such as BUG(),
> +BUG_ON(), or VM_BUG_ON(). Instead, use a WARN*() variant, preferably
> +WARN_ON_ONCE(), and possibly with recovery code. Recovery code is not 
> required
> +if there is no reasonable way to at least partially recover.

I'll tend to keep in this section:

"Unavoidable data corruption / security issues might be a very rare
exception to this rule and need good justification."

Because there are rare exceptions, and I'd much rather document the
clear exception to this rule.

> +
> +Use WARN_ON_ONCE() rather than WARN() or WARN_ON()
> +**
> +
> +WARN_ON_ONCE() is generally preferred over WARN() or WARN_ON(), because it is
> +common for a given warning condition, if it occurs at all, to occur multiple
> +times. (For example, once per file, or once per struct page.) This can fill 
> up

I'll drop the "For example" part. I feel like this doesn't really need
an example -- most probably we've all been there already when the kernel
log was flooded :)

> +and wrap the kernel log, and can even slow the system enough that the 
> excessive
> +logging turns into its own, additional problem.
> +
> +Do not WARN lightly
> +***
> +
> +WARN*() is intended for unexpected, this-should-never-happen situations. 
> WARN*()
> +macros are not to be used for anything that is expected to happen during 
> normal
> +operation. These are not pre- or post-condition asserts, for example. Again:
> +WARN*() must not be used for a condition that is expected to trigger easily, 
> for
> +example, by user space actions. pr_warn_once() is a possible alternative, if 
> you
> +need to notify the user of a problem.
> +
> +Do not worry about panic_on_warn users
> +**
> +
> +A few more words about panic_on_warn: Remember that ``panic_on_warn`` is an
> +available kernel option, and that many users set this option. This is why 
> there
> +is a "Do not WARN lightly" writeup, above. However, the existence of
> +panic_on_warn users is not a valid reason to avoid the judicious use WARN*().
> +That is because, whoever enables panic_on_warn has explicitly asked the 
> kernel
> +to crash if a WARN*() fires, and such users must be prepared to deal with the
> +consequences of a system that is somewhat more likely to crash.

Side note: especially with kdump() I feel like we might see much more
widespread use of panic_on_warn to be able to actually extract debug
information in a controlled manner -- for example on enterprise distros.
... which would then make these systems more likely to crash, because
there is no way to distinguish a rather harmless warning from a severe
warning :/ . But let's see if some kdump() folks will share their
opinion as reply to the cover letter.

-- 
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH RFC 2/2] checkpatch: warn on usage of VM_BUG_ON() and friends

2022-08-25 Thread David Hildenbrand
On 25.08.22 13:43, Jani Nikula wrote:
> On Thu, 25 Aug 2022, David Hildenbrand  wrote:
>> On 24.08.22 18:52, Joe Perches wrote:
>>> On Wed, 2022-08-24 at 18:31 +0200, David Hildenbrand wrote:
>>>> checkpatch does not point out that VM_BUG_ON() and friends should be
>>>> avoided, however, Linus notes:
>>>>
>>>> VM_BUG_ON() has the exact same semantics as BUG_ON. It is literally
>>>> no different, the only difference is "we can make the code smaller
>>>> because these are less important". [1]
>>>>
>>>> So let's warn on VM_BUG_ON() and friends as well. While at it, make it
>>>> clearer that the kernel really shouldn't be crashed.
>>>>
>>>> Note that there are some other *_BUG_ON flavors, but they are not all
>>>> bad: for example, KVM_BUG_ON() only triggers a WARN_ON_ONCE and then
>>>> flags KVM as being buggy, so we'll not care about them for now here.
>>> []
>>>> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
>>> []
>>>> @@ -4695,12 +4695,12 @@ sub process {
>>>>}
>>>>}
>>>>  
>>>> -# avoid BUG() or BUG_ON()
>>>> -  if ($line =~ /\b(?:BUG|BUG_ON)\b/) {
>>>> +# do not use BUG(), BUG_ON(), VM_BUG_ON() and friends.
>>>> +  if ($line =~ /\b(?:BUG|BUG_ON|VM_BUG_ON|VM_BUG_ON_[A-Z]+)\b/) {
>>>
>>> Perhaps better as something like the below to pick up more variants
>>>
>>
>> Trying to find more possible variants and exceptions
> 
>> CI_BUG_ON(
>> -> Bad with CONFIG_DRM_I915_DEBUG
>> GEM_BUG_ON(
>> -> Bad with CONFIG_DRM_I915_DEBUG_GEM_ONCE
> 
> These are hidden behind debug knobs that we use in our CI to
> specifically catch "should not happen" cases fast and loud. Should not
> be a problem for regular users.
> 

I tend to agree but I don't think this is worth an exception.
VM_BUG_ON also requires CONFIG_DEBUG_VM and absolutely shouldn't
be used as I learned.

Quoting Linus:

   Really. BUG_ON() IS NOT FOR DEBUGGING. [1]

   This kind of "I don't think this can happen" is _never_ an excuse for it. [2]


For CI work, it might be sufficient to use WARN_ON_ONCE() combined with 
panic_on_warn.

[1] 
https://lore.kernel.org/all/CAHk-=wiEAH+ojSpAgx_Ep=nkpwhu8ado3v56bxccsu97oyj...@mail.gmail.com/
[2] 
https://lore.kernel.org/all/CAHk-=wg40eazofo16eviaj7mfqdhz2gvebvfsmf6gyzsprj...@mail.gmail.com/
 

-- 
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH RFC 2/2] checkpatch: warn on usage of VM_BUG_ON() and friends

2022-08-25 Thread David Hildenbrand
On 24.08.22 18:52, Joe Perches wrote:
> On Wed, 2022-08-24 at 18:31 +0200, David Hildenbrand wrote:
>> checkpatch does not point out that VM_BUG_ON() and friends should be
>> avoided, however, Linus notes:
>>
>> VM_BUG_ON() has the exact same semantics as BUG_ON. It is literally
>> no different, the only difference is "we can make the code smaller
>> because these are less important". [1]
>>
>> So let's warn on VM_BUG_ON() and friends as well. While at it, make it
>> clearer that the kernel really shouldn't be crashed.
>>
>> Note that there are some other *_BUG_ON flavors, but they are not all
>> bad: for example, KVM_BUG_ON() only triggers a WARN_ON_ONCE and then
>> flags KVM as being buggy, so we'll not care about them for now here.
> []
>> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> []
>> @@ -4695,12 +4695,12 @@ sub process {
>>  }
>>  }
>>  
>> -# avoid BUG() or BUG_ON()
>> -if ($line =~ /\b(?:BUG|BUG_ON)\b/) {
>> +# do not use BUG(), BUG_ON(), VM_BUG_ON() and friends.
>> +if ($line =~ /\b(?:BUG|BUG_ON|VM_BUG_ON|VM_BUG_ON_[A-Z]+)\b/) {
> 
> Perhaps better as something like the below to pick up more variants
> 

Trying to find more possible variants and exceptions

$ git grep -h -o -E "\b[a-zA-Z]+_BUG(_ON(_[a-zA-Z]+)*)?\(" | sort | uniq
AA_BUG(
-> Ok, no BUG()
ASM_BUG(
-> Bad
BUILD_BUG(
BUILD_BUG_ON(
BUILD_BUG_ON_INVALID(
BUILD_BUG_ON_MSG(
BUILD_BUG_ON_ZERO(
-> Ok
CI_BUG_ON(
-> Bad with CONFIG_DRM_I915_DEBUG
DCCP_BUG(
DCCP_BUG_ON(
-> Ok, no BUG()
do_BUG(
-> BUG implementation, ok.
GEM_BUG_ON(
-> Bad with CONFIG_DRM_I915_DEBUG_GEM_ONCE
GLOCK_BUG_ON(
-> Bad
handle_BUG(
-> BUG implementation, ok.
IDA_BUG_ON(
KVM_BUG(
KVM_BUG_ON(
-> Ok, no BUG()
lkdtm_BUG(
paravirt_BUG(
-> bad
PROM_BUG(
-> unused, will remove
RWLOCK_BUG_ON(
-> Ok, no BUG()
snd_BUG(
snd_BUG_ON(
-> Ok, no BUG()
SNIC_BUG_ON(
-> Bad
SPIN_BUG_ON(
-> Ok, no BUG()
UNWINDER_BUG(
UNWINDER_BUG_ON(
VIRTUAL_BUG_ON(
VM_BUG_ON(
VM_BUG_ON_FOLIO(
VM_BUG_ON_MM(
VM_BUG_ON_PAGE(
VM_BUG_ON_PGFLAGS(
VM_BUG_ON_VMA(
XA_BUG_ON(
-> Bad

So an extended versions of your proposal like (ignoring do_BUG and handle_BUG, 
people are smart enough to figure that out)

if ($line =~ 
/\b(?!AA_|BUILD_|DCCP_|IDA_|KVM_|RWLOCK_|snd_|SPIN_)(?:[a-zA-Z_]*_)?BUG(?:_ON)?(?:_[A-Z_]+)?\s*\(/

?

-- 
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH RFC 2/2] checkpatch: warn on usage of VM_BUG_ON() and friends

2022-08-24 Thread David Hildenbrand
On 24.08.22 18:52, Joe Perches wrote:
> On Wed, 2022-08-24 at 18:31 +0200, David Hildenbrand wrote:
>> checkpatch does not point out that VM_BUG_ON() and friends should be
>> avoided, however, Linus notes:
>>
>> VM_BUG_ON() has the exact same semantics as BUG_ON. It is literally
>> no different, the only difference is "we can make the code smaller
>> because these are less important". [1]
>>
>> So let's warn on VM_BUG_ON() and friends as well. While at it, make it
>> clearer that the kernel really shouldn't be crashed.
>>
>> Note that there are some other *_BUG_ON flavors, but they are not all
>> bad: for example, KVM_BUG_ON() only triggers a WARN_ON_ONCE and then
>> flags KVM as being buggy, so we'll not care about them for now here.
> []
>> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> []
>> @@ -4695,12 +4695,12 @@ sub process {
>>  }
>>  }
>>  
>> -# avoid BUG() or BUG_ON()
>> -if ($line =~ /\b(?:BUG|BUG_ON)\b/) {
>> +# do not use BUG(), BUG_ON(), VM_BUG_ON() and friends.
>> +if ($line =~ /\b(?:BUG|BUG_ON|VM_BUG_ON|VM_BUG_ON_[A-Z]+)\b/) {
> 
> Perhaps better as something like the below to pick up more variants
> 
>   if ($line =~ 
> /\b(?!KVM_|BUILD_)(?:[A-Z_]*_)?BUG(?:_ON)?(?:_[A-Z_]+)?\s*\(/

... then I'll have to scan the other cases if they do something similar
as KVM. ... well, okay, I'll do it. :)

> 
>>  my $msg_level = \
>>  $msg_level = \ if ($file);
>>  &{$msg_level}("AVOID_BUG",
>> -  "Avoid crashing the kernel - try using 
>> WARN_ON & recovery code rather than BUG() or BUG_ON()\n" . $herecurr);
> 
> and maybe:
> 
> "Do not crash the kernel unless it is 
> unavoidable - use WARN_ON_ONCE & recovery code (if reasonable) rather than 
> BUG() or variants\n" . $herecurr);
> 
> 

Yes, thanks!

-- 
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH RFC 0/2] coding-style.rst: document BUG() and WARN() rules

2022-08-24 Thread David Hildenbrand
As it seems to be rather unclear if/when to use BUG(), BUG_ON(),
VM_BUG_ON(), WARN_ON_ONCE(), ... let's try to document the result of a
recent discussion.

Details can be found in patch #1.

--

Here is some braindump after thinking about BUG_ON(), WARN_ON(), ... and
how it interacts with kdump.

I was wondering what the expectation on a system with armed kdump are,
for example, after we removed most BUG_ON() instances and replaced them
by WARN_ON_ONCE(). I would assume that we actually want to panic in some
cases to capture a proper system dump instead of continuing and eventually
ending up with a completely broken system where it's hard to extract any
useful debug information. We'd have to enable panic_on_warn. But we'd only
want to do that in case kdump is actually armed after boot.

So one idea would be to have some kind of "panic_on_warn_with_kdump" mode.
But then, we'd actually crash+kdump even on the most harmless WARN_ON()
conditions, because they all look alike. To compensate, we would need
some kind of "severity" levels of a warning -- at least some kind of
"this is harmless and we can easily recover, but please tell the
developers" vs. "this is real bad and unexpected, capture a dump
immediately instead of trying to recover and eventually failing miserably".

But then, maybe we really want something like BUG_ON() -- let's call it
CBUG_ON() for simplicity -- but be able to make it be usable in
conditionals (to implement recovery code if easily possible) and make the
runtime behavior configurable.

if (CBUG_ON(whatever))
try_to_recover()

Whereby, for example, "panic_on_cbug" and "panic_on_cbug_with_kdump"
could control the runtime behavior.

But this is just a braindump and I assume people reading along have other,
better ideas. Especially, a better name for CBUG.


Cc: Linus Torvalds 
Cc: Andrew Morton 
Cc: Ingo Molnar 
Cc: David Laight 
Cc: Jonathan Corbet 
Cc: Andy Whitcroft 
Cc: Joe Perches 
Cc: Dwaipayan Ray 
Cc: Lukas Bulwahn 
Cc: Baoquan He 
Cc: Vivek Goyal 
Cc: Dave Young 

David Hildenbrand (2):
  coding-style.rst: document BUG() and WARN() rules ("do not crash the
kernel")
  checkpatch: warn on usage of VM_BUG_ON() and friends

 Documentation/process/coding-style.rst | 27 ++
 scripts/checkpatch.pl  |  6 +++---
 2 files changed, 30 insertions(+), 3 deletions(-)

-- 
2.37.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH RFC 1/2] coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")

2022-08-24 Thread David Hildenbrand
Linus notes [1] that the introduction of new code that uses VM_BUG_ON()
is just as bad as BUG_ON(), because it will crash the kernel on
distributions that enable CONFIG_DEBUG_VM (like Fedora):

VM_BUG_ON() has the exact same semantics as BUG_ON. It is literally
no different, the only difference is "we can make the code smaller
because these are less important". [2]

This resulted in a more generic discussion about usage of BUG() and
friends. While there might be corner cases that still deserve a BUG_ON(),
most BUG_ON() cases should simply use WARN_ON_ONCE() and implement a
recovery path if reasonable:

The only possible case where BUG_ON can validly be used is "I have
some fundamental data corruption and cannot possibly return an
error". [2]

As a very good approximation is the general rule:

"absolutely no new BUG_ON() calls _ever_" [2]

... not even if something really shouldn't ever happen and is merely for
documenting that an invariant always has to hold.

There is only one good BUG_ON():

Now, that said, there is one very valid sub-form of BUG_ON():
BUILD_BUG_ON() is absolutely 100% fine. [2]

While WARN will also crash the machine with panic_on_warn set, that's
exactly to be expected:

So we have two very different cases: the "virtual machine with good
logging where a dead machine is fine" - use 'panic_on_warn'. And
the actual real hardware with real drivers, running real loads by
users. [3]

The basic idea is that warnings will similarly get reported by users
and be found during testing. However, in contrast to a BUG(), there is a
way to actually influence the expected behavior (e.g., panic_on_warn)
and to eventually keep the machine alive to extract some debug info.

Ingo notes that not all WARN_ON_ONCE cases need recovery. If we don't ever
expect this code to trigger in any case, recovery code is not really
helpful.

I'd prefer to keep all these warnings 'simple' - i.e. no attempted
recovery & control flow, unless we ever expect these to trigger.
[4]

There have been different rules floating around that were never properly
documented. Let's try to clarify.

[1] 
https://lkml.kernel.org/r/CAHk-=wiEAH+ojSpAgx_Ep=nkpwhu8ado3v56bxccsu97oyj...@mail.gmail.com
[2] 
https://lore.kernel.org/r/CAHk-=wg40eazofo16eviaj7mfqdhz2gvebvfsmf6gyzsprj...@mail.gmail.com
[3] 
https://lore.kernel.org/r/CAHk-=wgF7K2gSSpy=m_=k3nov4zaceux9puqf1tjktjla2x...@mail.gmail.com

Signed-off-by: David Hildenbrand 
---
 Documentation/process/coding-style.rst | 27 ++
 1 file changed, 27 insertions(+)

diff --git a/Documentation/process/coding-style.rst 
b/Documentation/process/coding-style.rst
index 03eb53fd029a..a6d81ff578fe 100644
--- a/Documentation/process/coding-style.rst
+++ b/Documentation/process/coding-style.rst
@@ -1186,6 +1186,33 @@ expression used.  For instance:
#endif /* CONFIG_SOMETHING */
 
 
+22) Do not crash the kernel
+---
+
+Do not add new code that uses BUG(), BUG_ON(), VM_BUG_ON(), ... to crash
+the kernel if certain conditions are not met. Instead, use WARN_ON_ONCE()
+with recovery code (if reasonable) instead. Unavoidable data corruption /
+security issues might be a very rare exception to this rule and need good
+justification.
+
+There is no need for WARN_ON_ONCE() recovery code if we never expect it to
+possibly trigger unless something goes seriously wrong or some other code
+is changed to break invariants. Note that VM_WARN_ON_ONCE() cannot be used
+in conditionals.
+
+Usage of WARN() and friends is fine for something that is not expected to
+be triggered easily. ``panic_on_warn`` users are not an excuse to not use
+WARN(): whoever enables ``panic_on_warn`` explicitly asked the kernel to
+crash in this case.
+
+However, WARN() and friends should not be used for something that is
+expected to trigger easily, for example, by user space. pr_warn_once()
+might be a reasonable replacement to notify the user.
+
+Note that BUILD_BUG_ON() is perfectly fine because it will make compilation
+fail instead of crashing the kernel.
+
+
 Appendix I) References
 --
 
-- 
2.37.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH RFC 2/2] checkpatch: warn on usage of VM_BUG_ON() and friends

2022-08-24 Thread David Hildenbrand
checkpatch does not point out that VM_BUG_ON() and friends should be
avoided, however, Linus notes:

VM_BUG_ON() has the exact same semantics as BUG_ON. It is literally
no different, the only difference is "we can make the code smaller
because these are less important". [1]

So let's warn on VM_BUG_ON() and friends as well. While at it, make it
clearer that the kernel really shouldn't be crashed.

Note that there are some other *_BUG_ON flavors, but they are not all
bad: for example, KVM_BUG_ON() only triggers a WARN_ON_ONCE and then
flags KVM as being buggy, so we'll not care about them for now here.

[1] 
https://lore.kernel.org/r/CAHk-=wg40eazofo16eviaj7mfqdhz2gvebvfsmf6gyzsprj...@mail.gmail.com

Signed-off-by: David Hildenbrand 
---
 scripts/checkpatch.pl | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 79e759aac543..4c18acf17032 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -4695,12 +4695,12 @@ sub process {
}
}
 
-# avoid BUG() or BUG_ON()
-   if ($line =~ /\b(?:BUG|BUG_ON)\b/) {
+# do not use BUG(), BUG_ON(), VM_BUG_ON() and friends.
+   if ($line =~ /\b(?:BUG|BUG_ON|VM_BUG_ON|VM_BUG_ON_[A-Z]+)\b/) {
my $msg_level = \
$msg_level = \ if ($file);
&{$msg_level}("AVOID_BUG",
- "Avoid crashing the kernel - try using 
WARN_ON & recovery code rather than BUG() or BUG_ON()\n" . $herecurr);
+ "Do not crash the kernel unless it is 
unavoidable - use WARN_ON_ONCE & recovery code (if reasonable) rather than 
BUG(), BUG_ON(), VM_BUG_ON(), ...\n" . $herecurr);
}
 
 # avoid LINUX_VERSION_CODE
-- 
2.37.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v8 3/7] crash: add generic infrastructure for crash hotplug support

2022-06-15 Thread David Hildenbrand
On 01.06.22 00:25, Eric DeVolder wrote:
> 
> 
> On 5/31/22 08:15, David Hildenbrand wrote:
>> On 12.05.22 18:10, Eric DeVolder wrote:
>>> David,
>>> Great questions! See inline responses below.
>>> eric
>>
>> Sorry for the late reply, travel and vacation ...
> No problem, greatly appreciate the feedback!
> eric
> 
>>
>>>>
>>>>> +
>>>>> +#if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_MEMORY_HOTPLUG)
>>>>> +void __weak arch_crash_handle_hotplug_event(struct kimage *image,
>>>>> + unsigned int hp_action, 
>>>>> unsigned int cpu)
>>>>> +{
>>>>> + WARN(1, "crash hotplug handler not implemented");
>>>>
>>>>
>>>> Won't that trigger on any arch that has CONFIG_HOTPLUG_CPU and 
>>>> CONFIG_MEMORY_HOTPLUG?
>>>> I mean, you only implement it for x86 later in this series. Or what else 
>>>> stops this WARN from
>>>> triggering?
>>>>
>>> You're correct. What about: printk_once(KERN_DEBUG "...") ?
>>
>> Why even bother about printing anything? If the feature is not
>> supported, there should be some way for user space to figure out that it
>> sill has to reload on hot(un)plug manually, no?
> 
> I've changed this to WARN_ONCE(). If that isn't agreeable, I'll remove it.

Please don't use WARN* on expected error paths.

-- 
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v8 0/7] crash: Kernel handling of CPU and memory hot un/plug

2022-05-31 Thread David Hildenbrand
On 26.05.22 15:39, Sourabh Jain wrote:
> Hello Eric,
> 
> On 26/05/22 18:46, Eric DeVolder wrote:
>>
>>
>> On 5/25/22 10:13, Sourabh Jain wrote:
>>> Hello Eric,
>>>
>>> On 06/05/22 00:15, Eric DeVolder wrote:
 When the kdump service is loaded, if a CPU or memory is hot
 un/plugged, the crash elfcorehdr (for x86), which describes the CPUs
 and memory in the system, must also be updated, else the resulting
 vmcore is inaccurate (eg. missing either CPU context or memory
 regions).

 The current solution utilizes udev to initiate an unload-then-reload
 of the kdump image (e. kernel, initrd, boot_params, puratory and
 elfcorehdr) by the userspace kexec utility. In previous posts I have
 outlined the significant performance problems related to offloading
 this activity to userspace.

 This patchset introduces a generic crash hot un/plug handler that
 registers with the CPU and memory notifiers. Upon CPU or memory
 changes, this generic handler is invoked and performs important
 housekeeping, for example obtaining the appropriate lock, and then
 invokes an architecture specific handler to do the appropriate
 updates.

 In the case of x86_64, the arch specific handler generates a new
 elfcorehdr, and overwrites the old one in memory. No involvement
 with userspace needed.

 To realize the benefits/test this patchset, one must make a couple
 of minor changes to userspace:

   - Disable the udev rule for updating kdump on hot un/plug changes.
     Add the following as the first two lines to the udev rule file
     /usr/lib/udev/rules.d/98-kexec.rules:
>>>
>>> If we can have a sysfs attribute to advertise this feature then 
>>> userspace
>>> utilities (kexec tool/udev rules) can take action accordingly. In 
>>> short, it will
>>> help us maintain backward compatibility.
>>>
>>> kexec tool can use the new sysfs attribute and allocate additional 
>>> buffer space
>>> for elfcorehdr accordingly. Similarly, the checksum-related changes 
>>> can come
>>> under this check.
>>>
>>> Udev rule can use this sysfs file to decide kdump service reload is 
>>> required or not.
>>
>> Great idea. I've been working on the corresponding udev and 
>> kexec-tools changes and your input/idea here is quite timely.
>>
>> I have boolean "crash_hotplug" as a core_param(), so it will show up as:
>>
>> # cat /sys/module/kernel/parameters/crash_hotplug
>> N
> 
> How about using 0-1 instead Y/N?
> 0 = crash hotplug not supported
> 1 = crash hotplug supported
> 
> Also how about keeping sysfs here instead?
> /sys/kernel/kexec_crash_hotplug

It's not only about hotplug, though. And actually we care about
onlining/offlining. Hmm, I wonder if there is a better name for this
automatic handling of cpu and memory devices.

-- 
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v8 3/7] crash: add generic infrastructure for crash hotplug support

2022-05-31 Thread David Hildenbrand
On 12.05.22 18:10, Eric DeVolder wrote:
> David,
> Great questions! See inline responses below.
> eric

Sorry for the late reply, travel and vacation ...

>>
>>> +
>>> +#if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_MEMORY_HOTPLUG)
>>> +void __weak arch_crash_handle_hotplug_event(struct kimage *image,
>>> +   unsigned int hp_action, 
>>> unsigned int cpu)
>>> +{
>>> +   WARN(1, "crash hotplug handler not implemented");
>>
>>
>> Won't that trigger on any arch that has CONFIG_HOTPLUG_CPU and 
>> CONFIG_MEMORY_HOTPLUG?
>> I mean, you only implement it for x86 later in this series. Or what else 
>> stops this WARN from
>> triggering?
>>
> You're correct. What about: printk_once(KERN_DEBUG "...") ?

Why even bother about printing anything? If the feature is not
supported, there should be some way for user space to figure out that it
sill has to reload on hot(un)plug manually, no?


[...]

> 
>>
>> 2. Why can't the unprotect+reprotect not be done inside
>> arch_crash_handle_hotplug_event() ? It's all arch specific either way.
>>
>> IMHO, this code here should be as simple as
>>
>> if (kexec_crash_image)
>>  arch_crash_handle_hotplug_event(kexec_crash_image, hp_action, cpu);
>>
> 
> The intent of this code was to be generic infrastructure. Just invoking the
> arch_crash_handle_hotplug_event() would certainly be as generic as it gets.
> But there were a series of steps that seemed to be common, so those I hoisted
> into this bit of code.

But most common parts are actually arch_* calls already? :)

Anyhow, no strong opinion.

> 
>> 3. Why do we have to forward the CPU for CPU onlining/offlining but not the
>> memory block id (or similar) when onlining/offlining a memory block?
>  From patch "kexec: exclude hot remove cpu from elfcorehdr notes" commit 
> message:
> 
> Due to use of CPUHP_AP_ONLINE_DYN, upon CPU unplug, the CPU is
> still in the for_each_present_cpu() list when within the
> handle_hotplug_event(). Thus the CPU must be explicitly excluded
> when building the new list of CPUs.
> 
> This change identifies in handle_hotplug_event() the CPU to be
> excluded, and the check for excluding the CPU in
> crash_prepare_elf64_headers().
> 
> If there is a better CPUHP_ to use than _DYN, I'd be all for that!

Ah okay, thanks.

-- 
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v8 3/7] crash: add generic infrastructure for crash hotplug support

2022-05-12 Thread David Hildenbrand
On 05.05.22 20:45, Eric DeVolder wrote:
> CPU and memory change notifications are received in order to
> regenerate the elfcorehdr.
> 
> To support cpu hotplug, a callback is registered to capture the
> CPUHP_AP_ONLINE_DYN online and offline events via
> cpuhp_setup_state_nocalls().
> 
> To support memory hotplug, a notifier is registered to capture the
> MEM_ONLINE and MEM_OFFLINE events via register_memory_notifier().
> 
> The cpu callback and memory notifiers call handle_hotplug_event()
> to handle the hot plug/unplug event. Then handle_hotplug_event()
> dispatches the event to the architecture specific
> arch_crash_handle_hotplug_event(). During the process, the
> kexec_mutex is held.
> 
> Signed-off-by: Eric DeVolder 

[...]

> +
> +#if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_MEMORY_HOTPLUG)
> +void __weak arch_crash_handle_hotplug_event(struct kimage *image,
> + unsigned int hp_action, 
> unsigned int cpu)
> +{
> + WARN(1, "crash hotplug handler not implemented");


Won't that trigger on any arch that has CONFIG_HOTPLUG_CPU and 
CONFIG_MEMORY_HOTPLUG?
I mean, you only implement it for x86 later in this series. Or what else stops 
this WARN from
triggering?

> +}
> +
> +static void handle_hotplug_event(unsigned int hp_action, unsigned int cpu)
> +{
> + /* Obtain lock while changing crash information */
> + if (!mutex_trylock(_mutex))
> + return;

This looks wrong. What if you offline memory but for some reason the mutex
is currently locked? You'd miss updating the vmcore, which would be broken.

Why is this trylock in place? Some workaround for potential locking issues,
or what is the rationale?

> +
> + /* Check kdump is loaded */
> + if (kexec_crash_image) {
> + pr_debug("crash hp: hp_action %u, cpu %u", hp_action, cpu);
> +
> + /* Needed in order for the segments to be updated */
> + arch_kexec_unprotect_crashkres();
> +
> + /* Flag to differentiate between normal load and hotplug */
> + kexec_crash_image->hotplug_event = true;

1. Why is that required? Why can't arch_crash_handle_hotplug_event() forward 
that
information? I mean, *hotplug* in the anme implies that the function should be
aware what's happening.

2. Why can't the unprotect+reprotect not be done inside
arch_crash_handle_hotplug_event() ? It's all arch specific either way.

IMHO, this code here should be as simple as

if (kexec_crash_image)
arch_crash_handle_hotplug_event(kexec_crash_image, hp_action, cpu);

3. Why do we have to forward the CPU for CPU onlining/offlining but not the
memory block id (or similar) when onlining/offlining a memory block?

> +
> + /* Now invoke arch-specific update handler */
> + arch_crash_handle_hotplug_event(kexec_crash_image, hp_action, 
> cpu);
> +
> + /* No longer handling a hotplug event */
> + kexec_crash_image->hotplug_event = false;
> +
> + /* Change back to read-only */
> + arch_kexec_protect_crashkres();
> + }
> +
> + /* Release lock now that update complete */
> + mutex_unlock(_mutex);
> +}
> +
> +static int crash_memhp_notifier(struct notifier_block *nb, unsigned long 
> val, void *v)
> +{
> + switch (val) {
> + case MEM_ONLINE:
> + handle_hotplug_event(KEXEC_CRASH_HP_ADD_MEMORY, 0);
> + break;
> +
> + case MEM_OFFLINE:
> + handle_hotplug_event(KEXEC_CRASH_HP_REMOVE_MEMORY, 0);
> + break;
> + }
> + return NOTIFY_OK;
> +}
> +
> +static struct notifier_block crash_memhp_nb = {
> + .notifier_call = crash_memhp_notifier,
> + .priority = 0
> +};
> +
> +static int crash_cpuhp_online(unsigned int cpu)
> +{
> + handle_hotplug_event(KEXEC_CRASH_HP_ADD_CPU, cpu);
> + return 0;
> +}
> +
> +static int crash_cpuhp_offline(unsigned int cpu)
> +{
> + handle_hotplug_event(KEXEC_CRASH_HP_REMOVE_CPU, cpu);
> + return 0;
> +}
> +
> +static int __init crash_hotplug_init(void)
> +{
> + int result = 0;
> +
> + if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG))
> + register_memory_notifier(_memhp_nb);
> +
> + if (IS_ENABLED(CONFIG_HOTPLUG_CPU))
> + result = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
> + 
> "crash/cpuhp",
> + 
> crash_cpuhp_online,
> + 
> crash_cpuhp_offline);

Ehm, this indentation looks very weird.

> +
> + return result;
> +}
> +
> +subsys_initcall(crash_hotplug_init);
> +#endif


-- 
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v8 2/7] crash: prototype change for crash_prepare_elf64_headers

2022-05-12 Thread David Hildenbrand
On 05.05.22 20:45, Eric DeVolder wrote:
> From within crash_prepare_elf64_headers() there is a need to
> reference the struct kimage hotplug members. As such, this
> change passes the struct kimage as a parameter to the
> crash_prepare_elf64_headers().

You should make it clearer that the hotplug members will be added by
successive patches.

> 
> This is preparation for later patch, no functionality change.
> 
> Signed-off-by: Eric DeVolder 
> Acked-by: Baoquan He 

Acked-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v7 1/8] x86/crash: fix minor typo/bug in debug message

2022-04-28 Thread David Hildenbrand
On 13.04.22 18:42, Eric DeVolder wrote:
> The pr_debug() intends to display the memsz member, but the
> parameter is actually the bufsz member (which is already
> displayed). Correct this to display memsz value.
> 
> Signed-off-by: Eric DeVolder 
> Acked-by: Baoquan He 
> ---

Reviewed-by: David Hildenbrand 


-- 
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v4 02/10] crash hp: Introduce CRASH_HOTPLUG configuration options

2022-03-03 Thread David Hildenbrand
On 03.03.22 11:22, Baoquan He wrote:
> On 03/02/22 at 10:20am, David Hildenbrand wrote:
>> On 01.03.22 21:04, Eric DeVolder wrote:
>>>
>>>
>>> On 2/22/22 21:25, Baoquan He wrote:
>>>> On 02/09/22 at 02:56pm, Eric DeVolder wrote:
>>>>> Support for CPU and memory hotplug for crash is controlled by the
>>>>> CRASH_HOTPLUG configuration option, introduced by this patch.
>>>>>
>>>>> The CRASH_HOTPLUG_ELFCOREHDR_SZ related configuration option is
>>>>> also introduced with this patch.
>>>>>
>>>>> Signed-off-by: Eric DeVolder 
>>>>> ---
>>>>>   arch/x86/Kconfig | 26 ++
>>>>>   1 file changed, 26 insertions(+)
>>>>>
>>>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>>>> index ebe8fc76949a..4e3374edab02 100644
>>>>> --- a/arch/x86/Kconfig
>>>>> +++ b/arch/x86/Kconfig
>>>>> @@ -2060,6 +2060,32 @@ config CRASH_DUMP
>>>>> (CONFIG_RELOCATABLE=y).
>>>>> For more details see Documentation/admin-guide/kdump/kdump.rst
>>>>>   
>>>>> +config CRASH_HOTPLUG
>>>>> + bool "kernel updates of crash elfcorehdr"
>>>>> + depends on CRASH_DUMP && (HOTPLUG_CPU || MEMORY_HOTPLUG) && KEXEC_FILE
>>>>> + help
>>>>> +   Enable the kernel to update the crash elfcorehdr (which contains
>>>>> +   the list of CPUs and memory regions) directly when hot plug/unplug
>>>>> +   of CPUs or memory. Otherwise userspace must monitor these hot
>>>>> +   plug/unplug change notifications via udev in order to
>>>>> +   unload-then-reload the crash kernel so that the list of CPUs and
>>>>> +   memory regions is kept up-to-date. Note that the udev CPU and
>>>>> +   memory change notifications still occur (however, userspace is not
>>>>> +   required to monitor for crash dump purposes).
>>>>> +
>>>>> +config CRASH_HOTPLUG_ELFCOREHDR_SZ
>>>>> + depends on CRASH_HOTPLUG
>>>>> + int
>>>>> + default 131072
>>>>> + help
>>>>> +   Specify the maximum size of the elfcorehdr buffer/segment.
>>>>> +   The 128KiB default is sized so that it can accommodate 2048
>>>>> +   Elf64_Phdr, where each Phdr represents either a CPU or a
>>>>> +   region of memory.
>>>>> +   For example, this size can accommodate hotplugging a machine
>>>>> +   with up to 1024 CPUs and up to 1024 memory regions (e.g. 1TiB
>>>>> +   with 1024 1GiB memory DIMMs).
>>>>
>>>> This example of memory could be a little misleading. The memory regions
>>>> may not be related to memory DIMMs. System could split them into many
>>>> smaller regions during bootup.
>>>
>>> I changed "with 1024 1GiB memory DIMMs" to "with 1024 1GiB hotplug 
>>> memories".
>>> eric
>>
>> It's still not quite precise. Essentially it's the individual "System
>> RAM" entries in /proc/iomem
>>
>> Boot memory (i.e., a single DIMM) might be represented by multiple
>> entries due to rearranged holes (by the BIOS).
>>
>> While hoplugged DIMMs (under virt!) are usually represented using a
>> single range, it can be different on physical machines. Last but not
>> least, dax/kmem and virtio-mem behave in a different way.
> 
> Right. How about only mentioning the 'System RAM' entries in /proc/iomem
> as below? It's just giving an example, talking about the details of
> memory regions from each type may not be necessry here. People
> interested can refer to code or document related to get it.
> 
> 
> + default 131072
> + help
> +   Specify the maximum size of the elfcorehdr buffer/segment.
> +   The 128KiB default is sized so that it can accommodate 2048
> +   Elf64_Phdr, where each Phdr represents either a CPU or a
> +   region of memory.
> +   For example, this size can accommodate hotplugging a machine
> +   with up to 1024 CPUs and up to 1024 memory regions which are
> represented by 'System RAM' entries in /proc/iomem.

Maybe changing the last paragraph to:

"For example, this size can accommodate a machine with up to 1024 CPUs
and up to 1024 memory regions, for example, as represented by 'System
RAM' entries in /proc/iomem."


-- 
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v4 02/10] crash hp: Introduce CRASH_HOTPLUG configuration options

2022-03-02 Thread David Hildenbrand
On 01.03.22 21:04, Eric DeVolder wrote:
> 
> 
> On 2/22/22 21:25, Baoquan He wrote:
>> On 02/09/22 at 02:56pm, Eric DeVolder wrote:
>>> Support for CPU and memory hotplug for crash is controlled by the
>>> CRASH_HOTPLUG configuration option, introduced by this patch.
>>>
>>> The CRASH_HOTPLUG_ELFCOREHDR_SZ related configuration option is
>>> also introduced with this patch.
>>>
>>> Signed-off-by: Eric DeVolder 
>>> ---
>>>   arch/x86/Kconfig | 26 ++
>>>   1 file changed, 26 insertions(+)
>>>
>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>> index ebe8fc76949a..4e3374edab02 100644
>>> --- a/arch/x86/Kconfig
>>> +++ b/arch/x86/Kconfig
>>> @@ -2060,6 +2060,32 @@ config CRASH_DUMP
>>>   (CONFIG_RELOCATABLE=y).
>>>   For more details see Documentation/admin-guide/kdump/kdump.rst
>>>   
>>> +config CRASH_HOTPLUG
>>> +   bool "kernel updates of crash elfcorehdr"
>>> +   depends on CRASH_DUMP && (HOTPLUG_CPU || MEMORY_HOTPLUG) && KEXEC_FILE
>>> +   help
>>> + Enable the kernel to update the crash elfcorehdr (which contains
>>> + the list of CPUs and memory regions) directly when hot plug/unplug
>>> + of CPUs or memory. Otherwise userspace must monitor these hot
>>> + plug/unplug change notifications via udev in order to
>>> + unload-then-reload the crash kernel so that the list of CPUs and
>>> + memory regions is kept up-to-date. Note that the udev CPU and
>>> + memory change notifications still occur (however, userspace is not
>>> + required to monitor for crash dump purposes).
>>> +
>>> +config CRASH_HOTPLUG_ELFCOREHDR_SZ
>>> +   depends on CRASH_HOTPLUG
>>> +   int
>>> +   default 131072
>>> +   help
>>> + Specify the maximum size of the elfcorehdr buffer/segment.
>>> + The 128KiB default is sized so that it can accommodate 2048
>>> + Elf64_Phdr, where each Phdr represents either a CPU or a
>>> + region of memory.
>>> + For example, this size can accommodate hotplugging a machine
>>> + with up to 1024 CPUs and up to 1024 memory regions (e.g. 1TiB
>>> + with 1024 1GiB memory DIMMs).
>>
>> This example of memory could be a little misleading. The memory regions
>> may not be related to memory DIMMs. System could split them into many
>> smaller regions during bootup.
> 
> I changed "with 1024 1GiB memory DIMMs" to "with 1024 1GiB hotplug memories".
> eric

It's still not quite precise. Essentially it's the individual "System
RAM" entries in /proc/iomem

Boot memory (i.e., a single DIMM) might be represented by multiple
entries due to rearranged holes (by the BIOS).

While hoplugged DIMMs (under virt!) are usually represented using a
single range, it can be different on physical machines. Last but not
least, dax/kmem and virtio-mem behave in a different way.

-- 
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2] proc/vmcore: fix possible deadlock on concurrent mmap and read

2022-01-19 Thread David Hildenbrand
0]  ksys_mmap_pgoff+0x178/0x200
[6.428701]  do_syscall_64+0x43/0x90
[6.429126]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[6.429745] RIP: 0033:0x7fc7359b8fc7
[6.430157] Code: 00 00 00 89 ef e8 69 b3 ff ff eb e4 e8 c2 64 01 00 66 90 
f3 0f 1e fa 41 89 ca 41 f7 c1 ff 0f 00 00 75 10 b8 09 00 00 00 0f 05 <48> 3d 00 
f0 ff ff 77 21 c3 48 8b 05 21 7e 0e 00 64 c7 00 16 00 00
[6.432147] RSP: 002b:7fff35b4c208 EFLAGS: 0246 ORIG_RAX: 
0009
[6.432970] RAX: ffda RBX: 0001 RCX: 7fc7359b8fc7
[6.433746] RDX: 0001 RSI: 0040 RDI: 
[6.434529] RBP: 55a1125ecf10 R08: 0003 R09: 2000
[6.435310] R10: 0002 R11: 0246 R12: 2000
[6.436093] R13: 0040 R14: 55a1124269e2 R15: 
[6.436887]  

Reported-by: Baoquan He 
Fixes: cc5f2704c934 ("proc/vmcore: convert oldmem_pfn_is_ram callback to more 
generic vmcore callbacks")
Cc: Andrew Morton 
Cc: Baoquan He 
Cc: Vivek Goyal 
Cc: Dave Young 
Cc: "Paul E. McKenney" 
Cc: Josh Triplett 
Cc: Peter Zijlstra 
Cc: Boqun Feng 
Signed-off-by: David Hildenbrand 
---

Was: [PATCH v1] proc/vmcore: fix false positive lockdep warning

v1 -> v2:
- Adjust subject/description
- Add Fixes:

---
 fs/proc/vmcore.c | 41 ++---
 1 file changed, 22 insertions(+), 19 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 702754dd1daf..edeb01dfe05d 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -62,7 +62,8 @@ core_param(novmcoredd, vmcoredd_disabled, bool, 0);
 /* Device Dump Size */
 static size_t vmcoredd_orig_sz;
 
-static DECLARE_RWSEM(vmcore_cb_rwsem);
+static DEFINE_SPINLOCK(vmcore_cb_lock);
+DEFINE_STATIC_SRCU(vmcore_cb_srcu);
 /* List of registered vmcore callbacks. */
 static LIST_HEAD(vmcore_cb_list);
 /* Whether the vmcore has been opened once. */
@@ -70,8 +71,8 @@ static bool vmcore_opened;
 
 void register_vmcore_cb(struct vmcore_cb *cb)
 {
-   down_write(_cb_rwsem);
INIT_LIST_HEAD(>next);
+   spin_lock(_cb_lock);
list_add_tail(>next, _cb_list);
/*
 * Registering a vmcore callback after the vmcore was opened is
@@ -79,14 +80,14 @@ void register_vmcore_cb(struct vmcore_cb *cb)
 */
if (vmcore_opened)
pr_warn_once("Unexpected vmcore callback registration\n");
-   up_write(_cb_rwsem);
+   spin_unlock(_cb_lock);
 }
 EXPORT_SYMBOL_GPL(register_vmcore_cb);
 
 void unregister_vmcore_cb(struct vmcore_cb *cb)
 {
-   down_write(_cb_rwsem);
-   list_del(>next);
+   spin_lock(_cb_lock);
+   list_del_rcu(>next);
/*
 * Unregistering a vmcore callback after the vmcore was opened is
 * very unusual (e.g., forced driver removal), but we cannot stop
@@ -94,7 +95,9 @@ void unregister_vmcore_cb(struct vmcore_cb *cb)
 */
if (vmcore_opened)
pr_warn_once("Unexpected vmcore callback unregistration\n");
-   up_write(_cb_rwsem);
+   spin_unlock(_cb_lock);
+
+   synchronize_srcu(_cb_srcu);
 }
 EXPORT_SYMBOL_GPL(unregister_vmcore_cb);
 
@@ -103,9 +106,8 @@ static bool pfn_is_ram(unsigned long pfn)
struct vmcore_cb *cb;
bool ret = true;
 
-   lockdep_assert_held_read(_cb_rwsem);
-
-   list_for_each_entry(cb, _cb_list, next) {
+   list_for_each_entry_srcu(cb, _cb_list, next,
+srcu_read_lock_held(_cb_srcu)) {
if (unlikely(!cb->pfn_is_ram))
continue;
ret = cb->pfn_is_ram(cb, pfn);
@@ -118,9 +120,9 @@ static bool pfn_is_ram(unsigned long pfn)
 
 static int open_vmcore(struct inode *inode, struct file *file)
 {
-   down_read(_cb_rwsem);
+   spin_lock(_cb_lock);
vmcore_opened = true;
-   up_read(_cb_rwsem);
+   spin_unlock(_cb_lock);
 
return 0;
 }
@@ -133,6 +135,7 @@ ssize_t read_from_oldmem(char *buf, size_t count,
unsigned long pfn, offset;
size_t nr_bytes;
ssize_t read = 0, tmp;
+   int idx;
 
if (!count)
return 0;
@@ -140,7 +143,7 @@ ssize_t read_from_oldmem(char *buf, size_t count,
offset = (unsigned long)(*ppos % PAGE_SIZE);
pfn = (unsigned long)(*ppos / PAGE_SIZE);
 
-   down_read(_cb_rwsem);
+   idx = srcu_read_lock(_cb_srcu);
do {
if (count > (PAGE_SIZE - offset))
nr_bytes = PAGE_SIZE - offset;
@@ -165,7 +168,7 @@ ssize_t read_from_oldmem(char *buf, size_t count,
   offset, userbuf);
}
if (tmp < 0) {
-   up_read(_cb_rwsem);
+   srcu_read_unlock(_cb_srcu, idx);
return tmp;
}
 
@@ -176,8 +179,

Re: [PATCH v1] proc/vmcore: fix false positive lockdep warning

2022-01-19 Thread David Hildenbrand
On 19.01.22 16:15, David Hildenbrand wrote:
> On 19.01.22 16:08, Boqun Feng wrote:
>> Hi,
>>
>> On Wed, Jan 19, 2022 at 12:37:02PM +0100, David Hildenbrand wrote:
>>> Lockdep complains that we do during mmap of the vmcore:
>>> down_write(mmap_lock);
>>> down_read(vmcore_cb_rwsem);
>>> And during read of the vmcore:
>>> down_read(vmcore_cb_rwsem);
>>> down_read(mmap_lock);
>>>
>>> We cannot possibly deadlock when only taking vmcore_cb_rwsem in read
>>> mode, however, it's hard to teach that to lockdep.
>>>
>>
>> Lockdep warned about the above sequences because rw_semaphore is a fair
>> read-write lock, and the following can cause a deadlock:
>>
>>  TASK 1  TASK 2  TASK 3
>>  ==  ==  ==
>>  down_write(mmap_lock);
>>  down_read(vmcore_cb_rwsem)
>>  down_write(vmcore_cb_rwsem); // 
>> blocked
>>  down_read(vmcore_cb_rwsem); // cannot get the lock because of the 
>> fairness
>>  down_read(mmap_lock); // blocked
>>  
>> IOW, a reader can block another read if there is a writer queued by the
>> second reader and the lock is fair.
>>
>> So there is a deadlock possiblity.
> 
> Task 3 will never take the mmap_lock before doing a
> down_write(vmcore_cb_rwsem).
> 
> How would this happen?

Ah, I get it, nevermind. I'll adjust the patch description.

Thanks!

-- 
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1] proc/vmcore: fix false positive lockdep warning

2022-01-19 Thread David Hildenbrand
On 19.01.22 16:08, Boqun Feng wrote:
> Hi,
> 
> On Wed, Jan 19, 2022 at 12:37:02PM +0100, David Hildenbrand wrote:
>> Lockdep complains that we do during mmap of the vmcore:
>>  down_write(mmap_lock);
>>  down_read(vmcore_cb_rwsem);
>> And during read of the vmcore:
>>  down_read(vmcore_cb_rwsem);
>>  down_read(mmap_lock);
>>
>> We cannot possibly deadlock when only taking vmcore_cb_rwsem in read
>> mode, however, it's hard to teach that to lockdep.
>>
> 
> Lockdep warned about the above sequences because rw_semaphore is a fair
> read-write lock, and the following can cause a deadlock:
> 
>   TASK 1  TASK 2  TASK 3
>   ==  ==  ==
>   down_write(mmap_lock);
>   down_read(vmcore_cb_rwsem)
>   down_write(vmcore_cb_rwsem); // 
> blocked
>   down_read(vmcore_cb_rwsem); // cannot get the lock because of the 
> fairness
>   down_read(mmap_lock); // blocked
>   
> IOW, a reader can block another read if there is a writer queued by the
> second reader and the lock is fair.
> 
> So there is a deadlock possiblity.

Task 3 will never take the mmap_lock before doing a
down_write(vmcore_cb_rwsem).

How would this happen?


-- 
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v1] proc/vmcore: fix false positive lockdep warning

2022-01-19 Thread David Hildenbrand
a RBX: 0001 RCX: 7fc7359b8fc7
[6.433746] RDX: 0001 RSI: 0040 RDI: 
[6.434529] RBP: 55a1125ecf10 R08: 0003 R09: 2000
[6.435310] R10: 0002 R11: 0246 R12: 2000
[6.436093] R13: 0040 R14: 55a1124269e2 R15: 
[6.436887]  

Reported-by: Baoquan He 
Cc: Andrew Morton 
Cc: Baoquan He 
Cc: Vivek Goyal 
Cc: Dave Young 
Cc: "Paul E. McKenney" 
Cc: Josh Triplett 
Cc: Peter Zijlstra 
Cc: Boqun Feng 
Signed-off-by: David Hildenbrand 
---

Based on next-20220118

---
 fs/proc/vmcore.c | 41 ++---
 1 file changed, 22 insertions(+), 19 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 702754dd1daf..edeb01dfe05d 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -62,7 +62,8 @@ core_param(novmcoredd, vmcoredd_disabled, bool, 0);
 /* Device Dump Size */
 static size_t vmcoredd_orig_sz;
 
-static DECLARE_RWSEM(vmcore_cb_rwsem);
+static DEFINE_SPINLOCK(vmcore_cb_lock);
+DEFINE_STATIC_SRCU(vmcore_cb_srcu);
 /* List of registered vmcore callbacks. */
 static LIST_HEAD(vmcore_cb_list);
 /* Whether the vmcore has been opened once. */
@@ -70,8 +71,8 @@ static bool vmcore_opened;
 
 void register_vmcore_cb(struct vmcore_cb *cb)
 {
-   down_write(_cb_rwsem);
INIT_LIST_HEAD(>next);
+   spin_lock(_cb_lock);
list_add_tail(>next, _cb_list);
/*
 * Registering a vmcore callback after the vmcore was opened is
@@ -79,14 +80,14 @@ void register_vmcore_cb(struct vmcore_cb *cb)
 */
if (vmcore_opened)
pr_warn_once("Unexpected vmcore callback registration\n");
-   up_write(_cb_rwsem);
+   spin_unlock(_cb_lock);
 }
 EXPORT_SYMBOL_GPL(register_vmcore_cb);
 
 void unregister_vmcore_cb(struct vmcore_cb *cb)
 {
-   down_write(_cb_rwsem);
-   list_del(>next);
+   spin_lock(_cb_lock);
+   list_del_rcu(>next);
/*
 * Unregistering a vmcore callback after the vmcore was opened is
 * very unusual (e.g., forced driver removal), but we cannot stop
@@ -94,7 +95,9 @@ void unregister_vmcore_cb(struct vmcore_cb *cb)
 */
if (vmcore_opened)
pr_warn_once("Unexpected vmcore callback unregistration\n");
-   up_write(_cb_rwsem);
+   spin_unlock(_cb_lock);
+
+   synchronize_srcu(_cb_srcu);
 }
 EXPORT_SYMBOL_GPL(unregister_vmcore_cb);
 
@@ -103,9 +106,8 @@ static bool pfn_is_ram(unsigned long pfn)
struct vmcore_cb *cb;
bool ret = true;
 
-   lockdep_assert_held_read(_cb_rwsem);
-
-   list_for_each_entry(cb, _cb_list, next) {
+   list_for_each_entry_srcu(cb, _cb_list, next,
+srcu_read_lock_held(_cb_srcu)) {
if (unlikely(!cb->pfn_is_ram))
continue;
ret = cb->pfn_is_ram(cb, pfn);
@@ -118,9 +120,9 @@ static bool pfn_is_ram(unsigned long pfn)
 
 static int open_vmcore(struct inode *inode, struct file *file)
 {
-   down_read(_cb_rwsem);
+   spin_lock(_cb_lock);
vmcore_opened = true;
-   up_read(_cb_rwsem);
+   spin_unlock(_cb_lock);
 
return 0;
 }
@@ -133,6 +135,7 @@ ssize_t read_from_oldmem(char *buf, size_t count,
unsigned long pfn, offset;
size_t nr_bytes;
ssize_t read = 0, tmp;
+   int idx;
 
if (!count)
return 0;
@@ -140,7 +143,7 @@ ssize_t read_from_oldmem(char *buf, size_t count,
offset = (unsigned long)(*ppos % PAGE_SIZE);
pfn = (unsigned long)(*ppos / PAGE_SIZE);
 
-   down_read(_cb_rwsem);
+   idx = srcu_read_lock(_cb_srcu);
do {
if (count > (PAGE_SIZE - offset))
nr_bytes = PAGE_SIZE - offset;
@@ -165,7 +168,7 @@ ssize_t read_from_oldmem(char *buf, size_t count,
   offset, userbuf);
}
if (tmp < 0) {
-   up_read(_cb_rwsem);
+   srcu_read_unlock(_cb_srcu, idx);
return tmp;
}
 
@@ -176,8 +179,8 @@ ssize_t read_from_oldmem(char *buf, size_t count,
++pfn;
offset = 0;
} while (count);
+   srcu_read_unlock(_cb_srcu, idx);
 
-   up_read(_cb_rwsem);
return read;
 }
 
@@ -568,18 +571,18 @@ static int vmcore_remap_oldmem_pfn(struct vm_area_struct 
*vma,
unsigned long from, unsigned long pfn,
unsigned long size, pgprot_t prot)
 {
-   int ret;
+   int ret, idx;
 
/*
-* Check if oldmem_pfn_is_ram was registered to avoid
-* looping over all pages without a reason.
+* Check if a callback was registered to avoid looping over all
+* pages without a reason

Re: [PATCH v4 2/3] dma/pool: create dma atomic pool only if dma zone has managed pages

2022-01-03 Thread David Hildenbrand
On 23.12.21 10:44, Baoquan He wrote:
> Currently three dma atomic pools are initialized as long as the relevant
> kernel codes are built in. While in kdump kernel of x86_64, this is not
> right when trying to create atomic_pool_dma, because there's no managed
> pages in DMA zone. In the case, DMA zone only has low 1M memory presented
> and locked down by memblock allocator. So no pages are added into buddy
> of DMA zone. Please check commit f1d4d47c5851 ("x86/setup: Always reserve
> the first 1M of RAM").
> 
> Then in kdump kernel of x86_64, it always prints below failure message:
> 
>  DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
>  swapper/0: page allocation failure: order:5, mode:0xcc1(GFP_KERNEL|GFP_DMA), 
> nodemask=(null),cpuset=/,mems_allowed=0
>  CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> 5.13.0-0.rc5.20210611git929d931f2b40.42.fc35.x86_64 #1
>  Hardware name: Dell Inc. PowerEdge R910/0P658H, BIOS 2.12.0 06/04/2018
>  Call Trace:
>   dump_stack+0x7f/0xa1
>   warn_alloc.cold+0x72/0xd6
>   ? _raw_spin_unlock_irq+0x24/0x40
>   ? __alloc_pages_direct_compact+0x90/0x1b0
>   __alloc_pages_slowpath.constprop.0+0xf29/0xf50
>   ? __cond_resched+0x16/0x50
>   ? prepare_alloc_pages.constprop.0+0x19d/0x1b0
>   __alloc_pages+0x24d/0x2c0
>   ? __dma_atomic_pool_init+0x93/0x93
>   alloc_page_interleave+0x13/0xb0
>   atomic_pool_expand+0x118/0x210
>   ? __dma_atomic_pool_init+0x93/0x93
>   __dma_atomic_pool_init+0x45/0x93
>   dma_atomic_pool_init+0xdb/0x176
>   do_one_initcall+0x67/0x320
>   ? rcu_read_lock_sched_held+0x3f/0x80
>   kernel_init_freeable+0x290/0x2dc
>   ? rest_init+0x24f/0x24f
>   kernel_init+0xa/0x111
>   ret_from_fork+0x22/0x30
>  Mem-Info:
>  ..
>  DMA: failed to allocate 128 KiB GFP_KERNEL|GFP_DMA pool for atomic allocation
>  DMA: preallocated 128 KiB GFP_KERNEL|GFP_DMA32 pool for atomic allocations
> 
> Here, let's check if DMA zone has managed pages, then create atomic_pool_dma
> if yes. Otherwise just skip it.
> 
> Fixes: 6f599d84231f ("x86/kdump: Always reserve the low 1M when the 
> crashkernel option is specified")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Baoquan He 
> Cc: Christoph Hellwig 
> Cc: Marek Szyprowski 
> Cc: Robin Murphy 
> Cc: io...@lists.linux-foundation.org
> ---
>  kernel/dma/pool.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/dma/pool.c b/kernel/dma/pool.c
> index 5a85804b5beb..00df3edd6c5d 100644
> --- a/kernel/dma/pool.c
> +++ b/kernel/dma/pool.c
> @@ -206,7 +206,7 @@ static int __init dma_atomic_pool_init(void)
>   GFP_KERNEL);
>   if (!atomic_pool_kernel)
>   ret = -ENOMEM;
> - if (IS_ENABLED(CONFIG_ZONE_DMA)) {
> + if (has_managed_dma()) {
>   atomic_pool_dma = __dma_atomic_pool_init(atomic_pool_size,
>   GFP_KERNEL | GFP_DMA);
>   if (!atomic_pool_dma)
> @@ -229,7 +229,7 @@ static inline struct gen_pool *dma_guess_pool(struct 
> gen_pool *prev, gfp_t gfp)
>   if (prev == NULL) {
>   if (IS_ENABLED(CONFIG_ZONE_DMA32) && (gfp & GFP_DMA32))
>   return atomic_pool_dma32;
> - if (IS_ENABLED(CONFIG_ZONE_DMA) && (gfp & GFP_DMA))
> + if (atomic_pool_dma && (gfp & GFP_DMA))
>   return atomic_pool_dma;
>   return atomic_pool_kernel;
>   }

I thought for a second that we might have to tweak
atomic_pool_work_fn(), but atomic_pool_resize() handles it properly already.

Reviewed-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3 3/5] mm_zone: add function to check if managed dma zone exists

2021-12-16 Thread David Hildenbrand
On 13.12.21 13:27, Baoquan He wrote:
> In some places of the current kernel, it assumes that dma zone must have
> managed pages if CONFIG_ZONE_DMA is enabled. While this is not always true.
> E.g in kdump kernel of x86_64, only low 1M is presented and locked down
> at very early stage of boot, so that there's no managed pages at all in
> DMA zone. This exception will always cause page allocation failure if page
> is requested from DMA zone.
> 
> Here add function has_managed_dma() and the relevant helper functions to
> check if there's DMA zone with managed pages. It will be used in later
> patches.
> 
> Fixes: 6f599d84231f ("x86/kdump: Always reserve the low 1M when the 
> crashkernel option is specified")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Baoquan He 
> ---
> v2->v3:
>  Rewrite has_managed_dma() in a simpler and more efficient way which is
>  sugggested by DavidH. 
> 
>  include/linux/mmzone.h |  9 +
>  mm/page_alloc.c| 15 +++
>  2 files changed, 24 insertions(+)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 58e744b78c2c..6e1b726e9adf 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1046,6 +1046,15 @@ static inline int is_highmem_idx(enum zone_type idx)
>  #endif
>  }
>  
> +#ifdef CONFIG_ZONE_DMA
> +bool has_managed_dma(void);
> +#else
> +static inline bool has_managed_dma(void)
> +{
> + return false;
> +}
> +#endif
> +
>  /**
>   * is_highmem - helper function to quickly check if a struct zone is a
>   *  highmem zone or not.  This is an attempt to keep references
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c5952749ad40..7c7a0b5de2ff 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -9460,3 +9460,18 @@ bool take_page_off_buddy(struct page *page)
>   return ret;
>  }
>  #endif
> +
> +#ifdef CONFIG_ZONE_DMA
> +bool has_managed_dma(void)
> +{
> + struct pglist_data *pgdat;
> +
> + for_each_online_pgdat(pgdat) {
> + struct zone *zone = >node_zones[ZONE_DMA];
> +
> + if (managed_zone(zone))
> + return true;
> + }
> + return false;
> +}
> +#endif /* CONFIG_ZONE_DMA */
> 

Reviewed-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH RESEND v2 3/5] mm_zone: add function to check if managed dma zone exists

2021-12-09 Thread David Hildenbrand
On 09.12.21 14:02, Baoquan He wrote:
> On 12/07/21 at 12:23pm, David Hildenbrand wrote:
>> On 07.12.21 04:07, Baoquan He wrote:
>>> In some places of the current kernel, it assumes that dma zone must have
>>> managed pages if CONFIG_ZONE_DMA is enabled. While this is not always true.
>>> E.g in kdump kernel of x86_64, only low 1M is presented and locked down
>>> at very early stage of boot, so that there's no managed pages at all in
>>> DMA zone. This exception will always cause page allocation failure if page
>>> is requested from DMA zone.
>>>
>>> Here add function has_managed_dma() and the relevant helper functions to
>>> check if there's DMA zone with managed pages. It will be used in later
>>> patches.
>>>
>>> Signed-off-by: Baoquan He 
>>> ---
>>>  include/linux/mmzone.h | 21 +
>>>  mm/page_alloc.c| 11 +++
>>>  2 files changed, 32 insertions(+)
>>>
>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>> index 58e744b78c2c..82d23e13e0e5 100644
>>> --- a/include/linux/mmzone.h
>>> +++ b/include/linux/mmzone.h
>>> @@ -998,6 +998,18 @@ static inline bool zone_is_zone_device(struct zone 
>>> *zone)
>>>  }
>>>  #endif
>>>  
>>> +#ifdef CONFIG_ZONE_DMA
>>> +static inline bool zone_is_dma(struct zone *zone)
>>> +{
>>> +   return zone_idx(zone) == ZONE_DMA;
>>> +}
>>> +#else
>>> +static inline bool zone_is_dma(struct zone *zone)
>>> +{
>>> +   return false;
>>> +}
>>> +#endif
>>> +
>>>  /*
>>>   * Returns true if a zone has pages managed by the buddy allocator.
>>>   * All the reclaim decisions have to use this function rather than
>>> @@ -1046,6 +1058,7 @@ static inline int is_highmem_idx(enum zone_type idx)
>>>  #endif
>>>  }
>>>  
>>> +bool has_managed_dma(void);
>>>  /**
>>>   * is_highmem - helper function to quickly check if a struct zone is a
>>>   *  highmem zone or not.  This is an attempt to keep references
>>> @@ -1131,6 +1144,14 @@ extern struct zone *next_zone(struct zone *zone);
>>> ; /* do nothing */  \
>>> else
>>>  
>>> +#define for_each_managed_zone(zone)\
>>> +   for (zone = (first_online_pgdat())->node_zones; \
>>> +zone;  \
>>> +zone = next_zone(zone))\
>>> +   if (!managed_zone(zone))\
>>> +   ; /* do nothing */  \
>>> +   else
>>> +
>>>  static inline struct zone *zonelist_zone(struct zoneref *zoneref)
>>>  {
>>> return zoneref->zone;
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index c5952749ad40..ac0ea42a4e5f 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -9459,4 +9459,15 @@ bool take_page_off_buddy(struct page *page)
>>> spin_unlock_irqrestore(>lock, flags);
>>> return ret;
>>>  }
>>> +
>>> +bool has_managed_dma(void)
>>> +{
>>> +   struct zone *zone;
>>> +
>>> +   for_each_managed_zone(zone) {
>>> +   if (zone_is_dma(zone))
>>> +   return true;
>>> +   }
>>> +   return false;
>>> +}
>>
>> Wouldn't it be "easier/faster" to just iterate online nodes and directly
>> obtain the ZONE_DMA, checking if there are managed pages?
> 
> Thanks, Dave.
> 
> Please check for_each_managed_zone(), it is iterating online nodes and
> it's each managed zone. 
> 
> Is below what you are suggesting? The only difference is I introduced
> for_each_managed_zone() which can be reused later if needed. Not sure if
> I got your suggestion correctly.
> 
> bool has_managed_dma(void)
> {
> struct pglist_data *pgdat;
> struct zone *zone;
> enum zone_type i, j;
> 
> for_each_online_pgdat(pgdat) {
> for (i = 0; i < MAX_NR_ZONES - 1; i++) {  
> struct zone *zone = >node_zones[i];
> if (zone_is_dma(zone))
> 
> return true;
> }
> }
> return false;
> 
> }


Even simpler, no need to iterate over zones at all, only over nodes:

#ifdef CONFIG_ZONE_DMA
bool has_managed_dma(void)
{
struct pglist_data *pgdat;

for_each_online_pgdat(pgdat) {
struct zone *zone = >node_zones[ZONE_DMA];

if (managed_zone(zone)
return true;
}
return false;
}
#endif /* CONFIG_ZONE_DMA */

Without CONFIG_ZONE_DMA, simply provide a dummy in the header that
returns false.

-- 
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC v2 4/6] crash hp: generic crash hotplug support infrastructure

2021-12-08 Thread David Hildenbrand
> +#if defined(CONFIG_MEMORY_HOTPLUG)
> +static int crash_memhp_notifier(struct notifier_block *nb,
> + unsigned long val, void *v)
> +{
> + struct memory_notify *mhp = v;
> + unsigned long start, end;
> +
> + start = mhp->start_pfn << PAGE_SHIFT;
> + end = ((mhp->start_pfn + mhp->nr_pages) << PAGE_SHIFT) - 1;
> +
> + switch (val) {
> + case MEM_GOING_ONLINE:
> + crash_hotplug_handler(KEXEC_CRASH_HP_ADD_MEMORY,
> + start, end-start);
> + break;
> +
> + case MEM_OFFLINE:
> + case MEM_CANCEL_ONLINE:
> + crash_hotplug_handler(KEXEC_CRASH_HP_REMOVE_MEMORY,
> + start, end-start);

Any reason you don't handle this after the effects completely, meaning
MEM_ONLINE and MEM_OFFLINE?

-- 
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH RESEND v2 3/5] mm_zone: add function to check if managed dma zone exists

2021-12-07 Thread David Hildenbrand
On 07.12.21 04:07, Baoquan He wrote:
> In some places of the current kernel, it assumes that dma zone must have
> managed pages if CONFIG_ZONE_DMA is enabled. While this is not always true.
> E.g in kdump kernel of x86_64, only low 1M is presented and locked down
> at very early stage of boot, so that there's no managed pages at all in
> DMA zone. This exception will always cause page allocation failure if page
> is requested from DMA zone.
> 
> Here add function has_managed_dma() and the relevant helper functions to
> check if there's DMA zone with managed pages. It will be used in later
> patches.
> 
> Signed-off-by: Baoquan He 
> ---
>  include/linux/mmzone.h | 21 +
>  mm/page_alloc.c| 11 +++
>  2 files changed, 32 insertions(+)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 58e744b78c2c..82d23e13e0e5 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -998,6 +998,18 @@ static inline bool zone_is_zone_device(struct zone *zone)
>  }
>  #endif
>  
> +#ifdef CONFIG_ZONE_DMA
> +static inline bool zone_is_dma(struct zone *zone)
> +{
> + return zone_idx(zone) == ZONE_DMA;
> +}
> +#else
> +static inline bool zone_is_dma(struct zone *zone)
> +{
> + return false;
> +}
> +#endif
> +
>  /*
>   * Returns true if a zone has pages managed by the buddy allocator.
>   * All the reclaim decisions have to use this function rather than
> @@ -1046,6 +1058,7 @@ static inline int is_highmem_idx(enum zone_type idx)
>  #endif
>  }
>  
> +bool has_managed_dma(void);
>  /**
>   * is_highmem - helper function to quickly check if a struct zone is a
>   *  highmem zone or not.  This is an attempt to keep references
> @@ -1131,6 +1144,14 @@ extern struct zone *next_zone(struct zone *zone);
>   ; /* do nothing */  \
>   else
>  
> +#define for_each_managed_zone(zone)  \
> + for (zone = (first_online_pgdat())->node_zones; \
> +  zone;  \
> +  zone = next_zone(zone))\
> + if (!managed_zone(zone))\
> + ; /* do nothing */  \
> + else
> +
>  static inline struct zone *zonelist_zone(struct zoneref *zoneref)
>  {
>   return zoneref->zone;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c5952749ad40..ac0ea42a4e5f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -9459,4 +9459,15 @@ bool take_page_off_buddy(struct page *page)
>   spin_unlock_irqrestore(>lock, flags);
>   return ret;
>  }
> +
> +bool has_managed_dma(void)
> +{
> + struct zone *zone;
> +
> + for_each_managed_zone(zone) {
> + if (zone_is_dma(zone))
> + return true;
> + }
> + return false;
> +}

Wouldn't it be "easier/faster" to just iterate online nodes and directly
obtain the ZONE_DMA, checking if there are managed pages?


-- 
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] proc/vmcore: fix clearing user buffer by properly using clear_user()

2021-11-16 Thread David Hildenbrand
On 15.11.21 23:04, Andrew Morton wrote:
> On Fri, 12 Nov 2021 10:27:50 +0100 David Hildenbrand  wrote:
> 
>> To clear a user buffer we cannot simply use memset, we have to use
>> clear_user(). With a virtio-mem device that registers a vmcore_cb and has
>> some logically unplugged memory inside an added Linux memory block, I can
>> easily trigger a BUG by copying the vmcore via "cp":
>>
>> ...
>>
>> Some x86-64 CPUs have a CPU feature called "Supervisor Mode Access
>> Prevention (SMAP)", which is used to detect wrong access from the kernel to
>> user buffers like this: SMAP triggers a permissions violation on wrong
>> access. In the x86-64 variant of clear_user(), SMAP is properly
>> handled via clac()+stac().
>>
>> To fix, properly use clear_user() when we're dealing with a user buffer.
>>
> 
> I added cc:stable, OK?
> 

I was a bit hesitant because this would (beofe the virtio-mem changes)
only trigger under XEN and I was wondering why nobody notices under XEN
so far. But yes, even though it only applies to the kdump kernel,
cc:stable sounds like the right think to do!

Thanks Andrew!

-- 
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2] proc/vmcore: fix clearing user buffer by properly using clear_user()

2021-11-12 Thread David Hildenbrand
To clear a user buffer we cannot simply use memset, we have to use
clear_user(). With a virtio-mem device that registers a vmcore_cb and has
some logically unplugged memory inside an added Linux memory block, I can
easily trigger a BUG by copying the vmcore via "cp":

[   11.327580] systemd[1]: Starting Kdump Vmcore Save Service...
[   11.339697] kdump[420]: Kdump is using the default log level(3).
[   11.370964] kdump[453]: saving to 
/sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
[   11.373997] kdump[458]: saving vmcore-dmesg.txt to 
/sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
[   11.385357] kdump[465]: saving vmcore-dmesg.txt complete
[   11.386722] kdump[467]: saving vmcore
[   16.531275] BUG: unable to handle page fault for address: 7f2374e01000
[   16.531705] #PF: supervisor write access in kernel mode
[   16.532037] #PF: error_code(0x0003) - permissions violation
[   16.532396] PGD 7a523067 P4D 7a523067 PUD 7a528067 PMD 7a525067 PTE 
80007048f867
[   16.532872] Oops: 0003 [#1] PREEMPT SMP NOPTI
[   16.533154] CPU: 0 PID: 468 Comm: cp Not tainted 5.15.0+ #6
[   16.533513] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
rel-1.14.0-27-g64f37cc530f1-prebuilt.qemu.org 04/01/2014
[   16.534198] RIP: 0010:read_from_oldmem.part.0.cold+0x1d/0x86
[   16.534552] Code: ff ff ff e8 05 ff fe ff e9 b9 e9 7f ff 48 89 de 48 c7 c7 
38 3b 60 82 e8 f1 fe fe ff 83 fd 08 72 3c 49 8d 7d 08 4c 89 e9 89 e8 <49> c7 45 
00 00 00 00 00 49 c7 44 05 f8 00 00 00 00 48 83 e7 f81
[   16.535670] RSP: 0018:c973be08 EFLAGS: 00010212
[   16.535998] RAX: 1000 RBX: 002fd000 RCX: 7f2374e01000
[   16.536441] RDX: 0001 RSI: dfff RDI: 7f2374e01008
[   16.536878] RBP: 1000 R08:  R09: c973bc50
[   16.537315] R10: c973bc48 R11: 829461a8 R12: f000
[   16.537755] R13: 7f2374e01000 R14:  R15: 88807bd421e8
[   16.538200] FS:  7f2374e12140() GS:88807f00() 
knlGS:
[   16.538696] CS:  0010 DS:  ES:  CR0: 80050033
[   16.539055] CR2: 7f2374e01000 CR3: 7a4aa000 CR4: 00350eb0
[   16.539510] Call Trace:
[   16.539679]  
[   16.539828]  read_vmcore+0x236/0x2c0
[   16.540063]  ? enqueue_hrtimer+0x2f/0x80
[   16.540323]  ? inode_security+0x22/0x60
[   16.540572]  proc_reg_read+0x55/0xa0
[   16.540807]  vfs_read+0x95/0x190
[   16.541022]  ksys_read+0x4f/0xc0
[   16.541238]  do_syscall_64+0x3b/0x90
[   16.541475]  entry_SYSCALL_64_after_hwframe+0x44/0xae

Some x86-64 CPUs have a CPU feature called "Supervisor Mode Access
Prevention (SMAP)", which is used to detect wrong access from the kernel to
user buffers like this: SMAP triggers a permissions violation on wrong
access. In the x86-64 variant of clear_user(), SMAP is properly
handled via clac()+stac().

To fix, properly use clear_user() when we're dealing with a user buffer.

Fixes: 997c136f518c ("fs/proc/vmcore.c: add hook to read_from_oldmem() to check 
for non-ram pages")
Acked-by: Baoquan He 
Cc: Dave Young 
Cc: Baoquan He 
Cc: Vivek Goyal 
Cc: Andrew Morton 
Cc: Philipp Rudo 
Cc: kexec@lists.infradead.org
Cc: linux...@kvack.org
Cc: linux-fsde...@vger.kernel.org
Signed-off-by: David Hildenbrand 
---

v1 -> v2:
- Extend patch description

---
 fs/proc/vmcore.c | 20 
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 30a3b66f475a..509f85148fee 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -154,9 +154,13 @@ ssize_t read_from_oldmem(char *buf, size_t count,
nr_bytes = count;
 
/* If pfn is not ram, return zeros for sparse dump files */
-   if (!pfn_is_ram(pfn))
-   memset(buf, 0, nr_bytes);
-   else {
+   if (!pfn_is_ram(pfn)) {
+   tmp = 0;
+   if (!userbuf)
+   memset(buf, 0, nr_bytes);
+   else if (clear_user(buf, nr_bytes))
+   tmp = -EFAULT;
+   } else {
if (encrypted)
tmp = copy_oldmem_page_encrypted(pfn, buf,
 nr_bytes,
@@ -165,12 +169,12 @@ ssize_t read_from_oldmem(char *buf, size_t count,
else
tmp = copy_oldmem_page(pfn, buf, nr_bytes,
   offset, userbuf);
-
-   if (tmp < 0) {
-   up_read(_cb_rwsem);
-   return tmp;
-   }
}
+   if (tmp < 0) {
+   up_read(_cb_rwsem);
+   return tmp;
+   }
+
*ppos +

Re: [PATCH v1] proc/vmcore: fix clearing user buffer by properly using clear_user()

2021-11-12 Thread David Hildenbrand
> > "that allows supervisor mode programs to optionally set user-space
> > memory mappings so that access to those mappings from supervisor mode
> > will cause a trap. This makes it harder for malicious programs to
> > "trick" the kernel into using instructions or data from a user-space
> > program"
>
> OK, probably. I thought it's triggered in access_ok(), and tried to
> figure out why. But seems we should do something to check this in
> access_ok(), otherwise the logic of clear_user/_clear_user is not so
> reasonable. Anyway, I have learned it, thanks a lot for digging it out.
>
> By the way, I can't open above wiki article, found below commit from
> hpa. Maybe we can add some into log to tell this, not strong opinin,
> leave it to you.

Yes, now that we know the root cause I'll add some more details to the
patch description and resend -- thanks Baoquan!


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1] proc/vmcore: don't fake reading zeroes on surprise vmcore_cb unregistration

2021-11-12 Thread David Hildenbrand
On 12.11.21 04:30, Baoquan He wrote:
> On 11/11/21 at 08:22pm, David Hildenbrand wrote:
>> In commit cc5f2704c934 ("proc/vmcore: convert oldmem_pfn_is_ram callback
>> to more generic vmcore callbacks"), we added detection of surprise
>> vmcore_cb unregistration after the vmcore was already opened. Once
>> detected, we warn the user and simulate reading zeroes from that point on
>> when accessing the vmcore.
>>
>> The basic reason was that unexpected unregistration, for example, by
>> manually unbinding a driver from a device after opening the
>> vmcore, is not supported and could result in reading oldmem the
>> vmcore_cb would have actually prohibited while registered. However,
>> something like that can similarly be trigger by a user that's really
>> looking for trouble simply by unbinding the relevant driver before opening
>> the vmcore -- or by disallowing loading the driver in the first place.
>> So it's actually of limited help.
> 
> Yes, this is the change what I would like to see in the original patch
> "proc/vmcore: convert oldmem_pfn_is_ram callback to more generic vmcore 
> callbacks".
> I am happy with this patch appended to commit cc5f2704c934.

Good, thanks!

> 
>>
>> Currently, unregistration can only be triggered via virtio-mem when
>> manually unbinding the driver from the device inside the VM; there is no
>> way to trigger it from the hypervisor, as hypervisors don't allow for
>> unplugging virtio-mem devices -- ripping out system RAM from a VM without
>> coordination with the guest is usually not a good idea.
>>
>> The important part is that unbinding the driver and unregistering the
>> vmcore_cb while concurrently reading the vmcore won't crash the system,
>> and that is handled by the rwsem.
>>
>> To make the mechanism more future proof, let's remove the "read zero"
>> part, but leave the warning in place. For example, we could have a future
>> driver (like virtio-balloon) that will contact the hypervisor to figure out
>> if we already populated a page for a given PFN. Hotunplugging such a device
>> and consequently unregistering the vmcore_cb could be triggered from the
>> hypervisor without harming the system even while kdump is running. In that
> 
> I am a little confused, could you tell more about "contact the hypervisor to
> figure out if we already populated a page for a given PFN."? I think
> vmcore dumping relies on the eflcorehdr which is created beforehand, and
> relies on vmcore_cb registered in advance too, virtio-balloon could take
> another way to interact with kdump to make sure the dumpable ram?

This is essentially what the XEN callback does: check if a PFN is
actually populated in the hypervisor; if not, avoid reading it so we
won't be faulting+populating a fresh/zero page in the hypervisor just to
be able to dump it in the guest. But in the XEN world we usually simply
rely on straight hypercalls, not glued to actual devices that can get
hot(un)plugged.

Once you have some device that performs such checks instead that could
get hotunplugged and unregister the vmcore_cb (and virtio-balloon is
just one example), you would be able to trigger this.

As we're dealing with a moving target (hypervisor will populate pages as
necessary once the old kernel accesses them), there isn't really a way
to adjust this in the old kernel -- where we build the eflcorehdr. We
could try to adjust the elfcorehdr in the new kernel, but that certainly
opens up another can of worms.

But again, this is just an example to back the "future proof" claim
because Dave was explicitly concerned about this situation.

-- 
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1] proc/vmcore: fix clearing user buffer by properly using clear_user()

2021-11-12 Thread David Hildenbrand
On 12.11.21 08:01, Baoquan He wrote:
> On 11/11/21 at 08:18pm, David Hildenbrand wrote:
>> To clear a user buffer we cannot simply use memset, we have to use
>> clear_user(). Using a kernel config based on rawhide Fedora and a
>> virtio-mem device that registers a vmcore_cb, I can easily trigger:
>>
>> [   11.327580] systemd[1]: Starting Kdump Vmcore Save Service...
>> [   11.339697] kdump[420]: Kdump is using the default log level(3).
>> [   11.370964] kdump[453]: saving to 
>> /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
>> [   11.373997] kdump[458]: saving vmcore-dmesg.txt to 
>> /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
>> [   11.385357] kdump[465]: saving vmcore-dmesg.txt complete
>> [   11.386722] kdump[467]: saving vmcore
>> [   16.531275] BUG: unable to handle page fault for address: 7f2374e01000
>> [   16.531705] #PF: supervisor write access in kernel mode
>> [   16.532037] #PF: error_code(0x0003) - permissions violation
>> [   16.532396] PGD 7a523067 P4D 7a523067 PUD 7a528067 PMD 7a525067 PTE 
>> 80007048f867
>> [   16.532872] Oops: 0003 [#1] PREEMPT SMP NOPTI
>> [   16.533154] CPU: 0 PID: 468 Comm: cp Not tainted 5.15.0+ #6
>> [   16.533513] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
>> rel-1.14.0-27-g64f37cc530f1-prebuilt.qemu.org 04/01/2014
>> [   16.534198] RIP: 0010:read_from_oldmem.part.0.cold+0x1d/0x86
>> [   16.534552] Code: ff ff ff e8 05 ff fe ff e9 b9 e9 7f ff 48 89 de 48 c7 
>> c7 38 3b 60 82 e8 f1 fe fe ff 83 fd 08 72 3c 49 8d 7d 08 4c 89 e9 89 e8 <49> 
>> c7 45 00 00 00 00 00 49 c7 44 05 f8 00 00 00 00 48 83 e7 f81
>> [   16.535670] RSP: 0018:c973be08 EFLAGS: 00010212
>> [   16.535998] RAX: 1000 RBX: 002fd000 RCX: 
>> 7f2374e01000
>> [   16.536441] RDX: 0001 RSI: dfff RDI: 
>> 7f2374e01008
>> [   16.536878] RBP: 1000 R08:  R09: 
>> c973bc50
>> [   16.537315] R10: c973bc48 R11: 829461a8 R12: 
>> f000
>> [   16.537755] R13: 7f2374e01000 R14:  R15: 
>> 88807bd421e8
>> [   16.538200] FS:  7f2374e12140() GS:88807f00() 
>> knlGS:
>> [   16.538696] CS:  0010 DS:  ES:  CR0: 80050033
>> [   16.539055] CR2: 7f2374e01000 CR3: 7a4aa000 CR4: 
>> 00350eb0
>> [   16.539510] Call Trace:
>> [   16.539679]  
>> [   16.539828]  read_vmcore+0x236/0x2c0
>> [   16.540063]  ? enqueue_hrtimer+0x2f/0x80
>> [   16.540323]  ? inode_security+0x22/0x60
>> [   16.540572]  proc_reg_read+0x55/0xa0
>> [   16.540807]  vfs_read+0x95/0x190
>> [   16.541022]  ksys_read+0x4f/0xc0
>> [   16.541238]  do_syscall_64+0x3b/0x90
>> [   16.541475]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>
>> To fix, properly use clear_user() when required.
> 
> Looks a great fix to me, thanks for fixing this. 
> 
> Check the code, clear_user invokes access_ok to do check, then call
> memset(). It's unclear to me how the bug is triggered, could you
> please tell more so that I can learn? 
>
TBH, I was testing virtio-mem+vmcore before without running into this
issue, but after I retested with upstream in a different setup
(different kernel config but eventually also different CPU features), I
ran into this.


Note that you were looking at the generic __clear_user() implementation,
the x86-64 variant is different, see arch/x86/lib/usercopy_64.c

I can spot that it triggers stac()/clac() (X86_SMAP):
https://en.wikipedia.org/wiki/Supervisor_Mode_Access_Prevention

"that allows supervisor mode programs to optionally set user-space
memory mappings so that access to those mappings from supervisor mode
will cause a trap. This makes it harder for malicious programs to
"trick" the kernel into using instructions or data from a user-space
program"

Yes, that's most probably it :)

-- 
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v1] proc/vmcore: don't fake reading zeroes on surprise vmcore_cb unregistration

2021-11-11 Thread David Hildenbrand
In commit cc5f2704c934 ("proc/vmcore: convert oldmem_pfn_is_ram callback
to more generic vmcore callbacks"), we added detection of surprise
vmcore_cb unregistration after the vmcore was already opened. Once
detected, we warn the user and simulate reading zeroes from that point on
when accessing the vmcore.

The basic reason was that unexpected unregistration, for example, by
manually unbinding a driver from a device after opening the
vmcore, is not supported and could result in reading oldmem the
vmcore_cb would have actually prohibited while registered. However,
something like that can similarly be trigger by a user that's really
looking for trouble simply by unbinding the relevant driver before opening
the vmcore -- or by disallowing loading the driver in the first place.
So it's actually of limited help.

Currently, unregistration can only be triggered via virtio-mem when
manually unbinding the driver from the device inside the VM; there is no
way to trigger it from the hypervisor, as hypervisors don't allow for
unplugging virtio-mem devices -- ripping out system RAM from a VM without
coordination with the guest is usually not a good idea.

The important part is that unbinding the driver and unregistering the
vmcore_cb while concurrently reading the vmcore won't crash the system,
and that is handled by the rwsem.

To make the mechanism more future proof, let's remove the "read zero"
part, but leave the warning in place. For example, we could have a future
driver (like virtio-balloon) that will contact the hypervisor to figure out
if we already populated a page for a given PFN. Hotunplugging such a device
and consequently unregistering the vmcore_cb could be triggered from the
hypervisor without harming the system even while kdump is running. In that
case, we don't want to silently end up with a vmcore that contains wrong
data, because the user inside the VM might be unaware of the hypervisor
action and might easily miss the warning in the log.

Cc: Dave Young 
Cc: Baoquan He 
Cc: Vivek Goyal 
Cc: Andrew Morton 
Cc: Philipp Rudo 
Cc: kexec@lists.infradead.org
Cc: linux...@kvack.org
Cc: linux-fsde...@vger.kernel.org
Signed-off-by: David Hildenbrand 
---
 fs/proc/vmcore.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 30a3b66f475a..948691cf4a1a 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -65,8 +65,6 @@ static size_t vmcoredd_orig_sz;
 static DECLARE_RWSEM(vmcore_cb_rwsem);
 /* List of registered vmcore callbacks. */
 static LIST_HEAD(vmcore_cb_list);
-/* Whether we had a surprise unregistration of a callback. */
-static bool vmcore_cb_unstable;
 /* Whether the vmcore has been opened once. */
 static bool vmcore_opened;
 
@@ -94,10 +92,8 @@ void unregister_vmcore_cb(struct vmcore_cb *cb)
 * very unusual (e.g., forced driver removal), but we cannot stop
 * unregistering.
 */
-   if (vmcore_opened) {
+   if (vmcore_opened)
pr_warn_once("Unexpected vmcore callback unregistration\n");
-   vmcore_cb_unstable = true;
-   }
up_write(_cb_rwsem);
 }
 EXPORT_SYMBOL_GPL(unregister_vmcore_cb);
@@ -108,8 +104,6 @@ static bool pfn_is_ram(unsigned long pfn)
bool ret = true;
 
lockdep_assert_held_read(_cb_rwsem);
-   if (unlikely(vmcore_cb_unstable))
-   return false;
 
list_for_each_entry(cb, _cb_list, next) {
if (unlikely(!cb->pfn_is_ram))
@@ -577,7 +571,7 @@ static int vmcore_remap_oldmem_pfn(struct vm_area_struct 
*vma,
 * looping over all pages without a reason.
 */
down_read(_cb_rwsem);
-   if (!list_empty(_cb_list) || vmcore_cb_unstable)
+   if (!list_empty(_cb_list))
ret = remap_oldmem_pfn_checked(vma, from, pfn, size, prot);
else
ret = remap_oldmem_pfn_range(vma, from, pfn, size, prot);

base-commit: debe436e77c72fcee804fb867f275e6d31aa999c
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v1] proc/vmcore: fix clearing user buffer by properly using clear_user()

2021-11-11 Thread David Hildenbrand
To clear a user buffer we cannot simply use memset, we have to use
clear_user(). Using a kernel config based on rawhide Fedora and a
virtio-mem device that registers a vmcore_cb, I can easily trigger:

[   11.327580] systemd[1]: Starting Kdump Vmcore Save Service...
[   11.339697] kdump[420]: Kdump is using the default log level(3).
[   11.370964] kdump[453]: saving to 
/sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
[   11.373997] kdump[458]: saving vmcore-dmesg.txt to 
/sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
[   11.385357] kdump[465]: saving vmcore-dmesg.txt complete
[   11.386722] kdump[467]: saving vmcore
[   16.531275] BUG: unable to handle page fault for address: 7f2374e01000
[   16.531705] #PF: supervisor write access in kernel mode
[   16.532037] #PF: error_code(0x0003) - permissions violation
[   16.532396] PGD 7a523067 P4D 7a523067 PUD 7a528067 PMD 7a525067 PTE 
80007048f867
[   16.532872] Oops: 0003 [#1] PREEMPT SMP NOPTI
[   16.533154] CPU: 0 PID: 468 Comm: cp Not tainted 5.15.0+ #6
[   16.533513] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
rel-1.14.0-27-g64f37cc530f1-prebuilt.qemu.org 04/01/2014
[   16.534198] RIP: 0010:read_from_oldmem.part.0.cold+0x1d/0x86
[   16.534552] Code: ff ff ff e8 05 ff fe ff e9 b9 e9 7f ff 48 89 de 48 c7 c7 
38 3b 60 82 e8 f1 fe fe ff 83 fd 08 72 3c 49 8d 7d 08 4c 89 e9 89 e8 <49> c7 45 
00 00 00 00 00 49 c7 44 05 f8 00 00 00 00 48 83 e7 f81
[   16.535670] RSP: 0018:c973be08 EFLAGS: 00010212
[   16.535998] RAX: 1000 RBX: 002fd000 RCX: 7f2374e01000
[   16.536441] RDX: 0001 RSI: dfff RDI: 7f2374e01008
[   16.536878] RBP: 1000 R08:  R09: c973bc50
[   16.537315] R10: c973bc48 R11: 829461a8 R12: f000
[   16.537755] R13: 7f2374e01000 R14:  R15: 88807bd421e8
[   16.538200] FS:  7f2374e12140() GS:88807f00() 
knlGS:
[   16.538696] CS:  0010 DS:  ES:  CR0: 80050033
[   16.539055] CR2: 7f2374e01000 CR3: 7a4aa000 CR4: 00350eb0
[   16.539510] Call Trace:
[   16.539679]  
[   16.539828]  read_vmcore+0x236/0x2c0
[   16.540063]  ? enqueue_hrtimer+0x2f/0x80
[   16.540323]  ? inode_security+0x22/0x60
[   16.540572]  proc_reg_read+0x55/0xa0
[   16.540807]  vfs_read+0x95/0x190
[   16.541022]  ksys_read+0x4f/0xc0
[   16.541238]  do_syscall_64+0x3b/0x90
[   16.541475]  entry_SYSCALL_64_after_hwframe+0x44/0xae

To fix, properly use clear_user() when required.

Fixes: 997c136f518c ("fs/proc/vmcore.c: add hook to read_from_oldmem() to check 
for non-ram pages")
Cc: Dave Young 
Cc: Baoquan He 
Cc: Vivek Goyal 
Cc: Andrew Morton 
Cc: Philipp Rudo 
Cc: kexec@lists.infradead.org
Cc: linux...@kvack.org
Cc: linux-fsde...@vger.kernel.org
Signed-off-by: David Hildenbrand 
---
 fs/proc/vmcore.c | 20 
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 30a3b66f475a..509f85148fee 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -154,9 +154,13 @@ ssize_t read_from_oldmem(char *buf, size_t count,
nr_bytes = count;
 
/* If pfn is not ram, return zeros for sparse dump files */
-   if (!pfn_is_ram(pfn))
-   memset(buf, 0, nr_bytes);
-   else {
+   if (!pfn_is_ram(pfn)) {
+   tmp = 0;
+   if (!userbuf)
+   memset(buf, 0, nr_bytes);
+   else if (clear_user(buf, nr_bytes))
+   tmp = -EFAULT;
+   } else {
if (encrypted)
tmp = copy_oldmem_page_encrypted(pfn, buf,
 nr_bytes,
@@ -165,12 +169,12 @@ ssize_t read_from_oldmem(char *buf, size_t count,
else
tmp = copy_oldmem_page(pfn, buf, nr_bytes,
   offset, userbuf);
-
-   if (tmp < 0) {
-   up_read(_cb_rwsem);
-   return tmp;
-   }
}
+   if (tmp < 0) {
+   up_read(_cb_rwsem);
+   return tmp;
+   }
+
*ppos += nr_bytes;
count -= nr_bytes;
buf += nr_bytes;

base-commit: debe436e77c72fcee804fb867f275e6d31aa999c
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2 3/9] x86/xen: print a warning when HVMOP_get_mem_type fails

2021-10-05 Thread David Hildenbrand
HVMOP_get_mem_type is not expected to fail, "This call failing is
indication of something going quite wrong and it would be good to know
about this." [1]

Let's add a pr_warn_once().

[1] https://lkml.kernel.org/r/3b935aa0-6d85-0bcd-100e-15098add3...@oracle.com

Suggested-by: Boris Ostrovsky 
Signed-off-by: David Hildenbrand 
---
 arch/x86/xen/mmu_hvm.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/mmu_hvm.c b/arch/x86/xen/mmu_hvm.c
index d1b38c77352b..6ba8826dcdcc 100644
--- a/arch/x86/xen/mmu_hvm.c
+++ b/arch/x86/xen/mmu_hvm.c
@@ -22,8 +22,10 @@ static int xen_oldmem_pfn_is_ram(unsigned long pfn)
.pfn = pfn,
};
 
-   if (HYPERVISOR_hvm_op(HVMOP_get_mem_type, ))
+   if (HYPERVISOR_hvm_op(HVMOP_get_mem_type, )) {
+   pr_warn_once("Unexpected HVMOP_get_mem_type failure\n");
return -ENXIO;
+   }
return a.mem_type != HVMMEM_mmio_dm;
 }
 #endif
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2 2/9] x86/xen: simplify xen_oldmem_pfn_is_ram()

2021-10-05 Thread David Hildenbrand
Let's simplify return handling.

Signed-off-by: David Hildenbrand 
---
 arch/x86/xen/mmu_hvm.c | 15 +--
 1 file changed, 1 insertion(+), 14 deletions(-)

diff --git a/arch/x86/xen/mmu_hvm.c b/arch/x86/xen/mmu_hvm.c
index b242d1f4b426..d1b38c77352b 100644
--- a/arch/x86/xen/mmu_hvm.c
+++ b/arch/x86/xen/mmu_hvm.c
@@ -21,23 +21,10 @@ static int xen_oldmem_pfn_is_ram(unsigned long pfn)
.domid = DOMID_SELF,
.pfn = pfn,
};
-   int ram;
 
if (HYPERVISOR_hvm_op(HVMOP_get_mem_type, ))
return -ENXIO;
-
-   switch (a.mem_type) {
-   case HVMMEM_mmio_dm:
-   ram = 0;
-   break;
-   case HVMMEM_ram_rw:
-   case HVMMEM_ram_ro:
-   default:
-   ram = 1;
-   break;
-   }
-
-   return ram;
+   return a.mem_type != HVMMEM_mmio_dm;
 }
 #endif
 
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2 1/9] x86/xen: update xen_oldmem_pfn_is_ram() documentation

2021-10-05 Thread David Hildenbrand
The callback is only used for the vmcore nowadays.

Reviewed-by: Boris Ostrovsky 
Signed-off-by: David Hildenbrand 
---
 arch/x86/xen/mmu_hvm.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/arch/x86/xen/mmu_hvm.c b/arch/x86/xen/mmu_hvm.c
index 57409373750f..b242d1f4b426 100644
--- a/arch/x86/xen/mmu_hvm.c
+++ b/arch/x86/xen/mmu_hvm.c
@@ -9,12 +9,9 @@
 
 #ifdef CONFIG_PROC_VMCORE
 /*
- * This function is used in two contexts:
- * - the kdump kernel has to check whether a pfn of the crashed kernel
- *   was a ballooned page. vmcore is using this function to decide
- *   whether to access a pfn of the crashed kernel.
- * - the kexec kernel has to check whether a pfn was ballooned by the
- *   previous kernel. If the pfn is ballooned, handle it properly.
+ * The kdump kernel has to check whether a pfn of the crashed kernel
+ * was a ballooned page. vmcore is using this function to decide
+ * whether to access a pfn of the crashed kernel.
  * Returns 0 if the pfn is not backed by a RAM page, the caller may
  * handle the pfn special in this case.
  */
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2 0/9] proc/vmcore: sanitize access to virtio-mem memory

2021-10-05 Thread David Hildenbrand
As so often with virtio-mem changes that mess with common MM
infrastructure, this might be a good candiate to go via Andrew's tree.

--

After removing /dev/kmem, sanitizing /proc/kcore and handling /dev/mem,
this series tackles the last sane way how a VM could accidentially access
logically unplugged memory managed by a virtio-mem device: /proc/vmcore

When dumping memory via "makedumpfile", PG_offline pages, used by
virtio-mem to flag logically unplugged memory, are already properly
excluded; however, especially when accessing/copying /proc/vmcore "the
usual way", we can still end up reading logically unplugged memory part of
a virtio-mem device.

Patch #1-#3 are cleanups. Patch #4 extends the existing oldmem_pfn_is_ram
mechanism. Patch #5-#7 are virtio-mem refactorings for patch #8, which
implements the virtio-mem logic to query the state of device blocks.

Patch #8:

"
Although virtio-mem currently supports reading unplugged memory in the
hypervisor, this will change in the future, indicated to the device via
a new feature flag. We similarly sanitized /proc/kcore access recently.
[...]
Distributions that support virtio-mem+kdump have to make sure that the
virtio_mem module will be part of the kdump kernel or the kdump initrd;
dracut was recently [2] extended to include virtio-mem in the generated
initrd. As long as no special kdump kernels are used, this will
automatically make sure that virtio-mem will be around in the kdump initrd
and sanitize /proc/vmcore access -- with dracut.
"

This is the last remaining bit to support
VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE [3] in the Linux implementation of
virtio-mem.

Note: this is best-effort. We'll never be able to control what runs inside
the second kernel, really, but we also don't have to care: we only care
about sane setups where we don't want our VM getting zapped once we
touch the wrong memory location while dumping. While we usually expect sane
setups to use "makedumfile", nothing really speaks against just copying
/proc/vmcore, especially in environments where HWpoisioning isn't typically
expected. Also, we really don't want to put all our trust completely on the
memmap, so sanitizing also makes sense when just using "makedumpfile".

[1] https://lkml.kernel.org/r/20210526093041.8800-1-da...@redhat.com
[2] https://github.com/dracutdevs/dracut/pull/1157
[3] https://lists.oasis-open.org/archives/virtio-comment/202109/msg00021.html

v1 -> v2:
- "x86/xen: simplify xen_oldmem_pfn_is_ram()"
-- Simplify even more
- "x86/xen: print a warning when HVMOP_get_mem_type fails"
-- Added
- "virtio-mem: kdump mode to sanitize /proc/vmcore access"
-- Fix wrong range check

Cc: Andrew Morton 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Dave Young 
Cc: Baoquan He 
Cc: Vivek Goyal 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Mike Rapoport 
Cc: "Rafael J. Wysocki" 
Cc: x...@kernel.org
Cc: xen-de...@lists.xenproject.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: kexec@lists.infradead.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org

David Hildenbrand (9):
  x86/xen: update xen_oldmem_pfn_is_ram() documentation
  x86/xen: simplify xen_oldmem_pfn_is_ram()
  x86/xen: print a warning when HVMOP_get_mem_type fails
  proc/vmcore: let pfn_is_ram() return a bool
  proc/vmcore: convert oldmem_pfn_is_ram callback to more generic vmcore
callbacks
  virtio-mem: factor out hotplug specifics from virtio_mem_init() into
virtio_mem_init_hotplug()
  virtio-mem: factor out hotplug specifics from virtio_mem_probe() into
virtio_mem_init_hotplug()
  virtio-mem: factor out hotplug specifics from virtio_mem_remove() into
virtio_mem_deinit_hotplug()
  virtio-mem: kdump mode to sanitize /proc/vmcore access

 arch/x86/kernel/aperture_64.c |  13 +-
 arch/x86/xen/mmu_hvm.c|  37 ++---
 drivers/virtio/virtio_mem.c   | 297 --
 fs/proc/vmcore.c  | 105 
 include/linux/crash_dump.h|  26 ++-
 5 files changed, 333 insertions(+), 145 deletions(-)


base-commit: 9e1ff307c779ce1f0f810c7ecce3d95bbae40896
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2 9/9] virtio-mem: kdump mode to sanitize /proc/vmcore access

2021-10-05 Thread David Hildenbrand
Although virtio-mem currently supports reading unplugged memory in the
hypervisor, this will change in the future, indicated to the device via
a new feature flag. We similarly sanitized /proc/kcore access recently. [1]

Let's register a vmcore callback, to allow vmcore code to check if a PFN
belonging to a virtio-mem device is either currently plugged and should
be dumped or is currently unplugged and should not be accessed, instead
mapping the shared zeropage or returning zeroes when reading.

This is important when not capturing /proc/vmcore via tools like
"makedumpfile" that can identify logically unplugged virtio-mem memory via
PG_offline in the memmap, but simply by e.g., copying the file.

Distributions that support virtio-mem+kdump have to make sure that the
virtio_mem module will be part of the kdump kernel or the kdump initrd;
dracut was recently [2] extended to include virtio-mem in the generated
initrd. As long as no special kdump kernels are used, this will
automatically make sure that virtio-mem will be around in the kdump initrd
and sanitize /proc/vmcore access -- with dracut.

With this series, we'll send one virtio-mem state request for every
~2 MiB chunk of virtio-mem memory indicated in the vmcore that we intend
to read/map.

In the future, we might want to allow building virtio-mem for kdump
mode only, even without CONFIG_MEMORY_HOTPLUG and friends: this way,
we could support special stripped-down kdump kernels that have many
other config options disabled; we'll tackle that once required. Further,
we might want to try sensing bigger blocks (e.g., memory sections)
first before falling back to device blocks on demand.

Tested with Fedora rawhide, which contains a recent kexec-tools version
(considering "System RAM (virtio_mem)" when creating the vmcore header) and
a recent dracut version (including the virtio_mem module in the kdump
initrd).

[1] https://lkml.kernel.org/r/20210526093041.8800-1-da...@redhat.com
[2] https://github.com/dracutdevs/dracut/pull/1157

Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 136 
 1 file changed, 124 insertions(+), 12 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 76d8aef3cfd2..3573b78f6b05 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -223,6 +223,9 @@ struct virtio_mem {
 * When this lock is held the pointers can't change, ONLINE and
 * OFFLINE blocks can't change the state and no subblocks will get
 * plugged/unplugged.
+*
+* In kdump mode, used to serialize requests, last_block_addr and
+* last_block_plugged.
 */
struct mutex hotplug_mutex;
bool hotplug_active;
@@ -230,6 +233,9 @@ struct virtio_mem {
/* An error occurred we cannot handle - stop processing requests. */
bool broken;
 
+   /* Cached valued of is_kdump_kernel() when the device was probed. */
+   bool in_kdump;
+
/* The driver is being removed. */
spinlock_t removal_lock;
bool removing;
@@ -243,6 +249,13 @@ struct virtio_mem {
/* Memory notifier (online/offline events). */
struct notifier_block memory_notifier;
 
+#ifdef CONFIG_PROC_VMCORE
+   /* vmcore callback for /proc/vmcore handling in kdump mode */
+   struct vmcore_cb vmcore_cb;
+   uint64_t last_block_addr;
+   bool last_block_plugged;
+#endif /* CONFIG_PROC_VMCORE */
+
/* Next device in the list of virtio-mem devices. */
struct list_head next;
 };
@@ -2293,6 +2306,12 @@ static void virtio_mem_run_wq(struct work_struct *work)
uint64_t diff;
int rc;
 
+   if (unlikely(vm->in_kdump)) {
+   dev_warn_once(>vdev->dev,
+"unexpected workqueue run in kdump kernel\n");
+   return;
+   }
+
hrtimer_cancel(>retry_timer);
 
if (vm->broken)
@@ -2521,6 +2540,86 @@ static int virtio_mem_init_hotplug(struct virtio_mem *vm)
return rc;
 }
 
+#ifdef CONFIG_PROC_VMCORE
+static int virtio_mem_send_state_request(struct virtio_mem *vm, uint64_t addr,
+uint64_t size)
+{
+   const uint64_t nb_vm_blocks = size / vm->device_block_size;
+   const struct virtio_mem_req req = {
+   .type = cpu_to_virtio16(vm->vdev, VIRTIO_MEM_REQ_STATE),
+   .u.state.addr = cpu_to_virtio64(vm->vdev, addr),
+   .u.state.nb_blocks = cpu_to_virtio16(vm->vdev, nb_vm_blocks),
+   };
+   int rc = -ENOMEM;
+
+   dev_dbg(>vdev->dev, "requesting state: 0x%llx - 0x%llx\n", addr,
+   addr + size - 1);
+
+   switch (virtio_mem_send_request(vm, )) {
+   case VIRTIO_MEM_RESP_ACK:
+   return virtio16_to_cpu(vm->vdev, vm->resp.u.state.state);
+   case VIRTIO_MEM_RESP_ERROR:
+   

[PATCH v2 7/9] virtio-mem: factor out hotplug specifics from virtio_mem_probe() into virtio_mem_init_hotplug()

2021-10-05 Thread David Hildenbrand
Let's prepare for a new virtio-mem kdump mode in which we don't actually
hot(un)plug any memory but only observe the state of device blocks.

Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 87 +++--
 1 file changed, 45 insertions(+), 42 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 2ba7e8d6ba8d..1be3ee7f684d 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -260,6 +260,8 @@ static void virtio_mem_fake_offline_going_offline(unsigned 
long pfn,
 static void virtio_mem_fake_offline_cancel_offline(unsigned long pfn,
   unsigned long nr_pages);
 static void virtio_mem_retry(struct virtio_mem *vm);
+static int virtio_mem_create_resource(struct virtio_mem *vm);
+static void virtio_mem_delete_resource(struct virtio_mem *vm);
 
 /*
  * Register a virtio-mem device so it will be considered for the online_page
@@ -2395,7 +2397,8 @@ static int virtio_mem_init_vq(struct virtio_mem *vm)
 static int virtio_mem_init_hotplug(struct virtio_mem *vm)
 {
const struct range pluggable_range = mhp_get_pluggable_range(true);
-   uint64_t sb_size, addr;
+   uint64_t unit_pages, sb_size, addr;
+   int rc;
 
/* bad device setup - warn only */
if (!IS_ALIGNED(vm->addr, memory_block_size_bytes()))
@@ -2474,7 +2477,48 @@ static int virtio_mem_init_hotplug(struct virtio_mem *vm)
dev_info(>vdev->dev, "big block size: 0x%llx",
 (unsigned long long)vm->bbm.bb_size);
 
+   /* create the parent resource for all memory */
+   rc = virtio_mem_create_resource(vm);
+   if (rc)
+   return rc;
+
+   /* use a single dynamic memory group to cover the whole memory device */
+   if (vm->in_sbm)
+   unit_pages = PHYS_PFN(memory_block_size_bytes());
+   else
+   unit_pages = PHYS_PFN(vm->bbm.bb_size);
+   rc = memory_group_register_dynamic(vm->nid, unit_pages);
+   if (rc < 0)
+   goto out_del_resource;
+   vm->mgid = rc;
+
+   /*
+* If we still have memory plugged, we have to unplug all memory first.
+* Registering our parent resource makes sure that this memory isn't
+* actually in use (e.g., trying to reload the driver).
+*/
+   if (vm->plugged_size) {
+   vm->unplug_all_required = true;
+   dev_info(>vdev->dev, "unplugging all memory is required\n");
+   }
+
+   /* register callbacks */
+   vm->memory_notifier.notifier_call = virtio_mem_memory_notifier_cb;
+   rc = register_memory_notifier(>memory_notifier);
+   if (rc)
+   goto out_unreg_group;
+   rc = register_virtio_mem_device(vm);
+   if (rc)
+   goto out_unreg_mem;
+
return 0;
+out_unreg_mem:
+   unregister_memory_notifier(>memory_notifier);
+out_unreg_group:
+   memory_group_unregister(vm->mgid);
+out_del_resource:
+   virtio_mem_delete_resource(vm);
+   return rc;
 }
 
 static int virtio_mem_init(struct virtio_mem *vm)
@@ -2578,7 +2622,6 @@ static bool virtio_mem_has_memory_added(struct virtio_mem 
*vm)
 static int virtio_mem_probe(struct virtio_device *vdev)
 {
struct virtio_mem *vm;
-   uint64_t unit_pages;
int rc;
 
BUILD_BUG_ON(sizeof(struct virtio_mem_req) != 24);
@@ -2608,40 +2651,6 @@ static int virtio_mem_probe(struct virtio_device *vdev)
if (rc)
goto out_del_vq;
 
-   /* create the parent resource for all memory */
-   rc = virtio_mem_create_resource(vm);
-   if (rc)
-   goto out_del_vq;
-
-   /* use a single dynamic memory group to cover the whole memory device */
-   if (vm->in_sbm)
-   unit_pages = PHYS_PFN(memory_block_size_bytes());
-   else
-   unit_pages = PHYS_PFN(vm->bbm.bb_size);
-   rc = memory_group_register_dynamic(vm->nid, unit_pages);
-   if (rc < 0)
-   goto out_del_resource;
-   vm->mgid = rc;
-
-   /*
-* If we still have memory plugged, we have to unplug all memory first.
-* Registering our parent resource makes sure that this memory isn't
-* actually in use (e.g., trying to reload the driver).
-*/
-   if (vm->plugged_size) {
-   vm->unplug_all_required = true;
-   dev_info(>vdev->dev, "unplugging all memory is required\n");
-   }
-
-   /* register callbacks */
-   vm->memory_notifier.notifier_call = virtio_mem_memory_notifier_cb;
-   rc = register_memory_notifier(>memory_notifier);
-   if (rc)
-   goto out_unreg_group;
-   rc = register_virtio_mem_device(vm);
-   if (rc)
-   goto out_unreg_mem;
-
virtio_device_ready(vdev)

[PATCH v2 8/9] virtio-mem: factor out hotplug specifics from virtio_mem_remove() into virtio_mem_deinit_hotplug()

2021-10-05 Thread David Hildenbrand
Let's prepare for a new virtio-mem kdump mode in which we don't actually
hot(un)plug any memory but only observe the state of device blocks.

Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 1be3ee7f684d..76d8aef3cfd2 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -2667,9 +2667,8 @@ static int virtio_mem_probe(struct virtio_device *vdev)
return rc;
 }
 
-static void virtio_mem_remove(struct virtio_device *vdev)
+static void virtio_mem_deinit_hotplug(struct virtio_mem *vm)
 {
-   struct virtio_mem *vm = vdev->priv;
unsigned long mb_id;
int rc;
 
@@ -2716,7 +2715,8 @@ static void virtio_mem_remove(struct virtio_device *vdev)
 * away. Warn at least.
 */
if (virtio_mem_has_memory_added(vm)) {
-   dev_warn(>dev, "device still has system memory added\n");
+   dev_warn(>vdev->dev,
+"device still has system memory added\n");
} else {
virtio_mem_delete_resource(vm);
kfree_const(vm->resource_name);
@@ -2730,6 +2730,13 @@ static void virtio_mem_remove(struct virtio_device *vdev)
} else {
vfree(vm->bbm.bb_states);
}
+}
+
+static void virtio_mem_remove(struct virtio_device *vdev)
+{
+   struct virtio_mem *vm = vdev->priv;
+
+   virtio_mem_deinit_hotplug(vm);
 
/* reset the device and cleanup the queues */
vdev->config->reset(vdev);
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2 6/9] virtio-mem: factor out hotplug specifics from virtio_mem_init() into virtio_mem_init_hotplug()

2021-10-05 Thread David Hildenbrand
Let's prepare for a new virtio-mem kdump mode in which we don't actually
hot(un)plug any memory but only observe the state of device blocks.

Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 81 -
 1 file changed, 44 insertions(+), 37 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index bef8ad6bf466..2ba7e8d6ba8d 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -2392,41 +2392,10 @@ static int virtio_mem_init_vq(struct virtio_mem *vm)
return 0;
 }
 
-static int virtio_mem_init(struct virtio_mem *vm)
+static int virtio_mem_init_hotplug(struct virtio_mem *vm)
 {
const struct range pluggable_range = mhp_get_pluggable_range(true);
uint64_t sb_size, addr;
-   uint16_t node_id;
-
-   if (!vm->vdev->config->get) {
-   dev_err(>vdev->dev, "config access disabled\n");
-   return -EINVAL;
-   }
-
-   /*
-* We don't want to (un)plug or reuse any memory when in kdump. The
-* memory is still accessible (but not mapped).
-*/
-   if (is_kdump_kernel()) {
-   dev_warn(>vdev->dev, "disabled in kdump kernel\n");
-   return -EBUSY;
-   }
-
-   /* Fetch all properties that can't change. */
-   virtio_cread_le(vm->vdev, struct virtio_mem_config, plugged_size,
-   >plugged_size);
-   virtio_cread_le(vm->vdev, struct virtio_mem_config, block_size,
-   >device_block_size);
-   virtio_cread_le(vm->vdev, struct virtio_mem_config, node_id,
-   _id);
-   vm->nid = virtio_mem_translate_node_id(vm, node_id);
-   virtio_cread_le(vm->vdev, struct virtio_mem_config, addr, >addr);
-   virtio_cread_le(vm->vdev, struct virtio_mem_config, region_size,
-   >region_size);
-
-   /* Determine the nid for the device based on the lowest address. */
-   if (vm->nid == NUMA_NO_NODE)
-   vm->nid = memory_add_physaddr_to_nid(vm->addr);
 
/* bad device setup - warn only */
if (!IS_ALIGNED(vm->addr, memory_block_size_bytes()))
@@ -2496,10 +2465,6 @@ static int virtio_mem_init(struct virtio_mem *vm)
  vm->offline_threshold);
}
 
-   dev_info(>vdev->dev, "start address: 0x%llx", vm->addr);
-   dev_info(>vdev->dev, "region size: 0x%llx", vm->region_size);
-   dev_info(>vdev->dev, "device block size: 0x%llx",
-(unsigned long long)vm->device_block_size);
dev_info(>vdev->dev, "memory block size: 0x%lx",
 memory_block_size_bytes());
if (vm->in_sbm)
@@ -2508,10 +2473,52 @@ static int virtio_mem_init(struct virtio_mem *vm)
else
dev_info(>vdev->dev, "big block size: 0x%llx",
 (unsigned long long)vm->bbm.bb_size);
+
+   return 0;
+}
+
+static int virtio_mem_init(struct virtio_mem *vm)
+{
+   uint16_t node_id;
+
+   if (!vm->vdev->config->get) {
+   dev_err(>vdev->dev, "config access disabled\n");
+   return -EINVAL;
+   }
+
+   /*
+* We don't want to (un)plug or reuse any memory when in kdump. The
+* memory is still accessible (but not mapped).
+*/
+   if (is_kdump_kernel()) {
+   dev_warn(>vdev->dev, "disabled in kdump kernel\n");
+   return -EBUSY;
+   }
+
+   /* Fetch all properties that can't change. */
+   virtio_cread_le(vm->vdev, struct virtio_mem_config, plugged_size,
+   >plugged_size);
+   virtio_cread_le(vm->vdev, struct virtio_mem_config, block_size,
+   >device_block_size);
+   virtio_cread_le(vm->vdev, struct virtio_mem_config, node_id,
+   _id);
+   vm->nid = virtio_mem_translate_node_id(vm, node_id);
+   virtio_cread_le(vm->vdev, struct virtio_mem_config, addr, >addr);
+   virtio_cread_le(vm->vdev, struct virtio_mem_config, region_size,
+   >region_size);
+
+   /* Determine the nid for the device based on the lowest address. */
+   if (vm->nid == NUMA_NO_NODE)
+   vm->nid = memory_add_physaddr_to_nid(vm->addr);
+
+   dev_info(>vdev->dev, "start address: 0x%llx", vm->addr);
+   dev_info(>vdev->dev, "region size: 0x%llx", vm->region_size);
+   dev_info(>vdev->dev, "device block size: 0x%llx",
+(unsigned long long)vm->device_block_size);
if (vm->nid != NUMA_NO_NODE && IS_ENABLED(CONFIG_NUMA))
dev_info(>vdev->dev, "nid: %d", vm->nid);
 
-   return 0;
+   return virtio_mem_init_hotplug(vm);
 }
 
 static int virtio_mem_create_resource(struct virtio_mem *vm)
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2 4/9] proc/vmcore: let pfn_is_ram() return a bool

2021-10-05 Thread David Hildenbrand
The callback should deal with errors internally, it doesn't make sense to
expose these via pfn_is_ram(). We'll rework the callbacks next. Right now
we consider errors as if "it's RAM"; no functional change.

Signed-off-by: David Hildenbrand 
---
 fs/proc/vmcore.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 9a15334da208..a9bd80ab670e 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -84,11 +84,11 @@ void unregister_oldmem_pfn_is_ram(void)
 }
 EXPORT_SYMBOL_GPL(unregister_oldmem_pfn_is_ram);
 
-static int pfn_is_ram(unsigned long pfn)
+static bool pfn_is_ram(unsigned long pfn)
 {
int (*fn)(unsigned long pfn);
/* pfn is ram unless fn() checks pagetype */
-   int ret = 1;
+   bool ret = true;
 
/*
 * Ask hypervisor if the pfn is really ram.
@@ -97,7 +97,7 @@ static int pfn_is_ram(unsigned long pfn)
 */
fn = oldmem_pfn_is_ram;
if (fn)
-   ret = fn(pfn);
+   ret = !!fn(pfn);
 
return ret;
 }
@@ -124,7 +124,7 @@ ssize_t read_from_oldmem(char *buf, size_t count,
nr_bytes = count;
 
/* If pfn is not ram, return zeros for sparse dump files */
-   if (pfn_is_ram(pfn) == 0)
+   if (!pfn_is_ram(pfn))
memset(buf, 0, nr_bytes);
else {
if (encrypted)
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2 5/9] proc/vmcore: convert oldmem_pfn_is_ram callback to more generic vmcore callbacks

2021-10-05 Thread David Hildenbrand
Let's support multiple registered callbacks, making sure that
registering vmcore callbacks cannot fail. Make the callback return a
bool instead of an int, handling how to deal with errors internally.
Drop unused HAVE_OLDMEM_PFN_IS_RAM.

We soon want to make use of this infrastructure from other drivers:
virtio-mem, registering one callback for each virtio-mem device, to
prevent reading unplugged virtio-mem memory.

Handle it via a generic vmcore_cb structure, prepared for future
extensions: for example, once we support virtio-mem on s390x where the
vmcore is completely constructed in the second kernel, we want to detect
and add plugged virtio-mem memory ranges to the vmcore in order for them
to get dumped properly.

Handle corner cases that are unexpected and shouldn't happen in sane
setups: registering a callback after the vmcore has already been opened
(warn only) and unregistering a callback after the vmcore has already been
opened (warn and essentially read only zeroes from that point on).

Signed-off-by: David Hildenbrand 
---
 arch/x86/kernel/aperture_64.c | 13 -
 arch/x86/xen/mmu_hvm.c| 11 ++--
 fs/proc/vmcore.c  | 99 ---
 include/linux/crash_dump.h| 26 +++--
 4 files changed, 111 insertions(+), 38 deletions(-)

diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c
index 10562885f5fc..af3ba08b684b 100644
--- a/arch/x86/kernel/aperture_64.c
+++ b/arch/x86/kernel/aperture_64.c
@@ -73,12 +73,23 @@ static int gart_mem_pfn_is_ram(unsigned long pfn)
  (pfn >= aperture_pfn_start + aperture_page_count));
 }
 
+#ifdef CONFIG_PROC_VMCORE
+static bool gart_oldmem_pfn_is_ram(struct vmcore_cb *cb, unsigned long pfn)
+{
+   return !!gart_mem_pfn_is_ram(pfn);
+}
+
+static struct vmcore_cb gart_vmcore_cb = {
+   .pfn_is_ram = gart_oldmem_pfn_is_ram,
+};
+#endif
+
 static void __init exclude_from_core(u64 aper_base, u32 aper_order)
 {
aperture_pfn_start = aper_base >> PAGE_SHIFT;
aperture_page_count = (32 * 1024 * 1024) << aper_order >> PAGE_SHIFT;
 #ifdef CONFIG_PROC_VMCORE
-   WARN_ON(register_oldmem_pfn_is_ram(_mem_pfn_is_ram));
+   register_vmcore_cb(_vmcore_cb);
 #endif
 #ifdef CONFIG_PROC_KCORE
WARN_ON(register_mem_pfn_is_ram(_mem_pfn_is_ram));
diff --git a/arch/x86/xen/mmu_hvm.c b/arch/x86/xen/mmu_hvm.c
index 6ba8826dcdcc..509bdee3ab90 100644
--- a/arch/x86/xen/mmu_hvm.c
+++ b/arch/x86/xen/mmu_hvm.c
@@ -12,10 +12,10 @@
  * The kdump kernel has to check whether a pfn of the crashed kernel
  * was a ballooned page. vmcore is using this function to decide
  * whether to access a pfn of the crashed kernel.
- * Returns 0 if the pfn is not backed by a RAM page, the caller may
+ * Returns "false" if the pfn is not backed by a RAM page, the caller may
  * handle the pfn special in this case.
  */
-static int xen_oldmem_pfn_is_ram(unsigned long pfn)
+static bool xen_vmcore_pfn_is_ram(struct vmcore_cb *cb, unsigned long pfn)
 {
struct xen_hvm_get_mem_type a = {
.domid = DOMID_SELF,
@@ -24,10 +24,13 @@ static int xen_oldmem_pfn_is_ram(unsigned long pfn)
 
if (HYPERVISOR_hvm_op(HVMOP_get_mem_type, )) {
pr_warn_once("Unexpected HVMOP_get_mem_type failure\n");
-   return -ENXIO;
+   return true;
}
return a.mem_type != HVMMEM_mmio_dm;
 }
+static struct vmcore_cb xen_vmcore_cb = {
+   .pfn_is_ram = xen_vmcore_pfn_is_ram,
+};
 #endif
 
 static void xen_hvm_exit_mmap(struct mm_struct *mm)
@@ -61,6 +64,6 @@ void __init xen_hvm_init_mmu_ops(void)
if (is_pagetable_dying_supported())
pv_ops.mmu.exit_mmap = xen_hvm_exit_mmap;
 #ifdef CONFIG_PROC_VMCORE
-   WARN_ON(register_oldmem_pfn_is_ram(_oldmem_pfn_is_ram));
+   register_vmcore_cb(_vmcore_cb);
 #endif
 }
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index a9bd80ab670e..7a04b2eca287 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -62,46 +62,75 @@ core_param(novmcoredd, vmcoredd_disabled, bool, 0);
 /* Device Dump Size */
 static size_t vmcoredd_orig_sz;
 
-/*
- * Returns > 0 for RAM pages, 0 for non-RAM pages, < 0 on error
- * The called function has to take care of module refcounting.
- */
-static int (*oldmem_pfn_is_ram)(unsigned long pfn);
-
-int register_oldmem_pfn_is_ram(int (*fn)(unsigned long pfn))
+static DECLARE_RWSEM(vmcore_cb_rwsem);
+/* List of registered vmcore callbacks. */
+static LIST_HEAD(vmcore_cb_list);
+/* Whether we had a surprise unregistration of a callback. */
+static bool vmcore_cb_unstable;
+/* Whether the vmcore has been opened once. */
+static bool vmcore_opened;
+
+void register_vmcore_cb(struct vmcore_cb *cb)
 {
-   if (oldmem_pfn_is_ram)
-   return -EBUSY;
-   oldmem_pfn_is_ram = fn;
-   return 0;
+   down_write(_cb_rwsem);
+   INIT_LIST_HEAD(>next);
+   list_add_tail(>next, _cb_list)

[PATCH v2 5/5] mm/memory_hotplug: indicate MEMBLOCK_DRIVER_MANAGED with IORESOURCE_SYSRAM_DRIVER_MANAGED

2021-10-04 Thread David Hildenbrand
Let's communicate driver-managed regions to memblock, to properly
teach kexec_file with CONFIG_ARCH_KEEP_MEMBLOCK to not place images on
these memory regions.

Signed-off-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 5f873e7f5b29..6d90818d4ce8 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1357,6 +1357,7 @@ bool mhp_supports_memmap_on_memory(unsigned long size)
 int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 {
struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
+   enum memblock_flags memblock_flags = MEMBLOCK_NONE;
struct vmem_altmap mhp_altmap = {};
struct memory_group *group = NULL;
u64 start, size;
@@ -1385,7 +1386,9 @@ int __ref add_memory_resource(int nid, struct resource 
*res, mhp_t mhp_flags)
mem_hotplug_begin();
 
if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
-   ret = memblock_add_node(start, size, nid, MEMBLOCK_NONE);
+   if (res->flags & IORESOURCE_SYSRAM_DRIVER_MANAGED)
+   memblock_flags = MEMBLOCK_DRIVER_MANAGED;
+   ret = memblock_add_node(start, size, nid, memblock_flags);
if (ret)
goto error_mem_hotplug_end;
}
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2 4/5] memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED

2021-10-04 Thread David Hildenbrand
Let's add a flag that corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED,
indicating that we're dealing with a memory region that is never
indicated in the firmware-provided memory map, but always detected and
added by a driver.

Similar to MEMBLOCK_HOTPLUG, most infrastructure has to treat such memory
regions like ordinary MEMBLOCK_NONE memory regions -- for example, when
selecting memory regions to add to the vmcore for dumping in the
crashkernel via for_each_mem_range().

However, especially kexec_file is not supposed to select such memblocks via
for_each_free_mem_range() / for_each_free_mem_range_reverse() to place
kexec images, similar to how we handle IORESOURCE_SYSRAM_DRIVER_MANAGED
without CONFIG_ARCH_KEEP_MEMBLOCK.

We'll make sure that memory hotplug code sets the flag where applicable
(IORESOURCE_SYSRAM_DRIVER_MANAGED) next. This prepares architectures
that need CONFIG_ARCH_KEEP_MEMBLOCK, such as arm64, for virtio-mem
support.

Note that kexec *must not* indicate this memory to the second kernel
and *must not* place kexec-images on this memory. Let's add a comment to
kexec_walk_memblock(), documenting how we handle MEMBLOCK_DRIVER_MANAGED
now just like using IORESOURCE_SYSRAM_DRIVER_MANAGED in
locate_mem_hole_callback() for kexec_walk_resources().

Also note that MEMBLOCK_HOTPLUG cannot be reused due to different
semantics:
MEMBLOCK_HOTPLUG: memory is indicated as "System RAM" in the
firmware-provided memory map and added to the system early during
boot; kexec *has to* indicate this memory to the second kernel and
can place kexec-images on this memory. After memory hotunplug,
kexec has to be re-armed. We mostly ignore this flag when
"movable_node" is not set on the kernel command line, because
then we're told to not care about hotunpluggability of such
memory regions.

MEMBLOCK_DRIVER_MANAGED: memory is not indicated as "System RAM" in
the firmware-provided memory map; this memory is always detected
and added to the system by a driver; memory might not actually be
physically hotunpluggable. kexec *must not* indicate this memory to
the second kernel and *must not* place kexec-images on this memory.

Signed-off-by: David Hildenbrand 
---
 include/linux/memblock.h | 16 ++--
 kernel/kexec_file.c  |  5 +
 mm/memblock.c|  4 
 3 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 2bc726e43a1b..b3b29ccf91f3 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -37,12 +37,17 @@ extern unsigned long long max_possible_pfn;
  * @MEMBLOCK_NOMAP: don't add to kernel direct mapping and treat as
  * reserved in the memory map; refer to memblock_mark_nomap() description
  * for further details
+ * @MEMBLOCK_DRIVER_MANAGED: memory region that is always detected and added
+ * via a driver, and never indicated in the firmware-provided memory map as
+ * system RAM. This corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED in the
+ * kernel resource tree.
  */
 enum memblock_flags {
MEMBLOCK_NONE   = 0x0,  /* No special request */
MEMBLOCK_HOTPLUG= 0x1,  /* hotpluggable region */
MEMBLOCK_MIRROR = 0x2,  /* mirrored region */
MEMBLOCK_NOMAP  = 0x4,  /* don't add to kernel direct mapping */
+   MEMBLOCK_DRIVER_MANAGED = 0x8,  /* always detected via a driver */
 };
 
 /**
@@ -213,7 +218,8 @@ static inline void __next_physmem_range(u64 *idx, struct 
memblock_type *type,
  */
 #define for_each_mem_range(i, p_start, p_end) \
__for_each_mem_range(i, , NULL, NUMA_NO_NODE,   \
-MEMBLOCK_HOTPLUG, p_start, p_end, NULL)
+MEMBLOCK_HOTPLUG | MEMBLOCK_DRIVER_MANAGED, \
+p_start, p_end, NULL)
 
 /**
  * for_each_mem_range_rev - reverse iterate through memblock areas from
@@ -224,7 +230,8 @@ static inline void __next_physmem_range(u64 *idx, struct 
memblock_type *type,
  */
 #define for_each_mem_range_rev(i, p_start, p_end)  \
__for_each_mem_range_rev(i, , NULL, NUMA_NO_NODE, \
-MEMBLOCK_HOTPLUG, p_start, p_end, NULL)
+MEMBLOCK_HOTPLUG | MEMBLOCK_DRIVER_MANAGED,\
+p_start, p_end, NULL)
 
 /**
  * for_each_reserved_mem_range - iterate over all reserved memblock areas
@@ -254,6 +261,11 @@ static inline bool memblock_is_nomap(struct 
memblock_region *m)
return m->flags & MEMBLOCK_NOMAP;
 }
 
+static inline bool memblock_is_driver_managed(struct memblock_region *m)
+{
+   return m->flags & MEMBLOCK_DRIVER_MANAGED;
+}
+
 int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
unsigned long  *end_pfn);
 void __next_mem_pfn_range(int *idx, int n

[PATCH v2 3/5] memblock: allow to specify flags with memblock_add_node()

2021-10-04 Thread David Hildenbrand
We want to specify flags when hotplugging memory. Let's prepare to pass
flags to memblock_add_node() by adjusting all existing users.

Note that when hotplugging memory the system is already up and running
and we might have concurrent memblock users: for example, while we're
hotplugging memory, kexec_file code might search for suitable memory
regions to place kexec images. It's important to add the memory directly
to memblock via a single call with the right flags, instead of adding the
memory first and apply flags later: otherwise, concurrent memblock users
might temporarily stumble over memblocks with wrong flags, which will be
important in a follow-up patch that introduces a new flag to properly
handle add_memory_driver_managed().

Acked-by: Geert Uytterhoeven 
Acked-by: Heiko Carstens 
Signed-off-by: David Hildenbrand 
---
 arch/arc/mm/init.c   | 4 ++--
 arch/ia64/mm/contig.c| 2 +-
 arch/ia64/mm/init.c  | 2 +-
 arch/m68k/mm/mcfmmu.c| 3 ++-
 arch/m68k/mm/motorola.c  | 6 --
 arch/mips/loongson64/init.c  | 4 +++-
 arch/mips/sgi-ip27/ip27-memory.c | 3 ++-
 arch/s390/kernel/setup.c | 3 ++-
 include/linux/memblock.h | 3 ++-
 include/linux/mm.h   | 2 +-
 mm/memblock.c| 9 +
 mm/memory_hotplug.c  | 2 +-
 12 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c
index 699ecf119641..110eb69e9bee 100644
--- a/arch/arc/mm/init.c
+++ b/arch/arc/mm/init.c
@@ -59,13 +59,13 @@ void __init early_init_dt_add_memory_arch(u64 base, u64 
size)
 
low_mem_sz = size;
in_use = 1;
-   memblock_add_node(base, size, 0);
+   memblock_add_node(base, size, 0, MEMBLOCK_NONE);
} else {
 #ifdef CONFIG_HIGHMEM
high_mem_start = base;
high_mem_sz = size;
in_use = 1;
-   memblock_add_node(base, size, 1);
+   memblock_add_node(base, size, 1, MEMBLOCK_NONE);
memblock_reserve(base, size);
 #endif
}
diff --git a/arch/ia64/mm/contig.c b/arch/ia64/mm/contig.c
index 42e025cfbd08..24901d809301 100644
--- a/arch/ia64/mm/contig.c
+++ b/arch/ia64/mm/contig.c
@@ -153,7 +153,7 @@ find_memory (void)
efi_memmap_walk(find_max_min_low_pfn, NULL);
max_pfn = max_low_pfn;
 
-   memblock_add_node(0, PFN_PHYS(max_low_pfn), 0);
+   memblock_add_node(0, PFN_PHYS(max_low_pfn), 0, MEMBLOCK_NONE);
 
find_initrd();
 
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 5c6da8d83c1a..5d165607bf35 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -378,7 +378,7 @@ int __init register_active_ranges(u64 start, u64 len, int 
nid)
 #endif
 
if (start < end)
-   memblock_add_node(__pa(start), end - start, nid);
+   memblock_add_node(__pa(start), end - start, nid, MEMBLOCK_NONE);
return 0;
 }
 
diff --git a/arch/m68k/mm/mcfmmu.c b/arch/m68k/mm/mcfmmu.c
index eac9dde65193..6f1f25125294 100644
--- a/arch/m68k/mm/mcfmmu.c
+++ b/arch/m68k/mm/mcfmmu.c
@@ -174,7 +174,8 @@ void __init cf_bootmem_alloc(void)
m68k_memory[0].addr = _rambase;
m68k_memory[0].size = _ramend - _rambase;
 
-   memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0);
+   memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0,
+ MEMBLOCK_NONE);
 
/* compute total pages in system */
num_pages = PFN_DOWN(_ramend - _rambase);
diff --git a/arch/m68k/mm/motorola.c b/arch/m68k/mm/motorola.c
index 9f3f77785aa7..2b05bb2bac00 100644
--- a/arch/m68k/mm/motorola.c
+++ b/arch/m68k/mm/motorola.c
@@ -410,7 +410,8 @@ void __init paging_init(void)
 
min_addr = m68k_memory[0].addr;
max_addr = min_addr + m68k_memory[0].size;
-   memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0);
+   memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0,
+ MEMBLOCK_NONE);
for (i = 1; i < m68k_num_memory;) {
if (m68k_memory[i].addr < min_addr) {
printk("Ignoring memory chunk at 0x%lx:0x%lx before the 
first chunk\n",
@@ -421,7 +422,8 @@ void __init paging_init(void)
(m68k_num_memory - i) * sizeof(struct 
m68k_mem_info));
continue;
}
-   memblock_add_node(m68k_memory[i].addr, m68k_memory[i].size, i);
+   memblock_add_node(m68k_memory[i].addr, m68k_memory[i].size, i,
+ MEMBLOCK_NONE);
addr = m68k_memory[i].addr + m68k_memory[i].size;
if (addr > max_addr)
max_addr = addr;
diff --git a/arch/mips/loongson64/init.c b/arch/mips/loongson64/init.c
index 76e0a9636a0e..4ac5ba80bbf6 100644
--- a/arch/mips/loongson64/init.c
+++ b/arch/

[PATCH v2 0/5] mm/memory_hotplug: full support for add_memory_driver_managed() with CONFIG_ARCH_KEEP_MEMBLOCK

2021-10-04 Thread David Hildenbrand
Architectures that require CONFIG_ARCH_KEEP_MEMBLOCK=y, such as arm64,
don't cleanly support add_memory_driver_managed() yet. Most prominently,
kexec_file can still end up placing kexec images on such driver-managed
memory, resulting in undesired behavior, for example, having kexec images
located on memory not part of the firmware-provided memory map.

Teaching kexec to not place images on driver-managed memory is especially
relevant for virtio-mem. Details can be found in commit 7b7b27214bba
("mm/memory_hotplug: introduce add_memory_driver_managed()").

Extend memblock with a new flag and set it from memory hotplug code
when applicable. This is required to fully support virtio-mem on
arm64, making also kexec_file behave like on x86-64.

v1 -> v2:
- "memblock: improve MEMBLOCK_HOTPLUG documentation"
-- Added
- "memblock: add MEMBLOCK_DRIVER_MANAGED to mimic
   IORESOURCE_SYSRAM_DRIVER_MANAGED"
-- Improve documentation of MEMBLOCK_DRIVER_MANAGED
- Refine patch descriptions

Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Jianyong Wu 
Cc: Aneesh Kumar K.V 
Cc: Vineet Gupta 
Cc: Geert Uytterhoeven 
Cc: Huacai Chen 
Cc: Jiaxun Yang 
Cc: Thomas Bogendoerfer 
Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: Christian Borntraeger 
Cc: Eric Biederman 
Cc: Arnd Bergmann 
Cc: linux-snps-...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linux-m...@lists.linux-m68k.org
Cc: linux-m...@vger.kernel.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@kvack.org
Cc: kexec@lists.infradead.org

David Hildenbrand (5):
  mm/memory_hotplug: handle memblock_add_node() failures in
add_memory_resource()
  memblock: improve MEMBLOCK_HOTPLUG documentation
  memblock: allow to specify flags with memblock_add_node()
  memblock: add MEMBLOCK_DRIVER_MANAGED to mimic
IORESOURCE_SYSRAM_DRIVER_MANAGED
  mm/memory_hotplug: indicate MEMBLOCK_DRIVER_MANAGED with
IORESOURCE_SYSRAM_DRIVER_MANAGED

 arch/arc/mm/init.c   |  4 ++--
 arch/ia64/mm/contig.c|  2 +-
 arch/ia64/mm/init.c  |  2 +-
 arch/m68k/mm/mcfmmu.c|  3 ++-
 arch/m68k/mm/motorola.c  |  6 --
 arch/mips/loongson64/init.c  |  4 +++-
 arch/mips/sgi-ip27/ip27-memory.c |  3 ++-
 arch/s390/kernel/setup.c |  3 ++-
 include/linux/memblock.h | 25 +
 include/linux/mm.h   |  2 +-
 kernel/kexec_file.c  |  5 +
 mm/memblock.c| 13 +
 mm/memory_hotplug.c  | 11 +--
 13 files changed, 62 insertions(+), 21 deletions(-)


base-commit: 9e1ff307c779ce1f0f810c7ecce3d95bbae40896
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2 2/5] memblock: improve MEMBLOCK_HOTPLUG documentation

2021-10-04 Thread David Hildenbrand
The description of MEMBLOCK_HOTPLUG is currently short and consequently
misleading: we're actually dealing with a memory region that might get
hotunplugged later (i.e., the platform+firmware supports it), yet it is
indicated in the firmware-provided memory map as system ram that will just
get used by the system for any purpose when not taking special care. The
firmware marked this memory region as a hot(un)plugged (e.g., hotplugged
before reboot), implying that it might get hotunplugged again later.

Whether we consider this information depends on the "movable_node" kernel
commandline parameter: only with "movable_node" set, we'll try keeping
this memory hotunpluggable, for example, by not serving early allocations
from this memory region and by letting the buddy manage it using the
ZONE_MOVABLE.

Let's make this clearer by extending the documentation.

Note: kexec *has to* indicate this memory to the second kernel. With
"movable_node" set, we don't want to place kexec-images on this memory.
Without "movable_node" set, we don't care and can place kexec-images on
this memory. In both cases, after successful memory hotunplug, kexec has to
be re-armed to update the memory map for the second kernel and to place the
kexec-images somewhere else.

Signed-off-by: David Hildenbrand 
---
 include/linux/memblock.h | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 34de69b3b8ba..4ee8dd2d63a7 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -28,7 +28,11 @@ extern unsigned long long max_possible_pfn;
 /**
  * enum memblock_flags - definition of memory region attributes
  * @MEMBLOCK_NONE: no special request
- * @MEMBLOCK_HOTPLUG: hotpluggable region
+ * @MEMBLOCK_HOTPLUG: memory region indicated in the firmware-provided memory
+ * map during early boot as hot(un)pluggable system RAM (e.g., memory range
+ * that might get hotunplugged later). With "movable_node" set on the kernel
+ * commandline, try keeping this memory region hotunpluggable. Does not apply
+ * to memblocks added ("hotplugged") after early boot.
  * @MEMBLOCK_MIRROR: mirrored region
  * @MEMBLOCK_NOMAP: don't add to kernel direct mapping and treat as
  * reserved in the memory map; refer to memblock_mark_nomap() description
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2 1/5] mm/memory_hotplug: handle memblock_add_node() failures in add_memory_resource()

2021-10-04 Thread David Hildenbrand
If memblock_add_node() fails, we're most probably running out of memory.
While this is unlikely to happen, it can happen and having memory added
without a memblock can be problematic for architectures that use
memblock to detect valid memory. Let's fail in a nice way instead of
silently ignoring the error.

Signed-off-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9fd0be32a281..917b3528636d 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1384,8 +1384,11 @@ int __ref add_memory_resource(int nid, struct resource 
*res, mhp_t mhp_flags)
 
mem_hotplug_begin();
 
-   if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK))
-   memblock_add_node(start, size, nid);
+   if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
+   ret = memblock_add_node(start, size, nid);
+   if (ret)
+   goto error_mem_hotplug_end;
+   }
 
ret = __try_online_node(nid, false);
if (ret < 0)
@@ -1458,6 +1461,7 @@ int __ref add_memory_resource(int nid, struct resource 
*res, mhp_t mhp_flags)
rollback_node_hotadd(nid);
if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK))
memblock_remove(start, size);
+error_mem_hotplug_end:
mem_hotplug_done();
return ret;
 }
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 3/4] memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED

2021-10-01 Thread David Hildenbrand

On 30.09.21 23:21, Mike Rapoport wrote:

On Wed, Sep 29, 2021 at 06:54:01PM +0200, David Hildenbrand wrote:

On 29.09.21 18:39, Mike Rapoport wrote:

Hi,

On Mon, Sep 27, 2021 at 05:05:17PM +0200, David Hildenbrand wrote:

Let's add a flag that corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED.
Similar to MEMBLOCK_HOTPLUG, most infrastructure has to treat such memory
like ordinary MEMBLOCK_NONE memory -- for example, when selecting memory
regions to add to the vmcore for dumping in the crashkernel via
for_each_mem_range().

Can you please elaborate on the difference in semantics of MEMBLOCK_HOTPLUG
and MEMBLOCK_DRIVER_MANAGED?
Unless I'm missing something they both mark memory that can be unplugged
anytime and so it should not be used in certain cases. Why is there a need
for a new flag?


In the cover letter I have "Alternative B: Reuse MEMBLOCK_HOTPLUG.
MEMBLOCK_HOTPLUG serves a different purpose, though.", but looking into the
details it won't work as is.

MEMBLOCK_HOTPLUG is used to mark memory early during boot that can later get
hotunplugged again and should be placed into ZONE_MOVABLE if the
"movable_node" kernel parameter is set.

The confusing part is that we talk about "hotpluggable" but really mean
"hotunpluggable": the reason is that HW flags DIMM slots that can later be
hotplugged as "hotpluggable" even though there is already something
hotplugged.


MEMBLOCK_HOTPLUG name is indeed somewhat confusing, but still it's core
meaning "this memory may be removed" which does not differ from what
IORESOURCE_SYSRAM_DRIVER_MANAGED means.

MEMBLOCK_HOTPLUG regions are indeed placed into ZONE_MOVABLE, but more
importantly, they are avoided when we allocate memory from memblock.

So, in my view, both flags mean that the memory may be removed and it
should not be used for certain types of allocations.


The semantics are different:

MEMBLOCK_HOTPLUG: memory is indicated as "System RAM" in the 
firmware-provided memory map and added to the system early during boot; 
we want this memory to be managed by ZONE_MOVABLE with "movable_node" 
set on the kernel command line, because only then we want it to be 
hotpluggable again. kexec *has to* indicate this memory to the second 
kernel and can place kexec-images on this memory. After memory 
hotunplug, kexec has to be re-armed.


MEMBLOCK_DRIVER_MANAGED: memory is not indicated as System RAM" in the 
firmware-provided memory map; this memory is always detected and added 
to the system by a driver; memory might not actually be physically 
hotunpluggable and the ZONE selection does not depend on "movable_core". 
kexec *must not* indicate this memory to the second kernel and *must 
not* place kexec-images on this memory.



I would really advise against mixing concepts here.


What we could do is indicate *all* hotplugged memory (not just 
IORESOURCE_SYSRAM_DRIVER_MANAGED memory) as MEMBLOCK_HOTPLUG and make 
MEMBLOCK_HOTPLUG less dependent on "movable_node".


MEMBLOCK_HOTPLUG for early boot memory: with "movable_core", place it in 
ZONE_MOVABLE. Even without "movable_core", don't place early kernel 
allocations on this memory.
MEMBLOCK_HOTPLUG for all memory: don't place kexec images or on this 
memory, independent of "movable_core".



memblock would then not contain the information "contained in 
firmware-provided memory map" vs. "not contained in firmware-provided 
memory map"; but I think right now it's not strictly required to have 
that information if we'd go down that path.


  

For example, ranges in the ACPI SRAT that are marked as
ACPI_SRAT_MEM_HOT_PLUGGABLE will be marked MEMBLOCK_HOTPLUG early during
boot (drivers/acpi/numa/srat.c:acpi_numa_memory_affinity_init()). Later, we
use that information to size ZONE_MOVABLE
(mm/page_alloc.c:find_zone_movable_pfns_for_nodes()). This will make sure
that these "hotpluggable" DIMMs can later get hotunplugged.

Also, see should_skip_region() how this relates to the "movable_node" kernel
parameter:

/* skip hotpluggable memory regions if needed */
if (movable_node_is_enabled() && memblock_is_hotpluggable(m) &&
(flags & MEMBLOCK_HOTPLUG))
return true;


Hmm, I think that the movable_node_is_enabled() check here is excessive,
but I suspect we cannot simply remove it without breaking anything.


The reasoning is: without "movable_core" we don't want this memory to be 
hotunpluggable; consequently, we don't care if we place kexec-images on 
this memory. MEMBLOCK_HOTPLUG is currently only active with "movable_core".


If we remove that check, we will always not place early kernel 
allocations on that memory, even if we don't care about ZONE_MOVABLE.




I'll take a deeper look on the potential consequences.

BTW, is there anything that prevents putting kexec to hot-un

Re: [PATCH v1 3/4] memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED

2021-09-29 Thread David Hildenbrand

On 29.09.21 18:39, Mike Rapoport wrote:

Hi,

On Mon, Sep 27, 2021 at 05:05:17PM +0200, David Hildenbrand wrote:

Let's add a flag that corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED.
Similar to MEMBLOCK_HOTPLUG, most infrastructure has to treat such memory
like ordinary MEMBLOCK_NONE memory -- for example, when selecting memory
regions to add to the vmcore for dumping in the crashkernel via
for_each_mem_range().
  
Can you please elaborate on the difference in semantics of MEMBLOCK_HOTPLUG

and MEMBLOCK_DRIVER_MANAGED?
Unless I'm missing something they both mark memory that can be unplugged
anytime and so it should not be used in certain cases. Why is there a need
for a new flag?


In the cover letter I have "Alternative B: Reuse MEMBLOCK_HOTPLUG. 
MEMBLOCK_HOTPLUG serves a different purpose, though.", but looking into 
the details it won't work as is.


MEMBLOCK_HOTPLUG is used to mark memory early during boot that can later 
get hotunplugged again and should be placed into ZONE_MOVABLE if the 
"movable_node" kernel parameter is set.


The confusing part is that we talk about "hotpluggable" but really mean 
"hotunpluggable": the reason is that HW flags DIMM slots that can later 
be hotplugged as "hotpluggable" even though there is already something 
hotplugged.


For example, ranges in the ACPI SRAT that are marked as 
ACPI_SRAT_MEM_HOT_PLUGGABLE will be marked MEMBLOCK_HOTPLUG early during 
boot (drivers/acpi/numa/srat.c:acpi_numa_memory_affinity_init()). Later, 
we use that information to size ZONE_MOVABLE 
(mm/page_alloc.c:find_zone_movable_pfns_for_nodes()). This will make 
sure that these "hotpluggable" DIMMs can later get hotunplugged.


Also, see should_skip_region() how this relates to the "movable_node" 
kernel parameter:


/* skip hotpluggable memory regions if needed */
if (movable_node_is_enabled() && memblock_is_hotpluggable(m) &&
(flags & MEMBLOCK_HOTPLUG))
return true;

Long story short: MEMBLOCK_HOTPLUG has different semantics and is a 
special case for "movable_node".


--
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 2/4] memblock: allow to specify flags with memblock_add_node()

2021-09-29 Thread David Hildenbrand

On 29.09.21 18:25, Mike Rapoport wrote:

On Mon, Sep 27, 2021 at 05:05:16PM +0200, David Hildenbrand wrote:

We want to specify flags when hotplugging memory. Let's prepare to pass
flags to memblock_add_node() by adjusting all existing users.

Note that when hotplugging memory the system is already up and running
and we don't want to add the memory first and apply flags later: it
should happen within one memblock call.


Why is it important that the system is up and why it should happen in a
single call?
I don't mind adding flags parameter to memblock_add_node() but this
changelog does not really explain the reasons to do it.


"After memblock_add_node(), we could race with anybody performing a 
search for MEMBLOCK_NONE, like kexec_file -- and that only happens once 
the system is already up and running. So we want both steps to happen 
atomically."


I can add that to the patch description.

(I think it still won't be completely atomic because memblock isn't 
properly implementing locking yet, but that's a different story)


--
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 2/8] x86/xen: simplify xen_oldmem_pfn_is_ram()

2021-09-29 Thread David Hildenbrand

On 29.09.21 16:22, Boris Ostrovsky wrote:


On 9/29/21 5:03 AM, David Hildenbrand wrote:

On 29.09.21 10:45, David Hildenbrand wrote:



Can we go one step further and do


@@ -20,24 +20,11 @@ static int xen_oldmem_pfn_is_ram(unsigned long pfn)
   struct xen_hvm_get_mem_type a = {
   .domid = DOMID_SELF,
   .pfn = pfn,
+   .mem_type = HVMMEM_ram_rw,
   };
-   int ram;
    -   if (HYPERVISOR_hvm_op(HVMOP_get_mem_type, ))
-   return -ENXIO;
-
-   switch (a.mem_type) {
-   case HVMMEM_mmio_dm:
-   ram = 0;
-   break;
-   case HVMMEM_ram_rw:
-   case HVMMEM_ram_ro:
-   default:
-   ram = 1;
-   break;
-   }
-
-   return ram;
+   HYPERVISOR_hvm_op(HVMOP_get_mem_type, );
+   return a.mem_type != HVMMEM_mmio_dm;



I was actually thinking of asking you to add another patch with pr_warn_once() 
here (and print error code as well). This call failing is indication of 
something going quite wrong and it would be good to know about this.


Will include a patch in v2, thanks!


--
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 2/8] x86/xen: simplify xen_oldmem_pfn_is_ram()

2021-09-29 Thread David Hildenbrand

On 29.09.21 10:45, David Hildenbrand wrote:


How about

      return a.mem_type != HVMMEM_mmio_dm;



Ha, how could I have missed that :)



Result should be promoted to int and this has added benefit of not requiring 
changes in patch 4.



Can we go one step further and do


@@ -20,24 +20,11 @@ static int xen_oldmem_pfn_is_ram(unsigned long pfn)
  struct xen_hvm_get_mem_type a = {
  .domid = DOMID_SELF,
  .pfn = pfn,
+   .mem_type = HVMMEM_ram_rw,
  };
-   int ram;
   
-   if (HYPERVISOR_hvm_op(HVMOP_get_mem_type, ))

-   return -ENXIO;
-
-   switch (a.mem_type) {
-   case HVMMEM_mmio_dm:
-   ram = 0;
-   break;
-   case HVMMEM_ram_rw:
-   case HVMMEM_ram_ro:
-   default:
-   ram = 1;
-   break;
-   }
-
-   return ram;
+   HYPERVISOR_hvm_op(HVMOP_get_mem_type, );
+   return a.mem_type != HVMMEM_mmio_dm;
   }
   #endif


Assuming that if HYPERVISOR_hvm_op() fails that
.mem_type is not set to HVMMEM_mmio_dm.



Okay we can't, due to "__must_check" ...

--
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 2/8] x86/xen: simplify xen_oldmem_pfn_is_ram()

2021-09-29 Thread David Hildenbrand


How about

     return a.mem_type != HVMMEM_mmio_dm;



Ha, how could I have missed that :)



Result should be promoted to int and this has added benefit of not requiring 
changes in patch 4.



Can we go one step further and do


@@ -20,24 +20,11 @@ static int xen_oldmem_pfn_is_ram(unsigned long pfn)
struct xen_hvm_get_mem_type a = {
.domid = DOMID_SELF,
.pfn = pfn,
+   .mem_type = HVMMEM_ram_rw,
};
-   int ram;
 
-   if (HYPERVISOR_hvm_op(HVMOP_get_mem_type, ))

-   return -ENXIO;
-
-   switch (a.mem_type) {
-   case HVMMEM_mmio_dm:
-   ram = 0;
-   break;
-   case HVMMEM_ram_rw:
-   case HVMMEM_ram_ro:
-   default:
-   ram = 1;
-   break;
-   }
-
-   return ram;
+   HYPERVISOR_hvm_op(HVMOP_get_mem_type, );
+   return a.mem_type != HVMMEM_mmio_dm;
 }
 #endif


Assuming that if HYPERVISOR_hvm_op() fails that
.mem_type is not set to HVMMEM_mmio_dm.

--
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 8/8] virtio-mem: kdump mode to sanitize /proc/vmcore access

2021-09-29 Thread David Hildenbrand

[...]


+
+static bool virtio_mem_vmcore_pfn_is_ram(struct vmcore_cb *cb,
+unsigned long pfn)
+{
+   struct virtio_mem *vm = container_of(cb, struct virtio_mem,
+vmcore_cb);
+   uint64_t addr = PFN_PHYS(pfn);
+   bool is_ram;
+   int rc;
+
+   if (!virtio_mem_contains_range(vm, addr, addr + PAGE_SIZE))


Some more testing revealed that this has to be

if (!virtio_mem_contains_range(vm, addr, PAGE_SIZE))


--
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v1 5/8] virtio-mem: factor out hotplug specifics from virtio_mem_init() into virtio_mem_init_hotplug()

2021-09-28 Thread David Hildenbrand
Let's prepare for a new virtio-mem kdump mode in which we don't actually
hot(un)plug any memory but only observe the state of device blocks.

Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 81 -
 1 file changed, 44 insertions(+), 37 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index bef8ad6bf466..2ba7e8d6ba8d 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -2392,41 +2392,10 @@ static int virtio_mem_init_vq(struct virtio_mem *vm)
return 0;
 }
 
-static int virtio_mem_init(struct virtio_mem *vm)
+static int virtio_mem_init_hotplug(struct virtio_mem *vm)
 {
const struct range pluggable_range = mhp_get_pluggable_range(true);
uint64_t sb_size, addr;
-   uint16_t node_id;
-
-   if (!vm->vdev->config->get) {
-   dev_err(>vdev->dev, "config access disabled\n");
-   return -EINVAL;
-   }
-
-   /*
-* We don't want to (un)plug or reuse any memory when in kdump. The
-* memory is still accessible (but not mapped).
-*/
-   if (is_kdump_kernel()) {
-   dev_warn(>vdev->dev, "disabled in kdump kernel\n");
-   return -EBUSY;
-   }
-
-   /* Fetch all properties that can't change. */
-   virtio_cread_le(vm->vdev, struct virtio_mem_config, plugged_size,
-   >plugged_size);
-   virtio_cread_le(vm->vdev, struct virtio_mem_config, block_size,
-   >device_block_size);
-   virtio_cread_le(vm->vdev, struct virtio_mem_config, node_id,
-   _id);
-   vm->nid = virtio_mem_translate_node_id(vm, node_id);
-   virtio_cread_le(vm->vdev, struct virtio_mem_config, addr, >addr);
-   virtio_cread_le(vm->vdev, struct virtio_mem_config, region_size,
-   >region_size);
-
-   /* Determine the nid for the device based on the lowest address. */
-   if (vm->nid == NUMA_NO_NODE)
-   vm->nid = memory_add_physaddr_to_nid(vm->addr);
 
/* bad device setup - warn only */
if (!IS_ALIGNED(vm->addr, memory_block_size_bytes()))
@@ -2496,10 +2465,6 @@ static int virtio_mem_init(struct virtio_mem *vm)
  vm->offline_threshold);
}
 
-   dev_info(>vdev->dev, "start address: 0x%llx", vm->addr);
-   dev_info(>vdev->dev, "region size: 0x%llx", vm->region_size);
-   dev_info(>vdev->dev, "device block size: 0x%llx",
-(unsigned long long)vm->device_block_size);
dev_info(>vdev->dev, "memory block size: 0x%lx",
 memory_block_size_bytes());
if (vm->in_sbm)
@@ -2508,10 +2473,52 @@ static int virtio_mem_init(struct virtio_mem *vm)
else
dev_info(>vdev->dev, "big block size: 0x%llx",
 (unsigned long long)vm->bbm.bb_size);
+
+   return 0;
+}
+
+static int virtio_mem_init(struct virtio_mem *vm)
+{
+   uint16_t node_id;
+
+   if (!vm->vdev->config->get) {
+   dev_err(>vdev->dev, "config access disabled\n");
+   return -EINVAL;
+   }
+
+   /*
+* We don't want to (un)plug or reuse any memory when in kdump. The
+* memory is still accessible (but not mapped).
+*/
+   if (is_kdump_kernel()) {
+   dev_warn(>vdev->dev, "disabled in kdump kernel\n");
+   return -EBUSY;
+   }
+
+   /* Fetch all properties that can't change. */
+   virtio_cread_le(vm->vdev, struct virtio_mem_config, plugged_size,
+   >plugged_size);
+   virtio_cread_le(vm->vdev, struct virtio_mem_config, block_size,
+   >device_block_size);
+   virtio_cread_le(vm->vdev, struct virtio_mem_config, node_id,
+   _id);
+   vm->nid = virtio_mem_translate_node_id(vm, node_id);
+   virtio_cread_le(vm->vdev, struct virtio_mem_config, addr, >addr);
+   virtio_cread_le(vm->vdev, struct virtio_mem_config, region_size,
+   >region_size);
+
+   /* Determine the nid for the device based on the lowest address. */
+   if (vm->nid == NUMA_NO_NODE)
+   vm->nid = memory_add_physaddr_to_nid(vm->addr);
+
+   dev_info(>vdev->dev, "start address: 0x%llx", vm->addr);
+   dev_info(>vdev->dev, "region size: 0x%llx", vm->region_size);
+   dev_info(>vdev->dev, "device block size: 0x%llx",
+(unsigned long long)vm->device_block_size);
if (vm->nid != NUMA_NO_NODE && IS_ENABLED(CONFIG_NUMA))
dev_info(>vdev->dev, "nid: %d", vm->nid);
 
-   return 0;
+   return virtio_mem_init_hotplug(vm);
 }
 
 static int virtio_mem_create_resource(struct virtio_mem *vm)
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v1 6/8] virtio-mem: factor out hotplug specifics from virtio_mem_probe() into virtio_mem_init_hotplug()

2021-09-28 Thread David Hildenbrand
Let's prepare for a new virtio-mem kdump mode in which we don't actually
hot(un)plug any memory but only observe the state of device blocks.

Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 87 +++--
 1 file changed, 45 insertions(+), 42 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 2ba7e8d6ba8d..1be3ee7f684d 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -260,6 +260,8 @@ static void virtio_mem_fake_offline_going_offline(unsigned 
long pfn,
 static void virtio_mem_fake_offline_cancel_offline(unsigned long pfn,
   unsigned long nr_pages);
 static void virtio_mem_retry(struct virtio_mem *vm);
+static int virtio_mem_create_resource(struct virtio_mem *vm);
+static void virtio_mem_delete_resource(struct virtio_mem *vm);
 
 /*
  * Register a virtio-mem device so it will be considered for the online_page
@@ -2395,7 +2397,8 @@ static int virtio_mem_init_vq(struct virtio_mem *vm)
 static int virtio_mem_init_hotplug(struct virtio_mem *vm)
 {
const struct range pluggable_range = mhp_get_pluggable_range(true);
-   uint64_t sb_size, addr;
+   uint64_t unit_pages, sb_size, addr;
+   int rc;
 
/* bad device setup - warn only */
if (!IS_ALIGNED(vm->addr, memory_block_size_bytes()))
@@ -2474,7 +2477,48 @@ static int virtio_mem_init_hotplug(struct virtio_mem *vm)
dev_info(>vdev->dev, "big block size: 0x%llx",
 (unsigned long long)vm->bbm.bb_size);
 
+   /* create the parent resource for all memory */
+   rc = virtio_mem_create_resource(vm);
+   if (rc)
+   return rc;
+
+   /* use a single dynamic memory group to cover the whole memory device */
+   if (vm->in_sbm)
+   unit_pages = PHYS_PFN(memory_block_size_bytes());
+   else
+   unit_pages = PHYS_PFN(vm->bbm.bb_size);
+   rc = memory_group_register_dynamic(vm->nid, unit_pages);
+   if (rc < 0)
+   goto out_del_resource;
+   vm->mgid = rc;
+
+   /*
+* If we still have memory plugged, we have to unplug all memory first.
+* Registering our parent resource makes sure that this memory isn't
+* actually in use (e.g., trying to reload the driver).
+*/
+   if (vm->plugged_size) {
+   vm->unplug_all_required = true;
+   dev_info(>vdev->dev, "unplugging all memory is required\n");
+   }
+
+   /* register callbacks */
+   vm->memory_notifier.notifier_call = virtio_mem_memory_notifier_cb;
+   rc = register_memory_notifier(>memory_notifier);
+   if (rc)
+   goto out_unreg_group;
+   rc = register_virtio_mem_device(vm);
+   if (rc)
+   goto out_unreg_mem;
+
return 0;
+out_unreg_mem:
+   unregister_memory_notifier(>memory_notifier);
+out_unreg_group:
+   memory_group_unregister(vm->mgid);
+out_del_resource:
+   virtio_mem_delete_resource(vm);
+   return rc;
 }
 
 static int virtio_mem_init(struct virtio_mem *vm)
@@ -2578,7 +2622,6 @@ static bool virtio_mem_has_memory_added(struct virtio_mem 
*vm)
 static int virtio_mem_probe(struct virtio_device *vdev)
 {
struct virtio_mem *vm;
-   uint64_t unit_pages;
int rc;
 
BUILD_BUG_ON(sizeof(struct virtio_mem_req) != 24);
@@ -2608,40 +2651,6 @@ static int virtio_mem_probe(struct virtio_device *vdev)
if (rc)
goto out_del_vq;
 
-   /* create the parent resource for all memory */
-   rc = virtio_mem_create_resource(vm);
-   if (rc)
-   goto out_del_vq;
-
-   /* use a single dynamic memory group to cover the whole memory device */
-   if (vm->in_sbm)
-   unit_pages = PHYS_PFN(memory_block_size_bytes());
-   else
-   unit_pages = PHYS_PFN(vm->bbm.bb_size);
-   rc = memory_group_register_dynamic(vm->nid, unit_pages);
-   if (rc < 0)
-   goto out_del_resource;
-   vm->mgid = rc;
-
-   /*
-* If we still have memory plugged, we have to unplug all memory first.
-* Registering our parent resource makes sure that this memory isn't
-* actually in use (e.g., trying to reload the driver).
-*/
-   if (vm->plugged_size) {
-   vm->unplug_all_required = true;
-   dev_info(>vdev->dev, "unplugging all memory is required\n");
-   }
-
-   /* register callbacks */
-   vm->memory_notifier.notifier_call = virtio_mem_memory_notifier_cb;
-   rc = register_memory_notifier(>memory_notifier);
-   if (rc)
-   goto out_unreg_group;
-   rc = register_virtio_mem_device(vm);
-   if (rc)
-   goto out_unreg_mem;
-
virtio_device_ready(vdev)

[PATCH v1 8/8] virtio-mem: kdump mode to sanitize /proc/vmcore access

2021-09-28 Thread David Hildenbrand
Although virtio-mem currently supports reading unplugged memory in the
hypervisor, this will change in the future, indicated to the device via
a new feature flag. We similarly sanitized /proc/kcore access recently. [1]

Let's register a vmcore callback, to allow vmcore code to check if a PFN
belonging to a virtio-mem device is either currently plugged and should
be dumped or is currently unplugged and should not be accessed, instead
mapping the shared zeropage or returning zeroes when reading.

This is important when not capturing /proc/vmcore via tools like
"makedumpfile" that can identify logically unplugged virtio-mem memory via
PG_offline in the memmap, but simply by e.g., copying the file.

Distributions that support virtio-mem+kdump have to make sure that the
virtio_mem module will be part of the kdump kernel or the kdump initrd;
dracut was recently [2] extended to include virtio-mem in the generated
initrd. As long as no special kdump kernels are used, this will
automatically make sure that virtio-mem will be around in the kdump initrd
and sanitize /proc/vmcore access -- with dracut.

With this series, we'll send one virtio-mem state request for every
~2 MiB chunk of virtio-mem memory indicated in the vmcore that we intend
to read/map.

In the future, we might want to allow building virtio-mem for kdump
mode only, even without CONFIG_MEMORY_HOTPLUG and friends: this way,
we could support special stripped-down kdump kernels that have many
other config options disabled; we'll tackle that once required. Further,
we might want to try sensing bigger blocks (e.g., memory sections)
first before falling back to device blocks on demand.

Tested with Fedora rawhide, which contains a recent kexec-tools version
(considering "System RAM (virtio_mem)" when creating the vmcore header) and
a recent dracut version (including the virtio_mem module in the kdump
initrd).

[1] https://lkml.kernel.org/r/20210526093041.8800-1-da...@redhat.com
[2] https://github.com/dracutdevs/dracut/pull/1157

Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 136 
 1 file changed, 124 insertions(+), 12 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 76d8aef3cfd2..ec0b2ab37acb 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -223,6 +223,9 @@ struct virtio_mem {
 * When this lock is held the pointers can't change, ONLINE and
 * OFFLINE blocks can't change the state and no subblocks will get
 * plugged/unplugged.
+*
+* In kdump mode, used to serialize requests, last_block_addr and
+* last_block_plugged.
 */
struct mutex hotplug_mutex;
bool hotplug_active;
@@ -230,6 +233,9 @@ struct virtio_mem {
/* An error occurred we cannot handle - stop processing requests. */
bool broken;
 
+   /* Cached valued of is_kdump_kernel() when the device was probed. */
+   bool in_kdump;
+
/* The driver is being removed. */
spinlock_t removal_lock;
bool removing;
@@ -243,6 +249,13 @@ struct virtio_mem {
/* Memory notifier (online/offline events). */
struct notifier_block memory_notifier;
 
+#ifdef CONFIG_PROC_VMCORE
+   /* vmcore callback for /proc/vmcore handling in kdump mode */
+   struct vmcore_cb vmcore_cb;
+   uint64_t last_block_addr;
+   bool last_block_plugged;
+#endif /* CONFIG_PROC_VMCORE */
+
/* Next device in the list of virtio-mem devices. */
struct list_head next;
 };
@@ -2293,6 +2306,12 @@ static void virtio_mem_run_wq(struct work_struct *work)
uint64_t diff;
int rc;
 
+   if (unlikely(vm->in_kdump)) {
+   dev_warn_once(>vdev->dev,
+"unexpected workqueue run in kdump kernel\n");
+   return;
+   }
+
hrtimer_cancel(>retry_timer);
 
if (vm->broken)
@@ -2521,6 +2540,86 @@ static int virtio_mem_init_hotplug(struct virtio_mem *vm)
return rc;
 }
 
+#ifdef CONFIG_PROC_VMCORE
+static int virtio_mem_send_state_request(struct virtio_mem *vm, uint64_t addr,
+uint64_t size)
+{
+   const uint64_t nb_vm_blocks = size / vm->device_block_size;
+   const struct virtio_mem_req req = {
+   .type = cpu_to_virtio16(vm->vdev, VIRTIO_MEM_REQ_STATE),
+   .u.state.addr = cpu_to_virtio64(vm->vdev, addr),
+   .u.state.nb_blocks = cpu_to_virtio16(vm->vdev, nb_vm_blocks),
+   };
+   int rc = -ENOMEM;
+
+   dev_dbg(>vdev->dev, "requesting state: 0x%llx - 0x%llx\n", addr,
+   addr + size - 1);
+
+   switch (virtio_mem_send_request(vm, )) {
+   case VIRTIO_MEM_RESP_ACK:
+   return virtio16_to_cpu(vm->vdev, vm->resp.u.state.state);
+   case VIRTIO_MEM_RESP_ERROR:
+   

[PATCH v1 7/8] virtio-mem: factor out hotplug specifics from virtio_mem_remove() into virtio_mem_deinit_hotplug()

2021-09-28 Thread David Hildenbrand
Let's prepare for a new virtio-mem kdump mode in which we don't actually
hot(un)plug any memory but only observe the state of device blocks.

Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 1be3ee7f684d..76d8aef3cfd2 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -2667,9 +2667,8 @@ static int virtio_mem_probe(struct virtio_device *vdev)
return rc;
 }
 
-static void virtio_mem_remove(struct virtio_device *vdev)
+static void virtio_mem_deinit_hotplug(struct virtio_mem *vm)
 {
-   struct virtio_mem *vm = vdev->priv;
unsigned long mb_id;
int rc;
 
@@ -2716,7 +2715,8 @@ static void virtio_mem_remove(struct virtio_device *vdev)
 * away. Warn at least.
 */
if (virtio_mem_has_memory_added(vm)) {
-   dev_warn(>dev, "device still has system memory added\n");
+   dev_warn(>vdev->dev,
+"device still has system memory added\n");
} else {
virtio_mem_delete_resource(vm);
kfree_const(vm->resource_name);
@@ -2730,6 +2730,13 @@ static void virtio_mem_remove(struct virtio_device *vdev)
} else {
vfree(vm->bbm.bb_states);
}
+}
+
+static void virtio_mem_remove(struct virtio_device *vdev)
+{
+   struct virtio_mem *vm = vdev->priv;
+
+   virtio_mem_deinit_hotplug(vm);
 
/* reset the device and cleanup the queues */
vdev->config->reset(vdev);
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v1 2/8] x86/xen: simplify xen_oldmem_pfn_is_ram()

2021-09-28 Thread David Hildenbrand
Let's simplify return handling.

Signed-off-by: David Hildenbrand 
---
 arch/x86/xen/mmu_hvm.c | 11 ++-
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/arch/x86/xen/mmu_hvm.c b/arch/x86/xen/mmu_hvm.c
index b242d1f4b426..eb61622df75b 100644
--- a/arch/x86/xen/mmu_hvm.c
+++ b/arch/x86/xen/mmu_hvm.c
@@ -21,23 +21,16 @@ static int xen_oldmem_pfn_is_ram(unsigned long pfn)
.domid = DOMID_SELF,
.pfn = pfn,
};
-   int ram;
 
if (HYPERVISOR_hvm_op(HVMOP_get_mem_type, ))
return -ENXIO;
 
switch (a.mem_type) {
case HVMMEM_mmio_dm:
-   ram = 0;
-   break;
-   case HVMMEM_ram_rw:
-   case HVMMEM_ram_ro:
+   return 0;
default:
-   ram = 1;
-   break;
+   return 1;
}
-
-   return ram;
 }
 #endif
 
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v1 1/8] x86/xen: update xen_oldmem_pfn_is_ram() documentation

2021-09-28 Thread David Hildenbrand
The callback is only used for the vmcore nowadays.

Signed-off-by: David Hildenbrand 
---
 arch/x86/xen/mmu_hvm.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/arch/x86/xen/mmu_hvm.c b/arch/x86/xen/mmu_hvm.c
index 57409373750f..b242d1f4b426 100644
--- a/arch/x86/xen/mmu_hvm.c
+++ b/arch/x86/xen/mmu_hvm.c
@@ -9,12 +9,9 @@
 
 #ifdef CONFIG_PROC_VMCORE
 /*
- * This function is used in two contexts:
- * - the kdump kernel has to check whether a pfn of the crashed kernel
- *   was a ballooned page. vmcore is using this function to decide
- *   whether to access a pfn of the crashed kernel.
- * - the kexec kernel has to check whether a pfn was ballooned by the
- *   previous kernel. If the pfn is ballooned, handle it properly.
+ * The kdump kernel has to check whether a pfn of the crashed kernel
+ * was a ballooned page. vmcore is using this function to decide
+ * whether to access a pfn of the crashed kernel.
  * Returns 0 if the pfn is not backed by a RAM page, the caller may
  * handle the pfn special in this case.
  */
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v1 4/8] proc/vmcore: convert oldmem_pfn_is_ram callback to more generic vmcore callbacks

2021-09-28 Thread David Hildenbrand
Let's support multiple registered callbacks, making sure that
registering vmcore callbacks cannot fail. Make the callback return a
bool instead of an int, handling how to deal with errors internally.
Drop unused HAVE_OLDMEM_PFN_IS_RAM.

We soon want to make use of this infrastructure from other drivers:
virtio-mem, registering one callback for each virtio-mem device, to
prevent reading unplugged virtio-mem memory.

Handle it via a generic vmcore_cb structure, prepared for future
extensions: for example, once we support virtio-mem on s390x where the
vmcore is completely constructed in the second kernel, we want to detect
and add plugged virtio-mem memory ranges to the vmcore in order for them
to get dumped properly.

Handle corner cases that are unexpected and shouldn't happen in sane
setups: registering a callback after the vmcore has already been opened
(warn only) and unregistering a callback after the vmcore has already been
opened (warn and essentially read only zeroes from that point on).

Signed-off-by: David Hildenbrand 
---
 arch/x86/kernel/aperture_64.c | 13 -
 arch/x86/xen/mmu_hvm.c| 15 +++---
 fs/proc/vmcore.c  | 99 ---
 include/linux/crash_dump.h| 26 +++--
 4 files changed, 113 insertions(+), 40 deletions(-)

diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c
index 10562885f5fc..af3ba08b684b 100644
--- a/arch/x86/kernel/aperture_64.c
+++ b/arch/x86/kernel/aperture_64.c
@@ -73,12 +73,23 @@ static int gart_mem_pfn_is_ram(unsigned long pfn)
  (pfn >= aperture_pfn_start + aperture_page_count));
 }
 
+#ifdef CONFIG_PROC_VMCORE
+static bool gart_oldmem_pfn_is_ram(struct vmcore_cb *cb, unsigned long pfn)
+{
+   return !!gart_mem_pfn_is_ram(pfn);
+}
+
+static struct vmcore_cb gart_vmcore_cb = {
+   .pfn_is_ram = gart_oldmem_pfn_is_ram,
+};
+#endif
+
 static void __init exclude_from_core(u64 aper_base, u32 aper_order)
 {
aperture_pfn_start = aper_base >> PAGE_SHIFT;
aperture_page_count = (32 * 1024 * 1024) << aper_order >> PAGE_SHIFT;
 #ifdef CONFIG_PROC_VMCORE
-   WARN_ON(register_oldmem_pfn_is_ram(_mem_pfn_is_ram));
+   register_vmcore_cb(_vmcore_cb);
 #endif
 #ifdef CONFIG_PROC_KCORE
WARN_ON(register_mem_pfn_is_ram(_mem_pfn_is_ram));
diff --git a/arch/x86/xen/mmu_hvm.c b/arch/x86/xen/mmu_hvm.c
index eb61622df75b..49bd4a6a5858 100644
--- a/arch/x86/xen/mmu_hvm.c
+++ b/arch/x86/xen/mmu_hvm.c
@@ -12,10 +12,10 @@
  * The kdump kernel has to check whether a pfn of the crashed kernel
  * was a ballooned page. vmcore is using this function to decide
  * whether to access a pfn of the crashed kernel.
- * Returns 0 if the pfn is not backed by a RAM page, the caller may
+ * Returns "false" if the pfn is not backed by a RAM page, the caller may
  * handle the pfn special in this case.
  */
-static int xen_oldmem_pfn_is_ram(unsigned long pfn)
+static bool xen_vmcore_pfn_is_ram(struct vmcore_cb *cb, unsigned long pfn)
 {
struct xen_hvm_get_mem_type a = {
.domid = DOMID_SELF,
@@ -23,15 +23,18 @@ static int xen_oldmem_pfn_is_ram(unsigned long pfn)
};
 
if (HYPERVISOR_hvm_op(HVMOP_get_mem_type, ))
-   return -ENXIO;
+   return true;
 
switch (a.mem_type) {
case HVMMEM_mmio_dm:
-   return 0;
+   return false;
default:
-   return 1;
+   return true;
}
 }
+static struct vmcore_cb xen_vmcore_cb = {
+   .pfn_is_ram = xen_vmcore_pfn_is_ram,
+};
 #endif
 
 static void xen_hvm_exit_mmap(struct mm_struct *mm)
@@ -65,6 +68,6 @@ void __init xen_hvm_init_mmu_ops(void)
if (is_pagetable_dying_supported())
pv_ops.mmu.exit_mmap = xen_hvm_exit_mmap;
 #ifdef CONFIG_PROC_VMCORE
-   WARN_ON(register_oldmem_pfn_is_ram(_oldmem_pfn_is_ram));
+   register_vmcore_cb(_vmcore_cb);
 #endif
 }
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index a9bd80ab670e..7a04b2eca287 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -62,46 +62,75 @@ core_param(novmcoredd, vmcoredd_disabled, bool, 0);
 /* Device Dump Size */
 static size_t vmcoredd_orig_sz;
 
-/*
- * Returns > 0 for RAM pages, 0 for non-RAM pages, < 0 on error
- * The called function has to take care of module refcounting.
- */
-static int (*oldmem_pfn_is_ram)(unsigned long pfn);
-
-int register_oldmem_pfn_is_ram(int (*fn)(unsigned long pfn))
+static DECLARE_RWSEM(vmcore_cb_rwsem);
+/* List of registered vmcore callbacks. */
+static LIST_HEAD(vmcore_cb_list);
+/* Whether we had a surprise unregistration of a callback. */
+static bool vmcore_cb_unstable;
+/* Whether the vmcore has been opened once. */
+static bool vmcore_opened;
+
+void register_vmcore_cb(struct vmcore_cb *cb)
 {
-   if (oldmem_pfn_is_ram)
-   return -EBUSY;
-   oldmem_pfn_is_ram = fn;
-   return 0;
+   down_write(_cb

[PATCH v1 0/8] proc/vmcore: sanitize access to virtio-mem memory

2021-09-28 Thread David Hildenbrand
As so often with virtio-mem changes that mess with common MM
infrastructure, this might be a good candiate to go via Andrew's tree.

--

After removing /dev/kmem, sanitizing /proc/kcore and handling /dev/mem,
this series tackles the last sane way how a VM could accidentially access
logically unplugged memory managed by a virtio-mem device: /proc/vmcore

When dumping memory via "makedumpfile", PG_offline pages, used by
virtio-mem to flag logically unplugged memory, are already properly
excluded; however, especially when accessing/copying /proc/vmcore "the
usual way", we can still end up reading logically unplugged memory part of
a virtio-mem device.

Patch #1-#3 are cleanups. Patch #4 extends the existing oldmem_pfn_is_ram
mechanism. Patch #5-#7 are virtio-mem refactorings for patch #8, which
implements the virtio-mem logic to query the state of device blocks.

Patch #8:

"
Although virtio-mem currently supports reading unplugged memory in the
hypervisor, this will change in the future, indicated to the device via
a new feature flag. We similarly sanitized /proc/kcore access recently.
[...]
Distributions that support virtio-mem+kdump have to make sure that the
virtio_mem module will be part of the kdump kernel or the kdump initrd;
dracut was recently [2] extended to include virtio-mem in the generated
initrd. As long as no special kdump kernels are used, this will
automatically make sure that virtio-mem will be around in the kdump initrd
and sanitize /proc/vmcore access -- with dracut.
"

This is the last remaining bit to support
VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE [3] in the Linux implementation of
virtio-mem.

Note: this is best-effort. We'll never be able to control what runs inside
the second kernel, really, but we also don't have to care: we only care
about sane setups where we don't want our VM getting zapped once we
touch the wrong memory location while dumping. While we usually expect sane
setups to use "makedumfile", nothing really speaks against just copying
/proc/vmcore, especially in environments where HWpoisioning isn't typically
expected. Also, we really don't want to put all our trust completely on the
memmap, so sanitizing also makes sense when just using "makedumpfile".

[1] https://lkml.kernel.org/r/20210526093041.8800-1-da...@redhat.com
[2] https://github.com/dracutdevs/dracut/pull/1157
[3] https://lists.oasis-open.org/archives/virtio-comment/202109/msg00021.html

Cc: Andrew Morton 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Dave Young 
Cc: Baoquan He 
Cc: Vivek Goyal 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Mike Rapoport 
Cc: "Rafael J. Wysocki" 
Cc: x...@kernel.org
Cc: xen-de...@lists.xenproject.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: kexec@lists.infradead.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org

David Hildenbrand (8):
  x86/xen: update xen_oldmem_pfn_is_ram() documentation
  x86/xen: simplify xen_oldmem_pfn_is_ram()
  proc/vmcore: let pfn_is_ram() return a bool
  proc/vmcore: convert oldmem_pfn_is_ram callback to more generic vmcore
callbacks
  virtio-mem: factor out hotplug specifics from virtio_mem_init() into
virtio_mem_init_hotplug()
  virtio-mem: factor out hotplug specifics from virtio_mem_probe() into
virtio_mem_init_hotplug()
  virtio-mem: factor out hotplug specifics from virtio_mem_remove() into
virtio_mem_deinit_hotplug()
  virtio-mem: kdump mode to sanitize /proc/vmcore access

 arch/x86/kernel/aperture_64.c |  13 +-
 arch/x86/xen/mmu_hvm.c|  31 ++--
 drivers/virtio/virtio_mem.c   | 297 --
 fs/proc/vmcore.c  | 105 
 include/linux/crash_dump.h|  26 ++-
 5 files changed, 332 insertions(+), 140 deletions(-)


base-commit: 5816b3e6577eaa676ceb00a848f0fd65fe2adc29
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v1 3/8] proc/vmcore: let pfn_is_ram() return a bool

2021-09-28 Thread David Hildenbrand
The callback should deal with errors internally, it doesn't make sense to
expose these via pfn_is_ram(). We'll rework the callbacks next. Right now
we consider errors as if "it's RAM"; no functional change.

Signed-off-by: David Hildenbrand 
---
 fs/proc/vmcore.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 9a15334da208..a9bd80ab670e 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -84,11 +84,11 @@ void unregister_oldmem_pfn_is_ram(void)
 }
 EXPORT_SYMBOL_GPL(unregister_oldmem_pfn_is_ram);
 
-static int pfn_is_ram(unsigned long pfn)
+static bool pfn_is_ram(unsigned long pfn)
 {
int (*fn)(unsigned long pfn);
/* pfn is ram unless fn() checks pagetype */
-   int ret = 1;
+   bool ret = true;
 
/*
 * Ask hypervisor if the pfn is really ram.
@@ -97,7 +97,7 @@ static int pfn_is_ram(unsigned long pfn)
 */
fn = oldmem_pfn_is_ram;
if (fn)
-   ret = fn(pfn);
+   ret = !!fn(pfn);
 
return ret;
 }
@@ -124,7 +124,7 @@ ssize_t read_from_oldmem(char *buf, size_t count,
nr_bytes = count;
 
/* If pfn is not ram, return zeros for sparse dump files */
-   if (pfn_is_ram(pfn) == 0)
+   if (!pfn_is_ram(pfn))
memset(buf, 0, nr_bytes);
else {
if (encrypted)
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 0/4] mm/memory_hotplug: full support for

2021-09-27 Thread David Hildenbrand
Intended subject was "[PATCH v1 0/4] mm/memory_hotplug: full support for 
add_memory_driver_managed() with CONFIG_ARCH_KEEP_MEMBLOCK"


--
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v1 2/4] memblock: allow to specify flags with memblock_add_node()

2021-09-27 Thread David Hildenbrand
We want to specify flags when hotplugging memory. Let's prepare to pass
flags to memblock_add_node() by adjusting all existing users.

Note that when hotplugging memory the system is already up and running
and we don't want to add the memory first and apply flags later: it
should happen within one memblock call.

Signed-off-by: David Hildenbrand 
---
 arch/arc/mm/init.c   | 4 ++--
 arch/ia64/mm/contig.c| 2 +-
 arch/ia64/mm/init.c  | 2 +-
 arch/m68k/mm/mcfmmu.c| 3 ++-
 arch/m68k/mm/motorola.c  | 6 --
 arch/mips/loongson64/init.c  | 4 +++-
 arch/mips/sgi-ip27/ip27-memory.c | 3 ++-
 arch/s390/kernel/setup.c | 3 ++-
 include/linux/memblock.h | 3 ++-
 include/linux/mm.h   | 2 +-
 mm/memblock.c| 9 +
 mm/memory_hotplug.c  | 2 +-
 12 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c
index 699ecf119641..110eb69e9bee 100644
--- a/arch/arc/mm/init.c
+++ b/arch/arc/mm/init.c
@@ -59,13 +59,13 @@ void __init early_init_dt_add_memory_arch(u64 base, u64 
size)
 
low_mem_sz = size;
in_use = 1;
-   memblock_add_node(base, size, 0);
+   memblock_add_node(base, size, 0, MEMBLOCK_NONE);
} else {
 #ifdef CONFIG_HIGHMEM
high_mem_start = base;
high_mem_sz = size;
in_use = 1;
-   memblock_add_node(base, size, 1);
+   memblock_add_node(base, size, 1, MEMBLOCK_NONE);
memblock_reserve(base, size);
 #endif
}
diff --git a/arch/ia64/mm/contig.c b/arch/ia64/mm/contig.c
index 42e025cfbd08..24901d809301 100644
--- a/arch/ia64/mm/contig.c
+++ b/arch/ia64/mm/contig.c
@@ -153,7 +153,7 @@ find_memory (void)
efi_memmap_walk(find_max_min_low_pfn, NULL);
max_pfn = max_low_pfn;
 
-   memblock_add_node(0, PFN_PHYS(max_low_pfn), 0);
+   memblock_add_node(0, PFN_PHYS(max_low_pfn), 0, MEMBLOCK_NONE);
 
find_initrd();
 
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 5c6da8d83c1a..5d165607bf35 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -378,7 +378,7 @@ int __init register_active_ranges(u64 start, u64 len, int 
nid)
 #endif
 
if (start < end)
-   memblock_add_node(__pa(start), end - start, nid);
+   memblock_add_node(__pa(start), end - start, nid, MEMBLOCK_NONE);
return 0;
 }
 
diff --git a/arch/m68k/mm/mcfmmu.c b/arch/m68k/mm/mcfmmu.c
index eac9dde65193..6f1f25125294 100644
--- a/arch/m68k/mm/mcfmmu.c
+++ b/arch/m68k/mm/mcfmmu.c
@@ -174,7 +174,8 @@ void __init cf_bootmem_alloc(void)
m68k_memory[0].addr = _rambase;
m68k_memory[0].size = _ramend - _rambase;
 
-   memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0);
+   memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0,
+ MEMBLOCK_NONE);
 
/* compute total pages in system */
num_pages = PFN_DOWN(_ramend - _rambase);
diff --git a/arch/m68k/mm/motorola.c b/arch/m68k/mm/motorola.c
index 3a653f0a4188..e80c5d7e6728 100644
--- a/arch/m68k/mm/motorola.c
+++ b/arch/m68k/mm/motorola.c
@@ -410,7 +410,8 @@ void __init paging_init(void)
 
min_addr = m68k_memory[0].addr;
max_addr = min_addr + m68k_memory[0].size;
-   memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0);
+   memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0,
+ MEMBLOCK_NONE);
for (i = 1; i < m68k_num_memory;) {
if (m68k_memory[i].addr < min_addr) {
printk("Ignoring memory chunk at 0x%lx:0x%lx before the 
first chunk\n",
@@ -421,7 +422,8 @@ void __init paging_init(void)
(m68k_num_memory - i) * sizeof(struct 
m68k_mem_info));
continue;
}
-   memblock_add_node(m68k_memory[i].addr, m68k_memory[i].size, i);
+   memblock_add_node(m68k_memory[i].addr, m68k_memory[i].size, i,
+ MEMBLOCK_NONE);
addr = m68k_memory[i].addr + m68k_memory[i].size;
if (addr > max_addr)
max_addr = addr;
diff --git a/arch/mips/loongson64/init.c b/arch/mips/loongson64/init.c
index 76e0a9636a0e..4ac5ba80bbf6 100644
--- a/arch/mips/loongson64/init.c
+++ b/arch/mips/loongson64/init.c
@@ -77,7 +77,9 @@ void __init szmem(unsigned int node)
(u32)node_id, mem_type, mem_start, mem_size);
pr_info("   start_pfn:0x%llx, end_pfn:0x%llx, 
num_physpages:0x%lx\n",
start_pfn, end_pfn, num_physpages);
-   memblock_add_node(PFN_PHYS(start_pfn), 
PFN_PHYS(node_psize), node);
+   memb

[PATCH v1 3/4] memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED

2021-09-27 Thread David Hildenbrand
Let's add a flag that corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED.
Similar to MEMBLOCK_HOTPLUG, most infrastructure has to treat such memory
like ordinary MEMBLOCK_NONE memory -- for example, when selecting memory
regions to add to the vmcore for dumping in the crashkernel via
for_each_mem_range().

However, especially kexec_file is not supposed to select such memblocks via
for_each_free_mem_range() / for_each_free_mem_range_reverse() to place
kexec images, similar to how we handle IORESOURCE_SYSRAM_DRIVER_MANAGED
without CONFIG_ARCH_KEEP_MEMBLOCK.

Let's document why kexec_walk_memblock() won't try placing images on
areas marked MEMBLOCK_DRIVER_MANAGED -- similar to
IORESOURCE_SYSRAM_DRIVER_MANAGED handling in locate_mem_hole_callback()
via kexec_walk_resources().

We'll make sure that memory hotplug code sets the flag where applicable
(IORESOURCE_SYSRAM_DRIVER_MANAGED) next. This prepares architectures
that need CONFIG_ARCH_KEEP_MEMBLOCK, such as arm64, for virtio-mem
support.

Signed-off-by: David Hildenbrand 
---
 include/linux/memblock.h | 16 ++--
 kernel/kexec_file.c  |  5 +
 mm/memblock.c|  4 
 3 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index b49a58f621bc..7d8d656d5082 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -33,12 +33,17 @@ extern unsigned long long max_possible_pfn;
  * @MEMBLOCK_NOMAP: don't add to kernel direct mapping and treat as
  * reserved in the memory map; refer to memblock_mark_nomap() description
  * for further details
+ * @MEMBLOCK_DRIVER_MANAGED: memory region that is always detected via a 
driver,
+ * corresponding to IORESOURCE_SYSRAM_DRIVER_MANAGED in the kernel resource
+ * tree. Especially kexec should never use this memory for placing images and
+ * shouldn't expose this memory to the second kernel.
  */
 enum memblock_flags {
MEMBLOCK_NONE   = 0x0,  /* No special request */
MEMBLOCK_HOTPLUG= 0x1,  /* hotpluggable region */
MEMBLOCK_MIRROR = 0x2,  /* mirrored region */
MEMBLOCK_NOMAP  = 0x4,  /* don't add to kernel direct mapping */
+   MEMBLOCK_DRIVER_MANAGED = 0x8,  /* always detected via a driver */
 };
 
 /**
@@ -209,7 +214,8 @@ static inline void __next_physmem_range(u64 *idx, struct 
memblock_type *type,
  */
 #define for_each_mem_range(i, p_start, p_end) \
__for_each_mem_range(i, , NULL, NUMA_NO_NODE,   \
-MEMBLOCK_HOTPLUG, p_start, p_end, NULL)
+MEMBLOCK_HOTPLUG | MEMBLOCK_DRIVER_MANAGED, \
+p_start, p_end, NULL)
 
 /**
  * for_each_mem_range_rev - reverse iterate through memblock areas from
@@ -220,7 +226,8 @@ static inline void __next_physmem_range(u64 *idx, struct 
memblock_type *type,
  */
 #define for_each_mem_range_rev(i, p_start, p_end)  \
__for_each_mem_range_rev(i, , NULL, NUMA_NO_NODE, \
-MEMBLOCK_HOTPLUG, p_start, p_end, NULL)
+MEMBLOCK_HOTPLUG | MEMBLOCK_DRIVER_MANAGED,\
+p_start, p_end, NULL)
 
 /**
  * for_each_reserved_mem_range - iterate over all reserved memblock areas
@@ -250,6 +257,11 @@ static inline bool memblock_is_nomap(struct 
memblock_region *m)
return m->flags & MEMBLOCK_NOMAP;
 }
 
+static inline bool memblock_is_driver_managed(struct memblock_region *m)
+{
+   return m->flags & MEMBLOCK_DRIVER_MANAGED;
+}
+
 int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
unsigned long  *end_pfn);
 void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 33400ff051a8..8347fc158d2b 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -556,6 +556,11 @@ static int kexec_walk_memblock(struct kexec_buf *kbuf,
if (kbuf->image->type == KEXEC_TYPE_CRASH)
return func(_res, kbuf);
 
+   /*
+* Using MEMBLOCK_NONE will properly skip MEMBLOCK_DRIVER_MANAGED. See
+* IORESOURCE_SYSRAM_DRIVER_MANAGED handling in
+* locate_mem_hole_callback().
+*/
if (kbuf->top_down) {
for_each_free_mem_range_reverse(i, NUMA_NO_NODE, MEMBLOCK_NONE,
, , NULL) {
diff --git a/mm/memblock.c b/mm/memblock.c
index 47a56b223141..540a35317fb0 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -979,6 +979,10 @@ static bool should_skip_region(struct memblock_type *type,
if (!(flags & MEMBLOCK_NOMAP) && memblock_is_nomap(m))
return true;
 
+   /* skip driver-managed memory unless we were asked for it explicitly */
+   if (!(flags & MEMBLOCK_DRIVER_MANAGED) && memblock_is_driver_managed(m))
+   retur

[PATCH v1 0/4] mm/memory_hotplug: full support for

2021-09-27 Thread David Hildenbrand
Architectures that require CONFIG_ARCH_KEEP_MEMBLOCK=y, such as arm64,
don't cleanly support add_memory_driver_managed() yet. Most prominently,
kexec_file can still end up placing images on such driver-managed memory,
resulting in undesired behavior.

Teaching kexec to not place images on driver-managed memory is especially
relevant for virtio-mem. Details can be found in commit 7b7b27214bba
("mm/memory_hotplug: introduce add_memory_driver_managed()").

Extend memblock with a new flag and set it from memory hotplug code
when applicable. This is required to fully support virtio-mem on
arm64, making also kexec_file behave like on x86-64.

Alternative A: Extend kexec_walk_memblock() to consult the kernel resource
tree whether IORESOURCE_SYSRAM_DRIVER_MANAGED is set. This feels wrong,
because the goal was to rely on memblock and not the resource tree.

Alternative B: Reuse MEMBLOCK_HOTPLUG. MEMBLOCK_HOTPLUG serves a different
purpose, though.

Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Jianyong Wu 
Cc: Aneesh Kumar K.V 
Cc: Vineet Gupta 
Cc: Geert Uytterhoeven 
Cc: Huacai Chen 
Cc: Jiaxun Yang 
Cc: Thomas Bogendoerfer 
Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: Christian Borntraeger 
Cc: Eric Biederman 
Cc: Arnd Bergmann 
Cc: linux-snps-...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linux-m...@lists.linux-m68k.org
Cc: linux-m...@vger.kernel.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@kvack.org
Cc: kexec@lists.infradead.org

David Hildenbrand (4):
  mm/memory_hotplug: handle memblock_add_node() failures in
add_memory_resource()
  memblock: allow to specify flags with memblock_add_node()
  memblock: add MEMBLOCK_DRIVER_MANAGED to mimic
IORESOURCE_SYSRAM_DRIVER_MANAGED
  mm/memory_hotplug: indicate MEMBLOCK_DRIVER_MANAGED with
IORESOURCE_SYSRAM_DRIVER_MANAGED

 arch/arc/mm/init.c   |  4 ++--
 arch/ia64/mm/contig.c|  2 +-
 arch/ia64/mm/init.c  |  2 +-
 arch/m68k/mm/mcfmmu.c|  3 ++-
 arch/m68k/mm/motorola.c  |  6 --
 arch/mips/loongson64/init.c  |  4 +++-
 arch/mips/sgi-ip27/ip27-memory.c |  3 ++-
 arch/s390/kernel/setup.c |  3 ++-
 include/linux/memblock.h | 19 ---
 include/linux/mm.h   |  2 +-
 kernel/kexec_file.c  |  5 +
 mm/memblock.c| 13 +
 mm/memory_hotplug.c  | 11 +--
 13 files changed, 57 insertions(+), 20 deletions(-)


base-commit: 5816b3e6577eaa676ceb00a848f0fd65fe2adc29
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v1 1/4] mm/memory_hotplug: handle memblock_add_node() failures in add_memory_resource()

2021-09-27 Thread David Hildenbrand
If memblock_add_node() fails, we're most probably running out of memory.
While this is unlikely to happen, it can happen and having memory added
without a memblock can be problematic for architectures that use
memblock to detect valid memory. Let's fail in a nice way instead of
silently ignoring the error.

Signed-off-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9fd0be32a281..917b3528636d 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1384,8 +1384,11 @@ int __ref add_memory_resource(int nid, struct resource 
*res, mhp_t mhp_flags)
 
mem_hotplug_begin();
 
-   if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK))
-   memblock_add_node(start, size, nid);
+   if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
+   ret = memblock_add_node(start, size, nid);
+   if (ret)
+   goto error_mem_hotplug_end;
+   }
 
ret = __try_online_node(nid, false);
if (ret < 0)
@@ -1458,6 +1461,7 @@ int __ref add_memory_resource(int nid, struct resource 
*res, mhp_t mhp_flags)
rollback_node_hotadd(nid);
if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK))
memblock_remove(start, size);
+error_mem_hotplug_end:
mem_hotplug_done();
return ret;
 }
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v1 4/4] mm/memory_hotplug: indicate MEMBLOCK_DRIVER_MANAGED with IORESOURCE_SYSRAM_DRIVER_MANAGED

2021-09-27 Thread David Hildenbrand
Let's communicate driver-managed regions to memblock, to properly
teach kexec_file with CONFIG_ARCH_KEEP_MEMBLOCK to not place images on
these memory regions.

Signed-off-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 5f873e7f5b29..6d90818d4ce8 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1357,6 +1357,7 @@ bool mhp_supports_memmap_on_memory(unsigned long size)
 int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 {
struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
+   enum memblock_flags memblock_flags = MEMBLOCK_NONE;
struct vmem_altmap mhp_altmap = {};
struct memory_group *group = NULL;
u64 start, size;
@@ -1385,7 +1386,9 @@ int __ref add_memory_resource(int nid, struct resource 
*res, mhp_t mhp_flags)
mem_hotplug_begin();
 
if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
-   ret = memblock_add_node(start, size, nid, MEMBLOCK_NONE);
+   if (res->flags & IORESOURCE_SYSRAM_DRIVER_MANAGED)
+   memblock_flags = MEMBLOCK_DRIVER_MANAGED;
+   ret = memblock_add_node(start, size, nid, memblock_flags);
if (ret)
goto error_mem_hotplug_end;
}
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 8/9] mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA

2021-06-09 Thread David Hildenbrand
 1) == ZONE_HIGHMEM;
@@ -1041,17 +1041,17 @@ extern int percpu_pagelist_fraction;
  extern char numa_zonelist_order[];
  #define NUMA_ZONELIST_ORDER_LEN   16
  
-#ifndef CONFIG_NEED_MULTIPLE_NODES

+#ifndef CONFIG_NUMA
  
  extern struct pglist_data contig_page_data;

  #define NODE_DATA(nid)(_page_data)
  #define NODE_MEM_MAP(nid) mem_map
  
-#else /* CONFIG_NEED_MULTIPLE_NODES */

+#else /* CONFIG_NUMA */
  
  #include 
  
-#endif /* !CONFIG_NEED_MULTIPLE_NODES */

+#endif /* !CONFIG_NUMA */
  
  extern struct pglist_data *first_online_pgdat(void);

  extern struct pglist_data *next_online_pgdat(struct pglist_data *pgdat);
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 825284baaf46..53eb8bc6026d 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -455,7 +455,7 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_SYMBOL(_stext);
VMCOREINFO_SYMBOL(vmap_area_list);
  
-#ifndef CONFIG_NEED_MULTIPLE_NODES

+#ifndef CONFIG_NUMA
VMCOREINFO_SYMBOL(mem_map);
VMCOREINFO_SYMBOL(contig_page_data);
  #endif
diff --git a/mm/Kconfig b/mm/Kconfig
index 218b96ccc84a..bffe4bd859f3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -59,15 +59,6 @@ config FLAT_NODE_MEM_MAP
def_bool y
depends on !SPARSEMEM
  
-#

-# Both the NUMA code and DISCONTIGMEM use arrays of pg_data_t's
-# to represent different areas of memory.  This variable allows
-# those dependencies to exist individually.
-#
-config NEED_MULTIPLE_NODES
-   def_bool y
-   depends on NUMA
-
  #
  # SPARSEMEM_EXTREME (which is the default) does some bootmem
  # allocations when sparse_init() is called.  If this cannot
diff --git a/mm/memblock.c b/mm/memblock.c
index afaefa8fc6ab..123feef5259d 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -92,7 +92,7 @@
   * system initialization completes.
   */
  
-#ifndef CONFIG_NEED_MULTIPLE_NODES

+#ifndef CONFIG_NUMA
  struct pglist_data __refdata contig_page_data;
  EXPORT_SYMBOL(contig_page_data);
  #endif
@@ -607,7 +607,7 @@ static int __init_memblock memblock_add_range(struct 
memblock_type *type,
 * area, insert that portion.
 */
if (rbase > base) {
-#ifdef CONFIG_NEED_MULTIPLE_NODES
+#ifdef CONFIG_NUMA
WARN_ON(nid != memblock_get_region_node(rgn));
  #endif
WARN_ON(flags != rgn->flags);
@@ -1205,7 +1205,7 @@ void __init_memblock __next_mem_pfn_range(int *idx, int 
nid,
  int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,
  struct memblock_type *type, int nid)
  {
-#ifdef CONFIG_NEED_MULTIPLE_NODES
+#ifdef CONFIG_NUMA
int start_rgn, end_rgn;
int i, ret;
  
@@ -1849,7 +1849,7 @@ static void __init_memblock memblock_dump(struct memblock_type *type)

size = rgn->size;
end = base + size - 1;
flags = rgn->flags;
-#ifdef CONFIG_NEED_MULTIPLE_NODES
+#ifdef CONFIG_NUMA
if (memblock_get_region_node(rgn) != MAX_NUMNODES)
snprintf(nid_buf, sizeof(nid_buf), " on node %d",
 memblock_get_region_node(rgn));
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6fc22482eaa8..8f08135d3eb4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1596,7 +1596,7 @@ void __free_pages_core(struct page *page, unsigned int 
order)
__free_pages_ok(page, order, FPI_TO_TAIL | FPI_SKIP_KASAN_POISON);
  }
  
-#ifdef CONFIG_NEED_MULTIPLE_NODES

+#ifdef CONFIG_NUMA
  
  /*

   * During memory init memblocks map pfns to nids. The search is expensive and
@@ -1646,7 +1646,7 @@ int __meminit early_pfn_to_nid(unsigned long pfn)
  
  	return nid;

  }
-#endif /* CONFIG_NEED_MULTIPLE_NODES */
+#endif /* CONFIG_NUMA */
  
  void __init memblock_free_pages(struct page *page, unsigned long pfn,

unsigned int order)
@@ -7276,7 +7276,7 @@ static void __ref alloc_node_mem_map(struct pglist_data 
*pgdat)
pr_debug("%s: node %d, pgdat %08lx, node_mem_map %08lx\n",
__func__, pgdat->node_id, (unsigned long)pgdat,
        (unsigned long)pgdat->node_mem_map);
-#ifndef CONFIG_NEED_MULTIPLE_NODES
+#ifndef CONFIG_NUMA
/*
 * With no DISCONTIG, the global mem_map is just set as node 0's
 */



Nice

Acked-by: David Hildenbrand 

--
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 9/9] mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM

2021-06-09 Thread David Hildenbrand

On 02.06.21 12:53, Mike Rapoport wrote:

From: Mike Rapoport 

After removal of the DISCONTIGMEM memory model the FLAT_NODE_MEM_MAP
configuration option is equivalent to FLATMEM.

Drop CONFIG_FLAT_NODE_MEM_MAP and use CONFIG_FLATMEM instead.

Signed-off-by: Mike Rapoport 
---
  include/linux/mmzone.h | 4 ++--
  kernel/crash_core.c| 2 +-
  mm/Kconfig | 4 
  mm/page_alloc.c| 6 +++---
  mm/page_ext.c  | 2 +-
  5 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ad42f440c704..2698cdbfbf75 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -775,7 +775,7 @@ typedef struct pglist_data {
struct zonelist node_zonelists[MAX_ZONELISTS];
  
  	int nr_zones; /* number of populated zones in this node */

-#ifdef CONFIG_FLAT_NODE_MEM_MAP/* means !SPARSEMEM */
+#ifdef CONFIG_FLATMEM  /* means !SPARSEMEM */
struct page *node_mem_map;
  #ifdef CONFIG_PAGE_EXTENSION
struct page_ext *node_page_ext;
@@ -865,7 +865,7 @@ typedef struct pglist_data {
  
  #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)

  #define node_spanned_pages(nid)   (NODE_DATA(nid)->node_spanned_pages)
-#ifdef CONFIG_FLAT_NODE_MEM_MAP
+#ifdef CONFIG_FLATMEM
  #define pgdat_page_nr(pgdat, pagenr)  ((pgdat)->node_mem_map + (pagenr))
  #else
  #define pgdat_page_nr(pgdat, pagenr)  pfn_to_page((pgdat)->node_start_pfn + 
(pagenr))
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 53eb8bc6026d..2b8446ea7105 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -483,7 +483,7 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_OFFSET(page, compound_head);
VMCOREINFO_OFFSET(pglist_data, node_zones);
VMCOREINFO_OFFSET(pglist_data, nr_zones);
-#ifdef CONFIG_FLAT_NODE_MEM_MAP
+#ifdef CONFIG_FLATMEM
VMCOREINFO_OFFSET(pglist_data, node_mem_map);
  #endif
VMCOREINFO_OFFSET(pglist_data, node_start_pfn);
diff --git a/mm/Kconfig b/mm/Kconfig
index bffe4bd859f3..ded98fb859ab 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -55,10 +55,6 @@ config FLATMEM
def_bool y
depends on !SPARSEMEM || FLATMEM_MANUAL
  
-config FLAT_NODE_MEM_MAP

-   def_bool y
-   depends on !SPARSEMEM
-
  #
  # SPARSEMEM_EXTREME (which is the default) does some bootmem
  # allocations when sparse_init() is called.  If this cannot
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8f08135d3eb4..f039736541eb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6444,7 +6444,7 @@ static void __meminit zone_init_free_lists(struct zone 
*zone)
}
  }
  
-#if !defined(CONFIG_FLAT_NODE_MEM_MAP)

+#if !defined(CONFIG_FLATMEM)
  /*
   * Only struct pages that correspond to ranges defined by memblock.memory
   * are zeroed and initialized by going through __init_single_page() during
@@ -7241,7 +7241,7 @@ static void __init free_area_init_core(struct pglist_data 
*pgdat)
}
  }
  
-#ifdef CONFIG_FLAT_NODE_MEM_MAP

+#ifdef CONFIG_FLATMEM
  static void __ref alloc_node_mem_map(struct pglist_data *pgdat)
  {
unsigned long __maybe_unused start = 0;
@@ -7289,7 +7289,7 @@ static void __ref alloc_node_mem_map(struct pglist_data 
*pgdat)
  }
  #else
  static void __ref alloc_node_mem_map(struct pglist_data *pgdat) { }
-#endif /* CONFIG_FLAT_NODE_MEM_MAP */
+#endif /* CONFIG_FLATMEM */
  
  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT

  static inline void pgdat_set_deferred_range(pg_data_t *pgdat)
diff --git a/mm/page_ext.c b/mm/page_ext.c
index df6f74aac8e1..293b2685fc48 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -191,7 +191,7 @@ void __init page_ext_init_flatmem(void)
panic("Out of memory");
  }
  
-#else /* CONFIG_FLAT_NODE_MEM_MAP */

+#else /* CONFIG_FLATMEM */
  
  struct page_ext *lookup_page_ext(const struct page *page)

  {



Acked-by: David Hildenbrand 

--
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 7/9] docs: remove description of DISCONTIGMEM

2021-06-09 Thread David Hildenbrand

On 02.06.21 12:53, Mike Rapoport wrote:

From: Mike Rapoport 

Remove description of DISCONTIGMEM from the "Memory Models" document and
update VM sysctl description so that it won't mention DISCONIGMEM.

Signed-off-by: Mike Rapoport 
---
  Documentation/admin-guide/sysctl/vm.rst | 12 +++
  Documentation/vm/memory-model.rst   | 45 ++---
  2 files changed, 8 insertions(+), 49 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst 
b/Documentation/admin-guide/sysctl/vm.rst
index 586cd4b86428..ddbd71d592e0 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -936,12 +936,12 @@ allocations, THP and hugetlbfs pages.
  
  To make it sensible with respect to the watermark_scale_factor

  parameter, the unit is in fractions of 10,000. The default value of
-15,000 on !DISCONTIGMEM configurations means that up to 150% of the high
-watermark will be reclaimed in the event of a pageblock being mixed due
-to fragmentation. The level of reclaim is determined by the number of
-fragmentation events that occurred in the recent past. If this value is
-smaller than a pageblock then a pageblocks worth of pages will be reclaimed
-(e.g.  2MB on 64-bit x86). A boost factor of 0 will disable the feature.
+15,000 means that up to 150% of the high watermark will be reclaimed in the
+event of a pageblock being mixed due to fragmentation. The level of reclaim
+is determined by the number of fragmentation events that occurred in the
+recent past. If this value is smaller than a pageblock then a pageblocks
+worth of pages will be reclaimed (e.g.  2MB on 64-bit x86). A boost factor
+of 0 will disable the feature.
  
  
  watermark_scale_factor

diff --git a/Documentation/vm/memory-model.rst 
b/Documentation/vm/memory-model.rst
index ce398a7dc6cd..30e8fbed6914 100644
--- a/Documentation/vm/memory-model.rst
+++ b/Documentation/vm/memory-model.rst
@@ -14,15 +14,11 @@ for the CPU. Then there could be several contiguous ranges 
at
  completely distinct addresses. And, don't forget about NUMA, where
  different memory banks are attached to different CPUs.
  
-Linux abstracts this diversity using one of the three memory models:

-FLATMEM, DISCONTIGMEM and SPARSEMEM. Each architecture defines what
+Linux abstracts this diversity using one of the two memory models:
+FLATMEM and SPARSEMEM. Each architecture defines what
  memory models it supports, what the default memory model is and
  whether it is possible to manually override that default.
  
-.. note::

-   At time of this writing, DISCONTIGMEM is considered deprecated,
-   although it is still in use by several architectures.
-
  All the memory models track the status of physical page frames using
  struct page arranged in one or more arrays.
  
@@ -63,43 +59,6 @@ straightforward: `PFN - ARCH_PFN_OFFSET` is an index to the

  The `ARCH_PFN_OFFSET` defines the first page frame number for
  systems with physical memory starting at address different from 0.
  
-DISCONTIGMEM

-
-
-The DISCONTIGMEM model treats the physical memory as a collection of
-`nodes` similarly to how Linux NUMA support does. For each node Linux
-constructs an independent memory management subsystem represented by
-`struct pglist_data` (or `pg_data_t` for short). Among other
-things, `pg_data_t` holds the `node_mem_map` array that maps
-physical pages belonging to that node. The `node_start_pfn` field of
-`pg_data_t` is the number of the first page frame belonging to that
-node.
-
-The architecture setup code should call :c:func:`free_area_init_node` for
-each node in the system to initialize the `pg_data_t` object and its
-`node_mem_map`.
-
-Every `node_mem_map` behaves exactly as FLATMEM's `mem_map` -
-every physical page frame in a node has a `struct page` entry in the
-`node_mem_map` array. When DISCONTIGMEM is enabled, a portion of the
-`flags` field of the `struct page` encodes the node number of the
-node hosting that page.
-
-The conversion between a PFN and the `struct page` in the
-DISCONTIGMEM model became slightly more complex as it has to determine
-which node hosts the physical page and which `pg_data_t` object
-holds the `struct page`.
-
-Architectures that support DISCONTIGMEM provide :c:func:`pfn_to_nid`
-to convert PFN to the node number. The opposite conversion helper
-:c:func:`page_to_nid` is generic as it uses the node number encoded in
-page->flags.
-
-Once the node number is known, the PFN can be used to index
-appropriate `node_mem_map` array to access the `struct page` and
-the offset of the `struct page` from the `node_mem_map` plus
-`node_start_pfn` is the PFN of that page.
-
  SPARSEMEM
  =
  



Reviewed-by: David Hildenbrand 

--
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 5/9] mm: remove CONFIG_DISCONTIGMEM

2021-06-09 Thread David Hildenbrand
ol y
@@ -85,7 +66,7 @@ config FLAT_NODE_MEM_MAP
  #
  config NEED_MULTIPLE_NODES
def_bool y
-   depends on DISCONTIGMEM || NUMA
+   depends on NUMA
  
  #

  # SPARSEMEM_EXTREME (which is the default) does some bootmem
diff --git a/mm/memory.c b/mm/memory.c
index 730daa00952b..7c7b6ea02504 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -90,8 +90,7 @@
  #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for 
last_cpupid.
  #endif
  
-#ifndef CONFIG_NEED_MULTIPLE_NODES

-/* use the per-pgdat data instead for discontigmem - mbligh */
+#ifdef CONFIG_FLATMEM
  unsigned long max_mapnr;
  EXPORT_SYMBOL(max_mapnr);
  
diff --git a/mm/page_alloc.c b/mm/page_alloc.c

index aaa1655cf682..6fc22482eaa8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -331,20 +331,7 @@ compound_page_dtor * const 
compound_page_dtors[NR_COMPOUND_DTORS] = {
  
  int min_free_kbytes = 1024;

  int user_min_free_kbytes = -1;
-#ifdef CONFIG_DISCONTIGMEM
-/*
- * DiscontigMem defines memory ranges as separate pg_data_t even if the ranges
- * are not on separate NUMA nodes. Functionally this works but with
- * watermark_boost_factor, it can reclaim prematurely as the ranges can be
- * quite small. By default, do not boost watermarks on discontigmem as in
- * many cases very high-order allocations like THP are likely to be
- * unsupported and the premature reclaim offsets the advantage of long-term
- * fragmentation avoidance.
- */
-int watermark_boost_factor __read_mostly;
-#else
  int watermark_boost_factor __read_mostly = 15000;
-#endif
  int watermark_scale_factor = 10;
  
  static unsigned long nr_kernel_pages __initdata;




Nice

Acked-by: David Hildenbrand 

--
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 6/9] arch, mm: remove stale mentions of DISCONIGMEM

2021-06-09 Thread David Hildenbrand
de the zonelist belongs to.
   *
- * For the normal case of non-DISCONTIGMEM systems the NODE_DATA() gets
- * optimized to _page_data at compile-time.
+ * For the case of non-NUMA systems the NODE_DATA() gets optimized to
+ * _page_data at compile-time.
   */
  static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
  {



Reviewed-by: David Hildenbrand 

--
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 3/9] arc: remove support for DISCONTIGMEM

2021-06-09 Thread David Hildenbrand

On 02.06.21 12:53, Mike Rapoport wrote:

From: Mike Rapoport 

DISCONTIGMEM was replaced by FLATMEM with freeing of the unused memory map
in v5.11.

Remove the support for DISCONTIGMEM entirely.

Signed-off-by: Mike Rapoport 


Acked-by: David Hildenbrand 

--
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/9] alpha: remove DISCONTIGMEM and NUMA

2021-06-09 Thread David Hildenbrand
*)(__va(node_min_pfn << PAGE_SHIFT));
-#endif
-   printk(" Detected node memory:   start %8lu, end %8lu\n",
-  node_min_pfn, node_max_pfn);
-
-   DBGDCONT(" DISCONTIG: node_data[%d]   is at 0x%p\n", nid, 
NODE_DATA(nid));
-
-   /* Find the bounds of kernel memory.  */
-   start_kernel_pfn = PFN_DOWN(KERNEL_START_PHYS);
-   end_kernel_pfn = PFN_UP(virt_to_phys(kernel_end));
-
-   if (!nid && (node_max_pfn < end_kernel_pfn || node_min_pfn > 
start_kernel_pfn))
-   panic("kernel loaded out of ram");
-
-   memblock_add_node(PFN_PHYS(node_min_pfn),
- (node_max_pfn - node_min_pfn) << PAGE_SHIFT, nid);
-
-   /* Zone start phys-addr must be 2^(MAX_ORDER-1) aligned.
-  Note that we round this down, not up - node memory
-  has much larger alignment than 8Mb, so it's safe. */
-   node_min_pfn &= ~((1UL << (MAX_ORDER-1))-1);
-
-   NODE_DATA(nid)->node_start_pfn = node_min_pfn;
-   NODE_DATA(nid)->node_present_pages = node_max_pfn - node_min_pfn;
-
-   node_set_online(nid);
-}
-
-void __init
-setup_memory(void *kernel_end)
-{
-   unsigned long kernel_size;
-   int nid;
-
-   show_mem_layout();
-
-   nodes_clear(node_online_map);
-
-   min_low_pfn = ~0UL;
-   max_low_pfn = 0UL;
-   for (nid = 0; nid < MAX_NUMNODES; nid++)
-   setup_memory_node(nid, kernel_end);
-
-   kernel_size = virt_to_phys(kernel_end) - KERNEL_START_PHYS;
-   memblock_reserve(KERNEL_START_PHYS, kernel_size);
-
-#ifdef CONFIG_BLK_DEV_INITRD
-   initrd_start = INITRD_START;
-   if (initrd_start) {
-   extern void *move_initrd(unsigned long);
-
-   initrd_end = initrd_start+INITRD_SIZE;
-   printk("Initial ramdisk at: 0x%p (%lu bytes)\n",
-  (void *) initrd_start, INITRD_SIZE);
-
-   if ((void *)initrd_end > phys_to_virt(PFN_PHYS(max_low_pfn))) {
-   if (!move_initrd(PFN_PHYS(max_low_pfn)))
-   printk("initrd extends beyond end of memory "
-  "(0x%08lx > 0x%p)\ndisabling initrd\n",
-  initrd_end,
-  phys_to_virt(PFN_PHYS(max_low_pfn)));
-   } else {
-   nid = kvaddr_to_nid(initrd_start);
-   memblock_reserve(virt_to_phys((void *)initrd_start),
-INITRD_SIZE);
-   }
-   }
-#endif /* CONFIG_BLK_DEV_INITRD */
-}
-
-void __init paging_init(void)
-{
-   unsigned long   max_zone_pfn[MAX_NR_ZONES] = {0, };
-   unsigned long   dma_local_pfn;
-
-   /*
-* The old global MAX_DMA_ADDRESS per-arch API doesn't fit
-* in the NUMA model, for now we convert it to a pfn and
-* we interpret this pfn as a local per-node information.
-* This issue isn't very important since none of these machines
-* have legacy ISA slots anyways.
-*/
-   dma_local_pfn = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
-
-   max_zone_pfn[ZONE_DMA] = dma_local_pfn;
-   max_zone_pfn[ZONE_NORMAL] = max_pfn;
-
-   free_area_init(max_zone_pfn);
-
-   /* Initialize the kernel's ZERO_PGE. */
-   memset((void *)ZERO_PGE, 0, PAGE_SIZE);
-}



Acked-by: David Hildenbrand 

--
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 2/9] arc: update comment about HIGHMEM implementation

2021-06-09 Thread David Hildenbrand

On 02.06.21 12:53, Mike Rapoport wrote:

From: Mike Rapoport 

Arc does not use DISCONTIGMEM to implement high memory, update the comment
describing how high memory works to reflect this.

Signed-off-by: Mike Rapoport 
---
  arch/arc/mm/init.c | 13 +
  1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c
index e2ed355438c9..397a201adfe3 100644
--- a/arch/arc/mm/init.c
+++ b/arch/arc/mm/init.c
@@ -139,16 +139,13 @@ void __init setup_arch_memory(void)
  
  #ifdef CONFIG_HIGHMEM

/*
-* Populate a new node with highmem
-*
 * On ARC (w/o PAE) HIGHMEM addresses are actually smaller (0 based)
-* than addresses in normal ala low memory (0x8000_ based).
+* than addresses in normal aka low memory (0x8000_ based).
 * Even with PAE, the huge peripheral space hole would waste a lot of
-* mem with single mem_map[]. This warrants a mem_map per region design.
-* Thus HIGHMEM on ARC is imlemented with DISCONTIGMEM.
-*
-* DISCONTIGMEM in turns requires multiple nodes. node 0 above is
-* populated with normal memory zone while node 1 only has highmem
+* mem with single contiguous mem_map[].
+* Thus when HIGHMEM on ARC is enabled the memory map corresponding
+* to the hole is freed and ARC specific version of pfn_valid()
+* handles the hole in the memory map.
 */
  #ifdef CONFIG_DISCONTIGMEM
node_set_online(1);



Reviewed-by: David Hildenbrand 

--
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH V2 4/6] mm: rename the global section array to mem_sections

2021-06-01 Thread David Hildenbrand

On 01.06.21 10:37, Dong Aisheng wrote:

On Tue, Jun 1, 2021 at 4:22 PM David Hildenbrand  wrote:


On 31.05.21 11:19, Dong Aisheng wrote:

In order to distinguish the struct mem_section for a better code
readability and align with kernel doc [1] name below, change the
global mem section name to 'mem_sections' from 'mem_section'.

[1] Documentation/vm/memory-model.rst
"The `mem_section` objects are arranged in a two-dimensional array
called `mem_sections`."

Cc: Andrew Morton 
Cc: Dave Young 
Cc: Baoquan He 
Cc: Vivek Goyal 
Cc: kexec@lists.infradead.org
Signed-off-by: Dong Aisheng 
---
v1->v2:
   * no changes
---
   include/linux/mmzone.h | 10 +-
   kernel/crash_core.c|  4 ++--
   mm/sparse.c| 16 
   3 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a6bfde85ddb0..0ed61f32d898 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1302,9 +1302,9 @@ struct mem_section {
   #define SECTION_ROOT_MASK   (SECTIONS_PER_ROOT - 1)

   #ifdef CONFIG_SPARSEMEM_EXTREME
-extern struct mem_section **mem_section;
+extern struct mem_section **mem_sections;
   #else
-extern struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT];
+extern struct mem_section mem_sections[NR_SECTION_ROOTS][SECTIONS_PER_ROOT];
   #endif

   static inline unsigned long *section_to_usemap(struct mem_section *ms)
@@ -1315,12 +1315,12 @@ static inline unsigned long *section_to_usemap(struct 
mem_section *ms)
   static inline struct mem_section *__nr_to_section(unsigned long nr)
   {
   #ifdef CONFIG_SPARSEMEM_EXTREME
- if (!mem_section)
+ if (!mem_sections)
   return NULL;
   #endif
- if (!mem_section[SECTION_NR_TO_ROOT(nr)])
+ if (!mem_sections[SECTION_NR_TO_ROOT(nr)])
   return NULL;
- return _section[SECTION_NR_TO_ROOT(nr)][nr & SECTION_ROOT_MASK];
+ return _sections[SECTION_NR_TO_ROOT(nr)][nr & SECTION_ROOT_MASK];
   }
   extern unsigned long __section_nr(struct mem_section *ms);
   extern size_t mem_section_usage_size(void);
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 29cc15398ee4..fb1180d81b5a 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -414,8 +414,8 @@ static int __init crash_save_vmcoreinfo_init(void)
   VMCOREINFO_SYMBOL(contig_page_data);
   #endif
   #ifdef CONFIG_SPARSEMEM
- VMCOREINFO_SYMBOL_ARRAY(mem_section);
- VMCOREINFO_LENGTH(mem_section, NR_SECTION_ROOTS);
+ VMCOREINFO_SYMBOL_ARRAY(mem_sections);
+ VMCOREINFO_LENGTH(mem_sections, NR_SECTION_ROOTS);
   VMCOREINFO_STRUCT_SIZE(mem_section);
   VMCOREINFO_OFFSET(mem_section, section_mem_map);
   VMCOREINFO_NUMBER(MAX_PHYSMEM_BITS);
diff --git a/mm/sparse.c b/mm/sparse.c
index d02ee6bb7cbc..6412010478f7 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -24,12 +24,12 @@
* 1) mem_section   - memory sections, mem_map's for valid memory
*/
   #ifdef CONFIG_SPARSEMEM_EXTREME
-struct mem_section **mem_section;
+struct mem_section **mem_sections;
   #else
-struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT]
+struct mem_section mem_sections[NR_SECTION_ROOTS][SECTIONS_PER_ROOT]
   cacheline_internodealigned_in_smp;
   #endif
-EXPORT_SYMBOL(mem_section);
+EXPORT_SYMBOL(mem_sections);

   #ifdef NODE_NOT_IN_PAGE_FLAGS
   /*
@@ -66,8 +66,8 @@ static void __init sparse_alloc_section_roots(void)

   size = sizeof(struct mem_section *) * NR_SECTION_ROOTS;
   align = 1 << (INTERNODE_CACHE_SHIFT);
- mem_section = memblock_alloc(size, align);
- if (!mem_section)
+ mem_sections = memblock_alloc(size, align);
+ if (!mem_sections)
   panic("%s: Failed to allocate %lu bytes align=0x%lx\n",
 __func__, size, align);
   }
@@ -103,14 +103,14 @@ static int __meminit sparse_index_init(unsigned long 
section_nr, int nid)
*
* The mem_hotplug_lock resolves the apparent race below.
*/
- if (mem_section[root])
+ if (mem_sections[root])
   return 0;

   section = sparse_index_alloc(nid);
   if (!section)
   return -ENOMEM;

- mem_section[root] = section;
+ mem_sections[root] = section;

   return 0;
   }
@@ -145,7 +145,7 @@ unsigned long __section_nr(struct mem_section *ms)
   #else
   unsigned long __section_nr(struct mem_section *ms)
   {
- return (unsigned long)(ms - mem_section[0]);
+ return (unsigned long)(ms - mem_sections[0]);
   }
   #endif




I repeat: unnecessary code churn IMHO.


Hi David,

Thanks, i explained the reason during my last reply.
Andrew has already picked this patch to -mm tree.


Just because it's in Andrews tree doesn't mean it will end up upstream. ;)

Anyhow, no really strong opinion, it's simply unnecessary code churn 
that makes bisecting harder without real value IMHO.


--
Thanks,

David / dhildenb


___

Re: [PATCH V2 4/6] mm: rename the global section array to mem_sections

2021-06-01 Thread David Hildenbrand

On 31.05.21 11:19, Dong Aisheng wrote:

In order to distinguish the struct mem_section for a better code
readability and align with kernel doc [1] name below, change the
global mem section name to 'mem_sections' from 'mem_section'.

[1] Documentation/vm/memory-model.rst
"The `mem_section` objects are arranged in a two-dimensional array
called `mem_sections`."

Cc: Andrew Morton 
Cc: Dave Young 
Cc: Baoquan He 
Cc: Vivek Goyal 
Cc: kexec@lists.infradead.org
Signed-off-by: Dong Aisheng 
---
v1->v2:
  * no changes
---
  include/linux/mmzone.h | 10 +-
  kernel/crash_core.c|  4 ++--
  mm/sparse.c| 16 
  3 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a6bfde85ddb0..0ed61f32d898 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1302,9 +1302,9 @@ struct mem_section {
  #define SECTION_ROOT_MASK (SECTIONS_PER_ROOT - 1)
  
  #ifdef CONFIG_SPARSEMEM_EXTREME

-extern struct mem_section **mem_section;
+extern struct mem_section **mem_sections;
  #else
-extern struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT];
+extern struct mem_section mem_sections[NR_SECTION_ROOTS][SECTIONS_PER_ROOT];
  #endif
  
  static inline unsigned long *section_to_usemap(struct mem_section *ms)

@@ -1315,12 +1315,12 @@ static inline unsigned long *section_to_usemap(struct 
mem_section *ms)
  static inline struct mem_section *__nr_to_section(unsigned long nr)
  {
  #ifdef CONFIG_SPARSEMEM_EXTREME
-   if (!mem_section)
+   if (!mem_sections)
return NULL;
  #endif
-   if (!mem_section[SECTION_NR_TO_ROOT(nr)])
+   if (!mem_sections[SECTION_NR_TO_ROOT(nr)])
return NULL;
-   return _section[SECTION_NR_TO_ROOT(nr)][nr & SECTION_ROOT_MASK];
+   return _sections[SECTION_NR_TO_ROOT(nr)][nr & SECTION_ROOT_MASK];
  }
  extern unsigned long __section_nr(struct mem_section *ms);
  extern size_t mem_section_usage_size(void);
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 29cc15398ee4..fb1180d81b5a 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -414,8 +414,8 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_SYMBOL(contig_page_data);
  #endif
  #ifdef CONFIG_SPARSEMEM
-   VMCOREINFO_SYMBOL_ARRAY(mem_section);
-   VMCOREINFO_LENGTH(mem_section, NR_SECTION_ROOTS);
+   VMCOREINFO_SYMBOL_ARRAY(mem_sections);
+   VMCOREINFO_LENGTH(mem_sections, NR_SECTION_ROOTS);
VMCOREINFO_STRUCT_SIZE(mem_section);
VMCOREINFO_OFFSET(mem_section, section_mem_map);
VMCOREINFO_NUMBER(MAX_PHYSMEM_BITS);
diff --git a/mm/sparse.c b/mm/sparse.c
index d02ee6bb7cbc..6412010478f7 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -24,12 +24,12 @@
   * 1) mem_section - memory sections, mem_map's for valid memory
   */
  #ifdef CONFIG_SPARSEMEM_EXTREME
-struct mem_section **mem_section;
+struct mem_section **mem_sections;
  #else
-struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT]
+struct mem_section mem_sections[NR_SECTION_ROOTS][SECTIONS_PER_ROOT]
cacheline_internodealigned_in_smp;
  #endif
-EXPORT_SYMBOL(mem_section);
+EXPORT_SYMBOL(mem_sections);
  
  #ifdef NODE_NOT_IN_PAGE_FLAGS

  /*
@@ -66,8 +66,8 @@ static void __init sparse_alloc_section_roots(void)
  
  	size = sizeof(struct mem_section *) * NR_SECTION_ROOTS;

align = 1 << (INTERNODE_CACHE_SHIFT);
-   mem_section = memblock_alloc(size, align);
-   if (!mem_section)
+   mem_sections = memblock_alloc(size, align);
+   if (!mem_sections)
panic("%s: Failed to allocate %lu bytes align=0x%lx\n",
  __func__, size, align);
  }
@@ -103,14 +103,14 @@ static int __meminit sparse_index_init(unsigned long 
section_nr, int nid)
 *
 * The mem_hotplug_lock resolves the apparent race below.
 */
-   if (mem_section[root])
+   if (mem_sections[root])
return 0;
  
  	section = sparse_index_alloc(nid);

if (!section)
return -ENOMEM;
  
-	mem_section[root] = section;

+   mem_sections[root] = section;
  
  	return 0;

  }
@@ -145,7 +145,7 @@ unsigned long __section_nr(struct mem_section *ms)
  #else
  unsigned long __section_nr(struct mem_section *ms)
  {
-   return (unsigned long)(ms - mem_section[0]);
+   return (unsigned long)(ms - mem_sections[0]);
  }
  #endif
  



I repeat: unnecessary code churn IMHO.

--
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 4/5] mm: rename the global section array to mem_sections

2021-05-25 Thread David Hildenbrand

On 17.05.21 13:20, Dong Aisheng wrote:

In order to distinguish the struct mem_section for a better code
readability and align with kernel doc [1] name below, change the
global mem section name to 'mem_sections' from 'mem_section'.

[1] Documentation/vm/memory-model.rst
"The `mem_section` objects are arranged in a two-dimensional array
called `mem_sections`."

Cc: Andrew Morton 
Cc: Dave Young 
Cc: Baoquan He 
Cc: Vivek Goyal 
Cc: kexec@lists.infradead.org
Signed-off-by: Dong Aisheng 
---
  include/linux/mmzone.h | 10 +-
  kernel/crash_core.c|  4 ++--
  mm/sparse.c| 16 
  3 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fc23e36cb165..b348a06915c5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1297,9 +1297,9 @@ struct mem_section {
  #define SECTION_ROOT_MASK (SECTIONS_PER_ROOT - 1)
  
  #ifdef CONFIG_SPARSEMEM_EXTREME

-extern struct mem_section **mem_section;
+extern struct mem_section **mem_sections;
  #else
-extern struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT];
+extern struct mem_section mem_sections[NR_SECTION_ROOTS][SECTIONS_PER_ROOT];
  #endif
  
  static inline unsigned long *section_to_usemap(struct mem_section *ms)

@@ -1310,12 +1310,12 @@ static inline unsigned long *section_to_usemap(struct 
mem_section *ms)
  static inline struct mem_section *__nr_to_section(unsigned long nr)
  {
  #ifdef CONFIG_SPARSEMEM_EXTREME
-   if (!mem_section)
+   if (!mem_sections)
return NULL;
  #endif
-   if (!mem_section[SECTION_NR_TO_ROOT(nr)])
+   if (!mem_sections[SECTION_NR_TO_ROOT(nr)])
return NULL;
-   return _section[SECTION_NR_TO_ROOT(nr)][nr & SECTION_ROOT_MASK];
+   return _sections[SECTION_NR_TO_ROOT(nr)][nr & SECTION_ROOT_MASK];
  }
  extern unsigned long __section_nr(struct mem_section *ms);
  extern size_t mem_section_usage_size(void);
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 29cc15398ee4..fb1180d81b5a 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -414,8 +414,8 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_SYMBOL(contig_page_data);
  #endif
  #ifdef CONFIG_SPARSEMEM
-   VMCOREINFO_SYMBOL_ARRAY(mem_section);
-   VMCOREINFO_LENGTH(mem_section, NR_SECTION_ROOTS);
+   VMCOREINFO_SYMBOL_ARRAY(mem_sections);
+   VMCOREINFO_LENGTH(mem_sections, NR_SECTION_ROOTS);
VMCOREINFO_STRUCT_SIZE(mem_section);
VMCOREINFO_OFFSET(mem_section, section_mem_map);
VMCOREINFO_NUMBER(MAX_PHYSMEM_BITS);
diff --git a/mm/sparse.c b/mm/sparse.c
index df4418c12f04..a96e7e65475f 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -24,12 +24,12 @@
   * 1) mem_section - memory sections, mem_map's for valid memory
   */
  #ifdef CONFIG_SPARSEMEM_EXTREME
-struct mem_section **mem_section;
+struct mem_section **mem_sections;
  #else
-struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT]
+struct mem_section mem_sections[NR_SECTION_ROOTS][SECTIONS_PER_ROOT]
cacheline_internodealigned_in_smp;
  #endif
-EXPORT_SYMBOL(mem_section);
+EXPORT_SYMBOL(mem_sections);
  
  #ifdef NODE_NOT_IN_PAGE_FLAGS

  /*
@@ -91,14 +91,14 @@ static int __meminit sparse_index_init(unsigned long 
section_nr, int nid)
 *
 * The mem_hotplug_lock resolves the apparent race below.
 */
-   if (mem_section[root])
+   if (mem_sections[root])
return 0;
  
  	section = sparse_index_alloc(nid);

if (!section)
return -ENOMEM;
  
-	mem_section[root] = section;

+   mem_sections[root] = section;
  
  	return 0;

  }
@@ -131,7 +131,7 @@ unsigned long __section_nr(struct mem_section *ms)
  #else
  unsigned long __section_nr(struct mem_section *ms)
  {
-   return (unsigned long)(ms - mem_section[0]);
+   return (unsigned long)(ms - mem_sections[0]);
  }
  #endif
  
@@ -286,8 +286,8 @@ static void __init memblocks_present(void)

  #ifdef CONFIG_SPARSEMEM_EXTREME
size = sizeof(struct mem_section *) * NR_SECTION_ROOTS;
align = 1 << (INTERNODE_CACHE_SHIFT);
-   mem_section = memblock_alloc(size, align);
-   if (!mem_section)
+   mem_sections = memblock_alloc(size, align);
+   if (!mem_sections)
panic("%s: Failed to allocate %lu bytes align=0x%lx\n",
  __func__, size, align);
  #endif



Smells like unnecessary code churn. I'd rather just fix the doc because 
it doesn't really improve readability IMHO.


Anyhow, just my 2 cents.

--
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: kexec-starting-kernel-problem-on-vm

2021-04-19 Thread David Hildenbrand

On 19.04.21 10:26, Baoquan He wrote:

Hi Jingxian,

On 04/14/21 at 03:04pm, Jingxian He wrote:

We use ‘kexec –l’ and ‘kexec –e’ on our virtual machine to upgrade the
linux kernel. We find that the new kernel may start fail due to checking
the sha256 sum of the initrd segment checking fail with low probability.

The related code is as following:
/* arch/x86/purgatory/purgatory.c */
static int verify_sha256_digest(void)
{
struct kexec_sha_region *ptr, *end;
u8 digest[SHA256_DIGEST_SIZE];
struct sha256_state sctx;

sha256_init();
end = purgatory_sha_regions + ARRAY_SIZE(purgatory_sha_regions);

for (ptr = purgatory_sha_regions; ptr < end; ptr++)
sha256_update(, (uint8_t *)(ptr->start), ptr->len);

sha256_final(, digest);

if (memcmp(digest, purgatory_sha256_digest, sizeof(digest)))
return 1;

return 0;
}

void purgatory(void)
{
int ret;

ret = verify_sha256_digest();


I usually use qemu/kvm guest to test kernel, kexec and kdump, haven't
met this issue. kexec -l/-e works well for me. Seems you are not using
the latest kexec-tools. Otherwise you can use "-i (--no-checks)" to work
around this for the time being.


if (ret) { //<--verify_sha256 fail, entering loop forever
/* loop forever */
for (;;)
;
}
copy_backup_region();
}


Our opnion of this problem:
We think that the process of relocating the new kernel depending on the
boot cpu running without interruption. However, the vcpus may be interrupted
by the qemu process with async_page_fault interruption.


So, are you saying that the host still delivers an AFP to the guest, 
even though it has interrupts disabled (including AFP)? Hard to imagine 
that this would be the case right now.


Any other host activity that temporarily stops/schedules out the VCPU 
should be not relevant to the VM ("vCPU interrupted by the QEMU 
process"); if there would be something running inside the VM that 
disables interrupts to reduce the size of a race window, that would need 
fixing inside the VM.


--
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 0/3] crashdump/x86: dump dax/kmem and virito-mem added System RAM

2021-04-02 Thread David Hildenbrand

> Am 02.04.2021 um 12:16 schrieb Simon Horman :
> 
> On Tue, Mar 30, 2021 at 10:58:28AM +0200, David Hildenbrand wrote:
>>> On 23.03.21 11:01, David Hildenbrand wrote:
>>> Let's properly support dumping any kind of "System RAM" that is not on the
>>> top level of the kernel resource tree in /proc/iomem, processing any
>>> /proc/iomem entry that contains "System RAM". In addition, increase
>>> the maximum number of crash memory ranges to fully support virtio-mem
>>> even in corner cases where we end up with a lot of individual memory
>>> ranges.
>>> 
>>> David Hildenbrand (3):
>>>   crashdump/x86: dump any kind of "System RAM"
>>>   crashdump/x86: iterate only over actual crash memory ranges
>>>   crashdump/x86: increase CRASH_MAX_MEMORY_RANGES to 32k
>>> 
>>>  kexec/arch/i386/crashdump-x86.c | 18 +-
>>>  kexec/arch/i386/crashdump-x86.h |  3 ++-
>>>  2 files changed, 15 insertions(+), 6 deletions(-)
>>> 
>> 
>> Gentle ping.
> 
> Thanks, and sorry for the delay.
> 
> Series applied.
> 
No worries - thanks a lot!

Have a nice weekend :)


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 0/3] crashdump/x86: dump dax/kmem and virito-mem added System RAM

2021-03-30 Thread David Hildenbrand

On 23.03.21 11:01, David Hildenbrand wrote:

Let's properly support dumping any kind of "System RAM" that is not on the
top level of the kernel resource tree in /proc/iomem, processing any
/proc/iomem entry that contains "System RAM". In addition, increase
the maximum number of crash memory ranges to fully support virtio-mem
even in corner cases where we end up with a lot of individual memory
ranges.

David Hildenbrand (3):
   crashdump/x86: dump any kind of "System RAM"
   crashdump/x86: iterate only over actual crash memory ranges
   crashdump/x86: increase CRASH_MAX_MEMORY_RANGES to 32k

  kexec/arch/i386/crashdump-x86.c | 18 +-
  kexec/arch/i386/crashdump-x86.h |  3 ++-
  2 files changed, 15 insertions(+), 6 deletions(-)



Gentle ping.

--
Thanks,

David / dhildenb


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2 2/3] kernel/resource: make walk_mem_res() find all busy IORESOURCE_MEM resources

2021-03-25 Thread David Hildenbrand
It used to be true that we can have system RAM (IORESOURCE_SYSTEM_RAM |
IORESOURCE_BUSY) only on the first level in the resource tree. However,
this is no longer holds for driver-managed system RAM (i.e., added via
dax/kmem and virtio-mem), which gets added on lower levels, for example,
inside device containers.

IORESOURCE_SYSTEM_RAM is defined as IORESOURCE_MEM | IORESOURCE_SYSRAM
and just a special type of IORESOURCE_MEM.

The function walk_mem_res() only consideres the first level and is
used in arch/x86/mm/ioremap.c:__ioremap_check_mem() only. We currently
fail to identify System RAM added by dax/kmem and virtio-mem as
"IORES_MAP_SYSTEM_RAM", for example, allowing for remapping of such
"normal RAM" in __ioremap_caller().

Let's find all IORESOURCE_MEM | IORESOURCE_BUSY resources, making the
function behave similar to walk_system_ram_res().

Fixes: ebf71552bb0e ("virtio-mem: Add parent resource for all added "System 
RAM"")
Fixes: c221c0b0308f ("device-dax: "Hotplug" persistent memory for use like 
normal RAM")
Reviewed-by: Dan Williams 
Cc: Andrew Morton 
Cc: Greg Kroah-Hartman 
Cc: Dan Williams 
Cc: Daniel Vetter 
Cc: Andy Shevchenko 
Cc: Mauro Carvalho Chehab 
Cc: Dave Young 
Cc: Baoquan He 
Cc: Vivek Goyal 
Cc: Dave Hansen 
Cc: Keith Busch 
Cc: Michal Hocko 
Cc: Qian Cai 
Cc: Oscar Salvador 
Cc: Eric Biederman 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Tom Lendacky 
Cc: Brijesh Singh 
Cc: x...@kernel.org
Cc: kexec@lists.infradead.org
Signed-off-by: David Hildenbrand 
---
 kernel/resource.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/resource.c b/kernel/resource.c
index 4efd6e912279..16e0c7e8ed24 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -470,7 +470,7 @@ int walk_mem_res(u64 start, u64 end, void *arg,
 {
unsigned long flags = IORESOURCE_MEM | IORESOURCE_BUSY;
 
-   return __walk_iomem_res_desc(start, end, flags, IORES_DESC_NONE, true,
+   return __walk_iomem_res_desc(start, end, flags, IORES_DESC_NONE, false,
 arg, func);
 }
 
-- 
2.29.2


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2 3/3] kernel/resource: remove first_lvl / siblings_only logic

2021-03-25 Thread David Hildenbrand
All functions that search for IORESOURCE_SYSTEM_RAM or IORESOURCE_MEM
resources now properly consider the whole resource tree, not just the first
level. Let's drop the unused first_lvl / siblings_only logic.

Remove documentation that indicates that some functions behave differently,
all consider the full resource tree now.

Reviewed-by: Dan Williams 
Reviewed-by: Andy Shevchenko 
Cc: Andrew Morton 
Cc: Greg Kroah-Hartman 
Cc: Dan Williams 
Cc: Daniel Vetter 
Cc: Andy Shevchenko 
Cc: Mauro Carvalho Chehab 
Cc: Dave Young 
Cc: Baoquan He 
Cc: Vivek Goyal 
Cc: Dave Hansen 
Cc: Keith Busch 
Cc: Michal Hocko 
Cc: Qian Cai 
Cc: Oscar Salvador 
Cc: Eric Biederman 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Tom Lendacky 
Cc: Brijesh Singh 
Cc: x...@kernel.org
Cc: kexec@lists.infradead.org
Signed-off-by: David Hildenbrand 
---
 kernel/resource.c | 45 -
 1 file changed, 12 insertions(+), 33 deletions(-)

diff --git a/kernel/resource.c b/kernel/resource.c
index 16e0c7e8ed24..7e00239a023a 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -64,12 +64,8 @@ static DEFINE_RWLOCK(resource_lock);
 static struct resource *bootmem_resource_free;
 static DEFINE_SPINLOCK(bootmem_resource_lock);
 
-static struct resource *next_resource(struct resource *p, bool sibling_only)
+static struct resource *next_resource(struct resource *p)
 {
-   /* Caller wants to traverse through siblings only */
-   if (sibling_only)
-   return p->sibling;
-
if (p->child)
return p->child;
while (!p->sibling && p->parent)
@@ -81,7 +77,7 @@ static void *r_next(struct seq_file *m, void *v, loff_t *pos)
 {
struct resource *p = v;
(*pos)++;
-   return (void *)next_resource(p, false);
+   return (void *)next_resource(p);
 }
 
 #ifdef CONFIG_PROC_FS
@@ -330,14 +326,10 @@ EXPORT_SYMBOL(release_resource);
  * of the resource that's within [@start..@end]; if none is found, returns
  * -ENODEV.  Returns -EINVAL for invalid parameters.
  *
- * This function walks the whole tree and not just first level children
- * unless @first_lvl is true.
- *
  * @start: start address of the resource searched for
  * @end:   end address of same resource
  * @flags: flags which the resource must have
  * @desc:  descriptor the resource must have
- * @first_lvl: walk only the first level children, if set
  * @res:   return ptr, if resource found
  *
  * The caller must specify @start, @end, @flags, and @desc
@@ -345,9 +337,8 @@ EXPORT_SYMBOL(release_resource);
  */
 static int find_next_iomem_res(resource_size_t start, resource_size_t end,
   unsigned long flags, unsigned long desc,
-  bool first_lvl, struct resource *res)
+  struct resource *res)
 {
-   bool siblings_only = true;
struct resource *p;
 
if (!res)
@@ -358,7 +349,7 @@ static int find_next_iomem_res(resource_size_t start, 
resource_size_t end,
 
read_lock(_lock);
 
-   for (p = iomem_resource.child; p; p = next_resource(p, siblings_only)) {
+   for (p = iomem_resource.child; p; p = next_resource(p)) {
/* If we passed the resource we are looking for, stop */
if (p->start > end) {
p = NULL;
@@ -369,13 +360,6 @@ static int find_next_iomem_res(resource_size_t start, 
resource_size_t end,
if (p->end < start)
continue;
 
-   /*
-* Now that we found a range that matches what we look for,
-* check the flags and the descriptor. If we were not asked to
-* use only the first level, start looking at children as well.
-*/
-   siblings_only = first_lvl;
-
if ((p->flags & flags) != flags)
continue;
if ((desc != IORES_DESC_NONE) && (desc != p->desc))
@@ -402,14 +386,14 @@ static int find_next_iomem_res(resource_size_t start, 
resource_size_t end,
 
 static int __walk_iomem_res_desc(resource_size_t start, resource_size_t end,
 unsigned long flags, unsigned long desc,
-bool first_lvl, void *arg,
+void *arg,
 int (*func)(struct resource *, void *))
 {
struct resource res;
int ret = -EINVAL;
 
while (start < end &&
-  !find_next_iomem_res(start, end, flags, desc, first_lvl, )) {
+  !find_next_iomem_res(start, end, flags, desc, )) {
ret = (*func)(, arg);
if (ret)
break;
@@ -431,7 +415,6 @@ static int __walk_iomem_res_desc(resource_size_t start, 
resource_size_t end,
  * @arg: function argument f

  1   2   >