Re: [PATCH 0/4] kdump: crashkernel reservation from CMA
On 06.12.23 12:08, Philipp Rudo wrote: On Fri, 1 Dec 2023 17:59:02 +0100 Michal Hocko wrote: On Fri 01-12-23 16:51:13, Philipp Rudo wrote: On Fri, 1 Dec 2023 12:55:52 +0100 Michal Hocko wrote: On Fri 01-12-23 12:33:53, Philipp Rudo wrote: [...] And yes, those are all what-if concerns but unfortunately that is all we have right now. Should theoretical concerns without an actual evidence (e.g. multiple drivers known to be broken) become a roadblock for this otherwise useful feature? Those concerns aren't just theoretical. They are experiences we have from a related feature that suffers exactly the same problem regularly which wouldn't exist if everybody would simply work "properly". What is the related feature? kexec And yes, even purely theoretical concerns can become a roadblock for a feature when the cost of those theoretical concerns exceed the benefit of the feature. The thing is that bugs will be reported against kexec. So _we_ need to figure out which of the shitty drivers caused the problem. That puts additional burden on _us_. What we are trying to evaluate at the moment is if the benefit outweighs the extra burden with the information we have at the moment. I do understand your concerns! But I am pretty sure you do realize that it is really hard to argue theoreticals. Let me restate what I consider facts. Hopefully we can agree on these points - the CMA region can be used by user space memory which is a great advantage because the memory is not wasted and our experience has shown that users do care about this a lot. We _know_ that pressure on making those reservations smaller results in a less reliable crashdump and more resources spent on tuning and testing (especially after major upgrades). A larger reservation which is not completely wasted for the normal runtime is addressing that concern. - There is no other known mechanism to achieve the reusability of the crash kernel memory to stop the wastage without much more intrusive code/api impact (e.g. a separate zone or dedicated interface to prevent any hazardous usage like RDMA). - implementation wise the patch has a very small footprint. It is using an existing infrastructure (CMA) and it adds a minimal hooking into crashkernel configuration. - The only identified risk so far is RDMA acting on this memory without using proper pinning interface. If it helps to have a statement from RDMA maintainers/developers then we can pull them in for a further discussion of course. - The feature requires an explicit opt-in so this doesn't bring any new risk to existing crash kernel users until they decide to use it. AFAIU there is no way to tell that the crash kernel memory used to be CMA based in the primary kernel. If you believe that having that information available for debugability would help then I believe this shouldn't be hard to add. I think it would even make sense to mark this feature experimental to make it clear to users that this needs some time before it can be marked production ready. I hope I haven't really missed anything important. The final If I understand Documentation/core-api/pin_user_pages.rst correctly you missed case 1 Direct IO. In that case "short term" DMA is allowed for pages without FOLL_LONGTERM. Meaning that there is a way you can corrupt the CMA and with that the crash kernel after the production kernel has panicked. With that I don't see a chance this series can be included unless someone can explain me that that the documentation is wrong or I understood it wrong. I think you are right. We'd have to disallow any FOLL_PIN on these CMA pages, or find other ways of handling that (detect that there are no short-term pins any). But, I'm also wondering how MMU-notifier-based approaches might interfere, where CMA pages might be transparently mapped into secondary MMUs, possibly having DMA going on. Are we sure that all these secondary MMUs are inactive as soon as we kexec? -- Cheers, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [RFC PATCH v2 0/7] Introduce persistent memory pool
On 28.09.23 15:22, Dave Hansen wrote: On 9/27/23 09:13, Stanislav Kinsburskii wrote: Once deposited, these pages can't be accessed by Linux anymore and thus must be preserved in "used" state across kexec, as hypervisor state is unware of kexec. If Linux can't access them, they're not RAM any more. I'd much rather remove them from the memory map and move on with life rather than implement a bunch of new ABI that's got to be handed across kernels. The motivation of handling kexec (faster?) in a hyper-v domain doesn't sound particularly compelling got me for such features. If you inflated memory, just don't allow to kexec. It's been broken for years IIUC. Maybe the other use cases are more "relevant". -- Cheers, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [RFC PATCH v2 0/7] Introduce persistent memory pool
On 28.09.23 12:25, Baoquan He wrote: On 09/27/23 at 09:13am, Stanislav Kinsburskii wrote: On Wed, Sep 27, 2023 at 01:44:38PM +0800, Baoquan He wrote: Hi Stanislav, On 09/25/23 at 02:27pm, Stanislav Kinsburskii wrote: This patch introduces a memory allocator specifically tailored for persistent memory within the kernel. The allocator maintains kernel-specific states like DMA passthrough device states, IOMMU state, and more across kexec. Can you give more details about how this persistent memory pool will be utilized in a actual scenario? I mean, what problem have you met so that you have to introduce persistent memory pool to solve it? The major reason we have at the moment, is that Linux root partition running on top of the Microsoft hypervisor needs to deposit pages to hypervisor in runtime, when hypervisor runs out of memory. "Depositing" here means, that Linux passes a set of its PFNs to the hypervisor via hypercall, and hypervisor then uses these pages for its own needs. Once deposited, these pages can't be accessed by Linux anymore and thus must be preserved in "used" state across kexec, as hypervisor state is unware of kexec. In the same time, these pages can we withdrawn when usused. Thus, an allocator persistent across kexec looks reasonable for this particular matter. Thanks for these details. The deposit and withdraw remind me the Balloon driver, David's virtio-mem, DLPAR on ppc which can hot increasing or shrinking phisical memory on guest OS. Can't microsoft hypervisor do the similar thing to reclaim or give back the memory from or to the 'Linux root partition' running on top of the hypervisor? virtio-mem was designed with kexec support in mind. You only expose the initial memory to the second kernel, and that memory can never have such holes. That does not apply to memory ballooning implementations, like Hyper-V dynamic memory. In the virtio-mem paper I have the following: "In our experiments, Hyper-V VMs crashed reliably when trying to use kexec under Linux for fast OS reboots with an inflated balloon. Other memory ballooning mechanisms either have to temporarily deflate the whole balloon or al- low access to inflated memory, which is undesired in cloud environments." I remember XEN does something elaborate, whereby they allow access to all inflated memory during reboot, but limit the total number of pages they will hand out. IIRC, you then have to work around things like "Windows initializes all memory with 0s when booting, and cope with that". So there are ways how hypervisors handled that in the past. -- Cheers, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH V2 1/2] efi/unaccepted: Do not let /proc/vmcore try to access unaccepted memory
On 12.09.23 09:47, Adrian Hunter wrote: On 12/09/23 10:19, David Hildenbrand wrote: On 11.09.23 13:21, Adrian Hunter wrote: Support for unaccepted memory was added recently, refer commit dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby a virtual machine may need to accept memory before it can be used. Do not let /proc/vmcore try to access unaccepted memory because it can cause the guest to fail. Oh, hold on. What are the actual side effects of this? Once we're in the kdump kernel, any guest is already dead. So failing a guest doesn't apply, no? Unaccepted Memory is used by virtual machines. In this case the guest has kexec'ed to a dump-capture kernel, so the virtual machine is still alive and running the dump-capture kernel. Ah, I got lost in TDX host semantics. So what you're saying, if we (guest) are reading unnaccepted memory we will get zapped. Makes sense. -- Cheers, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH V2 1/2] efi/unaccepted: Do not let /proc/vmcore try to access unaccepted memory
On 11.09.23 13:21, Adrian Hunter wrote: Support for unaccepted memory was added recently, refer commit dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby a virtual machine may need to accept memory before it can be used. Do not let /proc/vmcore try to access unaccepted memory because it can cause the guest to fail. Oh, hold on. What are the actual side effects of this? Once we're in the kdump kernel, any guest is already dead. So failing a guest doesn't apply, no? -- Cheers, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH V2 1/2] efi/unaccepted: Do not let /proc/vmcore try to access unaccepted memory
On 11.09.23 13:21, Adrian Hunter wrote: Support for unaccepted memory was added recently, refer commit dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby a virtual machine may need to accept memory before it can be used. Do not let /proc/vmcore try to access unaccepted memory because it can cause the guest to fail. For /proc/vmcore, which is read-only, this means a read or mmap of unaccepted memory will return zeros. Signed-off-by: Adrian Hunter --- [...] +static inline bool pfn_is_unaccepted_memory(unsigned long pfn) +{ + phys_addr_t paddr = pfn << PAGE_SHIFT; + + return range_contains_unaccepted_memory(paddr, paddr + PAGE_SIZE); +} + #endif /* _LINUX_MM_H */ As stated, if the relevant table is not already properly populated with information about unaccepted memory by the first kernel, this probably logically belongs into Kirills series. Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH V2 2/2] proc/kcore: Do not try to access unaccepted memory
On 11.09.23 13:21, Adrian Hunter wrote: Support for unaccepted memory was added recently, refer commit dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby a virtual machine may need to accept memory before it can be used. Do not try to access unaccepted memory because it can cause the guest to fail. For /proc/kcore, which is read-only and does not support mmap, this means a read of unaccepted memory will return zeros. Signed-off-by: Adrian Hunter --- fs/proc/kcore.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Changes in V2: Change patch subject and commit message Do not open code pfn_is_unaccepted_memory() diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c index 23fc24d16b31..6422e569b080 100644 --- a/fs/proc/kcore.c +++ b/fs/proc/kcore.c @@ -546,7 +546,8 @@ static ssize_t read_kcore_iter(struct kiocb *iocb, struct iov_iter *iter) * and explicitly excluded physical ranges. */ if (!page || PageOffline(page) || - is_page_hwpoison(page) || !pfn_is_ram(pfn)) { + is_page_hwpoison(page) || !pfn_is_ram(pfn) || + pfn_is_unaccepted_memory(pfn)) { if (iov_iter_zero(tsz, iter) != tsz) { ret = -EFAULT; goto out; Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 1/3] proc/vmcore: Do not map unaccepted memory
On 11.09.23 12:05, Kirill A. Shutemov wrote: On Mon, Sep 11, 2023 at 11:50:31AM +0200, David Hildenbrand wrote: On 11.09.23 11:27, Kirill A. Shutemov wrote: On Mon, Sep 11, 2023 at 10:42:51AM +0200, David Hildenbrand wrote: On 11.09.23 10:41, Kirill A. Shutemov wrote: On Mon, Sep 11, 2023 at 10:03:36AM +0200, David Hildenbrand wrote: On 06.09.23 09:39, Adrian Hunter wrote: Support for unaccepted memory was added recently, refer commit dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby a virtual machine may need to accept memory before it can be used. Do not map unaccepted memory because it can cause the guest to fail. For /proc/vmcore, which is read-only, this means a read or mmap of unaccepted memory will return zeros. Does a second (kdump) kernel that exposes /proc/vmcore reliably get access to the information whether memory of the first kernel is unaccepted (IOW, not its memory, but the memory of the first kernel it is supposed to expose via /proc/vmcore)? There are few patches in my queue to few related issue, but generally, yes, the information is available to the target kernel via EFI configuration table. I assume that table provided by the first kernel, and not read directly from HW, correct? The table is constructed by the EFI stub in the first kernel based on EFI memory map. Okay, should work then once that's done by the first kernel. Maybe include this patch in your series? Can do. But the other two patches are not related to kexec. Hm. Yes, the others can go in separately. But this here really needs other kexec/kdump changes. -- Cheers, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 1/3] proc/vmcore: Do not map unaccepted memory
On 11.09.23 11:27, Kirill A. Shutemov wrote: On Mon, Sep 11, 2023 at 10:42:51AM +0200, David Hildenbrand wrote: On 11.09.23 10:41, Kirill A. Shutemov wrote: On Mon, Sep 11, 2023 at 10:03:36AM +0200, David Hildenbrand wrote: On 06.09.23 09:39, Adrian Hunter wrote: Support for unaccepted memory was added recently, refer commit dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby a virtual machine may need to accept memory before it can be used. Do not map unaccepted memory because it can cause the guest to fail. For /proc/vmcore, which is read-only, this means a read or mmap of unaccepted memory will return zeros. Does a second (kdump) kernel that exposes /proc/vmcore reliably get access to the information whether memory of the first kernel is unaccepted (IOW, not its memory, but the memory of the first kernel it is supposed to expose via /proc/vmcore)? There are few patches in my queue to few related issue, but generally, yes, the information is available to the target kernel via EFI configuration table. I assume that table provided by the first kernel, and not read directly from HW, correct? The table is constructed by the EFI stub in the first kernel based on EFI memory map. Okay, should work then once that's done by the first kernel. Maybe include this patch in your series? -- Cheers, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 1/3] proc/vmcore: Do not map unaccepted memory
On 11.09.23 10:41, Kirill A. Shutemov wrote: On Mon, Sep 11, 2023 at 10:03:36AM +0200, David Hildenbrand wrote: On 06.09.23 09:39, Adrian Hunter wrote: Support for unaccepted memory was added recently, refer commit dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby a virtual machine may need to accept memory before it can be used. Do not map unaccepted memory because it can cause the guest to fail. For /proc/vmcore, which is read-only, this means a read or mmap of unaccepted memory will return zeros. Does a second (kdump) kernel that exposes /proc/vmcore reliably get access to the information whether memory of the first kernel is unaccepted (IOW, not its memory, but the memory of the first kernel it is supposed to expose via /proc/vmcore)? There are few patches in my queue to few related issue, but generally, yes, the information is available to the target kernel via EFI configuration table. I assume that table provided by the first kernel, and not read directly from HW, correct? -- Cheers, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 3/3] /dev/mem: Do not map unaccepted memory
On 07.09.23 16:46, Dave Hansen wrote: On 9/7/23 07:25, Kirill A. Shutemov wrote: On Thu, Sep 07, 2023 at 07:15:21AM -0700, Dave Hansen wrote: On 9/6/23 00:39, Adrian Hunter wrote: Support for unaccepted memory was added recently, refer commit dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby a virtual machine may need to accept memory before it can be used. Do not map unaccepted memory because it can cause the guest to fail. Doesn't /dev/mem already provide a billion ways for someone to shoot themselves in the foot? TDX seems to have added the 1,000,000,001st. Is this really worth patching? Is it better to let TD die silently? I don't think so. First, let's take a look at all of the distro kernels that folks will run under TDX. Do they have STRICT_DEVMEM set? For virtio-mem, we do config VIRTIO_MEM ... depends on EXCLUSIVE_SYSTEM_RAM Which in turn: config EXCLUSIVE_SYSTEM_RAM ... depends on !DEVMEM || STRICT_DEVMEM Not supported on all archs, but at least on RHEL9 on x86_64 and aarch64. So, making unaccepted memory similarly depend on "!DEVMEM || STRICT_DEVMEM" does not sound too far off ... -- Cheers, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 1/3] proc/vmcore: Do not map unaccepted memory
On 06.09.23 09:39, Adrian Hunter wrote: Support for unaccepted memory was added recently, refer commit dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby a virtual machine may need to accept memory before it can be used. Do not map unaccepted memory because it can cause the guest to fail. For /proc/vmcore, which is read-only, this means a read or mmap of unaccepted memory will return zeros. Does a second (kdump) kernel that exposes /proc/vmcore reliably get access to the information whether memory of the first kernel is unaccepted (IOW, not its memory, but the memory of the first kernel it is supposed to expose via /proc/vmcore)? I recall there might be other kdump-related issues for TDX and friends to solve. Especially, which information the second kernel gets provided by the first kernel. So can this patch even be tested reasonably (IOW, get into a kdump kernel in an environment where the first kernel has unaccepted memory, and verify that unaccepted memory is handled accordingly? ... while kdump doing anything reasonable in such an environment at all?) -- Cheers, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v12 7/7] x86/crash: Add x86 crash hotplug support
On 26.10.22 16:48, Baoquan He wrote: On 10/25/22 at 12:31pm, Borislav Petkov wrote: On Thu, Oct 13, 2022 at 10:57:28AM +0800, Baoquan He wrote: The concern to range number mainly is on Virt guest systems. And why would virt emulate 1K hotpluggable DIMM slots and not emulate a real machine? IIRC, ACPI only allows for 256 slots. PPC dlpar might provide more. Well, currently, mem hotpug is an important feature on virt system to dynamically increase/shrink memory on the system. If only emulating real machine, it won't be different than bare metal system. IIRC, the ballon driver or virtio-mem feature can add memory board, e.g 1G, block size is 128M, 8 blocks added. When shrinking this 1G memory later, it will take best effort way to hot remove memory. Means if any memory block is occupied, it will be kept there. Finally we could only remove every second blocks, 4 blocks altogether. Then the left un-removed blocks will produce 4 separate memory regions. Like this, a virt guest could have many memory regions in kernel after memory being added/removed. If I am wrong, Please correct me, David. Yes, virtio-mem (but also PPC dlpar) can result in many individual memory blocks with holes in between after hotunplug. Hotplug OTOH, usually tries to "plug" these holes and reduce the total number of memory blocks. It might be rare that our range will be heavily fragmented after unplug, but it's certainly possible. [...] Yes, now assume we have a HPE SGI system and it has memory hotplug capacity. The system itself has already got memory regions more than 1024. Then when we hot add extra memory board, we want to include the newly added memory regions into elfcorehdr so that it will be dumped out in kdump kernel. That's why I earlier suggested 2048 for number of memory regions. The more the better, unless "it hurts". Assuming a single memory block is 128 MiB, that would be 256 GiB. Usually, on big systems, the memory block size is 2 GiB. So 4 TiB. -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH RFC 1/2] coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")
On 29.08.22 05:07, Linus Torvalds wrote: > On Sun, Aug 28, 2022 at 6:56 PM Dave Young wrote: >> >>> John mentioned PANIC_ON(). >> >> I would vote for PANIC_ON(), it sounds like a good idea, because >> BUG_ON() is not obvious and, PANIC_ON() can alert the code author that >> this will cause a kernel panic and one will be more careful before >> using it. > > People, NO. > > We're trying to get rid of BUG_ON() because it kills the machine. > > Not replace it with another bogus thing that kills a machine. > > So no PANIC_ON(). We used to have "panic()" many many years ago, we > got rid of it. We're not re-introducing it. > > People who want to panic on warnings can do so. WARN_ON() _becomes_ > PANIC for those people. But those people are the "we have a million > machines, we want to just fail things on any sign of trouble, and we > have MIS people who can look at the logs". > > And it's not like we need to get rid of _all_ BUG_ON() cases. If you > have a "this is major internal corruption, there's no way we can > continue", then BUG_ON() is appropriate. It will try to kill that > process and try to keep the machine running, and again, the kind of > people who don't care about one machine (because - again - they have > millions of them) can just turn that into a panic-and-reboot > situation. > > But the kind of people for whom the machine they are on IS THEIR ONLY > MACHINE - whether it be a workstation, a laptop, or a cellphone - > there is absolutely zero situation where "let's just kill the machine" > is *EVER* approproate. Even a BUG_ON() will try to continue as well as > it can after killing the current thread, but it's going to be iffy, > because locking etc. > > So WARN_ON_ONCE() is the thing to aim for. BUG_ON() is the thing for > "oops, I really don't know what to do, and I physically *cannot* > continue" (and that is *not* "I'm too lazy to do error handling"). > > There is no room for PANIC. None. Ever. Let me clearer what I had in mind, avoiding the PANIC_ON terminology John raised. I was wondering if it would make sense to 1) Be able to specify a severity for WARN (developer decision) 2) Be able to specify a severity for panic_on_warn (admin decision) Distributions with kdump could keep a mode whereby severe warnings (e.g., former BUG_ON) would properly kdump+reboot and harmless warnings (e.g., clean recovery path) would WARN once + continue. I agree that whether to panic should in most cases be a decision of the admin, not the developer. Now, that's a different discussion then the documentation update at hand, and I primary wanted to raise awareness for the kdump people, and ask them if a stronger move towards WARN_ON_ONCE will affect them/customer expectations. I'll work with John to document the current rules to reflect everything you said here. -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH RFC 1/2] coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")
On 26.08.22 03:43, Dave Young wrote: > Hi David, > > [Added more people in cc] > Hi Dave, thanks for your input! [...] >> Side note: especially with kdump() I feel like we might see much more >> widespread use of panic_on_warn to be able to actually extract debug >> information in a controlled manner -- for example on enterprise distros. >> ... which would then make these systems more likely to crash, because >> there is no way to distinguish a rather harmless warning from a severe >> warning :/ . But let's see if some kdump() folks will share their >> opinion as reply to the cover letter. > > I can understand the intention of this patch, and I totally agree that > BUG() should be used carefully, this is a good proposal if we can > clearly define the standard about when to use BUG(). But I do have Essentially, the general rule from Linus is "absolutely no new BUG_ON() calls ever" -- but I think the consensus in that thread was that there are corner cases when it comes to unavoidable data corruption/security issues. And these are rare cases, not the usual case where we'd have used BUG_ON()/VM_BUG_ON(). > some worries, I think this standard is different for different sub > components, it is not clear to me at least, so this may introduce an > unstable running kernel and cause troubles (eg. data corruption) with > a WARN instead of a BUG. Probably it would be better to say "Do not > WARN lightly, and do not hesitate to use BUG if it is really needed"? Well, I don't make the rules, I document them and share them for general awareness/comments :) Documenting this is valuable, because there seem to be quite some different opinions floating around in the community -- and I've been learning different rules from different people over the years. > > About "patch_on_warn", it will depend on the admin/end user to set it, > it is not a good idea for distribution to set it. It seems we are > leaving it to end users to take the risk of a kernel panic even with > all kernel WARN even if it is sometimes not necessary. My question would be what we could add/improve to keep systems with kdump armed running as expected for end users, that is most probably: 1) don't crash on harmless WARN() that can just be reported and the machine will continue running mostly fine without real issues. 2) crash on severe issues (previously BUG) such that we can properly capture a system dump via kdump. The restart the machine. Of course, once one would run into 2), one could try reproducing with "panic_on_warn" to get a reasonable system dump. But I guess that's not what enterprise customers expect. One wild idea (in the cover letter) was to add something new that can be configured by user space and that expresses that something is more severe than just some warning that can be recovered easily. But it can eventually be recovered to keep the system running to some degree. But still, it's configurable if we want to trigger a panic or let the system run. John mentioned PANIC_ON(). What would be your expectation for kdump users under which conditions we want to trigger kdump and when not? Regarding panic_on_warn, how often do e.g., RHEL users observe warnings that we're not able to catch during testing, such that "panic_on_warn" would be a real no-go? -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH RFC 1/2] coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")
On 24.08.22 23:59, John Hubbard wrote: > On 8/24/22 09:30, David Hildenbrand wrote: >> diff --git a/Documentation/process/coding-style.rst >> b/Documentation/process/coding-style.rst >> index 03eb53fd029a..a6d81ff578fe 100644 >> --- a/Documentation/process/coding-style.rst >> +++ b/Documentation/process/coding-style.rst >> @@ -1186,6 +1186,33 @@ expression used. For instance: >> #endif /* CONFIG_SOMETHING */ >> > > I like the idea of adding this documentation, and this is the right > place. Naturally, if one likes something, one must immediately change > it. :) Therefore, here is an alternative writeup that I think captures > what you and the email threads were saying. > > How's this sound? Much better, thanks! :) > > diff --git a/Documentation/process/coding-style.rst > b/Documentation/process/coding-style.rst > index 03eb53fd029a..32df0d503388 100644 > --- a/Documentation/process/coding-style.rst > +++ b/Documentation/process/coding-style.rst > @@ -1185,6 +1185,53 @@ expression used. For instance: > ... > #endif /* CONFIG_SOMETHING */ > > +22) Do not crash the kernel > +--- > + > +Use WARN() rather than BUG() > + > + > +Do not add new code that uses any of the BUG() variants, such as BUG(), > +BUG_ON(), or VM_BUG_ON(). Instead, use a WARN*() variant, preferably > +WARN_ON_ONCE(), and possibly with recovery code. Recovery code is not > required > +if there is no reasonable way to at least partially recover. I'll tend to keep in this section: "Unavoidable data corruption / security issues might be a very rare exception to this rule and need good justification." Because there are rare exceptions, and I'd much rather document the clear exception to this rule. > + > +Use WARN_ON_ONCE() rather than WARN() or WARN_ON() > +** > + > +WARN_ON_ONCE() is generally preferred over WARN() or WARN_ON(), because it is > +common for a given warning condition, if it occurs at all, to occur multiple > +times. (For example, once per file, or once per struct page.) This can fill > up I'll drop the "For example" part. I feel like this doesn't really need an example -- most probably we've all been there already when the kernel log was flooded :) > +and wrap the kernel log, and can even slow the system enough that the > excessive > +logging turns into its own, additional problem. > + > +Do not WARN lightly > +*** > + > +WARN*() is intended for unexpected, this-should-never-happen situations. > WARN*() > +macros are not to be used for anything that is expected to happen during > normal > +operation. These are not pre- or post-condition asserts, for example. Again: > +WARN*() must not be used for a condition that is expected to trigger easily, > for > +example, by user space actions. pr_warn_once() is a possible alternative, if > you > +need to notify the user of a problem. > + > +Do not worry about panic_on_warn users > +** > + > +A few more words about panic_on_warn: Remember that ``panic_on_warn`` is an > +available kernel option, and that many users set this option. This is why > there > +is a "Do not WARN lightly" writeup, above. However, the existence of > +panic_on_warn users is not a valid reason to avoid the judicious use WARN*(). > +That is because, whoever enables panic_on_warn has explicitly asked the > kernel > +to crash if a WARN*() fires, and such users must be prepared to deal with the > +consequences of a system that is somewhat more likely to crash. Side note: especially with kdump() I feel like we might see much more widespread use of panic_on_warn to be able to actually extract debug information in a controlled manner -- for example on enterprise distros. ... which would then make these systems more likely to crash, because there is no way to distinguish a rather harmless warning from a severe warning :/ . But let's see if some kdump() folks will share their opinion as reply to the cover letter. -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH RFC 2/2] checkpatch: warn on usage of VM_BUG_ON() and friends
On 25.08.22 13:43, Jani Nikula wrote: > On Thu, 25 Aug 2022, David Hildenbrand wrote: >> On 24.08.22 18:52, Joe Perches wrote: >>> On Wed, 2022-08-24 at 18:31 +0200, David Hildenbrand wrote: >>>> checkpatch does not point out that VM_BUG_ON() and friends should be >>>> avoided, however, Linus notes: >>>> >>>> VM_BUG_ON() has the exact same semantics as BUG_ON. It is literally >>>> no different, the only difference is "we can make the code smaller >>>> because these are less important". [1] >>>> >>>> So let's warn on VM_BUG_ON() and friends as well. While at it, make it >>>> clearer that the kernel really shouldn't be crashed. >>>> >>>> Note that there are some other *_BUG_ON flavors, but they are not all >>>> bad: for example, KVM_BUG_ON() only triggers a WARN_ON_ONCE and then >>>> flags KVM as being buggy, so we'll not care about them for now here. >>> [] >>>> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl >>> [] >>>> @@ -4695,12 +4695,12 @@ sub process { >>>>} >>>>} >>>> >>>> -# avoid BUG() or BUG_ON() >>>> - if ($line =~ /\b(?:BUG|BUG_ON)\b/) { >>>> +# do not use BUG(), BUG_ON(), VM_BUG_ON() and friends. >>>> + if ($line =~ /\b(?:BUG|BUG_ON|VM_BUG_ON|VM_BUG_ON_[A-Z]+)\b/) { >>> >>> Perhaps better as something like the below to pick up more variants >>> >> >> Trying to find more possible variants and exceptions > >> CI_BUG_ON( >> -> Bad with CONFIG_DRM_I915_DEBUG >> GEM_BUG_ON( >> -> Bad with CONFIG_DRM_I915_DEBUG_GEM_ONCE > > These are hidden behind debug knobs that we use in our CI to > specifically catch "should not happen" cases fast and loud. Should not > be a problem for regular users. > I tend to agree but I don't think this is worth an exception. VM_BUG_ON also requires CONFIG_DEBUG_VM and absolutely shouldn't be used as I learned. Quoting Linus: Really. BUG_ON() IS NOT FOR DEBUGGING. [1] This kind of "I don't think this can happen" is _never_ an excuse for it. [2] For CI work, it might be sufficient to use WARN_ON_ONCE() combined with panic_on_warn. [1] https://lore.kernel.org/all/CAHk-=wiEAH+ojSpAgx_Ep=nkpwhu8ado3v56bxccsu97oyj...@mail.gmail.com/ [2] https://lore.kernel.org/all/CAHk-=wg40eazofo16eviaj7mfqdhz2gvebvfsmf6gyzsprj...@mail.gmail.com/ -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH RFC 2/2] checkpatch: warn on usage of VM_BUG_ON() and friends
On 24.08.22 18:52, Joe Perches wrote: > On Wed, 2022-08-24 at 18:31 +0200, David Hildenbrand wrote: >> checkpatch does not point out that VM_BUG_ON() and friends should be >> avoided, however, Linus notes: >> >> VM_BUG_ON() has the exact same semantics as BUG_ON. It is literally >> no different, the only difference is "we can make the code smaller >> because these are less important". [1] >> >> So let's warn on VM_BUG_ON() and friends as well. While at it, make it >> clearer that the kernel really shouldn't be crashed. >> >> Note that there are some other *_BUG_ON flavors, but they are not all >> bad: for example, KVM_BUG_ON() only triggers a WARN_ON_ONCE and then >> flags KVM as being buggy, so we'll not care about them for now here. > [] >> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl > [] >> @@ -4695,12 +4695,12 @@ sub process { >> } >> } >> >> -# avoid BUG() or BUG_ON() >> -if ($line =~ /\b(?:BUG|BUG_ON)\b/) { >> +# do not use BUG(), BUG_ON(), VM_BUG_ON() and friends. >> +if ($line =~ /\b(?:BUG|BUG_ON|VM_BUG_ON|VM_BUG_ON_[A-Z]+)\b/) { > > Perhaps better as something like the below to pick up more variants > Trying to find more possible variants and exceptions $ git grep -h -o -E "\b[a-zA-Z]+_BUG(_ON(_[a-zA-Z]+)*)?\(" | sort | uniq AA_BUG( -> Ok, no BUG() ASM_BUG( -> Bad BUILD_BUG( BUILD_BUG_ON( BUILD_BUG_ON_INVALID( BUILD_BUG_ON_MSG( BUILD_BUG_ON_ZERO( -> Ok CI_BUG_ON( -> Bad with CONFIG_DRM_I915_DEBUG DCCP_BUG( DCCP_BUG_ON( -> Ok, no BUG() do_BUG( -> BUG implementation, ok. GEM_BUG_ON( -> Bad with CONFIG_DRM_I915_DEBUG_GEM_ONCE GLOCK_BUG_ON( -> Bad handle_BUG( -> BUG implementation, ok. IDA_BUG_ON( KVM_BUG( KVM_BUG_ON( -> Ok, no BUG() lkdtm_BUG( paravirt_BUG( -> bad PROM_BUG( -> unused, will remove RWLOCK_BUG_ON( -> Ok, no BUG() snd_BUG( snd_BUG_ON( -> Ok, no BUG() SNIC_BUG_ON( -> Bad SPIN_BUG_ON( -> Ok, no BUG() UNWINDER_BUG( UNWINDER_BUG_ON( VIRTUAL_BUG_ON( VM_BUG_ON( VM_BUG_ON_FOLIO( VM_BUG_ON_MM( VM_BUG_ON_PAGE( VM_BUG_ON_PGFLAGS( VM_BUG_ON_VMA( XA_BUG_ON( -> Bad So an extended versions of your proposal like (ignoring do_BUG and handle_BUG, people are smart enough to figure that out) if ($line =~ /\b(?!AA_|BUILD_|DCCP_|IDA_|KVM_|RWLOCK_|snd_|SPIN_)(?:[a-zA-Z_]*_)?BUG(?:_ON)?(?:_[A-Z_]+)?\s*\(/ ? -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH RFC 2/2] checkpatch: warn on usage of VM_BUG_ON() and friends
On 24.08.22 18:52, Joe Perches wrote: > On Wed, 2022-08-24 at 18:31 +0200, David Hildenbrand wrote: >> checkpatch does not point out that VM_BUG_ON() and friends should be >> avoided, however, Linus notes: >> >> VM_BUG_ON() has the exact same semantics as BUG_ON. It is literally >> no different, the only difference is "we can make the code smaller >> because these are less important". [1] >> >> So let's warn on VM_BUG_ON() and friends as well. While at it, make it >> clearer that the kernel really shouldn't be crashed. >> >> Note that there are some other *_BUG_ON flavors, but they are not all >> bad: for example, KVM_BUG_ON() only triggers a WARN_ON_ONCE and then >> flags KVM as being buggy, so we'll not care about them for now here. > [] >> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl > [] >> @@ -4695,12 +4695,12 @@ sub process { >> } >> } >> >> -# avoid BUG() or BUG_ON() >> -if ($line =~ /\b(?:BUG|BUG_ON)\b/) { >> +# do not use BUG(), BUG_ON(), VM_BUG_ON() and friends. >> +if ($line =~ /\b(?:BUG|BUG_ON|VM_BUG_ON|VM_BUG_ON_[A-Z]+)\b/) { > > Perhaps better as something like the below to pick up more variants > > if ($line =~ > /\b(?!KVM_|BUILD_)(?:[A-Z_]*_)?BUG(?:_ON)?(?:_[A-Z_]+)?\s*\(/ ... then I'll have to scan the other cases if they do something similar as KVM. ... well, okay, I'll do it. :) > >> my $msg_level = \ >> $msg_level = \ if ($file); >> &{$msg_level}("AVOID_BUG", >> - "Avoid crashing the kernel - try using >> WARN_ON & recovery code rather than BUG() or BUG_ON()\n" . $herecurr); > > and maybe: > > "Do not crash the kernel unless it is > unavoidable - use WARN_ON_ONCE & recovery code (if reasonable) rather than > BUG() or variants\n" . $herecurr); > > Yes, thanks! -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH RFC 0/2] coding-style.rst: document BUG() and WARN() rules
As it seems to be rather unclear if/when to use BUG(), BUG_ON(), VM_BUG_ON(), WARN_ON_ONCE(), ... let's try to document the result of a recent discussion. Details can be found in patch #1. -- Here is some braindump after thinking about BUG_ON(), WARN_ON(), ... and how it interacts with kdump. I was wondering what the expectation on a system with armed kdump are, for example, after we removed most BUG_ON() instances and replaced them by WARN_ON_ONCE(). I would assume that we actually want to panic in some cases to capture a proper system dump instead of continuing and eventually ending up with a completely broken system where it's hard to extract any useful debug information. We'd have to enable panic_on_warn. But we'd only want to do that in case kdump is actually armed after boot. So one idea would be to have some kind of "panic_on_warn_with_kdump" mode. But then, we'd actually crash+kdump even on the most harmless WARN_ON() conditions, because they all look alike. To compensate, we would need some kind of "severity" levels of a warning -- at least some kind of "this is harmless and we can easily recover, but please tell the developers" vs. "this is real bad and unexpected, capture a dump immediately instead of trying to recover and eventually failing miserably". But then, maybe we really want something like BUG_ON() -- let's call it CBUG_ON() for simplicity -- but be able to make it be usable in conditionals (to implement recovery code if easily possible) and make the runtime behavior configurable. if (CBUG_ON(whatever)) try_to_recover() Whereby, for example, "panic_on_cbug" and "panic_on_cbug_with_kdump" could control the runtime behavior. But this is just a braindump and I assume people reading along have other, better ideas. Especially, a better name for CBUG. Cc: Linus Torvalds Cc: Andrew Morton Cc: Ingo Molnar Cc: David Laight Cc: Jonathan Corbet Cc: Andy Whitcroft Cc: Joe Perches Cc: Dwaipayan Ray Cc: Lukas Bulwahn Cc: Baoquan He Cc: Vivek Goyal Cc: Dave Young David Hildenbrand (2): coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel") checkpatch: warn on usage of VM_BUG_ON() and friends Documentation/process/coding-style.rst | 27 ++ scripts/checkpatch.pl | 6 +++--- 2 files changed, 30 insertions(+), 3 deletions(-) -- 2.37.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH RFC 1/2] coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")
Linus notes [1] that the introduction of new code that uses VM_BUG_ON() is just as bad as BUG_ON(), because it will crash the kernel on distributions that enable CONFIG_DEBUG_VM (like Fedora): VM_BUG_ON() has the exact same semantics as BUG_ON. It is literally no different, the only difference is "we can make the code smaller because these are less important". [2] This resulted in a more generic discussion about usage of BUG() and friends. While there might be corner cases that still deserve a BUG_ON(), most BUG_ON() cases should simply use WARN_ON_ONCE() and implement a recovery path if reasonable: The only possible case where BUG_ON can validly be used is "I have some fundamental data corruption and cannot possibly return an error". [2] As a very good approximation is the general rule: "absolutely no new BUG_ON() calls _ever_" [2] ... not even if something really shouldn't ever happen and is merely for documenting that an invariant always has to hold. There is only one good BUG_ON(): Now, that said, there is one very valid sub-form of BUG_ON(): BUILD_BUG_ON() is absolutely 100% fine. [2] While WARN will also crash the machine with panic_on_warn set, that's exactly to be expected: So we have two very different cases: the "virtual machine with good logging where a dead machine is fine" - use 'panic_on_warn'. And the actual real hardware with real drivers, running real loads by users. [3] The basic idea is that warnings will similarly get reported by users and be found during testing. However, in contrast to a BUG(), there is a way to actually influence the expected behavior (e.g., panic_on_warn) and to eventually keep the machine alive to extract some debug info. Ingo notes that not all WARN_ON_ONCE cases need recovery. If we don't ever expect this code to trigger in any case, recovery code is not really helpful. I'd prefer to keep all these warnings 'simple' - i.e. no attempted recovery & control flow, unless we ever expect these to trigger. [4] There have been different rules floating around that were never properly documented. Let's try to clarify. [1] https://lkml.kernel.org/r/CAHk-=wiEAH+ojSpAgx_Ep=nkpwhu8ado3v56bxccsu97oyj...@mail.gmail.com [2] https://lore.kernel.org/r/CAHk-=wg40eazofo16eviaj7mfqdhz2gvebvfsmf6gyzsprj...@mail.gmail.com [3] https://lore.kernel.org/r/CAHk-=wgF7K2gSSpy=m_=k3nov4zaceux9puqf1tjktjla2x...@mail.gmail.com Signed-off-by: David Hildenbrand --- Documentation/process/coding-style.rst | 27 ++ 1 file changed, 27 insertions(+) diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst index 03eb53fd029a..a6d81ff578fe 100644 --- a/Documentation/process/coding-style.rst +++ b/Documentation/process/coding-style.rst @@ -1186,6 +1186,33 @@ expression used. For instance: #endif /* CONFIG_SOMETHING */ +22) Do not crash the kernel +--- + +Do not add new code that uses BUG(), BUG_ON(), VM_BUG_ON(), ... to crash +the kernel if certain conditions are not met. Instead, use WARN_ON_ONCE() +with recovery code (if reasonable) instead. Unavoidable data corruption / +security issues might be a very rare exception to this rule and need good +justification. + +There is no need for WARN_ON_ONCE() recovery code if we never expect it to +possibly trigger unless something goes seriously wrong or some other code +is changed to break invariants. Note that VM_WARN_ON_ONCE() cannot be used +in conditionals. + +Usage of WARN() and friends is fine for something that is not expected to +be triggered easily. ``panic_on_warn`` users are not an excuse to not use +WARN(): whoever enables ``panic_on_warn`` explicitly asked the kernel to +crash in this case. + +However, WARN() and friends should not be used for something that is +expected to trigger easily, for example, by user space. pr_warn_once() +might be a reasonable replacement to notify the user. + +Note that BUILD_BUG_ON() is perfectly fine because it will make compilation +fail instead of crashing the kernel. + + Appendix I) References -- -- 2.37.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH RFC 2/2] checkpatch: warn on usage of VM_BUG_ON() and friends
checkpatch does not point out that VM_BUG_ON() and friends should be avoided, however, Linus notes: VM_BUG_ON() has the exact same semantics as BUG_ON. It is literally no different, the only difference is "we can make the code smaller because these are less important". [1] So let's warn on VM_BUG_ON() and friends as well. While at it, make it clearer that the kernel really shouldn't be crashed. Note that there are some other *_BUG_ON flavors, but they are not all bad: for example, KVM_BUG_ON() only triggers a WARN_ON_ONCE and then flags KVM as being buggy, so we'll not care about them for now here. [1] https://lore.kernel.org/r/CAHk-=wg40eazofo16eviaj7mfqdhz2gvebvfsmf6gyzsprj...@mail.gmail.com Signed-off-by: David Hildenbrand --- scripts/checkpatch.pl | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index 79e759aac543..4c18acf17032 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -4695,12 +4695,12 @@ sub process { } } -# avoid BUG() or BUG_ON() - if ($line =~ /\b(?:BUG|BUG_ON)\b/) { +# do not use BUG(), BUG_ON(), VM_BUG_ON() and friends. + if ($line =~ /\b(?:BUG|BUG_ON|VM_BUG_ON|VM_BUG_ON_[A-Z]+)\b/) { my $msg_level = \ $msg_level = \ if ($file); &{$msg_level}("AVOID_BUG", - "Avoid crashing the kernel - try using WARN_ON & recovery code rather than BUG() or BUG_ON()\n" . $herecurr); + "Do not crash the kernel unless it is unavoidable - use WARN_ON_ONCE & recovery code (if reasonable) rather than BUG(), BUG_ON(), VM_BUG_ON(), ...\n" . $herecurr); } # avoid LINUX_VERSION_CODE -- 2.37.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v8 3/7] crash: add generic infrastructure for crash hotplug support
On 01.06.22 00:25, Eric DeVolder wrote: > > > On 5/31/22 08:15, David Hildenbrand wrote: >> On 12.05.22 18:10, Eric DeVolder wrote: >>> David, >>> Great questions! See inline responses below. >>> eric >> >> Sorry for the late reply, travel and vacation ... > No problem, greatly appreciate the feedback! > eric > >> >>>> >>>>> + >>>>> +#if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_MEMORY_HOTPLUG) >>>>> +void __weak arch_crash_handle_hotplug_event(struct kimage *image, >>>>> + unsigned int hp_action, >>>>> unsigned int cpu) >>>>> +{ >>>>> + WARN(1, "crash hotplug handler not implemented"); >>>> >>>> >>>> Won't that trigger on any arch that has CONFIG_HOTPLUG_CPU and >>>> CONFIG_MEMORY_HOTPLUG? >>>> I mean, you only implement it for x86 later in this series. Or what else >>>> stops this WARN from >>>> triggering? >>>> >>> You're correct. What about: printk_once(KERN_DEBUG "...") ? >> >> Why even bother about printing anything? If the feature is not >> supported, there should be some way for user space to figure out that it >> sill has to reload on hot(un)plug manually, no? > > I've changed this to WARN_ONCE(). If that isn't agreeable, I'll remove it. Please don't use WARN* on expected error paths. -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v8 0/7] crash: Kernel handling of CPU and memory hot un/plug
On 26.05.22 15:39, Sourabh Jain wrote: > Hello Eric, > > On 26/05/22 18:46, Eric DeVolder wrote: >> >> >> On 5/25/22 10:13, Sourabh Jain wrote: >>> Hello Eric, >>> >>> On 06/05/22 00:15, Eric DeVolder wrote: When the kdump service is loaded, if a CPU or memory is hot un/plugged, the crash elfcorehdr (for x86), which describes the CPUs and memory in the system, must also be updated, else the resulting vmcore is inaccurate (eg. missing either CPU context or memory regions). The current solution utilizes udev to initiate an unload-then-reload of the kdump image (e. kernel, initrd, boot_params, puratory and elfcorehdr) by the userspace kexec utility. In previous posts I have outlined the significant performance problems related to offloading this activity to userspace. This patchset introduces a generic crash hot un/plug handler that registers with the CPU and memory notifiers. Upon CPU or memory changes, this generic handler is invoked and performs important housekeeping, for example obtaining the appropriate lock, and then invokes an architecture specific handler to do the appropriate updates. In the case of x86_64, the arch specific handler generates a new elfcorehdr, and overwrites the old one in memory. No involvement with userspace needed. To realize the benefits/test this patchset, one must make a couple of minor changes to userspace: - Disable the udev rule for updating kdump on hot un/plug changes. Add the following as the first two lines to the udev rule file /usr/lib/udev/rules.d/98-kexec.rules: >>> >>> If we can have a sysfs attribute to advertise this feature then >>> userspace >>> utilities (kexec tool/udev rules) can take action accordingly. In >>> short, it will >>> help us maintain backward compatibility. >>> >>> kexec tool can use the new sysfs attribute and allocate additional >>> buffer space >>> for elfcorehdr accordingly. Similarly, the checksum-related changes >>> can come >>> under this check. >>> >>> Udev rule can use this sysfs file to decide kdump service reload is >>> required or not. >> >> Great idea. I've been working on the corresponding udev and >> kexec-tools changes and your input/idea here is quite timely. >> >> I have boolean "crash_hotplug" as a core_param(), so it will show up as: >> >> # cat /sys/module/kernel/parameters/crash_hotplug >> N > > How about using 0-1 instead Y/N? > 0 = crash hotplug not supported > 1 = crash hotplug supported > > Also how about keeping sysfs here instead? > /sys/kernel/kexec_crash_hotplug It's not only about hotplug, though. And actually we care about onlining/offlining. Hmm, I wonder if there is a better name for this automatic handling of cpu and memory devices. -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v8 3/7] crash: add generic infrastructure for crash hotplug support
On 12.05.22 18:10, Eric DeVolder wrote: > David, > Great questions! See inline responses below. > eric Sorry for the late reply, travel and vacation ... >> >>> + >>> +#if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_MEMORY_HOTPLUG) >>> +void __weak arch_crash_handle_hotplug_event(struct kimage *image, >>> + unsigned int hp_action, >>> unsigned int cpu) >>> +{ >>> + WARN(1, "crash hotplug handler not implemented"); >> >> >> Won't that trigger on any arch that has CONFIG_HOTPLUG_CPU and >> CONFIG_MEMORY_HOTPLUG? >> I mean, you only implement it for x86 later in this series. Or what else >> stops this WARN from >> triggering? >> > You're correct. What about: printk_once(KERN_DEBUG "...") ? Why even bother about printing anything? If the feature is not supported, there should be some way for user space to figure out that it sill has to reload on hot(un)plug manually, no? [...] > >> >> 2. Why can't the unprotect+reprotect not be done inside >> arch_crash_handle_hotplug_event() ? It's all arch specific either way. >> >> IMHO, this code here should be as simple as >> >> if (kexec_crash_image) >> arch_crash_handle_hotplug_event(kexec_crash_image, hp_action, cpu); >> > > The intent of this code was to be generic infrastructure. Just invoking the > arch_crash_handle_hotplug_event() would certainly be as generic as it gets. > But there were a series of steps that seemed to be common, so those I hoisted > into this bit of code. But most common parts are actually arch_* calls already? :) Anyhow, no strong opinion. > >> 3. Why do we have to forward the CPU for CPU onlining/offlining but not the >> memory block id (or similar) when onlining/offlining a memory block? > From patch "kexec: exclude hot remove cpu from elfcorehdr notes" commit > message: > > Due to use of CPUHP_AP_ONLINE_DYN, upon CPU unplug, the CPU is > still in the for_each_present_cpu() list when within the > handle_hotplug_event(). Thus the CPU must be explicitly excluded > when building the new list of CPUs. > > This change identifies in handle_hotplug_event() the CPU to be > excluded, and the check for excluding the CPU in > crash_prepare_elf64_headers(). > > If there is a better CPUHP_ to use than _DYN, I'd be all for that! Ah okay, thanks. -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v8 3/7] crash: add generic infrastructure for crash hotplug support
On 05.05.22 20:45, Eric DeVolder wrote: > CPU and memory change notifications are received in order to > regenerate the elfcorehdr. > > To support cpu hotplug, a callback is registered to capture the > CPUHP_AP_ONLINE_DYN online and offline events via > cpuhp_setup_state_nocalls(). > > To support memory hotplug, a notifier is registered to capture the > MEM_ONLINE and MEM_OFFLINE events via register_memory_notifier(). > > The cpu callback and memory notifiers call handle_hotplug_event() > to handle the hot plug/unplug event. Then handle_hotplug_event() > dispatches the event to the architecture specific > arch_crash_handle_hotplug_event(). During the process, the > kexec_mutex is held. > > Signed-off-by: Eric DeVolder [...] > + > +#if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_MEMORY_HOTPLUG) > +void __weak arch_crash_handle_hotplug_event(struct kimage *image, > + unsigned int hp_action, > unsigned int cpu) > +{ > + WARN(1, "crash hotplug handler not implemented"); Won't that trigger on any arch that has CONFIG_HOTPLUG_CPU and CONFIG_MEMORY_HOTPLUG? I mean, you only implement it for x86 later in this series. Or what else stops this WARN from triggering? > +} > + > +static void handle_hotplug_event(unsigned int hp_action, unsigned int cpu) > +{ > + /* Obtain lock while changing crash information */ > + if (!mutex_trylock(_mutex)) > + return; This looks wrong. What if you offline memory but for some reason the mutex is currently locked? You'd miss updating the vmcore, which would be broken. Why is this trylock in place? Some workaround for potential locking issues, or what is the rationale? > + > + /* Check kdump is loaded */ > + if (kexec_crash_image) { > + pr_debug("crash hp: hp_action %u, cpu %u", hp_action, cpu); > + > + /* Needed in order for the segments to be updated */ > + arch_kexec_unprotect_crashkres(); > + > + /* Flag to differentiate between normal load and hotplug */ > + kexec_crash_image->hotplug_event = true; 1. Why is that required? Why can't arch_crash_handle_hotplug_event() forward that information? I mean, *hotplug* in the anme implies that the function should be aware what's happening. 2. Why can't the unprotect+reprotect not be done inside arch_crash_handle_hotplug_event() ? It's all arch specific either way. IMHO, this code here should be as simple as if (kexec_crash_image) arch_crash_handle_hotplug_event(kexec_crash_image, hp_action, cpu); 3. Why do we have to forward the CPU for CPU onlining/offlining but not the memory block id (or similar) when onlining/offlining a memory block? > + > + /* Now invoke arch-specific update handler */ > + arch_crash_handle_hotplug_event(kexec_crash_image, hp_action, > cpu); > + > + /* No longer handling a hotplug event */ > + kexec_crash_image->hotplug_event = false; > + > + /* Change back to read-only */ > + arch_kexec_protect_crashkres(); > + } > + > + /* Release lock now that update complete */ > + mutex_unlock(_mutex); > +} > + > +static int crash_memhp_notifier(struct notifier_block *nb, unsigned long > val, void *v) > +{ > + switch (val) { > + case MEM_ONLINE: > + handle_hotplug_event(KEXEC_CRASH_HP_ADD_MEMORY, 0); > + break; > + > + case MEM_OFFLINE: > + handle_hotplug_event(KEXEC_CRASH_HP_REMOVE_MEMORY, 0); > + break; > + } > + return NOTIFY_OK; > +} > + > +static struct notifier_block crash_memhp_nb = { > + .notifier_call = crash_memhp_notifier, > + .priority = 0 > +}; > + > +static int crash_cpuhp_online(unsigned int cpu) > +{ > + handle_hotplug_event(KEXEC_CRASH_HP_ADD_CPU, cpu); > + return 0; > +} > + > +static int crash_cpuhp_offline(unsigned int cpu) > +{ > + handle_hotplug_event(KEXEC_CRASH_HP_REMOVE_CPU, cpu); > + return 0; > +} > + > +static int __init crash_hotplug_init(void) > +{ > + int result = 0; > + > + if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG)) > + register_memory_notifier(_memhp_nb); > + > + if (IS_ENABLED(CONFIG_HOTPLUG_CPU)) > + result = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, > + > "crash/cpuhp", > + > crash_cpuhp_online, > + > crash_cpuhp_offline); Ehm, this indentation looks very weird. > + > + return result; > +} > + > +subsys_initcall(crash_hotplug_init); > +#endif -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v8 2/7] crash: prototype change for crash_prepare_elf64_headers
On 05.05.22 20:45, Eric DeVolder wrote: > From within crash_prepare_elf64_headers() there is a need to > reference the struct kimage hotplug members. As such, this > change passes the struct kimage as a parameter to the > crash_prepare_elf64_headers(). You should make it clearer that the hotplug members will be added by successive patches. > > This is preparation for later patch, no functionality change. > > Signed-off-by: Eric DeVolder > Acked-by: Baoquan He Acked-by: David Hildenbrand -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v7 1/8] x86/crash: fix minor typo/bug in debug message
On 13.04.22 18:42, Eric DeVolder wrote: > The pr_debug() intends to display the memsz member, but the > parameter is actually the bufsz member (which is already > displayed). Correct this to display memsz value. > > Signed-off-by: Eric DeVolder > Acked-by: Baoquan He > --- Reviewed-by: David Hildenbrand -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v4 02/10] crash hp: Introduce CRASH_HOTPLUG configuration options
On 03.03.22 11:22, Baoquan He wrote: > On 03/02/22 at 10:20am, David Hildenbrand wrote: >> On 01.03.22 21:04, Eric DeVolder wrote: >>> >>> >>> On 2/22/22 21:25, Baoquan He wrote: >>>> On 02/09/22 at 02:56pm, Eric DeVolder wrote: >>>>> Support for CPU and memory hotplug for crash is controlled by the >>>>> CRASH_HOTPLUG configuration option, introduced by this patch. >>>>> >>>>> The CRASH_HOTPLUG_ELFCOREHDR_SZ related configuration option is >>>>> also introduced with this patch. >>>>> >>>>> Signed-off-by: Eric DeVolder >>>>> --- >>>>> arch/x86/Kconfig | 26 ++ >>>>> 1 file changed, 26 insertions(+) >>>>> >>>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig >>>>> index ebe8fc76949a..4e3374edab02 100644 >>>>> --- a/arch/x86/Kconfig >>>>> +++ b/arch/x86/Kconfig >>>>> @@ -2060,6 +2060,32 @@ config CRASH_DUMP >>>>> (CONFIG_RELOCATABLE=y). >>>>> For more details see Documentation/admin-guide/kdump/kdump.rst >>>>> >>>>> +config CRASH_HOTPLUG >>>>> + bool "kernel updates of crash elfcorehdr" >>>>> + depends on CRASH_DUMP && (HOTPLUG_CPU || MEMORY_HOTPLUG) && KEXEC_FILE >>>>> + help >>>>> + Enable the kernel to update the crash elfcorehdr (which contains >>>>> + the list of CPUs and memory regions) directly when hot plug/unplug >>>>> + of CPUs or memory. Otherwise userspace must monitor these hot >>>>> + plug/unplug change notifications via udev in order to >>>>> + unload-then-reload the crash kernel so that the list of CPUs and >>>>> + memory regions is kept up-to-date. Note that the udev CPU and >>>>> + memory change notifications still occur (however, userspace is not >>>>> + required to monitor for crash dump purposes). >>>>> + >>>>> +config CRASH_HOTPLUG_ELFCOREHDR_SZ >>>>> + depends on CRASH_HOTPLUG >>>>> + int >>>>> + default 131072 >>>>> + help >>>>> + Specify the maximum size of the elfcorehdr buffer/segment. >>>>> + The 128KiB default is sized so that it can accommodate 2048 >>>>> + Elf64_Phdr, where each Phdr represents either a CPU or a >>>>> + region of memory. >>>>> + For example, this size can accommodate hotplugging a machine >>>>> + with up to 1024 CPUs and up to 1024 memory regions (e.g. 1TiB >>>>> + with 1024 1GiB memory DIMMs). >>>> >>>> This example of memory could be a little misleading. The memory regions >>>> may not be related to memory DIMMs. System could split them into many >>>> smaller regions during bootup. >>> >>> I changed "with 1024 1GiB memory DIMMs" to "with 1024 1GiB hotplug >>> memories". >>> eric >> >> It's still not quite precise. Essentially it's the individual "System >> RAM" entries in /proc/iomem >> >> Boot memory (i.e., a single DIMM) might be represented by multiple >> entries due to rearranged holes (by the BIOS). >> >> While hoplugged DIMMs (under virt!) are usually represented using a >> single range, it can be different on physical machines. Last but not >> least, dax/kmem and virtio-mem behave in a different way. > > Right. How about only mentioning the 'System RAM' entries in /proc/iomem > as below? It's just giving an example, talking about the details of > memory regions from each type may not be necessry here. People > interested can refer to code or document related to get it. > > > + default 131072 > + help > + Specify the maximum size of the elfcorehdr buffer/segment. > + The 128KiB default is sized so that it can accommodate 2048 > + Elf64_Phdr, where each Phdr represents either a CPU or a > + region of memory. > + For example, this size can accommodate hotplugging a machine > + with up to 1024 CPUs and up to 1024 memory regions which are > represented by 'System RAM' entries in /proc/iomem. Maybe changing the last paragraph to: "For example, this size can accommodate a machine with up to 1024 CPUs and up to 1024 memory regions, for example, as represented by 'System RAM' entries in /proc/iomem." -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v4 02/10] crash hp: Introduce CRASH_HOTPLUG configuration options
On 01.03.22 21:04, Eric DeVolder wrote: > > > On 2/22/22 21:25, Baoquan He wrote: >> On 02/09/22 at 02:56pm, Eric DeVolder wrote: >>> Support for CPU and memory hotplug for crash is controlled by the >>> CRASH_HOTPLUG configuration option, introduced by this patch. >>> >>> The CRASH_HOTPLUG_ELFCOREHDR_SZ related configuration option is >>> also introduced with this patch. >>> >>> Signed-off-by: Eric DeVolder >>> --- >>> arch/x86/Kconfig | 26 ++ >>> 1 file changed, 26 insertions(+) >>> >>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig >>> index ebe8fc76949a..4e3374edab02 100644 >>> --- a/arch/x86/Kconfig >>> +++ b/arch/x86/Kconfig >>> @@ -2060,6 +2060,32 @@ config CRASH_DUMP >>> (CONFIG_RELOCATABLE=y). >>> For more details see Documentation/admin-guide/kdump/kdump.rst >>> >>> +config CRASH_HOTPLUG >>> + bool "kernel updates of crash elfcorehdr" >>> + depends on CRASH_DUMP && (HOTPLUG_CPU || MEMORY_HOTPLUG) && KEXEC_FILE >>> + help >>> + Enable the kernel to update the crash elfcorehdr (which contains >>> + the list of CPUs and memory regions) directly when hot plug/unplug >>> + of CPUs or memory. Otherwise userspace must monitor these hot >>> + plug/unplug change notifications via udev in order to >>> + unload-then-reload the crash kernel so that the list of CPUs and >>> + memory regions is kept up-to-date. Note that the udev CPU and >>> + memory change notifications still occur (however, userspace is not >>> + required to monitor for crash dump purposes). >>> + >>> +config CRASH_HOTPLUG_ELFCOREHDR_SZ >>> + depends on CRASH_HOTPLUG >>> + int >>> + default 131072 >>> + help >>> + Specify the maximum size of the elfcorehdr buffer/segment. >>> + The 128KiB default is sized so that it can accommodate 2048 >>> + Elf64_Phdr, where each Phdr represents either a CPU or a >>> + region of memory. >>> + For example, this size can accommodate hotplugging a machine >>> + with up to 1024 CPUs and up to 1024 memory regions (e.g. 1TiB >>> + with 1024 1GiB memory DIMMs). >> >> This example of memory could be a little misleading. The memory regions >> may not be related to memory DIMMs. System could split them into many >> smaller regions during bootup. > > I changed "with 1024 1GiB memory DIMMs" to "with 1024 1GiB hotplug memories". > eric It's still not quite precise. Essentially it's the individual "System RAM" entries in /proc/iomem Boot memory (i.e., a single DIMM) might be represented by multiple entries due to rearranged holes (by the BIOS). While hoplugged DIMMs (under virt!) are usually represented using a single range, it can be different on physical machines. Last but not least, dax/kmem and virtio-mem behave in a different way. -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v2] proc/vmcore: fix possible deadlock on concurrent mmap and read
0] ksys_mmap_pgoff+0x178/0x200 [6.428701] do_syscall_64+0x43/0x90 [6.429126] entry_SYSCALL_64_after_hwframe+0x44/0xae [6.429745] RIP: 0033:0x7fc7359b8fc7 [6.430157] Code: 00 00 00 89 ef e8 69 b3 ff ff eb e4 e8 c2 64 01 00 66 90 f3 0f 1e fa 41 89 ca 41 f7 c1 ff 0f 00 00 75 10 b8 09 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 21 c3 48 8b 05 21 7e 0e 00 64 c7 00 16 00 00 [6.432147] RSP: 002b:7fff35b4c208 EFLAGS: 0246 ORIG_RAX: 0009 [6.432970] RAX: ffda RBX: 0001 RCX: 7fc7359b8fc7 [6.433746] RDX: 0001 RSI: 0040 RDI: [6.434529] RBP: 55a1125ecf10 R08: 0003 R09: 2000 [6.435310] R10: 0002 R11: 0246 R12: 2000 [6.436093] R13: 0040 R14: 55a1124269e2 R15: [6.436887] Reported-by: Baoquan He Fixes: cc5f2704c934 ("proc/vmcore: convert oldmem_pfn_is_ram callback to more generic vmcore callbacks") Cc: Andrew Morton Cc: Baoquan He Cc: Vivek Goyal Cc: Dave Young Cc: "Paul E. McKenney" Cc: Josh Triplett Cc: Peter Zijlstra Cc: Boqun Feng Signed-off-by: David Hildenbrand --- Was: [PATCH v1] proc/vmcore: fix false positive lockdep warning v1 -> v2: - Adjust subject/description - Add Fixes: --- fs/proc/vmcore.c | 41 ++--- 1 file changed, 22 insertions(+), 19 deletions(-) diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c index 702754dd1daf..edeb01dfe05d 100644 --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -62,7 +62,8 @@ core_param(novmcoredd, vmcoredd_disabled, bool, 0); /* Device Dump Size */ static size_t vmcoredd_orig_sz; -static DECLARE_RWSEM(vmcore_cb_rwsem); +static DEFINE_SPINLOCK(vmcore_cb_lock); +DEFINE_STATIC_SRCU(vmcore_cb_srcu); /* List of registered vmcore callbacks. */ static LIST_HEAD(vmcore_cb_list); /* Whether the vmcore has been opened once. */ @@ -70,8 +71,8 @@ static bool vmcore_opened; void register_vmcore_cb(struct vmcore_cb *cb) { - down_write(_cb_rwsem); INIT_LIST_HEAD(>next); + spin_lock(_cb_lock); list_add_tail(>next, _cb_list); /* * Registering a vmcore callback after the vmcore was opened is @@ -79,14 +80,14 @@ void register_vmcore_cb(struct vmcore_cb *cb) */ if (vmcore_opened) pr_warn_once("Unexpected vmcore callback registration\n"); - up_write(_cb_rwsem); + spin_unlock(_cb_lock); } EXPORT_SYMBOL_GPL(register_vmcore_cb); void unregister_vmcore_cb(struct vmcore_cb *cb) { - down_write(_cb_rwsem); - list_del(>next); + spin_lock(_cb_lock); + list_del_rcu(>next); /* * Unregistering a vmcore callback after the vmcore was opened is * very unusual (e.g., forced driver removal), but we cannot stop @@ -94,7 +95,9 @@ void unregister_vmcore_cb(struct vmcore_cb *cb) */ if (vmcore_opened) pr_warn_once("Unexpected vmcore callback unregistration\n"); - up_write(_cb_rwsem); + spin_unlock(_cb_lock); + + synchronize_srcu(_cb_srcu); } EXPORT_SYMBOL_GPL(unregister_vmcore_cb); @@ -103,9 +106,8 @@ static bool pfn_is_ram(unsigned long pfn) struct vmcore_cb *cb; bool ret = true; - lockdep_assert_held_read(_cb_rwsem); - - list_for_each_entry(cb, _cb_list, next) { + list_for_each_entry_srcu(cb, _cb_list, next, +srcu_read_lock_held(_cb_srcu)) { if (unlikely(!cb->pfn_is_ram)) continue; ret = cb->pfn_is_ram(cb, pfn); @@ -118,9 +120,9 @@ static bool pfn_is_ram(unsigned long pfn) static int open_vmcore(struct inode *inode, struct file *file) { - down_read(_cb_rwsem); + spin_lock(_cb_lock); vmcore_opened = true; - up_read(_cb_rwsem); + spin_unlock(_cb_lock); return 0; } @@ -133,6 +135,7 @@ ssize_t read_from_oldmem(char *buf, size_t count, unsigned long pfn, offset; size_t nr_bytes; ssize_t read = 0, tmp; + int idx; if (!count) return 0; @@ -140,7 +143,7 @@ ssize_t read_from_oldmem(char *buf, size_t count, offset = (unsigned long)(*ppos % PAGE_SIZE); pfn = (unsigned long)(*ppos / PAGE_SIZE); - down_read(_cb_rwsem); + idx = srcu_read_lock(_cb_srcu); do { if (count > (PAGE_SIZE - offset)) nr_bytes = PAGE_SIZE - offset; @@ -165,7 +168,7 @@ ssize_t read_from_oldmem(char *buf, size_t count, offset, userbuf); } if (tmp < 0) { - up_read(_cb_rwsem); + srcu_read_unlock(_cb_srcu, idx); return tmp; } @@ -176,8 +179,
Re: [PATCH v1] proc/vmcore: fix false positive lockdep warning
On 19.01.22 16:15, David Hildenbrand wrote: > On 19.01.22 16:08, Boqun Feng wrote: >> Hi, >> >> On Wed, Jan 19, 2022 at 12:37:02PM +0100, David Hildenbrand wrote: >>> Lockdep complains that we do during mmap of the vmcore: >>> down_write(mmap_lock); >>> down_read(vmcore_cb_rwsem); >>> And during read of the vmcore: >>> down_read(vmcore_cb_rwsem); >>> down_read(mmap_lock); >>> >>> We cannot possibly deadlock when only taking vmcore_cb_rwsem in read >>> mode, however, it's hard to teach that to lockdep. >>> >> >> Lockdep warned about the above sequences because rw_semaphore is a fair >> read-write lock, and the following can cause a deadlock: >> >> TASK 1 TASK 2 TASK 3 >> == == == >> down_write(mmap_lock); >> down_read(vmcore_cb_rwsem) >> down_write(vmcore_cb_rwsem); // >> blocked >> down_read(vmcore_cb_rwsem); // cannot get the lock because of the >> fairness >> down_read(mmap_lock); // blocked >> >> IOW, a reader can block another read if there is a writer queued by the >> second reader and the lock is fair. >> >> So there is a deadlock possiblity. > > Task 3 will never take the mmap_lock before doing a > down_write(vmcore_cb_rwsem). > > How would this happen? Ah, I get it, nevermind. I'll adjust the patch description. Thanks! -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v1] proc/vmcore: fix false positive lockdep warning
On 19.01.22 16:08, Boqun Feng wrote: > Hi, > > On Wed, Jan 19, 2022 at 12:37:02PM +0100, David Hildenbrand wrote: >> Lockdep complains that we do during mmap of the vmcore: >> down_write(mmap_lock); >> down_read(vmcore_cb_rwsem); >> And during read of the vmcore: >> down_read(vmcore_cb_rwsem); >> down_read(mmap_lock); >> >> We cannot possibly deadlock when only taking vmcore_cb_rwsem in read >> mode, however, it's hard to teach that to lockdep. >> > > Lockdep warned about the above sequences because rw_semaphore is a fair > read-write lock, and the following can cause a deadlock: > > TASK 1 TASK 2 TASK 3 > == == == > down_write(mmap_lock); > down_read(vmcore_cb_rwsem) > down_write(vmcore_cb_rwsem); // > blocked > down_read(vmcore_cb_rwsem); // cannot get the lock because of the > fairness > down_read(mmap_lock); // blocked > > IOW, a reader can block another read if there is a writer queued by the > second reader and the lock is fair. > > So there is a deadlock possiblity. Task 3 will never take the mmap_lock before doing a down_write(vmcore_cb_rwsem). How would this happen? -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v1] proc/vmcore: fix false positive lockdep warning
a RBX: 0001 RCX: 7fc7359b8fc7 [6.433746] RDX: 0001 RSI: 0040 RDI: [6.434529] RBP: 55a1125ecf10 R08: 0003 R09: 2000 [6.435310] R10: 0002 R11: 0246 R12: 2000 [6.436093] R13: 0040 R14: 55a1124269e2 R15: [6.436887] Reported-by: Baoquan He Cc: Andrew Morton Cc: Baoquan He Cc: Vivek Goyal Cc: Dave Young Cc: "Paul E. McKenney" Cc: Josh Triplett Cc: Peter Zijlstra Cc: Boqun Feng Signed-off-by: David Hildenbrand --- Based on next-20220118 --- fs/proc/vmcore.c | 41 ++--- 1 file changed, 22 insertions(+), 19 deletions(-) diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c index 702754dd1daf..edeb01dfe05d 100644 --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -62,7 +62,8 @@ core_param(novmcoredd, vmcoredd_disabled, bool, 0); /* Device Dump Size */ static size_t vmcoredd_orig_sz; -static DECLARE_RWSEM(vmcore_cb_rwsem); +static DEFINE_SPINLOCK(vmcore_cb_lock); +DEFINE_STATIC_SRCU(vmcore_cb_srcu); /* List of registered vmcore callbacks. */ static LIST_HEAD(vmcore_cb_list); /* Whether the vmcore has been opened once. */ @@ -70,8 +71,8 @@ static bool vmcore_opened; void register_vmcore_cb(struct vmcore_cb *cb) { - down_write(_cb_rwsem); INIT_LIST_HEAD(>next); + spin_lock(_cb_lock); list_add_tail(>next, _cb_list); /* * Registering a vmcore callback after the vmcore was opened is @@ -79,14 +80,14 @@ void register_vmcore_cb(struct vmcore_cb *cb) */ if (vmcore_opened) pr_warn_once("Unexpected vmcore callback registration\n"); - up_write(_cb_rwsem); + spin_unlock(_cb_lock); } EXPORT_SYMBOL_GPL(register_vmcore_cb); void unregister_vmcore_cb(struct vmcore_cb *cb) { - down_write(_cb_rwsem); - list_del(>next); + spin_lock(_cb_lock); + list_del_rcu(>next); /* * Unregistering a vmcore callback after the vmcore was opened is * very unusual (e.g., forced driver removal), but we cannot stop @@ -94,7 +95,9 @@ void unregister_vmcore_cb(struct vmcore_cb *cb) */ if (vmcore_opened) pr_warn_once("Unexpected vmcore callback unregistration\n"); - up_write(_cb_rwsem); + spin_unlock(_cb_lock); + + synchronize_srcu(_cb_srcu); } EXPORT_SYMBOL_GPL(unregister_vmcore_cb); @@ -103,9 +106,8 @@ static bool pfn_is_ram(unsigned long pfn) struct vmcore_cb *cb; bool ret = true; - lockdep_assert_held_read(_cb_rwsem); - - list_for_each_entry(cb, _cb_list, next) { + list_for_each_entry_srcu(cb, _cb_list, next, +srcu_read_lock_held(_cb_srcu)) { if (unlikely(!cb->pfn_is_ram)) continue; ret = cb->pfn_is_ram(cb, pfn); @@ -118,9 +120,9 @@ static bool pfn_is_ram(unsigned long pfn) static int open_vmcore(struct inode *inode, struct file *file) { - down_read(_cb_rwsem); + spin_lock(_cb_lock); vmcore_opened = true; - up_read(_cb_rwsem); + spin_unlock(_cb_lock); return 0; } @@ -133,6 +135,7 @@ ssize_t read_from_oldmem(char *buf, size_t count, unsigned long pfn, offset; size_t nr_bytes; ssize_t read = 0, tmp; + int idx; if (!count) return 0; @@ -140,7 +143,7 @@ ssize_t read_from_oldmem(char *buf, size_t count, offset = (unsigned long)(*ppos % PAGE_SIZE); pfn = (unsigned long)(*ppos / PAGE_SIZE); - down_read(_cb_rwsem); + idx = srcu_read_lock(_cb_srcu); do { if (count > (PAGE_SIZE - offset)) nr_bytes = PAGE_SIZE - offset; @@ -165,7 +168,7 @@ ssize_t read_from_oldmem(char *buf, size_t count, offset, userbuf); } if (tmp < 0) { - up_read(_cb_rwsem); + srcu_read_unlock(_cb_srcu, idx); return tmp; } @@ -176,8 +179,8 @@ ssize_t read_from_oldmem(char *buf, size_t count, ++pfn; offset = 0; } while (count); + srcu_read_unlock(_cb_srcu, idx); - up_read(_cb_rwsem); return read; } @@ -568,18 +571,18 @@ static int vmcore_remap_oldmem_pfn(struct vm_area_struct *vma, unsigned long from, unsigned long pfn, unsigned long size, pgprot_t prot) { - int ret; + int ret, idx; /* -* Check if oldmem_pfn_is_ram was registered to avoid -* looping over all pages without a reason. +* Check if a callback was registered to avoid looping over all +* pages without a reason
Re: [PATCH v4 2/3] dma/pool: create dma atomic pool only if dma zone has managed pages
On 23.12.21 10:44, Baoquan He wrote: > Currently three dma atomic pools are initialized as long as the relevant > kernel codes are built in. While in kdump kernel of x86_64, this is not > right when trying to create atomic_pool_dma, because there's no managed > pages in DMA zone. In the case, DMA zone only has low 1M memory presented > and locked down by memblock allocator. So no pages are added into buddy > of DMA zone. Please check commit f1d4d47c5851 ("x86/setup: Always reserve > the first 1M of RAM"). > > Then in kdump kernel of x86_64, it always prints below failure message: > > DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations > swapper/0: page allocation failure: order:5, mode:0xcc1(GFP_KERNEL|GFP_DMA), > nodemask=(null),cpuset=/,mems_allowed=0 > CPU: 0 PID: 1 Comm: swapper/0 Not tainted > 5.13.0-0.rc5.20210611git929d931f2b40.42.fc35.x86_64 #1 > Hardware name: Dell Inc. PowerEdge R910/0P658H, BIOS 2.12.0 06/04/2018 > Call Trace: > dump_stack+0x7f/0xa1 > warn_alloc.cold+0x72/0xd6 > ? _raw_spin_unlock_irq+0x24/0x40 > ? __alloc_pages_direct_compact+0x90/0x1b0 > __alloc_pages_slowpath.constprop.0+0xf29/0xf50 > ? __cond_resched+0x16/0x50 > ? prepare_alloc_pages.constprop.0+0x19d/0x1b0 > __alloc_pages+0x24d/0x2c0 > ? __dma_atomic_pool_init+0x93/0x93 > alloc_page_interleave+0x13/0xb0 > atomic_pool_expand+0x118/0x210 > ? __dma_atomic_pool_init+0x93/0x93 > __dma_atomic_pool_init+0x45/0x93 > dma_atomic_pool_init+0xdb/0x176 > do_one_initcall+0x67/0x320 > ? rcu_read_lock_sched_held+0x3f/0x80 > kernel_init_freeable+0x290/0x2dc > ? rest_init+0x24f/0x24f > kernel_init+0xa/0x111 > ret_from_fork+0x22/0x30 > Mem-Info: > .. > DMA: failed to allocate 128 KiB GFP_KERNEL|GFP_DMA pool for atomic allocation > DMA: preallocated 128 KiB GFP_KERNEL|GFP_DMA32 pool for atomic allocations > > Here, let's check if DMA zone has managed pages, then create atomic_pool_dma > if yes. Otherwise just skip it. > > Fixes: 6f599d84231f ("x86/kdump: Always reserve the low 1M when the > crashkernel option is specified") > Cc: sta...@vger.kernel.org > Signed-off-by: Baoquan He > Cc: Christoph Hellwig > Cc: Marek Szyprowski > Cc: Robin Murphy > Cc: io...@lists.linux-foundation.org > --- > kernel/dma/pool.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/kernel/dma/pool.c b/kernel/dma/pool.c > index 5a85804b5beb..00df3edd6c5d 100644 > --- a/kernel/dma/pool.c > +++ b/kernel/dma/pool.c > @@ -206,7 +206,7 @@ static int __init dma_atomic_pool_init(void) > GFP_KERNEL); > if (!atomic_pool_kernel) > ret = -ENOMEM; > - if (IS_ENABLED(CONFIG_ZONE_DMA)) { > + if (has_managed_dma()) { > atomic_pool_dma = __dma_atomic_pool_init(atomic_pool_size, > GFP_KERNEL | GFP_DMA); > if (!atomic_pool_dma) > @@ -229,7 +229,7 @@ static inline struct gen_pool *dma_guess_pool(struct > gen_pool *prev, gfp_t gfp) > if (prev == NULL) { > if (IS_ENABLED(CONFIG_ZONE_DMA32) && (gfp & GFP_DMA32)) > return atomic_pool_dma32; > - if (IS_ENABLED(CONFIG_ZONE_DMA) && (gfp & GFP_DMA)) > + if (atomic_pool_dma && (gfp & GFP_DMA)) > return atomic_pool_dma; > return atomic_pool_kernel; > } I thought for a second that we might have to tweak atomic_pool_work_fn(), but atomic_pool_resize() handles it properly already. Reviewed-by: David Hildenbrand -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v3 3/5] mm_zone: add function to check if managed dma zone exists
On 13.12.21 13:27, Baoquan He wrote: > In some places of the current kernel, it assumes that dma zone must have > managed pages if CONFIG_ZONE_DMA is enabled. While this is not always true. > E.g in kdump kernel of x86_64, only low 1M is presented and locked down > at very early stage of boot, so that there's no managed pages at all in > DMA zone. This exception will always cause page allocation failure if page > is requested from DMA zone. > > Here add function has_managed_dma() and the relevant helper functions to > check if there's DMA zone with managed pages. It will be used in later > patches. > > Fixes: 6f599d84231f ("x86/kdump: Always reserve the low 1M when the > crashkernel option is specified") > Cc: sta...@vger.kernel.org > Signed-off-by: Baoquan He > --- > v2->v3: > Rewrite has_managed_dma() in a simpler and more efficient way which is > sugggested by DavidH. > > include/linux/mmzone.h | 9 + > mm/page_alloc.c| 15 +++ > 2 files changed, 24 insertions(+) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 58e744b78c2c..6e1b726e9adf 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -1046,6 +1046,15 @@ static inline int is_highmem_idx(enum zone_type idx) > #endif > } > > +#ifdef CONFIG_ZONE_DMA > +bool has_managed_dma(void); > +#else > +static inline bool has_managed_dma(void) > +{ > + return false; > +} > +#endif > + > /** > * is_highmem - helper function to quickly check if a struct zone is a > * highmem zone or not. This is an attempt to keep references > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index c5952749ad40..7c7a0b5de2ff 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -9460,3 +9460,18 @@ bool take_page_off_buddy(struct page *page) > return ret; > } > #endif > + > +#ifdef CONFIG_ZONE_DMA > +bool has_managed_dma(void) > +{ > + struct pglist_data *pgdat; > + > + for_each_online_pgdat(pgdat) { > + struct zone *zone = >node_zones[ZONE_DMA]; > + > + if (managed_zone(zone)) > + return true; > + } > + return false; > +} > +#endif /* CONFIG_ZONE_DMA */ > Reviewed-by: David Hildenbrand -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH RESEND v2 3/5] mm_zone: add function to check if managed dma zone exists
On 09.12.21 14:02, Baoquan He wrote: > On 12/07/21 at 12:23pm, David Hildenbrand wrote: >> On 07.12.21 04:07, Baoquan He wrote: >>> In some places of the current kernel, it assumes that dma zone must have >>> managed pages if CONFIG_ZONE_DMA is enabled. While this is not always true. >>> E.g in kdump kernel of x86_64, only low 1M is presented and locked down >>> at very early stage of boot, so that there's no managed pages at all in >>> DMA zone. This exception will always cause page allocation failure if page >>> is requested from DMA zone. >>> >>> Here add function has_managed_dma() and the relevant helper functions to >>> check if there's DMA zone with managed pages. It will be used in later >>> patches. >>> >>> Signed-off-by: Baoquan He >>> --- >>> include/linux/mmzone.h | 21 + >>> mm/page_alloc.c| 11 +++ >>> 2 files changed, 32 insertions(+) >>> >>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >>> index 58e744b78c2c..82d23e13e0e5 100644 >>> --- a/include/linux/mmzone.h >>> +++ b/include/linux/mmzone.h >>> @@ -998,6 +998,18 @@ static inline bool zone_is_zone_device(struct zone >>> *zone) >>> } >>> #endif >>> >>> +#ifdef CONFIG_ZONE_DMA >>> +static inline bool zone_is_dma(struct zone *zone) >>> +{ >>> + return zone_idx(zone) == ZONE_DMA; >>> +} >>> +#else >>> +static inline bool zone_is_dma(struct zone *zone) >>> +{ >>> + return false; >>> +} >>> +#endif >>> + >>> /* >>> * Returns true if a zone has pages managed by the buddy allocator. >>> * All the reclaim decisions have to use this function rather than >>> @@ -1046,6 +1058,7 @@ static inline int is_highmem_idx(enum zone_type idx) >>> #endif >>> } >>> >>> +bool has_managed_dma(void); >>> /** >>> * is_highmem - helper function to quickly check if a struct zone is a >>> * highmem zone or not. This is an attempt to keep references >>> @@ -1131,6 +1144,14 @@ extern struct zone *next_zone(struct zone *zone); >>> ; /* do nothing */ \ >>> else >>> >>> +#define for_each_managed_zone(zone)\ >>> + for (zone = (first_online_pgdat())->node_zones; \ >>> +zone; \ >>> +zone = next_zone(zone))\ >>> + if (!managed_zone(zone))\ >>> + ; /* do nothing */ \ >>> + else >>> + >>> static inline struct zone *zonelist_zone(struct zoneref *zoneref) >>> { >>> return zoneref->zone; >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>> index c5952749ad40..ac0ea42a4e5f 100644 >>> --- a/mm/page_alloc.c >>> +++ b/mm/page_alloc.c >>> @@ -9459,4 +9459,15 @@ bool take_page_off_buddy(struct page *page) >>> spin_unlock_irqrestore(>lock, flags); >>> return ret; >>> } >>> + >>> +bool has_managed_dma(void) >>> +{ >>> + struct zone *zone; >>> + >>> + for_each_managed_zone(zone) { >>> + if (zone_is_dma(zone)) >>> + return true; >>> + } >>> + return false; >>> +} >> >> Wouldn't it be "easier/faster" to just iterate online nodes and directly >> obtain the ZONE_DMA, checking if there are managed pages? > > Thanks, Dave. > > Please check for_each_managed_zone(), it is iterating online nodes and > it's each managed zone. > > Is below what you are suggesting? The only difference is I introduced > for_each_managed_zone() which can be reused later if needed. Not sure if > I got your suggestion correctly. > > bool has_managed_dma(void) > { > struct pglist_data *pgdat; > struct zone *zone; > enum zone_type i, j; > > for_each_online_pgdat(pgdat) { > for (i = 0; i < MAX_NR_ZONES - 1; i++) { > struct zone *zone = >node_zones[i]; > if (zone_is_dma(zone)) > > return true; > } > } > return false; > > } Even simpler, no need to iterate over zones at all, only over nodes: #ifdef CONFIG_ZONE_DMA bool has_managed_dma(void) { struct pglist_data *pgdat; for_each_online_pgdat(pgdat) { struct zone *zone = >node_zones[ZONE_DMA]; if (managed_zone(zone) return true; } return false; } #endif /* CONFIG_ZONE_DMA */ Without CONFIG_ZONE_DMA, simply provide a dummy in the header that returns false. -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [RFC v2 4/6] crash hp: generic crash hotplug support infrastructure
> +#if defined(CONFIG_MEMORY_HOTPLUG) > +static int crash_memhp_notifier(struct notifier_block *nb, > + unsigned long val, void *v) > +{ > + struct memory_notify *mhp = v; > + unsigned long start, end; > + > + start = mhp->start_pfn << PAGE_SHIFT; > + end = ((mhp->start_pfn + mhp->nr_pages) << PAGE_SHIFT) - 1; > + > + switch (val) { > + case MEM_GOING_ONLINE: > + crash_hotplug_handler(KEXEC_CRASH_HP_ADD_MEMORY, > + start, end-start); > + break; > + > + case MEM_OFFLINE: > + case MEM_CANCEL_ONLINE: > + crash_hotplug_handler(KEXEC_CRASH_HP_REMOVE_MEMORY, > + start, end-start); Any reason you don't handle this after the effects completely, meaning MEM_ONLINE and MEM_OFFLINE? -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH RESEND v2 3/5] mm_zone: add function to check if managed dma zone exists
On 07.12.21 04:07, Baoquan He wrote: > In some places of the current kernel, it assumes that dma zone must have > managed pages if CONFIG_ZONE_DMA is enabled. While this is not always true. > E.g in kdump kernel of x86_64, only low 1M is presented and locked down > at very early stage of boot, so that there's no managed pages at all in > DMA zone. This exception will always cause page allocation failure if page > is requested from DMA zone. > > Here add function has_managed_dma() and the relevant helper functions to > check if there's DMA zone with managed pages. It will be used in later > patches. > > Signed-off-by: Baoquan He > --- > include/linux/mmzone.h | 21 + > mm/page_alloc.c| 11 +++ > 2 files changed, 32 insertions(+) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 58e744b78c2c..82d23e13e0e5 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -998,6 +998,18 @@ static inline bool zone_is_zone_device(struct zone *zone) > } > #endif > > +#ifdef CONFIG_ZONE_DMA > +static inline bool zone_is_dma(struct zone *zone) > +{ > + return zone_idx(zone) == ZONE_DMA; > +} > +#else > +static inline bool zone_is_dma(struct zone *zone) > +{ > + return false; > +} > +#endif > + > /* > * Returns true if a zone has pages managed by the buddy allocator. > * All the reclaim decisions have to use this function rather than > @@ -1046,6 +1058,7 @@ static inline int is_highmem_idx(enum zone_type idx) > #endif > } > > +bool has_managed_dma(void); > /** > * is_highmem - helper function to quickly check if a struct zone is a > * highmem zone or not. This is an attempt to keep references > @@ -1131,6 +1144,14 @@ extern struct zone *next_zone(struct zone *zone); > ; /* do nothing */ \ > else > > +#define for_each_managed_zone(zone) \ > + for (zone = (first_online_pgdat())->node_zones; \ > + zone; \ > + zone = next_zone(zone))\ > + if (!managed_zone(zone))\ > + ; /* do nothing */ \ > + else > + > static inline struct zone *zonelist_zone(struct zoneref *zoneref) > { > return zoneref->zone; > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index c5952749ad40..ac0ea42a4e5f 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -9459,4 +9459,15 @@ bool take_page_off_buddy(struct page *page) > spin_unlock_irqrestore(>lock, flags); > return ret; > } > + > +bool has_managed_dma(void) > +{ > + struct zone *zone; > + > + for_each_managed_zone(zone) { > + if (zone_is_dma(zone)) > + return true; > + } > + return false; > +} Wouldn't it be "easier/faster" to just iterate online nodes and directly obtain the ZONE_DMA, checking if there are managed pages? -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2] proc/vmcore: fix clearing user buffer by properly using clear_user()
On 15.11.21 23:04, Andrew Morton wrote: > On Fri, 12 Nov 2021 10:27:50 +0100 David Hildenbrand wrote: > >> To clear a user buffer we cannot simply use memset, we have to use >> clear_user(). With a virtio-mem device that registers a vmcore_cb and has >> some logically unplugged memory inside an added Linux memory block, I can >> easily trigger a BUG by copying the vmcore via "cp": >> >> ... >> >> Some x86-64 CPUs have a CPU feature called "Supervisor Mode Access >> Prevention (SMAP)", which is used to detect wrong access from the kernel to >> user buffers like this: SMAP triggers a permissions violation on wrong >> access. In the x86-64 variant of clear_user(), SMAP is properly >> handled via clac()+stac(). >> >> To fix, properly use clear_user() when we're dealing with a user buffer. >> > > I added cc:stable, OK? > I was a bit hesitant because this would (beofe the virtio-mem changes) only trigger under XEN and I was wondering why nobody notices under XEN so far. But yes, even though it only applies to the kdump kernel, cc:stable sounds like the right think to do! Thanks Andrew! -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v2] proc/vmcore: fix clearing user buffer by properly using clear_user()
To clear a user buffer we cannot simply use memset, we have to use clear_user(). With a virtio-mem device that registers a vmcore_cb and has some logically unplugged memory inside an added Linux memory block, I can easily trigger a BUG by copying the vmcore via "cp": [ 11.327580] systemd[1]: Starting Kdump Vmcore Save Service... [ 11.339697] kdump[420]: Kdump is using the default log level(3). [ 11.370964] kdump[453]: saving to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/ [ 11.373997] kdump[458]: saving vmcore-dmesg.txt to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/ [ 11.385357] kdump[465]: saving vmcore-dmesg.txt complete [ 11.386722] kdump[467]: saving vmcore [ 16.531275] BUG: unable to handle page fault for address: 7f2374e01000 [ 16.531705] #PF: supervisor write access in kernel mode [ 16.532037] #PF: error_code(0x0003) - permissions violation [ 16.532396] PGD 7a523067 P4D 7a523067 PUD 7a528067 PMD 7a525067 PTE 80007048f867 [ 16.532872] Oops: 0003 [#1] PREEMPT SMP NOPTI [ 16.533154] CPU: 0 PID: 468 Comm: cp Not tainted 5.15.0+ #6 [ 16.533513] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-27-g64f37cc530f1-prebuilt.qemu.org 04/01/2014 [ 16.534198] RIP: 0010:read_from_oldmem.part.0.cold+0x1d/0x86 [ 16.534552] Code: ff ff ff e8 05 ff fe ff e9 b9 e9 7f ff 48 89 de 48 c7 c7 38 3b 60 82 e8 f1 fe fe ff 83 fd 08 72 3c 49 8d 7d 08 4c 89 e9 89 e8 <49> c7 45 00 00 00 00 00 49 c7 44 05 f8 00 00 00 00 48 83 e7 f81 [ 16.535670] RSP: 0018:c973be08 EFLAGS: 00010212 [ 16.535998] RAX: 1000 RBX: 002fd000 RCX: 7f2374e01000 [ 16.536441] RDX: 0001 RSI: dfff RDI: 7f2374e01008 [ 16.536878] RBP: 1000 R08: R09: c973bc50 [ 16.537315] R10: c973bc48 R11: 829461a8 R12: f000 [ 16.537755] R13: 7f2374e01000 R14: R15: 88807bd421e8 [ 16.538200] FS: 7f2374e12140() GS:88807f00() knlGS: [ 16.538696] CS: 0010 DS: ES: CR0: 80050033 [ 16.539055] CR2: 7f2374e01000 CR3: 7a4aa000 CR4: 00350eb0 [ 16.539510] Call Trace: [ 16.539679] [ 16.539828] read_vmcore+0x236/0x2c0 [ 16.540063] ? enqueue_hrtimer+0x2f/0x80 [ 16.540323] ? inode_security+0x22/0x60 [ 16.540572] proc_reg_read+0x55/0xa0 [ 16.540807] vfs_read+0x95/0x190 [ 16.541022] ksys_read+0x4f/0xc0 [ 16.541238] do_syscall_64+0x3b/0x90 [ 16.541475] entry_SYSCALL_64_after_hwframe+0x44/0xae Some x86-64 CPUs have a CPU feature called "Supervisor Mode Access Prevention (SMAP)", which is used to detect wrong access from the kernel to user buffers like this: SMAP triggers a permissions violation on wrong access. In the x86-64 variant of clear_user(), SMAP is properly handled via clac()+stac(). To fix, properly use clear_user() when we're dealing with a user buffer. Fixes: 997c136f518c ("fs/proc/vmcore.c: add hook to read_from_oldmem() to check for non-ram pages") Acked-by: Baoquan He Cc: Dave Young Cc: Baoquan He Cc: Vivek Goyal Cc: Andrew Morton Cc: Philipp Rudo Cc: kexec@lists.infradead.org Cc: linux...@kvack.org Cc: linux-fsde...@vger.kernel.org Signed-off-by: David Hildenbrand --- v1 -> v2: - Extend patch description --- fs/proc/vmcore.c | 20 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c index 30a3b66f475a..509f85148fee 100644 --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -154,9 +154,13 @@ ssize_t read_from_oldmem(char *buf, size_t count, nr_bytes = count; /* If pfn is not ram, return zeros for sparse dump files */ - if (!pfn_is_ram(pfn)) - memset(buf, 0, nr_bytes); - else { + if (!pfn_is_ram(pfn)) { + tmp = 0; + if (!userbuf) + memset(buf, 0, nr_bytes); + else if (clear_user(buf, nr_bytes)) + tmp = -EFAULT; + } else { if (encrypted) tmp = copy_oldmem_page_encrypted(pfn, buf, nr_bytes, @@ -165,12 +169,12 @@ ssize_t read_from_oldmem(char *buf, size_t count, else tmp = copy_oldmem_page(pfn, buf, nr_bytes, offset, userbuf); - - if (tmp < 0) { - up_read(_cb_rwsem); - return tmp; - } } + if (tmp < 0) { + up_read(_cb_rwsem); + return tmp; + } + *ppos +
Re: [PATCH v1] proc/vmcore: fix clearing user buffer by properly using clear_user()
> > "that allows supervisor mode programs to optionally set user-space > > memory mappings so that access to those mappings from supervisor mode > > will cause a trap. This makes it harder for malicious programs to > > "trick" the kernel into using instructions or data from a user-space > > program" > > OK, probably. I thought it's triggered in access_ok(), and tried to > figure out why. But seems we should do something to check this in > access_ok(), otherwise the logic of clear_user/_clear_user is not so > reasonable. Anyway, I have learned it, thanks a lot for digging it out. > > By the way, I can't open above wiki article, found below commit from > hpa. Maybe we can add some into log to tell this, not strong opinin, > leave it to you. Yes, now that we know the root cause I'll add some more details to the patch description and resend -- thanks Baoquan! ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v1] proc/vmcore: don't fake reading zeroes on surprise vmcore_cb unregistration
On 12.11.21 04:30, Baoquan He wrote: > On 11/11/21 at 08:22pm, David Hildenbrand wrote: >> In commit cc5f2704c934 ("proc/vmcore: convert oldmem_pfn_is_ram callback >> to more generic vmcore callbacks"), we added detection of surprise >> vmcore_cb unregistration after the vmcore was already opened. Once >> detected, we warn the user and simulate reading zeroes from that point on >> when accessing the vmcore. >> >> The basic reason was that unexpected unregistration, for example, by >> manually unbinding a driver from a device after opening the >> vmcore, is not supported and could result in reading oldmem the >> vmcore_cb would have actually prohibited while registered. However, >> something like that can similarly be trigger by a user that's really >> looking for trouble simply by unbinding the relevant driver before opening >> the vmcore -- or by disallowing loading the driver in the first place. >> So it's actually of limited help. > > Yes, this is the change what I would like to see in the original patch > "proc/vmcore: convert oldmem_pfn_is_ram callback to more generic vmcore > callbacks". > I am happy with this patch appended to commit cc5f2704c934. Good, thanks! > >> >> Currently, unregistration can only be triggered via virtio-mem when >> manually unbinding the driver from the device inside the VM; there is no >> way to trigger it from the hypervisor, as hypervisors don't allow for >> unplugging virtio-mem devices -- ripping out system RAM from a VM without >> coordination with the guest is usually not a good idea. >> >> The important part is that unbinding the driver and unregistering the >> vmcore_cb while concurrently reading the vmcore won't crash the system, >> and that is handled by the rwsem. >> >> To make the mechanism more future proof, let's remove the "read zero" >> part, but leave the warning in place. For example, we could have a future >> driver (like virtio-balloon) that will contact the hypervisor to figure out >> if we already populated a page for a given PFN. Hotunplugging such a device >> and consequently unregistering the vmcore_cb could be triggered from the >> hypervisor without harming the system even while kdump is running. In that > > I am a little confused, could you tell more about "contact the hypervisor to > figure out if we already populated a page for a given PFN."? I think > vmcore dumping relies on the eflcorehdr which is created beforehand, and > relies on vmcore_cb registered in advance too, virtio-balloon could take > another way to interact with kdump to make sure the dumpable ram? This is essentially what the XEN callback does: check if a PFN is actually populated in the hypervisor; if not, avoid reading it so we won't be faulting+populating a fresh/zero page in the hypervisor just to be able to dump it in the guest. But in the XEN world we usually simply rely on straight hypercalls, not glued to actual devices that can get hot(un)plugged. Once you have some device that performs such checks instead that could get hotunplugged and unregister the vmcore_cb (and virtio-balloon is just one example), you would be able to trigger this. As we're dealing with a moving target (hypervisor will populate pages as necessary once the old kernel accesses them), there isn't really a way to adjust this in the old kernel -- where we build the eflcorehdr. We could try to adjust the elfcorehdr in the new kernel, but that certainly opens up another can of worms. But again, this is just an example to back the "future proof" claim because Dave was explicitly concerned about this situation. -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v1] proc/vmcore: fix clearing user buffer by properly using clear_user()
On 12.11.21 08:01, Baoquan He wrote: > On 11/11/21 at 08:18pm, David Hildenbrand wrote: >> To clear a user buffer we cannot simply use memset, we have to use >> clear_user(). Using a kernel config based on rawhide Fedora and a >> virtio-mem device that registers a vmcore_cb, I can easily trigger: >> >> [ 11.327580] systemd[1]: Starting Kdump Vmcore Save Service... >> [ 11.339697] kdump[420]: Kdump is using the default log level(3). >> [ 11.370964] kdump[453]: saving to >> /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/ >> [ 11.373997] kdump[458]: saving vmcore-dmesg.txt to >> /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/ >> [ 11.385357] kdump[465]: saving vmcore-dmesg.txt complete >> [ 11.386722] kdump[467]: saving vmcore >> [ 16.531275] BUG: unable to handle page fault for address: 7f2374e01000 >> [ 16.531705] #PF: supervisor write access in kernel mode >> [ 16.532037] #PF: error_code(0x0003) - permissions violation >> [ 16.532396] PGD 7a523067 P4D 7a523067 PUD 7a528067 PMD 7a525067 PTE >> 80007048f867 >> [ 16.532872] Oops: 0003 [#1] PREEMPT SMP NOPTI >> [ 16.533154] CPU: 0 PID: 468 Comm: cp Not tainted 5.15.0+ #6 >> [ 16.533513] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS >> rel-1.14.0-27-g64f37cc530f1-prebuilt.qemu.org 04/01/2014 >> [ 16.534198] RIP: 0010:read_from_oldmem.part.0.cold+0x1d/0x86 >> [ 16.534552] Code: ff ff ff e8 05 ff fe ff e9 b9 e9 7f ff 48 89 de 48 c7 >> c7 38 3b 60 82 e8 f1 fe fe ff 83 fd 08 72 3c 49 8d 7d 08 4c 89 e9 89 e8 <49> >> c7 45 00 00 00 00 00 49 c7 44 05 f8 00 00 00 00 48 83 e7 f81 >> [ 16.535670] RSP: 0018:c973be08 EFLAGS: 00010212 >> [ 16.535998] RAX: 1000 RBX: 002fd000 RCX: >> 7f2374e01000 >> [ 16.536441] RDX: 0001 RSI: dfff RDI: >> 7f2374e01008 >> [ 16.536878] RBP: 1000 R08: R09: >> c973bc50 >> [ 16.537315] R10: c973bc48 R11: 829461a8 R12: >> f000 >> [ 16.537755] R13: 7f2374e01000 R14: R15: >> 88807bd421e8 >> [ 16.538200] FS: 7f2374e12140() GS:88807f00() >> knlGS: >> [ 16.538696] CS: 0010 DS: ES: CR0: 80050033 >> [ 16.539055] CR2: 7f2374e01000 CR3: 7a4aa000 CR4: >> 00350eb0 >> [ 16.539510] Call Trace: >> [ 16.539679] >> [ 16.539828] read_vmcore+0x236/0x2c0 >> [ 16.540063] ? enqueue_hrtimer+0x2f/0x80 >> [ 16.540323] ? inode_security+0x22/0x60 >> [ 16.540572] proc_reg_read+0x55/0xa0 >> [ 16.540807] vfs_read+0x95/0x190 >> [ 16.541022] ksys_read+0x4f/0xc0 >> [ 16.541238] do_syscall_64+0x3b/0x90 >> [ 16.541475] entry_SYSCALL_64_after_hwframe+0x44/0xae >> >> To fix, properly use clear_user() when required. > > Looks a great fix to me, thanks for fixing this. > > Check the code, clear_user invokes access_ok to do check, then call > memset(). It's unclear to me how the bug is triggered, could you > please tell more so that I can learn? > TBH, I was testing virtio-mem+vmcore before without running into this issue, but after I retested with upstream in a different setup (different kernel config but eventually also different CPU features), I ran into this. Note that you were looking at the generic __clear_user() implementation, the x86-64 variant is different, see arch/x86/lib/usercopy_64.c I can spot that it triggers stac()/clac() (X86_SMAP): https://en.wikipedia.org/wiki/Supervisor_Mode_Access_Prevention "that allows supervisor mode programs to optionally set user-space memory mappings so that access to those mappings from supervisor mode will cause a trap. This makes it harder for malicious programs to "trick" the kernel into using instructions or data from a user-space program" Yes, that's most probably it :) -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v1] proc/vmcore: don't fake reading zeroes on surprise vmcore_cb unregistration
In commit cc5f2704c934 ("proc/vmcore: convert oldmem_pfn_is_ram callback to more generic vmcore callbacks"), we added detection of surprise vmcore_cb unregistration after the vmcore was already opened. Once detected, we warn the user and simulate reading zeroes from that point on when accessing the vmcore. The basic reason was that unexpected unregistration, for example, by manually unbinding a driver from a device after opening the vmcore, is not supported and could result in reading oldmem the vmcore_cb would have actually prohibited while registered. However, something like that can similarly be trigger by a user that's really looking for trouble simply by unbinding the relevant driver before opening the vmcore -- or by disallowing loading the driver in the first place. So it's actually of limited help. Currently, unregistration can only be triggered via virtio-mem when manually unbinding the driver from the device inside the VM; there is no way to trigger it from the hypervisor, as hypervisors don't allow for unplugging virtio-mem devices -- ripping out system RAM from a VM without coordination with the guest is usually not a good idea. The important part is that unbinding the driver and unregistering the vmcore_cb while concurrently reading the vmcore won't crash the system, and that is handled by the rwsem. To make the mechanism more future proof, let's remove the "read zero" part, but leave the warning in place. For example, we could have a future driver (like virtio-balloon) that will contact the hypervisor to figure out if we already populated a page for a given PFN. Hotunplugging such a device and consequently unregistering the vmcore_cb could be triggered from the hypervisor without harming the system even while kdump is running. In that case, we don't want to silently end up with a vmcore that contains wrong data, because the user inside the VM might be unaware of the hypervisor action and might easily miss the warning in the log. Cc: Dave Young Cc: Baoquan He Cc: Vivek Goyal Cc: Andrew Morton Cc: Philipp Rudo Cc: kexec@lists.infradead.org Cc: linux...@kvack.org Cc: linux-fsde...@vger.kernel.org Signed-off-by: David Hildenbrand --- fs/proc/vmcore.c | 10 ++ 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c index 30a3b66f475a..948691cf4a1a 100644 --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -65,8 +65,6 @@ static size_t vmcoredd_orig_sz; static DECLARE_RWSEM(vmcore_cb_rwsem); /* List of registered vmcore callbacks. */ static LIST_HEAD(vmcore_cb_list); -/* Whether we had a surprise unregistration of a callback. */ -static bool vmcore_cb_unstable; /* Whether the vmcore has been opened once. */ static bool vmcore_opened; @@ -94,10 +92,8 @@ void unregister_vmcore_cb(struct vmcore_cb *cb) * very unusual (e.g., forced driver removal), but we cannot stop * unregistering. */ - if (vmcore_opened) { + if (vmcore_opened) pr_warn_once("Unexpected vmcore callback unregistration\n"); - vmcore_cb_unstable = true; - } up_write(_cb_rwsem); } EXPORT_SYMBOL_GPL(unregister_vmcore_cb); @@ -108,8 +104,6 @@ static bool pfn_is_ram(unsigned long pfn) bool ret = true; lockdep_assert_held_read(_cb_rwsem); - if (unlikely(vmcore_cb_unstable)) - return false; list_for_each_entry(cb, _cb_list, next) { if (unlikely(!cb->pfn_is_ram)) @@ -577,7 +571,7 @@ static int vmcore_remap_oldmem_pfn(struct vm_area_struct *vma, * looping over all pages without a reason. */ down_read(_cb_rwsem); - if (!list_empty(_cb_list) || vmcore_cb_unstable) + if (!list_empty(_cb_list)) ret = remap_oldmem_pfn_checked(vma, from, pfn, size, prot); else ret = remap_oldmem_pfn_range(vma, from, pfn, size, prot); base-commit: debe436e77c72fcee804fb867f275e6d31aa999c -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v1] proc/vmcore: fix clearing user buffer by properly using clear_user()
To clear a user buffer we cannot simply use memset, we have to use clear_user(). Using a kernel config based on rawhide Fedora and a virtio-mem device that registers a vmcore_cb, I can easily trigger: [ 11.327580] systemd[1]: Starting Kdump Vmcore Save Service... [ 11.339697] kdump[420]: Kdump is using the default log level(3). [ 11.370964] kdump[453]: saving to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/ [ 11.373997] kdump[458]: saving vmcore-dmesg.txt to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/ [ 11.385357] kdump[465]: saving vmcore-dmesg.txt complete [ 11.386722] kdump[467]: saving vmcore [ 16.531275] BUG: unable to handle page fault for address: 7f2374e01000 [ 16.531705] #PF: supervisor write access in kernel mode [ 16.532037] #PF: error_code(0x0003) - permissions violation [ 16.532396] PGD 7a523067 P4D 7a523067 PUD 7a528067 PMD 7a525067 PTE 80007048f867 [ 16.532872] Oops: 0003 [#1] PREEMPT SMP NOPTI [ 16.533154] CPU: 0 PID: 468 Comm: cp Not tainted 5.15.0+ #6 [ 16.533513] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-27-g64f37cc530f1-prebuilt.qemu.org 04/01/2014 [ 16.534198] RIP: 0010:read_from_oldmem.part.0.cold+0x1d/0x86 [ 16.534552] Code: ff ff ff e8 05 ff fe ff e9 b9 e9 7f ff 48 89 de 48 c7 c7 38 3b 60 82 e8 f1 fe fe ff 83 fd 08 72 3c 49 8d 7d 08 4c 89 e9 89 e8 <49> c7 45 00 00 00 00 00 49 c7 44 05 f8 00 00 00 00 48 83 e7 f81 [ 16.535670] RSP: 0018:c973be08 EFLAGS: 00010212 [ 16.535998] RAX: 1000 RBX: 002fd000 RCX: 7f2374e01000 [ 16.536441] RDX: 0001 RSI: dfff RDI: 7f2374e01008 [ 16.536878] RBP: 1000 R08: R09: c973bc50 [ 16.537315] R10: c973bc48 R11: 829461a8 R12: f000 [ 16.537755] R13: 7f2374e01000 R14: R15: 88807bd421e8 [ 16.538200] FS: 7f2374e12140() GS:88807f00() knlGS: [ 16.538696] CS: 0010 DS: ES: CR0: 80050033 [ 16.539055] CR2: 7f2374e01000 CR3: 7a4aa000 CR4: 00350eb0 [ 16.539510] Call Trace: [ 16.539679] [ 16.539828] read_vmcore+0x236/0x2c0 [ 16.540063] ? enqueue_hrtimer+0x2f/0x80 [ 16.540323] ? inode_security+0x22/0x60 [ 16.540572] proc_reg_read+0x55/0xa0 [ 16.540807] vfs_read+0x95/0x190 [ 16.541022] ksys_read+0x4f/0xc0 [ 16.541238] do_syscall_64+0x3b/0x90 [ 16.541475] entry_SYSCALL_64_after_hwframe+0x44/0xae To fix, properly use clear_user() when required. Fixes: 997c136f518c ("fs/proc/vmcore.c: add hook to read_from_oldmem() to check for non-ram pages") Cc: Dave Young Cc: Baoquan He Cc: Vivek Goyal Cc: Andrew Morton Cc: Philipp Rudo Cc: kexec@lists.infradead.org Cc: linux...@kvack.org Cc: linux-fsde...@vger.kernel.org Signed-off-by: David Hildenbrand --- fs/proc/vmcore.c | 20 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c index 30a3b66f475a..509f85148fee 100644 --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -154,9 +154,13 @@ ssize_t read_from_oldmem(char *buf, size_t count, nr_bytes = count; /* If pfn is not ram, return zeros for sparse dump files */ - if (!pfn_is_ram(pfn)) - memset(buf, 0, nr_bytes); - else { + if (!pfn_is_ram(pfn)) { + tmp = 0; + if (!userbuf) + memset(buf, 0, nr_bytes); + else if (clear_user(buf, nr_bytes)) + tmp = -EFAULT; + } else { if (encrypted) tmp = copy_oldmem_page_encrypted(pfn, buf, nr_bytes, @@ -165,12 +169,12 @@ ssize_t read_from_oldmem(char *buf, size_t count, else tmp = copy_oldmem_page(pfn, buf, nr_bytes, offset, userbuf); - - if (tmp < 0) { - up_read(_cb_rwsem); - return tmp; - } } + if (tmp < 0) { + up_read(_cb_rwsem); + return tmp; + } + *ppos += nr_bytes; count -= nr_bytes; buf += nr_bytes; base-commit: debe436e77c72fcee804fb867f275e6d31aa999c -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v2 3/9] x86/xen: print a warning when HVMOP_get_mem_type fails
HVMOP_get_mem_type is not expected to fail, "This call failing is indication of something going quite wrong and it would be good to know about this." [1] Let's add a pr_warn_once(). [1] https://lkml.kernel.org/r/3b935aa0-6d85-0bcd-100e-15098add3...@oracle.com Suggested-by: Boris Ostrovsky Signed-off-by: David Hildenbrand --- arch/x86/xen/mmu_hvm.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/x86/xen/mmu_hvm.c b/arch/x86/xen/mmu_hvm.c index d1b38c77352b..6ba8826dcdcc 100644 --- a/arch/x86/xen/mmu_hvm.c +++ b/arch/x86/xen/mmu_hvm.c @@ -22,8 +22,10 @@ static int xen_oldmem_pfn_is_ram(unsigned long pfn) .pfn = pfn, }; - if (HYPERVISOR_hvm_op(HVMOP_get_mem_type, )) + if (HYPERVISOR_hvm_op(HVMOP_get_mem_type, )) { + pr_warn_once("Unexpected HVMOP_get_mem_type failure\n"); return -ENXIO; + } return a.mem_type != HVMMEM_mmio_dm; } #endif -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v2 2/9] x86/xen: simplify xen_oldmem_pfn_is_ram()
Let's simplify return handling. Signed-off-by: David Hildenbrand --- arch/x86/xen/mmu_hvm.c | 15 +-- 1 file changed, 1 insertion(+), 14 deletions(-) diff --git a/arch/x86/xen/mmu_hvm.c b/arch/x86/xen/mmu_hvm.c index b242d1f4b426..d1b38c77352b 100644 --- a/arch/x86/xen/mmu_hvm.c +++ b/arch/x86/xen/mmu_hvm.c @@ -21,23 +21,10 @@ static int xen_oldmem_pfn_is_ram(unsigned long pfn) .domid = DOMID_SELF, .pfn = pfn, }; - int ram; if (HYPERVISOR_hvm_op(HVMOP_get_mem_type, )) return -ENXIO; - - switch (a.mem_type) { - case HVMMEM_mmio_dm: - ram = 0; - break; - case HVMMEM_ram_rw: - case HVMMEM_ram_ro: - default: - ram = 1; - break; - } - - return ram; + return a.mem_type != HVMMEM_mmio_dm; } #endif -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v2 1/9] x86/xen: update xen_oldmem_pfn_is_ram() documentation
The callback is only used for the vmcore nowadays. Reviewed-by: Boris Ostrovsky Signed-off-by: David Hildenbrand --- arch/x86/xen/mmu_hvm.c | 9 +++-- 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/arch/x86/xen/mmu_hvm.c b/arch/x86/xen/mmu_hvm.c index 57409373750f..b242d1f4b426 100644 --- a/arch/x86/xen/mmu_hvm.c +++ b/arch/x86/xen/mmu_hvm.c @@ -9,12 +9,9 @@ #ifdef CONFIG_PROC_VMCORE /* - * This function is used in two contexts: - * - the kdump kernel has to check whether a pfn of the crashed kernel - * was a ballooned page. vmcore is using this function to decide - * whether to access a pfn of the crashed kernel. - * - the kexec kernel has to check whether a pfn was ballooned by the - * previous kernel. If the pfn is ballooned, handle it properly. + * The kdump kernel has to check whether a pfn of the crashed kernel + * was a ballooned page. vmcore is using this function to decide + * whether to access a pfn of the crashed kernel. * Returns 0 if the pfn is not backed by a RAM page, the caller may * handle the pfn special in this case. */ -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v2 0/9] proc/vmcore: sanitize access to virtio-mem memory
As so often with virtio-mem changes that mess with common MM infrastructure, this might be a good candiate to go via Andrew's tree. -- After removing /dev/kmem, sanitizing /proc/kcore and handling /dev/mem, this series tackles the last sane way how a VM could accidentially access logically unplugged memory managed by a virtio-mem device: /proc/vmcore When dumping memory via "makedumpfile", PG_offline pages, used by virtio-mem to flag logically unplugged memory, are already properly excluded; however, especially when accessing/copying /proc/vmcore "the usual way", we can still end up reading logically unplugged memory part of a virtio-mem device. Patch #1-#3 are cleanups. Patch #4 extends the existing oldmem_pfn_is_ram mechanism. Patch #5-#7 are virtio-mem refactorings for patch #8, which implements the virtio-mem logic to query the state of device blocks. Patch #8: " Although virtio-mem currently supports reading unplugged memory in the hypervisor, this will change in the future, indicated to the device via a new feature flag. We similarly sanitized /proc/kcore access recently. [...] Distributions that support virtio-mem+kdump have to make sure that the virtio_mem module will be part of the kdump kernel or the kdump initrd; dracut was recently [2] extended to include virtio-mem in the generated initrd. As long as no special kdump kernels are used, this will automatically make sure that virtio-mem will be around in the kdump initrd and sanitize /proc/vmcore access -- with dracut. " This is the last remaining bit to support VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE [3] in the Linux implementation of virtio-mem. Note: this is best-effort. We'll never be able to control what runs inside the second kernel, really, but we also don't have to care: we only care about sane setups where we don't want our VM getting zapped once we touch the wrong memory location while dumping. While we usually expect sane setups to use "makedumfile", nothing really speaks against just copying /proc/vmcore, especially in environments where HWpoisioning isn't typically expected. Also, we really don't want to put all our trust completely on the memmap, so sanitizing also makes sense when just using "makedumpfile". [1] https://lkml.kernel.org/r/20210526093041.8800-1-da...@redhat.com [2] https://github.com/dracutdevs/dracut/pull/1157 [3] https://lists.oasis-open.org/archives/virtio-comment/202109/msg00021.html v1 -> v2: - "x86/xen: simplify xen_oldmem_pfn_is_ram()" -- Simplify even more - "x86/xen: print a warning when HVMOP_get_mem_type fails" -- Added - "virtio-mem: kdump mode to sanitize /proc/vmcore access" -- Fix wrong range check Cc: Andrew Morton Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Boris Ostrovsky Cc: Juergen Gross Cc: Stefano Stabellini Cc: "Michael S. Tsirkin" Cc: Jason Wang Cc: Dave Young Cc: Baoquan He Cc: Vivek Goyal Cc: Michal Hocko Cc: Oscar Salvador Cc: Mike Rapoport Cc: "Rafael J. Wysocki" Cc: x...@kernel.org Cc: xen-de...@lists.xenproject.org Cc: virtualizat...@lists.linux-foundation.org Cc: kexec@lists.infradead.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org David Hildenbrand (9): x86/xen: update xen_oldmem_pfn_is_ram() documentation x86/xen: simplify xen_oldmem_pfn_is_ram() x86/xen: print a warning when HVMOP_get_mem_type fails proc/vmcore: let pfn_is_ram() return a bool proc/vmcore: convert oldmem_pfn_is_ram callback to more generic vmcore callbacks virtio-mem: factor out hotplug specifics from virtio_mem_init() into virtio_mem_init_hotplug() virtio-mem: factor out hotplug specifics from virtio_mem_probe() into virtio_mem_init_hotplug() virtio-mem: factor out hotplug specifics from virtio_mem_remove() into virtio_mem_deinit_hotplug() virtio-mem: kdump mode to sanitize /proc/vmcore access arch/x86/kernel/aperture_64.c | 13 +- arch/x86/xen/mmu_hvm.c| 37 ++--- drivers/virtio/virtio_mem.c | 297 -- fs/proc/vmcore.c | 105 include/linux/crash_dump.h| 26 ++- 5 files changed, 333 insertions(+), 145 deletions(-) base-commit: 9e1ff307c779ce1f0f810c7ecce3d95bbae40896 -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v2 9/9] virtio-mem: kdump mode to sanitize /proc/vmcore access
Although virtio-mem currently supports reading unplugged memory in the hypervisor, this will change in the future, indicated to the device via a new feature flag. We similarly sanitized /proc/kcore access recently. [1] Let's register a vmcore callback, to allow vmcore code to check if a PFN belonging to a virtio-mem device is either currently plugged and should be dumped or is currently unplugged and should not be accessed, instead mapping the shared zeropage or returning zeroes when reading. This is important when not capturing /proc/vmcore via tools like "makedumpfile" that can identify logically unplugged virtio-mem memory via PG_offline in the memmap, but simply by e.g., copying the file. Distributions that support virtio-mem+kdump have to make sure that the virtio_mem module will be part of the kdump kernel or the kdump initrd; dracut was recently [2] extended to include virtio-mem in the generated initrd. As long as no special kdump kernels are used, this will automatically make sure that virtio-mem will be around in the kdump initrd and sanitize /proc/vmcore access -- with dracut. With this series, we'll send one virtio-mem state request for every ~2 MiB chunk of virtio-mem memory indicated in the vmcore that we intend to read/map. In the future, we might want to allow building virtio-mem for kdump mode only, even without CONFIG_MEMORY_HOTPLUG and friends: this way, we could support special stripped-down kdump kernels that have many other config options disabled; we'll tackle that once required. Further, we might want to try sensing bigger blocks (e.g., memory sections) first before falling back to device blocks on demand. Tested with Fedora rawhide, which contains a recent kexec-tools version (considering "System RAM (virtio_mem)" when creating the vmcore header) and a recent dracut version (including the virtio_mem module in the kdump initrd). [1] https://lkml.kernel.org/r/20210526093041.8800-1-da...@redhat.com [2] https://github.com/dracutdevs/dracut/pull/1157 Signed-off-by: David Hildenbrand --- drivers/virtio/virtio_mem.c | 136 1 file changed, 124 insertions(+), 12 deletions(-) diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c index 76d8aef3cfd2..3573b78f6b05 100644 --- a/drivers/virtio/virtio_mem.c +++ b/drivers/virtio/virtio_mem.c @@ -223,6 +223,9 @@ struct virtio_mem { * When this lock is held the pointers can't change, ONLINE and * OFFLINE blocks can't change the state and no subblocks will get * plugged/unplugged. +* +* In kdump mode, used to serialize requests, last_block_addr and +* last_block_plugged. */ struct mutex hotplug_mutex; bool hotplug_active; @@ -230,6 +233,9 @@ struct virtio_mem { /* An error occurred we cannot handle - stop processing requests. */ bool broken; + /* Cached valued of is_kdump_kernel() when the device was probed. */ + bool in_kdump; + /* The driver is being removed. */ spinlock_t removal_lock; bool removing; @@ -243,6 +249,13 @@ struct virtio_mem { /* Memory notifier (online/offline events). */ struct notifier_block memory_notifier; +#ifdef CONFIG_PROC_VMCORE + /* vmcore callback for /proc/vmcore handling in kdump mode */ + struct vmcore_cb vmcore_cb; + uint64_t last_block_addr; + bool last_block_plugged; +#endif /* CONFIG_PROC_VMCORE */ + /* Next device in the list of virtio-mem devices. */ struct list_head next; }; @@ -2293,6 +2306,12 @@ static void virtio_mem_run_wq(struct work_struct *work) uint64_t diff; int rc; + if (unlikely(vm->in_kdump)) { + dev_warn_once(>vdev->dev, +"unexpected workqueue run in kdump kernel\n"); + return; + } + hrtimer_cancel(>retry_timer); if (vm->broken) @@ -2521,6 +2540,86 @@ static int virtio_mem_init_hotplug(struct virtio_mem *vm) return rc; } +#ifdef CONFIG_PROC_VMCORE +static int virtio_mem_send_state_request(struct virtio_mem *vm, uint64_t addr, +uint64_t size) +{ + const uint64_t nb_vm_blocks = size / vm->device_block_size; + const struct virtio_mem_req req = { + .type = cpu_to_virtio16(vm->vdev, VIRTIO_MEM_REQ_STATE), + .u.state.addr = cpu_to_virtio64(vm->vdev, addr), + .u.state.nb_blocks = cpu_to_virtio16(vm->vdev, nb_vm_blocks), + }; + int rc = -ENOMEM; + + dev_dbg(>vdev->dev, "requesting state: 0x%llx - 0x%llx\n", addr, + addr + size - 1); + + switch (virtio_mem_send_request(vm, )) { + case VIRTIO_MEM_RESP_ACK: + return virtio16_to_cpu(vm->vdev, vm->resp.u.state.state); + case VIRTIO_MEM_RESP_ERROR: +
[PATCH v2 7/9] virtio-mem: factor out hotplug specifics from virtio_mem_probe() into virtio_mem_init_hotplug()
Let's prepare for a new virtio-mem kdump mode in which we don't actually hot(un)plug any memory but only observe the state of device blocks. Signed-off-by: David Hildenbrand --- drivers/virtio/virtio_mem.c | 87 +++-- 1 file changed, 45 insertions(+), 42 deletions(-) diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c index 2ba7e8d6ba8d..1be3ee7f684d 100644 --- a/drivers/virtio/virtio_mem.c +++ b/drivers/virtio/virtio_mem.c @@ -260,6 +260,8 @@ static void virtio_mem_fake_offline_going_offline(unsigned long pfn, static void virtio_mem_fake_offline_cancel_offline(unsigned long pfn, unsigned long nr_pages); static void virtio_mem_retry(struct virtio_mem *vm); +static int virtio_mem_create_resource(struct virtio_mem *vm); +static void virtio_mem_delete_resource(struct virtio_mem *vm); /* * Register a virtio-mem device so it will be considered for the online_page @@ -2395,7 +2397,8 @@ static int virtio_mem_init_vq(struct virtio_mem *vm) static int virtio_mem_init_hotplug(struct virtio_mem *vm) { const struct range pluggable_range = mhp_get_pluggable_range(true); - uint64_t sb_size, addr; + uint64_t unit_pages, sb_size, addr; + int rc; /* bad device setup - warn only */ if (!IS_ALIGNED(vm->addr, memory_block_size_bytes())) @@ -2474,7 +2477,48 @@ static int virtio_mem_init_hotplug(struct virtio_mem *vm) dev_info(>vdev->dev, "big block size: 0x%llx", (unsigned long long)vm->bbm.bb_size); + /* create the parent resource for all memory */ + rc = virtio_mem_create_resource(vm); + if (rc) + return rc; + + /* use a single dynamic memory group to cover the whole memory device */ + if (vm->in_sbm) + unit_pages = PHYS_PFN(memory_block_size_bytes()); + else + unit_pages = PHYS_PFN(vm->bbm.bb_size); + rc = memory_group_register_dynamic(vm->nid, unit_pages); + if (rc < 0) + goto out_del_resource; + vm->mgid = rc; + + /* +* If we still have memory plugged, we have to unplug all memory first. +* Registering our parent resource makes sure that this memory isn't +* actually in use (e.g., trying to reload the driver). +*/ + if (vm->plugged_size) { + vm->unplug_all_required = true; + dev_info(>vdev->dev, "unplugging all memory is required\n"); + } + + /* register callbacks */ + vm->memory_notifier.notifier_call = virtio_mem_memory_notifier_cb; + rc = register_memory_notifier(>memory_notifier); + if (rc) + goto out_unreg_group; + rc = register_virtio_mem_device(vm); + if (rc) + goto out_unreg_mem; + return 0; +out_unreg_mem: + unregister_memory_notifier(>memory_notifier); +out_unreg_group: + memory_group_unregister(vm->mgid); +out_del_resource: + virtio_mem_delete_resource(vm); + return rc; } static int virtio_mem_init(struct virtio_mem *vm) @@ -2578,7 +2622,6 @@ static bool virtio_mem_has_memory_added(struct virtio_mem *vm) static int virtio_mem_probe(struct virtio_device *vdev) { struct virtio_mem *vm; - uint64_t unit_pages; int rc; BUILD_BUG_ON(sizeof(struct virtio_mem_req) != 24); @@ -2608,40 +2651,6 @@ static int virtio_mem_probe(struct virtio_device *vdev) if (rc) goto out_del_vq; - /* create the parent resource for all memory */ - rc = virtio_mem_create_resource(vm); - if (rc) - goto out_del_vq; - - /* use a single dynamic memory group to cover the whole memory device */ - if (vm->in_sbm) - unit_pages = PHYS_PFN(memory_block_size_bytes()); - else - unit_pages = PHYS_PFN(vm->bbm.bb_size); - rc = memory_group_register_dynamic(vm->nid, unit_pages); - if (rc < 0) - goto out_del_resource; - vm->mgid = rc; - - /* -* If we still have memory plugged, we have to unplug all memory first. -* Registering our parent resource makes sure that this memory isn't -* actually in use (e.g., trying to reload the driver). -*/ - if (vm->plugged_size) { - vm->unplug_all_required = true; - dev_info(>vdev->dev, "unplugging all memory is required\n"); - } - - /* register callbacks */ - vm->memory_notifier.notifier_call = virtio_mem_memory_notifier_cb; - rc = register_memory_notifier(>memory_notifier); - if (rc) - goto out_unreg_group; - rc = register_virtio_mem_device(vm); - if (rc) - goto out_unreg_mem; - virtio_device_ready(vdev)
[PATCH v2 8/9] virtio-mem: factor out hotplug specifics from virtio_mem_remove() into virtio_mem_deinit_hotplug()
Let's prepare for a new virtio-mem kdump mode in which we don't actually hot(un)plug any memory but only observe the state of device blocks. Signed-off-by: David Hildenbrand --- drivers/virtio/virtio_mem.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c index 1be3ee7f684d..76d8aef3cfd2 100644 --- a/drivers/virtio/virtio_mem.c +++ b/drivers/virtio/virtio_mem.c @@ -2667,9 +2667,8 @@ static int virtio_mem_probe(struct virtio_device *vdev) return rc; } -static void virtio_mem_remove(struct virtio_device *vdev) +static void virtio_mem_deinit_hotplug(struct virtio_mem *vm) { - struct virtio_mem *vm = vdev->priv; unsigned long mb_id; int rc; @@ -2716,7 +2715,8 @@ static void virtio_mem_remove(struct virtio_device *vdev) * away. Warn at least. */ if (virtio_mem_has_memory_added(vm)) { - dev_warn(>dev, "device still has system memory added\n"); + dev_warn(>vdev->dev, +"device still has system memory added\n"); } else { virtio_mem_delete_resource(vm); kfree_const(vm->resource_name); @@ -2730,6 +2730,13 @@ static void virtio_mem_remove(struct virtio_device *vdev) } else { vfree(vm->bbm.bb_states); } +} + +static void virtio_mem_remove(struct virtio_device *vdev) +{ + struct virtio_mem *vm = vdev->priv; + + virtio_mem_deinit_hotplug(vm); /* reset the device and cleanup the queues */ vdev->config->reset(vdev); -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v2 6/9] virtio-mem: factor out hotplug specifics from virtio_mem_init() into virtio_mem_init_hotplug()
Let's prepare for a new virtio-mem kdump mode in which we don't actually hot(un)plug any memory but only observe the state of device blocks. Signed-off-by: David Hildenbrand --- drivers/virtio/virtio_mem.c | 81 - 1 file changed, 44 insertions(+), 37 deletions(-) diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c index bef8ad6bf466..2ba7e8d6ba8d 100644 --- a/drivers/virtio/virtio_mem.c +++ b/drivers/virtio/virtio_mem.c @@ -2392,41 +2392,10 @@ static int virtio_mem_init_vq(struct virtio_mem *vm) return 0; } -static int virtio_mem_init(struct virtio_mem *vm) +static int virtio_mem_init_hotplug(struct virtio_mem *vm) { const struct range pluggable_range = mhp_get_pluggable_range(true); uint64_t sb_size, addr; - uint16_t node_id; - - if (!vm->vdev->config->get) { - dev_err(>vdev->dev, "config access disabled\n"); - return -EINVAL; - } - - /* -* We don't want to (un)plug or reuse any memory when in kdump. The -* memory is still accessible (but not mapped). -*/ - if (is_kdump_kernel()) { - dev_warn(>vdev->dev, "disabled in kdump kernel\n"); - return -EBUSY; - } - - /* Fetch all properties that can't change. */ - virtio_cread_le(vm->vdev, struct virtio_mem_config, plugged_size, - >plugged_size); - virtio_cread_le(vm->vdev, struct virtio_mem_config, block_size, - >device_block_size); - virtio_cread_le(vm->vdev, struct virtio_mem_config, node_id, - _id); - vm->nid = virtio_mem_translate_node_id(vm, node_id); - virtio_cread_le(vm->vdev, struct virtio_mem_config, addr, >addr); - virtio_cread_le(vm->vdev, struct virtio_mem_config, region_size, - >region_size); - - /* Determine the nid for the device based on the lowest address. */ - if (vm->nid == NUMA_NO_NODE) - vm->nid = memory_add_physaddr_to_nid(vm->addr); /* bad device setup - warn only */ if (!IS_ALIGNED(vm->addr, memory_block_size_bytes())) @@ -2496,10 +2465,6 @@ static int virtio_mem_init(struct virtio_mem *vm) vm->offline_threshold); } - dev_info(>vdev->dev, "start address: 0x%llx", vm->addr); - dev_info(>vdev->dev, "region size: 0x%llx", vm->region_size); - dev_info(>vdev->dev, "device block size: 0x%llx", -(unsigned long long)vm->device_block_size); dev_info(>vdev->dev, "memory block size: 0x%lx", memory_block_size_bytes()); if (vm->in_sbm) @@ -2508,10 +2473,52 @@ static int virtio_mem_init(struct virtio_mem *vm) else dev_info(>vdev->dev, "big block size: 0x%llx", (unsigned long long)vm->bbm.bb_size); + + return 0; +} + +static int virtio_mem_init(struct virtio_mem *vm) +{ + uint16_t node_id; + + if (!vm->vdev->config->get) { + dev_err(>vdev->dev, "config access disabled\n"); + return -EINVAL; + } + + /* +* We don't want to (un)plug or reuse any memory when in kdump. The +* memory is still accessible (but not mapped). +*/ + if (is_kdump_kernel()) { + dev_warn(>vdev->dev, "disabled in kdump kernel\n"); + return -EBUSY; + } + + /* Fetch all properties that can't change. */ + virtio_cread_le(vm->vdev, struct virtio_mem_config, plugged_size, + >plugged_size); + virtio_cread_le(vm->vdev, struct virtio_mem_config, block_size, + >device_block_size); + virtio_cread_le(vm->vdev, struct virtio_mem_config, node_id, + _id); + vm->nid = virtio_mem_translate_node_id(vm, node_id); + virtio_cread_le(vm->vdev, struct virtio_mem_config, addr, >addr); + virtio_cread_le(vm->vdev, struct virtio_mem_config, region_size, + >region_size); + + /* Determine the nid for the device based on the lowest address. */ + if (vm->nid == NUMA_NO_NODE) + vm->nid = memory_add_physaddr_to_nid(vm->addr); + + dev_info(>vdev->dev, "start address: 0x%llx", vm->addr); + dev_info(>vdev->dev, "region size: 0x%llx", vm->region_size); + dev_info(>vdev->dev, "device block size: 0x%llx", +(unsigned long long)vm->device_block_size); if (vm->nid != NUMA_NO_NODE && IS_ENABLED(CONFIG_NUMA)) dev_info(>vdev->dev, "nid: %d", vm->nid); - return 0; + return virtio_mem_init_hotplug(vm); } static int virtio_mem_create_resource(struct virtio_mem *vm) -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v2 4/9] proc/vmcore: let pfn_is_ram() return a bool
The callback should deal with errors internally, it doesn't make sense to expose these via pfn_is_ram(). We'll rework the callbacks next. Right now we consider errors as if "it's RAM"; no functional change. Signed-off-by: David Hildenbrand --- fs/proc/vmcore.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c index 9a15334da208..a9bd80ab670e 100644 --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -84,11 +84,11 @@ void unregister_oldmem_pfn_is_ram(void) } EXPORT_SYMBOL_GPL(unregister_oldmem_pfn_is_ram); -static int pfn_is_ram(unsigned long pfn) +static bool pfn_is_ram(unsigned long pfn) { int (*fn)(unsigned long pfn); /* pfn is ram unless fn() checks pagetype */ - int ret = 1; + bool ret = true; /* * Ask hypervisor if the pfn is really ram. @@ -97,7 +97,7 @@ static int pfn_is_ram(unsigned long pfn) */ fn = oldmem_pfn_is_ram; if (fn) - ret = fn(pfn); + ret = !!fn(pfn); return ret; } @@ -124,7 +124,7 @@ ssize_t read_from_oldmem(char *buf, size_t count, nr_bytes = count; /* If pfn is not ram, return zeros for sparse dump files */ - if (pfn_is_ram(pfn) == 0) + if (!pfn_is_ram(pfn)) memset(buf, 0, nr_bytes); else { if (encrypted) -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v2 5/9] proc/vmcore: convert oldmem_pfn_is_ram callback to more generic vmcore callbacks
Let's support multiple registered callbacks, making sure that registering vmcore callbacks cannot fail. Make the callback return a bool instead of an int, handling how to deal with errors internally. Drop unused HAVE_OLDMEM_PFN_IS_RAM. We soon want to make use of this infrastructure from other drivers: virtio-mem, registering one callback for each virtio-mem device, to prevent reading unplugged virtio-mem memory. Handle it via a generic vmcore_cb structure, prepared for future extensions: for example, once we support virtio-mem on s390x where the vmcore is completely constructed in the second kernel, we want to detect and add plugged virtio-mem memory ranges to the vmcore in order for them to get dumped properly. Handle corner cases that are unexpected and shouldn't happen in sane setups: registering a callback after the vmcore has already been opened (warn only) and unregistering a callback after the vmcore has already been opened (warn and essentially read only zeroes from that point on). Signed-off-by: David Hildenbrand --- arch/x86/kernel/aperture_64.c | 13 - arch/x86/xen/mmu_hvm.c| 11 ++-- fs/proc/vmcore.c | 99 --- include/linux/crash_dump.h| 26 +++-- 4 files changed, 111 insertions(+), 38 deletions(-) diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c index 10562885f5fc..af3ba08b684b 100644 --- a/arch/x86/kernel/aperture_64.c +++ b/arch/x86/kernel/aperture_64.c @@ -73,12 +73,23 @@ static int gart_mem_pfn_is_ram(unsigned long pfn) (pfn >= aperture_pfn_start + aperture_page_count)); } +#ifdef CONFIG_PROC_VMCORE +static bool gart_oldmem_pfn_is_ram(struct vmcore_cb *cb, unsigned long pfn) +{ + return !!gart_mem_pfn_is_ram(pfn); +} + +static struct vmcore_cb gart_vmcore_cb = { + .pfn_is_ram = gart_oldmem_pfn_is_ram, +}; +#endif + static void __init exclude_from_core(u64 aper_base, u32 aper_order) { aperture_pfn_start = aper_base >> PAGE_SHIFT; aperture_page_count = (32 * 1024 * 1024) << aper_order >> PAGE_SHIFT; #ifdef CONFIG_PROC_VMCORE - WARN_ON(register_oldmem_pfn_is_ram(_mem_pfn_is_ram)); + register_vmcore_cb(_vmcore_cb); #endif #ifdef CONFIG_PROC_KCORE WARN_ON(register_mem_pfn_is_ram(_mem_pfn_is_ram)); diff --git a/arch/x86/xen/mmu_hvm.c b/arch/x86/xen/mmu_hvm.c index 6ba8826dcdcc..509bdee3ab90 100644 --- a/arch/x86/xen/mmu_hvm.c +++ b/arch/x86/xen/mmu_hvm.c @@ -12,10 +12,10 @@ * The kdump kernel has to check whether a pfn of the crashed kernel * was a ballooned page. vmcore is using this function to decide * whether to access a pfn of the crashed kernel. - * Returns 0 if the pfn is not backed by a RAM page, the caller may + * Returns "false" if the pfn is not backed by a RAM page, the caller may * handle the pfn special in this case. */ -static int xen_oldmem_pfn_is_ram(unsigned long pfn) +static bool xen_vmcore_pfn_is_ram(struct vmcore_cb *cb, unsigned long pfn) { struct xen_hvm_get_mem_type a = { .domid = DOMID_SELF, @@ -24,10 +24,13 @@ static int xen_oldmem_pfn_is_ram(unsigned long pfn) if (HYPERVISOR_hvm_op(HVMOP_get_mem_type, )) { pr_warn_once("Unexpected HVMOP_get_mem_type failure\n"); - return -ENXIO; + return true; } return a.mem_type != HVMMEM_mmio_dm; } +static struct vmcore_cb xen_vmcore_cb = { + .pfn_is_ram = xen_vmcore_pfn_is_ram, +}; #endif static void xen_hvm_exit_mmap(struct mm_struct *mm) @@ -61,6 +64,6 @@ void __init xen_hvm_init_mmu_ops(void) if (is_pagetable_dying_supported()) pv_ops.mmu.exit_mmap = xen_hvm_exit_mmap; #ifdef CONFIG_PROC_VMCORE - WARN_ON(register_oldmem_pfn_is_ram(_oldmem_pfn_is_ram)); + register_vmcore_cb(_vmcore_cb); #endif } diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c index a9bd80ab670e..7a04b2eca287 100644 --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -62,46 +62,75 @@ core_param(novmcoredd, vmcoredd_disabled, bool, 0); /* Device Dump Size */ static size_t vmcoredd_orig_sz; -/* - * Returns > 0 for RAM pages, 0 for non-RAM pages, < 0 on error - * The called function has to take care of module refcounting. - */ -static int (*oldmem_pfn_is_ram)(unsigned long pfn); - -int register_oldmem_pfn_is_ram(int (*fn)(unsigned long pfn)) +static DECLARE_RWSEM(vmcore_cb_rwsem); +/* List of registered vmcore callbacks. */ +static LIST_HEAD(vmcore_cb_list); +/* Whether we had a surprise unregistration of a callback. */ +static bool vmcore_cb_unstable; +/* Whether the vmcore has been opened once. */ +static bool vmcore_opened; + +void register_vmcore_cb(struct vmcore_cb *cb) { - if (oldmem_pfn_is_ram) - return -EBUSY; - oldmem_pfn_is_ram = fn; - return 0; + down_write(_cb_rwsem); + INIT_LIST_HEAD(>next); + list_add_tail(>next, _cb_list)
[PATCH v2 5/5] mm/memory_hotplug: indicate MEMBLOCK_DRIVER_MANAGED with IORESOURCE_SYSRAM_DRIVER_MANAGED
Let's communicate driver-managed regions to memblock, to properly teach kexec_file with CONFIG_ARCH_KEEP_MEMBLOCK to not place images on these memory regions. Signed-off-by: David Hildenbrand --- mm/memory_hotplug.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 5f873e7f5b29..6d90818d4ce8 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1357,6 +1357,7 @@ bool mhp_supports_memmap_on_memory(unsigned long size) int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) { struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) }; + enum memblock_flags memblock_flags = MEMBLOCK_NONE; struct vmem_altmap mhp_altmap = {}; struct memory_group *group = NULL; u64 start, size; @@ -1385,7 +1386,9 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) mem_hotplug_begin(); if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) { - ret = memblock_add_node(start, size, nid, MEMBLOCK_NONE); + if (res->flags & IORESOURCE_SYSRAM_DRIVER_MANAGED) + memblock_flags = MEMBLOCK_DRIVER_MANAGED; + ret = memblock_add_node(start, size, nid, memblock_flags); if (ret) goto error_mem_hotplug_end; } -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v2 4/5] memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED
Let's add a flag that corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED, indicating that we're dealing with a memory region that is never indicated in the firmware-provided memory map, but always detected and added by a driver. Similar to MEMBLOCK_HOTPLUG, most infrastructure has to treat such memory regions like ordinary MEMBLOCK_NONE memory regions -- for example, when selecting memory regions to add to the vmcore for dumping in the crashkernel via for_each_mem_range(). However, especially kexec_file is not supposed to select such memblocks via for_each_free_mem_range() / for_each_free_mem_range_reverse() to place kexec images, similar to how we handle IORESOURCE_SYSRAM_DRIVER_MANAGED without CONFIG_ARCH_KEEP_MEMBLOCK. We'll make sure that memory hotplug code sets the flag where applicable (IORESOURCE_SYSRAM_DRIVER_MANAGED) next. This prepares architectures that need CONFIG_ARCH_KEEP_MEMBLOCK, such as arm64, for virtio-mem support. Note that kexec *must not* indicate this memory to the second kernel and *must not* place kexec-images on this memory. Let's add a comment to kexec_walk_memblock(), documenting how we handle MEMBLOCK_DRIVER_MANAGED now just like using IORESOURCE_SYSRAM_DRIVER_MANAGED in locate_mem_hole_callback() for kexec_walk_resources(). Also note that MEMBLOCK_HOTPLUG cannot be reused due to different semantics: MEMBLOCK_HOTPLUG: memory is indicated as "System RAM" in the firmware-provided memory map and added to the system early during boot; kexec *has to* indicate this memory to the second kernel and can place kexec-images on this memory. After memory hotunplug, kexec has to be re-armed. We mostly ignore this flag when "movable_node" is not set on the kernel command line, because then we're told to not care about hotunpluggability of such memory regions. MEMBLOCK_DRIVER_MANAGED: memory is not indicated as "System RAM" in the firmware-provided memory map; this memory is always detected and added to the system by a driver; memory might not actually be physically hotunpluggable. kexec *must not* indicate this memory to the second kernel and *must not* place kexec-images on this memory. Signed-off-by: David Hildenbrand --- include/linux/memblock.h | 16 ++-- kernel/kexec_file.c | 5 + mm/memblock.c| 4 3 files changed, 23 insertions(+), 2 deletions(-) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 2bc726e43a1b..b3b29ccf91f3 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -37,12 +37,17 @@ extern unsigned long long max_possible_pfn; * @MEMBLOCK_NOMAP: don't add to kernel direct mapping and treat as * reserved in the memory map; refer to memblock_mark_nomap() description * for further details + * @MEMBLOCK_DRIVER_MANAGED: memory region that is always detected and added + * via a driver, and never indicated in the firmware-provided memory map as + * system RAM. This corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED in the + * kernel resource tree. */ enum memblock_flags { MEMBLOCK_NONE = 0x0, /* No special request */ MEMBLOCK_HOTPLUG= 0x1, /* hotpluggable region */ MEMBLOCK_MIRROR = 0x2, /* mirrored region */ MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */ + MEMBLOCK_DRIVER_MANAGED = 0x8, /* always detected via a driver */ }; /** @@ -213,7 +218,8 @@ static inline void __next_physmem_range(u64 *idx, struct memblock_type *type, */ #define for_each_mem_range(i, p_start, p_end) \ __for_each_mem_range(i, , NULL, NUMA_NO_NODE, \ -MEMBLOCK_HOTPLUG, p_start, p_end, NULL) +MEMBLOCK_HOTPLUG | MEMBLOCK_DRIVER_MANAGED, \ +p_start, p_end, NULL) /** * for_each_mem_range_rev - reverse iterate through memblock areas from @@ -224,7 +230,8 @@ static inline void __next_physmem_range(u64 *idx, struct memblock_type *type, */ #define for_each_mem_range_rev(i, p_start, p_end) \ __for_each_mem_range_rev(i, , NULL, NUMA_NO_NODE, \ -MEMBLOCK_HOTPLUG, p_start, p_end, NULL) +MEMBLOCK_HOTPLUG | MEMBLOCK_DRIVER_MANAGED,\ +p_start, p_end, NULL) /** * for_each_reserved_mem_range - iterate over all reserved memblock areas @@ -254,6 +261,11 @@ static inline bool memblock_is_nomap(struct memblock_region *m) return m->flags & MEMBLOCK_NOMAP; } +static inline bool memblock_is_driver_managed(struct memblock_region *m) +{ + return m->flags & MEMBLOCK_DRIVER_MANAGED; +} + int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn, unsigned long *end_pfn); void __next_mem_pfn_range(int *idx, int n
[PATCH v2 3/5] memblock: allow to specify flags with memblock_add_node()
We want to specify flags when hotplugging memory. Let's prepare to pass flags to memblock_add_node() by adjusting all existing users. Note that when hotplugging memory the system is already up and running and we might have concurrent memblock users: for example, while we're hotplugging memory, kexec_file code might search for suitable memory regions to place kexec images. It's important to add the memory directly to memblock via a single call with the right flags, instead of adding the memory first and apply flags later: otherwise, concurrent memblock users might temporarily stumble over memblocks with wrong flags, which will be important in a follow-up patch that introduces a new flag to properly handle add_memory_driver_managed(). Acked-by: Geert Uytterhoeven Acked-by: Heiko Carstens Signed-off-by: David Hildenbrand --- arch/arc/mm/init.c | 4 ++-- arch/ia64/mm/contig.c| 2 +- arch/ia64/mm/init.c | 2 +- arch/m68k/mm/mcfmmu.c| 3 ++- arch/m68k/mm/motorola.c | 6 -- arch/mips/loongson64/init.c | 4 +++- arch/mips/sgi-ip27/ip27-memory.c | 3 ++- arch/s390/kernel/setup.c | 3 ++- include/linux/memblock.h | 3 ++- include/linux/mm.h | 2 +- mm/memblock.c| 9 + mm/memory_hotplug.c | 2 +- 12 files changed, 26 insertions(+), 17 deletions(-) diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c index 699ecf119641..110eb69e9bee 100644 --- a/arch/arc/mm/init.c +++ b/arch/arc/mm/init.c @@ -59,13 +59,13 @@ void __init early_init_dt_add_memory_arch(u64 base, u64 size) low_mem_sz = size; in_use = 1; - memblock_add_node(base, size, 0); + memblock_add_node(base, size, 0, MEMBLOCK_NONE); } else { #ifdef CONFIG_HIGHMEM high_mem_start = base; high_mem_sz = size; in_use = 1; - memblock_add_node(base, size, 1); + memblock_add_node(base, size, 1, MEMBLOCK_NONE); memblock_reserve(base, size); #endif } diff --git a/arch/ia64/mm/contig.c b/arch/ia64/mm/contig.c index 42e025cfbd08..24901d809301 100644 --- a/arch/ia64/mm/contig.c +++ b/arch/ia64/mm/contig.c @@ -153,7 +153,7 @@ find_memory (void) efi_memmap_walk(find_max_min_low_pfn, NULL); max_pfn = max_low_pfn; - memblock_add_node(0, PFN_PHYS(max_low_pfn), 0); + memblock_add_node(0, PFN_PHYS(max_low_pfn), 0, MEMBLOCK_NONE); find_initrd(); diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c index 5c6da8d83c1a..5d165607bf35 100644 --- a/arch/ia64/mm/init.c +++ b/arch/ia64/mm/init.c @@ -378,7 +378,7 @@ int __init register_active_ranges(u64 start, u64 len, int nid) #endif if (start < end) - memblock_add_node(__pa(start), end - start, nid); + memblock_add_node(__pa(start), end - start, nid, MEMBLOCK_NONE); return 0; } diff --git a/arch/m68k/mm/mcfmmu.c b/arch/m68k/mm/mcfmmu.c index eac9dde65193..6f1f25125294 100644 --- a/arch/m68k/mm/mcfmmu.c +++ b/arch/m68k/mm/mcfmmu.c @@ -174,7 +174,8 @@ void __init cf_bootmem_alloc(void) m68k_memory[0].addr = _rambase; m68k_memory[0].size = _ramend - _rambase; - memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0); + memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0, + MEMBLOCK_NONE); /* compute total pages in system */ num_pages = PFN_DOWN(_ramend - _rambase); diff --git a/arch/m68k/mm/motorola.c b/arch/m68k/mm/motorola.c index 9f3f77785aa7..2b05bb2bac00 100644 --- a/arch/m68k/mm/motorola.c +++ b/arch/m68k/mm/motorola.c @@ -410,7 +410,8 @@ void __init paging_init(void) min_addr = m68k_memory[0].addr; max_addr = min_addr + m68k_memory[0].size; - memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0); + memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0, + MEMBLOCK_NONE); for (i = 1; i < m68k_num_memory;) { if (m68k_memory[i].addr < min_addr) { printk("Ignoring memory chunk at 0x%lx:0x%lx before the first chunk\n", @@ -421,7 +422,8 @@ void __init paging_init(void) (m68k_num_memory - i) * sizeof(struct m68k_mem_info)); continue; } - memblock_add_node(m68k_memory[i].addr, m68k_memory[i].size, i); + memblock_add_node(m68k_memory[i].addr, m68k_memory[i].size, i, + MEMBLOCK_NONE); addr = m68k_memory[i].addr + m68k_memory[i].size; if (addr > max_addr) max_addr = addr; diff --git a/arch/mips/loongson64/init.c b/arch/mips/loongson64/init.c index 76e0a9636a0e..4ac5ba80bbf6 100644 --- a/arch/mips/loongson64/init.c +++ b/arch/
[PATCH v2 0/5] mm/memory_hotplug: full support for add_memory_driver_managed() with CONFIG_ARCH_KEEP_MEMBLOCK
Architectures that require CONFIG_ARCH_KEEP_MEMBLOCK=y, such as arm64, don't cleanly support add_memory_driver_managed() yet. Most prominently, kexec_file can still end up placing kexec images on such driver-managed memory, resulting in undesired behavior, for example, having kexec images located on memory not part of the firmware-provided memory map. Teaching kexec to not place images on driver-managed memory is especially relevant for virtio-mem. Details can be found in commit 7b7b27214bba ("mm/memory_hotplug: introduce add_memory_driver_managed()"). Extend memblock with a new flag and set it from memory hotplug code when applicable. This is required to fully support virtio-mem on arm64, making also kexec_file behave like on x86-64. v1 -> v2: - "memblock: improve MEMBLOCK_HOTPLUG documentation" -- Added - "memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED" -- Improve documentation of MEMBLOCK_DRIVER_MANAGED - Refine patch descriptions Cc: Andrew Morton Cc: Mike Rapoport Cc: Michal Hocko Cc: Oscar Salvador Cc: Jianyong Wu Cc: Aneesh Kumar K.V Cc: Vineet Gupta Cc: Geert Uytterhoeven Cc: Huacai Chen Cc: Jiaxun Yang Cc: Thomas Bogendoerfer Cc: Heiko Carstens Cc: Vasily Gorbik Cc: Christian Borntraeger Cc: Eric Biederman Cc: Arnd Bergmann Cc: linux-snps-...@lists.infradead.org Cc: linux-i...@vger.kernel.org Cc: linux-m...@lists.linux-m68k.org Cc: linux-m...@vger.kernel.org Cc: linux-s...@vger.kernel.org Cc: linux...@kvack.org Cc: kexec@lists.infradead.org David Hildenbrand (5): mm/memory_hotplug: handle memblock_add_node() failures in add_memory_resource() memblock: improve MEMBLOCK_HOTPLUG documentation memblock: allow to specify flags with memblock_add_node() memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED mm/memory_hotplug: indicate MEMBLOCK_DRIVER_MANAGED with IORESOURCE_SYSRAM_DRIVER_MANAGED arch/arc/mm/init.c | 4 ++-- arch/ia64/mm/contig.c| 2 +- arch/ia64/mm/init.c | 2 +- arch/m68k/mm/mcfmmu.c| 3 ++- arch/m68k/mm/motorola.c | 6 -- arch/mips/loongson64/init.c | 4 +++- arch/mips/sgi-ip27/ip27-memory.c | 3 ++- arch/s390/kernel/setup.c | 3 ++- include/linux/memblock.h | 25 + include/linux/mm.h | 2 +- kernel/kexec_file.c | 5 + mm/memblock.c| 13 + mm/memory_hotplug.c | 11 +-- 13 files changed, 62 insertions(+), 21 deletions(-) base-commit: 9e1ff307c779ce1f0f810c7ecce3d95bbae40896 -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v2 2/5] memblock: improve MEMBLOCK_HOTPLUG documentation
The description of MEMBLOCK_HOTPLUG is currently short and consequently misleading: we're actually dealing with a memory region that might get hotunplugged later (i.e., the platform+firmware supports it), yet it is indicated in the firmware-provided memory map as system ram that will just get used by the system for any purpose when not taking special care. The firmware marked this memory region as a hot(un)plugged (e.g., hotplugged before reboot), implying that it might get hotunplugged again later. Whether we consider this information depends on the "movable_node" kernel commandline parameter: only with "movable_node" set, we'll try keeping this memory hotunpluggable, for example, by not serving early allocations from this memory region and by letting the buddy manage it using the ZONE_MOVABLE. Let's make this clearer by extending the documentation. Note: kexec *has to* indicate this memory to the second kernel. With "movable_node" set, we don't want to place kexec-images on this memory. Without "movable_node" set, we don't care and can place kexec-images on this memory. In both cases, after successful memory hotunplug, kexec has to be re-armed to update the memory map for the second kernel and to place the kexec-images somewhere else. Signed-off-by: David Hildenbrand --- include/linux/memblock.h | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 34de69b3b8ba..4ee8dd2d63a7 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -28,7 +28,11 @@ extern unsigned long long max_possible_pfn; /** * enum memblock_flags - definition of memory region attributes * @MEMBLOCK_NONE: no special request - * @MEMBLOCK_HOTPLUG: hotpluggable region + * @MEMBLOCK_HOTPLUG: memory region indicated in the firmware-provided memory + * map during early boot as hot(un)pluggable system RAM (e.g., memory range + * that might get hotunplugged later). With "movable_node" set on the kernel + * commandline, try keeping this memory region hotunpluggable. Does not apply + * to memblocks added ("hotplugged") after early boot. * @MEMBLOCK_MIRROR: mirrored region * @MEMBLOCK_NOMAP: don't add to kernel direct mapping and treat as * reserved in the memory map; refer to memblock_mark_nomap() description -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v2 1/5] mm/memory_hotplug: handle memblock_add_node() failures in add_memory_resource()
If memblock_add_node() fails, we're most probably running out of memory. While this is unlikely to happen, it can happen and having memory added without a memblock can be problematic for architectures that use memblock to detect valid memory. Let's fail in a nice way instead of silently ignoring the error. Signed-off-by: David Hildenbrand --- mm/memory_hotplug.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 9fd0be32a281..917b3528636d 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1384,8 +1384,11 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) mem_hotplug_begin(); - if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) - memblock_add_node(start, size, nid); + if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) { + ret = memblock_add_node(start, size, nid); + if (ret) + goto error_mem_hotplug_end; + } ret = __try_online_node(nid, false); if (ret < 0) @@ -1458,6 +1461,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) rollback_node_hotadd(nid); if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) memblock_remove(start, size); +error_mem_hotplug_end: mem_hotplug_done(); return ret; } -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v1 3/4] memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED
On 30.09.21 23:21, Mike Rapoport wrote: On Wed, Sep 29, 2021 at 06:54:01PM +0200, David Hildenbrand wrote: On 29.09.21 18:39, Mike Rapoport wrote: Hi, On Mon, Sep 27, 2021 at 05:05:17PM +0200, David Hildenbrand wrote: Let's add a flag that corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED. Similar to MEMBLOCK_HOTPLUG, most infrastructure has to treat such memory like ordinary MEMBLOCK_NONE memory -- for example, when selecting memory regions to add to the vmcore for dumping in the crashkernel via for_each_mem_range(). Can you please elaborate on the difference in semantics of MEMBLOCK_HOTPLUG and MEMBLOCK_DRIVER_MANAGED? Unless I'm missing something they both mark memory that can be unplugged anytime and so it should not be used in certain cases. Why is there a need for a new flag? In the cover letter I have "Alternative B: Reuse MEMBLOCK_HOTPLUG. MEMBLOCK_HOTPLUG serves a different purpose, though.", but looking into the details it won't work as is. MEMBLOCK_HOTPLUG is used to mark memory early during boot that can later get hotunplugged again and should be placed into ZONE_MOVABLE if the "movable_node" kernel parameter is set. The confusing part is that we talk about "hotpluggable" but really mean "hotunpluggable": the reason is that HW flags DIMM slots that can later be hotplugged as "hotpluggable" even though there is already something hotplugged. MEMBLOCK_HOTPLUG name is indeed somewhat confusing, but still it's core meaning "this memory may be removed" which does not differ from what IORESOURCE_SYSRAM_DRIVER_MANAGED means. MEMBLOCK_HOTPLUG regions are indeed placed into ZONE_MOVABLE, but more importantly, they are avoided when we allocate memory from memblock. So, in my view, both flags mean that the memory may be removed and it should not be used for certain types of allocations. The semantics are different: MEMBLOCK_HOTPLUG: memory is indicated as "System RAM" in the firmware-provided memory map and added to the system early during boot; we want this memory to be managed by ZONE_MOVABLE with "movable_node" set on the kernel command line, because only then we want it to be hotpluggable again. kexec *has to* indicate this memory to the second kernel and can place kexec-images on this memory. After memory hotunplug, kexec has to be re-armed. MEMBLOCK_DRIVER_MANAGED: memory is not indicated as System RAM" in the firmware-provided memory map; this memory is always detected and added to the system by a driver; memory might not actually be physically hotunpluggable and the ZONE selection does not depend on "movable_core". kexec *must not* indicate this memory to the second kernel and *must not* place kexec-images on this memory. I would really advise against mixing concepts here. What we could do is indicate *all* hotplugged memory (not just IORESOURCE_SYSRAM_DRIVER_MANAGED memory) as MEMBLOCK_HOTPLUG and make MEMBLOCK_HOTPLUG less dependent on "movable_node". MEMBLOCK_HOTPLUG for early boot memory: with "movable_core", place it in ZONE_MOVABLE. Even without "movable_core", don't place early kernel allocations on this memory. MEMBLOCK_HOTPLUG for all memory: don't place kexec images or on this memory, independent of "movable_core". memblock would then not contain the information "contained in firmware-provided memory map" vs. "not contained in firmware-provided memory map"; but I think right now it's not strictly required to have that information if we'd go down that path. For example, ranges in the ACPI SRAT that are marked as ACPI_SRAT_MEM_HOT_PLUGGABLE will be marked MEMBLOCK_HOTPLUG early during boot (drivers/acpi/numa/srat.c:acpi_numa_memory_affinity_init()). Later, we use that information to size ZONE_MOVABLE (mm/page_alloc.c:find_zone_movable_pfns_for_nodes()). This will make sure that these "hotpluggable" DIMMs can later get hotunplugged. Also, see should_skip_region() how this relates to the "movable_node" kernel parameter: /* skip hotpluggable memory regions if needed */ if (movable_node_is_enabled() && memblock_is_hotpluggable(m) && (flags & MEMBLOCK_HOTPLUG)) return true; Hmm, I think that the movable_node_is_enabled() check here is excessive, but I suspect we cannot simply remove it without breaking anything. The reasoning is: without "movable_core" we don't want this memory to be hotunpluggable; consequently, we don't care if we place kexec-images on this memory. MEMBLOCK_HOTPLUG is currently only active with "movable_core". If we remove that check, we will always not place early kernel allocations on that memory, even if we don't care about ZONE_MOVABLE. I'll take a deeper look on the potential consequences. BTW, is there anything that prevents putting kexec to hot-un
Re: [PATCH v1 3/4] memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED
On 29.09.21 18:39, Mike Rapoport wrote: Hi, On Mon, Sep 27, 2021 at 05:05:17PM +0200, David Hildenbrand wrote: Let's add a flag that corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED. Similar to MEMBLOCK_HOTPLUG, most infrastructure has to treat such memory like ordinary MEMBLOCK_NONE memory -- for example, when selecting memory regions to add to the vmcore for dumping in the crashkernel via for_each_mem_range(). Can you please elaborate on the difference in semantics of MEMBLOCK_HOTPLUG and MEMBLOCK_DRIVER_MANAGED? Unless I'm missing something they both mark memory that can be unplugged anytime and so it should not be used in certain cases. Why is there a need for a new flag? In the cover letter I have "Alternative B: Reuse MEMBLOCK_HOTPLUG. MEMBLOCK_HOTPLUG serves a different purpose, though.", but looking into the details it won't work as is. MEMBLOCK_HOTPLUG is used to mark memory early during boot that can later get hotunplugged again and should be placed into ZONE_MOVABLE if the "movable_node" kernel parameter is set. The confusing part is that we talk about "hotpluggable" but really mean "hotunpluggable": the reason is that HW flags DIMM slots that can later be hotplugged as "hotpluggable" even though there is already something hotplugged. For example, ranges in the ACPI SRAT that are marked as ACPI_SRAT_MEM_HOT_PLUGGABLE will be marked MEMBLOCK_HOTPLUG early during boot (drivers/acpi/numa/srat.c:acpi_numa_memory_affinity_init()). Later, we use that information to size ZONE_MOVABLE (mm/page_alloc.c:find_zone_movable_pfns_for_nodes()). This will make sure that these "hotpluggable" DIMMs can later get hotunplugged. Also, see should_skip_region() how this relates to the "movable_node" kernel parameter: /* skip hotpluggable memory regions if needed */ if (movable_node_is_enabled() && memblock_is_hotpluggable(m) && (flags & MEMBLOCK_HOTPLUG)) return true; Long story short: MEMBLOCK_HOTPLUG has different semantics and is a special case for "movable_node". -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v1 2/4] memblock: allow to specify flags with memblock_add_node()
On 29.09.21 18:25, Mike Rapoport wrote: On Mon, Sep 27, 2021 at 05:05:16PM +0200, David Hildenbrand wrote: We want to specify flags when hotplugging memory. Let's prepare to pass flags to memblock_add_node() by adjusting all existing users. Note that when hotplugging memory the system is already up and running and we don't want to add the memory first and apply flags later: it should happen within one memblock call. Why is it important that the system is up and why it should happen in a single call? I don't mind adding flags parameter to memblock_add_node() but this changelog does not really explain the reasons to do it. "After memblock_add_node(), we could race with anybody performing a search for MEMBLOCK_NONE, like kexec_file -- and that only happens once the system is already up and running. So we want both steps to happen atomically." I can add that to the patch description. (I think it still won't be completely atomic because memblock isn't properly implementing locking yet, but that's a different story) -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v1 2/8] x86/xen: simplify xen_oldmem_pfn_is_ram()
On 29.09.21 16:22, Boris Ostrovsky wrote: On 9/29/21 5:03 AM, David Hildenbrand wrote: On 29.09.21 10:45, David Hildenbrand wrote: Can we go one step further and do @@ -20,24 +20,11 @@ static int xen_oldmem_pfn_is_ram(unsigned long pfn) struct xen_hvm_get_mem_type a = { .domid = DOMID_SELF, .pfn = pfn, + .mem_type = HVMMEM_ram_rw, }; - int ram; - if (HYPERVISOR_hvm_op(HVMOP_get_mem_type, )) - return -ENXIO; - - switch (a.mem_type) { - case HVMMEM_mmio_dm: - ram = 0; - break; - case HVMMEM_ram_rw: - case HVMMEM_ram_ro: - default: - ram = 1; - break; - } - - return ram; + HYPERVISOR_hvm_op(HVMOP_get_mem_type, ); + return a.mem_type != HVMMEM_mmio_dm; I was actually thinking of asking you to add another patch with pr_warn_once() here (and print error code as well). This call failing is indication of something going quite wrong and it would be good to know about this. Will include a patch in v2, thanks! -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v1 2/8] x86/xen: simplify xen_oldmem_pfn_is_ram()
On 29.09.21 10:45, David Hildenbrand wrote: How about return a.mem_type != HVMMEM_mmio_dm; Ha, how could I have missed that :) Result should be promoted to int and this has added benefit of not requiring changes in patch 4. Can we go one step further and do @@ -20,24 +20,11 @@ static int xen_oldmem_pfn_is_ram(unsigned long pfn) struct xen_hvm_get_mem_type a = { .domid = DOMID_SELF, .pfn = pfn, + .mem_type = HVMMEM_ram_rw, }; - int ram; - if (HYPERVISOR_hvm_op(HVMOP_get_mem_type, )) - return -ENXIO; - - switch (a.mem_type) { - case HVMMEM_mmio_dm: - ram = 0; - break; - case HVMMEM_ram_rw: - case HVMMEM_ram_ro: - default: - ram = 1; - break; - } - - return ram; + HYPERVISOR_hvm_op(HVMOP_get_mem_type, ); + return a.mem_type != HVMMEM_mmio_dm; } #endif Assuming that if HYPERVISOR_hvm_op() fails that .mem_type is not set to HVMMEM_mmio_dm. Okay we can't, due to "__must_check" ... -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v1 2/8] x86/xen: simplify xen_oldmem_pfn_is_ram()
How about return a.mem_type != HVMMEM_mmio_dm; Ha, how could I have missed that :) Result should be promoted to int and this has added benefit of not requiring changes in patch 4. Can we go one step further and do @@ -20,24 +20,11 @@ static int xen_oldmem_pfn_is_ram(unsigned long pfn) struct xen_hvm_get_mem_type a = { .domid = DOMID_SELF, .pfn = pfn, + .mem_type = HVMMEM_ram_rw, }; - int ram; - if (HYPERVISOR_hvm_op(HVMOP_get_mem_type, )) - return -ENXIO; - - switch (a.mem_type) { - case HVMMEM_mmio_dm: - ram = 0; - break; - case HVMMEM_ram_rw: - case HVMMEM_ram_ro: - default: - ram = 1; - break; - } - - return ram; + HYPERVISOR_hvm_op(HVMOP_get_mem_type, ); + return a.mem_type != HVMMEM_mmio_dm; } #endif Assuming that if HYPERVISOR_hvm_op() fails that .mem_type is not set to HVMMEM_mmio_dm. -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v1 8/8] virtio-mem: kdump mode to sanitize /proc/vmcore access
[...] + +static bool virtio_mem_vmcore_pfn_is_ram(struct vmcore_cb *cb, +unsigned long pfn) +{ + struct virtio_mem *vm = container_of(cb, struct virtio_mem, +vmcore_cb); + uint64_t addr = PFN_PHYS(pfn); + bool is_ram; + int rc; + + if (!virtio_mem_contains_range(vm, addr, addr + PAGE_SIZE)) Some more testing revealed that this has to be if (!virtio_mem_contains_range(vm, addr, PAGE_SIZE)) -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v1 5/8] virtio-mem: factor out hotplug specifics from virtio_mem_init() into virtio_mem_init_hotplug()
Let's prepare for a new virtio-mem kdump mode in which we don't actually hot(un)plug any memory but only observe the state of device blocks. Signed-off-by: David Hildenbrand --- drivers/virtio/virtio_mem.c | 81 - 1 file changed, 44 insertions(+), 37 deletions(-) diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c index bef8ad6bf466..2ba7e8d6ba8d 100644 --- a/drivers/virtio/virtio_mem.c +++ b/drivers/virtio/virtio_mem.c @@ -2392,41 +2392,10 @@ static int virtio_mem_init_vq(struct virtio_mem *vm) return 0; } -static int virtio_mem_init(struct virtio_mem *vm) +static int virtio_mem_init_hotplug(struct virtio_mem *vm) { const struct range pluggable_range = mhp_get_pluggable_range(true); uint64_t sb_size, addr; - uint16_t node_id; - - if (!vm->vdev->config->get) { - dev_err(>vdev->dev, "config access disabled\n"); - return -EINVAL; - } - - /* -* We don't want to (un)plug or reuse any memory when in kdump. The -* memory is still accessible (but not mapped). -*/ - if (is_kdump_kernel()) { - dev_warn(>vdev->dev, "disabled in kdump kernel\n"); - return -EBUSY; - } - - /* Fetch all properties that can't change. */ - virtio_cread_le(vm->vdev, struct virtio_mem_config, plugged_size, - >plugged_size); - virtio_cread_le(vm->vdev, struct virtio_mem_config, block_size, - >device_block_size); - virtio_cread_le(vm->vdev, struct virtio_mem_config, node_id, - _id); - vm->nid = virtio_mem_translate_node_id(vm, node_id); - virtio_cread_le(vm->vdev, struct virtio_mem_config, addr, >addr); - virtio_cread_le(vm->vdev, struct virtio_mem_config, region_size, - >region_size); - - /* Determine the nid for the device based on the lowest address. */ - if (vm->nid == NUMA_NO_NODE) - vm->nid = memory_add_physaddr_to_nid(vm->addr); /* bad device setup - warn only */ if (!IS_ALIGNED(vm->addr, memory_block_size_bytes())) @@ -2496,10 +2465,6 @@ static int virtio_mem_init(struct virtio_mem *vm) vm->offline_threshold); } - dev_info(>vdev->dev, "start address: 0x%llx", vm->addr); - dev_info(>vdev->dev, "region size: 0x%llx", vm->region_size); - dev_info(>vdev->dev, "device block size: 0x%llx", -(unsigned long long)vm->device_block_size); dev_info(>vdev->dev, "memory block size: 0x%lx", memory_block_size_bytes()); if (vm->in_sbm) @@ -2508,10 +2473,52 @@ static int virtio_mem_init(struct virtio_mem *vm) else dev_info(>vdev->dev, "big block size: 0x%llx", (unsigned long long)vm->bbm.bb_size); + + return 0; +} + +static int virtio_mem_init(struct virtio_mem *vm) +{ + uint16_t node_id; + + if (!vm->vdev->config->get) { + dev_err(>vdev->dev, "config access disabled\n"); + return -EINVAL; + } + + /* +* We don't want to (un)plug or reuse any memory when in kdump. The +* memory is still accessible (but not mapped). +*/ + if (is_kdump_kernel()) { + dev_warn(>vdev->dev, "disabled in kdump kernel\n"); + return -EBUSY; + } + + /* Fetch all properties that can't change. */ + virtio_cread_le(vm->vdev, struct virtio_mem_config, plugged_size, + >plugged_size); + virtio_cread_le(vm->vdev, struct virtio_mem_config, block_size, + >device_block_size); + virtio_cread_le(vm->vdev, struct virtio_mem_config, node_id, + _id); + vm->nid = virtio_mem_translate_node_id(vm, node_id); + virtio_cread_le(vm->vdev, struct virtio_mem_config, addr, >addr); + virtio_cread_le(vm->vdev, struct virtio_mem_config, region_size, + >region_size); + + /* Determine the nid for the device based on the lowest address. */ + if (vm->nid == NUMA_NO_NODE) + vm->nid = memory_add_physaddr_to_nid(vm->addr); + + dev_info(>vdev->dev, "start address: 0x%llx", vm->addr); + dev_info(>vdev->dev, "region size: 0x%llx", vm->region_size); + dev_info(>vdev->dev, "device block size: 0x%llx", +(unsigned long long)vm->device_block_size); if (vm->nid != NUMA_NO_NODE && IS_ENABLED(CONFIG_NUMA)) dev_info(>vdev->dev, "nid: %d", vm->nid); - return 0; + return virtio_mem_init_hotplug(vm); } static int virtio_mem_create_resource(struct virtio_mem *vm) -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v1 6/8] virtio-mem: factor out hotplug specifics from virtio_mem_probe() into virtio_mem_init_hotplug()
Let's prepare for a new virtio-mem kdump mode in which we don't actually hot(un)plug any memory but only observe the state of device blocks. Signed-off-by: David Hildenbrand --- drivers/virtio/virtio_mem.c | 87 +++-- 1 file changed, 45 insertions(+), 42 deletions(-) diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c index 2ba7e8d6ba8d..1be3ee7f684d 100644 --- a/drivers/virtio/virtio_mem.c +++ b/drivers/virtio/virtio_mem.c @@ -260,6 +260,8 @@ static void virtio_mem_fake_offline_going_offline(unsigned long pfn, static void virtio_mem_fake_offline_cancel_offline(unsigned long pfn, unsigned long nr_pages); static void virtio_mem_retry(struct virtio_mem *vm); +static int virtio_mem_create_resource(struct virtio_mem *vm); +static void virtio_mem_delete_resource(struct virtio_mem *vm); /* * Register a virtio-mem device so it will be considered for the online_page @@ -2395,7 +2397,8 @@ static int virtio_mem_init_vq(struct virtio_mem *vm) static int virtio_mem_init_hotplug(struct virtio_mem *vm) { const struct range pluggable_range = mhp_get_pluggable_range(true); - uint64_t sb_size, addr; + uint64_t unit_pages, sb_size, addr; + int rc; /* bad device setup - warn only */ if (!IS_ALIGNED(vm->addr, memory_block_size_bytes())) @@ -2474,7 +2477,48 @@ static int virtio_mem_init_hotplug(struct virtio_mem *vm) dev_info(>vdev->dev, "big block size: 0x%llx", (unsigned long long)vm->bbm.bb_size); + /* create the parent resource for all memory */ + rc = virtio_mem_create_resource(vm); + if (rc) + return rc; + + /* use a single dynamic memory group to cover the whole memory device */ + if (vm->in_sbm) + unit_pages = PHYS_PFN(memory_block_size_bytes()); + else + unit_pages = PHYS_PFN(vm->bbm.bb_size); + rc = memory_group_register_dynamic(vm->nid, unit_pages); + if (rc < 0) + goto out_del_resource; + vm->mgid = rc; + + /* +* If we still have memory plugged, we have to unplug all memory first. +* Registering our parent resource makes sure that this memory isn't +* actually in use (e.g., trying to reload the driver). +*/ + if (vm->plugged_size) { + vm->unplug_all_required = true; + dev_info(>vdev->dev, "unplugging all memory is required\n"); + } + + /* register callbacks */ + vm->memory_notifier.notifier_call = virtio_mem_memory_notifier_cb; + rc = register_memory_notifier(>memory_notifier); + if (rc) + goto out_unreg_group; + rc = register_virtio_mem_device(vm); + if (rc) + goto out_unreg_mem; + return 0; +out_unreg_mem: + unregister_memory_notifier(>memory_notifier); +out_unreg_group: + memory_group_unregister(vm->mgid); +out_del_resource: + virtio_mem_delete_resource(vm); + return rc; } static int virtio_mem_init(struct virtio_mem *vm) @@ -2578,7 +2622,6 @@ static bool virtio_mem_has_memory_added(struct virtio_mem *vm) static int virtio_mem_probe(struct virtio_device *vdev) { struct virtio_mem *vm; - uint64_t unit_pages; int rc; BUILD_BUG_ON(sizeof(struct virtio_mem_req) != 24); @@ -2608,40 +2651,6 @@ static int virtio_mem_probe(struct virtio_device *vdev) if (rc) goto out_del_vq; - /* create the parent resource for all memory */ - rc = virtio_mem_create_resource(vm); - if (rc) - goto out_del_vq; - - /* use a single dynamic memory group to cover the whole memory device */ - if (vm->in_sbm) - unit_pages = PHYS_PFN(memory_block_size_bytes()); - else - unit_pages = PHYS_PFN(vm->bbm.bb_size); - rc = memory_group_register_dynamic(vm->nid, unit_pages); - if (rc < 0) - goto out_del_resource; - vm->mgid = rc; - - /* -* If we still have memory plugged, we have to unplug all memory first. -* Registering our parent resource makes sure that this memory isn't -* actually in use (e.g., trying to reload the driver). -*/ - if (vm->plugged_size) { - vm->unplug_all_required = true; - dev_info(>vdev->dev, "unplugging all memory is required\n"); - } - - /* register callbacks */ - vm->memory_notifier.notifier_call = virtio_mem_memory_notifier_cb; - rc = register_memory_notifier(>memory_notifier); - if (rc) - goto out_unreg_group; - rc = register_virtio_mem_device(vm); - if (rc) - goto out_unreg_mem; - virtio_device_ready(vdev)
[PATCH v1 8/8] virtio-mem: kdump mode to sanitize /proc/vmcore access
Although virtio-mem currently supports reading unplugged memory in the hypervisor, this will change in the future, indicated to the device via a new feature flag. We similarly sanitized /proc/kcore access recently. [1] Let's register a vmcore callback, to allow vmcore code to check if a PFN belonging to a virtio-mem device is either currently plugged and should be dumped or is currently unplugged and should not be accessed, instead mapping the shared zeropage or returning zeroes when reading. This is important when not capturing /proc/vmcore via tools like "makedumpfile" that can identify logically unplugged virtio-mem memory via PG_offline in the memmap, but simply by e.g., copying the file. Distributions that support virtio-mem+kdump have to make sure that the virtio_mem module will be part of the kdump kernel or the kdump initrd; dracut was recently [2] extended to include virtio-mem in the generated initrd. As long as no special kdump kernels are used, this will automatically make sure that virtio-mem will be around in the kdump initrd and sanitize /proc/vmcore access -- with dracut. With this series, we'll send one virtio-mem state request for every ~2 MiB chunk of virtio-mem memory indicated in the vmcore that we intend to read/map. In the future, we might want to allow building virtio-mem for kdump mode only, even without CONFIG_MEMORY_HOTPLUG and friends: this way, we could support special stripped-down kdump kernels that have many other config options disabled; we'll tackle that once required. Further, we might want to try sensing bigger blocks (e.g., memory sections) first before falling back to device blocks on demand. Tested with Fedora rawhide, which contains a recent kexec-tools version (considering "System RAM (virtio_mem)" when creating the vmcore header) and a recent dracut version (including the virtio_mem module in the kdump initrd). [1] https://lkml.kernel.org/r/20210526093041.8800-1-da...@redhat.com [2] https://github.com/dracutdevs/dracut/pull/1157 Signed-off-by: David Hildenbrand --- drivers/virtio/virtio_mem.c | 136 1 file changed, 124 insertions(+), 12 deletions(-) diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c index 76d8aef3cfd2..ec0b2ab37acb 100644 --- a/drivers/virtio/virtio_mem.c +++ b/drivers/virtio/virtio_mem.c @@ -223,6 +223,9 @@ struct virtio_mem { * When this lock is held the pointers can't change, ONLINE and * OFFLINE blocks can't change the state and no subblocks will get * plugged/unplugged. +* +* In kdump mode, used to serialize requests, last_block_addr and +* last_block_plugged. */ struct mutex hotplug_mutex; bool hotplug_active; @@ -230,6 +233,9 @@ struct virtio_mem { /* An error occurred we cannot handle - stop processing requests. */ bool broken; + /* Cached valued of is_kdump_kernel() when the device was probed. */ + bool in_kdump; + /* The driver is being removed. */ spinlock_t removal_lock; bool removing; @@ -243,6 +249,13 @@ struct virtio_mem { /* Memory notifier (online/offline events). */ struct notifier_block memory_notifier; +#ifdef CONFIG_PROC_VMCORE + /* vmcore callback for /proc/vmcore handling in kdump mode */ + struct vmcore_cb vmcore_cb; + uint64_t last_block_addr; + bool last_block_plugged; +#endif /* CONFIG_PROC_VMCORE */ + /* Next device in the list of virtio-mem devices. */ struct list_head next; }; @@ -2293,6 +2306,12 @@ static void virtio_mem_run_wq(struct work_struct *work) uint64_t diff; int rc; + if (unlikely(vm->in_kdump)) { + dev_warn_once(>vdev->dev, +"unexpected workqueue run in kdump kernel\n"); + return; + } + hrtimer_cancel(>retry_timer); if (vm->broken) @@ -2521,6 +2540,86 @@ static int virtio_mem_init_hotplug(struct virtio_mem *vm) return rc; } +#ifdef CONFIG_PROC_VMCORE +static int virtio_mem_send_state_request(struct virtio_mem *vm, uint64_t addr, +uint64_t size) +{ + const uint64_t nb_vm_blocks = size / vm->device_block_size; + const struct virtio_mem_req req = { + .type = cpu_to_virtio16(vm->vdev, VIRTIO_MEM_REQ_STATE), + .u.state.addr = cpu_to_virtio64(vm->vdev, addr), + .u.state.nb_blocks = cpu_to_virtio16(vm->vdev, nb_vm_blocks), + }; + int rc = -ENOMEM; + + dev_dbg(>vdev->dev, "requesting state: 0x%llx - 0x%llx\n", addr, + addr + size - 1); + + switch (virtio_mem_send_request(vm, )) { + case VIRTIO_MEM_RESP_ACK: + return virtio16_to_cpu(vm->vdev, vm->resp.u.state.state); + case VIRTIO_MEM_RESP_ERROR: +
[PATCH v1 7/8] virtio-mem: factor out hotplug specifics from virtio_mem_remove() into virtio_mem_deinit_hotplug()
Let's prepare for a new virtio-mem kdump mode in which we don't actually hot(un)plug any memory but only observe the state of device blocks. Signed-off-by: David Hildenbrand --- drivers/virtio/virtio_mem.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c index 1be3ee7f684d..76d8aef3cfd2 100644 --- a/drivers/virtio/virtio_mem.c +++ b/drivers/virtio/virtio_mem.c @@ -2667,9 +2667,8 @@ static int virtio_mem_probe(struct virtio_device *vdev) return rc; } -static void virtio_mem_remove(struct virtio_device *vdev) +static void virtio_mem_deinit_hotplug(struct virtio_mem *vm) { - struct virtio_mem *vm = vdev->priv; unsigned long mb_id; int rc; @@ -2716,7 +2715,8 @@ static void virtio_mem_remove(struct virtio_device *vdev) * away. Warn at least. */ if (virtio_mem_has_memory_added(vm)) { - dev_warn(>dev, "device still has system memory added\n"); + dev_warn(>vdev->dev, +"device still has system memory added\n"); } else { virtio_mem_delete_resource(vm); kfree_const(vm->resource_name); @@ -2730,6 +2730,13 @@ static void virtio_mem_remove(struct virtio_device *vdev) } else { vfree(vm->bbm.bb_states); } +} + +static void virtio_mem_remove(struct virtio_device *vdev) +{ + struct virtio_mem *vm = vdev->priv; + + virtio_mem_deinit_hotplug(vm); /* reset the device and cleanup the queues */ vdev->config->reset(vdev); -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v1 2/8] x86/xen: simplify xen_oldmem_pfn_is_ram()
Let's simplify return handling. Signed-off-by: David Hildenbrand --- arch/x86/xen/mmu_hvm.c | 11 ++- 1 file changed, 2 insertions(+), 9 deletions(-) diff --git a/arch/x86/xen/mmu_hvm.c b/arch/x86/xen/mmu_hvm.c index b242d1f4b426..eb61622df75b 100644 --- a/arch/x86/xen/mmu_hvm.c +++ b/arch/x86/xen/mmu_hvm.c @@ -21,23 +21,16 @@ static int xen_oldmem_pfn_is_ram(unsigned long pfn) .domid = DOMID_SELF, .pfn = pfn, }; - int ram; if (HYPERVISOR_hvm_op(HVMOP_get_mem_type, )) return -ENXIO; switch (a.mem_type) { case HVMMEM_mmio_dm: - ram = 0; - break; - case HVMMEM_ram_rw: - case HVMMEM_ram_ro: + return 0; default: - ram = 1; - break; + return 1; } - - return ram; } #endif -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v1 1/8] x86/xen: update xen_oldmem_pfn_is_ram() documentation
The callback is only used for the vmcore nowadays. Signed-off-by: David Hildenbrand --- arch/x86/xen/mmu_hvm.c | 9 +++-- 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/arch/x86/xen/mmu_hvm.c b/arch/x86/xen/mmu_hvm.c index 57409373750f..b242d1f4b426 100644 --- a/arch/x86/xen/mmu_hvm.c +++ b/arch/x86/xen/mmu_hvm.c @@ -9,12 +9,9 @@ #ifdef CONFIG_PROC_VMCORE /* - * This function is used in two contexts: - * - the kdump kernel has to check whether a pfn of the crashed kernel - * was a ballooned page. vmcore is using this function to decide - * whether to access a pfn of the crashed kernel. - * - the kexec kernel has to check whether a pfn was ballooned by the - * previous kernel. If the pfn is ballooned, handle it properly. + * The kdump kernel has to check whether a pfn of the crashed kernel + * was a ballooned page. vmcore is using this function to decide + * whether to access a pfn of the crashed kernel. * Returns 0 if the pfn is not backed by a RAM page, the caller may * handle the pfn special in this case. */ -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v1 4/8] proc/vmcore: convert oldmem_pfn_is_ram callback to more generic vmcore callbacks
Let's support multiple registered callbacks, making sure that registering vmcore callbacks cannot fail. Make the callback return a bool instead of an int, handling how to deal with errors internally. Drop unused HAVE_OLDMEM_PFN_IS_RAM. We soon want to make use of this infrastructure from other drivers: virtio-mem, registering one callback for each virtio-mem device, to prevent reading unplugged virtio-mem memory. Handle it via a generic vmcore_cb structure, prepared for future extensions: for example, once we support virtio-mem on s390x where the vmcore is completely constructed in the second kernel, we want to detect and add plugged virtio-mem memory ranges to the vmcore in order for them to get dumped properly. Handle corner cases that are unexpected and shouldn't happen in sane setups: registering a callback after the vmcore has already been opened (warn only) and unregistering a callback after the vmcore has already been opened (warn and essentially read only zeroes from that point on). Signed-off-by: David Hildenbrand --- arch/x86/kernel/aperture_64.c | 13 - arch/x86/xen/mmu_hvm.c| 15 +++--- fs/proc/vmcore.c | 99 --- include/linux/crash_dump.h| 26 +++-- 4 files changed, 113 insertions(+), 40 deletions(-) diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c index 10562885f5fc..af3ba08b684b 100644 --- a/arch/x86/kernel/aperture_64.c +++ b/arch/x86/kernel/aperture_64.c @@ -73,12 +73,23 @@ static int gart_mem_pfn_is_ram(unsigned long pfn) (pfn >= aperture_pfn_start + aperture_page_count)); } +#ifdef CONFIG_PROC_VMCORE +static bool gart_oldmem_pfn_is_ram(struct vmcore_cb *cb, unsigned long pfn) +{ + return !!gart_mem_pfn_is_ram(pfn); +} + +static struct vmcore_cb gart_vmcore_cb = { + .pfn_is_ram = gart_oldmem_pfn_is_ram, +}; +#endif + static void __init exclude_from_core(u64 aper_base, u32 aper_order) { aperture_pfn_start = aper_base >> PAGE_SHIFT; aperture_page_count = (32 * 1024 * 1024) << aper_order >> PAGE_SHIFT; #ifdef CONFIG_PROC_VMCORE - WARN_ON(register_oldmem_pfn_is_ram(_mem_pfn_is_ram)); + register_vmcore_cb(_vmcore_cb); #endif #ifdef CONFIG_PROC_KCORE WARN_ON(register_mem_pfn_is_ram(_mem_pfn_is_ram)); diff --git a/arch/x86/xen/mmu_hvm.c b/arch/x86/xen/mmu_hvm.c index eb61622df75b..49bd4a6a5858 100644 --- a/arch/x86/xen/mmu_hvm.c +++ b/arch/x86/xen/mmu_hvm.c @@ -12,10 +12,10 @@ * The kdump kernel has to check whether a pfn of the crashed kernel * was a ballooned page. vmcore is using this function to decide * whether to access a pfn of the crashed kernel. - * Returns 0 if the pfn is not backed by a RAM page, the caller may + * Returns "false" if the pfn is not backed by a RAM page, the caller may * handle the pfn special in this case. */ -static int xen_oldmem_pfn_is_ram(unsigned long pfn) +static bool xen_vmcore_pfn_is_ram(struct vmcore_cb *cb, unsigned long pfn) { struct xen_hvm_get_mem_type a = { .domid = DOMID_SELF, @@ -23,15 +23,18 @@ static int xen_oldmem_pfn_is_ram(unsigned long pfn) }; if (HYPERVISOR_hvm_op(HVMOP_get_mem_type, )) - return -ENXIO; + return true; switch (a.mem_type) { case HVMMEM_mmio_dm: - return 0; + return false; default: - return 1; + return true; } } +static struct vmcore_cb xen_vmcore_cb = { + .pfn_is_ram = xen_vmcore_pfn_is_ram, +}; #endif static void xen_hvm_exit_mmap(struct mm_struct *mm) @@ -65,6 +68,6 @@ void __init xen_hvm_init_mmu_ops(void) if (is_pagetable_dying_supported()) pv_ops.mmu.exit_mmap = xen_hvm_exit_mmap; #ifdef CONFIG_PROC_VMCORE - WARN_ON(register_oldmem_pfn_is_ram(_oldmem_pfn_is_ram)); + register_vmcore_cb(_vmcore_cb); #endif } diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c index a9bd80ab670e..7a04b2eca287 100644 --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -62,46 +62,75 @@ core_param(novmcoredd, vmcoredd_disabled, bool, 0); /* Device Dump Size */ static size_t vmcoredd_orig_sz; -/* - * Returns > 0 for RAM pages, 0 for non-RAM pages, < 0 on error - * The called function has to take care of module refcounting. - */ -static int (*oldmem_pfn_is_ram)(unsigned long pfn); - -int register_oldmem_pfn_is_ram(int (*fn)(unsigned long pfn)) +static DECLARE_RWSEM(vmcore_cb_rwsem); +/* List of registered vmcore callbacks. */ +static LIST_HEAD(vmcore_cb_list); +/* Whether we had a surprise unregistration of a callback. */ +static bool vmcore_cb_unstable; +/* Whether the vmcore has been opened once. */ +static bool vmcore_opened; + +void register_vmcore_cb(struct vmcore_cb *cb) { - if (oldmem_pfn_is_ram) - return -EBUSY; - oldmem_pfn_is_ram = fn; - return 0; + down_write(_cb
[PATCH v1 0/8] proc/vmcore: sanitize access to virtio-mem memory
As so often with virtio-mem changes that mess with common MM infrastructure, this might be a good candiate to go via Andrew's tree. -- After removing /dev/kmem, sanitizing /proc/kcore and handling /dev/mem, this series tackles the last sane way how a VM could accidentially access logically unplugged memory managed by a virtio-mem device: /proc/vmcore When dumping memory via "makedumpfile", PG_offline pages, used by virtio-mem to flag logically unplugged memory, are already properly excluded; however, especially when accessing/copying /proc/vmcore "the usual way", we can still end up reading logically unplugged memory part of a virtio-mem device. Patch #1-#3 are cleanups. Patch #4 extends the existing oldmem_pfn_is_ram mechanism. Patch #5-#7 are virtio-mem refactorings for patch #8, which implements the virtio-mem logic to query the state of device blocks. Patch #8: " Although virtio-mem currently supports reading unplugged memory in the hypervisor, this will change in the future, indicated to the device via a new feature flag. We similarly sanitized /proc/kcore access recently. [...] Distributions that support virtio-mem+kdump have to make sure that the virtio_mem module will be part of the kdump kernel or the kdump initrd; dracut was recently [2] extended to include virtio-mem in the generated initrd. As long as no special kdump kernels are used, this will automatically make sure that virtio-mem will be around in the kdump initrd and sanitize /proc/vmcore access -- with dracut. " This is the last remaining bit to support VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE [3] in the Linux implementation of virtio-mem. Note: this is best-effort. We'll never be able to control what runs inside the second kernel, really, but we also don't have to care: we only care about sane setups where we don't want our VM getting zapped once we touch the wrong memory location while dumping. While we usually expect sane setups to use "makedumfile", nothing really speaks against just copying /proc/vmcore, especially in environments where HWpoisioning isn't typically expected. Also, we really don't want to put all our trust completely on the memmap, so sanitizing also makes sense when just using "makedumpfile". [1] https://lkml.kernel.org/r/20210526093041.8800-1-da...@redhat.com [2] https://github.com/dracutdevs/dracut/pull/1157 [3] https://lists.oasis-open.org/archives/virtio-comment/202109/msg00021.html Cc: Andrew Morton Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Boris Ostrovsky Cc: Juergen Gross Cc: Stefano Stabellini Cc: "Michael S. Tsirkin" Cc: Jason Wang Cc: Dave Young Cc: Baoquan He Cc: Vivek Goyal Cc: Michal Hocko Cc: Oscar Salvador Cc: Mike Rapoport Cc: "Rafael J. Wysocki" Cc: x...@kernel.org Cc: xen-de...@lists.xenproject.org Cc: virtualizat...@lists.linux-foundation.org Cc: kexec@lists.infradead.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org David Hildenbrand (8): x86/xen: update xen_oldmem_pfn_is_ram() documentation x86/xen: simplify xen_oldmem_pfn_is_ram() proc/vmcore: let pfn_is_ram() return a bool proc/vmcore: convert oldmem_pfn_is_ram callback to more generic vmcore callbacks virtio-mem: factor out hotplug specifics from virtio_mem_init() into virtio_mem_init_hotplug() virtio-mem: factor out hotplug specifics from virtio_mem_probe() into virtio_mem_init_hotplug() virtio-mem: factor out hotplug specifics from virtio_mem_remove() into virtio_mem_deinit_hotplug() virtio-mem: kdump mode to sanitize /proc/vmcore access arch/x86/kernel/aperture_64.c | 13 +- arch/x86/xen/mmu_hvm.c| 31 ++-- drivers/virtio/virtio_mem.c | 297 -- fs/proc/vmcore.c | 105 include/linux/crash_dump.h| 26 ++- 5 files changed, 332 insertions(+), 140 deletions(-) base-commit: 5816b3e6577eaa676ceb00a848f0fd65fe2adc29 -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v1 3/8] proc/vmcore: let pfn_is_ram() return a bool
The callback should deal with errors internally, it doesn't make sense to expose these via pfn_is_ram(). We'll rework the callbacks next. Right now we consider errors as if "it's RAM"; no functional change. Signed-off-by: David Hildenbrand --- fs/proc/vmcore.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c index 9a15334da208..a9bd80ab670e 100644 --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -84,11 +84,11 @@ void unregister_oldmem_pfn_is_ram(void) } EXPORT_SYMBOL_GPL(unregister_oldmem_pfn_is_ram); -static int pfn_is_ram(unsigned long pfn) +static bool pfn_is_ram(unsigned long pfn) { int (*fn)(unsigned long pfn); /* pfn is ram unless fn() checks pagetype */ - int ret = 1; + bool ret = true; /* * Ask hypervisor if the pfn is really ram. @@ -97,7 +97,7 @@ static int pfn_is_ram(unsigned long pfn) */ fn = oldmem_pfn_is_ram; if (fn) - ret = fn(pfn); + ret = !!fn(pfn); return ret; } @@ -124,7 +124,7 @@ ssize_t read_from_oldmem(char *buf, size_t count, nr_bytes = count; /* If pfn is not ram, return zeros for sparse dump files */ - if (pfn_is_ram(pfn) == 0) + if (!pfn_is_ram(pfn)) memset(buf, 0, nr_bytes); else { if (encrypted) -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v1 0/4] mm/memory_hotplug: full support for
Intended subject was "[PATCH v1 0/4] mm/memory_hotplug: full support for add_memory_driver_managed() with CONFIG_ARCH_KEEP_MEMBLOCK" -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v1 2/4] memblock: allow to specify flags with memblock_add_node()
We want to specify flags when hotplugging memory. Let's prepare to pass flags to memblock_add_node() by adjusting all existing users. Note that when hotplugging memory the system is already up and running and we don't want to add the memory first and apply flags later: it should happen within one memblock call. Signed-off-by: David Hildenbrand --- arch/arc/mm/init.c | 4 ++-- arch/ia64/mm/contig.c| 2 +- arch/ia64/mm/init.c | 2 +- arch/m68k/mm/mcfmmu.c| 3 ++- arch/m68k/mm/motorola.c | 6 -- arch/mips/loongson64/init.c | 4 +++- arch/mips/sgi-ip27/ip27-memory.c | 3 ++- arch/s390/kernel/setup.c | 3 ++- include/linux/memblock.h | 3 ++- include/linux/mm.h | 2 +- mm/memblock.c| 9 + mm/memory_hotplug.c | 2 +- 12 files changed, 26 insertions(+), 17 deletions(-) diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c index 699ecf119641..110eb69e9bee 100644 --- a/arch/arc/mm/init.c +++ b/arch/arc/mm/init.c @@ -59,13 +59,13 @@ void __init early_init_dt_add_memory_arch(u64 base, u64 size) low_mem_sz = size; in_use = 1; - memblock_add_node(base, size, 0); + memblock_add_node(base, size, 0, MEMBLOCK_NONE); } else { #ifdef CONFIG_HIGHMEM high_mem_start = base; high_mem_sz = size; in_use = 1; - memblock_add_node(base, size, 1); + memblock_add_node(base, size, 1, MEMBLOCK_NONE); memblock_reserve(base, size); #endif } diff --git a/arch/ia64/mm/contig.c b/arch/ia64/mm/contig.c index 42e025cfbd08..24901d809301 100644 --- a/arch/ia64/mm/contig.c +++ b/arch/ia64/mm/contig.c @@ -153,7 +153,7 @@ find_memory (void) efi_memmap_walk(find_max_min_low_pfn, NULL); max_pfn = max_low_pfn; - memblock_add_node(0, PFN_PHYS(max_low_pfn), 0); + memblock_add_node(0, PFN_PHYS(max_low_pfn), 0, MEMBLOCK_NONE); find_initrd(); diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c index 5c6da8d83c1a..5d165607bf35 100644 --- a/arch/ia64/mm/init.c +++ b/arch/ia64/mm/init.c @@ -378,7 +378,7 @@ int __init register_active_ranges(u64 start, u64 len, int nid) #endif if (start < end) - memblock_add_node(__pa(start), end - start, nid); + memblock_add_node(__pa(start), end - start, nid, MEMBLOCK_NONE); return 0; } diff --git a/arch/m68k/mm/mcfmmu.c b/arch/m68k/mm/mcfmmu.c index eac9dde65193..6f1f25125294 100644 --- a/arch/m68k/mm/mcfmmu.c +++ b/arch/m68k/mm/mcfmmu.c @@ -174,7 +174,8 @@ void __init cf_bootmem_alloc(void) m68k_memory[0].addr = _rambase; m68k_memory[0].size = _ramend - _rambase; - memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0); + memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0, + MEMBLOCK_NONE); /* compute total pages in system */ num_pages = PFN_DOWN(_ramend - _rambase); diff --git a/arch/m68k/mm/motorola.c b/arch/m68k/mm/motorola.c index 3a653f0a4188..e80c5d7e6728 100644 --- a/arch/m68k/mm/motorola.c +++ b/arch/m68k/mm/motorola.c @@ -410,7 +410,8 @@ void __init paging_init(void) min_addr = m68k_memory[0].addr; max_addr = min_addr + m68k_memory[0].size; - memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0); + memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0, + MEMBLOCK_NONE); for (i = 1; i < m68k_num_memory;) { if (m68k_memory[i].addr < min_addr) { printk("Ignoring memory chunk at 0x%lx:0x%lx before the first chunk\n", @@ -421,7 +422,8 @@ void __init paging_init(void) (m68k_num_memory - i) * sizeof(struct m68k_mem_info)); continue; } - memblock_add_node(m68k_memory[i].addr, m68k_memory[i].size, i); + memblock_add_node(m68k_memory[i].addr, m68k_memory[i].size, i, + MEMBLOCK_NONE); addr = m68k_memory[i].addr + m68k_memory[i].size; if (addr > max_addr) max_addr = addr; diff --git a/arch/mips/loongson64/init.c b/arch/mips/loongson64/init.c index 76e0a9636a0e..4ac5ba80bbf6 100644 --- a/arch/mips/loongson64/init.c +++ b/arch/mips/loongson64/init.c @@ -77,7 +77,9 @@ void __init szmem(unsigned int node) (u32)node_id, mem_type, mem_start, mem_size); pr_info(" start_pfn:0x%llx, end_pfn:0x%llx, num_physpages:0x%lx\n", start_pfn, end_pfn, num_physpages); - memblock_add_node(PFN_PHYS(start_pfn), PFN_PHYS(node_psize), node); + memb
[PATCH v1 3/4] memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED
Let's add a flag that corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED. Similar to MEMBLOCK_HOTPLUG, most infrastructure has to treat such memory like ordinary MEMBLOCK_NONE memory -- for example, when selecting memory regions to add to the vmcore for dumping in the crashkernel via for_each_mem_range(). However, especially kexec_file is not supposed to select such memblocks via for_each_free_mem_range() / for_each_free_mem_range_reverse() to place kexec images, similar to how we handle IORESOURCE_SYSRAM_DRIVER_MANAGED without CONFIG_ARCH_KEEP_MEMBLOCK. Let's document why kexec_walk_memblock() won't try placing images on areas marked MEMBLOCK_DRIVER_MANAGED -- similar to IORESOURCE_SYSRAM_DRIVER_MANAGED handling in locate_mem_hole_callback() via kexec_walk_resources(). We'll make sure that memory hotplug code sets the flag where applicable (IORESOURCE_SYSRAM_DRIVER_MANAGED) next. This prepares architectures that need CONFIG_ARCH_KEEP_MEMBLOCK, such as arm64, for virtio-mem support. Signed-off-by: David Hildenbrand --- include/linux/memblock.h | 16 ++-- kernel/kexec_file.c | 5 + mm/memblock.c| 4 3 files changed, 23 insertions(+), 2 deletions(-) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index b49a58f621bc..7d8d656d5082 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -33,12 +33,17 @@ extern unsigned long long max_possible_pfn; * @MEMBLOCK_NOMAP: don't add to kernel direct mapping and treat as * reserved in the memory map; refer to memblock_mark_nomap() description * for further details + * @MEMBLOCK_DRIVER_MANAGED: memory region that is always detected via a driver, + * corresponding to IORESOURCE_SYSRAM_DRIVER_MANAGED in the kernel resource + * tree. Especially kexec should never use this memory for placing images and + * shouldn't expose this memory to the second kernel. */ enum memblock_flags { MEMBLOCK_NONE = 0x0, /* No special request */ MEMBLOCK_HOTPLUG= 0x1, /* hotpluggable region */ MEMBLOCK_MIRROR = 0x2, /* mirrored region */ MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */ + MEMBLOCK_DRIVER_MANAGED = 0x8, /* always detected via a driver */ }; /** @@ -209,7 +214,8 @@ static inline void __next_physmem_range(u64 *idx, struct memblock_type *type, */ #define for_each_mem_range(i, p_start, p_end) \ __for_each_mem_range(i, , NULL, NUMA_NO_NODE, \ -MEMBLOCK_HOTPLUG, p_start, p_end, NULL) +MEMBLOCK_HOTPLUG | MEMBLOCK_DRIVER_MANAGED, \ +p_start, p_end, NULL) /** * for_each_mem_range_rev - reverse iterate through memblock areas from @@ -220,7 +226,8 @@ static inline void __next_physmem_range(u64 *idx, struct memblock_type *type, */ #define for_each_mem_range_rev(i, p_start, p_end) \ __for_each_mem_range_rev(i, , NULL, NUMA_NO_NODE, \ -MEMBLOCK_HOTPLUG, p_start, p_end, NULL) +MEMBLOCK_HOTPLUG | MEMBLOCK_DRIVER_MANAGED,\ +p_start, p_end, NULL) /** * for_each_reserved_mem_range - iterate over all reserved memblock areas @@ -250,6 +257,11 @@ static inline bool memblock_is_nomap(struct memblock_region *m) return m->flags & MEMBLOCK_NOMAP; } +static inline bool memblock_is_driver_managed(struct memblock_region *m) +{ + return m->flags & MEMBLOCK_DRIVER_MANAGED; +} + int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn, unsigned long *end_pfn); void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn, diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index 33400ff051a8..8347fc158d2b 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -556,6 +556,11 @@ static int kexec_walk_memblock(struct kexec_buf *kbuf, if (kbuf->image->type == KEXEC_TYPE_CRASH) return func(_res, kbuf); + /* +* Using MEMBLOCK_NONE will properly skip MEMBLOCK_DRIVER_MANAGED. See +* IORESOURCE_SYSRAM_DRIVER_MANAGED handling in +* locate_mem_hole_callback(). +*/ if (kbuf->top_down) { for_each_free_mem_range_reverse(i, NUMA_NO_NODE, MEMBLOCK_NONE, , , NULL) { diff --git a/mm/memblock.c b/mm/memblock.c index 47a56b223141..540a35317fb0 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -979,6 +979,10 @@ static bool should_skip_region(struct memblock_type *type, if (!(flags & MEMBLOCK_NOMAP) && memblock_is_nomap(m)) return true; + /* skip driver-managed memory unless we were asked for it explicitly */ + if (!(flags & MEMBLOCK_DRIVER_MANAGED) && memblock_is_driver_managed(m)) + retur
[PATCH v1 0/4] mm/memory_hotplug: full support for
Architectures that require CONFIG_ARCH_KEEP_MEMBLOCK=y, such as arm64, don't cleanly support add_memory_driver_managed() yet. Most prominently, kexec_file can still end up placing images on such driver-managed memory, resulting in undesired behavior. Teaching kexec to not place images on driver-managed memory is especially relevant for virtio-mem. Details can be found in commit 7b7b27214bba ("mm/memory_hotplug: introduce add_memory_driver_managed()"). Extend memblock with a new flag and set it from memory hotplug code when applicable. This is required to fully support virtio-mem on arm64, making also kexec_file behave like on x86-64. Alternative A: Extend kexec_walk_memblock() to consult the kernel resource tree whether IORESOURCE_SYSRAM_DRIVER_MANAGED is set. This feels wrong, because the goal was to rely on memblock and not the resource tree. Alternative B: Reuse MEMBLOCK_HOTPLUG. MEMBLOCK_HOTPLUG serves a different purpose, though. Cc: Andrew Morton Cc: Mike Rapoport Cc: Michal Hocko Cc: Oscar Salvador Cc: Jianyong Wu Cc: Aneesh Kumar K.V Cc: Vineet Gupta Cc: Geert Uytterhoeven Cc: Huacai Chen Cc: Jiaxun Yang Cc: Thomas Bogendoerfer Cc: Heiko Carstens Cc: Vasily Gorbik Cc: Christian Borntraeger Cc: Eric Biederman Cc: Arnd Bergmann Cc: linux-snps-...@lists.infradead.org Cc: linux-i...@vger.kernel.org Cc: linux-m...@lists.linux-m68k.org Cc: linux-m...@vger.kernel.org Cc: linux-s...@vger.kernel.org Cc: linux...@kvack.org Cc: kexec@lists.infradead.org David Hildenbrand (4): mm/memory_hotplug: handle memblock_add_node() failures in add_memory_resource() memblock: allow to specify flags with memblock_add_node() memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED mm/memory_hotplug: indicate MEMBLOCK_DRIVER_MANAGED with IORESOURCE_SYSRAM_DRIVER_MANAGED arch/arc/mm/init.c | 4 ++-- arch/ia64/mm/contig.c| 2 +- arch/ia64/mm/init.c | 2 +- arch/m68k/mm/mcfmmu.c| 3 ++- arch/m68k/mm/motorola.c | 6 -- arch/mips/loongson64/init.c | 4 +++- arch/mips/sgi-ip27/ip27-memory.c | 3 ++- arch/s390/kernel/setup.c | 3 ++- include/linux/memblock.h | 19 --- include/linux/mm.h | 2 +- kernel/kexec_file.c | 5 + mm/memblock.c| 13 + mm/memory_hotplug.c | 11 +-- 13 files changed, 57 insertions(+), 20 deletions(-) base-commit: 5816b3e6577eaa676ceb00a848f0fd65fe2adc29 -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v1 1/4] mm/memory_hotplug: handle memblock_add_node() failures in add_memory_resource()
If memblock_add_node() fails, we're most probably running out of memory. While this is unlikely to happen, it can happen and having memory added without a memblock can be problematic for architectures that use memblock to detect valid memory. Let's fail in a nice way instead of silently ignoring the error. Signed-off-by: David Hildenbrand --- mm/memory_hotplug.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 9fd0be32a281..917b3528636d 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1384,8 +1384,11 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) mem_hotplug_begin(); - if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) - memblock_add_node(start, size, nid); + if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) { + ret = memblock_add_node(start, size, nid); + if (ret) + goto error_mem_hotplug_end; + } ret = __try_online_node(nid, false); if (ret < 0) @@ -1458,6 +1461,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) rollback_node_hotadd(nid); if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) memblock_remove(start, size); +error_mem_hotplug_end: mem_hotplug_done(); return ret; } -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v1 4/4] mm/memory_hotplug: indicate MEMBLOCK_DRIVER_MANAGED with IORESOURCE_SYSRAM_DRIVER_MANAGED
Let's communicate driver-managed regions to memblock, to properly teach kexec_file with CONFIG_ARCH_KEEP_MEMBLOCK to not place images on these memory regions. Signed-off-by: David Hildenbrand --- mm/memory_hotplug.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 5f873e7f5b29..6d90818d4ce8 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1357,6 +1357,7 @@ bool mhp_supports_memmap_on_memory(unsigned long size) int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) { struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) }; + enum memblock_flags memblock_flags = MEMBLOCK_NONE; struct vmem_altmap mhp_altmap = {}; struct memory_group *group = NULL; u64 start, size; @@ -1385,7 +1386,9 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) mem_hotplug_begin(); if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) { - ret = memblock_add_node(start, size, nid, MEMBLOCK_NONE); + if (res->flags & IORESOURCE_SYSRAM_DRIVER_MANAGED) + memblock_flags = MEMBLOCK_DRIVER_MANAGED; + ret = memblock_add_node(start, size, nid, memblock_flags); if (ret) goto error_mem_hotplug_end; } -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 8/9] mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
1) == ZONE_HIGHMEM; @@ -1041,17 +1041,17 @@ extern int percpu_pagelist_fraction; extern char numa_zonelist_order[]; #define NUMA_ZONELIST_ORDER_LEN 16 -#ifndef CONFIG_NEED_MULTIPLE_NODES +#ifndef CONFIG_NUMA extern struct pglist_data contig_page_data; #define NODE_DATA(nid)(_page_data) #define NODE_MEM_MAP(nid) mem_map -#else /* CONFIG_NEED_MULTIPLE_NODES */ +#else /* CONFIG_NUMA */ #include -#endif /* !CONFIG_NEED_MULTIPLE_NODES */ +#endif /* !CONFIG_NUMA */ extern struct pglist_data *first_online_pgdat(void); extern struct pglist_data *next_online_pgdat(struct pglist_data *pgdat); diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 825284baaf46..53eb8bc6026d 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -455,7 +455,7 @@ static int __init crash_save_vmcoreinfo_init(void) VMCOREINFO_SYMBOL(_stext); VMCOREINFO_SYMBOL(vmap_area_list); -#ifndef CONFIG_NEED_MULTIPLE_NODES +#ifndef CONFIG_NUMA VMCOREINFO_SYMBOL(mem_map); VMCOREINFO_SYMBOL(contig_page_data); #endif diff --git a/mm/Kconfig b/mm/Kconfig index 218b96ccc84a..bffe4bd859f3 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -59,15 +59,6 @@ config FLAT_NODE_MEM_MAP def_bool y depends on !SPARSEMEM -# -# Both the NUMA code and DISCONTIGMEM use arrays of pg_data_t's -# to represent different areas of memory. This variable allows -# those dependencies to exist individually. -# -config NEED_MULTIPLE_NODES - def_bool y - depends on NUMA - # # SPARSEMEM_EXTREME (which is the default) does some bootmem # allocations when sparse_init() is called. If this cannot diff --git a/mm/memblock.c b/mm/memblock.c index afaefa8fc6ab..123feef5259d 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -92,7 +92,7 @@ * system initialization completes. */ -#ifndef CONFIG_NEED_MULTIPLE_NODES +#ifndef CONFIG_NUMA struct pglist_data __refdata contig_page_data; EXPORT_SYMBOL(contig_page_data); #endif @@ -607,7 +607,7 @@ static int __init_memblock memblock_add_range(struct memblock_type *type, * area, insert that portion. */ if (rbase > base) { -#ifdef CONFIG_NEED_MULTIPLE_NODES +#ifdef CONFIG_NUMA WARN_ON(nid != memblock_get_region_node(rgn)); #endif WARN_ON(flags != rgn->flags); @@ -1205,7 +1205,7 @@ void __init_memblock __next_mem_pfn_range(int *idx, int nid, int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size, struct memblock_type *type, int nid) { -#ifdef CONFIG_NEED_MULTIPLE_NODES +#ifdef CONFIG_NUMA int start_rgn, end_rgn; int i, ret; @@ -1849,7 +1849,7 @@ static void __init_memblock memblock_dump(struct memblock_type *type) size = rgn->size; end = base + size - 1; flags = rgn->flags; -#ifdef CONFIG_NEED_MULTIPLE_NODES +#ifdef CONFIG_NUMA if (memblock_get_region_node(rgn) != MAX_NUMNODES) snprintf(nid_buf, sizeof(nid_buf), " on node %d", memblock_get_region_node(rgn)); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6fc22482eaa8..8f08135d3eb4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1596,7 +1596,7 @@ void __free_pages_core(struct page *page, unsigned int order) __free_pages_ok(page, order, FPI_TO_TAIL | FPI_SKIP_KASAN_POISON); } -#ifdef CONFIG_NEED_MULTIPLE_NODES +#ifdef CONFIG_NUMA /* * During memory init memblocks map pfns to nids. The search is expensive and @@ -1646,7 +1646,7 @@ int __meminit early_pfn_to_nid(unsigned long pfn) return nid; } -#endif /* CONFIG_NEED_MULTIPLE_NODES */ +#endif /* CONFIG_NUMA */ void __init memblock_free_pages(struct page *page, unsigned long pfn, unsigned int order) @@ -7276,7 +7276,7 @@ static void __ref alloc_node_mem_map(struct pglist_data *pgdat) pr_debug("%s: node %d, pgdat %08lx, node_mem_map %08lx\n", __func__, pgdat->node_id, (unsigned long)pgdat, (unsigned long)pgdat->node_mem_map); -#ifndef CONFIG_NEED_MULTIPLE_NODES +#ifndef CONFIG_NUMA /* * With no DISCONTIG, the global mem_map is just set as node 0's */ Nice Acked-by: David Hildenbrand -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 9/9] mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM
On 02.06.21 12:53, Mike Rapoport wrote: From: Mike Rapoport After removal of the DISCONTIGMEM memory model the FLAT_NODE_MEM_MAP configuration option is equivalent to FLATMEM. Drop CONFIG_FLAT_NODE_MEM_MAP and use CONFIG_FLATMEM instead. Signed-off-by: Mike Rapoport --- include/linux/mmzone.h | 4 ++-- kernel/crash_core.c| 2 +- mm/Kconfig | 4 mm/page_alloc.c| 6 +++--- mm/page_ext.c | 2 +- 5 files changed, 7 insertions(+), 11 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index ad42f440c704..2698cdbfbf75 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -775,7 +775,7 @@ typedef struct pglist_data { struct zonelist node_zonelists[MAX_ZONELISTS]; int nr_zones; /* number of populated zones in this node */ -#ifdef CONFIG_FLAT_NODE_MEM_MAP/* means !SPARSEMEM */ +#ifdef CONFIG_FLATMEM /* means !SPARSEMEM */ struct page *node_mem_map; #ifdef CONFIG_PAGE_EXTENSION struct page_ext *node_page_ext; @@ -865,7 +865,7 @@ typedef struct pglist_data { #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages) #define node_spanned_pages(nid) (NODE_DATA(nid)->node_spanned_pages) -#ifdef CONFIG_FLAT_NODE_MEM_MAP +#ifdef CONFIG_FLATMEM #define pgdat_page_nr(pgdat, pagenr) ((pgdat)->node_mem_map + (pagenr)) #else #define pgdat_page_nr(pgdat, pagenr) pfn_to_page((pgdat)->node_start_pfn + (pagenr)) diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 53eb8bc6026d..2b8446ea7105 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -483,7 +483,7 @@ static int __init crash_save_vmcoreinfo_init(void) VMCOREINFO_OFFSET(page, compound_head); VMCOREINFO_OFFSET(pglist_data, node_zones); VMCOREINFO_OFFSET(pglist_data, nr_zones); -#ifdef CONFIG_FLAT_NODE_MEM_MAP +#ifdef CONFIG_FLATMEM VMCOREINFO_OFFSET(pglist_data, node_mem_map); #endif VMCOREINFO_OFFSET(pglist_data, node_start_pfn); diff --git a/mm/Kconfig b/mm/Kconfig index bffe4bd859f3..ded98fb859ab 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -55,10 +55,6 @@ config FLATMEM def_bool y depends on !SPARSEMEM || FLATMEM_MANUAL -config FLAT_NODE_MEM_MAP - def_bool y - depends on !SPARSEMEM - # # SPARSEMEM_EXTREME (which is the default) does some bootmem # allocations when sparse_init() is called. If this cannot diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8f08135d3eb4..f039736541eb 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6444,7 +6444,7 @@ static void __meminit zone_init_free_lists(struct zone *zone) } } -#if !defined(CONFIG_FLAT_NODE_MEM_MAP) +#if !defined(CONFIG_FLATMEM) /* * Only struct pages that correspond to ranges defined by memblock.memory * are zeroed and initialized by going through __init_single_page() during @@ -7241,7 +7241,7 @@ static void __init free_area_init_core(struct pglist_data *pgdat) } } -#ifdef CONFIG_FLAT_NODE_MEM_MAP +#ifdef CONFIG_FLATMEM static void __ref alloc_node_mem_map(struct pglist_data *pgdat) { unsigned long __maybe_unused start = 0; @@ -7289,7 +7289,7 @@ static void __ref alloc_node_mem_map(struct pglist_data *pgdat) } #else static void __ref alloc_node_mem_map(struct pglist_data *pgdat) { } -#endif /* CONFIG_FLAT_NODE_MEM_MAP */ +#endif /* CONFIG_FLATMEM */ #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT static inline void pgdat_set_deferred_range(pg_data_t *pgdat) diff --git a/mm/page_ext.c b/mm/page_ext.c index df6f74aac8e1..293b2685fc48 100644 --- a/mm/page_ext.c +++ b/mm/page_ext.c @@ -191,7 +191,7 @@ void __init page_ext_init_flatmem(void) panic("Out of memory"); } -#else /* CONFIG_FLAT_NODE_MEM_MAP */ +#else /* CONFIG_FLATMEM */ struct page_ext *lookup_page_ext(const struct page *page) { Acked-by: David Hildenbrand -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 7/9] docs: remove description of DISCONTIGMEM
On 02.06.21 12:53, Mike Rapoport wrote: From: Mike Rapoport Remove description of DISCONTIGMEM from the "Memory Models" document and update VM sysctl description so that it won't mention DISCONIGMEM. Signed-off-by: Mike Rapoport --- Documentation/admin-guide/sysctl/vm.rst | 12 +++ Documentation/vm/memory-model.rst | 45 ++--- 2 files changed, 8 insertions(+), 49 deletions(-) diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 586cd4b86428..ddbd71d592e0 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -936,12 +936,12 @@ allocations, THP and hugetlbfs pages. To make it sensible with respect to the watermark_scale_factor parameter, the unit is in fractions of 10,000. The default value of -15,000 on !DISCONTIGMEM configurations means that up to 150% of the high -watermark will be reclaimed in the event of a pageblock being mixed due -to fragmentation. The level of reclaim is determined by the number of -fragmentation events that occurred in the recent past. If this value is -smaller than a pageblock then a pageblocks worth of pages will be reclaimed -(e.g. 2MB on 64-bit x86). A boost factor of 0 will disable the feature. +15,000 means that up to 150% of the high watermark will be reclaimed in the +event of a pageblock being mixed due to fragmentation. The level of reclaim +is determined by the number of fragmentation events that occurred in the +recent past. If this value is smaller than a pageblock then a pageblocks +worth of pages will be reclaimed (e.g. 2MB on 64-bit x86). A boost factor +of 0 will disable the feature. watermark_scale_factor diff --git a/Documentation/vm/memory-model.rst b/Documentation/vm/memory-model.rst index ce398a7dc6cd..30e8fbed6914 100644 --- a/Documentation/vm/memory-model.rst +++ b/Documentation/vm/memory-model.rst @@ -14,15 +14,11 @@ for the CPU. Then there could be several contiguous ranges at completely distinct addresses. And, don't forget about NUMA, where different memory banks are attached to different CPUs. -Linux abstracts this diversity using one of the three memory models: -FLATMEM, DISCONTIGMEM and SPARSEMEM. Each architecture defines what +Linux abstracts this diversity using one of the two memory models: +FLATMEM and SPARSEMEM. Each architecture defines what memory models it supports, what the default memory model is and whether it is possible to manually override that default. -.. note:: - At time of this writing, DISCONTIGMEM is considered deprecated, - although it is still in use by several architectures. - All the memory models track the status of physical page frames using struct page arranged in one or more arrays. @@ -63,43 +59,6 @@ straightforward: `PFN - ARCH_PFN_OFFSET` is an index to the The `ARCH_PFN_OFFSET` defines the first page frame number for systems with physical memory starting at address different from 0. -DISCONTIGMEM - - -The DISCONTIGMEM model treats the physical memory as a collection of -`nodes` similarly to how Linux NUMA support does. For each node Linux -constructs an independent memory management subsystem represented by -`struct pglist_data` (or `pg_data_t` for short). Among other -things, `pg_data_t` holds the `node_mem_map` array that maps -physical pages belonging to that node. The `node_start_pfn` field of -`pg_data_t` is the number of the first page frame belonging to that -node. - -The architecture setup code should call :c:func:`free_area_init_node` for -each node in the system to initialize the `pg_data_t` object and its -`node_mem_map`. - -Every `node_mem_map` behaves exactly as FLATMEM's `mem_map` - -every physical page frame in a node has a `struct page` entry in the -`node_mem_map` array. When DISCONTIGMEM is enabled, a portion of the -`flags` field of the `struct page` encodes the node number of the -node hosting that page. - -The conversion between a PFN and the `struct page` in the -DISCONTIGMEM model became slightly more complex as it has to determine -which node hosts the physical page and which `pg_data_t` object -holds the `struct page`. - -Architectures that support DISCONTIGMEM provide :c:func:`pfn_to_nid` -to convert PFN to the node number. The opposite conversion helper -:c:func:`page_to_nid` is generic as it uses the node number encoded in -page->flags. - -Once the node number is known, the PFN can be used to index -appropriate `node_mem_map` array to access the `struct page` and -the offset of the `struct page` from the `node_mem_map` plus -`node_start_pfn` is the PFN of that page. - SPARSEMEM = Reviewed-by: David Hildenbrand -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 5/9] mm: remove CONFIG_DISCONTIGMEM
ol y @@ -85,7 +66,7 @@ config FLAT_NODE_MEM_MAP # config NEED_MULTIPLE_NODES def_bool y - depends on DISCONTIGMEM || NUMA + depends on NUMA # # SPARSEMEM_EXTREME (which is the default) does some bootmem diff --git a/mm/memory.c b/mm/memory.c index 730daa00952b..7c7b6ea02504 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -90,8 +90,7 @@ #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid. #endif -#ifndef CONFIG_NEED_MULTIPLE_NODES -/* use the per-pgdat data instead for discontigmem - mbligh */ +#ifdef CONFIG_FLATMEM unsigned long max_mapnr; EXPORT_SYMBOL(max_mapnr); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index aaa1655cf682..6fc22482eaa8 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -331,20 +331,7 @@ compound_page_dtor * const compound_page_dtors[NR_COMPOUND_DTORS] = { int min_free_kbytes = 1024; int user_min_free_kbytes = -1; -#ifdef CONFIG_DISCONTIGMEM -/* - * DiscontigMem defines memory ranges as separate pg_data_t even if the ranges - * are not on separate NUMA nodes. Functionally this works but with - * watermark_boost_factor, it can reclaim prematurely as the ranges can be - * quite small. By default, do not boost watermarks on discontigmem as in - * many cases very high-order allocations like THP are likely to be - * unsupported and the premature reclaim offsets the advantage of long-term - * fragmentation avoidance. - */ -int watermark_boost_factor __read_mostly; -#else int watermark_boost_factor __read_mostly = 15000; -#endif int watermark_scale_factor = 10; static unsigned long nr_kernel_pages __initdata; Nice Acked-by: David Hildenbrand -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 6/9] arch, mm: remove stale mentions of DISCONIGMEM
de the zonelist belongs to. * - * For the normal case of non-DISCONTIGMEM systems the NODE_DATA() gets - * optimized to _page_data at compile-time. + * For the case of non-NUMA systems the NODE_DATA() gets optimized to + * _page_data at compile-time. */ static inline struct zonelist *node_zonelist(int nid, gfp_t flags) { Reviewed-by: David Hildenbrand -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 3/9] arc: remove support for DISCONTIGMEM
On 02.06.21 12:53, Mike Rapoport wrote: From: Mike Rapoport DISCONTIGMEM was replaced by FLATMEM with freeing of the unused memory map in v5.11. Remove the support for DISCONTIGMEM entirely. Signed-off-by: Mike Rapoport Acked-by: David Hildenbrand -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 1/9] alpha: remove DISCONTIGMEM and NUMA
*)(__va(node_min_pfn << PAGE_SHIFT)); -#endif - printk(" Detected node memory: start %8lu, end %8lu\n", - node_min_pfn, node_max_pfn); - - DBGDCONT(" DISCONTIG: node_data[%d] is at 0x%p\n", nid, NODE_DATA(nid)); - - /* Find the bounds of kernel memory. */ - start_kernel_pfn = PFN_DOWN(KERNEL_START_PHYS); - end_kernel_pfn = PFN_UP(virt_to_phys(kernel_end)); - - if (!nid && (node_max_pfn < end_kernel_pfn || node_min_pfn > start_kernel_pfn)) - panic("kernel loaded out of ram"); - - memblock_add_node(PFN_PHYS(node_min_pfn), - (node_max_pfn - node_min_pfn) << PAGE_SHIFT, nid); - - /* Zone start phys-addr must be 2^(MAX_ORDER-1) aligned. - Note that we round this down, not up - node memory - has much larger alignment than 8Mb, so it's safe. */ - node_min_pfn &= ~((1UL << (MAX_ORDER-1))-1); - - NODE_DATA(nid)->node_start_pfn = node_min_pfn; - NODE_DATA(nid)->node_present_pages = node_max_pfn - node_min_pfn; - - node_set_online(nid); -} - -void __init -setup_memory(void *kernel_end) -{ - unsigned long kernel_size; - int nid; - - show_mem_layout(); - - nodes_clear(node_online_map); - - min_low_pfn = ~0UL; - max_low_pfn = 0UL; - for (nid = 0; nid < MAX_NUMNODES; nid++) - setup_memory_node(nid, kernel_end); - - kernel_size = virt_to_phys(kernel_end) - KERNEL_START_PHYS; - memblock_reserve(KERNEL_START_PHYS, kernel_size); - -#ifdef CONFIG_BLK_DEV_INITRD - initrd_start = INITRD_START; - if (initrd_start) { - extern void *move_initrd(unsigned long); - - initrd_end = initrd_start+INITRD_SIZE; - printk("Initial ramdisk at: 0x%p (%lu bytes)\n", - (void *) initrd_start, INITRD_SIZE); - - if ((void *)initrd_end > phys_to_virt(PFN_PHYS(max_low_pfn))) { - if (!move_initrd(PFN_PHYS(max_low_pfn))) - printk("initrd extends beyond end of memory " - "(0x%08lx > 0x%p)\ndisabling initrd\n", - initrd_end, - phys_to_virt(PFN_PHYS(max_low_pfn))); - } else { - nid = kvaddr_to_nid(initrd_start); - memblock_reserve(virt_to_phys((void *)initrd_start), -INITRD_SIZE); - } - } -#endif /* CONFIG_BLK_DEV_INITRD */ -} - -void __init paging_init(void) -{ - unsigned long max_zone_pfn[MAX_NR_ZONES] = {0, }; - unsigned long dma_local_pfn; - - /* -* The old global MAX_DMA_ADDRESS per-arch API doesn't fit -* in the NUMA model, for now we convert it to a pfn and -* we interpret this pfn as a local per-node information. -* This issue isn't very important since none of these machines -* have legacy ISA slots anyways. -*/ - dma_local_pfn = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT; - - max_zone_pfn[ZONE_DMA] = dma_local_pfn; - max_zone_pfn[ZONE_NORMAL] = max_pfn; - - free_area_init(max_zone_pfn); - - /* Initialize the kernel's ZERO_PGE. */ - memset((void *)ZERO_PGE, 0, PAGE_SIZE); -} Acked-by: David Hildenbrand -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 2/9] arc: update comment about HIGHMEM implementation
On 02.06.21 12:53, Mike Rapoport wrote: From: Mike Rapoport Arc does not use DISCONTIGMEM to implement high memory, update the comment describing how high memory works to reflect this. Signed-off-by: Mike Rapoport --- arch/arc/mm/init.c | 13 + 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c index e2ed355438c9..397a201adfe3 100644 --- a/arch/arc/mm/init.c +++ b/arch/arc/mm/init.c @@ -139,16 +139,13 @@ void __init setup_arch_memory(void) #ifdef CONFIG_HIGHMEM /* -* Populate a new node with highmem -* * On ARC (w/o PAE) HIGHMEM addresses are actually smaller (0 based) -* than addresses in normal ala low memory (0x8000_ based). +* than addresses in normal aka low memory (0x8000_ based). * Even with PAE, the huge peripheral space hole would waste a lot of -* mem with single mem_map[]. This warrants a mem_map per region design. -* Thus HIGHMEM on ARC is imlemented with DISCONTIGMEM. -* -* DISCONTIGMEM in turns requires multiple nodes. node 0 above is -* populated with normal memory zone while node 1 only has highmem +* mem with single contiguous mem_map[]. +* Thus when HIGHMEM on ARC is enabled the memory map corresponding +* to the hole is freed and ARC specific version of pfn_valid() +* handles the hole in the memory map. */ #ifdef CONFIG_DISCONTIGMEM node_set_online(1); Reviewed-by: David Hildenbrand -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH V2 4/6] mm: rename the global section array to mem_sections
On 01.06.21 10:37, Dong Aisheng wrote: On Tue, Jun 1, 2021 at 4:22 PM David Hildenbrand wrote: On 31.05.21 11:19, Dong Aisheng wrote: In order to distinguish the struct mem_section for a better code readability and align with kernel doc [1] name below, change the global mem section name to 'mem_sections' from 'mem_section'. [1] Documentation/vm/memory-model.rst "The `mem_section` objects are arranged in a two-dimensional array called `mem_sections`." Cc: Andrew Morton Cc: Dave Young Cc: Baoquan He Cc: Vivek Goyal Cc: kexec@lists.infradead.org Signed-off-by: Dong Aisheng --- v1->v2: * no changes --- include/linux/mmzone.h | 10 +- kernel/crash_core.c| 4 ++-- mm/sparse.c| 16 3 files changed, 15 insertions(+), 15 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index a6bfde85ddb0..0ed61f32d898 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1302,9 +1302,9 @@ struct mem_section { #define SECTION_ROOT_MASK (SECTIONS_PER_ROOT - 1) #ifdef CONFIG_SPARSEMEM_EXTREME -extern struct mem_section **mem_section; +extern struct mem_section **mem_sections; #else -extern struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT]; +extern struct mem_section mem_sections[NR_SECTION_ROOTS][SECTIONS_PER_ROOT]; #endif static inline unsigned long *section_to_usemap(struct mem_section *ms) @@ -1315,12 +1315,12 @@ static inline unsigned long *section_to_usemap(struct mem_section *ms) static inline struct mem_section *__nr_to_section(unsigned long nr) { #ifdef CONFIG_SPARSEMEM_EXTREME - if (!mem_section) + if (!mem_sections) return NULL; #endif - if (!mem_section[SECTION_NR_TO_ROOT(nr)]) + if (!mem_sections[SECTION_NR_TO_ROOT(nr)]) return NULL; - return _section[SECTION_NR_TO_ROOT(nr)][nr & SECTION_ROOT_MASK]; + return _sections[SECTION_NR_TO_ROOT(nr)][nr & SECTION_ROOT_MASK]; } extern unsigned long __section_nr(struct mem_section *ms); extern size_t mem_section_usage_size(void); diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 29cc15398ee4..fb1180d81b5a 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -414,8 +414,8 @@ static int __init crash_save_vmcoreinfo_init(void) VMCOREINFO_SYMBOL(contig_page_data); #endif #ifdef CONFIG_SPARSEMEM - VMCOREINFO_SYMBOL_ARRAY(mem_section); - VMCOREINFO_LENGTH(mem_section, NR_SECTION_ROOTS); + VMCOREINFO_SYMBOL_ARRAY(mem_sections); + VMCOREINFO_LENGTH(mem_sections, NR_SECTION_ROOTS); VMCOREINFO_STRUCT_SIZE(mem_section); VMCOREINFO_OFFSET(mem_section, section_mem_map); VMCOREINFO_NUMBER(MAX_PHYSMEM_BITS); diff --git a/mm/sparse.c b/mm/sparse.c index d02ee6bb7cbc..6412010478f7 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -24,12 +24,12 @@ * 1) mem_section - memory sections, mem_map's for valid memory */ #ifdef CONFIG_SPARSEMEM_EXTREME -struct mem_section **mem_section; +struct mem_section **mem_sections; #else -struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT] +struct mem_section mem_sections[NR_SECTION_ROOTS][SECTIONS_PER_ROOT] cacheline_internodealigned_in_smp; #endif -EXPORT_SYMBOL(mem_section); +EXPORT_SYMBOL(mem_sections); #ifdef NODE_NOT_IN_PAGE_FLAGS /* @@ -66,8 +66,8 @@ static void __init sparse_alloc_section_roots(void) size = sizeof(struct mem_section *) * NR_SECTION_ROOTS; align = 1 << (INTERNODE_CACHE_SHIFT); - mem_section = memblock_alloc(size, align); - if (!mem_section) + mem_sections = memblock_alloc(size, align); + if (!mem_sections) panic("%s: Failed to allocate %lu bytes align=0x%lx\n", __func__, size, align); } @@ -103,14 +103,14 @@ static int __meminit sparse_index_init(unsigned long section_nr, int nid) * * The mem_hotplug_lock resolves the apparent race below. */ - if (mem_section[root]) + if (mem_sections[root]) return 0; section = sparse_index_alloc(nid); if (!section) return -ENOMEM; - mem_section[root] = section; + mem_sections[root] = section; return 0; } @@ -145,7 +145,7 @@ unsigned long __section_nr(struct mem_section *ms) #else unsigned long __section_nr(struct mem_section *ms) { - return (unsigned long)(ms - mem_section[0]); + return (unsigned long)(ms - mem_sections[0]); } #endif I repeat: unnecessary code churn IMHO. Hi David, Thanks, i explained the reason during my last reply. Andrew has already picked this patch to -mm tree. Just because it's in Andrews tree doesn't mean it will end up upstream. ;) Anyhow, no really strong opinion, it's simply unnecessary code churn that makes bisecting harder without real value IMHO. -- Thanks, David / dhildenb ___
Re: [PATCH V2 4/6] mm: rename the global section array to mem_sections
On 31.05.21 11:19, Dong Aisheng wrote: In order to distinguish the struct mem_section for a better code readability and align with kernel doc [1] name below, change the global mem section name to 'mem_sections' from 'mem_section'. [1] Documentation/vm/memory-model.rst "The `mem_section` objects are arranged in a two-dimensional array called `mem_sections`." Cc: Andrew Morton Cc: Dave Young Cc: Baoquan He Cc: Vivek Goyal Cc: kexec@lists.infradead.org Signed-off-by: Dong Aisheng --- v1->v2: * no changes --- include/linux/mmzone.h | 10 +- kernel/crash_core.c| 4 ++-- mm/sparse.c| 16 3 files changed, 15 insertions(+), 15 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index a6bfde85ddb0..0ed61f32d898 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1302,9 +1302,9 @@ struct mem_section { #define SECTION_ROOT_MASK (SECTIONS_PER_ROOT - 1) #ifdef CONFIG_SPARSEMEM_EXTREME -extern struct mem_section **mem_section; +extern struct mem_section **mem_sections; #else -extern struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT]; +extern struct mem_section mem_sections[NR_SECTION_ROOTS][SECTIONS_PER_ROOT]; #endif static inline unsigned long *section_to_usemap(struct mem_section *ms) @@ -1315,12 +1315,12 @@ static inline unsigned long *section_to_usemap(struct mem_section *ms) static inline struct mem_section *__nr_to_section(unsigned long nr) { #ifdef CONFIG_SPARSEMEM_EXTREME - if (!mem_section) + if (!mem_sections) return NULL; #endif - if (!mem_section[SECTION_NR_TO_ROOT(nr)]) + if (!mem_sections[SECTION_NR_TO_ROOT(nr)]) return NULL; - return _section[SECTION_NR_TO_ROOT(nr)][nr & SECTION_ROOT_MASK]; + return _sections[SECTION_NR_TO_ROOT(nr)][nr & SECTION_ROOT_MASK]; } extern unsigned long __section_nr(struct mem_section *ms); extern size_t mem_section_usage_size(void); diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 29cc15398ee4..fb1180d81b5a 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -414,8 +414,8 @@ static int __init crash_save_vmcoreinfo_init(void) VMCOREINFO_SYMBOL(contig_page_data); #endif #ifdef CONFIG_SPARSEMEM - VMCOREINFO_SYMBOL_ARRAY(mem_section); - VMCOREINFO_LENGTH(mem_section, NR_SECTION_ROOTS); + VMCOREINFO_SYMBOL_ARRAY(mem_sections); + VMCOREINFO_LENGTH(mem_sections, NR_SECTION_ROOTS); VMCOREINFO_STRUCT_SIZE(mem_section); VMCOREINFO_OFFSET(mem_section, section_mem_map); VMCOREINFO_NUMBER(MAX_PHYSMEM_BITS); diff --git a/mm/sparse.c b/mm/sparse.c index d02ee6bb7cbc..6412010478f7 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -24,12 +24,12 @@ * 1) mem_section - memory sections, mem_map's for valid memory */ #ifdef CONFIG_SPARSEMEM_EXTREME -struct mem_section **mem_section; +struct mem_section **mem_sections; #else -struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT] +struct mem_section mem_sections[NR_SECTION_ROOTS][SECTIONS_PER_ROOT] cacheline_internodealigned_in_smp; #endif -EXPORT_SYMBOL(mem_section); +EXPORT_SYMBOL(mem_sections); #ifdef NODE_NOT_IN_PAGE_FLAGS /* @@ -66,8 +66,8 @@ static void __init sparse_alloc_section_roots(void) size = sizeof(struct mem_section *) * NR_SECTION_ROOTS; align = 1 << (INTERNODE_CACHE_SHIFT); - mem_section = memblock_alloc(size, align); - if (!mem_section) + mem_sections = memblock_alloc(size, align); + if (!mem_sections) panic("%s: Failed to allocate %lu bytes align=0x%lx\n", __func__, size, align); } @@ -103,14 +103,14 @@ static int __meminit sparse_index_init(unsigned long section_nr, int nid) * * The mem_hotplug_lock resolves the apparent race below. */ - if (mem_section[root]) + if (mem_sections[root]) return 0; section = sparse_index_alloc(nid); if (!section) return -ENOMEM; - mem_section[root] = section; + mem_sections[root] = section; return 0; } @@ -145,7 +145,7 @@ unsigned long __section_nr(struct mem_section *ms) #else unsigned long __section_nr(struct mem_section *ms) { - return (unsigned long)(ms - mem_section[0]); + return (unsigned long)(ms - mem_sections[0]); } #endif I repeat: unnecessary code churn IMHO. -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 4/5] mm: rename the global section array to mem_sections
On 17.05.21 13:20, Dong Aisheng wrote: In order to distinguish the struct mem_section for a better code readability and align with kernel doc [1] name below, change the global mem section name to 'mem_sections' from 'mem_section'. [1] Documentation/vm/memory-model.rst "The `mem_section` objects are arranged in a two-dimensional array called `mem_sections`." Cc: Andrew Morton Cc: Dave Young Cc: Baoquan He Cc: Vivek Goyal Cc: kexec@lists.infradead.org Signed-off-by: Dong Aisheng --- include/linux/mmzone.h | 10 +- kernel/crash_core.c| 4 ++-- mm/sparse.c| 16 3 files changed, 15 insertions(+), 15 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index fc23e36cb165..b348a06915c5 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1297,9 +1297,9 @@ struct mem_section { #define SECTION_ROOT_MASK (SECTIONS_PER_ROOT - 1) #ifdef CONFIG_SPARSEMEM_EXTREME -extern struct mem_section **mem_section; +extern struct mem_section **mem_sections; #else -extern struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT]; +extern struct mem_section mem_sections[NR_SECTION_ROOTS][SECTIONS_PER_ROOT]; #endif static inline unsigned long *section_to_usemap(struct mem_section *ms) @@ -1310,12 +1310,12 @@ static inline unsigned long *section_to_usemap(struct mem_section *ms) static inline struct mem_section *__nr_to_section(unsigned long nr) { #ifdef CONFIG_SPARSEMEM_EXTREME - if (!mem_section) + if (!mem_sections) return NULL; #endif - if (!mem_section[SECTION_NR_TO_ROOT(nr)]) + if (!mem_sections[SECTION_NR_TO_ROOT(nr)]) return NULL; - return _section[SECTION_NR_TO_ROOT(nr)][nr & SECTION_ROOT_MASK]; + return _sections[SECTION_NR_TO_ROOT(nr)][nr & SECTION_ROOT_MASK]; } extern unsigned long __section_nr(struct mem_section *ms); extern size_t mem_section_usage_size(void); diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 29cc15398ee4..fb1180d81b5a 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -414,8 +414,8 @@ static int __init crash_save_vmcoreinfo_init(void) VMCOREINFO_SYMBOL(contig_page_data); #endif #ifdef CONFIG_SPARSEMEM - VMCOREINFO_SYMBOL_ARRAY(mem_section); - VMCOREINFO_LENGTH(mem_section, NR_SECTION_ROOTS); + VMCOREINFO_SYMBOL_ARRAY(mem_sections); + VMCOREINFO_LENGTH(mem_sections, NR_SECTION_ROOTS); VMCOREINFO_STRUCT_SIZE(mem_section); VMCOREINFO_OFFSET(mem_section, section_mem_map); VMCOREINFO_NUMBER(MAX_PHYSMEM_BITS); diff --git a/mm/sparse.c b/mm/sparse.c index df4418c12f04..a96e7e65475f 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -24,12 +24,12 @@ * 1) mem_section - memory sections, mem_map's for valid memory */ #ifdef CONFIG_SPARSEMEM_EXTREME -struct mem_section **mem_section; +struct mem_section **mem_sections; #else -struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT] +struct mem_section mem_sections[NR_SECTION_ROOTS][SECTIONS_PER_ROOT] cacheline_internodealigned_in_smp; #endif -EXPORT_SYMBOL(mem_section); +EXPORT_SYMBOL(mem_sections); #ifdef NODE_NOT_IN_PAGE_FLAGS /* @@ -91,14 +91,14 @@ static int __meminit sparse_index_init(unsigned long section_nr, int nid) * * The mem_hotplug_lock resolves the apparent race below. */ - if (mem_section[root]) + if (mem_sections[root]) return 0; section = sparse_index_alloc(nid); if (!section) return -ENOMEM; - mem_section[root] = section; + mem_sections[root] = section; return 0; } @@ -131,7 +131,7 @@ unsigned long __section_nr(struct mem_section *ms) #else unsigned long __section_nr(struct mem_section *ms) { - return (unsigned long)(ms - mem_section[0]); + return (unsigned long)(ms - mem_sections[0]); } #endif @@ -286,8 +286,8 @@ static void __init memblocks_present(void) #ifdef CONFIG_SPARSEMEM_EXTREME size = sizeof(struct mem_section *) * NR_SECTION_ROOTS; align = 1 << (INTERNODE_CACHE_SHIFT); - mem_section = memblock_alloc(size, align); - if (!mem_section) + mem_sections = memblock_alloc(size, align); + if (!mem_sections) panic("%s: Failed to allocate %lu bytes align=0x%lx\n", __func__, size, align); #endif Smells like unnecessary code churn. I'd rather just fix the doc because it doesn't really improve readability IMHO. Anyhow, just my 2 cents. -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: kexec-starting-kernel-problem-on-vm
On 19.04.21 10:26, Baoquan He wrote: Hi Jingxian, On 04/14/21 at 03:04pm, Jingxian He wrote: We use ‘kexec –l’ and ‘kexec –e’ on our virtual machine to upgrade the linux kernel. We find that the new kernel may start fail due to checking the sha256 sum of the initrd segment checking fail with low probability. The related code is as following: /* arch/x86/purgatory/purgatory.c */ static int verify_sha256_digest(void) { struct kexec_sha_region *ptr, *end; u8 digest[SHA256_DIGEST_SIZE]; struct sha256_state sctx; sha256_init(); end = purgatory_sha_regions + ARRAY_SIZE(purgatory_sha_regions); for (ptr = purgatory_sha_regions; ptr < end; ptr++) sha256_update(, (uint8_t *)(ptr->start), ptr->len); sha256_final(, digest); if (memcmp(digest, purgatory_sha256_digest, sizeof(digest))) return 1; return 0; } void purgatory(void) { int ret; ret = verify_sha256_digest(); I usually use qemu/kvm guest to test kernel, kexec and kdump, haven't met this issue. kexec -l/-e works well for me. Seems you are not using the latest kexec-tools. Otherwise you can use "-i (--no-checks)" to work around this for the time being. if (ret) { //<--verify_sha256 fail, entering loop forever /* loop forever */ for (;;) ; } copy_backup_region(); } Our opnion of this problem: We think that the process of relocating the new kernel depending on the boot cpu running without interruption. However, the vcpus may be interrupted by the qemu process with async_page_fault interruption. So, are you saying that the host still delivers an AFP to the guest, even though it has interrupts disabled (including AFP)? Hard to imagine that this would be the case right now. Any other host activity that temporarily stops/schedules out the VCPU should be not relevant to the VM ("vCPU interrupted by the QEMU process"); if there would be something running inside the VM that disables interrupts to reduce the size of a race window, that would need fixing inside the VM. -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v1 0/3] crashdump/x86: dump dax/kmem and virito-mem added System RAM
> Am 02.04.2021 um 12:16 schrieb Simon Horman : > > On Tue, Mar 30, 2021 at 10:58:28AM +0200, David Hildenbrand wrote: >>> On 23.03.21 11:01, David Hildenbrand wrote: >>> Let's properly support dumping any kind of "System RAM" that is not on the >>> top level of the kernel resource tree in /proc/iomem, processing any >>> /proc/iomem entry that contains "System RAM". In addition, increase >>> the maximum number of crash memory ranges to fully support virtio-mem >>> even in corner cases where we end up with a lot of individual memory >>> ranges. >>> >>> David Hildenbrand (3): >>> crashdump/x86: dump any kind of "System RAM" >>> crashdump/x86: iterate only over actual crash memory ranges >>> crashdump/x86: increase CRASH_MAX_MEMORY_RANGES to 32k >>> >>> kexec/arch/i386/crashdump-x86.c | 18 +- >>> kexec/arch/i386/crashdump-x86.h | 3 ++- >>> 2 files changed, 15 insertions(+), 6 deletions(-) >>> >> >> Gentle ping. > > Thanks, and sorry for the delay. > > Series applied. > No worries - thanks a lot! Have a nice weekend :) ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v1 0/3] crashdump/x86: dump dax/kmem and virito-mem added System RAM
On 23.03.21 11:01, David Hildenbrand wrote: Let's properly support dumping any kind of "System RAM" that is not on the top level of the kernel resource tree in /proc/iomem, processing any /proc/iomem entry that contains "System RAM". In addition, increase the maximum number of crash memory ranges to fully support virtio-mem even in corner cases where we end up with a lot of individual memory ranges. David Hildenbrand (3): crashdump/x86: dump any kind of "System RAM" crashdump/x86: iterate only over actual crash memory ranges crashdump/x86: increase CRASH_MAX_MEMORY_RANGES to 32k kexec/arch/i386/crashdump-x86.c | 18 +- kexec/arch/i386/crashdump-x86.h | 3 ++- 2 files changed, 15 insertions(+), 6 deletions(-) Gentle ping. -- Thanks, David / dhildenb ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v2 2/3] kernel/resource: make walk_mem_res() find all busy IORESOURCE_MEM resources
It used to be true that we can have system RAM (IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY) only on the first level in the resource tree. However, this is no longer holds for driver-managed system RAM (i.e., added via dax/kmem and virtio-mem), which gets added on lower levels, for example, inside device containers. IORESOURCE_SYSTEM_RAM is defined as IORESOURCE_MEM | IORESOURCE_SYSRAM and just a special type of IORESOURCE_MEM. The function walk_mem_res() only consideres the first level and is used in arch/x86/mm/ioremap.c:__ioremap_check_mem() only. We currently fail to identify System RAM added by dax/kmem and virtio-mem as "IORES_MAP_SYSTEM_RAM", for example, allowing for remapping of such "normal RAM" in __ioremap_caller(). Let's find all IORESOURCE_MEM | IORESOURCE_BUSY resources, making the function behave similar to walk_system_ram_res(). Fixes: ebf71552bb0e ("virtio-mem: Add parent resource for all added "System RAM"") Fixes: c221c0b0308f ("device-dax: "Hotplug" persistent memory for use like normal RAM") Reviewed-by: Dan Williams Cc: Andrew Morton Cc: Greg Kroah-Hartman Cc: Dan Williams Cc: Daniel Vetter Cc: Andy Shevchenko Cc: Mauro Carvalho Chehab Cc: Dave Young Cc: Baoquan He Cc: Vivek Goyal Cc: Dave Hansen Cc: Keith Busch Cc: Michal Hocko Cc: Qian Cai Cc: Oscar Salvador Cc: Eric Biederman Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Tom Lendacky Cc: Brijesh Singh Cc: x...@kernel.org Cc: kexec@lists.infradead.org Signed-off-by: David Hildenbrand --- kernel/resource.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/resource.c b/kernel/resource.c index 4efd6e912279..16e0c7e8ed24 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -470,7 +470,7 @@ int walk_mem_res(u64 start, u64 end, void *arg, { unsigned long flags = IORESOURCE_MEM | IORESOURCE_BUSY; - return __walk_iomem_res_desc(start, end, flags, IORES_DESC_NONE, true, + return __walk_iomem_res_desc(start, end, flags, IORES_DESC_NONE, false, arg, func); } -- 2.29.2 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v2 3/3] kernel/resource: remove first_lvl / siblings_only logic
All functions that search for IORESOURCE_SYSTEM_RAM or IORESOURCE_MEM resources now properly consider the whole resource tree, not just the first level. Let's drop the unused first_lvl / siblings_only logic. Remove documentation that indicates that some functions behave differently, all consider the full resource tree now. Reviewed-by: Dan Williams Reviewed-by: Andy Shevchenko Cc: Andrew Morton Cc: Greg Kroah-Hartman Cc: Dan Williams Cc: Daniel Vetter Cc: Andy Shevchenko Cc: Mauro Carvalho Chehab Cc: Dave Young Cc: Baoquan He Cc: Vivek Goyal Cc: Dave Hansen Cc: Keith Busch Cc: Michal Hocko Cc: Qian Cai Cc: Oscar Salvador Cc: Eric Biederman Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Tom Lendacky Cc: Brijesh Singh Cc: x...@kernel.org Cc: kexec@lists.infradead.org Signed-off-by: David Hildenbrand --- kernel/resource.c | 45 - 1 file changed, 12 insertions(+), 33 deletions(-) diff --git a/kernel/resource.c b/kernel/resource.c index 16e0c7e8ed24..7e00239a023a 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -64,12 +64,8 @@ static DEFINE_RWLOCK(resource_lock); static struct resource *bootmem_resource_free; static DEFINE_SPINLOCK(bootmem_resource_lock); -static struct resource *next_resource(struct resource *p, bool sibling_only) +static struct resource *next_resource(struct resource *p) { - /* Caller wants to traverse through siblings only */ - if (sibling_only) - return p->sibling; - if (p->child) return p->child; while (!p->sibling && p->parent) @@ -81,7 +77,7 @@ static void *r_next(struct seq_file *m, void *v, loff_t *pos) { struct resource *p = v; (*pos)++; - return (void *)next_resource(p, false); + return (void *)next_resource(p); } #ifdef CONFIG_PROC_FS @@ -330,14 +326,10 @@ EXPORT_SYMBOL(release_resource); * of the resource that's within [@start..@end]; if none is found, returns * -ENODEV. Returns -EINVAL for invalid parameters. * - * This function walks the whole tree and not just first level children - * unless @first_lvl is true. - * * @start: start address of the resource searched for * @end: end address of same resource * @flags: flags which the resource must have * @desc: descriptor the resource must have - * @first_lvl: walk only the first level children, if set * @res: return ptr, if resource found * * The caller must specify @start, @end, @flags, and @desc @@ -345,9 +337,8 @@ EXPORT_SYMBOL(release_resource); */ static int find_next_iomem_res(resource_size_t start, resource_size_t end, unsigned long flags, unsigned long desc, - bool first_lvl, struct resource *res) + struct resource *res) { - bool siblings_only = true; struct resource *p; if (!res) @@ -358,7 +349,7 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end, read_lock(_lock); - for (p = iomem_resource.child; p; p = next_resource(p, siblings_only)) { + for (p = iomem_resource.child; p; p = next_resource(p)) { /* If we passed the resource we are looking for, stop */ if (p->start > end) { p = NULL; @@ -369,13 +360,6 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end, if (p->end < start) continue; - /* -* Now that we found a range that matches what we look for, -* check the flags and the descriptor. If we were not asked to -* use only the first level, start looking at children as well. -*/ - siblings_only = first_lvl; - if ((p->flags & flags) != flags) continue; if ((desc != IORES_DESC_NONE) && (desc != p->desc)) @@ -402,14 +386,14 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end, static int __walk_iomem_res_desc(resource_size_t start, resource_size_t end, unsigned long flags, unsigned long desc, -bool first_lvl, void *arg, +void *arg, int (*func)(struct resource *, void *)) { struct resource res; int ret = -EINVAL; while (start < end && - !find_next_iomem_res(start, end, flags, desc, first_lvl, )) { + !find_next_iomem_res(start, end, flags, desc, )) { ret = (*func)(, arg); if (ret) break; @@ -431,7 +415,6 @@ static int __walk_iomem_res_desc(resource_size_t start, resource_size_t end, * @arg: function argument f