Re: [PATCH -next] mm/hotplug: skip bad PFNs from pfn_to_online_page()
On Thu, Jun 13, 2019 at 11:42 AM Qian Cai wrote: > > On Wed, 2019-06-12 at 12:37 -0700, Dan Williams wrote: > > On Wed, Jun 12, 2019 at 12:16 PM Qian Cai wrote: > > > > > > The linux-next commit "mm/sparsemem: Add helpers track active portions > > > of a section at boot" [1] causes a crash below when the first kmemleak > > > scan kthread kicks in. This is because kmemleak_scan() calls > > > pfn_to_online_page(() which calls pfn_valid_within() instead of > > > pfn_valid() on x86 due to CONFIG_HOLES_IN_ZONE=n. > > > > > > The commit [1] did add an additional check of pfn_section_valid() in > > > pfn_valid(), but forgot to add it in the above code path. > > > > > > page:ea0002748000 is uninitialized and poisoned > > > raw: > > > raw: > > > page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) > > > [ cut here ] > > > kernel BUG at include/linux/mm.h:1084! > > > invalid opcode: [#1] SMP DEBUG_PAGEALLOC KASAN PTI > > > CPU: 5 PID: 332 Comm: kmemleak Not tainted 5.2.0-rc4-next-20190612+ #6 > > > Hardware name: Lenovo ThinkSystem SR530 -[7X07RCZ000]-/-[7X07RCZ000]-, > > > BIOS -[TEE113T-1.00]- 07/07/2017 > > > RIP: 0010:kmemleak_scan+0x6df/0xad0 > > > Call Trace: > > > kmemleak_scan_thread+0x9f/0xc7 > > > kthread+0x1d2/0x1f0 > > > ret_from_fork+0x35/0x4 > > > > > > [1] https://patchwork.kernel.org/patch/10977957/ > > > > > > Signed-off-by: Qian Cai > > > --- > > > include/linux/memory_hotplug.h | 1 + > > > 1 file changed, 1 insertion(+) > > > > > > diff --git a/include/linux/memory_hotplug.h > > > b/include/linux/memory_hotplug.h > > > index 0b8a5e5ef2da..f02be86077e3 100644 > > > --- a/include/linux/memory_hotplug.h > > > +++ b/include/linux/memory_hotplug.h > > > @@ -28,6 +28,7 @@ > > > unsigned long ___nr = pfn_to_section_nr(___pfn); \ > > >\ > > > if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \ > > > + pfn_section_valid(__nr_to_section(___nr), pfn) && \ > > > pfn_valid_within(___pfn)) \ > > > ___page = pfn_to_page(___pfn); \ > > > ___page; \ > > > > Looks ok to me: > > > > Acked-by: Dan Williams > > > > ...but why is pfn_to_online_page() a multi-line macro instead of a > > static inline like all the helper routines it invokes? > > Sigh, probably because it is a mess over there. > > memory_hotplug.h and mmzone.h are included each other. Converted it directly > to > a static inline triggers compilation errors because mmzone.h was included > somewhere else and found pfn_to_online_page() needs things like > pfn_valid_within() and online_section_nr() etc which are only defined later in > mmzone.h. Ok, makes sense I had I assumed it was something horrible like that. Qian, can you send more details on the reproduction steps for the failures you are seeing? Like configs and platforms you're testing. I've tried enabling kmemleak and offlining memory and have yet to trigger these failures. I also have a couple people willing to help me out with tracking down the PowerPC issue, but I assume they need some help with the reproduction as well.
Re: [PATCH 08/22] memremap: pass a struct dev_pagemap to ->kill
On Thu, Jun 13, 2019 at 1:12 PM Logan Gunthorpe wrote: > > > > On 2019-06-13 3:43 a.m., Christoph Hellwig wrote: > > Passing the actual typed structure leads to more understandable code > > vs the actual references. > > Ha, ok, I originally suggested this to Dan when he introduced the > callback[1]. > > Reviewed-by: Logan Gunthorpe Reviewed-by: Dan Williams Reported-by: Logan Gunthorpe :) > > Logan > > [1] > https://lore.kernel.org/lkml/8f0cae82-130f-8a64-cfbd-fda5fd76b...@deltatee.com/T/#u >
Re: [PATCH -next] mm/hotplug: skip bad PFNs from pfn_to_online_page()
On Wed, Jun 12, 2019 at 2:47 PM Qian Cai wrote: > > On Wed, 2019-06-12 at 12:38 -0700, Dan Williams wrote: > > On Wed, Jun 12, 2019 at 12:37 PM Dan Williams > > wrote: > > > > > > On Wed, Jun 12, 2019 at 12:16 PM Qian Cai wrote: > > > > > > > > The linux-next commit "mm/sparsemem: Add helpers track active portions > > > > of a section at boot" [1] causes a crash below when the first kmemleak > > > > scan kthread kicks in. This is because kmemleak_scan() calls > > > > pfn_to_online_page(() which calls pfn_valid_within() instead of > > > > pfn_valid() on x86 due to CONFIG_HOLES_IN_ZONE=n. > > > > > > > > The commit [1] did add an additional check of pfn_section_valid() in > > > > pfn_valid(), but forgot to add it in the above code path. > > > > > > > > page:ea0002748000 is uninitialized and poisoned > > > > raw: > > > > raw: > > > > page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) > > > > [ cut here ] > > > > kernel BUG at include/linux/mm.h:1084! > > > > invalid opcode: [#1] SMP DEBUG_PAGEALLOC KASAN PTI > > > > CPU: 5 PID: 332 Comm: kmemleak Not tainted 5.2.0-rc4-next-20190612+ #6 > > > > Hardware name: Lenovo ThinkSystem SR530 -[7X07RCZ000]-/-[7X07RCZ000]-, > > > > BIOS -[TEE113T-1.00]- 07/07/2017 > > > > RIP: 0010:kmemleak_scan+0x6df/0xad0 > > > > Call Trace: > > > > kmemleak_scan_thread+0x9f/0xc7 > > > > kthread+0x1d2/0x1f0 > > > > ret_from_fork+0x35/0x4 > > > > > > > > [1] https://patchwork.kernel.org/patch/10977957/ > > > > > > > > Signed-off-by: Qian Cai > > > > --- > > > > include/linux/memory_hotplug.h | 1 + > > > > 1 file changed, 1 insertion(+) > > > > > > > > diff --git a/include/linux/memory_hotplug.h > > > > b/include/linux/memory_hotplug.h > > > > index 0b8a5e5ef2da..f02be86077e3 100644 > > > > --- a/include/linux/memory_hotplug.h > > > > +++ b/include/linux/memory_hotplug.h > > > > @@ -28,6 +28,7 @@ > > > > unsigned long ___nr = pfn_to_section_nr(___pfn); \ > > > >\ > > > > if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \ > > > > + pfn_section_valid(__nr_to_section(___nr), pfn) && \ > > > > pfn_valid_within(___pfn)) \ > > > > ___page = pfn_to_page(___pfn); \ > > > > ___page; \ > > > > > > Looks ok to me: > > > > > > Acked-by: Dan Williams > > > > > > ...but why is pfn_to_online_page() a multi-line macro instead of a > > > static inline like all the helper routines it invokes? > > > > I do need to send out a refreshed version of the sub-section patchset, > > so I'll fold this in and give you a Reported-by credit. > > BTW, not sure if your new version will fix those two problem below due to the > same commit. > > https://patchwork.kernel.org/patch/10977957/ > > 1) offline is busted [1]. It looks like test_pages_in_a_zone() missed the same > pfn_section_valid() check. > > 2) powerpc booting is generating endless warnings [2]. In vmemmap_populated() > at > arch/powerpc/mm/init_64.c, I tried to change PAGES_PER_SECTION to > PAGES_PER_SUBSECTION, but it alone seems not enough. Yes, I was just sending you another note about this. I don't think your proposed fix is sufficient. The original intent of pfn_valid_within() was to use it as a cheaper lookup after already validating that the first page in a MAX_ORDER_NR_PAGES range satisfied pfn_valid(). Quoting commit 14e072984179 "add pfn_valid_within helper for sub-MAX_ORDER hole detection": Add a pfn_valid_within() helper which should be used when scanning pages within a MAX_ORDER_NR_PAGES block when we have already checked the validility of the block normally with pfn_valid(). This can then be optimised away when we do not have holes within a MAX_ORDER_NR_PAGES block of pages. So, with that insight I think the complete fix is this: diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 6dd52d544857..9d15ec793330 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1428,7 +1428,7 @@ void memory_present(int nid, unsigned long start, unsigned long end); #ifdef CONFIG_HOLES_IN_ZONE #define pfn_valid_within(pfn) pfn_valid(pfn) #else -#define pfn_valid_within(pfn) (1) +#define pfn_valid_within(pfn) pfn_section_valid(pfn) #endif #ifdef CONFIG_ARCH_HAS_HOLES_MEMORYMODEL
Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
On Wed, Jun 12, 2019 at 3:12 PM Ira Weiny wrote: > > On Wed, Jun 12, 2019 at 04:14:21PM -0300, Jason Gunthorpe wrote: > > On Wed, Jun 12, 2019 at 02:09:07PM +0200, Jan Kara wrote: > > > On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote: > > > > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote: > > > > > > > > > > > The main objection to the current ODP & DAX solution is that very > > > > > > > little HW can actually implement it, having the alternative still > > > > > > > require HW support doesn't seem like progress. > > > > > > > > > > > > > > I think we will eventually start seein some HW be able to do this > > > > > > > invalidation, but it won't be universal, and I'd rather leave it > > > > > > > optional, for recovery from truely catastrophic errors (ie my DAX > > > > > > > is > > > > > > > on fire, I need to unplug it). > > > > > > > > > > > > Agreed. I think software wise there is not much some of the > > > > > > devices can do > > > > > > with such an "invalidate". > > > > > > > > > > So out of curiosity: What does RDMA driver do when userspace just > > > > > closes > > > > > the file pointing to RDMA object? It has to handle that somehow by > > > > > aborting > > > > > everything that's going on... And I wanted similar behavior here. > > > > > > > > It aborts *everything* connected to that file descriptor. Destroying > > > > everything avoids creating inconsistencies that destroying a subset > > > > would create. > > > > > > > > What has been talked about for lease break is not destroying anything > > > > but very selectively saying that one memory region linked to the GUP > > > > is no longer functional. > > > > > > OK, so what I had in mind was that if RDMA app doesn't play by the rules > > > and closes the file with existing pins (and thus layout lease) we would > > > force it to abort everything. Yes, it is disruptive but then the app > > > didn't > > > obey the rule that it has to maintain file lease while holding pins. Thus > > > such situation should never happen unless the app is malicious / buggy. > > > > We do have the infrastructure to completely revoke the entire > > *content* of a FD (this is called device disassociate). It is > > basically close without the app doing close. But again it only works > > with some drivers. However, this is more likely something a driver > > could support without a HW change though. > > > > It is quite destructive as it forcibly kills everything RDMA related > > the process(es) are doing, but it is less violent than SIGKILL, and > > there is perhaps a way for the app to recover from this, if it is > > coded for it. > > I don't think many are... I think most would effectively be "killed" if this > happened to them. > > > > > My preference would be to avoid this scenario, but if it is really > > necessary, we could probably build it with some work. > > > > The only case we use it today is forced HW hot unplug, so it is rarely > > used and only for an 'emergency' like use case. > > I'd really like to avoid this as well. I think it will be very confusing for > RDMA apps to have their context suddenly be invalid. I think if we have a way > for admins to ID who is pinning a file the admin can take more appropriate > action on those processes. Up to and including killing the process. Can RDMA context invalidation, "device disassociate", be inflicted on a process from the outside? Identifying the pid of a pin holder only leaves SIGKILL of the entire process as the remediation for revoking a pin, and I assume admins would use the finer grained invalidation where it was available.
Re: [PATCH -next] mm/hotplug: skip bad PFNs from pfn_to_online_page()
On Wed, Jun 12, 2019 at 2:52 PM Dan Williams wrote: > > On Wed, Jun 12, 2019 at 2:47 PM Qian Cai wrote: > > > > On Wed, 2019-06-12 at 12:38 -0700, Dan Williams wrote: > > > On Wed, Jun 12, 2019 at 12:37 PM Dan Williams > > > wrote: > > > > > > > > On Wed, Jun 12, 2019 at 12:16 PM Qian Cai wrote: > > > > > > > > > > The linux-next commit "mm/sparsemem: Add helpers track active portions > > > > > of a section at boot" [1] causes a crash below when the first kmemleak > > > > > scan kthread kicks in. This is because kmemleak_scan() calls > > > > > pfn_to_online_page(() which calls pfn_valid_within() instead of > > > > > pfn_valid() on x86 due to CONFIG_HOLES_IN_ZONE=n. > > > > > > > > > > The commit [1] did add an additional check of pfn_section_valid() in > > > > > pfn_valid(), but forgot to add it in the above code path. > > > > > > > > > > page:ea0002748000 is uninitialized and poisoned > > > > > raw: > > > > > > > > > > raw: > > > > > > > > > > page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) > > > > > [ cut here ] > > > > > kernel BUG at include/linux/mm.h:1084! > > > > > invalid opcode: [#1] SMP DEBUG_PAGEALLOC KASAN PTI > > > > > CPU: 5 PID: 332 Comm: kmemleak Not tainted 5.2.0-rc4-next-20190612+ #6 > > > > > Hardware name: Lenovo ThinkSystem SR530 -[7X07RCZ000]-/-[7X07RCZ000]-, > > > > > BIOS -[TEE113T-1.00]- 07/07/2017 > > > > > RIP: 0010:kmemleak_scan+0x6df/0xad0 > > > > > Call Trace: > > > > > kmemleak_scan_thread+0x9f/0xc7 > > > > > kthread+0x1d2/0x1f0 > > > > > ret_from_fork+0x35/0x4 > > > > > > > > > > [1] https://patchwork.kernel.org/patch/10977957/ > > > > > > > > > > Signed-off-by: Qian Cai > > > > > --- > > > > > include/linux/memory_hotplug.h | 1 + > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > diff --git a/include/linux/memory_hotplug.h > > > > > b/include/linux/memory_hotplug.h > > > > > index 0b8a5e5ef2da..f02be86077e3 100644 > > > > > --- a/include/linux/memory_hotplug.h > > > > > +++ b/include/linux/memory_hotplug.h > > > > > @@ -28,6 +28,7 @@ > > > > > unsigned long ___nr = pfn_to_section_nr(___pfn); \ > > > > >\ > > > > > if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \ > > > > > + pfn_section_valid(__nr_to_section(___nr), pfn) && \ > > > > > pfn_valid_within(___pfn)) \ > > > > > ___page = pfn_to_page(___pfn); \ > > > > > ___page; \ > > > > > > > > Looks ok to me: > > > > > > > > Acked-by: Dan Williams > > > > > > > > ...but why is pfn_to_online_page() a multi-line macro instead of a > > > > static inline like all the helper routines it invokes? > > > > > > I do need to send out a refreshed version of the sub-section patchset, > > > so I'll fold this in and give you a Reported-by credit. > > > > BTW, not sure if your new version will fix those two problem below due to > > the > > same commit. > > > > https://patchwork.kernel.org/patch/10977957/ > > > > 1) offline is busted [1]. It looks like test_pages_in_a_zone() missed the > > same > > pfn_section_valid() check. > > > > 2) powerpc booting is generating endless warnings [2]. In > > vmemmap_populated() at > > arch/powerpc/mm/init_64.c, I tried to change PAGES_PER_SECTION to > > PAGES_PER_SUBSECTION, but it alone seems not enough. > > Yes, I was just sending you another note about this. I don't think > your proposed fix is sufficient. The original intent of > pfn_valid_within() was to use it as a cheaper lookup after already > validating that the first page in a MAX_ORDER_NR_PAGES range satisfied > pfn_valid(). Quoting commit 14e072984179 "add pfn_valid_within
Re: [PATCH -next] mm/hotplug: skip bad PFNs from pfn_to_online_page()
On Wed, Jun 12, 2019 at 2:47 PM Qian Cai wrote: > > On Wed, 2019-06-12 at 12:38 -0700, Dan Williams wrote: > > On Wed, Jun 12, 2019 at 12:37 PM Dan Williams > > wrote: > > > > > > On Wed, Jun 12, 2019 at 12:16 PM Qian Cai wrote: > > > > > > > > The linux-next commit "mm/sparsemem: Add helpers track active portions > > > > of a section at boot" [1] causes a crash below when the first kmemleak > > > > scan kthread kicks in. This is because kmemleak_scan() calls > > > > pfn_to_online_page(() which calls pfn_valid_within() instead of > > > > pfn_valid() on x86 due to CONFIG_HOLES_IN_ZONE=n. > > > > > > > > The commit [1] did add an additional check of pfn_section_valid() in > > > > pfn_valid(), but forgot to add it in the above code path. > > > > > > > > page:ea0002748000 is uninitialized and poisoned > > > > raw: > > > > raw: > > > > page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) > > > > [ cut here ] > > > > kernel BUG at include/linux/mm.h:1084! > > > > invalid opcode: [#1] SMP DEBUG_PAGEALLOC KASAN PTI > > > > CPU: 5 PID: 332 Comm: kmemleak Not tainted 5.2.0-rc4-next-20190612+ #6 > > > > Hardware name: Lenovo ThinkSystem SR530 -[7X07RCZ000]-/-[7X07RCZ000]-, > > > > BIOS -[TEE113T-1.00]- 07/07/2017 > > > > RIP: 0010:kmemleak_scan+0x6df/0xad0 > > > > Call Trace: > > > > kmemleak_scan_thread+0x9f/0xc7 > > > > kthread+0x1d2/0x1f0 > > > > ret_from_fork+0x35/0x4 > > > > > > > > [1] https://patchwork.kernel.org/patch/10977957/ > > > > > > > > Signed-off-by: Qian Cai > > > > --- > > > > include/linux/memory_hotplug.h | 1 + > > > > 1 file changed, 1 insertion(+) > > > > > > > > diff --git a/include/linux/memory_hotplug.h > > > > b/include/linux/memory_hotplug.h > > > > index 0b8a5e5ef2da..f02be86077e3 100644 > > > > --- a/include/linux/memory_hotplug.h > > > > +++ b/include/linux/memory_hotplug.h > > > > @@ -28,6 +28,7 @@ > > > > unsigned long ___nr = pfn_to_section_nr(___pfn); \ > > > >\ > > > > if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \ > > > > + pfn_section_valid(__nr_to_section(___nr), pfn) && \ > > > > pfn_valid_within(___pfn)) \ > > > > ___page = pfn_to_page(___pfn); \ > > > > ___page; \ > > > > > > Looks ok to me: > > > > > > Acked-by: Dan Williams > > > > > > ...but why is pfn_to_online_page() a multi-line macro instead of a > > > static inline like all the helper routines it invokes? > > > > I do need to send out a refreshed version of the sub-section patchset, > > so I'll fold this in and give you a Reported-by credit. > > BTW, not sure if your new version will fix those two problem below due to the > same commit. > > https://patchwork.kernel.org/patch/10977957/ > > 1) offline is busted [1]. It looks like test_pages_in_a_zone() missed the same > pfn_section_valid() check. All online memory is to be onlined as a complete section, so I think the issue is more related to vmemmap_populated() not establishing the mem_map for all pages in a section. I take back my suggestions about pfn_valid_within() that operation should always be scoped to a section when validating online memory. > > 2) powerpc booting is generating endless warnings [2]. In vmemmap_populated() > at > arch/powerpc/mm/init_64.c, I tried to change PAGES_PER_SECTION to > PAGES_PER_SUBSECTION, but it alone seems not enough. On PowerPC PAGES_PER_SECTION == PAGES_PER_SUBSECTION because the PowerPC section size was already small. Instead I think the issue is that PowerPC is partially populating sections, but still expecting pfn_valid() to succeed. I.e. prior to the subsection patches pfn_valid() would still work for those holes, but now that it is more precise it is failing.
Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
On Wed, Jun 12, 2019 at 4:32 PM Ira Weiny wrote: > > On Wed, Jun 12, 2019 at 03:54:19PM -0700, Dan Williams wrote: > > On Wed, Jun 12, 2019 at 3:12 PM Ira Weiny wrote: > > > > > > On Wed, Jun 12, 2019 at 04:14:21PM -0300, Jason Gunthorpe wrote: > > > > On Wed, Jun 12, 2019 at 02:09:07PM +0200, Jan Kara wrote: > > > > > On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote: > > > > > > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote: > > > > > > > > > > > > > > > The main objection to the current ODP & DAX solution is that > > > > > > > > > very > > > > > > > > > little HW can actually implement it, having the alternative > > > > > > > > > still > > > > > > > > > require HW support doesn't seem like progress. > > > > > > > > > > > > > > > > > > I think we will eventually start seein some HW be able to do > > > > > > > > > this > > > > > > > > > invalidation, but it won't be universal, and I'd rather leave > > > > > > > > > it > > > > > > > > > optional, for recovery from truely catastrophic errors (ie my > > > > > > > > > DAX is > > > > > > > > > on fire, I need to unplug it). > > > > > > > > > > > > > > > > Agreed. I think software wise there is not much some of the > > > > > > > > devices can do > > > > > > > > with such an "invalidate". > > > > > > > > > > > > > > So out of curiosity: What does RDMA driver do when userspace just > > > > > > > closes > > > > > > > the file pointing to RDMA object? It has to handle that somehow > > > > > > > by aborting > > > > > > > everything that's going on... And I wanted similar behavior here. > > > > > > > > > > > > It aborts *everything* connected to that file descriptor. Destroying > > > > > > everything avoids creating inconsistencies that destroying a subset > > > > > > would create. > > > > > > > > > > > > What has been talked about for lease break is not destroying > > > > > > anything > > > > > > but very selectively saying that one memory region linked to the GUP > > > > > > is no longer functional. > > > > > > > > > > OK, so what I had in mind was that if RDMA app doesn't play by the > > > > > rules > > > > > and closes the file with existing pins (and thus layout lease) we > > > > > would > > > > > force it to abort everything. Yes, it is disruptive but then the app > > > > > didn't > > > > > obey the rule that it has to maintain file lease while holding pins. > > > > > Thus > > > > > such situation should never happen unless the app is malicious / > > > > > buggy. > > > > > > > > We do have the infrastructure to completely revoke the entire > > > > *content* of a FD (this is called device disassociate). It is > > > > basically close without the app doing close. But again it only works > > > > with some drivers. However, this is more likely something a driver > > > > could support without a HW change though. > > > > > > > > It is quite destructive as it forcibly kills everything RDMA related > > > > the process(es) are doing, but it is less violent than SIGKILL, and > > > > there is perhaps a way for the app to recover from this, if it is > > > > coded for it. > > > > > > I don't think many are... I think most would effectively be "killed" if > > > this > > > happened to them. > > > > > > > > > > > My preference would be to avoid this scenario, but if it is really > > > > necessary, we could probably build it with some work. > > > > > > > > The only case we use it today is forced HW hot unplug, so it is rarely > > > > used and only for an 'emergency' like use case. > > > > > > I'd really like to avoid this as well. I think it will be very confusing > > > for > > > RDMA apps to have their context suddenly be invalid. I think if we have > > > a way > > > for admins to ID who is pinning a file the admin can take more appropriate > > > action on those processes. Up to and including killing the process. > > > > Can RDMA context invalidation, "device disassociate", be inflicted on > > a process from the outside? Identifying the pid of a pin holder only > > leaves SIGKILL of the entire process as the remediation for revoking a > > pin, and I assume admins would use the finer grained invalidation > > where it was available. > > No not in the way you are describing it. As Jason said you can hotplug the > device which is "from the outside" but this would affect all users of that > device. > > Effectively, we would need a way for an admin to close a specific file > descriptor (or set of fds) which point to that file. AFAIK there is no way to > do that at all, is there? Even if there were that gets back to my other question, does RDMA teardown happen at close(fd), or at final fput() of the 'struct file'? I.e. does it also need munmap() to get the vma to drop its reference? Perhaps a pointer to the relevant code would help me wrap my head around this mechanism.
Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
On Wed, Jun 12, 2019 at 4:32 PM Ira Weiny wrote: > > On Wed, Jun 12, 2019 at 03:54:19PM -0700, Dan Williams wrote: > > On Wed, Jun 12, 2019 at 3:12 PM Ira Weiny wrote: > > > > > > On Wed, Jun 12, 2019 at 04:14:21PM -0300, Jason Gunthorpe wrote: > > > > On Wed, Jun 12, 2019 at 02:09:07PM +0200, Jan Kara wrote: > > > > > On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote: > > > > > > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote: > > > > > > > > > > > > > > > The main objection to the current ODP & DAX solution is that > > > > > > > > > very > > > > > > > > > little HW can actually implement it, having the alternative > > > > > > > > > still > > > > > > > > > require HW support doesn't seem like progress. > > > > > > > > > > > > > > > > > > I think we will eventually start seein some HW be able to do > > > > > > > > > this > > > > > > > > > invalidation, but it won't be universal, and I'd rather leave > > > > > > > > > it > > > > > > > > > optional, for recovery from truely catastrophic errors (ie my > > > > > > > > > DAX is > > > > > > > > > on fire, I need to unplug it). > > > > > > > > > > > > > > > > Agreed. I think software wise there is not much some of the > > > > > > > > devices can do > > > > > > > > with such an "invalidate". > > > > > > > > > > > > > > So out of curiosity: What does RDMA driver do when userspace just > > > > > > > closes > > > > > > > the file pointing to RDMA object? It has to handle that somehow > > > > > > > by aborting > > > > > > > everything that's going on... And I wanted similar behavior here. > > > > > > > > > > > > It aborts *everything* connected to that file descriptor. Destroying > > > > > > everything avoids creating inconsistencies that destroying a subset > > > > > > would create. > > > > > > > > > > > > What has been talked about for lease break is not destroying > > > > > > anything > > > > > > but very selectively saying that one memory region linked to the GUP > > > > > > is no longer functional. > > > > > > > > > > OK, so what I had in mind was that if RDMA app doesn't play by the > > > > > rules > > > > > and closes the file with existing pins (and thus layout lease) we > > > > > would > > > > > force it to abort everything. Yes, it is disruptive but then the app > > > > > didn't > > > > > obey the rule that it has to maintain file lease while holding pins. > > > > > Thus > > > > > such situation should never happen unless the app is malicious / > > > > > buggy. > > > > > > > > We do have the infrastructure to completely revoke the entire > > > > *content* of a FD (this is called device disassociate). It is > > > > basically close without the app doing close. But again it only works > > > > with some drivers. However, this is more likely something a driver > > > > could support without a HW change though. > > > > > > > > It is quite destructive as it forcibly kills everything RDMA related > > > > the process(es) are doing, but it is less violent than SIGKILL, and > > > > there is perhaps a way for the app to recover from this, if it is > > > > coded for it. > > > > > > I don't think many are... I think most would effectively be "killed" if > > > this > > > happened to them. > > > > > > > > > > > My preference would be to avoid this scenario, but if it is really > > > > necessary, we could probably build it with some work. > > > > > > > > The only case we use it today is forced HW hot unplug, so it is rarely > > > > used and only for an 'emergency' like use case. > > > > > > I'd really like to avoid this as well. I think it will be very confusing > > > for > > > RDMA apps to have their context suddenly be invalid. I think if we have > >
Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
On Thu, Jun 13, 2019 at 8:14 AM Jason Gunthorpe wrote: > > On Wed, Jun 12, 2019 at 06:14:46PM -0700, Dan Williams wrote: > > > Effectively, we would need a way for an admin to close a specific file > > > descriptor (or set of fds) which point to that file. AFAIK there is no > > > way to > > > do that at all, is there? > > > > Even if there were that gets back to my other question, does RDMA > > teardown happen at close(fd), or at final fput() of the 'struct > > file'? > > AFAIK there is no kernel side driver hook for close(fd). > > rdma uses a normal chardev so it's lifetime is linked to the file_ops > release, which is called on last fput. So all the mmaps, all the dups, > everything must go before it releases its resources. Oh, I must have missed where this conversation started talking about the driver-device fd. I thought we were talking about the close / release of the target file that is MAP_SHARED for the memory registration. A release of the driver fd is orthogonal to coordinating / signalling actions relative to the leased file.
Re: [PATCH -next] mm/hotplug: skip bad PFNs from pfn_to_online_page()
On Wed, Jun 12, 2019 at 12:37 PM Dan Williams wrote: > > On Wed, Jun 12, 2019 at 12:16 PM Qian Cai wrote: > > > > The linux-next commit "mm/sparsemem: Add helpers track active portions > > of a section at boot" [1] causes a crash below when the first kmemleak > > scan kthread kicks in. This is because kmemleak_scan() calls > > pfn_to_online_page(() which calls pfn_valid_within() instead of > > pfn_valid() on x86 due to CONFIG_HOLES_IN_ZONE=n. > > > > The commit [1] did add an additional check of pfn_section_valid() in > > pfn_valid(), but forgot to add it in the above code path. > > > > page:ea0002748000 is uninitialized and poisoned > > raw: > > raw: > > page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) > > [ cut here ] > > kernel BUG at include/linux/mm.h:1084! > > invalid opcode: [#1] SMP DEBUG_PAGEALLOC KASAN PTI > > CPU: 5 PID: 332 Comm: kmemleak Not tainted 5.2.0-rc4-next-20190612+ #6 > > Hardware name: Lenovo ThinkSystem SR530 -[7X07RCZ000]-/-[7X07RCZ000]-, > > BIOS -[TEE113T-1.00]- 07/07/2017 > > RIP: 0010:kmemleak_scan+0x6df/0xad0 > > Call Trace: > > kmemleak_scan_thread+0x9f/0xc7 > > kthread+0x1d2/0x1f0 > > ret_from_fork+0x35/0x4 > > > > [1] https://patchwork.kernel.org/patch/10977957/ > > > > Signed-off-by: Qian Cai > > --- > > include/linux/memory_hotplug.h | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h > > index 0b8a5e5ef2da..f02be86077e3 100644 > > --- a/include/linux/memory_hotplug.h > > +++ b/include/linux/memory_hotplug.h > > @@ -28,6 +28,7 @@ > > unsigned long ___nr = pfn_to_section_nr(___pfn); \ > >\ > > if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \ > > + pfn_section_valid(__nr_to_section(___nr), pfn) && \ > > pfn_valid_within(___pfn)) \ > > ___page = pfn_to_page(___pfn); \ > > ___page; \ > > Looks ok to me: > > Acked-by: Dan Williams > > ...but why is pfn_to_online_page() a multi-line macro instead of a > static inline like all the helper routines it invokes? I do need to send out a refreshed version of the sub-section patchset, so I'll fold this in and give you a Reported-by credit.
Re: [PATCH -next] mm/hotplug: skip bad PFNs from pfn_to_online_page()
On Wed, Jun 12, 2019 at 12:16 PM Qian Cai wrote: > > The linux-next commit "mm/sparsemem: Add helpers track active portions > of a section at boot" [1] causes a crash below when the first kmemleak > scan kthread kicks in. This is because kmemleak_scan() calls > pfn_to_online_page(() which calls pfn_valid_within() instead of > pfn_valid() on x86 due to CONFIG_HOLES_IN_ZONE=n. > > The commit [1] did add an additional check of pfn_section_valid() in > pfn_valid(), but forgot to add it in the above code path. > > page:ea0002748000 is uninitialized and poisoned > raw: > raw: > page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) > [ cut here ] > kernel BUG at include/linux/mm.h:1084! > invalid opcode: [#1] SMP DEBUG_PAGEALLOC KASAN PTI > CPU: 5 PID: 332 Comm: kmemleak Not tainted 5.2.0-rc4-next-20190612+ #6 > Hardware name: Lenovo ThinkSystem SR530 -[7X07RCZ000]-/-[7X07RCZ000]-, > BIOS -[TEE113T-1.00]- 07/07/2017 > RIP: 0010:kmemleak_scan+0x6df/0xad0 > Call Trace: > kmemleak_scan_thread+0x9f/0xc7 > kthread+0x1d2/0x1f0 > ret_from_fork+0x35/0x4 > > [1] https://patchwork.kernel.org/patch/10977957/ > > Signed-off-by: Qian Cai > --- > include/linux/memory_hotplug.h | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h > index 0b8a5e5ef2da..f02be86077e3 100644 > --- a/include/linux/memory_hotplug.h > +++ b/include/linux/memory_hotplug.h > @@ -28,6 +28,7 @@ > unsigned long ___nr = pfn_to_section_nr(___pfn); \ >\ > if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \ > + pfn_section_valid(__nr_to_section(___nr), pfn) && \ > pfn_valid_within(___pfn)) \ > ___page = pfn_to_page(___pfn); \ > ___page; \ Looks ok to me: Acked-by: Dan Williams ...but why is pfn_to_online_page() a multi-line macro instead of a static inline like all the helper routines it invokes?
Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
On Wed, Jun 12, 2019 at 3:29 AM Jan Kara wrote: > > On Fri 07-06-19 07:52:13, Ira Weiny wrote: > > On Fri, Jun 07, 2019 at 09:17:29AM -0300, Jason Gunthorpe wrote: > > > On Fri, Jun 07, 2019 at 12:36:36PM +0200, Jan Kara wrote: > > > > > > > Because the pins would be invisible to sysadmin from that point on. > > > > > > It is not invisible, it just shows up in a rdma specific kernel > > > interface. You have to use rdma netlink to see the kernel object > > > holding this pin. > > > > > > If this visibility is the main sticking point I suggest just enhancing > > > the existing MR reporting to include the file info for current GUP > > > pins and teaching lsof to collect information from there as well so it > > > is easy to use. > > > > > > If the ownership of the lease transfers to the MR, and we report that > > > ownership to userspace in a way lsof can find, then I think all the > > > concerns that have been raised are met, right? > > > > I was contemplating some new lsof feature yesterday. But what I don't > > think we want is sysadmins to have multiple tools for multiple > > subsystems. Or even have to teach lsof something new for every potential > > new subsystem user of GUP pins. > > Agreed. > > > I was thinking more along the lines of reporting files which have GUP > > pins on them directly somewhere (dare I say procfs?) and teaching lsof to > > report that information. That would cover any subsystem which does a > > longterm pin. > > So lsof already parses /proc//maps to learn about files held open by > memory mappings. It could parse some other file as well I guess. The good > thing about that would be that then "longterm pin" structure would just hold > struct file reference. That would avoid any needs of special behavior on > file close (the file reference in the "longterm pin" structure would make > sure struct file and thus the lease stays around, we'd just need to make > explicit lease unlock block until the "longterm pin" structure is freed). > The bad thing is that it requires us to come up with a sane new proc > interface for reporting "longterm pins" and associated struct file. Also we > need to define what this interface shows if the pinned pages are in DRAM > (either page cache or anon) and not on NVDIMM. The anon vs shared detection case is important because a longterm pin might be blocking a memory-hot-unplug operation if it is pinning ZONE_MOVABLE memory, but I don't think we want DRAM vs NVDIMM to be an explicit concern of the interface. For the anon / cached case I expect it might be useful to put that communication under the memory-blocks sysfs interface. I.e. a list of pids that are pinning that memory-block from being hot-unplugged.
Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
On Wed, Jun 12, 2019 at 5:09 AM Jan Kara wrote: > > On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote: > > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote: > > > > > > > The main objection to the current ODP & DAX solution is that very > > > > > little HW can actually implement it, having the alternative still > > > > > require HW support doesn't seem like progress. > > > > > > > > > > I think we will eventually start seein some HW be able to do this > > > > > invalidation, but it won't be universal, and I'd rather leave it > > > > > optional, for recovery from truely catastrophic errors (ie my DAX is > > > > > on fire, I need to unplug it). > > > > > > > > Agreed. I think software wise there is not much some of the devices > > > > can do > > > > with such an "invalidate". > > > > > > So out of curiosity: What does RDMA driver do when userspace just closes > > > the file pointing to RDMA object? It has to handle that somehow by > > > aborting > > > everything that's going on... And I wanted similar behavior here. > > > > It aborts *everything* connected to that file descriptor. Destroying > > everything avoids creating inconsistencies that destroying a subset > > would create. > > > > What has been talked about for lease break is not destroying anything > > but very selectively saying that one memory region linked to the GUP > > is no longer functional. > > OK, so what I had in mind was that if RDMA app doesn't play by the rules > and closes the file with existing pins (and thus layout lease) we would > force it to abort everything. Yes, it is disruptive but then the app didn't > obey the rule that it has to maintain file lease while holding pins. Thus > such situation should never happen unless the app is malicious / buggy. When you say 'close' do you mean the final release of the fd? The vma keeps a reference to a 'struct file' live even after the fd is closed.
Re: [PATCH v2 00/17] net: introduce Qualcomm IPA driver
On Wed, 2019-06-12 at 10:31 +0200, Arnd Bergmann wrote: > On Tue, Jun 11, 2019 at 7:23 PM Dan Williams wrote: > > On Tue, 2019-06-11 at 10:52 -0600, Subash Abhinov Kasiviswanathan > > wrote: > > > > rmnet should handle muxing the QMAP, QoS, and aggregation and pass > > the > > resulting packet to the lower layer. That lower layer could be IPA > > or > > qmi_wwan, which in turn passes that QMAP packet to USB or GSI or > > whatever. This is typically how Linux handles clean abstractions > > between different protocol layers in drivers. > > > > Similar to some WiFi drivers (drivers/net/wireless/marvell/libertas > > for > > example) where the same firmware interface can be accessed via PCI, > > SDIO, USB, SPI, etc. The bus-specific code is self-contained and > > does > > not creep into the upper more generic parts. > > Yes, I think that is a good model. In case of libertas, we have > multiple > layers inheritence from the basic device (slightly different in the > implementation, > but that is how it should be): To be clear (and I probably wasn't earlier) I wasn't talking as deep about the actual code structures as you are here but this a great discussion. I was trying to make the point that rmnet doesn't need to care about how the QMAP packets get to the device itself; it can be pretty generic so that it can be used by IPA/qmi_wwan/rmnet_smd/etc. Your points below are a great discussion though... > struct if_cs_card { /* pcmcia specific */ > struct lbs_private { /* libertas specific */ >struct wireless_dev { /* 802.11 specific */ > struct net_device { > struct device { > ... > }; > ... > }; > ... >}; >... > }; > ... > }; > The outer structure gets allocated when probing the hardware specific > driver, and everything below it is implemented as direct function > calls > into the more generic code, or as function pointers into the more > specific > code. > > The current rmnet model is different in that by design the upper > layer > (rmnet) and the lower layer (qmi_wwan, ipa, ...) are kept independent > in > both directions, i.e. ipa has (almost) no knowledge of rmnet, and > just > has pointers to the other net_device: > >ipa_device >net_device > >rmnet_port >net_device > > I understand that the rmnet model was intended to provide a cleaner > abstraction, but it's not how we normally structure subsystems in > Linux, and moving to a model more like how wireless_dev works > would improve both readability and performance, as you describe > it, it would be more like (ignoring for now the need for multiple > connections): > >ipa_dev > rmnet_dev >wwan_dev > net_device Perhaps I'm assuming too much from this diagram but this shows a 1:1 between wwan_dev and "lower" devices. What Johannes is proposing (IIRC) is something a bit looser where a wwan_dev does not necessarily provide netdev itself, but is instead the central point that various channels (control, data, gps, sim card, etc) register with. That way the wwan_dev can provide an overall view of the WWAN device to userspace, and userspace can talk to the wwan_dev to ask the lower drivers (ipa, rmnet, etc) to create new channels (netdev, tty, otherwise) when the control channel has told the modem firmware to expect one. For example, say you have told the firmware to create a new data channel with ID 5 via QMI (which the kernel is unaware of because it does not process higher-level QMI requests). Perhaps (and this is all just brainstorming) then userspace asks the wwan_dev to create a new data channel with ID 5 and a certain QoS. IPA (or rmnet because that's the data channel provider for IPA) has registered callbacks to the wwan_dev, receives this request, and creates a new rmnet_dev/net_device, and then the wwan_dev passes the ifindex back to userspace so we don't have to play the dance of "match up my request with a random netlink ADD event". The point being that since data channels aren't actually useful until the control channel agrees with the firmware that one should exist, if we have a wwan_dev that represents the entire WWAN device then we don't need the initial-but-useless net_device. Just some thoughts; Johannes can feel free to correct me at any time :) > Where each layer is a specialization of the next. Note: this is a > common change when moving from proprietary code to upstream > code. If a driver module is designed to live out of tree, there > is a strong incentive to limit the number of interfaces it uses, > but when it gets merged, it becomes much more flexible, as > an internal interface between wwan_dev and the hardware driver(s) > can be easily changed by modifying all drivers at once. Yep, I've seen this time and time again. Dan
[PATCH 4/6] libnvdimm/bus: Stop holding nvdimm_bus_list_mutex over __nd_ioctl()
In preparation for fixing a deadlock between wait_for_bus_probe_idle() and the nvdimm_bus_list_mutex arrange for __nd_ioctl() without nvdimm_bus_list_mutex held. This also unifies the 'dimm' and 'bus' level ioctls into a common nd_ioctl() preamble implementation. Marked for -stable as it is a pre-requisite for a follow-on fix. Cc: Fixes: bf9bccc14c05 ("libnvdimm: pmem label sets and namespace instantiation") Cc: Vishal Verma Signed-off-by: Dan Williams --- drivers/nvdimm/bus.c | 96 -- drivers/nvdimm/nd-core.h |3 + 2 files changed, 60 insertions(+), 39 deletions(-) diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c index 42713b210f51..dfb93228d6a7 100644 --- a/drivers/nvdimm/bus.c +++ b/drivers/nvdimm/bus.c @@ -73,7 +73,7 @@ static void nvdimm_bus_probe_end(struct nvdimm_bus *nvdimm_bus) { nvdimm_bus_lock(_bus->dev); if (--nvdimm_bus->probe_active == 0) - wake_up(_bus->probe_wait); + wake_up(_bus->wait); nvdimm_bus_unlock(_bus->dev); } @@ -341,7 +341,7 @@ struct nvdimm_bus *nvdimm_bus_register(struct device *parent, return NULL; INIT_LIST_HEAD(_bus->list); INIT_LIST_HEAD(_bus->mapping_list); - init_waitqueue_head(_bus->probe_wait); + init_waitqueue_head(_bus->wait); nvdimm_bus->id = ida_simple_get(_ida, 0, 0, GFP_KERNEL); if (nvdimm_bus->id < 0) { kfree(nvdimm_bus); @@ -426,6 +426,9 @@ static int nd_bus_remove(struct device *dev) list_del_init(_bus->list); mutex_unlock(_bus_list_mutex); + wait_event(nvdimm_bus->wait, + atomic_read(_bus->ioctl_active) == 0); + nd_synchronize(); device_for_each_child(_bus->dev, NULL, child_unregister); @@ -885,7 +888,7 @@ void wait_nvdimm_bus_probe_idle(struct device *dev) if (nvdimm_bus->probe_active == 0) break; nvdimm_bus_unlock(_bus->dev); - wait_event(nvdimm_bus->probe_wait, + wait_event(nvdimm_bus->wait, nvdimm_bus->probe_active == 0); nvdimm_bus_lock(_bus->dev); } while (true); @@ -1115,24 +1118,10 @@ static int __nd_ioctl(struct nvdimm_bus *nvdimm_bus, struct nvdimm *nvdimm, return rc; } -static long nd_ioctl(struct file *file, unsigned int cmd, unsigned long arg) -{ - long id = (long) file->private_data; - int rc = -ENXIO, ro; - struct nvdimm_bus *nvdimm_bus; - - ro = ((file->f_flags & O_ACCMODE) == O_RDONLY); - mutex_lock(_bus_list_mutex); - list_for_each_entry(nvdimm_bus, _bus_list, list) { - if (nvdimm_bus->id == id) { - rc = __nd_ioctl(nvdimm_bus, NULL, ro, cmd, arg); - break; - } - } - mutex_unlock(_bus_list_mutex); - - return rc; -} +enum nd_ioctl_mode { + BUS_IOCTL, + DIMM_IOCTL, +}; static int match_dimm(struct device *dev, void *data) { @@ -1147,31 +1136,62 @@ static int match_dimm(struct device *dev, void *data) return 0; } -static long nvdimm_ioctl(struct file *file, unsigned int cmd, unsigned long arg) +static long nd_ioctl(struct file *file, unsigned int cmd, unsigned long arg, + enum nd_ioctl_mode mode) + { - int rc = -ENXIO, ro; - struct nvdimm_bus *nvdimm_bus; + struct nvdimm_bus *nvdimm_bus, *found = NULL; + long id = (long) file->private_data; + struct nvdimm *nvdimm = NULL; + int rc, ro; ro = ((file->f_flags & O_ACCMODE) == O_RDONLY); mutex_lock(_bus_list_mutex); list_for_each_entry(nvdimm_bus, _bus_list, list) { - struct device *dev = device_find_child(_bus->dev, - file->private_data, match_dimm); - struct nvdimm *nvdimm; - - if (!dev) - continue; + if (mode == DIMM_IOCTL) { + struct device *dev; + + dev = device_find_child(_bus->dev, + file->private_data, match_dimm); + if (!dev) + continue; + nvdimm = to_nvdimm(dev); + found = nvdimm_bus; + } else if (nvdimm_bus->id == id) { + found = nvdimm_bus; + } - nvdimm = to_nvdimm(dev); - rc = __nd_ioctl(nvdimm_bus, nvdimm, ro, cmd, arg); - put_device(dev); - break; + if (found) { + atomic_inc(_bus->ioctl_active); + break; + } } mutex_unlock(_bus_list_mutex); + if (!fo
[PATCH 5/6] libnvdimm/bus: Fix wait_nvdimm_bus_probe_idle() ABBA deadlock
A multithreaded namespace creation/destruction stress test currently deadlocks with the following lockup signature: INFO: task ndctl:2924 blocked for more than 122 seconds. Tainted: G OE 5.2.0-rc4+ #3382 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ndctl D0 2924 1176 0x Call Trace: ? __schedule+0x27e/0x780 schedule+0x30/0xb0 wait_nvdimm_bus_probe_idle+0x8a/0xd0 [libnvdimm] ? finish_wait+0x80/0x80 uuid_store+0xe6/0x2e0 [libnvdimm] kernfs_fop_write+0xf0/0x1a0 vfs_write+0xb7/0x1b0 ksys_write+0x5c/0xd0 do_syscall_64+0x60/0x240 INFO: task ndctl:2923 blocked for more than 122 seconds. Tainted: G OE 5.2.0-rc4+ #3382 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ndctl D0 2923 1175 0x Call Trace: ? __schedule+0x27e/0x780 ? __mutex_lock+0x489/0x910 schedule+0x30/0xb0 schedule_preempt_disabled+0x11/0x20 __mutex_lock+0x48e/0x910 ? nvdimm_namespace_common_probe+0x95/0x4d0 [libnvdimm] ? __lock_acquire+0x23f/0x1710 ? nvdimm_namespace_common_probe+0x95/0x4d0 [libnvdimm] nvdimm_namespace_common_probe+0x95/0x4d0 [libnvdimm] __dax_pmem_probe+0x5e/0x210 [dax_pmem_core] ? nvdimm_bus_probe+0x1d0/0x2c0 [libnvdimm] dax_pmem_probe+0xc/0x20 [dax_pmem] nvdimm_bus_probe+0x90/0x2c0 [libnvdimm] really_probe+0xef/0x390 driver_probe_device+0xb4/0x100 In this sequence an 'nd_dax' device is being probed and trying to take the lock on its backing namespace to validate that the 'nd_dax' device indeed has exclusive access to the backing namespace. Meanwhile, another thread is trying to update the uuid property of that same backing namespace. So one thread is in the probe path trying to acquire the lock, and the other thread has acquired the lock and tries to flush the probe path. Fix this deadlock by not holding the namespace device_lock over the wait_nvdimm_bus_probe_idle() synchronization step. In turn this requires the device_lock to be held on entry to wait_nvdimm_bus_probe_idle() and subsequently dropped internally to wait_nvdimm_bus_probe_idle(). Cc: Fixes: bf9bccc14c05 ("libnvdimm: pmem label sets and namespace instantiation") Cc: Vishal Verma Signed-off-by: Dan Williams --- drivers/nvdimm/bus.c | 17 +++-- drivers/nvdimm/region_devs.c |4 2 files changed, 15 insertions(+), 6 deletions(-) diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c index dfb93228d6a7..c1d26fca9c4c 100644 --- a/drivers/nvdimm/bus.c +++ b/drivers/nvdimm/bus.c @@ -887,10 +887,12 @@ void wait_nvdimm_bus_probe_idle(struct device *dev) do { if (nvdimm_bus->probe_active == 0) break; - nvdimm_bus_unlock(_bus->dev); + nvdimm_bus_unlock(dev); + device_unlock(dev); wait_event(nvdimm_bus->wait, nvdimm_bus->probe_active == 0); - nvdimm_bus_lock(_bus->dev); + device_lock(dev); + nvdimm_bus_lock(dev); } while (true); } @@ -1017,7 +1019,7 @@ static int __nd_ioctl(struct nvdimm_bus *nvdimm_bus, struct nvdimm *nvdimm, case ND_CMD_ARS_START: case ND_CMD_CLEAR_ERROR: case ND_CMD_CALL: - dev_dbg(_bus->dev, "'%s' command while read-only.\n", + dev_dbg(dev, "'%s' command while read-only.\n", nvdimm ? nvdimm_cmd_name(cmd) : nvdimm_bus_cmd_name(cmd)); return -EPERM; @@ -1088,7 +1090,8 @@ static int __nd_ioctl(struct nvdimm_bus *nvdimm_bus, struct nvdimm *nvdimm, goto out; } - nvdimm_bus_lock(_bus->dev); + device_lock(dev); + nvdimm_bus_lock(dev); rc = nd_cmd_clear_to_send(nvdimm_bus, nvdimm, func, buf); if (rc) goto out_unlock; @@ -1103,7 +1106,8 @@ static int __nd_ioctl(struct nvdimm_bus *nvdimm_bus, struct nvdimm *nvdimm, nvdimm_account_cleared_poison(nvdimm_bus, clear_err->address, clear_err->cleared); } - nvdimm_bus_unlock(_bus->dev); + nvdimm_bus_unlock(dev); + device_unlock(dev); if (copy_to_user(p, buf, buf_len)) rc = -EFAULT; @@ -1112,7 +1116,8 @@ static int __nd_ioctl(struct nvdimm_bus *nvdimm_bus, struct nvdimm *nvdimm, return rc; out_unlock: - nvdimm_bus_unlock(_bus->dev); + nvdimm_bus_unlock(dev); + device_unlock(dev); out: vfree(buf); return rc; diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/regio
[PATCH 6/6] driver-core, libnvdimm: Let device subsystems add local lockdep coverage
For good reason, the standard device_lock() is marked lockdep_set_novalidate_class() because there is simply no sane way to describe the myriad ways the device_lock() ordered with other locks. However, that leaves subsystems that know their own local device_lock() ordering rules to find lock ordering mistakes manually. Instead, introduce an optional / additional lockdep-enabled lock that a subsystem can acquire in all the same paths that the device_lock() is acquired. A conversion of the NFIT driver and NVDIMM subsystem to a lockdep-validate device_lock() scheme is included. The debug_nvdimm_lock() implementation implements the correct lock-class and stacking order for the libnvdimm device topology hierarchy. Cc: Ingo Molnar Cc: Ira Weiny Cc: Will Deacon Cc: Dave Jiang Cc: Keith Busch Cc: Peter Zijlstra Cc: Vishal Verma Cc: "Rafael J. Wysocki" Cc: Greg Kroah-Hartman Signed-off-by: Dan Williams --- drivers/acpi/nfit/core.c| 28 drivers/acpi/nfit/nfit.h| 24 ++ drivers/base/core.c |3 ++ drivers/nvdimm/btt_devs.c | 16 + drivers/nvdimm/bus.c| 28 ++-- drivers/nvdimm/core.c | 10 +++--- drivers/nvdimm/dimm_devs.c |4 +- drivers/nvdimm/namespace_devs.c | 36 ++--- drivers/nvdimm/nd-core.h| 68 +++ drivers/nvdimm/pfn_devs.c | 24 +++--- drivers/nvdimm/pmem.c |4 +- drivers/nvdimm/region.c |2 + drivers/nvdimm/region_devs.c| 16 + include/linux/device.h |5 +++ 14 files changed, 187 insertions(+), 81 deletions(-) diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c index 23022cf20d26..f22139458ce1 100644 --- a/drivers/acpi/nfit/core.c +++ b/drivers/acpi/nfit/core.c @@ -1282,7 +1282,7 @@ static ssize_t hw_error_scrub_store(struct device *dev, if (rc) return rc; - device_lock(dev); + nfit_device_lock(dev); nd_desc = dev_get_drvdata(dev); if (nd_desc) { struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc); @@ -1299,7 +1299,7 @@ static ssize_t hw_error_scrub_store(struct device *dev, break; } } - device_unlock(dev); + nfit_device_unlock(dev); if (rc) return rc; return size; @@ -1319,7 +1319,7 @@ static ssize_t scrub_show(struct device *dev, ssize_t rc = -ENXIO; bool busy; - device_lock(dev); + nfit_device_lock(dev); nd_desc = dev_get_drvdata(dev); if (!nd_desc) { device_unlock(dev); @@ -1339,7 +1339,7 @@ static ssize_t scrub_show(struct device *dev, } mutex_unlock(_desc->init_mutex); - device_unlock(dev); + nfit_device_unlock(dev); return rc; } @@ -1356,14 +1356,14 @@ static ssize_t scrub_store(struct device *dev, if (val != 1) return -EINVAL; - device_lock(dev); + nfit_device_lock(dev); nd_desc = dev_get_drvdata(dev); if (nd_desc) { struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc); rc = acpi_nfit_ars_rescan(acpi_desc, ARS_REQ_LONG); } - device_unlock(dev); + nfit_device_unlock(dev); if (rc) return rc; return size; @@ -1749,9 +1749,9 @@ static void acpi_nvdimm_notify(acpi_handle handle, u32 event, void *data) struct acpi_device *adev = data; struct device *dev = >dev; - device_lock(dev->parent); + nfit_device_lock(dev->parent); __acpi_nvdimm_notify(dev, event); - device_unlock(dev->parent); + nfit_device_unlock(dev->parent); } static bool acpi_nvdimm_has_method(struct acpi_device *adev, char *method) @@ -3457,8 +3457,8 @@ static int acpi_nfit_flush_probe(struct nvdimm_bus_descriptor *nd_desc) struct device *dev = acpi_desc->dev; /* Bounce the device lock to flush acpi_nfit_add / acpi_nfit_notify */ - device_lock(dev); - device_unlock(dev); + nfit_device_lock(dev); + nfit_device_unlock(dev); /* Bounce the init_mutex to complete initial registration */ mutex_lock(_desc->init_mutex); @@ -3602,8 +3602,8 @@ void acpi_nfit_shutdown(void *data) * acpi_nfit_ars_rescan() submissions have had a chance to * either submit or see ->cancel set. */ - device_lock(bus_dev); - device_unlock(bus_dev); + nfit_device_lock(bus_dev); + nfit_device_unlock(bus_dev); flush_workqueue(nfit_wq); } @@ -3746,9 +3746,9 @@ EXPORT_SYMBOL_GPL(__acpi_nfit_notify); static void acpi_nfit_notify(struct acpi_device *adev, u32 event) { - device_lock(>dev); + nfit_device_lock(>dev); __acpi_nfit_notify(>dev, ad
[PATCH 3/6] libnvdimm/region: Register badblocks before namespaces
Namespace activation expects to be able to reference region badblocks. The following warning sometimes triggers when asynchronous namespace activation races in front of the completion of namespace probing. Move all possible namespace probing after region badblocks initialization. Otherwise, lockdep sometimes catches the uninitialized state of the badblocks seqlock with stack trace signatures like: INFO: trying to register non-static key. pmem2: detected capacity change from 0 to 136365211648 the code is fine but needs lockdep annotation. turning off the locking correctness validator. CPU: 9 PID: 358 Comm: kworker/u80:5 Tainted: G OE 5.2.0-rc4+ #3382 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 Workqueue: events_unbound async_run_entry_fn Call Trace: dump_stack+0x85/0xc0 pmem1.12: detected capacity change from 0 to 8589934592 register_lock_class+0x56a/0x570 ? check_object+0x140/0x270 __lock_acquire+0x80/0x1710 ? __mutex_lock+0x39d/0x910 lock_acquire+0x9e/0x180 ? nd_pfn_validate+0x28f/0x440 [libnvdimm] badblocks_check+0x93/0x1f0 ? nd_pfn_validate+0x28f/0x440 [libnvdimm] nd_pfn_validate+0x28f/0x440 [libnvdimm] ? lockdep_hardirqs_on+0xf0/0x180 nd_dax_probe+0x9a/0x120 [libnvdimm] nd_pmem_probe+0x6d/0x180 [nd_pmem] nvdimm_bus_probe+0x90/0x2c0 [libnvdimm] Fixes: 48af2f7e52f4 ("libnvdimm, pfn: during init, clear errors...") Cc: Cc: Vishal Verma Signed-off-by: Dan Williams --- drivers/nvdimm/region.c | 22 +++--- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/drivers/nvdimm/region.c b/drivers/nvdimm/region.c index ef46cc3a71ae..488c47ac4c4a 100644 --- a/drivers/nvdimm/region.c +++ b/drivers/nvdimm/region.c @@ -34,17 +34,6 @@ static int nd_region_probe(struct device *dev) if (rc) return rc; - rc = nd_region_register_namespaces(nd_region, ); - if (rc < 0) - return rc; - - ndrd = dev_get_drvdata(dev); - ndrd->ns_active = rc; - ndrd->ns_count = rc + err; - - if (rc && err && rc == err) - return -ENODEV; - if (is_nd_pmem(_region->dev)) { struct resource ndr_res; @@ -60,6 +49,17 @@ static int nd_region_probe(struct device *dev) nvdimm_badblocks_populate(nd_region, _region->bb, _res); } + rc = nd_region_register_namespaces(nd_region, ); + if (rc < 0) + return rc; + + ndrd = dev_get_drvdata(dev); + ndrd->ns_active = rc; + ndrd->ns_count = rc + err; + + if (rc && err && rc == err) + return -ENODEV; + nd_region->btt_seed = nd_btt_create(nd_region); nd_region->pfn_seed = nd_pfn_create(nd_region); nd_region->dax_seed = nd_dax_create(nd_region);
[PATCH 2/6] libnvdimm/bus: Prevent duplicate device_unregister() calls
A multithreaded namespace creation/destruction stress test currently fails with signatures like the following: sysfs group 'power' not found for kobject 'dax1.1' RIP: 0010:sysfs_remove_group+0x76/0x80 Call Trace: device_del+0x73/0x370 device_unregister+0x16/0x50 nd_async_device_unregister+0x1e/0x30 [libnvdimm] async_run_entry_fn+0x39/0x160 process_one_work+0x23c/0x5e0 worker_thread+0x3c/0x390 BUG: kernel NULL pointer dereference, address: 0020 RIP: 0010:klist_put+0x1b/0x6c Call Trace: klist_del+0xe/0x10 device_del+0x8a/0x2c9 ? __switch_to_asm+0x34/0x70 ? __switch_to_asm+0x40/0x70 device_unregister+0x44/0x4f nd_async_device_unregister+0x22/0x2d [libnvdimm] async_run_entry_fn+0x47/0x15a process_one_work+0x1a2/0x2eb worker_thread+0x1b8/0x26e Use the kill_device() helper to atomically resolve the race of multiple threads issuing kill, device_unregister(), requests. Reported-by: Jane Chu Reported-by: Erwin Tsaur Fixes: 4d88a97aa9e8 ("libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver...") Cc: Link: https://github.com/pmem/ndctl/issues/96 Signed-off-by: Dan Williams --- drivers/nvdimm/bus.c | 25 + 1 file changed, 25 insertions(+) diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c index 2dca3034fee0..42713b210f51 100644 --- a/drivers/nvdimm/bus.c +++ b/drivers/nvdimm/bus.c @@ -547,13 +547,38 @@ EXPORT_SYMBOL(nd_device_register); void nd_device_unregister(struct device *dev, enum nd_async_mode mode) { + bool killed; + switch (mode) { case ND_ASYNC: + /* +* In the async case this is being triggered with the +* device lock held and the unregistration work needs to +* be moved out of line iff this is thread has won the +* race to schedule the deletion. +*/ + if (!kill_device(dev)) + return; + get_device(dev); async_schedule_domain(nd_async_device_unregister, dev, _async_domain); break; case ND_SYNC: + /* +* In the sync case the device is being unregistered due +* to a state change of the parent. Claim the kill state +* to synchronize against other unregistration requests, +* or otherwise let the async path handle it if the +* unregistration was already queued. +*/ + device_lock(dev); + killed = kill_device(dev); + device_unlock(dev); + + if (!killed) + return; + nd_synchronize(); device_unregister(dev); break;
[PATCH 1/6] drivers/base: Introduce kill_device()
The libnvdimm subsystem arranges for devices to be destroyed as a result of a sysfs operation. Since device_unregister() cannot be called from an actively running sysfs attribute of the same device libnvdimm arranges for device_unregister() to be performed in an out-of-line async context. The driver core maintains a 'dead' state for coordinating its own racing async registration / de-registration requests. Rather than add local 'dead' state tracking infrastructure to libnvdimm device objects, export the existing state tracking via a new kill_device() helper. The kill_device() helper simply marks the device as dead, i.e. that it is on its way to device_del(), or returns that the device was already dead. This can be used in advance of calling device_unregister() for subsystems like libnvdimm that might need to handle multiple user threads racing to delete a device. This refactoring does not change any behavior, but it is a pre-requisite for follow-on fixes and therefore marked for -stable. Cc: Greg Kroah-Hartman Cc: "Rafael J. Wysocki" Fixes: 4d88a97aa9e8 ("libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver...") Cc: Signed-off-by: Dan Williams --- drivers/base/core.c| 27 +++ include/linux/device.h |1 + 2 files changed, 20 insertions(+), 8 deletions(-) diff --git a/drivers/base/core.c b/drivers/base/core.c index fd7511e04e62..eaf3aa0cb803 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -2211,6 +2211,24 @@ void put_device(struct device *dev) } EXPORT_SYMBOL_GPL(put_device); +bool kill_device(struct device *dev) +{ + /* +* Require the device lock and set the "dead" flag to guarantee that +* the update behavior is consistent with the other bitfields near +* it and that we cannot have an asynchronous probe routine trying +* to run while we are tearing out the bus/class/sysfs from +* underneath the device. +*/ + lockdep_assert_held(>mutex); + + if (dev->p->dead) + return false; + dev->p->dead = true; + return true; +} +EXPORT_SYMBOL_GPL(kill_device); + /** * device_del - delete device from system. * @dev: device. @@ -2230,15 +2248,8 @@ void device_del(struct device *dev) struct kobject *glue_dir = NULL; struct class_interface *class_intf; - /* -* Hold the device lock and set the "dead" flag to guarantee that -* the update behavior is consistent with the other bitfields near -* it and that we cannot have an asynchronous probe routine trying -* to run while we are tearing out the bus/class/sysfs from -* underneath the device. -*/ device_lock(dev); - dev->p->dead = true; + kill_device(dev); device_unlock(dev); /* Notify clients of device removal. This call must come diff --git a/include/linux/device.h b/include/linux/device.h index e85264fb6616..0da5c67f6be1 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -1373,6 +1373,7 @@ extern int (*platform_notify_remove)(struct device *dev); */ extern struct device *get_device(struct device *dev); extern void put_device(struct device *dev); +extern bool kill_device(struct device *dev); #ifdef CONFIG_DEVTMPFS extern int devtmpfs_create_node(struct device *dev);
[PATCH 0/6] libnvdimm: Fix async operations and locking
The libnvdimm subsystem uses async operations to parallelize device probing operations and to allow sysfs to trigger device_unregister() on deleted namepsaces. A multithreaded stress test of the libnvdimm sysfs interface uncovered a case where device_unregister() is triggered multiple times, and the subsequent investigation uncovered a broken locking scenario. The lack of lockdep coverage for device_lock() stymied the debug. That is, until patch6 "driver-core, libnvdimm: Let device subsystems add local lockdep coverage" solved that with a shadow lock, with lockdep coverage, to mirror device_lock() operations. Given the time saved with shadow-lock debug-hack, patch6 attempts to generalize device_lock() debug facility that might be able to be carried upstream. Patch6 is staged at the end of this fix series in case it is contentious and needs to be dropped. Patch1 "drivers/base: Introduce kill_device()" could be achieved with local libnvdimm infrastructure. However, the existing 'dead' flag in 'struct device_private' aims to solve similar async register/unregister races so the fix in patch2 "libnvdimm/bus: Prevent duplicate device_unregister() calls" can be implemented with existing driver-core infrastructure. Patch3 is a rare lockdep warning that is intermittent based on namespaces racing ahead of the completion of probe of their parent region. It is not related to the other fixes, it just happened to trigger as a result of the async stress test. Patch4 and patch5 address an ABBA deadlock tripped by the stress test. These patches pass the failing stress test and the existing libnvdimm unit tests with CONFIG_PROVE_LOCKING=y and the new "dev->lockdep_mutex" shadow lock with no lockdep warnings. --- Dan Williams (6): drivers/base: Introduce kill_device() libnvdimm/bus: Prevent duplicate device_unregister() calls libnvdimm/region: Register badblocks before namespaces libnvdimm/bus: Stop holding nvdimm_bus_list_mutex over __nd_ioctl() libnvdimm/bus: Fix wait_nvdimm_bus_probe_idle() ABBA deadlock driver-core, libnvdimm: Let device subsystems add local lockdep coverage drivers/acpi/nfit/core.c| 28 --- drivers/acpi/nfit/nfit.h| 24 ++ drivers/base/core.c | 30 ++-- drivers/nvdimm/btt_devs.c | 16 ++-- drivers/nvdimm/bus.c| 154 +++ drivers/nvdimm/core.c | 10 +-- drivers/nvdimm/dimm_devs.c |4 + drivers/nvdimm/namespace_devs.c | 36 + drivers/nvdimm/nd-core.h| 71 ++ drivers/nvdimm/pfn_devs.c | 24 +++--- drivers/nvdimm/pmem.c |4 + drivers/nvdimm/region.c | 24 +++--- drivers/nvdimm/region_devs.c| 12 ++- include/linux/device.h |6 ++ 14 files changed, 308 insertions(+), 135 deletions(-)
Re: [PATCH v2 00/17] net: introduce Qualcomm IPA driver
On Tue, 2019-06-11 at 10:52 -0600, Subash Abhinov Kasiviswanathan wrote: > > The general plan (and I believe Daniele Palmas was working on it) > > was > > to eventually make qmi_wwan use rmnet rather than its internal > > sysfs- > > based implementation. qmi_wwan and ipa are at essentially the same > > level and both could utilize rmnet on top. > > > > *That's* what I'd like to see. I don't want to see two different > > ways > > to get QMAP packets to modem firmware from two different drivers > > that > > really could use the same code. > > > > Dan > > qmi_wwan is based on USB and is very different from the IPA > interconnect > though. AFAIK, they do not have much in common (apart from sending & > receiving MAP packets from hardware). That is correct, they are very different drivers but as you state they send and receive MAP packets with the other end via some closer-to- hardware protocol (USB or GSI?) than QMAP. rmnet should handle muxing the QMAP, QoS, and aggregation and pass the resulting packet to the lower layer. That lower layer could be IPA or qmi_wwan, which in turn passes that QMAP packet to USB or GSI or whatever. This is typically how Linux handles clean abstractions between different protocol layers in drivers. Similar to some WiFi drivers (drivers/net/wireless/marvell/libertas for example) where the same firmware interface can be accessed via PCI, SDIO, USB, SPI, etc. The bus-specific code is self-contained and does not creep into the upper more generic parts. Dan
Re: [PATCH v2 00/17] net: introduce Qualcomm IPA driver
On Tue, 2019-06-11 at 13:56 +0200, Arnd Bergmann wrote: > On Tue, Jun 11, 2019 at 10:12 AM Johannes Berg > wrote: > > > > As I've made clear before, my work on this has been focused on > > > the IPA transport, > > > and some of this higher-level LTE architecture is new to me. But > > > it > > > seems pretty clear that an abstracted WWAN subsystem is a good > > > plan, > > > because these devices represent a superset of what a "normal" > > > netdev > > > implements. > > > > I'm not sure I'd actually call it a superset. By themselves, these > > netdevs are actually completely useless to the network stack, > > AFAICT. > > Therefore, the overlap with netdevs you can really use with the > > network > > stack is pretty small? > > I think Alex meant the concept of having a type of netdev with a > generic > user space interface for wwan and similar to a wlan device, as I > understood > you had suggested as well, as opposed to a stacked device as in > rmnet or those drivers it seems to be modeled after (vlan, ip tunnel, > ...)/. > > > > HOWEVER I disagree with your suggestion that the IPA code should > > > not be committed until after that is all sorted out. In part > > > it's > > > for selfish reasons, but I think there are legitimate reasons to > > > commit IPA now *knowing* that it will need to be adapted to fit > > > into the generic model that gets defined and developed. Here > > > are some reasons why. > > > > I can't really argue with those, though I would point out that the > > converse also holds - if we commit to this now, then we will have > > to > > actually keep the API offered by IPA/rmnet today, so we cannot > > actually > > remove the netdev again, even if we do migrate it to offer support > > for a > > WWAN framework in the future. > > Right. The interface to support rmnet might be simple enough to keep > next to what becomes the generic interface, but it will always > continue > to be an annoyance. > > > > Second, the IPA code has been out for review recently, and has > > > been > > > the subject of some detailed discussion in the past few > > > weeks. Arnd > > > especially has invested considerable time in review and > > > discussion. > > > Delaying things until after a better generic model is settled on > > > (which I'm guessing might be on the order of months) > > > > I dunno if it really has to be months. I think we can cobble > > something > > together relatively quickly that addresses the needs of IPA more > > specifically, and then extend later? > > > > But OTOH it may make sense to take a more paced approach and think > > about the details more carefully than we have over in the other > > thread so far. > > I would hope that as soon as we can agree on a general approach, it > would also be possible to merge a minimal implementation into the > kernel > along with IPA. Alex already mentioned that IPA in its current state > does > not actually support more than one data channel, so the necessary > setup for it becomes even simpler. > > At the moment, the rmnet configuration in > include/uapi/linux/if_link.h > is almost trivial, with the three pieces of information needed being > an IFLA_LINK to point to the real device (not needed if there is only > one device per channel, instead of two), the IFLA_RMNET_MUX_ID > setting the ID of the muxing channel (not needed if there is only > one channel ?), a way to specify software bridging between channels > (not useful if there is only one channel) and a few flags that I > assume > must match the remote end: > > #define RMNET_FLAGS_INGRESS_DEAGGREGATION (1U << 0) > #define RMNET_FLAGS_INGRESS_MAP_COMMANDS (1U << 1) > #define RMNET_FLAGS_INGRESS_MAP_CKSUMV4 (1U << 2) > #define RMNET_FLAGS_EGRESS_MAP_CKSUMV4(1U << 3) > enum { > IFLA_RMNET_UNSPEC, > IFLA_RMNET_MUX_ID, > IFLA_RMNET_FLAGS, > __IFLA_RMNET_MAX, > }; > #define IFLA_RMNET_MAX (__IFLA_RMNET_MAX - 1) > struct ifla_rmnet_flags { > __u32 flags; > __u32 mask; > }; > > > > Third, having the code upstream actually means the actual > > > requirements > > > for rmnet-over-IPA are clear and explicit. This might not be a > > > huge > > > deal, but I think it's better to devise a generic WWAN scheme > > > that > > > can refer to actual code than to do so with assumptions about > > > what > > > will work with rmnet (and others). As far as I know, the > > > upstream > > > rmnet has no other upstream back end; IPA will make it "real." > > > > Is that really true? I had previously been told that rmnet actually > > does > > have use with a few existing drivers. > > > > > > If true though, then I think this would be the killer argument *in > > favour* of *not* merging this - because that would mean we *don't* > > have > > to actually keep the rmnet API around for all foreseeable future. > > I would agree with that. From the code I can see no other driver > including the rmnet protocol header (see
Re: [PATCH v9 11/12] libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields
On Fri, Jun 7, 2019 at 12:54 PM Andrew Morton wrote: > > On Thu, 6 Jun 2019 15:06:26 -0700 Dan Williams > wrote: > > > On Thu, Jun 6, 2019 at 2:46 PM Andrew Morton > > wrote: > > > > > > On Wed, 05 Jun 2019 14:58:58 -0700 Dan Williams > > > wrote: > > > > > > > At namespace creation time there is the potential for the "expected to > > > > be zero" fields of a 'pfn' info-block to be filled with indeterminate > > > > data. While the kernel buffer is zeroed on allocation it is immediately > > > > overwritten by nd_pfn_validate() filling it with the current contents of > > > > the on-media info-block location. For fields like, 'flags' and the > > > > 'padding' it potentially means that future implementations can not rely > > > > on those fields being zero. > > > > > > > > In preparation to stop using the 'start_pad' and 'end_trunc' fields for > > > > section alignment, arrange for fields that are not explicitly > > > > initialized to be guaranteed zero. Bump the minor version to indicate it > > > > is safe to assume the 'padding' and 'flags' are zero. Otherwise, this > > > > corruption is expected to benign since all other critical fields are > > > > explicitly initialized. > > > > > > > > Fixes: 32ab0a3f5170 ("libnvdimm, pmem: 'struct page' for pmem") > > > > Cc: > > > > Signed-off-by: Dan Williams > > > > > > The cc:stable in [11/12] seems odd. Is this independent of the other > > > patches? If so, shouldn't it be a standalone thing which can be > > > prioritized? > > > > > > > The cc: stable is about spreading this new policy to as many kernels > > as possible not fixing an issue in those kernels. It's not until patch > > 12 "libnvdimm/pfn: Stop padding pmem namespaces to section alignment" > > as all previous kernel do initialize all fields. > > > > I'd be ok to drop that cc: stable, my concern is distros that somehow > > pickup and backport patch 12 and miss patch 11. > > Could you please propose a changelog paragraph which explains all this > to those who will be considering this patch for backports? > Will do. I'll resend the series with this and the fixups proposed by Oscar.
[PATCH v3 00/10] EFI Specific Purpose Memory Support
Changes since v2: - Consolidate the new E820_TYPE and IORES_DESC and EFI configuration symbol on an "_APPLICATION_RESERVED" suffix. (Ard). - Rework the implementation to drop the new MEMBLOCK_APP_SPECIFIC memblock and move the reservation earlier to e820__memblock_setup(). (Mike) - Move efi_fake_mem support for EFI_MEMORY_SP to its own implementation that does not require memblock allocations. - Move is_efi_application_reserved() into the x86 efi implementation. (Ard) [1]: https://lists.01.org/pipermail/linux-nvdimm/2019-May/021668.html --- Merge logistics: These patches touch core-efi, acpi, device-dax, and x86. Given the regression risk is highest for the x86 changes it seems tip.git is the best tree to host the series. --- The EFI 2.8 Specification [2] introduces the EFI_MEMORY_SP ("specific purpose") memory attribute. This attribute bit replaces the deprecated ACPI HMAT "reservation hint" that was introduced in ACPI 6.2 and removed in ACPI 6.3. Given the increasing diversity of memory types that might be advertised to the operating system, there is a need for platform firmware to hint which memory ranges are free for the OS to use as general purpose memory and which ranges are intended for application specific usage. For example, an application with prior knowledge of the platform may expect to be able to exclusively allocate a precious / limited pool of high bandwidth memory. Alternatively, for the general purpose case, the operating system may want to make the memory available on a best effort basis as a unique numa-node with performance properties by the new CONFIG_HMEM_REPORTING [3] facility. In support of optionally allowing either application-exclusive and core-kernel-mm managed access to differentiated memory, claim EFI_MEMORY_SP ranges for exposure as device-dax instances by default. Such instances can be directly owned / mapped by a platform-topology-aware application. Alternatively, with the new kmem facility [4], the administrator has the option to instead designate that those memory ranges be hot-added to the core-kernel-mm as a unique memory numa-node. In short, allow for the decision about what software agent manages specific-purpose memory to be made at runtime. The patches are based on the new HMAT+HMEM_REPORTING facilities merged for v5.2-rc1. The implementation is tested with qemu emulation of HMAT [5] plus the efi_fake_mem facility for applying the EFI_MEMORY_SP attribute. [2]: https://uefi.org/sites/default/files/resources/UEFI_Spec_2_8_final.pdf [3]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e1cf33aafb84 [4]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308f [5]: http://patchwork.ozlabs.org/cover/1096737/ --- Dan Williams (10): acpi/numa: Establish a new drivers/acpi/numa/ directory acpi/numa/hmat: Skip publishing target info for nodes with no online memory efi: Enumerate EFI_MEMORY_SP x86, efi: Push EFI_MEMMAP check into leaf routines x86, efi: Reserve UEFI 2.8 Specific Purpose Memory for dax x86, efi: Add efi_fake_mem support for EFI_MEMORY_SP lib/memregion: Uplevel the pmem "region" ida to a global allocator device-dax: Add a driver for "hmem" devices acpi/numa/hmat: Register HMAT at device_initcall level acpi/numa/hmat: Register "specific purpose" memory as an "hmem" device arch/x86/Kconfig| 21 + arch/x86/boot/compressed/eboot.c|5 + arch/x86/boot/compressed/kaslr.c|3 - arch/x86/include/asm/e820/types.h |9 ++ arch/x86/include/asm/efi.h | 15 arch/x86/kernel/e820.c | 12 ++- arch/x86/kernel/setup.c | 21 +++-- arch/x86/platform/efi/efi.c | 40 - arch/x86/platform/efi/quirks.c |3 + drivers/acpi/Kconfig|9 -- drivers/acpi/Makefile |3 - drivers/acpi/hmat/Makefile |2 drivers/acpi/numa/Kconfig |8 ++ drivers/acpi/numa/Makefile |3 + drivers/acpi/numa/hmat.c| 149 +++ drivers/acpi/numa/srat.c|0 drivers/dax/Kconfig | 27 +- drivers/dax/Makefile|2 drivers/dax/hmem.c | 58 ++ drivers/firmware/efi/Makefile |5 + drivers/firmware/efi/efi.c |5 + drivers/firmware/efi/esrt.c |3 + drivers/firmware/efi/fake_mem-x86.c | 69 drivers/firmware/efi/fake_mem.c | 26 +++--- drivers/firmware/efi/fake_mem.h | 10 ++ drivers/nvdimm/Kconfig |1 drivers/nvdimm/core.c |1 drivers/nvdimm/nd-core.h|1 drivers/nvdimm/region_devs.c| 13 +-- include/linux/efi.h |1 include/linux/ioport.h
Re: [PATCH v9 08/12] mm/sparsemem: Support sub-section hotplug
On Fri, Jun 7, 2019 at 1:34 AM Oscar Salvador wrote: > > On Wed, Jun 05, 2019 at 02:58:42PM -0700, Dan Williams wrote: > > The libnvdimm sub-system has suffered a series of hacks and broken > > workarounds for the memory-hotplug implementation's awkward > > section-aligned (128MB) granularity. For example the following backtrace > > is emitted when attempting arch_add_memory() with physical address > > ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM) > > within a given section: > > > > WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 > > devm_memremap_pages+0x3b5/0x4c0 > > devm_memremap_pages attempted on mixed region [mem 0x2-0x2fbff > > flags 0x200] > > [..] > > Call Trace: > >dump_stack+0x86/0xc3 > >__warn+0xcb/0xf0 > >warn_slowpath_fmt+0x5f/0x80 > >devm_memremap_pages+0x3b5/0x4c0 > >__wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap] > >pmem_attach_disk+0x19a/0x440 [nd_pmem] > > > > Recently it was discovered that the problem goes beyond RAM vs PMEM > > collisions as some platform produce PMEM vs PMEM collisions within a > > given section. The libnvdimm workaround for that case revealed that the > > libnvdimm section-alignment-padding implementation has been broken for a > > long while. A fix for that long-standing breakage introduces as many > > problems as it solves as it would require a backward-incompatible change > > to the namespace metadata interpretation. Instead of that dubious route > > [1], address the root problem in the memory-hotplug implementation. > > > > [1]: > > https://lore.kernel.org/r/155000671719.348031.2347363160141119237.st...@dwillia2-desk3.amr.corp.intel.com > > Cc: Michal Hocko > > Cc: Vlastimil Babka > > Cc: Logan Gunthorpe > > Cc: Oscar Salvador > > Cc: Pavel Tatashin > > Signed-off-by: Dan Williams > > --- > > include/linux/memory_hotplug.h |2 > > mm/memory_hotplug.c|7 - > > mm/page_alloc.c|2 > > mm/sparse.c| 225 > > +++- > > 4 files changed, 155 insertions(+), 81 deletions(-) > > > [...] > > @@ -325,6 +332,15 @@ static void __meminit sparse_init_one_section(struct > > mem_section *ms, > > unsigned long pnum, struct page *mem_map, > > struct mem_section_usage *usage) > > { > > + /* > > + * Given that SPARSEMEM_VMEMMAP=y supports sub-section hotplug, > > + * ->section_mem_map can not be guaranteed to point to a full > > + * section's worth of memory. The field is only valid / used > > + * in the SPARSEMEM_VMEMMAP=n case. > > + */ > > + if (IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)) > > + mem_map = NULL; > > Will this be a problem when reading mem_map with the crash-tool? > I do not expect it to be, but I am not sure if crash internally tries > to read ms->section_mem_map and do some sort of translation. > And since ms->section_mem_map SECTION_HAS_MEM_MAP, it might be that it expects > a valid mem_map? I don't know, but I can't imagine it would because it's much easier to do mem_map relative translations by simple PAGE_OFFSET arithmetic. > > +static void section_deactivate(unsigned long pfn, unsigned long nr_pages, > > + struct vmem_altmap *altmap) > > +{ > > + DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 }; > > + DECLARE_BITMAP(tmp, SUBSECTIONS_PER_SECTION) = { 0 }; > > + struct mem_section *ms = __pfn_to_section(pfn); > > + bool early_section = is_early_section(ms); > > + struct page *memmap = NULL; > > + unsigned long *subsection_map = ms->usage > > + ? >usage->subsection_map[0] : NULL; > > + > > + subsection_mask_set(map, pfn, nr_pages); > > + if (subsection_map) > > + bitmap_and(tmp, map, subsection_map, SUBSECTIONS_PER_SECTION); > > + > > + if (WARN(!subsection_map || !bitmap_equal(tmp, map, > > SUBSECTIONS_PER_SECTION), > > + "section already deactivated (%#lx + %ld)\n", > > + pfn, nr_pages)) > > + return; > > + > > + /* > > + * There are 3 cases to handle across two configurations > > + * (SPARSEMEM_VMEMMAP={y,n}): > > + * > > + * 1/ deactivation of a partial hot-added section (only possible > > + * in the SPARSEMEM_VMEMMAP=y case). > > + *a/ section was
Re: [Lsf-pc] [LSF/MM TOPIC] The end of the DAX experiment
On Wed, Jun 5, 2019 at 9:22 AM Goldwyn Rodrigues wrote: > > Hi Dan/Jerome, > > On 12:20 14/02, Dan Williams wrote: > > On Thu, Feb 14, 2019 at 12:09 PM Matthew Wilcox wrote: > > > > > > On Thu, Feb 14, 2019 at 11:31:24AM -0800, Dan Williams wrote: > > > > On Thu, Feb 14, 2019 at 11:10 AM Jerome Glisse > > > > wrote: > > > > > I am just again working on my struct page mapping patchset as well as > > > > > the generic page write protection that sits on top. I hope to be able > > > > > to post the v2 in couple weeks. You can always look at my posting last > > > > > year to see more details. > > > > > > > > Yes, I have that in mind as one of the contenders. However, it's not > > > > clear to me that its a suitable fit for filesystem-reflink. Others > > > > have floated the 'page proxy' idea, so it would be good to discuss the > > > > merits of the general approaches. > > > > > > ... and my preferred option of putting pfn entries in the page cache. > > > > Another option to include the discussion. > > > > > Or is that what you meant by "page proxy"? > > > > Page proxy would be an object that a filesystem could allocate to > > point back to a single physical 'struct page *'. The proxy would > > contain an override for page->index. > > Was there any outcome on this and its implementation? I am specifically > interested in this for DAX support on btrfs/CoW: The TODO comment on > top of dax_associate_entry() :) > > If there are patches/git tree I could use to base my patches on, it would > be nice. Half joking, but I was hoping that by the time I had circled back to finally reviewing the brtfs dax patches that a solution to this problem would be waiting. We spent more time on other DAX topics. I recall that Jerome and I were mutually skeptical of each others approaches at first glance, but we did not get into the details. So, I think it will be a matter of getting the idea coded up. I'm finally coming out from under a pile of mm work so I should have some time this cycle to push the page proxy idea forward.
Re: [PATCH v9 11/12] libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields
On Thu, Jun 6, 2019 at 2:46 PM Andrew Morton wrote: > > On Wed, 05 Jun 2019 14:58:58 -0700 Dan Williams > wrote: > > > At namespace creation time there is the potential for the "expected to > > be zero" fields of a 'pfn' info-block to be filled with indeterminate > > data. While the kernel buffer is zeroed on allocation it is immediately > > overwritten by nd_pfn_validate() filling it with the current contents of > > the on-media info-block location. For fields like, 'flags' and the > > 'padding' it potentially means that future implementations can not rely > > on those fields being zero. > > > > In preparation to stop using the 'start_pad' and 'end_trunc' fields for > > section alignment, arrange for fields that are not explicitly > > initialized to be guaranteed zero. Bump the minor version to indicate it > > is safe to assume the 'padding' and 'flags' are zero. Otherwise, this > > corruption is expected to benign since all other critical fields are > > explicitly initialized. > > > > Fixes: 32ab0a3f5170 ("libnvdimm, pmem: 'struct page' for pmem") > > Cc: > > Signed-off-by: Dan Williams > > The cc:stable in [11/12] seems odd. Is this independent of the other > patches? If so, shouldn't it be a standalone thing which can be > prioritized? > The cc: stable is about spreading this new policy to as many kernels as possible not fixing an issue in those kernels. It's not until patch 12 "libnvdimm/pfn: Stop padding pmem namespaces to section alignment" as all previous kernel do initialize all fields. I'd be ok to drop that cc: stable, my concern is distros that somehow pickup and backport patch 12 and miss patch 11.
Re: [PATCH v9 07/12] mm/sparsemem: Prepare for sub-section ranges
On Thu, Jun 6, 2019 at 10:21 AM Oscar Salvador wrote: > > On Wed, Jun 05, 2019 at 02:58:37PM -0700, Dan Williams wrote: > > Prepare the memory hot-{add,remove} paths for handling sub-section > > ranges by plumbing the starting page frame and number of pages being > > handled through arch_{add,remove}_memory() to > > sparse_{add,remove}_one_section(). > > > > This is simply plumbing, small cleanups, and some identifier renames. No > > intended functional changes. > > > > Cc: Michal Hocko > > Cc: Vlastimil Babka > > Cc: Logan Gunthorpe > > Cc: Oscar Salvador > > Reviewed-by: Pavel Tatashin > > Signed-off-by: Dan Williams > > --- > > include/linux/memory_hotplug.h |5 +- > > mm/memory_hotplug.c| 114 > > +--- > > mm/sparse.c| 15 ++--- > > 3 files changed, 81 insertions(+), 53 deletions(-) > > > > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h > > index 79e0add6a597..3ab0282b4fe5 100644 > > --- a/include/linux/memory_hotplug.h > > +++ b/include/linux/memory_hotplug.h > > @@ -348,9 +348,10 @@ extern int add_memory_resource(int nid, struct > > resource *resource); > > extern void move_pfn_range_to_zone(struct zone *zone, unsigned long > > start_pfn, > > unsigned long nr_pages, struct vmem_altmap *altmap); > > extern bool is_memblock_offlined(struct memory_block *mem); > > -extern int sparse_add_one_section(int nid, unsigned long start_pfn, > > - struct vmem_altmap *altmap); > > +extern int sparse_add_section(int nid, unsigned long pfn, > > + unsigned long nr_pages, struct vmem_altmap *altmap); > > extern void sparse_remove_one_section(struct mem_section *ms, > > + unsigned long pfn, unsigned long nr_pages, > > unsigned long map_offset, struct vmem_altmap *altmap); > > extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map, > > unsigned long pnum); > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > > index 4b882c57781a..399bf78bccc5 100644 > > --- a/mm/memory_hotplug.c > > +++ b/mm/memory_hotplug.c > > @@ -252,51 +252,84 @@ void __init register_page_bootmem_info_node(struct > > pglist_data *pgdat) > > } > > #endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */ > > > > -static int __meminit __add_section(int nid, unsigned long phys_start_pfn, > > -struct vmem_altmap *altmap) > > +static int __meminit __add_section(int nid, unsigned long pfn, > > + unsigned long nr_pages, struct vmem_altmap *altmap) > > { > > int ret; > > > > - if (pfn_valid(phys_start_pfn)) > > + if (pfn_valid(pfn)) > > return -EEXIST; > > > > - ret = sparse_add_one_section(nid, phys_start_pfn, altmap); > > + ret = sparse_add_section(nid, pfn, nr_pages, altmap); > > return ret < 0 ? ret : 0; > > } > > > > +static int check_pfn_span(unsigned long pfn, unsigned long nr_pages, > > + const char *reason) > > +{ > > + /* > > + * Disallow all operations smaller than a sub-section and only > > + * allow operations smaller than a section for > > + * SPARSEMEM_VMEMMAP. Note that check_hotplug_memory_range() > > + * enforces a larger memory_block_size_bytes() granularity for > > + * memory that will be marked online, so this check should only > > + * fire for direct arch_{add,remove}_memory() users outside of > > + * add_memory_resource(). > > + */ > > + unsigned long min_align; > > + > > + if (IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)) > > + min_align = PAGES_PER_SUBSECTION; > > + else > > + min_align = PAGES_PER_SECTION; > > + if (!IS_ALIGNED(pfn, min_align) > > + || !IS_ALIGNED(nr_pages, min_align)) { > > + WARN(1, "Misaligned __%s_pages start: %#lx end: #%lx\n", > > + reason, pfn, pfn + nr_pages - 1); > > + return -EINVAL; > > + } > > + return 0; > > +} > > > This caught my eye. > Back in patch#4 "Convert kmalloc_section_memmap() to > populate_section_memmap()", > you placed a mis-usage check for !CONFIG_SPARSEMEM_VMEMMAP in > populate_section_memmap(). > > populate_section_memmap() gets called from sparse_add_one_section(), which > means > that we should have passed this check, otherwise we cannot go further and call > __add_section(). > > So, unless I am missing something it seems to me that the check from patch#4 > could go? > And I think the same applies to depopulate_section_memmap()? Yes, good catch, I can kill those extra checks in favor of this one. > Besides that, it looks good to me: Thanks Oscar! > > Reviewed-by: Oscar Salvador > > -- > Oscar Salvador > SUSE L3
Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
On Thu, Jun 6, 2019 at 3:42 AM Jan Kara wrote: > > On Wed 05-06-19 18:45:33, ira.we...@intel.com wrote: > > From: Ira Weiny > > > > ... V1,000,000 ;-) > > > > Pre-requisites: > > John Hubbard's put_user_pages() patch series.[1] > > Jan Kara's ext4_break_layouts() fixes[2] > > > > Based on the feedback from LSFmm and the LWN article which resulted. I've > > decided to take a slightly different tack on this problem. > > > > The real issue is that there is no use case for a user to have RDMA pinn'ed > > memory which is then truncated. So really any solution we present which: > > > > A) Prevents file system corruption or data leaks > > ...and... > > B) Informs the user that they did something wrong > > > > Should be an acceptable solution. > > > > Because this is slightly new behavior. And because this is gonig to be > > specific to DAX (because of the lack of a page cache) we have made the user > > "opt in" to this behavior. > > > > The following patches implement the following solution. > > > > 1) The user has to opt in to allowing GUP pins on a file with a layout lease > >(now made visible). > > 2) GUP will fail (EPERM) if a layout lease is not taken > > 3) Any truncate or hole punch operation on a GUP'ed DAX page will fail. > > 4) The user has the option of holding the layout lease to receive a SIGIO > > for > >notification to the original thread that another thread has tried to > > delete > >their data. Furthermore this indicates that if the user needs to GUP the > >file again they will need to retake the Layout lease before doing so. > > > > > > NOTE: If the user releases the layout lease or if it has been broken by > > another operation further GUP operations on the file will fail without > > re-taking the lease. This means that if a user would like to register > > pieces of a file and continue to register other pieces later they would > > be advised to keep the layout lease, get a SIGIO notification, and retake > > the lease. > > > > NOTE2: Truncation of pages which are not actively pinned will succeed. > > Similar to accessing an mmap to this area GUP pins of that memory may > > fail. > > So after some through I'm willing accept the fact that pinned DAX pages > will just make truncate / hole punch fail and shove it into a same bucket > of situations like "user can open a file and unlink won't delete it" or > "ETXTBUSY when user is executing a file being truncated". The problem I > have with this proposal is a lack of visibility from sysadmin POV. For > ETXTBUSY or "unlinked but open file" sysadmin can just do lsof, find the > problematic process and kill it. There's nothing like that with your > proposal since currently once you hold page reference, you can unmap the > file, drop layout lease, close the file, and there's no trace that you're > responsible for the pinned page anymore. > > So I'd like to actually mandate that you *must* hold the file lease until > you unpin all pages in the given range (not just that you have an option to > hold a lease). And I believe the kernel should actually enforce this. That > way we maintain a sane state that if someone uses a physical location of > logical file offset on disk, he has a layout lease. Also once this is done, > sysadmin has a reasonably easy way to discover run-away RDMA application > and kill it if he wishes so. Yes, this satisfies the primary concern that made me oppose failing truncate. If the administrator determines that reclaiming capacity is more important than maintaining active RDMA mappings "lsof + kill" is a reasonable way to recover. I'd go so far as to say that anything less is an abdication of the kernel's responsibility as an arbiter of platform resources. > The question is on how to exactly enforce that lease is taken until all > pages are unpinned. I belive it could be done by tracking number of > long-term pinned pages within a lease. Gup_longterm could easily increment > the count when verifying the lease exists, gup_longterm users will somehow > need to propagate corresponding 'filp' (struct file pointer) to > put_user_pages_longterm() callsites so that they can look up appropriate > lease to drop reference - probably I'd just transition all gup_longterm() > users to a saner API similar to the one we have in mm/frame_vector.c where > we don't hand out page pointers but an encapsulating structure that does > all the necessary tracking. Removing a lease would need to block until all > pins are released - this is probably the most hairy part since we need to > handle a case if application just closes the file descriptor which would > release the lease but OTOH we need to make sure task exit does not deadlock. > Maybe we could block only on explicit lease unlock and just drop the layout > lease on file close and if there are still pinned pages, send SIGKILL to an > application as a reminder it did something stupid... > > What do people think about this? SIGKILL on close() without explicit
[PATCH v9 07/12] mm/sparsemem: Prepare for sub-section ranges
Prepare the memory hot-{add,remove} paths for handling sub-section ranges by plumbing the starting page frame and number of pages being handled through arch_{add,remove}_memory() to sparse_{add,remove}_one_section(). This is simply plumbing, small cleanups, and some identifier renames. No intended functional changes. Cc: Michal Hocko Cc: Vlastimil Babka Cc: Logan Gunthorpe Cc: Oscar Salvador Reviewed-by: Pavel Tatashin Signed-off-by: Dan Williams --- include/linux/memory_hotplug.h |5 +- mm/memory_hotplug.c| 114 +--- mm/sparse.c| 15 ++--- 3 files changed, 81 insertions(+), 53 deletions(-) diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 79e0add6a597..3ab0282b4fe5 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -348,9 +348,10 @@ extern int add_memory_resource(int nid, struct resource *resource); extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn, unsigned long nr_pages, struct vmem_altmap *altmap); extern bool is_memblock_offlined(struct memory_block *mem); -extern int sparse_add_one_section(int nid, unsigned long start_pfn, - struct vmem_altmap *altmap); +extern int sparse_add_section(int nid, unsigned long pfn, + unsigned long nr_pages, struct vmem_altmap *altmap); extern void sparse_remove_one_section(struct mem_section *ms, + unsigned long pfn, unsigned long nr_pages, unsigned long map_offset, struct vmem_altmap *altmap); extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map, unsigned long pnum); diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 4b882c57781a..399bf78bccc5 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -252,51 +252,84 @@ void __init register_page_bootmem_info_node(struct pglist_data *pgdat) } #endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */ -static int __meminit __add_section(int nid, unsigned long phys_start_pfn, - struct vmem_altmap *altmap) +static int __meminit __add_section(int nid, unsigned long pfn, + unsigned long nr_pages, struct vmem_altmap *altmap) { int ret; - if (pfn_valid(phys_start_pfn)) + if (pfn_valid(pfn)) return -EEXIST; - ret = sparse_add_one_section(nid, phys_start_pfn, altmap); + ret = sparse_add_section(nid, pfn, nr_pages, altmap); return ret < 0 ? ret : 0; } +static int check_pfn_span(unsigned long pfn, unsigned long nr_pages, + const char *reason) +{ + /* +* Disallow all operations smaller than a sub-section and only +* allow operations smaller than a section for +* SPARSEMEM_VMEMMAP. Note that check_hotplug_memory_range() +* enforces a larger memory_block_size_bytes() granularity for +* memory that will be marked online, so this check should only +* fire for direct arch_{add,remove}_memory() users outside of +* add_memory_resource(). +*/ + unsigned long min_align; + + if (IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)) + min_align = PAGES_PER_SUBSECTION; + else + min_align = PAGES_PER_SECTION; + if (!IS_ALIGNED(pfn, min_align) + || !IS_ALIGNED(nr_pages, min_align)) { + WARN(1, "Misaligned __%s_pages start: %#lx end: #%lx\n", + reason, pfn, pfn + nr_pages - 1); + return -EINVAL; + } + return 0; +} + /* * Reasonably generic function for adding memory. It is * expected that archs that support memory hotplug will * call this function after deciding the zone to which to * add the new pages. */ -int __ref __add_pages(int nid, unsigned long phys_start_pfn, - unsigned long nr_pages, struct mhp_restrictions *restrictions) +int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages, + struct mhp_restrictions *restrictions) { unsigned long i; - int err = 0; - int start_sec, end_sec; + int start_sec, end_sec, err; struct vmem_altmap *altmap = restrictions->altmap; - /* during initialize mem_map, align hot-added range to section */ - start_sec = pfn_to_section_nr(phys_start_pfn); - end_sec = pfn_to_section_nr(phys_start_pfn + nr_pages - 1); - if (altmap) { /* * Validate altmap is within bounds of the total request */ - if (altmap->base_pfn != phys_start_pfn + if (altmap->base_pfn != pfn || vmem_altmap_offset(altmap) > nr_pages) { pr_warn_once("memory add fail, invalid altmap\n"); - err = -E
[PATCH v9 12/12] libnvdimm/pfn: Stop padding pmem namespaces to section alignment
Now that the mm core supports section-unaligned hotplug of ZONE_DEVICE memory, we no longer need to add padding at pfn/dax device creation time. The kernel will still honor padding established by older kernels. Reported-by: Jeff Moyer Signed-off-by: Dan Williams --- drivers/nvdimm/pfn.h | 14 drivers/nvdimm/pfn_devs.c | 77 - include/linux/mmzone.h|3 ++ 3 files changed, 16 insertions(+), 78 deletions(-) diff --git a/drivers/nvdimm/pfn.h b/drivers/nvdimm/pfn.h index e901e3a3b04c..cc042a98758f 100644 --- a/drivers/nvdimm/pfn.h +++ b/drivers/nvdimm/pfn.h @@ -41,18 +41,4 @@ struct nd_pfn_sb { __le64 checksum; }; -#ifdef CONFIG_SPARSEMEM -#define PFN_SECTION_ALIGN_DOWN(x) SECTION_ALIGN_DOWN(x) -#define PFN_SECTION_ALIGN_UP(x) SECTION_ALIGN_UP(x) -#else -/* - * In this case ZONE_DEVICE=n and we will disable 'pfn' device support, - * but we still want pmem to compile. - */ -#define PFN_SECTION_ALIGN_DOWN(x) (x) -#define PFN_SECTION_ALIGN_UP(x) (x) -#endif - -#define PHYS_SECTION_ALIGN_DOWN(x) PFN_PHYS(PFN_SECTION_ALIGN_DOWN(PHYS_PFN(x))) -#define PHYS_SECTION_ALIGN_UP(x) PFN_PHYS(PFN_SECTION_ALIGN_UP(PHYS_PFN(x))) #endif /* __NVDIMM_PFN_H */ diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c index a2406253eb70..7f54374b082f 100644 --- a/drivers/nvdimm/pfn_devs.c +++ b/drivers/nvdimm/pfn_devs.c @@ -595,14 +595,14 @@ static u32 info_block_reserve(void) } /* - * We hotplug memory at section granularity, pad the reserved area from - * the previous section base to the namespace base address. + * We hotplug memory at sub-section granularity, pad the reserved area + * from the previous section base to the namespace base address. */ static unsigned long init_altmap_base(resource_size_t base) { unsigned long base_pfn = PHYS_PFN(base); - return PFN_SECTION_ALIGN_DOWN(base_pfn); + return SUBSECTION_ALIGN_DOWN(base_pfn); } static unsigned long init_altmap_reserve(resource_size_t base) @@ -610,7 +610,7 @@ static unsigned long init_altmap_reserve(resource_size_t base) unsigned long reserve = info_block_reserve() >> PAGE_SHIFT; unsigned long base_pfn = PHYS_PFN(base); - reserve += base_pfn - PFN_SECTION_ALIGN_DOWN(base_pfn); + reserve += base_pfn - SUBSECTION_ALIGN_DOWN(base_pfn); return reserve; } @@ -641,8 +641,7 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap) nd_pfn->npfns = le64_to_cpu(pfn_sb->npfns); pgmap->altmap_valid = false; } else if (nd_pfn->mode == PFN_MODE_PMEM) { - nd_pfn->npfns = PFN_SECTION_ALIGN_UP((resource_size(res) - - offset) / PAGE_SIZE); + nd_pfn->npfns = PHYS_PFN((resource_size(res) - offset)); if (le64_to_cpu(nd_pfn->pfn_sb->npfns) > nd_pfn->npfns) dev_info(_pfn->dev, "number of pfns truncated from %lld to %ld\n", @@ -658,54 +657,14 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap) return 0; } -static u64 phys_pmem_align_down(struct nd_pfn *nd_pfn, u64 phys) -{ - return min_t(u64, PHYS_SECTION_ALIGN_DOWN(phys), - ALIGN_DOWN(phys, nd_pfn->align)); -} - -/* - * Check if pmem collides with 'System RAM', or other regions when - * section aligned. Trim it accordingly. - */ -static void trim_pfn_device(struct nd_pfn *nd_pfn, u32 *start_pad, u32 *end_trunc) -{ - struct nd_namespace_common *ndns = nd_pfn->ndns; - struct nd_namespace_io *nsio = to_nd_namespace_io(>dev); - struct nd_region *nd_region = to_nd_region(nd_pfn->dev.parent); - const resource_size_t start = nsio->res.start; - const resource_size_t end = start + resource_size(>res); - resource_size_t adjust, size; - - *start_pad = 0; - *end_trunc = 0; - - adjust = start - PHYS_SECTION_ALIGN_DOWN(start); - size = resource_size(>res) + adjust; - if (region_intersects(start - adjust, size, IORESOURCE_SYSTEM_RAM, - IORES_DESC_NONE) == REGION_MIXED - || nd_region_conflict(nd_region, start - adjust, size)) - *start_pad = PHYS_SECTION_ALIGN_UP(start) - start; - - /* Now check that end of the range does not collide. */ - adjust = PHYS_SECTION_ALIGN_UP(end) - end; - size = resource_size(>res) + adjust; - if (region_intersects(start, size, IORESOURCE_SYSTEM_RAM, - IORES_DESC_NONE) == REGION_MIXED - || !IS_ALIGNED(end, nd_pfn->align) - || nd_region_conflict(nd_region, start, size)) - *end_trunc = end - phys_pmem_align_down(nd_pfn, end); -} - static int nd_pfn_init(struc
[PATCH v9 06/12] mm: Kill is_dev_zone() helper
Given there are no more usages of is_dev_zone() outside of 'ifdef CONFIG_ZONE_DEVICE' protection, kill off the compilation helper. Cc: Michal Hocko Cc: Logan Gunthorpe Acked-by: David Hildenbrand Reviewed-by: Oscar Salvador Reviewed-by: Pavel Tatashin Signed-off-by: Dan Williams --- include/linux/mmzone.h | 12 mm/page_alloc.c|2 +- 2 files changed, 1 insertion(+), 13 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 6dd52d544857..49e7fb452dfd 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -855,18 +855,6 @@ static inline int local_memory_node(int node_id) { return node_id; }; */ #define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones) -#ifdef CONFIG_ZONE_DEVICE -static inline bool is_dev_zone(const struct zone *zone) -{ - return zone_idx(zone) == ZONE_DEVICE; -} -#else -static inline bool is_dev_zone(const struct zone *zone) -{ - return false; -} -#endif - /* * Returns true if a zone has pages managed by the buddy allocator. * All the reclaim decisions have to use this function rather than diff --git a/mm/page_alloc.c b/mm/page_alloc.c index bd773efe5b82..5dff3f49a372 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5865,7 +5865,7 @@ void __ref memmap_init_zone_device(struct zone *zone, unsigned long start = jiffies; int nid = pgdat->node_id; - if (WARN_ON_ONCE(!pgmap || !is_dev_zone(zone))) + if (WARN_ON_ONCE(!pgmap || zone_idx(zone) != ZONE_DEVICE)) return; /*
[PATCH v9 05/12] mm/hotplug: Kill is_dev_zone() usage in __remove_pages()
The zone type check was a leftover from the cleanup that plumbed altmap through the memory hotplug path, i.e. commit da024512a1fa "mm: pass the vmem_altmap to arch_remove_memory and __remove_pages". Cc: Michal Hocko Cc: Logan Gunthorpe Cc: Pavel Tatashin Reviewed-by: David Hildenbrand Reviewed-by: Oscar Salvador Signed-off-by: Dan Williams --- mm/memory_hotplug.c |7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 647859a1d119..4b882c57781a 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -535,11 +535,8 @@ void __remove_pages(struct zone *zone, unsigned long phys_start_pfn, unsigned long map_offset = 0; int sections_to_remove; - /* In the ZONE_DEVICE case device driver owns the memory region */ - if (is_dev_zone(zone)) { - if (altmap) - map_offset = vmem_altmap_offset(altmap); - } + if (altmap) + map_offset = vmem_altmap_offset(altmap); clear_zone_contiguous(zone);
[PATCH v9 09/12] mm: Document ZONE_DEVICE memory-model implications
Explain the general mechanisms of 'ZONE_DEVICE' pages and list the users of 'devm_memremap_pages()'. Cc: Jonathan Corbet Reported-by: Mike Rapoport Signed-off-by: Dan Williams --- Documentation/vm/memory-model.rst | 39 + 1 file changed, 39 insertions(+) diff --git a/Documentation/vm/memory-model.rst b/Documentation/vm/memory-model.rst index 382f72ace1fc..e0af47e02e78 100644 --- a/Documentation/vm/memory-model.rst +++ b/Documentation/vm/memory-model.rst @@ -181,3 +181,42 @@ that is eventually passed to vmemmap_populate() through a long chain of function calls. The vmemmap_populate() implementation may use the `vmem_altmap` along with :c:func:`altmap_alloc_block_buf` helper to allocate memory map on the persistent memory device. + +ZONE_DEVICE +=== +The `ZONE_DEVICE` facility builds upon `SPARSEMEM_VMEMMAP` to offer +`struct page` `mem_map` services for device driver identified physical +address ranges. The "device" aspect of `ZONE_DEVICE` relates to the fact +that the page objects for these address ranges are never marked online, +and that a reference must be taken against the device, not just the page +to keep the memory pinned for active use. `ZONE_DEVICE`, via +:c:func:`devm_memremap_pages`, performs just enough memory hotplug to +turn on :c:func:`pfn_to_page`, :c:func:`page_to_pfn`, and +:c:func:`get_user_pages` service for the given range of pfns. Since the +page reference count never drops below 1 the page is never tracked as +free memory and the page's `struct list_head lru` space is repurposed +for back referencing to the host device / driver that mapped the memory. + +While `SPARSEMEM` presents memory as a collection of sections, +optionally collected into memory blocks, `ZONE_DEVICE` users have a need +for smaller granularity of populating the `mem_map`. Given that +`ZONE_DEVICE` memory is never marked online it is subsequently never +subject to its memory ranges being exposed through the sysfs memory +hotplug api on memory block boundaries. The implementation relies on +this lack of user-api constraint to allow sub-section sized memory +ranges to be specified to :c:func:`arch_add_memory`, the top-half of +memory hotplug. Sub-section support allows for `PMD_SIZE` as the minimum +alignment granularity for :c:func:`devm_memremap_pages`. + +The users of `ZONE_DEVICE` are: +* pmem: Map platform persistent memory to be used as a direct-I/O target + via DAX mappings. + +* hmm: Extend `ZONE_DEVICE` with `->page_fault()` and `->page_free()` + event callbacks to allow a device-driver to coordinate memory management + events related to device-memory, typically GPU memory. See + Documentation/vm/hmm.rst. + +* p2pdma: Create `struct page` objects to allow peer devices in a + PCI/-E topology to coordinate direct-DMA operations between themselves, + i.e. bypass host memory.
[PATCH v9 11/12] libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields
At namespace creation time there is the potential for the "expected to be zero" fields of a 'pfn' info-block to be filled with indeterminate data. While the kernel buffer is zeroed on allocation it is immediately overwritten by nd_pfn_validate() filling it with the current contents of the on-media info-block location. For fields like, 'flags' and the 'padding' it potentially means that future implementations can not rely on those fields being zero. In preparation to stop using the 'start_pad' and 'end_trunc' fields for section alignment, arrange for fields that are not explicitly initialized to be guaranteed zero. Bump the minor version to indicate it is safe to assume the 'padding' and 'flags' are zero. Otherwise, this corruption is expected to benign since all other critical fields are explicitly initialized. Fixes: 32ab0a3f5170 ("libnvdimm, pmem: 'struct page' for pmem") Cc: Signed-off-by: Dan Williams --- drivers/nvdimm/dax_devs.c |2 +- drivers/nvdimm/pfn.h |1 + drivers/nvdimm/pfn_devs.c | 18 +++--- 3 files changed, 17 insertions(+), 4 deletions(-) diff --git a/drivers/nvdimm/dax_devs.c b/drivers/nvdimm/dax_devs.c index 0453f49dc708..326f02ffca81 100644 --- a/drivers/nvdimm/dax_devs.c +++ b/drivers/nvdimm/dax_devs.c @@ -126,7 +126,7 @@ int nd_dax_probe(struct device *dev, struct nd_namespace_common *ndns) nvdimm_bus_unlock(>dev); if (!dax_dev) return -ENOMEM; - pfn_sb = devm_kzalloc(dev, sizeof(*pfn_sb), GFP_KERNEL); + pfn_sb = devm_kmalloc(dev, sizeof(*pfn_sb), GFP_KERNEL); nd_pfn->pfn_sb = pfn_sb; rc = nd_pfn_validate(nd_pfn, DAX_SIG); dev_dbg(dev, "dax: %s\n", rc == 0 ? dev_name(dax_dev) : ""); diff --git a/drivers/nvdimm/pfn.h b/drivers/nvdimm/pfn.h index dde9853453d3..e901e3a3b04c 100644 --- a/drivers/nvdimm/pfn.h +++ b/drivers/nvdimm/pfn.h @@ -36,6 +36,7 @@ struct nd_pfn_sb { __le32 end_trunc; /* minor-version-2 record the base alignment of the mapping */ __le32 align; + /* minor-version-3 guarantee the padding and flags are zero */ u8 padding[4000]; __le64 checksum; }; diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c index 01f40672507f..a2406253eb70 100644 --- a/drivers/nvdimm/pfn_devs.c +++ b/drivers/nvdimm/pfn_devs.c @@ -420,6 +420,15 @@ static int nd_pfn_clear_memmap_errors(struct nd_pfn *nd_pfn) return 0; } +/** + * nd_pfn_validate - read and validate info-block + * @nd_pfn: fsdax namespace runtime state / properties + * @sig: 'devdax' or 'fsdax' signature + * + * Upon return the info-block buffer contents (->pfn_sb) are + * indeterminate when validation fails, and a coherent info-block + * otherwise. + */ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig) { u64 checksum, offset; @@ -565,7 +574,7 @@ int nd_pfn_probe(struct device *dev, struct nd_namespace_common *ndns) nvdimm_bus_unlock(>dev); if (!pfn_dev) return -ENOMEM; - pfn_sb = devm_kzalloc(dev, sizeof(*pfn_sb), GFP_KERNEL); + pfn_sb = devm_kmalloc(dev, sizeof(*pfn_sb), GFP_KERNEL); nd_pfn = to_nd_pfn(pfn_dev); nd_pfn->pfn_sb = pfn_sb; rc = nd_pfn_validate(nd_pfn, PFN_SIG); @@ -702,7 +711,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn) u64 checksum; int rc; - pfn_sb = devm_kzalloc(_pfn->dev, sizeof(*pfn_sb), GFP_KERNEL); + pfn_sb = devm_kmalloc(_pfn->dev, sizeof(*pfn_sb), GFP_KERNEL); if (!pfn_sb) return -ENOMEM; @@ -711,11 +720,14 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn) sig = DAX_SIG; else sig = PFN_SIG; + rc = nd_pfn_validate(nd_pfn, sig); if (rc != -ENODEV) return rc; /* no info block, do init */; + memset(pfn_sb, 0, sizeof(*pfn_sb)); + nd_region = to_nd_region(nd_pfn->dev.parent); if (nd_region->ro) { dev_info(_pfn->dev, @@ -768,7 +780,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn) memcpy(pfn_sb->uuid, nd_pfn->uuid, 16); memcpy(pfn_sb->parent_uuid, nd_dev_to_uuid(>dev), 16); pfn_sb->version_major = cpu_to_le16(1); - pfn_sb->version_minor = cpu_to_le16(2); + pfn_sb->version_minor = cpu_to_le16(3); pfn_sb->start_pad = cpu_to_le32(start_pad); pfn_sb->end_trunc = cpu_to_le32(end_trunc); pfn_sb->align = cpu_to_le32(nd_pfn->align);
[PATCH v9 10/12] mm/devm_memremap_pages: Enable sub-section remap
Teach devm_memremap_pages() about the new sub-section capabilities of arch_{add,remove}_memory(). Effectively, just replace all usage of align_start, align_end, and align_size with res->start, res->end, and resource_size(res). The existing sanity check will still make sure that the two separate remap attempts do not collide within a sub-section (2MB on x86). Cc: Michal Hocko Cc: Toshi Kani Cc: Jérôme Glisse Cc: Logan Gunthorpe Cc: Oscar Salvador Cc: Pavel Tatashin Signed-off-by: Dan Williams --- kernel/memremap.c | 61 + 1 file changed, 24 insertions(+), 37 deletions(-) diff --git a/kernel/memremap.c b/kernel/memremap.c index 57980ed4e571..a0e5f6b91b04 100644 --- a/kernel/memremap.c +++ b/kernel/memremap.c @@ -58,7 +58,7 @@ static unsigned long pfn_first(struct dev_pagemap *pgmap) struct vmem_altmap *altmap = >altmap; unsigned long pfn; - pfn = res->start >> PAGE_SHIFT; + pfn = PHYS_PFN(res->start); if (pgmap->altmap_valid) pfn += vmem_altmap_offset(altmap); return pfn; @@ -86,7 +86,6 @@ static void devm_memremap_pages_release(void *data) struct dev_pagemap *pgmap = data; struct device *dev = pgmap->dev; struct resource *res = >res; - resource_size_t align_start, align_size; unsigned long pfn; int nid; @@ -96,25 +95,21 @@ static void devm_memremap_pages_release(void *data) pgmap->cleanup(pgmap->ref); /* pages are dead and unused, undo the arch mapping */ - align_start = res->start & ~(PA_SECTION_SIZE - 1); - align_size = ALIGN(res->start + resource_size(res), PA_SECTION_SIZE) - - align_start; - - nid = page_to_nid(pfn_to_page(align_start >> PAGE_SHIFT)); + nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start))); mem_hotplug_begin(); if (pgmap->type == MEMORY_DEVICE_PRIVATE) { - pfn = align_start >> PAGE_SHIFT; + pfn = PHYS_PFN(res->start); __remove_pages(page_zone(pfn_to_page(pfn)), pfn, - align_size >> PAGE_SHIFT, NULL); + PHYS_PFN(resource_size(res)), NULL); } else { - arch_remove_memory(nid, align_start, align_size, + arch_remove_memory(nid, res->start, resource_size(res), pgmap->altmap_valid ? >altmap : NULL); - kasan_remove_zero_shadow(__va(align_start), align_size); + kasan_remove_zero_shadow(__va(res->start), resource_size(res)); } mem_hotplug_done(); - untrack_pfn(NULL, PHYS_PFN(align_start), align_size); + untrack_pfn(NULL, PHYS_PFN(res->start), resource_size(res)); pgmap_array_delete(res); dev_WARN_ONCE(dev, pgmap->altmap.alloc, "%s: failed to free all reserved pages\n", __func__); @@ -141,16 +136,13 @@ static void devm_memremap_pages_release(void *data) */ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap) { - resource_size_t align_start, align_size, align_end; - struct vmem_altmap *altmap = pgmap->altmap_valid ? - >altmap : NULL; struct resource *res = >res; struct dev_pagemap *conflict_pgmap; struct mhp_restrictions restrictions = { /* * We do not want any optional features only our own memmap */ - .altmap = altmap, + .altmap = pgmap->altmap_valid ? >altmap : NULL, }; pgprot_t pgprot = PAGE_KERNEL; int error, nid, is_ram; @@ -160,12 +152,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap) return ERR_PTR(-EINVAL); } - align_start = res->start & ~(PA_SECTION_SIZE - 1); - align_size = ALIGN(res->start + resource_size(res), PA_SECTION_SIZE) - - align_start; - align_end = align_start + align_size - 1; - - conflict_pgmap = get_dev_pagemap(PHYS_PFN(align_start), NULL); + conflict_pgmap = get_dev_pagemap(PHYS_PFN(res->start), NULL); if (conflict_pgmap) { dev_WARN(dev, "Conflicting mapping in same section\n"); put_dev_pagemap(conflict_pgmap); @@ -173,7 +160,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap) goto err_array; } - conflict_pgmap = get_dev_pagemap(PHYS_PFN(align_end), NULL); + conflict_pgmap = get_dev_pagemap(PHYS_PFN(res->end), NULL); if (conflict_pgmap) { dev_WARN(dev, "Conflicting mapping in same section\n"); put_dev_pagemap(conflict_pgmap); @@ -181,7 +168,7 @@ void *devm_memremap_pages(struct device
[PATCH v9 08/12] mm/sparsemem: Support sub-section hotplug
The libnvdimm sub-system has suffered a series of hacks and broken workarounds for the memory-hotplug implementation's awkward section-aligned (128MB) granularity. For example the following backtrace is emitted when attempting arch_add_memory() with physical address ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM) within a given section: WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 devm_memremap_pages+0x3b5/0x4c0 devm_memremap_pages attempted on mixed region [mem 0x2-0x2fbff flags 0x200] [..] Call Trace: dump_stack+0x86/0xc3 __warn+0xcb/0xf0 warn_slowpath_fmt+0x5f/0x80 devm_memremap_pages+0x3b5/0x4c0 __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap] pmem_attach_disk+0x19a/0x440 [nd_pmem] Recently it was discovered that the problem goes beyond RAM vs PMEM collisions as some platform produce PMEM vs PMEM collisions within a given section. The libnvdimm workaround for that case revealed that the libnvdimm section-alignment-padding implementation has been broken for a long while. A fix for that long-standing breakage introduces as many problems as it solves as it would require a backward-incompatible change to the namespace metadata interpretation. Instead of that dubious route [1], address the root problem in the memory-hotplug implementation. [1]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.st...@dwillia2-desk3.amr.corp.intel.com Cc: Michal Hocko Cc: Vlastimil Babka Cc: Logan Gunthorpe Cc: Oscar Salvador Cc: Pavel Tatashin Signed-off-by: Dan Williams --- include/linux/memory_hotplug.h |2 mm/memory_hotplug.c|7 - mm/page_alloc.c|2 mm/sparse.c| 225 +++- 4 files changed, 155 insertions(+), 81 deletions(-) diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 3ab0282b4fe5..0b8a5e5ef2da 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -350,7 +350,7 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn, extern bool is_memblock_offlined(struct memory_block *mem); extern int sparse_add_section(int nid, unsigned long pfn, unsigned long nr_pages, struct vmem_altmap *altmap); -extern void sparse_remove_one_section(struct mem_section *ms, +extern void sparse_remove_section(struct mem_section *ms, unsigned long pfn, unsigned long nr_pages, unsigned long map_offset, struct vmem_altmap *altmap); extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map, diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 399bf78bccc5..8188be7a9edb 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -255,13 +255,10 @@ void __init register_page_bootmem_info_node(struct pglist_data *pgdat) static int __meminit __add_section(int nid, unsigned long pfn, unsigned long nr_pages, struct vmem_altmap *altmap) { - int ret; - if (pfn_valid(pfn)) return -EEXIST; - ret = sparse_add_section(nid, pfn, nr_pages, altmap); - return ret < 0 ? ret : 0; + return sparse_add_section(nid, pfn, nr_pages, altmap); } static int check_pfn_span(unsigned long pfn, unsigned long nr_pages, @@ -541,7 +538,7 @@ static void __remove_section(struct zone *zone, unsigned long pfn, return; __remove_zone(zone, pfn, nr_pages); - sparse_remove_one_section(ms, pfn, nr_pages, map_offset, altmap); + sparse_remove_section(ms, pfn, nr_pages, map_offset, altmap); } /** diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5dff3f49a372..af260cc469cd 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5915,7 +5915,7 @@ void __ref memmap_init_zone_device(struct zone *zone, * pfn out of zone. * * Please note that MEMMAP_HOTPLUG path doesn't clear memmap -* because this is done early in sparse_add_one_section +* because this is done early in section_activate() */ if (!(pfn & (pageblock_nr_pages - 1))) { set_pageblock_migratetype(page, MIGRATE_MOVABLE); diff --git a/mm/sparse.c b/mm/sparse.c index f65206deaf49..d83bac5d1324 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -83,8 +83,15 @@ static int __meminit sparse_index_init(unsigned long section_nr, int nid) unsigned long root = SECTION_NR_TO_ROOT(section_nr); struct mem_section *section; + /* +* An existing section is possible in the sub-section hotplug +* case. First hot-add instantiates, follow-on hot-add reuses +* the existing section. +* +* The mem_hotplug_lock resolves the apparent race below. +*/ if (mem_section[root]) - return -EEXIST; + return 0; section = sparse_index_al
[PATCH v9 00/12] mm: Sub-section memory hotplug support
subsection-pending' branch of ndctl [5]. [2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.st...@dwillia2-desk3.amr.corp.intel.com [3]: https://github.com/pmem/ndctl/issues/76 [4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=libnvdimm-pending [5]: https://github.com/pmem/ndctl/commit/7c59b4867e1c --- Dan Williams (12): mm/sparsemem: Introduce struct mem_section_usage mm/sparsemem: Add helpers track active portions of a section at boot mm/hotplug: Prepare shrink_{zone,pgdat}_span for sub-section removal mm/sparsemem: Convert kmalloc_section_memmap() to populate_section_memmap() mm/hotplug: Kill is_dev_zone() usage in __remove_pages() mm: Kill is_dev_zone() helper mm/sparsemem: Prepare for sub-section ranges mm/sparsemem: Support sub-section hotplug mm: Document ZONE_DEVICE memory-model implications mm/devm_memremap_pages: Enable sub-section remap libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields libnvdimm/pfn: Stop padding pmem namespaces to section alignment Documentation/vm/memory-model.rst| 39 arch/powerpc/include/asm/sparsemem.h |3 arch/x86/mm/init_64.c|4 drivers/nvdimm/dax_devs.c|2 drivers/nvdimm/pfn.h | 15 - drivers/nvdimm/pfn_devs.c| 95 +++-- include/linux/memory_hotplug.h |7 - include/linux/mm.h |4 include/linux/mmzone.h | 92 +++-- kernel/memremap.c| 61 ++ mm/memory_hotplug.c | 171 +--- mm/page_alloc.c | 10 + mm/sparse-vmemmap.c | 21 +- mm/sparse.c | 359 +++--- 14 files changed, 534 insertions(+), 349 deletions(-)
[PATCH v9 04/12] mm/sparsemem: Convert kmalloc_section_memmap() to populate_section_memmap()
Allow sub-section sized ranges to be added to the memmap. populate_section_memmap() takes an explict pfn range rather than assuming a full section, and those parameters are plumbed all the way through to vmmemap_populate(). There should be no sub-section usage in current deployments. New warnings are added to clarify which memmap allocation paths are sub-section capable. Cc: Michal Hocko Cc: David Hildenbrand Cc: Logan Gunthorpe Cc: Oscar Salvador Reviewed-by: Pavel Tatashin Signed-off-by: Dan Williams --- arch/x86/mm/init_64.c |4 ++- include/linux/mm.h|4 ++- mm/sparse-vmemmap.c | 21 +++-- mm/sparse.c | 61 +++-- 4 files changed, 57 insertions(+), 33 deletions(-) diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 8335ac6e1112..688fb0687e55 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1520,7 +1520,9 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, { int err; - if (boot_cpu_has(X86_FEATURE_PSE)) + if (end - start < PAGES_PER_SECTION * sizeof(struct page)) + err = vmemmap_populate_basepages(start, end, node); + else if (boot_cpu_has(X86_FEATURE_PSE)) err = vmemmap_populate_hugepages(start, end, node, altmap); else if (altmap) { pr_err_once("%s: no cpu support for altmap allocations\n", diff --git a/include/linux/mm.h b/include/linux/mm.h index acc578407e9e..c502f3ce8661 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2734,8 +2734,8 @@ const char * arch_vma_name(struct vm_area_struct *vma); void print_vma_addr(char *prefix, unsigned long rip); void *sparse_buffer_alloc(unsigned long size); -struct page *sparse_mem_map_populate(unsigned long pnum, int nid, - struct vmem_altmap *altmap); +struct page * __populate_section_memmap(unsigned long pfn, + unsigned long nr_pages, int nid, struct vmem_altmap *altmap); pgd_t *vmemmap_pgd_populate(unsigned long addr, int node); p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node); pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node); diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c index 7fec05796796..200aef686722 100644 --- a/mm/sparse-vmemmap.c +++ b/mm/sparse-vmemmap.c @@ -245,19 +245,26 @@ int __meminit vmemmap_populate_basepages(unsigned long start, return 0; } -struct page * __meminit sparse_mem_map_populate(unsigned long pnum, int nid, - struct vmem_altmap *altmap) +struct page * __meminit __populate_section_memmap(unsigned long pfn, + unsigned long nr_pages, int nid, struct vmem_altmap *altmap) { unsigned long start; unsigned long end; - struct page *map; - map = pfn_to_page(pnum * PAGES_PER_SECTION); - start = (unsigned long)map; - end = (unsigned long)(map + PAGES_PER_SECTION); + /* +* The minimum granularity of memmap extensions is +* PAGES_PER_SUBSECTION as allocations are tracked in the +* 'subsection_map' bitmap of the section. +*/ + end = ALIGN(pfn + nr_pages, PAGES_PER_SUBSECTION); + pfn &= PAGE_SUBSECTION_MASK; + nr_pages = end - pfn; + + start = (unsigned long) pfn_to_page(pfn); + end = start + nr_pages * sizeof(struct page); if (vmemmap_populate(start, end, nid, altmap)) return NULL; - return map; + return pfn_to_page(pfn); } diff --git a/mm/sparse.c b/mm/sparse.c index 0baa2e55cfdd..2093c662a5f7 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -439,8 +439,8 @@ static unsigned long __init section_map_size(void) return PAGE_ALIGN(sizeof(struct page) * PAGES_PER_SECTION); } -struct page __init *sparse_mem_map_populate(unsigned long pnum, int nid, - struct vmem_altmap *altmap) +struct page __init *__populate_section_memmap(unsigned long pfn, + unsigned long nr_pages, int nid, struct vmem_altmap *altmap) { unsigned long size = section_map_size(); struct page *map = sparse_buffer_alloc(size); @@ -521,10 +521,13 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin, } sparse_buffer_init(map_count * section_map_size(), nid); for_each_present_section_nr(pnum_begin, pnum) { + unsigned long pfn = section_nr_to_pfn(pnum); + if (pnum >= pnum_end) break; - map = sparse_mem_map_populate(pnum, nid, NULL); + map = __populate_section_memmap(pfn, PAGES_PER_SECTION, + nid, NULL); if (!map) { pr_err("%s: node[%d] memory map backing failed. Some memory will not be available.", __func__, nid); @@ -624,17 +627,17 @@ void offline_mem_sections(unsi
[PATCH v9 01/12] mm/sparsemem: Introduce struct mem_section_usage
Towards enabling memory hotplug to track partial population of a section, introduce 'struct mem_section_usage'. A pointer to a 'struct mem_section_usage' instance replaces the existing pointer to a 'pageblock_flags' bitmap. Effectively it adds one more 'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house a new 'subsection_map' bitmap. The new bitmap enables the memory hot{plug,remove} implementation to act on incremental sub-divisions of a section. The default SUBSECTION_SHIFT is chosen to keep the 'subsection_map' no larger than a single 'unsigned long' on the major architectures. Alternatively an architecture can define ARCH_SUBSECTION_SHIFT to override the default PMD_SHIFT. Note that PowerPC needs to use ARCH_SUBSECTION_SHIFT to workaround PMD_SHIFT being a non-constant expression on PowerPC. The primary motivation for this functionality is to support platforms that mix "System RAM" and "Persistent Memory" within a single section, or multiple PMEM ranges with different mapping lifetimes within a single section. The section restriction for hotplug has caused an ongoing saga of hacks and bugs for devm_memremap_pages() users. Beyond the fixups to teach existing paths how to retrieve the 'usemap' from a section, and updates to usemap allocation path, there are no expected behavior changes. Cc: Michal Hocko Cc: Vlastimil Babka Cc: Logan Gunthorpe Cc: Oscar Salvador Cc: Pavel Tatashin Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Signed-off-by: Dan Williams --- arch/powerpc/include/asm/sparsemem.h |3 + include/linux/mmzone.h | 48 +++- mm/memory_hotplug.c | 18 mm/page_alloc.c |2 - mm/sparse.c | 81 +- 5 files changed, 99 insertions(+), 53 deletions(-) diff --git a/arch/powerpc/include/asm/sparsemem.h b/arch/powerpc/include/asm/sparsemem.h index 3192d454a733..1aa3c9303bf8 100644 --- a/arch/powerpc/include/asm/sparsemem.h +++ b/arch/powerpc/include/asm/sparsemem.h @@ -10,6 +10,9 @@ */ #define SECTION_SIZE_BITS 24 +/* Reflect the largest possible PMD-size as the subsection-size constant */ +#define ARCH_SUBSECTION_SHIFT 24 + #endif /* CONFIG_SPARSEMEM */ #ifdef CONFIG_MEMORY_HOTPLUG diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 427b79c39b3c..ac163f2f274f 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1161,6 +1161,44 @@ static inline unsigned long section_nr_to_pfn(unsigned long sec) #define SECTION_ALIGN_UP(pfn) (((pfn) + PAGES_PER_SECTION - 1) & PAGE_SECTION_MASK) #define SECTION_ALIGN_DOWN(pfn)((pfn) & PAGE_SECTION_MASK) +/* + * SUBSECTION_SHIFT must be constant since it is used to declare + * subsection_map and related bitmaps without triggering the generation + * of variable-length arrays. The most natural size for a subsection is + * a PMD-page. For architectures that do not have a constant PMD-size + * ARCH_SUBSECTION_SHIFT can be set to a constant max size, or otherwise + * fallback to 2MB. + */ +#if defined(ARCH_SUBSECTION_SHIFT) +#define SUBSECTION_SHIFT (ARCH_SUBSECTION_SHIFT) +#elif defined(PMD_SHIFT) +#define SUBSECTION_SHIFT (PMD_SHIFT) +#else +/* + * Memory hotplug enabled platforms avoid this default because they + * either define ARCH_SUBSECTION_SHIFT, or PMD_SHIFT is a constant, but + * this is kept as a backstop to allow compilation on + * !ARCH_ENABLE_MEMORY_HOTPLUG archs. + */ +#define SUBSECTION_SHIFT 21 +#endif + +#define PFN_SUBSECTION_SHIFT (SUBSECTION_SHIFT - PAGE_SHIFT) +#define PAGES_PER_SUBSECTION (1UL << PFN_SUBSECTION_SHIFT) +#define PAGE_SUBSECTION_MASK ((~(PAGES_PER_SUBSECTION-1))) + +#if SUBSECTION_SHIFT > SECTION_SIZE_BITS +#error Subsection size exceeds section size +#else +#define SUBSECTIONS_PER_SECTION (1UL << (SECTION_SIZE_BITS - SUBSECTION_SHIFT)) +#endif + +struct mem_section_usage { + DECLARE_BITMAP(subsection_map, SUBSECTIONS_PER_SECTION); + /* See declaration of similar field in struct zone */ + unsigned long pageblock_flags[0]; +}; + struct page; struct page_ext; struct mem_section { @@ -1178,8 +1216,7 @@ struct mem_section { */ unsigned long section_mem_map; - /* See declaration of similar field in struct zone */ - unsigned long *pageblock_flags; + struct mem_section_usage *usage; #ifdef CONFIG_PAGE_EXTENSION /* * If SPARSEMEM, pgdat doesn't have page_ext pointer. We use @@ -1210,6 +1247,11 @@ extern struct mem_section **mem_section; extern struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT]; #endif +static inline unsigned long *section_to_usemap(struct mem_section *ms) +{ + return ms->usage->pageblock_flags; +} + static inline struct mem_section *__nr_to_section(unsigned long nr) { #ifdef CONFIG_SPARSEMEM_EXTREME @@ -1
[PATCH v9 02/12] mm/sparsemem: Add helpers track active portions of a section at boot
Prepare for hot{plug,remove} of sub-ranges of a section by tracking a sub-section active bitmask, each bit representing a PMD_SIZE span of the architecture's memory hotplug section size. The implications of a partially populated section is that pfn_valid() needs to go beyond a valid_section() check and read the sub-section active ranges from the bitmask. The expectation is that the bitmask (subsection_map) fits in the same cacheline as the valid_section() data, so the incremental performance overhead to pfn_valid() should be negligible. Cc: Michal Hocko Cc: Vlastimil Babka Cc: Logan Gunthorpe Cc: Oscar Salvador Cc: Pavel Tatashin Tested-by: Jane Chu Signed-off-by: Dan Williams --- include/linux/mmzone.h | 29 - mm/page_alloc.c|4 +++- mm/sparse.c| 35 +++ 3 files changed, 66 insertions(+), 2 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index ac163f2f274f..6dd52d544857 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1199,6 +1199,8 @@ struct mem_section_usage { unsigned long pageblock_flags[0]; }; +void subsection_map_init(unsigned long pfn, unsigned long nr_pages); + struct page; struct page_ext; struct mem_section { @@ -1336,12 +1338,36 @@ static inline struct mem_section *__pfn_to_section(unsigned long pfn) extern int __highest_present_section_nr; +static inline int subsection_map_index(unsigned long pfn) +{ + return (pfn & ~(PAGE_SECTION_MASK)) / PAGES_PER_SUBSECTION; +} + +#ifdef CONFIG_SPARSEMEM_VMEMMAP +static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn) +{ + int idx = subsection_map_index(pfn); + + return test_bit(idx, ms->usage->subsection_map); +} +#else +static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn) +{ + return 1; +} +#endif + #ifndef CONFIG_HAVE_ARCH_PFN_VALID static inline int pfn_valid(unsigned long pfn) { + struct mem_section *ms; + if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS) return 0; - return valid_section(__nr_to_section(pfn_to_section_nr(pfn))); + ms = __nr_to_section(pfn_to_section_nr(pfn)); + if (!valid_section(ms)) + return 0; + return pfn_section_valid(ms, pfn); } #endif @@ -1373,6 +1399,7 @@ void sparse_init(void); #define sparse_init() do {} while (0) #define sparse_index_init(_sec, _nid) do {} while (0) #define pfn_present pfn_valid +#define subsection_map_init(_pfn, _nr_pages) do {} while (0) #endif /* CONFIG_SPARSEMEM */ /* diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c6d8224d792e..bd773efe5b82 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7292,10 +7292,12 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn) /* Print out the early node map */ pr_info("Early memory node ranges\n"); - for_each_mem_pfn_range(i, MAX_NUMNODES, _pfn, _pfn, ) + for_each_mem_pfn_range(i, MAX_NUMNODES, _pfn, _pfn, ) { pr_info(" node %3d: [mem %#018Lx-%#018Lx]\n", nid, (u64)start_pfn << PAGE_SHIFT, ((u64)end_pfn << PAGE_SHIFT) - 1); + subsection_map_init(start_pfn, end_pfn - start_pfn); + } /* Initialise every node */ mminit_verify_pageflags_layout(); diff --git a/mm/sparse.c b/mm/sparse.c index 71da15cc7432..0baa2e55cfdd 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -210,6 +210,41 @@ static inline unsigned long first_present_section_nr(void) return next_present_section_nr(-1); } +void subsection_mask_set(unsigned long *map, unsigned long pfn, + unsigned long nr_pages) +{ + int idx = subsection_map_index(pfn); + int end = subsection_map_index(pfn + nr_pages - 1); + + bitmap_set(map, idx, end - idx + 1); +} + +void subsection_map_init(unsigned long pfn, unsigned long nr_pages) +{ + int end_sec = pfn_to_section_nr(pfn + nr_pages - 1); + int i, start_sec = pfn_to_section_nr(pfn); + + if (!nr_pages) + return; + + for (i = start_sec; i <= end_sec; i++) { + struct mem_section *ms; + unsigned long pfns; + + pfns = min(nr_pages, PAGES_PER_SECTION + - (pfn & ~PAGE_SECTION_MASK)); + ms = __nr_to_section(i); + subsection_mask_set(ms->usage->subsection_map, pfn, pfns); + + pr_debug("%s: sec: %d pfns: %ld set(%d, %d)\n", __func__, i, + pfns, subsection_map_index(pfn), + subsection_map_index(pfn + pfns - 1)); + + pfn += pfns; + nr_pages -= pfns; + } +} + /* Record a memory area against a node. */ void __init memory_present(int nid, unsigned long start, unsigned long end) {
[PATCH v9 03/12] mm/hotplug: Prepare shrink_{zone, pgdat}_span for sub-section removal
Sub-section hotplug support reduces the unit of operation of hotplug from section-sized-units (PAGES_PER_SECTION) to sub-section-sized units (PAGES_PER_SUBSECTION). Teach shrink_{zone,pgdat}_span() to consider PAGES_PER_SUBSECTION boundaries as the points where pfn_valid(), not valid_section(), can toggle. Cc: Michal Hocko Cc: Vlastimil Babka Cc: Logan Gunthorpe Reviewed-by: Pavel Tatashin Reviewed-by: Oscar Salvador Signed-off-by: Dan Williams --- mm/memory_hotplug.c | 29 - 1 file changed, 8 insertions(+), 21 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 7b963c2d3a0d..647859a1d119 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -318,12 +318,8 @@ static unsigned long find_smallest_section_pfn(int nid, struct zone *zone, unsigned long start_pfn, unsigned long end_pfn) { - struct mem_section *ms; - - for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SECTION) { - ms = __pfn_to_section(start_pfn); - - if (unlikely(!valid_section(ms))) + for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SUBSECTION) { + if (unlikely(!pfn_valid(start_pfn))) continue; if (unlikely(pfn_to_nid(start_pfn) != nid)) @@ -343,15 +339,12 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone, unsigned long start_pfn, unsigned long end_pfn) { - struct mem_section *ms; unsigned long pfn; /* pfn is the end pfn of a memory section. */ pfn = end_pfn - 1; - for (; pfn >= start_pfn; pfn -= PAGES_PER_SECTION) { - ms = __pfn_to_section(pfn); - - if (unlikely(!valid_section(ms))) + for (; pfn >= start_pfn; pfn -= PAGES_PER_SUBSECTION) { + if (unlikely(!pfn_valid(pfn))) continue; if (unlikely(pfn_to_nid(pfn) != nid)) @@ -373,7 +366,6 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn, unsigned long z = zone_end_pfn(zone); /* zone_end_pfn namespace clash */ unsigned long zone_end_pfn = z; unsigned long pfn; - struct mem_section *ms; int nid = zone_to_nid(zone); zone_span_writelock(zone); @@ -410,10 +402,8 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn, * it check the zone has only hole or not. */ pfn = zone_start_pfn; - for (; pfn < zone_end_pfn; pfn += PAGES_PER_SECTION) { - ms = __pfn_to_section(pfn); - - if (unlikely(!valid_section(ms))) + for (; pfn < zone_end_pfn; pfn += PAGES_PER_SUBSECTION) { + if (unlikely(!pfn_valid(pfn))) continue; if (page_zone(pfn_to_page(pfn)) != zone) @@ -441,7 +431,6 @@ static void shrink_pgdat_span(struct pglist_data *pgdat, unsigned long p = pgdat_end_pfn(pgdat); /* pgdat_end_pfn namespace clash */ unsigned long pgdat_end_pfn = p; unsigned long pfn; - struct mem_section *ms; int nid = pgdat->node_id; if (pgdat_start_pfn == start_pfn) { @@ -478,10 +467,8 @@ static void shrink_pgdat_span(struct pglist_data *pgdat, * has only hole or not. */ pfn = pgdat_start_pfn; - for (; pfn < pgdat_end_pfn; pfn += PAGES_PER_SECTION) { - ms = __pfn_to_section(pfn); - - if (unlikely(!valid_section(ms))) + for (; pfn < pgdat_end_pfn; pfn += PAGES_PER_SUBSECTION) { + if (unlikely(!pfn_valid(pfn))) continue; if (pfn_to_nid(pfn) != nid)
Re: [PATCH v3] mm/swap: Fix release_pages() when releasing devmap pages
On Tue, Jun 4, 2019 at 1:17 PM John Hubbard wrote: > > On 6/4/19 1:11 PM, Dan Williams wrote: > > On Tue, Jun 4, 2019 at 12:48 PM John Hubbard wrote: > >> > >> On 6/4/19 9:48 AM, ira.we...@intel.com wrote: > >>> From: Ira Weiny > >>> > ... > >>> diff --git a/mm/swap.c b/mm/swap.c > >>> index 7ede3eddc12a..6d153ce4cb8c 100644 > >>> --- a/mm/swap.c > >>> +++ b/mm/swap.c > >>> @@ -740,15 +740,20 @@ void release_pages(struct page **pages, int nr) > >>> if (is_huge_zero_page(page)) > >>> continue; > >>> > >>> - /* Device public page can not be huge page */ > >>> - if (is_device_public_page(page)) { > >>> + if (is_zone_device_page(page)) { > >>> if (locked_pgdat) { > >>> > >>> spin_unlock_irqrestore(_pgdat->lru_lock, > >>> flags); > >>> locked_pgdat = NULL; > >>> } > >>> - put_devmap_managed_page(page); > >>> - continue; > >>> + /* > >>> + * Not all zone-device-pages require special > >>> + * processing. Those pages return 'false' from > >>> + * put_devmap_managed_page() expecting a call to > >>> + * put_page_testzero() > >>> + */ > >> > >> Just a documentation tweak: how about: > >> > >> /* > >> * ZONE_DEVICE pages that return 'false' from > >> * put_devmap_managed_page() do not require special > >> * processing, and instead, expect a call to > >> * put_page_testzero(). > >> */ > > > > Looks better to me, but maybe just go ahead and list those > > expectations explicitly. Something like: > > > > /* > > * put_devmap_managed_page() only handles > > * ZONE_DEVICE (struct dev_pagemap managed) > > * pages when the hosting dev_pagemap has the > > * ->free() or ->fault() callback handlers > > * implemented as indicated by > > * dev_pagemap.type. Otherwise the expectation > > * is to fall back to a plain decrement / > > * put_page_testzero(). > > */ > > I like it--but not here, because it's too much internal detail in a > call site that doesn't use that level of detail. The call site looks > at the return value, only. > > Let's instead put that blurb above (or in) the put_devmap_managed_page() > routine itself. And leave the blurb that I wrote where it is. And then I > think everything will have an appropriate level of detail in the right places. Ok. Ideally there wouldn't be any commentary needed at the call site and the put_page() could be handled internal to put_devmap_managed_page(), but I did not see a way to do that without breaking the compile out / static branch optimization when there are no active ZONE_DEVICE users.
Re: [PATCH v2 00/17] net: introduce Qualcomm IPA driver
On Tue, 2019-06-04 at 22:04 +0200, Arnd Bergmann wrote: > On Tue, Jun 4, 2019 at 5:18 PM Dan Williams wrote: > > On Tue, 2019-06-04 at 10:13 +0200, Arnd Bergmann wrote: > > > Can you describe what kind of multiplexing is actually going on? > > > I'm still unclear about what we actually use multiple logical > > > interfaces for here, and how they relate to one another. > > > > Each logical interface represents a different "connection" (PDP/EPS > > context) to the provider network with a distinct IP address and > > QoS. > > VLANs may be a suitable analogy but here they are L3+QoS. > > > > In realistic example the main interface (say rmnet0) would be used > > for > > web browsing and have best-effort QoS. A second interface (say > > rmnet1) > > would be used for VOIP and have certain QoS guarantees from both > > the > > modem and the network itself. > > > > QMAP can also aggregate frames for a given channel > > (connection/EPS/PDP > > context/rmnet interface/etc) to better support LTE speeds. > > Thanks, that's a very helpful explanation! > > Is it correct to say then that the concept of having those separate > connections would be required for any proper LTE modem > implementation, > but the QMAP protocol (and based on that, the rmnet implementation) > is Qualcomm specific and shared only among several generations of > modems from that one vendor? Exactly correct. This is what Johannes is discussing in his "cellular modem APIs - take 2" thread about how this should all be organized at the driver level and I think we should figure that out before we commit to IPA-with-a-useless-netdev that requires rmnets to be created on top. That may end up being the solution but let's have that discussion. > > You mentioned the need to have a common user space interface > for configuration, and if the above is true, I agree that we should > try > to achieve that, either by ensuring rmnet is generic enough to > cover other vendors (and non-QMAP clients), or by creating a > new user level interface that IPA/rmnet can be adapted to. I would not suggest making rmnet generic; it's pretty QMAP specific (but QMAP is spoken by many many modems both SoC, USB stick, and PCIe minicard). Instead, I think what Johannes is discussing is a better approach. A kernel WWAN framework with consistent user API that rmnet/IPA/qmi_wwan/MBIM/QMI/serial/Sierra can all implement. That wouldn't affect the core packet processing of IPA/rmnet but instead: 1) when/how an rmnet device actually gets created on top of the IPA (or qmi_wwan) device AND (one of these two) a) whether IPA creates a netdev on probe OR b) whether there is some "WWAN device" kernel object which userspace interacts with create rmnet channels on top of IPA Dan
Re: [PATCH v3] mm/swap: Fix release_pages() when releasing devmap pages
On Tue, Jun 4, 2019 at 12:48 PM John Hubbard wrote: > > On 6/4/19 9:48 AM, ira.we...@intel.com wrote: > > From: Ira Weiny > > > > release_pages() is an optimized version of a loop around put_page(). > > Unfortunately for devmap pages the logic is not entirely correct in > > release_pages(). This is because device pages can be more than type > > MEMORY_DEVICE_PUBLIC. There are in fact 4 types, private, public, FS > > DAX, and PCI P2PDMA. Some of these have specific needs to "put" the > > page while others do not. > > > > This logic to handle any special needs is contained in > > put_devmap_managed_page(). Therefore all devmap pages should be > > processed by this function where we can contain the correct logic for a > > page put. > > > > Handle all device type pages within release_pages() by calling > > put_devmap_managed_page() on all devmap pages. If > > put_devmap_managed_page() returns true the page has been put and we > > continue with the next page. A false return of > > put_devmap_managed_page() means the page did not require special > > processing and should fall to "normal" processing. > > > > This was found via code inspection while determining if release_pages() > > and the new put_user_pages() could be interchangeable.[1] > > > > [1] > > https://lore.kernel.org/lkml/20190523172852.ga27...@iweiny-desk2.sc.intel.com/ > > > > Cc: Jérôme Glisse > > Cc: Michal Hocko > > Reviewed-by: Dan Williams > > Reviewed-by: John Hubbard > > Signed-off-by: Ira Weiny > > > > --- > > Changes from V2: > > Update changelog for more clarity as requested by Michal > > Update comment WRT "failing" of put_devmap_managed_page() > > > > Changes from V1: > > Add comment clarifying that put_devmap_managed_page() can still > > fail. > > Add Reviewed-by tags. > > > > mm/swap.c | 13 + > > 1 file changed, 9 insertions(+), 4 deletions(-) > > > > diff --git a/mm/swap.c b/mm/swap.c > > index 7ede3eddc12a..6d153ce4cb8c 100644 > > --- a/mm/swap.c > > +++ b/mm/swap.c > > @@ -740,15 +740,20 @@ void release_pages(struct page **pages, int nr) > > if (is_huge_zero_page(page)) > > continue; > > > > - /* Device public page can not be huge page */ > > - if (is_device_public_page(page)) { > > + if (is_zone_device_page(page)) { > > if (locked_pgdat) { > > > > spin_unlock_irqrestore(_pgdat->lru_lock, > > flags); > > locked_pgdat = NULL; > > } > > - put_devmap_managed_page(page); > > - continue; > > + /* > > + * Not all zone-device-pages require special > > + * processing. Those pages return 'false' from > > + * put_devmap_managed_page() expecting a call to > > + * put_page_testzero() > > + */ > > Just a documentation tweak: how about: > > /* > * ZONE_DEVICE pages that return 'false' from > * put_devmap_managed_page() do not require special > * processing, and instead, expect a call to > * put_page_testzero(). > */ Looks better to me, but maybe just go ahead and list those expectations explicitly. Something like: /* * put_devmap_managed_page() only handles * ZONE_DEVICE (struct dev_pagemap managed) * pages when the hosting dev_pagemap has the * ->free() or ->fault() callback handlers * implemented as indicated by * dev_pagemap.type. Otherwise the expectation * is to fall back to a plain decrement / * put_page_testzero(). */
Re: [PATCH v2 00/17] net: introduce Qualcomm IPA driver
On Mon, 2019-06-03 at 13:04 -0600, Subash Abhinov Kasiviswanathan wrote: > > > I can't (or won't) comment right now on whether IPA needs its own > > > netdev for rmnet to use. The IPA endpoints used for the modem > > > network interfaces are enabled when the netdev is opened and > > > disabled when closed. Outside of that, TX and RX are pretty > > > much immediately passed through to the layer below or above. > > > IPA currently has no other net device operations. > > > > I don't really have issues with the patchset underneath the netdev > > layer. I'm interested in how the various bits present themselves to > > userspace, which is why I am trying to tie this in with Johannes' > > conversation about WWAN devices, netdevs, channels, and how the > > various > > drivers present API for creating data channels that map to > > different > > modem contexts. > > > > So let me rephrase. If the control plane has set up the default > > context > > and sent a QMI Start Network message (or the network attached the > > default one) and the resulting IP details are applied to the IPA > > netdev > > can things just start sending data? Or do we need to create an > > rmnet on > > top to get that working? > > > > Dan > > Hi Dan > > All data from the hardware will have the MAP headers. > We still need to create rmnet devs over the IPA netdev and use it > for > the > data path. > The IPA netdev will pass on the packets which it receives from the > hardware > and queue it to network stack where it will be intercepted by the > rmnet rx handler. Ok, so IPA only needs a netdev so that rmnet has something to send/receive packets to/from? This gets even closer to the discussion in "cellular modem driver APIs - take 2" from last week. Dan
Re: [PATCH v2 00/17] net: introduce Qualcomm IPA driver
On Tue, 2019-06-04 at 10:13 +0200, Arnd Bergmann wrote: > On Mon, Jun 3, 2019 at 3:32 PM Alex Elder wrote: > > On 6/3/19 5:04 AM, Arnd Bergmann wrote: > > > On Sat, Jun 1, 2019 at 1:59 AM Subash Abhinov Kasiviswanathan > > > > > > - What I'm worried about most here is the flow control handling > > > on the > > > transmit side. The IPA driver now uses the modern BQL method to > > > control how much data gets submitted to the hardware at any > > > time. > > > The rmnet driver also uses flow control using the > > > rmnet_map_command() function, that blocks tx on the higher > > > level device when the remote side asks us to. > > > I fear that doing flow control for a single physical device on > > > two > > > separate netdev instances is counterproductive and confuses > > > both sides. > > > > I understand what you're saying here, and instinctively I think > > you're right. > > > > But BQL manages the *local* interface's ability to get rid of > > packets, whereas the QMAP flow control is initiated by the other > > end of the connection (the modem in this case). > > > > With multiplexing, it's possible that one of several logical > > devices on the modem side has exhausted a resource and must > > ask the source of the data on the host side to suspend the > > flow. Meanwhile the other logical devices sharing the physical > > link might be fine, and should not be delayed by the first one. > > > > It is the multiplexing itself that confuses the BQL algorithm. > > The abstraction obscures the *real* rates at which individual > > logical connections are able to transmit data. > > I would assume that the real rate constantly changes, at least > for wireless interfaces that are also shared with other users > on the same network. BQL is meant to deal with that, at least > when using a modern queuing algorithm. > > > Even if the multiple logical interfaces implemented BQL, they > > would not get the feedback they need directly from the IPA > > driver, because transmitting over the physical interface might > > succeed even if the logical interface on the modem side can't > > handle more data. So I think the flow control commands may be > > necessary, given multiplexing. > > Can you describe what kind of multiplexing is actually going on? > I'm still unclear about what we actually use multiple logical > interfaces for here, and how they relate to one another. Each logical interface represents a different "connection" (PDP/EPS context) to the provider network with a distinct IP address and QoS. VLANs may be a suitable analogy but here they are L3+QoS. In realistic example the main interface (say rmnet0) would be used for web browsing and have best-effort QoS. A second interface (say rmnet1) would be used for VOIP and have certain QoS guarantees from both the modem and the network itself. QMAP can also aggregate frames for a given channel (connection/EPS/PDP context/rmnet interface/etc) to better support LTE speeds. Dan > > The rmnet driver could use BQL, and could return NETDEV_TX_BUSY > > for a logical interface when its TX flow has been stopped by a > > QMAP command. That way the feedback for BQL on the logical > > interfaces would be provided in the right place. > > > > I have no good intuition about the interaction between > > two layered BQL managed queues though. > > Returning NETDEV_TX_BUSY is usually a bad idea as that > leads to unnecessary frame drop. > > I do think that using BQL and the QMAP flow command on > the /same/ device would be best, as that throttles the connection > when either of the two algorithms wants us to slow down. > > The question is mainly which of the two devices that should be. > Doing it in the ipa driver is probably easier to implement here, > but ideally I think we'd only have a single queue visible to the > network stack, if we can come up with a way to do that. > > > > - I was a little confused by the location of the rmnet driver in > > > drivers/net/ethernet/... More conventionally, I think as a > > > protocol > > > handler it should go into net/qmap/, with the ipa driver going > > > into drivers/net/qmap/ipa/, similar to what we have fo > > > ethernet, > > > wireless, ppp, appletalk, etc. > > > > > > - The rx_handler uses gro_cells, which as I understand is meant > > > for generic tunnelling setups and takes another loop through > > > NAPI to aggregate data from multiple queues, but in case of > > > IPA's single-queue receive calling gro directly would be > > > simpler > > > and more efficient. > > > > I have been planning to investigate some of the generic GRO > > stuff for IPA but was going to wait on that until the basic > > code was upstream. > > That's ok, that part can easily be changed after the fact, as it > does not impact the user interface or the general design. > > > > From the overall design and the rmnet Kconfig description, it > > > appears as though the intention as that rmnet could be a > > > generic
Re: [PATCH v8 09/12] mm/sparsemem: Support sub-section hotplug
On Mon, May 13, 2019 at 6:55 AM Oscar Salvador wrote: > > On Mon, May 06, 2019 at 04:40:14PM -0700, Dan Williams wrote: > > > > +void subsection_mask_set(unsigned long *map, unsigned long pfn, > > + unsigned long nr_pages) > > +{ > > + int idx = subsection_map_index(pfn); > > + int end = subsection_map_index(pfn + nr_pages - 1); > > + > > + bitmap_set(map, idx, end - idx + 1); > > +} > > + > > void subsection_map_init(unsigned long pfn, unsigned long nr_pages) > > { > > int end_sec = pfn_to_section_nr(pfn + nr_pages - 1); > > @@ -219,20 +235,17 @@ void subsection_map_init(unsigned long pfn, unsigned > > long nr_pages) > > return; > > > > for (i = start_sec; i <= end_sec; i++) { > > - int idx, end; > > - unsigned long pfns; > > struct mem_section *ms; > > + unsigned long pfns; > > > > - idx = subsection_map_index(pfn); > > pfns = min(nr_pages, PAGES_PER_SECTION > > - (pfn & ~PAGE_SECTION_MASK)); > > - end = subsection_map_index(pfn + pfns - 1); > > - > > ms = __nr_to_section(i); > > - bitmap_set(ms->usage->subsection_map, idx, end - idx + 1); > > + subsection_mask_set(ms->usage->subsection_map, pfn, pfns); > > > > pr_debug("%s: sec: %d pfns: %ld set(%d, %d)\n", __func__, i, > > - pfns, idx, end - idx + 1); > > + pfns, subsection_map_index(pfn), > > + subsection_map_index(pfn + pfns - 1)); > > I would definetely add subsection_mask_set() and above change to Patch#3. > I think it suits there better than here. Yes, done. > > > > > pfn += pfns; > > nr_pages -= pfns; > > @@ -319,6 +332,15 @@ static void __meminit sparse_init_one_section(struct > > mem_section *ms, > > unsigned long pnum, struct page *mem_map, > > struct mem_section_usage *usage) > > { > > + /* > > + * Given that SPARSEMEM_VMEMMAP=y supports sub-section hotplug, > > + * ->section_mem_map can not be guaranteed to point to a full > > + * section's worth of memory. The field is only valid / used > > + * in the SPARSEMEM_VMEMMAP=n case. > > + */ > > + if (IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)) > > + mem_map = NULL; > > + > > ms->section_mem_map &= ~SECTION_MAP_MASK; > > ms->section_mem_map |= sparse_encode_mem_map(mem_map, pnum) | > > SECTION_HAS_MEM_MAP; > > @@ -724,10 +746,142 @@ static void free_map_bootmem(struct page *memmap) > > #endif /* CONFIG_MEMORY_HOTREMOVE */ > > #endif /* CONFIG_SPARSEMEM_VMEMMAP */ > > > > +#ifndef CONFIG_MEMORY_HOTREMOVE > > +static void free_map_bootmem(struct page *memmap) > > +{ > > +} > > +#endif > > + > > +static bool is_early_section(struct mem_section *ms) > > +{ > > + struct page *usage_page; > > + > > + usage_page = virt_to_page(ms->usage); > > + if (PageSlab(usage_page) || PageCompound(usage_page)) > > + return false; > > + else > > + return true; > > +} > > + > > +static void section_deactivate(unsigned long pfn, unsigned long nr_pages, > > + int nid, struct vmem_altmap *altmap) > > +{ > > + DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 }; > > + DECLARE_BITMAP(tmp, SUBSECTIONS_PER_SECTION) = { 0 }; > > + struct mem_section *ms = __pfn_to_section(pfn); > > + bool early_section = is_early_section(ms); > > + struct page *memmap = NULL; > > + unsigned long *subsection_map = ms->usage > > + ? >usage->subsection_map[0] : NULL; > > + > > + subsection_mask_set(map, pfn, nr_pages); > > + if (subsection_map) > > + bitmap_and(tmp, map, subsection_map, SUBSECTIONS_PER_SECTION); > > + > > + if (WARN(!subsection_map || !bitmap_equal(tmp, map, > > SUBSECTIONS_PER_SECTION), > > + "section already deactivated (%#lx + %ld)\n", > > + pfn, nr_pages)) > > + return; > > + > > + if (WARN(!IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP) > > + && nr_pages < PAGES_PER_SECTI
Re: [PATCH v7 09/12] mm/sparsemem: Support sub-section hotplug
On Fri, May 3, 2019 at 5:56 AM Oscar Salvador wrote: > > On Wed, May 01, 2019 at 10:56:10PM -0700, Dan Williams wrote: > > The libnvdimm sub-system has suffered a series of hacks and broken > > workarounds for the memory-hotplug implementation's awkward > > section-aligned (128MB) granularity. For example the following backtrace > > is emitted when attempting arch_add_memory() with physical address > > ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM) > > within a given section: > > > > WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 > > devm_memremap_pages+0x3b5/0x4c0 > > devm_memremap_pages attempted on mixed region [mem 0x2-0x2fbff > > flags 0x200] > > [..] > > Call Trace: > >dump_stack+0x86/0xc3 > >__warn+0xcb/0xf0 > >warn_slowpath_fmt+0x5f/0x80 > >devm_memremap_pages+0x3b5/0x4c0 > >__wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap] > >pmem_attach_disk+0x19a/0x440 [nd_pmem] > > > > Recently it was discovered that the problem goes beyond RAM vs PMEM > > collisions as some platform produce PMEM vs PMEM collisions within a > > given section. The libnvdimm workaround for that case revealed that the > > libnvdimm section-alignment-padding implementation has been broken for a > > long while. A fix for that long-standing breakage introduces as many > > problems as it solves as it would require a backward-incompatible change > > to the namespace metadata interpretation. Instead of that dubious route > > [1], address the root problem in the memory-hotplug implementation. > > > > [1]: > > https://lore.kernel.org/r/155000671719.348031.2347363160141119237.st...@dwillia2-desk3.amr.corp.intel.com > > Cc: Michal Hocko > > Cc: Vlastimil Babka > > Cc: Logan Gunthorpe > > Signed-off-by: Dan Williams > > --- > > mm/sparse.c | 223 > > --- > > 1 file changed, 150 insertions(+), 73 deletions(-) > > > > diff --git a/mm/sparse.c b/mm/sparse.c > > index 198371e5fc87..419a3620af6e 100644 > > --- a/mm/sparse.c > > +++ b/mm/sparse.c > > @@ -83,8 +83,15 @@ static int __meminit sparse_index_init(unsigned long > > section_nr, int nid) > > unsigned long root = SECTION_NR_TO_ROOT(section_nr); > > struct mem_section *section; > > > > + /* > > + * An existing section is possible in the sub-section hotplug > > + * case. First hot-add instantiates, follow-on hot-add reuses > > + * the existing section. > > + * > > + * The mem_hotplug_lock resolves the apparent race below. > > + */ > > if (mem_section[root]) > > - return -EEXIST; > > + return 0; > > Just a sidenote: we do not bail out on -EEXIST, so it should be fine if we > stick with it. > But if not, I would then clean up sparse_add_section: > > --- a/mm/sparse.c > +++ b/mm/sparse.c > @@ -901,13 +901,12 @@ int __meminit sparse_add_section(int nid, unsigned long > start_pfn, > int ret; > > ret = sparse_index_init(section_nr, nid); > - if (ret < 0 && ret != -EEXIST) > + if (ret < 0) > return ret; > > memmap = section_activate(nid, start_pfn, nr_pages, altmap); > if (IS_ERR(memmap)) > return PTR_ERR(memmap); > - ret = 0; Good catch, folded the cleanup. > > > > + > > + if (!mask) > > + rc = -EINVAL; > > + else if (mask & ms->usage->map_active) > > else if (ms->usage->map_active) should be enough? > > > + rc = -EEXIST; > > + else > > + ms->usage->map_active |= mask; > > + > > + if (rc) { > > + if (usage) > > + ms->usage = NULL; > > + kfree(usage); > > + return ERR_PTR(rc); > > + } > > + > > + /* > > + * The early init code does not consider partially populated > > + * initial sections, it simply assumes that memory will never be > > + * referenced. If we hot-add memory into such a section then we > > + * do not need to populate the memmap and can simply reuse what > > + * is already there. > > + */ > > This puzzles me a bit. > I think we cannot have partially populated early sections, can we? Yes, at boot memory need not be section aligned it has historically been handled as a un-removable section of memory with holes. > And ho
Re: [PATCH v2 00/17] net: introduce Qualcomm IPA driver
On Mon, 2019-06-03 at 10:52 -0500, Alex Elder wrote: > On 6/3/19 9:54 AM, Dan Williams wrote: > > > To be perfectly honest, at first I thought having IPA use rmnet > > > was a cargo cult thing like Dan suggested, because I didn't see > > To be clear I only meant cargo-culting the naming, not any > > functionality. Clearly IPA/rmnet/QMAP are pretty intimately > > connected > > at this point. But this goes back to whether IPA needs a netdev > > itself > > or whether you need an rmnet device created on top. If the former > > then > > I'd say no cargo-culting, if the later then it's a moot point > > because > > the device name will be rmnet%d anyway. > > OK I thought you weren't sure why rmnet was a layer at all. As I > said, I didn't have a very good understanding of why it was even > needed when I first started working on this. No problem. > I can't (or won't) comment right now on whether IPA needs its own > netdev for rmnet to use. The IPA endpoints used for the modem > network interfaces are enabled when the netdev is opened and > disabled when closed. Outside of that, TX and RX are pretty > much immediately passed through to the layer below or above. > IPA currently has no other net device operations. I don't really have issues with the patchset underneath the netdev layer. I'm interested in how the various bits present themselves to userspace, which is why I am trying to tie this in with Johannes' conversation about WWAN devices, netdevs, channels, and how the various drivers present API for creating data channels that map to different modem contexts. So let me rephrase. If the control plane has set up the default context and sent a QMI Start Network message (or the network attached the default one) and the resulting IP details are applied to the IPA netdev can things just start sending data? Or do we need to create an rmnet on top to get that working? Dan
Re: [PATCH v2 00/17] net: introduce Qualcomm IPA driver
On Fri, 2019-05-31 at 15:47 -0500, Alex Elder wrote: > On 5/31/19 2:19 PM, Arnd Bergmann wrote: > > On Fri, May 31, 2019 at 6:36 PM Alex Elder > > wrote: > > > On 5/31/19 9:58 AM, Dan Williams wrote: > > > > On Thu, 2019-05-30 at 22:53 -0500, Alex Elder wrote: > > > > > > > > My question from the Nov 2018 IPA rmnet driver still stands; > > > > how does > > > > this relate to net/ethernet/qualcomm/rmnet/ if at all? And if > > > > this is > > > > really just a netdev talking to the IPA itself and unrelated to > > > > net/ethernet/qualcomm/rmnet, let's call it "ipa%d" and stop > > > > cargo- > > > > culting rmnet around just because it happens to be a net driver > > > > for a > > > > QC SoC. > > > > > > First, the relationship between the IPA driver and the rmnet > > > driver > > > is that the IPA driver is assumed to sit between the rmnet driver > > > and the hardware. > > > > Does this mean that IPA can only be used to back rmnet, and rmnet > > can only be used on top of IPA, or can or both of them be combined > > with another driver to talk to instead? > > No it does not mean that. > > As I understand it, one reason for the rmnet layer was to abstract > the back end, which would allow using a modem, or using something > else (a LAN?), without exposing certain details of the hardware. > (Perhaps to support multiplexing, etc. without duplicating that > logic in two "back-end" drivers?) > > To be perfectly honest, at first I thought having IPA use rmnet > was a cargo cult thing like Dan suggested, because I didn't see To be clear I only meant cargo-culting the naming, not any functionality. Clearly IPA/rmnet/QMAP are pretty intimately connected at this point. But this goes back to whether IPA needs a netdev itself or whether you need an rmnet device created on top. If the former then I'd say no cargo-culting, if the later then it's a moot point because the device name will be rmnet%d anyway. Dan > the benefit. I now see why one would use that pass-through layer > to handle the QMAP features. > > But back to your question. The other thing is that I see no > reason the IPA couldn't present a "normal" (non QMAP) interface > for a modem. It's something I'd really like to be able to do, > but I can't do it without having the modem firmware change its > configuration for these endpoints. My access to the people who > implement the modem firmware has been very limited (something > I hope to improve), and unless and until I can get corresponding > changes on the modem side to implement connections that don't > use QMAP, I can't implement such a thing. > > > > Currently the modem is assumed to use QMAP protocol. This means > > > each packet is prefixed by a (struct rmnet_map_header) structure > > > that allows the IPA connection to be multiplexed for several > > > logical > > > connections. The rmnet driver parses such messages and > > > implements > > > the multiplexed network interfaces. > > > > > > QMAP protocol can also be used for aggregating many small packets > > > into a larger message. The rmnet driver implements de- > > > aggregation > > > of such messages (and could probably aggregate them for TX as > > > well). > > > > > > Finally, the IPA can support checksum offload, and the rmnet > > > driver handles providing a prepended header (for TX) and > > > interpreting the appended trailer (for RX) if these features > > > are enabled. > > > > > > So basically, the purpose of the rmnet driver is to handle QMAP > > > protocol connections, and right now that's what the modem > > > provides. > > > > Do you have any idea why this particular design was picked? > > I don't really. I inherited it. Early on, when I asked about > the need for QMAP I was told it was important because it offered > certain features, but at that time I was somewhat new to the code > and didn't have the insight to judge the merits of the design. > Since then I've mostly just accepted it and concentrated on > improving the IPA driver. > > > My best guess is that it evolved organically with multiple > > generations of hardware and software, rather than being thought > > out as a nice abstraction layer. If the two are tightly connected, > > this might mean that what we actually want here is to reintegrate > > the two components into a single driver with a much simpler > > RX and TX path that handles the checksumming an
Re: [PATCH v2 00/17] net: introduce Qualcomm IPA driver
On Fri, 2019-05-31 at 17:59 -0600, Subash Abhinov Kasiviswanathan wrote: > On 2019-05-31 17:33, Bjorn Andersson wrote: > > On Fri 31 May 13:47 PDT 2019, Alex Elder wrote: > > > > > On 5/31/19 2:19 PM, Arnd Bergmann wrote: > > > > On Fri, May 31, 2019 at 6:36 PM Alex Elder > > > > wrote: > > > > > On 5/31/19 9:58 AM, Dan Williams wrote: > > > > > > On Thu, 2019-05-30 at 22:53 -0500, Alex Elder wrote: > > > > > > > > > > > > My question from the Nov 2018 IPA rmnet driver still > > > > > > stands; how does > > > > > > this relate to net/ethernet/qualcomm/rmnet/ if at all? And > > > > > > if this is > > > > > > really just a netdev talking to the IPA itself and > > > > > > unrelated to > > > > > > net/ethernet/qualcomm/rmnet, let's call it "ipa%d" and stop > > > > > > cargo- > > > > > > culting rmnet around just because it happens to be a net > > > > > > driver for a > > > > > > QC SoC. > > > > > > > > > > First, the relationship between the IPA driver and the rmnet > > > > > driver > > > > > is that the IPA driver is assumed to sit between the rmnet > > > > > driver > > > > > and the hardware. > > > > > > > > Does this mean that IPA can only be used to back rmnet, and > > > > rmnet > > > > can only be used on top of IPA, or can or both of them be > > > > combined > > > > with another driver to talk to instead? > > > > > > No it does not mean that. > > > > > > As I understand it, one reason for the rmnet layer was to > > > abstract > > > the back end, which would allow using a modem, or using something > > > else (a LAN?), without exposing certain details of the hardware. > > > (Perhaps to support multiplexing, etc. without duplicating that > > > logic in two "back-end" drivers?) > > > > > > To be perfectly honest, at first I thought having IPA use rmnet > > > was a cargo cult thing like Dan suggested, because I didn't see > > > the benefit. I now see why one would use that pass-through layer > > > to handle the QMAP features. > > > > > > But back to your question. The other thing is that I see no > > > reason the IPA couldn't present a "normal" (non QMAP) interface > > > for a modem. It's something I'd really like to be able to do, > > > but I can't do it without having the modem firmware change its > > > configuration for these endpoints. My access to the people who > > > implement the modem firmware has been very limited (something > > > I hope to improve), and unless and until I can get corresponding > > > changes on the modem side to implement connections that don't > > > use QMAP, I can't implement such a thing. > > > > > > > But any such changes would either be years into the future or for > > specific devices and as such not applicable to any/most of devices > > on > > the market now or in the coming years. > > > > > > But as Arnd points out, if the software split between IPA and rmnet > > is > > suboptimal your are encouraged to fix that. > > > > Regards, > > Bjorn > > The split rmnet design was chosen because we could place rmnet > over any transport - IPA, PCIe (https://lkml.org/lkml/2018/4/26/1159) > or USB. Yeah, that's what I was looking for clarification on :) Clearly since rmnet can have many transports it should be able to be used by different HW drivers, be that qmi_wwan, IPA, and maybe even rmnet_smd.c? > rmnet registers a rx handler, so the rmnet packet processing itself > happens in the same softirq when packets are queued to network stack > by IPA. This directly relates to the discussion about a WWAN subsystem that Johannes Berg started a couple weeks ago. IPA appears to create a netdev of its own. Is that netdev usable immediately, or does one need to create an rmnet device on top to access the default PDN? Thanks, Dan
Re: [PATCH v10 1/3] mm: Shuffle initial free memory to improve memory-side-cache utilization
On Fri, May 31, 2019 at 12:33 AM Vlastimil Babka wrote: > > On 2/1/19 6:15 AM, Dan Williams wrote: > > --- a/init/Kconfig > > +++ b/init/Kconfig > > @@ -1714,6 +1714,29 @@ config SLAB_FREELIST_HARDENED > > sacrifies to harden the kernel slab allocator against common > > freelist exploit methods. > > > > +config SHUFFLE_PAGE_ALLOCATOR > > + bool "Page allocator randomization" > > + default SLAB_FREELIST_RANDOM && ACPI_NUMA > > + help > > + Randomization of the page allocator improves the average > > + utilization of a direct-mapped memory-side-cache. See section > > + 5.2.27 Heterogeneous Memory Attribute Table (HMAT) in the ACPI > > + 6.2a specification for an example of how a platform advertises > > + the presence of a memory-side-cache. There are also incidental > > + security benefits as it reduces the predictability of page > > + allocations to compliment SLAB_FREELIST_RANDOM, but the > > + default granularity of shuffling on 4MB (MAX_ORDER) pages is > > + selected based on cache utilization benefits. > > + > > + While the randomization improves cache utilization it may > > + negatively impact workloads on platforms without a cache. For > > + this reason, by default, the randomization is enabled only > > + after runtime detection of a direct-mapped memory-side-cache. > > + Otherwise, the randomization may be force enabled with the > > + 'page_alloc.shuffle' kernel command line parameter. > > + > > + Say Y if unsure. > > It says "Say Y if unsure", yet if I run make oldconfig, the default is > N. Does that make sense? The default is due to the general policy of not forcing users into new kernel functionality (i.e. the common Linus objection when a new config symbol is default 'y') . However, if someone is actively considering whether to enable it I think there's no harm in recommending 'y' because the facility currently needs to be paired with the page_alloc.shuffle=1 command line option.
Re: [PATCH v2 00/17] net: introduce Qualcomm IPA driver
On Thu, 2019-05-30 at 22:53 -0500, Alex Elder wrote: > This series presents the driver for the Qualcomm IP Accelerator > (IPA). > > This is version 2 of the series. This version has addressed almost > all of the feedback received in the first version: > > https://lore.kernel.org/lkml/20190512012508.10608-1-el...@linaro.org/ > More detail is included in the individual patches, but here is a > high-level summary of what's changed since then: > - Two spinlocks have been removed. > - The code for enabling and disabling endpoint interrupts has > been simplified considerably, and the spinlock is no longer > required > - A spinlock used when updating ring buffer pointers is no > longer needed. Integers indexing the ring are used instead > (and they don't even have to be atomic). > - One spinlock remains to protect list updates, but it is always > acquired using spin_lock_bh() (no more irqsave). > - Information about the queueing and completion of messages is now > supplied to the network stack in batches rather than one at a > time. > - I/O completion handling has been simplified, with the IRQ > handler now consisting mainly of disabling the interrupt and > calling napi_schedule(). > - Some comments have been updated and improved througout. > > What follows is the introduction supplied with v1 of the series. > > - > > The IPA is a component present in some Qualcomm SoCs that allows > network functions such as aggregation, filtering, routing, and NAT > to be performed without active involvement of the main application > processor (AP). > > Initially, these advanced features are disabled; the IPA driver > simply provides a network interface that makes the modem's LTE > network available to the AP. In addition, only support for the > IPA found in the Qualcomm SDM845 SoC is provided. My question from the Nov 2018 IPA rmnet driver still stands; how does this relate to net/ethernet/qualcomm/rmnet/ if at all? And if this is really just a netdev talking to the IPA itself and unrelated to net/ethernet/qualcomm/rmnet, let's call it "ipa%d" and stop cargo- culting rmnet around just because it happens to be a net driver for a QC SoC. Is the firmware that the driver loads already in linux-firmware or going to be there soon? How does the driver support multiple PDNs (eg PDP or EPS contexts) that are enabled through the control plane via QMI messages? I couldn't quite find that out. Thanks, Dan > This code is derived from a driver developed internally by Qualcomm. > A version of the original source can be seen here: > https://source.codeaurora.org/quic/la/kernel/msm-4.9/tree > in the "drivers/platform/msm/ipa" directory. Many were involved in > developing this, but the following individuals deserve explicit > acknowledgement for their substantial contributions: > > Abhishek Choubey > Ady Abraham > Chaitanya Pratapa > David Arinzon > Ghanim Fodi > Gidon Studinski > Ravi Gummadidala > Shihuan Liu > Skylar Chang > > A version of this code was posted in November 2018 as an RFC. > > https://lore.kernel.org/lkml/20181107003250.5832-1-el...@linaro.org/ > All feedback received was addressed. The code has undergone > considerable further rework since that time, and most of the > "future work" described then has now been completed. > > This code is available in buildable form here, based on kernel > v5.2-rc1: > remote: ssh://g...@git.linaro.org/people/alex.elder/linux.git > branch: ipa-v2_kernel-v5.2-rc2 > 75adf2ac1266 arm64: defconfig: enable build of IPA code > > The branch depends on a commit now found in in net-next. It has > been cherry-picked, and (in this branch) has this commit ID: > 13c627b5a078 net: qualcomm: rmnet: Move common struct definitions > to include > by > > -Alex > > Alex Elder (17): > bitfield.h: add FIELD_MAX() and field_max() > dt-bindings: soc: qcom: add IPA bindings > soc: qcom: ipa: main code > soc: qcom: ipa: configuration data > soc: qcom: ipa: clocking, interrupts, and memory > soc: qcom: ipa: GSI headers > soc: qcom: ipa: the generic software interface > soc: qcom: ipa: GSI transactions > soc: qcom: ipa: IPA interface to GSI > soc: qcom: ipa: IPA endpoints > soc: qcom: ipa: immediate commands > soc: qcom: ipa: IPA network device and microcontroller > soc: qcom: ipa: AP/modem communications > soc: qcom: ipa: support build of IPA code > MAINTAINERS: add entry for the Qualcomm IPA driver > arm64: dts: sdm845: add IPA information > arm64: defconfig: enable build of IPA code > > .../devicetree/bindings/net/qcom,ipa.yaml | 180 ++ > MAINTAINERS |6 + > arch/arm64/boot/dts/qcom/sdm845.dtsi | 51 + > arch/arm64/configs/defconfig |1 + > drivers/net/Kconfig |2 + > drivers/net/Makefile
Re: [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race
On Mon, May 13, 2019 at 12:22 PM Logan Gunthorpe wrote: > > > > On 2019-05-08 11:05 a.m., Logan Gunthorpe wrote: > > > > > > On 2019-05-07 5:55 p.m., Dan Williams wrote: > >> Changes since v1 [1]: > >> - Fix a NULL-pointer deref crash in pci_p2pdma_release() (Logan) > >> > >> - Refresh the p2pdma patch headers to match the format of other p2pdma > >>patches (Bjorn) > >> > >> - Collect Ira's reviewed-by > >> > >> [1]: > >> https://lore.kernel.org/lkml/155387324370.2443841.574715745262628837.st...@dwillia2-desk3.amr.corp.intel.com/ > > > > This series looks good to me: > > > > Reviewed-by: Logan Gunthorpe > > > > However, I haven't tested it yet but I intend to later this week. > > I've tested libnvdimm-pending which includes this series on my setup and > everything works great. Hi Andrew, With this tested-by can we move forward on this fix set? I'm not aware of any other remaining comments. Greg had a question about "drivers/base/devres: Introduce devm_release_action()" that I answered, but otherwise the feedback has gone silent.
[GIT PULL] libnvdimm fixes for v5.2-rc2
Hi Linus, please pull from: git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/libnvdimm-fixes-5.2-rc2 ...to receive a regression fix, a small (2 line code change) performance enhancement, and some miscellaneous compilation warning fixes. These have soaked in -next the past week with no known issues. The device-mapper touches have Mike's ack, and the hardened user-copy bypass was reviewed with Kees. --- The following changes since commit a188339ca5a396acc588e5851ed7e19f66b0ebd9: Linux 5.2-rc1 (2019-05-19 15:47:09 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/libnvdimm-fixes-5.2-rc2 for you to fetch changes up to 52f476a323f9efc959be1c890d0cdcf12e1582e0: libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead (2019-05-20 20:43:32 -0700) libnvdimm fixes v5.2-rc2 - Fix a regression that disabled device-mapper dax support - Remove unnecessary hardened-user-copy overhead (>30%) for dax read(2)/write(2). - Fix some compilation warnings. ---- Dan Williams (2): dax: Arrange for dax_supported check to span multiple devices libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead Qian Cai (1): libnvdimm: Fix compilation warnings with W=1 drivers/dax/super.c | 88 drivers/md/dm-table.c| 17 ++--- drivers/md/dm.c | 20 ++ drivers/md/dm.h | 1 + drivers/nvdimm/bus.c | 4 +- drivers/nvdimm/label.c | 2 + drivers/nvdimm/label.h | 2 - drivers/nvdimm/pmem.c| 11 +- drivers/s390/block/dcssblk.c | 1 + include/linux/dax.h | 26 + 10 files changed, 129 insertions(+), 43 deletions(-)
Re: [PATCH] mm/swap: Fix release_pages() when releasing devmap pages
On Thu, May 23, 2019 at 8:58 PM Dan Williams wrote: > > On Thu, May 23, 2019 at 3:37 PM wrote: > > > > From: Ira Weiny > > > > Device pages can be more than type MEMORY_DEVICE_PUBLIC. > > > > Handle all device pages within release_pages() > > > > This was found via code inspection while determining if release_pages() > > and the new put_user_pages() could be interchangeable. > > > > Cc: Jérôme Glisse > > Cc: Dan Williams > > Cc: Michal Hocko > > Cc: John Hubbard > > Signed-off-by: Ira Weiny > > --- > > mm/swap.c | 7 +++ > > 1 file changed, 3 insertions(+), 4 deletions(-) > > > > diff --git a/mm/swap.c b/mm/swap.c > > index 3a75722e68a9..d1e8122568d0 100644 > > --- a/mm/swap.c > > +++ b/mm/swap.c > > @@ -739,15 +739,14 @@ void release_pages(struct page **pages, int nr) > > if (is_huge_zero_page(page)) > > continue; > > > > - /* Device public page can not be huge page */ > > - if (is_device_public_page(page)) { > > + if (is_zone_device_page(page)) { > > if (locked_pgdat) { > > > > spin_unlock_irqrestore(_pgdat->lru_lock, > >flags); > > locked_pgdat = NULL; > > } > > - put_devmap_managed_page(page); > > - continue; > > + if (put_devmap_managed_page(page)) > > This "shouldn't" fail, and if it does the code that follows might get > confused by a ZONE_DEVICE page. If anything I would make this a > WARN_ON_ONCE(!put_devmap_managed_page(page)), but always continue > unconditionally. As discussed offline, I'm wrong here. It needs to fall through to put_page_testzero() for the device-dax case, but perhaps a comment for the next time I forget that subtlety.
Re: [PATCH, RFC 2/2] Implement sharing/unsharing of PMDs for FS/DAX
On Fri, May 24, 2019 at 9:07 AM Larry Bassel wrote: > On 14 May 19 16:01, Kirill A. Shutemov wrote: > > On Thu, May 09, 2019 at 09:05:33AM -0700, Larry Bassel wrote: [..] > > > diff --git a/mm/memory.c b/mm/memory.c > > > index f7d962d..4c1814c 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -3845,6 +3845,109 @@ static vm_fault_t handle_pte_fault(struct > > > vm_fault *vmf) > > > return 0; > > > } > > > > > > +#ifdef CONFIG_MAY_SHARE_FSDAX_PMD > > > +static pmd_t *huge_pmd_offset(struct mm_struct *mm, > > > + unsigned long addr, unsigned long sz) > > > > Could you explain what this function suppose to do? > > > > As far as I can see vma_mmu_pagesize() is always PAGE_SIZE of DAX > > filesystem. So we have 'sz' == PAGE_SIZE here. > > I thought so too, but in my testing I found that vma_mmu_pagesize() returns > 4KiB, which differs from the DAX filesystem's 2MiB pagesize. A given filesystem-dax vma is allowed to support both 4K and 2M mappings, so the vma_mmu_pagesize() is not granular enough to describe the capabilities of a filesystem-dax vma. In the device-dax case, where there are mapping guarantees, the implementation does arrange for vma_mmu_pagesize() to reflect the right page size.
Re: [PATCH] mm/swap: Fix release_pages() when releasing devmap pages
On Fri, May 24, 2019 at 8:35 AM Ira Weiny wrote: > > On Thu, May 23, 2019 at 08:58:12PM -0700, Dan Williams wrote: > > On Thu, May 23, 2019 at 3:37 PM wrote: > > > > > > From: Ira Weiny > > > > > > Device pages can be more than type MEMORY_DEVICE_PUBLIC. > > > > > > Handle all device pages within release_pages() > > > > > > This was found via code inspection while determining if release_pages() > > > and the new put_user_pages() could be interchangeable. > > > > > > Cc: Jérôme Glisse > > > Cc: Dan Williams > > > Cc: Michal Hocko > > > Cc: John Hubbard > > > Signed-off-by: Ira Weiny > > > --- > > > mm/swap.c | 7 +++ > > > 1 file changed, 3 insertions(+), 4 deletions(-) > > > > > > diff --git a/mm/swap.c b/mm/swap.c > > > index 3a75722e68a9..d1e8122568d0 100644 > > > --- a/mm/swap.c > > > +++ b/mm/swap.c > > > @@ -739,15 +739,14 @@ void release_pages(struct page **pages, int nr) > > > if (is_huge_zero_page(page)) > > > continue; > > > > > > - /* Device public page can not be huge page */ > > > - if (is_device_public_page(page)) { > > > + if (is_zone_device_page(page)) { > > > if (locked_pgdat) { > > > > > > spin_unlock_irqrestore(_pgdat->lru_lock, > > >flags); > > > locked_pgdat = NULL; > > > } > > > - put_devmap_managed_page(page); > > > - continue; > > > + if (put_devmap_managed_page(page)) > > > > This "shouldn't" fail, and if it does the code that follows might get > > I agree it shouldn't based on the check. However... > > > confused by a ZONE_DEVICE page. If anything I would make this a > > WARN_ON_ONCE(!put_devmap_managed_page(page)), but always continue > > unconditionally. > > I was trying to follow the pattern from put_page() Where if fails it > indicated > it was not a devmap page and so "regular" processing should continue. In this case that regular continuation already happened by not taking the if (is_zone_device_page(page)) branch > > Since I'm unsure I'll just ask what does this check do? > > if (!static_branch_unlikely(_managed_key)) > return false; That attempts to skip the overhead imposed by device-pages, i.e. ->page_free() callback and other extras, if there are no device-page producers in the system. I.e. use the old simple put_page() path when there is no hmm or pmem.
Re: [PATCH] mm/swap: Fix release_pages() when releasing devmap pages
On Thu, May 23, 2019 at 3:37 PM wrote: > > From: Ira Weiny > > Device pages can be more than type MEMORY_DEVICE_PUBLIC. > > Handle all device pages within release_pages() > > This was found via code inspection while determining if release_pages() > and the new put_user_pages() could be interchangeable. > > Cc: Jérôme Glisse > Cc: Dan Williams > Cc: Michal Hocko > Cc: John Hubbard > Signed-off-by: Ira Weiny > --- > mm/swap.c | 7 +++ > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/mm/swap.c b/mm/swap.c > index 3a75722e68a9..d1e8122568d0 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -739,15 +739,14 @@ void release_pages(struct page **pages, int nr) > if (is_huge_zero_page(page)) > continue; > > - /* Device public page can not be huge page */ > - if (is_device_public_page(page)) { > + if (is_zone_device_page(page)) { > if (locked_pgdat) { > > spin_unlock_irqrestore(_pgdat->lru_lock, >flags); > locked_pgdat = NULL; > } > - put_devmap_managed_page(page); > - continue; > + if (put_devmap_managed_page(page)) This "shouldn't" fail, and if it does the code that follows might get confused by a ZONE_DEVICE page. If anything I would make this a WARN_ON_ONCE(!put_devmap_managed_page(page)), but always continue unconditionally. Other than that you can add: Reviewed-by: Dan Williams
Re: [v6 3/3] device-dax: "Hotremove" persistent memory that is used like normal RAM
On Fri, May 17, 2019 at 2:54 PM Pavel Tatashin wrote: > > It is now allowed to use persistent memory like a regular RAM, but > currently there is no way to remove this memory until machine is > rebooted. > > This work expands the functionality to also allows hotremoving > previously hotplugged persistent memory, and recover the device for use > for other purposes. > > To hotremove persistent memory, the management software must first > offline all memory blocks of dax region, and than unbind it from > device-dax/kmem driver. So, operations should look like this: > > echo offline > /sys/devices/system/memory/memoryN/state > ... > echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind > > Note: if unbind is done without offlining memory beforehand, it won't be > possible to do dax0.0 hotremove, and dax's memory is going to be part of > System RAM until reboot. > > Signed-off-by: Pavel Tatashin > Reviewed-by: David Hildenbrand > --- > drivers/dax/dax-private.h | 2 ++ > drivers/dax/kmem.c| 41 +++ > 2 files changed, 39 insertions(+), 4 deletions(-) > > diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h > index a45612148ca0..999aaf3a29b3 100644 > --- a/drivers/dax/dax-private.h > +++ b/drivers/dax/dax-private.h > @@ -53,6 +53,7 @@ struct dax_region { > * @pgmap - pgmap for memmap setup / lifetime (driver owned) > * @ref: pgmap reference count (driver owned) > * @cmp: @ref final put completion (driver owned) > + * @dax_mem_res: physical address range of hotadded DAX memory > */ > struct dev_dax { > struct dax_region *region; > @@ -62,6 +63,7 @@ struct dev_dax { > struct dev_pagemap pgmap; > struct percpu_ref ref; > struct completion cmp; > + struct resource *dax_kmem_res; > }; > > static inline struct dev_dax *to_dev_dax(struct device *dev) > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c > index 4c0131857133..3d0a7e702c94 100644 > --- a/drivers/dax/kmem.c > +++ b/drivers/dax/kmem.c > @@ -71,21 +71,54 @@ int dev_dax_kmem_probe(struct device *dev) > kfree(new_res); > return rc; > } > + dev_dax->dax_kmem_res = new_res; > > return 0; > } > > +#ifdef CONFIG_MEMORY_HOTREMOVE > +static int dev_dax_kmem_remove(struct device *dev) > +{ > + struct dev_dax *dev_dax = to_dev_dax(dev); > + struct resource *res = dev_dax->dax_kmem_res; > + resource_size_t kmem_start = res->start; > + resource_size_t kmem_size = resource_size(res); > + int rc; > + > + /* > +* We have one shot for removing memory, if some memory blocks were > not > +* offline prior to calling this function remove_memory() will fail, > and > +* there is no way to hotremove this memory until reboot because > device > +* unbind will succeed even if we return failure. > +*/ > + rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size); > + if (rc) { > + dev_err(dev, > + "DAX region %pR cannot be hotremoved until the next > reboot\n", > + res); Small quibbles with this error message... "DAX" is redundant since the device name is printed by dev_err(). I'd suggest dropping "until the next reboot" since there is no guarantee it will work then either and the surefire mechanism to recover the memory from the kmem driver is to not add it in the first place. Perhaps also print out the error code in case it might specify a finer grained reason the memory is pinned. Other than that you can add Reviewed-by: Dan Williams ...as it looks like Andrew will take this through -mm.
Re: [PATCH] mm/dev_pfn: Exclude MEMORY_DEVICE_PRIVATE while computing virtual address
On Mon, May 20, 2019 at 12:27 PM Jerome Glisse wrote: > > On Mon, May 20, 2019 at 11:07:38AM +0530, Anshuman Khandual wrote: > > On 05/18/2019 03:20 AM, Andrew Morton wrote: > > > On Fri, 17 May 2019 16:08:34 +0530 Anshuman Khandual > > > wrote: > > > > > >> The presence of struct page does not guarantee linear mapping for the pfn > > >> physical range. Device private memory which is non-coherent is excluded > > >> from linear mapping during devm_memremap_pages() though they will still > > >> have struct page coverage. Just check for device private memory before > > >> giving out virtual address for a given pfn. > > > > > > I was going to give my standard "what are the user-visible runtime > > > effects of this change?", but... > > > > > >> All these helper functions are all pfn_t related but could not figure out > > >> another way of determining a private pfn without looking into it's struct > > >> page. pfn_t_to_virt() is not getting used any where in mainline kernel.Is > > >> it used by out of tree drivers ? Should we then drop it completely ? > > > > > > Yeah, let's kill it. > > > > > > But first, let's fix it so that if someone brings it back, they bring > > > back a non-buggy version. > > > > Makes sense. > > > > > > > > So... what (would be) the user-visible runtime effects of this change? > > > > I am not very well aware about the user interaction with the drivers which > > hotplug and manage ZONE_DEVICE memory in general. Hence will not be able to > > comment on it's user visible runtime impact. I just figured this out from > > code audit while testing ZONE_DEVICE on arm64 platform. But the fix makes > > the function bit more expensive as it now involve some additional memory > > references. > > A device private pfn can never leak outside code that does not understand it > So this change is useless for any existing users and i would like to keep the > existing behavior ie never leak device private pfn. The issue is that only an HMM expert might know that such a pfn can never leak, in other words the pfn concept from a code perspective is already leaked / widespread. Ideally any developer familiar with a pfn and the core-mm pfn helpers need only worry about pfn semantics without being required to go audit HMM users.
[PATCH v2] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead
Jeff discovered that performance improves from ~375K iops to ~519K iops on a simple psync-write fio workload when moving the location of 'struct page' from the default PMEM location to DRAM. This result is surprising because the expectation is that 'struct page' for dax is only needed for third party references to dax mappings. For example, a dax-mapped buffer passed to another system call for direct-I/O requires 'struct page' for sending the request down the driver stack and pinning the page. There is no usage of 'struct page' for first party access to a file via read(2)/write(2) and friends. However, this "no page needed" expectation is violated by CONFIG_HARDENED_USERCOPY and the check_copy_size() performed in copy_from_iter_full_nocache() and copy_to_iter_mcsafe(). The check_heap_object() helper routine assumes the buffer is backed by a slab allocator (DRAM) page and applies some checks. Those checks are invalid, dax pages do not originate from the slab, and redundant, dax_iomap_actor() has already validated that the I/O is within bounds. Specifically that routine validates that the logical file offset is within bounds of the file, then it does a sector-to-pfn translation which validates that the physical mapping is within bounds of the block device. Bypass additional hardened usercopy overhead and call the 'no check' versions of the copy_{to,from}_iter operations directly. Fixes: 0aed55af8834 ("x86, uaccess: introduce copy_from_iter_flushcache...") Cc: Cc: Jeff Moyer Cc: Ingo Molnar Cc: Christoph Hellwig Cc: Al Viro Cc: Thomas Gleixner Cc: Matthew Wilcox Reported-and-tested-by: Jeff Smits Acked-by: Kees Cook Acked-by: Jan Kara Signed-off-by: Dan Williams --- Changes since v1 [1]: * Update the changelog to clarify which checks in dax_iomap_actor() obviate the need for "hardened" checks. (Jan) * Update the code comment in drivers/nvdimm/pmem.c to reflect the same. * Collect some Acks from Kees and Jan. [1]: https://lore.kernel.org/lkml/155805321833.867447.3864104616303535270.st...@dwillia2-desk3.amr.corp.intel.com/ drivers/nvdimm/pmem.c |9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 845c5b430cdd..c894f45e5077 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -281,16 +281,21 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev, return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn); } +/* + * Use the 'no check' versions of copy_from_iter_flushcache() and + * copy_to_iter_mcsafe() to bypass HARDENED_USERCOPY overhead. Bounds + * checking is handled by dax_iomap_actor() + */ static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i) { - return copy_from_iter_flushcache(addr, bytes, i); + return _copy_from_iter_flushcache(addr, bytes, i); } static size_t pmem_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i) { - return copy_to_iter_mcsafe(addr, bytes, i); + return _copy_to_iter_mcsafe(addr, bytes, i); } static const struct dax_operations pmem_dax_ops = {
Re: [PATCH] mm/dev_pfn: Exclude MEMORY_DEVICE_PRIVATE while computing virtual address
On Sun, May 19, 2019 at 10:37 PM Anshuman Khandual wrote: > > On 05/18/2019 03:20 AM, Andrew Morton wrote: > > On Fri, 17 May 2019 16:08:34 +0530 Anshuman Khandual > > wrote: > > > >> The presence of struct page does not guarantee linear mapping for the pfn > >> physical range. Device private memory which is non-coherent is excluded > >> from linear mapping during devm_memremap_pages() though they will still > >> have struct page coverage. Just check for device private memory before > >> giving out virtual address for a given pfn. > > > > I was going to give my standard "what are the user-visible runtime > > effects of this change?", but... > > > >> All these helper functions are all pfn_t related but could not figure out > >> another way of determining a private pfn without looking into it's struct > >> page. pfn_t_to_virt() is not getting used any where in mainline kernel.Is > >> it used by out of tree drivers ? Should we then drop it completely ? > > > > Yeah, let's kill it. +1 to killing it, since there has been a paucity of 'unsigned long pfn' code path conversions to 'pfn_t', and it continues to go unused. > > But first, let's fix it so that if someone brings it back, they bring > > back a non-buggy version. Not sure this can be solved without a rethink of who owns the virtual address space corresponding to MEMORY_DEVICE_PRIVATE, and clawing back some of the special-ness of HMM. > > Makes sense. > > > > > So... what (would be) the user-visible runtime effects of this change? > > I am not very well aware about the user interaction with the drivers which > hotplug and manage ZONE_DEVICE memory in general. Hence will not be able to > comment on it's user visible runtime impact. I just figured this out from > code audit while testing ZONE_DEVICE on arm64 platform. But the fix makes > the function bit more expensive as it now involve some additional memory > references. MEMORY_DEVICE_PRIVATE semantics were part of the package of the initial HMM submission that landed in the kernel without an upstream user. While pfn_t_to_virt() also does not have an upstream user it was at least modeled after the existing pfn_to_virt() api to allow for future 'unsigned long pfn' to 'pfn_t' conversions. As for what a fix might look like, it seems to me that we should try to unify 'pfn_t' and 'hmm_pfn's. I don't see why 'hmm_pfn's need to exist as their own concept vs trying consume more flag space out of pfn_t. That would at least allow the pfn_t_has_page() helper to detect the HMM case.
Re: [PATCH] dax: Fix last_page check in __bdev_dax_supported()
On Thu, May 16, 2019 at 10:37 PM Vaibhav Jain wrote: > > Dan Williams writes: > > > On Wed, May 15, 2019 at 10:55 PM Vaibhav Jain wrote: > >> > >> Presently __bdev_dax_supported() checks if first sector of last > >> page ( last_page ) on the block device is aligned to page > >> boundary. However the code to compute 'last_page' assumes that there > >> are 8 sectors/page assuming a 4K page-size. > >> > >> This assumption breaks on architectures which use a different page > >> size specifically PPC64 where page-size == 64K. Hence a warning is > >> seen while trying to mount a xfs/ext4 file-system with dax enabled: > >> > >> $ sudo mount -o dax /dev/pmem0 /mnt/pmem > >> XFS (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk > >> XFS (pmem0): DAX unsupported by block device. Turning off DAX. > >> > >> The patch fixes this issue by updating calculation of 'last_var' to > >> take into account number-of-sectors/page instead of assuming it to be > >> '8'. > > > > Yes, I noticed this too and fixed it up in a wider change that also > > allows device-mapper to validate each component device. Does this > > patch work for you? > > > > https://lore.kernel.org/lkml/155789172402.748145.11853718580748830476.st...@dwillia2-desk3.amr.corp.intel.com/ > > Thanks Dan, I tested your patch and not seeing the issue anymore. Thanks, I recorded a "Tested-by" for you.
Re: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead
On Mon, May 20, 2019 at 12:52 AM Jan Kara wrote: > > On Sat 18-05-19 21:46:03, Dan Williams wrote: > > On Fri, May 17, 2019 at 12:25 PM Kees Cook wrote: > > > On Fri, May 17, 2019 at 10:28:48AM -0700, Dan Williams wrote: > > > > It seems dax_iomap_actor() is not a path where we'd be worried about > > > > needing hardened user copy checks. > > > > > > I would agree: I think the proposed patch makes sense. :) > > > > Sounds like an acked-by to me. > > Yeah, if Kees agrees, I'm fine with skipping the checks as well. I just > wanted that to be clarified. Also it helped me that you wrote: > > That routine (dax_iomap_actor()) validates that the logical file offset is > within bounds of the file, then it does a sector-to-pfn translation which > validates that the physical mapping is within bounds of the block device. > > That is more specific than "dax_iomap_actor() takes care of necessary > checks" which was in the changelog. And the above paragraph helped me > clarify which checks in dax_iomap_actor() you think replace those usercopy > checks. So I think it would be good to add that paragraph to those > copy_from_pmem() functions as a comment just in case we are wondering in > the future why we are skipping the checks... Also feel free to add: > > Acked-by: Jan Kara Will do, thanks Jan.
Re: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead
On Fri, May 17, 2019 at 12:25 PM Kees Cook wrote: > On Fri, May 17, 2019 at 10:28:48AM -0700, Dan Williams wrote: > > It seems dax_iomap_actor() is not a path where we'd be worried about > > needing hardened user copy checks. > > I would agree: I think the proposed patch makes sense. :) Sounds like an acked-by to me.
Re: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead
On Fri, May 17, 2019 at 8:57 AM Kees Cook wrote: > > On Fri, May 17, 2019 at 08:08:27AM -0700, Dan Williams wrote: > > As far as I can see it's mostly check_heap_object() that is the > > problem, so I'm open to finding a way to just bypass that sub-routine. > > However, as far as I can see none of the other block / filesystem user > > copy implementations submit to the hardened checks, like > > bio_copy_from_iter(), and iov_iter_copy_from_user_atomic() . So, > > either those need to grow additional checks, or the hardened copy > > implementation is targeting single object copy use cases, not > > necessarily block-I/O. Yes, Kees, please advise. > > The intention is mainly for copies that haven't had explicit bounds > checking already performed on them, yes. Is there something getting > checked out of the slab, or is it literally just the overhead of doing > the "is this slab?" check that you're seeing? It's literally the overhead of "is this slab?" since it needs to go retrieve the struct page and read that potentially cold cacheline. In the case where that page is on memory media that is higher latency than DRAM we get the ~37% performance loss that Jeff measured. The path is via the filesystem ->write_iter() file operation. In the DAX case the filesystem traps that path early, before submitting block I/O, and routes it to the dax_iomap_actor() routine. That routine validates that the logical file offset is within bounds of the file, then it does a sector-to-pfn translation which validates that the physical mapping is within bounds of the block device. It seems dax_iomap_actor() is not a path where we'd be worried about needing hardened user copy checks.
Re: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead
On Fri, May 17, 2019 at 1:47 AM Jan Kara wrote: > > Let's add Kees to CC for usercopy expertise... > > On Thu 16-05-19 17:33:38, Dan Williams wrote: > > Jeff discovered that performance improves from ~375K iops to ~519K iops > > on a simple psync-write fio workload when moving the location of 'struct > > page' from the default PMEM location to DRAM. This result is surprising > > because the expectation is that 'struct page' for dax is only needed for > > third party references to dax mappings. For example, a dax-mapped buffer > > passed to another system call for direct-I/O requires 'struct page' for > > sending the request down the driver stack and pinning the page. There is > > no usage of 'struct page' for first party access to a file via > > read(2)/write(2) and friends. > > > > However, this "no page needed" expectation is violated by > > CONFIG_HARDENED_USERCOPY and the check_copy_size() performed in > > copy_from_iter_full_nocache() and copy_to_iter_mcsafe(). The > > check_heap_object() helper routine assumes the buffer is backed by a > > page-allocator DRAM page and applies some checks. Those checks are > > invalid, dax pages are not from the heap, and redundant, > > dax_iomap_actor() has already validated that the I/O is within bounds. > > So this last paragraph is not obvious to me as check_copy_size() does a lot > of various checks in CONFIG_HARDENED_USERCOPY case. I agree that some of > those checks don't make sense for PMEM pages but I'd rather handle that by > refining check_copy_size() and check_object_size() functions to detect and > appropriately handle pmem pages rather that generally skip all the checks > in pmem_copy_from/to_iter(). And yes, every check in such hot path is going > to cost performance but that's what user asked for with > CONFIG_HARDENED_USERCOPY... Kees? As far as I can see it's mostly check_heap_object() that is the problem, so I'm open to finding a way to just bypass that sub-routine. However, as far as I can see none of the other block / filesystem user copy implementations submit to the hardened checks, like bio_copy_from_iter(), and iov_iter_copy_from_user_atomic() . So, either those need to grow additional checks, or the hardened copy implementation is targeting single object copy use cases, not necessarily block-I/O. Yes, Kees, please advise.
[PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead
Jeff discovered that performance improves from ~375K iops to ~519K iops on a simple psync-write fio workload when moving the location of 'struct page' from the default PMEM location to DRAM. This result is surprising because the expectation is that 'struct page' for dax is only needed for third party references to dax mappings. For example, a dax-mapped buffer passed to another system call for direct-I/O requires 'struct page' for sending the request down the driver stack and pinning the page. There is no usage of 'struct page' for first party access to a file via read(2)/write(2) and friends. However, this "no page needed" expectation is violated by CONFIG_HARDENED_USERCOPY and the check_copy_size() performed in copy_from_iter_full_nocache() and copy_to_iter_mcsafe(). The check_heap_object() helper routine assumes the buffer is backed by a page-allocator DRAM page and applies some checks. Those checks are invalid, dax pages are not from the heap, and redundant, dax_iomap_actor() has already validated that the I/O is within bounds. Bypass this overhead and call the 'no check' versions of the copy_{to,from}_iter operations directly. Fixes: 0aed55af8834 ("x86, uaccess: introduce copy_from_iter_flushcache...") Cc: Jan Kara Cc: Cc: Jeff Moyer Cc: Ingo Molnar Cc: Christoph Hellwig Cc: Al Viro Cc: Thomas Gleixner Cc: Matthew Wilcox Reported-and-tested-by: Jeff Smits Signed-off-by: Dan Williams --- drivers/nvdimm/pmem.c |9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 845c5b430cdd..c894f45e5077 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -281,16 +281,21 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev, return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn); } +/* + * Use the 'no check' versions of copy_from_iter_flushcache() and + * copy_to_iter_mcsafe() to bypass HARDENED_USERCOPY overhead. Bounds + * checking is handled by dax_iomap_actor() + */ static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i) { - return copy_from_iter_flushcache(addr, bytes, i); + return _copy_from_iter_flushcache(addr, bytes, i); } static size_t pmem_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i) { - return copy_to_iter_mcsafe(addr, bytes, i); + return _copy_to_iter_mcsafe(addr, bytes, i); } static const struct dax_operations pmem_dax_ops = {
Re: [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race
On Thu, May 16, 2019 at 9:45 AM Jane Chu wrote: > > Hi, > > I'm able to reproduce the panic below by running two sets of ndctl > commands that actually serve legitimate purpose in parallel (unlike > the brute force experiment earlier), each set in a indefinite loop. > This time it takes about an hour to panic. But I gather the cause > is probably the same: I've overlapped ndctl commands on the same > region. > > Could we add a check in nd_ioctl(), such that if there is > an ongoing ndctl command on a region, subsequent ndctl request > will fail immediately with something to the effect of EAGAIN? > The rationale being that kernel should protect itself against > user mistakes. We do already have locking in the driver to prevent configuration collisions. The problem looks to be broken assumptions about running the device unregistration path in a separate thread outside the lock. I suspect it may be incorrect assumptions about the userspace visibility of the device relative to teardown actions. To be clear this isn't the nd_ioctl() path this is the sysfs path. > Also, sensing the subject fix is for a different problem, and has been > verified, I'm happy to see it in upstream, so we have a better > code base to digger deeper in terms of how the destructive ndctl > commands interacts to typical mission critical applications, include > but not limited to rdma. Right, the crash signature you are seeing looks unrelated to the issue being address in these patches which is device-teardown racing active page pins. I'll start the investigation on the crash signature, but again I don't think it reads on this fix series.
Re: [PATCH] nvdimm: fix compilation warnings with W=1
On Thu, May 16, 2019 at 9:05 AM Qian Cai wrote: > > Several places (dimm_devs.c, core.c etc) include label.h but only > label.c uses NSINDEX_SIGNATURE, so move its definition to label.c > instead. > > In file included from drivers/nvdimm/dimm_devs.c:23: > drivers/nvdimm/label.h:41:19: warning: 'NSINDEX_SIGNATURE' defined but > not used [-Wunused-const-variable=] > > Also, some places abuse "/**" which is only reserved for the kernel-doc. > > drivers/nvdimm/bus.c:648: warning: cannot understand function prototype: > 'struct attribute_group nd_device_attribute_group = ' > drivers/nvdimm/bus.c:677: warning: cannot understand function prototype: > 'struct attribute_group nd_numa_attribute_group = ' > > Those are just some member assignments for the "struct attribute_group" > instances and it can't be expressed in the kernel-doc. Ah, good point, I missed that. > Reviewed-by: Vishal Verma > Signed-off-by: Qian Cai Looks good, I'll pull this in for a post -rc1 update.
Re: [PATCH] dax: Fix last_page check in __bdev_dax_supported()
On Wed, May 15, 2019 at 10:55 PM Vaibhav Jain wrote: > > Presently __bdev_dax_supported() checks if first sector of last > page ( last_page ) on the block device is aligned to page > boundary. However the code to compute 'last_page' assumes that there > are 8 sectors/page assuming a 4K page-size. > > This assumption breaks on architectures which use a different page > size specifically PPC64 where page-size == 64K. Hence a warning is > seen while trying to mount a xfs/ext4 file-system with dax enabled: > > $ sudo mount -o dax /dev/pmem0 /mnt/pmem > XFS (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk > XFS (pmem0): DAX unsupported by block device. Turning off DAX. > > The patch fixes this issue by updating calculation of 'last_var' to > take into account number-of-sectors/page instead of assuming it to be > '8'. Yes, I noticed this too and fixed it up in a wider change that also allows device-mapper to validate each component device. Does this patch work for you? https://lore.kernel.org/lkml/155789172402.748145.11853718580748830476.st...@dwillia2-desk3.amr.corp.intel.com/
[GIT PULL] libnvdimm fixes for v5.2-rc1
Hi Linus, please pull from: git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/libnvdimm-fixes-5.2-rc1 ...to receive just a small collection of fixes this time around. The new virtio-pmem driver is nearly ready, but some last minute device-mapper acks and virtio questions made it prudent to await v5.3. Other major topics that were brewing on the linux-nvdimm mailing list like sub-section hotplug, and other devm_memremap_pages() reworks will go upstream through Andrew's tree. These have seen multiple -next releases with no reported issues. --- The following changes since commit 085b7755808aa11f78ab9377257e1dad2e6fa4bb: Linux 5.1-rc6 (2019-04-21 10:45:57 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/libnvdimm-fixes-5.2-rc1 for you to fetch changes up to 67476656febd7ec5f1fe1aeec3c441fcf53b1e45: drivers/dax: Allow to include DEV_DAX_PMEM as builtin (2019-05-07 07:48:06 -0700) libnvdimm fixes 5.2-rc1 * Fix a long standing namespace label corruption scenario when re-provisioning capacity for a namespace. * Restore the ability of the dax_pmem module to be built-in. * Harden the build for the 'nfit_test' unit test modules so that the userspace test harness can ensure all required test modules are available. Aneesh Kumar K.V (1): drivers/dax: Allow to include DEV_DAX_PMEM as builtin Dan Williams (1): libnvdimm/namespace: Fix label tracking error Vishal Verma (2): dax/pmem: Fix whitespace in dax_pmem tools/testing/nvdimm: add watermarks for dax_pmem* modules drivers/dax/Kconfig | 3 +-- drivers/dax/pmem/core.c | 6 +++--- drivers/nvdimm/label.c | 29 - drivers/nvdimm/namespace_devs.c | 15 +++ drivers/nvdimm/nd.h | 4 tools/testing/nvdimm/Kbuild | 3 +++ tools/testing/nvdimm/dax_pmem_compat_test.c | 8 tools/testing/nvdimm/dax_pmem_core_test.c | 8 tools/testing/nvdimm/dax_pmem_test.c| 8 tools/testing/nvdimm/test/nfit.c| 3 +++ tools/testing/nvdimm/watermark.h| 3 +++ 11 files changed, 72 insertions(+), 18 deletions(-) create mode 100644 tools/testing/nvdimm/dax_pmem_compat_test.c create mode 100644 tools/testing/nvdimm/dax_pmem_core_test.c create mode 100644 tools/testing/nvdimm/dax_pmem_test.c
Re: [RESEND PATCH] nvdimm: fix some compilation warnings
On Wed, May 15, 2019 at 5:25 PM Verma, Vishal L wrote: > > On Wed, 2019-05-15 at 16:25 -0700, Dan Williams wrote: > > > > > diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c > > > index 4671776f5623..9f02a99cfac0 100644 > > > --- a/drivers/nvdimm/btt.c > > > +++ b/drivers/nvdimm/btt.c > > > @@ -1269,11 +1269,9 @@ static int btt_read_pg(struct btt *btt, struct > > > bio_integrity_payload *bip, > > > > > > ret = btt_data_read(arena, page, off, postmap, cur_len); > > > if (ret) { > > > - int rc; > > > - > > > /* Media error - set the e_flag */ > > > - rc = btt_map_write(arena, premap, postmap, 0, 1, > > > - NVDIMM_IO_ATOMIC); > > > + btt_map_write(arena, premap, postmap, 0, 1, > > > + NVDIMM_IO_ATOMIC); > > > goto out_rtt; > > > > This doesn't look correct to me, shouldn't we at least be logging that > > the bad-block failed to be persistently tracked? > > Yes logging it sounds good to me. Qian, can you include this in your > respin or shall I send a fix for it separately (since we were always > ignoring the failure here regardless of this patch)? I think a separate fix for this makes more sense. Likely also needs to be a ratelimited message in case a storm of errors is encountered.
Re: [v5 0/3] "Hotremove" persistent memory
On Wed, May 15, 2019 at 11:12 AM Pavel Tatashin wrote: > > > Hi Pavel, > > > > I am working on adding this sort of a workflow into a new daxctl command > > (daxctl-reconfigure-device)- this will allow changing the 'mode' of a > > dax device to kmem, online the resulting memory, and with your patches, > > also attempt to offline the memory, and change back to device-dax. > > > > In running with these patches, and testing the offlining part, I ran > > into the following lockdep below. > > > > This is with just these three patches on top of -rc7. > > > > > > [ +0.004886] == > > [ +0.001576] WARNING: possible circular locking dependency detected > > [ +0.001506] 5.1.0-rc7+ #13 Tainted: G O > > [ +0.000929] -- > > [ +0.000708] daxctl/22950 is trying to acquire lock: > > [ +0.000548] f4d397f7 (kn->count#424){}, at: > > kernfs_remove_by_name_ns+0x40/0x80 > > [ +0.000922] > > but task is already holding lock: > > [ +0.000657] 2aa52a9f (mem_sysfs_mutex){+.+.}, at: > > unregister_memory_section+0x22/0xa0 > > I have studied this issue, and now have a clear understanding why it > happens, I am not yet sure how to fix it, so suggestions are welcomed > :) I would think that ACPI hotplug would have a similar problem, but it does this: acpi_unbind_memory_blocks(info); __remove_memory(nid, info->start_addr, info->length); I wonder if that ordering prevents going too deep into the device_unregister() call stack that you highlighted below. > > Here is the problem: > > When we offline pages we have the following call stack: > > # echo offline > /sys/devices/system/memory/memory8/state > ksys_write > vfs_write > __vfs_write >kernfs_fop_write > kernfs_get_active > lock_acquire kn->count#122 (lock for > "memory8/state" kn) > sysfs_kf_write > dev_attr_store > state_store >device_offline > memory_subsys_offline > memory_block_action > offline_pages >__offline_pages > percpu_down_write > down_write > lock_acquire mem_hotplug_lock.rw_sem > > When we unbind dax0.0 we have the following stack: > # echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind > drv_attr_store > unbind_store > device_driver_detach >device_release_driver_internal > dev_dax_kmem_remove > remove_memory device_hotplug_lock > try_remove_memory mem_hotplug_lock.rw_sem >arch_remove_memory > __remove_pages > __remove_section > unregister_memory_section >remove_memory_sectionmem_sysfs_mutex > unregister_memory > device_unregister > device_del >device_remove_attrs > sysfs_remove_groups > sysfs_remove_group > remove_files >kernfs_remove_by_name > kernfs_remove_by_name_ns > __kernfs_removekn->count#122 > > So, lockdep found the ordering issue with the above two stacks: > > 1. kn->count#122 -> mem_hotplug_lock.rw_sem > 2. mem_hotplug_lock.rw_sem -> kn->count#122
Re: [PATCH v2 12/30] dax: remove block device dependencies
On Wed, May 15, 2019 at 12:28 PM Vivek Goyal wrote: > > From: Stefan Hajnoczi > > Although struct dax_device itself is not tied to a block device, some > DAX code assumes there is a block device. Make block devices optional > by allowing bdev to be NULL in commonly used DAX APIs. > > When there is no block device: > * Skip the partition offset calculation in bdev_dax_pgoff() > * Skip the blkdev_issue_zeroout() optimization > > Note that more block device assumptions remain but I haven't reach those > code paths yet. > Is there a generic object that non-block-based filesystems reference for physical storage as a bdev stand-in? I assume "sector_t" is still the common type for addressing filesystem capacity? It just seems to me that we should stop pretending that the filesystem-dax facility requires block devices and try to move this functionality to generically use a dax device across all interfaces.
Re: [RESEND PATCH] nvdimm: fix some compilation warnings
On Tue, May 14, 2019 at 8:08 AM Qian Cai wrote: > > Several places (dimm_devs.c, core.c etc) include label.h but only > label.c uses NSINDEX_SIGNATURE, so move its definition to label.c > instead. > In file included from drivers/nvdimm/dimm_devs.c:23: > drivers/nvdimm/label.h:41:19: warning: 'NSINDEX_SIGNATURE' defined but > not used [-Wunused-const-variable=] > > The commit d9b83c756953 ("libnvdimm, btt: rework error clearing") left > an unused variable. > drivers/nvdimm/btt.c: In function 'btt_read_pg': > drivers/nvdimm/btt.c:1272:8: warning: variable 'rc' set but not used > [-Wunused-but-set-variable] > > Last, some places abuse "/**" which is only reserved for the kernel-doc. > drivers/nvdimm/bus.c:648: warning: cannot understand function prototype: > 'struct attribute_group nd_device_attribute_group = ' > drivers/nvdimm/bus.c:677: warning: cannot understand function prototype: > 'struct attribute_group nd_numa_attribute_group = ' Can you include the compiler where these errors start appearing, since I don't see these warnings with gcc-8.3.1 > > Reviewed-by: Vishal Verma > Signed-off-by: Qian Cai > --- > drivers/nvdimm/btt.c | 6 ++ > drivers/nvdimm/bus.c | 4 ++-- > drivers/nvdimm/label.c | 2 ++ > drivers/nvdimm/label.h | 2 -- > 4 files changed, 6 insertions(+), 8 deletions(-) > > diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c > index 4671776f5623..9f02a99cfac0 100644 > --- a/drivers/nvdimm/btt.c > +++ b/drivers/nvdimm/btt.c > @@ -1269,11 +1269,9 @@ static int btt_read_pg(struct btt *btt, struct > bio_integrity_payload *bip, > > ret = btt_data_read(arena, page, off, postmap, cur_len); > if (ret) { > - int rc; > - > /* Media error - set the e_flag */ > - rc = btt_map_write(arena, premap, postmap, 0, 1, > - NVDIMM_IO_ATOMIC); > + btt_map_write(arena, premap, postmap, 0, 1, > + NVDIMM_IO_ATOMIC); > goto out_rtt; This doesn't look correct to me, shouldn't we at least be logging that the bad-block failed to be persistently tracked? > } > > diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c > index 7ff684159f29..2eb6a6cfe9e4 100644 > --- a/drivers/nvdimm/bus.c > +++ b/drivers/nvdimm/bus.c > @@ -642,7 +642,7 @@ static struct attribute *nd_device_attributes[] = { > NULL, > }; > > -/** > +/* > * nd_device_attribute_group - generic attributes for all devices on an nd > bus > */ > struct attribute_group nd_device_attribute_group = { > @@ -671,7 +671,7 @@ static umode_t nd_numa_attr_visible(struct kobject *kobj, > struct attribute *a, > return a->mode; > } > > -/** > +/* > * nd_numa_attribute_group - NUMA attributes for all devices on an nd bus > */ Lets just fix this to be a valid kernel-doc format for a struct. @@ -672,7 +672,7 @@ static umode_t nd_numa_attr_visible(struct kobject *kobj, struct attribute *a, } /** - * nd_numa_attribute_group - NUMA attributes for all devices on an nd bus + * struct nd_numa_attribute_group - NUMA attributes for all devices on an nd bus */ struct attribute_group nd_numa_attribute_group = { .attrs = nd_numa_attributes, > struct attribute_group nd_numa_attribute_group = { > diff --git a/drivers/nvdimm/label.c b/drivers/nvdimm/label.c > index f3d753d3169c..02a51b7775e1 100644 > --- a/drivers/nvdimm/label.c > +++ b/drivers/nvdimm/label.c > @@ -25,6 +25,8 @@ static guid_t nvdimm_btt2_guid; > static guid_t nvdimm_pfn_guid; > static guid_t nvdimm_dax_guid; > > +static const char NSINDEX_SIGNATURE[] = "NAMESPACE_INDEX\0"; > + Looks good to me. > static u32 best_seq(u32 a, u32 b) > { > a &= NSINDEX_SEQ_MASK; > diff --git a/drivers/nvdimm/label.h b/drivers/nvdimm/label.h > index e9a2ad3c2150..4bb7add39580 100644 > --- a/drivers/nvdimm/label.h > +++ b/drivers/nvdimm/label.h > @@ -38,8 +38,6 @@ enum { > ND_NSINDEX_INIT = 0x1, > }; > > -static const char NSINDEX_SIGNATURE[] = "NAMESPACE_INDEX\0"; > - > /** > * struct nd_namespace_index - label set superblock > * @sig: NAMESPACE_INDEX\0 > -- > 2.20.1 (Apple Git-117) >
Re: [PATCH v9 4/7] dm: enable synchronous dax
[ add Mike and dm-devel ] Mike, any concerns with the below addition to the device-mapper-dax implementation? On Tue, May 14, 2019 at 7:58 AM Pankaj Gupta wrote: > > This patch sets dax device 'DAXDEV_SYNC' flag if all the target > devices of device mapper support synchrononous DAX. If device > mapper consists of both synchronous and asynchronous dax devices, > we don't set 'DAXDEV_SYNC' flag. > > Signed-off-by: Pankaj Gupta > --- > drivers/md/dm-table.c | 14 ++ > 1 file changed, 14 insertions(+) > > diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c > index cde3b49b2a91..1cce626ff576 100644 > --- a/drivers/md/dm-table.c > +++ b/drivers/md/dm-table.c > @@ -886,10 +886,17 @@ static int device_supports_dax(struct dm_target *ti, > struct dm_dev *dev, > return bdev_dax_supported(dev->bdev, PAGE_SIZE); > } > > +static int device_synchronous(struct dm_target *ti, struct dm_dev *dev, > + sector_t start, sector_t len, void *data) > +{ > + return dax_synchronous(dev->dax_dev); > +} > + > static bool dm_table_supports_dax(struct dm_table *t) > { > struct dm_target *ti; > unsigned i; > + bool dax_sync = true; > > /* Ensure that all targets support DAX. */ > for (i = 0; i < dm_table_get_num_targets(t); i++) { > @@ -901,7 +908,14 @@ static bool dm_table_supports_dax(struct dm_table *t) > if (!ti->type->iterate_devices || > !ti->type->iterate_devices(ti, device_supports_dax, NULL)) > return false; > + > + /* Check devices support synchronous DAX */ > + if (dax_sync && > + !ti->type->iterate_devices(ti, device_synchronous, NULL)) > + dax_sync = false; > } > + if (dax_sync) > + set_dax_synchronous(t->md->dax_dev); > > return true; > } > -- > 2.20.1 >
Re: [GIT PULL] device-dax for 5.1: PMEM as RAM
On Mon, Mar 11, 2019 at 5:08 PM Linus Torvalds wrote: > > On Mon, Mar 11, 2019 at 8:37 AM Dan Williams wrote: > > > > Another feature the userspace tooling can support for the PMEM as RAM > > case is the ability to complete an Address Range Scrub of the range > > before it is added to the core-mm. I.e at least ensure that previously > > encountered poison is eliminated. > > Ok, so this at least makes sense as an argument to me. > > In the "PMEM as filesystem" part, the errors have long-term history, > while in "PMEM as RAM" the memory may be physically the same thing, > but it doesn't have the history and as such may not be prone to > long-term errors the same way. > > So that validly argues that yes, when used as RAM, the likelihood for > errors is much lower because they don't accumulate the same way. In case anyone is looking for the above mentioned tooling for use with the v5.1 kernel, Vishal has released ndctl-v65 with the new "clear-errors" command [1]. [1]: https://pmem.io/ndctl/ndctl-clear-errors.html
Re: [PATCH v2 1/6] drivers/base/devres: Introduce devm_release_action()
On Tue, May 14, 2019 at 12:12 PM Greg Kroah-Hartman wrote: > > On Tue, May 07, 2019 at 04:56:05PM -0700, Dan Williams wrote: > > The devm_add_action() facility allows a resource allocation routine to > > add custom devm semantics. One such user is devm_memremap_pages(). > > > > There is now a need to manually trigger devm_memremap_pages_release(). > > Introduce devm_release_action() so the release action can be triggered > > via a new devm_memunmap_pages() api in a follow-on change. > > > > Cc: Logan Gunthorpe > > Cc: Bjorn Helgaas > > Cc: Christoph Hellwig > > Cc: Greg Kroah-Hartman > > Cc: "Rafael J. Wysocki" > > Reviewed-by: Ira Weiny > > Signed-off-by: Dan Williams > > --- > > drivers/base/devres.c | 24 +++- > > include/linux/device.h |1 + > > 2 files changed, 24 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/base/devres.c b/drivers/base/devres.c > > index e038e2b3b7ea..0bbb328bd17f 100644 > > --- a/drivers/base/devres.c > > +++ b/drivers/base/devres.c > > @@ -755,10 +755,32 @@ void devm_remove_action(struct device *dev, void > > (*action)(void *), void *data) > > > > WARN_ON(devres_destroy(dev, devm_action_release, devm_action_match, > > )); > > - > > } > > EXPORT_SYMBOL_GPL(devm_remove_action); > > > > +/** > > + * devm_release_action() - release previously added custom action > > + * @dev: Device that owns the action > > + * @action: Function implementing the action > > + * @data: Pointer to data passed to @action implementation > > + * > > + * Releases and removes instance of @action previously added by > > + * devm_add_action(). Both action and data should match one of the > > + * existing entries. > > + */ > > +void devm_release_action(struct device *dev, void (*action)(void *), void > > *data) > > +{ > > + struct action_devres devres = { > > + .data = data, > > + .action = action, > > + }; > > + > > + WARN_ON(devres_release(dev, devm_action_release, devm_action_match, > > +)); > > What does WARN_ON help here? are we going to start getting syzbot > reports of this happening? Hopefully, yes, if developers misuse the api they get a loud notification similar to devm_remove_action() misuse. > How can this fail? It's a catch to make sure that @dev actually has a live devres resource that can be found via @action and @data.
Re: [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race
On Tue, May 14, 2019 at 11:53 AM Jane Chu wrote: > > On 5/13/2019 12:22 PM, Logan Gunthorpe wrote: > > On 2019-05-08 11:05 a.m., Logan Gunthorpe wrote: > > On 2019-05-07 5:55 p.m., Dan Williams wrote: > > Changes since v1 [1]: > - Fix a NULL-pointer deref crash in pci_p2pdma_release() (Logan) > > - Refresh the p2pdma patch headers to match the format of other p2pdma >patches (Bjorn) > > - Collect Ira's reviewed-by > > [1]: > https://lore.kernel.org/lkml/155387324370.2443841.574715745262628837.st...@dwillia2-desk3.amr.corp.intel.com/ > > This series looks good to me: > > Reviewed-by: Logan Gunthorpe > > However, I haven't tested it yet but I intend to later this week. > > I've tested libnvdimm-pending which includes this series on my setup and > everything works great. > > Just wondering in a difference scenario where pmem pages are exported to > a KVM guest, and then by mistake the user issues "ndctl destroy-namespace -f", > will the kernel wait indefinitely until the user figures out to kill the guest > and release the pmem pages? It depends on whether the pages are pinned. Typically DAX memory mappings assigned to a guest are not pinned in the host and can be invalidated at any time. The pinning only occurs with VFIO and device-assignment which isn't the common case, especially since that configuration is blocked by fsdax. However, with devdax, yes you can arrange for the system to go into an indefinite wait. This somewhat ties back to the get_user_pages() vs DAX debate. The indefinite stall issue with device-assignment could be addressed with a requirement to hold a lease and expect that a lease revocation event may escalate to SIGKILL in response to 'ndctl destroy-namespace'. The expectation with device-dax is that it is already a raw interface with pointy edges and caveats, but I would not be opposed to introducing a lease semantic.
Re: [PATCH v8 00/12] mm: Sub-section memory hotplug support
On Mon, May 13, 2019 at 2:02 PM Mike Rapoport wrote: > > Hi Dan, > > On Mon, May 06, 2019 at 04:39:26PM -0700, Dan Williams wrote: > > Changes since v7 [1]: > > Sorry for jumping late No worries, it needs to be rebased on David's "mm/memory_hotplug: Factor out memory block device handling" series which puts it firmly in v5.3 territory. > but presuming there will be v9, it'd be great if it > would also include include updates to > Documentation/admin-guide/mm/memory-hotplug.rst and If I've done my job right this series should be irrelevant to Documentation/admin-guide/mm/memory-hotplug.rst. The subsection capability is strictly limited to arch_add_memory() users that never online the memory, i.e. only ZONE_DEVICE / devm_memremap_pages() users. So this isn't "memory-hotplug" as much as it is "support for subsection vmemmap management". > Documentation/vm/memory-model.rst This looks more fitting and should probably include a paragraph on the general ZONE_DEVICE / devm_memremap_pages() use case.
Re: [PATCH v8 3/6] libnvdimm: add dax_dev sync flag
On Fri, May 10, 2019 at 5:45 PM Pankaj Gupta wrote: > > > > > > > > > This patch adds 'DAXDEV_SYNC' flag which is set > > > for nd_region doing synchronous flush. This later > > > is used to disable MAP_SYNC functionality for > > > ext4 & xfs filesystem for devices don't support > > > synchronous flush. > > > > > > Signed-off-by: Pankaj Gupta > > > --- > > > drivers/dax/bus.c| 2 +- > > > drivers/dax/super.c | 13 - > > > drivers/md/dm.c | 3 ++- > > > drivers/nvdimm/pmem.c| 5 - > > > drivers/nvdimm/region_devs.c | 7 +++ > > > include/linux/dax.h | 8 ++-- > > > include/linux/libnvdimm.h| 1 + > > > 7 files changed, 33 insertions(+), 6 deletions(-) > > [..] > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c > > > index 043f0761e4a0..ee007b75d9fd 100644 > > > --- a/drivers/md/dm.c > > > +++ b/drivers/md/dm.c > > > @@ -1969,7 +1969,8 @@ static struct mapped_device *alloc_dev(int minor) > > > sprintf(md->disk->disk_name, "dm-%d", minor); > > > > > > if (IS_ENABLED(CONFIG_DAX_DRIVER)) { > > > - dax_dev = alloc_dax(md, md->disk->disk_name, _dax_ops); > > > + dax_dev = alloc_dax(md, md->disk->disk_name, _dax_ops, > > > +DAXDEV_F_SYNC); > > > > Apologies for not realizing this until now, but this is broken. > > Imaging a device-mapper configuration composed of both 'async' > > virtio-pmem and 'sync' pmem. The 'sync' flag needs to be unified > > across all members. I would change this argument to '0' and then > > arrange for it to be set at dm_table_supports_dax() time after > > validating that all components support synchronous dax. > > o.k. Need to set 'DAXDEV_F_SYNC' flag after verifying all the target > components support synchronous DAX. > > Just a question, If device mapper configuration have composed of both > virtio-pmem or pmem devices, we want to configure device mapper for async > flush? If it's composed of both then, yes, it needs to be async flush at the device-mapper level. Otherwise MAP_SYNC will succeed and fail to trigger fsync on the host file when necessary for the virtio-pmem backed portion of the device-mapper device.
Re: [Qemu-devel] [PATCH v7 2/6] virtio-pmem: Add virtio pmem driver
On Wed, May 8, 2019 at 4:19 AM Pankaj Gupta wrote: > > > Hi Dan, > > Thank you for the review. Please see my reply inline. > > > > > Hi Pankaj, > > > > Some minor file placement comments below. > > Sure. > > > > > On Thu, Apr 25, 2019 at 10:02 PM Pankaj Gupta wrote: > > > > > > This patch adds virtio-pmem driver for KVM guest. > > > > > > Guest reads the persistent memory range information from > > > Qemu over VIRTIO and registers it on nvdimm_bus. It also > > > creates a nd_region object with the persistent memory > > > range information so that existing 'nvdimm/pmem' driver > > > can reserve this into system memory map. This way > > > 'virtio-pmem' driver uses existing functionality of pmem > > > driver to register persistent memory compatible for DAX > > > capable filesystems. > > > > > > This also provides function to perform guest flush over > > > VIRTIO from 'pmem' driver when userspace performs flush > > > on DAX memory range. > > > > > > Signed-off-by: Pankaj Gupta > > > --- > > > drivers/nvdimm/virtio_pmem.c | 114 + > > > drivers/virtio/Kconfig | 10 +++ > > > drivers/virtio/Makefile | 1 + > > > drivers/virtio/pmem.c| 118 +++ > > > include/linux/virtio_pmem.h | 60 > > > include/uapi/linux/virtio_ids.h | 1 + > > > include/uapi/linux/virtio_pmem.h | 10 +++ > > > 7 files changed, 314 insertions(+) > > > create mode 100644 drivers/nvdimm/virtio_pmem.c > > > create mode 100644 drivers/virtio/pmem.c > > > create mode 100644 include/linux/virtio_pmem.h > > > create mode 100644 include/uapi/linux/virtio_pmem.h > > > > > > diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c > > > new file mode 100644 > > > index ..66b582f751a3 > > > --- /dev/null > > > +++ b/drivers/nvdimm/virtio_pmem.c > > > @@ -0,0 +1,114 @@ > > > +// SPDX-License-Identifier: GPL-2.0 > > > +/* > > > + * virtio_pmem.c: Virtio pmem Driver > > > + * > > > + * Discovers persistent memory range information > > > + * from host and provides a virtio based flushing > > > + * interface. > > > + */ > > > +#include > > > +#include "nd.h" > > > + > > > + /* The interrupt handler */ > > > +void host_ack(struct virtqueue *vq) > > > +{ > > > + unsigned int len; > > > + unsigned long flags; > > > + struct virtio_pmem_request *req, *req_buf; > > > + struct virtio_pmem *vpmem = vq->vdev->priv; > > > + > > > + spin_lock_irqsave(>pmem_lock, flags); > > > + while ((req = virtqueue_get_buf(vq, )) != NULL) { > > > + req->done = true; > > > + wake_up(>host_acked); > > > + > > > + if (!list_empty(>req_list)) { > > > + req_buf = list_first_entry(>req_list, > > > + struct virtio_pmem_request, list); > > > + list_del(>req_list); > > > + req_buf->wq_buf_avail = true; > > > + wake_up(_buf->wq_buf); > > > + } > > > + } > > > + spin_unlock_irqrestore(>pmem_lock, flags); > > > +} > > > +EXPORT_SYMBOL_GPL(host_ack); > > > + > > > + /* The request submission function */ > > > +int virtio_pmem_flush(struct nd_region *nd_region) > > > +{ > > > + int err; > > > + unsigned long flags; > > > + struct scatterlist *sgs[2], sg, ret; > > > + struct virtio_device *vdev = nd_region->provider_data; > > > + struct virtio_pmem *vpmem = vdev->priv; > > > + struct virtio_pmem_request *req; > > > + > > > + might_sleep(); > > > + req = kmalloc(sizeof(*req), GFP_KERNEL); > > > + if (!req) > > > + return -ENOMEM; > > > + > > > + req->done = req->wq_buf_avail = false; > > > + strcpy(req->name, "FLUSH"); > > > + init_waitqueue_head(>host_acked); > > > + init_waitqueue_head(>wq_buf); > > > + sg_init_one(, req->name, strlen(req->name)); > > > + sgs[0] = > > > + sg_init_one(, >ret, sizeof(req->ret)); > > > + sgs[1] = > > > + > > > + spin_lock_irqsave(>pmem_lock, flags); > > > + err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, > > > GFP_ATOMIC); > > > + if (err) { > > > + dev_err(>dev, "failed to send command to virtio pmem > > > device\n"); > > > + > > > + list_add_tail(>req_list, >list); > > > + spin_unlock_irqrestore(>pmem_lock, flags); > > > + > > > + /* When host has read buffer, this completes via host_ack > > > */ > > > + wait_event(req->wq_buf, req->wq_buf_avail); > > > + spin_lock_irqsave(>pmem_lock, flags); > > > + } > > > + err = virtqueue_kick(vpmem->req_vq); > > > + spin_unlock_irqrestore(>pmem_lock, flags); > > > + > > > + if (!err) { > > > + err = -EIO; > > > + goto ret; > > > + } > > > + /* When host has
Re: [PATCH, RFC 1/2] Add config option to enable FS/DAX PMD sharing
On Fri, May 10, 2019 at 9:32 AM Elliott, Robert (Servers) wrote: > > > > > -Original Message- > > From: Linux-nvdimm On Behalf Of > > Larry Bassel > > Sent: Thursday, May 09, 2019 11:06 AM > > Subject: [PATCH, RFC 1/2] Add config option to enable FS/DAX PMD > > sharing > > > > If enabled, sharing of FS/DAX PMDs will be attempted. > > > ... > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > ... > > > > +config MAY_SHARE_FSDAX_PMD > > +def_bool y > > + > > Is a config option really necessary - is there any reason to > not choose to do this? Agree. Either the arch implementation supports it or it doesn't, I don't see a need for any further configuration flexibility. Seems ARCH_WANT_HUGE_PMD_SHARE should be renamed ARCH_HAS_HUGE_PMD_SHARE and then auto-enable it.
[PATCH] mm/huge_memory: Fix vmf_insert_pfn_{pmd, pud}() crash, handle unaligned addresses
Starting with commit c6f3c5ee40c1 "mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd()" vmf_insert_pfn_pmd() internally calls pmdp_set_access_flags(). That helper enforces a pmd aligned @address argument via VM_BUG_ON() assertion. Update the implementation to take a 'struct vm_fault' argument directly and apply the address alignment fixup internally to fix crash signatures like: kernel BUG at arch/x86/mm/pgtable.c:515! invalid opcode: [#1] SMP NOPTI CPU: 51 PID: 43713 Comm: java Tainted: G OE 4.19.35 #1 [..] RIP: 0010:pmdp_set_access_flags+0x48/0x50 [..] Call Trace: vmf_insert_pfn_pmd+0x198/0x350 dax_iomap_fault+0xe82/0x1190 ext4_dax_huge_fault+0x103/0x1f0 ? __switch_to_asm+0x40/0x70 __handle_mm_fault+0x3f6/0x1370 ? __switch_to_asm+0x34/0x70 ? __switch_to_asm+0x40/0x70 handle_mm_fault+0xda/0x200 __do_page_fault+0x249/0x4f0 do_page_fault+0x32/0x110 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 Cc: Fixes: c6f3c5ee40c1 ("mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd()") Reported-by: Piotr Balcer Tested-by: Yan Ma Cc: Aneesh Kumar K.V Cc: Chandan Rajendra Cc: Jan Kara Cc: Andrew Morton Cc: Matthew Wilcox Cc: Souptick Joarder Signed-off-by: Dan Williams --- drivers/dax/device.c|6 ++ fs/dax.c|6 ++ include/linux/huge_mm.h |6 ++ mm/huge_memory.c| 16 ++-- 4 files changed, 16 insertions(+), 18 deletions(-) diff --git a/drivers/dax/device.c b/drivers/dax/device.c index e428468ab661..996d68ff992a 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -184,8 +184,7 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax, *pfn = phys_to_pfn_t(phys, dax_region->pfn_flags); - return vmf_insert_pfn_pmd(vmf->vma, vmf->address, vmf->pmd, *pfn, - vmf->flags & FAULT_FLAG_WRITE); + return vmf_insert_pfn_pmd(vmf, *pfn, vmf->flags & FAULT_FLAG_WRITE); } #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD @@ -235,8 +234,7 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax, *pfn = phys_to_pfn_t(phys, dax_region->pfn_flags); - return vmf_insert_pfn_pud(vmf->vma, vmf->address, vmf->pud, *pfn, - vmf->flags & FAULT_FLAG_WRITE); + return vmf_insert_pfn_pud(vmf, *pfn, vmf->flags & FAULT_FLAG_WRITE); } #else static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax, diff --git a/fs/dax.c b/fs/dax.c index e5e54da1715f..83009875308c 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1575,8 +1575,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp, } trace_dax_pmd_insert_mapping(inode, vmf, PMD_SIZE, pfn, entry); - result = vmf_insert_pfn_pmd(vma, vmf->address, vmf->pmd, pfn, - write); + result = vmf_insert_pfn_pmd(vmf, pfn, write); break; case IOMAP_UNWRITTEN: case IOMAP_HOLE: @@ -1686,8 +1685,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, unsigned int order) ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn); #ifdef CONFIG_FS_DAX_PMD else if (order == PMD_ORDER) - ret = vmf_insert_pfn_pmd(vmf->vma, vmf->address, vmf->pmd, - pfn, true); + ret = vmf_insert_pfn_pmd(vmf, pfn, FAULT_FLAG_WRITE); #endif else ret = VM_FAULT_FALLBACK; diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 381e872bfde0..7cd5c150c21d 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -47,10 +47,8 @@ extern bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr, extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, pgprot_t newprot, int prot_numa); -vm_fault_t vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, - pmd_t *pmd, pfn_t pfn, bool write); -vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr, - pud_t *pud, pfn_t pfn, bool write); +vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write); +vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write); enum transparent_hugepage_flag { TRANSPARENT_HUGEPAGE_FLAG, TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 165ea46bf149..4310c6e9e5a3 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -793,11 +793,13 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, pte_free(mm, pgtable); } -vm_fault_t vmf_insert_pfn_pmd(struct vm_area_struct *vm
[PATCH v2 5/6] PCI/P2PDMA: Track pgmap references per resource, not globally
In preparation for fixing a race between devm_memremap_pages_release() and the final put of a page from the device-page-map, allocate a percpu-ref per p2pdma resource mapping. Cc: Logan Gunthorpe Cc: Bjorn Helgaas Cc: Christoph Hellwig Reviewed-by: Ira Weiny Signed-off-by: Dan Williams --- drivers/pci/p2pdma.c | 124 +- 1 file changed, 81 insertions(+), 43 deletions(-) diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index 595a534bd749..54d475569058 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -20,12 +20,16 @@ #include struct pci_p2pdma { - struct percpu_ref devmap_ref; - struct completion devmap_ref_done; struct gen_pool *pool; bool p2pmem_published; }; +struct p2pdma_pagemap { + struct dev_pagemap pgmap; + struct percpu_ref ref; + struct completion ref_done; +}; + static ssize_t size_show(struct device *dev, struct device_attribute *attr, char *buf) { @@ -74,41 +78,45 @@ static const struct attribute_group p2pmem_group = { .name = "p2pmem", }; +static struct p2pdma_pagemap *to_p2p_pgmap(struct percpu_ref *ref) +{ + return container_of(ref, struct p2pdma_pagemap, ref); +} + static void pci_p2pdma_percpu_release(struct percpu_ref *ref) { - struct pci_p2pdma *p2p = - container_of(ref, struct pci_p2pdma, devmap_ref); + struct p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(ref); - complete_all(>devmap_ref_done); + complete(_pgmap->ref_done); } static void pci_p2pdma_percpu_kill(struct percpu_ref *ref) { - /* -* pci_p2pdma_add_resource() may be called multiple times -* by a driver and may register the percpu_kill devm action multiple -* times. We only want the first action to actually kill the -* percpu_ref. -*/ - if (percpu_ref_is_dying(ref)) - return; - percpu_ref_kill(ref); } +static void pci_p2pdma_percpu_cleanup(void *ref) +{ + struct p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(ref); + + wait_for_completion(_pgmap->ref_done); + percpu_ref_exit(_pgmap->ref); +} + static void pci_p2pdma_release(void *data) { struct pci_dev *pdev = data; + struct pci_p2pdma *p2pdma = pdev->p2pdma; - if (!pdev->p2pdma) + if (!p2pdma) return; - wait_for_completion(>p2pdma->devmap_ref_done); - percpu_ref_exit(>p2pdma->devmap_ref); + /* Flush and disable pci_alloc_p2p_mem() */ + pdev->p2pdma = NULL; + synchronize_rcu(); - gen_pool_destroy(pdev->p2pdma->pool); + gen_pool_destroy(p2pdma->pool); sysfs_remove_group(>dev.kobj, _group); - pdev->p2pdma = NULL; } static int pci_p2pdma_setup(struct pci_dev *pdev) @@ -124,12 +132,6 @@ static int pci_p2pdma_setup(struct pci_dev *pdev) if (!p2p->pool) goto out; - init_completion(>devmap_ref_done); - error = percpu_ref_init(>devmap_ref, - pci_p2pdma_percpu_release, 0, GFP_KERNEL); - if (error) - goto out_pool_destroy; - error = devm_add_action_or_reset(>dev, pci_p2pdma_release, pdev); if (error) goto out_pool_destroy; @@ -163,6 +165,7 @@ static int pci_p2pdma_setup(struct pci_dev *pdev) int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, u64 offset) { + struct p2pdma_pagemap *p2p_pgmap; struct dev_pagemap *pgmap; void *addr; int error; @@ -185,14 +188,32 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, return error; } - pgmap = devm_kzalloc(>dev, sizeof(*pgmap), GFP_KERNEL); - if (!pgmap) + p2p_pgmap = devm_kzalloc(>dev, sizeof(*p2p_pgmap), GFP_KERNEL); + if (!p2p_pgmap) return -ENOMEM; + init_completion(_pgmap->ref_done); + error = percpu_ref_init(_pgmap->ref, + pci_p2pdma_percpu_release, 0, GFP_KERNEL); + if (error) + goto pgmap_free; + + /* +* FIXME: the percpu_ref_exit needs to be coordinated internal +* to devm_memremap_pages_release(). Duplicate the same ordering +* as other devm_memremap_pages() users for now. +*/ + error = devm_add_action(>dev, pci_p2pdma_percpu_cleanup, + _pgmap->ref); + if (error) + goto ref_cleanup; + + pgmap = _pgmap->pgmap; + pgmap->res.start = pci_resource_start(pdev, bar) + offset; pgmap->res.end = pgmap->res.start + size - 1; pgmap->res.flags = pci_resource_flags(pdev, bar); - pgmap->ref = >p2pdma->devmap_ref; + pgmap->ref = _pgmap->
[PATCH v2 2/6] mm/devm_memremap_pages: Introduce devm_memunmap_pages
Use the new devm_relase_action() facility to allow devm_memremap_pages_release() to be manually triggered. Cc: Logan Gunthorpe Cc: Bjorn Helgaas Cc: Christoph Hellwig Reviewed-by: Ira Weiny Signed-off-by: Dan Williams --- include/linux/memremap.h |6 ++ kernel/memremap.c|6 ++ 2 files changed, 12 insertions(+) diff --git a/include/linux/memremap.h b/include/linux/memremap.h index f0628660d541..7601ee314c4a 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -100,6 +100,7 @@ struct dev_pagemap { #ifdef CONFIG_ZONE_DEVICE void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap); +void devm_memunmap_pages(struct device *dev, struct dev_pagemap *pgmap); struct dev_pagemap *get_dev_pagemap(unsigned long pfn, struct dev_pagemap *pgmap); @@ -118,6 +119,11 @@ static inline void *devm_memremap_pages(struct device *dev, return ERR_PTR(-ENXIO); } +static inline void devm_memunmap_pages(struct device *dev, + struct dev_pagemap *pgmap) +{ +} + static inline struct dev_pagemap *get_dev_pagemap(unsigned long pfn, struct dev_pagemap *pgmap) { diff --git a/kernel/memremap.c b/kernel/memremap.c index a856cb5ff192..65afbacab44e 100644 --- a/kernel/memremap.c +++ b/kernel/memremap.c @@ -266,6 +266,12 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap) } EXPORT_SYMBOL_GPL(devm_memremap_pages); +void devm_memunmap_pages(struct device *dev, struct dev_pagemap *pgmap) +{ + devm_release_action(dev, devm_memremap_pages_release, pgmap); +} +EXPORT_SYMBOL_GPL(devm_memunmap_pages); + unsigned long vmem_altmap_offset(struct vmem_altmap *altmap) { /* number of pfns from base where pfn_to_page() is valid */
[PATCH v2 6/6] mm/devm_memremap_pages: Fix final page put race
Logan noticed that devm_memremap_pages_release() kills the percpu_ref drops all the page references that were acquired at init and then immediately proceeds to unplug, arch_remove_memory(), the backing pages for the pagemap. If for some reason device shutdown actually collides with a busy / elevated-ref-count page then arch_remove_memory() should be deferred until after that reference is dropped. As it stands the "wait for last page ref drop" happens *after* devm_memremap_pages_release() returns, which is obviously too late and can lead to crashes. Fix this situation by assigning the responsibility to wait for the percpu_ref to go idle to devm_memremap_pages() with a new ->cleanup() callback. Implement the new cleanup callback for all devm_memremap_pages() users: pmem, devdax, hmm, and p2pdma. Reported-by: Logan Gunthorpe Fixes: 41e94a851304 ("add devm_memremap_pages") Cc: Bjorn Helgaas Cc: "Jérôme Glisse" Cc: Christoph Hellwig Reviewed-by: Ira Weiny Signed-off-by: Dan Williams --- drivers/dax/device.c | 13 +++-- drivers/nvdimm/pmem.c | 17 + drivers/pci/p2pdma.c | 17 +++-- include/linux/memremap.h |2 ++ kernel/memremap.c | 17 - mm/hmm.c | 14 +++--- tools/testing/nvdimm/test/iomap.c |2 ++ 7 files changed, 38 insertions(+), 44 deletions(-) diff --git a/drivers/dax/device.c b/drivers/dax/device.c index e428468ab661..e3aa78dd1bb0 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -27,9 +27,8 @@ static void dev_dax_percpu_release(struct percpu_ref *ref) complete(_dax->cmp); } -static void dev_dax_percpu_exit(void *data) +static void dev_dax_percpu_exit(struct percpu_ref *ref) { - struct percpu_ref *ref = data; struct dev_dax *dev_dax = ref_to_dev_dax(ref); dev_dbg(_dax->dev, "%s\n", __func__); @@ -468,18 +467,12 @@ int dev_dax_probe(struct device *dev) if (rc) return rc; - rc = devm_add_action_or_reset(dev, dev_dax_percpu_exit, _dax->ref); - if (rc) - return rc; - dev_dax->pgmap.ref = _dax->ref; dev_dax->pgmap.kill = dev_dax_percpu_kill; + dev_dax->pgmap.cleanup = dev_dax_percpu_exit; addr = devm_memremap_pages(dev, _dax->pgmap); - if (IS_ERR(addr)) { - devm_remove_action(dev, dev_dax_percpu_exit, _dax->ref); - percpu_ref_exit(_dax->ref); + if (IS_ERR(addr)) return PTR_ERR(addr); - } inode = dax_inode(dax_dev); cdev = inode->i_cdev; diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 0279eb1da3ef..1c9181712fa4 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -304,11 +304,19 @@ static const struct attribute_group *pmem_attribute_groups[] = { NULL, }; -static void pmem_release_queue(void *q) +static void __pmem_release_queue(struct percpu_ref *ref) { + struct request_queue *q; + + q = container_of(ref, typeof(*q), q_usage_counter); blk_cleanup_queue(q); } +static void pmem_release_queue(void *ref) +{ + __pmem_release_queue(ref); +} + static void pmem_freeze_queue(struct percpu_ref *ref) { struct request_queue *q; @@ -400,12 +408,10 @@ static int pmem_attach_disk(struct device *dev, if (!q) return -ENOMEM; - if (devm_add_action_or_reset(dev, pmem_release_queue, q)) - return -ENOMEM; - pmem->pfn_flags = PFN_DEV; pmem->pgmap.ref = >q_usage_counter; pmem->pgmap.kill = pmem_freeze_queue; + pmem->pgmap.cleanup = __pmem_release_queue; if (is_nd_pfn(dev)) { if (setup_pagemap_fsdax(dev, >pgmap)) return -ENOMEM; @@ -426,6 +432,9 @@ static int pmem_attach_disk(struct device *dev, pmem->pfn_flags |= PFN_MAP; memcpy(_res, >pgmap.res, sizeof(bb_res)); } else { + if (devm_add_action_or_reset(dev, pmem_release_queue, + >q_usage_counter)) + return -ENOMEM; addr = devm_memremap(dev, pmem->phys_addr, pmem->size, ARCH_MEMREMAP_PMEM); memcpy(_res, >res, sizeof(bb_res)); diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index 54d475569058..a7a66b958720 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -95,7 +95,7 @@ static void pci_p2pdma_percpu_kill(struct percpu_ref *ref) percpu_ref_kill(ref); } -static void pci_p2pdma_percpu_cleanup(void *ref) +static void pci_p2pdma_percpu_cleanup(struct percpu_ref *ref) { struct p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(ref); @@ -198,16 +198,6 @@ int pci_p2p
Re: [PATCH v2 4/8] mm/memory_hotplug: Create memory block devices after arch_add_memory()
On Tue, May 7, 2019 at 11:38 AM David Hildenbrand wrote: > > Only memory to be added to the buddy and to be onlined/offlined by > user space using memory block devices needs (and should have!) memory > block devices. > > Factor out creation of memory block devices Create all devices after > arch_add_memory() succeeded. We can later drop the want_memblock parameter, > because it is now effectively stale. > > Only after memory block devices have been added, memory can be onlined > by user space. This implies, that memory is not visible to user space at > all before arch_add_memory() succeeded. Nice! > > Cc: Greg Kroah-Hartman > Cc: "Rafael J. Wysocki" > Cc: David Hildenbrand > Cc: "mike.tra...@hpe.com" > Cc: Andrew Morton > Cc: Ingo Molnar > Cc: Andrew Banman > Cc: Oscar Salvador > Cc: Michal Hocko > Cc: Pavel Tatashin > Cc: Qian Cai > Cc: Wei Yang > Cc: Arun KS > Cc: Mathieu Malaterre > Signed-off-by: David Hildenbrand > --- > drivers/base/memory.c | 70 ++ > include/linux/memory.h | 2 +- > mm/memory_hotplug.c| 15 - > 3 files changed, 53 insertions(+), 34 deletions(-) > > diff --git a/drivers/base/memory.c b/drivers/base/memory.c > index 6e0cb4fda179..862c202a18ca 100644 > --- a/drivers/base/memory.c > +++ b/drivers/base/memory.c > @@ -701,44 +701,62 @@ static int add_memory_block(int base_section_nr) > return 0; > } > > +static void unregister_memory(struct memory_block *memory) > +{ > + BUG_ON(memory->dev.bus != _subsys); Given this should never happen and only a future kernel developer might trip over it, do we really need to kill that developer's machine? I.e. s/BUG/WARN/? I guess an argument can be made to move such a change that to a follow-on patch since you're just preserving existing behavior, but I figure might as well address these as the code is refactored. > + > + /* drop the ref. we got via find_memory_block() */ > + put_device(>dev); > + device_unregister(>dev); > +} > + > /* > - * need an interface for the VM to add new memory regions, > - * but without onlining it. > + * Create memory block devices for the given memory area. Start and size > + * have to be aligned to memory block granularity. Memory block devices > + * will be initialized as offline. > */ > -int hotplug_memory_register(int nid, struct mem_section *section) > +int hotplug_memory_register(unsigned long start, unsigned long size) > { > - int ret = 0; > + unsigned long block_nr_pages = memory_block_size_bytes() >> > PAGE_SHIFT; > + unsigned long start_pfn = PFN_DOWN(start); > + unsigned long end_pfn = start_pfn + (size >> PAGE_SHIFT); > + unsigned long pfn; > struct memory_block *mem; > + int ret = 0; > > - mutex_lock(_sysfs_mutex); > + BUG_ON(!IS_ALIGNED(start, memory_block_size_bytes())); > + BUG_ON(!IS_ALIGNED(size, memory_block_size_bytes())); Perhaps: if (WARN_ON(...)) return -EINVAL; > > - mem = find_memory_block(section); > - if (mem) { > - mem->section_count++; > - put_device(>dev); > - } else { > - ret = init_memory_block(, section, MEM_OFFLINE); > + mutex_lock(_sysfs_mutex); > + for (pfn = start_pfn; pfn != end_pfn; pfn += block_nr_pages) { > + mem = find_memory_block(__pfn_to_section(pfn)); > + if (mem) { > + WARN_ON_ONCE(false); ?? Isn't that a nop? > + put_device(>dev); > + continue; > + } > + ret = init_memory_block(, __pfn_to_section(pfn), > + MEM_OFFLINE); > if (ret) > - goto out; > - mem->section_count++; > + break; > + mem->section_count = memory_block_size_bytes() / > +MIN_MEMORY_BLOCK_SIZE; > + } > + if (ret) { > + end_pfn = pfn; > + for (pfn = start_pfn; pfn != end_pfn; pfn += block_nr_pages) { > + mem = find_memory_block(__pfn_to_section(pfn)); > + if (!mem) > + continue; > + mem->section_count = 0; > + unregister_memory(mem); > + } > } > - > -out: > mutex_unlock(_sysfs_mutex); > return ret; > } > > -static void > -unregister_memory(struct memory_block *memory) > -{ > - BUG_ON(memory->dev.bus != _subsys); > - > - /* drop the ref. we got via find_memory_block() */ > - put_device(>dev); > - device_unregister(>dev); > -} > - > -void unregister_memory_section(struct mem_section *section) > +static int remove_memory_section(struct mem_section *section) > { > struct memory_block *mem; > > diff --git a/include/linux/memory.h
Re: [PATCH v2 5/8] mm/memory_hotplug: Drop MHP_MEMBLOCK_API
On Tue, May 7, 2019 at 11:38 AM David Hildenbrand wrote: > > No longer needed, the callers of arch_add_memory() can handle this > manually. > > Cc: Andrew Morton > Cc: David Hildenbrand > Cc: Michal Hocko > Cc: Oscar Salvador > Cc: Pavel Tatashin > Cc: Wei Yang > Cc: Joonsoo Kim > Cc: Qian Cai > Cc: Arun KS > Cc: Mathieu Malaterre > Signed-off-by: David Hildenbrand > --- > include/linux/memory_hotplug.h | 8 > mm/memory_hotplug.c| 9 +++-- > 2 files changed, 3 insertions(+), 14 deletions(-) > > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h > index 2d4de313926d..2f1f87e13baa 100644 > --- a/include/linux/memory_hotplug.h > +++ b/include/linux/memory_hotplug.h > @@ -128,14 +128,6 @@ extern void arch_remove_memory(int nid, u64 start, u64 > size, > extern void __remove_pages(struct zone *zone, unsigned long start_pfn, >unsigned long nr_pages, struct vmem_altmap > *altmap); > > -/* > - * Do we want sysfs memblock files created. This will allow userspace to > online > - * and offline memory explicitly. Lack of this bit means that the caller has > to > - * call move_pfn_range_to_zone to finish the initialization. > - */ > - > -#define MHP_MEMBLOCK_API (1<<0) > - > /* reasonably generic interface to expand the physical pages */ > extern int __add_pages(int nid, unsigned long start_pfn, unsigned long > nr_pages, >struct mhp_restrictions *restrictions); > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index e1637c8a0723..107f72952347 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -250,7 +250,7 @@ void __init register_page_bootmem_info_node(struct > pglist_data *pgdat) > #endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */ > > static int __meminit __add_section(int nid, unsigned long phys_start_pfn, > - struct vmem_altmap *altmap, bool want_memblock) > + struct vmem_altmap *altmap) > { > int ret; > > @@ -293,8 +293,7 @@ int __ref __add_pages(int nid, unsigned long > phys_start_pfn, > } > > for (i = start_sec; i <= end_sec; i++) { > - err = __add_section(nid, section_nr_to_pfn(i), altmap, > - restrictions->flags & MHP_MEMBLOCK_API); > + err = __add_section(nid, section_nr_to_pfn(i), altmap); > > /* > * EEXIST is finally dealt with by ioresource collision > @@ -1066,9 +1065,7 @@ static int online_memory_block(struct memory_block > *mem, void *arg) > */ > int __ref add_memory_resource(int nid, struct resource *res) > { > - struct mhp_restrictions restrictions = { > - .flags = MHP_MEMBLOCK_API, > - }; > + struct mhp_restrictions restrictions = {}; With mhp_restrictions.flags no longer needed, can we drop mhp_restrictions and just go back to a plain @altmap argument?
[PATCH v8 05/12] mm/sparsemem: Convert kmalloc_section_memmap() to populate_section_memmap()
Allow sub-section sized ranges to be added to the memmap. populate_section_memmap() takes an explict pfn range rather than assuming a full section, and those parameters are plumbed all the way through to vmmemap_populate(). There should be no sub-section usage in current deployments. New warnings are added to clarify which memmap allocation paths are sub-section capable. Cc: Michal Hocko Cc: David Hildenbrand Cc: Logan Gunthorpe Cc: Oscar Salvador Reviewed-by: Pavel Tatashin Signed-off-by: Dan Williams --- arch/x86/mm/init_64.c |4 ++- include/linux/mm.h|4 ++- mm/sparse-vmemmap.c | 21 +++-- mm/sparse.c | 61 +++-- 4 files changed, 57 insertions(+), 33 deletions(-) diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 20d14254b686..bb018d09d2dc 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1457,7 +1457,9 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, { int err; - if (boot_cpu_has(X86_FEATURE_PSE)) + if (end - start < PAGES_PER_SECTION * sizeof(struct page)) + err = vmemmap_populate_basepages(start, end, node); + else if (boot_cpu_has(X86_FEATURE_PSE)) err = vmemmap_populate_hugepages(start, end, node, altmap); else if (altmap) { pr_err_once("%s: no cpu support for altmap allocations\n", diff --git a/include/linux/mm.h b/include/linux/mm.h index 0e8834ac32b7..5360a0e4051d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2748,8 +2748,8 @@ const char * arch_vma_name(struct vm_area_struct *vma); void print_vma_addr(char *prefix, unsigned long rip); void *sparse_buffer_alloc(unsigned long size); -struct page *sparse_mem_map_populate(unsigned long pnum, int nid, - struct vmem_altmap *altmap); +struct page * __populate_section_memmap(unsigned long pfn, + unsigned long nr_pages, int nid, struct vmem_altmap *altmap); pgd_t *vmemmap_pgd_populate(unsigned long addr, int node); p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node); pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node); diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c index 7fec05796796..200aef686722 100644 --- a/mm/sparse-vmemmap.c +++ b/mm/sparse-vmemmap.c @@ -245,19 +245,26 @@ int __meminit vmemmap_populate_basepages(unsigned long start, return 0; } -struct page * __meminit sparse_mem_map_populate(unsigned long pnum, int nid, - struct vmem_altmap *altmap) +struct page * __meminit __populate_section_memmap(unsigned long pfn, + unsigned long nr_pages, int nid, struct vmem_altmap *altmap) { unsigned long start; unsigned long end; - struct page *map; - map = pfn_to_page(pnum * PAGES_PER_SECTION); - start = (unsigned long)map; - end = (unsigned long)(map + PAGES_PER_SECTION); + /* +* The minimum granularity of memmap extensions is +* PAGES_PER_SUBSECTION as allocations are tracked in the +* 'subsection_map' bitmap of the section. +*/ + end = ALIGN(pfn + nr_pages, PAGES_PER_SUBSECTION); + pfn &= PAGE_SUBSECTION_MASK; + nr_pages = end - pfn; + + start = (unsigned long) pfn_to_page(pfn); + end = start + nr_pages * sizeof(struct page); if (vmemmap_populate(start, end, nid, altmap)) return NULL; - return map; + return pfn_to_page(pfn); } diff --git a/mm/sparse.c b/mm/sparse.c index ac47a48050c7..d613f108cf34 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -433,8 +433,8 @@ static unsigned long __init section_map_size(void) return PAGE_ALIGN(sizeof(struct page) * PAGES_PER_SECTION); } -struct page __init *sparse_mem_map_populate(unsigned long pnum, int nid, - struct vmem_altmap *altmap) +struct page __init *__populate_section_memmap(unsigned long pfn, + unsigned long nr_pages, int nid, struct vmem_altmap *altmap) { unsigned long size = section_map_size(); struct page *map = sparse_buffer_alloc(size); @@ -515,10 +515,13 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin, } sparse_buffer_init(map_count * section_map_size(), nid); for_each_present_section_nr(pnum_begin, pnum) { + unsigned long pfn = section_nr_to_pfn(pnum); + if (pnum >= pnum_end) break; - map = sparse_mem_map_populate(pnum, nid, NULL); + map = __populate_section_memmap(pfn, PAGES_PER_SECTION, + nid, NULL); if (!map) { pr_err("%s: node[%d] memory map backing failed. Some memory will not be available.", __func__, nid); @@ -618,17 +621,17 @@ void offline_mem_sections(unsi
[PATCH v8 01/12] mm/sparsemem: Introduce struct mem_section_usage
Towards enabling memory hotplug to track partial population of a section, introduce 'struct mem_section_usage'. A pointer to a 'struct mem_section_usage' instance replaces the existing pointer to a 'pageblock_flags' bitmap. Effectively it adds one more 'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house a new 'subsection_map' bitmap. The new bitmap enables the memory hot{plug,remove} implementation to act on incremental sub-divisions of a section. The default SUBSECTION_SHIFT is chosen to keep the 'subsection_map' no larger than a single 'unsigned long' on the major architectures. Alternatively an architecture can define ARCH_SUBSECTION_SHIFT to override the default PMD_SHIFT. Note that PowerPC needs to use ARCH_SUBSECTION_SHIFT to workaround PMD_SHIFT being a non-constant expression on PowerPC. The primary motivation for this functionality is to support platforms that mix "System RAM" and "Persistent Memory" within a single section, or multiple PMEM ranges with different mapping lifetimes within a single section. The section restriction for hotplug has caused an ongoing saga of hacks and bugs for devm_memremap_pages() users. Beyond the fixups to teach existing paths how to retrieve the 'usemap' from a section, and updates to usemap allocation path, there are no expected behavior changes. Cc: Michal Hocko Cc: Vlastimil Babka Cc: Logan Gunthorpe Cc: Oscar Salvador Cc: Pavel Tatashin Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Signed-off-by: Dan Williams --- arch/powerpc/include/asm/sparsemem.h |3 + include/linux/mmzone.h | 48 +++- mm/memory_hotplug.c | 18 mm/page_alloc.c |2 - mm/sparse.c | 81 +- 5 files changed, 99 insertions(+), 53 deletions(-) diff --git a/arch/powerpc/include/asm/sparsemem.h b/arch/powerpc/include/asm/sparsemem.h index 3192d454a733..1aa3c9303bf8 100644 --- a/arch/powerpc/include/asm/sparsemem.h +++ b/arch/powerpc/include/asm/sparsemem.h @@ -10,6 +10,9 @@ */ #define SECTION_SIZE_BITS 24 +/* Reflect the largest possible PMD-size as the subsection-size constant */ +#define ARCH_SUBSECTION_SHIFT 24 + #endif /* CONFIG_SPARSEMEM */ #ifdef CONFIG_MEMORY_HOTPLUG diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 70394cabaf4e..ef8d878079f9 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1160,6 +1160,44 @@ static inline unsigned long section_nr_to_pfn(unsigned long sec) #define SECTION_ALIGN_UP(pfn) (((pfn) + PAGES_PER_SECTION - 1) & PAGE_SECTION_MASK) #define SECTION_ALIGN_DOWN(pfn)((pfn) & PAGE_SECTION_MASK) +/* + * SUBSECTION_SHIFT must be constant since it is used to declare + * subsection_map and related bitmaps without triggering the generation + * of variable-length arrays. The most natural size for a subsection is + * a PMD-page. For architectures that do not have a constant PMD-size + * ARCH_SUBSECTION_SHIFT can be set to a constant max size, or otherwise + * fallback to 2MB. + */ +#if defined(ARCH_SUBSECTION_SHIFT) +#define SUBSECTION_SHIFT (ARCH_SUBSECTION_SHIFT) +#elif defined(PMD_SHIFT) +#define SUBSECTION_SHIFT (PMD_SHIFT) +#else +/* + * Memory hotplug enabled platforms avoid this default because they + * either define ARCH_SUBSECTION_SHIFT, or PMD_SHIFT is a constant, but + * this is kept as a backstop to allow compilation on + * !ARCH_ENABLE_MEMORY_HOTPLUG archs. + */ +#define SUBSECTION_SHIFT 21 +#endif + +#define PFN_SUBSECTION_SHIFT (SUBSECTION_SHIFT - PAGE_SHIFT) +#define PAGES_PER_SUBSECTION (1UL << PFN_SUBSECTION_SHIFT) +#define PAGE_SUBSECTION_MASK ((~(PAGES_PER_SUBSECTION-1))) + +#if SUBSECTION_SHIFT > SECTION_SIZE_BITS +#error Subsection size exceeds section size +#else +#define SUBSECTIONS_PER_SECTION (1UL << (SECTION_SIZE_BITS - SUBSECTION_SHIFT)) +#endif + +struct mem_section_usage { + DECLARE_BITMAP(subsection_map, SUBSECTIONS_PER_SECTION); + /* See declaration of similar field in struct zone */ + unsigned long pageblock_flags[0]; +}; + struct page; struct page_ext; struct mem_section { @@ -1177,8 +1215,7 @@ struct mem_section { */ unsigned long section_mem_map; - /* See declaration of similar field in struct zone */ - unsigned long *pageblock_flags; + struct mem_section_usage *usage; #ifdef CONFIG_PAGE_EXTENSION /* * If SPARSEMEM, pgdat doesn't have page_ext pointer. We use @@ -1209,6 +1246,11 @@ extern struct mem_section **mem_section; extern struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT]; #endif +static inline unsigned long *section_to_usemap(struct mem_section *ms) +{ + return ms->usage->pageblock_flags; +} + static inline struct mem_section *__nr_to_section(unsigned long nr) { #ifdef CONFIG_SPARSEMEM_EXTREME @@ -1
Re: [v5 2/3] mm/hotplug: make remove_memory() interface useable
On Mon, May 6, 2019 at 10:57 AM Dave Hansen wrote: > > > -static inline void remove_memory(int nid, u64 start, u64 size) {} > > +static inline bool remove_memory(int nid, u64 start, u64 size) > > +{ > > + return -EBUSY; > > +} > > This seems like an appropriate place for a WARN_ONCE(), if someone > manages to call remove_memory() with hotplug disabled. > > BTW, I looked and can't think of a better errno, but -EBUSY probably > isn't the best error code, right? > > > -void remove_memory(int nid, u64 start, u64 size) > > +/** > > + * remove_memory > > + * @nid: the node ID > > + * @start: physical address of the region to remove > > + * @size: size of the region to remove > > + * > > + * NOTE: The caller must call lock_device_hotplug() to serialize hotplug > > + * and online/offline operations before this call, as required by > > + * try_offline_node(). > > + */ > > +void __remove_memory(int nid, u64 start, u64 size) > > { > > + > > + /* > > + * trigger BUG() is some memory is not offlined prior to calling this > > + * function > > + */ > > + if (try_remove_memory(nid, start, size)) > > + BUG(); > > +} > > Could we call this remove_offline_memory()? That way, it makes _some_ > sense why we would BUG() if the memory isn't offline. Please WARN() instead of BUG() because failing to remove memory should not be system fatal.
Re: [PATCH v6 00/12] mm: Sub-section memory hotplug support
On Thu, May 2, 2019 at 3:46 PM Pavel Tatashin wrote: > > Hi Dan, > > How do you test these patches? Do you have any instructions? Yes, I briefly mentioned this in the cover letter, but here is the test I am using: > > I see for example that check_hotplug_memory_range() still enforces > memory_block_size_bytes() alignment. > > Also, after removing check_hotplug_memory_range(), I tried to online > 16M aligned DAX memory, and got the following panic: Right, this functionality is currently strictly limited to the devm_memremap_pages() case where there are guarantees that the memory will never be onlined. This is due to the fact that the section size is entangled with the memblock api. That said I would have expected you to trigger the warning in subsection_check() before getting this far into the hotplug process. > > # echo online > /sys/devices/system/memory/memory7/state > [ 202.193132] WARNING: CPU: 2 PID: 351 at drivers/base/memory.c:207 > memory_block_action+0x110/0x178 > [ 202.193391] Modules linked in: > [ 202.193698] CPU: 2 PID: 351 Comm: sh Not tainted > 5.1.0-rc7_pt_devdax-00038-g865af4385544-dirty #9 > [ 202.193909] Hardware name: linux,dummy-virt (DT) > [ 202.194122] pstate: 6005 (nZCv daif -PAN -UAO) > [ 202.194243] pc : memory_block_action+0x110/0x178 > [ 202.194404] lr : memory_block_action+0x90/0x178 > [ 202.194506] sp : 16763ca0 > [ 202.194592] x29: 16763ca0 x28: 80016fd29b80 > [ 202.194724] x27: x26: > [ 202.194838] x25: 15546000 x24: 001c > [ 202.194949] x23: x22: 0004 > [ 202.195058] x21: 001c x20: 0008 > [ 202.195168] x19: 0007 x18: > [ 202.195281] x17: x16: > [ 202.195393] x15: x14: > [ 202.195505] x13: x12: > [ 202.195614] x11: x10: > [ 202.195744] x9 : x8 : 00018000 > [ 202.195858] x7 : 0018 x6 : 15541930 > [ 202.195966] x5 : 15541930 x4 : 0001 > [ 202.196074] x3 : 0001 x2 : > [ 202.196185] x1 : 0070 x0 : > [ 202.196366] Call trace: > [ 202.196455] memory_block_action+0x110/0x178 > [ 202.196589] memory_subsys_online+0x3c/0x80 > [ 202.196681] device_online+0x6c/0x90 > [ 202.196761] state_store+0x84/0x100 > [ 202.196841] dev_attr_store+0x18/0x28 > [ 202.196927] sysfs_kf_write+0x40/0x58 > [ 202.197010] kernfs_fop_write+0xcc/0x1d8 > [ 202.197099] __vfs_write+0x18/0x40 > [ 202.197187] vfs_write+0xa4/0x1b0 > [ 202.197295] ksys_write+0x64/0xd8 > [ 202.197430] __arm64_sys_write+0x18/0x20 > [ 202.197521] el0_svc_common.constprop.0+0x7c/0xe8 > [ 202.197621] el0_svc_handler+0x28/0x78 > [ 202.197706] el0_svc+0x8/0xc > [ 202.197828] ---[ end trace 57719823dda6d21e ]--- > > Thank you, > Pasha
Re: [PATCH] kernel: remove the unused device_private_entry_fault export
On Mon, Apr 29, 2019 at 4:56 AM Christoph Hellwig wrote: > > This export has been entirely unused since it was added more than 1 1/2 > years ago. > > Signed-off-by: Christoph Hellwig > --- > kernel/memremap.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/kernel/memremap.c b/kernel/memremap.c > index 4e59d29245f4..1490e63f69a9 100644 > --- a/kernel/memremap.c > +++ b/kernel/memremap.c > @@ -45,7 +45,6 @@ vm_fault_t device_private_entry_fault(struct vm_area_struct > *vma, > */ > return devmem->page_fault(vma, addr, page, flags, pmdp); > } > -EXPORT_SYMBOL(device_private_entry_fault); > #endif /* CONFIG_DEVICE_PRIVATE */ Looks good to me. Reviewed-by: Dan Williams
Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous memory system
On Thu, Apr 25, 2019 at 1:05 AM Du, Fan wrote: > > > > >-Original Message- > >From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On > >Behalf Of Michal Hocko > >Sent: Thursday, April 25, 2019 3:54 PM > >To: Du, Fan > >Cc: a...@linux-foundation.org; Wu, Fengguang ; > >Williams, Dan J ; Hansen, Dave > >; xishi.qiuxi...@alibaba-inc.com; Huang, Ying > >; linux...@kvack.org; linux-kernel@vger.kernel.org > >Subject: Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous > >memory system > > > >On Thu 25-04-19 07:41:40, Du, Fan wrote: > >> > >> > >> >-Original Message- > >> >From: Michal Hocko [mailto:mho...@kernel.org] > >> >Sent: Thursday, April 25, 2019 2:37 PM > >> >To: Du, Fan > >> >Cc: a...@linux-foundation.org; Wu, Fengguang > >; > >> >Williams, Dan J ; Hansen, Dave > >> >; xishi.qiuxi...@alibaba-inc.com; Huang, Ying > >> >; linux...@kvack.org; > >linux-kernel@vger.kernel.org > >> >Subject: Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous > >> >memory system > >> > > >> >On Thu 25-04-19 09:21:30, Fan Du wrote: > >> >[...] > >> >> However PMEM has different characteristics from DRAM, > >> >> the more reasonable or desirable fallback style would be: > >> >> DRAM node 0 -> DRAM node 1 -> PMEM node 2 -> PMEM node 3. > >> >> When DRAM is exhausted, try PMEM then. > >> > > >> >Why and who does care? NUMA is fundamentally about memory nodes > >with > >> >different access characteristics so why is PMEM any special? > >> > >> Michal, thanks for your comments! > >> > >> The "different" lies in the local or remote access, usually the underlying > >> memory is the same type, i.e. DRAM. > >> > >> By "special", PMEM is usually in gigantic capacity than DRAM per dimm, > >> while with different read/write access latency than DRAM. > > > >You are describing a NUMA in general here. Yes access to different NUMA > >nodes has a different read/write latency. But that doesn't make PMEM > >really special from a regular DRAM. > > Not the numa distance b/w cpu and PMEM node make PMEM different than > DRAM. The difference lies in the physical layer. The access latency > characteristics > comes from media level. No, there is no such thing as a "PMEM node". I've pushed back on this broken concept in the past [1] [2]. Consider that PMEM could be as fast as DRAM for technologies like NVDIMM-N or in emulation environments. These attempts to look at persistence as an attribute of performance are entirely missing the point that the system can have multiple varied memory types and the platform firmware needs to enumerate these performance properties in the HMAT on ACPI platforms. Any scheme that only considers a binary DRAM and not-DRAM property is immediately invalidated the moment the OS needs to consider a 3rd or 4th memory type, or a more varied connection topology. [1]: https://lore.kernel.org/lkml/CAPcyv4heiUbZvP7Ewoy-Hy=-mprdjcjeusw+0rwdouhdjwe...@mail.gmail.com/ [2]: https://lore.kernel.org/lkml/capcyv4it1w7sddvbv24crcvhtlb3s1pvb5+sdm02uw4rbah...@mail.gmail.com/
Re: [PATCH v6 06/12] mm/hotplug: Add mem-hotplug restrictions for remove_memory()
On Tue, Apr 23, 2019 at 2:21 PM David Hildenbrand wrote: > > On 17.04.19 20:39, Dan Williams wrote: > > Teach the arch_remove_memory() path to consult the same 'struct > > mhp_restrictions' context as was specified at arch_add_memory() time. > > > > No functional change, this is a preparation step for teaching > > __remove_pages() about how and when to allow sub-section hot-remove, and > > a cleanup for an unnecessary "is_dev_zone()" special case. > > I am not yet sure if this is the right thing to do. When adding memory, > we obviously have to specify the "how". When removing memory, we usually > should be able to look such stuff up. True, the implementation can just use find_memory_block(), and no need to plumb this flag. > > > > void __remove_pages(struct zone *zone, unsigned long phys_start_pfn, > > - unsigned long nr_pages, struct vmem_altmap *altmap) > > + unsigned long nr_pages, struct mhp_restrictions *restrictions) > > { > > unsigned long i; > > - unsigned long map_offset = 0; > > int sections_to_remove; > > + unsigned long map_offset = 0; > > + struct vmem_altmap *altmap = restrictions->altmap; > > > > - /* In the ZONE_DEVICE case device driver owns the memory region */ > > - if (is_dev_zone(zone)) { > > - if (altmap) > > - map_offset = vmem_altmap_offset(altmap); > > - } > > + if (altmap) > > + map_offset = vmem_altmap_offset(altmap); > > > > Why weren't we able to use this exact same hunk before? (after my > resource deletion cleanup of course) > > IOW, do we really need struct mhp_restrictions here? We don't need it. It was only the memblock info why I added the "restrictions" argument. > After I factor out memory device handling into the caller of > arch_remove_memory(), also the next patch ("mm/sparsemem: Prepare for > sub-section ranges") should no longer need it. Or am I missing something? That patch is still needed for the places where it adds the @nr_pages argument, but the mhp_restrictions related bits can be dropped. The subsection_check() helper needs to be refactored a bit to not rely on mhp_restrictions.