Re: [PATCH -next] mm/hotplug: skip bad PFNs from pfn_to_online_page()

2019-06-13 Thread Dan Williams
On Thu, Jun 13, 2019 at 11:42 AM Qian Cai  wrote:
>
> On Wed, 2019-06-12 at 12:37 -0700, Dan Williams wrote:
> > On Wed, Jun 12, 2019 at 12:16 PM Qian Cai  wrote:
> > >
> > > The linux-next commit "mm/sparsemem: Add helpers track active portions
> > > of a section at boot" [1] causes a crash below when the first kmemleak
> > > scan kthread kicks in. This is because kmemleak_scan() calls
> > > pfn_to_online_page(() which calls pfn_valid_within() instead of
> > > pfn_valid() on x86 due to CONFIG_HOLES_IN_ZONE=n.
> > >
> > > The commit [1] did add an additional check of pfn_section_valid() in
> > > pfn_valid(), but forgot to add it in the above code path.
> > >
> > > page:ea0002748000 is uninitialized and poisoned
> > > raw:    
> > > raw:    
> > > page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> > > [ cut here ]
> > > kernel BUG at include/linux/mm.h:1084!
> > > invalid opcode:  [#1] SMP DEBUG_PAGEALLOC KASAN PTI
> > > CPU: 5 PID: 332 Comm: kmemleak Not tainted 5.2.0-rc4-next-20190612+ #6
> > > Hardware name: Lenovo ThinkSystem SR530 -[7X07RCZ000]-/-[7X07RCZ000]-,
> > > BIOS -[TEE113T-1.00]- 07/07/2017
> > > RIP: 0010:kmemleak_scan+0x6df/0xad0
> > > Call Trace:
> > >  kmemleak_scan_thread+0x9f/0xc7
> > >  kthread+0x1d2/0x1f0
> > >  ret_from_fork+0x35/0x4
> > >
> > > [1] https://patchwork.kernel.org/patch/10977957/
> > >
> > > Signed-off-by: Qian Cai 
> > > ---
> > >  include/linux/memory_hotplug.h | 1 +
> > >  1 file changed, 1 insertion(+)
> > >
> > > diff --git a/include/linux/memory_hotplug.h 
> > > b/include/linux/memory_hotplug.h
> > > index 0b8a5e5ef2da..f02be86077e3 100644
> > > --- a/include/linux/memory_hotplug.h
> > > +++ b/include/linux/memory_hotplug.h
> > > @@ -28,6 +28,7 @@
> > > unsigned long ___nr = pfn_to_section_nr(___pfn);   \
> > >\
> > > if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \
> > > +   pfn_section_valid(__nr_to_section(___nr), pfn) &&  \
> > > pfn_valid_within(___pfn))  \
> > > ___page = pfn_to_page(___pfn); \
> > > ___page;   \
> >
> > Looks ok to me:
> >
> > Acked-by: Dan Williams 
> >
> > ...but why is pfn_to_online_page() a multi-line macro instead of a
> > static inline like all the helper routines it invokes?
>
> Sigh, probably because it is a mess over there.
>
> memory_hotplug.h and mmzone.h are included each other. Converted it directly 
> to
> a static inline triggers compilation errors because mmzone.h was included
> somewhere else and found pfn_to_online_page() needs things like
> pfn_valid_within() and online_section_nr() etc which are only defined later in
> mmzone.h.

Ok, makes sense I had I assumed it was something horrible like that.

Qian, can you send more details on the reproduction steps for the
failures you are seeing? Like configs and platforms you're testing.
I've tried enabling kmemleak and offlining memory and have yet to
trigger these failures. I also have a couple people willing to help me
out with tracking down the PowerPC issue, but I assume they need some
help with the reproduction as well.


Re: [PATCH 08/22] memremap: pass a struct dev_pagemap to ->kill

2019-06-13 Thread Dan Williams
On Thu, Jun 13, 2019 at 1:12 PM Logan Gunthorpe  wrote:
>
>
>
> On 2019-06-13 3:43 a.m., Christoph Hellwig wrote:
> > Passing the actual typed structure leads to more understandable code
> > vs the actual references.
>
> Ha, ok, I originally suggested this to Dan when he introduced the
> callback[1].
>
> Reviewed-by: Logan Gunthorpe 

Reviewed-by: Dan Williams 
Reported-by: Logan Gunthorpe  :)


>
> Logan
>
> [1]
> https://lore.kernel.org/lkml/8f0cae82-130f-8a64-cfbd-fda5fd76b...@deltatee.com/T/#u
>


Re: [PATCH -next] mm/hotplug: skip bad PFNs from pfn_to_online_page()

2019-06-13 Thread Dan Williams
On Wed, Jun 12, 2019 at 2:47 PM Qian Cai  wrote:
>
> On Wed, 2019-06-12 at 12:38 -0700, Dan Williams wrote:
> > On Wed, Jun 12, 2019 at 12:37 PM Dan Williams 
> > wrote:
> > >
> > > On Wed, Jun 12, 2019 at 12:16 PM Qian Cai  wrote:
> > > >
> > > > The linux-next commit "mm/sparsemem: Add helpers track active portions
> > > > of a section at boot" [1] causes a crash below when the first kmemleak
> > > > scan kthread kicks in. This is because kmemleak_scan() calls
> > > > pfn_to_online_page(() which calls pfn_valid_within() instead of
> > > > pfn_valid() on x86 due to CONFIG_HOLES_IN_ZONE=n.
> > > >
> > > > The commit [1] did add an additional check of pfn_section_valid() in
> > > > pfn_valid(), but forgot to add it in the above code path.
> > > >
> > > > page:ea0002748000 is uninitialized and poisoned
> > > > raw:    
> > > > raw:    
> > > > page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> > > > [ cut here ]
> > > > kernel BUG at include/linux/mm.h:1084!
> > > > invalid opcode:  [#1] SMP DEBUG_PAGEALLOC KASAN PTI
> > > > CPU: 5 PID: 332 Comm: kmemleak Not tainted 5.2.0-rc4-next-20190612+ #6
> > > > Hardware name: Lenovo ThinkSystem SR530 -[7X07RCZ000]-/-[7X07RCZ000]-,
> > > > BIOS -[TEE113T-1.00]- 07/07/2017
> > > > RIP: 0010:kmemleak_scan+0x6df/0xad0
> > > > Call Trace:
> > > >  kmemleak_scan_thread+0x9f/0xc7
> > > >  kthread+0x1d2/0x1f0
> > > >  ret_from_fork+0x35/0x4
> > > >
> > > > [1] https://patchwork.kernel.org/patch/10977957/
> > > >
> > > > Signed-off-by: Qian Cai 
> > > > ---
> > > >  include/linux/memory_hotplug.h | 1 +
> > > >  1 file changed, 1 insertion(+)
> > > >
> > > > diff --git a/include/linux/memory_hotplug.h
> > > > b/include/linux/memory_hotplug.h
> > > > index 0b8a5e5ef2da..f02be86077e3 100644
> > > > --- a/include/linux/memory_hotplug.h
> > > > +++ b/include/linux/memory_hotplug.h
> > > > @@ -28,6 +28,7 @@
> > > > unsigned long ___nr = pfn_to_section_nr(___pfn);   \
> > > >\
> > > > if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \
> > > > +   pfn_section_valid(__nr_to_section(___nr), pfn) &&  \
> > > > pfn_valid_within(___pfn))  \
> > > > ___page = pfn_to_page(___pfn); \
> > > > ___page;   \
> > >
> > > Looks ok to me:
> > >
> > > Acked-by: Dan Williams 
> > >
> > > ...but why is pfn_to_online_page() a multi-line macro instead of a
> > > static inline like all the helper routines it invokes?
> >
> > I do need to send out a refreshed version of the sub-section patchset,
> > so I'll fold this in and give you a Reported-by credit.
>
> BTW, not sure if your new version will fix those two problem below due to the
> same commit.
>
> https://patchwork.kernel.org/patch/10977957/
>
> 1) offline is busted [1]. It looks like test_pages_in_a_zone() missed the same
> pfn_section_valid() check.
>
> 2) powerpc booting is generating endless warnings [2]. In vmemmap_populated() 
> at
> arch/powerpc/mm/init_64.c, I tried to change PAGES_PER_SECTION to
> PAGES_PER_SUBSECTION, but it alone seems not enough.

Yes, I was just sending you another note about this. I don't think
your proposed fix is sufficient. The original intent of
pfn_valid_within() was to use it as a cheaper lookup after already
validating that the first page in a MAX_ORDER_NR_PAGES range satisfied
pfn_valid(). Quoting commit  14e072984179 "add pfn_valid_within helper
for sub-MAX_ORDER hole detection":

Add a pfn_valid_within() helper which should be used when scanning pages
within a MAX_ORDER_NR_PAGES block when we have already checked the
validility
of the block normally with pfn_valid().  This can then be
optimised away when
we do not have holes within a MAX_ORDER_NR_PAGES block of pages.

So, with that insight I think the complete fix is this:

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6dd52d544857..9d15ec793330 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1428,7 +1428,7 @@ void memory_present(int nid, unsigned long
start, unsigned long end);
 #ifdef CONFIG_HOLES_IN_ZONE
 #define pfn_valid_within(pfn) pfn_valid(pfn)
 #else
-#define pfn_valid_within(pfn) (1)
+#define pfn_valid_within(pfn) pfn_section_valid(pfn)
 #endif

 #ifdef CONFIG_ARCH_HAS_HOLES_MEMORYMODEL


Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-13 Thread Dan Williams
On Wed, Jun 12, 2019 at 3:12 PM Ira Weiny  wrote:
>
> On Wed, Jun 12, 2019 at 04:14:21PM -0300, Jason Gunthorpe wrote:
> > On Wed, Jun 12, 2019 at 02:09:07PM +0200, Jan Kara wrote:
> > > On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote:
> > > > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote:
> > > >
> > > > > > > The main objection to the current ODP & DAX solution is that very
> > > > > > > little HW can actually implement it, having the alternative still
> > > > > > > require HW support doesn't seem like progress.
> > > > > > >
> > > > > > > I think we will eventually start seein some HW be able to do this
> > > > > > > invalidation, but it won't be universal, and I'd rather leave it
> > > > > > > optional, for recovery from truely catastrophic errors (ie my DAX 
> > > > > > > is
> > > > > > > on fire, I need to unplug it).
> > > > > >
> > > > > > Agreed.  I think software wise there is not much some of the 
> > > > > > devices can do
> > > > > > with such an "invalidate".
> > > > >
> > > > > So out of curiosity: What does RDMA driver do when userspace just 
> > > > > closes
> > > > > the file pointing to RDMA object? It has to handle that somehow by 
> > > > > aborting
> > > > > everything that's going on... And I wanted similar behavior here.
> > > >
> > > > It aborts *everything* connected to that file descriptor. Destroying
> > > > everything avoids creating inconsistencies that destroying a subset
> > > > would create.
> > > >
> > > > What has been talked about for lease break is not destroying anything
> > > > but very selectively saying that one memory region linked to the GUP
> > > > is no longer functional.
> > >
> > > OK, so what I had in mind was that if RDMA app doesn't play by the rules
> > > and closes the file with existing pins (and thus layout lease) we would
> > > force it to abort everything. Yes, it is disruptive but then the app 
> > > didn't
> > > obey the rule that it has to maintain file lease while holding pins. Thus
> > > such situation should never happen unless the app is malicious / buggy.
> >
> > We do have the infrastructure to completely revoke the entire
> > *content* of a FD (this is called device disassociate). It is
> > basically close without the app doing close. But again it only works
> > with some drivers. However, this is more likely something a driver
> > could support without a HW change though.
> >
> > It is quite destructive as it forcibly kills everything RDMA related
> > the process(es) are doing, but it is less violent than SIGKILL, and
> > there is perhaps a way for the app to recover from this, if it is
> > coded for it.
>
> I don't think many are...  I think most would effectively be "killed" if this
> happened to them.
>
> >
> > My preference would be to avoid this scenario, but if it is really
> > necessary, we could probably build it with some work.
> >
> > The only case we use it today is forced HW hot unplug, so it is rarely
> > used and only for an 'emergency' like use case.
>
> I'd really like to avoid this as well.  I think it will be very confusing for
> RDMA apps to have their context suddenly be invalid.  I think if we have a way
> for admins to ID who is pinning a file the admin can take more appropriate
> action on those processes.   Up to and including killing the process.

Can RDMA context invalidation, "device disassociate", be inflicted on
a process from the outside? Identifying the pid of a pin holder only
leaves SIGKILL of the entire process as the remediation for revoking a
pin, and I assume admins would use the finer grained invalidation
where it was available.


Re: [PATCH -next] mm/hotplug: skip bad PFNs from pfn_to_online_page()

2019-06-13 Thread Dan Williams
On Wed, Jun 12, 2019 at 2:52 PM Dan Williams  wrote:
>
> On Wed, Jun 12, 2019 at 2:47 PM Qian Cai  wrote:
> >
> > On Wed, 2019-06-12 at 12:38 -0700, Dan Williams wrote:
> > > On Wed, Jun 12, 2019 at 12:37 PM Dan Williams 
> > > wrote:
> > > >
> > > > On Wed, Jun 12, 2019 at 12:16 PM Qian Cai  wrote:
> > > > >
> > > > > The linux-next commit "mm/sparsemem: Add helpers track active portions
> > > > > of a section at boot" [1] causes a crash below when the first kmemleak
> > > > > scan kthread kicks in. This is because kmemleak_scan() calls
> > > > > pfn_to_online_page(() which calls pfn_valid_within() instead of
> > > > > pfn_valid() on x86 due to CONFIG_HOLES_IN_ZONE=n.
> > > > >
> > > > > The commit [1] did add an additional check of pfn_section_valid() in
> > > > > pfn_valid(), but forgot to add it in the above code path.
> > > > >
> > > > > page:ea0002748000 is uninitialized and poisoned
> > > > > raw:    
> > > > > 
> > > > > raw:    
> > > > > 
> > > > > page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> > > > > [ cut here ]
> > > > > kernel BUG at include/linux/mm.h:1084!
> > > > > invalid opcode:  [#1] SMP DEBUG_PAGEALLOC KASAN PTI
> > > > > CPU: 5 PID: 332 Comm: kmemleak Not tainted 5.2.0-rc4-next-20190612+ #6
> > > > > Hardware name: Lenovo ThinkSystem SR530 -[7X07RCZ000]-/-[7X07RCZ000]-,
> > > > > BIOS -[TEE113T-1.00]- 07/07/2017
> > > > > RIP: 0010:kmemleak_scan+0x6df/0xad0
> > > > > Call Trace:
> > > > >  kmemleak_scan_thread+0x9f/0xc7
> > > > >  kthread+0x1d2/0x1f0
> > > > >  ret_from_fork+0x35/0x4
> > > > >
> > > > > [1] https://patchwork.kernel.org/patch/10977957/
> > > > >
> > > > > Signed-off-by: Qian Cai 
> > > > > ---
> > > > >  include/linux/memory_hotplug.h | 1 +
> > > > >  1 file changed, 1 insertion(+)
> > > > >
> > > > > diff --git a/include/linux/memory_hotplug.h
> > > > > b/include/linux/memory_hotplug.h
> > > > > index 0b8a5e5ef2da..f02be86077e3 100644
> > > > > --- a/include/linux/memory_hotplug.h
> > > > > +++ b/include/linux/memory_hotplug.h
> > > > > @@ -28,6 +28,7 @@
> > > > > unsigned long ___nr = pfn_to_section_nr(___pfn);   \
> > > > >\
> > > > > if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \
> > > > > +   pfn_section_valid(__nr_to_section(___nr), pfn) &&  \
> > > > > pfn_valid_within(___pfn))  \
> > > > > ___page = pfn_to_page(___pfn); \
> > > > > ___page;   \
> > > >
> > > > Looks ok to me:
> > > >
> > > > Acked-by: Dan Williams 
> > > >
> > > > ...but why is pfn_to_online_page() a multi-line macro instead of a
> > > > static inline like all the helper routines it invokes?
> > >
> > > I do need to send out a refreshed version of the sub-section patchset,
> > > so I'll fold this in and give you a Reported-by credit.
> >
> > BTW, not sure if your new version will fix those two problem below due to 
> > the
> > same commit.
> >
> > https://patchwork.kernel.org/patch/10977957/
> >
> > 1) offline is busted [1]. It looks like test_pages_in_a_zone() missed the 
> > same
> > pfn_section_valid() check.
> >
> > 2) powerpc booting is generating endless warnings [2]. In 
> > vmemmap_populated() at
> > arch/powerpc/mm/init_64.c, I tried to change PAGES_PER_SECTION to
> > PAGES_PER_SUBSECTION, but it alone seems not enough.
>
> Yes, I was just sending you another note about this. I don't think
> your proposed fix is sufficient. The original intent of
> pfn_valid_within() was to use it as a cheaper lookup after already
> validating that the first page in a MAX_ORDER_NR_PAGES range satisfied
> pfn_valid(). Quoting commit  14e072984179 "add pfn_valid_within

Re: [PATCH -next] mm/hotplug: skip bad PFNs from pfn_to_online_page()

2019-06-13 Thread Dan Williams
On Wed, Jun 12, 2019 at 2:47 PM Qian Cai  wrote:
>
> On Wed, 2019-06-12 at 12:38 -0700, Dan Williams wrote:
> > On Wed, Jun 12, 2019 at 12:37 PM Dan Williams 
> > wrote:
> > >
> > > On Wed, Jun 12, 2019 at 12:16 PM Qian Cai  wrote:
> > > >
> > > > The linux-next commit "mm/sparsemem: Add helpers track active portions
> > > > of a section at boot" [1] causes a crash below when the first kmemleak
> > > > scan kthread kicks in. This is because kmemleak_scan() calls
> > > > pfn_to_online_page(() which calls pfn_valid_within() instead of
> > > > pfn_valid() on x86 due to CONFIG_HOLES_IN_ZONE=n.
> > > >
> > > > The commit [1] did add an additional check of pfn_section_valid() in
> > > > pfn_valid(), but forgot to add it in the above code path.
> > > >
> > > > page:ea0002748000 is uninitialized and poisoned
> > > > raw:    
> > > > raw:    
> > > > page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> > > > [ cut here ]
> > > > kernel BUG at include/linux/mm.h:1084!
> > > > invalid opcode:  [#1] SMP DEBUG_PAGEALLOC KASAN PTI
> > > > CPU: 5 PID: 332 Comm: kmemleak Not tainted 5.2.0-rc4-next-20190612+ #6
> > > > Hardware name: Lenovo ThinkSystem SR530 -[7X07RCZ000]-/-[7X07RCZ000]-,
> > > > BIOS -[TEE113T-1.00]- 07/07/2017
> > > > RIP: 0010:kmemleak_scan+0x6df/0xad0
> > > > Call Trace:
> > > >  kmemleak_scan_thread+0x9f/0xc7
> > > >  kthread+0x1d2/0x1f0
> > > >  ret_from_fork+0x35/0x4
> > > >
> > > > [1] https://patchwork.kernel.org/patch/10977957/
> > > >
> > > > Signed-off-by: Qian Cai 
> > > > ---
> > > >  include/linux/memory_hotplug.h | 1 +
> > > >  1 file changed, 1 insertion(+)
> > > >
> > > > diff --git a/include/linux/memory_hotplug.h
> > > > b/include/linux/memory_hotplug.h
> > > > index 0b8a5e5ef2da..f02be86077e3 100644
> > > > --- a/include/linux/memory_hotplug.h
> > > > +++ b/include/linux/memory_hotplug.h
> > > > @@ -28,6 +28,7 @@
> > > > unsigned long ___nr = pfn_to_section_nr(___pfn);   \
> > > >\
> > > > if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \
> > > > +   pfn_section_valid(__nr_to_section(___nr), pfn) &&  \
> > > > pfn_valid_within(___pfn))  \
> > > > ___page = pfn_to_page(___pfn); \
> > > > ___page;   \
> > >
> > > Looks ok to me:
> > >
> > > Acked-by: Dan Williams 
> > >
> > > ...but why is pfn_to_online_page() a multi-line macro instead of a
> > > static inline like all the helper routines it invokes?
> >
> > I do need to send out a refreshed version of the sub-section patchset,
> > so I'll fold this in and give you a Reported-by credit.
>
> BTW, not sure if your new version will fix those two problem below due to the
> same commit.
>
> https://patchwork.kernel.org/patch/10977957/
>
> 1) offline is busted [1]. It looks like test_pages_in_a_zone() missed the same
> pfn_section_valid() check.

All online memory is to be onlined as a complete section, so I think
the issue is more related to vmemmap_populated() not establishing the
mem_map for all pages in a section.

I take back my suggestions about pfn_valid_within() that operation
should always be scoped to a section when validating online memory.

>
> 2) powerpc booting is generating endless warnings [2]. In vmemmap_populated() 
> at
> arch/powerpc/mm/init_64.c, I tried to change PAGES_PER_SECTION to
> PAGES_PER_SUBSECTION, but it alone seems not enough.

On PowerPC PAGES_PER_SECTION == PAGES_PER_SUBSECTION because the
PowerPC section size was already small. Instead I think the issue is
that PowerPC is partially populating sections, but still expecting
pfn_valid() to succeed. I.e. prior to the subsection patches
pfn_valid() would still work for those holes, but now that it is more
precise it is failing.


Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-13 Thread Dan Williams
On Wed, Jun 12, 2019 at 4:32 PM Ira Weiny  wrote:
>
> On Wed, Jun 12, 2019 at 03:54:19PM -0700, Dan Williams wrote:
> > On Wed, Jun 12, 2019 at 3:12 PM Ira Weiny  wrote:
> > >
> > > On Wed, Jun 12, 2019 at 04:14:21PM -0300, Jason Gunthorpe wrote:
> > > > On Wed, Jun 12, 2019 at 02:09:07PM +0200, Jan Kara wrote:
> > > > > On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote:
> > > > > > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote:
> > > > > >
> > > > > > > > > The main objection to the current ODP & DAX solution is that 
> > > > > > > > > very
> > > > > > > > > little HW can actually implement it, having the alternative 
> > > > > > > > > still
> > > > > > > > > require HW support doesn't seem like progress.
> > > > > > > > >
> > > > > > > > > I think we will eventually start seein some HW be able to do 
> > > > > > > > > this
> > > > > > > > > invalidation, but it won't be universal, and I'd rather leave 
> > > > > > > > > it
> > > > > > > > > optional, for recovery from truely catastrophic errors (ie my 
> > > > > > > > > DAX is
> > > > > > > > > on fire, I need to unplug it).
> > > > > > > >
> > > > > > > > Agreed.  I think software wise there is not much some of the 
> > > > > > > > devices can do
> > > > > > > > with such an "invalidate".
> > > > > > >
> > > > > > > So out of curiosity: What does RDMA driver do when userspace just 
> > > > > > > closes
> > > > > > > the file pointing to RDMA object? It has to handle that somehow 
> > > > > > > by aborting
> > > > > > > everything that's going on... And I wanted similar behavior here.
> > > > > >
> > > > > > It aborts *everything* connected to that file descriptor. Destroying
> > > > > > everything avoids creating inconsistencies that destroying a subset
> > > > > > would create.
> > > > > >
> > > > > > What has been talked about for lease break is not destroying 
> > > > > > anything
> > > > > > but very selectively saying that one memory region linked to the GUP
> > > > > > is no longer functional.
> > > > >
> > > > > OK, so what I had in mind was that if RDMA app doesn't play by the 
> > > > > rules
> > > > > and closes the file with existing pins (and thus layout lease) we 
> > > > > would
> > > > > force it to abort everything. Yes, it is disruptive but then the app 
> > > > > didn't
> > > > > obey the rule that it has to maintain file lease while holding pins. 
> > > > > Thus
> > > > > such situation should never happen unless the app is malicious / 
> > > > > buggy.
> > > >
> > > > We do have the infrastructure to completely revoke the entire
> > > > *content* of a FD (this is called device disassociate). It is
> > > > basically close without the app doing close. But again it only works
> > > > with some drivers. However, this is more likely something a driver
> > > > could support without a HW change though.
> > > >
> > > > It is quite destructive as it forcibly kills everything RDMA related
> > > > the process(es) are doing, but it is less violent than SIGKILL, and
> > > > there is perhaps a way for the app to recover from this, if it is
> > > > coded for it.
> > >
> > > I don't think many are...  I think most would effectively be "killed" if 
> > > this
> > > happened to them.
> > >
> > > >
> > > > My preference would be to avoid this scenario, but if it is really
> > > > necessary, we could probably build it with some work.
> > > >
> > > > The only case we use it today is forced HW hot unplug, so it is rarely
> > > > used and only for an 'emergency' like use case.
> > >
> > > I'd really like to avoid this as well.  I think it will be very confusing 
> > > for
> > > RDMA apps to have their context suddenly be invalid.  I think if we have 
> > > a way
> > > for admins to ID who is pinning a file the admin can take more appropriate
> > > action on those processes.   Up to and including killing the process.
> >
> > Can RDMA context invalidation, "device disassociate", be inflicted on
> > a process from the outside? Identifying the pid of a pin holder only
> > leaves SIGKILL of the entire process as the remediation for revoking a
> > pin, and I assume admins would use the finer grained invalidation
> > where it was available.
>
> No not in the way you are describing it.  As Jason said you can hotplug the
> device which is "from the outside" but this would affect all users of that
> device.
>
> Effectively, we would need a way for an admin to close a specific file
> descriptor (or set of fds) which point to that file.  AFAIK there is no way to
> do that at all, is there?

Even if there were that gets back to my other question, does RDMA
teardown happen at close(fd), or at final fput() of the 'struct file'?
I.e. does it also need munmap() to get the vma to drop its reference?
Perhaps a pointer to the relevant code would help me wrap my head
around this mechanism.


Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-13 Thread Dan Williams
On Wed, Jun 12, 2019 at 4:32 PM Ira Weiny  wrote:
>
> On Wed, Jun 12, 2019 at 03:54:19PM -0700, Dan Williams wrote:
> > On Wed, Jun 12, 2019 at 3:12 PM Ira Weiny  wrote:
> > >
> > > On Wed, Jun 12, 2019 at 04:14:21PM -0300, Jason Gunthorpe wrote:
> > > > On Wed, Jun 12, 2019 at 02:09:07PM +0200, Jan Kara wrote:
> > > > > On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote:
> > > > > > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote:
> > > > > >
> > > > > > > > > The main objection to the current ODP & DAX solution is that 
> > > > > > > > > very
> > > > > > > > > little HW can actually implement it, having the alternative 
> > > > > > > > > still
> > > > > > > > > require HW support doesn't seem like progress.
> > > > > > > > >
> > > > > > > > > I think we will eventually start seein some HW be able to do 
> > > > > > > > > this
> > > > > > > > > invalidation, but it won't be universal, and I'd rather leave 
> > > > > > > > > it
> > > > > > > > > optional, for recovery from truely catastrophic errors (ie my 
> > > > > > > > > DAX is
> > > > > > > > > on fire, I need to unplug it).
> > > > > > > >
> > > > > > > > Agreed.  I think software wise there is not much some of the 
> > > > > > > > devices can do
> > > > > > > > with such an "invalidate".
> > > > > > >
> > > > > > > So out of curiosity: What does RDMA driver do when userspace just 
> > > > > > > closes
> > > > > > > the file pointing to RDMA object? It has to handle that somehow 
> > > > > > > by aborting
> > > > > > > everything that's going on... And I wanted similar behavior here.
> > > > > >
> > > > > > It aborts *everything* connected to that file descriptor. Destroying
> > > > > > everything avoids creating inconsistencies that destroying a subset
> > > > > > would create.
> > > > > >
> > > > > > What has been talked about for lease break is not destroying 
> > > > > > anything
> > > > > > but very selectively saying that one memory region linked to the GUP
> > > > > > is no longer functional.
> > > > >
> > > > > OK, so what I had in mind was that if RDMA app doesn't play by the 
> > > > > rules
> > > > > and closes the file with existing pins (and thus layout lease) we 
> > > > > would
> > > > > force it to abort everything. Yes, it is disruptive but then the app 
> > > > > didn't
> > > > > obey the rule that it has to maintain file lease while holding pins. 
> > > > > Thus
> > > > > such situation should never happen unless the app is malicious / 
> > > > > buggy.
> > > >
> > > > We do have the infrastructure to completely revoke the entire
> > > > *content* of a FD (this is called device disassociate). It is
> > > > basically close without the app doing close. But again it only works
> > > > with some drivers. However, this is more likely something a driver
> > > > could support without a HW change though.
> > > >
> > > > It is quite destructive as it forcibly kills everything RDMA related
> > > > the process(es) are doing, but it is less violent than SIGKILL, and
> > > > there is perhaps a way for the app to recover from this, if it is
> > > > coded for it.
> > >
> > > I don't think many are...  I think most would effectively be "killed" if 
> > > this
> > > happened to them.
> > >
> > > >
> > > > My preference would be to avoid this scenario, but if it is really
> > > > necessary, we could probably build it with some work.
> > > >
> > > > The only case we use it today is forced HW hot unplug, so it is rarely
> > > > used and only for an 'emergency' like use case.
> > >
> > > I'd really like to avoid this as well.  I think it will be very confusing 
> > > for
> > > RDMA apps to have their context suddenly be invalid.  I think if we have 
> >

Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-13 Thread Dan Williams
On Thu, Jun 13, 2019 at 8:14 AM Jason Gunthorpe  wrote:
>
> On Wed, Jun 12, 2019 at 06:14:46PM -0700, Dan Williams wrote:
> > > Effectively, we would need a way for an admin to close a specific file
> > > descriptor (or set of fds) which point to that file.  AFAIK there is no 
> > > way to
> > > do that at all, is there?
> >
> > Even if there were that gets back to my other question, does RDMA
> > teardown happen at close(fd), or at final fput() of the 'struct
> > file'?
>
> AFAIK there is no kernel side driver hook for close(fd).
>
> rdma uses a normal chardev so it's lifetime is linked to the file_ops
> release, which is called on last fput. So all the mmaps, all the dups,
> everything must go before it releases its resources.

Oh, I must have missed where this conversation started talking about
the driver-device fd. I thought we were talking about the close /
release of the target file that is MAP_SHARED for the memory
registration. A release of the driver fd is orthogonal to coordinating
/ signalling actions relative to the leased file.


Re: [PATCH -next] mm/hotplug: skip bad PFNs from pfn_to_online_page()

2019-06-12 Thread Dan Williams
On Wed, Jun 12, 2019 at 12:37 PM Dan Williams  wrote:
>
> On Wed, Jun 12, 2019 at 12:16 PM Qian Cai  wrote:
> >
> > The linux-next commit "mm/sparsemem: Add helpers track active portions
> > of a section at boot" [1] causes a crash below when the first kmemleak
> > scan kthread kicks in. This is because kmemleak_scan() calls
> > pfn_to_online_page(() which calls pfn_valid_within() instead of
> > pfn_valid() on x86 due to CONFIG_HOLES_IN_ZONE=n.
> >
> > The commit [1] did add an additional check of pfn_section_valid() in
> > pfn_valid(), but forgot to add it in the above code path.
> >
> > page:ea0002748000 is uninitialized and poisoned
> > raw:    
> > raw:    
> > page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> > [ cut here ]
> > kernel BUG at include/linux/mm.h:1084!
> > invalid opcode:  [#1] SMP DEBUG_PAGEALLOC KASAN PTI
> > CPU: 5 PID: 332 Comm: kmemleak Not tainted 5.2.0-rc4-next-20190612+ #6
> > Hardware name: Lenovo ThinkSystem SR530 -[7X07RCZ000]-/-[7X07RCZ000]-,
> > BIOS -[TEE113T-1.00]- 07/07/2017
> > RIP: 0010:kmemleak_scan+0x6df/0xad0
> > Call Trace:
> >  kmemleak_scan_thread+0x9f/0xc7
> >  kthread+0x1d2/0x1f0
> >  ret_from_fork+0x35/0x4
> >
> > [1] https://patchwork.kernel.org/patch/10977957/
> >
> > Signed-off-by: Qian Cai 
> > ---
> >  include/linux/memory_hotplug.h | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> > index 0b8a5e5ef2da..f02be86077e3 100644
> > --- a/include/linux/memory_hotplug.h
> > +++ b/include/linux/memory_hotplug.h
> > @@ -28,6 +28,7 @@
> > unsigned long ___nr = pfn_to_section_nr(___pfn);   \
> >\
> > if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \
> > +   pfn_section_valid(__nr_to_section(___nr), pfn) &&  \
> > pfn_valid_within(___pfn))  \
> > ___page = pfn_to_page(___pfn); \
> > ___page;   \
>
> Looks ok to me:
>
> Acked-by: Dan Williams 
>
> ...but why is pfn_to_online_page() a multi-line macro instead of a
> static inline like all the helper routines it invokes?

I do need to send out a refreshed version of the sub-section patchset,
so I'll fold this in and give you a Reported-by credit.


Re: [PATCH -next] mm/hotplug: skip bad PFNs from pfn_to_online_page()

2019-06-12 Thread Dan Williams
On Wed, Jun 12, 2019 at 12:16 PM Qian Cai  wrote:
>
> The linux-next commit "mm/sparsemem: Add helpers track active portions
> of a section at boot" [1] causes a crash below when the first kmemleak
> scan kthread kicks in. This is because kmemleak_scan() calls
> pfn_to_online_page(() which calls pfn_valid_within() instead of
> pfn_valid() on x86 due to CONFIG_HOLES_IN_ZONE=n.
>
> The commit [1] did add an additional check of pfn_section_valid() in
> pfn_valid(), but forgot to add it in the above code path.
>
> page:ea0002748000 is uninitialized and poisoned
> raw:    
> raw:    
> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> [ cut here ]
> kernel BUG at include/linux/mm.h:1084!
> invalid opcode:  [#1] SMP DEBUG_PAGEALLOC KASAN PTI
> CPU: 5 PID: 332 Comm: kmemleak Not tainted 5.2.0-rc4-next-20190612+ #6
> Hardware name: Lenovo ThinkSystem SR530 -[7X07RCZ000]-/-[7X07RCZ000]-,
> BIOS -[TEE113T-1.00]- 07/07/2017
> RIP: 0010:kmemleak_scan+0x6df/0xad0
> Call Trace:
>  kmemleak_scan_thread+0x9f/0xc7
>  kthread+0x1d2/0x1f0
>  ret_from_fork+0x35/0x4
>
> [1] https://patchwork.kernel.org/patch/10977957/
>
> Signed-off-by: Qian Cai 
> ---
>  include/linux/memory_hotplug.h | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 0b8a5e5ef2da..f02be86077e3 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -28,6 +28,7 @@
> unsigned long ___nr = pfn_to_section_nr(___pfn);   \
>\
> if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \
> +   pfn_section_valid(__nr_to_section(___nr), pfn) &&  \
> pfn_valid_within(___pfn))  \
>     ___page = pfn_to_page(___pfn); \
> ___page;   \

Looks ok to me:

Acked-by: Dan Williams 

...but why is pfn_to_online_page() a multi-line macro instead of a
static inline like all the helper routines it invokes?


Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-12 Thread Dan Williams
On Wed, Jun 12, 2019 at 3:29 AM Jan Kara  wrote:
>
> On Fri 07-06-19 07:52:13, Ira Weiny wrote:
> > On Fri, Jun 07, 2019 at 09:17:29AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Jun 07, 2019 at 12:36:36PM +0200, Jan Kara wrote:
> > >
> > > > Because the pins would be invisible to sysadmin from that point on.
> > >
> > > It is not invisible, it just shows up in a rdma specific kernel
> > > interface. You have to use rdma netlink to see the kernel object
> > > holding this pin.
> > >
> > > If this visibility is the main sticking point I suggest just enhancing
> > > the existing MR reporting to include the file info for current GUP
> > > pins and teaching lsof to collect information from there as well so it
> > > is easy to use.
> > >
> > > If the ownership of the lease transfers to the MR, and we report that
> > > ownership to userspace in a way lsof can find, then I think all the
> > > concerns that have been raised are met, right?
> >
> > I was contemplating some new lsof feature yesterday.  But what I don't
> > think we want is sysadmins to have multiple tools for multiple
> > subsystems.  Or even have to teach lsof something new for every potential
> > new subsystem user of GUP pins.
>
> Agreed.
>
> > I was thinking more along the lines of reporting files which have GUP
> > pins on them directly somewhere (dare I say procfs?) and teaching lsof to
> > report that information.  That would cover any subsystem which does a
> > longterm pin.
>
> So lsof already parses /proc//maps to learn about files held open by
> memory mappings. It could parse some other file as well I guess. The good
> thing about that would be that then "longterm pin" structure would just hold
> struct file reference. That would avoid any needs of special behavior on
> file close (the file reference in the "longterm pin" structure would make
> sure struct file and thus the lease stays around, we'd just need to make
> explicit lease unlock block until the "longterm pin" structure is freed).
> The bad thing is that it requires us to come up with a sane new proc
> interface for reporting "longterm pins" and associated struct file. Also we
> need to define what this interface shows if the pinned pages are in DRAM
> (either page cache or anon) and not on NVDIMM.

The anon vs shared detection case is important because a longterm pin
might be blocking a memory-hot-unplug operation if it is pinning
ZONE_MOVABLE memory, but I don't think we want DRAM vs NVDIMM to be an
explicit concern of the interface. For the anon / cached case I expect
it might be useful to put that communication under the memory-blocks
sysfs interface. I.e. a list of pids that are pinning that
memory-block from being hot-unplugged.


Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-12 Thread Dan Williams
On Wed, Jun 12, 2019 at 5:09 AM Jan Kara  wrote:
>
> On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote:
> > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote:
> >
> > > > > The main objection to the current ODP & DAX solution is that very
> > > > > little HW can actually implement it, having the alternative still
> > > > > require HW support doesn't seem like progress.
> > > > >
> > > > > I think we will eventually start seein some HW be able to do this
> > > > > invalidation, but it won't be universal, and I'd rather leave it
> > > > > optional, for recovery from truely catastrophic errors (ie my DAX is
> > > > > on fire, I need to unplug it).
> > > >
> > > > Agreed.  I think software wise there is not much some of the devices 
> > > > can do
> > > > with such an "invalidate".
> > >
> > > So out of curiosity: What does RDMA driver do when userspace just closes
> > > the file pointing to RDMA object? It has to handle that somehow by 
> > > aborting
> > > everything that's going on... And I wanted similar behavior here.
> >
> > It aborts *everything* connected to that file descriptor. Destroying
> > everything avoids creating inconsistencies that destroying a subset
> > would create.
> >
> > What has been talked about for lease break is not destroying anything
> > but very selectively saying that one memory region linked to the GUP
> > is no longer functional.
>
> OK, so what I had in mind was that if RDMA app doesn't play by the rules
> and closes the file with existing pins (and thus layout lease) we would
> force it to abort everything. Yes, it is disruptive but then the app didn't
> obey the rule that it has to maintain file lease while holding pins. Thus
> such situation should never happen unless the app is malicious / buggy.

When you say 'close' do you mean the final release of the fd? The vma
keeps a reference to a 'struct file' live even after the fd is closed.


Re: [PATCH v2 00/17] net: introduce Qualcomm IPA driver

2019-06-12 Thread Dan Williams
On Wed, 2019-06-12 at 10:31 +0200, Arnd Bergmann wrote:
> On Tue, Jun 11, 2019 at 7:23 PM Dan Williams  wrote:
> > On Tue, 2019-06-11 at 10:52 -0600, Subash Abhinov Kasiviswanathan
> > wrote:
> > 
> > rmnet should handle muxing the QMAP, QoS, and aggregation and pass
> > the
> > resulting packet to the lower layer. That lower layer could be IPA
> > or
> > qmi_wwan, which in turn passes that QMAP packet to USB or GSI or
> > whatever. This is typically how Linux handles clean abstractions
> > between different protocol layers in drivers.
> > 
> > Similar to some WiFi drivers (drivers/net/wireless/marvell/libertas
> > for
> > example) where the same firmware interface can be accessed via PCI,
> > SDIO, USB, SPI, etc. The bus-specific code is self-contained and
> > does
> > not creep into the upper more generic parts.
> 
> Yes, I think that is a good model. In case of libertas, we have
> multiple
> layers inheritence from the basic device (slightly different in the
> implementation,
> but that is how it should be):

To be clear (and I probably wasn't earlier) I wasn't talking as deep
about the actual code structures as you are here but this a great
discussion.

I was trying to make the point that rmnet doesn't need to care about
how the QMAP packets get to the device itself; it can be pretty generic
so that it can be used by IPA/qmi_wwan/rmnet_smd/etc.

Your points below are a great discussion though...

> struct if_cs_card { /* pcmcia specific */
>  struct lbs_private {  /* libertas specific */
>struct wireless_dev { /* 802.11 specific */
>   struct net_device {
> struct device {
>   ...
> };
> ...
>   };
>   ...
>};
>...
>   };
>   ...
> };

> The outer structure gets allocated when probing the hardware specific
> driver, and everything below it is implemented as direct function
> calls
> into the more generic code, or as function pointers into the more
> specific
> code.
> 
> The current rmnet model is different in that by design the upper
> layer
> (rmnet) and the lower layer (qmi_wwan, ipa, ...) are kept independent
> in
> both directions, i.e. ipa has (almost) no knowledge of rmnet, and
> just
> has pointers to the other net_device:
> 
>ipa_device
>net_device
> 
>rmnet_port
>net_device
> 
> I understand that the rmnet model was intended to provide a cleaner
> abstraction, but it's not how we normally structure subsystems in
> Linux, and moving to a model more like how wireless_dev works
> would improve both readability and performance, as you describe
> it, it would be more like (ignoring for now the need for multiple
> connections):
> 
>ipa_dev
> rmnet_dev
>wwan_dev
>   net_device

Perhaps I'm assuming too much from this diagram but this shows a 1:1
between wwan_dev and "lower" devices.

What Johannes is proposing (IIRC) is something a bit looser where a
wwan_dev does not necessarily provide netdev itself, but is instead the
central point that various channels (control, data, gps, sim card, etc)
register with. That way the wwan_dev can provide an overall view of the
WWAN device to userspace, and userspace can talk to the wwan_dev to ask
the lower drivers (ipa, rmnet, etc) to create new channels (netdev,
tty, otherwise) when the control channel has told the modem firmware to
expect one.

For example, say you have told the firmware to create a new data
channel with ID 5 via QMI (which the kernel is unaware of because it
does not process higher-level QMI requests).

Perhaps (and this is all just brainstorming) then userspace asks the
wwan_dev to create a new data channel with ID 5 and a certain QoS. IPA
(or rmnet because that's the data channel provider for IPA) has
registered callbacks to the wwan_dev, receives this request, and
creates a new rmnet_dev/net_device, and then the wwan_dev passes the
ifindex back to userspace so we don't have to play the dance of "match
up my request with a random netlink ADD event".

The point being that since data channels aren't actually useful until
the control channel agrees with the firmware that one should exist, if
we have a wwan_dev that represents the entire WWAN device then we don't
need the initial-but-useless net_device.

Just some thoughts; Johannes can feel free to correct me at any time :)

> Where each layer is a specialization of the next. Note: this is a
> common change when moving from proprietary code to upstream
> code. If a driver module is designed to live out of tree, there
> is a strong incentive to limit the number of interfaces it uses,
> but when it gets merged, it becomes much more flexible, as
> an internal interface between wwan_dev and the hardware driver(s)
> can be easily changed by modifying all drivers at once.

Yep, I've seen this time and time again.

Dan



[PATCH 4/6] libnvdimm/bus: Stop holding nvdimm_bus_list_mutex over __nd_ioctl()

2019-06-11 Thread Dan Williams
In preparation for fixing a deadlock between wait_for_bus_probe_idle()
and the nvdimm_bus_list_mutex arrange for __nd_ioctl() without
nvdimm_bus_list_mutex held. This also unifies the 'dimm' and 'bus' level
ioctls into a common nd_ioctl() preamble implementation.

Marked for -stable as it is a pre-requisite for a follow-on fix.

Cc: 
Fixes: bf9bccc14c05 ("libnvdimm: pmem label sets and namespace instantiation")
Cc: Vishal Verma 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/bus.c |   96 --
 drivers/nvdimm/nd-core.h |3 +
 2 files changed, 60 insertions(+), 39 deletions(-)

diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 42713b210f51..dfb93228d6a7 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -73,7 +73,7 @@ static void nvdimm_bus_probe_end(struct nvdimm_bus 
*nvdimm_bus)
 {
nvdimm_bus_lock(_bus->dev);
if (--nvdimm_bus->probe_active == 0)
-   wake_up(_bus->probe_wait);
+   wake_up(_bus->wait);
nvdimm_bus_unlock(_bus->dev);
 }
 
@@ -341,7 +341,7 @@ struct nvdimm_bus *nvdimm_bus_register(struct device 
*parent,
return NULL;
INIT_LIST_HEAD(_bus->list);
INIT_LIST_HEAD(_bus->mapping_list);
-   init_waitqueue_head(_bus->probe_wait);
+   init_waitqueue_head(_bus->wait);
nvdimm_bus->id = ida_simple_get(_ida, 0, 0, GFP_KERNEL);
if (nvdimm_bus->id < 0) {
kfree(nvdimm_bus);
@@ -426,6 +426,9 @@ static int nd_bus_remove(struct device *dev)
list_del_init(_bus->list);
mutex_unlock(_bus_list_mutex);
 
+   wait_event(nvdimm_bus->wait,
+   atomic_read(_bus->ioctl_active) == 0);
+
nd_synchronize();
device_for_each_child(_bus->dev, NULL, child_unregister);
 
@@ -885,7 +888,7 @@ void wait_nvdimm_bus_probe_idle(struct device *dev)
if (nvdimm_bus->probe_active == 0)
break;
nvdimm_bus_unlock(_bus->dev);
-   wait_event(nvdimm_bus->probe_wait,
+   wait_event(nvdimm_bus->wait,
nvdimm_bus->probe_active == 0);
nvdimm_bus_lock(_bus->dev);
} while (true);
@@ -1115,24 +1118,10 @@ static int __nd_ioctl(struct nvdimm_bus *nvdimm_bus, 
struct nvdimm *nvdimm,
return rc;
 }
 
-static long nd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
-{
-   long id = (long) file->private_data;
-   int rc = -ENXIO, ro;
-   struct nvdimm_bus *nvdimm_bus;
-
-   ro = ((file->f_flags & O_ACCMODE) == O_RDONLY);
-   mutex_lock(_bus_list_mutex);
-   list_for_each_entry(nvdimm_bus, _bus_list, list) {
-   if (nvdimm_bus->id == id) {
-   rc = __nd_ioctl(nvdimm_bus, NULL, ro, cmd, arg);
-   break;
-   }
-   }
-   mutex_unlock(_bus_list_mutex);
-
-   return rc;
-}
+enum nd_ioctl_mode {
+   BUS_IOCTL,
+   DIMM_IOCTL,
+};
 
 static int match_dimm(struct device *dev, void *data)
 {
@@ -1147,31 +1136,62 @@ static int match_dimm(struct device *dev, void *data)
return 0;
 }
 
-static long nvdimm_ioctl(struct file *file, unsigned int cmd, unsigned long 
arg)
+static long nd_ioctl(struct file *file, unsigned int cmd, unsigned long arg,
+   enum nd_ioctl_mode mode)
+
 {
-   int rc = -ENXIO, ro;
-   struct nvdimm_bus *nvdimm_bus;
+   struct nvdimm_bus *nvdimm_bus, *found = NULL;
+   long id = (long) file->private_data;
+   struct nvdimm *nvdimm = NULL;
+   int rc, ro;
 
ro = ((file->f_flags & O_ACCMODE) == O_RDONLY);
mutex_lock(_bus_list_mutex);
list_for_each_entry(nvdimm_bus, _bus_list, list) {
-   struct device *dev = device_find_child(_bus->dev,
-   file->private_data, match_dimm);
-   struct nvdimm *nvdimm;
-
-   if (!dev)
-   continue;
+   if (mode == DIMM_IOCTL) {
+   struct device *dev;
+
+   dev = device_find_child(_bus->dev,
+   file->private_data, match_dimm);
+   if (!dev)
+   continue;
+   nvdimm = to_nvdimm(dev);
+   found = nvdimm_bus;
+   } else if (nvdimm_bus->id == id) {
+   found = nvdimm_bus;
+   }
 
-   nvdimm = to_nvdimm(dev);
-   rc = __nd_ioctl(nvdimm_bus, nvdimm, ro, cmd, arg);
-   put_device(dev);
-   break;
+   if (found) {
+   atomic_inc(_bus->ioctl_active);
+   break;
+   }
}
mutex_unlock(_bus_list_mutex);
 
+   if (!fo

[PATCH 5/6] libnvdimm/bus: Fix wait_nvdimm_bus_probe_idle() ABBA deadlock

2019-06-11 Thread Dan Williams
A multithreaded namespace creation/destruction stress test currently
deadlocks with the following lockup signature:

INFO: task ndctl:2924 blocked for more than 122 seconds.
  Tainted: G   OE 5.2.0-rc4+ #3382
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ndctl   D0  2924   1176 0x
Call Trace:
 ? __schedule+0x27e/0x780
 schedule+0x30/0xb0
 wait_nvdimm_bus_probe_idle+0x8a/0xd0 [libnvdimm]
 ? finish_wait+0x80/0x80
 uuid_store+0xe6/0x2e0 [libnvdimm]
 kernfs_fop_write+0xf0/0x1a0
 vfs_write+0xb7/0x1b0
 ksys_write+0x5c/0xd0
 do_syscall_64+0x60/0x240

 INFO: task ndctl:2923 blocked for more than 122 seconds.
   Tainted: G   OE 5.2.0-rc4+ #3382
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 ndctl   D0  2923   1175 0x
 Call Trace:
  ? __schedule+0x27e/0x780
  ? __mutex_lock+0x489/0x910
  schedule+0x30/0xb0
  schedule_preempt_disabled+0x11/0x20
  __mutex_lock+0x48e/0x910
  ? nvdimm_namespace_common_probe+0x95/0x4d0 [libnvdimm]
  ? __lock_acquire+0x23f/0x1710
  ? nvdimm_namespace_common_probe+0x95/0x4d0 [libnvdimm]
  nvdimm_namespace_common_probe+0x95/0x4d0 [libnvdimm]
  __dax_pmem_probe+0x5e/0x210 [dax_pmem_core]
  ? nvdimm_bus_probe+0x1d0/0x2c0 [libnvdimm]
  dax_pmem_probe+0xc/0x20 [dax_pmem]
  nvdimm_bus_probe+0x90/0x2c0 [libnvdimm]
  really_probe+0xef/0x390
  driver_probe_device+0xb4/0x100

In this sequence an 'nd_dax' device is being probed and trying to take
the lock on its backing namespace to validate that the 'nd_dax' device
indeed has exclusive access to the backing namespace. Meanwhile, another
thread is trying to update the uuid property of that same backing
namespace. So one thread is in the probe path trying to acquire the
lock, and the other thread has acquired the lock and tries to flush the
probe path.

Fix this deadlock by not holding the namespace device_lock over the
wait_nvdimm_bus_probe_idle() synchronization step. In turn this requires
the device_lock to be held on entry to wait_nvdimm_bus_probe_idle() and
subsequently dropped internally to wait_nvdimm_bus_probe_idle().

Cc: 
Fixes: bf9bccc14c05 ("libnvdimm: pmem label sets and namespace instantiation")
Cc: Vishal Verma 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/bus.c |   17 +++--
 drivers/nvdimm/region_devs.c |4 
 2 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index dfb93228d6a7..c1d26fca9c4c 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -887,10 +887,12 @@ void wait_nvdimm_bus_probe_idle(struct device *dev)
do {
if (nvdimm_bus->probe_active == 0)
break;
-   nvdimm_bus_unlock(_bus->dev);
+   nvdimm_bus_unlock(dev);
+   device_unlock(dev);
wait_event(nvdimm_bus->wait,
nvdimm_bus->probe_active == 0);
-   nvdimm_bus_lock(_bus->dev);
+   device_lock(dev);
+   nvdimm_bus_lock(dev);
} while (true);
 }
 
@@ -1017,7 +1019,7 @@ static int __nd_ioctl(struct nvdimm_bus *nvdimm_bus, 
struct nvdimm *nvdimm,
case ND_CMD_ARS_START:
case ND_CMD_CLEAR_ERROR:
case ND_CMD_CALL:
-   dev_dbg(_bus->dev, "'%s' command while 
read-only.\n",
+   dev_dbg(dev, "'%s' command while read-only.\n",
nvdimm ? nvdimm_cmd_name(cmd)
: nvdimm_bus_cmd_name(cmd));
return -EPERM;
@@ -1088,7 +1090,8 @@ static int __nd_ioctl(struct nvdimm_bus *nvdimm_bus, 
struct nvdimm *nvdimm,
goto out;
}
 
-   nvdimm_bus_lock(_bus->dev);
+   device_lock(dev);
+   nvdimm_bus_lock(dev);
rc = nd_cmd_clear_to_send(nvdimm_bus, nvdimm, func, buf);
if (rc)
goto out_unlock;
@@ -1103,7 +1106,8 @@ static int __nd_ioctl(struct nvdimm_bus *nvdimm_bus, 
struct nvdimm *nvdimm,
nvdimm_account_cleared_poison(nvdimm_bus, clear_err->address,
clear_err->cleared);
}
-   nvdimm_bus_unlock(_bus->dev);
+   nvdimm_bus_unlock(dev);
+   device_unlock(dev);
 
if (copy_to_user(p, buf, buf_len))
rc = -EFAULT;
@@ -1112,7 +1116,8 @@ static int __nd_ioctl(struct nvdimm_bus *nvdimm_bus, 
struct nvdimm *nvdimm,
return rc;
 
  out_unlock:
-   nvdimm_bus_unlock(_bus->dev);
+   nvdimm_bus_unlock(dev);
+   device_unlock(dev);
  out:
vfree(buf);
return rc;
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/regio

[PATCH 6/6] driver-core, libnvdimm: Let device subsystems add local lockdep coverage

2019-06-11 Thread Dan Williams
For good reason, the standard device_lock() is marked
lockdep_set_novalidate_class() because there is simply no sane way to
describe the myriad ways the device_lock() ordered with other locks.
However, that leaves subsystems that know their own local device_lock()
ordering rules to find lock ordering mistakes manually. Instead,
introduce an optional / additional lockdep-enabled lock that a subsystem
can acquire in all the same paths that the device_lock() is acquired.

A conversion of the NFIT driver and NVDIMM subsystem to a
lockdep-validate device_lock() scheme is included. The
debug_nvdimm_lock() implementation implements the correct lock-class and
stacking order for the libnvdimm device topology hierarchy.

Cc: Ingo Molnar 
Cc: Ira Weiny 
Cc: Will Deacon 
Cc: Dave Jiang 
Cc: Keith Busch 
Cc: Peter Zijlstra 
Cc: Vishal Verma 
Cc: "Rafael J. Wysocki" 
Cc: Greg Kroah-Hartman 
Signed-off-by: Dan Williams 
---
 drivers/acpi/nfit/core.c|   28 
 drivers/acpi/nfit/nfit.h|   24 ++
 drivers/base/core.c |3 ++
 drivers/nvdimm/btt_devs.c   |   16 +
 drivers/nvdimm/bus.c|   28 ++--
 drivers/nvdimm/core.c   |   10 +++---
 drivers/nvdimm/dimm_devs.c  |4 +-
 drivers/nvdimm/namespace_devs.c |   36 ++---
 drivers/nvdimm/nd-core.h|   68 +++
 drivers/nvdimm/pfn_devs.c   |   24 +++---
 drivers/nvdimm/pmem.c   |4 +-
 drivers/nvdimm/region.c |2 +
 drivers/nvdimm/region_devs.c|   16 +
 include/linux/device.h  |5 +++
 14 files changed, 187 insertions(+), 81 deletions(-)

diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index 23022cf20d26..f22139458ce1 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -1282,7 +1282,7 @@ static ssize_t hw_error_scrub_store(struct device *dev,
if (rc)
return rc;
 
-   device_lock(dev);
+   nfit_device_lock(dev);
nd_desc = dev_get_drvdata(dev);
if (nd_desc) {
struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc);
@@ -1299,7 +1299,7 @@ static ssize_t hw_error_scrub_store(struct device *dev,
break;
}
}
-   device_unlock(dev);
+   nfit_device_unlock(dev);
if (rc)
return rc;
return size;
@@ -1319,7 +1319,7 @@ static ssize_t scrub_show(struct device *dev,
ssize_t rc = -ENXIO;
bool busy;
 
-   device_lock(dev);
+   nfit_device_lock(dev);
nd_desc = dev_get_drvdata(dev);
if (!nd_desc) {
device_unlock(dev);
@@ -1339,7 +1339,7 @@ static ssize_t scrub_show(struct device *dev,
}
 
mutex_unlock(_desc->init_mutex);
-   device_unlock(dev);
+   nfit_device_unlock(dev);
return rc;
 }
 
@@ -1356,14 +1356,14 @@ static ssize_t scrub_store(struct device *dev,
if (val != 1)
return -EINVAL;
 
-   device_lock(dev);
+   nfit_device_lock(dev);
nd_desc = dev_get_drvdata(dev);
if (nd_desc) {
struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc);
 
rc = acpi_nfit_ars_rescan(acpi_desc, ARS_REQ_LONG);
}
-   device_unlock(dev);
+   nfit_device_unlock(dev);
if (rc)
return rc;
return size;
@@ -1749,9 +1749,9 @@ static void acpi_nvdimm_notify(acpi_handle handle, u32 
event, void *data)
struct acpi_device *adev = data;
struct device *dev = >dev;
 
-   device_lock(dev->parent);
+   nfit_device_lock(dev->parent);
__acpi_nvdimm_notify(dev, event);
-   device_unlock(dev->parent);
+   nfit_device_unlock(dev->parent);
 }
 
 static bool acpi_nvdimm_has_method(struct acpi_device *adev, char *method)
@@ -3457,8 +3457,8 @@ static int acpi_nfit_flush_probe(struct 
nvdimm_bus_descriptor *nd_desc)
struct device *dev = acpi_desc->dev;
 
/* Bounce the device lock to flush acpi_nfit_add / acpi_nfit_notify */
-   device_lock(dev);
-   device_unlock(dev);
+   nfit_device_lock(dev);
+   nfit_device_unlock(dev);
 
/* Bounce the init_mutex to complete initial registration */
mutex_lock(_desc->init_mutex);
@@ -3602,8 +3602,8 @@ void acpi_nfit_shutdown(void *data)
 * acpi_nfit_ars_rescan() submissions have had a chance to
 * either submit or see ->cancel set.
 */
-   device_lock(bus_dev);
-   device_unlock(bus_dev);
+   nfit_device_lock(bus_dev);
+   nfit_device_unlock(bus_dev);
 
flush_workqueue(nfit_wq);
 }
@@ -3746,9 +3746,9 @@ EXPORT_SYMBOL_GPL(__acpi_nfit_notify);
 
 static void acpi_nfit_notify(struct acpi_device *adev, u32 event)
 {
-   device_lock(>dev);
+   nfit_device_lock(>dev);
__acpi_nfit_notify(>dev, ad

[PATCH 3/6] libnvdimm/region: Register badblocks before namespaces

2019-06-11 Thread Dan Williams
Namespace activation expects to be able to reference region badblocks.
The following warning sometimes triggers when asynchronous namespace
activation races in front of the completion of namespace probing. Move
all possible namespace probing after region badblocks initialization.

Otherwise, lockdep sometimes catches the uninitialized state of the
badblocks seqlock with stack trace signatures like:

INFO: trying to register non-static key.
pmem2: detected capacity change from 0 to 136365211648
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
CPU: 9 PID: 358 Comm: kworker/u80:5 Tainted: G   OE 5.2.0-rc4+ 
#3382
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
Workqueue: events_unbound async_run_entry_fn
Call Trace:
 dump_stack+0x85/0xc0
pmem1.12: detected capacity change from 0 to 8589934592
 register_lock_class+0x56a/0x570
 ? check_object+0x140/0x270
 __lock_acquire+0x80/0x1710
 ? __mutex_lock+0x39d/0x910
 lock_acquire+0x9e/0x180
 ? nd_pfn_validate+0x28f/0x440 [libnvdimm]
 badblocks_check+0x93/0x1f0
 ? nd_pfn_validate+0x28f/0x440 [libnvdimm]
 nd_pfn_validate+0x28f/0x440 [libnvdimm]
 ? lockdep_hardirqs_on+0xf0/0x180
 nd_dax_probe+0x9a/0x120 [libnvdimm]
 nd_pmem_probe+0x6d/0x180 [nd_pmem]
 nvdimm_bus_probe+0x90/0x2c0 [libnvdimm]

Fixes: 48af2f7e52f4 ("libnvdimm, pfn: during init, clear errors...")
Cc: 
Cc: Vishal Verma 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/region.c |   22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/nvdimm/region.c b/drivers/nvdimm/region.c
index ef46cc3a71ae..488c47ac4c4a 100644
--- a/drivers/nvdimm/region.c
+++ b/drivers/nvdimm/region.c
@@ -34,17 +34,6 @@ static int nd_region_probe(struct device *dev)
if (rc)
return rc;
 
-   rc = nd_region_register_namespaces(nd_region, );
-   if (rc < 0)
-   return rc;
-
-   ndrd = dev_get_drvdata(dev);
-   ndrd->ns_active = rc;
-   ndrd->ns_count = rc + err;
-
-   if (rc && err && rc == err)
-   return -ENODEV;
-
if (is_nd_pmem(_region->dev)) {
struct resource ndr_res;
 
@@ -60,6 +49,17 @@ static int nd_region_probe(struct device *dev)
nvdimm_badblocks_populate(nd_region, _region->bb, _res);
}
 
+   rc = nd_region_register_namespaces(nd_region, );
+   if (rc < 0)
+   return rc;
+
+   ndrd = dev_get_drvdata(dev);
+   ndrd->ns_active = rc;
+   ndrd->ns_count = rc + err;
+
+   if (rc && err && rc == err)
+   return -ENODEV;
+
nd_region->btt_seed = nd_btt_create(nd_region);
nd_region->pfn_seed = nd_pfn_create(nd_region);
nd_region->dax_seed = nd_dax_create(nd_region);



[PATCH 2/6] libnvdimm/bus: Prevent duplicate device_unregister() calls

2019-06-11 Thread Dan Williams
A multithreaded namespace creation/destruction stress test currently
fails with signatures like the following:

sysfs group 'power' not found for kobject 'dax1.1'
RIP: 0010:sysfs_remove_group+0x76/0x80
Call Trace:
 device_del+0x73/0x370
 device_unregister+0x16/0x50
 nd_async_device_unregister+0x1e/0x30 [libnvdimm]
 async_run_entry_fn+0x39/0x160
 process_one_work+0x23c/0x5e0
 worker_thread+0x3c/0x390

BUG: kernel NULL pointer dereference, address: 0020
RIP: 0010:klist_put+0x1b/0x6c
Call Trace:
 klist_del+0xe/0x10
 device_del+0x8a/0x2c9
 ? __switch_to_asm+0x34/0x70
 ? __switch_to_asm+0x40/0x70
 device_unregister+0x44/0x4f
 nd_async_device_unregister+0x22/0x2d [libnvdimm]
 async_run_entry_fn+0x47/0x15a
 process_one_work+0x1a2/0x2eb
 worker_thread+0x1b8/0x26e

Use the kill_device() helper to atomically resolve the race of multiple
threads issuing kill, device_unregister(), requests.

Reported-by: Jane Chu 
Reported-by: Erwin Tsaur 
Fixes: 4d88a97aa9e8 ("libnvdimm, nvdimm: dimm driver and base libnvdimm 
device-driver...")
Cc: 
Link: https://github.com/pmem/ndctl/issues/96
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/bus.c |   25 +
 1 file changed, 25 insertions(+)

diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 2dca3034fee0..42713b210f51 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -547,13 +547,38 @@ EXPORT_SYMBOL(nd_device_register);
 
 void nd_device_unregister(struct device *dev, enum nd_async_mode mode)
 {
+   bool killed;
+
switch (mode) {
case ND_ASYNC:
+   /*
+* In the async case this is being triggered with the
+* device lock held and the unregistration work needs to
+* be moved out of line iff this is thread has won the
+* race to schedule the deletion.
+*/
+   if (!kill_device(dev))
+   return;
+
get_device(dev);
async_schedule_domain(nd_async_device_unregister, dev,
_async_domain);
break;
case ND_SYNC:
+   /*
+* In the sync case the device is being unregistered due
+* to a state change of the parent. Claim the kill state
+* to synchronize against other unregistration requests,
+* or otherwise let the async path handle it if the
+* unregistration was already queued.
+*/
+   device_lock(dev);
+   killed = kill_device(dev);
+   device_unlock(dev);
+
+   if (!killed)
+   return;
+
nd_synchronize();
device_unregister(dev);
break;



[PATCH 1/6] drivers/base: Introduce kill_device()

2019-06-11 Thread Dan Williams
The libnvdimm subsystem arranges for devices to be destroyed as a result
of a sysfs operation. Since device_unregister() cannot be called from
an actively running sysfs attribute of the same device libnvdimm
arranges for device_unregister() to be performed in an out-of-line async
context.

The driver core maintains a 'dead' state for coordinating its own racing
async registration / de-registration requests. Rather than add local
'dead' state tracking infrastructure to libnvdimm device objects, export
the existing state tracking via a new kill_device() helper.

The kill_device() helper simply marks the device as dead, i.e. that it
is on its way to device_del(), or returns that the device was already
dead. This can be used in advance of calling device_unregister() for
subsystems like libnvdimm that might need to handle multiple user
threads racing to delete a device.

This refactoring does not change any behavior, but it is a pre-requisite
for follow-on fixes and therefore marked for -stable.

Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Fixes: 4d88a97aa9e8 ("libnvdimm, nvdimm: dimm driver and base libnvdimm 
device-driver...")
Cc: 
Signed-off-by: Dan Williams 
---
 drivers/base/core.c|   27 +++
 include/linux/device.h |1 +
 2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index fd7511e04e62..eaf3aa0cb803 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -2211,6 +2211,24 @@ void put_device(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(put_device);
 
+bool kill_device(struct device *dev)
+{
+   /*
+* Require the device lock and set the "dead" flag to guarantee that
+* the update behavior is consistent with the other bitfields near
+* it and that we cannot have an asynchronous probe routine trying
+* to run while we are tearing out the bus/class/sysfs from
+* underneath the device.
+*/
+   lockdep_assert_held(>mutex);
+
+   if (dev->p->dead)
+   return false;
+   dev->p->dead = true;
+   return true;
+}
+EXPORT_SYMBOL_GPL(kill_device);
+
 /**
  * device_del - delete device from system.
  * @dev: device.
@@ -2230,15 +2248,8 @@ void device_del(struct device *dev)
struct kobject *glue_dir = NULL;
struct class_interface *class_intf;
 
-   /*
-* Hold the device lock and set the "dead" flag to guarantee that
-* the update behavior is consistent with the other bitfields near
-* it and that we cannot have an asynchronous probe routine trying
-* to run while we are tearing out the bus/class/sysfs from
-* underneath the device.
-*/
device_lock(dev);
-   dev->p->dead = true;
+   kill_device(dev);
device_unlock(dev);
 
/* Notify clients of device removal.  This call must come
diff --git a/include/linux/device.h b/include/linux/device.h
index e85264fb6616..0da5c67f6be1 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -1373,6 +1373,7 @@ extern int (*platform_notify_remove)(struct device *dev);
  */
 extern struct device *get_device(struct device *dev);
 extern void put_device(struct device *dev);
+extern bool kill_device(struct device *dev);
 
 #ifdef CONFIG_DEVTMPFS
 extern int devtmpfs_create_node(struct device *dev);



[PATCH 0/6] libnvdimm: Fix async operations and locking

2019-06-11 Thread Dan Williams
The libnvdimm subsystem uses async operations to parallelize device
probing operations and to allow sysfs to trigger device_unregister() on
deleted namepsaces. A multithreaded stress test of the libnvdimm sysfs
interface uncovered a case where device_unregister() is triggered
multiple times, and the subsequent investigation uncovered a broken
locking scenario.

The lack of lockdep coverage for device_lock() stymied the debug. That
is, until patch6 "driver-core, libnvdimm: Let device subsystems add
local lockdep coverage" solved that with a shadow lock, with lockdep
coverage, to mirror device_lock() operations. Given the time saved with
shadow-lock debug-hack, patch6 attempts to generalize device_lock()
debug facility that might be able to be carried upstream. Patch6 is
staged at the end of this fix series in case it is contentious and needs
to be dropped.

Patch1 "drivers/base: Introduce kill_device()" could be achieved with
local libnvdimm infrastructure. However, the existing 'dead' flag in
'struct device_private' aims to solve similar async register/unregister
races so the fix in patch2 "libnvdimm/bus: Prevent duplicate
device_unregister() calls" can be implemented with existing driver-core
infrastructure.

Patch3 is a rare lockdep warning that is intermittent based on
namespaces racing ahead of the completion of probe of their parent
region. It is not related to the other fixes, it just happened to
trigger as a result of the async stress test.

Patch4 and patch5 address an ABBA deadlock tripped by the stress test.

These patches pass the failing stress test and the existing libnvdimm
unit tests with CONFIG_PROVE_LOCKING=y and the new "dev->lockdep_mutex"
shadow lock with no lockdep warnings.

---

Dan Williams (6):
  drivers/base: Introduce kill_device()
  libnvdimm/bus: Prevent duplicate device_unregister() calls
  libnvdimm/region: Register badblocks before namespaces
  libnvdimm/bus: Stop holding nvdimm_bus_list_mutex over __nd_ioctl()
  libnvdimm/bus: Fix wait_nvdimm_bus_probe_idle() ABBA deadlock
  driver-core, libnvdimm: Let device subsystems add local lockdep coverage


 drivers/acpi/nfit/core.c|   28 ---
 drivers/acpi/nfit/nfit.h|   24 ++
 drivers/base/core.c |   30 ++--
 drivers/nvdimm/btt_devs.c   |   16 ++--
 drivers/nvdimm/bus.c|  154 +++
 drivers/nvdimm/core.c   |   10 +--
 drivers/nvdimm/dimm_devs.c  |4 +
 drivers/nvdimm/namespace_devs.c |   36 +
 drivers/nvdimm/nd-core.h|   71 ++
 drivers/nvdimm/pfn_devs.c   |   24 +++---
 drivers/nvdimm/pmem.c   |4 +
 drivers/nvdimm/region.c |   24 +++---
 drivers/nvdimm/region_devs.c|   12 ++-
 include/linux/device.h  |6 ++
 14 files changed, 308 insertions(+), 135 deletions(-)


Re: [PATCH v2 00/17] net: introduce Qualcomm IPA driver

2019-06-11 Thread Dan Williams
On Tue, 2019-06-11 at 10:52 -0600, Subash Abhinov Kasiviswanathan
wrote:
> > The general plan (and I believe Daniele Palmas was working on it)
> > was
> > to eventually make qmi_wwan use rmnet rather than its internal
> > sysfs-
> > based implementation. qmi_wwan and ipa are at essentially the same
> > level and both could utilize rmnet on top.
> > 
> > *That's* what I'd like to see. I don't want to see two different
> > ways
> > to get QMAP packets to modem firmware from two different drivers
> > that
> > really could use the same code.
> > 
> > Dan
> 
> qmi_wwan is based on USB and is very different from the IPA
> interconnect
> though. AFAIK, they do not have much in common (apart from sending &
> receiving MAP packets from hardware).

That is correct, they are very different drivers but as you state they
send and receive MAP packets with the other end via some closer-to-
hardware protocol (USB or GSI?) than QMAP.

rmnet should handle muxing the QMAP, QoS, and aggregation and pass the
resulting packet to the lower layer. That lower layer could be IPA or
qmi_wwan, which in turn passes that QMAP packet to USB or GSI or
whatever. This is typically how Linux handles clean abstractions
between different protocol layers in drivers.

Similar to some WiFi drivers (drivers/net/wireless/marvell/libertas for
example) where the same firmware interface can be accessed via PCI,
SDIO, USB, SPI, etc. The bus-specific code is self-contained and does
not creep into the upper more generic parts.

Dan



Re: [PATCH v2 00/17] net: introduce Qualcomm IPA driver

2019-06-11 Thread Dan Williams
On Tue, 2019-06-11 at 13:56 +0200, Arnd Bergmann wrote:
> On Tue, Jun 11, 2019 at 10:12 AM Johannes Berg
>  wrote:
> 
> > > As I've made clear before, my work on this has been focused on
> > > the IPA transport,
> > > and some of this higher-level LTE architecture is new to me.  But
> > > it
> > > seems pretty clear that an abstracted WWAN subsystem is a good
> > > plan,
> > > because these devices represent a superset of what a "normal"
> > > netdev
> > > implements.
> > 
> > I'm not sure I'd actually call it a superset. By themselves, these
> > netdevs are actually completely useless to the network stack,
> > AFAICT.
> > Therefore, the overlap with netdevs you can really use with the
> > network
> > stack is pretty small?
> 
> I think Alex meant the concept of having a type of netdev with a
> generic
> user space interface for wwan and similar to a wlan device, as I
> understood
> you had suggested as well, as opposed to a stacked device as in
> rmnet or those drivers it seems to be modeled after (vlan, ip tunnel,
> ...)/.
> 
> > > HOWEVER I disagree with your suggestion that the IPA code should
> > > not be committed until after that is all sorted out.  In part
> > > it's
> > > for selfish reasons, but I think there are legitimate reasons to
> > > commit IPA now *knowing* that it will need to be adapted to fit
> > > into the generic model that gets defined and developed.  Here
> > > are some reasons why.
> > 
> > I can't really argue with those, though I would point out that the
> > converse also holds - if we commit to this now, then we will have
> > to
> > actually keep the API offered by IPA/rmnet today, so we cannot
> > actually
> > remove the netdev again, even if we do migrate it to offer support
> > for a
> > WWAN framework in the future.
> 
> Right. The interface to support rmnet might be simple enough to keep
> next to what becomes the generic interface, but it will always
> continue
> to be an annoyance.
> 
> > > Second, the IPA code has been out for review recently, and has
> > > been
> > > the subject of some detailed discussion in the past few
> > > weeks.  Arnd
> > > especially has invested considerable time in review and
> > > discussion.
> > > Delaying things until after a better generic model is settled on
> > > (which I'm guessing might be on the order of months)
> > 
> > I dunno if it really has to be months. I think we can cobble
> > something
> > together relatively quickly that addresses the needs of IPA more
> > specifically, and then extend later?
> > 
> > But OTOH it may make sense to take a more paced approach and think
> > about the details more carefully than we have over in the other
> > thread so far.
> 
> I would hope that as soon as we can agree on a general approach, it
> would also be possible to merge a minimal implementation into the
> kernel
> along with IPA. Alex already mentioned that IPA in its current state
> does
> not actually support more than one data channel, so the necessary
> setup for it becomes even simpler.
> 
> At the moment, the rmnet configuration in
> include/uapi/linux/if_link.h
> is almost trivial, with the three pieces of information needed being
> an IFLA_LINK to point to the real device (not needed if there is only
> one device per channel, instead of two), the IFLA_RMNET_MUX_ID
> setting the ID of the muxing channel (not needed if there is only
> one channel ?), a way to specify software bridging between channels
> (not useful if there is only one channel) and a few flags that I
> assume
> must match the remote end:
> 
> #define RMNET_FLAGS_INGRESS_DEAGGREGATION (1U << 0)
> #define RMNET_FLAGS_INGRESS_MAP_COMMANDS  (1U << 1)
> #define RMNET_FLAGS_INGRESS_MAP_CKSUMV4   (1U << 2)
> #define RMNET_FLAGS_EGRESS_MAP_CKSUMV4(1U << 3)
> enum {
> IFLA_RMNET_UNSPEC,
> IFLA_RMNET_MUX_ID,
> IFLA_RMNET_FLAGS,
> __IFLA_RMNET_MAX,
> };
> #define IFLA_RMNET_MAX  (__IFLA_RMNET_MAX - 1)
> struct ifla_rmnet_flags {
> __u32   flags;
> __u32   mask;
> };
> 
> > > Third, having the code upstream actually means the actual
> > > requirements
> > > for rmnet-over-IPA are clear and explicit.  This might not be a
> > > huge
> > > deal, but I think it's better to devise a generic WWAN scheme
> > > that
> > > can refer to actual code than to do so with assumptions about
> > > what
> > > will work with rmnet (and others).  As far as I know, the
> > > upstream
> > > rmnet has no other upstream back end; IPA will make it "real."
> > 
> > Is that really true? I had previously been told that rmnet actually
> > does
> > have use with a few existing drivers.
> > 
> > 
> > If true though, then I think this would be the killer argument *in
> > favour* of *not* merging this - because that would mean we *don't*
> > have
> > to actually keep the rmnet API around for all foreseeable future.
> 
> I would agree with that. From the code I can see no other driver
> including the rmnet protocol header (see 

Re: [PATCH v9 11/12] libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields

2019-06-07 Thread Dan Williams
On Fri, Jun 7, 2019 at 12:54 PM Andrew Morton  wrote:
>
> On Thu, 6 Jun 2019 15:06:26 -0700 Dan Williams  
> wrote:
>
> > On Thu, Jun 6, 2019 at 2:46 PM Andrew Morton  
> > wrote:
> > >
> > > On Wed, 05 Jun 2019 14:58:58 -0700 Dan Williams 
> > >  wrote:
> > >
> > > > At namespace creation time there is the potential for the "expected to
> > > > be zero" fields of a 'pfn' info-block to be filled with indeterminate
> > > > data. While the kernel buffer is zeroed on allocation it is immediately
> > > > overwritten by nd_pfn_validate() filling it with the current contents of
> > > > the on-media info-block location. For fields like, 'flags' and the
> > > > 'padding' it potentially means that future implementations can not rely
> > > > on those fields being zero.
> > > >
> > > > In preparation to stop using the 'start_pad' and 'end_trunc' fields for
> > > > section alignment, arrange for fields that are not explicitly
> > > > initialized to be guaranteed zero. Bump the minor version to indicate it
> > > > is safe to assume the 'padding' and 'flags' are zero. Otherwise, this
> > > > corruption is expected to benign since all other critical fields are
> > > > explicitly initialized.
> > > >
> > > > Fixes: 32ab0a3f5170 ("libnvdimm, pmem: 'struct page' for pmem")
> > > > Cc: 
> > > > Signed-off-by: Dan Williams 
> > >
> > > The cc:stable in [11/12] seems odd.  Is this independent of the other
> > > patches?  If so, shouldn't it be a standalone thing which can be
> > > prioritized?
> > >
> >
> > The cc: stable is about spreading this new policy to as many kernels
> > as possible not fixing an issue in those kernels. It's not until patch
> > 12 "libnvdimm/pfn: Stop padding pmem namespaces to section alignment"
> > as all previous kernel do initialize all fields.
> >
> > I'd be ok to drop that cc: stable, my concern is distros that somehow
> > pickup and backport patch 12 and miss patch 11.
>
> Could you please propose a changelog paragraph which explains all this
> to those who will be considering this patch for backports?
>

Will do. I'll resend the series with this and the fixups proposed by Oscar.


[PATCH v3 00/10] EFI Specific Purpose Memory Support

2019-06-07 Thread Dan Williams
Changes since v2:
- Consolidate the new E820_TYPE and IORES_DESC and EFI configuration
  symbol on an "_APPLICATION_RESERVED" suffix. (Ard).

- Rework the implementation to drop the new MEMBLOCK_APP_SPECIFIC
  memblock and move the reservation earlier to e820__memblock_setup().
  (Mike)

- Move efi_fake_mem support for EFI_MEMORY_SP to its own implementation
  that does not require memblock allocations.

- Move is_efi_application_reserved() into the x86 efi implementation.
  (Ard)

[1]: https://lists.01.org/pipermail/linux-nvdimm/2019-May/021668.html

---

Merge logistics: These patches touch core-efi, acpi, device-dax, and
x86. Given the regression risk is highest for the x86 changes it seems
tip.git is the best tree to host the series.

---

The EFI 2.8 Specification [2] introduces the EFI_MEMORY_SP ("specific
purpose") memory attribute. This attribute bit replaces the deprecated
ACPI HMAT "reservation hint" that was introduced in ACPI 6.2 and removed
in ACPI 6.3.

Given the increasing diversity of memory types that might be advertised
to the operating system, there is a need for platform firmware to hint
which memory ranges are free for the OS to use as general purpose memory
and which ranges are intended for application specific usage. For
example, an application with prior knowledge of the platform may expect
to be able to exclusively allocate a precious / limited pool of high
bandwidth memory. Alternatively, for the general purpose case, the
operating system may want to make the memory available on a best effort
basis as a unique numa-node with performance properties by the new
CONFIG_HMEM_REPORTING [3] facility.

In support of optionally allowing either application-exclusive and
core-kernel-mm managed access to differentiated memory, claim
EFI_MEMORY_SP ranges for exposure as device-dax instances by default.
Such instances can be directly owned / mapped by a
platform-topology-aware application. Alternatively, with the new kmem
facility [4], the administrator has the option to instead designate that
those memory ranges be hot-added to the core-kernel-mm as a unique
memory numa-node. In short, allow for the decision about what software
agent manages specific-purpose memory to be made at runtime.

The patches are based on the new HMAT+HMEM_REPORTING facilities merged
for v5.2-rc1. The implementation is tested with qemu emulation of HMAT
[5] plus the efi_fake_mem facility for applying the EFI_MEMORY_SP
attribute.

[2]: https://uefi.org/sites/default/files/resources/UEFI_Spec_2_8_final.pdf
[3]: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e1cf33aafb84
[4]: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308f
[5]: http://patchwork.ozlabs.org/cover/1096737/

---

Dan Williams (10):
  acpi/numa: Establish a new drivers/acpi/numa/ directory
  acpi/numa/hmat: Skip publishing target info for nodes with no online 
memory
  efi: Enumerate EFI_MEMORY_SP
  x86, efi: Push EFI_MEMMAP check into leaf routines
  x86, efi: Reserve UEFI 2.8 Specific Purpose Memory for dax
  x86, efi: Add efi_fake_mem support for EFI_MEMORY_SP
  lib/memregion: Uplevel the pmem "region" ida to a global allocator
  device-dax: Add a driver for "hmem" devices
  acpi/numa/hmat: Register HMAT at device_initcall level
  acpi/numa/hmat: Register "specific purpose" memory as an "hmem" device


 arch/x86/Kconfig|   21 +
 arch/x86/boot/compressed/eboot.c|5 +
 arch/x86/boot/compressed/kaslr.c|3 -
 arch/x86/include/asm/e820/types.h   |9 ++
 arch/x86/include/asm/efi.h  |   15 
 arch/x86/kernel/e820.c  |   12 ++-
 arch/x86/kernel/setup.c |   21 +++--
 arch/x86/platform/efi/efi.c |   40 -
 arch/x86/platform/efi/quirks.c  |3 +
 drivers/acpi/Kconfig|9 --
 drivers/acpi/Makefile   |3 -
 drivers/acpi/hmat/Makefile  |2 
 drivers/acpi/numa/Kconfig   |8 ++
 drivers/acpi/numa/Makefile  |3 +
 drivers/acpi/numa/hmat.c|  149 +++
 drivers/acpi/numa/srat.c|0 
 drivers/dax/Kconfig |   27 +-
 drivers/dax/Makefile|2 
 drivers/dax/hmem.c  |   58 ++
 drivers/firmware/efi/Makefile   |5 +
 drivers/firmware/efi/efi.c  |5 +
 drivers/firmware/efi/esrt.c |3 +
 drivers/firmware/efi/fake_mem-x86.c |   69 
 drivers/firmware/efi/fake_mem.c |   26 +++---
 drivers/firmware/efi/fake_mem.h |   10 ++
 drivers/nvdimm/Kconfig  |1 
 drivers/nvdimm/core.c   |1 
 drivers/nvdimm/nd-core.h|1 
 drivers/nvdimm/region_devs.c|   13 +--
 include/linux/efi.h |1 
 include/linux/ioport.h

Re: [PATCH v9 08/12] mm/sparsemem: Support sub-section hotplug

2019-06-07 Thread Dan Williams
On Fri, Jun 7, 2019 at 1:34 AM Oscar Salvador  wrote:
>
> On Wed, Jun 05, 2019 at 02:58:42PM -0700, Dan Williams wrote:
> > The libnvdimm sub-system has suffered a series of hacks and broken
> > workarounds for the memory-hotplug implementation's awkward
> > section-aligned (128MB) granularity. For example the following backtrace
> > is emitted when attempting arch_add_memory() with physical address
> > ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM)
> > within a given section:
> >
> >  WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 
> > devm_memremap_pages+0x3b5/0x4c0
> >  devm_memremap_pages attempted on mixed region [mem 0x2-0x2fbff 
> > flags 0x200]
> >  [..]
> >  Call Trace:
> >dump_stack+0x86/0xc3
> >__warn+0xcb/0xf0
> >warn_slowpath_fmt+0x5f/0x80
> >devm_memremap_pages+0x3b5/0x4c0
> >__wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap]
> >pmem_attach_disk+0x19a/0x440 [nd_pmem]
> >
> > Recently it was discovered that the problem goes beyond RAM vs PMEM
> > collisions as some platform produce PMEM vs PMEM collisions within a
> > given section. The libnvdimm workaround for that case revealed that the
> > libnvdimm section-alignment-padding implementation has been broken for a
> > long while. A fix for that long-standing breakage introduces as many
> > problems as it solves as it would require a backward-incompatible change
> > to the namespace metadata interpretation. Instead of that dubious route
> > [1], address the root problem in the memory-hotplug implementation.
> >
> > [1]: 
> > https://lore.kernel.org/r/155000671719.348031.2347363160141119237.st...@dwillia2-desk3.amr.corp.intel.com
> > Cc: Michal Hocko 
> > Cc: Vlastimil Babka 
> > Cc: Logan Gunthorpe 
> > Cc: Oscar Salvador 
> > Cc: Pavel Tatashin 
> > Signed-off-by: Dan Williams 
> > ---
> >  include/linux/memory_hotplug.h |2
> >  mm/memory_hotplug.c|7 -
> >  mm/page_alloc.c|2
> >  mm/sparse.c|  225 
> > +++-
> >  4 files changed, 155 insertions(+), 81 deletions(-)
> >
> [...]
> > @@ -325,6 +332,15 @@ static void __meminit sparse_init_one_section(struct 
> > mem_section *ms,
> >   unsigned long pnum, struct page *mem_map,
> >   struct mem_section_usage *usage)
> >  {
> > + /*
> > +  * Given that SPARSEMEM_VMEMMAP=y supports sub-section hotplug,
> > +  * ->section_mem_map can not be guaranteed to point to a full
> > +  *  section's worth of memory.  The field is only valid / used
> > +  *  in the SPARSEMEM_VMEMMAP=n case.
> > +  */
> > + if (IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP))
> > + mem_map = NULL;
>
> Will this be a problem when reading mem_map with the crash-tool?
> I do not expect it to be, but I am not sure if crash internally tries
> to read ms->section_mem_map and do some sort of translation.
> And since ms->section_mem_map SECTION_HAS_MEM_MAP, it might be that it expects
> a valid mem_map?

I don't know, but I can't imagine it would because it's much easier to
do mem_map relative translations by simple PAGE_OFFSET arithmetic.

> > +static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
> > + struct vmem_altmap *altmap)
> > +{
> > + DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 };
> > + DECLARE_BITMAP(tmp, SUBSECTIONS_PER_SECTION) = { 0 };
> > + struct mem_section *ms = __pfn_to_section(pfn);
> > + bool early_section = is_early_section(ms);
> > + struct page *memmap = NULL;
> > + unsigned long *subsection_map = ms->usage
> > + ? >usage->subsection_map[0] : NULL;
> > +
> > + subsection_mask_set(map, pfn, nr_pages);
> > + if (subsection_map)
> > + bitmap_and(tmp, map, subsection_map, SUBSECTIONS_PER_SECTION);
> > +
> > + if (WARN(!subsection_map || !bitmap_equal(tmp, map, 
> > SUBSECTIONS_PER_SECTION),
> > + "section already deactivated (%#lx + %ld)\n",
> > + pfn, nr_pages))
> > + return;
> > +
> > + /*
> > +  * There are 3 cases to handle across two configurations
> > +  * (SPARSEMEM_VMEMMAP={y,n}):
> > +  *
> > +  * 1/ deactivation of a partial hot-added section (only possible
> > +  * in the SPARSEMEM_VMEMMAP=y case).
> > +  *a/ section was 

Re: [Lsf-pc] [LSF/MM TOPIC] The end of the DAX experiment

2019-06-07 Thread Dan Williams
On Wed, Jun 5, 2019 at 9:22 AM Goldwyn Rodrigues  wrote:
>
> Hi Dan/Jerome,
>
> On 12:20 14/02, Dan Williams wrote:
> > On Thu, Feb 14, 2019 at 12:09 PM Matthew Wilcox  wrote:
> > >
> > > On Thu, Feb 14, 2019 at 11:31:24AM -0800, Dan Williams wrote:
> > > > On Thu, Feb 14, 2019 at 11:10 AM Jerome Glisse  
> > > > wrote:
> > > > > I am just again working on my struct page mapping patchset as well as
> > > > > the generic page write protection that sits on top. I hope to be able
> > > > > to post the v2 in couple weeks. You can always look at my posting last
> > > > > year to see more details.
> > > >
> > > > Yes, I have that in mind as one of the contenders. However, it's not
> > > > clear to me that its a suitable fit for filesystem-reflink. Others
> > > > have floated the 'page proxy' idea, so it would be good to discuss the
> > > > merits of the general approaches.
> > >
> > > ... and my preferred option of putting pfn entries in the page cache.
> >
> > Another option to include the discussion.
> >
> > > Or is that what you meant by "page proxy"?
> >
> > Page proxy would be an object that a filesystem could allocate to
> > point back to a single physical 'struct page *'. The proxy would
> > contain an override for page->index.
>
> Was there any outcome on this and its implementation? I am specifically
> interested in this for DAX support on btrfs/CoW: The TODO comment on
> top of dax_associate_entry() :)
>
> If there are patches/git tree I could use to base my patches on, it would
> be nice.

Half joking, but I was hoping that by the time I had circled back to
finally reviewing the brtfs dax patches that a solution to this
problem would be waiting. We spent more time on other DAX topics. I
recall that Jerome and I were mutually skeptical of each others
approaches at first glance, but we did not get into the details. So, I
think it will be a matter of getting the idea coded up. I'm finally
coming out from under a pile of mm work so I should have some time
this cycle to push the page proxy idea forward.


Re: [PATCH v9 11/12] libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields

2019-06-06 Thread Dan Williams
On Thu, Jun 6, 2019 at 2:46 PM Andrew Morton  wrote:
>
> On Wed, 05 Jun 2019 14:58:58 -0700 Dan Williams  
> wrote:
>
> > At namespace creation time there is the potential for the "expected to
> > be zero" fields of a 'pfn' info-block to be filled with indeterminate
> > data. While the kernel buffer is zeroed on allocation it is immediately
> > overwritten by nd_pfn_validate() filling it with the current contents of
> > the on-media info-block location. For fields like, 'flags' and the
> > 'padding' it potentially means that future implementations can not rely
> > on those fields being zero.
> >
> > In preparation to stop using the 'start_pad' and 'end_trunc' fields for
> > section alignment, arrange for fields that are not explicitly
> > initialized to be guaranteed zero. Bump the minor version to indicate it
> > is safe to assume the 'padding' and 'flags' are zero. Otherwise, this
> > corruption is expected to benign since all other critical fields are
> > explicitly initialized.
> >
> > Fixes: 32ab0a3f5170 ("libnvdimm, pmem: 'struct page' for pmem")
> > Cc: 
> > Signed-off-by: Dan Williams 
>
> The cc:stable in [11/12] seems odd.  Is this independent of the other
> patches?  If so, shouldn't it be a standalone thing which can be
> prioritized?
>

The cc: stable is about spreading this new policy to as many kernels
as possible not fixing an issue in those kernels. It's not until patch
12 "libnvdimm/pfn: Stop padding pmem namespaces to section alignment"
as all previous kernel do initialize all fields.

I'd be ok to drop that cc: stable, my concern is distros that somehow
pickup and backport patch 12 and miss patch 11.


Re: [PATCH v9 07/12] mm/sparsemem: Prepare for sub-section ranges

2019-06-06 Thread Dan Williams
On Thu, Jun 6, 2019 at 10:21 AM Oscar Salvador  wrote:
>
> On Wed, Jun 05, 2019 at 02:58:37PM -0700, Dan Williams wrote:
> > Prepare the memory hot-{add,remove} paths for handling sub-section
> > ranges by plumbing the starting page frame and number of pages being
> > handled through arch_{add,remove}_memory() to
> > sparse_{add,remove}_one_section().
> >
> > This is simply plumbing, small cleanups, and some identifier renames. No
> > intended functional changes.
> >
> > Cc: Michal Hocko 
> > Cc: Vlastimil Babka 
> > Cc: Logan Gunthorpe 
> > Cc: Oscar Salvador 
> > Reviewed-by: Pavel Tatashin 
> > Signed-off-by: Dan Williams 
> > ---
> >  include/linux/memory_hotplug.h |5 +-
> >  mm/memory_hotplug.c|  114 
> > +---
> >  mm/sparse.c|   15 ++---
> >  3 files changed, 81 insertions(+), 53 deletions(-)
> >
> > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> > index 79e0add6a597..3ab0282b4fe5 100644
> > --- a/include/linux/memory_hotplug.h
> > +++ b/include/linux/memory_hotplug.h
> > @@ -348,9 +348,10 @@ extern int add_memory_resource(int nid, struct 
> > resource *resource);
> >  extern void move_pfn_range_to_zone(struct zone *zone, unsigned long 
> > start_pfn,
> >   unsigned long nr_pages, struct vmem_altmap *altmap);
> >  extern bool is_memblock_offlined(struct memory_block *mem);
> > -extern int sparse_add_one_section(int nid, unsigned long start_pfn,
> > -   struct vmem_altmap *altmap);
> > +extern int sparse_add_section(int nid, unsigned long pfn,
> > + unsigned long nr_pages, struct vmem_altmap *altmap);
> >  extern void sparse_remove_one_section(struct mem_section *ms,
> > + unsigned long pfn, unsigned long nr_pages,
> >   unsigned long map_offset, struct vmem_altmap *altmap);
> >  extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
> > unsigned long pnum);
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index 4b882c57781a..399bf78bccc5 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -252,51 +252,84 @@ void __init register_page_bootmem_info_node(struct 
> > pglist_data *pgdat)
> >  }
> >  #endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */
> >
> > -static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
> > -struct vmem_altmap *altmap)
> > +static int __meminit __add_section(int nid, unsigned long pfn,
> > + unsigned long nr_pages, struct vmem_altmap *altmap)
> >  {
> >   int ret;
> >
> > - if (pfn_valid(phys_start_pfn))
> > + if (pfn_valid(pfn))
> >   return -EEXIST;
> >
> > - ret = sparse_add_one_section(nid, phys_start_pfn, altmap);
> > + ret = sparse_add_section(nid, pfn, nr_pages, altmap);
> >   return ret < 0 ? ret : 0;
> >  }
> >
> > +static int check_pfn_span(unsigned long pfn, unsigned long nr_pages,
> > + const char *reason)
> > +{
> > + /*
> > +  * Disallow all operations smaller than a sub-section and only
> > +  * allow operations smaller than a section for
> > +  * SPARSEMEM_VMEMMAP. Note that check_hotplug_memory_range()
> > +  * enforces a larger memory_block_size_bytes() granularity for
> > +  * memory that will be marked online, so this check should only
> > +  * fire for direct arch_{add,remove}_memory() users outside of
> > +  * add_memory_resource().
> > +  */
> > + unsigned long min_align;
> > +
> > + if (IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP))
> > + min_align = PAGES_PER_SUBSECTION;
> > + else
> > + min_align = PAGES_PER_SECTION;
> > + if (!IS_ALIGNED(pfn, min_align)
> > + || !IS_ALIGNED(nr_pages, min_align)) {
> > + WARN(1, "Misaligned __%s_pages start: %#lx end: #%lx\n",
> > + reason, pfn, pfn + nr_pages - 1);
> > + return -EINVAL;
> > + }
> > + return 0;
> > +}
>
>
> This caught my eye.
> Back in patch#4 "Convert kmalloc_section_memmap() to 
> populate_section_memmap()",
> you placed a mis-usage check for !CONFIG_SPARSEMEM_VMEMMAP in
> populate_section_memmap().
>
> populate_section_memmap() gets called from sparse_add_one_section(), which 
> means
> that we should have passed this check, otherwise we cannot go further and call
> __add_section().
>
> So, unless I am missing something it seems to me that the check from patch#4 
> could go?
> And I think the same applies to depopulate_section_memmap()?

Yes, good catch, I can kill those extra checks in favor of this one.

> Besides that, it looks good to me:

Thanks Oscar!

>
> Reviewed-by: Oscar Salvador 
>
> --
> Oscar Salvador
> SUSE L3


Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-06 Thread Dan Williams
On Thu, Jun 6, 2019 at 3:42 AM Jan Kara  wrote:
>
> On Wed 05-06-19 18:45:33, ira.we...@intel.com wrote:
> > From: Ira Weiny 
> >
> > ... V1,000,000   ;-)
> >
> > Pre-requisites:
> >   John Hubbard's put_user_pages() patch series.[1]
> >   Jan Kara's ext4_break_layouts() fixes[2]
> >
> > Based on the feedback from LSFmm and the LWN article which resulted.  I've
> > decided to take a slightly different tack on this problem.
> >
> > The real issue is that there is no use case for a user to have RDMA pinn'ed
> > memory which is then truncated.  So really any solution we present which:
> >
> > A) Prevents file system corruption or data leaks
> > ...and...
> > B) Informs the user that they did something wrong
> >
> > Should be an acceptable solution.
> >
> > Because this is slightly new behavior.  And because this is gonig to be
> > specific to DAX (because of the lack of a page cache) we have made the user
> > "opt in" to this behavior.
> >
> > The following patches implement the following solution.
> >
> > 1) The user has to opt in to allowing GUP pins on a file with a layout lease
> >(now made visible).
> > 2) GUP will fail (EPERM) if a layout lease is not taken
> > 3) Any truncate or hole punch operation on a GUP'ed DAX page will fail.
> > 4) The user has the option of holding the layout lease to receive a SIGIO 
> > for
> >notification to the original thread that another thread has tried to 
> > delete
> >their data.  Furthermore this indicates that if the user needs to GUP the
> >file again they will need to retake the Layout lease before doing so.
> >
> >
> > NOTE: If the user releases the layout lease or if it has been broken by
> > another operation further GUP operations on the file will fail without
> > re-taking the lease.  This means that if a user would like to register
> > pieces of a file and continue to register other pieces later they would
> > be advised to keep the layout lease, get a SIGIO notification, and retake
> > the lease.
> >
> > NOTE2: Truncation of pages which are not actively pinned will succeed.
> > Similar to accessing an mmap to this area GUP pins of that memory may
> > fail.
>
> So after some through I'm willing accept the fact that pinned DAX pages
> will just make truncate / hole punch fail and shove it into a same bucket
> of situations like "user can open a file and unlink won't delete it" or
> "ETXTBUSY when user is executing a file being truncated".  The problem I
> have with this proposal is a lack of visibility from sysadmin POV. For
> ETXTBUSY or "unlinked but open file" sysadmin can just do lsof, find the
> problematic process and kill it. There's nothing like that with your
> proposal since currently once you hold page reference, you can unmap the
> file, drop layout lease, close the file, and there's no trace that you're
> responsible for the pinned page anymore.
>
> So I'd like to actually mandate that you *must* hold the file lease until
> you unpin all pages in the given range (not just that you have an option to
> hold a lease). And I believe the kernel should actually enforce this. That
> way we maintain a sane state that if someone uses a physical location of
> logical file offset on disk, he has a layout lease. Also once this is done,
> sysadmin has a reasonably easy way to discover run-away RDMA application
> and kill it if he wishes so.

Yes, this satisfies the primary concern that made me oppose failing
truncate. If the administrator determines that reclaiming capacity is
more important than maintaining active RDMA mappings "lsof + kill" is
a reasonable way to recover. I'd go so far as to say that anything
less is an abdication of the kernel's responsibility as an arbiter of
platform resources.

> The question is on how to exactly enforce that lease is taken until all
> pages are unpinned. I belive it could be done by tracking number of
> long-term pinned pages within a lease. Gup_longterm could easily increment
> the count when verifying the lease exists, gup_longterm users will somehow
> need to propagate corresponding 'filp' (struct file pointer) to
> put_user_pages_longterm() callsites so that they can look up appropriate
> lease to drop reference - probably I'd just transition all gup_longterm()
> users to a saner API similar to the one we have in mm/frame_vector.c where
> we don't hand out page pointers but an encapsulating structure that does
> all the necessary tracking. Removing a lease would need to block until all
> pins are released - this is probably the most hairy part since we need to
> handle a case if application just closes the file descriptor which would
> release the lease but OTOH we need to make sure task exit does not deadlock.
> Maybe we could block only on explicit lease unlock and just drop the layout
> lease on file close and if there are still pinned pages, send SIGKILL to an
> application as a reminder it did something stupid...
>
> What do people think about this?

SIGKILL on close() without explicit 

[PATCH v9 07/12] mm/sparsemem: Prepare for sub-section ranges

2019-06-05 Thread Dan Williams
Prepare the memory hot-{add,remove} paths for handling sub-section
ranges by plumbing the starting page frame and number of pages being
handled through arch_{add,remove}_memory() to
sparse_{add,remove}_one_section().

This is simply plumbing, small cleanups, and some identifier renames. No
intended functional changes.

Cc: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Logan Gunthorpe 
Cc: Oscar Salvador 
Reviewed-by: Pavel Tatashin 
Signed-off-by: Dan Williams 
---
 include/linux/memory_hotplug.h |5 +-
 mm/memory_hotplug.c|  114 +---
 mm/sparse.c|   15 ++---
 3 files changed, 81 insertions(+), 53 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 79e0add6a597..3ab0282b4fe5 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -348,9 +348,10 @@ extern int add_memory_resource(int nid, struct resource 
*resource);
 extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
unsigned long nr_pages, struct vmem_altmap *altmap);
 extern bool is_memblock_offlined(struct memory_block *mem);
-extern int sparse_add_one_section(int nid, unsigned long start_pfn,
- struct vmem_altmap *altmap);
+extern int sparse_add_section(int nid, unsigned long pfn,
+   unsigned long nr_pages, struct vmem_altmap *altmap);
 extern void sparse_remove_one_section(struct mem_section *ms,
+   unsigned long pfn, unsigned long nr_pages,
unsigned long map_offset, struct vmem_altmap *altmap);
 extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
  unsigned long pnum);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 4b882c57781a..399bf78bccc5 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -252,51 +252,84 @@ void __init register_page_bootmem_info_node(struct 
pglist_data *pgdat)
 }
 #endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */
 
-static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
-  struct vmem_altmap *altmap)
+static int __meminit __add_section(int nid, unsigned long pfn,
+   unsigned long nr_pages, struct vmem_altmap *altmap)
 {
int ret;
 
-   if (pfn_valid(phys_start_pfn))
+   if (pfn_valid(pfn))
return -EEXIST;
 
-   ret = sparse_add_one_section(nid, phys_start_pfn, altmap);
+   ret = sparse_add_section(nid, pfn, nr_pages, altmap);
return ret < 0 ? ret : 0;
 }
 
+static int check_pfn_span(unsigned long pfn, unsigned long nr_pages,
+   const char *reason)
+{
+   /*
+* Disallow all operations smaller than a sub-section and only
+* allow operations smaller than a section for
+* SPARSEMEM_VMEMMAP. Note that check_hotplug_memory_range()
+* enforces a larger memory_block_size_bytes() granularity for
+* memory that will be marked online, so this check should only
+* fire for direct arch_{add,remove}_memory() users outside of
+* add_memory_resource().
+*/
+   unsigned long min_align;
+
+   if (IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP))
+   min_align = PAGES_PER_SUBSECTION;
+   else
+   min_align = PAGES_PER_SECTION;
+   if (!IS_ALIGNED(pfn, min_align)
+   || !IS_ALIGNED(nr_pages, min_align)) {
+   WARN(1, "Misaligned __%s_pages start: %#lx end: #%lx\n",
+   reason, pfn, pfn + nr_pages - 1);
+   return -EINVAL;
+   }
+   return 0;
+}
+
 /*
  * Reasonably generic function for adding memory.  It is
  * expected that archs that support memory hotplug will
  * call this function after deciding the zone to which to
  * add the new pages.
  */
-int __ref __add_pages(int nid, unsigned long phys_start_pfn,
-   unsigned long nr_pages, struct mhp_restrictions *restrictions)
+int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
+   struct mhp_restrictions *restrictions)
 {
unsigned long i;
-   int err = 0;
-   int start_sec, end_sec;
+   int start_sec, end_sec, err;
struct vmem_altmap *altmap = restrictions->altmap;
 
-   /* during initialize mem_map, align hot-added range to section */
-   start_sec = pfn_to_section_nr(phys_start_pfn);
-   end_sec = pfn_to_section_nr(phys_start_pfn + nr_pages - 1);
-
if (altmap) {
/*
 * Validate altmap is within bounds of the total request
 */
-   if (altmap->base_pfn != phys_start_pfn
+   if (altmap->base_pfn != pfn
|| vmem_altmap_offset(altmap) > nr_pages) {
pr_warn_once("memory add fail, invalid altmap\n");
-   err = -E

[PATCH v9 12/12] libnvdimm/pfn: Stop padding pmem namespaces to section alignment

2019-06-05 Thread Dan Williams
Now that the mm core supports section-unaligned hotplug of ZONE_DEVICE
memory, we no longer need to add padding at pfn/dax device creation
time. The kernel will still honor padding established by older kernels.

Reported-by: Jeff Moyer 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/pfn.h  |   14 
 drivers/nvdimm/pfn_devs.c |   77 -
 include/linux/mmzone.h|3 ++
 3 files changed, 16 insertions(+), 78 deletions(-)

diff --git a/drivers/nvdimm/pfn.h b/drivers/nvdimm/pfn.h
index e901e3a3b04c..cc042a98758f 100644
--- a/drivers/nvdimm/pfn.h
+++ b/drivers/nvdimm/pfn.h
@@ -41,18 +41,4 @@ struct nd_pfn_sb {
__le64 checksum;
 };
 
-#ifdef CONFIG_SPARSEMEM
-#define PFN_SECTION_ALIGN_DOWN(x) SECTION_ALIGN_DOWN(x)
-#define PFN_SECTION_ALIGN_UP(x) SECTION_ALIGN_UP(x)
-#else
-/*
- * In this case ZONE_DEVICE=n and we will disable 'pfn' device support,
- * but we still want pmem to compile.
- */
-#define PFN_SECTION_ALIGN_DOWN(x) (x)
-#define PFN_SECTION_ALIGN_UP(x) (x)
-#endif
-
-#define PHYS_SECTION_ALIGN_DOWN(x) 
PFN_PHYS(PFN_SECTION_ALIGN_DOWN(PHYS_PFN(x)))
-#define PHYS_SECTION_ALIGN_UP(x) PFN_PHYS(PFN_SECTION_ALIGN_UP(PHYS_PFN(x)))
 #endif /* __NVDIMM_PFN_H */
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index a2406253eb70..7f54374b082f 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -595,14 +595,14 @@ static u32 info_block_reserve(void)
 }
 
 /*
- * We hotplug memory at section granularity, pad the reserved area from
- * the previous section base to the namespace base address.
+ * We hotplug memory at sub-section granularity, pad the reserved area
+ * from the previous section base to the namespace base address.
  */
 static unsigned long init_altmap_base(resource_size_t base)
 {
unsigned long base_pfn = PHYS_PFN(base);
 
-   return PFN_SECTION_ALIGN_DOWN(base_pfn);
+   return SUBSECTION_ALIGN_DOWN(base_pfn);
 }
 
 static unsigned long init_altmap_reserve(resource_size_t base)
@@ -610,7 +610,7 @@ static unsigned long init_altmap_reserve(resource_size_t 
base)
unsigned long reserve = info_block_reserve() >> PAGE_SHIFT;
unsigned long base_pfn = PHYS_PFN(base);
 
-   reserve += base_pfn - PFN_SECTION_ALIGN_DOWN(base_pfn);
+   reserve += base_pfn - SUBSECTION_ALIGN_DOWN(base_pfn);
return reserve;
 }
 
@@ -641,8 +641,7 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct 
dev_pagemap *pgmap)
nd_pfn->npfns = le64_to_cpu(pfn_sb->npfns);
pgmap->altmap_valid = false;
} else if (nd_pfn->mode == PFN_MODE_PMEM) {
-   nd_pfn->npfns = PFN_SECTION_ALIGN_UP((resource_size(res)
-   - offset) / PAGE_SIZE);
+   nd_pfn->npfns = PHYS_PFN((resource_size(res) - offset));
if (le64_to_cpu(nd_pfn->pfn_sb->npfns) > nd_pfn->npfns)
dev_info(_pfn->dev,
"number of pfns truncated from %lld to 
%ld\n",
@@ -658,54 +657,14 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, 
struct dev_pagemap *pgmap)
return 0;
 }
 
-static u64 phys_pmem_align_down(struct nd_pfn *nd_pfn, u64 phys)
-{
-   return min_t(u64, PHYS_SECTION_ALIGN_DOWN(phys),
-   ALIGN_DOWN(phys, nd_pfn->align));
-}
-
-/*
- * Check if pmem collides with 'System RAM', or other regions when
- * section aligned.  Trim it accordingly.
- */
-static void trim_pfn_device(struct nd_pfn *nd_pfn, u32 *start_pad, u32 
*end_trunc)
-{
-   struct nd_namespace_common *ndns = nd_pfn->ndns;
-   struct nd_namespace_io *nsio = to_nd_namespace_io(>dev);
-   struct nd_region *nd_region = to_nd_region(nd_pfn->dev.parent);
-   const resource_size_t start = nsio->res.start;
-   const resource_size_t end = start + resource_size(>res);
-   resource_size_t adjust, size;
-
-   *start_pad = 0;
-   *end_trunc = 0;
-
-   adjust = start - PHYS_SECTION_ALIGN_DOWN(start);
-   size = resource_size(>res) + adjust;
-   if (region_intersects(start - adjust, size, IORESOURCE_SYSTEM_RAM,
-   IORES_DESC_NONE) == REGION_MIXED
-   || nd_region_conflict(nd_region, start - adjust, size))
-   *start_pad = PHYS_SECTION_ALIGN_UP(start) - start;
-
-   /* Now check that end of the range does not collide. */
-   adjust = PHYS_SECTION_ALIGN_UP(end) - end;
-   size = resource_size(>res) + adjust;
-   if (region_intersects(start, size, IORESOURCE_SYSTEM_RAM,
-   IORES_DESC_NONE) == REGION_MIXED
-   || !IS_ALIGNED(end, nd_pfn->align)
-   || nd_region_conflict(nd_region, start, size))
-   *end_trunc = end - phys_pmem_align_down(nd_pfn, end);
-}
-
 static int nd_pfn_init(struc

[PATCH v9 06/12] mm: Kill is_dev_zone() helper

2019-06-05 Thread Dan Williams
Given there are no more usages of is_dev_zone() outside of 'ifdef
CONFIG_ZONE_DEVICE' protection, kill off the compilation helper.

Cc: Michal Hocko 
Cc: Logan Gunthorpe 
Acked-by: David Hildenbrand 
Reviewed-by: Oscar Salvador 
Reviewed-by: Pavel Tatashin 
Signed-off-by: Dan Williams 
---
 include/linux/mmzone.h |   12 
 mm/page_alloc.c|2 +-
 2 files changed, 1 insertion(+), 13 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6dd52d544857..49e7fb452dfd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -855,18 +855,6 @@ static inline int local_memory_node(int node_id) { return 
node_id; };
  */
 #define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones)
 
-#ifdef CONFIG_ZONE_DEVICE
-static inline bool is_dev_zone(const struct zone *zone)
-{
-   return zone_idx(zone) == ZONE_DEVICE;
-}
-#else
-static inline bool is_dev_zone(const struct zone *zone)
-{
-   return false;
-}
-#endif
-
 /*
  * Returns true if a zone has pages managed by the buddy allocator.
  * All the reclaim decisions have to use this function rather than
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bd773efe5b82..5dff3f49a372 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5865,7 +5865,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
unsigned long start = jiffies;
int nid = pgdat->node_id;
 
-   if (WARN_ON_ONCE(!pgmap || !is_dev_zone(zone)))
+   if (WARN_ON_ONCE(!pgmap || zone_idx(zone) != ZONE_DEVICE))
return;
 
/*



[PATCH v9 05/12] mm/hotplug: Kill is_dev_zone() usage in __remove_pages()

2019-06-05 Thread Dan Williams
The zone type check was a leftover from the cleanup that plumbed altmap
through the memory hotplug path, i.e. commit da024512a1fa "mm: pass the
vmem_altmap to arch_remove_memory and __remove_pages".

Cc: Michal Hocko 
Cc: Logan Gunthorpe 
Cc: Pavel Tatashin 
Reviewed-by: David Hildenbrand 
Reviewed-by: Oscar Salvador 
Signed-off-by: Dan Williams 
---
 mm/memory_hotplug.c |7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 647859a1d119..4b882c57781a 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -535,11 +535,8 @@ void __remove_pages(struct zone *zone, unsigned long 
phys_start_pfn,
unsigned long map_offset = 0;
int sections_to_remove;
 
-   /* In the ZONE_DEVICE case device driver owns the memory region */
-   if (is_dev_zone(zone)) {
-   if (altmap)
-   map_offset = vmem_altmap_offset(altmap);
-   }
+   if (altmap)
+   map_offset = vmem_altmap_offset(altmap);
 
clear_zone_contiguous(zone);
 



[PATCH v9 09/12] mm: Document ZONE_DEVICE memory-model implications

2019-06-05 Thread Dan Williams
Explain the general mechanisms of 'ZONE_DEVICE' pages and list the users
of 'devm_memremap_pages()'.

Cc: Jonathan Corbet 
Reported-by: Mike Rapoport 
Signed-off-by: Dan Williams 
---
 Documentation/vm/memory-model.rst |   39 +
 1 file changed, 39 insertions(+)

diff --git a/Documentation/vm/memory-model.rst 
b/Documentation/vm/memory-model.rst
index 382f72ace1fc..e0af47e02e78 100644
--- a/Documentation/vm/memory-model.rst
+++ b/Documentation/vm/memory-model.rst
@@ -181,3 +181,42 @@ that is eventually passed to vmemmap_populate() through a 
long chain
 of function calls. The vmemmap_populate() implementation may use the
 `vmem_altmap` along with :c:func:`altmap_alloc_block_buf` helper to
 allocate memory map on the persistent memory device.
+
+ZONE_DEVICE
+===
+The `ZONE_DEVICE` facility builds upon `SPARSEMEM_VMEMMAP` to offer
+`struct page` `mem_map` services for device driver identified physical
+address ranges. The "device" aspect of `ZONE_DEVICE` relates to the fact
+that the page objects for these address ranges are never marked online,
+and that a reference must be taken against the device, not just the page
+to keep the memory pinned for active use. `ZONE_DEVICE`, via
+:c:func:`devm_memremap_pages`, performs just enough memory hotplug to
+turn on :c:func:`pfn_to_page`, :c:func:`page_to_pfn`, and
+:c:func:`get_user_pages` service for the given range of pfns. Since the
+page reference count never drops below 1 the page is never tracked as
+free memory and the page's `struct list_head lru` space is repurposed
+for back referencing to the host device / driver that mapped the memory.
+
+While `SPARSEMEM` presents memory as a collection of sections,
+optionally collected into memory blocks, `ZONE_DEVICE` users have a need
+for smaller granularity of populating the `mem_map`. Given that
+`ZONE_DEVICE` memory is never marked online it is subsequently never
+subject to its memory ranges being exposed through the sysfs memory
+hotplug api on memory block boundaries. The implementation relies on
+this lack of user-api constraint to allow sub-section sized memory
+ranges to be specified to :c:func:`arch_add_memory`, the top-half of
+memory hotplug. Sub-section support allows for `PMD_SIZE` as the minimum
+alignment granularity for :c:func:`devm_memremap_pages`.
+
+The users of `ZONE_DEVICE` are:
+* pmem: Map platform persistent memory to be used as a direct-I/O target
+  via DAX mappings.
+
+* hmm: Extend `ZONE_DEVICE` with `->page_fault()` and `->page_free()`
+  event callbacks to allow a device-driver to coordinate memory management
+  events related to device-memory, typically GPU memory. See
+  Documentation/vm/hmm.rst.
+
+* p2pdma: Create `struct page` objects to allow peer devices in a
+  PCI/-E topology to coordinate direct-DMA operations between themselves,
+  i.e. bypass host memory.



[PATCH v9 11/12] libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields

2019-06-05 Thread Dan Williams
At namespace creation time there is the potential for the "expected to
be zero" fields of a 'pfn' info-block to be filled with indeterminate
data. While the kernel buffer is zeroed on allocation it is immediately
overwritten by nd_pfn_validate() filling it with the current contents of
the on-media info-block location. For fields like, 'flags' and the
'padding' it potentially means that future implementations can not rely
on those fields being zero.

In preparation to stop using the 'start_pad' and 'end_trunc' fields for
section alignment, arrange for fields that are not explicitly
initialized to be guaranteed zero. Bump the minor version to indicate it
is safe to assume the 'padding' and 'flags' are zero. Otherwise, this
corruption is expected to benign since all other critical fields are
explicitly initialized.

Fixes: 32ab0a3f5170 ("libnvdimm, pmem: 'struct page' for pmem")
Cc: 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/dax_devs.c |2 +-
 drivers/nvdimm/pfn.h  |1 +
 drivers/nvdimm/pfn_devs.c |   18 +++---
 3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/nvdimm/dax_devs.c b/drivers/nvdimm/dax_devs.c
index 0453f49dc708..326f02ffca81 100644
--- a/drivers/nvdimm/dax_devs.c
+++ b/drivers/nvdimm/dax_devs.c
@@ -126,7 +126,7 @@ int nd_dax_probe(struct device *dev, struct 
nd_namespace_common *ndns)
nvdimm_bus_unlock(>dev);
if (!dax_dev)
return -ENOMEM;
-   pfn_sb = devm_kzalloc(dev, sizeof(*pfn_sb), GFP_KERNEL);
+   pfn_sb = devm_kmalloc(dev, sizeof(*pfn_sb), GFP_KERNEL);
nd_pfn->pfn_sb = pfn_sb;
rc = nd_pfn_validate(nd_pfn, DAX_SIG);
dev_dbg(dev, "dax: %s\n", rc == 0 ? dev_name(dax_dev) : "");
diff --git a/drivers/nvdimm/pfn.h b/drivers/nvdimm/pfn.h
index dde9853453d3..e901e3a3b04c 100644
--- a/drivers/nvdimm/pfn.h
+++ b/drivers/nvdimm/pfn.h
@@ -36,6 +36,7 @@ struct nd_pfn_sb {
__le32 end_trunc;
/* minor-version-2 record the base alignment of the mapping */
__le32 align;
+   /* minor-version-3 guarantee the padding and flags are zero */
u8 padding[4000];
__le64 checksum;
 };
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 01f40672507f..a2406253eb70 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -420,6 +420,15 @@ static int nd_pfn_clear_memmap_errors(struct nd_pfn 
*nd_pfn)
return 0;
 }
 
+/**
+ * nd_pfn_validate - read and validate info-block
+ * @nd_pfn: fsdax namespace runtime state / properties
+ * @sig: 'devdax' or 'fsdax' signature
+ *
+ * Upon return the info-block buffer contents (->pfn_sb) are
+ * indeterminate when validation fails, and a coherent info-block
+ * otherwise.
+ */
 int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
 {
u64 checksum, offset;
@@ -565,7 +574,7 @@ int nd_pfn_probe(struct device *dev, struct 
nd_namespace_common *ndns)
nvdimm_bus_unlock(>dev);
if (!pfn_dev)
return -ENOMEM;
-   pfn_sb = devm_kzalloc(dev, sizeof(*pfn_sb), GFP_KERNEL);
+   pfn_sb = devm_kmalloc(dev, sizeof(*pfn_sb), GFP_KERNEL);
nd_pfn = to_nd_pfn(pfn_dev);
nd_pfn->pfn_sb = pfn_sb;
rc = nd_pfn_validate(nd_pfn, PFN_SIG);
@@ -702,7 +711,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
u64 checksum;
int rc;
 
-   pfn_sb = devm_kzalloc(_pfn->dev, sizeof(*pfn_sb), GFP_KERNEL);
+   pfn_sb = devm_kmalloc(_pfn->dev, sizeof(*pfn_sb), GFP_KERNEL);
if (!pfn_sb)
return -ENOMEM;
 
@@ -711,11 +720,14 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
sig = DAX_SIG;
else
sig = PFN_SIG;
+
rc = nd_pfn_validate(nd_pfn, sig);
if (rc != -ENODEV)
return rc;
 
/* no info block, do init */;
+   memset(pfn_sb, 0, sizeof(*pfn_sb));
+
nd_region = to_nd_region(nd_pfn->dev.parent);
if (nd_region->ro) {
dev_info(_pfn->dev,
@@ -768,7 +780,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
memcpy(pfn_sb->uuid, nd_pfn->uuid, 16);
memcpy(pfn_sb->parent_uuid, nd_dev_to_uuid(>dev), 16);
pfn_sb->version_major = cpu_to_le16(1);
-   pfn_sb->version_minor = cpu_to_le16(2);
+   pfn_sb->version_minor = cpu_to_le16(3);
pfn_sb->start_pad = cpu_to_le32(start_pad);
pfn_sb->end_trunc = cpu_to_le32(end_trunc);
pfn_sb->align = cpu_to_le32(nd_pfn->align);



[PATCH v9 10/12] mm/devm_memremap_pages: Enable sub-section remap

2019-06-05 Thread Dan Williams
Teach devm_memremap_pages() about the new sub-section capabilities of
arch_{add,remove}_memory(). Effectively, just replace all usage of
align_start, align_end, and align_size with res->start, res->end, and
resource_size(res). The existing sanity check will still make sure that
the two separate remap attempts do not collide within a sub-section (2MB
on x86).

Cc: Michal Hocko 
Cc: Toshi Kani 
Cc: Jérôme Glisse 
Cc: Logan Gunthorpe 
Cc: Oscar Salvador 
Cc: Pavel Tatashin 
Signed-off-by: Dan Williams 
---
 kernel/memremap.c |   61 +
 1 file changed, 24 insertions(+), 37 deletions(-)

diff --git a/kernel/memremap.c b/kernel/memremap.c
index 57980ed4e571..a0e5f6b91b04 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -58,7 +58,7 @@ static unsigned long pfn_first(struct dev_pagemap *pgmap)
struct vmem_altmap *altmap = >altmap;
unsigned long pfn;
 
-   pfn = res->start >> PAGE_SHIFT;
+   pfn = PHYS_PFN(res->start);
if (pgmap->altmap_valid)
pfn += vmem_altmap_offset(altmap);
return pfn;
@@ -86,7 +86,6 @@ static void devm_memremap_pages_release(void *data)
struct dev_pagemap *pgmap = data;
struct device *dev = pgmap->dev;
struct resource *res = >res;
-   resource_size_t align_start, align_size;
unsigned long pfn;
int nid;
 
@@ -96,25 +95,21 @@ static void devm_memremap_pages_release(void *data)
pgmap->cleanup(pgmap->ref);
 
/* pages are dead and unused, undo the arch mapping */
-   align_start = res->start & ~(PA_SECTION_SIZE - 1);
-   align_size = ALIGN(res->start + resource_size(res), PA_SECTION_SIZE)
-   - align_start;
-
-   nid = page_to_nid(pfn_to_page(align_start >> PAGE_SHIFT));
+   nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
 
mem_hotplug_begin();
if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
-   pfn = align_start >> PAGE_SHIFT;
+   pfn = PHYS_PFN(res->start);
__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
-   align_size >> PAGE_SHIFT, NULL);
+   PHYS_PFN(resource_size(res)), NULL);
} else {
-   arch_remove_memory(nid, align_start, align_size,
+   arch_remove_memory(nid, res->start, resource_size(res),
pgmap->altmap_valid ? >altmap : NULL);
-   kasan_remove_zero_shadow(__va(align_start), align_size);
+   kasan_remove_zero_shadow(__va(res->start), resource_size(res));
}
mem_hotplug_done();
 
-   untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
+   untrack_pfn(NULL, PHYS_PFN(res->start), resource_size(res));
pgmap_array_delete(res);
dev_WARN_ONCE(dev, pgmap->altmap.alloc,
  "%s: failed to free all reserved pages\n", __func__);
@@ -141,16 +136,13 @@ static void devm_memremap_pages_release(void *data)
  */
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 {
-   resource_size_t align_start, align_size, align_end;
-   struct vmem_altmap *altmap = pgmap->altmap_valid ?
-   >altmap : NULL;
struct resource *res = >res;
struct dev_pagemap *conflict_pgmap;
struct mhp_restrictions restrictions = {
/*
 * We do not want any optional features only our own memmap
*/
-   .altmap = altmap,
+   .altmap = pgmap->altmap_valid ? >altmap : NULL,
};
pgprot_t pgprot = PAGE_KERNEL;
int error, nid, is_ram;
@@ -160,12 +152,7 @@ void *devm_memremap_pages(struct device *dev, struct 
dev_pagemap *pgmap)
return ERR_PTR(-EINVAL);
}
 
-   align_start = res->start & ~(PA_SECTION_SIZE - 1);
-   align_size = ALIGN(res->start + resource_size(res), PA_SECTION_SIZE)
-   - align_start;
-   align_end = align_start + align_size - 1;
-
-   conflict_pgmap = get_dev_pagemap(PHYS_PFN(align_start), NULL);
+   conflict_pgmap = get_dev_pagemap(PHYS_PFN(res->start), NULL);
if (conflict_pgmap) {
dev_WARN(dev, "Conflicting mapping in same section\n");
put_dev_pagemap(conflict_pgmap);
@@ -173,7 +160,7 @@ void *devm_memremap_pages(struct device *dev, struct 
dev_pagemap *pgmap)
goto err_array;
}
 
-   conflict_pgmap = get_dev_pagemap(PHYS_PFN(align_end), NULL);
+   conflict_pgmap = get_dev_pagemap(PHYS_PFN(res->end), NULL);
if (conflict_pgmap) {
dev_WARN(dev, "Conflicting mapping in same section\n");
put_dev_pagemap(conflict_pgmap);
@@ -181,7 +168,7 @@ void *devm_memremap_pages(struct device 

[PATCH v9 08/12] mm/sparsemem: Support sub-section hotplug

2019-06-05 Thread Dan Williams
The libnvdimm sub-system has suffered a series of hacks and broken
workarounds for the memory-hotplug implementation's awkward
section-aligned (128MB) granularity. For example the following backtrace
is emitted when attempting arch_add_memory() with physical address
ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM)
within a given section:

 WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 
devm_memremap_pages+0x3b5/0x4c0
 devm_memremap_pages attempted on mixed region [mem 0x2-0x2fbff 
flags 0x200]
 [..]
 Call Trace:
   dump_stack+0x86/0xc3
   __warn+0xcb/0xf0
   warn_slowpath_fmt+0x5f/0x80
   devm_memremap_pages+0x3b5/0x4c0
   __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap]
   pmem_attach_disk+0x19a/0x440 [nd_pmem]

Recently it was discovered that the problem goes beyond RAM vs PMEM
collisions as some platform produce PMEM vs PMEM collisions within a
given section. The libnvdimm workaround for that case revealed that the
libnvdimm section-alignment-padding implementation has been broken for a
long while. A fix for that long-standing breakage introduces as many
problems as it solves as it would require a backward-incompatible change
to the namespace metadata interpretation. Instead of that dubious route
[1], address the root problem in the memory-hotplug implementation.

[1]: 
https://lore.kernel.org/r/155000671719.348031.2347363160141119237.st...@dwillia2-desk3.amr.corp.intel.com
Cc: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Logan Gunthorpe 
Cc: Oscar Salvador 
Cc: Pavel Tatashin 
Signed-off-by: Dan Williams 
---
 include/linux/memory_hotplug.h |2 
 mm/memory_hotplug.c|7 -
 mm/page_alloc.c|2 
 mm/sparse.c|  225 +++-
 4 files changed, 155 insertions(+), 81 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 3ab0282b4fe5..0b8a5e5ef2da 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -350,7 +350,7 @@ extern void move_pfn_range_to_zone(struct zone *zone, 
unsigned long start_pfn,
 extern bool is_memblock_offlined(struct memory_block *mem);
 extern int sparse_add_section(int nid, unsigned long pfn,
unsigned long nr_pages, struct vmem_altmap *altmap);
-extern void sparse_remove_one_section(struct mem_section *ms,
+extern void sparse_remove_section(struct mem_section *ms,
unsigned long pfn, unsigned long nr_pages,
unsigned long map_offset, struct vmem_altmap *altmap);
 extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 399bf78bccc5..8188be7a9edb 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -255,13 +255,10 @@ void __init register_page_bootmem_info_node(struct 
pglist_data *pgdat)
 static int __meminit __add_section(int nid, unsigned long pfn,
unsigned long nr_pages, struct vmem_altmap *altmap)
 {
-   int ret;
-
if (pfn_valid(pfn))
return -EEXIST;
 
-   ret = sparse_add_section(nid, pfn, nr_pages, altmap);
-   return ret < 0 ? ret : 0;
+   return sparse_add_section(nid, pfn, nr_pages, altmap);
 }
 
 static int check_pfn_span(unsigned long pfn, unsigned long nr_pages,
@@ -541,7 +538,7 @@ static void __remove_section(struct zone *zone, unsigned 
long pfn,
return;
 
__remove_zone(zone, pfn, nr_pages);
-   sparse_remove_one_section(ms, pfn, nr_pages, map_offset, altmap);
+   sparse_remove_section(ms, pfn, nr_pages, map_offset, altmap);
 }
 
 /**
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5dff3f49a372..af260cc469cd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5915,7 +5915,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 * pfn out of zone.
 *
 * Please note that MEMMAP_HOTPLUG path doesn't clear memmap
-* because this is done early in sparse_add_one_section
+* because this is done early in section_activate()
 */
if (!(pfn & (pageblock_nr_pages - 1))) {
set_pageblock_migratetype(page, MIGRATE_MOVABLE);
diff --git a/mm/sparse.c b/mm/sparse.c
index f65206deaf49..d83bac5d1324 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -83,8 +83,15 @@ static int __meminit sparse_index_init(unsigned long 
section_nr, int nid)
unsigned long root = SECTION_NR_TO_ROOT(section_nr);
struct mem_section *section;
 
+   /*
+* An existing section is possible in the sub-section hotplug
+* case. First hot-add instantiates, follow-on hot-add reuses
+* the existing section.
+*
+* The mem_hotplug_lock resolves the apparent race below.
+*/
if (mem_section[root])
-   return -EEXIST;
+   return 0;
 
section = sparse_index_al

[PATCH v9 00/12] mm: Sub-section memory hotplug support

2019-06-05 Thread Dan Williams
subsection-pending' branch of ndctl [5].

[2]: 
https://lore.kernel.org/r/155000671719.348031.2347363160141119237.st...@dwillia2-desk3.amr.corp.intel.com
[3]: https://github.com/pmem/ndctl/issues/76
[4]: 
https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=libnvdimm-pending
[5]: https://github.com/pmem/ndctl/commit/7c59b4867e1c


---

Dan Williams (12):
  mm/sparsemem: Introduce struct mem_section_usage
  mm/sparsemem: Add helpers track active portions of a section at boot
  mm/hotplug: Prepare shrink_{zone,pgdat}_span for sub-section removal
  mm/sparsemem: Convert kmalloc_section_memmap() to 
populate_section_memmap()
  mm/hotplug: Kill is_dev_zone() usage in __remove_pages()
  mm: Kill is_dev_zone() helper
  mm/sparsemem: Prepare for sub-section ranges
  mm/sparsemem: Support sub-section hotplug
  mm: Document ZONE_DEVICE memory-model implications
  mm/devm_memremap_pages: Enable sub-section remap
  libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields
  libnvdimm/pfn: Stop padding pmem namespaces to section alignment


 Documentation/vm/memory-model.rst|   39 
 arch/powerpc/include/asm/sparsemem.h |3 
 arch/x86/mm/init_64.c|4 
 drivers/nvdimm/dax_devs.c|2 
 drivers/nvdimm/pfn.h |   15 -
 drivers/nvdimm/pfn_devs.c|   95 +++--
 include/linux/memory_hotplug.h   |7 -
 include/linux/mm.h   |4 
 include/linux/mmzone.h   |   92 +++--
 kernel/memremap.c|   61 ++
 mm/memory_hotplug.c  |  171 +---
 mm/page_alloc.c  |   10 +
 mm/sparse-vmemmap.c  |   21 +-
 mm/sparse.c  |  359 +++---
 14 files changed, 534 insertions(+), 349 deletions(-)


[PATCH v9 04/12] mm/sparsemem: Convert kmalloc_section_memmap() to populate_section_memmap()

2019-06-05 Thread Dan Williams
Allow sub-section sized ranges to be added to the memmap.
populate_section_memmap() takes an explict pfn range rather than
assuming a full section, and those parameters are plumbed all the way
through to vmmemap_populate(). There should be no sub-section usage in
current deployments. New warnings are added to clarify which memmap
allocation paths are sub-section capable.

Cc: Michal Hocko 
Cc: David Hildenbrand 
Cc: Logan Gunthorpe 
Cc: Oscar Salvador 
Reviewed-by: Pavel Tatashin 
Signed-off-by: Dan Williams 
---
 arch/x86/mm/init_64.c |4 ++-
 include/linux/mm.h|4 ++-
 mm/sparse-vmemmap.c   |   21 +++--
 mm/sparse.c   |   61 +++--
 4 files changed, 57 insertions(+), 33 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 8335ac6e1112..688fb0687e55 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1520,7 +1520,9 @@ int __meminit vmemmap_populate(unsigned long start, 
unsigned long end, int node,
 {
int err;
 
-   if (boot_cpu_has(X86_FEATURE_PSE))
+   if (end - start < PAGES_PER_SECTION * sizeof(struct page))
+   err = vmemmap_populate_basepages(start, end, node);
+   else if (boot_cpu_has(X86_FEATURE_PSE))
err = vmemmap_populate_hugepages(start, end, node, altmap);
else if (altmap) {
pr_err_once("%s: no cpu support for altmap allocations\n",
diff --git a/include/linux/mm.h b/include/linux/mm.h
index acc578407e9e..c502f3ce8661 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2734,8 +2734,8 @@ const char * arch_vma_name(struct vm_area_struct *vma);
 void print_vma_addr(char *prefix, unsigned long rip);
 
 void *sparse_buffer_alloc(unsigned long size);
-struct page *sparse_mem_map_populate(unsigned long pnum, int nid,
-   struct vmem_altmap *altmap);
+struct page * __populate_section_memmap(unsigned long pfn,
+   unsigned long nr_pages, int nid, struct vmem_altmap *altmap);
 pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
 p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 7fec05796796..200aef686722 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -245,19 +245,26 @@ int __meminit vmemmap_populate_basepages(unsigned long 
start,
return 0;
 }
 
-struct page * __meminit sparse_mem_map_populate(unsigned long pnum, int nid,
-   struct vmem_altmap *altmap)
+struct page * __meminit __populate_section_memmap(unsigned long pfn,
+   unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
 {
unsigned long start;
unsigned long end;
-   struct page *map;
 
-   map = pfn_to_page(pnum * PAGES_PER_SECTION);
-   start = (unsigned long)map;
-   end = (unsigned long)(map + PAGES_PER_SECTION);
+   /*
+* The minimum granularity of memmap extensions is
+* PAGES_PER_SUBSECTION as allocations are tracked in the
+* 'subsection_map' bitmap of the section.
+*/
+   end = ALIGN(pfn + nr_pages, PAGES_PER_SUBSECTION);
+   pfn &= PAGE_SUBSECTION_MASK;
+   nr_pages = end - pfn;
+
+   start = (unsigned long) pfn_to_page(pfn);
+   end = start + nr_pages * sizeof(struct page);
 
if (vmemmap_populate(start, end, nid, altmap))
return NULL;
 
-   return map;
+   return pfn_to_page(pfn);
 }
diff --git a/mm/sparse.c b/mm/sparse.c
index 0baa2e55cfdd..2093c662a5f7 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -439,8 +439,8 @@ static unsigned long __init section_map_size(void)
return PAGE_ALIGN(sizeof(struct page) * PAGES_PER_SECTION);
 }
 
-struct page __init *sparse_mem_map_populate(unsigned long pnum, int nid,
-   struct vmem_altmap *altmap)
+struct page __init *__populate_section_memmap(unsigned long pfn,
+   unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
 {
unsigned long size = section_map_size();
struct page *map = sparse_buffer_alloc(size);
@@ -521,10 +521,13 @@ static void __init sparse_init_nid(int nid, unsigned long 
pnum_begin,
}
sparse_buffer_init(map_count * section_map_size(), nid);
for_each_present_section_nr(pnum_begin, pnum) {
+   unsigned long pfn = section_nr_to_pfn(pnum);
+
if (pnum >= pnum_end)
break;
 
-   map = sparse_mem_map_populate(pnum, nid, NULL);
+   map = __populate_section_memmap(pfn, PAGES_PER_SECTION,
+   nid, NULL);
if (!map) {
pr_err("%s: node[%d] memory map backing failed. Some 
memory will not be available.",
   __func__, nid);
@@ -624,17 +627,17 @@ void offline_mem_sections(unsi

[PATCH v9 01/12] mm/sparsemem: Introduce struct mem_section_usage

2019-06-05 Thread Dan Williams
Towards enabling memory hotplug to track partial population of a
section, introduce 'struct mem_section_usage'.

A pointer to a 'struct mem_section_usage' instance replaces the existing
pointer to a 'pageblock_flags' bitmap. Effectively it adds one more
'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to
house a new 'subsection_map' bitmap.  The new bitmap enables the memory
hot{plug,remove} implementation to act on incremental sub-divisions of a
section.

The default SUBSECTION_SHIFT is chosen to keep the 'subsection_map' no
larger than a single 'unsigned long' on the major architectures.
Alternatively an architecture can define ARCH_SUBSECTION_SHIFT to
override the default PMD_SHIFT. Note that PowerPC needs to use
ARCH_SUBSECTION_SHIFT to workaround PMD_SHIFT being a non-constant
expression on PowerPC.

The primary motivation for this functionality is to support platforms
that mix "System RAM" and "Persistent Memory" within a single section,
or multiple PMEM ranges with different mapping lifetimes within a single
section. The section restriction for hotplug has caused an ongoing saga
of hacks and bugs for devm_memremap_pages() users.

Beyond the fixups to teach existing paths how to retrieve the 'usemap'
from a section, and updates to usemap allocation path, there are no
expected behavior changes.

Cc: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Logan Gunthorpe 
Cc: Oscar Salvador 
Cc: Pavel Tatashin 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Signed-off-by: Dan Williams 
---
 arch/powerpc/include/asm/sparsemem.h |3 +
 include/linux/mmzone.h   |   48 +++-
 mm/memory_hotplug.c  |   18 
 mm/page_alloc.c  |2 -
 mm/sparse.c  |   81 +-
 5 files changed, 99 insertions(+), 53 deletions(-)

diff --git a/arch/powerpc/include/asm/sparsemem.h 
b/arch/powerpc/include/asm/sparsemem.h
index 3192d454a733..1aa3c9303bf8 100644
--- a/arch/powerpc/include/asm/sparsemem.h
+++ b/arch/powerpc/include/asm/sparsemem.h
@@ -10,6 +10,9 @@
  */
 #define SECTION_SIZE_BITS   24
 
+/* Reflect the largest possible PMD-size as the subsection-size constant */
+#define ARCH_SUBSECTION_SHIFT 24
+
 #endif /* CONFIG_SPARSEMEM */
 
 #ifdef CONFIG_MEMORY_HOTPLUG
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 427b79c39b3c..ac163f2f274f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1161,6 +1161,44 @@ static inline unsigned long section_nr_to_pfn(unsigned 
long sec)
 #define SECTION_ALIGN_UP(pfn)  (((pfn) + PAGES_PER_SECTION - 1) & 
PAGE_SECTION_MASK)
 #define SECTION_ALIGN_DOWN(pfn)((pfn) & PAGE_SECTION_MASK)
 
+/*
+ * SUBSECTION_SHIFT must be constant since it is used to declare
+ * subsection_map and related bitmaps without triggering the generation
+ * of variable-length arrays. The most natural size for a subsection is
+ * a PMD-page. For architectures that do not have a constant PMD-size
+ * ARCH_SUBSECTION_SHIFT can be set to a constant max size, or otherwise
+ * fallback to 2MB.
+ */
+#if defined(ARCH_SUBSECTION_SHIFT)
+#define SUBSECTION_SHIFT (ARCH_SUBSECTION_SHIFT)
+#elif defined(PMD_SHIFT)
+#define SUBSECTION_SHIFT (PMD_SHIFT)
+#else
+/*
+ * Memory hotplug enabled platforms avoid this default because they
+ * either define ARCH_SUBSECTION_SHIFT, or PMD_SHIFT is a constant, but
+ * this is kept as a backstop to allow compilation on
+ * !ARCH_ENABLE_MEMORY_HOTPLUG archs.
+ */
+#define SUBSECTION_SHIFT 21
+#endif
+
+#define PFN_SUBSECTION_SHIFT (SUBSECTION_SHIFT - PAGE_SHIFT)
+#define PAGES_PER_SUBSECTION (1UL << PFN_SUBSECTION_SHIFT)
+#define PAGE_SUBSECTION_MASK ((~(PAGES_PER_SUBSECTION-1)))
+
+#if SUBSECTION_SHIFT > SECTION_SIZE_BITS
+#error Subsection size exceeds section size
+#else
+#define SUBSECTIONS_PER_SECTION (1UL << (SECTION_SIZE_BITS - SUBSECTION_SHIFT))
+#endif
+
+struct mem_section_usage {
+   DECLARE_BITMAP(subsection_map, SUBSECTIONS_PER_SECTION);
+   /* See declaration of similar field in struct zone */
+   unsigned long pageblock_flags[0];
+};
+
 struct page;
 struct page_ext;
 struct mem_section {
@@ -1178,8 +1216,7 @@ struct mem_section {
 */
unsigned long section_mem_map;
 
-   /* See declaration of similar field in struct zone */
-   unsigned long *pageblock_flags;
+   struct mem_section_usage *usage;
 #ifdef CONFIG_PAGE_EXTENSION
/*
 * If SPARSEMEM, pgdat doesn't have page_ext pointer. We use
@@ -1210,6 +1247,11 @@ extern struct mem_section **mem_section;
 extern struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT];
 #endif
 
+static inline unsigned long *section_to_usemap(struct mem_section *ms)
+{
+   return ms->usage->pageblock_flags;
+}
+
 static inline struct mem_section *__nr_to_section(unsigned long nr)
 {
 #ifdef CONFIG_SPARSEMEM_EXTREME
@@ -1

[PATCH v9 02/12] mm/sparsemem: Add helpers track active portions of a section at boot

2019-06-05 Thread Dan Williams
Prepare for hot{plug,remove} of sub-ranges of a section by tracking a
sub-section active bitmask, each bit representing a PMD_SIZE span of the
architecture's memory hotplug section size.

The implications of a partially populated section is that pfn_valid()
needs to go beyond a valid_section() check and read the sub-section
active ranges from the bitmask. The expectation is that the bitmask
(subsection_map) fits in the same cacheline as the valid_section() data,
so the incremental performance overhead to pfn_valid() should be
negligible.

Cc: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Logan Gunthorpe 
Cc: Oscar Salvador 
Cc: Pavel Tatashin 
Tested-by: Jane Chu 
Signed-off-by: Dan Williams 
---
 include/linux/mmzone.h |   29 -
 mm/page_alloc.c|4 +++-
 mm/sparse.c|   35 +++
 3 files changed, 66 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ac163f2f274f..6dd52d544857 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1199,6 +1199,8 @@ struct mem_section_usage {
unsigned long pageblock_flags[0];
 };
 
+void subsection_map_init(unsigned long pfn, unsigned long nr_pages);
+
 struct page;
 struct page_ext;
 struct mem_section {
@@ -1336,12 +1338,36 @@ static inline struct mem_section 
*__pfn_to_section(unsigned long pfn)
 
 extern int __highest_present_section_nr;
 
+static inline int subsection_map_index(unsigned long pfn)
+{
+   return (pfn & ~(PAGE_SECTION_MASK)) / PAGES_PER_SUBSECTION;
+}
+
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
+{
+   int idx = subsection_map_index(pfn);
+
+   return test_bit(idx, ms->usage->subsection_map);
+}
+#else
+static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
+{
+   return 1;
+}
+#endif
+
 #ifndef CONFIG_HAVE_ARCH_PFN_VALID
 static inline int pfn_valid(unsigned long pfn)
 {
+   struct mem_section *ms;
+
if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
return 0;
-   return valid_section(__nr_to_section(pfn_to_section_nr(pfn)));
+   ms = __nr_to_section(pfn_to_section_nr(pfn));
+   if (!valid_section(ms))
+   return 0;
+   return pfn_section_valid(ms, pfn);
 }
 #endif
 
@@ -1373,6 +1399,7 @@ void sparse_init(void);
 #define sparse_init()  do {} while (0)
 #define sparse_index_init(_sec, _nid)  do {} while (0)
 #define pfn_present pfn_valid
+#define subsection_map_init(_pfn, _nr_pages) do {} while (0)
 #endif /* CONFIG_SPARSEMEM */
 
 /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c6d8224d792e..bd773efe5b82 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7292,10 +7292,12 @@ void __init free_area_init_nodes(unsigned long 
*max_zone_pfn)
 
/* Print out the early node map */
pr_info("Early memory node ranges\n");
-   for_each_mem_pfn_range(i, MAX_NUMNODES, _pfn, _pfn, )
+   for_each_mem_pfn_range(i, MAX_NUMNODES, _pfn, _pfn, ) {
pr_info("  node %3d: [mem %#018Lx-%#018Lx]\n", nid,
(u64)start_pfn << PAGE_SHIFT,
((u64)end_pfn << PAGE_SHIFT) - 1);
+   subsection_map_init(start_pfn, end_pfn - start_pfn);
+   }
 
/* Initialise every node */
mminit_verify_pageflags_layout();
diff --git a/mm/sparse.c b/mm/sparse.c
index 71da15cc7432..0baa2e55cfdd 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -210,6 +210,41 @@ static inline unsigned long first_present_section_nr(void)
return next_present_section_nr(-1);
 }
 
+void subsection_mask_set(unsigned long *map, unsigned long pfn,
+   unsigned long nr_pages)
+{
+   int idx = subsection_map_index(pfn);
+   int end = subsection_map_index(pfn + nr_pages - 1);
+
+   bitmap_set(map, idx, end - idx + 1);
+}
+
+void subsection_map_init(unsigned long pfn, unsigned long nr_pages)
+{
+   int end_sec = pfn_to_section_nr(pfn + nr_pages - 1);
+   int i, start_sec = pfn_to_section_nr(pfn);
+
+   if (!nr_pages)
+   return;
+
+   for (i = start_sec; i <= end_sec; i++) {
+   struct mem_section *ms;
+   unsigned long pfns;
+
+   pfns = min(nr_pages, PAGES_PER_SECTION
+   - (pfn & ~PAGE_SECTION_MASK));
+   ms = __nr_to_section(i);
+   subsection_mask_set(ms->usage->subsection_map, pfn, pfns);
+
+   pr_debug("%s: sec: %d pfns: %ld set(%d, %d)\n", __func__, i,
+   pfns, subsection_map_index(pfn),
+   subsection_map_index(pfn + pfns - 1));
+
+   pfn += pfns;
+   nr_pages -= pfns;
+   }
+}
+
 /* Record a memory area against a node. */
 void __init memory_present(int nid, unsigned long start, unsigned long end)
 {



[PATCH v9 03/12] mm/hotplug: Prepare shrink_{zone, pgdat}_span for sub-section removal

2019-06-05 Thread Dan Williams
Sub-section hotplug support reduces the unit of operation of hotplug
from section-sized-units (PAGES_PER_SECTION) to sub-section-sized units
(PAGES_PER_SUBSECTION). Teach shrink_{zone,pgdat}_span() to consider
PAGES_PER_SUBSECTION boundaries as the points where pfn_valid(), not
valid_section(), can toggle.

Cc: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Logan Gunthorpe 
Reviewed-by: Pavel Tatashin 
Reviewed-by: Oscar Salvador 
Signed-off-by: Dan Williams 
---
 mm/memory_hotplug.c |   29 -
 1 file changed, 8 insertions(+), 21 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7b963c2d3a0d..647859a1d119 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -318,12 +318,8 @@ static unsigned long find_smallest_section_pfn(int nid, 
struct zone *zone,
 unsigned long start_pfn,
 unsigned long end_pfn)
 {
-   struct mem_section *ms;
-
-   for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SECTION) {
-   ms = __pfn_to_section(start_pfn);
-
-   if (unlikely(!valid_section(ms)))
+   for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SUBSECTION) {
+   if (unlikely(!pfn_valid(start_pfn)))
continue;
 
if (unlikely(pfn_to_nid(start_pfn) != nid))
@@ -343,15 +339,12 @@ static unsigned long find_biggest_section_pfn(int nid, 
struct zone *zone,
unsigned long start_pfn,
unsigned long end_pfn)
 {
-   struct mem_section *ms;
unsigned long pfn;
 
/* pfn is the end pfn of a memory section. */
pfn = end_pfn - 1;
-   for (; pfn >= start_pfn; pfn -= PAGES_PER_SECTION) {
-   ms = __pfn_to_section(pfn);
-
-   if (unlikely(!valid_section(ms)))
+   for (; pfn >= start_pfn; pfn -= PAGES_PER_SUBSECTION) {
+   if (unlikely(!pfn_valid(pfn)))
continue;
 
if (unlikely(pfn_to_nid(pfn) != nid))
@@ -373,7 +366,6 @@ static void shrink_zone_span(struct zone *zone, unsigned 
long start_pfn,
unsigned long z = zone_end_pfn(zone); /* zone_end_pfn namespace clash */
unsigned long zone_end_pfn = z;
unsigned long pfn;
-   struct mem_section *ms;
int nid = zone_to_nid(zone);
 
zone_span_writelock(zone);
@@ -410,10 +402,8 @@ static void shrink_zone_span(struct zone *zone, unsigned 
long start_pfn,
 * it check the zone has only hole or not.
 */
pfn = zone_start_pfn;
-   for (; pfn < zone_end_pfn; pfn += PAGES_PER_SECTION) {
-   ms = __pfn_to_section(pfn);
-
-   if (unlikely(!valid_section(ms)))
+   for (; pfn < zone_end_pfn; pfn += PAGES_PER_SUBSECTION) {
+   if (unlikely(!pfn_valid(pfn)))
continue;
 
if (page_zone(pfn_to_page(pfn)) != zone)
@@ -441,7 +431,6 @@ static void shrink_pgdat_span(struct pglist_data *pgdat,
unsigned long p = pgdat_end_pfn(pgdat); /* pgdat_end_pfn namespace 
clash */
unsigned long pgdat_end_pfn = p;
unsigned long pfn;
-   struct mem_section *ms;
int nid = pgdat->node_id;
 
if (pgdat_start_pfn == start_pfn) {
@@ -478,10 +467,8 @@ static void shrink_pgdat_span(struct pglist_data *pgdat,
 * has only hole or not.
 */
pfn = pgdat_start_pfn;
-   for (; pfn < pgdat_end_pfn; pfn += PAGES_PER_SECTION) {
-   ms = __pfn_to_section(pfn);
-
-   if (unlikely(!valid_section(ms)))
+   for (; pfn < pgdat_end_pfn; pfn += PAGES_PER_SUBSECTION) {
+   if (unlikely(!pfn_valid(pfn)))
continue;
 
if (pfn_to_nid(pfn) != nid)



Re: [PATCH v3] mm/swap: Fix release_pages() when releasing devmap pages

2019-06-04 Thread Dan Williams
On Tue, Jun 4, 2019 at 1:17 PM John Hubbard  wrote:
>
> On 6/4/19 1:11 PM, Dan Williams wrote:
> > On Tue, Jun 4, 2019 at 12:48 PM John Hubbard  wrote:
> >>
> >> On 6/4/19 9:48 AM, ira.we...@intel.com wrote:
> >>> From: Ira Weiny 
> >>>
> ...
> >>> diff --git a/mm/swap.c b/mm/swap.c
> >>> index 7ede3eddc12a..6d153ce4cb8c 100644
> >>> --- a/mm/swap.c
> >>> +++ b/mm/swap.c
> >>> @@ -740,15 +740,20 @@ void release_pages(struct page **pages, int nr)
> >>>   if (is_huge_zero_page(page))
> >>>   continue;
> >>>
> >>> - /* Device public page can not be huge page */
> >>> - if (is_device_public_page(page)) {
> >>> + if (is_zone_device_page(page)) {
> >>>   if (locked_pgdat) {
> >>>   
> >>> spin_unlock_irqrestore(_pgdat->lru_lock,
> >>>  flags);
> >>>   locked_pgdat = NULL;
> >>>   }
> >>> - put_devmap_managed_page(page);
> >>> - continue;
> >>> + /*
> >>> +  * Not all zone-device-pages require special
> >>> +  * processing.  Those pages return 'false' from
> >>> +  * put_devmap_managed_page() expecting a call to
> >>> +  * put_page_testzero()
> >>> +  */
> >>
> >> Just a documentation tweak: how about:
> >>
> >> /*
> >>  * ZONE_DEVICE pages that return 'false' from
> >>  * put_devmap_managed_page() do not require special
> >>  * processing, and instead, expect a call to
> >>  * put_page_testzero().
> >>  */
> >
> > Looks better to me, but maybe just go ahead and list those
> > expectations explicitly. Something like:
> >
> > /*
> >  * put_devmap_managed_page() only handles
> >  * ZONE_DEVICE (struct dev_pagemap managed)
> >  * pages when the hosting dev_pagemap has the
> >  * ->free() or ->fault() callback handlers
> >  *  implemented as indicated by
> >  *  dev_pagemap.type. Otherwise the expectation
> >  *  is to fall back to a plain decrement /
> >  *  put_page_testzero().
> >  */
>
> I like it--but not here, because it's too much internal detail in a
> call site that doesn't use that level of detail. The call site looks
> at the return value, only.
>
> Let's instead put that blurb above (or in) the put_devmap_managed_page()
> routine itself. And leave the blurb that I wrote where it is. And then I
> think everything will have an appropriate level of detail in the right places.

Ok.  Ideally there wouldn't be any commentary needed at the call site
and the put_page() could be handled internal to
put_devmap_managed_page(), but I did not see a way to do that without
breaking the compile out / static branch optimization when there are
no active ZONE_DEVICE users.


Re: [PATCH v2 00/17] net: introduce Qualcomm IPA driver

2019-06-04 Thread Dan Williams
On Tue, 2019-06-04 at 22:04 +0200, Arnd Bergmann wrote:
> On Tue, Jun 4, 2019 at 5:18 PM Dan Williams  wrote:
> > On Tue, 2019-06-04 at 10:13 +0200, Arnd Bergmann wrote:
> > > Can you describe what kind of multiplexing is actually going on?
> > > I'm still unclear about what we actually use multiple logical
> > > interfaces for here, and how they relate to one another.
> > 
> > Each logical interface represents a different "connection" (PDP/EPS
> > context) to the provider network with a distinct IP address and
> > QoS.
> > VLANs may be a suitable analogy but here they are L3+QoS.
> > 
> > In realistic example the main interface (say rmnet0) would be used
> > for
> > web browsing and have best-effort QoS. A second interface (say
> > rmnet1)
> > would be used for VOIP and have certain QoS guarantees from both
> > the
> > modem and the network itself.
> > 
> > QMAP can also aggregate frames for a given channel
> > (connection/EPS/PDP
> > context/rmnet interface/etc) to better support LTE speeds.
> 
> Thanks, that's a very helpful explanation!
> 
> Is it correct to say then that the concept of having those separate
> connections would be required for any proper LTE modem
> implementation,
> but the QMAP protocol (and based on that, the rmnet implementation)
> is Qualcomm specific and shared only among several generations of
> modems from that one vendor?

Exactly correct.  This is what Johannes is discussing in his "cellular
modem APIs - take 2" thread about how this should all be organized at
the driver level and I think we should figure that out before we commit
to IPA-with-a-useless-netdev that requires rmnets to be created on top.
That may end up being the solution but let's have that discussion.

> > You mentioned the need to have a common user space interface
> for configuration, and if the above is true, I agree that we should
> try
> to achieve that, either by ensuring rmnet is generic enough to
> cover other vendors (and non-QMAP clients), or by creating a
> new user level interface that IPA/rmnet can be adapted to.

I would not suggest making rmnet generic; it's pretty QMAP specific
(but QMAP is spoken by many many modems both SoC, USB stick, and PCIe
minicard).

Instead, I think what Johannes is discussing is a better approach. A
kernel WWAN framework with consistent user API that
rmnet/IPA/qmi_wwan/MBIM/QMI/serial/Sierra can all implement.

That wouldn't affect the core packet processing of IPA/rmnet but
instead:

1) when/how an rmnet device actually gets created on top of the IPA (or
qmi_wwan) device

AND (one of these two)

a) whether IPA creates a netdev on probe

OR

b) whether there is some "WWAN device" kernel object which userspace
interacts with create rmnet channels on top of IPA

Dan



Re: [PATCH v3] mm/swap: Fix release_pages() when releasing devmap pages

2019-06-04 Thread Dan Williams
On Tue, Jun 4, 2019 at 12:48 PM John Hubbard  wrote:
>
> On 6/4/19 9:48 AM, ira.we...@intel.com wrote:
> > From: Ira Weiny 
> >
> > release_pages() is an optimized version of a loop around put_page().
> > Unfortunately for devmap pages the logic is not entirely correct in
> > release_pages().  This is because device pages can be more than type
> > MEMORY_DEVICE_PUBLIC.  There are in fact 4 types, private, public, FS
> > DAX, and PCI P2PDMA.  Some of these have specific needs to "put" the
> > page while others do not.
> >
> > This logic to handle any special needs is contained in
> > put_devmap_managed_page().  Therefore all devmap pages should be
> > processed by this function where we can contain the correct logic for a
> > page put.
> >
> > Handle all device type pages within release_pages() by calling
> > put_devmap_managed_page() on all devmap pages.  If
> > put_devmap_managed_page() returns true the page has been put and we
> > continue with the next page.  A false return of
> > put_devmap_managed_page() means the page did not require special
> > processing and should fall to "normal" processing.
> >
> > This was found via code inspection while determining if release_pages()
> > and the new put_user_pages() could be interchangeable.[1]
> >
> > [1] 
> > https://lore.kernel.org/lkml/20190523172852.ga27...@iweiny-desk2.sc.intel.com/
> >
> > Cc: Jérôme Glisse 
> > Cc: Michal Hocko 
> > Reviewed-by: Dan Williams 
> > Reviewed-by: John Hubbard 
> > Signed-off-by: Ira Weiny 
> >
> > ---
> > Changes from V2:
> >   Update changelog for more clarity as requested by Michal
> >   Update comment WRT "failing" of put_devmap_managed_page()
> >
> > Changes from V1:
> >   Add comment clarifying that put_devmap_managed_page() can still
> >   fail.
> >   Add Reviewed-by tags.
> >
> >  mm/swap.c | 13 +
> >  1 file changed, 9 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 7ede3eddc12a..6d153ce4cb8c 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -740,15 +740,20 @@ void release_pages(struct page **pages, int nr)
> >   if (is_huge_zero_page(page))
> >   continue;
> >
> > - /* Device public page can not be huge page */
> > - if (is_device_public_page(page)) {
> > + if (is_zone_device_page(page)) {
> >   if (locked_pgdat) {
> >   
> > spin_unlock_irqrestore(_pgdat->lru_lock,
> >  flags);
> >   locked_pgdat = NULL;
> >   }
> > - put_devmap_managed_page(page);
> > - continue;
> > + /*
> > +  * Not all zone-device-pages require special
> > +  * processing.  Those pages return 'false' from
> > +  * put_devmap_managed_page() expecting a call to
> > +  * put_page_testzero()
> > +  */
>
> Just a documentation tweak: how about:
>
> /*
>  * ZONE_DEVICE pages that return 'false' from
>  * put_devmap_managed_page() do not require special
>  * processing, and instead, expect a call to
>  * put_page_testzero().
>  */

Looks better to me, but maybe just go ahead and list those
expectations explicitly. Something like:

/*
 * put_devmap_managed_page() only handles
 * ZONE_DEVICE (struct dev_pagemap managed)
 * pages when the hosting dev_pagemap has the
 * ->free() or ->fault() callback handlers
 *  implemented as indicated by
 *  dev_pagemap.type. Otherwise the expectation
 *  is to fall back to a plain decrement /
 *  put_page_testzero().
 */


Re: [PATCH v2 00/17] net: introduce Qualcomm IPA driver

2019-06-04 Thread Dan Williams
On Mon, 2019-06-03 at 13:04 -0600, Subash Abhinov Kasiviswanathan
wrote:
> > > I can't (or won't) comment right now on whether IPA needs its own
> > > netdev for rmnet to use.  The IPA endpoints used for the modem
> > > network interfaces are enabled when the netdev is opened and
> > > disabled when closed.  Outside of that, TX and RX are pretty
> > > much immediately passed through to the layer below or above.
> > > IPA currently has no other net device operations.
> > 
> > I don't really have issues with the patchset underneath the netdev
> > layer. I'm interested in how the various bits present themselves to
> > userspace, which is why I am trying to tie this in with Johannes'
> > conversation about WWAN devices, netdevs, channels, and how the
> > various
> > drivers present API for creating data channels that map to
> > different
> > modem contexts.
> > 
> > So let me rephrase. If the control plane has set up the default
> > context
> > and sent a QMI Start Network message (or the network attached the
> > default one) and the resulting IP details are applied to the IPA
> > netdev
> > can things just start sending data? Or do we need to create an
> > rmnet on
> > top to get that working?
> > 
> > Dan
> 
> Hi Dan
> 
> All data from the hardware will have the MAP headers.
> We still need to create rmnet devs over the IPA netdev and use it
> for 
> the
> data path.
> The IPA netdev will pass on the packets which it receives from the 
> hardware
> and queue it to network stack where it will be intercepted by the
> rmnet rx handler.

Ok, so IPA only needs a netdev so that rmnet has something to
send/receive packets to/from? This gets even closer to the discussion
in "cellular modem driver APIs - take 2" from last week.

Dan



Re: [PATCH v2 00/17] net: introduce Qualcomm IPA driver

2019-06-04 Thread Dan Williams
On Tue, 2019-06-04 at 10:13 +0200, Arnd Bergmann wrote:
> On Mon, Jun 3, 2019 at 3:32 PM Alex Elder  wrote:
> > On 6/3/19 5:04 AM, Arnd Bergmann wrote:
> > > On Sat, Jun 1, 2019 at 1:59 AM Subash Abhinov Kasiviswanathan
> > > 
> > > - What I'm worried about most here is the flow control handling
> > > on the
> > >   transmit side. The IPA driver now uses the modern BQL method to
> > >   control how much data gets submitted to the hardware at any
> > > time.
> > >   The rmnet driver also uses flow control using the
> > >   rmnet_map_command() function, that blocks tx on the higher
> > >   level device when the remote side asks us to.
> > >   I fear that doing flow control for a single physical device on
> > > two
> > >   separate netdev instances is counterproductive and confuses
> > >   both sides.
> > 
> > I understand what you're saying here, and instinctively I think
> > you're right.
> > 
> > But BQL manages the *local* interface's ability to get rid of
> > packets, whereas the QMAP flow control is initiated by the other
> > end of the connection (the modem in this case).
> > 
> > With multiplexing, it's possible that one of several logical
> > devices on the modem side has exhausted a resource and must
> > ask the source of the data on the host side to suspend the
> > flow.  Meanwhile the other logical devices sharing the physical
> > link might be fine, and should not be delayed by the first one.
> > 
> > It is the multiplexing itself that confuses the BQL algorithm.
> > The abstraction obscures the *real* rates at which individual
> > logical connections are able to transmit data.
> 
> I would assume that the real rate constantly changes, at least
> for wireless interfaces that are also shared with other users
> on the same network. BQL is meant to deal with that, at least
> when using a modern queuing algorithm.
> 
> > Even if the multiple logical interfaces implemented BQL, they
> > would not get the feedback they need directly from the IPA
> > driver, because transmitting over the physical interface might
> > succeed even if the logical interface on the modem side can't
> > handle more data.  So I think the flow control commands may be
> > necessary, given multiplexing.
> 
> Can you describe what kind of multiplexing is actually going on?
> I'm still unclear about what we actually use multiple logical
> interfaces for here, and how they relate to one another.

Each logical interface represents a different "connection" (PDP/EPS
context) to the provider network with a distinct IP address and QoS.
VLANs may be a suitable analogy but here they are L3+QoS.

In realistic example the main interface (say rmnet0) would be used for
web browsing and have best-effort QoS. A second interface (say rmnet1)
would be used for VOIP and have certain QoS guarantees from both the
modem and the network itself.

QMAP can also aggregate frames for a given channel (connection/EPS/PDP
context/rmnet interface/etc) to better support LTE speeds.

Dan

> > The rmnet driver could use BQL, and could return NETDEV_TX_BUSY
> > for a logical interface when its TX flow has been stopped by a
> > QMAP command.  That way the feedback for BQL on the logical
> > interfaces would be provided in the right place.
> > 
> > I have no good intuition about the interaction between
> > two layered BQL managed queues though.
> 
> Returning NETDEV_TX_BUSY is usually a bad idea as that
> leads to unnecessary frame drop.
> 
> I do think that using BQL and the QMAP flow command on
> the /same/ device would be best, as that throttles the connection
> when either of the two algorithms wants us to slow down.
> 
> The question is mainly which of the two devices that should be.
> Doing it in the ipa driver is probably easier to implement here,
> but ideally I think we'd only have a single queue visible to the
> network stack, if we can come up with a way to do that.
> 
> > > - I was a little confused by the location of the rmnet driver in
> > >   drivers/net/ethernet/... More conventionally, I think as a
> > > protocol
> > >   handler it should go into net/qmap/, with the ipa driver going
> > >   into drivers/net/qmap/ipa/, similar to what we have fo
> > > ethernet,
> > >   wireless, ppp, appletalk, etc.
> > > 
> > > - The rx_handler uses gro_cells, which as I understand is meant
> > >   for generic tunnelling setups and takes another loop through
> > >   NAPI to aggregate data from multiple queues, but in case of
> > >   IPA's single-queue receive calling gro directly would be
> > > simpler
> > >   and more efficient.
> > 
> > I have been planning to investigate some of the generic GRO
> > stuff for IPA but was going to wait on that until the basic
> > code was upstream.
> 
> That's ok, that part can easily be changed after the fact, as it
> does not impact the user interface or the general design.
> 
> > >   From the overall design and the rmnet Kconfig description, it
> > >   appears as though the intention as that rmnet could be a
> > >   generic 

Re: [PATCH v8 09/12] mm/sparsemem: Support sub-section hotplug

2019-06-03 Thread Dan Williams
On Mon, May 13, 2019 at 6:55 AM Oscar Salvador  wrote:
>
> On Mon, May 06, 2019 at 04:40:14PM -0700, Dan Williams wrote:
> >
> > +void subsection_mask_set(unsigned long *map, unsigned long pfn,
> > + unsigned long nr_pages)
> > +{
> > + int idx = subsection_map_index(pfn);
> > + int end = subsection_map_index(pfn + nr_pages - 1);
> > +
> > + bitmap_set(map, idx, end - idx + 1);
> > +}
> > +
> >  void subsection_map_init(unsigned long pfn, unsigned long nr_pages)
> >  {
> >   int end_sec = pfn_to_section_nr(pfn + nr_pages - 1);
> > @@ -219,20 +235,17 @@ void subsection_map_init(unsigned long pfn, unsigned 
> > long nr_pages)
> >   return;
> >
> >   for (i = start_sec; i <= end_sec; i++) {
> > - int idx, end;
> > - unsigned long pfns;
> >   struct mem_section *ms;
> > + unsigned long pfns;
> >
> > - idx = subsection_map_index(pfn);
> >   pfns = min(nr_pages, PAGES_PER_SECTION
> >   - (pfn & ~PAGE_SECTION_MASK));
> > - end = subsection_map_index(pfn + pfns - 1);
> > -
> >   ms = __nr_to_section(i);
> > - bitmap_set(ms->usage->subsection_map, idx, end - idx + 1);
> > + subsection_mask_set(ms->usage->subsection_map, pfn, pfns);
> >
> >   pr_debug("%s: sec: %d pfns: %ld set(%d, %d)\n", __func__, i,
> > - pfns, idx, end - idx + 1);
> > + pfns, subsection_map_index(pfn),
> > + subsection_map_index(pfn + pfns - 1));
>
> I would definetely add subsection_mask_set() and above change to Patch#3.
> I think it suits there better than here.

Yes, done.

>
> >
> >   pfn += pfns;
> >   nr_pages -= pfns;
> > @@ -319,6 +332,15 @@ static void __meminit sparse_init_one_section(struct 
> > mem_section *ms,
> >   unsigned long pnum, struct page *mem_map,
> >   struct mem_section_usage *usage)
> >  {
> > + /*
> > +  * Given that SPARSEMEM_VMEMMAP=y supports sub-section hotplug,
> > +  * ->section_mem_map can not be guaranteed to point to a full
> > +  *  section's worth of memory.  The field is only valid / used
> > +  *  in the SPARSEMEM_VMEMMAP=n case.
> > +  */
> > + if (IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP))
> > + mem_map = NULL;
> > +
> >   ms->section_mem_map &= ~SECTION_MAP_MASK;
> >   ms->section_mem_map |= sparse_encode_mem_map(mem_map, pnum) |
> >   SECTION_HAS_MEM_MAP;
> > @@ -724,10 +746,142 @@ static void free_map_bootmem(struct page *memmap)
> >  #endif /* CONFIG_MEMORY_HOTREMOVE */
> >  #endif /* CONFIG_SPARSEMEM_VMEMMAP */
> >
> > +#ifndef CONFIG_MEMORY_HOTREMOVE
> > +static void free_map_bootmem(struct page *memmap)
> > +{
> > +}
> > +#endif
> > +
> > +static bool is_early_section(struct mem_section *ms)
> > +{
> > + struct page *usage_page;
> > +
> > + usage_page = virt_to_page(ms->usage);
> > + if (PageSlab(usage_page) || PageCompound(usage_page))
> > + return false;
> > + else
> > + return true;
> > +}
> > +
> > +static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
> > + int nid, struct vmem_altmap *altmap)
> > +{
> > + DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 };
> > + DECLARE_BITMAP(tmp, SUBSECTIONS_PER_SECTION) = { 0 };
> > + struct mem_section *ms = __pfn_to_section(pfn);
> > + bool early_section = is_early_section(ms);
> > + struct page *memmap = NULL;
> > + unsigned long *subsection_map = ms->usage
> > + ? >usage->subsection_map[0] : NULL;
> > +
> > + subsection_mask_set(map, pfn, nr_pages);
> > + if (subsection_map)
> > + bitmap_and(tmp, map, subsection_map, SUBSECTIONS_PER_SECTION);
> > +
> > + if (WARN(!subsection_map || !bitmap_equal(tmp, map, 
> > SUBSECTIONS_PER_SECTION),
> > + "section already deactivated (%#lx + %ld)\n",
> > + pfn, nr_pages))
> > + return;
> > +
> > + if (WARN(!IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)
> > + && nr_pages < PAGES_PER_SECTI

Re: [PATCH v7 09/12] mm/sparsemem: Support sub-section hotplug

2019-06-03 Thread Dan Williams
On Fri, May 3, 2019 at 5:56 AM Oscar Salvador  wrote:
>
> On Wed, May 01, 2019 at 10:56:10PM -0700, Dan Williams wrote:
> > The libnvdimm sub-system has suffered a series of hacks and broken
> > workarounds for the memory-hotplug implementation's awkward
> > section-aligned (128MB) granularity. For example the following backtrace
> > is emitted when attempting arch_add_memory() with physical address
> > ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM)
> > within a given section:
> >
> >  WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 
> > devm_memremap_pages+0x3b5/0x4c0
> >  devm_memremap_pages attempted on mixed region [mem 0x2-0x2fbff 
> > flags 0x200]
> >  [..]
> >  Call Trace:
> >dump_stack+0x86/0xc3
> >__warn+0xcb/0xf0
> >warn_slowpath_fmt+0x5f/0x80
> >devm_memremap_pages+0x3b5/0x4c0
> >__wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap]
> >pmem_attach_disk+0x19a/0x440 [nd_pmem]
> >
> > Recently it was discovered that the problem goes beyond RAM vs PMEM
> > collisions as some platform produce PMEM vs PMEM collisions within a
> > given section. The libnvdimm workaround for that case revealed that the
> > libnvdimm section-alignment-padding implementation has been broken for a
> > long while. A fix for that long-standing breakage introduces as many
> > problems as it solves as it would require a backward-incompatible change
> > to the namespace metadata interpretation. Instead of that dubious route
> > [1], address the root problem in the memory-hotplug implementation.
> >
> > [1]: 
> > https://lore.kernel.org/r/155000671719.348031.2347363160141119237.st...@dwillia2-desk3.amr.corp.intel.com
> > Cc: Michal Hocko 
> > Cc: Vlastimil Babka 
> > Cc: Logan Gunthorpe 
> > Signed-off-by: Dan Williams 
> > ---
> >  mm/sparse.c |  223 
> > ---
> >  1 file changed, 150 insertions(+), 73 deletions(-)
> >
> > diff --git a/mm/sparse.c b/mm/sparse.c
> > index 198371e5fc87..419a3620af6e 100644
> > --- a/mm/sparse.c
> > +++ b/mm/sparse.c
> > @@ -83,8 +83,15 @@ static int __meminit sparse_index_init(unsigned long 
> > section_nr, int nid)
> >   unsigned long root = SECTION_NR_TO_ROOT(section_nr);
> >   struct mem_section *section;
> >
> > + /*
> > +  * An existing section is possible in the sub-section hotplug
> > +  * case. First hot-add instantiates, follow-on hot-add reuses
> > +  * the existing section.
> > +  *
> > +  * The mem_hotplug_lock resolves the apparent race below.
> > +  */
> >   if (mem_section[root])
> > - return -EEXIST;
> > + return 0;
>
> Just a sidenote: we do not bail out on -EEXIST, so it should be fine if we
> stick with it.
> But if not, I would then clean up sparse_add_section:
>
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -901,13 +901,12 @@ int __meminit sparse_add_section(int nid, unsigned long 
> start_pfn,
> int ret;
>
> ret = sparse_index_init(section_nr, nid);
> -   if (ret < 0 && ret != -EEXIST)
> +   if (ret < 0)
> return ret;
>
> memmap = section_activate(nid, start_pfn, nr_pages, altmap);
> if (IS_ERR(memmap))
> return PTR_ERR(memmap);
> -   ret = 0;

Good catch, folded the cleanup.

>
>
> > +
> > + if (!mask)
> > + rc = -EINVAL;
> > + else if (mask & ms->usage->map_active)
>
> else if (ms->usage->map_active) should be enough?
>
> > + rc = -EEXIST;
> > + else
> > + ms->usage->map_active |= mask;
> > +
> > + if (rc) {
> > + if (usage)
> > + ms->usage = NULL;
> > + kfree(usage);
> > + return ERR_PTR(rc);
> > + }
> > +
> > + /*
> > +  * The early init code does not consider partially populated
> > +  * initial sections, it simply assumes that memory will never be
> > +  * referenced.  If we hot-add memory into such a section then we
> > +  * do not need to populate the memmap and can simply reuse what
> > +  * is already there.
> > +  */
>
> This puzzles me a bit.
> I think we cannot have partially populated early sections, can we?

Yes, at boot memory need not be section aligned it has historically
been handled as a un-removable section of memory with holes.

> And ho

Re: [PATCH v2 00/17] net: introduce Qualcomm IPA driver

2019-06-03 Thread Dan Williams
On Mon, 2019-06-03 at 10:52 -0500, Alex Elder wrote:
> On 6/3/19 9:54 AM, Dan Williams wrote:
> > > To be perfectly honest, at first I thought having IPA use rmnet
> > > was a cargo cult thing like Dan suggested, because I didn't see
> > To be clear I only meant cargo-culting the naming, not any
> > functionality. Clearly IPA/rmnet/QMAP are pretty intimately
> > connected
> > at this point. But this goes back to whether IPA needs a netdev
> > itself
> > or whether you need an rmnet device created on top. If the former
> > then
> > I'd say no cargo-culting, if the later then it's a moot point
> > because
> > the device name will be rmnet%d anyway.
> 
> OK I thought you weren't sure why rmnet was a layer at all.  As I
> said, I didn't have a very good understanding of why it was even
> needed when I first started working on this.

No problem.

> I can't (or won't) comment right now on whether IPA needs its own
> netdev for rmnet to use.  The IPA endpoints used for the modem
> network interfaces are enabled when the netdev is opened and
> disabled when closed.  Outside of that, TX and RX are pretty
> much immediately passed through to the layer below or above.
> IPA currently has no other net device operations.

I don't really have issues with the patchset underneath the netdev
layer. I'm interested in how the various bits present themselves to
userspace, which is why I am trying to tie this in with Johannes'
conversation about WWAN devices, netdevs, channels, and how the various
drivers present API for creating data channels that map to different
modem contexts.

So let me rephrase. If the control plane has set up the default context
and sent a QMI Start Network message (or the network attached the
default one) and the resulting IP details are applied to the IPA netdev
can things just start sending data? Or do we need to create an rmnet on
top to get that working?

Dan



Re: [PATCH v2 00/17] net: introduce Qualcomm IPA driver

2019-06-03 Thread Dan Williams
On Fri, 2019-05-31 at 15:47 -0500, Alex Elder wrote:
> On 5/31/19 2:19 PM, Arnd Bergmann wrote:
> > On Fri, May 31, 2019 at 6:36 PM Alex Elder 
> > wrote:
> > > On 5/31/19 9:58 AM, Dan Williams wrote:
> > > > On Thu, 2019-05-30 at 22:53 -0500, Alex Elder wrote:
> > > > 
> > > > My question from the Nov 2018 IPA rmnet driver still stands;
> > > > how does
> > > > this relate to net/ethernet/qualcomm/rmnet/ if at all? And if
> > > > this is
> > > > really just a netdev talking to the IPA itself and unrelated to
> > > > net/ethernet/qualcomm/rmnet, let's call it "ipa%d" and stop
> > > > cargo-
> > > > culting rmnet around just because it happens to be a net driver
> > > > for a
> > > > QC SoC.
> > > 
> > > First, the relationship between the IPA driver and the rmnet
> > > driver
> > > is that the IPA driver is assumed to sit between the rmnet driver
> > > and the hardware.
> > 
> > Does this mean that IPA can only be used to back rmnet, and rmnet
> > can only be used on top of IPA, or can or both of them be combined
> > with another driver to talk to instead?
> 
> No it does not mean that.
> 
> As I understand it, one reason for the rmnet layer was to abstract
> the back end, which would allow using a modem, or using something
> else (a LAN?), without exposing certain details of the hardware.
> (Perhaps to support multiplexing, etc. without duplicating that
> logic in two "back-end" drivers?)
> 
> To be perfectly honest, at first I thought having IPA use rmnet
> was a cargo cult thing like Dan suggested, because I didn't see

To be clear I only meant cargo-culting the naming, not any
functionality. Clearly IPA/rmnet/QMAP are pretty intimately connected
at this point. But this goes back to whether IPA needs a netdev itself
or whether you need an rmnet device created on top. If the former then
I'd say no cargo-culting, if the later then it's a moot point because
the device name will be rmnet%d anyway.

Dan

> the benefit.  I now see why one would use that pass-through layer
> to handle the QMAP features.
> 
> But back to your question.  The other thing is that I see no
> reason the IPA couldn't present a "normal" (non QMAP) interface
> for a modem.  It's something I'd really like to be able to do,
> but I can't do it without having the modem firmware change its
> configuration for these endpoints.  My access to the people who
> implement the modem firmware has been very limited (something
> I hope to improve), and unless and until I can get corresponding
> changes on the modem side to implement connections that don't
> use QMAP, I can't implement such a thing.
> 
> > > Currently the modem is assumed to use QMAP protocol.  This means
> > > each packet is prefixed by a (struct rmnet_map_header) structure
> > > that allows the IPA connection to be multiplexed for several
> > > logical
> > > connections.  The rmnet driver parses such messages and
> > > implements
> > > the multiplexed network interfaces.
> > > 
> > > QMAP protocol can also be used for aggregating many small packets
> > > into a larger message.  The rmnet driver implements de-
> > > aggregation
> > > of such messages (and could probably aggregate them for TX as
> > > well).
> > > 
> > > Finally, the IPA can support checksum offload, and the rmnet
> > > driver handles providing a prepended header (for TX) and
> > > interpreting the appended trailer (for RX) if these features
> > > are enabled.
> > > 
> > > So basically, the purpose of the rmnet driver is to handle QMAP
> > > protocol connections, and right now that's what the modem
> > > provides.
> > 
> > Do you have any idea why this particular design was picked?
> 
> I don't really.  I inherited it.  Early on, when I asked about
> the need for QMAP I was told it was important because it offered
> certain features, but at that time I was somewhat new to the code
> and didn't have the insight to judge the merits of the design.
> Since then I've mostly just accepted it and concentrated on
> improving the IPA driver.
> 
> > My best guess is that it evolved organically with multiple
> > generations of hardware and software, rather than being thought
> > out as a nice abstraction layer. If the two are tightly connected,
> > this might mean that what we actually want here is to reintegrate
> > the two components into a single driver with a much simpler
> > RX and TX path that handles the checksumming an

Re: [PATCH v2 00/17] net: introduce Qualcomm IPA driver

2019-06-03 Thread Dan Williams
On Fri, 2019-05-31 at 17:59 -0600, Subash Abhinov Kasiviswanathan
wrote:
> On 2019-05-31 17:33, Bjorn Andersson wrote:
> > On Fri 31 May 13:47 PDT 2019, Alex Elder wrote:
> > 
> > > On 5/31/19 2:19 PM, Arnd Bergmann wrote:
> > > > On Fri, May 31, 2019 at 6:36 PM Alex Elder 
> > > > wrote:
> > > > > On 5/31/19 9:58 AM, Dan Williams wrote:
> > > > > > On Thu, 2019-05-30 at 22:53 -0500, Alex Elder wrote:
> > > > > > 
> > > > > > My question from the Nov 2018 IPA rmnet driver still
> > > > > > stands; how does
> > > > > > this relate to net/ethernet/qualcomm/rmnet/ if at all? And
> > > > > > if this is
> > > > > > really just a netdev talking to the IPA itself and
> > > > > > unrelated to
> > > > > > net/ethernet/qualcomm/rmnet, let's call it "ipa%d" and stop
> > > > > > cargo-
> > > > > > culting rmnet around just because it happens to be a net
> > > > > > driver for a
> > > > > > QC SoC.
> > > > > 
> > > > > First, the relationship between the IPA driver and the rmnet
> > > > > driver
> > > > > is that the IPA driver is assumed to sit between the rmnet
> > > > > driver
> > > > > and the hardware.
> > > > 
> > > > Does this mean that IPA can only be used to back rmnet, and
> > > > rmnet
> > > > can only be used on top of IPA, or can or both of them be
> > > > combined
> > > > with another driver to talk to instead?
> > > 
> > > No it does not mean that.
> > > 
> > > As I understand it, one reason for the rmnet layer was to
> > > abstract
> > > the back end, which would allow using a modem, or using something
> > > else (a LAN?), without exposing certain details of the hardware.
> > > (Perhaps to support multiplexing, etc. without duplicating that
> > > logic in two "back-end" drivers?)
> > > 
> > > To be perfectly honest, at first I thought having IPA use rmnet
> > > was a cargo cult thing like Dan suggested, because I didn't see
> > > the benefit.  I now see why one would use that pass-through layer
> > > to handle the QMAP features.
> > > 
> > > But back to your question.  The other thing is that I see no
> > > reason the IPA couldn't present a "normal" (non QMAP) interface
> > > for a modem.  It's something I'd really like to be able to do,
> > > but I can't do it without having the modem firmware change its
> > > configuration for these endpoints.  My access to the people who
> > > implement the modem firmware has been very limited (something
> > > I hope to improve), and unless and until I can get corresponding
> > > changes on the modem side to implement connections that don't
> > > use QMAP, I can't implement such a thing.
> > > 
> > 
> > But any such changes would either be years into the future or for
> > specific devices and as such not applicable to any/most of devices
> > on
> > the market now or in the coming years.
> > 
> > 
> > But as Arnd points out, if the software split between IPA and rmnet
> > is
> > suboptimal your are encouraged to fix that.
> > 
> > Regards,
> > Bjorn
> 
> The split rmnet design was chosen because we could place rmnet
> over any transport - IPA, PCIe (https://lkml.org/lkml/2018/4/26/1159)
> or USB.

Yeah, that's what I was looking for clarification on :) Clearly since
rmnet can have many transports it should be able to be used by
different HW drivers, be that qmi_wwan, IPA, and maybe even
rmnet_smd.c?

> rmnet registers a rx handler, so the rmnet packet processing itself
> happens in the same softirq when packets are queued to network stack
> by IPA.

This directly relates to the discussion about a WWAN subsystem that
Johannes Berg started a couple weeks ago. IPA appears to create a
netdev of its own. Is that netdev usable immediately, or does one need
to create an rmnet device on top to access the default PDN?

Thanks,
Dan



Re: [PATCH v10 1/3] mm: Shuffle initial free memory to improve memory-side-cache utilization

2019-05-31 Thread Dan Williams
On Fri, May 31, 2019 at 12:33 AM Vlastimil Babka  wrote:
>
> On 2/1/19 6:15 AM, Dan Williams wrote:
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -1714,6 +1714,29 @@ config SLAB_FREELIST_HARDENED
> > sacrifies to harden the kernel slab allocator against common
> > freelist exploit methods.
> >
> > +config SHUFFLE_PAGE_ALLOCATOR
> > + bool "Page allocator randomization"
> > + default SLAB_FREELIST_RANDOM && ACPI_NUMA
> > + help
> > +   Randomization of the page allocator improves the average
> > +   utilization of a direct-mapped memory-side-cache. See section
> > +   5.2.27 Heterogeneous Memory Attribute Table (HMAT) in the ACPI
> > +   6.2a specification for an example of how a platform advertises
> > +   the presence of a memory-side-cache. There are also incidental
> > +   security benefits as it reduces the predictability of page
> > +   allocations to compliment SLAB_FREELIST_RANDOM, but the
> > +   default granularity of shuffling on 4MB (MAX_ORDER) pages is
> > +   selected based on cache utilization benefits.
> > +
> > +   While the randomization improves cache utilization it may
> > +   negatively impact workloads on platforms without a cache. For
> > +   this reason, by default, the randomization is enabled only
> > +   after runtime detection of a direct-mapped memory-side-cache.
> > +   Otherwise, the randomization may be force enabled with the
> > +   'page_alloc.shuffle' kernel command line parameter.
> > +
> > +   Say Y if unsure.
>
> It says "Say Y if unsure", yet if I run make oldconfig, the default is
> N. Does that make sense?

The default is due to the general policy of not forcing users into new
kernel functionality (i.e. the common Linus objection when a new
config symbol is default 'y') . However, if someone is actively
considering whether to enable it I think there's no harm in
recommending 'y' because the facility currently needs to be paired
with the page_alloc.shuffle=1 command line option.


Re: [PATCH v2 00/17] net: introduce Qualcomm IPA driver

2019-05-31 Thread Dan Williams
On Thu, 2019-05-30 at 22:53 -0500, Alex Elder wrote:
> This series presents the driver for the Qualcomm IP Accelerator
> (IPA).
> 
> This is version 2 of the series.  This version has addressed almost
> all of the feedback received in the first version:
>   
> https://lore.kernel.org/lkml/20190512012508.10608-1-el...@linaro.org/
> More detail is included in the individual patches, but here is a
> high-level summary of what's changed since then:
>   - Two spinlocks have been removed.
>   - The code for enabling and disabling endpoint interrupts has
> been simplified considerably, and the spinlock is no longer
>   required
>   - A spinlock used when updating ring buffer pointers is no
> longer needed.  Integers indexing the ring are used instead
>   (and they don't even have to be atomic).
>   - One spinlock remains to protect list updates, but it is always
> acquired using spin_lock_bh() (no more irqsave).
>   - Information about the queueing and completion of messages is now
> supplied to the network stack in batches rather than one at a
> time.
>   - I/O completion handling has been simplified, with the IRQ
> handler now consisting mainly of disabling the interrupt and
> calling napi_schedule().
>   - Some comments have been updated and improved througout.
> 
> What follows is the introduction supplied with v1 of the series.
> 
> -
> 
> The IPA is a component present in some Qualcomm SoCs that allows
> network functions such as aggregation, filtering, routing, and NAT
> to be performed without active involvement of the main application
> processor (AP).
> 
> Initially, these advanced features are disabled; the IPA driver
> simply provides a network interface that makes the modem's LTE
> network available to the AP.  In addition, only support for the
> IPA found in the Qualcomm SDM845 SoC is provided.

My question from the Nov 2018 IPA rmnet driver still stands; how does
this relate to net/ethernet/qualcomm/rmnet/ if at all? And if this is
really just a netdev talking to the IPA itself and unrelated to
net/ethernet/qualcomm/rmnet, let's call it "ipa%d" and stop cargo-
culting rmnet around just because it happens to be a net driver for a
QC SoC.

Is the firmware that the driver loads already in linux-firmware or
going to be there soon?

How does the driver support multiple PDNs (eg PDP or EPS contexts) that
are enabled through the control plane via QMI messages? I couldn't
quite find that out.

Thanks,
Dan

> This code is derived from a driver developed internally by Qualcomm.
> A version of the original source can be seen here:
>   https://source.codeaurora.org/quic/la/kernel/msm-4.9/tree
> in the "drivers/platform/msm/ipa" directory.  Many were involved in
> developing this, but the following individuals deserve explicit
> acknowledgement for their substantial contributions:
> 
> Abhishek Choubey
> Ady Abraham
> Chaitanya Pratapa
> David Arinzon
> Ghanim Fodi
> Gidon Studinski
> Ravi Gummadidala
> Shihuan Liu
> Skylar Chang
> 
> A version of this code was posted in November 2018 as an RFC.
>   
> https://lore.kernel.org/lkml/20181107003250.5832-1-el...@linaro.org/
> All feedback received was addressed.  The code has undergone
> considerable further rework since that time, and most of the
> "future work" described then has now been completed.
> 
> This code is available in buildable form here, based on kernel
> v5.2-rc1:
>   remote: ssh://g...@git.linaro.org/people/alex.elder/linux.git
>   branch: ipa-v2_kernel-v5.2-rc2
> 75adf2ac1266 arm64: defconfig: enable build of IPA code
> 
> The branch depends on a commit now found in in net-next.  It has
> been cherry-picked, and (in this branch) has this commit ID:
>   13c627b5a078 net: qualcomm: rmnet: Move common struct definitions
> to include
> by 
> 
>   -Alex
> 
> Alex Elder (17):
>   bitfield.h: add FIELD_MAX() and field_max()
>   dt-bindings: soc: qcom: add IPA bindings
>   soc: qcom: ipa: main code
>   soc: qcom: ipa: configuration data
>   soc: qcom: ipa: clocking, interrupts, and memory
>   soc: qcom: ipa: GSI headers
>   soc: qcom: ipa: the generic software interface
>   soc: qcom: ipa: GSI transactions
>   soc: qcom: ipa: IPA interface to GSI
>   soc: qcom: ipa: IPA endpoints
>   soc: qcom: ipa: immediate commands
>   soc: qcom: ipa: IPA network device and microcontroller
>   soc: qcom: ipa: AP/modem communications
>   soc: qcom: ipa: support build of IPA code
>   MAINTAINERS: add entry for the Qualcomm IPA driver
>   arm64: dts: sdm845: add IPA information
>   arm64: defconfig: enable build of IPA code
> 
>  .../devicetree/bindings/net/qcom,ipa.yaml |  180 ++
>  MAINTAINERS   |6 +
>  arch/arm64/boot/dts/qcom/sdm845.dtsi  |   51 +
>  arch/arm64/configs/defconfig  |1 +
>  drivers/net/Kconfig   |2 +
>  drivers/net/Makefile   

Re: [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race

2019-05-30 Thread Dan Williams
On Mon, May 13, 2019 at 12:22 PM Logan Gunthorpe  wrote:
>
>
>
> On 2019-05-08 11:05 a.m., Logan Gunthorpe wrote:
> >
> >
> > On 2019-05-07 5:55 p.m., Dan Williams wrote:
> >> Changes since v1 [1]:
> >> - Fix a NULL-pointer deref crash in pci_p2pdma_release() (Logan)
> >>
> >> - Refresh the p2pdma patch headers to match the format of other p2pdma
> >>patches (Bjorn)
> >>
> >> - Collect Ira's reviewed-by
> >>
> >> [1]: 
> >> https://lore.kernel.org/lkml/155387324370.2443841.574715745262628837.st...@dwillia2-desk3.amr.corp.intel.com/
> >
> > This series looks good to me:
> >
> > Reviewed-by: Logan Gunthorpe 
> >
> > However, I haven't tested it yet but I intend to later this week.
>
> I've tested libnvdimm-pending which includes this series on my setup and
> everything works great.

Hi Andrew,

With this tested-by can we move forward on this fix set? I'm not aware
of any other remaining comments. Greg had a question about
"drivers/base/devres: Introduce devm_release_action()" that I
answered, but otherwise the feedback has gone silent.


[GIT PULL] libnvdimm fixes for v5.2-rc2

2019-05-25 Thread Dan Williams
Hi Linus, please pull from:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
tags/libnvdimm-fixes-5.2-rc2

...to receive a regression fix, a small (2 line code change)
performance enhancement, and some miscellaneous compilation warning
fixes. These have soaked in -next the past week with no known issues.
The device-mapper touches have Mike's ack, and the hardened user-copy
bypass was reviewed with Kees.

---

The following changes since commit a188339ca5a396acc588e5851ed7e19f66b0ebd9:

  Linux 5.2-rc1 (2019-05-19 15:47:09 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
tags/libnvdimm-fixes-5.2-rc2

for you to fetch changes up to 52f476a323f9efc959be1c890d0cdcf12e1582e0:

  libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead (2019-05-20
20:43:32 -0700)


libnvdimm fixes v5.2-rc2

- Fix a regression that disabled device-mapper dax support

- Remove unnecessary hardened-user-copy overhead (>30%) for dax
  read(2)/write(2).

- Fix some compilation warnings.

----
Dan Williams (2):
  dax: Arrange for dax_supported check to span multiple devices
  libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

Qian Cai (1):
  libnvdimm: Fix compilation warnings with W=1

 drivers/dax/super.c  | 88 
 drivers/md/dm-table.c| 17 ++---
 drivers/md/dm.c  | 20 ++
 drivers/md/dm.h  |  1 +
 drivers/nvdimm/bus.c |  4 +-
 drivers/nvdimm/label.c   |  2 +
 drivers/nvdimm/label.h   |  2 -
 drivers/nvdimm/pmem.c| 11 +-
 drivers/s390/block/dcssblk.c |  1 +
 include/linux/dax.h  | 26 +
 10 files changed, 129 insertions(+), 43 deletions(-)


Re: [PATCH] mm/swap: Fix release_pages() when releasing devmap pages

2019-05-24 Thread Dan Williams
On Thu, May 23, 2019 at 8:58 PM Dan Williams  wrote:
>
> On Thu, May 23, 2019 at 3:37 PM  wrote:
> >
> > From: Ira Weiny 
> >
> > Device pages can be more than type MEMORY_DEVICE_PUBLIC.
> >
> > Handle all device pages within release_pages()
> >
> > This was found via code inspection while determining if release_pages()
> > and the new put_user_pages() could be interchangeable.
> >
> > Cc: Jérôme Glisse 
> > Cc: Dan Williams 
> > Cc: Michal Hocko 
> > Cc: John Hubbard 
> > Signed-off-by: Ira Weiny 
> > ---
> >  mm/swap.c | 7 +++
> >  1 file changed, 3 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 3a75722e68a9..d1e8122568d0 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -739,15 +739,14 @@ void release_pages(struct page **pages, int nr)
> > if (is_huge_zero_page(page))
> > continue;
> >
> > -   /* Device public page can not be huge page */
> > -   if (is_device_public_page(page)) {
> > +   if (is_zone_device_page(page)) {
> > if (locked_pgdat) {
> > 
> > spin_unlock_irqrestore(_pgdat->lru_lock,
> >flags);
> > locked_pgdat = NULL;
> > }
> > -   put_devmap_managed_page(page);
> > -   continue;
> > +   if (put_devmap_managed_page(page))
>
> This "shouldn't" fail, and if it does the code that follows might get
> confused by a ZONE_DEVICE page. If anything I would make this a
> WARN_ON_ONCE(!put_devmap_managed_page(page)), but always continue
> unconditionally.

As discussed offline, I'm wrong here. It needs to fall through to
put_page_testzero() for the device-dax case, but perhaps a comment for
the next time I forget that subtlety.


Re: [PATCH, RFC 2/2] Implement sharing/unsharing of PMDs for FS/DAX

2019-05-24 Thread Dan Williams
On Fri, May 24, 2019 at 9:07 AM Larry Bassel  wrote:
> On 14 May 19 16:01, Kirill A. Shutemov wrote:
> > On Thu, May 09, 2019 at 09:05:33AM -0700, Larry Bassel wrote:
[..]
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index f7d962d..4c1814c 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -3845,6 +3845,109 @@ static vm_fault_t handle_pte_fault(struct 
> > > vm_fault *vmf)
> > > return 0;
> > >  }
> > >
> > > +#ifdef CONFIG_MAY_SHARE_FSDAX_PMD
> > > +static pmd_t *huge_pmd_offset(struct mm_struct *mm,
> > > + unsigned long addr, unsigned long sz)
> >
> > Could you explain what this function suppose to do?
> >
> > As far as I can see vma_mmu_pagesize() is always PAGE_SIZE of DAX
> > filesystem. So we have 'sz' == PAGE_SIZE here.
>
> I thought so too, but in my testing I found that vma_mmu_pagesize() returns
> 4KiB, which differs from the DAX filesystem's 2MiB pagesize.

A given filesystem-dax vma is allowed to support both 4K and 2M
mappings, so the vma_mmu_pagesize() is not granular enough to describe
the capabilities of a filesystem-dax vma. In the device-dax case,
where there are mapping guarantees, the implementation does arrange
for vma_mmu_pagesize() to reflect the right page size.


Re: [PATCH] mm/swap: Fix release_pages() when releasing devmap pages

2019-05-24 Thread Dan Williams
On Fri, May 24, 2019 at 8:35 AM Ira Weiny  wrote:
>
> On Thu, May 23, 2019 at 08:58:12PM -0700, Dan Williams wrote:
> > On Thu, May 23, 2019 at 3:37 PM  wrote:
> > >
> > > From: Ira Weiny 
> > >
> > > Device pages can be more than type MEMORY_DEVICE_PUBLIC.
> > >
> > > Handle all device pages within release_pages()
> > >
> > > This was found via code inspection while determining if release_pages()
> > > and the new put_user_pages() could be interchangeable.
> > >
> > > Cc: Jérôme Glisse 
> > > Cc: Dan Williams 
> > > Cc: Michal Hocko 
> > > Cc: John Hubbard 
> > > Signed-off-by: Ira Weiny 
> > > ---
> > >  mm/swap.c | 7 +++
> > >  1 file changed, 3 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/mm/swap.c b/mm/swap.c
> > > index 3a75722e68a9..d1e8122568d0 100644
> > > --- a/mm/swap.c
> > > +++ b/mm/swap.c
> > > @@ -739,15 +739,14 @@ void release_pages(struct page **pages, int nr)
> > > if (is_huge_zero_page(page))
> > > continue;
> > >
> > > -   /* Device public page can not be huge page */
> > > -   if (is_device_public_page(page)) {
> > > +   if (is_zone_device_page(page)) {
> > > if (locked_pgdat) {
> > > 
> > > spin_unlock_irqrestore(_pgdat->lru_lock,
> > >flags);
> > > locked_pgdat = NULL;
> > > }
> > > -   put_devmap_managed_page(page);
> > > -   continue;
> > > +   if (put_devmap_managed_page(page))
> >
> > This "shouldn't" fail, and if it does the code that follows might get
>
> I agree it shouldn't based on the check.  However...
>
> > confused by a ZONE_DEVICE page. If anything I would make this a
> > WARN_ON_ONCE(!put_devmap_managed_page(page)), but always continue
> > unconditionally.
>
> I was trying to follow the pattern from put_page()  Where if fails it 
> indicated
> it was not a devmap page and so "regular" processing should continue.

In this case that regular continuation already happened by not taking
the if (is_zone_device_page(page)) branch

>
> Since I'm unsure I'll just ask what does this check do?
>
> if (!static_branch_unlikely(_managed_key))
> return false;

That attempts to skip the overhead imposed by device-pages, i.e.
->page_free() callback and other extras, if there are no device-page
producers in the system. I.e. use the old simple put_page() path when
there is no hmm or pmem.


Re: [PATCH] mm/swap: Fix release_pages() when releasing devmap pages

2019-05-23 Thread Dan Williams
On Thu, May 23, 2019 at 3:37 PM  wrote:
>
> From: Ira Weiny 
>
> Device pages can be more than type MEMORY_DEVICE_PUBLIC.
>
> Handle all device pages within release_pages()
>
> This was found via code inspection while determining if release_pages()
> and the new put_user_pages() could be interchangeable.
>
> Cc: Jérôme Glisse 
> Cc: Dan Williams 
> Cc: Michal Hocko 
> Cc: John Hubbard 
> Signed-off-by: Ira Weiny 
> ---
>  mm/swap.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/mm/swap.c b/mm/swap.c
> index 3a75722e68a9..d1e8122568d0 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -739,15 +739,14 @@ void release_pages(struct page **pages, int nr)
> if (is_huge_zero_page(page))
> continue;
>
> -   /* Device public page can not be huge page */
> -   if (is_device_public_page(page)) {
> +   if (is_zone_device_page(page)) {
> if (locked_pgdat) {
> 
> spin_unlock_irqrestore(_pgdat->lru_lock,
>flags);
> locked_pgdat = NULL;
> }
> -   put_devmap_managed_page(page);
> -   continue;
> +   if (put_devmap_managed_page(page))

This "shouldn't" fail, and if it does the code that follows might get
confused by a ZONE_DEVICE page. If anything I would make this a
WARN_ON_ONCE(!put_devmap_managed_page(page)), but always continue
unconditionally.

Other than that you can add:

Reviewed-by: Dan Williams 


Re: [v6 3/3] device-dax: "Hotremove" persistent memory that is used like normal RAM

2019-05-21 Thread Dan Williams
On Fri, May 17, 2019 at 2:54 PM Pavel Tatashin
 wrote:
>
> It is now allowed to use persistent memory like a regular RAM, but
> currently there is no way to remove this memory until machine is
> rebooted.
>
> This work expands the functionality to also allows hotremoving
> previously hotplugged persistent memory, and recover the device for use
> for other purposes.
>
> To hotremove persistent memory, the management software must first
> offline all memory blocks of dax region, and than unbind it from
> device-dax/kmem driver. So, operations should look like this:
>
> echo offline > /sys/devices/system/memory/memoryN/state
> ...
> echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
>
> Note: if unbind is done without offlining memory beforehand, it won't be
> possible to do dax0.0 hotremove, and dax's memory is going to be part of
> System RAM until reboot.
>
> Signed-off-by: Pavel Tatashin 
> Reviewed-by: David Hildenbrand 
> ---
>  drivers/dax/dax-private.h |  2 ++
>  drivers/dax/kmem.c| 41 +++
>  2 files changed, 39 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
> index a45612148ca0..999aaf3a29b3 100644
> --- a/drivers/dax/dax-private.h
> +++ b/drivers/dax/dax-private.h
> @@ -53,6 +53,7 @@ struct dax_region {
>   * @pgmap - pgmap for memmap setup / lifetime (driver owned)
>   * @ref: pgmap reference count (driver owned)
>   * @cmp: @ref final put completion (driver owned)
> + * @dax_mem_res: physical address range of hotadded DAX memory
>   */
>  struct dev_dax {
> struct dax_region *region;
> @@ -62,6 +63,7 @@ struct dev_dax {
> struct dev_pagemap pgmap;
> struct percpu_ref ref;
> struct completion cmp;
> +   struct resource *dax_kmem_res;
>  };
>
>  static inline struct dev_dax *to_dev_dax(struct device *dev)
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 4c0131857133..3d0a7e702c94 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -71,21 +71,54 @@ int dev_dax_kmem_probe(struct device *dev)
> kfree(new_res);
> return rc;
> }
> +   dev_dax->dax_kmem_res = new_res;
>
> return 0;
>  }
>
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +static int dev_dax_kmem_remove(struct device *dev)
> +{
> +   struct dev_dax *dev_dax = to_dev_dax(dev);
> +   struct resource *res = dev_dax->dax_kmem_res;
> +   resource_size_t kmem_start = res->start;
> +   resource_size_t kmem_size = resource_size(res);
> +   int rc;
> +
> +   /*
> +* We have one shot for removing memory, if some memory blocks were 
> not
> +* offline prior to calling this function remove_memory() will fail, 
> and
> +* there is no way to hotremove this memory until reboot because 
> device
> +* unbind will succeed even if we return failure.
> +*/
> +   rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size);
> +   if (rc) {
> +   dev_err(dev,
> +   "DAX region %pR cannot be hotremoved until the next 
> reboot\n",
> +   res);

Small quibbles with this error message... "DAX" is redundant since the
device name is printed by dev_err(). I'd suggest dropping "until the
next reboot" since there is no guarantee it will work then either and
the surefire mechanism to recover the memory from the kmem driver is
to not add it in the first place. Perhaps also print out the error
code in case it might specify a finer grained reason the memory is
pinned.

Other than that you can add

   Reviewed-by: Dan Williams 

...as it looks like Andrew will take this through -mm.


Re: [PATCH] mm/dev_pfn: Exclude MEMORY_DEVICE_PRIVATE while computing virtual address

2019-05-20 Thread Dan Williams
On Mon, May 20, 2019 at 12:27 PM Jerome Glisse  wrote:
>
> On Mon, May 20, 2019 at 11:07:38AM +0530, Anshuman Khandual wrote:
> > On 05/18/2019 03:20 AM, Andrew Morton wrote:
> > > On Fri, 17 May 2019 16:08:34 +0530 Anshuman Khandual 
> > >  wrote:
> > >
> > >> The presence of struct page does not guarantee linear mapping for the pfn
> > >> physical range. Device private memory which is non-coherent is excluded
> > >> from linear mapping during devm_memremap_pages() though they will still
> > >> have struct page coverage. Just check for device private memory before
> > >> giving out virtual address for a given pfn.
> > >
> > > I was going to give my standard "what are the user-visible runtime
> > > effects of this change?", but...
> > >
> > >> All these helper functions are all pfn_t related but could not figure out
> > >> another way of determining a private pfn without looking into it's struct
> > >> page. pfn_t_to_virt() is not getting used any where in mainline kernel.Is
> > >> it used by out of tree drivers ? Should we then drop it completely ?
> > >
> > > Yeah, let's kill it.
> > >
> > > But first, let's fix it so that if someone brings it back, they bring
> > > back a non-buggy version.
> >
> > Makes sense.
> >
> > >
> > > So...  what (would be) the user-visible runtime effects of this change?
> >
> > I am not very well aware about the user interaction with the drivers which
> > hotplug and manage ZONE_DEVICE memory in general. Hence will not be able to
> > comment on it's user visible runtime impact. I just figured this out from
> > code audit while testing ZONE_DEVICE on arm64 platform. But the fix makes
> > the function bit more expensive as it now involve some additional memory
> > references.
>
> A device private pfn can never leak outside code that does not understand it
> So this change is useless for any existing users and i would like to keep the
> existing behavior ie never leak device private pfn.

The issue is that only an HMM expert might know that such a pfn can
never leak, in other words the pfn concept from a code perspective is
already leaked / widespread. Ideally any developer familiar with a pfn
and the core-mm pfn helpers need only worry about pfn semantics
without being required to go audit HMM users.


[PATCH v2] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

2019-05-20 Thread Dan Williams
Jeff discovered that performance improves from ~375K iops to ~519K iops
on a simple psync-write fio workload when moving the location of 'struct
page' from the default PMEM location to DRAM. This result is surprising
because the expectation is that 'struct page' for dax is only needed for
third party references to dax mappings. For example, a dax-mapped buffer
passed to another system call for direct-I/O requires 'struct page' for
sending the request down the driver stack and pinning the page. There is
no usage of 'struct page' for first party access to a file via
read(2)/write(2) and friends.

However, this "no page needed" expectation is violated by
CONFIG_HARDENED_USERCOPY and the check_copy_size() performed in
copy_from_iter_full_nocache() and copy_to_iter_mcsafe(). The
check_heap_object() helper routine assumes the buffer is backed by a
slab allocator (DRAM) page and applies some checks.  Those checks are
invalid, dax pages do not originate from the slab, and redundant,
dax_iomap_actor() has already validated that the I/O is within bounds.
Specifically that routine validates that the logical file offset is
within bounds of the file, then it does a sector-to-pfn translation
which validates that the physical mapping is within bounds of the block
device.

Bypass additional hardened usercopy overhead and call the 'no check'
versions of the copy_{to,from}_iter operations directly.

Fixes: 0aed55af8834 ("x86, uaccess: introduce copy_from_iter_flushcache...")
Cc: 
Cc: Jeff Moyer 
Cc: Ingo Molnar 
Cc: Christoph Hellwig 
Cc: Al Viro 
Cc: Thomas Gleixner 
Cc: Matthew Wilcox 
Reported-and-tested-by: Jeff Smits 
Acked-by: Kees Cook 
Acked-by: Jan Kara 
Signed-off-by: Dan Williams 
---
Changes since v1 [1]:
* Update the changelog to clarify which checks in dax_iomap_actor()
  obviate the need for "hardened" checks. (Jan)
* Update the code comment in drivers/nvdimm/pmem.c to reflect the same.
* Collect some Acks from Kees and Jan.

[1]: 
https://lore.kernel.org/lkml/155805321833.867447.3864104616303535270.st...@dwillia2-desk3.amr.corp.intel.com/

 drivers/nvdimm/pmem.c |9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 845c5b430cdd..c894f45e5077 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -281,16 +281,21 @@ static long pmem_dax_direct_access(struct dax_device 
*dax_dev,
return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn);
 }
 
+/*
+ * Use the 'no check' versions of copy_from_iter_flushcache() and
+ * copy_to_iter_mcsafe() to bypass HARDENED_USERCOPY overhead. Bounds
+ * checking is handled by dax_iomap_actor()
+ */
 static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
void *addr, size_t bytes, struct iov_iter *i)
 {
-   return copy_from_iter_flushcache(addr, bytes, i);
+   return _copy_from_iter_flushcache(addr, bytes, i);
 }
 
 static size_t pmem_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff,
void *addr, size_t bytes, struct iov_iter *i)
 {
-   return copy_to_iter_mcsafe(addr, bytes, i);
+   return _copy_to_iter_mcsafe(addr, bytes, i);
 }
 
 static const struct dax_operations pmem_dax_ops = {



Re: [PATCH] mm/dev_pfn: Exclude MEMORY_DEVICE_PRIVATE while computing virtual address

2019-05-20 Thread Dan Williams
On Sun, May 19, 2019 at 10:37 PM Anshuman Khandual
 wrote:
>
> On 05/18/2019 03:20 AM, Andrew Morton wrote:
> > On Fri, 17 May 2019 16:08:34 +0530 Anshuman Khandual 
> >  wrote:
> >
> >> The presence of struct page does not guarantee linear mapping for the pfn
> >> physical range. Device private memory which is non-coherent is excluded
> >> from linear mapping during devm_memremap_pages() though they will still
> >> have struct page coverage. Just check for device private memory before
> >> giving out virtual address for a given pfn.
> >
> > I was going to give my standard "what are the user-visible runtime
> > effects of this change?", but...
> >
> >> All these helper functions are all pfn_t related but could not figure out
> >> another way of determining a private pfn without looking into it's struct
> >> page. pfn_t_to_virt() is not getting used any where in mainline kernel.Is
> >> it used by out of tree drivers ? Should we then drop it completely ?
> >
> > Yeah, let's kill it.

+1 to killing it, since there has been a paucity of 'unsigned long
pfn' code path conversions to 'pfn_t', and it continues to go unused.

> > But first, let's fix it so that if someone brings it back, they bring
> > back a non-buggy version.

Not sure this can be solved without a rethink of who owns the virtual
address space corresponding to MEMORY_DEVICE_PRIVATE, and clawing back
some of the special-ness of HMM.

>
> Makes sense.
>
> >
> > So...  what (would be) the user-visible runtime effects of this change?
>
> I am not very well aware about the user interaction with the drivers which
> hotplug and manage ZONE_DEVICE memory in general. Hence will not be able to
> comment on it's user visible runtime impact. I just figured this out from
> code audit while testing ZONE_DEVICE on arm64 platform. But the fix makes
> the function bit more expensive as it now involve some additional memory
> references.

MEMORY_DEVICE_PRIVATE semantics were part of the package of the
initial HMM submission that landed in the kernel without an upstream
user. While pfn_t_to_virt() also does not have an upstream user it was
at least modeled after the existing pfn_to_virt() api to allow for
future 'unsigned long pfn' to 'pfn_t' conversions. As for what a fix
might look like, it seems to me that we should try to unify 'pfn_t'
and 'hmm_pfn's. I don't see why 'hmm_pfn's need to exist as their own
concept vs trying consume more flag space out of pfn_t. That would at
least allow the pfn_t_has_page() helper to detect the HMM case.


Re: [PATCH] dax: Fix last_page check in __bdev_dax_supported()

2019-05-20 Thread Dan Williams
On Thu, May 16, 2019 at 10:37 PM Vaibhav Jain  wrote:
>
> Dan Williams  writes:
>
> > On Wed, May 15, 2019 at 10:55 PM Vaibhav Jain  wrote:
> >>
> >> Presently __bdev_dax_supported() checks if first sector of last
> >> page ( last_page ) on the block device is aligned to page
> >> boundary. However the code to compute 'last_page' assumes that there
> >> are 8 sectors/page assuming a 4K page-size.
> >>
> >> This assumption breaks on architectures which use a different page
> >> size specifically PPC64 where page-size == 64K. Hence a warning is
> >> seen while trying to mount a xfs/ext4 file-system with dax enabled:
> >>
> >> $ sudo mount -o dax /dev/pmem0 /mnt/pmem
> >> XFS (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
> >> XFS (pmem0): DAX unsupported by block device. Turning off DAX.
> >>
> >> The patch fixes this issue by updating calculation of 'last_var' to
> >> take into account number-of-sectors/page instead of assuming it to be
> >> '8'.
> >
> > Yes, I noticed this too and fixed it up in a wider change that also
> > allows device-mapper to validate each component device. Does this
> > patch work for you?
> >
> > https://lore.kernel.org/lkml/155789172402.748145.11853718580748830476.st...@dwillia2-desk3.amr.corp.intel.com/
>
> Thanks Dan, I tested your patch and not seeing the issue anymore.

Thanks, I recorded a "Tested-by" for you.


Re: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

2019-05-20 Thread Dan Williams
On Mon, May 20, 2019 at 12:52 AM Jan Kara  wrote:
>
> On Sat 18-05-19 21:46:03, Dan Williams wrote:
> > On Fri, May 17, 2019 at 12:25 PM Kees Cook  wrote:
> > > On Fri, May 17, 2019 at 10:28:48AM -0700, Dan Williams wrote:
> > > > It seems dax_iomap_actor() is not a path where we'd be worried about
> > > > needing hardened user copy checks.
> > >
> > > I would agree: I think the proposed patch makes sense. :)
> >
> > Sounds like an acked-by to me.
>
> Yeah, if Kees agrees, I'm fine with skipping the checks as well. I just
> wanted that to be clarified. Also it helped me that you wrote:
>
> That routine (dax_iomap_actor()) validates that the logical file offset is
> within bounds of the file, then it does a sector-to-pfn translation which
> validates that the physical mapping is within bounds of the block device.
>
> That is more specific than "dax_iomap_actor() takes care of necessary
> checks" which was in the changelog. And the above paragraph helped me
> clarify which checks in dax_iomap_actor() you think replace those usercopy
> checks. So I think it would be good to add that paragraph to those
> copy_from_pmem() functions as a comment just in case we are wondering in
> the future why we are skipping the checks... Also feel free to add:
>
> Acked-by: Jan Kara 

Will do, thanks Jan.


Re: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

2019-05-19 Thread Dan Williams
On Fri, May 17, 2019 at 12:25 PM Kees Cook  wrote:
> On Fri, May 17, 2019 at 10:28:48AM -0700, Dan Williams wrote:
> > It seems dax_iomap_actor() is not a path where we'd be worried about
> > needing hardened user copy checks.
>
> I would agree: I think the proposed patch makes sense. :)

Sounds like an acked-by to me.


Re: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

2019-05-17 Thread Dan Williams
On Fri, May 17, 2019 at 8:57 AM Kees Cook  wrote:
>
> On Fri, May 17, 2019 at 08:08:27AM -0700, Dan Williams wrote:
> > As far as I can see it's mostly check_heap_object() that is the
> > problem, so I'm open to finding a way to just bypass that sub-routine.
> > However, as far as I can see none of the other block / filesystem user
> > copy implementations submit to the hardened checks, like
> > bio_copy_from_iter(), and iov_iter_copy_from_user_atomic() . So,
> > either those need to grow additional checks, or the hardened copy
> > implementation is targeting single object copy use cases, not
> > necessarily block-I/O. Yes, Kees, please advise.
>
> The intention is mainly for copies that haven't had explicit bounds
> checking already performed on them, yes. Is there something getting
> checked out of the slab, or is it literally just the overhead of doing
> the "is this slab?" check that you're seeing?

It's literally the overhead of "is this slab?" since it needs to go
retrieve the struct page and read that potentially cold cacheline. In
the case where that page is on memory media that is higher latency
than DRAM we get the ~37% performance loss that Jeff measured.

The path is via the filesystem ->write_iter() file operation. In the
DAX case the filesystem traps that path early, before submitting block
I/O, and routes it to the dax_iomap_actor() routine. That routine
validates that the logical file offset is within bounds of the file,
then it does a sector-to-pfn translation which validates that the
physical mapping is within bounds of the block device.

It seems dax_iomap_actor() is not a path where we'd be worried about
needing hardened user copy checks.


Re: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

2019-05-17 Thread Dan Williams
On Fri, May 17, 2019 at 1:47 AM Jan Kara  wrote:
>
> Let's add Kees to CC for usercopy expertise...
>
> On Thu 16-05-19 17:33:38, Dan Williams wrote:
> > Jeff discovered that performance improves from ~375K iops to ~519K iops
> > on a simple psync-write fio workload when moving the location of 'struct
> > page' from the default PMEM location to DRAM. This result is surprising
> > because the expectation is that 'struct page' for dax is only needed for
> > third party references to dax mappings. For example, a dax-mapped buffer
> > passed to another system call for direct-I/O requires 'struct page' for
> > sending the request down the driver stack and pinning the page. There is
> > no usage of 'struct page' for first party access to a file via
> > read(2)/write(2) and friends.
> >
> > However, this "no page needed" expectation is violated by
> > CONFIG_HARDENED_USERCOPY and the check_copy_size() performed in
> > copy_from_iter_full_nocache() and copy_to_iter_mcsafe(). The
> > check_heap_object() helper routine assumes the buffer is backed by a
> > page-allocator DRAM page and applies some checks.  Those checks are
> > invalid, dax pages are not from the heap, and redundant,
> > dax_iomap_actor() has already validated that the I/O is within bounds.
>
> So this last paragraph is not obvious to me as check_copy_size() does a lot
> of various checks in CONFIG_HARDENED_USERCOPY case. I agree that some of
> those checks don't make sense for PMEM pages but I'd rather handle that by
> refining check_copy_size() and check_object_size() functions to detect and
> appropriately handle pmem pages rather that generally skip all the checks
> in pmem_copy_from/to_iter(). And yes, every check in such hot path is going
> to cost performance but that's what user asked for with
> CONFIG_HARDENED_USERCOPY... Kees?

As far as I can see it's mostly check_heap_object() that is the
problem, so I'm open to finding a way to just bypass that sub-routine.
However, as far as I can see none of the other block / filesystem user
copy implementations submit to the hardened checks, like
bio_copy_from_iter(), and iov_iter_copy_from_user_atomic() . So,
either those need to grow additional checks, or the hardened copy
implementation is targeting single object copy use cases, not
necessarily block-I/O. Yes, Kees, please advise.


[PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

2019-05-16 Thread Dan Williams
Jeff discovered that performance improves from ~375K iops to ~519K iops
on a simple psync-write fio workload when moving the location of 'struct
page' from the default PMEM location to DRAM. This result is surprising
because the expectation is that 'struct page' for dax is only needed for
third party references to dax mappings. For example, a dax-mapped buffer
passed to another system call for direct-I/O requires 'struct page' for
sending the request down the driver stack and pinning the page. There is
no usage of 'struct page' for first party access to a file via
read(2)/write(2) and friends.

However, this "no page needed" expectation is violated by
CONFIG_HARDENED_USERCOPY and the check_copy_size() performed in
copy_from_iter_full_nocache() and copy_to_iter_mcsafe(). The
check_heap_object() helper routine assumes the buffer is backed by a
page-allocator DRAM page and applies some checks.  Those checks are
invalid, dax pages are not from the heap, and redundant,
dax_iomap_actor() has already validated that the I/O is within bounds.

Bypass this overhead and call the 'no check' versions of the
copy_{to,from}_iter operations directly.

Fixes: 0aed55af8834 ("x86, uaccess: introduce copy_from_iter_flushcache...")
Cc: Jan Kara 
Cc: 
Cc: Jeff Moyer 
Cc: Ingo Molnar 
Cc: Christoph Hellwig 
Cc: Al Viro 
Cc: Thomas Gleixner 
Cc: Matthew Wilcox 
Reported-and-tested-by: Jeff Smits 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/pmem.c |9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 845c5b430cdd..c894f45e5077 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -281,16 +281,21 @@ static long pmem_dax_direct_access(struct dax_device 
*dax_dev,
return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn);
 }
 
+/*
+ * Use the 'no check' versions of copy_from_iter_flushcache() and
+ * copy_to_iter_mcsafe() to bypass HARDENED_USERCOPY overhead. Bounds
+ * checking is handled by dax_iomap_actor()
+ */
 static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
void *addr, size_t bytes, struct iov_iter *i)
 {
-   return copy_from_iter_flushcache(addr, bytes, i);
+   return _copy_from_iter_flushcache(addr, bytes, i);
 }
 
 static size_t pmem_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff,
void *addr, size_t bytes, struct iov_iter *i)
 {
-   return copy_to_iter_mcsafe(addr, bytes, i);
+   return _copy_to_iter_mcsafe(addr, bytes, i);
 }
 
 static const struct dax_operations pmem_dax_ops = {



Re: [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race

2019-05-16 Thread Dan Williams
On Thu, May 16, 2019 at 9:45 AM Jane Chu  wrote:
>
> Hi,
>
> I'm able to reproduce the panic below by running two sets of ndctl
> commands that actually serve legitimate purpose in parallel (unlike
> the brute force experiment earlier), each set in a indefinite loop.
> This time it takes about an hour to panic.  But I gather the cause
> is probably the same: I've overlapped ndctl commands on the same
> region.
>
> Could we add a check in nd_ioctl(), such that if there is
> an ongoing ndctl command on a region, subsequent ndctl request
> will fail immediately with something to the effect of EAGAIN?
> The rationale being that kernel should protect itself against
> user mistakes.

We do already have locking in the driver to prevent configuration
collisions. The problem looks to be broken assumptions about running
the device unregistration path in a separate thread outside the lock.
I suspect it may be incorrect assumptions about the userspace
visibility of the device relative to teardown actions. To be clear
this isn't the nd_ioctl() path this is the sysfs path.


> Also, sensing the subject fix is for a different problem, and has been
> verified, I'm happy to see it in upstream, so we have a better
> code base to digger deeper in terms of how the destructive ndctl
> commands interacts to typical mission critical applications, include
> but not limited to rdma.

Right, the crash signature you are seeing looks unrelated to the issue
being address in these patches which is device-teardown racing active
page pins. I'll start the investigation on the crash signature, but
again I don't think it reads on this fix series.


Re: [PATCH] nvdimm: fix compilation warnings with W=1

2019-05-16 Thread Dan Williams
On Thu, May 16, 2019 at 9:05 AM Qian Cai  wrote:
>
> Several places (dimm_devs.c, core.c etc) include label.h but only
> label.c uses NSINDEX_SIGNATURE, so move its definition to label.c
> instead.
>
> In file included from drivers/nvdimm/dimm_devs.c:23:
> drivers/nvdimm/label.h:41:19: warning: 'NSINDEX_SIGNATURE' defined but
> not used [-Wunused-const-variable=]
>
> Also, some places abuse "/**" which is only reserved for the kernel-doc.
>
> drivers/nvdimm/bus.c:648: warning: cannot understand function prototype:
> 'struct attribute_group nd_device_attribute_group = '
> drivers/nvdimm/bus.c:677: warning: cannot understand function prototype:
> 'struct attribute_group nd_numa_attribute_group = '
>
> Those are just some member assignments for the "struct attribute_group"
> instances and it can't be expressed in the kernel-doc.

Ah, good point, I missed that.

> Reviewed-by: Vishal Verma 
> Signed-off-by: Qian Cai 

Looks good, I'll pull this in for a post -rc1 update.


Re: [PATCH] dax: Fix last_page check in __bdev_dax_supported()

2019-05-16 Thread Dan Williams
On Wed, May 15, 2019 at 10:55 PM Vaibhav Jain  wrote:
>
> Presently __bdev_dax_supported() checks if first sector of last
> page ( last_page ) on the block device is aligned to page
> boundary. However the code to compute 'last_page' assumes that there
> are 8 sectors/page assuming a 4K page-size.
>
> This assumption breaks on architectures which use a different page
> size specifically PPC64 where page-size == 64K. Hence a warning is
> seen while trying to mount a xfs/ext4 file-system with dax enabled:
>
> $ sudo mount -o dax /dev/pmem0 /mnt/pmem
> XFS (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
> XFS (pmem0): DAX unsupported by block device. Turning off DAX.
>
> The patch fixes this issue by updating calculation of 'last_var' to
> take into account number-of-sectors/page instead of assuming it to be
> '8'.

Yes, I noticed this too and fixed it up in a wider change that also
allows device-mapper to validate each component device. Does this
patch work for you?

https://lore.kernel.org/lkml/155789172402.748145.11853718580748830476.st...@dwillia2-desk3.amr.corp.intel.com/


[GIT PULL] libnvdimm fixes for v5.2-rc1

2019-05-15 Thread Dan Williams
Hi Linus, please pull from:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
tags/libnvdimm-fixes-5.2-rc1

...to receive just a small collection of fixes this time around. The
new virtio-pmem driver is nearly ready, but some last minute
device-mapper acks and virtio questions made it prudent to await v5.3.
Other major topics that were brewing on the linux-nvdimm mailing list
like sub-section hotplug, and other devm_memremap_pages() reworks will
go upstream through Andrew's tree.

These have seen multiple -next releases with no reported issues.

---

The following changes since commit 085b7755808aa11f78ab9377257e1dad2e6fa4bb:

  Linux 5.1-rc6 (2019-04-21 10:45:57 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
tags/libnvdimm-fixes-5.2-rc1

for you to fetch changes up to 67476656febd7ec5f1fe1aeec3c441fcf53b1e45:

  drivers/dax: Allow to include DEV_DAX_PMEM as builtin (2019-05-07
07:48:06 -0700)


libnvdimm fixes 5.2-rc1

* Fix a long standing namespace label corruption scenario when
  re-provisioning capacity for a namespace.

* Restore the ability of the dax_pmem module to be built-in.

* Harden the build for the 'nfit_test' unit test modules so that the
  userspace test harness can ensure all required test modules are
  available.


Aneesh Kumar K.V (1):
  drivers/dax: Allow to include DEV_DAX_PMEM as builtin

Dan Williams (1):
  libnvdimm/namespace: Fix label tracking error

Vishal Verma (2):
  dax/pmem: Fix whitespace in dax_pmem
  tools/testing/nvdimm: add watermarks for dax_pmem* modules

 drivers/dax/Kconfig |  3 +--
 drivers/dax/pmem/core.c |  6 +++---
 drivers/nvdimm/label.c  | 29 -
 drivers/nvdimm/namespace_devs.c | 15 +++
 drivers/nvdimm/nd.h |  4 
 tools/testing/nvdimm/Kbuild |  3 +++
 tools/testing/nvdimm/dax_pmem_compat_test.c |  8 
 tools/testing/nvdimm/dax_pmem_core_test.c   |  8 
 tools/testing/nvdimm/dax_pmem_test.c|  8 
 tools/testing/nvdimm/test/nfit.c|  3 +++
 tools/testing/nvdimm/watermark.h|  3 +++
 11 files changed, 72 insertions(+), 18 deletions(-)
 create mode 100644 tools/testing/nvdimm/dax_pmem_compat_test.c
 create mode 100644 tools/testing/nvdimm/dax_pmem_core_test.c
 create mode 100644 tools/testing/nvdimm/dax_pmem_test.c


Re: [RESEND PATCH] nvdimm: fix some compilation warnings

2019-05-15 Thread Dan Williams
On Wed, May 15, 2019 at 5:25 PM Verma, Vishal L
 wrote:
>
> On Wed, 2019-05-15 at 16:25 -0700, Dan Williams wrote:
> >
> > > diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
> > > index 4671776f5623..9f02a99cfac0 100644
> > > --- a/drivers/nvdimm/btt.c
> > > +++ b/drivers/nvdimm/btt.c
> > > @@ -1269,11 +1269,9 @@ static int btt_read_pg(struct btt *btt, struct 
> > > bio_integrity_payload *bip,
> > >
> > > ret = btt_data_read(arena, page, off, postmap, cur_len);
> > > if (ret) {
> > > -   int rc;
> > > -
> > > /* Media error - set the e_flag */
> > > -   rc = btt_map_write(arena, premap, postmap, 0, 1,
> > > -   NVDIMM_IO_ATOMIC);
> > > +   btt_map_write(arena, premap, postmap, 0, 1,
> > > + NVDIMM_IO_ATOMIC);
> > > goto out_rtt;
> >
> > This doesn't look correct to me, shouldn't we at least be logging that
> > the bad-block failed to be persistently tracked?
>
> Yes logging it sounds good to me. Qian, can you include this in your
> respin or shall I send a fix for it separately (since we were always
> ignoring the failure here regardless of this patch)?

I think a separate fix for this makes more sense. Likely also needs to
be a ratelimited message in case a storm of errors is encountered.


Re: [v5 0/3] "Hotremove" persistent memory

2019-05-15 Thread Dan Williams
On Wed, May 15, 2019 at 11:12 AM Pavel Tatashin
 wrote:
>
> > Hi Pavel,
> >
> > I am working on adding this sort of a workflow into a new daxctl command
> > (daxctl-reconfigure-device)- this will allow changing the 'mode' of a
> > dax device to kmem, online the resulting memory, and with your patches,
> > also attempt to offline the memory, and change back to device-dax.
> >
> > In running with these patches, and testing the offlining part, I ran
> > into the following lockdep below.
> >
> > This is with just these three patches on top of -rc7.
> >
> >
> > [  +0.004886] ==
> > [  +0.001576] WARNING: possible circular locking dependency detected
> > [  +0.001506] 5.1.0-rc7+ #13 Tainted: G   O
> > [  +0.000929] --
> > [  +0.000708] daxctl/22950 is trying to acquire lock:
> > [  +0.000548] f4d397f7 (kn->count#424){}, at: 
> > kernfs_remove_by_name_ns+0x40/0x80
> > [  +0.000922]
> >   but task is already holding lock:
> > [  +0.000657] 2aa52a9f (mem_sysfs_mutex){+.+.}, at: 
> > unregister_memory_section+0x22/0xa0
>
> I have studied this issue, and now have a clear understanding why it
> happens, I am not yet sure how to fix it, so suggestions are welcomed
> :)

I would think that ACPI hotplug would have a similar problem, but it does this:

acpi_unbind_memory_blocks(info);
__remove_memory(nid, info->start_addr, info->length);

I wonder if that ordering prevents going too deep into the
device_unregister() call stack that you highlighted below.


>
> Here is the problem:
>
> When we offline pages we have the following call stack:
>
> # echo offline > /sys/devices/system/memory/memory8/state
> ksys_write
>  vfs_write
>   __vfs_write
>kernfs_fop_write
> kernfs_get_active
>  lock_acquire   kn->count#122 (lock for
> "memory8/state" kn)
> sysfs_kf_write
>  dev_attr_store
>   state_store
>device_offline
> memory_subsys_offline
>  memory_block_action
>   offline_pages
>__offline_pages
> percpu_down_write
>  down_write
>   lock_acquire  mem_hotplug_lock.rw_sem
>
> When we unbind dax0.0 we have the following  stack:
> # echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
> drv_attr_store
>  unbind_store
>   device_driver_detach
>device_release_driver_internal
> dev_dax_kmem_remove
>  remove_memory  device_hotplug_lock
>   try_remove_memory mem_hotplug_lock.rw_sem
>arch_remove_memory
> __remove_pages
>  __remove_section
>   unregister_memory_section
>remove_memory_sectionmem_sysfs_mutex
> unregister_memory
>  device_unregister
>   device_del
>device_remove_attrs
> sysfs_remove_groups
>  sysfs_remove_group
>   remove_files
>kernfs_remove_by_name
> kernfs_remove_by_name_ns
>  __kernfs_removekn->count#122
>
> So, lockdep found the ordering issue with the above two stacks:
>
> 1. kn->count#122 -> mem_hotplug_lock.rw_sem
> 2. mem_hotplug_lock.rw_sem -> kn->count#122


Re: [PATCH v2 12/30] dax: remove block device dependencies

2019-05-15 Thread Dan Williams
On Wed, May 15, 2019 at 12:28 PM Vivek Goyal  wrote:
>
> From: Stefan Hajnoczi 
>
> Although struct dax_device itself is not tied to a block device, some
> DAX code assumes there is a block device.  Make block devices optional
> by allowing bdev to be NULL in commonly used DAX APIs.
>
> When there is no block device:
>  * Skip the partition offset calculation in bdev_dax_pgoff()
>  * Skip the blkdev_issue_zeroout() optimization
>
> Note that more block device assumptions remain but I haven't reach those
> code paths yet.
>

Is there a generic object that non-block-based filesystems reference
for physical storage as a bdev stand-in? I assume "sector_t" is still
the common type for addressing filesystem capacity?

It just seems to me that we should stop pretending that the
filesystem-dax facility requires block devices and try to move this
functionality to generically use a dax device across all interfaces.


Re: [RESEND PATCH] nvdimm: fix some compilation warnings

2019-05-15 Thread Dan Williams
On Tue, May 14, 2019 at 8:08 AM Qian Cai  wrote:
>
> Several places (dimm_devs.c, core.c etc) include label.h but only
> label.c uses NSINDEX_SIGNATURE, so move its definition to label.c
> instead.
> In file included from drivers/nvdimm/dimm_devs.c:23:
> drivers/nvdimm/label.h:41:19: warning: 'NSINDEX_SIGNATURE' defined but
> not used [-Wunused-const-variable=]
>
> The commit d9b83c756953 ("libnvdimm, btt: rework error clearing") left
> an unused variable.
> drivers/nvdimm/btt.c: In function 'btt_read_pg':
> drivers/nvdimm/btt.c:1272:8: warning: variable 'rc' set but not used
> [-Wunused-but-set-variable]
>
> Last, some places abuse "/**" which is only reserved for the kernel-doc.
> drivers/nvdimm/bus.c:648: warning: cannot understand function prototype:
> 'struct attribute_group nd_device_attribute_group = '
> drivers/nvdimm/bus.c:677: warning: cannot understand function prototype:
> 'struct attribute_group nd_numa_attribute_group = '

Can you include the compiler where these errors start appearing, since
I don't see these warnings with gcc-8.3.1

>
> Reviewed-by: Vishal Verma 
> Signed-off-by: Qian Cai 
> ---
>  drivers/nvdimm/btt.c   | 6 ++
>  drivers/nvdimm/bus.c   | 4 ++--
>  drivers/nvdimm/label.c | 2 ++
>  drivers/nvdimm/label.h | 2 --
>  4 files changed, 6 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
> index 4671776f5623..9f02a99cfac0 100644
> --- a/drivers/nvdimm/btt.c
> +++ b/drivers/nvdimm/btt.c
> @@ -1269,11 +1269,9 @@ static int btt_read_pg(struct btt *btt, struct 
> bio_integrity_payload *bip,
>
> ret = btt_data_read(arena, page, off, postmap, cur_len);
> if (ret) {
> -   int rc;
> -
> /* Media error - set the e_flag */
> -   rc = btt_map_write(arena, premap, postmap, 0, 1,
> -   NVDIMM_IO_ATOMIC);
> +   btt_map_write(arena, premap, postmap, 0, 1,
> + NVDIMM_IO_ATOMIC);
> goto out_rtt;

This doesn't look correct to me, shouldn't we at least be logging that
the bad-block failed to be persistently tracked?

> }
>
> diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
> index 7ff684159f29..2eb6a6cfe9e4 100644
> --- a/drivers/nvdimm/bus.c
> +++ b/drivers/nvdimm/bus.c
> @@ -642,7 +642,7 @@ static struct attribute *nd_device_attributes[] = {
> NULL,
>  };
>
> -/**
> +/*
>   * nd_device_attribute_group - generic attributes for all devices on an nd 
> bus
>   */
>  struct attribute_group nd_device_attribute_group = {
> @@ -671,7 +671,7 @@ static umode_t nd_numa_attr_visible(struct kobject *kobj, 
> struct attribute *a,
> return a->mode;
>  }
>
> -/**
> +/*
>   * nd_numa_attribute_group - NUMA attributes for all devices on an nd bus
>   */

Lets just fix this to be a valid kernel-doc format for a struct.

@@ -672,7 +672,7 @@ static umode_t nd_numa_attr_visible(struct kobject
*kobj, struct attribute *a,
 }

 /**
- * nd_numa_attribute_group - NUMA attributes for all devices on an nd bus
+ * struct nd_numa_attribute_group - NUMA attributes for all devices
on an nd bus
  */
 struct attribute_group nd_numa_attribute_group = {
.attrs = nd_numa_attributes,

>  struct attribute_group nd_numa_attribute_group = {
> diff --git a/drivers/nvdimm/label.c b/drivers/nvdimm/label.c
> index f3d753d3169c..02a51b7775e1 100644
> --- a/drivers/nvdimm/label.c
> +++ b/drivers/nvdimm/label.c
> @@ -25,6 +25,8 @@ static guid_t nvdimm_btt2_guid;
>  static guid_t nvdimm_pfn_guid;
>  static guid_t nvdimm_dax_guid;
>
> +static const char NSINDEX_SIGNATURE[] = "NAMESPACE_INDEX\0";
> +

Looks good to me.

>  static u32 best_seq(u32 a, u32 b)
>  {
> a &= NSINDEX_SEQ_MASK;
> diff --git a/drivers/nvdimm/label.h b/drivers/nvdimm/label.h
> index e9a2ad3c2150..4bb7add39580 100644
> --- a/drivers/nvdimm/label.h
> +++ b/drivers/nvdimm/label.h
> @@ -38,8 +38,6 @@ enum {
> ND_NSINDEX_INIT = 0x1,
>  };
>
> -static const char NSINDEX_SIGNATURE[] = "NAMESPACE_INDEX\0";
> -
>  /**
>   * struct nd_namespace_index - label set superblock
>   * @sig: NAMESPACE_INDEX\0
> --
> 2.20.1 (Apple Git-117)
>


Re: [PATCH v9 4/7] dm: enable synchronous dax

2019-05-15 Thread Dan Williams
[ add Mike and dm-devel ]

Mike, any concerns with the below addition to the device-mapper-dax
implementation?

On Tue, May 14, 2019 at 7:58 AM Pankaj Gupta  wrote:
>
>  This patch sets dax device 'DAXDEV_SYNC' flag if all the target
>  devices of device mapper support synchrononous DAX. If device
>  mapper consists of both synchronous and asynchronous dax devices,
>  we don't set 'DAXDEV_SYNC' flag.
>
> Signed-off-by: Pankaj Gupta 
> ---
>  drivers/md/dm-table.c | 14 ++
>  1 file changed, 14 insertions(+)
>
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index cde3b49b2a91..1cce626ff576 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -886,10 +886,17 @@ static int device_supports_dax(struct dm_target *ti, 
> struct dm_dev *dev,
> return bdev_dax_supported(dev->bdev, PAGE_SIZE);
>  }
>
> +static int device_synchronous(struct dm_target *ti, struct dm_dev *dev,
> +  sector_t start, sector_t len, void *data)
> +{
> +   return dax_synchronous(dev->dax_dev);
> +}
> +
>  static bool dm_table_supports_dax(struct dm_table *t)
>  {
> struct dm_target *ti;
> unsigned i;
> +   bool dax_sync = true;
>
> /* Ensure that all targets support DAX. */
> for (i = 0; i < dm_table_get_num_targets(t); i++) {
> @@ -901,7 +908,14 @@ static bool dm_table_supports_dax(struct dm_table *t)
> if (!ti->type->iterate_devices ||
> !ti->type->iterate_devices(ti, device_supports_dax, NULL))
> return false;
> +
> +   /* Check devices support synchronous DAX */
> +   if (dax_sync &&
> +   !ti->type->iterate_devices(ti, device_synchronous, NULL))
> +   dax_sync = false;
> }
> +   if (dax_sync)
> +   set_dax_synchronous(t->md->dax_dev);
>
> return true;
>  }
> --
> 2.20.1
>


Re: [GIT PULL] device-dax for 5.1: PMEM as RAM

2019-05-15 Thread Dan Williams
On Mon, Mar 11, 2019 at 5:08 PM Linus Torvalds
 wrote:
>
> On Mon, Mar 11, 2019 at 8:37 AM Dan Williams  wrote:
> >
> > Another feature the userspace tooling can support for the PMEM as RAM
> > case is the ability to complete an Address Range Scrub of the range
> > before it is added to the core-mm. I.e at least ensure that previously
> > encountered poison is eliminated.
>
> Ok, so this at least makes sense as an argument to me.
>
> In the "PMEM as filesystem" part, the errors have long-term history,
> while in "PMEM as RAM" the memory may be physically the same thing,
> but it doesn't have the history and as such may not be prone to
> long-term errors the same way.
>
> So that validly argues that yes, when used as RAM, the likelihood for
> errors is much lower because they don't accumulate the same way.

In case anyone is looking for the above mentioned tooling for use with
the v5.1 kernel, Vishal has released ndctl-v65 with the new
"clear-errors" command [1].

[1]: https://pmem.io/ndctl/ndctl-clear-errors.html


Re: [PATCH v2 1/6] drivers/base/devres: Introduce devm_release_action()

2019-05-14 Thread Dan Williams
On Tue, May 14, 2019 at 12:12 PM Greg Kroah-Hartman
 wrote:
>
> On Tue, May 07, 2019 at 04:56:05PM -0700, Dan Williams wrote:
> > The devm_add_action() facility allows a resource allocation routine to
> > add custom devm semantics. One such user is devm_memremap_pages().
> >
> > There is now a need to manually trigger devm_memremap_pages_release().
> > Introduce devm_release_action() so the release action can be triggered
> > via a new devm_memunmap_pages() api in a follow-on change.
> >
> > Cc: Logan Gunthorpe 
> > Cc: Bjorn Helgaas 
> > Cc: Christoph Hellwig 
> > Cc: Greg Kroah-Hartman 
> > Cc: "Rafael J. Wysocki" 
> > Reviewed-by: Ira Weiny 
> > Signed-off-by: Dan Williams 
> > ---
> >  drivers/base/devres.c  |   24 +++-
> >  include/linux/device.h |1 +
> >  2 files changed, 24 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/base/devres.c b/drivers/base/devres.c
> > index e038e2b3b7ea..0bbb328bd17f 100644
> > --- a/drivers/base/devres.c
> > +++ b/drivers/base/devres.c
> > @@ -755,10 +755,32 @@ void devm_remove_action(struct device *dev, void 
> > (*action)(void *), void *data)
> >
> >   WARN_ON(devres_destroy(dev, devm_action_release, devm_action_match,
> >  ));
> > -
> >  }
> >  EXPORT_SYMBOL_GPL(devm_remove_action);
> >
> > +/**
> > + * devm_release_action() - release previously added custom action
> > + * @dev: Device that owns the action
> > + * @action: Function implementing the action
> > + * @data: Pointer to data passed to @action implementation
> > + *
> > + * Releases and removes instance of @action previously added by
> > + * devm_add_action().  Both action and data should match one of the
> > + * existing entries.
> > + */
> > +void devm_release_action(struct device *dev, void (*action)(void *), void 
> > *data)
> > +{
> > + struct action_devres devres = {
> > + .data = data,
> > + .action = action,
> > + };
> > +
> > + WARN_ON(devres_release(dev, devm_action_release, devm_action_match,
> > +));
>
> What does WARN_ON help here?  are we going to start getting syzbot
> reports of this happening?

Hopefully, yes, if developers misuse the api they get a loud
notification similar to devm_remove_action() misuse.

> How can this fail?

It's a catch to make sure that @dev actually has a live devres
resource that can be found via @action and @data.


Re: [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race

2019-05-14 Thread Dan Williams
On Tue, May 14, 2019 at 11:53 AM Jane Chu  wrote:
>
> On 5/13/2019 12:22 PM, Logan Gunthorpe wrote:
>
> On 2019-05-08 11:05 a.m., Logan Gunthorpe wrote:
>
> On 2019-05-07 5:55 p.m., Dan Williams wrote:
>
> Changes since v1 [1]:
> - Fix a NULL-pointer deref crash in pci_p2pdma_release() (Logan)
>
> - Refresh the p2pdma patch headers to match the format of other p2pdma
>patches (Bjorn)
>
> - Collect Ira's reviewed-by
>
> [1]: 
> https://lore.kernel.org/lkml/155387324370.2443841.574715745262628837.st...@dwillia2-desk3.amr.corp.intel.com/
>
> This series looks good to me:
>
> Reviewed-by: Logan Gunthorpe 
>
> However, I haven't tested it yet but I intend to later this week.
>
> I've tested libnvdimm-pending which includes this series on my setup and
> everything works great.
>
> Just wondering in a difference scenario where pmem pages are exported to
> a KVM guest, and then by mistake the user issues "ndctl destroy-namespace -f",
> will the kernel wait indefinitely until the user figures out to kill the guest
> and release the pmem pages?

It depends on whether the pages are pinned. Typically DAX memory
mappings assigned to a guest are not pinned in the host and can be
invalidated at any time. The pinning only occurs with VFIO and
device-assignment which isn't the common case, especially since that
configuration is blocked by fsdax. However, with devdax, yes you can
arrange for the system to go into an indefinite wait.

This somewhat ties back to the get_user_pages() vs DAX debate. The
indefinite stall issue with device-assignment could be addressed with
a requirement to hold a lease and expect that a lease revocation event
may escalate to SIGKILL in response to 'ndctl destroy-namespace'. The
expectation with device-dax is that it is already a raw interface with
pointy edges and caveats, but I would not be opposed to introducing a
lease semantic.


Re: [PATCH v8 00/12] mm: Sub-section memory hotplug support

2019-05-13 Thread Dan Williams
On Mon, May 13, 2019 at 2:02 PM Mike Rapoport  wrote:
>
> Hi Dan,
>
> On Mon, May 06, 2019 at 04:39:26PM -0700, Dan Williams wrote:
> > Changes since v7 [1]:
>
> Sorry for jumping late

No worries, it needs to be rebased on David's "mm/memory_hotplug:
Factor out memory block device handling" series which puts it firmly
in v5.3 territory.

> but presuming there will be v9, it'd be great if it
> would also include include updates to
> Documentation/admin-guide/mm/memory-hotplug.rst and

If I've done my job right this series should be irrelevant to
Documentation/admin-guide/mm/memory-hotplug.rst. The subsection
capability is strictly limited to arch_add_memory() users that never
online the memory, i.e. only ZONE_DEVICE / devm_memremap_pages()
users. So this isn't "memory-hotplug" as much as it is "support for
subsection vmemmap management".

> Documentation/vm/memory-model.rst

This looks more fitting and should probably include a paragraph on the
general ZONE_DEVICE / devm_memremap_pages() use case.


Re: [PATCH v8 3/6] libnvdimm: add dax_dev sync flag

2019-05-10 Thread Dan Williams
On Fri, May 10, 2019 at 5:45 PM Pankaj Gupta  wrote:
>
>
>
> > >
> > > This patch adds 'DAXDEV_SYNC' flag which is set
> > > for nd_region doing synchronous flush. This later
> > > is used to disable MAP_SYNC functionality for
> > > ext4 & xfs filesystem for devices don't support
> > > synchronous flush.
> > >
> > > Signed-off-by: Pankaj Gupta 
> > > ---
> > >  drivers/dax/bus.c|  2 +-
> > >  drivers/dax/super.c  | 13 -
> > >  drivers/md/dm.c  |  3 ++-
> > >  drivers/nvdimm/pmem.c|  5 -
> > >  drivers/nvdimm/region_devs.c |  7 +++
> > >  include/linux/dax.h  |  8 ++--
> > >  include/linux/libnvdimm.h|  1 +
> > >  7 files changed, 33 insertions(+), 6 deletions(-)
> > [..]
> > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > > index 043f0761e4a0..ee007b75d9fd 100644
> > > --- a/drivers/md/dm.c
> > > +++ b/drivers/md/dm.c
> > > @@ -1969,7 +1969,8 @@ static struct mapped_device *alloc_dev(int minor)
> > > sprintf(md->disk->disk_name, "dm-%d", minor);
> > >
> > > if (IS_ENABLED(CONFIG_DAX_DRIVER)) {
> > > -   dax_dev = alloc_dax(md, md->disk->disk_name, _dax_ops);
> > > +   dax_dev = alloc_dax(md, md->disk->disk_name, _dax_ops,
> > > +DAXDEV_F_SYNC);
> >
> > Apologies for not realizing this until now, but this is broken.
> > Imaging a device-mapper configuration composed of both 'async'
> > virtio-pmem and 'sync' pmem. The 'sync' flag needs to be unified
> > across all members. I would change this argument to '0' and then
> > arrange for it to be set at dm_table_supports_dax() time after
> > validating that all components support synchronous dax.
>
> o.k. Need to set 'DAXDEV_F_SYNC' flag after verifying all the target
> components support synchronous DAX.
>
> Just a question, If device mapper configuration have composed of both
> virtio-pmem or pmem devices, we want to configure device mapper for async 
> flush?

If it's composed of both then, yes, it needs to be async flush at the
device-mapper level. Otherwise MAP_SYNC will succeed and fail to
trigger fsync on the host file when necessary for the virtio-pmem
backed portion of the device-mapper device.


Re: [Qemu-devel] [PATCH v7 2/6] virtio-pmem: Add virtio pmem driver

2019-05-10 Thread Dan Williams
On Wed, May 8, 2019 at 4:19 AM Pankaj Gupta  wrote:
>
>
> Hi Dan,
>
> Thank you for the review. Please see my reply inline.
>
> >
> > Hi Pankaj,
> >
> > Some minor file placement comments below.
>
> Sure.
>
> >
> > On Thu, Apr 25, 2019 at 10:02 PM Pankaj Gupta  wrote:
> > >
> > > This patch adds virtio-pmem driver for KVM guest.
> > >
> > > Guest reads the persistent memory range information from
> > > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > > creates a nd_region object with the persistent memory
> > > range information so that existing 'nvdimm/pmem' driver
> > > can reserve this into system memory map. This way
> > > 'virtio-pmem' driver uses existing functionality of pmem
> > > driver to register persistent memory compatible for DAX
> > > capable filesystems.
> > >
> > > This also provides function to perform guest flush over
> > > VIRTIO from 'pmem' driver when userspace performs flush
> > > on DAX memory range.
> > >
> > > Signed-off-by: Pankaj Gupta 
> > > ---
> > >  drivers/nvdimm/virtio_pmem.c | 114 +
> > >  drivers/virtio/Kconfig   |  10 +++
> > >  drivers/virtio/Makefile  |   1 +
> > >  drivers/virtio/pmem.c| 118 +++
> > >  include/linux/virtio_pmem.h  |  60 
> > >  include/uapi/linux/virtio_ids.h  |   1 +
> > >  include/uapi/linux/virtio_pmem.h |  10 +++
> > >  7 files changed, 314 insertions(+)
> > >  create mode 100644 drivers/nvdimm/virtio_pmem.c
> > >  create mode 100644 drivers/virtio/pmem.c
> > >  create mode 100644 include/linux/virtio_pmem.h
> > >  create mode 100644 include/uapi/linux/virtio_pmem.h
> > >
> > > diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
> > > new file mode 100644
> > > index ..66b582f751a3
> > > --- /dev/null
> > > +++ b/drivers/nvdimm/virtio_pmem.c
> > > @@ -0,0 +1,114 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/*
> > > + * virtio_pmem.c: Virtio pmem Driver
> > > + *
> > > + * Discovers persistent memory range information
> > > + * from host and provides a virtio based flushing
> > > + * interface.
> > > + */
> > > +#include 
> > > +#include "nd.h"
> > > +
> > > + /* The interrupt handler */
> > > +void host_ack(struct virtqueue *vq)
> > > +{
> > > +   unsigned int len;
> > > +   unsigned long flags;
> > > +   struct virtio_pmem_request *req, *req_buf;
> > > +   struct virtio_pmem *vpmem = vq->vdev->priv;
> > > +
> > > +   spin_lock_irqsave(>pmem_lock, flags);
> > > +   while ((req = virtqueue_get_buf(vq, )) != NULL) {
> > > +   req->done = true;
> > > +   wake_up(>host_acked);
> > > +
> > > +   if (!list_empty(>req_list)) {
> > > +   req_buf = list_first_entry(>req_list,
> > > +   struct virtio_pmem_request, list);
> > > +   list_del(>req_list);
> > > +   req_buf->wq_buf_avail = true;
> > > +   wake_up(_buf->wq_buf);
> > > +   }
> > > +   }
> > > +   spin_unlock_irqrestore(>pmem_lock, flags);
> > > +}
> > > +EXPORT_SYMBOL_GPL(host_ack);
> > > +
> > > + /* The request submission function */
> > > +int virtio_pmem_flush(struct nd_region *nd_region)
> > > +{
> > > +   int err;
> > > +   unsigned long flags;
> > > +   struct scatterlist *sgs[2], sg, ret;
> > > +   struct virtio_device *vdev = nd_region->provider_data;
> > > +   struct virtio_pmem *vpmem = vdev->priv;
> > > +   struct virtio_pmem_request *req;
> > > +
> > > +   might_sleep();
> > > +   req = kmalloc(sizeof(*req), GFP_KERNEL);
> > > +   if (!req)
> > > +   return -ENOMEM;
> > > +
> > > +   req->done = req->wq_buf_avail = false;
> > > +   strcpy(req->name, "FLUSH");
> > > +   init_waitqueue_head(>host_acked);
> > > +   init_waitqueue_head(>wq_buf);
> > > +   sg_init_one(, req->name, strlen(req->name));
> > > +   sgs[0] = 
> > > +   sg_init_one(, >ret, sizeof(req->ret));
> > > +   sgs[1] = 
> > > +
> > > +   spin_lock_irqsave(>pmem_lock, flags);
> > > +   err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, 
> > > GFP_ATOMIC);
> > > +   if (err) {
> > > +   dev_err(>dev, "failed to send command to virtio pmem
> > > device\n");
> > > +
> > > +   list_add_tail(>req_list, >list);
> > > +   spin_unlock_irqrestore(>pmem_lock, flags);
> > > +
> > > +   /* When host has read buffer, this completes via host_ack
> > > */
> > > +   wait_event(req->wq_buf, req->wq_buf_avail);
> > > +   spin_lock_irqsave(>pmem_lock, flags);
> > > +   }
> > > +   err = virtqueue_kick(vpmem->req_vq);
> > > +   spin_unlock_irqrestore(>pmem_lock, flags);
> > > +
> > > +   if (!err) {
> > > +   err = -EIO;
> > > +   goto ret;
> > > +   }
> > > +   /* When host has 

Re: [PATCH, RFC 1/2] Add config option to enable FS/DAX PMD sharing

2019-05-10 Thread Dan Williams
On Fri, May 10, 2019 at 9:32 AM Elliott, Robert (Servers)
 wrote:
>
>
>
> > -Original Message-
> > From: Linux-nvdimm  On Behalf Of
> > Larry Bassel
> > Sent: Thursday, May 09, 2019 11:06 AM
> > Subject: [PATCH, RFC 1/2] Add config option to enable FS/DAX PMD
> > sharing
> >
> > If enabled, sharing of FS/DAX PMDs will be attempted.
> >
> ...
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> ...
> >
> > +config MAY_SHARE_FSDAX_PMD
> > +def_bool y
> > +
>
> Is a config option really necessary - is there any reason to
> not choose to do this?

Agree. Either the arch implementation supports it or it doesn't, I
don't see a need for any further configuration flexibility. Seems
ARCH_WANT_HUGE_PMD_SHARE should be renamed ARCH_HAS_HUGE_PMD_SHARE and
then auto-enable it.


[PATCH] mm/huge_memory: Fix vmf_insert_pfn_{pmd, pud}() crash, handle unaligned addresses

2019-05-09 Thread Dan Williams
Starting with commit c6f3c5ee40c1 "mm/huge_memory.c: fix modifying of
page protection by insert_pfn_pmd()" vmf_insert_pfn_pmd() internally
calls pmdp_set_access_flags(). That helper enforces a pmd aligned
@address argument via VM_BUG_ON() assertion.

Update the implementation to take a 'struct vm_fault' argument directly
and apply the address alignment fixup internally to fix crash signatures
like:

kernel BUG at arch/x86/mm/pgtable.c:515!
invalid opcode:  [#1] SMP NOPTI
CPU: 51 PID: 43713 Comm: java Tainted: G   OE 4.19.35 #1
[..]
RIP: 0010:pmdp_set_access_flags+0x48/0x50
[..]
Call Trace:
 vmf_insert_pfn_pmd+0x198/0x350
 dax_iomap_fault+0xe82/0x1190
 ext4_dax_huge_fault+0x103/0x1f0
 ? __switch_to_asm+0x40/0x70
 __handle_mm_fault+0x3f6/0x1370
 ? __switch_to_asm+0x34/0x70
 ? __switch_to_asm+0x40/0x70
 handle_mm_fault+0xda/0x200
 __do_page_fault+0x249/0x4f0
 do_page_fault+0x32/0x110
 ? page_fault+0x8/0x30
 page_fault+0x1e/0x30

Cc: 
Fixes: c6f3c5ee40c1 ("mm/huge_memory.c: fix modifying of page protection by 
insert_pfn_pmd()")
Reported-by: Piotr Balcer 
Tested-by: Yan Ma 
Cc: Aneesh Kumar K.V 
Cc: Chandan Rajendra 
Cc: Jan Kara 
Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Souptick Joarder 
Signed-off-by: Dan Williams 
---

 drivers/dax/device.c|6 ++
 fs/dax.c|6 ++
 include/linux/huge_mm.h |6 ++
 mm/huge_memory.c|   16 ++--
 4 files changed, 16 insertions(+), 18 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index e428468ab661..996d68ff992a 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -184,8 +184,7 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax 
*dev_dax,
 
*pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-   return vmf_insert_pfn_pmd(vmf->vma, vmf->address, vmf->pmd, *pfn,
-   vmf->flags & FAULT_FLAG_WRITE);
+   return vmf_insert_pfn_pmd(vmf, *pfn, vmf->flags & FAULT_FLAG_WRITE);
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
@@ -235,8 +234,7 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax 
*dev_dax,
 
*pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-   return vmf_insert_pfn_pud(vmf->vma, vmf->address, vmf->pud, *pfn,
-   vmf->flags & FAULT_FLAG_WRITE);
+   return vmf_insert_pfn_pud(vmf, *pfn, vmf->flags & FAULT_FLAG_WRITE);
 }
 #else
 static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
diff --git a/fs/dax.c b/fs/dax.c
index e5e54da1715f..83009875308c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1575,8 +1575,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault 
*vmf, pfn_t *pfnp,
}
 
trace_dax_pmd_insert_mapping(inode, vmf, PMD_SIZE, pfn, entry);
-   result = vmf_insert_pfn_pmd(vma, vmf->address, vmf->pmd, pfn,
-   write);
+   result = vmf_insert_pfn_pmd(vmf, pfn, write);
break;
case IOMAP_UNWRITTEN:
case IOMAP_HOLE:
@@ -1686,8 +1685,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, 
unsigned int order)
ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
 #ifdef CONFIG_FS_DAX_PMD
else if (order == PMD_ORDER)
-   ret = vmf_insert_pfn_pmd(vmf->vma, vmf->address, vmf->pmd,
-   pfn, true);
+   ret = vmf_insert_pfn_pmd(vmf, pfn, FAULT_FLAG_WRITE);
 #endif
else
ret = VM_FAULT_FALLBACK;
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 381e872bfde0..7cd5c150c21d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -47,10 +47,8 @@ extern bool move_huge_pmd(struct vm_area_struct *vma, 
unsigned long old_addr,
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, pgprot_t newprot,
int prot_numa);
-vm_fault_t vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
-   pmd_t *pmd, pfn_t pfn, bool write);
-vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
-   pud_t *pud, pfn_t pfn, bool write);
+vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
+vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
 enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_FLAG,
TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 165ea46bf149..4310c6e9e5a3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -793,11 +793,13 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, 
unsigned long addr,
pte_free(mm, pgtable);
 }
 
-vm_fault_t vmf_insert_pfn_pmd(struct vm_area_struct *vm

[PATCH v2 5/6] PCI/P2PDMA: Track pgmap references per resource, not globally

2019-05-07 Thread Dan Williams
In preparation for fixing a race between devm_memremap_pages_release()
and the final put of a page from the device-page-map, allocate a
percpu-ref per p2pdma resource mapping.

Cc: Logan Gunthorpe 
Cc: Bjorn Helgaas 
Cc: Christoph Hellwig 
Reviewed-by: Ira Weiny 
Signed-off-by: Dan Williams 
---
 drivers/pci/p2pdma.c |  124 +-
 1 file changed, 81 insertions(+), 43 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 595a534bd749..54d475569058 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -20,12 +20,16 @@
 #include 
 
 struct pci_p2pdma {
-   struct percpu_ref devmap_ref;
-   struct completion devmap_ref_done;
struct gen_pool *pool;
bool p2pmem_published;
 };
 
+struct p2pdma_pagemap {
+   struct dev_pagemap pgmap;
+   struct percpu_ref ref;
+   struct completion ref_done;
+};
+
 static ssize_t size_show(struct device *dev, struct device_attribute *attr,
 char *buf)
 {
@@ -74,41 +78,45 @@ static const struct attribute_group p2pmem_group = {
.name = "p2pmem",
 };
 
+static struct p2pdma_pagemap *to_p2p_pgmap(struct percpu_ref *ref)
+{
+   return container_of(ref, struct p2pdma_pagemap, ref);
+}
+
 static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
 {
-   struct pci_p2pdma *p2p =
-   container_of(ref, struct pci_p2pdma, devmap_ref);
+   struct p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(ref);
 
-   complete_all(>devmap_ref_done);
+   complete(_pgmap->ref_done);
 }
 
 static void pci_p2pdma_percpu_kill(struct percpu_ref *ref)
 {
-   /*
-* pci_p2pdma_add_resource() may be called multiple times
-* by a driver and may register the percpu_kill devm action multiple
-* times. We only want the first action to actually kill the
-* percpu_ref.
-*/
-   if (percpu_ref_is_dying(ref))
-   return;
-
percpu_ref_kill(ref);
 }
 
+static void pci_p2pdma_percpu_cleanup(void *ref)
+{
+   struct p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(ref);
+
+   wait_for_completion(_pgmap->ref_done);
+   percpu_ref_exit(_pgmap->ref);
+}
+
 static void pci_p2pdma_release(void *data)
 {
struct pci_dev *pdev = data;
+   struct pci_p2pdma *p2pdma = pdev->p2pdma;
 
-   if (!pdev->p2pdma)
+   if (!p2pdma)
return;
 
-   wait_for_completion(>p2pdma->devmap_ref_done);
-   percpu_ref_exit(>p2pdma->devmap_ref);
+   /* Flush and disable pci_alloc_p2p_mem() */
+   pdev->p2pdma = NULL;
+   synchronize_rcu();
 
-   gen_pool_destroy(pdev->p2pdma->pool);
+   gen_pool_destroy(p2pdma->pool);
sysfs_remove_group(>dev.kobj, _group);
-   pdev->p2pdma = NULL;
 }
 
 static int pci_p2pdma_setup(struct pci_dev *pdev)
@@ -124,12 +132,6 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
if (!p2p->pool)
goto out;
 
-   init_completion(>devmap_ref_done);
-   error = percpu_ref_init(>devmap_ref,
-   pci_p2pdma_percpu_release, 0, GFP_KERNEL);
-   if (error)
-   goto out_pool_destroy;
-
error = devm_add_action_or_reset(>dev, pci_p2pdma_release, pdev);
if (error)
goto out_pool_destroy;
@@ -163,6 +165,7 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
 int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
u64 offset)
 {
+   struct p2pdma_pagemap *p2p_pgmap;
struct dev_pagemap *pgmap;
void *addr;
int error;
@@ -185,14 +188,32 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int 
bar, size_t size,
return error;
}
 
-   pgmap = devm_kzalloc(>dev, sizeof(*pgmap), GFP_KERNEL);
-   if (!pgmap)
+   p2p_pgmap = devm_kzalloc(>dev, sizeof(*p2p_pgmap), GFP_KERNEL);
+   if (!p2p_pgmap)
return -ENOMEM;
 
+   init_completion(_pgmap->ref_done);
+   error = percpu_ref_init(_pgmap->ref,
+   pci_p2pdma_percpu_release, 0, GFP_KERNEL);
+   if (error)
+   goto pgmap_free;
+
+   /*
+* FIXME: the percpu_ref_exit needs to be coordinated internal
+* to devm_memremap_pages_release(). Duplicate the same ordering
+* as other devm_memremap_pages() users for now.
+*/
+   error = devm_add_action(>dev, pci_p2pdma_percpu_cleanup,
+   _pgmap->ref);
+   if (error)
+   goto ref_cleanup;
+
+   pgmap = _pgmap->pgmap;
+
pgmap->res.start = pci_resource_start(pdev, bar) + offset;
pgmap->res.end = pgmap->res.start + size - 1;
pgmap->res.flags = pci_resource_flags(pdev, bar);
-   pgmap->ref = >p2pdma->devmap_ref;
+   pgmap->ref = _pgmap->

[PATCH v2 2/6] mm/devm_memremap_pages: Introduce devm_memunmap_pages

2019-05-07 Thread Dan Williams
Use the new devm_relase_action() facility to allow
devm_memremap_pages_release() to be manually triggered.

Cc: Logan Gunthorpe 
Cc: Bjorn Helgaas 
Cc: Christoph Hellwig 
Reviewed-by: Ira Weiny 
Signed-off-by: Dan Williams 
---
 include/linux/memremap.h |6 ++
 kernel/memremap.c|6 ++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index f0628660d541..7601ee314c4a 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -100,6 +100,7 @@ struct dev_pagemap {
 
 #ifdef CONFIG_ZONE_DEVICE
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
+void devm_memunmap_pages(struct device *dev, struct dev_pagemap *pgmap);
 struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
struct dev_pagemap *pgmap);
 
@@ -118,6 +119,11 @@ static inline void *devm_memremap_pages(struct device *dev,
return ERR_PTR(-ENXIO);
 }
 
+static inline void devm_memunmap_pages(struct device *dev,
+   struct dev_pagemap *pgmap)
+{
+}
+
 static inline struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
struct dev_pagemap *pgmap)
 {
diff --git a/kernel/memremap.c b/kernel/memremap.c
index a856cb5ff192..65afbacab44e 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -266,6 +266,12 @@ void *devm_memremap_pages(struct device *dev, struct 
dev_pagemap *pgmap)
 }
 EXPORT_SYMBOL_GPL(devm_memremap_pages);
 
+void devm_memunmap_pages(struct device *dev, struct dev_pagemap *pgmap)
+{
+   devm_release_action(dev, devm_memremap_pages_release, pgmap);
+}
+EXPORT_SYMBOL_GPL(devm_memunmap_pages);
+
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
 {
/* number of pfns from base where pfn_to_page() is valid */



[PATCH v2 6/6] mm/devm_memremap_pages: Fix final page put race

2019-05-07 Thread Dan Williams
Logan noticed that devm_memremap_pages_release() kills the percpu_ref
drops all the page references that were acquired at init and then
immediately proceeds to unplug, arch_remove_memory(), the backing pages
for the pagemap. If for some reason device shutdown actually collides
with a busy / elevated-ref-count page then arch_remove_memory() should
be deferred until after that reference is dropped.

As it stands the "wait for last page ref drop" happens *after*
devm_memremap_pages_release() returns, which is obviously too late and
can lead to crashes.

Fix this situation by assigning the responsibility to wait for the
percpu_ref to go idle to devm_memremap_pages() with a new ->cleanup()
callback. Implement the new cleanup callback for all
devm_memremap_pages() users: pmem, devdax, hmm, and p2pdma.

Reported-by: Logan Gunthorpe 
Fixes: 41e94a851304 ("add devm_memremap_pages")
Cc: Bjorn Helgaas 
Cc: "Jérôme Glisse" 
Cc: Christoph Hellwig 
Reviewed-by: Ira Weiny 
Signed-off-by: Dan Williams 
---
 drivers/dax/device.c  |   13 +++--
 drivers/nvdimm/pmem.c |   17 +
 drivers/pci/p2pdma.c  |   17 +++--
 include/linux/memremap.h  |2 ++
 kernel/memremap.c |   17 -
 mm/hmm.c  |   14 +++---
 tools/testing/nvdimm/test/iomap.c |2 ++
 7 files changed, 38 insertions(+), 44 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index e428468ab661..e3aa78dd1bb0 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -27,9 +27,8 @@ static void dev_dax_percpu_release(struct percpu_ref *ref)
complete(_dax->cmp);
 }
 
-static void dev_dax_percpu_exit(void *data)
+static void dev_dax_percpu_exit(struct percpu_ref *ref)
 {
-   struct percpu_ref *ref = data;
struct dev_dax *dev_dax = ref_to_dev_dax(ref);
 
dev_dbg(_dax->dev, "%s\n", __func__);
@@ -468,18 +467,12 @@ int dev_dax_probe(struct device *dev)
if (rc)
return rc;
 
-   rc = devm_add_action_or_reset(dev, dev_dax_percpu_exit, _dax->ref);
-   if (rc)
-   return rc;
-
dev_dax->pgmap.ref = _dax->ref;
dev_dax->pgmap.kill = dev_dax_percpu_kill;
+   dev_dax->pgmap.cleanup = dev_dax_percpu_exit;
addr = devm_memremap_pages(dev, _dax->pgmap);
-   if (IS_ERR(addr)) {
-   devm_remove_action(dev, dev_dax_percpu_exit, _dax->ref);
-   percpu_ref_exit(_dax->ref);
+   if (IS_ERR(addr))
return PTR_ERR(addr);
-   }
 
inode = dax_inode(dax_dev);
cdev = inode->i_cdev;
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 0279eb1da3ef..1c9181712fa4 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -304,11 +304,19 @@ static const struct attribute_group 
*pmem_attribute_groups[] = {
NULL,
 };
 
-static void pmem_release_queue(void *q)
+static void __pmem_release_queue(struct percpu_ref *ref)
 {
+   struct request_queue *q;
+
+   q = container_of(ref, typeof(*q), q_usage_counter);
blk_cleanup_queue(q);
 }
 
+static void pmem_release_queue(void *ref)
+{
+   __pmem_release_queue(ref);
+}
+
 static void pmem_freeze_queue(struct percpu_ref *ref)
 {
struct request_queue *q;
@@ -400,12 +408,10 @@ static int pmem_attach_disk(struct device *dev,
if (!q)
return -ENOMEM;
 
-   if (devm_add_action_or_reset(dev, pmem_release_queue, q))
-   return -ENOMEM;
-
pmem->pfn_flags = PFN_DEV;
pmem->pgmap.ref = >q_usage_counter;
pmem->pgmap.kill = pmem_freeze_queue;
+   pmem->pgmap.cleanup = __pmem_release_queue;
if (is_nd_pfn(dev)) {
if (setup_pagemap_fsdax(dev, >pgmap))
return -ENOMEM;
@@ -426,6 +432,9 @@ static int pmem_attach_disk(struct device *dev,
pmem->pfn_flags |= PFN_MAP;
memcpy(_res, >pgmap.res, sizeof(bb_res));
} else {
+   if (devm_add_action_or_reset(dev, pmem_release_queue,
+   >q_usage_counter))
+   return -ENOMEM;
addr = devm_memremap(dev, pmem->phys_addr,
pmem->size, ARCH_MEMREMAP_PMEM);
memcpy(_res, >res, sizeof(bb_res));
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 54d475569058..a7a66b958720 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -95,7 +95,7 @@ static void pci_p2pdma_percpu_kill(struct percpu_ref *ref)
percpu_ref_kill(ref);
 }
 
-static void pci_p2pdma_percpu_cleanup(void *ref)
+static void pci_p2pdma_percpu_cleanup(struct percpu_ref *ref)
 {
struct p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(ref);
 
@@ -198,16 +198,6 @@ int pci_p2p

Re: [PATCH v2 4/8] mm/memory_hotplug: Create memory block devices after arch_add_memory()

2019-05-07 Thread Dan Williams
On Tue, May 7, 2019 at 11:38 AM David Hildenbrand  wrote:
>
> Only memory to be added to the buddy and to be onlined/offlined by
> user space using memory block devices needs (and should have!) memory
> block devices.
>
> Factor out creation of memory block devices Create all devices after
> arch_add_memory() succeeded. We can later drop the want_memblock parameter,
> because it is now effectively stale.
>
> Only after memory block devices have been added, memory can be onlined
> by user space. This implies, that memory is not visible to user space at
> all before arch_add_memory() succeeded.

Nice!

>
> Cc: Greg Kroah-Hartman 
> Cc: "Rafael J. Wysocki" 
> Cc: David Hildenbrand 
> Cc: "mike.tra...@hpe.com" 
> Cc: Andrew Morton 
> Cc: Ingo Molnar 
> Cc: Andrew Banman 
> Cc: Oscar Salvador 
> Cc: Michal Hocko 
> Cc: Pavel Tatashin 
> Cc: Qian Cai 
> Cc: Wei Yang 
> Cc: Arun KS 
> Cc: Mathieu Malaterre 
> Signed-off-by: David Hildenbrand 
> ---
>  drivers/base/memory.c  | 70 ++
>  include/linux/memory.h |  2 +-
>  mm/memory_hotplug.c| 15 -
>  3 files changed, 53 insertions(+), 34 deletions(-)
>
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 6e0cb4fda179..862c202a18ca 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -701,44 +701,62 @@ static int add_memory_block(int base_section_nr)
> return 0;
>  }
>
> +static void unregister_memory(struct memory_block *memory)
> +{
> +   BUG_ON(memory->dev.bus != _subsys);

Given this should never happen and only a future kernel developer
might trip over it, do we really need to kill that developer's
machine? I.e. s/BUG/WARN/? I guess an argument can be made to move
such a change that to a follow-on patch since you're just preserving
existing behavior, but I figure might as well address these as the
code is refactored.

> +
> +   /* drop the ref. we got via find_memory_block() */
> +   put_device(>dev);
> +   device_unregister(>dev);
> +}
> +
>  /*
> - * need an interface for the VM to add new memory regions,
> - * but without onlining it.
> + * Create memory block devices for the given memory area. Start and size
> + * have to be aligned to memory block granularity. Memory block devices
> + * will be initialized as offline.
>   */
> -int hotplug_memory_register(int nid, struct mem_section *section)
> +int hotplug_memory_register(unsigned long start, unsigned long size)
>  {
> -   int ret = 0;
> +   unsigned long block_nr_pages = memory_block_size_bytes() >> 
> PAGE_SHIFT;
> +   unsigned long start_pfn = PFN_DOWN(start);
> +   unsigned long end_pfn = start_pfn + (size >> PAGE_SHIFT);
> +   unsigned long pfn;
> struct memory_block *mem;
> +   int ret = 0;
>
> -   mutex_lock(_sysfs_mutex);
> +   BUG_ON(!IS_ALIGNED(start, memory_block_size_bytes()));
> +   BUG_ON(!IS_ALIGNED(size, memory_block_size_bytes()));

Perhaps:

if (WARN_ON(...))
return -EINVAL;

>
> -   mem = find_memory_block(section);
> -   if (mem) {
> -   mem->section_count++;
> -   put_device(>dev);
> -   } else {
> -   ret = init_memory_block(, section, MEM_OFFLINE);
> +   mutex_lock(_sysfs_mutex);
> +   for (pfn = start_pfn; pfn != end_pfn; pfn += block_nr_pages) {
> +   mem = find_memory_block(__pfn_to_section(pfn));
> +   if (mem) {
> +   WARN_ON_ONCE(false);

?? Isn't that a nop?

> +   put_device(>dev);
> +   continue;
> +   }
> +   ret = init_memory_block(, __pfn_to_section(pfn),
> +   MEM_OFFLINE);
> if (ret)
> -   goto out;
> -   mem->section_count++;
> +   break;
> +   mem->section_count = memory_block_size_bytes() /
> +MIN_MEMORY_BLOCK_SIZE;
> +   }
> +   if (ret) {
> +   end_pfn = pfn;
> +   for (pfn = start_pfn; pfn != end_pfn; pfn += block_nr_pages) {
> +   mem = find_memory_block(__pfn_to_section(pfn));
> +   if (!mem)
> +   continue;
> +   mem->section_count = 0;
> +   unregister_memory(mem);
> +   }
> }
> -
> -out:
> mutex_unlock(_sysfs_mutex);
> return ret;
>  }
>
> -static void
> -unregister_memory(struct memory_block *memory)
> -{
> -   BUG_ON(memory->dev.bus != _subsys);
> -
> -   /* drop the ref. we got via find_memory_block() */
> -   put_device(>dev);
> -   device_unregister(>dev);
> -}
> -
> -void unregister_memory_section(struct mem_section *section)
> +static int remove_memory_section(struct mem_section *section)
>  {
> struct memory_block *mem;
>
> diff --git a/include/linux/memory.h 

Re: [PATCH v2 5/8] mm/memory_hotplug: Drop MHP_MEMBLOCK_API

2019-05-07 Thread Dan Williams
On Tue, May 7, 2019 at 11:38 AM David Hildenbrand  wrote:
>
> No longer needed, the callers of arch_add_memory() can handle this
> manually.
>
> Cc: Andrew Morton 
> Cc: David Hildenbrand 
> Cc: Michal Hocko 
> Cc: Oscar Salvador 
> Cc: Pavel Tatashin 
> Cc: Wei Yang 
> Cc: Joonsoo Kim 
> Cc: Qian Cai 
> Cc: Arun KS 
> Cc: Mathieu Malaterre 
> Signed-off-by: David Hildenbrand 
> ---
>  include/linux/memory_hotplug.h | 8 
>  mm/memory_hotplug.c| 9 +++--
>  2 files changed, 3 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 2d4de313926d..2f1f87e13baa 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -128,14 +128,6 @@ extern void arch_remove_memory(int nid, u64 start, u64 
> size,
>  extern void __remove_pages(struct zone *zone, unsigned long start_pfn,
>unsigned long nr_pages, struct vmem_altmap 
> *altmap);
>
> -/*
> - * Do we want sysfs memblock files created. This will allow userspace to 
> online
> - * and offline memory explicitly. Lack of this bit means that the caller has 
> to
> - * call move_pfn_range_to_zone to finish the initialization.
> - */
> -
> -#define MHP_MEMBLOCK_API   (1<<0)
> -
>  /* reasonably generic interface to expand the physical pages */
>  extern int __add_pages(int nid, unsigned long start_pfn, unsigned long 
> nr_pages,
>struct mhp_restrictions *restrictions);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index e1637c8a0723..107f72952347 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -250,7 +250,7 @@ void __init register_page_bootmem_info_node(struct 
> pglist_data *pgdat)
>  #endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */
>
>  static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
> -   struct vmem_altmap *altmap, bool want_memblock)
> +  struct vmem_altmap *altmap)
>  {
> int ret;
>
> @@ -293,8 +293,7 @@ int __ref __add_pages(int nid, unsigned long 
> phys_start_pfn,
> }
>
> for (i = start_sec; i <= end_sec; i++) {
> -   err = __add_section(nid, section_nr_to_pfn(i), altmap,
> -   restrictions->flags & MHP_MEMBLOCK_API);
> +   err = __add_section(nid, section_nr_to_pfn(i), altmap);
>
> /*
>  * EEXIST is finally dealt with by ioresource collision
> @@ -1066,9 +1065,7 @@ static int online_memory_block(struct memory_block 
> *mem, void *arg)
>   */
>  int __ref add_memory_resource(int nid, struct resource *res)
>  {
> -   struct mhp_restrictions restrictions = {
> -   .flags = MHP_MEMBLOCK_API,
> -   };
> +   struct mhp_restrictions restrictions = {};

With mhp_restrictions.flags no longer needed, can we drop
mhp_restrictions and just go back to a plain @altmap argument?


[PATCH v8 05/12] mm/sparsemem: Convert kmalloc_section_memmap() to populate_section_memmap()

2019-05-06 Thread Dan Williams
Allow sub-section sized ranges to be added to the memmap.
populate_section_memmap() takes an explict pfn range rather than
assuming a full section, and those parameters are plumbed all the way
through to vmmemap_populate(). There should be no sub-section usage in
current deployments. New warnings are added to clarify which memmap
allocation paths are sub-section capable.

Cc: Michal Hocko 
Cc: David Hildenbrand 
Cc: Logan Gunthorpe 
Cc: Oscar Salvador 
Reviewed-by: Pavel Tatashin 
Signed-off-by: Dan Williams 
---
 arch/x86/mm/init_64.c |4 ++-
 include/linux/mm.h|4 ++-
 mm/sparse-vmemmap.c   |   21 +++--
 mm/sparse.c   |   61 +++--
 4 files changed, 57 insertions(+), 33 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 20d14254b686..bb018d09d2dc 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1457,7 +1457,9 @@ int __meminit vmemmap_populate(unsigned long start, 
unsigned long end, int node,
 {
int err;
 
-   if (boot_cpu_has(X86_FEATURE_PSE))
+   if (end - start < PAGES_PER_SECTION * sizeof(struct page))
+   err = vmemmap_populate_basepages(start, end, node);
+   else if (boot_cpu_has(X86_FEATURE_PSE))
err = vmemmap_populate_hugepages(start, end, node, altmap);
else if (altmap) {
pr_err_once("%s: no cpu support for altmap allocations\n",
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0e8834ac32b7..5360a0e4051d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2748,8 +2748,8 @@ const char * arch_vma_name(struct vm_area_struct *vma);
 void print_vma_addr(char *prefix, unsigned long rip);
 
 void *sparse_buffer_alloc(unsigned long size);
-struct page *sparse_mem_map_populate(unsigned long pnum, int nid,
-   struct vmem_altmap *altmap);
+struct page * __populate_section_memmap(unsigned long pfn,
+   unsigned long nr_pages, int nid, struct vmem_altmap *altmap);
 pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
 p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 7fec05796796..200aef686722 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -245,19 +245,26 @@ int __meminit vmemmap_populate_basepages(unsigned long 
start,
return 0;
 }
 
-struct page * __meminit sparse_mem_map_populate(unsigned long pnum, int nid,
-   struct vmem_altmap *altmap)
+struct page * __meminit __populate_section_memmap(unsigned long pfn,
+   unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
 {
unsigned long start;
unsigned long end;
-   struct page *map;
 
-   map = pfn_to_page(pnum * PAGES_PER_SECTION);
-   start = (unsigned long)map;
-   end = (unsigned long)(map + PAGES_PER_SECTION);
+   /*
+* The minimum granularity of memmap extensions is
+* PAGES_PER_SUBSECTION as allocations are tracked in the
+* 'subsection_map' bitmap of the section.
+*/
+   end = ALIGN(pfn + nr_pages, PAGES_PER_SUBSECTION);
+   pfn &= PAGE_SUBSECTION_MASK;
+   nr_pages = end - pfn;
+
+   start = (unsigned long) pfn_to_page(pfn);
+   end = start + nr_pages * sizeof(struct page);
 
if (vmemmap_populate(start, end, nid, altmap))
return NULL;
 
-   return map;
+   return pfn_to_page(pfn);
 }
diff --git a/mm/sparse.c b/mm/sparse.c
index ac47a48050c7..d613f108cf34 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -433,8 +433,8 @@ static unsigned long __init section_map_size(void)
return PAGE_ALIGN(sizeof(struct page) * PAGES_PER_SECTION);
 }
 
-struct page __init *sparse_mem_map_populate(unsigned long pnum, int nid,
-   struct vmem_altmap *altmap)
+struct page __init *__populate_section_memmap(unsigned long pfn,
+   unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
 {
unsigned long size = section_map_size();
struct page *map = sparse_buffer_alloc(size);
@@ -515,10 +515,13 @@ static void __init sparse_init_nid(int nid, unsigned long 
pnum_begin,
}
sparse_buffer_init(map_count * section_map_size(), nid);
for_each_present_section_nr(pnum_begin, pnum) {
+   unsigned long pfn = section_nr_to_pfn(pnum);
+
if (pnum >= pnum_end)
break;
 
-   map = sparse_mem_map_populate(pnum, nid, NULL);
+   map = __populate_section_memmap(pfn, PAGES_PER_SECTION,
+   nid, NULL);
if (!map) {
pr_err("%s: node[%d] memory map backing failed. Some 
memory will not be available.",
   __func__, nid);
@@ -618,17 +621,17 @@ void offline_mem_sections(unsi

[PATCH v8 01/12] mm/sparsemem: Introduce struct mem_section_usage

2019-05-06 Thread Dan Williams
Towards enabling memory hotplug to track partial population of a
section, introduce 'struct mem_section_usage'.

A pointer to a 'struct mem_section_usage' instance replaces the existing
pointer to a 'pageblock_flags' bitmap. Effectively it adds one more
'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to
house a new 'subsection_map' bitmap.  The new bitmap enables the memory
hot{plug,remove} implementation to act on incremental sub-divisions of a
section.

The default SUBSECTION_SHIFT is chosen to keep the 'subsection_map' no
larger than a single 'unsigned long' on the major architectures.
Alternatively an architecture can define ARCH_SUBSECTION_SHIFT to
override the default PMD_SHIFT. Note that PowerPC needs to use
ARCH_SUBSECTION_SHIFT to workaround PMD_SHIFT being a non-constant
expression on PowerPC.

The primary motivation for this functionality is to support platforms
that mix "System RAM" and "Persistent Memory" within a single section,
or multiple PMEM ranges with different mapping lifetimes within a single
section. The section restriction for hotplug has caused an ongoing saga
of hacks and bugs for devm_memremap_pages() users.

Beyond the fixups to teach existing paths how to retrieve the 'usemap'
from a section, and updates to usemap allocation path, there are no
expected behavior changes.

Cc: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Logan Gunthorpe 
Cc: Oscar Salvador 
Cc: Pavel Tatashin 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Signed-off-by: Dan Williams 
---
 arch/powerpc/include/asm/sparsemem.h |3 +
 include/linux/mmzone.h   |   48 +++-
 mm/memory_hotplug.c  |   18 
 mm/page_alloc.c  |2 -
 mm/sparse.c  |   81 +-
 5 files changed, 99 insertions(+), 53 deletions(-)

diff --git a/arch/powerpc/include/asm/sparsemem.h 
b/arch/powerpc/include/asm/sparsemem.h
index 3192d454a733..1aa3c9303bf8 100644
--- a/arch/powerpc/include/asm/sparsemem.h
+++ b/arch/powerpc/include/asm/sparsemem.h
@@ -10,6 +10,9 @@
  */
 #define SECTION_SIZE_BITS   24
 
+/* Reflect the largest possible PMD-size as the subsection-size constant */
+#define ARCH_SUBSECTION_SHIFT 24
+
 #endif /* CONFIG_SPARSEMEM */
 
 #ifdef CONFIG_MEMORY_HOTPLUG
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 70394cabaf4e..ef8d878079f9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1160,6 +1160,44 @@ static inline unsigned long section_nr_to_pfn(unsigned 
long sec)
 #define SECTION_ALIGN_UP(pfn)  (((pfn) + PAGES_PER_SECTION - 1) & 
PAGE_SECTION_MASK)
 #define SECTION_ALIGN_DOWN(pfn)((pfn) & PAGE_SECTION_MASK)
 
+/*
+ * SUBSECTION_SHIFT must be constant since it is used to declare
+ * subsection_map and related bitmaps without triggering the generation
+ * of variable-length arrays. The most natural size for a subsection is
+ * a PMD-page. For architectures that do not have a constant PMD-size
+ * ARCH_SUBSECTION_SHIFT can be set to a constant max size, or otherwise
+ * fallback to 2MB.
+ */
+#if defined(ARCH_SUBSECTION_SHIFT)
+#define SUBSECTION_SHIFT (ARCH_SUBSECTION_SHIFT)
+#elif defined(PMD_SHIFT)
+#define SUBSECTION_SHIFT (PMD_SHIFT)
+#else
+/*
+ * Memory hotplug enabled platforms avoid this default because they
+ * either define ARCH_SUBSECTION_SHIFT, or PMD_SHIFT is a constant, but
+ * this is kept as a backstop to allow compilation on
+ * !ARCH_ENABLE_MEMORY_HOTPLUG archs.
+ */
+#define SUBSECTION_SHIFT 21
+#endif
+
+#define PFN_SUBSECTION_SHIFT (SUBSECTION_SHIFT - PAGE_SHIFT)
+#define PAGES_PER_SUBSECTION (1UL << PFN_SUBSECTION_SHIFT)
+#define PAGE_SUBSECTION_MASK ((~(PAGES_PER_SUBSECTION-1)))
+
+#if SUBSECTION_SHIFT > SECTION_SIZE_BITS
+#error Subsection size exceeds section size
+#else
+#define SUBSECTIONS_PER_SECTION (1UL << (SECTION_SIZE_BITS - SUBSECTION_SHIFT))
+#endif
+
+struct mem_section_usage {
+   DECLARE_BITMAP(subsection_map, SUBSECTIONS_PER_SECTION);
+   /* See declaration of similar field in struct zone */
+   unsigned long pageblock_flags[0];
+};
+
 struct page;
 struct page_ext;
 struct mem_section {
@@ -1177,8 +1215,7 @@ struct mem_section {
 */
unsigned long section_mem_map;
 
-   /* See declaration of similar field in struct zone */
-   unsigned long *pageblock_flags;
+   struct mem_section_usage *usage;
 #ifdef CONFIG_PAGE_EXTENSION
/*
 * If SPARSEMEM, pgdat doesn't have page_ext pointer. We use
@@ -1209,6 +1246,11 @@ extern struct mem_section **mem_section;
 extern struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT];
 #endif
 
+static inline unsigned long *section_to_usemap(struct mem_section *ms)
+{
+   return ms->usage->pageblock_flags;
+}
+
 static inline struct mem_section *__nr_to_section(unsigned long nr)
 {
 #ifdef CONFIG_SPARSEMEM_EXTREME
@@ -1

Re: [v5 2/3] mm/hotplug: make remove_memory() interface useable

2019-05-06 Thread Dan Williams
On Mon, May 6, 2019 at 10:57 AM Dave Hansen  wrote:
>
> > -static inline void remove_memory(int nid, u64 start, u64 size) {}
> > +static inline bool remove_memory(int nid, u64 start, u64 size)
> > +{
> > + return -EBUSY;
> > +}
>
> This seems like an appropriate place for a WARN_ONCE(), if someone
> manages to call remove_memory() with hotplug disabled.
>
> BTW, I looked and can't think of a better errno, but -EBUSY probably
> isn't the best error code, right?
>
> > -void remove_memory(int nid, u64 start, u64 size)
> > +/**
> > + * remove_memory
> > + * @nid: the node ID
> > + * @start: physical address of the region to remove
> > + * @size: size of the region to remove
> > + *
> > + * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
> > + * and online/offline operations before this call, as required by
> > + * try_offline_node().
> > + */
> > +void __remove_memory(int nid, u64 start, u64 size)
> >  {
> > +
> > + /*
> > +  * trigger BUG() is some memory is not offlined prior to calling this
> > +  * function
> > +  */
> > + if (try_remove_memory(nid, start, size))
> > + BUG();
> > +}
>
> Could we call this remove_offline_memory()?  That way, it makes _some_
> sense why we would BUG() if the memory isn't offline.

Please WARN() instead of BUG() because failing to remove memory should
not be system fatal.


Re: [PATCH v6 00/12] mm: Sub-section memory hotplug support

2019-05-02 Thread Dan Williams
On Thu, May 2, 2019 at 3:46 PM Pavel Tatashin  wrote:
>
> Hi Dan,
>
> How do you test these patches? Do you have any instructions?

Yes, I briefly mentioned this in the cover letter, but here is the
test I am using:

>
> I see for example that check_hotplug_memory_range() still enforces
> memory_block_size_bytes() alignment.
>
> Also, after removing check_hotplug_memory_range(), I tried to online
> 16M aligned DAX memory, and got the following panic:

Right, this functionality is currently strictly limited to the
devm_memremap_pages() case where there are guarantees that the memory
will never be onlined. This is due to the fact that the section size
is entangled with the memblock api. That said I would have expected
you to trigger the warning in subsection_check() before getting this
far into the hotplug process.
>
> # echo online > /sys/devices/system/memory/memory7/state
> [  202.193132] WARNING: CPU: 2 PID: 351 at drivers/base/memory.c:207
> memory_block_action+0x110/0x178
> [  202.193391] Modules linked in:
> [  202.193698] CPU: 2 PID: 351 Comm: sh Not tainted
> 5.1.0-rc7_pt_devdax-00038-g865af4385544-dirty #9
> [  202.193909] Hardware name: linux,dummy-virt (DT)
> [  202.194122] pstate: 6005 (nZCv daif -PAN -UAO)
> [  202.194243] pc : memory_block_action+0x110/0x178
> [  202.194404] lr : memory_block_action+0x90/0x178
> [  202.194506] sp : 16763ca0
> [  202.194592] x29: 16763ca0 x28: 80016fd29b80
> [  202.194724] x27:  x26: 
> [  202.194838] x25: 15546000 x24: 001c
> [  202.194949] x23:  x22: 0004
> [  202.195058] x21: 001c x20: 0008
> [  202.195168] x19: 0007 x18: 
> [  202.195281] x17:  x16: 
> [  202.195393] x15:  x14: 
> [  202.195505] x13:  x12: 
> [  202.195614] x11:  x10: 
> [  202.195744] x9 :  x8 : 00018000
> [  202.195858] x7 : 0018 x6 : 15541930
> [  202.195966] x5 : 15541930 x4 : 0001
> [  202.196074] x3 : 0001 x2 : 
> [  202.196185] x1 : 0070 x0 : 
> [  202.196366] Call trace:
> [  202.196455]  memory_block_action+0x110/0x178
> [  202.196589]  memory_subsys_online+0x3c/0x80
> [  202.196681]  device_online+0x6c/0x90
> [  202.196761]  state_store+0x84/0x100
> [  202.196841]  dev_attr_store+0x18/0x28
> [  202.196927]  sysfs_kf_write+0x40/0x58
> [  202.197010]  kernfs_fop_write+0xcc/0x1d8
> [  202.197099]  __vfs_write+0x18/0x40
> [  202.197187]  vfs_write+0xa4/0x1b0
> [  202.197295]  ksys_write+0x64/0xd8
> [  202.197430]  __arm64_sys_write+0x18/0x20
> [  202.197521]  el0_svc_common.constprop.0+0x7c/0xe8
> [  202.197621]  el0_svc_handler+0x28/0x78
> [  202.197706]  el0_svc+0x8/0xc
> [  202.197828] ---[ end trace 57719823dda6d21e ]---
>
> Thank you,
> Pasha


Re: [PATCH] kernel: remove the unused device_private_entry_fault export

2019-04-29 Thread Dan Williams
On Mon, Apr 29, 2019 at 4:56 AM Christoph Hellwig  wrote:
>
> This export has been entirely unused since it was added more than 1 1/2
> years ago.
>
> Signed-off-by: Christoph Hellwig 
> ---
>  kernel/memremap.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index 4e59d29245f4..1490e63f69a9 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -45,7 +45,6 @@ vm_fault_t device_private_entry_fault(struct vm_area_struct 
> *vma,
>  */
> return devmem->page_fault(vma, addr, page, flags, pmdp);
>  }
> -EXPORT_SYMBOL(device_private_entry_fault);
>  #endif /* CONFIG_DEVICE_PRIVATE */

Looks good to me.

Reviewed-by: Dan Williams 


Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous memory system

2019-04-25 Thread Dan Williams
On Thu, Apr 25, 2019 at 1:05 AM Du, Fan  wrote:
>
>
>
> >-Original Message-
> >From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On
> >Behalf Of Michal Hocko
> >Sent: Thursday, April 25, 2019 3:54 PM
> >To: Du, Fan 
> >Cc: a...@linux-foundation.org; Wu, Fengguang ;
> >Williams, Dan J ; Hansen, Dave
> >; xishi.qiuxi...@alibaba-inc.com; Huang, Ying
> >; linux...@kvack.org; linux-kernel@vger.kernel.org
> >Subject: Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous
> >memory system
> >
> >On Thu 25-04-19 07:41:40, Du, Fan wrote:
> >>
> >>
> >> >-Original Message-
> >> >From: Michal Hocko [mailto:mho...@kernel.org]
> >> >Sent: Thursday, April 25, 2019 2:37 PM
> >> >To: Du, Fan 
> >> >Cc: a...@linux-foundation.org; Wu, Fengguang
> >;
> >> >Williams, Dan J ; Hansen, Dave
> >> >; xishi.qiuxi...@alibaba-inc.com; Huang, Ying
> >> >; linux...@kvack.org;
> >linux-kernel@vger.kernel.org
> >> >Subject: Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous
> >> >memory system
> >> >
> >> >On Thu 25-04-19 09:21:30, Fan Du wrote:
> >> >[...]
> >> >> However PMEM has different characteristics from DRAM,
> >> >> the more reasonable or desirable fallback style would be:
> >> >> DRAM node 0 -> DRAM node 1 -> PMEM node 2 -> PMEM node 3.
> >> >> When DRAM is exhausted, try PMEM then.
> >> >
> >> >Why and who does care? NUMA is fundamentally about memory nodes
> >with
> >> >different access characteristics so why is PMEM any special?
> >>
> >> Michal, thanks for your comments!
> >>
> >> The "different" lies in the local or remote access, usually the underlying
> >> memory is the same type, i.e. DRAM.
> >>
> >> By "special", PMEM is usually in gigantic capacity than DRAM per dimm,
> >> while with different read/write access latency than DRAM.
> >
> >You are describing a NUMA in general here. Yes access to different NUMA
> >nodes has a different read/write latency. But that doesn't make PMEM
> >really special from a regular DRAM.
>
> Not the numa distance b/w cpu and PMEM node make PMEM different than
> DRAM. The difference lies in the physical layer. The access latency 
> characteristics
> comes from media level.

No, there is no such thing as a "PMEM node". I've pushed back on this
broken concept in the past [1] [2]. Consider that PMEM could be as
fast as DRAM for technologies like NVDIMM-N or in emulation
environments. These attempts to look at persistence as an attribute of
performance are entirely missing the point that the system can have
multiple varied memory types and the platform firmware needs to
enumerate these performance properties in the HMAT on ACPI platforms.
Any scheme that only considers a binary DRAM and not-DRAM property is
immediately invalidated the moment the OS needs to consider a 3rd or
4th memory type, or a more varied connection topology.

[1]: 
https://lore.kernel.org/lkml/CAPcyv4heiUbZvP7Ewoy-Hy=-mprdjcjeusw+0rwdouhdjwe...@mail.gmail.com/

[2]: 
https://lore.kernel.org/lkml/capcyv4it1w7sddvbv24crcvhtlb3s1pvb5+sdm02uw4rbah...@mail.gmail.com/


Re: [PATCH v6 06/12] mm/hotplug: Add mem-hotplug restrictions for remove_memory()

2019-04-24 Thread Dan Williams
On Tue, Apr 23, 2019 at 2:21 PM David Hildenbrand  wrote:
>
> On 17.04.19 20:39, Dan Williams wrote:
> > Teach the arch_remove_memory() path to consult the same 'struct
> > mhp_restrictions' context as was specified at arch_add_memory() time.
> >
> > No functional change, this is a preparation step for teaching
> > __remove_pages() about how and when to allow sub-section hot-remove, and
> > a cleanup for an unnecessary "is_dev_zone()" special case.
>
> I am not yet sure if this is the right thing to do. When adding memory,
> we obviously have to specify the "how". When removing memory, we usually
> should be able to look such stuff up.

True, the implementation can just use find_memory_block(), and no need
to plumb this flag.

>
>
> >  void __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
> > - unsigned long nr_pages, struct vmem_altmap *altmap)
> > + unsigned long nr_pages, struct mhp_restrictions *restrictions)
> >  {
> >   unsigned long i;
> > - unsigned long map_offset = 0;
> >   int sections_to_remove;
> > + unsigned long map_offset = 0;
> > + struct vmem_altmap *altmap = restrictions->altmap;
> >
> > - /* In the ZONE_DEVICE case device driver owns the memory region */
> > - if (is_dev_zone(zone)) {
> > - if (altmap)
> > - map_offset = vmem_altmap_offset(altmap);
> > - }
> > + if (altmap)
> > + map_offset = vmem_altmap_offset(altmap);
> >
>
> Why weren't we able to use this exact same hunk before? (after my
> resource deletion cleanup of course)
>
> IOW, do we really need struct mhp_restrictions here?

We don't need it. It was only the memblock info why I added the
"restrictions" argument.

> After I factor out memory device handling into the caller of
> arch_remove_memory(), also the next patch ("mm/sparsemem: Prepare for
> sub-section ranges") should no longer need it. Or am I missing something?

That patch is still needed for the places where it adds the @nr_pages
argument, but the mhp_restrictions related bits can be dropped. The
subsection_check() helper needs to be refactored a bit to not rely on
mhp_restrictions.


<    8   9   10   11   12   13   14   15   16   17   >