Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-30 Thread Dan Williams
On Tue, Oct 30, 2018 at 1:18 AM Michal Hocko  wrote:
>
> On Mon 29-10-18 23:55:12, Dan Williams wrote:
> > On Mon, Oct 29, 2018 at 11:29 PM Michal Hocko  wrote:
[..]
> > That testing identified this initialization performance problem and
> > thankfully got it addressed in time for the current merge window.
>
> And I still cannot see a word about that in changelogs.

True, I would have preferred that commit 966cf44f637e "mm: defer
ZONE_DEVICE page initialization to the point where we init pgmap"
include some notes about the scaling advantages afforded by not
serializing memmap_init_zone() work. I think this information got
distributed across several patch series because removing the lock was
not sufficient by itself, Alex went further to also rework the
physical socket affinity of the nvdimm sub-system's async
initialization threads.

As the code gets refactored further it's a chance to add commentary on
the scaling expectations of the design.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-30 Thread Michal Hocko
On Mon 29-10-18 23:55:12, Dan Williams wrote:
> On Mon, Oct 29, 2018 at 11:29 PM Michal Hocko  wrote:
> >
> > On Mon 29-10-18 12:59:11, Alexander Duyck wrote:
> > > On Mon, 2018-10-29 at 19:18 +0100, Michal Hocko wrote:
> [..]
> > > The patches Andrew pushed addressed the immediate issue so that now
> > > systems with nvdimm/DAX memory can at least initialize quick enough
> > > that systemd doesn't refuse to mount the root file system due to a
> > > timeout.
> >
> > This is about the first time you actually mention that. I have re-read
> > the cover letter and all changelogs of patches in this serious. Unless I
> > have missed something there is nothing about real users hitting issues
> > out there. nvdimm is still considered a toy because there is no real HW
> > users can play with.
> 
> Yes, you have missed something, because that's incorrect. There's been
> public articles about these parts sampling since May.
> 
> 
> https://www.anandtech.com/show/12828/intel-launches-optane-dimms-up-to-512gb-apache-pass-is-here

indeed!

> That testing identified this initialization performance problem and
> thankfully got it addressed in time for the current merge window.

And I still cannot see a word about that in changelogs.

-- 
Michal Hocko
SUSE Labs
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-30 Thread Oscar Salvador
> Yes, the hotplug lock was part of the original issue. However that
> starts to drift into the area I believe Oscar was working on as a part
> of his patch set in encapsulating the move_pfn_range_to_zone and other
> calls that were contained in the hotplug lock into their own functions.


While reworking it for my patchset, I thought that we can move
move_pfn_range_to_zone
out of hotplug lock.
But then I __think__ we would have to move init_currently_empty_zone() within
the span lock as zone->zone_start_pfn is being touched there.
At least that is what the zone locking rules say about it.

Since I saw that Dan was still reworking his patchset about unify HMM/devm code,
I just took one step back and I went simpler [1].
The main reason for backing off was I felt a bit demotivated due to
the lack of feedback,
and I did not want to interfer either with your work or Dan's work.
Plus I also was unsure about some other things like whether it is ok calling
kasan_add_zero_shadow/kasan_remove_zero_shadow out of the lock.
So I decided to make less changes in regard of HMM/devm.

Unfortunately, I did not get a lot of feedback there yet.
Just some reviews from David and a confirmation that fixes one of the
issues Jonathan reported [2].

>
> I was hoping to wait until after Dan's HMM patches and Oscar's changes
> had been sorted before I get into any further refactor of this specific
> code.


I plan to ping the series, but I wanted to give more time to people
since we are in the merge window now.

[1] https://patchwork.kernel.org/cover/10642049/
[2] https://patchwork.kernel.org/patch/10642057/#22275173
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-30 Thread Dan Williams
On Mon, Oct 29, 2018 at 11:29 PM Michal Hocko  wrote:
>
> On Mon 29-10-18 12:59:11, Alexander Duyck wrote:
> > On Mon, 2018-10-29 at 19:18 +0100, Michal Hocko wrote:
[..]
> > The patches Andrew pushed addressed the immediate issue so that now
> > systems with nvdimm/DAX memory can at least initialize quick enough
> > that systemd doesn't refuse to mount the root file system due to a
> > timeout.
>
> This is about the first time you actually mention that. I have re-read
> the cover letter and all changelogs of patches in this serious. Unless I
> have missed something there is nothing about real users hitting issues
> out there. nvdimm is still considered a toy because there is no real HW
> users can play with.

Yes, you have missed something, because that's incorrect. There's been
public articles about these parts sampling since May.


https://www.anandtech.com/show/12828/intel-launches-optane-dimms-up-to-512gb-apache-pass-is-here

That testing identified this initialization performance problem and
thankfully got it addressed in time for the current merge window.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-30 Thread Michal Hocko
On Mon 29-10-18 12:59:11, Alexander Duyck wrote:
> On Mon, 2018-10-29 at 19:18 +0100, Michal Hocko wrote:
[...]

I will try to get to your other points later.

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 89d2a2ab3fe6..048e4cc72fdf 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -5474,8 +5474,8 @@ void __meminit memmap_init_zone(unsigned long size, 
> > int nid, unsigned long zone,
> >  * Honor reservation requested by the driver for this ZONE_DEVICE
> >  * memory
> >  */
> > -   if (altmap && start_pfn == altmap->base_pfn)
> > -   start_pfn += altmap->reserve;
> > +   if (pgmap && pgmap->get_memmap)
> > +   start_pfn = pgmap->get_memmap(pgmap, start_pfn);
> >  
> > for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> > /*
> > 
> > [...]
> 
> The only reason why I hadn't bothered with these bits is that I was
> actually trying to leave this generic since I thought I had seen other
> discussions about hotplug scenerios where memory may want to change
> where the vmmemmap is initialized other than just the case of
> ZONE_DEVICE pages. So I was thinking at some point we may see altmap
> without the pgmap.

I wanted to abuse altmap to allocate struct pages from the physical
range to be added. In that case I would abstract the
allocation/initialization part of pgmap into a more abstract type.
Something trivially to be done without affecting end users of the
hotplug API.

[...]
> > Anyway we have gone into details while the primary problem here was that
> > the hotplug lock doesn't scale AFAIR. And my question was why cannot we
> > pull move_pfn_range_to_zone and what has to be done to achieve that.
> > That is a fundamental thing to address first. Then you can microptimize
> > on top.
> 
> Yes, the hotplug lock was part of the original issue. However that
> starts to drift into the area I believe Oscar was working on as a part
> of his patch set in encapsulating the move_pfn_range_to_zone and other
> calls that were contained in the hotplug lock into their own functions.

Well, I would really love to keep the external API as simple as
possible. That means that we need arch_add_memory/add_pages and 
move_pfn_range_to_zone to associate pages with a zone. The hotplug lock
should be preferably hidden from callers of those two and ideally it
shouldn't be a global lock. We should be good with a range lock.
 
> The patches Andrew pushed addressed the immediate issue so that now
> systems with nvdimm/DAX memory can at least initialize quick enough
> that systemd doesn't refuse to mount the root file system due to a
> timeout.

This is about the first time you actually mention that. I have re-read
the cover letter and all changelogs of patches in this serious. Unless I
have missed something there is nothing about real users hitting issues
out there. nvdimm is still considered a toy because there is no real HW
users can play with.

And hence my complains about half baked solutions rushed in just to fix
a performance regression. I can certainly understand that a pressing
problem might justify to rush things a bit but this should be always
carefuly justified.

> The next patch set I have refactors things to reduce code and
> allow us to reuse some of the hotplug code for the deferred page init, 
> https://lore.kernel.org/lkml/20181017235043.17213.92459.stgit@localhost.localdomain/
> . After that I was planning to work on dealing with the PageReserved
> flag and trying to get that sorted out.
> 
> I was hoping to wait until after Dan's HMM patches and Oscar's changes
> had been sorted before I get into any further refactor of this specific
> code.

Yes there is quite a lot going on here. I would really appreciate if we
all sit and actually try to come up with something robust rather than
hack here and there. I haven't yet seen your follow up series completely
so maybe you are indeed heading the correct direction.

-- 
Michal Hocko
SUSE Labs
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-29 Thread Alexander Duyck
On Mon, 2018-10-29 at 19:18 +0100, Michal Hocko wrote:
> On Mon 29-10-18 10:42:33, Alexander Duyck wrote:
> > On Mon, 2018-10-29 at 18:24 +0100, Michal Hocko wrote:
> > > On Mon 29-10-18 10:01:28, Alexander Duyck wrote:
> 
> [...]
> > > > So there end up being a few different issues with constructors. First
> > > > in my mind is that it means we have to initialize the region of memory
> > > > and cannot assume what the constructors are going to do for us. As a
> > > > result we will have to initialize the LRU pointers, and then overwrite
> > > > them with the pgmap and hmm_data.
> > > 
> > > Why we would do that? What does really prevent you from making a fully
> > > customized constructor?
> > 
> > It is more an argument of complexity. Do I just pass a single pointer
> > and write that value, or the LRU values in init, or do I have to pass a
> > function pointer, some abstracted data, and then call said function
> > pointer while passing the page and the abstracted data?
> 
> I though you have said that pgmap is the current common denominator for
> zone device users. I really do not see what is the problem to do
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 89d2a2ab3fe6..9105a4ed2c96 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5516,7 +5516,10 @@ void __meminit memmap_init_zone(unsigned long size, 
> int nid, unsigned long zone,
>  
>  not_early:
>   page = pfn_to_page(pfn);
> - __init_single_page(page, pfn, zone, nid);
> + if (pgmap && pgmap->init_page)
> + pgmap->init_page(page, pfn, zone, nid, pgmap);
> + else
> + __init_single_page(page, pfn, zone, nid);
>   if (context == MEMMAP_HOTPLUG)
>   SetPageReserved(page);
>  
> that would require to replace altmap throughout the call chain and
> replace it by pgmap. Altmap could be then renamed to something more
> clear

So as I had pointed out earlier doing per-page init is much slower than
initializing pages in bulk. Ideally I would like to see us seperate the
memmap_init_zone function into two pieces, one section for handling
hotplug and another for the everything else case. As is the fact that
you have to jump over a bunch of tests for the "not_early" case is
quite ugly in my opinion.

I could probably take your patch and test it. I'm suspecting this is
going to be a signficant slow-down in general as the indirect function
pointer stuff is probably going to come into play.

The "init_page" function in this case is going to end up being much
more complex then it really needs to be in this design as well since I
have to get the altmap and figure out if the page was used for vmmemmap
storage or is an actual DAX page. I might just see if I could add an
additional test for the pfn being before the end of the vmmemmap in the
case of pgmap being present.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 89d2a2ab3fe6..048e4cc72fdf 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5474,8 +5474,8 @@ void __meminit memmap_init_zone(unsigned long size, int 
> nid, unsigned long zone,
>* Honor reservation requested by the driver for this ZONE_DEVICE
>* memory
>*/
> - if (altmap && start_pfn == altmap->base_pfn)
> - start_pfn += altmap->reserve;
> + if (pgmap && pgmap->get_memmap)
> + start_pfn = pgmap->get_memmap(pgmap, start_pfn);
>  
>   for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>   /*
> 
> [...]

The only reason why I hadn't bothered with these bits is that I was
actually trying to leave this generic since I thought I had seen other
discussions about hotplug scenerios where memory may want to change
where the vmmemmap is initialized other than just the case of
ZONE_DEVICE pages. So I was thinking at some point we may see altmap
without the pgmap.

> > If I have to implement the code to verify the slowdown I will, but I
> > really feel like it is just going to be time wasted since we have seen
> > this in other spots within the kernel.
> 
> Please try to understand that I am not trying to force you write some
> artificial benchmarks. All I really do care about is that we have sane
> interfaces with reasonable performance. Especially for one-off things
> in relattively slow paths. I fully recognize that ZONE_DEVICE begs for a
> better integration but really, try to go incremental and try to unify
> the code first and microptimize on top. Is that way too much to ask for?

No, but the patches I had already proposed I thought were heading in
that direction. I had unified memmap_init_zone,
memmap_init_zone_device, and the deferred page initialization onto a
small set of functions and all had improved performance as a result.

> Anyway we have gone into details while the primary problem here was that
> the hotplug lock doesn't scale AFAIR. And my question was why cannot we
> pull move_pfn_range_to_zone and what has to be done to 

Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-29 Thread Michal Hocko
On Mon 29-10-18 10:42:33, Alexander Duyck wrote:
> On Mon, 2018-10-29 at 18:24 +0100, Michal Hocko wrote:
> > On Mon 29-10-18 10:01:28, Alexander Duyck wrote:
[...]
> > > So there end up being a few different issues with constructors. First
> > > in my mind is that it means we have to initialize the region of memory
> > > and cannot assume what the constructors are going to do for us. As a
> > > result we will have to initialize the LRU pointers, and then overwrite
> > > them with the pgmap and hmm_data.
> > 
> > Why we would do that? What does really prevent you from making a fully
> > customized constructor?
> 
> It is more an argument of complexity. Do I just pass a single pointer
> and write that value, or the LRU values in init, or do I have to pass a
> function pointer, some abstracted data, and then call said function
> pointer while passing the page and the abstracted data?

I though you have said that pgmap is the current common denominator for
zone device users. I really do not see what is the problem to do
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 89d2a2ab3fe6..9105a4ed2c96 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5516,7 +5516,10 @@ void __meminit memmap_init_zone(unsigned long size, int 
nid, unsigned long zone,
 
 not_early:
page = pfn_to_page(pfn);
-   __init_single_page(page, pfn, zone, nid);
+   if (pgmap && pgmap->init_page)
+   pgmap->init_page(page, pfn, zone, nid, pgmap);
+   else
+   __init_single_page(page, pfn, zone, nid);
if (context == MEMMAP_HOTPLUG)
SetPageReserved(page);
 
that would require to replace altmap throughout the call chain and
replace it by pgmap. Altmap could be then renamed to something more
clear
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 89d2a2ab3fe6..048e4cc72fdf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5474,8 +5474,8 @@ void __meminit memmap_init_zone(unsigned long size, int 
nid, unsigned long zone,
 * Honor reservation requested by the driver for this ZONE_DEVICE
 * memory
 */
-   if (altmap && start_pfn == altmap->base_pfn)
-   start_pfn += altmap->reserve;
+   if (pgmap && pgmap->get_memmap)
+   start_pfn = pgmap->get_memmap(pgmap, start_pfn);
 
for (pfn = start_pfn; pfn < end_pfn; pfn++) {
/*

[...]

> If I have to implement the code to verify the slowdown I will, but I
> really feel like it is just going to be time wasted since we have seen
> this in other spots within the kernel.

Please try to understand that I am not trying to force you write some
artificial benchmarks. All I really do care about is that we have sane
interfaces with reasonable performance. Especially for one-off things
in relattively slow paths. I fully recognize that ZONE_DEVICE begs for a
better integration but really, try to go incremental and try to unify
the code first and microptimize on top. Is that way too much to ask for?

Anyway we have gone into details while the primary problem here was that
the hotplug lock doesn't scale AFAIR. And my question was why cannot we
pull move_pfn_range_to_zone and what has to be done to achieve that.
That is a fundamental thing to address first. Then you can microptimize
on top.
-- 
Michal Hocko
SUSE Labs
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-29 Thread Michal Hocko
On Mon 29-10-18 10:34:22, Dan Williams wrote:
> On Mon, Oct 29, 2018 at 10:24 AM Michal Hocko  wrote:
> >
> > On Mon 29-10-18 10:01:28, Alexander Duyck wrote:
> > > On Mon, 2018-10-29 at 17:35 +0100, Michal Hocko wrote:
> [..]
> > > > You are already doing per-page initialization so I fail to see a larger
> > > > unit to operate on.
> > >
> > > I have a patch that makes it so that we can work at a pageblock level
> > > since all of the variables with the exception of only the LRU and page
> > > address fields can be precomputed. Doing that is one of the ways I was
> > > able to reduce page init to 1/3 to 1/4 of the time it was taking
> > > otherwise in the case of deferred page init.
> >
> > You still have to call set_page_links for each page. But let's assume we
> > can do initialization per larger units. Nothing really prevent to hide
> > that into constructor as well.
> 
> A constructor / indirect function call makes sense when there are
> multiple sub-classes of object initialization, on the table I only see
> 3 cases: typical hotplug, base ZONE_DEVICE, ZONE_DEVICE + HMM. I think
> we can look to move the HMM special casing out of line, then we're
> down to 2. Even at 3 cases we're better off open-coding than a
> constructor for such a low number of sub-cases to handle. I do not
> foresee more cases arriving, so I struggle to see what the constructor
> buys us in terms of code readability / maintainability?

I haven't dreamed of ZONE_DEVICE and HMM few years back. But anyway,
let me note that I am not in love with callbacks. I find them to be a
useful abstraction. I can be convinced (by numbers) that special casing
inside the core hotplug code is really beneficial. But let's do that at
a single place.

All I am arguing against throughout this thread is the
memmap_init_zone_device and the whole code duplication just because zone
device need something special.

-- 
Michal Hocko
SUSE Labs
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-29 Thread Alexander Duyck
On Mon, 2018-10-29 at 18:24 +0100, Michal Hocko wrote:
> On Mon 29-10-18 10:01:28, Alexander Duyck wrote:
> > On Mon, 2018-10-29 at 17:35 +0100, Michal Hocko wrote:
> > > On Mon 29-10-18 08:59:34, Alexander Duyck wrote:
> 
> [...]
> > > > So for example the call "online_pages_range" doesn't invoke the
> > > > online_page_callback unless the first pfn in the range is marked as
> > > > reserved.
> > > 
> > > Yes and there is no fundamental reason to do that. We can easily check
> > > the online status without that.
> > > 
> > > > Another example Dan had pointed out was the saveable_page function in
> > > > kernel/power/snapshot.c.
> > > 
> > > Use pfn_to_online_page there.
> > 
> > Right. Which getting back to my original point, there is a bunch of
> > other changes that need to happen in order for us to make this work.
> 
> Which is a standard process of upstreaming stuff. My main point was that
> the reason why I've added SetPageReserved was a safety net because I
> knew that different code paths would back of on PageReserved while they
> wouldn't on a partially initialized structure. Now that you really want
> to prevent setting this bit for performance reasons then it makes sense
> to revisit that earlier decision and get rid of it rather than build on
> top of it and duplicate and special case the low level hotplug init
> code.

Just to clarify not setting the reserved bit isn't so much a
performance optimization as it is a functional issue. My understanding
is that they cannot get the DAX pages through KVM currently as it
doesn't recognize them as being "online". So in going through and
cleaning up the checks for the reserved bit this problem will likely
solve itself.

> > I am going to end up with yet another patch set to clean up all the
> > spots that are using PageReserved that shouldn't be before I can get
> > to the point of not setting that bit.
> 
> This would be highly appreciated. There are not that many PageReserved
> checks.

Yeah, not many, only about 3 dozen. It should only take me a week or
two to get them all sorted.

> > > > > > > > > Regarding the post initialization required by 
> > > > > > > > > devm_memremap_pages and
> > > > > > > > > potentially others. Can we update the altmap which is already 
> > > > > > > > > a way how
> > > > > > > > > to get alternative struct pages by a constructor which we 
> > > > > > > > > could call
> > > > > > > > > from memmap_init_zone and do the post initialization? This 
> > > > > > > > > would reduce
> > > > > > > > > the additional loop in the caller while it would still fit 
> > > > > > > > > the overall
> > > > > > > > > design of the altmap and the core hotplug doesn't have to 
> > > > > > > > > know anything
> > > > > > > > > about DAX or whatever needs a special treatment.
> > > > > > > > > 
> > > > > > > > > Does that make any sense?
> > > > > > > > 
> > > > > > > > I think the only thing that is currently using the altmap is 
> > > > > > > > the ZONE_DEVICE
> > > > > > > > memory init. Specifically I think it is only really used by the
> > > > > > > > devm_memremap_pages version of things, and then only under 
> > > > > > > > certain
> > > > > > > > circumstances. Also the HMM driver doesn't pass an altmap. What 
> > > > > > > > we would
> > > > > > > > really need is a non-ZONE_DEVICE users of the altmap to really 
> > > > > > > > justify
> > > > > > > > sticking with that as the preferred argument to pass.
> > > > > > > 
> > > > > > > I am not aware of any upstream HMM user so I am not sure what are 
> > > > > > > the
> > > > > > > expectations there. But I thought that ZONE_DEVICE users use 
> > > > > > > altmap. If
> > > > > > > that is not generally true then we certainly have to think about a
> > > > > > > better interface.
> > > > > > 
> > > > > > I'm just basing my statement on the use of the 
> > > > > > move_pfn_range_to_zone call.
> > > > > > The only caller that is actually passing the altmap is 
> > > > > > devm_memremap_pages
> > > > > > and if I understand things correctly that is only used when we want 
> > > > > > to stare
> > > > > > the vmmemmap on the same memory that we just hotplugged.
> > > > > 
> > > > > Yes, and that is what I've called as allocator callback earlier.
> > > > 
> > > > I am really not a fan of the callback approach. It just means we will
> > > > have to do everything multiple times in terms of initialization.
> > > 
> > > I do not follow. Could you elaborate?
> > 
> > So there end up being a few different issues with constructors. First
> > in my mind is that it means we have to initialize the region of memory
> > and cannot assume what the constructors are going to do for us. As a
> > result we will have to initialize the LRU pointers, and then overwrite
> > them with the pgmap and hmm_data.
> 
> Why we would do that? What does really prevent you from making a fully
> customized constructor?

It is more an argument of complexity. Do I just pass a single pointer
and write that value, or the LRU 

Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-29 Thread Dan Williams
On Mon, Oct 29, 2018 at 10:24 AM Michal Hocko  wrote:
>
> On Mon 29-10-18 10:01:28, Alexander Duyck wrote:
> > On Mon, 2018-10-29 at 17:35 +0100, Michal Hocko wrote:
[..]
> > > You are already doing per-page initialization so I fail to see a larger
> > > unit to operate on.
> >
> > I have a patch that makes it so that we can work at a pageblock level
> > since all of the variables with the exception of only the LRU and page
> > address fields can be precomputed. Doing that is one of the ways I was
> > able to reduce page init to 1/3 to 1/4 of the time it was taking
> > otherwise in the case of deferred page init.
>
> You still have to call set_page_links for each page. But let's assume we
> can do initialization per larger units. Nothing really prevent to hide
> that into constructor as well.

A constructor / indirect function call makes sense when there are
multiple sub-classes of object initialization, on the table I only see
3 cases: typical hotplug, base ZONE_DEVICE, ZONE_DEVICE + HMM. I think
we can look to move the HMM special casing out of line, then we're
down to 2. Even at 3 cases we're better off open-coding than a
constructor for such a low number of sub-cases to handle. I do not
foresee more cases arriving, so I struggle to see what the constructor
buys us in terms of code readability / maintainability?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-29 Thread Michal Hocko
On Mon 29-10-18 10:01:28, Alexander Duyck wrote:
> On Mon, 2018-10-29 at 17:35 +0100, Michal Hocko wrote:
> > On Mon 29-10-18 08:59:34, Alexander Duyck wrote:
[...]
> > > So for example the call "online_pages_range" doesn't invoke the
> > > online_page_callback unless the first pfn in the range is marked as
> > > reserved.
> > 
> > Yes and there is no fundamental reason to do that. We can easily check
> > the online status without that.
> > 
> > > Another example Dan had pointed out was the saveable_page function in
> > > kernel/power/snapshot.c.
> > 
> > Use pfn_to_online_page there.
> 
> Right. Which getting back to my original point, there is a bunch of
> other changes that need to happen in order for us to make this work.

Which is a standard process of upstreaming stuff. My main point was that
the reason why I've added SetPageReserved was a safety net because I
knew that different code paths would back of on PageReserved while they
wouldn't on a partially initialized structure. Now that you really want
to prevent setting this bit for performance reasons then it makes sense
to revisit that earlier decision and get rid of it rather than build on
top of it and duplicate and special case the low level hotplug init
code.

> I am going to end up with yet another patch set to clean up all the
> spots that are using PageReserved that shouldn't be before I can get
> to the point of not setting that bit.

This would be highly appreciated. There are not that many PageReserved
checks.

> > > > > > > > Regarding the post initialization required by 
> > > > > > > > devm_memremap_pages and
> > > > > > > > potentially others. Can we update the altmap which is already a 
> > > > > > > > way how
> > > > > > > > to get alternative struct pages by a constructor which we could 
> > > > > > > > call
> > > > > > > > from memmap_init_zone and do the post initialization? This 
> > > > > > > > would reduce
> > > > > > > > the additional loop in the caller while it would still fit the 
> > > > > > > > overall
> > > > > > > > design of the altmap and the core hotplug doesn't have to know 
> > > > > > > > anything
> > > > > > > > about DAX or whatever needs a special treatment.
> > > > > > > > 
> > > > > > > > Does that make any sense?
> > > > > > > 
> > > > > > > I think the only thing that is currently using the altmap is the 
> > > > > > > ZONE_DEVICE
> > > > > > > memory init. Specifically I think it is only really used by the
> > > > > > > devm_memremap_pages version of things, and then only under certain
> > > > > > > circumstances. Also the HMM driver doesn't pass an altmap. What 
> > > > > > > we would
> > > > > > > really need is a non-ZONE_DEVICE users of the altmap to really 
> > > > > > > justify
> > > > > > > sticking with that as the preferred argument to pass.
> > > > > > 
> > > > > > I am not aware of any upstream HMM user so I am not sure what are 
> > > > > > the
> > > > > > expectations there. But I thought that ZONE_DEVICE users use 
> > > > > > altmap. If
> > > > > > that is not generally true then we certainly have to think about a
> > > > > > better interface.
> > > > > 
> > > > > I'm just basing my statement on the use of the move_pfn_range_to_zone 
> > > > > call.
> > > > > The only caller that is actually passing the altmap is 
> > > > > devm_memremap_pages
> > > > > and if I understand things correctly that is only used when we want 
> > > > > to stare
> > > > > the vmmemmap on the same memory that we just hotplugged.
> > > > 
> > > > Yes, and that is what I've called as allocator callback earlier.
> > > 
> > > I am really not a fan of the callback approach. It just means we will
> > > have to do everything multiple times in terms of initialization.
> > 
> > I do not follow. Could you elaborate?
> 
> So there end up being a few different issues with constructors. First
> in my mind is that it means we have to initialize the region of memory
> and cannot assume what the constructors are going to do for us. As a
> result we will have to initialize the LRU pointers, and then overwrite
> them with the pgmap and hmm_data.

Why we would do that? What does really prevent you from making a fully
customized constructor?

> I am generally not a fan of that as
> the next patch set I have gets rid of most of the redundancy we already
> had in the writes where we were memsetting everything to 0, then
> writing the values, and then taking care of the reserved bit and
> pgmap/hmm_data fields. Dealing with the init serially like that is just
> slow.
> 
> Another complication is retpoline making function pointers just more
> expensive in general. I know in the networking area we have been
> dealing with this for a while as even the DMA ops have been a pain
> there.

We use callbacks all over the kernel and in hot paths as well. This is
far from anything reminding a hot path AFAICT. It can be time consuming
because we have to touch each and every page but that is a fundamental
thing to do. We cannot simply batch the 

Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-29 Thread Alexander Duyck
On Mon, 2018-10-29 at 17:35 +0100, Michal Hocko wrote:
> On Mon 29-10-18 08:59:34, Alexander Duyck wrote:
> > On Mon, 2018-10-29 at 15:12 +0100, Michal Hocko wrote:
> > > On Wed 17-10-18 08:02:20, Alexander Duyck wrote:
> > > > On 10/17/2018 12:52 AM, Michal Hocko wrote:
> > > > > On Thu 11-10-18 10:38:39, Alexander Duyck wrote:
> > > > > > On 10/11/2018 1:55 AM, Michal Hocko wrote:
> > > > > > > On Wed 10-10-18 20:52:42, Michal Hocko wrote:
> > > > > > > [...]
> > > > > > > > My recollection was that we do clear the reserved bit in
> > > > > > > > move_pfn_range_to_zone and we indeed do in __init_single_page. 
> > > > > > > > But then
> > > > > > > > we set the bit back right afterwards. This seems to be the case 
> > > > > > > > since
> > > > > > > > d0dc12e86b319 which reorganized the code. I have to study this 
> > > > > > > > some more
> > > > > > > > obviously.
> > > > > > > 
> > > > > > > so my recollection was wrong and d0dc12e86b319 hasn't really 
> > > > > > > changed
> > > > > > > much because __init_single_page wouldn't zero out the struct page 
> > > > > > > for
> > > > > > > the hotplug contex. A comment in move_pfn_range_to_zone explains 
> > > > > > > that we
> > > > > > > want the reserved bit because pfn walkers already do see the pfn 
> > > > > > > range
> > > > > > > and the page is not fully associated with the zone until it is 
> > > > > > > onlined.
> > > > > > > 
> > > > > > > I am thinking that we might be overzealous here. With the full 
> > > > > > > state
> > > > > > > initialized we shouldn't actually care. pfn_to_online_page should 
> > > > > > > return
> > > > > > > NULL regardless of the reserved bit and normal pfn walkers 
> > > > > > > shouldn't
> > > > > > > touch pages they do not recognize and a plain page with ref. 
> > > > > > > count 1
> > > > > > > doesn't tell much to anybody. So I _suspect_ that we can simply 
> > > > > > > drop the
> > > > > > > reserved bit setting here.
> > > > > > 
> > > > > > So this has me a bit hesitant to want to just drop the bit 
> > > > > > entirely. If
> > > > > > nothing else I think I may wan to make that a patch onto itself so 
> > > > > > that if
> > > > > > we aren't going to set it we just drop it there. That way if it 
> > > > > > does cause
> > > > > > issues we can bisect it to that patch and pinpoint the cause.
> > > > > 
> > > > > Yes a patch on its own make sense for bisectability.
> > > > 
> > > > For now I think I am going to back off of this. There is a bunch of 
> > > > other
> > > > changes that need to happen in order for us to make this work. As far 
> > > > as I
> > > > can tell there are several places that are relying on this reserved bit.
> > > 
> > > Please be more specific. Unless I misremember, I have added this
> > > PageReserved just to be sure (f1dd2cd13c4bb) because pages where just
> > > half initialized back then. I am not aware anybody is depending on this.
> > > If there is somebody then be explicit about that. The last thing I want
> > > to see is to preserve a cargo cult and build a design around it.
> > 
> > It is mostly just a matter of going through and auditing all the
> > places that are using PageReserved to identify pages that they aren't
> > supposed to touch for whatever reason.
> > 
> > From what I can tell the issue appears to be the fact that the reserved
> > bit is used to identify if a region of memory is "online" or "offline".
> 
> No, this is wrong. pfn_to_online_page does that. PageReserved has
> nothing to do with online vs. offline status. It merely says that you
> shouldn't touch the page unless you own it. Sure we might have few
> places relying on it but nothing should really depend the reserved bit
> check from the MM hotplug POV.
> 
> > So for example the call "online_pages_range" doesn't invoke the
> > online_page_callback unless the first pfn in the range is marked as
> > reserved.
> 
> Yes and there is no fundamental reason to do that. We can easily check
> the online status without that.
> 
> > Another example Dan had pointed out was the saveable_page function in
> > kernel/power/snapshot.c.
> 
> Use pfn_to_online_page there.

Right. Which getting back to my original point, there is a bunch of
other changes that need to happen in order for us to make this work. I
am going to end up with yet another patch set to clean up all the spots
that are using PageReserved that shouldn't be before I can get to the
point of not setting that bit.

> > > > > > > Regarding the post initialization required by devm_memremap_pages 
> > > > > > > and
> > > > > > > potentially others. Can we update the altmap which is already a 
> > > > > > > way how
> > > > > > > to get alternative struct pages by a constructor which we could 
> > > > > > > call
> > > > > > > from memmap_init_zone and do the post initialization? This would 
> > > > > > > reduce
> > > > > > > the additional loop in the caller while it would still fit the 
> > > > > > > overall
> > > > > > > design of the altmap and the core 

Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-29 Thread Michal Hocko
On Mon 29-10-18 08:59:34, Alexander Duyck wrote:
> On Mon, 2018-10-29 at 15:12 +0100, Michal Hocko wrote:
> > On Wed 17-10-18 08:02:20, Alexander Duyck wrote:
> > > On 10/17/2018 12:52 AM, Michal Hocko wrote:
> > > > On Thu 11-10-18 10:38:39, Alexander Duyck wrote:
> > > > > On 10/11/2018 1:55 AM, Michal Hocko wrote:
> > > > > > On Wed 10-10-18 20:52:42, Michal Hocko wrote:
> > > > > > [...]
> > > > > > > My recollection was that we do clear the reserved bit in
> > > > > > > move_pfn_range_to_zone and we indeed do in __init_single_page. 
> > > > > > > But then
> > > > > > > we set the bit back right afterwards. This seems to be the case 
> > > > > > > since
> > > > > > > d0dc12e86b319 which reorganized the code. I have to study this 
> > > > > > > some more
> > > > > > > obviously.
> > > > > > 
> > > > > > so my recollection was wrong and d0dc12e86b319 hasn't really changed
> > > > > > much because __init_single_page wouldn't zero out the struct page 
> > > > > > for
> > > > > > the hotplug contex. A comment in move_pfn_range_to_zone explains 
> > > > > > that we
> > > > > > want the reserved bit because pfn walkers already do see the pfn 
> > > > > > range
> > > > > > and the page is not fully associated with the zone until it is 
> > > > > > onlined.
> > > > > > 
> > > > > > I am thinking that we might be overzealous here. With the full state
> > > > > > initialized we shouldn't actually care. pfn_to_online_page should 
> > > > > > return
> > > > > > NULL regardless of the reserved bit and normal pfn walkers shouldn't
> > > > > > touch pages they do not recognize and a plain page with ref. count 1
> > > > > > doesn't tell much to anybody. So I _suspect_ that we can simply 
> > > > > > drop the
> > > > > > reserved bit setting here.
> > > > > 
> > > > > So this has me a bit hesitant to want to just drop the bit entirely. 
> > > > > If
> > > > > nothing else I think I may wan to make that a patch onto itself so 
> > > > > that if
> > > > > we aren't going to set it we just drop it there. That way if it does 
> > > > > cause
> > > > > issues we can bisect it to that patch and pinpoint the cause.
> > > > 
> > > > Yes a patch on its own make sense for bisectability.
> > > 
> > > For now I think I am going to back off of this. There is a bunch of other
> > > changes that need to happen in order for us to make this work. As far as I
> > > can tell there are several places that are relying on this reserved bit.
> > 
> > Please be more specific. Unless I misremember, I have added this
> > PageReserved just to be sure (f1dd2cd13c4bb) because pages where just
> > half initialized back then. I am not aware anybody is depending on this.
> > If there is somebody then be explicit about that. The last thing I want
> > to see is to preserve a cargo cult and build a design around it.
> 
> It is mostly just a matter of going through and auditing all the
> places that are using PageReserved to identify pages that they aren't
> supposed to touch for whatever reason.
>
> From what I can tell the issue appears to be the fact that the reserved
> bit is used to identify if a region of memory is "online" or "offline".

No, this is wrong. pfn_to_online_page does that. PageReserved has
nothing to do with online vs. offline status. It merely says that you
shouldn't touch the page unless you own it. Sure we might have few
places relying on it but nothing should really depend the reserved bit
check from the MM hotplug POV.

> So for example the call "online_pages_range" doesn't invoke the
> online_page_callback unless the first pfn in the range is marked as
> reserved.

Yes and there is no fundamental reason to do that. We can easily check
the online status without that.

> Another example Dan had pointed out was the saveable_page function in
> kernel/power/snapshot.c.

Use pfn_to_online_page there.

> > > > > > Regarding the post initialization required by devm_memremap_pages 
> > > > > > and
> > > > > > potentially others. Can we update the altmap which is already a way 
> > > > > > how
> > > > > > to get alternative struct pages by a constructor which we could call
> > > > > > from memmap_init_zone and do the post initialization? This would 
> > > > > > reduce
> > > > > > the additional loop in the caller while it would still fit the 
> > > > > > overall
> > > > > > design of the altmap and the core hotplug doesn't have to know 
> > > > > > anything
> > > > > > about DAX or whatever needs a special treatment.
> > > > > > 
> > > > > > Does that make any sense?
> > > > > 
> > > > > I think the only thing that is currently using the altmap is the 
> > > > > ZONE_DEVICE
> > > > > memory init. Specifically I think it is only really used by the
> > > > > devm_memremap_pages version of things, and then only under certain
> > > > > circumstances. Also the HMM driver doesn't pass an altmap. What we 
> > > > > would
> > > > > really need is a non-ZONE_DEVICE users of the altmap to really justify
> > > > > sticking with that as the preferred 

Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-29 Thread Alexander Duyck
On Mon, 2018-10-29 at 15:12 +0100, Michal Hocko wrote:
> On Wed 17-10-18 08:02:20, Alexander Duyck wrote:
> > On 10/17/2018 12:52 AM, Michal Hocko wrote:
> > > On Thu 11-10-18 10:38:39, Alexander Duyck wrote:
> > > > On 10/11/2018 1:55 AM, Michal Hocko wrote:
> > > > > On Wed 10-10-18 20:52:42, Michal Hocko wrote:
> > > > > [...]
> > > > > > My recollection was that we do clear the reserved bit in
> > > > > > move_pfn_range_to_zone and we indeed do in __init_single_page. But 
> > > > > > then
> > > > > > we set the bit back right afterwards. This seems to be the case 
> > > > > > since
> > > > > > d0dc12e86b319 which reorganized the code. I have to study this some 
> > > > > > more
> > > > > > obviously.
> > > > > 
> > > > > so my recollection was wrong and d0dc12e86b319 hasn't really changed
> > > > > much because __init_single_page wouldn't zero out the struct page for
> > > > > the hotplug contex. A comment in move_pfn_range_to_zone explains that 
> > > > > we
> > > > > want the reserved bit because pfn walkers already do see the pfn range
> > > > > and the page is not fully associated with the zone until it is 
> > > > > onlined.
> > > > > 
> > > > > I am thinking that we might be overzealous here. With the full state
> > > > > initialized we shouldn't actually care. pfn_to_online_page should 
> > > > > return
> > > > > NULL regardless of the reserved bit and normal pfn walkers shouldn't
> > > > > touch pages they do not recognize and a plain page with ref. count 1
> > > > > doesn't tell much to anybody. So I _suspect_ that we can simply drop 
> > > > > the
> > > > > reserved bit setting here.
> > > > 
> > > > So this has me a bit hesitant to want to just drop the bit entirely. If
> > > > nothing else I think I may wan to make that a patch onto itself so that 
> > > > if
> > > > we aren't going to set it we just drop it there. That way if it does 
> > > > cause
> > > > issues we can bisect it to that patch and pinpoint the cause.
> > > 
> > > Yes a patch on its own make sense for bisectability.
> > 
> > For now I think I am going to back off of this. There is a bunch of other
> > changes that need to happen in order for us to make this work. As far as I
> > can tell there are several places that are relying on this reserved bit.
> 
> Please be more specific. Unless I misremember, I have added this
> PageReserved just to be sure (f1dd2cd13c4bb) because pages where just
> half initialized back then. I am not aware anybody is depending on this.
> If there is somebody then be explicit about that. The last thing I want
> to see is to preserve a cargo cult and build a design around it.

It is mostly just a matter of going through and auditing all the
places that are using PageReserved to identify pages that they aren't
supposed to touch for whatever reason.

>From what I can tell the issue appears to be the fact that the reserved
bit is used to identify if a region of memory is "online" or "offline".
So for example the call "online_pages_range" doesn't invoke the
online_page_callback unless the first pfn in the range is marked as
reserved.

Another example Dan had pointed out was the saveable_page function in
kernel/power/snapshot.c.

> > > > > Regarding the post initialization required by devm_memremap_pages and
> > > > > potentially others. Can we update the altmap which is already a way 
> > > > > how
> > > > > to get alternative struct pages by a constructor which we could call
> > > > > from memmap_init_zone and do the post initialization? This would 
> > > > > reduce
> > > > > the additional loop in the caller while it would still fit the overall
> > > > > design of the altmap and the core hotplug doesn't have to know 
> > > > > anything
> > > > > about DAX or whatever needs a special treatment.
> > > > > 
> > > > > Does that make any sense?
> > > > 
> > > > I think the only thing that is currently using the altmap is the 
> > > > ZONE_DEVICE
> > > > memory init. Specifically I think it is only really used by the
> > > > devm_memremap_pages version of things, and then only under certain
> > > > circumstances. Also the HMM driver doesn't pass an altmap. What we would
> > > > really need is a non-ZONE_DEVICE users of the altmap to really justify
> > > > sticking with that as the preferred argument to pass.
> > > 
> > > I am not aware of any upstream HMM user so I am not sure what are the
> > > expectations there. But I thought that ZONE_DEVICE users use altmap. If
> > > that is not generally true then we certainly have to think about a
> > > better interface.
> > 
> > I'm just basing my statement on the use of the move_pfn_range_to_zone call.
> > The only caller that is actually passing the altmap is devm_memremap_pages
> > and if I understand things correctly that is only used when we want to stare
> > the vmmemmap on the same memory that we just hotplugged.
> 
> Yes, and that is what I've called as allocator callback earlier.

I am really not a fan of the callback approach. It just means we will
have 

Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-29 Thread Michal Hocko
On Mon 29-10-18 08:49:46, Dan Williams wrote:
> On Wed, Oct 17, 2018 at 8:02 AM Alexander Duyck
>  wrote:
> >
> > On 10/17/2018 12:52 AM, Michal Hocko wrote:
> > > On Thu 11-10-18 10:38:39, Alexander Duyck wrote:
> > >> On 10/11/2018 1:55 AM, Michal Hocko wrote:
> > >>> On Wed 10-10-18 20:52:42, Michal Hocko wrote:
> > >>> [...]
> >  My recollection was that we do clear the reserved bit in
> >  move_pfn_range_to_zone and we indeed do in __init_single_page. But then
> >  we set the bit back right afterwards. This seems to be the case since
> >  d0dc12e86b319 which reorganized the code. I have to study this some 
> >  more
> >  obviously.
> > >>>
> > >>> so my recollection was wrong and d0dc12e86b319 hasn't really changed
> > >>> much because __init_single_page wouldn't zero out the struct page for
> > >>> the hotplug contex. A comment in move_pfn_range_to_zone explains that we
> > >>> want the reserved bit because pfn walkers already do see the pfn range
> > >>> and the page is not fully associated with the zone until it is onlined.
> > >>>
> > >>> I am thinking that we might be overzealous here. With the full state
> > >>> initialized we shouldn't actually care. pfn_to_online_page should return
> > >>> NULL regardless of the reserved bit and normal pfn walkers shouldn't
> > >>> touch pages they do not recognize and a plain page with ref. count 1
> > >>> doesn't tell much to anybody. So I _suspect_ that we can simply drop the
> > >>> reserved bit setting here.
> > >>
> > >> So this has me a bit hesitant to want to just drop the bit entirely. If
> > >> nothing else I think I may wan to make that a patch onto itself so that 
> > >> if
> > >> we aren't going to set it we just drop it there. That way if it does 
> > >> cause
> > >> issues we can bisect it to that patch and pinpoint the cause.
> > >
> > > Yes a patch on its own make sense for bisectability.
> >
> > For now I think I am going to back off of this. There is a bunch of
> > other changes that need to happen in order for us to make this work. As
> > far as I can tell there are several places that are relying on this
> > reserved bit.
> 
> When David Hildebrand and I looked it was only the hibernation code
> that we thought needed changing.

More details please?

-- 
Michal Hocko
SUSE Labs
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-29 Thread Dan Williams
On Wed, Oct 17, 2018 at 8:02 AM Alexander Duyck
 wrote:
>
> On 10/17/2018 12:52 AM, Michal Hocko wrote:
> > On Thu 11-10-18 10:38:39, Alexander Duyck wrote:
> >> On 10/11/2018 1:55 AM, Michal Hocko wrote:
> >>> On Wed 10-10-18 20:52:42, Michal Hocko wrote:
> >>> [...]
>  My recollection was that we do clear the reserved bit in
>  move_pfn_range_to_zone and we indeed do in __init_single_page. But then
>  we set the bit back right afterwards. This seems to be the case since
>  d0dc12e86b319 which reorganized the code. I have to study this some more
>  obviously.
> >>>
> >>> so my recollection was wrong and d0dc12e86b319 hasn't really changed
> >>> much because __init_single_page wouldn't zero out the struct page for
> >>> the hotplug contex. A comment in move_pfn_range_to_zone explains that we
> >>> want the reserved bit because pfn walkers already do see the pfn range
> >>> and the page is not fully associated with the zone until it is onlined.
> >>>
> >>> I am thinking that we might be overzealous here. With the full state
> >>> initialized we shouldn't actually care. pfn_to_online_page should return
> >>> NULL regardless of the reserved bit and normal pfn walkers shouldn't
> >>> touch pages they do not recognize and a plain page with ref. count 1
> >>> doesn't tell much to anybody. So I _suspect_ that we can simply drop the
> >>> reserved bit setting here.
> >>
> >> So this has me a bit hesitant to want to just drop the bit entirely. If
> >> nothing else I think I may wan to make that a patch onto itself so that if
> >> we aren't going to set it we just drop it there. That way if it does cause
> >> issues we can bisect it to that patch and pinpoint the cause.
> >
> > Yes a patch on its own make sense for bisectability.
>
> For now I think I am going to back off of this. There is a bunch of
> other changes that need to happen in order for us to make this work. As
> far as I can tell there are several places that are relying on this
> reserved bit.

When David Hildebrand and I looked it was only the hibernation code
that we thought needed changing. We either need to audit the removal
or go back to adding a special case hack for kvm because this is a
blocking issue for them.

What do you see beyond the hibernation change?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-29 Thread Michal Hocko
On Wed 17-10-18 08:02:20, Alexander Duyck wrote:
> On 10/17/2018 12:52 AM, Michal Hocko wrote:
> > On Thu 11-10-18 10:38:39, Alexander Duyck wrote:
> > > On 10/11/2018 1:55 AM, Michal Hocko wrote:
> > > > On Wed 10-10-18 20:52:42, Michal Hocko wrote:
> > > > [...]
> > > > > My recollection was that we do clear the reserved bit in
> > > > > move_pfn_range_to_zone and we indeed do in __init_single_page. But 
> > > > > then
> > > > > we set the bit back right afterwards. This seems to be the case since
> > > > > d0dc12e86b319 which reorganized the code. I have to study this some 
> > > > > more
> > > > > obviously.
> > > > 
> > > > so my recollection was wrong and d0dc12e86b319 hasn't really changed
> > > > much because __init_single_page wouldn't zero out the struct page for
> > > > the hotplug contex. A comment in move_pfn_range_to_zone explains that we
> > > > want the reserved bit because pfn walkers already do see the pfn range
> > > > and the page is not fully associated with the zone until it is onlined.
> > > > 
> > > > I am thinking that we might be overzealous here. With the full state
> > > > initialized we shouldn't actually care. pfn_to_online_page should return
> > > > NULL regardless of the reserved bit and normal pfn walkers shouldn't
> > > > touch pages they do not recognize and a plain page with ref. count 1
> > > > doesn't tell much to anybody. So I _suspect_ that we can simply drop the
> > > > reserved bit setting here.
> > > 
> > > So this has me a bit hesitant to want to just drop the bit entirely. If
> > > nothing else I think I may wan to make that a patch onto itself so that if
> > > we aren't going to set it we just drop it there. That way if it does cause
> > > issues we can bisect it to that patch and pinpoint the cause.
> > 
> > Yes a patch on its own make sense for bisectability.
> 
> For now I think I am going to back off of this. There is a bunch of other
> changes that need to happen in order for us to make this work. As far as I
> can tell there are several places that are relying on this reserved bit.

Please be more specific. Unless I misremember, I have added this
PageReserved just to be sure (f1dd2cd13c4bb) because pages where just
half initialized back then. I am not aware anybody is depending on this.
If there is somebody then be explicit about that. The last thing I want
to see is to preserve a cargo cult and build a design around it.

> > > > Regarding the post initialization required by devm_memremap_pages and
> > > > potentially others. Can we update the altmap which is already a way how
> > > > to get alternative struct pages by a constructor which we could call
> > > > from memmap_init_zone and do the post initialization? This would reduce
> > > > the additional loop in the caller while it would still fit the overall
> > > > design of the altmap and the core hotplug doesn't have to know anything
> > > > about DAX or whatever needs a special treatment.
> > > > 
> > > > Does that make any sense?
> > > 
> > > I think the only thing that is currently using the altmap is the 
> > > ZONE_DEVICE
> > > memory init. Specifically I think it is only really used by the
> > > devm_memremap_pages version of things, and then only under certain
> > > circumstances. Also the HMM driver doesn't pass an altmap. What we would
> > > really need is a non-ZONE_DEVICE users of the altmap to really justify
> > > sticking with that as the preferred argument to pass.
> > 
> > I am not aware of any upstream HMM user so I am not sure what are the
> > expectations there. But I thought that ZONE_DEVICE users use altmap. If
> > that is not generally true then we certainly have to think about a
> > better interface.
> 
> I'm just basing my statement on the use of the move_pfn_range_to_zone call.
> The only caller that is actually passing the altmap is devm_memremap_pages
> and if I understand things correctly that is only used when we want to stare
> the vmmemmap on the same memory that we just hotplugged.

Yes, and that is what I've called as allocator callback earlier.

> That is why it made more sense to me to just create a ZONE_DEVICE specific
> function for handling the page initialization because the one value I do
> have to pass is the dev_pagemap in both HMM and memremap case, and that has
> the altmap already embedded inside of it.

And I have argued that this is a wrong approach to the problem. If you
need a very specific struct page initialization then create a init
(constructor) callback.
-- 
Michal Hocko
SUSE Labs
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-17 Thread Alexander Duyck

On 10/17/2018 12:52 AM, Michal Hocko wrote:

On Thu 11-10-18 10:38:39, Alexander Duyck wrote:

On 10/11/2018 1:55 AM, Michal Hocko wrote:

On Wed 10-10-18 20:52:42, Michal Hocko wrote:
[...]

My recollection was that we do clear the reserved bit in
move_pfn_range_to_zone and we indeed do in __init_single_page. But then
we set the bit back right afterwards. This seems to be the case since
d0dc12e86b319 which reorganized the code. I have to study this some more
obviously.


so my recollection was wrong and d0dc12e86b319 hasn't really changed
much because __init_single_page wouldn't zero out the struct page for
the hotplug contex. A comment in move_pfn_range_to_zone explains that we
want the reserved bit because pfn walkers already do see the pfn range
and the page is not fully associated with the zone until it is onlined.

I am thinking that we might be overzealous here. With the full state
initialized we shouldn't actually care. pfn_to_online_page should return
NULL regardless of the reserved bit and normal pfn walkers shouldn't
touch pages they do not recognize and a plain page with ref. count 1
doesn't tell much to anybody. So I _suspect_ that we can simply drop the
reserved bit setting here.


So this has me a bit hesitant to want to just drop the bit entirely. If
nothing else I think I may wan to make that a patch onto itself so that if
we aren't going to set it we just drop it there. That way if it does cause
issues we can bisect it to that patch and pinpoint the cause.


Yes a patch on its own make sense for bisectability.


For now I think I am going to back off of this. There is a bunch of 
other changes that need to happen in order for us to make this work. As 
far as I can tell there are several places that are relying on this 
reserved bit.



Regarding the post initialization required by devm_memremap_pages and
potentially others. Can we update the altmap which is already a way how
to get alternative struct pages by a constructor which we could call
from memmap_init_zone and do the post initialization? This would reduce
the additional loop in the caller while it would still fit the overall
design of the altmap and the core hotplug doesn't have to know anything
about DAX or whatever needs a special treatment.

Does that make any sense?


I think the only thing that is currently using the altmap is the ZONE_DEVICE
memory init. Specifically I think it is only really used by the
devm_memremap_pages version of things, and then only under certain
circumstances. Also the HMM driver doesn't pass an altmap. What we would
really need is a non-ZONE_DEVICE users of the altmap to really justify
sticking with that as the preferred argument to pass.


I am not aware of any upstream HMM user so I am not sure what are the
expectations there. But I thought that ZONE_DEVICE users use altmap. If
that is not generally true then we certainly have to think about a
better interface.


I'm just basing my statement on the use of the move_pfn_range_to_zone 
call. The only caller that is actually passing the altmap is 
devm_memremap_pages and if I understand things correctly that is only 
used when we want to stare the vmmemmap on the same memory that we just 
hotplugged.


That is why it made more sense to me to just create a ZONE_DEVICE 
specific function for handling the page initialization because the one 
value I do have to pass is the dev_pagemap in both HMM and memremap 
case, and that has the altmap already embedded inside of it.



For those two functions it currently makes much more sense to pass the
dev_pagemap pointer and then reference the altmap from there. Otherwise we
are likely starting to look at something that would be more of a dirty hack
where we are passing a unused altmap in order to get to the dev_pagemap so
that we could populate the page.


If dev_pagemap is a general abstraction then I agree.


Well so far HMM and the memremap code have both agreed to use that 
structure to store the metadata for ZONE_DEVICE mappings, and at this 
point we are already looking at 3 different memory types being stored 
within that zone as we already have the private, public, and DAX memory 
types all using this structure.

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-17 Thread Michal Hocko
On Thu 11-10-18 10:38:39, Alexander Duyck wrote:
> On 10/11/2018 1:55 AM, Michal Hocko wrote:
> > On Wed 10-10-18 20:52:42, Michal Hocko wrote:
> > [...]
> > > My recollection was that we do clear the reserved bit in
> > > move_pfn_range_to_zone and we indeed do in __init_single_page. But then
> > > we set the bit back right afterwards. This seems to be the case since
> > > d0dc12e86b319 which reorganized the code. I have to study this some more
> > > obviously.
> > 
> > so my recollection was wrong and d0dc12e86b319 hasn't really changed
> > much because __init_single_page wouldn't zero out the struct page for
> > the hotplug contex. A comment in move_pfn_range_to_zone explains that we
> > want the reserved bit because pfn walkers already do see the pfn range
> > and the page is not fully associated with the zone until it is onlined.
> > 
> > I am thinking that we might be overzealous here. With the full state
> > initialized we shouldn't actually care. pfn_to_online_page should return
> > NULL regardless of the reserved bit and normal pfn walkers shouldn't
> > touch pages they do not recognize and a plain page with ref. count 1
> > doesn't tell much to anybody. So I _suspect_ that we can simply drop the
> > reserved bit setting here.
> 
> So this has me a bit hesitant to want to just drop the bit entirely. If
> nothing else I think I may wan to make that a patch onto itself so that if
> we aren't going to set it we just drop it there. That way if it does cause
> issues we can bisect it to that patch and pinpoint the cause.

Yes a patch on its own make sense for bisectability.

> > Regarding the post initialization required by devm_memremap_pages and
> > potentially others. Can we update the altmap which is already a way how
> > to get alternative struct pages by a constructor which we could call
> > from memmap_init_zone and do the post initialization? This would reduce
> > the additional loop in the caller while it would still fit the overall
> > design of the altmap and the core hotplug doesn't have to know anything
> > about DAX or whatever needs a special treatment.
> > 
> > Does that make any sense?
> 
> I think the only thing that is currently using the altmap is the ZONE_DEVICE
> memory init. Specifically I think it is only really used by the
> devm_memremap_pages version of things, and then only under certain
> circumstances. Also the HMM driver doesn't pass an altmap. What we would
> really need is a non-ZONE_DEVICE users of the altmap to really justify
> sticking with that as the preferred argument to pass.

I am not aware of any upstream HMM user so I am not sure what are the
expectations there. But I thought that ZONE_DEVICE users use altmap. If
that is not generally true then we certainly have to think about a
better interface.
 
> For those two functions it currently makes much more sense to pass the
> dev_pagemap pointer and then reference the altmap from there. Otherwise we
> are likely starting to look at something that would be more of a dirty hack
> where we are passing a unused altmap in order to get to the dev_pagemap so
> that we could populate the page.

If dev_pagemap is a general abstraction then I agree.
-- 
Michal Hocko
SUSE Labs
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-11 Thread Dan Williams
On Thu, Oct 11, 2018 at 10:39 AM Alexander Duyck
 wrote:
>
> On 10/11/2018 1:55 AM, Michal Hocko wrote:
> > On Wed 10-10-18 20:52:42, Michal Hocko wrote:
> > [...]
> >> My recollection was that we do clear the reserved bit in
> >> move_pfn_range_to_zone and we indeed do in __init_single_page. But then
> >> we set the bit back right afterwards. This seems to be the case since
> >> d0dc12e86b319 which reorganized the code. I have to study this some more
> >> obviously.
> >
> > so my recollection was wrong and d0dc12e86b319 hasn't really changed
> > much because __init_single_page wouldn't zero out the struct page for
> > the hotplug contex. A comment in move_pfn_range_to_zone explains that we
> > want the reserved bit because pfn walkers already do see the pfn range
> > and the page is not fully associated with the zone until it is onlined.
> >
> > I am thinking that we might be overzealous here. With the full state
> > initialized we shouldn't actually care. pfn_to_online_page should return
> > NULL regardless of the reserved bit and normal pfn walkers shouldn't
> > touch pages they do not recognize and a plain page with ref. count 1
> > doesn't tell much to anybody. So I _suspect_ that we can simply drop the
> > reserved bit setting here.
>
> So this has me a bit hesitant to want to just drop the bit entirely. If
> nothing else I think I may wan to make that a patch onto itself so that
> if we aren't going to set it we just drop it there. That way if it does
> cause issues we can bisect it to that patch and pinpoint the cause.
>
> > Regarding the post initialization required by devm_memremap_pages and
> > potentially others. Can we update the altmap which is already a way how
> > to get alternative struct pages by a constructor which we could call
> > from memmap_init_zone and do the post initialization? This would reduce
> > the additional loop in the caller while it would still fit the overall
> > design of the altmap and the core hotplug doesn't have to know anything
> > about DAX or whatever needs a special treatment.
> >
> > Does that make any sense?
>
> I think the only thing that is currently using the altmap is the
> ZONE_DEVICE memory init. Specifically I think it is only really used by
> the devm_memremap_pages version of things, and then only under certain
> circumstances. Also the HMM driver doesn't pass an altmap. What we would
> really need is a non-ZONE_DEVICE users of the altmap to really justify
> sticking with that as the preferred argument to pass.

Right, the altmap is optional. It's only there to direct the memmap
array to be allocated from the memory-range being hot-added vs a
dynamic page-allocator allocation from System-RAM.

> For those two functions it currently makes much more sense to pass the
> dev_pagemap pointer and then reference the altmap from there. Otherwise
> we are likely starting to look at something that would be more of a
> dirty hack where we are passing a unused altmap in order to get to the
> dev_pagemap so that we could populate the page.

Yeah, we can't rely on the altmap, it's marked invalid in many cases.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-11 Thread Alexander Duyck

On 10/11/2018 1:55 AM, Michal Hocko wrote:

On Wed 10-10-18 20:52:42, Michal Hocko wrote:
[...]

My recollection was that we do clear the reserved bit in
move_pfn_range_to_zone and we indeed do in __init_single_page. But then
we set the bit back right afterwards. This seems to be the case since
d0dc12e86b319 which reorganized the code. I have to study this some more
obviously.


so my recollection was wrong and d0dc12e86b319 hasn't really changed
much because __init_single_page wouldn't zero out the struct page for
the hotplug contex. A comment in move_pfn_range_to_zone explains that we
want the reserved bit because pfn walkers already do see the pfn range
and the page is not fully associated with the zone until it is onlined.

I am thinking that we might be overzealous here. With the full state
initialized we shouldn't actually care. pfn_to_online_page should return
NULL regardless of the reserved bit and normal pfn walkers shouldn't
touch pages they do not recognize and a plain page with ref. count 1
doesn't tell much to anybody. So I _suspect_ that we can simply drop the
reserved bit setting here.


So this has me a bit hesitant to want to just drop the bit entirely. If 
nothing else I think I may wan to make that a patch onto itself so that 
if we aren't going to set it we just drop it there. That way if it does 
cause issues we can bisect it to that patch and pinpoint the cause.



Regarding the post initialization required by devm_memremap_pages and
potentially others. Can we update the altmap which is already a way how
to get alternative struct pages by a constructor which we could call
from memmap_init_zone and do the post initialization? This would reduce
the additional loop in the caller while it would still fit the overall
design of the altmap and the core hotplug doesn't have to know anything
about DAX or whatever needs a special treatment.

Does that make any sense?


I think the only thing that is currently using the altmap is the 
ZONE_DEVICE memory init. Specifically I think it is only really used by 
the devm_memremap_pages version of things, and then only under certain 
circumstances. Also the HMM driver doesn't pass an altmap. What we would 
really need is a non-ZONE_DEVICE users of the altmap to really justify 
sticking with that as the preferred argument to pass.


For those two functions it currently makes much more sense to pass the 
dev_pagemap pointer and then reference the altmap from there. Otherwise 
we are likely starting to look at something that would be more of a 
dirty hack where we are passing a unused altmap in order to get to the 
dev_pagemap so that we could populate the page.






___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-11 Thread Alexander Duyck

On 10/11/2018 1:39 AM, Yi Zhang wrote:

On 2018-10-10 at 11:18:49 -0700, Dan Williams wrote:

On Wed, Oct 10, 2018 at 10:30 AM Michal Hocko  wrote:


On Wed 10-10-18 09:39:08, Alexander Duyck wrote:

On 10/10/2018 2:58 AM, Michal Hocko wrote:

On Tue 09-10-18 13:26:41, Alexander Duyck wrote:
[...]

I would think with that being the case we still probably need the call to
__SetPageReserved to set the bit with the expectation that it will not be
cleared for device-pages since the pages are not onlined. Removing the call
to __SetPageReserved would probably introduce a number of regressions as
there are multiple spots that use the reserved bit to determine if a page
can be swapped out to disk, mapped as system memory, or migrated.


PageReserved is meant to tell any potential pfn walkers that might get
to this struct page to back off and not touch it. Even though
ZONE_DEVICE doesn't online pages in traditional sense it makes those
pages available for further use so the page reserved bit should be
cleared.


So from what I can tell that isn't necessarily the case. Specifically if the
pagemap type is MEMORY_DEVICE_PRIVATE or MEMORY_DEVICE_PUBLIC both are
special cases where the memory may not be accessible to the CPU or cannot be
pinned in order to allow for eviction.


Could you give me an example please?


The specific case that Dan and Yi are referring to is for the type
MEMORY_DEVICE_FS_DAX. For that type I could probably look at not setting the
reserved bit. Part of me wants to say that we should wait and clear the bit
later, but that would end up just adding time back to initialization. At
this point I would consider the change more of a follow-up optimization
rather than a fix though since this is tailoring things specifically for DAX
versus the other ZONE_DEVICE types.


I thought I have already made it clear that these zone device hacks are
not acceptable to the generic hotplug code. If the current reserve bit
handling is not correct then give us a specific reason for that and we
can start thinking about the proper fix.



Right, so we're in a situation where a hack is needed for KVM's
current interpretation of the Reserved flag relative to dax mapped
pages. I'm arguing to push that knowledge / handling as deep as
possible into the core rather than hack the leaf implementations like
KVM, i.e. disable the Reserved flag for all non-MEMORY_DEVICE_*
ZONE_DEVICE types.

Here is the KVM thread about why they need a change:

 https://lkml.org/lkml/2018/9/7/552

...and where I pushed back on a KVM-local hack:

 https://lkml.org/lkml/2018/9/19/154

Yeah, Thank Dan, I think I can going on with something like this:

@@ -5589,6 +5589,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
struct page *page = pfn_to_page(pfn);
  
  		__init_single_page(page, pfn, zone_idx, nid);

+   /* Could we move this a little bit earlier as I can
+* direct use is_dax_page(page), or something else?
+*/
+   page->pgmap = pgmap;
  
  		/*

 * Mark page reserved as it will need to wait for onlining
@@ -5597,14 +5598,14 @@ void __ref memmap_init_zone_device(struct zone *zone,
 * We can use the non-atomic __set_bit operation for setting
 * the flag as we are still initializing the pages.
 */
-   __SetPageReserved(page);
+if(!is_dax_page(page))
+   __SetPageReserved(page);
  
  		/*

 * ZONE_DEVICE pages union ->lru with a ->pgmap back
 * pointer and hmm_data.  It is a bug if a ZONE_DEVICE
 * page is ever freed or placed on a driver-private list.
 */
-   page->pgmap = pgmap;
page->hmm_data = 0;


After Alex's patch merged.


So I am not a huge fan of splitting up the pgmap init from the hmm_data, 
but I suppose this is just for your proof-of-concept?


I already have another patch set outstanding that may actually make this 
change easier[1]. What I could do is add the logic there based on the 
pgmap.type as an additional patch since I pass a boolean to determine if 
I am setting the reserved bit or not.


[1] 
https://lore.kernel.org/lkml/20181005151006.17473.83040.stgit@localhost.localdomain/


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-10 Thread Yi Zhang
On 2018-10-10 at 11:18:49 -0700, Dan Williams wrote:
> On Wed, Oct 10, 2018 at 10:30 AM Michal Hocko  wrote:
> >
> > On Wed 10-10-18 09:39:08, Alexander Duyck wrote:
> > > On 10/10/2018 2:58 AM, Michal Hocko wrote:
> > > > On Tue 09-10-18 13:26:41, Alexander Duyck wrote:
> > > > [...]
> > > > > I would think with that being the case we still probably need the 
> > > > > call to
> > > > > __SetPageReserved to set the bit with the expectation that it will 
> > > > > not be
> > > > > cleared for device-pages since the pages are not onlined. Removing 
> > > > > the call
> > > > > to __SetPageReserved would probably introduce a number of regressions 
> > > > > as
> > > > > there are multiple spots that use the reserved bit to determine if a 
> > > > > page
> > > > > can be swapped out to disk, mapped as system memory, or migrated.
> > > >
> > > > PageReserved is meant to tell any potential pfn walkers that might get
> > > > to this struct page to back off and not touch it. Even though
> > > > ZONE_DEVICE doesn't online pages in traditional sense it makes those
> > > > pages available for further use so the page reserved bit should be
> > > > cleared.
> > >
> > > So from what I can tell that isn't necessarily the case. Specifically if 
> > > the
> > > pagemap type is MEMORY_DEVICE_PRIVATE or MEMORY_DEVICE_PUBLIC both are
> > > special cases where the memory may not be accessible to the CPU or cannot 
> > > be
> > > pinned in order to allow for eviction.
> >
> > Could you give me an example please?
> >
> > > The specific case that Dan and Yi are referring to is for the type
> > > MEMORY_DEVICE_FS_DAX. For that type I could probably look at not setting 
> > > the
> > > reserved bit. Part of me wants to say that we should wait and clear the 
> > > bit
> > > later, but that would end up just adding time back to initialization. At
> > > this point I would consider the change more of a follow-up optimization
> > > rather than a fix though since this is tailoring things specifically for 
> > > DAX
> > > versus the other ZONE_DEVICE types.
> >
> > I thought I have already made it clear that these zone device hacks are
> > not acceptable to the generic hotplug code. If the current reserve bit
> > handling is not correct then give us a specific reason for that and we
> > can start thinking about the proper fix.
> >
> 
> Right, so we're in a situation where a hack is needed for KVM's
> current interpretation of the Reserved flag relative to dax mapped
> pages. I'm arguing to push that knowledge / handling as deep as
> possible into the core rather than hack the leaf implementations like
> KVM, i.e. disable the Reserved flag for all non-MEMORY_DEVICE_*
> ZONE_DEVICE types.
> 
> Here is the KVM thread about why they need a change:
> 
> https://lkml.org/lkml/2018/9/7/552
> 
> ...and where I pushed back on a KVM-local hack:
> 
> https://lkml.org/lkml/2018/9/19/154
Yeah, Thank Dan, I think I can going on with something like this:

@@ -5589,6 +5589,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
struct page *page = pfn_to_page(pfn);
 
__init_single_page(page, pfn, zone_idx, nid);
+   /* Could we move this a little bit earlier as I can
+* direct use is_dax_page(page), or something else?
+*/
+   page->pgmap = pgmap;
 
/*
 * Mark page reserved as it will need to wait for onlining
@@ -5597,14 +5598,14 @@ void __ref memmap_init_zone_device(struct zone *zone,
 * We can use the non-atomic __set_bit operation for setting
 * the flag as we are still initializing the pages.
 */
-   __SetPageReserved(page);
+if(!is_dax_page(page))
+   __SetPageReserved(page);
 
/*
 * ZONE_DEVICE pages union ->lru with a ->pgmap back
 * pointer and hmm_data.  It is a bug if a ZONE_DEVICE
 * page is ever freed or placed on a driver-private list.
 */
-   page->pgmap = pgmap;
page->hmm_data = 0;


After Alex's patch merged.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-10 Thread Yi Zhang
On 2018-10-10 at 08:27:58 -0700, Alexander Duyck wrote:
> 
> 
> On 10/10/2018 5:52 AM, Yi Zhang wrote:
> >On 2018-10-09 at 14:19:32 -0700, Dan Williams wrote:
> >>On Tue, Oct 9, 2018 at 1:34 PM Alexander Duyck
> >> wrote:
> >>>
> >>>On 10/9/2018 11:04 AM, Dan Williams wrote:
> On Tue, Oct 9, 2018 at 3:21 AM Yi Zhang  
> wrote:
> >>[..]
> That comment is incorrect, device-pages are never onlined. So I think
> we can just skip that call to __SetPageReserved() unless the memory
> range is MEMORY_DEVICE_{PRIVATE,PUBLIC}.
> 
> >>>
> >>>When pages are "onlined" via __free_pages_boot_core they clear the
> >>>reserved bit, that is the reason for the comment. The reserved bit is
> >>>meant to indicate that the page cannot be swapped out or moved based on
> >>>the description of the bit.
> >>
> >>...but ZONE_DEVICE pages are never onlined so I would expect
> >>memmap_init_zone_device() to know that detail.
> >>
> >>>I would think with that being the case we still probably need the call
> >>>to __SetPageReserved to set the bit with the expectation that it will
> >>>not be cleared for device-pages since the pages are not onlined.
> >>>Removing the call to __SetPageReserved would probably introduce a number
> >>>of regressions as there are multiple spots that use the reserved bit to
> >>>determine if a page can be swapped out to disk, mapped as system memory,
> >>>or migrated.
> >
> >Another things, it seems page_init/set_reserved already been done in the
> >move_pfn_range_to_zone
> > |-->memmap_init_zone
> > |-->for_each_page_in_pfn
> > |-->__init_single_page
> > |-->SetPageReserved
> >
> >Why we haven't remove these redundant initial in memmap_init_zone?
> >
> >Correct me if I missed something.
> 
> In this case it isn't redundant as only the vmmemmap pages are initialized
> in memmap_init_zone now. So all of the pages that are going to be used as
> device pages are not initialized until the call to memmap_init_zone_device.
> What I did is split the initialization of the pages into two parts in order
> to allow us to initialize the pages outside of the hotplug lock.
Ah.. I saw that, Thanks the explanation, so that is we only need to
care about the device pages reserved flag, and plan to remove that.
> 
> >>
> >>Right, this is what Yi is working on... the PageReserved flag is
> >>problematic for KVM. Auditing those locations it seems as long as we
> >>teach hibernation to avoid ZONE_DEVICE ranges we can safely not set
> >>the reserved flag for DAX pages. What I'm trying to avoid is a local
> >>KVM hack to check for DAX pages when the Reserved flag is not
> >>otherwise needed.
> >Thanks Dan. Provide the patch link.
> >
> >https://lore.kernel.org/lkml/cover.1536342881.git.yi.z.zh...@linux.intel.com
> 
> So it looks like your current logic is just working around the bit then
> since it just allows for reserved DAX pages.
> 
> 
> ___
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-10 Thread Michal Hocko
On Wed 10-10-18 10:39:01, Alexander Duyck wrote:
> On 10/10/2018 10:24 AM, Michal Hocko wrote:
[...]
> > I thought I have already made it clear that these zone device hacks are
> > not acceptable to the generic hotplug code. If the current reserve bit
> > handling is not correct then give us a specific reason for that and we
> > can start thinking about the proper fix.
> 
> I might have misunderstood your earlier comment then. I thought you were
> saying that we shouldn't bother with setting the reserved bit. Now it sounds
> like you were thinking more along the lines of what I was here in my comment
> where I thought the bit should be cleared later in some code specifically
> related to DAX when it is exposing it for use to userspace or KVM.

It seems I managed to confuse myself completely. Sorry, it's been a long
day and I am sick so the brain doesn't work all that well. I will get
back to this tomorrow or on Friday with a fresh brain.

My recollection was that we do clear the reserved bit in
move_pfn_range_to_zone and we indeed do in __init_single_page. But then
we set the bit back right afterwards. This seems to be the case since
d0dc12e86b319 which reorganized the code. I have to study this some more
obviously.

-- 
Michal Hocko
SUSE Labs
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-10 Thread Dan Williams
On Wed, Oct 10, 2018 at 10:30 AM Michal Hocko  wrote:
>
> On Wed 10-10-18 09:39:08, Alexander Duyck wrote:
> > On 10/10/2018 2:58 AM, Michal Hocko wrote:
> > > On Tue 09-10-18 13:26:41, Alexander Duyck wrote:
> > > [...]
> > > > I would think with that being the case we still probably need the call 
> > > > to
> > > > __SetPageReserved to set the bit with the expectation that it will not 
> > > > be
> > > > cleared for device-pages since the pages are not onlined. Removing the 
> > > > call
> > > > to __SetPageReserved would probably introduce a number of regressions as
> > > > there are multiple spots that use the reserved bit to determine if a 
> > > > page
> > > > can be swapped out to disk, mapped as system memory, or migrated.
> > >
> > > PageReserved is meant to tell any potential pfn walkers that might get
> > > to this struct page to back off and not touch it. Even though
> > > ZONE_DEVICE doesn't online pages in traditional sense it makes those
> > > pages available for further use so the page reserved bit should be
> > > cleared.
> >
> > So from what I can tell that isn't necessarily the case. Specifically if the
> > pagemap type is MEMORY_DEVICE_PRIVATE or MEMORY_DEVICE_PUBLIC both are
> > special cases where the memory may not be accessible to the CPU or cannot be
> > pinned in order to allow for eviction.
>
> Could you give me an example please?
>
> > The specific case that Dan and Yi are referring to is for the type
> > MEMORY_DEVICE_FS_DAX. For that type I could probably look at not setting the
> > reserved bit. Part of me wants to say that we should wait and clear the bit
> > later, but that would end up just adding time back to initialization. At
> > this point I would consider the change more of a follow-up optimization
> > rather than a fix though since this is tailoring things specifically for DAX
> > versus the other ZONE_DEVICE types.
>
> I thought I have already made it clear that these zone device hacks are
> not acceptable to the generic hotplug code. If the current reserve bit
> handling is not correct then give us a specific reason for that and we
> can start thinking about the proper fix.
>

Right, so we're in a situation where a hack is needed for KVM's
current interpretation of the Reserved flag relative to dax mapped
pages. I'm arguing to push that knowledge / handling as deep as
possible into the core rather than hack the leaf implementations like
KVM, i.e. disable the Reserved flag for all non-MEMORY_DEVICE_*
ZONE_DEVICE types.

Here is the KVM thread about why they need a change:

https://lkml.org/lkml/2018/9/7/552

...and where I pushed back on a KVM-local hack:

https://lkml.org/lkml/2018/9/19/154
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-10 Thread Alexander Duyck

On 10/10/2018 10:53 AM, Michal Hocko wrote:

On Wed 10-10-18 10:39:01, Alexander Duyck wrote:



On 10/10/2018 10:24 AM, Michal Hocko wrote:

On Wed 10-10-18 09:39:08, Alexander Duyck wrote:

On 10/10/2018 2:58 AM, Michal Hocko wrote:

On Tue 09-10-18 13:26:41, Alexander Duyck wrote:
[...]

I would think with that being the case we still probably need the call to
__SetPageReserved to set the bit with the expectation that it will not be
cleared for device-pages since the pages are not onlined. Removing the call
to __SetPageReserved would probably introduce a number of regressions as
there are multiple spots that use the reserved bit to determine if a page
can be swapped out to disk, mapped as system memory, or migrated.


PageReserved is meant to tell any potential pfn walkers that might get
to this struct page to back off and not touch it. Even though
ZONE_DEVICE doesn't online pages in traditional sense it makes those
pages available for further use so the page reserved bit should be
cleared.


So from what I can tell that isn't necessarily the case. Specifically if the
pagemap type is MEMORY_DEVICE_PRIVATE or MEMORY_DEVICE_PUBLIC both are
special cases where the memory may not be accessible to the CPU or cannot be
pinned in order to allow for eviction.


Could you give me an example please?


Honestly I am getting a bit beyond my depth here so maybe Dan could explain
better. I am basing the above comment on Dan's earlier comment in this
thread combined with the comment that explains the "memory_type" field for
the pgmap:
https://elixir.bootlin.com/linux/v4.19-rc7/source/include/linux/memremap.h#L28


The specific case that Dan and Yi are referring to is for the type
MEMORY_DEVICE_FS_DAX. For that type I could probably look at not setting the
reserved bit. Part of me wants to say that we should wait and clear the bit
later, but that would end up just adding time back to initialization. At
this point I would consider the change more of a follow-up optimization
rather than a fix though since this is tailoring things specifically for DAX
versus the other ZONE_DEVICE types.


I thought I have already made it clear that these zone device hacks are
not acceptable to the generic hotplug code. If the current reserve bit
handling is not correct then give us a specific reason for that and we
can start thinking about the proper fix.


I might have misunderstood your earlier comment then. I thought you were
saying that we shouldn't bother with setting the reserved bit. Now it sounds
like you were thinking more along the lines of what I was here in my comment
where I thought the bit should be cleared later in some code specifically
related to DAX when it is exposing it for use to userspace or KVM.


I was referring to my earlier comment that if you need to do something
about struct page initialization (move_pfn_range_to_zone) outside of the
lock (with the appropriate ground work that is needed) rather than
pulling more zone device hacks into the generic hotplug code [1]

[1] http://lkml.kernel.org/r/20180926075540.gd6...@dhcp22.suse.cz


The only issue is if we don't pull the code together we are looking at a 
massive increase in initialization times. So for example just the loop 
that was originally there that was setting the pgmap and resetting the 
LRU prev pointer was adding an additional 20+ seconds per node with 3TB 
allocated per node. That essentially doubles the initialization time.


How would you recommend I address that? Currently it is a few extra 
lines in the memmap_init_zone_device function. Eventually I was planning 
to combine the memmap_init_zone hoplug functionality and 
memmap_init_zone_device core functionality into a single function called 
__init_pageblock[1] and then reuse that functionality for deferred page 
init as well.


[1] 
http://lkml.kernel.org/r/20181005151224.17473.53398.stgit@localhost.localdomain


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-10 Thread Michal Hocko
On Wed 10-10-18 10:39:01, Alexander Duyck wrote:
> 
> 
> On 10/10/2018 10:24 AM, Michal Hocko wrote:
> > On Wed 10-10-18 09:39:08, Alexander Duyck wrote:
> > > On 10/10/2018 2:58 AM, Michal Hocko wrote:
> > > > On Tue 09-10-18 13:26:41, Alexander Duyck wrote:
> > > > [...]
> > > > > I would think with that being the case we still probably need the 
> > > > > call to
> > > > > __SetPageReserved to set the bit with the expectation that it will 
> > > > > not be
> > > > > cleared for device-pages since the pages are not onlined. Removing 
> > > > > the call
> > > > > to __SetPageReserved would probably introduce a number of regressions 
> > > > > as
> > > > > there are multiple spots that use the reserved bit to determine if a 
> > > > > page
> > > > > can be swapped out to disk, mapped as system memory, or migrated.
> > > > 
> > > > PageReserved is meant to tell any potential pfn walkers that might get
> > > > to this struct page to back off and not touch it. Even though
> > > > ZONE_DEVICE doesn't online pages in traditional sense it makes those
> > > > pages available for further use so the page reserved bit should be
> > > > cleared.
> > > 
> > > So from what I can tell that isn't necessarily the case. Specifically if 
> > > the
> > > pagemap type is MEMORY_DEVICE_PRIVATE or MEMORY_DEVICE_PUBLIC both are
> > > special cases where the memory may not be accessible to the CPU or cannot 
> > > be
> > > pinned in order to allow for eviction.
> > 
> > Could you give me an example please?
> 
> Honestly I am getting a bit beyond my depth here so maybe Dan could explain
> better. I am basing the above comment on Dan's earlier comment in this
> thread combined with the comment that explains the "memory_type" field for
> the pgmap:
> https://elixir.bootlin.com/linux/v4.19-rc7/source/include/linux/memremap.h#L28
> 
> > > The specific case that Dan and Yi are referring to is for the type
> > > MEMORY_DEVICE_FS_DAX. For that type I could probably look at not setting 
> > > the
> > > reserved bit. Part of me wants to say that we should wait and clear the 
> > > bit
> > > later, but that would end up just adding time back to initialization. At
> > > this point I would consider the change more of a follow-up optimization
> > > rather than a fix though since this is tailoring things specifically for 
> > > DAX
> > > versus the other ZONE_DEVICE types.
> > 
> > I thought I have already made it clear that these zone device hacks are
> > not acceptable to the generic hotplug code. If the current reserve bit
> > handling is not correct then give us a specific reason for that and we
> > can start thinking about the proper fix.
> 
> I might have misunderstood your earlier comment then. I thought you were
> saying that we shouldn't bother with setting the reserved bit. Now it sounds
> like you were thinking more along the lines of what I was here in my comment
> where I thought the bit should be cleared later in some code specifically
> related to DAX when it is exposing it for use to userspace or KVM.

I was referring to my earlier comment that if you need to do something
about struct page initialization (move_pfn_range_to_zone) outside of the
lock (with the appropriate ground work that is needed) rather than
pulling more zone device hacks into the generic hotplug code [1]

[1] http://lkml.kernel.org/r/20180926075540.gd6...@dhcp22.suse.cz

-- 
Michal Hocko
SUSE Labs
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-10 Thread Alexander Duyck




On 10/10/2018 10:24 AM, Michal Hocko wrote:

On Wed 10-10-18 09:39:08, Alexander Duyck wrote:

On 10/10/2018 2:58 AM, Michal Hocko wrote:

On Tue 09-10-18 13:26:41, Alexander Duyck wrote:
[...]

I would think with that being the case we still probably need the call to
__SetPageReserved to set the bit with the expectation that it will not be
cleared for device-pages since the pages are not onlined. Removing the call
to __SetPageReserved would probably introduce a number of regressions as
there are multiple spots that use the reserved bit to determine if a page
can be swapped out to disk, mapped as system memory, or migrated.


PageReserved is meant to tell any potential pfn walkers that might get
to this struct page to back off and not touch it. Even though
ZONE_DEVICE doesn't online pages in traditional sense it makes those
pages available for further use so the page reserved bit should be
cleared.


So from what I can tell that isn't necessarily the case. Specifically if the
pagemap type is MEMORY_DEVICE_PRIVATE or MEMORY_DEVICE_PUBLIC both are
special cases where the memory may not be accessible to the CPU or cannot be
pinned in order to allow for eviction.


Could you give me an example please?


Honestly I am getting a bit beyond my depth here so maybe Dan could 
explain better. I am basing the above comment on Dan's earlier comment 
in this thread combined with the comment that explains the "memory_type" 
field for the pgmap:

https://elixir.bootlin.com/linux/v4.19-rc7/source/include/linux/memremap.h#L28


The specific case that Dan and Yi are referring to is for the type
MEMORY_DEVICE_FS_DAX. For that type I could probably look at not setting the
reserved bit. Part of me wants to say that we should wait and clear the bit
later, but that would end up just adding time back to initialization. At
this point I would consider the change more of a follow-up optimization
rather than a fix though since this is tailoring things specifically for DAX
versus the other ZONE_DEVICE types.


I thought I have already made it clear that these zone device hacks are
not acceptable to the generic hotplug code. If the current reserve bit
handling is not correct then give us a specific reason for that and we
can start thinking about the proper fix.


I might have misunderstood your earlier comment then. I thought you were 
saying that we shouldn't bother with setting the reserved bit. Now it 
sounds like you were thinking more along the lines of what I was here in 
my comment where I thought the bit should be cleared later in some code 
specifically related to DAX when it is exposing it for use to userspace 
or KVM.

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-10 Thread Michal Hocko
On Wed 10-10-18 09:39:08, Alexander Duyck wrote:
> On 10/10/2018 2:58 AM, Michal Hocko wrote:
> > On Tue 09-10-18 13:26:41, Alexander Duyck wrote:
> > [...]
> > > I would think with that being the case we still probably need the call to
> > > __SetPageReserved to set the bit with the expectation that it will not be
> > > cleared for device-pages since the pages are not onlined. Removing the 
> > > call
> > > to __SetPageReserved would probably introduce a number of regressions as
> > > there are multiple spots that use the reserved bit to determine if a page
> > > can be swapped out to disk, mapped as system memory, or migrated.
> > 
> > PageReserved is meant to tell any potential pfn walkers that might get
> > to this struct page to back off and not touch it. Even though
> > ZONE_DEVICE doesn't online pages in traditional sense it makes those
> > pages available for further use so the page reserved bit should be
> > cleared.
> 
> So from what I can tell that isn't necessarily the case. Specifically if the
> pagemap type is MEMORY_DEVICE_PRIVATE or MEMORY_DEVICE_PUBLIC both are
> special cases where the memory may not be accessible to the CPU or cannot be
> pinned in order to allow for eviction.

Could you give me an example please?

> The specific case that Dan and Yi are referring to is for the type
> MEMORY_DEVICE_FS_DAX. For that type I could probably look at not setting the
> reserved bit. Part of me wants to say that we should wait and clear the bit
> later, but that would end up just adding time back to initialization. At
> this point I would consider the change more of a follow-up optimization
> rather than a fix though since this is tailoring things specifically for DAX
> versus the other ZONE_DEVICE types.

I thought I have already made it clear that these zone device hacks are
not acceptable to the generic hotplug code. If the current reserve bit
handling is not correct then give us a specific reason for that and we
can start thinking about the proper fix.

-- 
Michal Hocko
SUSE Labs
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-10 Thread Alexander Duyck

On 10/10/2018 2:58 AM, Michal Hocko wrote:

On Tue 09-10-18 13:26:41, Alexander Duyck wrote:
[...]

I would think with that being the case we still probably need the call to
__SetPageReserved to set the bit with the expectation that it will not be
cleared for device-pages since the pages are not onlined. Removing the call
to __SetPageReserved would probably introduce a number of regressions as
there are multiple spots that use the reserved bit to determine if a page
can be swapped out to disk, mapped as system memory, or migrated.


PageReserved is meant to tell any potential pfn walkers that might get
to this struct page to back off and not touch it. Even though
ZONE_DEVICE doesn't online pages in traditional sense it makes those
pages available for further use so the page reserved bit should be
cleared.


So from what I can tell that isn't necessarily the case. Specifically if 
the pagemap type is MEMORY_DEVICE_PRIVATE or MEMORY_DEVICE_PUBLIC both 
are special cases where the memory may not be accessible to the CPU or 
cannot be pinned in order to allow for eviction.


The specific case that Dan and Yi are referring to is for the type 
MEMORY_DEVICE_FS_DAX. For that type I could probably look at not setting 
the reserved bit. Part of me wants to say that we should wait and clear 
the bit later, but that would end up just adding time back to 
initialization. At this point I would consider the change more of a 
follow-up optimization rather than a fix though since this is tailoring 
things specifically for DAX versus the other ZONE_DEVICE types.


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-10 Thread Alexander Duyck




On 10/10/2018 5:52 AM, Yi Zhang wrote:

On 2018-10-09 at 14:19:32 -0700, Dan Williams wrote:

On Tue, Oct 9, 2018 at 1:34 PM Alexander Duyck
 wrote:


On 10/9/2018 11:04 AM, Dan Williams wrote:

On Tue, Oct 9, 2018 at 3:21 AM Yi Zhang  wrote:

[..]

That comment is incorrect, device-pages are never onlined. So I think
we can just skip that call to __SetPageReserved() unless the memory
range is MEMORY_DEVICE_{PRIVATE,PUBLIC}.



When pages are "onlined" via __free_pages_boot_core they clear the
reserved bit, that is the reason for the comment. The reserved bit is
meant to indicate that the page cannot be swapped out or moved based on
the description of the bit.


...but ZONE_DEVICE pages are never onlined so I would expect
memmap_init_zone_device() to know that detail.


I would think with that being the case we still probably need the call
to __SetPageReserved to set the bit with the expectation that it will
not be cleared for device-pages since the pages are not onlined.
Removing the call to __SetPageReserved would probably introduce a number
of regressions as there are multiple spots that use the reserved bit to
determine if a page can be swapped out to disk, mapped as system memory,
or migrated.


Another things, it seems page_init/set_reserved already been done in the
move_pfn_range_to_zone
 |-->memmap_init_zone
|-->for_each_page_in_pfn
|-->__init_single_page
|-->SetPageReserved

Why we haven't remove these redundant initial in memmap_init_zone?

Correct me if I missed something.


In this case it isn't redundant as only the vmmemmap pages are 
initialized in memmap_init_zone now. So all of the pages that are going 
to be used as device pages are not initialized until the call to 
memmap_init_zone_device. What I did is split the initialization of the 
pages into two parts in order to allow us to initialize the pages 
outside of the hotplug lock.




Right, this is what Yi is working on... the PageReserved flag is
problematic for KVM. Auditing those locations it seems as long as we
teach hibernation to avoid ZONE_DEVICE ranges we can safely not set
the reserved flag for DAX pages. What I'm trying to avoid is a local
KVM hack to check for DAX pages when the Reserved flag is not
otherwise needed.

Thanks Dan. Provide the patch link.

https://lore.kernel.org/lkml/cover.1536342881.git.yi.z.zh...@linux.intel.com


So it looks like your current logic is just working around the bit then 
since it just allows for reserved DAX pages.



___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-10 Thread Michal Hocko
On Tue 09-10-18 13:26:41, Alexander Duyck wrote:
[...]
> I would think with that being the case we still probably need the call to
> __SetPageReserved to set the bit with the expectation that it will not be
> cleared for device-pages since the pages are not onlined. Removing the call
> to __SetPageReserved would probably introduce a number of regressions as
> there are multiple spots that use the reserved bit to determine if a page
> can be swapped out to disk, mapped as system memory, or migrated.

PageReserved is meant to tell any potential pfn walkers that might get
to this struct page to back off and not touch it. Even though
ZONE_DEVICE doesn't online pages in traditional sense it makes those
pages available for further use so the page reserved bit should be
cleared.
-- 
Michal Hocko
SUSE Labs
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-10 Thread Yi Zhang
On 2018-10-09 at 14:19:32 -0700, Dan Williams wrote:
> On Tue, Oct 9, 2018 at 1:34 PM Alexander Duyck
>  wrote:
> >
> > On 10/9/2018 11:04 AM, Dan Williams wrote:
> > > On Tue, Oct 9, 2018 at 3:21 AM Yi Zhang  
> > > wrote:
> [..]
> > > That comment is incorrect, device-pages are never onlined. So I think
> > > we can just skip that call to __SetPageReserved() unless the memory
> > > range is MEMORY_DEVICE_{PRIVATE,PUBLIC}.
> > >
> >
> > When pages are "onlined" via __free_pages_boot_core they clear the
> > reserved bit, that is the reason for the comment. The reserved bit is
> > meant to indicate that the page cannot be swapped out or moved based on
> > the description of the bit.
> 
> ...but ZONE_DEVICE pages are never onlined so I would expect
> memmap_init_zone_device() to know that detail.
> 
> > I would think with that being the case we still probably need the call
> > to __SetPageReserved to set the bit with the expectation that it will
> > not be cleared for device-pages since the pages are not onlined.
> > Removing the call to __SetPageReserved would probably introduce a number
> > of regressions as there are multiple spots that use the reserved bit to
> > determine if a page can be swapped out to disk, mapped as system memory,
> > or migrated.

Another things, it seems page_init/set_reserved already been done in the
move_pfn_range_to_zone
|-->memmap_init_zone
|-->for_each_page_in_pfn
|-->__init_single_page
|-->SetPageReserved

Why we haven't remove these redundant initial in memmap_init_zone?

Correct me if I missed something.

> 
> Right, this is what Yi is working on... the PageReserved flag is
> problematic for KVM. Auditing those locations it seems as long as we
> teach hibernation to avoid ZONE_DEVICE ranges we can safely not set
> the reserved flag for DAX pages. What I'm trying to avoid is a local
> KVM hack to check for DAX pages when the Reserved flag is not
> otherwise needed.
Thanks Dan. Provide the patch link.

https://lore.kernel.org/lkml/cover.1536342881.git.yi.z.zh...@linux.intel.com



> ___
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-09 Thread Dan Williams
On Tue, Oct 9, 2018 at 1:34 PM Alexander Duyck
 wrote:
>
> On 10/9/2018 11:04 AM, Dan Williams wrote:
> > On Tue, Oct 9, 2018 at 3:21 AM Yi Zhang  wrote:
[..]
> > That comment is incorrect, device-pages are never onlined. So I think
> > we can just skip that call to __SetPageReserved() unless the memory
> > range is MEMORY_DEVICE_{PRIVATE,PUBLIC}.
> >
>
> When pages are "onlined" via __free_pages_boot_core they clear the
> reserved bit, that is the reason for the comment. The reserved bit is
> meant to indicate that the page cannot be swapped out or moved based on
> the description of the bit.

...but ZONE_DEVICE pages are never onlined so I would expect
memmap_init_zone_device() to know that detail.

> I would think with that being the case we still probably need the call
> to __SetPageReserved to set the bit with the expectation that it will
> not be cleared for device-pages since the pages are not onlined.
> Removing the call to __SetPageReserved would probably introduce a number
> of regressions as there are multiple spots that use the reserved bit to
> determine if a page can be swapped out to disk, mapped as system memory,
> or migrated.

Right, this is what Yi is working on... the PageReserved flag is
problematic for KVM. Auditing those locations it seems as long as we
teach hibernation to avoid ZONE_DEVICE ranges we can safely not set
the reserved flag for DAX pages. What I'm trying to avoid is a local
KVM hack to check for DAX pages when the Reserved flag is not
otherwise needed.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-09 Thread Alexander Duyck

On 10/9/2018 11:04 AM, Dan Williams wrote:

On Tue, Oct 9, 2018 at 3:21 AM Yi Zhang  wrote:


On 2018-09-25 at 13:21:24 -0700, Alexander Duyck wrote:

The ZONE_DEVICE pages were being initialized in two locations. One was with
the memory_hotplug lock held and another was outside of that lock. The
problem with this is that it was nearly doubling the memory initialization
time. Instead of doing this twice, once while holding a global lock and
once without, I am opting to defer the initialization to the one outside of
the lock. This allows us to avoid serializing the overhead for memory init
and we can instead focus on per-node init times.

One issue I encountered is that devm_memremap_pages and
hmm_devmmem_pages_create were initializing only the pgmap field the same
way. One wasn't initializing hmm_data, and the other was initializing it to
a poison value. Since this is something that is exposed to the driver in
the case of hmm I am opting for a third option and just initializing
hmm_data to 0 since this is going to be exposed to unknown third party
drivers.

Reviewed-by: Pavel Tatashin 
Signed-off-by: Alexander Duyck 
---

v4: Moved moved memmap_init_zone_device to below memmmap_init_zone to avoid
 merge conflicts with other changes in the kernel.
v5: No change

  include/linux/mm.h |2 +
  kernel/memremap.c  |   24 +-
  mm/hmm.c   |   12 ---
  mm/page_alloc.c|   92 ++--
  4 files changed, 107 insertions(+), 23 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 06d7d7576f8d..7312fb78ef31 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -848,6 +848,8 @@ static inline bool is_zone_device_page(const struct page 
*page)
  {
   return page_zonenum(page) == ZONE_DEVICE;
  }
+extern void memmap_init_zone_device(struct zone *, unsigned long,
+ unsigned long, struct dev_pagemap *);
  #else
  static inline bool is_zone_device_page(const struct page *page)
  {
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 5b8600d39931..d0c32e473f82 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -175,10 +175,10 @@ void *devm_memremap_pages(struct device *dev, struct 
dev_pagemap *pgmap)
   struct vmem_altmap *altmap = pgmap->altmap_valid ?
   >altmap : NULL;
   struct resource *res = >res;
- unsigned long pfn, pgoff, order;
+ struct dev_pagemap *conflict_pgmap;
   pgprot_t pgprot = PAGE_KERNEL;
+ unsigned long pgoff, order;
   int error, nid, is_ram;
- struct dev_pagemap *conflict_pgmap;

   align_start = res->start & ~(SECTION_SIZE - 1);
   align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
@@ -256,19 +256,13 @@ void *devm_memremap_pages(struct device *dev, struct 
dev_pagemap *pgmap)
   if (error)
   goto err_add_memory;

- for_each_device_pfn(pfn, pgmap) {
- struct page *page = pfn_to_page(pfn);
-
- /*
-  * ZONE_DEVICE pages union ->lru with a ->pgmap back
-  * pointer.  It is a bug if a ZONE_DEVICE page is ever
-  * freed or placed on a driver-private list.  Seed the
-  * storage with LIST_POISON* values.
-  */
- list_del(>lru);
- page->pgmap = pgmap;
- percpu_ref_get(pgmap->ref);
- }
+ /*
+  * Initialization of the pages has been deferred until now in order
+  * to allow us to do the work while not holding the hotplug lock.
+  */
+ memmap_init_zone_device(_DATA(nid)->node_zones[ZONE_DEVICE],
+ align_start >> PAGE_SHIFT,
+ align_size >> PAGE_SHIFT, pgmap);

   devm_add_action(dev, devm_memremap_pages_release, pgmap);

diff --git a/mm/hmm.c b/mm/hmm.c
index c968e49f7a0c..774d684fa2b4 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -1024,7 +1024,6 @@ static int hmm_devmem_pages_create(struct hmm_devmem 
*devmem)
   resource_size_t key, align_start, align_size, align_end;
   struct device *device = devmem->device;
   int ret, nid, is_ram;
- unsigned long pfn;

   align_start = devmem->resource->start & ~(PA_SECTION_SIZE - 1);
   align_size = ALIGN(devmem->resource->start +
@@ -1109,11 +1108,14 @@ static int hmm_devmem_pages_create(struct hmm_devmem 
*devmem)
   align_size >> PAGE_SHIFT, NULL);
   mem_hotplug_done();

- for (pfn = devmem->pfn_first; pfn < devmem->pfn_last; pfn++) {
- struct page *page = pfn_to_page(pfn);
+ /*
+  * Initialization of the pages has been deferred until now in order
+  * to allow us to do the work while not holding the hotplug lock.
+  */
+ memmap_init_zone_device(_DATA(nid)->node_zones[ZONE_DEVICE],
+ align_start >> PAGE_SHIFT,
+ align_size >> PAGE_SHIFT, >pagemap);

- page->pgmap = 

Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-09 Thread Dan Williams
On Tue, Oct 9, 2018 at 3:21 AM Yi Zhang  wrote:
>
> On 2018-09-25 at 13:21:24 -0700, Alexander Duyck wrote:
> > The ZONE_DEVICE pages were being initialized in two locations. One was with
> > the memory_hotplug lock held and another was outside of that lock. The
> > problem with this is that it was nearly doubling the memory initialization
> > time. Instead of doing this twice, once while holding a global lock and
> > once without, I am opting to defer the initialization to the one outside of
> > the lock. This allows us to avoid serializing the overhead for memory init
> > and we can instead focus on per-node init times.
> >
> > One issue I encountered is that devm_memremap_pages and
> > hmm_devmmem_pages_create were initializing only the pgmap field the same
> > way. One wasn't initializing hmm_data, and the other was initializing it to
> > a poison value. Since this is something that is exposed to the driver in
> > the case of hmm I am opting for a third option and just initializing
> > hmm_data to 0 since this is going to be exposed to unknown third party
> > drivers.
> >
> > Reviewed-by: Pavel Tatashin 
> > Signed-off-by: Alexander Duyck 
> > ---
> >
> > v4: Moved moved memmap_init_zone_device to below memmmap_init_zone to avoid
> > merge conflicts with other changes in the kernel.
> > v5: No change
> >
> >  include/linux/mm.h |2 +
> >  kernel/memremap.c  |   24 +-
> >  mm/hmm.c   |   12 ---
> >  mm/page_alloc.c|   92 
> > ++--
> >  4 files changed, 107 insertions(+), 23 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 06d7d7576f8d..7312fb78ef31 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -848,6 +848,8 @@ static inline bool is_zone_device_page(const struct 
> > page *page)
> >  {
> >   return page_zonenum(page) == ZONE_DEVICE;
> >  }
> > +extern void memmap_init_zone_device(struct zone *, unsigned long,
> > + unsigned long, struct dev_pagemap *);
> >  #else
> >  static inline bool is_zone_device_page(const struct page *page)
> >  {
> > diff --git a/kernel/memremap.c b/kernel/memremap.c
> > index 5b8600d39931..d0c32e473f82 100644
> > --- a/kernel/memremap.c
> > +++ b/kernel/memremap.c
> > @@ -175,10 +175,10 @@ void *devm_memremap_pages(struct device *dev, struct 
> > dev_pagemap *pgmap)
> >   struct vmem_altmap *altmap = pgmap->altmap_valid ?
> >   >altmap : NULL;
> >   struct resource *res = >res;
> > - unsigned long pfn, pgoff, order;
> > + struct dev_pagemap *conflict_pgmap;
> >   pgprot_t pgprot = PAGE_KERNEL;
> > + unsigned long pgoff, order;
> >   int error, nid, is_ram;
> > - struct dev_pagemap *conflict_pgmap;
> >
> >   align_start = res->start & ~(SECTION_SIZE - 1);
> >   align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
> > @@ -256,19 +256,13 @@ void *devm_memremap_pages(struct device *dev, struct 
> > dev_pagemap *pgmap)
> >   if (error)
> >   goto err_add_memory;
> >
> > - for_each_device_pfn(pfn, pgmap) {
> > - struct page *page = pfn_to_page(pfn);
> > -
> > - /*
> > -  * ZONE_DEVICE pages union ->lru with a ->pgmap back
> > -  * pointer.  It is a bug if a ZONE_DEVICE page is ever
> > -  * freed or placed on a driver-private list.  Seed the
> > -  * storage with LIST_POISON* values.
> > -  */
> > - list_del(>lru);
> > - page->pgmap = pgmap;
> > - percpu_ref_get(pgmap->ref);
> > - }
> > + /*
> > +  * Initialization of the pages has been deferred until now in order
> > +  * to allow us to do the work while not holding the hotplug lock.
> > +  */
> > + memmap_init_zone_device(_DATA(nid)->node_zones[ZONE_DEVICE],
> > + align_start >> PAGE_SHIFT,
> > + align_size >> PAGE_SHIFT, pgmap);
> >
> >   devm_add_action(dev, devm_memremap_pages_release, pgmap);
> >
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index c968e49f7a0c..774d684fa2b4 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -1024,7 +1024,6 @@ static int hmm_devmem_pages_create(struct hmm_devmem 
> > *devmem)
> >   resource_size_t key, align_start, align_size, align_end;
> >   struct device *device = devmem->device;
> >   int ret, nid, is_ram;
> > - unsigned long pfn;
> >
> >   align_start = devmem->resource->start & ~(PA_SECTION_SIZE - 1);
> >   align_size = ALIGN(devmem->resource->start +
> > @@ -1109,11 +1108,14 @@ static int hmm_devmem_pages_create(struct 
> > hmm_devmem *devmem)
> >   align_size >> PAGE_SHIFT, NULL);
> >   mem_hotplug_done();
> >
> > - for (pfn = devmem->pfn_first; pfn < devmem->pfn_last; pfn++) {
> > - struct page *page = pfn_to_page(pfn);
> > + /*
> > +  * 

Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-09 Thread Yi Zhang
On 2018-09-25 at 13:21:24 -0700, Alexander Duyck wrote:
> The ZONE_DEVICE pages were being initialized in two locations. One was with
> the memory_hotplug lock held and another was outside of that lock. The
> problem with this is that it was nearly doubling the memory initialization
> time. Instead of doing this twice, once while holding a global lock and
> once without, I am opting to defer the initialization to the one outside of
> the lock. This allows us to avoid serializing the overhead for memory init
> and we can instead focus on per-node init times.
> 
> One issue I encountered is that devm_memremap_pages and
> hmm_devmmem_pages_create were initializing only the pgmap field the same
> way. One wasn't initializing hmm_data, and the other was initializing it to
> a poison value. Since this is something that is exposed to the driver in
> the case of hmm I am opting for a third option and just initializing
> hmm_data to 0 since this is going to be exposed to unknown third party
> drivers.
> 
> Reviewed-by: Pavel Tatashin 
> Signed-off-by: Alexander Duyck 
> ---
> 
> v4: Moved moved memmap_init_zone_device to below memmmap_init_zone to avoid
> merge conflicts with other changes in the kernel.
> v5: No change
> 
>  include/linux/mm.h |2 +
>  kernel/memremap.c  |   24 +-
>  mm/hmm.c   |   12 ---
>  mm/page_alloc.c|   92 
> ++--
>  4 files changed, 107 insertions(+), 23 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 06d7d7576f8d..7312fb78ef31 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -848,6 +848,8 @@ static inline bool is_zone_device_page(const struct page 
> *page)
>  {
>   return page_zonenum(page) == ZONE_DEVICE;
>  }
> +extern void memmap_init_zone_device(struct zone *, unsigned long,
> + unsigned long, struct dev_pagemap *);
>  #else
>  static inline bool is_zone_device_page(const struct page *page)
>  {
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index 5b8600d39931..d0c32e473f82 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -175,10 +175,10 @@ void *devm_memremap_pages(struct device *dev, struct 
> dev_pagemap *pgmap)
>   struct vmem_altmap *altmap = pgmap->altmap_valid ?
>   >altmap : NULL;
>   struct resource *res = >res;
> - unsigned long pfn, pgoff, order;
> + struct dev_pagemap *conflict_pgmap;
>   pgprot_t pgprot = PAGE_KERNEL;
> + unsigned long pgoff, order;
>   int error, nid, is_ram;
> - struct dev_pagemap *conflict_pgmap;
>  
>   align_start = res->start & ~(SECTION_SIZE - 1);
>   align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
> @@ -256,19 +256,13 @@ void *devm_memremap_pages(struct device *dev, struct 
> dev_pagemap *pgmap)
>   if (error)
>   goto err_add_memory;
>  
> - for_each_device_pfn(pfn, pgmap) {
> - struct page *page = pfn_to_page(pfn);
> -
> - /*
> -  * ZONE_DEVICE pages union ->lru with a ->pgmap back
> -  * pointer.  It is a bug if a ZONE_DEVICE page is ever
> -  * freed or placed on a driver-private list.  Seed the
> -  * storage with LIST_POISON* values.
> -  */
> - list_del(>lru);
> - page->pgmap = pgmap;
> - percpu_ref_get(pgmap->ref);
> - }
> + /*
> +  * Initialization of the pages has been deferred until now in order
> +  * to allow us to do the work while not holding the hotplug lock.
> +  */
> + memmap_init_zone_device(_DATA(nid)->node_zones[ZONE_DEVICE],
> + align_start >> PAGE_SHIFT,
> + align_size >> PAGE_SHIFT, pgmap);
>  
>   devm_add_action(dev, devm_memremap_pages_release, pgmap);
>  
> diff --git a/mm/hmm.c b/mm/hmm.c
> index c968e49f7a0c..774d684fa2b4 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -1024,7 +1024,6 @@ static int hmm_devmem_pages_create(struct hmm_devmem 
> *devmem)
>   resource_size_t key, align_start, align_size, align_end;
>   struct device *device = devmem->device;
>   int ret, nid, is_ram;
> - unsigned long pfn;
>  
>   align_start = devmem->resource->start & ~(PA_SECTION_SIZE - 1);
>   align_size = ALIGN(devmem->resource->start +
> @@ -1109,11 +1108,14 @@ static int hmm_devmem_pages_create(struct hmm_devmem 
> *devmem)
>   align_size >> PAGE_SHIFT, NULL);
>   mem_hotplug_done();
>  
> - for (pfn = devmem->pfn_first; pfn < devmem->pfn_last; pfn++) {
> - struct page *page = pfn_to_page(pfn);
> + /*
> +  * Initialization of the pages has been deferred until now in order
> +  * to allow us to do the work while not holding the hotplug lock.
> +  */
> + memmap_init_zone_device(_DATA(nid)->node_zones[ZONE_DEVICE],
> + align_start >> 

Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-08 Thread Dan Williams
On Mon, Oct 8, 2018 at 3:37 PM Alexander Duyck
 wrote:
>
>
>
> On 10/8/2018 3:00 PM, Dan Williams wrote:
> > On Mon, Oct 8, 2018 at 2:48 PM Alexander Duyck
> >  wrote:
> >>
> >> On 10/8/2018 2:01 PM, Dan Williams wrote:
> >>> On Tue, Sep 25, 2018 at 1:29 PM Alexander Duyck
> >>>  wrote:
> 
>  The ZONE_DEVICE pages were being initialized in two locations. One was 
>  with
>  the memory_hotplug lock held and another was outside of that lock. The
>  problem with this is that it was nearly doubling the memory 
>  initialization
>  time. Instead of doing this twice, once while holding a global lock and
>  once without, I am opting to defer the initialization to the one outside 
>  of
>  the lock. This allows us to avoid serializing the overhead for memory 
>  init
>  and we can instead focus on per-node init times.
> 
>  One issue I encountered is that devm_memremap_pages and
>  hmm_devmmem_pages_create were initializing only the pgmap field the same
>  way. One wasn't initializing hmm_data, and the other was initializing it 
>  to
>  a poison value. Since this is something that is exposed to the driver in
>  the case of hmm I am opting for a third option and just initializing
>  hmm_data to 0 since this is going to be exposed to unknown third party
>  drivers.
> 
>  Reviewed-by: Pavel Tatashin 
>  Signed-off-by: Alexander Duyck 
>  ---
> 
>  v4: Moved moved memmap_init_zone_device to below memmmap_init_zone to 
>  avoid
>    merge conflicts with other changes in the kernel.
>  v5: No change
> >>>
> >>> This patch appears to cause a regression in the "create.sh" unit test
> >>> in the ndctl test suite.
> >>
> >> So all you had to do is run the create.sh script to see the issue? I
> >> just want to confirm there isn't any additional information needed
> >> before I try chasing this down.
> >
> >  From the ndctl source tree run:
> >
> >  make -j TESTS="create.sh" check
> >
> > ...the readme has some more setup instructions:
> > https://github.com/pmem/ndctl/blob/master/README.md
> >
> > 0day has sometimes run this test suite automatically, but we need to
> > get that more robust because setting up this environment is a bit of a
> > hoop to jump through with the need to setup the nfit_test module.
> >
> >>> I tried to reproduce on -next with:
> >>>
> >>> 2302f5ee215e mm: defer ZONE_DEVICE page initialization to the point
> >>> where we init pgmap
> >>>
> >>> ...but -next does not even boot for me at that commit.
> >>
> >> What version of -next? There are a couple of patches probably needed
> >> depending on which version you are trying to boot.
> >
> > Today's -next, but backed up to that above commit. I was also seeing
> > CONFIG_DEBUG_LIST spamming the logs, and a crash in the crypto layer.
> >
> >>> Here is a warning signature that proceeds a hang with this patch
> >>> applied against v4.19-rc6:
> >>>
> >>> percpu ref (blk_queue_usage_counter_release) <= 0 (-1530626) after
> >>> switching to atomic
> >>> WARNING: CPU: 24 PID: 7346 at lib/percpu-refcount.c:155
> >>> percpu_ref_switch_to_atomic_rcu+0x1f7/0x200
> >>> CPU: 24 PID: 7346 Comm: modprobe Tainted: G   OE 4.19.0-rc6+ 
> >>> #2458
> >>> [..]
> >>> RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x1f7/0x200
> >>> [..]
> >>> Call Trace:
> >>>
> >>>? percpu_ref_reinit+0x140/0x140
> >>>rcu_process_callbacks+0x273/0x880
> >>>__do_softirq+0xd2/0x428
> >>>irq_exit+0xf6/0x100
> >>>smp_apic_timer_interrupt+0xa2/0x220
> >>>apic_timer_interrupt+0xf/0x20
> >>>
> >>> RIP: 0010:lock_acquire+0xb8/0x1a0
> >>> [..]
> >>>? __put_page+0x55/0x150
> >>>? __put_page+0x55/0x150
> >>>__put_page+0x83/0x150
> >>>? __put_page+0x55/0x150
> >>>devm_memremap_pages_release+0x194/0x250
> >>>release_nodes+0x17c/0x2c0
> >>>device_release_driver_internal+0x1a2/0x250
> >>>driver_detach+0x3a/0x70
> >>>bus_remove_driver+0x58/0xd0
> >>>__x64_sys_delete_module+0x13f/0x200
> >>>? trace_hardirqs_off_thunk+0x1a/0x1c
> >>>do_syscall_64+0x60/0x210
> >>>entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >>>
> >>
> >> So it looks like we are tearing down memory when this is triggered. Do
> >> we know if this is at the end of the test or if this is running in
> >> parallel with anything?
> >
> > Should not be running in parallel with anything this test is
> > performing a series of namespace setup and teardown events.
> >
> > Wait, where did the call to "percpu_ref_get()" go? I think that's the bug.
> >
>
> I have a reproduction on my system now as well. I should have a patch
> ready to go for it in the next hour or so.
>

Nice! Thanks for jumping on this, and I like the "get_many" optimization.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-08 Thread Alexander Duyck




On 10/8/2018 3:00 PM, Dan Williams wrote:

On Mon, Oct 8, 2018 at 2:48 PM Alexander Duyck
 wrote:


On 10/8/2018 2:01 PM, Dan Williams wrote:

On Tue, Sep 25, 2018 at 1:29 PM Alexander Duyck
 wrote:


The ZONE_DEVICE pages were being initialized in two locations. One was with
the memory_hotplug lock held and another was outside of that lock. The
problem with this is that it was nearly doubling the memory initialization
time. Instead of doing this twice, once while holding a global lock and
once without, I am opting to defer the initialization to the one outside of
the lock. This allows us to avoid serializing the overhead for memory init
and we can instead focus on per-node init times.

One issue I encountered is that devm_memremap_pages and
hmm_devmmem_pages_create were initializing only the pgmap field the same
way. One wasn't initializing hmm_data, and the other was initializing it to
a poison value. Since this is something that is exposed to the driver in
the case of hmm I am opting for a third option and just initializing
hmm_data to 0 since this is going to be exposed to unknown third party
drivers.

Reviewed-by: Pavel Tatashin 
Signed-off-by: Alexander Duyck 
---

v4: Moved moved memmap_init_zone_device to below memmmap_init_zone to avoid
  merge conflicts with other changes in the kernel.
v5: No change


This patch appears to cause a regression in the "create.sh" unit test
in the ndctl test suite.


So all you had to do is run the create.sh script to see the issue? I
just want to confirm there isn't any additional information needed
before I try chasing this down.


 From the ndctl source tree run:

 make -j TESTS="create.sh" check

...the readme has some more setup instructions:
https://github.com/pmem/ndctl/blob/master/README.md

0day has sometimes run this test suite automatically, but we need to
get that more robust because setting up this environment is a bit of a
hoop to jump through with the need to setup the nfit_test module.


I tried to reproduce on -next with:

2302f5ee215e mm: defer ZONE_DEVICE page initialization to the point
where we init pgmap

...but -next does not even boot for me at that commit.


What version of -next? There are a couple of patches probably needed
depending on which version you are trying to boot.


Today's -next, but backed up to that above commit. I was also seeing
CONFIG_DEBUG_LIST spamming the logs, and a crash in the crypto layer.


Here is a warning signature that proceeds a hang with this patch
applied against v4.19-rc6:

percpu ref (blk_queue_usage_counter_release) <= 0 (-1530626) after
switching to atomic
WARNING: CPU: 24 PID: 7346 at lib/percpu-refcount.c:155
percpu_ref_switch_to_atomic_rcu+0x1f7/0x200
CPU: 24 PID: 7346 Comm: modprobe Tainted: G   OE 4.19.0-rc6+ #2458
[..]
RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x1f7/0x200
[..]
Call Trace:
   
   ? percpu_ref_reinit+0x140/0x140
   rcu_process_callbacks+0x273/0x880
   __do_softirq+0xd2/0x428
   irq_exit+0xf6/0x100
   smp_apic_timer_interrupt+0xa2/0x220
   apic_timer_interrupt+0xf/0x20
   
RIP: 0010:lock_acquire+0xb8/0x1a0
[..]
   ? __put_page+0x55/0x150
   ? __put_page+0x55/0x150
   __put_page+0x83/0x150
   ? __put_page+0x55/0x150
   devm_memremap_pages_release+0x194/0x250
   release_nodes+0x17c/0x2c0
   device_release_driver_internal+0x1a2/0x250
   driver_detach+0x3a/0x70
   bus_remove_driver+0x58/0xd0
   __x64_sys_delete_module+0x13f/0x200
   ? trace_hardirqs_off_thunk+0x1a/0x1c
   do_syscall_64+0x60/0x210
   entry_SYSCALL_64_after_hwframe+0x49/0xbe



So it looks like we are tearing down memory when this is triggered. Do
we know if this is at the end of the test or if this is running in
parallel with anything?


Should not be running in parallel with anything this test is
performing a series of namespace setup and teardown events.

Wait, where did the call to "percpu_ref_get()" go? I think that's the bug.



I have a reproduction on my system now as well. I should have a patch 
ready to go for it in the next hour or so.


Thanks.

- Alex
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-08 Thread Alexander Duyck




On 10/8/2018 3:00 PM, Dan Williams wrote:

On Mon, Oct 8, 2018 at 2:48 PM Alexander Duyck
 wrote:


On 10/8/2018 2:01 PM, Dan Williams wrote:

On Tue, Sep 25, 2018 at 1:29 PM Alexander Duyck
 wrote:


The ZONE_DEVICE pages were being initialized in two locations. One was with
the memory_hotplug lock held and another was outside of that lock. The
problem with this is that it was nearly doubling the memory initialization
time. Instead of doing this twice, once while holding a global lock and
once without, I am opting to defer the initialization to the one outside of
the lock. This allows us to avoid serializing the overhead for memory init
and we can instead focus on per-node init times.

One issue I encountered is that devm_memremap_pages and
hmm_devmmem_pages_create were initializing only the pgmap field the same
way. One wasn't initializing hmm_data, and the other was initializing it to
a poison value. Since this is something that is exposed to the driver in
the case of hmm I am opting for a third option and just initializing
hmm_data to 0 since this is going to be exposed to unknown third party
drivers.

Reviewed-by: Pavel Tatashin 
Signed-off-by: Alexander Duyck 
---

v4: Moved moved memmap_init_zone_device to below memmmap_init_zone to avoid
  merge conflicts with other changes in the kernel.
v5: No change


This patch appears to cause a regression in the "create.sh" unit test
in the ndctl test suite.


So all you had to do is run the create.sh script to see the issue? I
just want to confirm there isn't any additional information needed
before I try chasing this down.


 From the ndctl source tree run:

 make -j TESTS="create.sh" check

...the readme has some more setup instructions:
https://github.com/pmem/ndctl/blob/master/README.md

0day has sometimes run this test suite automatically, but we need to
get that more robust because setting up this environment is a bit of a
hoop to jump through with the need to setup the nfit_test module.


I tried to reproduce on -next with:

2302f5ee215e mm: defer ZONE_DEVICE page initialization to the point
where we init pgmap

...but -next does not even boot for me at that commit.


What version of -next? There are a couple of patches probably needed
depending on which version you are trying to boot.


Today's -next, but backed up to that above commit. I was also seeing
CONFIG_DEBUG_LIST spamming the logs, and a crash in the crypto layer.


Here is a warning signature that proceeds a hang with this patch
applied against v4.19-rc6:

percpu ref (blk_queue_usage_counter_release) <= 0 (-1530626) after
switching to atomic
WARNING: CPU: 24 PID: 7346 at lib/percpu-refcount.c:155
percpu_ref_switch_to_atomic_rcu+0x1f7/0x200
CPU: 24 PID: 7346 Comm: modprobe Tainted: G   OE 4.19.0-rc6+ #2458
[..]
RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x1f7/0x200
[..]
Call Trace:
   
   ? percpu_ref_reinit+0x140/0x140
   rcu_process_callbacks+0x273/0x880
   __do_softirq+0xd2/0x428
   irq_exit+0xf6/0x100
   smp_apic_timer_interrupt+0xa2/0x220
   apic_timer_interrupt+0xf/0x20
   
RIP: 0010:lock_acquire+0xb8/0x1a0
[..]
   ? __put_page+0x55/0x150
   ? __put_page+0x55/0x150
   __put_page+0x83/0x150
   ? __put_page+0x55/0x150
   devm_memremap_pages_release+0x194/0x250
   release_nodes+0x17c/0x2c0
   device_release_driver_internal+0x1a2/0x250
   driver_detach+0x3a/0x70
   bus_remove_driver+0x58/0xd0
   __x64_sys_delete_module+0x13f/0x200
   ? trace_hardirqs_off_thunk+0x1a/0x1c
   do_syscall_64+0x60/0x210
   entry_SYSCALL_64_after_hwframe+0x49/0xbe



So it looks like we are tearing down memory when this is triggered. Do
we know if this is at the end of the test or if this is running in
parallel with anything?


Should not be running in parallel with anything this test is
performing a series of namespace setup and teardown events.

Wait, where did the call to "percpu_ref_get()" go? I think that's the bug.


Actually I think you are probably right. Do you want to get that or 
should I. Should be a quick patch since you could probably just add a 
call to percpu_ref_get_many to hold a reference for each page in the 
range of device pages before calling memmap_init_zone_device.


- Alex
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-08 Thread Dan Williams
On Mon, Oct 8, 2018 at 2:48 PM Alexander Duyck
 wrote:
>
> On 10/8/2018 2:01 PM, Dan Williams wrote:
> > On Tue, Sep 25, 2018 at 1:29 PM Alexander Duyck
> >  wrote:
> >>
> >> The ZONE_DEVICE pages were being initialized in two locations. One was with
> >> the memory_hotplug lock held and another was outside of that lock. The
> >> problem with this is that it was nearly doubling the memory initialization
> >> time. Instead of doing this twice, once while holding a global lock and
> >> once without, I am opting to defer the initialization to the one outside of
> >> the lock. This allows us to avoid serializing the overhead for memory init
> >> and we can instead focus on per-node init times.
> >>
> >> One issue I encountered is that devm_memremap_pages and
> >> hmm_devmmem_pages_create were initializing only the pgmap field the same
> >> way. One wasn't initializing hmm_data, and the other was initializing it to
> >> a poison value. Since this is something that is exposed to the driver in
> >> the case of hmm I am opting for a third option and just initializing
> >> hmm_data to 0 since this is going to be exposed to unknown third party
> >> drivers.
> >>
> >> Reviewed-by: Pavel Tatashin 
> >> Signed-off-by: Alexander Duyck 
> >> ---
> >>
> >> v4: Moved moved memmap_init_zone_device to below memmmap_init_zone to avoid
> >>  merge conflicts with other changes in the kernel.
> >> v5: No change
> >
> > This patch appears to cause a regression in the "create.sh" unit test
> > in the ndctl test suite.
>
> So all you had to do is run the create.sh script to see the issue? I
> just want to confirm there isn't any additional information needed
> before I try chasing this down.

>From the ndctl source tree run:

make -j TESTS="create.sh" check

...the readme has some more setup instructions:
https://github.com/pmem/ndctl/blob/master/README.md

0day has sometimes run this test suite automatically, but we need to
get that more robust because setting up this environment is a bit of a
hoop to jump through with the need to setup the nfit_test module.

> > I tried to reproduce on -next with:
> >
> > 2302f5ee215e mm: defer ZONE_DEVICE page initialization to the point
> > where we init pgmap
> >
> > ...but -next does not even boot for me at that commit.
>
> What version of -next? There are a couple of patches probably needed
> depending on which version you are trying to boot.

Today's -next, but backed up to that above commit. I was also seeing
CONFIG_DEBUG_LIST spamming the logs, and a crash in the crypto layer.

> > Here is a warning signature that proceeds a hang with this patch
> > applied against v4.19-rc6:
> >
> > percpu ref (blk_queue_usage_counter_release) <= 0 (-1530626) after
> > switching to atomic
> > WARNING: CPU: 24 PID: 7346 at lib/percpu-refcount.c:155
> > percpu_ref_switch_to_atomic_rcu+0x1f7/0x200
> > CPU: 24 PID: 7346 Comm: modprobe Tainted: G   OE 4.19.0-rc6+ 
> > #2458
> > [..]
> > RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x1f7/0x200
> > [..]
> > Call Trace:
> >   
> >   ? percpu_ref_reinit+0x140/0x140
> >   rcu_process_callbacks+0x273/0x880
> >   __do_softirq+0xd2/0x428
> >   irq_exit+0xf6/0x100
> >   smp_apic_timer_interrupt+0xa2/0x220
> >   apic_timer_interrupt+0xf/0x20
> >   
> > RIP: 0010:lock_acquire+0xb8/0x1a0
> > [..]
> >   ? __put_page+0x55/0x150
> >   ? __put_page+0x55/0x150
> >   __put_page+0x83/0x150
> >   ? __put_page+0x55/0x150
> >   devm_memremap_pages_release+0x194/0x250
> >   release_nodes+0x17c/0x2c0
> >   device_release_driver_internal+0x1a2/0x250
> >   driver_detach+0x3a/0x70
> >   bus_remove_driver+0x58/0xd0
> >   __x64_sys_delete_module+0x13f/0x200
> >   ? trace_hardirqs_off_thunk+0x1a/0x1c
> >   do_syscall_64+0x60/0x210
> >   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >
>
> So it looks like we are tearing down memory when this is triggered. Do
> we know if this is at the end of the test or if this is running in
> parallel with anything?

Should not be running in parallel with anything this test is
performing a series of namespace setup and teardown events.

Wait, where did the call to "percpu_ref_get()" go? I think that's the bug.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-08 Thread Alexander Duyck

On 10/8/2018 2:01 PM, Dan Williams wrote:

On Tue, Sep 25, 2018 at 1:29 PM Alexander Duyck
 wrote:


The ZONE_DEVICE pages were being initialized in two locations. One was with
the memory_hotplug lock held and another was outside of that lock. The
problem with this is that it was nearly doubling the memory initialization
time. Instead of doing this twice, once while holding a global lock and
once without, I am opting to defer the initialization to the one outside of
the lock. This allows us to avoid serializing the overhead for memory init
and we can instead focus on per-node init times.

One issue I encountered is that devm_memremap_pages and
hmm_devmmem_pages_create were initializing only the pgmap field the same
way. One wasn't initializing hmm_data, and the other was initializing it to
a poison value. Since this is something that is exposed to the driver in
the case of hmm I am opting for a third option and just initializing
hmm_data to 0 since this is going to be exposed to unknown third party
drivers.

Reviewed-by: Pavel Tatashin 
Signed-off-by: Alexander Duyck 
---

v4: Moved moved memmap_init_zone_device to below memmmap_init_zone to avoid
 merge conflicts with other changes in the kernel.
v5: No change


This patch appears to cause a regression in the "create.sh" unit test
in the ndctl test suite.


So all you had to do is run the create.sh script to see the issue? I 
just want to confirm there isn't any additional information needed 
before I try chasing this down.



I tried to reproduce on -next with:

2302f5ee215e mm: defer ZONE_DEVICE page initialization to the point
where we init pgmap

...but -next does not even boot for me at that commit.


What version of -next? There are a couple of patches probably needed 
depending on which version you are trying to boot.



Here is a warning signature that proceeds a hang with this patch
applied against v4.19-rc6:

percpu ref (blk_queue_usage_counter_release) <= 0 (-1530626) after
switching to atomic
WARNING: CPU: 24 PID: 7346 at lib/percpu-refcount.c:155
percpu_ref_switch_to_atomic_rcu+0x1f7/0x200
CPU: 24 PID: 7346 Comm: modprobe Tainted: G   OE 4.19.0-rc6+ #2458
[..]
RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x1f7/0x200
[..]
Call Trace:
  
  ? percpu_ref_reinit+0x140/0x140
  rcu_process_callbacks+0x273/0x880
  __do_softirq+0xd2/0x428
  irq_exit+0xf6/0x100
  smp_apic_timer_interrupt+0xa2/0x220
  apic_timer_interrupt+0xf/0x20
  
RIP: 0010:lock_acquire+0xb8/0x1a0
[..]
  ? __put_page+0x55/0x150
  ? __put_page+0x55/0x150
  __put_page+0x83/0x150
  ? __put_page+0x55/0x150
  devm_memremap_pages_release+0x194/0x250
  release_nodes+0x17c/0x2c0
  device_release_driver_internal+0x1a2/0x250
  driver_detach+0x3a/0x70
  bus_remove_driver+0x58/0xd0
  __x64_sys_delete_module+0x13f/0x200
  ? trace_hardirqs_off_thunk+0x1a/0x1c
  do_syscall_64+0x60/0x210
  entry_SYSCALL_64_after_hwframe+0x49/0xbe



So it looks like we are tearing down memory when this is triggered. Do 
we know if this is at the end of the test or if this is running in 
parallel with anything?

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-08 Thread Dan Williams
On Tue, Sep 25, 2018 at 1:29 PM Alexander Duyck
 wrote:
>
> The ZONE_DEVICE pages were being initialized in two locations. One was with
> the memory_hotplug lock held and another was outside of that lock. The
> problem with this is that it was nearly doubling the memory initialization
> time. Instead of doing this twice, once while holding a global lock and
> once without, I am opting to defer the initialization to the one outside of
> the lock. This allows us to avoid serializing the overhead for memory init
> and we can instead focus on per-node init times.
>
> One issue I encountered is that devm_memremap_pages and
> hmm_devmmem_pages_create were initializing only the pgmap field the same
> way. One wasn't initializing hmm_data, and the other was initializing it to
> a poison value. Since this is something that is exposed to the driver in
> the case of hmm I am opting for a third option and just initializing
> hmm_data to 0 since this is going to be exposed to unknown third party
> drivers.
>
> Reviewed-by: Pavel Tatashin 
> Signed-off-by: Alexander Duyck 
> ---
>
> v4: Moved moved memmap_init_zone_device to below memmmap_init_zone to avoid
> merge conflicts with other changes in the kernel.
> v5: No change

This patch appears to cause a regression in the "create.sh" unit test
in the ndctl test suite.

I tried to reproduce on -next with:

2302f5ee215e mm: defer ZONE_DEVICE page initialization to the point
where we init pgmap

...but -next does not even boot for me at that commit.

Here is a warning signature that proceeds a hang with this patch
applied against v4.19-rc6:

percpu ref (blk_queue_usage_counter_release) <= 0 (-1530626) after
switching to atomic
WARNING: CPU: 24 PID: 7346 at lib/percpu-refcount.c:155
percpu_ref_switch_to_atomic_rcu+0x1f7/0x200
CPU: 24 PID: 7346 Comm: modprobe Tainted: G   OE 4.19.0-rc6+ #2458
[..]
RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x1f7/0x200
[..]
Call Trace:
 
 ? percpu_ref_reinit+0x140/0x140
 rcu_process_callbacks+0x273/0x880
 __do_softirq+0xd2/0x428
 irq_exit+0xf6/0x100
 smp_apic_timer_interrupt+0xa2/0x220
 apic_timer_interrupt+0xf/0x20
 
RIP: 0010:lock_acquire+0xb8/0x1a0
[..]
 ? __put_page+0x55/0x150
 ? __put_page+0x55/0x150
 __put_page+0x83/0x150
 ? __put_page+0x55/0x150
 devm_memremap_pages_release+0x194/0x250
 release_nodes+0x17c/0x2c0
 device_release_driver_internal+0x1a2/0x250
 driver_detach+0x3a/0x70
 bus_remove_driver+0x58/0xd0
 __x64_sys_delete_module+0x13f/0x200
 ? trace_hardirqs_off_thunk+0x1a/0x1c
 do_syscall_64+0x60/0x210
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-09-28 Thread Dan Williams
On Fri, Sep 28, 2018 at 1:45 AM Oscar Salvador
 wrote:
>
> On Fri, Sep 28, 2018 at 10:12:24AM +0200, Oscar Salvador wrote:
> > Although I am not sure about leaving memmap_init_zone unprotected.
> > For the normal memory, that is not a problem since the memblock's lock
> > protects us from touching the same pages at the same time in 
> > online/offline_pages,
> > but for HMM/devm the story is different.
> >
> > I am totally unaware of HMM/devm, so I am not sure if its protected somehow.
> > e.g: what happens if devm_memremap_pages and devm_memremap_pages_release 
> > are running
> > at the same time for the same memory-range (with the assumption that the 
> > hotplug-lock
> > does not protect move_pfn_range_to_zone anymore).
>
> I guess that this could not happen since the device is not linked to 
> devm_memremap_pages_release
> until the end with:
>
> devm_add_action(dev, devm_memremap_pages_release, pgmap)

It's a bug if devm_memremap_pages and devm_memremap_pages_release are
running simultaneously for the same range. This is enforced by the
requirement that the caller has done a successful request_region() on
the range before the call to map pages.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-09-28 Thread Oscar Salvador
On Fri, Sep 28, 2018 at 10:12:24AM +0200, Oscar Salvador wrote:
> Although I am not sure about leaving memmap_init_zone unprotected.
> For the normal memory, that is not a problem since the memblock's lock
> protects us from touching the same pages at the same time in 
> online/offline_pages,
> but for HMM/devm the story is different.
> 
> I am totally unaware of HMM/devm, so I am not sure if its protected somehow.
> e.g: what happens if devm_memremap_pages and devm_memremap_pages_release are 
> running
> at the same time for the same memory-range (with the assumption that the 
> hotplug-lock
> does not protect move_pfn_range_to_zone anymore).

I guess that this could not happen since the device is not linked to 
devm_memremap_pages_release
until the end with:

devm_add_action(dev, devm_memremap_pages_release, pgmap)
-- 
Oscar Salvador
SUSE L3
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-09-28 Thread Oscar Salvador
On Thu, Sep 27, 2018 at 03:13:29PM +0200, Michal Hocko wrote:
> I would have to double check but is the hotplug lock really serializing
> access to the state initialized by init_currently_empty_zone? E.g.
> zone_start_pfn is a nice example of a state that is used outside of the
> lock. zone's free lists are similar. So do we really need the hoptlug
> lock? And more broadly, what does the hotplug lock is supposed to
> serialize in general. A proper documentation would surely help to answer
> these questions. There is way too much of "do not touch this code and
> just make my particular hack" mindset which made the whole memory
> hotplug a giant pile of mess. We really should start with some proper
> engineering here finally.

* Locking rules:
*
* zone_start_pfn and spanned_pages are protected by span_seqlock.
* It is a seqlock because it has to be read outside of zone->lock,
* and it is done in the main allocator path.  But, it is written
* quite infrequently.
*
* Write access to present_pages at runtime should be protected by
* mem_hotplug_begin/end(). Any reader who can't tolerant drift of
* present_pages should get_online_mems() to get a stable value.

IIUC, looks like zone_start_pfn should be envolved with
zone_span_writelock/zone_span_writeunlock, and since zone_start_pfn is changed
in init_currently_empty_zone, I guess that the whole function should be within
that lock.

So, a blind shot, but could we do something like the following?

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 898e1f816821..49f87252f1b1 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -764,14 +764,13 @@ void __ref move_pfn_range_to_zone(struct zone *zone, 
unsigned long start_pfn,
int nid = pgdat->node_id;
unsigned long flags;
 
-   if (zone_is_empty(zone))
-   init_currently_empty_zone(zone, start_pfn, nr_pages);
-
clear_zone_contiguous(zone);
 
/* TODO Huh pgdat is irqsave while zone is not. It used to be like that 
before */
pgdat_resize_lock(pgdat, );
zone_span_writelock(zone);
+   if (zone_is_empty(zone))
+   init_currently_empty_zone(zone, start_pfn, nr_pages);
resize_zone_range(zone, start_pfn, nr_pages);
zone_span_writeunlock(zone);
resize_pgdat_range(pgdat, start_pfn, nr_pages);

Then, we could take move_pfn_range_to_zone out of the hotplug lock.

Although I am not sure about leaving memmap_init_zone unprotected.
For the normal memory, that is not a problem since the memblock's lock
protects us from touching the same pages at the same time in 
online/offline_pages,
but for HMM/devm the story is different.

I am totally unaware of HMM/devm, so I am not sure if its protected somehow.
e.g: what happens if devm_memremap_pages and devm_memremap_pages_release are 
running
at the same time for the same memory-range (with the assumption that the 
hotplug-lock
does not protect move_pfn_range_to_zone anymore).

-- 
Oscar Salvador
SUSE L3
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-09-27 Thread David Hildenbrand
On 27/09/2018 16:50, Oscar Salvador wrote:
> On Thu, Sep 27, 2018 at 03:13:29PM +0200, Michal Hocko wrote:
>> On Thu 27-09-18 14:25:37, Oscar Salvador wrote:
>>> On Thu, Sep 27, 2018 at 01:09:26PM +0200, Michal Hocko wrote:
> So there were a few things I wasn't sure we could pull outside of the
> hotplug lock. One specific example is the bits related to resizing the 
> pgdat
> and zone. I wanted to avoid pulling those bits outside of the hotplug 
> lock.

 Why would that be a problem. There are dedicated locks for resizing.
>>>
>>> True is that move_pfn_range_to_zone() manages the locks for pgdat/zone 
>>> resizing,
>>> but it also takes care of calling init_currently_empty_zone() in case the 
>>> zone is empty.
>>> Could not that be a problem if we take move_pfn_range_to_zone() out of the 
>>> lock?
>>
>> I would have to double check but is the hotplug lock really serializing
>> access to the state initialized by init_currently_empty_zone? E.g.
>> zone_start_pfn is a nice example of a state that is used outside of the
>> lock. zone's free lists are similar. So do we really need the hoptlug
>> lock? And more broadly, what does the hotplug lock is supposed to
>> serialize in general. A proper documentation would surely help to answer
>> these questions. There is way too much of "do not touch this code and
>> just make my particular hack" mindset which made the whole memory
>> hotplug a giant pile of mess. We really should start with some proper
>> engineering here finally.
> 
> CC David
> 
> David has been looking into this lately, he even has updated 
> memory-hotplug.txt
> with some more documentation about the locking aspect [1].
> And with this change [2], the hotplug lock has been moved
> to the online/offline_pages.
> 
> From what I see (I might be wrong), the hotplug lock is there
> to serialize the online/offline operations.

mem_hotplug_lock is especially relevant for users of
get_online_mems/put_online_mems. Whatever affects them, you can't move
out of the lock.

Everything else is theoretically serialized via device_hotplug_lock now.

> 
> In online_pages, we do (among other things):
> 
> a) initialize the zone and its pages, and link them to the zone
> b) re-adjust zone/pgdat nr of pages (present, spanned, managed)
> b) check if the node changes in regard of N_MEMORY, N_HIGH_MEMORY or 
> N_NORMAL_MEMORY.
> c) fire notifiers
> d) rebuild the zonelists in case we got a new zone
> e) online memory sections and free the pages to the buddy allocator
> f) wake up kswapd/kcompactd in case we got a new node
> 
> while in offline_pages we do the opposite.
> 
> Hotplug lock here serializes the operations as a whole, online and offline 
> memory,
> so they do not step on each other's feet.
> 
> Having said that, we might be able to move some of those operations out of 
> the hotplug lock.
> The device_hotplug_lock coming from every memblock (which is taken in 
> device_online/device_offline) should protect
> us against some operations being made on the same memblock (e.g: touching the 
> same pages).

Yes, very right.


-- 

Thanks,

David / dhildenb
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-09-27 Thread Oscar Salvador
On Thu, Sep 27, 2018 at 03:13:29PM +0200, Michal Hocko wrote:
> On Thu 27-09-18 14:25:37, Oscar Salvador wrote:
> > On Thu, Sep 27, 2018 at 01:09:26PM +0200, Michal Hocko wrote:
> > > > So there were a few things I wasn't sure we could pull outside of the
> > > > hotplug lock. One specific example is the bits related to resizing the 
> > > > pgdat
> > > > and zone. I wanted to avoid pulling those bits outside of the hotplug 
> > > > lock.
> > > 
> > > Why would that be a problem. There are dedicated locks for resizing.
> > 
> > True is that move_pfn_range_to_zone() manages the locks for pgdat/zone 
> > resizing,
> > but it also takes care of calling init_currently_empty_zone() in case the 
> > zone is empty.
> > Could not that be a problem if we take move_pfn_range_to_zone() out of the 
> > lock?
> 
> I would have to double check but is the hotplug lock really serializing
> access to the state initialized by init_currently_empty_zone? E.g.
> zone_start_pfn is a nice example of a state that is used outside of the
> lock. zone's free lists are similar. So do we really need the hoptlug
> lock? And more broadly, what does the hotplug lock is supposed to
> serialize in general. A proper documentation would surely help to answer
> these questions. There is way too much of "do not touch this code and
> just make my particular hack" mindset which made the whole memory
> hotplug a giant pile of mess. We really should start with some proper
> engineering here finally.

CC David

David has been looking into this lately, he even has updated memory-hotplug.txt
with some more documentation about the locking aspect [1].
And with this change [2], the hotplug lock has been moved
to the online/offline_pages.

>From what I see (I might be wrong), the hotplug lock is there
to serialize the online/offline operations.

In online_pages, we do (among other things):

a) initialize the zone and its pages, and link them to the zone
b) re-adjust zone/pgdat nr of pages (present, spanned, managed)
b) check if the node changes in regard of N_MEMORY, N_HIGH_MEMORY or 
N_NORMAL_MEMORY.
c) fire notifiers
d) rebuild the zonelists in case we got a new zone
e) online memory sections and free the pages to the buddy allocator
f) wake up kswapd/kcompactd in case we got a new node

while in offline_pages we do the opposite.

Hotplug lock here serializes the operations as a whole, online and offline 
memory,
so they do not step on each other's feet.

Having said that, we might be able to move some of those operations out of the 
hotplug lock.
The device_hotplug_lock coming from every memblock (which is taken in 
device_online/device_offline) should protect
us against some operations being made on the same memblock (e.g: touching the 
same pages).

For the given case about move_pfn_range_to_zone() I have my doubts.
The resizing of the pgdat/zone is already serialized, so no discussion there.

Then we have memmap_init_zone. I __think__ that that should not worry us 
because those pages
belong to a memblock, and device_online/device_offline is serialized.
HMM/devm is different here as they do not handle memblocks.

And then we have init_currently_empty_zone.
Assuming that the shrinking of pages/removal of the zone is finally brought to
the offline stage (where it should be), I am not sure if we can somehow
race with an offline operation there if we do not hold the hotplug lock.

[1] https://patchwork.kernel.org/patch/10617715/
[2] https://patchwork.kernel.org/patch/10617705/
-- 
Oscar Salvador
SUSE L3
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-09-27 Thread Michal Hocko
On Thu 27-09-18 14:25:37, Oscar Salvador wrote:
> On Thu, Sep 27, 2018 at 01:09:26PM +0200, Michal Hocko wrote:
> > > So there were a few things I wasn't sure we could pull outside of the
> > > hotplug lock. One specific example is the bits related to resizing the 
> > > pgdat
> > > and zone. I wanted to avoid pulling those bits outside of the hotplug 
> > > lock.
> > 
> > Why would that be a problem. There are dedicated locks for resizing.
> 
> True is that move_pfn_range_to_zone() manages the locks for pgdat/zone 
> resizing,
> but it also takes care of calling init_currently_empty_zone() in case the 
> zone is empty.
> Could not that be a problem if we take move_pfn_range_to_zone() out of the 
> lock?

I would have to double check but is the hotplug lock really serializing
access to the state initialized by init_currently_empty_zone? E.g.
zone_start_pfn is a nice example of a state that is used outside of the
lock. zone's free lists are similar. So do we really need the hoptlug
lock? And more broadly, what does the hotplug lock is supposed to
serialize in general. A proper documentation would surely help to answer
these questions. There is way too much of "do not touch this code and
just make my particular hack" mindset which made the whole memory
hotplug a giant pile of mess. We really should start with some proper
engineering here finally.
-- 
Michal Hocko
SUSE Labs
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-09-27 Thread Oscar Salvador
On Wed, Sep 26, 2018 at 11:25:37AM -0700, Alexander Duyck wrote:
> With that said I am open to suggestions if you still feel like I need to
> follow this up with some additional work. I just want to avoid introducing
> any regressions in regards to functionality or performance.

Hi Alexander,

the problem I see is that devm/hmm is using some of the memory-hotplug 
features, but their paths are becoming more and more diverged with changes
like this, and that is sometimes a problem when we need to change
something in the generic memory-hotplug code.

E.g: I am trying to fix two issues in the memory-hotplug where we can
access steal pages if we hot-remove memory before online it.
That was not so difficult to fix, but I really struggled with the exceptions
that HMM/devm represent in this regard, for instance, regarding the resources.

The RFCv2 can be found here [1] https://patchwork.kernel.org/patch/10569083/
And the initial discussion with Jerome Glisse can be found here [2].

So it would be great to stick to the memory-hotplug path as much as possible,
otherwise when a problem arises, we need to think how we can workaround
HMM/devm.

[1] https://patchwork.kernel.org/patch/10569083/
[2] https://patchwork.kernel.org/patch/10558725/

-- 
Oscar Salvador
SUSE L3
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-09-27 Thread Oscar Salvador
On Thu, Sep 27, 2018 at 01:09:26PM +0200, Michal Hocko wrote:
> > So there were a few things I wasn't sure we could pull outside of the
> > hotplug lock. One specific example is the bits related to resizing the pgdat
> > and zone. I wanted to avoid pulling those bits outside of the hotplug lock.
> 
> Why would that be a problem. There are dedicated locks for resizing.

True is that move_pfn_range_to_zone() manages the locks for pgdat/zone resizing,
but it also takes care of calling init_currently_empty_zone() in case the zone 
is empty.
Could not that be a problem if we take move_pfn_range_to_zone() out of the lock?

-- 
Oscar Salvador
SUSE L3
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-09-27 Thread Michal Hocko
On Wed 26-09-18 11:52:56, Dan Williams wrote:
[...]
> Could we push the hotplug lock deeper to the places that actually need
> it? What I found with my initial investigation is that we don't even
> need the hotplug lock for the vmemmap initialization with this patch
> [1].

Yes, the scope of the hotplug lock should be evaluated and _documented_.
Then we can build on top.
-- 
Michal Hocko
SUSE Labs
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-09-27 Thread Michal Hocko
On Wed 26-09-18 11:25:37, Alexander Duyck wrote:
> 
> 
> On 9/26/2018 12:55 AM, Michal Hocko wrote:
> > On Tue 25-09-18 13:21:24, Alexander Duyck wrote:
> > > The ZONE_DEVICE pages were being initialized in two locations. One was 
> > > with
> > > the memory_hotplug lock held and another was outside of that lock. The
> > > problem with this is that it was nearly doubling the memory initialization
> > > time. Instead of doing this twice, once while holding a global lock and
> > > once without, I am opting to defer the initialization to the one outside 
> > > of
> > > the lock. This allows us to avoid serializing the overhead for memory init
> > > and we can instead focus on per-node init times.
> > > 
> > > One issue I encountered is that devm_memremap_pages and
> > > hmm_devmmem_pages_create were initializing only the pgmap field the same
> > > way. One wasn't initializing hmm_data, and the other was initializing it 
> > > to
> > > a poison value. Since this is something that is exposed to the driver in
> > > the case of hmm I am opting for a third option and just initializing
> > > hmm_data to 0 since this is going to be exposed to unknown third party
> > > drivers.
> > 
> > Why cannot you pull move_pfn_range_to_zone out of the hotplug lock? In
> > other words why are you making zone device even more special in the
> > generic hotplug code when it already has its own means to initialize the
> > pfn range by calling move_pfn_range_to_zone. Not to mention the code
> > duplication.
> 
> So there were a few things I wasn't sure we could pull outside of the
> hotplug lock. One specific example is the bits related to resizing the pgdat
> and zone. I wanted to avoid pulling those bits outside of the hotplug lock.

Why would that be a problem. There are dedicated locks for resizing.

> The other bit that I left inside the hot-plug lock with this approach was
> the initialization of the pages that contain the vmemmap.

Again, why this is needed?

> > That being said I really dislike this patch.
> 
> In my mind this was a patch that "killed two birds with one stone". I had
> two issues to address, the first one being the fact that we were performing
> the memmap_init_zone while holding the hotplug lock, and the other being the
> loop that was going through and initializing pgmap in the hmm and memremap
> calls essentially added another 20 seconds (measured for 3TB of memory per
> node) to the init time. With this patch I was able to cut my init time per
> node by that 20 seconds, and then made it so that we could scale as we added
> nodes as they could run in parallel.
> 
> With that said I am open to suggestions if you still feel like I need to
> follow this up with some additional work. I just want to avoid introducing
> any regressions in regards to functionality or performance.

Yes, I really do prefer this to be done properly rather than tweak it
around because of uncertainties.
-- 
Michal Hocko
SUSE Labs
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-09-26 Thread Dan Williams
On Wed, Sep 26, 2018 at 11:25 AM Alexander Duyck
 wrote:
>
>
>
> On 9/26/2018 12:55 AM, Michal Hocko wrote:
> > On Tue 25-09-18 13:21:24, Alexander Duyck wrote:
> >> The ZONE_DEVICE pages were being initialized in two locations. One was with
> >> the memory_hotplug lock held and another was outside of that lock. The
> >> problem with this is that it was nearly doubling the memory initialization
> >> time. Instead of doing this twice, once while holding a global lock and
> >> once without, I am opting to defer the initialization to the one outside of
> >> the lock. This allows us to avoid serializing the overhead for memory init
> >> and we can instead focus on per-node init times.
> >>
> >> One issue I encountered is that devm_memremap_pages and
> >> hmm_devmmem_pages_create were initializing only the pgmap field the same
> >> way. One wasn't initializing hmm_data, and the other was initializing it to
> >> a poison value. Since this is something that is exposed to the driver in
> >> the case of hmm I am opting for a third option and just initializing
> >> hmm_data to 0 since this is going to be exposed to unknown third party
> >> drivers.
> >
> > Why cannot you pull move_pfn_range_to_zone out of the hotplug lock? In
> > other words why are you making zone device even more special in the
> > generic hotplug code when it already has its own means to initialize the
> > pfn range by calling move_pfn_range_to_zone. Not to mention the code
> > duplication.
>
> So there were a few things I wasn't sure we could pull outside of the
> hotplug lock. One specific example is the bits related to resizing the
> pgdat and zone. I wanted to avoid pulling those bits outside of the
> hotplug lock.
>
> The other bit that I left inside the hot-plug lock with this approach
> was the initialization of the pages that contain the vmemmap.
>
> > That being said I really dislike this patch.
>
> In my mind this was a patch that "killed two birds with one stone". I
> had two issues to address, the first one being the fact that we were
> performing the memmap_init_zone while holding the hotplug lock, and the
> other being the loop that was going through and initializing pgmap in
> the hmm and memremap calls essentially added another 20 seconds
> (measured for 3TB of memory per node) to the init time. With this patch
> I was able to cut my init time per node by that 20 seconds, and then
> made it so that we could scale as we added nodes as they could run in
> parallel.

Yeah, at the very least there is no reason for devm_memremap_pages()
to do another loop through all pages, the core should handle this, but
cleaning up the scope of the hotplug lock is needed.

> With that said I am open to suggestions if you still feel like I need to
> follow this up with some additional work. I just want to avoid
> introducing any regressions in regards to functionality or performance.

Could we push the hotplug lock deeper to the places that actually need
it? What I found with my initial investigation is that we don't even
need the hotplug lock for the vmemmap initialization with this patch
[1].

Alternatively it seems the hotplug lock wants to synchronize changes
to the zone and the page init work. If the hotplug lock was an rwsem
the zone changes would be a write lock, but the init work could be
done as a read lock to allow parallelism. I.e. still provide a sync
point to be able to assert that no hotplug work is in-flight will
holding the write lock, but otherwise allow threads that are touching
independent parts of the memmap to run at the same time.

[1]: https://patchwork.kernel.org/patch/10527229/ just focus on the
mm/sparse-vmemmap.c changes at the end.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-09-26 Thread Alexander Duyck




On 9/26/2018 12:55 AM, Michal Hocko wrote:

On Tue 25-09-18 13:21:24, Alexander Duyck wrote:

The ZONE_DEVICE pages were being initialized in two locations. One was with
the memory_hotplug lock held and another was outside of that lock. The
problem with this is that it was nearly doubling the memory initialization
time. Instead of doing this twice, once while holding a global lock and
once without, I am opting to defer the initialization to the one outside of
the lock. This allows us to avoid serializing the overhead for memory init
and we can instead focus on per-node init times.

One issue I encountered is that devm_memremap_pages and
hmm_devmmem_pages_create were initializing only the pgmap field the same
way. One wasn't initializing hmm_data, and the other was initializing it to
a poison value. Since this is something that is exposed to the driver in
the case of hmm I am opting for a third option and just initializing
hmm_data to 0 since this is going to be exposed to unknown third party
drivers.


Why cannot you pull move_pfn_range_to_zone out of the hotplug lock? In
other words why are you making zone device even more special in the
generic hotplug code when it already has its own means to initialize the
pfn range by calling move_pfn_range_to_zone. Not to mention the code
duplication.


So there were a few things I wasn't sure we could pull outside of the 
hotplug lock. One specific example is the bits related to resizing the 
pgdat and zone. I wanted to avoid pulling those bits outside of the 
hotplug lock.


The other bit that I left inside the hot-plug lock with this approach 
was the initialization of the pages that contain the vmemmap.



That being said I really dislike this patch.


In my mind this was a patch that "killed two birds with one stone". I 
had two issues to address, the first one being the fact that we were 
performing the memmap_init_zone while holding the hotplug lock, and the 
other being the loop that was going through and initializing pgmap in 
the hmm and memremap calls essentially added another 20 seconds 
(measured for 3TB of memory per node) to the init time. With this patch 
I was able to cut my init time per node by that 20 seconds, and then 
made it so that we could scale as we added nodes as they could run in 
parallel.


With that said I am open to suggestions if you still feel like I need to 
follow this up with some additional work. I just want to avoid 
introducing any regressions in regards to functionality or performance.


Thanks.

- Alex


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-09-26 Thread Michal Hocko
On Tue 25-09-18 13:21:24, Alexander Duyck wrote:
> The ZONE_DEVICE pages were being initialized in two locations. One was with
> the memory_hotplug lock held and another was outside of that lock. The
> problem with this is that it was nearly doubling the memory initialization
> time. Instead of doing this twice, once while holding a global lock and
> once without, I am opting to defer the initialization to the one outside of
> the lock. This allows us to avoid serializing the overhead for memory init
> and we can instead focus on per-node init times.
> 
> One issue I encountered is that devm_memremap_pages and
> hmm_devmmem_pages_create were initializing only the pgmap field the same
> way. One wasn't initializing hmm_data, and the other was initializing it to
> a poison value. Since this is something that is exposed to the driver in
> the case of hmm I am opting for a third option and just initializing
> hmm_data to 0 since this is going to be exposed to unknown third party
> drivers.

Why cannot you pull move_pfn_range_to_zone out of the hotplug lock? In
other words why are you making zone device even more special in the
generic hotplug code when it already has its own means to initialize the
pfn range by calling move_pfn_range_to_zone. Not to mention the code
duplication.

That being said I really dislike this patch.

> Reviewed-by: Pavel Tatashin 
> Signed-off-by: Alexander Duyck 
> ---
> 
> v4: Moved moved memmap_init_zone_device to below memmmap_init_zone to avoid
> merge conflicts with other changes in the kernel.
> v5: No change
> 
>  include/linux/mm.h |2 +
>  kernel/memremap.c  |   24 +-
>  mm/hmm.c   |   12 ---
>  mm/page_alloc.c|   92 
> ++--
>  4 files changed, 107 insertions(+), 23 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 06d7d7576f8d..7312fb78ef31 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -848,6 +848,8 @@ static inline bool is_zone_device_page(const struct page 
> *page)
>  {
>   return page_zonenum(page) == ZONE_DEVICE;
>  }
> +extern void memmap_init_zone_device(struct zone *, unsigned long,
> + unsigned long, struct dev_pagemap *);
>  #else
>  static inline bool is_zone_device_page(const struct page *page)
>  {
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index 5b8600d39931..d0c32e473f82 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -175,10 +175,10 @@ void *devm_memremap_pages(struct device *dev, struct 
> dev_pagemap *pgmap)
>   struct vmem_altmap *altmap = pgmap->altmap_valid ?
>   >altmap : NULL;
>   struct resource *res = >res;
> - unsigned long pfn, pgoff, order;
> + struct dev_pagemap *conflict_pgmap;
>   pgprot_t pgprot = PAGE_KERNEL;
> + unsigned long pgoff, order;
>   int error, nid, is_ram;
> - struct dev_pagemap *conflict_pgmap;
>  
>   align_start = res->start & ~(SECTION_SIZE - 1);
>   align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
> @@ -256,19 +256,13 @@ void *devm_memremap_pages(struct device *dev, struct 
> dev_pagemap *pgmap)
>   if (error)
>   goto err_add_memory;
>  
> - for_each_device_pfn(pfn, pgmap) {
> - struct page *page = pfn_to_page(pfn);
> -
> - /*
> -  * ZONE_DEVICE pages union ->lru with a ->pgmap back
> -  * pointer.  It is a bug if a ZONE_DEVICE page is ever
> -  * freed or placed on a driver-private list.  Seed the
> -  * storage with LIST_POISON* values.
> -  */
> - list_del(>lru);
> - page->pgmap = pgmap;
> - percpu_ref_get(pgmap->ref);
> - }
> + /*
> +  * Initialization of the pages has been deferred until now in order
> +  * to allow us to do the work while not holding the hotplug lock.
> +  */
> + memmap_init_zone_device(_DATA(nid)->node_zones[ZONE_DEVICE],
> + align_start >> PAGE_SHIFT,
> + align_size >> PAGE_SHIFT, pgmap);
>  
>   devm_add_action(dev, devm_memremap_pages_release, pgmap);
>  
> diff --git a/mm/hmm.c b/mm/hmm.c
> index c968e49f7a0c..774d684fa2b4 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -1024,7 +1024,6 @@ static int hmm_devmem_pages_create(struct hmm_devmem 
> *devmem)
>   resource_size_t key, align_start, align_size, align_end;
>   struct device *device = devmem->device;
>   int ret, nid, is_ram;
> - unsigned long pfn;
>  
>   align_start = devmem->resource->start & ~(PA_SECTION_SIZE - 1);
>   align_size = ALIGN(devmem->resource->start +
> @@ -1109,11 +1108,14 @@ static int hmm_devmem_pages_create(struct hmm_devmem 
> *devmem)
>   align_size >> PAGE_SHIFT, NULL);
>   mem_hotplug_done();
>  
> - for (pfn = devmem->pfn_first; pfn < devmem->pfn_last; pfn++) {
> -