About remain issues of FS-DAX

2018-10-29 Thread y-g...@fujitsu.com


Hello,

I noticed the discussion what is necessary for fs-dax
in the following thread of xfs/ext4 Mailing list.

https://patchwork.kernel.org/cover/10631363/

(I didn't subscribe xfs or ext4 ML, so I didn't know at first.)

Though there may be discussed them already,
I would like to mention my humble opinion about them from normal user's view 
point
(or as technical support engineers)

I hope this is good for progress to remove "Experimental" status of FS-DAX.

>From Dave Chinner
-
> I think we need to decide on:
> 
> - default filesystem behaviour on dax-capable block devices

In my first impression of filesystem DAX, I felt that "-o dax" mount option
was a bit strange. Because system administrator can specify fs-dax capable area
by ndctl create-namespace. Then I couldn't find why he/she need to specify
dax again at mount option. It is second time of specify fsdax, I thought it is 
not needed.

So, I think Filesystem should detect FS-DAX capable area,
and should enable it. It is easier for administrator to use fs-dax.


> - what information about DAX do applications actually need? What
>   makes sense to provide them with that information?

I suppose that application should not grep dax mount option to confirm "fs-dax 
enabled".
Instead, OS should provide information fs-dax is certainly enabled or not for 
application.

> - how to provide hints to the kernel for desired behaviour
>  - on-disk inode flags, or something else?
>  - dax/nodax mount options or root dir inode flags become default
>global hints?
>  - is a single hint flag sufficient or do we also need an
>explicit "do not use dax" flag?

Are there any reason each inode must be remember dax flags?
It seems to be simpler that the root dir inode only know it as a global hint.

> - behaviour of MAP_SYNC w.r.t. non-DAX filesystems that can provide
>   required MAP_SYNC semnatics

Hmm, if there is a usecase of MAP_SYNC with non-DAX filesystem,
It should be allowed.
But I could not find any positive reason for it


> - behaviour of MAP_DIRECT - hint like O_DIRECT or guarantee?

I would like to bet "guarantee".
If system cannot use the feature which user specified,
then I suppose system should return error to notify it.

If there is no way of guarantee due to technical reason, and
system selects falling back from MAP_DIRECT, then I think system
should notify or output somewhere for user to detect such event.

> - default read/write path behaviour of dax-capable block devices
>   - automatically bypass the pagecache if bdev is capable?

I prefer automatically bypass the pagecache for users to use easily...


> - default mmap behaviour on dax capable devices
>   - use dax always?

I suppose "always" is easy for user to understand.
If not, I suppose document should describe what is necessary to use dax.


> - DAX vs get_user_pages_longterm
>   - turns off DAX dynamically?

If turn off can not be avoided, its event notification will be necessary
for software using DAX and/or for system administrator.

>   - how do DAX-enabled filesystems interact with page fault capable
> hardware? Can we allow DAX in those cases?

I have no comment here. 
(To be honest, I don't understand what is problem...
 I'm glad if someone tell me it.)

>From Dan Williams
---
> - Is MADV_DIRECT_ACCESS a hint or a requirement?

If MADV_DIRECT_ACCESS becomes a hint, how and when FS-DAX is falling down?
In such case, a software needs to change its behavior to call msync()
instead of cpu cache flush to make persistent data.

So, I suppose requirement is better


> - How does the kernel communicate the effective mode of a mapping
>   taking into account madvise(), inode flags, mount options, and / or
>   default fs behavior? New madvice() syscall?

hmm,  sysfs?


> - What is the behavior of dax in the presence of reflink'd extents?
>   Just failing seems the 'experimental' behavior. What to do about
>   page->index when page belongs to more than 1 file via reflink?

Sorry, I don't have good idea about this.
To be honest, I'm not sure how reflink is necessary for NVDIMM...


> - Is there ever a case to force disable dax operation? To date we've
>   only ever thought about interfaces to force *enable* dax operation

If something becomes wrong on dax, administrator may want to disable dax forcely
as a emergency mode and try to rescue the data in the filesystem.

Maybe "-o nodax" option?

> - The virtio-pmem use case wants dax mappings but requires an explicit
>   fsync() instead of MAP_SYNC to flush software buffers, it's a DAX
>   sub-set, should it have it's own name?

Is this just naming problem?
"DAX_with sync"?

> - DAX operation is loosely tied to block devices. There has been
>   discussions of mounting filesystems on /dev/dax devices directly.
>   Should we take that to its logical conclusion and support a
>   block-layer-less conversion of dax-capable file systems?

To be honest, I'm not sure how this is necessary.
May I have what is use-case?
It see

Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-29 Thread Michal Hocko
On Wed 17-10-18 08:02:20, Alexander Duyck wrote:
> On 10/17/2018 12:52 AM, Michal Hocko wrote:
> > On Thu 11-10-18 10:38:39, Alexander Duyck wrote:
> > > On 10/11/2018 1:55 AM, Michal Hocko wrote:
> > > > On Wed 10-10-18 20:52:42, Michal Hocko wrote:
> > > > [...]
> > > > > My recollection was that we do clear the reserved bit in
> > > > > move_pfn_range_to_zone and we indeed do in __init_single_page. But 
> > > > > then
> > > > > we set the bit back right afterwards. This seems to be the case since
> > > > > d0dc12e86b319 which reorganized the code. I have to study this some 
> > > > > more
> > > > > obviously.
> > > > 
> > > > so my recollection was wrong and d0dc12e86b319 hasn't really changed
> > > > much because __init_single_page wouldn't zero out the struct page for
> > > > the hotplug contex. A comment in move_pfn_range_to_zone explains that we
> > > > want the reserved bit because pfn walkers already do see the pfn range
> > > > and the page is not fully associated with the zone until it is onlined.
> > > > 
> > > > I am thinking that we might be overzealous here. With the full state
> > > > initialized we shouldn't actually care. pfn_to_online_page should return
> > > > NULL regardless of the reserved bit and normal pfn walkers shouldn't
> > > > touch pages they do not recognize and a plain page with ref. count 1
> > > > doesn't tell much to anybody. So I _suspect_ that we can simply drop the
> > > > reserved bit setting here.
> > > 
> > > So this has me a bit hesitant to want to just drop the bit entirely. If
> > > nothing else I think I may wan to make that a patch onto itself so that if
> > > we aren't going to set it we just drop it there. That way if it does cause
> > > issues we can bisect it to that patch and pinpoint the cause.
> > 
> > Yes a patch on its own make sense for bisectability.
> 
> For now I think I am going to back off of this. There is a bunch of other
> changes that need to happen in order for us to make this work. As far as I
> can tell there are several places that are relying on this reserved bit.

Please be more specific. Unless I misremember, I have added this
PageReserved just to be sure (f1dd2cd13c4bb) because pages where just
half initialized back then. I am not aware anybody is depending on this.
If there is somebody then be explicit about that. The last thing I want
to see is to preserve a cargo cult and build a design around it.

> > > > Regarding the post initialization required by devm_memremap_pages and
> > > > potentially others. Can we update the altmap which is already a way how
> > > > to get alternative struct pages by a constructor which we could call
> > > > from memmap_init_zone and do the post initialization? This would reduce
> > > > the additional loop in the caller while it would still fit the overall
> > > > design of the altmap and the core hotplug doesn't have to know anything
> > > > about DAX or whatever needs a special treatment.
> > > > 
> > > > Does that make any sense?
> > > 
> > > I think the only thing that is currently using the altmap is the 
> > > ZONE_DEVICE
> > > memory init. Specifically I think it is only really used by the
> > > devm_memremap_pages version of things, and then only under certain
> > > circumstances. Also the HMM driver doesn't pass an altmap. What we would
> > > really need is a non-ZONE_DEVICE users of the altmap to really justify
> > > sticking with that as the preferred argument to pass.
> > 
> > I am not aware of any upstream HMM user so I am not sure what are the
> > expectations there. But I thought that ZONE_DEVICE users use altmap. If
> > that is not generally true then we certainly have to think about a
> > better interface.
> 
> I'm just basing my statement on the use of the move_pfn_range_to_zone call.
> The only caller that is actually passing the altmap is devm_memremap_pages
> and if I understand things correctly that is only used when we want to stare
> the vmmemmap on the same memory that we just hotplugged.

Yes, and that is what I've called as allocator callback earlier.

> That is why it made more sense to me to just create a ZONE_DEVICE specific
> function for handling the page initialization because the one value I do
> have to pass is the dev_pagemap in both HMM and memremap case, and that has
> the altmap already embedded inside of it.

And I have argued that this is a wrong approach to the problem. If you
need a very specific struct page initialization then create a init
(constructor) callback.
-- 
Michal Hocko
SUSE Labs
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-29 Thread Dan Williams
On Wed, Oct 17, 2018 at 8:02 AM Alexander Duyck
 wrote:
>
> On 10/17/2018 12:52 AM, Michal Hocko wrote:
> > On Thu 11-10-18 10:38:39, Alexander Duyck wrote:
> >> On 10/11/2018 1:55 AM, Michal Hocko wrote:
> >>> On Wed 10-10-18 20:52:42, Michal Hocko wrote:
> >>> [...]
>  My recollection was that we do clear the reserved bit in
>  move_pfn_range_to_zone and we indeed do in __init_single_page. But then
>  we set the bit back right afterwards. This seems to be the case since
>  d0dc12e86b319 which reorganized the code. I have to study this some more
>  obviously.
> >>>
> >>> so my recollection was wrong and d0dc12e86b319 hasn't really changed
> >>> much because __init_single_page wouldn't zero out the struct page for
> >>> the hotplug contex. A comment in move_pfn_range_to_zone explains that we
> >>> want the reserved bit because pfn walkers already do see the pfn range
> >>> and the page is not fully associated with the zone until it is onlined.
> >>>
> >>> I am thinking that we might be overzealous here. With the full state
> >>> initialized we shouldn't actually care. pfn_to_online_page should return
> >>> NULL regardless of the reserved bit and normal pfn walkers shouldn't
> >>> touch pages they do not recognize and a plain page with ref. count 1
> >>> doesn't tell much to anybody. So I _suspect_ that we can simply drop the
> >>> reserved bit setting here.
> >>
> >> So this has me a bit hesitant to want to just drop the bit entirely. If
> >> nothing else I think I may wan to make that a patch onto itself so that if
> >> we aren't going to set it we just drop it there. That way if it does cause
> >> issues we can bisect it to that patch and pinpoint the cause.
> >
> > Yes a patch on its own make sense for bisectability.
>
> For now I think I am going to back off of this. There is a bunch of
> other changes that need to happen in order for us to make this work. As
> far as I can tell there are several places that are relying on this
> reserved bit.

When David Hildebrand and I looked it was only the hibernation code
that we thought needed changing. We either need to audit the removal
or go back to adding a special case hack for kvm because this is a
blocking issue for them.

What do you see beyond the hibernation change?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-29 Thread Michal Hocko
On Mon 29-10-18 08:49:46, Dan Williams wrote:
> On Wed, Oct 17, 2018 at 8:02 AM Alexander Duyck
>  wrote:
> >
> > On 10/17/2018 12:52 AM, Michal Hocko wrote:
> > > On Thu 11-10-18 10:38:39, Alexander Duyck wrote:
> > >> On 10/11/2018 1:55 AM, Michal Hocko wrote:
> > >>> On Wed 10-10-18 20:52:42, Michal Hocko wrote:
> > >>> [...]
> >  My recollection was that we do clear the reserved bit in
> >  move_pfn_range_to_zone and we indeed do in __init_single_page. But then
> >  we set the bit back right afterwards. This seems to be the case since
> >  d0dc12e86b319 which reorganized the code. I have to study this some 
> >  more
> >  obviously.
> > >>>
> > >>> so my recollection was wrong and d0dc12e86b319 hasn't really changed
> > >>> much because __init_single_page wouldn't zero out the struct page for
> > >>> the hotplug contex. A comment in move_pfn_range_to_zone explains that we
> > >>> want the reserved bit because pfn walkers already do see the pfn range
> > >>> and the page is not fully associated with the zone until it is onlined.
> > >>>
> > >>> I am thinking that we might be overzealous here. With the full state
> > >>> initialized we shouldn't actually care. pfn_to_online_page should return
> > >>> NULL regardless of the reserved bit and normal pfn walkers shouldn't
> > >>> touch pages they do not recognize and a plain page with ref. count 1
> > >>> doesn't tell much to anybody. So I _suspect_ that we can simply drop the
> > >>> reserved bit setting here.
> > >>
> > >> So this has me a bit hesitant to want to just drop the bit entirely. If
> > >> nothing else I think I may wan to make that a patch onto itself so that 
> > >> if
> > >> we aren't going to set it we just drop it there. That way if it does 
> > >> cause
> > >> issues we can bisect it to that patch and pinpoint the cause.
> > >
> > > Yes a patch on its own make sense for bisectability.
> >
> > For now I think I am going to back off of this. There is a bunch of
> > other changes that need to happen in order for us to make this work. As
> > far as I can tell there are several places that are relying on this
> > reserved bit.
> 
> When David Hildebrand and I looked it was only the hibernation code
> that we thought needed changing.

More details please?

-- 
Michal Hocko
SUSE Labs
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-29 Thread Alexander Duyck
On Mon, 2018-10-29 at 15:12 +0100, Michal Hocko wrote:
> On Wed 17-10-18 08:02:20, Alexander Duyck wrote:
> > On 10/17/2018 12:52 AM, Michal Hocko wrote:
> > > On Thu 11-10-18 10:38:39, Alexander Duyck wrote:
> > > > On 10/11/2018 1:55 AM, Michal Hocko wrote:
> > > > > On Wed 10-10-18 20:52:42, Michal Hocko wrote:
> > > > > [...]
> > > > > > My recollection was that we do clear the reserved bit in
> > > > > > move_pfn_range_to_zone and we indeed do in __init_single_page. But 
> > > > > > then
> > > > > > we set the bit back right afterwards. This seems to be the case 
> > > > > > since
> > > > > > d0dc12e86b319 which reorganized the code. I have to study this some 
> > > > > > more
> > > > > > obviously.
> > > > > 
> > > > > so my recollection was wrong and d0dc12e86b319 hasn't really changed
> > > > > much because __init_single_page wouldn't zero out the struct page for
> > > > > the hotplug contex. A comment in move_pfn_range_to_zone explains that 
> > > > > we
> > > > > want the reserved bit because pfn walkers already do see the pfn range
> > > > > and the page is not fully associated with the zone until it is 
> > > > > onlined.
> > > > > 
> > > > > I am thinking that we might be overzealous here. With the full state
> > > > > initialized we shouldn't actually care. pfn_to_online_page should 
> > > > > return
> > > > > NULL regardless of the reserved bit and normal pfn walkers shouldn't
> > > > > touch pages they do not recognize and a plain page with ref. count 1
> > > > > doesn't tell much to anybody. So I _suspect_ that we can simply drop 
> > > > > the
> > > > > reserved bit setting here.
> > > > 
> > > > So this has me a bit hesitant to want to just drop the bit entirely. If
> > > > nothing else I think I may wan to make that a patch onto itself so that 
> > > > if
> > > > we aren't going to set it we just drop it there. That way if it does 
> > > > cause
> > > > issues we can bisect it to that patch and pinpoint the cause.
> > > 
> > > Yes a patch on its own make sense for bisectability.
> > 
> > For now I think I am going to back off of this. There is a bunch of other
> > changes that need to happen in order for us to make this work. As far as I
> > can tell there are several places that are relying on this reserved bit.
> 
> Please be more specific. Unless I misremember, I have added this
> PageReserved just to be sure (f1dd2cd13c4bb) because pages where just
> half initialized back then. I am not aware anybody is depending on this.
> If there is somebody then be explicit about that. The last thing I want
> to see is to preserve a cargo cult and build a design around it.

It is mostly just a matter of going through and auditing all the
places that are using PageReserved to identify pages that they aren't
supposed to touch for whatever reason.

>From what I can tell the issue appears to be the fact that the reserved
bit is used to identify if a region of memory is "online" or "offline".
So for example the call "online_pages_range" doesn't invoke the
online_page_callback unless the first pfn in the range is marked as
reserved.

Another example Dan had pointed out was the saveable_page function in
kernel/power/snapshot.c.

> > > > > Regarding the post initialization required by devm_memremap_pages and
> > > > > potentially others. Can we update the altmap which is already a way 
> > > > > how
> > > > > to get alternative struct pages by a constructor which we could call
> > > > > from memmap_init_zone and do the post initialization? This would 
> > > > > reduce
> > > > > the additional loop in the caller while it would still fit the overall
> > > > > design of the altmap and the core hotplug doesn't have to know 
> > > > > anything
> > > > > about DAX or whatever needs a special treatment.
> > > > > 
> > > > > Does that make any sense?
> > > > 
> > > > I think the only thing that is currently using the altmap is the 
> > > > ZONE_DEVICE
> > > > memory init. Specifically I think it is only really used by the
> > > > devm_memremap_pages version of things, and then only under certain
> > > > circumstances. Also the HMM driver doesn't pass an altmap. What we would
> > > > really need is a non-ZONE_DEVICE users of the altmap to really justify
> > > > sticking with that as the preferred argument to pass.
> > > 
> > > I am not aware of any upstream HMM user so I am not sure what are the
> > > expectations there. But I thought that ZONE_DEVICE users use altmap. If
> > > that is not generally true then we certainly have to think about a
> > > better interface.
> > 
> > I'm just basing my statement on the use of the move_pfn_range_to_zone call.
> > The only caller that is actually passing the altmap is devm_memremap_pages
> > and if I understand things correctly that is only used when we want to stare
> > the vmmemmap on the same memory that we just hotplugged.
> 
> Yes, and that is what I've called as allocator callback earlier.

I am really not a fan of the callback approach. It just means we will
have t

Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-29 Thread Michal Hocko
On Mon 29-10-18 08:59:34, Alexander Duyck wrote:
> On Mon, 2018-10-29 at 15:12 +0100, Michal Hocko wrote:
> > On Wed 17-10-18 08:02:20, Alexander Duyck wrote:
> > > On 10/17/2018 12:52 AM, Michal Hocko wrote:
> > > > On Thu 11-10-18 10:38:39, Alexander Duyck wrote:
> > > > > On 10/11/2018 1:55 AM, Michal Hocko wrote:
> > > > > > On Wed 10-10-18 20:52:42, Michal Hocko wrote:
> > > > > > [...]
> > > > > > > My recollection was that we do clear the reserved bit in
> > > > > > > move_pfn_range_to_zone and we indeed do in __init_single_page. 
> > > > > > > But then
> > > > > > > we set the bit back right afterwards. This seems to be the case 
> > > > > > > since
> > > > > > > d0dc12e86b319 which reorganized the code. I have to study this 
> > > > > > > some more
> > > > > > > obviously.
> > > > > > 
> > > > > > so my recollection was wrong and d0dc12e86b319 hasn't really changed
> > > > > > much because __init_single_page wouldn't zero out the struct page 
> > > > > > for
> > > > > > the hotplug contex. A comment in move_pfn_range_to_zone explains 
> > > > > > that we
> > > > > > want the reserved bit because pfn walkers already do see the pfn 
> > > > > > range
> > > > > > and the page is not fully associated with the zone until it is 
> > > > > > onlined.
> > > > > > 
> > > > > > I am thinking that we might be overzealous here. With the full state
> > > > > > initialized we shouldn't actually care. pfn_to_online_page should 
> > > > > > return
> > > > > > NULL regardless of the reserved bit and normal pfn walkers shouldn't
> > > > > > touch pages they do not recognize and a plain page with ref. count 1
> > > > > > doesn't tell much to anybody. So I _suspect_ that we can simply 
> > > > > > drop the
> > > > > > reserved bit setting here.
> > > > > 
> > > > > So this has me a bit hesitant to want to just drop the bit entirely. 
> > > > > If
> > > > > nothing else I think I may wan to make that a patch onto itself so 
> > > > > that if
> > > > > we aren't going to set it we just drop it there. That way if it does 
> > > > > cause
> > > > > issues we can bisect it to that patch and pinpoint the cause.
> > > > 
> > > > Yes a patch on its own make sense for bisectability.
> > > 
> > > For now I think I am going to back off of this. There is a bunch of other
> > > changes that need to happen in order for us to make this work. As far as I
> > > can tell there are several places that are relying on this reserved bit.
> > 
> > Please be more specific. Unless I misremember, I have added this
> > PageReserved just to be sure (f1dd2cd13c4bb) because pages where just
> > half initialized back then. I am not aware anybody is depending on this.
> > If there is somebody then be explicit about that. The last thing I want
> > to see is to preserve a cargo cult and build a design around it.
> 
> It is mostly just a matter of going through and auditing all the
> places that are using PageReserved to identify pages that they aren't
> supposed to touch for whatever reason.
>
> From what I can tell the issue appears to be the fact that the reserved
> bit is used to identify if a region of memory is "online" or "offline".

No, this is wrong. pfn_to_online_page does that. PageReserved has
nothing to do with online vs. offline status. It merely says that you
shouldn't touch the page unless you own it. Sure we might have few
places relying on it but nothing should really depend the reserved bit
check from the MM hotplug POV.

> So for example the call "online_pages_range" doesn't invoke the
> online_page_callback unless the first pfn in the range is marked as
> reserved.

Yes and there is no fundamental reason to do that. We can easily check
the online status without that.

> Another example Dan had pointed out was the saveable_page function in
> kernel/power/snapshot.c.

Use pfn_to_online_page there.

> > > > > > Regarding the post initialization required by devm_memremap_pages 
> > > > > > and
> > > > > > potentially others. Can we update the altmap which is already a way 
> > > > > > how
> > > > > > to get alternative struct pages by a constructor which we could call
> > > > > > from memmap_init_zone and do the post initialization? This would 
> > > > > > reduce
> > > > > > the additional loop in the caller while it would still fit the 
> > > > > > overall
> > > > > > design of the altmap and the core hotplug doesn't have to know 
> > > > > > anything
> > > > > > about DAX or whatever needs a special treatment.
> > > > > > 
> > > > > > Does that make any sense?
> > > > > 
> > > > > I think the only thing that is currently using the altmap is the 
> > > > > ZONE_DEVICE
> > > > > memory init. Specifically I think it is only really used by the
> > > > > devm_memremap_pages version of things, and then only under certain
> > > > > circumstances. Also the HMM driver doesn't pass an altmap. What we 
> > > > > would
> > > > > really need is a non-ZONE_DEVICE users of the altmap to really justify
> > > > > sticking with that as the preferred 

Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-29 Thread Alexander Duyck
On Mon, 2018-10-29 at 17:35 +0100, Michal Hocko wrote:
> On Mon 29-10-18 08:59:34, Alexander Duyck wrote:
> > On Mon, 2018-10-29 at 15:12 +0100, Michal Hocko wrote:
> > > On Wed 17-10-18 08:02:20, Alexander Duyck wrote:
> > > > On 10/17/2018 12:52 AM, Michal Hocko wrote:
> > > > > On Thu 11-10-18 10:38:39, Alexander Duyck wrote:
> > > > > > On 10/11/2018 1:55 AM, Michal Hocko wrote:
> > > > > > > On Wed 10-10-18 20:52:42, Michal Hocko wrote:
> > > > > > > [...]
> > > > > > > > My recollection was that we do clear the reserved bit in
> > > > > > > > move_pfn_range_to_zone and we indeed do in __init_single_page. 
> > > > > > > > But then
> > > > > > > > we set the bit back right afterwards. This seems to be the case 
> > > > > > > > since
> > > > > > > > d0dc12e86b319 which reorganized the code. I have to study this 
> > > > > > > > some more
> > > > > > > > obviously.
> > > > > > > 
> > > > > > > so my recollection was wrong and d0dc12e86b319 hasn't really 
> > > > > > > changed
> > > > > > > much because __init_single_page wouldn't zero out the struct page 
> > > > > > > for
> > > > > > > the hotplug contex. A comment in move_pfn_range_to_zone explains 
> > > > > > > that we
> > > > > > > want the reserved bit because pfn walkers already do see the pfn 
> > > > > > > range
> > > > > > > and the page is not fully associated with the zone until it is 
> > > > > > > onlined.
> > > > > > > 
> > > > > > > I am thinking that we might be overzealous here. With the full 
> > > > > > > state
> > > > > > > initialized we shouldn't actually care. pfn_to_online_page should 
> > > > > > > return
> > > > > > > NULL regardless of the reserved bit and normal pfn walkers 
> > > > > > > shouldn't
> > > > > > > touch pages they do not recognize and a plain page with ref. 
> > > > > > > count 1
> > > > > > > doesn't tell much to anybody. So I _suspect_ that we can simply 
> > > > > > > drop the
> > > > > > > reserved bit setting here.
> > > > > > 
> > > > > > So this has me a bit hesitant to want to just drop the bit 
> > > > > > entirely. If
> > > > > > nothing else I think I may wan to make that a patch onto itself so 
> > > > > > that if
> > > > > > we aren't going to set it we just drop it there. That way if it 
> > > > > > does cause
> > > > > > issues we can bisect it to that patch and pinpoint the cause.
> > > > > 
> > > > > Yes a patch on its own make sense for bisectability.
> > > > 
> > > > For now I think I am going to back off of this. There is a bunch of 
> > > > other
> > > > changes that need to happen in order for us to make this work. As far 
> > > > as I
> > > > can tell there are several places that are relying on this reserved bit.
> > > 
> > > Please be more specific. Unless I misremember, I have added this
> > > PageReserved just to be sure (f1dd2cd13c4bb) because pages where just
> > > half initialized back then. I am not aware anybody is depending on this.
> > > If there is somebody then be explicit about that. The last thing I want
> > > to see is to preserve a cargo cult and build a design around it.
> > 
> > It is mostly just a matter of going through and auditing all the
> > places that are using PageReserved to identify pages that they aren't
> > supposed to touch for whatever reason.
> > 
> > From what I can tell the issue appears to be the fact that the reserved
> > bit is used to identify if a region of memory is "online" or "offline".
> 
> No, this is wrong. pfn_to_online_page does that. PageReserved has
> nothing to do with online vs. offline status. It merely says that you
> shouldn't touch the page unless you own it. Sure we might have few
> places relying on it but nothing should really depend the reserved bit
> check from the MM hotplug POV.
> 
> > So for example the call "online_pages_range" doesn't invoke the
> > online_page_callback unless the first pfn in the range is marked as
> > reserved.
> 
> Yes and there is no fundamental reason to do that. We can easily check
> the online status without that.
> 
> > Another example Dan had pointed out was the saveable_page function in
> > kernel/power/snapshot.c.
> 
> Use pfn_to_online_page there.

Right. Which getting back to my original point, there is a bunch of
other changes that need to happen in order for us to make this work. I
am going to end up with yet another patch set to clean up all the spots
that are using PageReserved that shouldn't be before I can get to the
point of not setting that bit.

> > > > > > > Regarding the post initialization required by devm_memremap_pages 
> > > > > > > and
> > > > > > > potentially others. Can we update the altmap which is already a 
> > > > > > > way how
> > > > > > > to get alternative struct pages by a constructor which we could 
> > > > > > > call
> > > > > > > from memmap_init_zone and do the post initialization? This would 
> > > > > > > reduce
> > > > > > > the additional loop in the caller while it would still fit the 
> > > > > > > overall
> > > > > > > design of the altmap and the core ho

Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-29 Thread Michal Hocko
On Mon 29-10-18 10:01:28, Alexander Duyck wrote:
> On Mon, 2018-10-29 at 17:35 +0100, Michal Hocko wrote:
> > On Mon 29-10-18 08:59:34, Alexander Duyck wrote:
[...]
> > > So for example the call "online_pages_range" doesn't invoke the
> > > online_page_callback unless the first pfn in the range is marked as
> > > reserved.
> > 
> > Yes and there is no fundamental reason to do that. We can easily check
> > the online status without that.
> > 
> > > Another example Dan had pointed out was the saveable_page function in
> > > kernel/power/snapshot.c.
> > 
> > Use pfn_to_online_page there.
> 
> Right. Which getting back to my original point, there is a bunch of
> other changes that need to happen in order for us to make this work.

Which is a standard process of upstreaming stuff. My main point was that
the reason why I've added SetPageReserved was a safety net because I
knew that different code paths would back of on PageReserved while they
wouldn't on a partially initialized structure. Now that you really want
to prevent setting this bit for performance reasons then it makes sense
to revisit that earlier decision and get rid of it rather than build on
top of it and duplicate and special case the low level hotplug init
code.

> I am going to end up with yet another patch set to clean up all the
> spots that are using PageReserved that shouldn't be before I can get
> to the point of not setting that bit.

This would be highly appreciated. There are not that many PageReserved
checks.

> > > > > > > > Regarding the post initialization required by 
> > > > > > > > devm_memremap_pages and
> > > > > > > > potentially others. Can we update the altmap which is already a 
> > > > > > > > way how
> > > > > > > > to get alternative struct pages by a constructor which we could 
> > > > > > > > call
> > > > > > > > from memmap_init_zone and do the post initialization? This 
> > > > > > > > would reduce
> > > > > > > > the additional loop in the caller while it would still fit the 
> > > > > > > > overall
> > > > > > > > design of the altmap and the core hotplug doesn't have to know 
> > > > > > > > anything
> > > > > > > > about DAX or whatever needs a special treatment.
> > > > > > > > 
> > > > > > > > Does that make any sense?
> > > > > > > 
> > > > > > > I think the only thing that is currently using the altmap is the 
> > > > > > > ZONE_DEVICE
> > > > > > > memory init. Specifically I think it is only really used by the
> > > > > > > devm_memremap_pages version of things, and then only under certain
> > > > > > > circumstances. Also the HMM driver doesn't pass an altmap. What 
> > > > > > > we would
> > > > > > > really need is a non-ZONE_DEVICE users of the altmap to really 
> > > > > > > justify
> > > > > > > sticking with that as the preferred argument to pass.
> > > > > > 
> > > > > > I am not aware of any upstream HMM user so I am not sure what are 
> > > > > > the
> > > > > > expectations there. But I thought that ZONE_DEVICE users use 
> > > > > > altmap. If
> > > > > > that is not generally true then we certainly have to think about a
> > > > > > better interface.
> > > > > 
> > > > > I'm just basing my statement on the use of the move_pfn_range_to_zone 
> > > > > call.
> > > > > The only caller that is actually passing the altmap is 
> > > > > devm_memremap_pages
> > > > > and if I understand things correctly that is only used when we want 
> > > > > to stare
> > > > > the vmmemmap on the same memory that we just hotplugged.
> > > > 
> > > > Yes, and that is what I've called as allocator callback earlier.
> > > 
> > > I am really not a fan of the callback approach. It just means we will
> > > have to do everything multiple times in terms of initialization.
> > 
> > I do not follow. Could you elaborate?
> 
> So there end up being a few different issues with constructors. First
> in my mind is that it means we have to initialize the region of memory
> and cannot assume what the constructors are going to do for us. As a
> result we will have to initialize the LRU pointers, and then overwrite
> them with the pgmap and hmm_data.

Why we would do that? What does really prevent you from making a fully
customized constructor?

> I am generally not a fan of that as
> the next patch set I have gets rid of most of the redundancy we already
> had in the writes where we were memsetting everything to 0, then
> writing the values, and then taking care of the reserved bit and
> pgmap/hmm_data fields. Dealing with the init serially like that is just
> slow.
> 
> Another complication is retpoline making function pointers just more
> expensive in general. I know in the networking area we have been
> dealing with this for a while as even the DMA ops have been a pain
> there.

We use callbacks all over the kernel and in hot paths as well. This is
far from anything reminding a hot path AFAICT. It can be time consuming
because we have to touch each and every page but that is a fundamental
thing to do. We cannot simply batch the larg

Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-29 Thread Dan Williams
On Mon, Oct 29, 2018 at 10:24 AM Michal Hocko  wrote:
>
> On Mon 29-10-18 10:01:28, Alexander Duyck wrote:
> > On Mon, 2018-10-29 at 17:35 +0100, Michal Hocko wrote:
[..]
> > > You are already doing per-page initialization so I fail to see a larger
> > > unit to operate on.
> >
> > I have a patch that makes it so that we can work at a pageblock level
> > since all of the variables with the exception of only the LRU and page
> > address fields can be precomputed. Doing that is one of the ways I was
> > able to reduce page init to 1/3 to 1/4 of the time it was taking
> > otherwise in the case of deferred page init.
>
> You still have to call set_page_links for each page. But let's assume we
> can do initialization per larger units. Nothing really prevent to hide
> that into constructor as well.

A constructor / indirect function call makes sense when there are
multiple sub-classes of object initialization, on the table I only see
3 cases: typical hotplug, base ZONE_DEVICE, ZONE_DEVICE + HMM. I think
we can look to move the HMM special casing out of line, then we're
down to 2. Even at 3 cases we're better off open-coding than a
constructor for such a low number of sub-cases to handle. I do not
foresee more cases arriving, so I struggle to see what the constructor
buys us in terms of code readability / maintainability?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-29 Thread Alexander Duyck
On Mon, 2018-10-29 at 18:24 +0100, Michal Hocko wrote:
> On Mon 29-10-18 10:01:28, Alexander Duyck wrote:
> > On Mon, 2018-10-29 at 17:35 +0100, Michal Hocko wrote:
> > > On Mon 29-10-18 08:59:34, Alexander Duyck wrote:
> 
> [...]
> > > > So for example the call "online_pages_range" doesn't invoke the
> > > > online_page_callback unless the first pfn in the range is marked as
> > > > reserved.
> > > 
> > > Yes and there is no fundamental reason to do that. We can easily check
> > > the online status without that.
> > > 
> > > > Another example Dan had pointed out was the saveable_page function in
> > > > kernel/power/snapshot.c.
> > > 
> > > Use pfn_to_online_page there.
> > 
> > Right. Which getting back to my original point, there is a bunch of
> > other changes that need to happen in order for us to make this work.
> 
> Which is a standard process of upstreaming stuff. My main point was that
> the reason why I've added SetPageReserved was a safety net because I
> knew that different code paths would back of on PageReserved while they
> wouldn't on a partially initialized structure. Now that you really want
> to prevent setting this bit for performance reasons then it makes sense
> to revisit that earlier decision and get rid of it rather than build on
> top of it and duplicate and special case the low level hotplug init
> code.

Just to clarify not setting the reserved bit isn't so much a
performance optimization as it is a functional issue. My understanding
is that they cannot get the DAX pages through KVM currently as it
doesn't recognize them as being "online". So in going through and
cleaning up the checks for the reserved bit this problem will likely
solve itself.

> > I am going to end up with yet another patch set to clean up all the
> > spots that are using PageReserved that shouldn't be before I can get
> > to the point of not setting that bit.
> 
> This would be highly appreciated. There are not that many PageReserved
> checks.

Yeah, not many, only about 3 dozen. It should only take me a week or
two to get them all sorted.

> > > > > > > > > Regarding the post initialization required by 
> > > > > > > > > devm_memremap_pages and
> > > > > > > > > potentially others. Can we update the altmap which is already 
> > > > > > > > > a way how
> > > > > > > > > to get alternative struct pages by a constructor which we 
> > > > > > > > > could call
> > > > > > > > > from memmap_init_zone and do the post initialization? This 
> > > > > > > > > would reduce
> > > > > > > > > the additional loop in the caller while it would still fit 
> > > > > > > > > the overall
> > > > > > > > > design of the altmap and the core hotplug doesn't have to 
> > > > > > > > > know anything
> > > > > > > > > about DAX or whatever needs a special treatment.
> > > > > > > > > 
> > > > > > > > > Does that make any sense?
> > > > > > > > 
> > > > > > > > I think the only thing that is currently using the altmap is 
> > > > > > > > the ZONE_DEVICE
> > > > > > > > memory init. Specifically I think it is only really used by the
> > > > > > > > devm_memremap_pages version of things, and then only under 
> > > > > > > > certain
> > > > > > > > circumstances. Also the HMM driver doesn't pass an altmap. What 
> > > > > > > > we would
> > > > > > > > really need is a non-ZONE_DEVICE users of the altmap to really 
> > > > > > > > justify
> > > > > > > > sticking with that as the preferred argument to pass.
> > > > > > > 
> > > > > > > I am not aware of any upstream HMM user so I am not sure what are 
> > > > > > > the
> > > > > > > expectations there. But I thought that ZONE_DEVICE users use 
> > > > > > > altmap. If
> > > > > > > that is not generally true then we certainly have to think about a
> > > > > > > better interface.
> > > > > > 
> > > > > > I'm just basing my statement on the use of the 
> > > > > > move_pfn_range_to_zone call.
> > > > > > The only caller that is actually passing the altmap is 
> > > > > > devm_memremap_pages
> > > > > > and if I understand things correctly that is only used when we want 
> > > > > > to stare
> > > > > > the vmmemmap on the same memory that we just hotplugged.
> > > > > 
> > > > > Yes, and that is what I've called as allocator callback earlier.
> > > > 
> > > > I am really not a fan of the callback approach. It just means we will
> > > > have to do everything multiple times in terms of initialization.
> > > 
> > > I do not follow. Could you elaborate?
> > 
> > So there end up being a few different issues with constructors. First
> > in my mind is that it means we have to initialize the region of memory
> > and cannot assume what the constructors are going to do for us. As a
> > result we will have to initialize the LRU pointers, and then overwrite
> > them with the pgmap and hmm_data.
> 
> Why we would do that? What does really prevent you from making a fully
> customized constructor?

It is more an argument of complexity. Do I just pass a single pointer
and write that value, or the LRU valu

Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-29 Thread Michal Hocko
On Mon 29-10-18 10:34:22, Dan Williams wrote:
> On Mon, Oct 29, 2018 at 10:24 AM Michal Hocko  wrote:
> >
> > On Mon 29-10-18 10:01:28, Alexander Duyck wrote:
> > > On Mon, 2018-10-29 at 17:35 +0100, Michal Hocko wrote:
> [..]
> > > > You are already doing per-page initialization so I fail to see a larger
> > > > unit to operate on.
> > >
> > > I have a patch that makes it so that we can work at a pageblock level
> > > since all of the variables with the exception of only the LRU and page
> > > address fields can be precomputed. Doing that is one of the ways I was
> > > able to reduce page init to 1/3 to 1/4 of the time it was taking
> > > otherwise in the case of deferred page init.
> >
> > You still have to call set_page_links for each page. But let's assume we
> > can do initialization per larger units. Nothing really prevent to hide
> > that into constructor as well.
> 
> A constructor / indirect function call makes sense when there are
> multiple sub-classes of object initialization, on the table I only see
> 3 cases: typical hotplug, base ZONE_DEVICE, ZONE_DEVICE + HMM. I think
> we can look to move the HMM special casing out of line, then we're
> down to 2. Even at 3 cases we're better off open-coding than a
> constructor for such a low number of sub-cases to handle. I do not
> foresee more cases arriving, so I struggle to see what the constructor
> buys us in terms of code readability / maintainability?

I haven't dreamed of ZONE_DEVICE and HMM few years back. But anyway,
let me note that I am not in love with callbacks. I find them to be a
useful abstraction. I can be convinced (by numbers) that special casing
inside the core hotplug code is really beneficial. But let's do that at
a single place.

All I am arguing against throughout this thread is the
memmap_init_zone_device and the whole code duplication just because zone
device need something special.

-- 
Michal Hocko
SUSE Labs
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-29 Thread Michal Hocko
On Mon 29-10-18 10:42:33, Alexander Duyck wrote:
> On Mon, 2018-10-29 at 18:24 +0100, Michal Hocko wrote:
> > On Mon 29-10-18 10:01:28, Alexander Duyck wrote:
[...]
> > > So there end up being a few different issues with constructors. First
> > > in my mind is that it means we have to initialize the region of memory
> > > and cannot assume what the constructors are going to do for us. As a
> > > result we will have to initialize the LRU pointers, and then overwrite
> > > them with the pgmap and hmm_data.
> > 
> > Why we would do that? What does really prevent you from making a fully
> > customized constructor?
> 
> It is more an argument of complexity. Do I just pass a single pointer
> and write that value, or the LRU values in init, or do I have to pass a
> function pointer, some abstracted data, and then call said function
> pointer while passing the page and the abstracted data?

I though you have said that pgmap is the current common denominator for
zone device users. I really do not see what is the problem to do
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 89d2a2ab3fe6..9105a4ed2c96 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5516,7 +5516,10 @@ void __meminit memmap_init_zone(unsigned long size, int 
nid, unsigned long zone,
 
 not_early:
page = pfn_to_page(pfn);
-   __init_single_page(page, pfn, zone, nid);
+   if (pgmap && pgmap->init_page)
+   pgmap->init_page(page, pfn, zone, nid, pgmap);
+   else
+   __init_single_page(page, pfn, zone, nid);
if (context == MEMMAP_HOTPLUG)
SetPageReserved(page);
 
that would require to replace altmap throughout the call chain and
replace it by pgmap. Altmap could be then renamed to something more
clear
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 89d2a2ab3fe6..048e4cc72fdf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5474,8 +5474,8 @@ void __meminit memmap_init_zone(unsigned long size, int 
nid, unsigned long zone,
 * Honor reservation requested by the driver for this ZONE_DEVICE
 * memory
 */
-   if (altmap && start_pfn == altmap->base_pfn)
-   start_pfn += altmap->reserve;
+   if (pgmap && pgmap->get_memmap)
+   start_pfn = pgmap->get_memmap(pgmap, start_pfn);
 
for (pfn = start_pfn; pfn < end_pfn; pfn++) {
/*

[...]

> If I have to implement the code to verify the slowdown I will, but I
> really feel like it is just going to be time wasted since we have seen
> this in other spots within the kernel.

Please try to understand that I am not trying to force you write some
artificial benchmarks. All I really do care about is that we have sane
interfaces with reasonable performance. Especially for one-off things
in relattively slow paths. I fully recognize that ZONE_DEVICE begs for a
better integration but really, try to go incremental and try to unify
the code first and microptimize on top. Is that way too much to ask for?

Anyway we have gone into details while the primary problem here was that
the hotplug lock doesn't scale AFAIR. And my question was why cannot we
pull move_pfn_range_to_zone and what has to be done to achieve that.
That is a fundamental thing to address first. Then you can microptimize
on top.
-- 
Michal Hocko
SUSE Labs
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH] tools/testing/nvdimm: Fix the index for dimm devices.

2018-10-29 Thread Masayoshi Mizuma
From: Masayoshi Mizuma 

KASAN reports following global out of bounds access while
nfit_test is being loaded. The out of bound access happens
the following reference to dimm_fail_cmd_flags[dimm]. 'dimm' is
over than the index value, NUM_DCR (==5).

---
  static int override_return_code(int dimm, unsigned int func, int rc)
  {
  if ((1 << func) & dimm_fail_cmd_flags[dimm]) {

dimm_fail_cmd_flags[] definition:
  static unsigned long dimm_fail_cmd_flags[NUM_DCR];
---

'dimm' is the return value of get_dimm(), and get_dimm() returns 
the index of handle[] array. The handle[] has 7 index, and the
index #0 to #4 is for nfit_test.0 and #5, #6 is for nfit_test.1.
NUM_DCR is only for nfit_test.0. Let's add for nfit_test.1.

KASAN report:
==
BUG: KASAN: global-out-of-bounds in nfit_test_ctl+0x47bb/0x55b0 [nfit_test]
Read of size 8 at addr c10cbbe8 by task kworker/u41:0/8
...
Call Trace:
 dump_stack+0xea/0x1b0
 ? dump_stack_print_info.cold.0+0x1b/0x1b
 ? kmsg_dump_rewind_nolock+0xd9/0xd9
 print_address_description+0x65/0x22e
 ? nfit_test_ctl+0x47bb/0x55b0 [nfit_test]
 kasan_report.cold.6+0x92/0x1a6
 nfit_test_ctl+0x47bb/0x55b0 [nfit_test]
...
The buggy address belongs to the variable:
 dimm_fail_cmd_flags+0x28/0xa440 [nfit_test]
==

Signed-off-by: Masayoshi Mizuma 
---
 tools/testing/nvdimm/test/nfit.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c
index 9527d47a1070..38f066ce2f47 100644
--- a/tools/testing/nvdimm/test/nfit.c
+++ b/tools/testing/nvdimm/test/nfit.c
@@ -104,6 +104,8 @@
 enum {
NUM_PM  = 3,
NUM_DCR = 5,
+   NUM_DCR_ALL = NUM_DCR + 2, /* 5 DCRs for test.0, 2 DCRs for test.1 */
+
NUM_HINTS = 8,
NUM_BDW = NUM_DCR,
NUM_SPA = NUM_PM + NUM_DCR + NUM_BDW,
@@ -140,8 +142,8 @@ static u32 handle[] = {
[6] = NFIT_DIMM_HANDLE(1, 0, 0, 0, 1),
 };
 
-static unsigned long dimm_fail_cmd_flags[NUM_DCR];
-static int dimm_fail_cmd_code[NUM_DCR];
+static unsigned long dimm_fail_cmd_flags[NUM_DCR_ALL];
+static int dimm_fail_cmd_code[NUM_DCR_ALL];
 
 static const struct nd_intel_smart smart_def = {
.flags = ND_INTEL_SMART_HEALTH_VALID
@@ -205,7 +207,7 @@ struct nfit_test {
unsigned long deadline;
spinlock_t lock;
} ars_state;
-   struct device *dimm_dev[NUM_DCR];
+   struct device *dimm_dev[NUM_DCR_ALL];
struct nd_intel_smart *smart;
struct nd_intel_smart_threshold *smart_threshold;
struct badrange badrange;
@@ -2680,7 +2682,7 @@ static int nfit_test_probe(struct platform_device *pdev)
u32 nfit_handle = __to_nfit_memdev(nfit_mem)->device_handle;
int i;
 
-   for (i = 0; i < NUM_DCR; i++)
+   for (i = 0; i < NUM_DCR_ALL; i++)
if (nfit_handle == handle[i])
dev_set_drvdata(nfit_test->dimm_dev[i],
nfit_mem);
-- 
2.18.0

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-29 Thread Alexander Duyck
On Mon, 2018-10-29 at 19:18 +0100, Michal Hocko wrote:
> On Mon 29-10-18 10:42:33, Alexander Duyck wrote:
> > On Mon, 2018-10-29 at 18:24 +0100, Michal Hocko wrote:
> > > On Mon 29-10-18 10:01:28, Alexander Duyck wrote:
> 
> [...]
> > > > So there end up being a few different issues with constructors. First
> > > > in my mind is that it means we have to initialize the region of memory
> > > > and cannot assume what the constructors are going to do for us. As a
> > > > result we will have to initialize the LRU pointers, and then overwrite
> > > > them with the pgmap and hmm_data.
> > > 
> > > Why we would do that? What does really prevent you from making a fully
> > > customized constructor?
> > 
> > It is more an argument of complexity. Do I just pass a single pointer
> > and write that value, or the LRU values in init, or do I have to pass a
> > function pointer, some abstracted data, and then call said function
> > pointer while passing the page and the abstracted data?
> 
> I though you have said that pgmap is the current common denominator for
> zone device users. I really do not see what is the problem to do
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 89d2a2ab3fe6..9105a4ed2c96 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5516,7 +5516,10 @@ void __meminit memmap_init_zone(unsigned long size, 
> int nid, unsigned long zone,
>  
>  not_early:
>   page = pfn_to_page(pfn);
> - __init_single_page(page, pfn, zone, nid);
> + if (pgmap && pgmap->init_page)
> + pgmap->init_page(page, pfn, zone, nid, pgmap);
> + else
> + __init_single_page(page, pfn, zone, nid);
>   if (context == MEMMAP_HOTPLUG)
>   SetPageReserved(page);
>  
> that would require to replace altmap throughout the call chain and
> replace it by pgmap. Altmap could be then renamed to something more
> clear

So as I had pointed out earlier doing per-page init is much slower than
initializing pages in bulk. Ideally I would like to see us seperate the
memmap_init_zone function into two pieces, one section for handling
hotplug and another for the everything else case. As is the fact that
you have to jump over a bunch of tests for the "not_early" case is
quite ugly in my opinion.

I could probably take your patch and test it. I'm suspecting this is
going to be a signficant slow-down in general as the indirect function
pointer stuff is probably going to come into play.

The "init_page" function in this case is going to end up being much
more complex then it really needs to be in this design as well since I
have to get the altmap and figure out if the page was used for vmmemmap
storage or is an actual DAX page. I might just see if I could add an
additional test for the pfn being before the end of the vmmemmap in the
case of pgmap being present.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 89d2a2ab3fe6..048e4cc72fdf 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5474,8 +5474,8 @@ void __meminit memmap_init_zone(unsigned long size, int 
> nid, unsigned long zone,
>* Honor reservation requested by the driver for this ZONE_DEVICE
>* memory
>*/
> - if (altmap && start_pfn == altmap->base_pfn)
> - start_pfn += altmap->reserve;
> + if (pgmap && pgmap->get_memmap)
> + start_pfn = pgmap->get_memmap(pgmap, start_pfn);
>  
>   for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>   /*
> 
> [...]

The only reason why I hadn't bothered with these bits is that I was
actually trying to leave this generic since I thought I had seen other
discussions about hotplug scenerios where memory may want to change
where the vmmemmap is initialized other than just the case of
ZONE_DEVICE pages. So I was thinking at some point we may see altmap
without the pgmap.

> > If I have to implement the code to verify the slowdown I will, but I
> > really feel like it is just going to be time wasted since we have seen
> > this in other spots within the kernel.
> 
> Please try to understand that I am not trying to force you write some
> artificial benchmarks. All I really do care about is that we have sane
> interfaces with reasonable performance. Especially for one-off things
> in relattively slow paths. I fully recognize that ZONE_DEVICE begs for a
> better integration but really, try to go incremental and try to unify
> the code first and microptimize on top. Is that way too much to ask for?

No, but the patches I had already proposed I thought were heading in
that direction. I had unified memmap_init_zone,
memmap_init_zone_device, and the deferred page initialization onto a
small set of functions and all had improved performance as a result.

> Anyway we have gone into details while the primary problem here was that
> the hotplug lock doesn't scale AFAIR. And my question was why cannot we
> pull move_pfn_range_to_zone and what has to be done to achi

[RFC PATCH] kvm: Use huge pages for DAX-backed files

2018-10-29 Thread Barret Rhoden
This change allows KVM to map DAX-backed files made of huge pages with
huge mappings in the EPT/TDP.

DAX pages are not PageTransCompound.  The existing check is trying to
determine if the mapping for the pfn is a huge mapping or not.  For
non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.

For DAX, we can check the page table itself.  Actually, we might always
be able to walk the page table, even for PageTransCompound pages, but
it's probably a little slower.

Note that KVM already faulted in the page (or huge page) in the host's
page table, and we hold the KVM mmu spinlock (grabbed before checking
the mmu seq).  Based on the other comments about not worrying about a
pmd split, we might be able to safely walk the page table without
holding the mm sem.

This patch relies on kvm_is_reserved_pfn() being false for DAX pages,
which I've hacked up for testing this code.  That change should
eventually happen:

https://lore.kernel.org/lkml/20181022084659.GA84523@tiger-server/

Another issue is that kvm_mmu_zap_collapsible_spte() also uses
PageTransCompoundMap() to detect huge pages, but we don't have a way to
get the HVA easily.  Can we just aggressively zap DAX pages there?

Alternatively, is there a better way to track at the struct page level
whether or not a page is huge-mapped?  Maybe the DAX huge pages mark
themselves as TransCompound or something similar, and we don't need to
special case DAX/ZONE_DEVICE pages.

Signed-off-by: Barret Rhoden 
---
 arch/x86/kvm/mmu.c | 71 +-
 1 file changed, 70 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index cf5f572f2305..9f3e0f83a2dd 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3152,6 +3152,75 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, 
gfn_t gfn, kvm_pfn_t pfn)
return -EFAULT;
 }
 
+static unsigned long pgd_mapping_size(struct mm_struct *mm, unsigned long addr)
+{
+   pgd_t *pgd;
+   p4d_t *p4d;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *pte;
+
+   pgd = pgd_offset(mm, addr);
+   if (!pgd_present(*pgd))
+   return 0;
+
+   p4d = p4d_offset(pgd, addr);
+   if (!p4d_present(*p4d))
+   return 0;
+   if (p4d_huge(*p4d))
+   return P4D_SIZE;
+
+   pud = pud_offset(p4d, addr);
+   if (!pud_present(*pud))
+   return 0;
+   if (pud_huge(*pud))
+   return PUD_SIZE;
+
+   pmd = pmd_offset(pud, addr);
+   if (!pmd_present(*pmd))
+   return 0;
+   if (pmd_huge(*pmd))
+   return PMD_SIZE;
+
+   pte = pte_offset_map(pmd, addr);
+   if (!pte_present(*pte))
+   return 0;
+   return PAGE_SIZE;
+}
+
+static bool pfn_is_pmd_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
+{
+   struct page *page = pfn_to_page(pfn);
+   unsigned long hva, map_sz;
+
+   if (!is_zone_device_page(page))
+   return PageTransCompoundMap(page);
+
+   /*
+* DAX pages do not use compound pages.  The page should have already
+* been mapped into the host-side page table during try_async_pf(), so
+* we can check the page tables directly.
+*/
+   hva = gfn_to_hva(kvm, gfn);
+   if (kvm_is_error_hva(hva))
+   return false;
+
+   /*
+* Our caller grabbed the KVM mmu_lock with a successful
+* mmu_notifier_retry, so we're safe to walk the page table.
+*/
+   map_sz = pgd_mapping_size(current->mm, hva);
+   switch (map_sz) {
+   case PMD_SIZE:
+   return true;
+   case P4D_SIZE:
+   case PUD_SIZE:
+   printk_once(KERN_INFO "KVM THP promo found a very large page");
+   return false;
+   }
+   return false;
+}
+
 static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
gfn_t *gfnp, kvm_pfn_t *pfnp,
int *levelp)
@@ -3168,7 +3237,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu 
*vcpu,
 */
if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
level == PT_PAGE_TABLE_LEVEL &&
-   PageTransCompoundMap(pfn_to_page(pfn)) &&
+   pfn_is_pmd_mapped(vcpu->kvm, gfn, pfn) &&
!mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
unsigned long mask;
/*
-- 
2.19.1.568.g152ad8e336-goog

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC PATCH] kvm: Use huge pages for DAX-backed files

2018-10-29 Thread Dan Williams
On Mon, Oct 29, 2018 at 2:07 PM Barret Rhoden  wrote:
>
> This change allows KVM to map DAX-backed files made of huge pages with
> huge mappings in the EPT/TDP.
>
> DAX pages are not PageTransCompound.  The existing check is trying to
> determine if the mapping for the pfn is a huge mapping or not.  For
> non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
>
> For DAX, we can check the page table itself.  Actually, we might always
> be able to walk the page table, even for PageTransCompound pages, but
> it's probably a little slower.
>
> Note that KVM already faulted in the page (or huge page) in the host's
> page table, and we hold the KVM mmu spinlock (grabbed before checking
> the mmu seq).  Based on the other comments about not worrying about a
> pmd split, we might be able to safely walk the page table without
> holding the mm sem.
>
> This patch relies on kvm_is_reserved_pfn() being false for DAX pages,
> which I've hacked up for testing this code.  That change should
> eventually happen:
>
> https://lore.kernel.org/lkml/20181022084659.GA84523@tiger-server/
>
> Another issue is that kvm_mmu_zap_collapsible_spte() also uses
> PageTransCompoundMap() to detect huge pages, but we don't have a way to
> get the HVA easily.  Can we just aggressively zap DAX pages there?
>
> Alternatively, is there a better way to track at the struct page level
> whether or not a page is huge-mapped?  Maybe the DAX huge pages mark
> themselves as TransCompound or something similar, and we don't need to
> special case DAX/ZONE_DEVICE pages.
>
> Signed-off-by: Barret Rhoden 
> ---
>  arch/x86/kvm/mmu.c | 71 +-
>  1 file changed, 70 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index cf5f572f2305..9f3e0f83a2dd 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3152,6 +3152,75 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, 
> gfn_t gfn, kvm_pfn_t pfn)
> return -EFAULT;
>  }
>
> +static unsigned long pgd_mapping_size(struct mm_struct *mm, unsigned long 
> addr)
> +{
> +   pgd_t *pgd;
> +   p4d_t *p4d;
> +   pud_t *pud;
> +   pmd_t *pmd;
> +   pte_t *pte;
> +
> +   pgd = pgd_offset(mm, addr);
> +   if (!pgd_present(*pgd))
> +   return 0;
> +
> +   p4d = p4d_offset(pgd, addr);
> +   if (!p4d_present(*p4d))
> +   return 0;
> +   if (p4d_huge(*p4d))
> +   return P4D_SIZE;
> +
> +   pud = pud_offset(p4d, addr);
> +   if (!pud_present(*pud))
> +   return 0;
> +   if (pud_huge(*pud))
> +   return PUD_SIZE;
> +
> +   pmd = pmd_offset(pud, addr);
> +   if (!pmd_present(*pmd))
> +   return 0;
> +   if (pmd_huge(*pmd))
> +   return PMD_SIZE;
> +
> +   pte = pte_offset_map(pmd, addr);
> +   if (!pte_present(*pte))
> +   return 0;
> +   return PAGE_SIZE;
> +}
> +
> +static bool pfn_is_pmd_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
> +{
> +   struct page *page = pfn_to_page(pfn);
> +   unsigned long hva, map_sz;
> +
> +   if (!is_zone_device_page(page))
> +   return PageTransCompoundMap(page);
> +
> +   /*
> +* DAX pages do not use compound pages.  The page should have already
> +* been mapped into the host-side page table during try_async_pf(), so
> +* we can check the page tables directly.
> +*/
> +   hva = gfn_to_hva(kvm, gfn);
> +   if (kvm_is_error_hva(hva))
> +   return false;
> +
> +   /*
> +* Our caller grabbed the KVM mmu_lock with a successful
> +* mmu_notifier_retry, so we're safe to walk the page table.
> +*/
> +   map_sz = pgd_mapping_size(current->mm, hva);
> +   switch (map_sz) {
> +   case PMD_SIZE:
> +   return true;
> +   case P4D_SIZE:
> +   case PUD_SIZE:
> +   printk_once(KERN_INFO "KVM THP promo found a very large 
> page");

Why not allow PUD_SIZE? The device-dax interface supports PUD mappings.

> +   return false;
> +   }
> +   return false;
> +}

The above 2 functions are  similar to what we need to do for
determining the blast radius of a memory error, see
dev_pagemap_mapping_shift() and its usage in add_to_kill().

> +
>  static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> gfn_t *gfnp, kvm_pfn_t *pfnp,
> int *levelp)
> @@ -3168,7 +3237,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu 
> *vcpu,
>  */
> if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
> level == PT_PAGE_TABLE_LEVEL &&
> -   PageTransCompoundMap(pfn_to_page(pfn)) &&
> +   pfn_is_pmd_mapped(vcpu->kvm, gfn, pfn) &&

I'm wondering if we're adding an explicit is_zone_device_page() check
in this path to determine the pa

Re: [RFC PATCH] kvm: Use huge pages for DAX-backed files

2018-10-29 Thread Barret Rhoden
On 2018-10-29 at 15:25 Dan Williams  wrote:
> > +   /*
> > +* Our caller grabbed the KVM mmu_lock with a successful
> > +* mmu_notifier_retry, so we're safe to walk the page table.
> > +*/
> > +   map_sz = pgd_mapping_size(current->mm, hva);
> > +   switch (map_sz) {
> > +   case PMD_SIZE:
> > +   return true;
> > +   case P4D_SIZE:
> > +   case PUD_SIZE:
> > +   printk_once(KERN_INFO "KVM THP promo found a very large 
> > page");  
> 
> Why not allow PUD_SIZE? The device-dax interface supports PUD mappings.

The place where I use that helper seemed to care about PMDs (compared
to huge pages larger than PUDs), I think due to THP.  Though it also
checks "level == PT_PAGE_TABLE_LEVEL", so it's probably a moot point.

I can change it from pfn_is_pmd_mapped -> pfn_is_huge_mapped and allow
any huge mapping that is appropriate: so PUD or PMD for DAX, PMD for
non-DAX, IIUC.

> 
> > +   return false;
> > +   }
> > +   return false;
> > +}  
> 
> The above 2 functions are  similar to what we need to do for
> determining the blast radius of a memory error, see
> dev_pagemap_mapping_shift() and its usage in add_to_kill().

Great.  I don't know if I have access in the KVM code to the VMA to use
those functions directly, but I can extract the guts of
dev_pagemap_mapping_shift() or something and put it in mm/util.c.

> >  static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> > gfn_t *gfnp, kvm_pfn_t *pfnp,
> > int *levelp)
> > @@ -3168,7 +3237,7 @@ static void transparent_hugepage_adjust(struct 
> > kvm_vcpu *vcpu,
> >  */
> > if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
> > level == PT_PAGE_TABLE_LEVEL &&
> > -   PageTransCompoundMap(pfn_to_page(pfn)) &&
> > +   pfn_is_pmd_mapped(vcpu->kvm, gfn, pfn) &&  
> 
> I'm wondering if we're adding an explicit is_zone_device_page() check
> in this path to determine the page mapping size if that can be a
> replacement for the kvm_is_reserved_pfn() check. In other words, the
> goal of fixing up PageReserved() was to preclude the need for DAX-page
> special casing in KVM, but if we already need add some special casing
> for page size determination, might as well bypass the
> kvm_is_reserved_pfn() dependency as well.

kvm_is_reserved_pfn() is used in some other places, like
kvm_set_pfn_dirty()and kvm_set_pfn_accessed().  Maybe the way those
treat DAX pages matters on a case-by-case basis?  

There are other callers of kvm_is_reserved_pfn() such as
kvm_pfn_to_page() and gfn_to_page().  I'm not familiar (yet) with how
struct pages and DAX work together, and whether or not the callers of
those pfn_to_page() functions have expectations about the 'type' of
struct page they get back.

It looks like another time that this popped up was kvm_is_mmio_pfn(),
though that wasn't exactly checking kvm_is_reserved_pfn(), and it
special cased based on the memory type / PAT business.

Thanks,

Barret

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC PATCH] kvm: Use huge pages for DAX-backed files

2018-10-29 Thread Dan Williams
On Mon, Oct 29, 2018 at 5:29 PM Barret Rhoden  wrote:
>
> On 2018-10-29 at 15:25 Dan Williams  wrote:
> > > +   /*
> > > +* Our caller grabbed the KVM mmu_lock with a successful
> > > +* mmu_notifier_retry, so we're safe to walk the page table.
> > > +*/
> > > +   map_sz = pgd_mapping_size(current->mm, hva);
> > > +   switch (map_sz) {
> > > +   case PMD_SIZE:
> > > +   return true;
> > > +   case P4D_SIZE:
> > > +   case PUD_SIZE:
> > > +   printk_once(KERN_INFO "KVM THP promo found a very large 
> > > page");
> >
> > Why not allow PUD_SIZE? The device-dax interface supports PUD mappings.
>
> The place where I use that helper seemed to care about PMDs (compared
> to huge pages larger than PUDs), I think due to THP.  Though it also
> checks "level == PT_PAGE_TABLE_LEVEL", so it's probably a moot point.
>
> I can change it from pfn_is_pmd_mapped -> pfn_is_huge_mapped and allow
> any huge mapping that is appropriate: so PUD or PMD for DAX, PMD for
> non-DAX, IIUC.

Yes, THP stops at PMDs, but DAX and hugetlbfs support PUD level mappings.

> > > +   return false;
> > > +   }
> > > +   return false;
> > > +}
> >
> > The above 2 functions are  similar to what we need to do for
> > determining the blast radius of a memory error, see
> > dev_pagemap_mapping_shift() and its usage in add_to_kill().
>
> Great.  I don't know if I have access in the KVM code to the VMA to use
> those functions directly, but I can extract the guts of
> dev_pagemap_mapping_shift() or something and put it in mm/util.c.

Sounds good.

> > >  static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> > > gfn_t *gfnp, kvm_pfn_t *pfnp,
> > > int *levelp)
> > > @@ -3168,7 +3237,7 @@ static void transparent_hugepage_adjust(struct 
> > > kvm_vcpu *vcpu,
> > >  */
> > > if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
> > > level == PT_PAGE_TABLE_LEVEL &&
> > > -   PageTransCompoundMap(pfn_to_page(pfn)) &&
> > > +   pfn_is_pmd_mapped(vcpu->kvm, gfn, pfn) &&
> >
> > I'm wondering if we're adding an explicit is_zone_device_page() check
> > in this path to determine the page mapping size if that can be a
> > replacement for the kvm_is_reserved_pfn() check. In other words, the
> > goal of fixing up PageReserved() was to preclude the need for DAX-page
> > special casing in KVM, but if we already need add some special casing
> > for page size determination, might as well bypass the
> > kvm_is_reserved_pfn() dependency as well.
>
> kvm_is_reserved_pfn() is used in some other places, like
> kvm_set_pfn_dirty()and kvm_set_pfn_accessed().  Maybe the way those
> treat DAX pages matters on a case-by-case basis?
>
> There are other callers of kvm_is_reserved_pfn() such as
> kvm_pfn_to_page() and gfn_to_page().  I'm not familiar (yet) with how
> struct pages and DAX work together, and whether or not the callers of
> those pfn_to_page() functions have expectations about the 'type' of
> struct page they get back.
>

The property of DAX pages that requires special coordination is the
fact that the device hosting the pages can be disabled at will. The
get_dev_pagemap() api is the interface to pin a device-pfn so that you
can safely perform a pfn_to_page() operation.

Have the pages that kvm uses in this path already been pinned by vfio?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-29 Thread Michal Hocko
On Mon 29-10-18 12:59:11, Alexander Duyck wrote:
> On Mon, 2018-10-29 at 19:18 +0100, Michal Hocko wrote:
[...]

I will try to get to your other points later.

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 89d2a2ab3fe6..048e4cc72fdf 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -5474,8 +5474,8 @@ void __meminit memmap_init_zone(unsigned long size, 
> > int nid, unsigned long zone,
> >  * Honor reservation requested by the driver for this ZONE_DEVICE
> >  * memory
> >  */
> > -   if (altmap && start_pfn == altmap->base_pfn)
> > -   start_pfn += altmap->reserve;
> > +   if (pgmap && pgmap->get_memmap)
> > +   start_pfn = pgmap->get_memmap(pgmap, start_pfn);
> >  
> > for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> > /*
> > 
> > [...]
> 
> The only reason why I hadn't bothered with these bits is that I was
> actually trying to leave this generic since I thought I had seen other
> discussions about hotplug scenerios where memory may want to change
> where the vmmemmap is initialized other than just the case of
> ZONE_DEVICE pages. So I was thinking at some point we may see altmap
> without the pgmap.

I wanted to abuse altmap to allocate struct pages from the physical
range to be added. In that case I would abstract the
allocation/initialization part of pgmap into a more abstract type.
Something trivially to be done without affecting end users of the
hotplug API.

[...]
> > Anyway we have gone into details while the primary problem here was that
> > the hotplug lock doesn't scale AFAIR. And my question was why cannot we
> > pull move_pfn_range_to_zone and what has to be done to achieve that.
> > That is a fundamental thing to address first. Then you can microptimize
> > on top.
> 
> Yes, the hotplug lock was part of the original issue. However that
> starts to drift into the area I believe Oscar was working on as a part
> of his patch set in encapsulating the move_pfn_range_to_zone and other
> calls that were contained in the hotplug lock into their own functions.

Well, I would really love to keep the external API as simple as
possible. That means that we need arch_add_memory/add_pages and 
move_pfn_range_to_zone to associate pages with a zone. The hotplug lock
should be preferably hidden from callers of those two and ideally it
shouldn't be a global lock. We should be good with a range lock.
 
> The patches Andrew pushed addressed the immediate issue so that now
> systems with nvdimm/DAX memory can at least initialize quick enough
> that systemd doesn't refuse to mount the root file system due to a
> timeout.

This is about the first time you actually mention that. I have re-read
the cover letter and all changelogs of patches in this serious. Unless I
have missed something there is nothing about real users hitting issues
out there. nvdimm is still considered a toy because there is no real HW
users can play with.

And hence my complains about half baked solutions rushed in just to fix
a performance regression. I can certainly understand that a pressing
problem might justify to rush things a bit but this should be always
carefuly justified.

> The next patch set I have refactors things to reduce code and
> allow us to reuse some of the hotplug code for the deferred page init, 
> https://lore.kernel.org/lkml/20181017235043.17213.92459.stgit@localhost.localdomain/
> . After that I was planning to work on dealing with the PageReserved
> flag and trying to get that sorted out.
> 
> I was hoping to wait until after Dan's HMM patches and Oscar's changes
> had been sorted before I get into any further refactor of this specific
> code.

Yes there is quite a lot going on here. I would really appreciate if we
all sit and actually try to come up with something robust rather than
hack here and there. I haven't yet seen your follow up series completely
so maybe you are indeed heading the correct direction.

-- 
Michal Hocko
SUSE Labs
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Problems with VM_MIXEDMAP removal from /proc//smaps

2018-10-29 Thread Dan Williams
On Thu, Oct 18, 2018 at 5:58 PM Dave Chinner  wrote:
>
> On Thu, Oct 18, 2018 at 04:55:55PM +0200, Jan Kara wrote:
> > On Thu 18-10-18 11:25:10, Dave Chinner wrote:
> > > On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote:
> > > > MAP_SYNC
> > > > - file system guarantees that metadata required to reach faulted-in file
> > > >   data is consistent on media before a write fault is completed.  A
> > > >   side-effect is that the page cache will not be used for
> > > >   writably-mapped pages.
> > >
> > > I think you are conflating current implementation with API
> > > requirements - MAP_SYNC doesn't guarantee anything about page cache
> > > use. The man page definition simply says "supported only for files
> > > supporting DAX" and that it provides certain data integrity
> > > guarantees. It does not define the implementation.
> > >
> > > We've /implemented MAP_SYNC/ as O_DSYNC page fault behaviour,
> > > because that's the only way we can currently provide the required
> > > behaviour to userspace. However, if a filesystem can use the page
> > > cache to provide the required functionality, then it's free to do
> > > so.
> > >
> > > i.e. if someone implements a pmem-based page cache, MAP_SYNC data
> > > integrity could be provided /without DAX/ by any filesystem using
> > > that persistent page cache. i.e. MAP_SYNC really only requires
> > > mmap() of CPU addressable persistent memory - it does not require
> > > DAX. Right now, however, the only way to get this functionality is
> > > through a DAX capable filesystem on dax capable storage.
> > >
> > > And, FWIW, this is pretty much how NOVA maintains DAX w/ COW - it
> > > COWs new pages in pmem and attaches them a special per-inode cache
> > > on clean->dirty transition. Then on data sync, background writeback
> > > or crash recovery, it migrates them from the cache into the file map
> > > proper via atomic metadata pointer swaps.
> > >
> > > IOWs, NOVA provides the correct MAP_SYNC semantics by using a
> > > separate persistent per-inode write cache to provide the correct
> > > crash recovery semantics for MAP_SYNC.
> >
> > Corect. NOVA would be able to provide MAP_SYNC semantics without DAX. But
> > effectively it will be also able to provide MAP_DIRECT semantics, right?
>
> Yes, I think so. It still needs to do COW on first write fault,
> but then the app has direct access to the data buffer until it is
> cleaned and put back in place. The "put back in place" is just an
> atomic swap of metadata pointers, so it doesn't need the page cache
> at all...
>
> > Because there won't be DRAM between app and persistent storage and I don't
> > think COW tricks or other data integrity methods are that interesting for
> > the application.
>
> Not for the application, but the filesystem still wants to support
> snapshots and other such functionality that requires COW. And NOVA
> doesn't have write-in-place functionality at all - it always COWs
> on the clean->dirty transition.
>
> > Most users of O_DIRECT are concerned about getting close
> > to media speed performance and low DRAM usage...
>
> *nod*
>
> > > > and what I think Dan had proposed:
> > > >
> > > > mmap flag, MAP_DIRECT
> > > > - file system guarantees that page cache will not be used to front 
> > > > storage.
> > > >   storage MUST be directly addressable.  This *almost* implies MAP_SYNC.
> > > >   The subtle difference is that a write fault /may/ not result in 
> > > > metadata
> > > >   being written back to media.
> > >
> > > SIimilar to O_DIRECT, these semantics do not allow userspace apps to
> > > replace msync/fsync with CPU cache flush operations. So any
> > > application that uses this mode still needs to use either MAP_SYNC
> > > or issue msync/fsync for data integrity.
> > >
> > > If the app is using MAP_DIRECT, the what do we do if the filesystem
> > > can't provide the required semantics for that specific operation? In
> > > the case of O_DIRECT, we fall back to buffered IO because it has the
> > > same data integrity semantics as O_DIRECT and will always work. It's
> > > just slower and consumes more memory, but the app continues to work
> > > just fine.
> > >
> > > Sending SIGBUS to apps when we can't perform MAP_DIRECT operations
> > > without using the pagecache seems extremely problematic to me.  e.g.
> > > an app already has an open MAP_DIRECT file, and a third party
> > > reflinks it or dedupes it and the fs has to fall back to buffered IO
> > > to do COW operations. This isn't the app's fault - the kernel should
> > > just fall back transparently to using the page cache for the
> > > MAP_DIRECT app and just keep working, just like it would if it was
> > > using O_DIRECT read/write.
> >
> > There's another option of failing reflink / dedupe with EBUSY if the file
> > is mapped with MAP_DIRECT and the filesystem cannot support relink &
> > MAP_DIRECT together. But there are downsides to that as well.
>
> Yup, not the least that setting MAP_DIRECT can race with a
> reflink
>