Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-05-02 Thread Jerome Glisse
On Tue, May 02, 2017 at 02:37:46PM +0300, Kirill A. Shutemov wrote:
> On Mon, May 01, 2017 at 09:55:48AM -0400, Jerome Glisse wrote:
> > On Mon, May 01, 2017 at 01:23:59PM +0300, Kirill A. Shutemov wrote:
> > > On Sun, Apr 30, 2017 at 07:14:24PM -0400, Jerome Glisse wrote:
> > > > On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote:
> > > > > On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
> > > > > > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
> > > > > > > Are you sure about needing to hook the 2 -> 1 transition? Could we
> > > > > > > change ZONE_DEVICE pages to not have an elevated reference count 
> > > > > > > when
> > > > > > > they are created so you can keep the HMM references out of the mm 
> > > > > > > hot
> > > > > > > path?
> > > > > > 
> > > > > > 100% sure on that :) I need to callback into driver for 2->1 
> > > > > > transition
> > > > > > no way around that. If we change ZONE_DEVICE to not have an elevated
> > > > > > reference count that you need to make a lot more change to mm so 
> > > > > > that
> > > > > > ZONE_DEVICE is never use as fallback for memory allocation. Also 
> > > > > > need
> > > > > > to make change to be sure that ZONE_DEVICE page never endup in one 
> > > > > > of
> > > > > > the path that try to put them back on lru. There is a lot of place 
> > > > > > that
> > > > > > would need to be updated and it would be highly intrusive and add a
> > > > > > lot of special cases to other hot code path.
> > > > > 
> > > > > Could you explain more on where the requirement comes from or point 
> > > > > me to
> > > > > where I can read about this.
> > > > > 
> > > > 
> > > > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back 
> > > > page)
> > > > in _any_ vma. So i need to know when a page is freed ie either as 
> > > > result of
> > > > unmap, exit or migration or anything that would free the memory. For 
> > > > zone
> > > > device a page is free once its refcount reach 1 so i need to catch 
> > > > refcount
> > > > transition from 2->1
> > > 
> > > What if we would rework zone device to have pages with refcount 0 at
> > > start?
> > 
> > That is a _lot_ of work from top of my head because it would need changes
> > to a lot of places and likely more hot code path that simply adding some-
> > thing to put_page() note that i only need something in put_page() i do not
> > need anything in the get page path. Is adding a conditional branch for
> > HMM pages in put_page() that much of a problem ?
> 
> Well, it gets inlined everywhere. Removing zone_device code from
> get_page() and put_page() saved non-trivial ~140k in vmlinux for
> allyesconfig.
> 
> Re-introducing part this bloat would be unfortunate.
> 
> > > > This is the only way i can inform the device that the page is now free. 
> > > > See
> > > > 
> > > > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21=52da8fe1a088b87b5321319add79e43b8372ed7d
> > > > 
> > > > There is _no_ way around that.
> > > 
> > > I'm still not convinced that it's impossible.
> > > 
> > > Could you describe lifecycle for pages in case of HMM?
> > 
> > Process malloc something, end it over to some function in the program
> > that use the GPU that function call GPU API (OpenCL, CUDA, ...) that
> > trigger a migration to device memory.
> > 
> > So in the kernel you get a migration like any existing migration,
> > original page is unmap, if refcount is all ok (no pin) then a device
> > page is allocated and thing are migrated to device memory.
> > 
> > What happen after is unknown. Either userspace/kernel driver decide
> > to migrate back to system memory, either there is an munmap, either
> > there is a CPU page fault, ... So from that point on the device page
> > as the exact same life as a regular page.
> > 
> > Above i describe the migrate case, but you can also have new memory
> > allocation that directly allocate device memory. For instance if the
> > GPU do a page fault on an address that isn't back by anything then
> > we can directly allocate a device page. No migration involve in that
> > case.
> > 
> > HMM pages are like any other pages in most respect. Exception are:
> >   - no GUP
> 
> Hm. How do you exclude GUP? And why is it required?

Well it is not forbiden it just can not happen simply because as device
memory is not accessible by CPU then the corresponding CPU page table
entry is a special entry and thus GUP trigger a page fault that migrate
thing back to regular memory.

The why is simply because we need to always be able to migrate back to
regular memory as device memory is not accessible by the CPU. So we can
not allow anyone to pin it.

Cheers,
Jérôme


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-05-02 Thread Jerome Glisse
On Tue, May 02, 2017 at 02:37:46PM +0300, Kirill A. Shutemov wrote:
> On Mon, May 01, 2017 at 09:55:48AM -0400, Jerome Glisse wrote:
> > On Mon, May 01, 2017 at 01:23:59PM +0300, Kirill A. Shutemov wrote:
> > > On Sun, Apr 30, 2017 at 07:14:24PM -0400, Jerome Glisse wrote:
> > > > On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote:
> > > > > On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
> > > > > > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
> > > > > > > Are you sure about needing to hook the 2 -> 1 transition? Could we
> > > > > > > change ZONE_DEVICE pages to not have an elevated reference count 
> > > > > > > when
> > > > > > > they are created so you can keep the HMM references out of the mm 
> > > > > > > hot
> > > > > > > path?
> > > > > > 
> > > > > > 100% sure on that :) I need to callback into driver for 2->1 
> > > > > > transition
> > > > > > no way around that. If we change ZONE_DEVICE to not have an elevated
> > > > > > reference count that you need to make a lot more change to mm so 
> > > > > > that
> > > > > > ZONE_DEVICE is never use as fallback for memory allocation. Also 
> > > > > > need
> > > > > > to make change to be sure that ZONE_DEVICE page never endup in one 
> > > > > > of
> > > > > > the path that try to put them back on lru. There is a lot of place 
> > > > > > that
> > > > > > would need to be updated and it would be highly intrusive and add a
> > > > > > lot of special cases to other hot code path.
> > > > > 
> > > > > Could you explain more on where the requirement comes from or point 
> > > > > me to
> > > > > where I can read about this.
> > > > > 
> > > > 
> > > > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back 
> > > > page)
> > > > in _any_ vma. So i need to know when a page is freed ie either as 
> > > > result of
> > > > unmap, exit or migration or anything that would free the memory. For 
> > > > zone
> > > > device a page is free once its refcount reach 1 so i need to catch 
> > > > refcount
> > > > transition from 2->1
> > > 
> > > What if we would rework zone device to have pages with refcount 0 at
> > > start?
> > 
> > That is a _lot_ of work from top of my head because it would need changes
> > to a lot of places and likely more hot code path that simply adding some-
> > thing to put_page() note that i only need something in put_page() i do not
> > need anything in the get page path. Is adding a conditional branch for
> > HMM pages in put_page() that much of a problem ?
> 
> Well, it gets inlined everywhere. Removing zone_device code from
> get_page() and put_page() saved non-trivial ~140k in vmlinux for
> allyesconfig.
> 
> Re-introducing part this bloat would be unfortunate.
> 
> > > > This is the only way i can inform the device that the page is now free. 
> > > > See
> > > > 
> > > > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21=52da8fe1a088b87b5321319add79e43b8372ed7d
> > > > 
> > > > There is _no_ way around that.
> > > 
> > > I'm still not convinced that it's impossible.
> > > 
> > > Could you describe lifecycle for pages in case of HMM?
> > 
> > Process malloc something, end it over to some function in the program
> > that use the GPU that function call GPU API (OpenCL, CUDA, ...) that
> > trigger a migration to device memory.
> > 
> > So in the kernel you get a migration like any existing migration,
> > original page is unmap, if refcount is all ok (no pin) then a device
> > page is allocated and thing are migrated to device memory.
> > 
> > What happen after is unknown. Either userspace/kernel driver decide
> > to migrate back to system memory, either there is an munmap, either
> > there is a CPU page fault, ... So from that point on the device page
> > as the exact same life as a regular page.
> > 
> > Above i describe the migrate case, but you can also have new memory
> > allocation that directly allocate device memory. For instance if the
> > GPU do a page fault on an address that isn't back by anything then
> > we can directly allocate a device page. No migration involve in that
> > case.
> > 
> > HMM pages are like any other pages in most respect. Exception are:
> >   - no GUP
> 
> Hm. How do you exclude GUP? And why is it required?

Well it is not forbiden it just can not happen simply because as device
memory is not accessible by CPU then the corresponding CPU page table
entry is a special entry and thus GUP trigger a page fault that migrate
thing back to regular memory.

The why is simply because we need to always be able to migrate back to
regular memory as device memory is not accessible by the CPU. So we can
not allow anyone to pin it.

Cheers,
Jérôme


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-05-02 Thread Kirill A. Shutemov
On Mon, May 01, 2017 at 09:55:48AM -0400, Jerome Glisse wrote:
> On Mon, May 01, 2017 at 01:23:59PM +0300, Kirill A. Shutemov wrote:
> > On Sun, Apr 30, 2017 at 07:14:24PM -0400, Jerome Glisse wrote:
> > > On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote:
> > > > On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
> > > > > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
> > > > > > Are you sure about needing to hook the 2 -> 1 transition? Could we
> > > > > > change ZONE_DEVICE pages to not have an elevated reference count 
> > > > > > when
> > > > > > they are created so you can keep the HMM references out of the mm 
> > > > > > hot
> > > > > > path?
> > > > > 
> > > > > 100% sure on that :) I need to callback into driver for 2->1 
> > > > > transition
> > > > > no way around that. If we change ZONE_DEVICE to not have an elevated
> > > > > reference count that you need to make a lot more change to mm so that
> > > > > ZONE_DEVICE is never use as fallback for memory allocation. Also need
> > > > > to make change to be sure that ZONE_DEVICE page never endup in one of
> > > > > the path that try to put them back on lru. There is a lot of place 
> > > > > that
> > > > > would need to be updated and it would be highly intrusive and add a
> > > > > lot of special cases to other hot code path.
> > > > 
> > > > Could you explain more on where the requirement comes from or point me 
> > > > to
> > > > where I can read about this.
> > > > 
> > > 
> > > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back 
> > > page)
> > > in _any_ vma. So i need to know when a page is freed ie either as result 
> > > of
> > > unmap, exit or migration or anything that would free the memory. For zone
> > > device a page is free once its refcount reach 1 so i need to catch 
> > > refcount
> > > transition from 2->1
> > 
> > What if we would rework zone device to have pages with refcount 0 at
> > start?
> 
> That is a _lot_ of work from top of my head because it would need changes
> to a lot of places and likely more hot code path that simply adding some-
> thing to put_page() note that i only need something in put_page() i do not
> need anything in the get page path. Is adding a conditional branch for
> HMM pages in put_page() that much of a problem ?

Well, it gets inlined everywhere. Removing zone_device code from
get_page() and put_page() saved non-trivial ~140k in vmlinux for
allyesconfig.

Re-introducing part this bloat would be unfortunate.

> > > This is the only way i can inform the device that the page is now free. 
> > > See
> > > 
> > > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21=52da8fe1a088b87b5321319add79e43b8372ed7d
> > > 
> > > There is _no_ way around that.
> > 
> > I'm still not convinced that it's impossible.
> > 
> > Could you describe lifecycle for pages in case of HMM?
> 
> Process malloc something, end it over to some function in the program
> that use the GPU that function call GPU API (OpenCL, CUDA, ...) that
> trigger a migration to device memory.
> 
> So in the kernel you get a migration like any existing migration,
> original page is unmap, if refcount is all ok (no pin) then a device
> page is allocated and thing are migrated to device memory.
> 
> What happen after is unknown. Either userspace/kernel driver decide
> to migrate back to system memory, either there is an munmap, either
> there is a CPU page fault, ... So from that point on the device page
> as the exact same life as a regular page.
> 
> Above i describe the migrate case, but you can also have new memory
> allocation that directly allocate device memory. For instance if the
> GPU do a page fault on an address that isn't back by anything then
> we can directly allocate a device page. No migration involve in that
> case.
> 
> HMM pages are like any other pages in most respect. Exception are:
>   - no GUP

Hm. How do you exclude GUP? And why is it required?

-- 
 Kirill A. Shutemov


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-05-02 Thread Kirill A. Shutemov
On Mon, May 01, 2017 at 09:55:48AM -0400, Jerome Glisse wrote:
> On Mon, May 01, 2017 at 01:23:59PM +0300, Kirill A. Shutemov wrote:
> > On Sun, Apr 30, 2017 at 07:14:24PM -0400, Jerome Glisse wrote:
> > > On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote:
> > > > On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
> > > > > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
> > > > > > Are you sure about needing to hook the 2 -> 1 transition? Could we
> > > > > > change ZONE_DEVICE pages to not have an elevated reference count 
> > > > > > when
> > > > > > they are created so you can keep the HMM references out of the mm 
> > > > > > hot
> > > > > > path?
> > > > > 
> > > > > 100% sure on that :) I need to callback into driver for 2->1 
> > > > > transition
> > > > > no way around that. If we change ZONE_DEVICE to not have an elevated
> > > > > reference count that you need to make a lot more change to mm so that
> > > > > ZONE_DEVICE is never use as fallback for memory allocation. Also need
> > > > > to make change to be sure that ZONE_DEVICE page never endup in one of
> > > > > the path that try to put them back on lru. There is a lot of place 
> > > > > that
> > > > > would need to be updated and it would be highly intrusive and add a
> > > > > lot of special cases to other hot code path.
> > > > 
> > > > Could you explain more on where the requirement comes from or point me 
> > > > to
> > > > where I can read about this.
> > > > 
> > > 
> > > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back 
> > > page)
> > > in _any_ vma. So i need to know when a page is freed ie either as result 
> > > of
> > > unmap, exit or migration or anything that would free the memory. For zone
> > > device a page is free once its refcount reach 1 so i need to catch 
> > > refcount
> > > transition from 2->1
> > 
> > What if we would rework zone device to have pages with refcount 0 at
> > start?
> 
> That is a _lot_ of work from top of my head because it would need changes
> to a lot of places and likely more hot code path that simply adding some-
> thing to put_page() note that i only need something in put_page() i do not
> need anything in the get page path. Is adding a conditional branch for
> HMM pages in put_page() that much of a problem ?

Well, it gets inlined everywhere. Removing zone_device code from
get_page() and put_page() saved non-trivial ~140k in vmlinux for
allyesconfig.

Re-introducing part this bloat would be unfortunate.

> > > This is the only way i can inform the device that the page is now free. 
> > > See
> > > 
> > > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21=52da8fe1a088b87b5321319add79e43b8372ed7d
> > > 
> > > There is _no_ way around that.
> > 
> > I'm still not convinced that it's impossible.
> > 
> > Could you describe lifecycle for pages in case of HMM?
> 
> Process malloc something, end it over to some function in the program
> that use the GPU that function call GPU API (OpenCL, CUDA, ...) that
> trigger a migration to device memory.
> 
> So in the kernel you get a migration like any existing migration,
> original page is unmap, if refcount is all ok (no pin) then a device
> page is allocated and thing are migrated to device memory.
> 
> What happen after is unknown. Either userspace/kernel driver decide
> to migrate back to system memory, either there is an munmap, either
> there is a CPU page fault, ... So from that point on the device page
> as the exact same life as a regular page.
> 
> Above i describe the migrate case, but you can also have new memory
> allocation that directly allocate device memory. For instance if the
> GPU do a page fault on an address that isn't back by anything then
> we can directly allocate a device page. No migration involve in that
> case.
> 
> HMM pages are like any other pages in most respect. Exception are:
>   - no GUP

Hm. How do you exclude GUP? And why is it required?

-- 
 Kirill A. Shutemov


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-05-01 Thread Jerome Glisse
On Mon, May 01, 2017 at 01:19:24PM -0700, Dan Williams wrote:
> On Mon, May 1, 2017 at 6:55 AM, Jerome Glisse  wrote:
> > On Mon, May 01, 2017 at 01:23:59PM +0300, Kirill A. Shutemov wrote:
> >> On Sun, Apr 30, 2017 at 07:14:24PM -0400, Jerome Glisse wrote:
> >> > On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote:
> >> > > On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
> >> > > > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
> >> > > > > Are you sure about needing to hook the 2 -> 1 transition? Could we
> >> > > > > change ZONE_DEVICE pages to not have an elevated reference count 
> >> > > > > when
> >> > > > > they are created so you can keep the HMM references out of the mm 
> >> > > > > hot
> >> > > > > path?
> >> > > >
> >> > > > 100% sure on that :) I need to callback into driver for 2->1 
> >> > > > transition
> >> > > > no way around that. If we change ZONE_DEVICE to not have an elevated
> >> > > > reference count that you need to make a lot more change to mm so that
> >> > > > ZONE_DEVICE is never use as fallback for memory allocation. Also need
> >> > > > to make change to be sure that ZONE_DEVICE page never endup in one of
> >> > > > the path that try to put them back on lru. There is a lot of place 
> >> > > > that
> >> > > > would need to be updated and it would be highly intrusive and add a
> >> > > > lot of special cases to other hot code path.
> >> > >
> >> > > Could you explain more on where the requirement comes from or point me 
> >> > > to
> >> > > where I can read about this.
> >> > >
> >> >
> >> > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back 
> >> > page)
> >> > in _any_ vma. So i need to know when a page is freed ie either as result 
> >> > of
> >> > unmap, exit or migration or anything that would free the memory. For zone
> >> > device a page is free once its refcount reach 1 so i need to catch 
> >> > refcount
> >> > transition from 2->1
> >>
> >> What if we would rework zone device to have pages with refcount 0 at
> >> start?
> >
> > That is a _lot_ of work from top of my head because it would need changes
> > to a lot of places and likely more hot code path that simply adding some-
> > thing to put_page() note that i only need something in put_page() i do not
> > need anything in the get page path. Is adding a conditional branch for
> > HMM pages in put_page() that much of a problem ?
> >
> >
> >> > This is the only way i can inform the device that the page is now free. 
> >> > See
> >> >
> >> > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21=52da8fe1a088b87b5321319add79e43b8372ed7d
> >> >
> >> > There is _no_ way around that.
> >>
> >> I'm still not convinced that it's impossible.
> >>
> >> Could you describe lifecycle for pages in case of HMM?
> >
> > Process malloc something, end it over to some function in the program
> > that use the GPU that function call GPU API (OpenCL, CUDA, ...) that
> > trigger a migration to device memory.
> >
> > So in the kernel you get a migration like any existing migration,
> > original page is unmap, if refcount is all ok (no pin) then a device
> > page is allocated and thing are migrated to device memory.
> >
> > What happen after is unknown. Either userspace/kernel driver decide
> > to migrate back to system memory, either there is an munmap, either
> > there is a CPU page fault, ... So from that point on the device page
> > as the exact same life as a regular page.
> >
> > Above i describe the migrate case, but you can also have new memory
> > allocation that directly allocate device memory. For instance if the
> > GPU do a page fault on an address that isn't back by anything then
> > we can directly allocate a device page. No migration involve in that
> > case.
> >
> > HMM pages are like any other pages in most respect. Exception are:
> >   - no GUP
> >   - no KSM
> >   - no lru reclaim
> >   - no NUMA balancing
> >   - no regular migration (existing migrate_page)
> >
> > The fact that minimum refcount for ZONE_DEVICE is 1 already gives
> > us for free most of the above exception. To convert the refcount to
> > be like other pages would mean that all of the above would need to
> > be audited and probably modify to ignore ZONE_DEVICE pages (i am
> > pretty sure Dan do not want any of the above either).
> 
> Right, adding HMM references to get_page() and put_page() seems less
> intrusive. Given how uncommon HMM hardware is (insert grumble about no
> visible upstream user of this functionality) I think the 'static
> branch' approach helps mitigate the impact for everything else.
> Looking back, I should have used that mechanism for the pmem use case,
> but it's moot now.

I do not need anything in get_page() all i need is something in put_page()
to catch the 2 -> 1 refcount transition to know when a page is freed.

Cheers,
Jérôme


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-05-01 Thread Jerome Glisse
On Mon, May 01, 2017 at 01:19:24PM -0700, Dan Williams wrote:
> On Mon, May 1, 2017 at 6:55 AM, Jerome Glisse  wrote:
> > On Mon, May 01, 2017 at 01:23:59PM +0300, Kirill A. Shutemov wrote:
> >> On Sun, Apr 30, 2017 at 07:14:24PM -0400, Jerome Glisse wrote:
> >> > On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote:
> >> > > On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
> >> > > > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
> >> > > > > Are you sure about needing to hook the 2 -> 1 transition? Could we
> >> > > > > change ZONE_DEVICE pages to not have an elevated reference count 
> >> > > > > when
> >> > > > > they are created so you can keep the HMM references out of the mm 
> >> > > > > hot
> >> > > > > path?
> >> > > >
> >> > > > 100% sure on that :) I need to callback into driver for 2->1 
> >> > > > transition
> >> > > > no way around that. If we change ZONE_DEVICE to not have an elevated
> >> > > > reference count that you need to make a lot more change to mm so that
> >> > > > ZONE_DEVICE is never use as fallback for memory allocation. Also need
> >> > > > to make change to be sure that ZONE_DEVICE page never endup in one of
> >> > > > the path that try to put them back on lru. There is a lot of place 
> >> > > > that
> >> > > > would need to be updated and it would be highly intrusive and add a
> >> > > > lot of special cases to other hot code path.
> >> > >
> >> > > Could you explain more on where the requirement comes from or point me 
> >> > > to
> >> > > where I can read about this.
> >> > >
> >> >
> >> > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back 
> >> > page)
> >> > in _any_ vma. So i need to know when a page is freed ie either as result 
> >> > of
> >> > unmap, exit or migration or anything that would free the memory. For zone
> >> > device a page is free once its refcount reach 1 so i need to catch 
> >> > refcount
> >> > transition from 2->1
> >>
> >> What if we would rework zone device to have pages with refcount 0 at
> >> start?
> >
> > That is a _lot_ of work from top of my head because it would need changes
> > to a lot of places and likely more hot code path that simply adding some-
> > thing to put_page() note that i only need something in put_page() i do not
> > need anything in the get page path. Is adding a conditional branch for
> > HMM pages in put_page() that much of a problem ?
> >
> >
> >> > This is the only way i can inform the device that the page is now free. 
> >> > See
> >> >
> >> > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21=52da8fe1a088b87b5321319add79e43b8372ed7d
> >> >
> >> > There is _no_ way around that.
> >>
> >> I'm still not convinced that it's impossible.
> >>
> >> Could you describe lifecycle for pages in case of HMM?
> >
> > Process malloc something, end it over to some function in the program
> > that use the GPU that function call GPU API (OpenCL, CUDA, ...) that
> > trigger a migration to device memory.
> >
> > So in the kernel you get a migration like any existing migration,
> > original page is unmap, if refcount is all ok (no pin) then a device
> > page is allocated and thing are migrated to device memory.
> >
> > What happen after is unknown. Either userspace/kernel driver decide
> > to migrate back to system memory, either there is an munmap, either
> > there is a CPU page fault, ... So from that point on the device page
> > as the exact same life as a regular page.
> >
> > Above i describe the migrate case, but you can also have new memory
> > allocation that directly allocate device memory. For instance if the
> > GPU do a page fault on an address that isn't back by anything then
> > we can directly allocate a device page. No migration involve in that
> > case.
> >
> > HMM pages are like any other pages in most respect. Exception are:
> >   - no GUP
> >   - no KSM
> >   - no lru reclaim
> >   - no NUMA balancing
> >   - no regular migration (existing migrate_page)
> >
> > The fact that minimum refcount for ZONE_DEVICE is 1 already gives
> > us for free most of the above exception. To convert the refcount to
> > be like other pages would mean that all of the above would need to
> > be audited and probably modify to ignore ZONE_DEVICE pages (i am
> > pretty sure Dan do not want any of the above either).
> 
> Right, adding HMM references to get_page() and put_page() seems less
> intrusive. Given how uncommon HMM hardware is (insert grumble about no
> visible upstream user of this functionality) I think the 'static
> branch' approach helps mitigate the impact for everything else.
> Looking back, I should have used that mechanism for the pmem use case,
> but it's moot now.

I do not need anything in get_page() all i need is something in put_page()
to catch the 2 -> 1 refcount transition to know when a page is freed.

Cheers,
Jérôme


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-05-01 Thread Dan Williams
On Mon, May 1, 2017 at 6:55 AM, Jerome Glisse  wrote:
> On Mon, May 01, 2017 at 01:23:59PM +0300, Kirill A. Shutemov wrote:
>> On Sun, Apr 30, 2017 at 07:14:24PM -0400, Jerome Glisse wrote:
>> > On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote:
>> > > On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
>> > > > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
>> > > > > Are you sure about needing to hook the 2 -> 1 transition? Could we
>> > > > > change ZONE_DEVICE pages to not have an elevated reference count when
>> > > > > they are created so you can keep the HMM references out of the mm hot
>> > > > > path?
>> > > >
>> > > > 100% sure on that :) I need to callback into driver for 2->1 transition
>> > > > no way around that. If we change ZONE_DEVICE to not have an elevated
>> > > > reference count that you need to make a lot more change to mm so that
>> > > > ZONE_DEVICE is never use as fallback for memory allocation. Also need
>> > > > to make change to be sure that ZONE_DEVICE page never endup in one of
>> > > > the path that try to put them back on lru. There is a lot of place that
>> > > > would need to be updated and it would be highly intrusive and add a
>> > > > lot of special cases to other hot code path.
>> > >
>> > > Could you explain more on where the requirement comes from or point me to
>> > > where I can read about this.
>> > >
>> >
>> > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back 
>> > page)
>> > in _any_ vma. So i need to know when a page is freed ie either as result of
>> > unmap, exit or migration or anything that would free the memory. For zone
>> > device a page is free once its refcount reach 1 so i need to catch refcount
>> > transition from 2->1
>>
>> What if we would rework zone device to have pages with refcount 0 at
>> start?
>
> That is a _lot_ of work from top of my head because it would need changes
> to a lot of places and likely more hot code path that simply adding some-
> thing to put_page() note that i only need something in put_page() i do not
> need anything in the get page path. Is adding a conditional branch for
> HMM pages in put_page() that much of a problem ?
>
>
>> > This is the only way i can inform the device that the page is now free. See
>> >
>> > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21=52da8fe1a088b87b5321319add79e43b8372ed7d
>> >
>> > There is _no_ way around that.
>>
>> I'm still not convinced that it's impossible.
>>
>> Could you describe lifecycle for pages in case of HMM?
>
> Process malloc something, end it over to some function in the program
> that use the GPU that function call GPU API (OpenCL, CUDA, ...) that
> trigger a migration to device memory.
>
> So in the kernel you get a migration like any existing migration,
> original page is unmap, if refcount is all ok (no pin) then a device
> page is allocated and thing are migrated to device memory.
>
> What happen after is unknown. Either userspace/kernel driver decide
> to migrate back to system memory, either there is an munmap, either
> there is a CPU page fault, ... So from that point on the device page
> as the exact same life as a regular page.
>
> Above i describe the migrate case, but you can also have new memory
> allocation that directly allocate device memory. For instance if the
> GPU do a page fault on an address that isn't back by anything then
> we can directly allocate a device page. No migration involve in that
> case.
>
> HMM pages are like any other pages in most respect. Exception are:
>   - no GUP
>   - no KSM
>   - no lru reclaim
>   - no NUMA balancing
>   - no regular migration (existing migrate_page)
>
> The fact that minimum refcount for ZONE_DEVICE is 1 already gives
> us for free most of the above exception. To convert the refcount to
> be like other pages would mean that all of the above would need to
> be audited and probably modify to ignore ZONE_DEVICE pages (i am
> pretty sure Dan do not want any of the above either).

Right, adding HMM references to get_page() and put_page() seems less
intrusive. Given how uncommon HMM hardware is (insert grumble about no
visible upstream user of this functionality) I think the 'static
branch' approach helps mitigate the impact for everything else.
Looking back, I should have used that mechanism for the pmem use case,
but it's moot now.


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-05-01 Thread Dan Williams
On Mon, May 1, 2017 at 6:55 AM, Jerome Glisse  wrote:
> On Mon, May 01, 2017 at 01:23:59PM +0300, Kirill A. Shutemov wrote:
>> On Sun, Apr 30, 2017 at 07:14:24PM -0400, Jerome Glisse wrote:
>> > On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote:
>> > > On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
>> > > > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
>> > > > > Are you sure about needing to hook the 2 -> 1 transition? Could we
>> > > > > change ZONE_DEVICE pages to not have an elevated reference count when
>> > > > > they are created so you can keep the HMM references out of the mm hot
>> > > > > path?
>> > > >
>> > > > 100% sure on that :) I need to callback into driver for 2->1 transition
>> > > > no way around that. If we change ZONE_DEVICE to not have an elevated
>> > > > reference count that you need to make a lot more change to mm so that
>> > > > ZONE_DEVICE is never use as fallback for memory allocation. Also need
>> > > > to make change to be sure that ZONE_DEVICE page never endup in one of
>> > > > the path that try to put them back on lru. There is a lot of place that
>> > > > would need to be updated and it would be highly intrusive and add a
>> > > > lot of special cases to other hot code path.
>> > >
>> > > Could you explain more on where the requirement comes from or point me to
>> > > where I can read about this.
>> > >
>> >
>> > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back 
>> > page)
>> > in _any_ vma. So i need to know when a page is freed ie either as result of
>> > unmap, exit or migration or anything that would free the memory. For zone
>> > device a page is free once its refcount reach 1 so i need to catch refcount
>> > transition from 2->1
>>
>> What if we would rework zone device to have pages with refcount 0 at
>> start?
>
> That is a _lot_ of work from top of my head because it would need changes
> to a lot of places and likely more hot code path that simply adding some-
> thing to put_page() note that i only need something in put_page() i do not
> need anything in the get page path. Is adding a conditional branch for
> HMM pages in put_page() that much of a problem ?
>
>
>> > This is the only way i can inform the device that the page is now free. See
>> >
>> > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21=52da8fe1a088b87b5321319add79e43b8372ed7d
>> >
>> > There is _no_ way around that.
>>
>> I'm still not convinced that it's impossible.
>>
>> Could you describe lifecycle for pages in case of HMM?
>
> Process malloc something, end it over to some function in the program
> that use the GPU that function call GPU API (OpenCL, CUDA, ...) that
> trigger a migration to device memory.
>
> So in the kernel you get a migration like any existing migration,
> original page is unmap, if refcount is all ok (no pin) then a device
> page is allocated and thing are migrated to device memory.
>
> What happen after is unknown. Either userspace/kernel driver decide
> to migrate back to system memory, either there is an munmap, either
> there is a CPU page fault, ... So from that point on the device page
> as the exact same life as a regular page.
>
> Above i describe the migrate case, but you can also have new memory
> allocation that directly allocate device memory. For instance if the
> GPU do a page fault on an address that isn't back by anything then
> we can directly allocate a device page. No migration involve in that
> case.
>
> HMM pages are like any other pages in most respect. Exception are:
>   - no GUP
>   - no KSM
>   - no lru reclaim
>   - no NUMA balancing
>   - no regular migration (existing migrate_page)
>
> The fact that minimum refcount for ZONE_DEVICE is 1 already gives
> us for free most of the above exception. To convert the refcount to
> be like other pages would mean that all of the above would need to
> be audited and probably modify to ignore ZONE_DEVICE pages (i am
> pretty sure Dan do not want any of the above either).

Right, adding HMM references to get_page() and put_page() seems less
intrusive. Given how uncommon HMM hardware is (insert grumble about no
visible upstream user of this functionality) I think the 'static
branch' approach helps mitigate the impact for everything else.
Looking back, I should have used that mechanism for the pmem use case,
but it's moot now.


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-05-01 Thread Jerome Glisse
On Mon, May 01, 2017 at 01:23:59PM +0300, Kirill A. Shutemov wrote:
> On Sun, Apr 30, 2017 at 07:14:24PM -0400, Jerome Glisse wrote:
> > On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote:
> > > On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
> > > > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
> > > > > Are you sure about needing to hook the 2 -> 1 transition? Could we
> > > > > change ZONE_DEVICE pages to not have an elevated reference count when
> > > > > they are created so you can keep the HMM references out of the mm hot
> > > > > path?
> > > > 
> > > > 100% sure on that :) I need to callback into driver for 2->1 transition
> > > > no way around that. If we change ZONE_DEVICE to not have an elevated
> > > > reference count that you need to make a lot more change to mm so that
> > > > ZONE_DEVICE is never use as fallback for memory allocation. Also need
> > > > to make change to be sure that ZONE_DEVICE page never endup in one of
> > > > the path that try to put them back on lru. There is a lot of place that
> > > > would need to be updated and it would be highly intrusive and add a
> > > > lot of special cases to other hot code path.
> > > 
> > > Could you explain more on where the requirement comes from or point me to
> > > where I can read about this.
> > > 
> > 
> > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page)
> > in _any_ vma. So i need to know when a page is freed ie either as result of
> > unmap, exit or migration or anything that would free the memory. For zone
> > device a page is free once its refcount reach 1 so i need to catch refcount
> > transition from 2->1
> 
> What if we would rework zone device to have pages with refcount 0 at
> start?

That is a _lot_ of work from top of my head because it would need changes
to a lot of places and likely more hot code path that simply adding some-
thing to put_page() note that i only need something in put_page() i do not
need anything in the get page path. Is adding a conditional branch for
HMM pages in put_page() that much of a problem ?


> > This is the only way i can inform the device that the page is now free. See
> > 
> > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21=52da8fe1a088b87b5321319add79e43b8372ed7d
> > 
> > There is _no_ way around that.
> 
> I'm still not convinced that it's impossible.
> 
> Could you describe lifecycle for pages in case of HMM?

Process malloc something, end it over to some function in the program
that use the GPU that function call GPU API (OpenCL, CUDA, ...) that
trigger a migration to device memory.

So in the kernel you get a migration like any existing migration,
original page is unmap, if refcount is all ok (no pin) then a device
page is allocated and thing are migrated to device memory.

What happen after is unknown. Either userspace/kernel driver decide
to migrate back to system memory, either there is an munmap, either
there is a CPU page fault, ... So from that point on the device page
as the exact same life as a regular page.

Above i describe the migrate case, but you can also have new memory
allocation that directly allocate device memory. For instance if the
GPU do a page fault on an address that isn't back by anything then
we can directly allocate a device page. No migration involve in that
case.

HMM pages are like any other pages in most respect. Exception are:
  - no GUP
  - no KSM
  - no lru reclaim
  - no NUMA balancing
  - no regular migration (existing migrate_page)

The fact that minimum refcount for ZONE_DEVICE is 1 already gives
us for free most of the above exception. To convert the refcount to
be like other pages would mean that all of the above would need to
be audited and probably modify to ignore ZONE_DEVICE pages (i am
pretty sure Dan do not want any of the above either).

Cheers,
Jérôme


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-05-01 Thread Jerome Glisse
On Mon, May 01, 2017 at 01:23:59PM +0300, Kirill A. Shutemov wrote:
> On Sun, Apr 30, 2017 at 07:14:24PM -0400, Jerome Glisse wrote:
> > On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote:
> > > On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
> > > > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
> > > > > Are you sure about needing to hook the 2 -> 1 transition? Could we
> > > > > change ZONE_DEVICE pages to not have an elevated reference count when
> > > > > they are created so you can keep the HMM references out of the mm hot
> > > > > path?
> > > > 
> > > > 100% sure on that :) I need to callback into driver for 2->1 transition
> > > > no way around that. If we change ZONE_DEVICE to not have an elevated
> > > > reference count that you need to make a lot more change to mm so that
> > > > ZONE_DEVICE is never use as fallback for memory allocation. Also need
> > > > to make change to be sure that ZONE_DEVICE page never endup in one of
> > > > the path that try to put them back on lru. There is a lot of place that
> > > > would need to be updated and it would be highly intrusive and add a
> > > > lot of special cases to other hot code path.
> > > 
> > > Could you explain more on where the requirement comes from or point me to
> > > where I can read about this.
> > > 
> > 
> > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page)
> > in _any_ vma. So i need to know when a page is freed ie either as result of
> > unmap, exit or migration or anything that would free the memory. For zone
> > device a page is free once its refcount reach 1 so i need to catch refcount
> > transition from 2->1
> 
> What if we would rework zone device to have pages with refcount 0 at
> start?

That is a _lot_ of work from top of my head because it would need changes
to a lot of places and likely more hot code path that simply adding some-
thing to put_page() note that i only need something in put_page() i do not
need anything in the get page path. Is adding a conditional branch for
HMM pages in put_page() that much of a problem ?


> > This is the only way i can inform the device that the page is now free. See
> > 
> > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21=52da8fe1a088b87b5321319add79e43b8372ed7d
> > 
> > There is _no_ way around that.
> 
> I'm still not convinced that it's impossible.
> 
> Could you describe lifecycle for pages in case of HMM?

Process malloc something, end it over to some function in the program
that use the GPU that function call GPU API (OpenCL, CUDA, ...) that
trigger a migration to device memory.

So in the kernel you get a migration like any existing migration,
original page is unmap, if refcount is all ok (no pin) then a device
page is allocated and thing are migrated to device memory.

What happen after is unknown. Either userspace/kernel driver decide
to migrate back to system memory, either there is an munmap, either
there is a CPU page fault, ... So from that point on the device page
as the exact same life as a regular page.

Above i describe the migrate case, but you can also have new memory
allocation that directly allocate device memory. For instance if the
GPU do a page fault on an address that isn't back by anything then
we can directly allocate a device page. No migration involve in that
case.

HMM pages are like any other pages in most respect. Exception are:
  - no GUP
  - no KSM
  - no lru reclaim
  - no NUMA balancing
  - no regular migration (existing migrate_page)

The fact that minimum refcount for ZONE_DEVICE is 1 already gives
us for free most of the above exception. To convert the refcount to
be like other pages would mean that all of the above would need to
be audited and probably modify to ignore ZONE_DEVICE pages (i am
pretty sure Dan do not want any of the above either).

Cheers,
Jérôme


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-05-01 Thread Kirill A. Shutemov
On Sun, Apr 30, 2017 at 07:14:24PM -0400, Jerome Glisse wrote:
> On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
> > > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
> > > > Are you sure about needing to hook the 2 -> 1 transition? Could we
> > > > change ZONE_DEVICE pages to not have an elevated reference count when
> > > > they are created so you can keep the HMM references out of the mm hot
> > > > path?
> > > 
> > > 100% sure on that :) I need to callback into driver for 2->1 transition
> > > no way around that. If we change ZONE_DEVICE to not have an elevated
> > > reference count that you need to make a lot more change to mm so that
> > > ZONE_DEVICE is never use as fallback for memory allocation. Also need
> > > to make change to be sure that ZONE_DEVICE page never endup in one of
> > > the path that try to put them back on lru. There is a lot of place that
> > > would need to be updated and it would be highly intrusive and add a
> > > lot of special cases to other hot code path.
> > 
> > Could you explain more on where the requirement comes from or point me to
> > where I can read about this.
> > 
> 
> HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page)
> in _any_ vma. So i need to know when a page is freed ie either as result of
> unmap, exit or migration or anything that would free the memory. For zone
> device a page is free once its refcount reach 1 so i need to catch refcount
> transition from 2->1

What if we would rework zone device to have pages with refcount 0 at
start?

> This is the only way i can inform the device that the page is now free. See
> 
> https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21=52da8fe1a088b87b5321319add79e43b8372ed7d
> 
> There is _no_ way around that.

I'm still not convinced that it's impossible.

Could you describe lifecycle for pages in case of HMM?

-- 
 Kirill A. Shutemov


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-05-01 Thread Kirill A. Shutemov
On Sun, Apr 30, 2017 at 07:14:24PM -0400, Jerome Glisse wrote:
> On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
> > > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
> > > > Are you sure about needing to hook the 2 -> 1 transition? Could we
> > > > change ZONE_DEVICE pages to not have an elevated reference count when
> > > > they are created so you can keep the HMM references out of the mm hot
> > > > path?
> > > 
> > > 100% sure on that :) I need to callback into driver for 2->1 transition
> > > no way around that. If we change ZONE_DEVICE to not have an elevated
> > > reference count that you need to make a lot more change to mm so that
> > > ZONE_DEVICE is never use as fallback for memory allocation. Also need
> > > to make change to be sure that ZONE_DEVICE page never endup in one of
> > > the path that try to put them back on lru. There is a lot of place that
> > > would need to be updated and it would be highly intrusive and add a
> > > lot of special cases to other hot code path.
> > 
> > Could you explain more on where the requirement comes from or point me to
> > where I can read about this.
> > 
> 
> HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page)
> in _any_ vma. So i need to know when a page is freed ie either as result of
> unmap, exit or migration or anything that would free the memory. For zone
> device a page is free once its refcount reach 1 so i need to catch refcount
> transition from 2->1

What if we would rework zone device to have pages with refcount 0 at
start?

> This is the only way i can inform the device that the page is now free. See
> 
> https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21=52da8fe1a088b87b5321319add79e43b8372ed7d
> 
> There is _no_ way around that.

I'm still not convinced that it's impossible.

Could you describe lifecycle for pages in case of HMM?

-- 
 Kirill A. Shutemov


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-05-01 Thread Kirill A. Shutemov
On Mon, May 01, 2017 at 09:12:59AM +0200, Ingo Molnar wrote:
> ... is this extension to the changelog correct?

Looks good to me.

-- 
 Kirill A. Shutemov


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-05-01 Thread Kirill A. Shutemov
On Mon, May 01, 2017 at 09:12:59AM +0200, Ingo Molnar wrote:
> ... is this extension to the changelog correct?

Looks good to me.

-- 
 Kirill A. Shutemov


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-05-01 Thread Ingo Molnar

* Dan Williams  wrote:

> On Sat, Apr 29, 2017 at 7:18 AM, Ingo Molnar  wrote:
> >
> > * Dan Williams  wrote:
> >
> >> Kirill points out that the calls to {get,put}_dev_pagemap() can be
> >> removed from the mm fast path if we take a single get_dev_pagemap()
> >> reference to signify that the page is alive and use the final put of the
> >> page to drop that reference.
> >>
> >> This does require some care to make sure that any waits for the
> >> percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
> >> since it now maintains its own elevated reference.
> >>
> >> Cc: Ingo Molnar 
> >> Cc: Jérôme Glisse 
> >> Cc: Andrew Morton 
> >> Reviewed-by: Logan Gunthorpe 
> >> Suggested-by: Kirill Shutemov 
> >> Tested-by: Kirill Shutemov 
> >> Signed-off-by: Dan Williams 
> >
> > This changelog is lacking an explanation about how this solves the crashes 
> > you
> > were seeing.
> 
> Kirill? It wasn't clear to me why the conversion to generic 
> get_user_pages_fast() caused the reference counts to be off.

Ok, the merge window is open and we really need this fix for x86/mm, so this is 
what I've decoded:

 The x86 conversion to the generic GUP code included a small change which causes
 crashes and data corruption in the pmem code - not good.

 The root cause is that the /dev/pmem driver code implicitly relies on the x86
 get_user_pages() implementation doing a get_page() on the page refcount, 
because
 get_page() does a get_zone_device_page() which properly refcounts pmem's 
separate
 page struct arrays that are not present in the regular page struct structures.
 (The pmem driver does this because it can cover huge memory areas.)

 But the x86 conversion to the generic GUP code changed the get_page() to
 page_cache_get_speculative() which is faster but doesn't do the
 get_zone_device_page() call the pmem code relies on.

 One way to solve the regression would be to change the generic GUP code to use 
 get_page(), but that would slow things down a bit and punish other generic-GUP 
 using architectures for an x86-ism they did not care about. (Arguably the pmem 
 driver was probably not working reliably for them: but nvdimm is an Intel
 feature, so non-x86 exposure is probably still limited.)

 So restructure the pmem code's interface with the MM instead: get rid of the 
 get/put_zone_device_page() distinction, integrate put_zone_device_page() into 
 __put_page() and and restructure the pmem completion-wait and teardown 
machinery.

 This speeds up things while also making the pmem refcounting more robust going 
 forward.

... is this extension to the changelog correct?

I'll apply this for the time being - but can still amend the text before 
sending 
it to Linus later today.

Thanks,

Ingo


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-05-01 Thread Ingo Molnar

* Dan Williams  wrote:

> On Sat, Apr 29, 2017 at 7:18 AM, Ingo Molnar  wrote:
> >
> > * Dan Williams  wrote:
> >
> >> Kirill points out that the calls to {get,put}_dev_pagemap() can be
> >> removed from the mm fast path if we take a single get_dev_pagemap()
> >> reference to signify that the page is alive and use the final put of the
> >> page to drop that reference.
> >>
> >> This does require some care to make sure that any waits for the
> >> percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
> >> since it now maintains its own elevated reference.
> >>
> >> Cc: Ingo Molnar 
> >> Cc: Jérôme Glisse 
> >> Cc: Andrew Morton 
> >> Reviewed-by: Logan Gunthorpe 
> >> Suggested-by: Kirill Shutemov 
> >> Tested-by: Kirill Shutemov 
> >> Signed-off-by: Dan Williams 
> >
> > This changelog is lacking an explanation about how this solves the crashes 
> > you
> > were seeing.
> 
> Kirill? It wasn't clear to me why the conversion to generic 
> get_user_pages_fast() caused the reference counts to be off.

Ok, the merge window is open and we really need this fix for x86/mm, so this is 
what I've decoded:

 The x86 conversion to the generic GUP code included a small change which causes
 crashes and data corruption in the pmem code - not good.

 The root cause is that the /dev/pmem driver code implicitly relies on the x86
 get_user_pages() implementation doing a get_page() on the page refcount, 
because
 get_page() does a get_zone_device_page() which properly refcounts pmem's 
separate
 page struct arrays that are not present in the regular page struct structures.
 (The pmem driver does this because it can cover huge memory areas.)

 But the x86 conversion to the generic GUP code changed the get_page() to
 page_cache_get_speculative() which is faster but doesn't do the
 get_zone_device_page() call the pmem code relies on.

 One way to solve the regression would be to change the generic GUP code to use 
 get_page(), but that would slow things down a bit and punish other generic-GUP 
 using architectures for an x86-ism they did not care about. (Arguably the pmem 
 driver was probably not working reliably for them: but nvdimm is an Intel
 feature, so non-x86 exposure is probably still limited.)

 So restructure the pmem code's interface with the MM instead: get rid of the 
 get/put_zone_device_page() distinction, integrate put_zone_device_page() into 
 __put_page() and and restructure the pmem completion-wait and teardown 
machinery.

 This speeds up things while also making the pmem refcounting more robust going 
 forward.

... is this extension to the changelog correct?

I'll apply this for the time being - but can still amend the text before 
sending 
it to Linus later today.

Thanks,

Ingo


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-04-30 Thread Logan Gunthorpe


On 30/04/17 05:14 PM, Jerome Glisse wrote:
> HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page)
> in _any_ vma. So i need to know when a page is freed ie either as result of
> unmap, exit or migration or anything that would free the memory. For zone
> device a page is free once its refcount reach 1 so i need to catch refcount
> transition from 2->1
> 
> This is the only way i can inform the device that the page is now free. See
> 
> https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21=52da8fe1a088b87b5321319add79e43b8372ed7d
> 
> There is _no_ way around that.

I had a similar issue in a piece of my p2pmem RFC [1]. I hacked around
it by tracking the pages separately and freeing them when the vma is
closed. This is by no means a great solution, it certainly has it's own
warts. However, maybe it will spark some ideas for some alternate
choices which avoid the hot path.

Another thing I briefly looked at was hooking the vma close process
earlier so that it would callback in time that you can loop through the
pages and do your free process. Of course this all depends on the vma
not getting closed while the pages have other references. So it may not
work at all. Again, just ideas.

Logan


[1]
https://github.com/sbates130272/linux-p2pmem/commit/77c631d92cb5c451c9824b3a4cf9b6cddfde6bb7


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-04-30 Thread Logan Gunthorpe


On 30/04/17 05:14 PM, Jerome Glisse wrote:
> HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page)
> in _any_ vma. So i need to know when a page is freed ie either as result of
> unmap, exit or migration or anything that would free the memory. For zone
> device a page is free once its refcount reach 1 so i need to catch refcount
> transition from 2->1
> 
> This is the only way i can inform the device that the page is now free. See
> 
> https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21=52da8fe1a088b87b5321319add79e43b8372ed7d
> 
> There is _no_ way around that.

I had a similar issue in a piece of my p2pmem RFC [1]. I hacked around
it by tracking the pages separately and freeing them when the vma is
closed. This is by no means a great solution, it certainly has it's own
warts. However, maybe it will spark some ideas for some alternate
choices which avoid the hot path.

Another thing I briefly looked at was hooking the vma close process
earlier so that it would callback in time that you can loop through the
pages and do your free process. Of course this all depends on the vma
not getting closed while the pages have other references. So it may not
work at all. Again, just ideas.

Logan


[1]
https://github.com/sbates130272/linux-p2pmem/commit/77c631d92cb5c451c9824b3a4cf9b6cddfde6bb7


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-04-30 Thread Dan Williams
On Sat, Apr 29, 2017 at 7:18 AM, Ingo Molnar  wrote:
>
> * Dan Williams  wrote:
>
>> Kirill points out that the calls to {get,put}_dev_pagemap() can be
>> removed from the mm fast path if we take a single get_dev_pagemap()
>> reference to signify that the page is alive and use the final put of the
>> page to drop that reference.
>>
>> This does require some care to make sure that any waits for the
>> percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
>> since it now maintains its own elevated reference.
>>
>> Cc: Ingo Molnar 
>> Cc: Jérôme Glisse 
>> Cc: Andrew Morton 
>> Reviewed-by: Logan Gunthorpe 
>> Suggested-by: Kirill Shutemov 
>> Tested-by: Kirill Shutemov 
>> Signed-off-by: Dan Williams 
>
> This changelog is lacking an explanation about how this solves the crashes you
> were seeing.
>

Kirill? It wasn't clear to me why the conversion to generic
get_user_pages_fast() caused the reference counts to be off.


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-04-30 Thread Dan Williams
On Sat, Apr 29, 2017 at 7:18 AM, Ingo Molnar  wrote:
>
> * Dan Williams  wrote:
>
>> Kirill points out that the calls to {get,put}_dev_pagemap() can be
>> removed from the mm fast path if we take a single get_dev_pagemap()
>> reference to signify that the page is alive and use the final put of the
>> page to drop that reference.
>>
>> This does require some care to make sure that any waits for the
>> percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
>> since it now maintains its own elevated reference.
>>
>> Cc: Ingo Molnar 
>> Cc: Jérôme Glisse 
>> Cc: Andrew Morton 
>> Reviewed-by: Logan Gunthorpe 
>> Suggested-by: Kirill Shutemov 
>> Tested-by: Kirill Shutemov 
>> Signed-off-by: Dan Williams 
>
> This changelog is lacking an explanation about how this solves the crashes you
> were seeing.
>

Kirill? It wasn't clear to me why the conversion to generic
get_user_pages_fast() caused the reference counts to be off.


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-04-30 Thread Dan Williams
On Sun, Apr 30, 2017 at 6:54 PM, Jerome Glisse  wrote:
> On Sun, Apr 30, 2017 at 06:42:02PM -0700, Dan Williams wrote:
>> On Sun, Apr 30, 2017 at 4:14 PM, Jerome Glisse  wrote:
>> > On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote:
>> >> On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
>> >> > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
>> >> > > Are you sure about needing to hook the 2 -> 1 transition? Could we
>> >> > > change ZONE_DEVICE pages to not have an elevated reference count when
>> >> > > they are created so you can keep the HMM references out of the mm hot
>> >> > > path?
>> >> >
>> >> > 100% sure on that :) I need to callback into driver for 2->1 transition
>> >> > no way around that. If we change ZONE_DEVICE to not have an elevated
>> >> > reference count that you need to make a lot more change to mm so that
>> >> > ZONE_DEVICE is never use as fallback for memory allocation. Also need
>> >> > to make change to be sure that ZONE_DEVICE page never endup in one of
>> >> > the path that try to put them back on lru. There is a lot of place that
>> >> > would need to be updated and it would be highly intrusive and add a
>> >> > lot of special cases to other hot code path.
>> >>
>> >> Could you explain more on where the requirement comes from or point me to
>> >> where I can read about this.
>> >>
>> >
>> > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back 
>> > page)
>> > in _any_ vma. So i need to know when a page is freed ie either as result of
>> > unmap, exit or migration or anything that would free the memory. For zone
>> > device a page is free once its refcount reach 1 so i need to catch refcount
>> > transition from 2->1
>> >
>> > This is the only way i can inform the device that the page is now free. See
>> >
>> > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21=52da8fe1a088b87b5321319add79e43b8372ed7d
>> >
>> > There is _no_ way around that.
>>
>> Ok, but I need to point out that this not a ZONE_DEVICE requirement.
>> This is an HMM-specific need. So, this extra reference counting should
>> be clearly delineated as part of the MEMORY_DEVICE_PRIVATE use case.
>
> And it already is delimited, i think you even gave your review by on
> the patch.
>

I gave my reviewed-by to the patch that leveraged
{get,put}_zone_device_page(), but now those are gone and HMM needs a
new patch. I'm just saying these reference counts should not come back
as {get,put}_zone_device_page() because now they're HMM specific.

>> Can we hide the extra reference counting behind a static branch so
>> that the common case fast path doesn't get slower until a HMM device
>> shows up?
>
> Like i already did With likely()/unlikely() ? Or something else ?
>
> https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21=e84778e9db0672e371eb6599dfcb812512118842

Something different:
http://lxr.free-electrons.com/source/Documentation/static-keys.txt


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-04-30 Thread Dan Williams
On Sun, Apr 30, 2017 at 6:54 PM, Jerome Glisse  wrote:
> On Sun, Apr 30, 2017 at 06:42:02PM -0700, Dan Williams wrote:
>> On Sun, Apr 30, 2017 at 4:14 PM, Jerome Glisse  wrote:
>> > On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote:
>> >> On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
>> >> > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
>> >> > > Are you sure about needing to hook the 2 -> 1 transition? Could we
>> >> > > change ZONE_DEVICE pages to not have an elevated reference count when
>> >> > > they are created so you can keep the HMM references out of the mm hot
>> >> > > path?
>> >> >
>> >> > 100% sure on that :) I need to callback into driver for 2->1 transition
>> >> > no way around that. If we change ZONE_DEVICE to not have an elevated
>> >> > reference count that you need to make a lot more change to mm so that
>> >> > ZONE_DEVICE is never use as fallback for memory allocation. Also need
>> >> > to make change to be sure that ZONE_DEVICE page never endup in one of
>> >> > the path that try to put them back on lru. There is a lot of place that
>> >> > would need to be updated and it would be highly intrusive and add a
>> >> > lot of special cases to other hot code path.
>> >>
>> >> Could you explain more on where the requirement comes from or point me to
>> >> where I can read about this.
>> >>
>> >
>> > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back 
>> > page)
>> > in _any_ vma. So i need to know when a page is freed ie either as result of
>> > unmap, exit or migration or anything that would free the memory. For zone
>> > device a page is free once its refcount reach 1 so i need to catch refcount
>> > transition from 2->1
>> >
>> > This is the only way i can inform the device that the page is now free. See
>> >
>> > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21=52da8fe1a088b87b5321319add79e43b8372ed7d
>> >
>> > There is _no_ way around that.
>>
>> Ok, but I need to point out that this not a ZONE_DEVICE requirement.
>> This is an HMM-specific need. So, this extra reference counting should
>> be clearly delineated as part of the MEMORY_DEVICE_PRIVATE use case.
>
> And it already is delimited, i think you even gave your review by on
> the patch.
>

I gave my reviewed-by to the patch that leveraged
{get,put}_zone_device_page(), but now those are gone and HMM needs a
new patch. I'm just saying these reference counts should not come back
as {get,put}_zone_device_page() because now they're HMM specific.

>> Can we hide the extra reference counting behind a static branch so
>> that the common case fast path doesn't get slower until a HMM device
>> shows up?
>
> Like i already did With likely()/unlikely() ? Or something else ?
>
> https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21=e84778e9db0672e371eb6599dfcb812512118842

Something different:
http://lxr.free-electrons.com/source/Documentation/static-keys.txt


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-04-30 Thread Jerome Glisse
On Sun, Apr 30, 2017 at 06:42:02PM -0700, Dan Williams wrote:
> On Sun, Apr 30, 2017 at 4:14 PM, Jerome Glisse  wrote:
> > On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote:
> >> On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
> >> > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
> >> > > Are you sure about needing to hook the 2 -> 1 transition? Could we
> >> > > change ZONE_DEVICE pages to not have an elevated reference count when
> >> > > they are created so you can keep the HMM references out of the mm hot
> >> > > path?
> >> >
> >> > 100% sure on that :) I need to callback into driver for 2->1 transition
> >> > no way around that. If we change ZONE_DEVICE to not have an elevated
> >> > reference count that you need to make a lot more change to mm so that
> >> > ZONE_DEVICE is never use as fallback for memory allocation. Also need
> >> > to make change to be sure that ZONE_DEVICE page never endup in one of
> >> > the path that try to put them back on lru. There is a lot of place that
> >> > would need to be updated and it would be highly intrusive and add a
> >> > lot of special cases to other hot code path.
> >>
> >> Could you explain more on where the requirement comes from or point me to
> >> where I can read about this.
> >>
> >
> > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page)
> > in _any_ vma. So i need to know when a page is freed ie either as result of
> > unmap, exit or migration or anything that would free the memory. For zone
> > device a page is free once its refcount reach 1 so i need to catch refcount
> > transition from 2->1
> >
> > This is the only way i can inform the device that the page is now free. See
> >
> > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21=52da8fe1a088b87b5321319add79e43b8372ed7d
> >
> > There is _no_ way around that.
> 
> Ok, but I need to point out that this not a ZONE_DEVICE requirement.
> This is an HMM-specific need. So, this extra reference counting should
> be clearly delineated as part of the MEMORY_DEVICE_PRIVATE use case.

And it already is delimited, i think you even gave your review by on
the patch.


> Can we hide the extra reference counting behind a static branch so
> that the common case fast path doesn't get slower until a HMM device
> shows up?

Like i already did With likely()/unlikely() ? Or something else ?

https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21=e84778e9db0672e371eb6599dfcb812512118842

Cheers,
Jérôme


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-04-30 Thread Jerome Glisse
On Sun, Apr 30, 2017 at 06:42:02PM -0700, Dan Williams wrote:
> On Sun, Apr 30, 2017 at 4:14 PM, Jerome Glisse  wrote:
> > On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote:
> >> On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
> >> > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
> >> > > Are you sure about needing to hook the 2 -> 1 transition? Could we
> >> > > change ZONE_DEVICE pages to not have an elevated reference count when
> >> > > they are created so you can keep the HMM references out of the mm hot
> >> > > path?
> >> >
> >> > 100% sure on that :) I need to callback into driver for 2->1 transition
> >> > no way around that. If we change ZONE_DEVICE to not have an elevated
> >> > reference count that you need to make a lot more change to mm so that
> >> > ZONE_DEVICE is never use as fallback for memory allocation. Also need
> >> > to make change to be sure that ZONE_DEVICE page never endup in one of
> >> > the path that try to put them back on lru. There is a lot of place that
> >> > would need to be updated and it would be highly intrusive and add a
> >> > lot of special cases to other hot code path.
> >>
> >> Could you explain more on where the requirement comes from or point me to
> >> where I can read about this.
> >>
> >
> > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page)
> > in _any_ vma. So i need to know when a page is freed ie either as result of
> > unmap, exit or migration or anything that would free the memory. For zone
> > device a page is free once its refcount reach 1 so i need to catch refcount
> > transition from 2->1
> >
> > This is the only way i can inform the device that the page is now free. See
> >
> > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21=52da8fe1a088b87b5321319add79e43b8372ed7d
> >
> > There is _no_ way around that.
> 
> Ok, but I need to point out that this not a ZONE_DEVICE requirement.
> This is an HMM-specific need. So, this extra reference counting should
> be clearly delineated as part of the MEMORY_DEVICE_PRIVATE use case.

And it already is delimited, i think you even gave your review by on
the patch.


> Can we hide the extra reference counting behind a static branch so
> that the common case fast path doesn't get slower until a HMM device
> shows up?

Like i already did With likely()/unlikely() ? Or something else ?

https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21=e84778e9db0672e371eb6599dfcb812512118842

Cheers,
Jérôme


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-04-30 Thread Dan Williams
On Sun, Apr 30, 2017 at 4:14 PM, Jerome Glisse  wrote:
> On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote:
>> On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
>> > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
>> > > Are you sure about needing to hook the 2 -> 1 transition? Could we
>> > > change ZONE_DEVICE pages to not have an elevated reference count when
>> > > they are created so you can keep the HMM references out of the mm hot
>> > > path?
>> >
>> > 100% sure on that :) I need to callback into driver for 2->1 transition
>> > no way around that. If we change ZONE_DEVICE to not have an elevated
>> > reference count that you need to make a lot more change to mm so that
>> > ZONE_DEVICE is never use as fallback for memory allocation. Also need
>> > to make change to be sure that ZONE_DEVICE page never endup in one of
>> > the path that try to put them back on lru. There is a lot of place that
>> > would need to be updated and it would be highly intrusive and add a
>> > lot of special cases to other hot code path.
>>
>> Could you explain more on where the requirement comes from or point me to
>> where I can read about this.
>>
>
> HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page)
> in _any_ vma. So i need to know when a page is freed ie either as result of
> unmap, exit or migration or anything that would free the memory. For zone
> device a page is free once its refcount reach 1 so i need to catch refcount
> transition from 2->1
>
> This is the only way i can inform the device that the page is now free. See
>
> https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21=52da8fe1a088b87b5321319add79e43b8372ed7d
>
> There is _no_ way around that.

Ok, but I need to point out that this not a ZONE_DEVICE requirement.
This is an HMM-specific need. So, this extra reference counting should
be clearly delineated as part of the MEMORY_DEVICE_PRIVATE use case.

Can we hide the extra reference counting behind a static branch so
that the common case fast path doesn't get slower until a HMM device
shows up?


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-04-30 Thread Dan Williams
On Sun, Apr 30, 2017 at 4:14 PM, Jerome Glisse  wrote:
> On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote:
>> On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
>> > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
>> > > Are you sure about needing to hook the 2 -> 1 transition? Could we
>> > > change ZONE_DEVICE pages to not have an elevated reference count when
>> > > they are created so you can keep the HMM references out of the mm hot
>> > > path?
>> >
>> > 100% sure on that :) I need to callback into driver for 2->1 transition
>> > no way around that. If we change ZONE_DEVICE to not have an elevated
>> > reference count that you need to make a lot more change to mm so that
>> > ZONE_DEVICE is never use as fallback for memory allocation. Also need
>> > to make change to be sure that ZONE_DEVICE page never endup in one of
>> > the path that try to put them back on lru. There is a lot of place that
>> > would need to be updated and it would be highly intrusive and add a
>> > lot of special cases to other hot code path.
>>
>> Could you explain more on where the requirement comes from or point me to
>> where I can read about this.
>>
>
> HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page)
> in _any_ vma. So i need to know when a page is freed ie either as result of
> unmap, exit or migration or anything that would free the memory. For zone
> device a page is free once its refcount reach 1 so i need to catch refcount
> transition from 2->1
>
> This is the only way i can inform the device that the page is now free. See
>
> https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21=52da8fe1a088b87b5321319add79e43b8372ed7d
>
> There is _no_ way around that.

Ok, but I need to point out that this not a ZONE_DEVICE requirement.
This is an HMM-specific need. So, this extra reference counting should
be clearly delineated as part of the MEMORY_DEVICE_PRIVATE use case.

Can we hide the extra reference counting behind a static branch so
that the common case fast path doesn't get slower until a HMM device
shows up?


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-04-30 Thread Jerome Glisse
On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote:
> On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
> > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
> > > Are you sure about needing to hook the 2 -> 1 transition? Could we
> > > change ZONE_DEVICE pages to not have an elevated reference count when
> > > they are created so you can keep the HMM references out of the mm hot
> > > path?
> > 
> > 100% sure on that :) I need to callback into driver for 2->1 transition
> > no way around that. If we change ZONE_DEVICE to not have an elevated
> > reference count that you need to make a lot more change to mm so that
> > ZONE_DEVICE is never use as fallback for memory allocation. Also need
> > to make change to be sure that ZONE_DEVICE page never endup in one of
> > the path that try to put them back on lru. There is a lot of place that
> > would need to be updated and it would be highly intrusive and add a
> > lot of special cases to other hot code path.
> 
> Could you explain more on where the requirement comes from or point me to
> where I can read about this.
> 

HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page)
in _any_ vma. So i need to know when a page is freed ie either as result of
unmap, exit or migration or anything that would free the memory. For zone
device a page is free once its refcount reach 1 so i need to catch refcount
transition from 2->1

This is the only way i can inform the device that the page is now free. See

https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21=52da8fe1a088b87b5321319add79e43b8372ed7d

There is _no_ way around that.

Cheers,
Jérôme


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-04-30 Thread Jerome Glisse
On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote:
> On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
> > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
> > > Are you sure about needing to hook the 2 -> 1 transition? Could we
> > > change ZONE_DEVICE pages to not have an elevated reference count when
> > > they are created so you can keep the HMM references out of the mm hot
> > > path?
> > 
> > 100% sure on that :) I need to callback into driver for 2->1 transition
> > no way around that. If we change ZONE_DEVICE to not have an elevated
> > reference count that you need to make a lot more change to mm so that
> > ZONE_DEVICE is never use as fallback for memory allocation. Also need
> > to make change to be sure that ZONE_DEVICE page never endup in one of
> > the path that try to put them back on lru. There is a lot of place that
> > would need to be updated and it would be highly intrusive and add a
> > lot of special cases to other hot code path.
> 
> Could you explain more on where the requirement comes from or point me to
> where I can read about this.
> 

HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page)
in _any_ vma. So i need to know when a page is freed ie either as result of
unmap, exit or migration or anything that would free the memory. For zone
device a page is free once its refcount reach 1 so i need to catch refcount
transition from 2->1

This is the only way i can inform the device that the page is now free. See

https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21=52da8fe1a088b87b5321319add79e43b8372ed7d

There is _no_ way around that.

Cheers,
Jérôme


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-04-29 Thread Ingo Molnar

* Dan Williams  wrote:

> Kirill points out that the calls to {get,put}_dev_pagemap() can be
> removed from the mm fast path if we take a single get_dev_pagemap()
> reference to signify that the page is alive and use the final put of the
> page to drop that reference.
> 
> This does require some care to make sure that any waits for the
> percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
> since it now maintains its own elevated reference.
> 
> Cc: Ingo Molnar 
> Cc: Jérôme Glisse 
> Cc: Andrew Morton 
> Reviewed-by: Logan Gunthorpe 
> Suggested-by: Kirill Shutemov 
> Tested-by: Kirill Shutemov 
> Signed-off-by: Dan Williams 

This changelog is lacking an explanation about how this solves the crashes you 
were seeing.

Thanks,

Ingo


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-04-29 Thread Ingo Molnar

* Dan Williams  wrote:

> Kirill points out that the calls to {get,put}_dev_pagemap() can be
> removed from the mm fast path if we take a single get_dev_pagemap()
> reference to signify that the page is alive and use the final put of the
> page to drop that reference.
> 
> This does require some care to make sure that any waits for the
> percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
> since it now maintains its own elevated reference.
> 
> Cc: Ingo Molnar 
> Cc: Jérôme Glisse 
> Cc: Andrew Morton 
> Reviewed-by: Logan Gunthorpe 
> Suggested-by: Kirill Shutemov 
> Tested-by: Kirill Shutemov 
> Signed-off-by: Dan Williams 

This changelog is lacking an explanation about how this solves the crashes you 
were seeing.

Thanks,

Ingo


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-04-29 Thread Kirill A. Shutemov
On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
> On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
> > Are you sure about needing to hook the 2 -> 1 transition? Could we
> > change ZONE_DEVICE pages to not have an elevated reference count when
> > they are created so you can keep the HMM references out of the mm hot
> > path?
> 
> 100% sure on that :) I need to callback into driver for 2->1 transition
> no way around that. If we change ZONE_DEVICE to not have an elevated
> reference count that you need to make a lot more change to mm so that
> ZONE_DEVICE is never use as fallback for memory allocation. Also need
> to make change to be sure that ZONE_DEVICE page never endup in one of
> the path that try to put them back on lru. There is a lot of place that
> would need to be updated and it would be highly intrusive and add a
> lot of special cases to other hot code path.

Could you explain more on where the requirement comes from or point me to
where I can read about this.

-- 
 Kirill A. Shutemov


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-04-29 Thread Kirill A. Shutemov
On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
> On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
> > Are you sure about needing to hook the 2 -> 1 transition? Could we
> > change ZONE_DEVICE pages to not have an elevated reference count when
> > they are created so you can keep the HMM references out of the mm hot
> > path?
> 
> 100% sure on that :) I need to callback into driver for 2->1 transition
> no way around that. If we change ZONE_DEVICE to not have an elevated
> reference count that you need to make a lot more change to mm so that
> ZONE_DEVICE is never use as fallback for memory allocation. Also need
> to make change to be sure that ZONE_DEVICE page never endup in one of
> the path that try to put them back on lru. There is a lot of place that
> would need to be updated and it would be highly intrusive and add a
> lot of special cases to other hot code path.

Could you explain more on where the requirement comes from or point me to
where I can read about this.

-- 
 Kirill A. Shutemov


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-04-28 Thread Jerome Glisse
On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
> On Fri, Apr 28, 2017 at 12:16 PM, Jerome Glisse  wrote:
> >> On Fri, Apr 28, 2017 at 11:00 AM, Jerome Glisse  wrote:
> >> >> On Fri, Apr 28, 2017 at 10:34 AM, Jerome Glisse 
> >> >> wrote:
> >> >> >> Kirill points out that the calls to {get,put}_dev_pagemap() can be
> >> >> >> removed from the mm fast path if we take a single get_dev_pagemap()
> >> >> >> reference to signify that the page is alive and use the final put of
> >> >> >> the
> >> >> >> page to drop that reference.
> >> >> >>
> >> >> >> This does require some care to make sure that any waits for the
> >> >> >> percpu_ref to drop to zero occur *after* 
> >> >> >> devm_memremap_page_release(),
> >> >> >> since it now maintains its own elevated reference.
> >> >> >
> >> >> > This is NAK from HMM point of view as i need those call. So if you
> >> >> > remove
> >> >> > them now i will need to add them back as part of HMM.
> >> >>
> >> >> I thought you only need them at page free time? You can still hook
> >> >> __put_page().
> >> >
> >> > No, i need a hook when page refcount reach 1, not 0. That being said
> >> > i don't care about put_dev_pagemap(page->pgmap); so that part of the
> >> > patch is fine from HMM point of view but i definitly need to hook my-
> >> > self in the general put_page() function.
> >> >
> >> > So i will have to undo part of this patch for HMM (put_page() will
> >> > need to handle ZONE_DEVICE page differently).
> >>
> >> Ok, I'd rather this go in now since it fixes the existing use case,
> >> and unblocks the get_user_pages_fast() conversion to generic code.
> >> That also gives Kirill and -mm folks a chance to review what HMM wants
> >> to do on top of the page_ref infrastructure.  The
> >> {get,put}_zone_device_page interface went in in 4.5 right before
> >> page_ref went in during 4.6, so it was just an oversight that
> >> {get,put}_zone_device_page were not removed earlier.
> >>
> >
> > I don't mind this going in, i am hopping people won't ignore HMM patchset
> > once i repost after 4.12 merge window. Note that there is absolutely no way
> > around me hooking up inside put_page(). The only other way to do it would
> > be to modify virtualy all places that call that function to handle 
> > ZONE_DEVICE
> > case.
> 
> Are you sure about needing to hook the 2 -> 1 transition? Could we
> change ZONE_DEVICE pages to not have an elevated reference count when
> they are created so you can keep the HMM references out of the mm hot
> path?

100% sure on that :) I need to callback into driver for 2->1 transition
no way around that. If we change ZONE_DEVICE to not have an elevated
reference count that you need to make a lot more change to mm so that
ZONE_DEVICE is never use as fallback for memory allocation. Also need
to make change to be sure that ZONE_DEVICE page never endup in one of
the path that try to put them back on lru. There is a lot of place that
would need to be updated and it would be highly intrusive and add a
lot of special cases to other hot code path.

Maybe i over estimate the amount of work but from top of my head it is
far from being trivial.

Jérôme


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-04-28 Thread Jerome Glisse
On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
> On Fri, Apr 28, 2017 at 12:16 PM, Jerome Glisse  wrote:
> >> On Fri, Apr 28, 2017 at 11:00 AM, Jerome Glisse  wrote:
> >> >> On Fri, Apr 28, 2017 at 10:34 AM, Jerome Glisse 
> >> >> wrote:
> >> >> >> Kirill points out that the calls to {get,put}_dev_pagemap() can be
> >> >> >> removed from the mm fast path if we take a single get_dev_pagemap()
> >> >> >> reference to signify that the page is alive and use the final put of
> >> >> >> the
> >> >> >> page to drop that reference.
> >> >> >>
> >> >> >> This does require some care to make sure that any waits for the
> >> >> >> percpu_ref to drop to zero occur *after* 
> >> >> >> devm_memremap_page_release(),
> >> >> >> since it now maintains its own elevated reference.
> >> >> >
> >> >> > This is NAK from HMM point of view as i need those call. So if you
> >> >> > remove
> >> >> > them now i will need to add them back as part of HMM.
> >> >>
> >> >> I thought you only need them at page free time? You can still hook
> >> >> __put_page().
> >> >
> >> > No, i need a hook when page refcount reach 1, not 0. That being said
> >> > i don't care about put_dev_pagemap(page->pgmap); so that part of the
> >> > patch is fine from HMM point of view but i definitly need to hook my-
> >> > self in the general put_page() function.
> >> >
> >> > So i will have to undo part of this patch for HMM (put_page() will
> >> > need to handle ZONE_DEVICE page differently).
> >>
> >> Ok, I'd rather this go in now since it fixes the existing use case,
> >> and unblocks the get_user_pages_fast() conversion to generic code.
> >> That also gives Kirill and -mm folks a chance to review what HMM wants
> >> to do on top of the page_ref infrastructure.  The
> >> {get,put}_zone_device_page interface went in in 4.5 right before
> >> page_ref went in during 4.6, so it was just an oversight that
> >> {get,put}_zone_device_page were not removed earlier.
> >>
> >
> > I don't mind this going in, i am hopping people won't ignore HMM patchset
> > once i repost after 4.12 merge window. Note that there is absolutely no way
> > around me hooking up inside put_page(). The only other way to do it would
> > be to modify virtualy all places that call that function to handle 
> > ZONE_DEVICE
> > case.
> 
> Are you sure about needing to hook the 2 -> 1 transition? Could we
> change ZONE_DEVICE pages to not have an elevated reference count when
> they are created so you can keep the HMM references out of the mm hot
> path?

100% sure on that :) I need to callback into driver for 2->1 transition
no way around that. If we change ZONE_DEVICE to not have an elevated
reference count that you need to make a lot more change to mm so that
ZONE_DEVICE is never use as fallback for memory allocation. Also need
to make change to be sure that ZONE_DEVICE page never endup in one of
the path that try to put them back on lru. There is a lot of place that
would need to be updated and it would be highly intrusive and add a
lot of special cases to other hot code path.

Maybe i over estimate the amount of work but from top of my head it is
far from being trivial.

Jérôme


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-04-28 Thread Dan Williams
On Fri, Apr 28, 2017 at 12:16 PM, Jerome Glisse  wrote:
>> On Fri, Apr 28, 2017 at 11:00 AM, Jerome Glisse  wrote:
>> >> On Fri, Apr 28, 2017 at 10:34 AM, Jerome Glisse 
>> >> wrote:
>> >> >> Kirill points out that the calls to {get,put}_dev_pagemap() can be
>> >> >> removed from the mm fast path if we take a single get_dev_pagemap()
>> >> >> reference to signify that the page is alive and use the final put of
>> >> >> the
>> >> >> page to drop that reference.
>> >> >>
>> >> >> This does require some care to make sure that any waits for the
>> >> >> percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
>> >> >> since it now maintains its own elevated reference.
>> >> >
>> >> > This is NAK from HMM point of view as i need those call. So if you
>> >> > remove
>> >> > them now i will need to add them back as part of HMM.
>> >>
>> >> I thought you only need them at page free time? You can still hook
>> >> __put_page().
>> >
>> > No, i need a hook when page refcount reach 1, not 0. That being said
>> > i don't care about put_dev_pagemap(page->pgmap); so that part of the
>> > patch is fine from HMM point of view but i definitly need to hook my-
>> > self in the general put_page() function.
>> >
>> > So i will have to undo part of this patch for HMM (put_page() will
>> > need to handle ZONE_DEVICE page differently).
>>
>> Ok, I'd rather this go in now since it fixes the existing use case,
>> and unblocks the get_user_pages_fast() conversion to generic code.
>> That also gives Kirill and -mm folks a chance to review what HMM wants
>> to do on top of the page_ref infrastructure.  The
>> {get,put}_zone_device_page interface went in in 4.5 right before
>> page_ref went in during 4.6, so it was just an oversight that
>> {get,put}_zone_device_page were not removed earlier.
>>
>
> I don't mind this going in, i am hopping people won't ignore HMM patchset
> once i repost after 4.12 merge window. Note that there is absolutely no way
> around me hooking up inside put_page(). The only other way to do it would
> be to modify virtualy all places that call that function to handle ZONE_DEVICE
> case.

Are you sure about needing to hook the 2 -> 1 transition? Could we
change ZONE_DEVICE pages to not have an elevated reference count when
they are created so you can keep the HMM references out of the mm hot
path?


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-04-28 Thread Dan Williams
On Fri, Apr 28, 2017 at 12:16 PM, Jerome Glisse  wrote:
>> On Fri, Apr 28, 2017 at 11:00 AM, Jerome Glisse  wrote:
>> >> On Fri, Apr 28, 2017 at 10:34 AM, Jerome Glisse 
>> >> wrote:
>> >> >> Kirill points out that the calls to {get,put}_dev_pagemap() can be
>> >> >> removed from the mm fast path if we take a single get_dev_pagemap()
>> >> >> reference to signify that the page is alive and use the final put of
>> >> >> the
>> >> >> page to drop that reference.
>> >> >>
>> >> >> This does require some care to make sure that any waits for the
>> >> >> percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
>> >> >> since it now maintains its own elevated reference.
>> >> >
>> >> > This is NAK from HMM point of view as i need those call. So if you
>> >> > remove
>> >> > them now i will need to add them back as part of HMM.
>> >>
>> >> I thought you only need them at page free time? You can still hook
>> >> __put_page().
>> >
>> > No, i need a hook when page refcount reach 1, not 0. That being said
>> > i don't care about put_dev_pagemap(page->pgmap); so that part of the
>> > patch is fine from HMM point of view but i definitly need to hook my-
>> > self in the general put_page() function.
>> >
>> > So i will have to undo part of this patch for HMM (put_page() will
>> > need to handle ZONE_DEVICE page differently).
>>
>> Ok, I'd rather this go in now since it fixes the existing use case,
>> and unblocks the get_user_pages_fast() conversion to generic code.
>> That also gives Kirill and -mm folks a chance to review what HMM wants
>> to do on top of the page_ref infrastructure.  The
>> {get,put}_zone_device_page interface went in in 4.5 right before
>> page_ref went in during 4.6, so it was just an oversight that
>> {get,put}_zone_device_page were not removed earlier.
>>
>
> I don't mind this going in, i am hopping people won't ignore HMM patchset
> once i repost after 4.12 merge window. Note that there is absolutely no way
> around me hooking up inside put_page(). The only other way to do it would
> be to modify virtualy all places that call that function to handle ZONE_DEVICE
> case.

Are you sure about needing to hook the 2 -> 1 transition? Could we
change ZONE_DEVICE pages to not have an elevated reference count when
they are created so you can keep the HMM references out of the mm hot
path?


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-04-28 Thread Jerome Glisse
> On Fri, Apr 28, 2017 at 11:00 AM, Jerome Glisse  wrote:
> >> On Fri, Apr 28, 2017 at 10:34 AM, Jerome Glisse 
> >> wrote:
> >> >> Kirill points out that the calls to {get,put}_dev_pagemap() can be
> >> >> removed from the mm fast path if we take a single get_dev_pagemap()
> >> >> reference to signify that the page is alive and use the final put of
> >> >> the
> >> >> page to drop that reference.
> >> >>
> >> >> This does require some care to make sure that any waits for the
> >> >> percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
> >> >> since it now maintains its own elevated reference.
> >> >
> >> > This is NAK from HMM point of view as i need those call. So if you
> >> > remove
> >> > them now i will need to add them back as part of HMM.
> >>
> >> I thought you only need them at page free time? You can still hook
> >> __put_page().
> >
> > No, i need a hook when page refcount reach 1, not 0. That being said
> > i don't care about put_dev_pagemap(page->pgmap); so that part of the
> > patch is fine from HMM point of view but i definitly need to hook my-
> > self in the general put_page() function.
> >
> > So i will have to undo part of this patch for HMM (put_page() will
> > need to handle ZONE_DEVICE page differently).
> 
> Ok, I'd rather this go in now since it fixes the existing use case,
> and unblocks the get_user_pages_fast() conversion to generic code.
> That also gives Kirill and -mm folks a chance to review what HMM wants
> to do on top of the page_ref infrastructure.  The
> {get,put}_zone_device_page interface went in in 4.5 right before
> page_ref went in during 4.6, so it was just an oversight that
> {get,put}_zone_device_page were not removed earlier.
> 

I don't mind this going in, i am hopping people won't ignore HMM patchset
once i repost after 4.12 merge window. Note that there is absolutely no way
around me hooking up inside put_page(). The only other way to do it would
be to modify virtualy all places that call that function to handle ZONE_DEVICE
case.

Jérôme


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-04-28 Thread Jerome Glisse
> On Fri, Apr 28, 2017 at 11:00 AM, Jerome Glisse  wrote:
> >> On Fri, Apr 28, 2017 at 10:34 AM, Jerome Glisse 
> >> wrote:
> >> >> Kirill points out that the calls to {get,put}_dev_pagemap() can be
> >> >> removed from the mm fast path if we take a single get_dev_pagemap()
> >> >> reference to signify that the page is alive and use the final put of
> >> >> the
> >> >> page to drop that reference.
> >> >>
> >> >> This does require some care to make sure that any waits for the
> >> >> percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
> >> >> since it now maintains its own elevated reference.
> >> >
> >> > This is NAK from HMM point of view as i need those call. So if you
> >> > remove
> >> > them now i will need to add them back as part of HMM.
> >>
> >> I thought you only need them at page free time? You can still hook
> >> __put_page().
> >
> > No, i need a hook when page refcount reach 1, not 0. That being said
> > i don't care about put_dev_pagemap(page->pgmap); so that part of the
> > patch is fine from HMM point of view but i definitly need to hook my-
> > self in the general put_page() function.
> >
> > So i will have to undo part of this patch for HMM (put_page() will
> > need to handle ZONE_DEVICE page differently).
> 
> Ok, I'd rather this go in now since it fixes the existing use case,
> and unblocks the get_user_pages_fast() conversion to generic code.
> That also gives Kirill and -mm folks a chance to review what HMM wants
> to do on top of the page_ref infrastructure.  The
> {get,put}_zone_device_page interface went in in 4.5 right before
> page_ref went in during 4.6, so it was just an oversight that
> {get,put}_zone_device_page were not removed earlier.
> 

I don't mind this going in, i am hopping people won't ignore HMM patchset
once i repost after 4.12 merge window. Note that there is absolutely no way
around me hooking up inside put_page(). The only other way to do it would
be to modify virtualy all places that call that function to handle ZONE_DEVICE
case.

Jérôme


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-04-28 Thread Dan Williams
On Fri, Apr 28, 2017 at 11:00 AM, Jerome Glisse  wrote:
>> On Fri, Apr 28, 2017 at 10:34 AM, Jerome Glisse  wrote:
>> >> Kirill points out that the calls to {get,put}_dev_pagemap() can be
>> >> removed from the mm fast path if we take a single get_dev_pagemap()
>> >> reference to signify that the page is alive and use the final put of the
>> >> page to drop that reference.
>> >>
>> >> This does require some care to make sure that any waits for the
>> >> percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
>> >> since it now maintains its own elevated reference.
>> >
>> > This is NAK from HMM point of view as i need those call. So if you remove
>> > them now i will need to add them back as part of HMM.
>>
>> I thought you only need them at page free time? You can still hook
>> __put_page().
>
> No, i need a hook when page refcount reach 1, not 0. That being said
> i don't care about put_dev_pagemap(page->pgmap); so that part of the
> patch is fine from HMM point of view but i definitly need to hook my-
> self in the general put_page() function.
>
> So i will have to undo part of this patch for HMM (put_page() will
> need to handle ZONE_DEVICE page differently).

Ok, I'd rather this go in now since it fixes the existing use case,
and unblocks the get_user_pages_fast() conversion to generic code.
That also gives Kirill and -mm folks a chance to review what HMM wants
to do on top of the page_ref infrastructure.  The
{get,put}_zone_device_page interface went in in 4.5 right before
page_ref went in during 4.6, so it was just an oversight that
{get,put}_zone_device_page were not removed earlier.


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-04-28 Thread Dan Williams
On Fri, Apr 28, 2017 at 11:00 AM, Jerome Glisse  wrote:
>> On Fri, Apr 28, 2017 at 10:34 AM, Jerome Glisse  wrote:
>> >> Kirill points out that the calls to {get,put}_dev_pagemap() can be
>> >> removed from the mm fast path if we take a single get_dev_pagemap()
>> >> reference to signify that the page is alive and use the final put of the
>> >> page to drop that reference.
>> >>
>> >> This does require some care to make sure that any waits for the
>> >> percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
>> >> since it now maintains its own elevated reference.
>> >
>> > This is NAK from HMM point of view as i need those call. So if you remove
>> > them now i will need to add them back as part of HMM.
>>
>> I thought you only need them at page free time? You can still hook
>> __put_page().
>
> No, i need a hook when page refcount reach 1, not 0. That being said
> i don't care about put_dev_pagemap(page->pgmap); so that part of the
> patch is fine from HMM point of view but i definitly need to hook my-
> self in the general put_page() function.
>
> So i will have to undo part of this patch for HMM (put_page() will
> need to handle ZONE_DEVICE page differently).

Ok, I'd rather this go in now since it fixes the existing use case,
and unblocks the get_user_pages_fast() conversion to generic code.
That also gives Kirill and -mm folks a chance to review what HMM wants
to do on top of the page_ref infrastructure.  The
{get,put}_zone_device_page interface went in in 4.5 right before
page_ref went in during 4.6, so it was just an oversight that
{get,put}_zone_device_page were not removed earlier.


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-04-28 Thread Jerome Glisse
> On Fri, Apr 28, 2017 at 10:34 AM, Jerome Glisse  wrote:
> >> Kirill points out that the calls to {get,put}_dev_pagemap() can be
> >> removed from the mm fast path if we take a single get_dev_pagemap()
> >> reference to signify that the page is alive and use the final put of the
> >> page to drop that reference.
> >>
> >> This does require some care to make sure that any waits for the
> >> percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
> >> since it now maintains its own elevated reference.
> >
> > This is NAK from HMM point of view as i need those call. So if you remove
> > them now i will need to add them back as part of HMM.
> 
> I thought you only need them at page free time? You can still hook
> __put_page().

No, i need a hook when page refcount reach 1, not 0. That being said
i don't care about put_dev_pagemap(page->pgmap); so that part of the
patch is fine from HMM point of view but i definitly need to hook my-
self in the general put_page() function.

So i will have to undo part of this patch for HMM (put_page() will
need to handle ZONE_DEVICE page differently).

Cheers,
Jérôme 


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-04-28 Thread Jerome Glisse
> On Fri, Apr 28, 2017 at 10:34 AM, Jerome Glisse  wrote:
> >> Kirill points out that the calls to {get,put}_dev_pagemap() can be
> >> removed from the mm fast path if we take a single get_dev_pagemap()
> >> reference to signify that the page is alive and use the final put of the
> >> page to drop that reference.
> >>
> >> This does require some care to make sure that any waits for the
> >> percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
> >> since it now maintains its own elevated reference.
> >
> > This is NAK from HMM point of view as i need those call. So if you remove
> > them now i will need to add them back as part of HMM.
> 
> I thought you only need them at page free time? You can still hook
> __put_page().

No, i need a hook when page refcount reach 1, not 0. That being said
i don't care about put_dev_pagemap(page->pgmap); so that part of the
patch is fine from HMM point of view but i definitly need to hook my-
self in the general put_page() function.

So i will have to undo part of this patch for HMM (put_page() will
need to handle ZONE_DEVICE page differently).

Cheers,
Jérôme 


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-04-28 Thread Dan Williams
On Fri, Apr 28, 2017 at 10:34 AM, Jerome Glisse  wrote:
>> Kirill points out that the calls to {get,put}_dev_pagemap() can be
>> removed from the mm fast path if we take a single get_dev_pagemap()
>> reference to signify that the page is alive and use the final put of the
>> page to drop that reference.
>>
>> This does require some care to make sure that any waits for the
>> percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
>> since it now maintains its own elevated reference.
>
> This is NAK from HMM point of view as i need those call. So if you remove
> them now i will need to add them back as part of HMM.

I thought you only need them at page free time? You can still hook __put_page().


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-04-28 Thread Dan Williams
On Fri, Apr 28, 2017 at 10:34 AM, Jerome Glisse  wrote:
>> Kirill points out that the calls to {get,put}_dev_pagemap() can be
>> removed from the mm fast path if we take a single get_dev_pagemap()
>> reference to signify that the page is alive and use the final put of the
>> page to drop that reference.
>>
>> This does require some care to make sure that any waits for the
>> percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
>> since it now maintains its own elevated reference.
>
> This is NAK from HMM point of view as i need those call. So if you remove
> them now i will need to add them back as part of HMM.

I thought you only need them at page free time? You can still hook __put_page().


Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-04-28 Thread Jerome Glisse
> Kirill points out that the calls to {get,put}_dev_pagemap() can be
> removed from the mm fast path if we take a single get_dev_pagemap()
> reference to signify that the page is alive and use the final put of the
> page to drop that reference.
> 
> This does require some care to make sure that any waits for the
> percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
> since it now maintains its own elevated reference.

This is NAK from HMM point of view as i need those call. So if you remove
them now i will need to add them back as part of HMM.

Cheers,
Jérôme

> 
> Cc: Ingo Molnar 
> Cc: Jérôme Glisse 
> Cc: Andrew Morton 
> Reviewed-by: Logan Gunthorpe 
> Suggested-by: Kirill Shutemov 
> Tested-by: Kirill Shutemov 
> Signed-off-by: Dan Williams 
> ---
> Changes in v2:
> * Rebased to tip/master
> * Clarified comment in __put_page
> * Clarified devm_memremap_pages() kernel doc about ordering of
>   devm_memremap_pages_release() vs percpu_ref_kill() vs wait for
>   percpu_ref to drop to zero.
> 
> Ingo, I retested this with a revert of commit 6dd29b3df975 "Revert
> 'x86/mm/gup: Switch GUP to the generic get_user_page_fast()
> implementation'". It should be good to go through x86/mm.
> 
>  drivers/dax/pmem.c|2 +-
>  drivers/nvdimm/pmem.c |   13 +++--
>  include/linux/mm.h|   14 --
>  kernel/memremap.c |   22 +-
>  mm/swap.c |   10 ++
>  5 files changed, 31 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
> index 033f49b31fdc..cb0d742fa23f 100644
> --- a/drivers/dax/pmem.c
> +++ b/drivers/dax/pmem.c
> @@ -43,6 +43,7 @@ static void dax_pmem_percpu_exit(void *data)
>   struct dax_pmem *dax_pmem = to_dax_pmem(ref);
>  
>   dev_dbg(dax_pmem->dev, "%s\n", __func__);
> + wait_for_completion(_pmem->cmp);
>   percpu_ref_exit(ref);
>  }
>  
> @@ -53,7 +54,6 @@ static void dax_pmem_percpu_kill(void *data)
>  
>   dev_dbg(dax_pmem->dev, "%s\n", __func__);
>   percpu_ref_kill(ref);
> - wait_for_completion(_pmem->cmp);
>  }
>  
>  static int dax_pmem_probe(struct device *dev)
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 5b536be5a12e..fb7bbc79ac26 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -25,6 +25,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -231,6 +232,11 @@ static void pmem_release_queue(void *q)
>   blk_cleanup_queue(q);
>  }
>  
> +static void pmem_freeze_queue(void *q)
> +{
> + blk_mq_freeze_queue_start(q);
> +}
> +
>  static void pmem_release_disk(void *disk)
>  {
>   del_gendisk(disk);
> @@ -284,6 +290,9 @@ static int pmem_attach_disk(struct device *dev,
>   if (!q)
>   return -ENOMEM;
>  
> + if (devm_add_action_or_reset(dev, pmem_release_queue, q))
> + return -ENOMEM;
> +
>   pmem->pfn_flags = PFN_DEV;
>   if (is_nd_pfn(dev)) {
>   addr = devm_memremap_pages(dev, _res, >q_usage_counter,
> @@ -303,10 +312,10 @@ static int pmem_attach_disk(struct device *dev,
>   pmem->size, ARCH_MEMREMAP_PMEM);
>  
>   /*
> -  * At release time the queue must be dead before
> +  * At release time the queue must be frozen before
>* devm_memremap_pages is unwound
>*/
> - if (devm_add_action_or_reset(dev, pmem_release_queue, q))
> + if (devm_add_action_or_reset(dev, pmem_freeze_queue, q))
>   return -ENOMEM;
>  
>   if (IS_ERR(addr))
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a835edd2db34..695da2a19b4c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -762,19 +762,11 @@ static inline enum zone_type page_zonenum(const struct
> page *page)
>  }
>  
>  #ifdef CONFIG_ZONE_DEVICE
> -void get_zone_device_page(struct page *page);
> -void put_zone_device_page(struct page *page);
>  static inline bool is_zone_device_page(const struct page *page)
>  {
>   return page_zonenum(page) == ZONE_DEVICE;
>  }
>  #else
> -static inline void get_zone_device_page(struct page *page)
> -{
> -}
> -static inline void put_zone_device_page(struct page *page)
> -{
> -}
>  static inline bool is_zone_device_page(const struct page *page)
>  {
>   return false;
> @@ -790,9 +782,6 @@ static inline void get_page(struct page *page)
>*/
>   VM_BUG_ON_PAGE(page_ref_count(page) <= 0, page);
>   page_ref_inc(page);
> -
> - if (unlikely(is_zone_device_page(page)))
> - get_zone_device_page(page);
>  }
>  
>  static inline void put_page(struct page *page)
> @@ -801,9 +790,6 @@ static inline void put_page(struct page *page)
>  
>   if (put_page_testzero(page))
>   __put_page(page);
> -
> - if 

Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference

2017-04-28 Thread Jerome Glisse
> Kirill points out that the calls to {get,put}_dev_pagemap() can be
> removed from the mm fast path if we take a single get_dev_pagemap()
> reference to signify that the page is alive and use the final put of the
> page to drop that reference.
> 
> This does require some care to make sure that any waits for the
> percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
> since it now maintains its own elevated reference.

This is NAK from HMM point of view as i need those call. So if you remove
them now i will need to add them back as part of HMM.

Cheers,
Jérôme

> 
> Cc: Ingo Molnar 
> Cc: Jérôme Glisse 
> Cc: Andrew Morton 
> Reviewed-by: Logan Gunthorpe 
> Suggested-by: Kirill Shutemov 
> Tested-by: Kirill Shutemov 
> Signed-off-by: Dan Williams 
> ---
> Changes in v2:
> * Rebased to tip/master
> * Clarified comment in __put_page
> * Clarified devm_memremap_pages() kernel doc about ordering of
>   devm_memremap_pages_release() vs percpu_ref_kill() vs wait for
>   percpu_ref to drop to zero.
> 
> Ingo, I retested this with a revert of commit 6dd29b3df975 "Revert
> 'x86/mm/gup: Switch GUP to the generic get_user_page_fast()
> implementation'". It should be good to go through x86/mm.
> 
>  drivers/dax/pmem.c|2 +-
>  drivers/nvdimm/pmem.c |   13 +++--
>  include/linux/mm.h|   14 --
>  kernel/memremap.c |   22 +-
>  mm/swap.c |   10 ++
>  5 files changed, 31 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
> index 033f49b31fdc..cb0d742fa23f 100644
> --- a/drivers/dax/pmem.c
> +++ b/drivers/dax/pmem.c
> @@ -43,6 +43,7 @@ static void dax_pmem_percpu_exit(void *data)
>   struct dax_pmem *dax_pmem = to_dax_pmem(ref);
>  
>   dev_dbg(dax_pmem->dev, "%s\n", __func__);
> + wait_for_completion(_pmem->cmp);
>   percpu_ref_exit(ref);
>  }
>  
> @@ -53,7 +54,6 @@ static void dax_pmem_percpu_kill(void *data)
>  
>   dev_dbg(dax_pmem->dev, "%s\n", __func__);
>   percpu_ref_kill(ref);
> - wait_for_completion(_pmem->cmp);
>  }
>  
>  static int dax_pmem_probe(struct device *dev)
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 5b536be5a12e..fb7bbc79ac26 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -25,6 +25,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -231,6 +232,11 @@ static void pmem_release_queue(void *q)
>   blk_cleanup_queue(q);
>  }
>  
> +static void pmem_freeze_queue(void *q)
> +{
> + blk_mq_freeze_queue_start(q);
> +}
> +
>  static void pmem_release_disk(void *disk)
>  {
>   del_gendisk(disk);
> @@ -284,6 +290,9 @@ static int pmem_attach_disk(struct device *dev,
>   if (!q)
>   return -ENOMEM;
>  
> + if (devm_add_action_or_reset(dev, pmem_release_queue, q))
> + return -ENOMEM;
> +
>   pmem->pfn_flags = PFN_DEV;
>   if (is_nd_pfn(dev)) {
>   addr = devm_memremap_pages(dev, _res, >q_usage_counter,
> @@ -303,10 +312,10 @@ static int pmem_attach_disk(struct device *dev,
>   pmem->size, ARCH_MEMREMAP_PMEM);
>  
>   /*
> -  * At release time the queue must be dead before
> +  * At release time the queue must be frozen before
>* devm_memremap_pages is unwound
>*/
> - if (devm_add_action_or_reset(dev, pmem_release_queue, q))
> + if (devm_add_action_or_reset(dev, pmem_freeze_queue, q))
>   return -ENOMEM;
>  
>   if (IS_ERR(addr))
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a835edd2db34..695da2a19b4c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -762,19 +762,11 @@ static inline enum zone_type page_zonenum(const struct
> page *page)
>  }
>  
>  #ifdef CONFIG_ZONE_DEVICE
> -void get_zone_device_page(struct page *page);
> -void put_zone_device_page(struct page *page);
>  static inline bool is_zone_device_page(const struct page *page)
>  {
>   return page_zonenum(page) == ZONE_DEVICE;
>  }
>  #else
> -static inline void get_zone_device_page(struct page *page)
> -{
> -}
> -static inline void put_zone_device_page(struct page *page)
> -{
> -}
>  static inline bool is_zone_device_page(const struct page *page)
>  {
>   return false;
> @@ -790,9 +782,6 @@ static inline void get_page(struct page *page)
>*/
>   VM_BUG_ON_PAGE(page_ref_count(page) <= 0, page);
>   page_ref_inc(page);
> -
> - if (unlikely(is_zone_device_page(page)))
> - get_zone_device_page(page);
>  }
>  
>  static inline void put_page(struct page *page)
> @@ -801,9 +790,6 @@ static inline void put_page(struct page *page)
>  
>   if (put_page_testzero(page))
>   __put_page(page);
> -
> - if (unlikely(is_zone_device_page(page)))
> - put_zone_device_page(page);
>  }
>  
>  #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
> diff --git