Re: [PATCH 2/7] mm: Free device private pages have zero refcount

2022-09-30 Thread Dan Williams
Alistair Popple wrote:
> 
> Dan Williams  writes:
> 
> > Alistair Popple wrote:
> >>
> >> Jason Gunthorpe  writes:
> >>
> >> > On Mon, Sep 26, 2022 at 04:03:06PM +1000, Alistair Popple wrote:
> >> >> Since 27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page
> >> >> refcount") device private pages have no longer had an extra reference
> >> >> count when the page is in use. However before handing them back to the
> >> >> owning device driver we add an extra reference count such that free
> >> >> pages have a reference count of one.
> >> >>
> >> >> This makes it difficult to tell if a page is free or not because both
> >> >> free and in use pages will have a non-zero refcount. Instead we should
> >> >> return pages to the drivers page allocator with a zero reference count.
> >> >> Kernel code can then safely use kernel functions such as
> >> >> get_page_unless_zero().
> >> >>
> >> >> Signed-off-by: Alistair Popple 
> >> >> ---
> >> >>  arch/powerpc/kvm/book3s_hv_uvmem.c   | 1 +
> >> >>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 1 +
> >> >>  drivers/gpu/drm/nouveau/nouveau_dmem.c   | 1 +
> >> >>  lib/test_hmm.c   | 1 +
> >> >>  mm/memremap.c| 5 -
> >> >>  mm/page_alloc.c  | 6 ++
> >> >>  6 files changed, 10 insertions(+), 5 deletions(-)
> >> >
> >> > I think this is a great idea, but I'm surprised no dax stuff is
> >> > touched here?
> >>
> >> free_zone_device_page() shouldn't be called for pgmap->type ==
> >> MEMORY_DEVICE_FS_DAX so I don't think we should have to worry about DAX
> >> there. Except that the folio code looks like it might have introduced a
> >> bug. AFAICT put_page() always calls
> >> put_devmap_managed_page(>page) but folio_put() does not (although
> >> folios_put() does!). So it seems folio_put() won't end up calling
> >> __put_devmap_managed_page_refs() as I think it should.
> >>
> >> I think you're right about the change to __init_zone_device_page() - I
> >> should limit it to DEVICE_PRIVATE/COHERENT pages only. But I need to
> >> look at Dan's patch series more closely as I suspect it might be better
> >> to rebase this patch on top of that.
> >
> > Apologies for the delay I was travelling the past few days. Yes, I think
> > this patch slots in nicely to avoid the introduction of an init_mode
> > [1]:
> >
> > https://lore.kernel.org/nvdimm/166329940343.2786261.6047770378829215962.st...@dwillia2-xfh.jf.intel.com/
> >
> > Mind if I steal it into my series?
> 
> No problem, although I notice Andrew has already merged it into
> mm-unstable. If you end up rebasing your series on top of mine I think
> all that's needed is a patch somewhere in your series to drop the
> various `if (pgmap->type == MEMORY_DEVICE_*)` I added to (hopefully)
> avoid breaking DAX. Assuming DAX takes a pagemap reference on struct
> page allocation something like below.

Yeah, I'll go that route and rebase on top of -mm.

Thanks again.


Re: [PATCH 2/7] mm: Free device private pages have zero refcount

2022-09-29 Thread Dan Williams
Alistair Popple wrote:
> 
> Jason Gunthorpe  writes:
> 
> > On Mon, Sep 26, 2022 at 04:03:06PM +1000, Alistair Popple wrote:
> >> Since 27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page
> >> refcount") device private pages have no longer had an extra reference
> >> count when the page is in use. However before handing them back to the
> >> owning device driver we add an extra reference count such that free
> >> pages have a reference count of one.
> >>
> >> This makes it difficult to tell if a page is free or not because both
> >> free and in use pages will have a non-zero refcount. Instead we should
> >> return pages to the drivers page allocator with a zero reference count.
> >> Kernel code can then safely use kernel functions such as
> >> get_page_unless_zero().
> >>
> >> Signed-off-by: Alistair Popple 
> >> ---
> >>  arch/powerpc/kvm/book3s_hv_uvmem.c   | 1 +
> >>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 1 +
> >>  drivers/gpu/drm/nouveau/nouveau_dmem.c   | 1 +
> >>  lib/test_hmm.c   | 1 +
> >>  mm/memremap.c| 5 -
> >>  mm/page_alloc.c  | 6 ++
> >>  6 files changed, 10 insertions(+), 5 deletions(-)
> >
> > I think this is a great idea, but I'm surprised no dax stuff is
> > touched here?
> 
> free_zone_device_page() shouldn't be called for pgmap->type ==
> MEMORY_DEVICE_FS_DAX so I don't think we should have to worry about DAX
> there. Except that the folio code looks like it might have introduced a
> bug. AFAICT put_page() always calls
> put_devmap_managed_page(>page) but folio_put() does not (although
> folios_put() does!). So it seems folio_put() won't end up calling
> __put_devmap_managed_page_refs() as I think it should.
> 
> I think you're right about the change to __init_zone_device_page() - I
> should limit it to DEVICE_PRIVATE/COHERENT pages only. But I need to
> look at Dan's patch series more closely as I suspect it might be better
> to rebase this patch on top of that.

Apologies for the delay I was travelling the past few days. Yes, I think
this patch slots in nicely to avoid the introduction of an init_mode
[1]:

https://lore.kernel.org/nvdimm/166329940343.2786261.6047770378829215962.st...@dwillia2-xfh.jf.intel.com/

Mind if I steal it into my series?


Re: [PATCH 7/8] mm: remove the extra ZONE_DEVICE struct page refcount

2022-02-08 Thread Dan Williams
On Sun, Feb 6, 2022 at 10:33 PM Christoph Hellwig  wrote:
[..]
> @@ -500,28 +482,27 @@ void free_devmap_managed_page(struct page *page)
>  */
> page->mapping = NULL;
> page->pgmap->ops->page_free(page);
> +
> +   /*
> +* Reset the page count to 1 to prepare for handing out the page 
> again.
> +*/
> +   set_page_count(page, 1);

Interesting. I had expected that to really fix the refcount problem
that fs/dax.c would need to start taking real page references as pages
were added to a mapping, just like page cache.

This looks ok to me, and passes my tests. So given I'm still working
my way back to fixing the references properly I'm ok for this hack to
replace the more broken hack that is there presently.

Reviewed-by: Dan Williams 



Re: [PATCH 6/8] mm: don't include in

2022-02-08 Thread Dan Williams
On Mon, Feb 7, 2022 at 3:49 PM Dan Williams  wrote:
>
> On Sun, Feb 6, 2022 at 10:33 PM Christoph Hellwig  wrote:
> >
> > Move the check for the actual pgmap types that need the free at refcount
> > one behavior into the out of line helper, and thus avoid the need to
> > pull memremap.h into mm.h.
>
> Looks good to me assuming the compile bots agree.
>
> Reviewed-by: Dan Williams 

Yeah, same as Logan:

mm/memcontrol.c: In function ‘get_mctgt_type’:
mm/memcontrol.c:5724:29: error: implicit declaration of function
‘is_device_private_page’; did you mean
‘is_device_private_entry’? [-Werror=implicit-function-declaration]
 5724 | if (is_device_private_page(page))
  | ^~
  | is_device_private_entry

...needs:

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d1e97a54ae53..0ac7515c85f9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -62,6 +62,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 #include 
 #include 



Re: [PATCH 6/8] mm: don't include in

2022-02-07 Thread Dan Williams
On Sun, Feb 6, 2022 at 10:33 PM Christoph Hellwig  wrote:
>
> Move the check for the actual pgmap types that need the free at refcount
> one behavior into the out of line helper, and thus avoid the need to
> pull memremap.h into mm.h.

Looks good to me assuming the compile bots agree.

Reviewed-by: Dan Williams 

>
> Signed-off-by: Christoph Hellwig 
> ---
>  arch/arm64/mm/mmu.c|  1 +
>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h  |  1 +
>  drivers/gpu/drm/drm_cache.c|  2 +-
>  drivers/gpu/drm/nouveau/nouveau_dmem.c |  1 +
>  drivers/gpu/drm/nouveau/nouveau_svm.c  |  1 +
>  drivers/infiniband/core/rw.c   |  1 +
>  drivers/nvdimm/pmem.h  |  1 +
>  drivers/nvme/host/pci.c|  1 +
>  drivers/nvme/target/io-cmd-bdev.c  |  1 +
>  fs/fuse/virtio_fs.c|  1 +
>  include/linux/memremap.h   | 18 ++
>  include/linux/mm.h | 20 
>  lib/test_hmm.c |  1 +
>  mm/memremap.c  |  6 +-
>  14 files changed, 34 insertions(+), 22 deletions(-)
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index acfae9b41cc8c9..580abae6c0b93f 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
> b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> index ea68f3b3a4e9cb..6d643b4b791d87 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> @@ -25,6 +25,7 @@
>
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c
> index f19d9acbe95936..50b8a088f763a6 100644
> --- a/drivers/gpu/drm/drm_cache.c
> +++ b/drivers/gpu/drm/drm_cache.c
> @@ -27,11 +27,11 @@
>  /*
>   * Authors: Thomas Hellström 
>   */
> -
>  #include 
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>
>  #include 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
> b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> index e886a3b9e08c7d..a5cdfbe32b5e54 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> @@ -39,6 +39,7 @@
>
>  #include 
>  #include 
> +#include 
>  #include 
>
>  /*
> diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
> b/drivers/gpu/drm/nouveau/nouveau_svm.c
> index 266809e511e2c1..090b9b47708cca 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_svm.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
> @@ -35,6 +35,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>
>  struct nouveau_svm {
> diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
> index 5a3bd41b331c93..4d98f931a13ddd 100644
> --- a/drivers/infiniband/core/rw.c
> +++ b/drivers/infiniband/core/rw.c
> @@ -2,6 +2,7 @@
>  /*
>   * Copyright (c) 2016 HGST, a Western Digital Company.
>   */
> +#include 
>  #include 
>  #include 
>  #include 
> diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
> index 59cfe13ea8a85c..1f51a23614299b 100644
> --- a/drivers/nvdimm/pmem.h
> +++ b/drivers/nvdimm/pmem.h
> @@ -3,6 +3,7 @@
>  #define __NVDIMM_PMEM_H__
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 6a99ed68091589..ab15bc72710dbe 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -15,6 +15,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> diff --git a/drivers/nvme/target/io-cmd-bdev.c 
> b/drivers/nvme/target/io-cmd-bdev.c
> index 70ca9dfc1771a9..a141446db1bea3 100644
> --- a/drivers/nvme/target/io-cmd-bdev.c
> +++ b/drivers/nvme/target/io-cmd-bdev.c
> @@ -6,6 +6,7 @@
>  #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>  #include 
>  #include 
> +#include 
>  #include 
>  #include "nvmet.h"
>
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index 9d737904d07c0b..86b7dbb6a0d43e 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -8,6 +8,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 1fafcc38acbad6..514ab46f597e5c 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -1,6 +1,8

Re: [PATCH 5/8] mm: simplify freeing of devmap managed pages

2022-02-07 Thread Dan Williams
On Sun, Feb 6, 2022 at 10:33 PM Christoph Hellwig  wrote:
>
> Make put_devmap_managed_page return if it took charge of the page
> or not and remove the separate page_is_devmap_managed helper.

Looks good to me:

Reviewed-by: Dan Williams 



Re: [PATCH 4/8] mm: move free_devmap_managed_page to memremap.c

2022-02-07 Thread Dan Williams
On Sun, Feb 6, 2022 at 10:33 PM Christoph Hellwig  wrote:
>
> free_devmap_managed_page has nothing to do with the code in swap.c,
> move it to live with the rest of the code for devmap handling.
>

Looks good.

Reviewed-by: Dan Williams 



Re: [PATCH 2/8] mm: remove the __KERNEL__ guard from

2022-02-07 Thread Dan Williams
On Sun, Feb 6, 2022 at 10:33 PM Christoph Hellwig  wrote:
>
> __KERNEL__ ifdefs don't make sense outside of include/uapi/.

Yes.

Reviewed-by: Dan Williams 



Re: [PATCH 1/8] mm: remove a pointless CONFIG_ZONE_DEVICE check in memremap_pages

2022-02-07 Thread Dan Williams
On Sun, Feb 6, 2022 at 10:33 PM Christoph Hellwig  wrote:
>
> memremap.c is only built when CONFIG_ZONE_DEVICE is set, so remove
> the superflous extra check.

Looks good to me.

Reviewed-by: Dan Williams 



Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

2021-10-20 Thread Dan Williams
On Wed, Oct 20, 2021 at 10:09 AM Joao Martins  wrote:
>
> On 10/19/21 20:21, Dan Williams wrote:
> > On Tue, Oct 19, 2021 at 9:02 AM Jason Gunthorpe  wrote:
> >>
> >> On Tue, Oct 19, 2021 at 04:13:34PM +0100, Joao Martins wrote:
> >>> On 10/19/21 00:06, Jason Gunthorpe wrote:
> >>>> On Mon, Oct 18, 2021 at 12:37:30PM -0700, Dan Williams wrote:
> >>>>
> >>>>>> device-dax uses PUD, along with TTM, they are the only places. I'm not
> >>>>>> sure TTM is a real place though.
> >>>>>
> >>>>> I was setting device-dax aside because it can use Joao's changes to
> >>>>> get compound-page support.
> >>>>
> >>>> Ideally, but that ideas in that patch series have been floating around
> >>>> for a long time now..
> >>>>
> >>> The current status of the series misses a Rb on patches 6,7,10,12-14.
> >>> Well, patch 8 too should now drop its tag, considering the latest
> >>> discussion.
> >>>
> >>> If it helps moving things forward I could split my series further into:
> >>>
> >>> 1) the compound page introduction (patches 1-7) of my aforementioned 
> >>> series
> >>> 2) vmemmap deduplication for memory gains (patches 9-14)
> >>> 3) gup improvements (patch 8 and gup-slow improvements)
> >>
> >> I would split it, yes..
> >>
> >> I think we can see a general consensus that making compound_head/etc
> >> work consistently with how THP uses it will provide value and
> >> opportunity for optimization going forward.
> >>
>
> I'll go do that. Meanwhile, I'll wait a couple days for Dan to review the
> dax subsystem patches (6 & 7), or otherwise send them over.
>
> >>> Whats the benefit between preventing longterm at start
> >>> versus only after mounting the filesystem? Or is the intended future 
> >>> purpose
> >>> to pass more context into an holder potential future callback e.g. nack 
> >>> longterm
> >>> pins on a page basis?
> >>
> >> I understood Dan's remark that the device-dax path allows
> >> FOLL_LONGTERM and the FSDAX path does not ?
> >>
> >> Which, IIRC, today is signaled basd on vma properties and in all cases
> >> fast-gup is denied.
> >
> > Yeah, I forgot that 7af75561e171 eliminated any possibility of
> > longterm-gup-fast for device-dax, let's not disturb that status quo.
> >
> I am slightly confused by this comment -- the status quo is what we are
> questioning here -- And we talked about changing that in the past too
> (thread below), that longterm-gup-fast was an oversight that that commit
> was only applicable to fsdax. [Maybe this is just my english confusion]

No, you have it correct. However that "regression" has gone unnoticed,
so unless there is data showing that gup-fast on device-dax is
critical for longterm page pinning workflows I'm ok for longterm to
continue to trigger gup-slow.



Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

2021-10-19 Thread Dan Williams
On Tue, Oct 19, 2021 at 9:02 AM Jason Gunthorpe  wrote:
>
> On Tue, Oct 19, 2021 at 04:13:34PM +0100, Joao Martins wrote:
> > On 10/19/21 00:06, Jason Gunthorpe wrote:
> > > On Mon, Oct 18, 2021 at 12:37:30PM -0700, Dan Williams wrote:
> > >
> > >>> device-dax uses PUD, along with TTM, they are the only places. I'm not
> > >>> sure TTM is a real place though.
> > >>
> > >> I was setting device-dax aside because it can use Joao's changes to
> > >> get compound-page support.
> > >
> > > Ideally, but that ideas in that patch series have been floating around
> > > for a long time now..
> > >
> > The current status of the series misses a Rb on patches 6,7,10,12-14.
> > Well, patch 8 too should now drop its tag, considering the latest
> > discussion.
> >
> > If it helps moving things forward I could split my series further into:
> >
> > 1) the compound page introduction (patches 1-7) of my aforementioned series
> > 2) vmemmap deduplication for memory gains (patches 9-14)
> > 3) gup improvements (patch 8 and gup-slow improvements)
>
> I would split it, yes..
>
> I think we can see a general consensus that making compound_head/etc
> work consistently with how THP uses it will provide value and
> opportunity for optimization going forward.
>
> > Whats the benefit between preventing longterm at start
> > versus only after mounting the filesystem? Or is the intended future purpose
> > to pass more context into an holder potential future callback e.g. nack 
> > longterm
> > pins on a page basis?
>
> I understood Dan's remark that the device-dax path allows
> FOLL_LONGTERM and the FSDAX path does not ?
>
> Which, IIRC, today is signaled basd on vma properties and in all cases
> fast-gup is denied.

Yeah, I forgot that 7af75561e171 eliminated any possibility of
longterm-gup-fast for device-dax, let's not disturb that status quo.

> > Maybe we can start by at least not add any flags and just prevent
> > FOLL_LONGTERM on fsdax -- which I guess was the original purpose of
> > commit 7af75561e171 ("mm/gup: add FOLL_LONGTERM capability to GUP fast").
> > This patch (which I can formally send) has a sketch of that (below scissors 
> > mark):
> >
> > https://lore.kernel.org/linux-mm/6a18179e-65f7-367d-89a9-d5162f10f...@oracle.com/
>
> Yes, basically, whatever test we want for 'deny fast gup foll
> longterm' is fine.
>
> Personally I'd like to see us move toward a set of flag specifying
> each special behavior and not a collection of types that imply special
> behaviors.
>
> Eg we have at least:
>  - Block gup fast on foll_longterm
>  - Capture the refcount ==1 and use the pgmap free hook
>(confusingly called page_is_devmap_managed())
>  - Always use a swap entry
>  - page->index/mapping are used in the usual file based way?
>
> Probably more things..

Yes, agree with the principle of reducing type-implied special casing.



Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

2021-10-18 Thread Dan Williams
On Mon, Oct 18, 2021 at 11:26 AM Jason Gunthorpe  wrote:
>
> On Sun, Oct 17, 2021 at 11:35:35AM -0700, Dan Williams wrote:
>
> > > DAX is stuffing arrays of 4k pages into the PUD/PMDs. Aligning with
> > > THP would make using normal refconting much simpler. I looked at
> > > teaching the mm core to deal with page arrays - it is certainly
> > > doable, but it is quite inefficient and ugly mm code.
> >
> > THP does not support PUD, and neither does FSDAX, so it's only PMDs we
> > need to worry about.
>
> device-dax uses PUD, along with TTM, they are the only places. I'm not
> sure TTM is a real place though.

I was setting device-dax aside because it can use Joao's changes to
get compound-page support.

>
> > > So, can we fix DAX and TTM - the only uses of PUD/PMDs I could find?
> > >
> > > Joao has a series that does this to device-dax:
> > >
> > > https://lore.kernel.org/all/20210827145819.16471-1-joao.m.mart...@oracle.com/
> >
> > That assumes there's never any need to fracture a huge page which
> > FSDAX could not support unless the filesystem was built with 2MB block
> > size.
>
> As I understand things, something like FSDAX post-folio should
> generate maximal compound pages for extents in the page cache that are
> physically contiguous.
>
> A high order folio can be placed in any lower order in the page
> tables, so we never have to fracture it, unless the underlying page
> are moved around - which requires an unmap_mapping_range() cycle..

That would be useful to disconnect the compound-page size from the
page-table-entry installed for the page. However, don't we need
typical compound page fracturing in the near term until folios move
ahead?

>
> > > Assuming changing FSDAX is hard.. How would DAX people feel about just
> > > deleting the PUD/PMD support until it can be done with compound pages?
> >
> > There are end users that would notice the PMD regression, and I think
> > FSDAX PMDs with proper compound page metadata is on the same order of
> > work as fixing the refcount.
>
> Hmm, I don't know.. I sketched out the refcount stuff and the code is
> OK but ugly and will add a conditional to some THP cases

That reminds me that there are several places that do:

pmd_devmap(pmd) || pmd_trans_huge(pmd)

...for the common cases where a THP and DEVMAP page are equivalent,
but there are a few places where those paths are not shared when the
THP path expects that the page came from the page allocator. So while
DEVMAP is not needed in GUP after this conversion, there still needs
to be an audit of when THP needs to be careful of DAX mappings.

> On the other hand, making THP unmap cases a bit slower is probably a
> net win compared to making put_page a bit slower.. Considering unmap
> is already quite heavy.

FSDAX eventually learned how to replace 'struct page' with xarray for
several paths, so "how hard could it be" (/famous last words) to
replace xarray with 'struct page'? I think the end result will be
cleaner, but yes, I expect some dragons in the conversion.

>
> > > 4) Ask what the pgmap owner wants to do:
> > >
> > > if (head->pgmap->deny_foll_longterm)
> > >   return FAIL
> >
> > The pgmap itself does not know, but the "holder" could specify this
> > policy.
>
> Here I imagine the thing that creates the pgmap would specify the
> policy it wants. In most cases the policy is tightly coupled to what
> the free function in the the provided dev_pagemap_ops does..

The thing that creates the pgmap is the device-driver, and
device-driver does not implement truncate or reclaim. It's not until
the FS mounts that the pgmap needs to start enforcing pin lifetime
guarantees.

>
> > Which is in line with the 'dax_holder_ops' concept being introduced
> > for reverse mapping support. I.e. when the FS claims the dax-device
> > it can specify at that point that it wants to forbid longterm.
>
> Which is a reasonable refinment if we think there are cases where two
> nvdim users would want different things.
>

It's already the case that device-dax does not enforce transient pin lifetimes.

> Anyhow, I'm wondering on a way forward. There are many balls in the
> air, all linked:
>  - Joao's compound page support for device_dax and more
>  - Alex's DEVICE_COHERENT

I have not seen these patches.

/me notices no MAINTAINERS mention for include/linux/memremap.h

>  - The refcount normalization
>  - Removing the pgmap test from GUP
>  - Removing the need for the PUD/PMD/PTE special bit
>  - Removing the need for the PUD/PMD/PTE devmap bit

It's not clear that this anything but pure cleanup once the special
bit can be used for architect

Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

2021-10-17 Thread Dan Williams
On Sat, Oct 16, 2021 at 8:45 AM Jason Gunthorpe  wrote:
>
> On Thu, Oct 14, 2021 at 06:37:35PM -0700, Dan Williams wrote:
> > On Thu, Oct 14, 2021 at 4:06 PM Jason Gunthorpe  wrote:
> > >
> > > On Thu, Oct 14, 2021 at 12:01:14PM -0700, Dan Williams wrote:
> > > > > > Does anyone know why devmap is pte_special anyhow?
> > > >
> > > > It does not need to be special as mentioned here:
> > > >
> > > > https://lore.kernel.org/all/CAPcyv4iFeVDVPn6uc=aksyuvkiu3-fk-n16ijvzq3n8ot00...@mail.gmail.com/
> > >
> > > I added a remark there
> > >
> > > Not special means more to me, it means devmap should do the refcounts
> > > properly like normal memory pages.
> > >
> > > It means vm_normal_page should return !NULL and it means insert_page,
> > > not insert_pfn should be used to install them in the PTE. VMAs should
> > > not be MIXED MAP, but normal struct page maps.
> > >
> > > I think this change alone would fix all the refcount problems
> > > everwhere in DAX and devmap.
> > >
> > > > The refcount dependencies also go away after this...
> > > >
> > > > https://lore.kernel.org/all/161604050866.1463742.7759521510383551055.st...@dwillia2-desk3.amr.corp.intel.com/
> > > >
> > > > ...but you can see that patches 1 and 2 in that series depend on being
> > > > able to guarantee that all mappings are invalidated when the undelying
> > > > device that owns the pgmap goes away.
> > >
> > > If I have put everything together right this is because of what I
> > > pointed to here. FS-DAX is installing 0 refcount pages into PTEs and
> > > expecting that to work sanely.
> > >
> > > This means the page map cannot be removed until all the PTEs are fully
> > > flushed, which buggily doesn't happen because of the missing unplug.
> > >
> > > However, this is all because nobody incrd a refcount to represent the
> > > reference in the PTE and since this ment that 0 refcount pages were
> > > wrongly stuffed into PTEs then devmap used the refcount == 1 hack to
> > > unbreak GUP?
> > >
> > > So.. Is there some reason why devmap pages are trying so hard to avoid
> > > sane refcounting???
> >
> > I wouldn't put it that way. It's more that the original sin of
> > ZONE_DEVICE that sought to reuse the lru field space, by never having
> > a zero recount, then got layered upon and calcified in malignant ways.
> > In the meantime surrounding infrastructure got decrustified. Work like
> > the 'struct page' cleanup among other things, made it clearer and
> > clearer over time that the original design choice needed to be fixed.
>
> So, there used to be some reason, but with the current code
> arrangement it is not the case? This is why it looks so strange when
> reading it..
>
> AFIACT we are good on the LRU stuff now? Eg release_pages() does not
> use page->lru for is_zone_device_page()?

Yes, the lru collision was fixed by the 'struct page' cleanup.

>
> > The MIXED_MAP and insert_pfn were a holdover from page-less DAX, but
> > now that we have page-available DAX, yes, we can skip the FS
> > notification and just rely on typical refcounting and hanging until
> > the FS has a chance to uninstall the PTEs. You're right, the FS
> > notification is an improvement to the conversion, not a requirement.
>
> It all sounds so simple. I looked at this for a good long time and the
> first major roadblock is huge pages.
>
> The mm side is designed around THP which puts a single high order page
> into the PUD/PMD such that the mm only has to juggle one page. This a
> very sane and reasonable thing to do.
>
> DAX is stuffing arrays of 4k pages into the PUD/PMDs. Aligning with
> THP would make using normal refconting much simpler. I looked at
> teaching the mm core to deal with page arrays - it is certainly
> doable, but it is quite inefficient and ugly mm code.

THP does not support PUD, and neither does FSDAX, so it's only PMDs we
need to worry about.

>
> So, can we fix DAX and TTM - the only uses of PUD/PMDs I could find?
>
> Joao has a series that does this to device-dax:
>
> https://lore.kernel.org/all/20210827145819.16471-1-joao.m.mart...@oracle.com/

That assumes there's never any need to fracture a huge page which
FSDAX could not support unless the filesystem was built with 2MB block
size.

> TTM is kind of broken already but does have a struct page, and it is
> probably already a high order one. Maybe it is OK? I will ask Thomas
>
> That leaves FSDAX. Can this be fi

Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

2021-10-17 Thread Dan Williams
On Sat, Oct 16, 2021 at 9:39 AM Matthew Wilcox  wrote:
>
> On Sat, Oct 16, 2021 at 12:44:50PM -0300, Jason Gunthorpe wrote:
> > Assuming changing FSDAX is hard.. How would DAX people feel about just
> > deleting the PUD/PMD support until it can be done with compound pages?
>
> I think there are customers who would find that an unacceptable answer :-)

No, not given the number of end users that ask for help debugging PMD support.



Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

2021-10-14 Thread Dan Williams
On Thu, Oct 14, 2021 at 4:06 PM Jason Gunthorpe  wrote:
>
> On Thu, Oct 14, 2021 at 12:01:14PM -0700, Dan Williams wrote:
> > > > Does anyone know why devmap is pte_special anyhow?
> >
> > It does not need to be special as mentioned here:
> >
> > https://lore.kernel.org/all/CAPcyv4iFeVDVPn6uc=aksyuvkiu3-fk-n16ijvzq3n8ot00...@mail.gmail.com/
>
> I added a remark there
>
> Not special means more to me, it means devmap should do the refcounts
> properly like normal memory pages.
>
> It means vm_normal_page should return !NULL and it means insert_page,
> not insert_pfn should be used to install them in the PTE. VMAs should
> not be MIXED MAP, but normal struct page maps.
>
> I think this change alone would fix all the refcount problems
> everwhere in DAX and devmap.
>
> > The refcount dependencies also go away after this...
> >
> > https://lore.kernel.org/all/161604050866.1463742.7759521510383551055.st...@dwillia2-desk3.amr.corp.intel.com/
> >
> > ...but you can see that patches 1 and 2 in that series depend on being
> > able to guarantee that all mappings are invalidated when the undelying
> > device that owns the pgmap goes away.
>
> If I have put everything together right this is because of what I
> pointed to here. FS-DAX is installing 0 refcount pages into PTEs and
> expecting that to work sanely.
>
> This means the page map cannot be removed until all the PTEs are fully
> flushed, which buggily doesn't happen because of the missing unplug.
>
> However, this is all because nobody incrd a refcount to represent the
> reference in the PTE and since this ment that 0 refcount pages were
> wrongly stuffed into PTEs then devmap used the refcount == 1 hack to
> unbreak GUP?
>
> So.. Is there some reason why devmap pages are trying so hard to avoid
> sane refcounting???

I wouldn't put it that way. It's more that the original sin of
ZONE_DEVICE that sought to reuse the lru field space, by never having
a zero recount, then got layered upon and calcified in malignant ways.
In the meantime surrounding infrastructure got decrustified. Work like
the 'struct page' cleanup among other things, made it clearer and
clearer over time that the original design choice needed to be fixed.

> If the PTE itself holds the refcount (by not being special) then there
> is no need for the pagemap stuff in GUP. pagemap already waits for
> refs to go to 0 so the missing shootdown during nvdimm unplug will
> cause pagemap to block until the address spaces are invalidated. IMHO
> this is already better than the current buggy situation of allowing
> continued PTE reference to memory that is now removed from the system.
>
> > For that to happen there needs to be communication back to the FS for
> > device-gone / failure events. That work is in progress via this
> > series:
> >
> > https://lore.kernel.org/all/20210924130959.2695749-1-ruansy.f...@fujitsu.com/
>
> This is fine, but I don't think it should block fixing the mm side -
> the end result here still cannot be 0 ref count pages installed in
> PTEs.
>
> Fixing that does not depend on shootdown during device removal, right?
>
> It requires holding refcounts while pages are installed into address
> spaces - and this lack is a direct cause of making the PTEs all
> special and using insert_pfn and MIXED_MAP.

The MIXED_MAP and insert_pfn were a holdover from page-less DAX, but
now that we have page-available DAX, yes, we can skip the FS
notification and just rely on typical refcounting and hanging until
the FS has a chance to uninstall the PTEs. You're right, the FS
notification is an improvement to the conversion, not a requirement.

However, there still needs to be something in the gup-fast path to
indicate that GUP_LONGTERM is not possible because the PTE represents
a pfn that can not support typical page-cache behavior for truncate
which is to just disconnect the page from the file and keep the page
pinned indefinitely. I think the "no longterm" caveat would be the
only remaining utility of PTE_DEVMAP after the above conversion to use
typical page refcounts throughout DAX.



Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

2021-10-14 Thread Dan Williams
On Thu, Oct 14, 2021 at 11:45 AM Matthew Wilcox  wrote:
>
>
> It would probably help if you cc'd Dan on this.

Thanks.

[..]
>
> On Thu, Oct 14, 2021 at 02:06:34PM -0300, Jason Gunthorpe wrote:
> > On Thu, Oct 14, 2021 at 10:39:28AM -0500, Alex Sierra wrote:
> > > From: Ralph Campbell 
> > >
> > > ZONE_DEVICE struct pages have an extra reference count that complicates 
> > > the
> > > code for put_page() and several places in the kernel that need to check 
> > > the
> > > reference count to see that a page is not being used (gup, compaction,
> > > migration, etc.). Clean up the code so the reference count doesn't need to
> > > be treated specially for ZONE_DEVICE.
> > >
> > > Signed-off-by: Ralph Campbell 
> > > Signed-off-by: Alex Sierra 
> > > Reviewed-by: Christoph Hellwig 
> > > ---
> > > v2:
> > > AS: merged this patch in linux 5.11 version
> > >
> > > v5:
> > > AS: add condition at try_grab_page to check for the zone device type, 
> > > while
> > > page ref counter is checked less/equal to zero. In case of device zone, 
> > > pages
> > > ref counter are initialized to zero.
> > >
> > > v7:
> > > AS: fix condition at try_grab_page added at v5, is invalid. It supposed
> > > to fix xfstests/generic/413 test, however, there's a known issue on
> > > this test where DAX mapped area DIO to non-DAX expect to fail.
> > > https://patchwork.kernel.org/project/fstests/patch/1489463960-3579-1-git-send-email-xz...@redhat.com
> > > This condition was removed after rebase over patch series
> > > https://lore.kernel.org/r/20210813044133.1536842-4-jhubb...@nvidia.com
> > > ---
> > >  arch/powerpc/kvm/book3s_hv_uvmem.c |  2 +-
> > >  drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
> > >  fs/dax.c   |  4 +-
> > >  include/linux/dax.h|  2 +-
> > >  include/linux/memremap.h   |  7 +--
> > >  include/linux/mm.h | 11 
> > >  lib/test_hmm.c |  2 +-
> > >  mm/internal.h  |  8 +++
> > >  mm/memcontrol.c|  6 +--
> > >  mm/memremap.c  | 69 +++---
> > >  mm/migrate.c   |  5 --
> > >  mm/page_alloc.c|  3 ++
> > >  mm/swap.c  | 45 ++---
> > >  13 files changed, 46 insertions(+), 120 deletions(-)
> >
> > Has anyone tested this with FSDAX? Does get_user_pages() on fsdax
> > backed memory still work?
> >
> > What refcount value does the struct pages have when they are installed
> > in the PTEs? Remember a 0 refcount will make all the get_user_pages()
> > fail.
> >
> > I'm looking at the call path starting in ext4_punch_hole() and I would
> > expect to see something manipulating the page ref count before
> > the ext4_break_layouts() call path gets to the dax_page_unused() test.
> >
> > All I see is we go into unmap_mapping_pages() - that would normally
> > put back the page references held by PTEs but insert_pfn() has this:
> >
> >   if (pfn_t_devmap(pfn))
> >   entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
> >
> > And:
> >
> > static inline pte_t pte_mkdevmap(pte_t pte)
> > {
> >   return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP);
> > }
> >
> > Which interacts with vm_normal_page():
> >
> >   if (pte_devmap(pte))
> >   return NULL;
> >
> > To disable that refcounting?
> >
> > So... I have a feeling this will have PTEs pointing to 0 refcount
> > pages? Unless FSDAX is !pte_devmap which is not the case, right?
> >
> > This seems further confirmed by this comment:
> >
> >   /*
> >* If we race get_user_pages_fast() here either we'll see the
> >* elevated page count in the iteration and wait, or
> >* get_user_pages_fast() will see that the page it took a reference
> >* against is no longer mapped in the page tables and bail to the
> >* get_user_pages() slow path.  The slow path is protected by
> >* pte_lock() and pmd_lock(). New references are not taken without
> >* holding those locks, and unmap_mapping_pages() will not zero the
> >* pte or pmd without holding the respective lock, so we are
> >* guaranteed to either see new references or prevent new
> >* references from being established.
> >*/
> >
> > Which seems to explain this scheme relies on unmap_mapping_pages() to
> > fence GUP_fast, not on GUP_fast observing 0 refcounts when it should
> > stop.
> >
> > This seems like it would be properly fixed by using normal page
> > refcounting for PTEs - ie stop using special for these pages?
> >
> > Does anyone know why devmap is pte_special anyhow?

It does not need to be special as mentioned here:

https://lore.kernel.org/all/CAPcyv4iFeVDVPn6uc=aksyuvkiu3-fk-n16ijvzq3n8ot00...@mail.gmail.com/

The refcount dependencies also go away after this...


Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

2021-09-02 Thread Dan Williams
On Thu, Sep 2, 2021 at 1:18 AM Christoph Hellwig  wrote:
>
> On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote:
> > >>> It looks like I'm totally misunderstanding what you are adding here
> > >>> then.  Why do we need any special treatment at all for memory that
> > >>> has normal struct pages and is part of the direct kernel map?
> > >> The pages are like normal memory for purposes of mapping them in CPU
> > >> page tables and for coherent access from the CPU.
> > > That's the user page tables.  What about the kernel direct map?
> > > If there is a normal kernel struct page backing there really should
> > > be no need for the pgmap.
> >
> > I'm not sure. The physical address ranges are in the UEFI system address
> > map as special-purpose memory. Does Linux create the struct pages and
> > kernel direct map for that without a pgmap call? I didn't see that last
> > time I went digging through that code.
>
> So doing some googling finds a patch from Dan that claims to hand EFI
> special purpose memory to the device dax driver.  But when I try to
> follow the version that got merged it looks it is treated simply as an
> MMIO region to be claimed by drivers, which would not get a struct page.
>
> Dan, did I misunderstand how E820_TYPE_SOFT_RESERVED works?

The original implementation of "soft reserve" support depended on the
combination of the EFI special purpose memory type and the ACPI HMAT
to define the device ranges. The requirement for ACPI HMAT was relaxed
later with commit:

5ccac54f3e12 ACPI: HMAT: attach a device for each soft-reserved range

The expectation is that system software policy can then either use the
device interface, assign a portion of the reservation back to the page
allocator, ignore the reservation altogether. Is this discussion
asking for a way to assign this memory to the GPU driver to manage?
device-dax already knows how to hand off to the page-allocator, seems
reasonable for it to be able to hand-off to another driver.


Re: [RFC] treewide: cleanup unreachable breaks

2020-10-19 Thread Dan Williams
On Sat, Oct 17, 2020 at 9:10 AM  wrote:
>
> From: Tom Rix 
>
> This is a upcoming change to clean up a new warning treewide.
> I am wondering if the change could be one mega patch (see below) or
> normal patch per file about 100 patches or somewhere half way by collecting
> early acks.
>
> clang has a number of useful, new warnings see
> https://clang.llvm.org/docs/DiagnosticsReference.html
>
> This change cleans up -Wunreachable-code-break
> https://clang.llvm.org/docs/DiagnosticsReference.html#wunreachable-code-break
> for 266 of 485 warnings in this week's linux-next, allyesconfig on x86_64.
>
> The method of fixing was to look for warnings where the preceding statement
> was a simple statement and by inspection made the subsequent break unneeded.
> In order of frequency these look like
>
> return and break
>
> switch (c->x86_vendor) {
> case X86_VENDOR_INTEL:
> intel_p5_mcheck_init(c);
> return 1;
> -   break;
>
> goto and break
>
> default:
> operation = 0; /* make gcc happy */
> goto fail_response;
> -   break;
>
> break and break
> case COLOR_SPACE_SRGB:
> /* by pass */
> REG_SET(OUTPUT_CSC_CONTROL, 0,
> OUTPUT_CSC_GRPH_MODE, 0);
> break;
> -   break;
>
> The exception to the simple statement, is a switch case with a block
> and the end of block is a return
>
> struct obj_buffer *buff = r->ptr;
> return scnprintf(str, PRIV_STR_SIZE,
> "size=%u\naddr=0x%X\n", buff->size,
> buff->addr);
> }
> -   break;
>
> Not considered obvious and excluded, breaks after
> multi level switches
> complicated if-else if-else blocks
> panic() or similar calls
>
> And there is an odd addition of a 'fallthrough' in drivers/tty/nozomi.c
[..]
> diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
> index 5a7c80053c62..2f250874b1a4 100644
> --- a/drivers/nvdimm/claim.c
> +++ b/drivers/nvdimm/claim.c
> @@ -200,11 +200,10 @@ ssize_t nd_namespace_store(struct device *dev,
> }
> break;
> default:
> len = -EBUSY;
> goto out_attach;
> -   break;
> }

Acked-by: Dan Williams 
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH RFC PKS/PMEM 33/58] fs/cramfs: Utilize new kmap_thread()

2020-10-13 Thread Dan Williams
On Fri, Oct 9, 2020 at 12:52 PM  wrote:
>
> From: Ira Weiny 
>
> The kmap() calls in this FS are localized to a single thread.  To avoid
> the over head of global PKRS updates use the new kmap_thread() call.
>
> Cc: Nicolas Pitre 
> Signed-off-by: Ira Weiny 
> ---
>  fs/cramfs/inode.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
> index 912308600d39..003c014a42ed 100644
> --- a/fs/cramfs/inode.c
> +++ b/fs/cramfs/inode.c
> @@ -247,8 +247,8 @@ static void *cramfs_blkdev_read(struct super_block *sb, 
> unsigned int offset,
> struct page *page = pages[i];
>
> if (page) {
> -   memcpy(data, kmap(page), PAGE_SIZE);
> -   kunmap(page);
> +   memcpy(data, kmap_thread(page), PAGE_SIZE);
> +   kunmap_thread(page);

Why does this need a sleepable kmap? This looks like a textbook
kmap_atomic() use case.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH RFC PKS/PMEM 33/58] fs/cramfs: Utilize new kmap_thread()

2020-10-13 Thread Dan Williams
On Tue, Oct 13, 2020 at 12:37 PM Matthew Wilcox  wrote:
>
> On Tue, Oct 13, 2020 at 11:44:29AM -0700, Dan Williams wrote:
> > On Fri, Oct 9, 2020 at 12:52 PM  wrote:
> > >
> > > From: Ira Weiny 
> > >
> > > The kmap() calls in this FS are localized to a single thread.  To avoid
> > > the over head of global PKRS updates use the new kmap_thread() call.
> > >
> > > Cc: Nicolas Pitre 
> > > Signed-off-by: Ira Weiny 
> > > ---
> > >  fs/cramfs/inode.c | 10 +-
> > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
> > > index 912308600d39..003c014a42ed 100644
> > > --- a/fs/cramfs/inode.c
> > > +++ b/fs/cramfs/inode.c
> > > @@ -247,8 +247,8 @@ static void *cramfs_blkdev_read(struct super_block 
> > > *sb, unsigned int offset,
> > > struct page *page = pages[i];
> > >
> > > if (page) {
> > > -   memcpy(data, kmap(page), PAGE_SIZE);
> > > -   kunmap(page);
> > > +   memcpy(data, kmap_thread(page), PAGE_SIZE);
> > > +   kunmap_thread(page);
> >
> > Why does this need a sleepable kmap? This looks like a textbook
> > kmap_atomic() use case.
>
> There's a lot of code of this form.  Could we perhaps have:
>
> static inline void copy_to_highpage(struct page *to, void *vfrom, unsigned 
> int size)
> {
> char *vto = kmap_atomic(to);
>
> memcpy(vto, vfrom, size);
> kunmap_atomic(vto);
> }
>
> in linux/highmem.h ?

Nice, yes, that could also replace the local ones in lib/iov_iter.c
(memcpy_{to,from}_page())
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk

2019-08-16 Thread Dan Williams
On Fri, Aug 16, 2019 at 5:24 AM Jason Gunthorpe  wrote:
>
> On Thu, Aug 15, 2019 at 08:54:46PM -0700, Dan Williams wrote:
>
> > > However, this means we cannot do any processing of ZONE_DEVICE pages
> > > outside the driver lock, so eg, doing any DMA map that might rely on
> > > MEMORY_DEVICE_PCI_P2PDMA has to be done in the driver lock, which is
> > > a bit unfortunate.
> >
> > Wouldn't P2PDMA use page pins? Not needing to hold a lock over
> > ZONE_DEVICE page operations was one of the motivations for plumbing
> > get_dev_pagemap() with a percpu-ref.
>
> hmm_range_fault() doesn't use page pins at all, so if a ZONE_DEVICE
> page comes out of it then it needs to use another locking pattern.
>
> If I follow it all right:
>
> We can do a get_dev_pagemap inside the page_walk and touch the pgmap,
> or we can do the 'device mutex && retry' pattern and touch the pgmap
> in the driver, under that lock.
>
> However in all cases the current get_dev_pagemap()'s in the page walk
> are not necessary, and we can delete them.

Yes, as long as 'struct page' instances resulting from that lookup are
not passed outside of that lock.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk

2019-08-15 Thread Dan Williams
On Thu, Aug 15, 2019 at 5:41 PM Jason Gunthorpe  wrote:
>
> On Thu, Aug 15, 2019 at 01:47:12PM -0700, Dan Williams wrote:
> > On Thu, Aug 15, 2019 at 1:41 PM Jason Gunthorpe  wrote:
> > >
> > > On Thu, Aug 15, 2019 at 04:33:06PM -0400, Jerome Glisse wrote:
> > >
> > > > So nor HMM nor driver should dereference the struct page (i do not
> > > > think any iommu driver would either),
> > >
> > > Er, they do technically deref the struct page:
> > >
> > > nouveau_dmem_convert_pfn(struct nouveau_drm *drm,
> > >  struct hmm_range *range)
> > > struct page *page;
> > > page = hmm_pfn_to_page(range, range->pfns[i]);
> > > if (!nouveau_dmem_page(drm, page)) {
> > >
> > >
> > > nouveau_dmem_page(struct nouveau_drm *drm, struct page *page)
> > > {
> > > return is_device_private_page(page) && drm->dmem == 
> > > page_to_dmem(page)
> > >
> > >
> > > Which does touch 'page->pgmap'
> > >
> > > Is this OK without having a get_dev_pagemap() ?
> > >
> > > Noting that the collision-retry scheme doesn't protect anything here
> > > as we can have a concurrent invalidation while doing the above deref.
> >
> > As long take_driver_page_table_lock() in Jerome's flow can replace
> > percpu_ref_tryget_live() on the pagemap reference. It seems
> > nouveau_dmem_convert_pfn() happens after:
> >
> > mutex_lock(>mutex);
> > if (!nouveau_range_done()) {
> >
> > ...so I would expect that to be functionally equivalent to validating
> > the reference count.
>
> Yes, OK, that makes sense, I was mostly surprised by the statement the
> driver doesn't touch the struct page..
>
> I suppose "doesn't touch the struct page out of the driver lock" is
> the case.
>
> However, this means we cannot do any processing of ZONE_DEVICE pages
> outside the driver lock, so eg, doing any DMA map that might rely on
> MEMORY_DEVICE_PCI_P2PDMA has to be done in the driver lock, which is
> a bit unfortunate.

Wouldn't P2PDMA use page pins? Not needing to hold a lock over
ZONE_DEVICE page operations was one of the motivations for plumbing
get_dev_pagemap() with a percpu-ref.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk

2019-08-15 Thread Dan Williams
On Thu, Aug 15, 2019 at 12:44 PM Jerome Glisse  wrote:
>
> On Thu, Aug 15, 2019 at 12:36:58PM -0700, Dan Williams wrote:
> > On Thu, Aug 15, 2019 at 11:07 AM Jerome Glisse  wrote:
> > >
> > > On Wed, Aug 14, 2019 at 07:48:28AM -0700, Dan Williams wrote:
> > > > On Wed, Aug 14, 2019 at 6:28 AM Jason Gunthorpe  
> > > > wrote:
> > > > >
> > > > > On Wed, Aug 14, 2019 at 09:38:54AM +0200, Christoph Hellwig wrote:
> > > > > > On Tue, Aug 13, 2019 at 06:36:33PM -0700, Dan Williams wrote:
> > > > > > > Section alignment constraints somewhat save us here. The only 
> > > > > > > example
> > > > > > > I can think of a PMD not containing a uniform pgmap association 
> > > > > > > for
> > > > > > > each pte is the case when the pgmap overlaps normal dram, i.e. 
> > > > > > > shares
> > > > > > > the same 'struct memory_section' for a given span. Otherwise, 
> > > > > > > distinct
> > > > > > > pgmaps arrange to manage their own exclusive sections (and now
> > > > > > > subsections as of v5.3). Otherwise the implementation could not
> > > > > > > guarantee different mapping lifetimes.
> > > > > > >
> > > > > > > That said, this seems to want a better mechanism to determine 
> > > > > > > "pfn is
> > > > > > > ZONE_DEVICE".
> > > > > >
> > > > > > So I guess this patch is fine for now, and once you provide a better
> > > > > > mechanism we can switch over to it?
> > > > >
> > > > > What about the version I sent to just get rid of all the strange
> > > > > put_dev_pagemaps while scanning? Odds are good we will work with only
> > > > > a single pagemap, so it makes some sense to cache it once we find it?
> > > >
> > > > Yes, if the scan is over a single pmd then caching it makes sense.
> > >
> > > Quite frankly an easier an better solution is to remove the pagemap
> > > lookup as HMM user abide by mmu notifier it means we will not make
> > > use or dereference the struct page so that we are safe from any
> > > racing hotunplug of dax memory (as long as device driver using hmm
> > > do not have a bug).
> >
> > Yes, as long as the driver remove is synchronized against HMM
> > operations via another mechanism then there is no need to take pagemap
> > references. Can you briefly describe what that other mechanism is?
>
> So if you hotunplug some dax memory i assume that this can only
> happens once all the pages are unmapped (as it must have the
> zero refcount, well 1 because of the bias) and any unmap will
> trigger a mmu notifier callback. User of hmm mirror abiding by
> the API will never make use of information they get through the
> fault or snapshot function until checking for racing notifier
> under lock.

Hmm that first assumption is not guaranteed by the dev_pagemap core.
The dev_pagemap end of life model is "disable, invalidate, drain" so
it's possible to call devm_munmap_pages() while pages are still mapped
it just won't complete the teardown of the pagemap until the last
reference is dropped. New references are blocked during this teardown.

However, if the driver is validating the liveness of the mapping in
the mmu-notifier path and blocking new references it sounds like it
should be ok. Might there be GPU driver unit tests that cover this
racing teardown case?
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk

2019-08-15 Thread Dan Williams
On Thu, Aug 15, 2019 at 1:41 PM Jason Gunthorpe  wrote:
>
> On Thu, Aug 15, 2019 at 04:33:06PM -0400, Jerome Glisse wrote:
>
> > So nor HMM nor driver should dereference the struct page (i do not
> > think any iommu driver would either),
>
> Er, they do technically deref the struct page:
>
> nouveau_dmem_convert_pfn(struct nouveau_drm *drm,
>  struct hmm_range *range)
> struct page *page;
> page = hmm_pfn_to_page(range, range->pfns[i]);
> if (!nouveau_dmem_page(drm, page)) {
>
>
> nouveau_dmem_page(struct nouveau_drm *drm, struct page *page)
> {
> return is_device_private_page(page) && drm->dmem == page_to_dmem(page)
>
>
> Which does touch 'page->pgmap'
>
> Is this OK without having a get_dev_pagemap() ?
>
> Noting that the collision-retry scheme doesn't protect anything here
> as we can have a concurrent invalidation while doing the above deref.

As long take_driver_page_table_lock() in Jerome's flow can replace
percpu_ref_tryget_live() on the pagemap reference. It seems
nouveau_dmem_convert_pfn() happens after:

mutex_lock(>mutex);
if (!nouveau_range_done()) {

...so I would expect that to be functionally equivalent to validating
the reference count.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk

2019-08-15 Thread Dan Williams
On Thu, Aug 15, 2019 at 11:07 AM Jerome Glisse  wrote:
>
> On Wed, Aug 14, 2019 at 07:48:28AM -0700, Dan Williams wrote:
> > On Wed, Aug 14, 2019 at 6:28 AM Jason Gunthorpe  wrote:
> > >
> > > On Wed, Aug 14, 2019 at 09:38:54AM +0200, Christoph Hellwig wrote:
> > > > On Tue, Aug 13, 2019 at 06:36:33PM -0700, Dan Williams wrote:
> > > > > Section alignment constraints somewhat save us here. The only example
> > > > > I can think of a PMD not containing a uniform pgmap association for
> > > > > each pte is the case when the pgmap overlaps normal dram, i.e. shares
> > > > > the same 'struct memory_section' for a given span. Otherwise, distinct
> > > > > pgmaps arrange to manage their own exclusive sections (and now
> > > > > subsections as of v5.3). Otherwise the implementation could not
> > > > > guarantee different mapping lifetimes.
> > > > >
> > > > > That said, this seems to want a better mechanism to determine "pfn is
> > > > > ZONE_DEVICE".
> > > >
> > > > So I guess this patch is fine for now, and once you provide a better
> > > > mechanism we can switch over to it?
> > >
> > > What about the version I sent to just get rid of all the strange
> > > put_dev_pagemaps while scanning? Odds are good we will work with only
> > > a single pagemap, so it makes some sense to cache it once we find it?
> >
> > Yes, if the scan is over a single pmd then caching it makes sense.
>
> Quite frankly an easier an better solution is to remove the pagemap
> lookup as HMM user abide by mmu notifier it means we will not make
> use or dereference the struct page so that we are safe from any
> racing hotunplug of dax memory (as long as device driver using hmm
> do not have a bug).

Yes, as long as the driver remove is synchronized against HMM
operations via another mechanism then there is no need to take pagemap
references. Can you briefly describe what that other mechanism is?
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk

2019-08-14 Thread Dan Williams
On Wed, Aug 14, 2019 at 6:28 AM Jason Gunthorpe  wrote:
>
> On Wed, Aug 14, 2019 at 09:38:54AM +0200, Christoph Hellwig wrote:
> > On Tue, Aug 13, 2019 at 06:36:33PM -0700, Dan Williams wrote:
> > > Section alignment constraints somewhat save us here. The only example
> > > I can think of a PMD not containing a uniform pgmap association for
> > > each pte is the case when the pgmap overlaps normal dram, i.e. shares
> > > the same 'struct memory_section' for a given span. Otherwise, distinct
> > > pgmaps arrange to manage their own exclusive sections (and now
> > > subsections as of v5.3). Otherwise the implementation could not
> > > guarantee different mapping lifetimes.
> > >
> > > That said, this seems to want a better mechanism to determine "pfn is
> > > ZONE_DEVICE".
> >
> > So I guess this patch is fine for now, and once you provide a better
> > mechanism we can switch over to it?
>
> What about the version I sent to just get rid of all the strange
> put_dev_pagemaps while scanning? Odds are good we will work with only
> a single pagemap, so it makes some sense to cache it once we find it?

Yes, if the scan is over a single pmd then caching it makes sense.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk

2019-08-14 Thread Dan Williams
On Wed, Aug 7, 2019 at 11:59 PM Christoph Hellwig  wrote:
>
> On Wed, Aug 07, 2019 at 11:47:22AM -0700, Dan Williams wrote:
> > > Unrelated to this patch, but what is the point of getting checking
> > > that the pgmap exists for the page and then immediately releasing it?
> > > This code has this pattern in several places.
> > >
> > > It feels racy
> >
> > Agree, not sure what the intent is here. The only other reason call
> > get_dev_pagemap() is to just check in general if the pfn is indeed
> > owned by some ZONE_DEVICE instance, but if the intent is to make sure
> > the device is still attached/enabled that check is invalidated at
> > put_dev_pagemap().
> >
> > If it's the former case, validating ZONE_DEVICE pfns, I imagine we can
> > do something cheaper with a helper that is on the order of the same
> > cost as pfn_valid(). I.e. replace PTE_DEVMAP with a mem_section flag
> > or something similar.
>
> The hmm literally never dereferences the pgmap, so validity checking is
> the only explanation for it.
>
> > > +   /*
> > > +* We do put_dev_pagemap() here so that we can leverage
> > > +* get_dev_pagemap() optimization which will not re-take a
> > > +* reference on a pgmap if we already have one.
> > > +*/
> > > +   if (hmm_vma_walk->pgmap)
> > > +   put_dev_pagemap(hmm_vma_walk->pgmap);
> > > +
> >
> > Seems ok, but only if the caller is guaranteeing that the range does
> > not span outside of a single pagemap instance. If that guarantee is
> > met why not just have the caller pass in a pinned pagemap? If that
> > guarantee is not met, then I think we're back to your race concern.
>
> It iterates over multiple ptes in a non-huge pmd.  Is there any kind of
> limitations on different pgmap instances inside a pmd?  I can't think
> of one, so this might actually be a bug.

Section alignment constraints somewhat save us here. The only example
I can think of a PMD not containing a uniform pgmap association for
each pte is the case when the pgmap overlaps normal dram, i.e. shares
the same 'struct memory_section' for a given span. Otherwise, distinct
pgmaps arrange to manage their own exclusive sections (and now
subsections as of v5.3). Otherwise the implementation could not
guarantee different mapping lifetimes.

That said, this seems to want a better mechanism to determine "pfn is
ZONE_DEVICE".
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk

2019-08-07 Thread Dan Williams
On Wed, Aug 7, 2019 at 10:45 AM Jason Gunthorpe  wrote:
>
> On Tue, Aug 06, 2019 at 07:05:42PM +0300, Christoph Hellwig wrote:
> > There is only a single place where the pgmap is passed over a function
> > call, so replace it with local variables in the places where we deal
> > with the pgmap.
> >
> > Signed-off-by: Christoph Hellwig 
> >  mm/hmm.c | 62 
> >  1 file changed, 27 insertions(+), 35 deletions(-)
> >
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index 9a908902e4cc..d66fa29b42e0 100644
> > +++ b/mm/hmm.c
> > @@ -278,7 +278,6 @@ EXPORT_SYMBOL(hmm_mirror_unregister);
> >
> >  struct hmm_vma_walk {
> >   struct hmm_range*range;
> > - struct dev_pagemap  *pgmap;
> >   unsigned long   last;
> >   unsigned intflags;
> >  };
> > @@ -475,6 +474,7 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >   struct hmm_vma_walk *hmm_vma_walk = walk->private;
> >   struct hmm_range *range = hmm_vma_walk->range;
> > + struct dev_pagemap *pgmap = NULL;
> >   unsigned long pfn, npages, i;
> >   bool fault, write_fault;
> >   uint64_t cpu_flags;
> > @@ -490,17 +490,14 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
> >   pfn = pmd_pfn(pmd) + pte_index(addr);
> >   for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> >   if (pmd_devmap(pmd)) {
> > - hmm_vma_walk->pgmap = get_dev_pagemap(pfn,
> > -   hmm_vma_walk->pgmap);
> > - if (unlikely(!hmm_vma_walk->pgmap))
> > + pgmap = get_dev_pagemap(pfn, pgmap);
> > + if (unlikely(!pgmap))
> >   return -EBUSY;
>
> Unrelated to this patch, but what is the point of getting checking
> that the pgmap exists for the page and then immediately releasing it?
> This code has this pattern in several places.
>
> It feels racy

Agree, not sure what the intent is here. The only other reason call
get_dev_pagemap() is to just check in general if the pfn is indeed
owned by some ZONE_DEVICE instance, but if the intent is to make sure
the device is still attached/enabled that check is invalidated at
put_dev_pagemap().

If it's the former case, validating ZONE_DEVICE pfns, I imagine we can
do something cheaper with a helper that is on the order of the same
cost as pfn_valid(). I.e. replace PTE_DEVMAP with a mem_section flag
or something similar.

>
> >   }
> >   pfns[i] = hmm_device_entry_from_pfn(range, pfn) | cpu_flags;
> >   }
> > - if (hmm_vma_walk->pgmap) {
> > - put_dev_pagemap(hmm_vma_walk->pgmap);
> > - hmm_vma_walk->pgmap = NULL;
>
> Putting the value in the hmm_vma_walk would have made some sense to me
> if the pgmap was not set to NULL all over the place. Then the most
> xa_loads would be eliminated, as I would expect the pgmap tends to be
> mostly uniform for these use cases.
>
> Is there some reason the pgmap ref can't be held across
> faulting/sleeping? ie like below.

No restriction on holding refs over faulting / sleeping.

>
> Anyhow, I looked over this pretty carefully and the change looks
> functionally OK, I just don't know why the code is like this in the
> first place.
>
> Reviewed-by: Jason Gunthorpe 
>
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 9a908902e4cc38..4e30128c23a505 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -497,10 +497,6 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
> }
> pfns[i] = hmm_device_entry_from_pfn(range, pfn) | cpu_flags;
> }
> -   if (hmm_vma_walk->pgmap) {
> -   put_dev_pagemap(hmm_vma_walk->pgmap);
> -   hmm_vma_walk->pgmap = NULL;
> -   }
> hmm_vma_walk->last = end;
> return 0;
>  #else
> @@ -604,10 +600,6 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, 
> unsigned long addr,
> return 0;
>
>  fault:
> -   if (hmm_vma_walk->pgmap) {
> -   put_dev_pagemap(hmm_vma_walk->pgmap);
> -   hmm_vma_walk->pgmap = NULL;
> -   }
> pte_unmap(ptep);
> /* Fault any virtual address we were asked to fault */
> return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
> @@ -690,16 +682,6 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
> return r;
> }
> }
> -   if (hmm_vma_walk->pgmap) {
> -   /*
> -* We do put_dev_pagemap() here and not in 
> hmm_vma_handle_pte()
> -* so that we can leverage get_dev_pagemap() optimization 
> which
> -* will not re-take a reference on a pgmap if we already have
> -* one.
> -*/
> -   put_dev_pagemap(hmm_vma_walk->pgmap);
> -   hmm_vma_walk->pgmap = NULL;
> -   }
> pte_unmap(ptep - 1);
>

Re: [PATCH v5 13/29] compat_ioctl: move more drivers to compat_ptr_ioctl

2019-07-30 Thread Dan Williams
On Tue, Jul 30, 2019 at 12:59 PM Arnd Bergmann  wrote:
>
> The .ioctl and .compat_ioctl file operations have the same prototype so
> they can both point to the same function, which works great almost all
> the time when all the commands are compatible.
>
> One exception is the s390 architecture, where a compat pointer is only
> 31 bit wide, and converting it into a 64-bit pointer requires calling
> compat_ptr(). Most drivers here will never run in s390, but since we now
> have a generic helper for it, it's easy enough to use it consistently.
>
> I double-checked all these drivers to ensure that all ioctl arguments
> are used as pointers or are ignored, but are not interpreted as integer
> values.
>
> Acked-by: Jason Gunthorpe 
> Acked-by: Daniel Vetter 
> Acked-by: Mauro Carvalho Chehab 
> Acked-by: Greg Kroah-Hartman 
> Acked-by: David Sterba 
> Acked-by: Darren Hart (VMware) 
> Acked-by: Jonathan Cameron 
> Acked-by: Bjorn Andersson 
> Signed-off-by: Arnd Bergmann 
> ---
>  drivers/nvdimm/bus.c| 4 ++--
[..]
> diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
> index 798c5c4aea9c..6ca142d833ab 100644
> --- a/drivers/nvdimm/bus.c
> +++ b/drivers/nvdimm/bus.c
> @@ -1229,7 +1229,7 @@ static const struct file_operations nvdimm_bus_fops = {
> .owner = THIS_MODULE,
> .open = nd_open,
> .unlocked_ioctl = bus_ioctl,
> -   .compat_ioctl = bus_ioctl,
> +   .compat_ioctl = compat_ptr_ioctl,
> .llseek = noop_llseek,
>  };
>
> @@ -1237,7 +1237,7 @@ static const struct file_operations nvdimm_fops = {
> .owner = THIS_MODULE,
> .open = nd_open,
> .unlocked_ioctl = dimm_ioctl,
> -       .compat_ioctl = dimm_ioctl,
> +   .compat_ioctl = compat_ptr_ioctl,
> .llseek = noop_llseek,
>  };

Acked-by: Dan Williams 
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-25 Thread Dan Williams
On Wed, Apr 25, 2018 at 10:44 AM, Alex Deucher  wrote:
> On Wed, Apr 25, 2018 at 2:41 AM, Christoph Hellwig  wrote:
>> On Wed, Apr 25, 2018 at 02:24:36AM -0400, Alex Deucher wrote:
>>> > It has a non-coherent transaction mode (which the chipset can opt to
>>> > not implement and still flush), to make sure the AGP horror show
>>> > doesn't happen again and GPU folks are happy with PCIe. That's at
>>> > least my understanding from digging around in amd the last time we had
>>> > coherency issues between intel and amd gpus. GPUs have some bits
>>> > somewhere (in the pagetables, or in the buffer object description
>>> > table created by userspace) to control that stuff.
>>>
>>> Right.  We have a bit in the GPU page table entries that determines
>>> whether we snoop the CPU's cache or not.
>>
>> I can see how that works with the GPU on the same SOC or SOC set as the
>> CPU.  But how is that going to work for a GPU that is a plain old PCIe
>> card?  The cache snooping in that case is happening in the PCIe root
>> complex.
>
> I'm not a pci expert, but as far as I know, the device sends either a
> snooped or non-snooped transaction on the bus.  I think the
> transaction descriptor supports a no snoop attribute.  Our GPUs have
> supported this feature for probably 20 years if not more, going back
> to PCI.  Using non-snooped transactions have lower latency and faster
> throughput compared to snooped transactions.

Right, 'no snoop' and 'relaxed ordering' have been part of the PCI
spec since forever. With a plain old PCI-E card the root complex
indeed arranges for caches to be snooped. Although it's not always
faster depending on the platform. 'No snoop' traffic may be relegated
to less bus resources relative to the much more common snooped
traffic.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx