Re: [PATCH 0/9] Add support for SVM atomics in Nouveau

2021-02-09 Thread Jerome Glisse
On Tue, Feb 09, 2021 at 09:35:20AM -0400, Jason Gunthorpe wrote:
> On Tue, Feb 09, 2021 at 11:57:28PM +1100, Alistair Popple wrote:
> > On Tuesday, 9 February 2021 9:27:05 PM AEDT Daniel Vetter wrote:
> > > >
> > > > Recent changes to pin_user_pages() prevent the creation of pinned pages 
> > > > in
> > > > ZONE_MOVABLE. This series allows pinned pages to be created in 
> > ZONE_MOVABLE
> > > > as attempts to migrate may fail which would be fatal to userspace.
> > > >
> > > > In this case migration of the pinned page is unnecessary as the page 
> > > > can 
> > be
> > > > unpinned at anytime by having the driver revoke atomic permission as it
> > > > does for the migrate_to_ram() callback. However a method of calling this
> > > > when memory needs to be moved has yet to be resolved so any discussion 
> > > > is
> > > > welcome.
> > > 
> > > Why do we need to pin for gpu atomics? You still have the callback for
> > > cpu faults, so you
> > > can move the page as needed, and hence a long-term pin sounds like the
> > > wrong approach.
> > 
> > Technically a real long term unmoveable pin isn't required, because as you 
> > say 
> > the page can be moved as needed at any time. However I needed some way of 
> > stopping the CPU page from being freed once the userspace mappings for it 
> > had 
> > been removed. 
> 
> The issue is you took the page out of the PTE it belongs to, which
> makes it orphaned and unlocatable by the rest of the mm?
> 
> Ideally this would leave the PTE in place so everything continues to
> work, just disable CPU access to it.
> 
> Maybe some kind of special swap entry?
> 
> I also don't much like the use of ZONE_DEVICE here, that should only
> be used for actual device memory, not as a temporary proxy for CPU
> pages.. Having two struct pages refer to the same physical memory is
> pretty ugly.
> 
> > The normal solution of registering an MMU notifier to unpin the page when 
> > it 
> > needs to be moved also doesn't work as the CPU page tables now point to the
> > device-private page and hence the migration code won't call any invalidate 
> > notifiers for the CPU page.
> 
> The fact the page is lost from the MM seems to be the main issue here.
> 
> > Yes, I would like to avoid the long term pin constraints as well if 
> > possible I 
> > just haven't found a solution yet. Are you suggesting it might be possible 
> > to 
> > add a callback in the page migration logic to specially deal with moving 
> > these 
> > pages?
> 
> How would migration even find the page?

Migration can scan memory from physical address (isolate_migratepages_range())
So the CPU mapping is not the only path to get to a page.

Cheers,
Jérôme



Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy

2021-01-13 Thread Jerome Glisse
On Wed, Jan 13, 2021 at 07:39:36PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 13, 2021 at 04:56:38PM -0500, Jerome Glisse wrote:
> 
> > is a broken model and the way GPU use GUP is less broken then RDMA. In
> > GPU driver GUP contract with userspace is that the data the GPU can
> > access is a snapshot of what the process memory was at the time you
> > asked for the GUP. Process can start using different pages right after.
> > There is no constant coherency contract (ie CPU and GPU can be working
> > on different pages).
> 
> Look at the habana labs "totally not a GPU" driver, it doesn't work
> that way, GPU compute operations do want coherency.
> 
> The mmu notifier hackery some of the other GPU drivers use to get
> coherency requires putting the kernel between every single work
> submission, and has all kinds of wonky issues and limitations - I
> think it is net worse approach than GUP, honestly.

Yes what GPU driver do today with GUP is wrong but it is only
use for texture upload/download. So that is a very limited
scope (amdkfd being an exception here).

Yes also to the fact that waiting on GPU fence from mmu notifier
callback is bad. We are thinking on how to solve this.

But what do matter is that hardware is moving in right direction
and we will no longer need GUP. So GUP is dying out in GPU
driver.

Cheers,
Jérôme



Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy

2021-01-13 Thread Jerome Glisse
On Fri, Jan 08, 2021 at 08:42:55PM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 08, 2021 at 05:43:56PM -0500, Andrea Arcangeli wrote:
> > On Fri, Jan 08, 2021 at 02:19:45PM -0400, Jason Gunthorpe wrote:
> > > On Fri, Jan 08, 2021 at 12:00:36PM -0500, Andrea Arcangeli wrote:
> > > > > The majority cannot be converted to notifiers because they are DMA
> > > > > based. Every one of those is an ABI for something, and does not expect
> > > > > extra privilege to function. It would be a major breaking change to
> > > > > have pin_user_pages require some cap.
> > > > 
> > > > ... what makes them safe is to be transient GUP pin and not long
> > > > term.
> > > > 
> > > > Please note the "long term" in the underlined line.
> > > 
> > > Many of them are long term, though only 50 or so have been marked
> > > specifically with FOLL_LONGTERM. I don't see how we can make such a
> > > major ABI break.
> > 
> > io_uring is one of those indeed and I already flagged it.
> > 
> > This isn't a black and white issue, kernel memory is also pinned but
> > it's not in movable pageblocks... How do you tell the VM in GUP to
> > migrate memory to a non movable pageblock before pinning it? Because
> > that's what it should do to create less breakage.
> 
> There is already a patch series floating about to do exactly that for
> FOLL_LONGTERM pins based on the existing code in GUP for CMA migration
> 
> > For example iommu obviously need to be privileged, if your argument
> > that it's enough to use the right API to take long term pins
> > unconstrained, that's not the case. Pins are pins and prevent moving
> > or freeing the memory, their effect is the same and again worse than
> > mlock on many levels.
> 
> The ship sailed on this a decade ago, it is completely infeasible to
> go back now, it would completely break widely used things like GPU,
> RDMA and more.
> 

I am late to this but GPU should not be use as an excuse for GUP. GUP
is a broken model and the way GPU use GUP is less broken then RDMA. In
GPU driver GUP contract with userspace is that the data the GPU can
access is a snapshot of what the process memory was at the time you
asked for the GUP. Process can start using different pages right after.
There is no constant coherency contract (ie CPU and GPU can be working
on different pages).

If you want coherency ie always have CPU and GPU work on the same page
then you need to use mmu notifier and avoid pinning pages. Anything that
does not abide by mmu notifier is broken and can not be fix.

Cheers,
Jérôme



Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse

2021-01-12 Thread Jerome Glisse
On Mon, Jan 11, 2021 at 02:18:13PM -0800, Linus Torvalds wrote:
> On Mon, Jan 11, 2021 at 11:19 AM Linus Torvalds
>  wrote:
> >
> > On Sun, Jan 10, 2021 at 11:27 PM John Hubbard  wrote:
> > > IMHO, a lot of the bits in page _refcount are still being wasted (even
> > > after GUP_PIN_COUNTING_BIAS overloading), because it's unlikely that
> > > there are many callers of gup/pup per page.
> >
> > It may be unlikely under real loads.
> >
> > But we've actually had overflow issues on this because rather than
> > real loads you can do attack loads (ie "lots of processes, lots of
> > pipe file descriptors, lots of vmsplice() operations on the same
> > page".
> >
> > We had to literally add that conditional "try_get_page()" that
> > protects against overflow..
> 
> Actually, what I think might be a better model is to actually
> strengthen the rules even more, and get rid of GUP_PIN_COUNTING_BIAS
> entirely.
> 
> What we could do is just make a few clear rules explicit (most of
> which we already basically hold to). Starting from that basic
> 
>  (a) Anonymous pages are made writable (ie COW) only when they have a
> page_count() of 1
> 
> That very simple rule then automatically results in the corollary
> 
>  (b) a writable page in a COW mapping always starts out reachable
> _only_ from the page tables
> 
> and now we could have a couple of really simple new rules:
> 
>  (c) we never ever make a writable page in a COW mapping read-only
> _unless_ it has a page_count() of 1

This breaks mprotect(R_ONLY) i do not think we want to do that. This
might break security scheme for user space application which expect
mprotect to make CPU mapping reads only.

Maybe an alternative would be to copy page on mprotect for pages that
do not have a page_count of 1 ? But that makes me uneasy toward short
lived GUP (direct IO racing with a mprotect or maybe simply even page
migration) versus unbound one (like RDMA).


Also I want to make sure i properly understand what happens on fork()
on a COW mapping for a page that has a page_count > 1 ? We copy the
page instead of write protecting the page ?

I believe better here would be to protect the page on the CPU but
forbid child to reuse the page ie if the child ever inherit the page
(parent unmapped the page for instance) it will have to make a copy
and the GUP reference (taken before the fork) might linger on a page
that is no longer associated with any VM. This way we keep fast fork.


Jérôme



Re: [PATCH 00/14] Small step toward KSM for file back page.

2020-10-08 Thread Jerome Glisse
On Thu, Oct 08, 2020 at 04:43:41PM +0100, Matthew Wilcox wrote:
> On Thu, Oct 08, 2020 at 11:30:28AM -0400, Jerome Glisse wrote:
> > On Wed, Oct 07, 2020 at 11:09:16PM +0100, Matthew Wilcox wrote:
> > > So ... why don't you put a PageKsm page in the page cache?  That way you
> > > can share code with the current KSM implementation.  You'd need
> > > something like this:
> > 
> > I do just that but there is no need to change anything in page cache.
> 
> That's clearly untrue.  If you just put a PageKsm page in the page
> cache today, here's what will happen on a truncate:
> 
> void truncate_inode_pages_range(struct address_space *mapping,
> loff_t lstart, loff_t lend)
> {
> ...
> struct page *page = find_lock_page(mapping, start - 1);
> 
> find_lock_page() does this:
> return pagecache_get_page(mapping, offset, FGP_LOCK, 0);
> 
> pagecache_get_page():
> 
> repeat:
> page = find_get_entry(mapping, index);
> ...
> if (fgp_flags & FGP_LOCK) {
> ...
> if (unlikely(compound_head(page)->mapping != mapping)) {
> unlock_page(page);
> put_page(page);
> goto repeat;
> 
> so it's just going to spin.  There are plenty of other codepaths that
> would need to be checked.  If you haven't found them, that shows you
> don't understand the problem deeply enough yet.

I also change truncate, splice and few other special cases that do
not goes through GUP/page fault/mkwrite (memory debug too but that's
a different beast).


> I believe we should solve this problem, but I don't think you're going
> about it the right way.

I have done much more than what i posted but there is bug that i
need to hammer down before posting everything and i wanted to get
the discussion started. I guess i will finish tracking that one
down and post the whole thing.


> > So flow is:
> > 
> >   Same as before:
> > 1 - write fault (address, vma)
> > 2 - regular write fault handler -> find page in page cache
> > 
> >   New to common page fault code:
> > 3 - ksm check in write fault common code (same as ksm today
> > for anonymous page fault code path).
> > 4 - break ksm (address, vma) -> (file offset, mapping)
> > 4.a - use mapping and file offset to lookup the proper
> >   fs specific information that were save when the
> >   page was made ksm.
> > 4.b - allocate new page and initialize it with that
> >   information (and page content), update page cache
> >   and mappings ie all the pte who where pointing to
> >   the ksm for that mapping at that offset to now use
> >   the new page (like KSM for anonymous page today).
> 
> But by putting that logic in the page fault path, you've missed
> the truncate path.  And maybe other places.  Putting the logic
> down in pagecache_get_page() means you _don't_ need to find
> all the places that call pagecache_get_page().

They are cases where pagecache is not even in the loop ie you
already have the page and you do not need to look it up (page
fault, some fs common code, anything that goes through GUP,
memory reclaim, ...). Making all those places having to go
through page cache all the times will slow them down and many
are hot code path that i do not believe we want to slow even
if a feature is not use.

Cheers,
Jérôme



Re: [PATCH 00/14] Small step toward KSM for file back page.

2020-10-08 Thread Jerome Glisse
On Wed, Oct 07, 2020 at 11:09:16PM +0100, Matthew Wilcox wrote:
> On Wed, Oct 07, 2020 at 01:54:19PM -0400, Jerome Glisse wrote:
> > > For other things (NUMA distribution), we can point to something which
> > > isn't a struct page and can be distiguished from a real struct page by a
> > > bit somewhere (I have ideas for at least three bits in struct page that
> > > could be used for this).  Then use a pointer in that data structure to
> > > point to the real page.  Or do NUMA distribution at the inode level.
> > > Have a way to get from (inode, node) to an address_space which contains
> > > just regular pages.
> > 
> > How do you find all the copies ? KSM maintains a list for a reasons.
> > Same would be needed here because if you want to break the write prot
> > you need to find all the copy first. If you intend to walk page table
> > then how do you synchronize to avoid more copy to spawn while you
> > walk reverse mapping, we could lock the struct page i guess. Also how
> > do you walk device page table which are completely hidden from core mm.
> 
> So ... why don't you put a PageKsm page in the page cache?  That way you
> can share code with the current KSM implementation.  You'd need
> something like this:

I do just that but there is no need to change anything in page cache.
So below code is not necessary. What you need is a way to find all
the copies so if you have a write fault (or any write access) then
from that fault you get the mapping and offset and you use that to
lookup the fs specific informations and de-duplicate the page with
new page and the fs specific informations. Hence the filesystem code
do not need to know anything it all happens in generic common code.

So flow is:

  Same as before:
1 - write fault (address, vma)
2 - regular write fault handler -> find page in page cache

  New to common page fault code:
3 - ksm check in write fault common code (same as ksm today
for anonymous page fault code path).
4 - break ksm (address, vma) -> (file offset, mapping)
4.a - use mapping and file offset to lookup the proper
  fs specific information that were save when the
  page was made ksm.
4.b - allocate new page and initialize it with that
  information (and page content), update page cache
  and mappings ie all the pte who where pointing to
  the ksm for that mapping at that offset to now use
  the new page (like KSM for anonymous page today).

  Resume regular code path:
mkwrite /|| set pte ...

Roughly the same for write ioctl (other cases goes through GUP
which itself goes through page fault code path). There is no
need to change page cache in anyway. Just common code path that
enable write to file back page.

The fs specific information is page->private, some of the flags
(page->flags) and page->indexi (file offset). Everytime a page
is deduplicated a copy of that information is save in an alias
struct which you can get to from the the share KSM page (page->
mapping is a pointer to ksm root struct which has a pointer to
list of all aliases).

> 
> +++ b/mm/filemap.c
> @@ -1622,6 +1622,9 @@ struct page *find_lock_entry(struct address_space 
> *mapping
> , pgoff_t index)
> lock_page(page);
> /* Has the page been truncated? */
> if (unlikely(page->mapping != mapping)) {
> +   if (PageKsm(page)) {
> +   ...
> +   }
> unlock_page(page);
> put_page(page);
> goto repeat;
> @@ -1655,6 +1658,7 @@ struct page *find_lock_entry(struct address_space 
> *mapping, pgoff_t index)
>   * * %FGP_WRITE - The page will be written
>   * * %FGP_NOFS - __GFP_FS will get cleared in gfp mask
>   * * %FGP_NOWAIT - Don't get blocked by page lock
> + * * %FGP_KSM - Return KSM pages
>   *
>   * If %FGP_LOCK or %FGP_CREAT are specified then the function may sleep even
>   * if the %GFP flags specified for %FGP_CREAT are atomic.
> @@ -1687,6 +1691,11 @@ struct page *pagecache_get_page(struct address_space 
> *mapping, pgoff_t index,
>  
> /* Has the page been truncated? */
> if (unlikely(page->mapping != mapping)) {
> +   if (PageKsm(page) {
> +   if (fgp_flags & FGP_KSM)
> +   return page;
> +   ...
> +   }
> unlock_page(page);
> put_page(page);
> goto repeat;
> 
> I don't know what you want to do when you find a KSM page, so I just left
> an ellipsis.
> 



Re: [PATCH 00/14] Small step toward KSM for file back page.

2020-10-07 Thread Jerome Glisse
On Wed, Oct 07, 2020 at 07:33:16PM +0100, Matthew Wilcox wrote:
> On Wed, Oct 07, 2020 at 01:54:19PM -0400, Jerome Glisse wrote:
> > On Wed, Oct 07, 2020 at 06:05:58PM +0100, Matthew Wilcox wrote:
> > > On Wed, Oct 07, 2020 at 10:48:35AM -0400, Jerome Glisse wrote:
> > > > On Wed, Oct 07, 2020 at 04:20:13AM +0100, Matthew Wilcox wrote:
> > > > > On Tue, Oct 06, 2020 at 09:05:49PM -0400, jgli...@redhat.com wrote:
> > > For other things (NUMA distribution), we can point to something which

[...]

> > > isn't a struct page and can be distiguished from a real struct page by a
> > > bit somewhere (I have ideas for at least three bits in struct page that
> > > could be used for this).  Then use a pointer in that data structure to
> > > point to the real page.  Or do NUMA distribution at the inode level.
> > > Have a way to get from (inode, node) to an address_space which contains
> > > just regular pages.
> > 
> > How do you find all the copies ? KSM maintains a list for a reasons.
> > Same would be needed here because if you want to break the write prot
> > you need to find all the copy first. If you intend to walk page table
> > then how do you synchronize to avoid more copy to spawn while you
> > walk reverse mapping, we could lock the struct page i guess. Also how
> > do you walk device page table which are completely hidden from core mm.
> 
> You have the inode and you iterate over each mapping, looking up the page
> that's in each mapping.  Or you use the i_mmap tree to find the pages.

This would slow down for everyone as we would have to walk all mapping
each time we try to write to page. Also we a have mechanism for page
write back to avoid race between thread trying to write and write back.
We would also need something similar. Without mediating this through
struct page i do not see how to keep this reasonable from performance
point of view.


> > > I don't have time to work on all of these.  If there's one that
> > > particularly interests you, let's dive deep into it and figure out how
> > 
> > I care about KSM, duplicate NUMA copy (not only for CPU but also
> > device) and write protection or exclusive write access. In each case
> > you need a list of all the copy (for KSM of the deduplicated page)
> > Having a special entry in the page cache does not sound like a good
> > option in many code path you would need to re-look the page cache to
> > find out if the page is in special state. If you use a bit flag in
> > struct page how do you get to the callback or to the copy/alias,
> > walk all the page tables ?
> 
> Like I said, something that _looks_ like a struct page.  At least looks
> enough like a struct page that you can pull a pointer out of the page
> cache and check the bit.  But since it's not actually a struct page,
> you can use the rest of the data structure for pointers to things you
> want to track.  Like the real struct page.

What i fear is the added cost because it means we need to do this look-
up everytime to check and we also need proper locking to avoid races.
Adding an ancilliary struct and trying to keep everything synchronize
seems harder to me.

> 
> > I do not see how i am doing violence to struct page :) The basis of
> > my approach is to pass down the mapping. We always have the mapping
> > at the top of the stack (either syscall entry point on a file or
> > through the vma when working on virtual address).
> 
> Yes, you explained all that in Utah.  I wasn't impressed than, and I'm
> not impressed now.

Is this more of a taste thing or is there something specific you do not
like ?

Cheers,
Jérôme



Re: [PATCH 00/14] Small step toward KSM for file back page.

2020-10-07 Thread Jerome Glisse
On Wed, Oct 07, 2020 at 06:05:58PM +0100, Matthew Wilcox wrote:
> On Wed, Oct 07, 2020 at 10:48:35AM -0400, Jerome Glisse wrote:
> > On Wed, Oct 07, 2020 at 04:20:13AM +0100, Matthew Wilcox wrote:
> > > On Tue, Oct 06, 2020 at 09:05:49PM -0400, jgli...@redhat.com wrote:
> > > > The present patchset just add mapping argument to the various vfs call-
> > > > backs. It does not make use of that new parameter to avoid regression.
> > > > I am posting this whole things as small contain patchset as it is rather
> > > > big and i would like to make progress step by step.
> > > 
> > > Well, that's the problem.  This patch set is gigantic and unreviewable.
> > > And it has no benefits.  The idea you present here was discussed at
> > > LSFMM in Utah and I recall absolutely nobody being in favour of it.
> > > You claim many wonderful features will be unlocked by this, but I think
> > > they can all be achieved without doing any of this very disruptive work.
> > 
> > You have any ideas on how to achieve them without such change ? I will
> > be more than happy for a simpler solution but i fail to see how you can
> > work around the need for a pointer inside struct page. Given struct
> > page can not grow it means you need to be able to overload one of the
> > existing field, at least i do not see any otherway.
> 
> The one I've spent the most time thinking about is sharing pages between
> reflinked files.  My approach is to pull DAX entries into the main page
> cache and have them reference the PFN directly.  It's not a struct page,
> but we can find a struct page from it if we need it.  The struct page
> would belong to a mapping that isn't part of the file.

You would need to do a lot of filesystem specific change to make sure
the fs understand the special mapping. It is doable but i feel it would
have a lot of fs specific part.


> For other things (NUMA distribution), we can point to something which
> isn't a struct page and can be distiguished from a real struct page by a
> bit somewhere (I have ideas for at least three bits in struct page that
> could be used for this).  Then use a pointer in that data structure to
> point to the real page.  Or do NUMA distribution at the inode level.
> Have a way to get from (inode, node) to an address_space which contains
> just regular pages.

How do you find all the copies ? KSM maintains a list for a reasons.
Same would be needed here because if you want to break the write prot
you need to find all the copy first. If you intend to walk page table
then how do you synchronize to avoid more copy to spawn while you
walk reverse mapping, we could lock the struct page i guess. Also how
do you walk device page table which are completely hidden from core mm.


> Using main memory to cache DAX could be done today without any data
> structure changes.  It just needs the DAX entries pulled up into the
> main pagecache.  See earlier item.
> 
> Exclusive write access ... you could put a magic value in the pagecache
> for pages which are exclusively for someone else's use and handle those
> specially.  I don't entirely understand this use case.

For this use case you need a callback to break the protection and it
needs to handle all cases ie not only write by CPU through file mapping
but also file write syscall and other syscall that can write to page
(pipe, ...).


> I don't have time to work on all of these.  If there's one that
> particularly interests you, let's dive deep into it and figure out how

I care about KSM, duplicate NUMA copy (not only for CPU but also
device) and write protection or exclusive write access. In each case
you need a list of all the copy (for KSM of the deduplicated page)
Having a special entry in the page cache does not sound like a good
option in many code path you would need to re-look the page cache to
find out if the page is in special state. If you use a bit flag in
struct page how do you get to the callback or to the copy/alias,
walk all the page tables ?

> you can do it without committing this kind of violence to struct page.

I do not see how i am doing violence to struct page :) The basis of
my approach is to pass down the mapping. We always have the mapping
at the top of the stack (either syscall entry point on a file or
through the vma when working on virtual address).

But we rarely pass down this mapping down the stack into the fs code.
I am only passing down the mapping through the bottom of the stack so
we do not need to rely of page->mapping all the time. I am not trying
to remove the page->mapping field, it is still usefull, i just want
to be able to overload it so that we can make KSM code generic and
allow to reuse that generic part for other usecase.

Cheers,
Jérôme



Re: [PATCH 00/14] Small step toward KSM for file back page.

2020-10-07 Thread Jerome Glisse
On Wed, Oct 07, 2020 at 04:20:13AM +0100, Matthew Wilcox wrote:
> On Tue, Oct 06, 2020 at 09:05:49PM -0400, jgli...@redhat.com wrote:
> > The present patchset just add mapping argument to the various vfs call-
> > backs. It does not make use of that new parameter to avoid regression.
> > I am posting this whole things as small contain patchset as it is rather
> > big and i would like to make progress step by step.
> 
> Well, that's the problem.  This patch set is gigantic and unreviewable.
> And it has no benefits.  The idea you present here was discussed at
> LSFMM in Utah and I recall absolutely nobody being in favour of it.
> You claim many wonderful features will be unlocked by this, but I think
> they can all be achieved without doing any of this very disruptive work.

You have any ideas on how to achieve them without such change ? I will
be more than happy for a simpler solution but i fail to see how you can
work around the need for a pointer inside struct page. Given struct
page can not grow it means you need to be able to overload one of the
existing field, at least i do not see any otherway.

Cheers,
Jérôme



Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

2020-06-22 Thread Jerome Glisse
On Mon, Jun 22, 2020 at 08:46:17AM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 19, 2020 at 04:31:47PM -0400, Jerome Glisse wrote:
> > Not doable as page refcount can change for things unrelated to GUP, with
> > John changes we can identify GUP and we could potentialy copy GUPed page
> > instead of COW but this can potentialy slow down fork() and i am not sure
> > how acceptable this would be. Also this does not solve GUP against page
> > that are already in fork tree ie page P0 is in process A which forks,
> > we now have page P0 in process A and B. Now we have process A which forks
> > again and we have page P0 in A, B, and C. Here B and C are two branches
> > with root in A. B and/or C can keep forking and grow the fork tree.
> 
> For a long time now RDMA has broken COW pages when creating user DMA
> regions.
> 
> The problem has been that fork re-COW's regions that had their COW
> broken.
> 
> So, if you break the COW upon mapping and prevent fork (and others)
> from copying DMA pinned then you'd cover the cases.

I am not sure we want to prevent COW for pinned GUP pages, this would
change current semantic and potentialy break/slow down existing apps.

Anyway i think we focus too much on fork/COW, it is just an unfixable
broken corner cases, mmu notifier allows you to avoid it. Forcing real
copy on fork would likely be seen as regression by most people.


> > Semantic was change with 17839856fd588f4ab6b789f482ed3ffd7c403e1f to some
> > what "fix" that but GUP fast is still succeptible to this.
> 
> Ah, so everyone breaks the COW now, not just RDMA..
> 
> What do you mean 'GUP fast is still succeptible to this' ?

Not all GUP fast path are updated (intentionaly) __get_user_pages_fast()
for instance still keeps COW intact. People using GUP should really knows
what they are doing.

Cheers,
Jérôme



Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

2020-06-19 Thread Jerome Glisse
On Fri, Jun 19, 2020 at 10:43:20PM +0200, Daniel Vetter wrote:
> On Fri, Jun 19, 2020 at 10:10 PM Jerome Glisse  wrote:
> >
> > On Fri, Jun 19, 2020 at 03:18:49PM -0300, Jason Gunthorpe wrote:
> > > On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:
> > > > On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
> > > > > On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
> > > > >
> > > > > > The madness is only that device B's mmu notifier might need to wait
> > > > > > for fence_B so that the dma operation finishes. Which in turn has to
> > > > > > wait for device A to finish first.
> > > > >
> > > > > So, it sound, fundamentally you've got this graph of operations across
> > > > > an unknown set of drivers and the kernel cannot insert itself in
> > > > > dma_fence hand offs to re-validate any of the buffers involved?
> > > > > Buffers which by definition cannot be touched by the hardware yet.
> > > > >
> > > > > That really is a pretty horrible place to end up..
> > > > >
> > > > > Pinning really is right answer for this kind of work flow. I think
> > > > > converting pinning to notifers should not be done unless notifier
> > > > > invalidation is relatively bounded.
> > > > >
> > > > > I know people like notifiers because they give a bit nicer performance
> > > > > in some happy cases, but this cripples all the bad cases..
> > > > >
> > > > > If pinning doesn't work for some reason maybe we should address that?
> > > >
> > > > Note that the dma fence is only true for user ptr buffer which predate
> > > > any HMM work and thus were using mmu notifier already. You need the
> > > > mmu notifier there because of fork and other corner cases.
> > >
> > > I wonder if we should try to fix the fork case more directly - RDMA
> > > has this same problem and added MADV_DONTFORK a long time ago as a
> > > hacky way to deal with it.
> > >
> > > Some crazy page pin that resolved COW in a way that always kept the
> > > physical memory with the mm that initiated the pin?
> >
> > Just no way to deal with it easily, i thought about forcing the
> > anon_vma (page->mapping for anonymous page) to the anon_vma that
> > belongs to the vma against which the GUP was done but it would
> > break things if page is already in other branch of a fork tree.
> > Also this forbid fast GUP.
> >
> > Quite frankly the fork was not the main motivating factor. GPU
> > can pin potentialy GBytes of memory thus we wanted to be able
> > to release it but since Michal changes to reclaim code this is
> > no longer effective.
> 
> What where how? My patch to annote reclaim paths with mmu notifier
> possibility just landed in -mm, so if direct reclaim can't reclaim mmu
> notifier'ed stuff anymore we need to know.
> 
> Also this would resolve the entire pain we're discussing in this
> thread about dma_fence_wait deadlocking against anything that's not
> GFP_ATOMIC ...

Sorry my bad, reclaim still works, only oom skip. It was couple
years ago and i thought that some of the things discuss while
back did make it upstream.

It is probably a good time to also point out that what i wanted
to do is have all the mmu notifier callback provide some kind
of fence (not dma fence) so that we can split the notification
into step:
A- schedule notification on all devices/system get fences
   this step should minimize lock dependency and should
   not have to wait for anything also best if you can avoid
   memory allocation for instance by pre-allocating what
   you need for notification.
B- mm can do things like unmap but can not map new page
   so write special swap pte to cpu page table
C- wait on each fences from A
... resume old code ie replace pte or finish unmap ...

The idea here is that at step C the core mm can decide to back
off if any fence returned from A have to wait. This means that
every device is invalidating for nothing but if we get there
then it might still be a good thing as next time around maybe
the kernel would be successfull without a wait.

This would allow things like reclaim to make forward progress
and skip over or limit wait time to given timeout.

Also I thought to extend this even to multi-cpu tlb flush so
that device and CPUs follow same pattern and we can make //
progress on each.


Getting to such scheme is a lot of work. My plan was to first
get the fence as part of the notifier user API and hide it from
mm inside notifier common code. Then update each core mm path to
new model and see if there is any benefit from it. Reclaim would
be first candidate.

Cheers,
Jérôme



Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

2020-06-19 Thread Jerome Glisse
On Fri, Jun 19, 2020 at 04:55:38PM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 19, 2020 at 03:48:49PM -0400, Felix Kuehling wrote:
> > Am 2020-06-19 um 2:18 p.m. schrieb Jason Gunthorpe:
> > > On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:
> > >> On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
> > >>> On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
> > >>>
> > >>>> The madness is only that device B's mmu notifier might need to wait
> > >>>> for fence_B so that the dma operation finishes. Which in turn has to
> > >>>> wait for device A to finish first.
> > >>> So, it sound, fundamentally you've got this graph of operations across
> > >>> an unknown set of drivers and the kernel cannot insert itself in
> > >>> dma_fence hand offs to re-validate any of the buffers involved?
> > >>> Buffers which by definition cannot be touched by the hardware yet.
> > >>>
> > >>> That really is a pretty horrible place to end up..
> > >>>
> > >>> Pinning really is right answer for this kind of work flow. I think
> > >>> converting pinning to notifers should not be done unless notifier
> > >>> invalidation is relatively bounded. 
> > >>>
> > >>> I know people like notifiers because they give a bit nicer performance
> > >>> in some happy cases, but this cripples all the bad cases..
> > >>>
> > >>> If pinning doesn't work for some reason maybe we should address that?
> > >> Note that the dma fence is only true for user ptr buffer which predate
> > >> any HMM work and thus were using mmu notifier already. You need the
> > >> mmu notifier there because of fork and other corner cases.
> > > I wonder if we should try to fix the fork case more directly - RDMA
> > > has this same problem and added MADV_DONTFORK a long time ago as a
> > > hacky way to deal with it.
> > >
> > > Some crazy page pin that resolved COW in a way that always kept the
> > > physical memory with the mm that initiated the pin?
> > >
> > > (isn't this broken for O_DIRECT as well anyhow?)
> > >
> > > How does mmu_notifiers help the fork case anyhow? Block fork from
> > > progressing?
> > 
> > How much the mmu_notifier blocks fork progress depends, on quickly we
> > can preempt GPU jobs accessing affected memory. If we don't have
> > fine-grained preemption capability (graphics), the best we can do is
> > wait for the GPU jobs to complete. We can also delay submission of new
> > GPU jobs to the same memory until the MMU notifier is done. Future jobs
> > would use the new page addresses.
> > 
> > With fine-grained preemption (ROCm compute), we can preempt GPU work on
> > the affected adders space to minimize the delay seen by fork.
> > 
> > With recoverable device page faults, we can invalidate GPU page table
> > entries, so device access to the affected pages stops immediately.
> > 
> > In all cases, the end result is, that the device page table gets updated
> > with the address of the copied pages before the GPU accesses the COW
> > memory again.Without the MMU notifier, we'd end up with the GPU
> > corrupting memory of the other process.
> 
> The model here in fork has been wrong for a long time, and I do wonder
> how O_DIRECT manages to not be broken too.. I guess the time windows
> there are too small to get unlucky.

This was discuss extensively in the GUP works John have been doing.
Yes O_DIRECT can potentialy break but only if you are writting to
COW pages and you initiated the O_DIRECT right before the fork and
GUP happen before fork was able to write protect the pages.

If you O_DIRECT but use memory as input ie you are writting the
memory to the file not reading from the file. Then fork is harmless
as you are just reading memory. You can still face the COW uncertainty
(the process against which you did the O_DIRECT get "new" pages but your
O_DIRECT goes on with the "old" pages) but doing O_DIRECT and fork
concurently is asking for trouble.

> 
> If you have a write pin on a page then it should not be COW'd into the
> fork'd process but copied with the originating page remaining with the
> original mm.
> 
> I wonder if there is some easy way to achive that - if that is the
> main reason to use notifiers then it would be a better solution.

Not doable as page refcount can change for things unrelated to GUP, with
John changes we can identify

Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

2020-06-19 Thread Jerome Glisse
On Fri, Jun 19, 2020 at 03:18:49PM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:
> > On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
> > > On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
> > > 
> > > > The madness is only that device B's mmu notifier might need to wait
> > > > for fence_B so that the dma operation finishes. Which in turn has to
> > > > wait for device A to finish first.
> > > 
> > > So, it sound, fundamentally you've got this graph of operations across
> > > an unknown set of drivers and the kernel cannot insert itself in
> > > dma_fence hand offs to re-validate any of the buffers involved?
> > > Buffers which by definition cannot be touched by the hardware yet.
> > > 
> > > That really is a pretty horrible place to end up..
> > > 
> > > Pinning really is right answer for this kind of work flow. I think
> > > converting pinning to notifers should not be done unless notifier
> > > invalidation is relatively bounded. 
> > > 
> > > I know people like notifiers because they give a bit nicer performance
> > > in some happy cases, but this cripples all the bad cases..
> > > 
> > > If pinning doesn't work for some reason maybe we should address that?
> > 
> > Note that the dma fence is only true for user ptr buffer which predate
> > any HMM work and thus were using mmu notifier already. You need the
> > mmu notifier there because of fork and other corner cases.
> 
> I wonder if we should try to fix the fork case more directly - RDMA
> has this same problem and added MADV_DONTFORK a long time ago as a
> hacky way to deal with it.
>
> Some crazy page pin that resolved COW in a way that always kept the
> physical memory with the mm that initiated the pin?

Just no way to deal with it easily, i thought about forcing the
anon_vma (page->mapping for anonymous page) to the anon_vma that
belongs to the vma against which the GUP was done but it would
break things if page is already in other branch of a fork tree.
Also this forbid fast GUP.

Quite frankly the fork was not the main motivating factor. GPU
can pin potentialy GBytes of memory thus we wanted to be able
to release it but since Michal changes to reclaim code this is
no longer effective.

User buffer should never end up in those weird corner case, iirc
the first usage was for xorg exa texture upload, then generalize
to texture upload in mesa and latter on to more upload cases
(vertices, ...). At least this is what i remember today. So in
those cases we do not expect fork, splice, mremap, mprotect, ...

Maybe we can audit how user ptr buffer are use today and see if
we can define a usage pattern that would allow to cut corner in
kernel. For instance we could use mmu notifier just to block CPU
pte update while we do GUP and thus never wait on dma fence.

Then GPU driver just keep the GUP pin around until they are done
with the page. They can also use the mmu notifier to keep a flag
so that the driver know if it needs to redo a GUP ie:

The notifier path:
   GPU_mmu_notifier_start_callback(range)
gpu_lock_cpu_pagetable(range)
for_each_bo_in(bo, range) {
bo->need_gup = true;
}
gpu_unlock_cpu_pagetable(range)

   GPU_validate_buffer_pages(bo)
if (!bo->need_gup)
return;
put_pages(bo->pages);
range = bo_vaddr_range(bo)
gpu_lock_cpu_pagetable(range)
GUP(bo->pages, range)
gpu_unlock_cpu_pagetable(range)


Depending on how user_ptr are use today this could work.


> (isn't this broken for O_DIRECT as well anyhow?)

Yes it can in theory, if you have an application that does O_DIRECT
and fork concurrently (ie O_DIRECT in one thread and fork in another).
Note that O_DIRECT after fork is fine, it is an issue only if GUP_fast
was able to lookup a page with write permission before fork had the
chance to update it to read only for COW.

But doing O_DIRECT (or anything that use GUP fast) in one thread and
fork in another is inherently broken ie there is no way to fix it.

See 17839856fd588f4ab6b789f482ed3ffd7c403e1f

> 
> How does mmu_notifiers help the fork case anyhow? Block fork from
> progressing?

It enforce ordering between fork and GUP, if fork is first it blocks
GUP and if forks is last then fork waits on GUP and then user buffer
get invalidated.

> 
> > I probably need to warn AMD folks again that using HMM means that you
> > must be able to update the GPU page table asynchronously without
> > fence wait.
> 
> It is kind of unrelated to HMM, it just shouldn't be using mmu
> notifiers to replace page pinning..

Well my POV is that if you abide by rules HMM defined then you do
not need to pin pages. The rule is asynchronous device page table
update.

Pinning pages is problematic it blocks many core mm features and
it is just bad all around. Also it is inherently broken in front
of fork/mremap/splice/...

Cheers,
Jérôme



Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

2020-06-19 Thread Jerome Glisse
On Fri, Jun 19, 2020 at 03:30:32PM -0400, Felix Kuehling wrote:
> 
> Am 2020-06-19 um 3:11 p.m. schrieb Alex Deucher:
> > On Fri, Jun 19, 2020 at 2:09 PM Jerome Glisse  wrote:
> >> On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
> >>> On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
> >>>
> >>>> The madness is only that device B's mmu notifier might need to wait
> >>>> for fence_B so that the dma operation finishes. Which in turn has to
> >>>> wait for device A to finish first.
> >>> So, it sound, fundamentally you've got this graph of operations across
> >>> an unknown set of drivers and the kernel cannot insert itself in
> >>> dma_fence hand offs to re-validate any of the buffers involved?
> >>> Buffers which by definition cannot be touched by the hardware yet.
> >>>
> >>> That really is a pretty horrible place to end up..
> >>>
> >>> Pinning really is right answer for this kind of work flow. I think
> >>> converting pinning to notifers should not be done unless notifier
> >>> invalidation is relatively bounded.
> >>>
> >>> I know people like notifiers because they give a bit nicer performance
> >>> in some happy cases, but this cripples all the bad cases..
> >>>
> >>> If pinning doesn't work for some reason maybe we should address that?
> >> Note that the dma fence is only true for user ptr buffer which predate
> >> any HMM work and thus were using mmu notifier already. You need the
> >> mmu notifier there because of fork and other corner cases.
> >>
> >> For nouveau the notifier do not need to wait for anything it can update
> >> the GPU page table right away. Modulo needing to write to GPU memory
> >> using dma engine if the GPU page table is in GPU memory that is not
> >> accessible from the CPU but that's never the case for nouveau so far
> >> (but i expect it will be at one point).
> >>
> >>
> >> So i see this as 2 different cases, the user ptr case, which does pin
> >> pages by the way, where things are synchronous. Versus the HMM cases
> >> where everything is asynchronous.
> >>
> >>
> >> I probably need to warn AMD folks again that using HMM means that you
> >> must be able to update the GPU page table asynchronously without
> >> fence wait. The issue for AMD is that they already update their GPU
> >> page table using DMA engine. I believe this is still doable if they
> >> use a kernel only DMA engine context, where only kernel can queue up
> >> jobs so that you do not need to wait for unrelated things and you can
> >> prioritize GPU page table update which should translate in fast GPU
> >> page table update without DMA fence.
> > All devices which support recoverable page faults also have a
> > dedicated paging engine for the kernel driver which the driver already
> > makes use of.  We can also update the GPU page tables with the CPU.
> 
> We have a potential problem with CPU updating page tables while the GPU
> is retrying on page table entries because 64 bit CPU transactions don't
> arrive in device memory atomically.
> 
> We are using SDMA for page table updates. This currently goes through a
> the DRM GPU scheduler to a special SDMA queue that's used by kernel-mode
> only. But since it's based on the DRM GPU scheduler, we do use dma-fence
> to wait for completion.

Yeah my worry is mostly that some cross dma fence leak into it but
it should never happen realy, maybe there is a way to catch if it
does and print a warning.

So yes you can use dma fence, as long as they do not have cross-dep.
Another expectation is that they complete quickly and usualy page
table update do.

Cheers,
Jérôme



Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

2020-06-19 Thread Jerome Glisse
On Thu, Jun 11, 2020 at 07:35:35PM -0400, Felix Kuehling wrote:
> Am 2020-06-11 um 10:15 a.m. schrieb Jason Gunthorpe:
> > On Thu, Jun 11, 2020 at 10:34:30AM +0200, Daniel Vetter wrote:
> >>> I still have my doubts about allowing fence waiting from within shrinkers.
> >>> IMO ideally they should use a trywait approach, in order to allow memory
> >>> allocation during command submission for drivers that
> >>> publish fences before command submission. (Since early reservation object
> >>> release requires that).
> >> Yeah it is a bit annoying, e.g. for drm/scheduler I think we'll end up
> >> with a mempool to make sure it can handle it's allocations.
> >>
> >>> But since drivers are already waiting from within shrinkers and I take 
> >>> your
> >>> word for HMM requiring this,
> >> Yeah the big trouble is HMM and mmu notifiers. That's the really awkward
> >> one, the shrinker one is a lot less established.
> > I really question if HW that needs something like DMA fence should
> > even be using mmu notifiers - the best use is HW that can fence the
> > DMA directly without having to get involved with some command stream
> > processing.
> >
> > Or at the very least it should not be a generic DMA fence but a
> > narrowed completion tied only into the same GPU driver's command
> > completion processing which should be able to progress without
> > blocking.
> >
> > The intent of notifiers was never to endlessly block while vast
> > amounts of SW does work.
> >
> > Going around and switching everything in a GPU to GFP_ATOMIC seems
> > like bad idea.
> >
> >> I've pinged a bunch of armsoc gpu driver people and ask them how much this
> >> hurts, so that we have a clear answer. On x86 I don't think we have much
> >> of a choice on this, with userptr in amd and i915 and hmm work in nouveau
> >> (but nouveau I think doesn't use dma_fence in there). 
> 
> Soon nouveau will get company. We're working on a recoverable page fault
> implementation for HMM in amdgpu where we'll need to update page tables
> using the GPUs SDMA engine and wait for corresponding fences in MMU
> notifiers.

Note that HMM mandate, and i stressed that several time in the past,
that all GPU page table update are asynchronous and do not have to
wait on _anything_.

I understand that you use DMA engine for GPU page table update but
if you want to do so with HMM then you need a GPU page table update
only DMA context where all GPU page table update goes through and
where user space can not queue up job.

It can be for HMM only but if you want to mix HMM with non HMM then
everything need to be on that queue and other command queue will have
to depends on it.

Cheers,
Jérôme



Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

2020-06-19 Thread Jerome Glisse
On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
> 
> > The madness is only that device B's mmu notifier might need to wait
> > for fence_B so that the dma operation finishes. Which in turn has to
> > wait for device A to finish first.
> 
> So, it sound, fundamentally you've got this graph of operations across
> an unknown set of drivers and the kernel cannot insert itself in
> dma_fence hand offs to re-validate any of the buffers involved?
> Buffers which by definition cannot be touched by the hardware yet.
> 
> That really is a pretty horrible place to end up..
> 
> Pinning really is right answer for this kind of work flow. I think
> converting pinning to notifers should not be done unless notifier
> invalidation is relatively bounded. 
> 
> I know people like notifiers because they give a bit nicer performance
> in some happy cases, but this cripples all the bad cases..
> 
> If pinning doesn't work for some reason maybe we should address that?

Note that the dma fence is only true for user ptr buffer which predate
any HMM work and thus were using mmu notifier already. You need the
mmu notifier there because of fork and other corner cases.

For nouveau the notifier do not need to wait for anything it can update
the GPU page table right away. Modulo needing to write to GPU memory
using dma engine if the GPU page table is in GPU memory that is not
accessible from the CPU but that's never the case for nouveau so far
(but i expect it will be at one point).


So i see this as 2 different cases, the user ptr case, which does pin
pages by the way, where things are synchronous. Versus the HMM cases
where everything is asynchronous.


I probably need to warn AMD folks again that using HMM means that you
must be able to update the GPU page table asynchronously without
fence wait. The issue for AMD is that they already update their GPU
page table using DMA engine. I believe this is still doable if they
use a kernel only DMA engine context, where only kernel can queue up
jobs so that you do not need to wait for unrelated things and you can
prioritize GPU page table update which should translate in fast GPU
page table update without DMA fence.


> > Full disclosure: We are aware that we've designed ourselves into an
> > impressive corner here, and there's lots of talks going on about
> > untangling the dma synchronization from the memory management
> > completely. But
> 
> I think the documenting is really important: only GPU should be using
> this stuff and driving notifiers this way. Complete NO for any
> totally-not-a-GPU things in drivers/accel for sure.

Yes for user that expect HMM they need to be asynchronous. But it is
hard to revert user ptr has it was done a long time ago.

Cheers,
Jérôme



Re: Cache flush issue with page_mapping_file() and swap back shmem page ?

2020-05-28 Thread Jerome Glisse
On Wed, May 27, 2020 at 08:46:22PM -0700, Hugh Dickins wrote:
> Hi Jerome,
> 
> On Wed, 27 May 2020, Jerome Glisse wrote:
> > So any arch code which uses page_mapping_file() might get the wrong
> > answer, this function will return NULL for a swap backed page which
> > can be a shmem pages. But shmem pages can still be shared among
> > multiple process (and possibly at different virtual addresses if
> > mremap was use).
> > 
> > Attached is a patch that changes page_mapping_file() to return the
> > shmem mapping for swap backed shmem page. I have not tested it (no
> > way for me to test all those architecture) and i spotted this while
> > working on something else. So i hope someone can take a closer look.
> 
> I'm certainly no expert on flush_dcache_page() and friends, but I'd
> be very surprised if such a problem exists, yet has gone unnoticed
> for so long.  page_mapping_file() itself is fairly new, added when
> a risk of crashing on a race with swapoff came in: but the previous
> use of page_mapping() would have suffered equally if there were such
> a cache flushinhg problem here.
> 
> And I'm afraid your patch won't do anything to help if there is a
> problem: very soon after shmem calls add_to_swap_cache(), it calls
> shmem_delete_from_page_cache(), which sets page->mapping to NULL.
> 
> But I can assure you that a shmem page (unlike an anon page) is never
> put into swap cache while it is mapped into userspace, and never
> mapped into userspace while it is still in swap cache: does that help?
> 

You are right i missed/forgot the part where shmem is never swapcache
and mapped at the same time, thus page_mapping_file() can return NULL
for those as they can no longer have alias mapping.

Thank you Hugh
Jérôme



Cache flush issue with page_mapping_file() and swap back shmem page ?

2020-05-27 Thread Jerome Glisse
So any arch code which uses page_mapping_file() might get the wrong
answer, this function will return NULL for a swap backed page which
can be a shmem pages. But shmem pages can still be shared among
multiple process (and possibly at different virtual addresses if
mremap was use).

Attached is a patch that changes page_mapping_file() to return the
shmem mapping for swap backed shmem page. I have not tested it (no
way for me to test all those architecture) and i spotted this while
working on something else. So i hope someone can take a closer look.

Cheers,
Jérôme
>From 6c76b9f8baa87ff872f6be5a44805a74c1e07fea Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= 
Date: Wed, 27 May 2020 20:18:59 -0400
Subject: [PATCH] mm: fix cache flush for shmem page that are swap backed.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This might be a shmem page that is in a sense a file that
can be mapped multiple times in different processes at
possibly different virtual addresses (fork + mremap). So
return the shmem mapping that will allow any arch code to
find all mappings of the page.

Note that even if page is not anonymous then the page might
have a NULL page->mapping field if it is being truncated,
but then it is fine as each pte poiting to the page will be
remove and cache flushing should be handled properly by that
part of the code.

Signed-off-by: Jérôme Glisse 
Cc: "Huang, Ying" 
Cc: Michal Hocko 
Cc: Mel Gorman 
Cc: Russell King 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: "David S. Miller" 
Cc: "James E.J. Bottomley" 
---
 mm/util.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/mm/util.c b/mm/util.c
index 988d11e6c17c..ec8739ab0cc3 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -685,8 +685,24 @@ EXPORT_SYMBOL(page_mapping);
  */
 struct address_space *page_mapping_file(struct page *page)
 {
-   if (unlikely(PageSwapCache(page)))
+   if (unlikely(PageSwapCache(page))) {
+   /*
+* This might be a shmem page that is in a sense a file that
+* can be mapped multiple times in different processes at
+* possibly different virtual addresses (fork + mremap). So
+* return the shmem mapping that will allow any arch code to
+* find all mappings of the page.
+*
+* Note that even if page is not anonymous then the page might
+* have a NULL page->mapping field if it is being truncated,
+* but then it is fine as each pte poiting to the page will be
+* remove and cache flushing should be handled properly by that
+* part of the code.
+*/
+   if (!PageAnon(page))
+   return page->mapping;
return NULL;
+   }
return page_mapping(page);
 }
 
-- 
2.26.2



Re: [PATCH v6 2/3] uacce: add uacce driver

2019-10-23 Thread Jerome Glisse
On Wed, Oct 16, 2019 at 07:28:02PM +0200, Jean-Philippe Brucker wrote:
[...]

> > +static struct uacce_qfile_region *
> > +uacce_create_region(struct uacce_queue *q, struct vm_area_struct *vma,
> > +   enum uacce_qfrt type, unsigned int flags)
> > +{
> > +   struct uacce_qfile_region *qfr;
> > +   struct uacce_device *uacce = q->uacce;
> > +   unsigned long vm_pgoff;
> > +   int ret = -ENOMEM;
> > +
> > +   qfr = kzalloc(sizeof(*qfr), GFP_ATOMIC);
> > +   if (!qfr)
> > +   return ERR_PTR(-ENOMEM);
> > +
> > +   qfr->type = type;
> > +   qfr->flags = flags;
> > +   qfr->iova = vma->vm_start;
> > +   qfr->nr_pages = vma_pages(vma);
> > +
> > +   if (vma->vm_flags & VM_READ)
> > +   qfr->prot |= IOMMU_READ;
> > +
> > +   if (vma->vm_flags & VM_WRITE)
> > +   qfr->prot |= IOMMU_WRITE;
> > +
> > +   if (flags & UACCE_QFRF_SELFMT) {
> > +   if (!uacce->ops->mmap) {
> > +   ret = -EINVAL;
> > +   goto err_with_qfr;
> > +   }
> > +
> > +   ret = uacce->ops->mmap(q, vma, qfr);
> > +   if (ret)
> > +   goto err_with_qfr;
> > +   return qfr;
> > +   }
> 
> I wish the SVA and !SVA paths were less interleaved. Both models are
> fundamentally different:
> 
> * Without SVA you cannot share the device between multiple processes. All
>   DMA mappings are in the "main", non-PASID address space of the device.
> 
>   Note that process isolation without SVA could be achieved with the
>   auxiliary domains IOMMU API (introduced primarily for vfio-mdev) but
>   this is not the model chosen here.
> 
> * With SVA you can share the device between multiple processes. But if the
>   process can somehow program its portion of the device to access the main
>   address space, you loose isolation. Only the kernel must be able to
>   program and access the main address space.
> 
> When interleaving both code paths it's easy to make a mistake and loose
> this isolation. Although I think this code is correct, it took me some
> time to understand that we never end up calling dma_alloc or iommu_map
> when using SVA. Might be worth at least adding a check that if
> UACCE_DEV_SVA, then we never end up in the bottom part of this function.

I would go even further, just remove the DMA path as it is not use.
But yes at bare minimum it needs to be completely separate to avoid
confusion.


[...]


> > +static int uacce_fops_open(struct inode *inode, struct file *filep)
> > +{
> > +   struct uacce_queue *q;
> > +   struct iommu_sva *handle = NULL;
> > +   struct uacce_device *uacce;
> > +   int ret;
> > +   int pasid = 0;
> > +
> > +   uacce = idr_find(&uacce_idr, iminor(inode));
> > +   if (!uacce)
> > +   return -ENODEV;
> > +
> > +   if (!try_module_get(uacce->pdev->driver->owner))
> > +   return -ENODEV;
> > +
> > +   ret = uacce_dev_open_check(uacce);
> > +   if (ret)
> > +   goto out_with_module;
> > +
> > +   if (uacce->flags & UACCE_DEV_SVA) {
> > +   handle = iommu_sva_bind_device(uacce->pdev, current->mm, NULL);
> > +   if (IS_ERR(handle))
> > +   goto out_with_module;
> > +   pasid = iommu_sva_get_pasid(handle);
> 
> We need to register an mm_exit callback. Once we return, userspace will
> start running jobs on the accelerator. If the process is killed while the
> accelerator is running, the mm_exit callback tells the device driver to
> stop using this PASID (stop_queue()), so that it can be reallocated for
> another process.
> 
> Implementing this with the right locking and ordering can be tricky. I'll
> try to implement the callback and test it on the device this week.

It already exist it is call mmu notifier, you can register an mmu notifier
and get callback once the mm exit.

Cheers,
Jérôme



Re: [PATCH 1/1] mm/gup_benchmark: fix MAP_HUGETLB case

2019-10-22 Thread Jerome Glisse
On Tue, Oct 22, 2019 at 11:41:57AM -0700, John Hubbard wrote:
> On 10/22/19 10:14 AM, Jerome Glisse wrote:
> > On Mon, Oct 21, 2019 at 02:24:35PM -0700, John Hubbard wrote:
> >> The MAP_HUGETLB ("-H" option) of gup_benchmark fails:
> >>
> >> $ sudo ./gup_benchmark -H
> >> mmap: Invalid argument
> >>
> >> This is because gup_benchmark.c is passing in a file descriptor to
> >> mmap(), but the fd came from opening up the /dev/zero file. This
> >> confuses the mmap syscall implementation, which thinks that, if the
> >> caller did not specify MAP_ANONYMOUS, then the file must be a huge
> >> page file. So it attempts to verify that the file really is a huge
> >> page file, as you can see here:
> >>
> >> ksys_mmap_pgoff()
> >> {
> >> if (!(flags & MAP_ANONYMOUS)) {
> >> retval = -EINVAL;
> >> if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file)))
> >> goto out_fput; /* THIS IS WHERE WE END UP */
> >>
> >> else if (flags & MAP_HUGETLB) {
> >> ...proceed normally, /dev/zero is ok here...
> >>
> >> ...and of course is_file_hugepages() returns "false" for the /dev/zero
> >> file.
> >>
> >> The problem is that the user space program, gup_benchmark.c, really just
> >> wants anonymous memory here. The simplest way to get that is to pass
> >> MAP_ANONYMOUS whenever MAP_HUGETLB is specified, so that's what this
> >> patch does.
> > 
> > This looks wrong, MAP_HUGETLB should only be use to create vma
> > for hugetlbfs. If you want anonymous private vma do not set the
> > MAP_HUGETLB. If you want huge page inside your anonymous vma
> > there is nothing to do at the mmap time, this is the job of the
> > transparent huge page code (THP).
> > 
> 
> Not the point. Please look more closely at ksys_mmap_pgoff(). You'll 
> see that, since 2009 (and probably earlier; 2009 is just when Hugh Dickens 
> moved it over from util.c), this routine has had full support for using
> hugetlbfs automatically, via mmap.
> 
> It does that via hugetlb_file_setup():
> 
> unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
> unsigned long prot, unsigned long flags,
> unsigned long fd, unsigned long pgoff)
> {
> ...
>   if (!(flags & MAP_ANONYMOUS)) {
> ...
>   } else if (flags & MAP_HUGETLB) {
>   struct user_struct *user = NULL;
>   struct hstate *hs;
> 
>   hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
>   if (!hs)
>   return -EINVAL;
> 
>   len = ALIGN(len, huge_page_size(hs));
>   /*
>* VM_NORESERVE is used because the reservations will be
>* taken when vm_ops->mmap() is called
>* A dummy user value is used because we are not locking
>* memory so no accounting is necessary
>*/
>   file = hugetlb_file_setup(HUGETLB_ANON_FILE, len,
>   VM_NORESERVE,
>   &user, HUGETLB_ANONHUGE_INODE,
>   (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
>   if (IS_ERR(file))
>   return PTR_ERR(file);
>   }
> ...
> 
> 
> Also, there are 14 (!) other pre-existing examples of passing
> MAP_HUGETLB | MAP_ANONYMOUS to mmap, so I'm not exactly the first one
> to reach this understanding.
> 
> 
> > NAK as misleading
> 
> Ouch. But I think I'm actually leading correctly, rather than misleading.
> Can you prove me wrong? :)

So i was misslead by the file descriptor, passing a file descriptor and
asking for anonymous always bugs me. But yeah the _linux_ kernel is happy
to ignore the file argument if you set the anonymous flag. I guess the
rules of passing -1 for fd when anonymous is just engrave in my brain.

Also i thought that the file was an argument of the test and thus that
for huge you needed to pass a hugetlbfs' file.

Anyway my mistake, you are right, you can pass a file and ask for anonymous
and hugetlb at the same time.

Reviewed-by: Jérôme Glisse 



Re: [PATCH 1/1] mm/gup_benchmark: fix MAP_HUGETLB case

2019-10-22 Thread Jerome Glisse
On Mon, Oct 21, 2019 at 02:24:35PM -0700, John Hubbard wrote:
> The MAP_HUGETLB ("-H" option) of gup_benchmark fails:
> 
> $ sudo ./gup_benchmark -H
> mmap: Invalid argument
> 
> This is because gup_benchmark.c is passing in a file descriptor to
> mmap(), but the fd came from opening up the /dev/zero file. This
> confuses the mmap syscall implementation, which thinks that, if the
> caller did not specify MAP_ANONYMOUS, then the file must be a huge
> page file. So it attempts to verify that the file really is a huge
> page file, as you can see here:
> 
> ksys_mmap_pgoff()
> {
> if (!(flags & MAP_ANONYMOUS)) {
> retval = -EINVAL;
> if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file)))
> goto out_fput; /* THIS IS WHERE WE END UP */
> 
> else if (flags & MAP_HUGETLB) {
> ...proceed normally, /dev/zero is ok here...
> 
> ...and of course is_file_hugepages() returns "false" for the /dev/zero
> file.
> 
> The problem is that the user space program, gup_benchmark.c, really just
> wants anonymous memory here. The simplest way to get that is to pass
> MAP_ANONYMOUS whenever MAP_HUGETLB is specified, so that's what this
> patch does.

This looks wrong, MAP_HUGETLB should only be use to create vma
for hugetlbfs. If you want anonymous private vma do not set the
MAP_HUGETLB. If you want huge page inside your anonymous vma
there is nothing to do at the mmap time, this is the job of the
transparent huge page code (THP).

NAK as misleading

Cheers,
Jérôme



Re: hmm pud-entry callback locking?

2019-10-07 Thread Jerome Glisse
On Sat, Oct 05, 2019 at 02:29:40PM +0200, Thomas Hellström (VMware) wrote:
> Hi, Jerome,
> 
> I was asked by Kirill to try to unify the pagewalk pud_entry and pmd_entry
> callbacks. The only user of the pagewalk pud-entry is currently hmm.
> 
> But the pagewalk code call pud_entry only for huge puds with the page-table
> lock held, whereas the hmm callback appears to assume it gets called
> unconditionally without the page-table lock held?
> 
> Could you shed some light into this?

I think in my mind they were already unified :) I think easiest thing is
to remove the hmm pud walker, this is not a big deal this will break huge
pud for now, we can re-add this to hmm once you unified them.

Cheers,
Jérôme


Re: [RFC PATCH 2/2] mm/gup: introduce vaddr_pin_pages_remote()

2019-08-16 Thread Jerome Glisse
On Fri, Aug 16, 2019 at 06:13:55PM +0200, Jan Kara wrote:
> On Fri 16-08-19 11:52:20, Jerome Glisse wrote:
> > On Fri, Aug 16, 2019 at 05:44:04PM +0200, Jan Kara wrote:
> > > On Fri 16-08-19 10:47:21, Vlastimil Babka wrote:
> > > > On 8/15/19 3:35 PM, Jan Kara wrote:
> > > > >> 
> > > > >> So when the GUP user uses MMU notifiers to stop writing to pages 
> > > > >> whenever
> > > > >> they are writeprotected with page_mkclean(), they don't really need 
> > > > >> page
> > > > >> pin - their access is then fully equivalent to any other mmap 
> > > > >> userspace
> > > > >> access and filesystem knows how to deal with those. I forgot out 
> > > > >> this case
> > > > >> when I wrote the above sentence.
> > > > >> 
> > > > >> So to sum up there are three cases:
> > > > >> 1) DIO case - GUP references to pages serving as DIO buffers are 
> > > > >> needed for
> > > > >>relatively short time, no special synchronization with 
> > > > >> page_mkclean() or
> > > > >>munmap() => needs FOLL_PIN
> > > > >> 2) RDMA case - GUP references to pages serving as DMA buffers needed 
> > > > >> for a
> > > > >>long time, no special synchronization with page_mkclean() or 
> > > > >> munmap()
> > > > >>=> needs FOLL_PIN | FOLL_LONGTERM
> > > > >>This case has also a special case when the pages are actually 
> > > > >> DAX. Then
> > > > >>the caller additionally needs file lease and additional file_pin
> > > > >>structure is used for tracking this usage.
> > > > >> 3) ODP case - GUP references to pages serving as DMA buffers, MMU 
> > > > >> notifiers
> > > > >>used to synchronize with page_mkclean() and munmap() => normal 
> > > > >> page
> > > > >>references are fine.
> > > > 
> > > > IMHO the munlock lesson told us about another one, that's in the end 
> > > > equivalent
> > > > to 3)
> > > > 
> > > > 4) pinning for struct page manipulation only => normal page references
> > > > are fine
> > > 
> > > Right, it's good to have this for clarity.
> > > 
> > > > > I want to add that I'd like to convert users in cases 1) and 2) from 
> > > > > using
> > > > > GUP to using differently named function. Users in case 3) can stay as 
> > > > > they
> > > > > are for now although ultimately I'd like to denote such use cases in a
> > > > > special way as well...
> > > > 
> > > > So after 1/2/3 is renamed/specially denoted, only 4) keeps the current
> > > > interface?
> > > 
> > > Well, munlock() code doesn't even use GUP, just follow_page(). I'd wait to
> > > see what's left after handling cases 1), 2), and 3) to decide about the
> > > interface for the remainder.
> > > 
> > 
> > For 3 we do not need to take a reference at all :) So just forget about 3
> > it does not exist. For 3 the reference is the reference the CPU page table
> > has on the page and that's it. GUP is no longer involve in ODP or anything
> > like that.
> 
> Yes, I understand. But the fact is that GUP calls are currently still there
> e.g. in ODP code. If you can make the code work without taking a page
> reference at all, I'm only happy :)

Already in rdma next AFAIK so in 5.4 it will be gone :) i have been
removing all GUP users that do not need reference. Intel i915 driver
is a left over i will work some more with them to get rid of it too.

Cheers,
Jérôme


Re: [RFC PATCH 2/2] mm/gup: introduce vaddr_pin_pages_remote()

2019-08-16 Thread Jerome Glisse
On Fri, Aug 16, 2019 at 05:44:04PM +0200, Jan Kara wrote:
> On Fri 16-08-19 10:47:21, Vlastimil Babka wrote:
> > On 8/15/19 3:35 PM, Jan Kara wrote:
> > >> 
> > >> So when the GUP user uses MMU notifiers to stop writing to pages whenever
> > >> they are writeprotected with page_mkclean(), they don't really need page
> > >> pin - their access is then fully equivalent to any other mmap userspace
> > >> access and filesystem knows how to deal with those. I forgot out this 
> > >> case
> > >> when I wrote the above sentence.
> > >> 
> > >> So to sum up there are three cases:
> > >> 1) DIO case - GUP references to pages serving as DIO buffers are needed 
> > >> for
> > >>relatively short time, no special synchronization with page_mkclean() 
> > >> or
> > >>munmap() => needs FOLL_PIN
> > >> 2) RDMA case - GUP references to pages serving as DMA buffers needed for 
> > >> a
> > >>long time, no special synchronization with page_mkclean() or munmap()
> > >>=> needs FOLL_PIN | FOLL_LONGTERM
> > >>This case has also a special case when the pages are actually DAX. 
> > >> Then
> > >>the caller additionally needs file lease and additional file_pin
> > >>structure is used for tracking this usage.
> > >> 3) ODP case - GUP references to pages serving as DMA buffers, MMU 
> > >> notifiers
> > >>used to synchronize with page_mkclean() and munmap() => normal page
> > >>references are fine.
> > 
> > IMHO the munlock lesson told us about another one, that's in the end 
> > equivalent
> > to 3)
> > 
> > 4) pinning for struct page manipulation only => normal page references
> > are fine
> 
> Right, it's good to have this for clarity.
> 
> > > I want to add that I'd like to convert users in cases 1) and 2) from using
> > > GUP to using differently named function. Users in case 3) can stay as they
> > > are for now although ultimately I'd like to denote such use cases in a
> > > special way as well...
> > 
> > So after 1/2/3 is renamed/specially denoted, only 4) keeps the current
> > interface?
> 
> Well, munlock() code doesn't even use GUP, just follow_page(). I'd wait to
> see what's left after handling cases 1), 2), and 3) to decide about the
> interface for the remainder.
> 

For 3 we do not need to take a reference at all :) So just forget about 3
it does not exist. For 3 the reference is the reference the CPU page table
has on the page and that's it. GUP is no longer involve in ODP or anything
like that.

Cheers,
Jérôme


Re: [PATCHv2] mm/migrate: clean up useless code in migrate_vma_collect_pmd()

2019-08-15 Thread Jerome Glisse
On Thu, Aug 15, 2019 at 12:23:44PM -0700, Ralph Campbell wrote:
> 
> On 8/15/19 10:19 AM, Jerome Glisse wrote:
> > On Wed, Aug 07, 2019 at 04:41:12PM +0800, Pingfan Liu wrote:
> > > Clean up useless 'pfn' variable.
> > 
> > NAK there is a bug see below:
> > 
> > > 
> > > Signed-off-by: Pingfan Liu 
> > > Cc: "Jérôme Glisse" 
> > > Cc: Andrew Morton 
> > > Cc: Mel Gorman 
> > > Cc: Jan Kara 
> > > Cc: "Kirill A. Shutemov" 
> > > Cc: Michal Hocko 
> > > Cc: Mike Kravetz 
> > > Cc: Andrea Arcangeli 
> > > Cc: Matthew Wilcox 
> > > To: linux...@kvack.org
> > > Cc: linux-kernel@vger.kernel.org
> > > ---
> > >   mm/migrate.c | 9 +++--
> > >   1 file changed, 3 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > index 8992741..d483a55 100644
> > > --- a/mm/migrate.c
> > > +++ b/mm/migrate.c
> > > @@ -2225,17 +2225,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > >   pte_t pte;
> > >   pte = *ptep;
> > > - pfn = pte_pfn(pte);
> > >   if (pte_none(pte)) {
> > >   mpfn = MIGRATE_PFN_MIGRATE;
> > >   migrate->cpages++;
> > > - pfn = 0;
> > >   goto next;
> > >   }
> > >   if (!pte_present(pte)) {
> > > - mpfn = pfn = 0;
> > > + mpfn = 0;
> > >   /*
> > >* Only care about unaddressable device page 
> > > special
> > > @@ -2252,10 +2250,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > >   if (is_write_device_private_entry(entry))
> > >   mpfn |= MIGRATE_PFN_WRITE;
> > >   } else {
> > > + pfn = pte_pfn(pte);
> > >   if (is_zero_pfn(pfn)) {
> > >   mpfn = MIGRATE_PFN_MIGRATE;
> > >   migrate->cpages++;
> > > - pfn = 0;
> > >   goto next;
> > >   }
> > >   page = vm_normal_page(migrate->vma, addr, pte);
> > > @@ -2265,10 +2263,9 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > >   /* FIXME support THP */
> > >   if (!page || !page->mapping || PageTransCompound(page)) 
> > > {
> > > - mpfn = pfn = 0;
> > > + mpfn = 0;
> > >   goto next;
> > >   }
> > > - pfn = page_to_pfn(page);
> > 
> > You can not remove that one ! Otherwise it will break the device
> > private case.
> > 
> 
> I don't understand. The only use of "pfn" I see is in the "else"
> clause above where it is set just before using it.

Ok i managed to confuse myself with mpfn and probably with old
version of the code. Sorry for reading too quickly. Can we move
unsigned long pfn; into the else { branch so that there is no
more confusion to its scope.

Cheers,
Jérôme


Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Jerome Glisse
On Thu, Aug 15, 2019 at 03:01:59PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 01:39:22PM -0400, Jerome Glisse wrote:
> > On Thu, Aug 15, 2019 at 02:35:57PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Aug 15, 2019 at 06:25:16PM +0200, Daniel Vetter wrote:
> > > 
> > > > I'm not really well versed in the details of our userptr, but both
> > > > amdgpu and i915 wait for the gpu to complete from
> > > > invalidate_range_start. Jerome has at least looked a lot at the amdgpu
> > > > one, so maybe he can explain what exactly it is we're doing ...
> > > 
> > > amdgpu is (wrongly) using hmm for something, I can't really tell what
> > > it is trying to do. The calls to dma_fence under the
> > > invalidate_range_start do not give me a good feeling.
> > > 
> > > However, i915 shows all the signs of trying to follow the registration
> > > cache model, it even has a nice comment in
> > > i915_gem_userptr_get_pages() explaining that the races it has don't
> > > matter because it is a user space bug to change the VA mapping in the
> > > first place. That just screams registration cache to me.
> > > 
> > > So it is fine to run HW that way, but if you do, there is no reason to
> > > fence inside the invalidate_range end. Just orphan the DMA buffer and
> > > clean it up & release the page pins when all DMA buffer refs go to
> > > zero. The next access to that VA should get a new DMA buffer with the
> > > right mapping.
> > > 
> > > In other words the invalidation should be very simple without
> > > complicated locking, or wait_event's. Look at hfi1 for example.
> > 
> > This would break the today usage model of uptr and it will
> > break userspace expectation ie if GPU is writting to that
> > memory and that memory then the userspace want to make sure
> > that it will see what the GPU write.
> 
> How exactly? This is holding the page pin, so the only way the VA
> mapping can be changed is via explicit user action.
> 
> ie:
> 
>gpu_write_something(va, size)
>mmap(.., va, size, MMAP_FIXED);
>gpu_wait_done()
> 
> This is racy and indeterminate with both models.
> 
> Based on the comment in i915 it appears to be going on the model that
> changes to the mmap by userspace when the GPU is working on it is a
> programming bug. This is reasonable, lots of systems use this kind of
> consistency model.

Well userspace process doing munmap(), mremap(), fork() and things like
that are a bug from the i915 kernel and userspace contract point of view.

But things like migration or reclaim are not cover under that contract
and for those the expectation is that CPU access to the same virtual address
should allow to get what was last written to it either by the GPU or the
CPU.

> 
> Since the driver seems to rely on a dma_fence to block DMA access, it
> looks to me like the kernel has full visibility to the
> 'gpu_write_something()' and 'gpu_wait_done()' markers.

So let's only consider the case where GPU wants to write to the memory
(the read only case is obviously right and not need any notifier in
fact) and like above the only thing we care about is reclaim or migration
(for instance because of some numa compaction) as the rest is cover by
i915 userspace contract.

So in the write case we do GUPfast(write=true) which will be "safe" from
any concurrent CPU page table update ie if GUPfast get a reference for
the page then any racing CPU page table update will not be able to migrate
or reclaim the page and thus the virtual address to page association will
be preserve.

This is only true because of GUPfast(), now if GUPfast() fails it will
fallback to the slow GUP case which make the same thing safe by taking
the page table lock.

Because of the reference on the page the i915 driver can forego the mmu
notifier end callback. The thing here is that taking a page reference
is pointless if we have better synchronization and tracking of mmu
notifier. Hence converting to hmm mirror allows to avoid taking a ref
on the page while still keeping the same functionality as of today.


> I think trying to use hmm_range_fault on HW that can't do HW page
> faulting and HW 'TLB shootdown' is a very, very bad idea. I fear that
> is what amd gpu is trying to do.
> 
> I haven't yet seen anything that looks like 'TLB shootdown' in i915??

GPU driver have complex usage pattern the tlb shootdown is implicit
once the GEM object associated with the uptr is invalidated it means
next time userspace submit command against that GEM object it will
have to re-validate it which means re-program the GPU page tab

Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk

2019-08-15 Thread Jerome Glisse
On Wed, Aug 14, 2019 at 07:48:28AM -0700, Dan Williams wrote:
> On Wed, Aug 14, 2019 at 6:28 AM Jason Gunthorpe  wrote:
> >
> > On Wed, Aug 14, 2019 at 09:38:54AM +0200, Christoph Hellwig wrote:
> > > On Tue, Aug 13, 2019 at 06:36:33PM -0700, Dan Williams wrote:
> > > > Section alignment constraints somewhat save us here. The only example
> > > > I can think of a PMD not containing a uniform pgmap association for
> > > > each pte is the case when the pgmap overlaps normal dram, i.e. shares
> > > > the same 'struct memory_section' for a given span. Otherwise, distinct
> > > > pgmaps arrange to manage their own exclusive sections (and now
> > > > subsections as of v5.3). Otherwise the implementation could not
> > > > guarantee different mapping lifetimes.
> > > >
> > > > That said, this seems to want a better mechanism to determine "pfn is
> > > > ZONE_DEVICE".
> > >
> > > So I guess this patch is fine for now, and once you provide a better
> > > mechanism we can switch over to it?
> >
> > What about the version I sent to just get rid of all the strange
> > put_dev_pagemaps while scanning? Odds are good we will work with only
> > a single pagemap, so it makes some sense to cache it once we find it?
> 
> Yes, if the scan is over a single pmd then caching it makes sense.

Quite frankly an easier an better solution is to remove the pagemap
lookup as HMM user abide by mmu notifier it means we will not make
use or dereference the struct page so that we are safe from any
racing hotunplug of dax memory (as long as device driver using hmm
do not have a bug).

Cheers,
Jérôme


Re: [PATCH 3/3] mm/migrate: remove the duplicated code migrate_vma_collect_hole()

2019-08-15 Thread Jerome Glisse
On Tue, Aug 06, 2019 at 04:00:11PM +0800, Pingfan Liu wrote:
> After the previous patch which sees hole as invalid source,
> migrate_vma_collect_hole() has the same code as migrate_vma_collect_skip().
> Removing the duplicated code.

NAK this one too given previous NAK.

> 
> Signed-off-by: Pingfan Liu 
> Cc: "Jérôme Glisse" 
> Cc: Andrew Morton 
> Cc: Mel Gorman 
> Cc: Jan Kara 
> Cc: "Kirill A. Shutemov" 
> Cc: Michal Hocko 
> Cc: Mike Kravetz 
> Cc: Andrea Arcangeli 
> Cc: Matthew Wilcox 
> To: linux...@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  mm/migrate.c | 22 +++---
>  1 file changed, 3 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 832483f..95e038d 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2128,22 +2128,6 @@ struct migrate_vma {
>   unsigned long   end;
>  };
>  
> -static int migrate_vma_collect_hole(unsigned long start,
> - unsigned long end,
> - struct mm_walk *walk)
> -{
> - struct migrate_vma *migrate = walk->private;
> - unsigned long addr;
> -
> - for (addr = start & PAGE_MASK; addr < end; addr += PAGE_SIZE) {
> - migrate->src[migrate->npages] = 0;
> - migrate->dst[migrate->npages] = 0;
> - migrate->npages++;
> - }
> -
> - return 0;
> -}
> -
>  static int migrate_vma_collect_skip(unsigned long start,
>   unsigned long end,
>   struct mm_walk *walk)
> @@ -2173,7 +2157,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  
>  again:
>   if (pmd_none(*pmdp))
> - return migrate_vma_collect_hole(start, end, walk);
> + return migrate_vma_collect_skip(start, end, walk);
>  
>   if (pmd_trans_huge(*pmdp)) {
>   struct page *page;
> @@ -2206,7 +2190,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   return migrate_vma_collect_skip(start, end,
>   walk);
>   if (pmd_none(*pmdp))
> - return migrate_vma_collect_hole(start, end,
> + return migrate_vma_collect_skip(start, end,
>   walk);
>   }
>   }
> @@ -2337,7 +2321,7 @@ static void migrate_vma_collect(struct migrate_vma 
> *migrate)
>  
>   mm_walk.pmd_entry = migrate_vma_collect_pmd;
>   mm_walk.pte_entry = NULL;
> - mm_walk.pte_hole = migrate_vma_collect_hole;
> + mm_walk.pte_hole = migrate_vma_collect_skip;
>   mm_walk.hugetlb_entry = NULL;
>   mm_walk.test_walk = NULL;
>   mm_walk.vma = migrate->vma;
> -- 
> 2.7.5
> 


Re: [PATCH 2/3] mm/migrate: see hole as invalid source page

2019-08-15 Thread Jerome Glisse
On Tue, Aug 06, 2019 at 04:00:10PM +0800, Pingfan Liu wrote:
> MIGRATE_PFN_MIGRATE marks a valid pfn, further more, suitable to migrate.
> As for hole, there is no valid pfn, not to mention migration.
> 
> Before this patch, hole has already relied on the following code to be
> filtered out. Hence it is more reasonable to see hole as invalid source
> page.
> migrate_vma_prepare()
> {
>   struct page *page = migrate_pfn_to_page(migrate->src[i]);
> 
>   if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE))
>\_ this condition
> }

NAK you break the API, MIGRATE_PFN_MIGRATE is use for 2 things,
first it allow the collection code to mark entry that can be
migrated, then it use by driver to allow driver to skip migration
for some entry (for whatever reason the driver might have), we
still need to keep the entry and not clear it so that we can
cleanup thing (ie remove migration pte entry).

> 
> Signed-off-by: Pingfan Liu 
> Cc: "Jérôme Glisse" 
> Cc: Andrew Morton 
> Cc: Mel Gorman 
> Cc: Jan Kara 
> Cc: "Kirill A. Shutemov" 
> Cc: Michal Hocko 
> Cc: Mike Kravetz 
> Cc: Andrea Arcangeli 
> Cc: Matthew Wilcox 
> To: linux...@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  mm/migrate.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index c2ec614..832483f 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2136,10 +2136,9 @@ static int migrate_vma_collect_hole(unsigned long 
> start,
>   unsigned long addr;
>  
>   for (addr = start & PAGE_MASK; addr < end; addr += PAGE_SIZE) {
> - migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE;
> + migrate->src[migrate->npages] = 0;
>   migrate->dst[migrate->npages] = 0;
>   migrate->npages++;
> - migrate->cpages++;
>   }
>  
>   return 0;
> @@ -2228,8 +2227,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   pfn = pte_pfn(pte);
>  
>   if (pte_none(pte)) {
> - mpfn = MIGRATE_PFN_MIGRATE;
> - migrate->cpages++;
> + mpfn = 0;
>   goto next;
>   }
>  
> -- 
> 2.7.5
> 


Re: [PATCHv2] mm/migrate: clean up useless code in migrate_vma_collect_pmd()

2019-08-15 Thread Jerome Glisse
On Wed, Aug 07, 2019 at 04:41:12PM +0800, Pingfan Liu wrote:
> Clean up useless 'pfn' variable.

NAK there is a bug see below:

> 
> Signed-off-by: Pingfan Liu 
> Cc: "Jérôme Glisse" 
> Cc: Andrew Morton 
> Cc: Mel Gorman 
> Cc: Jan Kara 
> Cc: "Kirill A. Shutemov" 
> Cc: Michal Hocko 
> Cc: Mike Kravetz 
> Cc: Andrea Arcangeli 
> Cc: Matthew Wilcox 
> To: linux...@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  mm/migrate.c | 9 +++--
>  1 file changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 8992741..d483a55 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2225,17 +2225,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   pte_t pte;
>  
>   pte = *ptep;
> - pfn = pte_pfn(pte);
>  
>   if (pte_none(pte)) {
>   mpfn = MIGRATE_PFN_MIGRATE;
>   migrate->cpages++;
> - pfn = 0;
>   goto next;
>   }
>  
>   if (!pte_present(pte)) {
> - mpfn = pfn = 0;
> + mpfn = 0;
>  
>   /*
>* Only care about unaddressable device page special
> @@ -2252,10 +2250,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   if (is_write_device_private_entry(entry))
>   mpfn |= MIGRATE_PFN_WRITE;
>   } else {
> + pfn = pte_pfn(pte);
>   if (is_zero_pfn(pfn)) {
>   mpfn = MIGRATE_PFN_MIGRATE;
>   migrate->cpages++;
> - pfn = 0;
>   goto next;
>   }
>   page = vm_normal_page(migrate->vma, addr, pte);
> @@ -2265,10 +2263,9 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  
>   /* FIXME support THP */
>   if (!page || !page->mapping || PageTransCompound(page)) {
> - mpfn = pfn = 0;
> + mpfn = 0;
>   goto next;
>   }
> - pfn = page_to_pfn(page);

You can not remove that one ! Otherwise it will break the device
private case.


Re: [PATCH 0/2] A General Accelerator Framework, WarpDrive

2019-08-15 Thread Jerome Glisse
On Wed, Aug 14, 2019 at 05:34:23PM +0800, Zhangfei Gao wrote:
> *WarpDrive* is a general accelerator framework for the user application to
> access the hardware without going through the kernel in data path.
> 
> WarpDrive is the name for the whole framework. The component in kernel
> is called uacce, meaning "Unified/User-space-access-intended Accelerator
> Framework". It makes use of the capability of IOMMU to maintain a
> unified virtual address space between the hardware and the process.
> 
> WarpDrive is intended to be used with Jean Philippe Brucker's SVA
> patchset[1], which enables IO side page fault and PASID support. 
> We have keep verifying with Jean's sva/current [2]
> We also keep verifying with Eric's SMMUv3 Nested Stage patch [3]
> 
> This series and related zip & qm driver as well as dummy driver for qemu test:
> https://github.com/Linaro/linux-kernel-warpdrive/tree/5.3-rc1-warpdrive-v1
> zip driver already been upstreamed.
> zip supporting uacce will be the next step.
> 
> The library and user application:
> https://github.com/Linaro/warpdrive/tree/wdprd-v1-current

Do we want a new framework ? I think that is the first question that
should be answer here. Accelerator are in many forms and so far they
never have been enough commonality to create a framework, even GPUs
with the drm is an example of that, drm only offer share framework
for the modesetting part of the GPU (as thankfully monitor connector
are not specific to GPU brands :))

FPGA is another example the only common code expose to userspace is
about bitstream management AFAIK.

I would argue that a framework should only be created once there is
enough devices with same userspace API. Meanwhile you can provide
in kernel helper that allow driver to expose same API. If after a
while we have enough device driver which all use that same in kernel
helpers API then it will a good time to introduce a new framework.
Meanwhile this will allow individual device driver to tinker with
their API and maybe get to something useful to more devices in the
end.

Note that what i propose also allow userspace code sharing for all
driver that use the same in kernel helper.

Cheers,
Jérôme


Re: [PATCH] mm/hmm: Fix bad subpage pointer in try_to_unmap_one

2019-07-16 Thread Jerome Glisse
On Mon, Jul 15, 2019 at 11:14:31PM -0700, John Hubbard wrote:
> On 7/15/19 5:38 PM, Ralph Campbell wrote:
> > On 7/15/19 4:34 PM, John Hubbard wrote:
> > > On 7/15/19 3:00 PM, Andrew Morton wrote:
> > > > On Tue, 9 Jul 2019 18:24:57 -0700 Ralph Campbell  
> > > > wrote:
> > > > 
> > > >   mm/rmap.c |    1 +
> > > >   1 file changed, 1 insertion(+)
> > > > 
> > > > --- a/mm/rmap.c~mm-hmm-fix-bad-subpage-pointer-in-try_to_unmap_one
> > > > +++ a/mm/rmap.c
> > > > @@ -1476,6 +1476,7 @@ static bool try_to_unmap_one(struct page
> > > >    * No need to invalidate here it will synchronize on
> > > >    * against the special swap migration pte.
> > > >    */
> > > > +    subpage = page;
> > > >   goto discard;
> > > >   }
> > > 
> > > Hi Ralph and everyone,
> > > 
> > > While the above prevents a crash, I'm concerned that it is still not
> > > an accurate fix. This fix leads to repeatedly removing the rmap, against 
> > > the
> > > same struct page, which is odd, and also doesn't directly address the
> > > root cause, which I understand to be: this routine can't handle migrating
> > > the zero page properly--over and back, anyway. (We should also mention 
> > > more
> > > about how this is triggered, in the commit description.)
> > > 
> > > I'll take a closer look at possible fixes (I have to step out for a bit) 
> > > soon,
> > > but any more experienced help is also appreciated here.
> > > 
> > > thanks,
> > 
> > I'm not surprised at the confusion. It took me quite awhile to
> > understand how migrate_vma() works with ZONE_DEVICE private memory.
> > The big point to be aware of is that when migrating a page to
> > device private memory, the source page's page->mapping pointer
> > is copied to the ZONE_DEVICE struct page and the page_mapcount()
> > is increased. So, the kernel sees the page as being "mapped"
> > but the page table entry as being is_swap_pte() so the CPU will fault
> > if it tries to access the mapped address.
> 
> Thanks for humoring me here...
> 
> The part about the source page's page->mapping pointer being *copied*
> to the ZONE_DEVICE struct page is particularly interesting, and belongs
> maybe even in a comment (if not already there). Definitely at least in
> the commit description, for now.
> 
> > So yes, the source anon page is unmapped, DMA'ed to the device,
> > and then mapped again. Then on a CPU fault, the zone device page
> > is unmapped, DMA'ed to system memory, and mapped again.
> > The rmap_walk() is used to clear the temporary migration pte so
> > that is another important detail of how migrate_vma() works.
> > At the moment, only single anon private pages can migrate to
> > device private memory so there are no subpages and setting it to "page"
> > should be correct for now. I'm looking at supporting migration of
> > transparent huge pages but that is a work in progress.
> 
> Well here, I worry, because subpage != tail page, right? subpage is a
> strange variable name, and here it is used to record the page that
> corresponds to *each* mapping that is found during the reverse page
> mapping walk.
> 
> And that makes me suspect that if there were more than one of these
> found (which is unlikely, given the light testing that we have available
> so far, I realize), then there could possibly be a problem with the fix,
> yes?

No THP when migrating to device memory so no tail vs head page here.

Cheers,
Jérôme


Re: [PATCH 2/5] mm/hmm: Clean up some coding style and comments

2019-06-06 Thread Jerome Glisse
On Thu, Jun 06, 2019 at 12:41:29PM -0300, Jason Gunthorpe wrote:
> On Thu, Jun 06, 2019 at 10:27:43AM -0400, Jerome Glisse wrote:
> > On Thu, Jun 06, 2019 at 11:16:44AM -0300, Jason Gunthorpe wrote:
> > > On Mon, May 06, 2019 at 04:29:39PM -0700, rcampb...@nvidia.com wrote:
> > > > From: Ralph Campbell 
> > > > 
> > > > There are no functional changes, just some coding style clean ups and
> > > > minor comment changes.
> > > > 
> > > > Signed-off-by: Ralph Campbell 
> > > > Reviewed-by: Jérôme Glisse 
> > > > Cc: John Hubbard 
> > > > Cc: Ira Weiny 
> > > > Cc: Dan Williams 
> > > > Cc: Arnd Bergmann 
> > > > Cc: Balbir Singh 
> > > > Cc: Dan Carpenter 
> > > > Cc: Matthew Wilcox 
> > > > Cc: Souptick Joarder 
> > > > Cc: Andrew Morton 
> > > >  include/linux/hmm.h | 71 +++--
> > > >  mm/hmm.c| 51 
> > > >  2 files changed, 62 insertions(+), 60 deletions(-)
> > > 
> > > Applied to hmm.git, thanks
> > 
> > Can you hold off, i was already collecting patches and we will
> > be stepping on each other toe ... for instance i had
> 
> I'd really rather not, I have a lot of work to do for this cycle and
> this part needs to start to move forward now. I can't do everything
> last minute, sorry.
> 
> The patches I picked up all look very safe to move ahead.

I want to post all the patch you need to apply soon, it is really
painful because they are lot of different branches i have to work
with if you start pulling patches that differ from the below branch
then you are making thing ever more difficult for me.

If you hold of i will be posting all the patches in one big set so
that you can apply all of them in one go and it will be a _lot_
easier for me that way.

> 
> > https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-5.3
> 
> I'm aware, and am referring to this tree. You can trivially rebase it
> on top of hmm.git..
> 
> BTW, what were you planning to do with this git branch anyhow?

This is just something i use to do testing and stack-up all patches.

> 
> As we'd already agreed I will send the hmm patches to Linus on a clean
> git branch so we can properly collaborate between the various involved
> trees.
> 
> As a tree-runner I very much prefer to take patches directly from the
> mailing list where everything is public. This is the standard kernel
> workflow.

Like i said above i want to resend all the patches in one big set.

On process thing it would be easier if we ask Dave/Daniel to merge
hmm within drm this cycle. Merging with Linus will break drm drivers
and it seems easier to me to fix all this within the drm tree.

But if you want to do everything with Linus fine.

Cheers,
Jérôme


Re: [PATCH 2/5] mm/hmm: Clean up some coding style and comments

2019-06-06 Thread Jerome Glisse
On Thu, Jun 06, 2019 at 11:16:44AM -0300, Jason Gunthorpe wrote:
> On Mon, May 06, 2019 at 04:29:39PM -0700, rcampb...@nvidia.com wrote:
> > From: Ralph Campbell 
> > 
> > There are no functional changes, just some coding style clean ups and
> > minor comment changes.
> > 
> > Signed-off-by: Ralph Campbell 
> > Reviewed-by: Jérôme Glisse 
> > Cc: John Hubbard 
> > Cc: Ira Weiny 
> > Cc: Dan Williams 
> > Cc: Arnd Bergmann 
> > Cc: Balbir Singh 
> > Cc: Dan Carpenter 
> > Cc: Matthew Wilcox 
> > Cc: Souptick Joarder 
> > Cc: Andrew Morton 
> >  include/linux/hmm.h | 71 +++--
> >  mm/hmm.c| 51 
> >  2 files changed, 62 insertions(+), 60 deletions(-)
> 
> Applied to hmm.git, thanks

Can you hold off, i was already collecting patches and we will
be stepping on each other toe ... for instance i had

https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-5.3

But i have been working on more collection.

Cheers,
Jérôme


Re: [PATCH v4 0/1] Use HMM for ODP v4

2019-05-23 Thread Jerome Glisse
On Thu, May 23, 2019 at 04:10:38PM -0300, Jason Gunthorpe wrote:
> 
> On Thu, May 23, 2019 at 02:24:58PM -0400, Jerome Glisse wrote:
> > I can not take mmap_sem in range_register, the READ_ONCE is fine and
> > they are no race as we do take a reference on the hmm struct thus
> 
> Of course there are use after free races with a READ_ONCE scheme, I
> shouldn't have to explain this.

Well i can not think of anything again here the mm->hmm can not
change while driver is calling hmm_range_register() so if you
want i can remove the READ_ONCE() this does not change anything.


> If you cannot take the read mmap sem (why not?), then please use my
> version and push the update to the driver through -mm..

Please see previous threads on why it was a failure.

Cheers,
Jérôme


Re: [PATCH v4 0/1] Use HMM for ODP v4

2019-05-23 Thread Jerome Glisse
On Thu, May 23, 2019 at 02:55:46PM -0300, Jason Gunthorpe wrote:
> On Thu, May 23, 2019 at 01:33:03PM -0400, Jerome Glisse wrote:
> > On Thu, May 23, 2019 at 01:34:29PM -0300, Jason Gunthorpe wrote:
> > > On Thu, May 23, 2019 at 11:52:08AM -0400, Jerome Glisse wrote:
> > > > On Thu, May 23, 2019 at 12:41:49PM -0300, Jason Gunthorpe wrote:
> > > > > On Thu, May 23, 2019 at 11:04:32AM -0400, Jerome Glisse wrote:
> > > > > > On Wed, May 22, 2019 at 08:57:37PM -0300, Jason Gunthorpe wrote:
> > > > > > > On Wed, May 22, 2019 at 01:48:52PM -0400, Jerome Glisse wrote:
> > > > > > > 
> > > > > > > > > > So attached is a rebase on top of 5.2-rc1, i have tested 
> > > > > > > > > > with pingpong
> > > > > > > > > > (prefetch and not and different sizes). Seems to work ok.
> > > > > > > > > 
> > > > > > > > > Urk, it already doesn't apply to the rdma tree :(
> > > > > > > > > 
> > > > > > > > > The conflicts are a little more extensive than I'd prefer to 
> > > > > > > > > handle..
> > > > > > > > > Can I ask you to rebase it on top of this branch please:
> > > > > > > > > 
> > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=wip/jgg-for-next
> > > > > > > > > 
> > > > > > > > > Specifically it conflicts with this patch:
> > > > > > > > > 
> > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-next&id=d2183c6f1958e6b6dfdde279f4cee04280710e34
> > > > > > > 
> > > > > > > There is at least one more serious blocker here:
> > > > > > > 
> > > > > > > config ARCH_HAS_HMM_MIRROR
> > > > > > > bool
> > > > > > > default y
> > > > > > > depends on (X86_64 || PPC64)
> > > > > > > depends on MMU && 64BIT
> > > > > > > 
> > > > > > > I can't loose ARM64 support for ODP by merging this, that is too
> > > > > > > serious of a regression.
> > > > > > > 
> > > > > > > Can you fix it?
> > > > > > 
> > > > > > 5.2 already has patch to fix the Kconfig (ARCH_HAS_HMM_MIRROR and
> > > > > > ARCH_HAS_HMM_DEVICE replacing ARCH_HAS_HMM) I need to update nouveau
> > > > > 
> > > > > Newer than 5.2-rc1? Is this why ARCH_HAS_HMM_MIRROR is not used 
> > > > > anywhere?
> > > > 
> > > > Yes this is multi-step update, first add the new Kconfig release n,
> > > > update driver in release n+1, update core Kconfig in release n+2
> > > > 
> > > > So we are in release n (5.2), in 5.3 i will update nouveau and amdgpu
> > > > so that in 5.4 in ca remove the old ARCH_HAS_HMM
> > > 
> > > Why don't you just send the patch for both parts to mm or to DRM?
> > > 
> > > This is very normal - as long as the resulting conflicts would be
> > > small during there is no reason not to do this. Can you share the
> > > combined patch?
> > 
> > This was tested in the past an resulted in failure. So for now i am
> > taking the simplest and easiest path with the least burden for every
> > maintainer. It only complexify my life.
> 
> I don't know what you tried to do in the past, but it happens all the
> time, every merge cycle with success. Not everything can be done, but
> changing the signature of one function with one call site should
> really not be a problem.
> 
> > Note that mm is not a git tree and thus i can not play any git trick
> > to help in this endeavor.
> 
> I am aware..
> 
> > > > > If mm takes the fixup patches so hmm mirror is as reliable as ODP's
> > > > > existing stuff, and patch from you to enable ARM64, then we can
> > > > > continue to merge into 5.3
> > > > > 
> > > > > So, let us try to get acks on those other threads..
> > > > 
> > > > I will be merging your patchset and Ralph and repost, they are only
> > > > minor change mostly that you can not update the driver API in just
> > > > one release.
> > 

Re: [PATCH v4 0/1] Use HMM for ODP v4

2019-05-23 Thread Jerome Glisse
On Thu, May 23, 2019 at 01:34:29PM -0300, Jason Gunthorpe wrote:
> On Thu, May 23, 2019 at 11:52:08AM -0400, Jerome Glisse wrote:
> > On Thu, May 23, 2019 at 12:41:49PM -0300, Jason Gunthorpe wrote:
> > > On Thu, May 23, 2019 at 11:04:32AM -0400, Jerome Glisse wrote:
> > > > On Wed, May 22, 2019 at 08:57:37PM -0300, Jason Gunthorpe wrote:
> > > > > On Wed, May 22, 2019 at 01:48:52PM -0400, Jerome Glisse wrote:
> > > > > 
> > > > > > > > So attached is a rebase on top of 5.2-rc1, i have tested with 
> > > > > > > > pingpong
> > > > > > > > (prefetch and not and different sizes). Seems to work ok.
> > > > > > > 
> > > > > > > Urk, it already doesn't apply to the rdma tree :(
> > > > > > > 
> > > > > > > The conflicts are a little more extensive than I'd prefer to 
> > > > > > > handle..
> > > > > > > Can I ask you to rebase it on top of this branch please:
> > > > > > > 
> > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=wip/jgg-for-next
> > > > > > > 
> > > > > > > Specifically it conflicts with this patch:
> > > > > > > 
> > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-next&id=d2183c6f1958e6b6dfdde279f4cee04280710e34
> > > > > 
> > > > > There is at least one more serious blocker here:
> > > > > 
> > > > > config ARCH_HAS_HMM_MIRROR
> > > > > bool
> > > > > default y
> > > > > depends on (X86_64 || PPC64)
> > > > > depends on MMU && 64BIT
> > > > > 
> > > > > I can't loose ARM64 support for ODP by merging this, that is too
> > > > > serious of a regression.
> > > > > 
> > > > > Can you fix it?
> > > > 
> > > > 5.2 already has patch to fix the Kconfig (ARCH_HAS_HMM_MIRROR and
> > > > ARCH_HAS_HMM_DEVICE replacing ARCH_HAS_HMM) I need to update nouveau
> > > 
> > > Newer than 5.2-rc1? Is this why ARCH_HAS_HMM_MIRROR is not used anywhere?
> > 
> > Yes this is multi-step update, first add the new Kconfig release n,
> > update driver in release n+1, update core Kconfig in release n+2
> > 
> > So we are in release n (5.2), in 5.3 i will update nouveau and amdgpu
> > so that in 5.4 in ca remove the old ARCH_HAS_HMM
> 
> Why don't you just send the patch for both parts to mm or to DRM?
> 
> This is very normal - as long as the resulting conflicts would be
> small during there is no reason not to do this. Can you share the
> combined patch?

This was tested in the past an resulted in failure. So for now i am
taking the simplest and easiest path with the least burden for every
maintainer. It only complexify my life.

Note that mm is not a git tree and thus i can not play any git trick
to help in this endeavor.

> > > If mm takes the fixup patches so hmm mirror is as reliable as ODP's
> > > existing stuff, and patch from you to enable ARM64, then we can
> > > continue to merge into 5.3
> > > 
> > > So, let us try to get acks on those other threads..
> > 
> > I will be merging your patchset and Ralph and repost, they are only
> > minor change mostly that you can not update the driver API in just
> > one release.
> 
> Of course you can, we do it all the time. It requires some
> co-ordination, but as long as the merge conflicts are not big it is
> fine.
> 
> Merge the driver API change and the call site updates to -mm and
> refain from merging horrendously conflicting patches through DRM.
> 
> In the case of the changes in my HMM RFC it is something like 2
> lines in DRM that need touching, no problem at all.
> 
> If you want help I can volunteer make a hmm PR for Linus just for this
> during the merge window - but Andrew would need to agree and ack the
> patches.

This was tested in the past and i do not want to go over this issue
again (or re-iterate the long emails discussion associated with that).
It failed and it put the burden on every maintainers. So it is easier
to do the multi-step thing.

You can take a peak at Ralph patchset and yours into one with minor
changes here:

https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-5.3

I am about to start testing it with nouveau, amdgpu and RDMA.

Cheers,
Jérôme


Re: [PATCH v4 0/1] Use HMM for ODP v4

2019-05-23 Thread Jerome Glisse
On Thu, May 23, 2019 at 12:41:49PM -0300, Jason Gunthorpe wrote:
> On Thu, May 23, 2019 at 11:04:32AM -0400, Jerome Glisse wrote:
> > On Wed, May 22, 2019 at 08:57:37PM -0300, Jason Gunthorpe wrote:
> > > On Wed, May 22, 2019 at 01:48:52PM -0400, Jerome Glisse wrote:
> > > 
> > > > > > So attached is a rebase on top of 5.2-rc1, i have tested with 
> > > > > > pingpong
> > > > > > (prefetch and not and different sizes). Seems to work ok.
> > > > > 
> > > > > Urk, it already doesn't apply to the rdma tree :(
> > > > > 
> > > > > The conflicts are a little more extensive than I'd prefer to handle..
> > > > > Can I ask you to rebase it on top of this branch please:
> > > > > 
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=wip/jgg-for-next
> > > > > 
> > > > > Specifically it conflicts with this patch:
> > > > > 
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-next&id=d2183c6f1958e6b6dfdde279f4cee04280710e34
> > > 
> > > There is at least one more serious blocker here:
> > > 
> > > config ARCH_HAS_HMM_MIRROR
> > > bool
> > > default y
> > > depends on (X86_64 || PPC64)
> > > depends on MMU && 64BIT
> > > 
> > > I can't loose ARM64 support for ODP by merging this, that is too
> > > serious of a regression.
> > > 
> > > Can you fix it?
> > 
> > 5.2 already has patch to fix the Kconfig (ARCH_HAS_HMM_MIRROR and
> > ARCH_HAS_HMM_DEVICE replacing ARCH_HAS_HMM) I need to update nouveau
> 
> Newer than 5.2-rc1? Is this why ARCH_HAS_HMM_MIRROR is not used anywhere?

Yes this is multi-step update, first add the new Kconfig release n,
update driver in release n+1, update core Kconfig in release n+2

So we are in release n (5.2), in 5.3 i will update nouveau and amdgpu
so that in 5.4 in ca remove the old ARCH_HAS_HMM

> > in 5.3 so that i can drop the old ARCH_HAS_HMM and then convert
> > core mm in 5.4 to use ARCH_HAS_HMM_MIRROR and ARCH_HAS_HMM_DEVICE
> > instead of ARCH_HAS_HMM
> 
> My problem is that ODP needs HMM_MIRROR which needs HMM & ARCH_HAS_HMM
> - and then even if fixed we still have the ARCH_HAS_HMM_MIRROR
> restricted to ARM64..
> 
> Can we broaden HMM_MIRROR to all arches? I would very much prefer
> that.

Ignore ARCH_HAS_HMM it will be remove in 5.4, all that will matter
for ODP is ARCH_HAS_HMM_MIRROR which should be enabled for ARM64 as
ARM64 has everything needed for that. I just did not add ARM64 to
ARCH_HAS_HMM_MIRROR because i did not had hardware to test it on.

So in 5.3 i will update nouveau and amdgpu to use ARCH_HAS_HMM_DEVICE
and ARCH_HAS_HMM_MIRROR. In 5.4 i will update mm/Kconig to remove
ARCH_HAS_HMM

> 
> > So it seems it will have to wait 5.4 for ODP. I will re-spin the
> > patch for ODP once i am done reviewing Ralph changes and yours
> > for 5.3.
> 
> I think we are still OK for 5.3.

I can not update mm/Kconfig in 5.3 so any Kconfig update will be
5.4

> 
> If mm takes the fixup patches so hmm mirror is as reliable as ODP's
> existing stuff, and patch from you to enable ARM64, then we can
> continue to merge into 5.3
> 
> So, let us try to get acks on those other threads..

I will be merging your patchset and Ralph and repost, they are only
minor change mostly that you can not update the driver API in just
one release. First add the new API in release n, then replace old
API usage in release n+1, then remove old API in n+2.

Cheers,
Jérôme


Re: [PATCH 1/1] infiniband/mm: convert put_page() to put_user_page*()

2019-05-23 Thread Jerome Glisse
On Thu, May 23, 2019 at 12:25:37AM -0700, john.hubb...@gmail.com wrote:
> From: John Hubbard 
> 
> For infiniband code that retains pages via get_user_pages*(),
> release those pages via the new put_user_page(), or
> put_user_pages*(), instead of put_page()
> 
> This is a tiny part of the second step of fixing the problem described
> in [1]. The steps are:
> 
> 1) Provide put_user_page*() routines, intended to be used
>for releasing pages that were pinned via get_user_pages*().
> 
> 2) Convert all of the call sites for get_user_pages*(), to
>invoke put_user_page*(), instead of put_page(). This involves dozens of
>call sites, and will take some time.
> 
> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
>implement tracking of these pages. This tracking will be separate from
>the existing struct page refcounting.
> 
> 4) Use the tracking and identification of these pages, to implement
>special handling (especially in writeback paths) when the pages are
>backed by a filesystem. Again, [1] provides details as to why that is
>desirable.
> 
> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
> 
> Cc: Doug Ledford 
> Cc: Jason Gunthorpe 
> Cc: Mike Marciniszyn 
> Cc: Dennis Dalessandro 
> Cc: Christian Benvenuti 
> 
> Reviewed-by: Jan Kara 
> Reviewed-by: Dennis Dalessandro 
> Acked-by: Jason Gunthorpe 
> Tested-by: Ira Weiny 
> Signed-off-by: John Hubbard 

Reviewed-by: Jérôme Glisse 

Between i have a wishlist see below


> ---
>  drivers/infiniband/core/umem.c  |  7 ---
>  drivers/infiniband/core/umem_odp.c  | 10 +-
>  drivers/infiniband/hw/hfi1/user_pages.c | 11 ---
>  drivers/infiniband/hw/mthca/mthca_memfree.c |  6 +++---
>  drivers/infiniband/hw/qib/qib_user_pages.c  | 11 ---
>  drivers/infiniband/hw/qib/qib_user_sdma.c   |  6 +++---
>  drivers/infiniband/hw/usnic/usnic_uiom.c|  7 ---
>  7 files changed, 27 insertions(+), 31 deletions(-)
> 
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index e7ea819fcb11..673f0d240b3e 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -54,9 +54,10 @@ static void __ib_umem_release(struct ib_device *dev, 
> struct ib_umem *umem, int d
>  
>   for_each_sg_page(umem->sg_head.sgl, &sg_iter, umem->sg_nents, 0) {
>   page = sg_page_iter_page(&sg_iter);
> - if (!PageDirty(page) && umem->writable && dirty)
> - set_page_dirty_lock(page);
> - put_page(page);
> + if (umem->writable && dirty)
> + put_user_pages_dirty_lock(&page, 1);
> + else
> + put_user_page(page);

Can we get a put_user_page_dirty(struct page 8*pages, bool dirty, npages) ?

It is a common pattern that we might have to conditionaly dirty the pages
and i feel it would look cleaner if we could move the branch within the
put_user_page*() function.

Cheers,
Jérôme


Re: [PATCH v4 0/1] Use HMM for ODP v4

2019-05-23 Thread Jerome Glisse
On Wed, May 22, 2019 at 08:57:37PM -0300, Jason Gunthorpe wrote:
> On Wed, May 22, 2019 at 01:48:52PM -0400, Jerome Glisse wrote:
> 
> > > > So attached is a rebase on top of 5.2-rc1, i have tested with pingpong
> > > > (prefetch and not and different sizes). Seems to work ok.
> > > 
> > > Urk, it already doesn't apply to the rdma tree :(
> > > 
> > > The conflicts are a little more extensive than I'd prefer to handle..
> > > Can I ask you to rebase it on top of this branch please:
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=wip/jgg-for-next
> > > 
> > > Specifically it conflicts with this patch:
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-next&id=d2183c6f1958e6b6dfdde279f4cee04280710e34
> 
> There is at least one more serious blocker here:
> 
> config ARCH_HAS_HMM_MIRROR
> bool
> default y
> depends on (X86_64 || PPC64)
> depends on MMU && 64BIT
> 
> I can't loose ARM64 support for ODP by merging this, that is too
> serious of a regression.
> 
> Can you fix it?

5.2 already has patch to fix the Kconfig (ARCH_HAS_HMM_MIRROR and
ARCH_HAS_HMM_DEVICE replacing ARCH_HAS_HMM) I need to update nouveau
in 5.3 so that i can drop the old ARCH_HAS_HMM and then convert
core mm in 5.4 to use ARCH_HAS_HMM_MIRROR and ARCH_HAS_HMM_DEVICE
instead of ARCH_HAS_HMM

Adding ARM64 to ARCH_HAS_HMM_MIRROR should not be an issue i would
need access to an ARM64 to test as i did not wanted to enable it
without testing.

So it seems it will have to wait 5.4 for ODP. I will re-spin the
patch for ODP once i am done reviewing Ralph changes and yours
for 5.3.

Cheers,
Jérôme


Re: [PATCH v4 0/1] Use HMM for ODP v4

2019-05-22 Thread Jerome Glisse
On Wed, May 22, 2019 at 07:39:06PM -0300, Jason Gunthorpe wrote:
> On Wed, May 22, 2019 at 06:04:20PM -0400, Jerome Glisse wrote:
> > On Wed, May 22, 2019 at 05:12:47PM -0300, Jason Gunthorpe wrote:
> > > On Wed, May 22, 2019 at 01:48:52PM -0400, Jerome Glisse wrote:
> > > 
> > > >  static void put_per_mm(struct ib_umem_odp *umem_odp)
> > > >  {
> > > > struct ib_ucontext_per_mm *per_mm = umem_odp->per_mm;
> > > > @@ -325,9 +283,10 @@ static void put_per_mm(struct ib_umem_odp 
> > > > *umem_odp)
> > > > up_write(&per_mm->umem_rwsem);
> > > >  
> > > > WARN_ON(!RB_EMPTY_ROOT(&per_mm->umem_tree.rb_root));
> > > > -   mmu_notifier_unregister_no_release(&per_mm->mn, per_mm->mm);
> > > > +   hmm_mirror_unregister(&per_mm->mirror);
> > > > put_pid(per_mm->tgid);
> > > > -   mmu_notifier_call_srcu(&per_mm->rcu, free_per_mm);
> > > > +
> > > > +   kfree(per_mm);
> > > 
> > > Notice that mmu_notifier only uses SRCU to fence in-progress ops
> > > callbacks, so I think hmm internally has the bug that this ODP
> > > approach prevents.
> > > 
> > > hmm should follow the same pattern ODP has and 'kfree_srcu' the hmm
> > > struct, use container_of in the mmu_notifier callbacks, and use the
> > > otherwise vestigal kref_get_unless_zero() to bail:
> > > 
> > > From 0cb536dc0150ba964a1d655151d7b7a84d0f915a Mon Sep 17 00:00:00 2001
> > > From: Jason Gunthorpe 
> > > Date: Wed, 22 May 2019 16:52:52 -0300
> > > Subject: [PATCH] hmm: Fix use after free with struct hmm in the mmu 
> > > notifiers
> > > 
> > > mmu_notifier_unregister_no_release() is not a fence and the mmu_notifier
> > > system will continue to reference hmm->mn until the srcu grace period
> > > expires.
> > > 
> > >  CPU0 CPU1
> > >
> > > __mmu_notifier_invalidate_range_start()
> > >  srcu_read_lock
> > >  hlist_for_each ()
> > >// mn == hmm->mn
> > > hmm_mirror_unregister()
> > >   hmm_put()
> > > hmm_free()
> > >   mmu_notifier_unregister_no_release()
> > >  hlist_del_init_rcu(hmm-mn->list)
> > >  
> > > mn->ops->invalidate_range_start(mn, range);
> > >mm_get_hmm()
> > >   mm->hmm = NULL;
> > >   kfree(hmm)
> > >  
> > > mutex_lock(&hmm->lock);
> > > 
> > > Use SRCU to kfree the hmm memory so that the notifiers can rely on hmm
> > > existing. Get the now-safe hmm struct through container_of and directly
> > > check kref_get_unless_zero to lock it against free.
> > 
> > It is already badly handled with BUG_ON()
> 
> You can't crash the kernel because userspace forced a race, and no it
> isn't handled today because there is no RCU locking in mm_get_hmm nor
> is there a kfree_rcu for the struct hmm to make the
> kref_get_unless_zero work without use-after-free.
> 
> > i just need to convert those to return and to use
> > mmu_notifier_call_srcu() to free hmm struct.
> 
> Isn't that what this patch does?

Yes but other chunk just need to replace BUG_ON with return

> 
> > The way race is avoided is because mm->hmm will either be NULL or
> > point to another hmm struct before an existing hmm is free. 
> 
> There is no locking on mm->hmm so it is useless to prevent races.

There is locking on mm->hmm

> 
> > Also if range_start/range_end use kref_get_unless_zero() but right
> > now this is BUG_ON if it turn out to be NULL, it should just return
> > on NULL.
> 
> Still needs rcu.
> 
> Also the container_of is necessary to avoid some race where you could
> be doing:
> 
>   CPU0 CPU1   
>   CPU2
>hlist_for_each ()
>mmu_notifier_unregister_no_release(hmm1) 
>spin_lock(&mm->page_table_lock);
>mm->hmm = NULL
>spin_unlock(&mm-&g

Re: [PATCH v4 0/1] Use HMM for ODP v4

2019-05-22 Thread Jerome Glisse
On Wed, May 22, 2019 at 02:12:31PM -0700, Ralph Campbell wrote:
> 
> On 5/22/19 1:12 PM, Jason Gunthorpe wrote:
> > On Wed, May 22, 2019 at 01:48:52PM -0400, Jerome Glisse wrote:
> > 
> > >   static void put_per_mm(struct ib_umem_odp *umem_odp)
> > >   {
> > >   struct ib_ucontext_per_mm *per_mm = umem_odp->per_mm;
> > > @@ -325,9 +283,10 @@ static void put_per_mm(struct ib_umem_odp *umem_odp)
> > >   up_write(&per_mm->umem_rwsem);
> > >   WARN_ON(!RB_EMPTY_ROOT(&per_mm->umem_tree.rb_root));
> > > - mmu_notifier_unregister_no_release(&per_mm->mn, per_mm->mm);
> > > + hmm_mirror_unregister(&per_mm->mirror);
> > >   put_pid(per_mm->tgid);
> > > - mmu_notifier_call_srcu(&per_mm->rcu, free_per_mm);
> > > +
> > > + kfree(per_mm);
> > 
> > Notice that mmu_notifier only uses SRCU to fence in-progress ops
> > callbacks, so I think hmm internally has the bug that this ODP
> > approach prevents.
> > 
> > hmm should follow the same pattern ODP has and 'kfree_srcu' the hmm
> > struct, use container_of in the mmu_notifier callbacks, and use the
> > otherwise vestigal kref_get_unless_zero() to bail:
> 
> You might also want to look at my patch where
> I try to fix some of these same issues (5/5).
> 
> https://marc.info/?l=linux-mm&m=155718572908765&w=2

I need to review the patchset but i do not want to invert referencing
ie having mm hold reference on hmm. Will review tommorrow. I wanted to
do that today but did not had time.



Re: [PATCH v4 0/1] Use HMM for ODP v4

2019-05-22 Thread Jerome Glisse
On Wed, May 22, 2019 at 05:12:47PM -0300, Jason Gunthorpe wrote:
> On Wed, May 22, 2019 at 01:48:52PM -0400, Jerome Glisse wrote:
> 
> >  static void put_per_mm(struct ib_umem_odp *umem_odp)
> >  {
> > struct ib_ucontext_per_mm *per_mm = umem_odp->per_mm;
> > @@ -325,9 +283,10 @@ static void put_per_mm(struct ib_umem_odp *umem_odp)
> > up_write(&per_mm->umem_rwsem);
> >  
> > WARN_ON(!RB_EMPTY_ROOT(&per_mm->umem_tree.rb_root));
> > -   mmu_notifier_unregister_no_release(&per_mm->mn, per_mm->mm);
> > +   hmm_mirror_unregister(&per_mm->mirror);
> > put_pid(per_mm->tgid);
> > -   mmu_notifier_call_srcu(&per_mm->rcu, free_per_mm);
> > +
> > +   kfree(per_mm);
> 
> Notice that mmu_notifier only uses SRCU to fence in-progress ops
> callbacks, so I think hmm internally has the bug that this ODP
> approach prevents.
> 
> hmm should follow the same pattern ODP has and 'kfree_srcu' the hmm
> struct, use container_of in the mmu_notifier callbacks, and use the
> otherwise vestigal kref_get_unless_zero() to bail:
> 
> From 0cb536dc0150ba964a1d655151d7b7a84d0f915a Mon Sep 17 00:00:00 2001
> From: Jason Gunthorpe 
> Date: Wed, 22 May 2019 16:52:52 -0300
> Subject: [PATCH] hmm: Fix use after free with struct hmm in the mmu notifiers
> 
> mmu_notifier_unregister_no_release() is not a fence and the mmu_notifier
> system will continue to reference hmm->mn until the srcu grace period
> expires.
> 
>  CPU0 CPU1
>
> __mmu_notifier_invalidate_range_start()
>  srcu_read_lock
>  hlist_for_each ()
>// mn == hmm->mn
> hmm_mirror_unregister()
>   hmm_put()
> hmm_free()
>   mmu_notifier_unregister_no_release()
>  hlist_del_init_rcu(hmm-mn->list)
>  
> mn->ops->invalidate_range_start(mn, range);
>mm_get_hmm()
>   mm->hmm = NULL;
>   kfree(hmm)
>  mutex_lock(&hmm->lock);
> 
> Use SRCU to kfree the hmm memory so that the notifiers can rely on hmm
> existing. Get the now-safe hmm struct through container_of and directly
> check kref_get_unless_zero to lock it against free.

It is already badly handled with BUG_ON(), i just need to convert
those to return and to use mmu_notifier_call_srcu() to free hmm
struct.

The way race is avoided is because mm->hmm will either be NULL or
point to another hmm struct before an existing hmm is free. Also
if range_start/range_end use kref_get_unless_zero() but right now
this is BUG_ON if it turn out to be NULL, it should just return
on NULL.

> 
> Signed-off-by: Jason Gunthorpe 
> ---
>  include/linux/hmm.h |  1 +
>  mm/hmm.c| 25 +++--
>  2 files changed, 20 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 51ec27a8466816..8b91c90d3b88cb 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -102,6 +102,7 @@ struct hmm {
>   struct mmu_notifier mmu_notifier;
>   struct rw_semaphore mirrors_sem;
>   wait_queue_head_t   wq;
> + struct rcu_head rcu;
>   longnotifiers;
>   booldead;
>  };
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 816c2356f2449f..824e7e160d8167 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -113,6 +113,11 @@ static struct hmm *hmm_get_or_create(struct mm_struct 
> *mm)
>   return NULL;
>  }
>  
> +static void hmm_fee_rcu(struct rcu_head *rcu)
> +{
> + kfree(container_of(rcu, struct hmm, rcu));
> +}
> +
>  static void hmm_free(struct kref *kref)
>  {
>   struct hmm *hmm = container_of(kref, struct hmm, kref);
> @@ -125,7 +130,7 @@ static void hmm_free(struct kref *kref)
>   mm->hmm = NULL;
>   spin_unlock(&mm->page_table_lock);
>  
> - kfree(hmm);
> + mmu_notifier_call_srcu(&hmm->rcu, hmm_fee_rcu);
>  }
>  
>  static inline void hmm_put(struct hmm *hmm)
> @@ -153,10 +158,14 @@ void hmm_mm_destroy(struct mm_struct *mm)
>  
>  static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>  {
> - struct hmm *hmm = mm_get_hmm(mm);
> + struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
>   struct hmm_mirror *mirror;
>   struct hmm_range *range;
>  
> + /* hmm is in pr

Re: [PATCH v4 0/1] Use HMM for ODP v4

2019-05-22 Thread Jerome Glisse
On Wed, May 22, 2019 at 04:22:19PM -0300, Jason Gunthorpe wrote:
> On Wed, May 22, 2019 at 01:48:52PM -0400, Jerome Glisse wrote:
> 
> > > > +long ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp,
> > > > +  struct hmm_range *range)
> > > >  {
> > > > +   struct device *device = 
> > > > umem_odp->umem.context->device->dma_device;
> > > > +   struct ib_ucontext_per_mm *per_mm = umem_odp->per_mm;
> > > > struct ib_umem *umem = &umem_odp->umem;
> > > > -   struct task_struct *owning_process  = NULL;
> > > > -   struct mm_struct *owning_mm = umem_odp->umem.owning_mm;
> > > > -   struct page   **local_page_list = NULL;
> > > > -   u64 page_mask, off;
> > > > -   int j, k, ret = 0, start_idx, npages = 0, page_shift;
> > > > -   unsigned int flags = 0;
> > > > -   phys_addr_t p = 0;
> > > > -
> > > > -   if (access_mask == 0)
> > > > +   struct mm_struct *mm = per_mm->mm;
> > > > +   unsigned long idx, npages;
> > > > +   long ret;
> > > > +
> > > > +   if (mm == NULL)
> > > > +   return -ENOENT;
> > > > +
> > > > +   /* Only drivers with invalidate support can use this function. 
> > > > */
> > > > +   if (!umem->context->invalidate_range)
> > > > return -EINVAL;
> > > >  
> > > > -   if (user_virt < ib_umem_start(umem) ||
> > > > -   user_virt + bcnt > ib_umem_end(umem))
> > > > -   return -EFAULT;
> > > > +   /* Sanity checks. */
> > > > +   if (range->default_flags == 0)
> > > > +   return -EINVAL;
> > > >  
> > > > -   local_page_list = (struct page **)__get_free_page(GFP_KERNEL);
> > > > -   if (!local_page_list)
> > > > -   return -ENOMEM;
> > > > +   if (range->start < ib_umem_start(umem) ||
> > > > +   range->end > ib_umem_end(umem))
> > > > +   return -EINVAL;
> > > >  
> > > > -   page_shift = umem->page_shift;
> > > > -   page_mask = ~(BIT(page_shift) - 1);
> > > > -   off = user_virt & (~page_mask);
> > > > -   user_virt = user_virt & page_mask;
> > > > -   bcnt += off; /* Charge for the first page offset as well. */
> > > > +   idx = (range->start - ib_umem_start(umem)) >> umem->page_shift;
> > > 
> > > Is this math OK? What is supposed to happen if the range->start is not
> > > page aligned to the internal page size?
> > 
> > range->start is align on 1 << page_shift boundary within pagefault_mr
> > thus the above math is ok. We can add a BUG_ON() and comments if you
> > want.
> 
> OK
> 
> > > > +   range->pfns = &umem_odp->pfns[idx];
> > > > +   range->pfn_shift = ODP_FLAGS_BITS;
> > > > +   range->values = odp_hmm_values;
> > > > +   range->flags = odp_hmm_flags;
> > > >  
> > > > /*
> > > > -* owning_process is allowed to be NULL, this means somehow the 
> > > > mm is
> > > > -* existing beyond the lifetime of the originating process.. 
> > > > Presumably
> > > > -* mmget_not_zero will fail in this case.
> > > > +* If mm is dying just bail out early without trying to take 
> > > > mmap_sem.
> > > > +* Note that this might race with mm destruction but that is 
> > > > fine the
> > > > +* is properly refcounted so are all HMM structure.
> > > >  */
> > > > -   owning_process = get_pid_task(umem_odp->per_mm->tgid, 
> > > > PIDTYPE_PID);
> > > > -   if (!owning_process || !mmget_not_zero(owning_mm)) {
> > > 
> > > But we are not in a HMM context here, and per_mm is not a HMM
> > > structure. 
> > > 
> > > So why is mm suddenly guarenteed valid? It was a bug report that
> > > triggered the race the mmget_not_zero is fixing, so I need a better
> > > explanation why it is now safe. From what I see the hmm_range_fault
> > > is doing stuff like find_vma without

Re: [PATCH v4 0/1] Use HMM for ODP v4

2019-05-22 Thread Jerome Glisse
On Tue, May 21, 2019 at 09:52:25PM -0300, Jason Gunthorpe wrote:
> On Tue, May 21, 2019 at 04:53:22PM -0400, Jerome Glisse wrote:
> > On Mon, May 06, 2019 at 04:56:57PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Apr 11, 2019 at 02:13:13PM -0400, jgli...@redhat.com wrote:
> > > > From: Jérôme Glisse 
> > > > 
> > > > Just fixed Kconfig and build when ODP was not enabled, other than that
> > > > this is the same as v3. Here is previous cover letter:
> > > > 
> > > > Git tree with all prerequisite:
> > > > https://cgit.freedesktop.org/~glisse/linux/log/?h=rdma-odp-hmm-v4
> > > > 
> > > > This patchset convert RDMA ODP to use HMM underneath this is motivated
> > > > by stronger code sharing for same feature (share virtual memory SVM or
> > > > Share Virtual Address SVA) and also stronger integration with mm code to
> > > > achieve that. It depends on HMM patchset posted for inclusion in 5.2 [2]
> > > > and [3].
> > > > 
> > > > It has been tested with pingpong test with -o and others flags to test
> > > > different size/features associated with ODP.
> > > > 
> > > > Moreover they are some features of HMM in the works like peer to peer
> > > > support, fast CPU page table snapshot, fast IOMMU mapping update ...
> > > > It will be easier for RDMA devices with ODP to leverage those if they
> > > > use HMM underneath.
> > > > 
> > > > Quick summary of what HMM is:
> > > > HMM is a toolbox for device driver to implement software support for
> > > > Share Virtual Memory (SVM). Not only it provides helpers to mirror a
> > > > process address space on a device (hmm_mirror). It also provides
> > > > helper to allow to use device memory to back regular valid virtual
> > > > address of a process (any valid mmap that is not an mmap of a device
> > > > or a DAX mapping). They are two kinds of device memory. Private 
> > > > memory
> > > > that is not accessible to CPU because it does not have all the 
> > > > expected
> > > > properties (this is for all PCIE devices) or public memory which can
> > > > also be access by CPU without restriction (with OpenCAPI or CCIX or
> > > > similar cache-coherent and atomic inter-connect).
> > > > 
> > > > Device driver can use each of HMM tools separatly. You do not have 
> > > > to
> > > > use all the tools it provides.
> > > > 
> > > > For RDMA device i do not expect a need to use the device memory support
> > > > of HMM. This device memory support is geared toward accelerator like 
> > > > GPU.
> > > > 
> > > > 
> > > > You can find a branch [1] with all the prerequisite in. This patch is on
> > > > top of rdma-next with the HMM patchset [2] and mmu notifier patchset [3]
> > > > applied on top of it.
> > > > 
> > > > [1] https://cgit.freedesktop.org/~glisse/linux/log/?h=rdma-odp-hmm-v4
> > > > [2] https://lkml.org/lkml/2019/4/3/1032
> > > > [3] https://lkml.org/lkml/2019/3/26/900
> > > 
> > > Jerome, please let me know if these dependent series are merged during
> > > the first week of the merge window.
> > > 
> > > This patch has been tested and could go along next week if the
> > > dependencies are met.
> > > 
> > 
> > So attached is a rebase on top of 5.2-rc1, i have tested with pingpong
> > (prefetch and not and different sizes). Seems to work ok.
> 
> Urk, it already doesn't apply to the rdma tree :(
> 
> The conflicts are a little more extensive than I'd prefer to handle..
> Can I ask you to rebase it on top of this branch please:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=wip/jgg-for-next
> 
> Specifically it conflicts with this patch:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-next&id=d2183c6f1958e6b6dfdde279f4cee04280710e34
> 
> > +long ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp,
> > +  struct hmm_range *range)
> >  {
> > +   struct device *device = umem_odp->umem.context->device->dma_device;
> > +   struct ib_ucontext_per_mm *per_mm = umem_odp->per_mm;
> > struct ib_umem *umem = &umem_odp->umem;
> > -   struct task_struct *owning_process  = NULL;
> > -   st

Re: [PATCH v4 0/1] Use HMM for ODP v4

2019-05-21 Thread Jerome Glisse
On Mon, May 06, 2019 at 04:56:57PM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 11, 2019 at 02:13:13PM -0400, jgli...@redhat.com wrote:
> > From: Jérôme Glisse 
> > 
> > Just fixed Kconfig and build when ODP was not enabled, other than that
> > this is the same as v3. Here is previous cover letter:
> > 
> > Git tree with all prerequisite:
> > https://cgit.freedesktop.org/~glisse/linux/log/?h=rdma-odp-hmm-v4
> > 
> > This patchset convert RDMA ODP to use HMM underneath this is motivated
> > by stronger code sharing for same feature (share virtual memory SVM or
> > Share Virtual Address SVA) and also stronger integration with mm code to
> > achieve that. It depends on HMM patchset posted for inclusion in 5.2 [2]
> > and [3].
> > 
> > It has been tested with pingpong test with -o and others flags to test
> > different size/features associated with ODP.
> > 
> > Moreover they are some features of HMM in the works like peer to peer
> > support, fast CPU page table snapshot, fast IOMMU mapping update ...
> > It will be easier for RDMA devices with ODP to leverage those if they
> > use HMM underneath.
> > 
> > Quick summary of what HMM is:
> > HMM is a toolbox for device driver to implement software support for
> > Share Virtual Memory (SVM). Not only it provides helpers to mirror a
> > process address space on a device (hmm_mirror). It also provides
> > helper to allow to use device memory to back regular valid virtual
> > address of a process (any valid mmap that is not an mmap of a device
> > or a DAX mapping). They are two kinds of device memory. Private memory
> > that is not accessible to CPU because it does not have all the expected
> > properties (this is for all PCIE devices) or public memory which can
> > also be access by CPU without restriction (with OpenCAPI or CCIX or
> > similar cache-coherent and atomic inter-connect).
> > 
> > Device driver can use each of HMM tools separatly. You do not have to
> > use all the tools it provides.
> > 
> > For RDMA device i do not expect a need to use the device memory support
> > of HMM. This device memory support is geared toward accelerator like GPU.
> > 
> > 
> > You can find a branch [1] with all the prerequisite in. This patch is on
> > top of rdma-next with the HMM patchset [2] and mmu notifier patchset [3]
> > applied on top of it.
> > 
> > [1] https://cgit.freedesktop.org/~glisse/linux/log/?h=rdma-odp-hmm-v4
> > [2] https://lkml.org/lkml/2019/4/3/1032
> > [3] https://lkml.org/lkml/2019/3/26/900
> 
> Jerome, please let me know if these dependent series are merged during
> the first week of the merge window.
> 
> This patch has been tested and could go along next week if the
> dependencies are met.
> 

So attached is a rebase on top of 5.2-rc1, i have tested with pingpong
(prefetch and not and different sizes). Seems to work ok.

Cheers,
Jérôme
>From 80d98b62b0106d94825eccacd5035fb67ad7b825 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= 
Date: Sat, 8 Dec 2018 15:47:55 -0500
Subject: [PATCH] RDMA/odp: convert to use HMM for ODP v4
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Convert ODP to use HMM so that we can build on common infrastructure
for different class of devices that want to mirror a process address
space into a device. There is no functional changes.

Changes since v3:
- Rebase on top of 5.2-rc1
Changes since v2:
- Update to match changes to HMM API
Changes since v1:
- improved comments
- simplified page alignment computation

Signed-off-by: Jérôme Glisse 
Cc: Jason Gunthorpe 
Cc: Leon Romanovsky 
Cc: Doug Ledford 
Cc: Artemy Kovalyov 
Cc: Moni Shoua 
Cc: Mike Marciniszyn 
Cc: Kaike Wan 
Cc: Dennis Dalessandro 
---
 drivers/infiniband/core/umem_odp.c | 491 -
 drivers/infiniband/hw/mlx5/mem.c   |  20 +-
 drivers/infiniband/hw/mlx5/mr.c|   2 +-
 drivers/infiniband/hw/mlx5/odp.c   | 107 ---
 include/rdma/ib_umem_odp.h |  47 ++-
 5 files changed, 224 insertions(+), 443 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c 
b/drivers/infiniband/core/umem_odp.c
index f962b5bbfa40..b94ab0d34f1b 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -46,6 +46,20 @@
 #include 
 #include 
 
+
+static uint64_t odp_hmm_flags[HMM_PFN_FLAG_MAX] = {
+   ODP_READ_BIT,   /* HMM_PFN_VALID */
+   ODP_WRITE_BIT,  /* HMM_PFN_WRITE */
+   ODP_DEVICE_BIT, /* HMM_PFN_DEVICE_PRIVATE */
+};
+
+static uint64_t odp_hmm_values[HMM_PFN_VALUE_MAX] = {
+   -1UL,   /* HMM_PFN_ERROR */
+   0UL,/* HMM_PFN_NONE */
+   -2UL,   /* HMM_PFN_SPECIAL */
+};
+
+
 /*
  * The ib_umem list keeps track of memory regions for which the HW
  * device request to receive notification when the related memory
@@ -78,57 +92,25 @@ static u64 node_last(struct umem_odp_node *n)
 INTERVAL_TREE_DEFINE(struct umem_odp_node, rb, u64, __subtree_last,
  

Re: [PATCH 4/4] mm, notifier: Add a lockdep map for invalidate_range_start

2019-05-21 Thread Jerome Glisse
On Mon, May 20, 2019 at 11:39:45PM +0200, Daniel Vetter wrote:
> This is a similar idea to the fs_reclaim fake lockdep lock. It's
> fairly easy to provoke a specific notifier to be run on a specific
> range: Just prep it, and then munmap() it.
> 
> A bit harder, but still doable, is to provoke the mmu notifiers for
> all the various callchains that might lead to them. But both at the
> same time is really hard to reliable hit, especially when you want to
> exercise paths like direct reclaim or compaction, where it's not
> easy to control what exactly will be unmapped.
> 
> By introducing a lockdep map to tie them all together we allow lockdep
> to see a lot more dependencies, without having to actually hit them
> in a single challchain while testing.
> 
> Aside: Since I typed this to test i915 mmu notifiers I've only rolled
> this out for the invaliate_range_start callback. If there's
> interest, we should probably roll this out to all of them. But my
> undestanding of core mm is seriously lacking, and I'm not clear on
> whether we need a lockdep map for each callback, or whether some can
> be shared.

I need to read more on lockdep but it is legal to have mmu notifier
invalidation within each other. For instance when you munmap you
might split a huge pmd and it will trigger a second invalidate range
while the munmap one is not done yet. Would that trigger the lockdep
here ?

Worst case i can think of is 2 invalidate_range_start chain one after
the other. I don't think you can triggers a 3 levels nesting but maybe.

Cheers,
Jérôme


Re: [PATCH] mm/dev_pfn: Exclude MEMORY_DEVICE_PRIVATE while computing virtual address

2019-05-20 Thread Jerome Glisse
On Mon, May 20, 2019 at 11:07:38AM +0530, Anshuman Khandual wrote:
> On 05/18/2019 03:20 AM, Andrew Morton wrote:
> > On Fri, 17 May 2019 16:08:34 +0530 Anshuman Khandual 
> >  wrote:
> > 
> >> The presence of struct page does not guarantee linear mapping for the pfn
> >> physical range. Device private memory which is non-coherent is excluded
> >> from linear mapping during devm_memremap_pages() though they will still
> >> have struct page coverage. Just check for device private memory before
> >> giving out virtual address for a given pfn.
> > 
> > I was going to give my standard "what are the user-visible runtime
> > effects of this change?", but...
> > 
> >> All these helper functions are all pfn_t related but could not figure out
> >> another way of determining a private pfn without looking into it's struct
> >> page. pfn_t_to_virt() is not getting used any where in mainline kernel.Is
> >> it used by out of tree drivers ? Should we then drop it completely ?
> > 
> > Yeah, let's kill it.
> > 
> > But first, let's fix it so that if someone brings it back, they bring
> > back a non-buggy version.
> 
> Makes sense.
> 
> > 
> > So...  what (would be) the user-visible runtime effects of this change?
> 
> I am not very well aware about the user interaction with the drivers which
> hotplug and manage ZONE_DEVICE memory in general. Hence will not be able to
> comment on it's user visible runtime impact. I just figured this out from
> code audit while testing ZONE_DEVICE on arm64 platform. But the fix makes
> the function bit more expensive as it now involve some additional memory
> references.

A device private pfn can never leak outside code that does not understand it
So this change is useless for any existing users and i would like to keep the
existing behavior ie never leak device private pfn.

Cheers,
Jérôme


Re: [PATCH 0/5] mm/hmm: HMM documentation updates and code fixes

2019-05-13 Thread Jerome Glisse
On Mon, May 13, 2019 at 10:26:59AM -0700, Ralph Campbell wrote:
> 
> 
> On 5/12/19 8:08 AM, Jerome Glisse wrote:
> > On Mon, May 06, 2019 at 04:29:37PM -0700, rcampb...@nvidia.com wrote:
> > > From: Ralph Campbell 
> > > 
> > > I hit a use after free bug in hmm_free() with KASAN and then couldn't
> > > stop myself from cleaning up a bunch of documentation and coding style
> > > changes. So the first two patches are clean ups, the last three are
> > > the fixes.
> > > 
> > > Ralph Campbell (5):
> > >mm/hmm: Update HMM documentation
> > >mm/hmm: Clean up some coding style and comments
> > >mm/hmm: Use mm_get_hmm() in hmm_range_register()
> > >mm/hmm: hmm_vma_fault() doesn't always call hmm_range_unregister()
> > >mm/hmm: Fix mm stale reference use in hmm_free()
> > 
> > This patchset does not seems to be on top of
> > https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-5.2-v3
> > 
> > So here we are out of sync, on documentation and code. If you
> > have any fix for 
> > https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-5.2-v3
> > then please submit something on top of that.
> > 
> > Cheers,
> > Jérôme
> > 
> > > 
> > >   Documentation/vm/hmm.rst | 139 ++-
> > >   include/linux/hmm.h  |  84 ++
> > >   mm/hmm.c | 151 ---
> > >   3 files changed, 174 insertions(+), 200 deletions(-)
> > > 
> > > -- 
> > > 2.20.1
> 
> The patches are based on top of Andrew's mmotm tree
> git://git.cmpxchg.org/linux-mmotm.git v5.1-rc6-mmotm-2019-04-25-16-30.
> They apply cleanly to that git tag as well as your hmm-5.2-v3 branch
> so I guess I am confused where we are out of sync.

No disregard my email, i was trying to apply on top of wrong
branch yesterday morning while catching up on big backlog of
email. Failure was on my side.

Cheers,
Jérôme


Re: [PATCH 4/5] mm/hmm: hmm_vma_fault() doesn't always call hmm_range_unregister()

2019-05-12 Thread Jerome Glisse
On Sun, May 12, 2019 at 11:07:24AM -0400, Jerome Glisse wrote:
> On Tue, May 07, 2019 at 11:12:14AM -0700, Ralph Campbell wrote:
> > 
> > On 5/7/19 6:15 AM, Souptick Joarder wrote:
> > > On Tue, May 7, 2019 at 5:00 AM  wrote:
> > > > 
> > > > From: Ralph Campbell 
> > > > 
> > > > The helper function hmm_vma_fault() calls hmm_range_register() but is
> > > > missing a call to hmm_range_unregister() in one of the error paths.
> > > > This leads to a reference count leak and ultimately a memory leak on
> > > > struct hmm.
> > > > 
> > > > Always call hmm_range_unregister() if hmm_range_register() succeeded.
> > > 
> > > How about * Call hmm_range_unregister() in error path if
> > > hmm_range_register() succeeded* ?
> > 
> > Sure, sounds good.
> > I'll include that in v2.
> 
> NAK for the patch see below why
> 
> > 
> > > > 
> > > > Signed-off-by: Ralph Campbell 
> > > > Cc: John Hubbard 
> > > > Cc: Ira Weiny 
> > > > Cc: Dan Williams 
> > > > Cc: Arnd Bergmann 
> > > > Cc: Balbir Singh 
> > > > Cc: Dan Carpenter 
> > > > Cc: Matthew Wilcox 
> > > > Cc: Souptick Joarder 
> > > > Cc: Andrew Morton 
> > > > ---
> > > >   include/linux/hmm.h | 3 ++-
> > > >   1 file changed, 2 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > > > index 35a429621e1e..fa0671d67269 100644
> > > > --- a/include/linux/hmm.h
> > > > +++ b/include/linux/hmm.h
> > > > @@ -559,6 +559,7 @@ static inline int hmm_vma_fault(struct hmm_range 
> > > > *range, bool block)
> > > >  return (int)ret;
> > > > 
> > > >  if (!hmm_range_wait_until_valid(range, 
> > > > HMM_RANGE_DEFAULT_TIMEOUT)) {
> > > > +   hmm_range_unregister(range);
> > > >  /*
> > > >   * The mmap_sem was taken by driver we release it here 
> > > > and
> > > >   * returns -EAGAIN which correspond to mmap_sem have 
> > > > been
> > > > @@ -570,13 +571,13 @@ static inline int hmm_vma_fault(struct hmm_range 
> > > > *range, bool block)
> > > > 
> > > >  ret = hmm_range_fault(range, block);
> > > >  if (ret <= 0) {
> > > > +   hmm_range_unregister(range);
> > > 
> > > what is the reason to moved it up ?
> > 
> > I moved it up because the normal calling pattern is:
> > down_read(&mm->mmap_sem)
> > hmm_vma_fault()
> > hmm_range_register()
> > hmm_range_fault()
> > hmm_range_unregister()
> > up_read(&mm->mmap_sem)
> > 
> > I don't think it is a bug to unlock mmap_sem and then unregister,
> > it is just more consistent nesting.
> 
> So this is not the usage pattern with HMM usage pattern is:
> 
> hmm_range_register()
> hmm_range_fault()
> hmm_range_unregister()
> 
> The hmm_vma_fault() is gonne so this patch here break thing.
> 
> See https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-5.2-v3

Sorry not enough coffee on sunday morning, so yeah this patch
looks good except that you do not need to move it up.

Note that hmm_vma_fault() is a gonner once ODP to HMM is upstream
and i converted nouveau/amd to new API then we can remove that
one.

Cheers,
Jérôme


Re: [PATCH 0/5] mm/hmm: HMM documentation updates and code fixes

2019-05-12 Thread Jerome Glisse
On Mon, May 06, 2019 at 04:29:37PM -0700, rcampb...@nvidia.com wrote:
> From: Ralph Campbell 
> 
> I hit a use after free bug in hmm_free() with KASAN and then couldn't
> stop myself from cleaning up a bunch of documentation and coding style
> changes. So the first two patches are clean ups, the last three are
> the fixes.
> 
> Ralph Campbell (5):
>   mm/hmm: Update HMM documentation
>   mm/hmm: Clean up some coding style and comments
>   mm/hmm: Use mm_get_hmm() in hmm_range_register()
>   mm/hmm: hmm_vma_fault() doesn't always call hmm_range_unregister()
>   mm/hmm: Fix mm stale reference use in hmm_free()

This patchset does not seems to be on top of
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-5.2-v3

So here we are out of sync, on documentation and code. If you
have any fix for https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-5.2-v3
then please submit something on top of that.

Cheers,
Jérôme

> 
>  Documentation/vm/hmm.rst | 139 ++-
>  include/linux/hmm.h  |  84 ++
>  mm/hmm.c | 151 ---
>  3 files changed, 174 insertions(+), 200 deletions(-)
> 
> -- 
> 2.20.1
> 


Re: [PATCH 4/5] mm/hmm: hmm_vma_fault() doesn't always call hmm_range_unregister()

2019-05-12 Thread Jerome Glisse
On Tue, May 07, 2019 at 11:12:14AM -0700, Ralph Campbell wrote:
> 
> On 5/7/19 6:15 AM, Souptick Joarder wrote:
> > On Tue, May 7, 2019 at 5:00 AM  wrote:
> > > 
> > > From: Ralph Campbell 
> > > 
> > > The helper function hmm_vma_fault() calls hmm_range_register() but is
> > > missing a call to hmm_range_unregister() in one of the error paths.
> > > This leads to a reference count leak and ultimately a memory leak on
> > > struct hmm.
> > > 
> > > Always call hmm_range_unregister() if hmm_range_register() succeeded.
> > 
> > How about * Call hmm_range_unregister() in error path if
> > hmm_range_register() succeeded* ?
> 
> Sure, sounds good.
> I'll include that in v2.

NAK for the patch see below why

> 
> > > 
> > > Signed-off-by: Ralph Campbell 
> > > Cc: John Hubbard 
> > > Cc: Ira Weiny 
> > > Cc: Dan Williams 
> > > Cc: Arnd Bergmann 
> > > Cc: Balbir Singh 
> > > Cc: Dan Carpenter 
> > > Cc: Matthew Wilcox 
> > > Cc: Souptick Joarder 
> > > Cc: Andrew Morton 
> > > ---
> > >   include/linux/hmm.h | 3 ++-
> > >   1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > > index 35a429621e1e..fa0671d67269 100644
> > > --- a/include/linux/hmm.h
> > > +++ b/include/linux/hmm.h
> > > @@ -559,6 +559,7 @@ static inline int hmm_vma_fault(struct hmm_range 
> > > *range, bool block)
> > >  return (int)ret;
> > > 
> > >  if (!hmm_range_wait_until_valid(range, 
> > > HMM_RANGE_DEFAULT_TIMEOUT)) {
> > > +   hmm_range_unregister(range);
> > >  /*
> > >   * The mmap_sem was taken by driver we release it here 
> > > and
> > >   * returns -EAGAIN which correspond to mmap_sem have been
> > > @@ -570,13 +571,13 @@ static inline int hmm_vma_fault(struct hmm_range 
> > > *range, bool block)
> > > 
> > >  ret = hmm_range_fault(range, block);
> > >  if (ret <= 0) {
> > > +   hmm_range_unregister(range);
> > 
> > what is the reason to moved it up ?
> 
> I moved it up because the normal calling pattern is:
> down_read(&mm->mmap_sem)
> hmm_vma_fault()
> hmm_range_register()
> hmm_range_fault()
> hmm_range_unregister()
> up_read(&mm->mmap_sem)
> 
> I don't think it is a bug to unlock mmap_sem and then unregister,
> it is just more consistent nesting.

So this is not the usage pattern with HMM usage pattern is:

hmm_range_register()
hmm_range_fault()
hmm_range_unregister()

The hmm_vma_fault() is gonne so this patch here break thing.

See https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-5.2-v3




Re: [PATCH] mm/hmm: add ARCH_HAS_HMM_MIRROR ARCH_HAS_HMM_DEVICE Kconfig

2019-05-02 Thread Jerome Glisse
On Wed, May 01, 2019 at 12:23:58PM -0700, Guenter Roeck wrote:
> On Wed, May 01, 2019 at 02:38:51PM -0400, Jerome Glisse wrote:
> > Andrew just the patch that would be nice to get in 5.2 so i can fix
> > device driver Kconfig before doing the real update to mm HMM Kconfig
> > 
> > On Wed, Apr 17, 2019 at 05:11:41PM -0400, jgli...@redhat.com wrote:
> > > From: Jérôme Glisse 
> > > 
> > > This patch just add 2 new Kconfig that are _not use_ by anyone. I check
> > > that various make ARCH=somearch allmodconfig do work and do not complain.
> > > This new Kconfig need to be added first so that device driver that do
> > > depend on HMM can be updated.
> > > 
> > > Once drivers are updated then i can update the HMM Kconfig to depends
> > > on this new Kconfig in a followup patch.
> > > 
> 
> I am probably missing something, but why not submit the entire series 
> together ?
> That might explain why XARRAY_MULTI is enabled below, and what the series is
> about. Additional comments below.
> 
> > > Signed-off-by: Jérôme Glisse 
> > > Cc: Guenter Roeck 
> > > Cc: Leon Romanovsky 
> > > Cc: Jason Gunthorpe 
> > > Cc: Andrew Morton 
> > > Cc: Ralph Campbell 
> > > Cc: John Hubbard 
> > > ---
> > >  mm/Kconfig | 16 
> > >  1 file changed, 16 insertions(+)
> > > 
> > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > index 25c71eb8a7db..daadc9131087 100644
> > > --- a/mm/Kconfig
> > > +++ b/mm/Kconfig
> > > @@ -676,6 +676,22 @@ config ZONE_DEVICE
> > >  
> > > If FS_DAX is enabled, then say Y.
> > >  
> > > +config ARCH_HAS_HMM_MIRROR
> > > + bool
> > > + default y
> > > + depends on (X86_64 || PPC64)
> > > + depends on MMU && 64BIT
> > > +
> > > +config ARCH_HAS_HMM_DEVICE
> > > + bool
> > > + default y
> > > + depends on (X86_64 || PPC64)
> > > + depends on MEMORY_HOTPLUG
> > > + depends on MEMORY_HOTREMOVE
> > > + depends on SPARSEMEM_VMEMMAP
> > > + depends on ARCH_HAS_ZONE_DEVICE
> 
> This is almost identical to ARCH_HAS_HMM except ARCH_HAS_HMM
> depends on ZONE_DEVICE and MMU && 64BIT. ARCH_HAS_HMM_MIRROR
> and ARCH_HAS_HMM_DEVICE together almost match ARCH_HAS_HMM,
> except for the ARCH_HAS_ZONE_DEVICE vs. ZONE_DEVICE dependency.
> And ZONE_DEVICE selects XARRAY_MULTI, meaning there is really
> substantial overlap.
> 
> Not really my concern, but personally I'd like to see some
> reasoning why the additional options are needed .. thus the
> question above, why not submit the series together ?
> 

There is no serie here, this is about solving Kconfig for HMM given
that device driver are going through their own tree we want to avoid
changing them from the mm tree. So plan is:

1 - Kernel release N add the new Kconfig to mm/Kconfig (this patch)
2 - Kernel release N+1 update driver to depend on new Kconfig ie
stop using ARCH_HASH_HMM and start using ARCH_HAS_HMM_MIRROR
and ARCH_HAS_HMM_DEVICE (one or the other or both depending
on the driver)
3 - Kernel release N+2 remove ARCH_HASH_HMM and do final Kconfig
update in mm/Kconfig

This has been discuss in the past and while it is bit painfull it
is the easiest solution (outside git topic branch but mm tree is
not merge as git).

Cheers,
Jérôme


Re: [PATCH] mm/hmm: add ARCH_HAS_HMM_MIRROR ARCH_HAS_HMM_DEVICE Kconfig

2019-05-01 Thread Jerome Glisse
Andrew just the patch that would be nice to get in 5.2 so i can fix
device driver Kconfig before doing the real update to mm HMM Kconfig

On Wed, Apr 17, 2019 at 05:11:41PM -0400, jgli...@redhat.com wrote:
> From: Jérôme Glisse 
> 
> This patch just add 2 new Kconfig that are _not use_ by anyone. I check
> that various make ARCH=somearch allmodconfig do work and do not complain.
> This new Kconfig need to be added first so that device driver that do
> depend on HMM can be updated.
> 
> Once drivers are updated then i can update the HMM Kconfig to depends
> on this new Kconfig in a followup patch.
> 
> Signed-off-by: Jérôme Glisse 
> Cc: Guenter Roeck 
> Cc: Leon Romanovsky 
> Cc: Jason Gunthorpe 
> Cc: Andrew Morton 
> Cc: Ralph Campbell 
> Cc: John Hubbard 
> ---
>  mm/Kconfig | 16 
>  1 file changed, 16 insertions(+)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 25c71eb8a7db..daadc9131087 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -676,6 +676,22 @@ config ZONE_DEVICE
>  
> If FS_DAX is enabled, then say Y.
>  
> +config ARCH_HAS_HMM_MIRROR
> + bool
> + default y
> + depends on (X86_64 || PPC64)
> + depends on MMU && 64BIT
> +
> +config ARCH_HAS_HMM_DEVICE
> + bool
> + default y
> + depends on (X86_64 || PPC64)
> + depends on MEMORY_HOTPLUG
> + depends on MEMORY_HOTREMOVE
> + depends on SPARSEMEM_VMEMMAP
> + depends on ARCH_HAS_ZONE_DEVICE
> + select XARRAY_MULTI
> +
>  config ARCH_HAS_HMM
>   bool
>   default y
> -- 
> 2.20.1
> 


Re: [LSF/MM TOPIC] NUMA, memory hierarchy and device memory

2019-04-25 Thread Jerome Glisse


I see that the schedule is not full yet for the mm track and i would
really like to be able to have a discussion on this topic

Schedule:
https://docs.google.com/spreadsheets/d/1Z1pDL-XeUT1ZwMWrBL8T8q3vtSqZpLPgF3Bzu_jejfk/edit#gid=0


On Fri, Jan 18, 2019 at 12:45:13PM -0500, Jerome Glisse wrote:
> Hi, i would like to discuss about NUMA API and its short comings when
> it comes to memory hierarchy (from fast HBM, to slower persistent
> memory through regular memory) and also device memory (which can have
> its own hierarchy).
> 
> I have proposed a patch to add a new memory topology model to the
> kernel for application to be able to get that informations, it
> also included a set of new API to bind/migrate process range [1].
> Note that this model also support device memory.
> 
> So far device memory support is achieve through device specific ioctl
> and this forbid some scenario like device memory interleaving accross
> multiple devices for a range. It also make the whole userspace more
> complex as program have to mix and match multiple device specific API
> on top of NUMA API.
> 
> While memory hierarchy can be more or less expose through the existing
> NUMA API by creating node for non-regular memory [2], i do not see this
> as a satisfying solution. Moreover such scheme does not work for device
> memory that might not even be accessible by CPUs.
> 
> 
> Hence i would like to discuss few points:
> - What proof people wants to see this as problem we need to solve ?
> - How to build concensus to move forward on this ?
> - What kind of syscall API people would like to see ?
> 
> People to discuss this topic:
> Dan Williams 
> Dave Hansen 
> Felix Kuehling 
> John Hubbard 
> Jonathan Cameron 
> Keith Busch 
> Mel Gorman 
> Michal Hocko 
> Paul Blinzer 
> 
> Probably others, sorry if i miss anyone from previous discussions.
> 
> Cheers,
> Jérôme
> 
> [1] https://lkml.org/lkml/2018/12/3/1072
> [2] https://lkml.org/lkml/2018/12/10/1112
> 


[LSF/MM] Preliminary agenda ? Anyone ... anyone ? Bueller ?

2019-04-25 Thread Jerome Glisse
Did i miss preliminary agenda somewhere ? In previous year i think
there use to be one by now :)

Cheers,
Jérôme


Re: [PATCH v3 14/28] userfaultfd: wp: handle COW properly for uffd-wp

2019-04-23 Thread Jerome Glisse
On Tue, Apr 23, 2019 at 11:00:30AM +0800, Peter Xu wrote:
> On Mon, Apr 22, 2019 at 10:54:02AM -0400, Jerome Glisse wrote:
> > On Mon, Apr 22, 2019 at 08:20:10PM +0800, Peter Xu wrote:
> > > On Fri, Apr 19, 2019 at 11:02:53AM -0400, Jerome Glisse wrote:
> > > 
> > > [...]
> > > 
> > > > > > > + if (uffd_wp_resolve) {
> > > > > > > + /* If the fault is resolved already, 
> > > > > > > skip */
> > > > > > > + if (!pte_uffd_wp(*pte))
> > > > > > > + continue;
> > > > > > > + page = vm_normal_page(vma, addr, 
> > > > > > > oldpte);
> > > > > > > + if (!page || page_mapcount(page) > 1) {
> > > > > > > + struct vm_fault vmf = {
> > > > > > > + .vma = vma,
> > > > > > > + .address = addr & 
> > > > > > > PAGE_MASK,
> > > > > > > + .page = page,
> > > > > > > + .orig_pte = oldpte,
> > > > > > > + .pmd = pmd,
> > > > > > > + /* pte and ptl not 
> > > > > > > needed */
> > > > > > > + };
> > > > > > > + vm_fault_t ret;
> > > > > > > +
> > > > > > > + if (page)
> > > > > > > + get_page(page);
> > > > > > > + arch_leave_lazy_mmu_mode();
> > > > > > > + pte_unmap_unlock(pte, ptl);
> > > > > > > + ret = wp_page_copy(&vmf);
> > > > > > > + /* PTE is changed, or OOM */
> > > > > > > + if (ret == 0)
> > > > > > > + /* It's done by others 
> > > > > > > */
> > > > > > > + continue;
> > > > > > 
> > > > > > This is wrong if ret == 0 you still need to remap the pte before
> > > > > > continuing as otherwise you will go to next pte without the page
> > > > > > table lock for the directory. So 0 case must be handled after
> > > > > > arch_enter_lazy_mmu_mode() below.
> > > > > > 
> > > > > > Sorry i should have catch that in previous review.
> > > > > 
> > > > > My fault to not have noticed it since the very beginning... thanks for
> > > > > spotting that.
> > > > > 
> > > > > I'm squashing below changes into the patch:
> > > > 
> > > > 
> > > > Well thinking of this some more i think you should use do_wp_page() and
> > > > not wp_page_copy() it would avoid bunch of code above and also you are
> > > > not properly handling KSM page or page in the swap cache. Instead of
> > > > duplicating same code that is in do_wp_page() it would be better to call
> > > > it here.
> > > 
> > > Yeah it makes sense to me.  Then here's my plan:
> > > 
> > > - I'll need to drop previous patch "export wp_page_copy" since then
> > >   it'll be not needed
> > > 
> > > - I'll introduce another patch to split current do_wp_page() and
> > >   introduce function "wp_page_copy_cont" (better suggestion on the
> > >   naming would be welcomed) which contains most of the wp handling
> > >   that'll be needed for change_pte_range() in this patch and isolate
> > >   the uffd handling:
> > > 
> > > static vm_fault_t do_wp_page(struct vm_fault *vmf)
> > >   __releases(vmf->ptl)
> > > {
> > >   struct vm_area_struct *vma = vmf->vma;
> > > 
> > >   if (userfaultfd_pte_wp(vma, *vmf->pte)) {
> > >   pte_unmap_unlock(vmf->pte, vmf->ptl);
> > >   return handle_userfault(vmf, VM

Re: [PATCH v12 21/31] mm: Introduce find_vma_rcu()

2019-04-22 Thread Jerome Glisse
On Tue, Apr 16, 2019 at 03:45:12PM +0200, Laurent Dufour wrote:
> This allows to search for a VMA structure without holding the mmap_sem.
> 
> The search is repeated while the mm seqlock is changing and until we found
> a valid VMA.
> 
> While under the RCU protection, a reference is taken on the VMA, so the
> caller must call put_vma() once it not more need the VMA structure.
> 
> At the time a VMA is inserted in the MM RB tree, in vma_rb_insert(), a
> reference is taken to the VMA by calling get_vma().
> 
> When removing a VMA from the MM RB tree, the VMA is not release immediately
> but at the end of the RCU grace period through vm_rcu_put(). This ensures
> that the VMA remains allocated until the end the RCU grace period.
> 
> Since the vm_file pointer, if valid, is released in put_vma(), there is no
> guarantee that the file pointer will be valid on the returned VMA.
> 
> Signed-off-by: Laurent Dufour 

Minor comments about comment (i love recursion :)) see below.

Reviewed-by: Jérôme Glisse 

> ---
>  include/linux/mm_types.h |  1 +
>  mm/internal.h|  5 ++-
>  mm/mmap.c| 76 ++--
>  3 files changed, 78 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6a6159e11a3f..9af6694cb95d 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -287,6 +287,7 @@ struct vm_area_struct {
>  
>  #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>   atomic_t vm_ref_count;
> + struct rcu_head vm_rcu;
>  #endif
>   struct rb_node vm_rb;
>  
> diff --git a/mm/internal.h b/mm/internal.h
> index 302382bed406..1e368e4afe3c 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -55,7 +55,10 @@ static inline void put_vma(struct vm_area_struct *vma)
>   __free_vma(vma);
>  }
>  
> -#else
> +extern struct vm_area_struct *find_vma_rcu(struct mm_struct *mm,
> +unsigned long addr);
> +
> +#else /* CONFIG_SPECULATIVE_PAGE_FAULT */
>  
>  static inline void get_vma(struct vm_area_struct *vma)
>  {
> diff --git a/mm/mmap.c b/mm/mmap.c
> index c106440dcae7..34bf261dc2c8 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -179,6 +179,18 @@ static inline void mm_write_sequnlock(struct mm_struct 
> *mm)
>  {
>   write_sequnlock(&mm->mm_seq);
>  }
> +
> +static void __vm_rcu_put(struct rcu_head *head)
> +{
> + struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
> +   vm_rcu);
> + put_vma(vma);
> +}
> +static void vm_rcu_put(struct vm_area_struct *vma)
> +{
> + VM_BUG_ON_VMA(!RB_EMPTY_NODE(&vma->vm_rb), vma);
> + call_rcu(&vma->vm_rcu, __vm_rcu_put);
> +}
>  #else
>  static inline void mm_write_seqlock(struct mm_struct *mm)
>  {
> @@ -190,6 +202,8 @@ static inline void mm_write_sequnlock(struct mm_struct 
> *mm)
>  
>  void __free_vma(struct vm_area_struct *vma)
>  {
> + if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT))
> + VM_BUG_ON_VMA(!RB_EMPTY_NODE(&vma->vm_rb), vma);
>   mpol_put(vma_policy(vma));
>   vm_area_free(vma);
>  }
> @@ -197,11 +211,24 @@ void __free_vma(struct vm_area_struct *vma)
>  /*
>   * Close a vm structure and free it, returning the next.
>   */
> -static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
> +static struct vm_area_struct *__remove_vma(struct vm_area_struct *vma)
>  {
>   struct vm_area_struct *next = vma->vm_next;
>  
>   might_sleep();
> + if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT) &&
> + !RB_EMPTY_NODE(&vma->vm_rb)) {
> + /*
> +  * If the VMA is still linked in the RB tree, we must release
> +  * that reference by calling put_vma().
> +  * This should only happen when called from exit_mmap().
> +  * We forcely clear the node to satisfy the chec in
^
Typo: chec -> check

> +  * __free_vma(). This is safe since the RB tree is not walked
> +  * anymore.
> +  */
> + RB_CLEAR_NODE(&vma->vm_rb);
> + put_vma(vma);
> + }
>   if (vma->vm_ops && vma->vm_ops->close)
>   vma->vm_ops->close(vma);
>   if (vma->vm_file)
> @@ -211,6 +238,13 @@ static struct vm_area_struct *remove_vma(struct 
> vm_area_struct *vma)
>   return next;
>  }
>  
> +static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
> +{
> + if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT))
> + VM_BUG_ON_VMA(!RB_EMPTY_NODE(&vma->vm_rb), vma);

Adding a comment here explaining the BUG_ON so people can understand
what is wrong if that happens. For instance:

/*
 * remove_vma() should be call only once a vma have been remove from the rbtree
 * at which point the vma->vm_rb is an empty node. The exception is when vmas
 * are destroy through exit_mmap() in which case we do not bother updating the
 * rbtree

Re: [PATCH v12 15/31] mm: introduce __lru_cache_add_active_or_unevictable

2019-04-22 Thread Jerome Glisse
On Tue, Apr 16, 2019 at 03:45:06PM +0200, Laurent Dufour wrote:
> The speculative page fault handler which is run without holding the
> mmap_sem is calling lru_cache_add_active_or_unevictable() but the vm_flags
> is not guaranteed to remain constant.
> Introducing __lru_cache_add_active_or_unevictable() which has the vma flags
> value parameter instead of the vma pointer.
> 
> Acked-by: David Rientjes 
> Signed-off-by: Laurent Dufour 

Reviewed-by: Jérôme Glisse 

> ---
>  include/linux/swap.h | 10 --
>  mm/memory.c  |  8 
>  mm/swap.c|  6 +++---
>  3 files changed, 15 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 4bfb5c4ac108..d33b94eb3c69 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -343,8 +343,14 @@ extern void deactivate_file_page(struct page *page);
>  extern void mark_page_lazyfree(struct page *page);
>  extern void swap_setup(void);
>  
> -extern void lru_cache_add_active_or_unevictable(struct page *page,
> - struct vm_area_struct *vma);
> +extern void __lru_cache_add_active_or_unevictable(struct page *page,
> + unsigned long vma_flags);
> +
> +static inline void lru_cache_add_active_or_unevictable(struct page *page,
> + struct vm_area_struct *vma)
> +{
> + return __lru_cache_add_active_or_unevictable(page, vma->vm_flags);
> +}
>  
>  /* linux/mm/vmscan.c */
>  extern unsigned long zone_reclaimable_pages(struct zone *zone);
> diff --git a/mm/memory.c b/mm/memory.c
> index 56802850e72c..85ec5ce5c0a8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2347,7 +2347,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>   ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
>   page_add_new_anon_rmap(new_page, vma, vmf->address, false);
>   mem_cgroup_commit_charge(new_page, memcg, false, false);
> - lru_cache_add_active_or_unevictable(new_page, vma);
> + __lru_cache_add_active_or_unevictable(new_page, vmf->vma_flags);
>   /*
>* We call the notify macro here because, when using secondary
>* mmu page tables (such as kvm shadow page tables), we want the
> @@ -2896,7 +2896,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   if (unlikely(page != swapcache && swapcache)) {
>   page_add_new_anon_rmap(page, vma, vmf->address, false);
>   mem_cgroup_commit_charge(page, memcg, false, false);
> - lru_cache_add_active_or_unevictable(page, vma);
> + __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
>   } else {
>   do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
>   mem_cgroup_commit_charge(page, memcg, true, false);
> @@ -3048,7 +3048,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault 
> *vmf)
>   inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
>   page_add_new_anon_rmap(page, vma, vmf->address, false);
>   mem_cgroup_commit_charge(page, memcg, false, false);
> - lru_cache_add_active_or_unevictable(page, vma);
> + __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
>  setpte:
>   set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>  
> @@ -3327,7 +3327,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct 
> mem_cgroup *memcg,
>   inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
>   page_add_new_anon_rmap(page, vma, vmf->address, false);
>   mem_cgroup_commit_charge(page, memcg, false, false);
> - lru_cache_add_active_or_unevictable(page, vma);
> + __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
>   } else {
>   inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
>   page_add_file_rmap(page, false);
> diff --git a/mm/swap.c b/mm/swap.c
> index 3a75722e68a9..a55f0505b563 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -450,12 +450,12 @@ void lru_cache_add(struct page *page)
>   * directly back onto it's zone's unevictable list, it does NOT use a
>   * per cpu pagevec.
>   */
> -void lru_cache_add_active_or_unevictable(struct page *page,
> -  struct vm_area_struct *vma)
> +void __lru_cache_add_active_or_unevictable(struct page *page,
> +unsigned long vma_flags)
>  {
>   VM_BUG_ON_PAGE(PageLRU(page), page);
>  
> - if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
> + if (likely((vma_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
>   SetPageActive(page);
>   else if (!TestSetPageMlocked(page)) {
>   /*
> -- 
> 2.21.0
> 


Re: [RESEND PATCH] mm/hmm: Fix initial PFN for hugetlbfs pages

2019-04-22 Thread Jerome Glisse
On Fri, Apr 19, 2019 at 04:35:36PM -0700, rcampb...@nvidia.com wrote:
> From: Ralph Campbell 
> 
> The mmotm patch [1] adds hugetlbfs support for HMM but the initial
> PFN used to fill the HMM range->pfns[] array doesn't properly
> compute the starting PFN offset.
> This can be tested by running test-hugetlbfs-read from [2].
> 
> Fix the PFN offset by adjusting the page offset by the device's
> page size.
> 
> Andrew, this should probably be squashed into Jerome's patch.
> 
> [1] https://marc.info/?l=linux-mm&m=155432003506068&w=2
> ("mm/hmm: mirror hugetlbfs (snapshoting, faulting and DMA mapping)")
> [2] https://gitlab.freedesktop.org/glisse/svm-cl-tests
> 
> Signed-off-by: Ralph Campbell 

Good catch.

Reviewed-by: Jérôme Glisse 

> Cc: Jérôme Glisse 
> Cc: Ira Weiny 
> Cc: John Hubbard 
> Cc: Dan Williams 
> Cc: Arnd Bergmann 
> Cc: Balbir Singh 
> Cc: Dan Carpenter 
> Cc: Matthew Wilcox 
> Cc: Souptick Joarder 
> Cc: Andrew Morton 
> ---
>  mm/hmm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index def451a56c3e..fcf8e4fb5770 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -868,7 +868,7 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, 
> unsigned long hmask,
>   goto unlock;
>   }
>  
> - pfn = pte_pfn(entry) + (start & mask);
> + pfn = pte_pfn(entry) + ((start & mask) >> range->page_shift);
>   for (; addr < end; addr += size, i++, pfn += pfn_inc)
>   range->pfns[i] = hmm_device_entry_from_pfn(range, pfn) |
>cpu_flags;
> -- 
> 2.20.1
> 


Re: [PATCH v3 14/28] userfaultfd: wp: handle COW properly for uffd-wp

2019-04-22 Thread Jerome Glisse
On Mon, Apr 22, 2019 at 08:20:10PM +0800, Peter Xu wrote:
> On Fri, Apr 19, 2019 at 11:02:53AM -0400, Jerome Glisse wrote:
> 
> [...]
> 
> > > > > + if (uffd_wp_resolve) {
> > > > > + /* If the fault is resolved already, 
> > > > > skip */
> > > > > + if (!pte_uffd_wp(*pte))
> > > > > + continue;
> > > > > + page = vm_normal_page(vma, addr, 
> > > > > oldpte);
> > > > > + if (!page || page_mapcount(page) > 1) {
> > > > > + struct vm_fault vmf = {
> > > > > + .vma = vma,
> > > > > + .address = addr & 
> > > > > PAGE_MASK,
> > > > > + .page = page,
> > > > > + .orig_pte = oldpte,
> > > > > + .pmd = pmd,
> > > > > + /* pte and ptl not 
> > > > > needed */
> > > > > + };
> > > > > + vm_fault_t ret;
> > > > > +
> > > > > + if (page)
> > > > > + get_page(page);
> > > > > + arch_leave_lazy_mmu_mode();
> > > > > + pte_unmap_unlock(pte, ptl);
> > > > > + ret = wp_page_copy(&vmf);
> > > > > + /* PTE is changed, or OOM */
> > > > > + if (ret == 0)
> > > > > + /* It's done by others 
> > > > > */
> > > > > + continue;
> > > > 
> > > > This is wrong if ret == 0 you still need to remap the pte before
> > > > continuing as otherwise you will go to next pte without the page
> > > > table lock for the directory. So 0 case must be handled after
> > > > arch_enter_lazy_mmu_mode() below.
> > > > 
> > > > Sorry i should have catch that in previous review.
> > > 
> > > My fault to not have noticed it since the very beginning... thanks for
> > > spotting that.
> > > 
> > > I'm squashing below changes into the patch:
> > 
> > 
> > Well thinking of this some more i think you should use do_wp_page() and
> > not wp_page_copy() it would avoid bunch of code above and also you are
> > not properly handling KSM page or page in the swap cache. Instead of
> > duplicating same code that is in do_wp_page() it would be better to call
> > it here.
> 
> Yeah it makes sense to me.  Then here's my plan:
> 
> - I'll need to drop previous patch "export wp_page_copy" since then
>   it'll be not needed
> 
> - I'll introduce another patch to split current do_wp_page() and
>   introduce function "wp_page_copy_cont" (better suggestion on the
>   naming would be welcomed) which contains most of the wp handling
>   that'll be needed for change_pte_range() in this patch and isolate
>   the uffd handling:
> 
> static vm_fault_t do_wp_page(struct vm_fault *vmf)
>   __releases(vmf->ptl)
> {
>   struct vm_area_struct *vma = vmf->vma;
> 
>   if (userfaultfd_pte_wp(vma, *vmf->pte)) {
>   pte_unmap_unlock(vmf->pte, vmf->ptl);
>   return handle_userfault(vmf, VM_UFFD_WP);
>   }
> 
>   return do_wp_page_cont(vmf);
> }
> 
> Then I can probably use do_wp_page_cont() in this patch.

Instead i would keep the do_wp_page name and do:
static vm_fault_t do_userfaultfd_wp_page(struct vm_fault *vmf) {
... // what you have above
return do_wp_page(vmf);
}

Naming wise i think it would be better to keep do_wp_page() as
is.

Cheers,
Jérôme


Re: [PATCH v3 17/28] userfaultfd: wp: support swap and page migration

2019-04-19 Thread Jerome Glisse
On Fri, Apr 19, 2019 at 03:42:20PM +0800, Peter Xu wrote:
> On Thu, Apr 18, 2019 at 04:59:07PM -0400, Jerome Glisse wrote:
> > On Wed, Mar 20, 2019 at 10:06:31AM +0800, Peter Xu wrote:
> > > For either swap and page migration, we all use the bit 2 of the entry to
> > > identify whether this entry is uffd write-protected.  It plays a similar
> > > role as the existing soft dirty bit in swap entries but only for keeping
> > > the uffd-wp tracking for a specific PTE/PMD.
> > > 
> > > Something special here is that when we want to recover the uffd-wp bit
> > > from a swap/migration entry to the PTE bit we'll also need to take care
> > > of the _PAGE_RW bit and make sure it's cleared, otherwise even with the
> > > _PAGE_UFFD_WP bit we can't trap it at all.
> > > 
> > > Note that this patch removed two lines from "userfaultfd: wp: hook
> > > userfault handler to write protection fault" where we try to remove the
> > > VM_FAULT_WRITE from vmf->flags when uffd-wp is set for the VMA.  This
> > > patch will still keep the write flag there.
> > > 
> > > Reviewed-by: Mike Rapoport 
> > > Signed-off-by: Peter Xu 
> > 
> > Some missing thing see below.
> > 
> > [...]
> > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 6405d56debee..c3d57fa890f2 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -736,6 +736,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct 
> > > mm_struct *src_mm,
> > >   pte = swp_entry_to_pte(entry);
> > >   if (pte_swp_soft_dirty(*src_pte))
> > >   pte = pte_swp_mksoft_dirty(pte);
> > > + if (pte_swp_uffd_wp(*src_pte))
> > > + pte = pte_swp_mkuffd_wp(pte);
> > >   set_pte_at(src_mm, addr, src_pte, pte);
> > >   }
> > >   } else if (is_device_private_entry(entry)) {
> > 
> > You need to handle the is_device_private_entry() as the migration case
> > too.
> 
> Hi, Jerome,
> 
> Yes I can simply add the handling, but I'd confess I haven't thought
> clearly yet on how userfault-wp will be used with HMM (and that's
> mostly because my unfamiliarity so far with HMM).  Could you give me
> some hint on a most general and possible scenario?

device private is just a temporary state with HMM you can have thing
like GPU or FPGA migrate some anonymous page to their local memory
because it is use by the GPU or the FPGA. The GPU or FPGA behave like
a CPU from mm POV so if it wants to write it will fault and go through
the regular CPU page fault.

That said it can still migrate a page that is UFD write protected just
because the device only care about reading. So if you have a UFD pte
to a regular page that get migrated to some device memory you want to
keep the UFD WP flags after the migration (in both direction when going
to device memory and from coming back from it).

As far as UFD is concern this is just another page, it just does not
have a valid pte entry because CPU can not access such memory. But from
mm point of view it just another page.

> 
> > 
> > 
> > 
> > > @@ -2825,6 +2827,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >   flush_icache_page(vma, page);
> > >   if (pte_swp_soft_dirty(vmf->orig_pte))
> > >   pte = pte_mksoft_dirty(pte);
> > > + if (pte_swp_uffd_wp(vmf->orig_pte)) {
> > > + pte = pte_mkuffd_wp(pte);
> > > + pte = pte_wrprotect(pte);
> > > + }
> > >   set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> > >   arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> > >   vmf->orig_pte = pte;
> > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > index 181f5d2718a9..72cde187d4a1 100644
> > > --- a/mm/migrate.c
> > > +++ b/mm/migrate.c
> > > @@ -241,6 +241,8 @@ static bool remove_migration_pte(struct page *page, 
> > > struct vm_area_struct *vma,
> > >   entry = pte_to_swp_entry(*pvmw.pte);
> > >   if (is_write_migration_entry(entry))
> > >   pte = maybe_mkwrite(pte, vma);
> > > + else if (pte_swp_uffd_wp(*pvmw.pte))
> > > + pte = pte_mkuffd_wp(pte);
> > >  
> > >   if (unlikely(is_zone_device_page(new))) {
> > >   if (is_device_private_page(new)) {
> > 
> > You need to handle 

Re: [PATCH v3 14/28] userfaultfd: wp: handle COW properly for uffd-wp

2019-04-19 Thread Jerome Glisse
On Fri, Apr 19, 2019 at 02:26:50PM +0800, Peter Xu wrote:
> On Thu, Apr 18, 2019 at 04:51:15PM -0400, Jerome Glisse wrote:
> > On Wed, Mar 20, 2019 at 10:06:28AM +0800, Peter Xu wrote:
> > > This allows uffd-wp to support write-protected pages for COW.
> > > 
> > > For example, the uffd write-protected PTE could also be write-protected
> > > by other usages like COW or zero pages.  When that happens, we can't
> > > simply set the write bit in the PTE since otherwise it'll change the
> > > content of every single reference to the page.  Instead, we should do
> > > the COW first if necessary, then handle the uffd-wp fault.
> > > 
> > > To correctly copy the page, we'll also need to carry over the
> > > _PAGE_UFFD_WP bit if it was set in the original PTE.
> > > 
> > > For huge PMDs, we just simply split the huge PMDs where we want to
> > > resolve an uffd-wp page fault always.  That matches what we do with
> > > general huge PMD write protections.  In that way, we resolved the huge
> > > PMD copy-on-write issue into PTE copy-on-write.
> > > 
> > > Signed-off-by: Peter Xu 
> > 
> > This one has a bug see below.
> > 
> > 
> > > ---
> > >  mm/memory.c   |  5 +++-
> > >  mm/mprotect.c | 64 ---
> > >  2 files changed, 65 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index e7a4b9650225..b8a4c0bab461 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -2291,7 +2291,10 @@ vm_fault_t wp_page_copy(struct vm_fault *vmf)
> > >   }
> > >   flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
> > >   entry = mk_pte(new_page, vma->vm_page_prot);
> > > - entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> > > + if (pte_uffd_wp(vmf->orig_pte))
> > > + entry = pte_mkuffd_wp(entry);
> > > + else
> > > + entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> > >   /*
> > >* Clear the pte entry and flush it first, before updating the
> > >* pte with the new entry. This will avoid a race condition
> > > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > > index 9d4433044c21..855dddb07ff2 100644
> > > --- a/mm/mprotect.c
> > > +++ b/mm/mprotect.c
> > > @@ -73,18 +73,18 @@ static unsigned long change_pte_range(struct 
> > > vm_area_struct *vma, pmd_t *pmd,
> > >   flush_tlb_batched_pending(vma->vm_mm);
> > >   arch_enter_lazy_mmu_mode();
> > >   do {
> > > +retry_pte:
> > >   oldpte = *pte;
> > >   if (pte_present(oldpte)) {
> > >   pte_t ptent;
> > >   bool preserve_write = prot_numa && pte_write(oldpte);
> > > + struct page *page;
> > >  
> > >   /*
> > >* Avoid trapping faults against the zero or KSM
> > >* pages. See similar comment in change_huge_pmd.
> > >*/
> > >   if (prot_numa) {
> > > - struct page *page;
> > > -
> > >   page = vm_normal_page(vma, addr, oldpte);
> > >   if (!page || PageKsm(page))
> > >   continue;
> > > @@ -114,6 +114,54 @@ static unsigned long change_pte_range(struct 
> > > vm_area_struct *vma, pmd_t *pmd,
> > >   continue;
> > >   }
> > >  
> > > + /*
> > > +  * Detect whether we'll need to COW before
> > > +  * resolving an uffd-wp fault.  Note that this
> > > +  * includes detection of the zero page (where
> > > +  * page==NULL)
> > > +  */
> > > + if (uffd_wp_resolve) {
> > > + /* If the fault is resolved already, skip */
> > > + if (!pte_uffd_wp(*pte))
> > > + continue;
> > > + page = vm_normal_page(vma, addr, oldpte);
> > > + if (!page || page_mapcount(page) > 1) {
> > > + struct vm_fault vmf = {
> &g

Re: [PATCH] cifs: fix page reference leak with readv/writev

2019-04-18 Thread Jerome Glisse
On Wed, Apr 10, 2019 at 09:47:01PM -0500, Steve French wrote:
> How was this discovered? Does it address a reported user problem?

I have spotted it while tracking down how page reference are taken
for bio and how they are release. In the current code, once the page
are GUPed they are never release if there is any failure, on failure
cifs_aio_ctx_release() will be call but it will just free the bio_vec
not release the page reference.

Page reference are only drop if everything is successful.

So this patch move the page reference droping to cifs_aio_ctx_release()
which is call from all code path including failure AFAICT and thus page
reference will be drop if failure does happen.

Cheers,
Jérôme

> 
> On Wed, Apr 10, 2019 at 2:38 PM  wrote:
> >
> > From: Jérôme Glisse 
> >
> > CIFS can leak pages reference gotten through GUP (get_user_pages*()
> > through iov_iter_get_pages()). This happen if cifs_send_async_read()
> > or cifs_write_from_iter() calls fail from within __cifs_readv() and
> > __cifs_writev() respectively. This patch move page unreference to
> > cifs_aio_ctx_release() which will happens on all code paths this is
> > all simpler to follow for correctness.
> >
> > Signed-off-by: Jérôme Glisse 
> > Cc: Steve French 
> > Cc: linux-c...@vger.kernel.org
> > Cc: samba-techni...@lists.samba.org
> > Cc: Alexander Viro 
> > Cc: linux-fsde...@vger.kernel.org
> > Cc: Linus Torvalds 
> > Cc: Stable 
> > ---
> >  fs/cifs/file.c | 15 +--
> >  fs/cifs/misc.c | 23 ++-
> >  2 files changed, 23 insertions(+), 15 deletions(-)
> >
> > diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> > index 89006e044973..a756a4d3f70f 100644
> > --- a/fs/cifs/file.c
> > +++ b/fs/cifs/file.c
> > @@ -2858,7 +2858,6 @@ static void collect_uncached_write_data(struct 
> > cifs_aio_ctx *ctx)
> > struct cifs_tcon *tcon;
> > struct cifs_sb_info *cifs_sb;
> > struct dentry *dentry = ctx->cfile->dentry;
> > -   unsigned int i;
> > int rc;
> >
> > tcon = tlink_tcon(ctx->cfile->tlink);
> > @@ -2922,10 +2921,6 @@ static void collect_uncached_write_data(struct 
> > cifs_aio_ctx *ctx)
> > kref_put(&wdata->refcount, cifs_uncached_writedata_release);
> > }
> >
> > -   if (!ctx->direct_io)
> > -   for (i = 0; i < ctx->npages; i++)
> > -   put_page(ctx->bv[i].bv_page);
> > -
> > cifs_stats_bytes_written(tcon, ctx->total_len);
> > set_bit(CIFS_INO_INVALID_MAPPING, &CIFS_I(dentry->d_inode)->flags);
> >
> > @@ -3563,7 +3558,6 @@ collect_uncached_read_data(struct cifs_aio_ctx *ctx)
> > struct iov_iter *to = &ctx->iter;
> > struct cifs_sb_info *cifs_sb;
> > struct cifs_tcon *tcon;
> > -   unsigned int i;
> > int rc;
> >
> > tcon = tlink_tcon(ctx->cfile->tlink);
> > @@ -3647,15 +3641,8 @@ collect_uncached_read_data(struct cifs_aio_ctx *ctx)
> > kref_put(&rdata->refcount, cifs_uncached_readdata_release);
> > }
> >
> > -   if (!ctx->direct_io) {
> > -   for (i = 0; i < ctx->npages; i++) {
> > -   if (ctx->should_dirty)
> > -   set_page_dirty(ctx->bv[i].bv_page);
> > -   put_page(ctx->bv[i].bv_page);
> > -   }
> > -
> > +   if (!ctx->direct_io)
> > ctx->total_len = ctx->len - iov_iter_count(to);
> > -   }
> >
> > /* mask nodata case */
> > if (rc == -ENODATA)
> > diff --git a/fs/cifs/misc.c b/fs/cifs/misc.c
> > index bee203055b30..9bc0d17a9d77 100644
> > --- a/fs/cifs/misc.c
> > +++ b/fs/cifs/misc.c
> > @@ -768,6 +768,11 @@ cifs_aio_ctx_alloc(void)
> >  {
> > struct cifs_aio_ctx *ctx;
> >
> > +   /*
> > +* Must use kzalloc to initialize ctx->bv to NULL and ctx->direct_io
> > +* to false so that we know when we have to unreference pages within
> > +* cifs_aio_ctx_release()
> > +*/
> > ctx = kzalloc(sizeof(struct cifs_aio_ctx), GFP_KERNEL);
> > if (!ctx)
> > return NULL;
> > @@ -786,7 +791,23 @@ cifs_aio_ctx_release(struct kref *refcount)
> > struct cifs_aio_ctx, refcount);
> >
> > cifsFileInfo_put(ctx->cfile);
> > -   kvfree(ctx->bv);
> > +
> > +   /*
> > +* ctx->bv is only set if setup_aio_ctx_iter() was call successfuly
> > +* which means that iov_iter_get_pages() was a success and thus that
> > +* we have taken reference on pages.
> > +*/
> > +   if (ctx->bv) {
> > +   unsigned i;
> > +
> > +   for (i = 0; i < ctx->npages; i++) {
> > +   if (ctx->should_dirty)
> > +   set_page_dirty(ctx->bv[i].bv_page);
> > +   put_page(ctx->bv[i].bv_page);
> > +   }
> > +   kvfree(ctx->bv);
> > +   }
> > +
> > kfree(ctx);
> >  }
> >
> 

Re: [PATCH v3 00/28] userfaultfd: write protection support

2019-04-18 Thread Jerome Glisse
On Tue, Apr 09, 2019 at 02:08:39PM +0800, Peter Xu wrote:
> On Wed, Mar 20, 2019 at 10:06:14AM +0800, Peter Xu wrote:
> > This series implements initial write protection support for
> > userfaultfd.  Currently both shmem and hugetlbfs are not supported
> > yet, but only anonymous memory.  This is the 3nd version of it.
> > 
> > The latest code can also be found at:
> > 
> >   https://github.com/xzpeter/linux/tree/uffd-wp-merged
> > 
> > Note again that the first 5 patches in the series can be seen as
> > isolated work on page fault mechanism.  I would hope that they can be
> > considered to be reviewed/picked even earlier than the rest of the
> > series since it's even useful for existing userfaultfd MISSING case
> > [8].
> 
> Ping - any further comments for v3?  Is there any chance to have this
> series (or the first 5 patches) for 5.2?

Few issues left, sorry for taking so long to get to review, sometimes
it goes to the bottom of my stack.

I am guessing this should be merge through Andrew ? Unless Andrea have
a tree for userfaultfd (i am not following all that closely).

>From my point of view it almost all look good. I sent review before
this email. Maybe we need some review from x86 folks on the x86 arch
changes for the feature ?

Cheers,
Jérôme


Re: [PATCH v3 25/28] userfaultfd: wp: fixup swap entries in change_pte_range

2019-04-18 Thread Jerome Glisse
On Wed, Mar 20, 2019 at 10:06:39AM +0800, Peter Xu wrote:
> In change_pte_range() we do nothing for uffd if the PTE is a swap
> entry.  That can lead to data mismatch if the page that we are going
> to write protect is swapped out when sending the UFFDIO_WRITEPROTECT.
> This patch applies/removes the uffd-wp bit even for the swap entries.
> 
> Signed-off-by: Peter Xu 

This one seems to address some of the comments i made on patch 17
not all thought. Maybe squash them together ?

> ---
> 
> I kept this patch a standalone one majorly to make review easier.  The
> patch can be considered as standalone or to squash into the patch
> "userfaultfd: wp: support swap and page migration".
> ---
>  mm/mprotect.c | 24 +---
>  1 file changed, 13 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 96c0f521099d..a23e03053787 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -183,11 +183,11 @@ static unsigned long change_pte_range(struct 
> vm_area_struct *vma, pmd_t *pmd,
>   }
>   ptep_modify_prot_commit(mm, addr, pte, ptent);
>   pages++;
> - } else if (IS_ENABLED(CONFIG_MIGRATION)) {
> + } else if (is_swap_pte(oldpte)) {
>   swp_entry_t entry = pte_to_swp_entry(oldpte);
> + pte_t newpte;
>  
>   if (is_write_migration_entry(entry)) {
> - pte_t newpte;
>   /*
>* A protection check is difficult so
>* just be safe and disable write
> @@ -198,22 +198,24 @@ static unsigned long change_pte_range(struct 
> vm_area_struct *vma, pmd_t *pmd,
>   newpte = pte_swp_mksoft_dirty(newpte);
>   if (pte_swp_uffd_wp(oldpte))
>   newpte = pte_swp_mkuffd_wp(newpte);
> - set_pte_at(mm, addr, pte, newpte);
> -
> - pages++;
> - }
> -
> - if (is_write_device_private_entry(entry)) {
> - pte_t newpte;
> -
> + } else if (is_write_device_private_entry(entry)) {
>   /*
>* We do not preserve soft-dirtiness. See
>* copy_one_pte() for explanation.
>*/
>   make_device_private_entry_read(&entry);
>   newpte = swp_entry_to_pte(entry);
> - set_pte_at(mm, addr, pte, newpte);
> + } else {
> + newpte = oldpte;
> + }
>  
> + if (uffd_wp)
> + newpte = pte_swp_mkuffd_wp(newpte);
> + else if (uffd_wp_resolve)
> + newpte = pte_swp_clear_uffd_wp(newpte);
> +
> + if (!pte_same(oldpte, newpte)) {
> + set_pte_at(mm, addr, pte, newpte);
>   pages++;
>   }
>   }
> -- 
> 2.17.1
> 


Re: [PATCH v3 17/28] userfaultfd: wp: support swap and page migration

2019-04-18 Thread Jerome Glisse
On Wed, Mar 20, 2019 at 10:06:31AM +0800, Peter Xu wrote:
> For either swap and page migration, we all use the bit 2 of the entry to
> identify whether this entry is uffd write-protected.  It plays a similar
> role as the existing soft dirty bit in swap entries but only for keeping
> the uffd-wp tracking for a specific PTE/PMD.
> 
> Something special here is that when we want to recover the uffd-wp bit
> from a swap/migration entry to the PTE bit we'll also need to take care
> of the _PAGE_RW bit and make sure it's cleared, otherwise even with the
> _PAGE_UFFD_WP bit we can't trap it at all.
> 
> Note that this patch removed two lines from "userfaultfd: wp: hook
> userfault handler to write protection fault" where we try to remove the
> VM_FAULT_WRITE from vmf->flags when uffd-wp is set for the VMA.  This
> patch will still keep the write flag there.
> 
> Reviewed-by: Mike Rapoport 
> Signed-off-by: Peter Xu 

Some missing thing see below.

[...]

> diff --git a/mm/memory.c b/mm/memory.c
> index 6405d56debee..c3d57fa890f2 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -736,6 +736,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct 
> *src_mm,
>   pte = swp_entry_to_pte(entry);
>   if (pte_swp_soft_dirty(*src_pte))
>   pte = pte_swp_mksoft_dirty(pte);
> + if (pte_swp_uffd_wp(*src_pte))
> + pte = pte_swp_mkuffd_wp(pte);
>   set_pte_at(src_mm, addr, src_pte, pte);
>   }
>   } else if (is_device_private_entry(entry)) {

You need to handle the is_device_private_entry() as the migration case
too.



> @@ -2825,6 +2827,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   flush_icache_page(vma, page);
>   if (pte_swp_soft_dirty(vmf->orig_pte))
>   pte = pte_mksoft_dirty(pte);
> + if (pte_swp_uffd_wp(vmf->orig_pte)) {
> + pte = pte_mkuffd_wp(pte);
> + pte = pte_wrprotect(pte);
> + }
>   set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
>   arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
>   vmf->orig_pte = pte;
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 181f5d2718a9..72cde187d4a1 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -241,6 +241,8 @@ static bool remove_migration_pte(struct page *page, 
> struct vm_area_struct *vma,
>   entry = pte_to_swp_entry(*pvmw.pte);
>   if (is_write_migration_entry(entry))
>   pte = maybe_mkwrite(pte, vma);
> + else if (pte_swp_uffd_wp(*pvmw.pte))
> + pte = pte_mkuffd_wp(pte);
>  
>   if (unlikely(is_zone_device_page(new))) {
>   if (is_device_private_page(new)) {

You need to handle is_device_private_page() case ie mark its swap
as uffd_wp

> @@ -2301,6 +2303,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   swp_pte = swp_entry_to_pte(entry);
>   if (pte_soft_dirty(pte))
>   swp_pte = pte_swp_mksoft_dirty(swp_pte);
> + if (pte_uffd_wp(pte))
> + swp_pte = pte_swp_mkuffd_wp(swp_pte);
>   set_pte_at(mm, addr, ptep, swp_pte);
>
>   /*
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 855dddb07ff2..96c0f521099d 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -196,6 +196,8 @@ static unsigned long change_pte_range(struct 
> vm_area_struct *vma, pmd_t *pmd,
>   newpte = swp_entry_to_pte(entry);
>   if (pte_swp_soft_dirty(oldpte))
>   newpte = pte_swp_mksoft_dirty(newpte);
> + if (pte_swp_uffd_wp(oldpte))
> + newpte = pte_swp_mkuffd_wp(newpte);
>   set_pte_at(mm, addr, pte, newpte);
>  
>   pages++;

Need to handle is_write_device_private_entry() case just below
that chunk.

Cheers,
Jérôme


Re: [PATCH v3 14/28] userfaultfd: wp: handle COW properly for uffd-wp

2019-04-18 Thread Jerome Glisse
On Wed, Mar 20, 2019 at 10:06:28AM +0800, Peter Xu wrote:
> This allows uffd-wp to support write-protected pages for COW.
> 
> For example, the uffd write-protected PTE could also be write-protected
> by other usages like COW or zero pages.  When that happens, we can't
> simply set the write bit in the PTE since otherwise it'll change the
> content of every single reference to the page.  Instead, we should do
> the COW first if necessary, then handle the uffd-wp fault.
> 
> To correctly copy the page, we'll also need to carry over the
> _PAGE_UFFD_WP bit if it was set in the original PTE.
> 
> For huge PMDs, we just simply split the huge PMDs where we want to
> resolve an uffd-wp page fault always.  That matches what we do with
> general huge PMD write protections.  In that way, we resolved the huge
> PMD copy-on-write issue into PTE copy-on-write.
> 
> Signed-off-by: Peter Xu 

This one has a bug see below.


> ---
>  mm/memory.c   |  5 +++-
>  mm/mprotect.c | 64 ---
>  2 files changed, 65 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index e7a4b9650225..b8a4c0bab461 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2291,7 +2291,10 @@ vm_fault_t wp_page_copy(struct vm_fault *vmf)
>   }
>   flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
>   entry = mk_pte(new_page, vma->vm_page_prot);
> - entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> + if (pte_uffd_wp(vmf->orig_pte))
> + entry = pte_mkuffd_wp(entry);
> + else
> + entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>   /*
>* Clear the pte entry and flush it first, before updating the
>* pte with the new entry. This will avoid a race condition
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 9d4433044c21..855dddb07ff2 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -73,18 +73,18 @@ static unsigned long change_pte_range(struct 
> vm_area_struct *vma, pmd_t *pmd,
>   flush_tlb_batched_pending(vma->vm_mm);
>   arch_enter_lazy_mmu_mode();
>   do {
> +retry_pte:
>   oldpte = *pte;
>   if (pte_present(oldpte)) {
>   pte_t ptent;
>   bool preserve_write = prot_numa && pte_write(oldpte);
> + struct page *page;
>  
>   /*
>* Avoid trapping faults against the zero or KSM
>* pages. See similar comment in change_huge_pmd.
>*/
>   if (prot_numa) {
> - struct page *page;
> -
>   page = vm_normal_page(vma, addr, oldpte);
>   if (!page || PageKsm(page))
>   continue;
> @@ -114,6 +114,54 @@ static unsigned long change_pte_range(struct 
> vm_area_struct *vma, pmd_t *pmd,
>   continue;
>   }
>  
> + /*
> +  * Detect whether we'll need to COW before
> +  * resolving an uffd-wp fault.  Note that this
> +  * includes detection of the zero page (where
> +  * page==NULL)
> +  */
> + if (uffd_wp_resolve) {
> + /* If the fault is resolved already, skip */
> + if (!pte_uffd_wp(*pte))
> + continue;
> + page = vm_normal_page(vma, addr, oldpte);
> + if (!page || page_mapcount(page) > 1) {
> + struct vm_fault vmf = {
> + .vma = vma,
> + .address = addr & PAGE_MASK,
> + .page = page,
> + .orig_pte = oldpte,
> + .pmd = pmd,
> + /* pte and ptl not needed */
> + };
> + vm_fault_t ret;
> +
> + if (page)
> + get_page(page);
> + arch_leave_lazy_mmu_mode();
> + pte_unmap_unlock(pte, ptl);
> + ret = wp_page_copy(&vmf);
> + /* PTE is changed, or OOM */
> + if (ret == 0)
> + /* It's done by others */
> + continue;

This is wrong if ret == 0 you still need to remap the pte befo

Re: [PATCH v3 04/28] mm: allow VM_FAULT_RETRY for multiple times

2019-04-18 Thread Jerome Glisse
On Wed, Mar 20, 2019 at 10:06:18AM +0800, Peter Xu wrote:
> The idea comes from a discussion between Linus and Andrea [1].
> 
> Before this patch we only allow a page fault to retry once.  We
> achieved this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
> handle_mm_fault() the second time.  This was majorly used to avoid
> unexpected starvation of the system by looping over forever to handle
> the page fault on a single page.  However that should hardly happen,
> and after all for each code path to return a VM_FAULT_RETRY we'll
> first wait for a condition (during which time we should possibly yield
> the cpu) to happen before VM_FAULT_RETRY is really returned.
> 
> This patch removes the restriction by keeping the
> FAULT_FLAG_ALLOW_RETRY flag when we receive VM_FAULT_RETRY.  It means
> that the page fault handler now can retry the page fault for multiple
> times if necessary without the need to generate another page fault
> event.  Meanwhile we still keep the FAULT_FLAG_TRIED flag so page
> fault handler can still identify whether a page fault is the first
> attempt or not.
> 
> Then we'll have these combinations of fault flags (only considering
> ALLOW_RETRY flag and TRIED flag):
> 
>   - ALLOW_RETRY and !TRIED:  this means the page fault allows to
>  retry, and this is the first try
> 
>   - ALLOW_RETRY and TRIED:   this means the page fault allows to
>  retry, and this is not the first try
> 
>   - !ALLOW_RETRY and !TRIED: this means the page fault does not allow
>  to retry at all
> 
>   - !ALLOW_RETRY and TRIED:  this is forbidden and should never be used
> 
> In existing code we have multiple places that has taken special care
> of the first condition above by checking against (fault_flags &
> FAULT_FLAG_ALLOW_RETRY).  This patch introduces a simple helper to
> detect the first retry of a page fault by checking against
> both (fault_flags & FAULT_FLAG_ALLOW_RETRY) and !(fault_flag &
> FAULT_FLAG_TRIED) because now even the 2nd try will have the
> ALLOW_RETRY set, then use that helper in all existing special paths.
> One example is in __lock_page_or_retry(), now we'll drop the mmap_sem
> only in the first attempt of page fault and we'll keep it in follow up
> retries, so old locking behavior will be retained.
> 
> This will be a nice enhancement for current code [2] at the same time
> a supporting material for the future userfaultfd-writeprotect work,
> since in that work there will always be an explicit userfault
> writeprotect retry for protected pages, and if that cannot resolve the
> page fault (e.g., when userfaultfd-writeprotect is used in conjunction
> with swapped pages) then we'll possibly need a 3rd retry of the page
> fault.  It might also benefit other potential users who will have
> similar requirement like userfault write-protection.
> 
> GUP code is not touched yet and will be covered in follow up patch.
> 
> Please read the thread below for more information.
> 
> [1] https://lkml.org/lkml/2017/11/2/833
> [2] https://lkml.org/lkml/2018/12/30/64
> 
> Suggested-by: Linus Torvalds 
> Suggested-by: Andrea Arcangeli 
> Signed-off-by: Peter Xu 

Reviewed-by: Jérôme Glisse 

A minor comment suggestion below but it can be fix in a followup patch.

[...]

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 80bb6408fe73..f73dbc4a1957 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -336,16 +336,52 @@ extern unsigned int kobjsize(const void *objp);
>   */
>  extern pgprot_t protection_map[16];
>  
> +/*
> + * About FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_TRIED: we can specify whether 
> we
> + * would allow page faults to retry by specifying these two fault flags
> + * correctly.  Currently there can be three legal combinations:
> + *
> + * (a) ALLOW_RETRY and !TRIED:  this means the page fault allows retry, and
> + *  this is the first try
> + *
> + * (b) ALLOW_RETRY and TRIED:   this means the page fault allows retry, and
> + *  we've already tried at least once
> + *
> + * (c) !ALLOW_RETRY and !TRIED: this means the page fault does not allow 
> retry
> + *
> + * The unlisted combination (!ALLOW_RETRY && TRIED) is illegal and should 
> never
> + * be used.  Note that page faults can be allowed to retry for multiple 
> times,
> + * in which case we'll have an initial fault with flags (a) then later on
> + * continuous faults with flags (b).  We should always try to detect pending
> + * signals before a retry to make sure the continuous page faults can still 
> be
> + * interrupted if necessary.
> + */
> +
>  #define FAULT_FLAG_WRITE 0x01/* Fault was a write access */
>  #define FAULT_FLAG_MKWRITE   0x02/* Fault was mkwrite of existing pte */
>  #define FAULT_FLAG_ALLOW_RETRY   0x04/* Retry fault if blocking */
>  #define FAULT_FLAG_RETRY_NOWAIT  0x08/* Don't drop mmap_sem and wait 
> when retryi

Re: [PATCH v3 07/28] userfaultfd: wp: hook userfault handler to write protection fault

2019-04-18 Thread Jerome Glisse
On Wed, Mar 20, 2019 at 10:06:21AM +0800, Peter Xu wrote:
> From: Andrea Arcangeli 
> 
> There are several cases write protection fault happens. It could be a
> write to zero page, swaped page or userfault write protected
> page. When the fault happens, there is no way to know if userfault
> write protect the page before. Here we just blindly issue a userfault
> notification for vma with VM_UFFD_WP regardless if app write protects
> it yet. Application should be ready to handle such wp fault.
> 
> v1: From: Shaohua Li 
> 
> v2: Handle the userfault in the common do_wp_page. If we get there a
> pagetable is present and readonly so no need to do further processing
> until we solve the userfault.
> 
> In the swapin case, always swapin as readonly. This will cause false
> positive userfaults. We need to decide later if to eliminate them with
> a flag like soft-dirty in the swap entry (see _PAGE_SWP_SOFT_DIRTY).
> 
> hugetlbfs wouldn't need to worry about swapouts but and tmpfs would
> be handled by a swap entry bit like anonymous memory.
> 
> The main problem with no easy solution to eliminate the false
> positives, will be if/when userfaultfd is extended to real filesystem
> pagecache. When the pagecache is freed by reclaim we can't leave the
> radix tree pinned if the inode and in turn the radix tree is reclaimed
> as well.
> 
> The estimation is that full accuracy and lack of false positives could
> be easily provided only to anonymous memory (as long as there's no
> fork or as long as MADV_DONTFORK is used on the userfaultfd anonymous
> range) tmpfs and hugetlbfs, it's most certainly worth to achieve it
> but in a later incremental patch.
> 
> v3: Add hooking point for THP wrprotect faults.
> 
> CC: Shaohua Li 
> Signed-off-by: Andrea Arcangeli 
> [peterx: don't conditionally drop FAULT_FLAG_WRITE in do_swap_page]
> Reviewed-by: Mike Rapoport 
> Signed-off-by: Peter Xu 


Reviewed-by: Jérôme Glisse 

> ---
>  mm/memory.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index e11ca9dd823f..567686ec086d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2483,6 +2483,11 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>  {
>   struct vm_area_struct *vma = vmf->vma;
>  
> + if (userfaultfd_wp(vma)) {
> + pte_unmap_unlock(vmf->pte, vmf->ptl);
> + return handle_userfault(vmf, VM_UFFD_WP);
> + }
> +
>   vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
>   if (!vmf->page) {
>   /*
> @@ -3684,8 +3689,11 @@ static inline vm_fault_t create_huge_pmd(struct 
> vm_fault *vmf)
>  /* `inline' is required to avoid gcc 4.1.2 build error */
>  static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
>  {
> - if (vma_is_anonymous(vmf->vma))
> + if (vma_is_anonymous(vmf->vma)) {
> + if (userfaultfd_wp(vmf->vma))
> + return handle_userfault(vmf, VM_UFFD_WP);
>   return do_huge_pmd_wp_page(vmf, orig_pmd);
> + }
>   if (vmf->vma->vm_ops->huge_fault)
>   return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD);
>  
> -- 
> 2.17.1
> 


Re: IOMMU Page faults when running DMA transfers from PCIe device

2019-04-18 Thread Jerome Glisse
On Thu, Apr 18, 2019 at 09:37:58AM +, David Laight wrote:
> From: Jerome Glisse
> > Sent: 16 April 2019 16:33
> ...
> > I am no expert but i am guessing your FPGA set the request field in the
> > PCIE TLP write packet to 00:00.0 and this might work when IOMMU is off but
> > might not work when IOMMU is on ie when IOMMU is on your device should set
> > the request field to the FPGA PCIE id so that the IOMMU knows for which
> > device the PCIE write or read packet is and thus against which IOMMU page
> > table.
> 
> Interesting.
> Does that mean that a malicious PCIe device can send write TLP
> that contain the 'wrong' id (IIRC that is bus:dev:fn) and so
> write to areas that it shouldn't access?

Yes it does, they are bunch of paper on that look for IOMMU DMA
attack.

> 
> For any degree of security the PCIe bridge nearest the target
> needs to verify the id as well.
> Actually all bridges need to verify the 'bus' part.
> Then boards with 'dodgy' bridges can only write to locations
> that other dev:fn on the same board can access.

Yes they should but it has a cost and AFAIK no bridges, not even
the root port, does that. PCIE bandwidth is big and it means a
lot of packets can go through a PCIE switch or PCIE bridge and
i believe that such PCIE packet inspection have been considered
too costly. Afterall if someone can plug a rogue device to your
computer (ignoring laptop) then he can do more harm with easier
method. FGPA accelerator as PCIE device, might open a door for
clever and _resourceful_ people to try to use them as a remote
vector attack.

Cheers,
Jérôme


Re: [PATCH] mm/hmm: kconfig split HMM address space mirroring from device memory

2019-04-17 Thread Jerome Glisse
On Wed, Apr 17, 2019 at 12:33:35PM -0700, Guenter Roeck wrote:
> On Wed, Apr 17, 2019 at 02:26:18PM -0400, Jerome Glisse wrote:
> > On Wed, Apr 17, 2019 at 11:21:18AM -0700, Guenter Roeck wrote:
> > > On Thu, Apr 11, 2019 at 02:03:26PM -0400, jgli...@redhat.com wrote:
> > > > From: Jérôme Glisse 
> > > > 
> > > > To allow building device driver that only care about address space
> > > > mirroring (like RDMA ODP) on platform that do not have all the pre-
> > > > requisite for HMM device memory (like ZONE_DEVICE on ARM) split the
> > > > HMM_MIRROR option dependency from the HMM_DEVICE dependency.
> > > > 
> > > > Signed-off-by: Jérôme Glisse 
> > > > Cc: Leon Romanovsky 
> > > > Cc: Jason Gunthorpe 
> > > > Cc: Andrew Morton 
> > > > Cc: Ralph Campbell 
> > > > Cc: John Hubbard 
> > > > Tested-by: Leon Romanovsky 
> > > 
> > > In case it hasn't been reported already:
> > > 
> > > mm/hmm.c: In function 'hmm_vma_handle_pmd':
> > > mm/hmm.c:537:8: error: implicit declaration of function 'pmd_pfn'; did 
> > > you mean 'pte_pfn'?
> > 
> > No it is pmd_pfn
> > 
> FWIW, this is a compiler message.
> 
> > > 
> > > and similar errors when building alpha:allmodconfig (and maybe others).
> > 
> > Does HMM_MIRROR get enabled in your config ? It should not
> > does adding depends on (X86_64 || PPC64) to ARCH_HAS_HMM
> > fix it ? I should just add that there for arch i do build.
> > 
> 
> The eror is seen with is alpha:allmodconfig. "make ARCH=alpha allmodconfig".
> It does set CONFIG_ARCH_HAS_HMM=y.
> 
> This patch has additional problems. For arm64:allmodconfig
> and many others, when running "make ARCH=arm64 allmodconfig":
> 
> WARNING: unmet direct dependencies detected for DEVICE_PRIVATE
>   Depends on [n]: ARCH_HAS_HMM_DEVICE [=n] && ZONE_DEVICE [=n]
>   Selected by [m]:
>   - DRM_NOUVEAU_SVM [=y] && HAS_IOMEM [=y] && ARCH_HAS_HMM [=y] && 
> DRM_NOUVEAU [=m] && STAGING [=y]
> 
> WARNING: unmet direct dependencies detected for DEVICE_PRIVATE
>   Depends on [n]: ARCH_HAS_HMM_DEVICE [=n] && ZONE_DEVICE [=n]
>   Selected by [m]:
>   - DRM_NOUVEAU_SVM [=y] && HAS_IOMEM [=y] && ARCH_HAS_HMM [=y] && 
> DRM_NOUVEAU [=m] && STAGING [=y]
> 
> WARNING: unmet direct dependencies detected for DEVICE_PRIVATE
>   Depends on [n]: ARCH_HAS_HMM_DEVICE [=n] && ZONE_DEVICE [=n]
>   Selected by [m]:
>   - DRM_NOUVEAU_SVM [=y] && HAS_IOMEM [=y] && ARCH_HAS_HMM [=y] && 
> DRM_NOUVEAU [=m] && STAGING [=y]
> 
> This in turn results in:
> 
> arch64-linux-ld: mm/memory.o: in function `do_swap_page':
> memory.c:(.text+0x798c): undefined reference to `device_private_entry_fault'
> 
> not only on arm64, but on other architectures as well.
> 
> All those problems are gone after reverting this patch.
> 
> Guenter

Andrew let drop this patch i need to fix nouveau Kconfig first.

Cheers,
Jérôme


Re: [PATCH] mm/hmm: kconfig split HMM address space mirroring from device memory

2019-04-17 Thread Jerome Glisse
On Wed, Apr 17, 2019 at 11:21:18AM -0700, Guenter Roeck wrote:
> On Thu, Apr 11, 2019 at 02:03:26PM -0400, jgli...@redhat.com wrote:
> > From: Jérôme Glisse 
> > 
> > To allow building device driver that only care about address space
> > mirroring (like RDMA ODP) on platform that do not have all the pre-
> > requisite for HMM device memory (like ZONE_DEVICE on ARM) split the
> > HMM_MIRROR option dependency from the HMM_DEVICE dependency.
> > 
> > Signed-off-by: Jérôme Glisse 
> > Cc: Leon Romanovsky 
> > Cc: Jason Gunthorpe 
> > Cc: Andrew Morton 
> > Cc: Ralph Campbell 
> > Cc: John Hubbard 
> > Tested-by: Leon Romanovsky 
> 
> In case it hasn't been reported already:
> 
> mm/hmm.c: In function 'hmm_vma_handle_pmd':
> mm/hmm.c:537:8: error: implicit declaration of function 'pmd_pfn'; did you 
> mean 'pte_pfn'?

No it is pmd_pfn

> 
> and similar errors when building alpha:allmodconfig (and maybe others).

Does HMM_MIRROR get enabled in your config ? It should not
does adding depends on (X86_64 || PPC64) to ARCH_HAS_HMM
fix it ? I should just add that there for arch i do build.

Cheers,
Jérôme


Re: IOMMU Page faults when running DMA transfers from PCIe device

2019-04-17 Thread Jerome Glisse
On Wed, Apr 17, 2019 at 04:17:09PM +0200, Patrick Brunner wrote:
> Am Dienstag, 16. April 2019, 17:33:07 CEST schrieb Jerome Glisse:
> > On Mon, Apr 15, 2019 at 06:04:11PM +0200, Patrick Brunner wrote:
> > > Dear all,
> > > 
> > > I'm encountering very nasty problems regarding DMA transfers from an
> > > external PCIe device to the main memory while the IOMMU is enabled, and
> > > I'm running out of ideas. I'm not even sure, whether it's a kernel issue
> > > or not. But I would highly appreciate any hints from experienced
> > > developers how to proceed to solve that issue.
> > > 
> > > The problem: An FPGA (see details below) should write a small amount of
> > > data (~128 bytes) over a PCIe 2.0 x1 link to an address in the CPU's
> > > memory space. The destination address (64 bits) for the Mem Write TLP is
> > > written to a BAR- mapped register before-hand.
> > > 
> > > On the system side, the driver consists of the usual setup code:
> > > - request PCI regions
> > > - pci_set_master
> > > - I/O remapping of BARs
> > > - setting DMA mask (dma_set_mask_and_coherent), tried both 32/64 bits
> > > - allocating DMA buffers with dma_alloc_coherent (4096 bytes, but also
> > > tried smaller numbers)
> > > - allocating IRQ lines (MSI) with pci_alloc_irq_vectors and pci_irq_vector
> > > - writing the DMA buffers' logical address (as returned in dma_handle_t
> > > from dma_alloc_coherent) to a BAR-mapped register
> > > 
> > > There is also an IRQ handler dumping the first 2 DWs from the DMA buffer
> > > when triggered.
> > > 
> > > The FPGA part will initiate following transfers at an interval of 2.5ms:
> > > - Memory write to DMA address
> > > - Send MSI (to signal that transfer is done)
> > > - Memory read from DMA address+offset
> > > 
> > > And now, the clue: everything works fine with the IOMMU disabled
> > > (iommu=off), i.e. the 2 DWs dumped in the ISR handler contain valid data.
> > > But if the IOMMU is enabled (iommu=soft or force), I receive an IO page
> > > fault (sometimes even more, depending on the payload size) on every
> > > transfer, and the data is all zeros:
> > > 
> > > [   49.001605] IO_PAGE_FAULT device=00:00.0 domain=0x
> > > address=0xffbf8000 flags=0x0070]
> > > 
> > > Where the device ID corresponds to the Host bridge, and the address
> > > corresponds to the DMA handle I got from dma_alloc_coherent respectively.
> > 
> > I am now expert but i am guessing your FPGA set the request field in the
> > PCIE TLP write packet to 00:00.0 and this might work when IOMMU is off but
> > might not work when IOMMU is on ie when IOMMU is on your device should set
> > the request field to the FPGA PCIE id so that the IOMMU knows for which
> > device the PCIE write or read packet is and thus against which IOMMU page
> > table.
> > 
> > Cheers,
> > Jérôme
> 
> Hi Jérôme
> 
> Thank you very much for your response.
> 
> You hit the nail! That was exactly the root cause of the problem. The request 
> field was properly filled in for the Memory Read TLP, but not for the Memory 
> Write TLP, where it was all-zeroes.
> 
> If I may ask another question: Is it possible to remap a buffer for DMA which 
> was allocated by other means? For the second phase, we are going to use the 
> RTAI extension(*) which provides its own memory allocation routines (e.g. 
> rt_shm_alloc()). There, you may pass the flag USE_GFP_DMA to indicate that 
> this buffer should be suitable for DMA. I've tried to remap this memory area 
> using virt_to_phys() and use the resulting address for the DMA transfer from 
> the FPGA, getting other IO page faults. E.g.:
> 
> [   70.100140] IO_PAGE_FAULT device=01:00.0 domain=0x0001 
> address=0x0008 flags=0x0020]
> 
> It's remarkable that the logical addresses returned from dma_alloc_coherent 
> (e.g. ffbd8000) look quite different from those returned by rt_shm_alloc
> +virt_to_phys (e.g. 0008).
> 
> Unfortunately, it does not seem possible to do that the other way round, i.e. 
> forcing RTAI to use the buffer from dma_alloc_coherent.

You can use pci_map_page() or dma_map_page(). First you must get the page
that correspond to the virtual address (maybe with get_user_pages*() but
i would advice against it as it comes with a long list of gotcha and they
are no other alternative unless your device is advance enough).

Once you have the page for the virtual address then you can call either
dma_map_page() or pci_map_page(). I am sure you can find example within
the kernel for there usage.

It is also documented somewhere in Documentations/

Hopes this helps.

Cheers,
Jérôme


Re: [PATCH v1 00/15] Keep track of GUPed pages in fs and block

2019-04-16 Thread Jerome Glisse
On Tue, Apr 16, 2019 at 10:28:40PM +0300, Boaz Harrosh wrote:
> On 16/04/19 22:12, Dan Williams wrote:
> > On Tue, Apr 16, 2019 at 11:59 AM Kent Overstreet
> >  wrote:
> <>
> > This all reminds of the failed attempt to teach the block layer to
> > operate without pages:
> > 
> > https://lore.kernel.org/lkml/20150316201640.33102.33761.st...@dwillia2-desk3.amr.corp.intel.com/
> > 
> 
> Exactly why I want to make sure it is just a [pointer | flag] and not any 
> kind of pfn
> type. Let us please not go there again?
> 
> >>
> >> Question though - why do we need a flag for whether a page is a GUP page 
> >> or not?
> >> Couldn't the needed information just be determined by what range the pfn 
> >> is not
> >> (i.e. whether or not it has a struct page associated with it)?
> > 
> > That amounts to a pfn_valid() check which is a bit heavier than if we
> > can store a flag in the bv_pfn entry directly.
> > 
> > I'd say create a new PFN_* flag, and make bv_pfn a 'pfn_t' rather than
> > an 'unsigned long'.
> > 
> 
> No, please please not. This is not a pfn and not a pfn_t. It is a page-ptr
> and a flag that says where/how to put_page it. IE I did a GUP on this page
> please do a PUP on this page instead of regular put_page. So no where do I 
> mean
> pfn or pfn_t in this code. Then why?
> 
> > That said, I'm still in favor of Jan's proposal to just make the
> > bv_page semantics uniform. Otherwise we're complicating this core
> > infrastructure for some yet to be implemented GPU memory management
> > capabilities with yet to be determined value. Circle back when that
> > value is clear, but in the meantime fix the GUP bug.
> > 
> 
> I agree there are simpler ways to solve the bugs at hand then
> to system wide separate get_user_page from get_page and force all put_user
> callers to remember what to do. Is there some Document explaining the
> all design of where this is going?
> 

A very long thread on this:

https://lkml.org/lkml/2018/12/3/1128

especialy all the reply to this first one

There is also:

https://lkml.org/lkml/2019/3/26/1395
https://lwn.net/Articles/753027/

Cheers,
Jérôme


Re: IOMMU Page faults when running DMA transfers from PCIe device

2019-04-16 Thread Jerome Glisse
On Mon, Apr 15, 2019 at 06:04:11PM +0200, Patrick Brunner wrote:
> Dear all,
> 
> I'm encountering very nasty problems regarding DMA transfers from an external 
> PCIe device to the main memory while the IOMMU is enabled, and I'm running 
> out 
> of ideas. I'm not even sure, whether it's a kernel issue or not. But I would 
> highly appreciate any hints from experienced developers how to proceed to 
> solve that issue.
> 
> The problem: An FPGA (see details below) should write a small amount of data 
> (~128 bytes) over a PCIe 2.0 x1 link to an address in the CPU's memory space. 
> The destination address (64 bits) for the Mem Write TLP is written to a BAR-
> mapped register before-hand.
> 
> On the system side, the driver consists of the usual setup code:
> - request PCI regions
> - pci_set_master
> - I/O remapping of BARs
> - setting DMA mask (dma_set_mask_and_coherent), tried both 32/64 bits
> - allocating DMA buffers with dma_alloc_coherent (4096 bytes, but also tried 
> smaller numbers)
> - allocating IRQ lines (MSI) with pci_alloc_irq_vectors and pci_irq_vector
> - writing the DMA buffers' logical address (as returned in dma_handle_t from 
> dma_alloc_coherent) to a BAR-mapped register
> 
> There is also an IRQ handler dumping the first 2 DWs from the DMA buffer when 
> triggered.
> 
> The FPGA part will initiate following transfers at an interval of 2.5ms:
> - Memory write to DMA address
> - Send MSI (to signal that transfer is done)
> - Memory read from DMA address+offset
> 
> And now, the clue: everything works fine with the IOMMU disabled (iommu=off), 
> i.e. the 2 DWs dumped in the ISR handler contain valid data. But if the IOMMU 
> is enabled (iommu=soft or force), I receive an IO page fault (sometimes even 
> more, depending on the payload size) on every transfer, and the data is all 
> zeros:
> 
> [   49.001605] IO_PAGE_FAULT device=00:00.0 domain=0x 
> address=0xffbf8000 flags=0x0070]
> 
> Where the device ID corresponds to the Host bridge, and the address 
> corresponds to the DMA handle I got from dma_alloc_coherent respectively.

I am now expert but i am guessing your FPGA set the request field in the
PCIE TLP write packet to 00:00.0 and this might work when IOMMU is off but
might not work when IOMMU is on ie when IOMMU is on your device should set
the request field to the FPGA PCIE id so that the IOMMU knows for which
device the PCIE write or read packet is and thus against which IOMMU page
table.

Cheers,
Jérôme


Re: [PATCH v3 0/1] Use HMM for ODP v3

2019-04-11 Thread Jerome Glisse
On Thu, Apr 11, 2019 at 12:29:43PM +, Leon Romanovsky wrote:
> On Wed, Apr 10, 2019 at 11:41:24AM -0400, jgli...@redhat.com wrote:
> > From: Jérôme Glisse 
> >
> > Changes since v1/v2 are about rebase and better comments in the code.
> > Previous cover letter slightly updated.
> >
> >
> > This patchset convert RDMA ODP to use HMM underneath this is motivated
> > by stronger code sharing for same feature (share virtual memory SVM or
> > Share Virtual Address SVA) and also stronger integration with mm code to
> > achieve that. It depends on HMM patchset posted for inclusion in 5.2 [2]
> > and [3].
> >
> > It has been tested with pingpong test with -o and others flags to test
> > different size/features associated with ODP.
> >
> > Moreover they are some features of HMM in the works like peer to peer
> > support, fast CPU page table snapshot, fast IOMMU mapping update ...
> > It will be easier for RDMA devices with ODP to leverage those if they
> > use HMM underneath.
> >
> > Quick summary of what HMM is:
> > HMM is a toolbox for device driver to implement software support for
> > Share Virtual Memory (SVM). Not only it provides helpers to mirror a
> > process address space on a device (hmm_mirror). It also provides
> > helper to allow to use device memory to back regular valid virtual
> > address of a process (any valid mmap that is not an mmap of a device
> > or a DAX mapping). They are two kinds of device memory. Private memory
> > that is not accessible to CPU because it does not have all the expected
> > properties (this is for all PCIE devices) or public memory which can
> > also be access by CPU without restriction (with OpenCAPI or CCIX or
> > similar cache-coherent and atomic inter-connect).
> >
> > Device driver can use each of HMM tools separatly. You do not have to
> > use all the tools it provides.
> >
> > For RDMA device i do not expect a need to use the device memory support
> > of HMM. This device memory support is geared toward accelerator like GPU.
> >
> >
> > You can find a branch [1] with all the prerequisite in. This patch is on
> > top of rdma-next with the HMM patchset [2] and mmu notifier patchset [3]
> > applied on top of it.
> >
> > [1] https://cgit.freedesktop.org/~glisse/linux/log/?h=rdma-5.2
> 
> Hi Jerome,
> 
> I took this branch and merged with our latest rdma-next, but it doesn't
> compile.
> 
> In file included from drivers/infiniband/hw/mlx5/mem.c:35:
> ./include/rdma/ib_umem_odp.h:110:20: error: field _mirror_ has
> incomplete type
>   struct hmm_mirror mirror;
>   ^~
> ./include/rdma/ib_umem_odp.h:132:18: warning: _struct hmm_range_ declared 
> inside parameter list will not be visible outside of this definition or 
> declaration
> struct hmm_range *range);
>   ^
> make[4]: *** [scripts/Makefile.build:276: drivers/infiniband/hw/mlx5/mem.o] 
> Error 1
> 
> The reason to it that in my .config, ZONE_DEVICE, MEMORY_HOTPLUG and HMM 
> options were disabled.

Silly my i forgot to update kconfig so i pushed a branch with
proper kconfig changes in the ODP patch but it depends on changes
to the HMM kconfig so that HMM_MIRROR can be enabled on arch that
do not have everything for HMM_DEVICE.

https://cgit.freedesktop.org/~glisse/linux/log/?h=rdma-odp-hmm-v4

I doing build of various kconfig variation before posting to make
sure it is all good.

Cheers,
Jérôme


Re: [PATCH] zram: pass down the bvec we need to read into in the work struct

2019-04-10 Thread Jerome Glisse
Adding more Cc and stable (i thought this was 5.1 addition). Note that
without this patch on arch/kernel where PAGE_SIZE != 4096 userspace
could read random memory through a zram block device (thought userspace
probably would have no control on the address being read).

On Mon, Apr 08, 2019 at 02:32:19PM -0400, jgli...@redhat.com wrote:
> From: Jérôme Glisse 
> 
> When scheduling work item to read page we need to pass down the proper
> bvec struct which point to the page to read into. Before this patch it
> uses randomly initialized bvec (only if PAGE_SIZE != 4096) which is
> wrong.
> 
> Signed-off-by: Jérôme Glisse 
> Cc: Minchan Kim 
> Cc: Nitin Gupta 
> Cc: Sergey Senozhatsky 
> Cc: linux-kernel@vger.kernel.org
> ---
>  drivers/block/zram/zram_drv.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 399cad7daae7..d58a359a6622 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -774,18 +774,18 @@ struct zram_work {
>   struct zram *zram;
>   unsigned long entry;
>   struct bio *bio;
> + struct bio_vec bvec;
>  };
>  
>  #if PAGE_SIZE != 4096
>  static void zram_sync_read(struct work_struct *work)
>  {
> - struct bio_vec bvec;
>   struct zram_work *zw = container_of(work, struct zram_work, work);
>   struct zram *zram = zw->zram;
>   unsigned long entry = zw->entry;
>   struct bio *bio = zw->bio;
>  
> - read_from_bdev_async(zram, &bvec, entry, bio);
> + read_from_bdev_async(zram, &zw->bvec, entry, bio);
>  }
>  
>  /*
> @@ -798,6 +798,7 @@ static int read_from_bdev_sync(struct zram *zram, struct 
> bio_vec *bvec,
>  {
>   struct zram_work work;
>  
> + work.bvec = *bvec;
>   work.zram = zram;
>   work.entry = entry;
>   work.bio = bio;
> -- 
> 2.20.1
> 


Re: [PATCH v2] vfio/type1: Limit DMA mappings per container

2019-04-03 Thread Jerome Glisse
On Tue, Apr 02, 2019 at 10:15:38AM -0600, Alex Williamson wrote:
> Memory backed DMA mappings are accounted against a user's locked
> memory limit, including multiple mappings of the same memory.  This
> accounting bounds the number of such mappings that a user can create.
> However, DMA mappings that are not backed by memory, such as DMA
> mappings of device MMIO via mmaps, do not make use of page pinning
> and therefore do not count against the user's locked memory limit.
> These mappings still consume memory, but the memory is not well
> associated to the process for the purpose of oom killing a task.
> 
> To add bounding on this use case, we introduce a limit to the total
> number of concurrent DMA mappings that a user is allowed to create.
> This limit is exposed as a tunable module option where the default
> value of 64K is expected to be well in excess of any reasonable use
> case (a large virtual machine configuration would typically only make
> use of tens of concurrent mappings).
> 
> This fixes CVE-2019-3882.
> 
> Signed-off-by: Alex Williamson 

Have you tested with GPU passthrough ? GPU have huge BAR from
hundred of mega bytes to giga bytes (some driver resize them
to cover the whole GPU memory). Driver need to map those to
properly work. I am not sure what path is taken by mmap of
mmio BAR by a guest on the host but i just thought i would
point that out.

Cheers,
Jérôme


Re: [PATCH 6/6] arm64/mm: Enable ZONE_DEVICE

2019-04-03 Thread Jerome Glisse
On Wed, Apr 03, 2019 at 02:58:28PM +0100, Robin Murphy wrote:
> [ +Dan, Jerome ]
> 
> On 03/04/2019 05:30, Anshuman Khandual wrote:
> > Arch implementation for functions which create or destroy vmemmap mapping
> > (vmemmap_populate, vmemmap_free) can comprehend and allocate from inside
> > device memory range through driver provided vmem_altmap structure which
> > fulfils all requirements to enable ZONE_DEVICE on the platform. Hence just
> 
> ZONE_DEVICE is about more than just altmap support, no?
> 
> > enable ZONE_DEVICE by subscribing to ARCH_HAS_ZONE_DEVICE. But this is only
> > applicable for ARM64_4K_PAGES (ARM64_SWAPPER_USES_SECTION_MAPS) only which
> > creates vmemmap section mappings and utilize vmem_altmap structure.
> 
> What prevents it from working with other page sizes? One of the foremost
> use-cases for our 52-bit VA/PA support is to enable mapping large quantities
> of persistent memory, so we really do need this for 64K pages too. FWIW, it
> appears not to be an issue for PowerPC.
> 
> > Signed-off-by: Anshuman Khandual 
> > ---
> >   arch/arm64/Kconfig | 1 +
> >   1 file changed, 1 insertion(+)
> > 
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index db3e625..b5d8cf5 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -31,6 +31,7 @@ config ARM64
> > select ARCH_HAS_SYSCALL_WRAPPER
> > select ARCH_HAS_TEARDOWN_DMA_OPS if IOMMU_SUPPORT
> > select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
> > +   select ARCH_HAS_ZONE_DEVICE if ARM64_4K_PAGES
> 
> IIRC certain configurations (HMM?) don't even build if you just turn this on
> alone (although of course things may have changed elsewhere in the meantime)
> - crucially, though, from previous discussions[1] it seems fundamentally
> unsafe, since I don't think we can guarantee that nobody will touch the
> corners of ZONE_DEVICE that also require pte_devmap in order not to go
> subtly wrong. I did get as far as cooking up some patches to sort that out
> [2][3] which I never got round to posting for their own sake, so please
> consider picking those up as part of this series.

Correct _do not_ enable ZONE_DEVICE without support for pte_devmap detection.
If you want some feature of ZONE_DEVICE. Like HMM as while DAX does require
pte_devmap, HMM device private does not. So you would first have to split
ZONE_DEVICE into more sub-features kconfig option.

What is the end use case you are looking for ? Persistent memory ?

Cheers,
Jérôme


Re: [PATCH v2 01/11] mm/hmm: select mmu notifier when selecting HMM

2019-03-29 Thread Jerome Glisse
On Thu, Mar 28, 2019 at 01:33:42PM -0700, John Hubbard wrote:
> On 3/25/19 7:40 AM, jgli...@redhat.com wrote:
> > From: Jérôme Glisse 
> > 
> > To avoid random config build issue, select mmu notifier when HMM is
> > selected. In any cases when HMM get selected it will be by users that
> > will also wants the mmu notifier.
> > 
> > Signed-off-by: Jérôme Glisse 
> > Acked-by: Balbir Singh 
> > Cc: Ralph Campbell 
> > Cc: Andrew Morton 
> > Cc: John Hubbard 
> > Cc: Dan Williams 
> > ---
> >  mm/Kconfig | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 25c71eb8a7db..0d2944278d80 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -694,6 +694,7 @@ config DEV_PAGEMAP_OPS
> >  
> >  config HMM
> > bool
> > +   select MMU_NOTIFIER
> > select MIGRATE_VMA_HELPER
> >  
> >  config HMM_MIRROR
> > 
> 
> Yes, this is a good move, given that MMU notifiers are completely,
> indispensably part of the HMM design and implementation.
> 
> The alternative would also work, but it's not quite as good. I'm
> listing it in order to forestall any debate: 
> 
>   config HMM
>   bool
>  +depends on MMU_NOTIFIER
>   select MIGRATE_VMA_HELPER
> 
> ...and "depends on" versus "select" is always a subtle question. But in
> this case, I'd say that if someone wants HMM, there's no advantage in
> making them know that they must first ensure MMU_NOTIFIER is enabled.
> After poking around a bit I don't see any obvious downsides either.

You can not depend on MMU_NOTIFIER it is one of the kernel config
option that is not selectable. So any config that need MMU_NOTIFIER
must select it.

> 
> However, given that you're making this change, in order to avoid odd
> redundancy, you should also do this:
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 0d2944278d80..2e6d24d783f7 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -700,7 +700,6 @@ config HMM
>  config HMM_MIRROR
> bool "HMM mirror CPU page table into a device page table"
> depends on ARCH_HAS_HMM
> -   select MMU_NOTIFIER
> select HMM
> help
>   Select HMM_MIRROR if you want to mirror range of the CPU page table 
> of a

Because it is a select option no harm can come from that hence i do
not remove but i can remove it.

Cheers,
Jérôme


Re: [PATCH v2 02/11] mm/hmm: use reference counting for HMM struct v2

2019-03-28 Thread Jerome Glisse
On Thu, Mar 28, 2019 at 11:21:00AM -0700, Ira Weiny wrote:
> On Thu, Mar 28, 2019 at 09:50:03PM -0400, Jerome Glisse wrote:
> > On Thu, Mar 28, 2019 at 06:18:35PM -0700, John Hubbard wrote:
> > > On 3/28/19 6:00 PM, Jerome Glisse wrote:
> > > > On Thu, Mar 28, 2019 at 09:57:09AM -0700, Ira Weiny wrote:
> > > >> On Thu, Mar 28, 2019 at 05:39:26PM -0700, John Hubbard wrote:
> > > >>> On 3/28/19 2:21 PM, Jerome Glisse wrote:
> > > >>>> On Thu, Mar 28, 2019 at 01:43:13PM -0700, John Hubbard wrote:
> > > >>>>> On 3/28/19 12:11 PM, Jerome Glisse wrote:
> > > >>>>>> On Thu, Mar 28, 2019 at 04:07:20AM -0700, Ira Weiny wrote:
> > > >>>>>>> On Mon, Mar 25, 2019 at 10:40:02AM -0400, Jerome Glisse wrote:
> > > >>>>>>>> From: Jérôme Glisse 
> > > >>> [...]
> > > >>>>>>>> @@ -67,14 +78,9 @@ struct hmm {
> > > >>>>>>>>   */
> > > >>>>>>>>  static struct hmm *hmm_register(struct mm_struct *mm)
> > > >>>>>>>>  {
> > > >>>>>>>> -struct hmm *hmm = READ_ONCE(mm->hmm);
> > > >>>>>>>> +struct hmm *hmm = mm_get_hmm(mm);
> > > >>>>>>>
> > > >>>>>>> FWIW: having hmm_register == "hmm get" is a bit confusing...
> > > >>>>>>
> > > >>>>>> The thing is that you want only one hmm struct per process and thus
> > > >>>>>> if there is already one and it is not being destroy then you want 
> > > >>>>>> to
> > > >>>>>> reuse it.
> > > >>>>>>
> > > >>>>>> Also this is all internal to HMM code and so it should not confuse
> > > >>>>>> anyone.
> > > >>>>>>
> > > >>>>>
> > > >>>>> Well, it has repeatedly come up, and I'd claim that it is quite 
> > > >>>>> counter-intuitive. So if there is an easy way to make this internal 
> > > >>>>> HMM code clearer or better named, I would really love that to 
> > > >>>>> happen.
> > > >>>>>
> > > >>>>> And we shouldn't ever dismiss feedback based on "this is just 
> > > >>>>> internal
> > > >>>>> xxx subsystem code, no need for it to be as clear as other parts of 
> > > >>>>> the
> > > >>>>> kernel", right?
> > > >>>>
> > > >>>> Yes but i have not seen any better alternative that present code. If
> > > >>>> there is please submit patch.
> > > >>>>
> > > >>>
> > > >>> Ira, do you have any patch you're working on, or a more detailed 
> > > >>> suggestion there?
> > > >>> If not, then I might (later, as it's not urgent) propose a small 
> > > >>> cleanup patch 
> > > >>> I had in mind for the hmm_register code. But I don't want to 
> > > >>> duplicate effort 
> > > >>> if you're already thinking about it.
> > > >>
> > > >> No I don't have anything.
> > > >>
> > > >> I was just really digging into these this time around and I was about 
> > > >> to
> > > >> comment on the lack of "get's" for some "puts" when I realized that
> > > >> "hmm_register" _was_ the get...
> > > >>
> > > >> :-(
> > > >>
> > > > 
> > > > The get is mm_get_hmm() were you get a reference on HMM from a mm 
> > > > struct.
> > > > John in previous posting complained about me naming that function 
> > > > hmm_get()
> > > > and thus in this version i renamed it to mm_get_hmm() as we are getting
> > > > a reference on hmm from a mm struct.
> > > 
> > > Well, that's not what I recommended, though. The actual conversation went 
> > > like
> > > this [1]:
> > > 
> > > ---
> > > >> So for this, hmm_get() really ought to be symmetric with
> > > >> hmm_put(), by taking a struct hmm

Re: [PATCH v2 02/11] mm/hmm: use reference counting for HMM struct v2

2019-03-28 Thread Jerome Glisse
On Thu, Mar 28, 2019 at 07:11:17PM -0700, John Hubbard wrote:
> On 3/28/19 6:50 PM, Jerome Glisse wrote:
> [...]
> >>>
> >>> The hmm_put() is just releasing the reference on the hmm struct.
> >>>
> >>> Here i feel i am getting contradicting requirement from different people.
> >>> I don't think there is a way to please everyone here.
> >>>
> >>
> >> That's not a true conflict: you're comparing your actual implementation
> >> to Ira's request, rather than comparing my request to Ira's request.
> >>
> >> I think there's a way forward. Ira and I are actually both asking for the
> >> same thing:
> >>
> >> a) clear, concise get/put routines
> >>
> >> b) avoiding odd side effects in functions that have one name, but do
> >> additional surprising things.
> > 
> > Please show me code because i do not see any other way to do it then
> > how i did.
> > 
> 
> Sure, I'll take a run at it. I've driven you crazy enough with the naming 
> today, it's time to back it up with actual code. :)

Note that every single line in mm_get_hmm() do matter.

> I hope this is not one of those "we must also change Nouveau in N+M steps" 
> situations, though. I'm starting to despair about reviewing code that
> basically can't be changed...

It can be change but i rather not do too many in one go, each change is
like a tango with one partner and having tango with multiple partner at
once is painful much more likely to step on each other foot.

Cheers,
Jérôme


Re: [PATCH v2 09/11] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem v2

2019-03-28 Thread Jerome Glisse
On Thu, Mar 28, 2019 at 11:04:26AM -0700, Ira Weiny wrote:
> On Mon, Mar 25, 2019 at 10:40:09AM -0400, Jerome Glisse wrote:
> > From: Jérôme Glisse 
> > 
> > HMM mirror is a device driver helpers to mirror range of virtual address.
> > It means that the process jobs running on the device can access the same
> > virtual address as the CPU threads of that process. This patch adds support
> > for mirroring mapping of file that are on a DAX block device (ie range of
> > virtual address that is an mmap of a file in a filesystem on a DAX block
> > device). There is no reason to not support such case when mirroring virtual
> > address on a device.
> > 
> > Note that unlike GUP code we do not take page reference hence when we
> > back-off we have nothing to undo.
> > 
> > Changes since v1:
> > - improved commit message
> > - squashed: Arnd Bergmann: fix unused variable warning in 
> > hmm_vma_walk_pud
> > 
> > Signed-off-by: Jérôme Glisse 
> > Reviewed-by: Ralph Campbell 
> > Cc: Andrew Morton 
> > Cc: Dan Williams 
> > Cc: John Hubbard 
> > Cc: Arnd Bergmann 
> > ---
> >  mm/hmm.c | 132 ++-
> >  1 file changed, 111 insertions(+), 21 deletions(-)
> > 
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index 64a33770813b..ce33151c6832 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -325,6 +325,7 @@ EXPORT_SYMBOL(hmm_mirror_unregister);
> >  
> >  struct hmm_vma_walk {
> > struct hmm_range*range;
> > +   struct dev_pagemap  *pgmap;
> > unsigned long   last;
> > boolfault;
> > boolblock;
> > @@ -499,6 +500,15 @@ static inline uint64_t pmd_to_hmm_pfn_flags(struct 
> > hmm_range *range, pmd_t pmd)
> > range->flags[HMM_PFN_VALID];
> >  }
> >  
> > +static inline uint64_t pud_to_hmm_pfn_flags(struct hmm_range *range, pud_t 
> > pud)
> > +{
> > +   if (!pud_present(pud))
> > +   return 0;
> > +   return pud_write(pud) ? range->flags[HMM_PFN_VALID] |
> > +   range->flags[HMM_PFN_WRITE] :
> > +   range->flags[HMM_PFN_VALID];
> > +}
> > +
> >  static int hmm_vma_handle_pmd(struct mm_walk *walk,
> >   unsigned long addr,
> >   unsigned long end,
> > @@ -520,8 +530,19 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
> > return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
> >  
> > pfn = pmd_pfn(pmd) + pte_index(addr);
> > -   for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++)
> > +   for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> > +   if (pmd_devmap(pmd)) {
> > +   hmm_vma_walk->pgmap = get_dev_pagemap(pfn,
> > + hmm_vma_walk->pgmap);
> > +   if (unlikely(!hmm_vma_walk->pgmap))
> > +   return -EBUSY;
> > +   }
> > pfns[i] = hmm_pfn_from_pfn(range, pfn) | cpu_flags;
> > +   }
> > +   if (hmm_vma_walk->pgmap) {
> > +   put_dev_pagemap(hmm_vma_walk->pgmap);
> > +   hmm_vma_walk->pgmap = NULL;
> > +   }
> > hmm_vma_walk->last = end;
> > return 0;
> >  }
> > @@ -608,10 +629,24 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, 
> > unsigned long addr,
> > if (fault || write_fault)
> > goto fault;
> >  
> > +   if (pte_devmap(pte)) {
> > +   hmm_vma_walk->pgmap = get_dev_pagemap(pte_pfn(pte),
> > + hmm_vma_walk->pgmap);
> > +   if (unlikely(!hmm_vma_walk->pgmap))
> > +   return -EBUSY;
> > +   } else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) 
> > {
> > +   *pfn = range->values[HMM_PFN_SPECIAL];
> > +   return -EFAULT;
> > +   }
> > +
> > *pfn = hmm_pfn_from_pfn(range, pte_pfn(pte)) | cpu_flags;
> 
>   
> 
> > return 0;
> >  
> >  fault:
> > +   if (hmm_vma_walk->pgmap) {
> > +   put_dev_pagemap(hmm_vma_walk->pgmap);
> > +   hmm_vma_walk->pgmap = NULL;
> > +   }
> > pte_unmap(ptep);
> > /* Fault any virtual address we were asked to fault */
> > return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
> > @@ -699,12 +734,83 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
> > return r;
> > }
> > }
> > +   if (hmm_vma_walk->pgmap) {
> > +   put_dev_pagemap(hmm_vma_walk->pgmap);
> > +   hmm_vma_walk->pgmap = NULL;
> > +   }
> 
> 
> Why is this here and not in hmm_vma_handle_pte()?  Unless I'm just getting
> tired this is the corresponding put when hmm_vma_handle_pte() returns 0 above
> at  above.

This is because get_dev_pagemap() optimize away the reference getting
if we already hold a reference on the correct dev_pagemap. So if we
were releasing the reference within hmm_vma_handle_pte() then we would
loose the get_dev_pagemap() optimization.

Cheers,
Jérôme


Re: [PATCH v2 07/11] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays.

2019-03-28 Thread Jerome Glisse
On Thu, Mar 28, 2019 at 07:05:21PM -0700, John Hubbard wrote:
> On 3/28/19 6:59 PM, Jerome Glisse wrote:
> >>>>>> [...]
> >>>>> Indeed I did not realize there is an hmm "pfn" until I saw this 
> >>>>> function:
> >>>>>
> >>>>> /*
> >>>>>  * hmm_pfn_from_pfn() - create a valid HMM pfn value from pfn
> >>>>>  * @range: range use to encode HMM pfn value
> >>>>>  * @pfn: pfn value for which to create the HMM pfn
> >>>>>  * Returns: valid HMM pfn for the pfn
> >>>>>  */
> >>>>> static inline uint64_t hmm_pfn_from_pfn(const struct hmm_range *range,
> >>>>> unsigned long pfn)
> >>>>>
> >>>>> So should this patch contain some sort of helper like this... maybe?
> >>>>>
> >>>>> I'm assuming the "hmm_pfn" being returned above is the device pfn being
> >>>>> discussed here?
> >>>>>
> >>>>> I'm also thinking calling it pfn is confusing.  I'm not advocating a 
> >>>>> new type
> >>>>> but calling the "device pfn's" "hmm_pfn" or "device_pfn" seems like it 
> >>>>> would
> >>>>> have shortened the discussion here.
> >>>>>
> >>>>
> >>>> That helper is also use today by nouveau so changing that name is not 
> >>>> that
> >>>> easy it does require the multi-release dance. So i am not sure how much
> >>>> value there is in a name change.
> >>>>
> >>>
> >>> Once the dust settles, I would expect that a name change for this could go
> >>> via Andrew's tree, right? It seems incredible to claim that we've built 
> >>> something
> >>> that effectively does not allow any minor changes!
> >>>
> >>> I do think it's worth some *minor* trouble to improve the name, assuming 
> >>> that we
> >>> can do it in a simple patch, rather than some huge maintainer-level 
> >>> effort.
> >>
> >> Change to nouveau have to go through nouveau tree so changing name means:
> 
> Yes, I understand the guideline, but is that always how it must be done? Ben 
> (+cc)?

Yes, it is not only about nouveau, it will be about every single
upstream driver using HMM. It is the easiest solution all other
solution involve coordination and/or risk of people that handle
the conflict to do something that break things.

Cheers,
Jérôme


Re: [PATCH v2 07/11] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays.

2019-03-28 Thread Jerome Glisse
On Thu, Mar 28, 2019 at 09:42:59PM -0400, Jerome Glisse wrote:
> On Thu, Mar 28, 2019 at 06:30:26PM -0700, John Hubbard wrote:
> > On 3/28/19 6:17 PM, Jerome Glisse wrote:
> > > On Thu, Mar 28, 2019 at 09:42:31AM -0700, Ira Weiny wrote:
> > >> On Thu, Mar 28, 2019 at 04:28:47PM -0700, John Hubbard wrote:
> > >>> On 3/28/19 4:21 PM, Jerome Glisse wrote:
> > >>>> On Thu, Mar 28, 2019 at 03:40:42PM -0700, John Hubbard wrote:
> > >>>>> On 3/28/19 3:31 PM, Jerome Glisse wrote:
> > >>>>>> On Thu, Mar 28, 2019 at 03:19:06PM -0700, John Hubbard wrote:
> > >>>>>>> On 3/28/19 3:12 PM, Jerome Glisse wrote:
> > >>>>>>>> On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
> > >>>>>>>>> On 3/25/19 7:40 AM, jgli...@redhat.com wrote:
> > >>>>>>>>>> From: Jérôme Glisse 
> > >>> [...]
> > >> Indeed I did not realize there is an hmm "pfn" until I saw this function:
> > >>
> > >> /*
> > >>  * hmm_pfn_from_pfn() - create a valid HMM pfn value from pfn
> > >>  * @range: range use to encode HMM pfn value
> > >>  * @pfn: pfn value for which to create the HMM pfn
> > >>  * Returns: valid HMM pfn for the pfn
> > >>  */
> > >> static inline uint64_t hmm_pfn_from_pfn(const struct hmm_range *range,
> > >> unsigned long pfn)
> > >>
> > >> So should this patch contain some sort of helper like this... maybe?
> > >>
> > >> I'm assuming the "hmm_pfn" being returned above is the device pfn being
> > >> discussed here?
> > >>
> > >> I'm also thinking calling it pfn is confusing.  I'm not advocating a new 
> > >> type
> > >> but calling the "device pfn's" "hmm_pfn" or "device_pfn" seems like it 
> > >> would
> > >> have shortened the discussion here.
> > >>
> > > 
> > > That helper is also use today by nouveau so changing that name is not that
> > > easy it does require the multi-release dance. So i am not sure how much
> > > value there is in a name change.
> > > 
> > 
> > Once the dust settles, I would expect that a name change for this could go
> > via Andrew's tree, right? It seems incredible to claim that we've built 
> > something
> > that effectively does not allow any minor changes!
> > 
> > I do think it's worth some *minor* trouble to improve the name, assuming 
> > that we
> > can do it in a simple patch, rather than some huge maintainer-level effort.
> 
> Change to nouveau have to go through nouveau tree so changing name means:
>  -  release N add function with new name, maybe make the old function just
> a wrapper to the new function
>  -  release N+1 update user to use the new name
>  -  release N+2 remove the old name
> 
> So it is do-able but it is painful so i rather do that one latter that now
> as i am sure people will then complain again about some little thing and it
> will post pone this whole patchset on that new bit. To avoid post-poning
> RDMA and bunch of other patchset that build on top of that i rather get
> this patchset in and then do more changes in the next cycle.
> 
> This is just a capacity thing.

Also for clarity changes to API i am doing in this patchset is to make
the ODP convertion easier and thus they bring a real hard value. Renaming
those function is esthetic, i am not saying it is useless, i am saying it
does not have the same value as those other changes and i would rather not
miss another merge window just for esthetic changes.

Cheers,
Jérôme


Re: [PATCH v2 02/11] mm/hmm: use reference counting for HMM struct v2

2019-03-28 Thread Jerome Glisse
On Thu, Mar 28, 2019 at 06:18:35PM -0700, John Hubbard wrote:
> On 3/28/19 6:00 PM, Jerome Glisse wrote:
> > On Thu, Mar 28, 2019 at 09:57:09AM -0700, Ira Weiny wrote:
> >> On Thu, Mar 28, 2019 at 05:39:26PM -0700, John Hubbard wrote:
> >>> On 3/28/19 2:21 PM, Jerome Glisse wrote:
> >>>> On Thu, Mar 28, 2019 at 01:43:13PM -0700, John Hubbard wrote:
> >>>>> On 3/28/19 12:11 PM, Jerome Glisse wrote:
> >>>>>> On Thu, Mar 28, 2019 at 04:07:20AM -0700, Ira Weiny wrote:
> >>>>>>> On Mon, Mar 25, 2019 at 10:40:02AM -0400, Jerome Glisse wrote:
> >>>>>>>> From: Jérôme Glisse 
> >>> [...]
> >>>>>>>> @@ -67,14 +78,9 @@ struct hmm {
> >>>>>>>>   */
> >>>>>>>>  static struct hmm *hmm_register(struct mm_struct *mm)
> >>>>>>>>  {
> >>>>>>>> -struct hmm *hmm = READ_ONCE(mm->hmm);
> >>>>>>>> +struct hmm *hmm = mm_get_hmm(mm);
> >>>>>>>
> >>>>>>> FWIW: having hmm_register == "hmm get" is a bit confusing...
> >>>>>>
> >>>>>> The thing is that you want only one hmm struct per process and thus
> >>>>>> if there is already one and it is not being destroy then you want to
> >>>>>> reuse it.
> >>>>>>
> >>>>>> Also this is all internal to HMM code and so it should not confuse
> >>>>>> anyone.
> >>>>>>
> >>>>>
> >>>>> Well, it has repeatedly come up, and I'd claim that it is quite 
> >>>>> counter-intuitive. So if there is an easy way to make this internal 
> >>>>> HMM code clearer or better named, I would really love that to happen.
> >>>>>
> >>>>> And we shouldn't ever dismiss feedback based on "this is just internal
> >>>>> xxx subsystem code, no need for it to be as clear as other parts of the
> >>>>> kernel", right?
> >>>>
> >>>> Yes but i have not seen any better alternative that present code. If
> >>>> there is please submit patch.
> >>>>
> >>>
> >>> Ira, do you have any patch you're working on, or a more detailed 
> >>> suggestion there?
> >>> If not, then I might (later, as it's not urgent) propose a small cleanup 
> >>> patch 
> >>> I had in mind for the hmm_register code. But I don't want to duplicate 
> >>> effort 
> >>> if you're already thinking about it.
> >>
> >> No I don't have anything.
> >>
> >> I was just really digging into these this time around and I was about to
> >> comment on the lack of "get's" for some "puts" when I realized that
> >> "hmm_register" _was_ the get...
> >>
> >> :-(
> >>
> > 
> > The get is mm_get_hmm() were you get a reference on HMM from a mm struct.
> > John in previous posting complained about me naming that function hmm_get()
> > and thus in this version i renamed it to mm_get_hmm() as we are getting
> > a reference on hmm from a mm struct.
> 
> Well, that's not what I recommended, though. The actual conversation went like
> this [1]:
> 
> ---
> >> So for this, hmm_get() really ought to be symmetric with
> >> hmm_put(), by taking a struct hmm*. And the null check is
> >> not helping here, so let's just go with this smaller version:
> >>
> >> static inline struct hmm *hmm_get(struct hmm *hmm)
> >> {
> >> if (kref_get_unless_zero(&hmm->kref))
> >> return hmm;
> >>
> >> return NULL;
> >> }
> >>
> >> ...and change the few callers accordingly.
> >>
> >
> > What about renaning hmm_get() to mm_get_hmm() instead ?
> >
> 
> For a get/put pair of functions, it would be ideal to pass
> the same argument type to each. It looks like we are passing
> around hmm*, and hmm retains a reference count on hmm->mm,
> so I think you have a choice of using either mm* or hmm* as
> the argument. I'm not sure that one is better than the other
> here, as the lifetimes appear to be linked pretty tightly.
> 
> Whichever one is used, I think it would be best to use it
> in both the _get() and _put() calls. 
&

Re: [PATCH v2 07/11] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays.

2019-03-28 Thread Jerome Glisse
On Thu, Mar 28, 2019 at 06:30:26PM -0700, John Hubbard wrote:
> On 3/28/19 6:17 PM, Jerome Glisse wrote:
> > On Thu, Mar 28, 2019 at 09:42:31AM -0700, Ira Weiny wrote:
> >> On Thu, Mar 28, 2019 at 04:28:47PM -0700, John Hubbard wrote:
> >>> On 3/28/19 4:21 PM, Jerome Glisse wrote:
> >>>> On Thu, Mar 28, 2019 at 03:40:42PM -0700, John Hubbard wrote:
> >>>>> On 3/28/19 3:31 PM, Jerome Glisse wrote:
> >>>>>> On Thu, Mar 28, 2019 at 03:19:06PM -0700, John Hubbard wrote:
> >>>>>>> On 3/28/19 3:12 PM, Jerome Glisse wrote:
> >>>>>>>> On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
> >>>>>>>>> On 3/25/19 7:40 AM, jgli...@redhat.com wrote:
> >>>>>>>>>> From: Jérôme Glisse 
> >>> [...]
> >> Indeed I did not realize there is an hmm "pfn" until I saw this function:
> >>
> >> /*
> >>  * hmm_pfn_from_pfn() - create a valid HMM pfn value from pfn
> >>  * @range: range use to encode HMM pfn value
> >>  * @pfn: pfn value for which to create the HMM pfn
> >>  * Returns: valid HMM pfn for the pfn
> >>  */
> >> static inline uint64_t hmm_pfn_from_pfn(const struct hmm_range *range,
> >> unsigned long pfn)
> >>
> >> So should this patch contain some sort of helper like this... maybe?
> >>
> >> I'm assuming the "hmm_pfn" being returned above is the device pfn being
> >> discussed here?
> >>
> >> I'm also thinking calling it pfn is confusing.  I'm not advocating a new 
> >> type
> >> but calling the "device pfn's" "hmm_pfn" or "device_pfn" seems like it 
> >> would
> >> have shortened the discussion here.
> >>
> > 
> > That helper is also use today by nouveau so changing that name is not that
> > easy it does require the multi-release dance. So i am not sure how much
> > value there is in a name change.
> > 
> 
> Once the dust settles, I would expect that a name change for this could go
> via Andrew's tree, right? It seems incredible to claim that we've built 
> something
> that effectively does not allow any minor changes!
> 
> I do think it's worth some *minor* trouble to improve the name, assuming that 
> we
> can do it in a simple patch, rather than some huge maintainer-level effort.

Change to nouveau have to go through nouveau tree so changing name means:
 -  release N add function with new name, maybe make the old function just
a wrapper to the new function
 -  release N+1 update user to use the new name
 -  release N+2 remove the old name

So it is do-able but it is painful so i rather do that one latter that now
as i am sure people will then complain again about some little thing and it
will post pone this whole patchset on that new bit. To avoid post-poning
RDMA and bunch of other patchset that build on top of that i rather get
this patchset in and then do more changes in the next cycle.

This is just a capacity thing.

Cheers,
Jérôme


Re: [PATCH v2 07/11] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays.

2019-03-28 Thread Jerome Glisse
On Thu, Mar 28, 2019 at 09:42:31AM -0700, Ira Weiny wrote:
> On Thu, Mar 28, 2019 at 04:28:47PM -0700, John Hubbard wrote:
> > On 3/28/19 4:21 PM, Jerome Glisse wrote:
> > > On Thu, Mar 28, 2019 at 03:40:42PM -0700, John Hubbard wrote:
> > >> On 3/28/19 3:31 PM, Jerome Glisse wrote:
> > >>> On Thu, Mar 28, 2019 at 03:19:06PM -0700, John Hubbard wrote:
> > >>>> On 3/28/19 3:12 PM, Jerome Glisse wrote:
> > >>>>> On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
> > >>>>>> On 3/25/19 7:40 AM, jgli...@redhat.com wrote:
> > >>>>>>> From: Jérôme Glisse 
> > [...]
> > >> Hi Jerome,
> > >>
> > >> I think you're talking about flags, but I'm talking about the mask. The 
> > >> above link doesn't appear to use the pfn_flags_mask, and the 
> > >> default_flags 
> > >> that it uses are still in the same lower 3 bits:
> > >>
> > >> +static uint64_t odp_hmm_flags[HMM_PFN_FLAG_MAX] = {
> > >> +ODP_READ_BIT,   /* HMM_PFN_VALID */
> > >> +ODP_WRITE_BIT,  /* HMM_PFN_WRITE */
> > >> +ODP_DEVICE_BIT, /* HMM_PFN_DEVICE_PRIVATE */
> > >> +};
> > >>
> > >> So I still don't see why we need the flexibility of a full 
> > >> 0x
> > >> mask, that is *also* runtime changeable. 
> > > 
> > > So the pfn array is using a device driver specific format and we have
> > > no idea nor do we need to know where the valid, write, ... bit are in
> > > that format. Those bits can be in the top 60 bits like 63, 62, 61, ...
> > > we do not care. They are device with bit at the top and for those you
> > > need a mask that allows you to mask out those bits or not depending on
> > > what the user want to do.
> > > 
> > > The mask here is against an _unknown_ (from HMM POV) format. So we can
> > > not presume where the bits will be and thus we can not presume what a
> > > proper mask is.
> > > 
> > > So that's why a full unsigned long mask is use here.
> > > 
> > > Maybe an example will help let say the device flag are:
> > > VALID (1 << 63)
> > > WRITE (1 << 62)
> > > 
> > > Now let say that device wants to fault with at least read a range
> > > it does set:
> > > range->default_flags = (1 << 63)
> > > range->pfn_flags_mask = 0;
> > > 
> > > This will fill fault all page in the range with at least read
> > > permission.
> > > 
> > > Now let say it wants to do the same except for one page in the range
> > > for which its want to have write. Now driver set:
> > > range->default_flags = (1 << 63);
> > > range->pfn_flags_mask = (1 << 62);
> > > range->pfns[index_of_write] = (1 << 62);
> > > 
> > > With this HMM will fault in all page with at least read (ie valid)
> > > and for the address: range->start + index_of_write << PAGE_SHIFT it
> > > will fault with write permission ie if the CPU pte does not have
> > > write permission set then handle_mm_fault() will be call asking for
> > > write permission.
> > > 
> > > 
> > > Note that in the above HMM will populate the pfns array with write
> > > permission for any entry that have write permission within the CPU
> > > pte ie the default_flags and pfn_flags_mask is only the minimun
> > > requirement but HMM always returns all the flag that are set in the
> > > CPU pte.
> > > 
> > > 
> > > Now let say you are an "old" driver like nouveau upstream, then it
> > > means that you are setting each individual entry within range->pfns
> > > with the exact flags you want for each address hence here what you
> > > want is:
> > > range->default_flags = 0;
> > > range->pfn_flags_mask = -1UL;
> > > 
> > > So that what we do is (for each entry):
> > > (range->pfns[index] & range->pfn_flags_mask) | range->default_flags
> > > and we end up with the flags that were set by the driver for each of
> > > the individual range->pfns entries.
> > > 
> > > 
> > > Does this help ?
> > > 
> > 
> > Yes, the key point for me was that this is an entirely device driver 
> > specific
> > format. OK. But then we have HMM setting it. So a com

Re: [PATCH v2 02/11] mm/hmm: use reference counting for HMM struct v2

2019-03-28 Thread Jerome Glisse
On Thu, Mar 28, 2019 at 09:57:09AM -0700, Ira Weiny wrote:
> On Thu, Mar 28, 2019 at 05:39:26PM -0700, John Hubbard wrote:
> > On 3/28/19 2:21 PM, Jerome Glisse wrote:
> > > On Thu, Mar 28, 2019 at 01:43:13PM -0700, John Hubbard wrote:
> > >> On 3/28/19 12:11 PM, Jerome Glisse wrote:
> > >>> On Thu, Mar 28, 2019 at 04:07:20AM -0700, Ira Weiny wrote:
> > >>>> On Mon, Mar 25, 2019 at 10:40:02AM -0400, Jerome Glisse wrote:
> > >>>>> From: Jérôme Glisse 
> > [...]
> > >>>>> @@ -67,14 +78,9 @@ struct hmm {
> > >>>>>   */
> > >>>>>  static struct hmm *hmm_register(struct mm_struct *mm)
> > >>>>>  {
> > >>>>> - struct hmm *hmm = READ_ONCE(mm->hmm);
> > >>>>> + struct hmm *hmm = mm_get_hmm(mm);
> > >>>>
> > >>>> FWIW: having hmm_register == "hmm get" is a bit confusing...
> > >>>
> > >>> The thing is that you want only one hmm struct per process and thus
> > >>> if there is already one and it is not being destroy then you want to
> > >>> reuse it.
> > >>>
> > >>> Also this is all internal to HMM code and so it should not confuse
> > >>> anyone.
> > >>>
> > >>
> > >> Well, it has repeatedly come up, and I'd claim that it is quite 
> > >> counter-intuitive. So if there is an easy way to make this internal 
> > >> HMM code clearer or better named, I would really love that to happen.
> > >>
> > >> And we shouldn't ever dismiss feedback based on "this is just internal
> > >> xxx subsystem code, no need for it to be as clear as other parts of the
> > >> kernel", right?
> > > 
> > > Yes but i have not seen any better alternative that present code. If
> > > there is please submit patch.
> > > 
> > 
> > Ira, do you have any patch you're working on, or a more detailed suggestion 
> > there?
> > If not, then I might (later, as it's not urgent) propose a small cleanup 
> > patch 
> > I had in mind for the hmm_register code. But I don't want to duplicate 
> > effort 
> > if you're already thinking about it.
> 
> No I don't have anything.
> 
> I was just really digging into these this time around and I was about to
> comment on the lack of "get's" for some "puts" when I realized that
> "hmm_register" _was_ the get...
> 
> :-(
> 

The get is mm_get_hmm() were you get a reference on HMM from a mm struct.
John in previous posting complained about me naming that function hmm_get()
and thus in this version i renamed it to mm_get_hmm() as we are getting
a reference on hmm from a mm struct.

The hmm_put() is just releasing the reference on the hmm struct.

Here i feel i am getting contradicting requirement from different people.
I don't think there is a way to please everyone here.

Cheers,
Jérôme


Re: [PATCH v2 06/11] mm/hmm: improve driver API to work and wait over a range v2

2019-03-28 Thread Jerome Glisse
On Thu, Mar 28, 2019 at 09:12:21AM -0700, Ira Weiny wrote:
> On Mon, Mar 25, 2019 at 10:40:06AM -0400, Jerome Glisse wrote:
> > From: Jérôme Glisse 
> > 
> > A common use case for HMM mirror is user trying to mirror a range
> > and before they could program the hardware it get invalidated by
> > some core mm event. Instead of having user re-try right away to
> > mirror the range provide a completion mechanism for them to wait
> > for any active invalidation affecting the range.
> > 
> > This also changes how hmm_range_snapshot() and hmm_range_fault()
> > works by not relying on vma so that we can drop the mmap_sem
> > when waiting and lookup the vma again on retry.
> > 
> > Changes since v1:
> > - squashed: Dan Carpenter: potential deadlock in nonblocking code
> > 
> > Signed-off-by: Jérôme Glisse 
> > Reviewed-by: Ralph Campbell 
> > Cc: Andrew Morton 
> > Cc: John Hubbard 
> > Cc: Dan Williams 
> > Cc: Dan Carpenter 
> > Cc: Matthew Wilcox 
> > ---
> >  include/linux/hmm.h | 208 ++---
> >  mm/hmm.c| 528 +---
> >  2 files changed, 428 insertions(+), 308 deletions(-)
> > 
> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index e9afd23c2eac..79671036cb5f 100644
> > --- a/include/linux/hmm.h
> > +++ b/include/linux/hmm.h
> > @@ -77,8 +77,34 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> > -struct hmm;
> > +
> > +/*
> > + * struct hmm - HMM per mm struct
> > + *
> > + * @mm: mm struct this HMM struct is bound to
> > + * @lock: lock protecting ranges list
> > + * @ranges: list of range being snapshotted
> > + * @mirrors: list of mirrors for this mm
> > + * @mmu_notifier: mmu notifier to track updates to CPU page table
> > + * @mirrors_sem: read/write semaphore protecting the mirrors list
> > + * @wq: wait queue for user waiting on a range invalidation
> > + * @notifiers: count of active mmu notifiers
> > + * @dead: is the mm dead ?
> > + */
> > +struct hmm {
> > +   struct mm_struct*mm;
> > +   struct kref kref;
> > +   struct mutexlock;
> > +   struct list_headranges;
> > +   struct list_headmirrors;
> > +   struct mmu_notifier mmu_notifier;
> > +   struct rw_semaphore mirrors_sem;
> > +   wait_queue_head_t   wq;
> > +   longnotifiers;
> > +   booldead;
> > +};
> >  
> >  /*
> >   * hmm_pfn_flag_e - HMM flag enums
> > @@ -155,6 +181,38 @@ struct hmm_range {
> > boolvalid;
> >  };
> >  
> > +/*
> > + * hmm_range_wait_until_valid() - wait for range to be valid
> > + * @range: range affected by invalidation to wait on
> > + * @timeout: time out for wait in ms (ie abort wait after that period of 
> > time)
> > + * Returns: true if the range is valid, false otherwise.
> > + */
> > +static inline bool hmm_range_wait_until_valid(struct hmm_range *range,
> > + unsigned long timeout)
> > +{
> > +   /* Check if mm is dead ? */
> > +   if (range->hmm == NULL || range->hmm->dead || range->hmm->mm == NULL) {
> > +   range->valid = false;
> > +   return false;
> > +   }
> > +   if (range->valid)
> > +   return true;
> > +   wait_event_timeout(range->hmm->wq, range->valid || range->hmm->dead,
> > +  msecs_to_jiffies(timeout));
> > +   /* Return current valid status just in case we get lucky */
> > +   return range->valid;
> > +}
> > +
> > +/*
> > + * hmm_range_valid() - test if a range is valid or not
> > + * @range: range
> > + * Returns: true if the range is valid, false otherwise.
> > + */
> > +static inline bool hmm_range_valid(struct hmm_range *range)
> > +{
> > +   return range->valid;
> > +}
> > +
> >  /*
> >   * hmm_pfn_to_page() - return struct page pointed to by a valid HMM pfn
> >   * @range: range use to decode HMM pfn value
> > @@ -357,51 +415,133 @@ void hmm_mirror_unregister(struct hmm_mirror 
> > *mirror);
> >  
> >  
> >  /*
> > - * To snapshot the CPU page table, call hmm_vma_get_pfns(), then take a 
> > device
> > - * driver lock that serializes device page table updates, then call
> > - * hmm_vma_range_done(), to check if the snapshot is still valid. The same
> >

Re: [PATCH v2 07/11] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays.

2019-03-28 Thread Jerome Glisse
On Thu, Mar 28, 2019 at 04:28:47PM -0700, John Hubbard wrote:
> On 3/28/19 4:21 PM, Jerome Glisse wrote:
> > On Thu, Mar 28, 2019 at 03:40:42PM -0700, John Hubbard wrote:
> >> On 3/28/19 3:31 PM, Jerome Glisse wrote:
> >>> On Thu, Mar 28, 2019 at 03:19:06PM -0700, John Hubbard wrote:
> >>>> On 3/28/19 3:12 PM, Jerome Glisse wrote:
> >>>>> On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
> >>>>>> On 3/25/19 7:40 AM, jgli...@redhat.com wrote:
> >>>>>>> From: Jérôme Glisse 
> [...]
> >> Hi Jerome,
> >>
> >> I think you're talking about flags, but I'm talking about the mask. The 
> >> above link doesn't appear to use the pfn_flags_mask, and the default_flags 
> >> that it uses are still in the same lower 3 bits:
> >>
> >> +static uint64_t odp_hmm_flags[HMM_PFN_FLAG_MAX] = {
> >> +  ODP_READ_BIT,   /* HMM_PFN_VALID */
> >> +  ODP_WRITE_BIT,  /* HMM_PFN_WRITE */
> >> +  ODP_DEVICE_BIT, /* HMM_PFN_DEVICE_PRIVATE */
> >> +};
> >>
> >> So I still don't see why we need the flexibility of a full 
> >> 0x
> >> mask, that is *also* runtime changeable. 
> > 
> > So the pfn array is using a device driver specific format and we have
> > no idea nor do we need to know where the valid, write, ... bit are in
> > that format. Those bits can be in the top 60 bits like 63, 62, 61, ...
> > we do not care. They are device with bit at the top and for those you
> > need a mask that allows you to mask out those bits or not depending on
> > what the user want to do.
> > 
> > The mask here is against an _unknown_ (from HMM POV) format. So we can
> > not presume where the bits will be and thus we can not presume what a
> > proper mask is.
> > 
> > So that's why a full unsigned long mask is use here.
> > 
> > Maybe an example will help let say the device flag are:
> > VALID (1 << 63)
> > WRITE (1 << 62)
> > 
> > Now let say that device wants to fault with at least read a range
> > it does set:
> > range->default_flags = (1 << 63)
> > range->pfn_flags_mask = 0;
> > 
> > This will fill fault all page in the range with at least read
> > permission.
> > 
> > Now let say it wants to do the same except for one page in the range
> > for which its want to have write. Now driver set:
> > range->default_flags = (1 << 63);
> > range->pfn_flags_mask = (1 << 62);
> > range->pfns[index_of_write] = (1 << 62);
> > 
> > With this HMM will fault in all page with at least read (ie valid)
> > and for the address: range->start + index_of_write << PAGE_SHIFT it
> > will fault with write permission ie if the CPU pte does not have
> > write permission set then handle_mm_fault() will be call asking for
> > write permission.
> > 
> > 
> > Note that in the above HMM will populate the pfns array with write
> > permission for any entry that have write permission within the CPU
> > pte ie the default_flags and pfn_flags_mask is only the minimun
> > requirement but HMM always returns all the flag that are set in the
> > CPU pte.
> > 
> > 
> > Now let say you are an "old" driver like nouveau upstream, then it
> > means that you are setting each individual entry within range->pfns
> > with the exact flags you want for each address hence here what you
> > want is:
> > range->default_flags = 0;
> > range->pfn_flags_mask = -1UL;
> > 
> > So that what we do is (for each entry):
> > (range->pfns[index] & range->pfn_flags_mask) | range->default_flags
> > and we end up with the flags that were set by the driver for each of
> > the individual range->pfns entries.
> > 
> > 
> > Does this help ?
> > 
> 
> Yes, the key point for me was that this is an entirely device driver specific
> format. OK. But then we have HMM setting it. So a comment to the effect that
> this is device-specific might be nice, but I'll leave that up to you whether
> it is useful.

The code you were pointing at is temporary ie once this get merge that code
will get remove in release N+2 ie merge code in N, update nouveau in N+1 and
remove this temporary code in N+2

When updating HMM API it is easier to stage API update over release like that
so there is no need to synchronize accross multiple tree (mm, drm, rdma, ...)

> Either way, you can add:
> 
>   Reviewed-by: John Hubbard 
> 
> thanks,
> -- 
> John Hubbard
> NVIDIA


Re: [PATCH v2 10/11] mm/hmm: add helpers for driver to safely take the mmap_sem v2

2019-03-28 Thread Jerome Glisse
On Thu, Mar 28, 2019 at 04:20:37PM -0700, John Hubbard wrote:
> On 3/28/19 4:05 PM, Jerome Glisse wrote:
> > On Thu, Mar 28, 2019 at 03:43:33PM -0700, John Hubbard wrote:
> >> On 3/28/19 3:40 PM, Jerome Glisse wrote:
> >>> On Thu, Mar 28, 2019 at 03:25:39PM -0700, John Hubbard wrote:
> >>>> On 3/28/19 3:08 PM, Jerome Glisse wrote:
> >>>>> On Thu, Mar 28, 2019 at 02:41:02PM -0700, John Hubbard wrote:
> >>>>>> On 3/28/19 2:30 PM, Jerome Glisse wrote:
> >>>>>>> On Thu, Mar 28, 2019 at 01:54:01PM -0700, John Hubbard wrote:
> >>>>>>>> On 3/25/19 7:40 AM, jgli...@redhat.com wrote:
> >>>>>>>>> From: Jérôme Glisse 
> >>>> [...]
> >>>> OK, so let's either drop this patch, or if merge windows won't allow 
> >>>> that,
> >>>> then *eventually* drop this patch. And instead, put in a 
> >>>> hmm_sanity_check()
> >>>> that does the same checks.
> >>>
> >>> RDMA depends on this, so does the nouveau patchset that convert to new 
> >>> API.
> >>> So i do not see reason to drop this. They are user for this they are 
> >>> posted
> >>> and i hope i explained properly the benefit.
> >>>
> >>> It is a common pattern. Yes it only save couple lines of code but down the
> >>> road i will also help for people working on the mmap_sem patchset.
> >>>
> >>
> >> It *adds* a couple of lines that are misleading, because they look like 
> >> they
> >> make things safer, but they don't actually do so.
> > 
> > It is not about safety, sorry if it confused you but there is nothing about
> > safety here, i can add a big fat comment that explains that there is no 
> > safety
> > here. The intention is to allow the page fault handler that potential have
> > hundred of page fault queue up to abort as soon as it sees that it is 
> > pointless
> > to keep faulting on a dying process.
> > 
> > Again if we race it is _fine_ nothing bad will happen, we are just doing 
> > use-
> > less work that gonna be thrown on the floor and we are just slowing down the
> > process tear down.
> > 
> 
> In addition to a comment, how about naming this thing to indicate the above 
> intention?  I have a really hard time with this odd down_read() wrapper, which
> allows code to proceed without really getting a lock. It's just too 
> wrong-looking.
> If it were instead named:
> 
>   hmm_is_exiting()

What about: hmm_lock_mmap_if_alive() ?


> 
> and had a comment about why racy is OK, then I'd be a lot happier. :)

Will add fat comment.

Cheers,
Jérôme


Re: [PATCH v2 07/11] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays.

2019-03-28 Thread Jerome Glisse
On Thu, Mar 28, 2019 at 03:40:42PM -0700, John Hubbard wrote:
> On 3/28/19 3:31 PM, Jerome Glisse wrote:
> > On Thu, Mar 28, 2019 at 03:19:06PM -0700, John Hubbard wrote:
> >> On 3/28/19 3:12 PM, Jerome Glisse wrote:
> >>> On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
> >>>> On 3/25/19 7:40 AM, jgli...@redhat.com wrote:
> >>>>> From: Jérôme Glisse 
> >>>>>
> >>>>> The HMM mirror API can be use in two fashions. The first one where the 
> >>>>> HMM
> >>>>> user coalesce multiple page faults into one request and set flags per 
> >>>>> pfns
> >>>>> for of those faults. The second one where the HMM user want to 
> >>>>> pre-fault a
> >>>>> range with specific flags. For the latter one it is a waste to have the 
> >>>>> user
> >>>>> pre-fill the pfn arrays with a default flags value.
> >>>>>
> >>>>> This patch adds a default flags value allowing user to set them for a 
> >>>>> range
> >>>>> without having to pre-fill the pfn array.
> >>>>>
> >>>>> Signed-off-by: Jérôme Glisse 
> >>>>> Reviewed-by: Ralph Campbell 
> >>>>> Cc: Andrew Morton 
> >>>>> Cc: John Hubbard 
> >>>>> Cc: Dan Williams 
> >>>>> ---
> >>>>>  include/linux/hmm.h |  7 +++
> >>>>>  mm/hmm.c| 12 
> >>>>>  2 files changed, 19 insertions(+)
> >>>>>
> >>>>> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> >>>>> index 79671036cb5f..13bc2c72f791 100644
> >>>>> --- a/include/linux/hmm.h
> >>>>> +++ b/include/linux/hmm.h
> >>>>> @@ -165,6 +165,8 @@ enum hmm_pfn_value_e {
> >>>>>   * @pfns: array of pfns (big enough for the range)
> >>>>>   * @flags: pfn flags to match device driver page table
> >>>>>   * @values: pfn value for some special case (none, special, error, ...)
> >>>>> + * @default_flags: default flags for the range (write, read, ...)
> >>>>> + * @pfn_flags_mask: allows to mask pfn flags so that only 
> >>>>> default_flags matter
> >>>>>   * @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT)
> >>>>>   * @valid: pfns array did not change since it has been fill by an HMM 
> >>>>> function
> >>>>>   */
> >>>>> @@ -177,6 +179,8 @@ struct hmm_range {
> >>>>> uint64_t*pfns;
> >>>>> const uint64_t  *flags;
> >>>>> const uint64_t  *values;
> >>>>> +   uint64_tdefault_flags;
> >>>>> +   uint64_tpfn_flags_mask;
> >>>>> uint8_t pfn_shift;
> >>>>> boolvalid;
> >>>>>  };
> >>>>> @@ -521,6 +525,9 @@ static inline int hmm_vma_fault(struct hmm_range 
> >>>>> *range, bool block)
> >>>>>  {
> >>>>> long ret;
> >>>>>  
> >>>>> +   range->default_flags = 0;
> >>>>> +   range->pfn_flags_mask = -1UL;
> >>>>
> >>>> Hi Jerome,
> >>>>
> >>>> This is nice to have. Let's constrain it a little bit more, though: the 
> >>>> pfn_flags_mask
> >>>> definitely does not need to be a run time value. And we want some 
> >>>> assurance that
> >>>> the mask is 
> >>>>  a) large enough for the flags, and
> >>>>  b) small enough to avoid overrunning the pfns field.
> >>>>
> >>>> Those are less certain with a run-time struct field, and more obviously 
> >>>> correct with
> >>>> something like, approximately:
> >>>>
> >>>>  #define PFN_FLAGS_MASK 0x
> >>>>
> >>>> or something.
> >>>>
> >>>> In other words, this is more flexibility than we need--just a touch too 
> >>>> much,
> >>>> IMHO.
> >>>
> >>> This mirror the fact that flags are provided as an array and some devices 
> >>> use
> >>> the top bits for flags (read,

Re: [PATCH v2 10/11] mm/hmm: add helpers for driver to safely take the mmap_sem v2

2019-03-28 Thread Jerome Glisse
On Thu, Mar 28, 2019 at 03:43:33PM -0700, John Hubbard wrote:
> On 3/28/19 3:40 PM, Jerome Glisse wrote:
> > On Thu, Mar 28, 2019 at 03:25:39PM -0700, John Hubbard wrote:
> >> On 3/28/19 3:08 PM, Jerome Glisse wrote:
> >>> On Thu, Mar 28, 2019 at 02:41:02PM -0700, John Hubbard wrote:
> >>>> On 3/28/19 2:30 PM, Jerome Glisse wrote:
> >>>>> On Thu, Mar 28, 2019 at 01:54:01PM -0700, John Hubbard wrote:
> >>>>>> On 3/25/19 7:40 AM, jgli...@redhat.com wrote:
> >>>>>>> From: Jérôme Glisse 
> >> [...]
> >>>>
> >>>>>>
> >>>>>> If you insist on having this wrapper, I think it should have 
> >>>>>> approximately 
> >>>>>> this form:
> >>>>>>
> >>>>>> void hmm_mirror_mm_down_read(...)
> >>>>>> {
> >>>>>>WARN_ON(...)
> >>>>>>down_read(...)
> >>>>>> } 
> >>>>>
> >>>>> I do insist as it is useful and use by both RDMA and nouveau and the
> >>>>> above would kill the intent. The intent is do not try to take the lock
> >>>>> if the process is dying.
> >>>>
> >>>> Could you provide me a link to those examples so I can take a peek? I
> >>>> am still convinced that this whole thing is a race condition at best.
> >>>
> >>> The race is fine and ok see:
> >>>
> >>> https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-odp-v2&id=eebd4f3095290a16ebc03182e2d3ab5dfa7b05ec
> >>>
> >>> which has been posted and i think i provided a link in the cover
> >>> letter to that post. The same patch exist for nouveau i need to
> >>> cleanup that tree and push it.
> >>
> >> Thanks for that link, and I apologize for not keeping up with that
> >> other review thread.
> >>
> >> Looking it over, hmm_mirror_mm_down_read() is only used in one place.
> >> So, what you really want there is not a down_read() wrapper, but rather,
> >> something like
> >>
> >>hmm_sanity_check()
> >>
> >> , that ib_umem_odp_map_dma_pages() calls.
> > 
> > Why ? The device driver pattern is:
> > if (hmm_is_it_dying()) {
> > // handle when process die and abort the fault ie useless
> > // to call within HMM
> > }
> > down_read(mmap_sem);
> > 
> > This pattern is common within nouveau and RDMA and other device driver in
> > the work. Hence why i am replacing it with just one helper. Also it has the
> > added benefit that changes being discussed around the mmap sem will be 
> > easier
> > to do as it avoid having to update each driver but instead it can be done
> > just once for the HMM helpers.
> 
> Yes, and I'm saying that the pattern is broken. Because it's racy. :)

And i explained why it is fine, it just an optimization, in most case
it takes time to tear down a process and the device page fault handler
can be trigger while that happens, so instead of having it pile more
work on we can detect that even if it is racy. It is just about avoiding
useless work. There is nothing about correctness here. It does not need
to identify dying process with 100% accuracy. The fact that the process
is dying will be identified race free later on and it just means that in
the meantime we are doing useless works, potential tons of useless works.

They are hardware that can storm the page fault handler and we end up
with hundred of page fault queued up against a process that might be
dying. It is a big waste to go over all those fault and do works that
will be trown on the floor later on.

> 
> >>>>>>> +{
> >>>>>>> + struct mm_struct *mm;
> >>>>>>> +
> >>>>>>> + /* Sanity check ... */
> >>>>>>> + if (!mirror || !mirror->hmm)
> >>>>>>> + return -EINVAL;
> >>>>>>> + /*
> >>>>>>> +  * Before trying to take the mmap_sem make sure the mm is still
> >>>>>>> +  * alive as device driver context might outlive the mm lifetime.
> >>>>>>
> >>>>>> Let's find another way, and a better place, to solve this problem.
> >>>>>> Ref counting?
> >>>>>
> >>>>> This has nothing to do with refcount or use after free or anthing
> >>>>> like tha

Re: [PATCH v2 10/11] mm/hmm: add helpers for driver to safely take the mmap_sem v2

2019-03-28 Thread Jerome Glisse
On Thu, Mar 28, 2019 at 03:25:39PM -0700, John Hubbard wrote:
> On 3/28/19 3:08 PM, Jerome Glisse wrote:
> > On Thu, Mar 28, 2019 at 02:41:02PM -0700, John Hubbard wrote:
> >> On 3/28/19 2:30 PM, Jerome Glisse wrote:
> >>> On Thu, Mar 28, 2019 at 01:54:01PM -0700, John Hubbard wrote:
> >>>> On 3/25/19 7:40 AM, jgli...@redhat.com wrote:
> >>>>> From: Jérôme Glisse 
> [...]
> >>
> >>>>
> >>>> If you insist on having this wrapper, I think it should have 
> >>>> approximately 
> >>>> this form:
> >>>>
> >>>> void hmm_mirror_mm_down_read(...)
> >>>> {
> >>>>  WARN_ON(...)
> >>>>  down_read(...)
> >>>> } 
> >>>
> >>> I do insist as it is useful and use by both RDMA and nouveau and the
> >>> above would kill the intent. The intent is do not try to take the lock
> >>> if the process is dying.
> >>
> >> Could you provide me a link to those examples so I can take a peek? I
> >> am still convinced that this whole thing is a race condition at best.
> > 
> > The race is fine and ok see:
> > 
> > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-odp-v2&id=eebd4f3095290a16ebc03182e2d3ab5dfa7b05ec
> > 
> > which has been posted and i think i provided a link in the cover
> > letter to that post. The same patch exist for nouveau i need to
> > cleanup that tree and push it.
> 
> Thanks for that link, and I apologize for not keeping up with that
> other review thread.
> 
> Looking it over, hmm_mirror_mm_down_read() is only used in one place.
> So, what you really want there is not a down_read() wrapper, but rather,
> something like
> 
>   hmm_sanity_check()
> 
> , that ib_umem_odp_map_dma_pages() calls.

Why ? The device driver pattern is:
if (hmm_is_it_dying()) {
// handle when process die and abort the fault ie useless
// to call within HMM
}
down_read(mmap_sem);

This pattern is common within nouveau and RDMA and other device driver in
the work. Hence why i am replacing it with just one helper. Also it has the
added benefit that changes being discussed around the mmap sem will be easier
to do as it avoid having to update each driver but instead it can be done
just once for the HMM helpers.

> 
> 
> > 
> >>>
> >>>
> >>>>
> >>>>> +{
> >>>>> +   struct mm_struct *mm;
> >>>>> +
> >>>>> +   /* Sanity check ... */
> >>>>> +   if (!mirror || !mirror->hmm)
> >>>>> +   return -EINVAL;
> >>>>> +   /*
> >>>>> +* Before trying to take the mmap_sem make sure the mm is still
> >>>>> +* alive as device driver context might outlive the mm lifetime.
> >>>>
> >>>> Let's find another way, and a better place, to solve this problem.
> >>>> Ref counting?
> >>>
> >>> This has nothing to do with refcount or use after free or anthing
> >>> like that. It is just about checking wether we are about to do
> >>> something pointless. If the process is dying then it is pointless
> >>> to try to take the lock and it is pointless for the device driver
> >>> to trigger handle_mm_fault().
> >>
> >> Well, what happens if you let such pointless code run anyway? 
> >> Does everything still work? If yes, then we don't need this change.
> >> If no, then we need a race-free version of this change.
> > 
> > Yes everything work, nothing bad can happen from a race, it will just
> > do useless work which never hurt anyone.
> > 
> 
> OK, so let's either drop this patch, or if merge windows won't allow that,
> then *eventually* drop this patch. And instead, put in a hmm_sanity_check()
> that does the same checks.

RDMA depends on this, so does the nouveau patchset that convert to new API.
So i do not see reason to drop this. They are user for this they are posted
and i hope i explained properly the benefit.

It is a common pattern. Yes it only save couple lines of code but down the
road i will also help for people working on the mmap_sem patchset.


Cheers,
Jérôme


Re: [PATCH v2 07/11] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays.

2019-03-28 Thread Jerome Glisse
On Thu, Mar 28, 2019 at 03:19:06PM -0700, John Hubbard wrote:
> On 3/28/19 3:12 PM, Jerome Glisse wrote:
> > On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
> >> On 3/25/19 7:40 AM, jgli...@redhat.com wrote:
> >>> From: Jérôme Glisse 
> >>>
> >>> The HMM mirror API can be use in two fashions. The first one where the HMM
> >>> user coalesce multiple page faults into one request and set flags per pfns
> >>> for of those faults. The second one where the HMM user want to pre-fault a
> >>> range with specific flags. For the latter one it is a waste to have the 
> >>> user
> >>> pre-fill the pfn arrays with a default flags value.
> >>>
> >>> This patch adds a default flags value allowing user to set them for a 
> >>> range
> >>> without having to pre-fill the pfn array.
> >>>
> >>> Signed-off-by: Jérôme Glisse 
> >>> Reviewed-by: Ralph Campbell 
> >>> Cc: Andrew Morton 
> >>> Cc: John Hubbard 
> >>> Cc: Dan Williams 
> >>> ---
> >>>  include/linux/hmm.h |  7 +++
> >>>  mm/hmm.c| 12 
> >>>  2 files changed, 19 insertions(+)
> >>>
> >>> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> >>> index 79671036cb5f..13bc2c72f791 100644
> >>> --- a/include/linux/hmm.h
> >>> +++ b/include/linux/hmm.h
> >>> @@ -165,6 +165,8 @@ enum hmm_pfn_value_e {
> >>>   * @pfns: array of pfns (big enough for the range)
> >>>   * @flags: pfn flags to match device driver page table
> >>>   * @values: pfn value for some special case (none, special, error, ...)
> >>> + * @default_flags: default flags for the range (write, read, ...)
> >>> + * @pfn_flags_mask: allows to mask pfn flags so that only default_flags 
> >>> matter
> >>>   * @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT)
> >>>   * @valid: pfns array did not change since it has been fill by an HMM 
> >>> function
> >>>   */
> >>> @@ -177,6 +179,8 @@ struct hmm_range {
> >>>   uint64_t*pfns;
> >>>   const uint64_t  *flags;
> >>>   const uint64_t  *values;
> >>> + uint64_tdefault_flags;
> >>> + uint64_tpfn_flags_mask;
> >>>   uint8_t pfn_shift;
> >>>   boolvalid;
> >>>  };
> >>> @@ -521,6 +525,9 @@ static inline int hmm_vma_fault(struct hmm_range 
> >>> *range, bool block)
> >>>  {
> >>>   long ret;
> >>>  
> >>> + range->default_flags = 0;
> >>> + range->pfn_flags_mask = -1UL;
> >>
> >> Hi Jerome,
> >>
> >> This is nice to have. Let's constrain it a little bit more, though: the 
> >> pfn_flags_mask
> >> definitely does not need to be a run time value. And we want some 
> >> assurance that
> >> the mask is 
> >>a) large enough for the flags, and
> >>b) small enough to avoid overrunning the pfns field.
> >>
> >> Those are less certain with a run-time struct field, and more obviously 
> >> correct with
> >> something like, approximately:
> >>
> >>#define PFN_FLAGS_MASK 0x
> >>
> >> or something.
> >>
> >> In other words, this is more flexibility than we need--just a touch too 
> >> much,
> >> IMHO.
> > 
> > This mirror the fact that flags are provided as an array and some devices 
> > use
> > the top bits for flags (read, write, ...). So here it is the safe default to
> > set it to -1. If the caller want to leverage this optimization it can 
> > override
> > the default_flags value.
> > 
> 
> Optimization? OK, now I'm a bit lost. Maybe this is another place where I 
> could
> use a peek at the calling code. The only flags I've seen so far use the bottom
> 3 bits and that's it. 
> 
> Maybe comments here?
> 
> >>
> >>> +
> >>>   ret = hmm_range_register(range, range->vma->vm_mm,
> >>>range->start, range->end);
> >>>   if (ret)
> >>> diff --git a/mm/hmm.c b/mm/hmm.c
> >>> index fa9498eeb9b6..4fe88a196d17 100644
> >>> --- a/mm/hmm.c
> >>> +++ b/mm/hmm.c
> >>> @@ -415,6 +415,18 @@ static inline void hmm_pte_need_fault(c

Re: [PATCH v2 07/11] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays.

2019-03-28 Thread Jerome Glisse
On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
> On 3/25/19 7:40 AM, jgli...@redhat.com wrote:
> > From: Jérôme Glisse 
> > 
> > The HMM mirror API can be use in two fashions. The first one where the HMM
> > user coalesce multiple page faults into one request and set flags per pfns
> > for of those faults. The second one where the HMM user want to pre-fault a
> > range with specific flags. For the latter one it is a waste to have the user
> > pre-fill the pfn arrays with a default flags value.
> > 
> > This patch adds a default flags value allowing user to set them for a range
> > without having to pre-fill the pfn array.
> > 
> > Signed-off-by: Jérôme Glisse 
> > Reviewed-by: Ralph Campbell 
> > Cc: Andrew Morton 
> > Cc: John Hubbard 
> > Cc: Dan Williams 
> > ---
> >  include/linux/hmm.h |  7 +++
> >  mm/hmm.c| 12 
> >  2 files changed, 19 insertions(+)
> > 
> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index 79671036cb5f..13bc2c72f791 100644
> > --- a/include/linux/hmm.h
> > +++ b/include/linux/hmm.h
> > @@ -165,6 +165,8 @@ enum hmm_pfn_value_e {
> >   * @pfns: array of pfns (big enough for the range)
> >   * @flags: pfn flags to match device driver page table
> >   * @values: pfn value for some special case (none, special, error, ...)
> > + * @default_flags: default flags for the range (write, read, ...)
> > + * @pfn_flags_mask: allows to mask pfn flags so that only default_flags 
> > matter
> >   * @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT)
> >   * @valid: pfns array did not change since it has been fill by an HMM 
> > function
> >   */
> > @@ -177,6 +179,8 @@ struct hmm_range {
> > uint64_t*pfns;
> > const uint64_t  *flags;
> > const uint64_t  *values;
> > +   uint64_tdefault_flags;
> > +   uint64_tpfn_flags_mask;
> > uint8_t pfn_shift;
> > boolvalid;
> >  };
> > @@ -521,6 +525,9 @@ static inline int hmm_vma_fault(struct hmm_range 
> > *range, bool block)
> >  {
> > long ret;
> >  
> > +   range->default_flags = 0;
> > +   range->pfn_flags_mask = -1UL;
> 
> Hi Jerome,
> 
> This is nice to have. Let's constrain it a little bit more, though: the 
> pfn_flags_mask
> definitely does not need to be a run time value. And we want some assurance 
> that
> the mask is 
>   a) large enough for the flags, and
>   b) small enough to avoid overrunning the pfns field.
> 
> Those are less certain with a run-time struct field, and more obviously 
> correct with
> something like, approximately:
> 
>   #define PFN_FLAGS_MASK 0x
> 
> or something.
> 
> In other words, this is more flexibility than we need--just a touch too much,
> IMHO.

This mirror the fact that flags are provided as an array and some devices use
the top bits for flags (read, write, ...). So here it is the safe default to
set it to -1. If the caller want to leverage this optimization it can override
the default_flags value.

> 
> > +
> > ret = hmm_range_register(range, range->vma->vm_mm,
> >  range->start, range->end);
> > if (ret)
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index fa9498eeb9b6..4fe88a196d17 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -415,6 +415,18 @@ static inline void hmm_pte_need_fault(const struct 
> > hmm_vma_walk *hmm_vma_walk,
> > if (!hmm_vma_walk->fault)
> > return;
> >  
> > +   /*
> > +* So we not only consider the individual per page request we also
> > +* consider the default flags requested for the range. The API can
> > +* be use in 2 fashions. The first one where the HMM user coalesce
> > +* multiple page fault into one request and set flags per pfns for
> > +* of those faults. The second one where the HMM user want to pre-
> > +* fault a range with specific flags. For the latter one it is a
> > +* waste to have the user pre-fill the pfn arrays with a default
> > +* flags value.
> > +*/
> > +   pfns = (pfns & range->pfn_flags_mask) | range->default_flags;
> 
> Need to verify that the mask isn't too large or too small.

I need to check agin but default flag is anded somewhere to limit
the bit to the one we expect.

Cheers,
Jérôme


Re: [PATCH v2 10/11] mm/hmm: add helpers for driver to safely take the mmap_sem v2

2019-03-28 Thread Jerome Glisse
On Thu, Mar 28, 2019 at 02:41:02PM -0700, John Hubbard wrote:
> On 3/28/19 2:30 PM, Jerome Glisse wrote:
> > On Thu, Mar 28, 2019 at 01:54:01PM -0700, John Hubbard wrote:
> >> On 3/25/19 7:40 AM, jgli...@redhat.com wrote:
> >>> From: Jérôme Glisse 
> >>>
> >>> The device driver context which holds reference to mirror and thus to
> >>> core hmm struct might outlive the mm against which it was created. To
> >>> avoid every driver to check for that case provide an helper that check
> >>> if mm is still alive and take the mmap_sem in read mode if so. If the
> >>> mm have been destroy (mmu_notifier release call back did happen) then
> >>> we return -EINVAL so that calling code knows that it is trying to do
> >>> something against a mm that is no longer valid.
> >>>
> >>> Changes since v1:
> >>> - removed bunch of useless check (if API is use with bogus argument
> >>>   better to fail loudly so user fix their code)
> >>>
> >>> Signed-off-by: Jérôme Glisse 
> >>> Reviewed-by: Ralph Campbell 
> >>> Cc: Andrew Morton 
> >>> Cc: John Hubbard 
> >>> Cc: Dan Williams 
> >>> ---
> >>>  include/linux/hmm.h | 50 ++---
> >>>  1 file changed, 47 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> >>> index f3b919b04eda..5f9deaeb9d77 100644
> >>> --- a/include/linux/hmm.h
> >>> +++ b/include/linux/hmm.h
> >>> @@ -438,6 +438,50 @@ struct hmm_mirror {
> >>>  int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
> >>>  void hmm_mirror_unregister(struct hmm_mirror *mirror);
> >>>  
> >>> +/*
> >>> + * hmm_mirror_mm_down_read() - lock the mmap_sem in read mode
> >>> + * @mirror: the HMM mm mirror for which we want to lock the mmap_sem
> >>> + * Returns: -EINVAL if the mm is dead, 0 otherwise (lock taken).
> >>> + *
> >>> + * The device driver context which holds reference to mirror and thus to 
> >>> core
> >>> + * hmm struct might outlive the mm against which it was created. To 
> >>> avoid every
> >>> + * driver to check for that case provide an helper that check if mm is 
> >>> still
> >>> + * alive and take the mmap_sem in read mode if so. If the mm have been 
> >>> destroy
> >>> + * (mmu_notifier release call back did happen) then we return -EINVAL so 
> >>> that
> >>> + * calling code knows that it is trying to do something against a mm 
> >>> that is
> >>> + * no longer valid.
> >>> + */
> >>> +static inline int hmm_mirror_mm_down_read(struct hmm_mirror *mirror)
> >>
> >> Hi Jerome,
> >>
> >> Let's please not do this. There are at least two problems here:
> >>
> >> 1. The hmm_mirror_mm_down_read() wrapper around down_read() requires a 
> >> return value. This is counter to how locking is normally done: callers do
> >> not normally have to check the return value of most locks (other than
> >> trylocks). And sure enough, your own code below doesn't check the return 
> >> value.
> >> That is a pretty good illustration of why not to do this.
> > 
> > Please read the function description this is not about checking lock
> > return value it is about checking wether we are racing with process
> > destruction and avoid trying to take lock in such cases so that driver
> > do abort as quickly as possible when a process is being kill.
> > 
> >>
> >> 2. This is a weird place to randomly check for semi-unrelated state, such 
> >> as "is HMM still alive". By that I mean, if you have to detect a problem
> >> at down_read() time, then the problem could have existed both before and
> >> after the call to this wrapper. So it is providing a false sense of 
> >> security,
> >> and it is therefore actually undesirable to add the code.
> > 
> > It is not, this function is use in device page fault handler which will
> > happens asynchronously from CPU event or process lifetime when a process
> > is killed or is dying we do want to avoid useless page fault work and
> > we do want to avoid blocking the page fault queue of the device. This
> > function reports to the caller that the process is dying and that it
> > should just abort the page fault and do whatev

Re: [PATCH v2 05/11] mm/hmm: improve and rename hmm_vma_fault() to hmm_range_fault() v2

2019-03-28 Thread Jerome Glisse
On Thu, Mar 28, 2019 at 06:43:51AM -0700, Ira Weiny wrote:
> On Mon, Mar 25, 2019 at 10:40:05AM -0400, Jerome Glisse wrote:
> > From: Jérôme Glisse 
> > 
> > Rename for consistency between code, comments and documentation. Also
> > improves the comments on all the possible returns values. Improve the
> > function by returning the number of populated entries in pfns array.
> > 
> > Changes since v1:
> > - updated documentation
> > - reformated some comments
> > 
> > Signed-off-by: Jérôme Glisse 
> > Reviewed-by: Ralph Campbell 
> > Cc: Andrew Morton 
> > Cc: John Hubbard 
> > Cc: Dan Williams 
> > ---
> >  Documentation/vm/hmm.rst |  8 +---
> >  include/linux/hmm.h  | 13 +-
> >  mm/hmm.c | 91 +---
> >  3 files changed, 52 insertions(+), 60 deletions(-)
> > 
> > diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
> > index d9b27bdadd1b..61f073215a8d 100644
> > --- a/Documentation/vm/hmm.rst
> > +++ b/Documentation/vm/hmm.rst
> > @@ -190,13 +190,7 @@ When the device driver wants to populate a range of 
> > virtual addresses, it can
> >  use either::
> >  
> >long hmm_range_snapshot(struct hmm_range *range);
> > -  int hmm_vma_fault(struct vm_area_struct *vma,
> > -struct hmm_range *range,
> > -unsigned long start,
> > -unsigned long end,
> > -hmm_pfn_t *pfns,
> > -bool write,
> > -bool block);
> > +  long hmm_range_fault(struct hmm_range *range, bool block);
> >  
> >  The first one (hmm_range_snapshot()) will only fetch present CPU page table
> >  entries and will not trigger a page fault on missing or non-present 
> > entries.
> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index 32206b0b1bfd..e9afd23c2eac 100644
> > --- a/include/linux/hmm.h
> > +++ b/include/linux/hmm.h
> > @@ -391,7 +391,18 @@ bool hmm_vma_range_done(struct hmm_range *range);
> >   *
> >   * See the function description in mm/hmm.c for further documentation.
> >   */
> > -int hmm_vma_fault(struct hmm_range *range, bool block);
> > +long hmm_range_fault(struct hmm_range *range, bool block);
> > +
> > +/* This is a temporary helper to avoid merge conflict between trees. */
> > +static inline int hmm_vma_fault(struct hmm_range *range, bool block)
> > +{
> > +   long ret = hmm_range_fault(range, block);
> > +   if (ret == -EBUSY)
> > +   ret = -EAGAIN;
> > +   else if (ret == -EAGAIN)
> > +   ret = -EBUSY;
> > +   return ret < 0 ? ret : 0;
> > +}
> >  
> >  /* Below are for HMM internal use only! Not to be used by device driver! */
> >  void hmm_mm_destroy(struct mm_struct *mm);
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index 91361aa74b8b..7860e63c3ba7 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -336,13 +336,13 @@ static int hmm_vma_do_fault(struct mm_walk *walk, 
> > unsigned long addr,
> > flags |= write_fault ? FAULT_FLAG_WRITE : 0;
> > ret = handle_mm_fault(vma, addr, flags);
> > if (ret & VM_FAULT_RETRY)
> > -   return -EBUSY;
> > +   return -EAGAIN;
> > if (ret & VM_FAULT_ERROR) {
> > *pfn = range->values[HMM_PFN_ERROR];
> > return -EFAULT;
> > }
> >  
> > -   return -EAGAIN;
> > +   return -EBUSY;
> >  }
> >  
> >  static int hmm_pfns_bad(unsigned long addr,
> > @@ -368,7 +368,7 @@ static int hmm_pfns_bad(unsigned long addr,
> >   * @fault: should we fault or not ?
> >   * @write_fault: write fault ?
> >   * @walk: mm_walk structure
> > - * Returns: 0 on success, -EAGAIN after page fault, or page fault error
> > + * Returns: 0 on success, -EBUSY after page fault, or page fault error
> >   *
> >   * This function will be called whenever pmd_none() or pte_none() returns 
> > true,
> >   * or whenever there is no page directory covering the virtual address 
> > range.
> > @@ -391,12 +391,12 @@ static int hmm_vma_walk_hole_(unsigned long addr, 
> > unsigned long end,
> >  
> > ret = hmm_vma_do_fault(walk, addr, write_fault,
> >&pfns[i]);
> > -   if (ret != -EAGAIN)
> > +   if (ret != -EBUSY)
> > return ret;
> > }
> > }
> >  
> > -   

  1   2   3   4   5   6   7   8   9   10   >