Re: [Nouveau] [PATCH v2] mm: Take a page reference when removing device exclusive entries

2023-03-29 Thread John Hubbard

On 3/29/23 18:25, Alistair Popple wrote:

Device exclusive page table entries are used to prevent CPU access to
a page whilst it is being accessed from a device. Typically this is
used to implement atomic operations when the underlying bus does not
support atomic access. When a CPU thread encounters a device exclusive
entry it locks the page and restores the original entry after calling
mmu notifiers to signal drivers that exclusive access is no longer
available.

The device exclusive entry holds a reference to the page making it
safe to access the struct page whilst the entry is present. However
the fault handling code does not hold the PTL when taking the page
lock. This means if there are multiple threads faulting concurrently
on the device exclusive entry one will remove the entry whilst others
will wait on the page lock without holding a reference.

This can lead to threads locking or waiting on a folio with a zero
refcount. Whilst mmap_lock prevents the pages getting freed via
munmap() they may still be freed by a migration. This leads to
warnings such as PAGE_FLAGS_CHECK_AT_FREE due to the page being locked
when the refcount drops to zero.

Fix this by trying to take a reference on the folio before locking
it. The code already checks the PTE under the PTL and aborts if the
entry is no longer there. It is also possible the folio has been
unmapped, freed and re-allocated allowing a reference to be taken on
an unrelated folio. This case is also detected by the PTE check and
the folio is unlocked without further changes.

Signed-off-by: Alistair Popple 
Reviewed-by: Ralph Campbell 
Reviewed-by: John Hubbard 
Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
Cc: sta...@vger.kernel.org

---

Changes for v2:

  - Rebased to Linus master
  - Reworded commit message
  - Switched to using folios (thanks Matthew!)
  - Added Reviewed-by's


v2 looks correct to me.

thanks,
--
John Hubbard
NVIDIA


---
  mm/memory.c | 16 +++-
  1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index f456f3b5049c..01a23ad48a04 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3563,8 +3563,21 @@ static vm_fault_t remove_device_exclusive_entry(struct 
vm_fault *vmf)
struct vm_area_struct *vma = vmf->vma;
struct mmu_notifier_range range;
  
-	if (!folio_lock_or_retry(folio, vma->vm_mm, vmf->flags))

+   /*
+* We need a reference to lock the folio because we don't hold
+* the PTL so a racing thread can remove the device-exclusive
+* entry and unmap it. If the folio is free the entry must
+* have been removed already. If it happens to have already
+* been re-allocated after being freed all we do is lock and
+* unlock it.
+*/
+   if (!folio_try_get(folio))
+   return 0;
+
+   if (!folio_lock_or_retry(folio, vma->vm_mm, vmf->flags)) {
+   folio_put(folio);
return VM_FAULT_RETRY;
+   }
mmu_notifier_range_init_owner(, MMU_NOTIFY_EXCLUSIVE, 0,
vma->vm_mm, vmf->address & PAGE_MASK,
(vmf->address & PAGE_MASK) + PAGE_SIZE, NULL);
@@ -3577,6 +3590,7 @@ static vm_fault_t remove_device_exclusive_entry(struct 
vm_fault *vmf)
  
  	pte_unmap_unlock(vmf->pte, vmf->ptl);

folio_unlock(folio);
+   folio_put(folio);
  
  	mmu_notifier_invalidate_range_end();

return 0;





Re: [Nouveau] [PATCH] mm: Take a page reference when removing device exclusive entries

2023-03-28 Thread John Hubbard

On 3/28/23 20:16, Matthew Wilcox wrote:
...

+   if (!get_page_unless_zero(vmf->page))
+   return 0;


 From a folio point of view: what the hell are you doing here?  Tail
pages don't have individual refcounts; all the refcounts are actually


ohh, and I really should have caught that too. I plead spending too much
time recently in a somewhat more driver-centric mindset, and failing to
mentally shift gears properly for this case.

Sorry for missing that!

thanks,
--
John Hubbard
NVIDIA



Re: [Nouveau] [PATCH] mm: Take a page reference when removing device exclusive entries

2023-03-28 Thread John Hubbard
On 3/27/23 19:14, Alistair Popple wrote:
> Device exclusive page table entries are used to prevent CPU access to
> a page whilst it is being accessed from a device. Typically this is
> used to implement atomic operations when the underlying bus does not
> support atomic access. When a CPU thread encounters a device exclusive
> entry it locks the page and restores the original entry after calling
> mmu notifiers to signal drivers that exclusive access is no longer
> available.
> 
> The device exclusive entry holds a reference to the page making it
> safe to access the struct page whilst the entry is present. However
> the fault handling code does not hold the PTL when taking the page
> lock. This means if there are multiple threads faulting concurrently
> on the device exclusive entry one will remove the entry whilst others
> will wait on the page lock without holding a reference.
> 
> This can lead to threads locking or waiting on a page with a zero
> refcount. Whilst mmap_lock prevents the pages getting freed via
> munmap() they may still be freed by a migration. This leads to

An important point! So I'm glad that you wrote it here clearly.

> warnings such as PAGE_FLAGS_CHECK_AT_FREE due to the page being locked
> when the refcount drops to zero. Note that during removal of the
> device exclusive entry the PTE is currently re-checked under the PTL
> so no futher bad page accesses occur once it is locked.

Maybe change that last sentence to something like this:

"Fix this by taking a page reference before starting to remove a device
exclusive pte. This is done safely in a lock-free way by first getting a
reference via get_page_unless_zero(), and then re-checking after
acquiring the PTL, that the page is the correct one."

?

...well, maybe that's not all that much help. But it does at least
provide the traditional description of what the patch *does*, at
the end of the commit description. But please treat this as just
an optional suggestion.

> 
> Signed-off-by: Alistair Popple 
> Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
> Cc: sta...@vger.kernel.org
> ---
>  mm/memory.c | 14 +-
>  1 file changed, 13 insertions(+), 1 deletion(-)

On the patch process, I see that this applies to linux-stable's 6.1.y
branch. I'd suggest two things:

1) Normally, what I've seen done is to post against either the current
top of tree linux.git, or else against one of the mm-stable branches.
And then after it's accepted, create a version for -stable. 

2) Either indicate in the cover letter (or after the ---) which branch
or commit this applies to, or let git format-patch help by passing in
the base commit via --base. That will save "some people" (people like
me) from having to guess, if they want to apply the patch locally and
poke around at it.

Anyway, all of the above are just little documentation and process
suggestions, but the patch itself looks great, so please feel free to
add:

Reviewed-by: John Hubbard 


thanks,
-- 
John Hubbard
NVIDIA

> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 8c8420934d60..b499bd283d8e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3623,8 +3623,19 @@ static vm_fault_t remove_device_exclusive_entry(struct 
> vm_fault *vmf)
>   struct vm_area_struct *vma = vmf->vma;
>   struct mmu_notifier_range range;
>  
> - if (!folio_lock_or_retry(folio, vma->vm_mm, vmf->flags))
> + /*
> +  * We need a page reference to lock the page because we don't
> +  * hold the PTL so a racing thread can remove the
> +  * device-exclusive entry and unmap the page. If the page is
> +  * free the entry must have been removed already.
> +  */
> + if (!get_page_unless_zero(vmf->page))
> + return 0;
> +
> + if (!folio_lock_or_retry(folio, vma->vm_mm, vmf->flags)) {
> + put_page(vmf->page);
>   return VM_FAULT_RETRY;
> + }
>   mmu_notifier_range_init_owner(, MMU_NOTIFY_EXCLUSIVE, 0, vma,
>   vma->vm_mm, vmf->address & PAGE_MASK,
>   (vmf->address & PAGE_MASK) + PAGE_SIZE, NULL);
> @@ -3637,6 +3648,7 @@ static vm_fault_t remove_device_exclusive_entry(struct 
> vm_fault *vmf)
>  
>   pte_unmap_unlock(vmf->pte, vmf->ptl);
>   folio_unlock(folio);
> + put_page(vmf->page);
>  
>   mmu_notifier_invalidate_range_end();
>   return 0;




Re: [Nouveau] [PATCH 6/7] nouveau/dmem: Evict device private memory during release

2022-09-26 Thread John Hubbard
On 9/26/22 14:35, Lyude Paul wrote:
>> +for (i = 0; i < npages; i++) {
>> +if (src_pfns[i] & MIGRATE_PFN_MIGRATE) {
>> +struct page *dpage;
>> +
>> +/*
>> + * _GFP_NOFAIL because the GPU is going away and there
>> + * is nothing sensible we can do if we can't copy the
>> + * data back.
>> + */
> 
> You'll have to excuse me for a moment since this area of nouveau isn't one of
> my strongpoints, but are we sure about this? IIRC __GFP_NOFAIL means infinite
> retry, in the case of a GPU hotplug event I would assume we would rather just
> stop trying to migrate things to the GPU and just drop the data instead of
> hanging on infinite retries.
> 
Hi Lyude!

Actually, I really think it's better in this case to keep trying
(presumably not necessarily infinitely, but only until memory becomes
available), rather than failing out and corrupting data.

That's because I'm not sure it's completely clear that this memory is
discardable. And at some point, we're going to make this all work with
file-backed memory, which will *definitely* not be discardable--I
realize that we're not there yet, of course.

But here, it's reasonable to commit to just retrying indefinitely,
really. Memory should eventually show up. And if it doesn't, then
restarting the machine is better than corrupting data, generally.


thanks,

-- 
John Hubbard
NVIDIA



Re: [Nouveau] [PATCH v9 07/10] mm: Device exclusive memory access

2021-06-03 Thread John Hubbard

On 6/2/21 1:50 AM, Balbir Singh wrote:
...

only impact the address space of programs using the GPU. Should the exclusively
marked range live in the unreclaimable list and recycled back to 
active/in-active
to account for the fact that

1. It is not reclaimable and reclaim will only hurt via page faults?
2. It ages the page correctly or at-least allows for that possibility when the
 page is used by the GPU.


I'm not sure that that is *necessarily* something we can conclude. It depends 
upon
access patterns of each program. For example, a "reduction" parallel program 
sends
over lots of data to the GPU, and only a tiny bit of (reduced!) data comes back
to the CPU. In that case, freeing the physical page on the CPU is actually the
best decision for the OS to make (if the OS is sufficiently prescient).



With a shared device or a device exclusive range, it would be good to get the 
device
usage pattern and update the mm with that knowledge, so that the LRU can be 
better


Integrating a GPU (or "device") processor and it's mm behavior with the Linux 
kernel is
always an interesting concept. Certainly worth exploring, although it's probably
not a small project by any means.


maintained. With your comment you seem to suggest that a page used by the GPU 
might
be a good candidate for reclaim based on the CPU's understanding of the age of
the page should not account for use by the device
(are GPU workloads - access once and discard?)



Well, that's a little too narrow of an interpretation. The GPU is a fairly 
general
purpose processor, and so it has all kinds of workloads. I'm trying to 
discourage
any hopes that one can know, in advance, precisely how the GPU's pages need to 
be
managed. It's similar to the the CPU, in that regard. My example was just one, 
out
of a vast pool of possible behaviors.

thanks,
--
John Hubbard
NVIDIA
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v9 07/10] mm: Device exclusive memory access

2021-05-26 Thread John Hubbard

On 5/25/21 4:51 AM, Balbir Singh wrote:
...

How beneficial is this code to nouveau users?  I see that it permits a
part of OpenCL to be implemented, but how useful/important is this in
the real world?


That is a very good question! I've not reviewed the code, but a sample
program with the described use case would make things easy to parse.
I suspect that is not easy to build at the moment?



The cover letter says this:

This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program
which checks that GPU atomic accesses to system memory are atomic. Without
this series the test fails as there is no way of write-protecting the page
mapping which results in the device clobbering CPU writes. For reference
the test is available at https://ozlabs.org/~apopple/opencl_svm_atomics/

Further testing has been performed by adding support for testing exclusive
access to the hmm-tests kselftests.

...so that seems to cover the "sample program" request, at least.


I wonder how we co-ordinate all the work the mm is doing, page migration,
reclaim with device exclusive access? Do we have any numbers for the worst
case page fault latency when something is marked away for exclusive access?


CPU page fault latency is approximately "terrible", if a page is resident on
the GPU. We have to spin up a DMA engine on the GPU and have it copy the page
over the PCIe bus, after all.


I presume for now this is anonymous memory only? SWP_DEVICE_EXCLUSIVE would


Yes, for now.


only impact the address space of programs using the GPU. Should the exclusively
marked range live in the unreclaimable list and recycled back to 
active/in-active
to account for the fact that

1. It is not reclaimable and reclaim will only hurt via page faults?
2. It ages the page correctly or at-least allows for that possibility when the
page is used by the GPU.


I'm not sure that that is *necessarily* something we can conclude. It depends 
upon
access patterns of each program. For example, a "reduction" parallel program 
sends
over lots of data to the GPU, and only a tiny bit of (reduced!) data comes back
to the CPU. In that case, freeing the physical page on the CPU is actually the
best decision for the OS to make (if the OS is sufficiently prescient).

thanks,
--
John Hubbard
NVIDIA
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v9 07/10] mm: Device exclusive memory access

2021-05-24 Thread John Hubbard

On 5/24/21 3:11 PM, Andrew Morton wrote:

...

  Documentation/vm/hmm.rst |  17 
  include/linux/mmu_notifier.h |   6 ++
  include/linux/rmap.h |   4 +
  include/linux/swap.h |   7 +-
  include/linux/swapops.h  |  44 -
  mm/hmm.c |   5 +
  mm/memory.c  | 128 +++-
  mm/mprotect.c|   8 ++
  mm/page_vma_mapped.c |   9 +-
  mm/rmap.c| 186 +++
  10 files changed, 405 insertions(+), 9 deletions(-)



This is quite a lot of code added to core MM for a single driver.

Is there any expectation that other drivers will use this code?


Yes! This should work for GPUs (and potentially, other devices) that support
OpenCL SVM atomic accesses on the device. I haven't looked into how amdgpu
works in any detail, but that's certainly at the top of the list of likely
additional callers.



Is there a way of reducing the impact (code size, at least) for systems
which don't need this code?


I'll leave this question to others for the moment, in order to answer
the "do we need it at all" points.



How beneficial is this code to nouveau users?  I see that it permits a
part of OpenCL to be implemented, but how useful/important is this in
the real world?



So this is interesting. Right now, OpenCL support in Nouveau is rather new
and so probably not a huge impact yet. However, we've built up enough experience
with CUDA and OpenCL to learn that atomic operations, as part of the user
space programming model, are a super big deal. Atomic operations are so
useful and important that I'd expect many OpenCL SVM users to be uninterested in
programming models that lack atomic operations for GPU compute programs.

Again, this doesn't rule out future, non-GPU accelerator devices that may
come along.

Atomic ops are just a really important piece of high-end multi-threaded
programming, it turns out. So this is the beginning of support for an
important building block for general purpose programming on devices that
have GPU-like memory models.


thanks,
--
John Hubbard
NVIDIA
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

2021-03-30 Thread John Hubbard

On 3/30/21 8:56 PM, John Hubbard wrote:

On 3/30/21 3:56 PM, Alistair Popple wrote:
...

+1 for renaming "munlock*" items to "mlock*", where applicable. good grief.


At least the situation was weird enough to prompt further investigation :)

Renaming to mlock* doesn't feel like the right solution to me either though. I
am not sure if you saw me responding to myself earlier but I am thinking
renaming try_to_munlock() -> page_mlocked() and try_to_munlock_one() ->
page_mlock_one() might be better. Thoughts?



Quite confused by this naming idea. Because: try_to_munlock() returns
void, so a boolean-style name such as "page_mlocked()" is already not a
good fit.

Even more important, though, is that try_to_munlock() is mlock-ing the
page, right? Is there some subtle point I'm missing? It really is doing
an mlock to the best of my knowledge here. Although the kerneldoc
comment for try_to_munlock() seems questionable too:

/**
* try_to_munlock - try to munlock a page
* @page: the page to be munlocked
*
* Called from munlock code.  Checks all of the VMAs mapping the page
* to make sure nobody else has this page mlocked. The page will be
* returned with PG_mlocked cleared if no other vmas have it mlocked.
*/

...because I don't see where, in *this* routine, it clears PG_mlocked!

Obviously we agree that a routine should be named based on what it does,
rather than on who calls it. So I think that still leads to:

     try_to_munlock() --> try_to_mlock()
     try_to_munlock_one() --> try_to_mlock_one()

Sorry if I'm missing something really obvious.


Actually, re-reading your and Jason's earlier points in the thread, I see
that I'm *not* missing anything, and we are actually in agreement about how
the code operates. OK, good!

Also, as you point out above, maybe the "try_" prefix is not really accurate
either, given how this works. So maybe we have arrived at something like:

try_to_munlock() --> page_mlock() // or mlock_page()...
try_to_munlock_one() --> page_mlock_one()



thanks,
--
John Hubbard
NVIDIA
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

2021-03-30 Thread John Hubbard

On 3/30/21 3:56 PM, Alistair Popple wrote:
...

+1 for renaming "munlock*" items to "mlock*", where applicable. good grief.


At least the situation was weird enough to prompt further investigation :)

Renaming to mlock* doesn't feel like the right solution to me either though. I
am not sure if you saw me responding to myself earlier but I am thinking
renaming try_to_munlock() -> page_mlocked() and try_to_munlock_one() ->
page_mlock_one() might be better. Thoughts?



Quite confused by this naming idea. Because: try_to_munlock() returns
void, so a boolean-style name such as "page_mlocked()" is already not a
good fit.

Even more important, though, is that try_to_munlock() is mlock-ing the
page, right? Is there some subtle point I'm missing? It really is doing
an mlock to the best of my knowledge here. Although the kerneldoc
comment for try_to_munlock() seems questionable too:

/**
 * try_to_munlock - try to munlock a page
 * @page: the page to be munlocked
 *
 * Called from munlock code.  Checks all of the VMAs mapping the page
 * to make sure nobody else has this page mlocked. The page will be
 * returned with PG_mlocked cleared if no other vmas have it mlocked.
 */

...because I don't see where, in *this* routine, it clears PG_mlocked!

Obviously we agree that a routine should be named based on what it does,
rather than on who calls it. So I think that still leads to:

 try_to_munlock() --> try_to_mlock()
 try_to_munlock_one() --> try_to_mlock_one()

Sorry if I'm missing something really obvious.



This is actually inspired from a suggestion in Documentation/vm/unevictable-
lru.rst which warns about this problem:

try_to_munlock() Reverse Map Scan
-

.. warning::
[!] TODO/FIXME: a better name might be page_mlocked() - analogous to the
page_referenced() reverse map walker.



This is actually rather bad advice! page_referenced() returns an
int-that-is-really-a-boolean, whereas try_to_munlock(), at least as it
stands now, returns void. Usually when I'm writing a TODO item, I'm in a
hurry, and I think that's what probably happened here, too. :)



Although, it seems reasonable to tack such renaming patches onto the tail

end

of this series. But whatever works.


Unless anyone objects strongly I will roll the rename into this patch as there
is only one caller of try_to_munlock.

  - Alistair



No objections here. :)

thanks,
--
John Hubbard
NVIDIA
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

2021-03-30 Thread John Hubbard

On 3/30/21 3:24 PM, Jason Gunthorpe wrote:
...

As far as I can tell this has always been called try_to_munlock() even though
it appears to do the opposite.


Maybe we should change it then?


/**
  * try_to_munlock - try to munlock a page
  * @page: the page to be munlocked
  *
  * Called from munlock code.  Checks all of the VMAs mapping the page
  * to make sure nobody else has this page mlocked. The page will be
  * returned with PG_mlocked cleared if no other vmas have it mlocked.
  */


In other words it sets PG_mlocked if one or more vmas has it mlocked. So
try_to_mlock() might be a better name, except that seems to have the potential
for confusion as well because it's only called from the munlock code path and
never for mlock.


That explanation makes more sense.. This function looks like it is
'set PG_mlocked of the page if any vm->flags has VM_LOCKED'

Maybe call it check_vm_locked or something then and reword the above
comment?

(and why is it OK to read vm->flags for this without any locking?)


Something needs attention here..


I think the code is correct, but perhaps the naming could be better. Would be
interested hearing any thoughts on renaming try_to_munlock() to try_to_mlock()
as the current name appears based on the context it is called from (munlock)
rather than what it does (mlock).


The point of this patch is to make it clearer, after all, so I'd
change something and maybe slightly clarify the comment.



I'd add that, after looking around the calling code, this is a really unhappy
pre-existing situation. Anyone reading this has to remember at which point in 
the
call stack the naming transitions from "do the opposite of what the name says",
to "do what the name says".

+1 for renaming "munlock*" items to "mlock*", where applicable. good grief.

Although, it seems reasonable to tack such renaming patches onto the tail end
of this series. But whatever works.

thanks,
--
John Hubbard
NVIDIA
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH 0/9] Add support for SVM atomics in Nouveau

2021-02-10 Thread John Hubbard

On 2/10/21 4:59 AM, Daniel Vetter wrote:
...

GPU atomic operations to sysmem are hard to categorize, because because 
application
programmers could easily write programs that do a long series of atomic 
operations.
Such a program would be a little weird, but it's hard to rule out.


Yeah, but we can forcefully break this whenever we feel like by revoking
the page, moving it, and then reinstating the gpu pte again and let it
continue.


Oh yes, that's true.



If that's no possible then what we need here instead is an mlock() type of
thing I think.

No need for that, then.


thanks,
--
John Hubbard
NVIDIA
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH 0/9] Add support for SVM atomics in Nouveau

2021-02-09 Thread John Hubbard

On 2/9/21 5:37 AM, Daniel Vetter wrote:

On Tue, Feb 9, 2021 at 1:57 PM Alistair Popple  wrote:


On Tuesday, 9 February 2021 9:27:05 PM AEDT Daniel Vetter wrote:


Recent changes to pin_user_pages() prevent the creation of pinned pages in
ZONE_MOVABLE. This series allows pinned pages to be created in

ZONE_MOVABLE

as attempts to migrate may fail which would be fatal to userspace.

In this case migration of the pinned page is unnecessary as the page can

be

unpinned at anytime by having the driver revoke atomic permission as it
does for the migrate_to_ram() callback. However a method of calling this
when memory needs to be moved has yet to be resolved so any discussion is
welcome.


Why do we need to pin for gpu atomics? You still have the callback for
cpu faults, so you
can move the page as needed, and hence a long-term pin sounds like the
wrong approach.


Technically a real long term unmoveable pin isn't required, because as you say
the page can be moved as needed at any time. However I needed some way of
stopping the CPU page from being freed once the userspace mappings for it had
been removed. Obviously I could have just used get_page() but from the
perspective of page migration the result is much the same as a pin - a page
which can't be moved because of the extra refcount.


long term pin vs short term page reference aren't fully fleshed out.
But the rule more or less is:
- short term page reference: _must_ get released in finite time for
migration and other things, either because you have a callback, or
because it's just for direct I/O, which will complete. This means
short term pins will delay migration, but not foul it complete



GPU atomic operations to sysmem are hard to categorize, because because 
application
programmers could easily write programs that do a long series of atomic 
operations.
Such a program would be a little weird, but it's hard to rule out.




- long term pin: the page cannot be moved, all migration must fail.
Also this will have an impact on COW behaviour for fork (but not sure
where those patches are, John Hubbard will know).



That would be Jason's commit 57efa1fe59576 ("mm/gup: prevent gup_fast from 
racing
with COW during fork"), which is in linux-next 20201216.




So I think for your use case here you want a) short term page
reference to make sure it doesn't disappear plus b) callback to make
sure migrate isn't blocked.

Breaking ZONE_MOVEABLE with either allowing long term pins or failing
migrations because you don't release your short term page reference
isn't good.


The normal solution of registering an MMU notifier to unpin the page when it
needs to be moved also doesn't work as the CPU page tables now point to the
device-private page and hence the migration code won't call any invalidate
notifiers for the CPU page.


Yeah you need some other callback for migration on the page directly.
it's a bit awkward since there is one already for struct
address_space, but that's own by the address_space/page cache, not
HMM. So I think we need something else, maybe something for each
ZONE_DEVICE?



This direction sounds at least...possible. Using MMU notifiers instead of pins
is definitely appealing. I'm not quite clear on the callback idea above, but
overall it seems like taking advantage of the ZONE_DEVICE tracking of pages
(without having to put anything additional in each struct page), could work.

Additional notes or ideas here are definitely welcome.



thanks,
--
John Hubbard
NVIDIA
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH] nouveau: fix page fault on device private memory

2020-06-26 Thread John Hubbard

On 2020-06-26 10:26, Ralph Campbell wrote:

If system memory is migrated to device private memory and no GPU MMU
page table entry exists, the GPU will fault and call hmm_range_fault()
to get the PFN for the page. Since the .dev_private_owner pointer in
struct hmm_range is not set, hmm_range_fault returns an error which
results in the GPU program stopping with a fatal fault.
Fix this by setting .dev_private_owner appropriately.

Fixes: 08da667b ("mm/hmm: check the device private page owner in 
hmm_range_fault()")
Signed-off-by: Ralph Campbell 
---

This is based on Linux-5.8.0-rc2 and is for Ben Skeggs nouveau tree.
It doesn't depend on any of the other nouveau/HMM changes I have
recently posted.


Maybe cc stable, seeing as how the problem affect Linux 5.7?

thanks,
--
John Hubbard
NVIDIA



  drivers/gpu/drm/nouveau/nouveau_svm.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index ba9f9359c30e..6586d9d39874 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -562,6 +562,7 @@ static int nouveau_range_fault(struct nouveau_svmm *svmm,
.end = notifier->notifier.interval_tree.last + 1,
.pfn_flags_mask = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE,
.hmm_pfns = hmm_pfns,
+   .dev_private_owner = drm->dev,
};
struct mm_struct *mm = notifier->notifier.mm;
int ret;



___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [RESEND PATCH 3/3] nouveau: make nvkm_vmm_ctor() and nvkm_mmu_ptp_get() static

2020-06-22 Thread John Hubbard

On 2020-06-22 16:38, Ralph Campbell wrote:

The functions nvkm_vmm_ctor() and nvkm_mmu_ptp_get() are not called outside
of the file defining them so make them static.

Signed-off-by: Ralph Campbell 
---
  drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c | 2 +-
  drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c  | 2 +-
  drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h  | 3 ---
  3 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c 
b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c
index ee11ccaf0563..de91e9a26172 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c
@@ -61,7 +61,7 @@ nvkm_mmu_ptp_put(struct nvkm_mmu *mmu, bool force, struct 
nvkm_mmu_pt *pt)
kfree(pt);
  }
  
-struct nvkm_mmu_pt *

+static struct nvkm_mmu_pt *
  nvkm_mmu_ptp_get(struct nvkm_mmu *mmu, u32 size, bool zero)
  {
struct nvkm_mmu_pt *pt;
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c 
b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
index 199f94e15c5f..67b00dcef4b8 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
@@ -1030,7 +1030,7 @@ nvkm_vmm_ctor_managed(struct nvkm_vmm *vmm, u64 addr, u64 
size)
return 0;
  }
  
-int

+static int
  nvkm_vmm_ctor(const struct nvkm_vmm_func *func, struct nvkm_mmu *mmu,
  u32 pd_header, bool managed, u64 addr, u64 size,
  struct lock_class_key *key, const char *name,
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h 
b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
index d3f8f916d0db..a2b179568970 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
@@ -163,9 +163,6 @@ int nvkm_vmm_new_(const struct nvkm_vmm_func *, struct 
nvkm_mmu *,
  u32 pd_header, bool managed, u64 addr, u64 size,
  struct lock_class_key *, const char *name,
  struct nvkm_vmm **);
-int nvkm_vmm_ctor(const struct nvkm_vmm_func *, struct nvkm_mmu *,
- u32 pd_header, bool managed, u64 addr, u64 size,
- struct lock_class_key *, const char *name, struct nvkm_vmm *);
  struct nvkm_vma *nvkm_vmm_node_search(struct nvkm_vmm *, u64 addr);
  struct nvkm_vma *nvkm_vmm_node_split(struct nvkm_vmm *, struct nvkm_vma *,
 u64 addr, u64 size);



Looks accurate: the order within vmm.c (now that there is no .h
declaration) is still good, and I found no other uses of either function
within the linux.git tree, so


Reviewed-by: John Hubbard https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [RESEND PATCH 1/3] nouveau: fix migrate page regression

2020-06-22 Thread John Hubbard

On 2020-06-22 16:38, Ralph Campbell wrote:

The patch to add zero page migration to GPU memory inadvertantly included


inadvertently


part of a future change which broke normal page migration to GPU memory
by copying too much data and corrupting GPU memory.
Fix this by only copying one page instead of a byte count.

Fixes: 9d4296a7d4b3 ("drm/nouveau/nouveau/hmm: fix migrate zero page to GPU")
Signed-off-by: Ralph Campbell 
---
  drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index e5c230d9ae24..cc9993837508 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -550,7 +550,7 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct 
nouveau_drm *drm,
 DMA_BIDIRECTIONAL);
if (dma_mapping_error(dev, *dma_addr))
goto out_free_page;
-   if (drm->dmem->migrate.copy_func(drm, page_size(spage),
+   if (drm->dmem->migrate.copy_func(drm, 1,
NOUVEAU_APER_VRAM, paddr, NOUVEAU_APER_HOST, *dma_addr))
goto out_dma_unmap;
} else {




I Am Not A Nouveau Expert, nor is it really clear to me how
page_size(spage) came to contain something other than a page's worth of
byte count, but this fix looks accurate to me. It's better for
maintenance, too, because the function never intends to migrate "some
number of bytes". It intends to migrate exactly one page.

Hope I'm not missing something fundamental, but:

Reviewed-by: John Hubbard https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [RESEND PATCH 2/3] nouveau: fix mixed normal and device private page migration

2020-06-22 Thread John Hubbard

On 2020-06-22 16:38, Ralph Campbell wrote:

The OpenCL function clEnqueueSVMMigrateMem(), without any flags, will
migrate memory in the given address range to device private memory. The
source pages might already have been migrated to device private memory.
In that case, the source struct page is not checked to see if it is
a device private page and incorrectly computes the GPU's physical
address of local memory leading to data corruption.
Fix this by checking the source struct page and computing the correct
physical address.

Signed-off-by: Ralph Campbell 
---
  drivers/gpu/drm/nouveau/nouveau_dmem.c | 8 
  1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index cc9993837508..f6a806ba3caa 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -540,6 +540,12 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct 
nouveau_drm *drm,
if (!(src & MIGRATE_PFN_MIGRATE))
goto out;
  
+	if (spage && is_device_private_page(spage)) {

+   paddr = nouveau_dmem_page_addr(spage);
+   *dma_addr = DMA_MAPPING_ERROR;
+   goto done;
+   }
+
dpage = nouveau_dmem_page_alloc_locked(drm);
if (!dpage)
goto out;
@@ -560,6 +566,7 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct 
nouveau_drm *drm,
goto out_free_page;
}
  
+done:

*pfn = NVIF_VMM_PFNMAP_V0_V | NVIF_VMM_PFNMAP_V0_VRAM |
((paddr >> PAGE_SHIFT) << NVIF_VMM_PFNMAP_V0_ADDR_SHIFT);
if (src & MIGRATE_PFN_WRITE)
@@ -615,6 +622,7 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
struct migrate_vma args = {
.vma= vma,
.start  = start,
+   .src_owner  = drm->dev,


Hi Ralph,

This .src_owner setting does look like a required fix, but it seems like
a completely separate fix from what is listed in this patch's commit
description, right? (It feels like a casualty of rearranging the patches.)


thanks,
--
John Hubbard
NVIDIA
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH 13/16] mm: support THP migration to device private memory

2020-06-22 Thread John Hubbard

On 2020-06-22 15:33, Yang Shi wrote:

On Mon, Jun 22, 2020 at 3:30 PM Yang Shi  wrote:

On Mon, Jun 22, 2020 at 2:53 PM Zi Yan  wrote:

On 22 Jun 2020, at 17:31, Ralph Campbell wrote:

On 6/22/20 1:10 PM, Zi Yan wrote:

On 22 Jun 2020, at 15:36, Ralph Campbell wrote:

On 6/21/20 4:20 PM, Zi Yan wrote:

On 19 Jun 2020, at 17:56, Ralph Campbell wrote:

...

Ying(cc’d) developed the code to swapout and swapin THP in one piece: 
https://lore.kernel.org/linux-mm/20181207054122.27822-1-ying.hu...@intel.com/.
I am not sure whether the patchset makes into mainstream or not. It could be a 
good technical reference
for swapping in device private pages, although swapping in pages from disk and 
from device private
memory are two different scenarios.

Since the device private memory swapin impacts core mm performance, we might 
want to discuss your patches
with more people, like the ones from Ying’s patchset, in the next version.


I believe Ying will give you more insights about how THP swap works.

But, IMHO device memory migration (migrate to system memory) seems
like THP CoW more than swap.



A fine point: overall, the desired behavior is "migrate", not CoW.
That's important. Migrate means that you don't leave a page behind, even
a read-only one. And that's exactly how device private migration is
specified.

We should try to avoid any erosion of clarity here. Even if somehow
(really?) the underlying implementation calls this THP CoW, the actual
goal is to migrate pages over to the device (and back).




When migrating in:


Sorry for my fat finger, hit sent button inadvertently, let me finish here.

When migrating in:

 - if THP is enabled: allocate THP, but need handle allocation
failure by falling back to base page
 - if THP is disabled: fallback to base page



OK, but *all* page entries (base and huge/large pages) need to be cleared,
when migrating to device memory, unless I'm really confused here.
So: not CoW.

thanks,
--
John Hubbard
NVIDIA
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH hmm v2 5/5] mm/hmm: remove the customizable pfn format from hmm_range_fault

2020-05-04 Thread John Hubbard
eems like we could get stuck in a loop here,
if we're not issuing a new REQ, right?



if (ret == -EBUSY)
continue;
return ret;
@@ -562,7 +587,7 @@ static int nouveau_range_fault(struct nouveau_svmm *svmm,
break;
}
  
-	nouveau_dmem_convert_pfn(drm, );

+   nouveau_hmm_convert_pfn(drm, , ioctl_addr);
  
  	svmm->vmm->vmm.object.client->super = true;

ret = nvif_object_ioctl(>vmm->vmm.object, data, size, NULL);
@@ -589,6 +614,7 @@ nouveau_svm_fault(struct nvif_notify *notify)
} i;
u64 phys[16];
} args;
+   unsigned long hmm_pfns[ARRAY_SIZE(args.phys)];



Is there a risk of blowing up the stack here?

...


--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -19,45 +19,45 @@
  #include 
  
  /*

- * hmm_pfn_flag_e - HMM flag enums
+ * On output:
+ * 0 - The page is faultable and a future call with
+ * HMM_PFN_REQ_FAULT could succeed.
+ * HMM_PFN_VALID - the pfn field points to a valid PFN. This PFN is at
+ * least readable. If dev_private_owner is !NULL then this 
could
+ * point at a DEVICE_PRIVATE page.
+ * HMM_PFN_WRITE - if the page memory can be written to (requires 
HMM_PFN_VALID)
+ * HMM_PFN_ERROR - accessing the pfn is impossible and the device should
+ * fail. ie poisoned memory, special pages, no vma, etc
   *
- * Flags:
- * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
- * HMM_PFN_WRITE: CPU page table has write permission set
- *
- * The driver provides a flags array for mapping page protections to device
- * PTE bits. If the driver valid bit for an entry is bit 3,
- * i.e., (entry & (1 << 3)), then the driver must provide
- * an array in hmm_range.flags with hmm_range.flags[HMM_PFN_VALID] == 1 << 3.
- * Same logic apply to all flags. This is the same idea as vm_page_prot in vma
- * except that this is per device driver rather than per architecture.
+ * On input:
+ * 0 - Return the current state of the page, do not fault it.
+ * HMM_PFN_REQ_FAULT - The output must have HMM_PFN_VALID or hmm_range_fault()
+ * will fail
+ * HMM_PFN_REQ_WRITE - The output must have HMM_PFN_WRITE or hmm_range_fault()
+ * will fail. Must be combined with HMM_PFN_REQ_FAULT.
   */
-enum hmm_pfn_flag_e {
-   HMM_PFN_VALID = 0,
-   HMM_PFN_WRITE,
-   HMM_PFN_FLAG_MAX
+enum hmm_pfn_flags {


Let's add:

/* Output flags: */


+   HMM_PFN_VALID = 1UL << (BITS_PER_LONG - 1),
+   HMM_PFN_WRITE = 1UL << (BITS_PER_LONG - 2),
+   HMM_PFN_ERROR = 1UL << (BITS_PER_LONG - 3),
+


/* Input flags: */

...


@@ -174,44 +162,44 @@ static int hmm_vma_walk_hole(unsigned long addr, unsigned 
long end,
}
if (required_fault)
return hmm_vma_fault(addr, end, required_fault, walk);
-   return hmm_pfns_fill(addr, end, range, HMM_PFN_NONE);
+   return hmm_pfns_fill(addr, end, range, 0);
  }
  
-static inline uint64_t pmd_to_hmm_pfn_flags(struct hmm_range *range, pmd_t pmd)

+static inline unsigned long pmd_to_hmm_pfn_flags(struct hmm_range *range,
+pmd_t pmd)
  {
if (pmd_protnone(pmd))
return 0;
-   return pmd_write(pmd) ? range->flags[HMM_PFN_VALID] |
-   range->flags[HMM_PFN_WRITE] :
-   range->flags[HMM_PFN_VALID];
+   return pmd_write(pmd) ? (HMM_PFN_VALID | HMM_PFN_WRITE) : HMM_PFN_VALID;



I always found the previous range->flags[...] approach hard to remember, so it's
nice to see a simpler version now.


thanks,
--
John Hubbard
NVIDIA
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH hmm v2 4/5] mm/hmm: remove HMM_PFN_SPECIAL

2020-05-04 Thread John Hubbard

On 2020-05-01 11:20, Jason Gunthorpe wrote:

From: Jason Gunthorpe 

This is just an alias for HMM_PFN_ERROR, nothing cares that the error was
because of a special page vs any other error case.


Reviewed-by: John Hubbard 

thanks,
--
John Hubbard
NVIDIA


Acked-by: Felix Kuehling 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jason Gunthorpe 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 1 -
  drivers/gpu/drm/nouveau/nouveau_svm.c   | 1 -
  include/linux/hmm.h | 8 
  mm/hmm.c| 2 +-
  4 files changed, 1 insertion(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 41ae7f96f48194..76b4a4fa39ed04 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -775,7 +775,6 @@ static const uint64_t hmm_range_flags[HMM_PFN_FLAG_MAX] = {
  static const uint64_t hmm_range_values[HMM_PFN_VALUE_MAX] = {
0xfffeUL, /* HMM_PFN_ERROR */
0, /* HMM_PFN_NONE */
-   0xfffcUL /* HMM_PFN_SPECIAL */
  };
  
  /**

diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index c68e9317cf0740..cf0d9bd61bebf9 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -379,7 +379,6 @@ static const u64
  nouveau_svm_pfn_values[HMM_PFN_VALUE_MAX] = {
[HMM_PFN_ERROR  ] = ~NVIF_VMM_PFNMAP_V0_V,
[HMM_PFN_NONE   ] =  NVIF_VMM_PFNMAP_V0_NONE,
-   [HMM_PFN_SPECIAL] = ~NVIF_VMM_PFNMAP_V0_V,
  };
  
  /* Issue fault replay for GPU to retry accesses that faulted previously. */

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 0df27dd03d53d7..81c302c884c0e3 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -44,10 +44,6 @@ enum hmm_pfn_flag_e {
   * Flags:
   * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
   * HMM_PFN_NONE: corresponding CPU page table entry is pte_none()
- * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
- *  result of vmf_insert_pfn() or vm_insert_page(). Therefore, it should 
not
- *  be mirrored by a device, because the entry will never have 
HMM_PFN_VALID
- *  set and the pfn value is undefined.
   *
   * Driver provides values for none entry, error entry, and special entry.
   * Driver can alias (i.e., use same value) error and special, but
@@ -56,12 +52,10 @@ enum hmm_pfn_flag_e {
   * HMM pfn value returned by hmm_vma_get_pfns() or hmm_vma_fault() will be:
   * hmm_range.values[HMM_PFN_ERROR] if CPU page table entry is poisonous,
   * hmm_range.values[HMM_PFN_NONE] if there is no CPU page table entry,
- * hmm_range.values[HMM_PFN_SPECIAL] if CPU page table entry is a special one
   */
  enum hmm_pfn_value_e {
HMM_PFN_ERROR,
HMM_PFN_NONE,
-   HMM_PFN_SPECIAL,
HMM_PFN_VALUE_MAX
  };
  
@@ -110,8 +104,6 @@ static inline struct page *hmm_device_entry_to_page(const struct hmm_range *rang

return NULL;
if (entry == range->values[HMM_PFN_ERROR])
return NULL;
-   if (entry == range->values[HMM_PFN_SPECIAL])
-   return NULL;
if (!(entry & range->flags[HMM_PFN_VALID]))
return NULL;
return pfn_to_page(entry >> range->pfn_shift);
diff --git a/mm/hmm.c b/mm/hmm.c
index f06bcac948a79b..2e975eedb14f89 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -301,7 +301,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, 
unsigned long addr,
pte_unmap(ptep);
return -EFAULT;
}
-   *pfn = range->values[HMM_PFN_SPECIAL];
+   *pfn = range->values[HMM_PFN_ERROR];
return 0;
}
  



___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH hmm v2 2/5] mm/hmm: make hmm_range_fault return 0 or -1

2020-05-04 Thread John Hubbard

On 2020-05-01 11:20, Jason Gunthorpe wrote:

From: Jason Gunthorpe 

hmm_vma_walk->last is supposed to be updated after every write to the
pfns, so that it can be returned by hmm_range_fault(). However, this is
not done consistently. Fortunately nothing checks the return code of
hmm_range_fault() for anything other than error.

More importantly last must be set before returning -EBUSY as it is used to
prevent reading an output pfn as an input flags when the loop restarts.

For clarity and simplicity make hmm_range_fault() return 0 or -ERRNO. Only
set last when returning -EBUSY.


Yes, this is also a nice simplification.


...
@@ -590,10 +580,13 @@ long hmm_range_fault(struct hmm_range *range)
return -EBUSY;
ret = walk_page_range(mm, hmm_vma_walk.last, range->end,
  _walk_ops, _vma_walk);
+   /*
+* When -EBUSY is returned the loop restarts with
+* hmm_vma_walk.last set to an address that has not been stored
+* in pfns. All entries < last in the pfn array are set to their
+* output, and all >= are still at their input values.
+*/


I'm glad you added that comment. This is much easier to figure out with
that in place. After poking around this patch and eventually understanding the
.last handling, I wondered if you might like this slightly tweaked wording
instead:

/*
 * Each of the hmm_walk_ops routines returns -EBUSY if and only
 * hmm_vma_walk.last has been set to an address that has not yet
 * been stored in pfns. All entries < last in the pfn array are
 * set to their output, and all >= are still at their input
 * values.
 */

Either way,

    Reviewed-by: John Hubbard 

thanks,
--
John Hubbard
NVIDIA


} while (ret == -EBUSY);
-
-   if (ret)
-   return ret;
-   return (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
+   return ret;
  }
  EXPORT_SYMBOL(hmm_range_fault);



___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH hmm v2 1/5] mm/hmm: make CONFIG_DEVICE_PRIVATE into a select

2020-05-01 Thread John Hubbard

On 2020-05-01 11:20, Jason Gunthorpe wrote:

From: Jason Gunthorpe 

There is no reason for a user to select this or not directly - it should
be selected by drivers that are going to use the feature, similar to how
CONFIG_HMM_MIRROR works.


Yes, this is a nice touch.

Reviewed-by: John Hubbard 

thanks,
--
John Hubbard
NVIDIA



Currently all drivers provide a feature kconfig that will disable use of
DEVICE_PRIVATE in that driver, allowing users to avoid enabling this if
they don't want the overhead.

Acked-by: Felix Kuehling 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jason Gunthorpe 
---
  arch/powerpc/Kconfig| 2 +-
  drivers/gpu/drm/nouveau/Kconfig | 2 +-
  mm/Kconfig  | 7 +--
  3 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 924c541a926008..8de52aefdc74cc 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -455,7 +455,7 @@ config PPC_TRANSACTIONAL_MEM
  config PPC_UV
bool "Ultravisor support"
depends on KVM_BOOK3S_HV_POSSIBLE
-   depends on DEVICE_PRIVATE
+   select DEVICE_PRIVATE
default n
help
  This option paravirtualizes the kernel to run in POWER platforms that
diff --git a/drivers/gpu/drm/nouveau/Kconfig b/drivers/gpu/drm/nouveau/Kconfig
index d6e4ae1ef7053a..af5793f3e7c2cf 100644
--- a/drivers/gpu/drm/nouveau/Kconfig
+++ b/drivers/gpu/drm/nouveau/Kconfig
@@ -86,10 +86,10 @@ config DRM_NOUVEAU_BACKLIGHT
  
  config DRM_NOUVEAU_SVM

bool "(EXPERIMENTAL) Enable SVM (Shared Virtual Memory) support"
-   depends on DEVICE_PRIVATE
depends on DRM_NOUVEAU
depends on MMU
depends on STAGING
+   select DEVICE_PRIVATE
select HMM_MIRROR
select MMU_NOTIFIER
default n
diff --git a/mm/Kconfig b/mm/Kconfig
index c1acc34c1c358c..7ca36bf5f5058e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -805,15 +805,10 @@ config HMM_MIRROR
depends on MMU
  
  config DEVICE_PRIVATE

-   bool "Unaddressable device memory (GPU memory, ...)"
+   bool
depends on ZONE_DEVICE
select DEV_PAGEMAP_OPS
  
-	help

- Allows creation of struct pages to represent unaddressable device
- memory; i.e., memory that is only accessible from the device (or
- group of devices). You likely also want to select HMM_MIRROR.
-
  config FRAME_VECTOR
bool
  



___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH] nouveau: no need to check return value of debugfs_create functions

2020-02-13 Thread John Hubbard
On 2/13/20 2:39 PM, Greg Kroah-Hartman wrote:
> On Thu, Feb 13, 2020 at 02:30:09PM -0800, John Hubbard wrote:
>> On 2/9/20 2:55 AM, Greg Kroah-Hartman wrote:
>>> When calling debugfs functions, there is no need to ever check the
>>> return value.  The function can work or not, but the code logic should
>>> never do something different based on this.
>>>
>>
>> Should we follow that line of reasoning further, and simply return void
>> from the debugfs functions--rather than playing whack-a-mole with this
>> indefinitely?
> 
> That is what we (well I) have been doing.  Look at all of the changes
> that have happened to include/linux/debugfs.h over the past few
> releases.  I'm slowly winnowing down the api to make it impossible to
> get wrong for this type of thing, and am almost there.
>


Oops, I see now that you have already been very busy with this. :)  
Looking good...


thanks,
-- 
John Hubbard
NVIDIA
 
> DRM is the big fish left to tackle, I have submitted some patches in the
> past, but lots more cleanup needs to be done to get them into mergable
> shape.  I just need to find the time...
>>
> thanks,
> 
> greg k-h
> 
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH] nouveau: no need to check return value of debugfs_create functions

2020-02-13 Thread John Hubbard
On 2/9/20 2:55 AM, Greg Kroah-Hartman wrote:
> When calling debugfs functions, there is no need to ever check the
> return value.  The function can work or not, but the code logic should
> never do something different based on this.
> 

Should we follow that line of reasoning further, and simply return void
from the debugfs functions--rather than playing whack-a-mole with this
indefinitely?


thanks,
-- 
John Hubbard
NVIDIA

> Cc: Ben Skeggs 
> Cc: David Airlie 
> Cc: Daniel Vetter 
> Cc: dri-de...@lists.freedesktop.org
> Cc: nouveau@lists.freedesktop.org
> Signed-off-by: Greg Kroah-Hartman 
> ---
>  drivers/gpu/drm/nouveau/nouveau_debugfs.c | 12 
>  1 file changed, 4 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_debugfs.c 
> b/drivers/gpu/drm/nouveau/nouveau_debugfs.c
> index 080e964d49aa..d1c82fc45a68 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_debugfs.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_debugfs.c
> @@ -224,14 +224,10 @@ nouveau_drm_debugfs_init(struct drm_minor *minor)
>   struct dentry *dentry;
>   int i;
>  
> - for (i = 0; i < ARRAY_SIZE(nouveau_debugfs_files); i++) {
> - dentry = debugfs_create_file(nouveau_debugfs_files[i].name,
> -  S_IRUGO | S_IWUSR,
> -  minor->debugfs_root, minor->dev,
> -  nouveau_debugfs_files[i].fops);
> - if (!dentry)
> - return -ENOMEM;
> - }
> + for (i = 0; i < ARRAY_SIZE(nouveau_debugfs_files); i++)
> + debugfs_create_file(nouveau_debugfs_files[i].name,
> + S_IRUGO | S_IWUSR, minor->debugfs_root,
> + minor->dev, nouveau_debugfs_files[i].fops);
>  
>   drm_debugfs_create_files(nouveau_debugfs_list,
>NOUVEAU_DEBUGFS_ENTRIES,
> 
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v2 01/15] mm/mmu_notifier: define the header pre-processor parts even if disabled

2019-12-27 Thread John Hubbard
On 10/28/19 1:10 PM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe 
> 
> Now that we have KERNEL_HEADER_TEST all headers are generally compile
> tested, so relying on makefile tricks to avoid compiling code that depends
> on CONFIG_MMU_NOTIFIER is more annoying.
> 
> Instead follow the usual pattern and provide most of the header with only
> the functions stubbed out when CONFIG_MMU_NOTIFIER is disabled. This
> ensures code compiles no matter what the config setting is.
> 
> While here, struct mmu_notifier_mm is private to mmu_notifier.c, move it.

and correct a minor spelling error in a comment. Good. :)

> 
> Reviewed-by: Jérôme Glisse 
> Signed-off-by: Jason Gunthorpe 
> ---
>  include/linux/mmu_notifier.h | 46 +---
>  mm/mmu_notifier.c| 13 ++
>  2 files changed, 30 insertions(+), 29 deletions(-)
> 

Because this is correct as-is, you can add:

Reviewed-by: John Hubbard 


...whether or not you take the following recommendation, which is:
you've only done part of the job of making struct mmu_notifier_mm 
private to mmu_notifier.c. There's more:

* struct mmu_notifier_mm is referred to in two places now: mm_types.h
  and (still) mmu_notifier.h. Therefore:

a) Move the last two traces of it out of mmu_notifier.h, and

b) Put a forward declaration in mm_types.h, which is where it
   belongs because that's where it's referred to.

So if you apply this incremental patch on top, I think it's where
you want to be:

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fa795284..df93a3cc0da9 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -25,6 +25,7 @@
 
 struct address_space;
 struct mem_cgroup;
+struct mmu_notifier_mm;
 
 /*
  * Each physical page in the system has a struct page associated with
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 51b92ba013dd..84efd2c51f5c 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -8,7 +8,6 @@
 #include 
 #include 
 
-struct mmu_notifier_mm;
 struct mmu_notifier;
 struct mmu_notifier_range;
 struct mmu_range_notifier;
@@ -263,10 +262,7 @@ struct mmu_notifier_range {
enum mmu_notifier_event event;
 };
 
-static inline int mm_has_notifiers(struct mm_struct *mm)
-{
-   return unlikely(mm->mmu_notifier_mm);
-}
+int mm_has_notifiers(struct mm_struct *mm);
 
 struct mmu_notifier *mmu_notifier_get_locked(const struct mmu_notifier_ops 
*ops,
 struct mm_struct *mm);
@@ -477,10 +473,7 @@ static inline void mmu_notifier_invalidate_range(struct 
mm_struct *mm,
__mmu_notifier_invalidate_range(mm, start, end);
 }
 
-static inline void mmu_notifier_mm_init(struct mm_struct *mm)
-{
-   mm->mmu_notifier_mm = NULL;
-}
+void mmu_notifier_mm_init(struct mm_struct *mm);
 
 static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
 {
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 2b7485919ecf..107f9406a92d 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -47,6 +47,16 @@ struct mmu_notifier_mm {
struct hlist_head deferred_list;
 };
 
+int mm_has_notifiers(struct mm_struct *mm)
+{
+   return unlikely(mm->mmu_notifier_mm);
+}
+
+void mmu_notifier_mm_init(struct mm_struct *mm)
+{
+   mm->mmu_notifier_mm = NULL;
+}
+
 /*
  * This is a collision-retry read-side/write-side 'lock', a lot like a
  * seqcount, however this allows multiple write-sides to hold it at


thanks,

John Hubbard
NVIDIA

> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 1bd8e6a09a3c27..12bd603d318ce7 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -7,8 +7,9 @@
>  #include 
>  #include 
>  
> +struct mmu_notifier_mm;
>  struct mmu_notifier;
> -struct mmu_notifier_ops;
> +struct mmu_notifier_range;
>  
>  /**
>   * enum mmu_notifier_event - reason for the mmu notifier callback
> @@ -40,36 +41,8 @@ enum mmu_notifier_event {
>   MMU_NOTIFY_SOFT_DIRTY,
>  };
>  
> -#ifdef CONFIG_MMU_NOTIFIER
> -
> -#ifdef CONFIG_LOCKDEP
> -extern struct lockdep_map __mmu_notifier_invalidate_range_start_map;
> -#endif
> -
> -/*
> - * The mmu notifier_mm structure is allocated and installed in
> - * mm->mmu_notifier_mm inside the mm_take_all_locks() protected
> - * critical section and it's released only when mm_count reaches zero
> - * in mmdrop().
> - */
> -struct mmu_notifier_mm {
> - /* all mmu notifiers registerd in this mm are queued in this list */
> - struct hlist_head list;
> - /* to serialize the list modifications and hlist_unhashed */
> - spinlock_t lock;
> -};
> -
>  #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
>  
> -struct mmu_notifier_range {
> - struct vm

Re: [Nouveau] [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier

2019-12-27 Thread John Hubbard
ot reliable and only suggests a collision has not happened. It
- * can be called many times and does not have to hold the user provided lock.
+ * False is not reliable and only suggests a collision may not have
+ * occured. It can be called many times and does not have to hold the user
+ * provided lock.
   *
   * This call can be used as part of loops and other expensive operations to
   * expedite a retry.
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 2b7485919ecfeb..afe1e2d94183f8 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -51,7 +51,8 @@ struct mmu_notifier_mm {
   * This is a collision-retry read-side/write-side 'lock', a lot like a
   * seqcount, however this allows multiple write-sides to hold it at
   * once. Conceptually the write side is protecting the values of the PTEs in
- * this mm, such that PTES cannot be read into SPTEs while any writer exists.
+ * this mm, such that PTES cannot be read into SPTEs (shadow PTEs) while any
+ * writer exists.
   *
   * Note that the core mm creates nested invalidate_range_start()/end() regions
   * within the same thread, and runs invalidate_range_start()/end() in parallel
@@ -64,12 +65,13 @@ struct mmu_notifier_mm {
   *
   * The write side has two states, fully excluded:
   *  - mm->active_invalidate_ranges != 0
- *  - mnn->invalidate_seq & 1 == True
+ *  - mnn->invalidate_seq & 1 == True (odd)
   *  - some range on the mm_struct is being invalidated
   *  - the itree is not allowed to change
   *
   * And partially excluded:
   *  - mm->active_invalidate_ranges != 0
+ *  - mnn->invalidate_seq & 1 == False (even)
   *  - some range on the mm_struct is being invalidated
   *  - the itree is allowed to change
   *
@@ -131,12 +133,13 @@ static void mn_itree_inv_end(struct mmu_notifier_mm 
*mmn_mm)
return;
}
  
+	/* Make invalidate_seq even */

mmn_mm->invalidate_seq++;
need_wake = true;
  
  	/*

 * The inv_end incorporates a deferred mechanism like
-* rtnl_lock(). Adds and removes are queued until the final inv_end
+* rtnl_unlock(). Adds and removes are queued until the final inv_end
 * happens then they are progressed. This arrangement for tree updates
 * is used to avoid using a blocking lock during
 * invalidate_range_start.
@@ -230,10 +233,11 @@ unsigned long mmu_range_read_begin(struct 
mmu_range_notifier *mrn)
spin_unlock(_mm->lock);
  
  	/*

-* mrn->invalidate_seq is always set to an odd value. This ensures
-* that if seq does wrap we will always clear the below sleep in some
-* reasonable time as mmn_mm->invalidate_seq is even in the idle
-* state.
+* mrn->invalidate_seq must always be set to an odd value via
+* mmu_range_set_seq() using the provided cur_seq from
+* mn_itree_inv_start_range(). This ensures that if seq does wrap we
+* will always clear the below sleep in some reasonable time as
+* mmn_mm->invalidate_seq is even in the idle state.
 */
lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
lock_map_release(&__mmu_notifier_invalidate_range_start_map);
@@ -892,6 +896,12 @@ static int __mmu_range_notifier_insert(struct 
mmu_range_notifier *mrn,
mrn->invalidate_seq = mmn_mm->invalidate_seq;
} else {
WARN_ON(mn_itree_is_invalidating(mmn_mm));
+   /*
+* The starting seq for a mrn not under invalidation should be
+* odd, not equal to the current invalidate_seq and
+* invalidate_seq should not 'wrap' to the new seq any time
+* soon.
+*/
mrn->invalidate_seq = mmn_mm->invalidate_seq - 1;
    interval_tree_insert(>interval_tree, _mm->itree);
}



Looks good. We're just polishing up minor points now, so you can add:

Reviewed-by: John Hubbard 



thanks,
--
John Hubbard
NVIDIA
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier

2019-12-27 Thread John Hubbard
 _mm->deferred_list);
> + else {
> + mmn_mm->invalidate_seq |= 1;
> + interval_tree_insert(>interval_tree,
> +  _mm->itree);
> + }
> + mrn->invalidate_seq = mmn_mm->invalidate_seq;
> + } else {
> + WARN_ON(mn_itree_is_invalidating(mmn_mm));
> + mrn->invalidate_seq = mmn_mm->invalidate_seq - 1;

Ohhh, checkmate. I lose. Why is *subtracting* the right thing to do
for seq numbers here?  I'm acutely unhappy trying to figure this out.
I suspect it's another unfortunate side effect of trying to use the
lower bit of the seq number (even/odd) for something else.

> +     interval_tree_insert(>interval_tree, _mm->itree);
> + }
> + spin_unlock(_mm->lock);
> + return 0;
> +}
> +
> +/**
> + * mmu_range_notifier_insert - Insert a range notifier
> + * @mrn: Range notifier to register
> + * @start: Starting virtual address to monitor
> + * @length: Length of the range to monitor
> + * @mm : mm_struct to attach to
> + *
> + * This function subscribes the range notifier for notifications from the mm.
> + * Upon return the ops related to mmu_range_notifier will be called whenever
> + * an event that intersects with the given range occurs.
> + *
> + * Upon return the range_notifier may not be present in the interval tree 
> yet.
> + * The caller must use the normal range notifier locking flow via
> + * mmu_range_read_begin() to establish SPTEs for this range.
> + */
> +int mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
> +   unsigned long start, unsigned long length,
> +   struct mm_struct *mm)
> +{
> + struct mmu_notifier_mm *mmn_mm;
> + int ret;

Hmmm, I think a later patch improperly changes the above to "int ret = 0;".
I'll check on that. It's correct here, though.

> +
> + might_lock(>mmap_sem);
> +
> + mmn_mm = smp_load_acquire(>mmu_notifier_mm);

What does the above pair with? Should have a comment that specifies that.

 
thanks,

John Hubbard
NVIDIA
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH 18/22] mm: mark DEVICE_PUBLIC as broken

2019-06-26 Thread John Hubbard
On 6/25/19 10:45 PM, Michal Hocko wrote:
> On Tue 25-06-19 20:15:28, John Hubbard wrote:
>> On 6/19/19 12:27 PM, Jason Gunthorpe wrote:
>>> On Thu, Jun 13, 2019 at 06:23:04PM -0700, John Hubbard wrote:
>>>> On 6/13/19 5:43 PM, Ira Weiny wrote:
>>>>> On Thu, Jun 13, 2019 at 07:58:29PM +, Jason Gunthorpe wrote:
>>>>>> On Thu, Jun 13, 2019 at 12:53:02PM -0700, Ralph Campbell wrote:
>>>>>>>
>>>> ...
>>>>> So I think it is ok.  Frankly I was wondering if we should remove the 
>>>>> public
>>>>> type altogether but conceptually it seems ok.  But I don't see any users 
>>>>> of it
>>>>> so...  should we get rid of it in the code rather than turning the config 
>>>>> off?
>>>>>
>>>>> Ira
>>>>
>>>> That seems reasonable. I recall that the hope was for those IBM Power 9
>>>> systems to use _PUBLIC, as they have hardware-based coherent device (GPU)
>>>> memory, and so the memory really is visible to the CPU. And the IBM team
>>>> was thinking of taking advantage of it. But I haven't seen anything on
>>>> that front for a while.
>>>
>>> Does anyone know who those people are and can we encourage them to
>>> send some patches? :)
>>>
>>
>> I asked about this, and it seems that the idea was: DEVICE_PUBLIC was there
>> in order to provide an alternative way to do things (such as migrate memory
>> to and from a device), in case the combination of existing and near-future
>> NUMA APIs was insufficient. This probably came as a follow-up to the early
>> 2017-ish conversations about NUMA, in which the linux-mm recommendation was
>> "try using HMM mechanisms, and if those are inadequate, then maybe we can
>> look at enhancing NUMA so that it has better handling of advanced (GPU-like)
>> devices".
> 
> Yes that was the original idea. It sounds so much better to use a common
> framework rather than awkward special cased cpuless NUMA nodes with
> a weird semantic. User of the neither of the two has shown up so I guess
> that the envisioned HW just didn't materialized. Or has there been a
> completely different approach chosen?

The HW showed up, alright: it's the IBM Power 9, which provides HW-based
memory coherency between its CPUs and GPUs. So on this system, the CPU is
allowed to access GPU memory, which *could* be modeled as DEVICE_PUBLIC.

However, what happened was that the system worked well enough with a combination
of the device driver, plus NUMA APIs, plus heaven knows what sort of /proc 
tuning
might have also gone on. :) No one saw the need to reach for the DEVICE_PUBLIC
functionality.

> 
>> In the end, however, _PUBLIC was never used, nor does anyone in the local
>> (NVIDIA + IBM) kernel vicinity seem to have plans to use it.  So it really
>> does seem safe to remove, although of course it's good to start with 
>> BROKEN and see if anyone pops up and complains.
> 
> Well, I do not really see much of a difference. Preserving an unused
> code which doesn't have any user in sight just adds a maintenance burden
> whether the code depends on BROKEN or not. We can always revert patches
> which remove the code once a real user shows up.

Sure, I don't see much difference either. Either way seems fine.

thanks,
-- 
John Hubbard
NVIDIA
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH 18/22] mm: mark DEVICE_PUBLIC as broken

2019-06-25 Thread John Hubbard
On 6/19/19 12:27 PM, Jason Gunthorpe wrote:
> On Thu, Jun 13, 2019 at 06:23:04PM -0700, John Hubbard wrote:
>> On 6/13/19 5:43 PM, Ira Weiny wrote:
>>> On Thu, Jun 13, 2019 at 07:58:29PM +, Jason Gunthorpe wrote:
>>>> On Thu, Jun 13, 2019 at 12:53:02PM -0700, Ralph Campbell wrote:
>>>>>
>> ...
>>> So I think it is ok.  Frankly I was wondering if we should remove the public
>>> type altogether but conceptually it seems ok.  But I don't see any users of 
>>> it
>>> so...  should we get rid of it in the code rather than turning the config 
>>> off?
>>>
>>> Ira
>>
>> That seems reasonable. I recall that the hope was for those IBM Power 9
>> systems to use _PUBLIC, as they have hardware-based coherent device (GPU)
>> memory, and so the memory really is visible to the CPU. And the IBM team
>> was thinking of taking advantage of it. But I haven't seen anything on
>> that front for a while.
> 
> Does anyone know who those people are and can we encourage them to
> send some patches? :)
> 

I asked about this, and it seems that the idea was: DEVICE_PUBLIC was there
in order to provide an alternative way to do things (such as migrate memory
to and from a device), in case the combination of existing and near-future
NUMA APIs was insufficient. This probably came as a follow-up to the early
2017-ish conversations about NUMA, in which the linux-mm recommendation was
"try using HMM mechanisms, and if those are inadequate, then maybe we can
look at enhancing NUMA so that it has better handling of advanced (GPU-like)
devices".

In the end, however, _PUBLIC was never used, nor does anyone in the local
(NVIDIA + IBM) kernel vicinity seem to have plans to use it.  So it really
does seem safe to remove, although of course it's good to start with 
BROKEN and see if anyone pops up and complains.

thanks,
-- 
John Hubbard
NVIDIA
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [PATCH 06/22] mm: factor out a devm_request_free_mem_region helper

2019-06-14 Thread John Hubbard
On 6/13/19 2:43 AM, Christoph Hellwig wrote:
> Keep the physical address allocation that hmm_add_device does with the
> rest of the resource code, and allow future reuse of it without the hmm
> wrapper.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  include/linux/ioport.h |  2 ++
>  kernel/resource.c  | 39 +++
>  mm/hmm.c   | 33 -
>  3 files changed, 45 insertions(+), 29 deletions(-)

Some trivial typos noted below, but this accurately moves the code
into a helper routine, looks good.

Reviewed-by: John Hubbard  


> 
> diff --git a/include/linux/ioport.h b/include/linux/ioport.h
> index da0ebaec25f0..76a33ae3bf6c 100644
> --- a/include/linux/ioport.h
> +++ b/include/linux/ioport.h
> @@ -286,6 +286,8 @@ static inline bool resource_overlaps(struct resource *r1, 
> struct resource *r2)
> return (r1->start <= r2->end && r1->end >= r2->start);
>  }
>  
> +struct resource *devm_request_free_mem_region(struct device *dev,
> + struct resource *base, unsigned long size);
>  
>  #endif /* __ASSEMBLY__ */
>  #endif   /* _LINUX_IOPORT_H */
> diff --git a/kernel/resource.c b/kernel/resource.c
> index 158f04ec1d4f..99c58134ed1c 100644
> --- a/kernel/resource.c
> +++ b/kernel/resource.c
> @@ -1628,6 +1628,45 @@ void resource_list_free(struct list_head *head)
>  }
>  EXPORT_SYMBOL(resource_list_free);
>  
> +#ifdef CONFIG_DEVICE_PRIVATE
> +/**
> + * devm_request_free_mem_region - find free region for device private memory
> + *
> + * @dev: device struct to bind the resource too

   "to"

> + * @size: size in bytes of the device memory to add
> + * @base: resource tree to look in
> + *
> + * This function tries to find an empty range of physical address big enough 
> to
> + * contain the new resource, so that it can later be hotpluged as ZONE_DEVICE

"hotplugged"

> + * memory, which in turn allocates struct pages.
> + */
> +struct resource *devm_request_free_mem_region(struct device *dev,
> + struct resource *base, unsigned long size)
> +{
> + resource_size_t end, addr;
> + struct resource *res;
> +
> + size = ALIGN(size, 1UL << PA_SECTION_SHIFT);
> + end = min_t(unsigned long, base->end, (1UL << MAX_PHYSMEM_BITS) - 1);
> + addr = end - size + 1UL;
> +
> + for (; addr > size && addr >= base->start; addr -= size) {
> + if (region_intersects(addr, size, 0, IORES_DESC_NONE) !=
> + REGION_DISJOINT)
> + continue;
> +
> + res = devm_request_mem_region(dev, addr, size, dev_name(dev));
> + if (!res)
> + return ERR_PTR(-ENOMEM);
> + res->desc = IORES_DESC_DEVICE_PRIVATE_MEMORY;
> + return res;
> + }
> +
> + return ERR_PTR(-ERANGE);
> +}
> +EXPORT_SYMBOL_GPL(devm_request_free_mem_region);
> +#endif /* CONFIG_DEVICE_PRIVATE */
> +
>  static int __init strict_iomem(char *str)
>  {
>   if (strstr(str, "relaxed"))
> diff --git a/mm/hmm.c b/mm/hmm.c
> index e1dc98407e7b..13a16faf0a77 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -26,8 +26,6 @@
>  #include 
>  #include 
>  
> -#define PA_SECTION_SIZE (1UL << PA_SECTION_SHIFT)
> -
>  #if IS_ENABLED(CONFIG_HMM_MIRROR)
>  static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
>  
> @@ -1372,7 +1370,6 @@ struct hmm_devmem *hmm_devmem_add(const struct 
> hmm_devmem_ops *ops,
> unsigned long size)
>  {
>   struct hmm_devmem *devmem;
> - resource_size_t addr;
>   void *result;
>   int ret;
>  
> @@ -1398,32 +1395,10 @@ struct hmm_devmem *hmm_devmem_add(const struct 
> hmm_devmem_ops *ops,
>   if (ret)
>   return ERR_PTR(ret);
>  
> - size = ALIGN(size, PA_SECTION_SIZE);
> - addr = min((unsigned long)iomem_resource.end,
> -(1UL << MAX_PHYSMEM_BITS) - 1);
> - addr = addr - size + 1UL;
> -
> - /*
> -  * FIXME add a new helper to quickly walk resource tree and find free
> -  * range
> -  *
> -  * FIXME what about ioport_resource resource ?
> -  */
> - for (; addr > size && addr >= iomem_resource.start; addr -= size) {
> - ret = region_intersects(addr, size, 0, IORES_DESC_NONE);
> - if (ret != REGION_DISJOINT)
> - continue;
> -
> - devmem->resource = devm_requ

Re: [Nouveau] [PATCH] drm/nouveau/dmem: missing mutex_lock in error path

2019-06-14 Thread John Hubbard
On 6/14/19 10:39 AM, Ralph Campbell wrote:
> On 6/13/19 5:49 PM, John Hubbard wrote:
>> On 6/13/19 5:11 PM, Ralph Campbell wrote:
...
>> Actually, the pre-existing code is a little concerning. Your change preserves
>> the behavior, but it seems questionable to be doing a "return 0" (whether
>> via the above break, or your change) when it's in this partially allocated
>> state. It's reporting success when it only allocates part of what was 
>> requested,
>> and it doesn't fill in the pages array either.
>>
>>
>>
>>> +    return 0;
>>>   return ret;
>>>   }
>>> +    mutex_lock(>dmem->mutex);
>>>   continue;
>>>   }
>>>  
>>
>> The above comment is about pre-existing potential problems, but your patch 
>> itself
>> looks correct, so:
>>
>> Reviewed-by: John Hubbard 
>>
>>
>> thanks,
>>
> The crash was the NULL pointer bug in Christoph's patch #10.
> I sent a separate reply for that.
> 
> Below is the console output I got, then I made the changes just based on
> code inspection. Do you think I should include it in the change log?

Yes, I think it's good to have it in there. If you look at git log,
you'll see that it's common to include the symptoms, including the
backtrace. It helps people see if they are hitting the same problem,
for one thing.

> 
> As for the "return 0", If you follow the call chain,
> nouveau_dmem_pages_alloc() is only ever called for one page so this
> currently "works" but I agree it is a bit of a time bomb. There are a
> number of other bugs that I can see that need fixing but I think those
> should be separate patches.
> 

Yes of course. I called it out for the benefit of the email list, not to
say that your patch needs any changes. 

thanks,
-- 
John Hubbard
NVIDIA
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH 22/22] mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR

2019-06-13 Thread John Hubbard
On 6/13/19 2:43 AM, Christoph Hellwig wrote:
> The migrate_vma helper is only used by noveau to migrate device private
> pages around.  Other HMM_MIRROR users like amdgpu or infiniband don't
> need it.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/gpu/drm/nouveau/Kconfig | 1 +
>  mm/Kconfig  | 1 -
>  2 files changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/Kconfig b/drivers/gpu/drm/nouveau/Kconfig
> index 66c839d8e9d1..96b9814e6d06 100644
> --- a/drivers/gpu/drm/nouveau/Kconfig
> +++ b/drivers/gpu/drm/nouveau/Kconfig
> @@ -88,6 +88,7 @@ config DRM_NOUVEAU_SVM
>   depends on DRM_NOUVEAU
>   depends on HMM_MIRROR
>   depends on STAGING
> + select MIGRATE_VMA_HELPER
>   default n
>   help
> Say Y here if you want to enable experimental support for
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 73676cb4693f..eca88679b624 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -679,7 +679,6 @@ config HMM_MIRROR
>   bool "HMM mirror CPU page table into a device page table"
>   depends on MMU
>   select MMU_NOTIFIER
> - select MIGRATE_VMA_HELPER
>   help
> Select HMM_MIRROR if you want to mirror range of the CPU page table 
> of a
> process into a device page table. Here, mirror means "keep 
> synchronized".
> 

For those who have out of tree drivers that need migrate_vma(), but are not
Nouveau, could we pretty please allow a way to select that independently?

It's not a big deal, as I expect the Nouveau option will normally be selected, 
but it would be nice. Because there is a valid configuration that involves 
Nouveau not being selected, but our driver still wanting to run.

Maybe we can add something like this on top of what you have?

diff --git a/mm/Kconfig b/mm/Kconfig
index eca88679b624..330996632513 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -670,7 +670,10 @@ config ZONE_DEVICE
  If FS_DAX is enabled, then say Y.
 
 config MIGRATE_VMA_HELPER
-   bool
+   bool "migrate_vma() helper routine"
+   help
+ Provides a migrate_vma() routine that GPUs and other
+ device drivers may need.
 
 config DEV_PAGEMAP_OPS
bool



thanks,
-- 
John Hubbard
NVIDIA


Re: [Nouveau] [PATCH 05/22] mm: export alloc_pages_vma

2019-06-13 Thread John Hubbard
On 6/13/19 2:43 AM, Christoph Hellwig wrote:
> noveau is currently using this through an odd hmm wrapper, and I plan

  "nouveau"

> to switch it to the real thing later in this series.
> 
> Signed-off-by: Christoph Hellwig 
> ---

Reviewed-by: John Hubbard  

thanks,
-- 
John Hubbard
NVIDIA

>  mm/mempolicy.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 01600d80ae01..f9023b5fba37 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2098,6 +2098,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct 
> vm_area_struct *vma,
>  out:
>   return page;
>  }
> +EXPORT_SYMBOL_GPL(alloc_pages_vma);
>  
>  /**
>   *   alloc_pages_current - Allocate pages.
> 
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH 04/22] mm: don't clear ->mapping in hmm_devmem_free

2019-06-13 Thread John Hubbard
On 6/13/19 2:43 AM, Christoph Hellwig wrote:
> ->mapping isn't even used by HMM users, and the field at the same offset
> in the zone_device part of the union is declared as pad.  (Which btw is
> rather confusing, as DAX uses ->pgmap and ->mapping from two different
> sides of the union, but DAX doesn't use hmm_devmem_free).
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  mm/hmm.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 0c62426d1257..e1dc98407e7b 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -1347,8 +1347,6 @@ static void hmm_devmem_free(struct page *page, void 
> *data)
>  {
>   struct hmm_devmem *devmem = data;
>  
> - page->mapping = NULL;
> -
>   devmem->ops->free(devmem, page);
>  }
>  
> 

Yes, I think that line was unnecessary. I see from git history that it was
originally being set to NULL from within __put_devmap_managed_page(), and then
in commit 2fa147bdbf672c53386a8f5f2c7fe358004c3ef8, Dan moved it out of there,
and stashed in specifically here. But it appears to have been unnecessary from
the beginning.

Reviewed-by: John Hubbard  

thanks,
-- 
John Hubbard
NVIDIA
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [PATCH 18/22] mm: mark DEVICE_PUBLIC as broken

2019-06-13 Thread John Hubbard
On 6/13/19 5:43 PM, Ira Weiny wrote:
> On Thu, Jun 13, 2019 at 07:58:29PM +, Jason Gunthorpe wrote:
>> On Thu, Jun 13, 2019 at 12:53:02PM -0700, Ralph Campbell wrote:
>>>
...
>> Hum, so the only thing this config does is short circuit here:
>>
>> static inline bool is_device_public_page(const struct page *page)
>> {
>> return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
>> IS_ENABLED(CONFIG_DEVICE_PUBLIC) &&
>> is_zone_device_page(page) &&
>> page->pgmap->type == MEMORY_DEVICE_PUBLIC;
>> }
>>
>> Which is called all over the place.. 
> 
>   yes but the earlier patch:
> 
> [PATCH 03/22] mm: remove hmm_devmem_add_resource
> 
> Removes the only place type is set to MEMORY_DEVICE_PUBLIC.
> 
> So I think it is ok.  Frankly I was wondering if we should remove the public
> type altogether but conceptually it seems ok.  But I don't see any users of it
> so...  should we get rid of it in the code rather than turning the config off?
> 
> Ira

That seems reasonable. I recall that the hope was for those IBM Power 9
systems to use _PUBLIC, as they have hardware-based coherent device (GPU)
memory, and so the memory really is visible to the CPU. And the IBM team
was thinking of taking advantage of it. But I haven't seen anything on
that front for a while.

So maybe it will get re-added as part of a future patchset to use that
kind of memory, but yes, we should not hesitate to clean house at this
point, and delete unused code.


thanks,
-- 
John Hubbard
NVIDIA

> 
>>
>> So, yes, we really don't want any distro or something to turn this on
>> until it has a use.
>>
>> Reviewed-by: Jason Gunthorpe 
>>
>> Jason
>> ___
>> Linux-nvdimm mailing list
>> linux-nvd...@lists.01.org
>> https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [Nouveau] [PATCH 03/22] mm: remove hmm_devmem_add_resource

2019-06-13 Thread John Hubbard
On 6/13/19 2:43 AM, Christoph Hellwig wrote:
> This function has never been used since it was first added to the kernel
> more than a year and a half ago, and if we ever grow a consumer of the
> MEMORY_DEVICE_PUBLIC infrastructure it can easily use devm_memremap_pages
> directly now that we've simplified the API for it.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  include/linux/hmm.h |  3 ---
>  mm/hmm.c| 54 -
>  2 files changed, 57 deletions(-)
> 

No objections here, good cleanup.

Reviewed-by: John Hubbard  

thanks,
-- 
John Hubbard
NVIDIA

> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 4867b9da1b6c..5761a39221a6 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -688,9 +688,6 @@ struct hmm_devmem {
>  struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
> struct device *device,
> unsigned long size);
> -struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
> -struct device *device,
> -struct resource *res);
>  
>  /*
>   * hmm_devmem_page_set_drvdata - set per-page driver data field
> diff --git a/mm/hmm.c b/mm/hmm.c
> index ff2598eb7377..0c62426d1257 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -1445,58 +1445,4 @@ struct hmm_devmem *hmm_devmem_add(const struct 
> hmm_devmem_ops *ops,
>   return devmem;
>  }
>  EXPORT_SYMBOL_GPL(hmm_devmem_add);
> -
> -struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
> -struct device *device,
> -struct resource *res)
> -{
> - struct hmm_devmem *devmem;
> - void *result;
> - int ret;
> -
> - if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY)
> - return ERR_PTR(-EINVAL);
> -
> - dev_pagemap_get_ops();
> -
> - devmem = devm_kzalloc(device, sizeof(*devmem), GFP_KERNEL);
> - if (!devmem)
> - return ERR_PTR(-ENOMEM);
> -
> - init_completion(>completion);
> - devmem->pfn_first = -1UL;
> - devmem->pfn_last = -1UL;
> - devmem->resource = res;
> - devmem->device = device;
> - devmem->ops = ops;
> -
> - ret = percpu_ref_init(>ref, _devmem_ref_release,
> -   0, GFP_KERNEL);
> - if (ret)
> - return ERR_PTR(ret);
> -
> - ret = devm_add_action_or_reset(device, hmm_devmem_ref_exit,
> - >ref);
> - if (ret)
> - return ERR_PTR(ret);
> -
> - devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT;
> - devmem->pfn_last = devmem->pfn_first +
> -(resource_size(devmem->resource) >> PAGE_SHIFT);
> - devmem->page_fault = hmm_devmem_fault;
> -
> - devmem->pagemap.type = MEMORY_DEVICE_PUBLIC;
> - devmem->pagemap.res = *devmem->resource;
> - devmem->pagemap.page_free = hmm_devmem_free;
> - devmem->pagemap.altmap_valid = false;
> - devmem->pagemap.ref = >ref;
> - devmem->pagemap.data = devmem;
> - devmem->pagemap.kill = hmm_devmem_ref_kill;
> -
> - result = devm_memremap_pages(devmem->device, >pagemap);
> - if (IS_ERR(result))
> - return result;
> - return devmem;
> -}
> -EXPORT_SYMBOL_GPL(hmm_devmem_add_resource);
>  #endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
> 


Re: [Nouveau] [PATCH] drm/nouveau/dmem: missing mutex_lock in error path

2019-06-13 Thread John Hubbard
On 6/13/19 5:11 PM, Ralph Campbell wrote:
> In nouveau_dmem_pages_alloc(), the drm->dmem->mutex is unlocked before
> calling nouveau_dmem_chunk_alloc().
> Reacquire the lock before continuing to the next page.
> 
> Signed-off-by: Ralph Campbell 
> ---
> 
> I found this while testing Jason Gunthorpe's hmm tree but this is
> independant of those changes. I guess it could go through
> David Airlie's tree for nouveau or Jason's tree.
> 

Hi Ralph,

btw, was this the fix for the crash you were seeing? It might be nice to
mention in the commit description, if you are seeing real symptoms.


>  drivers/gpu/drm/nouveau/nouveau_dmem.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
> b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> index 27aa4e72abe9..00f7236af1b9 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> @@ -379,9 +379,10 @@ nouveau_dmem_pages_alloc(struct nouveau_drm *drm,
>   ret = nouveau_dmem_chunk_alloc(drm);
>   if (ret) {
>   if (c)
> - break;

Actually, the pre-existing code is a little concerning. Your change preserves
the behavior, but it seems questionable to be doing a "return 0" (whether
via the above break, or your change) when it's in this partially allocated
state. It's reporting success when it only allocates part of what was requested,
and it doesn't fill in the pages array either.



> + return 0;
>   return ret;
>   }
> + mutex_lock(>dmem->mutex);
>   continue;
>   }
>  
> 

The above comment is about pre-existing potential problems, but your patch 
itself
looks correct, so:

Reviewed-by: John Hubbard  


thanks,
-- 
John Hubbard
NVIDIA

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH 02/22] mm: remove the struct hmm_device infrastructure

2019-06-13 Thread John Hubbard
On 6/13/19 2:43 AM, Christoph Hellwig wrote:
> This code is a trivial wrapper around device model helpers, which
> should have been integrated into the driver device model usage from
> the start.  Assuming it actually had users, which it never had since
> the code was added more than 1 1/2 years ago.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  include/linux/hmm.h | 20 
>  mm/hmm.c| 80 -
>  2 files changed, 100 deletions(-)
> 

Yes. This code is definitely unnecessary, and it's a good housecleaning here.

(As to the history: I know that there was some early "HMM dummy device" 
testing when the HMM code was much younger, but such testing has long since
been superseded by more elaborate testing with real drivers.)


Reviewed-by: John Hubbard  


thanks,
-- 
John Hubbard
NVIDIA

> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 0fa8ea34ccef..4867b9da1b6c 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -717,26 +717,6 @@ static inline unsigned long 
> hmm_devmem_page_get_drvdata(const struct page *page)
>  {
>   return page->hmm_data;
>  }
> -
> -
> -/*
> - * struct hmm_device - fake device to hang device memory onto
> - *
> - * @device: device struct
> - * @minor: device minor number
> - */
> -struct hmm_device {
> - struct device   device;
> - unsigned intminor;
> -};
> -
> -/*
> - * A device driver that wants to handle multiple devices memory through a
> - * single fake device can use hmm_device to do so. This is purely a helper 
> and
> - * it is not strictly needed, in order to make use of any HMM functionality.
> - */
> -struct hmm_device *hmm_device_new(void *drvdata);
> -void hmm_device_put(struct hmm_device *hmm_device);
>  #endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
>  #else /* IS_ENABLED(CONFIG_HMM) */
>  static inline void hmm_mm_destroy(struct mm_struct *mm) {}
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 886b18695b97..ff2598eb7377 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -1499,84 +1499,4 @@ struct hmm_devmem *hmm_devmem_add_resource(const 
> struct hmm_devmem_ops *ops,
>   return devmem;
>  }
>  EXPORT_SYMBOL_GPL(hmm_devmem_add_resource);
> -
> -/*
> - * A device driver that wants to handle multiple devices memory through a
> - * single fake device can use hmm_device to do so. This is purely a helper
> - * and it is not needed to make use of any HMM functionality.
> - */
> -#define HMM_DEVICE_MAX 256
> -
> -static DECLARE_BITMAP(hmm_device_mask, HMM_DEVICE_MAX);
> -static DEFINE_SPINLOCK(hmm_device_lock);
> -static struct class *hmm_device_class;
> -static dev_t hmm_device_devt;
> -
> -static void hmm_device_release(struct device *device)
> -{
> - struct hmm_device *hmm_device;
> -
> - hmm_device = container_of(device, struct hmm_device, device);
> - spin_lock(_device_lock);
> - clear_bit(hmm_device->minor, hmm_device_mask);
> - spin_unlock(_device_lock);
> -
> - kfree(hmm_device);
> -}
> -
> -struct hmm_device *hmm_device_new(void *drvdata)
> -{
> - struct hmm_device *hmm_device;
> -
> - hmm_device = kzalloc(sizeof(*hmm_device), GFP_KERNEL);
> - if (!hmm_device)
> - return ERR_PTR(-ENOMEM);
> -
> - spin_lock(_device_lock);
> - hmm_device->minor = find_first_zero_bit(hmm_device_mask, 
> HMM_DEVICE_MAX);
> - if (hmm_device->minor >= HMM_DEVICE_MAX) {
> - spin_unlock(_device_lock);
> - kfree(hmm_device);
> - return ERR_PTR(-EBUSY);
> - }
> - set_bit(hmm_device->minor, hmm_device_mask);
> - spin_unlock(_device_lock);
> -
> - dev_set_name(_device->device, "hmm_device%d", hmm_device->minor);
> - hmm_device->device.devt = MKDEV(MAJOR(hmm_device_devt),
> - hmm_device->minor);
> - hmm_device->device.release = hmm_device_release;
> - dev_set_drvdata(_device->device, drvdata);
> - hmm_device->device.class = hmm_device_class;
> - device_initialize(_device->device);
> -
> - return hmm_device;
> -}
> -EXPORT_SYMBOL(hmm_device_new);
> -
> -void hmm_device_put(struct hmm_device *hmm_device)
> -{
> - put_device(_device->device);
> -}
> -EXPORT_SYMBOL(hmm_device_put);
> -
> -static int __init hmm_init(void)
> -{
> - int ret;
> -
> - ret = alloc_chrdev_region(_device_devt, 0,
> -   HMM_DEVICE_MAX,
> -   "hmm_device");
> - if (ret)
> - return ret;
> -
> - hmm_device_class = class_create(THIS_MODULE, "hmm_device");
> - if (IS_ERR(hmm_device_class)) {
> - unregister_chrdev_region(hmm_device_devt, HMM_DEVICE_MAX);
> - return PTR_ERR(hmm_device_class);
> - }
> - return 0;
> -}
> -
> -device_initcall(hmm_init);
>  #endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
> 


Re: [Nouveau] [PATCH] drm/nouveau/secboot/acr: fix memory leak

2018-09-07 Thread John Hubbard
On 8/2/18 12:51 PM, Gustavo A. R. Silva wrote:
> Hi all,
> 
> Friendly ping! Who can take this?
> 
> Thanks
> --
> Gustavo
> 
> On 07/24/2018 08:27 AM, Gustavo A. R. Silva wrote:
>> In case memory resources for *bl_desc* were allocated, release
>> them before return.
>>
>> Addresses-Coverity-ID: 1472021 ("Resource leak")
>> Fixes: 0d466901552a ("drm/nouveau/secboot/acr: Remove VLA usage")
>> Signed-off-by: Gustavo A. R. Silva 
>> ---
>>  drivers/gpu/drm/nouveau/nvkm/subdev/secboot/acr_r352.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/secboot/acr_r352.c 
>> b/drivers/gpu/drm/nouveau/nvkm/subdev/secboot/acr_r352.c
>> index d02e183..5c14d6a 100644
>> --- a/drivers/gpu/drm/nouveau/nvkm/subdev/secboot/acr_r352.c
>> +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/secboot/acr_r352.c
>> @@ -801,6 +801,7 @@ acr_r352_load(struct nvkm_acr *_acr, struct nvkm_falcon 
>> *falcon,
>>  bl = acr->hsbl_unload_blob;
>>  } else {
>>  nvkm_error(_acr->subdev, "invalid secure boot blob!\n");
>> +kfree(bl_desc);
>>  return -EINVAL;
>>  }
>>  
>>

Hi Gustavo,

Seeing as how I've been working on this corner of Nouveau lately (don't ask, 
haha),
I reviewed and also tested this. It looks good, you can add:

Reviewed-by: John Hubbard 

thanks,
-- 
John Hubbard
NVIDIA
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [RFC PATCH 00/13] SVM (share virtual memory) with HMM in nouveau

2018-03-13 Thread John Hubbard
On 03/12/2018 10:50 AM, Jerome Glisse wrote:
> On Mon, Mar 12, 2018 at 06:30:09PM +0100, Daniel Vetter wrote:
>> On Sat, Mar 10, 2018 at 04:01:58PM +0100, Christian K??nig wrote:
> 
> [...]
> 
>>>> They are work underway to revamp nouveau channel creation with a new
>>>> userspace API. So we might want to delay upstreaming until this lands.
>>>> We can stil discuss one aspect specific to HMM here namely the issue
>>>> around GEM objects used for some specific part of the GPU. Some engine
>>>> inside the GPU (engine are a GPU block like the display block which
>>>> is responsible of scaning memory to send out a picture through some
>>>> connector for instance HDMI or DisplayPort) can only access memory
>>>> with virtual address below (1 << 40). To accomodate those we need to
>>>> create a "hole" inside the process address space. This patchset have
>>>> a hack for that (patch 13 HACK FOR HMM AREA), it reserves a range of
>>>> device file offset so that process can mmap this range with PROT_NONE
>>>> to create a hole (process must make sure the hole is below 1 << 40).
>>>> I feel un-easy of doing it this way but maybe it is ok with other
>>>> folks.
>>>
>>> Well we have essentially the same problem with pre gfx9 AMD hardware. Felix
>>> might have some advise how it was solved for HSA.
>>
>> Couldn't we do an in-kernel address space for those special gpu blocks? As
>> long as it's display the kernel needs to manage it anyway, and adding a
>> 2nd mapping when you pin/unpin for scanout usage shouldn't really matter
>> (as long as you cache the mapping until the buffer gets thrown out of
>> vram). More-or-less what we do for i915 (where we have an entirely
>> separate address space for these things which is 4G on the latest chips).
>> -Daniel
> 
> We can not do an in-kernel address space for those. We already have an
> in kernel address space but it does not apply for the object considered
> here.
> 
> For NVidia (i believe this is the same for AMD AFAIK) the objects we
> are talking about are objects that must be in the same address space
> as the one against which process's shader/dma/... get executed.
> 
> For instance command buffer submited by userspace must be inside a
> GEM object mapped inside the GPU's process address against which the
> command are executed. My understanding is that the PFIFO (the engine
> on nv GPU that fetch commands) first context switch to address space
> associated with the channel and then starts fetching commands with
> all address being interpreted against the channel address space.
> 
> Hence why we need to reserve some range in the process virtual address
> space if we want to do SVM in a sane way. I mean we could just map
> buffer into GPU page table and then cross fingers and toes hopping that
> the process will never get any of its mmap overlapping those mapping :)
> 
> Cheers,
> Jérôme
> 

Hi Jerome and all,

Yes, on NVIDIA GPUs, the Host/FIFO unit is limited to 40-bit addresses, so
things such as the following need to be below (1 << 40), and also accessible 
to both CPU (user space) and GPU hardware. 
-- command buffers (CPU user space driver fills them, GPU consumes them), 
-- semaphores (here, a GPU-centric term, rather than OS-type: these are
   memory locations that, for example, the GPU hardware might write to, in
   order to indicate work completion; there are other uses as well), 
-- a few other things most likely (this is not a complete list).

So what I'd tentatively expect that to translate into in the driver stack is, 
approximately:

-- User space driver code mmap's an area below (1 << 40). It's hard to 
avoid this,
   given that user space needs access to the area (for filling out command
   buffers and monitoring semaphores, that sort of thing). Then suballocate
   from there using mmap's MAP_FIXED or (future-ish) MAP_FIXED_SAFE flags.

   ...glancing at the other fork of this thread, I think that is exactly 
what
   Felix is saying, too. So that's good.

    -- The user space program sits above the user space driver, and although the
   program could, in theory, interfere with this mmap'd area, that would be
   wrong in the same way that mucking around with malloc'd areas (outside of
   malloc() itself) is wrong. So I don't see any particular need to do much
   more than the above.

thanks,
-- 
John Hubbard
NVIDIA
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] Addressing the problem of noisy GPUs under Nouveau

2018-02-06 Thread John Hubbard
On 01/28/2018 04:05 PM, Martin Peres wrote:
> On 29/01/18 01:24, Martin Peres wrote:
>> On 28/11/17 07:32, John Hubbard wrote:
>>> On 11/23/2017 02:48 PM, Martin Peres wrote:
>>>> On 23/11/17 10:06, John Hubbard wrote:
>>>>> On 11/22/2017 05:07 PM, Martin Peres wrote:
>>>>>> Hey,
>>>>>>
>>>>>> Thanks for your answer, Andy!
>>>>>>
>>>>>> On 22/11/17 04:06, Ilia Mirkin wrote:
>>>>>>> On Tue, Nov 21, 2017 at 8:29 PM, Andy Ritger <arit...@nvidia.com> wrote:
>>>>>>> Martin's question was very long, but it boils down to this:
>>>>>>>
>>>>>>> How do we compute the correct values to write into the e114/e118 pwm
>>>>>>> registers based on the VBIOS contents and current state of the board
>>>>>>> (like temperature).
>>>>>>
>>>>>> Unfortunately, it can also be the e11c/e120 couple, or 0x200d8/dc on
>>>>>> GF119+, or 0x200cd/d0 on Kepler+.
>>>>>>
>>>>>> At least, it looks like we know which PWM controler we need to drive, so
>>>>>> I did not want to muddy the water even more by giving register
>>>>>> addresses, rather concentrating on the problem at hand: How to compute
>>>>>> the duty value for the PWM controler.
>>>>>>
>>>>>>>
>>>>>>> We generally do this right, but appear to get it extra-wrong for 
>>>>>>> certain GPUs.
>>>>>>
>>>>>> Yes... So far, we are always safe, but users tend to mind when their
>>>>>> computer sound like a jumbo jet at take off... Who would have thought? :D
>>>>>>
>>>>>> Anyway, looking forward to your answer!
>>>>>>
>>>>>> Cheers,
>>>>>> Martin
>>>>>
[...]

Hi Martin,

I strongly suspect you are seeing a special behavior, which is: on
some GF108 boards we use only a very limited range of PWM,
0.4 to 2.5%, due to the particular type of DC power conversion
circuit on those boards. However, it could also just be difficulties
in interpreting the fixed-point variables in the tables. In either
case, the answer is to explain those formats, so I'll do that now.

I am attaching the fan cooler table, in HTML format. We have also
published the BIT (BIOS Information Table) format, separately:


http://download.nvidia.com/open-gpu-doc/BIOS-Information-Table/1/BIOS-Information-Table.html

, but I don't think it has any surprises for you, in this regard. You
can check it, to be sure you're looking at the right subtable, though,
just in case.

The interesting parts of that table are:

PWM Scale Slope (16 bits):

  Slope to scale effective PWM to actual PWM (1/4096, F4.12, signed).
  For backwards compatibility, a value of 0.0 (0x) is interpreted as 1.0 
(0x1000).
  This value is used to scale the effective PWM duty cycle, a conceptual 
fraction
  of full speed (0% to 100%), to the actual electrical PWM duty cycle.
  PWM(actual) = Slope × PWM(effective) + Offset

PWM Scale Offset (16 bits):

  Offset to scale effective PWM to actual PWM (1/4096, F4.12, signed).
  This value is used to scale the effective PWM duty cycle, a conceptual 
fraction
  of full speed (0% to 100%), to the actual electrical PWM duty cycle.
  PWM(actual) = Slope × PWM(effective) + Offset


However, the calculations are hard to get right, and the table stores
values in fixed-point format, so I'm showing a few simplified code excerpts
that use these. The various fixed point macro definitions are found as part of
our normal driver package, in nvmisc.h and nvtypes.h. Any other definitions
that you need are included right here (I ran a quick compiler check to be sure.)

#define VBIOS_THERM_COOLER_TABLE_10_ENTRY_SIZE_100x0010

// Fan Cooler Table Version
#define NV_FAN_COOLER_TABLE_V2_VERSION_10 0x10

// We limit the size of V2 tables
#define NV_FAN_COOLER_TABLE_V2_MAX_ENTRIES10

// Entry skip.
#define NV_FAN_COOLER_TABLE_V2_SKIP_ENTRY 0x00FF

#define NV_FAN_COOLER_TABLE_TYPE  3:0   
// field1- dword
#define NV_FAN_COOLER_TABLE_TYPE_PASSIVE_HEAT_SINK0x
#define NV_FAN_COOLER_TABLE_TYPE_ACTIVE_FAN_SINK  0x0001
#define NV_FAN_COOLER_TABLE_TYPE_ACTIVE_SKIP  0x000F
#define NV_FAN_COOLER_TABLE_TYPE_DEFAULT  
NV_FAN_COOLER_TABLE_TYPE_PASSIVE_HEAT_SINK

#define NV_FAN_COOLER_TABLE_TARGET_AFFINITY  

Re: [Nouveau] Addressing the problem of noisy GPUs under Nouveau

2018-01-28 Thread John Hubbard
On 01/28/2018 04:05 PM, Martin Peres wrote:
> On 29/01/18 01:24, Martin Peres wrote:
>> On 28/11/17 07:32, John Hubbard wrote:
>>> On 11/23/2017 02:48 PM, Martin Peres wrote:
>>>> On 23/11/17 10:06, John Hubbard wrote:
>>>>> On 11/22/2017 05:07 PM, Martin Peres wrote:
>>>>>> Hey,
>>>>>>
>>>>>> Thanks for your answer, Andy!
>>>>>>
>>>>>> On 22/11/17 04:06, Ilia Mirkin wrote:
>>>>>>> On Tue, Nov 21, 2017 at 8:29 PM, Andy Ritger <arit...@nvidia.com> wrote:
>>>>>>> Martin's question was very long, but it boils down to this:
>>>>>>>
>>>>>>> How do we compute the correct values to write into the e114/e118 pwm
>>>>>>> registers based on the VBIOS contents and current state of the board
>>>>>>> (like temperature).
>>>>>>
>>>>>> Unfortunately, it can also be the e11c/e120 couple, or 0x200d8/dc on
>>>>>> GF119+, or 0x200cd/d0 on Kepler+.
>>>>>>
>>>>>> At least, it looks like we know which PWM controler we need to drive, so
>>>>>> I did not want to muddy the water even more by giving register
>>>>>> addresses, rather concentrating on the problem at hand: How to compute
>>>>>> the duty value for the PWM controler.
>>>>>>
>>>>>>>
>>>>>>> We generally do this right, but appear to get it extra-wrong for 
>>>>>>> certain GPUs.
>>>>>>
>>>>>> Yes... So far, we are always safe, but users tend to mind when their
>>>>>> computer sound like a jumbo jet at take off... Who would have thought? :D
>>>>>>
>>>>>> Anyway, looking forward to your answer!
>>>>>>
>>>>>> Cheers,
>>>>>> Martin
>>>>>
>>>>>
>>>>> Hi Martin,
>>>>>
>>>>> One of our firmware engineers thinks that this looks a lot like PWM 
>>>>> inversion.
>>>>> For some SKUs, the interpretation of the PWM duty cycle is inverted. That 
>>>>> would probably make it *very* difficult to find a sensible algorithm that 
>>>>> covered all the SKUs, given that some are inverted and others are not.
>>>>>
>>>>> For the noisy GPUs, a very useful experiment would be to try inverting 
>>>>> it, 
>>>>> like this:
>>>>>
>>>>>   pwmDutyCycle = pwmPeriod - pwmDutyCycle;
>>>>>
>>>>> ...and then see if fan control starts behaving closer to how you've 
>>>>> actually 
>>>>> programmed it.
>>>>>
>>>>> Would that be easy enough to try out? It should help narrow down the
>>>>> problem at least.
>>>>>
>>>>
>>>> Hey John,
>>>>
>>>> Unfortunately, we know about PWM inversion, and one can know which mode
>>>> to use based on the GPIO entry associated to the fan (inverted). We have
>>>> had support for this in Nouveau for a long time. At the very least, this
>>>> is not the problem on my GF108.
>>>>
>>>> I am certain that the problem I am seeing is related to this vbios table
>>>> I wrote about (BIT P, offset 0x18). It is used to compute what PWM duty
>>>> I should use for both 0 and 100% of the fan speed.
>>>>
>>>> Computing the value for 0% fan speed is difficult because of
>>>> non-continuous nature of some of the functions[1], but I can always
>>>> over-approximate. However, I failed to accurately compute the duty I
>>>> need to write to get the 100% fan speed (I have cases where I greatly
>>>> over-estimate it...).
>>>>
>>>> Could you please check out the vbios table I am pointing at? I am quite
>>>> sure that your documentation will be clearer than my babbling :D
>>>
>>> Yes. We will check on this. There has been some productive discussion 
>>> internally, but it will take some more investigation.
>>>
>>> thanks,
>>> John Hubbard
>>
>> Have the productive discussions panned out?

Yes, we concluded our discussions, and decided that I should study the 
situation 
and write some documentation.  I just finished my research and writeup late 
last Friday, 
though, so my colleagues haven't had a chance to review it. Not to put undue
pre

Re: [Nouveau] Addressing the problem of noisy GPUs under Nouveau

2017-11-27 Thread John Hubbard
On 11/23/2017 02:48 PM, Martin Peres wrote:
> On 23/11/17 10:06, John Hubbard wrote:
>> On 11/22/2017 05:07 PM, Martin Peres wrote:
>>> Hey,
>>>
>>> Thanks for your answer, Andy!
>>>
>>> On 22/11/17 04:06, Ilia Mirkin wrote:
>>>> On Tue, Nov 21, 2017 at 8:29 PM, Andy Ritger <arit...@nvidia.com> wrote:
>>>> Martin's question was very long, but it boils down to this:
>>>>
>>>> How do we compute the correct values to write into the e114/e118 pwm
>>>> registers based on the VBIOS contents and current state of the board
>>>> (like temperature).
>>>
>>> Unfortunately, it can also be the e11c/e120 couple, or 0x200d8/dc on
>>> GF119+, or 0x200cd/d0 on Kepler+.
>>>
>>> At least, it looks like we know which PWM controler we need to drive, so
>>> I did not want to muddy the water even more by giving register
>>> addresses, rather concentrating on the problem at hand: How to compute
>>> the duty value for the PWM controler.
>>>
>>>>
>>>> We generally do this right, but appear to get it extra-wrong for certain 
>>>> GPUs.
>>>
>>> Yes... So far, we are always safe, but users tend to mind when their
>>> computer sound like a jumbo jet at take off... Who would have thought? :D
>>>
>>> Anyway, looking forward to your answer!
>>>
>>> Cheers,
>>> Martin
>>
>>
>> Hi Martin,
>>
>> One of our firmware engineers thinks that this looks a lot like PWM 
>> inversion.
>> For some SKUs, the interpretation of the PWM duty cycle is inverted. That 
>> would probably make it *very* difficult to find a sensible algorithm that 
>> covered all the SKUs, given that some are inverted and others are not.
>>
>> For the noisy GPUs, a very useful experiment would be to try inverting it, 
>> like this:
>>
>>  pwmDutyCycle = pwmPeriod - pwmDutyCycle;
>>
>> ...and then see if fan control starts behaving closer to how you've actually 
>> programmed it.
>>
>> Would that be easy enough to try out? It should help narrow down the
>> problem at least.
>>
> 
> Hey John,
> 
> Unfortunately, we know about PWM inversion, and one can know which mode
> to use based on the GPIO entry associated to the fan (inverted). We have
> had support for this in Nouveau for a long time. At the very least, this
> is not the problem on my GF108.
> 
> I am certain that the problem I am seeing is related to this vbios table
> I wrote about (BIT P, offset 0x18). It is used to compute what PWM duty
> I should use for both 0 and 100% of the fan speed.
> 
> Computing the value for 0% fan speed is difficult because of
> non-continuous nature of some of the functions[1], but I can always
> over-approximate. However, I failed to accurately compute the duty I
> need to write to get the 100% fan speed (I have cases where I greatly
> over-estimate it...).
> 
> Could you please check out the vbios table I am pointing at? I am quite
> sure that your documentation will be clearer than my babbling :D

Yes. We will check on this. There has been some productive discussion 
internally, but it will take some more investigation.

thanks,
John Hubbard


> 
> Thanks,
> Martin
> 
> [1] http://fs.mupuf.org/nvidia/fan_calib/pwm_offset.png
> 
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] Addressing the problem of noisy GPUs under Nouveau

2017-11-23 Thread John Hubbard
On 11/22/2017 05:07 PM, Martin Peres wrote:
> Hey,
> 
> Thanks for your answer, Andy!
> 
> On 22/11/17 04:06, Ilia Mirkin wrote:
>> On Tue, Nov 21, 2017 at 8:29 PM, Andy Ritger <arit...@nvidia.com> wrote:
>> Martin's question was very long, but it boils down to this:
>>
>> How do we compute the correct values to write into the e114/e118 pwm
>> registers based on the VBIOS contents and current state of the board
>> (like temperature).
> 
> Unfortunately, it can also be the e11c/e120 couple, or 0x200d8/dc on
> GF119+, or 0x200cd/d0 on Kepler+.
> 
> At least, it looks like we know which PWM controler we need to drive, so
> I did not want to muddy the water even more by giving register
> addresses, rather concentrating on the problem at hand: How to compute
> the duty value for the PWM controler.
> 
>>
>> We generally do this right, but appear to get it extra-wrong for certain 
>> GPUs.
> 
> Yes... So far, we are always safe, but users tend to mind when their
> computer sound like a jumbo jet at take off... Who would have thought? :D
> 
> Anyway, looking forward to your answer!
> 
> Cheers,
> Martin


Hi Martin,

One of our firmware engineers thinks that this looks a lot like PWM inversion.
For some SKUs, the interpretation of the PWM duty cycle is inverted. That 
would probably make it *very* difficult to find a sensible algorithm that 
covered all the SKUs, given that some are inverted and others are not.

For the noisy GPUs, a very useful experiment would be to try inverting it, 
like this:

pwmDutyCycle = pwmPeriod - pwmDutyCycle;

...and then see if fan control starts behaving closer to how you've actually 
programmed it.

Would that be easy enough to try out? It should help narrow down the
problem at least.

thanks,
John Hubbard
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] Addressing the problem of noisy GPUs under Nouveau

2017-11-14 Thread John Hubbard
Hi Martin,

This is just a quick ACK. I've started an internal email thread and
we'll see if we can get back to you soon.

Yes, our thermal and fan control definitely changes a lot which
the various chip architectures. I'm continually impressed by how much
the SW+HW has been able to improve performance per watt, year after
year, but of course the side effect is a very complex system, as 
you are seeing. But even so, let's see if there is any sort of
simpler approximation that would work for you here...no promises,
because I'm about to be humbled when the thermal experts respond. :)


thanks,
John Hubbard
NVIDIA

On 11/12/2017 06:29 PM, Martin Peres wrote:
> Hello,
> 
> Some users have been complaining for years about their GPU sounding like
> a jet engine at take off. Last year, I finally laid my hand on one of
> these GPUs and have been trying to fix this issue on and off since then.
> 
> After failing to find anything in the HW, I figured out that the duty
> cycle set by nvidia's proprietary driver would be way under the expected
> value. By randomly changing values in the unknown tables of the vbios, I
> found out that there is a fan calibration table at the offset 0x18 in
> the BIT P table (version 2).
> 
> In this table, I identified 2 major 16 bits parameters at offset 0xa and
> 0xc[2]. The first one, I named pwm_max, while naming the latter
> pwm_offset. As expected, these parameters look like a mapping function
> of the form aX + b. However, after gathering more samples, I found out
> that the output was not continuous when linearly increasing pwm_offset
> [1]. Even more funnily, the period of this square function is linear
> with the frequency used for the fan's PWN.
> 
> I tried reverse engineering the formula to describe this function, but
> failed to find a version that would work perfectly for all PWM
> frequency. This is the closest I have got to[3], and I basically stopped
> there about a year ago because I could not figure it out and got
> frustrated :s.
> 
> I started again on this project 2 weeks ago, with the intent of finding
> a good-enough solution for nouveau, and modelling the rest of the
> equation that that would allow me to compute what duty I should set for
> every wanted fan speed (%). I again mostly succeeded... but it would
> seem that the interpretation of the table depends on the generation of
> chipset (Tesla behaves one way, Fermi+ behaves another way). Also, the
> proprietary is not consistent for rules such as what to do when the
> computed duty value is going to be lower than 0 or not (sometimes we
> clamp it to 0, some times we set it to the same value as the divider,
> some times we set it to a slightly lower value than the divider).
> 
> I have been trying to cover all edge cases by generating a randomized
> set of values for the PWM frequency, pwm_max, and pwm_offset values,
> flashed the vbios, and iterate from 0% to 100% fan speed while dumping
> the values set by your driver. Using half a million sample points (which
> took a week to acquire), my model computes 97% of the values correctly
> (ignoring off by ones), while the remaining 3% are worryingly off (by up
> to 100%)... It is clear that the code is not trivial and is full of
> branching, which makes clean-room reverse engineering a chore.
> 
> As a final attempt to make a somewhat complete solution, I tried this
> weekend to make a "safe" model that would still make the GPUs quiet. I
> managed to improve the pass rate from 97 to 99.6%, but the remaining
> failures conflict with my previous findings, which are also way more
> prevalent. In the end, the only completely-safe way of driving the fan
> is the current behaviour of nouveau...
> 
> At this point, I am ready to throw in the towel and hardcode parameters
> in nouveau to address the problem of the loudest GPUs, but this is of
> course suboptimal. This is why I am asking for your help. Would you have
> some documentation about this fan calibration table that could help me
> here? Code would be even more appreciated.
> 
> Thanks a lot in advance,
> Martin
> 
> PS: here is most of the code you may want to see:
> http://fs.mupuf.org/nvidia/fan_calib/
> 
> [1] http://fs.mupuf.org/nvidia/fan_calib/pwm_offset.png
> [2] https://github.com/envytools/envytools/blob/master/nvbios/power.c#L333
> [3] https://github.com/envytools/envytools/blob/master/nvbios/power.c#L298
> 
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] Addressing the problem of noisy GPUs under Nouveau

2017-11-12 Thread John Hubbard
On 11/12/2017 06:29 PM, Martin Peres wrote:
> Hello,
> 
> Some users have been complaining for years about their GPU sounding like
> a jet engine at take off. Last year, I finally laid my hand on one of
> these GPUs and have been trying to fix this issue on and off since then.

Some early feedback: can you tell us the exact SKUs you have? And are these
production boards with production VBIOSes?  

Normally, it's just our bringup boards that we'd expect to be noisy like 
this, so we're looking for a few more details.

thanks,
John Hubbard
NVIDIA

> 
> After failing to find anything in the HW, I figured out that the duty
> cycle set by nvidia's proprietary driver would be way under the expected
> value. By randomly changing values in the unknown tables of the vbios, I
> found out that there is a fan calibration table at the offset 0x18 in
> the BIT P table (version 2).
> 
> In this table, I identified 2 major 16 bits parameters at offset 0xa and
> 0xc[2]. The first one, I named pwm_max, while naming the latter
> pwm_offset. As expected, these parameters look like a mapping function
> of the form aX + b. However, after gathering more samples, I found out
> that the output was not continuous when linearly increasing pwm_offset
> [1]. Even more funnily, the period of this square function is linear
> with the frequency used for the fan's PWN.
> 
> I tried reverse engineering the formula to describe this function, but
> failed to find a version that would work perfectly for all PWM
> frequency. This is the closest I have got to[3], and I basically stopped
> there about a year ago because I could not figure it out and got
> frustrated :s.
> 
> I started again on this project 2 weeks ago, with the intent of finding
> a good-enough solution for nouveau, and modelling the rest of the
> equation that that would allow me to compute what duty I should set for
> every wanted fan speed (%). I again mostly succeeded... but it would
> seem that the interpretation of the table depends on the generation of
> chipset (Tesla behaves one way, Fermi+ behaves another way). Also, the
> proprietary is not consistent for rules such as what to do when the
> computed duty value is going to be lower than 0 or not (sometimes we
> clamp it to 0, some times we set it to the same value as the divider,
> some times we set it to a slightly lower value than the divider).
> 
> I have been trying to cover all edge cases by generating a randomized
> set of values for the PWM frequency, pwm_max, and pwm_offset values,
> flashed the vbios, and iterate from 0% to 100% fan speed while dumping
> the values set by your driver. Using half a million sample points (which
> took a week to acquire), my model computes 97% of the values correctly
> (ignoring off by ones), while the remaining 3% are worryingly off (by up
> to 100%)... It is clear that the code is not trivial and is full of
> branching, which makes clean-room reverse engineering a chore.
> 
> As a final attempt to make a somewhat complete solution, I tried this
> weekend to make a "safe" model that would still make the GPUs quiet. I
> managed to improve the pass rate from 97 to 99.6%, but the remaining
> failures conflict with my previous findings, which are also way more
> prevalent. In the end, the only completely-safe way of driving the fan
> is the current behaviour of nouveau...
> 
> At this point, I am ready to throw in the towel and hardcode parameters
> in nouveau to address the problem of the loudest GPUs, but this is of
> course suboptimal. This is why I am asking for your help. Would you have
> some documentation about this fan calibration table that could help me
> here? Code would be even more appreciated.
> 
> Thanks a lot in advance,
> Martin
> 
> PS: here is most of the code you may want to see:
> http://fs.mupuf.org/nvidia/fan_calib/
> 
> [1] http://fs.mupuf.org/nvidia/fan_calib/pwm_offset.png
> [2] https://github.com/envytools/envytools/blob/master/nvbios/power.c#L333
> [3] https://github.com/envytools/envytools/blob/master/nvbios/power.c#L298
> 
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau