Re: [Nouveau] 5.12.1 0010:nvkm_falcon_v1_wait_for_halt+0x8f/0xb9 [nouveau]

2021-05-24 Thread Ben Skeggs
On Fri, 7 May 2021 at 00:50, Bjorn Helgaas  wrote:
>
> [+cc Ben]
>
> Hi Marc,
>
> Thanks for paying attention to these things.  I added Ben (who
> probably would see this via nouveau@lists.freedesktop.org anyway).
> I don't see a PCI issue here, but the nouveau timeout, which I know
> nothing about, does look like it could be interesting.
This is likely from a bug that snuck into linux-firmware, I've sent a
patch[1] recently that will probably solve this.

Ben.

[1] 
https://lore.kernel.org/linux-firmware/20210518063631.5072-1-bske...@redhat.com/T/#u

>
> On Wed, May 05, 2021 at 02:42:27PM -0700, Marc MERLIN wrote:
> > Howdy,
> > I upgraded my thinkpad P73 from 5.9 to 5.12, and I now get this new
> > ug at boot (although the system does continue booting and display works
> > since I use i915 for display and only use nouveau for PM)
> >
> > Short:
> > [   18.561181] WARNING: CPU: 15 PID: 220 at 
> > drivers/gpu/drm/nouveau/nvkm/falcon/v1.c:247 
> > nvkm_falcon_v1_wait_for_halt+0x8f/0xb9 [nouveau]
> > [   18.561300] Modules linked in: dm_crypt trusted tpm rng_core dm_mod 
> > raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx 
> > multipath sata_sil24 r8169 realtek mdio_devres libphy mii hid_generic 
> > usbhid hid crct10dif_pclmul crc32_pclmul crc32c_intel xhci_pci 
> > rtsx_pci_sdmmc nouveau ghash_clmulni_intel xhci_hcd mmc_core e1000e 
> > i2c_designware_platform mxm_wmi i2c_designware_core hwmon ptp aesni_intel 
> > intel_lpss_pci drm_ttm_helper i2c_i801 crypto_simd intel_lpss i2c_smbus 
> > psmouse i915 cryptd pps_core thunderbolt rtsx_pci idma64 usbcore ttm 
> > i2c_nvidia_gpu thermal wmi battery
> > [   18.561636] CPU: 15 PID: 220 Comm: kworker/15:2 Tainted: G U 
> >5.12.1-amd64-preempt-sysrq-20190817 #1
> > [   18.561707] Hardware name: LENOVO 20QRS00200/20QRS00200, BIOS N2NET40W 
> > (1.25 ) 08/26/2020
> > [   18.561765] Workqueue: pm pm_runtime_work
> > [   18.561799] RIP: 0010:nvkm_falcon_v1_wait_for_halt+0x8f/0xb9 [nouveau]
> >
> > Despite the warning, chip seems to go to sleep on batteries, poewertop
> > shows an encouraging low battery use (my lowest one yet of any kernel):
> > The battery reports a discharge rate of 10.7 W
> > The power consumed was 230 J
> >
> > So it seems that what I need from nouveau is working (power management)
> >
> > Full warning below with logs
> >
> >
> > Long:
> > [0.00] Linux version 5.12.1-amd64-preempt-sysrq-20190817 
> > (r...@sauron.svh.merlins.org) (gcc (Debian 10.2.1-3) 10.2.1 20201224, GNU 
> > ld (GNU Binutils for Debian) 2.35.1) #1 SMP PREEMPT Wed May 5 13:05:02 PDT 
> > 2021
> > [0.00] Command line: 
> > BOOT_IMAGE=/vmlinuz-5.12.1-amd64-preempt-sysrq-20190817 
> > root=/dev/mapper/cryptroot ro rootflags=subvol=root 
> > cryptopts=source=/dev/nvme0n1p7,keyscript=/sbin/cryptgetpw 
> > usbcore.autosuspend=1 pcie_aspm=force resume=/dev/dm-1 
> > acpi_backlight=vendor nouveau.debug=disp=trace
> > [8.672663] nouveau :01:00.0: runtime IRQ mapping not provided by 
> > arch
> > [8.677434] nouveau :01:00.0: enabling device ( -> 0003)
> > [8.691872] nouveau :01:00.0: NVIDIA TU104 (164000a1)
> > [8.789240] nouveau :01:00.0: bios: version 90.04.4d.00.2c
> > [8.789605] nouveau :01:00.0: pmu: firmware unavailable
> > [8.789897] nouveau :01:00.0: enabling bus mastering
> > [8.789978] nouveau :01:00.0: disp: preinit running...
> > [8.789981] nouveau :01:00.0: disp: preinit completed in 0us
> > [8.789997] nouveau :01:00.0: disp: fini running...
> > [8.78] nouveau :01:00.0: disp: fini completed in 0us
> > [8.790189] nouveau :01:00.0: fb: 8192 MiB GDDR6
> > [8.800113] nouveau :01:00.0: disp: init running...
> > [8.800116] nouveau :01:00.0: disp: init skipped, engine has no users
> > [8.800118] nouveau :01:00.0: disp: init completed in 2us
> > [8.801512] nouveau :01:00.0: DRM: VRAM: 8192 MiB
> > [8.801515] nouveau :01:00.0: DRM: GART: 536870912 MiB
> > [8.801517] nouveau :01:00.0: DRM: BIT table 'A' not found
> > [8.801520] nouveau :01:00.0: DRM: BIT table 'L' not found
> > [8.801521] nouveau :01:00.0: DRM: TMDS table version 2.0
> > [8.801525] nouveau :01:00.0: DRM: DCB version 4.1
> > [8.801527] nouveau :01:00.0: DRM: DCB outp 00: 02800f66 04600020
> > [8.801529] nouveau :01:00.0: DRM: DCB outp 01: 02011f52 00020010
> > [8.801531] nouveau :01:00.0: DRM: DCB outp 02: 01022f36 04600010
> > [8.801533] nouveau :01:00.0: DRM: DCB outp 03: 04033f76 04600010
> > [8.801535] nouveau :01:00.0: DRM: DCB outp 04: 04044f86 04600020
> > [8.801537] nouveau :01:00.0: DRM: DCB conn 00: 00020047
> > [8.801539] nouveau :01:00.0: DRM: DCB conn 01: 00010161
> > [8.801541] nouveau :01:00.0: DRM: DCB conn 02: 1248
> > [8.801543] nouveau :01:00.0: DRM: DCB conn 03: 01000348
> > [8.801543] nouveau 

Re: [Nouveau] [PATCH v9 07/10] mm: Device exclusive memory access

2021-05-24 Thread John Hubbard

On 5/24/21 3:11 PM, Andrew Morton wrote:

...

  Documentation/vm/hmm.rst |  17 
  include/linux/mmu_notifier.h |   6 ++
  include/linux/rmap.h |   4 +
  include/linux/swap.h |   7 +-
  include/linux/swapops.h  |  44 -
  mm/hmm.c |   5 +
  mm/memory.c  | 128 +++-
  mm/mprotect.c|   8 ++
  mm/page_vma_mapped.c |   9 +-
  mm/rmap.c| 186 +++
  10 files changed, 405 insertions(+), 9 deletions(-)



This is quite a lot of code added to core MM for a single driver.

Is there any expectation that other drivers will use this code?


Yes! This should work for GPUs (and potentially, other devices) that support
OpenCL SVM atomic accesses on the device. I haven't looked into how amdgpu
works in any detail, but that's certainly at the top of the list of likely
additional callers.



Is there a way of reducing the impact (code size, at least) for systems
which don't need this code?


I'll leave this question to others for the moment, in order to answer
the "do we need it at all" points.



How beneficial is this code to nouveau users?  I see that it permits a
part of OpenCL to be implemented, but how useful/important is this in
the real world?



So this is interesting. Right now, OpenCL support in Nouveau is rather new
and so probably not a huge impact yet. However, we've built up enough experience
with CUDA and OpenCL to learn that atomic operations, as part of the user
space programming model, are a super big deal. Atomic operations are so
useful and important that I'd expect many OpenCL SVM users to be uninterested in
programming models that lack atomic operations for GPU compute programs.

Again, this doesn't rule out future, non-GPU accelerator devices that may
come along.

Atomic ops are just a really important piece of high-end multi-threaded
programming, it turns out. So this is the beginning of support for an
important building block for general purpose programming on devices that
have GPU-like memory models.


thanks,
--
John Hubbard
NVIDIA
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v9 07/10] mm: Device exclusive memory access

2021-05-24 Thread Andrew Morton
On Mon, 24 May 2021 23:27:22 +1000 Alistair Popple  wrote:

> Some devices require exclusive write access to shared virtual
> memory (SVM) ranges to perform atomic operations on that memory. This
> requires CPU page tables to be updated to deny access whilst atomic
> operations are occurring.
> 
> In order to do this introduce a new swap entry
> type (SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for
> exclusive access by a device all page table mappings for the particular
> range are replaced with device exclusive swap entries. This causes any
> CPU access to the page to result in a fault.
> 
> Faults are resovled by replacing the faulting entry with the original
> mapping. This results in MMU notifiers being called which a driver uses
> to update access permissions such as revoking atomic access. After
> notifiers have been called the device will no longer have exclusive
> access to the region.
> 
> Walking of the page tables to find the target pages is handled by
> get_user_pages() rather than a direct page table walk. A direct page
> table walk similar to what migrate_vma_collect()/unmap() does could also
> have been utilised. However this resulted in more code similar in
> functionality to what get_user_pages() provides as page faulting is
> required to make the PTEs present and to break COW.
> 
> ...
>
>  Documentation/vm/hmm.rst |  17 
>  include/linux/mmu_notifier.h |   6 ++
>  include/linux/rmap.h |   4 +
>  include/linux/swap.h |   7 +-
>  include/linux/swapops.h  |  44 -
>  mm/hmm.c |   5 +
>  mm/memory.c  | 128 +++-
>  mm/mprotect.c|   8 ++
>  mm/page_vma_mapped.c |   9 +-
>  mm/rmap.c| 186 +++
>  10 files changed, 405 insertions(+), 9 deletions(-)
> 

This is quite a lot of code added to core MM for a single driver.

Is there any expectation that other drivers will use this code?

Is there a way of reducing the impact (code size, at least) for systems
which don't need this code?

How beneficial is this code to nouveau users?  I see that it permits a
part of OpenCL to be implemented, but how useful/important is this in
the real world?

Thanks.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


[Nouveau] [PATCH v9 10/10] nouveau/svm: Implement atomic SVM access

2021-05-24 Thread Alistair Popple
Some NVIDIA GPUs do not support direct atomic access to system memory
via PCIe. Instead this must be emulated by granting the GPU exclusive
access to the memory. This is achieved by replacing CPU page table
entries with special swap entries that fault on userspace access.

The driver then grants the GPU permission to update the page undergoing
atomic access via the GPU page tables. When CPU access to the page is
required a CPU fault is raised which calls into the device driver via
MMU notifiers to revoke the atomic access. The original page table
entries are then restored allowing CPU access to proceed.

Signed-off-by: Alistair Popple 
Reviewed-by: Ben Skeggs 

---

v9:
* Added Ben's Reviewed-By

v7:
* Removed magic values for fault access levels
* Improved readability of fault comparison code

v4:
* Check that page table entries haven't changed before mapping on the
  device
---
 drivers/gpu/drm/nouveau/include/nvif/if000c.h |   1 +
 drivers/gpu/drm/nouveau/nouveau_svm.c | 126 --
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h |   1 +
 .../drm/nouveau/nvkm/subdev/mmu/vmmgp100.c|   6 +
 4 files changed, 123 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/include/nvif/if000c.h 
b/drivers/gpu/drm/nouveau/include/nvif/if000c.h
index d6dd40f21eed..9c7ff56831c5 100644
--- a/drivers/gpu/drm/nouveau/include/nvif/if000c.h
+++ b/drivers/gpu/drm/nouveau/include/nvif/if000c.h
@@ -77,6 +77,7 @@ struct nvif_vmm_pfnmap_v0 {
 #define NVIF_VMM_PFNMAP_V0_APER   0x00f0ULL
 #define NVIF_VMM_PFNMAP_V0_HOST   0xULL
 #define NVIF_VMM_PFNMAP_V0_VRAM   0x0010ULL
+#define NVIF_VMM_PFNMAP_V0_A 0x0004ULL
 #define NVIF_VMM_PFNMAP_V0_W  0x0002ULL
 #define NVIF_VMM_PFNMAP_V0_V  0x0001ULL
 #define NVIF_VMM_PFNMAP_V0_NONE   0xULL
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index a195e48c9aee..81526d65b4e2 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct nouveau_svm {
struct nouveau_drm *drm;
@@ -67,6 +68,11 @@ struct nouveau_svm {
} buffer[1];
 };
 
+#define FAULT_ACCESS_READ 0
+#define FAULT_ACCESS_WRITE 1
+#define FAULT_ACCESS_ATOMIC 2
+#define FAULT_ACCESS_PREFETCH 3
+
 #define SVM_DBG(s,f,a...) NV_DEBUG((s)->drm, "svm: "f"\n", ##a)
 #define SVM_ERR(s,f,a...) NV_WARN((s)->drm, "svm: "f"\n", ##a)
 
@@ -411,6 +417,24 @@ nouveau_svm_fault_cancel_fault(struct nouveau_svm *svm,
  fault->client);
 }
 
+static int
+nouveau_svm_fault_priority(u8 fault)
+{
+   switch (fault) {
+   case FAULT_ACCESS_PREFETCH:
+   return 0;
+   case FAULT_ACCESS_READ:
+   return 1;
+   case FAULT_ACCESS_WRITE:
+   return 2;
+   case FAULT_ACCESS_ATOMIC:
+   return 3;
+   default:
+   WARN_ON_ONCE(1);
+   return -1;
+   }
+}
+
 static int
 nouveau_svm_fault_cmp(const void *a, const void *b)
 {
@@ -421,9 +445,8 @@ nouveau_svm_fault_cmp(const void *a, const void *b)
return ret;
if ((ret = (s64)fa->addr - fb->addr))
return ret;
-   /*XXX: atomic? */
-   return (fa->access == 0 || fa->access == 3) -
-  (fb->access == 0 || fb->access == 3);
+   return nouveau_svm_fault_priority(fa->access) -
+   nouveau_svm_fault_priority(fb->access);
 }
 
 static void
@@ -487,6 +510,10 @@ static bool nouveau_svm_range_invalidate(struct 
mmu_interval_notifier *mni,
struct svm_notifier *sn =
container_of(mni, struct svm_notifier, notifier);
 
+   if (range->event == MMU_NOTIFY_EXCLUSIVE &&
+   range->owner == sn->svmm->vmm->cli->drm->dev)
+   return true;
+
/*
 * serializes the update to mni->invalidate_seq done by caller and
 * prevents invalidation of the PTE from progressing while HW is being
@@ -555,6 +582,71 @@ static void nouveau_hmm_convert_pfn(struct nouveau_drm 
*drm,
args->p.phys[0] |= NVIF_VMM_PFNMAP_V0_W;
 }
 
+static int nouveau_atomic_range_fault(struct nouveau_svmm *svmm,
+  struct nouveau_drm *drm,
+  struct nouveau_pfnmap_args *args, u32 size,
+  struct svm_notifier *notifier)
+{
+   unsigned long timeout =
+   jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
+   struct mm_struct *mm = svmm->notifier.mm;
+   struct page *page;
+   unsigned long start = args->p.addr;
+   unsigned long notifier_seq;
+   int ret = 0;
+
+   ret = 

[Nouveau] [PATCH v9 09/10] nouveau/svm: Refactor nouveau_range_fault

2021-05-24 Thread Alistair Popple
Call mmu_interval_notifier_insert() as part of nouveau_range_fault().
This doesn't introduce any functional change but makes it easier for a
subsequent patch to alter the behaviour of nouveau_range_fault() to
support GPU atomic operations.

Signed-off-by: Alistair Popple 
Reviewed-by: Ben Skeggs 

---

v9:

* Added Ben's Reviewed-By (thanks!)
---
 drivers/gpu/drm/nouveau/nouveau_svm.c | 34 ---
 1 file changed, 20 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 94f841026c3b..a195e48c9aee 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -567,18 +567,27 @@ static int nouveau_range_fault(struct nouveau_svmm *svmm,
unsigned long hmm_pfns[1];
struct hmm_range range = {
.notifier = >notifier,
-   .start = notifier->notifier.interval_tree.start,
-   .end = notifier->notifier.interval_tree.last + 1,
.default_flags = hmm_flags,
.hmm_pfns = hmm_pfns,
.dev_private_owner = drm->dev,
};
-   struct mm_struct *mm = notifier->notifier.mm;
+   struct mm_struct *mm = svmm->notifier.mm;
int ret;
 
+   ret = mmu_interval_notifier_insert(>notifier, mm,
+   args->p.addr, args->p.size,
+   _svm_mni_ops);
+   if (ret)
+   return ret;
+
+   range.start = notifier->notifier.interval_tree.start;
+   range.end = notifier->notifier.interval_tree.last + 1;
+
while (true) {
-   if (time_after(jiffies, timeout))
-   return -EBUSY;
+   if (time_after(jiffies, timeout)) {
+   ret = -EBUSY;
+   goto out;
+   }
 
range.notifier_seq = mmu_interval_read_begin(range.notifier);
mmap_read_lock(mm);
@@ -587,7 +596,7 @@ static int nouveau_range_fault(struct nouveau_svmm *svmm,
if (ret) {
if (ret == -EBUSY)
continue;
-   return ret;
+   goto out;
}
 
mutex_lock(>mutex);
@@ -606,6 +615,9 @@ static int nouveau_range_fault(struct nouveau_svmm *svmm,
svmm->vmm->vmm.object.client->super = false;
mutex_unlock(>mutex);
 
+out:
+   mmu_interval_notifier_remove(>notifier);
+
return ret;
 }
 
@@ -727,14 +739,8 @@ nouveau_svm_fault(struct nvif_notify *notify)
}
 
notifier.svmm = svmm;
-   ret = mmu_interval_notifier_insert(, mm,
-  args.i.p.addr, args.i.p.size,
-  _svm_mni_ops);
-   if (!ret) {
-   ret = nouveau_range_fault(svmm, svm->drm, ,
-   sizeof(args), hmm_flags, );
-   mmu_interval_notifier_remove();
-   }
+   ret = nouveau_range_fault(svmm, svm->drm, ,
+   sizeof(args), hmm_flags, );
mmput(mm);
 
limit = args.i.p.addr + args.i.p.size;
-- 
2.20.1

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


[Nouveau] [PATCH v9 07/10] mm: Device exclusive memory access

2021-05-24 Thread Alistair Popple
Some devices require exclusive write access to shared virtual
memory (SVM) ranges to perform atomic operations on that memory. This
requires CPU page tables to be updated to deny access whilst atomic
operations are occurring.

In order to do this introduce a new swap entry
type (SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for
exclusive access by a device all page table mappings for the particular
range are replaced with device exclusive swap entries. This causes any
CPU access to the page to result in a fault.

Faults are resovled by replacing the faulting entry with the original
mapping. This results in MMU notifiers being called which a driver uses
to update access permissions such as revoking atomic access. After
notifiers have been called the device will no longer have exclusive
access to the region.

Walking of the page tables to find the target pages is handled by
get_user_pages() rather than a direct page table walk. A direct page
table walk similar to what migrate_vma_collect()/unmap() does could also
have been utilised. However this resulted in more code similar in
functionality to what get_user_pages() provides as page faulting is
required to make the PTEs present and to break COW.

Signed-off-by: Alistair Popple 
Reviewed-by: Christoph Hellwig 

---

v9:
* Split rename of migrate_pgmap_owner into a separate patch.
* Added comments explaining SWP_DEVICE_EXCLUSIVE_* entries.
* Renamed try_to_protect{_one} to page_make_device_exclusive{_one} based
  somewhat on a suggestion from Peter Xu. I was never particularly happy
  with try_to_protect() as a name so think this is better.
* Removed unneccesary code and reworded some comments based on feedback
  from Peter Xu.
* Removed the VMA walk when restoring PTEs for device-exclusive entries.
* Simplified implementation of copy_pte_range() to fail if the page
  cannot be locked. This might lead to occasional fork() failures but at
  this stage we don't think that will be an issue.

v8:
* Remove device exclusive entries on fork rather than copy them.

v7:
* Added Christoph's Reviewed-by.
* Minor cosmetic cleanups suggested by Christoph.
* Replace mmu_notifier_range_init_migrate/exclusive with
  mmu_notifier_range_init_owner as suggested by Christoph.
* Replaced lock_page() with lock_page_retry() when handling faults.
* Restrict to anonymous pages for now.

v6:
* Fixed a bisectablity issue due to incorrectly applying the rename of
  migrate_pgmap_owner to the wrong patches for Nouveau and hmm_test.

v5:
* Renamed range->migrate_pgmap_owner to range->owner.
* Added MMU_NOTIFY_EXCLUSIVE to allow passing of a driver cookie which
  allows notifiers called as a result of make_device_exclusive_range() to
  be ignored.
* Added a check to try_to_protect_one() to detect if the pages originally
  returned from get_user_pages() have been unmapped or not.
* Removed check_device_exclusive_range() as it is no longer required with
  the other changes.
* Documentation update.

v4:
* Add function to check that mappings are still valid and exclusive.
* s/long/unsigned long/ in make_device_exclusive_entry().
---
 Documentation/vm/hmm.rst |  17 
 include/linux/mmu_notifier.h |   6 ++
 include/linux/rmap.h |   4 +
 include/linux/swap.h |   7 +-
 include/linux/swapops.h  |  44 -
 mm/hmm.c |   5 +
 mm/memory.c  | 128 +++-
 mm/mprotect.c|   8 ++
 mm/page_vma_mapped.c |   9 +-
 mm/rmap.c| 186 +++
 10 files changed, 405 insertions(+), 9 deletions(-)

diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index 3df79307a797..a14c2938e7af 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -405,6 +405,23 @@ between device driver specific code and shared common code:
 
The lock can now be released.
 
+Exclusive access memory
+===
+
+Some devices have features such as atomic PTE bits that can be used to 
implement
+atomic access to system memory. To support atomic operations to a shared 
virtual
+memory page such a device needs access to that page which is exclusive of any
+userspace access from the CPU. The ``make_device_exclusive_range()`` function
+can be used to make a memory range inaccessible from userspace.
+
+This replaces all mappings for pages in the given range with special swap
+entries. Any attempt to access the swap entry results in a fault which is
+resovled by replacing the entry with the original mapping. A driver gets
+notified that the mapping has been changed by MMU notifiers, after which point
+it will no longer have exclusive access to the page. Exclusive access is
+guranteed to last until the driver drops the page lock and page reference, at
+which point any CPU faults on the page may proceed as described.
+
 Memory cgroup (memcg) and rss accounting
 
 
diff --git 

[Nouveau] [PATCH v9 08/10] mm: Selftests for exclusive device memory

2021-05-24 Thread Alistair Popple
Adds some selftests for exclusive device memory.

Signed-off-by: Alistair Popple 
Acked-by: Jason Gunthorpe 
Tested-by: Ralph Campbell 
Reviewed-by: Ralph Campbell 
---
 lib/test_hmm.c | 124 +++
 lib/test_hmm_uapi.h|   2 +
 tools/testing/selftests/vm/hmm-tests.c | 158 +
 3 files changed, 284 insertions(+)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 5c9f5a020c1d..305a9d9e2b4c 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "test_hmm_uapi.h"
 
@@ -46,6 +47,7 @@ struct dmirror_bounce {
unsigned long   cpages;
 };
 
+#define DPT_XA_TAG_ATOMIC 1UL
 #define DPT_XA_TAG_WRITE 3UL
 
 /*
@@ -619,6 +621,54 @@ static void dmirror_migrate_alloc_and_copy(struct 
migrate_vma *args,
}
 }
 
+static int dmirror_check_atomic(struct dmirror *dmirror, unsigned long start,
+unsigned long end)
+{
+   unsigned long pfn;
+
+   for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++) {
+   void *entry;
+   struct page *page;
+
+   entry = xa_load(>pt, pfn);
+   page = xa_untag_pointer(entry);
+   if (xa_pointer_tag(entry) == DPT_XA_TAG_ATOMIC)
+   return -EPERM;
+   }
+
+   return 0;
+}
+
+static int dmirror_atomic_map(unsigned long start, unsigned long end,
+ struct page **pages, struct dmirror *dmirror)
+{
+   unsigned long pfn, mapped = 0;
+   int i;
+
+   /* Map the migrated pages into the device's page tables. */
+   mutex_lock(>mutex);
+
+   for (i = 0, pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); 
pfn++, i++) {
+   void *entry;
+
+   if (!pages[i])
+   continue;
+
+   entry = pages[i];
+   entry = xa_tag_pointer(entry, DPT_XA_TAG_ATOMIC);
+   entry = xa_store(>pt, pfn, entry, GFP_ATOMIC);
+   if (xa_is_err(entry)) {
+   mutex_unlock(>mutex);
+   return xa_err(entry);
+   }
+
+   mapped++;
+   }
+
+   mutex_unlock(>mutex);
+   return mapped;
+}
+
 static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
struct dmirror *dmirror)
 {
@@ -661,6 +711,71 @@ static int dmirror_migrate_finalize_and_map(struct 
migrate_vma *args,
return 0;
 }
 
+static int dmirror_exclusive(struct dmirror *dmirror,
+struct hmm_dmirror_cmd *cmd)
+{
+   unsigned long start, end, addr;
+   unsigned long size = cmd->npages << PAGE_SHIFT;
+   struct mm_struct *mm = dmirror->notifier.mm;
+   struct page *pages[64];
+   struct dmirror_bounce bounce;
+   unsigned long next;
+   int ret;
+
+   start = cmd->addr;
+   end = start + size;
+   if (end < start)
+   return -EINVAL;
+
+   /* Since the mm is for the mirrored process, get a reference first. */
+   if (!mmget_not_zero(mm))
+   return -EINVAL;
+
+   mmap_read_lock(mm);
+   for (addr = start; addr < end; addr = next) {
+   int i, mapped;
+
+   if (end < addr + (ARRAY_SIZE(pages) << PAGE_SHIFT))
+   next = end;
+   else
+   next = addr + (ARRAY_SIZE(pages) << PAGE_SHIFT);
+
+   ret = make_device_exclusive_range(mm, addr, next, pages, NULL);
+   mapped = dmirror_atomic_map(addr, next, pages, dmirror);
+   for (i = 0; i < ret; i++) {
+   if (pages[i]) {
+   unlock_page(pages[i]);
+   put_page(pages[i]);
+   }
+   }
+
+   if (addr + (mapped << PAGE_SHIFT) < next) {
+   mmap_read_unlock(mm);
+   mmput(mm);
+   return -EBUSY;
+   }
+   }
+   mmap_read_unlock(mm);
+   mmput(mm);
+
+   /* Return the migrated data for verification. */
+   ret = dmirror_bounce_init(, start, size);
+   if (ret)
+   return ret;
+   mutex_lock(>mutex);
+   ret = dmirror_do_read(dmirror, start, end, );
+   mutex_unlock(>mutex);
+   if (ret == 0) {
+   if (copy_to_user(u64_to_user_ptr(cmd->ptr), bounce.ptr,
+bounce.size))
+   ret = -EFAULT;
+   }
+
+   cmd->cpages = bounce.cpages;
+   dmirror_bounce_fini();
+   return ret;
+}
+
 static int dmirror_migrate(struct dmirror *dmirror,
   struct hmm_dmirror_cmd *cmd)
 {
@@ -949,6 +1064,15 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
ret = dmirror_migrate(dmirror, );
break;
 
+   

[Nouveau] [PATCH v9 04/10] mm/rmap: Split migration into its own function

2021-05-24 Thread Alistair Popple
Migration is currently implemented as a mode of operation for
try_to_unmap_one() generally specified by passing the TTU_MIGRATION flag
or in the case of splitting a huge anonymous page TTU_SPLIT_FREEZE.

However it does not have much in common with the rest of the unmap
functionality of try_to_unmap_one() and thus splitting it into a
separate function reduces the complexity of try_to_unmap_one() making it
more readable.

Several simplifications can also be made in try_to_migrate_one() based
on the following observations:

 - All users of TTU_MIGRATION also set TTU_IGNORE_MLOCK.
 - No users of TTU_MIGRATION ever set TTU_IGNORE_HWPOISON.
 - No users of TTU_MIGRATION ever set TTU_BATCH_FLUSH.

TTU_SPLIT_FREEZE is a special case of migration used when splitting an
anonymous page. This is most easily dealt with by calling the correct
function from unmap_page() in mm/huge_memory.c  - either
try_to_migrate() for PageAnon or try_to_unmap().

Signed-off-by: Alistair Popple 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Ralph Campbell 

---

v5:
* Added comments about how PMD splitting works for migration vs.
  unmapping
* Tightened up the flag check in try_to_migrate() to be explicit about
  which TTU_XXX flags are supported.
---
 include/linux/rmap.h |   4 +-
 mm/huge_memory.c |  15 +-
 mm/migrate.c |   9 +-
 mm/rmap.c| 358 ---
 4 files changed, 280 insertions(+), 106 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 38a746787c2f..0e25d829f742 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -86,8 +86,6 @@ struct anon_vma_chain {
 };
 
 enum ttu_flags {
-   TTU_MIGRATION   = 0x1,  /* migration mode */
-
TTU_SPLIT_HUGE_PMD  = 0x4,  /* split huge PMD if any */
TTU_IGNORE_MLOCK= 0x8,  /* ignore mlock */
TTU_IGNORE_HWPOISON = 0x20, /* corrupted page is recoverable */
@@ -96,7 +94,6 @@ enum ttu_flags {
 * do a final flush if necessary */
TTU_RMAP_LOCKED = 0x80, /* do not grab rmap lock:
 * caller holds it */
-   TTU_SPLIT_FREEZE= 0x100,/* freeze pte under 
splitting thp */
 };
 
 #ifdef CONFIG_MMU
@@ -193,6 +190,7 @@ static inline void page_dup_rmap(struct page *page, bool 
compound)
 int page_referenced(struct page *, int is_locked,
struct mem_cgroup *memcg, unsigned long *vm_flags);
 
+bool try_to_migrate(struct page *page, enum ttu_flags flags);
 bool try_to_unmap(struct page *, enum ttu_flags flags);
 
 /* Avoid racy checks */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2ec6dab72217..6dddc75b89ee 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2345,16 +2345,21 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
 
 static void unmap_page(struct page *page)
 {
-   enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK |
-   TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
+   enum ttu_flags ttu_flags = TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
bool unmap_success;
 
VM_BUG_ON_PAGE(!PageHead(page), page);
 
if (PageAnon(page))
-   ttu_flags |= TTU_SPLIT_FREEZE;
-
-   unmap_success = try_to_unmap(page, ttu_flags);
+   unmap_success = try_to_migrate(page, ttu_flags);
+   else
+   /*
+* Don't install migration entries for file backed pages. This
+* helps handle cases when i_size is in the middle of the page
+* as there is no need to unmap pages beyond i_size manually.
+*/
+   unmap_success = try_to_unmap(page, ttu_flags |
+   TTU_IGNORE_MLOCK);
VM_BUG_ON_PAGE(!unmap_success, page);
 }
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 930de919b1f2..05740f816bc4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1103,7 +1103,7 @@ static int __unmap_and_move(struct page *page, struct 
page *newpage,
/* Establish migration ptes */
VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma,
page);
-   try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK);
+   try_to_migrate(page, 0);
page_was_mapped = 1;
}
 
@@ -1305,7 +1305,7 @@ static int unmap_and_move_huge_page(new_page_t 
get_new_page,
 
if (page_mapped(hpage)) {
bool mapping_locked = false;
-   enum ttu_flags ttu = TTU_MIGRATION|TTU_IGNORE_MLOCK;
+   enum ttu_flags ttu = 0;
 
if (!PageAnon(hpage)) {
/*
@@ -1322,7 +1322,7 @@ static int unmap_and_move_huge_page(new_page_t 
get_new_page,
ttu |= TTU_RMAP_LOCKED;
}
 
-   try_to_unmap(hpage, ttu);
+   try_to_migrate(hpage, ttu);
page_was_mapped 

[Nouveau] [PATCH v9 06/10] mm/memory.c: Allow different return codes for copy_nonpresent_pte()

2021-05-24 Thread Alistair Popple
Currently if copy_nonpresent_pte() returns a non-zero value it is
assumed to be a swap entry which requires further processing outside the
loop in copy_pte_range() after dropping locks. This prevents other
values being returned to signal conditions such as failure which a
subsequent change requires.

Instead make copy_nonpresent_pte() return an error code if further
processing is required and read the value for the swap entry in the main
loop under the ptl.

Signed-off-by: Alistair Popple 

---

v9:

New for v9 to allow device exclusive handling to occur in
copy_nonpresent_pte().
---
 mm/memory.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2fb455c365c2..e061cfa18c11 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -718,7 +718,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct 
mm_struct *src_mm,
 
if (likely(!non_swap_entry(entry))) {
if (swap_duplicate(entry) < 0)
-   return entry.val;
+   return -EAGAIN;
 
/* make sure dst_mm is on swapoff's mmlist. */
if (unlikely(list_empty(_mm->mmlist))) {
@@ -974,11 +974,13 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct 
vm_area_struct *src_vma,
continue;
}
if (unlikely(!pte_present(*src_pte))) {
-   entry.val = copy_nonpresent_pte(dst_mm, src_mm,
-   dst_pte, src_pte,
-   src_vma, addr, rss);
-   if (entry.val)
+   ret = copy_nonpresent_pte(dst_mm, src_mm,
+   dst_pte, src_pte,
+   src_vma, addr, rss);
+   if (ret == -EAGAIN) {
+   entry = pte_to_swp_entry(*src_pte);
break;
+   }
progress += 8;
continue;
}
-- 
2.20.1

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


[Nouveau] [PATCH v9 05/10] mm: Rename migrate_pgmap_owner

2021-05-24 Thread Alistair Popple
MMU notifier ranges have a migrate_pgmap_owner field which is used by
drivers to store a pointer. This is subsequently used by the driver
callback to filter MMU_NOTIFY_MIGRATE events. Other notifier event types
can also benefit from this filtering, so rename the
'migrate_pgmap_owner' field to 'owner' and create a new notifier
initialisation function to initialise this field.

Signed-off-by: Alistair Popple 
Suggested-by: Peter Xu 

---

v9:

Previously part of the next patch in the series ('mm: Device exclusive
memory access') but now split out as a separate change as suggested by
Peter Xu.
---
 Documentation/vm/hmm.rst  |  2 +-
 drivers/gpu/drm/nouveau/nouveau_svm.c |  2 +-
 include/linux/mmu_notifier.h  | 20 ++--
 lib/test_hmm.c|  2 +-
 mm/migrate.c  | 10 +-
 5 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index 09e28507f5b2..3df79307a797 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -332,7 +332,7 @@ between device driver specific code and shared common code:
walks to fill in the ``args->src`` array with PFNs to be migrated.
The ``invalidate_range_start()`` callback is passed a
``struct mmu_notifier_range`` with the ``event`` field set to
-   ``MMU_NOTIFY_MIGRATE`` and the ``migrate_pgmap_owner`` field set to
+   ``MMU_NOTIFY_MIGRATE`` and the ``owner`` field set to
the ``args->pgmap_owner`` field passed to migrate_vma_setup(). This is
allows the device driver to skip the invalidation callback and only
invalidate device private MMU mappings that are actually migrating.
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index f18bd53da052..94f841026c3b 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -265,7 +265,7 @@ nouveau_svmm_invalidate_range_start(struct mmu_notifier *mn,
 * the invalidation is handled as part of the migration process.
 */
if (update->event == MMU_NOTIFY_MIGRATE &&
-   update->migrate_pgmap_owner == svmm->vmm->cli->drm->dev)
+   update->owner == svmm->vmm->cli->drm->dev)
goto out;
 
if (limit > svmm->unmanaged.start && start < svmm->unmanaged.limit) {
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 1a6a9eb6d3fa..8e428eb813b8 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -41,7 +41,7 @@ struct mmu_interval_notifier;
  *
  * @MMU_NOTIFY_MIGRATE: used during migrate_vma_collect() invalidate to signal
  * a device driver to possibly ignore the invalidation if the
- * migrate_pgmap_owner field matches the driver's device private pgmap owner.
+ * owner field matches the driver's device private pgmap owner.
  */
 enum mmu_notifier_event {
MMU_NOTIFY_UNMAP = 0,
@@ -269,7 +269,7 @@ struct mmu_notifier_range {
unsigned long end;
unsigned flags;
enum mmu_notifier_event event;
-   void *migrate_pgmap_owner;
+   void *owner;
 };
 
 static inline int mm_has_notifiers(struct mm_struct *mm)
@@ -521,14 +521,14 @@ static inline void mmu_notifier_range_init(struct 
mmu_notifier_range *range,
range->flags = flags;
 }
 
-static inline void mmu_notifier_range_init_migrate(
-   struct mmu_notifier_range *range, unsigned int flags,
+static inline void mmu_notifier_range_init_owner(
+   struct mmu_notifier_range *range,
+   enum mmu_notifier_event event, unsigned int flags,
struct vm_area_struct *vma, struct mm_struct *mm,
-   unsigned long start, unsigned long end, void *pgmap)
+   unsigned long start, unsigned long end, void *owner)
 {
-   mmu_notifier_range_init(range, MMU_NOTIFY_MIGRATE, flags, vma, mm,
-   start, end);
-   range->migrate_pgmap_owner = pgmap;
+   mmu_notifier_range_init(range, event, flags, vma, mm, start, end);
+   range->owner = owner;
 }
 
 #define ptep_clear_flush_young_notify(__vma, __address, __ptep)
\
@@ -655,8 +655,8 @@ static inline void _mmu_notifier_range_init(struct 
mmu_notifier_range *range,
 
 #define mmu_notifier_range_init(range,event,flags,vma,mm,start,end)  \
_mmu_notifier_range_init(range, start, end)
-#define mmu_notifier_range_init_migrate(range, flags, vma, mm, start, end, \
-   pgmap) \
+#define mmu_notifier_range_init_owner(range, event, flags, vma, mm, start, \
+   end, owner) \
_mmu_notifier_range_init(range, start, end)
 
 static inline bool
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 80a78877bd93..5c9f5a020c1d 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -218,7 +218,7 @@ static bool 

[Nouveau] [PATCH v9 03/10] mm/rmap: Split try_to_munlock from try_to_unmap

2021-05-24 Thread Alistair Popple
The behaviour of try_to_unmap_one() is difficult to follow because it
performs different operations based on a fairly large set of flags used
in different combinations.

TTU_MUNLOCK is one such flag. However it is exclusively used by
try_to_munlock() which specifies no other flags. Therefore rather than
overload try_to_unmap_one() with unrelated behaviour split this out into
it's own function and remove the flag.

Signed-off-by: Alistair Popple 
Reviewed-by: Ralph Campbell 
Reviewed-by: Christoph Hellwig 

---

v9:
* Improved comments

v8:
* Renamed try_to_munlock to page_mlock to better reflect what the
  function actually does.
* Removed the TODO from the documentation that this patch addresses.

v7:
* Added Christoph's Reviewed-by

v4:
* Removed redundant check for VM_LOCKED
---
 Documentation/vm/unevictable-lru.rst | 33 ++-
 include/linux/rmap.h |  3 +-
 mm/mlock.c   | 10 ++---
 mm/rmap.c| 61 
 4 files changed, 63 insertions(+), 44 deletions(-)

diff --git a/Documentation/vm/unevictable-lru.rst 
b/Documentation/vm/unevictable-lru.rst
index 0e1490524f53..eae3af17f2d9 100644
--- a/Documentation/vm/unevictable-lru.rst
+++ b/Documentation/vm/unevictable-lru.rst
@@ -389,14 +389,14 @@ mlocked, munlock_vma_page() updates that zone statistics 
for the number of
 mlocked pages.  Note, however, that at this point we haven't checked whether
 the page is mapped by other VM_LOCKED VMAs.
 
-We can't call try_to_munlock(), the function that walks the reverse map to
+We can't call page_mlock(), the function that walks the reverse map to
 check for other VM_LOCKED VMAs, without first isolating the page from the LRU.
-try_to_munlock() is a variant of try_to_unmap() and thus requires that the page
+page_mlock() is a variant of try_to_unmap() and thus requires that the page
 not be on an LRU list [more on these below].  However, the call to
-isolate_lru_page() could fail, in which case we couldn't try_to_munlock().  So,
+isolate_lru_page() could fail, in which case we can't call page_mlock().  So,
 we go ahead and clear PG_mlocked up front, as this might be the only chance we
-have.  If we can successfully isolate the page, we go ahead and
-try_to_munlock(), which will restore the PG_mlocked flag and update the zone
+have.  If we can successfully isolate the page, we go ahead and call
+page_mlock(), which will restore the PG_mlocked flag and update the zone
 page statistics if it finds another VMA holding the page mlocked.  If we fail
 to isolate the page, we'll have left a potentially mlocked page on the LRU.
 This is fine, because we'll catch it later if and if vmscan tries to reclaim
@@ -545,31 +545,24 @@ munlock or munmap system calls, mm teardown 
(munlock_vma_pages_all), reclaim,
 holepunching, and truncation of file pages and their anonymous COWed pages.
 
 
-try_to_munlock() Reverse Map Scan
+page_mlock() Reverse Map Scan
 -
 
-.. warning::
-   [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the
-   page_referenced() reverse map walker.
-
 When munlock_vma_page() [see section :ref:`munlock()/munlockall() System Call
 Handling ` above] tries to munlock a
 page, it needs to determine whether or not the page is mapped by any
 VM_LOCKED VMA without actually attempting to unmap all PTEs from the
 page.  For this purpose, the unevictable/mlock infrastructure
-introduced a variant of try_to_unmap() called try_to_munlock().
+introduced a variant of try_to_unmap() called page_mlock().
 
-try_to_munlock() calls the same functions as try_to_unmap() for anonymous and
-mapped file and KSM pages with a flag argument specifying unlock versus unmap
-processing.  Again, these functions walk the respective reverse maps looking
-for VM_LOCKED VMAs.  When such a VMA is found, as in the try_to_unmap() case,
-the functions mlock the page via mlock_vma_page() and return SWAP_MLOCK.  This
-undoes the pre-clearing of the page's PG_mlocked done by munlock_vma_page.
+page_mlock() walks the respective reverse maps looking for VM_LOCKED VMAs. When
+such a VMA is found the page is mlocked via mlock_vma_page(). This undoes the
+pre-clearing of the page's PG_mlocked done by munlock_vma_page.
 
-Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's
+Note that page_mlock()'s reverse map walk must visit every VMA in a page's
 reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA.
 However, the scan can terminate when it encounters a VM_LOCKED VMA.
-Although try_to_munlock() might be called a great many times when munlocking a
+Although page_mlock() might be called a great many times when munlocking a
 large region or tearing down a large address space that has been mlocked via
 mlockall(), overall this is a fairly rare event.
 
@@ -602,7 +595,7 @@ inactive lists to the appropriate node's unevictable list.
 shrink_inactive_list() should only see 

[Nouveau] [PATCH v9 02/10] mm/swapops: Rework swap entry manipulation code

2021-05-24 Thread Alistair Popple
Both migration and device private pages use special swap entries that
are manipluated by a range of inline functions. The arguments to these
are somewhat inconsitent so rework them to remove flag type arguments
and to make the arguments similar for both read and write entry
creation.

Signed-off-by: Alistair Popple 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Ralph Campbell 
---
 include/linux/swapops.h | 56 ++---
 mm/debug_vm_pgtable.c   | 12 -
 mm/hmm.c|  2 +-
 mm/huge_memory.c| 26 +--
 mm/hugetlb.c| 10 +---
 mm/memory.c | 10 +---
 mm/migrate.c| 26 ++-
 mm/mprotect.c   | 10 +---
 mm/rmap.c   | 10 +---
 9 files changed, 100 insertions(+), 62 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 139be8235ad2..4dfd807ae52a 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -100,35 +100,35 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
 }
 
 #if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
-static inline swp_entry_t make_device_private_entry(struct page *page, bool 
write)
+static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
 {
-   return swp_entry(write ? SWP_DEVICE_WRITE : SWP_DEVICE_READ,
-page_to_pfn(page));
+   return swp_entry(SWP_DEVICE_READ, offset);
 }
 
-static inline bool is_device_private_entry(swp_entry_t entry)
+static inline swp_entry_t make_writable_device_private_entry(pgoff_t offset)
 {
-   int type = swp_type(entry);
-   return type == SWP_DEVICE_READ || type == SWP_DEVICE_WRITE;
+   return swp_entry(SWP_DEVICE_WRITE, offset);
 }
 
-static inline void make_device_private_entry_read(swp_entry_t *entry)
+static inline bool is_device_private_entry(swp_entry_t entry)
 {
-   *entry = swp_entry(SWP_DEVICE_READ, swp_offset(*entry));
+   int type = swp_type(entry);
+   return type == SWP_DEVICE_READ || type == SWP_DEVICE_WRITE;
 }
 
-static inline bool is_write_device_private_entry(swp_entry_t entry)
+static inline bool is_writable_device_private_entry(swp_entry_t entry)
 {
return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
 }
 #else /* CONFIG_DEVICE_PRIVATE */
-static inline swp_entry_t make_device_private_entry(struct page *page, bool 
write)
+static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
 {
return swp_entry(0, 0);
 }
 
-static inline void make_device_private_entry_read(swp_entry_t *entry)
+static inline swp_entry_t make_writable_device_private_entry(pgoff_t offset)
 {
+   return swp_entry(0, 0);
 }
 
 static inline bool is_device_private_entry(swp_entry_t entry)
@@ -136,35 +136,32 @@ static inline bool is_device_private_entry(swp_entry_t 
entry)
return false;
 }
 
-static inline bool is_write_device_private_entry(swp_entry_t entry)
+static inline bool is_writable_device_private_entry(swp_entry_t entry)
 {
return false;
 }
 #endif /* CONFIG_DEVICE_PRIVATE */
 
 #ifdef CONFIG_MIGRATION
-static inline swp_entry_t make_migration_entry(struct page *page, int write)
-{
-   BUG_ON(!PageLocked(compound_head(page)));
-
-   return swp_entry(write ? SWP_MIGRATION_WRITE : SWP_MIGRATION_READ,
-   page_to_pfn(page));
-}
-
 static inline int is_migration_entry(swp_entry_t entry)
 {
return unlikely(swp_type(entry) == SWP_MIGRATION_READ ||
swp_type(entry) == SWP_MIGRATION_WRITE);
 }
 
-static inline int is_write_migration_entry(swp_entry_t entry)
+static inline int is_writable_migration_entry(swp_entry_t entry)
 {
return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE);
 }
 
-static inline void make_migration_entry_read(swp_entry_t *entry)
+static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
 {
-   *entry = swp_entry(SWP_MIGRATION_READ, swp_offset(*entry));
+   return swp_entry(SWP_MIGRATION_READ, offset);
+}
+
+static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
+{
+   return swp_entry(SWP_MIGRATION_WRITE, offset);
 }
 
 extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
@@ -174,21 +171,28 @@ extern void migration_entry_wait(struct mm_struct *mm, 
pmd_t *pmd,
 extern void migration_entry_wait_huge(struct vm_area_struct *vma,
struct mm_struct *mm, pte_t *pte);
 #else
+static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
+{
+   return swp_entry(0, 0);
+}
+
+static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
+{
+   return swp_entry(0, 0);
+}
 
-#define make_migration_entry(page, write) swp_entry(0, 0)
 static inline int is_migration_entry(swp_entry_t swp)
 {
return 0;
 }
 
-static inline void make_migration_entry_read(swp_entry_t *entryp) { }
 static inline void __migration_entry_wait(struct mm_struct *mm, 

[Nouveau] [PATCH v9 01/10] mm: Remove special swap entry functions

2021-05-24 Thread Alistair Popple
Remove multiple similar inline functions for dealing with different
types of special swap entries.

Both migration and device private swap entries use the swap offset to
store a pfn. Instead of multiple inline functions to obtain a struct
page for each swap entry type use a common function
pfn_swap_entry_to_page(). Also open-code the various entry_to_pfn()
functions as this results is shorter code that is easier to understand.

Signed-off-by: Alistair Popple 
Reviewed-by: Ralph Campbell 
Reviewed-by: Christoph Hellwig 

---

v9:
* Rebased on v5.13-rc2

v8:
* No changes

v7:
* Reworded commit message to include pfn_swap_entry_to_page()
* Added Christoph's Reviewed-by

v6:
* Removed redundant compound_page() call from inside PageLocked()
* Fixed a minor build issue for s390 reported by kernel test bot

v4:
* Added pfn_swap_entry_to_page()
* Reinstated check that migration entries point to locked pages
* Removed #define swapcache_prepare which isn't needed for CONFIG_SWAP=0
  builds
---
 arch/s390/mm/pgtable.c  |  2 +-
 fs/proc/task_mmu.c  | 23 +-
 include/linux/swap.h|  4 +--
 include/linux/swapops.h | 69 ++---
 mm/hmm.c|  5 ++-
 mm/huge_memory.c|  4 +--
 mm/memcontrol.c |  2 +-
 mm/memory.c | 10 +++---
 mm/migrate.c|  6 ++--
 mm/page_vma_mapped.c|  6 ++--
 10 files changed, 50 insertions(+), 81 deletions(-)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 18205f851c24..eec3a9d7176e 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -691,7 +691,7 @@ static void ptep_zap_swap_entry(struct mm_struct *mm, 
swp_entry_t entry)
if (!non_swap_entry(entry))
dec_mm_counter(mm, MM_SWAPENTS);
else if (is_migration_entry(entry)) {
-   struct page *page = migration_entry_to_page(entry);
+   struct page *page = pfn_swap_entry_to_page(entry);
 
dec_mm_counter(mm, mm_counter(page));
}
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index fc9784544b24..0953732c8ce1 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -514,10 +514,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
} else {
mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT;
}
-   } else if (is_migration_entry(swpent))
-   page = migration_entry_to_page(swpent);
-   else if (is_device_private_entry(swpent))
-   page = device_private_entry_to_page(swpent);
+   } else if (is_pfn_swap_entry(swpent))
+   page = pfn_swap_entry_to_page(swpent);
} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
&& pte_none(*pte))) {
page = xa_load(>vm_file->f_mapping->i_pages,
@@ -549,7 +547,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
swp_entry_t entry = pmd_to_swp_entry(*pmd);
 
if (is_migration_entry(entry))
-   page = migration_entry_to_page(entry);
+   page = pfn_swap_entry_to_page(entry);
}
if (IS_ERR_OR_NULL(page))
return;
@@ -694,10 +692,8 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long 
hmask,
} else if (is_swap_pte(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);
 
-   if (is_migration_entry(swpent))
-   page = migration_entry_to_page(swpent);
-   else if (is_device_private_entry(swpent))
-   page = device_private_entry_to_page(swpent);
+   if (is_pfn_swap_entry(swpent))
+   page = pfn_swap_entry_to_page(swpent);
}
if (page) {
int mapcount = page_mapcount(page);
@@ -1384,11 +1380,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct 
pagemapread *pm,
frame = swp_type(entry) |
(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
flags |= PM_SWAP;
-   if (is_migration_entry(entry))
-   page = migration_entry_to_page(entry);
-
-   if (is_device_private_entry(entry))
-   page = device_private_entry_to_page(entry);
+   if (is_pfn_swap_entry(entry))
+   page = pfn_swap_entry_to_page(entry);
}
 
if (page && !PageAnon(page))
@@ -1445,7 +1438,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long 
addr, unsigned long end,
if (pmd_swp_soft_dirty(pmd))
flags |= PM_SOFT_DIRTY;
VM_BUG_ON(!is_pmd_migration_entry(pmd));
-   page = migration_entry_to_page(entry);
+   page = 

[Nouveau] [PATCH v9 00/10] Add support for SVM atomics in Nouveau

2021-05-24 Thread Alistair Popple
This is a repost of the previous series to rebase on v5.13-rc2 and to
address comments.

Outside of some code comment updates the primary change was to split the
renaming of migrate_pgmap_owner into a separate patch and to further
simplify the handling of device exclusive entries in copy_pte_range(). This
may result in temporary fork() failures if the process is using a device
whilst forking, but such usage is unlikely to be practical.

This resulted in a new clean-up patch for the series (patch 6) so that
device exclusive entries can be handled inside copy_nonpresent_pte(),
although more extensive clean-ups of copy_pte_range() are planned as
further development work in future.

Introduction


Some devices have features such as atomic PTE bits that can be used to
implement atomic access to system memory. To support atomic operations to a
shared virtual memory page such a device needs access to that page which is
exclusive of the CPU. This series introduces a mechanism to temporarily
unmap pages granting exclusive access to a device.

These changes are required to support OpenCL atomic operations in Nouveau
to shared virtual memory (SVM) regions allocated with the
CL_MEM_SVM_ATOMICS clSVMAlloc flag. A more complete description of the
OpenCL SVM feature is available at
https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/
OpenCL_API.html#_shared_virtual_memory .

Implementation
==

Exclusive device access is implemented by adding a new swap entry type
(SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry. The main
difference is that on fault the original entry is immediately restored by
the fault handler instead of waiting.

Restoring the entry triggers calls to MMU notifers which allows a device
driver to revoke the atomic access permission from the GPU prior to the CPU
finalising the entry.

Patches
===

Patches 1 & 2 refactor existing migration and device private entry
functions.

Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
functionality into separate functions - try_to_migrate_one() and
try_to_munlock_one().

Patch 5 renames some existing code but does not introduce functionality.

Patch 6 is a small clean-up to swap entry handling in copy_pte_range().

Patch 7 contains the bulk of the implementation for device exclusive
memory.

Patch 8 contains some additions to the HMM selftests to ensure everything
works as expected.

Patch 9 is a cleanup for the Nouveau SVM implementation.

Patch 10 contains the implementation of atomic access for the Nouveau
driver.

Testing
===

This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program
which checks that GPU atomic accesses to system memory are atomic. Without
this series the test fails as there is no way of write-protecting the page
mapping which results in the device clobbering CPU writes. For reference
the test is available at https://ozlabs.org/~apopple/opencl_svm_atomics/

Further testing has been performed by adding support for testing exclusive
access to the hmm-tests kselftests.

Alistair Popple (10):
  mm: Remove special swap entry functions
  mm/swapops: Rework swap entry manipulation code
  mm/rmap: Split try_to_munlock from try_to_unmap
  mm/rmap: Split migration into its own function
  mm: Rename migrate_pgmap_owner
  mm/memory.c: Allow different return codes for copy_nonpresent_pte()
  mm: Device exclusive memory access
  mm: Selftests for exclusive device memory
  nouveau/svm: Refactor nouveau_range_fault
  nouveau/svm: Implement atomic SVM access

 Documentation/vm/hmm.rst  |  19 +-
 Documentation/vm/unevictable-lru.rst  |  33 +-
 arch/s390/mm/pgtable.c|   2 +-
 drivers/gpu/drm/nouveau/include/nvif/if000c.h |   1 +
 drivers/gpu/drm/nouveau/nouveau_svm.c | 156 -
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h |   1 +
 .../drm/nouveau/nvkm/subdev/mmu/vmmgp100.c|   6 +
 fs/proc/task_mmu.c|  23 +-
 include/linux/mmu_notifier.h  |  26 +-
 include/linux/rmap.h  |  11 +-
 include/linux/swap.h  |  11 +-
 include/linux/swapops.h   | 123 ++--
 lib/test_hmm.c| 126 +++-
 lib/test_hmm_uapi.h   |   2 +
 mm/debug_vm_pgtable.c |  12 +-
 mm/hmm.c  |  12 +-
 mm/huge_memory.c  |  45 +-
 mm/hugetlb.c  |  10 +-
 mm/memcontrol.c   |   2 +-
 mm/memory.c   | 160 -
 mm/migrate.c  |  51 +-
 mm/mlock.c|  10 +-
 mm/mprotect.c |  18 +-
 mm/page_vma_mapped.c  |  15 +-
 mm/rmap.c | 601 +++---
 

Re: [Nouveau] [Intel-gfx] [PATCH 0/7] Per client engine busyness

2021-05-24 Thread Tvrtko Ursulin


On 20/05/2021 09:35, Tvrtko Ursulin wrote:

On 19/05/2021 19:23, Daniel Vetter wrote:

On Wed, May 19, 2021 at 6:16 PM Tvrtko Ursulin
 wrote:



On 18/05/2021 10:40, Tvrtko Ursulin wrote:


On 18/05/2021 10:16, Daniel Stone wrote:

Hi,

On Tue, 18 May 2021 at 10:09, Tvrtko Ursulin
 wrote:

I was just wondering if stat(2) and a chrdev major check would be a
solid criteria to more efficiently (compared to parsing the text
content) detect drm files while walking procfs.


Maybe I'm missing something, but is the per-PID walk actually a
measurable performance issue rather than just a bit unpleasant?


Per pid and per each open fd.

As said in the other thread what bothers me a bit in this scheme is 
that

the cost of obtaining GPU usage scales based on non-GPU criteria.

For use case of a top-like tool which shows all processes this is a
smaller additional cost, but then for a gpu-top like tool it is 
somewhat

higher.


To further expand, not only cost would scale per pid multiplies per open
fd, but to detect which of the fds are DRM I see these three options:

1) Open and parse fdinfo.
2) Name based matching ie /dev/dri/.. something.
3) Stat the symlink target and check for DRM major.


stat with symlink following should be plenty fast.


Maybe. I don't think my point about keeping the dentry cache needlessly 
hot is getting through at all. On my lightly loaded desktop:


  $ sudo lsof | wc -l
  599551

  $ sudo lsof | grep "/dev/dri/" | wc -l
  1965

It's going to look up ~600k pointless dentries in every iteration. Just 
to find a handful of DRM ones. Hard to say if that is better or worse 
than just parsing fdinfo text for all files. Will see.


CPU usage looks passable under a production kernel (non-debug). Once a 
second refresh period, on a not really that loaded system (115 running 
processes, 3096 open file descriptors as reported by lsof, none of which 
are DRM), results in a system call heavy load:


real0m55.348s
user0m0.100s
sys 0m0.319s

Once per second loop is essentially along the lines of:

  for each pid in /proc/:
for each fd in /proc//fdinfo:
  if fstatat(fd) is drm major:
read fdinfo text in one sweep and parse it

I'll post the quick intel_gpu_top patch for reference but string parsing 
in C leaves a few things to be desired there.


Regards,

Tvrtko
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau