Re: [RFC PATCH 2/6] mm/gmem: add arch-independent abstraction to track address mapping status

2023-12-04 Thread David Hildenbrand

On 02.12.23 15:50, Pedro Falcato wrote:

On Fri, Dec 1, 2023 at 9:23 AM David Hildenbrand  wrote:


On 28.11.23 13:50, Weixi Zhu wrote:

This patch adds an abstraction layer, struct vm_object, that maintains
per-process virtual-to-physical mapping status stored in struct gm_mapping.
For example, a virtual page may be mapped to a CPU physical page or to a
device physical page. Struct vm_object effectively maintains an
arch-independent page table, which is defined as a "logical page table".
While arch-dependent page table used by a real MMU is named a "physical
page table". The logical page table is useful if Linux core MM is extended
to handle a unified virtual address space with external accelerators using
customized MMUs.


Which raises the question why we are dealing with anonymous memory at
all? Why not go for shmem if you are already only special-casing VMAs
with a MMAP flag right now?

That would maybe avoid having to introduce controversial BSD design
concepts into Linux, that feel like going a step backwards in time to me
and adding *more* MM complexity.



In this patch, struct vm_object utilizes a radix
tree (xarray) to track where a virtual page is mapped to. This adds extra
memory consumption from xarray, but provides a nice abstraction to isolate
mapping status from the machine-dependent layer (PTEs). Besides supporting
accelerators with external MMUs, struct vm_object is planned to further
union with i_pages in struct address_mapping for file-backed memory.


A file already has a tree structure (pagecache) to manage the pages that
are theoretically mapped. It's easy to translate from a VMA to a page
inside that tree structure that is currently not present in page tables.

Why the need for that tree structure if you can just remove anon memory
from the picture?



The idea of struct vm_object is originated from FreeBSD VM design, which
provides a unified abstraction for anonymous memory, file-backed memory,
page cache and etc[1].


:/


Currently, Linux utilizes a set of hierarchical page walk functions to
abstract page table manipulations of different CPU architecture. The
problem happens when a device wants to reuse Linux MM code to manage its
page table -- the device page table may not be accessible to the CPU.
Existing solution like Linux HMM utilizes the MMU notifier mechanisms to
invoke device-specific MMU functions, but relies on encoding the mapping
status on the CPU page table entries. This entangles machine-independent
code with machine-dependent code, and also brings unnecessary restrictions.


Why? we have primitives to walk arch page tables in a non-arch specific
fashion and are using them all over the place.

We even have various mechanisms to map something into the page tables
and get the CPU to fault on it, as if it is inaccessible (PROT_NONE as
used for NUMA balancing, fake swap entries).


The PTE size and format vary arch by arch, which harms the extensibility.


Not really.

We might have some features limited to some architectures because of the
lack of PTE bits. And usually the problem is that people don't care
enough about enabling these features on older architectures.

If we ever *really* need more space for sw-defined data, it would be
possible to allocate auxiliary data for page tables only where required
(where the features apply), instead of crafting a completely new,
auxiliary datastructure with it's own locking.

So far it was not required to enable the feature we need on the
architectures we care about.



[1] https://docs.freebsd.org/en/articles/vm-design/


In the cover letter you have:

"The future plan of logical page table is to provide a generic
abstraction layer that support common anonymous memory (I am looking at
you, transparent huge pages) and file-backed memory."

Which I doubt will happen; there is little interest in making anonymous
memory management slower, more serialized, and wasting more memory on
metadata.


Also worth noting that:

1) Mach VM (which FreeBSD inherited, from the old BSD) vm_objects
aren't quite what's being stated here, rather they are somewhat
replacements for both anon_vma and address_space[1]. Very similarly to
Linux, they take pages from vm_objects and map them in page tables
using pmap (the big difference is anon memory, which has its
bookkeeping in page tables, on Linux)

2) These vm_objects were a horrendous mistake (see CoW chaining) and
FreeBSD has to go to horrendous lengths to make them tolerable. The
UVM paper/dissertation (by Charles Cranor) talks about these issues at
length, and 20 years later it's still true.

3) Despite Linux MM having its warts, it's probably correct to
consider it a solid improvement over FreeBSD MM or NetBSD UVM

And, finally, randomly tacking on core MM concepts from other systems
is at best a *really weird* idea. Particularly when they aren't even
what was stated!


Can you read my mind? :) thanks for noting all that, with which I 100% 
agree.


--
Cheers,

David / dhildenb



Re: [RFC PATCH 2/6] mm/gmem: add arch-independent abstraction to track address mapping status

2023-12-01 Thread David Hildenbrand

On 28.11.23 13:50, Weixi Zhu wrote:

This patch adds an abstraction layer, struct vm_object, that maintains
per-process virtual-to-physical mapping status stored in struct gm_mapping.
For example, a virtual page may be mapped to a CPU physical page or to a
device physical page. Struct vm_object effectively maintains an
arch-independent page table, which is defined as a "logical page table".
While arch-dependent page table used by a real MMU is named a "physical
page table". The logical page table is useful if Linux core MM is extended
to handle a unified virtual address space with external accelerators using
customized MMUs.


Which raises the question why we are dealing with anonymous memory at 
all? Why not go for shmem if you are already only special-casing VMAs 
with a MMAP flag right now?


That would maybe avoid having to introduce controversial BSD design 
concepts into Linux, that feel like going a step backwards in time to me 
and adding *more* MM complexity.




In this patch, struct vm_object utilizes a radix
tree (xarray) to track where a virtual page is mapped to. This adds extra
memory consumption from xarray, but provides a nice abstraction to isolate
mapping status from the machine-dependent layer (PTEs). Besides supporting
accelerators with external MMUs, struct vm_object is planned to further
union with i_pages in struct address_mapping for file-backed memory.


A file already has a tree structure (pagecache) to manage the pages that 
are theoretically mapped. It's easy to translate from a VMA to a page 
inside that tree structure that is currently not present in page tables.


Why the need for that tree structure if you can just remove anon memory 
from the picture?




The idea of struct vm_object is originated from FreeBSD VM design, which
provides a unified abstraction for anonymous memory, file-backed memory,
page cache and etc[1].


:/


Currently, Linux utilizes a set of hierarchical page walk functions to
abstract page table manipulations of different CPU architecture. The
problem happens when a device wants to reuse Linux MM code to manage its
page table -- the device page table may not be accessible to the CPU.
Existing solution like Linux HMM utilizes the MMU notifier mechanisms to
invoke device-specific MMU functions, but relies on encoding the mapping
status on the CPU page table entries. This entangles machine-independent
code with machine-dependent code, and also brings unnecessary restrictions.


Why? we have primitives to walk arch page tables in a non-arch specific 
fashion and are using them all over the place.


We even have various mechanisms to map something into the page tables 
and get the CPU to fault on it, as if it is inaccessible (PROT_NONE as 
used for NUMA balancing, fake swap entries).



The PTE size and format vary arch by arch, which harms the extensibility.


Not really.

We might have some features limited to some architectures because of the 
lack of PTE bits. And usually the problem is that people don't care 
enough about enabling these features on older architectures.


If we ever *really* need more space for sw-defined data, it would be 
possible to allocate auxiliary data for page tables only where required 
(where the features apply), instead of crafting a completely new, 
auxiliary datastructure with it's own locking.


So far it was not required to enable the feature we need on the 
architectures we care about.




[1] https://docs.freebsd.org/en/articles/vm-design/


In the cover letter you have:

"The future plan of logical page table is to provide a generic 
abstraction layer that support common anonymous memory (I am looking at 
you, transparent huge pages) and file-backed memory."


Which I doubt will happen; there is little interest in making anonymous 
memory management slower, more serialized, and wasting more memory on 
metadata.


Note that you won't make many friends around here with statements like 
"To be honest, not using a logical page table for anonymous memory is 
why Linux THP fails compared with FreeBSD's superpage".


I read one paper that makes such claims (I'm curious how you define 
"winning"), and am aware of some shortcomings. But I am not convinced 
that a second datastructure "is why Linux THP fails". It just requires 
some more work to get it sorted under Linux (e.g., allocate THP, PTE-map 
it and map inaccessible parts PROT_NONE, later collapse it in-place into 
a PMD), and so far, there was not a lot of interest that I am ware of to 
even start working on that.


So if there is not enough pain for companies to even work on that in 
Linux, maybe FreeBSD superpages are "winning" "on paper" only? Remember 
that the target audience here are Linux developers.



But yeah, this here is all designed around the idea "core MM is extended 
to handle a unified virtual address space with external accelerators 
using customized MMUs." and then trying to find other arguments why it's 
a good idea, without going too much into det

Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices

2023-12-01 Thread David Hildenbrand

On 01.12.23 03:44, zhuweixi wrote:

Thanks!


I hope you understood that that was a joke :)


I am planning to present GMEM in Linux MM Alignment Sessions so I can collect 
more input from the mm developers.


Sounds good. But please try inviting key HMM/driver developer as well.

Most of the core-mm folks attending that meeting are not that familiar 
with these concepts and they are usually not happy about:


(1) More core-MM complexity for things that can be neatly handled in
separate subsystems with the existing infrastructure already.

(2) One new way of doing things why the other things remain in place.

(3) New MMAP flags. Usually you have a hard time getting this in.
Sometimes, there are other ways (e.g., special-purpose file-
systems).

(4) Changing controversial core-mm design decisions to handle corner
cases.

--
Cheers,

David / dhildenb



Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices

2023-11-30 Thread David Hildenbrand

On 29.11.23 09:27, zhuweixi wrote:

Glad to hear that more sharable code is desirable.
IMHO, for a common MM subsystem, it is more beneficial for
GMEM to extend core MM instead of building a separate one.


More core-mm complexity, awesome, we all love that! ;)

--
Cheers,

David / dhildenb



Re: [PATCH v2 3/4] selinux: use vma_is_initial_stack() and vma_is_initial_heap()

2023-07-19 Thread David Hildenbrand

On 19.07.23 09:51, Kefeng Wang wrote:

Use the helpers to simplify code.

Cc: Paul Moore 
Cc: Stephen Smalley 
Cc: Eric Paris 
Acked-by: Paul Moore 
Signed-off-by: Kefeng Wang 
---
  security/selinux/hooks.c | 7 ++-
  1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index d06e350fedee..ee8575540a8e 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -3762,13 +3762,10 @@ static int selinux_file_mprotect(struct vm_area_struct 
*vma,
if (default_noexec &&
(prot & PROT_EXEC) && !(vma->vm_flags & VM_EXEC)) {
int rc = 0;
-   if (vma->vm_start >= vma->vm_mm->start_brk &&
-   vma->vm_end <= vma->vm_mm->brk) {
+   if (vma_is_initial_heap(vma)) {
rc = avc_has_perm(sid, sid, SECCLASS_PROCESS,
  PROCESS__EXECHEAP, NULL);
-   } else if (!vma->vm_file &&
-  ((vma->vm_start <= vma->vm_mm->start_stack &&
-vma->vm_end >= vma->vm_mm->start_stack) ||
+   } else if (!vma->vm_file && (vma_is_initial_stack(vma) ||
vma_is_stack_for_current(vma))) {
rc = avc_has_perm(sid, sid, SECCLASS_PROCESS,
  PROCESS__EXECSTACK, NULL);


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb



Re: [PATCH v2 1/4] mm: factor out VMA stack and heap checks

2023-07-19 Thread David Hildenbrand

On 19.07.23 09:51, Kefeng Wang wrote:

Factor out VMA stack and heap checks and name them
vma_is_initial_stack() and vma_is_initial_heap() for
general use.

Cc: Christian Göttsche 
Cc: David Hildenbrand 
Signed-off-by: Kefeng Wang 
---



[...]


diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2dd73e4f3d8e..51f8c573db74 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -822,6 +822,27 @@ static inline bool vma_is_anonymous(struct vm_area_struct 
*vma)
return !vma->vm_ops;
  }
  


Worth adding a similar comment like for vma_is_initial_stack() ?


+static inline bool vma_is_initial_heap(const struct vm_area_struct *vma)
+{
+   return vma->vm_start <= vma->vm_mm->brk &&
+   vma->vm_end >= vma->vm_mm->start_brk;
+}
+
+/*
+ * Indicate if the VMA is a stack for the given task; for
+ * /proc/PID/maps that is the stack of the main task.
+ */
+static inline bool vma_is_initial_stack(const struct vm_area_struct *vma)
+{
+   /*
+* We make no effort to guess what a given thread considers to be
+* its "stack".  It's not even well-defined for programs written
+* languages like Go.
+*/
+   return vma->vm_start <= vma->vm_mm->start_stack &&
+  vma->vm_end >= vma->vm_mm->start_stack;
+}
+
  static inline bool vma_is_temporary_stack(struct vm_area_struct *vma)
  {
int maybe_stack = vma->vm_flags & (VM_GROWSDOWN | VM_GROWSUP);


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb



Re: [PATCH v2 2/4] drm/amdkfd: use vma_is_initial_stack() and vma_is_initial_heap()

2023-07-19 Thread David Hildenbrand

On 19.07.23 09:51, Kefeng Wang wrote:

Use the helpers to simplify code.

Cc: Felix Kuehling 
Cc: Alex Deucher 
Cc: "Christian König" 
Cc: "Pan, Xinhui" 
Cc: David Airlie 
Cc: Daniel Vetter 
Signed-off-by: Kefeng Wang 
---
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 5 +
  1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 5ff1a5a89d96..0b7bfbd0cb66 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -2621,10 +2621,7 @@ svm_range_get_range_boundaries(struct kfd_process *p, 
int64_t addr,
return -EFAULT;
}
  
-	*is_heap_stack = (vma->vm_start <= vma->vm_mm->brk &&

- vma->vm_end >= vma->vm_mm->start_brk) ||
-(vma->vm_start <= vma->vm_mm->start_stack &&
- vma->vm_end >= vma->vm_mm->start_stack);
+   *is_heap_stack = vma_is_initial_heap(vma) || vma_is_initial_stack(vma);
  
  	start_limit = max(vma->vm_start >> PAGE_SHIFT,

  (unsigned long)ALIGN_DOWN(addr, 2UL << 8));


Certainly a valid refactoring, although questionable if such code should 
care about that.


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb



Re: [PATCH v2 4/4] perf/core: use vma_is_initial_stack() and vma_is_initial_heap()

2023-07-19 Thread David Hildenbrand

On 19.07.23 09:51, Kefeng Wang wrote:

Use the helpers to simplify code, also kill unneeded goto cpy_name.

Cc: Peter Zijlstra 
Cc: Arnaldo Carvalho de Melo 
Signed-off-by: Kefeng Wang 
---
  kernel/events/core.c | 22 +++---
  1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 78ae7b6f90fd..d59f6327472f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8685,22 +8685,14 @@ static void perf_event_mmap_event(struct 
perf_mmap_event *mmap_event)
}
  
  		name = (char *)arch_vma_name(vma);

-   if (name)
-   goto cpy_name;
-
-   if (vma->vm_start <= vma->vm_mm->start_brk &&
-   vma->vm_end >= vma->vm_mm->brk) {
-   name = "[heap]";
-   goto cpy_name;
+   if (!name) {
+   if (vma_is_initial_heap(vma))
+   name = "[heap]";
+   else if (vma_is_initial_stack(vma))
+   name = "[stack]";
+   else
+   name = "//anon";
}
-   if (vma->vm_start <= vma->vm_mm->start_stack &&
-   vma->vm_end >= vma->vm_mm->start_stack) {
-   name = "[stack]";
-   goto cpy_name;
-   }
-
-   name = "//anon";
-   goto cpy_name;


If you're removing that goto, maybe also worth removing the goto at the 
end of the previous if branch.


Reviewed-by: David Hildenbrand 


--
Cheers,

David / dhildenb



Re: [PATCH 1/5] mm: introduce vma_is_stack() and vma_is_heap()

2023-07-17 Thread David Hildenbrand

On 12.07.23 16:38, Kefeng Wang wrote:

Introduce the two helpers for general use.

Signed-off-by: Kefeng Wang 
---
  include/linux/mm.h | 12 
  1 file changed, 12 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1462cf15badf..0bbeb31ac750 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -926,6 +926,18 @@ static inline bool vma_is_anonymous(struct vm_area_struct 
*vma)
return !vma->vm_ops;
  }
  
+static inline bool vma_is_heap(struct vm_area_struct *vma)

+{
+   return vma->vm_start <= vma->vm_mm->brk &&
+   vma->vm_end >= vma->vm_mm->start_brk;
+}
+
+static inline bool vma_is_stack(struct vm_area_struct *vma)
+{
+   return vma->vm_start <= vma->vm_mm->start_stack &&
+  vma->vm_end >= vma->vm_mm->start_stack;
+}
+
  static inline bool vma_is_temporary_stack(struct vm_area_struct *vma)
  {
int maybe_stack = vma->vm_flags & (VM_GROWSDOWN | VM_GROWSUP);


Looking at the comments in patch #3, should these functions be called

vma_is_initial_heap / vma_is_initial_stack ?

--
Cheers,

David / dhildenb



Re: [PATCH 2/5] mm: use vma_is_stack() and vma_is_heap()

2023-07-17 Thread David Hildenbrand

On 12.07.23 16:38, Kefeng Wang wrote:

Use the helpers to simplify code.

Signed-off-by: Kefeng Wang 
---
  fs/proc/task_mmu.c   | 24 
  fs/proc/task_nommu.c | 15 +--
  2 files changed, 5 insertions(+), 34 deletions(-)



Please squash patch #1 and this patch and call it something like

"mm: factor out VMA stack and heap checks"

And then, maybe also keep the comments in these functions, they sound 
reasonable to have.


--
Cheers,

David / dhildenb



Re: [PATCH v5 1/6] mm/gup: remove unused vmas parameter from get_user_pages()

2023-05-16 Thread David Hildenbrand

On 16.05.23 16:30, Sean Christopherson wrote:

On Tue, May 16, 2023, David Hildenbrand wrote:

On 15.05.23 21:07, Sean Christopherson wrote:

On Sun, May 14, 2023, Lorenzo Stoakes wrote:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cb5c13eee193..eaa5bb8dbadc 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2477,7 +2477,7 @@ static inline int check_user_page_hwpoison(unsigned long 
addr)
   {
int rc, flags = FOLL_HWPOISON | FOLL_WRITE;
-   rc = get_user_pages(addr, 1, flags, NULL, NULL);
+   rc = get_user_pages(addr, 1, flags, NULL);
return rc == -EHWPOISON;


Unrelated to this patch, I think there's a pre-existing bug here.  If gup() 
returns
a valid page, KVM will leak the refcount and unintentionally pin the page.  
That's


When passing NULL as "pages" to get_user_pages(), __get_user_pages_locked()
won't set FOLL_GET. As FOLL_PIN is also not set, we won't be messing with
the mapcount of the page.


For completeness: s/mapcount/refcount/ :)



Ah, that's what I'm missing.




--
Thanks,

David / dhildenb



Re: [PATCH v5 1/6] mm/gup: remove unused vmas parameter from get_user_pages()

2023-05-16 Thread David Hildenbrand

On 15.05.23 21:07, Sean Christopherson wrote:

On Sun, May 14, 2023, Lorenzo Stoakes wrote:

No invocation of get_user_pages() use the vmas parameter, so remove it.

The GUP API is confusing and caveated. Recent changes have done much to
improve that, however there is more we can do. Exporting vmas is a prime
target as the caller has to be extremely careful to preclude their use
after the mmap_lock has expired or otherwise be left with dangling
pointers.

Removing the vmas parameter focuses the GUP functions upon their primary
purpose - pinning (and outputting) pages as well as performing the actions
implied by the input flags.

This is part of a patch series aiming to remove the vmas parameter
altogether.

Suggested-by: Matthew Wilcox (Oracle) 
Acked-by: Greg Kroah-Hartman 
Acked-by: David Hildenbrand 
Reviewed-by: Jason Gunthorpe 
Acked-by: Christian K�nig  (for radeon parts)
Acked-by: Jarkko Sakkinen 
Signed-off-by: Lorenzo Stoakes 
---
  arch/x86/kernel/cpu/sgx/ioctl.c | 2 +-
  drivers/gpu/drm/radeon/radeon_ttm.c | 2 +-
  drivers/misc/sgi-gru/grufault.c | 2 +-
  include/linux/mm.h  | 3 +--
  mm/gup.c| 9 +++--
  mm/gup_test.c   | 5 ++---
  virt/kvm/kvm_main.c | 2 +-
  7 files changed, 10 insertions(+), 15 deletions(-)


Acked-by: Sean Christopherson  (KVM)


diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cb5c13eee193..eaa5bb8dbadc 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2477,7 +2477,7 @@ static inline int check_user_page_hwpoison(unsigned long 
addr)
  {
int rc, flags = FOLL_HWPOISON | FOLL_WRITE;
  
-	rc = get_user_pages(addr, 1, flags, NULL, NULL);

+   rc = get_user_pages(addr, 1, flags, NULL);
return rc == -EHWPOISON;


Unrelated to this patch, I think there's a pre-existing bug here.  If gup() 
returns
a valid page, KVM will leak the refcount and unintentionally pin the page.  
That's


When passing NULL as "pages" to get_user_pages(), 
__get_user_pages_locked() won't set FOLL_GET. As FOLL_PIN is also not 
set, we won't be messing with the mapcount of the page.


So even if get_user_pages() returns "1", we should be fine.


Or am I misunderstanding your concern? At least hva_to_pfn_slow() most 
certainly didn't return "1" if we end up calling 
check_user_page_hwpoison(), so nothing would have been pinned there as well.


--
Thanks,

David / dhildenb



Re: [PATCH v3 1/7] mm/gup: remove unused vmas parameter from get_user_pages()

2023-04-17 Thread David Hildenbrand

On 15.04.23 14:09, Lorenzo Stoakes wrote:

No invocation of get_user_pages() uses the vmas parameter, so remove
it.

The GUP API is confusing and caveated. Recent changes have done much to
improve that, however there is more we can do. Exporting vmas is a prime
target as the caller has to be extremely careful to preclude their use
after the mmap_lock has expired or otherwise be left with dangling
pointers.

Removing the vmas parameter focuses the GUP functions upon their primary
purpose - pinning (and outputting) pages as well as performing the actions
implied by the input flags.

This is part of a patch series aiming to remove the vmas parameter
altogether.

Signed-off-by: Lorenzo Stoakes 
Suggested-by: Matthew Wilcox (Oracle) 
Acked-by: Greg Kroah-Hartman 
---


Acked-by: David Hildenbrand 

--
Thanks,

David / dhildenb



Re: [PATCH] mm/gup.c: Fix formating in check_and_migrate_movable_page()

2022-07-21 Thread David Hildenbrand
On 21.07.22 04:05, Alistair Popple wrote:
> Commit b05a79d4377f ("mm/gup: migrate device coherent pages when pinning
> instead of failing") added a badly formatted if statement. Fix it.
> 
> Signed-off-by: Alistair Popple 
> Reported-by: David Hildenbrand 
> ---
> 
> Apologies Andrew for missing this. Hopefully this fixes things.
> 
>  mm/gup.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 364b274a10c2..c6d060dee9e0 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1980,8 +1980,8 @@ static long check_and_migrate_movable_pages(unsigned 
> long nr_pages,
>   folio_nr_pages(folio));
>   }
>  
> - if (!list_empty(&movable_page_list) || isolation_error_count
> - || coherent_pages)
> + if (!list_empty(&movable_page_list) || isolation_error_count ||
> + coherent_pages)
>   goto unpin_pages;
>  
>   /*

Happy to see the series go upstream soon.

Reviewed-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb



Re: [PATCH v9 06/14] mm/gup: migrate device coherent pages when pinning instead of failing

2022-07-19 Thread David Hildenbrand
On 18.07.22 22:32, Andrew Morton wrote:
> On Mon, 18 Jul 2022 12:56:29 +0200 David Hildenbrand  wrote:
> 
>>> /*
>>>  * Try to move out any movable page before pinning the range.
>>>  */
>>> @@ -1919,7 +1948,8 @@ static long check_and_migrate_movable_pages(unsigned 
>>> long nr_pages,
>>> folio_nr_pages(folio));
>>> }
>>>  
>>> -   if (!list_empty(&movable_page_list) || isolation_error_count)
>>> +   if (!list_empty(&movable_page_list) || isolation_error_count
>>> +   || coherent_pages)
>>
>> The common style is to
>>
>> a) add the || to the end of the previous line
>> b) indent such the we have a nice-to-read alignment
>>
>> if (!list_empty(&movable_page_list) || isolation_error_count ||
>> coherent_pages)
>>
> 
> I missed that.  This series is now in mm-stable so any fix will need to
> be a standalone followup patch, please.
> 
>> Apart from that lgtm.
>>
>> Reviewed-by: David Hildenbrand 
> 
> And your reviewed-by's will be lost.  Stupid git.

I know, I already raised my concerns regarding the new workflow, so I
won't repeat that. I can understand (too some degree) that we don't want
code to change just before the new merge window opens.

But I do wonder if we really don't even want to do subject+description
updates. Sure, the commit IDs will change. Who cares?

Anyhow, it is what it is.

-- 
Thanks,

David / dhildenb



Re: [PATCH v9 02/14] mm: move page zone helpers from mm.h to mmzone.h

2022-07-18 Thread David Hildenbrand
On 18.07.22 19:52, Felix Kuehling wrote:
> On 2022-07-18 06:50, David Hildenbrand wrote:
>> On 15.07.22 17:05, Alex Sierra wrote:
>>> [WHY]
>>> It makes more sense to have these helpers in zone specific header
>>> file, rather than the generic mm.h
>>>
>>> Signed-off-by: Alex Sierra 
>> Acked-by: David Hildenbrand 
> 
> Thank you! I don't think I have the authority to give this a 
> Reviewed-by. Who does?


Sure you can. Everybody can give Reviewed-by/Tested-by ... tags. :)


-- 
Thanks,

David / dhildenb



Re: [PATCH v9 06/14] mm/gup: migrate device coherent pages when pinning instead of failing

2022-07-18 Thread David Hildenbrand
On 15.07.22 17:05, Alex Sierra wrote:
> From: Alistair Popple 
> 
> Currently any attempts to pin a device coherent page will fail. This is
> because device coherent pages need to be managed by a device driver, and
> pinning them would prevent a driver from migrating them off the device.
> 
> However this is no reason to fail pinning of these pages. These are
> coherent and accessible from the CPU so can be migrated just like
> pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin
> them first try migrating them out of ZONE_DEVICE.
> 
> [hch: rebased to the split device memory checks,
>   moved migrate_device_page to migrate_device.c]
> 
> Signed-off-by: Alistair Popple 
> Acked-by: Felix Kuehling 
> Signed-off-by: Christoph Hellwig 

[...]

>   /*
>* Try to move out any movable page before pinning the range.
>*/
> @@ -1919,7 +1948,8 @@ static long check_and_migrate_movable_pages(unsigned 
> long nr_pages,
>   folio_nr_pages(folio));
>   }
>  
> - if (!list_empty(&movable_page_list) || isolation_error_count)
> + if (!list_empty(&movable_page_list) || isolation_error_count
> + || coherent_pages)

The common style is to

a) add the || to the end of the previous line
b) indent such the we have a nice-to-read alignment

if (!list_empty(&movable_page_list) || isolation_error_count ||
coherent_pages)


Apart from that lgtm.

Reviewed-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb



Re: [PATCH v9 04/14] mm: handling Non-LRU pages returned by vm_normal_pages

2022-07-18 Thread David Hildenbrand
On 15.07.22 17:05, Alex Sierra wrote:
> With DEVICE_COHERENT, we'll soon have vm_normal_pages() return
> device-managed anonymous pages that are not LRU pages. Although they
> behave like normal pages for purposes of mapping in CPU page, and for
> COW. They do not support LRU lists, NUMA migration or THP.
> 
> Callers to follow_page() currently don't expect ZONE_DEVICE pages,
> however, with DEVICE_COHERENT we might now return ZONE_DEVICE. Check
> for ZONE_DEVICE pages in applicable users of follow_page() as well.
> 
> Signed-off-by: Alex Sierra 
> Acked-by: Felix Kuehling  (v2)
> Reviewed-by: Alistair Popple  (v6)
>

Acked-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb



Re: [PATCH v9 02/14] mm: move page zone helpers from mm.h to mmzone.h

2022-07-18 Thread David Hildenbrand
On 15.07.22 17:05, Alex Sierra wrote:
> [WHY]
> It makes more sense to have these helpers in zone specific header
> file, rather than the generic mm.h
> 
> Signed-off-by: Alex Sierra 

Acked-by: David Hildenbrand 


-- 
Thanks,

David / dhildenb



Re: [PATCH v8 02/15] mm: move page zone helpers into new header-specific file

2022-07-11 Thread David Hildenbrand
On 08.07.22 23:25, Felix Kuehling wrote:
> On 2022-07-08 07:28, David Hildenbrand wrote:
>> On 07.07.22 21:03, Alex Sierra wrote:
>>> [WHY]
>>> Have a cleaner way to expose all page zone helpers in one header
>> What exactly is a "page zone"? Do you mean a buddy zone as in
>> include/linux/mmzone.h ?
>>
> Zone as in ZONE_DEVICE. Maybe we could extend mmzone.h instead of 

Yes, I think so. And try moving as little as possible in this patch.

> creating page_zone.h? That should work as long as it's OK to include 
> mmzone.h in memremap.h.

I think so.

-- 
Thanks,

David / dhildenb



Re: [PATCH v8 07/15] mm/gup: migrate device coherent pages when pinning instead of failing

2022-07-11 Thread David Hildenbrand
On 11.07.22 16:00, Matthew Wilcox wrote:
> On Mon, Jul 11, 2022 at 03:35:42PM +0200, David Hildenbrand wrote:
>>> +   /*
>>> +* Device coherent pages are managed by a driver and should not
>>> +* be pinned indefinitely as it prevents the driver moving the
>>> +* page. So when trying to pin with FOLL_LONGTERM instead try
>>> +* to migrate the page out of device memory.
>>> +*/
>>> +   if (folio_is_device_coherent(folio)) {
>>> +   WARN_ON_ONCE(PageCompound(&folio->page));
>>
>> Maybe that belongs into migrate_device_page()?
> 
> ... and should be simply WARN_ON_ONCE(folio_test_large(folio)).
> No need to go converting it back into a page in order to test things
> that can't possibly be true.
> 

Good point.

-- 
Thanks,

David / dhildenb



Re: [PATCH v8 06/15] mm: remove the vma check in migrate_vma_setup()

2022-07-11 Thread David Hildenbrand
On 07.07.22 21:03, Alex Sierra wrote:
> From: Alistair Popple 
> 
> migrate_vma_setup() checks that a valid vma is passed so that the page
> tables can be walked to find the pfns associated with a given address
> range. However in some cases the pfns are already known, such as when
> migrating device coherent pages during pin_user_pages() meaning a valid
> vma isn't required.

As raised in my other reply, without a VMA ... it feels odd to use a
"migrate_vma" API. For an internal (mm/migrate_device.c) use case it is
ok I guess, but it certainly adds a bit of confusion. For example,
because migrate_vma_setup() will undo ref+lock not obtained by it.

I guess the interesting point is that

a) Besides migrate_vma_pages() and migrate_vma_setup(), the ->vma is unused.

b) migrate_vma_setup() does collect+unmap+cleanup if unmap failed.

c) With our source page in our hands, we cannot be processing a hole in
a VMA.



Not sure if it's better. but I would

a) Enforce in migrate_vma_setup() that there is a VMA. Code outside of
mm/migrate_device.c shouldn't be doing some hacks like this.

b) Don't call migrate_vma_setup() from migrate_device_page(), but
directly migrate_vma_unmap() and add a comment.


That will leave a single change to this patch (migrate_vma_pages()). But
is that even required? Because 

> @@ -685,7 +685,7 @@ void migrate_vma_pages(struct migrate_vma *migrate)
>   continue;
>   }
>  
> - if (!page) {
> + if (!page && migrate->vma) {

How could we ever have !page in case of migrate_device_page()?

Instead, I think a VM_BUG_ON(migrate->vma); should hold and you can just
simplify.

>   if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE))
>   continue;
>   if (!notified) {
-- 
Thanks,

David / dhildenb



Re: [PATCH v8 07/15] mm/gup: migrate device coherent pages when pinning instead of failing

2022-07-11 Thread David Hildenbrand
On 07.07.22 21:03, Alex Sierra wrote:
> From: Alistair Popple 
> 
> Currently any attempts to pin a device coherent page will fail. This is
> because device coherent pages need to be managed by a device driver, and
> pinning them would prevent a driver from migrating them off the device.
> 
> However this is no reason to fail pinning of these pages. These are
> coherent and accessible from the CPU so can be migrated just like
> pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin
> them first try migrating them out of ZONE_DEVICE.
> 
> Signed-off-by: Alistair Popple 
> Acked-by: Felix Kuehling 
> [hch: rebased to the split device memory checks,
>   moved migrate_device_page to migrate_device.c]
> Signed-off-by: Christoph Hellwig 
> ---
>  mm/gup.c| 47 +++-
>  mm/internal.h   |  1 +
>  mm/migrate_device.c | 53 +
>  3 files changed, 96 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index b65fe8bf5af4..9b6b9923d22d 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1891,9 +1891,43 @@ static long check_and_migrate_movable_pages(unsigned 
> long nr_pages,
>   continue;
>   prev_folio = folio;
>  
> - if (folio_is_longterm_pinnable(folio))
> + /*
> +  * Device private pages will get faulted in during gup so it
> +  * shouldn't be possible to see one here.
> +  */
> + if (WARN_ON_ONCE(folio_is_device_private(folio))) {
> + ret = -EFAULT;
> + goto unpin_pages;
> + }

I'd just drop that. Device private pages are never part of a present PTE. So if 
we
could actually get a grab of one via GUP we would be in bigger trouble ... 
already before this patch.

> +
> + /*
> +  * Device coherent pages are managed by a driver and should not
> +  * be pinned indefinitely as it prevents the driver moving the
> +  * page. So when trying to pin with FOLL_LONGTERM instead try
> +  * to migrate the page out of device memory.
> +  */
> + if (folio_is_device_coherent(folio)) {
> + WARN_ON_ONCE(PageCompound(&folio->page));

Maybe that belongs into migrate_device_page()?

> +
> + /*
> +  * Migration will fail if the page is pinned, so convert

[...]

>  /*
>   * mm/gup.c
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index cf9668376c5a..5decd26dd551 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -794,3 +794,56 @@ void migrate_vma_finalize(struct migrate_vma *migrate)
>   }
>  }
>  EXPORT_SYMBOL(migrate_vma_finalize);
> +
> +/*
> + * Migrate a device coherent page back to normal memory.  The caller should 
> have
> + * a reference on page which will be copied to the new page if migration is
> + * successful or dropped on failure.
> + */
> +struct page *migrate_device_page(struct page *page, unsigned int gup_flags)

Function name should most probably indicate that we're dealing with coherent 
pages here?

> +{
> + unsigned long src_pfn, dst_pfn = 0;
> + struct migrate_vma args;
> + struct page *dpage;
> +
> + lock_page(page);
> + src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE;
> + args.src = &src_pfn;
> + args.dst = &dst_pfn;
> + args.cpages = 1;
> + args.npages = 1;
> + args.vma = NULL;
> + migrate_vma_setup(&args);
> + if (!(src_pfn & MIGRATE_PFN_MIGRATE))
> + return NULL;

Wow, these refcount and page locking/unlocking rules with this migrate_* api are
confusing now. And the usage here of sometimes returning and sometimes falling
trough don't make it particularly easier to understand here.

I'm not 100% happy about reusing migrate_vma_setup() usage if there *is no VMA*.
That's just absolutely confusing, because usually migrate_vma_setup() itself 
would do the collection step and ref+lock pages. :/

In general, I can see why/how we're reusing the migrate_vma_* API here, but 
there 
is absolutely no VMA ... not sure what to improve besides providing a second API
that does a simple single-page migration. But that can be changed later ...


> +
> + dpage = alloc_pages(GFP_USER | __GFP_NOWARN, 0);
> +

alloc_page()

> + /*
> +  * get/pin the new page now so we don't have to retry gup after
> +  * migrating. We already have a reference so this should never fail.
> +  */
> + if (dpage && WARN_ON_ONCE(!try_grab_page(dpage, gup_flags))) {
> + __free_pages(dpage, 0);

__free_page()

> + dpage = NULL;
> + }

Hm, this means that we are not pinning via the PTE at hand, but via something
we expect migration to put into the PTE. I'm not really happy about this.

Ideally, we'd make the pinning decision only on the actual GUP path, not in 
here.
Just like in the

Re: [PATCH v8 02/15] mm: move page zone helpers into new header-specific file

2022-07-08 Thread David Hildenbrand
On 07.07.22 21:03, Alex Sierra wrote:
> [WHY]
> Have a cleaner way to expose all page zone helpers in one header

What exactly is a "page zone"? Do you mean a buddy zone as in
include/linux/mmzone.h ?


-- 
Thanks,

David / dhildenb



Re: [PATCH v8 01/15] mm: rename is_pinnable_pages to is_longterm_pinnable_pages

2022-07-08 Thread David Hildenbrand
On 07.07.22 21:03, Alex Sierra wrote:
> is_pinnable_page() and folio_is_pinnable() were renamed to
> is_longterm_pinnable_page() and folio_is_longterm_pinnable()
> respectively. These functions are used in the FOLL_LONGTERM flag
> context.
> 
> Signed-off-by: Alex Sierra 
> ---
>  include/linux/mm.h | 8 
>  mm/gup.c   | 4 ++--
>  mm/gup_test.c  | 2 +-
>  mm/hugetlb.c   | 2 +-
>  4 files changed, 8 insertions(+), 8 deletions(-)
> 

Reviewed-by: David Hildenbrand 


-- 
Thanks,

David / dhildenb



Re: [PATCH v7 04/14] mm: add device coherent vma selection for memory migration

2022-06-30 Thread David Hildenbrand
On 30.06.22 13:44, Alistair Popple wrote:
> 
> David Hildenbrand  writes:
> 
>> On 29.06.22 05:54, Alex Sierra wrote:
>>> This case is used to migrate pages from device memory, back to system
>>> memory. Device coherent type memory is cache coherent from device and CPU
>>> point of view.
>>>
>>> Signed-off-by: Alex Sierra 
>>> Acked-by: Felix Kuehling 
>>> Reviewed-by: Alistair Poppple 
>>> Signed-off-by: Christoph Hellwig 
>>
>>
>> I'm not too familiar with this code, please excuse my naive questions:
>>
>>> @@ -148,15 +148,21 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>> if (is_writable_device_private_entry(entry))
>>> mpfn |= MIGRATE_PFN_WRITE;
>>> } else {
>>> -   if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
>>> -   goto next;
>>
>> Why not exclude MIGRATE_VMA_SELECT_DEVICE_PRIVATE here? IIRC that would
>> have happened before this change.
> 
> I might be missing something as I don't quite follow - this path is for
> normal system pages so we only want to skip selecting them if
> MIGRATE_VMA_SELECT_SYSTEM or MIGRATE_VMA_SELECT_DEVICE_COHERENT aren't
> set.
> 
> Note that MIGRATE_VMA_SELECT_DEVICE_PRIVATE doesn't apply here because
> we already know it's not a device private page by virtue of
> pte_present(pte) == True.

Ah, stupid me, pte_present(pte) is the key.

> 
>>> pfn = pte_pfn(pte);
>>> -   if (is_zero_pfn(pfn)) {
>>> +   if (is_zero_pfn(pfn) &&
>>> +   (migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
>>> mpfn = MIGRATE_PFN_MIGRATE;
>>> migrate->cpages++;
>>> goto next;
>>> }
>>> page = vm_normal_page(migrate->vma, addr, pte);
>>> +   if (page && !is_zone_device_page(page) &&
>>
>> I'm wondering if that check logically belongs into patch #2.
> 
> I don't think so as it would break functionality until the below
> conditionals are added - we explicitly don't want to skip
> is_zone_device_page(page) == False here because that is the pages we are
> trying to select.
> 
> You could add in this:
> 
>>> +   !(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
> 
> But then in patch 2 we know this can never be true because we've already
> checked for !MIGRATE_VMA_SELECT_SYSTEM there.


Ah, okay, thanks

Reviewed-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb



Re: [PATCH v7 04/14] mm: add device coherent vma selection for memory migration

2022-06-30 Thread David Hildenbrand
On 29.06.22 05:54, Alex Sierra wrote:
> This case is used to migrate pages from device memory, back to system
> memory. Device coherent type memory is cache coherent from device and CPU
> point of view.
> 
> Signed-off-by: Alex Sierra 
> Acked-by: Felix Kuehling 
> Reviewed-by: Alistair Poppple 
> Signed-off-by: Christoph Hellwig 


I'm not too familiar with this code, please excuse my naive questions:

> @@ -148,15 +148,21 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   if (is_writable_device_private_entry(entry))
>   mpfn |= MIGRATE_PFN_WRITE;
>   } else {
> - if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
> - goto next;

Why not exclude MIGRATE_VMA_SELECT_DEVICE_PRIVATE here? IIRC that would
have happened before this change.


>   pfn = pte_pfn(pte);
> - if (is_zero_pfn(pfn)) {
> + if (is_zero_pfn(pfn) &&
> + (migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
>   mpfn = MIGRATE_PFN_MIGRATE;
>   migrate->cpages++;
>   goto next;
>   }
>   page = vm_normal_page(migrate->vma, addr, pte);
> + if (page && !is_zone_device_page(page) &&

I'm wondering if that check logically belongs into patch #2.

> + !(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
> + goto next;
> + else if (page && is_device_coherent_page(page) &&
> + (!(migrate->flags & 
> MIGRATE_VMA_SELECT_DEVICE_COHERENT) ||
> +  page->pgmap->owner != migrate->pgmap_owner))


In general LGTM

-- 
Thanks,

David / dhildenb



Re: [PATCH v7 01/14] mm: rename is_pinnable_pages to is_pinnable_longterm_pages

2022-06-29 Thread David Hildenbrand
On 30.06.22 00:08, Felix Kuehling wrote:
> On 2022-06-29 03:33, David Hildenbrand wrote:
>> On 29.06.22 05:54, Alex Sierra wrote:
>>> is_pinnable_page() and folio_is_pinnable() were renamed to
>>> is_longterm_pinnable_page() and folio_is_longterm_pinnable()
>>> respectively. These functions are used in the FOLL_LONGTERM flag
>>> context.
>> Subject talks about "*_pages"
>>
>>
>> Can you elaborate why the move from mm.h to memremap.h is justified?
> 
> Patch 2 adds is_device_coherent_page in memremap.h and updates 
> is_longterm_pinnable_page to call is_device_coherent_page. memremap.h 
> cannot include mm.h because it is itself included by mm.h. So the choice 
> was to move is_longterm_pinnable_page to memremap.h, or move 
> is_device_coherent_page and all its dependencies to mm.h. The latter 
> would have been a bigger change.

I really don't think something mm generic that compiles without
ZONE_DEVICE belongs into memremap.h. Please find a cleaner way to get
this done.

-- 
Thanks,

David / dhildenb



Re: [PATCH v7 02/14] mm: add zone device coherent type memory support

2022-06-29 Thread David Hildenbrand
On 29.06.22 05:54, Alex Sierra wrote:
> Device memory that is cache coherent from device and CPU point of view.
> This is used on platforms that have an advanced system bus (like CAPI
> or CXL). Any page of a process can be migrated to such memory. However,
> no one should be allowed to pin such memory so that it can always be
> evicted.
> 
> Signed-off-by: Alex Sierra 
> Acked-by: Felix Kuehling 
> Reviewed-by: Alistair Popple 
> [hch: rebased ontop of the refcount changes,
>   removed is_dev_private_or_coherent_page]
> Signed-off-by: Christoph Hellwig 

>From what I can tell, this looks good to me

Acked-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb



Re: [PATCH v7 03/14] mm: handling Non-LRU pages returned by vm_normal_pages

2022-06-29 Thread David Hildenbrand
On 29.06.22 05:54, Alex Sierra wrote:
> With DEVICE_COHERENT, we'll soon have vm_normal_pages() return
> device-managed anonymous pages that are not LRU pages. Although they
> behave like normal pages for purposes of mapping in CPU page, and for
> COW. They do not support LRU lists, NUMA migration or THP.
> 
> Callers to follow_page that expect LRU pages, are also checked for
> device zone pages due to DEVICE_COHERENT type.

Can we rephrase that to (because zeropage)

"Callers to follow_page() currently don't expect ZONE_DEVICE pages,
however, with DEVICE_COHERENT we might now return ZONE_DEVICE. Check for
ZONE_DEVICE pages in applicable users of follow_page() as well."



[...]

>   /*
> diff --git a/mm/memory.c b/mm/memory.c
> index 7a089145cad4..e18555af9024 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -624,6 +624,13 @@ struct page *vm_normal_page(struct vm_area_struct *vma, 
> unsigned long addr,
>   if (is_zero_pfn(pfn))
>   return NULL;
>   if (pte_devmap(pte))
> +/*
> + * NOTE: New uers of ZONE_DEVICE will not set pte_devmap() and will have

s/uers/users/

> + * refcounts incremented on their struct pages when they are inserted into
> + * PTEs, thus they are safe to return here. Legacy ZONE_DEVICE pages that set
> + * pte_devmap() do not have refcounts. Example of legacy ZONE_DEVICE is
> + * MEMORY_DEVICE_FS_DAX type in pmem or virtio_fs drivers.
> + */

[...]

> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index ba5592655ee3..e034aae2a98b 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -95,7 +95,7 @@ static unsigned long change_pte_range(struct mmu_gather 
> *tlb,
>   continue;
>  
>   page = vm_normal_page(vma, addr, oldpte);
> - if (!page || PageKsm(page))
> + if (!page || is_zone_device_page(page) || 
> PageKsm(page))
>   continue;
>  
>   /* Also skip shared copy-on-write pages */

In -next/-mm there is now an additional can_change_pte_writable() that
calls vm_normal_page() --  added by me. I assume that that is indeed
fine because we can simply map device coherent pages writable.

Besides the nits, LGTM

Acked-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb



Re: [PATCH v7 01/14] mm: rename is_pinnable_pages to is_pinnable_longterm_pages

2022-06-29 Thread David Hildenbrand
On 29.06.22 05:54, Alex Sierra wrote:
> is_pinnable_page() and folio_is_pinnable() were renamed to
> is_longterm_pinnable_page() and folio_is_longterm_pinnable()
> respectively. These functions are used in the FOLL_LONGTERM flag
> context.

Subject talks about "*_pages"


Can you elaborate why the move from mm.h to memremap.h is justified?

I'd have called it "is_longterm_pinnable_page", but I am not a native
speaker, so no strong opinion :)


-- 
Thanks,

David / dhildenb



Re: [PATCH v6 02/14] mm: handling Non-LRU pages returned by vm_normal_pages

2022-06-28 Thread David Hildenbrand
On 28.06.22 02:14, Alex Sierra wrote:
> With DEVICE_COHERENT, we'll soon have vm_normal_pages() return
> device-managed anonymous pages that are not LRU pages. Although they
> behave like normal pages for purposes of mapping in CPU page, and for
> COW. They do not support LRU lists, NUMA migration or THP.
> 
> We also introduced a FOLL_LRU flag that adds the same behaviour to
> follow_page and related APIs, to allow callers to specify that they
> expect to put pages on an LRU list.
> 
> Signed-off-by: Alex Sierra 
> Acked-by: Felix Kuehling 
> Reviewed-by: Alistair Popple 
> ---

I think my review feedback regarding FOLL_LRU has been ignored.


-- 
Thanks,

David / dhildenb



Re: [PATCH v6 06/14] mm: add device coherent checker to is_pinnable_page

2022-06-28 Thread David Hildenbrand
On 28.06.22 02:14, Alex Sierra wrote:
> is_device_coherent checker was added to is_pinnable_page and renamed
> to is_longterm_pinnable_page. The reason is that device coherent
> pages are not supported for longterm pinning.
> 
> Signed-off-by: Alex Sierra 
> ---
>  include/linux/memremap.h | 25 +
>  include/linux/mm.h   | 24 
>  mm/gup.c |  5 ++---
>  mm/gup_test.c|  4 ++--
>  mm/hugetlb.c |  2 +-
>  5 files changed, 30 insertions(+), 30 deletions(-)


Rename of the function should be a separate cleanup patch before any
other changes, and the remaining change should be squashed into patch
#1, to logically make sense, because it still states "no one should be
allowed to pin such memory so that it can always be evicted."

Or am I missing something?

-- 
Thanks,

David / dhildenb



Re: [PATCH v5 01/13] mm: add zone device coherent type memory support

2022-06-23 Thread David Hildenbrand
On 23.06.22 20:20, Sierra Guiza, Alejandro (Alex) wrote:
> 
> On 6/23/2022 2:57 AM, David Hildenbrand wrote:
>> On 23.06.22 01:16, Sierra Guiza, Alejandro (Alex) wrote:
>>> On 6/21/2022 11:16 AM, David Hildenbrand wrote:
>>>> On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote:
>>>>> On 6/21/2022 7:25 AM, David Hildenbrand wrote:
>>>>>> On 21.06.22 13:55, Alistair Popple wrote:
>>>>>>> David Hildenbrand  writes:
>>>>>>>
>>>>>>>> On 21.06.22 13:25, Felix Kuehling wrote:
>>>>>>>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>>>>>>>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>>>>>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>>>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>>>>>>>>>> Device memory that is cache coherent from device and CPU point 
>>>>>>>>>>>>>>> of view.
>>>>>>>>>>>>>>> This is used on platforms that have an advanced system bus 
>>>>>>>>>>>>>>> (like CAPI
>>>>>>>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. 
>>>>>>>>>>>>>>> However,
>>>>>>>>>>>>>>> no one should be allowed to pin such memory so that it can 
>>>>>>>>>>>>>>> always be
>>>>>>>>>>>>>>> evicted.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Signed-off-by: Alex Sierra
>>>>>>>>>>>>>>> Acked-by: Felix Kuehling
>>>>>>>>>>>>>>> Reviewed-by: Alistair Popple
>>>>>>>>>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>>>>>>>>> removed is_dev_private_or_coherent_page]
>>>>>>>>>>>>>>> Signed-off-by: Christoph Hellwig
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>include/linux/memremap.h | 19 +++
>>>>>>>>>>>>>>>mm/memcontrol.c  |  7 ---
>>>>>>>>>>>>>>>mm/memory-failure.c  |  8 ++--
>>>>>>>>>>>>>>>mm/memremap.c| 10 ++
>>>>>>>>>>>>>>>mm/migrate_device.c  | 16 +++-
>>>>>>>>>>>>>>>mm/rmap.c|  5 +++--
>>>>>>>>>>>>>>>6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>>>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>>>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>>>>>>>>> * A more complete discussion of unaddressable memory 
>>>>>>>>>>>>>>> may be found in
>>>>>>>>>>>>>>> * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>>>>>>>>>> + * Device memory that is cache coherent from device and CPU 
>>>>>>>>>>>>>>> point of view. This
>>>>>>>>>>>>>>> + * is used on platforms that have an advanced system bus (like 
>>>>>>>>>>>>>>> CAPI or CXL). A
>>>>

Re: [PATCH v5 01/13] mm: add zone device coherent type memory support

2022-06-23 Thread David Hildenbrand
On 23.06.22 01:16, Sierra Guiza, Alejandro (Alex) wrote:
> 
> On 6/21/2022 11:16 AM, David Hildenbrand wrote:
>> On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote:
>>> On 6/21/2022 7:25 AM, David Hildenbrand wrote:
>>>> On 21.06.22 13:55, Alistair Popple wrote:
>>>>> David Hildenbrand  writes:
>>>>>
>>>>>> On 21.06.22 13:25, Felix Kuehling wrote:
>>>>>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>>>>>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>>>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>>>>>>>> Device memory that is cache coherent from device and CPU point of 
>>>>>>>>>>>>> view.
>>>>>>>>>>>>> This is used on platforms that have an advanced system bus (like 
>>>>>>>>>>>>> CAPI
>>>>>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. 
>>>>>>>>>>>>> However,
>>>>>>>>>>>>> no one should be allowed to pin such memory so that it can always 
>>>>>>>>>>>>> be
>>>>>>>>>>>>> evicted.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Alex Sierra
>>>>>>>>>>>>> Acked-by: Felix Kuehling
>>>>>>>>>>>>> Reviewed-by: Alistair Popple
>>>>>>>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>>>>>>>removed is_dev_private_or_coherent_page]
>>>>>>>>>>>>> Signed-off-by: Christoph Hellwig
>>>>>>>>>>>>> ---
>>>>>>>>>>>>>   include/linux/memremap.h | 19 +++
>>>>>>>>>>>>>   mm/memcontrol.c  |  7 ---
>>>>>>>>>>>>>   mm/memory-failure.c  |  8 ++--
>>>>>>>>>>>>>   mm/memremap.c| 10 ++
>>>>>>>>>>>>>   mm/migrate_device.c  | 16 +++-
>>>>>>>>>>>>>   mm/rmap.c|  5 +++--
>>>>>>>>>>>>>   6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>>>>>>>* A more complete discussion of unaddressable memory may 
>>>>>>>>>>>>> be found in
>>>>>>>>>>>>>* include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>>>>>>>*
>>>>>>>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>>>>>>>> + * Device memory that is cache coherent from device and CPU 
>>>>>>>>>>>>> point of view. This
>>>>>>>>>>>>> + * is used on platforms that have an advanced system bus (like 
>>>>>>>>>>>>> CAPI or CXL). A
>>>>>>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and 
>>>>>>>>>>>>> with that memory
>>>>>>>>>>>>> + * type. Any page of a process can be migrated to such memory. 
>>>>>>>>>>>>> However no one
>>>>>>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking 
>>>>>>>>>>>> about special pages
&g

Re: [PATCH v5 01/13] mm: add zone device coherent type memory support

2022-06-21 Thread David Hildenbrand
On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote:
> 
> On 6/21/2022 7:25 AM, David Hildenbrand wrote:
>> On 21.06.22 13:55, Alistair Popple wrote:
>>> David Hildenbrand  writes:
>>>
>>>> On 21.06.22 13:25, Felix Kuehling wrote:
>>>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>>>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>>>>>> Device memory that is cache coherent from device and CPU point of 
>>>>>>>>>>> view.
>>>>>>>>>>> This is used on platforms that have an advanced system bus (like 
>>>>>>>>>>> CAPI
>>>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. 
>>>>>>>>>>> However,
>>>>>>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>>>>>>> evicted.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Alex Sierra
>>>>>>>>>>> Acked-by: Felix Kuehling
>>>>>>>>>>> Reviewed-by: Alistair Popple
>>>>>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>>>>>   removed is_dev_private_or_coherent_page]
>>>>>>>>>>> Signed-off-by: Christoph Hellwig
>>>>>>>>>>> ---
>>>>>>>>>>>  include/linux/memremap.h | 19 +++
>>>>>>>>>>>  mm/memcontrol.c  |  7 ---
>>>>>>>>>>>  mm/memory-failure.c  |  8 ++--
>>>>>>>>>>>  mm/memremap.c| 10 ++
>>>>>>>>>>>  mm/migrate_device.c  | 16 +++-
>>>>>>>>>>>  mm/rmap.c|  5 +++--
>>>>>>>>>>>  6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>>>>>   * A more complete discussion of unaddressable memory may be 
>>>>>>>>>>> found in
>>>>>>>>>>>   * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>>>>>   *
>>>>>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>>>>>> + * Device memory that is cache coherent from device and CPU point 
>>>>>>>>>>> of view. This
>>>>>>>>>>> + * is used on platforms that have an advanced system bus (like 
>>>>>>>>>>> CAPI or CXL). A
>>>>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with 
>>>>>>>>>>> that memory
>>>>>>>>>>> + * type. Any page of a process can be migrated to such memory. 
>>>>>>>>>>> However no one
>>>>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking 
>>>>>>>>>> about special pages
>>>>>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>>>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>>>>>
>>>>>>>>>>> + * should be allowed to pin such memory so that it can always be 
>>>>>>>>>>> evicted.
>>>>>>>>>>> + *
>>>>>>>>>>>   * MEMORY_DEVICE_FS_DAX:
>>>>>>>>>>>   * Host memory that has similar access semantics as System RAM 
&g

Re: [PATCH v5 01/13] mm: add zone device coherent type memory support

2022-06-21 Thread David Hildenbrand
On 21.06.22 13:55, Alistair Popple wrote:
> 
> David Hildenbrand  writes:
> 
>> On 21.06.22 13:25, Felix Kuehling wrote:
>>>
>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>>>> Device memory that is cache coherent from device and CPU point of 
>>>>>>>>> view.
>>>>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. 
>>>>>>>>> However,
>>>>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>>>>> evicted.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Alex Sierra 
>>>>>>>>> Acked-by: Felix Kuehling 
>>>>>>>>> Reviewed-by: Alistair Popple 
>>>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>>>  removed is_dev_private_or_coherent_page]
>>>>>>>>> Signed-off-by: Christoph Hellwig 
>>>>>>>>> ---
>>>>>>>>> include/linux/memremap.h | 19 +++
>>>>>>>>> mm/memcontrol.c  |  7 ---
>>>>>>>>> mm/memory-failure.c  |  8 ++--
>>>>>>>>> mm/memremap.c| 10 ++
>>>>>>>>> mm/migrate_device.c  | 16 +++-
>>>>>>>>> mm/rmap.c|  5 +++--
>>>>>>>>> 6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>>>  * A more complete discussion of unaddressable memory may be 
>>>>>>>>> found in
>>>>>>>>>  * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>>>  *
>>>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>>>> + * Device memory that is cache coherent from device and CPU point of 
>>>>>>>>> view. This
>>>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI 
>>>>>>>>> or CXL). A
>>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with 
>>>>>>>>> that memory
>>>>>>>>> + * type. Any page of a process can be migrated to such memory. 
>>>>>>>>> However no one
>>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about 
>>>>>>>> special pages
>>>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>>>
>>>>>>>>> + * should be allowed to pin such memory so that it can always be 
>>>>>>>>> evicted.
>>>>>>>>> + *
>>>>>>>>>  * MEMORY_DEVICE_FS_DAX:
>>>>>>>>>  * Host memory that has similar access semantics as System RAM 
>>>>>>>>> i.e. DMA
>>>>>>>>>  * coherent and supports page pinning. In support of coordinating 
>>>>>>>>> page
>>>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>>>>> enum memory_type {
>>>>>>>>>   /* 0 is reserved to catch uninitialized type fields */
>>>>>>>>>   MEMORY_DEVICE_PRIVATE = 1,
>>>>>>>>> + MEMORY_DEVICE_COHERENT,
>>>>>>>>>   MEMORY_DEVICE_FS_DAX,
&

Re: [PATCH v5 01/13] mm: add zone device coherent type memory support

2022-06-21 Thread David Hildenbrand
On 21.06.22 13:25, Felix Kuehling wrote:
> 
> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>>> evicted.
>>>>>>>
>>>>>>> Signed-off-by: Alex Sierra 
>>>>>>> Acked-by: Felix Kuehling 
>>>>>>> Reviewed-by: Alistair Popple 
>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>  removed is_dev_private_or_coherent_page]
>>>>>>> Signed-off-by: Christoph Hellwig 
>>>>>>> ---
>>>>>>> include/linux/memremap.h | 19 +++
>>>>>>> mm/memcontrol.c  |  7 ---
>>>>>>> mm/memory-failure.c  |  8 ++--
>>>>>>> mm/memremap.c| 10 ++
>>>>>>> mm/migrate_device.c  | 16 +++-
>>>>>>> mm/rmap.c|  5 +++--
>>>>>>> 6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>> --- a/include/linux/memremap.h
>>>>>>> +++ b/include/linux/memremap.h
>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>  * A more complete discussion of unaddressable memory may be found 
>>>>>>> in
>>>>>>>  * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>  *
>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>> + * Device memory that is cache coherent from device and CPU point of 
>>>>>>> view. This
>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or 
>>>>>>> CXL). A
>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with 
>>>>>>> that memory
>>>>>>> + * type. Any page of a process can be migrated to such memory. However 
>>>>>>> no one
>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about 
>>>>>> special pages
>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>
>>>>>>> + * should be allowed to pin such memory so that it can always be 
>>>>>>> evicted.
>>>>>>> + *
>>>>>>>  * MEMORY_DEVICE_FS_DAX:
>>>>>>>  * Host memory that has similar access semantics as System RAM i.e. 
>>>>>>> DMA
>>>>>>>  * coherent and supports page pinning. In support of coordinating 
>>>>>>> page
>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>>> enum memory_type {
>>>>>>> /* 0 is reserved to catch uninitialized type fields */
>>>>>>> MEMORY_DEVICE_PRIVATE = 1,
>>>>>>> +   MEMORY_DEVICE_COHERENT,
>>>>>>> MEMORY_DEVICE_FS_DAX,
>>>>>>> MEMORY_DEVICE_GENERIC,
>>>>>>> MEMORY_DEVICE_PCI_P2PDMA,
>>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const 
>>>>>>> struct folio *folio)
>>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I 
>>>>>> think.
>>>>>>
>>>>>>
>>>>>> However, where exactly is pinning forbidden?
>>>>> Long-term pinning is forbidden since it would interfere with the device
>>>>> memory manager owning the
>>>>> d

Re: [PATCH v5 01/13] mm: add zone device coherent type memory support

2022-06-18 Thread David Hildenbrand
On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
> 
> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>> no one should be allowed to pin such memory so that it can always be
>>>>> evicted.
>>>>>
>>>>> Signed-off-by: Alex Sierra 
>>>>> Acked-by: Felix Kuehling 
>>>>> Reviewed-by: Alistair Popple 
>>>>> [hch: rebased ontop of the refcount changes,
>>>>> removed is_dev_private_or_coherent_page]
>>>>> Signed-off-by: Christoph Hellwig 
>>>>> ---
>>>>>include/linux/memremap.h | 19 +++
>>>>>mm/memcontrol.c  |  7 ---
>>>>>mm/memory-failure.c  |  8 ++--
>>>>>mm/memremap.c| 10 ++
>>>>>mm/migrate_device.c  | 16 +++-
>>>>>mm/rmap.c|  5 +++--
>>>>>6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>> --- a/include/linux/memremap.h
>>>>> +++ b/include/linux/memremap.h
>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>> * A more complete discussion of unaddressable memory may be found in
>>>>> * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>> *
>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>> + * Device memory that is cache coherent from device and CPU point of 
>>>>> view. This
>>>>> + * is used on platforms that have an advanced system bus (like CAPI or 
>>>>> CXL). A
>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that 
>>>>> memory
>>>>> + * type. Any page of a process can be migrated to such memory. However 
>>>>> no one
>>>> Any page might not be right, I'm pretty sure. ... just thinking about 
>>>> special pages
>>>> like vdso, shared zeropage, ... pinned pages ...
>> Well, you cannot migrate long term pages, that's what I meant :)
>>
>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>> + *
>>>>> * MEMORY_DEVICE_FS_DAX:
>>>>> * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>> * coherent and supports page pinning. In support of coordinating page
>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>enum memory_type {
>>>>>   /* 0 is reserved to catch uninitialized type fields */
>>>>>   MEMORY_DEVICE_PRIVATE = 1,
>>>>> + MEMORY_DEVICE_COHERENT,
>>>>>   MEMORY_DEVICE_FS_DAX,
>>>>>   MEMORY_DEVICE_GENERIC,
>>>>>   MEMORY_DEVICE_PCI_P2PDMA,
>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const 
>>>>> struct folio *folio)
>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I 
>>>> think.
>>>>
>>>>
>>>> However, where exactly is pinning forbidden?
>>> Long-term pinning is forbidden since it would interfere with the device
>>> memory manager owning the
>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>> is allowed on this device type.
>> I don't see updates to folio_is_pinnable() in this patch.
> Device coherent type pages should return true here, as they are pinnable 
> pages.

That function is only called for long-term pinnings in try_grab_folio().

>>
>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
> 
> As far as I understand this return NULL for long term pin pages. 
> Otherwise they get refcount incremented.

I don't follow.

You're saying

a) folio_is_pinnable() returns true for device coherent pages

and that

b) device coherent pages don't get long-term pinned


Yet, the code says

struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
{
if (flags & FOLL_GET)
return try_get_folio(page, refs);
else if (flags & FOLL_PIN) {
struct folio *folio;

/*
 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
 * right zone, so fail and let the caller fall back to the slow
 * path.
 */
if (unlikely((flags & FOLL_LONGTERM) &&
 !is_pinnable_page(page)))
return NULL;
...
return folio;
}
}


What prevents these pages from getting long-term pinned as stated in this patch?

I am probably missing something important.

-- 
Thanks,

David / dhildenb



Re: [PATCH v5 01/13] mm: add zone device coherent type memory support

2022-06-18 Thread David Hildenbrand
On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
> 
> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>> On 31.05.22 22:00, Alex Sierra wrote:
>>> Device memory that is cache coherent from device and CPU point of view.
>>> This is used on platforms that have an advanced system bus (like CAPI
>>> or CXL). Any page of a process can be migrated to such memory. However,
>>> no one should be allowed to pin such memory so that it can always be
>>> evicted.
>>>
>>> Signed-off-by: Alex Sierra 
>>> Acked-by: Felix Kuehling 
>>> Reviewed-by: Alistair Popple 
>>> [hch: rebased ontop of the refcount changes,
>>>removed is_dev_private_or_coherent_page]
>>> Signed-off-by: Christoph Hellwig 
>>> ---
>>>   include/linux/memremap.h | 19 +++
>>>   mm/memcontrol.c  |  7 ---
>>>   mm/memory-failure.c  |  8 ++--
>>>   mm/memremap.c| 10 ++
>>>   mm/migrate_device.c  | 16 +++-
>>>   mm/rmap.c|  5 +++--
>>>   6 files changed, 49 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>> index 8af304f6b504..9f752ebed613 100644
>>> --- a/include/linux/memremap.h
>>> +++ b/include/linux/memremap.h
>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>* A more complete discussion of unaddressable memory may be found in
>>>* include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>*
>>> + * MEMORY_DEVICE_COHERENT:
>>> + * Device memory that is cache coherent from device and CPU point of view. 
>>> This
>>> + * is used on platforms that have an advanced system bus (like CAPI or 
>>> CXL). A
>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that 
>>> memory
>>> + * type. Any page of a process can be migrated to such memory. However no 
>>> one
>> Any page might not be right, I'm pretty sure. ... just thinking about 
>> special pages
>> like vdso, shared zeropage, ... pinned pages ...
> 

Well, you cannot migrate long term pages, that's what I meant :)

>>
>>> + * should be allowed to pin such memory so that it can always be evicted.
>>> + *
>>>* MEMORY_DEVICE_FS_DAX:
>>>* Host memory that has similar access semantics as System RAM i.e. DMA
>>>* coherent and supports page pinning. In support of coordinating page
>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>   enum memory_type {
>>> /* 0 is reserved to catch uninitialized type fields */
>>> MEMORY_DEVICE_PRIVATE = 1,
>>> +   MEMORY_DEVICE_COHERENT,
>>> MEMORY_DEVICE_FS_DAX,
>>> MEMORY_DEVICE_GENERIC,
>>> MEMORY_DEVICE_PCI_P2PDMA,
>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const 
>>> struct folio *folio)
>> In general, this LGTM, and it should be correct with PageAnonExclusive I 
>> think.
>>
>>
>> However, where exactly is pinning forbidden?
> 
> Long-term pinning is forbidden since it would interfere with the device 
> memory manager owning the
> device-coherent pages (e.g. evictions in TTM). However, normal pinning 
> is allowed on this device type.

I don't see updates to folio_is_pinnable() in this patch.

So wouldn't try_grab_folio() simply pin these pages? What am I missing?

-- 
Thanks,

David / dhildenb



Re: [PATCH v5 02/13] mm: handling Non-LRU pages returned by vm_normal_pages

2022-06-18 Thread David Hildenbrand
On 31.05.22 22:00, Alex Sierra wrote:
> With DEVICE_COHERENT, we'll soon have vm_normal_pages() return
> device-managed anonymous pages that are not LRU pages. Although they
> behave like normal pages for purposes of mapping in CPU page, and for
> COW. They do not support LRU lists, NUMA migration or THP.
> 
> We also introduced a FOLL_LRU flag that adds the same behaviour to
> follow_page and related APIs, to allow callers to specify that they
> expect to put pages on an LRU list.
> 
> Signed-off-by: Alex Sierra 
> Acked-by: Felix Kuehling 
> ---
>  fs/proc/task_mmu.c | 2 +-
>  include/linux/mm.h | 3 ++-
>  mm/gup.c   | 6 +-
>  mm/huge_memory.c   | 2 +-
>  mm/khugepaged.c| 9 ++---
>  mm/ksm.c   | 6 +++---
>  mm/madvise.c   | 4 ++--
>  mm/memory.c| 9 -
>  mm/mempolicy.c | 2 +-
>  mm/migrate.c   | 4 ++--
>  mm/mlock.c | 2 +-
>  mm/mprotect.c  | 2 +-
>  12 files changed, 33 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 2d04e3470d4c..2dd8c8a66924 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1792,7 +1792,7 @@ static struct page *can_gather_numa_stats(pte_t pte, 
> struct vm_area_struct *vma,
>   return NULL;
>  
>   page = vm_normal_page(vma, addr, pte);
> - if (!page)
> + if (!page || is_zone_device_page(page))
>   return NULL;
>  
>   if (PageReserved(page))
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index bc8f326be0ce..d3f43908ff8d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -601,7 +601,7 @@ struct vm_operations_struct {
>  #endif
>   /*
>* Called by vm_normal_page() for special PTEs to find the
> -  * page for @addr.  This is useful if the default behavior
> +  * page for @addr. This is useful if the default behavior
>* (using pte_page()) would not find the correct page.
>*/
>   struct page *(*find_special_page)(struct vm_area_struct *vma,
> @@ -2934,6 +2934,7 @@ struct page *follow_page(struct vm_area_struct *vma, 
> unsigned long address,
>  #define FOLL_NUMA0x200   /* force NUMA hinting page fault */
>  #define FOLL_MIGRATION   0x400   /* wait for page to replace migration 
> entry */
>  #define FOLL_TRIED   0x800   /* a retry, previous pass started an IO */
> +#define FOLL_LRU0x1000  /* return only LRU (anon or page cache) */

Does that statement hold for special pages like the shared zeropage?

Also, this flag is only valid for in-kernel follow_page() but not for
the ordinary GUP interfaces. What are the semantics there? Is it fenced?


I really wonder if you should simply similarly teach the handful of
users of follow_page() to just special case these pages ... sounds
cleaner to me then adding flags with unclear semantics. Alternatively,
properly document what that flag is actually doing and where it applies.


I know, there was discussion on ... sorry for jumping in now, but this
doesn't look clean to me yet.

-- 
Thanks,

David / dhildenb



Re: [PATCH v5 01/13] mm: add zone device coherent type memory support

2022-06-18 Thread David Hildenbrand
On 31.05.22 22:00, Alex Sierra wrote:
> Device memory that is cache coherent from device and CPU point of view.
> This is used on platforms that have an advanced system bus (like CAPI
> or CXL). Any page of a process can be migrated to such memory. However,
> no one should be allowed to pin such memory so that it can always be
> evicted.
> 
> Signed-off-by: Alex Sierra 
> Acked-by: Felix Kuehling 
> Reviewed-by: Alistair Popple 
> [hch: rebased ontop of the refcount changes,
>   removed is_dev_private_or_coherent_page]
> Signed-off-by: Christoph Hellwig 
> ---
>  include/linux/memremap.h | 19 +++
>  mm/memcontrol.c  |  7 ---
>  mm/memory-failure.c  |  8 ++--
>  mm/memremap.c| 10 ++
>  mm/migrate_device.c  | 16 +++-
>  mm/rmap.c|  5 +++--
>  6 files changed, 49 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 8af304f6b504..9f752ebed613 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -41,6 +41,13 @@ struct vmem_altmap {
>   * A more complete discussion of unaddressable memory may be found in
>   * include/linux/hmm.h and Documentation/vm/hmm.rst.
>   *
> + * MEMORY_DEVICE_COHERENT:
> + * Device memory that is cache coherent from device and CPU point of view. 
> This
> + * is used on platforms that have an advanced system bus (like CAPI or CXL). 
> A
> + * driver can hotplug the device memory using ZONE_DEVICE and with that 
> memory
> + * type. Any page of a process can be migrated to such memory. However no one

Any page might not be right, I'm pretty sure. ... just thinking about special 
pages
like vdso, shared zeropage, ... pinned pages ...

> + * should be allowed to pin such memory so that it can always be evicted.
> + *
>   * MEMORY_DEVICE_FS_DAX:
>   * Host memory that has similar access semantics as System RAM i.e. DMA
>   * coherent and supports page pinning. In support of coordinating page
> @@ -61,6 +68,7 @@ struct vmem_altmap {
>  enum memory_type {
>   /* 0 is reserved to catch uninitialized type fields */
>   MEMORY_DEVICE_PRIVATE = 1,
> + MEMORY_DEVICE_COHERENT,
>   MEMORY_DEVICE_FS_DAX,
>   MEMORY_DEVICE_GENERIC,
>   MEMORY_DEVICE_PCI_P2PDMA,
> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct 
> folio *folio)

In general, this LGTM, and it should be correct with PageAnonExclusive I think.


However, where exactly is pinning forbidden?

-- 
Thanks,

David / dhildenb



Re: [PATCH v5 00/13] Add MEMORY_DEVICE_COHERENT for coherent device memory mapping

2022-06-18 Thread David Hildenbrand
On 17.06.22 04:19, Andrew Morton wrote:
> On Tue, 31 May 2022 15:00:28 -0500 Alex Sierra  wrote:
> 
>> This is our MEMORY_DEVICE_COHERENT patch series rebased and updated
>> for current 5.18.0
> 
> I plan to move this series into the non-rebasing mm-stable branch in a
> few days.  Unless sternly told not to do so!
> 

I want to double-check some things regarding PageAnonExclusive
interaction. I'm busy, but I'll try prioritizing it.

-- 
Thanks,

David / dhildenb



Re: [PATCH v2 1/3] mm: add vm_normal_lru_pages for LRU handled pages only

2022-03-31 Thread David Hildenbrand
On 31.03.22 10:53, Christoph Hellwig wrote:
>> -page = vm_normal_page(vma, addr, pte);
>> +page = vm_normal_lru_page(vma, addr, pte);
> 
> Why can't this deal with ZONE_DEVICE pages?  It certainly has
> nothing do with a LRU I think.  In fact being able to have
> stats that count say the number of device pages here would
> probably be useful at some point.
> 
> In general I find the vm_normal_lru_page vs vm_normal_page
> API highly confusing.  An explicit check for zone device pages
> in the dozen or so spots that care has a much better documentation
> value, especially if accompanied by comments where it isn't entirely
> obvious.

What's your thought on FOLL_LRU?

-- 
Thanks,

David / dhildenb



Re: [PATCH v1 1/3] mm: split vm_normal_pages for LRU and non-LRU handling

2022-03-17 Thread David Hildenbrand
On 17.03.22 03:54, Alistair Popple wrote:
> Felix Kuehling  writes:
> 
>> On 2022-03-11 04:16, David Hildenbrand wrote:
>>> On 10.03.22 18:26, Alex Sierra wrote:
>>>> DEVICE_COHERENT pages introduce a subtle distinction in the way
>>>> "normal" pages can be used by various callers throughout the kernel.
>>>> They behave like normal pages for purposes of mapping in CPU page
>>>> tables, and for COW. But they do not support LRU lists, NUMA
>>>> migration or THP. Therefore we split vm_normal_page into two
>>>> functions vm_normal_any_page and vm_normal_lru_page. The latter will
>>>> only return pages that can be put on an LRU list and that support
>>>> NUMA migration, KSM and THP.
>>>>
>>>> We also introduced a FOLL_LRU flag that adds the same behaviour to
>>>> follow_page and related APIs, to allow callers to specify that they
>>>> expect to put pages on an LRU list.
>>>>
>>> I still don't see the need for s/vm_normal_page/vm_normal_any_page/. And
>>> as this patch is dominated by that change, I'd suggest (again) to just
>>> drop it as I don't see any value of that renaming. No specifier implies any.
>>
>> OK. If nobody objects, we can adopts that naming convention.
> 
> I'd prefer we avoid the churn too, but I don't think we should make
> vm_normal_page() the equivalent of vm_normal_any_page(). It would mean
> vm_normal_page() would return non-LRU device coherent pages, but to me at 
> least
> device coherent pages seem special and not what I'd expect from a function 
> with
> "normal" in the name.
> 
> So I think it would be better to s/vm_normal_lru_page/vm_normal_page/ and keep
> vm_normal_any_page() (or perhaps call it vm_any_page?). This is basically what
> the previous incarnation of this feature did:
> 
> struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> pte_t pte, bool with_public_device);
> #define vm_normal_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, false)
> 
> Except we should add:
> 
> #define vm_normal_any_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, 
> true)
> 

"normal" simply tells us that this is not a special mapping -- IOW, we
want the VM to take a look at the memmap and not treat it like a PFN
map. What we're changing is that we're now also returning non-lru pages.
Fair enough, that's why we introduce vm_normal_lru_page() as a
replacement where we really can only deal with lru pages.

vm_normal_page vs vm_normal_lru_page is good enough. "lru" further
limits what we get via vm_normal_page, that's even how it's implemented.

vm_normal_page vs vm_normal_any_page is confusing IMHO.


-- 
Thanks,

David / dhildenb



Re: [PATCH v1 1/3] mm: split vm_normal_pages for LRU and non-LRU handling

2022-03-14 Thread David Hildenbrand
On 10.03.22 18:26, Alex Sierra wrote:
> DEVICE_COHERENT pages introduce a subtle distinction in the way
> "normal" pages can be used by various callers throughout the kernel.
> They behave like normal pages for purposes of mapping in CPU page
> tables, and for COW. But they do not support LRU lists, NUMA
> migration or THP. Therefore we split vm_normal_page into two
> functions vm_normal_any_page and vm_normal_lru_page. The latter will
> only return pages that can be put on an LRU list and that support
> NUMA migration, KSM and THP.
> 
> We also introduced a FOLL_LRU flag that adds the same behaviour to
> follow_page and related APIs, to allow callers to specify that they
> expect to put pages on an LRU list.
> 

I still don't see the need for s/vm_normal_page/vm_normal_any_page/. And
as this patch is dominated by that change, I'd suggest (again) to just
drop it as I don't see any value of that renaming. No specifier implies any.

The general idea of this change LGTM.


I wonder how this interacts with the actual DEVICE_COHERENT coherent
series. Is this a preparation? Should it be part of the DEVICE_COHERENT
series?

IOW, should this patch start with

"With DEVICE_COHERENT, we'll soon have vm_normal_pages() return
device-managed anonymous pages that are not LRU pages. Although they
behave like normal pages for purposes of mapping in CPU page, and for
COW, they do not support LRU lists, NUMA migration or THP. [...]"

But then, I'm confused by patch 2 and 3, because it feels more like we'd
already have DEVICE_COHERENT then ("hmm_is_coherent_type").


-- 
Thanks,

David / dhildenb



Re: [PATCH] mm: split vm_normal_pages for LRU and non-LRU handling

2022-03-01 Thread David Hildenbrand


>>
>>> if (PageReserved(page))
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index c31d04b46a5e..17d049311b78 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -1614,7 +1614,7 @@ static int add_page_for_migration(struct mm_struct 
>>> *mm, unsigned long addr,
>>> goto out;
>>>   
>>> /* FOLL_DUMP to ignore special (like zero) pages */
>>> -   follflags = FOLL_GET | FOLL_DUMP;
>>> +   follflags = FOLL_GET | FOLL_DUMP | FOLL_LRU;
>>> page = follow_page(vma, addr, follflags);
>> Why wouldn't we want to dump DEVICE_COHERENT pages? This looks wrong.
> 
> This function later calls isolate_lru_page, which is something you can't 
> do with a device page.
> 

Then, that code might require care instead. We most certainly don't want
to have random memory holes in a dump just because some anonymous memory
was migrated to DEVICE_COHERENT.

-- 
Thanks,

David / dhildenb



Re: [PATCH] mm: split vm_normal_pages for LRU and non-LRU handling

2022-03-01 Thread David Hildenbrand
On 01.03.22 17:30, Felix Kuehling wrote:
> Am 2022-03-01 um 11:22 schrieb David Hildenbrand:
>>>>>   if (PageReserved(page))
>>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>>> index c31d04b46a5e..17d049311b78 100644
>>>>> --- a/mm/migrate.c
>>>>> +++ b/mm/migrate.c
>>>>> @@ -1614,7 +1614,7 @@ static int add_page_for_migration(struct mm_struct 
>>>>> *mm, unsigned long addr,
>>>>>   goto out;
>>>>>
>>>>>   /* FOLL_DUMP to ignore special (like zero) pages */
>>>>> - follflags = FOLL_GET | FOLL_DUMP;
>>>>> + follflags = FOLL_GET | FOLL_DUMP | FOLL_LRU;
>>>>>   page = follow_page(vma, addr, follflags);
>>>> Why wouldn't we want to dump DEVICE_COHERENT pages? This looks wrong.
>>> This function later calls isolate_lru_page, which is something you can't
>>> do with a device page.
>>>
>> Then, that code might require care instead. We most certainly don't want
>> to have random memory holes in a dump just because some anonymous memory
>> was migrated to DEVICE_COHERENT.
> I don't think this code is for core dumps. The call chain I see is
> 
> SYSCALL_DEFINE6(move_pages, ...) -> kernel_move_pages -> do_pages_move 
> -> add_page_for_migration
> 

Ah, sorry, I got mislead by FOLL_DUMP and thought we'd be in
get_dump_page() . As you said, nothing to do.

-- 
Thanks,

David / dhildenb



Re: [PATCH] mm: split vm_normal_pages for LRU and non-LRU handling

2022-03-01 Thread David Hildenbrand
On 28.02.22 21:34, Alex Sierra wrote:
> DEVICE_COHERENT pages introduce a subtle distinction in the way
> "normal" pages can be used by various callers throughout the kernel.
> They behave like normal pages for purposes of mapping in CPU page
> tables, and for COW. But they do not support LRU lists, NUMA
> migration or THP. Therefore we split vm_normal_page into two
> functions vm_normal_any_page and vm_normal_lru_page. The latter will
> only return pages that can be put on an LRU list and that support
> NUMA migration and THP.

Why not s/vm_normal_any_page/vm_normal_page/ and avoid code churn?

> 
> We also introduced a FOLL_LRU flag that adds the same behaviour to
> follow_page and related APIs, to allow callers to specify that they
> expect to put pages on an LRU list.

[...]
> -#define FOLL_WRITE   0x01/* check pte is writable */
> -#define FOLL_TOUCH   0x02/* mark page accessed */
> -#define FOLL_GET 0x04/* do get_page on page */
> -#define FOLL_DUMP0x08/* give error on hole if it would be zero */
> -#define FOLL_FORCE   0x10/* get_user_pages read/write w/o permission */
> -#define FOLL_NOWAIT  0x20/* if a disk transfer is needed, start the IO
> -  * and return without waiting upon it */
> -#define FOLL_POPULATE0x40/* fault in pages (with FOLL_MLOCK) */
> -#define FOLL_NOFAULT 0x80/* do not fault in pages */
> -#define FOLL_HWPOISON0x100   /* check page is hwpoisoned */
> -#define FOLL_NUMA0x200   /* force NUMA hinting page fault */
> -#define FOLL_MIGRATION   0x400   /* wait for page to replace migration 
> entry */
> -#define FOLL_TRIED   0x800   /* a retry, previous pass started an IO */
> -#define FOLL_MLOCK   0x1000  /* lock present pages */
> -#define FOLL_REMOTE  0x2000  /* we are working on non-current tsk/mm */
> -#define FOLL_COW 0x4000  /* internal GUP flag */
> -#define FOLL_ANON0x8000  /* don't do file mappings */
> -#define FOLL_LONGTERM0x1 /* mapping lifetime is indefinite: see 
> below */
> -#define FOLL_SPLIT_PMD   0x2 /* split huge pmd before returning */
> -#define FOLL_PIN 0x4 /* pages must be released via unpin_user_page */
> -#define FOLL_FAST_ONLY   0x8 /* gup_fast: prevent fall-back to slow 
> gup */
> +#define FOLL_WRITE   0x01 /* check pte is writable */
> +#define FOLL_TOUCH   0x02 /* mark page accessed */
> +#define FOLL_GET 0x04 /* do get_page on page */
> +#define FOLL_DUMP0x08 /* give error on hole if it would be zero */
> +#define FOLL_FORCE   0x10 /* get_user_pages read/write w/o permission */
> +#define FOLL_NOWAIT  0x20 /* if a disk transfer is needed, start the IO
> +   * and return without waiting upon it */
> +#define FOLL_POPULATE0x40 /* fault in pages (with FOLL_MLOCK) */
> +#define FOLL_NOFAULT 0x80 /* do not fault in pages */
> +#define FOLL_HWPOISON0x100/* check page is hwpoisoned */
> +#define FOLL_NUMA0x200/* force NUMA hinting page fault */
> +#define FOLL_MIGRATION   0x400/* wait for page to replace migration 
> entry */
> +#define FOLL_TRIED   0x800/* a retry, previous pass started an IO */
> +#define FOLL_MLOCK   0x1000   /* lock present pages */
> +#define FOLL_REMOTE  0x2000   /* we are working on non-current tsk/mm */
> +#define FOLL_COW 0x4000   /* internal GUP flag */
> +#define FOLL_ANON0x8000   /* don't do file mappings */
> +#define FOLL_LONGTERM0x1  /* mapping lifetime is indefinite: see 
> below */
> +#define FOLL_SPLIT_PMD   0x2  /* split huge pmd before returning */
> +#define FOLL_PIN 0x4  /* pages must be released via unpin_user_page 
> */
> +#define FOLL_FAST_ONLY   0x8  /* gup_fast: prevent fall-back to slow 
> gup */
> +#define FOLL_LRU 0x10 /* return only LRU (anon or page cache) */
>  

Can we minimize code churn, please?


>   if (PageReserved(page))
> diff --git a/mm/migrate.c b/mm/migrate.c
> index c31d04b46a5e..17d049311b78 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1614,7 +1614,7 @@ static int add_page_for_migration(struct mm_struct *mm, 
> unsigned long addr,
>   goto out;
>  
>   /* FOLL_DUMP to ignore special (like zero) pages */
> - follflags = FOLL_GET | FOLL_DUMP;
> + follflags = FOLL_GET | FOLL_DUMP | FOLL_LRU;
>   page = follow_page(vma, addr, follflags);

Why wouldn't we want to dump DEVICE_COHERENT pages? This looks wrong.


-- 
Thanks,

David / dhildenb



Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

2022-02-16 Thread David Hildenbrand
On 16.02.22 03:36, Alistair Popple wrote:
> On Wednesday, 16 February 2022 1:03:57 PM AEDT Jason Gunthorpe wrote:
>> On Wed, Feb 16, 2022 at 12:23:44PM +1100, Alistair Popple wrote:
>>
>>> Device private and device coherent pages are not marked with pte_devmap and 
>>> they
>>> are backed by a struct page. The only way of inserting them is via 
>>> migrate_vma.
>>> The refcount is decremented in zap_pte_range() on munmap() with special 
>>> handling
>>> for device private pages. Looking at it again though I wonder if there is 
>>> any
>>> special treatment required in zap_pte_range() for device coherent pages 
>>> given
>>> they count as present pages.
>>
>> This is what I guessed, but we shouldn't be able to just drop
>> pte_devmap on these pages without any other work?? Granted it does
>> very little already..
> 
> Yes, I agree we need to check this more closely. For device private pages
> not having pte_devmap is fine, because they are non-present swap entries so
> they always get special handling in the swap entry paths but the same isn't
> true for coherent device pages.

I'm curious, how does the refcount of a PageAnon() DEVICE_COHERENT page
look like when mapped? I'd assume it's also (currently) still offset by
one, meaning, if it's mapped into a single page table it's always at
least 2.

Just a note that if my assumption is correct and if we'd have such a
page mapped R/O, do_wp_page() would always have to copy it
unconditionally and would not be able to reuse it on write faults.
(while I'm working on improving the reuse logic, I think there is also
work in progress to avoid this additional reference on some ZONE_DEVICE
stuff -- I'd assume that would include DEVICE_COHERENT ?)

> 
>> I thought at least gup_fast needed to be touched or did this get
>> handled by scanning the page list after the fact?
> 
> Right, for gup I think the only special handling required is to prevent
> pinning. I had assumed that check_and_migrate_movable_pages() would still get
> called for gup_fast but unless I've missed something I don't think it does.
> That means gup_fast could still pin movable and coherent pages. Technically
> that is ok for coherent pages, but it's undesirable.

We really should have the same pinning rules for GUP vs. GUP-fast.
is_pinnable_page() should be the right place for such checks (similarly
as indicated in my reply to the migration series).

-- 
Thanks,

David / dhildenb



Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

2022-02-15 Thread David Hildenbrand
On 11.02.22 18:07, Felix Kuehling wrote:
> 
> Am 2022-02-11 um 11:39 schrieb David Hildenbrand:
>> On 11.02.22 17:15, David Hildenbrand wrote:
>>> On 01.02.22 16:48, Alex Sierra wrote:
>>>> Device memory that is cache coherent from device and CPU point of view.
>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>> no one should be allowed to pin such memory so that it can always be
>>>> evicted.
>>>>
>>>> Signed-off-by: Alex Sierra 
>>>> Acked-by: Felix Kuehling 
>>>> Reviewed-by: Alistair Popple 
>>> So, I'm currently messing with PageAnon() pages and CoW semantics ...
>>> all these PageAnon() ZONE_DEVICE variants don't necessarily make my life
>>> easier but I'm not sure yet if they make my life harder. I hope you can
>>> help me understand some of that stuff.
>>>
>>> 1) What are expected CoW semantics for DEVICE_COHERENT?
>>>
>>> I assume we'll share them just like other PageAnon() pages during fork()
>>> readable, and the first sharer writing to them receives an "ordinary"
>>> !ZONE_DEVICE copy.
>>>
>>> So this would be just like DEVICE_EXCLUSIVE CoW handling I assume, just
>>> that we don't have to go through the loop of restoring a device
>>> exclusive entry?
>>>
>>> 2) How are these pages freed to clear/invalidate PageAnon() ?
>>>
>>> I assume for PageAnon() ZONE_DEVICE pages we'll always for via
>>> free_devmap_managed_page(), correct?
>>>
>>>
>>> 3) FOLL_PIN
>>>
>>> While you write "no one should be allowed to pin such memory", patch #2
>>> only blocks FOLL_LONGTERM. So I assume we allow ordinary FOLL_PIN and
>>> you might want to be a bit more precise?
>>>
>>>
>>> ... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages, but can we
>>> FILL_PIN DEVICE_EXCLUSIVE pages? I strongly assume so?
>>>
>>>
>>> Thanks for any information.
>>>
>> (digging a bit more, I realized that device exclusive pages are not
>> actually/necessarily ZONE_DEVICE pages -- so I assume DEVICE_COHERENT
>> will be the actual first PageAnon() ZONE_DEVICE pages that can be
>> present in a page table.)
> 
> I think DEVICE_GENERIC pages can also be mapped in the page table. In 
> fact, the first version of our patches attempted to add migration 
> support to DEVICE_GENERIC. But we were convinced to create a new 
> ZONE_DEVICE page type for our use case instead.

Do you know if DEVICE_GENERIC pages would end up as PageAnon()? My
assumption was that they would be part of a special mapping.

-- 
Thanks,

David / dhildenb



Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

2022-02-15 Thread David Hildenbrand
On 11.02.22 17:56, Jason Gunthorpe wrote:
> On Fri, Feb 11, 2022 at 05:49:08PM +0100, David Hildenbrand wrote:
>> On 11.02.22 17:45, Jason Gunthorpe wrote:
>>> On Fri, Feb 11, 2022 at 05:15:25PM +0100, David Hildenbrand wrote:
>>>
>>>> ... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages
>>>
>>> Currently the only way to get a DEVICE_PRIVATE page out of the page
>>> tables is via hmm_range_fault() and that doesn't manipulate any ref
>>> counts.
>>
>> Thanks for clarifying Jason! ... and AFAIU, device exclusive entries are
>> essentially just pointers at ordinary PageAnon() pages. So with DEVICE
>> COHERENT we'll have the first PageAnon() ZONE_DEVICE pages mapped as
>> present in the page tables where GUP could FOLL_PIN them.
> 
> This is my understanding
> 
> Though you probably understand what PageAnon means alot better than I
> do.. I wonder if it really makes sense to talk about that together
> with ZONE_DEVICE which has alot in common with filesystem originated
> pages too.

For me, PageAnon() means that modifications are visible only to the
modifying process. On actual CoW, the underlying page will get replaced
-- in the world of DEVICE_COHERENT that would mean that once you write
to a DEVICE_COHERENT you could suddenly have a !DEVICE_COHERENT page.

PageAnon() pages don't have a mapping, thus they can only be found in
MAP_ANON VMAs or in MAP_SHARED VMAs with MAP_PRIVATE. They can only be
found via a page table, and not looked up via the page cache (excluding
the swap cache).

So if we have PageAnon() pages on ZONE_DEVICE, they generally have the
exact same semantics as !ZONE_DEVICE pages, but the way they "appear" in
the page tables the allocation/freeing path differs -- I guess :)

... and as we want pinning semantics to be different we have to touch GUP.

> 
> I'm not sure what AMDs plan is here, is there an expecation that a GPU
> driver will somehow stuff these pages into an existing anonymous
> memory VMA or do they always come from a driver originated VMA?

My understanding is that a driver can just decide to replace "ordinary"
PageAnon() pages e.g., in a MAP_ANON VMA by these pages. Hopefully AMD
can clarify.


-- 
Thanks,

David / dhildenb



Re: [PATCH v2 2/3] mm/gup.c: Migrate device coherent pages when pinning instead of failing

2022-02-15 Thread David Hildenbrand
On 11.02.22 00:41, Alistair Popple wrote:
> On Thursday, 10 February 2022 10:47:35 PM AEDT David Hildenbrand wrote:
>> On 10.02.22 12:39, Alistair Popple wrote:
>>> On Thursday, 10 February 2022 9:53:38 PM AEDT David Hildenbrand wrote:
>>>> On 07.02.22 05:26, Alistair Popple wrote:
>>>>> Currently any attempts to pin a device coherent page will fail. This is
>>>>> because device coherent pages need to be managed by a device driver, and
>>>>> pinning them would prevent a driver from migrating them off the device.
>>>>>
>>>>> However this is no reason to fail pinning of these pages. These are
>>>>> coherent and accessible from the CPU so can be migrated just like
>>>>> pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin
>>>>> them first try migrating them out of ZONE_DEVICE.
>>>>>
>>>>> Signed-off-by: Alistair Popple 
>>>>> Acked-by: Felix Kuehling 
>>>>> ---
>>>>>
>>>>> Changes for v2:
>>>>>
>>>>>  - Added Felix's Acked-by
>>>>>  - Fixed missing check for dpage == NULL
>>>>>
>>>>>  mm/gup.c | 105 ++--
>>>>>  1 file changed, 95 insertions(+), 10 deletions(-)
>>>>>
>>>>> diff --git a/mm/gup.c b/mm/gup.c
>>>>> index 56d9577..5e826db 100644
>>>>> --- a/mm/gup.c
>>>>> +++ b/mm/gup.c
>>>>> @@ -1861,6 +1861,60 @@ struct page *get_dump_page(unsigned long addr)
>>>>>  
>>>>>  #ifdef CONFIG_MIGRATION
>>>>>  /*
>>>>> + * Migrates a device coherent page back to normal memory. Caller should 
>>>>> have a
>>>>> + * reference on page which will be copied to the new page if migration is
>>>>> + * successful or dropped on failure.
>>>>> + */
>>>>> +static struct page *migrate_device_page(struct page *page,
>>>>> + unsigned int gup_flags)
>>>>> +{
>>>>> + struct page *dpage;
>>>>> + struct migrate_vma args;
>>>>> + unsigned long src_pfn, dst_pfn = 0;
>>>>> +
>>>>> + lock_page(page);
>>>>> + src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE;
>>>>> + args.src = &src_pfn;
>>>>> + args.dst = &dst_pfn;
>>>>> + args.cpages = 1;
>>>>> + args.npages = 1;
>>>>> + args.vma = NULL;
>>>>> + migrate_vma_setup(&args);
>>>>> + if (!(src_pfn & MIGRATE_PFN_MIGRATE))
>>>>> + return NULL;
>>>>> +
>>>>> + dpage = alloc_pages(GFP_USER | __GFP_NOWARN, 0);
>>>>> +
>>>>> + /*
>>>>> +  * get/pin the new page now so we don't have to retry gup after
>>>>> +  * migrating. We already have a reference so this should never fail.
>>>>> +  */
>>>>> + if (dpage && WARN_ON_ONCE(!try_grab_page(dpage, gup_flags))) {
>>>>> + __free_pages(dpage, 0);
>>>>> + dpage = NULL;
>>>>> + }
>>>>> +
>>>>> + if (dpage) {
>>>>> + lock_page(dpage);
>>>>> + dst_pfn = migrate_pfn(page_to_pfn(dpage));
>>>>> + }
>>>>> +
>>>>> + migrate_vma_pages(&args);
>>>>> + if (src_pfn & MIGRATE_PFN_MIGRATE)
>>>>> + copy_highpage(dpage, page);
>>>>> + migrate_vma_finalize(&args);
>>>>> + if (dpage && !(src_pfn & MIGRATE_PFN_MIGRATE)) {
>>>>> + if (gup_flags & FOLL_PIN)
>>>>> + unpin_user_page(dpage);
>>>>> + else
>>>>> + put_page(dpage);
>>>>> + dpage = NULL;
>>>>> + }
>>>>> +
>>>>> + return dpage;
>>>>> +}
>>>>> +
>>>>> +/*
>>>>>   * Check whether all pages are pinnable, if so return number of pages.  
>>>>> If some
>>>>>   * pages are not pinnable, migrate them, and unpin all pages. Return 
>>>>> zero if
>>>>>   * pages were migrated, or if some pages were not successfully isolated.
>>>>> @@ -1888,15 +1942,40 @@ static long 
>>>&

Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

2022-02-11 Thread David Hildenbrand
On 11.02.22 17:15, David Hildenbrand wrote:
> On 01.02.22 16:48, Alex Sierra wrote:
>> Device memory that is cache coherent from device and CPU point of view.
>> This is used on platforms that have an advanced system bus (like CAPI
>> or CXL). Any page of a process can be migrated to such memory. However,
>> no one should be allowed to pin such memory so that it can always be
>> evicted.
>>
>> Signed-off-by: Alex Sierra 
>> Acked-by: Felix Kuehling 
>> Reviewed-by: Alistair Popple 
> 
> So, I'm currently messing with PageAnon() pages and CoW semantics ...
> all these PageAnon() ZONE_DEVICE variants don't necessarily make my life
> easier but I'm not sure yet if they make my life harder. I hope you can
> help me understand some of that stuff.
> 
> 1) What are expected CoW semantics for DEVICE_COHERENT?
> 
> I assume we'll share them just like other PageAnon() pages during fork()
> readable, and the first sharer writing to them receives an "ordinary"
> !ZONE_DEVICE copy.
> 
> So this would be just like DEVICE_EXCLUSIVE CoW handling I assume, just
> that we don't have to go through the loop of restoring a device
> exclusive entry?
> 
> 2) How are these pages freed to clear/invalidate PageAnon() ?
> 
> I assume for PageAnon() ZONE_DEVICE pages we'll always for via
> free_devmap_managed_page(), correct?
> 
> 
> 3) FOLL_PIN
> 
> While you write "no one should be allowed to pin such memory", patch #2
> only blocks FOLL_LONGTERM. So I assume we allow ordinary FOLL_PIN and
> you might want to be a bit more precise?
> 
> 
> ... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages, but can we
> FILL_PIN DEVICE_EXCLUSIVE pages? I strongly assume so?
> 
> 
> Thanks for any information.
> 

(digging a bit more, I realized that device exclusive pages are not
actually/necessarily ZONE_DEVICE pages -- so I assume DEVICE_COHERENT
will be the actual first PageAnon() ZONE_DEVICE pages that can be
present in a page table.)

-- 
Thanks,

David / dhildenb



Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

2022-02-11 Thread David Hildenbrand
On 11.02.22 17:45, Jason Gunthorpe wrote:
> On Fri, Feb 11, 2022 at 05:15:25PM +0100, David Hildenbrand wrote:
> 
>> ... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages
> 
> Currently the only way to get a DEVICE_PRIVATE page out of the page
> tables is via hmm_range_fault() and that doesn't manipulate any ref
> counts.

Thanks for clarifying Jason! ... and AFAIU, device exclusive entries are
essentially just pointers at ordinary PageAnon() pages. So with DEVICE
COHERENT we'll have the first PageAnon() ZONE_DEVICE pages mapped as
present in the page tables where GUP could FOLL_PIN them.


-- 
Thanks,

David / dhildenb



Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

2022-02-11 Thread David Hildenbrand
On 01.02.22 16:48, Alex Sierra wrote:
> Device memory that is cache coherent from device and CPU point of view.
> This is used on platforms that have an advanced system bus (like CAPI
> or CXL). Any page of a process can be migrated to such memory. However,
> no one should be allowed to pin such memory so that it can always be
> evicted.
> 
> Signed-off-by: Alex Sierra 
> Acked-by: Felix Kuehling 
> Reviewed-by: Alistair Popple 

So, I'm currently messing with PageAnon() pages and CoW semantics ...
all these PageAnon() ZONE_DEVICE variants don't necessarily make my life
easier but I'm not sure yet if they make my life harder. I hope you can
help me understand some of that stuff.

1) What are expected CoW semantics for DEVICE_COHERENT?

I assume we'll share them just like other PageAnon() pages during fork()
readable, and the first sharer writing to them receives an "ordinary"
!ZONE_DEVICE copy.

So this would be just like DEVICE_EXCLUSIVE CoW handling I assume, just
that we don't have to go through the loop of restoring a device
exclusive entry?

2) How are these pages freed to clear/invalidate PageAnon() ?

I assume for PageAnon() ZONE_DEVICE pages we'll always for via
free_devmap_managed_page(), correct?


3) FOLL_PIN

While you write "no one should be allowed to pin such memory", patch #2
only blocks FOLL_LONGTERM. So I assume we allow ordinary FOLL_PIN and
you might want to be a bit more precise?


... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages, but can we
FILL_PIN DEVICE_EXCLUSIVE pages? I strongly assume so?


Thanks for any information.

-- 
Thanks,

David / dhildenb



Re: [PATCH v2 2/3] mm/gup.c: Migrate device coherent pages when pinning instead of failing

2022-02-10 Thread David Hildenbrand
On 07.02.22 05:26, Alistair Popple wrote:
> Currently any attempts to pin a device coherent page will fail. This is
> because device coherent pages need to be managed by a device driver, and
> pinning them would prevent a driver from migrating them off the device.
> 
> However this is no reason to fail pinning of these pages. These are
> coherent and accessible from the CPU so can be migrated just like
> pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin
> them first try migrating them out of ZONE_DEVICE.
> 
> Signed-off-by: Alistair Popple 
> Acked-by: Felix Kuehling 
> ---
> 
> Changes for v2:
> 
>  - Added Felix's Acked-by
>  - Fixed missing check for dpage == NULL
> 
>  mm/gup.c | 105 ++--
>  1 file changed, 95 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 56d9577..5e826db 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1861,6 +1861,60 @@ struct page *get_dump_page(unsigned long addr)
>  
>  #ifdef CONFIG_MIGRATION
>  /*
> + * Migrates a device coherent page back to normal memory. Caller should have 
> a
> + * reference on page which will be copied to the new page if migration is
> + * successful or dropped on failure.
> + */
> +static struct page *migrate_device_page(struct page *page,
> + unsigned int gup_flags)
> +{
> + struct page *dpage;
> + struct migrate_vma args;
> + unsigned long src_pfn, dst_pfn = 0;
> +
> + lock_page(page);
> + src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE;
> + args.src = &src_pfn;
> + args.dst = &dst_pfn;
> + args.cpages = 1;
> + args.npages = 1;
> + args.vma = NULL;
> + migrate_vma_setup(&args);
> + if (!(src_pfn & MIGRATE_PFN_MIGRATE))
> + return NULL;
> +
> + dpage = alloc_pages(GFP_USER | __GFP_NOWARN, 0);
> +
> + /*
> +  * get/pin the new page now so we don't have to retry gup after
> +  * migrating. We already have a reference so this should never fail.
> +  */
> + if (dpage && WARN_ON_ONCE(!try_grab_page(dpage, gup_flags))) {
> + __free_pages(dpage, 0);
> + dpage = NULL;
> + }
> +
> + if (dpage) {
> + lock_page(dpage);
> + dst_pfn = migrate_pfn(page_to_pfn(dpage));
> + }
> +
> + migrate_vma_pages(&args);
> + if (src_pfn & MIGRATE_PFN_MIGRATE)
> + copy_highpage(dpage, page);
> + migrate_vma_finalize(&args);
> + if (dpage && !(src_pfn & MIGRATE_PFN_MIGRATE)) {
> + if (gup_flags & FOLL_PIN)
> + unpin_user_page(dpage);
> + else
> + put_page(dpage);
> + dpage = NULL;
> + }
> +
> + return dpage;
> +}
> +
> +/*
>   * Check whether all pages are pinnable, if so return number of pages.  If 
> some
>   * pages are not pinnable, migrate them, and unpin all pages. Return zero if
>   * pages were migrated, or if some pages were not successfully isolated.
> @@ -1888,15 +1942,40 @@ static long check_and_migrate_movable_pages(unsigned 
> long nr_pages,
>   continue;
>   prev_head = head;
>   /*
> -  * If we get a movable page, since we are going to be pinning
> -  * these entries, try to move them out if possible.
> +  * Device coherent pages are managed by a driver and should not
> +  * be pinned indefinitely as it prevents the driver moving the
> +  * page. So when trying to pin with FOLL_LONGTERM instead try
> +  * migrating page out of device memory.
>*/
>   if (is_dev_private_or_coherent_page(head)) {
> + /*
> +  * device private pages will get faulted in during gup
> +  * so it shouldn't be possible to see one here.
> +  */
>   WARN_ON_ONCE(is_device_private_page(head));
> - ret = -EFAULT;
> - goto unpin_pages;
> + WARN_ON_ONCE(PageCompound(head));
> +
> + /*
> +  * migration will fail if the page is pinned, so convert
> +  * the pin on the source page to a normal reference.
> +  */
> + if (gup_flags & FOLL_PIN) {
> + get_page(head);
> + unpin_user_page(head);
> + }
> +
> + pages[i] = migrate_device_page(head, gup_flags);



For ordinary migrate_pages(), we'll unpin all pages and return 0 so the
caller will retry pinning by walking the page tables again.

Why can't we apply the same mechanism here? This "let's avoid another
walk" looks unnecessary complicated to me, but I might be wrong.

-- 
Thanks,

David / dhildenb



Re: [PATCH v2 2/3] mm/gup.c: Migrate device coherent pages when pinning instead of failing

2022-02-10 Thread David Hildenbrand
On 10.02.22 12:39, Alistair Popple wrote:
> On Thursday, 10 February 2022 9:53:38 PM AEDT David Hildenbrand wrote:
>> On 07.02.22 05:26, Alistair Popple wrote:
>>> Currently any attempts to pin a device coherent page will fail. This is
>>> because device coherent pages need to be managed by a device driver, and
>>> pinning them would prevent a driver from migrating them off the device.
>>>
>>> However this is no reason to fail pinning of these pages. These are
>>> coherent and accessible from the CPU so can be migrated just like
>>> pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin
>>> them first try migrating them out of ZONE_DEVICE.
>>>
>>> Signed-off-by: Alistair Popple 
>>> Acked-by: Felix Kuehling 
>>> ---
>>>
>>> Changes for v2:
>>>
>>>  - Added Felix's Acked-by
>>>  - Fixed missing check for dpage == NULL
>>>
>>>  mm/gup.c | 105 ++--
>>>  1 file changed, 95 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/mm/gup.c b/mm/gup.c
>>> index 56d9577..5e826db 100644
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -1861,6 +1861,60 @@ struct page *get_dump_page(unsigned long addr)
>>>  
>>>  #ifdef CONFIG_MIGRATION
>>>  /*
>>> + * Migrates a device coherent page back to normal memory. Caller should 
>>> have a
>>> + * reference on page which will be copied to the new page if migration is
>>> + * successful or dropped on failure.
>>> + */
>>> +static struct page *migrate_device_page(struct page *page,
>>> +   unsigned int gup_flags)
>>> +{
>>> +   struct page *dpage;
>>> +   struct migrate_vma args;
>>> +   unsigned long src_pfn, dst_pfn = 0;
>>> +
>>> +   lock_page(page);
>>> +   src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE;
>>> +   args.src = &src_pfn;
>>> +   args.dst = &dst_pfn;
>>> +   args.cpages = 1;
>>> +   args.npages = 1;
>>> +   args.vma = NULL;
>>> +   migrate_vma_setup(&args);
>>> +   if (!(src_pfn & MIGRATE_PFN_MIGRATE))
>>> +   return NULL;
>>> +
>>> +   dpage = alloc_pages(GFP_USER | __GFP_NOWARN, 0);
>>> +
>>> +   /*
>>> +* get/pin the new page now so we don't have to retry gup after
>>> +* migrating. We already have a reference so this should never fail.
>>> +*/
>>> +   if (dpage && WARN_ON_ONCE(!try_grab_page(dpage, gup_flags))) {
>>> +   __free_pages(dpage, 0);
>>> +   dpage = NULL;
>>> +   }
>>> +
>>> +   if (dpage) {
>>> +   lock_page(dpage);
>>> +   dst_pfn = migrate_pfn(page_to_pfn(dpage));
>>> +   }
>>> +
>>> +   migrate_vma_pages(&args);
>>> +   if (src_pfn & MIGRATE_PFN_MIGRATE)
>>> +   copy_highpage(dpage, page);
>>> +   migrate_vma_finalize(&args);
>>> +   if (dpage && !(src_pfn & MIGRATE_PFN_MIGRATE)) {
>>> +   if (gup_flags & FOLL_PIN)
>>> +   unpin_user_page(dpage);
>>> +   else
>>> +   put_page(dpage);
>>> +   dpage = NULL;
>>> +   }
>>> +
>>> +   return dpage;
>>> +}
>>> +
>>> +/*
>>>   * Check whether all pages are pinnable, if so return number of pages.  If 
>>> some
>>>   * pages are not pinnable, migrate them, and unpin all pages. Return zero 
>>> if
>>>   * pages were migrated, or if some pages were not successfully isolated.
>>> @@ -1888,15 +1942,40 @@ static long 
>>> check_and_migrate_movable_pages(unsigned long nr_pages,
>>> continue;
>>> prev_head = head;
>>> /*
>>> -* If we get a movable page, since we are going to be pinning
>>> -* these entries, try to move them out if possible.
>>> +* Device coherent pages are managed by a driver and should not
>>> +* be pinned indefinitely as it prevents the driver moving the
>>> +* page. So when trying to pin with FOLL_LONGTERM instead try
>>> +* migrating page out of device memory.
>>>  */
>>> if (is_dev_private_or_coherent_page(head)) {
>>> +   /*
>&g

Re: [PATCH v3 00/10] Add MEMORY_DEVICE_COHERENT for coherent device memory mapping

2022-01-12 Thread David Hildenbrand
On 10.01.22 23:31, Alex Sierra wrote:
> This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory
> owned by a device that can be mapped into CPU page tables like
> MEMORY_DEVICE_GENERIC and can also be migrated like
> MEMORY_DEVICE_PRIVATE.
> 
> Christoph, the suggestion to incorporate Ralph Campbell’s refcount
> cleanup patch into our hardware page migration patchset originally came
> from you, but it proved impractical to do things in that order because
> the refcount cleanup introduced a bug with wide ranging structural
> implications. Instead, we amended Ralph’s patch so that it could be
> applied after merging the migration work. As we saw from the recent
> discussion, merging the refcount work is going to take some time and
> cooperation between multiple development groups, while the migration
> work is ready now and is needed now. So we propose to merge this
> patchset first and continue to work with Ralph and others to merge the
> refcount cleanup separately, when it is ready.
> 
> This patch series is mostly self-contained except for a few places where
> it needs to update other subsystems to handle the new memory type.
> System stability and performance are not affected according to our
> ongoing testing, including xfstests.
> 
> How it works: The system BIOS advertises the GPU device memory
> (aka VRAM) as SPM (special purpose memory) in the UEFI system address
> map.
> 
> The amdgpu driver registers the memory with devmap as
> MEMORY_DEVICE_COHERENT using devm_memremap_pages. The initial user for
> this hardware page migration capability is the Frontier supercomputer
> project. This functionality is not AMD-specific. We expect other GPU
> vendors to find this functionality useful, and possibly other hardware
> types in the future.
> 
> Our test nodes in the lab are similar to the Frontier configuration,
> with .5 TB of system memory plus 256 GB of device memory split across
> 4 GPUs, all in a single coherent address space. Page migration is
> expected to improve application efficiency significantly. We will
> report empirical results as they become available.

Hi,

might be a dumb question because I'm not too familiar with
MEMORY_DEVICE_COHERENT, but who's in charge of migrating *to* that
memory? Or how does a process ever get a grab on such pages?

And where does migration come into play? I assume migration is only
required to migrate off of that device memory to ordinary system RAM
when required because the device memory has to be freed up, correct?

(a high level description on how this is exploited from users space
would be great)

-- 
Thanks,

David / dhildenb



Re: slow boot with 7fef431be9c9 ("mm/page_alloc: place pages to tail in __free_pages_core()")

2021-03-16 Thread David Hildenbrand

On 16.03.21 09:43, Liang, Liang (Leo) wrote:

[AMD Public Use]

Hi David,

Thanks for your explanation. We saw slow boot issue on our farm/QA's machines 
and mine. All of machines are same SoC/board.


I cannot spot anything really special in the logs -- it's just ordinary 
system ram -- except:


[0.27] MTRR fixed ranges enabled:
[0.28]   0-9 write-back
[0.29]   A-B uncachable
[0.30]   C-F write-through
[0.31] MTRR variable ranges enabled:
[0.32]   0 base  mask 8000 write-back
[0.34]   1 base FFE0 mask FFE0 write-protect
[0.35]   2 base 0001 mask FF00 write-protect
[0.36]   3 base FFDE mask FFFE write-protect
[0.38]   4 base FF00 mask FFF8 write-protect
[0.39]   5 disabled
[0.39]   6 disabled
[0.40]   7 disabled

Not sure if "2 base 0001" indicates something nasty. Not sure 
how to interpret the masks.


Can you provide the output of "cat /proc/mtrr" ?

--
Thanks,

David / dhildenb

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: slow boot with 7fef431be9c9 ("mm/page_alloc: place pages to tail in __free_pages_core()")

2021-03-16 Thread David Hildenbrand

On 16.03.21 09:58, Liang, Liang (Leo) wrote:

[AMD Public Use]

Hi David,

root@scbu-Chachani:~# cat /proc/mtrr
reg00: base=0x0 (0MB), size= 2048MB, count=1: write-back
reg01: base=0x0ffe0 ( 4094MB), size=2MB, count=1: write-protect
reg02: base=0x1 ( 4096MB), size=   16MB, count=1: write-protect


^ there it is

https://wiki.osdev.org/MTRR

"Reads allocate cache lines on a cache miss. All writes update main memory.

Cache lines are not allocated on a write miss. Write hits invalidate the 
cache line and update main memory. "


AFAIU, writes completely bypass caches and store directly to main 
mamory. If there are cache lines from a previous read, they are 
invalidated. So I think especially slow will be read(addr), write(addr), 
read(addr), ... which is what we have in the kstream benchmark.



The question is:

who sets this up without owning the memory?
Is the memory actually special/slow or is that setting wrong?
Buggy firmware/BIOS?
Buggy device driver?



reg03: base=0x0ffde ( 4093MB), size=  128KB, count=1: write-protect
reg04: base=0x0ff00 ( 4080MB), size=  512KB, count=1: write-protect



--
Thanks,

David / dhildenb

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: slow boot with 7fef431be9c9 ("mm/page_alloc: place pages to tail in __free_pages_core()")

2021-03-16 Thread David Hildenbrand

On 16.03.21 12:02, Liang, Liang (Leo) wrote:

[AMD Public Use]

Hi David and Mike,

It's BIOS buggy. Now fixed by new BIOS. Thanks you so much! Cheers!

[0.34] MTRR variable ranges enabled:
[0.35]   0 base  mask 8000 write-back
[0.37]   1 base FFE0 mask FFE0 write-protect
[0.39]   2 base FFDE mask FFFE write-protect
[0.40]   3 base FF00 mask FFF8 write-protect
[0.41]   4 disabled
[0.42]   5 disabled
[0.43]   6 disabled
[0.44]   7 disabled
[0.45] TOM2: 00028000 aka 10240M

root@scbu-Chachani:/home/scbu# cat /proc/mtrr
reg00: base=0x0 (0MB), size= 2048MB, count=1: write-back
reg01: base=0x0ffe0 ( 4094MB), size=2MB, count=1: write-protect
reg02: base=0x0ffde ( 4093MB), size=  128KB, count=1: write-protect
reg03: base=0x0ff00 ( 4080MB), size=  512KB, count=1: write-protect


Great :)

(another latent BUG found with 7fef431be9c9 :) )

--
Thanks,

David / dhildenb

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: slow boot with 7fef431be9c9 ("mm/page_alloc: place pages to tail in __free_pages_core()")

2021-03-16 Thread David Hildenbrand

On 16.03.21 09:00, Liang, Liang (Leo) wrote:

[AMD Public Use]

Hi Mike,

Thanks for help. The patch works for me and boot time back to normal. So it's a 
fix, or just WA?


Hi Leo,

excluding up to 16 MiB of memory on every system just because that 
single platform is weird is not acceptable.


I think we have to figure out

a) why that memory is so special. This is weird.
b) why the platform doesn't indicate it in a special way. Why is it 
ordinary system RAM but still *that* slow?

c) how we can reliably identify such memory and exclude it.

I'll have a peek at the memory layout of that machine from boot logs 
next to figure out if we can answer any of these questions.


Just to verify: this does happen on multiple machines, not just a single 
one? (i.e., we're not dealing with faulty RAM)


--
Thanks,

David / dhildenb

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: slow boot with 7fef431be9c9 ("mm/page_alloc: place pages to tail in __free_pages_core()")

2021-03-15 Thread David Hildenbrand

On 13.03.21 14:48, Mike Rapoport wrote:

Hi,

On Sat, Mar 13, 2021 at 10:05:23AM +0100, David Hildenbrand wrote:

Am 13.03.2021 um 05:04 schrieb Liang, Liang (Leo) :

Hi David,

Which benchmark tool you prefer? Memtest86+ or else?


Hi Leo,

I think you want something that runs under Linux natively.

I‘m planning on coding up a kernel module to walk all 4MB pages in the
freelists and perform a stream benchmark individually. Then we might be
able to identify the problematic range - if there is a problematic range :)


My wild guess would be that the pages that are now at the head of free
lists have wrong caching enabled. Might be worth checking in your test
module.


I hacked something up real quick:

https://github.com/davidhildenbrand/kstream

Only briefly tested inside a VM. The output looks something like

[...]
[ 8396.432225] [0x4580 - 0x45bf] 25322 MB/s / 
38948 MB/s
[ 8396.448749] [0x45c0 - 0x45ff] 24481 MB/s / 
38946 MB/s
[ 8396.465197] [0x4600 - 0x463f] 24892 MB/s / 
39170 MB/s
[ 8396.481552] [0x4640 - 0x467f] 25222 MB/s / 
39156 MB/s
[ 8396.498012] [0x4680 - 0x46bf] 24416 MB/s / 
39159 MB/s
[ 8396.514397] [0x46c0 - 0x46ff] 25469 MB/s / 
38940 MB/s
[ 8396.530849] [0x4700 - 0x473f] 24885 MB/s / 
38734 MB/s
[ 8396.547195] [0x4740 - 0x477f] 25458 MB/s / 
38941 MB/s

[...]

The benchmark allocates one 4 MiB chunk at a time and runs a simplified 
STREAM benchmark a) without flushing caches b) flushing caches before 
every memory access.


It would be great if you could run that with the *old behavior* kernel 
(IOW, without 7fef431be9c9), so we might still be lucky to catch the 
problematic area in the freelist.


Let's see if that will indicate anything.

--
Thanks,

David / dhildenb

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: slow boot with 7fef431be9c9 ("mm/page_alloc: place pages to tail in __free_pages_core()")

2021-03-13 Thread David Hildenbrand

> Am 13.03.2021 um 05:04 schrieb Liang, Liang (Leo) :
> 
> [AMD Public Use]
> 
> Hi David,
> 
> Which benchmark tool you prefer? Memtest86+ or else?

Hi Leo,

I think you want something that runs under Linux natively.

I‘m planning on coding up a kernel module to walk all 4MB pages in the 
freelists and perform a stream benchmark individually. Then we might be able to 
identify the problematic range - if there is a problematic range :) Guess I‘ll 
have it running by Monday and let you know.

Cheers!

> 
> BRs,
> Leo
> -----Original Message-
> From: David Hildenbrand  
> Sent: Saturday, March 13, 2021 12:47 AM
> To: Liang, Liang (Leo) ; Deucher, Alexander 
> ; linux-ker...@vger.kernel.org; amd-gfx list 
> ; Andrew Morton 
> Cc: Huang, Ray ; Koenig, Christian 
> ; Mike Rapoport ; Rafael J. 
> Wysocki ; George Kennedy 
> Subject: Re: slow boot with 7fef431be9c9 ("mm/page_alloc: place pages to tail 
> in __free_pages_core()")
> 
>> On 12.03.21 17:19, Liang, Liang (Leo) wrote:
>> [AMD Public Use]
>> 
>> Dmesg attached.
>> 
> 
> 
> So, looks like the "real" slowdown starts once the buddy is up and running 
> (no surprise).
> 
> 
> [0.044035] Memory: 6856724K/7200304K available (14345K kernel code, 9699K 
> rwdata, 5276K rodata, 2628K init, 12104K bss, 343324K reserved, 0K 
> cma-reserved)
> [0.044045] random: get_random_u64 called from 
> __kmem_cache_create+0x33/0x460 with crng_init=1
> [0.049025] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=16, Nodes=1
> [0.050036] ftrace: allocating 47158 entries in 185 pages
> [0.097487] ftrace: allocated 185 pages with 5 groups
> [0.109210] rcu: Hierarchical RCU implementation.
> 
> vs.
> 
> [0.041115] Memory: 6869396K/7200304K available (14345K kernel code, 3433K 
> rwdata, 5284K rodata, 2624K init, 6088K bss, 330652K reserved, 0K 
> cma-reserved)
> [0.041127] random: get_random_u64 called from 
> __kmem_cache_create+0x31/0x430 with crng_init=1
> [0.041309] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=16, Nodes=1
> [0.041335] ftrace: allocating 47184 entries in 185 pages
> [0.055719] ftrace: allocated 185 pages with 5 groups
> [0.055863] rcu: Hierarchical RCU implementation.
> 
> 
> And it gets especially bad during ACPI table processing:
> 
> [4.158303] ACPI: Added _OSI(Module Device)
> [4.158767] ACPI: Added _OSI(Processor Device)
> [4.159230] ACPI: Added _OSI(3.0 _SCP Extensions)
> [4.159705] ACPI: Added _OSI(Processor Aggregator Device)
> [4.160551] ACPI: Added _OSI(Linux-Dell-Video)
> [4.161359] ACPI: Added _OSI(Linux-Lenovo-NV-HDMI-Audio)
> [4.162264] ACPI: Added _OSI(Linux-HPI-Hybrid-Graphics)
> [   17.713421] ACPI: 13 ACPI AML tables successfully acquired and loaded
> [   18.716065] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
> [   20.743828] ACPI: EC: EC started
> [   20.744155] ACPI: EC: interrupt blocked
> [   20.945956] ACPI: EC: EC_CMD/EC_SC=0x666, EC_DATA=0x662
> [   20.946618] ACPI: \_SB_.PCI0.LPC0.EC0_: Boot DSDT EC used to handle 
> transactions
> [   20.947348] ACPI: Interpreter enabled
> [   20.951278] ACPI: (supports S0 S3 S4 S5)
> [   20.951632] ACPI: Using IOAPIC for interrupt routing
> 
> vs.
> 
> [0.216039] ACPI: Added _OSI(Module Device)
> [0.216041] ACPI: Added _OSI(Processor Device)
> [0.216043] ACPI: Added _OSI(3.0 _SCP Extensions)
> [0.216044] ACPI: Added _OSI(Processor Aggregator Device)
> [0.216046] ACPI: Added _OSI(Linux-Dell-Video)
> [0.216048] ACPI: Added _OSI(Linux-Lenovo-NV-HDMI-Audio)
> [0.216049] ACPI: Added _OSI(Linux-HPI-Hybrid-Graphics)
> [0.228259] ACPI: 13 ACPI AML tables successfully acquired and loaded
> [0.229527] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
> [0.231663] ACPI: EC: EC started
> [0.231666] ACPI: EC: interrupt blocked
> [0.233664] ACPI: EC: EC_CMD/EC_SC=0x666, EC_DATA=0x662
> [0.233667] ACPI: \_SB_.PCI0.LPC0.EC0_: Boot DSDT EC used to handle 
> transactions
> [0.233670] ACPI: Interpreter enabled
> [0.233685] ACPI: (supports S0 S3 S4 S5)
> [0.233687] ACPI: Using IOAPIC for interrupt routing
> 
> The jump from 4.1 -> 17.7 is especially bad.
> 
> Which might in fact indicate that this could be related to using some very 
> special slow (ACPI?) memory for ordinary purposes, interfering with actual 
> ACPI users?
> 
> But again, just a wild guess, because the system is extremely slow 
> afterwards, however, we don't have any pauses without any signs of life for 
> that long.
> 
> 
> It would be interesting to run a simple memory bandwidth benchmark on the 
> fast kernel with differing sizes up to running OOM to see if there is really 
> some memory that is just horribly slow once allocated and used.
> 
> --
> Thanks,
> 
> David / dhildenb
> 

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: slow boot with 7fef431be9c9 ("mm/page_alloc: place pages to tail in __free_pages_core()")

2021-03-12 Thread David Hildenbrand

On 12.03.21 17:19, Liang, Liang (Leo) wrote:

[AMD Public Use]

Dmesg attached.




So, looks like the "real" slowdown starts once the buddy is up and running (no 
surprise).


[0.044035] Memory: 6856724K/7200304K available (14345K kernel code, 9699K 
rwdata, 5276K rodata, 2628K init, 12104K bss, 343324K reserved, 0K cma-reserved)
[0.044045] random: get_random_u64 called from 
__kmem_cache_create+0x33/0x460 with crng_init=1
[0.049025] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=16, Nodes=1
[0.050036] ftrace: allocating 47158 entries in 185 pages
[0.097487] ftrace: allocated 185 pages with 5 groups
[0.109210] rcu: Hierarchical RCU implementation.

vs.

[0.041115] Memory: 6869396K/7200304K available (14345K kernel code, 3433K 
rwdata, 5284K rodata, 2624K init, 6088K bss, 330652K reserved, 0K cma-reserved)
[0.041127] random: get_random_u64 called from 
__kmem_cache_create+0x31/0x430 with crng_init=1
[0.041309] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=16, Nodes=1
[0.041335] ftrace: allocating 47184 entries in 185 pages
[0.055719] ftrace: allocated 185 pages with 5 groups
[0.055863] rcu: Hierarchical RCU implementation.


And it gets especially bad during ACPI table processing:

[4.158303] ACPI: Added _OSI(Module Device)
[4.158767] ACPI: Added _OSI(Processor Device)
[4.159230] ACPI: Added _OSI(3.0 _SCP Extensions)
[4.159705] ACPI: Added _OSI(Processor Aggregator Device)
[4.160551] ACPI: Added _OSI(Linux-Dell-Video)
[4.161359] ACPI: Added _OSI(Linux-Lenovo-NV-HDMI-Audio)
[4.162264] ACPI: Added _OSI(Linux-HPI-Hybrid-Graphics)
[   17.713421] ACPI: 13 ACPI AML tables successfully acquired and loaded
[   18.716065] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
[   20.743828] ACPI: EC: EC started
[   20.744155] ACPI: EC: interrupt blocked
[   20.945956] ACPI: EC: EC_CMD/EC_SC=0x666, EC_DATA=0x662
[   20.946618] ACPI: \_SB_.PCI0.LPC0.EC0_: Boot DSDT EC used to handle 
transactions
[   20.947348] ACPI: Interpreter enabled
[   20.951278] ACPI: (supports S0 S3 S4 S5)
[   20.951632] ACPI: Using IOAPIC for interrupt routing

vs.

[0.216039] ACPI: Added _OSI(Module Device)
[0.216041] ACPI: Added _OSI(Processor Device)
[0.216043] ACPI: Added _OSI(3.0 _SCP Extensions)
[0.216044] ACPI: Added _OSI(Processor Aggregator Device)
[0.216046] ACPI: Added _OSI(Linux-Dell-Video)
[0.216048] ACPI: Added _OSI(Linux-Lenovo-NV-HDMI-Audio)
[0.216049] ACPI: Added _OSI(Linux-HPI-Hybrid-Graphics)
[0.228259] ACPI: 13 ACPI AML tables successfully acquired and loaded
[0.229527] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
[0.231663] ACPI: EC: EC started
[0.231666] ACPI: EC: interrupt blocked
[0.233664] ACPI: EC: EC_CMD/EC_SC=0x666, EC_DATA=0x662
[0.233667] ACPI: \_SB_.PCI0.LPC0.EC0_: Boot DSDT EC used to handle 
transactions
[0.233670] ACPI: Interpreter enabled
[0.233685] ACPI: (supports S0 S3 S4 S5)
[0.233687] ACPI: Using IOAPIC for interrupt routing

The jump from 4.1 -> 17.7 is especially bad.

Which might in fact indicate that this could be related to using
some very special slow (ACPI?) memory for ordinary purposes,
interfering with actual ACPI users?

But again, just a wild guess, because the system is extremely slow
afterwards, however, we don't have any pauses without any signs of
life for that long.


It would be interesting to run a simple memory bandwidth benchmark
on the fast kernel with differing sizes up to running OOM to see if
there is really some memory that is just horribly slow once allocated and
used.

--
Thanks,

David / dhildenb

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: slow boot with 7fef431be9c9 ("mm/page_alloc: place pages to tail in __free_pages_core()")

2021-03-12 Thread David Hildenbrand

8G (with some carve out for the integrated GPU).
[0.044181] Memory: 6858688K/7200304K available (14345K kernel code, 9659K 
rwdata, 4980K rodata, 2484K init, 12292K bss, 341360K reserved, 0K cma-reserved)

Nothing particularly special about these systems that I am aware of.  I'll see 
if we can repro this issue on any other platforms, but so far, not one has 
noticed any problems.



Increasing the boot time from a few seconds to 2-3 minutes does not smell
like some corner case cache effects we might be hitting in this particular
instance - there have been minor reports that it either slightly increased or
slightly decreases initial system performance, but that was about it.

Either, yet another latent BUG (but why? why should memory access
suddenly be that slow? I could only guess that we are now making sooner
use of very slow memory), or there is really something else weird going on.


Looks like pretty much everything is slower based on the timestamps in the 
dmesg output.  There is a big jump here:


If we're really dealing with some specific slow memory regions and that 
memory gets allocated for something that gets used regularly, then we 
might get a general slowdown. Hard to identify, though :)





[3.758596] ACPI: Added _OSI(Linux-Lenovo-NV-HDMI-Audio)
[3.759372] ACPI: Added _OSI(Linux-HPI-Hybrid-Graphics)
[   16.177983] ACPI: 13 ACPI AML tables successfully acquired and loaded
[   17.099316] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
[   18.969959] ACPI: EC: EC started


And here:


[   36.566608] PCI: CLS 64 bytes, default 64
[   36.575383] Trying to unpack rootfs image as initramfs...
[   44.594348] Initramfs unpacking failed: Decoding failed
[   44.765141] Freeing initrd memory: 46348K


Also seeing soft lockups:

[  124.588634] watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [swapper/1:0]


Yes, I noticed that -- there is a heavy slowdown somewhere.

As that patch is v5.10 already (and we're close to v5.12) I assume 
something is particularly weird about the platform you are running on - 
because this is the first time I see a report like that.


--
Thanks,

David / dhildenb

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: slow boot with 7fef431be9c9 ("mm/page_alloc: place pages to tail in __free_pages_core()")

2021-03-12 Thread David Hildenbrand

On 12.03.21 15:06, Deucher, Alexander wrote:

[AMD Public Use]


-Original Message-
From: David Hildenbrand 
Sent: Thursday, March 11, 2021 10:03 AM
To: Deucher, Alexander ; linux-
ker...@vger.kernel.org; amd-gfx list ;
Andrew Morton 
Cc: Huang, Ray ; Koenig, Christian
; Liang, Liang (Leo) ;
Mike Rapoport ; Rafael J. Wysocki
; George Kennedy 
Subject: Re: slow boot with 7fef431be9c9 ("mm/page_alloc: place pages to
tail in __free_pages_core()")

On 11.03.21 15:41, Deucher, Alexander wrote:

[AMD Public Use]

Booting kernels on certain AMD platforms takes 2-3 minutes with the patch

in the subject.  Reverting it restores quick boot times (few seconds).  Any
ideas?




Hi,

We just discovered latent BUGs in ACPI code whereby ACPI tables are
exposed to the page allocator as ordinary, free system RAM. With the
patch you mention, the order in which pages get allocated from the page
allocator are changed - which makes the BUG trigger more easily.

I could imagine that someone allocates and uses that memory on your
platform, and I could imagine that such accesses are very slow.

I cannot tell if that is the root cause, but at least it would make sense.

See
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.
kernel.org%2Fpatchwork%2Fpatch%2F1389314%2F&data=04%7C01%7C
alexander.deucher%40amd.com%7Cd1533aaddccd464c59f308d8e49ec563%7
C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637510717893096801%
7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLC
JBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=xpty77D54Z5S%2FKK
JO5SsVQaNsHoojWMR73whpu8VT%2B4%3D&reserved=0

You might want to give that patch a try (not sure if it's the latest
version). CCing George


Thanks for the patch.  Unfortunately it didn't help.  Any other ideas?  Is 
there a newer version of that patch?



@George?

It's interesting that this only applies to these special AMD systems so  
far. Is there anything particular about these systems? How much memory  
do these systems have?


Increasing the boot time from a few seconds to 2-3 minutes does not  
smell like some corner case cache effects we might be hitting in this  
particular instance - there have been minor reports that it either  
slightly increased or slightly decreases initial system performance, but  
that was about it.


Either, yet another latent BUG (but why? why should memory access  
suddenly be that slow? I could only guess that we are now making sooner  
use of very slow memory), or there is really something else weird going on.


Cheers!


Alex



--
Thanks,

David / dhildenb

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: slow boot with 7fef431be9c9 ("mm/page_alloc: place pages to tail in __free_pages_core()")

2021-03-11 Thread David Hildenbrand

On 11.03.21 15:41, Deucher, Alexander wrote:

[AMD Public Use]

Booting kernels on certain AMD platforms takes 2-3 minutes with the patch in 
the subject.  Reverting it restores quick boot times (few seconds).  Any ideas?



Hi,

We just discovered latent BUGs in ACPI code whereby ACPI tables are  
exposed to the page allocator as ordinary, free system RAM. With the  
patch you mention, the order in which pages get allocated from the page  
allocator are changed - which makes the BUG trigger more easily.


I could imagine that someone allocates and uses that memory on your  
platform, and I could imagine that such accesses are very slow.


I cannot tell if that is the root cause, but at least it would make sense.

See https://lore.kernel.org/patchwork/patch/1389314/

You might want to give that patch a try (not sure if it's the latest  
version). CCing George


Thanks


Thanks,

Alex

[0.00] Linux version 5.11.0-7490c004ae7e (jenkins@24dbd4b4380b) (gcc 
(Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, GNU ld (GNU Binutils for Ubuntu) 2.30) 
#20210308 SMP Sun Mar 7 20:04:05 UTC 2021
[0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-5.11.0-7490c004ae7e 
root=UUID=459758f3-5106-4173-b9bc-cf9d528828ec ro 
resume=UUID=23390f67-bbaf-42c1-b31d-64ef7288e39e amd_iommu=off nokaslr
[0.00] KERNEL supported cpus:
[0.00]   Intel GenuineIntel
[0.00]   AMD AuthenticAMD
[0.00]   Hygon HygonGenuine
[0.00]   Centaur CentaurHauls
[0.00]   zhaoxin   Shanghai
[0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point 
registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[0.00] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[0.00] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, 
using 'compacted' format.
[0.00] BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009efff] usable
[0.00] BIOS-e820: [mem 0x0009f000-0x000b] reserved
[0.00] BIOS-e820: [mem 0x0010-0x09af] usable
[0.00] BIOS-e820: [mem 0x09b0-0x09df] reserved
[0.00] BIOS-e820: [mem 0x09e0-0x09ef] usable
[0.00] BIOS-e820: [mem 0x09f0-0x09f10fff] ACPI NVS
[0.00] BIOS-e820: [mem 0x09f11000-0x6c56efff] usable
[0.00] BIOS-e820: [mem 0x6c56f000-0x6c56] reserved
[0.00] BIOS-e820: [mem 0x6c57-0x7877efff] usable
[0.00] BIOS-e820: [mem 0x7877f000-0x7af7efff] reserved
[0.00] BIOS-e820: [mem 0x7af7f000-0x7cf7efff] ACPI NVS
[0.00] BIOS-e820: [mem 0x7cf7f000-0x7cffefff] ACPI data
[0.00] BIOS-e820: [mem 0x7cfff000-0x7cff] usable
[0.00] BIOS-e820: [mem 0x7d00-0x7dff] reserved
[0.00] BIOS-e820: [mem 0x7f00-0x7fff] reserved
[0.00] BIOS-e820: [mem 0xa000-0xa00f] reserved
[0.00] BIOS-e820: [mem 0xf000-0xf7ff] reserved
[0.00] BIOS-e820: [mem 0xfec0-0xfec01fff] reserved
[0.00] BIOS-e820: [mem 0xfec1-0xfec10fff] reserved
[0.00] BIOS-e820: [mem 0xfec2-0xfec20fff] reserved
[0.00] BIOS-e820: [mem 0xfed8-0xfed81fff] reserved
[0.00] BIOS-e820: [mem 0xfedc-0xfedd] reserved
[0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved
[0.00] BIOS-e820: [mem 0xff08-0xffdd] reserved
[0.00] BIOS-e820: [mem 0x0001-0x00023f37] usable
[0.00] BIOS-e820: [mem 0x00023f38-0x00027fff] reserved
[0.00] NX (Execute Disable) protection: active
[0.00] e820: update [mem 0x6a275018-0x6a283857] usable ==> usable
[0.00] e820: update [mem 0x6a275018-0x6a283857] usable ==> usable
[0.00] e820: update [mem 0x6c572018-0x6c57c657] usable ==> usable
[0.00] e820: update [mem 0x6c572018-0x6c57c657] usable ==> usable
[0.00] extended physical RAM map:
[0.00] reserve setup_data: [mem 0x-0x0009efff] 
usable
[0.00] reserve setup_data: [mem 0x0009f000-0x000b] 
reserved
[0.00] reserve setup_data: [mem 0x0010-0x09af] 
usable
[0.00] reserve setup_data: [mem 0x09b0-0x09df] 
reserved
[0.00] reserve setup_data: [mem 0x09e0-0x09ef] 
usable
[0.00] reserve setup_data: [mem 0x09f0-0x09f10fff] 
ACPI NVS
[0.00] reserve setup_data: [mem 0x09f11000-0x6