Re: [RFC PATCH 2/6] mm/gmem: add arch-independent abstraction to track address mapping status
On 02.12.23 15:50, Pedro Falcato wrote: On Fri, Dec 1, 2023 at 9:23 AM David Hildenbrand wrote: On 28.11.23 13:50, Weixi Zhu wrote: This patch adds an abstraction layer, struct vm_object, that maintains per-process virtual-to-physical mapping status stored in struct gm_mapping. For example, a virtual page may be mapped to a CPU physical page or to a device physical page. Struct vm_object effectively maintains an arch-independent page table, which is defined as a "logical page table". While arch-dependent page table used by a real MMU is named a "physical page table". The logical page table is useful if Linux core MM is extended to handle a unified virtual address space with external accelerators using customized MMUs. Which raises the question why we are dealing with anonymous memory at all? Why not go for shmem if you are already only special-casing VMAs with a MMAP flag right now? That would maybe avoid having to introduce controversial BSD design concepts into Linux, that feel like going a step backwards in time to me and adding *more* MM complexity. In this patch, struct vm_object utilizes a radix tree (xarray) to track where a virtual page is mapped to. This adds extra memory consumption from xarray, but provides a nice abstraction to isolate mapping status from the machine-dependent layer (PTEs). Besides supporting accelerators with external MMUs, struct vm_object is planned to further union with i_pages in struct address_mapping for file-backed memory. A file already has a tree structure (pagecache) to manage the pages that are theoretically mapped. It's easy to translate from a VMA to a page inside that tree structure that is currently not present in page tables. Why the need for that tree structure if you can just remove anon memory from the picture? The idea of struct vm_object is originated from FreeBSD VM design, which provides a unified abstraction for anonymous memory, file-backed memory, page cache and etc[1]. :/ Currently, Linux utilizes a set of hierarchical page walk functions to abstract page table manipulations of different CPU architecture. The problem happens when a device wants to reuse Linux MM code to manage its page table -- the device page table may not be accessible to the CPU. Existing solution like Linux HMM utilizes the MMU notifier mechanisms to invoke device-specific MMU functions, but relies on encoding the mapping status on the CPU page table entries. This entangles machine-independent code with machine-dependent code, and also brings unnecessary restrictions. Why? we have primitives to walk arch page tables in a non-arch specific fashion and are using them all over the place. We even have various mechanisms to map something into the page tables and get the CPU to fault on it, as if it is inaccessible (PROT_NONE as used for NUMA balancing, fake swap entries). The PTE size and format vary arch by arch, which harms the extensibility. Not really. We might have some features limited to some architectures because of the lack of PTE bits. And usually the problem is that people don't care enough about enabling these features on older architectures. If we ever *really* need more space for sw-defined data, it would be possible to allocate auxiliary data for page tables only where required (where the features apply), instead of crafting a completely new, auxiliary datastructure with it's own locking. So far it was not required to enable the feature we need on the architectures we care about. [1] https://docs.freebsd.org/en/articles/vm-design/ In the cover letter you have: "The future plan of logical page table is to provide a generic abstraction layer that support common anonymous memory (I am looking at you, transparent huge pages) and file-backed memory." Which I doubt will happen; there is little interest in making anonymous memory management slower, more serialized, and wasting more memory on metadata. Also worth noting that: 1) Mach VM (which FreeBSD inherited, from the old BSD) vm_objects aren't quite what's being stated here, rather they are somewhat replacements for both anon_vma and address_space[1]. Very similarly to Linux, they take pages from vm_objects and map them in page tables using pmap (the big difference is anon memory, which has its bookkeeping in page tables, on Linux) 2) These vm_objects were a horrendous mistake (see CoW chaining) and FreeBSD has to go to horrendous lengths to make them tolerable. The UVM paper/dissertation (by Charles Cranor) talks about these issues at length, and 20 years later it's still true. 3) Despite Linux MM having its warts, it's probably correct to consider it a solid improvement over FreeBSD MM or NetBSD UVM And, finally, randomly tacking on core MM concepts from other systems is at best a *really weird* idea. Particularly when they aren't even what was stated! Can you read my mind? :) thanks for noting all that, with which I 100% agree. -- Cheers, David / dhildenb
Re: [RFC PATCH 2/6] mm/gmem: add arch-independent abstraction to track address mapping status
On 28.11.23 13:50, Weixi Zhu wrote: This patch adds an abstraction layer, struct vm_object, that maintains per-process virtual-to-physical mapping status stored in struct gm_mapping. For example, a virtual page may be mapped to a CPU physical page or to a device physical page. Struct vm_object effectively maintains an arch-independent page table, which is defined as a "logical page table". While arch-dependent page table used by a real MMU is named a "physical page table". The logical page table is useful if Linux core MM is extended to handle a unified virtual address space with external accelerators using customized MMUs. Which raises the question why we are dealing with anonymous memory at all? Why not go for shmem if you are already only special-casing VMAs with a MMAP flag right now? That would maybe avoid having to introduce controversial BSD design concepts into Linux, that feel like going a step backwards in time to me and adding *more* MM complexity. In this patch, struct vm_object utilizes a radix tree (xarray) to track where a virtual page is mapped to. This adds extra memory consumption from xarray, but provides a nice abstraction to isolate mapping status from the machine-dependent layer (PTEs). Besides supporting accelerators with external MMUs, struct vm_object is planned to further union with i_pages in struct address_mapping for file-backed memory. A file already has a tree structure (pagecache) to manage the pages that are theoretically mapped. It's easy to translate from a VMA to a page inside that tree structure that is currently not present in page tables. Why the need for that tree structure if you can just remove anon memory from the picture? The idea of struct vm_object is originated from FreeBSD VM design, which provides a unified abstraction for anonymous memory, file-backed memory, page cache and etc[1]. :/ Currently, Linux utilizes a set of hierarchical page walk functions to abstract page table manipulations of different CPU architecture. The problem happens when a device wants to reuse Linux MM code to manage its page table -- the device page table may not be accessible to the CPU. Existing solution like Linux HMM utilizes the MMU notifier mechanisms to invoke device-specific MMU functions, but relies on encoding the mapping status on the CPU page table entries. This entangles machine-independent code with machine-dependent code, and also brings unnecessary restrictions. Why? we have primitives to walk arch page tables in a non-arch specific fashion and are using them all over the place. We even have various mechanisms to map something into the page tables and get the CPU to fault on it, as if it is inaccessible (PROT_NONE as used for NUMA balancing, fake swap entries). The PTE size and format vary arch by arch, which harms the extensibility. Not really. We might have some features limited to some architectures because of the lack of PTE bits. And usually the problem is that people don't care enough about enabling these features on older architectures. If we ever *really* need more space for sw-defined data, it would be possible to allocate auxiliary data for page tables only where required (where the features apply), instead of crafting a completely new, auxiliary datastructure with it's own locking. So far it was not required to enable the feature we need on the architectures we care about. [1] https://docs.freebsd.org/en/articles/vm-design/ In the cover letter you have: "The future plan of logical page table is to provide a generic abstraction layer that support common anonymous memory (I am looking at you, transparent huge pages) and file-backed memory." Which I doubt will happen; there is little interest in making anonymous memory management slower, more serialized, and wasting more memory on metadata. Note that you won't make many friends around here with statements like "To be honest, not using a logical page table for anonymous memory is why Linux THP fails compared with FreeBSD's superpage". I read one paper that makes such claims (I'm curious how you define "winning"), and am aware of some shortcomings. But I am not convinced that a second datastructure "is why Linux THP fails". It just requires some more work to get it sorted under Linux (e.g., allocate THP, PTE-map it and map inaccessible parts PROT_NONE, later collapse it in-place into a PMD), and so far, there was not a lot of interest that I am ware of to even start working on that. So if there is not enough pain for companies to even work on that in Linux, maybe FreeBSD superpages are "winning" "on paper" only? Remember that the target audience here are Linux developers. But yeah, this here is all designed around the idea "core MM is extended to handle a unified virtual address space with external accelerators using customized MMUs." and then trying to find other arguments why it's a good idea, without going too much into det
Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices
On 01.12.23 03:44, zhuweixi wrote: Thanks! I hope you understood that that was a joke :) I am planning to present GMEM in Linux MM Alignment Sessions so I can collect more input from the mm developers. Sounds good. But please try inviting key HMM/driver developer as well. Most of the core-mm folks attending that meeting are not that familiar with these concepts and they are usually not happy about: (1) More core-MM complexity for things that can be neatly handled in separate subsystems with the existing infrastructure already. (2) One new way of doing things why the other things remain in place. (3) New MMAP flags. Usually you have a hard time getting this in. Sometimes, there are other ways (e.g., special-purpose file- systems). (4) Changing controversial core-mm design decisions to handle corner cases. -- Cheers, David / dhildenb
Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices
On 29.11.23 09:27, zhuweixi wrote: Glad to hear that more sharable code is desirable. IMHO, for a common MM subsystem, it is more beneficial for GMEM to extend core MM instead of building a separate one. More core-mm complexity, awesome, we all love that! ;) -- Cheers, David / dhildenb
Re: [PATCH v2 3/4] selinux: use vma_is_initial_stack() and vma_is_initial_heap()
On 19.07.23 09:51, Kefeng Wang wrote: Use the helpers to simplify code. Cc: Paul Moore Cc: Stephen Smalley Cc: Eric Paris Acked-by: Paul Moore Signed-off-by: Kefeng Wang --- security/selinux/hooks.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c index d06e350fedee..ee8575540a8e 100644 --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -3762,13 +3762,10 @@ static int selinux_file_mprotect(struct vm_area_struct *vma, if (default_noexec && (prot & PROT_EXEC) && !(vma->vm_flags & VM_EXEC)) { int rc = 0; - if (vma->vm_start >= vma->vm_mm->start_brk && - vma->vm_end <= vma->vm_mm->brk) { + if (vma_is_initial_heap(vma)) { rc = avc_has_perm(sid, sid, SECCLASS_PROCESS, PROCESS__EXECHEAP, NULL); - } else if (!vma->vm_file && - ((vma->vm_start <= vma->vm_mm->start_stack && -vma->vm_end >= vma->vm_mm->start_stack) || + } else if (!vma->vm_file && (vma_is_initial_stack(vma) || vma_is_stack_for_current(vma))) { rc = avc_has_perm(sid, sid, SECCLASS_PROCESS, PROCESS__EXECSTACK, NULL); Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v2 1/4] mm: factor out VMA stack and heap checks
On 19.07.23 09:51, Kefeng Wang wrote: Factor out VMA stack and heap checks and name them vma_is_initial_stack() and vma_is_initial_heap() for general use. Cc: Christian Göttsche Cc: David Hildenbrand Signed-off-by: Kefeng Wang --- [...] diff --git a/include/linux/mm.h b/include/linux/mm.h index 2dd73e4f3d8e..51f8c573db74 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -822,6 +822,27 @@ static inline bool vma_is_anonymous(struct vm_area_struct *vma) return !vma->vm_ops; } Worth adding a similar comment like for vma_is_initial_stack() ? +static inline bool vma_is_initial_heap(const struct vm_area_struct *vma) +{ + return vma->vm_start <= vma->vm_mm->brk && + vma->vm_end >= vma->vm_mm->start_brk; +} + +/* + * Indicate if the VMA is a stack for the given task; for + * /proc/PID/maps that is the stack of the main task. + */ +static inline bool vma_is_initial_stack(const struct vm_area_struct *vma) +{ + /* +* We make no effort to guess what a given thread considers to be +* its "stack". It's not even well-defined for programs written +* languages like Go. +*/ + return vma->vm_start <= vma->vm_mm->start_stack && + vma->vm_end >= vma->vm_mm->start_stack; +} + static inline bool vma_is_temporary_stack(struct vm_area_struct *vma) { int maybe_stack = vma->vm_flags & (VM_GROWSDOWN | VM_GROWSUP); Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v2 2/4] drm/amdkfd: use vma_is_initial_stack() and vma_is_initial_heap()
On 19.07.23 09:51, Kefeng Wang wrote: Use the helpers to simplify code. Cc: Felix Kuehling Cc: Alex Deucher Cc: "Christian König" Cc: "Pan, Xinhui" Cc: David Airlie Cc: Daniel Vetter Signed-off-by: Kefeng Wang --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 5ff1a5a89d96..0b7bfbd0cb66 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -2621,10 +2621,7 @@ svm_range_get_range_boundaries(struct kfd_process *p, int64_t addr, return -EFAULT; } - *is_heap_stack = (vma->vm_start <= vma->vm_mm->brk && - vma->vm_end >= vma->vm_mm->start_brk) || -(vma->vm_start <= vma->vm_mm->start_stack && - vma->vm_end >= vma->vm_mm->start_stack); + *is_heap_stack = vma_is_initial_heap(vma) || vma_is_initial_stack(vma); start_limit = max(vma->vm_start >> PAGE_SHIFT, (unsigned long)ALIGN_DOWN(addr, 2UL << 8)); Certainly a valid refactoring, although questionable if such code should care about that. Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v2 4/4] perf/core: use vma_is_initial_stack() and vma_is_initial_heap()
On 19.07.23 09:51, Kefeng Wang wrote: Use the helpers to simplify code, also kill unneeded goto cpy_name. Cc: Peter Zijlstra Cc: Arnaldo Carvalho de Melo Signed-off-by: Kefeng Wang --- kernel/events/core.c | 22 +++--- 1 file changed, 7 insertions(+), 15 deletions(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index 78ae7b6f90fd..d59f6327472f 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -8685,22 +8685,14 @@ static void perf_event_mmap_event(struct perf_mmap_event *mmap_event) } name = (char *)arch_vma_name(vma); - if (name) - goto cpy_name; - - if (vma->vm_start <= vma->vm_mm->start_brk && - vma->vm_end >= vma->vm_mm->brk) { - name = "[heap]"; - goto cpy_name; + if (!name) { + if (vma_is_initial_heap(vma)) + name = "[heap]"; + else if (vma_is_initial_stack(vma)) + name = "[stack]"; + else + name = "//anon"; } - if (vma->vm_start <= vma->vm_mm->start_stack && - vma->vm_end >= vma->vm_mm->start_stack) { - name = "[stack]"; - goto cpy_name; - } - - name = "//anon"; - goto cpy_name; If you're removing that goto, maybe also worth removing the goto at the end of the previous if branch. Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH 1/5] mm: introduce vma_is_stack() and vma_is_heap()
On 12.07.23 16:38, Kefeng Wang wrote: Introduce the two helpers for general use. Signed-off-by: Kefeng Wang --- include/linux/mm.h | 12 1 file changed, 12 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 1462cf15badf..0bbeb31ac750 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -926,6 +926,18 @@ static inline bool vma_is_anonymous(struct vm_area_struct *vma) return !vma->vm_ops; } +static inline bool vma_is_heap(struct vm_area_struct *vma) +{ + return vma->vm_start <= vma->vm_mm->brk && + vma->vm_end >= vma->vm_mm->start_brk; +} + +static inline bool vma_is_stack(struct vm_area_struct *vma) +{ + return vma->vm_start <= vma->vm_mm->start_stack && + vma->vm_end >= vma->vm_mm->start_stack; +} + static inline bool vma_is_temporary_stack(struct vm_area_struct *vma) { int maybe_stack = vma->vm_flags & (VM_GROWSDOWN | VM_GROWSUP); Looking at the comments in patch #3, should these functions be called vma_is_initial_heap / vma_is_initial_stack ? -- Cheers, David / dhildenb
Re: [PATCH 2/5] mm: use vma_is_stack() and vma_is_heap()
On 12.07.23 16:38, Kefeng Wang wrote: Use the helpers to simplify code. Signed-off-by: Kefeng Wang --- fs/proc/task_mmu.c | 24 fs/proc/task_nommu.c | 15 +-- 2 files changed, 5 insertions(+), 34 deletions(-) Please squash patch #1 and this patch and call it something like "mm: factor out VMA stack and heap checks" And then, maybe also keep the comments in these functions, they sound reasonable to have. -- Cheers, David / dhildenb
Re: [PATCH v5 1/6] mm/gup: remove unused vmas parameter from get_user_pages()
On 16.05.23 16:30, Sean Christopherson wrote: On Tue, May 16, 2023, David Hildenbrand wrote: On 15.05.23 21:07, Sean Christopherson wrote: On Sun, May 14, 2023, Lorenzo Stoakes wrote: diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index cb5c13eee193..eaa5bb8dbadc 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2477,7 +2477,7 @@ static inline int check_user_page_hwpoison(unsigned long addr) { int rc, flags = FOLL_HWPOISON | FOLL_WRITE; - rc = get_user_pages(addr, 1, flags, NULL, NULL); + rc = get_user_pages(addr, 1, flags, NULL); return rc == -EHWPOISON; Unrelated to this patch, I think there's a pre-existing bug here. If gup() returns a valid page, KVM will leak the refcount and unintentionally pin the page. That's When passing NULL as "pages" to get_user_pages(), __get_user_pages_locked() won't set FOLL_GET. As FOLL_PIN is also not set, we won't be messing with the mapcount of the page. For completeness: s/mapcount/refcount/ :) Ah, that's what I'm missing. -- Thanks, David / dhildenb
Re: [PATCH v5 1/6] mm/gup: remove unused vmas parameter from get_user_pages()
On 15.05.23 21:07, Sean Christopherson wrote: On Sun, May 14, 2023, Lorenzo Stoakes wrote: No invocation of get_user_pages() use the vmas parameter, so remove it. The GUP API is confusing and caveated. Recent changes have done much to improve that, however there is more we can do. Exporting vmas is a prime target as the caller has to be extremely careful to preclude their use after the mmap_lock has expired or otherwise be left with dangling pointers. Removing the vmas parameter focuses the GUP functions upon their primary purpose - pinning (and outputting) pages as well as performing the actions implied by the input flags. This is part of a patch series aiming to remove the vmas parameter altogether. Suggested-by: Matthew Wilcox (Oracle) Acked-by: Greg Kroah-Hartman Acked-by: David Hildenbrand Reviewed-by: Jason Gunthorpe Acked-by: Christian K�nig (for radeon parts) Acked-by: Jarkko Sakkinen Signed-off-by: Lorenzo Stoakes --- arch/x86/kernel/cpu/sgx/ioctl.c | 2 +- drivers/gpu/drm/radeon/radeon_ttm.c | 2 +- drivers/misc/sgi-gru/grufault.c | 2 +- include/linux/mm.h | 3 +-- mm/gup.c| 9 +++-- mm/gup_test.c | 5 ++--- virt/kvm/kvm_main.c | 2 +- 7 files changed, 10 insertions(+), 15 deletions(-) Acked-by: Sean Christopherson (KVM) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index cb5c13eee193..eaa5bb8dbadc 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2477,7 +2477,7 @@ static inline int check_user_page_hwpoison(unsigned long addr) { int rc, flags = FOLL_HWPOISON | FOLL_WRITE; - rc = get_user_pages(addr, 1, flags, NULL, NULL); + rc = get_user_pages(addr, 1, flags, NULL); return rc == -EHWPOISON; Unrelated to this patch, I think there's a pre-existing bug here. If gup() returns a valid page, KVM will leak the refcount and unintentionally pin the page. That's When passing NULL as "pages" to get_user_pages(), __get_user_pages_locked() won't set FOLL_GET. As FOLL_PIN is also not set, we won't be messing with the mapcount of the page. So even if get_user_pages() returns "1", we should be fine. Or am I misunderstanding your concern? At least hva_to_pfn_slow() most certainly didn't return "1" if we end up calling check_user_page_hwpoison(), so nothing would have been pinned there as well. -- Thanks, David / dhildenb
Re: [PATCH v3 1/7] mm/gup: remove unused vmas parameter from get_user_pages()
On 15.04.23 14:09, Lorenzo Stoakes wrote: No invocation of get_user_pages() uses the vmas parameter, so remove it. The GUP API is confusing and caveated. Recent changes have done much to improve that, however there is more we can do. Exporting vmas is a prime target as the caller has to be extremely careful to preclude their use after the mmap_lock has expired or otherwise be left with dangling pointers. Removing the vmas parameter focuses the GUP functions upon their primary purpose - pinning (and outputting) pages as well as performing the actions implied by the input flags. This is part of a patch series aiming to remove the vmas parameter altogether. Signed-off-by: Lorenzo Stoakes Suggested-by: Matthew Wilcox (Oracle) Acked-by: Greg Kroah-Hartman --- Acked-by: David Hildenbrand -- Thanks, David / dhildenb
Re: [PATCH] mm/gup.c: Fix formating in check_and_migrate_movable_page()
On 21.07.22 04:05, Alistair Popple wrote: > Commit b05a79d4377f ("mm/gup: migrate device coherent pages when pinning > instead of failing") added a badly formatted if statement. Fix it. > > Signed-off-by: Alistair Popple > Reported-by: David Hildenbrand > --- > > Apologies Andrew for missing this. Hopefully this fixes things. > > mm/gup.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/mm/gup.c b/mm/gup.c > index 364b274a10c2..c6d060dee9e0 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -1980,8 +1980,8 @@ static long check_and_migrate_movable_pages(unsigned > long nr_pages, > folio_nr_pages(folio)); > } > > - if (!list_empty(&movable_page_list) || isolation_error_count > - || coherent_pages) > + if (!list_empty(&movable_page_list) || isolation_error_count || > + coherent_pages) > goto unpin_pages; > > /* Happy to see the series go upstream soon. Reviewed-by: David Hildenbrand -- Thanks, David / dhildenb
Re: [PATCH v9 06/14] mm/gup: migrate device coherent pages when pinning instead of failing
On 18.07.22 22:32, Andrew Morton wrote: > On Mon, 18 Jul 2022 12:56:29 +0200 David Hildenbrand wrote: > >>> /* >>> * Try to move out any movable page before pinning the range. >>> */ >>> @@ -1919,7 +1948,8 @@ static long check_and_migrate_movable_pages(unsigned >>> long nr_pages, >>> folio_nr_pages(folio)); >>> } >>> >>> - if (!list_empty(&movable_page_list) || isolation_error_count) >>> + if (!list_empty(&movable_page_list) || isolation_error_count >>> + || coherent_pages) >> >> The common style is to >> >> a) add the || to the end of the previous line >> b) indent such the we have a nice-to-read alignment >> >> if (!list_empty(&movable_page_list) || isolation_error_count || >> coherent_pages) >> > > I missed that. This series is now in mm-stable so any fix will need to > be a standalone followup patch, please. > >> Apart from that lgtm. >> >> Reviewed-by: David Hildenbrand > > And your reviewed-by's will be lost. Stupid git. I know, I already raised my concerns regarding the new workflow, so I won't repeat that. I can understand (too some degree) that we don't want code to change just before the new merge window opens. But I do wonder if we really don't even want to do subject+description updates. Sure, the commit IDs will change. Who cares? Anyhow, it is what it is. -- Thanks, David / dhildenb
Re: [PATCH v9 02/14] mm: move page zone helpers from mm.h to mmzone.h
On 18.07.22 19:52, Felix Kuehling wrote: > On 2022-07-18 06:50, David Hildenbrand wrote: >> On 15.07.22 17:05, Alex Sierra wrote: >>> [WHY] >>> It makes more sense to have these helpers in zone specific header >>> file, rather than the generic mm.h >>> >>> Signed-off-by: Alex Sierra >> Acked-by: David Hildenbrand > > Thank you! I don't think I have the authority to give this a > Reviewed-by. Who does? Sure you can. Everybody can give Reviewed-by/Tested-by ... tags. :) -- Thanks, David / dhildenb
Re: [PATCH v9 06/14] mm/gup: migrate device coherent pages when pinning instead of failing
On 15.07.22 17:05, Alex Sierra wrote: > From: Alistair Popple > > Currently any attempts to pin a device coherent page will fail. This is > because device coherent pages need to be managed by a device driver, and > pinning them would prevent a driver from migrating them off the device. > > However this is no reason to fail pinning of these pages. These are > coherent and accessible from the CPU so can be migrated just like > pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin > them first try migrating them out of ZONE_DEVICE. > > [hch: rebased to the split device memory checks, > moved migrate_device_page to migrate_device.c] > > Signed-off-by: Alistair Popple > Acked-by: Felix Kuehling > Signed-off-by: Christoph Hellwig [...] > /* >* Try to move out any movable page before pinning the range. >*/ > @@ -1919,7 +1948,8 @@ static long check_and_migrate_movable_pages(unsigned > long nr_pages, > folio_nr_pages(folio)); > } > > - if (!list_empty(&movable_page_list) || isolation_error_count) > + if (!list_empty(&movable_page_list) || isolation_error_count > + || coherent_pages) The common style is to a) add the || to the end of the previous line b) indent such the we have a nice-to-read alignment if (!list_empty(&movable_page_list) || isolation_error_count || coherent_pages) Apart from that lgtm. Reviewed-by: David Hildenbrand -- Thanks, David / dhildenb
Re: [PATCH v9 04/14] mm: handling Non-LRU pages returned by vm_normal_pages
On 15.07.22 17:05, Alex Sierra wrote: > With DEVICE_COHERENT, we'll soon have vm_normal_pages() return > device-managed anonymous pages that are not LRU pages. Although they > behave like normal pages for purposes of mapping in CPU page, and for > COW. They do not support LRU lists, NUMA migration or THP. > > Callers to follow_page() currently don't expect ZONE_DEVICE pages, > however, with DEVICE_COHERENT we might now return ZONE_DEVICE. Check > for ZONE_DEVICE pages in applicable users of follow_page() as well. > > Signed-off-by: Alex Sierra > Acked-by: Felix Kuehling (v2) > Reviewed-by: Alistair Popple (v6) > Acked-by: David Hildenbrand -- Thanks, David / dhildenb
Re: [PATCH v9 02/14] mm: move page zone helpers from mm.h to mmzone.h
On 15.07.22 17:05, Alex Sierra wrote: > [WHY] > It makes more sense to have these helpers in zone specific header > file, rather than the generic mm.h > > Signed-off-by: Alex Sierra Acked-by: David Hildenbrand -- Thanks, David / dhildenb
Re: [PATCH v8 02/15] mm: move page zone helpers into new header-specific file
On 08.07.22 23:25, Felix Kuehling wrote: > On 2022-07-08 07:28, David Hildenbrand wrote: >> On 07.07.22 21:03, Alex Sierra wrote: >>> [WHY] >>> Have a cleaner way to expose all page zone helpers in one header >> What exactly is a "page zone"? Do you mean a buddy zone as in >> include/linux/mmzone.h ? >> > Zone as in ZONE_DEVICE. Maybe we could extend mmzone.h instead of Yes, I think so. And try moving as little as possible in this patch. > creating page_zone.h? That should work as long as it's OK to include > mmzone.h in memremap.h. I think so. -- Thanks, David / dhildenb
Re: [PATCH v8 07/15] mm/gup: migrate device coherent pages when pinning instead of failing
On 11.07.22 16:00, Matthew Wilcox wrote: > On Mon, Jul 11, 2022 at 03:35:42PM +0200, David Hildenbrand wrote: >>> + /* >>> +* Device coherent pages are managed by a driver and should not >>> +* be pinned indefinitely as it prevents the driver moving the >>> +* page. So when trying to pin with FOLL_LONGTERM instead try >>> +* to migrate the page out of device memory. >>> +*/ >>> + if (folio_is_device_coherent(folio)) { >>> + WARN_ON_ONCE(PageCompound(&folio->page)); >> >> Maybe that belongs into migrate_device_page()? > > ... and should be simply WARN_ON_ONCE(folio_test_large(folio)). > No need to go converting it back into a page in order to test things > that can't possibly be true. > Good point. -- Thanks, David / dhildenb
Re: [PATCH v8 06/15] mm: remove the vma check in migrate_vma_setup()
On 07.07.22 21:03, Alex Sierra wrote: > From: Alistair Popple > > migrate_vma_setup() checks that a valid vma is passed so that the page > tables can be walked to find the pfns associated with a given address > range. However in some cases the pfns are already known, such as when > migrating device coherent pages during pin_user_pages() meaning a valid > vma isn't required. As raised in my other reply, without a VMA ... it feels odd to use a "migrate_vma" API. For an internal (mm/migrate_device.c) use case it is ok I guess, but it certainly adds a bit of confusion. For example, because migrate_vma_setup() will undo ref+lock not obtained by it. I guess the interesting point is that a) Besides migrate_vma_pages() and migrate_vma_setup(), the ->vma is unused. b) migrate_vma_setup() does collect+unmap+cleanup if unmap failed. c) With our source page in our hands, we cannot be processing a hole in a VMA. Not sure if it's better. but I would a) Enforce in migrate_vma_setup() that there is a VMA. Code outside of mm/migrate_device.c shouldn't be doing some hacks like this. b) Don't call migrate_vma_setup() from migrate_device_page(), but directly migrate_vma_unmap() and add a comment. That will leave a single change to this patch (migrate_vma_pages()). But is that even required? Because > @@ -685,7 +685,7 @@ void migrate_vma_pages(struct migrate_vma *migrate) > continue; > } > > - if (!page) { > + if (!page && migrate->vma) { How could we ever have !page in case of migrate_device_page()? Instead, I think a VM_BUG_ON(migrate->vma); should hold and you can just simplify. > if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE)) > continue; > if (!notified) { -- Thanks, David / dhildenb
Re: [PATCH v8 07/15] mm/gup: migrate device coherent pages when pinning instead of failing
On 07.07.22 21:03, Alex Sierra wrote: > From: Alistair Popple > > Currently any attempts to pin a device coherent page will fail. This is > because device coherent pages need to be managed by a device driver, and > pinning them would prevent a driver from migrating them off the device. > > However this is no reason to fail pinning of these pages. These are > coherent and accessible from the CPU so can be migrated just like > pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin > them first try migrating them out of ZONE_DEVICE. > > Signed-off-by: Alistair Popple > Acked-by: Felix Kuehling > [hch: rebased to the split device memory checks, > moved migrate_device_page to migrate_device.c] > Signed-off-by: Christoph Hellwig > --- > mm/gup.c| 47 +++- > mm/internal.h | 1 + > mm/migrate_device.c | 53 + > 3 files changed, 96 insertions(+), 5 deletions(-) > > diff --git a/mm/gup.c b/mm/gup.c > index b65fe8bf5af4..9b6b9923d22d 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -1891,9 +1891,43 @@ static long check_and_migrate_movable_pages(unsigned > long nr_pages, > continue; > prev_folio = folio; > > - if (folio_is_longterm_pinnable(folio)) > + /* > + * Device private pages will get faulted in during gup so it > + * shouldn't be possible to see one here. > + */ > + if (WARN_ON_ONCE(folio_is_device_private(folio))) { > + ret = -EFAULT; > + goto unpin_pages; > + } I'd just drop that. Device private pages are never part of a present PTE. So if we could actually get a grab of one via GUP we would be in bigger trouble ... already before this patch. > + > + /* > + * Device coherent pages are managed by a driver and should not > + * be pinned indefinitely as it prevents the driver moving the > + * page. So when trying to pin with FOLL_LONGTERM instead try > + * to migrate the page out of device memory. > + */ > + if (folio_is_device_coherent(folio)) { > + WARN_ON_ONCE(PageCompound(&folio->page)); Maybe that belongs into migrate_device_page()? > + > + /* > + * Migration will fail if the page is pinned, so convert [...] > /* > * mm/gup.c > diff --git a/mm/migrate_device.c b/mm/migrate_device.c > index cf9668376c5a..5decd26dd551 100644 > --- a/mm/migrate_device.c > +++ b/mm/migrate_device.c > @@ -794,3 +794,56 @@ void migrate_vma_finalize(struct migrate_vma *migrate) > } > } > EXPORT_SYMBOL(migrate_vma_finalize); > + > +/* > + * Migrate a device coherent page back to normal memory. The caller should > have > + * a reference on page which will be copied to the new page if migration is > + * successful or dropped on failure. > + */ > +struct page *migrate_device_page(struct page *page, unsigned int gup_flags) Function name should most probably indicate that we're dealing with coherent pages here? > +{ > + unsigned long src_pfn, dst_pfn = 0; > + struct migrate_vma args; > + struct page *dpage; > + > + lock_page(page); > + src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE; > + args.src = &src_pfn; > + args.dst = &dst_pfn; > + args.cpages = 1; > + args.npages = 1; > + args.vma = NULL; > + migrate_vma_setup(&args); > + if (!(src_pfn & MIGRATE_PFN_MIGRATE)) > + return NULL; Wow, these refcount and page locking/unlocking rules with this migrate_* api are confusing now. And the usage here of sometimes returning and sometimes falling trough don't make it particularly easier to understand here. I'm not 100% happy about reusing migrate_vma_setup() usage if there *is no VMA*. That's just absolutely confusing, because usually migrate_vma_setup() itself would do the collection step and ref+lock pages. :/ In general, I can see why/how we're reusing the migrate_vma_* API here, but there is absolutely no VMA ... not sure what to improve besides providing a second API that does a simple single-page migration. But that can be changed later ... > + > + dpage = alloc_pages(GFP_USER | __GFP_NOWARN, 0); > + alloc_page() > + /* > + * get/pin the new page now so we don't have to retry gup after > + * migrating. We already have a reference so this should never fail. > + */ > + if (dpage && WARN_ON_ONCE(!try_grab_page(dpage, gup_flags))) { > + __free_pages(dpage, 0); __free_page() > + dpage = NULL; > + } Hm, this means that we are not pinning via the PTE at hand, but via something we expect migration to put into the PTE. I'm not really happy about this. Ideally, we'd make the pinning decision only on the actual GUP path, not in here. Just like in the
Re: [PATCH v8 02/15] mm: move page zone helpers into new header-specific file
On 07.07.22 21:03, Alex Sierra wrote: > [WHY] > Have a cleaner way to expose all page zone helpers in one header What exactly is a "page zone"? Do you mean a buddy zone as in include/linux/mmzone.h ? -- Thanks, David / dhildenb
Re: [PATCH v8 01/15] mm: rename is_pinnable_pages to is_longterm_pinnable_pages
On 07.07.22 21:03, Alex Sierra wrote: > is_pinnable_page() and folio_is_pinnable() were renamed to > is_longterm_pinnable_page() and folio_is_longterm_pinnable() > respectively. These functions are used in the FOLL_LONGTERM flag > context. > > Signed-off-by: Alex Sierra > --- > include/linux/mm.h | 8 > mm/gup.c | 4 ++-- > mm/gup_test.c | 2 +- > mm/hugetlb.c | 2 +- > 4 files changed, 8 insertions(+), 8 deletions(-) > Reviewed-by: David Hildenbrand -- Thanks, David / dhildenb
Re: [PATCH v7 04/14] mm: add device coherent vma selection for memory migration
On 30.06.22 13:44, Alistair Popple wrote: > > David Hildenbrand writes: > >> On 29.06.22 05:54, Alex Sierra wrote: >>> This case is used to migrate pages from device memory, back to system >>> memory. Device coherent type memory is cache coherent from device and CPU >>> point of view. >>> >>> Signed-off-by: Alex Sierra >>> Acked-by: Felix Kuehling >>> Reviewed-by: Alistair Poppple >>> Signed-off-by: Christoph Hellwig >> >> >> I'm not too familiar with this code, please excuse my naive questions: >> >>> @@ -148,15 +148,21 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, >>> if (is_writable_device_private_entry(entry)) >>> mpfn |= MIGRATE_PFN_WRITE; >>> } else { >>> - if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) >>> - goto next; >> >> Why not exclude MIGRATE_VMA_SELECT_DEVICE_PRIVATE here? IIRC that would >> have happened before this change. > > I might be missing something as I don't quite follow - this path is for > normal system pages so we only want to skip selecting them if > MIGRATE_VMA_SELECT_SYSTEM or MIGRATE_VMA_SELECT_DEVICE_COHERENT aren't > set. > > Note that MIGRATE_VMA_SELECT_DEVICE_PRIVATE doesn't apply here because > we already know it's not a device private page by virtue of > pte_present(pte) == True. Ah, stupid me, pte_present(pte) is the key. > >>> pfn = pte_pfn(pte); >>> - if (is_zero_pfn(pfn)) { >>> + if (is_zero_pfn(pfn) && >>> + (migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) { >>> mpfn = MIGRATE_PFN_MIGRATE; >>> migrate->cpages++; >>> goto next; >>> } >>> page = vm_normal_page(migrate->vma, addr, pte); >>> + if (page && !is_zone_device_page(page) && >> >> I'm wondering if that check logically belongs into patch #2. > > I don't think so as it would break functionality until the below > conditionals are added - we explicitly don't want to skip > is_zone_device_page(page) == False here because that is the pages we are > trying to select. > > You could add in this: > >>> + !(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) > > But then in patch 2 we know this can never be true because we've already > checked for !MIGRATE_VMA_SELECT_SYSTEM there. Ah, okay, thanks Reviewed-by: David Hildenbrand -- Thanks, David / dhildenb
Re: [PATCH v7 04/14] mm: add device coherent vma selection for memory migration
On 29.06.22 05:54, Alex Sierra wrote: > This case is used to migrate pages from device memory, back to system > memory. Device coherent type memory is cache coherent from device and CPU > point of view. > > Signed-off-by: Alex Sierra > Acked-by: Felix Kuehling > Reviewed-by: Alistair Poppple > Signed-off-by: Christoph Hellwig I'm not too familiar with this code, please excuse my naive questions: > @@ -148,15 +148,21 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, > if (is_writable_device_private_entry(entry)) > mpfn |= MIGRATE_PFN_WRITE; > } else { > - if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) > - goto next; Why not exclude MIGRATE_VMA_SELECT_DEVICE_PRIVATE here? IIRC that would have happened before this change. > pfn = pte_pfn(pte); > - if (is_zero_pfn(pfn)) { > + if (is_zero_pfn(pfn) && > + (migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) { > mpfn = MIGRATE_PFN_MIGRATE; > migrate->cpages++; > goto next; > } > page = vm_normal_page(migrate->vma, addr, pte); > + if (page && !is_zone_device_page(page) && I'm wondering if that check logically belongs into patch #2. > + !(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) > + goto next; > + else if (page && is_device_coherent_page(page) && > + (!(migrate->flags & > MIGRATE_VMA_SELECT_DEVICE_COHERENT) || > + page->pgmap->owner != migrate->pgmap_owner)) In general LGTM -- Thanks, David / dhildenb
Re: [PATCH v7 01/14] mm: rename is_pinnable_pages to is_pinnable_longterm_pages
On 30.06.22 00:08, Felix Kuehling wrote: > On 2022-06-29 03:33, David Hildenbrand wrote: >> On 29.06.22 05:54, Alex Sierra wrote: >>> is_pinnable_page() and folio_is_pinnable() were renamed to >>> is_longterm_pinnable_page() and folio_is_longterm_pinnable() >>> respectively. These functions are used in the FOLL_LONGTERM flag >>> context. >> Subject talks about "*_pages" >> >> >> Can you elaborate why the move from mm.h to memremap.h is justified? > > Patch 2 adds is_device_coherent_page in memremap.h and updates > is_longterm_pinnable_page to call is_device_coherent_page. memremap.h > cannot include mm.h because it is itself included by mm.h. So the choice > was to move is_longterm_pinnable_page to memremap.h, or move > is_device_coherent_page and all its dependencies to mm.h. The latter > would have been a bigger change. I really don't think something mm generic that compiles without ZONE_DEVICE belongs into memremap.h. Please find a cleaner way to get this done. -- Thanks, David / dhildenb
Re: [PATCH v7 02/14] mm: add zone device coherent type memory support
On 29.06.22 05:54, Alex Sierra wrote: > Device memory that is cache coherent from device and CPU point of view. > This is used on platforms that have an advanced system bus (like CAPI > or CXL). Any page of a process can be migrated to such memory. However, > no one should be allowed to pin such memory so that it can always be > evicted. > > Signed-off-by: Alex Sierra > Acked-by: Felix Kuehling > Reviewed-by: Alistair Popple > [hch: rebased ontop of the refcount changes, > removed is_dev_private_or_coherent_page] > Signed-off-by: Christoph Hellwig >From what I can tell, this looks good to me Acked-by: David Hildenbrand -- Thanks, David / dhildenb
Re: [PATCH v7 03/14] mm: handling Non-LRU pages returned by vm_normal_pages
On 29.06.22 05:54, Alex Sierra wrote: > With DEVICE_COHERENT, we'll soon have vm_normal_pages() return > device-managed anonymous pages that are not LRU pages. Although they > behave like normal pages for purposes of mapping in CPU page, and for > COW. They do not support LRU lists, NUMA migration or THP. > > Callers to follow_page that expect LRU pages, are also checked for > device zone pages due to DEVICE_COHERENT type. Can we rephrase that to (because zeropage) "Callers to follow_page() currently don't expect ZONE_DEVICE pages, however, with DEVICE_COHERENT we might now return ZONE_DEVICE. Check for ZONE_DEVICE pages in applicable users of follow_page() as well." [...] > /* > diff --git a/mm/memory.c b/mm/memory.c > index 7a089145cad4..e18555af9024 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -624,6 +624,13 @@ struct page *vm_normal_page(struct vm_area_struct *vma, > unsigned long addr, > if (is_zero_pfn(pfn)) > return NULL; > if (pte_devmap(pte)) > +/* > + * NOTE: New uers of ZONE_DEVICE will not set pte_devmap() and will have s/uers/users/ > + * refcounts incremented on their struct pages when they are inserted into > + * PTEs, thus they are safe to return here. Legacy ZONE_DEVICE pages that set > + * pte_devmap() do not have refcounts. Example of legacy ZONE_DEVICE is > + * MEMORY_DEVICE_FS_DAX type in pmem or virtio_fs drivers. > + */ [...] > diff --git a/mm/mprotect.c b/mm/mprotect.c > index ba5592655ee3..e034aae2a98b 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -95,7 +95,7 @@ static unsigned long change_pte_range(struct mmu_gather > *tlb, > continue; > > page = vm_normal_page(vma, addr, oldpte); > - if (!page || PageKsm(page)) > + if (!page || is_zone_device_page(page) || > PageKsm(page)) > continue; > > /* Also skip shared copy-on-write pages */ In -next/-mm there is now an additional can_change_pte_writable() that calls vm_normal_page() -- added by me. I assume that that is indeed fine because we can simply map device coherent pages writable. Besides the nits, LGTM Acked-by: David Hildenbrand -- Thanks, David / dhildenb
Re: [PATCH v7 01/14] mm: rename is_pinnable_pages to is_pinnable_longterm_pages
On 29.06.22 05:54, Alex Sierra wrote: > is_pinnable_page() and folio_is_pinnable() were renamed to > is_longterm_pinnable_page() and folio_is_longterm_pinnable() > respectively. These functions are used in the FOLL_LONGTERM flag > context. Subject talks about "*_pages" Can you elaborate why the move from mm.h to memremap.h is justified? I'd have called it "is_longterm_pinnable_page", but I am not a native speaker, so no strong opinion :) -- Thanks, David / dhildenb
Re: [PATCH v6 02/14] mm: handling Non-LRU pages returned by vm_normal_pages
On 28.06.22 02:14, Alex Sierra wrote: > With DEVICE_COHERENT, we'll soon have vm_normal_pages() return > device-managed anonymous pages that are not LRU pages. Although they > behave like normal pages for purposes of mapping in CPU page, and for > COW. They do not support LRU lists, NUMA migration or THP. > > We also introduced a FOLL_LRU flag that adds the same behaviour to > follow_page and related APIs, to allow callers to specify that they > expect to put pages on an LRU list. > > Signed-off-by: Alex Sierra > Acked-by: Felix Kuehling > Reviewed-by: Alistair Popple > --- I think my review feedback regarding FOLL_LRU has been ignored. -- Thanks, David / dhildenb
Re: [PATCH v6 06/14] mm: add device coherent checker to is_pinnable_page
On 28.06.22 02:14, Alex Sierra wrote: > is_device_coherent checker was added to is_pinnable_page and renamed > to is_longterm_pinnable_page. The reason is that device coherent > pages are not supported for longterm pinning. > > Signed-off-by: Alex Sierra > --- > include/linux/memremap.h | 25 + > include/linux/mm.h | 24 > mm/gup.c | 5 ++--- > mm/gup_test.c| 4 ++-- > mm/hugetlb.c | 2 +- > 5 files changed, 30 insertions(+), 30 deletions(-) Rename of the function should be a separate cleanup patch before any other changes, and the remaining change should be squashed into patch #1, to logically make sense, because it still states "no one should be allowed to pin such memory so that it can always be evicted." Or am I missing something? -- Thanks, David / dhildenb
Re: [PATCH v5 01/13] mm: add zone device coherent type memory support
On 23.06.22 20:20, Sierra Guiza, Alejandro (Alex) wrote: > > On 6/23/2022 2:57 AM, David Hildenbrand wrote: >> On 23.06.22 01:16, Sierra Guiza, Alejandro (Alex) wrote: >>> On 6/21/2022 11:16 AM, David Hildenbrand wrote: >>>> On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote: >>>>> On 6/21/2022 7:25 AM, David Hildenbrand wrote: >>>>>> On 21.06.22 13:55, Alistair Popple wrote: >>>>>>> David Hildenbrand writes: >>>>>>> >>>>>>>> On 21.06.22 13:25, Felix Kuehling wrote: >>>>>>>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand: >>>>>>>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote: >>>>>>>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote: >>>>>>>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote: >>>>>>>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote: >>>>>>>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote: >>>>>>>>>>>>>>> Device memory that is cache coherent from device and CPU point >>>>>>>>>>>>>>> of view. >>>>>>>>>>>>>>> This is used on platforms that have an advanced system bus >>>>>>>>>>>>>>> (like CAPI >>>>>>>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. >>>>>>>>>>>>>>> However, >>>>>>>>>>>>>>> no one should be allowed to pin such memory so that it can >>>>>>>>>>>>>>> always be >>>>>>>>>>>>>>> evicted. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Signed-off-by: Alex Sierra >>>>>>>>>>>>>>> Acked-by: Felix Kuehling >>>>>>>>>>>>>>> Reviewed-by: Alistair Popple >>>>>>>>>>>>>>> [hch: rebased ontop of the refcount changes, >>>>>>>>>>>>>>> removed is_dev_private_or_coherent_page] >>>>>>>>>>>>>>> Signed-off-by: Christoph Hellwig >>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>include/linux/memremap.h | 19 +++ >>>>>>>>>>>>>>>mm/memcontrol.c | 7 --- >>>>>>>>>>>>>>>mm/memory-failure.c | 8 ++-- >>>>>>>>>>>>>>>mm/memremap.c| 10 ++ >>>>>>>>>>>>>>>mm/migrate_device.c | 16 +++- >>>>>>>>>>>>>>>mm/rmap.c| 5 +++-- >>>>>>>>>>>>>>>6 files changed, 49 insertions(+), 16 deletions(-) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h >>>>>>>>>>>>>>> index 8af304f6b504..9f752ebed613 100644 >>>>>>>>>>>>>>> --- a/include/linux/memremap.h >>>>>>>>>>>>>>> +++ b/include/linux/memremap.h >>>>>>>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap { >>>>>>>>>>>>>>> * A more complete discussion of unaddressable memory >>>>>>>>>>>>>>> may be found in >>>>>>>>>>>>>>> * include/linux/hmm.h and Documentation/vm/hmm.rst. >>>>>>>>>>>>>>> * >>>>>>>>>>>>>>> + * MEMORY_DEVICE_COHERENT: >>>>>>>>>>>>>>> + * Device memory that is cache coherent from device and CPU >>>>>>>>>>>>>>> point of view. This >>>>>>>>>>>>>>> + * is used on platforms that have an advanced system bus (like >>>>>>>>>>>>>>> CAPI or CXL). A >>>>
Re: [PATCH v5 01/13] mm: add zone device coherent type memory support
On 23.06.22 01:16, Sierra Guiza, Alejandro (Alex) wrote: > > On 6/21/2022 11:16 AM, David Hildenbrand wrote: >> On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote: >>> On 6/21/2022 7:25 AM, David Hildenbrand wrote: >>>> On 21.06.22 13:55, Alistair Popple wrote: >>>>> David Hildenbrand writes: >>>>> >>>>>> On 21.06.22 13:25, Felix Kuehling wrote: >>>>>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand: >>>>>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote: >>>>>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote: >>>>>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote: >>>>>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote: >>>>>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote: >>>>>>>>>>>>> Device memory that is cache coherent from device and CPU point of >>>>>>>>>>>>> view. >>>>>>>>>>>>> This is used on platforms that have an advanced system bus (like >>>>>>>>>>>>> CAPI >>>>>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. >>>>>>>>>>>>> However, >>>>>>>>>>>>> no one should be allowed to pin such memory so that it can always >>>>>>>>>>>>> be >>>>>>>>>>>>> evicted. >>>>>>>>>>>>> >>>>>>>>>>>>> Signed-off-by: Alex Sierra >>>>>>>>>>>>> Acked-by: Felix Kuehling >>>>>>>>>>>>> Reviewed-by: Alistair Popple >>>>>>>>>>>>> [hch: rebased ontop of the refcount changes, >>>>>>>>>>>>>removed is_dev_private_or_coherent_page] >>>>>>>>>>>>> Signed-off-by: Christoph Hellwig >>>>>>>>>>>>> --- >>>>>>>>>>>>> include/linux/memremap.h | 19 +++ >>>>>>>>>>>>> mm/memcontrol.c | 7 --- >>>>>>>>>>>>> mm/memory-failure.c | 8 ++-- >>>>>>>>>>>>> mm/memremap.c| 10 ++ >>>>>>>>>>>>> mm/migrate_device.c | 16 +++- >>>>>>>>>>>>> mm/rmap.c| 5 +++-- >>>>>>>>>>>>> 6 files changed, 49 insertions(+), 16 deletions(-) >>>>>>>>>>>>> >>>>>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h >>>>>>>>>>>>> index 8af304f6b504..9f752ebed613 100644 >>>>>>>>>>>>> --- a/include/linux/memremap.h >>>>>>>>>>>>> +++ b/include/linux/memremap.h >>>>>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap { >>>>>>>>>>>>>* A more complete discussion of unaddressable memory may >>>>>>>>>>>>> be found in >>>>>>>>>>>>>* include/linux/hmm.h and Documentation/vm/hmm.rst. >>>>>>>>>>>>>* >>>>>>>>>>>>> + * MEMORY_DEVICE_COHERENT: >>>>>>>>>>>>> + * Device memory that is cache coherent from device and CPU >>>>>>>>>>>>> point of view. This >>>>>>>>>>>>> + * is used on platforms that have an advanced system bus (like >>>>>>>>>>>>> CAPI or CXL). A >>>>>>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and >>>>>>>>>>>>> with that memory >>>>>>>>>>>>> + * type. Any page of a process can be migrated to such memory. >>>>>>>>>>>>> However no one >>>>>>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking >>>>>>>>>>>> about special pages &g
Re: [PATCH v5 01/13] mm: add zone device coherent type memory support
On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote: > > On 6/21/2022 7:25 AM, David Hildenbrand wrote: >> On 21.06.22 13:55, Alistair Popple wrote: >>> David Hildenbrand writes: >>> >>>> On 21.06.22 13:25, Felix Kuehling wrote: >>>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand: >>>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote: >>>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote: >>>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote: >>>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote: >>>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote: >>>>>>>>>>> Device memory that is cache coherent from device and CPU point of >>>>>>>>>>> view. >>>>>>>>>>> This is used on platforms that have an advanced system bus (like >>>>>>>>>>> CAPI >>>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. >>>>>>>>>>> However, >>>>>>>>>>> no one should be allowed to pin such memory so that it can always be >>>>>>>>>>> evicted. >>>>>>>>>>> >>>>>>>>>>> Signed-off-by: Alex Sierra >>>>>>>>>>> Acked-by: Felix Kuehling >>>>>>>>>>> Reviewed-by: Alistair Popple >>>>>>>>>>> [hch: rebased ontop of the refcount changes, >>>>>>>>>>> removed is_dev_private_or_coherent_page] >>>>>>>>>>> Signed-off-by: Christoph Hellwig >>>>>>>>>>> --- >>>>>>>>>>> include/linux/memremap.h | 19 +++ >>>>>>>>>>> mm/memcontrol.c | 7 --- >>>>>>>>>>> mm/memory-failure.c | 8 ++-- >>>>>>>>>>> mm/memremap.c| 10 ++ >>>>>>>>>>> mm/migrate_device.c | 16 +++- >>>>>>>>>>> mm/rmap.c| 5 +++-- >>>>>>>>>>> 6 files changed, 49 insertions(+), 16 deletions(-) >>>>>>>>>>> >>>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h >>>>>>>>>>> index 8af304f6b504..9f752ebed613 100644 >>>>>>>>>>> --- a/include/linux/memremap.h >>>>>>>>>>> +++ b/include/linux/memremap.h >>>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap { >>>>>>>>>>> * A more complete discussion of unaddressable memory may be >>>>>>>>>>> found in >>>>>>>>>>> * include/linux/hmm.h and Documentation/vm/hmm.rst. >>>>>>>>>>> * >>>>>>>>>>> + * MEMORY_DEVICE_COHERENT: >>>>>>>>>>> + * Device memory that is cache coherent from device and CPU point >>>>>>>>>>> of view. This >>>>>>>>>>> + * is used on platforms that have an advanced system bus (like >>>>>>>>>>> CAPI or CXL). A >>>>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with >>>>>>>>>>> that memory >>>>>>>>>>> + * type. Any page of a process can be migrated to such memory. >>>>>>>>>>> However no one >>>>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking >>>>>>>>>> about special pages >>>>>>>>>> like vdso, shared zeropage, ... pinned pages ... >>>>>>>> Well, you cannot migrate long term pages, that's what I meant :) >>>>>>>> >>>>>>>>>>> + * should be allowed to pin such memory so that it can always be >>>>>>>>>>> evicted. >>>>>>>>>>> + * >>>>>>>>>>> * MEMORY_DEVICE_FS_DAX: >>>>>>>>>>> * Host memory that has similar access semantics as System RAM &g
Re: [PATCH v5 01/13] mm: add zone device coherent type memory support
On 21.06.22 13:55, Alistair Popple wrote: > > David Hildenbrand writes: > >> On 21.06.22 13:25, Felix Kuehling wrote: >>> >>> Am 6/17/22 um 23:19 schrieb David Hildenbrand: >>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote: >>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote: >>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote: >>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote: >>>>>>>> On 31.05.22 22:00, Alex Sierra wrote: >>>>>>>>> Device memory that is cache coherent from device and CPU point of >>>>>>>>> view. >>>>>>>>> This is used on platforms that have an advanced system bus (like CAPI >>>>>>>>> or CXL). Any page of a process can be migrated to such memory. >>>>>>>>> However, >>>>>>>>> no one should be allowed to pin such memory so that it can always be >>>>>>>>> evicted. >>>>>>>>> >>>>>>>>> Signed-off-by: Alex Sierra >>>>>>>>> Acked-by: Felix Kuehling >>>>>>>>> Reviewed-by: Alistair Popple >>>>>>>>> [hch: rebased ontop of the refcount changes, >>>>>>>>> removed is_dev_private_or_coherent_page] >>>>>>>>> Signed-off-by: Christoph Hellwig >>>>>>>>> --- >>>>>>>>> include/linux/memremap.h | 19 +++ >>>>>>>>> mm/memcontrol.c | 7 --- >>>>>>>>> mm/memory-failure.c | 8 ++-- >>>>>>>>> mm/memremap.c| 10 ++ >>>>>>>>> mm/migrate_device.c | 16 +++- >>>>>>>>> mm/rmap.c| 5 +++-- >>>>>>>>> 6 files changed, 49 insertions(+), 16 deletions(-) >>>>>>>>> >>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h >>>>>>>>> index 8af304f6b504..9f752ebed613 100644 >>>>>>>>> --- a/include/linux/memremap.h >>>>>>>>> +++ b/include/linux/memremap.h >>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap { >>>>>>>>> * A more complete discussion of unaddressable memory may be >>>>>>>>> found in >>>>>>>>> * include/linux/hmm.h and Documentation/vm/hmm.rst. >>>>>>>>> * >>>>>>>>> + * MEMORY_DEVICE_COHERENT: >>>>>>>>> + * Device memory that is cache coherent from device and CPU point of >>>>>>>>> view. This >>>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI >>>>>>>>> or CXL). A >>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with >>>>>>>>> that memory >>>>>>>>> + * type. Any page of a process can be migrated to such memory. >>>>>>>>> However no one >>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about >>>>>>>> special pages >>>>>>>> like vdso, shared zeropage, ... pinned pages ... >>>>>> Well, you cannot migrate long term pages, that's what I meant :) >>>>>> >>>>>>>>> + * should be allowed to pin such memory so that it can always be >>>>>>>>> evicted. >>>>>>>>> + * >>>>>>>>> * MEMORY_DEVICE_FS_DAX: >>>>>>>>> * Host memory that has similar access semantics as System RAM >>>>>>>>> i.e. DMA >>>>>>>>> * coherent and supports page pinning. In support of coordinating >>>>>>>>> page >>>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap { >>>>>>>>> enum memory_type { >>>>>>>>> /* 0 is reserved to catch uninitialized type fields */ >>>>>>>>> MEMORY_DEVICE_PRIVATE = 1, >>>>>>>>> + MEMORY_DEVICE_COHERENT, >>>>>>>>> MEMORY_DEVICE_FS_DAX, &
Re: [PATCH v5 01/13] mm: add zone device coherent type memory support
On 21.06.22 13:25, Felix Kuehling wrote: > > Am 6/17/22 um 23:19 schrieb David Hildenbrand: >> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote: >>> On 6/17/2022 12:33 PM, David Hildenbrand wrote: >>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote: >>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote: >>>>>> On 31.05.22 22:00, Alex Sierra wrote: >>>>>>> Device memory that is cache coherent from device and CPU point of view. >>>>>>> This is used on platforms that have an advanced system bus (like CAPI >>>>>>> or CXL). Any page of a process can be migrated to such memory. However, >>>>>>> no one should be allowed to pin such memory so that it can always be >>>>>>> evicted. >>>>>>> >>>>>>> Signed-off-by: Alex Sierra >>>>>>> Acked-by: Felix Kuehling >>>>>>> Reviewed-by: Alistair Popple >>>>>>> [hch: rebased ontop of the refcount changes, >>>>>>> removed is_dev_private_or_coherent_page] >>>>>>> Signed-off-by: Christoph Hellwig >>>>>>> --- >>>>>>> include/linux/memremap.h | 19 +++ >>>>>>> mm/memcontrol.c | 7 --- >>>>>>> mm/memory-failure.c | 8 ++-- >>>>>>> mm/memremap.c| 10 ++ >>>>>>> mm/migrate_device.c | 16 +++- >>>>>>> mm/rmap.c| 5 +++-- >>>>>>> 6 files changed, 49 insertions(+), 16 deletions(-) >>>>>>> >>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h >>>>>>> index 8af304f6b504..9f752ebed613 100644 >>>>>>> --- a/include/linux/memremap.h >>>>>>> +++ b/include/linux/memremap.h >>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap { >>>>>>> * A more complete discussion of unaddressable memory may be found >>>>>>> in >>>>>>> * include/linux/hmm.h and Documentation/vm/hmm.rst. >>>>>>> * >>>>>>> + * MEMORY_DEVICE_COHERENT: >>>>>>> + * Device memory that is cache coherent from device and CPU point of >>>>>>> view. This >>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or >>>>>>> CXL). A >>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with >>>>>>> that memory >>>>>>> + * type. Any page of a process can be migrated to such memory. However >>>>>>> no one >>>>>> Any page might not be right, I'm pretty sure. ... just thinking about >>>>>> special pages >>>>>> like vdso, shared zeropage, ... pinned pages ... >>>> Well, you cannot migrate long term pages, that's what I meant :) >>>> >>>>>>> + * should be allowed to pin such memory so that it can always be >>>>>>> evicted. >>>>>>> + * >>>>>>> * MEMORY_DEVICE_FS_DAX: >>>>>>> * Host memory that has similar access semantics as System RAM i.e. >>>>>>> DMA >>>>>>> * coherent and supports page pinning. In support of coordinating >>>>>>> page >>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap { >>>>>>> enum memory_type { >>>>>>> /* 0 is reserved to catch uninitialized type fields */ >>>>>>> MEMORY_DEVICE_PRIVATE = 1, >>>>>>> + MEMORY_DEVICE_COHERENT, >>>>>>> MEMORY_DEVICE_FS_DAX, >>>>>>> MEMORY_DEVICE_GENERIC, >>>>>>> MEMORY_DEVICE_PCI_P2PDMA, >>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const >>>>>>> struct folio *folio) >>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I >>>>>> think. >>>>>> >>>>>> >>>>>> However, where exactly is pinning forbidden? >>>>> Long-term pinning is forbidden since it would interfere with the device >>>>> memory manager owning the >>>>> d
Re: [PATCH v5 01/13] mm: add zone device coherent type memory support
On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote: > > On 6/17/2022 12:33 PM, David Hildenbrand wrote: >> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote: >>> On 6/17/2022 4:40 AM, David Hildenbrand wrote: >>>> On 31.05.22 22:00, Alex Sierra wrote: >>>>> Device memory that is cache coherent from device and CPU point of view. >>>>> This is used on platforms that have an advanced system bus (like CAPI >>>>> or CXL). Any page of a process can be migrated to such memory. However, >>>>> no one should be allowed to pin such memory so that it can always be >>>>> evicted. >>>>> >>>>> Signed-off-by: Alex Sierra >>>>> Acked-by: Felix Kuehling >>>>> Reviewed-by: Alistair Popple >>>>> [hch: rebased ontop of the refcount changes, >>>>> removed is_dev_private_or_coherent_page] >>>>> Signed-off-by: Christoph Hellwig >>>>> --- >>>>>include/linux/memremap.h | 19 +++ >>>>>mm/memcontrol.c | 7 --- >>>>>mm/memory-failure.c | 8 ++-- >>>>>mm/memremap.c| 10 ++ >>>>>mm/migrate_device.c | 16 +++- >>>>>mm/rmap.c| 5 +++-- >>>>>6 files changed, 49 insertions(+), 16 deletions(-) >>>>> >>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h >>>>> index 8af304f6b504..9f752ebed613 100644 >>>>> --- a/include/linux/memremap.h >>>>> +++ b/include/linux/memremap.h >>>>> @@ -41,6 +41,13 @@ struct vmem_altmap { >>>>> * A more complete discussion of unaddressable memory may be found in >>>>> * include/linux/hmm.h and Documentation/vm/hmm.rst. >>>>> * >>>>> + * MEMORY_DEVICE_COHERENT: >>>>> + * Device memory that is cache coherent from device and CPU point of >>>>> view. This >>>>> + * is used on platforms that have an advanced system bus (like CAPI or >>>>> CXL). A >>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that >>>>> memory >>>>> + * type. Any page of a process can be migrated to such memory. However >>>>> no one >>>> Any page might not be right, I'm pretty sure. ... just thinking about >>>> special pages >>>> like vdso, shared zeropage, ... pinned pages ... >> Well, you cannot migrate long term pages, that's what I meant :) >> >>>>> + * should be allowed to pin such memory so that it can always be evicted. >>>>> + * >>>>> * MEMORY_DEVICE_FS_DAX: >>>>> * Host memory that has similar access semantics as System RAM i.e. DMA >>>>> * coherent and supports page pinning. In support of coordinating page >>>>> @@ -61,6 +68,7 @@ struct vmem_altmap { >>>>>enum memory_type { >>>>> /* 0 is reserved to catch uninitialized type fields */ >>>>> MEMORY_DEVICE_PRIVATE = 1, >>>>> + MEMORY_DEVICE_COHERENT, >>>>> MEMORY_DEVICE_FS_DAX, >>>>> MEMORY_DEVICE_GENERIC, >>>>> MEMORY_DEVICE_PCI_P2PDMA, >>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const >>>>> struct folio *folio) >>>> In general, this LGTM, and it should be correct with PageAnonExclusive I >>>> think. >>>> >>>> >>>> However, where exactly is pinning forbidden? >>> Long-term pinning is forbidden since it would interfere with the device >>> memory manager owning the >>> device-coherent pages (e.g. evictions in TTM). However, normal pinning >>> is allowed on this device type. >> I don't see updates to folio_is_pinnable() in this patch. > Device coherent type pages should return true here, as they are pinnable > pages. That function is only called for long-term pinnings in try_grab_folio(). >> >> So wouldn't try_grab_folio() simply pin these pages? What am I missing? > > As far as I understand this return NULL for long term pin pages. > Otherwise they get refcount incremented. I don't follow. You're saying a) folio_is_pinnable() returns true for device coherent pages and that b) device coherent pages don't get long-term pinned Yet, the code says struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) { if (flags & FOLL_GET) return try_get_folio(page, refs); else if (flags & FOLL_PIN) { struct folio *folio; /* * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a * right zone, so fail and let the caller fall back to the slow * path. */ if (unlikely((flags & FOLL_LONGTERM) && !is_pinnable_page(page))) return NULL; ... return folio; } } What prevents these pages from getting long-term pinned as stated in this patch? I am probably missing something important. -- Thanks, David / dhildenb
Re: [PATCH v5 01/13] mm: add zone device coherent type memory support
On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote: > > On 6/17/2022 4:40 AM, David Hildenbrand wrote: >> On 31.05.22 22:00, Alex Sierra wrote: >>> Device memory that is cache coherent from device and CPU point of view. >>> This is used on platforms that have an advanced system bus (like CAPI >>> or CXL). Any page of a process can be migrated to such memory. However, >>> no one should be allowed to pin such memory so that it can always be >>> evicted. >>> >>> Signed-off-by: Alex Sierra >>> Acked-by: Felix Kuehling >>> Reviewed-by: Alistair Popple >>> [hch: rebased ontop of the refcount changes, >>>removed is_dev_private_or_coherent_page] >>> Signed-off-by: Christoph Hellwig >>> --- >>> include/linux/memremap.h | 19 +++ >>> mm/memcontrol.c | 7 --- >>> mm/memory-failure.c | 8 ++-- >>> mm/memremap.c| 10 ++ >>> mm/migrate_device.c | 16 +++- >>> mm/rmap.c| 5 +++-- >>> 6 files changed, 49 insertions(+), 16 deletions(-) >>> >>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h >>> index 8af304f6b504..9f752ebed613 100644 >>> --- a/include/linux/memremap.h >>> +++ b/include/linux/memremap.h >>> @@ -41,6 +41,13 @@ struct vmem_altmap { >>>* A more complete discussion of unaddressable memory may be found in >>>* include/linux/hmm.h and Documentation/vm/hmm.rst. >>>* >>> + * MEMORY_DEVICE_COHERENT: >>> + * Device memory that is cache coherent from device and CPU point of view. >>> This >>> + * is used on platforms that have an advanced system bus (like CAPI or >>> CXL). A >>> + * driver can hotplug the device memory using ZONE_DEVICE and with that >>> memory >>> + * type. Any page of a process can be migrated to such memory. However no >>> one >> Any page might not be right, I'm pretty sure. ... just thinking about >> special pages >> like vdso, shared zeropage, ... pinned pages ... > Well, you cannot migrate long term pages, that's what I meant :) >> >>> + * should be allowed to pin such memory so that it can always be evicted. >>> + * >>>* MEMORY_DEVICE_FS_DAX: >>>* Host memory that has similar access semantics as System RAM i.e. DMA >>>* coherent and supports page pinning. In support of coordinating page >>> @@ -61,6 +68,7 @@ struct vmem_altmap { >>> enum memory_type { >>> /* 0 is reserved to catch uninitialized type fields */ >>> MEMORY_DEVICE_PRIVATE = 1, >>> + MEMORY_DEVICE_COHERENT, >>> MEMORY_DEVICE_FS_DAX, >>> MEMORY_DEVICE_GENERIC, >>> MEMORY_DEVICE_PCI_P2PDMA, >>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const >>> struct folio *folio) >> In general, this LGTM, and it should be correct with PageAnonExclusive I >> think. >> >> >> However, where exactly is pinning forbidden? > > Long-term pinning is forbidden since it would interfere with the device > memory manager owning the > device-coherent pages (e.g. evictions in TTM). However, normal pinning > is allowed on this device type. I don't see updates to folio_is_pinnable() in this patch. So wouldn't try_grab_folio() simply pin these pages? What am I missing? -- Thanks, David / dhildenb
Re: [PATCH v5 02/13] mm: handling Non-LRU pages returned by vm_normal_pages
On 31.05.22 22:00, Alex Sierra wrote: > With DEVICE_COHERENT, we'll soon have vm_normal_pages() return > device-managed anonymous pages that are not LRU pages. Although they > behave like normal pages for purposes of mapping in CPU page, and for > COW. They do not support LRU lists, NUMA migration or THP. > > We also introduced a FOLL_LRU flag that adds the same behaviour to > follow_page and related APIs, to allow callers to specify that they > expect to put pages on an LRU list. > > Signed-off-by: Alex Sierra > Acked-by: Felix Kuehling > --- > fs/proc/task_mmu.c | 2 +- > include/linux/mm.h | 3 ++- > mm/gup.c | 6 +- > mm/huge_memory.c | 2 +- > mm/khugepaged.c| 9 ++--- > mm/ksm.c | 6 +++--- > mm/madvise.c | 4 ++-- > mm/memory.c| 9 - > mm/mempolicy.c | 2 +- > mm/migrate.c | 4 ++-- > mm/mlock.c | 2 +- > mm/mprotect.c | 2 +- > 12 files changed, 33 insertions(+), 18 deletions(-) > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index 2d04e3470d4c..2dd8c8a66924 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -1792,7 +1792,7 @@ static struct page *can_gather_numa_stats(pte_t pte, > struct vm_area_struct *vma, > return NULL; > > page = vm_normal_page(vma, addr, pte); > - if (!page) > + if (!page || is_zone_device_page(page)) > return NULL; > > if (PageReserved(page)) > diff --git a/include/linux/mm.h b/include/linux/mm.h > index bc8f326be0ce..d3f43908ff8d 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -601,7 +601,7 @@ struct vm_operations_struct { > #endif > /* >* Called by vm_normal_page() for special PTEs to find the > - * page for @addr. This is useful if the default behavior > + * page for @addr. This is useful if the default behavior >* (using pte_page()) would not find the correct page. >*/ > struct page *(*find_special_page)(struct vm_area_struct *vma, > @@ -2934,6 +2934,7 @@ struct page *follow_page(struct vm_area_struct *vma, > unsigned long address, > #define FOLL_NUMA0x200 /* force NUMA hinting page fault */ > #define FOLL_MIGRATION 0x400 /* wait for page to replace migration > entry */ > #define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */ > +#define FOLL_LRU0x1000 /* return only LRU (anon or page cache) */ Does that statement hold for special pages like the shared zeropage? Also, this flag is only valid for in-kernel follow_page() but not for the ordinary GUP interfaces. What are the semantics there? Is it fenced? I really wonder if you should simply similarly teach the handful of users of follow_page() to just special case these pages ... sounds cleaner to me then adding flags with unclear semantics. Alternatively, properly document what that flag is actually doing and where it applies. I know, there was discussion on ... sorry for jumping in now, but this doesn't look clean to me yet. -- Thanks, David / dhildenb
Re: [PATCH v5 01/13] mm: add zone device coherent type memory support
On 31.05.22 22:00, Alex Sierra wrote: > Device memory that is cache coherent from device and CPU point of view. > This is used on platforms that have an advanced system bus (like CAPI > or CXL). Any page of a process can be migrated to such memory. However, > no one should be allowed to pin such memory so that it can always be > evicted. > > Signed-off-by: Alex Sierra > Acked-by: Felix Kuehling > Reviewed-by: Alistair Popple > [hch: rebased ontop of the refcount changes, > removed is_dev_private_or_coherent_page] > Signed-off-by: Christoph Hellwig > --- > include/linux/memremap.h | 19 +++ > mm/memcontrol.c | 7 --- > mm/memory-failure.c | 8 ++-- > mm/memremap.c| 10 ++ > mm/migrate_device.c | 16 +++- > mm/rmap.c| 5 +++-- > 6 files changed, 49 insertions(+), 16 deletions(-) > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h > index 8af304f6b504..9f752ebed613 100644 > --- a/include/linux/memremap.h > +++ b/include/linux/memremap.h > @@ -41,6 +41,13 @@ struct vmem_altmap { > * A more complete discussion of unaddressable memory may be found in > * include/linux/hmm.h and Documentation/vm/hmm.rst. > * > + * MEMORY_DEVICE_COHERENT: > + * Device memory that is cache coherent from device and CPU point of view. > This > + * is used on platforms that have an advanced system bus (like CAPI or CXL). > A > + * driver can hotplug the device memory using ZONE_DEVICE and with that > memory > + * type. Any page of a process can be migrated to such memory. However no one Any page might not be right, I'm pretty sure. ... just thinking about special pages like vdso, shared zeropage, ... pinned pages ... > + * should be allowed to pin such memory so that it can always be evicted. > + * > * MEMORY_DEVICE_FS_DAX: > * Host memory that has similar access semantics as System RAM i.e. DMA > * coherent and supports page pinning. In support of coordinating page > @@ -61,6 +68,7 @@ struct vmem_altmap { > enum memory_type { > /* 0 is reserved to catch uninitialized type fields */ > MEMORY_DEVICE_PRIVATE = 1, > + MEMORY_DEVICE_COHERENT, > MEMORY_DEVICE_FS_DAX, > MEMORY_DEVICE_GENERIC, > MEMORY_DEVICE_PCI_P2PDMA, > @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct > folio *folio) In general, this LGTM, and it should be correct with PageAnonExclusive I think. However, where exactly is pinning forbidden? -- Thanks, David / dhildenb
Re: [PATCH v5 00/13] Add MEMORY_DEVICE_COHERENT for coherent device memory mapping
On 17.06.22 04:19, Andrew Morton wrote: > On Tue, 31 May 2022 15:00:28 -0500 Alex Sierra wrote: > >> This is our MEMORY_DEVICE_COHERENT patch series rebased and updated >> for current 5.18.0 > > I plan to move this series into the non-rebasing mm-stable branch in a > few days. Unless sternly told not to do so! > I want to double-check some things regarding PageAnonExclusive interaction. I'm busy, but I'll try prioritizing it. -- Thanks, David / dhildenb
Re: [PATCH v2 1/3] mm: add vm_normal_lru_pages for LRU handled pages only
On 31.03.22 10:53, Christoph Hellwig wrote: >> -page = vm_normal_page(vma, addr, pte); >> +page = vm_normal_lru_page(vma, addr, pte); > > Why can't this deal with ZONE_DEVICE pages? It certainly has > nothing do with a LRU I think. In fact being able to have > stats that count say the number of device pages here would > probably be useful at some point. > > In general I find the vm_normal_lru_page vs vm_normal_page > API highly confusing. An explicit check for zone device pages > in the dozen or so spots that care has a much better documentation > value, especially if accompanied by comments where it isn't entirely > obvious. What's your thought on FOLL_LRU? -- Thanks, David / dhildenb
Re: [PATCH v1 1/3] mm: split vm_normal_pages for LRU and non-LRU handling
On 17.03.22 03:54, Alistair Popple wrote: > Felix Kuehling writes: > >> On 2022-03-11 04:16, David Hildenbrand wrote: >>> On 10.03.22 18:26, Alex Sierra wrote: >>>> DEVICE_COHERENT pages introduce a subtle distinction in the way >>>> "normal" pages can be used by various callers throughout the kernel. >>>> They behave like normal pages for purposes of mapping in CPU page >>>> tables, and for COW. But they do not support LRU lists, NUMA >>>> migration or THP. Therefore we split vm_normal_page into two >>>> functions vm_normal_any_page and vm_normal_lru_page. The latter will >>>> only return pages that can be put on an LRU list and that support >>>> NUMA migration, KSM and THP. >>>> >>>> We also introduced a FOLL_LRU flag that adds the same behaviour to >>>> follow_page and related APIs, to allow callers to specify that they >>>> expect to put pages on an LRU list. >>>> >>> I still don't see the need for s/vm_normal_page/vm_normal_any_page/. And >>> as this patch is dominated by that change, I'd suggest (again) to just >>> drop it as I don't see any value of that renaming. No specifier implies any. >> >> OK. If nobody objects, we can adopts that naming convention. > > I'd prefer we avoid the churn too, but I don't think we should make > vm_normal_page() the equivalent of vm_normal_any_page(). It would mean > vm_normal_page() would return non-LRU device coherent pages, but to me at > least > device coherent pages seem special and not what I'd expect from a function > with > "normal" in the name. > > So I think it would be better to s/vm_normal_lru_page/vm_normal_page/ and keep > vm_normal_any_page() (or perhaps call it vm_any_page?). This is basically what > the previous incarnation of this feature did: > > struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr, > pte_t pte, bool with_public_device); > #define vm_normal_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, false) > > Except we should add: > > #define vm_normal_any_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, > true) > "normal" simply tells us that this is not a special mapping -- IOW, we want the VM to take a look at the memmap and not treat it like a PFN map. What we're changing is that we're now also returning non-lru pages. Fair enough, that's why we introduce vm_normal_lru_page() as a replacement where we really can only deal with lru pages. vm_normal_page vs vm_normal_lru_page is good enough. "lru" further limits what we get via vm_normal_page, that's even how it's implemented. vm_normal_page vs vm_normal_any_page is confusing IMHO. -- Thanks, David / dhildenb
Re: [PATCH v1 1/3] mm: split vm_normal_pages for LRU and non-LRU handling
On 10.03.22 18:26, Alex Sierra wrote: > DEVICE_COHERENT pages introduce a subtle distinction in the way > "normal" pages can be used by various callers throughout the kernel. > They behave like normal pages for purposes of mapping in CPU page > tables, and for COW. But they do not support LRU lists, NUMA > migration or THP. Therefore we split vm_normal_page into two > functions vm_normal_any_page and vm_normal_lru_page. The latter will > only return pages that can be put on an LRU list and that support > NUMA migration, KSM and THP. > > We also introduced a FOLL_LRU flag that adds the same behaviour to > follow_page and related APIs, to allow callers to specify that they > expect to put pages on an LRU list. > I still don't see the need for s/vm_normal_page/vm_normal_any_page/. And as this patch is dominated by that change, I'd suggest (again) to just drop it as I don't see any value of that renaming. No specifier implies any. The general idea of this change LGTM. I wonder how this interacts with the actual DEVICE_COHERENT coherent series. Is this a preparation? Should it be part of the DEVICE_COHERENT series? IOW, should this patch start with "With DEVICE_COHERENT, we'll soon have vm_normal_pages() return device-managed anonymous pages that are not LRU pages. Although they behave like normal pages for purposes of mapping in CPU page, and for COW, they do not support LRU lists, NUMA migration or THP. [...]" But then, I'm confused by patch 2 and 3, because it feels more like we'd already have DEVICE_COHERENT then ("hmm_is_coherent_type"). -- Thanks, David / dhildenb
Re: [PATCH] mm: split vm_normal_pages for LRU and non-LRU handling
>> >>> if (PageReserved(page)) >>> diff --git a/mm/migrate.c b/mm/migrate.c >>> index c31d04b46a5e..17d049311b78 100644 >>> --- a/mm/migrate.c >>> +++ b/mm/migrate.c >>> @@ -1614,7 +1614,7 @@ static int add_page_for_migration(struct mm_struct >>> *mm, unsigned long addr, >>> goto out; >>> >>> /* FOLL_DUMP to ignore special (like zero) pages */ >>> - follflags = FOLL_GET | FOLL_DUMP; >>> + follflags = FOLL_GET | FOLL_DUMP | FOLL_LRU; >>> page = follow_page(vma, addr, follflags); >> Why wouldn't we want to dump DEVICE_COHERENT pages? This looks wrong. > > This function later calls isolate_lru_page, which is something you can't > do with a device page. > Then, that code might require care instead. We most certainly don't want to have random memory holes in a dump just because some anonymous memory was migrated to DEVICE_COHERENT. -- Thanks, David / dhildenb
Re: [PATCH] mm: split vm_normal_pages for LRU and non-LRU handling
On 01.03.22 17:30, Felix Kuehling wrote: > Am 2022-03-01 um 11:22 schrieb David Hildenbrand: >>>>> if (PageReserved(page)) >>>>> diff --git a/mm/migrate.c b/mm/migrate.c >>>>> index c31d04b46a5e..17d049311b78 100644 >>>>> --- a/mm/migrate.c >>>>> +++ b/mm/migrate.c >>>>> @@ -1614,7 +1614,7 @@ static int add_page_for_migration(struct mm_struct >>>>> *mm, unsigned long addr, >>>>> goto out; >>>>> >>>>> /* FOLL_DUMP to ignore special (like zero) pages */ >>>>> - follflags = FOLL_GET | FOLL_DUMP; >>>>> + follflags = FOLL_GET | FOLL_DUMP | FOLL_LRU; >>>>> page = follow_page(vma, addr, follflags); >>>> Why wouldn't we want to dump DEVICE_COHERENT pages? This looks wrong. >>> This function later calls isolate_lru_page, which is something you can't >>> do with a device page. >>> >> Then, that code might require care instead. We most certainly don't want >> to have random memory holes in a dump just because some anonymous memory >> was migrated to DEVICE_COHERENT. > I don't think this code is for core dumps. The call chain I see is > > SYSCALL_DEFINE6(move_pages, ...) -> kernel_move_pages -> do_pages_move > -> add_page_for_migration > Ah, sorry, I got mislead by FOLL_DUMP and thought we'd be in get_dump_page() . As you said, nothing to do. -- Thanks, David / dhildenb
Re: [PATCH] mm: split vm_normal_pages for LRU and non-LRU handling
On 28.02.22 21:34, Alex Sierra wrote: > DEVICE_COHERENT pages introduce a subtle distinction in the way > "normal" pages can be used by various callers throughout the kernel. > They behave like normal pages for purposes of mapping in CPU page > tables, and for COW. But they do not support LRU lists, NUMA > migration or THP. Therefore we split vm_normal_page into two > functions vm_normal_any_page and vm_normal_lru_page. The latter will > only return pages that can be put on an LRU list and that support > NUMA migration and THP. Why not s/vm_normal_any_page/vm_normal_page/ and avoid code churn? > > We also introduced a FOLL_LRU flag that adds the same behaviour to > follow_page and related APIs, to allow callers to specify that they > expect to put pages on an LRU list. [...] > -#define FOLL_WRITE 0x01/* check pte is writable */ > -#define FOLL_TOUCH 0x02/* mark page accessed */ > -#define FOLL_GET 0x04/* do get_page on page */ > -#define FOLL_DUMP0x08/* give error on hole if it would be zero */ > -#define FOLL_FORCE 0x10/* get_user_pages read/write w/o permission */ > -#define FOLL_NOWAIT 0x20/* if a disk transfer is needed, start the IO > - * and return without waiting upon it */ > -#define FOLL_POPULATE0x40/* fault in pages (with FOLL_MLOCK) */ > -#define FOLL_NOFAULT 0x80/* do not fault in pages */ > -#define FOLL_HWPOISON0x100 /* check page is hwpoisoned */ > -#define FOLL_NUMA0x200 /* force NUMA hinting page fault */ > -#define FOLL_MIGRATION 0x400 /* wait for page to replace migration > entry */ > -#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */ > -#define FOLL_MLOCK 0x1000 /* lock present pages */ > -#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */ > -#define FOLL_COW 0x4000 /* internal GUP flag */ > -#define FOLL_ANON0x8000 /* don't do file mappings */ > -#define FOLL_LONGTERM0x1 /* mapping lifetime is indefinite: see > below */ > -#define FOLL_SPLIT_PMD 0x2 /* split huge pmd before returning */ > -#define FOLL_PIN 0x4 /* pages must be released via unpin_user_page */ > -#define FOLL_FAST_ONLY 0x8 /* gup_fast: prevent fall-back to slow > gup */ > +#define FOLL_WRITE 0x01 /* check pte is writable */ > +#define FOLL_TOUCH 0x02 /* mark page accessed */ > +#define FOLL_GET 0x04 /* do get_page on page */ > +#define FOLL_DUMP0x08 /* give error on hole if it would be zero */ > +#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */ > +#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO > + * and return without waiting upon it */ > +#define FOLL_POPULATE0x40 /* fault in pages (with FOLL_MLOCK) */ > +#define FOLL_NOFAULT 0x80 /* do not fault in pages */ > +#define FOLL_HWPOISON0x100/* check page is hwpoisoned */ > +#define FOLL_NUMA0x200/* force NUMA hinting page fault */ > +#define FOLL_MIGRATION 0x400/* wait for page to replace migration > entry */ > +#define FOLL_TRIED 0x800/* a retry, previous pass started an IO */ > +#define FOLL_MLOCK 0x1000 /* lock present pages */ > +#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */ > +#define FOLL_COW 0x4000 /* internal GUP flag */ > +#define FOLL_ANON0x8000 /* don't do file mappings */ > +#define FOLL_LONGTERM0x1 /* mapping lifetime is indefinite: see > below */ > +#define FOLL_SPLIT_PMD 0x2 /* split huge pmd before returning */ > +#define FOLL_PIN 0x4 /* pages must be released via unpin_user_page > */ > +#define FOLL_FAST_ONLY 0x8 /* gup_fast: prevent fall-back to slow > gup */ > +#define FOLL_LRU 0x10 /* return only LRU (anon or page cache) */ > Can we minimize code churn, please? > if (PageReserved(page)) > diff --git a/mm/migrate.c b/mm/migrate.c > index c31d04b46a5e..17d049311b78 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1614,7 +1614,7 @@ static int add_page_for_migration(struct mm_struct *mm, > unsigned long addr, > goto out; > > /* FOLL_DUMP to ignore special (like zero) pages */ > - follflags = FOLL_GET | FOLL_DUMP; > + follflags = FOLL_GET | FOLL_DUMP | FOLL_LRU; > page = follow_page(vma, addr, follflags); Why wouldn't we want to dump DEVICE_COHERENT pages? This looks wrong. -- Thanks, David / dhildenb
Re: [PATCH v6 01/10] mm: add zone device coherent type memory support
On 16.02.22 03:36, Alistair Popple wrote: > On Wednesday, 16 February 2022 1:03:57 PM AEDT Jason Gunthorpe wrote: >> On Wed, Feb 16, 2022 at 12:23:44PM +1100, Alistair Popple wrote: >> >>> Device private and device coherent pages are not marked with pte_devmap and >>> they >>> are backed by a struct page. The only way of inserting them is via >>> migrate_vma. >>> The refcount is decremented in zap_pte_range() on munmap() with special >>> handling >>> for device private pages. Looking at it again though I wonder if there is >>> any >>> special treatment required in zap_pte_range() for device coherent pages >>> given >>> they count as present pages. >> >> This is what I guessed, but we shouldn't be able to just drop >> pte_devmap on these pages without any other work?? Granted it does >> very little already.. > > Yes, I agree we need to check this more closely. For device private pages > not having pte_devmap is fine, because they are non-present swap entries so > they always get special handling in the swap entry paths but the same isn't > true for coherent device pages. I'm curious, how does the refcount of a PageAnon() DEVICE_COHERENT page look like when mapped? I'd assume it's also (currently) still offset by one, meaning, if it's mapped into a single page table it's always at least 2. Just a note that if my assumption is correct and if we'd have such a page mapped R/O, do_wp_page() would always have to copy it unconditionally and would not be able to reuse it on write faults. (while I'm working on improving the reuse logic, I think there is also work in progress to avoid this additional reference on some ZONE_DEVICE stuff -- I'd assume that would include DEVICE_COHERENT ?) > >> I thought at least gup_fast needed to be touched or did this get >> handled by scanning the page list after the fact? > > Right, for gup I think the only special handling required is to prevent > pinning. I had assumed that check_and_migrate_movable_pages() would still get > called for gup_fast but unless I've missed something I don't think it does. > That means gup_fast could still pin movable and coherent pages. Technically > that is ok for coherent pages, but it's undesirable. We really should have the same pinning rules for GUP vs. GUP-fast. is_pinnable_page() should be the right place for such checks (similarly as indicated in my reply to the migration series). -- Thanks, David / dhildenb
Re: [PATCH v6 01/10] mm: add zone device coherent type memory support
On 11.02.22 18:07, Felix Kuehling wrote: > > Am 2022-02-11 um 11:39 schrieb David Hildenbrand: >> On 11.02.22 17:15, David Hildenbrand wrote: >>> On 01.02.22 16:48, Alex Sierra wrote: >>>> Device memory that is cache coherent from device and CPU point of view. >>>> This is used on platforms that have an advanced system bus (like CAPI >>>> or CXL). Any page of a process can be migrated to such memory. However, >>>> no one should be allowed to pin such memory so that it can always be >>>> evicted. >>>> >>>> Signed-off-by: Alex Sierra >>>> Acked-by: Felix Kuehling >>>> Reviewed-by: Alistair Popple >>> So, I'm currently messing with PageAnon() pages and CoW semantics ... >>> all these PageAnon() ZONE_DEVICE variants don't necessarily make my life >>> easier but I'm not sure yet if they make my life harder. I hope you can >>> help me understand some of that stuff. >>> >>> 1) What are expected CoW semantics for DEVICE_COHERENT? >>> >>> I assume we'll share them just like other PageAnon() pages during fork() >>> readable, and the first sharer writing to them receives an "ordinary" >>> !ZONE_DEVICE copy. >>> >>> So this would be just like DEVICE_EXCLUSIVE CoW handling I assume, just >>> that we don't have to go through the loop of restoring a device >>> exclusive entry? >>> >>> 2) How are these pages freed to clear/invalidate PageAnon() ? >>> >>> I assume for PageAnon() ZONE_DEVICE pages we'll always for via >>> free_devmap_managed_page(), correct? >>> >>> >>> 3) FOLL_PIN >>> >>> While you write "no one should be allowed to pin such memory", patch #2 >>> only blocks FOLL_LONGTERM. So I assume we allow ordinary FOLL_PIN and >>> you might want to be a bit more precise? >>> >>> >>> ... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages, but can we >>> FILL_PIN DEVICE_EXCLUSIVE pages? I strongly assume so? >>> >>> >>> Thanks for any information. >>> >> (digging a bit more, I realized that device exclusive pages are not >> actually/necessarily ZONE_DEVICE pages -- so I assume DEVICE_COHERENT >> will be the actual first PageAnon() ZONE_DEVICE pages that can be >> present in a page table.) > > I think DEVICE_GENERIC pages can also be mapped in the page table. In > fact, the first version of our patches attempted to add migration > support to DEVICE_GENERIC. But we were convinced to create a new > ZONE_DEVICE page type for our use case instead. Do you know if DEVICE_GENERIC pages would end up as PageAnon()? My assumption was that they would be part of a special mapping. -- Thanks, David / dhildenb
Re: [PATCH v6 01/10] mm: add zone device coherent type memory support
On 11.02.22 17:56, Jason Gunthorpe wrote: > On Fri, Feb 11, 2022 at 05:49:08PM +0100, David Hildenbrand wrote: >> On 11.02.22 17:45, Jason Gunthorpe wrote: >>> On Fri, Feb 11, 2022 at 05:15:25PM +0100, David Hildenbrand wrote: >>> >>>> ... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages >>> >>> Currently the only way to get a DEVICE_PRIVATE page out of the page >>> tables is via hmm_range_fault() and that doesn't manipulate any ref >>> counts. >> >> Thanks for clarifying Jason! ... and AFAIU, device exclusive entries are >> essentially just pointers at ordinary PageAnon() pages. So with DEVICE >> COHERENT we'll have the first PageAnon() ZONE_DEVICE pages mapped as >> present in the page tables where GUP could FOLL_PIN them. > > This is my understanding > > Though you probably understand what PageAnon means alot better than I > do.. I wonder if it really makes sense to talk about that together > with ZONE_DEVICE which has alot in common with filesystem originated > pages too. For me, PageAnon() means that modifications are visible only to the modifying process. On actual CoW, the underlying page will get replaced -- in the world of DEVICE_COHERENT that would mean that once you write to a DEVICE_COHERENT you could suddenly have a !DEVICE_COHERENT page. PageAnon() pages don't have a mapping, thus they can only be found in MAP_ANON VMAs or in MAP_SHARED VMAs with MAP_PRIVATE. They can only be found via a page table, and not looked up via the page cache (excluding the swap cache). So if we have PageAnon() pages on ZONE_DEVICE, they generally have the exact same semantics as !ZONE_DEVICE pages, but the way they "appear" in the page tables the allocation/freeing path differs -- I guess :) ... and as we want pinning semantics to be different we have to touch GUP. > > I'm not sure what AMDs plan is here, is there an expecation that a GPU > driver will somehow stuff these pages into an existing anonymous > memory VMA or do they always come from a driver originated VMA? My understanding is that a driver can just decide to replace "ordinary" PageAnon() pages e.g., in a MAP_ANON VMA by these pages. Hopefully AMD can clarify. -- Thanks, David / dhildenb
Re: [PATCH v2 2/3] mm/gup.c: Migrate device coherent pages when pinning instead of failing
On 11.02.22 00:41, Alistair Popple wrote: > On Thursday, 10 February 2022 10:47:35 PM AEDT David Hildenbrand wrote: >> On 10.02.22 12:39, Alistair Popple wrote: >>> On Thursday, 10 February 2022 9:53:38 PM AEDT David Hildenbrand wrote: >>>> On 07.02.22 05:26, Alistair Popple wrote: >>>>> Currently any attempts to pin a device coherent page will fail. This is >>>>> because device coherent pages need to be managed by a device driver, and >>>>> pinning them would prevent a driver from migrating them off the device. >>>>> >>>>> However this is no reason to fail pinning of these pages. These are >>>>> coherent and accessible from the CPU so can be migrated just like >>>>> pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin >>>>> them first try migrating them out of ZONE_DEVICE. >>>>> >>>>> Signed-off-by: Alistair Popple >>>>> Acked-by: Felix Kuehling >>>>> --- >>>>> >>>>> Changes for v2: >>>>> >>>>> - Added Felix's Acked-by >>>>> - Fixed missing check for dpage == NULL >>>>> >>>>> mm/gup.c | 105 ++-- >>>>> 1 file changed, 95 insertions(+), 10 deletions(-) >>>>> >>>>> diff --git a/mm/gup.c b/mm/gup.c >>>>> index 56d9577..5e826db 100644 >>>>> --- a/mm/gup.c >>>>> +++ b/mm/gup.c >>>>> @@ -1861,6 +1861,60 @@ struct page *get_dump_page(unsigned long addr) >>>>> >>>>> #ifdef CONFIG_MIGRATION >>>>> /* >>>>> + * Migrates a device coherent page back to normal memory. Caller should >>>>> have a >>>>> + * reference on page which will be copied to the new page if migration is >>>>> + * successful or dropped on failure. >>>>> + */ >>>>> +static struct page *migrate_device_page(struct page *page, >>>>> + unsigned int gup_flags) >>>>> +{ >>>>> + struct page *dpage; >>>>> + struct migrate_vma args; >>>>> + unsigned long src_pfn, dst_pfn = 0; >>>>> + >>>>> + lock_page(page); >>>>> + src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE; >>>>> + args.src = &src_pfn; >>>>> + args.dst = &dst_pfn; >>>>> + args.cpages = 1; >>>>> + args.npages = 1; >>>>> + args.vma = NULL; >>>>> + migrate_vma_setup(&args); >>>>> + if (!(src_pfn & MIGRATE_PFN_MIGRATE)) >>>>> + return NULL; >>>>> + >>>>> + dpage = alloc_pages(GFP_USER | __GFP_NOWARN, 0); >>>>> + >>>>> + /* >>>>> + * get/pin the new page now so we don't have to retry gup after >>>>> + * migrating. We already have a reference so this should never fail. >>>>> + */ >>>>> + if (dpage && WARN_ON_ONCE(!try_grab_page(dpage, gup_flags))) { >>>>> + __free_pages(dpage, 0); >>>>> + dpage = NULL; >>>>> + } >>>>> + >>>>> + if (dpage) { >>>>> + lock_page(dpage); >>>>> + dst_pfn = migrate_pfn(page_to_pfn(dpage)); >>>>> + } >>>>> + >>>>> + migrate_vma_pages(&args); >>>>> + if (src_pfn & MIGRATE_PFN_MIGRATE) >>>>> + copy_highpage(dpage, page); >>>>> + migrate_vma_finalize(&args); >>>>> + if (dpage && !(src_pfn & MIGRATE_PFN_MIGRATE)) { >>>>> + if (gup_flags & FOLL_PIN) >>>>> + unpin_user_page(dpage); >>>>> + else >>>>> + put_page(dpage); >>>>> + dpage = NULL; >>>>> + } >>>>> + >>>>> + return dpage; >>>>> +} >>>>> + >>>>> +/* >>>>> * Check whether all pages are pinnable, if so return number of pages. >>>>> If some >>>>> * pages are not pinnable, migrate them, and unpin all pages. Return >>>>> zero if >>>>> * pages were migrated, or if some pages were not successfully isolated. >>>>> @@ -1888,15 +1942,40 @@ static long >>>&
Re: [PATCH v6 01/10] mm: add zone device coherent type memory support
On 11.02.22 17:15, David Hildenbrand wrote: > On 01.02.22 16:48, Alex Sierra wrote: >> Device memory that is cache coherent from device and CPU point of view. >> This is used on platforms that have an advanced system bus (like CAPI >> or CXL). Any page of a process can be migrated to such memory. However, >> no one should be allowed to pin such memory so that it can always be >> evicted. >> >> Signed-off-by: Alex Sierra >> Acked-by: Felix Kuehling >> Reviewed-by: Alistair Popple > > So, I'm currently messing with PageAnon() pages and CoW semantics ... > all these PageAnon() ZONE_DEVICE variants don't necessarily make my life > easier but I'm not sure yet if they make my life harder. I hope you can > help me understand some of that stuff. > > 1) What are expected CoW semantics for DEVICE_COHERENT? > > I assume we'll share them just like other PageAnon() pages during fork() > readable, and the first sharer writing to them receives an "ordinary" > !ZONE_DEVICE copy. > > So this would be just like DEVICE_EXCLUSIVE CoW handling I assume, just > that we don't have to go through the loop of restoring a device > exclusive entry? > > 2) How are these pages freed to clear/invalidate PageAnon() ? > > I assume for PageAnon() ZONE_DEVICE pages we'll always for via > free_devmap_managed_page(), correct? > > > 3) FOLL_PIN > > While you write "no one should be allowed to pin such memory", patch #2 > only blocks FOLL_LONGTERM. So I assume we allow ordinary FOLL_PIN and > you might want to be a bit more precise? > > > ... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages, but can we > FILL_PIN DEVICE_EXCLUSIVE pages? I strongly assume so? > > > Thanks for any information. > (digging a bit more, I realized that device exclusive pages are not actually/necessarily ZONE_DEVICE pages -- so I assume DEVICE_COHERENT will be the actual first PageAnon() ZONE_DEVICE pages that can be present in a page table.) -- Thanks, David / dhildenb
Re: [PATCH v6 01/10] mm: add zone device coherent type memory support
On 11.02.22 17:45, Jason Gunthorpe wrote: > On Fri, Feb 11, 2022 at 05:15:25PM +0100, David Hildenbrand wrote: > >> ... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages > > Currently the only way to get a DEVICE_PRIVATE page out of the page > tables is via hmm_range_fault() and that doesn't manipulate any ref > counts. Thanks for clarifying Jason! ... and AFAIU, device exclusive entries are essentially just pointers at ordinary PageAnon() pages. So with DEVICE COHERENT we'll have the first PageAnon() ZONE_DEVICE pages mapped as present in the page tables where GUP could FOLL_PIN them. -- Thanks, David / dhildenb
Re: [PATCH v6 01/10] mm: add zone device coherent type memory support
On 01.02.22 16:48, Alex Sierra wrote: > Device memory that is cache coherent from device and CPU point of view. > This is used on platforms that have an advanced system bus (like CAPI > or CXL). Any page of a process can be migrated to such memory. However, > no one should be allowed to pin such memory so that it can always be > evicted. > > Signed-off-by: Alex Sierra > Acked-by: Felix Kuehling > Reviewed-by: Alistair Popple So, I'm currently messing with PageAnon() pages and CoW semantics ... all these PageAnon() ZONE_DEVICE variants don't necessarily make my life easier but I'm not sure yet if they make my life harder. I hope you can help me understand some of that stuff. 1) What are expected CoW semantics for DEVICE_COHERENT? I assume we'll share them just like other PageAnon() pages during fork() readable, and the first sharer writing to them receives an "ordinary" !ZONE_DEVICE copy. So this would be just like DEVICE_EXCLUSIVE CoW handling I assume, just that we don't have to go through the loop of restoring a device exclusive entry? 2) How are these pages freed to clear/invalidate PageAnon() ? I assume for PageAnon() ZONE_DEVICE pages we'll always for via free_devmap_managed_page(), correct? 3) FOLL_PIN While you write "no one should be allowed to pin such memory", patch #2 only blocks FOLL_LONGTERM. So I assume we allow ordinary FOLL_PIN and you might want to be a bit more precise? ... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages, but can we FILL_PIN DEVICE_EXCLUSIVE pages? I strongly assume so? Thanks for any information. -- Thanks, David / dhildenb
Re: [PATCH v2 2/3] mm/gup.c: Migrate device coherent pages when pinning instead of failing
On 07.02.22 05:26, Alistair Popple wrote: > Currently any attempts to pin a device coherent page will fail. This is > because device coherent pages need to be managed by a device driver, and > pinning them would prevent a driver from migrating them off the device. > > However this is no reason to fail pinning of these pages. These are > coherent and accessible from the CPU so can be migrated just like > pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin > them first try migrating them out of ZONE_DEVICE. > > Signed-off-by: Alistair Popple > Acked-by: Felix Kuehling > --- > > Changes for v2: > > - Added Felix's Acked-by > - Fixed missing check for dpage == NULL > > mm/gup.c | 105 ++-- > 1 file changed, 95 insertions(+), 10 deletions(-) > > diff --git a/mm/gup.c b/mm/gup.c > index 56d9577..5e826db 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -1861,6 +1861,60 @@ struct page *get_dump_page(unsigned long addr) > > #ifdef CONFIG_MIGRATION > /* > + * Migrates a device coherent page back to normal memory. Caller should have > a > + * reference on page which will be copied to the new page if migration is > + * successful or dropped on failure. > + */ > +static struct page *migrate_device_page(struct page *page, > + unsigned int gup_flags) > +{ > + struct page *dpage; > + struct migrate_vma args; > + unsigned long src_pfn, dst_pfn = 0; > + > + lock_page(page); > + src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE; > + args.src = &src_pfn; > + args.dst = &dst_pfn; > + args.cpages = 1; > + args.npages = 1; > + args.vma = NULL; > + migrate_vma_setup(&args); > + if (!(src_pfn & MIGRATE_PFN_MIGRATE)) > + return NULL; > + > + dpage = alloc_pages(GFP_USER | __GFP_NOWARN, 0); > + > + /* > + * get/pin the new page now so we don't have to retry gup after > + * migrating. We already have a reference so this should never fail. > + */ > + if (dpage && WARN_ON_ONCE(!try_grab_page(dpage, gup_flags))) { > + __free_pages(dpage, 0); > + dpage = NULL; > + } > + > + if (dpage) { > + lock_page(dpage); > + dst_pfn = migrate_pfn(page_to_pfn(dpage)); > + } > + > + migrate_vma_pages(&args); > + if (src_pfn & MIGRATE_PFN_MIGRATE) > + copy_highpage(dpage, page); > + migrate_vma_finalize(&args); > + if (dpage && !(src_pfn & MIGRATE_PFN_MIGRATE)) { > + if (gup_flags & FOLL_PIN) > + unpin_user_page(dpage); > + else > + put_page(dpage); > + dpage = NULL; > + } > + > + return dpage; > +} > + > +/* > * Check whether all pages are pinnable, if so return number of pages. If > some > * pages are not pinnable, migrate them, and unpin all pages. Return zero if > * pages were migrated, or if some pages were not successfully isolated. > @@ -1888,15 +1942,40 @@ static long check_and_migrate_movable_pages(unsigned > long nr_pages, > continue; > prev_head = head; > /* > - * If we get a movable page, since we are going to be pinning > - * these entries, try to move them out if possible. > + * Device coherent pages are managed by a driver and should not > + * be pinned indefinitely as it prevents the driver moving the > + * page. So when trying to pin with FOLL_LONGTERM instead try > + * migrating page out of device memory. >*/ > if (is_dev_private_or_coherent_page(head)) { > + /* > + * device private pages will get faulted in during gup > + * so it shouldn't be possible to see one here. > + */ > WARN_ON_ONCE(is_device_private_page(head)); > - ret = -EFAULT; > - goto unpin_pages; > + WARN_ON_ONCE(PageCompound(head)); > + > + /* > + * migration will fail if the page is pinned, so convert > + * the pin on the source page to a normal reference. > + */ > + if (gup_flags & FOLL_PIN) { > + get_page(head); > + unpin_user_page(head); > + } > + > + pages[i] = migrate_device_page(head, gup_flags); For ordinary migrate_pages(), we'll unpin all pages and return 0 so the caller will retry pinning by walking the page tables again. Why can't we apply the same mechanism here? This "let's avoid another walk" looks unnecessary complicated to me, but I might be wrong. -- Thanks, David / dhildenb
Re: [PATCH v2 2/3] mm/gup.c: Migrate device coherent pages when pinning instead of failing
On 10.02.22 12:39, Alistair Popple wrote: > On Thursday, 10 February 2022 9:53:38 PM AEDT David Hildenbrand wrote: >> On 07.02.22 05:26, Alistair Popple wrote: >>> Currently any attempts to pin a device coherent page will fail. This is >>> because device coherent pages need to be managed by a device driver, and >>> pinning them would prevent a driver from migrating them off the device. >>> >>> However this is no reason to fail pinning of these pages. These are >>> coherent and accessible from the CPU so can be migrated just like >>> pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin >>> them first try migrating them out of ZONE_DEVICE. >>> >>> Signed-off-by: Alistair Popple >>> Acked-by: Felix Kuehling >>> --- >>> >>> Changes for v2: >>> >>> - Added Felix's Acked-by >>> - Fixed missing check for dpage == NULL >>> >>> mm/gup.c | 105 ++-- >>> 1 file changed, 95 insertions(+), 10 deletions(-) >>> >>> diff --git a/mm/gup.c b/mm/gup.c >>> index 56d9577..5e826db 100644 >>> --- a/mm/gup.c >>> +++ b/mm/gup.c >>> @@ -1861,6 +1861,60 @@ struct page *get_dump_page(unsigned long addr) >>> >>> #ifdef CONFIG_MIGRATION >>> /* >>> + * Migrates a device coherent page back to normal memory. Caller should >>> have a >>> + * reference on page which will be copied to the new page if migration is >>> + * successful or dropped on failure. >>> + */ >>> +static struct page *migrate_device_page(struct page *page, >>> + unsigned int gup_flags) >>> +{ >>> + struct page *dpage; >>> + struct migrate_vma args; >>> + unsigned long src_pfn, dst_pfn = 0; >>> + >>> + lock_page(page); >>> + src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE; >>> + args.src = &src_pfn; >>> + args.dst = &dst_pfn; >>> + args.cpages = 1; >>> + args.npages = 1; >>> + args.vma = NULL; >>> + migrate_vma_setup(&args); >>> + if (!(src_pfn & MIGRATE_PFN_MIGRATE)) >>> + return NULL; >>> + >>> + dpage = alloc_pages(GFP_USER | __GFP_NOWARN, 0); >>> + >>> + /* >>> +* get/pin the new page now so we don't have to retry gup after >>> +* migrating. We already have a reference so this should never fail. >>> +*/ >>> + if (dpage && WARN_ON_ONCE(!try_grab_page(dpage, gup_flags))) { >>> + __free_pages(dpage, 0); >>> + dpage = NULL; >>> + } >>> + >>> + if (dpage) { >>> + lock_page(dpage); >>> + dst_pfn = migrate_pfn(page_to_pfn(dpage)); >>> + } >>> + >>> + migrate_vma_pages(&args); >>> + if (src_pfn & MIGRATE_PFN_MIGRATE) >>> + copy_highpage(dpage, page); >>> + migrate_vma_finalize(&args); >>> + if (dpage && !(src_pfn & MIGRATE_PFN_MIGRATE)) { >>> + if (gup_flags & FOLL_PIN) >>> + unpin_user_page(dpage); >>> + else >>> + put_page(dpage); >>> + dpage = NULL; >>> + } >>> + >>> + return dpage; >>> +} >>> + >>> +/* >>> * Check whether all pages are pinnable, if so return number of pages. If >>> some >>> * pages are not pinnable, migrate them, and unpin all pages. Return zero >>> if >>> * pages were migrated, or if some pages were not successfully isolated. >>> @@ -1888,15 +1942,40 @@ static long >>> check_and_migrate_movable_pages(unsigned long nr_pages, >>> continue; >>> prev_head = head; >>> /* >>> -* If we get a movable page, since we are going to be pinning >>> -* these entries, try to move them out if possible. >>> +* Device coherent pages are managed by a driver and should not >>> +* be pinned indefinitely as it prevents the driver moving the >>> +* page. So when trying to pin with FOLL_LONGTERM instead try >>> +* migrating page out of device memory. >>> */ >>> if (is_dev_private_or_coherent_page(head)) { >>> + /* >&g
Re: [PATCH v3 00/10] Add MEMORY_DEVICE_COHERENT for coherent device memory mapping
On 10.01.22 23:31, Alex Sierra wrote: > This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory > owned by a device that can be mapped into CPU page tables like > MEMORY_DEVICE_GENERIC and can also be migrated like > MEMORY_DEVICE_PRIVATE. > > Christoph, the suggestion to incorporate Ralph Campbell’s refcount > cleanup patch into our hardware page migration patchset originally came > from you, but it proved impractical to do things in that order because > the refcount cleanup introduced a bug with wide ranging structural > implications. Instead, we amended Ralph’s patch so that it could be > applied after merging the migration work. As we saw from the recent > discussion, merging the refcount work is going to take some time and > cooperation between multiple development groups, while the migration > work is ready now and is needed now. So we propose to merge this > patchset first and continue to work with Ralph and others to merge the > refcount cleanup separately, when it is ready. > > This patch series is mostly self-contained except for a few places where > it needs to update other subsystems to handle the new memory type. > System stability and performance are not affected according to our > ongoing testing, including xfstests. > > How it works: The system BIOS advertises the GPU device memory > (aka VRAM) as SPM (special purpose memory) in the UEFI system address > map. > > The amdgpu driver registers the memory with devmap as > MEMORY_DEVICE_COHERENT using devm_memremap_pages. The initial user for > this hardware page migration capability is the Frontier supercomputer > project. This functionality is not AMD-specific. We expect other GPU > vendors to find this functionality useful, and possibly other hardware > types in the future. > > Our test nodes in the lab are similar to the Frontier configuration, > with .5 TB of system memory plus 256 GB of device memory split across > 4 GPUs, all in a single coherent address space. Page migration is > expected to improve application efficiency significantly. We will > report empirical results as they become available. Hi, might be a dumb question because I'm not too familiar with MEMORY_DEVICE_COHERENT, but who's in charge of migrating *to* that memory? Or how does a process ever get a grab on such pages? And where does migration come into play? I assume migration is only required to migrate off of that device memory to ordinary system RAM when required because the device memory has to be freed up, correct? (a high level description on how this is exploited from users space would be great) -- Thanks, David / dhildenb
Re: slow boot with 7fef431be9c9 ("mm/page_alloc: place pages to tail in __free_pages_core()")
On 16.03.21 09:43, Liang, Liang (Leo) wrote: [AMD Public Use] Hi David, Thanks for your explanation. We saw slow boot issue on our farm/QA's machines and mine. All of machines are same SoC/board. I cannot spot anything really special in the logs -- it's just ordinary system ram -- except: [0.27] MTRR fixed ranges enabled: [0.28] 0-9 write-back [0.29] A-B uncachable [0.30] C-F write-through [0.31] MTRR variable ranges enabled: [0.32] 0 base mask 8000 write-back [0.34] 1 base FFE0 mask FFE0 write-protect [0.35] 2 base 0001 mask FF00 write-protect [0.36] 3 base FFDE mask FFFE write-protect [0.38] 4 base FF00 mask FFF8 write-protect [0.39] 5 disabled [0.39] 6 disabled [0.40] 7 disabled Not sure if "2 base 0001" indicates something nasty. Not sure how to interpret the masks. Can you provide the output of "cat /proc/mtrr" ? -- Thanks, David / dhildenb ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: slow boot with 7fef431be9c9 ("mm/page_alloc: place pages to tail in __free_pages_core()")
On 16.03.21 09:58, Liang, Liang (Leo) wrote: [AMD Public Use] Hi David, root@scbu-Chachani:~# cat /proc/mtrr reg00: base=0x0 (0MB), size= 2048MB, count=1: write-back reg01: base=0x0ffe0 ( 4094MB), size=2MB, count=1: write-protect reg02: base=0x1 ( 4096MB), size= 16MB, count=1: write-protect ^ there it is https://wiki.osdev.org/MTRR "Reads allocate cache lines on a cache miss. All writes update main memory. Cache lines are not allocated on a write miss. Write hits invalidate the cache line and update main memory. " AFAIU, writes completely bypass caches and store directly to main mamory. If there are cache lines from a previous read, they are invalidated. So I think especially slow will be read(addr), write(addr), read(addr), ... which is what we have in the kstream benchmark. The question is: who sets this up without owning the memory? Is the memory actually special/slow or is that setting wrong? Buggy firmware/BIOS? Buggy device driver? reg03: base=0x0ffde ( 4093MB), size= 128KB, count=1: write-protect reg04: base=0x0ff00 ( 4080MB), size= 512KB, count=1: write-protect -- Thanks, David / dhildenb ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: slow boot with 7fef431be9c9 ("mm/page_alloc: place pages to tail in __free_pages_core()")
On 16.03.21 12:02, Liang, Liang (Leo) wrote: [AMD Public Use] Hi David and Mike, It's BIOS buggy. Now fixed by new BIOS. Thanks you so much! Cheers! [0.34] MTRR variable ranges enabled: [0.35] 0 base mask 8000 write-back [0.37] 1 base FFE0 mask FFE0 write-protect [0.39] 2 base FFDE mask FFFE write-protect [0.40] 3 base FF00 mask FFF8 write-protect [0.41] 4 disabled [0.42] 5 disabled [0.43] 6 disabled [0.44] 7 disabled [0.45] TOM2: 00028000 aka 10240M root@scbu-Chachani:/home/scbu# cat /proc/mtrr reg00: base=0x0 (0MB), size= 2048MB, count=1: write-back reg01: base=0x0ffe0 ( 4094MB), size=2MB, count=1: write-protect reg02: base=0x0ffde ( 4093MB), size= 128KB, count=1: write-protect reg03: base=0x0ff00 ( 4080MB), size= 512KB, count=1: write-protect Great :) (another latent BUG found with 7fef431be9c9 :) ) -- Thanks, David / dhildenb ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: slow boot with 7fef431be9c9 ("mm/page_alloc: place pages to tail in __free_pages_core()")
On 16.03.21 09:00, Liang, Liang (Leo) wrote: [AMD Public Use] Hi Mike, Thanks for help. The patch works for me and boot time back to normal. So it's a fix, or just WA? Hi Leo, excluding up to 16 MiB of memory on every system just because that single platform is weird is not acceptable. I think we have to figure out a) why that memory is so special. This is weird. b) why the platform doesn't indicate it in a special way. Why is it ordinary system RAM but still *that* slow? c) how we can reliably identify such memory and exclude it. I'll have a peek at the memory layout of that machine from boot logs next to figure out if we can answer any of these questions. Just to verify: this does happen on multiple machines, not just a single one? (i.e., we're not dealing with faulty RAM) -- Thanks, David / dhildenb ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: slow boot with 7fef431be9c9 ("mm/page_alloc: place pages to tail in __free_pages_core()")
On 13.03.21 14:48, Mike Rapoport wrote: Hi, On Sat, Mar 13, 2021 at 10:05:23AM +0100, David Hildenbrand wrote: Am 13.03.2021 um 05:04 schrieb Liang, Liang (Leo) : Hi David, Which benchmark tool you prefer? Memtest86+ or else? Hi Leo, I think you want something that runs under Linux natively. I‘m planning on coding up a kernel module to walk all 4MB pages in the freelists and perform a stream benchmark individually. Then we might be able to identify the problematic range - if there is a problematic range :) My wild guess would be that the pages that are now at the head of free lists have wrong caching enabled. Might be worth checking in your test module. I hacked something up real quick: https://github.com/davidhildenbrand/kstream Only briefly tested inside a VM. The output looks something like [...] [ 8396.432225] [0x4580 - 0x45bf] 25322 MB/s / 38948 MB/s [ 8396.448749] [0x45c0 - 0x45ff] 24481 MB/s / 38946 MB/s [ 8396.465197] [0x4600 - 0x463f] 24892 MB/s / 39170 MB/s [ 8396.481552] [0x4640 - 0x467f] 25222 MB/s / 39156 MB/s [ 8396.498012] [0x4680 - 0x46bf] 24416 MB/s / 39159 MB/s [ 8396.514397] [0x46c0 - 0x46ff] 25469 MB/s / 38940 MB/s [ 8396.530849] [0x4700 - 0x473f] 24885 MB/s / 38734 MB/s [ 8396.547195] [0x4740 - 0x477f] 25458 MB/s / 38941 MB/s [...] The benchmark allocates one 4 MiB chunk at a time and runs a simplified STREAM benchmark a) without flushing caches b) flushing caches before every memory access. It would be great if you could run that with the *old behavior* kernel (IOW, without 7fef431be9c9), so we might still be lucky to catch the problematic area in the freelist. Let's see if that will indicate anything. -- Thanks, David / dhildenb ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: slow boot with 7fef431be9c9 ("mm/page_alloc: place pages to tail in __free_pages_core()")
> Am 13.03.2021 um 05:04 schrieb Liang, Liang (Leo) : > > [AMD Public Use] > > Hi David, > > Which benchmark tool you prefer? Memtest86+ or else? Hi Leo, I think you want something that runs under Linux natively. I‘m planning on coding up a kernel module to walk all 4MB pages in the freelists and perform a stream benchmark individually. Then we might be able to identify the problematic range - if there is a problematic range :) Guess I‘ll have it running by Monday and let you know. Cheers! > > BRs, > Leo > -----Original Message- > From: David Hildenbrand > Sent: Saturday, March 13, 2021 12:47 AM > To: Liang, Liang (Leo) ; Deucher, Alexander > ; linux-ker...@vger.kernel.org; amd-gfx list > ; Andrew Morton > Cc: Huang, Ray ; Koenig, Christian > ; Mike Rapoport ; Rafael J. > Wysocki ; George Kennedy > Subject: Re: slow boot with 7fef431be9c9 ("mm/page_alloc: place pages to tail > in __free_pages_core()") > >> On 12.03.21 17:19, Liang, Liang (Leo) wrote: >> [AMD Public Use] >> >> Dmesg attached. >> > > > So, looks like the "real" slowdown starts once the buddy is up and running > (no surprise). > > > [0.044035] Memory: 6856724K/7200304K available (14345K kernel code, 9699K > rwdata, 5276K rodata, 2628K init, 12104K bss, 343324K reserved, 0K > cma-reserved) > [0.044045] random: get_random_u64 called from > __kmem_cache_create+0x33/0x460 with crng_init=1 > [0.049025] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=16, Nodes=1 > [0.050036] ftrace: allocating 47158 entries in 185 pages > [0.097487] ftrace: allocated 185 pages with 5 groups > [0.109210] rcu: Hierarchical RCU implementation. > > vs. > > [0.041115] Memory: 6869396K/7200304K available (14345K kernel code, 3433K > rwdata, 5284K rodata, 2624K init, 6088K bss, 330652K reserved, 0K > cma-reserved) > [0.041127] random: get_random_u64 called from > __kmem_cache_create+0x31/0x430 with crng_init=1 > [0.041309] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=16, Nodes=1 > [0.041335] ftrace: allocating 47184 entries in 185 pages > [0.055719] ftrace: allocated 185 pages with 5 groups > [0.055863] rcu: Hierarchical RCU implementation. > > > And it gets especially bad during ACPI table processing: > > [4.158303] ACPI: Added _OSI(Module Device) > [4.158767] ACPI: Added _OSI(Processor Device) > [4.159230] ACPI: Added _OSI(3.0 _SCP Extensions) > [4.159705] ACPI: Added _OSI(Processor Aggregator Device) > [4.160551] ACPI: Added _OSI(Linux-Dell-Video) > [4.161359] ACPI: Added _OSI(Linux-Lenovo-NV-HDMI-Audio) > [4.162264] ACPI: Added _OSI(Linux-HPI-Hybrid-Graphics) > [ 17.713421] ACPI: 13 ACPI AML tables successfully acquired and loaded > [ 18.716065] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored > [ 20.743828] ACPI: EC: EC started > [ 20.744155] ACPI: EC: interrupt blocked > [ 20.945956] ACPI: EC: EC_CMD/EC_SC=0x666, EC_DATA=0x662 > [ 20.946618] ACPI: \_SB_.PCI0.LPC0.EC0_: Boot DSDT EC used to handle > transactions > [ 20.947348] ACPI: Interpreter enabled > [ 20.951278] ACPI: (supports S0 S3 S4 S5) > [ 20.951632] ACPI: Using IOAPIC for interrupt routing > > vs. > > [0.216039] ACPI: Added _OSI(Module Device) > [0.216041] ACPI: Added _OSI(Processor Device) > [0.216043] ACPI: Added _OSI(3.0 _SCP Extensions) > [0.216044] ACPI: Added _OSI(Processor Aggregator Device) > [0.216046] ACPI: Added _OSI(Linux-Dell-Video) > [0.216048] ACPI: Added _OSI(Linux-Lenovo-NV-HDMI-Audio) > [0.216049] ACPI: Added _OSI(Linux-HPI-Hybrid-Graphics) > [0.228259] ACPI: 13 ACPI AML tables successfully acquired and loaded > [0.229527] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored > [0.231663] ACPI: EC: EC started > [0.231666] ACPI: EC: interrupt blocked > [0.233664] ACPI: EC: EC_CMD/EC_SC=0x666, EC_DATA=0x662 > [0.233667] ACPI: \_SB_.PCI0.LPC0.EC0_: Boot DSDT EC used to handle > transactions > [0.233670] ACPI: Interpreter enabled > [0.233685] ACPI: (supports S0 S3 S4 S5) > [0.233687] ACPI: Using IOAPIC for interrupt routing > > The jump from 4.1 -> 17.7 is especially bad. > > Which might in fact indicate that this could be related to using some very > special slow (ACPI?) memory for ordinary purposes, interfering with actual > ACPI users? > > But again, just a wild guess, because the system is extremely slow > afterwards, however, we don't have any pauses without any signs of life for > that long. > > > It would be interesting to run a simple memory bandwidth benchmark on the > fast kernel with differing sizes up to running OOM to see if there is really > some memory that is just horribly slow once allocated and used. > > -- > Thanks, > > David / dhildenb > ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: slow boot with 7fef431be9c9 ("mm/page_alloc: place pages to tail in __free_pages_core()")
On 12.03.21 17:19, Liang, Liang (Leo) wrote: [AMD Public Use] Dmesg attached. So, looks like the "real" slowdown starts once the buddy is up and running (no surprise). [0.044035] Memory: 6856724K/7200304K available (14345K kernel code, 9699K rwdata, 5276K rodata, 2628K init, 12104K bss, 343324K reserved, 0K cma-reserved) [0.044045] random: get_random_u64 called from __kmem_cache_create+0x33/0x460 with crng_init=1 [0.049025] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=16, Nodes=1 [0.050036] ftrace: allocating 47158 entries in 185 pages [0.097487] ftrace: allocated 185 pages with 5 groups [0.109210] rcu: Hierarchical RCU implementation. vs. [0.041115] Memory: 6869396K/7200304K available (14345K kernel code, 3433K rwdata, 5284K rodata, 2624K init, 6088K bss, 330652K reserved, 0K cma-reserved) [0.041127] random: get_random_u64 called from __kmem_cache_create+0x31/0x430 with crng_init=1 [0.041309] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=16, Nodes=1 [0.041335] ftrace: allocating 47184 entries in 185 pages [0.055719] ftrace: allocated 185 pages with 5 groups [0.055863] rcu: Hierarchical RCU implementation. And it gets especially bad during ACPI table processing: [4.158303] ACPI: Added _OSI(Module Device) [4.158767] ACPI: Added _OSI(Processor Device) [4.159230] ACPI: Added _OSI(3.0 _SCP Extensions) [4.159705] ACPI: Added _OSI(Processor Aggregator Device) [4.160551] ACPI: Added _OSI(Linux-Dell-Video) [4.161359] ACPI: Added _OSI(Linux-Lenovo-NV-HDMI-Audio) [4.162264] ACPI: Added _OSI(Linux-HPI-Hybrid-Graphics) [ 17.713421] ACPI: 13 ACPI AML tables successfully acquired and loaded [ 18.716065] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored [ 20.743828] ACPI: EC: EC started [ 20.744155] ACPI: EC: interrupt blocked [ 20.945956] ACPI: EC: EC_CMD/EC_SC=0x666, EC_DATA=0x662 [ 20.946618] ACPI: \_SB_.PCI0.LPC0.EC0_: Boot DSDT EC used to handle transactions [ 20.947348] ACPI: Interpreter enabled [ 20.951278] ACPI: (supports S0 S3 S4 S5) [ 20.951632] ACPI: Using IOAPIC for interrupt routing vs. [0.216039] ACPI: Added _OSI(Module Device) [0.216041] ACPI: Added _OSI(Processor Device) [0.216043] ACPI: Added _OSI(3.0 _SCP Extensions) [0.216044] ACPI: Added _OSI(Processor Aggregator Device) [0.216046] ACPI: Added _OSI(Linux-Dell-Video) [0.216048] ACPI: Added _OSI(Linux-Lenovo-NV-HDMI-Audio) [0.216049] ACPI: Added _OSI(Linux-HPI-Hybrid-Graphics) [0.228259] ACPI: 13 ACPI AML tables successfully acquired and loaded [0.229527] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored [0.231663] ACPI: EC: EC started [0.231666] ACPI: EC: interrupt blocked [0.233664] ACPI: EC: EC_CMD/EC_SC=0x666, EC_DATA=0x662 [0.233667] ACPI: \_SB_.PCI0.LPC0.EC0_: Boot DSDT EC used to handle transactions [0.233670] ACPI: Interpreter enabled [0.233685] ACPI: (supports S0 S3 S4 S5) [0.233687] ACPI: Using IOAPIC for interrupt routing The jump from 4.1 -> 17.7 is especially bad. Which might in fact indicate that this could be related to using some very special slow (ACPI?) memory for ordinary purposes, interfering with actual ACPI users? But again, just a wild guess, because the system is extremely slow afterwards, however, we don't have any pauses without any signs of life for that long. It would be interesting to run a simple memory bandwidth benchmark on the fast kernel with differing sizes up to running OOM to see if there is really some memory that is just horribly slow once allocated and used. -- Thanks, David / dhildenb ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: slow boot with 7fef431be9c9 ("mm/page_alloc: place pages to tail in __free_pages_core()")
8G (with some carve out for the integrated GPU). [0.044181] Memory: 6858688K/7200304K available (14345K kernel code, 9659K rwdata, 4980K rodata, 2484K init, 12292K bss, 341360K reserved, 0K cma-reserved) Nothing particularly special about these systems that I am aware of. I'll see if we can repro this issue on any other platforms, but so far, not one has noticed any problems. Increasing the boot time from a few seconds to 2-3 minutes does not smell like some corner case cache effects we might be hitting in this particular instance - there have been minor reports that it either slightly increased or slightly decreases initial system performance, but that was about it. Either, yet another latent BUG (but why? why should memory access suddenly be that slow? I could only guess that we are now making sooner use of very slow memory), or there is really something else weird going on. Looks like pretty much everything is slower based on the timestamps in the dmesg output. There is a big jump here: If we're really dealing with some specific slow memory regions and that memory gets allocated for something that gets used regularly, then we might get a general slowdown. Hard to identify, though :) [3.758596] ACPI: Added _OSI(Linux-Lenovo-NV-HDMI-Audio) [3.759372] ACPI: Added _OSI(Linux-HPI-Hybrid-Graphics) [ 16.177983] ACPI: 13 ACPI AML tables successfully acquired and loaded [ 17.099316] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored [ 18.969959] ACPI: EC: EC started And here: [ 36.566608] PCI: CLS 64 bytes, default 64 [ 36.575383] Trying to unpack rootfs image as initramfs... [ 44.594348] Initramfs unpacking failed: Decoding failed [ 44.765141] Freeing initrd memory: 46348K Also seeing soft lockups: [ 124.588634] watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [swapper/1:0] Yes, I noticed that -- there is a heavy slowdown somewhere. As that patch is v5.10 already (and we're close to v5.12) I assume something is particularly weird about the platform you are running on - because this is the first time I see a report like that. -- Thanks, David / dhildenb ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: slow boot with 7fef431be9c9 ("mm/page_alloc: place pages to tail in __free_pages_core()")
On 12.03.21 15:06, Deucher, Alexander wrote: [AMD Public Use] -Original Message- From: David Hildenbrand Sent: Thursday, March 11, 2021 10:03 AM To: Deucher, Alexander ; linux- ker...@vger.kernel.org; amd-gfx list ; Andrew Morton Cc: Huang, Ray ; Koenig, Christian ; Liang, Liang (Leo) ; Mike Rapoport ; Rafael J. Wysocki ; George Kennedy Subject: Re: slow boot with 7fef431be9c9 ("mm/page_alloc: place pages to tail in __free_pages_core()") On 11.03.21 15:41, Deucher, Alexander wrote: [AMD Public Use] Booting kernels on certain AMD platforms takes 2-3 minutes with the patch in the subject. Reverting it restores quick boot times (few seconds). Any ideas? Hi, We just discovered latent BUGs in ACPI code whereby ACPI tables are exposed to the page allocator as ordinary, free system RAM. With the patch you mention, the order in which pages get allocated from the page allocator are changed - which makes the BUG trigger more easily. I could imagine that someone allocates and uses that memory on your platform, and I could imagine that such accesses are very slow. I cannot tell if that is the root cause, but at least it would make sense. See https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore. kernel.org%2Fpatchwork%2Fpatch%2F1389314%2F&data=04%7C01%7C alexander.deucher%40amd.com%7Cd1533aaddccd464c59f308d8e49ec563%7 C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637510717893096801% 7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLC JBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=xpty77D54Z5S%2FKK JO5SsVQaNsHoojWMR73whpu8VT%2B4%3D&reserved=0 You might want to give that patch a try (not sure if it's the latest version). CCing George Thanks for the patch. Unfortunately it didn't help. Any other ideas? Is there a newer version of that patch? @George? It's interesting that this only applies to these special AMD systems so far. Is there anything particular about these systems? How much memory do these systems have? Increasing the boot time from a few seconds to 2-3 minutes does not smell like some corner case cache effects we might be hitting in this particular instance - there have been minor reports that it either slightly increased or slightly decreases initial system performance, but that was about it. Either, yet another latent BUG (but why? why should memory access suddenly be that slow? I could only guess that we are now making sooner use of very slow memory), or there is really something else weird going on. Cheers! Alex -- Thanks, David / dhildenb ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: slow boot with 7fef431be9c9 ("mm/page_alloc: place pages to tail in __free_pages_core()")
On 11.03.21 15:41, Deucher, Alexander wrote: [AMD Public Use] Booting kernels on certain AMD platforms takes 2-3 minutes with the patch in the subject. Reverting it restores quick boot times (few seconds). Any ideas? Hi, We just discovered latent BUGs in ACPI code whereby ACPI tables are exposed to the page allocator as ordinary, free system RAM. With the patch you mention, the order in which pages get allocated from the page allocator are changed - which makes the BUG trigger more easily. I could imagine that someone allocates and uses that memory on your platform, and I could imagine that such accesses are very slow. I cannot tell if that is the root cause, but at least it would make sense. See https://lore.kernel.org/patchwork/patch/1389314/ You might want to give that patch a try (not sure if it's the latest version). CCing George Thanks Thanks, Alex [0.00] Linux version 5.11.0-7490c004ae7e (jenkins@24dbd4b4380b) (gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, GNU ld (GNU Binutils for Ubuntu) 2.30) #20210308 SMP Sun Mar 7 20:04:05 UTC 2021 [0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-5.11.0-7490c004ae7e root=UUID=459758f3-5106-4173-b9bc-cf9d528828ec ro resume=UUID=23390f67-bbaf-42c1-b31d-64ef7288e39e amd_iommu=off nokaslr [0.00] KERNEL supported cpus: [0.00] Intel GenuineIntel [0.00] AMD AuthenticAMD [0.00] Hygon HygonGenuine [0.00] Centaur CentaurHauls [0.00] zhaoxin Shanghai [0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' [0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' [0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' [0.00] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256 [0.00] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'compacted' format. [0.00] BIOS-provided physical RAM map: [0.00] BIOS-e820: [mem 0x-0x0009efff] usable [0.00] BIOS-e820: [mem 0x0009f000-0x000b] reserved [0.00] BIOS-e820: [mem 0x0010-0x09af] usable [0.00] BIOS-e820: [mem 0x09b0-0x09df] reserved [0.00] BIOS-e820: [mem 0x09e0-0x09ef] usable [0.00] BIOS-e820: [mem 0x09f0-0x09f10fff] ACPI NVS [0.00] BIOS-e820: [mem 0x09f11000-0x6c56efff] usable [0.00] BIOS-e820: [mem 0x6c56f000-0x6c56] reserved [0.00] BIOS-e820: [mem 0x6c57-0x7877efff] usable [0.00] BIOS-e820: [mem 0x7877f000-0x7af7efff] reserved [0.00] BIOS-e820: [mem 0x7af7f000-0x7cf7efff] ACPI NVS [0.00] BIOS-e820: [mem 0x7cf7f000-0x7cffefff] ACPI data [0.00] BIOS-e820: [mem 0x7cfff000-0x7cff] usable [0.00] BIOS-e820: [mem 0x7d00-0x7dff] reserved [0.00] BIOS-e820: [mem 0x7f00-0x7fff] reserved [0.00] BIOS-e820: [mem 0xa000-0xa00f] reserved [0.00] BIOS-e820: [mem 0xf000-0xf7ff] reserved [0.00] BIOS-e820: [mem 0xfec0-0xfec01fff] reserved [0.00] BIOS-e820: [mem 0xfec1-0xfec10fff] reserved [0.00] BIOS-e820: [mem 0xfec2-0xfec20fff] reserved [0.00] BIOS-e820: [mem 0xfed8-0xfed81fff] reserved [0.00] BIOS-e820: [mem 0xfedc-0xfedd] reserved [0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved [0.00] BIOS-e820: [mem 0xff08-0xffdd] reserved [0.00] BIOS-e820: [mem 0x0001-0x00023f37] usable [0.00] BIOS-e820: [mem 0x00023f38-0x00027fff] reserved [0.00] NX (Execute Disable) protection: active [0.00] e820: update [mem 0x6a275018-0x6a283857] usable ==> usable [0.00] e820: update [mem 0x6a275018-0x6a283857] usable ==> usable [0.00] e820: update [mem 0x6c572018-0x6c57c657] usable ==> usable [0.00] e820: update [mem 0x6c572018-0x6c57c657] usable ==> usable [0.00] extended physical RAM map: [0.00] reserve setup_data: [mem 0x-0x0009efff] usable [0.00] reserve setup_data: [mem 0x0009f000-0x000b] reserved [0.00] reserve setup_data: [mem 0x0010-0x09af] usable [0.00] reserve setup_data: [mem 0x09b0-0x09df] reserved [0.00] reserve setup_data: [mem 0x09e0-0x09ef] usable [0.00] reserve setup_data: [mem 0x09f0-0x09f10fff] ACPI NVS [0.00] reserve setup_data: [mem 0x09f11000-0x6