Re: [PATCH v2 07/14] LoongArch: Implement ARCH_HAS_KERNEL_FPU_SUPPORT
On Sun, Jan 07, 2024 at 10:39:07AM +0800, Huacai Chen wrote: > > Do you mean that LoongArch32 does not support double-precision FP in > > hardware? > > At least both of the consumers in this series use double-precision, so my > > first > > thought is that LoongArch32 could not select ARCH_HAS_KERNEL_FPU_SUPPORT. > Then is it possible to introduce CC_FLAGS_SP_FPU and CC_FLAGS_DP_FPU? > I think there may be some place where SP FP is enough. Let's defer that until it is actually neeed.
Re: [PATCH v2 13/14] selftests/fpu: Move FP code to a separate translation unit
Looks good: Reviewed-by: Christoph Hellwig
Re: [PATCH v2 12/14] drm/amd/display: Use ARCH_HAS_KERNEL_FPU_SUPPORT
Looks good: Reviewed-by: Christoph Hellwig
Re: [PATCH v2 05/14] arm64: crypto: Use CC_FLAGS_FPU for NEON CFLAGS
Looks good: Reviewed-by: Christoph Hellwig
Re: [PATCH v2 10/14] riscv: Add support for kernel-mode FPU
On Wed, Dec 27, 2023 at 05:42:00PM -0800, Samuel Holland wrote: > This is motivated by the amdgpu DRM driver, which needs floating-point > code to support recent hardware. That code is not performance-critical, > so only provide a minimal non-preemptible implementation for now. > > Signed-off-by: Samuel Holland Looks good: Reviewed-by: Christoph Hellwig
Re: [PATCH v2 01/14] arch: Add ARCH_HAS_KERNEL_FPU_SUPPORT
Thanks for all the great documentation! Looks good: Reviewed-by: Christoph Hellwig
Re: [PATCH 3/3] drm/amd/display: Support DRM_AMD_DC_FP on RISC-V
On Thu, Dec 07, 2023 at 10:49:53PM -0600, Samuel Holland wrote: > Actually tracking all possibly-FPU-tainted functions and their call sites is > probably possible, but a much larger task. I think objtool should be able to do that reasonably easily, it already does it for checking section where userspace address access is enabled or not, which is very similar.
Re: [RFC PATCH 05/12] lib/raid6: Use CC_FLAGS_FPU for NEON CFLAGS
On Mon, Dec 11, 2023 at 10:12:27AM -0600, Samuel Holland wrote: > On 2023-12-11 10:07 AM, Christoph Hellwig wrote: > > Unfortunately, not all of the relevant options can be no-prefixed: Ok. That is another good argument for having the obj-fpu += syntax I proposed. You might need help from the kbuild maintainers from that as trying to understand the kbuild magic isn't something I'd expect from a normal contributor (including myself..).
Re: [RFC PATCH 11/12] selftests/fpu: Move FP code to a separate translation unit
> obj-$(CONFIG_TEST_FPU) += test_fpu.o > -CFLAGS_test_fpu.o += $(FPU_CFLAGS) > +test_fpu-y := test_fpu_glue.o test_fpu_impl.o > +CFLAGS_test_fpu_impl.o += $(FPU_CFLAGS) Btw, I really wonder if having a modname-fpu += foo.o syntax in kbuild wouldn't be preferable to this. Of coure that requires someone who understands kbuild inside out. > +int test_fpu(void); This needs to go into a header. And I think I underatand your way to enforce the use of a separate compilation unit in the riscv patch now. Can we just make that generic, e.g. have a that wraps that does the guard based on a -D_LINUX_FPU_COMPILATION_UNIT=1 on the command line so that all the code becomes fully portable? Any legacy arch specific fpu users not using would not be affected by it, although it would be great to eventually migrate them to the common scheme.
Re: [RFC PATCH 12/12] selftests/fpu: Allow building on other architectures
Looks good: Reviewed-by: Christoph Hellwig
Re: [RFC PATCH 07/12] powerpc: Implement ARCH_HAS_KERNEL_FPU_SUPPORT
On Thu, Dec 07, 2023 at 09:54:37PM -0800, Samuel Holland wrote: > PowerPC provides an equivalent to the common kernel-mode FPU API, but in > a different header and using different function names. The PowerPC API > also requires a non-preemptible context. Add a wrapper header, and > export the CFLAGS adjustments. Looks good: Reviewed-by: Christoph Hellwig
Re: [RFC PATCH 05/12] lib/raid6: Use CC_FLAGS_FPU for NEON CFLAGS
> +CFLAGS_REMOVE_neon1.o += $(CC_FLAGS_NO_FPU) > +CFLAGS_REMOVE_neon2.o += $(CC_FLAGS_NO_FPU) > +CFLAGS_REMOVE_neon4.o += $(CC_FLAGS_NO_FPU) > +CFLAGS_REMOVE_neon8.o += $(CC_FLAGS_NO_FPU) Btw, do we even really need the extra variables for compiler flags to remove? Don't gcc/clang options work so that if you add a no-prefixed version of the option later it transparently gets removed? Except for that: Reviewed-by: Christoph Hellwig
Re: [RFC PATCH 09/12] riscv: Add support for kernel-mode FPU
> +#ifdef __riscv_f > + > +#define kernel_fpu_begin() \ > + static_assert(false, "floating-point code must use a separate > translation unit") > +#define kernel_fpu_end() kernel_fpu_begin() > + > +#else > + > +void kernel_fpu_begin(void); > +void kernel_fpu_end(void); > + > +#endif I'll assume this is related to trick that places code in a separate translation unit, but I fail to understand it. Can you add a comment explaining it?
Re: [RFC PATCH 04/12] arm64: Implement ARCH_HAS_KERNEL_FPU_SUPPORT
> + * linux/arch/arm64/include/asm/fpu.h Same comment as for arm here. Except for that: Reviewed-by: Christoph Hellwig
Re: [RFC PATCH 02/12] ARM: Implement ARCH_HAS_KERNEL_FPU_SUPPORT
> --- /dev/null > +++ b/arch/arm/include/asm/fpu.h > @@ -0,0 +1,17 @@ > +/* SPDX-License-Identifier: GPL-2.0-only */ > +/* > + * linux/arch/arm/include/asm/fpu.h Please don't add the file name to top of the file comments. It serves no purpose and easily gets out of date. Except for that: Reviewed-by: Christoph Hellwig
Re: [RFC PATCH 01/12] arch: Add ARCH_HAS_KERNEL_FPU_SUPPORT
Looks good: Reviewed-by: Christoph Hellwig
Re: [RFC PATCH 03/12] ARM: crypto: Use CC_FLAGS_FPU for NEON CFLAGS
On Thu, Dec 07, 2023 at 09:54:33PM -0800, Samuel Holland wrote: > Now that CC_FLAGS_FPU is exported and can be used anywhere in the source > tree, use it instead of duplicating the flags here. Looks good: Reviewed-by: Christoph Hellwig
Re: [RFC PATCH 06/12] LoongArch: Implement ARCH_HAS_KERNEL_FPU_SUPPORT
On Thu, Dec 07, 2023 at 09:54:36PM -0800, Samuel Holland wrote: > LoongArch already provides kernel_fpu_begin() and kernel_fpu_end() in > asm/fpu.h, so it only needs to add kernel_fpu_available() and export > the CFLAGS adjustments. Looks good: Reviewed-by: Christoph Hellwig
Re: [RFC PATCH 08/12] x86: Implement ARCH_HAS_KERNEL_FPU_SUPPORT
Looks good: Reviewed-by: Christoph Hellwig
Re: [PATCH 3/3] drm/amd/display: Support DRM_AMD_DC_FP on RISC-V
> - select DRM_AMD_DC_FP if (X86 || LOONGARCH || (PPC64 && ALTIVEC) || > (ARM64 && KERNEL_MODE_NEON && !CC_IS_CLANG)) > + select DRM_AMD_DC_FP if ARM64 && KERNEL_MODE_NEON && !CC_IS_CLANG > + select DRM_AMD_DC_FP if PPC64 && ALTIVEC > + select DRM_AMD_DC_FP if RISCV && FPU > + select DRM_AMD_DC_FP if LOONGARCH || X86 This really is a mess. Can you add a ARCH_HAS_KERNEL_FPU_SUPPORT symbol that all architetures that have it select instead, and them make DRM_AMD_DC_FP depend on it? > -#if defined(CONFIG_X86) || defined(CONFIG_LOONGARCH) > +#if defined(CONFIG_X86) || defined(CONFIG_LOONGARCH) || defined(CONFIG_RISCV) > kernel_fpu_begin(); > #elif defined(CONFIG_PPC64) > if (cpu_has_feature(CPU_FTR_VSX_COMP)) > @@ -122,7 +124,7 @@ void dc_fpu_end(const char *function_name, const int line) > > depth = __this_cpu_dec_return(fpu_recursion_depth); > if (depth == 0) { > -#if defined(CONFIG_X86) || defined(CONFIG_LOONGARCH) > +#if defined(CONFIG_X86) || defined(CONFIG_LOONGARCH) || defined(CONFIG_RISCV) > kernel_fpu_end(); > #elif defined(CONFIG_PPC64) > if (cpu_has_feature(CPU_FTR_VSX_COMP)) And then this mess can go away. We'll need to decide if we want to cover all the in-kernel vector support as part of it, which would seem reasonable to me, or have a separate generic kernel_vector_begin with it's own option. > diff --git a/drivers/gpu/drm/amd/display/dc/dml/Makefile > b/drivers/gpu/drm/amd/display/dc/dml/Makefile > index ea7d60f9a9b4..5c8f840ef323 100644 > --- a/drivers/gpu/drm/amd/display/dc/dml/Makefile > +++ b/drivers/gpu/drm/amd/display/dc/dml/Makefile > @@ -43,6 +43,12 @@ dml_ccflags := -mfpu=64 > dml_rcflags := -msoft-float > endif > > +ifdef CONFIG_RISCV > +include $(srctree)/arch/riscv/Makefile.isa > +# Remove V from the ISA string, like in arch/riscv/Makefile, but keep F and > D. > +dml_ccflags := -march=$(shell echo $(riscv-march-y) | sed -E > 's/(rv32ima|rv64ima)([^v_]*)v?/\1\2/') > +endif > + > ifdef CONFIG_CC_IS_GCC > ifneq ($(call gcc-min-version, 70100),y) > IS_OLD_GCC = 1 And this is again not really something we should be doing. Instead we need a generic way in Kconfig to enable FPU support for an object file or set of, that the arch support can hook into. Btw, I'm also really worried about folks using the FPU instructions outside the kernel_fpu_begin/end windows in general (not directly related to the RISC-V support). Can we have objecttool checks for that similar to only allowing the unsafe uaccess in the uaccess begin/end pairs?
Re: [PATCH 1/3] riscv: Add support for kernel-mode FPU
On Tue, Nov 21, 2023 at 07:05:13PM -0800, Samuel Holland wrote: > +static inline void kernel_fpu_begin(void) > +{ > + preempt_disable(); > + fstate_save(current, task_pt_regs(current)); > + csr_set(CSR_SSTATUS, SR_FS); > +} > + > +static inline void kernel_fpu_end(void) > +{ > + csr_clear(CSR_SSTATUS, SR_FS); > + fstate_restore(current, task_pt_regs(current)); > + preempt_enable(); > +} Is there any critical reason to inline these two? I'd much rather see them out of line and exported instead of the low-level helpers.
Re: [PATCH 3/5] drm/amdkfd: use vma_is_stack() and vma_is_heap()
On Wed, Jul 12, 2023 at 10:38:29PM +0800, Kefeng Wang wrote: > Use the helpers to simplify code. Nothing against your addition of a helper, but a GPU driver really should have no business even looking at this information..
Re: [PATCH] drm/amdgpu: add the accelerator pcie class
On Thu, May 25, 2023 at 08:52:06PM +, Deucher, Alexander wrote: > We already handle this today for CLASS_DISPLAY via a data table provided on > our hardware that details the components on the board. The driver can then > determine whether or not that combination of components is supported. If the > data table doesn't exist or isn’t parse-able, or the components enumerated > are not supported, the driver doesn't load. But things like module loading and initramfs generation still work off the ID table and not your internal tables.
Re: [PATCH v2] drm/amd/display: enable more strict compile checks
> +subdir-ccflags-y += -Werror -Wunused -Wmisleading-indentation We have a config option for -Werror. Blindly adding this will create problems with too new (or sometimes too old, or just too weird) compilers all the time. Don't do this.
Re: [PATCH] drm/amdgpu: add the accelerator pcie class
On Tue, May 23, 2023 at 10:02:32AM -0400, Alex Deucher wrote: > On Tue, May 23, 2023 at 5:25 AM Christoph Hellwig wrote: > > > > On Tue, May 23, 2023 at 12:02:32PM +0800, Shiwu Zhang wrote: > > > + { PCI_DEVICE(0x1002, PCI_ANY_ID), > > > + .class = PCI_CLASS_ACCELERATOR_PROCESSING << 8, > > > + .class_mask = 0xff, > > > + .driver_data = CHIP_IP_DISCOVERY }, > > > > Probing for every single device of a given class for a single vendor > > to a driver is just fundamentaly wrong. Please list the actual IDs > > that the driver can handle. > > How so? The driver handles all devices of that class. We already do > that for PCI_CLASS_DISPLAY_VGA and PCI_CLASS_DISPLAY_OTHER. Other > drivers do similar things. How is that going to work in the long run? The chances of totally incompatbile devices from the same vendor appearing is absolutely given. > The hda audio driver does the same thing > for PCI_CLASS_MULTIMEDIA_HD_AUDIO for example. > That, just like PCI_CLASS_STORAGE_EXPRESS is a different case, as the class is associated with an actual documented programming interface.
Re: [PATCH] drm/amdgpu: add the accelerator pcie class
On Tue, May 23, 2023 at 12:02:32PM +0800, Shiwu Zhang wrote: > + { PCI_DEVICE(0x1002, PCI_ANY_ID), > + .class = PCI_CLASS_ACCELERATOR_PROCESSING << 8, > + .class_mask = 0xff, > + .driver_data = CHIP_IP_DISCOVERY }, Probing for every single device of a given class for a single vendor to a driver is just fundamentaly wrong. Please list the actual IDs that the driver can handle.
[PATCH] drm/radeon: stop including swiotlb.h
radeon does not need swiotlb.h, so stop including it. Signed-off-by: Christoph Hellwig --- drivers/gpu/drm/radeon/radeon_ttm.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c index 2220cdf6a3f680..04df08356d553f 100644 --- a/drivers/gpu/drm/radeon/radeon_ttm.c +++ b/drivers/gpu/drm/radeon/radeon_ttm.c @@ -36,7 +36,6 @@ #include #include #include -#include #include #include -- 2.39.2
[PATCH] drm/amdgpu: stop including swiotlb.h
amdgpu does not need swiotlb.h, so stop including it. Signed-off-by: Christoph Hellwig --- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c index 2cd081cbf70621..385e04612e4e9b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c @@ -38,7 +38,6 @@ #include #include #include -#include #include #include #include -- 2.39.2
Re: [PATCH v5 1/6] mm/gup: remove unused vmas parameter from get_user_pages()
Looks good: Reviewed-by: Christoph Hellwig
Re: [PATCH] powerpc: export cpu_smallcore_map for modules
On Mon, Aug 22, 2022 at 01:40:23PM +1000, Michael Ellerman wrote: > Randy Dunlap writes: > > drivers/gpu/drm/amd/amdkfd/kfd_device.c calls cpu_smt_mask(). > > This is an inline function on powerpc which references > > cpu_smallcore_map. > > > > Fixes: 425752c63b6f ("powerpc: Detect the presence of big-cores via "ibm, > > thread-groups"") > > Fixes: 7bc913085765 ("drm/amdkfd: Try to schedule bottom half on same core") > > That 2nd commit is not in mainline, only linux-next. > > I don't mind merging this fix preemptively, but is that SHA stable? I really do not think this has any business being exported at all. kfd_queue_work is not something that should be done in a driver. Something like this belongs into the workqueue core, not in an underdocumented helper in a random driver. Drm guys: once again, please please work with the maintainers instead of just making up random stuff in the drivers.
Re: [linux-next:master] BUILD REGRESSION 736ee37e2e8eed7fe48d0a37ee5a709514d478b3
How about just turning the MMIO/PIO accessors on m68k into inline functions as they are on most other architectures?
Re: [PATCH v2 1/3] mm: add vm_normal_lru_pages for LRU handled pages only
On Thu, Mar 31, 2022 at 10:55:13AM +0200, David Hildenbrand wrote: > > Why can't this deal with ZONE_DEVICE pages? It certainly has > > nothing do with a LRU I think. In fact being able to have > > stats that count say the number of device pages here would > > probably be useful at some point. > > > > In general I find the vm_normal_lru_page vs vm_normal_page > > API highly confusing. An explicit check for zone device pages > > in the dozen or so spots that care has a much better documentation > > value, especially if accompanied by comments where it isn't entirely > > obvious. > > What's your thought on FOLL_LRU? Also a bit confusing, but inbetween all these FOLL_ flags it doesn't really matter any more.
Re: [PATCH v2 1/3] mm: add vm_normal_lru_pages for LRU handled pages only
> - page = vm_normal_page(vma, addr, pte); > + page = vm_normal_lru_page(vma, addr, pte); Why can't this deal with ZONE_DEVICE pages? It certainly has nothing do with a LRU I think. In fact being able to have stats that count say the number of device pages here would probably be useful at some point. In general I find the vm_normal_lru_page vs vm_normal_page API highly confusing. An explicit check for zone device pages in the dozen or so spots that care has a much better documentation value, especially if accompanied by comments where it isn't entirely obvious. > page = follow_page(vma, addr, > - FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE); > + FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE | > FOLL_LRU); Overly long line here. > +/* > + * NOTE: Technically this should goto check_pfn label. However, > page->_mapcount > + * is never incremented for device pages that are mmap through DAX mechanism > + * using pmem driver mounted into ext4 filesystem. When these pages are > unmap, > + * zap_pte_range is called and vm_normal_page return a valid page with > + * page_mapcount() = 0, before page_remove_rmap is called. > + */ Please properly indent comments. > + * zone, as long as the pte's are present and vm_normal_lru_page() succeeds. > These > * pages also get pinned. Another overly long line here.
Re: [PATCH v6 01/10] mm: add zone device coherent type memory support
On Tue, Feb 15, 2022 at 10:45:24AM -0400, Jason Gunthorpe wrote: > > Do you know if DEVICE_GENERIC pages would end up as PageAnon()? My > > assumption was that they would be part of a special mapping. > > We need to stop using the special PTEs and VMAs for things that have a > struct page. This is a mistake DAX created that must be undone. Yes, we'll get to it. Maybe we can do it for the non-DAX devmap ptes first given that DAX is more complicated.
Re: [PATCH 13/27] mm: move the migrate_vma_* device migration code into it's own file
On Thu, Feb 10, 2022 at 09:35:10PM +1100, Alistair Popple wrote: > I got the following build error: > > /data/source/linux/mm/migrate_device.c: In function ‘migrate_vma_collect_pmd’: > /data/source/linux/mm/migrate_device.c:242:3: error: implicit declaration of > function ‘flush_tlb_range’; did you mean ‘flush_pmd_tlb_range’? > [-Werror=implicit-function-declaration] > 242 | flush_tlb_range(walk->vma, start, end); > | ^~~ > | flush_pmd_tlb_range > > Including asm/tlbflush.h in migrate_device.c fixed it for me. Yes, the buildbot also complained about this, but somehow in my test configfs it got pulled in implicitly.
[PATCH 27/27] tools: add hmm gup test for long term pinned device pages
From: Alex Sierra The intention is to test device coherent type pages that have been called through get user pages with PIN_LONGTERM flag set. These pages should get migrated back to normal system memory. Signed-off-by: Alex Sierra Signed-off-by: Alistair Popple Reviewed-by: Felix Kuehling Signed-off-by: Christoph Hellwig --- tools/testing/selftests/vm/Makefile| 2 +- tools/testing/selftests/vm/hmm-tests.c | 81 ++ 2 files changed, 82 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile index 1607322a112c91..58c8427114f0c2 100644 --- a/tools/testing/selftests/vm/Makefile +++ b/tools/testing/selftests/vm/Makefile @@ -142,7 +142,7 @@ $(OUTPUT)/mlock-random-test $(OUTPUT)/memfd_secret: LDLIBS += -lcap $(OUTPUT)/gup_test: ../../../../mm/gup_test.h -$(OUTPUT)/hmm-tests: local_config.h +$(OUTPUT)/hmm-tests: local_config.h ../../../../mm/gup_test.h # HMM_EXTRA_LIBS may get set in local_config.mk, or it may be left empty. $(OUTPUT)/hmm-tests: LDLIBS += $(HMM_EXTRA_LIBS) diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c index 84ec8c4a1dc7b6..11b83a8084fee2 100644 --- a/tools/testing/selftests/vm/hmm-tests.c +++ b/tools/testing/selftests/vm/hmm-tests.c @@ -36,6 +36,7 @@ * in the usual include/uapi/... directory. */ #include "../../../../lib/test_hmm_uapi.h" +#include "../../../../mm/gup_test.h" struct hmm_buffer { void*ptr; @@ -60,6 +61,8 @@ enum { #define NTIMES 10 #define ALIGN(x, a) (((x) + (a - 1)) & (~((a) - 1))) +/* Just the flags we need, copied from mm.h: */ +#define FOLL_WRITE 0x01/* check pte is writable */ FIXTURE(hmm) { @@ -1766,4 +1769,82 @@ TEST_F(hmm, exclusive_cow) hmm_buffer_free(buffer); } +/* + * Test get user device pages through gup_test. Setting PIN_LONGTERM flag. + * This should trigger a migration back to system memory for both, private + * and coherent type pages. + * This test makes use of gup_test module. Make sure GUP_TEST_CONFIG is added + * to your configuration before you run it. + */ +TEST_F(hmm, hmm_gup_test) +{ + struct hmm_buffer *buffer; + struct gup_test gup; + int gup_fd; + unsigned long npages; + unsigned long size; + unsigned long i; + int *ptr; + int ret; + unsigned char *m; + + gup_fd = open("/sys/kernel/debug/gup_test", O_RDWR); + if (gup_fd == -1) + SKIP(return, "Skipping test, could not find gup_test driver"); + + npages = 4; + ASSERT_NE(npages, 0); + size = npages << self->page_shift; + + buffer = malloc(sizeof(*buffer)); + ASSERT_NE(buffer, NULL); + + buffer->fd = -1; + buffer->size = size; + buffer->mirror = malloc(size); + ASSERT_NE(buffer->mirror, NULL); + + buffer->ptr = mmap(NULL, size, + PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, + buffer->fd, 0); + ASSERT_NE(buffer->ptr, MAP_FAILED); + + /* Initialize buffer in system memory. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ptr[i] = i; + + /* Migrate memory to device. */ + ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, npages); + /* Check what the device read. */ + for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], i); + + gup.nr_pages_per_call = npages; + gup.addr = (unsigned long)buffer->ptr; + gup.gup_flags = FOLL_WRITE; + gup.size = size; + /* +* Calling gup_test ioctl. It will try to PIN_LONGTERM these device pages +* causing a migration back to system memory for both, private and coherent +* type pages. +*/ + if (ioctl(gup_fd, PIN_LONGTERM_BENCHMARK, &gup)) { + perror("ioctl on PIN_LONGTERM_BENCHMARK\n"); + goto out_test; + } + + /* Take snapshot to make sure pages have been migrated to sys memory */ + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_SNAPSHOT, buffer, npages); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, npages); + m = buffer->mirror; + for (i = 0; i < npages; i++) + ASSERT_EQ(m[i], HMM_DMIRROR_PROT_WRITE); +out_test: + close(gup_fd); + hmm_buffer_free(buffer); +} TEST_HARNESS_MAIN -- 2.30.2
[PATCH 26/27] mm/gup: migrate device coherent pages when pinning instead of failing
From: Alistair Popple Currently any attempts to pin a device coherent page will fail. This is because device coherent pages need to be managed by a device driver, and pinning them would prevent a driver from migrating them off the device. However this is no reason to fail pinning of these pages. These are coherent and accessible from the CPU so can be migrated just like pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin them first try migrating them out of ZONE_DEVICE. Signed-off-by: Alistair Popple Acked-by: Felix Kuehling [hch: rebased to the split device memory checks, moved migrate_device_page to migrate_device.c] Signed-off-by: Christoph Hellwig --- mm/gup.c| 37 ++- mm/internal.h | 1 + mm/migrate_device.c | 53 + 3 files changed, 85 insertions(+), 6 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index 39b23ad39a7bde..41349b685eafb4 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1889,9 +1889,31 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, ret = -EFAULT; goto unpin_pages; } + + /* +* Device coherent pages are managed by a driver and should not +* be pinned indefinitely as it prevents the driver moving the +* page. So when trying to pin with FOLL_LONGTERM instead try +* to migrate the page out of device memory. +*/ if (is_device_coherent_page(head)) { - ret = -EFAULT; - goto unpin_pages; + WARN_ON_ONCE(PageCompound(head)); + + /* +* Migration will fail if the page is pinned, so convert +* the pin on the source page to a normal reference. +*/ + if (gup_flags & FOLL_PIN) { + get_page(head); + unpin_user_page(head); + } + + pages[i] = migrate_device_page(head, gup_flags); + if (!pages[i]) { + ret = -EBUSY; + goto unpin_pages; + } + continue; } if (is_pinnable_page(head)) @@ -1931,10 +1953,13 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, return nr_pages; unpin_pages: - if (gup_flags & FOLL_PIN) { - unpin_user_pages(pages, nr_pages); - } else { - for (i = 0; i < nr_pages; i++) + for (i = 0; i < nr_pages; i++) { + if (!pages[i]) + continue; + + if (gup_flags & FOLL_PIN) + unpin_user_page(pages[i]); + else put_page(pages[i]); } diff --git a/mm/internal.h b/mm/internal.h index a67222d17e5987..1bded5d7f41a9d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -719,5 +719,6 @@ int numa_migrate_prep(struct page *page, struct vm_area_struct *vma, unsigned long addr, int page_nid, int *flags); void free_zone_device_page(struct page *page); +struct page *migrate_device_page(struct page *page, unsigned int gup_flags); #endif /* __MM_INTERNAL_H */ diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 03e182f9fc7865..3373b535d5c9d9 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -767,3 +767,56 @@ void migrate_vma_finalize(struct migrate_vma *migrate) } } EXPORT_SYMBOL(migrate_vma_finalize); + +/* + * Migrate a device coherent page back to normal memory. The caller should have + * a reference on page which will be copied to the new page if migration is + * successful or dropped on failure. + */ +struct page *migrate_device_page(struct page *page, unsigned int gup_flags) +{ + unsigned long src_pfn, dst_pfn = 0; + struct migrate_vma args; + struct page *dpage; + + lock_page(page); + src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE; + args.src = &src_pfn; + args.dst = &dst_pfn; + args.cpages = 1; + args.npages = 1; + args.vma = NULL; + migrate_vma_setup(&args); + if (!(src_pfn & MIGRATE_PFN_MIGRATE)) + return NULL; + + dpage = alloc_pages(GFP_USER | __GFP_NOWARN, 0); + + /* +* get/pin the new page now so we don't have to retry gup after +* migrating. We already have a reference so this should never fail. +*/ + if (dpage && WARN_ON_ONCE(!try_grab_page(dpage, gup_flags))) { + __free_pages(dpage, 0); + dpage = NULL; + } + + if (dpage) { + lock_page(dpage); +
[PATCH 25/27] mm: remove the vma check in migrate_vma_setup()
From: Alistair Popple migrate_vma_setup() checks that a valid vma is passed so that the page tables can be walked to find the pfns associated with a given address range. However in some cases the pfns are already known, such as when migrating device coherent pages during pin_user_pages() meaning a valid vma isn't required. Signed-off-by: Alistair Popple Acked-by: Felix Kuehling Signed-off-by: Christoph Hellwig --- mm/migrate_device.c | 34 +- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 0b295594e7626d..03e182f9fc7865 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -462,24 +462,24 @@ int migrate_vma_setup(struct migrate_vma *args) args->start &= PAGE_MASK; args->end &= PAGE_MASK; - if (!args->vma || is_vm_hugetlb_page(args->vma) || - (args->vma->vm_flags & VM_SPECIAL) || vma_is_dax(args->vma)) - return -EINVAL; - if (nr_pages <= 0) - return -EINVAL; - if (args->start < args->vma->vm_start || - args->start >= args->vma->vm_end) - return -EINVAL; - if (args->end <= args->vma->vm_start || args->end > args->vma->vm_end) - return -EINVAL; if (!args->src || !args->dst) return -EINVAL; - - memset(args->src, 0, sizeof(*args->src) * nr_pages); - args->cpages = 0; - args->npages = 0; - - migrate_vma_collect(args); + if (args->vma) { + if (is_vm_hugetlb_page(args->vma) || + (args->vma->vm_flags & VM_SPECIAL) || vma_is_dax(args->vma)) + return -EINVAL; + if (args->start < args->vma->vm_start || + args->start >= args->vma->vm_end) + return -EINVAL; + if (args->end <= args->vma->vm_start || + args->end > args->vma->vm_end) + return -EINVAL; + memset(args->src, 0, sizeof(*args->src) * nr_pages); + args->cpages = 0; + args->npages = 0; + + migrate_vma_collect(args); + } if (args->cpages) migrate_vma_unmap(args); @@ -661,7 +661,7 @@ void migrate_vma_pages(struct migrate_vma *migrate) continue; } - if (!page) { + if (!page && migrate->vma) { if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE)) continue; if (!notified) { -- 2.30.2
[PATCH 23/27] tools: update hmm-test to support device coherent type
From: Alex Sierra Test cases such as migrate_fault and migrate_multiple, were modified to explicit migrate from device to sys memory without the need of page faults, when using device coherent type. Snapshot test case updated to read memory device type first and based on that, get the proper returned results migrate_ping_pong test case added to test explicit migration from device to sys memory for both private and coherent zone types. Helpers to migrate from device to sys memory and vicerversa were also added. Signed-off-by: Alex Sierra Acked-by: Felix Kuehling Reviewed-by: Alistair Popple Signed-off-by: Christoph Hellwig --- tools/testing/selftests/vm/hmm-tests.c | 123 - 1 file changed, 102 insertions(+), 21 deletions(-) diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c index 203323967b507a..84ec8c4a1dc7b6 100644 --- a/tools/testing/selftests/vm/hmm-tests.c +++ b/tools/testing/selftests/vm/hmm-tests.c @@ -44,6 +44,14 @@ struct hmm_buffer { int fd; uint64_tcpages; uint64_tfaults; + int zone_device_type; +}; + +enum { + HMM_PRIVATE_DEVICE_ONE, + HMM_PRIVATE_DEVICE_TWO, + HMM_COHERENCE_DEVICE_ONE, + HMM_COHERENCE_DEVICE_TWO, }; #define TWOMEG (1 << 21) @@ -60,6 +68,21 @@ FIXTURE(hmm) unsigned intpage_shift; }; +FIXTURE_VARIANT(hmm) +{ + int device_number; +}; + +FIXTURE_VARIANT_ADD(hmm, hmm_device_private) +{ + .device_number = HMM_PRIVATE_DEVICE_ONE, +}; + +FIXTURE_VARIANT_ADD(hmm, hmm_device_coherent) +{ + .device_number = HMM_COHERENCE_DEVICE_ONE, +}; + FIXTURE(hmm2) { int fd0; @@ -68,6 +91,24 @@ FIXTURE(hmm2) unsigned intpage_shift; }; +FIXTURE_VARIANT(hmm2) +{ + int device_number0; + int device_number1; +}; + +FIXTURE_VARIANT_ADD(hmm2, hmm2_device_private) +{ + .device_number0 = HMM_PRIVATE_DEVICE_ONE, + .device_number1 = HMM_PRIVATE_DEVICE_TWO, +}; + +FIXTURE_VARIANT_ADD(hmm2, hmm2_device_coherent) +{ + .device_number0 = HMM_COHERENCE_DEVICE_ONE, + .device_number1 = HMM_COHERENCE_DEVICE_TWO, +}; + static int hmm_open(int unit) { char pathname[HMM_PATH_MAX]; @@ -81,12 +122,19 @@ static int hmm_open(int unit) return fd; } +static bool hmm_is_coherent_type(int dev_num) +{ + return (dev_num >= HMM_COHERENCE_DEVICE_ONE); +} + FIXTURE_SETUP(hmm) { self->page_size = sysconf(_SC_PAGE_SIZE); self->page_shift = ffs(self->page_size) - 1; - self->fd = hmm_open(0); + self->fd = hmm_open(variant->device_number); + if (self->fd < 0 && hmm_is_coherent_type(variant->device_number)) + SKIP(exit(0), "DEVICE_COHERENT not available"); ASSERT_GE(self->fd, 0); } @@ -95,9 +143,11 @@ FIXTURE_SETUP(hmm2) self->page_size = sysconf(_SC_PAGE_SIZE); self->page_shift = ffs(self->page_size) - 1; - self->fd0 = hmm_open(0); + self->fd0 = hmm_open(variant->device_number0); + if (self->fd0 < 0 && hmm_is_coherent_type(variant->device_number0)) + SKIP(exit(0), "DEVICE_COHERENT not available"); ASSERT_GE(self->fd0, 0); - self->fd1 = hmm_open(1); + self->fd1 = hmm_open(variant->device_number1); ASSERT_GE(self->fd1, 0); } @@ -144,6 +194,7 @@ static int hmm_dmirror_cmd(int fd, } buffer->cpages = cmd.cpages; buffer->faults = cmd.faults; + buffer->zone_device_type = cmd.zone_device_type; return 0; } @@ -211,6 +262,20 @@ static void hmm_nanosleep(unsigned int n) nanosleep(&t, NULL); } +static int hmm_migrate_sys_to_dev(int fd, + struct hmm_buffer *buffer, + unsigned long npages) +{ + return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_DEV, buffer, npages); +} + +static int hmm_migrate_dev_to_sys(int fd, + struct hmm_buffer *buffer, + unsigned long npages) +{ + return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_SYS, buffer, npages); +} + /* * Simple NULL test of device open/close. */ @@ -875,7 +940,7 @@ TEST_F(hmm, migrate) ptr[i] = i; /* Migrate memory to device. */ - ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages); + ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages); ASSERT_EQ(ret, 0); ASSERT_EQ(buffer->cpages, npages); @@ -923,7 +988,7 @@ TEST_F(hmm, migrate_fault) ptr[i] = i; /* Migrate memory to device. */ - ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages); + ret = hmm_migrate_sys_to_dev(sel
[PATCH 24/27] tools: update test_hmm script to support SP config
From: Alex Sierra Add two more parameters to set spm_addr_dev0 & spm_addr_dev1 addresses. These two parameters configure the start SP addresses for each device in test_hmm driver. Consequently, this configures zone device type as coherent. Signed-off-by: Alex Sierra Acked-by: Felix Kuehling Reviewed-by: Alistair Popple Signed-off-by: Christoph Hellwig --- tools/testing/selftests/vm/test_hmm.sh | 24 +--- 1 file changed, 21 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/vm/test_hmm.sh b/tools/testing/selftests/vm/test_hmm.sh index 0647b525a62564..539c9371e592a1 100755 --- a/tools/testing/selftests/vm/test_hmm.sh +++ b/tools/testing/selftests/vm/test_hmm.sh @@ -40,11 +40,26 @@ check_test_requirements() load_driver() { - modprobe $DRIVER > /dev/null 2>&1 + if [ $# -eq 0 ]; then + modprobe $DRIVER > /dev/null 2>&1 + else + if [ $# -eq 2 ]; then + modprobe $DRIVER spm_addr_dev0=$1 spm_addr_dev1=$2 + > /dev/null 2>&1 + else + echo "Missing module parameters. Make sure pass"\ + "spm_addr_dev0 and spm_addr_dev1" + usage + fi + fi if [ $? == 0 ]; then major=$(awk "\$2==\"HMM_DMIRROR\" {print \$1}" /proc/devices) mknod /dev/hmm_dmirror0 c $major 0 mknod /dev/hmm_dmirror1 c $major 1 + if [ $# -eq 2 ]; then + mknod /dev/hmm_dmirror2 c $major 2 + mknod /dev/hmm_dmirror3 c $major 3 + fi fi } @@ -58,7 +73,7 @@ run_smoke() { echo "Running smoke test. Note, this test provides basic coverage." - load_driver + load_driver $1 $2 $(dirname "${BASH_SOURCE[0]}")/hmm-tests unload_driver } @@ -75,6 +90,9 @@ usage() echo "# Smoke testing" echo "./${TEST_NAME}.sh smoke" echo + echo "# Smoke testing with SPM enabled" + echo "./${TEST_NAME}.sh smoke " + echo exit 0 } @@ -84,7 +102,7 @@ function run_test() usage else if [ "$1" = "smoke" ]; then - run_smoke + run_smoke $2 $3 else usage fi -- 2.30.2
[PATCH 22/27] lib: add support for device coherent type in test_hmm
From: Alex Sierra Device Coherent type uses device memory that is coherently accesible by the CPU. This could be shown as SP (special purpose) memory range at the BIOS-e820 memory enumeration. If no SP memory is supported in system, this could be faked by setting CONFIG_EFI_FAKE_MEMMAP. Currently, test_hmm only supports two different SP ranges of at least 256MB size. This could be specified in the kernel parameter variable efi_fake_mem. Ex. Two SP ranges of 1GB starting at 0x1 & 0x14000 physical address. Ex. efi_fake_mem=1G@0x1:0x4,1G@0x14000:0x4 Private and coherent device mirror instances can be created in the same probed. This is done by passing the module parameters spm_addr_dev0 & spm_addr_dev1. In this case, it will create four instances of device_mirror. The first two correspond to private device type, the last two to coherent type. Then, they can be easily accessed from user space through /dev/hmm_mirror. Usually num_device 0 and 1 are for private, and 2 and 3 for coherent types. If no module parameters are passed, two instances of private type device_mirror will be created only. Signed-off-by: Alex Sierra Acked-by: Felix Kuehling Reviewed-by: Alistair Poppple --- lib/test_hmm.c | 253 +--- lib/test_hmm_uapi.h | 15 ++- 2 files changed, 202 insertions(+), 66 deletions(-) diff --git a/lib/test_hmm.c b/lib/test_hmm.c index 15747f70c5bc9a..361a026c5d2126 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -32,11 +32,22 @@ #include "test_hmm_uapi.h" -#define DMIRROR_NDEVICES 2 +#define DMIRROR_NDEVICES 4 #define DMIRROR_RANGE_FAULT_TIMEOUT1000 #define DEVMEM_CHUNK_SIZE (256 * 1024 * 1024U) #define DEVMEM_CHUNKS_RESERVE 16 +/* + * For device_private pages, dpage is just a dummy struct page + * representing a piece of device memory. dmirror_devmem_alloc_page + * allocates a real system memory page as backing storage to fake a + * real device. zone_device_data points to that backing page. But + * for device_coherent memory, the struct page represents real + * physical CPU-accessible memory that we can use directly. + */ +#define BACKING_PAGE(page) (is_device_private_page((page)) ? \ + (page)->zone_device_data : (page)) + static unsigned long spm_addr_dev0; module_param(spm_addr_dev0, long, 0644); MODULE_PARM_DESC(spm_addr_dev0, @@ -125,6 +136,21 @@ static int dmirror_bounce_init(struct dmirror_bounce *bounce, return 0; } +static bool dmirror_is_private_zone(struct dmirror_device *mdevice) +{ + return (mdevice->zone_device_type == + HMM_DMIRROR_MEMORY_DEVICE_PRIVATE) ? true : false; +} + +static enum migrate_vma_direction +dmirror_select_device(struct dmirror *dmirror) +{ + return (dmirror->mdevice->zone_device_type == + HMM_DMIRROR_MEMORY_DEVICE_PRIVATE) ? + MIGRATE_VMA_SELECT_DEVICE_PRIVATE : + MIGRATE_VMA_SELECT_DEVICE_COHERENT; +} + static void dmirror_bounce_fini(struct dmirror_bounce *bounce) { vfree(bounce->ptr); @@ -575,16 +601,19 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice, static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice) { struct page *dpage = NULL; - struct page *rpage; + struct page *rpage = NULL; /* -* This is a fake device so we alloc real system memory to store -* our device memory. +* For ZONE_DEVICE private type, this is a fake device so we allocate +* real system memory to store our device memory. +* For ZONE_DEVICE coherent type we use the actual dpage to store the +* data and ignore rpage. */ - rpage = alloc_page(GFP_HIGHUSER); - if (!rpage) - return NULL; - + if (dmirror_is_private_zone(mdevice)) { + rpage = alloc_page(GFP_HIGHUSER); + if (!rpage) + return NULL; + } spin_lock(&mdevice->lock); if (mdevice->free_pages) { @@ -603,7 +632,8 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice) return dpage; error: - __free_page(rpage); + if (rpage) + __free_page(rpage); return NULL; } @@ -629,12 +659,16 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args, * unallocated pte_none() or read-only zero page. */ spage = migrate_pfn_to_page(*src); + if (WARN(spage && is_zone_device_page(spage), +"page already in device spage pfn: 0x%lx\n", +page_to_pfn(spage))) + continue; dpage = dmirror_devmem_alloc_page(mdevice); if (!dpage) continue; - rpage = dpage->zone_device_data; + rpage = BA
[PATCH 21/27] lib: test_hmm add module param for zone device type
From: Alex Sierra In order to configure device coherent in test_hmm, two module parameters should be passed, which correspond to the SP start address of each device (2) spm_addr_dev0 & spm_addr_dev1. If no parameters are passed, private device type is configured. Signed-off-by: Alex Sierra Acked-by: Felix Kuehling Reviewed-by: Alistair Poppple Signed-off-by: Christoph Hellwig --- lib/test_hmm.c | 73 - lib/test_hmm_uapi.h | 1 + 2 files changed, 53 insertions(+), 21 deletions(-) diff --git a/lib/test_hmm.c b/lib/test_hmm.c index 7a27584484ce0f..15747f70c5bc9a 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -37,6 +37,16 @@ #define DEVMEM_CHUNK_SIZE (256 * 1024 * 1024U) #define DEVMEM_CHUNKS_RESERVE 16 +static unsigned long spm_addr_dev0; +module_param(spm_addr_dev0, long, 0644); +MODULE_PARM_DESC(spm_addr_dev0, + "Specify start address for SPM (special purpose memory) used for device 0. By setting this Coherent device type will be used. Make sure spm_addr_dev1 is set too. Minimum SPM size should be DEVMEM_CHUNK_SIZE."); + +static unsigned long spm_addr_dev1; +module_param(spm_addr_dev1, long, 0644); +MODULE_PARM_DESC(spm_addr_dev1, + "Specify start address for SPM (special purpose memory) used for device 1. By setting this Coherent device type will be used. Make sure spm_addr_dev0 is set too. Minimum SPM size should be DEVMEM_CHUNK_SIZE."); + static const struct dev_pagemap_ops dmirror_devmem_ops; static const struct mmu_interval_notifier_ops dmirror_min_ops; static dev_t dmirror_dev; @@ -455,28 +465,44 @@ static int dmirror_write(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd) return ret; } -static bool dmirror_allocate_chunk(struct dmirror_device *mdevice, +static int dmirror_allocate_chunk(struct dmirror_device *mdevice, struct page **ppage) { struct dmirror_chunk *devmem; - struct resource *res; + struct resource *res = NULL; unsigned long pfn; unsigned long pfn_first; unsigned long pfn_last; void *ptr; + int ret = -ENOMEM; devmem = kzalloc(sizeof(*devmem), GFP_KERNEL); if (!devmem) - return false; + return ret; - res = request_free_mem_region(&iomem_resource, DEVMEM_CHUNK_SIZE, - "hmm_dmirror"); - if (IS_ERR(res)) + switch (mdevice->zone_device_type) { + case HMM_DMIRROR_MEMORY_DEVICE_PRIVATE: + res = request_free_mem_region(&iomem_resource, DEVMEM_CHUNK_SIZE, + "hmm_dmirror"); + if (IS_ERR_OR_NULL(res)) + goto err_devmem; + devmem->pagemap.range.start = res->start; + devmem->pagemap.range.end = res->end; + devmem->pagemap.type = MEMORY_DEVICE_PRIVATE; + break; + case HMM_DMIRROR_MEMORY_DEVICE_COHERENT: + devmem->pagemap.range.start = (MINOR(mdevice->cdevice.dev) - 2) ? + spm_addr_dev0 : + spm_addr_dev1; + devmem->pagemap.range.end = devmem->pagemap.range.start + + DEVMEM_CHUNK_SIZE - 1; + devmem->pagemap.type = MEMORY_DEVICE_COHERENT; + break; + default: + ret = -EINVAL; goto err_devmem; + } - devmem->pagemap.type = MEMORY_DEVICE_PRIVATE; - devmem->pagemap.range.start = res->start; - devmem->pagemap.range.end = res->end; devmem->pagemap.nr_range = 1; devmem->pagemap.ops = &dmirror_devmem_ops; devmem->pagemap.owner = mdevice; @@ -497,10 +523,14 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice, mdevice->devmem_capacity = new_capacity; mdevice->devmem_chunks = new_chunks; } - ptr = memremap_pages(&devmem->pagemap, numa_node_id()); - if (IS_ERR(ptr)) + if (IS_ERR_OR_NULL(ptr)) { + if (ptr) + ret = PTR_ERR(ptr); + else + ret = -EFAULT; goto err_release; + } devmem->mdevice = mdevice; pfn_first = devmem->pagemap.range.start >> PAGE_SHIFT; @@ -529,15 +559,17 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice, } spin_unlock(&mdevice->lock); - return true; + return 0; err_release: mutex_unlock(&mdevice->devmem_lock); - release_mem_region(devmem->pagemap.range.start, range_len(&devmem->pagemap.range)); + if (
[PATCH 20/27] lib: test_hmm add ioctl to get zone device type
From: Alex Sierra new ioctl cmd added to query zone device type. This will be used once the test_hmm adds zone device coherent type. Signed-off-by: Alex Sierra Acked-by: Felix Kuehling Reviewed-by: Alistair Poppple Signed-off-by: Christoph Hellwig --- lib/test_hmm.c | 23 +-- lib/test_hmm_uapi.h | 8 2 files changed, 29 insertions(+), 2 deletions(-) diff --git a/lib/test_hmm.c b/lib/test_hmm.c index cfe63204783918..7a27584484ce0f 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -87,6 +87,7 @@ struct dmirror_chunk { struct dmirror_device { struct cdev cdevice; struct hmm_devmem *devmem; + unsigned intzone_device_type; unsigned intdevmem_capacity; unsigned intdevmem_count; @@ -1026,6 +1027,15 @@ static int dmirror_snapshot(struct dmirror *dmirror, return ret; } +static int dmirror_get_device_type(struct dmirror *dmirror, + struct hmm_dmirror_cmd *cmd) +{ + mutex_lock(&dmirror->mutex); + cmd->zone_device_type = dmirror->mdevice->zone_device_type; + mutex_unlock(&dmirror->mutex); + + return 0; +} static long dmirror_fops_unlocked_ioctl(struct file *filp, unsigned int command, unsigned long arg) @@ -1076,6 +1086,9 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp, ret = dmirror_snapshot(dmirror, &cmd); break; + case HMM_DMIRROR_GET_MEM_DEV_TYPE: + ret = dmirror_get_device_type(dmirror, &cmd); + break; default: return -EINVAL; } @@ -1260,14 +1273,20 @@ static void dmirror_device_remove(struct dmirror_device *mdevice) static int __init hmm_dmirror_init(void) { int ret; - int id; + int id = 0; + int ndevices = 0; ret = alloc_chrdev_region(&dmirror_dev, 0, DMIRROR_NDEVICES, "HMM_DMIRROR"); if (ret) goto err_unreg; - for (id = 0; id < DMIRROR_NDEVICES; id++) { + memset(dmirror_devices, 0, DMIRROR_NDEVICES * sizeof(dmirror_devices[0])); + dmirror_devices[ndevices++].zone_device_type = + HMM_DMIRROR_MEMORY_DEVICE_PRIVATE; + dmirror_devices[ndevices++].zone_device_type = + HMM_DMIRROR_MEMORY_DEVICE_PRIVATE; + for (id = 0; id < ndevices; id++) { ret = dmirror_device_init(dmirror_devices + id, id); if (ret) goto err_chrdev; diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h index f14dea5dcd062b..17f842f1aa02c7 100644 --- a/lib/test_hmm_uapi.h +++ b/lib/test_hmm_uapi.h @@ -19,6 +19,7 @@ * @npages: (in) number of pages to read/write * @cpages: (out) number of pages copied * @faults: (out) number of device page faults seen + * @zone_device_type: (out) zone device memory type */ struct hmm_dmirror_cmd { __u64 addr; @@ -26,6 +27,7 @@ struct hmm_dmirror_cmd { __u64 npages; __u64 cpages; __u64 faults; + __u64 zone_device_type; }; /* Expose the address space of the calling process through hmm device file */ @@ -35,6 +37,7 @@ struct hmm_dmirror_cmd { #define HMM_DMIRROR_SNAPSHOT _IOWR('H', 0x03, struct hmm_dmirror_cmd) #define HMM_DMIRROR_EXCLUSIVE _IOWR('H', 0x04, struct hmm_dmirror_cmd) #define HMM_DMIRROR_CHECK_EXCLUSIVE_IOWR('H', 0x05, struct hmm_dmirror_cmd) +#define HMM_DMIRROR_GET_MEM_DEV_TYPE _IOWR('H', 0x06, struct hmm_dmirror_cmd) /* * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT. @@ -62,4 +65,9 @@ enum { HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE = 0x30, }; +enum { + /* 0 is reserved to catch uninitialized type fields */ + HMM_DMIRROR_MEMORY_DEVICE_PRIVATE = 1, +}; + #endif /* _LIB_TEST_HMM_UAPI_H */ -- 2.30.2
[PATCH 19/27] drm/amdkfd: coherent type as sys mem on migration to ram
From: Alex Sierra Coherent device type memory on VRAM to RAM migration, has similar access as System RAM from the CPU. This flag sets the source from the sender. Which in Coherent type case, should be set as MIGRATE_VMA_SELECT_DEVICE_COHERENT. Signed-off-by: Alex Sierra Reviewed-by: Felix Kuehling Signed-off-by: Christoph Hellwig --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index 2c51f2ac3b46ac..6646291d75d574 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -659,9 +659,12 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange, migrate.vma = vma; migrate.start = start; migrate.end = end; - migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE; migrate.pgmap_owner = SVM_ADEV_PGMAP_OWNER(adev); + if (adev->gmc.xgmi.connected_to_cpu) + migrate.flags = MIGRATE_VMA_SELECT_DEVICE_COHERENT; + else + migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE; size = 2 * sizeof(*migrate.src) + sizeof(uint64_t) + sizeof(dma_addr_t); size *= npages; buf = kvmalloc(size, GFP_KERNEL | __GFP_ZERO); -- 2.30.2
[PATCH 18/27] drm/amdkfd: add SPM support for SVM
From: Alex Sierra When CPU is connected throug XGMI, it has coherent access to VRAM resource. In this case that resource is taken from a table in the device gmc aperture base. This resource is used along with the device type, which could be DEVICE_PRIVATE or DEVICE_COHERENT to create the device page map region. Signed-off-by: Alex Sierra Reviewed-by: Felix Kuehling Signed-off-by: Christoph Hellwig --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 28 ++-- 1 file changed, 17 insertions(+), 11 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index e27ca375876230..2c51f2ac3b46ac 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -933,7 +933,7 @@ int svm_migrate_init(struct amdgpu_device *adev) { struct kfd_dev *kfddev = adev->kfd.dev; struct dev_pagemap *pgmap; - struct resource *res; + struct resource *res = NULL; unsigned long size; void *r; @@ -948,28 +948,34 @@ int svm_migrate_init(struct amdgpu_device *adev) * should remove reserved size */ size = ALIGN(adev->gmc.real_vram_size, 2ULL << 20); - res = devm_request_free_mem_region(adev->dev, &iomem_resource, size); - if (IS_ERR(res)) - return -ENOMEM; + if (adev->gmc.xgmi.connected_to_cpu) { + pgmap->range.start = adev->gmc.aper_base; + pgmap->range.end = adev->gmc.aper_base + adev->gmc.aper_size - 1; + pgmap->type = MEMORY_DEVICE_COHERENT; + } else { + res = devm_request_free_mem_region(adev->dev, &iomem_resource, size); + if (IS_ERR(res)) + return -ENOMEM; + pgmap->range.start = res->start; + pgmap->range.end = res->end; + pgmap->type = MEMORY_DEVICE_PRIVATE; + } - pgmap->type = MEMORY_DEVICE_PRIVATE; pgmap->nr_range = 1; - pgmap->range.start = res->start; - pgmap->range.end = res->end; pgmap->ops = &svm_migrate_pgmap_ops; pgmap->owner = SVM_ADEV_PGMAP_OWNER(adev); - pgmap->flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE; - + pgmap->flags = 0; /* Device manager releases device-specific resources, memory region and * pgmap when driver disconnects from device. */ r = devm_memremap_pages(adev->dev, pgmap); if (IS_ERR(r)) { pr_err("failed to register HMM device memory\n"); - /* Disable SVM support capability */ pgmap->type = 0; - devm_release_mem_region(adev->dev, res->start, resource_size(res)); + if (pgmap->type == MEMORY_DEVICE_PRIVATE) + devm_release_mem_region(adev->dev, res->start, + res->end - res->start + 1); return PTR_ERR(r); } -- 2.30.2
[PATCH 17/27] mm/gup: fail get_user_pages for LONGTERM dev coherent type
From: Alex Sierra Avoid long term pinning for Coherent device type pages. This could interfere with their own device memory manager. For now, we are just returning error for PIN_LONGTERM Coherent device type pages. Eventually, these type of pages will get migrated to system memory, once the device migration pages support is added. Signed-off-by: Alex Sierra Acked-by: Felix Kuehling Reviewed-by: Alistair Poppple [hch: rebased on previous cleanups, split the two checks] Signed-off-by: Christoph Hellwig --- mm/gup.c | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/mm/gup.c b/mm/gup.c index 37d6c24ca71225..39b23ad39a7bde 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1881,6 +1881,19 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, continue; prev_head = head; + /* +* Device private pages will get faulted in during gup so it +* shouldn't be possible to see one here. +*/ + if (WARN_ON_ONCE(is_device_private_page(head))) { + ret = -EFAULT; + goto unpin_pages; + } + if (is_device_coherent_page(head)) { + ret = -EFAULT; + goto unpin_pages; + } + if (is_pinnable_page(head)) continue; @@ -1925,7 +1938,7 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, put_page(pages[i]); } - if (!list_empty(&movable_page_list)) { + if (!ret && !list_empty(&movable_page_list)) { struct migration_target_control mtc = { .nid = NUMA_NO_NODE, .gfp_mask = GFP_USER | __GFP_NOWARN, -- 2.30.2
[PATCH 16/27] mm: add device coherent vma selection for memory migration
From: Alex Sierra This case is used to migrate pages from device memory, back to system memory. Device coherent type memory is cache coherent from device and CPU point of view. Signed-off-by: Alex Sierra Acked-by: Felix Kuehling Reviewed-by: Alistair Poppple Signed-off-by: Christoph Hellwig --- include/linux/migrate.h | 1 + mm/migrate_device.c | 12 +--- 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index db96e10eb8da22..66a34eae8cb635 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -130,6 +130,7 @@ static inline unsigned long migrate_pfn(unsigned long pfn) enum migrate_vma_direction { MIGRATE_VMA_SELECT_SYSTEM = 1 << 0, MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1, + MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2, }; struct migrate_vma { diff --git a/mm/migrate_device.c b/mm/migrate_device.c index bfd66e7d830b02..0b295594e7626d 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -147,15 +147,21 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, if (is_writable_device_private_entry(entry)) mpfn |= MIGRATE_PFN_WRITE; } else { - if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) - goto next; pfn = pte_pfn(pte); - if (is_zero_pfn(pfn)) { + if (is_zero_pfn(pfn) && + (migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) { mpfn = MIGRATE_PFN_MIGRATE; migrate->cpages++; goto next; } page = vm_normal_page(migrate->vma, addr, pte); + if (page && !is_zone_device_page(page) && + !(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) + goto next; + else if (page && is_device_coherent_page(page) && + (!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_COHERENT) || +page->pgmap->owner != migrate->pgmap_owner)) + goto next; mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE; mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0; } -- 2.30.2
[PATCH 13/27] mm: move the migrate_vma_* device migration code into it's own file
Split the code used to migrate to and from ZONE_DEVICE memory from migrate.c into a new file. Signed-off-by: Christoph Hellwig --- mm/Kconfig | 3 + mm/Makefile | 1 + mm/migrate.c| 753 --- mm/migrate_device.c | 765 4 files changed, 769 insertions(+), 753 deletions(-) create mode 100644 mm/migrate_device.c diff --git a/mm/Kconfig b/mm/Kconfig index a1901ae6d06293..6391d8d3a616f3 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -249,6 +249,9 @@ config MIGRATION pages as migration can relocate pages to satisfy a huge page allocation instead of reclaiming. +config DEVICE_MIGRATION + def_bool MIGRATION && DEVICE_PRIVATE + config ARCH_ENABLE_HUGEPAGE_MIGRATION bool diff --git a/mm/Makefile b/mm/Makefile index 70d4309c9ce338..4cc13f3179a518 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/ obj-$(CONFIG_FAILSLAB) += failslab.o obj-$(CONFIG_MEMTEST) += memtest.o obj-$(CONFIG_MIGRATION) += migrate.o +obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o obj-$(CONFIG_PAGE_COUNTER) += page_counter.o obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o diff --git a/mm/migrate.c b/mm/migrate.c index 746e1230886ddb..c31d04b46a5e17 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -38,12 +38,10 @@ #include #include #include -#include #include #include #include #include -#include #include #include #include @@ -2125,757 +2123,6 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, #endif /* CONFIG_NUMA_BALANCING */ #endif /* CONFIG_NUMA */ -#ifdef CONFIG_DEVICE_PRIVATE -static int migrate_vma_collect_skip(unsigned long start, - unsigned long end, - struct mm_walk *walk) -{ - struct migrate_vma *migrate = walk->private; - unsigned long addr; - - for (addr = start; addr < end; addr += PAGE_SIZE) { - migrate->dst[migrate->npages] = 0; - migrate->src[migrate->npages++] = 0; - } - - return 0; -} - -static int migrate_vma_collect_hole(unsigned long start, - unsigned long end, - __always_unused int depth, - struct mm_walk *walk) -{ - struct migrate_vma *migrate = walk->private; - unsigned long addr; - - /* Only allow populating anonymous memory. */ - if (!vma_is_anonymous(walk->vma)) - return migrate_vma_collect_skip(start, end, walk); - - for (addr = start; addr < end; addr += PAGE_SIZE) { - migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE; - migrate->dst[migrate->npages] = 0; - migrate->npages++; - migrate->cpages++; - } - - return 0; -} - -static int migrate_vma_collect_pmd(pmd_t *pmdp, - unsigned long start, - unsigned long end, - struct mm_walk *walk) -{ - struct migrate_vma *migrate = walk->private; - struct vm_area_struct *vma = walk->vma; - struct mm_struct *mm = vma->vm_mm; - unsigned long addr = start, unmapped = 0; - spinlock_t *ptl; - pte_t *ptep; - -again: - if (pmd_none(*pmdp)) - return migrate_vma_collect_hole(start, end, -1, walk); - - if (pmd_trans_huge(*pmdp)) { - struct page *page; - - ptl = pmd_lock(mm, pmdp); - if (unlikely(!pmd_trans_huge(*pmdp))) { - spin_unlock(ptl); - goto again; - } - - page = pmd_page(*pmdp); - if (is_huge_zero_page(page)) { - spin_unlock(ptl); - split_huge_pmd(vma, pmdp, addr); - if (pmd_trans_unstable(pmdp)) - return migrate_vma_collect_skip(start, end, - walk); - } else { - int ret; - - get_page(page); - spin_unlock(ptl); - if (unlikely(!trylock_page(page))) - return migrate_vma_collect_skip(start, end, - walk); - ret = split_huge_page(page); - unlock_page(page); - put_page(page); - if (ret) - return migrate_vma_collect_skip(start, end, - walk); -
[PATCH 15/27] mm: add zone device coherent type memory support
From: Alex Sierra Device memory that is cache coherent from device and CPU point of view. This is used on platforms that have an advanced system bus (like CAPI or CXL). Any page of a process can be migrated to such memory. However, no one should be allowed to pin such memory so that it can always be evicted. Signed-off-by: Alex Sierra Acked-by: Felix Kuehling Reviewed-by: Alistair Popple [hch: rebased ontop of the refcount changes, removed is_dev_private_or_coherent_page] Signed-off-by: Christoph Hellwig --- include/linux/memremap.h | 14 ++ mm/memcontrol.c | 7 --- mm/memory-failure.c | 8 ++-- mm/memremap.c| 10 ++ mm/migrate_device.c | 16 +++- mm/rmap.c| 5 +++-- 6 files changed, 44 insertions(+), 16 deletions(-) diff --git a/include/linux/memremap.h b/include/linux/memremap.h index d6a114dd5ea8b7..eb73630a49da39 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -41,6 +41,13 @@ struct vmem_altmap { * A more complete discussion of unaddressable memory may be found in * include/linux/hmm.h and Documentation/vm/hmm.rst. * + * MEMORY_DEVICE_COHERENT: + * Device memory that is cache coherent from device and CPU point of view. This + * is used on platforms that have an advanced system bus (like CAPI or CXL). A + * driver can hotplug the device memory using ZONE_DEVICE and with that memory + * type. Any page of a process can be migrated to such memory. However no one + * should be allowed to pin such memory so that it can always be evicted. + * * MEMORY_DEVICE_FS_DAX: * Host memory that has similar access semantics as System RAM i.e. DMA * coherent and supports page pinning. In support of coordinating page @@ -61,6 +68,7 @@ struct vmem_altmap { enum memory_type { /* 0 is reserved to catch uninitialized type fields */ MEMORY_DEVICE_PRIVATE = 1, + MEMORY_DEVICE_COHERENT, MEMORY_DEVICE_FS_DAX, MEMORY_DEVICE_GENERIC, MEMORY_DEVICE_PCI_P2PDMA, @@ -138,6 +146,12 @@ static inline bool is_device_private_page(const struct page *page) page->pgmap->type == MEMORY_DEVICE_PRIVATE; } +static inline bool is_device_coherent_page(const struct page *page) +{ + return is_zone_device_page(page) && + page->pgmap->type == MEMORY_DEVICE_COHERENT; +} + static inline bool is_pci_p2pdma_page(const struct page *page) { return IS_ENABLED(CONFIG_PCI_P2PDMA) && diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 510cbfb82bb62a..10259c35fde20d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5687,8 +5687,8 @@ static int mem_cgroup_move_account(struct page *page, * 2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a * target for charge migration. if @target is not NULL, the entry is stored * in target->ent. - * 3(MC_TARGET_DEVICE): like MC_TARGET_PAGE but page is MEMORY_DEVICE_PRIVATE - * (so ZONE_DEVICE page and thus not on the lru). + * 3(MC_TARGET_DEVICE): like MC_TARGET_PAGE but page is device memory and + * thus not on the lru. * For now we such page is charge like a regular page would be as for all * intent and purposes it is just special memory taking the place of a * regular page. @@ -5722,7 +5722,8 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma, */ if (page_memcg(page) == mc.from) { ret = MC_TARGET_PAGE; - if (is_device_private_page(page)) + if (is_device_private_page(page) || + is_device_coherent_page(page)) ret = MC_TARGET_DEVICE; if (target) target->page = page; diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 97a9ed8f87a96a..f498ed3ece79ae 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1617,12 +1617,16 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags, goto unlock; } - if (pgmap->type == MEMORY_DEVICE_PRIVATE) { + switch (pgmap->type) { + case MEMORY_DEVICE_PRIVATE: + case MEMORY_DEVICE_COHERENT: /* -* TODO: Handle HMM pages which may need coordination +* TODO: Handle device pages which may need coordination * with device-side memory. */ goto unlock; + default: + break; } /* diff --git a/mm/memremap.c b/mm/memremap.c index e00ffcdba7b632..d00bb21a0630cd 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -313,6 +313,16 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) return ERR_PTR(-EINVAL); } break; + case MEMORY_DEVICE_COHERENT: +
[PATCH 14/27] mm: build migrate_vma_* for all configs with ZONE_DEVICE support
This code will be used for device coherent memory as well in a bit, so relax the ifdef a bit. Signed-off-by: Christoph Hellwig --- mm/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/Kconfig b/mm/Kconfig index 6391d8d3a616f3..95d4aa3acaefe0 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -250,7 +250,7 @@ config MIGRATION allocation instead of reclaiming. config DEVICE_MIGRATION - def_bool MIGRATION && DEVICE_PRIVATE + def_bool MIGRATION && ZONE_DEVICE config ARCH_ENABLE_HUGEPAGE_MIGRATION bool -- 2.30.2
[PATCH 12/27] mm: refactor the ZONE_DEVICE handling in migrate_vma_pages
Make the flow a little more clear and prepare for adding a new ZONE_DEVICE memory type. Signed-off-by: Christoph Hellwig --- mm/migrate.c | 27 --- 1 file changed, 12 insertions(+), 15 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 30ecd7223656c1..746e1230886ddb 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2788,24 +2788,21 @@ void migrate_vma_pages(struct migrate_vma *migrate) mapping = page_mapping(page); - if (is_zone_device_page(newpage)) { - if (is_device_private_page(newpage)) { - /* -* For now only support private anonymous when -* migrating to un-addressable device memory. -*/ - if (mapping) { - migrate->src[i] &= ~MIGRATE_PFN_MIGRATE; - continue; - } - } else { - /* -* Other types of ZONE_DEVICE page are not -* supported. -*/ + if (is_device_private_page(newpage)) { + /* +* For now only support private anonymous when migrating +* to un-addressable device memory. +*/ + if (mapping) { migrate->src[i] &= ~MIGRATE_PFN_MIGRATE; continue; } + } else if (is_zone_device_page(newpage)) { + /* +* Other types of ZONE_DEVICE page are not supported. +*/ + migrate->src[i] &= ~MIGRATE_PFN_MIGRATE; + continue; } r = migrate_page(mapping, newpage, page, MIGRATE_SYNC_NO_COPY); -- 2.30.2
[PATCH 11/27] mm: refactor the ZONE_DEVICE handling in migrate_vma_insert_page
Make the flow a little more clear and prepare for adding a new ZONE_DEVICE memory type. Signed-off-by: Christoph Hellwig --- mm/migrate.c | 31 +++ 1 file changed, 15 insertions(+), 16 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 8e0370a73f8a43..30ecd7223656c1 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2670,26 +2670,25 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, */ __SetPageUptodate(page); - if (is_zone_device_page(page)) { - if (is_device_private_page(page)) { - swp_entry_t swp_entry; + if (is_device_private_page(page)) { + swp_entry_t swp_entry; - if (vma->vm_flags & VM_WRITE) - swp_entry = make_writable_device_private_entry( - page_to_pfn(page)); - else - swp_entry = make_readable_device_private_entry( - page_to_pfn(page)); - entry = swp_entry_to_pte(swp_entry); - } else { - /* -* For now we only support migrating to un-addressable -* device memory. -*/ + if (vma->vm_flags & VM_WRITE) + swp_entry = make_writable_device_private_entry( + page_to_pfn(page)); + else + swp_entry = make_readable_device_private_entry( + page_to_pfn(page)); + entry = swp_entry_to_pte(swp_entry); + } else { + /* +* For now we only support migrating to un-addressable device +* memory. +*/ + if (is_zone_device_page(page)) { pr_warn_once("Unsupported ZONE_DEVICE page type.\n"); goto abort; } - } else { entry = mk_pte(page, vma->vm_page_prot); if (vma->vm_flags & VM_WRITE) entry = pte_mkwrite(pte_mkdirty(entry)); -- 2.30.2
[PATCH 10/27] mm: refactor check_and_migrate_movable_pages
Remove up to two levels of indentation by using continue statements and move variables to local scope where possible. Signed-off-by: Christoph Hellwig --- mm/gup.c | 81 ++-- 1 file changed, 44 insertions(+), 37 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index a9d4d724aef749..37d6c24ca71225 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1868,72 +1868,79 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, struct page **pages, unsigned int gup_flags) { - unsigned long i; - unsigned long isolation_error_count = 0; - bool drain_allow = true; - LIST_HEAD(movable_page_list); - long ret = 0; + unsigned long isolation_error_count = 0, i; struct page *prev_head = NULL; - struct page *head; - struct migration_target_control mtc = { - .nid = NUMA_NO_NODE, - .gfp_mask = GFP_USER | __GFP_NOWARN, - }; + LIST_HEAD(movable_page_list); + bool drain_allow = true; + int ret = 0; for (i = 0; i < nr_pages; i++) { - head = compound_head(pages[i]); + struct page *head = compound_head(pages[i]); + if (head == prev_head) continue; prev_head = head; + + if (is_pinnable_page(head)) + continue; + /* -* If we get a movable page, since we are going to be pinning -* these entries, try to move them out if possible. +* Try to move out any movable page before pinning the range. */ - if (!is_pinnable_page(head)) { - if (PageHuge(head)) { - if (!isolate_huge_page(head, &movable_page_list)) - isolation_error_count++; - } else { - if (!PageLRU(head) && drain_allow) { - lru_add_drain_all(); - drain_allow = false; - } + if (PageHuge(head)) { + if (!isolate_huge_page(head, &movable_page_list)) + isolation_error_count++; + continue; + } - if (isolate_lru_page(head)) { - isolation_error_count++; - continue; - } - list_add_tail(&head->lru, &movable_page_list); - mod_node_page_state(page_pgdat(head), - NR_ISOLATED_ANON + - page_is_file_lru(head), - thp_nr_pages(head)); - } + if (!PageLRU(head) && drain_allow) { + lru_add_drain_all(); + drain_allow = false; + } + + if (isolate_lru_page(head)) { + isolation_error_count++; + continue; } + list_add_tail(&head->lru, &movable_page_list); + mod_node_page_state(page_pgdat(head), + NR_ISOLATED_ANON + page_is_file_lru(head), + thp_nr_pages(head)); } + if (!list_empty(&movable_page_list) || isolation_error_count) + goto unpin_pages; + /* * If list is empty, and no isolation errors, means that all pages are * in the correct zone. */ - if (list_empty(&movable_page_list) && !isolation_error_count) - return nr_pages; + return nr_pages; +unpin_pages: if (gup_flags & FOLL_PIN) { unpin_user_pages(pages, nr_pages); } else { for (i = 0; i < nr_pages; i++) put_page(pages[i]); } + if (!list_empty(&movable_page_list)) { + struct migration_target_control mtc = { + .nid = NUMA_NO_NODE, + .gfp_mask = GFP_USER | __GFP_NOWARN, + }; + ret = migrate_pages(&movable_page_list, alloc_migration_target, NULL, (unsigned long)&mtc, MIGRATE_SYNC, MR_LONGTERM_PIN, NULL); - if (ret && !list_empty(&movable_page_list)) - putback_movable_pages(&movable_page_list); + if (ret > 0) /* number of pages not migrated */ +
[PATCH 09/27] mm: generalize the pgmap based page_free infrastructure
Key off on the existence of ->page_free to prepare for adding support for more pgmap types that are device managed and thus need the free callback. Signed-off-by: Christoph Hellwig --- mm/memremap.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/memremap.c b/mm/memremap.c index fef5734d5e4933..e00ffcdba7b632 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -452,7 +452,7 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap); void free_zone_device_page(struct page *page) { - if (WARN_ON_ONCE(!is_device_private_page(page))) + if (WARN_ON_ONCE(!page->pgmap->ops || !page->pgmap->ops->page_free)) return; __ClearPageWaiters(page); @@ -460,7 +460,7 @@ void free_zone_device_page(struct page *page) mem_cgroup_uncharge(page_folio(page)); /* -* When a device_private page is freed, the page->mapping field +* When a device managed page is freed, the page->mapping field * may still contain a (stale) mapping value. For example, the * lower bits of page->mapping may still identify the page as an * anonymous page. Ultimately, this entire field is just stale -- 2.30.2
[PATCH 08/27] fsdax: depend on ZONE_DEVICE || FS_DAX_LIMITED
Add a depends on ZONE_DEVICE support or the s390-specific limited DAX support, as one of the two is required at runtime for fsdax code to actually work. Signed-off-by: Christoph Hellwig Reviewed-by: Logan Gunthorpe Reviewed-by: Jason Gunthorpe --- fs/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/Kconfig b/fs/Kconfig index e9433bbc48010a..7f2455e8e18ae2 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -48,6 +48,7 @@ config FS_DAX bool "File system based Direct Access (DAX) support" depends on MMU depends on !(ARM || MIPS || SPARC) + depends on ZONE_DEVICE || FS_DAX_LIMITED select FS_IOMAP select DAX help -- 2.30.2
[PATCH 07/27] mm: remove the extra ZONE_DEVICE struct page refcount
ZONE_DEVICE struct pages have an extra reference count that complicates the code for put_page() and several places in the kernel that need to check the reference count to see that a page is not being used (gup, compaction, migration, etc.). Clean up the code so the reference count doesn't need to be treated specially for ZONE_DEVICE pages. Note that this excludes the special idle page wakeup for fsdax pages, which still happens at refcount 1. This is a separate issue and will be sorted out later. Given that only fsdax pages require the notifiacation when the refcount hits 1 now, the PAGEMAP_OPS Kconfig symbol can go away and be replaced with a FS_DAX check for this hook in the put_page fastpath. Based on an earlier patch from Ralph Campbell . Signed-off-by: Christoph Hellwig Reviewed-by: Logan Gunthorpe Reviewed-by: Ralph Campbell Reviewed-by: Jason Gunthorpe Reviewed-by: Dan Williams Acked-by: Felix Kuehling --- arch/powerpc/kvm/book3s_hv_uvmem.c | 1 - drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 1 - drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 - fs/Kconfig | 1 - include/linux/memremap.h | 12 +++-- include/linux/mm.h | 6 +-- lib/test_hmm.c | 1 - mm/Kconfig | 4 -- mm/internal.h| 2 + mm/memcontrol.c | 11 ++--- mm/memremap.c| 57 mm/migrate.c | 6 --- mm/swap.c| 16 ++- 13 files changed, 36 insertions(+), 83 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c index e414ca44839fd1..8b6438fa18fc2b 100644 --- a/arch/powerpc/kvm/book3s_hv_uvmem.c +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c @@ -712,7 +712,6 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm) dpage = pfn_to_page(uvmem_pfn); dpage->zone_device_data = pvt; - get_page(dpage); lock_page(dpage); return dpage; out_clear: diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index cb835f95a76e66..e27ca375876230 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -225,7 +225,6 @@ svm_migrate_get_vram_page(struct svm_range *prange, unsigned long pfn) page = pfn_to_page(pfn); svm_range_bo_ref(prange->svm_bo); page->zone_device_data = prange->svm_bo; - get_page(page); lock_page(page); } diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index a5cdfbe32b5e54..7ba66ad68a8a1e 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -326,7 +326,6 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm) return NULL; } - get_page(page); lock_page(page); return page; } diff --git a/fs/Kconfig b/fs/Kconfig index 6c7dc1387beb0f..e9433bbc48010a 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -48,7 +48,6 @@ config FS_DAX bool "File system based Direct Access (DAX) support" depends on MMU depends on !(ARM || MIPS || SPARC) - select DEV_PAGEMAP_OPS if (ZONE_DEVICE && !FS_DAX_LIMITED) select FS_IOMAP select DAX help diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 514ab46f597e5c..d6a114dd5ea8b7 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -68,9 +68,9 @@ enum memory_type { struct dev_pagemap_ops { /* -* Called once the page refcount reaches 1. (ZONE_DEVICE pages never -* reach 0 refcount unless there is a refcount bug. This allows the -* device driver to implement its own memory management.) +* Called once the page refcount reaches 0. The reference count will be +* reset to one by the core code after the method is called to prepare +* for handing out the page again. */ void (*page_free)(struct page *page); @@ -133,16 +133,14 @@ static inline unsigned long pgmap_vmemmap_nr(struct dev_pagemap *pgmap) static inline bool is_device_private_page(const struct page *page) { - return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) && - IS_ENABLED(CONFIG_DEVICE_PRIVATE) && + return IS_ENABLED(CONFIG_DEVICE_PRIVATE) && is_zone_device_page(page) && page->pgmap->type == MEMORY_DEVICE_PRIVATE; } static inline bool is_pci_p2pdma_page(const struct page *page) { - return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) && - IS_ENABLED(CONFIG_PCI_P2PDMA) && + return IS_ENABLED(CONFIG_PCI_P2PDMA) && is_zone_device_page(page
[PATCH 06/27] mm: don't include in
Move the check for the actual pgmap types that need the free at refcount one behavior into the out of line helper, and thus avoid the need to pull memremap.h into mm.h. Signed-off-by: Christoph Hellwig Reviewed-by: Logan Gunthorpe Reviewed-by: Jason Gunthorpe Reviewed-by: Dan Williams Acked-by: Felix Kuehling --- arch/arm64/mm/mmu.c| 1 + drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 1 + drivers/gpu/drm/drm_cache.c| 2 +- drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 + drivers/gpu/drm/nouveau/nouveau_svm.c | 1 + drivers/infiniband/core/rw.c | 1 + drivers/nvdimm/pmem.h | 1 + drivers/nvme/host/pci.c| 1 + drivers/nvme/target/io-cmd-bdev.c | 1 + fs/fuse/virtio_fs.c| 1 + include/linux/memremap.h | 18 ++ include/linux/mm.h | 20 lib/test_hmm.c | 1 + mm/memcontrol.c| 1 + mm/memremap.c | 6 +- 15 files changed, 35 insertions(+), 22 deletions(-) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index acfae9b41cc8c9..580abae6c0b93f 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h index ea68f3b3a4e9cb..6d643b4b791d87 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h @@ -25,6 +25,7 @@ #include #include +#include #include #include #include diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c index f19d9acbe95936..50b8a088f763a6 100644 --- a/drivers/gpu/drm/drm_cache.c +++ b/drivers/gpu/drm/drm_cache.c @@ -27,11 +27,11 @@ /* * Authors: Thomas Hellström */ - #include #include #include #include +#include #include #include diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index e886a3b9e08c7d..a5cdfbe32b5e54 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -39,6 +39,7 @@ #include #include +#include #include /* diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c index 266809e511e2c1..090b9b47708cca 100644 --- a/drivers/gpu/drm/nouveau/nouveau_svm.c +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c @@ -35,6 +35,7 @@ #include #include #include +#include #include struct nouveau_svm { diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c index 5a3bd41b331c93..4d98f931a13ddd 100644 --- a/drivers/infiniband/core/rw.c +++ b/drivers/infiniband/core/rw.c @@ -2,6 +2,7 @@ /* * Copyright (c) 2016 HGST, a Western Digital Company. */ +#include #include #include #include diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h index 59cfe13ea8a85c..1f51a23614299b 100644 --- a/drivers/nvdimm/pmem.h +++ b/drivers/nvdimm/pmem.h @@ -3,6 +3,7 @@ #define __NVDIMM_PMEM_H__ #include #include +#include #include #include #include diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 6a99ed68091589..ab15bc72710dbe 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c index 70ca9dfc1771a9..a141446db1bea3 100644 --- a/drivers/nvme/target/io-cmd-bdev.c +++ b/drivers/nvme/target/io-cmd-bdev.c @@ -6,6 +6,7 @@ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include #include +#include #include #include "nvmet.h" diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c index 9d737904d07c0b..86b7dbb6a0d43e 100644 --- a/fs/fuse/virtio_fs.c +++ b/fs/fuse/virtio_fs.c @@ -8,6 +8,7 @@ #include #include #include +#include #include #include #include diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 1fafcc38acbad6..514ab46f597e5c 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -1,6 +1,8 @@ /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _LINUX_MEMREMAP_H_ #define _LINUX_MEMREMAP_H_ + +#include #include #include #include @@ -129,6 +131,22 @@ static inline unsigned long pgmap_vmemmap_nr(struct dev_pagemap *pgmap) return 1 << pgmap->vmemmap_shift; } +static inline bool is_device_private_page(const struct page *page) +{ + return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) && + IS_ENABLED(CONFIG_DEVICE_PRIVATE) && + is_zone_device_page(page) && + page->pgmap->type == MEMORY_DEVICE_PRIVATE; +} + +static inline bool is_pci_p2pdma_page(const struct page *page) +{ + return IS_ENABLED(CONFIG_DEV_P
[PATCH 05/27] mm: simplify freeing of devmap managed pages
Make put_devmap_managed_page return if it took charge of the page or not and remove the separate page_is_devmap_managed helper. Signed-off-by: Christoph Hellwig Reviewed-by: Logan Gunthorpe Reviewed-by: Jason Gunthorpe Reviewed-by: Chaitanya Kulkarni Reviewed-by: Dan Williams --- include/linux/mm.h | 34 ++ mm/memremap.c | 20 +--- mm/swap.c | 10 +- 3 files changed, 20 insertions(+), 44 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 91dd0bc786a9ec..26baadcef4556b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1094,33 +1094,24 @@ static inline bool is_zone_movable_page(const struct page *page) #ifdef CONFIG_DEV_PAGEMAP_OPS DECLARE_STATIC_KEY_FALSE(devmap_managed_key); -static inline bool page_is_devmap_managed(struct page *page) +bool __put_devmap_managed_page(struct page *page); +static inline bool put_devmap_managed_page(struct page *page) { if (!static_branch_unlikely(&devmap_managed_key)) return false; if (!is_zone_device_page(page)) return false; - switch (page->pgmap->type) { - case MEMORY_DEVICE_PRIVATE: - case MEMORY_DEVICE_FS_DAX: - return true; - default: - break; - } - return false; + if (page->pgmap->type != MEMORY_DEVICE_PRIVATE && + page->pgmap->type != MEMORY_DEVICE_FS_DAX) + return false; + return __put_devmap_managed_page(page); } -void put_devmap_managed_page(struct page *page); - #else /* CONFIG_DEV_PAGEMAP_OPS */ -static inline bool page_is_devmap_managed(struct page *page) +static inline bool put_devmap_managed_page(struct page *page) { return false; } - -static inline void put_devmap_managed_page(struct page *page) -{ -} #endif /* CONFIG_DEV_PAGEMAP_OPS */ static inline bool is_device_private_page(const struct page *page) @@ -1220,16 +1211,11 @@ static inline void put_page(struct page *page) struct folio *folio = page_folio(page); /* -* For devmap managed pages we need to catch refcount transition from -* 2 to 1, when refcount reach one it means the page is free and we -* need to inform the device driver through callback. See -* include/linux/memremap.h and HMM for details. +* For some devmap managed pages we need to catch refcount transition +* from 2 to 1: */ - if (page_is_devmap_managed(&folio->page)) { - put_devmap_managed_page(&folio->page); + if (put_devmap_managed_page(&folio->page)) return; - } - folio_put(folio); } diff --git a/mm/memremap.c b/mm/memremap.c index 55d23e9f5c04ec..f41233a67edb12 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -502,24 +502,22 @@ void free_devmap_managed_page(struct page *page) page->pgmap->ops->page_free(page); } -void put_devmap_managed_page(struct page *page) +bool __put_devmap_managed_page(struct page *page) { - int count; - - if (WARN_ON_ONCE(!page_is_devmap_managed(page))) - return; - - count = page_ref_dec_return(page); - /* * devmap page refcounts are 1-based, rather than 0-based: if * refcount is 1, then the page is free and the refcount is * stable because nobody holds a reference on the page. */ - if (count == 1) + switch (page_ref_dec_return(page)) { + case 1: free_devmap_managed_page(page); - else if (!count) + break; + case 0: __put_page(page); + break; + } + return true; } -EXPORT_SYMBOL(put_devmap_managed_page); +EXPORT_SYMBOL(__put_devmap_managed_page); #endif /* CONFIG_DEV_PAGEMAP_OPS */ diff --git a/mm/swap.c b/mm/swap.c index 08058f74cae23e..25b55c56614311 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -930,16 +930,8 @@ void release_pages(struct page **pages, int nr) unlock_page_lruvec_irqrestore(lruvec, flags); lruvec = NULL; } - /* -* ZONE_DEVICE pages that return 'false' from -* page_is_devmap_managed() do not require special -* processing, and instead, expect a call to -* put_page_testzero(). -*/ - if (page_is_devmap_managed(page)) { - put_devmap_managed_page(page); + if (put_devmap_managed_page(page)) continue; - } if (put_page_testzero(page)) put_dev_pagemap(page->pgmap); continue; -- 2.30.2
[PATCH 03/27] mm: remove pointless includes from
hmm.h pulls in the world for no good reason at all. Remove the includes and push a few ones into the users instead. Signed-off-by: Christoph Hellwig Reviewed-by: Logan Gunthorpe Reviewed-by: Jason Gunthorpe Reviewed-by: Chaitanya Kulkarni --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 1 + drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 + include/linux/hmm.h | 9 ++--- lib/test_hmm.c | 2 ++ 4 files changed, 6 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index ed5385137f4831..cb835f95a76e66 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -24,6 +24,7 @@ #include #include #include +#include #include "amdgpu_sync.h" #include "amdgpu_object.h" #include "amdgpu_vm.h" diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index 3828aafd3ac46f..e886a3b9e08c7d 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -39,6 +39,7 @@ #include #include +#include /* * FIXME: this is ugly right now we are using TTM to allocate vram and we pin diff --git a/include/linux/hmm.h b/include/linux/hmm.h index 2fd2e91d5107c0..d5a6f101f843e6 100644 --- a/include/linux/hmm.h +++ b/include/linux/hmm.h @@ -9,14 +9,9 @@ #ifndef LINUX_HMM_H #define LINUX_HMM_H -#include -#include +#include -#include -#include -#include -#include -#include +struct mmu_interval_notifier; /* * On output: diff --git a/lib/test_hmm.c b/lib/test_hmm.c index 767538089a62e4..396beee6b061d4 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -26,6 +26,8 @@ #include #include #include +#include +#include #include "test_hmm_uapi.h" -- 2.30.2
[PATCH 04/27] mm: move free_devmap_managed_page to memremap.c
free_devmap_managed_page has nothing to do with the code in swap.c, move it to live with the rest of the code for devmap handling. Signed-off-by: Christoph Hellwig Reviewed-by: Logan Gunthorpe Reviewed-by: Jason Gunthorpe Reviewed-by: Chaitanya Kulkarni Reviewed-by: Muchun Song Reviewed-by: Dan Williams --- include/linux/mm.h | 1 - mm/memremap.c | 21 + mm/swap.c | 23 --- 3 files changed, 21 insertions(+), 24 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 7b46174989b086..91dd0bc786a9ec 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1092,7 +1092,6 @@ static inline bool is_zone_movable_page(const struct page *page) } #ifdef CONFIG_DEV_PAGEMAP_OPS -void free_devmap_managed_page(struct page *page); DECLARE_STATIC_KEY_FALSE(devmap_managed_key); static inline bool page_is_devmap_managed(struct page *page) diff --git a/mm/memremap.c b/mm/memremap.c index 5f04a0709e436e..55d23e9f5c04ec 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -501,4 +501,25 @@ void free_devmap_managed_page(struct page *page) page->mapping = NULL; page->pgmap->ops->page_free(page); } + +void put_devmap_managed_page(struct page *page) +{ + int count; + + if (WARN_ON_ONCE(!page_is_devmap_managed(page))) + return; + + count = page_ref_dec_return(page); + + /* +* devmap page refcounts are 1-based, rather than 0-based: if +* refcount is 1, then the page is free and the refcount is +* stable because nobody holds a reference on the page. +*/ + if (count == 1) + free_devmap_managed_page(page); + else if (!count) + __put_page(page); +} +EXPORT_SYMBOL(put_devmap_managed_page); #endif /* CONFIG_DEV_PAGEMAP_OPS */ diff --git a/mm/swap.c b/mm/swap.c index bcf3ac288b56d5..08058f74cae23e 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -1153,26 +1153,3 @@ void __init swap_setup(void) * _really_ don't want to cluster much more */ } - -#ifdef CONFIG_DEV_PAGEMAP_OPS -void put_devmap_managed_page(struct page *page) -{ - int count; - - if (WARN_ON_ONCE(!page_is_devmap_managed(page))) - return; - - count = page_ref_dec_return(page); - - /* -* devmap page refcounts are 1-based, rather than 0-based: if -* refcount is 1, then the page is free and the refcount is -* stable because nobody holds a reference on the page. -*/ - if (count == 1) - free_devmap_managed_page(page); - else if (!count) - __put_page(page); -} -EXPORT_SYMBOL(put_devmap_managed_page); -#endif -- 2.30.2
[PATCH 02/27] mm: remove the __KERNEL__ guard from
__KERNEL__ ifdefs don't make sense outside of include/uapi/. Signed-off-by: Christoph Hellwig Reviewed-by: Logan Gunthorpe Reviewed-by: Jason Gunthorpe Reviewed-by: Chaitanya Kulkarni Reviewed-by: Muchun Song Reviewed-by: Dan Williams --- include/linux/mm.h | 4 1 file changed, 4 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 213cc569b19223..7b46174989b086 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3,9 +3,6 @@ #define _LINUX_MM_H #include - -#ifdef __KERNEL__ - #include #include #include @@ -3381,5 +3378,4 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start, } #endif -#endif /* __KERNEL__ */ #endif /* _LINUX_MM_H */ -- 2.30.2
start sorting out the ZONE_DEVICE refcount mess v2
Hi all, this series removes the offset by one refcount for ZONE_DEVICE pages that are freed back to the driver owning them, which is just device private ones for now, but also the planned device coherent pages and the ehanced p2p ones pending. It does not address the fsdax pages yet, which will be attacked in a follow on series. Note that if we want to get the p2p series rebased on top of this we'll need a git branch for this series. I could offer to host one. A git tree is available here: git://git.infradead.org/users/hch/misc.git pgmap-refcount Gitweb: http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/pgmap-refcount Changes since v1: - add a missing memremap.h include in memcontrol.c - include rebased versions of the device coherent support and device coherent migration support series as well as additional cleanup patches Diffstt: arch/arm64/mm/mmu.c |1 arch/powerpc/kvm/book3s_hv_uvmem.c |1 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 35 - drivers/gpu/drm/amd/amdkfd/kfd_priv.h|1 drivers/gpu/drm/drm_cache.c |2 drivers/gpu/drm/nouveau/nouveau_dmem.c |3 drivers/gpu/drm/nouveau/nouveau_svm.c|1 drivers/infiniband/core/rw.c |1 drivers/nvdimm/pmem.h|1 drivers/nvme/host/pci.c |1 drivers/nvme/target/io-cmd-bdev.c|1 fs/Kconfig |2 fs/fuse/virtio_fs.c |1 include/linux/hmm.h |9 include/linux/memremap.h | 36 + include/linux/migrate.h |1 include/linux/mm.h | 59 -- lib/test_hmm.c | 353 ++--- lib/test_hmm_uapi.h | 22 mm/Kconfig |7 mm/Makefile |1 mm/gup.c | 127 +++- mm/internal.h|3 mm/memcontrol.c | 19 mm/memory-failure.c |8 mm/memremap.c| 75 +- mm/migrate.c | 763 mm/migrate_device.c | 822 +++ mm/rmap.c|5 mm/swap.c| 49 - tools/testing/selftests/vm/Makefile |2 tools/testing/selftests/vm/hmm-tests.c | 204 ++- tools/testing/selftests/vm/test_hmm.sh | 24 33 files changed, 1552 insertions(+), 1088 deletions(-)
[PATCH 01/27] mm: remove a pointless CONFIG_ZONE_DEVICE check in memremap_pages
memremap.c is only built when CONFIG_ZONE_DEVICE is set, so remove the superflous extra check. Signed-off-by: Christoph Hellwig Reviewed-by: Logan Gunthorpe Reviewed-by: Jason Gunthorpe Reviewed-by: Chaitanya Kulkarni Reviewed-by: Muchun Song Reviewed-by: Dan Williams --- mm/memremap.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/mm/memremap.c b/mm/memremap.c index 6aa5f0c2d11fda..5f04a0709e436e 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -328,8 +328,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) } break; case MEMORY_DEVICE_FS_DAX: - if (!IS_ENABLED(CONFIG_ZONE_DEVICE) || - IS_ENABLED(CONFIG_FS_DAX_LIMITED)) { + if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) { WARN(1, "File system DAX not supported\n"); return ERR_PTR(-EINVAL); } -- 2.30.2
Re: [PATCH 6/8] mm: don't include in
On Thu, Feb 10, 2022 at 01:10:47PM +1100, Alistair Popple wrote: > diff --git a/mm/gup.c b/mm/gup.c > index cbb49abb7992..8e85c9fb8df4 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2007,7 +2007,6 @@ static long check_and_migrate_movable_pages(unsigned > long nr_pages, > if (!ret && list_empty(&movable_page_list) && !isolation_error_count) > return nr_pages; > > - ret = 0; > unpin_pages: This isn't quite correct as ret is initially set to -EFAULT now. I'll fix it by removing the early ret initialization and always using the goto. I've also added another refactoring patch for this messy function. I've folded the inversion of the is_device_coherent_page check in migrate.c in as well, thanks!
Re: [PATCH 6/8] mm: don't include in
On Mon, Feb 07, 2022 at 04:19:29PM -0500, Felix Kuehling wrote: > > Am 2022-02-07 um 01:32 schrieb Christoph Hellwig: >> Move the check for the actual pgmap types that need the free at refcount >> one behavior into the out of line helper, and thus avoid the need to >> pull memremap.h into mm.h. >> >> Signed-off-by: Christoph Hellwig > > The amdkfd part looks good to me. > > It looks like this patch is not based on Alex Sierra's coherent memory > series. He added two new helpers is_device_coherent_page and > is_dev_private_or_coherent_page that would need to be moved along with > is_device_private_page and is_pci_p2pdma_page. FYI, here is a branch that contains a rebase of the coherent memory related patches on top of this series: http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/pgmap-refcount I don't have a good way to test this, but I'll at least let the build bot finish before sending it out (probably tomorrow).
Re: [PATCH 7/8] mm: remove the extra ZONE_DEVICE struct page refcount
On Wed, Feb 09, 2022 at 08:29:56AM -0400, Jason Gunthorpe wrote: > It is nice, but the other series are still impacted by the fsdax mess > - they still stuff pages into ptes without proper refcounts and have > to carry nonsense to dance around this problem. > > I certainly would be unhappy if the amd driver, for instance, gained > the fsdax problem as well and started pushing 4k pages into PMDs. As said before: I think this all needs to be fixed. But I'd rather fix it gradually and I think this series is a nice step forward. After that we can look at the pte mappings.
Re: [PATCH 7/8] mm: remove the extra ZONE_DEVICE struct page refcount
On Tue, Feb 08, 2022 at 07:30:11PM -0800, Dan Williams wrote: > Interesting. I had expected that to really fix the refcount problem > that fs/dax.c would need to start taking real page references as pages > were added to a mapping, just like page cache. I think we should do that eventually. But I think this series that just attacks the device private type and extends to the device coherent and p2p enhacements is a good first step to stop the proliferation of the one off refcount and to allow to deal with the fsdax pages in another more focuessed series.
Re: [PATCH 6/8] mm: don't include in
On Tue, Feb 08, 2022 at 03:53:14PM -0800, Dan Williams wrote: > Yeah, same as Logan: > > mm/memcontrol.c: In function ‘get_mctgt_type’: > mm/memcontrol.c:5724:29: error: implicit declaration of function > ‘is_device_private_page’; did you mean > ‘is_device_private_entry’? [-Werror=implicit-function-declaration] > 5724 | if (is_device_private_page(page)) > | ^~ > | is_device_private_entry > > ...needs: Yeah, the buildbot also complained. I've fixed this up locally now.
Re: [PATCH 6/8] mm: don't include in
On Mon, Feb 07, 2022 at 04:19:29PM -0500, Felix Kuehling wrote: > > Am 2022-02-07 um 01:32 schrieb Christoph Hellwig: >> Move the check for the actual pgmap types that need the free at refcount >> one behavior into the out of line helper, and thus avoid the need to >> pull memremap.h into mm.h. >> >> Signed-off-by: Christoph Hellwig > > The amdkfd part looks good to me. > > It looks like this patch is not based on Alex Sierra's coherent memory > series. He added two new helpers is_device_coherent_page and > is_dev_private_or_coherent_page that would need to be moved along with > is_device_private_page and is_pci_p2pdma_page. Yes. I Naked that series because it spreads te mess with the refcount further in this latest version. My intent is that it gets rebased on top of this to avoid that spread. Same for the p2p series form Logan.
[PATCH 8/8] fsdax: depend on ZONE_DEVICE || FS_DAX_LIMITED
Add a depends on ZONE_DEVICE support or the s390-specific limited DAX support, as one of the two is required at runtime for fsdax code to actually work. Signed-off-by: Christoph Hellwig --- fs/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/Kconfig b/fs/Kconfig index 05efea674bffa0..6e8818a5e53c45 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -48,6 +48,7 @@ config FS_DAX bool "File system based Direct Access (DAX) support" depends on MMU depends on !(ARM || MIPS || SPARC) + depends on ZONE_DEVICE || FS_DAX_LIMITED select FS_IOMAP select DAX help -- 2.30.2
[PATCH 7/8] mm: remove the extra ZONE_DEVICE struct page refcount
ZONE_DEVICE struct pages have an extra reference count that complicates the code for put_page() and several places in the kernel that need to check the reference count to see that a page is not being used (gup, compaction, migration, etc.). Clean up the code so the reference count doesn't need to be treated specially for ZONE_DEVICE pages. Note that this excludes the special idle page wakeup for fsdax pages, which still happens at refcount 1. This is a separate issue and will be sorted out later. Given that only fsdax pages require the notifiacation when the refcount hits 1 now, the PAGEMAP_OPS Kconfig symbol can go away and be replaced with a FS_DAX check for this hook in the put_page fastpath. Based on an earlier patch from Ralph Campbell . Signed-off-by: Christoph Hellwig --- arch/powerpc/kvm/book3s_hv_uvmem.c | 1 - drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 1 - drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 - fs/Kconfig | 1 - include/linux/memremap.h | 12 +++-- include/linux/mm.h | 6 +-- lib/test_hmm.c | 1 - mm/Kconfig | 4 -- mm/internal.h| 2 + mm/memcontrol.c | 11 ++--- mm/memremap.c| 57 mm/migrate.c | 6 --- mm/swap.c| 16 ++- 13 files changed, 36 insertions(+), 83 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c index e414ca44839fd1..8b6438fa18fc2b 100644 --- a/arch/powerpc/kvm/book3s_hv_uvmem.c +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c @@ -712,7 +712,6 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm) dpage = pfn_to_page(uvmem_pfn); dpage->zone_device_data = pvt; - get_page(dpage); lock_page(dpage); return dpage; out_clear: diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index cb835f95a76e66..e27ca375876230 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -225,7 +225,6 @@ svm_migrate_get_vram_page(struct svm_range *prange, unsigned long pfn) page = pfn_to_page(pfn); svm_range_bo_ref(prange->svm_bo); page->zone_device_data = prange->svm_bo; - get_page(page); lock_page(page); } diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index a5cdfbe32b5e54..7ba66ad68a8a1e 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -326,7 +326,6 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm) return NULL; } - get_page(page); lock_page(page); return page; } diff --git a/fs/Kconfig b/fs/Kconfig index 7a2b11c0b8036d..05efea674bffa0 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -48,7 +48,6 @@ config FS_DAX bool "File system based Direct Access (DAX) support" depends on MMU depends on !(ARM || MIPS || SPARC) - select DEV_PAGEMAP_OPS if (ZONE_DEVICE && !FS_DAX_LIMITED) select FS_IOMAP select DAX help diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 514ab46f597e5c..d6a114dd5ea8b7 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -68,9 +68,9 @@ enum memory_type { struct dev_pagemap_ops { /* -* Called once the page refcount reaches 1. (ZONE_DEVICE pages never -* reach 0 refcount unless there is a refcount bug. This allows the -* device driver to implement its own memory management.) +* Called once the page refcount reaches 0. The reference count will be +* reset to one by the core code after the method is called to prepare +* for handing out the page again. */ void (*page_free)(struct page *page); @@ -133,16 +133,14 @@ static inline unsigned long pgmap_vmemmap_nr(struct dev_pagemap *pgmap) static inline bool is_device_private_page(const struct page *page) { - return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) && - IS_ENABLED(CONFIG_DEVICE_PRIVATE) && + return IS_ENABLED(CONFIG_DEVICE_PRIVATE) && is_zone_device_page(page) && page->pgmap->type == MEMORY_DEVICE_PRIVATE; } static inline bool is_pci_p2pdma_page(const struct page *page) { - return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) && - IS_ENABLED(CONFIG_PCI_P2PDMA) && + return IS_ENABLED(CONFIG_PCI_P2PDMA) && is_zone_device_page(page) && page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA; } diff --git a/include/linux/mm.h b/include/linux/mm.h i
[PATCH 6/8] mm: don't include in
Move the check for the actual pgmap types that need the free at refcount one behavior into the out of line helper, and thus avoid the need to pull memremap.h into mm.h. Signed-off-by: Christoph Hellwig --- arch/arm64/mm/mmu.c| 1 + drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 1 + drivers/gpu/drm/drm_cache.c| 2 +- drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 + drivers/gpu/drm/nouveau/nouveau_svm.c | 1 + drivers/infiniband/core/rw.c | 1 + drivers/nvdimm/pmem.h | 1 + drivers/nvme/host/pci.c| 1 + drivers/nvme/target/io-cmd-bdev.c | 1 + fs/fuse/virtio_fs.c| 1 + include/linux/memremap.h | 18 ++ include/linux/mm.h | 20 lib/test_hmm.c | 1 + mm/memremap.c | 6 +- 14 files changed, 34 insertions(+), 22 deletions(-) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index acfae9b41cc8c9..580abae6c0b93f 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h index ea68f3b3a4e9cb..6d643b4b791d87 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h @@ -25,6 +25,7 @@ #include #include +#include #include #include #include diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c index f19d9acbe95936..50b8a088f763a6 100644 --- a/drivers/gpu/drm/drm_cache.c +++ b/drivers/gpu/drm/drm_cache.c @@ -27,11 +27,11 @@ /* * Authors: Thomas Hellström */ - #include #include #include #include +#include #include #include diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index e886a3b9e08c7d..a5cdfbe32b5e54 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -39,6 +39,7 @@ #include #include +#include #include /* diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c index 266809e511e2c1..090b9b47708cca 100644 --- a/drivers/gpu/drm/nouveau/nouveau_svm.c +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c @@ -35,6 +35,7 @@ #include #include #include +#include #include struct nouveau_svm { diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c index 5a3bd41b331c93..4d98f931a13ddd 100644 --- a/drivers/infiniband/core/rw.c +++ b/drivers/infiniband/core/rw.c @@ -2,6 +2,7 @@ /* * Copyright (c) 2016 HGST, a Western Digital Company. */ +#include #include #include #include diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h index 59cfe13ea8a85c..1f51a23614299b 100644 --- a/drivers/nvdimm/pmem.h +++ b/drivers/nvdimm/pmem.h @@ -3,6 +3,7 @@ #define __NVDIMM_PMEM_H__ #include #include +#include #include #include #include diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 6a99ed68091589..ab15bc72710dbe 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c index 70ca9dfc1771a9..a141446db1bea3 100644 --- a/drivers/nvme/target/io-cmd-bdev.c +++ b/drivers/nvme/target/io-cmd-bdev.c @@ -6,6 +6,7 @@ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include #include +#include #include #include "nvmet.h" diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c index 9d737904d07c0b..86b7dbb6a0d43e 100644 --- a/fs/fuse/virtio_fs.c +++ b/fs/fuse/virtio_fs.c @@ -8,6 +8,7 @@ #include #include #include +#include #include #include #include diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 1fafcc38acbad6..514ab46f597e5c 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -1,6 +1,8 @@ /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _LINUX_MEMREMAP_H_ #define _LINUX_MEMREMAP_H_ + +#include #include #include #include @@ -129,6 +131,22 @@ static inline unsigned long pgmap_vmemmap_nr(struct dev_pagemap *pgmap) return 1 << pgmap->vmemmap_shift; } +static inline bool is_device_private_page(const struct page *page) +{ + return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) && + IS_ENABLED(CONFIG_DEVICE_PRIVATE) && + is_zone_device_page(page) && + page->pgmap->type == MEMORY_DEVICE_PRIVATE; +} + +static inline bool is_pci_p2pdma_page(const struct page *page) +{ + return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) && + IS_ENABLED(CONFIG_PCI_P2PDMA) && + is_zone_device_page(page) && + page->pgmap->type == MEMORY_DEVICE_PC
[PATCH 5/8] mm: simplify freeing of devmap managed pages
Make put_devmap_managed_page return if it took charge of the page or not and remove the separate page_is_devmap_managed helper. Signed-off-by: Christoph Hellwig --- include/linux/mm.h | 34 ++ mm/memremap.c | 20 +--- mm/swap.c | 10 +- 3 files changed, 20 insertions(+), 44 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 91dd0bc786a9ec..26baadcef4556b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1094,33 +1094,24 @@ static inline bool is_zone_movable_page(const struct page *page) #ifdef CONFIG_DEV_PAGEMAP_OPS DECLARE_STATIC_KEY_FALSE(devmap_managed_key); -static inline bool page_is_devmap_managed(struct page *page) +bool __put_devmap_managed_page(struct page *page); +static inline bool put_devmap_managed_page(struct page *page) { if (!static_branch_unlikely(&devmap_managed_key)) return false; if (!is_zone_device_page(page)) return false; - switch (page->pgmap->type) { - case MEMORY_DEVICE_PRIVATE: - case MEMORY_DEVICE_FS_DAX: - return true; - default: - break; - } - return false; + if (page->pgmap->type != MEMORY_DEVICE_PRIVATE && + page->pgmap->type != MEMORY_DEVICE_FS_DAX) + return false; + return __put_devmap_managed_page(page); } -void put_devmap_managed_page(struct page *page); - #else /* CONFIG_DEV_PAGEMAP_OPS */ -static inline bool page_is_devmap_managed(struct page *page) +static inline bool put_devmap_managed_page(struct page *page) { return false; } - -static inline void put_devmap_managed_page(struct page *page) -{ -} #endif /* CONFIG_DEV_PAGEMAP_OPS */ static inline bool is_device_private_page(const struct page *page) @@ -1220,16 +1211,11 @@ static inline void put_page(struct page *page) struct folio *folio = page_folio(page); /* -* For devmap managed pages we need to catch refcount transition from -* 2 to 1, when refcount reach one it means the page is free and we -* need to inform the device driver through callback. See -* include/linux/memremap.h and HMM for details. +* For some devmap managed pages we need to catch refcount transition +* from 2 to 1: */ - if (page_is_devmap_managed(&folio->page)) { - put_devmap_managed_page(&folio->page); + if (put_devmap_managed_page(&folio->page)) return; - } - folio_put(folio); } diff --git a/mm/memremap.c b/mm/memremap.c index 55d23e9f5c04ec..f41233a67edb12 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -502,24 +502,22 @@ void free_devmap_managed_page(struct page *page) page->pgmap->ops->page_free(page); } -void put_devmap_managed_page(struct page *page) +bool __put_devmap_managed_page(struct page *page) { - int count; - - if (WARN_ON_ONCE(!page_is_devmap_managed(page))) - return; - - count = page_ref_dec_return(page); - /* * devmap page refcounts are 1-based, rather than 0-based: if * refcount is 1, then the page is free and the refcount is * stable because nobody holds a reference on the page. */ - if (count == 1) + switch (page_ref_dec_return(page)) { + case 1: free_devmap_managed_page(page); - else if (!count) + break; + case 0: __put_page(page); + break; + } + return true; } -EXPORT_SYMBOL(put_devmap_managed_page); +EXPORT_SYMBOL(__put_devmap_managed_page); #endif /* CONFIG_DEV_PAGEMAP_OPS */ diff --git a/mm/swap.c b/mm/swap.c index 08058f74cae23e..25b55c56614311 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -930,16 +930,8 @@ void release_pages(struct page **pages, int nr) unlock_page_lruvec_irqrestore(lruvec, flags); lruvec = NULL; } - /* -* ZONE_DEVICE pages that return 'false' from -* page_is_devmap_managed() do not require special -* processing, and instead, expect a call to -* put_page_testzero(). -*/ - if (page_is_devmap_managed(page)) { - put_devmap_managed_page(page); + if (put_devmap_managed_page(page)) continue; - } if (put_page_testzero(page)) put_dev_pagemap(page->pgmap); continue; -- 2.30.2
[PATCH 4/8] mm: move free_devmap_managed_page to memremap.c
free_devmap_managed_page has nothing to do with the code in swap.c, move it to live with the rest of the code for devmap handling. Signed-off-by: Christoph Hellwig --- include/linux/mm.h | 1 - mm/memremap.c | 21 + mm/swap.c | 23 --- 3 files changed, 21 insertions(+), 24 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 7b46174989b086..91dd0bc786a9ec 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1092,7 +1092,6 @@ static inline bool is_zone_movable_page(const struct page *page) } #ifdef CONFIG_DEV_PAGEMAP_OPS -void free_devmap_managed_page(struct page *page); DECLARE_STATIC_KEY_FALSE(devmap_managed_key); static inline bool page_is_devmap_managed(struct page *page) diff --git a/mm/memremap.c b/mm/memremap.c index 5f04a0709e436e..55d23e9f5c04ec 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -501,4 +501,25 @@ void free_devmap_managed_page(struct page *page) page->mapping = NULL; page->pgmap->ops->page_free(page); } + +void put_devmap_managed_page(struct page *page) +{ + int count; + + if (WARN_ON_ONCE(!page_is_devmap_managed(page))) + return; + + count = page_ref_dec_return(page); + + /* +* devmap page refcounts are 1-based, rather than 0-based: if +* refcount is 1, then the page is free and the refcount is +* stable because nobody holds a reference on the page. +*/ + if (count == 1) + free_devmap_managed_page(page); + else if (!count) + __put_page(page); +} +EXPORT_SYMBOL(put_devmap_managed_page); #endif /* CONFIG_DEV_PAGEMAP_OPS */ diff --git a/mm/swap.c b/mm/swap.c index bcf3ac288b56d5..08058f74cae23e 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -1153,26 +1153,3 @@ void __init swap_setup(void) * _really_ don't want to cluster much more */ } - -#ifdef CONFIG_DEV_PAGEMAP_OPS -void put_devmap_managed_page(struct page *page) -{ - int count; - - if (WARN_ON_ONCE(!page_is_devmap_managed(page))) - return; - - count = page_ref_dec_return(page); - - /* -* devmap page refcounts are 1-based, rather than 0-based: if -* refcount is 1, then the page is free and the refcount is -* stable because nobody holds a reference on the page. -*/ - if (count == 1) - free_devmap_managed_page(page); - else if (!count) - __put_page(page); -} -EXPORT_SYMBOL(put_devmap_managed_page); -#endif -- 2.30.2
[PATCH 3/8] mm: remove pointless includes from
hmm.h pulls in the world for no good reason at all. Remove the includes and push a few ones into the users instead. Signed-off-by: Christoph Hellwig --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 1 + drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 + include/linux/hmm.h | 9 ++--- lib/test_hmm.c | 2 ++ 4 files changed, 6 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index ed5385137f4831..cb835f95a76e66 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -24,6 +24,7 @@ #include #include #include +#include #include "amdgpu_sync.h" #include "amdgpu_object.h" #include "amdgpu_vm.h" diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index 3828aafd3ac46f..e886a3b9e08c7d 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -39,6 +39,7 @@ #include #include +#include /* * FIXME: this is ugly right now we are using TTM to allocate vram and we pin diff --git a/include/linux/hmm.h b/include/linux/hmm.h index 2fd2e91d5107c0..d5a6f101f843e6 100644 --- a/include/linux/hmm.h +++ b/include/linux/hmm.h @@ -9,14 +9,9 @@ #ifndef LINUX_HMM_H #define LINUX_HMM_H -#include -#include +#include -#include -#include -#include -#include -#include +struct mmu_interval_notifier; /* * On output: diff --git a/lib/test_hmm.c b/lib/test_hmm.c index 767538089a62e4..396beee6b061d4 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -26,6 +26,8 @@ #include #include #include +#include +#include #include "test_hmm_uapi.h" -- 2.30.2
[PATCH 2/8] mm: remove the __KERNEL__ guard from
__KERNEL__ ifdefs don't make sense outside of include/uapi/. Signed-off-by: Christoph Hellwig --- include/linux/mm.h | 4 1 file changed, 4 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 213cc569b19223..7b46174989b086 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3,9 +3,6 @@ #define _LINUX_MM_H #include - -#ifdef __KERNEL__ - #include #include #include @@ -3381,5 +3378,4 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start, } #endif -#endif /* __KERNEL__ */ #endif /* _LINUX_MM_H */ -- 2.30.2
start sorting out the ZONE_DEVICE refcount mess
Hi all, this series removes the offset by one refcount for ZONE_DEVICE pages that are freed back to the driver owning them, which is just device private ones for now, but also the planned device coherent pages and the ehanced p2p ones pending. It does not address the fsdax pages yet, which will be attacked in a follow on series. Diffstat: arch/arm64/mm/mmu.c |1 arch/powerpc/kvm/book3s_hv_uvmem.c |1 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |2 drivers/gpu/drm/amd/amdkfd/kfd_priv.h|1 drivers/gpu/drm/drm_cache.c |2 drivers/gpu/drm/nouveau/nouveau_dmem.c |3 - drivers/gpu/drm/nouveau/nouveau_svm.c|1 drivers/infiniband/core/rw.c |1 drivers/nvdimm/pmem.h|1 drivers/nvme/host/pci.c |1 drivers/nvme/target/io-cmd-bdev.c|1 fs/Kconfig |2 fs/fuse/virtio_fs.c |1 include/linux/hmm.h |9 include/linux/memremap.h | 22 +- include/linux/mm.h | 59 - lib/test_hmm.c |4 + mm/Kconfig |4 - mm/internal.h|2 mm/memcontrol.c | 11 + mm/memremap.c| 63 --- mm/migrate.c |6 -- mm/swap.c| 49 ++-- 23 files changed, 90 insertions(+), 157 deletions(-)
[PATCH 1/8] mm: remove a pointless CONFIG_ZONE_DEVICE check in memremap_pages
memremap.c is only built when CONFIG_ZONE_DEVICE is set, so remove the superflous extra check. Signed-off-by: Christoph Hellwig --- mm/memremap.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/mm/memremap.c b/mm/memremap.c index 6aa5f0c2d11fda..5f04a0709e436e 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -328,8 +328,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) } break; case MEMORY_DEVICE_FS_DAX: - if (!IS_ENABLED(CONFIG_ZONE_DEVICE) || - IS_ENABLED(CONFIG_FS_DAX_LIMITED)) { + if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) { WARN(1, "File system DAX not supported\n"); return ERR_PTR(-EINVAL); } -- 2.30.2
Re: [PATCH v4 00/10] Add MEMORY_DEVICE_COHERENT for coherent device memory mapping
On Thu, Jan 27, 2022 at 02:32:58PM -0800, Andrew Morton wrote: > On Wed, 26 Jan 2022 21:09:39 -0600 Alex Sierra wrote: > > > This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory > > owned by a device that can be mapped into CPU page tables like > > MEMORY_DEVICE_GENERIC and can also be migrated like > > MEMORY_DEVICE_PRIVATE. > > Some more reviewer input appears to be desirable here. > > I was going to tentatively add it to -mm and -next, but problems. > 5.17-rc1's mm/migrate.c:migrate_vma_check_page() is rather different > from the tree you patched. Please redo, refresh and resend? I really hate adding more types with the weird one off page refcount. We need to clean that mess up first.
Re: [PATCH 1/7] drm/i915: Replace kmap() with kmap_local_page()
On Fri, Dec 10, 2021 at 03:23:58PM -0800, ira.we...@intel.com wrote: > - vaddr = kmap(page); > + vaddr = kmap_local_page(page); > memcpy(vaddr, data, len); > - kunmap(page); > + kunmap_local(vaddr); memcpy_to_page?
Re: [PATCH v1 1/9] mm: add zone device coherent type memory support
On Mon, Nov 15, 2021 at 01:30:18PM -0600, Alex Sierra wrote: > @@ -5695,8 +5695,8 @@ static int mem_cgroup_move_account(struct page *page, > * 2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a > * target for charge migration. if @target is not NULL, the entry is > stored > * in target->ent. > - * 3(MC_TARGET_DEVICE): like MC_TARGET_PAGE but page is > MEMORY_DEVICE_PRIVATE > - * (so ZONE_DEVICE page and thus not on the lru). > + * 3(MC_TARGET_DEVICE): like MC_TARGET_PAGE but page is > MEMORY_DEVICE_COHERENT > + * or MEMORY_DEVICE_PRIVATE (so ZONE_DEVICE page and thus not on the > lru). Please avoid the overly long line. But I don't think we we need to mention the exact enum, but rather do something like: * 3(MC_TARGET_DEVICE): like MC_TARGET_PAGE but page is device memory and * thus not on the lru. > + switch (pgmap->type) { > + case MEMORY_DEVICE_PRIVATE: > + case MEMORY_DEVICE_COHERENT: > /* >* TODO: Handle HMM pages which may need coordination >* with device-side memory. This might be a good opportunity for doing a s/HMM/device/ here.
Re: [PATCH v3 4/8] powerpc/pseries/svm: Add a powerpc version of cc_platform_has()
On Wed, Sep 15, 2021 at 07:18:34PM +0200, Christophe Leroy wrote: > Could you please provide more explicit explanation why inlining such an > helper is considered as bad practice and messy ? Because now we get architectures to all subly differ. Look at the mess for ioremap and the ioremap* variant. The only good reason to allow for inlines if if they are used in a hot path. Which cc_platform_has is not, especially not on powerpc.
Re: [PATCH] Enable '-Werror' by default for all kernel builds
On Wed, Sep 08, 2021 at 11:58:56PM +0200, Marco Elver wrote: > It'd be good to avoid. It has helped uncover build issues with KASAN in > the past. Or at least make it dependent on the problematic architecture. > For example if arm is a problem, something like this: I'm also seeing quite a few stack size warnings with KASAN on x86_64 without COMPILT_TEST using gcc 10.2.1 from Debian. In fact there are a few warnings without KASAN, but with KASAN there are a lot more. I'll try to find some time to dig into them. While we're at it, with -Werror something like this is really futile: drivers/gpu/drm/amd/amdgpu/amdgpu_object.c: In function ???amdgpu_bo_support_uswc???: drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:493:2: warning: #warning Please enable CONFIG_MTRR and CONFIG_X86_PAT for better performance thanks to write-combining [-Wcpp 493 | #warning Please enable CONFIG_MTRR and CONFIG_X86_PAT for better performance \ | ^~~
Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration
On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote: > >>> It looks like I'm totally misunderstanding what you are adding here > >>> then. Why do we need any special treatment at all for memory that > >>> has normal struct pages and is part of the direct kernel map? > >> The pages are like normal memory for purposes of mapping them in CPU > >> page tables and for coherent access from the CPU. > > That's the user page tables. What about the kernel direct map? > > If there is a normal kernel struct page backing there really should > > be no need for the pgmap. > > I'm not sure. The physical address ranges are in the UEFI system address > map as special-purpose memory. Does Linux create the struct pages and > kernel direct map for that without a pgmap call? I didn't see that last > time I went digging through that code. So doing some googling finds a patch from Dan that claims to hand EFI special purpose memory to the device dax driver. But when I try to follow the version that got merged it looks it is treated simply as an MMIO region to be claimed by drivers, which would not get a struct page. Dan, did I misunderstand how E820_TYPE_SOFT_RESERVED works? > >> From an application > >> perspective, we want file-backed and anonymous mappings to be able to > >> use DEVICE_PUBLIC pages with coherent CPU access. The goal is to > >> optimize performance for GPU heavy workloads while minimizing the need > >> to migrate data back-and-forth between system memory and device memory. > > I don't really understand that part. file backed pages are always > > allocated by the file system using the pagecache helpers, that is > > using the page allocator. Anonymouns memory also always comes from > > the page allocator. > > I'm coming at this from my experience with DEVICE_PRIVATE. Both > anonymous and file-backed pages should be migrateable to DEVICE_PRIVATE > memory by the migrate_vma_* helpers for more efficient access by our > GPU. (*) It's part of the basic premise of HMM as I understand it. I > would expect the same thing to work for DEVICE_PUBLIC memory. Ok, so you want to migrate to and from them. Not use DEVICE_PUBLIC for the actual page cache pages. That maks a lot more sense. > I see DEVICE_PUBLIC as an improved version of DEVICE_PRIVATE that allows > the CPU to map the device memory coherently to minimize the need for > migrations when CPU and GPU access the same memory concurrently or > alternatingly. But we're not going as far as putting that memory > entirely under the management of the Linux memory manager and VM > subsystem. Our (and HPE's) system architects decided that this memory is > not suitable to be used like regular NUMA system memory by the Linux > memory manager. So yes. It is a Memory Mapped I/O region, which unlike the PCIe BARs that people typically deal with is fully cache coherent. I think this does make more sense as a description. But to go back to what start this discussion: If these are memory mapped I/O pfn_valid should generally not return true for them. And as you already pointed out in reply to Alex we need to tighten the selection criteria one way or another.
Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration
On Mon, Aug 30, 2021 at 01:04:43PM -0400, Felix Kuehling wrote: > >> driver code is not really involved in updating the CPU mappings. Maybe > >> it's something we need to do in the migration helpers. > > It looks like I'm totally misunderstanding what you are adding here > > then. Why do we need any special treatment at all for memory that > > has normal struct pages and is part of the direct kernel map? > > The pages are like normal memory for purposes of mapping them in CPU > page tables and for coherent access from the CPU. That's the user page tables. What about the kernel direct map? If there is a normal kernel struct page backing there really should be no need for the pgmap. > From an application > perspective, we want file-backed and anonymous mappings to be able to > use DEVICE_PUBLIC pages with coherent CPU access. The goal is to > optimize performance for GPU heavy workloads while minimizing the need > to migrate data back-and-forth between system memory and device memory. I don't really understand that part. file backed pages are always allocated by the file system using the pagecache helpers, that is using the page allocator. Anonymouns memory also always comes from the page allocator. > The pages are special in two ways: > > 1. The memory is managed not by the Linux buddy allocator, but by the > GPU driver's TTM memory manager Why? > 2. We want to migrate data in response to GPU page faults and > application hints using the migrate_vma helpers Why?
Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration
On Thu, Aug 26, 2021 at 06:27:31PM -0400, Felix Kuehling wrote: > I think we're missing something here. As far as I can tell, all the work > we did first with DEVICE_GENERIC and now DEVICE_PUBLIC always used > normal pages. Are we missing something in our driver code that would > make these PTEs special? I don't understand how that can be, because > driver code is not really involved in updating the CPU mappings. Maybe > it's something we need to do in the migration helpers. It looks like I'm totally misunderstanding what you are adding here then. Why do we need any special treatment at all for memory that has normal struct pages and is part of the direct kernel map?
Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration
On Tue, Aug 24, 2021 at 10:48:17PM -0500, Alex Sierra wrote: > } else { > - if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) > + if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM) && > + !(migrate->flags & MIGRATE_VMA_SELECT_IOMEM)) > goto next; > pfn = pte_pfn(pte); > if (is_zero_pfn(pfn)) { .. also how is this going to work for the device public memory? That should be pte_special() an thus fail vm_normal_page.
Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration
On Tue, Aug 24, 2021 at 10:48:17PM -0500, Alex Sierra wrote: > In this case, this is used to migrate pages from device memory, back to > system memory. This particular device memory type should be accessible > by the CPU, through IOMEM access. Typically, zone device public type > memory falls into this category. > > Signed-off-by: Alex Sierra > --- > include/linux/migrate.h | 1 + > mm/migrate.c| 3 ++- > 2 files changed, 3 insertions(+), 1 deletion(-) > > diff --git a/include/linux/migrate.h b/include/linux/migrate.h > index 4bb4e519e3f5..6b16f417384f 100644 > --- a/include/linux/migrate.h > +++ b/include/linux/migrate.h > @@ -156,6 +156,7 @@ static inline unsigned long migrate_pfn(unsigned long pfn) > enum migrate_vma_direction { > MIGRATE_VMA_SELECT_SYSTEM = 1 << 0, > MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1, > + MIGRATE_VMA_SELECT_IOMEM = 1 << 2, > }; > > struct migrate_vma { > diff --git a/mm/migrate.c b/mm/migrate.c > index e3a10e2a1bb3..d4ae2da99607 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -2406,7 +2406,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, > if (is_write_device_private_entry(entry)) > mpfn |= MIGRATE_PFN_WRITE; > } else { > - if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) > + if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM) && > + !(migrate->flags & MIGRATE_VMA_SELECT_IOMEM)) This makes the MIGRATE_VMA_SELECT_SYSTEM and MIGRATE_VMA_SELECT_IOMEM behave entirely identifical, that is redundant. I think we need to distinguish between the dfferent cases here. I think the right check would be pfn_valid(), which should be true for system memory, and false for iomem. Also shouldn't this be called DEVICE_PUBLIC instead of IOMEM?
Re: [PATCH v1 02/14] mm: remove extra ZONE_DEVICE struct page refcount
Looks good, Reviewed-by: Christoph Hellwig
Re: [PATCH v1 08/14] mm: add public type support to migrate_vma helpers
This should probably be folded into patch 4.
Re: [PATCH v1 01/14] ext4/xfs: add page refcount helper
On Tue, Aug 24, 2021 at 10:48:15PM -0500, Alex Sierra wrote: > Signed-off-by: Ralph Campbell > Signed-off-by: Alex Sierra > Reviewed-by: Christoph Hellwig > --- > v3: > [AS]: rename dax_layout_is_idle_page func to dax_page_unused > > v4: > [AS]: This ref count functionality was missing on fuse/dax.c. > --- Not sure all tooling can cope with the two --- separators. Personally I find these per-patch changelogs pretty annoying anyway, but others have different opinions.
Re: [PATCH v1 09/14] mm: call pgmap->ops->page_free for DEVICE_PUBLIC pages
On Tue, Aug 24, 2021 at 10:48:23PM -0500, Alex Sierra wrote: > Add MEMORY_DEVICE_PUBLIC case to free_zone_device_page callback. > Device public type memory case is now able to free its pages properly. This really should go into patch 4. And it might make sense to introduce free_device_private_page directly with the free_device_page name instead of renaming it a little later.
Re: [PATCH v2 03/12] x86/sev: Add an x86 version of prot_guest_has()
On Thu, Aug 19, 2021 at 01:33:09PM -0500, Tom Lendacky wrote: > I did it as inline originally because the presence of the function will be > decided based on the ARCH_HAS_PROTECTED_GUEST config. For now, that is > only selected by the AMD memory encryption support, so if I went out of > line I could put in mem_encrypt.c. But with TDX wanting to also use it, it > would have to be in an always built file with some #ifdefs or in its own > file that is conditionally built based on the ARCH_HAS_PROTECTED_GUEST > setting (they've already tried building with ARCH_HAS_PROTECTED_GUEST=y > and AMD_MEM_ENCRYPT not set). > > To take it out of line, I'm leaning towards the latter, creating a new > file that is built based on the ARCH_HAS_PROTECTED_GUEST setting. Yes. In general everytime architectures have to provide the prototype and not just the implementation of something we end up with a giant mess sooner or later. In a few cases that is still warranted due to performance concerns, but i don't think that is the case here. > > > > >> +/* 0x800 - 0x8ff reserved for AMD */ > >> +#define PATTR_SME 0x800 > >> +#define PATTR_SEV 0x801 > >> +#define PATTR_SEV_ES 0x802 > > > > Why do we need reservations for a purely in-kernel namespace? > > > > And why are you overoading a brand new generic API with weird details > > of a specific implementation like this? > > There was some talk about this on the mailing list where TDX and SEV may > need to be differentiated, so we wanted to reserve a range of values per > technology. I guess I can remove them until they are actually needed. In that case add a flag for the differing behavior. And only add them when actually needed. And either way there is absolutely no need to reserve ranges.
Re: [PATCH v6 08/13] mm: call pgmap->ops->page_free for DEVICE_GENERIC pages
On Tue, Aug 17, 2021 at 11:44:54AM -0400, Felix Kuehling wrote: > >> That's a good catch. Existing drivers shouldn't need a page_free > >> callback if they didn't have one before. That means we need to add a > >> NULL-pointer check in free_device_page. > > Also the other state clearing (__ClearPageWaiters/mem_cgroup_uncharge/ > > ->mapping = NULL). > > > > In many ways this seems like you want to bring back the DEVICE_PUBLIC > > pgmap type that was removed a while ago due to the lack of users > > instead of overloading the generic type. > > I think so. I'm not clear about how DEVICE_PUBLIC differed from what > DEVICE_GENERIC is today. As I understand it, DEVICE_PUBLIC was removed > because it was unused and also known to be broken in some ways. > DEVICE_GENERIC seemed close enough to what we need, other than not being > supported in the migration helpers. > > Would you see benefit in re-introducing DEVICE_PUBLIC as a distinct > memory type from DEVICE_GENERIC? What would be the benefits of making > that distinction? The old DEVICE_PUBLIC mostly different in that it allowed the page to be returned from vm_normal_page, which I think was horribly buggy. But the point is not to bring back these old semantics. The idea is to be able to differeniate between your new coherent on-device memory and the existing DEVICE_GENERIC. That is call the code in free_devmap_managed_page that is currently only used for device private pages also for your new public device pages without affecting the devdax and xen use cases.
Re: [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount
On Thu, Aug 19, 2021 at 03:59:56PM -0400, Felix Kuehling wrote: > I got lost trying to understand how DAX counts page references and how > the PTE_SPECIAL option affects that. Theodore, can you help with this? > Is there an easy way to test without CONFIG_ARCH_HAS_PTE_SPECIAL on x86, > or do we need to test on a CPU architecture that doesn't support this > feature? I think the right answer is to simplify disallow ZONE_DEVICE pages if ARCH_HAS_PTE_SPECIAL is not supported. ARCH_HAS_PTE_SPECIAL is supported by all modern architecture ports than can make use of ZONE_DEVICE / dev_pagemap, so we can avoid this pocket of barely testable code entirely: diff --git a/mm/Kconfig b/mm/Kconfig index 40a9bfcd5062e1..2823bbfd1c8c70 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -775,6 +775,7 @@ config ZONE_DMA32 config ZONE_DEVICE bool "Device memory (pmem, HMM, etc...) hotplug support" + depends on ARCH_HAS_PTE_SPECIAL depends on MEMORY_HOTPLUG depends on MEMORY_HOTREMOVE depends on SPARSEMEM_VMEMMAP
Re: [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount
On Wed, Aug 18, 2021 at 12:28:30PM -0700, Ralph Campbell wrote: > Did you test on a system without CONFIG_ARCH_HAS_PTE_SPECIAL defined? > In that case, mmap() of a DAX device will call insert_page() which calls > get_page() which would trigger VM_BUG_ON_PAGE(). __vm_insert_mixed still ends up calling insert_pfn for the !CASE_ARCH_HAS_PTE_SPECIAL if pfn_t_devmap() is true, which it should be for DAX. (and as said in my other mail, I suspect we should disallow that case anyway, as no one can test it in practice).
Re: [PATCH v2 04/12] powerpc/pseries/svm: Add a powerpc version of prot_guest_has()
On Fri, Aug 13, 2021 at 11:59:23AM -0500, Tom Lendacky wrote: > +static inline bool prot_guest_has(unsigned int attr) No reall need to have this inline. In fact I'd suggest we havea the prototype in a common header so that everyone must implement it out of line.
Re: [PATCH v2 02/12] mm: Introduce a function to check for virtualization protection features
On Fri, Aug 13, 2021 at 11:59:21AM -0500, Tom Lendacky wrote: > +#define PATTR_MEM_ENCRYPT0 /* Encrypted memory */ > +#define PATTR_HOST_MEM_ENCRYPT 1 /* Host encrypted > memory */ > +#define PATTR_GUEST_MEM_ENCRYPT 2 /* Guest encrypted > memory */ > +#define PATTR_GUEST_PROT_STATE 3 /* Guest encrypted > state */ Please write an actual detailed explanaton of what these mean, that is what implications it has on the kernel.
Re: [PATCH v2 03/12] x86/sev: Add an x86 version of prot_guest_has()
On Fri, Aug 13, 2021 at 11:59:22AM -0500, Tom Lendacky wrote: > While the name suggests this is intended mainly for guests, it will > also be used for host memory encryption checks in place of sme_active(). Which suggest that the name is not good to start with. Maybe protected hardware, system or platform might be a better choice? > +static inline bool prot_guest_has(unsigned int attr) > +{ > +#ifdef CONFIG_AMD_MEM_ENCRYPT > + if (sme_me_mask) > + return amd_prot_guest_has(attr); > +#endif > + > + return false; > +} Shouldn't this be entirely out of line? > +/* 0x800 - 0x8ff reserved for AMD */ > +#define PATTR_SME0x800 > +#define PATTR_SEV0x801 > +#define PATTR_SEV_ES 0x802 Why do we need reservations for a purely in-kernel namespace? And why are you overoading a brand new generic API with weird details of a specific implementation like this?
Re: [PATCH v6 08/13] mm: call pgmap->ops->page_free for DEVICE_GENERIC pages
On Mon, Aug 16, 2021 at 03:00:49PM -0400, Felix Kuehling wrote: > > Am 2021-08-15 um 11:40 a.m. schrieb Christoph Hellwig: > > On Fri, Aug 13, 2021 at 01:31:45AM -0500, Alex Sierra wrote: > >> Add MEMORY_DEVICE_GENERIC case to free_zone_device_page callback. > >> Device generic type memory case is now able to free its pages properly. > > How is this going to work for the two existing MEMORY_DEVICE_GENERIC > > that now change behavior? And which don't have a ->page_free callback > > at all? > > That's a good catch. Existing drivers shouldn't need a page_free > callback if they didn't have one before. That means we need to add a > NULL-pointer check in free_device_page. Also the other state clearing (__ClearPageWaiters/mem_cgroup_uncharge/ ->mapping = NULL). In many ways this seems like you want to bring back the DEVICE_PUBLIC pgmap type that was removed a while ago due to the lack of users instead of overloading the generic type.