Re: [PATCH v2 07/14] LoongArch: Implement ARCH_HAS_KERNEL_FPU_SUPPORT

2024-01-08 Thread Christoph Hellwig
On Sun, Jan 07, 2024 at 10:39:07AM +0800, Huacai Chen wrote:
> > Do you mean that LoongArch32 does not support double-precision FP in 
> > hardware?
> > At least both of the consumers in this series use double-precision, so my 
> > first
> > thought is that LoongArch32 could not select ARCH_HAS_KERNEL_FPU_SUPPORT.
> Then is it possible to introduce CC_FLAGS_SP_FPU and CC_FLAGS_DP_FPU?
> I think there may be some place where SP FP is enough.

Let's defer that until it is actually neeed.


Re: [PATCH v2 13/14] selftests/fpu: Move FP code to a separate translation unit

2023-12-28 Thread Christoph Hellwig
Looks good:

Reviewed-by: Christoph Hellwig 



Re: [PATCH v2 12/14] drm/amd/display: Use ARCH_HAS_KERNEL_FPU_SUPPORT

2023-12-28 Thread Christoph Hellwig
Looks good:

Reviewed-by: Christoph Hellwig 


Re: [PATCH v2 05/14] arm64: crypto: Use CC_FLAGS_FPU for NEON CFLAGS

2023-12-28 Thread Christoph Hellwig
Looks good:

Reviewed-by: Christoph Hellwig 



Re: [PATCH v2 10/14] riscv: Add support for kernel-mode FPU

2023-12-28 Thread Christoph Hellwig
On Wed, Dec 27, 2023 at 05:42:00PM -0800, Samuel Holland wrote:
> This is motivated by the amdgpu DRM driver, which needs floating-point
> code to support recent hardware. That code is not performance-critical,
> so only provide a minimal non-preemptible implementation for now.
> 
> Signed-off-by: Samuel Holland 

Looks good:

Reviewed-by: Christoph Hellwig 


Re: [PATCH v2 01/14] arch: Add ARCH_HAS_KERNEL_FPU_SUPPORT

2023-12-28 Thread Christoph Hellwig
Thanks for all the great documentation!

Looks good:

Reviewed-by: Christoph Hellwig 



Re: [PATCH 3/3] drm/amd/display: Support DRM_AMD_DC_FP on RISC-V

2023-12-12 Thread Christoph Hellwig
On Thu, Dec 07, 2023 at 10:49:53PM -0600, Samuel Holland wrote:
> Actually tracking all possibly-FPU-tainted functions and their call sites is
> probably possible, but a much larger task.

I think objtool should be able to do that reasonably easily, it already
does it for checking section where userspace address access is enabled
or not, which is very similar.


Re: [RFC PATCH 05/12] lib/raid6: Use CC_FLAGS_FPU for NEON CFLAGS

2023-12-12 Thread Christoph Hellwig
On Mon, Dec 11, 2023 at 10:12:27AM -0600, Samuel Holland wrote:
> On 2023-12-11 10:07 AM, Christoph Hellwig wrote:
> 
> Unfortunately, not all of the relevant options can be no-prefixed:

Ok.  That is another good argument for having the obj-fpu += syntax
I proposed.  You might need help from the kbuild maintainers from that
as trying to understand the kbuild magic isn't something I'd expect
from a normal contributor (including myself..).



Re: [RFC PATCH 11/12] selftests/fpu: Move FP code to a separate translation unit

2023-12-11 Thread Christoph Hellwig
>  obj-$(CONFIG_TEST_FPU) += test_fpu.o
> -CFLAGS_test_fpu.o += $(FPU_CFLAGS)
> +test_fpu-y := test_fpu_glue.o test_fpu_impl.o
> +CFLAGS_test_fpu_impl.o += $(FPU_CFLAGS)

Btw, I really wonder if having a

modname-fpu += foo.o

syntax in kbuild wouldn't be preferable to this.  Of coure that requires
someone who understands kbuild inside out.

> +int test_fpu(void);

This needs to go into a header.

And I think I underatand your way to enforce the use of a separate
compilation unit in the riscv patch now.

Can we just make that generic, e.g. have a  that wraps
 that does the guard based on a
-D_LINUX_FPU_COMPILATION_UNIT=1 on the command line so that all the
code becomes fully portable?  Any legacy arch specific fpu users not
using  would not be affected by it, although it would be
great to eventually migrate them to the common scheme.



Re: [RFC PATCH 12/12] selftests/fpu: Allow building on other architectures

2023-12-11 Thread Christoph Hellwig
Looks good:

Reviewed-by: Christoph Hellwig 


Re: [RFC PATCH 07/12] powerpc: Implement ARCH_HAS_KERNEL_FPU_SUPPORT

2023-12-11 Thread Christoph Hellwig
On Thu, Dec 07, 2023 at 09:54:37PM -0800, Samuel Holland wrote:
> PowerPC provides an equivalent to the common kernel-mode FPU API, but in
> a different header and using different function names. The PowerPC API
> also requires a non-preemptible context. Add a wrapper header, and
> export the CFLAGS adjustments.

Looks good:

Reviewed-by: Christoph Hellwig 


Re: [RFC PATCH 05/12] lib/raid6: Use CC_FLAGS_FPU for NEON CFLAGS

2023-12-11 Thread Christoph Hellwig
> +CFLAGS_REMOVE_neon1.o += $(CC_FLAGS_NO_FPU)
> +CFLAGS_REMOVE_neon2.o += $(CC_FLAGS_NO_FPU)
> +CFLAGS_REMOVE_neon4.o += $(CC_FLAGS_NO_FPU)
> +CFLAGS_REMOVE_neon8.o += $(CC_FLAGS_NO_FPU)

Btw, do we even really need the extra variables for compiler flags
to remove?  Don't gcc/clang options work so that if you add a
no-prefixed version of the option later it transparently gets removed?

Except for that:

Reviewed-by: Christoph Hellwig 


Re: [RFC PATCH 09/12] riscv: Add support for kernel-mode FPU

2023-12-11 Thread Christoph Hellwig
> +#ifdef __riscv_f
> +
> +#define kernel_fpu_begin() \
> + static_assert(false, "floating-point code must use a separate 
> translation unit")
> +#define kernel_fpu_end() kernel_fpu_begin()
> +
> +#else
> +
> +void kernel_fpu_begin(void);
> +void kernel_fpu_end(void);
> +
> +#endif

I'll assume this is related to trick that places code in a separate
translation unit, but I fail to understand it.  Can you add a comment
explaining it?



Re: [RFC PATCH 04/12] arm64: Implement ARCH_HAS_KERNEL_FPU_SUPPORT

2023-12-11 Thread Christoph Hellwig
> + * linux/arch/arm64/include/asm/fpu.h

Same comment as for arm here.  Except for that:

Reviewed-by: Christoph Hellwig 


Re: [RFC PATCH 02/12] ARM: Implement ARCH_HAS_KERNEL_FPU_SUPPORT

2023-12-11 Thread Christoph Hellwig
> --- /dev/null
> +++ b/arch/arm/include/asm/fpu.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * linux/arch/arm/include/asm/fpu.h

Please don't add the file name to top of the file comments.  It serves
no purpose and easily gets out of date.

Except for that:

Reviewed-by: Christoph Hellwig 


Re: [RFC PATCH 01/12] arch: Add ARCH_HAS_KERNEL_FPU_SUPPORT

2023-12-11 Thread Christoph Hellwig
Looks good:

Reviewed-by: Christoph Hellwig 


Re: [RFC PATCH 03/12] ARM: crypto: Use CC_FLAGS_FPU for NEON CFLAGS

2023-12-11 Thread Christoph Hellwig
On Thu, Dec 07, 2023 at 09:54:33PM -0800, Samuel Holland wrote:
> Now that CC_FLAGS_FPU is exported and can be used anywhere in the source
> tree, use it instead of duplicating the flags here.

Looks good:

Reviewed-by: Christoph Hellwig 


Re: [RFC PATCH 06/12] LoongArch: Implement ARCH_HAS_KERNEL_FPU_SUPPORT

2023-12-11 Thread Christoph Hellwig
On Thu, Dec 07, 2023 at 09:54:36PM -0800, Samuel Holland wrote:
> LoongArch already provides kernel_fpu_begin() and kernel_fpu_end() in
> asm/fpu.h, so it only needs to add kernel_fpu_available() and export
> the CFLAGS adjustments.

Looks good:

Reviewed-by: Christoph Hellwig 


Re: [RFC PATCH 08/12] x86: Implement ARCH_HAS_KERNEL_FPU_SUPPORT

2023-12-11 Thread Christoph Hellwig
Looks good:

Reviewed-by: Christoph Hellwig 


Re: [PATCH 3/3] drm/amd/display: Support DRM_AMD_DC_FP on RISC-V

2023-11-22 Thread Christoph Hellwig
> - select DRM_AMD_DC_FP if (X86 || LOONGARCH || (PPC64 && ALTIVEC) || 
> (ARM64 && KERNEL_MODE_NEON && !CC_IS_CLANG))
> + select DRM_AMD_DC_FP if ARM64 && KERNEL_MODE_NEON && !CC_IS_CLANG
> + select DRM_AMD_DC_FP if PPC64 && ALTIVEC
> + select DRM_AMD_DC_FP if RISCV && FPU
> + select DRM_AMD_DC_FP if LOONGARCH || X86

This really is a mess.  Can you add a ARCH_HAS_KERNEL_FPU_SUPPORT
symbol that all architetures that have it select instead, and them
make DRM_AMD_DC_FP depend on it?

> -#if defined(CONFIG_X86) || defined(CONFIG_LOONGARCH)
> +#if defined(CONFIG_X86) || defined(CONFIG_LOONGARCH) || defined(CONFIG_RISCV)
>   kernel_fpu_begin();
>  #elif defined(CONFIG_PPC64)
>   if (cpu_has_feature(CPU_FTR_VSX_COMP))
> @@ -122,7 +124,7 @@ void dc_fpu_end(const char *function_name, const int line)
>  
>   depth = __this_cpu_dec_return(fpu_recursion_depth);
>   if (depth == 0) {
> -#if defined(CONFIG_X86) || defined(CONFIG_LOONGARCH)
> +#if defined(CONFIG_X86) || defined(CONFIG_LOONGARCH) || defined(CONFIG_RISCV)
>   kernel_fpu_end();
>  #elif defined(CONFIG_PPC64)
>   if (cpu_has_feature(CPU_FTR_VSX_COMP))

And then this mess can go away.  We'll need to decide if we want to
cover all the in-kernel vector support as part of it, which would
seem reasonable to me, or have a separate generic kernel_vector_begin
with it's own option.

> diff --git a/drivers/gpu/drm/amd/display/dc/dml/Makefile 
> b/drivers/gpu/drm/amd/display/dc/dml/Makefile
> index ea7d60f9a9b4..5c8f840ef323 100644
> --- a/drivers/gpu/drm/amd/display/dc/dml/Makefile
> +++ b/drivers/gpu/drm/amd/display/dc/dml/Makefile
> @@ -43,6 +43,12 @@ dml_ccflags := -mfpu=64
>  dml_rcflags := -msoft-float
>  endif
>  
> +ifdef CONFIG_RISCV
> +include $(srctree)/arch/riscv/Makefile.isa
> +# Remove V from the ISA string, like in arch/riscv/Makefile, but keep F and 
> D.
> +dml_ccflags := -march=$(shell echo $(riscv-march-y) | sed -E 
> 's/(rv32ima|rv64ima)([^v_]*)v?/\1\2/')
> +endif
> +
>  ifdef CONFIG_CC_IS_GCC
>  ifneq ($(call gcc-min-version, 70100),y)
>  IS_OLD_GCC = 1

And this is again not really something we should be doing.
Instead we need a generic way in Kconfig to enable FPU support
for an object file or set of, that the arch support can hook
into.

Btw, I'm also really worried about folks using the FPU instructions
outside the kernel_fpu_begin/end windows in general (not directly
related to the RISC-V support).  Can we have objecttool checks
for that similar to only allowing the unsafe uaccess in the
uaccess begin/end pairs?


Re: [PATCH 1/3] riscv: Add support for kernel-mode FPU

2023-11-22 Thread Christoph Hellwig
On Tue, Nov 21, 2023 at 07:05:13PM -0800, Samuel Holland wrote:
> +static inline void kernel_fpu_begin(void)
> +{
> + preempt_disable();
> + fstate_save(current, task_pt_regs(current));
> + csr_set(CSR_SSTATUS, SR_FS);
> +}
> +
> +static inline void kernel_fpu_end(void)
> +{
> + csr_clear(CSR_SSTATUS, SR_FS);
> + fstate_restore(current, task_pt_regs(current));
> + preempt_enable();
> +}

Is there any critical reason to inline these two?  I'd much rather see
them out of line and exported instead of the low-level helpers.



Re: [PATCH 3/5] drm/amdkfd: use vma_is_stack() and vma_is_heap()

2023-07-12 Thread Christoph Hellwig
On Wed, Jul 12, 2023 at 10:38:29PM +0800, Kefeng Wang wrote:
> Use the helpers to simplify code.

Nothing against your addition of a helper, but a GPU driver really
should have no business even looking at this information..



Re: [PATCH] drm/amdgpu: add the accelerator pcie class

2023-05-26 Thread Christoph Hellwig
On Thu, May 25, 2023 at 08:52:06PM +, Deucher, Alexander wrote:
> We already handle this today for CLASS_DISPLAY via a data table provided on 
> our hardware that details the components on the board.  The driver can then 
> determine whether or not that combination of components is supported.  If the 
> data table doesn't exist or isn’t parse-able, or the components enumerated 
> are not supported, the driver doesn't load.

But things like module loading and initramfs generation still work
off the ID table and not your internal tables.



Re: [PATCH v2] drm/amd/display: enable more strict compile checks

2023-05-25 Thread Christoph Hellwig
> +subdir-ccflags-y += -Werror -Wunused -Wmisleading-indentation

We have a config option for -Werror.  Blindly adding this will create
problems with too new (or sometimes too old, or just too weird)
compilers all the time.  Don't do this.


Re: [PATCH] drm/amdgpu: add the accelerator pcie class

2023-05-25 Thread Christoph Hellwig
On Tue, May 23, 2023 at 10:02:32AM -0400, Alex Deucher wrote:
> On Tue, May 23, 2023 at 5:25 AM Christoph Hellwig  wrote:
> >
> > On Tue, May 23, 2023 at 12:02:32PM +0800, Shiwu Zhang wrote:
> > > + { PCI_DEVICE(0x1002, PCI_ANY_ID),
> > > +   .class = PCI_CLASS_ACCELERATOR_PROCESSING << 8,
> > > +   .class_mask = 0xff,
> > > +   .driver_data = CHIP_IP_DISCOVERY },
> >
> > Probing for every single device of a given class for a single vendor
> > to a driver is just fundamentaly wrong.  Please list the actual IDs
> > that the driver can handle.
> 
> How so?  The driver handles all devices of that class.  We already do
> that for PCI_CLASS_DISPLAY_VGA and PCI_CLASS_DISPLAY_OTHER.  Other
> drivers do similar things.

How is that going to work in the long run?  The chances of totally
incompatbile devices from the same vendor appearing is absolutely given.

> The hda audio driver does the same thing
> for PCI_CLASS_MULTIMEDIA_HD_AUDIO for example.
>

That, just like PCI_CLASS_STORAGE_EXPRESS is a different case, as
the class is associated with an actual documented programming interface.


Re: [PATCH] drm/amdgpu: add the accelerator pcie class

2023-05-23 Thread Christoph Hellwig
On Tue, May 23, 2023 at 12:02:32PM +0800, Shiwu Zhang wrote:
> + { PCI_DEVICE(0x1002, PCI_ANY_ID),
> +   .class = PCI_CLASS_ACCELERATOR_PROCESSING << 8,
> +   .class_mask = 0xff,
> +   .driver_data = CHIP_IP_DISCOVERY },

Probing for every single device of a given class for a single vendor
to a driver is just fundamentaly wrong.  Please list the actual IDs
that the driver can handle.



[PATCH] drm/radeon: stop including swiotlb.h

2023-05-18 Thread Christoph Hellwig
radeon does not need swiotlb.h, so stop including it.

Signed-off-by: Christoph Hellwig 
---
 drivers/gpu/drm/radeon/radeon_ttm.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c 
b/drivers/gpu/drm/radeon/radeon_ttm.c
index 2220cdf6a3f680..04df08356d553f 100644
--- a/drivers/gpu/drm/radeon/radeon_ttm.c
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c
@@ -36,7 +36,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include 
 #include 
-- 
2.39.2



[PATCH] drm/amdgpu: stop including swiotlb.h

2023-05-18 Thread Christoph Hellwig
amdgpu does not need swiotlb.h, so stop including it.

Signed-off-by: Christoph Hellwig 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 2cd081cbf70621..385e04612e4e9b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -38,7 +38,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
-- 
2.39.2



Re: [PATCH v5 1/6] mm/gup: remove unused vmas parameter from get_user_pages()

2023-05-15 Thread Christoph Hellwig
Looks good:

Reviewed-by: Christoph Hellwig 


Re: [PATCH] powerpc: export cpu_smallcore_map for modules

2022-08-22 Thread Christoph Hellwig
On Mon, Aug 22, 2022 at 01:40:23PM +1000, Michael Ellerman wrote:
> Randy Dunlap  writes:
> > drivers/gpu/drm/amd/amdkfd/kfd_device.c calls cpu_smt_mask().
> > This is an inline function on powerpc which references
> > cpu_smallcore_map.
> >
> > Fixes: 425752c63b6f ("powerpc: Detect the presence of big-cores via "ibm, 
> > thread-groups"")
> > Fixes: 7bc913085765 ("drm/amdkfd: Try to schedule bottom half on same core")
> 
> That 2nd commit is not in mainline, only linux-next.
> 
> I don't mind merging this fix preemptively, but is that SHA stable?

I really do not think this has any business being exported at all.

kfd_queue_work is not something that should be done in a driver.
Something like this belongs into the workqueue core, not in an
underdocumented helper in a random driver.

Drm guys:  once again, please please work with the maintainers instead
of just making up random stuff in the drivers.


Re: [linux-next:master] BUILD REGRESSION 736ee37e2e8eed7fe48d0a37ee5a709514d478b3

2022-05-22 Thread Christoph Hellwig
How about just turning the MMIO/PIO accessors on m68k into inline
functions as they are on most other architectures?


Re: [PATCH v2 1/3] mm: add vm_normal_lru_pages for LRU handled pages only

2022-03-31 Thread Christoph Hellwig
On Thu, Mar 31, 2022 at 10:55:13AM +0200, David Hildenbrand wrote:
> > Why can't this deal with ZONE_DEVICE pages?  It certainly has
> > nothing do with a LRU I think.  In fact being able to have
> > stats that count say the number of device pages here would
> > probably be useful at some point.
> > 
> > In general I find the vm_normal_lru_page vs vm_normal_page
> > API highly confusing.  An explicit check for zone device pages
> > in the dozen or so spots that care has a much better documentation
> > value, especially if accompanied by comments where it isn't entirely
> > obvious.
> 
> What's your thought on FOLL_LRU?

Also a bit confusing, but inbetween all these FOLL_ flags it doesn't
really matter any more.


Re: [PATCH v2 1/3] mm: add vm_normal_lru_pages for LRU handled pages only

2022-03-31 Thread Christoph Hellwig
> - page = vm_normal_page(vma, addr, pte);
> + page = vm_normal_lru_page(vma, addr, pte);

Why can't this deal with ZONE_DEVICE pages?  It certainly has
nothing do with a LRU I think.  In fact being able to have
stats that count say the number of device pages here would
probably be useful at some point.

In general I find the vm_normal_lru_page vs vm_normal_page
API highly confusing.  An explicit check for zone device pages
in the dozen or so spots that care has a much better documentation
value, especially if accompanied by comments where it isn't entirely
obvious.

>   page = follow_page(vma, addr,
> - FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE);
> + FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE | 
> FOLL_LRU);

Overly long line here.

> +/*
> + * NOTE: Technically this should goto check_pfn label. However, 
> page->_mapcount
> + * is never incremented for device pages that are mmap through DAX mechanism
> + * using pmem driver mounted into ext4 filesystem. When these pages are 
> unmap,
> + * zap_pte_range is called and vm_normal_page return a valid page with
> + * page_mapcount() = 0, before page_remove_rmap is called.
> + */

Please properly indent comments.

> + * zone, as long as the pte's are present and vm_normal_lru_page() succeeds. 
> These
>   * pages also get pinned.

Another overly long line here.


Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

2022-02-15 Thread Christoph Hellwig
On Tue, Feb 15, 2022 at 10:45:24AM -0400, Jason Gunthorpe wrote:
> > Do you know if DEVICE_GENERIC pages would end up as PageAnon()? My
> > assumption was that they would be part of a special mapping.
> 
> We need to stop using the special PTEs and VMAs for things that have a
> struct page. This is a mistake DAX created that must be undone.

Yes, we'll get to it.  Maybe we can do it for the non-DAX devmap
ptes first given that DAX is more complicated.


Re: [PATCH 13/27] mm: move the migrate_vma_* device migration code into it's own file

2022-02-10 Thread Christoph Hellwig
On Thu, Feb 10, 2022 at 09:35:10PM +1100, Alistair Popple wrote:
> I got the following build error:
> 
> /data/source/linux/mm/migrate_device.c: In function ‘migrate_vma_collect_pmd’:
> /data/source/linux/mm/migrate_device.c:242:3: error: implicit declaration of 
> function ‘flush_tlb_range’; did you mean ‘flush_pmd_tlb_range’? 
> [-Werror=implicit-function-declaration]
>   242 |   flush_tlb_range(walk->vma, start, end);
>   |   ^~~
>   |   flush_pmd_tlb_range
> 
> Including asm/tlbflush.h in migrate_device.c fixed it for me.

Yes, the buildbot also complained about this, but somehow in my test
configfs it got pulled in implicitly.



[PATCH 27/27] tools: add hmm gup test for long term pinned device pages

2022-02-09 Thread Christoph Hellwig
From: Alex Sierra 

The intention is to test device coherent type pages that have been
called through get user pages with PIN_LONGTERM flag set. These pages
should get migrated back to normal system memory.

Signed-off-by: Alex Sierra 
Signed-off-by: Alistair Popple 
Reviewed-by: Felix Kuehling 
Signed-off-by: Christoph Hellwig 
---
 tools/testing/selftests/vm/Makefile|  2 +-
 tools/testing/selftests/vm/hmm-tests.c | 81 ++
 2 files changed, 82 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/vm/Makefile 
b/tools/testing/selftests/vm/Makefile
index 1607322a112c91..58c8427114f0c2 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -142,7 +142,7 @@ $(OUTPUT)/mlock-random-test $(OUTPUT)/memfd_secret: LDLIBS 
+= -lcap
 
 $(OUTPUT)/gup_test: ../../../../mm/gup_test.h
 
-$(OUTPUT)/hmm-tests: local_config.h
+$(OUTPUT)/hmm-tests: local_config.h ../../../../mm/gup_test.h
 
 # HMM_EXTRA_LIBS may get set in local_config.mk, or it may be left empty.
 $(OUTPUT)/hmm-tests: LDLIBS += $(HMM_EXTRA_LIBS)
diff --git a/tools/testing/selftests/vm/hmm-tests.c 
b/tools/testing/selftests/vm/hmm-tests.c
index 84ec8c4a1dc7b6..11b83a8084fee2 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -36,6 +36,7 @@
  * in the usual include/uapi/... directory.
  */
 #include "../../../../lib/test_hmm_uapi.h"
+#include "../../../../mm/gup_test.h"
 
 struct hmm_buffer {
void*ptr;
@@ -60,6 +61,8 @@ enum {
 #define NTIMES 10
 
 #define ALIGN(x, a) (((x) + (a - 1)) & (~((a) - 1)))
+/* Just the flags we need, copied from mm.h: */
+#define FOLL_WRITE 0x01/* check pte is writable */
 
 FIXTURE(hmm)
 {
@@ -1766,4 +1769,82 @@ TEST_F(hmm, exclusive_cow)
hmm_buffer_free(buffer);
 }
 
+/*
+ * Test get user device pages through gup_test. Setting PIN_LONGTERM flag.
+ * This should trigger a migration back to system memory for both, private
+ * and coherent type pages.
+ * This test makes use of gup_test module. Make sure GUP_TEST_CONFIG is added
+ * to your configuration before you run it.
+ */
+TEST_F(hmm, hmm_gup_test)
+{
+   struct hmm_buffer *buffer;
+   struct gup_test gup;
+   int gup_fd;
+   unsigned long npages;
+   unsigned long size;
+   unsigned long i;
+   int *ptr;
+   int ret;
+   unsigned char *m;
+
+   gup_fd = open("/sys/kernel/debug/gup_test", O_RDWR);
+   if (gup_fd == -1)
+   SKIP(return, "Skipping test, could not find gup_test driver");
+
+   npages = 4;
+   ASSERT_NE(npages, 0);
+   size = npages << self->page_shift;
+
+   buffer = malloc(sizeof(*buffer));
+   ASSERT_NE(buffer, NULL);
+
+   buffer->fd = -1;
+   buffer->size = size;
+   buffer->mirror = malloc(size);
+   ASSERT_NE(buffer->mirror, NULL);
+
+   buffer->ptr = mmap(NULL, size,
+  PROT_READ | PROT_WRITE,
+  MAP_PRIVATE | MAP_ANONYMOUS,
+  buffer->fd, 0);
+   ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+   /* Initialize buffer in system memory. */
+   for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+   ptr[i] = i;
+
+   /* Migrate memory to device. */
+   ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+   ASSERT_EQ(ret, 0);
+   ASSERT_EQ(buffer->cpages, npages);
+   /* Check what the device read. */
+   for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+   ASSERT_EQ(ptr[i], i);
+
+   gup.nr_pages_per_call = npages;
+   gup.addr = (unsigned long)buffer->ptr;
+   gup.gup_flags = FOLL_WRITE;
+   gup.size = size;
+   /*
+* Calling gup_test ioctl. It will try to PIN_LONGTERM these device 
pages
+* causing a migration back to system memory for both, private and 
coherent
+* type pages.
+*/
+   if (ioctl(gup_fd, PIN_LONGTERM_BENCHMARK, &gup)) {
+   perror("ioctl on PIN_LONGTERM_BENCHMARK\n");
+   goto out_test;
+   }
+
+   /* Take snapshot to make sure pages have been migrated to sys memory */
+   ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_SNAPSHOT, buffer, npages);
+   ASSERT_EQ(ret, 0);
+   ASSERT_EQ(buffer->cpages, npages);
+   m = buffer->mirror;
+   for (i = 0; i < npages; i++)
+   ASSERT_EQ(m[i], HMM_DMIRROR_PROT_WRITE);
+out_test:
+   close(gup_fd);
+   hmm_buffer_free(buffer);
+}
 TEST_HARNESS_MAIN
-- 
2.30.2




[PATCH 26/27] mm/gup: migrate device coherent pages when pinning instead of failing

2022-02-09 Thread Christoph Hellwig
From: Alistair Popple 

Currently any attempts to pin a device coherent page will fail. This is
because device coherent pages need to be managed by a device driver, and
pinning them would prevent a driver from migrating them off the device.

However this is no reason to fail pinning of these pages. These are
coherent and accessible from the CPU so can be migrated just like
pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin
them first try migrating them out of ZONE_DEVICE.

Signed-off-by: Alistair Popple 
Acked-by: Felix Kuehling 
[hch: rebased to the split device memory checks,
  moved migrate_device_page to migrate_device.c]
Signed-off-by: Christoph Hellwig 
---
 mm/gup.c| 37 ++-
 mm/internal.h   |  1 +
 mm/migrate_device.c | 53 +
 3 files changed, 85 insertions(+), 6 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 39b23ad39a7bde..41349b685eafb4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1889,9 +1889,31 @@ static long check_and_migrate_movable_pages(unsigned 
long nr_pages,
ret = -EFAULT;
goto unpin_pages;
}
+
+   /*
+* Device coherent pages are managed by a driver and should not
+* be pinned indefinitely as it prevents the driver moving the
+* page. So when trying to pin with FOLL_LONGTERM instead try
+* to migrate the page out of device memory.
+*/
if (is_device_coherent_page(head)) {
-   ret = -EFAULT;
-   goto unpin_pages;
+   WARN_ON_ONCE(PageCompound(head));
+
+   /*
+* Migration will fail if the page is pinned, so convert
+* the pin on the source page to a normal reference.
+*/
+   if (gup_flags & FOLL_PIN) {
+   get_page(head);
+   unpin_user_page(head);
+   }
+
+   pages[i] = migrate_device_page(head, gup_flags);
+   if (!pages[i]) {
+   ret = -EBUSY;
+   goto unpin_pages;
+   }
+   continue;
}
 
if (is_pinnable_page(head))
@@ -1931,10 +1953,13 @@ static long check_and_migrate_movable_pages(unsigned 
long nr_pages,
return nr_pages;
 
 unpin_pages:
-   if (gup_flags & FOLL_PIN) {
-   unpin_user_pages(pages, nr_pages);
-   } else {
-   for (i = 0; i < nr_pages; i++)
+   for (i = 0; i < nr_pages; i++) {
+   if (!pages[i])
+   continue;
+
+   if (gup_flags & FOLL_PIN)
+   unpin_user_page(pages[i]);
+   else
put_page(pages[i]);
}
 
diff --git a/mm/internal.h b/mm/internal.h
index a67222d17e5987..1bded5d7f41a9d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -719,5 +719,6 @@ int numa_migrate_prep(struct page *page, struct 
vm_area_struct *vma,
  unsigned long addr, int page_nid, int *flags);
 
 void free_zone_device_page(struct page *page);
+struct page *migrate_device_page(struct page *page, unsigned int gup_flags);
 
 #endif /* __MM_INTERNAL_H */
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 03e182f9fc7865..3373b535d5c9d9 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -767,3 +767,56 @@ void migrate_vma_finalize(struct migrate_vma *migrate)
}
 }
 EXPORT_SYMBOL(migrate_vma_finalize);
+
+/*
+ * Migrate a device coherent page back to normal memory.  The caller should 
have
+ * a reference on page which will be copied to the new page if migration is
+ * successful or dropped on failure.
+ */
+struct page *migrate_device_page(struct page *page, unsigned int gup_flags)
+{
+   unsigned long src_pfn, dst_pfn = 0;
+   struct migrate_vma args;
+   struct page *dpage;
+
+   lock_page(page);
+   src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE;
+   args.src = &src_pfn;
+   args.dst = &dst_pfn;
+   args.cpages = 1;
+   args.npages = 1;
+   args.vma = NULL;
+   migrate_vma_setup(&args);
+   if (!(src_pfn & MIGRATE_PFN_MIGRATE))
+   return NULL;
+
+   dpage = alloc_pages(GFP_USER | __GFP_NOWARN, 0);
+
+   /*
+* get/pin the new page now so we don't have to retry gup after
+* migrating. We already have a reference so this should never fail.
+*/
+   if (dpage && WARN_ON_ONCE(!try_grab_page(dpage, gup_flags))) {
+   __free_pages(dpage, 0);
+   dpage = NULL;
+   }
+
+   if (dpage) {
+   lock_page(dpage);
+   

[PATCH 25/27] mm: remove the vma check in migrate_vma_setup()

2022-02-09 Thread Christoph Hellwig
From: Alistair Popple 

migrate_vma_setup() checks that a valid vma is passed so that the page
tables can be walked to find the pfns associated with a given address
range. However in some cases the pfns are already known, such as when
migrating device coherent pages during pin_user_pages() meaning a valid
vma isn't required.

Signed-off-by: Alistair Popple 
Acked-by: Felix Kuehling 
Signed-off-by: Christoph Hellwig 
---
 mm/migrate_device.c | 34 +-
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 0b295594e7626d..03e182f9fc7865 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -462,24 +462,24 @@ int migrate_vma_setup(struct migrate_vma *args)
 
args->start &= PAGE_MASK;
args->end &= PAGE_MASK;
-   if (!args->vma || is_vm_hugetlb_page(args->vma) ||
-   (args->vma->vm_flags & VM_SPECIAL) || vma_is_dax(args->vma))
-   return -EINVAL;
-   if (nr_pages <= 0)
-   return -EINVAL;
-   if (args->start < args->vma->vm_start ||
-   args->start >= args->vma->vm_end)
-   return -EINVAL;
-   if (args->end <= args->vma->vm_start || args->end > args->vma->vm_end)
-   return -EINVAL;
if (!args->src || !args->dst)
return -EINVAL;
-
-   memset(args->src, 0, sizeof(*args->src) * nr_pages);
-   args->cpages = 0;
-   args->npages = 0;
-
-   migrate_vma_collect(args);
+   if (args->vma) {
+   if (is_vm_hugetlb_page(args->vma) ||
+   (args->vma->vm_flags & VM_SPECIAL) || vma_is_dax(args->vma))
+   return -EINVAL;
+   if (args->start < args->vma->vm_start ||
+   args->start >= args->vma->vm_end)
+   return -EINVAL;
+   if (args->end <= args->vma->vm_start ||
+   args->end > args->vma->vm_end)
+   return -EINVAL;
+   memset(args->src, 0, sizeof(*args->src) * nr_pages);
+   args->cpages = 0;
+   args->npages = 0;
+
+   migrate_vma_collect(args);
+   }
 
if (args->cpages)
migrate_vma_unmap(args);
@@ -661,7 +661,7 @@ void migrate_vma_pages(struct migrate_vma *migrate)
continue;
}
 
-   if (!page) {
+   if (!page && migrate->vma) {
if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE))
continue;
if (!notified) {
-- 
2.30.2




[PATCH 23/27] tools: update hmm-test to support device coherent type

2022-02-09 Thread Christoph Hellwig
From: Alex Sierra 

Test cases such as migrate_fault and migrate_multiple, were modified to
explicit migrate from device to sys memory without the need of page
faults, when using device coherent type.

Snapshot test case updated to read memory device type first and based
on that, get the proper returned results migrate_ping_pong test case
added to test explicit migration from device to sys memory for both
private and coherent zone types.

Helpers to migrate from device to sys memory and vicerversa
were also added.

Signed-off-by: Alex Sierra 
Acked-by: Felix Kuehling 
Reviewed-by: Alistair Popple 
Signed-off-by: Christoph Hellwig 
---
 tools/testing/selftests/vm/hmm-tests.c | 123 -
 1 file changed, 102 insertions(+), 21 deletions(-)

diff --git a/tools/testing/selftests/vm/hmm-tests.c 
b/tools/testing/selftests/vm/hmm-tests.c
index 203323967b507a..84ec8c4a1dc7b6 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -44,6 +44,14 @@ struct hmm_buffer {
int fd;
uint64_tcpages;
uint64_tfaults;
+   int zone_device_type;
+};
+
+enum {
+   HMM_PRIVATE_DEVICE_ONE,
+   HMM_PRIVATE_DEVICE_TWO,
+   HMM_COHERENCE_DEVICE_ONE,
+   HMM_COHERENCE_DEVICE_TWO,
 };
 
 #define TWOMEG (1 << 21)
@@ -60,6 +68,21 @@ FIXTURE(hmm)
unsigned intpage_shift;
 };
 
+FIXTURE_VARIANT(hmm)
+{
+   int device_number;
+};
+
+FIXTURE_VARIANT_ADD(hmm, hmm_device_private)
+{
+   .device_number = HMM_PRIVATE_DEVICE_ONE,
+};
+
+FIXTURE_VARIANT_ADD(hmm, hmm_device_coherent)
+{
+   .device_number = HMM_COHERENCE_DEVICE_ONE,
+};
+
 FIXTURE(hmm2)
 {
int fd0;
@@ -68,6 +91,24 @@ FIXTURE(hmm2)
unsigned intpage_shift;
 };
 
+FIXTURE_VARIANT(hmm2)
+{
+   int device_number0;
+   int device_number1;
+};
+
+FIXTURE_VARIANT_ADD(hmm2, hmm2_device_private)
+{
+   .device_number0 = HMM_PRIVATE_DEVICE_ONE,
+   .device_number1 = HMM_PRIVATE_DEVICE_TWO,
+};
+
+FIXTURE_VARIANT_ADD(hmm2, hmm2_device_coherent)
+{
+   .device_number0 = HMM_COHERENCE_DEVICE_ONE,
+   .device_number1 = HMM_COHERENCE_DEVICE_TWO,
+};
+
 static int hmm_open(int unit)
 {
char pathname[HMM_PATH_MAX];
@@ -81,12 +122,19 @@ static int hmm_open(int unit)
return fd;
 }
 
+static bool hmm_is_coherent_type(int dev_num)
+{
+   return (dev_num >= HMM_COHERENCE_DEVICE_ONE);
+}
+
 FIXTURE_SETUP(hmm)
 {
self->page_size = sysconf(_SC_PAGE_SIZE);
self->page_shift = ffs(self->page_size) - 1;
 
-   self->fd = hmm_open(0);
+   self->fd = hmm_open(variant->device_number);
+   if (self->fd < 0 && hmm_is_coherent_type(variant->device_number))
+   SKIP(exit(0), "DEVICE_COHERENT not available");
ASSERT_GE(self->fd, 0);
 }
 
@@ -95,9 +143,11 @@ FIXTURE_SETUP(hmm2)
self->page_size = sysconf(_SC_PAGE_SIZE);
self->page_shift = ffs(self->page_size) - 1;
 
-   self->fd0 = hmm_open(0);
+   self->fd0 = hmm_open(variant->device_number0);
+   if (self->fd0 < 0 && hmm_is_coherent_type(variant->device_number0))
+   SKIP(exit(0), "DEVICE_COHERENT not available");
ASSERT_GE(self->fd0, 0);
-   self->fd1 = hmm_open(1);
+   self->fd1 = hmm_open(variant->device_number1);
ASSERT_GE(self->fd1, 0);
 }
 
@@ -144,6 +194,7 @@ static int hmm_dmirror_cmd(int fd,
}
buffer->cpages = cmd.cpages;
buffer->faults = cmd.faults;
+   buffer->zone_device_type = cmd.zone_device_type;
 
return 0;
 }
@@ -211,6 +262,20 @@ static void hmm_nanosleep(unsigned int n)
nanosleep(&t, NULL);
 }
 
+static int hmm_migrate_sys_to_dev(int fd,
+  struct hmm_buffer *buffer,
+  unsigned long npages)
+{
+   return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_DEV, buffer, npages);
+}
+
+static int hmm_migrate_dev_to_sys(int fd,
+  struct hmm_buffer *buffer,
+  unsigned long npages)
+{
+   return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_SYS, buffer, npages);
+}
+
 /*
  * Simple NULL test of device open/close.
  */
@@ -875,7 +940,7 @@ TEST_F(hmm, migrate)
ptr[i] = i;
 
/* Migrate memory to device. */
-   ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+   ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, npages);
 
@@ -923,7 +988,7 @@ TEST_F(hmm, migrate_fault)
ptr[i] = i;
 
/* Migrate memory to device. */
-   ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+   ret = hmm_migrate_sys_to_dev(sel

[PATCH 24/27] tools: update test_hmm script to support SP config

2022-02-09 Thread Christoph Hellwig
From: Alex Sierra 

Add two more parameters to set spm_addr_dev0 & spm_addr_dev1
addresses. These two parameters configure the start SP
addresses for each device in test_hmm driver.
Consequently, this configures zone device type as coherent.

Signed-off-by: Alex Sierra 
Acked-by: Felix Kuehling 
Reviewed-by: Alistair Popple 
Signed-off-by: Christoph Hellwig 
---
 tools/testing/selftests/vm/test_hmm.sh | 24 +---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/vm/test_hmm.sh 
b/tools/testing/selftests/vm/test_hmm.sh
index 0647b525a62564..539c9371e592a1 100755
--- a/tools/testing/selftests/vm/test_hmm.sh
+++ b/tools/testing/selftests/vm/test_hmm.sh
@@ -40,11 +40,26 @@ check_test_requirements()
 
 load_driver()
 {
-   modprobe $DRIVER > /dev/null 2>&1
+   if [ $# -eq 0 ]; then
+   modprobe $DRIVER > /dev/null 2>&1
+   else
+   if [ $# -eq 2 ]; then
+   modprobe $DRIVER spm_addr_dev0=$1 spm_addr_dev1=$2
+   > /dev/null 2>&1
+   else
+   echo "Missing module parameters. Make sure pass"\
+   "spm_addr_dev0 and spm_addr_dev1"
+   usage
+   fi
+   fi
if [ $? == 0 ]; then
major=$(awk "\$2==\"HMM_DMIRROR\" {print \$1}" /proc/devices)
mknod /dev/hmm_dmirror0 c $major 0
mknod /dev/hmm_dmirror1 c $major 1
+   if [ $# -eq 2 ]; then
+   mknod /dev/hmm_dmirror2 c $major 2
+   mknod /dev/hmm_dmirror3 c $major 3
+   fi
fi
 }
 
@@ -58,7 +73,7 @@ run_smoke()
 {
echo "Running smoke test. Note, this test provides basic coverage."
 
-   load_driver
+   load_driver $1 $2
$(dirname "${BASH_SOURCE[0]}")/hmm-tests
unload_driver
 }
@@ -75,6 +90,9 @@ usage()
echo "# Smoke testing"
echo "./${TEST_NAME}.sh smoke"
echo
+   echo "# Smoke testing with SPM enabled"
+   echo "./${TEST_NAME}.sh smoke  "
+   echo
exit 0
 }
 
@@ -84,7 +102,7 @@ function run_test()
usage
else
if [ "$1" = "smoke" ]; then
-   run_smoke
+   run_smoke $2 $3
else
usage
fi
-- 
2.30.2




[PATCH 22/27] lib: add support for device coherent type in test_hmm

2022-02-09 Thread Christoph Hellwig
From: Alex Sierra 

Device Coherent type uses device memory that is coherently accesible by
the CPU. This could be shown as SP (special purpose) memory range
at the BIOS-e820 memory enumeration. If no SP memory is supported in
system, this could be faked by setting CONFIG_EFI_FAKE_MEMMAP.

Currently, test_hmm only supports two different SP ranges of at least
256MB size. This could be specified in the kernel parameter variable
efi_fake_mem. Ex. Two SP ranges of 1GB starting at 0x1 &
0x14000 physical address. Ex.
efi_fake_mem=1G@0x1:0x4,1G@0x14000:0x4

Private and coherent device mirror instances can be created in the same
probed. This is done by passing the module parameters spm_addr_dev0 &
spm_addr_dev1. In this case, it will create four instances of
device_mirror. The first two correspond to private device type, the
last two to coherent type. Then, they can be easily accessed from user
space through /dev/hmm_mirror. Usually num_device 0 and 1
are for private, and 2 and 3 for coherent types. If no module
parameters are passed, two instances of private type device_mirror will
be created only.

Signed-off-by: Alex Sierra 
Acked-by: Felix Kuehling 
Reviewed-by: Alistair Poppple 
---
 lib/test_hmm.c  | 253 +---
 lib/test_hmm_uapi.h |  15 ++-
 2 files changed, 202 insertions(+), 66 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 15747f70c5bc9a..361a026c5d2126 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -32,11 +32,22 @@
 
 #include "test_hmm_uapi.h"
 
-#define DMIRROR_NDEVICES   2
+#define DMIRROR_NDEVICES   4
 #define DMIRROR_RANGE_FAULT_TIMEOUT1000
 #define DEVMEM_CHUNK_SIZE  (256 * 1024 * 1024U)
 #define DEVMEM_CHUNKS_RESERVE  16
 
+/*
+ * For device_private pages, dpage is just a dummy struct page
+ * representing a piece of device memory. dmirror_devmem_alloc_page
+ * allocates a real system memory page as backing storage to fake a
+ * real device. zone_device_data points to that backing page. But
+ * for device_coherent memory, the struct page represents real
+ * physical CPU-accessible memory that we can use directly.
+ */
+#define BACKING_PAGE(page) (is_device_private_page((page)) ? \
+  (page)->zone_device_data : (page))
+
 static unsigned long spm_addr_dev0;
 module_param(spm_addr_dev0, long, 0644);
 MODULE_PARM_DESC(spm_addr_dev0,
@@ -125,6 +136,21 @@ static int dmirror_bounce_init(struct dmirror_bounce 
*bounce,
return 0;
 }
 
+static bool dmirror_is_private_zone(struct dmirror_device *mdevice)
+{
+   return (mdevice->zone_device_type ==
+   HMM_DMIRROR_MEMORY_DEVICE_PRIVATE) ? true : false;
+}
+
+static enum migrate_vma_direction
+dmirror_select_device(struct dmirror *dmirror)
+{
+   return (dmirror->mdevice->zone_device_type ==
+   HMM_DMIRROR_MEMORY_DEVICE_PRIVATE) ?
+   MIGRATE_VMA_SELECT_DEVICE_PRIVATE :
+   MIGRATE_VMA_SELECT_DEVICE_COHERENT;
+}
+
 static void dmirror_bounce_fini(struct dmirror_bounce *bounce)
 {
vfree(bounce->ptr);
@@ -575,16 +601,19 @@ static int dmirror_allocate_chunk(struct dmirror_device 
*mdevice,
 static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
 {
struct page *dpage = NULL;
-   struct page *rpage;
+   struct page *rpage = NULL;
 
/*
-* This is a fake device so we alloc real system memory to store
-* our device memory.
+* For ZONE_DEVICE private type, this is a fake device so we allocate
+* real system memory to store our device memory.
+* For ZONE_DEVICE coherent type we use the actual dpage to store the
+* data and ignore rpage.
 */
-   rpage = alloc_page(GFP_HIGHUSER);
-   if (!rpage)
-   return NULL;
-
+   if (dmirror_is_private_zone(mdevice)) {
+   rpage = alloc_page(GFP_HIGHUSER);
+   if (!rpage)
+   return NULL;
+   }
spin_lock(&mdevice->lock);
 
if (mdevice->free_pages) {
@@ -603,7 +632,8 @@ static struct page *dmirror_devmem_alloc_page(struct 
dmirror_device *mdevice)
return dpage;
 
 error:
-   __free_page(rpage);
+   if (rpage)
+   __free_page(rpage);
return NULL;
 }
 
@@ -629,12 +659,16 @@ static void dmirror_migrate_alloc_and_copy(struct 
migrate_vma *args,
 * unallocated pte_none() or read-only zero page.
 */
spage = migrate_pfn_to_page(*src);
+   if (WARN(spage && is_zone_device_page(spage),
+"page already in device spage pfn: 0x%lx\n",
+page_to_pfn(spage)))
+   continue;
 
dpage = dmirror_devmem_alloc_page(mdevice);
if (!dpage)
continue;
 
-   rpage = dpage->zone_device_data;
+   rpage = BA

[PATCH 21/27] lib: test_hmm add module param for zone device type

2022-02-09 Thread Christoph Hellwig
From: Alex Sierra 

In order to configure device coherent in test_hmm, two module parameters
should be passed, which correspond to the SP start address of each
device (2) spm_addr_dev0 & spm_addr_dev1. If no parameters are passed,
private device type is configured.

Signed-off-by: Alex Sierra 
Acked-by: Felix Kuehling 
Reviewed-by: Alistair Poppple 
Signed-off-by: Christoph Hellwig 
---
 lib/test_hmm.c  | 73 -
 lib/test_hmm_uapi.h |  1 +
 2 files changed, 53 insertions(+), 21 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 7a27584484ce0f..15747f70c5bc9a 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -37,6 +37,16 @@
 #define DEVMEM_CHUNK_SIZE  (256 * 1024 * 1024U)
 #define DEVMEM_CHUNKS_RESERVE  16
 
+static unsigned long spm_addr_dev0;
+module_param(spm_addr_dev0, long, 0644);
+MODULE_PARM_DESC(spm_addr_dev0,
+   "Specify start address for SPM (special purpose memory) used 
for device 0. By setting this Coherent device type will be used. Make sure 
spm_addr_dev1 is set too. Minimum SPM size should be DEVMEM_CHUNK_SIZE.");
+
+static unsigned long spm_addr_dev1;
+module_param(spm_addr_dev1, long, 0644);
+MODULE_PARM_DESC(spm_addr_dev1,
+   "Specify start address for SPM (special purpose memory) used 
for device 1. By setting this Coherent device type will be used. Make sure 
spm_addr_dev0 is set too. Minimum SPM size should be DEVMEM_CHUNK_SIZE.");
+
 static const struct dev_pagemap_ops dmirror_devmem_ops;
 static const struct mmu_interval_notifier_ops dmirror_min_ops;
 static dev_t dmirror_dev;
@@ -455,28 +465,44 @@ static int dmirror_write(struct dmirror *dmirror, struct 
hmm_dmirror_cmd *cmd)
return ret;
 }
 
-static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
+static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
   struct page **ppage)
 {
struct dmirror_chunk *devmem;
-   struct resource *res;
+   struct resource *res = NULL;
unsigned long pfn;
unsigned long pfn_first;
unsigned long pfn_last;
void *ptr;
+   int ret = -ENOMEM;
 
devmem = kzalloc(sizeof(*devmem), GFP_KERNEL);
if (!devmem)
-   return false;
+   return ret;
 
-   res = request_free_mem_region(&iomem_resource, DEVMEM_CHUNK_SIZE,
- "hmm_dmirror");
-   if (IS_ERR(res))
+   switch (mdevice->zone_device_type) {
+   case HMM_DMIRROR_MEMORY_DEVICE_PRIVATE:
+   res = request_free_mem_region(&iomem_resource, 
DEVMEM_CHUNK_SIZE,
+ "hmm_dmirror");
+   if (IS_ERR_OR_NULL(res))
+   goto err_devmem;
+   devmem->pagemap.range.start = res->start;
+   devmem->pagemap.range.end = res->end;
+   devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+   break;
+   case HMM_DMIRROR_MEMORY_DEVICE_COHERENT:
+   devmem->pagemap.range.start = (MINOR(mdevice->cdevice.dev) - 2) 
?
+   spm_addr_dev0 :
+   spm_addr_dev1;
+   devmem->pagemap.range.end = devmem->pagemap.range.start +
+   DEVMEM_CHUNK_SIZE - 1;
+   devmem->pagemap.type = MEMORY_DEVICE_COHERENT;
+   break;
+   default:
+   ret = -EINVAL;
goto err_devmem;
+   }
 
-   devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
-   devmem->pagemap.range.start = res->start;
-   devmem->pagemap.range.end = res->end;
devmem->pagemap.nr_range = 1;
devmem->pagemap.ops = &dmirror_devmem_ops;
devmem->pagemap.owner = mdevice;
@@ -497,10 +523,14 @@ static bool dmirror_allocate_chunk(struct dmirror_device 
*mdevice,
mdevice->devmem_capacity = new_capacity;
mdevice->devmem_chunks = new_chunks;
}
-
ptr = memremap_pages(&devmem->pagemap, numa_node_id());
-   if (IS_ERR(ptr))
+   if (IS_ERR_OR_NULL(ptr)) {
+   if (ptr)
+   ret = PTR_ERR(ptr);
+   else
+   ret = -EFAULT;
goto err_release;
+   }
 
devmem->mdevice = mdevice;
pfn_first = devmem->pagemap.range.start >> PAGE_SHIFT;
@@ -529,15 +559,17 @@ static bool dmirror_allocate_chunk(struct dmirror_device 
*mdevice,
}
spin_unlock(&mdevice->lock);
 
-   return true;
+   return 0;
 
 err_release:
mutex_unlock(&mdevice->devmem_lock);
-   release_mem_region(devmem->pagemap.range.start, 
range_len(&devmem->pagemap.range));
+   if (

[PATCH 20/27] lib: test_hmm add ioctl to get zone device type

2022-02-09 Thread Christoph Hellwig
From: Alex Sierra 

new ioctl cmd added to query zone device type. This will be
used once the test_hmm adds zone device coherent type.

Signed-off-by: Alex Sierra 
Acked-by: Felix Kuehling 
Reviewed-by: Alistair Poppple 
Signed-off-by: Christoph Hellwig 
---
 lib/test_hmm.c  | 23 +--
 lib/test_hmm_uapi.h |  8 
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index cfe63204783918..7a27584484ce0f 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -87,6 +87,7 @@ struct dmirror_chunk {
 struct dmirror_device {
struct cdev cdevice;
struct hmm_devmem   *devmem;
+   unsigned intzone_device_type;
 
unsigned intdevmem_capacity;
unsigned intdevmem_count;
@@ -1026,6 +1027,15 @@ static int dmirror_snapshot(struct dmirror *dmirror,
return ret;
 }
 
+static int dmirror_get_device_type(struct dmirror *dmirror,
+   struct hmm_dmirror_cmd *cmd)
+{
+   mutex_lock(&dmirror->mutex);
+   cmd->zone_device_type = dmirror->mdevice->zone_device_type;
+   mutex_unlock(&dmirror->mutex);
+
+   return 0;
+}
 static long dmirror_fops_unlocked_ioctl(struct file *filp,
unsigned int command,
unsigned long arg)
@@ -1076,6 +1086,9 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
ret = dmirror_snapshot(dmirror, &cmd);
break;
 
+   case HMM_DMIRROR_GET_MEM_DEV_TYPE:
+   ret = dmirror_get_device_type(dmirror, &cmd);
+   break;
default:
return -EINVAL;
}
@@ -1260,14 +1273,20 @@ static void dmirror_device_remove(struct dmirror_device 
*mdevice)
 static int __init hmm_dmirror_init(void)
 {
int ret;
-   int id;
+   int id = 0;
+   int ndevices = 0;
 
ret = alloc_chrdev_region(&dmirror_dev, 0, DMIRROR_NDEVICES,
  "HMM_DMIRROR");
if (ret)
goto err_unreg;
 
-   for (id = 0; id < DMIRROR_NDEVICES; id++) {
+   memset(dmirror_devices, 0, DMIRROR_NDEVICES * 
sizeof(dmirror_devices[0]));
+   dmirror_devices[ndevices++].zone_device_type =
+   HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
+   dmirror_devices[ndevices++].zone_device_type =
+   HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
+   for (id = 0; id < ndevices; id++) {
ret = dmirror_device_init(dmirror_devices + id, id);
if (ret)
goto err_chrdev;
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index f14dea5dcd062b..17f842f1aa02c7 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -19,6 +19,7 @@
  * @npages: (in) number of pages to read/write
  * @cpages: (out) number of pages copied
  * @faults: (out) number of device page faults seen
+ * @zone_device_type: (out) zone device memory type
  */
 struct hmm_dmirror_cmd {
__u64   addr;
@@ -26,6 +27,7 @@ struct hmm_dmirror_cmd {
__u64   npages;
__u64   cpages;
__u64   faults;
+   __u64   zone_device_type;
 };
 
 /* Expose the address space of the calling process through hmm device file */
@@ -35,6 +37,7 @@ struct hmm_dmirror_cmd {
 #define HMM_DMIRROR_SNAPSHOT   _IOWR('H', 0x03, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_EXCLUSIVE  _IOWR('H', 0x04, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_CHECK_EXCLUSIVE_IOWR('H', 0x05, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_GET_MEM_DEV_TYPE   _IOWR('H', 0x06, struct hmm_dmirror_cmd)
 
 /*
  * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
@@ -62,4 +65,9 @@ enum {
HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE = 0x30,
 };
 
+enum {
+   /* 0 is reserved to catch uninitialized type fields */
+   HMM_DMIRROR_MEMORY_DEVICE_PRIVATE = 1,
+};
+
 #endif /* _LIB_TEST_HMM_UAPI_H */
-- 
2.30.2




[PATCH 19/27] drm/amdkfd: coherent type as sys mem on migration to ram

2022-02-09 Thread Christoph Hellwig
From: Alex Sierra 

Coherent device type memory on VRAM to RAM migration, has similar access
as System RAM from the CPU. This flag sets the source from the sender.
Which in Coherent type case, should be set as
MIGRATE_VMA_SELECT_DEVICE_COHERENT.

Signed-off-by: Alex Sierra 
Reviewed-by: Felix Kuehling 
Signed-off-by: Christoph Hellwig 
---
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 2c51f2ac3b46ac..6646291d75d574 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -659,9 +659,12 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct 
svm_range *prange,
migrate.vma = vma;
migrate.start = start;
migrate.end = end;
-   migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
migrate.pgmap_owner = SVM_ADEV_PGMAP_OWNER(adev);
 
+   if (adev->gmc.xgmi.connected_to_cpu)
+   migrate.flags = MIGRATE_VMA_SELECT_DEVICE_COHERENT;
+   else
+   migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
size = 2 * sizeof(*migrate.src) + sizeof(uint64_t) + sizeof(dma_addr_t);
size *= npages;
buf = kvmalloc(size, GFP_KERNEL | __GFP_ZERO);
-- 
2.30.2




[PATCH 18/27] drm/amdkfd: add SPM support for SVM

2022-02-09 Thread Christoph Hellwig
From: Alex Sierra 

When CPU is connected throug XGMI, it has coherent
access to VRAM resource. In this case that resource
is taken from a table in the device gmc aperture base.
This resource is used along with the device type, which could
be DEVICE_PRIVATE or DEVICE_COHERENT to create the device
page map region.

Signed-off-by: Alex Sierra 
Reviewed-by: Felix Kuehling 
Signed-off-by: Christoph Hellwig 
---
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 28 ++--
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index e27ca375876230..2c51f2ac3b46ac 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -933,7 +933,7 @@ int svm_migrate_init(struct amdgpu_device *adev)
 {
struct kfd_dev *kfddev = adev->kfd.dev;
struct dev_pagemap *pgmap;
-   struct resource *res;
+   struct resource *res = NULL;
unsigned long size;
void *r;
 
@@ -948,28 +948,34 @@ int svm_migrate_init(struct amdgpu_device *adev)
 * should remove reserved size
 */
size = ALIGN(adev->gmc.real_vram_size, 2ULL << 20);
-   res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
-   if (IS_ERR(res))
-   return -ENOMEM;
+   if (adev->gmc.xgmi.connected_to_cpu) {
+   pgmap->range.start = adev->gmc.aper_base;
+   pgmap->range.end = adev->gmc.aper_base + adev->gmc.aper_size - 
1;
+   pgmap->type = MEMORY_DEVICE_COHERENT;
+   } else {
+   res = devm_request_free_mem_region(adev->dev, &iomem_resource, 
size);
+   if (IS_ERR(res))
+   return -ENOMEM;
+   pgmap->range.start = res->start;
+   pgmap->range.end = res->end;
+   pgmap->type = MEMORY_DEVICE_PRIVATE;
+   }
 
-   pgmap->type = MEMORY_DEVICE_PRIVATE;
pgmap->nr_range = 1;
-   pgmap->range.start = res->start;
-   pgmap->range.end = res->end;
pgmap->ops = &svm_migrate_pgmap_ops;
pgmap->owner = SVM_ADEV_PGMAP_OWNER(adev);
-   pgmap->flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
-
+   pgmap->flags = 0;
/* Device manager releases device-specific resources, memory region and
 * pgmap when driver disconnects from device.
 */
r = devm_memremap_pages(adev->dev, pgmap);
if (IS_ERR(r)) {
pr_err("failed to register HMM device memory\n");
-
/* Disable SVM support capability */
pgmap->type = 0;
-   devm_release_mem_region(adev->dev, res->start, 
resource_size(res));
+   if (pgmap->type == MEMORY_DEVICE_PRIVATE)
+   devm_release_mem_region(adev->dev, res->start,
+   res->end - res->start + 1);
return PTR_ERR(r);
}
 
-- 
2.30.2




[PATCH 17/27] mm/gup: fail get_user_pages for LONGTERM dev coherent type

2022-02-09 Thread Christoph Hellwig
From: Alex Sierra 

Avoid long term pinning for Coherent device type pages. This could
interfere with their own device memory manager. For now, we are just
returning error for PIN_LONGTERM Coherent device type pages. Eventually,
these type of pages will get migrated to system memory, once the device
migration pages support is added.

Signed-off-by: Alex Sierra 
Acked-by: Felix Kuehling 
Reviewed-by: Alistair Poppple 
[hch: rebased on previous cleanups, split the two checks]
Signed-off-by: Christoph Hellwig 
---
 mm/gup.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index 37d6c24ca71225..39b23ad39a7bde 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1881,6 +1881,19 @@ static long check_and_migrate_movable_pages(unsigned 
long nr_pages,
continue;
prev_head = head;
 
+   /*
+* Device private pages will get faulted in during gup so it
+* shouldn't be possible to see one here.
+*/
+   if (WARN_ON_ONCE(is_device_private_page(head))) {
+   ret = -EFAULT;
+   goto unpin_pages;
+   }
+   if (is_device_coherent_page(head)) {
+   ret = -EFAULT;
+   goto unpin_pages;
+   }
+
if (is_pinnable_page(head))
continue;
 
@@ -1925,7 +1938,7 @@ static long check_and_migrate_movable_pages(unsigned long 
nr_pages,
put_page(pages[i]);
}
 
-   if (!list_empty(&movable_page_list)) {
+   if (!ret && !list_empty(&movable_page_list)) {
struct migration_target_control mtc = {
.nid = NUMA_NO_NODE,
.gfp_mask = GFP_USER | __GFP_NOWARN,
-- 
2.30.2




[PATCH 16/27] mm: add device coherent vma selection for memory migration

2022-02-09 Thread Christoph Hellwig
From: Alex Sierra 

This case is used to migrate pages from device memory, back to system
memory. Device coherent type memory is cache coherent from device and CPU
point of view.

Signed-off-by: Alex Sierra 
Acked-by: Felix Kuehling 
Reviewed-by: Alistair Poppple 
Signed-off-by: Christoph Hellwig 
---
 include/linux/migrate.h |  1 +
 mm/migrate_device.c | 12 +---
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index db96e10eb8da22..66a34eae8cb635 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -130,6 +130,7 @@ static inline unsigned long migrate_pfn(unsigned long pfn)
 enum migrate_vma_direction {
MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
+   MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
 };
 
 struct migrate_vma {
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index bfd66e7d830b02..0b295594e7626d 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -147,15 +147,21 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
if (is_writable_device_private_entry(entry))
mpfn |= MIGRATE_PFN_WRITE;
} else {
-   if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
-   goto next;
pfn = pte_pfn(pte);
-   if (is_zero_pfn(pfn)) {
+   if (is_zero_pfn(pfn) &&
+   (migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
mpfn = MIGRATE_PFN_MIGRATE;
migrate->cpages++;
goto next;
}
page = vm_normal_page(migrate->vma, addr, pte);
+   if (page && !is_zone_device_page(page) &&
+   !(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
+   goto next;
+   else if (page && is_device_coherent_page(page) &&
+   (!(migrate->flags & 
MIGRATE_VMA_SELECT_DEVICE_COHERENT) ||
+page->pgmap->owner != migrate->pgmap_owner))
+   goto next;
mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
}
-- 
2.30.2




[PATCH 13/27] mm: move the migrate_vma_* device migration code into it's own file

2022-02-09 Thread Christoph Hellwig
Split the code used to migrate to and from ZONE_DEVICE memory from
migrate.c into a new file.

Signed-off-by: Christoph Hellwig 
---
 mm/Kconfig  |   3 +
 mm/Makefile |   1 +
 mm/migrate.c| 753 ---
 mm/migrate_device.c | 765 
 4 files changed, 769 insertions(+), 753 deletions(-)
 create mode 100644 mm/migrate_device.c

diff --git a/mm/Kconfig b/mm/Kconfig
index a1901ae6d06293..6391d8d3a616f3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -249,6 +249,9 @@ config MIGRATION
  pages as migration can relocate pages to satisfy a huge page
  allocation instead of reclaiming.
 
+config DEVICE_MIGRATION
+   def_bool MIGRATION && DEVICE_PRIVATE
+
 config ARCH_ENABLE_HUGEPAGE_MIGRATION
bool
 
diff --git a/mm/Makefile b/mm/Makefile
index 70d4309c9ce338..4cc13f3179a518 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
 obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMTEST)  += memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
 obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
diff --git a/mm/migrate.c b/mm/migrate.c
index 746e1230886ddb..c31d04b46a5e17 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -38,12 +38,10 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -2125,757 +2123,6 @@ int migrate_misplaced_page(struct page *page, struct 
vm_area_struct *vma,
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_NUMA */
 
-#ifdef CONFIG_DEVICE_PRIVATE
-static int migrate_vma_collect_skip(unsigned long start,
-   unsigned long end,
-   struct mm_walk *walk)
-{
-   struct migrate_vma *migrate = walk->private;
-   unsigned long addr;
-
-   for (addr = start; addr < end; addr += PAGE_SIZE) {
-   migrate->dst[migrate->npages] = 0;
-   migrate->src[migrate->npages++] = 0;
-   }
-
-   return 0;
-}
-
-static int migrate_vma_collect_hole(unsigned long start,
-   unsigned long end,
-   __always_unused int depth,
-   struct mm_walk *walk)
-{
-   struct migrate_vma *migrate = walk->private;
-   unsigned long addr;
-
-   /* Only allow populating anonymous memory. */
-   if (!vma_is_anonymous(walk->vma))
-   return migrate_vma_collect_skip(start, end, walk);
-
-   for (addr = start; addr < end; addr += PAGE_SIZE) {
-   migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE;
-   migrate->dst[migrate->npages] = 0;
-   migrate->npages++;
-   migrate->cpages++;
-   }
-
-   return 0;
-}
-
-static int migrate_vma_collect_pmd(pmd_t *pmdp,
-  unsigned long start,
-  unsigned long end,
-  struct mm_walk *walk)
-{
-   struct migrate_vma *migrate = walk->private;
-   struct vm_area_struct *vma = walk->vma;
-   struct mm_struct *mm = vma->vm_mm;
-   unsigned long addr = start, unmapped = 0;
-   spinlock_t *ptl;
-   pte_t *ptep;
-
-again:
-   if (pmd_none(*pmdp))
-   return migrate_vma_collect_hole(start, end, -1, walk);
-
-   if (pmd_trans_huge(*pmdp)) {
-   struct page *page;
-
-   ptl = pmd_lock(mm, pmdp);
-   if (unlikely(!pmd_trans_huge(*pmdp))) {
-   spin_unlock(ptl);
-   goto again;
-   }
-
-   page = pmd_page(*pmdp);
-   if (is_huge_zero_page(page)) {
-   spin_unlock(ptl);
-   split_huge_pmd(vma, pmdp, addr);
-   if (pmd_trans_unstable(pmdp))
-   return migrate_vma_collect_skip(start, end,
-   walk);
-   } else {
-   int ret;
-
-   get_page(page);
-   spin_unlock(ptl);
-   if (unlikely(!trylock_page(page)))
-   return migrate_vma_collect_skip(start, end,
-   walk);
-   ret = split_huge_page(page);
-   unlock_page(page);
-   put_page(page);
-   if (ret)
-   return migrate_vma_collect_skip(start, end,
-   walk);
-

[PATCH 15/27] mm: add zone device coherent type memory support

2022-02-09 Thread Christoph Hellwig
From: Alex Sierra 

Device memory that is cache coherent from device and CPU point of view.
This is used on platforms that have an advanced system bus (like CAPI
or CXL). Any page of a process can be migrated to such memory. However,
no one should be allowed to pin such memory so that it can always be
evicted.

Signed-off-by: Alex Sierra 
Acked-by: Felix Kuehling 
Reviewed-by: Alistair Popple 
[hch: rebased ontop of the refcount changes,
  removed is_dev_private_or_coherent_page]
Signed-off-by: Christoph Hellwig 
---
 include/linux/memremap.h | 14 ++
 mm/memcontrol.c  |  7 ---
 mm/memory-failure.c  |  8 ++--
 mm/memremap.c| 10 ++
 mm/migrate_device.c  | 16 +++-
 mm/rmap.c|  5 +++--
 6 files changed, 44 insertions(+), 16 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index d6a114dd5ea8b7..eb73630a49da39 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -41,6 +41,13 @@ struct vmem_altmap {
  * A more complete discussion of unaddressable memory may be found in
  * include/linux/hmm.h and Documentation/vm/hmm.rst.
  *
+ * MEMORY_DEVICE_COHERENT:
+ * Device memory that is cache coherent from device and CPU point of view. This
+ * is used on platforms that have an advanced system bus (like CAPI or CXL). A
+ * driver can hotplug the device memory using ZONE_DEVICE and with that memory
+ * type. Any page of a process can be migrated to such memory. However no one
+ * should be allowed to pin such memory so that it can always be evicted.
+ *
  * MEMORY_DEVICE_FS_DAX:
  * Host memory that has similar access semantics as System RAM i.e. DMA
  * coherent and supports page pinning. In support of coordinating page
@@ -61,6 +68,7 @@ struct vmem_altmap {
 enum memory_type {
/* 0 is reserved to catch uninitialized type fields */
MEMORY_DEVICE_PRIVATE = 1,
+   MEMORY_DEVICE_COHERENT,
MEMORY_DEVICE_FS_DAX,
MEMORY_DEVICE_GENERIC,
MEMORY_DEVICE_PCI_P2PDMA,
@@ -138,6 +146,12 @@ static inline bool is_device_private_page(const struct 
page *page)
page->pgmap->type == MEMORY_DEVICE_PRIVATE;
 }
 
+static inline bool is_device_coherent_page(const struct page *page)
+{
+   return is_zone_device_page(page) &&
+   page->pgmap->type == MEMORY_DEVICE_COHERENT;
+}
+
 static inline bool is_pci_p2pdma_page(const struct page *page)
 {
return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 510cbfb82bb62a..10259c35fde20d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5687,8 +5687,8 @@ static int mem_cgroup_move_account(struct page *page,
  *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
  * target for charge migration. if @target is not NULL, the entry is stored
  * in target->ent.
- *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is 
MEMORY_DEVICE_PRIVATE
- * (so ZONE_DEVICE page and thus not on the lru).
+ *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is device memory and
+ *   thus not on the lru.
  * For now we such page is charge like a regular page would be as for all
  * intent and purposes it is just special memory taking the place of a
  * regular page.
@@ -5722,7 +5722,8 @@ static enum mc_target_type get_mctgt_type(struct 
vm_area_struct *vma,
 */
if (page_memcg(page) == mc.from) {
ret = MC_TARGET_PAGE;
-   if (is_device_private_page(page))
+   if (is_device_private_page(page) ||
+   is_device_coherent_page(page))
ret = MC_TARGET_DEVICE;
if (target)
target->page = page;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 97a9ed8f87a96a..f498ed3ece79ae 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1617,12 +1617,16 @@ static int memory_failure_dev_pagemap(unsigned long 
pfn, int flags,
goto unlock;
}
 
-   if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
+   switch (pgmap->type) {
+   case MEMORY_DEVICE_PRIVATE:
+   case MEMORY_DEVICE_COHERENT:
/*
-* TODO: Handle HMM pages which may need coordination
+* TODO: Handle device pages which may need coordination
 * with device-side memory.
 */
goto unlock;
+   default:
+   break;
}
 
/*
diff --git a/mm/memremap.c b/mm/memremap.c
index e00ffcdba7b632..d00bb21a0630cd 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -313,6 +313,16 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
return ERR_PTR(-EINVAL);
}
break;
+   case MEMORY_DEVICE_COHERENT:
+ 

[PATCH 14/27] mm: build migrate_vma_* for all configs with ZONE_DEVICE support

2022-02-09 Thread Christoph Hellwig
This code will be used for device coherent memory as well in a bit,
so relax the ifdef a bit.

Signed-off-by: Christoph Hellwig 
---
 mm/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 6391d8d3a616f3..95d4aa3acaefe0 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -250,7 +250,7 @@ config MIGRATION
  allocation instead of reclaiming.
 
 config DEVICE_MIGRATION
-   def_bool MIGRATION && DEVICE_PRIVATE
+   def_bool MIGRATION && ZONE_DEVICE
 
 config ARCH_ENABLE_HUGEPAGE_MIGRATION
bool
-- 
2.30.2




[PATCH 12/27] mm: refactor the ZONE_DEVICE handling in migrate_vma_pages

2022-02-09 Thread Christoph Hellwig
Make the flow a little more clear and prepare for adding a new
ZONE_DEVICE memory type.

Signed-off-by: Christoph Hellwig 
---
 mm/migrate.c | 27 ---
 1 file changed, 12 insertions(+), 15 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 30ecd7223656c1..746e1230886ddb 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2788,24 +2788,21 @@ void migrate_vma_pages(struct migrate_vma *migrate)
 
mapping = page_mapping(page);
 
-   if (is_zone_device_page(newpage)) {
-   if (is_device_private_page(newpage)) {
-   /*
-* For now only support private anonymous when
-* migrating to un-addressable device memory.
-*/
-   if (mapping) {
-   migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
-   continue;
-   }
-   } else {
-   /*
-* Other types of ZONE_DEVICE page are not
-* supported.
-*/
+   if (is_device_private_page(newpage)) {
+   /*
+* For now only support private anonymous when migrating
+* to un-addressable device memory.
+*/
+   if (mapping) {
migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
continue;
}
+   } else if (is_zone_device_page(newpage)) {
+   /*
+* Other types of ZONE_DEVICE page are not supported.
+*/
+   migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+   continue;
}
 
r = migrate_page(mapping, newpage, page, MIGRATE_SYNC_NO_COPY);
-- 
2.30.2




[PATCH 11/27] mm: refactor the ZONE_DEVICE handling in migrate_vma_insert_page

2022-02-09 Thread Christoph Hellwig
Make the flow a little more clear and prepare for adding a new
ZONE_DEVICE memory type.

Signed-off-by: Christoph Hellwig 
---
 mm/migrate.c | 31 +++
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 8e0370a73f8a43..30ecd7223656c1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2670,26 +2670,25 @@ static void migrate_vma_insert_page(struct migrate_vma 
*migrate,
 */
__SetPageUptodate(page);
 
-   if (is_zone_device_page(page)) {
-   if (is_device_private_page(page)) {
-   swp_entry_t swp_entry;
+   if (is_device_private_page(page)) {
+   swp_entry_t swp_entry;
 
-   if (vma->vm_flags & VM_WRITE)
-   swp_entry = make_writable_device_private_entry(
-   page_to_pfn(page));
-   else
-   swp_entry = make_readable_device_private_entry(
-   page_to_pfn(page));
-   entry = swp_entry_to_pte(swp_entry);
-   } else {
-   /*
-* For now we only support migrating to un-addressable
-* device memory.
-*/
+   if (vma->vm_flags & VM_WRITE)
+   swp_entry = make_writable_device_private_entry(
+   page_to_pfn(page));
+   else
+   swp_entry = make_readable_device_private_entry(
+   page_to_pfn(page));
+   entry = swp_entry_to_pte(swp_entry);
+   } else {
+   /*
+* For now we only support migrating to un-addressable device
+* memory.
+*/
+   if (is_zone_device_page(page)) {
pr_warn_once("Unsupported ZONE_DEVICE page type.\n");
goto abort;
}
-   } else {
entry = mk_pte(page, vma->vm_page_prot);
if (vma->vm_flags & VM_WRITE)
entry = pte_mkwrite(pte_mkdirty(entry));
-- 
2.30.2




[PATCH 10/27] mm: refactor check_and_migrate_movable_pages

2022-02-09 Thread Christoph Hellwig
Remove up to two levels of indentation by using continue statements
and move variables to local scope where possible.

Signed-off-by: Christoph Hellwig 
---
 mm/gup.c | 81 ++--
 1 file changed, 44 insertions(+), 37 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index a9d4d724aef749..37d6c24ca71225 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1868,72 +1868,79 @@ static long check_and_migrate_movable_pages(unsigned 
long nr_pages,
struct page **pages,
unsigned int gup_flags)
 {
-   unsigned long i;
-   unsigned long isolation_error_count = 0;
-   bool drain_allow = true;
-   LIST_HEAD(movable_page_list);
-   long ret = 0;
+   unsigned long isolation_error_count = 0, i;
struct page *prev_head = NULL;
-   struct page *head;
-   struct migration_target_control mtc = {
-   .nid = NUMA_NO_NODE,
-   .gfp_mask = GFP_USER | __GFP_NOWARN,
-   };
+   LIST_HEAD(movable_page_list);
+   bool drain_allow = true;
+   int ret = 0;
 
for (i = 0; i < nr_pages; i++) {
-   head = compound_head(pages[i]);
+   struct page *head = compound_head(pages[i]);
+
if (head == prev_head)
continue;
prev_head = head;
+
+   if (is_pinnable_page(head))
+   continue;
+
/*
-* If we get a movable page, since we are going to be pinning
-* these entries, try to move them out if possible.
+* Try to move out any movable page before pinning the range.
 */
-   if (!is_pinnable_page(head)) {
-   if (PageHuge(head)) {
-   if (!isolate_huge_page(head, 
&movable_page_list))
-   isolation_error_count++;
-   } else {
-   if (!PageLRU(head) && drain_allow) {
-   lru_add_drain_all();
-   drain_allow = false;
-   }
+   if (PageHuge(head)) {
+   if (!isolate_huge_page(head, &movable_page_list))
+   isolation_error_count++;
+   continue;
+   }
 
-   if (isolate_lru_page(head)) {
-   isolation_error_count++;
-   continue;
-   }
-   list_add_tail(&head->lru, &movable_page_list);
-   mod_node_page_state(page_pgdat(head),
-   NR_ISOLATED_ANON +
-   page_is_file_lru(head),
-   thp_nr_pages(head));
-   }
+   if (!PageLRU(head) && drain_allow) {
+   lru_add_drain_all();
+   drain_allow = false;
+   }
+
+   if (isolate_lru_page(head)) {
+   isolation_error_count++;
+   continue;
}
+   list_add_tail(&head->lru, &movable_page_list);
+   mod_node_page_state(page_pgdat(head),
+   NR_ISOLATED_ANON + page_is_file_lru(head),
+   thp_nr_pages(head));
}
 
+   if (!list_empty(&movable_page_list) || isolation_error_count)
+   goto unpin_pages;
+
/*
 * If list is empty, and no isolation errors, means that all pages are
 * in the correct zone.
 */
-   if (list_empty(&movable_page_list) && !isolation_error_count)
-   return nr_pages;
+   return nr_pages;
 
+unpin_pages:
if (gup_flags & FOLL_PIN) {
unpin_user_pages(pages, nr_pages);
} else {
for (i = 0; i < nr_pages; i++)
put_page(pages[i]);
}
+
if (!list_empty(&movable_page_list)) {
+   struct migration_target_control mtc = {
+   .nid = NUMA_NO_NODE,
+   .gfp_mask = GFP_USER | __GFP_NOWARN,
+   };
+
ret = migrate_pages(&movable_page_list, alloc_migration_target,
NULL, (unsigned long)&mtc, MIGRATE_SYNC,
MR_LONGTERM_PIN, NULL);
-   if (ret && !list_empty(&movable_page_list))
-   putback_movable_pages(&movable_page_list);
+   if (ret > 0) /* number of pages not migrated */
+  

[PATCH 09/27] mm: generalize the pgmap based page_free infrastructure

2022-02-09 Thread Christoph Hellwig
Key off on the existence of ->page_free to prepare for adding support for
more pgmap types that are device managed and thus need the free callback.

Signed-off-by: Christoph Hellwig 
---
 mm/memremap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memremap.c b/mm/memremap.c
index fef5734d5e4933..e00ffcdba7b632 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -452,7 +452,7 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
 
 void free_zone_device_page(struct page *page)
 {
-   if (WARN_ON_ONCE(!is_device_private_page(page)))
+   if (WARN_ON_ONCE(!page->pgmap->ops || !page->pgmap->ops->page_free))
return;
 
__ClearPageWaiters(page);
@@ -460,7 +460,7 @@ void free_zone_device_page(struct page *page)
mem_cgroup_uncharge(page_folio(page));
 
/*
-* When a device_private page is freed, the page->mapping field
+* When a device managed page is freed, the page->mapping field
 * may still contain a (stale) mapping value. For example, the
 * lower bits of page->mapping may still identify the page as an
 * anonymous page. Ultimately, this entire field is just stale
-- 
2.30.2




[PATCH 08/27] fsdax: depend on ZONE_DEVICE || FS_DAX_LIMITED

2022-02-09 Thread Christoph Hellwig
Add a depends on ZONE_DEVICE support or the s390-specific limited DAX
support, as one of the two is required at runtime for fsdax code to
actually work.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
---
 fs/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/Kconfig b/fs/Kconfig
index e9433bbc48010a..7f2455e8e18ae2 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -48,6 +48,7 @@ config FS_DAX
bool "File system based Direct Access (DAX) support"
depends on MMU
depends on !(ARM || MIPS || SPARC)
+   depends on ZONE_DEVICE || FS_DAX_LIMITED
select FS_IOMAP
select DAX
help
-- 
2.30.2




[PATCH 07/27] mm: remove the extra ZONE_DEVICE struct page refcount

2022-02-09 Thread Christoph Hellwig
ZONE_DEVICE struct pages have an extra reference count that complicates
the code for put_page() and several places in the kernel that need to
check the reference count to see that a page is not being used (gup,
compaction, migration, etc.). Clean up the code so the reference count
doesn't need to be treated specially for ZONE_DEVICE pages.

Note that this excludes the special idle page wakeup for fsdax pages,
which still happens at refcount 1.  This is a separate issue and will
be sorted out later.  Given that only fsdax pages require the
notifiacation when the refcount hits 1 now, the PAGEMAP_OPS Kconfig
symbol can go away and be replaced with a FS_DAX check for this hook
in the put_page fastpath.

Based on an earlier patch from Ralph Campbell .

Signed-off-by: Christoph Hellwig 
Reviewed-by: Logan Gunthorpe 
Reviewed-by: Ralph Campbell 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Dan Williams 
Acked-by: Felix Kuehling 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c   |  1 -
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  1 -
 drivers/gpu/drm/nouveau/nouveau_dmem.c   |  1 -
 fs/Kconfig   |  1 -
 include/linux/memremap.h | 12 +++--
 include/linux/mm.h   |  6 +--
 lib/test_hmm.c   |  1 -
 mm/Kconfig   |  4 --
 mm/internal.h|  2 +
 mm/memcontrol.c  | 11 ++---
 mm/memremap.c| 57 
 mm/migrate.c |  6 ---
 mm/swap.c| 16 ++-
 13 files changed, 36 insertions(+), 83 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index e414ca44839fd1..8b6438fa18fc2b 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -712,7 +712,6 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
gpa, struct kvm *kvm)
 
dpage = pfn_to_page(uvmem_pfn);
dpage->zone_device_data = pvt;
-   get_page(dpage);
lock_page(dpage);
return dpage;
 out_clear:
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index cb835f95a76e66..e27ca375876230 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -225,7 +225,6 @@ svm_migrate_get_vram_page(struct svm_range *prange, 
unsigned long pfn)
page = pfn_to_page(pfn);
svm_range_bo_ref(prange->svm_bo);
page->zone_device_data = prange->svm_bo;
-   get_page(page);
lock_page(page);
 }
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index a5cdfbe32b5e54..7ba66ad68a8a1e 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -326,7 +326,6 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
return NULL;
}
 
-   get_page(page);
lock_page(page);
return page;
 }
diff --git a/fs/Kconfig b/fs/Kconfig
index 6c7dc1387beb0f..e9433bbc48010a 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -48,7 +48,6 @@ config FS_DAX
bool "File system based Direct Access (DAX) support"
depends on MMU
depends on !(ARM || MIPS || SPARC)
-   select DEV_PAGEMAP_OPS if (ZONE_DEVICE && !FS_DAX_LIMITED)
select FS_IOMAP
select DAX
help
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 514ab46f597e5c..d6a114dd5ea8b7 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -68,9 +68,9 @@ enum memory_type {
 
 struct dev_pagemap_ops {
/*
-* Called once the page refcount reaches 1.  (ZONE_DEVICE pages never
-* reach 0 refcount unless there is a refcount bug. This allows the
-* device driver to implement its own memory management.)
+* Called once the page refcount reaches 0.  The reference count will be
+* reset to one by the core code after the method is called to prepare
+* for handing out the page again.
 */
void (*page_free)(struct page *page);
 
@@ -133,16 +133,14 @@ static inline unsigned long pgmap_vmemmap_nr(struct 
dev_pagemap *pgmap)
 
 static inline bool is_device_private_page(const struct page *page)
 {
-   return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
-   IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
+   return IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
is_zone_device_page(page) &&
page->pgmap->type == MEMORY_DEVICE_PRIVATE;
 }
 
 static inline bool is_pci_p2pdma_page(const struct page *page)
 {
-   return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
-   IS_ENABLED(CONFIG_PCI_P2PDMA) &&
+   return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
is_zone_device_page(page

[PATCH 06/27] mm: don't include in

2022-02-09 Thread Christoph Hellwig
Move the check for the actual pgmap types that need the free at refcount
one behavior into the out of line helper, and thus avoid the need to
pull memremap.h into mm.h.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Dan Williams 
Acked-by: Felix Kuehling 
---
 arch/arm64/mm/mmu.c|  1 +
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h  |  1 +
 drivers/gpu/drm/drm_cache.c|  2 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c |  1 +
 drivers/gpu/drm/nouveau/nouveau_svm.c  |  1 +
 drivers/infiniband/core/rw.c   |  1 +
 drivers/nvdimm/pmem.h  |  1 +
 drivers/nvme/host/pci.c|  1 +
 drivers/nvme/target/io-cmd-bdev.c  |  1 +
 fs/fuse/virtio_fs.c|  1 +
 include/linux/memremap.h   | 18 ++
 include/linux/mm.h | 20 
 lib/test_hmm.c |  1 +
 mm/memcontrol.c|  1 +
 mm/memremap.c  |  6 +-
 15 files changed, 35 insertions(+), 22 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index acfae9b41cc8c9..580abae6c0b93f 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index ea68f3b3a4e9cb..6d643b4b791d87 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -25,6 +25,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c
index f19d9acbe95936..50b8a088f763a6 100644
--- a/drivers/gpu/drm/drm_cache.c
+++ b/drivers/gpu/drm/drm_cache.c
@@ -27,11 +27,11 @@
 /*
  * Authors: Thomas Hellström 
  */
-
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index e886a3b9e08c7d..a5cdfbe32b5e54 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -39,6 +39,7 @@
 
 #include 
 #include 
+#include 
 #include 
 
 /*
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 266809e511e2c1..090b9b47708cca 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 struct nouveau_svm {
diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index 5a3bd41b331c93..4d98f931a13ddd 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -2,6 +2,7 @@
 /*
  * Copyright (c) 2016 HGST, a Western Digital Company.
  */
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index 59cfe13ea8a85c..1f51a23614299b 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -3,6 +3,7 @@
 #define __NVDIMM_PMEM_H__
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6a99ed68091589..ab15bc72710dbe 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/nvme/target/io-cmd-bdev.c 
b/drivers/nvme/target/io-cmd-bdev.c
index 70ca9dfc1771a9..a141446db1bea3 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -6,6 +6,7 @@
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 #include 
 #include 
+#include 
 #include 
 #include "nvmet.h"
 
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 9d737904d07c0b..86b7dbb6a0d43e 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 1fafcc38acbad6..514ab46f597e5c 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -1,6 +1,8 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _LINUX_MEMREMAP_H_
 #define _LINUX_MEMREMAP_H_
+
+#include 
 #include 
 #include 
 #include 
@@ -129,6 +131,22 @@ static inline unsigned long pgmap_vmemmap_nr(struct 
dev_pagemap *pgmap)
return 1 << pgmap->vmemmap_shift;
 }
 
+static inline bool is_device_private_page(const struct page *page)
+{
+   return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
+   IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
+   is_zone_device_page(page) &&
+   page->pgmap->type == MEMORY_DEVICE_PRIVATE;
+}
+
+static inline bool is_pci_p2pdma_page(const struct page *page)
+{
+   return IS_ENABLED(CONFIG_DEV_P

[PATCH 05/27] mm: simplify freeing of devmap managed pages

2022-02-09 Thread Christoph Hellwig
Make put_devmap_managed_page return if it took charge of the page
or not and remove the separate page_is_devmap_managed helper.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Chaitanya Kulkarni 
Reviewed-by: Dan Williams 
---
 include/linux/mm.h | 34 ++
 mm/memremap.c  | 20 +---
 mm/swap.c  | 10 +-
 3 files changed, 20 insertions(+), 44 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 91dd0bc786a9ec..26baadcef4556b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1094,33 +1094,24 @@ static inline bool is_zone_movable_page(const struct 
page *page)
 #ifdef CONFIG_DEV_PAGEMAP_OPS
 DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
 
-static inline bool page_is_devmap_managed(struct page *page)
+bool __put_devmap_managed_page(struct page *page);
+static inline bool put_devmap_managed_page(struct page *page)
 {
if (!static_branch_unlikely(&devmap_managed_key))
return false;
if (!is_zone_device_page(page))
return false;
-   switch (page->pgmap->type) {
-   case MEMORY_DEVICE_PRIVATE:
-   case MEMORY_DEVICE_FS_DAX:
-   return true;
-   default:
-   break;
-   }
-   return false;
+   if (page->pgmap->type != MEMORY_DEVICE_PRIVATE &&
+   page->pgmap->type != MEMORY_DEVICE_FS_DAX)
+   return false;
+   return __put_devmap_managed_page(page);
 }
 
-void put_devmap_managed_page(struct page *page);
-
 #else /* CONFIG_DEV_PAGEMAP_OPS */
-static inline bool page_is_devmap_managed(struct page *page)
+static inline bool put_devmap_managed_page(struct page *page)
 {
return false;
 }
-
-static inline void put_devmap_managed_page(struct page *page)
-{
-}
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
 
 static inline bool is_device_private_page(const struct page *page)
@@ -1220,16 +1211,11 @@ static inline void put_page(struct page *page)
struct folio *folio = page_folio(page);
 
/*
-* For devmap managed pages we need to catch refcount transition from
-* 2 to 1, when refcount reach one it means the page is free and we
-* need to inform the device driver through callback. See
-* include/linux/memremap.h and HMM for details.
+* For some devmap managed pages we need to catch refcount transition
+* from 2 to 1:
 */
-   if (page_is_devmap_managed(&folio->page)) {
-   put_devmap_managed_page(&folio->page);
+   if (put_devmap_managed_page(&folio->page))
return;
-   }
-
folio_put(folio);
 }
 
diff --git a/mm/memremap.c b/mm/memremap.c
index 55d23e9f5c04ec..f41233a67edb12 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -502,24 +502,22 @@ void free_devmap_managed_page(struct page *page)
page->pgmap->ops->page_free(page);
 }
 
-void put_devmap_managed_page(struct page *page)
+bool __put_devmap_managed_page(struct page *page)
 {
-   int count;
-
-   if (WARN_ON_ONCE(!page_is_devmap_managed(page)))
-   return;
-
-   count = page_ref_dec_return(page);
-
/*
 * devmap page refcounts are 1-based, rather than 0-based: if
 * refcount is 1, then the page is free and the refcount is
 * stable because nobody holds a reference on the page.
 */
-   if (count == 1)
+   switch (page_ref_dec_return(page)) {
+   case 1:
free_devmap_managed_page(page);
-   else if (!count)
+   break;
+   case 0:
__put_page(page);
+   break;
+   }
+   return true;
 }
-EXPORT_SYMBOL(put_devmap_managed_page);
+EXPORT_SYMBOL(__put_devmap_managed_page);
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/swap.c b/mm/swap.c
index 08058f74cae23e..25b55c56614311 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -930,16 +930,8 @@ void release_pages(struct page **pages, int nr)
unlock_page_lruvec_irqrestore(lruvec, flags);
lruvec = NULL;
}
-   /*
-* ZONE_DEVICE pages that return 'false' from
-* page_is_devmap_managed() do not require special
-* processing, and instead, expect a call to
-* put_page_testzero().
-*/
-   if (page_is_devmap_managed(page)) {
-   put_devmap_managed_page(page);
+   if (put_devmap_managed_page(page))
continue;
-   }
if (put_page_testzero(page))
put_dev_pagemap(page->pgmap);
continue;
-- 
2.30.2




[PATCH 03/27] mm: remove pointless includes from

2022-02-09 Thread Christoph Hellwig
hmm.h pulls in the world for no good reason at all.  Remove the
includes and push a few ones into the users instead.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Chaitanya Kulkarni 
---
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 1 +
 drivers/gpu/drm/nouveau/nouveau_dmem.c   | 1 +
 include/linux/hmm.h  | 9 ++---
 lib/test_hmm.c   | 2 ++
 4 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index ed5385137f4831..cb835f95a76e66 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "amdgpu_sync.h"
 #include "amdgpu_object.h"
 #include "amdgpu_vm.h"
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 3828aafd3ac46f..e886a3b9e08c7d 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -39,6 +39,7 @@
 
 #include 
 #include 
+#include 
 
 /*
  * FIXME: this is ugly right now we are using TTM to allocate vram and we pin
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 2fd2e91d5107c0..d5a6f101f843e6 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -9,14 +9,9 @@
 #ifndef LINUX_HMM_H
 #define LINUX_HMM_H
 
-#include 
-#include 
+#include 
 
-#include 
-#include 
-#include 
-#include 
-#include 
+struct mmu_interval_notifier;
 
 /*
  * On output:
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 767538089a62e4..396beee6b061d4 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -26,6 +26,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "test_hmm_uapi.h"
 
-- 
2.30.2




[PATCH 04/27] mm: move free_devmap_managed_page to memremap.c

2022-02-09 Thread Christoph Hellwig
free_devmap_managed_page has nothing to do with the code in swap.c,
move it to live with the rest of the code for devmap handling.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Chaitanya Kulkarni 
Reviewed-by: Muchun Song 
Reviewed-by: Dan Williams 
---
 include/linux/mm.h |  1 -
 mm/memremap.c  | 21 +
 mm/swap.c  | 23 ---
 3 files changed, 21 insertions(+), 24 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7b46174989b086..91dd0bc786a9ec 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1092,7 +1092,6 @@ static inline bool is_zone_movable_page(const struct page 
*page)
 }
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void free_devmap_managed_page(struct page *page);
 DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
 
 static inline bool page_is_devmap_managed(struct page *page)
diff --git a/mm/memremap.c b/mm/memremap.c
index 5f04a0709e436e..55d23e9f5c04ec 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -501,4 +501,25 @@ void free_devmap_managed_page(struct page *page)
page->mapping = NULL;
page->pgmap->ops->page_free(page);
 }
+
+void put_devmap_managed_page(struct page *page)
+{
+   int count;
+
+   if (WARN_ON_ONCE(!page_is_devmap_managed(page)))
+   return;
+
+   count = page_ref_dec_return(page);
+
+   /*
+* devmap page refcounts are 1-based, rather than 0-based: if
+* refcount is 1, then the page is free and the refcount is
+* stable because nobody holds a reference on the page.
+*/
+   if (count == 1)
+   free_devmap_managed_page(page);
+   else if (!count)
+   __put_page(page);
+}
+EXPORT_SYMBOL(put_devmap_managed_page);
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/swap.c b/mm/swap.c
index bcf3ac288b56d5..08058f74cae23e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1153,26 +1153,3 @@ void __init swap_setup(void)
 * _really_ don't want to cluster much more
 */
 }
-
-#ifdef CONFIG_DEV_PAGEMAP_OPS
-void put_devmap_managed_page(struct page *page)
-{
-   int count;
-
-   if (WARN_ON_ONCE(!page_is_devmap_managed(page)))
-   return;
-
-   count = page_ref_dec_return(page);
-
-   /*
-* devmap page refcounts are 1-based, rather than 0-based: if
-* refcount is 1, then the page is free and the refcount is
-* stable because nobody holds a reference on the page.
-*/
-   if (count == 1)
-   free_devmap_managed_page(page);
-   else if (!count)
-   __put_page(page);
-}
-EXPORT_SYMBOL(put_devmap_managed_page);
-#endif
-- 
2.30.2




[PATCH 02/27] mm: remove the __KERNEL__ guard from

2022-02-09 Thread Christoph Hellwig
__KERNEL__ ifdefs don't make sense outside of include/uapi/.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Chaitanya Kulkarni 
Reviewed-by: Muchun Song 
Reviewed-by: Dan Williams 
---
 include/linux/mm.h | 4 
 1 file changed, 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 213cc569b19223..7b46174989b086 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3,9 +3,6 @@
 #define _LINUX_MM_H
 
 #include 
-
-#ifdef __KERNEL__
-
 #include 
 #include 
 #include 
@@ -3381,5 +3378,4 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long 
start,
 }
 #endif
 
-#endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
-- 
2.30.2




start sorting out the ZONE_DEVICE refcount mess v2

2022-02-09 Thread Christoph Hellwig
Hi all,

this series removes the offset by one refcount for ZONE_DEVICE pages
that are freed back to the driver owning them, which is just device
private ones for now, but also the planned device coherent pages
and the ehanced p2p ones pending.

It does not address the fsdax pages yet, which will be attacked in a
follow on series.

Note that if we want to get the p2p series rebased on top of this
we'll need a git branch for this series.  I could offer to host one.

A git tree is available here:

git://git.infradead.org/users/hch/misc.git pgmap-refcount

Gitweb:


http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/pgmap-refcount

Changes since v1:
 - add a missing memremap.h include in memcontrol.c
 - include rebased versions of the device coherent support and
   device coherent migration support series as well as additional
   cleanup patches

Diffstt:
 arch/arm64/mm/mmu.c  |1 
 arch/powerpc/kvm/book3s_hv_uvmem.c   |1 
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |   35 -
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h|1 
 drivers/gpu/drm/drm_cache.c  |2 
 drivers/gpu/drm/nouveau/nouveau_dmem.c   |3 
 drivers/gpu/drm/nouveau/nouveau_svm.c|1 
 drivers/infiniband/core/rw.c |1 
 drivers/nvdimm/pmem.h|1 
 drivers/nvme/host/pci.c  |1 
 drivers/nvme/target/io-cmd-bdev.c|1 
 fs/Kconfig   |2 
 fs/fuse/virtio_fs.c  |1 
 include/linux/hmm.h  |9 
 include/linux/memremap.h |   36 +
 include/linux/migrate.h  |1 
 include/linux/mm.h   |   59 --
 lib/test_hmm.c   |  353 ++---
 lib/test_hmm_uapi.h  |   22 
 mm/Kconfig   |7 
 mm/Makefile  |1 
 mm/gup.c |  127 +++-
 mm/internal.h|3 
 mm/memcontrol.c  |   19 
 mm/memory-failure.c  |8 
 mm/memremap.c|   75 +-
 mm/migrate.c |  763 
 mm/migrate_device.c  |  822 +++
 mm/rmap.c|5 
 mm/swap.c|   49 -
 tools/testing/selftests/vm/Makefile  |2 
 tools/testing/selftests/vm/hmm-tests.c   |  204 ++-
 tools/testing/selftests/vm/test_hmm.sh   |   24 
 33 files changed, 1552 insertions(+), 1088 deletions(-)



[PATCH 01/27] mm: remove a pointless CONFIG_ZONE_DEVICE check in memremap_pages

2022-02-09 Thread Christoph Hellwig
memremap.c is only built when CONFIG_ZONE_DEVICE is set, so remove
the superflous extra check.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Chaitanya Kulkarni 
Reviewed-by: Muchun Song 
Reviewed-by: Dan Williams 
---
 mm/memremap.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/memremap.c b/mm/memremap.c
index 6aa5f0c2d11fda..5f04a0709e436e 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -328,8 +328,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
}
break;
case MEMORY_DEVICE_FS_DAX:
-   if (!IS_ENABLED(CONFIG_ZONE_DEVICE) ||
-   IS_ENABLED(CONFIG_FS_DAX_LIMITED)) {
+   if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) {
WARN(1, "File system DAX not supported\n");
return ERR_PTR(-EINVAL);
}
-- 
2.30.2




Re: [PATCH 6/8] mm: don't include in

2022-02-09 Thread Christoph Hellwig
On Thu, Feb 10, 2022 at 01:10:47PM +1100, Alistair Popple wrote:
> diff --git a/mm/gup.c b/mm/gup.c
> index cbb49abb7992..8e85c9fb8df4 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2007,7 +2007,6 @@ static long check_and_migrate_movable_pages(unsigned 
> long nr_pages,
>   if (!ret && list_empty(&movable_page_list) && !isolation_error_count)
>   return nr_pages;
>  
> - ret = 0;
>  unpin_pages:

This isn't quite correct as ret is initially set to -EFAULT now.  I'll
fix it by removing the early ret initialization and always using the
goto. I've also added another refactoring patch for this messy function.

I've folded the inversion of the is_device_coherent_page check in
migrate.c in as well, thanks!



Re: [PATCH 6/8] mm: don't include in

2022-02-09 Thread Christoph Hellwig
On Mon, Feb 07, 2022 at 04:19:29PM -0500, Felix Kuehling wrote:
>
> Am 2022-02-07 um 01:32 schrieb Christoph Hellwig:
>> Move the check for the actual pgmap types that need the free at refcount
>> one behavior into the out of line helper, and thus avoid the need to
>> pull memremap.h into mm.h.
>>
>> Signed-off-by: Christoph Hellwig 
>
> The amdkfd part looks good to me.
>
> It looks like this patch is not based on Alex Sierra's coherent memory 
> series. He added two new helpers is_device_coherent_page and 
> is_dev_private_or_coherent_page that would need to be moved along with 
> is_device_private_page and is_pci_p2pdma_page.

FYI, here is a branch that contains a rebase of the coherent memory
related patches on top of this series:

http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/pgmap-refcount

I don't have a good way to test this, but I'll at least let the build bot
finish before sending it out (probably tomorrow).



Re: [PATCH 7/8] mm: remove the extra ZONE_DEVICE struct page refcount

2022-02-09 Thread Christoph Hellwig
On Wed, Feb 09, 2022 at 08:29:56AM -0400, Jason Gunthorpe wrote:
> It is nice, but the other series are still impacted by the fsdax mess
> - they still stuff pages into ptes without proper refcounts and have
> to carry nonsense to dance around this problem.
> 
> I certainly would be unhappy if the amd driver, for instance, gained
> the fsdax problem as well and started pushing 4k pages into PMDs.

As said before: I think this all needs to be fixed.  But I'd rather
fix it gradually and I think this series is a nice step forward.
After that we can look at the pte mappings.



Re: [PATCH 7/8] mm: remove the extra ZONE_DEVICE struct page refcount

2022-02-08 Thread Christoph Hellwig
On Tue, Feb 08, 2022 at 07:30:11PM -0800, Dan Williams wrote:
> Interesting. I had expected that to really fix the refcount problem
> that fs/dax.c would need to start taking real page references as pages
> were added to a mapping, just like page cache.

I think we should do that eventually.  But I think this series that
just attacks the device private type and extends to the device coherent
and p2p enhacements is a good first step to stop the proliferation of
the one off refcount and to allow to deal with the fsdax pages in another
more focuessed series.



Re: [PATCH 6/8] mm: don't include in

2022-02-08 Thread Christoph Hellwig
On Tue, Feb 08, 2022 at 03:53:14PM -0800, Dan Williams wrote:
> Yeah, same as Logan:
> 
> mm/memcontrol.c: In function ‘get_mctgt_type’:
> mm/memcontrol.c:5724:29: error: implicit declaration of function
> ‘is_device_private_page’; did you mean
> ‘is_device_private_entry’? [-Werror=implicit-function-declaration]
>  5724 | if (is_device_private_page(page))
>   | ^~
>   | is_device_private_entry
> 
> ...needs:

Yeah, the buildbot also complained.  I've fixed this up locally now.



Re: [PATCH 6/8] mm: don't include in

2022-02-07 Thread Christoph Hellwig
On Mon, Feb 07, 2022 at 04:19:29PM -0500, Felix Kuehling wrote:
>
> Am 2022-02-07 um 01:32 schrieb Christoph Hellwig:
>> Move the check for the actual pgmap types that need the free at refcount
>> one behavior into the out of line helper, and thus avoid the need to
>> pull memremap.h into mm.h.
>>
>> Signed-off-by: Christoph Hellwig 
>
> The amdkfd part looks good to me.
>
> It looks like this patch is not based on Alex Sierra's coherent memory 
> series. He added two new helpers is_device_coherent_page and 
> is_dev_private_or_coherent_page that would need to be moved along with 
> is_device_private_page and is_pci_p2pdma_page.

Yes.  I Naked that series because it spreads te mess with the refcount
further in this latest version.  My intent is that it gets rebased
on top of this to avoid that spread.  Same for the p2p series form Logan.



[PATCH 8/8] fsdax: depend on ZONE_DEVICE || FS_DAX_LIMITED

2022-02-06 Thread Christoph Hellwig
Add a depends on ZONE_DEVICE support or the s390-specific limited DAX
support, as one of the two is required at runtime for fsdax code to
actually work.

Signed-off-by: Christoph Hellwig 
---
 fs/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/Kconfig b/fs/Kconfig
index 05efea674bffa0..6e8818a5e53c45 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -48,6 +48,7 @@ config FS_DAX
bool "File system based Direct Access (DAX) support"
depends on MMU
depends on !(ARM || MIPS || SPARC)
+   depends on ZONE_DEVICE || FS_DAX_LIMITED
select FS_IOMAP
select DAX
help
-- 
2.30.2




[PATCH 7/8] mm: remove the extra ZONE_DEVICE struct page refcount

2022-02-06 Thread Christoph Hellwig
ZONE_DEVICE struct pages have an extra reference count that complicates
the code for put_page() and several places in the kernel that need to
check the reference count to see that a page is not being used (gup,
compaction, migration, etc.). Clean up the code so the reference count
doesn't need to be treated specially for ZONE_DEVICE pages.

Note that this excludes the special idle page wakeup for fsdax pages,
which still happens at refcount 1.  This is a separate issue and will
be sorted out later.  Given that only fsdax pages require the
notifiacation when the refcount hits 1 now, the PAGEMAP_OPS Kconfig
symbol can go away and be replaced with a FS_DAX check for this hook
in the put_page fastpath.

Based on an earlier patch from Ralph Campbell .

Signed-off-by: Christoph Hellwig 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c   |  1 -
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  1 -
 drivers/gpu/drm/nouveau/nouveau_dmem.c   |  1 -
 fs/Kconfig   |  1 -
 include/linux/memremap.h | 12 +++--
 include/linux/mm.h   |  6 +--
 lib/test_hmm.c   |  1 -
 mm/Kconfig   |  4 --
 mm/internal.h|  2 +
 mm/memcontrol.c  | 11 ++---
 mm/memremap.c| 57 
 mm/migrate.c |  6 ---
 mm/swap.c| 16 ++-
 13 files changed, 36 insertions(+), 83 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index e414ca44839fd1..8b6438fa18fc2b 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -712,7 +712,6 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
gpa, struct kvm *kvm)
 
dpage = pfn_to_page(uvmem_pfn);
dpage->zone_device_data = pvt;
-   get_page(dpage);
lock_page(dpage);
return dpage;
 out_clear:
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index cb835f95a76e66..e27ca375876230 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -225,7 +225,6 @@ svm_migrate_get_vram_page(struct svm_range *prange, 
unsigned long pfn)
page = pfn_to_page(pfn);
svm_range_bo_ref(prange->svm_bo);
page->zone_device_data = prange->svm_bo;
-   get_page(page);
lock_page(page);
 }
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index a5cdfbe32b5e54..7ba66ad68a8a1e 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -326,7 +326,6 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
return NULL;
}
 
-   get_page(page);
lock_page(page);
return page;
 }
diff --git a/fs/Kconfig b/fs/Kconfig
index 7a2b11c0b8036d..05efea674bffa0 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -48,7 +48,6 @@ config FS_DAX
bool "File system based Direct Access (DAX) support"
depends on MMU
depends on !(ARM || MIPS || SPARC)
-   select DEV_PAGEMAP_OPS if (ZONE_DEVICE && !FS_DAX_LIMITED)
select FS_IOMAP
select DAX
help
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 514ab46f597e5c..d6a114dd5ea8b7 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -68,9 +68,9 @@ enum memory_type {
 
 struct dev_pagemap_ops {
/*
-* Called once the page refcount reaches 1.  (ZONE_DEVICE pages never
-* reach 0 refcount unless there is a refcount bug. This allows the
-* device driver to implement its own memory management.)
+* Called once the page refcount reaches 0.  The reference count will be
+* reset to one by the core code after the method is called to prepare
+* for handing out the page again.
 */
void (*page_free)(struct page *page);
 
@@ -133,16 +133,14 @@ static inline unsigned long pgmap_vmemmap_nr(struct 
dev_pagemap *pgmap)
 
 static inline bool is_device_private_page(const struct page *page)
 {
-   return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
-   IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
+   return IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
is_zone_device_page(page) &&
page->pgmap->type == MEMORY_DEVICE_PRIVATE;
 }
 
 static inline bool is_pci_p2pdma_page(const struct page *page)
 {
-   return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
-   IS_ENABLED(CONFIG_PCI_P2PDMA) &&
+   return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
is_zone_device_page(page) &&
page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
i

[PATCH 6/8] mm: don't include in

2022-02-06 Thread Christoph Hellwig
Move the check for the actual pgmap types that need the free at refcount
one behavior into the out of line helper, and thus avoid the need to
pull memremap.h into mm.h.

Signed-off-by: Christoph Hellwig 
---
 arch/arm64/mm/mmu.c|  1 +
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h  |  1 +
 drivers/gpu/drm/drm_cache.c|  2 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c |  1 +
 drivers/gpu/drm/nouveau/nouveau_svm.c  |  1 +
 drivers/infiniband/core/rw.c   |  1 +
 drivers/nvdimm/pmem.h  |  1 +
 drivers/nvme/host/pci.c|  1 +
 drivers/nvme/target/io-cmd-bdev.c  |  1 +
 fs/fuse/virtio_fs.c|  1 +
 include/linux/memremap.h   | 18 ++
 include/linux/mm.h | 20 
 lib/test_hmm.c |  1 +
 mm/memremap.c  |  6 +-
 14 files changed, 34 insertions(+), 22 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index acfae9b41cc8c9..580abae6c0b93f 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index ea68f3b3a4e9cb..6d643b4b791d87 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -25,6 +25,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c
index f19d9acbe95936..50b8a088f763a6 100644
--- a/drivers/gpu/drm/drm_cache.c
+++ b/drivers/gpu/drm/drm_cache.c
@@ -27,11 +27,11 @@
 /*
  * Authors: Thomas Hellström 
  */
-
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index e886a3b9e08c7d..a5cdfbe32b5e54 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -39,6 +39,7 @@
 
 #include 
 #include 
+#include 
 #include 
 
 /*
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 266809e511e2c1..090b9b47708cca 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 struct nouveau_svm {
diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index 5a3bd41b331c93..4d98f931a13ddd 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -2,6 +2,7 @@
 /*
  * Copyright (c) 2016 HGST, a Western Digital Company.
  */
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index 59cfe13ea8a85c..1f51a23614299b 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -3,6 +3,7 @@
 #define __NVDIMM_PMEM_H__
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6a99ed68091589..ab15bc72710dbe 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/nvme/target/io-cmd-bdev.c 
b/drivers/nvme/target/io-cmd-bdev.c
index 70ca9dfc1771a9..a141446db1bea3 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -6,6 +6,7 @@
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 #include 
 #include 
+#include 
 #include 
 #include "nvmet.h"
 
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 9d737904d07c0b..86b7dbb6a0d43e 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 1fafcc38acbad6..514ab46f597e5c 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -1,6 +1,8 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _LINUX_MEMREMAP_H_
 #define _LINUX_MEMREMAP_H_
+
+#include 
 #include 
 #include 
 #include 
@@ -129,6 +131,22 @@ static inline unsigned long pgmap_vmemmap_nr(struct 
dev_pagemap *pgmap)
return 1 << pgmap->vmemmap_shift;
 }
 
+static inline bool is_device_private_page(const struct page *page)
+{
+   return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
+   IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
+   is_zone_device_page(page) &&
+   page->pgmap->type == MEMORY_DEVICE_PRIVATE;
+}
+
+static inline bool is_pci_p2pdma_page(const struct page *page)
+{
+   return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
+   IS_ENABLED(CONFIG_PCI_P2PDMA) &&
+   is_zone_device_page(page) &&
+   page->pgmap->type == MEMORY_DEVICE_PC

[PATCH 5/8] mm: simplify freeing of devmap managed pages

2022-02-06 Thread Christoph Hellwig
Make put_devmap_managed_page return if it took charge of the page
or not and remove the separate page_is_devmap_managed helper.

Signed-off-by: Christoph Hellwig 
---
 include/linux/mm.h | 34 ++
 mm/memremap.c  | 20 +---
 mm/swap.c  | 10 +-
 3 files changed, 20 insertions(+), 44 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 91dd0bc786a9ec..26baadcef4556b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1094,33 +1094,24 @@ static inline bool is_zone_movable_page(const struct 
page *page)
 #ifdef CONFIG_DEV_PAGEMAP_OPS
 DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
 
-static inline bool page_is_devmap_managed(struct page *page)
+bool __put_devmap_managed_page(struct page *page);
+static inline bool put_devmap_managed_page(struct page *page)
 {
if (!static_branch_unlikely(&devmap_managed_key))
return false;
if (!is_zone_device_page(page))
return false;
-   switch (page->pgmap->type) {
-   case MEMORY_DEVICE_PRIVATE:
-   case MEMORY_DEVICE_FS_DAX:
-   return true;
-   default:
-   break;
-   }
-   return false;
+   if (page->pgmap->type != MEMORY_DEVICE_PRIVATE &&
+   page->pgmap->type != MEMORY_DEVICE_FS_DAX)
+   return false;
+   return __put_devmap_managed_page(page);
 }
 
-void put_devmap_managed_page(struct page *page);
-
 #else /* CONFIG_DEV_PAGEMAP_OPS */
-static inline bool page_is_devmap_managed(struct page *page)
+static inline bool put_devmap_managed_page(struct page *page)
 {
return false;
 }
-
-static inline void put_devmap_managed_page(struct page *page)
-{
-}
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
 
 static inline bool is_device_private_page(const struct page *page)
@@ -1220,16 +1211,11 @@ static inline void put_page(struct page *page)
struct folio *folio = page_folio(page);
 
/*
-* For devmap managed pages we need to catch refcount transition from
-* 2 to 1, when refcount reach one it means the page is free and we
-* need to inform the device driver through callback. See
-* include/linux/memremap.h and HMM for details.
+* For some devmap managed pages we need to catch refcount transition
+* from 2 to 1:
 */
-   if (page_is_devmap_managed(&folio->page)) {
-   put_devmap_managed_page(&folio->page);
+   if (put_devmap_managed_page(&folio->page))
return;
-   }
-
folio_put(folio);
 }
 
diff --git a/mm/memremap.c b/mm/memremap.c
index 55d23e9f5c04ec..f41233a67edb12 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -502,24 +502,22 @@ void free_devmap_managed_page(struct page *page)
page->pgmap->ops->page_free(page);
 }
 
-void put_devmap_managed_page(struct page *page)
+bool __put_devmap_managed_page(struct page *page)
 {
-   int count;
-
-   if (WARN_ON_ONCE(!page_is_devmap_managed(page)))
-   return;
-
-   count = page_ref_dec_return(page);
-
/*
 * devmap page refcounts are 1-based, rather than 0-based: if
 * refcount is 1, then the page is free and the refcount is
 * stable because nobody holds a reference on the page.
 */
-   if (count == 1)
+   switch (page_ref_dec_return(page)) {
+   case 1:
free_devmap_managed_page(page);
-   else if (!count)
+   break;
+   case 0:
__put_page(page);
+   break;
+   }
+   return true;
 }
-EXPORT_SYMBOL(put_devmap_managed_page);
+EXPORT_SYMBOL(__put_devmap_managed_page);
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/swap.c b/mm/swap.c
index 08058f74cae23e..25b55c56614311 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -930,16 +930,8 @@ void release_pages(struct page **pages, int nr)
unlock_page_lruvec_irqrestore(lruvec, flags);
lruvec = NULL;
}
-   /*
-* ZONE_DEVICE pages that return 'false' from
-* page_is_devmap_managed() do not require special
-* processing, and instead, expect a call to
-* put_page_testzero().
-*/
-   if (page_is_devmap_managed(page)) {
-   put_devmap_managed_page(page);
+   if (put_devmap_managed_page(page))
continue;
-   }
if (put_page_testzero(page))
put_dev_pagemap(page->pgmap);
continue;
-- 
2.30.2




[PATCH 4/8] mm: move free_devmap_managed_page to memremap.c

2022-02-06 Thread Christoph Hellwig
free_devmap_managed_page has nothing to do with the code in swap.c,
move it to live with the rest of the code for devmap handling.

Signed-off-by: Christoph Hellwig 
---
 include/linux/mm.h |  1 -
 mm/memremap.c  | 21 +
 mm/swap.c  | 23 ---
 3 files changed, 21 insertions(+), 24 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7b46174989b086..91dd0bc786a9ec 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1092,7 +1092,6 @@ static inline bool is_zone_movable_page(const struct page 
*page)
 }
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void free_devmap_managed_page(struct page *page);
 DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
 
 static inline bool page_is_devmap_managed(struct page *page)
diff --git a/mm/memremap.c b/mm/memremap.c
index 5f04a0709e436e..55d23e9f5c04ec 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -501,4 +501,25 @@ void free_devmap_managed_page(struct page *page)
page->mapping = NULL;
page->pgmap->ops->page_free(page);
 }
+
+void put_devmap_managed_page(struct page *page)
+{
+   int count;
+
+   if (WARN_ON_ONCE(!page_is_devmap_managed(page)))
+   return;
+
+   count = page_ref_dec_return(page);
+
+   /*
+* devmap page refcounts are 1-based, rather than 0-based: if
+* refcount is 1, then the page is free and the refcount is
+* stable because nobody holds a reference on the page.
+*/
+   if (count == 1)
+   free_devmap_managed_page(page);
+   else if (!count)
+   __put_page(page);
+}
+EXPORT_SYMBOL(put_devmap_managed_page);
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/swap.c b/mm/swap.c
index bcf3ac288b56d5..08058f74cae23e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1153,26 +1153,3 @@ void __init swap_setup(void)
 * _really_ don't want to cluster much more
 */
 }
-
-#ifdef CONFIG_DEV_PAGEMAP_OPS
-void put_devmap_managed_page(struct page *page)
-{
-   int count;
-
-   if (WARN_ON_ONCE(!page_is_devmap_managed(page)))
-   return;
-
-   count = page_ref_dec_return(page);
-
-   /*
-* devmap page refcounts are 1-based, rather than 0-based: if
-* refcount is 1, then the page is free and the refcount is
-* stable because nobody holds a reference on the page.
-*/
-   if (count == 1)
-   free_devmap_managed_page(page);
-   else if (!count)
-   __put_page(page);
-}
-EXPORT_SYMBOL(put_devmap_managed_page);
-#endif
-- 
2.30.2




[PATCH 3/8] mm: remove pointless includes from

2022-02-06 Thread Christoph Hellwig
hmm.h pulls in the world for no good reason at all.  Remove the
includes and push a few ones into the users instead.

Signed-off-by: Christoph Hellwig 
---
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 1 +
 drivers/gpu/drm/nouveau/nouveau_dmem.c   | 1 +
 include/linux/hmm.h  | 9 ++---
 lib/test_hmm.c   | 2 ++
 4 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index ed5385137f4831..cb835f95a76e66 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "amdgpu_sync.h"
 #include "amdgpu_object.h"
 #include "amdgpu_vm.h"
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 3828aafd3ac46f..e886a3b9e08c7d 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -39,6 +39,7 @@
 
 #include 
 #include 
+#include 
 
 /*
  * FIXME: this is ugly right now we are using TTM to allocate vram and we pin
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 2fd2e91d5107c0..d5a6f101f843e6 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -9,14 +9,9 @@
 #ifndef LINUX_HMM_H
 #define LINUX_HMM_H
 
-#include 
-#include 
+#include 
 
-#include 
-#include 
-#include 
-#include 
-#include 
+struct mmu_interval_notifier;
 
 /*
  * On output:
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 767538089a62e4..396beee6b061d4 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -26,6 +26,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "test_hmm_uapi.h"
 
-- 
2.30.2




[PATCH 2/8] mm: remove the __KERNEL__ guard from

2022-02-06 Thread Christoph Hellwig
__KERNEL__ ifdefs don't make sense outside of include/uapi/.

Signed-off-by: Christoph Hellwig 
---
 include/linux/mm.h | 4 
 1 file changed, 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 213cc569b19223..7b46174989b086 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3,9 +3,6 @@
 #define _LINUX_MM_H
 
 #include 
-
-#ifdef __KERNEL__
-
 #include 
 #include 
 #include 
@@ -3381,5 +3378,4 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long 
start,
 }
 #endif
 
-#endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
-- 
2.30.2




start sorting out the ZONE_DEVICE refcount mess

2022-02-06 Thread Christoph Hellwig
Hi all,

this series removes the offset by one refcount for ZONE_DEVICE pages
that are freed back to the driver owning them, which is just device
private ones for now, but also the planned device coherent pages
and the ehanced p2p ones pending.

It does not address the fsdax pages yet, which will be attacked in a
follow on series.

Diffstat:
 arch/arm64/mm/mmu.c  |1 
 arch/powerpc/kvm/book3s_hv_uvmem.c   |1 
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |2 
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h|1 
 drivers/gpu/drm/drm_cache.c  |2 
 drivers/gpu/drm/nouveau/nouveau_dmem.c   |3 -
 drivers/gpu/drm/nouveau/nouveau_svm.c|1 
 drivers/infiniband/core/rw.c |1 
 drivers/nvdimm/pmem.h|1 
 drivers/nvme/host/pci.c  |1 
 drivers/nvme/target/io-cmd-bdev.c|1 
 fs/Kconfig   |2 
 fs/fuse/virtio_fs.c  |1 
 include/linux/hmm.h  |9 
 include/linux/memremap.h |   22 +-
 include/linux/mm.h   |   59 -
 lib/test_hmm.c   |4 +
 mm/Kconfig   |4 -
 mm/internal.h|2 
 mm/memcontrol.c  |   11 +
 mm/memremap.c|   63 ---
 mm/migrate.c |6 --
 mm/swap.c|   49 ++--
 23 files changed, 90 insertions(+), 157 deletions(-)



[PATCH 1/8] mm: remove a pointless CONFIG_ZONE_DEVICE check in memremap_pages

2022-02-06 Thread Christoph Hellwig
memremap.c is only built when CONFIG_ZONE_DEVICE is set, so remove
the superflous extra check.

Signed-off-by: Christoph Hellwig 
---
 mm/memremap.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/memremap.c b/mm/memremap.c
index 6aa5f0c2d11fda..5f04a0709e436e 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -328,8 +328,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
}
break;
case MEMORY_DEVICE_FS_DAX:
-   if (!IS_ENABLED(CONFIG_ZONE_DEVICE) ||
-   IS_ENABLED(CONFIG_FS_DAX_LIMITED)) {
+   if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) {
WARN(1, "File system DAX not supported\n");
return ERR_PTR(-EINVAL);
}
-- 
2.30.2




Re: [PATCH v4 00/10] Add MEMORY_DEVICE_COHERENT for coherent device memory mapping

2022-02-02 Thread Christoph Hellwig
On Thu, Jan 27, 2022 at 02:32:58PM -0800, Andrew Morton wrote:
> On Wed, 26 Jan 2022 21:09:39 -0600 Alex Sierra  wrote:
> 
> > This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory
> > owned by a device that can be mapped into CPU page tables like
> > MEMORY_DEVICE_GENERIC and can also be migrated like
> > MEMORY_DEVICE_PRIVATE.
> 
> Some more reviewer input appears to be desirable here.
> 
> I was going to tentatively add it to -mm and -next, but problems. 
> 5.17-rc1's mm/migrate.c:migrate_vma_check_page() is rather different
> from the tree you patched.  Please redo, refresh and resend?

I really hate adding more types with the weird one off page refcount.
We need to clean that mess up first.


Re: [PATCH 1/7] drm/i915: Replace kmap() with kmap_local_page()

2021-12-13 Thread Christoph Hellwig
On Fri, Dec 10, 2021 at 03:23:58PM -0800, ira.we...@intel.com wrote:
> - vaddr = kmap(page);
> + vaddr = kmap_local_page(page);
>   memcpy(vaddr, data, len);
> - kunmap(page);
> + kunmap_local(vaddr);

memcpy_to_page?


Re: [PATCH v1 1/9] mm: add zone device coherent type memory support

2021-11-17 Thread Christoph Hellwig
On Mon, Nov 15, 2021 at 01:30:18PM -0600, Alex Sierra wrote:
> @@ -5695,8 +5695,8 @@ static int mem_cgroup_move_account(struct page *page,
>   *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
>   * target for charge migration. if @target is not NULL, the entry is 
> stored
>   * in target->ent.
> - *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is 
> MEMORY_DEVICE_PRIVATE
> - * (so ZONE_DEVICE page and thus not on the lru).
> + *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is 
> MEMORY_DEVICE_COHERENT
> + * or MEMORY_DEVICE_PRIVATE (so ZONE_DEVICE page and thus not on the 
> lru).

Please avoid the overly long line.  But I don't think we we need to mention
the exact enum, but rather do something like:

 *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is device memory and
 * thus not on the lru.

> + switch (pgmap->type) {
> + case MEMORY_DEVICE_PRIVATE:
> + case MEMORY_DEVICE_COHERENT:
>   /*
>* TODO: Handle HMM pages which may need coordination
>* with device-side memory.

This might be a good opportunity for doing a s/HMM/device/ here.


Re: [PATCH v3 4/8] powerpc/pseries/svm: Add a powerpc version of cc_platform_has()

2021-09-16 Thread Christoph Hellwig
On Wed, Sep 15, 2021 at 07:18:34PM +0200, Christophe Leroy wrote:
> Could you please provide more explicit explanation why inlining such an
> helper is considered as bad practice and messy ?

Because now we get architectures to all subly differ.  Look at the mess
for ioremap and the ioremap* variant.

The only good reason to allow for inlines if if they are used in a hot
path.  Which cc_platform_has is not, especially not on powerpc.


Re: [PATCH] Enable '-Werror' by default for all kernel builds

2021-09-09 Thread Christoph Hellwig
On Wed, Sep 08, 2021 at 11:58:56PM +0200, Marco Elver wrote:
> It'd be good to avoid. It has helped uncover build issues with KASAN in
> the past. Or at least make it dependent on the problematic architecture.
> For example if arm is a problem, something like this:

I'm also seeing quite a few stack size warnings with KASAN on x86_64
without COMPILT_TEST using gcc 10.2.1 from Debian.  In fact there are a
few warnings without KASAN, but with KASAN there are a lot more.
I'll try to find some time to dig into them.

While we're at it, with -Werror something like this is really futile:

drivers/gpu/drm/amd/amdgpu/amdgpu_object.c: In function 
???amdgpu_bo_support_uswc???:
drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:493:2: warning: #warning
Please enable CONFIG_MTRR and CONFIG_X86_PAT for better performance thanks to 
write-combining [-Wcpp
  493 | #warning Please enable CONFIG_MTRR and CONFIG_X86_PAT for better 
performance \
  |  ^~~


Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

2021-09-02 Thread Christoph Hellwig
On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote:
> >>> It looks like I'm totally misunderstanding what you are adding here
> >>> then.  Why do we need any special treatment at all for memory that
> >>> has normal struct pages and is part of the direct kernel map?
> >> The pages are like normal memory for purposes of mapping them in CPU
> >> page tables and for coherent access from the CPU.
> > That's the user page tables.  What about the kernel direct map?
> > If there is a normal kernel struct page backing there really should
> > be no need for the pgmap.
> 
> I'm not sure. The physical address ranges are in the UEFI system address
> map as special-purpose memory. Does Linux create the struct pages and
> kernel direct map for that without a pgmap call? I didn't see that last
> time I went digging through that code.

So doing some googling finds a patch from Dan that claims to hand EFI
special purpose memory to the device dax driver.  But when I try to
follow the version that got merged it looks it is treated simply as an
MMIO region to be claimed by drivers, which would not get a struct page.

Dan, did I misunderstand how E820_TYPE_SOFT_RESERVED works?

> >> From an application
> >> perspective, we want file-backed and anonymous mappings to be able to
> >> use DEVICE_PUBLIC pages with coherent CPU access. The goal is to
> >> optimize performance for GPU heavy workloads while minimizing the need
> >> to migrate data back-and-forth between system memory and device memory.
> > I don't really understand that part.  file backed pages are always
> > allocated by the file system using the pagecache helpers, that is
> > using the page allocator.  Anonymouns memory also always comes from
> > the page allocator.
> 
> I'm coming at this from my experience with DEVICE_PRIVATE. Both
> anonymous and file-backed pages should be migrateable to DEVICE_PRIVATE
> memory by the migrate_vma_* helpers for more efficient access by our
> GPU. (*) It's part of the basic premise of HMM as I understand it. I
> would expect the same thing to work for DEVICE_PUBLIC memory.

Ok, so you want to migrate to and from them.  Not use DEVICE_PUBLIC
for the actual page cache pages.  That maks a lot more sense.

> I see DEVICE_PUBLIC as an improved version of DEVICE_PRIVATE that allows
> the CPU to map the device memory coherently to minimize the need for
> migrations when CPU and GPU access the same memory concurrently or
> alternatingly. But we're not going as far as putting that memory
> entirely under the management of the Linux memory manager and VM
> subsystem. Our (and HPE's) system architects decided that this memory is
> not suitable to be used like regular NUMA system memory by the Linux
> memory manager.

So yes.  It is a Memory Mapped I/O region, which unlike the PCIe BARs
that people typically deal with is fully cache coherent.  I think this
does make more sense as a description.

But to go back to what start this discussion:  If these are memory
mapped I/O pfn_valid should generally not return true for them.

And as you already pointed out in reply to Alex we need to tighten the
selection criteria one way or another.


Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

2021-09-01 Thread Christoph Hellwig
On Mon, Aug 30, 2021 at 01:04:43PM -0400, Felix Kuehling wrote:
> >> driver code is not really involved in updating the CPU mappings. Maybe
> >> it's something we need to do in the migration helpers.
> > It looks like I'm totally misunderstanding what you are adding here
> > then.  Why do we need any special treatment at all for memory that
> > has normal struct pages and is part of the direct kernel map?
> 
> The pages are like normal memory for purposes of mapping them in CPU
> page tables and for coherent access from the CPU.

That's the user page tables.  What about the kernel direct map?
If there is a normal kernel struct page backing there really should
be no need for the pgmap.

> From an application
> perspective, we want file-backed and anonymous mappings to be able to
> use DEVICE_PUBLIC pages with coherent CPU access. The goal is to
> optimize performance for GPU heavy workloads while minimizing the need
> to migrate data back-and-forth between system memory and device memory.

I don't really understand that part.  file backed pages are always
allocated by the file system using the pagecache helpers, that is
using the page allocator.  Anonymouns memory also always comes from
the page allocator.

> The pages are special in two ways:
> 
>  1. The memory is managed not by the Linux buddy allocator, but by the
> GPU driver's TTM memory manager

Why?

>  2. We want to migrate data in response to GPU page faults and
> application hints using the migrate_vma helpers

Why? 


Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

2021-08-30 Thread Christoph Hellwig
On Thu, Aug 26, 2021 at 06:27:31PM -0400, Felix Kuehling wrote:
> I think we're missing something here. As far as I can tell, all the work
> we did first with DEVICE_GENERIC and now DEVICE_PUBLIC always used
> normal pages. Are we missing something in our driver code that would
> make these PTEs special? I don't understand how that can be, because
> driver code is not really involved in updating the CPU mappings. Maybe
> it's something we need to do in the migration helpers.

It looks like I'm totally misunderstanding what you are adding here
then.  Why do we need any special treatment at all for memory that
has normal struct pages and is part of the direct kernel map?


Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

2021-08-25 Thread Christoph Hellwig
On Tue, Aug 24, 2021 at 10:48:17PM -0500, Alex Sierra wrote:
>   } else {
> - if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
> + if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM) &&
> + !(migrate->flags & MIGRATE_VMA_SELECT_IOMEM))
>   goto next;
>   pfn = pte_pfn(pte);
>   if (is_zero_pfn(pfn)) {

.. also how is this going to work for the device public memory?  That
should be pte_special() an thus fail vm_normal_page.


Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

2021-08-25 Thread Christoph Hellwig
On Tue, Aug 24, 2021 at 10:48:17PM -0500, Alex Sierra wrote:
> In this case, this is used to migrate pages from device memory, back to
> system memory. This particular device memory type should be accessible
> by the CPU, through IOMEM access. Typically, zone device public type
> memory falls into this category.
> 
> Signed-off-by: Alex Sierra 
> ---
>  include/linux/migrate.h | 1 +
>  mm/migrate.c| 3 ++-
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 4bb4e519e3f5..6b16f417384f 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -156,6 +156,7 @@ static inline unsigned long migrate_pfn(unsigned long pfn)
>  enum migrate_vma_direction {
>   MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
>   MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
> + MIGRATE_VMA_SELECT_IOMEM = 1 << 2,
>  };
>  
>  struct migrate_vma {
> diff --git a/mm/migrate.c b/mm/migrate.c
> index e3a10e2a1bb3..d4ae2da99607 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2406,7 +2406,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   if (is_write_device_private_entry(entry))
>   mpfn |= MIGRATE_PFN_WRITE;
>   } else {
> - if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
> + if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM) &&
> + !(migrate->flags & MIGRATE_VMA_SELECT_IOMEM))

This makes the MIGRATE_VMA_SELECT_SYSTEM and MIGRATE_VMA_SELECT_IOMEM
behave entirely identifical, that is redundant.  I think we need to
distinguish between the dfferent cases here.  I think the right check
would be pfn_valid(), which should be true for system memory, and
false for iomem.

Also shouldn't this be called DEVICE_PUBLIC instead of IOMEM?


Re: [PATCH v1 02/14] mm: remove extra ZONE_DEVICE struct page refcount

2021-08-25 Thread Christoph Hellwig
Looks good,

Reviewed-by: Christoph Hellwig 


Re: [PATCH v1 08/14] mm: add public type support to migrate_vma helpers

2021-08-25 Thread Christoph Hellwig
This should probably be folded into patch 4.


Re: [PATCH v1 01/14] ext4/xfs: add page refcount helper

2021-08-25 Thread Christoph Hellwig
On Tue, Aug 24, 2021 at 10:48:15PM -0500, Alex Sierra wrote:
> Signed-off-by: Ralph Campbell 
> Signed-off-by: Alex Sierra 
> Reviewed-by: Christoph Hellwig 
> ---
> v3:
> [AS]: rename dax_layout_is_idle_page func to dax_page_unused
> 
> v4:
> [AS]: This ref count functionality was missing on fuse/dax.c.
> ---

Not sure all tooling can cope with the two --- separators.  Personally
I find these per-patch changelogs pretty annoying anyway, but others
have different opinions.


Re: [PATCH v1 09/14] mm: call pgmap->ops->page_free for DEVICE_PUBLIC pages

2021-08-25 Thread Christoph Hellwig
On Tue, Aug 24, 2021 at 10:48:23PM -0500, Alex Sierra wrote:
> Add MEMORY_DEVICE_PUBLIC case to free_zone_device_page callback.
> Device public type memory case is now able to free its pages properly.

This really should go into patch 4.  And it might make sense to introduce
free_device_private_page directly with the free_device_page name instead
of renaming it a little later.


Re: [PATCH v2 03/12] x86/sev: Add an x86 version of prot_guest_has()

2021-08-24 Thread Christoph Hellwig
On Thu, Aug 19, 2021 at 01:33:09PM -0500, Tom Lendacky wrote:
> I did it as inline originally because the presence of the function will be
> decided based on the ARCH_HAS_PROTECTED_GUEST config. For now, that is
> only selected by the AMD memory encryption support, so if I went out of
> line I could put in mem_encrypt.c. But with TDX wanting to also use it, it
> would have to be in an always built file with some #ifdefs or in its own
> file that is conditionally built based on the ARCH_HAS_PROTECTED_GUEST
> setting (they've already tried building with ARCH_HAS_PROTECTED_GUEST=y
> and AMD_MEM_ENCRYPT not set).
> 
> To take it out of line, I'm leaning towards the latter, creating a new
> file that is built based on the ARCH_HAS_PROTECTED_GUEST setting.

Yes.  In general everytime architectures have to provide the prototype
and not just the implementation of something we end up with a giant mess
sooner or later.  In a few cases that is still warranted due to
performance concerns, but i don't think that is the case here.

> 
> > 
> >> +/* 0x800 - 0x8ff reserved for AMD */
> >> +#define PATTR_SME 0x800
> >> +#define PATTR_SEV 0x801
> >> +#define PATTR_SEV_ES  0x802
> > 
> > Why do we need reservations for a purely in-kernel namespace?
> > 
> > And why are you overoading a brand new generic API with weird details
> > of a specific implementation like this?
> 
> There was some talk about this on the mailing list where TDX and SEV may
> need to be differentiated, so we wanted to reserve a range of values per
> technology. I guess I can remove them until they are actually needed.

In that case add a flag for the differing behavior.  And only add them
when actually needed.  And either way there is absolutely no need to
reserve ranges.



Re: [PATCH v6 08/13] mm: call pgmap->ops->page_free for DEVICE_GENERIC pages

2021-08-20 Thread Christoph Hellwig
On Tue, Aug 17, 2021 at 11:44:54AM -0400, Felix Kuehling wrote:
> >> That's a good catch. Existing drivers shouldn't need a page_free
> >> callback if they didn't have one before. That means we need to add a
> >> NULL-pointer check in free_device_page.
> > Also the other state clearing (__ClearPageWaiters/mem_cgroup_uncharge/
> > ->mapping = NULL).
> >
> > In many ways this seems like you want to bring back the DEVICE_PUBLIC
> > pgmap type that was removed a while ago due to the lack of users
> > instead of overloading the generic type.
> 
> I think so. I'm not clear about how DEVICE_PUBLIC differed from what
> DEVICE_GENERIC is today. As I understand it, DEVICE_PUBLIC was removed
> because it was unused and also known to be broken in some ways.
> DEVICE_GENERIC seemed close enough to what we need, other than not being
> supported in the migration helpers.
> 
> Would you see benefit in re-introducing DEVICE_PUBLIC as a distinct
> memory type from DEVICE_GENERIC? What would be the benefits of making
> that distinction?

The old DEVICE_PUBLIC mostly different in that it allowed the page
to be returned from vm_normal_page, which I think was horribly buggy.

But the point is not to bring back these old semantics.  The idea
is to be able to differeniate between your new coherent on-device
memory and the existing DEVICE_GENERIC.  That is call the
code in free_devmap_managed_page that is currently only used
for device private pages also for your new public device pages without
affecting the devdax and xen use cases.


Re: [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount

2021-08-20 Thread Christoph Hellwig
On Thu, Aug 19, 2021 at 03:59:56PM -0400, Felix Kuehling wrote:
> I got lost trying to understand how DAX counts page references and how
> the PTE_SPECIAL option affects that. Theodore, can you help with this?
> Is there an easy way to test without CONFIG_ARCH_HAS_PTE_SPECIAL on x86,
> or do we need to test on a CPU architecture that doesn't support this
> feature?

I think the right answer is to simplify disallow ZONE_DEVICE pages
if ARCH_HAS_PTE_SPECIAL is not supported.  ARCH_HAS_PTE_SPECIAL is
supported by all modern architecture ports than can make use of
ZONE_DEVICE / dev_pagemap, so we can avoid this pocket of barely
testable code entirely:


diff --git a/mm/Kconfig b/mm/Kconfig
index 40a9bfcd5062e1..2823bbfd1c8c70 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -775,6 +775,7 @@ config ZONE_DMA32
 
 config ZONE_DEVICE
bool "Device memory (pmem, HMM, etc...) hotplug support"
+   depends on ARCH_HAS_PTE_SPECIAL
depends on MEMORY_HOTPLUG
depends on MEMORY_HOTREMOVE
depends on SPARSEMEM_VMEMMAP


Re: [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount

2021-08-20 Thread Christoph Hellwig
On Wed, Aug 18, 2021 at 12:28:30PM -0700, Ralph Campbell wrote:
> Did you test on a system without CONFIG_ARCH_HAS_PTE_SPECIAL defined?
> In that case, mmap() of a DAX device will call insert_page() which calls
> get_page() which would trigger VM_BUG_ON_PAGE().

__vm_insert_mixed still ends up calling insert_pfn for the
!CASE_ARCH_HAS_PTE_SPECIAL if pfn_t_devmap() is true, which it should
be for DAX.  (and as said in my other mail, I suspect we should disallow
that case anyway, as no one can test it in practice).


Re: [PATCH v2 04/12] powerpc/pseries/svm: Add a powerpc version of prot_guest_has()

2021-08-19 Thread Christoph Hellwig
On Fri, Aug 13, 2021 at 11:59:23AM -0500, Tom Lendacky wrote:
> +static inline bool prot_guest_has(unsigned int attr)

No reall need to have this inline.  In fact I'd suggest we havea the
prototype in a common header so that everyone must implement it out
of line.


Re: [PATCH v2 02/12] mm: Introduce a function to check for virtualization protection features

2021-08-19 Thread Christoph Hellwig
On Fri, Aug 13, 2021 at 11:59:21AM -0500, Tom Lendacky wrote:
> +#define PATTR_MEM_ENCRYPT0   /* Encrypted memory */
> +#define PATTR_HOST_MEM_ENCRYPT   1   /* Host encrypted 
> memory */
> +#define PATTR_GUEST_MEM_ENCRYPT  2   /* Guest encrypted 
> memory */
> +#define PATTR_GUEST_PROT_STATE   3   /* Guest encrypted 
> state */

Please write an actual detailed explanaton of what these mean, that
is what implications it has on the kernel.



Re: [PATCH v2 03/12] x86/sev: Add an x86 version of prot_guest_has()

2021-08-19 Thread Christoph Hellwig
On Fri, Aug 13, 2021 at 11:59:22AM -0500, Tom Lendacky wrote:
> While the name suggests this is intended mainly for guests, it will
> also be used for host memory encryption checks in place of sme_active().

Which suggest that the name is not good to start with.  Maybe protected
hardware, system or platform might be a better choice?

> +static inline bool prot_guest_has(unsigned int attr)
> +{
> +#ifdef CONFIG_AMD_MEM_ENCRYPT
> + if (sme_me_mask)
> + return amd_prot_guest_has(attr);
> +#endif
> +
> + return false;
> +}

Shouldn't this be entirely out of line?

> +/* 0x800 - 0x8ff reserved for AMD */
> +#define PATTR_SME0x800
> +#define PATTR_SEV0x801
> +#define PATTR_SEV_ES 0x802

Why do we need reservations for a purely in-kernel namespace?

And why are you overoading a brand new generic API with weird details
of a specific implementation like this?


Re: [PATCH v6 08/13] mm: call pgmap->ops->page_free for DEVICE_GENERIC pages

2021-08-17 Thread Christoph Hellwig
On Mon, Aug 16, 2021 at 03:00:49PM -0400, Felix Kuehling wrote:
> 
> Am 2021-08-15 um 11:40 a.m. schrieb Christoph Hellwig:
> > On Fri, Aug 13, 2021 at 01:31:45AM -0500, Alex Sierra wrote:
> >> Add MEMORY_DEVICE_GENERIC case to free_zone_device_page callback.
> >> Device generic type memory case is now able to free its pages properly.
> > How is this going to work for the two existing MEMORY_DEVICE_GENERIC
> > that now change behavior?  And which don't have a ->page_free callback
> > at all?
> 
> That's a good catch. Existing drivers shouldn't need a page_free
> callback if they didn't have one before. That means we need to add a
> NULL-pointer check in free_device_page.

Also the other state clearing (__ClearPageWaiters/mem_cgroup_uncharge/
->mapping = NULL).

In many ways this seems like you want to bring back the DEVICE_PUBLIC
pgmap type that was removed a while ago due to the lack of users
instead of overloading the generic type.


  1   2   3   4   >