from:"Matthew Wilcox"

Re: [PATCH v3 06/11] mm/util: Deduplicate code in {kstrdup,kstrndup,kmemdup_nul}

2024-06-22 Thread Matthew Wilcox

On Sun, Jun 23, 2024 at 10:29:30AM +0800, Yafang Shao wrote:
> On Fri, Jun 21, 2024 at 9:57 PM Matthew Wilcox  wrote:
> >
> > On Fri, Jun 21, 2024 at 10:29:54AM +0800, Yafang Shao wrote:
> > > +++ b/mm/internal.h
> >
> > Why are you putting __kstrndup in a header file when it's only used
> > in util.c?
> 
> I want to make it always inlined. However, it is not recommended to
> define an inline function in a .c file, right ?

I'm not aware of any such recommendation.  Better than putting it in
a .h file that everybody has to look at but nobody uses.

Re: [PATCH v3 06/11] mm/util: Deduplicate code in {kstrdup,kstrndup,kmemdup_nul}

2024-06-21 Thread Matthew Wilcox

On Fri, Jun 21, 2024 at 10:29:54AM +0800, Yafang Shao wrote:
> +++ b/mm/internal.h

Why are you putting __kstrndup in a header file when it's only used
in util.c?

Also, I think this function is actually __kmemdup_nul(), not
__kstrndup().

Re: [PATCH v2] Fix duplicate C declaration warnings

2024-03-23 Thread Matthew Wilcox

On Sat, Mar 23, 2024 at 10:01:47PM +0530, Amogh Cheluvaraj wrote:
> Fix duplicate C declaration warnings at
> Documentation/gpu/drm-kms.rst that was found by
> compiling htmldocs

I'm sure this removes the warning, but it removes all kernel-doc
which exists in drivers/gpu/drm/drm_fourcc.c.  Isn't there a more
granular fix than this?

> /home/amogh/Linux_Kernel_Workspace/linux-next/Documentation/gpu/drm-
> kms:360: ./drivers/gpu/drm/drm_fourcc.c:344: WARNING: Duplicate C
> declaration, also defined at gpu/drm-kms:39.
> Declaration is '.. c:function:: const struct drm_format_info *
> drm_format_info (u32 format)'.
> /home/amogh/Linux_Kernel_Workspace/linux-next/Documentation/gpu/drm-
> kms:461: ./drivers/gpu/drm/drm_modeset_lock.c:392: WARNING: Duplicate C
> declaration, also defined at gpu/drm-kms:49.
> Declaration is '.. c:function:: int drm_modeset_lock (struct
> drm_modeset_lock *lock, struct drm_modeset_acquire_ctx *ctx)'.
> 
> Signed-off-by: Amogh Cheluvaraj 
> ---
> 
> changes in v2
> - add warnings found after compilation
> - fix grammar in commit description
> 
> ---
>  Documentation/gpu/drm-kms.rst | 6 --
>  1 file changed, 6 deletions(-)
> 
> diff --git a/Documentation/gpu/drm-kms.rst b/Documentation/gpu/drm-kms.rst
> index 13d3627d8bc0..a4145f391e43 100644
> --- a/Documentation/gpu/drm-kms.rst
> +++ b/Documentation/gpu/drm-kms.rst
> @@ -357,9 +357,6 @@ Format Functions Reference
>  .. kernel-doc:: include/drm/drm_fourcc.h
> :internal:
>  
> -.. kernel-doc:: drivers/gpu/drm/drm_fourcc.c
> -   :export:
> -
>  .. _kms_dumb_buffer_objects:
>  
>  Dumb Buffer Objects
> @@ -458,9 +455,6 @@ KMS Locking
>  .. kernel-doc:: include/drm/drm_modeset_lock.h
> :internal:
>  
> -.. kernel-doc:: drivers/gpu/drm/drm_modeset_lock.c
> -   :export:
> -
>  KMS Properties
>  ==
>  
> -- 
> 2.44.0
> 
>

fb_defio and page->mapping

2024-01-23 Thread Matthew Wilcox

We're currently trying to remove page->mapping from the entire kernel.
This has me interested in fb_defio and since I made such a mess of it
with commits ccf953d8f3d6 / 0b78f8bcf495, I'd like to discuss what to
do before diving in.

Folios continue to have a mapping.  So we can effectively do
page_folio(page)->mapping (today that's calling compound_head() to get
to the head page; eventually it's a separate allocation).

But now I look at commit 56c134f7f1b5, I'm a little scared.
Apparently pages are being allocated from shmem and being mapped by
fb_deferred_io_fault()?  This line:

page->mapping = vmf->vma->vm_file->f_mapping;

initially appears harmless for shmem files (because that's presumably
a noop), but it's only a noop for head pages.  If shmem allocates a
compound page (ie a 2MB THP today), you'll overlay some information
stored in the second and third pages; looks like _entire_mapcount
and _deferred_list.prev (but we do shift those fields around without
regard to what the fb_defio driver is doing).  Even if you've disabled
THP today, setting page->mapping to NULL in fb_deferred_io_lastclose()
for a shmem page is a really bad idea.

I'd like to avoid fb_defio playing with page->mapping at all.
As I understand it, the only reason to set page->mapping is so that
page_mkclean() works.  But there are all kinds of assumptions in
page_mkclean() (now essentially folio_mkclean()) that we're dealing with
file-backed or anonymous memory.  I wonder if we might be better off
calling pfn_mkclean_range() for each VMA which maps these allocations?
You'd have to keep track of each VMA yourself (I think?)  but it would
avoid messing with page->mapping.

Anyway, I don't know enough about DRM.  There might be something
unutterably obvious we could do to fix this.

Re: [PATCH 2/2] xfs: disable large folio support in xfile_create

2024-01-11 Thread Matthew Wilcox

On Thu, Jan 11, 2024 at 02:00:53PM -0800, Andrew Morton wrote:
> On Wed, 10 Jan 2024 12:04:51 -0800 "Darrick J. Wong"  
> wrote:
> 
> > > > Fixing this will require a bit of an API change, and prefeably sorting 
> > > > out
> > > > the hwpoison story for pages vs folio and where it is placed in the 
> > > > shmem
> > > > API.  For now use this one liner to disable large folios.
> > > > 
> > > > Reported-by: Darrick J. Wong 
> > > > Signed-off-by: Christoph Hellwig 
> > > 
> > > Can someone who knows more about shmem.c than I do please review
> > > https://lore.kernel.org/linux-xfs/20240103084126.513354-4-...@lst.de/
> > > so that I can feel slightly more confident as hch and I sort through the
> > > xfile.c issues?
> > > 
> > > For this patch,
> > > Reviewed-by: Darrick J. Wong 
> > 
> > ...except that I'm still getting 2M THPs even with this enabled, so I
> > guess either we get to fix it now, or create our own private tmpfs mount
> > so that we can pass in huge=never, similar to what i915 does. :(
> 
> What is "this"?  Are you saying that $Subject doesn't work, or that the
> above-linked please-review patch doesn't work?

shmem pays no attention to the mapping_large_folio_support() flag,
so the proposed fix doesn't work.  It ought to, but it has its own way
of doing it that predates mapping_large_folio_support existing.

Re: disable large folios for shmem file used by xfs xfile

2024-01-10 Thread Matthew Wilcox

On Wed, Jan 10, 2024 at 05:28:22PM +0200, Joonas Lahtinen wrote:
> Quoting Joonas Lahtinen (2024-01-10 17:20:24)
> > However we specifically pass "huge=within_size" to vfs_kern_mount when
> > creating a private mount of tmpfs for the purpose of i915 created
> > allocations.
> > 
> > Older hardware also had some address hashing bugs where 2M aligned
> > memory caused a lot of collisions in TLB so we don't enable it always.
> > 
> > You can see drivers/gpu/drm/i915/gem/i915_gemfs.c function
> > i915_gemfs_init for details and references.
> > 
> > So in short, functionality wise we should be fine either default
> > for using 2M pages or not. If they become the default, we'd probably
> > want an option that would still be able to prevent them for performance
> > regression reasons on older hardware.
> 
> To maybe write out my concern better:
> 
> Is there plan to enable huge pages by default in shmem?

Not in the next kernel release, but eventually the plan is to allow
arbitrary order folios to be used in shmem.  So you could ask it to create
a 256kB folio for you, if that's the right size to manage memory in.

How shmem and its various users go about choosing the right size is not
quite clear to me yet.  Perhaps somebody else will do it before I get
to it; I have a lot of different sub-projects to work on at the moment,
and shmem isn't blocking any of them.  And I have a sneaking suspicion
that more work is needed in the swap code to deal with arbitrary order
folios, so that's another reason for me to delay looking at this ;-)

Re: disable large folios for shmem file used by xfs xfile

2024-01-10 Thread Matthew Wilcox

On Wed, Jan 10, 2024 at 10:21:07AM +0100, Christoph Hellwig wrote:
> Hi all,
> 
> Darrick reported that the fairly new XFS xfile code blows up when force
> enabling large folio for shmem.  This series fixes this quickly by disabling
> large folios for this particular shmem file for now until it can be fixed
> properly, which will be a lot more invasive.
> 
> I've added most of you to the CC list as I suspect most other users of
> shmem_file_setup and friends will have similar issues.

The graphics users _want_ to use large folios.  I'm pretty sure they've
been tested with this.  It's just XFS that didn't know about this
feature of shmem.

Re: [PATCH v8 3/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v8)

2023-12-15 Thread Matthew Wilcox

On Fri, Dec 15, 2023 at 10:05:33PM -0800, Vivek Kasireddy wrote:
> +++ b/include/linux/memfd.h
> @@ -6,11 +6,16 @@
>  
>  #ifdef CONFIG_MEMFD_CREATE
>  extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int 
> arg);
> +extern struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx);

You don't need the 'extern' for functions.

>  #else
>  static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned int 
> a)
>  {
>   return -EINVAL;
>  }
> +static inline struct page *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
> +{
> + return ERR_PTR(-EINVAL);
> +}
>  #endif

Different return types depending on the CONFIG selected ...

Re: [PATCH v7 5/6] udmabuf: Pin the pages using memfd_pin_folios() API (v5)

2023-12-13 Thread Matthew Wilcox

On Mon, Dec 11, 2023 at 11:38:02PM -0800, Vivek Kasireddy wrote:
> +++ b/drivers/dma-buf/udmabuf.c
> @@ -42,7 +42,7 @@ static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf)
>   if (pgoff >= ubuf->pagecount)
>   return VM_FAULT_SIGBUS;
>  
> - pfn = page_to_pfn(&ubuf->folios[pgoff]->page);
> + pfn = page_to_pfn(folio_page(ubuf->folios[pgoff], 0));

Why are you changing from &folio->page to folio_page(folio, 0) in
this patch?  Either that should have been done the other way in the
earlier patch, or it shouldn't be done at all.

My vote is that it shuoldn't be done at all.  These all seem like "I
have to convert back from folio to page because the APIs I want aren't
available", not "I want the first page from this folio".

Re: [PATCH v7 4/6] udmabuf: Convert udmabuf driver to use folios

2023-12-13 Thread Matthew Wilcox

On Mon, Dec 11, 2023 at 11:38:01PM -0800, Vivek Kasireddy wrote:
> @@ -42,7 +42,7 @@ static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf)
>   if (pgoff >= ubuf->pagecount)
>   return VM_FAULT_SIGBUS;
>  
> - pfn = page_to_pfn(ubuf->pages[pgoff]);
> + pfn = page_to_pfn(&ubuf->folios[pgoff]->page);

We have folio_pfn().

>  static int vmap_udmabuf(struct dma_buf *buf, struct iosys_map *map)
>  {
>   struct udmabuf *ubuf = buf->priv;
> + struct page **pages;
>   void *vaddr;
> + pgoff_t pg;
>  
>   dma_resv_assert_held(buf->resv);
>  
> - vaddr = vm_map_ram(ubuf->pages, ubuf->pagecount, -1);
> + pages = kmalloc_array(ubuf->pagecount, sizeof(*pages), GFP_KERNEL);
> + if (!pages)
> + return -ENOMEM;
> +
> + for (pg = 0; pg < ubuf->pagecount; pg++)
> + pages[pg] = &ubuf->folios[pg]->page;
> +
> + vaddr = vm_map_ram(pages, ubuf->pagecount, -1);
> + kfree(pages);

We don't yet have a vm_map_ram() variant that takes an array of
folios.  We probably should; there was _something_ I was looking at
recently that would have liked it ...

> @@ -254,31 +262,70 @@ static int handle_shmem_pages(struct udmabuf *ubuf, 
> struct file *memfd,
> pgoff_t *pgbuf)
>  {
>   pgoff_t pgidx, pgoff = offset >> PAGE_SHIFT;
> - struct page *page;
> + struct folio *folio = NULL;
>  
>   for (pgidx = 0; pgidx < pgcnt; pgidx++) {
> - page = shmem_read_mapping_page(memfd->f_mapping,
> -pgoff + pgidx);
> - if (IS_ERR(page))
> - return PTR_ERR(page);
> + folio = shmem_read_folio(memfd->f_mapping,
> +  pgoff + pgidx);

You could join these two lines.

[PATCH 1/2] fbdev: Remove support for Carillo Ranch driver

2023-12-08 Thread Matthew Wilcox (Oracle)

As far as anybody can tell, this product never shipped.  If it did,
it shipped in 2007 and nobody has access to one any more.  Remove the
fbdev driver and the backlight driver.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 drivers/video/backlight/Kconfig   |7 -
 drivers/video/backlight/Makefile  |1 -
 drivers/video/backlight/cr_bllcd.c|  264 -
 drivers/video/fbdev/Kconfig   |   15 -
 drivers/video/fbdev/Makefile  |1 -
 drivers/video/fbdev/vermilion/Makefile|6 -
 drivers/video/fbdev/vermilion/cr_pll.c|  195 
 drivers/video/fbdev/vermilion/vermilion.c | 1175 -
 drivers/video/fbdev/vermilion/vermilion.h |  245 -
 9 files changed, 1909 deletions(-)
 delete mode 100644 drivers/video/backlight/cr_bllcd.c
 delete mode 100644 drivers/video/fbdev/vermilion/Makefile
 delete mode 100644 drivers/video/fbdev/vermilion/cr_pll.c
 delete mode 100644 drivers/video/fbdev/vermilion/vermilion.c
 delete mode 100644 drivers/video/fbdev/vermilion/vermilion.h

diff --git a/drivers/video/backlight/Kconfig b/drivers/video/backlight/Kconfig
index 1144a54a35c0..ea2d0d69bd8c 100644
--- a/drivers/video/backlight/Kconfig
+++ b/drivers/video/backlight/Kconfig
@@ -235,13 +235,6 @@ config BACKLIGHT_HP700
  If you have an HP Jornada 700 series,
  say Y to include backlight control driver.
 
-config BACKLIGHT_CARILLO_RANCH
-   tristate "Intel Carillo Ranch Backlight Driver"
-   depends on LCD_CLASS_DEVICE && PCI && X86 && FB_LE80578
-   help
- If you have a Intel LE80578 (Carillo Ranch) say Y to enable the
- backlight driver.
-
 config BACKLIGHT_PWM
tristate "Generic PWM based Backlight Driver"
depends on PWM
diff --git a/drivers/video/backlight/Makefile b/drivers/video/backlight/Makefile
index 1af583de665b..06966cb20459 100644
--- a/drivers/video/backlight/Makefile
+++ b/drivers/video/backlight/Makefile
@@ -25,7 +25,6 @@ obj-$(CONFIG_BACKLIGHT_ADP8870)   += adp8870_bl.o
 obj-$(CONFIG_BACKLIGHT_APPLE)  += apple_bl.o
 obj-$(CONFIG_BACKLIGHT_AS3711) += as3711_bl.o
 obj-$(CONFIG_BACKLIGHT_BD6107) += bd6107.o
-obj-$(CONFIG_BACKLIGHT_CARILLO_RANCH)  += cr_bllcd.o
 obj-$(CONFIG_BACKLIGHT_CLASS_DEVICE)   += backlight.o
 obj-$(CONFIG_BACKLIGHT_DA903X) += da903x_bl.o
 obj-$(CONFIG_BACKLIGHT_DA9052) += da9052_bl.o
diff --git a/drivers/video/backlight/cr_bllcd.c 
b/drivers/video/backlight/cr_bllcd.c
deleted file mode 100644
index 781aeecc451d..
--- a/drivers/video/backlight/cr_bllcd.c
+++ /dev/null
@@ -1,264 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-or-later
-/*
- * Copyright (c) Intel Corp. 2007.
- * All Rights Reserved.
- *
- * Intel funded Tungsten Graphics (http://www.tungstengraphics.com) to
- * develop this driver.
- *
- * This file is part of the Carillo Ranch video subsystem driver.
- *
- * Authors:
- *   Thomas Hellstrom 
- *   Alan Hourihane 
- */
-
-#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-/* The LVDS- and panel power controls sits on the
- * GPIO port of the ISA bridge.
- */
-
-#define CRVML_DEVICE_LPC0x27B8
-#define CRVML_REG_GPIOBAR   0x48
-#define CRVML_REG_GPIOEN0x4C
-#define CRVML_GPIOEN_BIT(1 << 4)
-#define CRVML_PANEL_PORT0x38
-#define CRVML_LVDS_ON   0x0001
-#define CRVML_PANEL_ON  0x0002
-#define CRVML_BACKLIGHT_OFF 0x0004
-
-/* The PLL Clock register sits on Host bridge */
-#define CRVML_DEVICE_MCH   0x5001
-#define CRVML_REG_MCHBAR   0x44
-#define CRVML_REG_MCHEN0x54
-#define CRVML_MCHEN_BIT(1 << 28)
-#define CRVML_MCHMAP_SIZE  4096
-#define CRVML_REG_CLOCK0xc3c
-#define CRVML_CLOCK_SHIFT  8
-#define CRVML_CLOCK_MASK   0x0f00
-
-static struct pci_dev *lpc_dev;
-static u32 gpio_bar;
-
-struct cr_panel {
-   struct backlight_device *cr_backlight_device;
-   struct lcd_device *cr_lcd_device;
-};
-
-static int cr_backlight_set_intensity(struct backlight_device *bd)
-{
-   u32 addr = gpio_bar + CRVML_PANEL_PORT;
-   u32 cur = inl(addr);
-
-   if (backlight_get_brightness(bd) == 0) {
-   /* OFF */
-   cur |= CRVML_BACKLIGHT_OFF;
-   outl(cur, addr);
-   } else {
-   /* FULL ON */
-   cur &= ~CRVML_BACKLIGHT_OFF;
-   outl(cur, addr);
-   }
-
-   return 0;
-}
-
-static int cr_backlight_get_intensity(struct backlight_device *bd)
-{
-   u32 addr = gpio_bar + CRVML_PANEL_PORT;
-   u32 cur = inl(addr);
-   u8 intensity;
-
-   if (cur & CRVML_BACKLIGHT_OFF)
-   intensity = 0;
-   else
-   intensity = 1;
-
-   return intensity;
-}
-
-static const struct backlight_ops cr_backlight_ops = {
-   .get_brightness = cr_ba

[PATCH 2/2] mtd: Remove support for Carillo Ranch driver

2023-12-08 Thread Matthew Wilcox (Oracle)

As far as anybody can tell, this product never shipped.  If it did,
it shipped in 2007 and nobody has access to one any more.  Remove the
mtd NOR driver.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 drivers/mtd/maps/Kconfig|   7 -
 drivers/mtd/maps/Makefile   |   1 -
 drivers/mtd/maps/intel_vr_nor.c | 265 
 3 files changed, 273 deletions(-)
 delete mode 100644 drivers/mtd/maps/intel_vr_nor.c

diff --git a/drivers/mtd/maps/Kconfig b/drivers/mtd/maps/Kconfig
index e098ae937ce8..8a8b19874e23 100644
--- a/drivers/mtd/maps/Kconfig
+++ b/drivers/mtd/maps/Kconfig
@@ -341,13 +341,6 @@ config MTD_UCLINUX
help
  Map driver to support image based filesystems for uClinux.
 
-config MTD_INTEL_VR_NOR
-   tristate "NOR flash on Intel Vermilion Range Expansion Bus CS0"
-   depends on PCI
-   help
- Map driver for a NOR flash bank located on the Expansion Bus of the
- Intel Vermilion Range chipset.
-
 config MTD_PLATRAM
tristate "Map driver for platform device RAM (mtd-ram)"
select MTD_RAM
diff --git a/drivers/mtd/maps/Makefile b/drivers/mtd/maps/Makefile
index 094cfb244086..a9083c888e3b 100644
--- a/drivers/mtd/maps/Makefile
+++ b/drivers/mtd/maps/Makefile
@@ -40,6 +40,5 @@ obj-$(CONFIG_MTD_UCLINUX) += uclinux.o
 obj-$(CONFIG_MTD_NETtel)   += nettel.o
 obj-$(CONFIG_MTD_SCB2_FLASH)   += scb2_flash.o
 obj-$(CONFIG_MTD_PLATRAM)  += plat-ram.o
-obj-$(CONFIG_MTD_INTEL_VR_NOR) += intel_vr_nor.o
 obj-$(CONFIG_MTD_VMU)  += vmu-flash.o
 obj-$(CONFIG_MTD_LANTIQ)   += lantiq-flash.o
diff --git a/drivers/mtd/maps/intel_vr_nor.c b/drivers/mtd/maps/intel_vr_nor.c
deleted file mode 100644
index d67b845b0e89..
--- a/drivers/mtd/maps/intel_vr_nor.c
+++ /dev/null
@@ -1,265 +0,0 @@
-/*
- * drivers/mtd/maps/intel_vr_nor.c
- *
- * An MTD map driver for a NOR flash bank on the Expansion Bus of the Intel
- * Vermilion Range chipset.
- *
- * The Vermilion Range Expansion Bus supports four chip selects, each of which
- * has 64MiB of address space.  The 2nd BAR of the Expansion Bus PCI Device
- * is a 256MiB memory region containing the address spaces for all four of the
- * chip selects, with start addresses hardcoded on 64MiB boundaries.
- *
- * This map driver only supports NOR flash on chip select 0.  The buswidth
- * (either 8 bits or 16 bits) is determined by reading the Expansion Bus Timing
- * and Control Register for Chip Select 0 (EXP_TIMING_CS0).  This driver does
- * not modify the value in the EXP_TIMING_CS0 register except to enable writing
- * and disable boot acceleration.  The timing parameters in the register are
- * assumed to have been properly initialized by the BIOS.  The reset default
- * timing parameters are maximally conservative (slow), so access to the flash
- * will be slower than it should be if the BIOS has not initialized the timing
- * parameters.
- *
- * Author: Andy Lowe 
- *
- * 2006 (c) MontaVista Software, Inc. This file is licensed under
- * the terms of the GNU General Public License version 2. This program
- * is licensed "as is" without any warranty of any kind, whether express
- * or implied.
- */
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-#define DRV_NAME "vr_nor"
-
-struct vr_nor_mtd {
-   void __iomem *csr_base;
-   struct map_info map;
-   struct mtd_info *info;
-   struct pci_dev *dev;
-};
-
-/* Expansion Bus Configuration and Status Registers are in BAR 0 */
-#define EXP_CSR_MBAR 0
-/* Expansion Bus Memory Window is BAR 1 */
-#define EXP_WIN_MBAR 1
-/* Maximum address space for Chip Select 0 is 64MiB */
-#define CS0_SIZE 0x0400
-/* Chip Select 0 is at offset 0 in the Memory Window */
-#define CS0_START 0x0
-/* Chip Select 0 Timing Register is at offset 0 in CSR */
-#define EXP_TIMING_CS0 0x00
-#define TIMING_CS_EN   (1 << 31)   /* Chip Select Enable */
-#define TIMING_BOOT_ACCEL_DIS  (1 <<  8)   /* Boot Acceleration Disable */
-#define TIMING_WR_EN   (1 <<  1)   /* Write Enable */
-#define TIMING_BYTE_EN (1 <<  0)   /* 8-bit vs 16-bit bus */
-#define TIMING_MASK0x3FFF
-
-static void vr_nor_destroy_partitions(struct vr_nor_mtd *p)
-{
-   mtd_device_unregister(p->info);
-}
-
-static int vr_nor_init_partitions(struct vr_nor_mtd *p)
-{
-   /* register the flash bank */
-   /* partition the flash bank */
-   return mtd_device_register(p->info, NULL, 0);
-}
-
-static void vr_nor_destroy_mtd_setup(struct vr_nor_mtd *p)
-{
-   map_destroy(p->info);
-}
-
-static int vr_nor_mtd_setup(struct vr_nor_mtd *p)
-{
-   static const char * const probe_types[] =
-   { "cfi_probe", "jedec_probe", NULL };
-   const char * const *type;
-
-   for (type = probe_types; !p->info && *type; type++)
-

[PATCH] drm: Do not overrun array in drm_gem_get_pages()

2023-10-05 Thread Matthew Wilcox (Oracle)

If the shared memory object is larger than the DRM object that it backs,
we can overrun the page array.  Limit the number of pages we install
from each folio to prevent this.

Signed-off-by: Matthew Wilcox (Oracle) 
Reported-by: Oleksandr Natalenko 
Tested-by: Oleksandr Natalenko 
Link: https://lore.kernel.org/lkml/13360591.ulzwgnk...@natalenko.name/
Fixes: 3291e09a4638 ("drm: convert drm_gem_put_pages() to use a folio_batch")
Cc: sta...@vger.kernel.org # 6.5.x
---
 drivers/gpu/drm/drm_gem.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index 6129b89bb366..44a948b80ee1 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -540,7 +540,7 @@ struct page **drm_gem_get_pages(struct drm_gem_object *obj)
struct page **pages;
struct folio *folio;
struct folio_batch fbatch;
-   int i, j, npages;
+   long i, j, npages;
 
if (WARN_ON(!obj->filp))
return ERR_PTR(-EINVAL);
@@ -564,11 +564,13 @@ struct page **drm_gem_get_pages(struct drm_gem_object 
*obj)
 
i = 0;
while (i < npages) {
+   long nr;
folio = shmem_read_folio_gfp(mapping, i,
mapping_gfp_mask(mapping));
if (IS_ERR(folio))
goto fail;
-   for (j = 0; j < folio_nr_pages(folio); j++, i++)
+   nr = min(npages - i, folio_nr_pages(folio));
+   for (j = 0; j < nr; j++, i++)
pages[i] = folio_file_page(folio, i);
 
/* Make sure shmem keeps __GFP_DMA32 allocated pages in the
-- 
2.40.1

Re: [REGRESSION] BUG: KFENCE: memory corruption in drm_gem_put_pages+0x186/0x250

2023-10-05 Thread Matthew Wilcox

On Thu, Oct 05, 2023 at 02:30:55PM +0200, Oleksandr Natalenko wrote:
> No-no, sorry for possible confusion. Let me explain again:
> 
> 1. we had an issue with i915, which was introduced by 0b62af28f249, and later 
> was fixed by 863a8eb3f270
> 2. now I've discovered another issue, which looks very similar to 1., but in 
> a VM with Cirrus VGA, and it happens even while having 863a8eb3f270 applied
> 3. I've tried reverting 3291e09a4638, after which I cannot reproduce the 
> issue with Cirrus VGA, but clearly there was no fix for it discussed
> 
> IOW, 863a8eb3f270 is the fix for 0b62af28f249, but not for 3291e09a4638. It 
> looks like 3291e09a4638 requires a separate fix.

Thank you!  Sorry about the misunderstanding.  Try this:

diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index 6129b89bb366..44a948b80ee1 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -540,7 +540,7 @@ struct page **drm_gem_get_pages(struct drm_gem_object *obj)
struct page **pages;
struct folio *folio;
struct folio_batch fbatch;
-   int i, j, npages;
+   long i, j, npages;
 
if (WARN_ON(!obj->filp))
return ERR_PTR(-EINVAL);
@@ -564,11 +564,13 @@ struct page **drm_gem_get_pages(struct drm_gem_object 
*obj)
 
i = 0;
while (i < npages) {
+   long nr;
folio = shmem_read_folio_gfp(mapping, i,
mapping_gfp_mask(mapping));
if (IS_ERR(folio))
goto fail;
-   for (j = 0; j < folio_nr_pages(folio); j++, i++)
+   nr = min(npages - i, folio_nr_pages(folio));
+   for (j = 0; j < nr; j++, i++)
pages[i] = folio_file_page(folio, i);
 
/* Make sure shmem keeps __GFP_DMA32 allocated pages in the

Re: [REGRESSION] BUG: KFENCE: memory corruption in drm_gem_put_pages+0x186/0x250

2023-10-05 Thread Matthew Wilcox

On Thu, Oct 05, 2023 at 09:56:03AM +0200, Oleksandr Natalenko wrote:
> Hello.
> 
> On čtvrtek 5. října 2023 9:44:42 CEST Thomas Zimmermann wrote:
> > Hi
> > 
> > Am 02.10.23 um 17:38 schrieb Oleksandr Natalenko:
> > > On pondělí 2. října 2023 16:32:45 CEST Matthew Wilcox wrote:
> > >> On Mon, Oct 02, 2023 at 01:02:52PM +0200, Oleksandr Natalenko wrote:
> > >>>>>>> BUG: KFENCE: memory corruption in drm_gem_put_pages+0x186/0x250
> > >>>>>>>
> > >>>>>>> Corrupted memory at 0xe173a294 [ ! ! ! ! ! ! ! ! ! ! ! ! ! 
> > >>>>>>> ! ! ! ] (in kfence-#108):
> > >>>>>>>   drm_gem_put_pages+0x186/0x250
> > >>>>>>>   drm_gem_shmem_put_pages_locked+0x43/0xc0
> > >>>>>>>   drm_gem_shmem_object_vunmap+0x83/0xe0
> > >>>>>>>   drm_gem_vunmap_unlocked+0x46/0xb0
> > >>>>>>>   drm_fbdev_generic_helper_fb_dirty+0x1dc/0x310
> > >>>>>>>   drm_fb_helper_damage_work+0x96/0x170
> > >>>
> > >>> Matthew, before I start dancing around, do you think ^^ could have the 
> > >>> same cause as 0b62af28f249b9c4036a05acfb053058dc02e2e2 which got fixed 
> > >>> by 863a8eb3f27098b42772f668e3977ff4cae10b04?
> > >>
> > >> Yes, entirely plausible.  I think you have two useful points to look at
> > >> before delving into a full bisect -- 863a8e and the parent of 0b62af.
> > >> If either of them work, I think you have no more work to do.
> > > 
> > > OK, I've did this against v6.5.5:
> > > 
> > > ```
> > > git log --oneline HEAD~3..
> > > 7c1e7695ca9b8 (HEAD -> test) Revert "mm: remove struct pagevec"
> > > 8f2ad53b6eac6 Revert "mm: remove check_move_unevictable_pages()"
> > > fa1e3c0b5453c Revert "drm: convert drm_gem_put_pages() to use a 
> > > folio_batch"
> > > ```
> > > 
> > > then rebooted the host multiple times, and the issue is not seen any more.
> > > 
> > > So I guess 3291e09a463870610b8227f32b16b19a587edf33 is the culprit.
> > 
> > Ignore my other email. It's apparently been fixed already. Thanks!
> 
> Has it? I think I was able to identify offending commit, but I'm not aware of 
> any fix to that.

I don't understand; you said reverting those DRM commits fixed the
problem, so 863a8eb3f270 is the solution.  No?

Re: [REGRESSION] BUG: KFENCE: memory corruption in drm_gem_put_pages+0x186/0x250

2023-10-02 Thread Matthew Wilcox

On Mon, Oct 02, 2023 at 01:02:52PM +0200, Oleksandr Natalenko wrote:
> > > > > BUG: KFENCE: memory corruption in drm_gem_put_pages+0x186/0x250
> > > > > 
> > > > > Corrupted memory at 0xe173a294 [ ! ! ! ! ! ! ! ! ! ! ! ! ! ! 
> > > > > ! ! ] (in kfence-#108):
> > > > >  drm_gem_put_pages+0x186/0x250
> > > > >  drm_gem_shmem_put_pages_locked+0x43/0xc0
> > > > >  drm_gem_shmem_object_vunmap+0x83/0xe0
> > > > >  drm_gem_vunmap_unlocked+0x46/0xb0
> > > > >  drm_fbdev_generic_helper_fb_dirty+0x1dc/0x310
> > > > >  drm_fb_helper_damage_work+0x96/0x170
> 
> Matthew, before I start dancing around, do you think ^^ could have the same 
> cause as 0b62af28f249b9c4036a05acfb053058dc02e2e2 which got fixed by 
> 863a8eb3f27098b42772f668e3977ff4cae10b04?

Yes, entirely plausible.  I think you have two useful points to look at
before delving into a full bisect -- 863a8e and the parent of 0b62af.
If either of them work, I think you have no more work to do.

[PATCH] i915: Limit the length of an sg list to the requested length

2023-09-19 Thread Matthew Wilcox (Oracle)

The folio conversion changed the behaviour of shmem_sg_alloc_table() to
put the entire length of the last folio into the sg list, even if the sg
list should have been shorter.  gen8_ggtt_insert_entries() relied on the
list being the right langth and would overrun the end of the page tables.
Other functions may also have been affected.

Clamp the length of the last entry in the sg list to be the expected
length.

Signed-off-by: Matthew Wilcox (Oracle) 
Fixes: 0b62af28f249 ("i915: convert shmem_sg_free_table() to use a folio_batch")
Cc: sta...@vger.kernel.org # 6.5.x
Link: https://gitlab.freedesktop.org/drm/intel/-/issues/9256
Link: https://lore.kernel.org/lkml/6287208.lov4wx5...@natalenko.name/
Reported-by: Oleksandr Natalenko 
Tested-by: Oleksandr Natalenko 
---
 drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c 
b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
index 8f1633c3fb93..73a4a4eb29e0 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
@@ -100,6 +100,7 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, 
struct sg_table *st,
st->nents = 0;
for (i = 0; i < page_count; i++) {
struct folio *folio;
+   unsigned long nr_pages;
const unsigned int shrink[] = {
I915_SHRINK_BOUND | I915_SHRINK_UNBOUND,
0,
@@ -150,6 +151,8 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, 
struct sg_table *st,
}
} while (1);
 
+   nr_pages = min_t(unsigned long,
+   folio_nr_pages(folio), page_count - i);
if (!i ||
sg->length >= max_segment ||
folio_pfn(folio) != next_pfn) {
@@ -157,13 +160,13 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, 
struct sg_table *st,
sg = sg_next(sg);
 
st->nents++;
-   sg_set_folio(sg, folio, folio_size(folio), 0);
+   sg_set_folio(sg, folio, nr_pages * PAGE_SIZE, 0);
} else {
/* XXX: could overflow? */
-   sg->length += folio_size(folio);
+   sg->length += nr_pages * PAGE_SIZE;
}
-   next_pfn = folio_pfn(folio) + folio_nr_pages(folio);
-   i += folio_nr_pages(folio) - 1;
+   next_pfn = folio_pfn(folio) + nr_pages;
+   i += nr_pages - 1;
 
/* Check that the i965g/gm workaround works. */
GEM_BUG_ON(gfp & __GFP_DMA32 && next_pfn >= 0x0010UL);
-- 
2.40.1

Re: [REGRESSION] [BISECTED] Panic in gen8_ggtt_insert_entries() with v6.5

2023-09-19 Thread Matthew Wilcox

On Tue, Sep 19, 2023 at 08:11:47PM +0200, Oleksandr Natalenko wrote:
> I can confirm this one fixes the issue for me on T460s laptop. Thank you!

Yay!

> Should you submit it, please add:
> 
> Fixes: 0b62af28f2 ("i915: convert shmem_sg_free_table() to use a folio_batch")

Thanks for collecting all these; you're making my life really easy.
One minor point is that the standard format for Fixes: is 12 characters,
not 10.  eg,

Documentation/process/5.Posting.rst:Fixes: 1f2e3d4c5b6a ("The first line of 
the commit specified by the first 12 characters of its SHA-1 ID")

I have this in my .gitconfig:

[pretty]
fixes = Fixes: %h (\"%s\")

and in .git/config,

[core]
abbrev = 12

I'm working on the commit message now.

Re: [REGRESSION] [BISECTED] Panic in gen8_ggtt_insert_entries() with v6.5

2023-09-19 Thread Matthew Wilcox

On Tue, Sep 19, 2023 at 04:43:41PM +0100, Matthew Wilcox wrote:
> Could I ask you to try this patch?  I'll follow up with another patch
> later because I think I made another assumption that may not be valid.

Ah, no, never mind.  I thought we could start in the middle of a folio,
but we always start constructing the sg list from index 0 of the file,
so we always start at the first page of a folio.  If this patch solves
your problem, then I think we're done.

Re: [REGRESSION] [BISECTED] Panic in gen8_ggtt_insert_entries() with v6.5

2023-09-19 Thread Matthew Wilcox

On Tue, Sep 19, 2023 at 10:26:42AM +0200, Oleksandr Natalenko wrote:
> Andrzej asked me to try to revert commits 0b62af28f249, e0b72c14d8dc and 
> 1e0877d58b1e, and reverting those fixed the i915 crash for me. The 
> e0b72c14d8dc and 1e0877d58b1e commits look like just prerequisites, so I 
> assume 0b62af28f249 ("i915: convert shmem_sg_free_table() to use a 
> folio_batch") is the culprit here.
> 
> Could you please check this?
> 
> Our conversation with Andrzej is available at drm-intel GitLab [1].
> 
> Thanks.
> 
> [1] https://gitlab.freedesktop.org/drm/intel/-/issues/9256

Wow, that is some great debugging.  Thanks for all the time & effort
you and others have invested.  Sorry for breaking your system.

You're almost right about the "prerequisites", but it's in the other
direction; 0b62af28f249 is a prerequisite for the later two cleanups,
so reverting all three is necessary to test 0b62af28f249.

It seems to me that you've isolated the problem to constructing overly
long sg lists.  I didn't realise that was going to be a problem, so
that's my fault.

Could I ask you to try this patch?  I'll follow up with another patch
later because I think I made another assumption that may not be valid.

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c 
b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
index 8f1633c3fb93..73a4a4eb29e0 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
@@ -100,6 +100,7 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, 
struct sg_table *st,
st->nents = 0;
for (i = 0; i < page_count; i++) {
struct folio *folio;
+   unsigned long nr_pages;
const unsigned int shrink[] = {
I915_SHRINK_BOUND | I915_SHRINK_UNBOUND,
0,
@@ -150,6 +151,8 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, 
struct sg_table *st,
}
} while (1);
 
+   nr_pages = min_t(unsigned long,
+   folio_nr_pages(folio), page_count - i);
if (!i ||
sg->length >= max_segment ||
folio_pfn(folio) != next_pfn) {
@@ -157,13 +160,13 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, 
struct sg_table *st,
sg = sg_next(sg);
 
st->nents++;
-   sg_set_folio(sg, folio, folio_size(folio), 0);
+   sg_set_folio(sg, folio, nr_pages * PAGE_SIZE, 0);
} else {
/* XXX: could overflow? */
-   sg->length += folio_size(folio);
+   sg->length += nr_pages * PAGE_SIZE;
}
-   next_pfn = folio_pfn(folio) + folio_nr_pages(folio);
-   i += folio_nr_pages(folio) - 1;
+   next_pfn = folio_pfn(folio) + nr_pages;
+   i += nr_pages - 1;
 
/* Check that the i965g/gm workaround works. */
GEM_BUG_ON(gfp & __GFP_DMA32 && next_pfn >= 0x0010UL);

Re: [PATCH v6 1/4] drm: Use XArray instead of IDR for minors

2023-08-29 Thread Matthew Wilcox

On Tue, Aug 29, 2023 at 02:35:34PM -0400, James Zhu wrote:
> 
> On 2023-08-29 14:33, Matthew Wilcox wrote:
> > On Tue, Aug 29, 2023 at 01:34:22PM -0400, James Zhu wrote:
> > > > > > @@ -1067,7 +1055,7 @@ static void drm_core_exit(void)
> > > > > > unregister_chrdev(DRM_MAJOR, "drm");
> > > > > > debugfs_remove(drm_debugfs_root);
> > > > > > drm_sysfs_destroy();
> > > > > > -   idr_destroy(&drm_minors_idr);
> > > > > [JZ] Should we call xa_destroy instead here?
> > > > We could, if we expect the xarray to potentially not be empty.
> > > >   From what I understand - all minors should be released at this point.
> > > [JZ] In practice,  adding destroy here will be better.
> > Why do you say that?
> [JZ] In case, the future, INIT adds something, need DESTROY to free
> completely.

That isn't going to happen.

Re: [PATCH v6 1/4] drm: Use XArray instead of IDR for minors

2023-08-29 Thread Matthew Wilcox

On Tue, Aug 29, 2023 at 01:34:22PM -0400, James Zhu wrote:
> > > > @@ -1067,7 +1055,7 @@ static void drm_core_exit(void)
> > > > unregister_chrdev(DRM_MAJOR, "drm");
> > > > debugfs_remove(drm_debugfs_root);
> > > > drm_sysfs_destroy();
> > > > -   idr_destroy(&drm_minors_idr);
> > > [JZ] Should we call xa_destroy instead here?
> > We could, if we expect the xarray to potentially not be empty.
> >  From what I understand - all minors should be released at this point.
> [JZ] In practice,  adding destroy here will be better.

Why do you say that?

Re: [PATCH v2] fs: clean up usage of noop_dirty_folio

2023-08-28 Thread Matthew Wilcox

On Mon, Aug 28, 2023 at 03:54:49PM +0800, Xueshi Hu wrote:
> In folio_mark_dirty(), it can automatically fallback to
> noop_dirty_folio() if a_ops->dirty_folio is not registered.
> 
> As anon_aops, dev_dax_aops and fb_deferred_io_aops becames empty, remove
> them too.
> 
> Signed-off-by: Xueshi Hu 

Reviewed-by: Matthew Wilcox (Oracle)

Re: [PATCH] fs: clean up usage of noop_dirty_folio

2023-08-21 Thread Matthew Wilcox

On Mon, Aug 21, 2023 at 01:16:43PM +0200, Jan Kara wrote:
> On Sat 19-08-23 20:42:25, Xueshi Hu wrote:
> > In folio_mark_dirty(), it will automatically fallback to
> > noop_dirty_folio() if a_ops->dirty_folio is not registered.
> > 
> > As anon_aops, dev_dax_aops and fb_deferred_io_aops becames empty, remove
> > them too.
> > 
> > Signed-off-by: Xueshi Hu 
> 
> Yeah, looks sensible to me but for some callbacks we are oscilating between
> all users having to provide some callback and providing some default
> behavior for NULL callback. I don't have a strong opinion either way so
> feel free to add:
> 
> Reviewed-by: Jan Kara 
> 
> But I guess let's see what Matthew thinks about this and what plans he has
> so that we don't switch back again in the near future. Matthew?

I was hoping Christoph would weigh in ;-)  I don't have a strong
feeling here, but it seems to me that a NULL ->dirty_folio() should mean
"do the noop thing" rather than "do the buffer_head thing" or "do the
filemap thing".  In 0af573780b0b, the buffer_head default was removed.
I think enough time has passed that we're OK to change what a NULL
->dirty_folio means (plus we also changed the name of ->set_page_dirty()
to ->dirty_folio())

So Ack to the concept.  One minor change I'd request:

-bool noop_dirty_folio(struct address_space *mapping, struct folio *folio)
+static bool noop_dirty_folio(struct address_space *mapping, struct folio 
*folio)
 {
if (!folio_test_dirty(folio))
return !folio_test_set_dirty(folio);
return false;
 }
-EXPORT_SYMBOL(noop_dirty_folio);

Please inline this into folio_mark_dirty() instead of calling it.

Re: [RESEND PATCH v10 25/25] dept: Track the potential waits of PG_{locked,writeback}

2023-08-20 Thread Matthew Wilcox

On Mon, Aug 21, 2023 at 12:46:37PM +0900, Byungchul Park wrote:
> @@ -377,44 +421,88 @@ static __always_inline int Page##uname(struct page 
> *page)   \
>  #define SETPAGEFLAG(uname, lname, policy)\
>  static __always_inline   
> \
>  void folio_set_##lname(struct folio *folio)  \
> -{ set_bit(PG_##lname, folio_flags(folio, FOLIO_##policy)); } \
> +{\
> + set_bit(PG_##lname, folio_flags(folio, FOLIO_##policy));\
> + dept_page_set_bit(&folio->page, PG_##lname);\

The PG_locked and PG_writeback bits only actually exist in the folio;
the ones in struct page are just legacy and never actually used.
Perhaps we could make the APIs more folio-based and less page-based?

>  static __always_inline void SetPage##uname(struct page *page)
> \
> -{ set_bit(PG_##lname, &policy(page, 1)->flags); }
> +{\
> + set_bit(PG_##lname, &policy(page, 1)->flags);   \
> + dept_page_set_bit(page, PG_##lname);\
> +}

I don't think we ever call this for PG_writeback or PG_locked.  If
I'm wrong, we can probably fix that ;-)

>  static __always_inline void __SetPage##uname(struct page *page)  
> \
> -{ __set_bit(PG_##lname, &policy(page, 1)->flags); }
> +{\
> + __set_bit(PG_##lname, &policy(page, 1)->flags); \
> + dept_page_set_bit(page, PG_##lname);\
> +}

Umm.  We do call __SetPageLocked() though ... I'll fix those up to
be __set_folio_locked().

Re: [RESEND PATCH v10 08/25] dept: Apply sdt_might_sleep_{start,end}() to PG_{locked,writeback} wait

2023-08-20 Thread Matthew Wilcox

On Mon, Aug 21, 2023 at 12:46:20PM +0900, Byungchul Park wrote:
> @@ -1219,6 +1220,9 @@ static inline bool folio_trylock_flag(struct folio 
> *folio, int bit_nr,
>  /* How many times do we accept lock stealing from under a waiter? */
>  int sysctl_page_lock_unfairness = 5;
>  
> +static struct dept_map __maybe_unused PG_locked_map = 
> DEPT_MAP_INITIALIZER(PG_locked_map, NULL);
> +static struct dept_map __maybe_unused PG_writeback_map = 
> DEPT_MAP_INITIALIZER(PG_writeback_map, NULL);

Hmm, why are these "maybe unused"?  *digs*.  Ah.  Because
sdt_might_sleep_start() becomes a no-op macro if DEPT is disabled.

OK, the right way to handle this is

#ifdef CONFIG_DEPT
#define DEPT_MAP(name)  static struct dept_map name = \
DEPT_MAP_INITIALIZER(name, NULL)
#else
#define DEPT_MAP(name)  /* */
#endif

And now DEPT takes up no space if disabled.

/* */; is a somewhat unusual thing to see, but since this must work at
top level, we can't use "do { } while (0)" like we usually do.  Given
where else this is likely to be used, i don't think it's going to be
a problem ...

Re: [PATCH 03/13] scatterlist: Add sg_set_folio()

2023-07-30 Thread Matthew Wilcox

On Sun, Jul 30, 2023 at 09:57:06PM +0800, Zhu Yanjun wrote:
> 
> 在 2023/7/30 19:18, Matthew Wilcox 写道:
> > On Sun, Jul 30, 2023 at 07:01:26PM +0800, Zhu Yanjun wrote:
> > > Does the following function have folio version?
> > > 
> > > "
> > > int sg_alloc_append_table_from_pages(struct sg_append_table *sgt_append,
> > >   struct page **pages, unsigned int n_pages, unsigned int offset,
> > >   unsigned long size, unsigned int max_segment,
> > >   unsigned int left_pages, gfp_t gfp_mask)
> > > "
> > No -- I haven't needed to convert anything that uses
> > sg_alloc_append_table_from_pages() yet.  It doesn't look like it should
> > be _too_ hard to add a folio version.
> 
> In many places, this function is used. So this function needs the folio
> version.

It's not used in very many places.  But the first one that I see it used
(drivers/infiniband/core/umem.c), you can't do a straightforward folio
conversion:

pinned = pin_user_pages_fast(cur_base,
  min_t(unsigned long, npages,
PAGE_SIZE /
sizeof(struct page *)),
  gup_flags, page_list);
...
ret = sg_alloc_append_table_from_pages(
&umem->sgt_append, page_list, pinned, 0,
pinned << PAGE_SHIFT, ib_dma_max_seg_size(device),
npages, GFP_KERNEL);

That can't be converted to folios.  The GUP might start in the middle of
the folio, and we have no way to communicate that.

This particular usage really needs the phyr work that Jason is doing so
we can efficiently communicate physically contiguous ranges from GUP
to sg.

> Another problem, after folio is used, I want to know the performance after
> folio is implemented.
> 
> How to make tests to get the performance?

You know what you're working on ... I wouldn't know how best to test
your code.

Re: [PATCH 03/13] scatterlist: Add sg_set_folio()

2023-07-30 Thread Matthew Wilcox

On Sun, Jul 30, 2023 at 07:01:26PM +0800, Zhu Yanjun wrote:
> Does the following function have folio version?
> 
> "
> int sg_alloc_append_table_from_pages(struct sg_append_table *sgt_append,
>   struct page **pages, unsigned int n_pages, unsigned int offset,
>   unsigned long size, unsigned int max_segment,
>   unsigned int left_pages, gfp_t gfp_mask)
> "

No -- I haven't needed to convert anything that uses
sg_alloc_append_table_from_pages() yet.  It doesn't look like it should
be _too_ hard to add a folio version.

[PATCH 09/13] net: Convert sunrpc from pagevec to folio_batch

2023-06-21 Thread Matthew Wilcox (Oracle)

Remove the last usage of pagevecs.  There is a slight change here; we
now free the folio_batch as soon as it fills up instead of freeing the
folio_batch when we try to add a page to a full batch.  This should have
no effect in practice.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 include/linux/sunrpc/svc.h |  2 +-
 net/sunrpc/svc.c   | 10 +-
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index c2807e301790..f8751118c122 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -222,7 +222,7 @@ struct svc_rqst {
struct page *   *rq_next_page; /* next reply page to use */
struct page *   *rq_page_end;  /* one past the last page */
 
-   struct pagevec  rq_pvec;
+   struct folio_batch  rq_fbatch;
struct kvec rq_vec[RPCSVC_MAXPAGES]; /* generally useful.. 
*/
struct bio_vec  rq_bvec[RPCSVC_MAXPAGES];
 
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index e7c101290425..587811a002c9 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -640,7 +640,7 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool 
*pool, int node)
if (!rqstp)
return rqstp;
 
-   pagevec_init(&rqstp->rq_pvec);
+   folio_batch_init(&rqstp->rq_fbatch);
 
__set_bit(RQ_BUSY, &rqstp->rq_flags);
rqstp->rq_server = serv;
@@ -851,9 +851,9 @@ bool svc_rqst_replace_page(struct svc_rqst *rqstp, struct 
page *page)
}
 
if (*rqstp->rq_next_page) {
-   if (!pagevec_space(&rqstp->rq_pvec))
-   __pagevec_release(&rqstp->rq_pvec);
-   pagevec_add(&rqstp->rq_pvec, *rqstp->rq_next_page);
+   if (!folio_batch_add(&rqstp->rq_fbatch,
+   page_folio(*rqstp->rq_next_page)))
+   __folio_batch_release(&rqstp->rq_fbatch);
}
 
get_page(page);
@@ -887,7 +887,7 @@ void svc_rqst_release_pages(struct svc_rqst *rqstp)
 void
 svc_rqst_free(struct svc_rqst *rqstp)
 {
-   pagevec_release(&rqstp->rq_pvec);
+   folio_batch_release(&rqstp->rq_fbatch);
svc_release_buffer(rqstp);
if (rqstp->rq_scratch_page)
put_page(rqstp->rq_scratch_page);
-- 
2.39.2

[PATCH 12/13] mm: Remove references to pagevec

2023-06-21 Thread Matthew Wilcox (Oracle)

Most of these should just refer to the LRU cache rather than the
data structure used to implement the LRU cache.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 mm/huge_memory.c| 2 +-
 mm/khugepaged.c | 6 +++---
 mm/ksm.c| 6 +++---
 mm/memory.c | 6 +++---
 mm/migrate_device.c | 2 +-
 mm/swap.c   | 2 +-
 mm/truncate.c   | 2 +-
 7 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e94fe292f30a..eb3678360b97 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1344,7 +1344,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
/*
 * See do_wp_page(): we can only reuse the folio exclusively if
 * there are no additional references. Note that we always drain
-* the LRU pagevecs immediately after adding a THP.
+* the LRU cache immediately after adding a THP.
 */
if (folio_ref_count(folio) >
1 + folio_test_swapcache(folio) * folio_nr_pages(folio))
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5ef1e08b2a06..3beb4ad2ee5e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1051,7 +1051,7 @@ static int __collapse_huge_page_swapin(struct mm_struct 
*mm,
if (pte)
pte_unmap(pte);
 
-   /* Drain LRU add pagevec to remove extra pin on the swapped in pages */
+   /* Drain LRU cache to remove extra pin on the swapped in pages */
if (swapped_in)
lru_add_drain();
 
@@ -1972,7 +1972,7 @@ static int collapse_file(struct mm_struct *mm, unsigned 
long addr,
result = SCAN_FAIL;
goto xa_unlocked;
}
-   /* drain pagevecs to help isolate_lru_page() */
+   /* drain lru cache to help isolate_lru_page() */
lru_add_drain();
page = folio_file_page(folio, index);
} else if (trylock_page(page)) {
@@ -1988,7 +1988,7 @@ static int collapse_file(struct mm_struct *mm, unsigned 
long addr,
page_cache_sync_readahead(mapping, &file->f_ra,
  file, index,
  end - index);
-   /* drain pagevecs to help isolate_lru_page() */
+   /* drain lru cache to help isolate_lru_page() */
lru_add_drain();
page = find_lock_page(mapping, index);
if (unlikely(page == NULL)) {
diff --git a/mm/ksm.c b/mm/ksm.c
index d995779dc1fe..ba266359da55 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -932,7 +932,7 @@ static int remove_stable_node(struct ksm_stable_node 
*stable_node)
 * The stable node did not yet appear stale to get_ksm_page(),
 * since that allows for an unmapped ksm page to be recognized
 * right up until it is freed; but the node is safe to remove.
-* This page might be in a pagevec waiting to be freed,
+* This page might be in an LRU cache waiting to be freed,
 * or it might be PageSwapCache (perhaps under writeback),
 * or it might have been removed from swapcache a moment ago.
 */
@@ -2303,8 +2303,8 @@ static struct ksm_rmap_item 
*scan_get_next_rmap_item(struct page **page)
trace_ksm_start_scan(ksm_scan.seqnr, ksm_rmap_items);
 
/*
-* A number of pages can hang around indefinitely on per-cpu
-* pagevecs, raised page count preventing write_protect_page
+* A number of pages can hang around indefinitely in per-cpu
+* LRU cache, raised page count preventing write_protect_page
 * from merging them.  Though it doesn't really matter much,
 * it is puzzling to see some stuck in pages_volatile until
 * other activity jostles them out, and they also prevented
diff --git a/mm/memory.c b/mm/memory.c
index 9f2723749f55..d034c52071f4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3404,8 +3404,8 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
goto copy;
if (!folio_test_lru(folio))
/*
-* Note: We cannot easily detect+handle references from
-* remote LRU pagevecs or references to LRU folios.
+* We cannot easily detect+handle references from
+* remote LRU caches or references to LRU folios.
 */
lru_add_drain();
if (folio_ref_count(folio) > 1 + folio_test_swapcache(folio))
@@ -3883,7 +38

[PATCH 03/13] scatterlist: Add sg_set_folio()

2023-06-21 Thread Matthew Wilcox (Oracle)

This wrapper for sg_set_page() lets drivers add folios to a scatterlist
more easily.  We could, perhaps, do better by using a different page
in the folio if offset is larger than UINT_MAX, but let's hope we get
a better data structure than this before we need to care about such
large folios.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 include/linux/scatterlist.h | 24 
 1 file changed, 24 insertions(+)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index ec46d8e8e49d..77df3d7b18a6 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -141,6 +141,30 @@ static inline void sg_set_page(struct scatterlist *sg, 
struct page *page,
sg->length = len;
 }
 
+/**
+ * sg_set_folio - Set sg entry to point at given folio
+ * @sg: SG entry
+ * @folio:  The folio
+ * @len:Length of data
+ * @offset: Offset into folio
+ *
+ * Description:
+ *   Use this function to set an sg entry pointing at a folio, never assign
+ *   the folio directly. We encode sg table information in the lower bits
+ *   of the folio pointer. See sg_page() for looking up the page belonging
+ *   to an sg entry.
+ *
+ **/
+static inline void sg_set_folio(struct scatterlist *sg, struct folio *folio,
+  size_t len, size_t offset)
+{
+   WARN_ON_ONCE(len > UINT_MAX);
+   WARN_ON_ONCE(offset > UINT_MAX);
+   sg_assign_page(sg, &folio->page);
+   sg->offset = offset;
+   sg->length = len;
+}
+
 static inline struct page *sg_page(struct scatterlist *sg)
 {
 #ifdef CONFIG_DEBUG_SG
-- 
2.39.2

[PATCH 04/13] i915: Convert shmem_sg_free_table() to use a folio_batch

2023-06-21 Thread Matthew Wilcox (Oracle)

Remove a few hidden compound_head() calls by converting the returned
page to a folio once and using the folio APIs.  We also only increment
the refcount on the folio once instead of once for each page.  Ideally,
we would have a for_each_sgt_folio macro, but until then this will do.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 55 +--
 1 file changed, 31 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c 
b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
index 33d5d5178103..8f1633c3fb93 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
@@ -19,13 +19,13 @@
 #include "i915_trace.h"
 
 /*
- * Move pages to appropriate lru and release the pagevec, decrementing the
- * ref count of those pages.
+ * Move folios to appropriate lru and release the batch, decrementing the
+ * ref count of those folios.
  */
-static void check_release_pagevec(struct pagevec *pvec)
+static void check_release_folio_batch(struct folio_batch *fbatch)
 {
-   check_move_unevictable_pages(pvec);
-   __pagevec_release(pvec);
+   check_move_unevictable_folios(fbatch);
+   __folio_batch_release(fbatch);
cond_resched();
 }
 
@@ -33,24 +33,29 @@ void shmem_sg_free_table(struct sg_table *st, struct 
address_space *mapping,
 bool dirty, bool backup)
 {
struct sgt_iter sgt_iter;
-   struct pagevec pvec;
+   struct folio_batch fbatch;
+   struct folio *last = NULL;
struct page *page;
 
mapping_clear_unevictable(mapping);
 
-   pagevec_init(&pvec);
+   folio_batch_init(&fbatch);
for_each_sgt_page(page, sgt_iter, st) {
-   if (dirty)
-   set_page_dirty(page);
+   struct folio *folio = page_folio(page);
 
+   if (folio == last)
+   continue;
+   last = folio;
+   if (dirty)
+   folio_mark_dirty(folio);
if (backup)
-   mark_page_accessed(page);
+   folio_mark_accessed(folio);
 
-   if (!pagevec_add(&pvec, page))
-   check_release_pagevec(&pvec);
+   if (!folio_batch_add(&fbatch, folio))
+   check_release_folio_batch(&fbatch);
}
-   if (pagevec_count(&pvec))
-   check_release_pagevec(&pvec);
+   if (fbatch.nr)
+   check_release_folio_batch(&fbatch);
 
sg_free_table(st);
 }
@@ -63,8 +68,7 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, 
struct sg_table *st,
unsigned int page_count; /* restricted by sg_alloc_table */
unsigned long i;
struct scatterlist *sg;
-   struct page *page;
-   unsigned long last_pfn = 0; /* suppress gcc warning */
+   unsigned long next_pfn = 0; /* suppress gcc warning */
gfp_t noreclaim;
int ret;
 
@@ -95,6 +99,7 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, 
struct sg_table *st,
sg = st->sgl;
st->nents = 0;
for (i = 0; i < page_count; i++) {
+   struct folio *folio;
const unsigned int shrink[] = {
I915_SHRINK_BOUND | I915_SHRINK_UNBOUND,
0,
@@ -103,12 +108,12 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, 
struct sg_table *st,
 
do {
cond_resched();
-   page = shmem_read_mapping_page_gfp(mapping, i, gfp);
-   if (!IS_ERR(page))
+   folio = shmem_read_folio_gfp(mapping, i, gfp);
+   if (!IS_ERR(folio))
break;
 
if (!*s) {
-   ret = PTR_ERR(page);
+   ret = PTR_ERR(folio);
goto err_sg;
}
 
@@ -147,19 +152,21 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, 
struct sg_table *st,
 
if (!i ||
sg->length >= max_segment ||
-   page_to_pfn(page) != last_pfn + 1) {
+   folio_pfn(folio) != next_pfn) {
if (i)
sg = sg_next(sg);
 
st->nents++;
-   sg_set_page(sg, page, PAGE_SIZE, 0);
+   sg_set_folio(sg, folio, folio_size(folio), 0);
} else {
-   sg->length += PAGE_SIZE;
+   /* XXX: could overflow? */
+   sg->length += folio_size(folio);
}
-   last_pfn = page_to_pfn(page);
+   next_pfn = folio_pfn(folio) + folio_nr_pages(folio);
+   i += folio_nr_pag

[PATCH 11/13] mm: Rename invalidate_mapping_pagevec to mapping_try_invalidate

2023-06-21 Thread Matthew Wilcox (Oracle)

We don't use pagevecs for the LRU cache any more, and we don't know
that the failed invalidations were due to the folio being in an
LRU cache.  So rename it to be more accurate.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 mm/fadvise.c  | 16 +++-
 mm/internal.h |  4 ++--
 mm/truncate.c | 25 -
 3 files changed, 21 insertions(+), 24 deletions(-)

diff --git a/mm/fadvise.c b/mm/fadvise.c
index fb7c5f43fd2a..f684ffd7f9c9 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -143,7 +143,7 @@ int generic_fadvise(struct file *file, loff_t offset, 
loff_t len, int advice)
}
 
if (end_index >= start_index) {
-   unsigned long nr_pagevec = 0;
+   unsigned long nr_failed = 0;
 
/*
 * It's common to FADV_DONTNEED right after
@@ -156,17 +156,15 @@ int generic_fadvise(struct file *file, loff_t offset, 
loff_t len, int advice)
 */
lru_add_drain();
 
-   invalidate_mapping_pagevec(mapping,
-   start_index, end_index,
-   &nr_pagevec);
+   mapping_try_invalidate(mapping, start_index, end_index,
+   &nr_failed);
 
/*
-* If fewer pages were invalidated than expected then
-* it is possible that some of the pages were on
-* a per-cpu pagevec for a remote CPU. Drain all
-* pagevecs and try again.
+* The failures may be due to the folio being
+* in the LRU cache of a remote CPU. Drain all
+* caches and try again.
 */
-   if (nr_pagevec) {
+   if (nr_failed) {
lru_add_drain_all();
invalidate_mapping_pages(mapping, start_index,
end_index);
diff --git a/mm/internal.h b/mm/internal.h
index 119a8241f9d9..2ff7587b4045 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -133,8 +133,8 @@ int truncate_inode_folio(struct address_space *mapping, 
struct folio *folio);
 bool truncate_inode_partial_folio(struct folio *folio, loff_t start,
loff_t end);
 long invalidate_inode_page(struct page *page);
-unsigned long invalidate_mapping_pagevec(struct address_space *mapping,
-   pgoff_t start, pgoff_t end, unsigned long *nr_pagevec);
+unsigned long mapping_try_invalidate(struct address_space *mapping,
+   pgoff_t start, pgoff_t end, unsigned long *nr_failed);
 
 /**
  * folio_evictable - Test whether a folio is evictable.
diff --git a/mm/truncate.c b/mm/truncate.c
index 86de31ed4d32..4a917570887f 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -486,18 +486,17 @@ void truncate_inode_pages_final(struct address_space 
*mapping)
 EXPORT_SYMBOL(truncate_inode_pages_final);
 
 /**
- * invalidate_mapping_pagevec - Invalidate all the unlocked pages of one inode
- * @mapping: the address_space which holds the pages to invalidate
+ * mapping_try_invalidate - Invalidate all the evictable folios of one inode
+ * @mapping: the address_space which holds the folios to invalidate
  * @start: the offset 'from' which to invalidate
  * @end: the offset 'to' which to invalidate (inclusive)
- * @nr_pagevec: invalidate failed page number for caller
+ * @nr_failed: How many folio invalidations failed
  *
- * This helper is similar to invalidate_mapping_pages(), except that it 
accounts
- * for pages that are likely on a pagevec and counts them in @nr_pagevec, which
- * will be used by the caller.
+ * This function is similar to invalidate_mapping_pages(), except that it
+ * returns the number of folios which could not be evicted in @nr_failed.
  */
-unsigned long invalidate_mapping_pagevec(struct address_space *mapping,
-   pgoff_t start, pgoff_t end, unsigned long *nr_pagevec)
+unsigned long mapping_try_invalidate(struct address_space *mapping,
+   pgoff_t start, pgoff_t end, unsigned long *nr_failed)
 {
pgoff_t indices[PAGEVEC_SIZE];
struct folio_batch fbatch;
@@ -527,9 +526,9 @@ unsigned long invalidate_mapping_pagevec(struct 
address_space *mapping,
 */
if (!ret) {
deactivate_file_folio(folio);
-   /* It is likely on the pagevec of a remote CPU 
*/
-   if (nr_pagevec)
-   (*nr_pagevec)++;
+   /* Likely in the lru cache of a remote CPU */
+   if (nr_failed)
+

[PATCH 02/13] mm: Add __folio_batch_release()

2023-06-21 Thread Matthew Wilcox (Oracle)

This performs the same role as __pagevec_release(), ie skipping the
check for batch length of 0.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 include/linux/pagevec.h | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index f582f7213ea5..42aad53e382e 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -127,9 +127,15 @@ static inline unsigned folio_batch_add(struct folio_batch 
*fbatch,
return fbatch_space(fbatch);
 }
 
+static inline void __folio_batch_release(struct folio_batch *fbatch)
+{
+   __pagevec_release((struct pagevec *)fbatch);
+}
+
 static inline void folio_batch_release(struct folio_batch *fbatch)
 {
-   pagevec_release((struct pagevec *)fbatch);
+   if (folio_batch_count(fbatch))
+   __folio_batch_release(fbatch);
 }
 
 void folio_batch_remove_exceptionals(struct folio_batch *fbatch);
-- 
2.39.2

[PATCH 06/13] mm: Remove check_move_unevictable_pages()

2023-06-21 Thread Matthew Wilcox (Oracle)

All callers have now been converted to call check_move_unevictable_folios().

Signed-off-by: Matthew Wilcox (Oracle) 
---
 include/linux/swap.h |  1 -
 mm/vmscan.c  | 17 -
 2 files changed, 18 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index ce7e82cf787f..456546443f1f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -439,7 +439,6 @@ static inline bool node_reclaim_enabled(void)
 }
 
 void check_move_unevictable_folios(struct folio_batch *fbatch);
-void check_move_unevictable_pages(struct pagevec *pvec);
 
 extern void __meminit kswapd_run(int nid);
 extern void __meminit kswapd_stop(int nid);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 45d17c7cc555..f8dd1db15897 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -8074,23 +8074,6 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t 
gfp_mask, unsigned int order)
 }
 #endif
 
-void check_move_unevictable_pages(struct pagevec *pvec)
-{
-   struct folio_batch fbatch;
-   unsigned i;
-
-   folio_batch_init(&fbatch);
-   for (i = 0; i < pvec->nr; i++) {
-   struct page *page = pvec->pages[i];
-
-   if (PageTransTail(page))
-   continue;
-   folio_batch_add(&fbatch, page_folio(page));
-   }
-   check_move_unevictable_folios(&fbatch);
-}
-EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
-
 /**
  * check_move_unevictable_folios - Move evictable folios to appropriate zone
  * lru list
-- 
2.39.2

[PATCH 00/13] Remove pagevecs

2023-06-21 Thread Matthew Wilcox (Oracle)

We're almost done with the pagevec -> folio_batch conversion.  Finish
the job.

Matthew Wilcox (Oracle) (13):
  afs: Convert pagevec to folio_batch in afs_extend_writeback()
  mm: Add __folio_batch_release()
  scatterlist: Add sg_set_folio()
  i915: Convert shmem_sg_free_table() to use a folio_batch
  drm: Convert drm_gem_put_pages() to use a folio_batch
  mm: Remove check_move_unevictable_pages()
  pagevec: Rename fbatch_count()
  i915: Convert i915_gpu_error to use a folio_batch
  net: Convert sunrpc from pagevec to folio_batch
  mm: Remove struct pagevec
  mm: Rename invalidate_mapping_pagevec to mapping_try_invalidate
  mm: Remove references to pagevec
  mm: Remove unnecessary pagevec includes

 drivers/gpu/drm/drm_gem.c | 68 +--
 drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 55 ++
 drivers/gpu/drm/i915/i915_gpu_error.c | 50 -
 fs/afs/write.c| 16 +++---
 include/linux/pagevec.h   | 67 +++---
 include/linux/scatterlist.h   | 24 
 include/linux/sunrpc/svc.h|  2 +-
 include/linux/swap.h  |  1 -
 mm/fadvise.c  | 17 +++---
 mm/huge_memory.c  |  2 +-
 mm/internal.h |  4 +-
 mm/khugepaged.c   |  6 +-
 mm/ksm.c  |  6 +-
 mm/memory.c   |  6 +-
 mm/memory_hotplug.c   |  1 -
 mm/migrate.c  |  1 -
 mm/migrate_device.c   |  2 +-
 mm/readahead.c|  1 -
 mm/swap.c | 20 +++
 mm/swap_state.c   |  1 -
 mm/truncate.c | 27 +
 mm/vmscan.c   | 17 --
 net/sunrpc/svc.c  | 10 ++--
 23 files changed, 185 insertions(+), 219 deletions(-)

-- 
2.39.2

[PATCH 05/13] drm: Convert drm_gem_put_pages() to use a folio_batch

2023-06-21 Thread Matthew Wilcox (Oracle)

Remove a few hidden compound_head() calls by converting the returned
page to a folio once and using the folio APIs.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 drivers/gpu/drm/drm_gem.c | 68 ++-
 1 file changed, 39 insertions(+), 29 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index 1a5a2cd0d4ec..78dcae201cc6 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -496,13 +496,13 @@ int drm_gem_create_mmap_offset(struct drm_gem_object *obj)
 EXPORT_SYMBOL(drm_gem_create_mmap_offset);
 
 /*
- * Move pages to appropriate lru and release the pagevec, decrementing the
- * ref count of those pages.
+ * Move folios to appropriate lru and release the folios, decrementing the
+ * ref count of those folios.
  */
-static void drm_gem_check_release_pagevec(struct pagevec *pvec)
+static void drm_gem_check_release_batch(struct folio_batch *fbatch)
 {
-   check_move_unevictable_pages(pvec);
-   __pagevec_release(pvec);
+   check_move_unevictable_folios(fbatch);
+   __folio_batch_release(fbatch);
cond_resched();
 }
 
@@ -534,10 +534,10 @@ static void drm_gem_check_release_pagevec(struct pagevec 
*pvec)
 struct page **drm_gem_get_pages(struct drm_gem_object *obj)
 {
struct address_space *mapping;
-   struct page *p, **pages;
-   struct pagevec pvec;
-   int i, npages;
-
+   struct page **pages;
+   struct folio *folio;
+   struct folio_batch fbatch;
+   int i, j, npages;
 
if (WARN_ON(!obj->filp))
return ERR_PTR(-EINVAL);
@@ -559,11 +559,14 @@ struct page **drm_gem_get_pages(struct drm_gem_object 
*obj)
 
mapping_set_unevictable(mapping);
 
-   for (i = 0; i < npages; i++) {
-   p = shmem_read_mapping_page(mapping, i);
-   if (IS_ERR(p))
+   i = 0;
+   while (i < npages) {
+   folio = shmem_read_folio_gfp(mapping, i,
+   mapping_gfp_mask(mapping));
+   if (IS_ERR(folio))
goto fail;
-   pages[i] = p;
+   for (j = 0; j < folio_nr_pages(folio); j++, i++)
+   pages[i] = folio_file_page(folio, i);
 
/* Make sure shmem keeps __GFP_DMA32 allocated pages in the
 * correct region during swapin. Note that this requires
@@ -571,23 +574,26 @@ struct page **drm_gem_get_pages(struct drm_gem_object 
*obj)
 * so shmem can relocate pages during swapin if required.
 */
BUG_ON(mapping_gfp_constraint(mapping, __GFP_DMA32) &&
-   (page_to_pfn(p) >= 0x0010UL));
+   (folio_pfn(folio) >= 0x0010UL));
}
 
return pages;
 
 fail:
mapping_clear_unevictable(mapping);
-   pagevec_init(&pvec);
-   while (i--) {
-   if (!pagevec_add(&pvec, pages[i]))
-   drm_gem_check_release_pagevec(&pvec);
+   folio_batch_init(&fbatch);
+   j = 0;
+   while (j < i) {
+   struct folio *f = page_folio(pages[j]);
+   if (!folio_batch_add(&fbatch, f))
+   drm_gem_check_release_batch(&fbatch);
+   j += folio_nr_pages(f);
}
-   if (pagevec_count(&pvec))
-   drm_gem_check_release_pagevec(&pvec);
+   if (fbatch.nr)
+   drm_gem_check_release_batch(&fbatch);
 
kvfree(pages);
-   return ERR_CAST(p);
+   return ERR_CAST(folio);
 }
 EXPORT_SYMBOL(drm_gem_get_pages);
 
@@ -603,7 +609,7 @@ void drm_gem_put_pages(struct drm_gem_object *obj, struct 
page **pages,
 {
int i, npages;
struct address_space *mapping;
-   struct pagevec pvec;
+   struct folio_batch fbatch;
 
mapping = file_inode(obj->filp)->i_mapping;
mapping_clear_unevictable(mapping);
@@ -616,23 +622,27 @@ void drm_gem_put_pages(struct drm_gem_object *obj, struct 
page **pages,
 
npages = obj->size >> PAGE_SHIFT;
 
-   pagevec_init(&pvec);
+   folio_batch_init(&fbatch);
for (i = 0; i < npages; i++) {
+   struct folio *folio;
+
if (!pages[i])
continue;
+   folio = page_folio(pages[i]);
 
if (dirty)
-   set_page_dirty(pages[i]);
+   folio_mark_dirty(folio);
 
if (accessed)
-   mark_page_accessed(pages[i]);
+   folio_mark_accessed(folio);
 
/* Undo the reference we took when populating the table */
-   if (!pagevec_add(&pvec, pages[i]))
-   drm_gem_check_release_pagevec(&pvec);
+   if (!folio_batch_add(&fbatch, folio))
+   drm_gem_check_release_ba

[PATCH 08/13] i915: Convert i915_gpu_error to use a folio_batch

2023-06-21 Thread Matthew Wilcox (Oracle)

Remove one of the last remaining users of pagevec.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 50 +--
 1 file changed, 25 insertions(+), 25 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c 
b/drivers/gpu/drm/i915/i915_gpu_error.c
index ec368e700235..0c38bfb60c9a 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -187,64 +187,64 @@ i915_error_printer(struct drm_i915_error_state_buf *e)
 }
 
 /* single threaded page allocator with a reserved stash for emergencies */
-static void pool_fini(struct pagevec *pv)
+static void pool_fini(struct folio_batch *fbatch)
 {
-   pagevec_release(pv);
+   folio_batch_release(fbatch);
 }
 
-static int pool_refill(struct pagevec *pv, gfp_t gfp)
+static int pool_refill(struct folio_batch *fbatch, gfp_t gfp)
 {
-   while (pagevec_space(pv)) {
-   struct page *p;
+   while (folio_batch_space(fbatch)) {
+   struct folio *folio;
 
-   p = alloc_page(gfp);
-   if (!p)
+   folio = folio_alloc(gfp, 0);
+   if (!folio)
return -ENOMEM;
 
-   pagevec_add(pv, p);
+   folio_batch_add(fbatch, folio);
}
 
return 0;
 }
 
-static int pool_init(struct pagevec *pv, gfp_t gfp)
+static int pool_init(struct folio_batch *fbatch, gfp_t gfp)
 {
int err;
 
-   pagevec_init(pv);
+   folio_batch_init(fbatch);
 
-   err = pool_refill(pv, gfp);
+   err = pool_refill(fbatch, gfp);
if (err)
-   pool_fini(pv);
+   pool_fini(fbatch);
 
return err;
 }
 
-static void *pool_alloc(struct pagevec *pv, gfp_t gfp)
+static void *pool_alloc(struct folio_batch *fbatch, gfp_t gfp)
 {
-   struct page *p;
+   struct folio *folio;
 
-   p = alloc_page(gfp);
-   if (!p && pagevec_count(pv))
-   p = pv->pages[--pv->nr];
+   folio = folio_alloc(gfp, 0);
+   if (!folio && folio_batch_count(fbatch))
+   folio = fbatch->folios[--fbatch->nr];
 
-   return p ? page_address(p) : NULL;
+   return folio ? folio_address(folio) : NULL;
 }
 
-static void pool_free(struct pagevec *pv, void *addr)
+static void pool_free(struct folio_batch *fbatch, void *addr)
 {
-   struct page *p = virt_to_page(addr);
+   struct folio *folio = virt_to_folio(addr);
 
-   if (pagevec_space(pv))
-   pagevec_add(pv, p);
+   if (folio_batch_space(fbatch))
+   folio_batch_add(fbatch, folio);
else
-   __free_page(p);
+   folio_put(folio);
 }
 
 #ifdef CONFIG_DRM_I915_COMPRESS_ERROR
 
 struct i915_vma_compress {
-   struct pagevec pool;
+   struct folio_batch pool;
struct z_stream_s zstream;
void *tmp;
 };
@@ -381,7 +381,7 @@ static void err_compression_marker(struct 
drm_i915_error_state_buf *m)
 #else
 
 struct i915_vma_compress {
-   struct pagevec pool;
+   struct folio_batch pool;
 };
 
 static bool compress_init(struct i915_vma_compress *c)
-- 
2.39.2

[PATCH 01/13] afs: Convert pagevec to folio_batch in afs_extend_writeback()

2023-06-21 Thread Matthew Wilcox (Oracle)

Removes a folio->page->folio conversion for each folio that's involved.
More importantly, removes one of the last few uses of a pagevec.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/afs/write.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/afs/write.c b/fs/afs/write.c
index 18ccb613dff8..6e68c70d0b22 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -465,7 +465,7 @@ static void afs_extend_writeback(struct address_space 
*mapping,
 bool caching,
 unsigned int *_len)
 {
-   struct pagevec pvec;
+   struct folio_batch fbatch;
struct folio *folio;
unsigned long priv;
unsigned int psize, filler = 0;
@@ -476,7 +476,7 @@ static void afs_extend_writeback(struct address_space 
*mapping,
unsigned int i;
 
XA_STATE(xas, &mapping->i_pages, index);
-   pagevec_init(&pvec);
+   folio_batch_init(&fbatch);
 
do {
/* Firstly, we gather up a batch of contiguous dirty pages
@@ -535,7 +535,7 @@ static void afs_extend_writeback(struct address_space 
*mapping,
stop = false;
 
index += folio_nr_pages(folio);
-   if (!pagevec_add(&pvec, &folio->page))
+   if (!folio_batch_add(&fbatch, folio))
break;
if (stop)
break;
@@ -545,14 +545,14 @@ static void afs_extend_writeback(struct address_space 
*mapping,
xas_pause(&xas);
rcu_read_unlock();
 
-   /* Now, if we obtained any pages, we can shift them to being
+   /* Now, if we obtained any folios, we can shift them to being
 * writable and mark them for caching.
 */
-   if (!pagevec_count(&pvec))
+   if (!folio_batch_count(&fbatch))
break;
 
-   for (i = 0; i < pagevec_count(&pvec); i++) {
-   folio = page_folio(pvec.pages[i]);
+   for (i = 0; i < folio_batch_count(&fbatch); i++) {
+   folio = fbatch.folios[i];
trace_afs_folio_dirty(vnode, 
tracepoint_string("store+"), folio);
 
if (!folio_clear_dirty_for_io(folio))
@@ -565,7 +565,7 @@ static void afs_extend_writeback(struct address_space 
*mapping,
folio_unlock(folio);
}
 
-   pagevec_release(&pvec);
+   folio_batch_release(&fbatch);
cond_resched();
} while (!stop);
 
-- 
2.39.2

[PATCH 07/13] pagevec: Rename fbatch_count()

2023-06-21 Thread Matthew Wilcox (Oracle)

This should always have been called folio_batch_count().

Signed-off-by: Matthew Wilcox (Oracle) 
---
 include/linux/pagevec.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index 42aad53e382e..3a9d29dd28a3 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -105,7 +105,7 @@ static inline unsigned int folio_batch_count(struct 
folio_batch *fbatch)
return fbatch->nr;
 }
 
-static inline unsigned int fbatch_space(struct folio_batch *fbatch)
+static inline unsigned int folio_batch_space(struct folio_batch *fbatch)
 {
return PAGEVEC_SIZE - fbatch->nr;
 }
@@ -124,7 +124,7 @@ static inline unsigned folio_batch_add(struct folio_batch 
*fbatch,
struct folio *folio)
 {
fbatch->folios[fbatch->nr++] = folio;
-   return fbatch_space(fbatch);
+   return folio_batch_space(fbatch);
 }
 
 static inline void __folio_batch_release(struct folio_batch *fbatch)
-- 
2.39.2

[PATCH 13/13] mm: Remove unnecessary pagevec includes

2023-06-21 Thread Matthew Wilcox (Oracle)

These files no longer need pagevec.h, mostly due to function declarations
being moved out of it.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 mm/fadvise.c| 1 -
 mm/memory_hotplug.c | 1 -
 mm/migrate.c| 1 -
 mm/readahead.c  | 1 -
 mm/swap_state.c | 1 -
 5 files changed, 5 deletions(-)

diff --git a/mm/fadvise.c b/mm/fadvise.c
index f684ffd7f9c9..6c39d42f16dc 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -14,7 +14,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 35db4108bb15..3f231cf1b410 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -13,7 +13,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/mm/migrate.c b/mm/migrate.c
index ce35afdbc1e3..ee26f4a962ef 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -21,7 +21,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/mm/readahead.c b/mm/readahead.c
index 47afbca1d122..a9c999aa19af 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -120,7 +120,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 4a5c7b748051..f8ea7015bad4 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -16,7 +16,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
-- 
2.39.2

[PATCH 10/13] mm: Remove struct pagevec

2023-06-21 Thread Matthew Wilcox (Oracle)

All users are now converted to use the folio_batch so we can get rid of
this data structure.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 include/linux/pagevec.h | 63 +++--
 mm/swap.c   | 18 ++--
 2 files changed, 13 insertions(+), 68 deletions(-)

diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index 3a9d29dd28a3..87cc678adc85 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -3,65 +3,18 @@
  * include/linux/pagevec.h
  *
  * In many places it is efficient to batch an operation up against multiple
- * pages.  A pagevec is a multipage container which is used for that.
+ * folios.  A folio_batch is a container which is used for that.
  */
 
 #ifndef _LINUX_PAGEVEC_H
 #define _LINUX_PAGEVEC_H
 
-#include 
+#include 
 
-/* 15 pointers + header align the pagevec structure to a power of two */
+/* 15 pointers + header align the folio_batch structure to a power of two */
 #define PAGEVEC_SIZE   15
 
-struct page;
 struct folio;
-struct address_space;
-
-/* Layout must match folio_batch */
-struct pagevec {
-   unsigned char nr;
-   bool percpu_pvec_drained;
-   struct page *pages[PAGEVEC_SIZE];
-};
-
-void __pagevec_release(struct pagevec *pvec);
-
-static inline void pagevec_init(struct pagevec *pvec)
-{
-   pvec->nr = 0;
-   pvec->percpu_pvec_drained = false;
-}
-
-static inline void pagevec_reinit(struct pagevec *pvec)
-{
-   pvec->nr = 0;
-}
-
-static inline unsigned pagevec_count(struct pagevec *pvec)
-{
-   return pvec->nr;
-}
-
-static inline unsigned pagevec_space(struct pagevec *pvec)
-{
-   return PAGEVEC_SIZE - pvec->nr;
-}
-
-/*
- * Add a page to a pagevec.  Returns the number of slots still available.
- */
-static inline unsigned pagevec_add(struct pagevec *pvec, struct page *page)
-{
-   pvec->pages[pvec->nr++] = page;
-   return pagevec_space(pvec);
-}
-
-static inline void pagevec_release(struct pagevec *pvec)
-{
-   if (pagevec_count(pvec))
-   __pagevec_release(pvec);
-}
 
 /**
  * struct folio_batch - A collection of folios.
@@ -78,11 +31,6 @@ struct folio_batch {
struct folio *folios[PAGEVEC_SIZE];
 };
 
-/* Layout must match pagevec */
-static_assert(sizeof(struct pagevec) == sizeof(struct folio_batch));
-static_assert(offsetof(struct pagevec, pages) ==
-   offsetof(struct folio_batch, folios));
-
 /**
  * folio_batch_init() - Initialise a batch of folios
  * @fbatch: The folio batch.
@@ -127,10 +75,7 @@ static inline unsigned folio_batch_add(struct folio_batch 
*fbatch,
return folio_batch_space(fbatch);
 }
 
-static inline void __folio_batch_release(struct folio_batch *fbatch)
-{
-   __pagevec_release((struct pagevec *)fbatch);
-}
+void __folio_batch_release(struct folio_batch *pvec);
 
 static inline void folio_batch_release(struct folio_batch *fbatch)
 {
diff --git a/mm/swap.c b/mm/swap.c
index 423199ee8478..10348c1cf9c5 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1044,25 +1044,25 @@ void release_pages(release_pages_arg arg, int nr)
 EXPORT_SYMBOL(release_pages);
 
 /*
- * The pages which we're about to release may be in the deferred lru-addition
+ * The folios which we're about to release may be in the deferred lru-addition
  * queues.  That would prevent them from really being freed right now.  That's
- * OK from a correctness point of view but is inefficient - those pages may be
+ * OK from a correctness point of view but is inefficient - those folios may be
  * cache-warm and we want to give them back to the page allocator ASAP.
  *
- * So __pagevec_release() will drain those queues here.
+ * So __folio_batch_release() will drain those queues here.
  * folio_batch_move_lru() calls folios_put() directly to avoid
  * mutual recursion.
  */
-void __pagevec_release(struct pagevec *pvec)
+void __folio_batch_release(struct folio_batch *fbatch)
 {
-   if (!pvec->percpu_pvec_drained) {
+   if (!fbatch->percpu_pvec_drained) {
lru_add_drain();
-   pvec->percpu_pvec_drained = true;
+   fbatch->percpu_pvec_drained = true;
}
-   release_pages(pvec->pages, pagevec_count(pvec));
-   pagevec_reinit(pvec);
+   release_pages(fbatch->folios, folio_batch_count(fbatch));
+   folio_batch_reinit(fbatch);
 }
-EXPORT_SYMBOL(__pagevec_release);
+EXPORT_SYMBOL(__folio_batch_release);
 
 /**
  * folio_batch_remove_exceptionals() - Prune non-folios from a batch.
-- 
2.39.2

Re: [PATCH v9 02/14] mm: move page zone helpers from mm.h to mmzone.h

2023-06-15 Thread Matthew Wilcox

On Thu, Jun 15, 2023 at 03:33:12PM -0400, Peter Xu wrote:
> My question is whether page_zonenum() is ready for taking all kinds of tail
> pages?
> 
> Zone device tail pages all look fine, per memmap_init_zone_device().  The
> question was other kinds of usual compound pages, like either thp or
> hugetlb.  IIUC page->flags can be uninitialized for those tail pages.

I don't think that's true.  It's my understanding that page->flags is
initialised for all pages in memmap at boot / hotplug / delayed-init
time.  So you can check things like zone, node, etc on literally any
page.  Contrariwise, those flags are not available in tail pages for
use by the entity that has allocated a compound page / large folio.

Also, I don't believe zone device pages support compound allocation.
I think they're always allocated as order-0.

Re: [PATCH drm-next v2 04/16] maple_tree: add flag MT_FLAGS_LOCK_NONE

2023-02-27 Thread Matthew Wilcox

On Mon, Feb 27, 2023 at 06:39:33PM +0100, Danilo Krummrich wrote:
> On 2/21/23 19:31, Matthew Wilcox wrote:
> > Lockdep will shout at you if you get it wrong ;-)  But you can safely
> > take the spinlock before calling mas_store_gfp(GFP_KERNEL) because
> > mas_nomem() knows to drop the lock before doing a sleeping allocation.
> > Essentially you're open-coding mtree_store_range() but doing your own
> > thing in addition to the store.
> 
> As already mentioned, I went with your advice to just take the maple tree's
> internal spinlock within the GPUVA manager and leave all the other locking
> to the drivers as intended.
> 
> However, I run into the case that lockdep shouts at me for not taking the
> spinlock before calling mas_find() in the iterator macros.
> 
> Now, I definitely don't want to let the drivers take the maple tree's
> spinlock before they use the iterator macro. Of course, drivers shouldn't
> even know about the underlying maple tree of the GPUVA manager.
> 
> One way to make lockdep happy in this case seems to be taking the spinlock
> right before mas_find() and drop it right after for each iteration.

While we don't have any lockdep checking of this, you really shouldn't be
using an iterator if you're going to drop the lock between invocations.
The iterator points into the tree, so you need to invalidate the iterator
any time you drop the lock.

You don't have to use a spinlock to do a read iteration.  You can just
take the rcu_read_lock() around your iteration, as long as you can
tolerate the mild inconsistencies that RCU permits.

Re: [PATCH drm-next v2 04/16] maple_tree: add flag MT_FLAGS_LOCK_NONE

2023-02-22 Thread Matthew Wilcox

On Wed, Feb 22, 2023 at 05:11:34PM +0100, Danilo Krummrich wrote:
> On 2/21/23 19:31, Matthew Wilcox wrote:
> > on tue, feb 21, 2023 at 03:37:49pm +0100, danilo krummrich wrote:
> > > It feels a bit weird that I, as a user of the API, would need to lock 
> > > certain
> > > (or all?) mas_*() functions with the internal spinlock in order to protect
> > > (future) internal features of the tree, such as the slab cache 
> > > defragmentation
> > > you mentioned. Because from my perspective, as the generic component that 
> > > tells
> > > it's users (the drivers) to take care of locking VA space operations (and 
> > > hence
> > > tree operations) I don't have an own purpose of this internal spinlock, 
> > > right?
> > 
> > You don't ... but we can't know that.
> 
> Thanks for the clarification. I think I should now know what to for the
> GPUVA manager in terms of locking the maple tree in general.
> 
> Though I still have very limited insights on the maple tree I want to share
> some further thoughts.
> 
> From what I got so far it really seems to me that it would be better to just
> take the internal spinlock for both APIs (normal and advanced) whenever you
> need to internally.

No.  Really, no.  The point of the advanced API is that it's a toolbox
for doing the operation you want, but isn't a generic enough operation
to be part of the normal API.  To take an example from the radix
tree days, in the page cache, we need to walk a range of the tree,
looking for any entries that are marked as DIRTY, clear the DIRTY
mark and set the TOWRITE mark.  There was a horrendous function called
radix_tree_range_tag_if_tagged() which did exactly this.  Now look at
the implementation of tag_pages_for_writeback(); it's a simple loop over
a range with an occasional pause to check whether we need to reschedule.

But that means you need to know how to use the toolbox.  Some of the
tools are dangerous and you can cut yourself on them.

> Another plus would probably be maintainability. Once you got quite a few
> maple tree users using external locks (either in the sense of calling

I don't want maple tree users using external locks.  That exists
because it was the only reasonable way of converting the VMA tree
from the rbtree to the maple tree.  I intend to get rid of
mt_set_external_lock().  The VMAs are eventually going to be protected
by the internal spinlock.

Re: [PATCH drm-next v2 04/16] maple_tree: add flag MT_FLAGS_LOCK_NONE

2023-02-21 Thread Matthew Wilcox

On Tue, Feb 21, 2023 at 03:37:49PM +0100, Danilo Krummrich wrote:
> On Mon, Feb 20, 2023 at 08:33:35PM +0000, Matthew Wilcox wrote:
> > On Mon, Feb 20, 2023 at 06:06:03PM +0100, Danilo Krummrich wrote:
> > > On 2/20/23 16:10, Matthew Wilcox wrote:
> > > > This is why we like people to use the spinlock embedded in the tree.
> > > > There's nothing for the user to care about.  If the access really is
> > > > serialised, acquiring/releasing the uncontended spinlock is a minimal
> > > > cost compared to all the other things that will happen while modifying
> > > > the tree.
> > > 
> > > I think as for the users of the GPUVA manager we'd have two cases:
> > > 
> > > 1) Accesses to the manager (and hence the tree) are serialized, no lock
> > > needed.
> > > 
> > > 2) Multiple operations on the tree must be locked in order to make them
> > > appear atomic.
> > 
> > Could you give an example here of what you'd like to do?  Ideally
> > something complicated so I don't say "Oh, you can just do this" when
> > there's a more complex example for which "this" won't work.  I'm sure
> > that's embedded somewhere in the next 20-odd patches, but it's probably
> > quicker for you to describe in terms of tree operations that have to
> > appear atomic than for me to try to figure it out.
> > 
> 
> Absolutely, not gonna ask you to read all of that. :-)
> 
> One thing the GPUVA manager does is to provide drivers the (sub-)operations
> that need to be processed in order to fulfill a map or unmap request from
> userspace. For instance, when userspace asks the driver to map some memory
> the GPUVA manager calculates which existing mappings must be removed, split up
> or can be merged with the newly requested mapping.
> 
> A driver has two ways to fetch those operations from the GPUVA manager. It can
> either obtain a list of operations or receive a callback for each operation
> generated by the GPUVA manager.
> 
> In both cases the GPUVA manager walks the maple tree, which keeps track of
> existing mappings, for the given range in __drm_gpuva_sm_map() (only 
> considering
> the map case, since the unmap case is a subset basically). For each mapping
> found in the given range the driver, as mentioned, either receives a callback 
> or
> a list entry is added to the list of operations.
> 
> Typically, for each operation / callback one entry within the maple tree is
> removed and, optionally at the beginning and end of a new mapping's range, a
> new entry is inserted. An of course, as the last operation, there is the new
> mapping itself to insert.
> 
> The GPUVA manager delegates locking responsibility to the drivers. Typically,
> a driver either serializes access to the VA space managed by the GPUVA manager
> (no lock needed) or need to lock the processing of a full set of operations
> generated by the GPUVA manager.

OK, that all makes sense.  It does make sense to have the driver use its
own mutex and then take the spinlock inside the maple tree code.  It
shouldn't ever be contended.

> > > In either case the embedded spinlock wouldn't be useful, we'd either need 
> > > an
> > > external lock or no lock at all.
> > > 
> > > If there are any internal reasons why specific tree operations must be
> > > mutually excluded (such as those you explain below), wouldn't it make more
> > > sense to always have the internal lock and, optionally, allow users to
> > > specify an external lock additionally?
> > 
> > So the way this works for the XArray, which is a little older than the
> > Maple tree, is that we always use the internal spinlock for
> > modifications (possibly BH or IRQ safe), and if someone wants to
> > use an external mutex to make some callers atomic with respect to each
> > other, they're free to do so.  In that case, the XArray doesn't check
> > the user's external locking at all, because it really can't know.
> > 
> > I'd advise taking that approach; if there's really no way to use the
> > internal spinlock to make your complicated updates appear atomic
> > then just let the maple tree use its internal spinlock, and you can
> > also use your external mutex however you like.
> > 
> 
> That sounds like the right thing to do.
> 
> However, I'm using the advanced API of the maple tree (and that's the reason
> why the above example appears a little more detailed than needed) because I
> think with the normal API I can't insert / remove tree entries

Re: [PATCH drm-next v2 04/16] maple_tree: add flag MT_FLAGS_LOCK_NONE

2023-02-20 Thread Matthew Wilcox

On Mon, Feb 20, 2023 at 06:06:03PM +0100, Danilo Krummrich wrote:
> On 2/20/23 16:10, Matthew Wilcox wrote:
> > This is why we like people to use the spinlock embedded in the tree.
> > There's nothing for the user to care about.  If the access really is
> > serialised, acquiring/releasing the uncontended spinlock is a minimal
> > cost compared to all the other things that will happen while modifying
> > the tree.
> 
> I think as for the users of the GPUVA manager we'd have two cases:
> 
> 1) Accesses to the manager (and hence the tree) are serialized, no lock
> needed.
> 
> 2) Multiple operations on the tree must be locked in order to make them
> appear atomic.

Could you give an example here of what you'd like to do?  Ideally
something complicated so I don't say "Oh, you can just do this" when
there's a more complex example for which "this" won't work.  I'm sure
that's embedded somewhere in the next 20-odd patches, but it's probably
quicker for you to describe in terms of tree operations that have to
appear atomic than for me to try to figure it out.

> In either case the embedded spinlock wouldn't be useful, we'd either need an
> external lock or no lock at all.
> 
> If there are any internal reasons why specific tree operations must be
> mutually excluded (such as those you explain below), wouldn't it make more
> sense to always have the internal lock and, optionally, allow users to
> specify an external lock additionally?

So the way this works for the XArray, which is a little older than the
Maple tree, is that we always use the internal spinlock for
modifications (possibly BH or IRQ safe), and if someone wants to
use an external mutex to make some callers atomic with respect to each
other, they're free to do so.  In that case, the XArray doesn't check
the user's external locking at all, because it really can't know.

I'd advise taking that approach; if there's really no way to use the
internal spinlock to make your complicated updates appear atomic
then just let the maple tree use its internal spinlock, and you can
also use your external mutex however you like.

Re: [PATCH drm-next v2 04/16] maple_tree: add flag MT_FLAGS_LOCK_NONE

2023-02-20 Thread Matthew Wilcox

On Mon, Feb 20, 2023 at 03:00:59PM +0100, Danilo Krummrich wrote:
> On 2/17/23 20:38, Matthew Wilcox wrote:
> > On Fri, Feb 17, 2023 at 02:44:10PM +0100, Danilo Krummrich wrote:
> > > Generic components making use of the maple tree (such as the
> > > DRM GPUVA Manager) delegate the responsibility of ensuring mutual
> > > exclusion to their users.
> > > 
> > > While such components could inherit the concept of an external lock,
> > > some users might just serialize the access to the component and hence to
> > > the internal maple tree.
> > > 
> > > In order to allow such use cases, add a new flag MT_FLAGS_LOCK_NONE to
> > > indicate not to do any internal lockdep checks.
> > 
> > I'm really against this change.
> > 
> > First, we really should check that users have their locking right.
> > It's bitten us so many times when they get it wrong.
> 
> In case of the DRM GPUVA manager, some users might serialize the access to
> the GPUVA manager and hence to it's maple tree instances, e.g. through the
> drm_gpu_scheduler. In such a case ensuring to hold a lock would be a bit
> pointless and I wouldn't really know how to "sell" this to potential users
> of the GPUVA manager.

This is why we like people to use the spinlock embedded in the tree.
There's nothing for the user to care about.  If the access really is
serialised, acquiring/releasing the uncontended spinlock is a minimal
cost compared to all the other things that will happen while modifying
the tree.

> > Second, having a lock allows us to defragment the slab cache.  The
> > patches to do that haven't gone anywhere recently, but if we drop the
> > requirement now, we'll never be able to compact ranges of memory that
> > have slabs allocated to them.
> > 
> 
> Not sure if I get that, do you mind explaining a bit how this would affect
> other users of the maple tree, such as my use case, the GPUVA manager?

When we want to free a slab in order to defragment memory, we need
to relocate all the objects allocated within that slab.  To do that
for the maple tree node cache, for each node in this particular slab,
we'll need to walk up to the top of the tree and lock it.  We can then
allocate a new node from a different slab, change the parent to point
to the new node and drop the lock.  After an RCU delay, we can free the
slab and create a larger contiguous block of memory.

As I said, this is somewhat hypothetical in that there's no current
code in the tree to reclaim slabs when we're trying to defragment
memory.  And that's because it's hard to do.  The XArray and maple
tree were designed to make it possible for their slabs.

Re: [PATCH drm-next v2 03/16] maple_tree: split up MA_STATE() macro

2023-02-17 Thread Matthew Wilcox

On Fri, Feb 17, 2023 at 02:44:09PM +0100, Danilo Krummrich wrote:
> \#define SAMPLE_ITER(name, __mgr) \
>   struct sample_iter name = { \
>   .mas = __MA_STATE(&(__mgr)->mt, 0, 0),

This is usually called MA_STATE_INIT()

> #define sample_iter_for_each_range(it__, start__, end__) \
>   for ((it__).mas.index = start__, (it__).entry = mas_find(&(it__).mas, 
> end__ - 1); \
>(it__).entry; (it__).entry = mas_find(&(it__).mas, end__ - 1))

This is a bad iterator design.  It's usually best to do this:

struct sample *sample;
SAMPLE_ITERATOR(si, min);

sample_iter_for_each(&si, sample, max) {
frob(mgr, sample);
}

I don't mind splitting apart MA_STATE_INIT from MA_STATE, and if you
do that, we can also use it in VMA_ITERATOR.

Re: [PATCH drm-next v2 04/16] maple_tree: add flag MT_FLAGS_LOCK_NONE

2023-02-17 Thread Matthew Wilcox

On Fri, Feb 17, 2023 at 02:44:10PM +0100, Danilo Krummrich wrote:
> Generic components making use of the maple tree (such as the
> DRM GPUVA Manager) delegate the responsibility of ensuring mutual
> exclusion to their users.
> 
> While such components could inherit the concept of an external lock,
> some users might just serialize the access to the component and hence to
> the internal maple tree.
> 
> In order to allow such use cases, add a new flag MT_FLAGS_LOCK_NONE to
> indicate not to do any internal lockdep checks.

I'm really against this change.

First, we really should check that users have their locking right.
It's bitten us so many times when they get it wrong.

Second, having a lock allows us to defragment the slab cache.  The
patches to do that haven't gone anywhere recently, but if we drop the
requirement now, we'll never be able to compact ranges of memory that
have slabs allocated to them.

Re: [PATCH v2 1/6] mm: introduce vma->vm_flags modifier functions

2023-01-26 Thread Matthew Wilcox

On Thu, Jan 26, 2023 at 04:50:59PM +0200, Mike Rapoport wrote:
> On Thu, Jan 26, 2023 at 11:17:09AM +0200, Mike Rapoport wrote:
> > On Wed, Jan 25, 2023 at 12:38:46AM -0800, Suren Baghdasaryan wrote:
> > > +/* Use when VMA is not part of the VMA tree and needs no locking */
> > > +static inline void init_vm_flags(struct vm_area_struct *vma,
> > > +  unsigned long flags)
> > 
> > I'd suggest to make it vm_flags_init() etc.
> 
> Thinking more about it, it will be even clearer to name these vma_flags_xyz()

Perhaps vma_VERB_flags()?

vma_init_flags()
vma_reset_flags()
vma_set_flags()
vma_clear_flags()
vma_mod_flags()

Re: [PATCH v2 1/6] mm: introduce vma->vm_flags modifier functions

2023-01-25 Thread Matthew Wilcox

On Wed, Jan 25, 2023 at 12:38:46AM -0800, Suren Baghdasaryan wrote:
> +/* Use when VMA is not part of the VMA tree and needs no locking */
> +static inline void init_vm_flags(struct vm_area_struct *vma,
> +  unsigned long flags)
> +{
> + vma->vm_flags = flags;

vm_flags are supposed to have type vm_flags_t.  That's not been
fully realised yet, but perhaps we could avoid making it worse?

>   pgprot_t vm_page_prot;
> - unsigned long vm_flags; /* Flags, see mm.h. */
> +
> + /*
> +  * Flags, see mm.h.
> +  * WARNING! Do not modify directly.
> +  * Use {init|reset|set|clear|mod}_vm_flags() functions instead.
> +  */
> + unsigned long vm_flags;

Including changing this line to vm_flags_t

Re: [PATCH v2 1/6] mm: introduce vma->vm_flags modifier functions

2023-01-25 Thread Matthew Wilcox

On Wed, Jan 25, 2023 at 08:49:50AM -0800, Suren Baghdasaryan wrote:
> On Wed, Jan 25, 2023 at 1:10 AM Peter Zijlstra  wrote:
> > > + /*
> > > +  * Flags, see mm.h.
> > > +  * WARNING! Do not modify directly.
> > > +  * Use {init|reset|set|clear|mod}_vm_flags() functions instead.
> > > +  */
> > > + unsigned long vm_flags;
> >
> > We have __private and ACCESS_PRIVATE() to help with enforcing this.
> 
> Thanks for pointing this out, Peter! I guess for that I'll need to
> convert all read accesses and provide get_vm_flags() too? That will
> cause some additional churt (a quick search shows 801 hits over 248
> files) but maybe it's worth it? I think Michal suggested that too in
> another patch. Should I do that while we are at it?

Here's a trick I saw somewhere in the VFS:

union {
const vm_flags_t vm_flags;
vm_flags_t __private __vm_flags;
};

Now it can be read by anybody but written only by those using
ACCESS_PRIVATE.

Re: [Lsf-pc] [LSF/MM/BPF proposal]: Physr discussion

2023-01-23 Thread Matthew Wilcox

On Mon, Jan 23, 2023 at 12:50:52PM -0800, Dan Williams wrote:
> Matthew Wilcox wrote:
> > On Mon, Jan 23, 2023 at 11:36:51AM -0800, Dan Williams wrote:
> > > Jason Gunthorpe via Lsf-pc wrote:
> > > > I would like to have a session at LSF to talk about Matthew's
> > > > physr discussion starter:
> > > > 
> > > >  https://lore.kernel.org/linux-mm/ydykweu0htv8m...@casper.infradead.org/
> > > > 
> > > > I have become interested in this with some immediacy because of
> > > > IOMMUFD and this other discussion with Christoph:
> > > > 
> > > >  
> > > > https://lore.kernel.org/kvm/4-v2-472615b3877e+28f7-vfio_dma_buf_...@nvidia.com/
> > > 
> > > I think this is a worthwhile discussion. My main hangup with 'struct
> > > page' elimination in general is that if anything needs to be allocated
> > 
> > You're the first one to bring up struct page elimination.  Neither Jason
> > nor I have that as our motivation.
> 
> Oh, ok, then maybe I misread the concern in the vfio discussion. I
> thought the summary there is debating the ongoing requirement for
> 'struct page' for P2PDMA?

My reading of that thread is that while it started out that way, it
became more about "So what would a good interface be for doing this".  And
Jason's right, he and I are approaching this from different directions.
My concern is from the GUP side where we start out by getting a folio
(which we know is physically contiguous) and decomposing it into pages.
Then we aggregate all those pages together which are physically contiguous
and stuff them into a bio_vec.  After that, I lose interest; I was
planning on having DMA mapping interfaces which took in an array of
phyr and spat out scatterlists.  Then we could shrink the scatterlist
by removing page_link and offset, leaving us with only dma_address,
length and maybe flags.

Re: [Lsf-pc] [LSF/MM/BPF proposal]: Physr discussion

2023-01-23 Thread Matthew Wilcox

On Mon, Jan 23, 2023 at 11:36:51AM -0800, Dan Williams wrote:
> Jason Gunthorpe via Lsf-pc wrote:
> > I would like to have a session at LSF to talk about Matthew's
> > physr discussion starter:
> > 
> >  https://lore.kernel.org/linux-mm/ydykweu0htv8m...@casper.infradead.org/
> > 
> > I have become interested in this with some immediacy because of
> > IOMMUFD and this other discussion with Christoph:
> > 
> >  
> > https://lore.kernel.org/kvm/4-v2-472615b3877e+28f7-vfio_dma_buf_...@nvidia.com/
> 
> I think this is a worthwhile discussion. My main hangup with 'struct
> page' elimination in general is that if anything needs to be allocated

You're the first one to bring up struct page elimination.  Neither Jason
nor I have that as our motivation.  But there are reasons why struct page
is a bad data structure, and Xen proves that you don't need to have such
a data structure in order to do I/O.

> When I read "general interest across all the driver subsystems" it is
> hard not to ask "have all possible avenues to enable 'struct page' been
> exhausted?"

Yes, we should definitely expend yet more resources chasing a poor
implementation.

Re: [LSF/MM/BPF proposal]: Physr discussion

2023-01-22 Thread Matthew Wilcox

On Sat, Jan 21, 2023 at 11:03:05AM -0400, Jason Gunthorpe wrote:
> I would like to have a session at LSF to talk about Matthew's
> physr discussion starter:
> 
>  https://lore.kernel.org/linux-mm/ydykweu0htv8m...@casper.infradead.org/

I'm definitely interested in discussing phyrs (even if you'd rather
pronounce it "fizzers" than "fires" ;-)

> I've been working on an implementation and hope to have something
> draft to show on the lists in a few weeks. It is pretty clear there
> are several interesting decisions to make that I think will benefit
> from a live discussion.

Cool!  Here's my latest noodlings:
https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/heads/phyr

Just the top two commits; the other stuff is unrelated.  Shakeel has
also been interested in this.

Re: [PATCH RFC v7 00/23] DEPT(Dependency Tracker)

2023-01-19 Thread Matthew Wilcox

On Thu, Jan 19, 2023 at 03:23:08PM +0900, Byungchul Park wrote:
> Boqun wrote:
> > *   Looks like the DEPT dependency graph doesn't handle the
> > fair/unfair readers as lockdep current does. Which bring the
> > next question.
> 
> No. DEPT works better for unfair read. It works based on wait/event. So
> read_lock() is considered a potential wait waiting on write_unlock()
> while write_lock() is considered a potential wait waiting on either
> write_unlock() or read_unlock(). DEPT is working perfect for it.
> 
> For fair read (maybe you meant queued read lock), I think the case
> should be handled in the same way as normal lock. I might get it wrong.
> Please let me know if I miss something.

>From the lockdep/DEPT point of view, the question is whether:

read_lock(A)
read_lock(A)

can deadlock if a writer comes in between the two acquisitions and
sleeps waiting on A to be released.  A fair lock will block new
readers when a writer is waiting, while an unfair lock will allow
new readers even while a writer is waiting.

Re: [RFC PATCH v3 0/3] new subsystem for compute accelerator devices

2022-11-07 Thread Matthew Wilcox

On Mon, Nov 07, 2022 at 09:07:28AM -0700, Jeffrey Hugo wrote:
> > Another important change is that I have reverted back to use IDR for minor
> > handling instead of xarray. This is because I have found that xarray doesn't
> > handle well the scenario where you allocate a NULL entry and then exchange 
> > it
> > with a real pointer. It appears xarray still considers that entry a "zero"
> > entry. This is unfortunate because DRM works that way (first allocates a 
> > NULL
> > entry and then replaces the entry with a real pointer).
> > 
> > I decided to revert to IDR because I don't want to hold up these patches,
> > as many people are blocked until the support for accel is merged. The xarray
> > issue should be fixed as a separate patch by either fixing the xarray code 
> > or
> > changing how DRM + ACCEL do minor id handling.
> 
> This sounds sane to me.  However, this appears to be something that Matthew
> Wilcox should be aware of (added for visibility).  Perhaps he has a very
> quick solution.  If not, at-least he might have ideas on how to best address
> in the future.

Thanks for cc'ing me.  I wasn't aware of this problem because I hadn't
seen Oded's email yet.  The "problem" is simply a mis-use of the API.

Re: [PATCH v5 1/3] drm: Use XArray instead of IDR for minors

2022-11-07 Thread Matthew Wilcox

On Sun, Nov 06, 2022 at 04:51:39PM +0200, Oded Gabbay wrote:
> I tried executing the following code in a dummy driver I wrote:

You don't need to write a dummy driver; you can write test-cases
entirely in userspace.  Just add them to lib/test_xarray.c and then
make -C tools/testing/radix-tree/

> static DEFINE_XARRAY_ALLOC(xa_dummy);
> void check_xa(void *pdev)
> {
>   void *entry;
>   int ret, index;
> 
>   ret = xa_alloc(&xa_dummy, &index, NULL, XA_LIMIT(0, 63), GFP_NOWAIT);
>   if (ret < 0)
>   return ret;
> 
>   entry = xa_cmpxchg(&xa_dummy, index, NULL, pdev, GFP_KERNEL);
>   if (xa_is_err(entry))
>return;
> 
>   xa_lock(&xa_dummy);
>   xa_dev = xa_load(&xa_dummy, index);
>   xa_unlock(&xa_dummy);
> }
> 
> And to my surprise xa_dev is always NULL, when it should be pdev.
> I believe that because we first allocate the entry with NULL, it is
> considered a "zero" entry in the XA.
> And when we replace it, this attribute doesn't change so when we do
> xa_load, the xa code thinks the entry is a "zero" entry and returns
> NULL.

There's no "attribute" to mark a zero entry.  It's just a zero entry.
The problem is that you're cmpxchg'ing against NULL, and it's not NULL,
it's the ZERO entry.  This is even documented in the test code:

/* cmpxchg sees a reserved entry as ZERO */
XA_BUG_ON(xa, xa_reserve(xa, 12345678, GFP_KERNEL) != 0);
XA_BUG_ON(xa, xa_cmpxchg(xa, 12345678, XA_ZERO_ENTRY,
xa_mk_value(12345678), GFP_NOWAIT) != NULL);

I'm not quite sure why you're using xa_cmpxchg() here anyway; if you
allocated it, it's yours and you can just xa_store() to it.

Re: [PATCH v4 01/10] gna: add PCI driver module

2022-10-20 Thread Matthew Wilcox

On Thu, Oct 20, 2022 at 03:35:16PM +0200, Kwapulinski, Maciej wrote:
> +++ b/drivers/gpu/drm/gna/Kconfig
> @@ -0,0 +1,12 @@
> +#
> +# Intel(R) Gaussian & Neural Accelerator (Intel(R) GNA)
> +#
> +
> +config DRM_GNA
> + tristate "Intel(R) Gaussian & Neural Accelerator (Intel(R) GNA)"
> + depends on X86 && PCI
> +  help

This is mangled; 'help' should be indented by a tab, not two spaces.

Re: [PATCH v4 1/3] drm: Use XArray instead of IDR for minors

2022-09-06 Thread Matthew Wilcox

On Tue, Sep 06, 2022 at 10:16:27PM +0200, Michał Winiarski wrote:
> IDR is deprecated, and since XArray manages its own state with internal
> locking, it simplifies the locking on DRM side.
> Additionally, don't use the IRQ-safe variant, since operating on drm
> minor is not done in IRQ context.
> 
> Signed-off-by: Michał Winiarski 
> Suggested-by: Matthew Wilcox 

I have a few questions, but I like where you're going.

> @@ -98,21 +98,18 @@ static struct drm_minor **drm_minor_get_slot(struct 
> drm_device *dev,
>  static void drm_minor_alloc_release(struct drm_device *dev, void *data)
>  {
>   struct drm_minor *minor = data;
> - unsigned long flags;
>  
>   WARN_ON(dev != minor->dev);
>  
>   put_device(minor->kdev);
>  
> - spin_lock_irqsave(&drm_minor_lock, flags);
> - idr_remove(&drm_minors_idr, minor->index);
> - spin_unlock_irqrestore(&drm_minor_lock, flags);
> + xa_release(&drm_minors_xa, minor->index);

Has it definitely been unused at this point?  I would think that
xa_erase() (an unconditional store) would be the correct function to
call.

> @@ -122,20 +119,12 @@ static int drm_minor_alloc(struct drm_device *dev, 
> unsigned int type)
>   minor->type = type;
>   minor->dev = dev;
>  
> - idr_preload(GFP_KERNEL);
> - spin_lock_irqsave(&drm_minor_lock, flags);
> - r = idr_alloc(&drm_minors_idr,
> -   NULL,
> -   64 * type,
> -   64 * (type + 1),
> -   GFP_NOWAIT);
> - spin_unlock_irqrestore(&drm_minor_lock, flags);
> - idr_preload_end();
> -
> + r = xa_alloc(&drm_minors_xa, &id, NULL,
> +  XA_LIMIT(64 * type, 64 * (type + 1) - 1), GFP_KERNEL);
>   if (r < 0)
>   return r;
>  
> - minor->index = r;
> + minor->index = id;

Wouldn't it be better to call:

r = xa_alloc(&drm_minors_xa, &minor->index, NULL,
XA_LIMIT(64 * type, 64 * (type + 1) - 1), GFP_KERNEL);

I might also prefer a little syntactic sugar like:

#define DRM_MINOR_LIMIT(type)   XA_LIMIT(64 * (type), 64 * (type) + 63)

but that's definitely a matter of taste.

> @@ -172,9 +161,12 @@ static int drm_minor_register(struct drm_device *dev, 
> unsigned int type)
>   goto err_debugfs;
>  
>   /* replace NULL with @minor so lookups will succeed from now on */
> - spin_lock_irqsave(&drm_minor_lock, flags);
> - idr_replace(&drm_minors_idr, minor, minor->index);
> - spin_unlock_irqrestore(&drm_minor_lock, flags);
> + entry = xa_store(&drm_minors_xa, minor->index, &minor, GFP_KERNEL);
> + if (xa_is_err(entry)) {
> + ret = xa_err(entry);
> + goto err_debugfs;
> + }
> + WARN_ON(entry);

Might be better as an xa_cmpxchg()?

> @@ -187,16 +179,13 @@ static int drm_minor_register(struct drm_device *dev, 
> unsigned int type)
>  static void drm_minor_unregister(struct drm_device *dev, unsigned int type)
>  {
>   struct drm_minor *minor;
> - unsigned long flags;
>  
>   minor = *drm_minor_get_slot(dev, type);
>   if (!minor || !device_is_registered(minor->kdev))
>   return;
>  
>   /* replace @minor with NULL so lookups will fail from now on */
> - spin_lock_irqsave(&drm_minor_lock, flags);
> - idr_replace(&drm_minors_idr, NULL, minor->index);
> - spin_unlock_irqrestore(&drm_minor_lock, flags);
> + xa_erase(&drm_minors_xa, minor->index);

This isn't an exact replacement, but I'm not sure whether that makes a
difference.  xa_erase() allows allocation of this ID again while
idr_replace() means that lookups return NULL, but the ID remains in
use.  The equivalent of idr_replace() is:
xa_store(&drm_minors_xa, minor->index, NULL, GFP_KERNEL);

> @@ -215,13 +204,10 @@ static void drm_minor_unregister(struct drm_device 
> *dev, unsigned int type)
>  struct drm_minor *drm_minor_acquire(unsigned int minor_id)
>  {
>   struct drm_minor *minor;
> - unsigned long flags;
>  
> - spin_lock_irqsave(&drm_minor_lock, flags);
> - minor = idr_find(&drm_minors_idr, minor_id);
> + minor = xa_load(&drm_minors_xa, minor_id);
>   if (minor)
>   drm_dev_get(minor->dev);
> - spin_unlock_irqrestore(&drm_minor_lock, flags);

This is also not an exact equivalent as the dev_drm_get() is now outside
the lock.  Does that matter in this case?  I don't know the code well
enough to say.  If you want it to be equivalent, then:

xa_lock(&drm_minors_xa);
minor = xa_load(&d

Re: [PATCH v3 0/4] drm: Use full allocated minor range for DRM

2022-09-06 Thread Matthew Wilcox

On Tue, Sep 06, 2022 at 04:01:13PM +0200, Michał Winiarski wrote:
> 64 DRM device nodes is not enough for everyone.
> Upgrade it to ~512K (which definitely is more than enough).
> 
> To allow testing userspace support for >64 devices, add additional DRM
> modparam (skip_legacy_minors) which causes DRM to skip allocating minors
> in 0-192 range.
> Additionally - one minor tweak around minor DRM IDR locking and IDR lockdep
> annotations.

The IDR is deprecated; rather than making all these changes around
the IDR, could you convert it to use the XArray instead?  I did it
once before, but those patches bounced off the submissions process.

Re: [PATCH] mm: Fix a null ptr deref with CONFIG_DEBUG_VM enabled in wp_page_reuse

2022-07-27 Thread Matthew Wilcox

On Wed, Jul 27, 2022 at 03:14:07PM -0400, Zack Rusin wrote:
> From: Zack Rusin 
> 
> Write page faults on last references might not have a valid page anymore.
> wp_page_reuse has always dealt with that scenario by making
> sure the page isn't null (or the reference was shared) before doing
> anything with it. Recently added checks in VM_BUG_ON (enabled by the
> CONFIG_DEBUG_VM option) use PageAnon helpers which assume the passed
> page is never null, before making sure there is a valid page to work
> with.
> 
> Move the VM_BUG_ON, which unconditionally uses the page, after the
> code that checks that we have a valid one.

Message-ID:

Re: [PATCH v8 07/15] mm/gup: migrate device coherent pages when pinning instead of failing

2022-07-11 Thread Matthew Wilcox

On Mon, Jul 11, 2022 at 03:35:42PM +0200, David Hildenbrand wrote:
> > +   /*
> > +* Device coherent pages are managed by a driver and should not
> > +* be pinned indefinitely as it prevents the driver moving the
> > +* page. So when trying to pin with FOLL_LONGTERM instead try
> > +* to migrate the page out of device memory.
> > +*/
> > +   if (folio_is_device_coherent(folio)) {
> > +   WARN_ON_ONCE(PageCompound(&folio->page));
> 
> Maybe that belongs into migrate_device_page()?

... and should be simply WARN_ON_ONCE(folio_test_large(folio)).
No need to go converting it back into a page in order to test things
that can't possibly be true.

Re: [linux-next:master] BUILD REGRESSION 8cb8311e95e3bb58bd84d6350365f14a718faa6d

2022-05-26 Thread Matthew Wilcox

On Thu, May 26, 2022 at 11:48:32AM +0300, Dan Carpenter wrote:
> On Thu, May 26, 2022 at 02:16:34AM +0100, Matthew Wilcox wrote:
> > Bizarre this started showing up now.  The recent patch was:
> > 
> > -   info->alloced += compound_nr(page);
> > -   inode->i_blocks += BLOCKS_PER_PAGE << compound_order(page);
> > +   info->alloced += folio_nr_pages(folio);
> > +   inode->i_blocks += BLOCKS_PER_PAGE << folio_order(folio);
> > 
> > so it could tell that compound_order() was small, but folio_order()
> > might be large?
> 
> The old code also generates a warning on my test system.  Smatch thinks
> both compound_order() and folio_order() are 0-255.  I guess because of
> the "unsigned char compound_order;" in the struct page.

It'd be nice if we could annotate that as "contains a value between
1 and BITS_PER_LONG - PAGE_SHIFT".  Then be able to optionally enable
a checker that ensures that's true on loads/stores.  Maybe we need a
language that isn't C :-P  Ada can do this ... I don't think Rust can.

Re: [linux-next:master] BUILD REGRESSION 8cb8311e95e3bb58bd84d6350365f14a718faa6d

2022-05-25 Thread Matthew Wilcox

On Wed, May 25, 2022 at 03:20:06PM -0700, Andrew Morton wrote:
> On Wed, 25 May 2022 23:07:35 +0100 Jessica Clarke  wrote:
> 
> > This is i386, so an unsigned long is 32-bit, but i_blocks is a blkcnt_t
> > i.e. a u64, which makes the shift without a cast of the LHS fishy.
> 
> Ah, of course, thanks.  I remember 32 bits ;)
> 
> --- a/mm/shmem.c~mm-shmemc-suppress-shift-warning
> +++ a/mm/shmem.c
> @@ -1945,7 +1945,7 @@ alloc_nohuge:
>  
>   spin_lock_irq(&info->lock);
>   info->alloced += folio_nr_pages(folio);
> - inode->i_blocks += BLOCKS_PER_PAGE << folio_order(folio);
> + inode->i_blocks += (blkcnt_t)BLOCKS_PER_PAGE << folio_order(folio);

Bizarre this started showing up now.  The recent patch was:

-   info->alloced += compound_nr(page);
-   inode->i_blocks += BLOCKS_PER_PAGE << compound_order(page);
+   info->alloced += folio_nr_pages(folio);
+   inode->i_blocks += BLOCKS_PER_PAGE << folio_order(folio);

so it could tell that compound_order() was small, but folio_order()
might be large?

Silencing the warning is a good thing, but folio_order() can (at the
moment) be at most 9 on i386, so it isn't actually going to be
larger than 4096.

Re: [Bug 215807] Bad page state in process systemd-udevd

2022-04-12 Thread Matthew Wilcox

On Tue, Apr 12, 2022 at 02:08:04PM -0700, Andrew Morton wrote:
> hm, that's my third `bad page' report in three emails.  But I'm not
> seeing a pattern - this one might be a DRM thing.

I noticed it was an order-9 page and was set to take responsibility
for it, but it's clearly not a filesystem page, but a DRM page.
Let me help decode this for the benefit of the DRM folks.

> > [8.453363] amdgpu: Topology: Add CPU node
> > [8.467169] BUG: Bad page state in process systemd-udevd  pfn:11b403
> > [8.467180] fbcon: Taking over console
> > [8.467188] page:2cc06944 refcount:0 mapcount:0
> > mapping: index:0x3 pfn:0x11b403
> > [8.467195] head:aa25e58e order:9 compound_mapcount:0
> > compound_pincount:0
> > [8.467198] flags: 
> > 0x17c001(head|node=0|zone=2|lastcpupid=0x1f)
> > [8.467205] raw: 0017c000 f2da846d0001 f2da846d00c8
> > 
> > [8.467208] raw:   
> > 
> > [8.467210] head: 0017c001  dead0122
> > 
> > [8.467212] head:   
> > 
> > [8.467214] page dumped because: corrupted mapping in tail page

Tail pages are all supposed to have page->mapping == TAIL_MAPPING
(0x400 + POISON_POINTER_DELTA).  This one has page->mapping == NULL.
I say "all", but tail pages 1 and 2 aren't checked because those
get other useful things stored in them instead.  This is tail page
number 3, so it's the first one checked.

So DRM has probably been overly enthusiastic about cleaning up
something.  Hope that's helpful!

Re: [PATCH] drm/amdkfd: Add SVM API support capability bits

2022-03-30 Thread Matthew Wilcox

On Wed, Mar 30, 2022 at 04:24:20PM -0500, Alex Sierra wrote:
> From: Philip Yang 
> 
> SVMAPISupported property added to HSA_CAPABILITY, the value match
> HSA_CAPABILITY defined in Thunk spec:
> 
> SVMAPISupported: it will not be supported on older kernels that don't
> have HMM or on systems with GFXv8 or older GPUs without support for
> 48-bit virtual addresses.
> 
> CoherentHostAccess property added to HSA_MEMORYPROPERTY, the value match
> HSA_MEMORYPROPERTY defined in Thunk spec:
> 
> CoherentHostAccess: whether or not device memory can be coherently
> accessed by the host CPU.

Could you translate this commit message into English?  Reviewing
Documentation/process/5.Posting.rst might be helpful.

Re: [PATCH v1 1/3] mm: split vm_normal_pages for LRU and non-LRU handling

2022-03-10 Thread Matthew Wilcox

On Thu, Mar 10, 2022 at 11:26:31AM -0600, Alex Sierra wrote:
> @@ -606,7 +606,7 @@ static void print_bad_pte(struct vm_area_struct *vma, 
> unsigned long addr,
>   * PFNMAP mappings in order to support COWable mappings.
>   *
>   */
> -struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> +struct page *vm_normal_any_page(struct vm_area_struct *vma, unsigned long 
> addr,
>   pte_t pte)
>  {
>   unsigned long pfn = pte_pfn(pte);
> @@ -620,8 +620,6 @@ struct page *vm_normal_page(struct vm_area_struct *vma, 
> unsigned long addr,
>   return NULL;
>   if (is_zero_pfn(pfn))
>   return NULL;
> - if (pte_devmap(pte))
> - return NULL;
>  
>   print_bad_pte(vma, addr, pte, NULL);
>   return NULL;

... what?

Haven't you just made it so that a devmap page always prints a bad PTE
message, and then returns NULL anyway?

Surely this should be:

if (pte_devmap(pte))
-   return NULL;
+   return pfn_to_page(pfn);

or maybe

+   goto check_pfn;

But I don't know about that highest_memmap_pfn check.

> @@ -661,6 +659,22 @@ struct page *vm_normal_page(struct vm_area_struct *vma, 
> unsigned long addr,
>   return pfn_to_page(pfn);
>  }
>  
> +/*
> + * vm_normal_lru_page -- This function gets the "struct page" associated
> + * with a pte only for page cache and anon page. These pages are LRU handled.
> + */
> +struct page *vm_normal_lru_page(struct vm_area_struct *vma, unsigned long 
> addr,
> + pte_t pte)

It seems a shame to add a new function without proper kernel-doc.

Re: [PATCH RFC v2] mm: Add f_ops->populate()

2022-03-07 Thread Matthew Wilcox

On Sun, Mar 06, 2022 at 03:41:54PM -0800, Dave Hansen wrote:
> In short: page faults stink.  The core kernel has lots of ways of
> avoiding page faults like madvise(MADV_WILLNEED) or mmap(MAP_POPULATE).
>  But, those only work on normal RAM that the core mm manages.
> 
> SGX is weird.  SGX memory is managed outside the core mm.  It doesn't
> have a 'struct page' and get_user_pages() doesn't work on it.  Its VMAs
> are marked with VM_IO.  So, none of the existing methods for avoiding
> page faults work on SGX memory.
> 
> This essentially helps extend existing "normal RAM" kernel ABIs to work
> for avoiding faults for SGX too.  SGX users want to enjoy all of the
> benefits of a delayed allocation policy (better resource use,
> overcommit, NUMA affinity) but without the cost of millions of faults.

We have a mechanism for dynamically reducing the number of page faults
already; it's just buried in the page cache code.  You have vma->vm_file,
which contains a file_ra_state.  You can use this to track where
recent faults have been and grow the size of the region you fault in
per page fault.  You don't have to (indeed probably don't want to) use
the same algorithm as the page cache, but the _principle_ is the same --
were recent speculative faults actually used; should we grow the number
of pages actually faulted in, or is this a random sparse workload where
we want to allocate individual pages.

Don't rely on the user to ask.  They don't know.

Re: [PATCH RFC 1/3] mm: Add f_ops->populate()

2022-03-06 Thread Matthew Wilcox

On Sun, Mar 06, 2022 at 07:02:57PM +0200, Jarkko Sakkinen wrote:
> So can I conclude from this that in general having populate available for
> device memory is something horrid, or just the implementation path?

You haven't even attempted to explain what the problem is you're trying
to solve.  You've shown up with some terrible code and said "Hey, is
this a good idea".  No, no, it's not.

Re: [PATCH RFC 0/3] MAP_POPULATE for device memory

2022-03-06 Thread Matthew Wilcox

On Sun, Mar 06, 2022 at 07:32:04AM +0200, Jarkko Sakkinen wrote:
> For device memory (aka VM_IO | VM_PFNMAP) MAP_POPULATE does nothing. Allow
> to use that for initializing the device memory by providing a new callback
> f_ops->populate() for the purpose.

As I said, NAK.

Re: [PATCH RFC] mm: Add f_ops->populate()

2022-03-05 Thread Matthew Wilcox

On Sun, Mar 06, 2022 at 06:25:52AM +0200, Jarkko Sakkinen wrote:
> > Are you deliberately avoiding the question?  I'm not asking about
> > implementation.  I'm asking about the semantics of MAP_POPULATE with
> > your driver.
> 
> No. I just noticed a bug in the guard from your comment that I wanted
> point out.
> 
> With the next version I post the corresponding change to the driver,
> in order to see this in context.

Oh, good grief.  Don't bother.  NAK.

Re: [PATCH RFC] mm: Add f_ops->populate()

2022-03-05 Thread Matthew Wilcox

On Sun, Mar 06, 2022 at 06:11:21AM +0200, Jarkko Sakkinen wrote:
> On Sun, Mar 06, 2022 at 03:52:12AM +0000, Matthew Wilcox wrote:
> > On Sun, Mar 06, 2022 at 05:21:11AM +0200, Jarkko Sakkinen wrote:
> > > On Sun, Mar 06, 2022 at 02:57:55AM +0000, Matthew Wilcox wrote:
> > > > On Sun, Mar 06, 2022 at 04:15:33AM +0200, Jarkko Sakkinen wrote:
> > > > > Sometimes you might want to use MAP_POPULATE to ask a device driver to
> > > > > initialize the device memory in some specific manner. SGX driver can 
> > > > > use
> > > > > this to request more memory by issuing ENCLS[EAUG] x86 opcode for each
> > > > > page in the address range.
> > > > > 
> > > > > Add f_ops->populate() with the same parameters as f_ops->mmap() and 
> > > > > make
> > > > > it conditionally called inside call_mmap(). Update call sites
> > > > > accodingly.
> > > > 
> > > > Your device driver has a ->mmap operation.  Why does it need another
> > > > one?  More explanation required here.
> > > 
> > > f_ops->mmap() would require an additional parameter, which results
> > > heavy refactoring.
> > > 
> > > struct file_operations has 1125 references in the kernel tree, so I
> > > decided to check this way around first. 
> > 
> > Are you saying that your device driver behaves differently if
> > MAP_POPULATE is set versus if it isn't?  That seems hideously broken.
> 
> MAP_POPULATE does not do anything (according to __mm_populate in mm/gup.c)
> with VMA's that have some sort of device/IO memory, i.e. vm_flags
> intersecting with VM_PFNMAP | VM_IO.
> 
> I can extend the guard obviously to:
> 
> if (!ret && do_populate && file->f_op->populate &&
> !!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
> file->f_op->populate(file, vma);

Are you deliberately avoiding the question?  I'm not asking about
implementation.  I'm asking about the semantics of MAP_POPULATE with
your driver.

Re: [PATCH RFC] mm: Add f_ops->populate()

2022-03-05 Thread Matthew Wilcox

On Sun, Mar 06, 2022 at 05:21:11AM +0200, Jarkko Sakkinen wrote:
> On Sun, Mar 06, 2022 at 02:57:55AM +0000, Matthew Wilcox wrote:
> > On Sun, Mar 06, 2022 at 04:15:33AM +0200, Jarkko Sakkinen wrote:
> > > Sometimes you might want to use MAP_POPULATE to ask a device driver to
> > > initialize the device memory in some specific manner. SGX driver can use
> > > this to request more memory by issuing ENCLS[EAUG] x86 opcode for each
> > > page in the address range.
> > > 
> > > Add f_ops->populate() with the same parameters as f_ops->mmap() and make
> > > it conditionally called inside call_mmap(). Update call sites
> > > accodingly.
> > 
> > Your device driver has a ->mmap operation.  Why does it need another
> > one?  More explanation required here.
> 
> f_ops->mmap() would require an additional parameter, which results
> heavy refactoring.
> 
> struct file_operations has 1125 references in the kernel tree, so I
> decided to check this way around first. 

Are you saying that your device driver behaves differently if
MAP_POPULATE is set versus if it isn't?  That seems hideously broken.

Re: [PATCH RFC] mm: Add f_ops->populate()

2022-03-05 Thread Matthew Wilcox

On Sun, Mar 06, 2022 at 04:15:33AM +0200, Jarkko Sakkinen wrote:
> Sometimes you might want to use MAP_POPULATE to ask a device driver to
> initialize the device memory in some specific manner. SGX driver can use
> this to request more memory by issuing ENCLS[EAUG] x86 opcode for each
> page in the address range.
> 
> Add f_ops->populate() with the same parameters as f_ops->mmap() and make
> it conditionally called inside call_mmap(). Update call sites
> accodingly.

Your device driver has a ->mmap operation.  Why does it need another
one?  More explanation required here.

Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr

2022-03-01 Thread Matthew Wilcox

On Tue, Mar 01, 2022 at 10:14:07AM -0800, Kees Cook wrote:
> On Mon, Feb 28, 2022 at 04:45:11PM -0800, Linus Torvalds wrote:
> > Really. The "-Wshadow doesn't work on the kernel" is not some new
> > issue, because you have to do completely insane things to the source
> > code to enable it.
> 
> The first big glitch with -Wshadow was with shadowed global variables.
> GCC 4.8 fixed that, but it still yells about shadowed functions. What
> _almost_ works is -Wshadow=local. At first glace, all the warnings
> look solvable, but then one will eventually discover __wait_event()
> and associated macros that mix when and how deeply it intentionally
> shadows variables. :)

Well, that's just disgusting.  Macros fundamentally shouldn't be
referring to things that aren't in their arguments.  The first step to
cleaning this up is ...

I'll take a look at the rest of cleaning this up soon.

>From 28ffe35d56223d4242b915832299e5acc926737e Mon Sep 17 00:00:00 2001
From: "Matthew Wilcox (Oracle)" 
Date: Tue, 1 Mar 2022 13:47:07 -0500
Subject: [PATCH] wait: Parameterize the return variable to ___wait_event()

Macros should not refer to variables which aren't in their arguments.
Pass the name from its callers.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 include/linux/swait.h| 12 ++--
 include/linux/wait.h | 32 
 include/linux/wait_bit.h |  4 ++--
 3 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/include/linux/swait.h b/include/linux/swait.h
index 6a8c22b8c2a5..5e8e9b13be2d 100644
--- a/include/linux/swait.h
+++ b/include/linux/swait.h
@@ -191,14 +191,14 @@ do {  
\
 } while (0)
 
 #define __swait_event_timeout(wq, condition, timeout)  \
-   ___swait_event(wq, ___wait_cond_timeout(condition), \
+   ___swait_event(wq, ___wait_cond_timeout(condition, __ret),  \
  TASK_UNINTERRUPTIBLE, timeout,\
  __ret = schedule_timeout(__ret))
 
 #define swait_event_timeout_exclusive(wq, condition, timeout)  \
 ({ \
long __ret = timeout;   \
-   if (!___wait_cond_timeout(condition))   \
+   if (!___wait_cond_timeout(condition, __ret))\
__ret = __swait_event_timeout(wq, condition, timeout);  \
__ret;  \
 })
@@ -216,14 +216,14 @@ do {  
\
 })
 
 #define __swait_event_interruptible_timeout(wq, condition, timeout)\
-   ___swait_event(wq, ___wait_cond_timeout(condition), \
+   ___swait_event(wq, ___wait_cond_timeout(condition, __ret),  \
  TASK_INTERRUPTIBLE, timeout,  \
  __ret = schedule_timeout(__ret))
 
 #define swait_event_interruptible_timeout_exclusive(wq, condition, timeout)\
 ({ \
long __ret = timeout;   \
-   if (!___wait_cond_timeout(condition))   \
+   if (!___wait_cond_timeout(condition, __ret))\
__ret = __swait_event_interruptible_timeout(wq, \
condition, timeout);\
__ret;  \
@@ -252,7 +252,7 @@ do {
\
 } while (0)
 
 #define __swait_event_idle_timeout(wq, condition, timeout) \
-   ___swait_event(wq, ___wait_cond_timeout(condition), \
+   ___swait_event(wq, ___wait_cond_timeout(condition, __ret),  \
   TASK_IDLE, timeout,  \
   __ret = schedule_timeout(__ret))
 
@@ -278,7 +278,7 @@ do {
\
 #define swait_event_idle_timeout_exclusive(wq, condition, timeout) \
 ({ \
long __ret = timeout;   \
-   if (!___wait_cond_timeout(condition))   \
+   if (!___wait_cond_timeout(condition, __ret))\
__ret = __swait_event_idle_timeout(wq,  \
   condition, timeout); \
__ret;  \
diff --git a/include/linux/wait.h b/include/linux/wait.h
index 851e07da2583..890cce3c0f2e 1006

Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr

2022-02-28 Thread Matthew Wilcox

On Mon, Feb 28, 2022 at 12:37:15PM -0800, Linus Torvalds wrote:
> On Mon, Feb 28, 2022 at 12:16 PM Matthew Wilcox  wrote:
> >
> > Then we can never use -Wshadow ;-(  I'd love to be able to turn it on;
> > it catches real bugs.
> 
> Oh, we already can never use -Wshadow regardless of things like this.
> That bridge hasn't just been burned, it never existed in the first
> place.
> 
> The whole '-Wshadow' thing simply cannot work with local variables in
> macros - something that we've used since day 1.
> 
> Try this (as a "p.c" file):
> 
> #define min(a,b) ({ \
> typeof(a) __a = (a);\
> typeof(b) __b = (b);\
> __a < __b ? __a : __b; })
> 
> int min3(int a, int b, int c)
> {
> return min(a,min(b,c));
> }
> 
> and now do "gcc -O2 -S t.c".
> 
> Then try it with -Wshadow.

#define ___PASTE(a, b)  a##b
#define __PASTE(a, b) ___PASTE(a, b)
#define _min(a, b, u) ({ \
typeof(a) __PASTE(__a,u) = (a);\
typeof(b) __PASTE(__b,u) = (b);\
__PASTE(__a,u) < __PASTE(__b,u) ? __PASTE(__a,u) : __PASTE(__b,u); })

#define min(a, b) _min(a, b, __COUNTER__)

int min3(int a, int b, int c)
{
return min(a,min(b,c));
}

(probably there's a more elegant way to do this)

Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr

2022-02-28 Thread Matthew Wilcox

On Mon, Feb 28, 2022 at 12:10:24PM -0800, Linus Torvalds wrote:
> We can do
> 
> typeof(pos) pos
> 
> in the 'for ()' loop, and never use __iter at all.
> 
> That means that inside the for-loop, we use a _different_ 'pos' than outside.

Then we can never use -Wshadow ;-(  I'd love to be able to turn it on;
it catches real bugs.

> +#define list_for_each_entry(pos, head, member)   
> \
> + for (typeof(pos) pos = list_first_entry(head, typeof(*pos), member);
> \
> +  !list_entry_is_head(pos, head, member);\
>pos = list_next_entry(pos, member))

Re: [PATCH v12 1/6] drm: Add arch arm64 for drm_clflush_virt_range

2022-02-25 Thread Matthew Wilcox

On Fri, Feb 25, 2022 at 06:42:37PM +, Tvrtko Ursulin wrote:
> Matthew, what do you think fix for this build warning on h8300 and s390 
> should be? Or perhaps a build environment issue with kernel test robot?

I'd suggest this should do the job:

+++ b/include/linux/cacheflush.h
@@ -4,6 +4,8 @@

 #include 

+struct folio;
+
 #if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
 #ifndef ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO
 void flush_dcache_folio(struct folio *folio);

Re: [PATCH 00/16] DEPT(Dependency Tracker)

2022-02-17 Thread Matthew Wilcox

On Thu, Feb 17, 2022 at 12:00:05PM -0500, Steven Rostedt wrote:
> On Thu, 17 Feb 2022 10:51:09 -0500
> "Theodore Ts'o"  wrote:
> 
> > I know that you're trying to help us, but this tool needs to be far
> > better than Lockdep before we should think about merging it.  Even if
> > it finds 5% more potential deadlocks, if it creates 95% more false
> > positive reports --- and the ones it finds are crazy things that
> > rarely actually happen in practice, are the costs worth the benefits?
> > And who is bearing the costs, and who is receiving the benefits?
> 
> I personally believe that there's potential that this can be helpful and we
> will want to merge it.
> 
> But, what I believe Ted is trying to say is, if you do not know if the
> report is a bug or not, please do not ask the maintainers to determine it
> for you. This is a good opportunity for you to look to see why your tool
> reported an issue, and learn that subsystem. Look at if this is really a
> bug or not, and investigate why.

I agree with Steven here, to the point where I'm willing to invest some
time being a beta-tester for this, so if you focus your efforts on
filesystem/mm kinds of problems, I can continue looking at them and
tell you what's helpful and what's unhelpful in the reports.

Re: Report 1 in ext4 and journal based on v5.17-rc1

2022-02-17 Thread Matthew Wilcox

On Thu, Feb 17, 2022 at 08:10:03PM +0900, Byungchul Park wrote:
> [7.009608] ===
> [7.009613] DEPT: Circular dependency has been detected.
> [7.009614] 5.17.0-rc1-00014-g8a599299c0cb-dirty #30 Tainted: GW
> [7.009616] ---
> [7.009617] summary
> [7.009618] ---
> [7.009618] *** DEADLOCK ***
> [7.009618]
> [7.009619] context A
> [7.009619] [S] (unknown)(&(bit_wait_table + i)->dmap:0)

Why is the context unknown here?  I don't see a way to debug this
without knowing where we acquired the bit wait lock.

> [7.009621] [W] down_write(&ei->i_data_sem:0)
> [7.009623] [E] event(&(bit_wait_table + i)->dmap:0)
> [7.009624]
> [7.009625] context B
> [7.009625] [S] down_read(&ei->i_data_sem:0)
> [7.009626] [W] wait(&(bit_wait_table + i)->dmap:0)
> [7.009627] [E] up_read(&ei->i_data_sem:0)
> [7.009628]
> [7.009629] [S]: start of the event context
> [7.009629] [W]: the wait blocked
> [7.009630] [E]: the event not reachable
> [7.009631] ---
> [7.009631] context A's detail
> [7.009632] ---
> [7.009632] context A
> [7.009633] [S] (unknown)(&(bit_wait_table + i)->dmap:0)
> [7.009634] [W] down_write(&ei->i_data_sem:0)
> [7.009635] [E] event(&(bit_wait_table + i)->dmap:0)
> [7.009636]
> [7.009636] [S] (unknown)(&(bit_wait_table + i)->dmap:0):
> [7.009638] (N/A)
> [7.009638]
> [7.009639] [W] down_write(&ei->i_data_sem:0):
> [7.009639] ext4_truncate (fs/ext4/inode.c:4187) 
> [7.009645] stacktrace:
> [7.009646] down_write (kernel/locking/rwsem.c:1514) 
> [7.009648] ext4_truncate (fs/ext4/inode.c:4187) 
> [7.009650] ext4_da_write_begin (./include/linux/fs.h:827 
> fs/ext4/truncate.h:23 fs/ext4/inode.c:2963) 
> [7.009652] generic_perform_write (mm/filemap.c:3784) 
> [7.009654] ext4_buffered_write_iter (fs/ext4/file.c:269) 
> [7.009657] ext4_file_write_iter (fs/ext4/file.c:677) 
> [7.009659] new_sync_write (fs/read_write.c:504 (discriminator 1)) 
> [7.009662] vfs_write (fs/read_write.c:590) 
> [7.009663] ksys_write (fs/read_write.c:644) 
> [7.009664] do_syscall_64 (arch/x86/entry/common.c:50 
> arch/x86/entry/common.c:80) 
> [7.009667] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) 
> [7.009669]
> [7.009670] [E] event(&(bit_wait_table + i)->dmap:0):
> [7.009671] __wake_up_common (kernel/sched/wait.c:108) 
> [7.009673] stacktrace:
> [7.009674] dept_event (kernel/dependency/dept.c:2337) 
> [7.009677] __wake_up_common (kernel/sched/wait.c:109) 
> [7.009678] __wake_up_common_lock (./include/linux/spinlock.h:428 
> (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) 
> [7.009679] __wake_up_bit (kernel/sched/wait_bit.c:127) 
> [7.009681] ext4_orphan_del (fs/ext4/orphan.c:282) 
> [7.009683] ext4_truncate (fs/ext4/inode.c:4212) 
> [7.009685] ext4_da_write_begin (./include/linux/fs.h:827 
> fs/ext4/truncate.h:23 fs/ext4/inode.c:2963) 
> [7.009687] generic_perform_write (mm/filemap.c:3784) 
> [7.009688] ext4_buffered_write_iter (fs/ext4/file.c:269) 
> [7.009690] ext4_file_write_iter (fs/ext4/file.c:677) 
> [7.009692] new_sync_write (fs/read_write.c:504 (discriminator 1)) 
> [7.009694] vfs_write (fs/read_write.c:590) 
> [7.009695] ksys_write (fs/read_write.c:644) 
> [7.009696] do_syscall_64 (arch/x86/entry/common.c:50 
> arch/x86/entry/common.c:80) 
> [7.009698] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) 
> [7.009700] ---
> [7.009700] context B's detail
> [7.009701] ---
> [7.009702] context B
> [7.009702] [S] down_read(&ei->i_data_sem:0)
> [7.009703] [W] wait(&(bit_wait_table + i)->dmap:0)
> [7.009704] [E] up_read(&ei->i_data_sem:0)
> [7.009705]
> [7.009706] [S] down_read(&ei->i_data_sem:0):
> [7.009707] ext4_map_blocks (./arch/x86/include/asm/bitops.h:207 
> ./include/asm-generic/bitops/instrumented-non-atomic.h:135 
> fs/ext4/ext4.h:1918 fs/ext4/inode.c:562) 
> [7.009709] stacktrace:
> [7.009709] down_read (kernel/locking/rwsem.c:1461) 
> [7.009711] ext4_map_blocks (./arch/x86/include/asm/bitops.h:207 
> ./include/asm-generic/bitops/instrumented-non-atomic.h:135 
> fs/ext4/ext4.h:1918 fs/ext4/inode.c:562) 
> [7.009712] ext4_getblk (fs/ext4/inode.c:851) 
> [7.009714] ext4_bread (fs/ext4/inode.c:903) 
> [7.009715] __ext4_read_dirblock (fs/ext4/namei.c:117) 
> [7.009718] dx_probe (fs/ext4/namei.c:789) 
> [7.009720] ext4_dx_find_entry (fs/ext4/namei.c:1721) 
> [7.009722] __ext4_find_entry

Re: Report in unix_stream_read_generic()

2022-02-15 Thread Matthew Wilcox

On Wed, Feb 16, 2022 at 01:17:03PM +0900, Byungchul Park wrote:
> [7.013330] ===
> [7.013331] DEPT: Circular dependency has been detected.
> [7.013332] 5.17.0-rc1-00014-gcf3441bb2012 #2 Tainted: GW
> [7.01] ---
> [7.013334] summary
> [7.013334] ---
> [7.013335] *** DEADLOCK ***
> [7.013335] 
> [7.013335] context A
> [7.013336] [S] (unknown)(&(&ei->socket.wq.wait)->dmap:0)
> [7.013337] [W] __mutex_lock_common(&u->iolock:0)
> [7.013338] [E] event(&(&ei->socket.wq.wait)->dmap:0)
> [7.013340] 
> [7.013340] context B
> [7.013341] [S] __raw_spin_lock(&u->lock:0)
> [7.013342] [W] wait(&(&ei->socket.wq.wait)->dmap:0)
> [7.013343] [E] spin_unlock(&u->lock:0)

This seems unlikely to be real.  We're surely not actually waiting
while holding a spinlock; existing debug checks would catch it.

> [7.013407] ---
> [7.013407] context B's detail
> [7.013408] ---
> [7.013408] context B
> [7.013409] [S] __raw_spin_lock(&u->lock:0)
> [7.013410] [W] wait(&(&ei->socket.wq.wait)->dmap:0)
> [7.013411] [E] spin_unlock(&u->lock:0)
> [7.013412] 
> [7.013412] [S] __raw_spin_lock(&u->lock:0):
> [7.013413] [] unix_stream_read_generic+0x6bf/0xb60
> [7.013416] stacktrace:
> [7.013416]   _raw_spin_lock+0x6e/0x90
> [7.013418]   unix_stream_read_generic+0x6bf/0xb60

It would be helpful if you'd run this through scripts/decode_stacktrace.sh
so we could see line numbers instead of hex offsets (which arene't much
use without the binary kernel).

> [7.013420]   unix_stream_recvmsg+0x40/0x50
> [7.013422]   sock_read_iter+0x85/0xd0
> [7.013424]   new_sync_read+0x162/0x180
> [7.013426]   vfs_read+0xf3/0x190
> [7.013428]   ksys_read+0xa6/0xc0
> [7.013429]   do_syscall_64+0x3a/0x90
> [7.013431]   entry_SYSCALL_64_after_hwframe+0x44/0xae
> [7.013433] 
> [7.013434] [W] wait(&(&ei->socket.wq.wait)->dmap:0):
> [7.013434] [] prepare_to_wait+0x47/0xd0

... this may be the source of confusion.  Just because we prepare to
wait doesn't mean we end up actually waiting.  For example, look at
unix_wait_for_peer():

prepare_to_wait_exclusive(&u->peer_wait, &wait, TASK_INTERRUPTIBLE);

sched = !sock_flag(other, SOCK_DEAD) &&
!(other->sk_shutdown & RCV_SHUTDOWN) &&
unix_recvq_full(other);

unix_state_unlock(other);

if (sched)
timeo = schedule_timeout(timeo);

finish_wait(&u->peer_wait, &wait);

We *prepare* to wait, *then* drop the lock, then actually schedule.

Re: Phyr Starter

2022-01-12 Thread Matthew Wilcox

On Tue, Jan 11, 2022 at 06:53:06PM -0400, Jason Gunthorpe wrote:
> IOMMU is not common in those cases, it is slow.
> 
> So you end up with 16 bytes per entry then another 24 bytes in the
> entirely redundant scatter list. That is now 40 bytes/page for typical
> HPC case, and I can't see that being OK.

Ah, I didn't realise what case you wanted to optimise for.

So, how about this ...

Since you want to get to the same destination as I do (a
16-byte-per-entry dma_addr+dma_len struct), but need to get there sooner
than "make all sg users stop using it wrongly", let's introduce a
(hopefully temporary) "struct dma_range".

But let's go further than that (which only brings us to 32 bytes per
range).  For the systems you care about which use an identity mapping,
and have sizeof(dma_addr_t) == sizeof(phys_addr_t), we can simply
point the dma_range pointer to the same memory as the phyr.  We just
have to not free it too early.  That gets us down to 16 bytes per range,
a saving of 33%.

Re: Phyr Starter

2022-01-11 Thread Matthew Wilcox

On Tue, Jan 11, 2022 at 04:21:59PM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 11, 2022 at 06:33:57PM +0000, Matthew Wilcox wrote:
> 
> > > Then we are we using get_user_phyr() at all if we are just storing it
> > > in a sg?
> > 
> > I did consider just implementing get_user_sg() (actually 4 years ago),
> > but that cements the use of sg as both an input and output data structure
> > for DMA mapping, which I am under the impression we're trying to get
> > away from.
> 
> I know every time I talked about a get_user_sg() Christoph is against
> it and we need to stop using scatter list...
> 
> > > Also 16 entries is way to small, it should be at least a whole PMD
> > > worth so we don't have to relock the PMD level each iteration.
> > > 
> > > I would like to see a flow more like:
> > > 
> > >   cpu_phyr_list = get_user_phyr(uptr, 1G);
> > >   dma_phyr_list = dma_map_phyr(device, cpu_phyr_list);
> > >   [..]
> > >   dma_unmap_phyr(device, dma_phyr_list);
> > >   unpin_drity_free(cpu_phy_list);
> > > 
> > > Where dma_map_phyr() can build a temporary SGL for old iommu drivers
> > > compatability. iommu drivers would want to implement natively, of
> > > course.
> > > 
> > > ie no loops in drivers.
> > 
> > Let me just rewrite that for you ...
> > 
> > umem->phyrs = get_user_phyrs(addr, size, &umem->phyr_len);
> > umem->sgt = dma_map_phyrs(device, umem->phyrs, umem->phyr_len,
> > DMA_BIDIRECTIONAL, dma_attr);
> > ...
> > dma_unmap_phyr(device, umem->phyrs, umem->phyr_len, umem->sgt->sgl,
> > umem->sgt->nents, DMA_BIDIRECTIONAL, dma_attr);
> > sg_free_table(umem->sgt);
> > free_user_phyrs(umem->phyrs, umem->phyr_len);
> 
> Why? As above we want to get rid of the sgl, so you are telling me to
> adopt phyrs I need to increase the memory consumption by a hefty
> amount to store the phyrs and still keep the sgt now? Why?
> 
> I don't need the sgt at all. I just need another list of physical
> addresses for DMA. I see no issue with a phsr_list storing either CPU
> Physical Address or DMA Physical Addresses, same data structure.

There's a difference between a phys_addr_t and a dma_addr_t.  They
can even be different sizes; some architectures use a 32-bit dma_addr_t
and a 64-bit phys_addr_t or vice-versa.  phyr cannot store DMA addresses.

> In the fairly important passthrough DMA case the CPU list and DMA list
> are identical, so we don't even need to do anything.
> 
> In the typical iommu case my dma map's phyrs is only one entry.

That becomes a very simple sg table then.

> As an example coding - Use the first 8 bytes to encode this:
> 
>  51:0 - Physical address / 4k (ie pfn)
>  56:52 - Order (simple, your order encoding can do better)
>  61:57 - Unused
>  63:62 - Mode, one of:
>  00 = natural order pfn (8 bytes)
>  01 = order aligned with length (12 bytes)
>  10 = arbitary (12 bytes)
> 
> Then the optional 4 bytes are used as:
> 
> Mode 01 (Up to 2^48 bytes of memory on a 4k alignment)
>   31:0 - # of order pages
> 
> Mode 10 (Up to 2^25 bytes of memory on a 1 byte alignment)
>   11:0 - starting byte offset in the 4k
>   31:12 - 20 bits, plus the 5 bit order from the first 8 bytes:
>   length in bytes

Honestly, this looks awful to operate on.  Mandatory 8-bytes per entry
with an optional 4 byte extension?

> > > The last case is, perhaps, a possible route to completely replace
> > > scatterlist. Few places need true byte granularity for interior pages,
> > > so we can invent some coding to say 'this is 8 byte aligned, and n
> > > bytes long' that only fits < 4k or something. Exceptional cases can
> > > then still work. I'm not sure what block needs here - is it just 512?
> > 
> > Replacing scatterlist is not my goal.  That seems like a lot more work
> > for little gain.  
> 
> Well, I'm not comfortable with the idea above where RDMA would have to
> take a memory penalty to use the new interface. To avoid that memory
> penalty we need to get rid of scatterlist entirely.
> 
> If we do the 16 byte struct from the first email then a umem for MRs
> will increase in memory consumption by 160% compared today's 24
> bytes/page. I think the HPC workloads will veto this.

Huh?  We do 16 bytes per physically contiguous range.  Then, if your HPC
workloads use an IOMMU that can map a virtually contiguous range
into a single sg entry, it uses 24 bytes for the entire mapping.
It s

Re: Phyr Starter

2022-01-11 Thread Matthew Wilcox

On Tue, Jan 11, 2022 at 11:01:42AM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 11, 2022 at 04:32:56AM +0000, Matthew Wilcox wrote:
> > On Mon, Jan 10, 2022 at 08:41:26PM -0400, Jason Gunthorpe wrote:
> > > On Mon, Jan 10, 2022 at 07:34:49PM +0000, Matthew Wilcox wrote:
> > > 
> > > > Finally, it may be possible to stop using scatterlist to describe the
> > > > input to the DMA-mapping operation.  We may be able to get struct
> > > > scatterlist down to just dma_address and dma_length, with chaining
> > > > handled through an enclosing struct.
> > > 
> > > Can you talk about this some more? IMHO one of the key properties of
> > > the scatterlist is that it can hold huge amounts of pages without
> > > having to do any kind of special allocation due to the chaining.
> > > 
> > > The same will be true of the phyr idea right?
> > 
> > My thinking is that we'd pass a relatively small array of phyr (maybe 16
> > entries) to get_user_phyr().  If that turned out not to be big enough,
> > then we have two options; one is to map those 16 ranges with sg and use
> > the sg chaining functionality before throwing away the phyr and calling
> > get_user_phyr() again. 
> 
> Then we are we using get_user_phyr() at all if we are just storing it
> in a sg?

I did consider just implementing get_user_sg() (actually 4 years ago),
but that cements the use of sg as both an input and output data structure
for DMA mapping, which I am under the impression we're trying to get
away from.

> Also 16 entries is way to small, it should be at least a whole PMD
> worth so we don't have to relock the PMD level each iteration.
> 
> I would like to see a flow more like:
> 
>   cpu_phyr_list = get_user_phyr(uptr, 1G);
>   dma_phyr_list = dma_map_phyr(device, cpu_phyr_list);
>   [..]
>   dma_unmap_phyr(device, dma_phyr_list);
>   unpin_drity_free(cpu_phy_list);
> 
> Where dma_map_phyr() can build a temporary SGL for old iommu drivers
> compatability. iommu drivers would want to implement natively, of
> course.
> 
> ie no loops in drivers.

Let me just rewrite that for you ...

umem->phyrs = get_user_phyrs(addr, size, &umem->phyr_len);
umem->sgt = dma_map_phyrs(device, umem->phyrs, umem->phyr_len,
DMA_BIDIRECTIONAL, dma_attr);
...
dma_unmap_phyr(device, umem->phyrs, umem->phyr_len, umem->sgt->sgl,
umem->sgt->nents, DMA_BIDIRECTIONAL, dma_attr);
sg_free_table(umem->sgt);
free_user_phyrs(umem->phyrs, umem->phyr_len);

> > The question is whether this is the right kind of optimisation to be
> > doing.  I hear you that we want a dense format, but it's questionable
> > whether the kind of thing you're suggesting is actually denser than this
> > scheme.  For example, if we have 1GB pages and userspace happens to have
> > allocated pages (3, 4, 5, 6, 7, 8, 9, 10) then this can be represented
> > as a single phyr.  A power-of-two scheme would have us use four entries
> > (3, 4-7, 8-9, 10).
> 
> That is not quite what I had in mind..
> 
> struct phyr_list {
>unsigned int first_page_offset_bytes;
>size_t total_length_bytes;
>phys_addr_t min_alignment;
>struct packed_phyr *list_of_pages;
> };
> 
> Where each 'packed_phyr' is an aligned page of some kind. The packing
> has to be able to represent any number of pfns, so we have four major
> cases:
>  - 4k pfns (use 8 bytes)
>  - Natural order pfn (use 8 bytes)
>  - 4k aligned pfns, arbitary number (use 12 bytes)
>  - <4k aligned, arbitary length (use 16 bytes?)
> 
> In all cases the interior pages are fully used, only the first and
> last page is sliced based on the two parameters in the phyr_list.

This kind of representation works for a virtually contiguous range.
Unfortunately, that's not sufficient for some bio users (as I discovered
after getting a representation like this enshrined in the NVMe spec as
the PRP List).

> The last case is, perhaps, a possible route to completely replace
> scatterlist. Few places need true byte granularity for interior pages,
> so we can invent some coding to say 'this is 8 byte aligned, and n
> bytes long' that only fits < 4k or something. Exceptional cases can
> then still work. I'm not sure what block needs here - is it just 512?

Replacing scatterlist is not my goal.  That seems like a lot more work
for little gain.  I just want to delete page_link, offset and length
from struct scatterlist.  Given the above sequence of calls, we're going
to get sg lists that aren't chained.  They may have to be vmalloced,
but th

Re: Phyr Starter

2022-01-11 Thread Matthew Wilcox

On Tue, Jan 11, 2022 at 12:17:18AM -0800, John Hubbard wrote:
> Zooming in on the pinning aspect for a moment: last time I attempted to
> convert O_DIRECT callers from gup to pup, I recall wanting very much to
> record, in each bio_vec, whether these pages were acquired via FOLL_PIN,
> or some non-FOLL_PIN method. Because at the end of the IO, it is not
> easy to disentangle which pages require put_page() and which require
> unpin_user_page*().
> 
> And changing the bio_vec for *that* purpose was not really acceptable.
> 
> But now that you're looking to change it in a big way (and with some
> spare bits avaiable...oohh!), maybe I can go that direction after all.
> 
> Or, are you looking at a design in which any phyr is implicitly FOLL_PIN'd
> if it exists at all?

That.  I think there's still good reasons to keep a single-page (or
maybe dual-page) GUP around, but no reason to mix it with ranges.

> Or any other thoughts in this area are very welcome.

That's there's no support for unpinning part of a range.  You pin it,
do the IO, unpin it.  That simplifies the accounting.

Re: Phyr Starter

2022-01-11 Thread Matthew Wilcox

On Tue, Jan 11, 2022 at 12:40:10PM +0100, Thomas Zimmermann wrote:
> Hi
> 
> Am 10.01.22 um 20:34 schrieb Matthew Wilcox:
> > TLDR: I want to introduce a new data type:
> > 
> > struct phyr {
> >  phys_addr_t addr;
> >  size_t len;
> > };
> 
> Did you look at struct dma_buf_map? [1]

Thanks.  I wasn't aware of that.  It doesn't seem to actually solve the
problem, in that it doesn't carry any length information.  Did you mean
to point me at a different structure?

Re: Phyr Starter

2022-01-10 Thread Matthew Wilcox

On Mon, Jan 10, 2022 at 08:41:26PM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 10, 2022 at 07:34:49PM +0000, Matthew Wilcox wrote:
> 
> > Finally, it may be possible to stop using scatterlist to describe the
> > input to the DMA-mapping operation.  We may be able to get struct
> > scatterlist down to just dma_address and dma_length, with chaining
> > handled through an enclosing struct.
> 
> Can you talk about this some more? IMHO one of the key properties of
> the scatterlist is that it can hold huge amounts of pages without
> having to do any kind of special allocation due to the chaining.
> 
> The same will be true of the phyr idea right?

My thinking is that we'd pass a relatively small array of phyr (maybe 16
entries) to get_user_phyr().  If that turned out not to be big enough,
then we have two options; one is to map those 16 ranges with sg and use
the sg chaining functionality before throwing away the phyr and calling
get_user_phyr() again.  The other is to stash those 16 ranges somewhere
(eg a resizing array of some kind) and keep calling get_user_phyr()
to get the next batch of 16; once we've got the entire range, call
sg_map_phyr() passing all of the phyrs.

> > I would like to see phyr replace bio_vec everywhere it's currently used.
> > I don't have time to do that work now because I'm busy with folios.
> > If someone else wants to take that on, I shall cheer from the sidelines.
> > What I do intend to do is:
> 
> I wonder if we mixed things though..
> 
> IMHO there is a lot of optimization to be had by having a
> datastructure that is expressly 'the physical pages underlying a
> contiguous chunk of va'
> 
> If you limit to that scenario then we can be more optimal because
> things like byte granular offsets and size in the interior pages don't
> need to exist. Every interior chunk is always aligned to its order and
> we only need to record the order.
> 
> An overall starting offset and total length allow computing the slice
> of the first/last entry.
> 
> If the physical address is always aligned then we get 12 free bits
> from the min 4k alignment and also only need to store order, not an
> arbitary byte granular length.
> 
> The win is I think we can meaningfully cover most common cases using
> only 8 bytes per physical chunk. The 12 bits can be used to encode the
> common orders (4k, 2M, 1G, etc) and some smart mechanism to get
> another 16 bits to cover 'everything'.
> 
> IMHO storage density here is quite important, we end up having to keep
> this stuff around for a long time.
> 
> I say this here, because I've always though bio_vec/etc are more
> general than we actually need, being byte granular at every chunk.

Oh, I can do you one better on the bit-packing scheme.  There's a
representation of every power-of-two that is naturally aligned, using
just one extra bit.  Let's say we have 3 bits of address space and
4 bits to represent any power of two allocation within that address
space:

 index-0, order-0
0010 index-1, order-0
...
1110 index-7, order-0
0001 index-0, order-1
0101 index-2, order-1
1001 index-4, order-1
1101 index-6, order-1
0011 index-0, order-2
1011 index-4, order-2
0111 index-0, order-3

 has no meaning and can be used to represent an invalid range, if
that's useful.  The lowest clear bit decodes to the order, and
(x & (x+1))/2 gets you the index.

That leaves you with another 11 bits to represent something smart about
partial pages.

The question is whether this is the right kind of optimisation to be
doing.  I hear you that we want a dense format, but it's questionable
whether the kind of thing you're suggesting is actually denser than this
scheme.  For example, if we have 1GB pages and userspace happens to have
allocated pages (3, 4, 5, 6, 7, 8, 9, 10) then this can be represented
as a single phyr.  A power-of-two scheme would have us use four entries
(3, 4-7, 8-9, 10).

Using a (dma_addr, size_t) tuple makes coalescing adjacent pages very
cheap.  If I have to walk PTEs looking for pages which can be combined
together, I end up with interesting behaviour where the length of the
list shrinks and expands.  Using the example above, as I walk successive
PUDs, the data struct looks like this:

(3)
(3, 4)
(3, 4-5)
(3, 4-5, 6)
(3, 4-7)
(3, 4-7, 8)
(3, 4-7, 8-9)
(3, 4-7, 8-9, 10)

We could end up with a situation where we stop because the array is
full, even though if we kept going, it'd shrink back down below the
length of the array (in this example, an array of length 2 would stop
when it saw page 6, even though page 7 shrinks it back down again).

> What is needed is a full scatterlist replacement, including the IOMMU
> part.
> 
> For the IOMMU I would expect the datastructure to be re-used, we start
>

Phyr Starter

2022-01-10 Thread Matthew Wilcox

TLDR: I want to introduce a new data type:

struct phyr {
phys_addr_t addr;
size_t len;
};

and use it to replace bio_vec as well as using it to replace the array
of struct pages used by get_user_pages() and friends.

---

There are two distinct problems I want to address: doing I/O to memory
which does not have a struct page and efficiently doing I/O to large
blobs of physically contiguous memory, regardless of whether it has a
struct page.  There are some other improvements which I regard as minor.

There are many types of memory that one might want to do I/O to that do
not have a struct page, some examples:
 - Memory on a graphics card (or other PCI card, but gfx seems to be
   the primary provider of DRAM on the PCI bus today)
 - DAX, or other pmem (there are some fake pages today, but this is
   mostly a workaround for the IO problem today)
 - Guest memory being accessed from the hypervisor (KVM needs to
   create structpages to make this happen.  Xen doesn't ...)
All of these kinds of memories can be addressed by the CPU and so also
by a bus master.  That is, there is a physical address that the CPU
can use which will address this memory, and there is a way to convert
that to a DMA address which can be programmed into another device.
There's no intent here to support memory which can be accessed by a
complex scheme like writing an address to a control register and then
accessing the memory through a FIFO; this is for memory which can be
accessed by DMA and CPU loads and stores.

For get_user_pages() and friends, we currently fill an array of struct
pages, each one representing PAGE_SIZE bytes.  For an application that
is using 1GB hugepages, writing 2^18 entries is a significant overhead.
It also makes drivers hard to write as they have to recoalesce the
struct pages, even though the VM can tell it whether those 2^18 pages
are contiguous.

On the minor side, struct phyr can represent any mappable chunk of memory.
A bio_vec is limited to 2^32 bytes, while on 64-bit machines a phyr
can represent larger than 4GB.  A phyr is the same size as a bio_vec
on 64 bit (16 bytes), and the same size for 32-bit with PAE (12 bytes).
It is smaller for 32-bit machines without PAE (8 bytes instead of 12).

Finally, it may be possible to stop using scatterlist to describe the
input to the DMA-mapping operation.  We may be able to get struct
scatterlist down to just dma_address and dma_length, with chaining
handled through an enclosing struct.

I would like to see phyr replace bio_vec everywhere it's currently used.
I don't have time to do that work now because I'm busy with folios.
If someone else wants to take that on, I shall cheer from the sidelines.
What I do intend to do is:

 - Add an interface to gup.c to pin/unpin N phyrs
 - Add a sg_map_phyrs()
   This will take an array of phyrs and allocate an sg for them
 - Whatever else I need to do to make one RDMA driver happy with
   this scheme

At that point, I intend to stop and let others more familiar with this
area of the kernel continue the conversion of drivers.

P.S. If you've had the Prodigy song running through your head the whole
time you've been reading this email ... I'm sorry / You're welcome.
If people insist, we can rename this to phys_range or something boring,
but I quite like the spelling of phyr with the pronunciation of "fire".

Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

2021-10-16 Thread Matthew Wilcox

On Sat, Oct 16, 2021 at 12:44:50PM -0300, Jason Gunthorpe wrote:
> Assuming changing FSDAX is hard.. How would DAX people feel about just
> deleting the PUD/PMD support until it can be done with compound pages?

I think there are customers who would find that an unacceptable answer :-)

Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

2021-10-14 Thread Matthew Wilcox



It would probably help if you cc'd Dan on this.
As far as I know he's the only person left who cares about GUP on DAX.

On Thu, Oct 14, 2021 at 02:06:34PM -0300, Jason Gunthorpe wrote:
> On Thu, Oct 14, 2021 at 10:39:28AM -0500, Alex Sierra wrote:
> > From: Ralph Campbell 
> > 
> > ZONE_DEVICE struct pages have an extra reference count that complicates the
> > code for put_page() and several places in the kernel that need to check the
> > reference count to see that a page is not being used (gup, compaction,
> > migration, etc.). Clean up the code so the reference count doesn't need to
> > be treated specially for ZONE_DEVICE.
> > 
> > Signed-off-by: Ralph Campbell 
> > Signed-off-by: Alex Sierra 
> > Reviewed-by: Christoph Hellwig 
> > ---
> > v2:
> > AS: merged this patch in linux 5.11 version
> > 
> > v5:
> > AS: add condition at try_grab_page to check for the zone device type, while
> > page ref counter is checked less/equal to zero. In case of device zone, 
> > pages
> > ref counter are initialized to zero.
> > 
> > v7:
> > AS: fix condition at try_grab_page added at v5, is invalid. It supposed
> > to fix xfstests/generic/413 test, however, there's a known issue on
> > this test where DAX mapped area DIO to non-DAX expect to fail.
> > https://patchwork.kernel.org/project/fstests/patch/1489463960-3579-1-git-send-email-xz...@redhat.com
> > This condition was removed after rebase over patch series
> > https://lore.kernel.org/r/20210813044133.1536842-4-jhubb...@nvidia.com
> > ---
> >  arch/powerpc/kvm/book3s_hv_uvmem.c |  2 +-
> >  drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
> >  fs/dax.c   |  4 +-
> >  include/linux/dax.h|  2 +-
> >  include/linux/memremap.h   |  7 +--
> >  include/linux/mm.h | 11 
> >  lib/test_hmm.c |  2 +-
> >  mm/internal.h  |  8 +++
> >  mm/memcontrol.c|  6 +--
> >  mm/memremap.c  | 69 +++---
> >  mm/migrate.c   |  5 --
> >  mm/page_alloc.c|  3 ++
> >  mm/swap.c  | 45 ++---
> >  13 files changed, 46 insertions(+), 120 deletions(-)
> 
> Has anyone tested this with FSDAX? Does get_user_pages() on fsdax
> backed memory still work?
> 
> What refcount value does the struct pages have when they are installed
> in the PTEs? Remember a 0 refcount will make all the get_user_pages()
> fail.
> 
> I'm looking at the call path starting in ext4_punch_hole() and I would
> expect to see something manipulating the page ref count before
> the ext4_break_layouts() call path gets to the dax_page_unused() test.
> 
> All I see is we go into unmap_mapping_pages() - that would normally
> put back the page references held by PTEs but insert_pfn() has this:
> 
>   if (pfn_t_devmap(pfn))
>   entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
> 
> And:
> 
> static inline pte_t pte_mkdevmap(pte_t pte)
> {
>   return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP);
> }
> 
> Which interacts with vm_normal_page():
> 
>   if (pte_devmap(pte))
>   return NULL;
> 
> To disable that refcounting?
> 
> So... I have a feeling this will have PTEs pointing to 0 refcount
> pages? Unless FSDAX is !pte_devmap which is not the case, right?
> 
> This seems further confirmed by this comment:
> 
>   /*
>* If we race get_user_pages_fast() here either we'll see the
>* elevated page count in the iteration and wait, or
>* get_user_pages_fast() will see that the page it took a reference
>* against is no longer mapped in the page tables and bail to the
>* get_user_pages() slow path.  The slow path is protected by
>* pte_lock() and pmd_lock(). New references are not taken without
>* holding those locks, and unmap_mapping_pages() will not zero the
>* pte or pmd without holding the respective lock, so we are
>* guaranteed to either see new references or prevent new
>* references from being established.
>*/
> 
> Which seems to explain this scheme relies on unmap_mapping_pages() to
> fence GUP_fast, not on GUP_fast observing 0 refcounts when it should
> stop.
> 
> This seems like it would be properly fixed by using normal page
> refcounting for PTEs - ie stop using special for these pages?
> 
> Does anyone know why devmap is pte_special anyhow?
> 
> > +void free_zone_device_page(struct page *page)
> > +{
> > +   switch (page->pgmap->type) {
> > +   case MEMORY_DEVICE_PRIVATE:
> > +   free_device_page(page);
> > +   return;
> > +   case MEMORY_DEVICE_FS_DAX:
> > +   /* notify page idle */
> > +   wake_up_var(&page->_refcount);
> > +   return;
> 
> It is not for this series, but I wonder if we should just always call
> ops->page_free and have free_device_page() logic in that cal

Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

2021-10-14 Thread Matthew Wilcox

On Thu, Oct 14, 2021 at 10:39:28AM -0500, Alex Sierra wrote:
> From: Ralph Campbell 
> 
> ZONE_DEVICE struct pages have an extra reference count that complicates the
> code for put_page() and several places in the kernel that need to check the
> reference count to see that a page is not being used (gup, compaction,
> migration, etc.). Clean up the code so the reference count doesn't need to
> be treated specially for ZONE_DEVICE.
> 
> Signed-off-by: Ralph Campbell 
> Signed-off-by: Alex Sierra 
> Reviewed-by: Christoph Hellwig 

Reviewed-by: Matthew Wilcox (Oracle)

Re: [PATCH v1 1/2] ext4/xfs: add page refcount helper

2021-10-14 Thread Matthew Wilcox

On Thu, Oct 14, 2021 at 10:39:27AM -0500, Alex Sierra wrote:
> From: Ralph Campbell 
> 
> There are several places where ZONE_DEVICE struct pages assume a reference
> count == 1 means the page is idle and free. Instead of open coding this,
> add a helper function to hide this detail.
> 
> Signed-off-by: Ralph Campbell 
> Signed-off-by: Alex Sierra 
> Reviewed-by: Christoph Hellwig 
> Acked-by: Theodore Ts'o 
> Acked-by: Darrick J. Wong 

Reviewed-by: Matthew Wilcox (Oracle)

Re: [PATCH v1 00/12] MEMORY_DEVICE_COHERENT for CPU-accessible coherent device memory

2021-10-12 Thread Matthew Wilcox

On Tue, Oct 12, 2021 at 11:39:57AM -0700, Andrew Morton wrote:
> Because I must ask: if this feature is for one single computer which
> presumably has a custom kernel, why add it to mainline Linux?

I think in particular patch 2 deserves to be merged because it removes
a ton of cruft from every call to put_page() (at least if you're using
a distro config).  It makes me nervous, but I think it's the right
thing to do.  It may well need more fixups after it has been merged,
but that's life.

Re: BoF at LPC: Documenting the Heterogeneous Memory Model Architecture

2021-09-23 Thread Matthew Wilcox

On Thu, Sep 23, 2021 at 04:25:08PM -0400, Felix Kuehling wrote:
> Change of plan: Instead of a BoF, this is now a session in the "GPU/media/AI
> buffer management and interop MC" micro conference. Thank you Daniel Stone
> for making that happen.
> https://linuxplumbersconf.org/event/11/contributions/1112/
> 
> It is scheduled for tomorrow (Friday) 08:40-10:00 Pacific, 11:40-13:00
> Eastern, 15:40-17:00 UTC.

That's up against:

 Direct map management
Vlastimil Babka, Mike Rapoport, Rick Edgecombe  11:30-12:15.

Seems like a lot of the same people would want to be in both sessions.
Maybe one could be moved?

Re: [PATCH v5 3/9] dyndbg: add DEFINE_DYNAMIC_DEBUG_CATEGORIES and callbacks

2021-08-13 Thread Matthew Wilcox

On Fri, Aug 13, 2021 at 06:51:05PM +0300, Andy Shevchenko wrote:
> On Fri, Aug 13, 2021 at 09:17:11AM -0600, Jim Cromie wrote:
> > +int param_set_dyndbg(const char *instr, const struct kernel_param *kp)
> > +{
> > +   unsigned long inbits;
> > +   int rc, i, chgct = 0, totct = 0;
> > +   char query[OUR_QUERY_SIZE];
> > +   struct dyndbg_bitdesc *bitmap = (struct dyndbg_bitdesc *) kp->data;
> 
> So you need space after ')' ?

More importantly, if ->data is of type 'void *', it is bad style to
cast the pointer at all.  I can't tell what type 'data' has; if it
is added to kernel_param as part of this series, I wasn't cc'd on
the patch that did that.

Re: [RFC PATCH v2 1/8] ext4/xfs: add page refcount helper

2021-06-09 Thread Matthew Wilcox

On Mon, Jun 07, 2021 at 03:42:19PM -0500, Alex Sierra wrote:
> +++ b/include/linux/dax.h
> @@ -243,6 +243,16 @@ static inline bool dax_mapping(struct address_space 
> *mapping)
>   return mapping->host && IS_DAX(mapping->host);
>  }
>  
> +static inline bool dax_layout_is_idle_page(struct page *page)
> +{
> + return page_ref_count(page) == 1;
> +}

We already have something called an idle page, and that's quite a
different thing from this.  How about dax_page_unused() (it's a use
count, so once it's got down to its minimum value, it's unused)?

Re: [RFC PATCH v2 1/8] ext4/xfs: add page refcount helper

2021-06-08 Thread Matthew Wilcox

On Tue, Jun 08, 2021 at 12:29:04AM +, Liam Howlett wrote:
> * Alex Sierra  [210607 16:43]:
> > From: Ralph Campbell 
> > 
> > There are several places where ZONE_DEVICE struct pages assume a reference
> > count == 1 means the page is idle and free. Instead of open coding this,
> > add a helper function to hide this detail.
> > 
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index b52f084aa643..8909a91cd381 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -243,6 +243,16 @@ static inline bool dax_mapping(struct address_space 
> > *mapping)
> > return mapping->host && IS_DAX(mapping->host);
> >  }
> >  
> > +static inline bool dax_layout_is_idle_page(struct page *page)
> > +{
> > +   return page_ref_count(page) == 1;
> > +}
> 
> If this races with page_ref_count(page) == 0, then it will return false
> that a page is idle when the page is being freed.  I don't know the code
> well enough to say if this is an issue or not so please let me know.
> 
> For example:
> !dax_layout_is_idle_page() will return true in dax_busy_page() above
> when the count is 0 and return the page.
> 
> Maybe you are sure to have at least one reference when calling this?  It
> might be worth adding a comment.

You're getting confused by the problem that the next patch fixes, which
is that devmap pages were stupidly given an elevated refcount.  devmap
pages are considered "free" when their refcount is 1.  See
put_page(), put_devmap_managed_page() and so on.

Re: [RFC PATCH v2 0/8] Support DEVICE_GENERIC memory in migrate_vma_*

2021-06-08 Thread Matthew Wilcox

On Mon, Jun 07, 2021 at 03:42:18PM -0500, Alex Sierra wrote:
> v1:
> https://lore.kernel.org/linux-mm/20210529064022.gb15...@lst.de/T/

Please copy and paste the rationale into followup patch series instead
of sending a link:

AMD is building a system architecture for the Frontier supercomputer with
a coherent interconnect between CPUs and GPUs. This hardware architecture
allows the CPUs to coherently access GPU device memory. We have hardware
in our labs and we are working with our partner HPE on the BIOS, firmware
and software for delivery to the DOE.

The system BIOS advertises the GPU device memory (aka VRAM) as SPM
(special purpose memory) in the UEFI system address map. The amdgpu driver
looks it up with lookup_resource and registers it with devmap as
MEMORY_DEVICE_GENERIC using devm_memremap_pages.

Now we're trying to migrate data to and from that memory using the
migrate_vma_* helpers so we can support page-based migration in our
unified memory allocations, while also supporting CPU access to those
pages.

This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages
behave correctly in the migrate_vma_* helpers. We are looking for feedback
about this approach. If we're close, what's needed to make our patches
acceptable upstream? If we're not close, any suggestions how else to
achieve what we are trying to do (i.e. page migration and coherent CPU
access to VRAM)?

This work is based on HMM and our SVM memory manager that was recently
upstreamed to Dave Airlie's drm-next branch
[https://cgit.freedesktop.org/drm/drm/log/?h=drm-next]. On top of that we
did some rework of our VRAM management for migrations to remove some
incorrect assumptions, allow partially successful migrations and GPU
memory mappings that mix pages in VRAM and system memory.
[https://patchwork.kernel.org/project/dri-devel/list/?series=489811]

> v2:
> This patch series version has merged "[RFC PATCH v3 0/2]
> mm: remove extra ZONE_DEVICE struct page refcount" patch series made by
> Ralph Campbell. It also applies at the top of these series, our changes
> to support device generic type in migration_vma helpers.
> This has been tested in systems with device memory that has coherent
> access by CPU.
> 
> Also addresses the following feedback made in v1:
> - Isolate in one patch kernel/resource.c modification, based
> on Christoph's feedback.
> - Add helpers check for generic and private type to avoid
> duplicated long lines.
> 
> I like to provide an overview of what each of the patches does in a series:
> 
> Patches 1-2: Rebased Ralph Campbell's ZONE_DEVICE page refcounting patches
> Patch 3: Export lookup_resource
> Patches 4-5: AMDGPU driver changes to register and use DEVICE_GENERIC memory
> Patches 6-8: Handle DEVICE_GENERIC memory in migration helpers
> 
> Alex Sierra (6):
>   kernel: resource: lookup_resource as exported symbol
>   drm/amdkfd: add SPM support for SVM
>   drm/amdkfd: generic type as sys mem on migration to ram
>   include/linux/mm.h: helpers to check zone device generic type
>   mm: add generic type support to migrate_vma helpers
>   mm: call pgmap->ops->page_free for DEVICE_GENERIC pages
> 
> Ralph Campbell (2):
>   ext4/xfs: add page refcount helper
>   mm: remove extra ZONE_DEVICE struct page refcount
> 
>  arch/powerpc/kvm/book3s_hv_uvmem.c   |  2 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 15 --
>  drivers/gpu/drm/nouveau/nouveau_dmem.c   |  2 +-
>  fs/dax.c |  8 +--
>  fs/ext4/inode.c  |  5 +-
>  fs/xfs/xfs_file.c|  4 +-
>  include/linux/dax.h  | 10 
>  include/linux/memremap.h |  7 +--
>  include/linux/mm.h   | 52 +++---
>  kernel/resource.c|  2 +-
>  lib/test_hmm.c   |  2 +-
>  mm/internal.h|  8 +++
>  mm/memremap.c| 69 +++-
>  mm/migrate.c | 13 ++---
>  mm/page_alloc.c  |  3 ++
>  mm/swap.c| 45 ++--
>  16 files changed, 83 insertions(+), 164 deletions(-)
> 
> -- 
> 2.17.1
> 
>

1 2 3 >

1 - 100 of 261 matches

Mail list logo