Re: [PATCH v3 06/11] mm/util: Deduplicate code in {kstrdup,kstrndup,kmemdup_nul}
On Sun, Jun 23, 2024 at 10:29:30AM +0800, Yafang Shao wrote: > On Fri, Jun 21, 2024 at 9:57 PM Matthew Wilcox wrote: > > > > On Fri, Jun 21, 2024 at 10:29:54AM +0800, Yafang Shao wrote: > > > +++ b/mm/internal.h > > > > Why are you putting __kstrndup in a header file when it's only used > > in util.c? > > I want to make it always inlined. However, it is not recommended to > define an inline function in a .c file, right ? I'm not aware of any such recommendation. Better than putting it in a .h file that everybody has to look at but nobody uses.
Re: [PATCH v3 06/11] mm/util: Deduplicate code in {kstrdup,kstrndup,kmemdup_nul}
On Fri, Jun 21, 2024 at 10:29:54AM +0800, Yafang Shao wrote: > +++ b/mm/internal.h Why are you putting __kstrndup in a header file when it's only used in util.c? Also, I think this function is actually __kmemdup_nul(), not __kstrndup().
Re: [PATCH v2] Fix duplicate C declaration warnings
On Sat, Mar 23, 2024 at 10:01:47PM +0530, Amogh Cheluvaraj wrote: > Fix duplicate C declaration warnings at > Documentation/gpu/drm-kms.rst that was found by > compiling htmldocs I'm sure this removes the warning, but it removes all kernel-doc which exists in drivers/gpu/drm/drm_fourcc.c. Isn't there a more granular fix than this? > /home/amogh/Linux_Kernel_Workspace/linux-next/Documentation/gpu/drm- > kms:360: ./drivers/gpu/drm/drm_fourcc.c:344: WARNING: Duplicate C > declaration, also defined at gpu/drm-kms:39. > Declaration is '.. c:function:: const struct drm_format_info * > drm_format_info (u32 format)'. > /home/amogh/Linux_Kernel_Workspace/linux-next/Documentation/gpu/drm- > kms:461: ./drivers/gpu/drm/drm_modeset_lock.c:392: WARNING: Duplicate C > declaration, also defined at gpu/drm-kms:49. > Declaration is '.. c:function:: int drm_modeset_lock (struct > drm_modeset_lock *lock, struct drm_modeset_acquire_ctx *ctx)'. > > Signed-off-by: Amogh Cheluvaraj > --- > > changes in v2 > - add warnings found after compilation > - fix grammar in commit description > > --- > Documentation/gpu/drm-kms.rst | 6 -- > 1 file changed, 6 deletions(-) > > diff --git a/Documentation/gpu/drm-kms.rst b/Documentation/gpu/drm-kms.rst > index 13d3627d8bc0..a4145f391e43 100644 > --- a/Documentation/gpu/drm-kms.rst > +++ b/Documentation/gpu/drm-kms.rst > @@ -357,9 +357,6 @@ Format Functions Reference > .. kernel-doc:: include/drm/drm_fourcc.h > :internal: > > -.. kernel-doc:: drivers/gpu/drm/drm_fourcc.c > - :export: > - > .. _kms_dumb_buffer_objects: > > Dumb Buffer Objects > @@ -458,9 +455,6 @@ KMS Locking > .. kernel-doc:: include/drm/drm_modeset_lock.h > :internal: > > -.. kernel-doc:: drivers/gpu/drm/drm_modeset_lock.c > - :export: > - > KMS Properties > == > > -- > 2.44.0 > >
fb_defio and page->mapping
We're currently trying to remove page->mapping from the entire kernel. This has me interested in fb_defio and since I made such a mess of it with commits ccf953d8f3d6 / 0b78f8bcf495, I'd like to discuss what to do before diving in. Folios continue to have a mapping. So we can effectively do page_folio(page)->mapping (today that's calling compound_head() to get to the head page; eventually it's a separate allocation). But now I look at commit 56c134f7f1b5, I'm a little scared. Apparently pages are being allocated from shmem and being mapped by fb_deferred_io_fault()? This line: page->mapping = vmf->vma->vm_file->f_mapping; initially appears harmless for shmem files (because that's presumably a noop), but it's only a noop for head pages. If shmem allocates a compound page (ie a 2MB THP today), you'll overlay some information stored in the second and third pages; looks like _entire_mapcount and _deferred_list.prev (but we do shift those fields around without regard to what the fb_defio driver is doing). Even if you've disabled THP today, setting page->mapping to NULL in fb_deferred_io_lastclose() for a shmem page is a really bad idea. I'd like to avoid fb_defio playing with page->mapping at all. As I understand it, the only reason to set page->mapping is so that page_mkclean() works. But there are all kinds of assumptions in page_mkclean() (now essentially folio_mkclean()) that we're dealing with file-backed or anonymous memory. I wonder if we might be better off calling pfn_mkclean_range() for each VMA which maps these allocations? You'd have to keep track of each VMA yourself (I think?) but it would avoid messing with page->mapping. Anyway, I don't know enough about DRM. There might be something unutterably obvious we could do to fix this.
Re: [PATCH 2/2] xfs: disable large folio support in xfile_create
On Thu, Jan 11, 2024 at 02:00:53PM -0800, Andrew Morton wrote: > On Wed, 10 Jan 2024 12:04:51 -0800 "Darrick J. Wong" > wrote: > > > > > Fixing this will require a bit of an API change, and prefeably sorting > > > > out > > > > the hwpoison story for pages vs folio and where it is placed in the > > > > shmem > > > > API. For now use this one liner to disable large folios. > > > > > > > > Reported-by: Darrick J. Wong > > > > Signed-off-by: Christoph Hellwig > > > > > > Can someone who knows more about shmem.c than I do please review > > > https://lore.kernel.org/linux-xfs/20240103084126.513354-4-...@lst.de/ > > > so that I can feel slightly more confident as hch and I sort through the > > > xfile.c issues? > > > > > > For this patch, > > > Reviewed-by: Darrick J. Wong > > > > ...except that I'm still getting 2M THPs even with this enabled, so I > > guess either we get to fix it now, or create our own private tmpfs mount > > so that we can pass in huge=never, similar to what i915 does. :( > > What is "this"? Are you saying that $Subject doesn't work, or that the > above-linked please-review patch doesn't work? shmem pays no attention to the mapping_large_folio_support() flag, so the proposed fix doesn't work. It ought to, but it has its own way of doing it that predates mapping_large_folio_support existing.
Re: disable large folios for shmem file used by xfs xfile
On Wed, Jan 10, 2024 at 05:28:22PM +0200, Joonas Lahtinen wrote: > Quoting Joonas Lahtinen (2024-01-10 17:20:24) > > However we specifically pass "huge=within_size" to vfs_kern_mount when > > creating a private mount of tmpfs for the purpose of i915 created > > allocations. > > > > Older hardware also had some address hashing bugs where 2M aligned > > memory caused a lot of collisions in TLB so we don't enable it always. > > > > You can see drivers/gpu/drm/i915/gem/i915_gemfs.c function > > i915_gemfs_init for details and references. > > > > So in short, functionality wise we should be fine either default > > for using 2M pages or not. If they become the default, we'd probably > > want an option that would still be able to prevent them for performance > > regression reasons on older hardware. > > To maybe write out my concern better: > > Is there plan to enable huge pages by default in shmem? Not in the next kernel release, but eventually the plan is to allow arbitrary order folios to be used in shmem. So you could ask it to create a 256kB folio for you, if that's the right size to manage memory in. How shmem and its various users go about choosing the right size is not quite clear to me yet. Perhaps somebody else will do it before I get to it; I have a lot of different sub-projects to work on at the moment, and shmem isn't blocking any of them. And I have a sneaking suspicion that more work is needed in the swap code to deal with arbitrary order folios, so that's another reason for me to delay looking at this ;-)
Re: disable large folios for shmem file used by xfs xfile
On Wed, Jan 10, 2024 at 10:21:07AM +0100, Christoph Hellwig wrote: > Hi all, > > Darrick reported that the fairly new XFS xfile code blows up when force > enabling large folio for shmem. This series fixes this quickly by disabling > large folios for this particular shmem file for now until it can be fixed > properly, which will be a lot more invasive. > > I've added most of you to the CC list as I suspect most other users of > shmem_file_setup and friends will have similar issues. The graphics users _want_ to use large folios. I'm pretty sure they've been tested with this. It's just XFS that didn't know about this feature of shmem.
Re: [PATCH v8 3/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v8)
On Fri, Dec 15, 2023 at 10:05:33PM -0800, Vivek Kasireddy wrote: > +++ b/include/linux/memfd.h > @@ -6,11 +6,16 @@ > > #ifdef CONFIG_MEMFD_CREATE > extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int > arg); > +extern struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx); You don't need the 'extern' for functions. > #else > static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned int > a) > { > return -EINVAL; > } > +static inline struct page *memfd_alloc_folio(struct file *memfd, pgoff_t idx) > +{ > + return ERR_PTR(-EINVAL); > +} > #endif Different return types depending on the CONFIG selected ...
Re: [PATCH v7 5/6] udmabuf: Pin the pages using memfd_pin_folios() API (v5)
On Mon, Dec 11, 2023 at 11:38:02PM -0800, Vivek Kasireddy wrote: > +++ b/drivers/dma-buf/udmabuf.c > @@ -42,7 +42,7 @@ static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf) > if (pgoff >= ubuf->pagecount) > return VM_FAULT_SIGBUS; > > - pfn = page_to_pfn(&ubuf->folios[pgoff]->page); > + pfn = page_to_pfn(folio_page(ubuf->folios[pgoff], 0)); Why are you changing from &folio->page to folio_page(folio, 0) in this patch? Either that should have been done the other way in the earlier patch, or it shouldn't be done at all. My vote is that it shuoldn't be done at all. These all seem like "I have to convert back from folio to page because the APIs I want aren't available", not "I want the first page from this folio".
Re: [PATCH v7 4/6] udmabuf: Convert udmabuf driver to use folios
On Mon, Dec 11, 2023 at 11:38:01PM -0800, Vivek Kasireddy wrote: > @@ -42,7 +42,7 @@ static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf) > if (pgoff >= ubuf->pagecount) > return VM_FAULT_SIGBUS; > > - pfn = page_to_pfn(ubuf->pages[pgoff]); > + pfn = page_to_pfn(&ubuf->folios[pgoff]->page); We have folio_pfn(). > static int vmap_udmabuf(struct dma_buf *buf, struct iosys_map *map) > { > struct udmabuf *ubuf = buf->priv; > + struct page **pages; > void *vaddr; > + pgoff_t pg; > > dma_resv_assert_held(buf->resv); > > - vaddr = vm_map_ram(ubuf->pages, ubuf->pagecount, -1); > + pages = kmalloc_array(ubuf->pagecount, sizeof(*pages), GFP_KERNEL); > + if (!pages) > + return -ENOMEM; > + > + for (pg = 0; pg < ubuf->pagecount; pg++) > + pages[pg] = &ubuf->folios[pg]->page; > + > + vaddr = vm_map_ram(pages, ubuf->pagecount, -1); > + kfree(pages); We don't yet have a vm_map_ram() variant that takes an array of folios. We probably should; there was _something_ I was looking at recently that would have liked it ... > @@ -254,31 +262,70 @@ static int handle_shmem_pages(struct udmabuf *ubuf, > struct file *memfd, > pgoff_t *pgbuf) > { > pgoff_t pgidx, pgoff = offset >> PAGE_SHIFT; > - struct page *page; > + struct folio *folio = NULL; > > for (pgidx = 0; pgidx < pgcnt; pgidx++) { > - page = shmem_read_mapping_page(memfd->f_mapping, > -pgoff + pgidx); > - if (IS_ERR(page)) > - return PTR_ERR(page); > + folio = shmem_read_folio(memfd->f_mapping, > + pgoff + pgidx); You could join these two lines.
[PATCH 1/2] fbdev: Remove support for Carillo Ranch driver
As far as anybody can tell, this product never shipped. If it did, it shipped in 2007 and nobody has access to one any more. Remove the fbdev driver and the backlight driver. Signed-off-by: Matthew Wilcox (Oracle) --- drivers/video/backlight/Kconfig |7 - drivers/video/backlight/Makefile |1 - drivers/video/backlight/cr_bllcd.c| 264 - drivers/video/fbdev/Kconfig | 15 - drivers/video/fbdev/Makefile |1 - drivers/video/fbdev/vermilion/Makefile|6 - drivers/video/fbdev/vermilion/cr_pll.c| 195 drivers/video/fbdev/vermilion/vermilion.c | 1175 - drivers/video/fbdev/vermilion/vermilion.h | 245 - 9 files changed, 1909 deletions(-) delete mode 100644 drivers/video/backlight/cr_bllcd.c delete mode 100644 drivers/video/fbdev/vermilion/Makefile delete mode 100644 drivers/video/fbdev/vermilion/cr_pll.c delete mode 100644 drivers/video/fbdev/vermilion/vermilion.c delete mode 100644 drivers/video/fbdev/vermilion/vermilion.h diff --git a/drivers/video/backlight/Kconfig b/drivers/video/backlight/Kconfig index 1144a54a35c0..ea2d0d69bd8c 100644 --- a/drivers/video/backlight/Kconfig +++ b/drivers/video/backlight/Kconfig @@ -235,13 +235,6 @@ config BACKLIGHT_HP700 If you have an HP Jornada 700 series, say Y to include backlight control driver. -config BACKLIGHT_CARILLO_RANCH - tristate "Intel Carillo Ranch Backlight Driver" - depends on LCD_CLASS_DEVICE && PCI && X86 && FB_LE80578 - help - If you have a Intel LE80578 (Carillo Ranch) say Y to enable the - backlight driver. - config BACKLIGHT_PWM tristate "Generic PWM based Backlight Driver" depends on PWM diff --git a/drivers/video/backlight/Makefile b/drivers/video/backlight/Makefile index 1af583de665b..06966cb20459 100644 --- a/drivers/video/backlight/Makefile +++ b/drivers/video/backlight/Makefile @@ -25,7 +25,6 @@ obj-$(CONFIG_BACKLIGHT_ADP8870) += adp8870_bl.o obj-$(CONFIG_BACKLIGHT_APPLE) += apple_bl.o obj-$(CONFIG_BACKLIGHT_AS3711) += as3711_bl.o obj-$(CONFIG_BACKLIGHT_BD6107) += bd6107.o -obj-$(CONFIG_BACKLIGHT_CARILLO_RANCH) += cr_bllcd.o obj-$(CONFIG_BACKLIGHT_CLASS_DEVICE) += backlight.o obj-$(CONFIG_BACKLIGHT_DA903X) += da903x_bl.o obj-$(CONFIG_BACKLIGHT_DA9052) += da9052_bl.o diff --git a/drivers/video/backlight/cr_bllcd.c b/drivers/video/backlight/cr_bllcd.c deleted file mode 100644 index 781aeecc451d.. --- a/drivers/video/backlight/cr_bllcd.c +++ /dev/null @@ -1,264 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0-or-later -/* - * Copyright (c) Intel Corp. 2007. - * All Rights Reserved. - * - * Intel funded Tungsten Graphics (http://www.tungstengraphics.com) to - * develop this driver. - * - * This file is part of the Carillo Ranch video subsystem driver. - * - * Authors: - * Thomas Hellstrom - * Alan Hourihane - */ - -#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -/* The LVDS- and panel power controls sits on the - * GPIO port of the ISA bridge. - */ - -#define CRVML_DEVICE_LPC0x27B8 -#define CRVML_REG_GPIOBAR 0x48 -#define CRVML_REG_GPIOEN0x4C -#define CRVML_GPIOEN_BIT(1 << 4) -#define CRVML_PANEL_PORT0x38 -#define CRVML_LVDS_ON 0x0001 -#define CRVML_PANEL_ON 0x0002 -#define CRVML_BACKLIGHT_OFF 0x0004 - -/* The PLL Clock register sits on Host bridge */ -#define CRVML_DEVICE_MCH 0x5001 -#define CRVML_REG_MCHBAR 0x44 -#define CRVML_REG_MCHEN0x54 -#define CRVML_MCHEN_BIT(1 << 28) -#define CRVML_MCHMAP_SIZE 4096 -#define CRVML_REG_CLOCK0xc3c -#define CRVML_CLOCK_SHIFT 8 -#define CRVML_CLOCK_MASK 0x0f00 - -static struct pci_dev *lpc_dev; -static u32 gpio_bar; - -struct cr_panel { - struct backlight_device *cr_backlight_device; - struct lcd_device *cr_lcd_device; -}; - -static int cr_backlight_set_intensity(struct backlight_device *bd) -{ - u32 addr = gpio_bar + CRVML_PANEL_PORT; - u32 cur = inl(addr); - - if (backlight_get_brightness(bd) == 0) { - /* OFF */ - cur |= CRVML_BACKLIGHT_OFF; - outl(cur, addr); - } else { - /* FULL ON */ - cur &= ~CRVML_BACKLIGHT_OFF; - outl(cur, addr); - } - - return 0; -} - -static int cr_backlight_get_intensity(struct backlight_device *bd) -{ - u32 addr = gpio_bar + CRVML_PANEL_PORT; - u32 cur = inl(addr); - u8 intensity; - - if (cur & CRVML_BACKLIGHT_OFF) - intensity = 0; - else - intensity = 1; - - return intensity; -} - -static const struct backlight_ops cr_backlight_ops = { - .get_brightness = cr_ba
[PATCH 2/2] mtd: Remove support for Carillo Ranch driver
As far as anybody can tell, this product never shipped. If it did, it shipped in 2007 and nobody has access to one any more. Remove the mtd NOR driver. Signed-off-by: Matthew Wilcox (Oracle) --- drivers/mtd/maps/Kconfig| 7 - drivers/mtd/maps/Makefile | 1 - drivers/mtd/maps/intel_vr_nor.c | 265 3 files changed, 273 deletions(-) delete mode 100644 drivers/mtd/maps/intel_vr_nor.c diff --git a/drivers/mtd/maps/Kconfig b/drivers/mtd/maps/Kconfig index e098ae937ce8..8a8b19874e23 100644 --- a/drivers/mtd/maps/Kconfig +++ b/drivers/mtd/maps/Kconfig @@ -341,13 +341,6 @@ config MTD_UCLINUX help Map driver to support image based filesystems for uClinux. -config MTD_INTEL_VR_NOR - tristate "NOR flash on Intel Vermilion Range Expansion Bus CS0" - depends on PCI - help - Map driver for a NOR flash bank located on the Expansion Bus of the - Intel Vermilion Range chipset. - config MTD_PLATRAM tristate "Map driver for platform device RAM (mtd-ram)" select MTD_RAM diff --git a/drivers/mtd/maps/Makefile b/drivers/mtd/maps/Makefile index 094cfb244086..a9083c888e3b 100644 --- a/drivers/mtd/maps/Makefile +++ b/drivers/mtd/maps/Makefile @@ -40,6 +40,5 @@ obj-$(CONFIG_MTD_UCLINUX) += uclinux.o obj-$(CONFIG_MTD_NETtel) += nettel.o obj-$(CONFIG_MTD_SCB2_FLASH) += scb2_flash.o obj-$(CONFIG_MTD_PLATRAM) += plat-ram.o -obj-$(CONFIG_MTD_INTEL_VR_NOR) += intel_vr_nor.o obj-$(CONFIG_MTD_VMU) += vmu-flash.o obj-$(CONFIG_MTD_LANTIQ) += lantiq-flash.o diff --git a/drivers/mtd/maps/intel_vr_nor.c b/drivers/mtd/maps/intel_vr_nor.c deleted file mode 100644 index d67b845b0e89.. --- a/drivers/mtd/maps/intel_vr_nor.c +++ /dev/null @@ -1,265 +0,0 @@ -/* - * drivers/mtd/maps/intel_vr_nor.c - * - * An MTD map driver for a NOR flash bank on the Expansion Bus of the Intel - * Vermilion Range chipset. - * - * The Vermilion Range Expansion Bus supports four chip selects, each of which - * has 64MiB of address space. The 2nd BAR of the Expansion Bus PCI Device - * is a 256MiB memory region containing the address spaces for all four of the - * chip selects, with start addresses hardcoded on 64MiB boundaries. - * - * This map driver only supports NOR flash on chip select 0. The buswidth - * (either 8 bits or 16 bits) is determined by reading the Expansion Bus Timing - * and Control Register for Chip Select 0 (EXP_TIMING_CS0). This driver does - * not modify the value in the EXP_TIMING_CS0 register except to enable writing - * and disable boot acceleration. The timing parameters in the register are - * assumed to have been properly initialized by the BIOS. The reset default - * timing parameters are maximally conservative (slow), so access to the flash - * will be slower than it should be if the BIOS has not initialized the timing - * parameters. - * - * Author: Andy Lowe - * - * 2006 (c) MontaVista Software, Inc. This file is licensed under - * the terms of the GNU General Public License version 2. This program - * is licensed "as is" without any warranty of any kind, whether express - * or implied. - */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#define DRV_NAME "vr_nor" - -struct vr_nor_mtd { - void __iomem *csr_base; - struct map_info map; - struct mtd_info *info; - struct pci_dev *dev; -}; - -/* Expansion Bus Configuration and Status Registers are in BAR 0 */ -#define EXP_CSR_MBAR 0 -/* Expansion Bus Memory Window is BAR 1 */ -#define EXP_WIN_MBAR 1 -/* Maximum address space for Chip Select 0 is 64MiB */ -#define CS0_SIZE 0x0400 -/* Chip Select 0 is at offset 0 in the Memory Window */ -#define CS0_START 0x0 -/* Chip Select 0 Timing Register is at offset 0 in CSR */ -#define EXP_TIMING_CS0 0x00 -#define TIMING_CS_EN (1 << 31) /* Chip Select Enable */ -#define TIMING_BOOT_ACCEL_DIS (1 << 8) /* Boot Acceleration Disable */ -#define TIMING_WR_EN (1 << 1) /* Write Enable */ -#define TIMING_BYTE_EN (1 << 0) /* 8-bit vs 16-bit bus */ -#define TIMING_MASK0x3FFF - -static void vr_nor_destroy_partitions(struct vr_nor_mtd *p) -{ - mtd_device_unregister(p->info); -} - -static int vr_nor_init_partitions(struct vr_nor_mtd *p) -{ - /* register the flash bank */ - /* partition the flash bank */ - return mtd_device_register(p->info, NULL, 0); -} - -static void vr_nor_destroy_mtd_setup(struct vr_nor_mtd *p) -{ - map_destroy(p->info); -} - -static int vr_nor_mtd_setup(struct vr_nor_mtd *p) -{ - static const char * const probe_types[] = - { "cfi_probe", "jedec_probe", NULL }; - const char * const *type; - - for (type = probe_types; !p->info && *type; type++) -
[PATCH] drm: Do not overrun array in drm_gem_get_pages()
If the shared memory object is larger than the DRM object that it backs, we can overrun the page array. Limit the number of pages we install from each folio to prevent this. Signed-off-by: Matthew Wilcox (Oracle) Reported-by: Oleksandr Natalenko Tested-by: Oleksandr Natalenko Link: https://lore.kernel.org/lkml/13360591.ulzwgnk...@natalenko.name/ Fixes: 3291e09a4638 ("drm: convert drm_gem_put_pages() to use a folio_batch") Cc: sta...@vger.kernel.org # 6.5.x --- drivers/gpu/drm/drm_gem.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c index 6129b89bb366..44a948b80ee1 100644 --- a/drivers/gpu/drm/drm_gem.c +++ b/drivers/gpu/drm/drm_gem.c @@ -540,7 +540,7 @@ struct page **drm_gem_get_pages(struct drm_gem_object *obj) struct page **pages; struct folio *folio; struct folio_batch fbatch; - int i, j, npages; + long i, j, npages; if (WARN_ON(!obj->filp)) return ERR_PTR(-EINVAL); @@ -564,11 +564,13 @@ struct page **drm_gem_get_pages(struct drm_gem_object *obj) i = 0; while (i < npages) { + long nr; folio = shmem_read_folio_gfp(mapping, i, mapping_gfp_mask(mapping)); if (IS_ERR(folio)) goto fail; - for (j = 0; j < folio_nr_pages(folio); j++, i++) + nr = min(npages - i, folio_nr_pages(folio)); + for (j = 0; j < nr; j++, i++) pages[i] = folio_file_page(folio, i); /* Make sure shmem keeps __GFP_DMA32 allocated pages in the -- 2.40.1
Re: [REGRESSION] BUG: KFENCE: memory corruption in drm_gem_put_pages+0x186/0x250
On Thu, Oct 05, 2023 at 02:30:55PM +0200, Oleksandr Natalenko wrote: > No-no, sorry for possible confusion. Let me explain again: > > 1. we had an issue with i915, which was introduced by 0b62af28f249, and later > was fixed by 863a8eb3f270 > 2. now I've discovered another issue, which looks very similar to 1., but in > a VM with Cirrus VGA, and it happens even while having 863a8eb3f270 applied > 3. I've tried reverting 3291e09a4638, after which I cannot reproduce the > issue with Cirrus VGA, but clearly there was no fix for it discussed > > IOW, 863a8eb3f270 is the fix for 0b62af28f249, but not for 3291e09a4638. It > looks like 3291e09a4638 requires a separate fix. Thank you! Sorry about the misunderstanding. Try this: diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c index 6129b89bb366..44a948b80ee1 100644 --- a/drivers/gpu/drm/drm_gem.c +++ b/drivers/gpu/drm/drm_gem.c @@ -540,7 +540,7 @@ struct page **drm_gem_get_pages(struct drm_gem_object *obj) struct page **pages; struct folio *folio; struct folio_batch fbatch; - int i, j, npages; + long i, j, npages; if (WARN_ON(!obj->filp)) return ERR_PTR(-EINVAL); @@ -564,11 +564,13 @@ struct page **drm_gem_get_pages(struct drm_gem_object *obj) i = 0; while (i < npages) { + long nr; folio = shmem_read_folio_gfp(mapping, i, mapping_gfp_mask(mapping)); if (IS_ERR(folio)) goto fail; - for (j = 0; j < folio_nr_pages(folio); j++, i++) + nr = min(npages - i, folio_nr_pages(folio)); + for (j = 0; j < nr; j++, i++) pages[i] = folio_file_page(folio, i); /* Make sure shmem keeps __GFP_DMA32 allocated pages in the
Re: [REGRESSION] BUG: KFENCE: memory corruption in drm_gem_put_pages+0x186/0x250
On Thu, Oct 05, 2023 at 09:56:03AM +0200, Oleksandr Natalenko wrote: > Hello. > > On čtvrtek 5. října 2023 9:44:42 CEST Thomas Zimmermann wrote: > > Hi > > > > Am 02.10.23 um 17:38 schrieb Oleksandr Natalenko: > > > On pondělí 2. října 2023 16:32:45 CEST Matthew Wilcox wrote: > > >> On Mon, Oct 02, 2023 at 01:02:52PM +0200, Oleksandr Natalenko wrote: > > >>>>>>> BUG: KFENCE: memory corruption in drm_gem_put_pages+0x186/0x250 > > >>>>>>> > > >>>>>>> Corrupted memory at 0xe173a294 [ ! ! ! ! ! ! ! ! ! ! ! ! ! > > >>>>>>> ! ! ! ] (in kfence-#108): > > >>>>>>> drm_gem_put_pages+0x186/0x250 > > >>>>>>> drm_gem_shmem_put_pages_locked+0x43/0xc0 > > >>>>>>> drm_gem_shmem_object_vunmap+0x83/0xe0 > > >>>>>>> drm_gem_vunmap_unlocked+0x46/0xb0 > > >>>>>>> drm_fbdev_generic_helper_fb_dirty+0x1dc/0x310 > > >>>>>>> drm_fb_helper_damage_work+0x96/0x170 > > >>> > > >>> Matthew, before I start dancing around, do you think ^^ could have the > > >>> same cause as 0b62af28f249b9c4036a05acfb053058dc02e2e2 which got fixed > > >>> by 863a8eb3f27098b42772f668e3977ff4cae10b04? > > >> > > >> Yes, entirely plausible. I think you have two useful points to look at > > >> before delving into a full bisect -- 863a8e and the parent of 0b62af. > > >> If either of them work, I think you have no more work to do. > > > > > > OK, I've did this against v6.5.5: > > > > > > ``` > > > git log --oneline HEAD~3.. > > > 7c1e7695ca9b8 (HEAD -> test) Revert "mm: remove struct pagevec" > > > 8f2ad53b6eac6 Revert "mm: remove check_move_unevictable_pages()" > > > fa1e3c0b5453c Revert "drm: convert drm_gem_put_pages() to use a > > > folio_batch" > > > ``` > > > > > > then rebooted the host multiple times, and the issue is not seen any more. > > > > > > So I guess 3291e09a463870610b8227f32b16b19a587edf33 is the culprit. > > > > Ignore my other email. It's apparently been fixed already. Thanks! > > Has it? I think I was able to identify offending commit, but I'm not aware of > any fix to that. I don't understand; you said reverting those DRM commits fixed the problem, so 863a8eb3f270 is the solution. No?
Re: [REGRESSION] BUG: KFENCE: memory corruption in drm_gem_put_pages+0x186/0x250
On Mon, Oct 02, 2023 at 01:02:52PM +0200, Oleksandr Natalenko wrote: > > > > > BUG: KFENCE: memory corruption in drm_gem_put_pages+0x186/0x250 > > > > > > > > > > Corrupted memory at 0xe173a294 [ ! ! ! ! ! ! ! ! ! ! ! ! ! ! > > > > > ! ! ] (in kfence-#108): > > > > > drm_gem_put_pages+0x186/0x250 > > > > > drm_gem_shmem_put_pages_locked+0x43/0xc0 > > > > > drm_gem_shmem_object_vunmap+0x83/0xe0 > > > > > drm_gem_vunmap_unlocked+0x46/0xb0 > > > > > drm_fbdev_generic_helper_fb_dirty+0x1dc/0x310 > > > > > drm_fb_helper_damage_work+0x96/0x170 > > Matthew, before I start dancing around, do you think ^^ could have the same > cause as 0b62af28f249b9c4036a05acfb053058dc02e2e2 which got fixed by > 863a8eb3f27098b42772f668e3977ff4cae10b04? Yes, entirely plausible. I think you have two useful points to look at before delving into a full bisect -- 863a8e and the parent of 0b62af. If either of them work, I think you have no more work to do.
[PATCH] i915: Limit the length of an sg list to the requested length
The folio conversion changed the behaviour of shmem_sg_alloc_table() to put the entire length of the last folio into the sg list, even if the sg list should have been shorter. gen8_ggtt_insert_entries() relied on the list being the right langth and would overrun the end of the page tables. Other functions may also have been affected. Clamp the length of the last entry in the sg list to be the expected length. Signed-off-by: Matthew Wilcox (Oracle) Fixes: 0b62af28f249 ("i915: convert shmem_sg_free_table() to use a folio_batch") Cc: sta...@vger.kernel.org # 6.5.x Link: https://gitlab.freedesktop.org/drm/intel/-/issues/9256 Link: https://lore.kernel.org/lkml/6287208.lov4wx5...@natalenko.name/ Reported-by: Oleksandr Natalenko Tested-by: Oleksandr Natalenko --- drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c index 8f1633c3fb93..73a4a4eb29e0 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c @@ -100,6 +100,7 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, struct sg_table *st, st->nents = 0; for (i = 0; i < page_count; i++) { struct folio *folio; + unsigned long nr_pages; const unsigned int shrink[] = { I915_SHRINK_BOUND | I915_SHRINK_UNBOUND, 0, @@ -150,6 +151,8 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, struct sg_table *st, } } while (1); + nr_pages = min_t(unsigned long, + folio_nr_pages(folio), page_count - i); if (!i || sg->length >= max_segment || folio_pfn(folio) != next_pfn) { @@ -157,13 +160,13 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, struct sg_table *st, sg = sg_next(sg); st->nents++; - sg_set_folio(sg, folio, folio_size(folio), 0); + sg_set_folio(sg, folio, nr_pages * PAGE_SIZE, 0); } else { /* XXX: could overflow? */ - sg->length += folio_size(folio); + sg->length += nr_pages * PAGE_SIZE; } - next_pfn = folio_pfn(folio) + folio_nr_pages(folio); - i += folio_nr_pages(folio) - 1; + next_pfn = folio_pfn(folio) + nr_pages; + i += nr_pages - 1; /* Check that the i965g/gm workaround works. */ GEM_BUG_ON(gfp & __GFP_DMA32 && next_pfn >= 0x0010UL); -- 2.40.1
Re: [REGRESSION] [BISECTED] Panic in gen8_ggtt_insert_entries() with v6.5
On Tue, Sep 19, 2023 at 08:11:47PM +0200, Oleksandr Natalenko wrote: > I can confirm this one fixes the issue for me on T460s laptop. Thank you! Yay! > Should you submit it, please add: > > Fixes: 0b62af28f2 ("i915: convert shmem_sg_free_table() to use a folio_batch") Thanks for collecting all these; you're making my life really easy. One minor point is that the standard format for Fixes: is 12 characters, not 10. eg, Documentation/process/5.Posting.rst:Fixes: 1f2e3d4c5b6a ("The first line of the commit specified by the first 12 characters of its SHA-1 ID") I have this in my .gitconfig: [pretty] fixes = Fixes: %h (\"%s\") and in .git/config, [core] abbrev = 12 I'm working on the commit message now.
Re: [REGRESSION] [BISECTED] Panic in gen8_ggtt_insert_entries() with v6.5
On Tue, Sep 19, 2023 at 04:43:41PM +0100, Matthew Wilcox wrote: > Could I ask you to try this patch? I'll follow up with another patch > later because I think I made another assumption that may not be valid. Ah, no, never mind. I thought we could start in the middle of a folio, but we always start constructing the sg list from index 0 of the file, so we always start at the first page of a folio. If this patch solves your problem, then I think we're done.
Re: [REGRESSION] [BISECTED] Panic in gen8_ggtt_insert_entries() with v6.5
On Tue, Sep 19, 2023 at 10:26:42AM +0200, Oleksandr Natalenko wrote: > Andrzej asked me to try to revert commits 0b62af28f249, e0b72c14d8dc and > 1e0877d58b1e, and reverting those fixed the i915 crash for me. The > e0b72c14d8dc and 1e0877d58b1e commits look like just prerequisites, so I > assume 0b62af28f249 ("i915: convert shmem_sg_free_table() to use a > folio_batch") is the culprit here. > > Could you please check this? > > Our conversation with Andrzej is available at drm-intel GitLab [1]. > > Thanks. > > [1] https://gitlab.freedesktop.org/drm/intel/-/issues/9256 Wow, that is some great debugging. Thanks for all the time & effort you and others have invested. Sorry for breaking your system. You're almost right about the "prerequisites", but it's in the other direction; 0b62af28f249 is a prerequisite for the later two cleanups, so reverting all three is necessary to test 0b62af28f249. It seems to me that you've isolated the problem to constructing overly long sg lists. I didn't realise that was going to be a problem, so that's my fault. Could I ask you to try this patch? I'll follow up with another patch later because I think I made another assumption that may not be valid. diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c index 8f1633c3fb93..73a4a4eb29e0 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c @@ -100,6 +100,7 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, struct sg_table *st, st->nents = 0; for (i = 0; i < page_count; i++) { struct folio *folio; + unsigned long nr_pages; const unsigned int shrink[] = { I915_SHRINK_BOUND | I915_SHRINK_UNBOUND, 0, @@ -150,6 +151,8 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, struct sg_table *st, } } while (1); + nr_pages = min_t(unsigned long, + folio_nr_pages(folio), page_count - i); if (!i || sg->length >= max_segment || folio_pfn(folio) != next_pfn) { @@ -157,13 +160,13 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, struct sg_table *st, sg = sg_next(sg); st->nents++; - sg_set_folio(sg, folio, folio_size(folio), 0); + sg_set_folio(sg, folio, nr_pages * PAGE_SIZE, 0); } else { /* XXX: could overflow? */ - sg->length += folio_size(folio); + sg->length += nr_pages * PAGE_SIZE; } - next_pfn = folio_pfn(folio) + folio_nr_pages(folio); - i += folio_nr_pages(folio) - 1; + next_pfn = folio_pfn(folio) + nr_pages; + i += nr_pages - 1; /* Check that the i965g/gm workaround works. */ GEM_BUG_ON(gfp & __GFP_DMA32 && next_pfn >= 0x0010UL);
Re: [PATCH v6 1/4] drm: Use XArray instead of IDR for minors
On Tue, Aug 29, 2023 at 02:35:34PM -0400, James Zhu wrote: > > On 2023-08-29 14:33, Matthew Wilcox wrote: > > On Tue, Aug 29, 2023 at 01:34:22PM -0400, James Zhu wrote: > > > > > > @@ -1067,7 +1055,7 @@ static void drm_core_exit(void) > > > > > > unregister_chrdev(DRM_MAJOR, "drm"); > > > > > > debugfs_remove(drm_debugfs_root); > > > > > > drm_sysfs_destroy(); > > > > > > - idr_destroy(&drm_minors_idr); > > > > > [JZ] Should we call xa_destroy instead here? > > > > We could, if we expect the xarray to potentially not be empty. > > > > From what I understand - all minors should be released at this point. > > > [JZ] In practice, adding destroy here will be better. > > Why do you say that? > [JZ] In case, the future, INIT adds something, need DESTROY to free > completely. That isn't going to happen.
Re: [PATCH v6 1/4] drm: Use XArray instead of IDR for minors
On Tue, Aug 29, 2023 at 01:34:22PM -0400, James Zhu wrote: > > > > @@ -1067,7 +1055,7 @@ static void drm_core_exit(void) > > > > unregister_chrdev(DRM_MAJOR, "drm"); > > > > debugfs_remove(drm_debugfs_root); > > > > drm_sysfs_destroy(); > > > > - idr_destroy(&drm_minors_idr); > > > [JZ] Should we call xa_destroy instead here? > > We could, if we expect the xarray to potentially not be empty. > > From what I understand - all minors should be released at this point. > [JZ] In practice, adding destroy here will be better. Why do you say that?
Re: [PATCH v2] fs: clean up usage of noop_dirty_folio
On Mon, Aug 28, 2023 at 03:54:49PM +0800, Xueshi Hu wrote: > In folio_mark_dirty(), it can automatically fallback to > noop_dirty_folio() if a_ops->dirty_folio is not registered. > > As anon_aops, dev_dax_aops and fb_deferred_io_aops becames empty, remove > them too. > > Signed-off-by: Xueshi Hu Reviewed-by: Matthew Wilcox (Oracle)
Re: [PATCH] fs: clean up usage of noop_dirty_folio
On Mon, Aug 21, 2023 at 01:16:43PM +0200, Jan Kara wrote: > On Sat 19-08-23 20:42:25, Xueshi Hu wrote: > > In folio_mark_dirty(), it will automatically fallback to > > noop_dirty_folio() if a_ops->dirty_folio is not registered. > > > > As anon_aops, dev_dax_aops and fb_deferred_io_aops becames empty, remove > > them too. > > > > Signed-off-by: Xueshi Hu > > Yeah, looks sensible to me but for some callbacks we are oscilating between > all users having to provide some callback and providing some default > behavior for NULL callback. I don't have a strong opinion either way so > feel free to add: > > Reviewed-by: Jan Kara > > But I guess let's see what Matthew thinks about this and what plans he has > so that we don't switch back again in the near future. Matthew? I was hoping Christoph would weigh in ;-) I don't have a strong feeling here, but it seems to me that a NULL ->dirty_folio() should mean "do the noop thing" rather than "do the buffer_head thing" or "do the filemap thing". In 0af573780b0b, the buffer_head default was removed. I think enough time has passed that we're OK to change what a NULL ->dirty_folio means (plus we also changed the name of ->set_page_dirty() to ->dirty_folio()) So Ack to the concept. One minor change I'd request: -bool noop_dirty_folio(struct address_space *mapping, struct folio *folio) +static bool noop_dirty_folio(struct address_space *mapping, struct folio *folio) { if (!folio_test_dirty(folio)) return !folio_test_set_dirty(folio); return false; } -EXPORT_SYMBOL(noop_dirty_folio); Please inline this into folio_mark_dirty() instead of calling it.
Re: [RESEND PATCH v10 25/25] dept: Track the potential waits of PG_{locked,writeback}
On Mon, Aug 21, 2023 at 12:46:37PM +0900, Byungchul Park wrote: > @@ -377,44 +421,88 @@ static __always_inline int Page##uname(struct page > *page) \ > #define SETPAGEFLAG(uname, lname, policy)\ > static __always_inline > \ > void folio_set_##lname(struct folio *folio) \ > -{ set_bit(PG_##lname, folio_flags(folio, FOLIO_##policy)); } \ > +{\ > + set_bit(PG_##lname, folio_flags(folio, FOLIO_##policy));\ > + dept_page_set_bit(&folio->page, PG_##lname);\ The PG_locked and PG_writeback bits only actually exist in the folio; the ones in struct page are just legacy and never actually used. Perhaps we could make the APIs more folio-based and less page-based? > static __always_inline void SetPage##uname(struct page *page) > \ > -{ set_bit(PG_##lname, &policy(page, 1)->flags); } > +{\ > + set_bit(PG_##lname, &policy(page, 1)->flags); \ > + dept_page_set_bit(page, PG_##lname);\ > +} I don't think we ever call this for PG_writeback or PG_locked. If I'm wrong, we can probably fix that ;-) > static __always_inline void __SetPage##uname(struct page *page) > \ > -{ __set_bit(PG_##lname, &policy(page, 1)->flags); } > +{\ > + __set_bit(PG_##lname, &policy(page, 1)->flags); \ > + dept_page_set_bit(page, PG_##lname);\ > +} Umm. We do call __SetPageLocked() though ... I'll fix those up to be __set_folio_locked().
Re: [RESEND PATCH v10 08/25] dept: Apply sdt_might_sleep_{start,end}() to PG_{locked,writeback} wait
On Mon, Aug 21, 2023 at 12:46:20PM +0900, Byungchul Park wrote: > @@ -1219,6 +1220,9 @@ static inline bool folio_trylock_flag(struct folio > *folio, int bit_nr, > /* How many times do we accept lock stealing from under a waiter? */ > int sysctl_page_lock_unfairness = 5; > > +static struct dept_map __maybe_unused PG_locked_map = > DEPT_MAP_INITIALIZER(PG_locked_map, NULL); > +static struct dept_map __maybe_unused PG_writeback_map = > DEPT_MAP_INITIALIZER(PG_writeback_map, NULL); Hmm, why are these "maybe unused"? *digs*. Ah. Because sdt_might_sleep_start() becomes a no-op macro if DEPT is disabled. OK, the right way to handle this is #ifdef CONFIG_DEPT #define DEPT_MAP(name) static struct dept_map name = \ DEPT_MAP_INITIALIZER(name, NULL) #else #define DEPT_MAP(name) /* */ #endif And now DEPT takes up no space if disabled. /* */; is a somewhat unusual thing to see, but since this must work at top level, we can't use "do { } while (0)" like we usually do. Given where else this is likely to be used, i don't think it's going to be a problem ...
Re: [PATCH 03/13] scatterlist: Add sg_set_folio()
On Sun, Jul 30, 2023 at 09:57:06PM +0800, Zhu Yanjun wrote: > > 在 2023/7/30 19:18, Matthew Wilcox 写道: > > On Sun, Jul 30, 2023 at 07:01:26PM +0800, Zhu Yanjun wrote: > > > Does the following function have folio version? > > > > > > " > > > int sg_alloc_append_table_from_pages(struct sg_append_table *sgt_append, > > > struct page **pages, unsigned int n_pages, unsigned int offset, > > > unsigned long size, unsigned int max_segment, > > > unsigned int left_pages, gfp_t gfp_mask) > > > " > > No -- I haven't needed to convert anything that uses > > sg_alloc_append_table_from_pages() yet. It doesn't look like it should > > be _too_ hard to add a folio version. > > In many places, this function is used. So this function needs the folio > version. It's not used in very many places. But the first one that I see it used (drivers/infiniband/core/umem.c), you can't do a straightforward folio conversion: pinned = pin_user_pages_fast(cur_base, min_t(unsigned long, npages, PAGE_SIZE / sizeof(struct page *)), gup_flags, page_list); ... ret = sg_alloc_append_table_from_pages( &umem->sgt_append, page_list, pinned, 0, pinned << PAGE_SHIFT, ib_dma_max_seg_size(device), npages, GFP_KERNEL); That can't be converted to folios. The GUP might start in the middle of the folio, and we have no way to communicate that. This particular usage really needs the phyr work that Jason is doing so we can efficiently communicate physically contiguous ranges from GUP to sg. > Another problem, after folio is used, I want to know the performance after > folio is implemented. > > How to make tests to get the performance? You know what you're working on ... I wouldn't know how best to test your code.
Re: [PATCH 03/13] scatterlist: Add sg_set_folio()
On Sun, Jul 30, 2023 at 07:01:26PM +0800, Zhu Yanjun wrote: > Does the following function have folio version? > > " > int sg_alloc_append_table_from_pages(struct sg_append_table *sgt_append, > struct page **pages, unsigned int n_pages, unsigned int offset, > unsigned long size, unsigned int max_segment, > unsigned int left_pages, gfp_t gfp_mask) > " No -- I haven't needed to convert anything that uses sg_alloc_append_table_from_pages() yet. It doesn't look like it should be _too_ hard to add a folio version.
[PATCH 09/13] net: Convert sunrpc from pagevec to folio_batch
Remove the last usage of pagevecs. There is a slight change here; we now free the folio_batch as soon as it fills up instead of freeing the folio_batch when we try to add a page to a full batch. This should have no effect in practice. Signed-off-by: Matthew Wilcox (Oracle) --- include/linux/sunrpc/svc.h | 2 +- net/sunrpc/svc.c | 10 +- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h index c2807e301790..f8751118c122 100644 --- a/include/linux/sunrpc/svc.h +++ b/include/linux/sunrpc/svc.h @@ -222,7 +222,7 @@ struct svc_rqst { struct page * *rq_next_page; /* next reply page to use */ struct page * *rq_page_end; /* one past the last page */ - struct pagevec rq_pvec; + struct folio_batch rq_fbatch; struct kvec rq_vec[RPCSVC_MAXPAGES]; /* generally useful.. */ struct bio_vec rq_bvec[RPCSVC_MAXPAGES]; diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c index e7c101290425..587811a002c9 100644 --- a/net/sunrpc/svc.c +++ b/net/sunrpc/svc.c @@ -640,7 +640,7 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node) if (!rqstp) return rqstp; - pagevec_init(&rqstp->rq_pvec); + folio_batch_init(&rqstp->rq_fbatch); __set_bit(RQ_BUSY, &rqstp->rq_flags); rqstp->rq_server = serv; @@ -851,9 +851,9 @@ bool svc_rqst_replace_page(struct svc_rqst *rqstp, struct page *page) } if (*rqstp->rq_next_page) { - if (!pagevec_space(&rqstp->rq_pvec)) - __pagevec_release(&rqstp->rq_pvec); - pagevec_add(&rqstp->rq_pvec, *rqstp->rq_next_page); + if (!folio_batch_add(&rqstp->rq_fbatch, + page_folio(*rqstp->rq_next_page))) + __folio_batch_release(&rqstp->rq_fbatch); } get_page(page); @@ -887,7 +887,7 @@ void svc_rqst_release_pages(struct svc_rqst *rqstp) void svc_rqst_free(struct svc_rqst *rqstp) { - pagevec_release(&rqstp->rq_pvec); + folio_batch_release(&rqstp->rq_fbatch); svc_release_buffer(rqstp); if (rqstp->rq_scratch_page) put_page(rqstp->rq_scratch_page); -- 2.39.2
[PATCH 12/13] mm: Remove references to pagevec
Most of these should just refer to the LRU cache rather than the data structure used to implement the LRU cache. Signed-off-by: Matthew Wilcox (Oracle) --- mm/huge_memory.c| 2 +- mm/khugepaged.c | 6 +++--- mm/ksm.c| 6 +++--- mm/memory.c | 6 +++--- mm/migrate_device.c | 2 +- mm/swap.c | 2 +- mm/truncate.c | 2 +- 7 files changed, 13 insertions(+), 13 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e94fe292f30a..eb3678360b97 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1344,7 +1344,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) /* * See do_wp_page(): we can only reuse the folio exclusively if * there are no additional references. Note that we always drain -* the LRU pagevecs immediately after adding a THP. +* the LRU cache immediately after adding a THP. */ if (folio_ref_count(folio) > 1 + folio_test_swapcache(folio) * folio_nr_pages(folio)) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 5ef1e08b2a06..3beb4ad2ee5e 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1051,7 +1051,7 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm, if (pte) pte_unmap(pte); - /* Drain LRU add pagevec to remove extra pin on the swapped in pages */ + /* Drain LRU cache to remove extra pin on the swapped in pages */ if (swapped_in) lru_add_drain(); @@ -1972,7 +1972,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, result = SCAN_FAIL; goto xa_unlocked; } - /* drain pagevecs to help isolate_lru_page() */ + /* drain lru cache to help isolate_lru_page() */ lru_add_drain(); page = folio_file_page(folio, index); } else if (trylock_page(page)) { @@ -1988,7 +1988,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, page_cache_sync_readahead(mapping, &file->f_ra, file, index, end - index); - /* drain pagevecs to help isolate_lru_page() */ + /* drain lru cache to help isolate_lru_page() */ lru_add_drain(); page = find_lock_page(mapping, index); if (unlikely(page == NULL)) { diff --git a/mm/ksm.c b/mm/ksm.c index d995779dc1fe..ba266359da55 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -932,7 +932,7 @@ static int remove_stable_node(struct ksm_stable_node *stable_node) * The stable node did not yet appear stale to get_ksm_page(), * since that allows for an unmapped ksm page to be recognized * right up until it is freed; but the node is safe to remove. -* This page might be in a pagevec waiting to be freed, +* This page might be in an LRU cache waiting to be freed, * or it might be PageSwapCache (perhaps under writeback), * or it might have been removed from swapcache a moment ago. */ @@ -2303,8 +2303,8 @@ static struct ksm_rmap_item *scan_get_next_rmap_item(struct page **page) trace_ksm_start_scan(ksm_scan.seqnr, ksm_rmap_items); /* -* A number of pages can hang around indefinitely on per-cpu -* pagevecs, raised page count preventing write_protect_page +* A number of pages can hang around indefinitely in per-cpu +* LRU cache, raised page count preventing write_protect_page * from merging them. Though it doesn't really matter much, * it is puzzling to see some stuck in pages_volatile until * other activity jostles them out, and they also prevented diff --git a/mm/memory.c b/mm/memory.c index 9f2723749f55..d034c52071f4 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3404,8 +3404,8 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) goto copy; if (!folio_test_lru(folio)) /* -* Note: We cannot easily detect+handle references from -* remote LRU pagevecs or references to LRU folios. +* We cannot easily detect+handle references from +* remote LRU caches or references to LRU folios. */ lru_add_drain(); if (folio_ref_count(folio) > 1 + folio_test_swapcache(folio)) @@ -3883,7 +38
[PATCH 03/13] scatterlist: Add sg_set_folio()
This wrapper for sg_set_page() lets drivers add folios to a scatterlist more easily. We could, perhaps, do better by using a different page in the folio if offset is larger than UINT_MAX, but let's hope we get a better data structure than this before we need to care about such large folios. Signed-off-by: Matthew Wilcox (Oracle) --- include/linux/scatterlist.h | 24 1 file changed, 24 insertions(+) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index ec46d8e8e49d..77df3d7b18a6 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -141,6 +141,30 @@ static inline void sg_set_page(struct scatterlist *sg, struct page *page, sg->length = len; } +/** + * sg_set_folio - Set sg entry to point at given folio + * @sg: SG entry + * @folio: The folio + * @len:Length of data + * @offset: Offset into folio + * + * Description: + * Use this function to set an sg entry pointing at a folio, never assign + * the folio directly. We encode sg table information in the lower bits + * of the folio pointer. See sg_page() for looking up the page belonging + * to an sg entry. + * + **/ +static inline void sg_set_folio(struct scatterlist *sg, struct folio *folio, + size_t len, size_t offset) +{ + WARN_ON_ONCE(len > UINT_MAX); + WARN_ON_ONCE(offset > UINT_MAX); + sg_assign_page(sg, &folio->page); + sg->offset = offset; + sg->length = len; +} + static inline struct page *sg_page(struct scatterlist *sg) { #ifdef CONFIG_DEBUG_SG -- 2.39.2
[PATCH 04/13] i915: Convert shmem_sg_free_table() to use a folio_batch
Remove a few hidden compound_head() calls by converting the returned page to a folio once and using the folio APIs. We also only increment the refcount on the folio once instead of once for each page. Ideally, we would have a for_each_sgt_folio macro, but until then this will do. Signed-off-by: Matthew Wilcox (Oracle) --- drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 55 +-- 1 file changed, 31 insertions(+), 24 deletions(-) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c index 33d5d5178103..8f1633c3fb93 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c @@ -19,13 +19,13 @@ #include "i915_trace.h" /* - * Move pages to appropriate lru and release the pagevec, decrementing the - * ref count of those pages. + * Move folios to appropriate lru and release the batch, decrementing the + * ref count of those folios. */ -static void check_release_pagevec(struct pagevec *pvec) +static void check_release_folio_batch(struct folio_batch *fbatch) { - check_move_unevictable_pages(pvec); - __pagevec_release(pvec); + check_move_unevictable_folios(fbatch); + __folio_batch_release(fbatch); cond_resched(); } @@ -33,24 +33,29 @@ void shmem_sg_free_table(struct sg_table *st, struct address_space *mapping, bool dirty, bool backup) { struct sgt_iter sgt_iter; - struct pagevec pvec; + struct folio_batch fbatch; + struct folio *last = NULL; struct page *page; mapping_clear_unevictable(mapping); - pagevec_init(&pvec); + folio_batch_init(&fbatch); for_each_sgt_page(page, sgt_iter, st) { - if (dirty) - set_page_dirty(page); + struct folio *folio = page_folio(page); + if (folio == last) + continue; + last = folio; + if (dirty) + folio_mark_dirty(folio); if (backup) - mark_page_accessed(page); + folio_mark_accessed(folio); - if (!pagevec_add(&pvec, page)) - check_release_pagevec(&pvec); + if (!folio_batch_add(&fbatch, folio)) + check_release_folio_batch(&fbatch); } - if (pagevec_count(&pvec)) - check_release_pagevec(&pvec); + if (fbatch.nr) + check_release_folio_batch(&fbatch); sg_free_table(st); } @@ -63,8 +68,7 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, struct sg_table *st, unsigned int page_count; /* restricted by sg_alloc_table */ unsigned long i; struct scatterlist *sg; - struct page *page; - unsigned long last_pfn = 0; /* suppress gcc warning */ + unsigned long next_pfn = 0; /* suppress gcc warning */ gfp_t noreclaim; int ret; @@ -95,6 +99,7 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, struct sg_table *st, sg = st->sgl; st->nents = 0; for (i = 0; i < page_count; i++) { + struct folio *folio; const unsigned int shrink[] = { I915_SHRINK_BOUND | I915_SHRINK_UNBOUND, 0, @@ -103,12 +108,12 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, struct sg_table *st, do { cond_resched(); - page = shmem_read_mapping_page_gfp(mapping, i, gfp); - if (!IS_ERR(page)) + folio = shmem_read_folio_gfp(mapping, i, gfp); + if (!IS_ERR(folio)) break; if (!*s) { - ret = PTR_ERR(page); + ret = PTR_ERR(folio); goto err_sg; } @@ -147,19 +152,21 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, struct sg_table *st, if (!i || sg->length >= max_segment || - page_to_pfn(page) != last_pfn + 1) { + folio_pfn(folio) != next_pfn) { if (i) sg = sg_next(sg); st->nents++; - sg_set_page(sg, page, PAGE_SIZE, 0); + sg_set_folio(sg, folio, folio_size(folio), 0); } else { - sg->length += PAGE_SIZE; + /* XXX: could overflow? */ + sg->length += folio_size(folio); } - last_pfn = page_to_pfn(page); + next_pfn = folio_pfn(folio) + folio_nr_pages(folio); + i += folio_nr_pag
[PATCH 11/13] mm: Rename invalidate_mapping_pagevec to mapping_try_invalidate
We don't use pagevecs for the LRU cache any more, and we don't know that the failed invalidations were due to the folio being in an LRU cache. So rename it to be more accurate. Signed-off-by: Matthew Wilcox (Oracle) --- mm/fadvise.c | 16 +++- mm/internal.h | 4 ++-- mm/truncate.c | 25 - 3 files changed, 21 insertions(+), 24 deletions(-) diff --git a/mm/fadvise.c b/mm/fadvise.c index fb7c5f43fd2a..f684ffd7f9c9 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -143,7 +143,7 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice) } if (end_index >= start_index) { - unsigned long nr_pagevec = 0; + unsigned long nr_failed = 0; /* * It's common to FADV_DONTNEED right after @@ -156,17 +156,15 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice) */ lru_add_drain(); - invalidate_mapping_pagevec(mapping, - start_index, end_index, - &nr_pagevec); + mapping_try_invalidate(mapping, start_index, end_index, + &nr_failed); /* -* If fewer pages were invalidated than expected then -* it is possible that some of the pages were on -* a per-cpu pagevec for a remote CPU. Drain all -* pagevecs and try again. +* The failures may be due to the folio being +* in the LRU cache of a remote CPU. Drain all +* caches and try again. */ - if (nr_pagevec) { + if (nr_failed) { lru_add_drain_all(); invalidate_mapping_pages(mapping, start_index, end_index); diff --git a/mm/internal.h b/mm/internal.h index 119a8241f9d9..2ff7587b4045 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -133,8 +133,8 @@ int truncate_inode_folio(struct address_space *mapping, struct folio *folio); bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end); long invalidate_inode_page(struct page *page); -unsigned long invalidate_mapping_pagevec(struct address_space *mapping, - pgoff_t start, pgoff_t end, unsigned long *nr_pagevec); +unsigned long mapping_try_invalidate(struct address_space *mapping, + pgoff_t start, pgoff_t end, unsigned long *nr_failed); /** * folio_evictable - Test whether a folio is evictable. diff --git a/mm/truncate.c b/mm/truncate.c index 86de31ed4d32..4a917570887f 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -486,18 +486,17 @@ void truncate_inode_pages_final(struct address_space *mapping) EXPORT_SYMBOL(truncate_inode_pages_final); /** - * invalidate_mapping_pagevec - Invalidate all the unlocked pages of one inode - * @mapping: the address_space which holds the pages to invalidate + * mapping_try_invalidate - Invalidate all the evictable folios of one inode + * @mapping: the address_space which holds the folios to invalidate * @start: the offset 'from' which to invalidate * @end: the offset 'to' which to invalidate (inclusive) - * @nr_pagevec: invalidate failed page number for caller + * @nr_failed: How many folio invalidations failed * - * This helper is similar to invalidate_mapping_pages(), except that it accounts - * for pages that are likely on a pagevec and counts them in @nr_pagevec, which - * will be used by the caller. + * This function is similar to invalidate_mapping_pages(), except that it + * returns the number of folios which could not be evicted in @nr_failed. */ -unsigned long invalidate_mapping_pagevec(struct address_space *mapping, - pgoff_t start, pgoff_t end, unsigned long *nr_pagevec) +unsigned long mapping_try_invalidate(struct address_space *mapping, + pgoff_t start, pgoff_t end, unsigned long *nr_failed) { pgoff_t indices[PAGEVEC_SIZE]; struct folio_batch fbatch; @@ -527,9 +526,9 @@ unsigned long invalidate_mapping_pagevec(struct address_space *mapping, */ if (!ret) { deactivate_file_folio(folio); - /* It is likely on the pagevec of a remote CPU */ - if (nr_pagevec) - (*nr_pagevec)++; + /* Likely in the lru cache of a remote CPU */ + if (nr_failed) +
[PATCH 02/13] mm: Add __folio_batch_release()
This performs the same role as __pagevec_release(), ie skipping the check for batch length of 0. Signed-off-by: Matthew Wilcox (Oracle) --- include/linux/pagevec.h | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h index f582f7213ea5..42aad53e382e 100644 --- a/include/linux/pagevec.h +++ b/include/linux/pagevec.h @@ -127,9 +127,15 @@ static inline unsigned folio_batch_add(struct folio_batch *fbatch, return fbatch_space(fbatch); } +static inline void __folio_batch_release(struct folio_batch *fbatch) +{ + __pagevec_release((struct pagevec *)fbatch); +} + static inline void folio_batch_release(struct folio_batch *fbatch) { - pagevec_release((struct pagevec *)fbatch); + if (folio_batch_count(fbatch)) + __folio_batch_release(fbatch); } void folio_batch_remove_exceptionals(struct folio_batch *fbatch); -- 2.39.2
[PATCH 06/13] mm: Remove check_move_unevictable_pages()
All callers have now been converted to call check_move_unevictable_folios(). Signed-off-by: Matthew Wilcox (Oracle) --- include/linux/swap.h | 1 - mm/vmscan.c | 17 - 2 files changed, 18 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index ce7e82cf787f..456546443f1f 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -439,7 +439,6 @@ static inline bool node_reclaim_enabled(void) } void check_move_unevictable_folios(struct folio_batch *fbatch); -void check_move_unevictable_pages(struct pagevec *pvec); extern void __meminit kswapd_run(int nid); extern void __meminit kswapd_stop(int nid); diff --git a/mm/vmscan.c b/mm/vmscan.c index 45d17c7cc555..f8dd1db15897 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -8074,23 +8074,6 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order) } #endif -void check_move_unevictable_pages(struct pagevec *pvec) -{ - struct folio_batch fbatch; - unsigned i; - - folio_batch_init(&fbatch); - for (i = 0; i < pvec->nr; i++) { - struct page *page = pvec->pages[i]; - - if (PageTransTail(page)) - continue; - folio_batch_add(&fbatch, page_folio(page)); - } - check_move_unevictable_folios(&fbatch); -} -EXPORT_SYMBOL_GPL(check_move_unevictable_pages); - /** * check_move_unevictable_folios - Move evictable folios to appropriate zone * lru list -- 2.39.2
[PATCH 00/13] Remove pagevecs
We're almost done with the pagevec -> folio_batch conversion. Finish the job. Matthew Wilcox (Oracle) (13): afs: Convert pagevec to folio_batch in afs_extend_writeback() mm: Add __folio_batch_release() scatterlist: Add sg_set_folio() i915: Convert shmem_sg_free_table() to use a folio_batch drm: Convert drm_gem_put_pages() to use a folio_batch mm: Remove check_move_unevictable_pages() pagevec: Rename fbatch_count() i915: Convert i915_gpu_error to use a folio_batch net: Convert sunrpc from pagevec to folio_batch mm: Remove struct pagevec mm: Rename invalidate_mapping_pagevec to mapping_try_invalidate mm: Remove references to pagevec mm: Remove unnecessary pagevec includes drivers/gpu/drm/drm_gem.c | 68 +-- drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 55 ++ drivers/gpu/drm/i915/i915_gpu_error.c | 50 - fs/afs/write.c| 16 +++--- include/linux/pagevec.h | 67 +++--- include/linux/scatterlist.h | 24 include/linux/sunrpc/svc.h| 2 +- include/linux/swap.h | 1 - mm/fadvise.c | 17 +++--- mm/huge_memory.c | 2 +- mm/internal.h | 4 +- mm/khugepaged.c | 6 +- mm/ksm.c | 6 +- mm/memory.c | 6 +- mm/memory_hotplug.c | 1 - mm/migrate.c | 1 - mm/migrate_device.c | 2 +- mm/readahead.c| 1 - mm/swap.c | 20 +++ mm/swap_state.c | 1 - mm/truncate.c | 27 + mm/vmscan.c | 17 -- net/sunrpc/svc.c | 10 ++-- 23 files changed, 185 insertions(+), 219 deletions(-) -- 2.39.2
[PATCH 05/13] drm: Convert drm_gem_put_pages() to use a folio_batch
Remove a few hidden compound_head() calls by converting the returned page to a folio once and using the folio APIs. Signed-off-by: Matthew Wilcox (Oracle) --- drivers/gpu/drm/drm_gem.c | 68 ++- 1 file changed, 39 insertions(+), 29 deletions(-) diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c index 1a5a2cd0d4ec..78dcae201cc6 100644 --- a/drivers/gpu/drm/drm_gem.c +++ b/drivers/gpu/drm/drm_gem.c @@ -496,13 +496,13 @@ int drm_gem_create_mmap_offset(struct drm_gem_object *obj) EXPORT_SYMBOL(drm_gem_create_mmap_offset); /* - * Move pages to appropriate lru and release the pagevec, decrementing the - * ref count of those pages. + * Move folios to appropriate lru and release the folios, decrementing the + * ref count of those folios. */ -static void drm_gem_check_release_pagevec(struct pagevec *pvec) +static void drm_gem_check_release_batch(struct folio_batch *fbatch) { - check_move_unevictable_pages(pvec); - __pagevec_release(pvec); + check_move_unevictable_folios(fbatch); + __folio_batch_release(fbatch); cond_resched(); } @@ -534,10 +534,10 @@ static void drm_gem_check_release_pagevec(struct pagevec *pvec) struct page **drm_gem_get_pages(struct drm_gem_object *obj) { struct address_space *mapping; - struct page *p, **pages; - struct pagevec pvec; - int i, npages; - + struct page **pages; + struct folio *folio; + struct folio_batch fbatch; + int i, j, npages; if (WARN_ON(!obj->filp)) return ERR_PTR(-EINVAL); @@ -559,11 +559,14 @@ struct page **drm_gem_get_pages(struct drm_gem_object *obj) mapping_set_unevictable(mapping); - for (i = 0; i < npages; i++) { - p = shmem_read_mapping_page(mapping, i); - if (IS_ERR(p)) + i = 0; + while (i < npages) { + folio = shmem_read_folio_gfp(mapping, i, + mapping_gfp_mask(mapping)); + if (IS_ERR(folio)) goto fail; - pages[i] = p; + for (j = 0; j < folio_nr_pages(folio); j++, i++) + pages[i] = folio_file_page(folio, i); /* Make sure shmem keeps __GFP_DMA32 allocated pages in the * correct region during swapin. Note that this requires @@ -571,23 +574,26 @@ struct page **drm_gem_get_pages(struct drm_gem_object *obj) * so shmem can relocate pages during swapin if required. */ BUG_ON(mapping_gfp_constraint(mapping, __GFP_DMA32) && - (page_to_pfn(p) >= 0x0010UL)); + (folio_pfn(folio) >= 0x0010UL)); } return pages; fail: mapping_clear_unevictable(mapping); - pagevec_init(&pvec); - while (i--) { - if (!pagevec_add(&pvec, pages[i])) - drm_gem_check_release_pagevec(&pvec); + folio_batch_init(&fbatch); + j = 0; + while (j < i) { + struct folio *f = page_folio(pages[j]); + if (!folio_batch_add(&fbatch, f)) + drm_gem_check_release_batch(&fbatch); + j += folio_nr_pages(f); } - if (pagevec_count(&pvec)) - drm_gem_check_release_pagevec(&pvec); + if (fbatch.nr) + drm_gem_check_release_batch(&fbatch); kvfree(pages); - return ERR_CAST(p); + return ERR_CAST(folio); } EXPORT_SYMBOL(drm_gem_get_pages); @@ -603,7 +609,7 @@ void drm_gem_put_pages(struct drm_gem_object *obj, struct page **pages, { int i, npages; struct address_space *mapping; - struct pagevec pvec; + struct folio_batch fbatch; mapping = file_inode(obj->filp)->i_mapping; mapping_clear_unevictable(mapping); @@ -616,23 +622,27 @@ void drm_gem_put_pages(struct drm_gem_object *obj, struct page **pages, npages = obj->size >> PAGE_SHIFT; - pagevec_init(&pvec); + folio_batch_init(&fbatch); for (i = 0; i < npages; i++) { + struct folio *folio; + if (!pages[i]) continue; + folio = page_folio(pages[i]); if (dirty) - set_page_dirty(pages[i]); + folio_mark_dirty(folio); if (accessed) - mark_page_accessed(pages[i]); + folio_mark_accessed(folio); /* Undo the reference we took when populating the table */ - if (!pagevec_add(&pvec, pages[i])) - drm_gem_check_release_pagevec(&pvec); + if (!folio_batch_add(&fbatch, folio)) + drm_gem_check_release_ba
[PATCH 08/13] i915: Convert i915_gpu_error to use a folio_batch
Remove one of the last remaining users of pagevec. Signed-off-by: Matthew Wilcox (Oracle) --- drivers/gpu/drm/i915/i915_gpu_error.c | 50 +-- 1 file changed, 25 insertions(+), 25 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index ec368e700235..0c38bfb60c9a 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -187,64 +187,64 @@ i915_error_printer(struct drm_i915_error_state_buf *e) } /* single threaded page allocator with a reserved stash for emergencies */ -static void pool_fini(struct pagevec *pv) +static void pool_fini(struct folio_batch *fbatch) { - pagevec_release(pv); + folio_batch_release(fbatch); } -static int pool_refill(struct pagevec *pv, gfp_t gfp) +static int pool_refill(struct folio_batch *fbatch, gfp_t gfp) { - while (pagevec_space(pv)) { - struct page *p; + while (folio_batch_space(fbatch)) { + struct folio *folio; - p = alloc_page(gfp); - if (!p) + folio = folio_alloc(gfp, 0); + if (!folio) return -ENOMEM; - pagevec_add(pv, p); + folio_batch_add(fbatch, folio); } return 0; } -static int pool_init(struct pagevec *pv, gfp_t gfp) +static int pool_init(struct folio_batch *fbatch, gfp_t gfp) { int err; - pagevec_init(pv); + folio_batch_init(fbatch); - err = pool_refill(pv, gfp); + err = pool_refill(fbatch, gfp); if (err) - pool_fini(pv); + pool_fini(fbatch); return err; } -static void *pool_alloc(struct pagevec *pv, gfp_t gfp) +static void *pool_alloc(struct folio_batch *fbatch, gfp_t gfp) { - struct page *p; + struct folio *folio; - p = alloc_page(gfp); - if (!p && pagevec_count(pv)) - p = pv->pages[--pv->nr]; + folio = folio_alloc(gfp, 0); + if (!folio && folio_batch_count(fbatch)) + folio = fbatch->folios[--fbatch->nr]; - return p ? page_address(p) : NULL; + return folio ? folio_address(folio) : NULL; } -static void pool_free(struct pagevec *pv, void *addr) +static void pool_free(struct folio_batch *fbatch, void *addr) { - struct page *p = virt_to_page(addr); + struct folio *folio = virt_to_folio(addr); - if (pagevec_space(pv)) - pagevec_add(pv, p); + if (folio_batch_space(fbatch)) + folio_batch_add(fbatch, folio); else - __free_page(p); + folio_put(folio); } #ifdef CONFIG_DRM_I915_COMPRESS_ERROR struct i915_vma_compress { - struct pagevec pool; + struct folio_batch pool; struct z_stream_s zstream; void *tmp; }; @@ -381,7 +381,7 @@ static void err_compression_marker(struct drm_i915_error_state_buf *m) #else struct i915_vma_compress { - struct pagevec pool; + struct folio_batch pool; }; static bool compress_init(struct i915_vma_compress *c) -- 2.39.2
[PATCH 01/13] afs: Convert pagevec to folio_batch in afs_extend_writeback()
Removes a folio->page->folio conversion for each folio that's involved. More importantly, removes one of the last few uses of a pagevec. Signed-off-by: Matthew Wilcox (Oracle) --- fs/afs/write.c | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/fs/afs/write.c b/fs/afs/write.c index 18ccb613dff8..6e68c70d0b22 100644 --- a/fs/afs/write.c +++ b/fs/afs/write.c @@ -465,7 +465,7 @@ static void afs_extend_writeback(struct address_space *mapping, bool caching, unsigned int *_len) { - struct pagevec pvec; + struct folio_batch fbatch; struct folio *folio; unsigned long priv; unsigned int psize, filler = 0; @@ -476,7 +476,7 @@ static void afs_extend_writeback(struct address_space *mapping, unsigned int i; XA_STATE(xas, &mapping->i_pages, index); - pagevec_init(&pvec); + folio_batch_init(&fbatch); do { /* Firstly, we gather up a batch of contiguous dirty pages @@ -535,7 +535,7 @@ static void afs_extend_writeback(struct address_space *mapping, stop = false; index += folio_nr_pages(folio); - if (!pagevec_add(&pvec, &folio->page)) + if (!folio_batch_add(&fbatch, folio)) break; if (stop) break; @@ -545,14 +545,14 @@ static void afs_extend_writeback(struct address_space *mapping, xas_pause(&xas); rcu_read_unlock(); - /* Now, if we obtained any pages, we can shift them to being + /* Now, if we obtained any folios, we can shift them to being * writable and mark them for caching. */ - if (!pagevec_count(&pvec)) + if (!folio_batch_count(&fbatch)) break; - for (i = 0; i < pagevec_count(&pvec); i++) { - folio = page_folio(pvec.pages[i]); + for (i = 0; i < folio_batch_count(&fbatch); i++) { + folio = fbatch.folios[i]; trace_afs_folio_dirty(vnode, tracepoint_string("store+"), folio); if (!folio_clear_dirty_for_io(folio)) @@ -565,7 +565,7 @@ static void afs_extend_writeback(struct address_space *mapping, folio_unlock(folio); } - pagevec_release(&pvec); + folio_batch_release(&fbatch); cond_resched(); } while (!stop); -- 2.39.2
[PATCH 07/13] pagevec: Rename fbatch_count()
This should always have been called folio_batch_count(). Signed-off-by: Matthew Wilcox (Oracle) --- include/linux/pagevec.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h index 42aad53e382e..3a9d29dd28a3 100644 --- a/include/linux/pagevec.h +++ b/include/linux/pagevec.h @@ -105,7 +105,7 @@ static inline unsigned int folio_batch_count(struct folio_batch *fbatch) return fbatch->nr; } -static inline unsigned int fbatch_space(struct folio_batch *fbatch) +static inline unsigned int folio_batch_space(struct folio_batch *fbatch) { return PAGEVEC_SIZE - fbatch->nr; } @@ -124,7 +124,7 @@ static inline unsigned folio_batch_add(struct folio_batch *fbatch, struct folio *folio) { fbatch->folios[fbatch->nr++] = folio; - return fbatch_space(fbatch); + return folio_batch_space(fbatch); } static inline void __folio_batch_release(struct folio_batch *fbatch) -- 2.39.2
[PATCH 13/13] mm: Remove unnecessary pagevec includes
These files no longer need pagevec.h, mostly due to function declarations being moved out of it. Signed-off-by: Matthew Wilcox (Oracle) --- mm/fadvise.c| 1 - mm/memory_hotplug.c | 1 - mm/migrate.c| 1 - mm/readahead.c | 1 - mm/swap_state.c | 1 - 5 files changed, 5 deletions(-) diff --git a/mm/fadvise.c b/mm/fadvise.c index f684ffd7f9c9..6c39d42f16dc 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -14,7 +14,6 @@ #include #include #include -#include #include #include #include diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 35db4108bb15..3f231cf1b410 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -13,7 +13,6 @@ #include #include #include -#include #include #include #include diff --git a/mm/migrate.c b/mm/migrate.c index ce35afdbc1e3..ee26f4a962ef 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -21,7 +21,6 @@ #include #include #include -#include #include #include #include diff --git a/mm/readahead.c b/mm/readahead.c index 47afbca1d122..a9c999aa19af 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -120,7 +120,6 @@ #include #include #include -#include #include #include #include diff --git a/mm/swap_state.c b/mm/swap_state.c index 4a5c7b748051..f8ea7015bad4 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -16,7 +16,6 @@ #include #include #include -#include #include #include #include -- 2.39.2
[PATCH 10/13] mm: Remove struct pagevec
All users are now converted to use the folio_batch so we can get rid of this data structure. Signed-off-by: Matthew Wilcox (Oracle) --- include/linux/pagevec.h | 63 +++-- mm/swap.c | 18 ++-- 2 files changed, 13 insertions(+), 68 deletions(-) diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h index 3a9d29dd28a3..87cc678adc85 100644 --- a/include/linux/pagevec.h +++ b/include/linux/pagevec.h @@ -3,65 +3,18 @@ * include/linux/pagevec.h * * In many places it is efficient to batch an operation up against multiple - * pages. A pagevec is a multipage container which is used for that. + * folios. A folio_batch is a container which is used for that. */ #ifndef _LINUX_PAGEVEC_H #define _LINUX_PAGEVEC_H -#include +#include -/* 15 pointers + header align the pagevec structure to a power of two */ +/* 15 pointers + header align the folio_batch structure to a power of two */ #define PAGEVEC_SIZE 15 -struct page; struct folio; -struct address_space; - -/* Layout must match folio_batch */ -struct pagevec { - unsigned char nr; - bool percpu_pvec_drained; - struct page *pages[PAGEVEC_SIZE]; -}; - -void __pagevec_release(struct pagevec *pvec); - -static inline void pagevec_init(struct pagevec *pvec) -{ - pvec->nr = 0; - pvec->percpu_pvec_drained = false; -} - -static inline void pagevec_reinit(struct pagevec *pvec) -{ - pvec->nr = 0; -} - -static inline unsigned pagevec_count(struct pagevec *pvec) -{ - return pvec->nr; -} - -static inline unsigned pagevec_space(struct pagevec *pvec) -{ - return PAGEVEC_SIZE - pvec->nr; -} - -/* - * Add a page to a pagevec. Returns the number of slots still available. - */ -static inline unsigned pagevec_add(struct pagevec *pvec, struct page *page) -{ - pvec->pages[pvec->nr++] = page; - return pagevec_space(pvec); -} - -static inline void pagevec_release(struct pagevec *pvec) -{ - if (pagevec_count(pvec)) - __pagevec_release(pvec); -} /** * struct folio_batch - A collection of folios. @@ -78,11 +31,6 @@ struct folio_batch { struct folio *folios[PAGEVEC_SIZE]; }; -/* Layout must match pagevec */ -static_assert(sizeof(struct pagevec) == sizeof(struct folio_batch)); -static_assert(offsetof(struct pagevec, pages) == - offsetof(struct folio_batch, folios)); - /** * folio_batch_init() - Initialise a batch of folios * @fbatch: The folio batch. @@ -127,10 +75,7 @@ static inline unsigned folio_batch_add(struct folio_batch *fbatch, return folio_batch_space(fbatch); } -static inline void __folio_batch_release(struct folio_batch *fbatch) -{ - __pagevec_release((struct pagevec *)fbatch); -} +void __folio_batch_release(struct folio_batch *pvec); static inline void folio_batch_release(struct folio_batch *fbatch) { diff --git a/mm/swap.c b/mm/swap.c index 423199ee8478..10348c1cf9c5 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -1044,25 +1044,25 @@ void release_pages(release_pages_arg arg, int nr) EXPORT_SYMBOL(release_pages); /* - * The pages which we're about to release may be in the deferred lru-addition + * The folios which we're about to release may be in the deferred lru-addition * queues. That would prevent them from really being freed right now. That's - * OK from a correctness point of view but is inefficient - those pages may be + * OK from a correctness point of view but is inefficient - those folios may be * cache-warm and we want to give them back to the page allocator ASAP. * - * So __pagevec_release() will drain those queues here. + * So __folio_batch_release() will drain those queues here. * folio_batch_move_lru() calls folios_put() directly to avoid * mutual recursion. */ -void __pagevec_release(struct pagevec *pvec) +void __folio_batch_release(struct folio_batch *fbatch) { - if (!pvec->percpu_pvec_drained) { + if (!fbatch->percpu_pvec_drained) { lru_add_drain(); - pvec->percpu_pvec_drained = true; + fbatch->percpu_pvec_drained = true; } - release_pages(pvec->pages, pagevec_count(pvec)); - pagevec_reinit(pvec); + release_pages(fbatch->folios, folio_batch_count(fbatch)); + folio_batch_reinit(fbatch); } -EXPORT_SYMBOL(__pagevec_release); +EXPORT_SYMBOL(__folio_batch_release); /** * folio_batch_remove_exceptionals() - Prune non-folios from a batch. -- 2.39.2
Re: [PATCH v9 02/14] mm: move page zone helpers from mm.h to mmzone.h
On Thu, Jun 15, 2023 at 03:33:12PM -0400, Peter Xu wrote: > My question is whether page_zonenum() is ready for taking all kinds of tail > pages? > > Zone device tail pages all look fine, per memmap_init_zone_device(). The > question was other kinds of usual compound pages, like either thp or > hugetlb. IIUC page->flags can be uninitialized for those tail pages. I don't think that's true. It's my understanding that page->flags is initialised for all pages in memmap at boot / hotplug / delayed-init time. So you can check things like zone, node, etc on literally any page. Contrariwise, those flags are not available in tail pages for use by the entity that has allocated a compound page / large folio. Also, I don't believe zone device pages support compound allocation. I think they're always allocated as order-0.
Re: [PATCH drm-next v2 04/16] maple_tree: add flag MT_FLAGS_LOCK_NONE
On Mon, Feb 27, 2023 at 06:39:33PM +0100, Danilo Krummrich wrote: > On 2/21/23 19:31, Matthew Wilcox wrote: > > Lockdep will shout at you if you get it wrong ;-) But you can safely > > take the spinlock before calling mas_store_gfp(GFP_KERNEL) because > > mas_nomem() knows to drop the lock before doing a sleeping allocation. > > Essentially you're open-coding mtree_store_range() but doing your own > > thing in addition to the store. > > As already mentioned, I went with your advice to just take the maple tree's > internal spinlock within the GPUVA manager and leave all the other locking > to the drivers as intended. > > However, I run into the case that lockdep shouts at me for not taking the > spinlock before calling mas_find() in the iterator macros. > > Now, I definitely don't want to let the drivers take the maple tree's > spinlock before they use the iterator macro. Of course, drivers shouldn't > even know about the underlying maple tree of the GPUVA manager. > > One way to make lockdep happy in this case seems to be taking the spinlock > right before mas_find() and drop it right after for each iteration. While we don't have any lockdep checking of this, you really shouldn't be using an iterator if you're going to drop the lock between invocations. The iterator points into the tree, so you need to invalidate the iterator any time you drop the lock. You don't have to use a spinlock to do a read iteration. You can just take the rcu_read_lock() around your iteration, as long as you can tolerate the mild inconsistencies that RCU permits.
Re: [PATCH drm-next v2 04/16] maple_tree: add flag MT_FLAGS_LOCK_NONE
On Wed, Feb 22, 2023 at 05:11:34PM +0100, Danilo Krummrich wrote: > On 2/21/23 19:31, Matthew Wilcox wrote: > > on tue, feb 21, 2023 at 03:37:49pm +0100, danilo krummrich wrote: > > > It feels a bit weird that I, as a user of the API, would need to lock > > > certain > > > (or all?) mas_*() functions with the internal spinlock in order to protect > > > (future) internal features of the tree, such as the slab cache > > > defragmentation > > > you mentioned. Because from my perspective, as the generic component that > > > tells > > > it's users (the drivers) to take care of locking VA space operations (and > > > hence > > > tree operations) I don't have an own purpose of this internal spinlock, > > > right? > > > > You don't ... but we can't know that. > > Thanks for the clarification. I think I should now know what to for the > GPUVA manager in terms of locking the maple tree in general. > > Though I still have very limited insights on the maple tree I want to share > some further thoughts. > > From what I got so far it really seems to me that it would be better to just > take the internal spinlock for both APIs (normal and advanced) whenever you > need to internally. No. Really, no. The point of the advanced API is that it's a toolbox for doing the operation you want, but isn't a generic enough operation to be part of the normal API. To take an example from the radix tree days, in the page cache, we need to walk a range of the tree, looking for any entries that are marked as DIRTY, clear the DIRTY mark and set the TOWRITE mark. There was a horrendous function called radix_tree_range_tag_if_tagged() which did exactly this. Now look at the implementation of tag_pages_for_writeback(); it's a simple loop over a range with an occasional pause to check whether we need to reschedule. But that means you need to know how to use the toolbox. Some of the tools are dangerous and you can cut yourself on them. > Another plus would probably be maintainability. Once you got quite a few > maple tree users using external locks (either in the sense of calling I don't want maple tree users using external locks. That exists because it was the only reasonable way of converting the VMA tree from the rbtree to the maple tree. I intend to get rid of mt_set_external_lock(). The VMAs are eventually going to be protected by the internal spinlock.
Re: [PATCH drm-next v2 04/16] maple_tree: add flag MT_FLAGS_LOCK_NONE
On Tue, Feb 21, 2023 at 03:37:49PM +0100, Danilo Krummrich wrote: > On Mon, Feb 20, 2023 at 08:33:35PM +0000, Matthew Wilcox wrote: > > On Mon, Feb 20, 2023 at 06:06:03PM +0100, Danilo Krummrich wrote: > > > On 2/20/23 16:10, Matthew Wilcox wrote: > > > > This is why we like people to use the spinlock embedded in the tree. > > > > There's nothing for the user to care about. If the access really is > > > > serialised, acquiring/releasing the uncontended spinlock is a minimal > > > > cost compared to all the other things that will happen while modifying > > > > the tree. > > > > > > I think as for the users of the GPUVA manager we'd have two cases: > > > > > > 1) Accesses to the manager (and hence the tree) are serialized, no lock > > > needed. > > > > > > 2) Multiple operations on the tree must be locked in order to make them > > > appear atomic. > > > > Could you give an example here of what you'd like to do? Ideally > > something complicated so I don't say "Oh, you can just do this" when > > there's a more complex example for which "this" won't work. I'm sure > > that's embedded somewhere in the next 20-odd patches, but it's probably > > quicker for you to describe in terms of tree operations that have to > > appear atomic than for me to try to figure it out. > > > > Absolutely, not gonna ask you to read all of that. :-) > > One thing the GPUVA manager does is to provide drivers the (sub-)operations > that need to be processed in order to fulfill a map or unmap request from > userspace. For instance, when userspace asks the driver to map some memory > the GPUVA manager calculates which existing mappings must be removed, split up > or can be merged with the newly requested mapping. > > A driver has two ways to fetch those operations from the GPUVA manager. It can > either obtain a list of operations or receive a callback for each operation > generated by the GPUVA manager. > > In both cases the GPUVA manager walks the maple tree, which keeps track of > existing mappings, for the given range in __drm_gpuva_sm_map() (only > considering > the map case, since the unmap case is a subset basically). For each mapping > found in the given range the driver, as mentioned, either receives a callback > or > a list entry is added to the list of operations. > > Typically, for each operation / callback one entry within the maple tree is > removed and, optionally at the beginning and end of a new mapping's range, a > new entry is inserted. An of course, as the last operation, there is the new > mapping itself to insert. > > The GPUVA manager delegates locking responsibility to the drivers. Typically, > a driver either serializes access to the VA space managed by the GPUVA manager > (no lock needed) or need to lock the processing of a full set of operations > generated by the GPUVA manager. OK, that all makes sense. It does make sense to have the driver use its own mutex and then take the spinlock inside the maple tree code. It shouldn't ever be contended. > > > In either case the embedded spinlock wouldn't be useful, we'd either need > > > an > > > external lock or no lock at all. > > > > > > If there are any internal reasons why specific tree operations must be > > > mutually excluded (such as those you explain below), wouldn't it make more > > > sense to always have the internal lock and, optionally, allow users to > > > specify an external lock additionally? > > > > So the way this works for the XArray, which is a little older than the > > Maple tree, is that we always use the internal spinlock for > > modifications (possibly BH or IRQ safe), and if someone wants to > > use an external mutex to make some callers atomic with respect to each > > other, they're free to do so. In that case, the XArray doesn't check > > the user's external locking at all, because it really can't know. > > > > I'd advise taking that approach; if there's really no way to use the > > internal spinlock to make your complicated updates appear atomic > > then just let the maple tree use its internal spinlock, and you can > > also use your external mutex however you like. > > > > That sounds like the right thing to do. > > However, I'm using the advanced API of the maple tree (and that's the reason > why the above example appears a little more detailed than needed) because I > think with the normal API I can't insert / remove tree entries
Re: [PATCH drm-next v2 04/16] maple_tree: add flag MT_FLAGS_LOCK_NONE
On Mon, Feb 20, 2023 at 06:06:03PM +0100, Danilo Krummrich wrote: > On 2/20/23 16:10, Matthew Wilcox wrote: > > This is why we like people to use the spinlock embedded in the tree. > > There's nothing for the user to care about. If the access really is > > serialised, acquiring/releasing the uncontended spinlock is a minimal > > cost compared to all the other things that will happen while modifying > > the tree. > > I think as for the users of the GPUVA manager we'd have two cases: > > 1) Accesses to the manager (and hence the tree) are serialized, no lock > needed. > > 2) Multiple operations on the tree must be locked in order to make them > appear atomic. Could you give an example here of what you'd like to do? Ideally something complicated so I don't say "Oh, you can just do this" when there's a more complex example for which "this" won't work. I'm sure that's embedded somewhere in the next 20-odd patches, but it's probably quicker for you to describe in terms of tree operations that have to appear atomic than for me to try to figure it out. > In either case the embedded spinlock wouldn't be useful, we'd either need an > external lock or no lock at all. > > If there are any internal reasons why specific tree operations must be > mutually excluded (such as those you explain below), wouldn't it make more > sense to always have the internal lock and, optionally, allow users to > specify an external lock additionally? So the way this works for the XArray, which is a little older than the Maple tree, is that we always use the internal spinlock for modifications (possibly BH or IRQ safe), and if someone wants to use an external mutex to make some callers atomic with respect to each other, they're free to do so. In that case, the XArray doesn't check the user's external locking at all, because it really can't know. I'd advise taking that approach; if there's really no way to use the internal spinlock to make your complicated updates appear atomic then just let the maple tree use its internal spinlock, and you can also use your external mutex however you like.
Re: [PATCH drm-next v2 04/16] maple_tree: add flag MT_FLAGS_LOCK_NONE
On Mon, Feb 20, 2023 at 03:00:59PM +0100, Danilo Krummrich wrote: > On 2/17/23 20:38, Matthew Wilcox wrote: > > On Fri, Feb 17, 2023 at 02:44:10PM +0100, Danilo Krummrich wrote: > > > Generic components making use of the maple tree (such as the > > > DRM GPUVA Manager) delegate the responsibility of ensuring mutual > > > exclusion to their users. > > > > > > While such components could inherit the concept of an external lock, > > > some users might just serialize the access to the component and hence to > > > the internal maple tree. > > > > > > In order to allow such use cases, add a new flag MT_FLAGS_LOCK_NONE to > > > indicate not to do any internal lockdep checks. > > > > I'm really against this change. > > > > First, we really should check that users have their locking right. > > It's bitten us so many times when they get it wrong. > > In case of the DRM GPUVA manager, some users might serialize the access to > the GPUVA manager and hence to it's maple tree instances, e.g. through the > drm_gpu_scheduler. In such a case ensuring to hold a lock would be a bit > pointless and I wouldn't really know how to "sell" this to potential users > of the GPUVA manager. This is why we like people to use the spinlock embedded in the tree. There's nothing for the user to care about. If the access really is serialised, acquiring/releasing the uncontended spinlock is a minimal cost compared to all the other things that will happen while modifying the tree. > > Second, having a lock allows us to defragment the slab cache. The > > patches to do that haven't gone anywhere recently, but if we drop the > > requirement now, we'll never be able to compact ranges of memory that > > have slabs allocated to them. > > > > Not sure if I get that, do you mind explaining a bit how this would affect > other users of the maple tree, such as my use case, the GPUVA manager? When we want to free a slab in order to defragment memory, we need to relocate all the objects allocated within that slab. To do that for the maple tree node cache, for each node in this particular slab, we'll need to walk up to the top of the tree and lock it. We can then allocate a new node from a different slab, change the parent to point to the new node and drop the lock. After an RCU delay, we can free the slab and create a larger contiguous block of memory. As I said, this is somewhat hypothetical in that there's no current code in the tree to reclaim slabs when we're trying to defragment memory. And that's because it's hard to do. The XArray and maple tree were designed to make it possible for their slabs.
Re: [PATCH drm-next v2 03/16] maple_tree: split up MA_STATE() macro
On Fri, Feb 17, 2023 at 02:44:09PM +0100, Danilo Krummrich wrote: > \#define SAMPLE_ITER(name, __mgr) \ > struct sample_iter name = { \ > .mas = __MA_STATE(&(__mgr)->mt, 0, 0), This is usually called MA_STATE_INIT() > #define sample_iter_for_each_range(it__, start__, end__) \ > for ((it__).mas.index = start__, (it__).entry = mas_find(&(it__).mas, > end__ - 1); \ >(it__).entry; (it__).entry = mas_find(&(it__).mas, end__ - 1)) This is a bad iterator design. It's usually best to do this: struct sample *sample; SAMPLE_ITERATOR(si, min); sample_iter_for_each(&si, sample, max) { frob(mgr, sample); } I don't mind splitting apart MA_STATE_INIT from MA_STATE, and if you do that, we can also use it in VMA_ITERATOR.
Re: [PATCH drm-next v2 04/16] maple_tree: add flag MT_FLAGS_LOCK_NONE
On Fri, Feb 17, 2023 at 02:44:10PM +0100, Danilo Krummrich wrote: > Generic components making use of the maple tree (such as the > DRM GPUVA Manager) delegate the responsibility of ensuring mutual > exclusion to their users. > > While such components could inherit the concept of an external lock, > some users might just serialize the access to the component and hence to > the internal maple tree. > > In order to allow such use cases, add a new flag MT_FLAGS_LOCK_NONE to > indicate not to do any internal lockdep checks. I'm really against this change. First, we really should check that users have their locking right. It's bitten us so many times when they get it wrong. Second, having a lock allows us to defragment the slab cache. The patches to do that haven't gone anywhere recently, but if we drop the requirement now, we'll never be able to compact ranges of memory that have slabs allocated to them.
Re: [PATCH v2 1/6] mm: introduce vma->vm_flags modifier functions
On Thu, Jan 26, 2023 at 04:50:59PM +0200, Mike Rapoport wrote: > On Thu, Jan 26, 2023 at 11:17:09AM +0200, Mike Rapoport wrote: > > On Wed, Jan 25, 2023 at 12:38:46AM -0800, Suren Baghdasaryan wrote: > > > +/* Use when VMA is not part of the VMA tree and needs no locking */ > > > +static inline void init_vm_flags(struct vm_area_struct *vma, > > > + unsigned long flags) > > > > I'd suggest to make it vm_flags_init() etc. > > Thinking more about it, it will be even clearer to name these vma_flags_xyz() Perhaps vma_VERB_flags()? vma_init_flags() vma_reset_flags() vma_set_flags() vma_clear_flags() vma_mod_flags()
Re: [PATCH v2 1/6] mm: introduce vma->vm_flags modifier functions
On Wed, Jan 25, 2023 at 12:38:46AM -0800, Suren Baghdasaryan wrote: > +/* Use when VMA is not part of the VMA tree and needs no locking */ > +static inline void init_vm_flags(struct vm_area_struct *vma, > + unsigned long flags) > +{ > + vma->vm_flags = flags; vm_flags are supposed to have type vm_flags_t. That's not been fully realised yet, but perhaps we could avoid making it worse? > pgprot_t vm_page_prot; > - unsigned long vm_flags; /* Flags, see mm.h. */ > + > + /* > + * Flags, see mm.h. > + * WARNING! Do not modify directly. > + * Use {init|reset|set|clear|mod}_vm_flags() functions instead. > + */ > + unsigned long vm_flags; Including changing this line to vm_flags_t
Re: [PATCH v2 1/6] mm: introduce vma->vm_flags modifier functions
On Wed, Jan 25, 2023 at 08:49:50AM -0800, Suren Baghdasaryan wrote: > On Wed, Jan 25, 2023 at 1:10 AM Peter Zijlstra wrote: > > > + /* > > > + * Flags, see mm.h. > > > + * WARNING! Do not modify directly. > > > + * Use {init|reset|set|clear|mod}_vm_flags() functions instead. > > > + */ > > > + unsigned long vm_flags; > > > > We have __private and ACCESS_PRIVATE() to help with enforcing this. > > Thanks for pointing this out, Peter! I guess for that I'll need to > convert all read accesses and provide get_vm_flags() too? That will > cause some additional churt (a quick search shows 801 hits over 248 > files) but maybe it's worth it? I think Michal suggested that too in > another patch. Should I do that while we are at it? Here's a trick I saw somewhere in the VFS: union { const vm_flags_t vm_flags; vm_flags_t __private __vm_flags; }; Now it can be read by anybody but written only by those using ACCESS_PRIVATE.
Re: [Lsf-pc] [LSF/MM/BPF proposal]: Physr discussion
On Mon, Jan 23, 2023 at 12:50:52PM -0800, Dan Williams wrote: > Matthew Wilcox wrote: > > On Mon, Jan 23, 2023 at 11:36:51AM -0800, Dan Williams wrote: > > > Jason Gunthorpe via Lsf-pc wrote: > > > > I would like to have a session at LSF to talk about Matthew's > > > > physr discussion starter: > > > > > > > > https://lore.kernel.org/linux-mm/ydykweu0htv8m...@casper.infradead.org/ > > > > > > > > I have become interested in this with some immediacy because of > > > > IOMMUFD and this other discussion with Christoph: > > > > > > > > > > > > https://lore.kernel.org/kvm/4-v2-472615b3877e+28f7-vfio_dma_buf_...@nvidia.com/ > > > > > > I think this is a worthwhile discussion. My main hangup with 'struct > > > page' elimination in general is that if anything needs to be allocated > > > > You're the first one to bring up struct page elimination. Neither Jason > > nor I have that as our motivation. > > Oh, ok, then maybe I misread the concern in the vfio discussion. I > thought the summary there is debating the ongoing requirement for > 'struct page' for P2PDMA? My reading of that thread is that while it started out that way, it became more about "So what would a good interface be for doing this". And Jason's right, he and I are approaching this from different directions. My concern is from the GUP side where we start out by getting a folio (which we know is physically contiguous) and decomposing it into pages. Then we aggregate all those pages together which are physically contiguous and stuff them into a bio_vec. After that, I lose interest; I was planning on having DMA mapping interfaces which took in an array of phyr and spat out scatterlists. Then we could shrink the scatterlist by removing page_link and offset, leaving us with only dma_address, length and maybe flags.
Re: [Lsf-pc] [LSF/MM/BPF proposal]: Physr discussion
On Mon, Jan 23, 2023 at 11:36:51AM -0800, Dan Williams wrote: > Jason Gunthorpe via Lsf-pc wrote: > > I would like to have a session at LSF to talk about Matthew's > > physr discussion starter: > > > > https://lore.kernel.org/linux-mm/ydykweu0htv8m...@casper.infradead.org/ > > > > I have become interested in this with some immediacy because of > > IOMMUFD and this other discussion with Christoph: > > > > > > https://lore.kernel.org/kvm/4-v2-472615b3877e+28f7-vfio_dma_buf_...@nvidia.com/ > > I think this is a worthwhile discussion. My main hangup with 'struct > page' elimination in general is that if anything needs to be allocated You're the first one to bring up struct page elimination. Neither Jason nor I have that as our motivation. But there are reasons why struct page is a bad data structure, and Xen proves that you don't need to have such a data structure in order to do I/O. > When I read "general interest across all the driver subsystems" it is > hard not to ask "have all possible avenues to enable 'struct page' been > exhausted?" Yes, we should definitely expend yet more resources chasing a poor implementation.
Re: [LSF/MM/BPF proposal]: Physr discussion
On Sat, Jan 21, 2023 at 11:03:05AM -0400, Jason Gunthorpe wrote: > I would like to have a session at LSF to talk about Matthew's > physr discussion starter: > > https://lore.kernel.org/linux-mm/ydykweu0htv8m...@casper.infradead.org/ I'm definitely interested in discussing phyrs (even if you'd rather pronounce it "fizzers" than "fires" ;-) > I've been working on an implementation and hope to have something > draft to show on the lists in a few weeks. It is pretty clear there > are several interesting decisions to make that I think will benefit > from a live discussion. Cool! Here's my latest noodlings: https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/heads/phyr Just the top two commits; the other stuff is unrelated. Shakeel has also been interested in this.
Re: [PATCH RFC v7 00/23] DEPT(Dependency Tracker)
On Thu, Jan 19, 2023 at 03:23:08PM +0900, Byungchul Park wrote: > Boqun wrote: > > * Looks like the DEPT dependency graph doesn't handle the > > fair/unfair readers as lockdep current does. Which bring the > > next question. > > No. DEPT works better for unfair read. It works based on wait/event. So > read_lock() is considered a potential wait waiting on write_unlock() > while write_lock() is considered a potential wait waiting on either > write_unlock() or read_unlock(). DEPT is working perfect for it. > > For fair read (maybe you meant queued read lock), I think the case > should be handled in the same way as normal lock. I might get it wrong. > Please let me know if I miss something. >From the lockdep/DEPT point of view, the question is whether: read_lock(A) read_lock(A) can deadlock if a writer comes in between the two acquisitions and sleeps waiting on A to be released. A fair lock will block new readers when a writer is waiting, while an unfair lock will allow new readers even while a writer is waiting.
Re: [RFC PATCH v3 0/3] new subsystem for compute accelerator devices
On Mon, Nov 07, 2022 at 09:07:28AM -0700, Jeffrey Hugo wrote: > > Another important change is that I have reverted back to use IDR for minor > > handling instead of xarray. This is because I have found that xarray doesn't > > handle well the scenario where you allocate a NULL entry and then exchange > > it > > with a real pointer. It appears xarray still considers that entry a "zero" > > entry. This is unfortunate because DRM works that way (first allocates a > > NULL > > entry and then replaces the entry with a real pointer). > > > > I decided to revert to IDR because I don't want to hold up these patches, > > as many people are blocked until the support for accel is merged. The xarray > > issue should be fixed as a separate patch by either fixing the xarray code > > or > > changing how DRM + ACCEL do minor id handling. > > This sounds sane to me. However, this appears to be something that Matthew > Wilcox should be aware of (added for visibility). Perhaps he has a very > quick solution. If not, at-least he might have ideas on how to best address > in the future. Thanks for cc'ing me. I wasn't aware of this problem because I hadn't seen Oded's email yet. The "problem" is simply a mis-use of the API.
Re: [PATCH v5 1/3] drm: Use XArray instead of IDR for minors
On Sun, Nov 06, 2022 at 04:51:39PM +0200, Oded Gabbay wrote: > I tried executing the following code in a dummy driver I wrote: You don't need to write a dummy driver; you can write test-cases entirely in userspace. Just add them to lib/test_xarray.c and then make -C tools/testing/radix-tree/ > static DEFINE_XARRAY_ALLOC(xa_dummy); > void check_xa(void *pdev) > { > void *entry; > int ret, index; > > ret = xa_alloc(&xa_dummy, &index, NULL, XA_LIMIT(0, 63), GFP_NOWAIT); > if (ret < 0) > return ret; > > entry = xa_cmpxchg(&xa_dummy, index, NULL, pdev, GFP_KERNEL); > if (xa_is_err(entry)) >return; > > xa_lock(&xa_dummy); > xa_dev = xa_load(&xa_dummy, index); > xa_unlock(&xa_dummy); > } > > And to my surprise xa_dev is always NULL, when it should be pdev. > I believe that because we first allocate the entry with NULL, it is > considered a "zero" entry in the XA. > And when we replace it, this attribute doesn't change so when we do > xa_load, the xa code thinks the entry is a "zero" entry and returns > NULL. There's no "attribute" to mark a zero entry. It's just a zero entry. The problem is that you're cmpxchg'ing against NULL, and it's not NULL, it's the ZERO entry. This is even documented in the test code: /* cmpxchg sees a reserved entry as ZERO */ XA_BUG_ON(xa, xa_reserve(xa, 12345678, GFP_KERNEL) != 0); XA_BUG_ON(xa, xa_cmpxchg(xa, 12345678, XA_ZERO_ENTRY, xa_mk_value(12345678), GFP_NOWAIT) != NULL); I'm not quite sure why you're using xa_cmpxchg() here anyway; if you allocated it, it's yours and you can just xa_store() to it.
Re: [PATCH v4 01/10] gna: add PCI driver module
On Thu, Oct 20, 2022 at 03:35:16PM +0200, Kwapulinski, Maciej wrote: > +++ b/drivers/gpu/drm/gna/Kconfig > @@ -0,0 +1,12 @@ > +# > +# Intel(R) Gaussian & Neural Accelerator (Intel(R) GNA) > +# > + > +config DRM_GNA > + tristate "Intel(R) Gaussian & Neural Accelerator (Intel(R) GNA)" > + depends on X86 && PCI > + help This is mangled; 'help' should be indented by a tab, not two spaces.
Re: [PATCH v4 1/3] drm: Use XArray instead of IDR for minors
On Tue, Sep 06, 2022 at 10:16:27PM +0200, Michał Winiarski wrote: > IDR is deprecated, and since XArray manages its own state with internal > locking, it simplifies the locking on DRM side. > Additionally, don't use the IRQ-safe variant, since operating on drm > minor is not done in IRQ context. > > Signed-off-by: Michał Winiarski > Suggested-by: Matthew Wilcox I have a few questions, but I like where you're going. > @@ -98,21 +98,18 @@ static struct drm_minor **drm_minor_get_slot(struct > drm_device *dev, > static void drm_minor_alloc_release(struct drm_device *dev, void *data) > { > struct drm_minor *minor = data; > - unsigned long flags; > > WARN_ON(dev != minor->dev); > > put_device(minor->kdev); > > - spin_lock_irqsave(&drm_minor_lock, flags); > - idr_remove(&drm_minors_idr, minor->index); > - spin_unlock_irqrestore(&drm_minor_lock, flags); > + xa_release(&drm_minors_xa, minor->index); Has it definitely been unused at this point? I would think that xa_erase() (an unconditional store) would be the correct function to call. > @@ -122,20 +119,12 @@ static int drm_minor_alloc(struct drm_device *dev, > unsigned int type) > minor->type = type; > minor->dev = dev; > > - idr_preload(GFP_KERNEL); > - spin_lock_irqsave(&drm_minor_lock, flags); > - r = idr_alloc(&drm_minors_idr, > - NULL, > - 64 * type, > - 64 * (type + 1), > - GFP_NOWAIT); > - spin_unlock_irqrestore(&drm_minor_lock, flags); > - idr_preload_end(); > - > + r = xa_alloc(&drm_minors_xa, &id, NULL, > + XA_LIMIT(64 * type, 64 * (type + 1) - 1), GFP_KERNEL); > if (r < 0) > return r; > > - minor->index = r; > + minor->index = id; Wouldn't it be better to call: r = xa_alloc(&drm_minors_xa, &minor->index, NULL, XA_LIMIT(64 * type, 64 * (type + 1) - 1), GFP_KERNEL); I might also prefer a little syntactic sugar like: #define DRM_MINOR_LIMIT(type) XA_LIMIT(64 * (type), 64 * (type) + 63) but that's definitely a matter of taste. > @@ -172,9 +161,12 @@ static int drm_minor_register(struct drm_device *dev, > unsigned int type) > goto err_debugfs; > > /* replace NULL with @minor so lookups will succeed from now on */ > - spin_lock_irqsave(&drm_minor_lock, flags); > - idr_replace(&drm_minors_idr, minor, minor->index); > - spin_unlock_irqrestore(&drm_minor_lock, flags); > + entry = xa_store(&drm_minors_xa, minor->index, &minor, GFP_KERNEL); > + if (xa_is_err(entry)) { > + ret = xa_err(entry); > + goto err_debugfs; > + } > + WARN_ON(entry); Might be better as an xa_cmpxchg()? > @@ -187,16 +179,13 @@ static int drm_minor_register(struct drm_device *dev, > unsigned int type) > static void drm_minor_unregister(struct drm_device *dev, unsigned int type) > { > struct drm_minor *minor; > - unsigned long flags; > > minor = *drm_minor_get_slot(dev, type); > if (!minor || !device_is_registered(minor->kdev)) > return; > > /* replace @minor with NULL so lookups will fail from now on */ > - spin_lock_irqsave(&drm_minor_lock, flags); > - idr_replace(&drm_minors_idr, NULL, minor->index); > - spin_unlock_irqrestore(&drm_minor_lock, flags); > + xa_erase(&drm_minors_xa, minor->index); This isn't an exact replacement, but I'm not sure whether that makes a difference. xa_erase() allows allocation of this ID again while idr_replace() means that lookups return NULL, but the ID remains in use. The equivalent of idr_replace() is: xa_store(&drm_minors_xa, minor->index, NULL, GFP_KERNEL); > @@ -215,13 +204,10 @@ static void drm_minor_unregister(struct drm_device > *dev, unsigned int type) > struct drm_minor *drm_minor_acquire(unsigned int minor_id) > { > struct drm_minor *minor; > - unsigned long flags; > > - spin_lock_irqsave(&drm_minor_lock, flags); > - minor = idr_find(&drm_minors_idr, minor_id); > + minor = xa_load(&drm_minors_xa, minor_id); > if (minor) > drm_dev_get(minor->dev); > - spin_unlock_irqrestore(&drm_minor_lock, flags); This is also not an exact equivalent as the dev_drm_get() is now outside the lock. Does that matter in this case? I don't know the code well enough to say. If you want it to be equivalent, then: xa_lock(&drm_minors_xa); minor = xa_load(&d
Re: [PATCH v3 0/4] drm: Use full allocated minor range for DRM
On Tue, Sep 06, 2022 at 04:01:13PM +0200, Michał Winiarski wrote: > 64 DRM device nodes is not enough for everyone. > Upgrade it to ~512K (which definitely is more than enough). > > To allow testing userspace support for >64 devices, add additional DRM > modparam (skip_legacy_minors) which causes DRM to skip allocating minors > in 0-192 range. > Additionally - one minor tweak around minor DRM IDR locking and IDR lockdep > annotations. The IDR is deprecated; rather than making all these changes around the IDR, could you convert it to use the XArray instead? I did it once before, but those patches bounced off the submissions process.
Re: [PATCH] mm: Fix a null ptr deref with CONFIG_DEBUG_VM enabled in wp_page_reuse
On Wed, Jul 27, 2022 at 03:14:07PM -0400, Zack Rusin wrote: > From: Zack Rusin > > Write page faults on last references might not have a valid page anymore. > wp_page_reuse has always dealt with that scenario by making > sure the page isn't null (or the reference was shared) before doing > anything with it. Recently added checks in VM_BUG_ON (enabled by the > CONFIG_DEBUG_VM option) use PageAnon helpers which assume the passed > page is never null, before making sure there is a valid page to work > with. > > Move the VM_BUG_ON, which unconditionally uses the page, after the > code that checks that we have a valid one. Message-ID:
Re: [PATCH v8 07/15] mm/gup: migrate device coherent pages when pinning instead of failing
On Mon, Jul 11, 2022 at 03:35:42PM +0200, David Hildenbrand wrote: > > + /* > > +* Device coherent pages are managed by a driver and should not > > +* be pinned indefinitely as it prevents the driver moving the > > +* page. So when trying to pin with FOLL_LONGTERM instead try > > +* to migrate the page out of device memory. > > +*/ > > + if (folio_is_device_coherent(folio)) { > > + WARN_ON_ONCE(PageCompound(&folio->page)); > > Maybe that belongs into migrate_device_page()? ... and should be simply WARN_ON_ONCE(folio_test_large(folio)). No need to go converting it back into a page in order to test things that can't possibly be true.
Re: [linux-next:master] BUILD REGRESSION 8cb8311e95e3bb58bd84d6350365f14a718faa6d
On Thu, May 26, 2022 at 11:48:32AM +0300, Dan Carpenter wrote: > On Thu, May 26, 2022 at 02:16:34AM +0100, Matthew Wilcox wrote: > > Bizarre this started showing up now. The recent patch was: > > > > - info->alloced += compound_nr(page); > > - inode->i_blocks += BLOCKS_PER_PAGE << compound_order(page); > > + info->alloced += folio_nr_pages(folio); > > + inode->i_blocks += BLOCKS_PER_PAGE << folio_order(folio); > > > > so it could tell that compound_order() was small, but folio_order() > > might be large? > > The old code also generates a warning on my test system. Smatch thinks > both compound_order() and folio_order() are 0-255. I guess because of > the "unsigned char compound_order;" in the struct page. It'd be nice if we could annotate that as "contains a value between 1 and BITS_PER_LONG - PAGE_SHIFT". Then be able to optionally enable a checker that ensures that's true on loads/stores. Maybe we need a language that isn't C :-P Ada can do this ... I don't think Rust can.
Re: [linux-next:master] BUILD REGRESSION 8cb8311e95e3bb58bd84d6350365f14a718faa6d
On Wed, May 25, 2022 at 03:20:06PM -0700, Andrew Morton wrote: > On Wed, 25 May 2022 23:07:35 +0100 Jessica Clarke wrote: > > > This is i386, so an unsigned long is 32-bit, but i_blocks is a blkcnt_t > > i.e. a u64, which makes the shift without a cast of the LHS fishy. > > Ah, of course, thanks. I remember 32 bits ;) > > --- a/mm/shmem.c~mm-shmemc-suppress-shift-warning > +++ a/mm/shmem.c > @@ -1945,7 +1945,7 @@ alloc_nohuge: > > spin_lock_irq(&info->lock); > info->alloced += folio_nr_pages(folio); > - inode->i_blocks += BLOCKS_PER_PAGE << folio_order(folio); > + inode->i_blocks += (blkcnt_t)BLOCKS_PER_PAGE << folio_order(folio); Bizarre this started showing up now. The recent patch was: - info->alloced += compound_nr(page); - inode->i_blocks += BLOCKS_PER_PAGE << compound_order(page); + info->alloced += folio_nr_pages(folio); + inode->i_blocks += BLOCKS_PER_PAGE << folio_order(folio); so it could tell that compound_order() was small, but folio_order() might be large? Silencing the warning is a good thing, but folio_order() can (at the moment) be at most 9 on i386, so it isn't actually going to be larger than 4096.
Re: [Bug 215807] Bad page state in process systemd-udevd
On Tue, Apr 12, 2022 at 02:08:04PM -0700, Andrew Morton wrote: > hm, that's my third `bad page' report in three emails. But I'm not > seeing a pattern - this one might be a DRM thing. I noticed it was an order-9 page and was set to take responsibility for it, but it's clearly not a filesystem page, but a DRM page. Let me help decode this for the benefit of the DRM folks. > > [8.453363] amdgpu: Topology: Add CPU node > > [8.467169] BUG: Bad page state in process systemd-udevd pfn:11b403 > > [8.467180] fbcon: Taking over console > > [8.467188] page:2cc06944 refcount:0 mapcount:0 > > mapping: index:0x3 pfn:0x11b403 > > [8.467195] head:aa25e58e order:9 compound_mapcount:0 > > compound_pincount:0 > > [8.467198] flags: > > 0x17c001(head|node=0|zone=2|lastcpupid=0x1f) > > [8.467205] raw: 0017c000 f2da846d0001 f2da846d00c8 > > > > [8.467208] raw: > > > > [8.467210] head: 0017c001 dead0122 > > > > [8.467212] head: > > > > [8.467214] page dumped because: corrupted mapping in tail page Tail pages are all supposed to have page->mapping == TAIL_MAPPING (0x400 + POISON_POINTER_DELTA). This one has page->mapping == NULL. I say "all", but tail pages 1 and 2 aren't checked because those get other useful things stored in them instead. This is tail page number 3, so it's the first one checked. So DRM has probably been overly enthusiastic about cleaning up something. Hope that's helpful!
Re: [PATCH] drm/amdkfd: Add SVM API support capability bits
On Wed, Mar 30, 2022 at 04:24:20PM -0500, Alex Sierra wrote: > From: Philip Yang > > SVMAPISupported property added to HSA_CAPABILITY, the value match > HSA_CAPABILITY defined in Thunk spec: > > SVMAPISupported: it will not be supported on older kernels that don't > have HMM or on systems with GFXv8 or older GPUs without support for > 48-bit virtual addresses. > > CoherentHostAccess property added to HSA_MEMORYPROPERTY, the value match > HSA_MEMORYPROPERTY defined in Thunk spec: > > CoherentHostAccess: whether or not device memory can be coherently > accessed by the host CPU. Could you translate this commit message into English? Reviewing Documentation/process/5.Posting.rst might be helpful.
Re: [PATCH v1 1/3] mm: split vm_normal_pages for LRU and non-LRU handling
On Thu, Mar 10, 2022 at 11:26:31AM -0600, Alex Sierra wrote: > @@ -606,7 +606,7 @@ static void print_bad_pte(struct vm_area_struct *vma, > unsigned long addr, > * PFNMAP mappings in order to support COWable mappings. > * > */ > -struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, > +struct page *vm_normal_any_page(struct vm_area_struct *vma, unsigned long > addr, > pte_t pte) > { > unsigned long pfn = pte_pfn(pte); > @@ -620,8 +620,6 @@ struct page *vm_normal_page(struct vm_area_struct *vma, > unsigned long addr, > return NULL; > if (is_zero_pfn(pfn)) > return NULL; > - if (pte_devmap(pte)) > - return NULL; > > print_bad_pte(vma, addr, pte, NULL); > return NULL; ... what? Haven't you just made it so that a devmap page always prints a bad PTE message, and then returns NULL anyway? Surely this should be: if (pte_devmap(pte)) - return NULL; + return pfn_to_page(pfn); or maybe + goto check_pfn; But I don't know about that highest_memmap_pfn check. > @@ -661,6 +659,22 @@ struct page *vm_normal_page(struct vm_area_struct *vma, > unsigned long addr, > return pfn_to_page(pfn); > } > > +/* > + * vm_normal_lru_page -- This function gets the "struct page" associated > + * with a pte only for page cache and anon page. These pages are LRU handled. > + */ > +struct page *vm_normal_lru_page(struct vm_area_struct *vma, unsigned long > addr, > + pte_t pte) It seems a shame to add a new function without proper kernel-doc.
Re: [PATCH RFC v2] mm: Add f_ops->populate()
On Sun, Mar 06, 2022 at 03:41:54PM -0800, Dave Hansen wrote: > In short: page faults stink. The core kernel has lots of ways of > avoiding page faults like madvise(MADV_WILLNEED) or mmap(MAP_POPULATE). > But, those only work on normal RAM that the core mm manages. > > SGX is weird. SGX memory is managed outside the core mm. It doesn't > have a 'struct page' and get_user_pages() doesn't work on it. Its VMAs > are marked with VM_IO. So, none of the existing methods for avoiding > page faults work on SGX memory. > > This essentially helps extend existing "normal RAM" kernel ABIs to work > for avoiding faults for SGX too. SGX users want to enjoy all of the > benefits of a delayed allocation policy (better resource use, > overcommit, NUMA affinity) but without the cost of millions of faults. We have a mechanism for dynamically reducing the number of page faults already; it's just buried in the page cache code. You have vma->vm_file, which contains a file_ra_state. You can use this to track where recent faults have been and grow the size of the region you fault in per page fault. You don't have to (indeed probably don't want to) use the same algorithm as the page cache, but the _principle_ is the same -- were recent speculative faults actually used; should we grow the number of pages actually faulted in, or is this a random sparse workload where we want to allocate individual pages. Don't rely on the user to ask. They don't know.
Re: [PATCH RFC 1/3] mm: Add f_ops->populate()
On Sun, Mar 06, 2022 at 07:02:57PM +0200, Jarkko Sakkinen wrote: > So can I conclude from this that in general having populate available for > device memory is something horrid, or just the implementation path? You haven't even attempted to explain what the problem is you're trying to solve. You've shown up with some terrible code and said "Hey, is this a good idea". No, no, it's not.
Re: [PATCH RFC 0/3] MAP_POPULATE for device memory
On Sun, Mar 06, 2022 at 07:32:04AM +0200, Jarkko Sakkinen wrote: > For device memory (aka VM_IO | VM_PFNMAP) MAP_POPULATE does nothing. Allow > to use that for initializing the device memory by providing a new callback > f_ops->populate() for the purpose. As I said, NAK.
Re: [PATCH RFC] mm: Add f_ops->populate()
On Sun, Mar 06, 2022 at 06:25:52AM +0200, Jarkko Sakkinen wrote: > > Are you deliberately avoiding the question? I'm not asking about > > implementation. I'm asking about the semantics of MAP_POPULATE with > > your driver. > > No. I just noticed a bug in the guard from your comment that I wanted > point out. > > With the next version I post the corresponding change to the driver, > in order to see this in context. Oh, good grief. Don't bother. NAK.
Re: [PATCH RFC] mm: Add f_ops->populate()
On Sun, Mar 06, 2022 at 06:11:21AM +0200, Jarkko Sakkinen wrote: > On Sun, Mar 06, 2022 at 03:52:12AM +0000, Matthew Wilcox wrote: > > On Sun, Mar 06, 2022 at 05:21:11AM +0200, Jarkko Sakkinen wrote: > > > On Sun, Mar 06, 2022 at 02:57:55AM +0000, Matthew Wilcox wrote: > > > > On Sun, Mar 06, 2022 at 04:15:33AM +0200, Jarkko Sakkinen wrote: > > > > > Sometimes you might want to use MAP_POPULATE to ask a device driver to > > > > > initialize the device memory in some specific manner. SGX driver can > > > > > use > > > > > this to request more memory by issuing ENCLS[EAUG] x86 opcode for each > > > > > page in the address range. > > > > > > > > > > Add f_ops->populate() with the same parameters as f_ops->mmap() and > > > > > make > > > > > it conditionally called inside call_mmap(). Update call sites > > > > > accodingly. > > > > > > > > Your device driver has a ->mmap operation. Why does it need another > > > > one? More explanation required here. > > > > > > f_ops->mmap() would require an additional parameter, which results > > > heavy refactoring. > > > > > > struct file_operations has 1125 references in the kernel tree, so I > > > decided to check this way around first. > > > > Are you saying that your device driver behaves differently if > > MAP_POPULATE is set versus if it isn't? That seems hideously broken. > > MAP_POPULATE does not do anything (according to __mm_populate in mm/gup.c) > with VMA's that have some sort of device/IO memory, i.e. vm_flags > intersecting with VM_PFNMAP | VM_IO. > > I can extend the guard obviously to: > > if (!ret && do_populate && file->f_op->populate && > !!(vma->vm_flags & (VM_IO | VM_PFNMAP))) > file->f_op->populate(file, vma); Are you deliberately avoiding the question? I'm not asking about implementation. I'm asking about the semantics of MAP_POPULATE with your driver.
Re: [PATCH RFC] mm: Add f_ops->populate()
On Sun, Mar 06, 2022 at 05:21:11AM +0200, Jarkko Sakkinen wrote: > On Sun, Mar 06, 2022 at 02:57:55AM +0000, Matthew Wilcox wrote: > > On Sun, Mar 06, 2022 at 04:15:33AM +0200, Jarkko Sakkinen wrote: > > > Sometimes you might want to use MAP_POPULATE to ask a device driver to > > > initialize the device memory in some specific manner. SGX driver can use > > > this to request more memory by issuing ENCLS[EAUG] x86 opcode for each > > > page in the address range. > > > > > > Add f_ops->populate() with the same parameters as f_ops->mmap() and make > > > it conditionally called inside call_mmap(). Update call sites > > > accodingly. > > > > Your device driver has a ->mmap operation. Why does it need another > > one? More explanation required here. > > f_ops->mmap() would require an additional parameter, which results > heavy refactoring. > > struct file_operations has 1125 references in the kernel tree, so I > decided to check this way around first. Are you saying that your device driver behaves differently if MAP_POPULATE is set versus if it isn't? That seems hideously broken.
Re: [PATCH RFC] mm: Add f_ops->populate()
On Sun, Mar 06, 2022 at 04:15:33AM +0200, Jarkko Sakkinen wrote: > Sometimes you might want to use MAP_POPULATE to ask a device driver to > initialize the device memory in some specific manner. SGX driver can use > this to request more memory by issuing ENCLS[EAUG] x86 opcode for each > page in the address range. > > Add f_ops->populate() with the same parameters as f_ops->mmap() and make > it conditionally called inside call_mmap(). Update call sites > accodingly. Your device driver has a ->mmap operation. Why does it need another one? More explanation required here.
Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr
On Tue, Mar 01, 2022 at 10:14:07AM -0800, Kees Cook wrote: > On Mon, Feb 28, 2022 at 04:45:11PM -0800, Linus Torvalds wrote: > > Really. The "-Wshadow doesn't work on the kernel" is not some new > > issue, because you have to do completely insane things to the source > > code to enable it. > > The first big glitch with -Wshadow was with shadowed global variables. > GCC 4.8 fixed that, but it still yells about shadowed functions. What > _almost_ works is -Wshadow=local. At first glace, all the warnings > look solvable, but then one will eventually discover __wait_event() > and associated macros that mix when and how deeply it intentionally > shadows variables. :) Well, that's just disgusting. Macros fundamentally shouldn't be referring to things that aren't in their arguments. The first step to cleaning this up is ... I'll take a look at the rest of cleaning this up soon. >From 28ffe35d56223d4242b915832299e5acc926737e Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Tue, 1 Mar 2022 13:47:07 -0500 Subject: [PATCH] wait: Parameterize the return variable to ___wait_event() Macros should not refer to variables which aren't in their arguments. Pass the name from its callers. Signed-off-by: Matthew Wilcox (Oracle) --- include/linux/swait.h| 12 ++-- include/linux/wait.h | 32 include/linux/wait_bit.h | 4 ++-- 3 files changed, 24 insertions(+), 24 deletions(-) diff --git a/include/linux/swait.h b/include/linux/swait.h index 6a8c22b8c2a5..5e8e9b13be2d 100644 --- a/include/linux/swait.h +++ b/include/linux/swait.h @@ -191,14 +191,14 @@ do { \ } while (0) #define __swait_event_timeout(wq, condition, timeout) \ - ___swait_event(wq, ___wait_cond_timeout(condition), \ + ___swait_event(wq, ___wait_cond_timeout(condition, __ret), \ TASK_UNINTERRUPTIBLE, timeout,\ __ret = schedule_timeout(__ret)) #define swait_event_timeout_exclusive(wq, condition, timeout) \ ({ \ long __ret = timeout; \ - if (!___wait_cond_timeout(condition)) \ + if (!___wait_cond_timeout(condition, __ret))\ __ret = __swait_event_timeout(wq, condition, timeout); \ __ret; \ }) @@ -216,14 +216,14 @@ do { \ }) #define __swait_event_interruptible_timeout(wq, condition, timeout)\ - ___swait_event(wq, ___wait_cond_timeout(condition), \ + ___swait_event(wq, ___wait_cond_timeout(condition, __ret), \ TASK_INTERRUPTIBLE, timeout, \ __ret = schedule_timeout(__ret)) #define swait_event_interruptible_timeout_exclusive(wq, condition, timeout)\ ({ \ long __ret = timeout; \ - if (!___wait_cond_timeout(condition)) \ + if (!___wait_cond_timeout(condition, __ret))\ __ret = __swait_event_interruptible_timeout(wq, \ condition, timeout);\ __ret; \ @@ -252,7 +252,7 @@ do { \ } while (0) #define __swait_event_idle_timeout(wq, condition, timeout) \ - ___swait_event(wq, ___wait_cond_timeout(condition), \ + ___swait_event(wq, ___wait_cond_timeout(condition, __ret), \ TASK_IDLE, timeout, \ __ret = schedule_timeout(__ret)) @@ -278,7 +278,7 @@ do { \ #define swait_event_idle_timeout_exclusive(wq, condition, timeout) \ ({ \ long __ret = timeout; \ - if (!___wait_cond_timeout(condition)) \ + if (!___wait_cond_timeout(condition, __ret))\ __ret = __swait_event_idle_timeout(wq, \ condition, timeout); \ __ret; \ diff --git a/include/linux/wait.h b/include/linux/wait.h index 851e07da2583..890cce3c0f2e 1006
Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr
On Mon, Feb 28, 2022 at 12:37:15PM -0800, Linus Torvalds wrote: > On Mon, Feb 28, 2022 at 12:16 PM Matthew Wilcox wrote: > > > > Then we can never use -Wshadow ;-( I'd love to be able to turn it on; > > it catches real bugs. > > Oh, we already can never use -Wshadow regardless of things like this. > That bridge hasn't just been burned, it never existed in the first > place. > > The whole '-Wshadow' thing simply cannot work with local variables in > macros - something that we've used since day 1. > > Try this (as a "p.c" file): > > #define min(a,b) ({ \ > typeof(a) __a = (a);\ > typeof(b) __b = (b);\ > __a < __b ? __a : __b; }) > > int min3(int a, int b, int c) > { > return min(a,min(b,c)); > } > > and now do "gcc -O2 -S t.c". > > Then try it with -Wshadow. #define ___PASTE(a, b) a##b #define __PASTE(a, b) ___PASTE(a, b) #define _min(a, b, u) ({ \ typeof(a) __PASTE(__a,u) = (a);\ typeof(b) __PASTE(__b,u) = (b);\ __PASTE(__a,u) < __PASTE(__b,u) ? __PASTE(__a,u) : __PASTE(__b,u); }) #define min(a, b) _min(a, b, __COUNTER__) int min3(int a, int b, int c) { return min(a,min(b,c)); } (probably there's a more elegant way to do this)
Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr
On Mon, Feb 28, 2022 at 12:10:24PM -0800, Linus Torvalds wrote: > We can do > > typeof(pos) pos > > in the 'for ()' loop, and never use __iter at all. > > That means that inside the for-loop, we use a _different_ 'pos' than outside. Then we can never use -Wshadow ;-( I'd love to be able to turn it on; it catches real bugs. > +#define list_for_each_entry(pos, head, member) > \ > + for (typeof(pos) pos = list_first_entry(head, typeof(*pos), member); > \ > + !list_entry_is_head(pos, head, member);\ >pos = list_next_entry(pos, member))
Re: [PATCH v12 1/6] drm: Add arch arm64 for drm_clflush_virt_range
On Fri, Feb 25, 2022 at 06:42:37PM +, Tvrtko Ursulin wrote: > Matthew, what do you think fix for this build warning on h8300 and s390 > should be? Or perhaps a build environment issue with kernel test robot? I'd suggest this should do the job: +++ b/include/linux/cacheflush.h @@ -4,6 +4,8 @@ #include +struct folio; + #if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE #ifndef ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO void flush_dcache_folio(struct folio *folio);
Re: [PATCH 00/16] DEPT(Dependency Tracker)
On Thu, Feb 17, 2022 at 12:00:05PM -0500, Steven Rostedt wrote: > On Thu, 17 Feb 2022 10:51:09 -0500 > "Theodore Ts'o" wrote: > > > I know that you're trying to help us, but this tool needs to be far > > better than Lockdep before we should think about merging it. Even if > > it finds 5% more potential deadlocks, if it creates 95% more false > > positive reports --- and the ones it finds are crazy things that > > rarely actually happen in practice, are the costs worth the benefits? > > And who is bearing the costs, and who is receiving the benefits? > > I personally believe that there's potential that this can be helpful and we > will want to merge it. > > But, what I believe Ted is trying to say is, if you do not know if the > report is a bug or not, please do not ask the maintainers to determine it > for you. This is a good opportunity for you to look to see why your tool > reported an issue, and learn that subsystem. Look at if this is really a > bug or not, and investigate why. I agree with Steven here, to the point where I'm willing to invest some time being a beta-tester for this, so if you focus your efforts on filesystem/mm kinds of problems, I can continue looking at them and tell you what's helpful and what's unhelpful in the reports.
Re: Report 1 in ext4 and journal based on v5.17-rc1
On Thu, Feb 17, 2022 at 08:10:03PM +0900, Byungchul Park wrote: > [7.009608] === > [7.009613] DEPT: Circular dependency has been detected. > [7.009614] 5.17.0-rc1-00014-g8a599299c0cb-dirty #30 Tainted: GW > [7.009616] --- > [7.009617] summary > [7.009618] --- > [7.009618] *** DEADLOCK *** > [7.009618] > [7.009619] context A > [7.009619] [S] (unknown)(&(bit_wait_table + i)->dmap:0) Why is the context unknown here? I don't see a way to debug this without knowing where we acquired the bit wait lock. > [7.009621] [W] down_write(&ei->i_data_sem:0) > [7.009623] [E] event(&(bit_wait_table + i)->dmap:0) > [7.009624] > [7.009625] context B > [7.009625] [S] down_read(&ei->i_data_sem:0) > [7.009626] [W] wait(&(bit_wait_table + i)->dmap:0) > [7.009627] [E] up_read(&ei->i_data_sem:0) > [7.009628] > [7.009629] [S]: start of the event context > [7.009629] [W]: the wait blocked > [7.009630] [E]: the event not reachable > [7.009631] --- > [7.009631] context A's detail > [7.009632] --- > [7.009632] context A > [7.009633] [S] (unknown)(&(bit_wait_table + i)->dmap:0) > [7.009634] [W] down_write(&ei->i_data_sem:0) > [7.009635] [E] event(&(bit_wait_table + i)->dmap:0) > [7.009636] > [7.009636] [S] (unknown)(&(bit_wait_table + i)->dmap:0): > [7.009638] (N/A) > [7.009638] > [7.009639] [W] down_write(&ei->i_data_sem:0): > [7.009639] ext4_truncate (fs/ext4/inode.c:4187) > [7.009645] stacktrace: > [7.009646] down_write (kernel/locking/rwsem.c:1514) > [7.009648] ext4_truncate (fs/ext4/inode.c:4187) > [7.009650] ext4_da_write_begin (./include/linux/fs.h:827 > fs/ext4/truncate.h:23 fs/ext4/inode.c:2963) > [7.009652] generic_perform_write (mm/filemap.c:3784) > [7.009654] ext4_buffered_write_iter (fs/ext4/file.c:269) > [7.009657] ext4_file_write_iter (fs/ext4/file.c:677) > [7.009659] new_sync_write (fs/read_write.c:504 (discriminator 1)) > [7.009662] vfs_write (fs/read_write.c:590) > [7.009663] ksys_write (fs/read_write.c:644) > [7.009664] do_syscall_64 (arch/x86/entry/common.c:50 > arch/x86/entry/common.c:80) > [7.009667] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) > [7.009669] > [7.009670] [E] event(&(bit_wait_table + i)->dmap:0): > [7.009671] __wake_up_common (kernel/sched/wait.c:108) > [7.009673] stacktrace: > [7.009674] dept_event (kernel/dependency/dept.c:2337) > [7.009677] __wake_up_common (kernel/sched/wait.c:109) > [7.009678] __wake_up_common_lock (./include/linux/spinlock.h:428 > (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) > [7.009679] __wake_up_bit (kernel/sched/wait_bit.c:127) > [7.009681] ext4_orphan_del (fs/ext4/orphan.c:282) > [7.009683] ext4_truncate (fs/ext4/inode.c:4212) > [7.009685] ext4_da_write_begin (./include/linux/fs.h:827 > fs/ext4/truncate.h:23 fs/ext4/inode.c:2963) > [7.009687] generic_perform_write (mm/filemap.c:3784) > [7.009688] ext4_buffered_write_iter (fs/ext4/file.c:269) > [7.009690] ext4_file_write_iter (fs/ext4/file.c:677) > [7.009692] new_sync_write (fs/read_write.c:504 (discriminator 1)) > [7.009694] vfs_write (fs/read_write.c:590) > [7.009695] ksys_write (fs/read_write.c:644) > [7.009696] do_syscall_64 (arch/x86/entry/common.c:50 > arch/x86/entry/common.c:80) > [7.009698] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) > [7.009700] --- > [7.009700] context B's detail > [7.009701] --- > [7.009702] context B > [7.009702] [S] down_read(&ei->i_data_sem:0) > [7.009703] [W] wait(&(bit_wait_table + i)->dmap:0) > [7.009704] [E] up_read(&ei->i_data_sem:0) > [7.009705] > [7.009706] [S] down_read(&ei->i_data_sem:0): > [7.009707] ext4_map_blocks (./arch/x86/include/asm/bitops.h:207 > ./include/asm-generic/bitops/instrumented-non-atomic.h:135 > fs/ext4/ext4.h:1918 fs/ext4/inode.c:562) > [7.009709] stacktrace: > [7.009709] down_read (kernel/locking/rwsem.c:1461) > [7.009711] ext4_map_blocks (./arch/x86/include/asm/bitops.h:207 > ./include/asm-generic/bitops/instrumented-non-atomic.h:135 > fs/ext4/ext4.h:1918 fs/ext4/inode.c:562) > [7.009712] ext4_getblk (fs/ext4/inode.c:851) > [7.009714] ext4_bread (fs/ext4/inode.c:903) > [7.009715] __ext4_read_dirblock (fs/ext4/namei.c:117) > [7.009718] dx_probe (fs/ext4/namei.c:789) > [7.009720] ext4_dx_find_entry (fs/ext4/namei.c:1721) > [7.009722] __ext4_find_entry
Re: Report in unix_stream_read_generic()
On Wed, Feb 16, 2022 at 01:17:03PM +0900, Byungchul Park wrote: > [7.013330] === > [7.013331] DEPT: Circular dependency has been detected. > [7.013332] 5.17.0-rc1-00014-gcf3441bb2012 #2 Tainted: GW > [7.01] --- > [7.013334] summary > [7.013334] --- > [7.013335] *** DEADLOCK *** > [7.013335] > [7.013335] context A > [7.013336] [S] (unknown)(&(&ei->socket.wq.wait)->dmap:0) > [7.013337] [W] __mutex_lock_common(&u->iolock:0) > [7.013338] [E] event(&(&ei->socket.wq.wait)->dmap:0) > [7.013340] > [7.013340] context B > [7.013341] [S] __raw_spin_lock(&u->lock:0) > [7.013342] [W] wait(&(&ei->socket.wq.wait)->dmap:0) > [7.013343] [E] spin_unlock(&u->lock:0) This seems unlikely to be real. We're surely not actually waiting while holding a spinlock; existing debug checks would catch it. > [7.013407] --- > [7.013407] context B's detail > [7.013408] --- > [7.013408] context B > [7.013409] [S] __raw_spin_lock(&u->lock:0) > [7.013410] [W] wait(&(&ei->socket.wq.wait)->dmap:0) > [7.013411] [E] spin_unlock(&u->lock:0) > [7.013412] > [7.013412] [S] __raw_spin_lock(&u->lock:0): > [7.013413] [] unix_stream_read_generic+0x6bf/0xb60 > [7.013416] stacktrace: > [7.013416] _raw_spin_lock+0x6e/0x90 > [7.013418] unix_stream_read_generic+0x6bf/0xb60 It would be helpful if you'd run this through scripts/decode_stacktrace.sh so we could see line numbers instead of hex offsets (which arene't much use without the binary kernel). > [7.013420] unix_stream_recvmsg+0x40/0x50 > [7.013422] sock_read_iter+0x85/0xd0 > [7.013424] new_sync_read+0x162/0x180 > [7.013426] vfs_read+0xf3/0x190 > [7.013428] ksys_read+0xa6/0xc0 > [7.013429] do_syscall_64+0x3a/0x90 > [7.013431] entry_SYSCALL_64_after_hwframe+0x44/0xae > [7.013433] > [7.013434] [W] wait(&(&ei->socket.wq.wait)->dmap:0): > [7.013434] [] prepare_to_wait+0x47/0xd0 ... this may be the source of confusion. Just because we prepare to wait doesn't mean we end up actually waiting. For example, look at unix_wait_for_peer(): prepare_to_wait_exclusive(&u->peer_wait, &wait, TASK_INTERRUPTIBLE); sched = !sock_flag(other, SOCK_DEAD) && !(other->sk_shutdown & RCV_SHUTDOWN) && unix_recvq_full(other); unix_state_unlock(other); if (sched) timeo = schedule_timeout(timeo); finish_wait(&u->peer_wait, &wait); We *prepare* to wait, *then* drop the lock, then actually schedule.
Re: Phyr Starter
On Tue, Jan 11, 2022 at 06:53:06PM -0400, Jason Gunthorpe wrote: > IOMMU is not common in those cases, it is slow. > > So you end up with 16 bytes per entry then another 24 bytes in the > entirely redundant scatter list. That is now 40 bytes/page for typical > HPC case, and I can't see that being OK. Ah, I didn't realise what case you wanted to optimise for. So, how about this ... Since you want to get to the same destination as I do (a 16-byte-per-entry dma_addr+dma_len struct), but need to get there sooner than "make all sg users stop using it wrongly", let's introduce a (hopefully temporary) "struct dma_range". But let's go further than that (which only brings us to 32 bytes per range). For the systems you care about which use an identity mapping, and have sizeof(dma_addr_t) == sizeof(phys_addr_t), we can simply point the dma_range pointer to the same memory as the phyr. We just have to not free it too early. That gets us down to 16 bytes per range, a saving of 33%.
Re: Phyr Starter
On Tue, Jan 11, 2022 at 04:21:59PM -0400, Jason Gunthorpe wrote: > On Tue, Jan 11, 2022 at 06:33:57PM +0000, Matthew Wilcox wrote: > > > > Then we are we using get_user_phyr() at all if we are just storing it > > > in a sg? > > > > I did consider just implementing get_user_sg() (actually 4 years ago), > > but that cements the use of sg as both an input and output data structure > > for DMA mapping, which I am under the impression we're trying to get > > away from. > > I know every time I talked about a get_user_sg() Christoph is against > it and we need to stop using scatter list... > > > > Also 16 entries is way to small, it should be at least a whole PMD > > > worth so we don't have to relock the PMD level each iteration. > > > > > > I would like to see a flow more like: > > > > > > cpu_phyr_list = get_user_phyr(uptr, 1G); > > > dma_phyr_list = dma_map_phyr(device, cpu_phyr_list); > > > [..] > > > dma_unmap_phyr(device, dma_phyr_list); > > > unpin_drity_free(cpu_phy_list); > > > > > > Where dma_map_phyr() can build a temporary SGL for old iommu drivers > > > compatability. iommu drivers would want to implement natively, of > > > course. > > > > > > ie no loops in drivers. > > > > Let me just rewrite that for you ... > > > > umem->phyrs = get_user_phyrs(addr, size, &umem->phyr_len); > > umem->sgt = dma_map_phyrs(device, umem->phyrs, umem->phyr_len, > > DMA_BIDIRECTIONAL, dma_attr); > > ... > > dma_unmap_phyr(device, umem->phyrs, umem->phyr_len, umem->sgt->sgl, > > umem->sgt->nents, DMA_BIDIRECTIONAL, dma_attr); > > sg_free_table(umem->sgt); > > free_user_phyrs(umem->phyrs, umem->phyr_len); > > Why? As above we want to get rid of the sgl, so you are telling me to > adopt phyrs I need to increase the memory consumption by a hefty > amount to store the phyrs and still keep the sgt now? Why? > > I don't need the sgt at all. I just need another list of physical > addresses for DMA. I see no issue with a phsr_list storing either CPU > Physical Address or DMA Physical Addresses, same data structure. There's a difference between a phys_addr_t and a dma_addr_t. They can even be different sizes; some architectures use a 32-bit dma_addr_t and a 64-bit phys_addr_t or vice-versa. phyr cannot store DMA addresses. > In the fairly important passthrough DMA case the CPU list and DMA list > are identical, so we don't even need to do anything. > > In the typical iommu case my dma map's phyrs is only one entry. That becomes a very simple sg table then. > As an example coding - Use the first 8 bytes to encode this: > > 51:0 - Physical address / 4k (ie pfn) > 56:52 - Order (simple, your order encoding can do better) > 61:57 - Unused > 63:62 - Mode, one of: > 00 = natural order pfn (8 bytes) > 01 = order aligned with length (12 bytes) > 10 = arbitary (12 bytes) > > Then the optional 4 bytes are used as: > > Mode 01 (Up to 2^48 bytes of memory on a 4k alignment) > 31:0 - # of order pages > > Mode 10 (Up to 2^25 bytes of memory on a 1 byte alignment) > 11:0 - starting byte offset in the 4k > 31:12 - 20 bits, plus the 5 bit order from the first 8 bytes: > length in bytes Honestly, this looks awful to operate on. Mandatory 8-bytes per entry with an optional 4 byte extension? > > > The last case is, perhaps, a possible route to completely replace > > > scatterlist. Few places need true byte granularity for interior pages, > > > so we can invent some coding to say 'this is 8 byte aligned, and n > > > bytes long' that only fits < 4k or something. Exceptional cases can > > > then still work. I'm not sure what block needs here - is it just 512? > > > > Replacing scatterlist is not my goal. That seems like a lot more work > > for little gain. > > Well, I'm not comfortable with the idea above where RDMA would have to > take a memory penalty to use the new interface. To avoid that memory > penalty we need to get rid of scatterlist entirely. > > If we do the 16 byte struct from the first email then a umem for MRs > will increase in memory consumption by 160% compared today's 24 > bytes/page. I think the HPC workloads will veto this. Huh? We do 16 bytes per physically contiguous range. Then, if your HPC workloads use an IOMMU that can map a virtually contiguous range into a single sg entry, it uses 24 bytes for the entire mapping. It s
Re: Phyr Starter
On Tue, Jan 11, 2022 at 11:01:42AM -0400, Jason Gunthorpe wrote: > On Tue, Jan 11, 2022 at 04:32:56AM +0000, Matthew Wilcox wrote: > > On Mon, Jan 10, 2022 at 08:41:26PM -0400, Jason Gunthorpe wrote: > > > On Mon, Jan 10, 2022 at 07:34:49PM +0000, Matthew Wilcox wrote: > > > > > > > Finally, it may be possible to stop using scatterlist to describe the > > > > input to the DMA-mapping operation. We may be able to get struct > > > > scatterlist down to just dma_address and dma_length, with chaining > > > > handled through an enclosing struct. > > > > > > Can you talk about this some more? IMHO one of the key properties of > > > the scatterlist is that it can hold huge amounts of pages without > > > having to do any kind of special allocation due to the chaining. > > > > > > The same will be true of the phyr idea right? > > > > My thinking is that we'd pass a relatively small array of phyr (maybe 16 > > entries) to get_user_phyr(). If that turned out not to be big enough, > > then we have two options; one is to map those 16 ranges with sg and use > > the sg chaining functionality before throwing away the phyr and calling > > get_user_phyr() again. > > Then we are we using get_user_phyr() at all if we are just storing it > in a sg? I did consider just implementing get_user_sg() (actually 4 years ago), but that cements the use of sg as both an input and output data structure for DMA mapping, which I am under the impression we're trying to get away from. > Also 16 entries is way to small, it should be at least a whole PMD > worth so we don't have to relock the PMD level each iteration. > > I would like to see a flow more like: > > cpu_phyr_list = get_user_phyr(uptr, 1G); > dma_phyr_list = dma_map_phyr(device, cpu_phyr_list); > [..] > dma_unmap_phyr(device, dma_phyr_list); > unpin_drity_free(cpu_phy_list); > > Where dma_map_phyr() can build a temporary SGL for old iommu drivers > compatability. iommu drivers would want to implement natively, of > course. > > ie no loops in drivers. Let me just rewrite that for you ... umem->phyrs = get_user_phyrs(addr, size, &umem->phyr_len); umem->sgt = dma_map_phyrs(device, umem->phyrs, umem->phyr_len, DMA_BIDIRECTIONAL, dma_attr); ... dma_unmap_phyr(device, umem->phyrs, umem->phyr_len, umem->sgt->sgl, umem->sgt->nents, DMA_BIDIRECTIONAL, dma_attr); sg_free_table(umem->sgt); free_user_phyrs(umem->phyrs, umem->phyr_len); > > The question is whether this is the right kind of optimisation to be > > doing. I hear you that we want a dense format, but it's questionable > > whether the kind of thing you're suggesting is actually denser than this > > scheme. For example, if we have 1GB pages and userspace happens to have > > allocated pages (3, 4, 5, 6, 7, 8, 9, 10) then this can be represented > > as a single phyr. A power-of-two scheme would have us use four entries > > (3, 4-7, 8-9, 10). > > That is not quite what I had in mind.. > > struct phyr_list { >unsigned int first_page_offset_bytes; >size_t total_length_bytes; >phys_addr_t min_alignment; >struct packed_phyr *list_of_pages; > }; > > Where each 'packed_phyr' is an aligned page of some kind. The packing > has to be able to represent any number of pfns, so we have four major > cases: > - 4k pfns (use 8 bytes) > - Natural order pfn (use 8 bytes) > - 4k aligned pfns, arbitary number (use 12 bytes) > - <4k aligned, arbitary length (use 16 bytes?) > > In all cases the interior pages are fully used, only the first and > last page is sliced based on the two parameters in the phyr_list. This kind of representation works for a virtually contiguous range. Unfortunately, that's not sufficient for some bio users (as I discovered after getting a representation like this enshrined in the NVMe spec as the PRP List). > The last case is, perhaps, a possible route to completely replace > scatterlist. Few places need true byte granularity for interior pages, > so we can invent some coding to say 'this is 8 byte aligned, and n > bytes long' that only fits < 4k or something. Exceptional cases can > then still work. I'm not sure what block needs here - is it just 512? Replacing scatterlist is not my goal. That seems like a lot more work for little gain. I just want to delete page_link, offset and length from struct scatterlist. Given the above sequence of calls, we're going to get sg lists that aren't chained. They may have to be vmalloced, but th
Re: Phyr Starter
On Tue, Jan 11, 2022 at 12:17:18AM -0800, John Hubbard wrote: > Zooming in on the pinning aspect for a moment: last time I attempted to > convert O_DIRECT callers from gup to pup, I recall wanting very much to > record, in each bio_vec, whether these pages were acquired via FOLL_PIN, > or some non-FOLL_PIN method. Because at the end of the IO, it is not > easy to disentangle which pages require put_page() and which require > unpin_user_page*(). > > And changing the bio_vec for *that* purpose was not really acceptable. > > But now that you're looking to change it in a big way (and with some > spare bits avaiable...oohh!), maybe I can go that direction after all. > > Or, are you looking at a design in which any phyr is implicitly FOLL_PIN'd > if it exists at all? That. I think there's still good reasons to keep a single-page (or maybe dual-page) GUP around, but no reason to mix it with ranges. > Or any other thoughts in this area are very welcome. That's there's no support for unpinning part of a range. You pin it, do the IO, unpin it. That simplifies the accounting.
Re: Phyr Starter
On Tue, Jan 11, 2022 at 12:40:10PM +0100, Thomas Zimmermann wrote: > Hi > > Am 10.01.22 um 20:34 schrieb Matthew Wilcox: > > TLDR: I want to introduce a new data type: > > > > struct phyr { > > phys_addr_t addr; > > size_t len; > > }; > > Did you look at struct dma_buf_map? [1] Thanks. I wasn't aware of that. It doesn't seem to actually solve the problem, in that it doesn't carry any length information. Did you mean to point me at a different structure?
Re: Phyr Starter
On Mon, Jan 10, 2022 at 08:41:26PM -0400, Jason Gunthorpe wrote: > On Mon, Jan 10, 2022 at 07:34:49PM +0000, Matthew Wilcox wrote: > > > Finally, it may be possible to stop using scatterlist to describe the > > input to the DMA-mapping operation. We may be able to get struct > > scatterlist down to just dma_address and dma_length, with chaining > > handled through an enclosing struct. > > Can you talk about this some more? IMHO one of the key properties of > the scatterlist is that it can hold huge amounts of pages without > having to do any kind of special allocation due to the chaining. > > The same will be true of the phyr idea right? My thinking is that we'd pass a relatively small array of phyr (maybe 16 entries) to get_user_phyr(). If that turned out not to be big enough, then we have two options; one is to map those 16 ranges with sg and use the sg chaining functionality before throwing away the phyr and calling get_user_phyr() again. The other is to stash those 16 ranges somewhere (eg a resizing array of some kind) and keep calling get_user_phyr() to get the next batch of 16; once we've got the entire range, call sg_map_phyr() passing all of the phyrs. > > I would like to see phyr replace bio_vec everywhere it's currently used. > > I don't have time to do that work now because I'm busy with folios. > > If someone else wants to take that on, I shall cheer from the sidelines. > > What I do intend to do is: > > I wonder if we mixed things though.. > > IMHO there is a lot of optimization to be had by having a > datastructure that is expressly 'the physical pages underlying a > contiguous chunk of va' > > If you limit to that scenario then we can be more optimal because > things like byte granular offsets and size in the interior pages don't > need to exist. Every interior chunk is always aligned to its order and > we only need to record the order. > > An overall starting offset and total length allow computing the slice > of the first/last entry. > > If the physical address is always aligned then we get 12 free bits > from the min 4k alignment and also only need to store order, not an > arbitary byte granular length. > > The win is I think we can meaningfully cover most common cases using > only 8 bytes per physical chunk. The 12 bits can be used to encode the > common orders (4k, 2M, 1G, etc) and some smart mechanism to get > another 16 bits to cover 'everything'. > > IMHO storage density here is quite important, we end up having to keep > this stuff around for a long time. > > I say this here, because I've always though bio_vec/etc are more > general than we actually need, being byte granular at every chunk. Oh, I can do you one better on the bit-packing scheme. There's a representation of every power-of-two that is naturally aligned, using just one extra bit. Let's say we have 3 bits of address space and 4 bits to represent any power of two allocation within that address space: index-0, order-0 0010 index-1, order-0 ... 1110 index-7, order-0 0001 index-0, order-1 0101 index-2, order-1 1001 index-4, order-1 1101 index-6, order-1 0011 index-0, order-2 1011 index-4, order-2 0111 index-0, order-3 has no meaning and can be used to represent an invalid range, if that's useful. The lowest clear bit decodes to the order, and (x & (x+1))/2 gets you the index. That leaves you with another 11 bits to represent something smart about partial pages. The question is whether this is the right kind of optimisation to be doing. I hear you that we want a dense format, but it's questionable whether the kind of thing you're suggesting is actually denser than this scheme. For example, if we have 1GB pages and userspace happens to have allocated pages (3, 4, 5, 6, 7, 8, 9, 10) then this can be represented as a single phyr. A power-of-two scheme would have us use four entries (3, 4-7, 8-9, 10). Using a (dma_addr, size_t) tuple makes coalescing adjacent pages very cheap. If I have to walk PTEs looking for pages which can be combined together, I end up with interesting behaviour where the length of the list shrinks and expands. Using the example above, as I walk successive PUDs, the data struct looks like this: (3) (3, 4) (3, 4-5) (3, 4-5, 6) (3, 4-7) (3, 4-7, 8) (3, 4-7, 8-9) (3, 4-7, 8-9, 10) We could end up with a situation where we stop because the array is full, even though if we kept going, it'd shrink back down below the length of the array (in this example, an array of length 2 would stop when it saw page 6, even though page 7 shrinks it back down again). > What is needed is a full scatterlist replacement, including the IOMMU > part. > > For the IOMMU I would expect the datastructure to be re-used, we start >
Phyr Starter
TLDR: I want to introduce a new data type: struct phyr { phys_addr_t addr; size_t len; }; and use it to replace bio_vec as well as using it to replace the array of struct pages used by get_user_pages() and friends. --- There are two distinct problems I want to address: doing I/O to memory which does not have a struct page and efficiently doing I/O to large blobs of physically contiguous memory, regardless of whether it has a struct page. There are some other improvements which I regard as minor. There are many types of memory that one might want to do I/O to that do not have a struct page, some examples: - Memory on a graphics card (or other PCI card, but gfx seems to be the primary provider of DRAM on the PCI bus today) - DAX, or other pmem (there are some fake pages today, but this is mostly a workaround for the IO problem today) - Guest memory being accessed from the hypervisor (KVM needs to create structpages to make this happen. Xen doesn't ...) All of these kinds of memories can be addressed by the CPU and so also by a bus master. That is, there is a physical address that the CPU can use which will address this memory, and there is a way to convert that to a DMA address which can be programmed into another device. There's no intent here to support memory which can be accessed by a complex scheme like writing an address to a control register and then accessing the memory through a FIFO; this is for memory which can be accessed by DMA and CPU loads and stores. For get_user_pages() and friends, we currently fill an array of struct pages, each one representing PAGE_SIZE bytes. For an application that is using 1GB hugepages, writing 2^18 entries is a significant overhead. It also makes drivers hard to write as they have to recoalesce the struct pages, even though the VM can tell it whether those 2^18 pages are contiguous. On the minor side, struct phyr can represent any mappable chunk of memory. A bio_vec is limited to 2^32 bytes, while on 64-bit machines a phyr can represent larger than 4GB. A phyr is the same size as a bio_vec on 64 bit (16 bytes), and the same size for 32-bit with PAE (12 bytes). It is smaller for 32-bit machines without PAE (8 bytes instead of 12). Finally, it may be possible to stop using scatterlist to describe the input to the DMA-mapping operation. We may be able to get struct scatterlist down to just dma_address and dma_length, with chaining handled through an enclosing struct. I would like to see phyr replace bio_vec everywhere it's currently used. I don't have time to do that work now because I'm busy with folios. If someone else wants to take that on, I shall cheer from the sidelines. What I do intend to do is: - Add an interface to gup.c to pin/unpin N phyrs - Add a sg_map_phyrs() This will take an array of phyrs and allocate an sg for them - Whatever else I need to do to make one RDMA driver happy with this scheme At that point, I intend to stop and let others more familiar with this area of the kernel continue the conversion of drivers. P.S. If you've had the Prodigy song running through your head the whole time you've been reading this email ... I'm sorry / You're welcome. If people insist, we can rename this to phys_range or something boring, but I quite like the spelling of phyr with the pronunciation of "fire".
Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount
On Sat, Oct 16, 2021 at 12:44:50PM -0300, Jason Gunthorpe wrote: > Assuming changing FSDAX is hard.. How would DAX people feel about just > deleting the PUD/PMD support until it can be done with compound pages? I think there are customers who would find that an unacceptable answer :-)
Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount
It would probably help if you cc'd Dan on this. As far as I know he's the only person left who cares about GUP on DAX. On Thu, Oct 14, 2021 at 02:06:34PM -0300, Jason Gunthorpe wrote: > On Thu, Oct 14, 2021 at 10:39:28AM -0500, Alex Sierra wrote: > > From: Ralph Campbell > > > > ZONE_DEVICE struct pages have an extra reference count that complicates the > > code for put_page() and several places in the kernel that need to check the > > reference count to see that a page is not being used (gup, compaction, > > migration, etc.). Clean up the code so the reference count doesn't need to > > be treated specially for ZONE_DEVICE. > > > > Signed-off-by: Ralph Campbell > > Signed-off-by: Alex Sierra > > Reviewed-by: Christoph Hellwig > > --- > > v2: > > AS: merged this patch in linux 5.11 version > > > > v5: > > AS: add condition at try_grab_page to check for the zone device type, while > > page ref counter is checked less/equal to zero. In case of device zone, > > pages > > ref counter are initialized to zero. > > > > v7: > > AS: fix condition at try_grab_page added at v5, is invalid. It supposed > > to fix xfstests/generic/413 test, however, there's a known issue on > > this test where DAX mapped area DIO to non-DAX expect to fail. > > https://patchwork.kernel.org/project/fstests/patch/1489463960-3579-1-git-send-email-xz...@redhat.com > > This condition was removed after rebase over patch series > > https://lore.kernel.org/r/20210813044133.1536842-4-jhubb...@nvidia.com > > --- > > arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +- > > drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +- > > fs/dax.c | 4 +- > > include/linux/dax.h| 2 +- > > include/linux/memremap.h | 7 +-- > > include/linux/mm.h | 11 > > lib/test_hmm.c | 2 +- > > mm/internal.h | 8 +++ > > mm/memcontrol.c| 6 +-- > > mm/memremap.c | 69 +++--- > > mm/migrate.c | 5 -- > > mm/page_alloc.c| 3 ++ > > mm/swap.c | 45 ++--- > > 13 files changed, 46 insertions(+), 120 deletions(-) > > Has anyone tested this with FSDAX? Does get_user_pages() on fsdax > backed memory still work? > > What refcount value does the struct pages have when they are installed > in the PTEs? Remember a 0 refcount will make all the get_user_pages() > fail. > > I'm looking at the call path starting in ext4_punch_hole() and I would > expect to see something manipulating the page ref count before > the ext4_break_layouts() call path gets to the dax_page_unused() test. > > All I see is we go into unmap_mapping_pages() - that would normally > put back the page references held by PTEs but insert_pfn() has this: > > if (pfn_t_devmap(pfn)) > entry = pte_mkdevmap(pfn_t_pte(pfn, prot)); > > And: > > static inline pte_t pte_mkdevmap(pte_t pte) > { > return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP); > } > > Which interacts with vm_normal_page(): > > if (pte_devmap(pte)) > return NULL; > > To disable that refcounting? > > So... I have a feeling this will have PTEs pointing to 0 refcount > pages? Unless FSDAX is !pte_devmap which is not the case, right? > > This seems further confirmed by this comment: > > /* >* If we race get_user_pages_fast() here either we'll see the >* elevated page count in the iteration and wait, or >* get_user_pages_fast() will see that the page it took a reference >* against is no longer mapped in the page tables and bail to the >* get_user_pages() slow path. The slow path is protected by >* pte_lock() and pmd_lock(). New references are not taken without >* holding those locks, and unmap_mapping_pages() will not zero the >* pte or pmd without holding the respective lock, so we are >* guaranteed to either see new references or prevent new >* references from being established. >*/ > > Which seems to explain this scheme relies on unmap_mapping_pages() to > fence GUP_fast, not on GUP_fast observing 0 refcounts when it should > stop. > > This seems like it would be properly fixed by using normal page > refcounting for PTEs - ie stop using special for these pages? > > Does anyone know why devmap is pte_special anyhow? > > > +void free_zone_device_page(struct page *page) > > +{ > > + switch (page->pgmap->type) { > > + case MEMORY_DEVICE_PRIVATE: > > + free_device_page(page); > > + return; > > + case MEMORY_DEVICE_FS_DAX: > > + /* notify page idle */ > > + wake_up_var(&page->_refcount); > > + return; > > It is not for this series, but I wonder if we should just always call > ops->page_free and have free_device_page() logic in that cal
Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount
On Thu, Oct 14, 2021 at 10:39:28AM -0500, Alex Sierra wrote: > From: Ralph Campbell > > ZONE_DEVICE struct pages have an extra reference count that complicates the > code for put_page() and several places in the kernel that need to check the > reference count to see that a page is not being used (gup, compaction, > migration, etc.). Clean up the code so the reference count doesn't need to > be treated specially for ZONE_DEVICE. > > Signed-off-by: Ralph Campbell > Signed-off-by: Alex Sierra > Reviewed-by: Christoph Hellwig Reviewed-by: Matthew Wilcox (Oracle)
Re: [PATCH v1 1/2] ext4/xfs: add page refcount helper
On Thu, Oct 14, 2021 at 10:39:27AM -0500, Alex Sierra wrote: > From: Ralph Campbell > > There are several places where ZONE_DEVICE struct pages assume a reference > count == 1 means the page is idle and free. Instead of open coding this, > add a helper function to hide this detail. > > Signed-off-by: Ralph Campbell > Signed-off-by: Alex Sierra > Reviewed-by: Christoph Hellwig > Acked-by: Theodore Ts'o > Acked-by: Darrick J. Wong Reviewed-by: Matthew Wilcox (Oracle)
Re: [PATCH v1 00/12] MEMORY_DEVICE_COHERENT for CPU-accessible coherent device memory
On Tue, Oct 12, 2021 at 11:39:57AM -0700, Andrew Morton wrote: > Because I must ask: if this feature is for one single computer which > presumably has a custom kernel, why add it to mainline Linux? I think in particular patch 2 deserves to be merged because it removes a ton of cruft from every call to put_page() (at least if you're using a distro config). It makes me nervous, but I think it's the right thing to do. It may well need more fixups after it has been merged, but that's life.
Re: BoF at LPC: Documenting the Heterogeneous Memory Model Architecture
On Thu, Sep 23, 2021 at 04:25:08PM -0400, Felix Kuehling wrote: > Change of plan: Instead of a BoF, this is now a session in the "GPU/media/AI > buffer management and interop MC" micro conference. Thank you Daniel Stone > for making that happen. > https://linuxplumbersconf.org/event/11/contributions/1112/ > > It is scheduled for tomorrow (Friday) 08:40-10:00 Pacific, 11:40-13:00 > Eastern, 15:40-17:00 UTC. That's up against: Direct map management Vlastimil Babka, Mike Rapoport, Rick Edgecombe 11:30-12:15. Seems like a lot of the same people would want to be in both sessions. Maybe one could be moved?
Re: [PATCH v5 3/9] dyndbg: add DEFINE_DYNAMIC_DEBUG_CATEGORIES and callbacks
On Fri, Aug 13, 2021 at 06:51:05PM +0300, Andy Shevchenko wrote: > On Fri, Aug 13, 2021 at 09:17:11AM -0600, Jim Cromie wrote: > > +int param_set_dyndbg(const char *instr, const struct kernel_param *kp) > > +{ > > + unsigned long inbits; > > + int rc, i, chgct = 0, totct = 0; > > + char query[OUR_QUERY_SIZE]; > > + struct dyndbg_bitdesc *bitmap = (struct dyndbg_bitdesc *) kp->data; > > So you need space after ')' ? More importantly, if ->data is of type 'void *', it is bad style to cast the pointer at all. I can't tell what type 'data' has; if it is added to kernel_param as part of this series, I wasn't cc'd on the patch that did that.
Re: [RFC PATCH v2 1/8] ext4/xfs: add page refcount helper
On Mon, Jun 07, 2021 at 03:42:19PM -0500, Alex Sierra wrote: > +++ b/include/linux/dax.h > @@ -243,6 +243,16 @@ static inline bool dax_mapping(struct address_space > *mapping) > return mapping->host && IS_DAX(mapping->host); > } > > +static inline bool dax_layout_is_idle_page(struct page *page) > +{ > + return page_ref_count(page) == 1; > +} We already have something called an idle page, and that's quite a different thing from this. How about dax_page_unused() (it's a use count, so once it's got down to its minimum value, it's unused)?
Re: [RFC PATCH v2 1/8] ext4/xfs: add page refcount helper
On Tue, Jun 08, 2021 at 12:29:04AM +, Liam Howlett wrote: > * Alex Sierra [210607 16:43]: > > From: Ralph Campbell > > > > There are several places where ZONE_DEVICE struct pages assume a reference > > count == 1 means the page is idle and free. Instead of open coding this, > > add a helper function to hide this detail. > > > > diff --git a/include/linux/dax.h b/include/linux/dax.h > > index b52f084aa643..8909a91cd381 100644 > > --- a/include/linux/dax.h > > +++ b/include/linux/dax.h > > @@ -243,6 +243,16 @@ static inline bool dax_mapping(struct address_space > > *mapping) > > return mapping->host && IS_DAX(mapping->host); > > } > > > > +static inline bool dax_layout_is_idle_page(struct page *page) > > +{ > > + return page_ref_count(page) == 1; > > +} > > If this races with page_ref_count(page) == 0, then it will return false > that a page is idle when the page is being freed. I don't know the code > well enough to say if this is an issue or not so please let me know. > > For example: > !dax_layout_is_idle_page() will return true in dax_busy_page() above > when the count is 0 and return the page. > > Maybe you are sure to have at least one reference when calling this? It > might be worth adding a comment. You're getting confused by the problem that the next patch fixes, which is that devmap pages were stupidly given an elevated refcount. devmap pages are considered "free" when their refcount is 1. See put_page(), put_devmap_managed_page() and so on.
Re: [RFC PATCH v2 0/8] Support DEVICE_GENERIC memory in migrate_vma_*
On Mon, Jun 07, 2021 at 03:42:18PM -0500, Alex Sierra wrote: > v1: > https://lore.kernel.org/linux-mm/20210529064022.gb15...@lst.de/T/ Please copy and paste the rationale into followup patch series instead of sending a link: AMD is building a system architecture for the Frontier supercomputer with a coherent interconnect between CPUs and GPUs. This hardware architecture allows the CPUs to coherently access GPU device memory. We have hardware in our labs and we are working with our partner HPE on the BIOS, firmware and software for delivery to the DOE. The system BIOS advertises the GPU device memory (aka VRAM) as SPM (special purpose memory) in the UEFI system address map. The amdgpu driver looks it up with lookup_resource and registers it with devmap as MEMORY_DEVICE_GENERIC using devm_memremap_pages. Now we're trying to migrate data to and from that memory using the migrate_vma_* helpers so we can support page-based migration in our unified memory allocations, while also supporting CPU access to those pages. This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages behave correctly in the migrate_vma_* helpers. We are looking for feedback about this approach. If we're close, what's needed to make our patches acceptable upstream? If we're not close, any suggestions how else to achieve what we are trying to do (i.e. page migration and coherent CPU access to VRAM)? This work is based on HMM and our SVM memory manager that was recently upstreamed to Dave Airlie's drm-next branch [https://cgit.freedesktop.org/drm/drm/log/?h=drm-next]. On top of that we did some rework of our VRAM management for migrations to remove some incorrect assumptions, allow partially successful migrations and GPU memory mappings that mix pages in VRAM and system memory. [https://patchwork.kernel.org/project/dri-devel/list/?series=489811] > v2: > This patch series version has merged "[RFC PATCH v3 0/2] > mm: remove extra ZONE_DEVICE struct page refcount" patch series made by > Ralph Campbell. It also applies at the top of these series, our changes > to support device generic type in migration_vma helpers. > This has been tested in systems with device memory that has coherent > access by CPU. > > Also addresses the following feedback made in v1: > - Isolate in one patch kernel/resource.c modification, based > on Christoph's feedback. > - Add helpers check for generic and private type to avoid > duplicated long lines. > > I like to provide an overview of what each of the patches does in a series: > > Patches 1-2: Rebased Ralph Campbell's ZONE_DEVICE page refcounting patches > Patch 3: Export lookup_resource > Patches 4-5: AMDGPU driver changes to register and use DEVICE_GENERIC memory > Patches 6-8: Handle DEVICE_GENERIC memory in migration helpers > > Alex Sierra (6): > kernel: resource: lookup_resource as exported symbol > drm/amdkfd: add SPM support for SVM > drm/amdkfd: generic type as sys mem on migration to ram > include/linux/mm.h: helpers to check zone device generic type > mm: add generic type support to migrate_vma helpers > mm: call pgmap->ops->page_free for DEVICE_GENERIC pages > > Ralph Campbell (2): > ext4/xfs: add page refcount helper > mm: remove extra ZONE_DEVICE struct page refcount > > arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +- > drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 15 -- > drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +- > fs/dax.c | 8 +-- > fs/ext4/inode.c | 5 +- > fs/xfs/xfs_file.c| 4 +- > include/linux/dax.h | 10 > include/linux/memremap.h | 7 +-- > include/linux/mm.h | 52 +++--- > kernel/resource.c| 2 +- > lib/test_hmm.c | 2 +- > mm/internal.h| 8 +++ > mm/memremap.c| 69 +++- > mm/migrate.c | 13 ++--- > mm/page_alloc.c | 3 ++ > mm/swap.c| 45 ++-- > 16 files changed, 83 insertions(+), 164 deletions(-) > > -- > 2.17.1 > >