from:"Hugh Dickins"

Re: WARNING in shmem_release_dquot

2024-02-19 Thread Hugh Dickins

On Mon, 29 Jan 2024, Ubisectech Sirius wrote:

> Hello.
> We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. 
> Recently, our team has discovered a issue in Linux kernel 
> 6.8.0-rc1-gecb1b8288dc7. Attached to the email were a POC file of the issue.
> 
> Stack dump:
> [  246.195553][ T4096] [ cut here ]
> [  246.196540][ T4096] quota id 16384 from dquot 888051bd3000, not in rb 
> tree!
> [ 246.198829][ T4096] WARNING: CPU: 1 PID: 4096 at mm/shmem_quota.c:290 
> shmem_release_dquot (mm/shmem_quota.c:290 (discriminator 3))
> [  246.199955][ T4096] Modules linked in:
> [  246.200435][ T4096] CPU: 1 PID: 4096 Comm: kworker/u6:6 Not tainted 
> 6.8.0-rc1-gecb1b8288dc7 #21
> [  246.201566][ T4096] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> BIOS 1.15.0-1 04/01/2014
> [  246.202667][ T4096] Workqueue: events_unbound quota_release_workfn
> [ 246.203516][ T4096] RIP: 0010:shmem_release_dquot (mm/shmem_quota.c:290 
> (discriminator 3))
> [ 246.204276][ T4096] Code: e8 28 d9 18 00 e9 b3 f8 ff ff e8 6e e1 c2 ff c6 
> 05 bf e8 1b 0d 01 90 48 c7 c7 80 f0 b8 8a 4c 89 ea 44 89 e6 e8 14 6d 89 ff 90 
> <0f> 0b 90 90 e9 18 fb ff ff e8 f5 d8 18 00 e9 a2 fa ff ff e8 0b d9
> All code
> 
>0:   e8 28 d9 18 00  call   0x18d92d
>5:   e9 b3 f8 ff ff  jmp0xf8bd
>a:   e8 6e e1 c2 ff  call   0xffc2e17d
>f:   c6 05 bf e8 1b 0d 01movb   $0x1,0xd1be8bf(%rip)# 0xd1be8d5
>   16:   90  nop
>   17:   48 c7 c7 80 f0 b8 8amov$0x8ab8f080,%rdi
>   1e:   4c 89 eamov%r13,%rdx
>   21:   44 89 e6mov%r12d,%esi
>   24:   e8 14 6d 89 ff  call   0xff896d3d
>   29:   90  nop
>   2a:*  0f 0b   ud2 <-- trapping instruction
>   2c:   90  nop
>   2d:   90  nop
>   2e:   e9 18 fb ff ff  jmp0xfb4b
>   33:   e8 f5 d8 18 00  call   0x18d92d
>   38:   e9 a2 fa ff ff  jmp0xfadf
>   3d:   e8  .byte 0xe8
>   3e:   0b d9   or %ecx,%ebx
> 
> Code starting with the faulting instruction
> ===
>0:   0f 0b   ud2
>2:   90  nop
>3:   90  nop
>4:   e9 18 fb ff ff  jmp0xfb21
>9:   e8 f5 d8 18 00  call   0x18d903
>e:   e9 a2 fa ff ff  jmp0xfab5
>   13:   e8  .byte 0xe8
>   14:   0b d9   or %ecx,%ebx
> [  246.206640][ T4096] RSP: 0018:c9000604fbc0 EFLAGS: 00010286
> [  246.207403][ T4096] RAX:  RBX:  RCX: 
> 814c77da
> [  246.208514][ T4096] RDX: 888049a58000 RSI: 814c77e7 RDI: 
> 0001
> [  246.209429][ T4096] RBP:  R08: 0001 R09: 
> 
> [  246.210362][ T4096] R10: 0001 R11: 0001 R12: 
> 4000
> [  246.211367][ T4096] R13: 888051bd3000 R14: dc00 R15: 
> 888051bd3040
> [  246.212327][ T4096] FS:  () GS:88807ec0() 
> knlGS:
> [  246.213387][ T4096] CS:  0010 DS:  ES:  CR0: 80050033
> [  246.214232][ T4096] CR2: 7ffee748ec80 CR3: 0cb78000 CR4: 
> 00750ef0
> [  246.215216][ T4096] DR0:  DR1:  DR2: 
> 
> [  246.216187][ T4096] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [  246.217148][ T4096] PKRU: 5554
> [  246.217615][ T4096] Call Trace:
> [  246.218090][ T4096]  
> [ 246.218467][ T4096] ? show_regs (arch/x86/kernel/dumpstack.c:479)
> [ 246.218979][ T4096] ? __warn (kernel/panic.c:677)
> [ 246.219505][ T4096] ? shmem_release_dquot (mm/shmem_quota.c:290 
> (discriminator 3))
> [ 246.220197][ T4096] ? report_bug (lib/bug.c:201 lib/bug.c:219)
> [ 246.220775][ T4096] ? shmem_release_dquot (mm/shmem_quota.c:290 
> (discriminator 3))
> [ 246.221500][ T4096] ? handle_bug (arch/x86/kernel/traps.c:238)
> [ 246.222081][ T4096] ? exc_invalid_op (arch/x86/kernel/traps.c:259 
> (discriminator 1))
> [ 246.222687][ T4096] ? asm_exc_invalid_op 
> (./arch/x86/include/asm/idtentry.h:568)
> [ 246.223296][ T4096] ? __warn_printk (./include/linux/context_tracking.h:155 
> kernel/panic.c:726)
> [ 246.223878][ T4096] ? __warn_printk (kernel/panic.c:717)
> [ 246.224460][ T4096] ? shmem_release_dquot (mm/shmem_quota.c:290 
> (discriminator 3))
> [ 246.225125][ T4096] quota_release_workfn (fs/quota/dquot.c:839)
> [ 246.225792][ T4096] ? dquot_release (fs/quota/dquot.c:810)
> [ 246.226401][ T4096] process_one_work (kernel/workqueue.c:2638)
> [ 246.227001][ T4096] ? lock_sync (kernel/locking/lockdep.c:5722)
> [ 246.227509][ T4096] ? workqueue_congested

Re: [PATCH v2] mm, thp: Relax the VM_DENYWRITE constraint on file-backed THPs

2021-04-16 Thread Hugh Dickins

On Mon, 5 Apr 2021, Collin Fijalkovich wrote:

> Transparent huge pages are supported for read-only non-shmem files,
> but are only used for vmas with VM_DENYWRITE. This condition ensures that
> file THPs are protected from writes while an application is running
> (ETXTBSY).  Any existing file THPs are then dropped from the page cache
> when a file is opened for write in do_dentry_open(). Since sys_mmap
> ignores MAP_DENYWRITE, this constrains the use of file THPs to vmas
> produced by execve().
> 
> Systems that make heavy use of shared libraries (e.g. Android) are unable
> to apply VM_DENYWRITE through the dynamic linker, preventing them from
> benefiting from the resultant reduced contention on the TLB.
> 
> This patch reduces the constraint on file THPs allowing use with any
> executable mapping from a file not opened for write (see
> inode_is_open_for_write()). It also introduces additional conditions to
> ensure that files opened for write will never be backed by file THPs.
> 
> Restricting the use of THPs to executable mappings eliminates the risk that
> a read-only file later opened for write would encounter significant
> latencies due to page cache truncation.
> 
> The ld linker flag '-z max-page-size=(hugepage size)' can be used to
> produce executables with the necessary layout. The dynamic linker must
> map these file's segments at a hugepage size aligned vma for the mapping to
> be backed with THPs.
> 
> Comparison of the performance characteristics of 4KB and 2MB-backed
> libraries follows; the Android dex2oat tool was used to AOT compile an
> example application on a single ARM core.
> 
> 4KB Pages:
> ==
> 
> count  event_name# count / runtime
> 598,995,035,942cpu-cycles# 1.800861 GHz
>  81,195,620,851raw-stall-frontend# 244.112 M/sec
> 347,754,466,597iTLB-loads# 1.046 G/sec
>   2,970,248,900iTLB-load-misses  # 0.854122% miss rate
> 
> Total test time: 332.854998 seconds.
> 
> 2MB Pages:
> ==
> 
> count  event_name# count / runtime
> 592,872,663,047cpu-cycles# 1.800358 GHz
>  76,485,624,143raw-stall-frontend# 232.261 M/sec
> 350,478,413,710iTLB-loads# 1.064 G/sec
> 803,233,322iTLB-load-misses  # 0.229182% miss rate
> 
> Total test time: 329.826087 seconds
> 
> A check of /proc/$(pidof dex2oat64)/smaps shows THPs in use:
> 
> /apex/com.android.art/lib64/libart.so
> FilePmdMapped:  4096 kB
> 
> /apex/com.android.art/lib64/libart-compiler.so
> FilePmdMapped:  2048 kB
> 
> Signed-off-by: Collin Fijalkovich 

Acked-by: Hugh Dickins 

and you also won

Reviewed-by: William Kucharski 

in the v1 thread.

I had hoped to see a more dramatic difference in the numbers above,
but I'm a performance naif, and presume other loads and other
libraries may show further benefit.

> ---
> Changes v1 -> v2:
> * commit message 'non-shmem filesystems' -> 'non-shmem files'
> * Add performance testing data to commit message
> 
>  fs/open.c   | 13 +++--
>  mm/khugepaged.c | 16 +++-
>  2 files changed, 26 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/open.c b/fs/open.c
> index e53af13b5835..f76e960d10ea 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -852,8 +852,17 @@ static int do_dentry_open(struct file *f,
>* XXX: Huge page cache doesn't support writing yet. Drop all page
>* cache for this file before processing writes.
>*/
> - if ((f->f_mode & FMODE_WRITE) && filemap_nr_thps(inode->i_mapping))
> - truncate_pagecache(inode, 0);
> + if (f->f_mode & FMODE_WRITE) {
> + /*
> +  * Paired with smp_mb() in collapse_file() to ensure nr_thps
> +  * is up to date and the update to i_writecount by
> +  * get_write_access() is visible. Ensures subsequent insertion
> +  * of THPs into the page cache will fail.
> +  */
> + smp_mb();
> + if (filemap_nr_thps(inode->i_mapping))
> + truncate_pagecache(inode, 0);
> + }
>  
>   return 0;
>  
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index a7d6cb912b05..4c7cc877d5e3 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -459,7 +459,8 @@ static bool hugepage_vma_check(struct vm_area_struct *vma,
>  
>   /* Read-only file mappings need to be aligned for THP to work. */
>   if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && vma->vm_file &&
> - (vm_flags & VM_DENYWRITE)) {
> + !inode_is_open_for_write(vma->vm_file->f_i

Re: [PATCH v2 9/9] userfaultfd/shmem: modify shmem_mcopy_atomic_pte to use install_ptes

2021-04-16 Thread Hugh Dickins

On Mon, 12 Apr 2021, Axel Rasmussen wrote:

> In a previous commit, we added the mcopy_atomic_install_ptes() helper.
> This helper does the job of setting up PTEs for an existing page, to map
> it into a given VMA. It deals with both the anon and shmem cases, as
> well as the shared and private cases.
> 
> In other words, shmem_mcopy_atomic_pte() duplicates a case it already
> handles. So, expose it, and let shmem_mcopy_atomic_pte() use it
> directly, to reduce code duplication.
> 
> This requires that we refactor shmem_mcopy_atomic-pte() a bit:
> 
> Instead of doing accounting (shmem_recalc_inode() et al) part-way
> through the PTE setup, do it beforehand. This frees up
> mcopy_atomic_install_ptes() from having to care about this accounting,
> but it does mean we need to clean it up if we get a failure afterwards
> (shmem_uncharge()).
> 
> We can *almost* use shmem_charge() to do this, reducing code
> duplication. But, it does `inode->i_mapping->nrpages++`, which would
> double-count since shmem_add_to_page_cache() also does this.
> 
> Signed-off-by: Axel Rasmussen 
> ---
>  include/linux/userfaultfd_k.h |  5 
>  mm/shmem.c| 52 +++
>  mm/userfaultfd.c  | 25 -
>  3 files changed, 27 insertions(+), 55 deletions(-)

Very nice, and it gets better.

> 
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index 794d1538b8ba..3e20bfa9ef80 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -53,6 +53,11 @@ enum mcopy_atomic_mode {
>   MCOPY_ATOMIC_CONTINUE,
>  };
>  
> +extern int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t 
> *dst_pmd,

mcopy_atomic_install_pte throughout as before.

> +  struct vm_area_struct *dst_vma,
> +  unsigned long dst_addr, struct page *page,
> +  bool newly_allocated, bool wp_copy);
> +
>  extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long 
> dst_start,
>   unsigned long src_start, unsigned long len,
>   bool *mmap_changing, __u64 mode);
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 3f48cb5e8404..9b12298405a4 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2376,10 +2376,8 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,
>   struct address_space *mapping = inode->i_mapping;
>   gfp_t gfp = mapping_gfp_mask(mapping);
>   pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
> - spinlock_t *ptl;
>   void *page_kaddr;
>   struct page *page;
> - pte_t _dst_pte, *dst_pte;
>   int ret;
>   pgoff_t max_off;
>  
> @@ -2389,8 +2387,10 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  
>   if (!*pagep) {
>   page = shmem_alloc_page(gfp, info, pgoff);
> - if (!page)
> - goto out_unacct_blocks;
> + if (!page) {
> + shmem_inode_unacct_blocks(inode, 1);
> + goto out;
> + }
>  
>   if (!zeropage) {/* COPY */
>   page_kaddr = kmap_atomic(page);
> @@ -2430,59 +2430,27 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,
>   if (ret)
>   goto out_release;
>  
> - _dst_pte = mk_pte(page, dst_vma->vm_page_prot);
> - if (dst_vma->vm_flags & VM_WRITE)
> - _dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
> - else {
> - /*
> -  * We don't set the pte dirty if the vma has no
> -  * VM_WRITE permission, so mark the page dirty or it
> -  * could be freed from under us. We could do it
> -  * unconditionally before unlock_page(), but doing it
> -  * only if VM_WRITE is not set is faster.
> -  */
> - set_page_dirty(page);
> - }
> -
> - dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, );
> -
> - ret = -EFAULT;
> - max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> - if (unlikely(pgoff >= max_off))
> - goto out_release_unlock;
> -
> - ret = -EEXIST;
> - if (!pte_none(*dst_pte))
> - goto out_release_unlock;
> -
> - lru_cache_add(page);
> -
>   spin_lock_irq(>lock);
>   info->alloced++;
>   inode->i_blocks += BLOCKS_PER_PAGE;
>   shmem_recalc_inode(inode);
>   spin_unlock_irq(>lock);
>  
> - inc_mm_counter(dst_mm, mm_counter_file(page));
> - page_add_file_rmap(page, false);
> - set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
> + ret = mcopy_atomic_install_ptes(dst_mm, dst_pmd, dst_vma, dst_addr,
> + page, true, false);
> + if (ret)
> + goto out_release_uncharge;
>  
> - /* No need to invalidate - it was non-present before */
> - update_mmu_cache(dst_vma, dst_addr, dst_pte);
> -

Re: [PATCH v2 4/9] userfaultfd/shmem: support UFFDIO_CONTINUE for shmem

2021-04-16 Thread Hugh Dickins

On Mon, 12 Apr 2021, Axel Rasmussen wrote:

> With this change, userspace can resolve a minor fault within a
> shmem-backed area with a UFFDIO_CONTINUE ioctl. The semantics for this
> match those for hugetlbfs - we look up the existing page in the page
> cache, and install PTEs for it.

s/PTEs/a PTE/

> 
> This commit introduces a new helper: mcopy_atomic_install_ptes.

The plural is misleading: it only installs a single pte, so I'm going
to ask you to change it throughout to mcopy_atomic_install_pte()
(I'm not thrilled with the "mcopy" nor the "atomic", but there you are
being consistent with userfaultfd's peculiar naming, so let them be).

> 
> Why handle UFFDIO_CONTINUE for shmem in mm/userfaultfd.c, instead of in
> shmem.c? The existing userfault implementation only relies on shmem.c
> for VM_SHARED VMAs. However, minor fault handling / CONTINUE work just
> fine for !VM_SHARED VMAs as well. We'd prefer to handle CONTINUE for
> shmem in one place, regardless of shared/private (to reduce code
> duplication).
> 
> Why add a new mcopy_atomic_install_ptes helper? A problem we have with
> continue is that shmem_mcopy_atomic_pte() and mcopy_atomic_pte() are
> *close* to what we want, but not exactly. We do want to setup the PTEs
> in a CONTINUE operation, but we don't want to e.g. allocate a new page,
> charge it (e.g. to the shmem inode), manipulate various flags, etc. Also
> we have the problem stated above: shmem_mcopy_atomic_pte() and
> mcopy_atomic_pte() both handle one-half of the problem (shared /
> private) continue cares about. So, introduce mcontinue_atomic_pte(), to
> handle all of the shmem continue cases. Introduce the helper so it
> doesn't duplicate code with mcopy_atomic_pte().
> 
> In a future commit, shmem_mcopy_atomic_pte() will also be modified to
> use this new helper. However, since this is a bigger refactor, it seems
> most clear to do it as a separate change.

(Actually that turns out to be a nice deletion of lines,
but you're absolutely right to do it as a separate patch.)

> 
> Signed-off-by: Axel Rasmussen 
> ---
>  mm/userfaultfd.c | 176 +++
>  1 file changed, 131 insertions(+), 45 deletions(-)
> 
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 23fa2583bbd1..8df0438f5d6a 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -48,6 +48,87 @@ struct vm_area_struct *find_dst_vma(struct mm_struct 
> *dst_mm,
>   return dst_vma;
>  }
>  
> +/*
> + * Install PTEs, to map dst_addr (within dst_vma) to page.
> + *
> + * This function handles MCOPY_ATOMIC_CONTINUE (which is always file-backed),
> + * whether or not dst_vma is VM_SHARED. It also handles the more general
> + * MCOPY_ATOMIC_NORMAL case, when dst_vma is *not* VM_SHARED (it may be file
> + * backed, or not).
> + *
> + * Note that MCOPY_ATOMIC_NORMAL for a VM_SHARED dst_vma is handled by
> + * shmem_mcopy_atomic_pte instead.

Right, I'm thinking in terms of five cases below (I'm not for a moment
saying that you need to list these out in the comment, just saying that
I could not get my head around the issues in this function without
listing them out for myself):

1. anon private mcopy (using anon page newly allocated)
2. shmem private mcopy (using anon page newly allocated)
3. shmem private mcontinue (using page in cache from shmem_getpage)
4. shmem shared mcontinue (using page in cache from shmem_getpage)
5. shmem shared mcopy (using page in cache newly allocated)

Of which each has a VM_WRITE and a !VM_WRITE case; and the third and
fourth cases are new in this patch (it really would have been better
to introduce mcopy_atomic_install_pte() in a separate earlier patch,
but don't change that now we've got this far); and the fifth case does
*not* use mcopy_atomic_install_pte() in this patch, but will in future.

And while making these notes, let's hightlight again what is commented
elsewhere, the odd nature of the second case: where userfaultfd short
circuits to an anonymous CoW page without instantiating the shmem page.
(Please double-check me on that: quite a lot of my comments below are
about this case 2, so if I've got it wrong, then I've got a lot wrong.)

> + */
> +static int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t 
> *dst_pmd,

mcopy_atomic_install_pte() throughout please.

> +  struct vm_area_struct *dst_vma,
> +  unsigned long dst_addr, struct page *page,
> +  bool newly_allocated, bool wp_copy)
> +{
> + int ret;
> + pte_t _dst_pte, *dst_pte;
> + int writable;

Sorry, it's silly of me, but I keep getting irritated by "int writable"
in company with the various bools; and the way vm_shared is initialized
below, but writable initialized later.  Please humour me by making it
bool writable = dst_vma->vm_flags & VM_WRITE;

> + bool vm_shared = dst_vma->vm_flags & VM_SHARED;

And I've found that we also need
bool

Re: [PATCH v3 00/10] userfaultfd: add minor fault handling for shmem

2021-04-15 Thread Hugh Dickins

On Thu, 15 Apr 2021, Axel Rasmussen wrote:

> Base
> 
> 
> This series is based on (and therefore should apply cleanly to) the tag
> "v5.12-rc7-mmots-2021-04-11-20-49", additionally with Peter's selftest cleanup
> series applied first:
> 
> https://lore.kernel.org/patchwork/cover/1412450/
> 
> Changelog
> =
> 
> v2->v3:
> - Picked up {Reviewed,Acked}-by's.
> - Reorder commits: introduce CONTINUE before MINOR registration. [Hugh, Peter]
> - Don't try to {unlock,put}_page an xarray value in shmem_getpage_gfp. [Hugh]
> - Move enum mcopy_atomic_mode forward declare out of CONFIG_HUGETLB_PAGE. 
> [Hugh]
> - Keep mistakenly removed UFFD_USER_MODE_ONLY in selftest. [Peter]
> - Cleanup context management in self test (make clear implicit, remove 
> unneeded
>   return values now that we have err()). [Peter]
> - Correct dst_pte argument to dst_pmd in shmem_mcopy_atomic_pte macro. [Hugh]
> - Mention the new shmem support feature in documentation. [Hugh]

I shall ignore this v3 completely: "git send-email" is a wonderful
tool for mailing out patchsets in quick succession, but I have not
yet mastered "git send-review" to do the thinking for me as quickly.

Still deliberating on 4/9 and 9/9 of v2: they're very close,
but raise userfaultfd questions I still have to answer myself.

Hugh

Re: [PATCH v2 3/9] userfaultfd/shmem: support minor fault registration for shmem

2021-04-14 Thread Hugh Dickins

On Mon, 12 Apr 2021, Axel Rasmussen wrote:

> This patch allows shmem-backed VMAs to be registered for minor faults.
> Minor faults are appropriately relayed to userspace in the fault path,
> for VMAs with the relevant flag.
> 
> This commit doesn't hook up the UFFDIO_CONTINUE ioctl for shmem-backed
> minor faults, though, so userspace doesn't yet have a way to resolve
> such faults.

This is a very odd way to divide up the series: an "Intermission"
half way through the implementation of MINOR/CONTINUE: this 3/9
makes little sense without the 4/9 to mm/userfaultfd.c which follows.

But, having said that, I won't object and Peter did not object, and
I don't know of anyone else looking here: it will only give each of
us more trouble to insist on repartitioning the series, and it's the
end state that's far more important to me and to all of us.

And I'll even seize on it, to give myself an intermission after
this one, until tomorrow (when I'll look at 4/9 and 9/9 - but
shall not look at the selftests ones at all).

Most of this is okay, except the mm/shmem.c part; and I've just now
realized that somewhere (whether in this patch or separately) there
needs to be an update to Documentation/admin-guide/mm/userfaultfd.rst
(admin-guide? how weird, but not this series' business to correct).

> 
> Signed-off-by: Axel Rasmussen 
> ---
>  fs/userfaultfd.c |  6 +++---
>  include/uapi/linux/userfaultfd.h |  7 ++-
>  mm/memory.c  |  8 +---
>  mm/shmem.c   | 10 +-
>  4 files changed, 23 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 14f92285d04f..9f3b8684cf3c 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1267,8 +1267,7 @@ static inline bool vma_can_userfault(struct 
> vm_area_struct *vma,
>   }
>  
>   if (vm_flags & VM_UFFD_MINOR) {
> - /* FIXME: Add minor fault interception for shmem. */
> - if (!is_vm_hugetlb_page(vma))
> + if (!(is_vm_hugetlb_page(vma) || vma_is_shmem(vma)))
>   return false;
>   }
>  
> @@ -1941,7 +1940,8 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
>   /* report all available features and ioctls to userland */
>   uffdio_api.features = UFFD_API_FEATURES;
>  #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
> - uffdio_api.features &= ~UFFD_FEATURE_MINOR_HUGETLBFS;
> + uffdio_api.features &=
> + ~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM);
>  #endif
>   uffdio_api.ioctls = UFFD_API_IOCTLS;
>   ret = -EFAULT;
> diff --git a/include/uapi/linux/userfaultfd.h 
> b/include/uapi/linux/userfaultfd.h
> index bafbeb1a2624..159a74e9564f 100644
> --- a/include/uapi/linux/userfaultfd.h
> +++ b/include/uapi/linux/userfaultfd.h
> @@ -31,7 +31,8 @@
>  UFFD_FEATURE_MISSING_SHMEM | \
>  UFFD_FEATURE_SIGBUS |\
>  UFFD_FEATURE_THREAD_ID | \
> -UFFD_FEATURE_MINOR_HUGETLBFS)
> +UFFD_FEATURE_MINOR_HUGETLBFS |   \
> +UFFD_FEATURE_MINOR_SHMEM)
>  #define UFFD_API_IOCTLS  \
>   ((__u64)1 << _UFFDIO_REGISTER | \
>(__u64)1 << _UFFDIO_UNREGISTER |   \
> @@ -185,6 +186,9 @@ struct uffdio_api {
>* UFFD_FEATURE_MINOR_HUGETLBFS indicates that minor faults
>* can be intercepted (via REGISTER_MODE_MINOR) for
>* hugetlbfs-backed pages.
> +  *
> +  * UFFD_FEATURE_MINOR_SHMEM indicates the same support as
> +  * UFFD_FEATURE_MINOR_HUGETLBFS, but for shmem-backed pages instead.
>*/
>  #define UFFD_FEATURE_PAGEFAULT_FLAG_WP   (1<<0)
>  #define UFFD_FEATURE_EVENT_FORK  (1<<1)
> @@ -196,6 +200,7 @@ struct uffdio_api {
>  #define UFFD_FEATURE_SIGBUS  (1<<7)
>  #define UFFD_FEATURE_THREAD_ID   (1<<8)
>  #define UFFD_FEATURE_MINOR_HUGETLBFS (1<<9)
> +#define UFFD_FEATURE_MINOR_SHMEM (1<<10)
>   __u64 features;
>  
>   __u64 ioctls;
> diff --git a/mm/memory.c b/mm/memory.c
> index 4e358601c5d6..cc71a445c76c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3972,9 +3972,11 @@ static vm_fault_t do_read_fault(struct vm_fault *vmf)
>* something).
>*/
>   if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
> - ret = do_fault_around(vmf);
> - if (ret)
> - return ret;
> + if (likely(!userfaultfd_minor(vmf->vma))) {
> + ret = do_fault_around(vmf);
> + if (ret)
> + return ret;
> + }
>   }
>  
>   ret = __do_fault(vmf);
> diff --git a/mm/shmem.c b/mm/shmem.c
> index b72c55aa07fc..3f48cb5e8404 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1785,7

Re: [PATCH v2 2/9] userfaultfd/shmem: combine shmem_{mcopy_atomic,mfill_zeropage}_pte

2021-04-14 Thread Hugh Dickins

On Mon, 12 Apr 2021, Axel Rasmussen wrote:

> Previously, we did a dance where we had one calling path in
> userfaultfd.c (mfill_atomic_pte), but then we split it into two in
> shmem_fs.h (shmem_{mcopy_atomic,mfill_zeropage}_pte), and then rejoined
> into a single shared function in shmem.c (shmem_mfill_atomic_pte).
> 
> This is all a bit overly complex. Just call the single combined shmem
> function directly, allowing us to clean up various branches,
> boilerplate, etc.
> 
> While we're touching this function, two other small cleanup changes:
> - offset is equivalent to pgoff, so we can get rid of offset entirely.
> - Split two VM_BUG_ON cases into two statements. This means the line
>   number reported when the BUG is hit specifies exactly which condition
>   was true.
> 
> Reviewed-by: Peter Xu 
> Signed-off-by: Axel Rasmussen 

Acked-by: Hugh Dickins 
though you've dropped one minor fix I did like, see below...

> ---
>  include/linux/shmem_fs.h | 15 +---
>  mm/shmem.c   | 52 +---
>  mm/userfaultfd.c | 10 +++-
>  3 files changed, 25 insertions(+), 52 deletions(-)
> 
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index d82b6f396588..919e36671fe6 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -122,21 +122,18 @@ static inline bool shmem_file(struct file *file)
>  extern bool shmem_charge(struct inode *inode, long pages);
>  extern void shmem_uncharge(struct inode *inode, long pages);
>  
> +#ifdef CONFIG_USERFAULTFD
>  #ifdef CONFIG_SHMEM
>  extern int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> struct vm_area_struct *dst_vma,
> unsigned long dst_addr,
> unsigned long src_addr,
> +   bool zeropage,
> struct page **pagep);
> -extern int shmem_mfill_zeropage_pte(struct mm_struct *dst_mm,
> - pmd_t *dst_pmd,
> - struct vm_area_struct *dst_vma,
> - unsigned long dst_addr);
> -#else
> +#else /* !CONFIG_SHMEM */
>  #define shmem_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma, dst_addr, \

In a previous version, you quietly corrected that "dst_pte" to "dst_pmd":
of course it makes no difference to the code generated, but it was a good
correction, helping to prevent confusion.

> -src_addr, pagep)({ BUG(); 0; })
> -#define shmem_mfill_zeropage_pte(dst_mm, dst_pmd, dst_vma, \
> -  dst_addr)  ({ BUG(); 0; })
> -#endif
> +src_addr, zeropage, pagep)   ({ BUG(); 0; })
> +#endif /* CONFIG_SHMEM */
> +#endif /* CONFIG_USERFAULTFD */
>  
>  #endif
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 26c76b13ad23..b72c55aa07fc 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2354,13 +2354,14 @@ static struct inode *shmem_get_inode(struct 
> super_block *sb, const struct inode
>   return inode;
>  }
>  
> -static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
> -   pmd_t *dst_pmd,
> -   struct vm_area_struct *dst_vma,
> -   unsigned long dst_addr,
> -   unsigned long src_addr,
> -   bool zeropage,
> -   struct page **pagep)
> +#ifdef CONFIG_USERFAULTFD
> +int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,
> +pmd_t *dst_pmd,
> +struct vm_area_struct *dst_vma,
> +unsigned long dst_addr,
> +unsigned long src_addr,
> +bool zeropage,
> +struct page **pagep)
>  {
>   struct inode *inode = file_inode(dst_vma->vm_file);
>   struct shmem_inode_info *info = SHMEM_I(inode);
> @@ -2372,7 +2373,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct 
> *dst_mm,
>   struct page *page;
>   pte_t _dst_pte, *dst_pte;
>   int ret;
> - pgoff_t offset, max_off;
> + pgoff_t max_off;
>  
>   ret = -ENOMEM;
>   if (!shmem_inode_acct_block(inode, 1))
> @@ -2383,7 +2384,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct 
> *dst_mm,
>   if (!page)
>   goto out_unacct_blocks;
>  
> - if (!zeropage) {/* mcopy_atomic */
> + if (!zeropage) {/* COPY */
>   page_kaddr = kmap_atomic(page);
&

Re: [PATCH v2 1/9] userfaultfd/hugetlbfs: avoid including userfaultfd_k.h in hugetlb.h

2021-04-14 Thread Hugh Dickins

On Mon, 12 Apr 2021, Axel Rasmussen wrote:

> Minimizing header file inclusion is desirable. In this case, we can do
> so just by forward declaring the enumeration our signature relies upon.
> 
> Reviewed-by: Peter Xu 
> Signed-off-by: Axel Rasmussen 
> ---
>  include/linux/hugetlb.h | 4 +++-
>  mm/hugetlb.c| 1 +
>  2 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 09f1fd12a6fa..3f47650ab79b 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -11,7 +11,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  
>  struct ctl_table;
>  struct user_struct;
> @@ -135,6 +134,8 @@ void hugetlb_show_meminfo(void);
>  unsigned long hugetlb_total_pages(void);
>  vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   unsigned long address, unsigned int flags);
> +
> +enum mcopy_atomic_mode;

Wrongly placed: the CONFIG_USERFAULTFD=y CONFIG_HUGETLB_PAGE=n build
fails. Better place it up above with struct ctl_table etc.

>  #ifdef CONFIG_USERFAULTFD
>  int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
>   struct vm_area_struct *dst_vma,
> @@ -143,6 +144,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, 
> pte_t *dst_pte,
>   enum mcopy_atomic_mode mode,
>   struct page **pagep);
>  #endif /* CONFIG_USERFAULTFD */
> +
>  bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
>   struct vm_area_struct *vma,
>   vm_flags_t vm_flags);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 54d81d5947ed..b1652e747318 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -40,6 +40,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "internal.h"
>  
>  int hugetlb_max_hstate __read_mostly;
> -- 
> 2.31.1.295.g9ea45b61b8-goog
> 
>

Re: linux-next: Tree for Apr 9 (x86 boot problem)

2021-04-13 Thread Hugh Dickins

On Tue, 13 Apr 2021, Mike Rapoport wrote:
> 
> I think I've found the reason. trim_snb_memory() reserved the entire first
> megabyte very early leaving no room for real mode trampoline allocation.
> Since this reservation is needed only to make sure integrated gfx does not
> access some memory, it can be safely done after memblock allocations are
> possible.
> 
> I don't know if it can be fixed on the graphics device driver side, but
> from the setup_arch() perspective I think this would be the proper fix:
> 
> From c05f6046137abbcbb700571ce1ac54e7abb56a7d Mon Sep 17 00:00:00 2001
> From: Mike Rapoport 
> Date: Tue, 13 Apr 2021 21:08:39 +0300
> Subject: [PATCH] x86/setup: move trim_snb_memory() later in setup_arch to fix
>  boot hangs
> 
> Commit a799c2bd29d1 ("x86/setup: Consolidate early memory reservations")
> moved reservation of the memory inaccessible by Sandy Bride integrated
> graphics very early and as the result on systems with such devices the
> first 1M was reserved by trim_snb_memory() which prevented the allocation
> of the real mode trampoline and made the boot hang very early.
> 
> Since the purpose of trim_snb_memory() is to prevent problematic pages ever
> reaching the graphics device, it is safe to reserve these pages after
> memblock allocations are possible.
> 
> Move trim_snb_memory later in boot so that it will be called after
> reserve_real_mode() and make comments describing trim_snb_memory()
> operation more elaborate.
> 
> Fixes: a799c2bd29d1 ("x86/setup: Consolidate early memory reservations")
> Reported-by: Randy Dunlap 
> Signed-off-by: Mike Rapoport 

Tested-by: Hugh Dickins 

Thanks Mike and Randy. ThinkPad T420s here. I didn't notice this thread
until this morning, but had been investigating bootup panic on mmotm
yesterday. I was more fortunate than Randy, in getting some console
output which soon led to a799c2bd29d1 without bisection. Expected
to go through it line by line today, but you've saved me - thanks.

> ---
>  arch/x86/kernel/setup.c | 20 +++-
>  1 file changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 59e5e0903b0c..ccdcfb19df1e 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -633,11 +633,16 @@ static void __init trim_snb_memory(void)
>   printk(KERN_DEBUG "reserving inaccessible SNB gfx pages\n");
>  
>   /*
> -  * Reserve all memory below the 1 MB mark that has not
> -  * already been reserved.
> +  * SandyBridge integrated graphic devices have a bug that prevents
> +  * them from accessing certain memory ranges, namely anything below
> +  * 1M and in the pages listed in the bad_pages.
> +  *
> +  * To avoid these pages being ever accessed by SNB gfx device
> +  * reserve all memory below the 1 MB mark and bad_pages that have
> +  * not already been reserved at boot time.
>*/
>   memblock_reserve(0, 1<<20);
> - 
> +
>   for (i = 0; i < ARRAY_SIZE(bad_pages); i++) {
>   if (memblock_reserve(bad_pages[i], PAGE_SIZE))
>   printk(KERN_WARNING "failed to reserve 0x%08lx\n",
> @@ -746,8 +751,6 @@ static void __init early_reserve_memory(void)
>  
>   reserve_ibft_region();
>   reserve_bios_regions();
> -
> - trim_snb_memory();
>  }
>  
>  /*
> @@ -1083,6 +1086,13 @@ void __init setup_arch(char **cmdline_p)
>  
>   reserve_real_mode();
>  
> + /*
> +  * Reserving memory causing GPU hangs on Sandy Bridge integrated
> +  * graphic devices should be done after we allocated memory under
> +  * 1M for the real mode trampoline
> +  */
> + trim_snb_memory();
> +
>   init_mem_mapping();
>  
>   idt_setup_early_pf();
> -- 
> 2.28.0

Re: [PATCH v4] userfaultfd/shmem: fix MCOPY_ATOMIC_CONTINUE behavior

2021-04-12 Thread Hugh Dickins

On Mon, 12 Apr 2021, Peter Xu wrote:
> On Tue, Apr 06, 2021 at 11:14:30PM -0700, Hugh Dickins wrote:
> > > +static int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t 
> > > *dst_pmd,
> > > +  struct vm_area_struct *dst_vma,
> > > +  unsigned long dst_addr, struct page *page,
> > > +  enum mcopy_atomic_mode mode, bool wp_copy)
> > > +{
> 
> [...]
> 
> > > + if (writable) {
> > > + _dst_pte = pte_mkdirty(_dst_pte);
> > > + if (wp_copy)
> > > + _dst_pte = pte_mkuffd_wp(_dst_pte);
> > > + else
> > > + _dst_pte = pte_mkwrite(_dst_pte);
> > > + } else if (vm_shared) {
> > > + /*
> > > +  * Since we didn't pte_mkdirty(), mark the page dirty or it
> > > +  * could be freed from under us. We could do this
> > > +  * unconditionally, but doing it only if !writable is faster.
> > > +  */
> > > + set_page_dirty(page);
> > 
> > I do not remember why Andrea or I preferred set_page_dirty() here to
> > pte_mkdirty(); but I suppose there might somewhere be a BUG_ON(pte_dirty)
> > which this would avoid.  Risky to change it, though it does look odd.
> 
> Is any of the possible BUG_ON(pte_dirty) going to trigger because the pte has
> write bit cleared?  That's one question I was not very sure, e.g., whether one
> pte is allowed to be "dirty" if it's not writable.
> 
> To me it's okay, it's actually very suitable for UFFDIO_COPY case, where it is
> definitely dirty data (so we must never drop it) even if it's installed as RO,
> however to achieve that we can still set the dirty on the page rather than the
> pte as what we do here.  It's just a bit awkward as you said.
> 
> Meanwhile today I just noticed this in arm64 code:
> 
> static inline pte_t pte_wrprotect(pte_t pte)
> {
>   /*
>* If hardware-dirty (PTE_WRITE/DBM bit set and PTE_RDONLY
>* clear), set the PTE_DIRTY bit.
>*/
>   if (pte_hw_dirty(pte))
>   pte = pte_mkdirty(pte);
> 
>   pte = clear_pte_bit(pte, __pgprot(PTE_WRITE));
>   pte = set_pte_bit(pte, __pgprot(PTE_RDONLY));
>   return pte;
> }
> 
> So arm64 will explicitly set the dirty bit (from the HW dirty bit) when
> wr-protect.  It seems to prove that at least for arm64 it's very valid to have
> !write && dirty pte.

I did not mean to imply that it's wrong to have pte_dirty without
pte_write: no, I agree with you, I believe that there are accepted
and generic ways in which we can have pte_dirty without pte_write
(and we could each probably add a warning somewhere which would
very quickly prove that - but those would not prove that there
are not BUG_ONs on some other path, which had been my fear).

I wanted now to demonstrate that by pointing to change_pte_range() in
mm/mprotect.c, showing that it does not clear pte_dirty when it clears
pte_write. But alarmingly found rather the reverse: that it appears to
set pte_write when it finds pte_dirty - if dirty_accountable.

That looks very wrong, but if I spent long enough following up
dirty_accountable in detail, I think I would be reassured to find that
it is only adding the pte_write there when it had removed it from the
prot passed down, for dirty accounting reasons (which apply !VM_SHARED
protections in the VM_SHARED case, so that page_mkwrite() is called
and dirty accounting done when necessary).

What I did mean to imply is that changing set_page_dirty to pte_mkdirty,
to make that userfaultfd code block look nicer, is not a change to be
done lightly: by all means try it out, test it, and send a patch after
Axel's series is in, but please do not ask Axel to make that change as
a part of his series - it would be taking a risk, just for a cleanup.

Now, I have also looked up the mail exchange with Andrea which led to
his dcf7fe9d8976 ("userfaultfd: shmem: UFFDIO_COPY: set the page dirty
if VM_WRITE is not set") - it had to be off-list at the time.  And he
was rather led to that set_page_dirty by following old patterns left
over in shmem_getpage_gfp(); but when I said "or it could be done with
pte_mkdirty without pte_mkwrite", he answered "I explicitly avoided
that because pte_dirty then has side effects on mprotect to decide
pte_write. It looks safer to do set_page_dirty and not set dirty bits
in not writable ptes unnecessarily".

Haha: I think Andrea is referring to exactly the dirty_accountable code
in change_pte_protection() which worried me above. Now, I think that
will turn out okay (shmem does not have a page_mkwrite(), and does not
participate in dirty accounting), but you will have to do some work to
assure us all of that, before sending in a cleanup patch.

Hugh

Re: [PATCH v5] userfaultfd/shmem: fix MCOPY_ATOMIC_CONTINUE behavior

2021-04-08 Thread Hugh Dickins

On Thu, 8 Apr 2021, Axel Rasmussen wrote:
> On Tue, Apr 6, 2021 at 4:49 PM Peter Xu  wrote:
> > On Mon, Apr 05, 2021 at 10:19:17AM -0700, Axel Rasmussen wrote:
...
> > > --- a/mm/userfaultfd.c
> > > +++ b/mm/userfaultfd.c
...
> > > +
> > > + if (is_file_backed) {
> > > + /* The shmem MAP_PRIVATE case requires checking the i_size 
> > > */
> > > + inode = dst_vma->vm_file->f_inode;
> > > + offset = linear_page_index(dst_vma, dst_addr);
> > > + max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> > > + ret = -EFAULT;
> > > + if (unlikely(offset >= max_off))
> > > + goto out_unlock;
> >
> > Frankly I don't really know why this must be put into pgtable lock..  Since 
> > if
> > not required then it can be moved into UFFDIO_COPY path, as CONTINUE doesn't
> > need it iiuc.  Just raise it up as a pure question.
> 
> It's not clear to me either. shmem_getpage_gfp() does check this twice
> kinda like we're doing, but it doesn't ever touch the PTL. What it
> seems to be worried about is, what happens if a concurrent
> FALLOC_FL_PUNCH_HOLE happens somewhere in the middle of whatever
> manipulation we're doing? From looking at shmem_fallocate(), I think
> the basic point is that truncation happens while "inode_lock(inode)"
> is held, but neither shmem_mcopy_atomic_pte() or the new
> mcopy_atomic_install_ptes() take that lock.
> 
> I'm a bit hesitant to just remove it, run some tests, and then declare
> victory, because it seems plausible it's there to catch some
> semi-hard-to-induce race. I'm not sure how to prove that *isn't*
> needed, so my inclination is to just keep it?
> 
> I'll send a series addressing the feedback so far this afternoon, and
> I'll leave this alone for now - at least, it doesn't seem to hurt
> anything. Maybe Hugh or someone else has some more advice about it. If
> so, I'm happy to remove it in a follow-up.

It takes some thinking about, but the i_size check is required to be
under the pagetable lock, for the MAP_PRIVATE UFFDIO_COPY path, where
it is inserting an anonymous page into the file-backed vma (skipping
actually inserting a page into page cache, as an ordinary fault would).

Not because of FALLOC_FL_PUNCH_HOLE (which makes no change to i_size;
and it's okay if a race fills in the hole immediately afterwards),
but because of truncation (which must remove all beyond i_size).

In the MAP_SHARED case, with a locked page inserted into page cache,
the page lock is enough to exclude concurrent truncation.  But even
in that case the second i_size check (I'm looking at 5.12-rc's
shmem_mfill_atomic_pte(), rather than recent patches which might differ)
is required: because the first i_size check was done before the page
became visible in page cache, so a concurrent truncation could miss it).

Maybe that first check is redundant, though I'm definitely for doing it;
or maybe shmem_add_to_page_cache() would be better if it made that check
itself, under xas_lock (I think the reason it does not is historical).
The second check, in the MAP_SHARED case, does not need to be under
pagetable lock - the page lock on the page cache page is enough -
but probably Andrea placed it there to resemble the anonymous case.

You might then question, how come there is no i_size check in all of
mm/memory.c, where ordinary faulting is handled.  I'll answer that
the pte_same() check, under pagetable lock in wp_page_copy(), is
where the equivalent to userfaultfd's MAP_PRIVATE UFFDIO_COPY check
is made: if the page cache page has already been truncated, that pte
will have been cleared.

Or, if the page cache page is truncated an instant after wp_page_copy()
drops page table lock, then the unmap_mapping_range(,,, even_cows = 1)
which follows truncation has to clean it up.  Er, does that mean that
the i_size check I started off insisting is required, actually is not
required?  Um, maybe, but let's just keep it and say goodnight!

Hugh

Re: [PATCH v4] userfaultfd/shmem: fix MCOPY_ATOMIC_CONTINUE behavior

2021-04-07 Thread Hugh Dickins

On Wed, 7 Apr 2021, Axel Rasmussen wrote:
> Agreed about taking one direction or the other further.
> 
> I get the sense that Peter prefers the mcopy_atomic_install_ptes()
> version, and would thus prefer to just expose that and let
> shmem_mcopy_atomic_pte() use it.
> 
> But, I get the sense that you (Hugh) slightly prefer the other way -
> just letting shmem_mcopy_atomic_pte() deal with both the VM_SHARED and
> !VM_SHARED cases.

No, either direction seems plausible to me: start from whichever
end you prefer.

> 
> I was planning to write "I prefer option X because (reasons), and
> objections?" but I'm realizing that it isn't really clear to me which
> route would end up being cleaner. I think I have to just pick one,
> write it out, and see where I end up. If it ends up gross, I don't
> mind backtracking and taking the other route. :) To that end, I'll
> proceed by having shmem_mcopy_atomic_pte() call the new
> mcopy_atomic_install_ptes() helper, and see how it looks (unless there
> are objections).

I am pleased to read that: it's exactly how I would approach it -
so it must be right :-)

Hugh

Re: [PATCH v4] userfaultfd/shmem: fix MCOPY_ATOMIC_CONTINUE behavior

2021-04-07 Thread Hugh Dickins

[PATCH v4] userfaultfd/shmem: fix MCOPY_ATOMIC_CONTINUE behavior
was a significant rework, so here I'm reviewing a synthetic patch
merged from 5.12-rc5's 2021-03-31 mmotm patches:
  userfaultfd-support-minor-fault-handling-for-shmem.patch
  userfaultfd-support-minor-fault-handling-for-shmem-fix.patch
  userfaultfd-support-minor-fault-handling-for-shmem-fix-2.patch
Plus the PATCH v4 which akpm added the next day as fix-3:
  userfaultfd-support-minor-fault-handling-for-shmem-fix-3.patch

[PATCH v5] userfaultfd/shmem: fix MCOPY_ATOMIC_CONTINUE behavior
was the same as v4, except for adding a change in selftests, which
would not apply at this stage of the series: so I've ignored it.

>  fs/userfaultfd.c |6 
>  include/linux/shmem_fs.h |   26 +--
>  include/uapi/linux/userfaultfd.h |4 
>  mm/memory.c  |8 -
>  mm/shmem.c   |   65 +++--
>  mm/userfaultfd.c |  192 -
>  6 files changed, 186 insertions(+), 115 deletions(-)
> 
> diff -purN 5125m243/fs/userfaultfd.c 5125m247/fs/userfaultfd.c
> --- 5125m243/fs/userfaultfd.c 2021-04-04 22:32:32.018244547 -0700
> +++ 5125m247/fs/userfaultfd.c 2021-04-04 22:34:14.946860343 -0700
> @@ -1267,8 +1267,7 @@ static inline bool vma_can_userfault(str
>   }
>  
>   if (vm_flags & VM_UFFD_MINOR) {
> - /* FIXME: Add minor fault interception for shmem. */
> - if (!is_vm_hugetlb_page(vma))
> + if (!(is_vm_hugetlb_page(vma) || vma_is_shmem(vma)))
>   return false;
>   }
>  
> @@ -1941,7 +1940,8 @@ static int userfaultfd_api(struct userfa
>   /* report all available features and ioctls to userland */
>   uffdio_api.features = UFFD_API_FEATURES;
>  #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
> - uffdio_api.features &= ~UFFD_FEATURE_MINOR_HUGETLBFS;
> + uffdio_api.features &=
> + ~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM);
>  #endif
>   uffdio_api.ioctls = UFFD_API_IOCTLS;
>   ret = -EFAULT;
> diff -purN 5125m243/include/linux/shmem_fs.h 5125m247/include/linux/shmem_fs.h
> --- 5125m243/include/linux/shmem_fs.h 2021-02-14 14:32:24.0 -0800
> +++ 5125m247/include/linux/shmem_fs.h 2021-04-04 22:34:14.958860415 -0700
> @@ -9,6 +9,7 @@
>  #include 
>  #include 
>  #include 
> +#include 

I'd much rather not include userfaultfd_k.h in shmem_fs.h, and go back
to including it in mm/shmem.c: it's better to minimize everyone's header
file inclusion, where reasonably possible.  A small change below for that.

I advise the same for include/linux/hugetlb.h and mm/hugetlb.c,
but those are outside the scope of this userfaultfd/shmem patch.

>  
>  /* inode in-kernel data */
>  
> @@ -122,21 +123,16 @@ static inline bool shmem_file(struct fil
>  extern bool shmem_charge(struct inode *inode, long pages);
>  extern void shmem_uncharge(struct inode *inode, long pages);
>  
> +#ifdef CONFIG_USERFAULTFD
>  #ifdef CONFIG_SHMEM
> -extern int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> -   struct vm_area_struct *dst_vma,
> -   unsigned long dst_addr,
> -   unsigned long src_addr,
> -   struct page **pagep);
> -extern int shmem_mfill_zeropage_pte(struct mm_struct *dst_mm,
> - pmd_t *dst_pmd,
> - struct vm_area_struct *dst_vma,
> - unsigned long dst_addr);
> -#else
> -#define shmem_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma, dst_addr, \
> -src_addr, pagep)({ BUG(); 0; })
> -#define shmem_mfill_zeropage_pte(dst_mm, dst_pmd, dst_vma, \
> -  dst_addr)  ({ BUG(); 0; })
> -#endif

Please add
enum mcopy_atomic_mode;
here, so the compiler can understand it without needing userfaultfd_k.h.

> +int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> +struct vm_area_struct *dst_vma,
> +unsigned long dst_addr, unsigned long src_addr,
> +enum mcopy_atomic_mode mode, struct page **pagep);
> +#else /* !CONFIG_SHMEM */
> +#define shmem_mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, \
> +src_addr, mode, pagep)({ BUG(); 0; })
> +#endif /* CONFIG_SHMEM */
> +#endif /* CONFIG_USERFAULTFD */
>  
>  #endif
> diff -purN 5125m243/include/uapi/linux/userfaultfd.h 
> 5125m247/include/uapi/linux/userfaultfd.h
> --- 5125m243/include/uapi/linux/userfaultfd.h 2021-04-04 22:32:32.042244690 
> -0700
> +++ 5125m247/include/uapi/linux/userfaultfd.h 2021-04-04 22:34:14.962860439 
> -0700
> @@ -31,7 +31,8 @@
>  UFFD_FEATURE_MISSING_SHMEM | \
>  UFFD_FEATURE_SIGBUS |\
>

Re: BUG_ON(!mapping_empty(>i_data))

2021-04-02 Thread Hugh Dickins

On Fri, 2 Apr 2021, Hugh Dickins wrote:
> 
> There is a "Put holes back where they were" xas_store(, NULL) on
> the failure path, which I think we would expect to delete empty nodes.
> But it only goes as far as nr_none.  Is it ok to xas_store(, NULL)
> where there was no non-NULL entry before?  I should try that, maybe
> adjusting the !nr_none break will give a very simple fix.

No, XArray did not like that:
xas_update() XA_NODE_BUG_ON(node, !list_empty(>private_list)).

But also it's the wrong thing for collapse_file() to do, from a file
integrity point of view. So far as there is a non-NULL page in the list,
or nr_none is non-zero, those subpages are frozen at the src end, and
THP head locked and not Uptodate at the dst end. But go beyond nr_none,
and a racing task could be adding new pages, which THP collapse failure
has no right to delete behind its back.

Not an issue for READ_ONLY_THP_FOR_FS, but important for shmem and future.

> 
> Or, if you remove the "static " from xas_trim(), maybe that provides
> the xas_prune_range() you proposed, or the cleanup pass I proposed.
> To be called on collapse_file() failure, or when eviction finds
> !mapping_empty().

Something like this I think.

Hugh

Re: BUG_ON(!mapping_empty(>i_data))

2021-04-02 Thread Hugh Dickins

On Fri, 2 Apr 2021, Matthew Wilcox wrote:

> OK, more competent testing, and that previous bug now detected and fixed.
> I have a reasonable amount of confidence this will solve your problem.
> If you do apply this patch, don't enable CONFIG_TEST_XARRAY as the new
> tests assume that attempting to allocate with a GFP flags of 0 will
> definitely fail, which is true for my userspace allocator, but not true
> inside the kernel.  I'll add some ifdeffery to skip these tests inside
> the kernel, as without a way to deterministically fail allocation,
> there's no way to test this code properly.

Thanks a lot for all your efforts on this, but the news from the front
is disappointing.  The lib/xarray.c you sent here is yesterday's plus
the little __xas_trim() fixup you sent this morning: I set that going
then on three machines, two of them are still good, but one is not (and
yes, I've checked several times that it is the intended kernel running).
xa_dump()s appended below, but I don't expect them to have more to tell.

I think you've been focusing on the old radix-tree -ENOMEM case, which
you'd wanted to clean up anyway, but overlooking the THP collapse_file()
case, which is the one actually hitting me.  collapse_file() does that
xas_create_range(), which Doc tells me will create all the nodes which
might be needed; and if collapse_file() has to give up and revert for
any of many plausible reasons, those nodes may be left over at the end.

There is a "Put holes back where they were" xas_store(, NULL) on
the failure path, which I think we would expect to delete empty nodes.
But it only goes as far as nr_none.  Is it ok to xas_store(, NULL)
where there was no non-NULL entry before?  I should try that, maybe
adjusting the !nr_none break will give a very simple fix.

Or, if you remove the "static " from xas_trim(), maybe that provides
the xas_prune_range() you proposed, or the cleanup pass I proposed.
To be called on collapse_file() failure, or when eviction finds
!mapping_empty().

[ 2927.151739] xarray: 888017914c80 head 888003a10db2 flags 21 marks 0 
0 0
[ 2927.171484] 0-4095: node 888003a10db0 max 0 parent  
shift 6 count 3 values 0 array 888017914c80 list 888003a10dc8 
888003a10dc8 marks 0 0 0
[ 2927.213313] 1344-1407: node 8880055c8490 offset 21 parent 
888003a10db0 shift 0 count 0 values 0 array 888017914c80 list 
8880055c84a8 8880055c84a8 marks 0 0 0
[ 2927.257924] 1408-1471: node 8880055c8248 offset 22 parent 
888003a10db0 shift 0 count 0 values 0 array 888017914c80 list 
8880055c8260 8880055c8260 marks 0 0 0
[ 2927.305332] 1472-1535: node 8880055c8000 offset 23 parent 
888003a10db0 shift 0 count 0 values 0 array 888017914c80 list 
8880055c8018 8880055c8018 marks 0 0 0
[ 2927.355811] s_dev 8:8 i_ino 274355 i_size 10092280

[ 3813.689018] xarray: 888005511408 head 888017624db2 flags 21 marks 0 
0 0
[ 3813.716012] 0-4095: node 888017624db0 max 2 parent  
shift 6 count 3 values 0 array 888005511408 list 888017624dc8 
888017624dc8 marks 0 0 0
[ 3813.771966] 1344-1407: node 888000595b60 offset 21 parent 
888017624db0 shift 0 count 0 values 0 array 888005511408 list 
888000595b78 888000595b78 marks 0 0 0
[ 3813.828102] 1408-1471: node 888000594b68 offset 22 parent 
888017624db0 shift 0 count 0 values 0 array 888005511408 list 
888000594b80 888000594b80 marks 0 0 0
[ 3813.883603] 1472-1535: node 888000594248 offset 23 parent 
888017624db0 shift 0 count 0 values 0 array 888005511408 list 
888000594260 888000594260 marks 0 0 0
[ 3813.939146] s_dev 8:8 i_ino 274355 i_size 10092280

[14157.780505] xarray: 888007c8d988 head 88800bccfd9a flags 21 marks 0 
0 0
[14157.801557] 0-4095: node 88800bccfd98 max 7 parent  
shift 6 count 2 values 0 array 888007c8d988 list 88800bccfdb0 
88800bccfdb0 marks 0 0 0
[14157.845337] 896-959: node 8880279fdda8 offset 14 parent 88800bccfd98 
shift 0 count 0 values 0 array 888007c8d988 list 8880279fddc0 
8880279fddc0 marks 0 0 0
[14157.893594] 960-1023: node 8880279fe238 offset 15 parent 
88800bccfd98 shift 0 count 0 values 0 array 888007c8d988 list 
8880279fe250 8880279fe250 marks 0 0 0
[14157.943810] s_dev 8:8 i_ino 274355 i_size 10092280

Hugh

Re: BUG_ON(!mapping_empty(>i_data))

2021-03-31 Thread Hugh Dickins

On Wed, 31 Mar 2021, Matthew Wilcox wrote:
> On Tue, Mar 30, 2021 at 06:30:22PM -0700, Hugh Dickins wrote:
> > Running my usual tmpfs kernel builds swapping load, on Sunday's rc4-mm1
> > mmotm (I never got to try rc3-mm1 but presume it behaved the same way),
> > I hit clear_inode()'s BUG_ON(!mapping_empty(>i_data)); on two
> > machines, within an hour or few, repeatably though not to order.
> > 
> > The stack backtrace has always been clear_inode < ext4_clear_inode <
> > ext4_evict_inode < evict < dispose_list < prune_icache_sb <
> > super_cache_scan < do_shrink_slab < shrink_slab_memcg < shrink_slab <
> > shrink_node_memgs < shrink_node < balance_pgdat < kswapd.
> > 
> > ext4 is the disk filesystem I read the source to build from, and also
> > the filesystem I use on a loop device on a tmpfs file: I have not tried
> > with other filesystems, nor checked whether perhaps it happens always on
> > the loop one or always on the disk one.  I have not seen it happen with
> > tmpfs - probably because its inodes cannot be evicted by the shrinker
> > anyway; I have not seen it happen when "rm -rf" evicts ext4 or tmpfs
> > inodes (but suspect that may be down to timing, or less pressure).
> > I doubt it's a matter of filesystem: think it's an XArray thing.
> > 
> > Whenever I've looked at the XArray nodes involved, the root node
> > (shift 6) contained one or three (adjacent) pointers to empty shift
> > 0 nodes, which each had offset and parent and array correctly set.
> > Is there some way in which empty nodes can get left behind, and so
> > fail eviction's mapping_empty() check?
> 
> There isn't _supposed_ to be.  The XArray is supposed to delete nodes
> whenever the ->count reaches zero.  It might give me a clue if you could
> share a dump of the tree, if you still have that handy.

Very useful suggestion: the xa_dump() may not give you more of a clue,
but just running again last night to gather that info has revealed more.

> 
> > I did wonder whether some might get left behind if xas_alloc() fails
> > (though probably the tree here is too shallow to show that).  Printks
> > showed that occasionally xas_alloc() did fail while testing (maybe at
> > memcg limit), but there was no correlation with the BUG_ONs.
> 
> This is a problem inherited from the radix tree, and I really want to
> justify fixing it ... I think I may have enough infrastructure in place
> to do it now (as part of the xas_split() commit we can now allocate
> multiple xa_nodes in xas->xa_alloc).  But you're right; if we allocated
> all the way down to an order-0 node, then this isn't the bug.
> 
> Were you using the ALLOW_ERROR_INJECTION feature on
> __add_to_page_cache_locked()?  I haven't looked into how that works,
> and maybe that could leave us in an inconsistent state.

No, no error injection: not something I've ever looked at either.

> 
> > I did wonder whether this is a long-standing issue, which your new
> > BUG_ON is the first to detect: so tried 5.12-rc5 clear_inode() with
> > a BUG_ON(!xa_empty(>i_data.i_pages)) after its nrpages and
> > nrexceptional BUG_ONs.  The result there surprised me: I expected
> > it to behave the same way, but it hits that BUG_ON in a minute or
> > so, instead of an hour or so.  Was there a fix you made somewhere,
> > to avoid the BUG_ON(!mapping_empty) most of the time? but needs
> > more work. I looked around a little, but didn't find any.
> 
> I didn't make a fix for this issue; indeed I haven't observed it myself.

That was interesting to me last night, but not so interesting now
we have more info (below).

> It seems like cgroups are a good way to induce allocation failures, so
> I should play around with that a bit.  The userspace test-suite has a
> relatively malicious allocator that will fail every allocation not marked
> as GFP_KERNEL, so it always exercises the fallback path for GFP_NOWAIT,
> but then it will always succeed eventually.
> 
> > I had hoped to work this out myself, and save us both some writing:
> > but better hand over to you, in the hope that you'll quickly guess
> > what's up, then I can try patches. I do like the no-nrexceptionals
> > series, but there's something still to be fixed.
> 
> Agreed.  It seems like it's unmasking a bug that already existed, so
> it's not an argument for dropping the series, but we should fix the bug
> so we don't crash people's machines.
> 
> Arguably, the condition being checked for is not serious enough for a
> BUG_ON.  A WARN_ON, yes, and dump the tree for later perusal, but it's
> just a memory leak, and not (I think?) likely to lead to later memory
&

Re: [PATCH mmotm] mm: vmscan: fix shrinker_rwsem in free_shrinker_info()

2021-03-31 Thread Hugh Dickins

On Wed, 31 Mar 2021, Yang Shi wrote:
> On Wed, Mar 31, 2021 at 6:54 AM Shakeel Butt  wrote:
> > On Tue, Mar 30, 2021 at 4:44 PM Hugh Dickins  wrote:
> > >
> > > Lockdep warns mm/vmscan.c: suspicious rcu_dereference_protected() usage!
> > > when free_shrinker_info() is called from mem_cgroup_css_free(): there it
> > > is called with no locking, whereas alloc_shrinker_info() calls it with
> > > down_write of shrinker_rwsem - which seems appropriate.  Rearrange that
> > > so free_shrinker_info() can manage the shrinker_rwsem for itself.
> > >
> > > Link: https://lkml.kernel.org/r/20210317140615.GB28839@xsang-OptiPlex-9020
> > > Reported-by: kernel test robot 
> > > Signed-off-by: Hugh Dickins 
> > > Cc: Yang Shi 
> > > ---
> > > Sorry, I've made no attempt to work out precisely where in the series
> > > the locking went missing, nor tried to fit this in as a fix on top of
> > > mm-vmscan-add-shrinker_info_protected-helper.patch
> > > which Oliver reported (and which you notated in mmotm's "series" file).
> > > This patch just adds the fix to the end of the series, after
> > > mm-vmscan-shrink-deferred-objects-proportional-to-priority.patch
> >
> > The patch "mm: vmscan: add shrinker_info_protected() helper" replaces
> > rcu_dereference_protected(shrinker_info, true) with
> > rcu_dereference_protected(shrinker_info,
> > lockdep_is_held(_rwsem)).
> >
> > I think we don't really need shrinker_rwsem in free_shrinker_info()
> > which is called from css_free(). The bits of the map have already been
> > 'reparented' in css_offline. I think we can remove
> > lockdep_is_held(_rwsem) for free_shrinker_info().
> 
> Thanks, Hugh and Shakeel. I missed the report.
> 
> I think Shakeel is correct, shrinker_rwsem is not required in css_free
> path so Shakeel's proposal should be able to fix it.

Yes, looking at it again, I am sure that Shakeel is right, and
that my patch was overkill - no need for shrinker_rwsem there.

Whether it's RCU-safe to free the info there, I have not reviewed at
all: but shrinker_rwsem would not help even if there were an issue.

> I prepared a patch:

Unsigned, white-space damaged, so does not apply.

> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 64bf07cc20f2..7348c26d4cac 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -251,7 +251,12 @@ void free_shrinker_info(struct mem_cgroup *memcg)
> for_each_node(nid) {
> pn = memcg->nodeinfo[nid];
> -   info = shrinker_info_protected(memcg, nid);
> +   /*
> +* Don't use shrinker_info_protected() helper since
> +* free_shrinker_info() could be called by css_free()
> +* without holding shrinker_rwsem.
> +*/

Just because I mis-inferred from the use of shrinker_info_protected()
that shrinker_rwsem was needed here, is no reason to add that comment:
imagine how unhelpfully bigger the kernel source would be if we added
a comment everywhere I had misunderstood something!

> +   info = rcu_dereference_protected(pn->shrinker_info, true);
> kvfree(info);
> rcu_assign_pointer(pn->shrinker_info, NULL);
> }

That does it, but I bikeshedded with myself in the encyclopaedic
rcupdate.h, and decided rcu_replace_pointer(pn->shrinker_info, NULL, true)
would be best.  But now see that patch won't fit so well into your series,
and I can't spend more time writing up a justification for it.

I think Andrew should simply delete my fix patch from his queue,
and edit out the
@@ -232,7 +239,7 @@ void free_shrinker_info(struct mem_cgrou
 
for_each_node(nid) {
pn = memcg->nodeinfo[nid];
-   info = rcu_dereference_protected(pn->shrinker_info, true);
+   info = shrinker_info_protected(memcg, nid);
kvfree(info);
rcu_assign_pointer(pn->shrinker_info, NULL);
}
hunk from your mm-vmscan-add-shrinker_info_protected-helper.patch
which will then restore free_shrinker_info() to what you propose above.

Thanks,
Hugh

BUG_ON(!mapping_empty(>i_data))

2021-03-30 Thread Hugh Dickins

Running my usual tmpfs kernel builds swapping load, on Sunday's rc4-mm1
mmotm (I never got to try rc3-mm1 but presume it behaved the same way),
I hit clear_inode()'s BUG_ON(!mapping_empty(>i_data)); on two
machines, within an hour or few, repeatably though not to order.

The stack backtrace has always been clear_inode < ext4_clear_inode <
ext4_evict_inode < evict < dispose_list < prune_icache_sb <
super_cache_scan < do_shrink_slab < shrink_slab_memcg < shrink_slab <
shrink_node_memgs < shrink_node < balance_pgdat < kswapd.

ext4 is the disk filesystem I read the source to build from, and also
the filesystem I use on a loop device on a tmpfs file: I have not tried
with other filesystems, nor checked whether perhaps it happens always on
the loop one or always on the disk one.  I have not seen it happen with
tmpfs - probably because its inodes cannot be evicted by the shrinker
anyway; I have not seen it happen when "rm -rf" evicts ext4 or tmpfs
inodes (but suspect that may be down to timing, or less pressure).
I doubt it's a matter of filesystem: think it's an XArray thing.

Whenever I've looked at the XArray nodes involved, the root node
(shift 6) contained one or three (adjacent) pointers to empty shift
0 nodes, which each had offset and parent and array correctly set.
Is there some way in which empty nodes can get left behind, and so
fail eviction's mapping_empty() check?

I did wonder whether some might get left behind if xas_alloc() fails
(though probably the tree here is too shallow to show that).  Printks
showed that occasionally xas_alloc() did fail while testing (maybe at
memcg limit), but there was no correlation with the BUG_ONs.

I did wonder whether this is a long-standing issue, which your new
BUG_ON is the first to detect: so tried 5.12-rc5 clear_inode() with
a BUG_ON(!xa_empty(>i_data.i_pages)) after its nrpages and
nrexceptional BUG_ONs.  The result there surprised me: I expected
it to behave the same way, but it hits that BUG_ON in a minute or
so, instead of an hour or so.  Was there a fix you made somewhere,
to avoid the BUG_ON(!mapping_empty) most of the time? but needs
more work. I looked around a little, but didn't find any.

I had hoped to work this out myself, and save us both some writing:
but better hand over to you, in the hope that you'll quickly guess
what's up, then I can try patches. I do like the no-nrexceptionals
series, but there's something still to be fixed.

Hugh

[PATCH mmotm] mm: vmscan: fix shrinker_rwsem in free_shrinker_info()

2021-03-30 Thread Hugh Dickins

Lockdep warns mm/vmscan.c: suspicious rcu_dereference_protected() usage!
when free_shrinker_info() is called from mem_cgroup_css_free(): there it
is called with no locking, whereas alloc_shrinker_info() calls it with
down_write of shrinker_rwsem - which seems appropriate.  Rearrange that
so free_shrinker_info() can manage the shrinker_rwsem for itself.

Link: https://lkml.kernel.org/r/20210317140615.GB28839@xsang-OptiPlex-9020
Reported-by: kernel test robot 
Signed-off-by: Hugh Dickins 
Cc: Yang Shi 
---
Sorry, I've made no attempt to work out precisely where in the series
the locking went missing, nor tried to fit this in as a fix on top of
mm-vmscan-add-shrinker_info_protected-helper.patch
which Oliver reported (and which you notated in mmotm's "series" file).
This patch just adds the fix to the end of the series, after
mm-vmscan-shrink-deferred-objects-proportional-to-priority.patch

 mm/vmscan.c |   10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

--- mmotm/mm/vmscan.c   2021-03-28 17:26:54.935553064 -0700
+++ linux/mm/vmscan.c   2021-03-30 15:55:13.374459559 -0700
@@ -249,18 +249,20 @@ void free_shrinker_info(struct mem_cgrou
struct shrinker_info *info;
int nid;
 
+   down_write(_rwsem);
for_each_node(nid) {
pn = memcg->nodeinfo[nid];
info = shrinker_info_protected(memcg, nid);
kvfree(info);
rcu_assign_pointer(pn->shrinker_info, NULL);
}
+   up_write(_rwsem);
 }
 
 int alloc_shrinker_info(struct mem_cgroup *memcg)
 {
struct shrinker_info *info;
-   int nid, size, ret = 0;
+   int nid, size;
int map_size, defer_size = 0;
 
down_write(_rwsem);
@@ -270,9 +272,9 @@ int alloc_shrinker_info(struct mem_cgrou
for_each_node(nid) {
info = kvzalloc_node(sizeof(*info) + size, GFP_KERNEL, nid);
if (!info) {
+   up_write(_rwsem);
free_shrinker_info(memcg);
-   ret = -ENOMEM;
-   break;
+   return -ENOMEM;
}
info->nr_deferred = (atomic_long_t *)(info + 1);
info->map = (void *)info->nr_deferred + defer_size;
@@ -280,7 +282,7 @@ int alloc_shrinker_info(struct mem_cgrou
}
up_write(_rwsem);
 
-   return ret;
+   return 0;
 }
 
 static inline bool need_expand(int nr_max)

Re: [PATCH] mm: page_alloc: fix memcg accounting leak in speculative cache lookup

2021-03-25 Thread Hugh Dickins

On Fri, 26 Mar 2021, Matthew Wilcox wrote:
> On Thu, Mar 25, 2021 at 06:55:42PM -0700, Hugh Dickins wrote:
> > The first reason occurred to me this morning.  I thought I had been
> > clever to spot the PageHead race which you fix here.  But now I just feel
> > very stupid not to have spotted the very similar memcg_data race.  The
> > speculative racer may call mem_cgroup_uncharge() from __put_single_page(),
> > and the new call to split_page_memcg() do nothing because page_memcg(head)
> > is already NULL.
> > 
> > And is it even safe there, to sprinkle memcg_data through all of those
> > order-0 subpages, when free_the_page() is about to be applied to a
> > series of descending orders?  I could easily be wrong, but I think
> > free_pages_prepare()'s check_free_page() will find that is not
> > page_expected_state().
> 
> So back to something more like my original patch then?
> 
> +++ b/mm/page_alloc.c
> @@ -5081,9 +5081,15 @@ void __free_pages(struct page *page, unsigned int 
> order)
>  {
> if (put_page_testzero(page))
> free_the_page(page, order);
> - else if (!PageHead(page))
> -   while (order-- > 0)
> -   free_the_page(page + (1 << order), order);
> +   else if (!PageHead(page)) {
> +   while (order-- > 0) {
> +   struct page *tail = page + (1 << order);
> +#ifdef CONFIG_MEMCG
> +   tail->memcg_data = page->memcg_data;
> +#endif
> +   free_the_page(tail, order);
> +   }
> +   }
>  }
>  EXPORT_SYMBOL(__free_pages);
> 
> We can cache page->memcg_data before calling put_page_testzero(),
> just like we cache the Head flag in Johannes' patch.

If I still believed in e320d3012d25, yes, that would look right
(but I don't have much faith in my judgement after all this).

I'd fallen in love with split_page_memcg() when you posted that
one, and was put off by your #ifdef, so got my priorities wrong
and went for the split_page_memcg().

> 
> > But, after all that, I'm now thinking that Matthew's original
> > e320d3012d25 ("mm/page_alloc.c: fix freeing non-compound pages")
> > is safer reverted.  The put_page_testzero() in __free_pages() was
> > not introduced for speculative pagecache: it was there in 2.4.0,
> > and atomic_dec_and_test() in 2.2, I don't have older trees to hand.
> 
> I think you're confused in that last assertion.  According to
> linux-fullhistory, the first introduction of __free_pages was 2.3.29pre3
> (September 1999), where it did indeed use put_page_testzero:

Not confused, just pontificating from a misleading subset of the data.
I knew there's an even-more-history-than-tglx git tree somewhere, but
what I usually look back to is 2.4 trees, plus a 2.2.26 tree - but of
course that's a late 2.2, from 2004, around the same time as 2.6.3.
That tree shows a __free_pages() using atomic_dec_and_test().

But we digress...

> 
> +extern inline void __free_pages(struct page *page, unsigned long order)
> +{
> +   if (!put_page_testzero(page))
> +   return;
> +   __free_pages_ok(page, order);
> +}
> 
> Before that, we had only free_pages() and __free_page().
> 
> > So, it has "always" been accepted that multiple references to a
> > high-order non-compound page can be given out and released: maybe
> > they were all released with __free_pages() of the right order, or
> > maybe only the last had to get that right; but as __free_pages()
> > stands today, all but the last caller frees all but the first
> > subpage.  A very rare leak seems much safer.
> > 
> > I don't have the answer (find somewhere in struct page to squirrel
> > away the order, even when it's a non-compound page?), and I think
> > each of us would much rather be thinking about other things at the
> > moment.  But for now it looks to me like NAK to this patch, and
> > revert of e320d3012d25.
> 
> We did discuss that possibility prior to the introduction of
> e320d3012d25.  Here's one such:
> https://lore.kernel.org/linux-mm/20200922031215.gz32...@casper.infradead.org/T/#m0b08c0c3430e09e20fa6648877dc42b04b18e6f3

Thanks for the link. And I'll willingly grant that your experience is
vast compared to mine. But "Drivers don't do that, in my experience"
is not a convincing reason to invalidate a way of working that the
code has gone out of its way to allow for, for over twenty years.

But you make a good point on the "Bad page" reports that would now
be generated: maybe that will change my mind later on.

Hugh

Re: [PATCH] mm: page_alloc: fix memcg accounting leak in speculative cache lookup

2021-03-25 Thread Hugh Dickins

On Tue, 23 Mar 2021, Hugh Dickins wrote:
> On Tue, 23 Mar 2021, Johannes Weiner wrote:
> > From f6f062a3ec46f4fb083dcf6792fde9723f18cfc5 Mon Sep 17 00:00:00 2001
> > From: Johannes Weiner 
> > Date: Fri, 19 Mar 2021 02:17:00 -0400
> > Subject: [PATCH] mm: page_alloc: fix allocation imbalances from speculative
> >  cache lookup
> > 
> > When the freeing of a higher-order page block (non-compound) races
> > with a speculative page cache lookup, __free_pages() needs to leave
> > the first order-0 page in the chunk to the lookup but free the buddy
> > pages that the lookup doesn't know about separately.
> > 
> > There are currently two problems with it:
> > 
> > 1. It checks PageHead() to see whether we're dealing with a compound
> >page after put_page_testzero(). But the speculative lookup could
> >have freed the page after our put and cleared PageHead, in which
> >case we would double free the tail pages.
> > 
> >To fix this, test PageHead before the put and cache the result for
> >afterwards.
> > 
> > 2. If such a higher-order page is charged to a memcg (e.g. !vmap
> >kernel stack)), only the first page of the block has page->memcg
> >set. That means we'll uncharge only one order-0 page from the
> >entire block, and leak the remainder.
> > 
> >To fix this, add a split_page_memcg() before it starts freeing tail
> >pages, to ensure they all have page->memcg set up.
> > 
> > While at it, also update the comments a bit to clarify what exactly is
> > happening to the page during that race.
> > 
> > Fixes: e320d3012d25 mm/page_alloc.c: fix freeing non-compound pages

Whoops, misses ("...") around the title.

> > Reported-by: Hugh Dickins 
> > Reported-by: Matthew Wilcox 
> > Signed-off-by: Johannes Weiner 
> > Cc:  # 5.10+
> 
> This is great, thanks Hannes.
> Acked-by: Hugh Dickins 

Sorry, I am ashamed to do this, but now I renege and say NAK:
better now than before Andrew picks it up.

The first reason occurred to me this morning.  I thought I had been
clever to spot the PageHead race which you fix here.  But now I just feel
very stupid not to have spotted the very similar memcg_data race.  The
speculative racer may call mem_cgroup_uncharge() from __put_single_page(),
and the new call to split_page_memcg() do nothing because page_memcg(head)
is already NULL.

And is it even safe there, to sprinkle memcg_data through all of those
order-0 subpages, when free_the_page() is about to be applied to a
series of descending orders?  I could easily be wrong, but I think
free_pages_prepare()'s check_free_page() will find that is not
page_expected_state().

And what gets to do the uncharging when memcg_data is properly set
on the appropriate order-N subpages?  I believe it's the (second)
__memcg_kmem_uncharge_page() in free_pages_prepare(), but that's
only called if PageMemcgKmem().  Ah, good, Roman's changes have put
that flag into memcg_data, so it will automatically be set: but this
patch will not port back to 5.10 without some addition.

But, after all that, I'm now thinking that Matthew's original
e320d3012d25 ("mm/page_alloc.c: fix freeing non-compound pages")
is safer reverted.  The put_page_testzero() in __free_pages() was
not introduced for speculative pagecache: it was there in 2.4.0,
and atomic_dec_and_test() in 2.2, I don't have older trees to hand.

So, it has "always" been accepted that multiple references to a
high-order non-compound page can be given out and released: maybe
they were all released with __free_pages() of the right order, or
maybe only the last had to get that right; but as __free_pages()
stands today, all but the last caller frees all but the first
subpage.  A very rare leak seems much safer.

I don't have the answer (find somewhere in struct page to squirrel
away the order, even when it's a non-compound page?), and I think
each of us would much rather be thinking about other things at the
moment.  But for now it looks to me like NAK to this patch, and
revert of e320d3012d25.

> 
> I know that 5.10-stable rejected the two split_page_memcg() patches:
> we shall need those in, I'll send GregKH the fixups, but not today.

Done, and Sasha has picked them up.  But in writing that "Ah, good,
Roman's changes ..." paragraph, I've begun to wonder if what I sent
was complete - does a 5.10 split_page_memcg(), when called from
split_page(), also need to copy head's PageKmemcg? I rather think
yes, but by now I'm unsure of everything...

Hugh

> 
> > ---
> >  mm/page_alloc.c | 41 +++--
> >  1 file changed, 35 insertions(+), 6 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >

Re: [PATCH] x86/tlb: Flush global mappings when KAISER is disabled

2021-03-25 Thread Hugh Dickins

On Thu, 25 Mar 2021, Borislav Petkov wrote:

> Ok,
> 
> I tried to be as specific as possible in the commit message so that we
> don't forget. Please lemme know if I've missed something.
> 
> Babu, Jim, I'd appreciate it if you ran this to confirm.
> 
> Thx.
> 
> ---
> From: Borislav Petkov 
> Date: Thu, 25 Mar 2021 11:02:31 +0100
> 
> Jim Mattson reported that Debian 9 guests using a 4.9-stable kernel
> are exploding during alternatives patching:
> 
>   kernel BUG at 
> /build/linux-dqnRSc/linux-4.9.228/arch/x86/kernel/alternative.c:709!
>   invalid opcode:  [#1] SMP
>   Modules linked in:
>   CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.9.0-13-amd64 #1 Debian 4.9.228-1
>   Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
>   Call Trace:
>swap_entry_free
>swap_entry_free
>text_poke_bp
>swap_entry_free
>arch_jump_label_transform
>set_debug_rodata
>__jump_label_update
>static_key_slow_inc
>frontswap_register_ops
>init_zswap
>init_frontswap
>do_one_initcall
>set_debug_rodata
>kernel_init_freeable
>rest_init
>kernel_init
>ret_from_fork
> 
> triggering the BUG_ON in text_poke() which verifies whether patched
> instruction bytes have actually landed at the destination.
> 
> Further debugging showed that the TLB flush before that check is
> insufficient because there could be global mappings left in the TLB,
> leading to a stale mapping getting used.
> 
> I say "global mappings" because the hardware configuration is a new one:
> machine is an AMD, which means, KAISER/PTI doesn't need to be enabled
> there, which also means there's no user/kernel pagetables split and
> therefore the TLB can have global mappings.
> 
> And the configuration is new one for a second reason: because that AMD
> machine supports PCID and INVPCID, which leads the CPU detection code to
> set the synthetic X86_FEATURE_INVPCID_SINGLE flag.
> 
> Now, __native_flush_tlb_single() does invalidate global mappings when
> X86_FEATURE_INVPCID_SINGLE is *not* set and returns.
> 
> When X86_FEATURE_INVPCID_SINGLE is set, however, it invalidates the
> requested address from both PCIDs in the KAISER-enabled case. But if
> KAISER is not enabled and the machine has global mappings in the TLB,
> then those global mappings do not get invalidated, which would lead to
> the above mismatch from using a stale TLB entry.
> 
> So make sure to flush those global mappings in the KAISER disabled case.
> 
> Co-debugged by Babu Moger .
> 
> Reported-by: Jim Mattson 
> Signed-off-by: Borislav Petkov 
> Link: 
> https://lkml.kernel.org/r/CALMp9eRDSW66%2BXvbHVF4ohL7XhThoPoT0BrB0TcS0cgk=dk...@mail.gmail.com

Acked-by: Hugh Dickins 

Great write-up too: many thanks.

> ---
>  arch/x86/include/asm/tlbflush.h | 11 +++
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index f5ca15622dc9..2bfa4deb8cae 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -245,12 +245,15 @@ static inline void __native_flush_tlb_single(unsigned 
> long addr)
>* ASID.  But, userspace flushes are probably much more
>* important performance-wise.
>*
> -  * Make sure to do only a single invpcid when KAISER is
> -  * disabled and we have only a single ASID.
> +  * In the KAISER disabled case, do an INVLPG to make sure
> +  * the mapping is flushed in case it is a global one.
>*/
> - if (kaiser_enabled)
> + if (kaiser_enabled) {
>   invpcid_flush_one(X86_CR3_PCID_ASID_USER, addr);
> - invpcid_flush_one(X86_CR3_PCID_ASID_KERN, addr);
> + invpcid_flush_one(X86_CR3_PCID_ASID_KERN, addr);
> + } else {
> + asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
> + }
>  }
>  
>  static inline void __flush_tlb_all(void)
> -- 
> 2.29.2
> 
> -- 
> Regards/Gruss,
> Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v6 00/12] SVM cleanup and INVPCID feature support

2021-03-24 Thread Hugh Dickins

On Wed, 24 Mar 2021, Hugh Dickins wrote:
> On Wed, 24 Mar 2021, Borislav Petkov wrote:
> 
> > Ok,
> > 
> > some more experimenting Babu and I did lead us to:
> > 
> > ---
> > diff --git a/arch/x86/include/asm/tlbflush.h 
> > b/arch/x86/include/asm/tlbflush.h
> > index f5ca15622dc9..259aa4889cad 100644
> > --- a/arch/x86/include/asm/tlbflush.h
> > +++ b/arch/x86/include/asm/tlbflush.h
> > @@ -250,6 +250,9 @@ static inline void __native_flush_tlb_single(unsigned 
> > long addr)
> >  */
> > if (kaiser_enabled)
> > invpcid_flush_one(X86_CR3_PCID_ASID_USER, addr);
> > +   else
> > +   asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
> > +
> > invpcid_flush_one(X86_CR3_PCID_ASID_KERN, addr);
> >  }
> > 
> > applied on the guest kernel which fixes the issue. And let me add Hugh
> > who did that PCID stuff at the time. So lemme summarize for Hugh and to
> > ask him nicely to sanity-check me. :-)
> 
> Just a brief interim note to assure you that I'm paying attention,
> but wow, it's a long time since I gave any thought down here!
> Trying to page it all back in...
> 
> I see no harm in your workaround if it works, but it's not as if
> this is a previously untried path: so I'm suspicious how an issue
> here with Globals could have gone unnoticed for so long, and need
> to understand it better.

Right, after looking into it more, I completely agree with you:
the Kaiser series (in both 4.4-stable and 4.9-stable) was simply
wrong to lose that invlpg - fine in the kaiser case when we don't
enable Globals at all, but plain wrong in the !kaiser_enabled case.
One way or another, we have somehow got away with it for three years.

I do agree with Paolo that the PCID_ASID_KERN flush would be better
moved under the "if (kaiser_enabled)" now. (And if this were ongoing
development, I'd want to rewrite the function altogether: but no,
these old stable trees are not the place for that.)

Boris, may I leave both -stable fixes to you?
Let me know if you'd prefer me to clean up my mess.

Thanks a lot for tracking this down,
Hugh

> > 
> > Basically, you have an AMD host which supports PCID and INVPCID and you
> > boot on it a 4.9 guest. It explodes like the panic below.
> > 
> > What fixes it is this:
> > 
> > diff --git a/arch/x86/include/asm/tlbflush.h 
> > b/arch/x86/include/asm/tlbflush.h
> > index f5ca15622dc9..259aa4889cad 100644
> > --- a/arch/x86/include/asm/tlbflush.h
> > +++ b/arch/x86/include/asm/tlbflush.h
> > @@ -250,6 +250,9 @@ static inline void __native_flush_tlb_single(unsigned 
> > long addr)
> >  */
> > if (kaiser_enabled)
> > invpcid_flush_one(X86_CR3_PCID_ASID_USER, addr);
> > +   else
> > +   asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
> > +
> > invpcid_flush_one(X86_CR3_PCID_ASID_KERN, addr);
> >  }
> > 
> > ---
> > 
> > and the reason why it does, IMHO, is because on AMD, kaiser_enabled is
> > false because AMD is not affected by Meltdown, which means, there's no
> > user/kernel pagetables split.
> > 
> > And that also means, you have global TLB entries which means that if you
> > look at that __native_flush_tlb_single() function, it needs to flush
> > global TLB entries on CPUs with X86_FEATURE_INVPCID_SINGLE by doing an
> > INVLPG in the kaiser_enabled=0 case. Errgo, the above hunk.
> > 
> > But I might be completely off here thus this note...
> > 
> > Thoughts?
> > 
> > Thx.
> > 
> > 
> > [1.235726] [ cut here ]
> > [1.237515] kernel BUG at 
> > /build/linux-dqnRSc/linux-4.9.228/arch/x86/kernel/alternative.c:709!
> > [1.240926] invalid opcode:  [#1] SMP
> > [1.243301] Modules linked in:
> > [1.244585] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.9.0-13-amd64 #1 
> > Debian 4.9.228-1
> > [1.247657] Hardware name: Google Google Compute Engine/Google Compute 
> > Engine, BIOS Google 01/01/2011
> > [1.251249] task: 909363e94040 task.stack: a41bc0194000
> > [1.253519] RIP: 0010:[]  [] 
> > text_poke+0x18c/0x240
> > [1.256593] RSP: 0018:a41bc0197d90  EFLAGS: 00010096
> > [1.258657] RAX: 000f RBX: 01020800 RCX: 
> > feda3203
> > [1.261388] RDX: 178bfbff RSI:  RDI: 
> > ff57a000
> > [1.264168] RBP: 8fbd3eca R08:  R09: 
> > 0003
> > [1.266983] R10

Re: [PATCH v6 00/12] SVM cleanup and INVPCID feature support

2021-03-24 Thread Hugh Dickins

On Wed, 24 Mar 2021, Borislav Petkov wrote:

> Ok,
> 
> some more experimenting Babu and I did lead us to:
> 
> ---
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index f5ca15622dc9..259aa4889cad 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -250,6 +250,9 @@ static inline void __native_flush_tlb_single(unsigned 
> long addr)
>*/
>   if (kaiser_enabled)
>   invpcid_flush_one(X86_CR3_PCID_ASID_USER, addr);
> + else
> + asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
> +
>   invpcid_flush_one(X86_CR3_PCID_ASID_KERN, addr);
>  }
> 
> applied on the guest kernel which fixes the issue. And let me add Hugh
> who did that PCID stuff at the time. So lemme summarize for Hugh and to
> ask him nicely to sanity-check me. :-)

Just a brief interim note to assure you that I'm paying attention,
but wow, it's a long time since I gave any thought down here!
Trying to page it all back in...

I see no harm in your workaround if it works, but it's not as if
this is a previously untried path: so I'm suspicious how an issue
here with Globals could have gone unnoticed for so long, and need
to understand it better.

Hugh

> 
> Basically, you have an AMD host which supports PCID and INVPCID and you
> boot on it a 4.9 guest. It explodes like the panic below.
> 
> What fixes it is this:
> 
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index f5ca15622dc9..259aa4889cad 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -250,6 +250,9 @@ static inline void __native_flush_tlb_single(unsigned 
> long addr)
>*/
>   if (kaiser_enabled)
>   invpcid_flush_one(X86_CR3_PCID_ASID_USER, addr);
> + else
> + asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
> +
>   invpcid_flush_one(X86_CR3_PCID_ASID_KERN, addr);
>  }
> 
> ---
> 
> and the reason why it does, IMHO, is because on AMD, kaiser_enabled is
> false because AMD is not affected by Meltdown, which means, there's no
> user/kernel pagetables split.
> 
> And that also means, you have global TLB entries which means that if you
> look at that __native_flush_tlb_single() function, it needs to flush
> global TLB entries on CPUs with X86_FEATURE_INVPCID_SINGLE by doing an
> INVLPG in the kaiser_enabled=0 case. Errgo, the above hunk.
> 
> But I might be completely off here thus this note...
> 
> Thoughts?
> 
> Thx.
> 
> 
> [1.235726] [ cut here ]
> [1.237515] kernel BUG at 
> /build/linux-dqnRSc/linux-4.9.228/arch/x86/kernel/alternative.c:709!
> [1.240926] invalid opcode:  [#1] SMP
> [1.243301] Modules linked in:
> [1.244585] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.9.0-13-amd64 #1 
> Debian 4.9.228-1
> [1.247657] Hardware name: Google Google Compute Engine/Google Compute 
> Engine, BIOS Google 01/01/2011
> [1.251249] task: 909363e94040 task.stack: a41bc0194000
> [1.253519] RIP: 0010:[]  [] 
> text_poke+0x18c/0x240
> [1.256593] RSP: 0018:a41bc0197d90  EFLAGS: 00010096
> [1.258657] RAX: 000f RBX: 01020800 RCX: 
> feda3203
> [1.261388] RDX: 178bfbff RSI:  RDI: 
> ff57a000
> [1.264168] RBP: 8fbd3eca R08:  R09: 
> 0003
> [1.266983] R10: 0003 R11: 0112 R12: 
> 0001
> [1.269702] R13: a41bc0197dcf R14: 0286 R15: 
> ed1c40407500
> [1.272572] FS:  () GS:90936630() 
> knlGS:
> [1.275791] CS:  0010 DS:  ES:  CR0: 80050033
> [1.278032] CR2:  CR3: 10c08000 CR4: 
> 003606f0
> [1.280815] Stack:
> [1.281630]  8fbd3eca 0005 a41bc0197e03 
> 8fbd3ecb
> [1.284660]    8fa2e835 
> ccff8fad4326
> [1.287729]  1ccd0231874d55d3 8fbd3eca a41bc0197e03 
> 90203844
> [1.290852] Call Trace:
> [1.291782]  [] ? swap_entry_free+0x12a/0x300
> [1.294900]  [] ? swap_entry_free+0x12b/0x300
> [1.297267]  [] ? text_poke_bp+0x55/0xe0
> [1.299473]  [] ? swap_entry_free+0x12a/0x300
> [1.301896]  [] ? arch_jump_label_transform+0x9c/0x120
> [1.304557]  [] ? set_debug_rodata+0xc/0xc
> [1.306790]  [] ? __jump_label_update+0x72/0x80
> [1.309255]  [] ? static_key_slow_inc+0x8f/0xa0
> [1.311680]  [] ? frontswap_register_ops+0x107/0x1d0
> [1.314281]  [] ? init_zswap+0x282/0x3f6
> [1.316547]  [] ? init_frontswap+0x8c/0x8c
> [1.318784]  [] ? do_one_initcall+0x4e/0x180
> [1.321067]  [] ? set_debug_rodata+0xc/0xc
> [1.323366]  [] ? kernel_init_freeable+0x16b/0x1ec
> [1.325873]  [] ? rest_init+0x80/0x80
> [1.327989]  [] ? kernel_init+0xa/0x100
> [1.330092]  [] ?

Re: [PATCH] mm: page_alloc: fix memcg accounting leak in speculative cache lookup

2021-03-23 Thread Hugh Dickins

On Tue, 23 Mar 2021, Johannes Weiner wrote:
> From f6f062a3ec46f4fb083dcf6792fde9723f18cfc5 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner 
> Date: Fri, 19 Mar 2021 02:17:00 -0400
> Subject: [PATCH] mm: page_alloc: fix allocation imbalances from speculative
>  cache lookup
> 
> When the freeing of a higher-order page block (non-compound) races
> with a speculative page cache lookup, __free_pages() needs to leave
> the first order-0 page in the chunk to the lookup but free the buddy
> pages that the lookup doesn't know about separately.
> 
> There are currently two problems with it:
> 
> 1. It checks PageHead() to see whether we're dealing with a compound
>page after put_page_testzero(). But the speculative lookup could
>have freed the page after our put and cleared PageHead, in which
>case we would double free the tail pages.
> 
>To fix this, test PageHead before the put and cache the result for
>afterwards.
> 
> 2. If such a higher-order page is charged to a memcg (e.g. !vmap
>kernel stack)), only the first page of the block has page->memcg
>set. That means we'll uncharge only one order-0 page from the
>entire block, and leak the remainder.
> 
>To fix this, add a split_page_memcg() before it starts freeing tail
>pages, to ensure they all have page->memcg set up.
> 
> While at it, also update the comments a bit to clarify what exactly is
> happening to the page during that race.
> 
> Fixes: e320d3012d25 mm/page_alloc.c: fix freeing non-compound pages
> Reported-by: Hugh Dickins 
> Reported-by: Matthew Wilcox 
> Signed-off-by: Johannes Weiner 
> Cc:  # 5.10+

This is great, thanks Hannes.
Acked-by: Hugh Dickins 

I know that 5.10-stable rejected the two split_page_memcg() patches:
we shall need those in, I'll send GregKH the fixups, but not today.

> ---
>  mm/page_alloc.c | 41 +++--
>  1 file changed, 35 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c53fe4fa10bf..8aab1e87fa3c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5112,10 +5112,9 @@ static inline void free_the_page(struct page *page, 
> unsigned int order)
>   * the allocation, so it is easy to leak memory.  Freeing more memory
>   * than was allocated will probably emit a warning.
>   *
> - * If the last reference to this page is speculative, it will be released
> - * by put_page() which only frees the first page of a non-compound
> - * allocation.  To prevent the remaining pages from being leaked, we free
> - * the subsequent pages here.  If you want to use the page's reference
> + * This function isn't a put_page(). Don't let the put_page_testzero()
> + * fool you, it's only to deal with speculative cache references. It
> + * WILL free pages directly. If you want to use the page's reference
>   * count to decide when to free the allocation, you should allocate a
>   * compound page, and use put_page() instead of __free_pages().
>   *
> @@ -5124,11 +5123,41 @@ static inline void free_the_page(struct page *page, 
> unsigned int order)
>   */
>  void __free_pages(struct page *page, unsigned int order)
>  {
> - if (put_page_testzero(page))
> + bool compound = PageHead(page);
> +
> + /*
> +  * Drop the base reference from __alloc_pages and free. In
> +  * case there is an outstanding speculative reference, from
> +  * e.g. the page cache, it will put and free the page later.
> +  */
> + if (likely(put_page_testzero(page))) {
>   free_the_page(page, order);
> - else if (!PageHead(page))
> + return;
> + }
> +
> + /*
> +  * Ok, the speculative reference will put and free the page.
> +  *
> +  * - If this was an order-0 page, we're done.
> +  *
> +  * - If the page was compound, the other side will free the
> +  *   entire page and we're done here as well. Just note that
> +  *   freeing clears PG_head, so it can only be read reliably
> +  *   before the put_page_testzero().
> +  *
> +  * - If the page was of higher order but NOT marked compound,
> +  *   the other side will know nothing about our buddy pages
> +  *   and only free the order-0 page at the start of our block.
> +  *   We must split off and free the buddy pages here.
> +  *
> +  *   The buddy pages aren't individually refcounted, so they
> +  *   can't have any pending speculative references themselves.
> +  */
> + if (order > 0 && !compound) {
> + split_page_memcg(page, 1 << order);
>   while (order-- > 0)
>   free_the_page(page + (1 << order), order);
> + }
>  }
>  EXPORT_SYMBOL(__free_pages);
>  
> -- 
> 2.31.0

Re: [PATCH v4 2/3] Revert "mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio"

2021-03-23 Thread Hugh Dickins

On Tue, 23 Mar 2021, Brian Geffon wrote:

> This reverts commit cd544fd1dc9293c6702fab6effa63dac1cc67e99.
> 
> As discussed in [1] this commit was a no-op because the mapping type was
> checked in vma_to_resize before move_vma is ever called. This meant that
> vm_ops->mremap() would never be called on such mappings. Furthermore,
> we've since expanded support of MREMAP_DONTUNMAP to non-anonymous
> mappings, and these special mappings are still protected by the existing
> check of !VM_DONTEXPAND and !VM_PFNMAP which will result in a -EFAULT.

No, those two lines still describe an earlier version, they should say:
"mappings, and these special mappings are now protected by a check of
 !VM_DONTEXPAND and !VM_PFNMAP which will result in a -EINVAL."

> 
> 1. https://lkml.org/lkml/2020/12/28/2340
> 
> Signed-off-by: Brian Geffon 
> Acked-by: Hugh Dickins 
> ---
>  arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +-
>  fs/aio.c  | 5 +
>  include/linux/mm.h| 2 +-
>  mm/mmap.c | 6 +-
>  mm/mremap.c   | 2 +-
>  5 files changed, 5 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c 
> b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> index e916646adc69..0daf2f1cf7a8 100644
> --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> @@ -1458,7 +1458,7 @@ static int pseudo_lock_dev_release(struct inode *inode, 
> struct file *filp)
>   return 0;
>  }
>  
> -static int pseudo_lock_dev_mremap(struct vm_area_struct *area, unsigned long 
> flags)
> +static int pseudo_lock_dev_mremap(struct vm_area_struct *area)
>  {
>   /* Not supported */
>   return -EINVAL;
> diff --git a/fs/aio.c b/fs/aio.c
> index 1f32da13d39e..76ce0cc3ee4e 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -323,16 +323,13 @@ static void aio_free_ring(struct kioctx *ctx)
>   }
>  }
>  
> -static int aio_ring_mremap(struct vm_area_struct *vma, unsigned long flags)
> +static int aio_ring_mremap(struct vm_area_struct *vma)
>  {
>   struct file *file = vma->vm_file;
>   struct mm_struct *mm = vma->vm_mm;
>   struct kioctx_table *table;
>   int i, res = -EINVAL;
>  
> - if (flags & MREMAP_DONTUNMAP)
> - return -EINVAL;
> -
>   spin_lock(>ioctx_lock);
>   rcu_read_lock();
>   table = rcu_dereference(mm->ioctx_table);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 64a71bf20536..ecdc6e8dc5af 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -570,7 +570,7 @@ struct vm_operations_struct {
>   void (*close)(struct vm_area_struct * area);
>   /* Called any time before splitting to check if it's allowed */
>   int (*may_split)(struct vm_area_struct *area, unsigned long addr);
> - int (*mremap)(struct vm_area_struct *area, unsigned long flags);
> + int (*mremap)(struct vm_area_struct *area);
>   /*
>* Called by mprotect() to make driver-specific permission
>* checks before mprotect() is finalised.   The VMA must not
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 3f287599a7a3..9d7651e4e1fe 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3403,14 +3403,10 @@ static const char *special_mapping_name(struct 
> vm_area_struct *vma)
>   return ((struct vm_special_mapping *)vma->vm_private_data)->name;
>  }
>  
> -static int special_mapping_mremap(struct vm_area_struct *new_vma,
> -   unsigned long flags)
> +static int special_mapping_mremap(struct vm_area_struct *new_vma)
>  {
>   struct vm_special_mapping *sm = new_vma->vm_private_data;
>  
> - if (flags & MREMAP_DONTUNMAP)
> - return -EINVAL;
> -
>   if (WARN_ON_ONCE(current->mm != new_vma->vm_mm))
>   return -EFAULT;
>  
> diff --git a/mm/mremap.c b/mm/mremap.c
> index db5b8b28c2dd..d22629ff8f3c 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -545,7 +545,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
>   if (moved_len < old_len) {
>   err = -ENOMEM;
>   } else if (vma->vm_ops && vma->vm_ops->mremap) {
> - err = vma->vm_ops->mremap(new_vma, flags);
> + err = vma->vm_ops->mremap(new_vma);
>   }
>  
>   if (unlikely(err)) {
> -- 
> 2.31.0.rc2.261.g7f71774620-goog
> 
>

Re: [PATCH] mm: page_alloc: fix memcg accounting leak in speculative cache lookup

2021-03-19 Thread Hugh Dickins

On Fri, 19 Mar 2021, Johannes Weiner wrote:

> When the freeing of a higher-order page block (non-compound) races
> with a speculative page cache lookup, __free_pages() needs to leave
> the first order-0 page in the chunk to the lookup but free the buddy
> pages that the lookup doesn't know about separately.
> 
> However, if such a higher-order page is charged to a memcg (e.g. !vmap
> kernel stack)), only the first page of the block has page->memcg
> set. That means we'll uncharge only one order-0 page from the entire
> block, and leak the remainder.
> 
> Add a split_page_memcg() to __free_pages() right before it starts
> taking the higher-order page apart and freeing its individual
> constituent pages. This ensures all of them will have the memcg
> linkage set up for correct uncharging. Also update the comments a bit
> to clarify what exactly is happening to the page during that race.
> 
> This bug is old and has its roots in the speculative page cache patch
> and adding cgroup accounting of kernel pages. There are no known user
> reports. A backport to stable is therefor not warranted.
> 
> Reported-by: Matthew Wilcox 
> Signed-off-by: Johannes Weiner 

Acked-by: Hugh Dickins 

to the split_page_memcg() addition etc, but a doubt just hit me on the
original e320d3012d25 ("mm/page_alloc.c: fix freeing non-compound pages"):
see comment below.

> ---
>  mm/page_alloc.c | 33 +++--
>  1 file changed, 27 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c53fe4fa10bf..f4bd56656402 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5112,10 +5112,9 @@ static inline void free_the_page(struct page *page, 
> unsigned int order)
>   * the allocation, so it is easy to leak memory.  Freeing more memory
>   * than was allocated will probably emit a warning.
>   *
> - * If the last reference to this page is speculative, it will be released
> - * by put_page() which only frees the first page of a non-compound
> - * allocation.  To prevent the remaining pages from being leaked, we free
> - * the subsequent pages here.  If you want to use the page's reference
> + * This function isn't a put_page(). Don't let the put_page_testzero()
> + * fool you, it's only to deal with speculative cache references. It
> + * WILL free pages directly. If you want to use the page's reference
>   * count to decide when to free the allocation, you should allocate a
>   * compound page, and use put_page() instead of __free_pages().
>   *
> @@ -5124,11 +5123,33 @@ static inline void free_the_page(struct page *page, 
> unsigned int order)
>   */
>  void __free_pages(struct page *page, unsigned int order)
>  {
> - if (put_page_testzero(page))
> + /*
> +  * Drop the base reference from __alloc_pages and free. In
> +  * case there is an outstanding speculative reference, from
> +  * e.g. the page cache, it will put and free the page later.
> +  */
> + if (likely(put_page_testzero(page))) {
>   free_the_page(page, order);
> - else if (!PageHead(page))
> + return;
> + }
> +
> + /*
> +  * The speculative reference will put and free the page.
> +  *
> +  * However, if the speculation was into a higher-order page
> +  * chunk that isn't marked compound, the other side will know
> +  * nothing about our buddy pages and only free the order-0
> +  * page at the start of our chunk! We must split off and free
> +  * the buddy pages here.
> +  *
> +  * The buddy pages aren't individually refcounted, so they
> +  * can't have any pending speculative references themselves.
> +  */
> + if (!PageHead(page) && order > 0) {

The put_page_testzero() has released our reference to the first
subpage of page: it's now under the control of the racing speculative
lookup.  So it seems to me unsafe to be checking PageHead(page) here:
if it was actually a compound page, PageHead might already be cleared
by now, and we doubly free its tail pages below?  I think we need to
use a "bool compound = PageHead(page)" on entry to __free_pages().

Or alternatively, it's wrong to call __free_pages() on a compound
page anyway, so we should not check PageHead at all, except in a
WARN_ON_ONCE(PageCompound(page)) at the start?

And would it be wrong to fix that too in this patch?
Though it ought then to be backported to 5.10 stable.

> + split_page_memcg(page, 1 << order);
>   while (order-- > 0)
>   free_the_page(page + (1 << order), order);
> + }
>  }
>  EXPORT_SYMBOL(__free_pages);
>  
> -- 
> 2.30.1

Re: [PATCH 2/2] mm: memcontrol: deprecate swapaccounting=0 mode

2021-03-19 Thread Hugh Dickins

On Fri, 19 Mar 2021, Johannes Weiner wrote:

> The swapaccounting= commandline option already does very little
> today. To close a trivial containment failure case, the swap ownership
> tracking part of the swap controller has recently become mandatory
> (see commit 2d1c498072de ("mm: memcontrol: make swap tracking an
> integral part of memory control") for details), which makes up the
> majority of the work during swapout, swapin, and the swap slot map.
> 
> The only thing left under this flag is the page_counter operations and
> the visibility of the swap control files in the first place, which are
> rather meager savings. There also aren't many scenarios, if any, where
> controlling the memory of a cgroup while allowing it unlimited access
> to a global swap space is a workable resource isolation stragegy.
> 
> On the other hand, there have been several bugs and confusion around
> the many possible swap controller states (cgroup1 vs cgroup2 behavior,
> memory accounting without swap accounting, memcg runtime disabled).
> 
> This puts the maintenance overhead of retaining the toggle above its
> practical benefits. Deprecate it.
> 
> Suggested-by: Shakeel Butt 
> Signed-off-by: Johannes Weiner 

This crashes, and needs a fix: see below (plus some nits).

But it's a very welcome cleanup: just getting rid of all those
!cgroup_memory_noswap double negatives is a relief in itself.

It does suggest eliminating CONFIG_MEMCG_SWAP altogether (just
using #ifdef CONFIG_SWAP instead, in those parts of CONFIG_MEMCG code);
but you're right that's a separate cleanup, and not nearly so worthwhile
as this one (I notice CONFIG_MEMCG_SWAP in some of the arch defconfigs,
and don't know whether whoever removes CONFIG_MEMCG_SWAP would be
obligated to remove those too).

> ---
>  .../admin-guide/kernel-parameters.txt |  5 --
>  include/linux/memcontrol.h|  4 --
>  mm/memcontrol.c   | 48 ++-
>  3 files changed, 15 insertions(+), 42 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index 942bbef8f128..986d45dd8c37 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -5322,11 +5322,6 @@
>   This parameter controls use of the Protected
>   Execution Facility on pSeries.
>  
> - swapaccount=[0|1]
> - [KNL] Enable accounting of swap in memory resource
> - controller if no parameter or 1 is given or disable
> - it if 0 is given (See 
> Documentation/admin-guide/cgroup-v1/memory.rst)
> -
>   swiotlb=[ARM,IA-64,PPC,MIPS,X86]
>   Format: {  | force | noforce }
>-- Number of I/O TLB slabs
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4064c9dda534..ef9613538d36 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -874,10 +874,6 @@ struct mem_cgroup *mem_cgroup_get_oom_group(struct 
> task_struct *victim,
>   struct mem_cgroup *oom_domain);
>  void mem_cgroup_print_oom_group(struct mem_cgroup *memcg);
>  
> -#ifdef CONFIG_MEMCG_SWAP
> -extern bool cgroup_memory_noswap;
> -#endif
> -
>  void lock_page_memcg(struct page *page);
>  void unlock_page_memcg(struct page *page);
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 49bdcf603af1..b036c4fb0fa7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -85,13 +85,6 @@ static bool cgroup_memory_nosocket;
>  /* Kernel memory accounting disabled? */
>  static bool cgroup_memory_nokmem;
>  
> -/* Whether the swap controller is active */
> -#ifdef CONFIG_MEMCG_SWAP
> -bool cgroup_memory_noswap __read_mostly;
> -#else
> -#define cgroup_memory_noswap 1
> -#endif
> -
>  #ifdef CONFIG_CGROUP_WRITEBACK
>  static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq);
>  #endif
> @@ -99,7 +92,11 @@ static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq);
>  /* Whether legacy memory+swap accounting is active */
>  static bool do_memsw_account(void)
>  {
> - return !cgroup_subsys_on_dfl(memory_cgrp_subsys) && 
> !cgroup_memory_noswap;
> + /* cgroup2 doesn't do mem+swap accounting */
> + if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
> + return false;
> +
> + return true;

Nit: I'm not fond of the "if (boolean()) return true; else return false;"
codestyle, and would prefer the straightforward

return !cgroup_subsys_on_dfl(memory_cgrp_subsys);

but you've chosen otherwise, so, okay.

>  }
>  
>  #define THRESHOLDS_EVENTS_TARGET 128
> @@ -7019,7 +7016,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t 
> entry)
>   if (!mem_cgroup_is_root(memcg))
>   page_counter_uncharge(>memory, nr_entries);
>  
> - if (!cgroup_memory_noswap && memcg != swap_memcg) {

Re: [PATCH 1/2] mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled

2021-03-19 Thread Hugh Dickins

On Fri, 19 Mar 2021, Johannes Weiner wrote:

> Since commit 2d1c498072de ("mm: memcontrol: make swap tracking an
> integral part of memory control"), the cgroup swap arrays are used to
> track memory ownership at the time of swap readahead and swapoff, even
> if swap space *accounting* has been turned off by the user via
> swapaccount=0 (which sets cgroup_memory_noswap).
> 
> However, the patch was overzealous: by simply dropping the
> cgroup_memory_noswap conditionals in the swapon, swapoff and uncharge
> path, it caused the cgroup arrays being allocated even when the memory
> controller as a whole is disabled. This is a waste of that memory.
> 
> Restore mem_cgroup_disabled() checks, implied previously by
> cgroup_memory_noswap, in the swapon, swapoff, and swap_entry_free
> callbacks.
> 
> Fixes: 2d1c498072de ("mm: memcontrol: make swap tracking an integral part of 
> memory control")
> Reported-by: Hugh Dickins 
> Signed-off-by: Johannes Weiner 

Acked-by: Hugh Dickins 

Thanks for the memory!

> ---
>  mm/memcontrol.c  | 3 +++
>  mm/swap_cgroup.c | 6 ++
>  2 files changed, 9 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 668d1d7c2645..49bdcf603af1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7101,6 +7101,9 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry, 
> unsigned int nr_pages)
>   struct mem_cgroup *memcg;
>   unsigned short id;
>  
> + if (mem_cgroup_disabled())
> + return;
> +
>   id = swap_cgroup_record(entry, 0, nr_pages);
>   rcu_read_lock();
>   memcg = mem_cgroup_from_id(id);
> diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
> index 7f34343c075a..08c3246f9269 100644
> --- a/mm/swap_cgroup.c
> +++ b/mm/swap_cgroup.c
> @@ -171,6 +171,9 @@ int swap_cgroup_swapon(int type, unsigned long max_pages)
>   unsigned long length;
>   struct swap_cgroup_ctrl *ctrl;
>  
> + if (mem_cgroup_disabled())
> + return 0;
> +
>   length = DIV_ROUND_UP(max_pages, SC_PER_PAGE);
>   array_size = length * sizeof(void *);
>  
> @@ -206,6 +209,9 @@ void swap_cgroup_swapoff(int type)
>   unsigned long i, length;
>   struct swap_cgroup_ctrl *ctrl;
>  
> + if (mem_cgroup_disabled())
> + return;
> +
>   mutex_lock(_cgroup_mutex);
>   ctrl = _cgroup_ctrl[type];
>   map = ctrl->map;
> -- 
> 2.30.1

Re: [PATCH] mm: Allow shmem mappings with MREMAP_DONTUNMAP

2021-03-18 Thread Hugh Dickins

On Tue, 16 Mar 2021, Peter Xu wrote:
> 
> I'm curious whether it's okay to expand MREMAP_DONTUNMAP to PFNMAP too..
> E.g. vfio maps device MMIO regions with both VM_DONTEXPAND|VM_PFNMAP, to me it
> makes sense to allow the userspace to get such MMIO region remapped/duplicated
> somewhere else as long as the size won't change.  With the strict check as
> above we kill all those possibilities.
> 
> Though in that case we'll still need commits like cd544fd1dc92 to protect any
> customized ->mremap() when they're not supported.

It would take me many hours to arrive at a conclusion on that:
I'm going to spend the time differently, and let whoever ends up
wanting MREMAP_DONTUNMAP on a VM_PFNMAP area research the safety
of that for existing users.

I did look to see what added VM_PFNMAP to the original VM_DONTEXPAND:

v2.6.15
commit 4d7672b46244abffea1953e55688c0ea143dd617
Author: Linus Torvalds 
Date:   Fri Dec 16 10:21:23 2005 -0800

Make sure we copy pages inserted with "vm_insert_page()" on fork

The logic that decides that a fork() might be able to avoid copying a VM
area when it can be re-created by page faults didn't know about the new
vm_insert_page() case.

Also make some things a bit more anal wrt VM_PFNMAP.

Pointed out by Hugh Dickins 

Signed-off-by: Linus Torvalds 

So apparently I do bear some anal responsibility.  My concern seems
to have been that in those days an unexpected page fault in a special
driver area would end up allocating an anonymous page, which would
never get freed later.  Nowadays it looks like there's a SIGBUS for
the equivalent situation.

So probably VM_DONTEXPAND is less important than it was, and the
additional VM_PFNMAP safety net no longer necessary, and you could
strip it out of the old size check and Brian's new dontunmap check.

But I give no guarantee: I don't know VM_PFNMAP users at all well.

Hugh

Re: [PATCH v3 2/2] Revert "mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio"

2021-03-18 Thread Hugh Dickins

On Wed, 17 Mar 2021, Brian Geffon wrote:

> This reverts commit cd544fd1dc9293c6702fab6effa63dac1cc67e99.
> 
> As discussed in [1] this commit was a no-op because the mapping type was
> checked in vma_to_resize before move_vma is ever called. This meant that
> vm_ops->mremap() would never be called on such mappings. Furthermore,
> we've since expanded support of MREMAP_DONTUNMAP to non-anonymous
> mappings, and these special mappings are still protected by the existing
> check of !VM_DONTEXPAND and !VM_PFNMAP which will result in a -EFAULT.

One small fixup needed: -EFAULT was what the incorrect v2 gave, but
v3 issues -EINVAL like before, and I'm content with that difference.

> 
> 1. https://lkml.org/lkml/2020/12/28/2340
> 
> Signed-off-by: Brian Geffon 

Acked-by: Hugh Dickins 

Thanks Brian, I'm happy with this result.

> ---
>  arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +-
>  fs/aio.c  | 5 +
>  include/linux/mm.h| 2 +-
>  mm/mmap.c | 6 +-
>  mm/mremap.c   | 2 +-
>  5 files changed, 5 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c 
> b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> index e916646adc69..0daf2f1cf7a8 100644
> --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> @@ -1458,7 +1458,7 @@ static int pseudo_lock_dev_release(struct inode *inode, 
> struct file *filp)
>   return 0;
>  }
>  
> -static int pseudo_lock_dev_mremap(struct vm_area_struct *area, unsigned long 
> flags)
> +static int pseudo_lock_dev_mremap(struct vm_area_struct *area)
>  {
>   /* Not supported */
>   return -EINVAL;
> diff --git a/fs/aio.c b/fs/aio.c
> index 1f32da13d39e..76ce0cc3ee4e 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -323,16 +323,13 @@ static void aio_free_ring(struct kioctx *ctx)
>   }
>  }
>  
> -static int aio_ring_mremap(struct vm_area_struct *vma, unsigned long flags)
> +static int aio_ring_mremap(struct vm_area_struct *vma)
>  {
>   struct file *file = vma->vm_file;
>   struct mm_struct *mm = vma->vm_mm;
>   struct kioctx_table *table;
>   int i, res = -EINVAL;
>  
> - if (flags & MREMAP_DONTUNMAP)
> - return -EINVAL;
> -
>   spin_lock(>ioctx_lock);
>   rcu_read_lock();
>   table = rcu_dereference(mm->ioctx_table);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 77e64e3eac80..8c3729eb3e38 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -570,7 +570,7 @@ struct vm_operations_struct {
>   void (*close)(struct vm_area_struct * area);
>   /* Called any time before splitting to check if it's allowed */
>   int (*may_split)(struct vm_area_struct *area, unsigned long addr);
> - int (*mremap)(struct vm_area_struct *area, unsigned long flags);
> + int (*mremap)(struct vm_area_struct *area);
>   /*
>* Called by mprotect() to make driver-specific permission
>* checks before mprotect() is finalised.   The VMA must not
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 3f287599a7a3..9d7651e4e1fe 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3403,14 +3403,10 @@ static const char *special_mapping_name(struct 
> vm_area_struct *vma)
>   return ((struct vm_special_mapping *)vma->vm_private_data)->name;
>  }
>  
> -static int special_mapping_mremap(struct vm_area_struct *new_vma,
> -   unsigned long flags)
> +static int special_mapping_mremap(struct vm_area_struct *new_vma)
>  {
>   struct vm_special_mapping *sm = new_vma->vm_private_data;
>  
> - if (flags & MREMAP_DONTUNMAP)
> - return -EINVAL;
> -
>   if (WARN_ON_ONCE(current->mm != new_vma->vm_mm))
>   return -EFAULT;
>  
> diff --git a/mm/mremap.c b/mm/mremap.c
> index db5b8b28c2dd..d22629ff8f3c 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -545,7 +545,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
>   if (moved_len < old_len) {
>   err = -ENOMEM;
>   } else if (vma->vm_ops && vma->vm_ops->mremap) {
> - err = vma->vm_ops->mremap(new_vma, flags);
> + err = vma->vm_ops->mremap(new_vma);
>   }
>  
>   if (unlikely(err)) {
> -- 
> 2.31.0.rc2.261.g7f71774620-goog

Re: [PATCH v3 1/2] mm: Allow non-VM_DONTEXPAND and VM_PFNMAP mappings with MREMAP_DONTUNMAP

2021-03-18 Thread Hugh Dickins

If Andrew is happy with such a long patch name, okay;
but personally I'd prefer brevity to all that detail:

mm: Extend MREMAP_DONTUNMAP to non-anonymous mappings

On Wed, 17 Mar 2021, Brian Geffon wrote:

> Currently MREMAP_DONTUNMAP only accepts private anonymous mappings. This
> change will widen the support to include any mappings which are not
> VM_DONTEXPAND or VM_PFNMAP. The primary use case is to support
> MREMAP_DONTUNMAP on mappings which may have been created from a memfd.
> 
> This change will result in mremap(MREMAP_DONTUNMAP) returning -EINVAL
> if VM_DONTEXPAND or VM_PFNMAP mappings are specified.
> 
> Lokesh Gidra who works on the Android JVM, provided an explanation of how
> such a feature will improve Android JVM garbage collection:
> "Android is developing a new garbage collector (GC), based on userfaultfd.
> The garbage collector will use userfaultfd (uffd) on the java heap during
> compaction. On accessing any uncompacted page, the application threads will
> find it missing, at which point the thread will create the compacted page
> and then use UFFDIO_COPY ioctl to get it mapped and then resume execution.
> Before starting this compaction, in a stop-the-world pause the heap will be
> mremap(MREMAP_DONTUNMAP) so that the java heap is ready to receive
> UFFD_EVENT_PAGEFAULT events after resuming execution.
> 
> To speedup mremap operations, pagetable movement was optimized by moving
> PUD entries instead of PTE entries [1]. It was necessary as mremap of even
> modest sized memory ranges also took several milliseconds, and stopping the
> application for that long isn't acceptable in response-time sensitive
> cases.
> 
> With UFFDIO_CONTINUE feature [2], it will be even more efficient to
> implement this GC, particularly the 'non-moveable' portions of the heap.
> It will also help in reducing the need to copy (UFFDIO_COPY) the pages.
> However, for this to work, the java heap has to be on a 'shared' vma.
> Currently MREMAP_DONTUNMAP only supports private anonymous mappings, this
> patch will enable using UFFDIO_CONTINUE for the new userfaultfd-based heap
> compaction."
> 
> [1] 
> https://lore.kernel.org/linux-mm/20201215030730.nc3cu98e4%25a...@linux-foundation.org/
> [2] 
> https://lore.kernel.org/linux-mm/20210302000133.272579-1-axelrasmus...@google.com/
> 
> Signed-off-by: Brian Geffon 

Acked-by: Hugh Dickins 

Thanks Brian, just what I wanted :)

You wondered in another mail about this returning -EINVAL whereas
the VM_DONTEXPAND size error returns -EFAULT: I've pondered, and I've
read the manpage, and I'm sure it would be wrong to change the old
-EFAULT to -EINVAL now; and I don't see good reason to change your
-EINVAL to -EFAULT either.  Let them differ, that's okay (and it's
only in special corner cases that either of these fail anyway).

> ---
>  mm/mremap.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/mremap.c b/mm/mremap.c
> index ec8f840399ed..db5b8b28c2dd 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -653,8 +653,8 @@ static struct vm_area_struct *vma_to_resize(unsigned long 
> addr,
>   return ERR_PTR(-EINVAL);
>   }
>  
> - if (flags & MREMAP_DONTUNMAP && (!vma_is_anonymous(vma) ||
> - vma->vm_flags & VM_SHARED))
> + if ((flags & MREMAP_DONTUNMAP) &&
> + (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)))
>   return ERR_PTR(-EINVAL);
>  
>   if (is_vm_hugetlb_page(vma))
> -- 
> 2.31.0.rc2.261.g7f71774620-goog

Re: [PATCH] mm: Allow shmem mappings with MREMAP_DONTUNMAP

2021-03-13 Thread Hugh Dickins

On Wed, 3 Mar 2021, Brian Geffon wrote:

> Currently MREMAP_DONTUNMAP only accepts private anonymous mappings. This 
> change
> will widen the support to include shmem mappings. The primary use case
> is to support MREMAP_DONTUNMAP on mappings which may have been created from
> a memfd.
> 
> Lokesh Gidra who works on the Android JVM, provided an explanation of how such
> a feature will improve Android JVM garbage collection:
> "Android is developing a new garbage collector (GC), based on userfaultfd. The
> garbage collector will use userfaultfd (uffd) on the java heap during 
> compaction.
> On accessing any uncompacted page, the application threads will find it 
> missing,
> at which point the thread will create the compacted page and then use 
> UFFDIO_COPY
> ioctl to get it mapped and then resume execution. Before starting this 
> compaction,
> in a stop-the-world pause the heap will be mremap(MREMAP_DONTUNMAP) so that 
> the
> java heap is ready to receive UFFD_EVENT_PAGEFAULT events after resuming 
> execution.
> 
> To speedup mremap operations, pagetable movement was optimized by moving PUD 
> entries
> instead of PTE entries [1]. It was necessary as mremap of even modest sized 
> memory
> ranges also took several milliseconds, and stopping the application for that 
> long
> isn't acceptable in response-time sensitive cases. With UFFDIO_CONTINUE 
> feature [2],
> it will be even more efficient to implement this GC, particularly the 
> 'non-moveable'
> portions of the heap. It will also help in reducing the need to copy 
> (UFFDIO_COPY)
> the pages. However, for this to work, the java heap has to be on a 'shared' 
> vma.
> Currently MREMAP_DONTUNMAP only supports private anonymous mappings, this 
> patch will
> enable using UFFDIO_CONTINUE for the new userfaultfd-based heap compaction."
> 
> [1] 
> https://lore.kernel.org/linux-mm/20201215030730.nc3cu98e4%25a...@linux-foundation.org/
> [2] 
> https://lore.kernel.org/linux-mm/20210302000133.272579-1-axelrasmus...@google.com/
> ---
>  mm/mremap.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/mm/mremap.c b/mm/mremap.c
> index ec8f840399ed..6934d199da54 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -653,8 +653,7 @@ static struct vm_area_struct *vma_to_resize(unsigned long 
> addr,
>   return ERR_PTR(-EINVAL);
>   }
>  
> - if (flags & MREMAP_DONTUNMAP && (!vma_is_anonymous(vma) ||
> - vma->vm_flags & VM_SHARED))
> + if (flags & MREMAP_DONTUNMAP && !(vma_is_anonymous(vma) || 
> vma_is_shmem(vma)))
>   return ERR_PTR(-EINVAL);
>  
>   if (is_vm_hugetlb_page(vma))
> -- 

Yet something to improve...

Thanks for extending MREMAP_DONTUNMAP to shmem, but I think this patch
goes in the wrong direction, complicating when it should be generalizing:
the mremap syscall is about rearranging the user's virtual address space,
and is not specific to the underlying anonymous or shmem or file object
(though so far you have only been interested in anonymous, and now shmem).

A better patch would say:

-   if (flags & MREMAP_DONTUNMAP && (!vma_is_anonymous(vma) ||
-   vma->vm_flags & VM_SHARED))
+   if ((flags & MREMAP_DONTUNMAP) &&
+   (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)))
return ERR_PTR(-EINVAL);

VM_DONTEXPAND is what has long been used on special mappings, to prevent
surprises from mremap changing the size of the mapping: MREMAP_DONTUNMAP
introduced a different way of expanding the mapping, so VM_DONTEXPAND
still seems a reasonable name (I've thrown in VM_PFNMAP there because
it's in the VM_DONTEXPAND test lower down: for safety I guess, and best
if both behave the same - though one says -EINVAL and the other -EFAULT).

With that VM_DONTEXPAND check in, Dmitry's commit cd544fd1dc92
("mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio")
can still be reverted (as you agreed on 28th December), even though
vma_is_anonymous() will no longer protect it.

Was there an mremap(2) man page update for MREMAP_DONTUNMAP?
Whether or not there was before, it ought to get one now.

Thanks,
Hugh

Re: [PATCH v4 00/25] Page folios

2021-03-13 Thread Hugh Dickins

On Sat, 13 Mar 2021, Andrew Morton wrote:
> On Fri,  5 Mar 2021 04:18:36 + "Matthew Wilcox (Oracle)" 
>  wrote:
> 
> > Our type system does not currently distinguish between tail pages and
> > head or single pages.  This is a problem because we call compound_head()
> > multiple times (and the compiler cannot optimise it out), bloating the
> > kernel.  It also makes programming hard as it is often unclear whether
> > a function operates on an individual page, or an entire compound page.
> > 
> > This patch series introduces the struct folio, which is a type that
> > represents an entire compound page.  This initial set reduces the kernel
> > size by approximately 6kB, although its real purpose is adding
> > infrastructure to enable further use of the folio.
> 
> Geeze it's a lot of noise.  More things to remember and we'll forever
> have a mismash of `page' and `folio' and code everywhere converting
> from one to the other.  Ongoing addition of folio
> accessors/manipulators to overlay the existing page
> accessors/manipulators, etc.
> 
> It's unclear to me that it's all really worth it.  What feedback have
> you seen from others?

My own feeling and feedback have been much like yours.

I don't get very excited by type safety at this level; and although
I protested back when all those compound_head()s got tucked into the
*PageFlag() functions, the text size increase was not very much, and
I never noticed any adverse performance reports.

To me, it's distraction, churn and friction, ongoing for years; but
that's just me, and I'm resigned to the possibility that it will go in.
Matthew is not alone in wanting to pursue it: let others speak.

Hugh

Re: [PATCH v1] mm, hwpoison: enable error handling on shmem thp

2021-03-11 Thread Hugh Dickins

On Thu, 11 Mar 2021, Jue Wang wrote:
> On Thu, Mar 11, 2021 at 7:14 AM HORIGUCHI NAOYA(堀口　直也)
>  wrote:
> > On Wed, Mar 10, 2021 at 11:22:18PM -0800, Hugh Dickins wrote:
> > >
> > > I'm not much into memory-failure myself, but Jue discovered that the
> > > SIGBUS never arrives: because split_huge_page() on a shmem or file
> > > THP unmaps all its pmds and ptes, and (unlike with anon) leaves them
> > > unmapped - in normal circumstances, to be faulted back on demand.
> > > So the page_mapped() check in hwpoison_user_mappings() fails,
> > > and the intended SIGBUS is not delivered.
> >
> > Thanks for the information.  The split behaves quite differently between
> > for anon thp and for shmem thp.  I saw some unexpected behavior in my
> > testing, maybe that's due to the difference.
> >
> > >
> > > (Or, is it acceptable that the SIGBUS is not delivered to those who
> > > have the huge page mapped: should it get delivered later, to anyone
> > > who faults back in the bad 4k?)
> >
> > Later access should report error in page fault, so the worst scenario
> > of consuming corrupted data does not happen, but precautionary signal
> > does not work so it's not acceptable.

On the other hand, if split_huge_page() does succeed, then there is an
argument that it would be better not to SIGBUS all mappers of parts of
the THP, but wait to select only those re-accessing the one bad 4k.

> In our experiment with SHMEM THPs, later accesses resulted in a zero
> page allocated instead of a SIGBUS with BUS_MCEERR_AR reported by the
> page fault handler. That part might be an opportunity to prevent some
> silent data corruption just in case.

Thanks for filling in more detail, Jue: I understand better now.

Maybe mm/shmem.c is wrong to be using generic_error_remove_page(),
the function which punches a hole on memory-failure.

That works well for filesystems backed by storage (at least when the
page had not been modified), because it does not (I think) actually
punch a hole in the stored object; and the next touch at that offset of
the file will allocate a new cache page to be filled from good storage.

But in the case of shmem (if we ignore the less likely swap cache case)
there is no storage to read back good data from, so the next touch just
fills a new cache page with zeroes (as you report above).

I don't know enough of the philosophy of memory-failure to say, but
I can see there's an argument for leaving the bad page in cache, to
give SIGBUS or EFAULT or EIO (whether by observation of PageHWPoison,
or by another MCE) to whoever accesses it - until the file or that
part of it is deleted (then that page never returned to use again).

Hugh

Re: [PATCH v2 2/2] mm/memcg: set memcg when split page

2021-03-11 Thread Hugh Dickins

On Thu, 11 Mar 2021, Michal Hocko wrote:
> On Thu 11-03-21 10:21:39, Johannes Weiner wrote:
> > On Thu, Mar 11, 2021 at 09:37:02AM +0100, Michal Hocko wrote:
> > > Johannes, Hugh,
> > > 
> > > what do you think about this approach? If we want to stick with
> > > split_page approach then we need to update the missing place Matthew has
> > > pointed out.
> > 
> > I find the __free_pages() code quite tricky as well. But for that
> > reason I would actually prefer to initiate the splitting in there,
> > since that's the place where we actually split the page, rather than
> > spread the handling of this situation further out.
> > 
> > The race condition shouldn't be hot, so I don't think we need to be as
> > efficient about setting page->memcg_data only on the higher-order
> > buddies as in Willy's scratch patch. We can call split_page_memcg(),
> > which IMO should actually help document what's happening to the page.
> > 
> > I think that function could also benefit a bit more from step-by-step
> > documentation about what's going on. The kerneldoc is helpful, but I
> > don't think it does justice to how tricky this race condition is.
> > 
> > Something like this?
> > 
> > void __free_pages(struct page *page, unsigned int order)
> > {
> > /*
> >  * Drop the base reference from __alloc_pages and free. In
> >  * case there is an outstanding speculative reference, from
> >  * e.g. the page cache, it will put and free the page later.
> >  */
> > if (likely(put_page_testzero(page))) {
> > free_the_page(page, order);
> > return;
> > }
> > 
> > /*
> >  * The speculative reference will put and free the page.
> >  *
> >  * However, if the speculation was into a higher-order page
> >  * that isn't marked compound, the other side will know
> >  * nothing about our buddy pages and only free the order-0
> >  * page at the start of our chunk! We must split off and free
> >  * the buddy pages here.
> >  *
> >  * The buddy pages aren't individually refcounted, so they
> >  * can't have any pending speculative references themselves.
> >  */
> > if (!PageHead(page) && order > 0) {
> > split_page_memcg(page, 1 << order);
> > while (order-- > 0)
> > free_the_page(page + (1 << order), order);
> > }
> > }
> 
> Fine with me. Mathew was concerned about more places that do something
> similar but I would say that if we find out more places we might
> reconsider and currently stay with a reasonably clear model that it is
> only head patch that carries the memcg information and split_page_memcg
> is necessary to break such page into smaller pieces.

I agree: I do like Johannes' suggestion best, now that we already
have split_page_memcg().  Not too worried about contrived use of
free_unref_page() here; and whether non-compound high-order pages
should be perpetuated is a different discussion.

Hugh

Re: [PATCH v1] mm, hwpoison: enable error handling on shmem thp

2021-03-10 Thread Hugh Dickins

On Tue, 9 Feb 2021, Naoya Horiguchi wrote:

> From: Naoya Horiguchi 
> 
> Currently hwpoison code checks PageAnon() for thp and refuses to handle
> errors on non-anonymous thps (just for historical reason).  We now
> support non-anonymou thp like shmem one, so this patch suggests to enable
> to handle shmem thps. Fortunately, we already have can_split_huge_page()
> to check if a give thp is splittable, so this patch relies on it.

Fortunately? I don't understand. Why call can_split_huge_page()
at all, instead of simply trying split_huge_page() directly?
And could it do better than -EBUSY when split_huge_page() fails?

> 
> Signed-off-by: Naoya Horiguchi 

Thanks for trying to add shmem+file THP support, but I think this
does not work as intended - Andrew, if Naoya agrees, please drop from
mmotm for now, the fixup needed will be more than a line or two.

I'm not much into memory-failure myself, but Jue discovered that the
SIGBUS never arrives: because split_huge_page() on a shmem or file
THP unmaps all its pmds and ptes, and (unlike with anon) leaves them
unmapped - in normal circumstances, to be faulted back on demand.
So the page_mapped() check in hwpoison_user_mappings() fails,
and the intended SIGBUS is not delivered.

(Or, is it acceptable that the SIGBUS is not delivered to those who
have the huge page mapped: should it get delivered later, to anyone
who faults back in the bad 4k?)

We believe the tokill list has to be set up earlier, before
split_huge_page() is called, then passed in to hwpoison_user_mappings().

Sorry, we don't have a proper patch for that right now, but I expect
you can see what needs to be done.  But something we found on the way,
we do have a patch for: add_to_kill() uses page_address_in_vma(), but
that has not been used on file THP tails before - fix appended at the
end below, so as not to waste your time on that bit.

> ---
>  mm/memory-failure.c | 34 +-
>  1 file changed, 9 insertions(+), 25 deletions(-)
> 
> diff --git v5.11-rc6/mm/memory-failure.c v5.11-rc6_patched/mm/memory-failure.c
> index e9481632fcd1..8c1575de0507 100644
> --- v5.11-rc6/mm/memory-failure.c
> +++ v5.11-rc6_patched/mm/memory-failure.c
> @@ -956,20 +956,6 @@ static int __get_hwpoison_page(struct page *page)
>  {
>   struct page *head = compound_head(page);
>  
> - if (!PageHuge(head) && PageTransHuge(head)) {
> - /*
> -  * Non anonymous thp exists only in allocation/free time. We
> -  * can't handle such a case correctly, so let's give it up.
> -  * This should be better than triggering BUG_ON when kernel
> -  * tries to touch the "partially handled" page.
> -  */
> - if (!PageAnon(head)) {
> - pr_err("Memory failure: %#lx: non anonymous thp\n",
> - page_to_pfn(page));
> - return 0;
> - }
> - }
> -
>   if (get_page_unless_zero(head)) {
>   if (head == compound_head(page))
>   return 1;
> @@ -1197,21 +1183,19 @@ static int identify_page_state(unsigned long pfn, 
> struct page *p,
>  
>  static int try_to_split_thp_page(struct page *page, const char *msg)
>  {
> - lock_page(page);
> - if (!PageAnon(page) || unlikely(split_huge_page(page))) {
> - unsigned long pfn = page_to_pfn(page);
> + struct page *head;
>  
> + lock_page(page);
> + head = compound_head(page);
> + if (PageTransHuge(head) && can_split_huge_page(head, NULL) &&
> + !split_huge_page(page)) {
>   unlock_page(page);
> - if (!PageAnon(page))
> - pr_info("%s: %#lx: non anonymous thp\n", msg, pfn);
> - else
> - pr_info("%s: %#lx: thp split failed\n", msg, pfn);
> - put_page(page);
> - return -EBUSY;
> + return 0;
>   }
> + pr_info("%s: %#lx: thp split failed\n", msg, page_to_pfn(page));
>   unlock_page(page);
> -
> - return 0;
> + put_page(page);
> + return -EBUSY;
>  }
>  
>  static int memory_failure_hugetlb(unsigned long pfn, int flags)
> -- 
> 2.25.1

[PATCH] mm: fix page_address_in_vma() on file THP tails
From: Jue Wang 

Anon THP tails were already supported, but memory-failure now needs to use
page_address_in_vma() on file THP tails, which its page->mapping check did
not permit: fix it.

Signed-off-by: Jue Wang 
Signed-off-by: Hugh Dickins 
---

 mm/rmap.c |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

--- 5.12-rc2/mm/rmap.c  2021-02-28 16:58:57.950450151 -0800
+++ linux/mm/rmap.c 2021-03-10 20:29:21.59147517

Re: [PATCH v2 1/2] mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg

2021-03-10 Thread Hugh Dickins

On Thu, 11 Mar 2021, Singh, Balbir wrote:
> On 9/3/21 7:28 pm, Michal Hocko wrote:
> > On Tue 09-03-21 09:37:29, Balbir Singh wrote:
> >> On 4/3/21 6:40 pm, Zhou Guanghui wrote:
> > [...]
> >>> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>>  /*
> >>> - * Because page_memcg(head) is not set on compound tails, set it now.
> >>> + * Because page_memcg(head) is not set on tails, set it now.
> >>>   */
> >>> -void mem_cgroup_split_huge_fixup(struct page *head)
> >>> +void split_page_memcg(struct page *head, unsigned int nr)
> >>>  {
> >>
> >> Do we need input validation on nr? Can nr be aribtrary or can we enforce
> >>
> >> VM_BUG_ON(!is_power_of_2(nr));
> > 
> > In practice this will be power of 2 but why should we bother to sanitze
> > that? 
> > 
> 
> Just when DEBUG_VM is enabled to ensure the contract is valid, given that
> nr is now variable, we could end up with subtle bugs unless we can audit
> all callers. Even the power of 2 check does not catch the fact that nr
> is indeed what we expect, but it still checks a large range of invalid
> inputs.

I think you imagine this is something it's not.

"all callers" are __split_huge_page() and split_page() (maybe Matthew
will have a third caller, maybe not).  It is not something drivers will
be calling directly themselves, and it won't ever get EXPORTed to them.

Hugh

Re: [PATCH 1/2] iwlwifi: fix DVM build regression in 5.12-rc

2021-03-06 Thread Hugh Dickins

On Sat, 6 Mar 2021, Sedat Dilek wrote:
> On Sat, Mar 6, 2021 at 8:48 PM Hugh Dickins  wrote:
> >
> > There is no iwl_so_trans_cfg if CONFIG_IWLDVM but not CONFIG_IWLMVM:
> > move the CONFIG_IWLMVM guard up before the problematic SnJ workaround
> > to fix the build breakage.
> >
> > Fixes: 930be4e76f26 ("iwlwifi: add support for SnJ with Jf devices")
> > Signed-off-by: Hugh Dickins 
> 
> See "iwlwifi: pcie: fix iwl_so_trans_cfg link error when CONFIG_IWLMVM
> is disabled" in [1].
> 
> - Sedat -
> 
> [1] 
> https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers.git/commit/?id=62541e266703549550e77fd46138422dbdc881f1

Thanks for looking out that and the other one, Sedat: I swear I checked
linux-next before sending, but my check seems to have been... defective.

Hugh

[PATCH 2/2] iwlwifi: fix DVM boot regression in 5.12-rc

2021-03-06 Thread Hugh Dickins

No time_point op has been provided for DVM: check for NULL before
calling, to fix the oops (blank screen booting non-modular kernel).

Fixes: d01293154c0a ("iwlwifi: dbg: add op_mode callback for collecting debug 
data.")
Signed-off-by: Hugh Dickins 
---

 drivers/net/wireless/intel/iwlwifi/iwl-op-mode.h |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- 5.12-rc2/drivers/net/wireless/intel/iwlwifi/iwl-op-mode.h   2021-02-28 
16:58:55.058425551 -0800
+++ linux/drivers/net/wireless/intel/iwlwifi/iwl-op-mode.h  2021-03-05 
20:59:14.156217412 -0800
@@ -205,7 +205,8 @@ static inline void iwl_op_mode_time_poin
  enum iwl_fw_ini_time_point tp_id,
  union iwl_dbg_tlv_tp_data *tp_data)
 {
-   op_mode->ops->time_point(op_mode, tp_id, tp_data);
+   if (op_mode->ops->time_point)
+   op_mode->ops->time_point(op_mode, tp_id, tp_data);
 }
 
 #endif /* __iwl_op_mode_h__ */

[PATCH 1/2] iwlwifi: fix DVM build regression in 5.12-rc

2021-03-06 Thread Hugh Dickins

There is no iwl_so_trans_cfg if CONFIG_IWLDVM but not CONFIG_IWLMVM:
move the CONFIG_IWLMVM guard up before the problematic SnJ workaround
to fix the build breakage.

Fixes: 930be4e76f26 ("iwlwifi: add support for SnJ with Jf devices")
Signed-off-by: Hugh Dickins 
---

 drivers/net/wireless/intel/iwlwifi/pcie/drv.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- 5.12-rc2/drivers/net/wireless/intel/iwlwifi/pcie/drv.c  2021-02-28 
16:58:55.082425755 -0800
+++ linux/drivers/net/wireless/intel/iwlwifi/pcie/drv.c 2021-03-05 
18:42:53.650809293 -0800
@@ -1106,6 +1106,7 @@ static int iwl_pci_probe(struct pci_dev
}
}
 
+#if IS_ENABLED(CONFIG_IWLMVM)
/*
 * Workaround for problematic SnJ device: sometimes when
 * certain RF modules are connected to SnJ, the device ID
@@ -1116,7 +1117,6 @@ static int iwl_pci_probe(struct pci_dev
if (CSR_HW_REV_TYPE(iwl_trans->hw_rev) == IWL_CFG_MAC_TYPE_SNJ)
iwl_trans->trans_cfg = _so_trans_cfg;
 
-#if IS_ENABLED(CONFIG_IWLMVM)
/*
 * special-case 7265D, it has the same PCI IDs.
 *

Re: [PATCH v4] memcg: charge before adding to swapcache on swapin

2021-03-05 Thread Hugh Dickins

On Fri, 5 Mar 2021, Shakeel Butt wrote:
> On Fri, Mar 5, 2021 at 1:26 PM Shakeel Butt  wrote:
> >
> > Currently the kernel adds the page, allocated for swapin, to the
> > swapcache before charging the page. This is fine but now we want a
> > per-memcg swapcache stat which is essential for folks who wants to
> > transparently migrate from cgroup v1's memsw to cgroup v2's memory and
> > swap counters. In addition charging a page before exposing it to other
> > parts of the kernel is a step in the right direction.
> >
> > To correctly maintain the per-memcg swapcache stat, this patch has
> > adopted to charge the page before adding it to swapcache. One
> > challenge in this option is the failure case of add_to_swap_cache() on
> > which we need to undo the mem_cgroup_charge(). Specifically undoing
> > mem_cgroup_uncharge_swap() is not simple.
> >
> > To resolve the issue, this patch introduces transaction like interface
> > to charge a page for swapin. The function mem_cgroup_charge_swapin_page()
> > initiates the charging of the page and mem_cgroup_finish_swapin_page()
> > completes the charging process. So, the kernel starts the charging
> > process of the page for swapin with mem_cgroup_charge_swapin_page(),
> > adds the page to the swapcache and on success completes the charging
> > process with mem_cgroup_finish_swapin_page().
> 
> And of course I forgot to update the commit message.
> 
> Andrew, please replace the third paragraph with the following para:
> 
> To resolve the issue, this patch decouples the charging for swapin pages from
> mem_cgroup_charge(). Two new functions are introduced,
> mem_cgroup_swapin_charge_page() for just charging the swapin page and
> mem_cgroup_swapin_uncharge_swap() for uncharging the swap slot once the
> page has been successfully added to the swapcache.

Lgtm
Hugh

Re: [PATCH v3] memcg: charge before adding to swapcache on swapin

2021-03-05 Thread Hugh Dickins

On Wed, 3 Mar 2021, Shakeel Butt wrote:

> Currently the kernel adds the page, allocated for swapin, to the
> swapcache before charging the page. This is fine but now we want a
> per-memcg swapcache stat which is essential for folks who wants to
> transparently migrate from cgroup v1's memsw to cgroup v2's memory and
> swap counters. In addition charging a page before exposing it to other
> parts of the kernel is a step in the right direction.
> 
> To correctly maintain the per-memcg swapcache stat, this patch has
> adopted to charge the page before adding it to swapcache. One
> challenge in this option is the failure case of add_to_swap_cache() on
> which we need to undo the mem_cgroup_charge(). Specifically undoing
> mem_cgroup_uncharge_swap() is not simple.
> 
> To resolve the issue, this patch introduces transaction like interface
> to charge a page for swapin. The function mem_cgroup_charge_swapin_page()
> initiates the charging of the page and mem_cgroup_finish_swapin_page()
> completes the charging process. So, the kernel starts the charging
> process of the page for swapin with mem_cgroup_charge_swapin_page(),
> adds the page to the swapcache and on success completes the charging
> process with mem_cgroup_finish_swapin_page().
> 
> Signed-off-by: Shakeel Butt 

Quite apart from helping with the stat you want, what you've ended
up with here is a nice cleanup in several different ways (and I'm
glad Johannes talked you out of __GFP_NOFAIL: much better like this).
I'll say

Acked-by: Hugh Dickins 

but I am quite unhappy with the name mem_cgroup_finish_swapin_page():
it doesn't finish the swapin, it doesn't finish the page, and I'm
not persuaded by your paragraph above that there's any "transaction"
here (if there were, I'd suggest "commit" instead of "finish"'; and
I'd get worried by the css_put before it's called - but no, that's
fine, it's independent).

How about complementing mem_cgroup_charge_swapin_page() with
mem_cgroup_uncharge_swapin_swap()?  I think that describes well
what it does, at least in the do_memsw_account() case, and I hope
we can overlook that it does nothing at all in the other case.

And it really doesn't need a page argument: both places it's called
have just allocated an order-0 page, there's no chance of a THP here;
but you might have some idea of future expansion, or matching
put_swap_page() - I won't object if you prefer to pass in the page.

But more interesting, though off-topic, comments on it below...

> +/*
> + * mem_cgroup_finish_swapin_page - complete the swapin page charge 
> transaction
> + * @page: page charged for swapin
> + * @entry: swap entry for which the page is charged
> + *
> + * This function completes the transaction of charging the page allocated for
> + * swapin.
> + */
> +void mem_cgroup_finish_swapin_page(struct page *page, swp_entry_t entry)
> +{
>   /*
>* Cgroup1's unified memory+swap counter has been charged with the
>* new swapcache page, finish the transfer by uncharging the swap
> @@ -6760,20 +6796,14 @@ int mem_cgroup_charge(struct page *page, struct 
> mm_struct *mm, gfp_t gfp_mask)
>* correspond 1:1 to page and swap slot lifetimes: we charge the
>* page to memory here, and uncharge swap when the slot is freed.
>*/
> - if (do_memsw_account() && PageSwapCache(page)) {
> - swp_entry_t entry = { .val = page_private(page) };
> + if (!mem_cgroup_disabled() && do_memsw_account()) {

I understand why you put that !mem_cgroup_disabled() check in there,
but I have a series of observations on that.

First I was going to say that it would be better left to
mem_cgroup_uncharge_swap() itself.

Then I was going to say that I think it's already covered here
by the cgroup_memory_noswap check inside do_memsw_account().

Then, going back to mem_cgroup_uncharge_swap(), I realized that 5.8's
2d1c498072de ("mm: memcontrol: make swap tracking an integral part of
memory control") removed the do_swap_account or cgroup_memory_noswap
checks from mem_cgroup_uncharge_swap() and swap_cgroup_swapon() and
swap_cgroup_swapoff() - so since then we have been allocating totally
unnecessary swap_cgroup arrays when mem_cgroup_disabled() (and
mem_cgroup_uncharge_swap() has worked by reading the zalloced array).

I think, or am I confused? If I'm right on that, one of us ought to
send another patch putting back, either cgroup_memory_noswap checks
or mem_cgroup_disabled() checks in those three places - I suspect the
static key mem_cgroup_disabled() is preferable, but I'm getting dozy.

Whatever we do with that - and it's really not any business for this
patch - I think you can drop the mem_cgroup_disabled() check from
mem_cgroup_uncharge_swapin_swap().

>   /*
>

Re: [PATCH] mm/memcg: set memcg when split pages

2021-03-03 Thread Hugh Dickins

On Tue, 2 Mar 2021, Johannes Weiner wrote:
> On Tue, Mar 02, 2021 at 12:24:41PM -0800, Hugh Dickins wrote:
> > On Tue, 2 Mar 2021, Michal Hocko wrote:
> > > [Cc Johannes for awareness and fixup Nick's email]
> > > 
> > > On Tue 02-03-21 01:34:51, Zhou Guanghui wrote:
> > > > When split page, the memory cgroup info recorded in first page is
> > > > not copied to tail pages. In this case, when the tail pages are
> > > > freed, the uncharge operation is not performed. As a result, the
> > > > usage of this memcg keeps increasing, and the OOM may occur.
> > > > 
> > > > So, the copying of first page's memory cgroup info to tail pages
> > > > is needed when split page.
> > > 
> > > I was not aware that alloc_pages_exact is used for accounted allocations
> > > but git grep told me otherwise so this is not a theoretical one. Both
> > > users (arm64 and s390 kvm) are quite recent AFAICS. split_page is also
> > > used in dma allocator but I got lost in indirection so I have no idea
> > > whether there are any users there.
> > 
> > Yes, it's a bit worrying that such a low-level thing as split_page()
> > can now get caught up in memcg accounting, but I suppose that's okay.
> > 
> > I feel rather strongly that whichever way it is done, THP splitting
> > and split_page() should use the same interface to memcg.
> > 
> > And a look at mem_cgroup_split_huge_fixup() suggests that nowadays
> > there need to be css_get()s too - or better, a css_get_many().
> > 
> > Its #ifdef CONFIG_TRANSPARENT_HUGEPAGE should be removed, rename
> > it mem_cgroup_split_page_fixup(), and take order from caller.
> 
> +1
> 
> There is already a split_page_owner() in both these places as well
> which does a similar thing. Mabye we can match that by calling it
> split_page_memcg() and having it take a nr of pages?

Agreed on both counts :) "fixup" was not an inspiring name.

> 
> > Though I've never much liked that separate pass: would it be
> > better page by page, like this copy_page_memcg() does?  Though
> > mem_cgroup_disabled() and css_getting make that less appealing.
> 
> Agreed on both counts. mem_cgroup_disabled() is a jump label and would
> be okay, IMO, but the refcounting - though it is (usually) per-cpu -
> adds at least two branches and rcu read locking.

Re: [PATCH] mm/memcg: set memcg when split pages

2021-03-02 Thread Hugh Dickins

On Tue, 2 Mar 2021, Michal Hocko wrote:
> [Cc Johannes for awareness and fixup Nick's email]
> 
> On Tue 02-03-21 01:34:51, Zhou Guanghui wrote:
> > When split page, the memory cgroup info recorded in first page is
> > not copied to tail pages. In this case, when the tail pages are
> > freed, the uncharge operation is not performed. As a result, the
> > usage of this memcg keeps increasing, and the OOM may occur.
> > 
> > So, the copying of first page's memory cgroup info to tail pages
> > is needed when split page.
> 
> I was not aware that alloc_pages_exact is used for accounted allocations
> but git grep told me otherwise so this is not a theoretical one. Both
> users (arm64 and s390 kvm) are quite recent AFAICS. split_page is also
> used in dma allocator but I got lost in indirection so I have no idea
> whether there are any users there.

Yes, it's a bit worrying that such a low-level thing as split_page()
can now get caught up in memcg accounting, but I suppose that's okay.

I feel rather strongly that whichever way it is done, THP splitting
and split_page() should use the same interface to memcg.

And a look at mem_cgroup_split_huge_fixup() suggests that nowadays
there need to be css_get()s too - or better, a css_get_many().

Its #ifdef CONFIG_TRANSPARENT_HUGEPAGE should be removed, rename
it mem_cgroup_split_page_fixup(), and take order from caller.

Though I've never much liked that separate pass: would it be
better page by page, like this copy_page_memcg() does?  Though
mem_cgroup_disabled() and css_getting make that less appealing.

Hugh

> 
> The page itself looks reasonable to me.
> 
> > Signed-off-by: Zhou Guanghui 
> 
> Acked-by: Michal Hocko 
> 
> Minor nit
> 
> > ---
> >  include/linux/memcontrol.h | 10 ++
> >  mm/page_alloc.c|  4 +++-
> >  2 files changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index e6dc793d587d..c7e2b4421dc1 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -867,6 +867,12 @@ void mem_cgroup_print_oom_group(struct mem_cgroup 
> > *memcg);
> >  extern bool cgroup_memory_noswap;
> >  #endif
> >  
> > +static inline void copy_page_memcg(struct page *dst, struct page *src)
> > +{
> > +   if (src->memcg_data)
> > +   dst->memcg_data = src->memcg_data;
> 
> I would just drop the test. The struct page is a single cache line which
> is dirty by the reference count so another store will unlikely be
> noticeable even when NULL is stored here and you safe a conditional.
> 
> > +}
> > +
> >  struct mem_cgroup *lock_page_memcg(struct page *page);
> >  void __unlock_page_memcg(struct mem_cgroup *memcg);
> >  void unlock_page_memcg(struct page *page);
> > @@ -1291,6 +1297,10 @@ mem_cgroup_print_oom_meminfo(struct mem_cgroup 
> > *memcg)
> >  {
> >  }
> >  
> > +static inline void copy_page_memcg(struct page *dst, struct page *src)
> > +{
> > +}
> > +
> >  static inline struct mem_cgroup *lock_page_memcg(struct page *page)
> >  {
> > return NULL;
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3e4b29ee2b1e..ee0a63dc1c9b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3307,8 +3307,10 @@ void split_page(struct page *page, unsigned int 
> > order)
> > VM_BUG_ON_PAGE(PageCompound(page), page);
> > VM_BUG_ON_PAGE(!page_count(page), page);
> >  
> > -   for (i = 1; i < (1 << order); i++)
> > +   for (i = 1; i < (1 << order); i++) {
> > set_page_refcounted(page + i);
> > +   copy_page_memcg(page + i, page);
> > +   }
> > split_page_owner(page, 1 << order);
> >  }
> >  EXPORT_SYMBOL_GPL(split_page);
> > -- 
> > 2.25.0
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs

[PATCH v2 3/4] mm: /proc/sys/vm/stat_refresh skip checking known negative stats

2021-03-02 Thread Hugh Dickins

vmstat_refresh() can occasionally catch nr_zone_write_pending and
nr_writeback when they are transiently negative.  The reason is partly
that the interrupt which decrements them in test_clear_page_writeback()
can come in before __test_set_page_writeback() got to increment them;
but transient negatives are still seen even when that is prevented, and
I am not yet certain why (but see Roman's note below).  Those stats are
not buggy, they have never been seen to drift away from 0 permanently:
so just avoid the annoyance of showing a warning on them.

Similarly avoid showing a warning on nr_free_cma: CMA users have seen
that one reported negative from /proc/sys/vm/stat_refresh too, but it
does drift away permanently: I believe that's because its incrementation
and decrementation are decided by page migratetype, but the migratetype
of a pageblock is not guaranteed to be constant.

Roman Gushchin points out:
For performance reasons, vmstat counters are incremented and decremented
using per-cpu batches.  vmstat_refresh() flushes the per-cpu batches on
all CPUs, to get values as accurate as possible; but this method is not
atomic, so the resulting value is not always precise.  As a consequence,
for those counters whose actual value is close to 0, a small negative
value may occasionally be reported.  If the value is small and the state
is transient, it is not an indication of an error.

Link: https://lore.kernel.org/linux-mm/20200714173747.3315771-1-g...@fb.com/
Reported-by: Roman Gushchin 
Signed-off-by: Hugh Dickins 
---

 mm/vmstat.c |   15 +++
 1 file changed, 15 insertions(+)

--- vmstat2/mm/vmstat.c 2021-02-25 11:56:18.0 -0800
+++ vmstat3/mm/vmstat.c 2021-02-25 12:42:15.0 -0800
@@ -1840,6 +1840,14 @@ int vmstat_refresh(struct ctl_table *tab
if (err)
return err;
for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
+   /*
+* Skip checking stats known to go negative occasionally.
+*/
+   switch (i) {
+   case NR_ZONE_WRITE_PENDING:
+   case NR_FREE_CMA_PAGES:
+   continue;
+   }
val = atomic_long_read(_zone_stat[i]);
if (val < 0) {
pr_warn("%s: %s %ld\n",
@@ -1856,6 +1864,13 @@ int vmstat_refresh(struct ctl_table *tab
}
 #endif
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
+   /*
+* Skip checking stats known to go negative occasionally.
+*/
+   switch (i) {
+   case NR_WRITEBACK:
+   continue;
+   }
val = atomic_long_read(_node_stat[i]);
if (val < 0) {
pr_warn("%s: %s %ld\n",

Re: [PATCH 3/4] mm: /proc/sys/vm/stat_refresh skip checking known negative stats

2021-03-01 Thread Hugh Dickins

On Sun, 28 Feb 2021, Roman Gushchin wrote:
> On Thu, Feb 25, 2021 at 03:14:03PM -0800, Hugh Dickins wrote:
> > vmstat_refresh() can occasionally catch nr_zone_write_pending and
> > nr_writeback when they are transiently negative.  The reason is partly
> > that the interrupt which decrements them in test_clear_page_writeback()
> > can come in before __test_set_page_writeback() got to increment them;
> > but transient negatives are still seen even when that is prevented, and
> > we have not yet resolved why (Roman believes that it is an unavoidable
> > consequence of the refresh scheduled on each cpu).  But those stats are
> > not buggy, they have never been seen to drift away from 0 permanently:
> > so just avoid the annoyance of showing a warning on them.
> > 
> > Similarly avoid showing a warning on nr_free_cma: CMA users have seen
> > that one reported negative from /proc/sys/vm/stat_refresh too, but it
> > does drift away permanently: I believe that's because its incrementation
> > and decrementation are decided by page migratetype, but the migratetype
> > of a pageblock is not guaranteed to be constant.
> > 
> > Use switch statements so we can most easily add or remove cases later.
> 
> I'm OK with the code, but I can't fully agree with the commit log. I don't 
> think
> there is any mystery around negative values. Let me copy-paste the explanation
> from my original patch:
> 
> These warnings* are generated by the vmstat_refresh() function, which
> assumes that atomic zone and numa counters can't go below zero.  However,
> on a SMP machine it's not quite right: due to per-cpu caching it can in
> theory be as low as -(zone threshold) * NR_CPUs.
> 
> For instance, let's say all cma pages are in use and NR_FREE_CMA_PAGES
> reached 0.  Then we've reclaimed a small number of cma pages on each CPU
> except CPU0, so that most percpu NR_FREE_CMA_PAGES counters are slightly
> positive (the atomic counter is still 0).  Then somebody on CPU0 consumes
> all these pages.  The number of pages can easily exceed the threshold and
> a negative value will be committed to the atomic counter.
> 
> * warnings about negative NR_FREE_CMA_PAGES

Hi Roman, thanks for your Acks on the others - and indeed this
is the one on which disagreement was more to be expected.

I certainly wanted (and included below) a Link to your original patch;
and even wondered whether to paste your description into mine.
But I read it again and still have issues with it.

Mainly, it does not convey at all, that touching stat_refresh adds the
per-cpu counts into the global atomics, resetting per-cpu counts to 0.
Which does not invalidate your explanation: races might still manage
to underflow; but it does take the "easily" out of "can easily exceed".

Since I don't use CMA on any machine, I cannot be sure, but it looked
like a bad example to rely upon, because of its migratetype-based
accounting.  If you use /proc/sys/vm/stat_refresh frequently enough,
without suppressing the warning, I guess that uncertainty could be
resolved by checking whether nr_free_cma is seen with negative value
in consecutive refreshes - which would tend to support my migratetype
theory - or only singly - which would support your raciness theory.

> 
> Actually, the same is almost true for ANY other counter. What differs CMA, 
> dirty
> and write pending counters is that they can reach 0 value under normal 
> conditions.
> Other counters are usually not reaching values small enough to see negative 
> values
> on a reasonable sized machine.

Looking through /proc/vmstat now, yes, I can see that there are fewer
counters which hover near 0 than I had imagined: more have a positive
bias, or are monotonically increasing.  And I'd be lying if I said I'd
never seen any others than nr_writeback or nr_zone_write_pending caught
negative.  But what are you asking for?  Should the patch be changed, to
retry the refresh_vm_stats() before warning, if it sees any negative?
Depends on how terrible one line in dmesg is considered!

> 
> Does it makes sense?

I'm not sure: you were not asking for the patch to be changed, but
its commit log: and I better not say "Roman believes that it is an
unavoidable consequence of the refresh scheduled on each cpu" if
that's untrue (or unclear: now it reads to me as if we're accusing
the refresh of messing things up, whereas it's the non-atomic nature
of the refresh which leaves it vulnerable to races).

Hugh

> 
> > 
> > Link: https://lore.kernel.org/linux-mm/20200714173747.3315771-1-g...@fb.com/
> > Reported-by: Roman Gushchin 
> > Signed-off-by: Hugh Dickins 
> > ---
> > 
> >  mm/vmstat.c |   15 +++
> >  1 file chang

Re: [PATCH v2 3/3] mm: use PF_ONLY_HEAD for PG_active and PG_unevictable

2021-03-01 Thread Hugh Dickins

On Mon, 1 Mar 2021, Yu Zhao wrote:
> On Mon, Mar 01, 2021 at 02:50:07PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Feb 26, 2021 at 12:13:14PM +, Matthew Wilcox wrote:
> > > On Fri, Feb 26, 2021 at 02:17:18AM -0700, Yu Zhao wrote:
> > > > All places but one test, set or clear PG_active and PG_unevictable on
> > > > small or head pages. Use compound_head() explicitly for that singleton
> > > > so the rest can rid of redundant compound_head().
> > > 
> > > How do you know it's only one place?  I really wish you'd work with me
> > > on folios.  They make the compiler prove that it's not a tail page.
> > 
> > +1 to this.
> > 
> > The problem with compound_head() is systemic and ad-hoc solution to few
> > page flags will only complicate the picture.
> 
> Well, I call it an incremental improvement, and how exactly does it
> complicate the picture?
> 
> I see your point: you prefer a complete replacement. But my point is
> not about the preference; it's about presenting an option: I'm not
> saying we have to go with this series; I'm saying if you don't want
> to wait, here is something quick but not perfect.

+1 to this.

Hugh

Re: [PATCH 1/2] mm: Guard a use of node_reclaim_distance with CONFIFG_NUMA

2021-02-26 Thread Hugh Dickins

On Fri, 26 Feb 2021, Palmer Dabbelt wrote:
> On Fri, 26 Feb 2021 17:31:40 PST (-0800), hu...@google.com wrote:
> > On Fri, 26 Feb 2021, Andrew Morton wrote:
> > > On Fri, 26 Feb 2021 12:17:20 -0800 Palmer Dabbelt 
> > > wrote:
> > > > From: Palmer Dabbelt 
> > > >
> > > > This is only useful under CONFIG_NUMA.  IIUC skipping the check is the
> > > > right thing to do here, as without CONFIG_NUMA there will never be any
> > > > large node distances on non-NUMA systems.
> > > >
> > > > I expected this to manifest as a link failure under (!CONFIG_NUMA &&
> > > > CONFIG_TRANSPARENT_HUGE_PAGES), but I'm not actually seeing that.  I
> > > > think the reference is just getting pruned before it's checked, but I
> > > > didn't get that from reading the code so I'm worried I'm missing
> > > > something.
> > > >
> > > > Either way, this is necessary to guard the definition of
> > > > node_reclaim_distance with CONFIG_NUMA.
> > > >
> > > > Signed-off-by: Palmer Dabbelt 
> > > > ---
> > > >  mm/khugepaged.c | 2 ++
> > > >  1 file changed, 2 insertions(+)
> > > >
> > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > index a7d6cb912b05..b1bf191c3a54 100644
> > > > --- a/mm/khugepaged.c
> > > > +++ b/mm/khugepaged.c
> > > > @@ -819,8 +819,10 @@ static bool khugepaged_scan_abort(int nid)
> > > > for (i = 0; i < MAX_NUMNODES; i++) {
> > > > if (!khugepaged_node_load[i])
> > > > continue;
> > > > +#ifdef CONFIG_NUMA
> > > > if (node_distance(nid, i) > node_reclaim_distance)
> > > > return true;
> > > > +#endif
> > > > }
> > > > return false;
> > > >  }
> > > 
> > > This makes the entire loop a no-op.  Perhaps Kirill can help take a
> > > look at removing unnecessary code in khugepaged.c when CONFIG_NUMA=n?
> > 
> > First lines of khugepaged_scan_abort() say
> > if (!node_reclaim_mode)
> > return false;
> > 
> > And include/linux/swap.h says
> > #ifdef CONFIG_NUMA
> > extern int node_reclaim_mode;
> > extern int sysctl_min_unmapped_ratio;
> > extern int sysctl_min_slab_ratio;
> > #else
> > #define node_reclaim_mode 0
> > #endif
> > 
> > So, no need for an #ifdef CONFIG_NUMA inside khugepaged_scan_abort().
> 
> Ah, thanks, I hadn't seen that.  That certainly explains the lack of an
> undefined reference.
> 
> That said: do we generally rely on DCE to prune references to undefined
> symbols?  This particular one seems like it'd get reliably deleted, but it
> seems like a fragile thing to do in general.  This kind of stuff would
> certainly make some code easier to write, though.

Yes, the kernel build very much depends on the optimizer eliminating
dead code, in many many places.  We do prefer to keep the #ifdefs to
the header files as much as possible.

> 
> I don't really care all that much, though, as I was just sending this along
> due to some build failure report from a user that I couldn't reproduce.  It
> looked like they had some out-of-tree stuff, so in this case I'm fine on
> fixing this being their problem.

I didn't see your 2/2 at the time; but wouldn't be surprised if that
needs 1/2, to avoid an error on undeclared node_reclaim_distance before
the optimizer comes into play.  If so, best just to drop 2/2 too.

Hugh

Re: [PATCH 1/2] mm: Guard a use of node_reclaim_distance with CONFIFG_NUMA

2021-02-26 Thread Hugh Dickins

On Fri, 26 Feb 2021, Andrew Morton wrote:
> On Fri, 26 Feb 2021 12:17:20 -0800 Palmer Dabbelt  wrote:
> > From: Palmer Dabbelt 
> > 
> > This is only useful under CONFIG_NUMA.  IIUC skipping the check is the
> > right thing to do here, as without CONFIG_NUMA there will never be any
> > large node distances on non-NUMA systems.
> > 
> > I expected this to manifest as a link failure under (!CONFIG_NUMA &&
> > CONFIG_TRANSPARENT_HUGE_PAGES), but I'm not actually seeing that.  I
> > think the reference is just getting pruned before it's checked, but I
> > didn't get that from reading the code so I'm worried I'm missing
> > something.
> > 
> > Either way, this is necessary to guard the definition of
> > node_reclaim_distance with CONFIG_NUMA.
> > 
> > Signed-off-by: Palmer Dabbelt 
> > ---
> >  mm/khugepaged.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index a7d6cb912b05..b1bf191c3a54 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -819,8 +819,10 @@ static bool khugepaged_scan_abort(int nid)
> > for (i = 0; i < MAX_NUMNODES; i++) {
> > if (!khugepaged_node_load[i])
> > continue;
> > +#ifdef CONFIG_NUMA
> > if (node_distance(nid, i) > node_reclaim_distance)
> > return true;
> > +#endif
> > }
> > return false;
> >  }
> 
> This makes the entire loop a no-op.  Perhaps Kirill can help take a
> look at removing unnecessary code in khugepaged.c when CONFIG_NUMA=n?

First lines of khugepaged_scan_abort() say
if (!node_reclaim_mode)
return false;

And include/linux/swap.h says
#ifdef CONFIG_NUMA
extern int node_reclaim_mode;
extern int sysctl_min_unmapped_ratio;
extern int sysctl_min_slab_ratio;
#else
#define node_reclaim_mode 0
#endif

So, no need for an #ifdef CONFIG_NUMA inside khugepaged_scan_abort().

Hugh

[PATCH 4/4] mm: /proc//sys/vm/stat_refresh stop checking monotonic numa stats

2021-02-25 Thread Hugh Dickins

All of the VM NUMA stats are event counts, incremented never decremented:
it is not very useful for vmstat_refresh() to check them throughout their
first aeon, then warn on them throughout their next.

Signed-off-by: Hugh Dickins 
---

 mm/vmstat.c |9 -
 1 file changed, 9 deletions(-)

--- vmstat3/mm/vmstat.c 2021-02-25 12:42:15.0 -0800
+++ vmstat4/mm/vmstat.c 2021-02-25 12:44:20.0 -0800
@@ -1854,15 +1854,6 @@ int vmstat_refresh(struct ctl_table *tab
__func__, zone_stat_name(i), val);
}
}
-#ifdef CONFIG_NUMA
-   for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) {
-   val = atomic_long_read(_numa_stat[i]);
-   if (val < 0) {
-   pr_warn("%s: %s %ld\n",
-   __func__, numa_stat_name(i), val);
-   }
-   }
-#endif
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
/*
 * Skip checking stats known to go negative occasionally.

[PATCH 3/4] mm: /proc/sys/vm/stat_refresh skip checking known negative stats

2021-02-25 Thread Hugh Dickins

vmstat_refresh() can occasionally catch nr_zone_write_pending and
nr_writeback when they are transiently negative.  The reason is partly
that the interrupt which decrements them in test_clear_page_writeback()
can come in before __test_set_page_writeback() got to increment them;
but transient negatives are still seen even when that is prevented, and
we have not yet resolved why (Roman believes that it is an unavoidable
consequence of the refresh scheduled on each cpu).  But those stats are
not buggy, they have never been seen to drift away from 0 permanently:
so just avoid the annoyance of showing a warning on them.

Similarly avoid showing a warning on nr_free_cma: CMA users have seen
that one reported negative from /proc/sys/vm/stat_refresh too, but it
does drift away permanently: I believe that's because its incrementation
and decrementation are decided by page migratetype, but the migratetype
of a pageblock is not guaranteed to be constant.

Use switch statements so we can most easily add or remove cases later.

Link: https://lore.kernel.org/linux-mm/20200714173747.3315771-1-g...@fb.com/
Reported-by: Roman Gushchin 
Signed-off-by: Hugh Dickins 
---

 mm/vmstat.c |   15 +++
 1 file changed, 15 insertions(+)

--- vmstat2/mm/vmstat.c 2021-02-25 11:56:18.0 -0800
+++ vmstat3/mm/vmstat.c 2021-02-25 12:42:15.0 -0800
@@ -1840,6 +1840,14 @@ int vmstat_refresh(struct ctl_table *tab
if (err)
return err;
for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
+   /*
+* Skip checking stats known to go negative occasionally.
+*/
+   switch (i) {
+   case NR_ZONE_WRITE_PENDING:
+   case NR_FREE_CMA_PAGES:
+   continue;
+   }
val = atomic_long_read(_zone_stat[i]);
if (val < 0) {
pr_warn("%s: %s %ld\n",
@@ -1856,6 +1864,13 @@ int vmstat_refresh(struct ctl_table *tab
}
 #endif
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
+   /*
+* Skip checking stats known to go negative occasionally.
+*/
+   switch (i) {
+   case NR_WRITEBACK:
+   continue;
+   }
val = atomic_long_read(_node_stat[i]);
if (val < 0) {
pr_warn("%s: %s %ld\n",

[PATCH 2/4] mm: no more EINVAL from /proc/sys/vm/stat_refresh

2021-02-25 Thread Hugh Dickins

EINVAL was good for drawing the refresher's attention to a warning in
dmesg, but became very tiresome when running test suites scripted with
"set -e": an underflow from a bug in one feature would cause unrelated
tests much later to fail, just because their /proc/sys/vm/stat_refresh
touch failed with that error. Stop doing that.

Signed-off-by: Hugh Dickins 
---

 mm/vmstat.c |5 -
 1 file changed, 5 deletions(-)

--- vmstat1/mm/vmstat.c 2021-02-25 11:50:36.0 -0800
+++ vmstat2/mm/vmstat.c 2021-02-25 11:56:18.0 -0800
@@ -1844,7 +1844,6 @@ int vmstat_refresh(struct ctl_table *tab
if (val < 0) {
pr_warn("%s: %s %ld\n",
__func__, zone_stat_name(i), val);
-   err = -EINVAL;
}
}
 #ifdef CONFIG_NUMA
@@ -1853,7 +1852,6 @@ int vmstat_refresh(struct ctl_table *tab
if (val < 0) {
pr_warn("%s: %s %ld\n",
__func__, numa_stat_name(i), val);
-   err = -EINVAL;
}
}
 #endif
@@ -1862,11 +1860,8 @@ int vmstat_refresh(struct ctl_table *tab
if (val < 0) {
pr_warn("%s: %s %ld\n",
__func__, node_stat_name(i), val);
-   err = -EINVAL;
}
}
-   if (err)
-   return err;
if (write)
*ppos += *lenp;
else

[PATCH 1/4] mm: restore node stat checking in /proc/sys/vm/stat_refresh

2021-02-25 Thread Hugh Dickins

v4.7 52b6f46bc163 ("mm: /proc/sys/vm/stat_refresh to force vmstat update")
introduced vmstat_refresh(), with its vmstat underflow checking; then
v4.8 75ef71840539 ("mm, vmstat: add infrastructure for per-node vmstats")
split NR_VM_NODE_STAT_ITEMS out of NR_VM_ZONE_STAT_ITEMS without updating
vmstat_refresh(): so it has been missing out much of the vmstat underflow
checking ever since. Reinstate it. Thanks to Roman Gushchin 
for tangentially pointing this out.

Signed-off-by: Hugh Dickins 
---

 mm/vmstat.c |8 
 1 file changed, 8 insertions(+)

--- 5.12-rc/mm/vmstat.c 2021-02-24 12:03:55.0 -0800
+++ vmstat1/mm/vmstat.c 2021-02-25 11:50:36.0 -0800
@@ -1857,6 +1857,14 @@ int vmstat_refresh(struct ctl_table *tab
}
}
 #endif
+   for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
+   val = atomic_long_read(_node_stat[i]);
+   if (val < 0) {
+   pr_warn("%s: %s %ld\n",
+   __func__, node_stat_name(i), val);
+   err = -EINVAL;
+   }
+   }
if (err)
return err;
if (write)

Re: [PATCH v2] mm: vmstat: fix /proc/sys/vm/stat_refresh generating false warnings

2021-02-25 Thread Hugh Dickins

On Wed, 24 Feb 2021, Roman Gushchin wrote:
> On Tue, Feb 23, 2021 at 11:24:23PM -0800, Hugh Dickins wrote:
> > On Thu, 6 Aug 2020, Andrew Morton wrote:
> > > On Thu, 6 Aug 2020 16:38:04 -0700 Roman Gushchin  wrote:
> > 
> > August, yikes, I thought it was much more recent.
> > 
> > > 
> > > > it seems that Hugh and me haven't reached a consensus here.
> > > > Can, you, please, not merge this patch into 5.9, so we would have
> > > > more time to find a solution, acceptable for all?
> > > 
> > > No probs.  I already had a big red asterisk on it ;)
> > 
> > I've a suspicion that Andrew might be tiring of his big red asterisk,
> > and wanting to unload
> > mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings.patch
> > mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings-fix.patch
> > mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings-fix-2.patch
> > into 5.12.
> > 
> > I would prefer not, and reiterate my Nack: but no great harm will
> > befall the cosmos if he overrules that, and it does go through to
> > 5.12 - I'll just want to revert it again later.  And I do think a
> > more straightforward way of suppressing those warnings would be just
> > to delete the code that issues them, rather than brushing them under
> > a carpet of overtuning.
> 
> I'm actually fine with either option. My only concern is that if somebody
> will try to use the hugetlb_cma boot option AND /proc/sys/vm/stat_refresh
> together, they will get a false warning and report them to mm@ or will
> waste their time trying to debug a non-existing problem. It's not the end
> of the world.
> We can also make the warning conditional on CONFIG_DEBUG_VM, for example.
> 
> Please, let me know what's your preferred way to go forward.

My preferred way forward (for now: since we're all too busy to fix
the misbehaving stats) is for Andrew to drop your patch, and I'll post
three patches against current 5.12 in a few hours: one to restore the
check on the missing NR_VM_NODE_STAT_ITEMS, one to remove the -EINVAL
(which upsets test scripts at our end), one to suppress the warning on
nr_zone_write_pending, nr_writeback and nr_free_cma.

Hugh

Re: [PATCH v6 0/3] mm,thp,shm: limit shmem THP alloc gfp_mask

2021-02-24 Thread Hugh Dickins

On Wed, 24 Feb 2021, Rik van Riel wrote:
> On Wed, 2021-02-24 at 00:41 -0800, Hugh Dickins wrote:
> > On Mon, 14 Dec 2020, Vlastimil Babka wrote:
> > 
> > > > (There's also a specific issue with the gfp_mask limiting: I have
> > > > not yet reviewed the allowing and denying in detail, but it looks
> > > > like it does not respect the caller's GFP_ZONEMASK - the gfp in
> > > > shmem_getpage_gfp() and shmem_read_mapping_page_gfp() is there to
> > > > satisfy the gma500, which wanted to use shmem but could only
> > manage
> > > > DMA32.  I doubt it wants THPS, but shmem_enabled=force forces
> > them.)
> > 
> > Oh, I'd forgotten all about that gma500 aspect:
> > well, I can send a fixup later on.
> 
> I already have code to fix that, which somebody earlier
> in this discussion convinced me to throw away. Want me
> to send it as a patch 4/3 ?

If Andrew wants it all, yes, please do add that - thanks Rik.

Hugh

Re: [PATCH v6 0/3] mm,thp,shm: limit shmem THP alloc gfp_mask

2021-02-24 Thread Hugh Dickins

On Mon, 14 Dec 2020, Vlastimil Babka wrote:
> On 12/14/20 10:16 PM, Hugh Dickins wrote:
> > On Tue, 24 Nov 2020, Rik van Riel wrote:
> > 
> >> The allocation flags of anonymous transparent huge pages can be controlled
> >> through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
> >> help the system from getting bogged down in the page reclaim and compaction
> >> code when many THPs are getting allocated simultaneously.
> >> 
> >> However, the gfp_mask for shmem THP allocations were not limited by those
> >> configuration settings, and some workloads ended up with all CPUs stuck
> >> on the LRU lock in the page reclaim code, trying to allocate dozens of
> >> THPs simultaneously.
> >> 
> >> This patch applies the same configurated limitation of THPs to shmem
> >> hugepage allocations, to prevent that from happening.
> >> 
> >> This way a THP defrag setting of "never" or "defer+madvise" will result
> >> in quick allocation failures without direct reclaim when no 2MB free
> >> pages are available.
> >> 
> >> With this patch applied, THP allocations for tmpfs will be a little
> >> more aggressive than today for files mmapped with MADV_HUGEPAGE,
> >> and a little less aggressive for files that are not mmapped or
> >> mapped without that flag.
> >> 
> >> v6: make khugepaged actually obey tmpfs mount flags
> >> v5: reduce gfp mask further if needed, to accomodate i915 (Matthew Wilcox)
> >> v4: rename alloc_hugepage_direct_gfpmask to vma_thp_gfp_mask (Matthew 
> >> Wilcox)
> >> v3: fix NULL vma issue spotted by Hugh Dickins & tested
> >> v2: move gfp calculation to shmem_getpage_gfp as suggested by Yu Xu
> > 
> > Andrew, please don't rush
> > 
> > mmthpshmem-limit-shmem-thp-alloc-gfp_mask.patch
> > mmthpshm-limit-gfp-mask-to-no-more-than-specified.patch
> > mmthpshmem-make-khugepaged-obey-tmpfs-mount-flags.patch
> > 
> > to Linus in your first wave of mmotm->5.11 sendings.
> > Or, alternatively, go ahead and send them to Linus, but
> > be aware that I'm fairly likely to want adjustments later.

And I have a suspicion that Andrew might want to send these in
for 5.12.

I spent a lot of time trying to find a compromise that would
satisfy us all, but failed to do so.  I kept hoping to find one
next day, so never reached the point of responding.

My fundamental objection to Rik's gfp_mask patches (the third
is different, looks good, though I never studied it properly) is
that (as I said right at the start) anyone who uses a huge=always
mount of tmpfs is already advising for huge pages. The situation is
different from anon, where everything on a machine with THP enabled
is liable to get huge pages, and an madvise necessary to distinguish
who wants.  (madvise is also quite the wrong tool for a filesystem.)

But when I tried modifying the patches to reflect that huge=always
already advises for huge, that did not give a satisfactory result
either: precisely what was wrong with every combination I tried,
I do not have to hand - again, I was hoping for a success which
did not arrive.

But if Andrew wants to put these in, I'll no longer object to their
inclusion: it seems wrong to me, to replace one unsatisfactory array
of choices by another unsatisfactory array of choices, but in the end
it's to be decided by what users prefer - if we hear of regressions
(people not getting the huge pages that they have come to expect),
then the patches will have to be reverted.

> > 
> > Sorry for limping along so far behind, but I still have more
> > re-reading of the threads to do, and I'm still investigating
> > why tmpfs huge=always becomes so ineffective in my testing with
> > these changes, even if I ramp up from default defrag=madvise to
> > defrag=always:
> > 5.10   mmotm
> > thp_file_alloc   4641788  216027
> > thp_file_fallback 275339 8895647

I never devised a suitable test to prove it, but I did come to
believe that the worrying scale of that regression comes from the
kind of unrealistic testing I'm doing, and would not be nearly so
bad in "real life".

Since I'm interested in exercising the assembly and splitting of
huge pages for testing, I'm happy to run kernel builds of many small
source files in a limited huge=always tmpfs in limited memory.  But
for that to work, for the files allocated hugely to fit in, it does
depend on direct reclaim and kswapd to split and shrink the ends of the
older files, so compaction can make huge pages available to the newer.

Whereas most people should be using huge tmpfs more appropriately,
for huge files.

> 
> So AFAICS before the

Re: [PATCH v2] mm: vmstat: fix /proc/sys/vm/stat_refresh generating false warnings

2021-02-23 Thread Hugh Dickins

On Thu, 6 Aug 2020, Andrew Morton wrote:
> On Thu, 6 Aug 2020 16:38:04 -0700 Roman Gushchin  wrote:

August, yikes, I thought it was much more recent.

> 
> > it seems that Hugh and me haven't reached a consensus here.
> > Can, you, please, not merge this patch into 5.9, so we would have
> > more time to find a solution, acceptable for all?
> 
> No probs.  I already had a big red asterisk on it ;)

I've a suspicion that Andrew might be tiring of his big red asterisk,
and wanting to unload
mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings.patch
mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings-fix.patch
mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings-fix-2.patch
into 5.12.

I would prefer not, and reiterate my Nack: but no great harm will
befall the cosmos if he overrules that, and it does go through to
5.12 - I'll just want to revert it again later.  And I do think a
more straightforward way of suppressing those warnings would be just
to delete the code that issues them, rather than brushing them under
a carpet of overtuning.

I've been running mmotm with the patch below (shown as sign of good
faith, and for you to try, but not ready to go yet) for a few months
now - overriding your max_drift, restoring nr_writeback and friends to
the same checking, fixing the obvious reason why nr_zone_write_pending
and nr_writeback are seen negative occasionally (interrupt interrupting
to decrement those stats before they have even been incremented).

Two big BUTs (if not asterisks): since adding that patch, I have
usually forgotten all about it, so forgotten to run the script that
echoes /proc/sys/vm/stat_refresh at odd intervals while under load:
so have less data than I'd intended by now.  And secondly (and I've
just checked again this evening) I do still see nr_zone_write_pending
and nr_writeback occasionally caught negative while under load.  So,
there's something more at play, perhaps the predicted Gushchin Effect
(but wouldn't they go together if so? I've only seen them separately),
or maybe something else, I don't know.

Those are the only stats I've seen caught negative, but I don't have
CMA configured at all.  You mention nr_free_cma as the only(?) other
stat you've seen negative, that of course I won't see, but looking
at the source I now notice that NR_FREE_CMA_PAGES is incremented
and decremented according to page migratetype...

... internally we have another stat that's incremented and decremented
according to page migratetype, and that one has been seen negative too:
isn't page migratetype something that usually stays the same, but
sometimes the migratetype of the page's block can change, even while
some pages of it are allocated?  Not a stable basis for maintaining
stats, though won't matter much if they are only for display.

vmstat_refresh could just exempt nr_zone_write_pending, nr_writeback
and nr_free_cma from warnings, if we cannot find a fix to them: but
I see no reason to suppress warnings on all the other vmstats.

The patch I've been testing with:

--- mmotm/mm/page-writeback.c   2021-02-14 14:32:24.0 -0800
+++ hughd/mm/page-writeback.c   2021-02-20 18:01:11.264162616 -0800
@@ -2769,6 +2769,13 @@ int __test_set_page_writeback(struct pag
int ret, access_ret;

lock_page_memcg(page);
+   /*
+* Increment counts in advance, so that they will not go negative
+* if test_clear_page_writeback() comes in to decrement them.
+*/
+   inc_lruvec_page_state(page, NR_WRITEBACK);
+   inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
+
if (mapping && mapping_use_writeback_tags(mapping)) {
XA_STATE(xas, >i_pages, page_index(page));
struct inode *inode = mapping->host;
@@ -2804,9 +2811,14 @@ int __test_set_page_writeback(struct pag
} else {
ret = TestSetPageWriteback(page);
}
-   if (!ret) {
-   inc_lruvec_page_state(page, NR_WRITEBACK);
-   inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
+
+   if (WARN_ON_ONCE(ret)) {
+   /*
+* Correct counts in retrospect, if PageWriteback was already
+* set; but does any filesystem ever allow this to happen?
+*/
+   dec_lruvec_page_state(page, NR_WRITEBACK);
+   dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
}
unlock_page_memcg(page);
access_ret = arch_make_page_accessible(page);
--- mmotm/mm/vmstat.c   2021-02-20 17:59:44.838171232 -0800
+++ hughd/mm/vmstat.c   2021-02-20 18:01:11.272162661 -0800
@@ -1865,7 +1865,7 @@ int vmstat_refresh(struct ctl_table *tab

for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
val = atomic_long_read(_zone_stat[i]);
-   if (val < -max_drift) {
+   if (val < 0) {
pr_warn("%s: %s %ld\n",
__func__, zone_stat_name(i), val);

Re: [PATCH] percpu_counter: increase batch count

2021-02-22 Thread Hugh Dickins

On Mon, 22 Feb 2021, Jens Axboe wrote:
> On 2/22/21 2:31 PM, Hugh Dickins wrote:
> > On Thu, 18 Feb 2021, Jens Axboe wrote:
> >> On 2/18/21 4:16 PM, Andrew Morton wrote:
> >>> On Thu, 18 Feb 2021 14:36:31 -0700 Jens Axboe  wrote:
> >>>
> >>>> Currently we cap the batch count at max(32, 2*nr_online_cpus), which 
> >>>> these
> >>>> days is kind of silly as systems have gotten much bigger than in 2009 
> >>>> when
> >>>> this heuristic was introduced.
> >>>>
> >>>> Bump it to capping it at 256 instead. This has a noticeable improvement
> >>>> for certain io_uring workloads, as io_uring tracks per-task inflight 
> >>>> count
> >>>> using percpu counters.
> > 
> > I want to quibble with the word "capping" here, it's misleading -
> > but I'm sorry I cannot think of the right word.
> 
> Agree, it's not the best wording. And if you can't think of a better
> one, then I'm at a loss too :-)
> 
> > The macro is max() not min(): you're making an improvement for
> > certain io_uring workloads on machines with 1 to 15 cpus, right?
> > Does "bigger than in 2009" apply to those?
> 
> Right, that actually had me confused. The box in question has 64 threads,
> so my effective count was 128, or 256 with the patch.

Ah, yes, so there I *was* confused in saying "1 to 15",
the improvement was for "1 to 127" of course - thanks.

> 
> > Though, io_uring could as well use percpu_counter_add_batch() instead?
> 
> That might be a simpler/better choice!
> 
> > (Yeah, this has nothing to do with me really, but I was looking at
> > percpu_counter_compare() just now, for tmpfs reasons, so took more
> > interest.  Not objecting to a change, but the wording leaves me
> > wondering if the patch does what you think - or, not for the
> > first time, I'm confused.)
> 
> I don't think you're confused, and honestly I think using the batch
> version instead would likely improve our situation without potentially
> changing behavior for everyone else. So it's likely the right way to go.

You're too polite! But yes, if percpu_counter_add_batch() suits, great.

> 
> Thanks Hugh!
> 
> -- 
> Jens Axboe

Re: [PATCH] percpu_counter: increase batch count

2021-02-22 Thread Hugh Dickins

On Thu, 18 Feb 2021, Jens Axboe wrote:
> On 2/18/21 4:16 PM, Andrew Morton wrote:
> > On Thu, 18 Feb 2021 14:36:31 -0700 Jens Axboe  wrote:
> > 
> >> Currently we cap the batch count at max(32, 2*nr_online_cpus), which these
> >> days is kind of silly as systems have gotten much bigger than in 2009 when
> >> this heuristic was introduced.
> >>
> >> Bump it to capping it at 256 instead. This has a noticeable improvement
> >> for certain io_uring workloads, as io_uring tracks per-task inflight count
> >> using percpu counters.

I want to quibble with the word "capping" here, it's misleading -
but I'm sorry I cannot think of the right word.

The macro is max() not min(): you're making an improvement for
certain io_uring workloads on machines with 1 to 15 cpus, right?
Does "bigger than in 2009" apply to those?

Though, io_uring could as well use percpu_counter_add_batch() instead?

(Yeah, this has nothing to do with me really, but I was looking at
percpu_counter_compare() just now, for tmpfs reasons, so took more
interest.  Not objecting to a change, but the wording leaves me
wondering if the patch does what you think - or, not for the
first time, I'm confused.)

Hugh

> >>
> > 
> > It will also make percpu_counter_read() and
> > percpu_counter_read_positive() more inaccurate than at present.  Any
> > effects from this will take a while to discover.
> 
> It will, but the value of 32 is very low, especially when you are potentially
> doing millions of these per second. So I do think it should track the times
> a bit.
> 
> > But yes, worth trying - I'll add it to the post-rc1 pile.
> 
> Thanks!
> 
> -- 
> Jens Axboe

Re: Very slow unlockall()

2021-02-10 Thread Hugh Dickins

On Wed, 10 Feb 2021, Michal Hocko wrote:
> On Wed 10-02-21 17:57:29, Michal Hocko wrote:
> > On Wed 10-02-21 16:18:50, Vlastimil Babka wrote:
> [...]
> > > And the munlock (munlock_vma_pages_range()) is slow, because it uses
> > > follow_page_mask() in a loop incrementing addresses by PAGE_SIZE, so 
> > > that's
> > > always traversing all levels of page tables from scratch. Funnily enough,
> > > speeding this up was my first linux-mm series years ago. But the speedup 
> > > only
> > > works if pte's are present, which is not the case for unpopulated 
> > > PROT_NONE
> > > areas. That use case was unexpected back then. We should probably convert 
> > > this
> > > code to a proper page table walk. If there are large areas with 
> > > unpopulated pmd
> > > entries (or even higher levels) we would traverse them very quickly.
> > 
> > Yes, this is a good idea. I suspect it will be little bit tricky without
> > duplicating a large part of gup page table walker.
> 
> Thinking about it some more, unmap_page_range would be a better model
> for this operation.

Could do, I suppose; but I thought it was just a matter of going back to
using follow_page_mask() in munlock_vma_pages_range() (whose fear of THP
split looks overwrought, since an extra reference now prevents splitting);
and enhancing follow_page_mask() to let the no_page_table() FOLL_DUMP
case set ctx->page_mask appropriately (or perhaps it can be preset
at a higher level, without having to pass ctx so far down, dunno).

Nice little job, but I couldn't quite spare the time to do it: needs a
bit more care than I could afford (I suspect the page_increm business at
the end of munlock_vma_pages_range() is good enough while THP tails are
skipped one by one, but will need to be fixed to apply page_mask correctly
to the start - __get_user_pages()'s page_increm-entation looks superior).

Hugh

Re: [PATCH v2] mm: page-writeback: simplify memcg handling in test_clear_page_writeback()

2021-02-10 Thread Hugh Dickins

On Wed, 10 Feb 2021, Johannes Weiner wrote:
> On Wed, Feb 10, 2021 at 08:22:00AM -0800, Hugh Dickins wrote:
> > On Tue, 9 Feb 2021, Hugh Dickins wrote:
> > > On Tue, 9 Feb 2021, Johannes Weiner wrote:
> > > 
> > > > Page writeback doesn't hold a page reference, which allows truncate to
> > > > free a page the second PageWriteback is cleared. This used to require
> > > > special attention in test_clear_page_writeback(), where we had to be
> > > > careful not to rely on the unstable page->memcg binding and look up
> > > > all the necessary information before clearing the writeback flag.
> > > > 
> > > > Since commit 073861ed77b6 ("mm: fix VM_BUG_ON(PageTail) and
> > > > BUG_ON(PageWriteback)") test_clear_page_writeback() is called with an
> > > > explicit reference on the page, and this dance is no longer needed.
> > > > 
> > > > Use unlock_page_memcg() and dec_lruvec_page_stat() directly.
> > > 
> > > s/stat()/state()/
> > > 
> > > This is a nice cleanup: I hadn't seen that connection at all.
> > > 
> > > But I think you should take it further:
> > > __unlock_page_memcg() can then be static in mm/memcontrol.c,
> > > and its declarations deleted from include/linux/memcontrol.h?
> > 
> > And further: void lock_page_memcg(page), not returning memcg.
> 
> You're right on all counts!
> 
> > > And further: delete __dec_lruvec_state() and dec_lruvec_state()
> > > from include/linux/vmstat.h - unless you feel that every "inc"
> > > ought to be matched by a "dec", even when unused.
> 
> Hey look, there isn't a user for the __inc, either :) There is one for
> inc, but I don't insist on having symmetry there.
> 
> > > > Signed-off-by: Johannes Weiner 
> > > 
> > > Acked-by: Hugh Dickins 
> 
> Thanks for the review and good feedback.
> 
> How about this v2?

Yes, even nicer, thank you: SetPatchDoubleAcked.

> 
> ---
> 
> From 5bcc0f468460aa2670c40318bb657e8b08ef96d5 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner 
> Date: Tue, 9 Feb 2021 16:22:42 -0500
> Subject: [PATCH] mm: page-writeback: simplify memcg handling in
>  test_clear_page_writeback()
> 
> Page writeback doesn't hold a page reference, which allows truncate to
> free a page the second PageWriteback is cleared. This used to require
> special attention in test_clear_page_writeback(), where we had to be
> careful not to rely on the unstable page->memcg binding and look up
> all the necessary information before clearing the writeback flag.
> 
> Since commit 073861ed77b6 ("mm: fix VM_BUG_ON(PageTail) and
> BUG_ON(PageWriteback)") test_clear_page_writeback() is called with an
> explicit reference on the page, and this dance is no longer needed.
> 
> Use unlock_page_memcg() and dec_lruvec_page_state() directly.
> 
> This removes the last user of the lock_page_memcg() return value,
> change it to void. Touch up the comments in there as well. This also
> removes the last extern user of __unlock_page_memcg(), make it
> static. Further, it removes the last user of dec_lruvec_state(),
> delete it, along with a few other unused helpers.
> 
> Signed-off-by: Johannes Weiner 
> Acked-by: Hugh Dickins 
> Reviewed-by: Shakeel Butt 
> ---
>  include/linux/memcontrol.h | 10 ++
>  include/linux/vmstat.h | 24 +++-
>  mm/memcontrol.c| 36 +++-
>  mm/page-writeback.c|  9 +++--
>  4 files changed, 19 insertions(+), 60 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index a44b2d51aecc..b17053af3287 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -874,8 +874,7 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg);
>  extern bool cgroup_memory_noswap;
>  #endif
>  
> -struct mem_cgroup *lock_page_memcg(struct page *page);
> -void __unlock_page_memcg(struct mem_cgroup *memcg);
> +void lock_page_memcg(struct page *page);
>  void unlock_page_memcg(struct page *page);
>  
>  void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);
> @@ -1269,12 +1268,7 @@ mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
>  {
>  }
>  
> -static inline struct mem_cgroup *lock_page_memcg(struct page *page)
> -{
> - return NULL;
> -}
> -
> -static inline void __unlock_page_memcg(struct mem_cgroup *memcg)
> +static inline void lock_page_memcg(struct page *page)
>  {
>  }
>  
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 506d

Re: [PATCH] mm: page-writeback: simplify memcg handling in test_clear_page_writeback()

2021-02-10 Thread Hugh Dickins

On Tue, 9 Feb 2021, Hugh Dickins wrote:
> On Tue, 9 Feb 2021, Johannes Weiner wrote:
> 
> > Page writeback doesn't hold a page reference, which allows truncate to
> > free a page the second PageWriteback is cleared. This used to require
> > special attention in test_clear_page_writeback(), where we had to be
> > careful not to rely on the unstable page->memcg binding and look up
> > all the necessary information before clearing the writeback flag.
> > 
> > Since commit 073861ed77b6 ("mm: fix VM_BUG_ON(PageTail) and
> > BUG_ON(PageWriteback)") test_clear_page_writeback() is called with an
> > explicit reference on the page, and this dance is no longer needed.
> > 
> > Use unlock_page_memcg() and dec_lruvec_page_stat() directly.
> 
> s/stat()/state()/
> 
> This is a nice cleanup: I hadn't seen that connection at all.
> 
> But I think you should take it further:
> __unlock_page_memcg() can then be static in mm/memcontrol.c,
> and its declarations deleted from include/linux/memcontrol.h?

And further: void lock_page_memcg(page), not returning memcg.

> 
> And further: delete __dec_lruvec_state() and dec_lruvec_state()
> from include/linux/vmstat.h - unless you feel that every "inc"
> ought to be matched by a "dec", even when unused.
> 
> > 
> > Signed-off-by: Johannes Weiner 
> 
> Acked-by: Hugh Dickins 
> 
> > ---
> >  mm/page-writeback.c | 9 +++--
> >  1 file changed, 3 insertions(+), 6 deletions(-)
> > 
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index eb34d204d4ee..f6c2c3165d4d 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -2722,12 +2722,9 @@ EXPORT_SYMBOL(clear_page_dirty_for_io);
> >  int test_clear_page_writeback(struct page *page)
> >  {
> > struct address_space *mapping = page_mapping(page);
> > -   struct mem_cgroup *memcg;
> > -   struct lruvec *lruvec;
> > int ret;
> >  
> > -   memcg = lock_page_memcg(page);
> > -   lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> > +   lock_page_memcg(page);
> > if (mapping && mapping_use_writeback_tags(mapping)) {
> > struct inode *inode = mapping->host;
> > struct backing_dev_info *bdi = inode_to_bdi(inode);
> > @@ -2755,11 +2752,11 @@ int test_clear_page_writeback(struct page *page)
> > ret = TestClearPageWriteback(page);
> > }
> > if (ret) {
> > -   dec_lruvec_state(lruvec, NR_WRITEBACK);
> > +   dec_lruvec_page_state(page, NR_WRITEBACK);
> > dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
> > inc_node_page_state(page, NR_WRITTEN);
> > }
> > -   __unlock_page_memcg(memcg);
> > +   unlock_page_memcg(page);
> > return ret;
> >  }
> >  
> > -- 
> > 2.30.0
>

Re: [PATCH] mm: page-writeback: simplify memcg handling in test_clear_page_writeback()

2021-02-09 Thread Hugh Dickins

On Tue, 9 Feb 2021, Johannes Weiner wrote:

> Page writeback doesn't hold a page reference, which allows truncate to
> free a page the second PageWriteback is cleared. This used to require
> special attention in test_clear_page_writeback(), where we had to be
> careful not to rely on the unstable page->memcg binding and look up
> all the necessary information before clearing the writeback flag.
> 
> Since commit 073861ed77b6 ("mm: fix VM_BUG_ON(PageTail) and
> BUG_ON(PageWriteback)") test_clear_page_writeback() is called with an
> explicit reference on the page, and this dance is no longer needed.
> 
> Use unlock_page_memcg() and dec_lruvec_page_stat() directly.

s/stat()/state()/

This is a nice cleanup: I hadn't seen that connection at all.

But I think you should take it further:
__unlock_page_memcg() can then be static in mm/memcontrol.c,
and its declarations deleted from include/linux/memcontrol.h?

And further: delete __dec_lruvec_state() and dec_lruvec_state()
from include/linux/vmstat.h - unless you feel that every "inc"
ought to be matched by a "dec", even when unused.

> 
> Signed-off-by: Johannes Weiner 

Acked-by: Hugh Dickins 

> ---
>  mm/page-writeback.c | 9 +++--
>  1 file changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index eb34d204d4ee..f6c2c3165d4d 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2722,12 +2722,9 @@ EXPORT_SYMBOL(clear_page_dirty_for_io);
>  int test_clear_page_writeback(struct page *page)
>  {
>   struct address_space *mapping = page_mapping(page);
> - struct mem_cgroup *memcg;
> - struct lruvec *lruvec;
>   int ret;
>  
> - memcg = lock_page_memcg(page);
> - lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> + lock_page_memcg(page);
>   if (mapping && mapping_use_writeback_tags(mapping)) {
>   struct inode *inode = mapping->host;
>   struct backing_dev_info *bdi = inode_to_bdi(inode);
> @@ -2755,11 +2752,11 @@ int test_clear_page_writeback(struct page *page)
>   ret = TestClearPageWriteback(page);
>   }
>   if (ret) {
> - dec_lruvec_state(lruvec, NR_WRITEBACK);
> + dec_lruvec_page_state(page, NR_WRITEBACK);
>   dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
>   inc_node_page_state(page, NR_WRITTEN);
>   }
> - __unlock_page_memcg(memcg);
> + unlock_page_memcg(page);
>   return ret;
>  }
>  
> -- 
> 2.30.0

Re: [PATCH] tmpfs: disallow CONFIG_TMPFS_INODE64 on alpha

2021-02-08 Thread Hugh Dickins

On Mon, 8 Feb 2021, Seth Forshee wrote:

> As with s390, alpha is a 64-bit architecture with a 32-bit ino_t.
> With CONFIG_TMPFS_INODE64=y tmpfs mounts will get 64-bit inode
> numbers and display "inode64" in the mount options, whereas
> passing "inode64" in the mount options will fail. This leads to
> erroneous behaviours such as this:
> 
>  # mkdir mnt
>  # mount -t tmpfs nodev mnt
>  # mount -o remount,rw mnt
>  mount: /home/ubuntu/mnt: mount point not mounted or bad option.
> 
> Prevent CONFIG_TMPFS_INODE64 from being selected on alpha.
> 
> Fixes: ea3271f7196c ("tmpfs: support 64-bit inums per-sb")
> Cc: sta...@vger.kernel.org # v5.9+
> Signed-off-by: Seth Forshee 

Thanks,
Acked-by: Hugh Dickins 

> ---
>  fs/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 3347ec7bd837..da524c4d7b7e 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -203,7 +203,7 @@ config TMPFS_XATTR
>  
>  config TMPFS_INODE64
>   bool "Use 64-bit ino_t by default in tmpfs"
> - depends on TMPFS && 64BIT && !S390
> + depends on TMPFS && 64BIT && !(S390 || ALPHA)
>   default n
>   help
> tmpfs has historically used only inode numbers as wide as an unsigned
> -- 
> 2.29.2

Re: [PATCH] tmpfs: Disallow CONFIG_TMPFS_INODE64 on s390

2021-02-08 Thread Hugh Dickins

On Fri, 5 Feb 2021, Andrew Morton wrote:
> On Fri,  5 Feb 2021 17:06:20 -0600 Seth Forshee  
> wrote:
> 
> > This feature requires ino_t be 64-bits, which is true for every
> > 64-bit architecture but s390, so prevent this option from being
> > selected there.
> > 
> 
> The previous patch nicely described the end-user impact of the bug. 
> This is especially important when requesting a -stable backport.
> 
> Here's what I ended up with:
> 
> 
> From: Seth Forshee 
> Subject: tmpfs: disallow CONFIG_TMPFS_INODE64 on s390
> 
> Currently there is an assumption in tmpfs that 64-bit architectures also
> have a 64-bit ino_t.  This is not true on s390 which has a 32-bit ino_t. 
> With CONFIG_TMPFS_INODE64=y tmpfs mounts will get 64-bit inode numbers and
> display "inode64" in the mount options, but passing the "inode64" mount
> option will fail.  This leads to the following behavior:
> 
>  # mkdir mnt
>  # mount -t tmpfs nodev mnt
>  # mount -o remount,rw mnt
>  mount: /home/ubuntu/mnt: mount point not mounted or bad option.
> 
> As mount sees "inode64" in the mount options and thus passes it in the
> options for the remount.
> 
> 
> So prevent CONFIG_TMPFS_INODE64 from being selected on s390.
> 
> Link: 
> https://lkml.kernel.org/r/20210205230620.518245-1-seth.fors...@canonical.com
> Fixes: ea3271f7196c ("tmpfs: support 64-bit inums per-sb")
> Signed-off-by: Seth Forshee 
> Cc: Chris Down 
> Cc: Hugh Dickins 

Thank you Seth: now that you've fixed Kirill's alpha observation too,
Acked-by: Hugh Dickins 

> Cc: Amir Goldstein 
> Cc: Heiko Carstens 
> Cc: Vasily Gorbik 
> Cc: Christian Borntraeger 
> Cc:   [5.9+]
> Signed-off-by: Andrew Morton 
> ---
> 
>  fs/Kconfig |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/fs/Kconfig~tmpfs-disallow-config_tmpfs_inode64-on-s390
> +++ a/fs/Kconfig
> @@ -203,7 +203,7 @@ config TMPFS_XATTR
>  
>  config TMPFS_INODE64
>   bool "Use 64-bit ino_t by default in tmpfs"
> - depends on TMPFS && 64BIT
> + depends on TMPFS && 64BIT && !S390
>   default n
>   help
> tmpfs has historically used only inode numbers as wide as an unsigned
> _
> 
>

Re: [PATCH RFC 00/30] userfaultfd-wp: Support shmem and hugetlbfs

2021-02-05 Thread Hugh Dickins

On Fri, 29 Jan 2021, Peter Xu wrote:
> 
> Huge & Mike,
> 
> Would any of you have comment/concerns on the high-level design of this 
> series?
> 
> It would be great to know it, especially major objection, before move on to an
> non-rfc version.

Seeing Mike's update prompts me to speak up: I have been looking, and
will continue to look through it - will report when done; but find I've
been making very little forward progress from one day to the next.

It is very confusing, inevitably; but you have done an *outstanding*
job on acknowledging the confusion, and commenting it in great detail.

Hugh

Re: INFO: task can't die in shrink_inactive_list (2)

2021-02-05 Thread Hugh Dickins

On Fri, 5 Feb 2021, Matthew Wilcox wrote:
> 
> Hugh, did you get a chance to test this?

'fraid not: since I was unable to reproduce the problem,
I did not try running with your suggested fix at all:
hoped someone who could reproduce the problem might.

Hugh

> 
> On Mon, Dec 21, 2020 at 08:33:44PM +, Matthew Wilcox wrote:
> > On Mon, Dec 21, 2020 at 11:56:36AM -0800, Hugh Dickins wrote:
> > > On Mon, 23 Nov 2020, Andrew Morton wrote:
> > > > On Fri, 20 Nov 2020 17:55:22 -0800 syzbot 
> > > >  wrote:
> > > > 
> > > > > Hello,
> > > > > 
> > > > > syzbot found the following issue on:
> > > > > 
> > > > > HEAD commit:03430750 Add linux-next specific files for 20201116
> > > > > git tree:   linux-next
> > > > > console output: 
> > > > > https://syzkaller.appspot.com/x/log.txt?x=13f80e5e50
> > > > > kernel config:  
> > > > > https://syzkaller.appspot.com/x/.config?x=a1c4c3f27041fdb8
> > > > > dashboard link: 
> > > > > https://syzkaller.appspot.com/bug?extid=e5a33e700b1dd0da20a2
> > > > > compiler:   gcc (GCC) 10.1.0-syz 20200507
> > > > > syz repro:  
> > > > > https://syzkaller.appspot.com/x/repro.syz?x=12f7bc5a50
> > > > > C reproducer:   
> > > > > https://syzkaller.appspot.com/x/repro.c?x=10934cf250
> > > > 
> > > > Alex, your series "per memcg lru lock" changed the vmscan code rather a
> > > > lot.  Could you please take a look at that reproducer?
> > > 
> > > Andrew, I promised I'd take a look at this syzreport too (though I think
> > > we're agreed by now that it has nothing to do with per-memcg lru_lock).
> > > 
> > > I did try, but (unlike Alex) did not manage to get the reproducer to
> > > reproduce it.  No doubt I did not try hard enough: I did rather lose
> > > interest after seeing that it appears to involve someone with
> > > CAP_SYS_ADMIN doing an absurdly large ioctl(BLKFRASET) on /dev/nullb0
> > > ("Null test block driver" enabled via CONFIG_BLK_DEV_NULL_BLK=y: that I
> > > did enable) and faulting from it: presumably triggering an absurd amount
> > > of readahead.
> > > 
> > > Cc'ing Matthew since he has a particular interest in readahead, and
> > > might be inspired to make some small safe change that would fix this,
> > > and benefit realistic cases too; but on the whole it didn't look worth
> > > worrying about - or at least not by me.
> > 
> > Oh, interesting.  Thanks for looping me in, I hadn't looked at this one
> > at all.  Building on the debugging you did, this is the interesting
> > part of the backtrace to me:
> > 
> > > > >  try_to_free_pages+0x29f/0x720 mm/vmscan.c:3264
> > > > >  __perform_reclaim mm/page_alloc.c:4360 [inline]
> > > > >  __alloc_pages_direct_reclaim mm/page_alloc.c:4381 [inline]
> > > > >  __alloc_pages_slowpath.constprop.0+0x917/0x2510 mm/page_alloc.c:4785
> > > > >  __alloc_pages_nodemask+0x5f0/0x730 mm/page_alloc.c:4995
> > > > >  alloc_pages_current+0x191/0x2a0 mm/mempolicy.c:2271
> > > > >  alloc_pages include/linux/gfp.h:547 [inline]
> > > > >  __page_cache_alloc mm/filemap.c:977 [inline]
> > > > >  __page_cache_alloc+0x2ce/0x360 mm/filemap.c:962
> > > > >  page_cache_ra_unbounded+0x3a1/0x920 mm/readahead.c:216
> > > > >  do_page_cache_ra+0xf9/0x140 mm/readahead.c:267
> > > > >  do_sync_mmap_readahead mm/filemap.c:2721 [inline]
> > > > >  filemap_fault+0x19d0/0x2940 mm/filemap.c:2809
> > 
> > So ra_pages has been set to something ridiculously large, and as
> > a result, we call do_page_cache_ra() asking to read more memory than
> > is available in the machine.  Funny thing, we actually have a function
> > to prevent this kind of situation, and it's force_page_cache_ra().
> > 
> > So this might fix the problem.  I only tested that it compiles.  I'll
> > be happy to write up a proper changelog and sign-off for it if it works ...
> > it'd be good to get it some soak testing on a variety of different
> > workloads; changing this stuff is enormously subtle.
> > 
> > As a testament to that, I think Fengguang got it wrong in commit
> > 2cbea1d3ab11 -- async_size should have been 3 * ra_pages / 4, not ra_pages
> > / 4 (because we read-behind by half the range, so we're looking for a
> > page fault to happen a

Re: Possible deny of service with memfd_create()

2021-02-04 Thread Hugh Dickins

On Thu, 4 Feb 2021, Michal Hocko wrote:
> On Thu 04-02-21 17:32:20, Christian Koenig wrote:
> > Hi Michal,
> > 
> > as requested in the other mail thread the following sample code gets my test
> > system down within seconds.
> > 
> > The issue is that the memory allocated for the file descriptor is not
> > accounted to the process allocating it, so the OOM killer pics whatever
> > process it things is good but never my small test program.
> > 
> > Since memfd_create() doesn't need any special permission this is a rather
> > nice deny of service and as far as I can see also works with a standard
> > Ubuntu 5.4.0-65-generic kernel.
> 
> Thanks for following up. This is really nasty but now that I am looking
> at it more closely, this is not really different from tmpfs in general.
> You are free to create files and eat the memory without being accounted
> for that memory because that is not seen as your memory from the sysstem
> POV. You would have to map that memory to be part of your rss.
> 
> The only existing protection right now is to use memoery cgroup
> controller because the tmpfs memory is accounted to the process which
> faults the memory in (or write to the file).
> 
> I am not sure there is a good way to handle this in general
> unfortunatelly. Shmem is is just tricky (e.g. how to you deal with left
> overs after the fd is closed?). Maybe memfd_create can be more clever
> and account memory to all owners of the fd but even that sounds far from
> trivial from the accounting POV. It is true that tmpfs can at least
> control who can write to it which is not the case for memfd but then we
> hit the backward compatibility wall.

Yes, no solution satisfactory, and memcg best, but don't forget
echo 2 >/proc/sys/vm/overcommit_memory

Hugh

Re: [PATCH] userfaultfd: hugetlbfs: only compile UFFD helpers if config enabled

2021-02-02 Thread Hugh Dickins

On Tue, 2 Feb 2021, Axel Rasmussen wrote:
> On Tue, Feb 2, 2021 at 1:03 PM Hugh Dickins  wrote:
> > On Tue, 2 Feb 2021, Axel Rasmussen wrote:
> >
> > > For background, mm/userfaultfd.c provides a general mcopy_atomic
> > > implementation. But some types of memory (e.g., hugetlb and shmem) need
> > > a slightly different implementation, so they provide their own helpers
> > > for this. In other words, userfaultfd is the only caller of this
> > > function.
> > >
> > > This patch achieves two things:
> > >
> > > 1. Don't spend time compiling code which will end up never being
> > > referenced anyway (a small build time optimization).
> > >
> > > 2. In future patches (e.g. [1]), we plan to extend the signature of
> > > these helpers with UFFD-specific state (e.g., enums or structs defined
> > > conditionally in userfaultfd_k.h). Once this happens, this patch will be
> > > needed to avoid build errors (or, we'd need to define more UFFD-only
> > > stuff unconditionally, which seems messier to me).
> > >
> > > Peter Xu suggested this be sent as a standalone patch, in the mailing
> > > list discussion for [1].
> > >
> > > [1] https://patchwork.kernel.org/project/linux-mm/list/?series=424091
> > >
> > > Signed-off-by: Axel Rasmussen 
> > > ---
> > >  include/linux/hugetlb.h | 4 
> > >  mm/hugetlb.c| 2 ++
> > >  2 files changed, 6 insertions(+)
> >
> > Hi Axel, please also do the same to mm/shmem.c (perhaps you missed
> > it because I did that long ago to our internal copy of mm/shmem.c).
> 
> I had been largely ignoring shmem up to this point because my minor
> fault handling series doesn't (yet) deal with it. But, I'll need to do
> this later when I support shmem anyway, so happy to add it here.

Oh, if this patch is going into a hugetlbfs series, skip mm/shmem.c for
now (or keep it in, whichever's easiest for you): I caught sight of the
"(e.g., hugetlb and shmem)" in the commit message above, and thought
you had inadvertently missed out the shmem part - but now see that
the patch title does say "userfaultfd: hugetlbfs:".

Hugh

Re: [PATCH] userfaultfd: hugetlbfs: only compile UFFD helpers if config enabled

2021-02-02 Thread Hugh Dickins

On Tue, 2 Feb 2021, Axel Rasmussen wrote:

> For background, mm/userfaultfd.c provides a general mcopy_atomic
> implementation. But some types of memory (e.g., hugetlb and shmem) need
> a slightly different implementation, so they provide their own helpers
> for this. In other words, userfaultfd is the only caller of this
> function.
> 
> This patch achieves two things:
> 
> 1. Don't spend time compiling code which will end up never being
> referenced anyway (a small build time optimization).
> 
> 2. In future patches (e.g. [1]), we plan to extend the signature of
> these helpers with UFFD-specific state (e.g., enums or structs defined
> conditionally in userfaultfd_k.h). Once this happens, this patch will be
> needed to avoid build errors (or, we'd need to define more UFFD-only
> stuff unconditionally, which seems messier to me).
> 
> Peter Xu suggested this be sent as a standalone patch, in the mailing
> list discussion for [1].
> 
> [1] https://patchwork.kernel.org/project/linux-mm/list/?series=424091
> 
> Signed-off-by: Axel Rasmussen 
> ---
>  include/linux/hugetlb.h | 4 
>  mm/hugetlb.c| 2 ++
>  2 files changed, 6 insertions(+)

Hi Axel, please also do the same to mm/shmem.c (perhaps you missed
it because I did that long ago to our internal copy of mm/shmem.c).
But please also comment the endifs
#endif /* CONFIG_USERFAULTFD */
to help find one's way around them.

I see you've done include/linux/hugetlb.h: okay, that's not necessary,
but a matter of taste; up to you whether to do include/linux/shmem_fs.h.

Thanks,
Hugh

> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index ebca2ef02212..749701b5c153 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -134,11 +134,13 @@ void hugetlb_show_meminfo(void);
>  unsigned long hugetlb_total_pages(void);
>  vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   unsigned long address, unsigned int flags);
> +#ifdef CONFIG_USERFAULTFD
>  int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
>   struct vm_area_struct *dst_vma,
>   unsigned long dst_addr,
>   unsigned long src_addr,
>   struct page **pagep);
> +#endif
>  int hugetlb_reserve_pages(struct inode *inode, long from, long to,
>   struct vm_area_struct *vma,
>   vm_flags_t vm_flags);
> @@ -308,6 +310,7 @@ static inline void hugetlb_free_pgd_range(struct 
> mmu_gather *tlb,
>   BUG();
>  }
>  
> +#ifdef CONFIG_USERFAULTFD
>  static inline int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>   pte_t *dst_pte,
>   struct vm_area_struct *dst_vma,
> @@ -318,6 +321,7 @@ static inline int hugetlb_mcopy_atomic_pte(struct 
> mm_struct *dst_mm,
>   BUG();
>   return 0;
>  }
> +#endif
>  
>  static inline pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long 
> addr,
>   unsigned long sz)
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 18f6ee317900..821bfa9c0c80 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4615,6 +4615,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   return ret;
>  }
>  
> +#ifdef CONFIG_USERFAULTFD
>  /*
>   * Used by userfaultfd UFFDIO_COPY.  Based on mcopy_atomic_pte with
>   * modifications for huge pages.
> @@ -4745,6 +4746,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>   put_page(page);
>   goto out;
>  }
> +#endif
>  
>  long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>struct page **pages, struct vm_area_struct **vmas,
> -- 
> 2.30.0.365.g02bc693789-goog

Re: [PATCH v4 0/8] Create 'old' ptes for faultaround mappings on arm64 with hardware access flag

2021-01-26 Thread Hugh Dickins

On Tue, 26 Jan 2021, Will Deacon wrote:
> On Wed, Jan 20, 2021 at 05:36:04PM +, Will Deacon wrote:
> > Hi all,
> > 
> > This is version four of the patches I previously posted here:
> > 
> >   v1: https://lore.kernel.org/r/20201209163950.8494-1-w...@kernel.org
> >   v2: https://lore.kernel.org/r/20210108171517.5290-1-w...@kernel.org
> >   v3: https://lore.kernel.org/r/20210114175934.13070-1-w...@kernel.org
> > 
> > The patches allow architectures to opt-in at runtime for faultaround
> > mappings to be created as 'old' instead of 'young'. Although there have
> > been previous attempts at this, they failed either because the decision
> > was deferred to userspace [1] or because it was done unconditionally and
> > shown to regress benchmarks for particular architectures [2].
> > 
> > The big change since v3 is that the immutable fields of 'struct vm_fault'
> > now live in a 'const' anonymous struct. Although Clang will silently
> > accept modifications to these fields [3], GCC emits an error. The
> > resulting diffstat is _considerably_ more manageable with this approach.
> 
> The only changes I have pending against this series are cosmetic (commit
> logs). Can I go ahead and queue this in the arm64 tree so that it can sit
> in linux-next for a bit? (positive or negative feedback appreciated!).

That would be fine by me: I ran v3 on rc3, then the nicer smaller v4
on rc4, and saw no problems when running either of them (x86_64 only).

Hugh

Re: Infinite recursion in device_reorder_to_tail() due to circular device links

2021-01-24 Thread Hugh Dickins

On Sun, 24 Jan 2021, Greg Kroah-Hartman wrote:
> On Sat, Jan 23, 2021 at 03:37:30PM -0800, Hugh Dickins wrote:
> > On Tue, 12 Jan 2021, Greg Kroah-Hartman wrote:
> > > On Tue, Jan 12, 2021 at 03:32:04PM +0100, Rafael J. Wysocki wrote:
> > > > On Mon, Jan 11, 2021 at 7:46 PM Stephan Gerhold  
> > > > wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > since 5.11-rc1 I get kernel crashes with infinite recursion in
> > > > > device_reorder_to_tail() in some situations... It's a bit complicated 
> > > > > to
> > > > > explain so I want to apologize in advance for the long mail. :)
> > > > >
> > > > >   Kernel panic - not syncing: kernel stack overflow
> > > > >   CPU: 1 PID: 33 Comm: kworker/1:1 Not tainted 5.11.0-rc3 #1
> > > > >   Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT)
> > > > >   Call trace:
> > > > >...
> > > > >device_reorder_to_tail+0x4c/0xf0
> > > > >device_reorder_to_tail+0x98/0xf0
> > > > >device_reorder_to_tail+0x60/0xf0
> > > > >device_reorder_to_tail+0x60/0xf0
> > > > >device_reorder_to_tail+0x60/0xf0
> > > > >...
> > > > >
> > > > > The crash happens only in 5.11 with commit 5b6164d3465f ("driver core:
> > > > > Reorder devices on successful probe"). It stops happening when I 
> > > > > revert
> > > > > this commit.
> > > > 
> > > > Thanks for the report!
> > > > 
> > > > Greg, please revert commit 5b6164d3465f, it clearly is not an
> > > > improvement, at least at this point.
> > > 
> > > Now reverted, thanks.
> > > 
> > > greg k-h
> > 
> > I think that there has been a misunderstanding here: although
> > 5b6164d3465f ("driver core: Reorder devices on successful probe")
> > has been reverted from linux-next (thank you), it has not yet been
> > reverted from 5.11-rc, and still causing problems there (in my case,
> > not the infinite recursion Stephan reported in this thread, but the
> > ThinkPad rmi4 suspend failure that I reported in another thread).
> 
> It will be sent to Linus in a few hours, thanks, so should show up in
> 5.11-rc5.  I had other patches to go along with this to send him at the
> same time :)

And indeed it's now in, thanks Greg: I'm sorry for being importunate,
the misunderstanding was mine.

Hugh

Re: Infinite recursion in device_reorder_to_tail() due to circular device links

2021-01-23 Thread Hugh Dickins

On Tue, 12 Jan 2021, Greg Kroah-Hartman wrote:
> On Tue, Jan 12, 2021 at 03:32:04PM +0100, Rafael J. Wysocki wrote:
> > On Mon, Jan 11, 2021 at 7:46 PM Stephan Gerhold  wrote:
> > >
> > > Hi,
> > >
> > > since 5.11-rc1 I get kernel crashes with infinite recursion in
> > > device_reorder_to_tail() in some situations... It's a bit complicated to
> > > explain so I want to apologize in advance for the long mail. :)
> > >
> > >   Kernel panic - not syncing: kernel stack overflow
> > >   CPU: 1 PID: 33 Comm: kworker/1:1 Not tainted 5.11.0-rc3 #1
> > >   Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT)
> > >   Call trace:
> > >...
> > >device_reorder_to_tail+0x4c/0xf0
> > >device_reorder_to_tail+0x98/0xf0
> > >device_reorder_to_tail+0x60/0xf0
> > >device_reorder_to_tail+0x60/0xf0
> > >device_reorder_to_tail+0x60/0xf0
> > >...
> > >
> > > The crash happens only in 5.11 with commit 5b6164d3465f ("driver core:
> > > Reorder devices on successful probe"). It stops happening when I revert
> > > this commit.
> > 
> > Thanks for the report!
> > 
> > Greg, please revert commit 5b6164d3465f, it clearly is not an
> > improvement, at least at this point.
> 
> Now reverted, thanks.
> 
> greg k-h

I think that there has been a misunderstanding here: although
5b6164d3465f ("driver core: Reorder devices on successful probe")
has been reverted from linux-next (thank you), it has not yet been
reverted from 5.11-rc, and still causing problems there (in my case,
not the infinite recursion Stephan reported in this thread, but the
ThinkPad rmi4 suspend failure that I reported in another thread).

Thanks,
Hugh

[PATCH] mm: thp: fix MADV_REMOVE deadlock on shmem THP

2021-01-16 Thread Hugh Dickins

Sergey reported deadlock between kswapd correctly doing its usual
lock_page(page) followed by down_read(page->mapping->i_mmap_rwsem),
and madvise(MADV_REMOVE) on an madvise(MADV_HUGEPAGE) area doing
down_write(page->mapping->i_mmap_rwsem) followed by lock_page(page).

This happened when shmem_fallocate(punch hole)'s unmap_mapping_range()
reaches zap_pmd_range()'s call to __split_huge_pmd().  The same deadlock
could occur when partially truncating a mapped huge tmpfs file, or using
fallocate(FALLOC_FL_PUNCH_HOLE) on it.

__split_huge_pmd()'s page lock was added in 5.8, to make sure that any
concurrent use of reuse_swap_page() (holding page lock) could not catch
the anon THP's mapcounts and swapcounts while they were being split.

Fortunately, reuse_swap_page() is never applied to a shmem or file THP
(not even by khugepaged, which checks PageSwapCache before calling),
and anonymous THPs are never created in shmem or file areas: so that
__split_huge_pmd()'s page lock can only be necessary for anonymous THPs,
on which there is no risk of deadlock with i_mmap_rwsem.

Reported-by: Sergey Senozhatsky 
Fixes: c444eb564fb1 ("mm: thp: make the THP mapcount atomic against 
__split_huge_pmd_locked()")
Signed-off-by: Hugh Dickins 
Reviewed-by: Andrea Arcangeli 
Cc: sta...@vger.kernel.org
---

The status of reuse_swap_page(), and its use on THPs, is currently under
discussion, and may need to be changed: but this patch is a simple fix
to the reported deadlock, which can go in now, and be easily backported
to whichever stable and longterm releases took in 5.8's c444eb564fb1.

 mm/huge_memory.c |   37 +++--
 1 file changed, 23 insertions(+), 14 deletions(-)

--- 5.11-rc3/mm/huge_memory.c   2020-12-27 20:39:37.667932292 -0800
+++ linux/mm/huge_memory.c  2021-01-16 08:02:08.265551393 -0800
@@ -2202,7 +2202,7 @@ void __split_huge_pmd(struct vm_area_str
 {
spinlock_t *ptl;
struct mmu_notifier_range range;
-   bool was_locked = false;
+   bool do_unlock_page = false;
pmd_t _pmd;
 
mmu_notifier_range_init(, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
@@ -2218,7 +2218,6 @@ void __split_huge_pmd(struct vm_area_str
VM_BUG_ON(freeze && !page);
if (page) {
VM_WARN_ON_ONCE(!PageLocked(page));
-   was_locked = true;
if (page != pmd_page(*pmd))
goto out;
}
@@ -2227,19 +2226,29 @@ repeat:
if (pmd_trans_huge(*pmd)) {
if (!page) {
page = pmd_page(*pmd);
-   if (unlikely(!trylock_page(page))) {
-   get_page(page);
-   _pmd = *pmd;
-   spin_unlock(ptl);
-   lock_page(page);
-   spin_lock(ptl);
-   if (unlikely(!pmd_same(*pmd, _pmd))) {
-   unlock_page(page);
+   /*
+* An anonymous page must be locked, to ensure that a
+* concurrent reuse_swap_page() sees stable mapcount;
+* but reuse_swap_page() is not used on shmem or file,
+* and page lock must not be taken when zap_pmd_range()
+* calls __split_huge_pmd() while i_mmap_lock is held.
+*/
+   if (PageAnon(page)) {
+   if (unlikely(!trylock_page(page))) {
+   get_page(page);
+   _pmd = *pmd;
+   spin_unlock(ptl);
+   lock_page(page);
+   spin_lock(ptl);
+   if (unlikely(!pmd_same(*pmd, _pmd))) {
+   unlock_page(page);
+   put_page(page);
+   page = NULL;
+   goto repeat;
+   }
put_page(page);
-   page = NULL;
-   goto repeat;
}
-   put_page(page);
+   do_unlock_page = true;
}
}
if (PageMlocked(page))
@@ -2249,7 +2258,7 @@ repeat:
__split_huge_pmd_locked(vma, pmd, range.start, freeze);
 out:
spin_unlock(ptl);
-   if (!was_locked && page)
+   if (do_unlock_page)
unlock_page(page);
/*
 * No need to double call mmu_notifier->invalidate_range() callback.

Re: madvise(MADV_REMOVE) deadlocks on shmem THP

2021-01-13 Thread Hugh Dickins

On Thu, 14 Jan 2021, Sergey Senozhatsky wrote:

> Hi,
> 
> We are running into lockups during the memory pressure tests on our
> boards, which essentially NMI panic them. In short the test case is
> 
> - THP shmem
> echo advise > /sys/kernel/mm/transparent_hugepage/shmem_enabled
> 
> - And a user-space process doing madvise(MADV_HUGEPAGE) on new mappings,
>   and madvise(MADV_REMOVE) when it wants to remove the page range
> 
> The problem boils down to the reverse locking chain:
>   kswapd does
> 
>   lock_page(page) -> down_read(page->mapping->i_mmap_rwsem)
> 
>   madvise() process does
> 
>   down_write(page->mapping->i_mmap_rwsem) -> lock_page(page)
> 
> 
> 
> CPU0   CPU1
> 
> kswapd vfs_fallocate()
>  shrink_node()  shmem_fallocate()
>   shrink_active_list()   
> unmap_mapping_range()
>page_referenced() << lock page:PG_locked >>
> unmap_mapping_pages()  << down_write(mapping->i_mmap_rwsem) >>
> rmap_walk_file()   
> zap_page_range_single()
>  down_read(mapping->i_mmap_rwsem) << W-locked on CPU1>> 
> unmap_page_range()
>   rwsem_down_read_failed()   
> __split_huge_pmd()
>__rwsem_down_read_failed_common()  
> __lock_page()  << PG_locked on CPU0 >>
> schedule() 
> wait_on_page_bit_common()
> 
> io_schedule()

Very interesting, Sergey: many thanks for this report.

There is no doubt that kswapd is right in its lock ordering:
__split_huge_pmd() is in the wrong to be attempting lock_page().

Which used not to be done, but was added in 5.8's c444eb564fb1 ("mm:
thp: make the THP mapcount atomic against __split_huge_pmd_locked()").

Which explains why this deadlock was not seen years ago: that
surprised me at first, since the case you show to reproduce it is good,
but I'd expect more common ways in which that deadlock could show up.

And your report is remarkably timely too: I have two other reasons
for looking at that change at the moment (I'm currently catching up
with recent discussion of page_count versus mapcount when deciding
COW page reuse).

I won't say more tonight, but should have more to add tomorrow.

Hugh

Re: 5.11-rc device reordering breaks ThinkPad rmi4 suspend

2021-01-11 Thread Hugh Dickins

On Mon, 11 Jan 2021, Saravana Kannan wrote:
> On Mon, Jan 11, 2021 at 4:44 PM Hugh Dickins  wrote:
> > On Mon, 11 Jan 2021, Saravana Kannan wrote:
> > >
> > > Did you see this patch change the organization of devices under 
> > > /sys/devices/?
> > > The rmi* devices need to be under one of the i2c devices after this
> > > patch. Is that not the case? Or is that the case, but you are still
> > > seeing suspend/resume issues?
> >
> > Now that I look, yes, that patch has moved the directory
> > /sys/devices/rmi4-00
> > to
> > /sys/devices/pci:00/:00:1f.4/i2c-6/6-002c/rmi4-00
> 
> What about child devices of rmi4-00? Does it still have the
> rmi4-00.fn* devices as children? I'd think so, but just double
> checking.

Yes, the patch moved the rmi4-00 directory and its contents.

> 
> >
> > But I still see the same suspend issues despite that.
> 
> Can you please get new logs to see if the failure reasons are still
> the same? I'd think this parent/child relationship would at least
> avoid the "Failed to read irqs" errors that seem to be due to I2C
> dependency.

No, it did not avoid the "Failed to read irqs" error (though my
recollection from earlier failures before I mailed out, is that
that particular error is intermittent: sometimes it showed up,
other times not; but always the "Failed to write sleep mode").

I configured CONFIG_PM_DEBUG=y and booted with pm_debug_messages
this time, dmesgsys.tar attached, contents:

dmesg.rc3   # dmesg of boot and attempt to suspend on 5.11-rc3
sys.rc3 # find /sys | sort | grep -v /sys/fs/cgroup afterwards
dmesg.saravana  # dmesg of boot and attempt to suspend with your patch
sys.saravana# find /sys | sort | grep -v /sys/fs/cgroup afterwards
dmesg.revert# dmesg of boot+suspend+resume, rc3 without 5b6164d3465f
sys.revert  # find /sys | sort | grep -v /sys/fs/cgroup afterwards

Not as many debug messages as I was expecting: perhaps you can point
me to something else to tune to get more info out? And perhaps it was
a mistake to snapshot the /sys hierarchy after rather than before:
I see now that it does make some difference.  I filtered out
/sys/fs/cgroup because that enlarged the diffs with no relevance.

Hugh

dmesgsys.tar
Description: Unix tar archive

Re: 5.11-rc device reordering breaks ThinkPad rmi4 suspend

2021-01-11 Thread Hugh Dickins

On Mon, 11 Jan 2021, Saravana Kannan wrote:
> On Mon, Jan 11, 2021 at 3:42 PM Hugh Dickins  wrote:
> > On Mon, 11 Jan 2021, Saravana Kannan wrote:
> > >
> > > I happen to have an X1 Carbon (different gen though) lying around and
> > > I poked at its /sys folders. None of the devices in the rmi4_smbus are
> > > considered the grandchildren of the i2c device. I think the real
> > > problem is rmi_register_transport_device() [1] not setting up the
> > > parent for any of the new devices it's adding.
> > >
> > > Hugh, can you try this patch?
> >
> > Just tried, but no, this patch does not help; but I bet
> > you're along the right lines, and something as simple will do it.
> 
> Did you see this patch change the organization of devices under /sys/devices/?
> The rmi* devices need to be under one of the i2c devices after this
> patch. Is that not the case? Or is that the case, but you are still
> seeing suspend/resume issues?

Now that I look, yes, that patch has moved the directory
/sys/devices/rmi4-00
to
/sys/devices/pci:00/:00:1f.4/i2c-6/6-002c/rmi4-00

But I still see the same suspend issues despite that.

Hugh

Re: 5.11-rc device reordering breaks ThinkPad rmi4 suspend

2021-01-11 Thread Hugh Dickins

On Mon, 11 Jan 2021, Saravana Kannan wrote:
> 
> I happen to have an X1 Carbon (different gen though) lying around and
> I poked at its /sys folders. None of the devices in the rmi4_smbus are
> considered the grandchildren of the i2c device. I think the real
> problem is rmi_register_transport_device() [1] not setting up the
> parent for any of the new devices it's adding.
> 
> Hugh, can you try this patch?

Just tried, but no, this patch does not help; but I bet
you're along the right lines, and something as simple will do it.

> 
> diff --git a/drivers/input/rmi4/rmi_bus.c b/drivers/input/rmi4/rmi_bus.c
> index 24f31a5c0e04..50a0134b6901 100644
> --- a/drivers/input/rmi4/rmi_bus.c
> +++ b/drivers/input/rmi4/rmi_bus.c
> @@ -90,6 +90,7 @@ int rmi_register_transport_device(struct
> rmi_transport_dev *xport)
> 
> rmi_dev->dev.bus = _bus_type;
> rmi_dev->dev.type = _device_type;
> +   rmi_dev->dev.parent = xport->dev;
> 
> xport->rmi_dev = rmi_dev;
> 
> -Saravana
> 
> [1] - 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/input/rmi4/rmi_bus.c#n74

Re: [PATCH v2 0/3] Create 'old' ptes for faultaround mappings on arm64 with hardware access flag

2021-01-11 Thread Hugh Dickins

On Mon, 11 Jan 2021, Will Deacon wrote:
> On Fri, Jan 08, 2021 at 11:34:08AM -0800, Linus Torvalds wrote:
> > On Fri, Jan 8, 2021 at 9:15 AM Will Deacon  wrote:
> > >
> > > The big difference in this version is that I have reworked it based on
> > > Kirill's patch which he posted as a follow-up to the original. However,
> > > I can't tell where we've landed on that -- Linus seemed to like it, but
> > > Hugh was less enthusiastic.
> > 
> > Yeah, I like it, but I have to admit that it had a disturbingly high
> > number of small details wrong for several versions. I hope you picked
> > up the final version of the code.
> 
> I picked the version from here:
> 
>   https://lore.kernel.org/r/20201229132819.najtavneutnf7ajp@box
> 
> and actually, I just noticed that willy spotted a typo in a comment, so
> I'll fix that locally as well as adding the above to a 'Link:' tag for
> reference.
> 
> > At the same time, I do think that the "disturbingly high number of
> > issues" was primarily exactly _because_ the old code was so
> > incomprehensible, and I think the end result is much cleaner, so I
> > still like it.

Just to report that I gave this v2 set a spin on a few (x86_64 and i386)
machines, and found nothing objectionable this time around.

And the things that I'm unenthusiastic about are exactly those details
that you and Kirill and Linus find unsatisfactory, but awkward to
eliminate: expect no new insights from me!

Hugh

Re: [PATCH 5.10 109/145] mm: make wait_on_page_writeback() wait for multiple pending writebacks

2021-01-11 Thread Hugh Dickins

On Mon, 11 Jan 2021, Greg Kroah-Hartman wrote:

> From: Linus Torvalds 
> 
> commit c2407cf7d22d0c0d94cf20342b3b8f06f1d904e7 upstream.
> 
> Ever since commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common()
> logic") we've had some very occasional reports of BUG_ON(PageWriteback)
> in write_cache_pages(), which we thought we already fixed in commit
> 073861ed77b6 ("mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)").
> 
> But syzbot just reported another one, even with that commit in place.
> 
> And it turns out that there's a simpler way to trigger the BUG_ON() than
> the one Hugh found with page re-use.  It all boils down to the fact that
> the page writeback is ostensibly serialized by the page lock, but that
> isn't actually really true.
> 
> Yes, the people _setting_ writeback all do so under the page lock, but
> the actual clearing of the bit - and waking up any waiters - happens
> without any page lock.
> 
> This gives us this fairly simple race condition:
> 
>   CPU1 = end previous writeback
>   CPU2 = start new writeback under page lock
>   CPU3 = write_cache_pages()
> 
>   CPU1  CPU2CPU3
>     
> 
>   end_page_writeback()
> test_clear_page_writeback(page)
> ... delayed...
> 
> lock_page();
> set_page_writeback()
> unlock_page()
> 
> lock_page()
> wait_on_page_writeback();
> 
> wake_up_page(page, PG_writeback);
> .. wakes up CPU3 ..
> 
> BUG_ON(PageWriteback(page));
> 
> where the BUG_ON() happens because we woke up the PG_writeback bit
> becasue of the _previous_ writeback, but a new one had already been
> started because the clearing of the bit wasn't actually atomic wrt the
> actual wakeup or serialized by the page lock.
> 
> The reason this didn't use to happen was that the old logic in waiting
> on a page bit would just loop if it ever saw the bit set again.
> 
> The nice proper fix would probably be to get rid of the whole "wait for
> writeback to clear, and then set it" logic in the writeback path, and
> replace it with an atomic "wait-to-set" (ie the same as we have for page
> locking: we set the page lock bit with a single "lock_page()", not with
> "wait for lock bit to clear and then set it").
> 
> However, out current model for writeback is that the waiting for the
> writeback bit is done by the generic VFS code (ie write_cache_pages()),
> but the actual setting of the writeback bit is done much later by the
> filesystem ".writepages()" function.
> 
> IOW, to make the writeback bit have that same kind of "wait-to-set"
> behavior as we have for page locking, we'd have to change our roughly
> ~50 different writeback functions.  Painful.
> 
> Instead, just make "wait_on_page_writeback()" loop on the very unlikely
> situation that the PG_writeback bit is still set, basically re-instating
> the old behavior.  This is very non-optimal in case of contention, but
> since we only ever set the bit under the page lock, that situation is
> controlled.
> 
> Reported-by: syzbot+2fc0712f8f8b8b8fa...@syzkaller.appspotmail.com
> Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
> Acked-by: Hugh Dickins 
> Cc: Andrew Morton 
> Cc: Matthew Wilcox 
> Cc: sta...@kernel.org
> Signed-off-by: Linus Torvalds 
> Signed-off-by: Greg Kroah-Hartman 

I think it's too early to push this one through to stable:
Linus mentioned on Friday that Michael Larabel of Phoronix
has observed a performance regression from this commit.

Correctness outweighs performance of course, but I think
stable users might see the performance issue much sooner
than they would ever see the BUG fixed.  Wait a bit,
while we think some more about what to try next?

Hugh

> 
> ---
>  mm/page-writeback.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2826,7 +2826,7 @@ EXPORT_SYMBOL(__test_set_page_writeback)
>   */
>  void wait_on_page_writeback(struct page *page)
>  {
> - if (PageWriteback(page)) {
> + while (PageWriteback(page)) {
>   trace_wait_on_page_writeback(page, page_mapping(page));
>   wait_on_page_bit(page, PG_writeback);
>   }

Re: 5.11-rc device reordering breaks ThinkPad rmi4 suspend

2021-01-11 Thread Hugh Dickins

On Mon, 11 Jan 2021, Thierry Reding wrote:
> On Sun, Jan 10, 2021 at 08:44:13PM -0800, Hugh Dickins wrote:
> > 
> > Synaptics RMI4 SMBus touchpad on ThinkPad X1 Carbon (5th generation)
> > fails to suspend when running 5.11-rc kernels: bisected to 
> > 5b6164d3465f ("driver core: Reorder devices on successful probe"),
> > and reverting that fixes it.  dmesg.xz attached, but go ahead and ask
> > me to switch on a debug option to extract further info if that may help.
...
> 
> I think what might be happening here is that the offending patch causes
> some devices to be reordered in a way different to how they were ordered
> originally and the rmi4 driver currently depends on that implicit order.

Yes, all that you explained makes good sense, thanks.

> I'm not familiar with how exactly rmi4 works, so I'll have to do
> some digging to hopefully pinpoint exactly what's going wrong here.
> 
> In the meantime, it would be useful to know what exactly the I2C
> hierarchy looks like. For example, what's the I2C controller that the
> RMI4 device is hooked up to. According to the above, that's I2C bus 6,
> so you should be able to find out some details about it by inspecting
> the corresponding sysfs nodes:
> 
>   $ ls -l /sys/class/i2c-adapter/i2c-6/
>   $ cat /sys/class/i2c-adapter/i2c-6/name
>   $ ls -l /sys/class/i2c-adapter/i2c-6/device/

That's curious: I don't even have a /sys/class/i2c-adapter directory.

(And I did wonder if you meant to say "smbus" rather than "i2c",
though I don't have any /sys/class/smbus* either: I have no notion
of the relationship between i2c and smbus, but I thought the failing
write_block calls were the ones in rmi_smbus.c rather than rmi_i2c.c.)

I've attached compressed output of "find /sys/bus /sys/class | sort":
/sys/bus looked more relevant than /sys/class, maybe it will help
point in the right direction?

And in case it's relevant, maybe I should mention that this is a
non-modular, all-built-in kernel.

But as I said to Rafael, my touchpad can wait: the wider ordering
discussion is much more important.

Hugh

sysbusclass.xz
Description: application/xz

Re: 5.11-rc device reordering breaks ThinkPad rmi4 suspend

2021-01-11 Thread Hugh Dickins

On Mon, 11 Jan 2021, Rafael J. Wysocki wrote:
> On Mon, Jan 11, 2021 at 5:44 AM Hugh Dickins  wrote:
> >
> > Hi Rafael,
> >
> > Synaptics RMI4 SMBus touchpad on ThinkPad X1 Carbon (5th generation)
> > fails to suspend when running 5.11-rc kernels: bisected to
> > 5b6164d3465f ("driver core: Reorder devices on successful probe"),
> > and reverting that fixes it.  dmesg.xz attached, but go ahead and ask
> > me to switch on a debug option to extract further info if that may help.
> 
> Does the driver abort the suspend transition by returning an error or
> does something else happen?

Both.  Thierry has pointed to the lines showing failed suspend transition;
and I forgot to mention that the touchpad is unresponsive from then on
(I might not have noticed the failed suspend without that).  But I don't
suppose that unresponsiveness is worth worrying about: things went wrong
in suspend, so it's not surprising if the driver does not recover well.

Thank you both for getting on to this so quickly - but don't worry about
getting my touchpad working: I'm glad to see you discussing the wider
issues of ordering that this has brought up.

Hugh

5.11-rc device reordering breaks ThinkPad rmi4 suspend

2021-01-10 Thread Hugh Dickins

Hi Rafael,

Synaptics RMI4 SMBus touchpad on ThinkPad X1 Carbon (5th generation)
fails to suspend when running 5.11-rc kernels: bisected to 
5b6164d3465f ("driver core: Reorder devices on successful probe"),
and reverting that fixes it.  dmesg.xz attached, but go ahead and ask
me to switch on a debug option to extract further info if that may help.

Thanks,
Hugh

dmesg.xz
Description: application/xz

Re: [PATCH] mm/memcontrol: fix warning in mem_cgroup_page_lruvec()

2021-01-08 Thread Hugh Dickins

On Thu, 7 Jan 2021, Vlastimil Babka wrote:
> On 1/4/21 6:03 AM, Hugh Dickins wrote:
> > Boot a CONFIG_MEMCG=y kernel with "cgroup_disabled=memory" and you are
> > met by a series of warnings from the VM_WARN_ON_ONCE_PAGE(!memcg, page)
> > recently added to the inline mem_cgroup_page_lruvec().
> > 
> > An earlier attempt to place that warning, in mem_cgroup_lruvec(), had
> > been careful to do so after weeding out the mem_cgroup_disabled() case;
> > but was itself invalid because of the mem_cgroup_lruvec(NULL, pgdat) in
> > clear_pgdat_congested() and age_active_anon().
> > 
> > Warning in mem_cgroup_page_lruvec() was once useful in detecting a KSM
> > charge bug, so may be worth keeping: but skip if mem_cgroup_disabled().
> > 
> > Fixes: 9a1ac2288cf1 ("mm/memcontrol:rewrite mem_cgroup_page_lruvec()")
> > Signed-off-by: Hugh Dickins 
> 
> Acked-by: Vlastimil Babka 

Thanks.

> 
> > ---
> > 
> >  include/linux/memcontrol.h |2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > --- 5.11-rc2/include/linux/memcontrol.h 2020-12-27 20:39:36.751923135 
> > -0800
> > +++ linux/include/linux/memcontrol.h2021-01-03 19:38:24.822978559 
> > -0800
> > @@ -665,7 +665,7 @@ static inline struct lruvec *mem_cgroup_
> >  {
> > struct mem_cgroup *memcg = page_memcg(page);
> >  
> > -   VM_WARN_ON_ONCE_PAGE(!memcg, page);
> > +   VM_WARN_ON_ONCE_PAGE(!memcg && !mem_cgroup_disabled(), page);
> 
> Nit: I would reverse the order of conditions as mem_cgroup_disabled() is 
> either
> "return true" or a static key. Not that it matters too much on DEBUG_VM 
> configs...

tl;dr I'm going to leave the patch as is.

You are certainly right that I was forgetting the static-key-ness of
mem_cgroup_disabled() when I put the tests that way round: I was thinking
of the already-in-a-register-ness of "memcg"; but had also not realized
that page_memcg() just did an "&", so condition bits nicely set already.

And I think you are right in principle, that the tests should be better
the way you suggest, when static key is in use - in the (unusual)
mem_cgroup_disabled() case, though not in the usual enabled case.

I refuse to confess how many hours I've spent poring over "objdump -ld"s
of lock_page_lruvec_irqsave(), and comparing with how it is patched when
the kernel is booted with "cgroup_disable=memory".

But I have seen builds where my way round worked out better than yours,
for both the enabled and disabled cases (SUSE gcc 9.3.1 was good, in
the config I was trying on it); and builds where disabled was treated
rather poorly my way (with external call to mem_cgroup_disabled() from
lock_page_lruvec() and lock_page_lruvec_irqsave(), but inlined into
lock_page_lruvec_irq() - go figure! - with SUSE gcc 10.2.1).

I suspect a lot depends on what inlining is done, and on that prior
page_memcg() doing its "&", and the second mem_cgroup_disabled() which
follows immediately in mem_cgroup_lruvec(): different compilers will
make different choices, favouring one or the other ordering.

I've grown rather tired of it all (and discovered on the way that
static keys depend on CONFIG_JUMP_LABEL=y, which I didn't have in
a config I've carried forward through "make oldconfig"s for years -
thanks); but not found a decisive reason to change the patch.

Hugh

> 
> > return mem_cgroup_lruvec(memcg, pgdat);
> >  }
> >  
> >

Re: [PATCH] mm/mmap: replace if (cond) BUG() with BUG_ON()

2021-01-06 Thread Hugh Dickins

On Wed, 6 Jan 2021, Andrea Arcangeli wrote:
> 
> I'd be surprised if the kernel can boot with BUG_ON() defined as "do
> {}while(0)" so I guess it doesn't make any difference.

I had been afraid of that too, when CONFIG_BUG is not set:
but I think it's actually "if (cond) do {} while (0)".

Re: [PATCH] mm/mmap: replace if (cond) BUG() with BUG_ON()

2021-01-06 Thread Hugh Dickins

On Wed, 6 Jan 2021, Andrew Morton wrote:
> On Tue, 5 Jan 2021 20:28:27 -0800 (PST) Hugh Dickins  wrote:
> 
> > Alex, please consider why the authors of these lines (whom you
> > did not Cc) chose to write them without BUG_ON(): it has always
> > been preferred practice to use BUG_ON() on predicates, but not on
> > functionally effective statements (sorry, I've forgotten the proper
> > term: I'd say statements with side-effects, but here they are not
> > just side-effects: they are their main purpose).
> > 
> > We prefer not to hide those away inside BUG macros
> 
> Should we change that?  I find BUG_ON(something_which_shouldnt_fail())
> to be quite natural and readable.

Fair enough.  Whereas my mind tends to filter out the BUG lines when
skimming code, knowing they can be skipped, not needing that effort
to pull out what's inside them.

Perhaps I'm a relic and everyone else is with you: I can only offer
my own preference, which until now was supported by kernel practice.

> 
> As are things like the existing
> 
> BUG_ON(mmap_read_trylock(mm));
> BUG_ON(wb_domain_init(_wb_domain, GFP_KERNEL));
> 
> etc.

People say "the exception proves the rule".  Perhaps we should invite a
shower of patches to change those?  (I'd prefer not, I'm no fan of churn.)

> 
> No strong opinion here, but is current mostly-practice really
> useful?

You've seen my vote.  Now let the games begin!

Hugh

Re: [PATCH] mm/mmap: replace if (cond) BUG() with BUG_ON()

2021-01-05 Thread Hugh Dickins

On Sat, 12 Dec 2020, Alex Shi wrote:
> 
> I'm very sorry, a typo here. the patch should be updated:
> 
> From ed4fa1c6d5bed5766c5f0c35af0c597855d7be06 Mon Sep 17 00:00:00 2001
> From: Alex Shi 
> Date: Fri, 11 Dec 2020 21:26:46 +0800
> Subject: [PATCH] mm/mmap: replace if (cond) BUG() with BUG_ON()
> 
> coccinelle reports some warnings:
> WARNING: Use BUG_ON instead of if condition followed by BUG.
> 
> It could be fixed by BUG_ON().
> 
> Reported-by: ab...@linux.alibaba.com
> Signed-off-by: Alex Shi 

When diffing mmotm just now, I was sorry to find this: NAK.

Alex, please consider why the authors of these lines (whom you
did not Cc) chose to write them without BUG_ON(): it has always
been preferred practice to use BUG_ON() on predicates, but not on
functionally effective statements (sorry, I've forgotten the proper
term: I'd say statements with side-effects, but here they are not
just side-effects: they are their main purpose).

We prefer not to hide those away inside BUG macros: please fix your
"abaci" to respect kernel style here - unless it turns out that the
kernel has moved away from that, and it's me who's behind the times.

Andrew, if you agree, please drop
mm-mmap-replace-if-cond-bug-with-bug_on.patch
from your stack.

(And did Minchan really Ack it? I see an Ack from Minchan to a
similar mm/zsmalloc patch: which surprises me, but is Minchan's
business not mine; but that patch is not in mmotm.)

On the whole, I think there are far too many patches submitted,
where Developer B chooses to rewrite a line to their own preference,
without respecting that Author A chose to write it in another way.
That's great when it really does improve readability, but often not.

Thanks,
Hugh

> Cc: Andrew Morton 
> Cc: linux...@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  mm/mmap.c | 22 --
>  1 file changed, 8 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 8144fc3c5a78..107fa91bb59f 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -712,9 +712,8 @@ static void __insert_vm_struct(struct mm_struct *mm, 
> struct vm_area_struct *vma)
>   struct vm_area_struct *prev;
>   struct rb_node **rb_link, *rb_parent;
>  
> - if (find_vma_links(mm, vma->vm_start, vma->vm_end,
> -, _link, _parent))
> - BUG();
> + BUG_ON(find_vma_links(mm, vma->vm_start, vma->vm_end,
> +, _link, _parent));
>   __vma_link(mm, vma, prev, rb_link, rb_parent);
>   mm->map_count++;
>  }
> @@ -3585,9 +3584,8 @@ static void vm_lock_anon_vma(struct mm_struct *mm, 
> struct anon_vma *anon_vma)
>* can't change from under us thanks to the
>* anon_vma->root->rwsem.
>*/
> - if (__test_and_set_bit(0, (unsigned long *)
> -
> _vma->root->rb_root.rb_root.rb_node))
> - BUG();
> + BUG_ON(__test_and_set_bit(0, (unsigned long *)
> + _vma->root->rb_root.rb_root.rb_node));
>   }
>  }
>  
> @@ -3603,8 +3601,7 @@ static void vm_lock_mapping(struct mm_struct *mm, 
> struct address_space *mapping)
>* mm_all_locks_mutex, there may be other cpus
>* changing other bitflags in parallel to us.
>*/
> - if (test_and_set_bit(AS_MM_ALL_LOCKS, >flags))
> - BUG();
> + BUG_ON(test_and_set_bit(AS_MM_ALL_LOCKS, >flags));
>   down_write_nest_lock(>i_mmap_rwsem, >mmap_lock);
>   }
>  }
> @@ -3701,9 +3698,8 @@ static void vm_unlock_anon_vma(struct anon_vma 
> *anon_vma)
>* can't change from under us until we release the
>* anon_vma->root->rwsem.
>*/
> - if (!__test_and_clear_bit(0, (unsigned long *)
> -   
> _vma->root->rb_root.rb_root.rb_node))
> - BUG();
> + BUG_ON(!__test_and_clear_bit(0, (unsigned long *)
> + _vma->root->rb_root.rb_root.rb_node));
>   anon_vma_unlock_write(anon_vma);
>   }
>  }
> @@ -3716,9 +3712,7 @@ static void vm_unlock_mapping(struct address_space 
> *mapping)
>* because we hold the mm_all_locks_mutex.
>*/
>   i_mmap_unlock_write(mapping);
> - if (!test_and_clear_bit(AS_MM_ALL_LOCKS,
> - >flags))
> - BUG();
> + BUG_ON(!test_and_clear_bit(AS_MM_ALL_LOCKS, >flags));
>   }
>  }
>  
> -- 
> 2.29.GIT

Re: [PATCH v21 00/19] per memcg lru lock

2021-01-05 Thread Hugh Dickins

On Tue, 5 Jan 2021, Qian Cai wrote:
> On Tue, 2021-01-05 at 13:35 -0800, Hugh Dickins wrote:
> > This patchset went into mmotm 2020-11-16-16-23, so probably linux-next
> > on 2020-11-17: you'll have had three trouble-free weeks testing with it
> > in, so it's not a likely suspect.  I haven't looked yet at your report,
> > to think of a more likely suspect: will do.
> 
> Probably my memory was bad then. Unfortunately, I had 2 weeks holidays before
> the Thanksgiving as well. I have tried a few times so far and only been able 
> to
> reproduce once. Looks nasty...

I have not found a likely suspect.

What it smells like is a defect in cloning anon_vma during fork,
such that mappings of the THP can get added even after all that
could be found were unmapped (tree lookup ordering should prevent
that).  But I've not seen any recent change there.

It would be very easily fixed by deleting the whole BUG() block,
which is only there as a sanity check for developers: but we would
not want to delete it without understanding why it has gone wrong
(and would also have to reconsider two related VM_BUG_ON_PAGEs).

It is possible that b6769834aac1 ("mm/thp: narrow lru locking") of this
patchset has changed the timing and made a pre-existing bug more likely
in some situations: it used to hold an lru_lock before that BUG() on
total_mapcount(), and now does not; but that's not a lock which should
be relevant to the check.

When you get more info (or not), please repost the bugstack in a
new email thread: this thread is not really useful for pursuing it.

Hugh

Re: [PATCH v21 00/19] per memcg lru lock

2021-01-05 Thread Hugh Dickins

On Tue, 5 Jan 2021, Qian Cai wrote:
> On Tue, 2021-01-05 at 11:42 -0800, Shakeel Butt wrote:
> > On Tue, Jan 5, 2021 at 11:30 AM Qian Cai  wrote:
> > > On Thu, 2020-11-05 at 16:55 +0800, Alex Shi wrote:
> > > > This version rebase on next/master 20201104, with much of Johannes's
> > > > Acks and some changes according to Johannes comments. And add a new 
> > > > patch
> > > > v21-0006-mm-rmap-stop-store-reordering-issue-on-page-mapp.patch to 
> > > > support
> > > > v21-0007.
> > > > 
> > > > This patchset followed 2 memcg VM_WARN_ON_ONCE_PAGE patches which were
> > > > added to -mm tree yesterday.
> > > > 
> > > > Many thanks for line by line review by Hugh Dickins, Alexander Duyck and
> > > > Johannes Weiner.
> > > 
> > > Given the troublesome history of this patchset, and had been put into 
> > > linux-
> > > next
> > > recently, as well as it touched both THP and mlock. Is it a good idea to
> > > suspect
> > > this patchset introducing some races and a spontaneous crash with some 
> > > mlock
> > > memory presume?
> > 
> > This has already been merged into the linus tree. Were you able to get
> > a similar crash on the latest upstream kernel as well?
> 
> No, I seldom test the mainline those days. Before the vacations, I have tested
> linux-next up to something like 12/10 which did not include this patchset IIRC
> and never saw any crash like this. I am still trying to figure out how to
> reproduce it fast, so I can try a revert to confirm.

This patchset went into mmotm 2020-11-16-16-23, so probably linux-next
on 2020-11-17: you'll have had three trouble-free weeks testing with it
in, so it's not a likely suspect.  I haven't looked yet at your report,
to think of a more likely suspect: will do.

Hugh

Re: kernel BUG at mm/page-writeback.c:LINE!

2021-01-05 Thread Hugh Dickins

On Tue, 5 Jan 2021, Linus Torvalds wrote:
> On Tue, Jan 5, 2021 at 11:31 AM Linus Torvalds
>  wrote:
> > On Mon, Jan 4, 2021 at 7:29 PM Hugh Dickins  wrote:
> > >
> > > > So the one-liner of changing the "if" to "while" in
> > > > wait_on_page_writeback() should get us back to what we used to do.
> > >
> > > I think that is the realistic way to go.
> >
> > Yeah, that's what I'll do.
> 
> I took your "way to go" statement as an ack, and made it all be commit
> c2407cf7d22d ("mm: make wait_on_page_writeback() wait for multiple
> pending writebacks").

Great, thanks, I see it now.

I was going to raise a question, whether you should now revert
073861ed77b6 ("mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)"):
which would not have gone in like that if c2407cf7d22d were already in.

But if it were reverted, we'd need some other fix for the PageTail part
of it; and I still cannot think of anywhere else where we knowingly
operated on a struct page without holding a reference; and there have
been no adverse reports on its extra get_page+put_page.

So I think it's safest to leave it in.

Hugh

Re: kernel BUG at mm/page-writeback.c:LINE!

2021-01-04 Thread Hugh Dickins

On Mon, 4 Jan 2021, Linus Torvalds wrote:
> On Mon, Jan 4, 2021 at 12:41 PM Andrew Morton  
> wrote:
> 
> > Linus, how confident are you in those wait_on_page_bit_common()
> > changes?
> 
> Pretty confident. The atomicity of the bitops themselves is fairly simple.
> 
> But in the writeback bit? No. The old code would basically _loop_ if
> it was woken up and the writeback bit was set again, and would hide
> any problems with it.
> 
> The new code basically goes "ok, the writeback bit was clear at one
> point, so I've waited enough".
> 
> We could easily turn the "if ()" in wait_on_page_writeback() into a "while()".
> 
> But honestly, it does smell to me like the bug is always in the caller
> not having serialized with whatever actually starts writeback. High
> figured out one such case.
> 
> This code holds the page lock, but I don't see where
> set_page_writeback() would always be done with the page lock held. So
> what really protects against PG_writeback simply being set again?
> 
> The whole BUG_ON() seems entirely buggy to me.
> 
> In fact, even if you hold the page lock while doing
> set_page_writeback(), since the actual IO does *NOT* hold the page
> lock, the unlock happens without it. So even if every single case of
> setting the page writeback were to hold the page lock,

I did an audit when this came up before, and though not 100% confident
in my diligence, it certainly looked that way; and others looked too
(IIRC Matthew had a patch to add a WARN_ON_ONCE or whatever, but that
didn't go upstream).

> what keeps this from happening:
> 
> CPU1 = end previous writeback
> CPU2 = start new writeback under page lock
> CPU3 = write_cache_pages()
> 
>   CPU1  CPU2CPU3
>     
> 
>   end_page_writeback()
> test_clear_page_writeback(page)
> ... delayed...
> 
> 
> lock_page();
> set_page_writeback()
> unlock_page()
> 
> 
> lock_page()
> wait_on_page_writeback();
> 
> wake_up_page(page, PG_writeback);
> .. wakes up CPU3 ..
> 
> BUG_ON(PageWriteback(page));
> 
> IOW, that BUG_ON() really feels entirely bogus to me. Notice how it
> wasn't actually serialized with the waking up of the _previous_ bit.

Well.  That looks so obvious now you suggest it, that I feel very
stupid for not seeing it before, so have tried hard to disprove you.
But I think you're right.

> 
> Could we make the wait_on_page_writeback() just loop if it sees the
> page under writeback again? Sure.
> 
> Could we make the wait_on_page_bit_common() code say "if this is
> PG_writeback, I won't wake it up after all, because the bit is set
> again?" Sure.
> 
> But I feel it's really that end_page_writeback() itself is
> fundamentally buggy, because the "wakeup" is not atomic with the bit
> clearing _and_ it doesn't actually hold the page lock that is
> allegedly serializing this all.

And we won't be adding a lock_page() into end_page_writeback()!

> 
> That raciness was what caused the "stale wakeup from previous owner"
> thing too. And I think that Hugh fixed the page re-use case, but the
> fundamental problem of end_page_writeback() kind of remained.
> 
> And yes, I think this was all hidden by wait_on_page_writeback()
> effectively looping over the "PageWriteback(page)" test because of how
> wait_on_page_bit() worked.
> 
> So the one-liner of changing the "if" to "while" in
> wait_on_page_writeback() should get us back to what we used to do.

I think that is the realistic way to go.

> 
> Except I still get the feeling that the bug really is not in
> wait_on_page_writeback(), but in the end_page_writeback() side.
> 
> Comments? I'm perfectly happy doing the one-liner. I would just be
> _happier_ with end_page_writeback() having the serialization..
> 
> The real problem is that "wake_up_page(page, bit)" is not the thing
> that actually clears the bit. So there's a fundamental race between
> clearing the bit and waking something up.
> 
> Which makes me think that the best option would actually be to move
> the bit clearing _into_ wake_up_page(). But that looks like a very big
> change.

I'll be surprised if that direction is even possible, without unpleasant
extra locking.  If there were only one wakeup to be done, perhaps, but
potentially there are many.  When I looked before, it seemed that the
clear bit needs to come before the wakeup, and the wakeup needs to come
before the clear bit.  And the BOOKMARK case drops q->lock.

It should be possible to rely on the XArray's i_pages lock rather than
the page lock for serialization, much as I did in one variant of the
patch I sent originally.  Updated version appended below for show
(most of it rearrangement+cleanup rather than the functional change);
but I think it's slightly incomplete (__test_set_page_writeback()
should take i_pages lock even in the !mapping_use_writeback_tags case);
and

[PATCH] mm/memcontrol: fix warning in mem_cgroup_page_lruvec()

2021-01-03 Thread Hugh Dickins

Boot a CONFIG_MEMCG=y kernel with "cgroup_disabled=memory" and you are
met by a series of warnings from the VM_WARN_ON_ONCE_PAGE(!memcg, page)
recently added to the inline mem_cgroup_page_lruvec().

An earlier attempt to place that warning, in mem_cgroup_lruvec(), had
been careful to do so after weeding out the mem_cgroup_disabled() case;
but was itself invalid because of the mem_cgroup_lruvec(NULL, pgdat) in
clear_pgdat_congested() and age_active_anon().

Warning in mem_cgroup_page_lruvec() was once useful in detecting a KSM
charge bug, so may be worth keeping: but skip if mem_cgroup_disabled().

Fixes: 9a1ac2288cf1 ("mm/memcontrol:rewrite mem_cgroup_page_lruvec()")
Signed-off-by: Hugh Dickins 
---

 include/linux/memcontrol.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- 5.11-rc2/include/linux/memcontrol.h 2020-12-27 20:39:36.751923135 -0800
+++ linux/include/linux/memcontrol.h2021-01-03 19:38:24.822978559 -0800
@@ -665,7 +665,7 @@ static inline struct lruvec *mem_cgroup_
 {
struct mem_cgroup *memcg = page_memcg(page);
 
-   VM_WARN_ON_ONCE_PAGE(!memcg, page);
+   VM_WARN_ON_ONCE_PAGE(!memcg && !mem_cgroup_disabled(), page);
return mem_cgroup_lruvec(memcg, pgdat);
 }

Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

2020-12-28 Thread Hugh Dickins

Got it at last, sorry it's taken so long.

On Tue, 29 Dec 2020, Kirill A. Shutemov wrote:
> On Tue, Dec 29, 2020 at 01:05:48AM +0300, Kirill A. Shutemov wrote:
> > On Mon, Dec 28, 2020 at 10:47:36AM -0800, Linus Torvalds wrote:
> > > On Mon, Dec 28, 2020 at 4:53 AM Kirill A. Shutemov  
> > > wrote:
> > > >
> > > > So far I only found one more pin leak and always-true check. I don't see
> > > > how can it lead to crash or corruption. Keep looking.

Those mods look good in themselves, but, as you expected,
made no difference to the corruption I was seeing.

> > > 
> > > Well, I noticed that the nommu.c version of filemap_map_pages() needs
> > > fixing, but that's obviously not the case Hugh sees.
> > > 
> > > No,m I think the problem is the
> > > 
> > > pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > 
> > > at the end of filemap_map_pages().
> > > 
> > > Why?
> > > 
> > > Because we've been updating vmf->pte as we go along:
> > > 
> > > vmf->pte += xas.xa_index - last_pgoff;
> > > 
> > > and I think that by the time we get to that "pte_unmap_unlock()",
> > > vmf->pte potentially points to past the edge of the page directory.
> > 
> > Well, if it's true we have bigger problem: we set up an pte entry without
> > relevant PTL.
> > 
> > But I *think* we should be fine here: do_fault_around() limits start_pgoff
> > and end_pgoff to stay within the page table.

Yes, Linus's patch had made no difference,
the map_pages loop is safe in that respect.

> > 
> > It made mw looking at the code around pte_unmap_unlock() and I think that
> > the bug is that we have to reset vmf->address and NULLify vmf->pte once we
> > are done with faultaround:
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> 
> Ugh.. Wrong place. Need to sleep.
> 
> I'll look into your idea tomorrow.
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 87671284de62..e4daab80ed81 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2987,6 +2987,8 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, 
> unsigned long address,
>   } while ((head = next_map_page(vmf, , end_pgoff)) != NULL);
>   pte_unmap_unlock(vmf->pte, vmf->ptl);
>   rcu_read_unlock();
> + vmf->address = address;
> + vmf->pte = NULL;
>   WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
>  
>   return ret;
> -- 

And that made no (noticeable) difference either.  But at last
I realized, it's absolutely on the right track, but missing the
couple of early returns at the head of filemap_map_pages(): add

--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3025,14 +3025,12 @@ vm_fault_t filemap_map_pages(struct vm_f

rcu_read_lock();
head = first_map_page(vmf, , end_pgoff);
-   if (!head) {
-   rcu_read_unlock();
-   return 0;
-   }
+   if (!head)
+   goto out;

if (filemap_map_pmd(vmf, head)) {
-   rcu_read_unlock();
-   return VM_FAULT_NOPAGE;
+   ret = VM_FAULT_NOPAGE;
+   goto out;
}

vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
@@ -3066,9 +3064,9 @@ unlock:
put_page(head);
} while ((head = next_map_page(vmf, , end_pgoff)) != NULL);
pte_unmap_unlock(vmf->pte, vmf->ptl);
+out:
rcu_read_unlock();
vmf->address = address;
-   vmf->pte = NULL;
WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);

return ret;
--

and then the corruption is fixed.  It seems miraculous that the
machines even booted with that bad vmf->address going to __do_fault():
maybe that tells us what a good job map_pages does most of the time.

You'll see I've tried removing the "vmf->pte = NULL;" there. I did
criticize earlier that vmf->pte was being left set, but was either
thinking back to some earlier era of mm/memory.c, or else confusing
with vmf->prealloc_pte, which is NULLed when consumed: I could not
find anywhere in mm/memory.c which now needs vmf->pte to be cleared,
and I seem to run fine without it (even on i386 HIGHPTE).

So, the mystery is solved; but I don't think any of these patches
should be applied.  Without thinking through Linus's suggestions
re do_set_pte() in particular, I do think this map_pages interface
is too ugly, and given us lots of trouble: please take your time
to go over it all again, and come up with a cleaner patch.

I've grown rather jaded, and questioning the value of the rework:
I don't think I want to look at or test another for a week or so.

Hugh

Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

2020-12-27 Thread Hugh Dickins

On Sun, 27 Dec 2020, Linus Torvalds wrote:
> On Sun, Dec 27, 2020 at 3:48 PM Kirill A. Shutemov  
> wrote:
> >
> > I did what Hugh proposed and it got clear to my eyes. It gets somewhat
> > large, but take a look.
> 
> Ok, it's not that much bigger, and the end result is certainly much
> clearer wrt locking.
> 
> So that last version of yours with the fix for the uninitialized 'ret'
> variable looks good to me.
> 
> Of course, I've said that before, and have been wrong. So ...

And guess what... it's broken.

I folded it into testing rc1: segfault on cc1, systemd
"Journal file corrupted, rotating", seen on more than one machine.

I've backed it out, rc1 itself seems fine, I'll leave rc1 under
load overnight, then come back to the faultaround patch tomorrow;
won't glance at it tonight, but maybe Kirill will guess what's wrong.

Hugh

Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

2020-12-27 Thread Hugh Dickins

On Sun, 27 Dec 2020, Damian Tometzki wrote:
> On Sun, 27. Dec 11:38, Linus Torvalds wrote:
> > On Sat, Dec 26, 2020 at 6:38 PM Hugh Dickins  wrote:
> > >
> > > This patch (like its antecedents) moves the pte_unmap_unlock() from
> > > after do_fault_around()'s "check if the page fault is solved" into
> > > filemap_map_pages() itself (which apparently does not NULLify vmf->pte
> > > after unmapping it, which is poor, but good for revealing this issue).
> > > That looks cleaner, but of course there was a very good reason for its
> > > original positioning.
> > 
> > Good catch.
> > 
> > > Maybe you want to change the ->map_pages prototype, to pass down the
> > > requested address too, so that it can report whether the requested
> > > address was resolved or not.  Or it could be left to __do_fault(),
> > > or even to a repeated fault; but those would be less efficient.
> > 
> > Let's keep the old really odd "let's unlock in the caller" for now,
> > and minimize the changes.
> > 
> > Adding a big big comment at the end of filemap_map_pages() to note the
> > odd delayed page table unlocking.
> > 
> > Here's an updated patch that combines Kirill's original patch, his
> > additional incremental patch, and the fix for the pte lock oddity into
> > one thing.
> > 
> > Does this finally pass your testing?

Yes, this one passes my testing on x86_64 and on i386.  But...

> > 
> >Linus
> Hello together,
> 
> when i try to build this patch, i got the following error:
> 
>  CC  arch/x86/kernel/cpu/mce/threshold.o
> mm/memory.c:3716:19: error: static declaration of ‘do_set_pmd’ follows 
> non-static declaration
>  static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
>^~
> In file included from mm/memory.c:43:
> ./include/linux/mm.h:984:12: note: previous declaration of ‘do_set_pmd’ was 
> here
>  vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
> ^~
> make[3]: *** [scripts/Makefile.build:279: mm/memory.o] Error 1
> make[2]: *** [Makefile:1805: mm] Error 2
> make[2]: *** Waiting for unfinished jobs
>   CC  arch/x86/kernel/cpu/mce/therm_throt.o

... Damian very helpfully reports that it does not build when
CONFIG_TRANSPARENT_HUGEPAGE is not set, since the "static " has
not been removed from the alternative definition of do_set_pmd().

And its BUILD_BUG() becomes invalid once it's globally available.
You don't like unnecessary BUG()s, and I don't like returning
success there: VM_FAULT_FALLBACK seems best.

--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3713,10 +3713,9 @@ out:
return ret;
 }
 #else
-static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
+vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
 {
-   BUILD_BUG();
-   return 0;
+   return VM_FAULT_FALLBACK;
 }
 #endif
 

(I'm also a wee bit worried by filemap.c's +#include :
that's the kind of thing that might turn out not to work on some arch.)

Hugh

Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

2020-12-26 Thread Hugh Dickins

On Sat, 26 Dec 2020, Hugh Dickins wrote:
> On Sun, 27 Dec 2020, Kirill A. Shutemov wrote:
> > 
> > Here's the fixup I have so far. It doesn't blow up immediately, but please
> > take a closer look. Who knows what stupid mistake I did this time. :/
> 
> It's been running fine on x86_64 for a couple of hours (but of course
> my testing is deficient, in not detecting the case Linus spotted).
> 
> But I just thought I'd try it on i386 (hadn't tried previous versions)
> and this has a new disappointment: crashes when booting, in the "check
> if the page fault is solved" in do_fault_around().  I imagine a highmem
> issue with kmap of the pte address, but I'm reporting now before looking
> into it further (but verified that current linux.git i386 boots up fine).

This patch (like its antecedents) moves the pte_unmap_unlock() from
after do_fault_around()'s "check if the page fault is solved" into
filemap_map_pages() itself (which apparently does not NULLify vmf->pte
after unmapping it, which is poor, but good for revealing this issue).
That looks cleaner, but of course there was a very good reason for its
original positioning.

Maybe you want to change the ->map_pages prototype, to pass down the
requested address too, so that it can report whether the requested
address was resolved or not.  Or it could be left to __do_fault(),
or even to a repeated fault; but those would be less efficient.

> 
> Maybe easily fixed: but does suggest this needs exposure in linux-next.
> 
> Hugh

Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

2020-12-26 Thread Hugh Dickins

On Sun, 27 Dec 2020, Kirill A. Shutemov wrote:
> 
> Here's the fixup I have so far. It doesn't blow up immediately, but please
> take a closer look. Who knows what stupid mistake I did this time. :/

It's been running fine on x86_64 for a couple of hours (but of course
my testing is deficient, in not detecting the case Linus spotted).

But I just thought I'd try it on i386 (hadn't tried previous versions)
and this has a new disappointment: crashes when booting, in the "check
if the page fault is solved" in do_fault_around().  I imagine a highmem
issue with kmap of the pte address, but I'm reporting now before looking
into it further (but verified that current linux.git i386 boots up fine).

Maybe easily fixed: but does suggest this needs exposure in linux-next.

Hugh

Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

2020-12-26 Thread Hugh Dickins

On Sat, 26 Dec 2020, Kirill A. Shutemov wrote:
> On Sat, Dec 26, 2020 at 09:57:13AM -0800, Linus Torvalds wrote:
> > Because not only does that get rid of the "if (page)" test, I think it
> > would make things a bit clearer. When I read the patch first, the
> > initial "next_page()" call confused me.
> 
> Agreed. Here we go:
> 
> From d12dea4abe94dbc24b7945329b191ad7d29e213a Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" 
> Date: Sat, 19 Dec 2020 15:19:23 +0300
> Subject: [PATCH] mm: Cleanup faultaround and finish_fault() codepaths
> 
> alloc_set_pte() has two users with different requirements: in the
> faultaround code, it called from an atomic context and PTE page table
> has to be preallocated. finish_fault() can sleep and allocate page table
> as needed.
> 
> PTL locking rules are also strange, hard to follow and overkill for
> finish_fault().
> 
> Let's untangle the mess. alloc_set_pte() has gone now. All locking is
> explicit.
> 
> The price is some code duplication to handle huge pages in faultaround
> path, but it should be fine, having overall improvement in readability.
> 
> Signed-off-by: Kirill A. Shutemov 

Hold on. I guess this one will suffer from the same bug as the previous.
I was about to report back, after satisfactory overnight testing of that
version - provided that one big little bug is fixed:

--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2919,7 +2919,7 @@ static bool filemap_map_pmd(struct vm_fa
 
if (pmd_none(*vmf->pmd) &&
PageTransHuge(page) &&
-   do_set_pmd(vmf, page)) {
+   do_set_pmd(vmf, page) == 0) {
unlock_page(page);
return true;
}

(Yes, you can write that as !do_set_pmd(vmf, page), and maybe I'm odd,
but even though it's very common, I have a personal aversion to using
"!' on a positive-sounding function that returns 0 for success.)

I'll give the new patch a try now, but with that fix added in. Without it,
I got "Bad page" on compound_mapcount on file THP pages - but I run with
a BUG() inside of bad_page() so I cannot miss them: I did not look to see
what the eventual crash or page leak would look like without that.

Hugh

Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

2020-12-23 Thread Hugh Dickins

On Tue, 22 Dec 2020, Kirill A. Shutemov wrote:
> 
> Updated patch is below.
> 
> From 0ec1bc1fe95587350ac4f4c866d6482383740b36 Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" 
> Date: Sat, 19 Dec 2020 15:19:23 +0300
> Subject: [PATCH] mm: Cleanup faultaround and finish_fault() codepaths
> 
> alloc_set_pte() has two users with different requirements: in the
> faultaround code, it called from an atomic context and PTE page table
> has to be preallocated. finish_fault() can sleep and allocate page table
> as needed.
> 
> PTL locking rules are also strange, hard to follow and overkill for
> finish_fault().
> 
> Let's untangle the mess. alloc_set_pte() has gone now. All locking is
> explicit.
> 
> The price is some code duplication to handle huge pages in faultaround
> path, but it should be fine, having overall improvement in readability.
> 
> Signed-off-by: Kirill A. Shutemov 

It's not ready yet.

I won't pretend to have reviewed, but I did try applying and running
with it: mostly it seems to work fine, but turned out to be leaking
huge pages (with vmstat's thp_split_page_failed growing bigger and
bigger as page reclaim cannot get rid of them).

Aside from the actual bug, filemap_map_pmd() seems suboptimal at
present: comments below (plus one comment in do_anonymous_page()).

> diff --git a/mm/filemap.c b/mm/filemap.c
> index 0b2067b3c328..f8fdbe079375 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2831,10 +2832,74 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  }
>  EXPORT_SYMBOL(filemap_fault);
>  
> +static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page,
> +   struct xa_state *xas)
> +{
> + struct vm_area_struct *vma = vmf->vma;
> + struct address_space *mapping = vma->vm_file->f_mapping;
> +
> + /* Huge page is mapped? No need to proceed. */
> + if (pmd_trans_huge(*vmf->pmd))
> + return true;
> +
> + if (xa_is_value(page))
> + goto nohuge;

I think it would be easier to follow if filemap_map_pages() never
passed this an xa_is_value(page): probably just skip them in its
initial xas_next_entry() loop.

> +
> + if (!pmd_none(*vmf->pmd))
> + goto nohuge;

Then at nohuge it unconditionally takes pmd_lock(), finds !pmd_none,
and unlocks again: unnecessary overhead I believe we did not have before.

> +
> + if (!PageTransHuge(page) || PageLocked(page))
> + goto nohuge;

So if PageTransHuge, but someone else temporarily holds PageLocked,
we insert a page table at nohuge, sadly preventing it from being
mapped here later by huge pmd.

> +
> + if (!page_cache_get_speculative(page))
> + goto nohuge;
> +
> + if (page != xas_reload(xas))
> + goto unref;
> +
> + if (!PageTransHuge(page))
> + goto unref;
> +
> + if (!PageUptodate(page) || PageReadahead(page) || PageHWPoison(page))
> + goto unref;
> +
> + if (!trylock_page(page))
> + goto unref;
> +
> + if (page->mapping != mapping || !PageUptodate(page))
> + goto unlock;
> +
> + if (xas->xa_index >= DIV_ROUND_UP(i_size_read(mapping->host), 
> PAGE_SIZE))
> + goto unlock;
> +
> + do_set_pmd(vmf, page);

Here is the source of the huge page leak: do_set_pmd() can fail
(and we would do better to have skipped most of its failure cases long
before getting this far).  It worked without leaking once I patched it:

-   do_set_pmd(vmf, page);
-   unlock_page(page);
-   return true;
+   if (do_set_pmd(vmf, page) == 0) {
+   unlock_page(page);
+   return true;
+   }

> + unlock_page(page);
> + return true;
> +unlock:
> + unlock_page(page);
> +unref:
> + put_page(page);
> +nohuge:
> + vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> + if (likely(pmd_none(*vmf->pmd))) {
> + mm_inc_nr_ptes(vma->vm_mm);
> + pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
> + vmf->prealloc_pte = NULL;
> + }
> + spin_unlock(vmf->ptl);

I think it's a bit weird to hide this page table insertion inside
filemap_map_pmd() (I guess you're thinking that this function deals
with pmd level, but I'd find it easier to have a filemap_map_huge()
dealing with the huge mapping).  Better to do it on return into
filemap_map_pages(); maybe filemap_map_pmd() or filemap_map_huge()
would then need to return vm_fault_t rather than bool, I didn't try.

> +
> + /* See comment in handle_pte_fault() */
> + if (pmd_devmap_trans_unstable(vmf->pmd))
> + return true;
> +
> + return false;
> +}
...
> diff --git a/mm/memory.c b/mm/memory.c
> index c48f8df6e502..96d62774096a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3490,7 +3490,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault 
> *vmf)
>   if (pte_alloc(vma->vm_mm, vmf->pmd))
>   return VM_FAULT_OOM;
>  
> - /* See the comment in pte_alloc_one_map() */
> +

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 4042 matches

Mail list logo