Re: WARNING at fs/btrfs/disk-io.c:275 during xfstests run(4k block)

2023-05-29 Thread Christoph Hellwig
This should be fixed by

"btrfs: fix the uptodate assert in btree_csum_one_bio"

on the list.


Re: [PATCH] powerpc/crypto: Fix aes-gcm-p10 link errors

2023-05-29 Thread Vishal Chourasia

On 5/25/23 20:35, Michael Ellerman wrote:

The recently added P10 AES/GCM code added some files containing
CRYPTOGAMS perl-asm code which are near duplicates of the p8 files
found in drivers/crypto/vmx.

In particular the newly added files produce functions with identical
names to the existing code.

When the kernel is built with CONFIG_CRYPTO_AES_GCM_P10=y and
CONFIG_CRYPTO_DEV_VMX_ENCRYPT=y that leads to link errors, eg:

   ld: drivers/crypto/vmx/aesp8-ppc.o: in function `aes_p8_set_encrypt_key':
   (.text+0xa0): multiple definition of `aes_p8_set_encrypt_key'; 
arch/powerpc/crypto/aesp8-ppc.o:(.text+0xa0): first defined here
   ...
   ld: drivers/crypto/vmx/ghashp8-ppc.o: in function `gcm_ghash_p8':
   (.text+0x140): multiple definition of `gcm_ghash_p8'; 
arch/powerpc/crypto/ghashp8-ppc.o:(.text+0x2e4): first defined here

Fix it for now by renaming the newly added files and functions to use
"p10" instead of "p8" in the names.

Fixes: 45a4672b9a6e ("crypto: p10-aes-gcm - Update Kconfig and Makefile")
Signed-off-by: Michael Ellerman 
---
  arch/powerpc/crypto/Makefile   | 10 +-
  arch/powerpc/crypto/aes-gcm-p10-glue.c | 18 +-
  .../crypto/{aesp8-ppc.pl => aesp10-ppc.pl} |  2 +-
  .../crypto/{ghashp8-ppc.pl => ghashp10-ppc.pl} | 12 ++--
  4 files changed, 21 insertions(+), 21 deletions(-)
  rename arch/powerpc/crypto/{aesp8-ppc.pl => aesp10-ppc.pl} (99%)
  rename arch/powerpc/crypto/{ghashp8-ppc.pl => ghashp10-ppc.pl} (97%)


I missed adding tested-by in previous reply.

After applying the patch, I was able to successfully build the Linux 
kernel v6.4-rc4.


I encountered no errors during the build process. The issue pertaining 
to multiple

definitions of certain functions appears to be resolved.

λ grep -i CRYPTO_AES_GCM_P10 .config
CONFIG_CRYPTO_AES_GCM_P10=y

λ grep -i CRYPTO_DEV_VMX_ENCRYPT .config
CONFIG_CRYPTO_DEV_VMX_ENCRYPT=y

 + make O=${BUILD_DIR} CC=clang ARCH=powerpc
CROSS_COMPILE=powerpc64le-linux-gnu- -j16 -s vmlinux modules

Thank you for the well-detailed fix.

Tested-by: Vishal Chourasia 



Kernel crash with ftrace tests

2023-05-29 Thread Sachin Sant
While running ftrace specific kernel selftests on IBM Power9 server,
following crash is observed. I have observed this crash only on Power9.
Similar test run on a Power10 server works.

[14350.791484] Kernel attempted to read user page (8) - exploit attempt? (uid: 
0)
[14350.791507] BUG: Kernel NULL pointer dereference on read at 0x0008
[14350.791514] Faulting instruction address: 0xc04bf0e0
[14350.791521] Oops: Kernel access of bad area, sig: 11 [#1]
[14350.791526] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=8192 NUMA pSeries
[14350.791532] Modules linked in: nvram rpadlpar_io rpaphp uinput torture dummy 
veth tun nfsv3 nfs_acl nfs lockd grace fscache netfs brd overlay exfat vfat fat 
xfs loop sctp ip6_udp_tunnel udp_tunnel dm_mod nft_fib_inet nft_fib_ipv4 
nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject 
nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set 
bonding tls rfkill nf_tables libcrc32c nfnetlink sunrpc pseries_rng vmx_crypto 
ext4 mbcache jbd2 sd_mod t10_pi crc64_rocksoft crc64 sg ibmvscsi 
scsi_transport_srp ibmveth fuse [last unloaded: test_cpuidle_latency(O)]
[14350.791616] CPU: 8 PID: 1169868 Comm: perl Tainted: G O 
6.4.0-rc4-g8b817fded42d #1
[14350.791623] Hardware name: IBM,8375-42A POWER9 (raw) 0x4e0202 0xf05 
of:IBM,FW950.50 (VL950_105) hv:phyp pSeries
[14350.791629] NIP: c04bf0e0 LR: c0498924 CTR: c02e8f60
[14350.791635] REGS: c002c8313830 TRAP: 0300 Tainted: G O 
(6.4.0-rc4-g8b817fded42d)
[14350.791641] MSR: 80009033  CR: 483139c4 XER: 
2004
[14350.791655] CFAR: c000dbbc DAR: 0008 DSISR: 4000 
IRQMASK: 0 
[14350.791655] GPR00: c0498924 c002c8313ad0 c1411300 
 
[14350.791655] GPR04: c0001d939d68  0001 
0008 
[14350.791655] GPR08:  8000 0800 
0001 
[14350.791655] GPR12: 3000 c0001ec56700  
 
[14350.791655] GPR16:    
 
[14350.791655] GPR20:  c2c67400 c001be8b4000 
c2c67378 
[14350.791655] GPR24: c0001a41d200 c0001a41d200 7fffb8eb 
 
[14350.791655] GPR28: 8603721f02c0 c0001d939d68  
c002c8313c18 
[14350.791724] NIP [c04bf0e0] page_remove_rmap+0x40/0x320
[14350.791732] LR [c0498924] wp_page_copy+0x384/0xde0
[14350.791738] Call Trace:
[14350.791741] [c002c8313ad0] [c0001a41d200] 0xc0001a41d200 
(unreliable)
[14350.791749] [c002c8313b20] [c0498924] wp_page_copy+0x384/0xde0
[14350.791756] [c002c8313bf0] [c04a1a34] 
__handle_mm_fault+0x9a4/0xf90
[14350.791764] [c002c8313cf0] [c04a2110] handle_mm_fault+0xf0/0x350
[14350.791771] [c002c8313d40] [c0094b8c] 
___do_page_fault+0x47c/0xc20
[14350.791780] [c002c8313df0] [c0095540] 
hash__do_page_fault+0x30/0x70
[14350.791788] [c002c8313e20] [c009e378] do_hash_fault+0x278/0x470
[14350.791794] [c002c8313e50] [c0008be0] 
data_access_common_virt+0x210/0x220
[14350.791802] --- interrupt: 300 at 0x7fffb8e91968
[14350.791807] NIP: 7fffb8e91968 LR: 7fffb8e91958 CTR: 
[14350.791812] REGS: c002c8313e80 TRAP: 0300 Tainted: G O 
(6.4.0-rc4-g8b817fded42d)
[14350.791818] MSR: 8280f033  CR: 
24000422 XER: 
[14350.791835] CFAR: 7fffb8e9185c DAR: 7fffb8eb DSISR: 0a00 
IRQMASK: 0 
[14350.791835] GPR00: 7fffb9684d14 7fffc55f14f0 7fffb8eb7f00 
7fffb8eb 
[14350.791835] GPR04: 0001 7fffb9122840 0001 
 
[14350.791835] GPR08: 7fffb9122890 0001  
7fffc55f1440 
[14350.791835] GPR12:  7fffb96dce80  
 
[14350.791835] GPR16:    
 
[14350.791835] GPR20: 7fffb8eafb38 7fffc55f15a8 7fffc55f1560 
7fffc55f1550 
[14350.791835] GPR24: 010022406210  7fffb96d0988 
7fffb96d 
[14350.791835] GPR28:  7fffb96d 7fffb8eafb38 
7fffc55f1550 
[14350.791902] NIP [7fffb8e91968] 0x7fffb8e91968
[14350.791907] LR [7fffb8e91958] 0x7fffb8e91958
[14350.791911] --- interrupt: 300
[14350.791914] Code: 7c0802a6 7d908026 fb61ffd8 fba1ffe8 fbc1fff0 fbe1fff8 
7c7e1b78 7c9d2378 7cbb2b78 91810008 f8010010 f821ffb1  712a0001 
3929 7fe3489e 
[14350.791937] ---[ end trace  ]---
[14350.793185] pstore: backend (nvram) writing error (-1)
[14350.793190] 
[14351.793195] Kernel panic - not syncing: Fatal exception

- Sachin

Re: [PATCH] powerpc/crypto: Fix aes-gcm-p10 link errors

2023-05-29 Thread Vishal Chourasia
After applying the patch, I was able to successfully build the Linux 
kernel v6.4-rc4.
I encountered no errors during the build process. The issue pertaining 
to multiple

definitions of certain functions appears to be resolved.

λ grep -i CRYPTO_AES_GCM_P10 .config
CONFIG_CRYPTO_AES_GCM_P10=y

λ grep -i CRYPTO_DEV_VMX_ENCRYPT .config
CONFIG_CRYPTO_DEV_VMX_ENCRYPT=y

 + make O=${BUILD_DIR} CC=clang ARCH=powerpc
CROSS_COMPILE=powerpc64le-linux-gnu- -j16 -s vmlinux modules

Thank you for the well-detailed fix.

 -- vishal.c


On 5/25/23 20:35, Michael Ellerman wrote:

The recently added P10 AES/GCM code added some files containing
CRYPTOGAMS perl-asm code which are near duplicates of the p8 files
found in drivers/crypto/vmx.

In particular the newly added files produce functions with identical
names to the existing code.

When the kernel is built with CONFIG_CRYPTO_AES_GCM_P10=y and
CONFIG_CRYPTO_DEV_VMX_ENCRYPT=y that leads to link errors, eg:

   ld: drivers/crypto/vmx/aesp8-ppc.o: in function `aes_p8_set_encrypt_key':
   (.text+0xa0): multiple definition of `aes_p8_set_encrypt_key'; 
arch/powerpc/crypto/aesp8-ppc.o:(.text+0xa0): first defined here
   ...
   ld: drivers/crypto/vmx/ghashp8-ppc.o: in function `gcm_ghash_p8':
   (.text+0x140): multiple definition of `gcm_ghash_p8'; 
arch/powerpc/crypto/ghashp8-ppc.o:(.text+0x2e4): first defined here

Fix it for now by renaming the newly added files and functions to use
"p10" instead of "p8" in the names.

Fixes: 45a4672b9a6e ("crypto: p10-aes-gcm - Update Kconfig and Makefile")
Signed-off-by: Michael Ellerman 
---
  arch/powerpc/crypto/Makefile   | 10 +-
  arch/powerpc/crypto/aes-gcm-p10-glue.c | 18 +-
  .../crypto/{aesp8-ppc.pl => aesp10-ppc.pl} |  2 +-
  .../crypto/{ghashp8-ppc.pl => ghashp10-ppc.pl} | 12 ++--
  4 files changed, 21 insertions(+), 21 deletions(-)
  rename arch/powerpc/crypto/{aesp8-ppc.pl => aesp10-ppc.pl} (99%)
  rename arch/powerpc/crypto/{ghashp8-ppc.pl => ghashp10-ppc.pl} (97%)

diff --git a/arch/powerpc/crypto/Makefile b/arch/powerpc/crypto/Makefile
index 05c7486f42c5..7b4f516abec1 100644
--- a/arch/powerpc/crypto/Makefile
+++ b/arch/powerpc/crypto/Makefile
@@ -22,15 +22,15 @@ sha1-ppc-spe-y := sha1-spe-asm.o sha1-spe-glue.o
  sha256-ppc-spe-y := sha256-spe-asm.o sha256-spe-glue.o
  crc32c-vpmsum-y := crc32c-vpmsum_asm.o crc32c-vpmsum_glue.o
  crct10dif-vpmsum-y := crct10dif-vpmsum_asm.o crct10dif-vpmsum_glue.o
-aes-gcm-p10-crypto-y := aes-gcm-p10-glue.o aes-gcm-p10.o ghashp8-ppc.o 
aesp8-ppc.o
+aes-gcm-p10-crypto-y := aes-gcm-p10-glue.o aes-gcm-p10.o ghashp10-ppc.o 
aesp10-ppc.o
  
  quiet_cmd_perl = PERL$@

cmd_perl = $(PERL) $< $(if $(CONFIG_CPU_LITTLE_ENDIAN), linux-ppc64le, 
linux-ppc64) > $@
  
-targets += aesp8-ppc.S ghashp8-ppc.S

+targets += aesp10-ppc.S ghashp10-ppc.S
  
-$(obj)/aesp8-ppc.S $(obj)/ghashp8-ppc.S: $(obj)/%.S: $(src)/%.pl FORCE

+$(obj)/aesp10-ppc.S $(obj)/ghashp10-ppc.S: $(obj)/%.S: $(src)/%.pl FORCE
$(call if_changed,perl)
  
-OBJECT_FILES_NON_STANDARD_aesp8-ppc.o := y

-OBJECT_FILES_NON_STANDARD_ghashp8-ppc.o := y
+OBJECT_FILES_NON_STANDARD_aesp10-ppc.o := y
+OBJECT_FILES_NON_STANDARD_ghashp10-ppc.o := y
diff --git a/arch/powerpc/crypto/aes-gcm-p10-glue.c 
b/arch/powerpc/crypto/aes-gcm-p10-glue.c
index bd3475f5348d..4b6e899895e7 100644
--- a/arch/powerpc/crypto/aes-gcm-p10-glue.c
+++ b/arch/powerpc/crypto/aes-gcm-p10-glue.c
@@ -30,15 +30,15 @@ MODULE_AUTHOR("Danny Tsen   
-asmlinkage int aes_p8_set_encrypt_key(const u8 *userKey, const int bits,

+asmlinkage int aes_p10_set_encrypt_key(const u8 *userKey, const int bits,
  void *key);
-asmlinkage void aes_p8_encrypt(const u8 *in, u8 *out, const void *key);
+asmlinkage void aes_p10_encrypt(const u8 *in, u8 *out, const void *key);
  asmlinkage void aes_p10_gcm_encrypt(u8 *in, u8 *out, size_t len,
void *rkey, u8 *iv, void *Xi);
  asmlinkage void aes_p10_gcm_decrypt(u8 *in, u8 *out, size_t len,
void *rkey, u8 *iv, void *Xi);
  asmlinkage void gcm_init_htable(unsigned char htable[256], unsigned char 
Xi[16]);
-asmlinkage void gcm_ghash_p8(unsigned char *Xi, unsigned char *Htable,
+asmlinkage void gcm_ghash_p10(unsigned char *Xi, unsigned char *Htable,
unsigned char *aad, unsigned int alen);
  
  struct aes_key {

@@ -93,7 +93,7 @@ static void set_aad(struct gcm_ctx *gctx, struct Hash_ctx 
*hash,
gctx->aadLen = alen;
i = alen & ~0xf;
if (i) {
-   gcm_ghash_p8(nXi, hash->Htable+32, aad, i);
+   gcm_ghash_p10(nXi, hash->Htable+32, aad, i);
aad += i;
alen -= i;
}
@@ -102,7 +102,7 @@ static void set_aad(struct gcm_ctx *gctx, struct Hash_ctx 
*hash,
nXi[i] ^= aad[i];
  
  		memset(gctx->aad_hash, 0, 16);

-   gcm_ghash_p8(gctx->aad_hash, 

WARNING at fs/btrfs/disk-io.c:275 during xfstests run(4k block)

2023-05-29 Thread Sachin Sant
While running xfstests against btrfs (with 4k block size) on IBM Power server
booted with 6.4.0-rc3-next-20230525, following warning is seen. This warning
Is seen several times during the test run.

This problem was first seen with 6.4.0-rc2-next-20230516

[ 365.885258] run fstests btrfs/001 at 2023-05-29 14:40:10
[ 366.070754] BTRFS: device fsid 0732-a1c0-452c-8409-4fa8f9b69e51 devid 1 
transid 6 /dev/sdb2 scanned by mkfs.btrfs (27139)
[ 366.076267] BTRFS info (device sdb2): using crc32c (crc32c-generic) checksum 
algorithm
[ 366.076286] BTRFS info (device sdb2): using free space tree
[ 366.076292] BTRFS warning (device sdb2): read-write for sector size 4096 with 
page size 65536 is experimental
[ 366.078153] BTRFS info (device sdb2): enabling ssd optimizations
[ 366.078319] BTRFS info (device sdb2): checking UUID tree
[ 366.084911] [ cut here ]
[ 366.084923] WARNING: CPU: 16 PID: 27174 at fs/btrfs/disk-io.c:275 
btree_csum_one_bio+0xdc/0x310 [btrfs]
[ 366.084962] Modules linked in: dm_mod nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 
nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct 
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bonding rfkill 
tls ip_set nf_tables nfnetlink sunrpc btrfs blake2b_generic xor raid6_pq 
zstd_compress nd_pmem libcrc32c nd_btt dax_pmem pseries_rng papr_scm libnvdimm 
vmx_crypto aes_gcm_p10_crypto ext4 mbcache jbd2 sd_mod t10_pi crc64_rocksoft 
crc64 sg ibmvscsi ibmveth scsi_transport_srp fuse
[ 366.085003] CPU: 16 PID: 27174 Comm: btrfs Tainted: G W 
6.4.0-rc3-next-20230525 #1
[ 366.085009] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf06 
of:IBM,FW1030.00 (NH1030_026) hv:phyp pSeries
[ 366.085014] NIP: c00801a2eed4 LR: c00801b1b5a4 CTR: 
[ 366.085018] REGS: c0001a3bb030 TRAP: 0700 Tainted: G W 
(6.4.0-rc3-next-20230525)
[ 366.085023] MSR: 8282b033  CR: 
24002882 XER: 2004
[ 366.085033] CFAR: c00801a2ee78 IRQMASK: 0 
[ 366.085033] GPR00: c00801b1b5a4 c0001a3bb2d0 c0080165a700 
c000d18cca80 
[ 366.085033] GPR04: c001029da290  4000 
01d44000 
[ 366.085033] GPR08: c00c0021fd40   
ca19d890 
[ 366.085033] GPR12: 84002482 c00aa7cf6f00  
0001 
[ 366.085033] GPR16: c0001a3bb790  c0001a3bb770 
 
[ 366.085033] GPR20: c0001a3bb558 01d4  
c0001a3bb558 
[ 366.085033] GPR24: c000879ed000 c000d18cca80 c000fd947a80 
c000879ed000 
[ 366.085033] GPR28: c000d18ccb30 4000 c2c67490 
c000d1a0c558 
[ 366.085078] NIP [c00801a2eed4] btree_csum_one_bio+0xdc/0x310 [btrfs]
[ 366.085110] LR [c00801b1b5a4] btrfs_bio_csum+0x4c/0x70 [btrfs]
[ 366.085139] Call Trace:
[ 366.085141] [c0001a3bb2d0] [0006] 0x6 (unreliable)
[ 366.085146] [c0001a3bb380] [01d44000] 0x1d44000
[ 366.085150] [c0001a3bb3a0] [c00801b1c554] 
btrfs_submit_chunk+0x51c/0x560 [btrfs]
[ 366.085179] [c0001a3bb440] [c00801b1c67c] btrfs_submit_bio+0x44/0x70 
[btrfs]
[ 366.085207] [c0001a3bb470] [c00801a74ec0] write_one_eb+0x2f8/0x3c0 
[btrfs]
[ 366.085242] [c0001a3bb4d0] [c00801a75460] 
btree_write_cache_pages+0x4d8/0x7b0 [btrfs]
[ 366.085276] [c0001a3bb660] [c00801a2d04c] btree_writepages+0x94/0xe0 
[btrfs]
[ 366.085308] [c0001a3bb690] [c044df24] do_writepages+0x114/0x290
[ 366.085316] [c0001a3bb710] [c0436f58] 
filemap_fdatawrite_wbc+0xb8/0x140
[ 366.085322] [c0001a3bb750] [c043df60] 
__filemap_fdatawrite_range+0x80/0xc0
[ 366.085327] [c0001a3bb800] [c00801a37dc0] 
btrfs_write_marked_extents+0x78/0x1b0 [btrfs]
[ 366.085360] [c0001a3bb870] [c00801a37f60] 
btrfs_write_and_wait_transaction.isra.22+0x68/0x140 [btrfs]
[ 366.085393] [c0001a3bb8e0] [c00801a3a2d8] 
btrfs_commit_transaction+0x610/0x11f0 [btrfs]
[ 366.085425] [c0001a3bb9d0] [c00801a8ccf0] 
btrfs_mksubvol.isra.36+0x4e8/0x5b0 [btrfs]
[ 366.085460] [c0001a3bbab0] [c00801a8ce8c] btrfs_mksnapshot+0xd4/0x140 
[btrfs]
[ 366.085493] [c0001a3bbb20] [c00801a8d110] 
__btrfs_ioctl_snap_create+0x218/0x260 [btrfs]
[ 366.085526] [c0001a3bbbd0] [c00801a8d3fc] 
btrfs_ioctl_snap_create_v2+0x1a4/0x1f0 [btrfs]
[ 366.085559] [c0001a3bbc30] [c00801a910a4] btrfs_ioctl+0x1dac/0x3180 
[btrfs]
[ 366.085592] [c0001a3bbdc0] [c05a7098] sys_ioctl+0xf8/0x190
[ 366.085598] [c0001a3bbe10] [c00374b0] 
system_call_exception+0x140/0x350
[ 366.085604] [c0001a3bbe50] [c000d6a0] 
system_call_common+0x160/0x2e4
[ 366.085610] --- interrupt: c00 at 0x7fffa5f29b70
[ 366.085614] NIP: 7fffa5f29b70 LR: 1007a968 CTR: 
[ 366.085618] REGS: c0001a3bbe80 TRAP: 0c00 Tainted: G W 
(6.4.0-rc3-next-20230525)
[ 366.085625] MSR: 

Re: [PATCH] powerpc: Fail build if using recordmcount with binutils v2.37

2023-05-29 Thread Michael Ellerman
Naveen N Rao  writes:
> binutils v2.37 drops unused section symbols, which prevents recordmcount
> from capturing mcount locations in sections that have no non-weak
> symbols. This results in a build failure with a message such as:
>   Cannot find symbol for section 12: .text.perf_callchain_kernel.
>   kernel/events/callchain.o: failed
>
> The change to binutils was reverted for v2.38, so this behavior is
> specific to binutils v2.37:
> https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=c09c8b42021180eee9495bd50d8b35e683d3901b
>
> Objtool is able to cope with such sections, so this issue is specific to
> recordmcount.
>
> Fail the build and print a warning if binutils v2.37 is detected and if
> we are using recordmcount.
>
> Cc: sta...@vger.kernel.org
> Suggested-by: Joel Stanley 
> Signed-off-by: Naveen N Rao 
> ---
>  arch/powerpc/Makefile | 8 
>  1 file changed, 8 insertions(+)
>
> diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
> index dca73f673d7046..f0540c1f1377c8 100644
> --- a/arch/powerpc/Makefile
> +++ b/arch/powerpc/Makefile
> @@ -409,3 +409,11 @@ checkbin:
>   echo -n '*** Please use a different binutils version.' ; \
>   false ; \
>   fi
> + @if test "x${CONFIG_FTRACE_MCOUNT_USE_RECORDMCOUNT}" = "xy" -a \
> + "x${CONFIG_LD_IS_BFD}" = "xy" -a \
> + "x$(call ld-ifversion, -eq, 23700, y)" = "xy" ; then \
> + echo -n '*** binutils 2.37 drops unused section symbols, which 
> recordmcount ' ; \
> + echo 'is unable to handle.' ; \
> + echo '*** Please use a different binutils version.' ; \
> + false ; \
> + fi

Thanks for doing this.

Masahiro wanted to remove ld-ifversion, he suggested to just check
CONFIG_LD_VERSION directly instead. Mind doing a v2 with that change?

https://lore.kernel.org/all/CAK7LNAQWtDHOs=k+qznt5u1widv86tchkj4zoer4wtvrb97...@mail.gmail.com/

cheers


Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock

2023-05-29 Thread Peter Xu
On Sun, May 28, 2023 at 11:25:15PM -0700, Hugh Dickins wrote:
> Simplify shmem and file THP collapse's retract_page_tables(), and relax
> its locking: to improve its success rate and to lessen impact on others.
> 
> Instead of its MADV_COLLAPSE case doing set_huge_pmd() at target_addr of
> target_mm, leave that part of the work to madvise_collapse() calling
> collapse_pte_mapped_thp() afterwards: just adjust collapse_file()'s
> result code to arrange for that.  That spares retract_page_tables() four
> arguments; and since it will be successful in retracting all of the page
> tables expected of it, no need to track and return a result code itself.
> 
> It needs i_mmap_lock_read(mapping) for traversing the vma interval tree,
> but it does not need i_mmap_lock_write() for that: page_vma_mapped_walk()
> allows for pte_offset_map_lock() etc to fail, and uses pmd_lock() for
> THPs.  retract_page_tables() just needs to use those same spinlocks to
> exclude it briefly, while transitioning pmd from page table to none: so
> restore its use of pmd_lock() inside of which pte lock is nested.
> 
> Users of pte_offset_map_lock() etc all now allow for them to fail:
> so retract_page_tables() now has no use for mmap_write_trylock() or
> vma_try_start_write().  In common with rmap and page_vma_mapped_walk(),
> it does not even need the mmap_read_lock().
> 
> But those users do expect the page table to remain a good page table,
> until they unlock and rcu_read_unlock(): so the page table cannot be
> freed immediately, but rather by the recently added pte_free_defer().
> 
> retract_page_tables() can be enhanced to replace_page_tables(), which
> inserts the final huge pmd without mmap lock: going through an invalid
> state instead of pmd_none() followed by fault.  But that does raise some
> questions, and requires a more complicated pte_free_defer() for powerpc
> (when its arch_needs_pgtable_deposit() for shmem and file THPs).  Leave
> that enhancement to a later release.
> 
> Signed-off-by: Hugh Dickins 
> ---
>  mm/khugepaged.c | 169 +---
>  1 file changed, 60 insertions(+), 109 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 1083f0e38a07..4fd408154692 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1617,9 +1617,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, 
> unsigned long addr,
>   break;
>   case SCAN_PMD_NONE:
>   /*
> -  * In MADV_COLLAPSE path, possible race with khugepaged where
> -  * all pte entries have been removed and pmd cleared.  If so,
> -  * skip all the pte checks and just update the pmd mapping.
> +  * All pte entries have been removed and pmd cleared.
> +  * Skip all the pte checks and just update the pmd mapping.
>*/
>   goto maybe_install_pmd;
>   default:
> @@ -1748,123 +1747,73 @@ static void 
> khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_sl
>   mmap_write_unlock(mm);
>  }
>  
> -static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> -struct mm_struct *target_mm,
> -unsigned long target_addr, struct page *hpage,
> -struct collapse_control *cc)
> +static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>  {
>   struct vm_area_struct *vma;
> - int target_result = SCAN_FAIL;
>  
> - i_mmap_lock_write(mapping);
> + i_mmap_lock_read(mapping);
>   vma_interval_tree_foreach(vma, >i_mmap, pgoff, pgoff) {
> - int result = SCAN_FAIL;
> - struct mm_struct *mm = NULL;
> - unsigned long addr = 0;
> - pmd_t *pmd;
> - bool is_target = false;
> + struct mm_struct *mm;
> + unsigned long addr;
> + pmd_t *pmd, pgt_pmd;
> + spinlock_t *pml;
> + spinlock_t *ptl;
>  
>   /*
>* Check vma->anon_vma to exclude MAP_PRIVATE mappings that
> -  * got written to. These VMAs are likely not worth investing
> -  * mmap_write_lock(mm) as PMD-mapping is likely to be split
> -  * later.
> +  * got written to. These VMAs are likely not worth removing
> +  * page tables from, as PMD-mapping is likely to be split later.
>*
> -  * Note that vma->anon_vma check is racy: it can be set up after
> -  * the check but before we took mmap_lock by the fault path.
> -  * But page lock would prevent establishing any new ptes of the
> -  * page, so we are safe.
> -  *
> -  * An alternative would be drop the check, but check that page
> -  * table is clear before calling pmdp_collapse_flush() under
> -  * ptl. It has higher chance to recover THP for the VMA, 

[PATCH] powerpc/kcsan: Properly instrument arch_spin_unlock()

2023-05-29 Thread Christophe Leroy
The following boottime error is encountered with SMP kernel:

  kcsan: improperly instrumented type=(0): arch_spin_unlock(_spinlock)
  kcsan: improperly instrumented type=(0): spin_unlock(_spinlock)
  kcsan: improperly instrumented type=(KCSAN_ACCESS_WRITE): 
arch_spin_unlock(_spinlock)
  kcsan: improperly instrumented type=(KCSAN_ACCESS_WRITE): 
spin_unlock(_spinlock)
  kcsan: improperly instrumented type=(KCSAN_ACCESS_WRITE | 
KCSAN_ACCESS_COMPOUND): arch_spin_unlock(_spinlock)
  kcsan: improperly instrumented type=(KCSAN_ACCESS_WRITE | 
KCSAN_ACCESS_COMPOUND): spin_unlock(_spinlock)
  kcsan: selftest: test_barrier failed
  kcsan: selftest: 2/3 tests passed
  Kernel panic - not syncing: selftests failed

Properly instrument arch_spin_unlock() with kcsan_mb().

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/simple_spinlock.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/simple_spinlock.h 
b/arch/powerpc/include/asm/simple_spinlock.h
index 9dcc7e9993b9..4dd12dcb9ef8 100644
--- a/arch/powerpc/include/asm/simple_spinlock.h
+++ b/arch/powerpc/include/asm/simple_spinlock.h
@@ -15,6 +15,7 @@
  * (the type definitions are in asm/simple_spinlock_types.h)
  */
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -126,6 +127,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
 
 static inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
+   kcsan_mb();
__asm__ __volatile__("# arch_spin_unlock\n\t"
PPC_RELEASE_BARRIER: : :"memory");
lock->slock = 0;
-- 
2.40.1



Re: [PATCH 1/1] arch:hexagon/powerpc: use KSYM_NAME_LEN in array size

2023-05-29 Thread Miguel Ojeda
On Mon, May 29, 2023 at 1:08 PM Maninder Singh  wrote:
>
> I Will add co-developed-by` tag.
> because this change was identified while we were working on kallsyms some 
> time back.
> https://lore.kernel.org/lkml/yontol4zc4cyt...@infradead.org/t/
>
> this patch set is pending and we will start working on that again, so i 
> thought better
> to send bugfix first.

Sounds good to me!

(Fixed Wedson's email address)

> Yes, I think second buffer was not related to kallsyms, so I have not touched 
> that.

Kees: what is the current stance on `[static N]` parameters? Something like:

const char *kallsyms_lookup(unsigned long addr,
unsigned long *symbolsize,
unsigned long *offset,
-   char **modname, char *namebuf);
+   char **modname, char namebuf[static
KSYM_NAME_LEN]);

makes the compiler complain about cases like these (even if trivial):

arch/powerpc/xmon/xmon.c:1711:10: error: array argument is too small;
contains 128 elements, callee requires at least 512
[-Werror,-Warray-bounds]
name = kallsyms_lookup(pc, , , NULL, tmpstr);
 ^   ~~
./include/linux/kallsyms.h:86:29: note: callee declares array
parameter as static here
char **modname, char namebuf[static KSYM_NAME_LEN]);
 ^  ~~

But I only see 2 files in the kernel using `[static N]` (from 2020 and
2021). Should something else be used instead (e.g. `__counted_by`),
even if constexpr-sized?.

Also, I went through the other callers to `kallsyms_lookup` to see
other issues -- one I am not sure about is `fetch_store_symstring` in
`kernel/trace/trace_probe_tmpl.h`. Steven/Masami: is that "with max
length" in the function docs enough? Is it 0x?

Thanks!

Cheers,
Miguel


Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page

2023-05-29 Thread Hugh Dickins
On Mon, 29 May 2023, Matthew Wilcox wrote:
> On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote:
> > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > +{
> > +   struct page *page;
> > +
> > +   page = virt_to_page(pgtable);
> > +   call_rcu(>rcu_head, pte_free_now);
> > +}
> 
> This can't be safe (on ppc).  IIRC you might have up to 16x4k page
> tables sharing one 64kB page.  So if you have two page tables from the
> same page being defer-freed simultaneously, you'll reuse the rcu_head
> and I cannot imagine things go well from that point.

Oh yes, of course, thanks for catching that so quickly.
So my s390 and sparc implementations will be equally broken.

> 
> I have no idea how to solve this problem.

I do: I'll have to go back to the more complicated implementation we
actually ran with on powerpc - I was thinking those complications just
related to deposit/withdraw matters, forgetting the one-rcu_head issue.

It uses large (0x1) increments of the page refcount, avoiding
call_rcu() when already active.

It's not a complication I had wanted to explain or test for now,
but we shall have to.  Should apply equally well to sparc, but s390
more of a problem, since s390 already has its own refcount cleverness.

Thanks, I must dash, out much of the day.

Hugh


Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page

2023-05-29 Thread Matthew Wilcox
On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote:
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> +{
> + struct page *page;
> +
> + page = virt_to_page(pgtable);
> + call_rcu(>rcu_head, pte_free_now);
> +}

This can't be safe (on ppc).  IIRC you might have up to 16x4k page
tables sharing one 64kB page.  So if you have two page tables from the
same page being defer-freed simultaneously, you'll reuse the rcu_head
and I cannot imagine things go well from that point.

I have no idea how to solve this problem.


Re: [PATCH 02/12] mm/pgtable: add PAE safety to __pte_offset_map()

2023-05-29 Thread Matthew Wilcox
On Sun, May 28, 2023 at 11:16:16PM -0700, Hugh Dickins wrote:
> +#if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \
> + (defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RCU))
> +/*
> + * See the comment above ptep_get_lockless() in include/linux/pgtable.h:
> + * the barriers in pmdp_get_lockless() cannot guarantee that the value in
> + * pmd_high actually belongs with the value in pmd_low; but holding 
> interrupts
> + * off blocks the TLB flush between present updates, which guarantees that a
> + * successful __pte_offset_map() points to a page from matched halves.
> + */
> +#define config_might_irq_save(flags) local_irq_save(flags)
> +#define config_might_irq_restore(flags)  local_irq_restore(flags)
> +#else
> +#define config_might_irq_save(flags)
> +#define config_might_irq_restore(flags)

I don't like the name.  It should indicate that it's PMD-related, so
pmd_read_start(flags) / pmd_read_end(flags)?



[PATCH] powerpc: Fail build if using recordmcount with binutils v2.37

2023-05-29 Thread Naveen N Rao
binutils v2.37 drops unused section symbols, which prevents recordmcount
from capturing mcount locations in sections that have no non-weak
symbols. This results in a build failure with a message such as:
Cannot find symbol for section 12: .text.perf_callchain_kernel.
kernel/events/callchain.o: failed

The change to binutils was reverted for v2.38, so this behavior is
specific to binutils v2.37:
https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=c09c8b42021180eee9495bd50d8b35e683d3901b

Objtool is able to cope with such sections, so this issue is specific to
recordmcount.

Fail the build and print a warning if binutils v2.37 is detected and if
we are using recordmcount.

Cc: sta...@vger.kernel.org
Suggested-by: Joel Stanley 
Signed-off-by: Naveen N Rao 
---
 arch/powerpc/Makefile | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index dca73f673d7046..f0540c1f1377c8 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -409,3 +409,11 @@ checkbin:
echo -n '*** Please use a different binutils version.' ; \
false ; \
fi
+   @if test "x${CONFIG_FTRACE_MCOUNT_USE_RECORDMCOUNT}" = "xy" -a \
+   "x${CONFIG_LD_IS_BFD}" = "xy" -a \
+   "x$(call ld-ifversion, -eq, 23700, y)" = "xy" ; then \
+   echo -n '*** binutils 2.37 drops unused section symbols, which 
recordmcount ' ; \
+   echo 'is unable to handle.' ; \
+   echo '*** Please use a different binutils version.' ; \
+   false ; \
+   fi

base-commit: 7b2f56d76feff3b494d6e77950ab97987323d3c5
-- 
2.40.1



[PATCH 2/2] powerpc/xmon: use KSYM_NAME_LEN in array size

2023-05-29 Thread Maninder Singh
kallsyms_lookup which in turn calls for kallsyms_lookup_buildid()
writes on index "KSYM_NAME_LEN - 1".

Thus array size should be KSYM_NAME_LEN.

for powerpc it was defined as "128" directly.
and commit '61968dbc2d5d' changed define value to 512,
So both were missed to update with new size.

Fixes: 61968dbc2d5d ("kallsyms: increase maximum kernel symbol length to 512")

Co-developed-by: Onkarnath 
Signed-off-by: Onkarnath 
Signed-off-by: Maninder Singh 
---
 arch/powerpc/xmon/xmon.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 728d3c257e4a..70c4c59a1a8f 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -88,7 +88,7 @@ static unsigned long ndump = 64;
 static unsigned long nidump = 16;
 static unsigned long ncsum = 4096;
 static int termch;
-static char tmpstr[128];
+static char tmpstr[KSYM_NAME_LEN];
 static int tracing_enabled;
 
 static long bus_error_jmp[JMP_BUF_LEN];
-- 
2.17.1



[PATCH 1/2] hexagon/traps.c: use KSYM_NAME_LEN in array size

2023-05-29 Thread Maninder Singh
kallsyms_lookup which in turn calls for kallsyms_lookup_buildid()
writes on index "KSYM_NAME_LEN - 1".

Thus array size should be KSYM_NAME_LEN.

for hexagon it was defined as "128" directly.
and commit '61968dbc2d5d' changed define value to 512,
So both were missed to update with new size.

Fixes: 61968dbc2d5d ("kallsyms: increase maximum kernel symbol length to 512")

Co-developed-by: Onkarnath 
Signed-off-by: Onkarnath 
Signed-off-by: Maninder Singh 
---
 arch/hexagon/kernel/traps.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/hexagon/kernel/traps.c b/arch/hexagon/kernel/traps.c
index 6447763ce5a9..65b30b6ea226 100644
--- a/arch/hexagon/kernel/traps.c
+++ b/arch/hexagon/kernel/traps.c
@@ -82,7 +82,7 @@ static void do_show_stack(struct task_struct *task, unsigned 
long *fp,
const char *name = NULL;
unsigned long *newfp;
unsigned long low, high;
-   char tmpstr[128];
+   char tmpstr[KSYM_NAME_LEN];
char *modname;
int i;
 
-- 
2.17.1



RE: [PATCH 1/1] arch:hexagon/powerpc: use KSYM_NAME_LEN in array size

2023-05-29 Thread Maninder Singh
Hi,

>>
>> kallsyms_lookup which in turn calls for kallsyms_lookup_buildid()
>> writes on index "KSYM_NAME_LEN - 1".
>>
>> Thus array size should be KSYM_NAME_LEN.
>>
>> for powerpc and hexagon it was defined as "128" directly.
>> and commit '61968dbc2d5d' changed define value to 512,
>> So both were missed to update with new size.
>>
>> Fixes: 61968dbc2d5d ("kallsyms: increase maximum kernel symbol length to 
>> 512")
>> Signed-off-by: Onkarnath 
>> Signed-off-by: Maninder Singh 

> Thanks for this!
> 
> There is no `From:` at the top. Since I cannot locate the patch in
> Lore, did you mean to put both of you as authors perhaps? In that
> case, please use a `Co-developed-by` as needed.
> 

I Will add co-developed-by` tag.
because this change was identified while we were working on kallsyms some time 
back.
https://lore.kernel.org/lkml/yontol4zc4cyt...@infradead.org/t/

this patch set is pending and we will start working on that again, so i thought 
better
to send bugfix first.

> Perhaps it is a good idea to submit each arch independently, too.
> 

ok, I will share 2 separate patches.

> The changes themselves look fine on a quick inspection, though the
> `xmon.c` one is a global buffer (and there is another equally-sized
> buffer in `xmon.c` with a hard-coded `128` constant that would be nice
> to clarify).

Yes, I think second buffer was not related to kallsyms, so I have not touched 
that.

Thanks,
Maninder Singh


Re: [PATCH 1/1] arch:hexagon/powerpc: use KSYM_NAME_LEN in array size

2023-05-29 Thread Miguel Ojeda
On Mon, May 29, 2023 at 7:44 AM Maninder Singh  wrote:
>
> kallsyms_lookup which in turn calls for kallsyms_lookup_buildid()
> writes on index "KSYM_NAME_LEN - 1".
>
> Thus array size should be KSYM_NAME_LEN.
>
> for powerpc and hexagon it was defined as "128" directly.
> and commit '61968dbc2d5d' changed define value to 512,
> So both were missed to update with new size.
>
> Fixes: 61968dbc2d5d ("kallsyms: increase maximum kernel symbol length to 512")
> Signed-off-by: Onkarnath 
> Signed-off-by: Maninder Singh 

Thanks for this!

There is no `From:` at the top. Since I cannot locate the patch in
Lore, did you mean to put both of you as authors perhaps? In that
case, please use a `Co-developed-by` as needed.

Perhaps it is a good idea to submit each arch independently, too.

The changes themselves look fine on a quick inspection, though the
`xmon.c` one is a global buffer (and there is another equally-sized
buffer in `xmon.c` with a hard-coded `128` constant that would be nice
to clarify).

Cheers,
Miguel


Re: Fwd: ./include/linux/mmzone.h:1735:2: error: #error Allocator MAX_ORDER exceeds SECTION_SIZE (v6.4-rc3 build regression)

2023-05-29 Thread Bagas Sanjaya
On Wed, May 24, 2023 at 06:13:52PM +0300, Doru Iorgulescu wrote:
> Awesome, thanks!

tl;dr: see [1].

(Hey, looks like complimenting noise here.)

[1]: 
https://lore.kernel.org/regressions/5df92692-296e-3956-24fa-2bd439337...@gmail.com/

-- 
An old man doll... just what I always wanted! - Clara


signature.asc
Description: PGP signature


[PATCH 12/12] mm: delete mmap_write_trylock() and vma_try_start_write()

2023-05-29 Thread Hugh Dickins
mmap_write_trylock() and vma_try_start_write() were added just for
khugepaged, but now it has no use for them: delete.

Signed-off-by: Hugh Dickins 
---
 include/linux/mm.h| 17 -
 include/linux/mmap_lock.h | 10 --
 2 files changed, 27 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3c2e56980853..9b24f8fbf899 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -690,21 +690,6 @@ static inline void vma_start_write(struct vm_area_struct 
*vma)
up_write(>vm_lock->lock);
 }
 
-static inline bool vma_try_start_write(struct vm_area_struct *vma)
-{
-   int mm_lock_seq;
-
-   if (__is_vma_write_locked(vma, _lock_seq))
-   return true;
-
-   if (!down_write_trylock(>vm_lock->lock))
-   return false;
-
-   vma->vm_lock_seq = mm_lock_seq;
-   up_write(>vm_lock->lock);
-   return true;
-}
-
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 {
int mm_lock_seq;
@@ -730,8 +715,6 @@ static inline bool vma_start_read(struct vm_area_struct 
*vma)
{ return false; }
 static inline void vma_end_read(struct vm_area_struct *vma) {}
 static inline void vma_start_write(struct vm_area_struct *vma) {}
-static inline bool vma_try_start_write(struct vm_area_struct *vma)
-   { return true; }
 static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
 static inline void vma_mark_detached(struct vm_area_struct *vma,
 bool detached) {}
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index aab8f1b28d26..d1191f02c7fa 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -112,16 +112,6 @@ static inline int mmap_write_lock_killable(struct 
mm_struct *mm)
return ret;
 }
 
-static inline bool mmap_write_trylock(struct mm_struct *mm)
-{
-   bool ret;
-
-   __mmap_lock_trace_start_locking(mm, true);
-   ret = down_write_trylock(>mmap_lock) != 0;
-   __mmap_lock_trace_acquire_returned(mm, true, ret);
-   return ret;
-}
-
 static inline void mmap_write_unlock(struct mm_struct *mm)
 {
__mmap_lock_trace_released(mm, true);
-- 
2.35.3



[PATCH 11/12] mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps()

2023-05-29 Thread Hugh Dickins
Now that retract_page_tables() can retract page tables reliably, without
depending on trylocks, delete all the apparatus for khugepaged to try
again later: khugepaged_collapse_pte_mapped_thps() etc; and free up the
per-mm memory which was set aside for that in the khugepaged_mm_slot.

But one part of that is worth keeping: when hpage_collapse_scan_file()
found SCAN_PTE_MAPPED_HUGEPAGE, that address was noted in the mm_slot
to be tried for retraction later - catching, for example, page tables
where a reversible mprotect() of a portion had required splitting the
pmd, but now it can be recollapsed.  Call collapse_pte_mapped_thp()
directly in this case (why was it deferred before?  I assume an issue
with needing mmap_lock for write, but now it's only needed for read).

Signed-off-by: Hugh Dickins 
---
 mm/khugepaged.c | 125 +++-
 1 file changed, 16 insertions(+), 109 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 2999500abdd5..301c0e54a2ef 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -92,8 +92,6 @@ static __read_mostly DEFINE_HASHTABLE(mm_slots_hash, 
MM_SLOTS_HASH_BITS);
 
 static struct kmem_cache *mm_slot_cache __read_mostly;
 
-#define MAX_PTE_MAPPED_THP 8
-
 struct collapse_control {
bool is_khugepaged;
 
@@ -107,15 +105,9 @@ struct collapse_control {
 /**
  * struct khugepaged_mm_slot - khugepaged information per mm that is being 
scanned
  * @slot: hash lookup from mm to mm_slot
- * @nr_pte_mapped_thp: number of pte mapped THP
- * @pte_mapped_thp: address array corresponding pte mapped THP
  */
 struct khugepaged_mm_slot {
struct mm_slot slot;
-
-   /* pte-mapped THP in this mm */
-   int nr_pte_mapped_thp;
-   unsigned long pte_mapped_thp[MAX_PTE_MAPPED_THP];
 };
 
 /**
@@ -1441,50 +1433,6 @@ static void collect_mm_slot(struct khugepaged_mm_slot 
*mm_slot)
 }
 
 #ifdef CONFIG_SHMEM
-/*
- * Notify khugepaged that given addr of the mm is pte-mapped THP. Then
- * khugepaged should try to collapse the page table.
- *
- * Note that following race exists:
- * (1) khugepaged calls khugepaged_collapse_pte_mapped_thps() for mm_struct A,
- * emptying the A's ->pte_mapped_thp[] array.
- * (2) MADV_COLLAPSE collapses some file extent with target mm_struct B, and
- * retract_page_tables() finds a VMA in mm_struct A mapping the same extent
- * (at virtual address X) and adds an entry (for X) into mm_struct A's
- * ->pte-mapped_thp[] array.
- * (3) khugepaged calls khugepaged_collapse_scan_file() for mm_struct A at X,
- * sees a pte-mapped THP (SCAN_PTE_MAPPED_HUGEPAGE) and adds an entry
- * (for X) into mm_struct A's ->pte-mapped_thp[] array.
- * Thus, it's possible the same address is added multiple times for the same
- * mm_struct.  Should this happen, we'll simply attempt
- * collapse_pte_mapped_thp() multiple times for the same address, under the 
same
- * exclusive mmap_lock, and assuming the first call is successful, subsequent
- * attempts will return quickly (without grabbing any additional locks) when
- * a huge pmd is found in find_pmd_or_thp_or_none().  Since this is a cheap
- * check, and since this is a rare occurrence, the cost of preventing this
- * "multiple-add" is thought to be more expensive than just handling it, should
- * it occur.
- */
-static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
- unsigned long addr)
-{
-   struct khugepaged_mm_slot *mm_slot;
-   struct mm_slot *slot;
-   bool ret = false;
-
-   VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
-
-   spin_lock(_mm_lock);
-   slot = mm_slot_lookup(mm_slots_hash, mm);
-   mm_slot = mm_slot_entry(slot, struct khugepaged_mm_slot, slot);
-   if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) 
{
-   mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr;
-   ret = true;
-   }
-   spin_unlock(_mm_lock);
-   return ret;
-}
-
 /* hpage must be locked, and mmap_lock must be held */
 static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmdp, struct page *hpage)
@@ -1675,29 +1623,6 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, 
unsigned long addr,
goto drop_hpage;
 }
 
-static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot 
*mm_slot)
-{
-   struct mm_slot *slot = _slot->slot;
-   struct mm_struct *mm = slot->mm;
-   int i;
-
-   if (likely(mm_slot->nr_pte_mapped_thp == 0))
-   return;
-
-   if (!mmap_write_trylock(mm))
-   return;
-
-   if (unlikely(hpage_collapse_test_exit(mm)))
-   goto out;
-
-   for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++)
-   collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i], false);
-
-out:
-   mm_slot->nr_pte_mapped_thp = 0;
-   mmap_write_unlock(mm);
-}
-
 static void retract_page_tables(struct 

[PATCH 10/12] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()

2023-05-29 Thread Hugh Dickins
Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
It does need mmap_read_lock(), but it does not need mmap_write_lock(),
nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.
Follow the pattern in retract_page_tables(); and using pte_free_defer()
removes the need for tlb_remove_table_sync_one() here.

Confirm the preliminary find_pmd_or_thp_or_none() once page lock has been
acquired and the page looks suitable: from then on its state is stable.

However, collapse_pte_mapped_thp() was doing something others don't:
freeing a page table still containing "valid" entries.  i_mmap lock did
stop a racing truncate from double-freeing those pages, but we prefer
collapse_pte_mapped_thp() to clear the entries as usual.  Their TLB
flush can wait until the pmdp_collapse_flush() which follows, but the
mmu_notifier_invalidate_range_start() has to be done earlier.

Some cleanup while rearranging: rename "count" to "nr_ptes";
and "step 2" does not need to duplicate the checks in "step 1".

Signed-off-by: Hugh Dickins 
---
 mm/khugepaged.c | 131 +++-
 1 file changed, 41 insertions(+), 90 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4fd408154692..2999500abdd5 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1485,7 +1485,7 @@ static bool khugepaged_add_pte_mapped_thp(struct 
mm_struct *mm,
return ret;
 }
 
-/* hpage must be locked, and mmap_lock must be held in write */
+/* hpage must be locked, and mmap_lock must be held */
 static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmdp, struct page *hpage)
 {
@@ -1497,7 +1497,7 @@ static int set_huge_pmd(struct vm_area_struct *vma, 
unsigned long addr,
};
 
VM_BUG_ON(!PageTransHuge(hpage));
-   mmap_assert_write_locked(vma->vm_mm);
+   mmap_assert_locked(vma->vm_mm);
 
if (do_set_pmd(, hpage))
return SCAN_FAIL;
@@ -1506,48 +1506,6 @@ static int set_huge_pmd(struct vm_area_struct *vma, 
unsigned long addr,
return SCAN_SUCCEED;
 }
 
-/*
- * A note about locking:
- * Trying to take the page table spinlocks would be useless here because those
- * are only used to synchronize:
- *
- *  - modifying terminal entries (ones that point to a data page, not to 
another
- *page table)
- *  - installing *new* non-terminal entries
- *
- * Instead, we need roughly the same kind of protection as free_pgtables() or
- * mm_take_all_locks() (but only for a single VMA):
- * The mmap lock together with this VMA's rmap locks covers all paths towards
- * the page table entries we're messing with here, except for hardware page
- * table walks and lockless_pages_from_mm().
- */
-static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct 
*vma,
- unsigned long addr, pmd_t *pmdp)
-{
-   pmd_t pmd;
-   struct mmu_notifier_range range;
-
-   mmap_assert_write_locked(mm);
-   if (vma->vm_file)
-   
lockdep_assert_held_write(>vm_file->f_mapping->i_mmap_rwsem);
-   /*
-* All anon_vmas attached to the VMA have the same root and are
-* therefore locked by the same lock.
-*/
-   if (vma->anon_vma)
-   lockdep_assert_held_write(>anon_vma->root->rwsem);
-
-   mmu_notifier_range_init(, MMU_NOTIFY_CLEAR, 0, mm, addr,
-   addr + HPAGE_PMD_SIZE);
-   mmu_notifier_invalidate_range_start();
-   pmd = pmdp_collapse_flush(vma, addr, pmdp);
-   tlb_remove_table_sync_one();
-   mmu_notifier_invalidate_range_end();
-   mm_dec_nr_ptes(mm);
-   page_table_check_pte_clear_range(mm, addr, pmd);
-   pte_free(mm, pmd_pgtable(pmd));
-}
-
 /**
  * collapse_pte_mapped_thp - Try to collapse a pte-mapped THP for mm at
  * address haddr.
@@ -1563,16 +1521,17 @@ static void collapse_and_free_pmd(struct mm_struct *mm, 
struct vm_area_struct *v
 int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
bool install_pmd)
 {
+   struct mmu_notifier_range range;
unsigned long haddr = addr & HPAGE_PMD_MASK;
struct vm_area_struct *vma = vma_lookup(mm, haddr);
struct page *hpage;
pte_t *start_pte, *pte;
-   pmd_t *pmd;
-   spinlock_t *ptl;
-   int count = 0, result = SCAN_FAIL;
+   pmd_t *pmd, pgt_pmd;
+   spinlock_t *pml, *ptl;
+   int nr_ptes = 0, result = SCAN_FAIL;
int i;
 
-   mmap_assert_write_locked(mm);
+   mmap_assert_locked(mm);
 
/* Fast check before locking page if already PMD-mapped */
result = find_pmd_or_thp_or_none(mm, haddr, );
@@ -1612,6 +1571,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, 
unsigned long addr,
goto drop_hpage;
}
 
+   result = find_pmd_or_thp_or_none(mm, haddr, );
switch 

[PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock

2023-05-29 Thread Hugh Dickins
Simplify shmem and file THP collapse's retract_page_tables(), and relax
its locking: to improve its success rate and to lessen impact on others.

Instead of its MADV_COLLAPSE case doing set_huge_pmd() at target_addr of
target_mm, leave that part of the work to madvise_collapse() calling
collapse_pte_mapped_thp() afterwards: just adjust collapse_file()'s
result code to arrange for that.  That spares retract_page_tables() four
arguments; and since it will be successful in retracting all of the page
tables expected of it, no need to track and return a result code itself.

It needs i_mmap_lock_read(mapping) for traversing the vma interval tree,
but it does not need i_mmap_lock_write() for that: page_vma_mapped_walk()
allows for pte_offset_map_lock() etc to fail, and uses pmd_lock() for
THPs.  retract_page_tables() just needs to use those same spinlocks to
exclude it briefly, while transitioning pmd from page table to none: so
restore its use of pmd_lock() inside of which pte lock is nested.

Users of pte_offset_map_lock() etc all now allow for them to fail:
so retract_page_tables() now has no use for mmap_write_trylock() or
vma_try_start_write().  In common with rmap and page_vma_mapped_walk(),
it does not even need the mmap_read_lock().

But those users do expect the page table to remain a good page table,
until they unlock and rcu_read_unlock(): so the page table cannot be
freed immediately, but rather by the recently added pte_free_defer().

retract_page_tables() can be enhanced to replace_page_tables(), which
inserts the final huge pmd without mmap lock: going through an invalid
state instead of pmd_none() followed by fault.  But that does raise some
questions, and requires a more complicated pte_free_defer() for powerpc
(when its arch_needs_pgtable_deposit() for shmem and file THPs).  Leave
that enhancement to a later release.

Signed-off-by: Hugh Dickins 
---
 mm/khugepaged.c | 169 +---
 1 file changed, 60 insertions(+), 109 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1083f0e38a07..4fd408154692 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1617,9 +1617,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, 
unsigned long addr,
break;
case SCAN_PMD_NONE:
/*
-* In MADV_COLLAPSE path, possible race with khugepaged where
-* all pte entries have been removed and pmd cleared.  If so,
-* skip all the pte checks and just update the pmd mapping.
+* All pte entries have been removed and pmd cleared.
+* Skip all the pte checks and just update the pmd mapping.
 */
goto maybe_install_pmd;
default:
@@ -1748,123 +1747,73 @@ static void khugepaged_collapse_pte_mapped_thps(struct 
khugepaged_mm_slot *mm_sl
mmap_write_unlock(mm);
 }
 
-static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
-  struct mm_struct *target_mm,
-  unsigned long target_addr, struct page *hpage,
-  struct collapse_control *cc)
+static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 {
struct vm_area_struct *vma;
-   int target_result = SCAN_FAIL;
 
-   i_mmap_lock_write(mapping);
+   i_mmap_lock_read(mapping);
vma_interval_tree_foreach(vma, >i_mmap, pgoff, pgoff) {
-   int result = SCAN_FAIL;
-   struct mm_struct *mm = NULL;
-   unsigned long addr = 0;
-   pmd_t *pmd;
-   bool is_target = false;
+   struct mm_struct *mm;
+   unsigned long addr;
+   pmd_t *pmd, pgt_pmd;
+   spinlock_t *pml;
+   spinlock_t *ptl;
 
/*
 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
-* got written to. These VMAs are likely not worth investing
-* mmap_write_lock(mm) as PMD-mapping is likely to be split
-* later.
+* got written to. These VMAs are likely not worth removing
+* page tables from, as PMD-mapping is likely to be split later.
 *
-* Note that vma->anon_vma check is racy: it can be set up after
-* the check but before we took mmap_lock by the fault path.
-* But page lock would prevent establishing any new ptes of the
-* page, so we are safe.
-*
-* An alternative would be drop the check, but check that page
-* table is clear before calling pmdp_collapse_flush() under
-* ptl. It has higher chance to recover THP for the VMA, but
-* has higher cost too. It would also probably require locking
-* the anon_vma.
+* Note that vma->anon_vma check is 

[PATCH 08/12] mm/pgtable: add pte_free_defer() for pgtable as page

2023-05-29 Thread Hugh Dickins
Add the generic pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This version
suits all those architectures which use an unfragmented page for one page
table (none of whose pte_free()s use the mm arg which was passed to it).

Signed-off-by: Hugh Dickins 
---
 include/linux/pgtable.h |  2 ++
 mm/pgtable-generic.c| 20 
 2 files changed, 22 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 8b0fc7fdc46f..62a8732d92f0 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -112,6 +112,8 @@ static inline void pte_unmap(pte_t *pte)
 }
 #endif
 
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 /* Find an entry in the second-level page table.. */
 #ifndef pmd_offset
 static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index d28b63386cef..471697dcb244 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /*
@@ -230,6 +231,25 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, 
unsigned long address,
return pmd;
 }
 #endif
+
+/* arch define pte_free_defer in asm/pgalloc.h for its own implementation */
+#ifndef pte_free_defer
+static void pte_free_now(struct rcu_head *head)
+{
+   struct page *page;
+
+   page = container_of(head, struct page, rcu_head);
+   pte_free(NULL /* mm not passed and not used */, (pgtable_t)page);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+   struct page *page;
+
+   page = pgtable;
+   call_rcu(>rcu_head, pte_free_now);
+}
+#endif /* pte_free_defer */
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \
-- 
2.35.3



[PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()

2023-05-29 Thread Hugh Dickins
Add s390-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This version is more complicated than others: because page_table_free()
needs to know which fragment is being freed, and which mm to link it to.

page_table_free()'s fragment handling is clever, but I could too easily
break it: what's done here in pte_free_defer() and pte_free_now() might
be better integrated with page_table_free()'s cleverness, but not by me!

By the time that page_table_free() gets called via RCU, it's conceivable
that mm would already have been freed: so mmgrab() in pte_free_defer()
and mmdrop() in pte_free_now().  No, that is not a good context to call
mmdrop() from, so make mmdrop_async() public and use that.

Signed-off-by: Hugh Dickins 
---
 arch/s390/include/asm/pgalloc.h |  4 
 arch/s390/mm/pgalloc.c  | 34 +
 include/linux/mm_types.h|  2 +-
 include/linux/sched/mm.h|  1 +
 kernel/fork.c   |  2 +-
 5 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
index 17eb618f1348..89a9d5ef94f8 100644
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -143,6 +143,10 @@ static inline void pmd_populate(struct mm_struct *mm,
 #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
 #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
 
+/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 void vmem_map_init(void);
 void *vmem_crst_alloc(unsigned long val);
 pte_t *vmem_pte_alloc(void);
diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index 66ab68db9842..0129de9addfd 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -346,6 +346,40 @@ void page_table_free(struct mm_struct *mm, unsigned long 
*table)
__free_page(page);
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void pte_free_now(struct rcu_head *head)
+{
+   struct page *page;
+   unsigned long mm_bit;
+   struct mm_struct *mm;
+   unsigned long *table;
+
+   page = container_of(head, struct page, rcu_head);
+   table = (unsigned long *)page_to_virt(page);
+   mm_bit = (unsigned long)page->pt_mm;
+   /* 4K page has only two 2K fragments, but alignment allows eight */
+   mm = (struct mm_struct *)(mm_bit & ~7);
+   table += PTRS_PER_PTE * (mm_bit & 7);
+   page_table_free(mm, table);
+   mmdrop_async(mm);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+   struct page *page;
+   unsigned long mm_bit;
+
+   mmgrab(mm);
+   page = virt_to_page(pgtable);
+   /* Which 2K page table fragment of a 4K page? */
+   mm_bit = ((unsigned long)pgtable & ~PAGE_MASK) /
+   (PTRS_PER_PTE * sizeof(pte_t));
+   mm_bit += (unsigned long)mm;
+   page->pt_mm = (struct mm_struct *)mm_bit;
+   call_rcu(>rcu_head, pte_free_now);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
 unsigned long vmaddr)
 {
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 306a3d1a0fa6..1667a1bdb8a8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -146,7 +146,7 @@ struct page {
pgtable_t pmd_huge_pte; /* protected by page->ptl */
unsigned long _pt_pad_2;/* mapping */
union {
-   struct mm_struct *pt_mm; /* x86 pgds only */
+   struct mm_struct *pt_mm; /* x86 pgd, s390 */
atomic_t pt_frag_refcount; /* powerpc */
};
 #if ALLOC_SPLIT_PTLOCKS
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 8d89c8c4fac1..a9043d1a0d55 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -41,6 +41,7 @@ static inline void smp_mb__after_mmgrab(void)
smp_mb__after_atomic();
 }
 
+extern void mmdrop_async(struct mm_struct *mm);
 extern void __mmdrop(struct mm_struct *mm);
 
 static inline void mmdrop(struct mm_struct *mm)
diff --git a/kernel/fork.c b/kernel/fork.c
index ed4e01daccaa..fa4486b65c56 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -942,7 +942,7 @@ static void mmdrop_async_fn(struct work_struct *work)
__mmdrop(mm);
 }
 
-static void mmdrop_async(struct mm_struct *mm)
+void mmdrop_async(struct mm_struct *mm)
 {
if (unlikely(atomic_dec_and_test(>mm_count))) {
INIT_WORK(>async_put_work, mmdrop_async_fn);
-- 
2.35.3



[PATCH 06/12] sparc: add pte_free_defer() for pgtables sharing page

2023-05-29 Thread Hugh Dickins
Add sparc-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

Signed-off-by: Hugh Dickins 
---
 arch/sparc/include/asm/pgalloc_64.h |  4 
 arch/sparc/mm/init_64.c | 16 
 2 files changed, 20 insertions(+)

diff --git a/arch/sparc/include/asm/pgalloc_64.h 
b/arch/sparc/include/asm/pgalloc_64.h
index 7b5561d17ab1..caa7632be4c2 100644
--- a/arch/sparc/include/asm/pgalloc_64.h
+++ b/arch/sparc/include/asm/pgalloc_64.h
@@ -65,6 +65,10 @@ pgtable_t pte_alloc_one(struct mm_struct *mm);
 void pte_free_kernel(struct mm_struct *mm, pte_t *pte);
 void pte_free(struct mm_struct *mm, pgtable_t ptepage);
 
+/* arch use pte_free_defer() implementation in arch/sparc/mm/init_64.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 #define pmd_populate_kernel(MM, PMD, PTE)  pmd_set(MM, PMD, PTE)
 #define pmd_populate(MM, PMD, PTE) pmd_set(MM, PMD, PTE)
 
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 04f9db0c3111..b7c6aa085ef6 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2930,6 +2930,22 @@ void pgtable_free(void *table, bool is_page)
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void pte_free_now(struct rcu_head *head)
+{
+   struct page *page;
+
+   page = container_of(head, struct page, rcu_head);
+   __pte_free((pgtable_t)page_to_virt(page));
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+   struct page *page;
+
+   page = virt_to_page(pgtable);
+   call_rcu(>rcu_head, pte_free_now);
+}
+
 void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
  pmd_t *pmd)
 {
-- 
2.35.3



[PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page

2023-05-29 Thread Hugh Dickins
Add powerpc-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

Signed-off-by: Hugh Dickins 
---
 arch/powerpc/include/asm/pgalloc.h |  4 
 arch/powerpc/mm/pgtable-frag.c | 18 ++
 2 files changed, 22 insertions(+)

diff --git a/arch/powerpc/include/asm/pgalloc.h 
b/arch/powerpc/include/asm/pgalloc.h
index 3360cad78ace..3a971e2a8c73 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -45,6 +45,10 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t 
ptepage)
pte_fragment_free((unsigned long *)ptepage, 0);
 }
 
+/* arch use pte_free_defer() implementation in arch/powerpc/mm/pgtable-frag.c 
*/
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 /*
  * Functions that deal with pagetables that could be at any level of
  * the table need to be passed an "index_size" so they know how to
diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index 20652daa1d7e..3a3dac77faf2 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -120,3 +120,21 @@ void pte_fragment_free(unsigned long *table, int kernel)
__free_page(page);
}
 }
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void pte_free_now(struct rcu_head *head)
+{
+   struct page *page;
+
+   page = container_of(head, struct page, rcu_head);
+   pte_fragment_free((unsigned long *)page_to_virt(page), 0);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+   struct page *page;
+
+   page = virt_to_page(pgtable);
+   call_rcu(>rcu_head, pte_free_now);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-- 
2.35.3



[PATCH 04/12] powerpc: assert_pte_locked() use pte_offset_map_nolock()

2023-05-29 Thread Hugh Dickins
Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
in assert_pte_locked().  BUG if pte_offset_map_nolock() fails: this is
stricter than the previous implementation, which skipped when pmd_none()
(with a comment on khugepaged collapse transitions): but wouldn't we want
to know, if an assert_pte_locked() caller can be racing such transitions?

This mod might cause new crashes: which either expose my ignorance, or
indicate issues to be fixed, or limit the usage of assert_pte_locked().

Signed-off-by: Hugh Dickins 
---
 arch/powerpc/mm/pgtable.c | 16 ++--
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index cb2dcdb18f8e..16b061af86d7 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -311,6 +311,8 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long 
addr)
p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
+   pte_t *pte;
+   spinlock_t *ptl;
 
if (mm == _mm)
return;
@@ -321,16 +323,10 @@ void assert_pte_locked(struct mm_struct *mm, unsigned 
long addr)
pud = pud_offset(p4d, addr);
BUG_ON(pud_none(*pud));
pmd = pmd_offset(pud, addr);
-   /*
-* khugepaged to collapse normal pages to hugepage, first set
-* pmd to none to force page fault/gup to take mmap_lock. After
-* pmd is set to none, we do a pte_clear which does this assertion
-* so if we find pmd none, return.
-*/
-   if (pmd_none(*pmd))
-   return;
-   BUG_ON(!pmd_present(*pmd));
-   assert_spin_locked(pte_lockptr(mm, pmd));
+   pte = pte_offset_map_nolock(mm, pmd, addr, );
+   BUG_ON(!pte);
+   assert_spin_locked(ptl);
+   pte_unmap(pte);
 }
 #endif /* CONFIG_DEBUG_VM */
 
-- 
2.35.3



[PATCH 03/12] arm: adjust_pte() use pte_offset_map_nolock()

2023-05-29 Thread Hugh Dickins
Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
in adjust_pte(): because it gives the not-locked ptl for precisely that
pte, which the caller can then safely lock; whereas pte_lockptr() is not
so tightly coupled, because it dereferences the pmd pointer again.

Signed-off-by: Hugh Dickins 
---
 arch/arm/mm/fault-armv.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c
index ca5302b0b7ee..7cb125497976 100644
--- a/arch/arm/mm/fault-armv.c
+++ b/arch/arm/mm/fault-armv.c
@@ -117,11 +117,10 @@ static int adjust_pte(struct vm_area_struct *vma, 
unsigned long address,
 * must use the nested version.  This also means we need to
 * open-code the spin-locking.
 */
-   pte = pte_offset_map(pmd, address);
+   pte = pte_offset_map_nolock(vma->vm_mm, pmd, address, );
if (!pte)
return 0;
 
-   ptl = pte_lockptr(vma->vm_mm, pmd);
do_pte_lock(ptl);
 
ret = do_adjust_pte(vma, address, pfn, pte);
-- 
2.35.3



[PATCH 02/12] mm/pgtable: add PAE safety to __pte_offset_map()

2023-05-29 Thread Hugh Dickins
There is a faint risk that __pte_offset_map(), on a 32-bit architecture
with a 64-bit pmd_t e.g. x86-32 with CONFIG_X86_PAE=y, would succeed on
a pmdval assembled from a pmd_low and a pmd_high which never belonged
together: their combination not pointing to a page table at all, perhaps
not even a valid pfn.  pmdp_get_lockless() is not enough to prevent that.

Guard against that (on such configs) by local_irq_save() blocking TLB
flush between present updates, as linux/pgtable.h suggests.  It's only
needed around the pmdp_get_lockless() in __pte_offset_map(): a race when
__pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the
lock, would just send it back to __pte_offset_map() again.

CONFIG_GUP_GET_PXX_LOW_HIGH is enabled when required by mips, sh and x86.
It is not enabled by arm-32 CONFIG_ARM_LPAE: my understanding is that
Will Deacon's 2020 enhancements to READ_ONCE() are sufficient for arm.
It is not enabled by arc, but its pmd_t is 32-bit even when pte_t 64-bit.

Limit the IRQ disablement to CONFIG_HIGHPTE?  Perhaps, but would need a
little more work, to retry if pmd_low good for page table, but pmd_high
non-zero from THP (and that might be making x86-specific assumptions).

Signed-off-by: Hugh Dickins 
---
 mm/pgtable-generic.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 674671835631..d28b63386cef 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -232,12 +232,32 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, 
unsigned long address,
 #endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+#if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \
+   (defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RCU))
+/*
+ * See the comment above ptep_get_lockless() in include/linux/pgtable.h:
+ * the barriers in pmdp_get_lockless() cannot guarantee that the value in
+ * pmd_high actually belongs with the value in pmd_low; but holding interrupts
+ * off blocks the TLB flush between present updates, which guarantees that a
+ * successful __pte_offset_map() points to a page from matched halves.
+ */
+#define config_might_irq_save(flags)   local_irq_save(flags)
+#define config_might_irq_restore(flags)local_irq_restore(flags)
+#else
+#define config_might_irq_save(flags)
+#define config_might_irq_restore(flags)
+#endif
+
 pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
 {
+   unsigned long __maybe_unused flags;
pmd_t pmdval;
 
rcu_read_lock();
+   config_might_irq_save(flags);
pmdval = pmdp_get_lockless(pmd);
+   config_might_irq_restore(flags);
+
if (pmdvalp)
*pmdvalp = pmdval;
if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
-- 
2.35.3



[PATCH 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s

2023-05-29 Thread Hugh Dickins
Before putting them to use (several commits later), add rcu_read_lock()
to pte_offset_map(), and rcu_read_unlock() to pte_unmap().  Make this a
separate commit, since it risks exposing imbalances: prior commits have
fixed all the known imbalances, but we may find some have been missed.

Signed-off-by: Hugh Dickins 
---
 include/linux/pgtable.h | 4 ++--
 mm/pgtable-generic.c| 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index a1326e61d7ee..8b0fc7fdc46f 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -99,7 +99,7 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned 
long address)
((pte_t *)kmap_local_page(pmd_page(*(pmd))) + pte_index((address)))
 #define pte_unmap(pte) do {\
kunmap_local((pte));\
-   /* rcu_read_unlock() to be added later */   \
+   rcu_read_unlock();  \
 } while (0)
 #else
 static inline pte_t *__pte_map(pmd_t *pmd, unsigned long address)
@@ -108,7 +108,7 @@ static inline pte_t *__pte_map(pmd_t *pmd, unsigned long 
address)
 }
 static inline void pte_unmap(pte_t *pte)
 {
-   /* rcu_read_unlock() to be added later */
+   rcu_read_unlock();
 }
 #endif
 
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index c7ab18a5fb77..674671835631 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -236,7 +236,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, 
pmd_t *pmdvalp)
 {
pmd_t pmdval;
 
-   /* rcu_read_lock() to be added later */
+   rcu_read_lock();
pmdval = pmdp_get_lockless(pmd);
if (pmdvalp)
*pmdvalp = pmdval;
@@ -250,7 +250,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, 
pmd_t *pmdvalp)
}
return __pte_map(, addr);
 nomap:
-   /* rcu_read_unlock() to be added later */
+   rcu_read_unlock();
return NULL;
 }
 
-- 
2.35.3



[PATCH 00/12] mm: free retracted page table by RCU

2023-05-29 Thread Hugh Dickins
Here is the third series of patches to mm (and a few architectures), based
on v6.4-rc3 with the preceding two series applied: in which khugepaged
takes advantage of pte_offset_map[_lock]() allowing for pmd transitions.

This follows on from the "arch: allow pte_offset_map[_lock]() to fail"
https://lore.kernel.org/linux-mm/77a5d8c-406b-7068-4f17-23b7ac53b...@google.com/
series of 23 posted on 2023-05-09,
and the "mm: allow pte_offset_map[_lock]() to fail"
https://lore.kernel.org/linux-mm/68a97fbe-5c1e-7ac6-72c-7b9c6290b...@google.com/
series of 31 posted on 2023-05-21.

Those two series were "independent": neither depending for build or
correctness on the other, but both series needed before this third one
can safely make the effective changes.  I'll send v2 of those two series
in a couple of days, incorporating Acks and Revieweds and the minor fixes.

What is it all about?  Some mmap_lock avoidance i.e. latency reduction.
Initially just for the case of collapsing shmem or file pages to THPs:
the usefulness of MADV_COLLAPSE on shmem is being limited by that
mmap_write_lock it currently requires.

Likely to be relied upon later in other contexts e.g. freeing of
empty page tables (but that's not work I'm doing).  mmap_write_lock
avoidance when collapsing to anon THPs?  Perhaps, but again that's not
work I've done: a quick attempt was not as easy as the shmem/file case.

These changes (though of course not these exact patches) have been in
Google's data centre kernel for three years now: we do rely upon them.

Based on the preceding two series over v6.4-rc3, but good over
v6.4-rc[1-4], current mm-everything or current linux-next.

01/12 mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s
02/12 mm/pgtable: add PAE safety to __pte_offset_map()
03/12 arm: adjust_pte() use pte_offset_map_nolock()
04/12 powerpc: assert_pte_locked() use pte_offset_map_nolock()
05/12 powerpc: add pte_free_defer() for pgtables sharing page
06/12 sparc: add pte_free_defer() for pgtables sharing page
07/12 s390: add pte_free_defer(), with use of mmdrop_async()
08/12 mm/pgtable: add pte_free_defer() for pgtable as page
09/12 mm/khugepaged: retract_page_tables() without mmap or vma lock
10/12 mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
11/12 mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps()
12/12 mm: delete mmap_write_trylock() and vma_try_start_write()

 arch/arm/mm/fault-armv.c|   3 +-
 arch/powerpc/include/asm/pgalloc.h  |   4 +
 arch/powerpc/mm/pgtable-frag.c  |  18 ++
 arch/powerpc/mm/pgtable.c   |  16 +-
 arch/s390/include/asm/pgalloc.h |   4 +
 arch/s390/mm/pgalloc.c  |  34 +++
 arch/sparc/include/asm/pgalloc_64.h |   4 +
 arch/sparc/mm/init_64.c |  16 ++
 include/linux/mm.h  |  17 --
 include/linux/mm_types.h|   2 +-
 include/linux/mmap_lock.h   |  10 -
 include/linux/pgtable.h |   6 +-
 include/linux/sched/mm.h|   1 +
 kernel/fork.c   |   2 +-
 mm/khugepaged.c | 425 --
 mm/pgtable-generic.c|  44 +++-
 16 files changed, 253 insertions(+), 353 deletions(-)

Hugh