[PATCH 1/2] ASoC: dt-bindings: fsl_rpmsg: Add compatible string for i.MX95

2024-06-26 Thread Chancel Liu
Add compatible string for i.MX95 platform which supports audio
function through rpmsg channel between Cortex-A and Cortex-M core.

Signed-off-by: Chancel Liu 
---
 Documentation/devicetree/bindings/sound/fsl,rpmsg.yaml | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/devicetree/bindings/sound/fsl,rpmsg.yaml 
b/Documentation/devicetree/bindings/sound/fsl,rpmsg.yaml
index 188f38baddec..3d5d435c765b 100644
--- a/Documentation/devicetree/bindings/sound/fsl,rpmsg.yaml
+++ b/Documentation/devicetree/bindings/sound/fsl,rpmsg.yaml
@@ -29,6 +29,7 @@ properties:
   - fsl,imx8mp-rpmsg-audio
   - fsl,imx8ulp-rpmsg-audio
   - fsl,imx93-rpmsg-audio
+  - fsl,imx95-rpmsg-audio
 
   clocks:
 items:
-- 
2.43.0



[PATCH 2/2] ASoC: fsl_rpmsg: Add support for i.MX95 platform

2024-06-26 Thread Chancel Liu
Add compatible string and specific soc data to support rpmsg sound card
on i.MX95 platform.

Signed-off-by: Chancel Liu 
---
 sound/soc/fsl/fsl_rpmsg.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/sound/soc/fsl/fsl_rpmsg.c b/sound/soc/fsl/fsl_rpmsg.c
index bc41a0666856..467d6bc9f956 100644
--- a/sound/soc/fsl/fsl_rpmsg.c
+++ b/sound/soc/fsl/fsl_rpmsg.c
@@ -175,6 +175,14 @@ static const struct fsl_rpmsg_soc_data imx93_data = {
   SNDRV_PCM_FMTBIT_S32_LE,
 };
 
+static const struct fsl_rpmsg_soc_data imx95_data = {
+   .rates = SNDRV_PCM_RATE_16000 | SNDRV_PCM_RATE_32000 |
+SNDRV_PCM_RATE_44100 | SNDRV_PCM_RATE_48000 |
+SNDRV_PCM_RATE_88200 | SNDRV_PCM_RATE_96000,
+   .formats = SNDRV_PCM_FMTBIT_S16_LE | SNDRV_PCM_FMTBIT_S24_LE |
+  SNDRV_PCM_FMTBIT_S32_LE,
+};
+
 static const struct of_device_id fsl_rpmsg_ids[] = {
{ .compatible = "fsl,imx7ulp-rpmsg-audio", .data = &imx7ulp_data},
{ .compatible = "fsl,imx8mm-rpmsg-audio", .data = &imx8mm_data},
@@ -182,6 +190,7 @@ static const struct of_device_id fsl_rpmsg_ids[] = {
{ .compatible = "fsl,imx8mp-rpmsg-audio", .data = &imx8mp_data},
{ .compatible = "fsl,imx8ulp-rpmsg-audio", .data = &imx7ulp_data},
{ .compatible = "fsl,imx93-rpmsg-audio", .data = &imx93_data},
+   { .compatible = "fsl,imx95-rpmsg-audio", .data = &imx95_data},
{ /* sentinel */ }
 };
 MODULE_DEVICE_TABLE(of, fsl_rpmsg_ids);
-- 
2.43.0



Re: [PATCH] printk: Add a short description string to kmsg_dump()

2024-06-26 Thread Petr Mladek
On Tue 2024-06-25 14:39:29, Jocelyn Falempe wrote:
> kmsg_dump doesn't forward the panic reason string to the kmsg_dumper
> callback.
> This patch adds a new parameter "const char *desc" to the kmsg_dumper
> dump() callback, and update all drivers that are using it.
> 
> To avoid updating all kmsg_dump() call, it adds a kmsg_dump_desc()
> function and a macro for backward compatibility.
> 
> I've written this for drm_panic, but it can be useful for other
> kmsg_dumper.
> It allows to see the panic reason, like "sysrq triggered crash"
> or "VFS: Unable to mount root fs on " on the drm panic screen.
>
> Signed-off-by: Jocelyn Falempe 
> ---
>  arch/powerpc/kernel/nvram_64.c |  3 ++-
>  arch/powerpc/platforms/powernv/opal-kmsg.c |  3 ++-
>  drivers/gpu/drm/drm_panic.c|  3 ++-
>  drivers/hv/hv_common.c |  3 ++-
>  drivers/mtd/mtdoops.c  |  3 ++-
>  fs/pstore/platform.c   |  3 ++-
>  include/linux/kmsg_dump.h  | 13 ++---
>  kernel/panic.c |  2 +-
>  kernel/printk/printk.c |  8 +---
>  9 files changed, 28 insertions(+), 13 deletions(-)

The parameter is added into all dumpers. I guess that it would be
used only drm_panic() because it is graphics and might be "fancy".
The others simply dump the log buffer and the reason is in
the dumped log as well.

Anyway, the passed buffer is static. Alternative solution would
be to make it global and export it like, for example, panic_cpu.

Best Regards,
Petr


Re: [PATCH 2/3] powerpc/pseries: Export hardware trace macro dump via debugfs

2024-06-26 Thread Michael Ellerman
Ritesh Harjani (IBM)  writes:
> This is a generic review and I haven't looked into the PAPR spec for
> htmdump hcall and it's interface.
>
> Madhavan Srinivasan  writes:
...
>> +
>> +debugfs_create_u32("nodeindex", 0600,
>> +htmdump_debugfs_dir, &nodeindex);
>> +debugfs_create_u32("nodalchipindex", 0600,
>> +htmdump_debugfs_dir, &nodalchipindex);
>> +debugfs_create_u32("coreindexonchip", 0600,
>> +htmdump_debugfs_dir, &coreindexonchip);
>> +debugfs_create_u32("htmtype", 0600,
>> +htmdump_debugfs_dir, &htmtype);
>
> minor nit: For all of the above. S_IRUSR | S_IWUSR instead of 0600.
>
>> +debugfs_create_file("trace", 0400, htmdump_debugfs_dir, ent, 
>> &htmdump_fops);
>
> maybe S_IRUSR instead of 0400.

Actually we prefer the octal values, see:
  https://git.kernel.org/torvalds/c/57ad583f2086d55ada284c54bfc440123cf73964

cheers


Re: [PATCH] powerpc/pseries: Fix scv instruction crash with kexec

2024-06-26 Thread Michael Ellerman
Nicholas Piggin  writes:
> kexec on pseries disables AIL (reloc_on_exc), required for scv
> instruction support, before other CPUs have been shut down. This means
> they can execute scv instructions after AIL is disabled, which causes an
> interrupt at an unexpected entry location that crashes the kernel.
>
> Change the kexec sequence to disable AIL after other CPUs have been
> brought down.
>
> As a refresher, the real-mode scv interrupt vector is 0x17000, and the
> fixed-location head code probably couldn't easily deal with implementing
> such high addresses so it was just decided not to support that interrupt
> at all.
>
> Reported-by: Sourabh Jain 
 
Was this reported publicly? I don't remember it.

cheers


Re: [PATCH] powerpc/pseries: Fix scv instruction crash with kexec

2024-06-26 Thread Gautam Menghani
Without this patch, we had an issue where if we have some cpus disabled
in the system and we try to do a 2 stage kexec as follows:

kexec -l vmlinux 
kexec -e

we would hit the following Oops

[ 2598.923098] kernel BUG at arch/powerpc/kernel/exceptions-64s.S:501!
[ 2598.923103] Oops: Exception in kernel mode, sig: 5 [#1]
[ 2598.923107] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
[ 2598.923111] Modules linked in: rpcrdma rdma_cm iw_cm ib_cm ib_core 
xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat 
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables 
bridge stp llc kvm_hv kvm bonding tls rfkill binfmt_misc tg3 vmx_crypto 
aes_gcm_p10_crypto ibmveth crct10dif_vpmsum pseries_rng nfsd auth_rpcgss 
nfs_acl lockd grace sunrpc fuse loop dm_multipath nfnetlink zram xfs ibmvscsi 
scsi_transport_srp crc32c_vpmsum pseries_wdt scsi_dh_rdac scsi_dh_emc 
scsi_dh_alua ip6_tables ip_tables
[ 2598.923167] CPU: 11 PID: 1548 Comm: systemd-journal Not tainted 6.9.0+ #4
[ 2598.923171] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf06 
of:IBM,FW1060.00 (NH1060_022) hv:phyp pSeries
[ 2598.923176] NIP:  c00089e4 LR: 7fffaa1427c4 CTR: c00089b0
[ 2598.923180] REGS: c008dfe7fd60 TRAP: 0700   Not tainted  (6.9.0+)
[ 2598.923184] MSR:  80021031   CR: 28002413  XER: 

[ 2598.923192] CFAR: c00089dc IRQMASK: 0
[ 2598.923192] GPR00: 0003 740fb110  
0009
[ 2598.923192] GPR04: 740fbcf0 2000 740fdcc0 

[ 2598.923192] GPR08: 7fffaabc3b80 48002413 740fb3e0 
00017000
[ 2598.923192] GPR12: 80009003 c008dfff2b00  

[ 2598.923192] GPR16:    

[ 2598.923192] GPR20:    
7fffaabaf448
[ 2598.923192] GPR24: 00011bc72700 740fddf8 000132490ea0 
740fddf0
[ 2598.923192] GPR28:  740fbcf0 2000 
0009
[ 2598.923238] NIP [c00089e4] data_access_common_virt+0x14/0x220
[ 2598.923245] LR [7fffaa1427c4] 0x7fffaa1427c4
[ 2598.923251] Call Trace:
[ 2598.923253] Code: 2c0a 39400300 408242c0 e94d0020 694a0002 7d400164 
6042 718a4000 7c2a0b78 3821fd30 41c20008 e82d0910 <0981fd30> f9210160 
f9610130 f9810138
[ 2598.923269] ---[ end trace  ]---
[ 2598.926662] pstore: backend (nvram) writing error (-1)


With this patch, the disabled cpus are woken up and kexec goes through
fine.

Tested-by: Gautam Menghani 


Re: [PATCH] powerpc/pseries: Fix scv instruction crash with kexec

2024-06-26 Thread Sourabh Jain

Hello Michael,

On 26/06/24 14:57, Michael Ellerman wrote:

Nicholas Piggin  writes:

kexec on pseries disables AIL (reloc_on_exc), required for scv
instruction support, before other CPUs have been shut down. This means
they can execute scv instructions after AIL is disabled, which causes an
interrupt at an unexpected entry location that crashes the kernel.

Change the kexec sequence to disable AIL after other CPUs have been
brought down.

As a refresher, the real-mode scv interrupt vector is 0x17000, and the
fixed-location head code probably couldn't easily deal with implementing
such high addresses so it was just decided not to support that interrupt
at all.

Reported-by: Sourabh Jain 
  
Was this reported publicly? I don't remember it.


No, I didn't report this issue publicly.

While debugging a kexec issue, the git bisect pointed to the commit 
mentioned

in the patch description. So, I contacted Nick directly.

`kexec -e` with --smt=off the first kernel hits exception when 
wake_offline_cpus() -> add_cpu() is called

to bring up offline CPUs.

Console log:

[   68.824514] restraintd[899]: * Parsing recipe
[   68.825546] restraintd[899]: * Running recipe
[   68.825591] restraintd[899]: ** Continuing task: 20291 
[/mnt/tests/distribution/reservesys]

[   68.834095] restraintd[899]: ** Preparing metadata
[   68.872927] restraintd[899]: ** Refreshing peer role hostnames: Retries 0
[   68.911107] restraintd[899]: ** Updating env vars
[   68.911737] restraintd[899]: *** Current Time: Tue May 21 09:09:42 
2024  Localwatchdog at:  * Disabled! *
[   68.922803] restraintd[899]: ** Running task: 20291 
[/distribution/reservesys]

[   78.027943] Removing IBM Power 842 compression device
[   78.093777] XFS (sda2): Block device removal (0x20) detected at 
xfs_fs_shutdown+0x34/0x50 [xfs] (fs/xfs/xfs_super.c:1179). Shutting down 
filesystem.
[   78.093894] XFS (sda2): Please unmount the filesystem and rectify the 
problem(s)
[   83.450854] dm-0: writeback error on inode 17086756, offset 569344, 
sector 11026136
[   83.450910] dm-0: writeback error on inode 36421601, offset 0, sector 
20772504
[   84.021819] dm-0: writeback error on inode 36382045, offset 0, sector 
20772536
[   84.094348] dm-0: writeback error on inode 18703102, offset 0, sector 
11021000
[   84.601228] dm-0: writeback error on inode 51268015, offset 0, sector 
27663152
[   84.601468] dm-0: writeback error on inode 58225471, offset 0, sector 
34636080

[   85.370996] kexec_core: Starting new kernel
[   85.391013] kexec: Waking offline cpu 1.
[   85.391038] [ cut here ]
[   85.391042] kernel BUG at arch/powerpc/kernel/exceptions-64s.S:501!
[   85.391047] Oops: Exception in kernel mode, sig: 5 [#1]
[   85.391051] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
[   85.391056] Modules linked in: bonding tls rfkill pseries_rng 
vmx_crypto drm fuse drm_panel_orientation_quirks xfs libcrc32c sr_mod 
sd_mod cdrom t10_pi sg ibmvscsi ibmveth scsi_transport_srp dm_mirror 
dm_region_hash dm_log dm_mod
[   85.391086] CPU: 0 PID: 565 Comm: systemd-journal Kdump: loaded Not 
tainted 6.9.0+ #1
[   85.391092] Hardware name: IBM,9008-22L POWER9 (raw) 0x4e0202 
0xf05 of:IBM,FW950.A0 (VL950_144) hv:phyp pSeries
[   85.391096] NIP:  c00089a4 LR: 0001703c CTR: 
c0008980

[   85.391101] REGS: cf76fd60 TRAP: 0700   Not tainted (6.9.0+)
[   85.391106] MSR:  80021031   CR: 240022d4  
XER: 

[   85.391116] CFAR: c000899c IRQMASK: 0
[   85.391116] GPR00: 0003 7fffc4f783a0 7fff9f0a7200 
010014331bb8
[   85.391116] GPR04: 7fffc4f7b078 c4f6 7fffc4f7b1d0 
0100143469a0
[   85.391116] GPR08: 7fff9f489268 440022d4 7fffc4f78670 
000ac588
[   85.391116] GPR12: 80009003 c2f5  

[   85.391116] GPR16:    

[   85.391116] GPR20:   000127117b48 
0001271185b8
[   85.391116] GPR24: 000127117b90 7fffc4f7b070 010014331540 
7fffc4f7b078
[   85.391116] GPR28:  7fffc4f78f80 c4f6 
010014331ba0

[   85.391173] NIP [c00089a4] data_access_common_virt+0x14/0x220
[   85.391181] LR [0001703c] 0x1703c
[   85.391186] Call Trace:
[   85.391189] Code: 48024df9 4800 6000 e94d0020 694a0002 
7d400164 6000 718a4000 7c2a0b78 3821fd30 41c20008 e82d0910 
<0981fd30> f9210160 f9610130 f9810138

[   85.391208] ---[ end trace  ]---
[   85.394302] pstore: backend (nvram) writing error (-1)
[   85.394306]
[   86.394309] Kernel panic - not syncing: Fatal exception
[   86.399970] Rebooting in 10 seconds..


Thanks,
Sourabh Jain


Re: [PATCH 1/4] soc: fsl: qbman: FSL_DPAA depends on COMPILE_TEST

2024-06-26 Thread kernel test robot
Hi Breno,

kernel test robot noticed the following build warnings:

[auto build test WARNING on herbert-cryptodev-2.6/master]
[also build test WARNING on soc/for-next linus/master v6.10-rc5 next-20240625]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:
https://github.com/intel-lab-lkp/linux/commits/Breno-Leitao/crypto-caam-Depend-on-COMPILE_TEST-also/20240625-223834
base:   
https://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git master
patch link:
https://lore.kernel.org/r/20240624162128.1665620-1-leitao%40debian.org
patch subject: [PATCH 1/4] soc: fsl: qbman: FSL_DPAA depends on COMPILE_TEST
config: x86_64-allyesconfig 
(https://download.01.org/0day-ci/archive/20240626/202406261920.l5pzm1rj-...@intel.com/config)
compiler: clang version 18.1.5 (https://github.com/llvm/llvm-project 
617a15a9eac96088ae5e9134248d8236e34b91b1)
reproduce (this is a W=1 build): 
(https://download.01.org/0day-ci/archive/20240626/202406261920.l5pzm1rj-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202406261920.l5pzm1rj-...@intel.com/

All warnings (new ones prefixed by >>):

>> drivers/net/ethernet/freescale/dpaa/dpaa_eth.c:3280:12: warning: stack frame 
>> size (16664) exceeds limit (2048) in 'dpaa_eth_probe' [-Wframe-larger-than]
3280 | static int dpaa_eth_probe(struct platform_device *pdev)
 |^
   1 warning generated.
--
>> drivers/net/ethernet/freescale/dpaa/dpaa_ethtool.c:454:12: warning: stack 
>> frame size (8264) exceeds limit (2048) in 'dpaa_set_coalesce' 
>> [-Wframe-larger-than]
 454 | static int dpaa_set_coalesce(struct net_device *dev,
 |^
   1 warning generated.


vim +/dpaa_eth_probe +3280 drivers/net/ethernet/freescale/dpaa/dpaa_eth.c

9ad1a37493338c Madalin Bucur   2016-11-15  3279  
9ad1a37493338c Madalin Bucur   2016-11-15 @3280  static int 
dpaa_eth_probe(struct platform_device *pdev)
9ad1a37493338c Madalin Bucur   2016-11-15  3281  {
9ad1a37493338c Madalin Bucur   2016-11-15  3282 struct net_device 
*net_dev = NULL;
f07f30042f8e0f Madalin Bucur   2019-10-31  3283 struct dpaa_bp *dpaa_bp 
= NULL;
9ad1a37493338c Madalin Bucur   2016-11-15  3284 struct dpaa_fq 
*dpaa_fq, *tmp;
9ad1a37493338c Madalin Bucur   2016-11-15  3285 struct dpaa_priv *priv 
= NULL;
9ad1a37493338c Madalin Bucur   2016-11-15  3286 struct fm_port_fqs 
port_fqs;
9ad1a37493338c Madalin Bucur   2016-11-15  3287 struct mac_device 
*mac_dev;
f07f30042f8e0f Madalin Bucur   2019-10-31  3288 int err = 0, channel;
9ad1a37493338c Madalin Bucur   2016-11-15  3289 struct device *dev;
9ad1a37493338c Madalin Bucur   2016-11-15  3290  
060ad66f97954f Madalin Bucur   2019-10-23  3291 dev = &pdev->dev;
060ad66f97954f Madalin Bucur   2019-10-23  3292  
5537b329857676 Laurentiu Tudor 2019-10-23  3293 err = bman_is_probed();
5537b329857676 Laurentiu Tudor 2019-10-23  3294 if (!err)
5537b329857676 Laurentiu Tudor 2019-10-23  3295 return 
-EPROBE_DEFER;
5537b329857676 Laurentiu Tudor 2019-10-23  3296 if (err < 0) {
060ad66f97954f Madalin Bucur   2019-10-23  3297 dev_err(dev, 
"failing probe due to bman probe error\n");
5537b329857676 Laurentiu Tudor 2019-10-23  3298 return -ENODEV;
5537b329857676 Laurentiu Tudor 2019-10-23  3299 }
5537b329857676 Laurentiu Tudor 2019-10-23  3300 err = qman_is_probed();
5537b329857676 Laurentiu Tudor 2019-10-23  3301 if (!err)
5537b329857676 Laurentiu Tudor 2019-10-23  3302 return 
-EPROBE_DEFER;
5537b329857676 Laurentiu Tudor 2019-10-23  3303 if (err < 0) {
060ad66f97954f Madalin Bucur   2019-10-23  3304 dev_err(dev, 
"failing probe due to qman probe error\n");
5537b329857676 Laurentiu Tudor 2019-10-23  3305 return -ENODEV;
5537b329857676 Laurentiu Tudor 2019-10-23  3306 }
5537b329857676 Laurentiu Tudor 2019-10-23  3307 err = 
bman_portals_probed();
5537b329857676 Laurentiu Tudor 2019-10-23  3308 if (!err)
5537b329857676 Laurentiu Tudor 2019-10-23  3309 return 
-EPROBE_DEFER;
5537b329857676 Laurentiu Tudor 2019-10-23  3310 if (err < 0) {
060ad66f97954f Madalin Bucur   2019-10-23  3311 dev_err(dev,
5537b329857676 Laurentiu Tudor 2019-10-23  3312 
"failing probe due to bman portals probe error\n");
5537b329857676 Laurentiu Tudor 2019-10-23  3313 return -ENODEV;
5537b

[PATCH] arch/powerpc/kvm: Avoid extra checks when emulating HFSCR bits

2024-06-26 Thread Gautam Menghani
When a KVM guest tries to use a feature disabled by HFSCR, it exits to
the host for emulation support, and the code checks for all bits which
are emulated. Avoid checking all the bits by using a switch case.

Signed-off-by: Gautam Menghani 
---
 arch/powerpc/kvm/book3s_hv.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 99c7ce825..a72fd2543 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1922,14 +1922,22 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,
 
r = EMULATE_FAIL;
if (cpu_has_feature(CPU_FTR_ARCH_300)) {
-   if (cause == FSCR_MSGP_LG)
+   switch (cause) {
+   case FSCR_MSGP_LG:
r = kvmppc_emulate_doorbell_instr(vcpu);
-   if (cause == FSCR_PM_LG)
+   break;
+   case FSCR_PM_LG:
r = kvmppc_pmu_unavailable(vcpu);
-   if (cause == FSCR_EBB_LG)
+   break;
+   case FSCR_EBB_LG:
r = kvmppc_ebb_unavailable(vcpu);
-   if (cause == FSCR_TM_LG)
+   break;
+   case FSCR_TM_LG:
r = kvmppc_tm_unavailable(vcpu);
+   break;
+   default:
+   break;
+   }
}
if (r == EMULATE_FAIL) {
kvmppc_core_queue_program(vcpu, SRR1_PROGILL |
-- 
2.45.2



Re: [PATCH 1/4] soc: fsl: qbman: FSL_DPAA depends on COMPILE_TEST

2024-06-26 Thread Vladimir Oltean
On Wed, Jun 26, 2024 at 08:09:53PM +0800, kernel test robot wrote:
> Hi Breno,
> 
> kernel test robot noticed the following build warnings:
> 
> [auto build test WARNING on herbert-cryptodev-2.6/master]
> [also build test WARNING on soc/for-next linus/master v6.10-rc5 next-20240625]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
> 
> url:
> https://github.com/intel-lab-lkp/linux/commits/Breno-Leitao/crypto-caam-Depend-on-COMPILE_TEST-also/20240625-223834
> base:   
> https://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git 
> master
> patch link:
> https://lore.kernel.org/r/20240624162128.1665620-1-leitao%40debian.org
> patch subject: [PATCH 1/4] soc: fsl: qbman: FSL_DPAA depends on COMPILE_TEST
> config: x86_64-allyesconfig 
> (https://download.01.org/0day-ci/archive/20240626/202406261920.l5pzm1rj-...@intel.com/config)
> compiler: clang version 18.1.5 (https://github.com/llvm/llvm-project 
> 617a15a9eac96088ae5e9134248d8236e34b91b1)
> reproduce (this is a W=1 build): 
> (https://download.01.org/0day-ci/archive/20240626/202406261920.l5pzm1rj-...@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version 
> of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot 
> | Closes: 
> https://lore.kernel.org/oe-kbuild-all/202406261920.l5pzm1rj-...@intel.com/
> 
> All warnings (new ones prefixed by >>):
> 
> >> drivers/net/ethernet/freescale/dpaa/dpaa_eth.c:3280:12: warning: stack 
> >> frame size (16664) exceeds limit (2048) in 'dpaa_eth_probe' 
> >> [-Wframe-larger-than]
> 3280 | static int dpaa_eth_probe(struct platform_device *pdev)
>  |^
>1 warning generated.
> --
> >> drivers/net/ethernet/freescale/dpaa/dpaa_ethtool.c:454:12: warning: stack 
> >> frame size (8264) exceeds limit (2048) in 'dpaa_set_coalesce' 
> >> [-Wframe-larger-than]
>  454 | static int dpaa_set_coalesce(struct net_device *dev,
>  |^
>1 warning generated.

Arrays of NR_CPUS elements are what it probably doesn't like?
In the attached Kconfig, CONFIG_NR_CPUS is 8192, which is clearly
excessive compared to the SoCs that the driver is written for and
expects to run on (1-24 cores).

static int dpaa_set_coalesce(struct net_device *dev,
 struct ethtool_coalesce *c,
 struct kernel_ethtool_coalesce *kernel_coal,
 struct netlink_ext_ack *extack)
{
const cpumask_t *cpus = qman_affine_cpus();
bool needs_revert[NR_CPUS] = {false};
...
}

static void dpaa_fq_setup(struct dpaa_priv *priv,
  const struct dpaa_fq_cbs *fq_cbs,
  struct fman_port *tx_port)
{
int egress_cnt = 0, conf_cnt = 0, num_portals = 0, portal_cnt = 0, cpu;
const cpumask_t *affine_cpus = qman_affine_cpus();
u16 channels[NR_CPUS];
...
}

While 'needs_revert' can probably easily be converted to a bitmask which
consumes 8 times less space, I don't know what to say about the "channels"
array. It could probably be rewritten to use dynamic allocation for the
array. I don't have any better idea...


Re: [PATCH v3 1/2] pci/hotplug/pnv_php: Fix hotplug driver crash on Powernv

2024-06-26 Thread Bjorn Helgaas
I expect this series would go through the powerpc tree since that's
where most of the chance is.

On Mon, Jun 24, 2024 at 05:39:27PM +0530, Krishna Kumar wrote:
> Description of the problem: The hotplug driver for powerpc
> (pci/hotplug/pnv_php.c) gives kernel crash when we try to
> hot-unplug/disable the PCIe switch/bridge from the PHB.
> 
> Root Cause of Crash: The crash is due to the reason that, though the msi
> data structure has been released during disable/hot-unplug path and it
> has been assigned with NULL, still during unregistartion the code was
> again trying to explicitly disable the msi which causes the Null pointer
> dereference and kernel crash.

s/unregistartion/unregistration/
s/Null/NULL/ to match previous use
s/msi/MSI/ to match spec usage

> Proposed Fix : The fix is to correct the check during unregistration path
> so that the code should not  try to invoke pci_disable_msi/msix() if its
> data structure is already freed.

s/Proposed Fix : The fix is to// ... Just say what the patch does.

If/when the powerpc folks like this, add my:

Acked-by: Bjorn Helgaas 

> Cc: Michael Ellerman 
> Cc: Nicholas Piggin 
> Cc: Christophe Leroy 
> Cc: "Aneesh Kumar K.V" 
> Cc: Bjorn Helgaas 
> Cc: Gaurav Batra 
> Cc: Nathan Lynch 
> Cc: Brian King 
> 
> Signed-off-by: Krishna Kumar 
> ---
>  drivers/pci/hotplug/pnv_php.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/hotplug/pnv_php.c b/drivers/pci/hotplug/pnv_php.c
> index 694349be9d0a..573a41869c15 100644
> --- a/drivers/pci/hotplug/pnv_php.c
> +++ b/drivers/pci/hotplug/pnv_php.c
> @@ -40,7 +40,6 @@ static void pnv_php_disable_irq(struct pnv_php_slot 
> *php_slot,
>   bool disable_device)
>  {
>   struct pci_dev *pdev = php_slot->pdev;
> - int irq = php_slot->irq;
>   u16 ctrl;
>  
>   if (php_slot->irq > 0) {
> @@ -59,7 +58,7 @@ static void pnv_php_disable_irq(struct pnv_php_slot 
> *php_slot,
>   php_slot->wq = NULL;
>   }
>  
> - if (disable_device || irq > 0) {
> + if (disable_device) {
>   if (pdev->msix_enabled)
>   pci_disable_msix(pdev);
>   else if (pdev->msi_enabled)
> -- 
> 2.45.0
> 


Re: [PATCH] printk: Add a short description string to kmsg_dump()

2024-06-26 Thread Kees Cook
On Tue, Jun 25, 2024 at 02:39:29PM +0200, Jocelyn Falempe wrote:
> kmsg_dump doesn't forward the panic reason string to the kmsg_dumper
> callback.
> This patch adds a new parameter "const char *desc" to the kmsg_dumper
> dump() callback, and update all drivers that are using it.
> 
> To avoid updating all kmsg_dump() call, it adds a kmsg_dump_desc()
> function and a macro for backward compatibility.
> 
> I've written this for drm_panic, but it can be useful for other
> kmsg_dumper.
> It allows to see the panic reason, like "sysrq triggered crash"
> or "VFS: Unable to mount root fs on " on the drm panic screen.

Seems reasonable. Given the prototype before/after:

dump(struct kmsg_dumper *dumper, enum kmsg_dump_reason reason)

dump(struct kmsg_dumper *dumper, enum kmsg_dump_reason reason,
 const char *desc)

Perhaps this should instead be a struct that the panic fills in? Then
it'll be easy to adjust the struct in the future:

struct kmsg_dump_detail {
enum kmsg_dump_reason reason;
const char *description;
};

dump(struct kmsg_dumper *dumper, struct kmsg_dump *detail)

This .cocci could do the conversion:


@ dump_func @
identifier DUMPER, CALLBACK;
@@

  struct kmsg_dumper DUMPER = {
.dump = CALLBACK,
  };

@ detail @
identifier dump_func.CALLBACK;
identifier DUMPER, REASON;
@@

CALLBACK(struct kmsg_dumper *DUMPER,
-enum kmsg_dump_reason REASON
+struct kmsg_dump_detail *detail
)
{
<...
-   REASON
+   detail->reason
...>
}


Also, just to double-check, doesn't the panic reason show up in the
kmsg_dump log itself (at the end?) I ask since for pstore, "desc" is
likely redundant since it's capturing the entire console log.

-Kees

Here's the patch from the above cocci:


diff -u -p a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -207,13 +207,13 @@ static int hv_die_panic_notify_crash(str
  * buffer and call into Hyper-V to transfer the data.
  */
 static void hv_kmsg_dump(struct kmsg_dumper *dumper,
-enum kmsg_dump_reason reason)
+struct kmsg_dump_detail *detail)
 {
struct kmsg_dump_iter iter;
size_t bytes_written;
 
/* We are only interested in panics. */
-   if (reason != KMSG_DUMP_PANIC || !sysctl_record_panic_msg)
+   if (detail->reason != KMSG_DUMP_PANIC || !sysctl_record_panic_msg)
return;
 
/*
diff -u -p a/arch/powerpc/platforms/powernv/opal-kmsg.c 
b/arch/powerpc/platforms/powernv/opal-kmsg.c
--- a/arch/powerpc/platforms/powernv/opal-kmsg.c
+++ b/arch/powerpc/platforms/powernv/opal-kmsg.c
@@ -20,13 +20,13 @@
  * message, it just ensures that OPAL completely flushes the console buffer.
  */
 static void kmsg_dump_opal_console_flush(struct kmsg_dumper *dumper,
-enum kmsg_dump_reason reason)
+struct kmsg_dump_detail *detail)
 {
/*
 * Outside of a panic context the pollers will continue to run,
 * so we don't need to do any special flushing.
 */
-   if (reason != KMSG_DUMP_PANIC)
+   if (detail->reason != KMSG_DUMP_PANIC)
return;
 
opal_flush_console(0);
diff -u -p a/arch/powerpc/kernel/nvram_64.c b/arch/powerpc/kernel/nvram_64.c
--- a/arch/powerpc/kernel/nvram_64.c
+++ b/arch/powerpc/kernel/nvram_64.c
@@ -73,7 +73,7 @@ static const char *nvram_os_partitions[]
 };
 
 static void oops_to_nvram(struct kmsg_dumper *dumper,
- enum kmsg_dump_reason reason);
+ struct kmsg_dump_detail *detail);
 
 static struct kmsg_dumper nvram_kmsg_dumper = {
.dump = oops_to_nvram
@@ -643,7 +643,7 @@ void __init nvram_init_oops_partition(in
  * partition.  If that's too much, go back and capture uncompressed text.
  */
 static void oops_to_nvram(struct kmsg_dumper *dumper,
- enum kmsg_dump_reason reason)
+ struct kmsg_dump_detail *detail)
 {
struct oops_log_info *oops_hdr = (struct oops_log_info *)oops_buf;
static unsigned int oops_count = 0;
@@ -655,7 +655,7 @@ static void oops_to_nvram(struct kmsg_du
unsigned int err_type = ERR_TYPE_KERNEL_PANIC_GZ;
int rc = -1;
 
-   switch (reason) {
+   switch (detail->reason) {
case KMSG_DUMP_SHUTDOWN:
/* These are almost always orderly shutdowns. */
return;
@@ -671,7 +671,7 @@ static void oops_to_nvram(struct kmsg_du
break;
default:
pr_err("%s: ignoring unrecognized KMSG_DUMP_* reason %d\n",
-  __func__, (int) reason);
+  __func__, (int) detail->reason);
return;
}
 
warning: detail, node 59: record.reason = ... ;[1,2,21,22,32] in pstore_dump 
may be inco

[PATCH v2 1/1] dt-bindings: soc: fsl: Convert q(b)man-* to yaml format

2024-06-26 Thread Frank Li
Convert qman, bman, qman-portals, bman-portals to yaml format.

Additional Change for fsl,q(b)man-portal:
- Only keep one example.
- Add fsl,qman-channel-id property.
- Use interrupt type macro.
- Remove top level qman-portals@ff420 at example.

Additional change for fsl,q(b)man:
- Fixed example error.
- Remove redundent part, only keep fsl,qman node.
- Change memory-regions to memory-region.
- fsl,q(b)man-portals is not required property

Additional change for fsl,qman-fqd.yaml:
- Fixed example error.
- Only keep one example.
- Ref to reserve-memory.yaml
- Merge fsl,bman reserver memory part

Signed-off-by: Frank Li 
---
Change from v1 to v2
- fix typo chang
- fix typo porta
- Add | for reg description
- wrap to 80 for reg descritption
- memory-region set maxItems: 2
- fix regex parttern
- drop  See clock-bindings.txt
- "see reserved-memory.yaml" change to
"see reserved-memory/reserved-memory.yaml in dtschema project"

- A strange thing in fsl,qman-fqd.yaml, if example compatible string
change to fsl,qman-fqd, dt_binding_check report below error.
qman-fqd: False schema does not allow {'compatible': ['fsl,qman-fqd'], 
'size': [[4194304]], 'alignment': [[4194304]], 'no-map': True, '$nodename': 
['qman-fqd']}

but I replace "fsl,qman-fqd" with "abc", it pass check.
---
 .../bindings/soc/fsl/bman-portals.txt |  56 --
 .../devicetree/bindings/soc/fsl/bman.txt  | 137 -
 .../bindings/soc/fsl/fsl,bman-portal.yaml |  52 +
 .../devicetree/bindings/soc/fsl/fsl,bman.yaml |  83 
 .../bindings/soc/fsl/fsl,qman-fqd.yaml|  69 +++
 .../bindings/soc/fsl/fsl,qman-portal.yaml | 110 +++
 .../devicetree/bindings/soc/fsl/fsl,qman.yaml |  93 +
 .../bindings/soc/fsl/qman-portals.txt | 134 -
 .../devicetree/bindings/soc/fsl/qman.txt  | 187 --
 9 files changed, 407 insertions(+), 514 deletions(-)
 delete mode 100644 Documentation/devicetree/bindings/soc/fsl/bman-portals.txt
 delete mode 100644 Documentation/devicetree/bindings/soc/fsl/bman.txt
 create mode 100644 
Documentation/devicetree/bindings/soc/fsl/fsl,bman-portal.yaml
 create mode 100644 Documentation/devicetree/bindings/soc/fsl/fsl,bman.yaml
 create mode 100644 Documentation/devicetree/bindings/soc/fsl/fsl,qman-fqd.yaml
 create mode 100644 
Documentation/devicetree/bindings/soc/fsl/fsl,qman-portal.yaml
 create mode 100644 Documentation/devicetree/bindings/soc/fsl/fsl,qman.yaml
 delete mode 100644 Documentation/devicetree/bindings/soc/fsl/qman-portals.txt
 delete mode 100644 Documentation/devicetree/bindings/soc/fsl/qman.txt

diff --git a/Documentation/devicetree/bindings/soc/fsl/bman-portals.txt 
b/Documentation/devicetree/bindings/soc/fsl/bman-portals.txt
deleted file mode 100644
index 2a00e14e11e02..0
--- a/Documentation/devicetree/bindings/soc/fsl/bman-portals.txt
+++ /dev/null
@@ -1,56 +0,0 @@
-QorIQ DPAA Buffer Manager Portals Device Tree Binding
-
-Copyright (C) 2008 - 2014 Freescale Semiconductor Inc.
-
-CONTENTS
-
-   - BMan Portal
-   - Example
-
-BMan Portal Node
-
-Portals are memory mapped interfaces to BMan that allow low-latency, lock-less
-interaction by software running on processor cores, accelerators and network
-interfaces with the BMan
-
-PROPERTIES
-
-- compatible
-   Usage:  Required
-   Value type: 
-   Definition: Must include "fsl,bman-portal-"
-   May include "fsl,-bman-portal" or "fsl,bman-portal"
-
-- reg
-   Usage:  Required
-   Value type: 
-   Definition: Two regions. The first is the cache-enabled region of
-   the portal. The second is the cache-inhibited region of
-   the portal
-
-- interrupts
-   Usage:  Required
-   Value type: 
-   Definition: Standard property
-
-EXAMPLE
-
-The example below shows a (P4080) BMan portals container/bus node with two 
portals
-
-   bman-portals@ff400 {
-   #address-cells = <1>;
-   #size-cells = <1>;
-   compatible = "simple-bus";
-   ranges = <0 0xf 0xf400 0x20>;
-
-   bman-portal@0 {
-   compatible = "fsl,bman-portal-1.0.0", "fsl,bman-portal";
-   reg = <0x0 0x4000>, <0x10 0x1000>;
-   interrupts = <105 2 0 0>;
-   };
-   bman-portal@4000 {
-   compatible = "fsl,bman-portal-1.0.0", "fsl,bman-portal";
-   reg = <0x4000 0x4000>, <0x101000 0x1000>;
-   interrupts = <107 2 0 0>;
-   };
-   };
diff --git a/Documentation/devicetree/bindings/soc/fsl/bman.txt 
b/Documentation/devicetree/bindings/soc/fsl/bman.txt
deleted file mode 100644
index 48eed140765b0..0
--- a/Documentation/devicetree/bindings/soc/fsl/bman.txt
+++ /dev/null
@@ -1,137 +0,0 @@
-QorIQ DPAA Buffer Manager Dev

Re: [V4 05/16] tools/perf: Add disasm_line__parse to parse raw instruction for powerpc

2024-06-26 Thread Namhyung Kim
Hello,

On Wed, Jun 26, 2024 at 09:38:28AM +0530, Athira Rajeev wrote:
> 
> 
> > On 26 Jun 2024, at 12:15 AM, Namhyung Kim  wrote:
> > 
> > On Tue, Jun 25, 2024 at 06:12:51PM +0530, Athira Rajeev wrote:
> >> 
> >> 
> >>> On 25 Jun 2024, at 11:09 AM, Namhyung Kim  wrote:
> >>> 
> >>> On Fri, Jun 14, 2024 at 10:56:20PM +0530, Athira Rajeev wrote:
>  Currently, the perf tool infrastructure disasm_line__parse function to
>  parse disassembled line.
>  
>  Example snippet from objdump:
>  objdump  --start-address= --stop-address=  -d 
>  --no-show-raw-insn -C 
>  
>  c10224b4: lwz r10,0(r9)
>  
>  This line "lwz r10,0(r9)" is parsed to extract instruction name,
>  registers names and offset. In powerpc, the approach for data type
>  profiling uses raw instruction instead of result from objdump to identify
>  the instruction category and extract the source/target registers.
>  
>  Example: 38 01 81 e8 ld  r4,312(r1)
>  
>  Here "38 01 81 e8" is the raw instruction representation. Add function
>  "disasm_line__parse_powerpc" to handle parsing of raw instruction.
>  Also update "struct disasm_line" to save the binary code/
>  With the change, function captures:
>  
>  line -> "38 01 81 e8 ld  r4,312(r1)"
>  raw instruction "38 01 81 e8"
>  
>  Raw instruction is used later to extract the reg/offset fields. Macros
>  are added to extract opcode and register fields. "struct disasm_line"
>  is updated to carry union of "bytes" and "raw_insn" of 32 bit to carry 
>  raw
>  code (raw). Function "disasm_line__parse_powerpc fills the raw
>  instruction hex value and can use macros to get opcode. There is no
>  changes in existing code paths, which parses the disassembled code.
>  The architecture using the instruction name and present approach is
>  not altered. Since this approach targets powerpc, the macro
>  implementation is added for powerpc as of now.
>  
>  Since the disasm_line__parse is used in other cases (perf annotate) and
>  not only data tye profiling, the powerpc callback includes changes to
>  work with binary code as well as mneumonic representation. Also in case
>  if the DSO read fails and libcapstone is not supported, the approach
>  fallback to use objdump as option. Hence as option, patch has changes to
>  ensure objdump option also works well.
>  
>  Signed-off-by: Athira Rajeev 
>  ---
[SNIP]
>  +/*
>  + * Parses the result captured from symbol__disassemble_*
>  + * Example, line read from DSO file in powerpc:
>  + * line:38 01 81 e8
>  + * opcode: fetched from arch specific get_opcode_insn
>  + * rawp_insn: e8810138
>  + *
>  + * rawp_insn is used later to extract the reg/offset fields
>  + */
>  +#define PPC_OP(op) (((op) >> 26) & 0x3F)
>  +
>  +static int disasm_line__parse_powerpc(struct disasm_line *dl)
>  +{
>  + char *line = dl->al.line;
>  + const char **namep = &dl->ins.name;
>  + char **rawp = &dl->ops.raw;
>  + char tmp, *tmp_raw_insn, *name_raw_insn = skip_spaces(line);
>  + char *name = skip_spaces(name_raw_insn + 11);
>  + int objdump = 0;
>  +
>  + if (strlen(line) > 11)
>  + objdump = 1;
>  +
>  + if (name_raw_insn[0] == '\0')
>  + return -1;
>  +
>  + if (objdump) {
>  + *rawp = name + 1;
>  + while ((*rawp)[0] != '\0' && !isspace((*rawp)[0]))
>  + ++*rawp;
>  + tmp = (*rawp)[0];
>  + (*rawp)[0] = '\0';
>  +
>  + *namep = strdup(name);
>  + if (*namep == NULL)
>  + return -1;
>  +
>  + (*rawp)[0] = tmp;
>  + *rawp = strim(*rawp);
>  + } else
>  + *namep = "";
> > 
> > Then can you handle this logic under if (annotate_opts.show_raw_insn)
> > in disasm_line__parse() instead of adding a new function?
> > 
> > Thanks,
> > Namhyung
> 
> Hi Namhyung,
> 
> We discussed to have a per-arch disasm_line_parse() here:
> https://lore.kernel.org/all/cam9d7ci1lda7mot2qdr2qk+dtnlu6zbkmronbdozajuqlqf...@mail.gmail.com/#t
> 
> So I added it as a new function : disasm_line__parse_powerpc
> Since it is not used by other archs, we can go with having new function ?

Ok, I thought it'd be quite different from disasm_line__parse() but it
seems that it's mostly similar except for the raw insn.  So I think it's
better to add the logic to the generic disasm_line__parse().  Sorry for
the inconvenience.

Thanks,
Namhyung

>  +
>  + tmp_raw_insn = strdup(name_raw_insn);
>  + tmp_raw_insn[11] = '\0';
>  + remove_spaces(tmp_raw_insn);
>  +
>  + dl->raw.raw_insn = strtol(tmp_raw_insn, NULL, 16);
>  + if (objdump)
>  + dl->raw.raw_insn = be32_to_cpu(strtol(tmp_raw_insn, NULL, 16));
> >>> 
> >>> Hmm.. can you use a sscanf() instead?
> >>> 
> >>> sscanf(line, "%x %x %x %x", &dl->raw.bytes[0], &dl->raw.bytes[1], ...)
> >>

[PATCH 00/13] fs/dax: Fix FS DAX page reference counts

2024-06-26 Thread Alistair Popple
FS DAX pages have always maintained their own page reference counts
without following the normal rules for page reference counting. In
particular pages are considered free when the refcount hits one rather
than zero and refcounts are not added when mapping the page.

Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
mechanism for allowing GUP to hold references on the page (see
get_dev_pagemap). However there doesn't seem to be any reason why FS
DAX pages need their own reference counting scheme.

By treating the refcounts on these pages the same way as normal pages
we can remove a lot of special checks. In particular pXd_trans_huge()
becomes the same as pXd_leaf(), although I haven't made that change
here. It also frees up a valuable SW define PTE bit on architectures
that have devmap PTE bits defined.

It also almost certainly allows further clean-up of the devmap managed
functions, but I have left that as a future improvment.

This is an update to the original RFC rebased onto v6.10-rc5. Unlike
the original RFC it passes the same number of ndctl test suite
(https://github.com/pmem/ndctl) tests as my current development
environment does without these patches.

I am not intimately familiar with the FS DAX code so would appreciate
some careful review there. In particular I have not given any thought
at all to CONFIG_FS_DAX_LIMITED.

Signed-off-by: Alistair Popple 

Alistair Popple (13):
  mm/gup.c: Remove redundant check for PCI P2PDMA page
  pci/p2pdma: Don't initialise page refcount to one
  fs/dax: Refactor wait for dax idle page
  fs/dax: Add dax_page_free callback
  mm: Allow compound zone device pages
  mm/memory: Add dax_insert_pfn
  huge_memory: Allow mappings of PUD sized pages
  huge_memory: Allow mappings of PMD sized pages
  gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages
  fs/dax: Properly refcount fs dax pages
  huge_memory: Remove dead vmf_insert_pXd code
  mm: Remove pXX_devmap callers
  mm: Remove devmap related functions and page table bits

 Documentation/mm/arch_pgtable_helpers.rst |   6 +-
 arch/arm64/Kconfig|   1 +-
 arch/arm64/include/asm/pgtable-prot.h |   1 +-
 arch/arm64/include/asm/pgtable.h  |  24 +--
 arch/powerpc/Kconfig  |   1 +-
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |   6 +-
 arch/powerpc/include/asm/book3s/64/hash-64k.h |   7 +-
 arch/powerpc/include/asm/book3s/64/pgtable.h  |  52 +
 arch/powerpc/include/asm/book3s/64/radix.h|  14 +-
 arch/powerpc/mm/book3s64/hash_pgtable.c   |   3 +-
 arch/powerpc/mm/book3s64/pgtable.c|   8 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c  |   5 +-
 arch/powerpc/mm/pgtable.c |   2 +-
 arch/x86/Kconfig  |   1 +-
 arch/x86/include/asm/pgtable.h|  50 +
 arch/x86/include/asm/pgtable_types.h  |   5 +-
 drivers/dax/device.c  |  12 +-
 drivers/dax/super.c   |   2 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c|   2 +-
 drivers/nvdimm/pmem.c |   9 +-
 drivers/pci/p2pdma.c  |   4 +-
 fs/dax.c  | 204 +++-
 fs/ext4/inode.c   |   5 +-
 fs/fuse/dax.c |   4 +-
 fs/fuse/virtio_fs.c   |   8 +-
 fs/userfaultfd.c  |   2 +-
 fs/xfs/xfs_inode.c|   4 +-
 include/linux/dax.h   |  11 +-
 include/linux/huge_mm.h   |  17 +-
 include/linux/memremap.h  |  23 +-
 include/linux/migrate.h   |   2 +-
 include/linux/mm.h|  40 +---
 include/linux/page-flags.h|   6 +-
 include/linux/pfn_t.h |  20 +--
 include/linux/pgtable.h   |  21 +--
 include/linux/rmap.h  |  14 +-
 lib/test_hmm.c|   2 +-
 mm/Kconfig|   4 +-
 mm/debug_vm_pgtable.c |  59 +-
 mm/gup.c  | 178 +--
 mm/hmm.c  |  12 +-
 mm/huge_memory.c  | 248 +++
 mm/internal.h |   2 +-
 mm/khugepaged.c   |   2 +-
 mm/mapping_dirty_helpers.c|   4 +-
 mm/memory-failure.c   |   6 +-
 mm/memory.c   | 114 ++---
 mm/memremap.c |  38 +---
 mm/migrate_device.c   |   6 +-
 mm/mlock.c|   2 +-
 mm/mm_init.c  |   5 +-
 mm/mprotect.c |   2 +-

[PATCH 01/13] mm/gup.c: Remove redundant check for PCI P2PDMA page

2024-06-26 Thread Alistair Popple
PCI P2PDMA pages are not mapped with pXX_devmap PTEs therefore the
check in __gup_device_huge() is redundant. Remove it

Signed-off-by: Alistair Popple 
Reviewed-by: Jason Gunthorpe 
Acked-by: David Hildenbrand 
---
 mm/gup.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index ca0f5ce..669583e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -3044,11 +3044,6 @@ static int gup_fast_devmap_leaf(unsigned long pfn, 
unsigned long addr,
break;
}
 
-   if (!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
-   gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages);
-   break;
-   }
-
folio = try_grab_folio(page, 1, flags);
if (!folio) {
gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages);
-- 
git-series 0.9.1


[PATCH 02/13] pci/p2pdma: Don't initialise page refcount to one

2024-06-26 Thread Alistair Popple
The reference counts for ZONE_DEVICE private pages should be
initialised by the driver when the page is actually allocated by the
driver allocator, not when they are first created. This is currently
the case for MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_COHERENT pages
but not MEMORY_DEVICE_PCI_P2PDMA pages so fix that up.

Signed-off-by: Alistair Popple 
---
 drivers/pci/p2pdma.c | 2 ++
 mm/memremap.c| 8 
 mm/mm_init.c | 4 +++-
 3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4f47a13..1e9ea32 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -128,6 +128,8 @@ static int p2pmem_alloc_mmap(struct file *filp, struct 
kobject *kobj,
goto out;
}
 
+   set_page_count(virt_to_page(kaddr), 1);
+
/*
 * vm_insert_page() can sleep, so a reference is taken to mapping
 * such that rcu_read_unlock() can be done before inserting the
diff --git a/mm/memremap.c b/mm/memremap.c
index 40d4547..caccbd8 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -488,15 +488,15 @@ void free_zone_device_folio(struct folio *folio)
folio->mapping = NULL;
folio->page.pgmap->ops->page_free(folio_page(folio, 0));
 
-   if (folio->page.pgmap->type != MEMORY_DEVICE_PRIVATE &&
-   folio->page.pgmap->type != MEMORY_DEVICE_COHERENT)
+   if (folio->page.pgmap->type == MEMORY_DEVICE_PRIVATE ||
+   folio->page.pgmap->type == MEMORY_DEVICE_COHERENT)
+   put_dev_pagemap(folio->page.pgmap);
+   else if (folio->page.pgmap->type != MEMORY_DEVICE_PCI_P2PDMA)
/*
 * Reset the refcount to 1 to prepare for handing out the page
 * again.
 */
folio_set_count(folio, 1);
-   else
-   put_dev_pagemap(folio->page.pgmap);
 }
 
 void zone_device_page_init(struct page *page)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 3ec0493..b7e1599 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -6,6 +6,7 @@
  * Author Mel Gorman 
  *
  */
+#include "linux/memremap.h"
 #include 
 #include 
 #include 
@@ -1014,7 +1015,8 @@ static void __ref __init_zone_device_page(struct page 
*page, unsigned long pfn,
 * which will set the page count to 1 when allocating the page.
 */
if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
-   pgmap->type == MEMORY_DEVICE_COHERENT)
+   pgmap->type == MEMORY_DEVICE_COHERENT ||
+   pgmap->type == MEMORY_DEVICE_PCI_P2PDMA)
set_page_count(page, 0);
 }
 
-- 
git-series 0.9.1


[PATCH 03/13] fs/dax: Refactor wait for dax idle page

2024-06-26 Thread Alistair Popple
A FS DAX page is considered idle when its refcount drops to one. This
is currently open-coded in all file systems supporting FS DAX. Move
the idle detection to a common function to make future changes easier.

Signed-off-by: Alistair Popple 
Reviewed-by: Jan Kara 
---
 fs/ext4/inode.c | 5 +
 fs/fuse/dax.c   | 4 +---
 fs/xfs/xfs_inode.c  | 4 +---
 include/linux/dax.h | 8 
 4 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4bae9cc..4737450 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3844,10 +3844,7 @@ int ext4_break_layouts(struct inode *inode)
if (!page)
return 0;
 
-   error = ___wait_var_event(&page->_refcount,
-   atomic_read(&page->_refcount) == 1,
-   TASK_INTERRUPTIBLE, 0, 0,
-   ext4_wait_dax_page(inode));
+   error = dax_wait_page_idle(page, ext4_wait_dax_page, inode);
} while (error == 0);
 
return error;
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index 12ef91d..da50595 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -676,9 +676,7 @@ static int __fuse_dax_break_layouts(struct inode *inode, 
bool *retry,
return 0;
 
*retry = true;
-   return ___wait_var_event(&page->_refcount,
-   atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
-   0, 0, fuse_wait_dax_page(inode));
+   return dax_wait_page_idle(page, fuse_wait_dax_page, inode);
 }
 
 /* dmap_end == 0 leads to unmapping of whole file */
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index f36091e..b5742aa 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -4243,9 +4243,7 @@ xfs_break_dax_layouts(
return 0;
 
*retry = true;
-   return ___wait_var_event(&page->_refcount,
-   atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
-   0, 0, xfs_wait_dax_page(inode));
+   return dax_wait_page_idle(page, xfs_wait_dax_page, inode);
 }
 
 int
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9d3e332..773dfc4 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -213,6 +213,14 @@ int dax_zero_range(struct inode *inode, loff_t pos, loff_t 
len, bool *did_zero,
 int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
const struct iomap_ops *ops);
 
+static inline int dax_wait_page_idle(struct page *page,
+   void (cb)(struct inode *),
+   struct inode *inode)
+{
+   return ___wait_var_event(page, page_ref_count(page) == 1,
+   TASK_INTERRUPTIBLE, 0, 0, cb(inode));
+}
+
 #if IS_ENABLED(CONFIG_DAX)
 int dax_read_lock(void);
 void dax_read_unlock(int id);
-- 
git-series 0.9.1


[PATCH 04/13] fs/dax: Add dax_page_free callback

2024-06-26 Thread Alistair Popple
When a fs dax page is freed it has to notify filesystems that the page
has been unpinned/unmapped and is free. Currently this involves
special code in the page free paths to detect a transition of refcount
from 2 to 1 and to call some fs dax specific code.

A future change will require this to happen when the page refcount
drops to zero. In this case we can use the existing
pgmap->ops->page_free() callback so wire that up for all devices that
support FS DAX (nvdimm and virtio).

Signed-off-by: Alistair Popple 
---
 drivers/nvdimm/pmem.c | 1 +
 fs/dax.c  | 6 ++
 fs/fuse/virtio_fs.c   | 5 +
 include/linux/dax.h   | 1 +
 4 files changed, 13 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 598fe2e..cafadd0 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -444,6 +444,7 @@ static int pmem_pagemap_memory_failure(struct dev_pagemap 
*pgmap,
 
 static const struct dev_pagemap_ops fsdax_pagemap_ops = {
.memory_failure = pmem_pagemap_memory_failure,
+   .page_free  = dax_page_free,
 };
 
 static int pmem_attach_disk(struct device *dev,
diff --git a/fs/dax.c b/fs/dax.c
index becb4a6..f93afd7 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -2065,3 +2065,9 @@ int dax_remap_file_range_prep(struct file *file_in, 
loff_t pos_in,
   pos_out, len, remap_flags, ops);
 }
 EXPORT_SYMBOL_GPL(dax_remap_file_range_prep);
+
+void dax_page_free(struct page *page)
+{
+   wake_up_var(page);
+}
+EXPORT_SYMBOL_GPL(dax_page_free);
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 1a52a51..6e90a4b 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -909,6 +909,10 @@ static void virtio_fs_cleanup_dax(void *data)
 
 DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR_OR_NULL(_T)) 
virtio_fs_cleanup_dax(_T))
 
+static const struct dev_pagemap_ops fsdax_pagemap_ops = {
+   .page_free = dax_page_free,
+};
+
 static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs 
*fs)
 {
struct dax_device *dax_dev __free(cleanup_dax) = NULL;
@@ -948,6 +952,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, 
struct virtio_fs *fs)
return -ENOMEM;
 
pgmap->type = MEMORY_DEVICE_FS_DAX;
+   pgmap->ops = &fsdax_pagemap_ops;
 
/* Ideally we would directly use the PCI BAR resource but
 * devm_memremap_pages() wants its own copy in pgmap.  So
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 773dfc4..adbafc8 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -213,6 +213,7 @@ int dax_zero_range(struct inode *inode, loff_t pos, loff_t 
len, bool *did_zero,
 int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
const struct iomap_ops *ops);
 
+void dax_page_free(struct page *page);
 static inline int dax_wait_page_idle(struct page *page,
void (cb)(struct inode *),
struct inode *inode)
-- 
git-series 0.9.1


[PATCH 05/13] mm: Allow compound zone device pages

2024-06-26 Thread Alistair Popple
Zone device pages are used to represent various type of device memory
managed by device drivers. Currently compound zone device pages are
not supported. This is because MEMORY_DEVICE_FS_DAX pages are the only
user of higher order zone device pages and have their own page
reference counting.

A future change will unify FS DAX reference counting with normal page
reference counting rules and remove the special FS DAX reference
counting. Supporting that requires compound zone device pages.

Supporting compound zone device pages requires compound_head() to
distinguish between head and tail pages whilst still preserving the
special struct page fields that are specific to zone device pages.

A tail page is distinguished by having bit zero being set in
page->compound_head, with the remaining bits pointing to the head
page. For zone device pages page->compound_head is shared with
page->pgmap.

The page->pgmap field is common to all pages within a memory section.
Therefore pgmap is the same for both head and tail pages and we can
use the same scheme to distinguish tail pages. To obtain the pgmap for
a tail page a new accessor is introduced to fetch it from
compound_head.

Signed-off-by: Alistair Popple 
Reviewed-by: Jason Gunthorpe 

---

In response to the RFC Matthew Wilcox pointed out that we could move
the pgmap field to the folio. Morally I think that's where pgmap
belongs, so I it's a good idea that I just haven't had a change to
implement yet. I suspect there will be at least a v2 of this series
though so will probably do it then.
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
 drivers/pci/p2pdma.c   |  2 +-
 include/linux/memremap.h   | 12 +---
 include/linux/migrate.h|  2 +-
 lib/test_hmm.c |  2 +-
 mm/hmm.c   |  2 +-
 mm/memory.c|  2 +-
 mm/memremap.c  |  8 
 mm/migrate_device.c|  4 ++--
 9 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 6fb65b0..18d74a7 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -88,7 +88,7 @@ struct nouveau_dmem {
 
 static struct nouveau_dmem_chunk *nouveau_page_to_chunk(struct page *page)
 {
-   return container_of(page->pgmap, struct nouveau_dmem_chunk, pagemap);
+   return container_of(page_dev_pagemap(page), struct nouveau_dmem_chunk, 
pagemap);
 }
 
 static struct nouveau_drm *page_to_drm(struct page *page)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 1e9ea32..d9b422a 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -195,7 +195,7 @@ static const struct attribute_group p2pmem_group = {
 
 static void p2pdma_page_free(struct page *page)
 {
-   struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap);
+   struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_dev_pagemap(page));
/* safe to dereference while a reference is held to the percpu ref */
struct pci_p2pdma *p2pdma =
rcu_dereference_protected(pgmap->provider->p2pdma, 1);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 3f7143a..6505713 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -140,6 +140,12 @@ struct dev_pagemap {
};
 };
 
+static inline struct dev_pagemap *page_dev_pagemap(const struct page *page)
+{
+   WARN_ON(!is_zone_device_page(page));
+   return compound_head(page)->pgmap;
+}
+
 static inline bool pgmap_has_memory_failure(struct dev_pagemap *pgmap)
 {
return pgmap->ops && pgmap->ops->memory_failure;
@@ -161,7 +167,7 @@ static inline bool is_device_private_page(const struct page 
*page)
 {
return IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
is_zone_device_page(page) &&
-   page->pgmap->type == MEMORY_DEVICE_PRIVATE;
+   page_dev_pagemap(page)->type == MEMORY_DEVICE_PRIVATE;
 }
 
 static inline bool folio_is_device_private(const struct folio *folio)
@@ -173,13 +179,13 @@ static inline bool is_pci_p2pdma_page(const struct page 
*page)
 {
return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
is_zone_device_page(page) &&
-   page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
+   page_dev_pagemap(page)->type == MEMORY_DEVICE_PCI_P2PDMA;
 }
 
 static inline bool is_device_coherent_page(const struct page *page)
 {
return is_zone_device_page(page) &&
-   page->pgmap->type == MEMORY_DEVICE_COHERENT;
+   page_dev_pagemap(page)->type == MEMORY_DEVICE_COHERENT;
 }
 
 static inline bool folio_is_device_coherent(const struct folio *folio)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 2ce13e8..e31acc0 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -200,7 +200,7 @@ struct migrate_vma {
   

[PATCH 06/13] mm/memory: Add dax_insert_pfn

2024-06-26 Thread Alistair Popple
Currently to map a DAX page the DAX driver calls vmf_insert_pfn. This
creates a special devmap PTE entry for the pfn but does not take a
reference on the underlying struct page for the mapping. This is
because DAX page refcounts are treated specially, as indicated by the
presence of a devmap entry.

To allow DAX page refcounts to be managed the same as normal page
refcounts introduce dax_insert_pfn. This will take a reference on the
underlying page much the same as vmf_insert_page, except it also
permits upgrading an existing mapping to be writable if
requested/possible.

Signed-off-by: Alistair Popple 
---
 include/linux/mm.h |  4 ++-
 mm/memory.c| 79 ++-
 2 files changed, 76 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9a5652c..b84368b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1080,6 +1080,8 @@ int vma_is_stack_for_current(struct vm_area_struct *vma);
 struct mmu_gather;
 struct inode;
 
+extern void prep_compound_page(struct page *page, unsigned int order);
+
 /*
  * compound_order() can be called without holding a reference, which means
  * that niceties like page_folio() don't work.  These callers should be
@@ -3624,6 +3626,8 @@ int vm_map_pages(struct vm_area_struct *vma, struct page 
**pages,
unsigned long num);
 int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages,
unsigned long num);
+vm_fault_t dax_insert_pfn(struct vm_area_struct *vma,
+   unsigned long addr, pfn_t pfn, bool write);
 vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn);
 vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index ce48a05..4f26a1f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1989,14 +1989,42 @@ static int validate_page_before_insert(struct page 
*page)
 }
 
 static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
-   unsigned long addr, struct page *page, pgprot_t prot)
+   unsigned long addr, struct page *page, pgprot_t prot, 
bool mkwrite)
 {
struct folio *folio = page_folio(page);
+   pte_t entry = ptep_get(pte);
 
-   if (!pte_none(ptep_get(pte)))
+   if (!pte_none(entry)) {
+   if (mkwrite) {
+   /*
+* For read faults on private mappings the PFN passed
+* in may not match the PFN we have mapped if the
+* mapped PFN is a writeable COW page.  In the mkwrite
+* case we are creating a writable PTE for a shared
+* mapping and we expect the PFNs to match. If they
+* don't match, we are likely racing with block
+* allocation and mapping invalidation so just skip the
+* update.
+*/
+   if (pte_pfn(entry) != page_to_pfn(page)) {
+   WARN_ON_ONCE(!is_zero_pfn(pte_pfn(entry)));
+   return -EFAULT;
+   }
+   entry = maybe_mkwrite(entry, vma);
+   entry = pte_mkyoung(entry);
+   if (ptep_set_access_flags(vma, addr, pte, entry, 1))
+   update_mmu_cache(vma, addr, pte);
+   return 0;
+   }
return -EBUSY;
+   }
+
/* Ok, finally just insert the thing.. */
folio_get(folio);
+   if (mkwrite)
+   entry = maybe_mkwrite(mk_pte(page, prot), vma);
+   else
+   entry = mk_pte(page, prot);
inc_mm_counter(vma->vm_mm, mm_counter_file(folio));
folio_add_file_rmap_pte(folio, page, vma);
set_pte_at(vma->vm_mm, addr, pte, mk_pte(page, prot));
@@ -2011,7 +2039,7 @@ static int insert_page_into_pte_locked(struct 
vm_area_struct *vma, pte_t *pte,
  * pages reserved for the old functions anyway.
  */
 static int insert_page(struct vm_area_struct *vma, unsigned long addr,
-   struct page *page, pgprot_t prot)
+   struct page *page, pgprot_t prot, bool mkwrite)
 {
int retval;
pte_t *pte;
@@ -2024,7 +2052,7 @@ static int insert_page(struct vm_area_struct *vma, 
unsigned long addr,
pte = get_locked_pte(vma->vm_mm, addr, &ptl);
if (!pte)
goto out;
-   retval = insert_page_into_pte_locked(vma, pte, addr, page, prot);
+   retval = insert_page_into_pte_locked(vma, pte, addr, page, prot, 
mkwrite);
pte_unmap_unlock(pte, ptl);
 out:
return retval;
@@ -2040,7 +2068,7 @@ static int insert_page_in_batch_locked(struct 
vm_area_struct *vma, pte_t *pte,
err = validate_page_before_ins

[PATCH 07/13] huge_memory: Allow mappings of PUD sized pages

2024-06-26 Thread Alistair Popple
Currently DAX folio/page reference counts are managed differently to
normal pages. To allow these to be managed the same as normal pages
introduce dax_insert_pfn_pud. This will map the entire PUD-sized folio
and take references as it would for a normally mapped page.

This is distinct from the current mechanism, vmf_insert_pfn_pud, which
simply inserts a special devmap PUD entry into the page table without
holding a reference to the page for the mapping.

Signed-off-by: Alistair Popple 
---
 include/linux/huge_mm.h |   4 ++-
 include/linux/rmap.h|  14 +-
 mm/huge_memory.c| 108 ++---
 mm/rmap.c   |  48 ++-
 4 files changed, 168 insertions(+), 6 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2aa986a..b98a3cc 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -39,6 +39,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
 
 vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
 vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
+vm_fault_t dax_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
 
 enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_UNSUPPORTED,
@@ -106,6 +107,9 @@ extern struct kobj_attribute shmem_enabled_attr;
 #define HPAGE_PUD_MASK (~(HPAGE_PUD_SIZE - 1))
 #define HPAGE_PUD_SIZE ((1UL) << HPAGE_PUD_SHIFT)
 
+#define HPAGE_PUD_ORDER (HPAGE_PUD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PUD_NR (1<_large_mapcount);
break;
case RMAP_LEVEL_PMD:
+   case RMAP_LEVEL_PUD:
atomic_inc(&folio->_entire_mapcount);
atomic_inc(&folio->_large_mapcount);
break;
@@ -434,6 +447,7 @@ static __always_inline int __folio_try_dup_anon_rmap(struct 
folio *folio,
atomic_add(orig_nr_pages, &folio->_large_mapcount);
break;
case RMAP_LEVEL_PMD:
+   case RMAP_LEVEL_PUD:
if (PageAnonExclusive(page)) {
if (unlikely(maybe_pinned))
return -EBUSY;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index db7946a..e1f053e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1283,6 +1283,70 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, 
pfn_t pfn, bool write)
return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(vmf_insert_pfn_pud);
+
+/**
+ * dax_insert_pfn_pud - insert a pud size pfn backed by a normal page
+ * @vmf: Structure describing the fault
+ * @pfn: pfn of the page to insert
+ * @write: whether it's a write fault
+ *
+ * Return: vm_fault_t value.
+ */
+vm_fault_t dax_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
+{
+   struct vm_area_struct *vma = vmf->vma;
+   unsigned long addr = vmf->address & PUD_MASK;
+   pud_t *pud = vmf->pud;
+   pgprot_t prot = vma->vm_page_prot;
+   struct mm_struct *mm = vma->vm_mm;
+   pud_t entry;
+   spinlock_t *ptl;
+   struct folio *folio;
+   struct page *page;
+
+   if (addr < vma->vm_start || addr >= vma->vm_end)
+   return VM_FAULT_SIGBUS;
+
+   track_pfn_insert(vma, &prot, pfn);
+
+   ptl = pud_lock(mm, pud);
+   if (!pud_none(*pud)) {
+   if (write) {
+   if (pud_pfn(*pud) != pfn_t_to_pfn(pfn)) {
+   WARN_ON_ONCE(!is_huge_zero_pud(*pud));
+   goto out_unlock;
+   }
+   entry = pud_mkyoung(*pud);
+   entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
+   if (pudp_set_access_flags(vma, addr, pud, entry, 1))
+   update_mmu_cache_pud(vma, addr, pud);
+   }
+   goto out_unlock;
+   }
+
+   entry = pud_mkhuge(pfn_t_pud(pfn, prot));
+   if (pfn_t_devmap(pfn))
+   entry = pud_mkdevmap(entry);
+   if (write) {
+   entry = pud_mkyoung(pud_mkdirty(entry));
+   entry = maybe_pud_mkwrite(entry, vma);
+   }
+
+   page = pfn_t_to_page(pfn);
+   folio = page_folio(page);
+   folio_get(folio);
+   folio_add_file_rmap_pud(folio, page, vma);
+   add_mm_counter(mm, mm_counter_file(folio), HPAGE_PUD_NR);
+
+   set_pud_at(mm, addr, pud, entry);
+   update_mmu_cache_pud(vma, addr, pud);
+
+out_unlock:
+   spin_unlock(ptl);
+
+   return VM_FAULT_NOPAGE;
+}
+EXPORT_SYMBOL_GPL(dax_insert_pfn_pud);
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
 void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
@@ -1836,7 +1900,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
zap_deposited_table(tlb->mm, pmd);
spin_unlock(ptl);
} else if (is_huge_zero_pmd(orig_pmd)) {
-   zap_deposited_table(tlb->mm, pmd);
+

[PATCH 08/13] huge_memory: Allow mappings of PMD sized pages

2024-06-26 Thread Alistair Popple
Currently DAX folio/page reference counts are managed differently to
normal pages. To allow these to be managed the same as normal pages
introduce dax_insert_pfn_pmd. This will map the entire PMD-sized folio
and take references as it would for a normally mapped page.

This is distinct from the current mechanism, vmf_insert_pfn_pmd, which
simply inserts a special devmap PMD entry into the page table without
holding a reference to the page for the mapping.

Signed-off-by: Alistair Popple 
---
 include/linux/huge_mm.h |  1 +-
 mm/huge_memory.c| 70 ++-
 2 files changed, 71 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b98a3cc..9207d8e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -39,6 +39,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
 
 vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
 vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
+vm_fault_t dax_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
 vm_fault_t dax_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
 
 enum transparent_hugepage_flag {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e1f053e..a9874ac 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1202,6 +1202,76 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, 
pfn_t pfn, bool write)
 }
 EXPORT_SYMBOL_GPL(vmf_insert_pfn_pmd);
 
+vm_fault_t dax_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
+{
+   struct vm_area_struct *vma = vmf->vma;
+   unsigned long addr = vmf->address & PMD_MASK;
+   pmd_t *pmd = vmf->pmd;
+   struct mm_struct *mm = vma->vm_mm;
+   pmd_t entry;
+   spinlock_t *ptl;
+   pgtable_t pgtable = NULL;
+   struct folio *folio;
+   struct page *page;
+
+   if (addr < vma->vm_start || addr >= vma->vm_end)
+   return VM_FAULT_SIGBUS;
+
+   if (arch_needs_pgtable_deposit()) {
+   pgtable = pte_alloc_one(vma->vm_mm);
+   if (!pgtable)
+   return VM_FAULT_OOM;
+   }
+
+   track_pfn_insert(vma, &vma->vm_page_prot, pfn);
+
+   ptl = pmd_lock(mm, pmd);
+   if (!pmd_none(*pmd)) {
+   if (write) {
+   if (pmd_pfn(*pmd) != pfn_t_to_pfn(pfn)) {
+   WARN_ON_ONCE(!is_huge_zero_pmd(*pmd));
+   goto out_unlock;
+   }
+   entry = pmd_mkyoung(*pmd);
+   entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+   if (pmdp_set_access_flags(vma, addr, pmd, entry, 1))
+   update_mmu_cache_pmd(vma, addr, pmd);
+   }
+
+   goto out_unlock;
+   }
+
+   entry = pmd_mkhuge(pfn_t_pmd(pfn, vma->vm_page_prot));
+   if (pfn_t_devmap(pfn))
+   entry = pmd_mkdevmap(entry);
+   if (write) {
+   entry = pmd_mkyoung(pmd_mkdirty(entry));
+   entry = maybe_pmd_mkwrite(entry, vma);
+   }
+
+   if (pgtable) {
+   pgtable_trans_huge_deposit(mm, pmd, pgtable);
+   mm_inc_nr_ptes(mm);
+   pgtable = NULL;
+   }
+
+   page = pfn_t_to_page(pfn);
+   folio = page_folio(page);
+   folio_get(folio);
+   folio_add_file_rmap_pmd(folio, page, vma);
+   add_mm_counter(mm, mm_counter_file(folio), HPAGE_PMD_NR);
+   set_pmd_at(mm, addr, pmd, entry);
+   update_mmu_cache_pmd(vma, addr, pmd);
+
+out_unlock:
+   spin_unlock(ptl);
+   if (pgtable)
+   pte_free(mm, pgtable);
+
+   return VM_FAULT_NOPAGE;
+}
+EXPORT_SYMBOL_GPL(dax_insert_pfn_pmd);
+
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 static pud_t maybe_pud_mkwrite(pud_t pud, struct vm_area_struct *vma)
 {
-- 
git-series 0.9.1


[PATCH 09/13] gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages

2024-06-26 Thread Alistair Popple
Longterm pinning of FS DAX pages should already be disallowed by
various pXX_devmap checks. However a future change will cause these
checks to be invalid for FS DAX pages so make
folio_is_longterm_pinnable() return false for FS DAX pages.

Signed-off-by: Alistair Popple 
---
 include/linux/memremap.h | 11 +++
 include/linux/mm.h   |  4 
 2 files changed, 15 insertions(+)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 6505713..19a448e 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -193,6 +193,17 @@ static inline bool folio_is_device_coherent(const struct 
folio *folio)
return is_device_coherent_page(&folio->page);
 }
 
+static inline bool is_device_dax_page(const struct page *page)
+{
+   return is_zone_device_page(page) &&
+   page_dev_pagemap(page)->type == MEMORY_DEVICE_FS_DAX;
+}
+
+static inline bool folio_is_device_dax(const struct folio *folio)
+{
+   return is_device_dax_page(&folio->page);
+}
+
 #ifdef CONFIG_ZONE_DEVICE
 void zone_device_page_init(struct page *page);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b84368b..4d1cdea 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2032,6 +2032,10 @@ static inline bool folio_is_longterm_pinnable(struct 
folio *folio)
if (folio_is_device_coherent(folio))
return false;
 
+   /* DAX must also always allow eviction. */
+   if (folio_is_device_dax(folio))
+   return false;
+
/* Otherwise, non-movable zone folios can be pinned. */
return !folio_is_zone_movable(folio);
 
-- 
git-series 0.9.1


[PATCH 10/13] fs/dax: Properly refcount fs dax pages

2024-06-26 Thread Alistair Popple
Currently fs dax pages are considered free when the refcount drops to
one and their refcounts are not increased when mapped via PTEs or
decreased when unmapped. This requires special logic in mm paths to
detect that these pages should not be properly refcounted, and to
detect when the refcount drops to one instead of zero.

On the other hand get_user_pages(), etc. will properly refcount fs dax
pages by taking a reference and dropping it when the page is
unpinned.

Tracking this special behaviour requires extra PTE bits
(eg. pte_devmap) and introduces rules that are potentially confusing
and specific to FS DAX pages. To fix this, and to possibly allow
removal of the special PTE bits in future, convert the fs dax page
refcounts to be zero based and instead take a reference on the page
each time it is mapped as is currently the case for normal pages.

This may also allow a future clean-up to remove the pgmap refcounting
that is currently done in mm/gup.c.

Signed-off-by: Alistair Popple 
---
 drivers/dax/device.c   |  12 +-
 drivers/dax/super.c|   2 +-
 drivers/nvdimm/pmem.c  |   8 +--
 fs/dax.c   | 193 +-
 fs/fuse/virtio_fs.c|   3 +-
 include/linux/dax.h|   4 +-
 include/linux/mm.h |  25 +-
 include/linux/page-flags.h |   6 +-
 mm/gup.c   |   9 +--
 mm/huge_memory.c   |   6 +-
 mm/internal.h  |   2 +-
 mm/memory-failure.c|   6 +-
 mm/memremap.c  |  24 +-
 mm/mlock.c |   2 +-
 mm/mm_init.c   |   3 +-
 mm/swap.c  |   2 +-
 16 files changed, 123 insertions(+), 184 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index eb61598..b7a31ae 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -126,11 +126,11 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax 
*dev_dax,
return VM_FAULT_SIGBUS;
}
 
-   pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
+   pfn = phys_to_pfn_t(phys, 0);
 
dax_set_mapping(vmf, pfn, fault_size);
 
-   return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
+   return dax_insert_pfn(vmf->vma, vmf->address, pfn, vmf->flags & 
FAULT_FLAG_WRITE);
 }
 
 static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
@@ -169,11 +169,11 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax 
*dev_dax,
return VM_FAULT_SIGBUS;
}
 
-   pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
+   pfn = phys_to_pfn_t(phys, 0);
 
dax_set_mapping(vmf, pfn, fault_size);
 
-   return vmf_insert_pfn_pmd(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE);
+   return dax_insert_pfn_pmd(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE);
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
@@ -214,11 +214,11 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax 
*dev_dax,
return VM_FAULT_SIGBUS;
}
 
-   pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
+   pfn = phys_to_pfn_t(phys, 0);
 
dax_set_mapping(vmf, pfn, fault_size);
 
-   return vmf_insert_pfn_pud(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE);
+   return dax_insert_pfn_pud(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE);
 }
 #else
 static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index aca71d7..d83196e 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -257,7 +257,7 @@ EXPORT_SYMBOL_GPL(dax_holder_notify_failure);
 void arch_wb_cache_pmem(void *addr, size_t size);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
 {
-   if (unlikely(!dax_write_cache_enabled(dax_dev)))
+   if (unlikely(dax_dev && !dax_write_cache_enabled(dax_dev)))
return;
 
arch_wb_cache_pmem(addr, size);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index cafadd0..da13dc1 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -510,7 +510,7 @@ static int pmem_attach_disk(struct device *dev,
 
pmem->disk = disk;
pmem->pgmap.owner = pmem;
-   pmem->pfn_flags = PFN_DEV;
+   pmem->pfn_flags = 0;
if (is_nd_pfn(dev)) {
pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
pmem->pgmap.ops = &fsdax_pagemap_ops;
@@ -519,7 +519,7 @@ static int pmem_attach_disk(struct device *dev,
pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
pmem->pfn_pad = resource_size(res) -
range_len(&pmem->pgmap.range);
-   pmem->pfn_flags |= PFN_MAP;
+   blk_queue_flag_set(QUEUE_FLAG_DAX, q);
bb_range = pmem->pgmap.range;
bb_range.start += pmem->data_offset;
} else if (pmem_should_map_pages(dev)) {
@@ -529,7 +529,7 @@ static int pmem_attach_disk(struct device *dev,
pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
pmem->pgmap.ops = &fsdax_pagemap_ops;
 

[PATCH 11/13] huge_memory: Remove dead vmf_insert_pXd code

2024-06-26 Thread Alistair Popple
Now that DAX is managing page reference counts the same as normal
pages there are no callers for vmf_insert_pXd functions so remove
them.

Signed-off-by: Alistair Popple 
---
 include/linux/huge_mm.h |   2 +-
 mm/huge_memory.c| 165 +-
 2 files changed, 167 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9207d8e..0fb6bff 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -37,8 +37,6 @@ int change_huge_pmd(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
pmd_t *pmd, unsigned long addr, pgprot_t newprot,
unsigned long cp_flags);
 
-vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
-vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
 vm_fault_t dax_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
 vm_fault_t dax_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5191f91..de39af4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -,97 +,6 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault 
*vmf)
return __do_huge_pmd_anonymous_page(vmf, &folio->page, gfp);
 }
 
-static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
-   pmd_t *pmd, pfn_t pfn, pgprot_t prot, bool write,
-   pgtable_t pgtable)
-{
-   struct mm_struct *mm = vma->vm_mm;
-   pmd_t entry;
-   spinlock_t *ptl;
-
-   ptl = pmd_lock(mm, pmd);
-   if (!pmd_none(*pmd)) {
-   if (write) {
-   if (pmd_pfn(*pmd) != pfn_t_to_pfn(pfn)) {
-   WARN_ON_ONCE(!is_huge_zero_pmd(*pmd));
-   goto out_unlock;
-   }
-   entry = pmd_mkyoung(*pmd);
-   entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
-   if (pmdp_set_access_flags(vma, addr, pmd, entry, 1))
-   update_mmu_cache_pmd(vma, addr, pmd);
-   }
-
-   goto out_unlock;
-   }
-
-   entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
-   if (pfn_t_devmap(pfn))
-   entry = pmd_mkdevmap(entry);
-   if (write) {
-   entry = pmd_mkyoung(pmd_mkdirty(entry));
-   entry = maybe_pmd_mkwrite(entry, vma);
-   }
-
-   if (pgtable) {
-   pgtable_trans_huge_deposit(mm, pmd, pgtable);
-   mm_inc_nr_ptes(mm);
-   pgtable = NULL;
-   }
-
-   set_pmd_at(mm, addr, pmd, entry);
-   update_mmu_cache_pmd(vma, addr, pmd);
-
-out_unlock:
-   spin_unlock(ptl);
-   if (pgtable)
-   pte_free(mm, pgtable);
-}
-
-/**
- * vmf_insert_pfn_pmd - insert a pmd size pfn
- * @vmf: Structure describing the fault
- * @pfn: pfn to insert
- * @write: whether it's a write fault
- *
- * Insert a pmd size pfn. See vmf_insert_pfn() for additional info.
- *
- * Return: vm_fault_t value.
- */
-vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
-{
-   unsigned long addr = vmf->address & PMD_MASK;
-   struct vm_area_struct *vma = vmf->vma;
-   pgprot_t pgprot = vma->vm_page_prot;
-   pgtable_t pgtable = NULL;
-
-   /*
-* If we had pmd_special, we could avoid all these restrictions,
-* but we need to be consistent with PTEs and architectures that
-* can't support a 'special' bit.
-*/
-   BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
-   !pfn_t_devmap(pfn));
-   BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
-   (VM_PFNMAP|VM_MIXEDMAP));
-   BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
-
-   if (addr < vma->vm_start || addr >= vma->vm_end)
-   return VM_FAULT_SIGBUS;
-
-   if (arch_needs_pgtable_deposit()) {
-   pgtable = pte_alloc_one(vma->vm_mm);
-   if (!pgtable)
-   return VM_FAULT_OOM;
-   }
-
-   track_pfn_insert(vma, &pgprot, pfn);
-
-   insert_pfn_pmd(vma, addr, vmf->pmd, pfn, pgprot, write, pgtable);
-   return VM_FAULT_NOPAGE;
-}
-EXPORT_SYMBOL_GPL(vmf_insert_pfn_pmd);
-
 vm_fault_t dax_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
 {
struct vm_area_struct *vma = vmf->vma;
@@ -1280,80 +1189,6 @@ static pud_t maybe_pud_mkwrite(pud_t pud, struct 
vm_area_struct *vma)
return pud;
 }
 
-static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
-   pud_t *pud, pfn_t pfn, bool write)
-{
-   struct mm_struct *mm = vma->vm_mm;
-   pgprot_t prot = vma->vm_page_prot;
-   pud_t entry;
-   spinlock_t *ptl;
-
-   ptl = pud_lock(mm, pud);
-   if (!pud_none(*pud)) {
-   if (write) {
-   if (pud_pfn(*p

[PATCH 12/13] mm: Remove pXX_devmap callers

2024-06-26 Thread Alistair Popple
The devmap PTE special bit was used to detect mappings of FS DAX
pages. This tracking was required to ensure the generic mm did not
manipulate the page reference counts as FS DAX implemented it's own
reference counting scheme.

Now that FS DAX pages have their references counted the same way as
normal pages this tracking is no longer needed and can be
removed.

Almost all existing uses of pmd_devmap() are paired with a check of
pmd_trans_huge(). As pmd_trans_huge() now returns true for FS DAX pages
dropping the check in these cases doesn't change anything.

However care needs to be taken because pmd_trans_huge() also checks that
a page is not an FS DAX page. This is dealt with either by checking
!vma_is_dax() or relying on the fact that the page pointer was obtained
from a page list. This is possible because zone device pages cannot
appear in any page list due to sharing page->lru with page->pgmap.

Signed-off-by: Alistair Popple 
---
 arch/powerpc/mm/book3s64/hash_pgtable.c  |   3 +-
 arch/powerpc/mm/book3s64/pgtable.c   |   8 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c |   5 +-
 arch/powerpc/mm/pgtable.c|   2 +-
 fs/dax.c |   5 +-
 fs/userfaultfd.c |   2 +-
 include/linux/huge_mm.h  |  10 +-
 include/linux/pgtable.h  |   2 +-
 mm/gup.c | 164 +---
 mm/hmm.c |   7 +-
 mm/huge_memory.c |  61 +
 mm/khugepaged.c  |   2 +-
 mm/mapping_dirty_helpers.c   |   4 +-
 mm/memory.c  |  37 +
 mm/migrate_device.c  |   2 +-
 mm/mprotect.c|   2 +-
 mm/mremap.c  |   5 +-
 mm/page_vma_mapped.c |   5 +-
 mm/pgtable-generic.c |   7 +-
 mm/userfaultfd.c |   2 +-
 mm/vmscan.c  |   5 +-
 21 files changed, 53 insertions(+), 287 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c 
b/arch/powerpc/mm/book3s64/hash_pgtable.c
index 988948d..82d3117 100644
--- a/arch/powerpc/mm/book3s64/hash_pgtable.c
+++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
@@ -195,7 +195,7 @@ unsigned long hash__pmd_hugepage_update(struct mm_struct 
*mm, unsigned long addr
unsigned long old;
 
 #ifdef CONFIG_DEBUG_VM
-   WARN_ON(!hash__pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
+   WARN_ON(!hash__pmd_trans_huge(*pmdp));
assert_spin_locked(pmd_lockptr(mm, pmdp));
 #endif
 
@@ -227,7 +227,6 @@ pmd_t hash__pmdp_collapse_flush(struct vm_area_struct *vma, 
unsigned long addres
 
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
VM_BUG_ON(pmd_trans_huge(*pmdp));
-   VM_BUG_ON(pmd_devmap(*pmdp));
 
pmd = *pmdp;
pmd_clear(pmdp);
diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtable.c
index 2975ea0..65dd1fe 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -50,7 +50,7 @@ int pmdp_set_access_flags(struct vm_area_struct *vma, 
unsigned long address,
 {
int changed;
 #ifdef CONFIG_DEBUG_VM
-   WARN_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
+   WARN_ON(!pmd_trans_huge(*pmdp));
assert_spin_locked(pmd_lockptr(vma->vm_mm, pmdp));
 #endif
changed = !pmd_same(*(pmdp), entry);
@@ -70,7 +70,6 @@ int pudp_set_access_flags(struct vm_area_struct *vma, 
unsigned long address,
 {
int changed;
 #ifdef CONFIG_DEBUG_VM
-   WARN_ON(!pud_devmap(*pudp));
assert_spin_locked(pud_lockptr(vma->vm_mm, pudp));
 #endif
changed = !pud_same(*(pudp), entry);
@@ -182,7 +181,7 @@ pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct 
*vma,
pmd_t pmd;
VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
VM_BUG_ON((pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
-  !pmd_devmap(*pmdp)) || !pmd_present(*pmdp));
+  || !pmd_present(*pmdp));
pmd = pmdp_huge_get_and_clear(vma->vm_mm, addr, pmdp);
/*
 * if it not a fullmm flush, then we can possibly end up converting
@@ -200,8 +199,7 @@ pud_t pudp_huge_get_and_clear_full(struct vm_area_struct 
*vma,
pud_t pud;
 
VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
-   VM_BUG_ON((pud_present(*pudp) && !pud_devmap(*pudp)) ||
- !pud_present(*pudp));
+   VM_BUG_ON(!pud_present(*pudp));
pud = pudp_huge_get_and_clear(vma->vm_mm, addr, pudp);
/*
 * if it not a fullmm flush, then we can possibly end up converting
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 15e88f1..1c195bc 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1348,7 +1348,7 @@ unsigned long radix__pmd_hugepage_update(struct mm_struc

[PATCH 13/13] mm: Remove devmap related functions and page table bits

2024-06-26 Thread Alistair Popple
Now that DAX and all other reference counts to ZONE_DEVICE pages are
managed normally there is no need for the special devmap PTE/PMD/PUD
page table bits. So drop all references to these, freeing up a
software defined page table bit on architectures supporting it.

Signed-off-by: Alistair Popple 
---
 Documentation/mm/arch_pgtable_helpers.rst |  6 +--
 arch/arm64/Kconfig|  1 +-
 arch/arm64/include/asm/pgtable-prot.h |  1 +-
 arch/arm64/include/asm/pgtable.h  | 24 +
 arch/powerpc/Kconfig  |  1 +-
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |  6 +--
 arch/powerpc/include/asm/book3s/64/hash-64k.h |  7 +--
 arch/powerpc/include/asm/book3s/64/pgtable.h  | 52 +--
 arch/powerpc/include/asm/book3s/64/radix.h| 14 +-
 arch/x86/Kconfig  |  1 +-
 arch/x86/include/asm/pgtable.h| 50 +-
 arch/x86/include/asm/pgtable_types.h  |  5 +--
 include/linux/mm.h|  7 +--
 include/linux/pfn_t.h | 20 +---
 include/linux/pgtable.h   | 19 +--
 mm/Kconfig|  4 +-
 mm/debug_vm_pgtable.c | 59 +
 mm/hmm.c  |  3 +-
 18 files changed, 11 insertions(+), 269 deletions(-)

diff --git a/Documentation/mm/arch_pgtable_helpers.rst 
b/Documentation/mm/arch_pgtable_helpers.rst
index ad50ca6..9230bc7 100644
--- a/Documentation/mm/arch_pgtable_helpers.rst
+++ b/Documentation/mm/arch_pgtable_helpers.rst
@@ -30,8 +30,6 @@ PTE Page Table Helpers
 
+---+--+
 | pte_protnone  | Tests a PROT_NONE PTE
|
 
+---+--+
-| pte_devmap| Tests a ZONE_DEVICE mapped PTE   
|
-+---+--+
 | pte_soft_dirty| Tests a soft dirty PTE   
|
 
+---+--+
 | pte_swp_soft_dirty| Tests a soft dirty swapped PTE   
|
@@ -106,8 +104,6 @@ PMD Page Table Helpers
 
+---+--+
 | pmd_protnone  | Tests a PROT_NONE PMD
|
 
+---+--+
-| pmd_devmap| Tests a ZONE_DEVICE mapped PMD   
|
-+---+--+
 | pmd_soft_dirty| Tests a soft dirty PMD   
|
 
+---+--+
 | pmd_swp_soft_dirty| Tests a soft dirty swapped PMD   
|
@@ -181,8 +177,6 @@ PUD Page Table Helpers
 
+---+--+
 | pud_write | Tests a writable PUD 
|
 
+---+--+
-| pud_devmap| Tests a ZONE_DEVICE mapped PUD   
|
-+---+--+
 | pud_mkyoung   | Creates a young PUD  
|
 
+---+--+
 | pud_mkold | Creates an old PUD   
|
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 5d91259..beb8c3c 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -35,7 +35,6 @@ config ARM64
select ARCH_HAS_MEMBARRIER_SYNC_CORE
select ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
-   select ARCH_HAS_PTE_DEVMAP
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_HW_PTE_YOUNG
select ARCH_HAS_SETUP_DMA_OPS
diff --git a/arch/arm64/include/asm/pgtable-prot.h 
b/arch/arm64/include/asm/pgtable-prot.h
index b11cfb9..043b102 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -17,7 +17,6 @@
 #define PTE_SWP_EXCLUSIVE  (_AT(pteval_t, 1) << 2)  /* only for swp ptes */
 #define PTE_DIRTY  (_AT(pteval_t, 1) << 55)
 #define PTE_SPECIAL(_AT(pteval_t, 1) << 56)
-#define PTE_DEVMAP (_AT(pteval_t, 1) << 57)
 
 /*
  * PTE_PRESENT_INVALID=1 & PTE_VALID=0 indicates that the pte's fields should 
be
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index f8efbc1..9193537 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -107,7 +107,6 @

Re: [axboe-block:for-next] [block] 1122c0c1cc: aim7.jobs-per-min 22.6% improvement

2024-06-26 Thread Oliver Sang
hi, Christoph Hellwig,

On Tue, Jun 25, 2024 at 08:39:50PM -0700, Christoph Hellwig wrote:
> On Wed, Jun 26, 2024 at 10:10:49AM +0800, Oliver Sang wrote:
> > I'm not sure I understand this test request. as in title, we see a good
> > improvement of aim7 for 1122c0c1cc, and we didn't observe other issues for
> > this commit.
> 
> The improvement suggests we are not sending cache flushes when we should
> send them, or at least just handle them in md.

thanks for explanation!

> 
> > do you mean this improvement is not expected or exposes some problems 
> > instead?
> > then by below patch, should the performance back to the level of parent of
> > 1122c0c1cc?
> > 
> > sure! it's our great pleasure to test your patches. I noticed there are
> > [1]
> > https://lore.kernel.org/all/20240625110603.50885-2-...@lst.de/
> > which includes "[PATCH 1/7] md: set md-specific flags for all queue limits"
> > [2]
> > https://lore.kernel.org/all/20240625145955.115252-2-...@lst.de/
> > which includes "[PATCH 1/8] md: set md-specific flags for all queue limits"
> > 
> > which one you suggest us to test?
> > do we only need to apply the first patch "md: set md-specific flags for all 
> > queue limits"
> > upon 1122c0c1cc?
> > then is the expectation the performance back to parent of 1122c0c1cc?
> 
> Either just the patch in reply or the entire [2] series would be fine.

I failed to apply patch in your previous reply to 1122c0c1cc or current tip
of axboe-block/for-next:
c1440ed442a58 (axboe-block/for-next) Merge branch 'for-6.11/block' into for-next

but it's ok to apply upon next:
* 0fc4bfab2cd45 (tag: next-20240625) Add linux-next specific files for 20240625

I've already started the test based on this applyment.
is the expectation that patch should not introduce performance change comparing
to 0fc4bfab2cd45?

or if this applyment is not ok, please just give me guidance. Thanks!


> 
> Thanks!
> 


[PATCH] macintosh: Drop explicit initialization of struct i2c_device_id::driver_data to 0

2024-06-26 Thread Uwe Kleine-König
These drivers don't use the driver_data member of struct i2c_device_id,
so don't explicitly initialize this member.

This prepares putting driver_data in an anonymous union which requires
either no initialization or named designators. But it's also a nice
cleanup on its own.

Signed-off-by: Uwe Kleine-König 
---
 drivers/macintosh/ams/ams-i2c.c | 2 +-
 drivers/macintosh/windfarm_ad7417_sensor.c  | 2 +-
 drivers/macintosh/windfarm_fcu_controls.c   | 2 +-
 drivers/macintosh/windfarm_lm87_sensor.c| 2 +-
 drivers/macintosh/windfarm_max6690_sensor.c | 2 +-
 drivers/macintosh/windfarm_smu_sat.c| 2 +-
 6 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/macintosh/ams/ams-i2c.c b/drivers/macintosh/ams/ams-i2c.c
index f9bfe84b1c73..d5cdbba6e7c7 100644
--- a/drivers/macintosh/ams/ams-i2c.c
+++ b/drivers/macintosh/ams/ams-i2c.c
@@ -60,7 +60,7 @@ static int ams_i2c_probe(struct i2c_client *client);
 static void ams_i2c_remove(struct i2c_client *client);
 
 static const struct i2c_device_id ams_id[] = {
-   { "MAC,accelerometer_1", 0 },
+   { "MAC,accelerometer_1" },
{ }
 };
 MODULE_DEVICE_TABLE(i2c, ams_id);
diff --git a/drivers/macintosh/windfarm_ad7417_sensor.c 
b/drivers/macintosh/windfarm_ad7417_sensor.c
index 49ce37fde930..3ff4577ba847 100644
--- a/drivers/macintosh/windfarm_ad7417_sensor.c
+++ b/drivers/macintosh/windfarm_ad7417_sensor.c
@@ -304,7 +304,7 @@ static void wf_ad7417_remove(struct i2c_client *client)
 }
 
 static const struct i2c_device_id wf_ad7417_id[] = {
-   { "MAC,ad7417", 0 },
+   { "MAC,ad7417" },
{ }
 };
 MODULE_DEVICE_TABLE(i2c, wf_ad7417_id);
diff --git a/drivers/macintosh/windfarm_fcu_controls.c 
b/drivers/macintosh/windfarm_fcu_controls.c
index 603ef6c600ba..82365f19adb4 100644
--- a/drivers/macintosh/windfarm_fcu_controls.c
+++ b/drivers/macintosh/windfarm_fcu_controls.c
@@ -573,7 +573,7 @@ static void wf_fcu_remove(struct i2c_client *client)
 }
 
 static const struct i2c_device_id wf_fcu_id[] = {
-   { "MAC,fcu", 0 },
+   { "MAC,fcu" },
{ }
 };
 MODULE_DEVICE_TABLE(i2c, wf_fcu_id);
diff --git a/drivers/macintosh/windfarm_lm87_sensor.c 
b/drivers/macintosh/windfarm_lm87_sensor.c
index 975361c23a93..16635e2b180b 100644
--- a/drivers/macintosh/windfarm_lm87_sensor.c
+++ b/drivers/macintosh/windfarm_lm87_sensor.c
@@ -156,7 +156,7 @@ static void wf_lm87_remove(struct i2c_client *client)
 }
 
 static const struct i2c_device_id wf_lm87_id[] = {
-   { "MAC,lm87cimt", 0 },
+   { "MAC,lm87cimt" },
{ }
 };
 MODULE_DEVICE_TABLE(i2c, wf_lm87_id);
diff --git a/drivers/macintosh/windfarm_max6690_sensor.c 
b/drivers/macintosh/windfarm_max6690_sensor.c
index 02856d1f0313..d734b31b8236 100644
--- a/drivers/macintosh/windfarm_max6690_sensor.c
+++ b/drivers/macintosh/windfarm_max6690_sensor.c
@@ -112,7 +112,7 @@ static void wf_max6690_remove(struct i2c_client *client)
 }
 
 static const struct i2c_device_id wf_max6690_id[] = {
-   { "MAC,max6690", 0 },
+   { "MAC,max6690" },
{ }
 };
 MODULE_DEVICE_TABLE(i2c, wf_max6690_id);
diff --git a/drivers/macintosh/windfarm_smu_sat.c 
b/drivers/macintosh/windfarm_smu_sat.c
index 50baa062c9df..ff8805ecf2e5 100644
--- a/drivers/macintosh/windfarm_smu_sat.c
+++ b/drivers/macintosh/windfarm_smu_sat.c
@@ -333,7 +333,7 @@ static void wf_sat_remove(struct i2c_client *client)
 }
 
 static const struct i2c_device_id wf_sat_id[] = {
-   { "MAC,smu-sat", 0 },
+   { "MAC,smu-sat" },
{ }
 };
 MODULE_DEVICE_TABLE(i2c, wf_sat_id);

base-commit: f76698bd9a8ca01d3581236082d786e9a6b72bb7
-- 
2.43.0



Re: [axboe-block:for-next] [block] bd4a633b6f: fsmark.files_per_sec -64.5% regression

2024-06-26 Thread Anthony D'Atri
S3610 I think.  Be sure to use sst or the chassis vendor’s tool to update the 
firmware.

> On Jun 24, 2024, at 9:45 AM, Niklas Cassel  wrote:
> 
> SSDSC2BG012T4



[PATCH] printk: Add a short description string to kmsg_dump()

2024-06-26 Thread Jocelyn Falempe
kmsg_dump doesn't forward the panic reason string to the kmsg_dumper
callback.
This patch adds a new parameter "const char *desc" to the kmsg_dumper
dump() callback, and update all drivers that are using it.

To avoid updating all kmsg_dump() call, it adds a kmsg_dump_desc()
function and a macro for backward compatibility.

I've written this for drm_panic, but it can be useful for other
kmsg_dumper.
It allows to see the panic reason, like "sysrq triggered crash"
or "VFS: Unable to mount root fs on " on the drm panic screen.

Signed-off-by: Jocelyn Falempe 
---
 arch/powerpc/kernel/nvram_64.c |  3 ++-
 arch/powerpc/platforms/powernv/opal-kmsg.c |  3 ++-
 drivers/gpu/drm/drm_panic.c|  3 ++-
 drivers/hv/hv_common.c |  3 ++-
 drivers/mtd/mtdoops.c  |  3 ++-
 fs/pstore/platform.c   |  3 ++-
 include/linux/kmsg_dump.h  | 13 ++---
 kernel/panic.c |  2 +-
 kernel/printk/printk.c |  8 +---
 9 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kernel/nvram_64.c b/arch/powerpc/kernel/nvram_64.c
index e385d3164648c..6b3a80d8cfa64 100644
--- a/arch/powerpc/kernel/nvram_64.c
+++ b/arch/powerpc/kernel/nvram_64.c
@@ -643,7 +643,8 @@ void __init nvram_init_oops_partition(int 
rtas_partition_exists)
  * partition.  If that's too much, go back and capture uncompressed text.
  */
 static void oops_to_nvram(struct kmsg_dumper *dumper,
- enum kmsg_dump_reason reason)
+ enum kmsg_dump_reason reason,
+ const char *desc)
 {
struct oops_log_info *oops_hdr = (struct oops_log_info *)oops_buf;
static unsigned int oops_count = 0;
diff --git a/arch/powerpc/platforms/powernv/opal-kmsg.c 
b/arch/powerpc/platforms/powernv/opal-kmsg.c
index 6c3bc4b4da983..49b60de6feb04 100644
--- a/arch/powerpc/platforms/powernv/opal-kmsg.c
+++ b/arch/powerpc/platforms/powernv/opal-kmsg.c
@@ -20,7 +20,8 @@
  * message, it just ensures that OPAL completely flushes the console buffer.
  */
 static void kmsg_dump_opal_console_flush(struct kmsg_dumper *dumper,
-enum kmsg_dump_reason reason)
+enum kmsg_dump_reason reason,
+const char *desc)
 {
/*
 * Outside of a panic context the pollers will continue to run,
diff --git a/drivers/gpu/drm/drm_panic.c b/drivers/gpu/drm/drm_panic.c
index 293d4dcbc80da..88e9359fe6d78 100644
--- a/drivers/gpu/drm/drm_panic.c
+++ b/drivers/gpu/drm/drm_panic.c
@@ -604,7 +604,8 @@ static struct drm_plane *to_drm_plane(struct kmsg_dumper 
*kd)
return container_of(kd, struct drm_plane, kmsg_panic);
 }
 
-static void drm_panic(struct kmsg_dumper *dumper, enum kmsg_dump_reason reason)
+static void drm_panic(struct kmsg_dumper *dumper, enum kmsg_dump_reason reason,
+ const char *desc)
 {
struct drm_plane *plane = to_drm_plane(dumper);
 
diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index 9c452bfbd5719..b0786ee9c94e3 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -207,7 +207,8 @@ static int hv_die_panic_notify_crash(struct notifier_block 
*self,
  * buffer and call into Hyper-V to transfer the data.
  */
 static void hv_kmsg_dump(struct kmsg_dumper *dumper,
-enum kmsg_dump_reason reason)
+enum kmsg_dump_reason reason,
+const char *desc)
 {
struct kmsg_dump_iter iter;
size_t bytes_written;
diff --git a/drivers/mtd/mtdoops.c b/drivers/mtd/mtdoops.c
index 2f11585b5613e..c618999a96832 100644
--- a/drivers/mtd/mtdoops.c
+++ b/drivers/mtd/mtdoops.c
@@ -298,7 +298,8 @@ static void find_next_position(struct mtdoops_context *cxt)
 }
 
 static void mtdoops_do_dump(struct kmsg_dumper *dumper,
-   enum kmsg_dump_reason reason)
+   enum kmsg_dump_reason reason,
+   const char *desc)
 {
struct mtdoops_context *cxt = container_of(dumper,
struct mtdoops_context, dump);
diff --git a/fs/pstore/platform.c b/fs/pstore/platform.c
index 3497ede88aa01..a6ed5d56021ef 100644
--- a/fs/pstore/platform.c
+++ b/fs/pstore/platform.c
@@ -275,7 +275,8 @@ void pstore_record_init(struct pstore_record *record,
  * end of the buffer.
  */
 static void pstore_dump(struct kmsg_dumper *dumper,
-   enum kmsg_dump_reason reason)
+   enum kmsg_dump_reason reason,
+   const char *desc)
 {
struct kmsg_dump_iter iter;
unsigned long   total = 0;
diff --git a/include/linux/kmsg_dump.h b/include/linux/kmsg_dump.h
index 906521c2329ca..a8f8a6204542d 100644
--- a/include/linux/kmsg_dump.h
+++ b/include/linux/kmsg_dump.h
@@ -49,13 +49,15 @@ struct kmsg_dump_iter 

Re: [axboe-block:for-next] [block] 1122c0c1cc: aim7.jobs-per-min 22.6% improvement

2024-06-26 Thread Christoph Hellwig
On Thu, Jun 27, 2024 at 10:35:38AM +0800, Oliver Sang wrote:
> 
> I failed to apply patch in your previous reply to 1122c0c1cc or current tip
> of axboe-block/for-next:
> c1440ed442a58 (axboe-block/for-next) Merge branch 'for-6.11/block' into 
> for-next

That already includes it.

> 
> but it's ok to apply upon next:
> * 0fc4bfab2cd45 (tag: next-20240625) Add linux-next specific files for 
> 20240625
> 
> I've already started the test based on this applyment.
> is the expectation that patch should not introduce performance change 
> comparing
> to 0fc4bfab2cd45?
> 
> or if this applyment is not ok, please just give me guidance. Thanks!

The expectation is that the latest block branch (and thus linux-next)
doesn't see this performance change.



Re: [PATCH 06/13] mm/memory: Add dax_insert_pfn

2024-06-26 Thread Christoph Hellwig
On Thu, Jun 27, 2024 at 10:54:21AM +1000, Alistair Popple wrote:
> +extern void prep_compound_page(struct page *page, unsigned int order);

No need for the extern.

>  static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t 
> *pte,
> - unsigned long addr, struct page *page, pgprot_t prot)
> + unsigned long addr, struct page *page, pgprot_t prot, 
> bool mkwrite)

Overly long line.

> + retval = insert_page_into_pte_locked(vma, pte, addr, page, prot, 
> mkwrite);

.. same here.

> +vm_fault_t dax_insert_pfn(struct vm_area_struct *vma,
> + unsigned long addr, pfn_t pfn_t, bool write)

This could probably use a kerneldoc comment.



Re: [PATCH 02/13] pci/p2pdma: Don't initialise page refcount to one

2024-06-26 Thread Christoph Hellwig
On Thu, Jun 27, 2024 at 10:54:17AM +1000, Alistair Popple wrote:
> The reference counts for ZONE_DEVICE private pages should be
> initialised by the driver when the page is actually allocated by the
> driver allocator, not when they are first created. This is currently
> the case for MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_COHERENT pages
> but not MEMORY_DEVICE_PCI_P2PDMA pages so fix that up.
> 
> Signed-off-by: Alistair Popple 
> ---
>  drivers/pci/p2pdma.c | 2 ++
>  mm/memremap.c| 8 
>  mm/mm_init.c | 4 +++-
>  3 files changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 4f47a13..1e9ea32 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -128,6 +128,8 @@ static int p2pmem_alloc_mmap(struct file *filp, struct 
> kobject *kobj,
>   goto out;
>   }
>  
> + set_page_count(virt_to_page(kaddr), 1);

Can we have a comment here?  Without that it feels a bit too much like
black magic when reading the code.

> + if (folio->page.pgmap->type == MEMORY_DEVICE_PRIVATE ||
> + folio->page.pgmap->type == MEMORY_DEVICE_COHERENT)
> + put_dev_pagemap(folio->page.pgmap);
> + else if (folio->page.pgmap->type != MEMORY_DEVICE_PCI_P2PDMA)
>   /*
>* Reset the refcount to 1 to prepare for handing out the page
>* again.
>*/
>   folio_set_count(folio, 1);

Where the else if evaluates to MEMORY_DEVICE_FS_DAX ||
MEMORY_DEVICE_GENERIC.  Maybe make this a switch statement handling
all cases of the enum to make it clear and have the compiler generate
a warning when a new type is added without being handled here?

> @@ -1014,7 +1015,8 @@ static void __ref __init_zone_device_page(struct page 
> *page, unsigned long pfn,
>* which will set the page count to 1 when allocating the page.
>*/
>   if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
> + pgmap->type == MEMORY_DEVICE_COHERENT ||
> + pgmap->type == MEMORY_DEVICE_PCI_P2PDMA)
>   set_page_count(page, 0);

Similarly here a switch with explanation of what will be handled and
what not would be nice.



Re: [PATCH 03/13] fs/dax: Refactor wait for dax idle page

2024-06-26 Thread Christoph Hellwig
On Thu, Jun 27, 2024 at 10:54:18AM +1000, Alistair Popple wrote:
> A FS DAX page is considered idle when its refcount drops to one. This
> is currently open-coded in all file systems supporting FS DAX. Move
> the idle detection to a common function to make future changes easier.
> 
> Signed-off-by: Alistair Popple 
> Reviewed-by: Jan Kara 

I'm pretty sure I already review this ages ago, but:

Reviewed-by: Christoph Hellwig 



Re: [PATCH 04/13] fs/dax: Add dax_page_free callback

2024-06-26 Thread Christoph Hellwig
On Thu, Jun 27, 2024 at 10:54:19AM +1000, Alistair Popple wrote:
> When a fs dax page is freed it has to notify filesystems that the page
> has been unpinned/unmapped and is free. Currently this involves
> special code in the page free paths to detect a transition of refcount
> from 2 to 1 and to call some fs dax specific code.
> 
> A future change will require this to happen when the page refcount
> drops to zero. In this case we can use the existing
> pgmap->ops->page_free() callback so wire that up for all devices that
> support FS DAX (nvdimm and virtio).

Given that ->page_ffree is only called from free_zone_device_folio
and right next to a switch on the the type, can't we just do the
wake_up_var there without the somewhat confusing indirect call that
just back in common code without any driver logic?



Re: [PATCH 05/13] mm: Allow compound zone device pages

2024-06-26 Thread Christoph Hellwig
On Thu, Jun 27, 2024 at 10:54:20AM +1000, Alistair Popple wrote:
>  static struct nouveau_dmem_chunk *nouveau_page_to_chunk(struct page *page)
>  {
> - return container_of(page->pgmap, struct nouveau_dmem_chunk, pagemap);
> + return container_of(page_dev_pagemap(page), struct nouveau_dmem_chunk, 
> pagemap);

Overly long line hee (and quite a few more).



Re: [PATCH 10/13] fs/dax: Properly refcount fs dax pages

2024-06-26 Thread Christoph Hellwig
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index eb61598..b7a31ae 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -126,11 +126,11 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax 
> *dev_dax,
>   return VM_FAULT_SIGBUS;
>   }
>  
> - pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
> + pfn = phys_to_pfn_t(phys, 0);
>  
>   dax_set_mapping(vmf, pfn, fault_size);
>  
> - return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
> + return dax_insert_pfn(vmf->vma, vmf->address, pfn, vmf->flags & 
> FAULT_FLAG_WRITE);

Plenty overly long lines here and later.

Q: hould dax_insert_pfn take a vm_fault structure instead of the vma?
Or are the potential use cases that aren't from the fault path?
similar instead of the bool write passing the fault flags might actually
make things more readable than the bool.

Also at least currently it seems like there are no modular users despite
the export, or am I missing something?

> + blk_queue_flag_set(QUEUE_FLAG_DAX, q);

Just as a heads up, setting of these flags has changed a lot in
linux-next.

>  {
> + /*
> +  * Make sure we flush any cached data to the page now that it's free.
> +  */
> + if (PageDirty(page))
> + dax_flush(NULL, page_address(page), page_size(page));
> +

Adding the magic dax_dev == NULL case to dax_flush and going through it
vs just calling arch_wb_cache_pmem directly here seems odd.

But I also don't quite understand how it is related to the rest
of the patch anyway.

> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -373,6 +373,8 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
>   unsigned long start = addr;
>  
>   ptl = pmd_trans_huge_lock(pmd, vma);
> + if (vma_is_dax(vma))
> + ptl = NULL;
>   if (ptl) {

This feels sufficiently magic to warrant a comment.

>   if (!pmd_present(*pmd))
>   goto out;
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index b7e1599..f11ee0d 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -1016,7 +1016,8 @@ static void __ref __init_zone_device_page(struct page 
> *page, unsigned long pfn,
>*/
>   if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
>   pgmap->type == MEMORY_DEVICE_COHERENT ||
> - pgmap->type == MEMORY_DEVICE_PCI_P2PDMA)
> + pgmap->type == MEMORY_DEVICE_PCI_P2PDMA ||
> + pgmap->type == MEMORY_DEVICE_FS_DAX)
>   set_page_count(page, 0);
>  }

So we'll skip this for MEMORY_DEVICE_GENERIC only.  Does anyone remember
if that's actively harmful or just not needed?  If the latter it might
be simpler to just set the page count unconditionally here.



Re: [PATCH 01/13] mm/gup.c: Remove redundant check for PCI P2PDMA page

2024-06-26 Thread Dan Williams
Alistair Popple wrote:
> PCI P2PDMA pages are not mapped with pXX_devmap PTEs therefore the
> check in __gup_device_huge() is redundant. Remove it
> 
> Signed-off-by: Alistair Popple 
> Reviewed-by: Jason Gunthorpe 
> Acked-by: David Hildenbrand 

Acked-by: Dan Williams