Re: [PATCH] KVM: PPC: Book3S HV: Fix L2 guest reboot failure due to empty 'arch_compat'
On Thu, Jan 18, 2024 at 03:26:53PM +0530, Amit Machhiwal wrote: > Currently, rebooting a pseries nested qemu-kvm guest (L2) results in > below error as L1 qemu sends PVR value 'arch_compat' == 0 via > ppc_set_compat ioctl. This triggers a condition failure in > kvmppc_set_arch_compat() resulting in an EINVAL. > > qemu-system-ppc64: Unable to set CPU compatibility mode in KVM: Invalid > > This patch updates kvmppc_set_arch_compat() to use the host PVR value if > 'compat_pvr' == 0 indicating that qemu doesn't want to enforce any > specific PVR compat mode. > > Signed-off-by: Amit Machhiwal > --- > arch/powerpc/kvm/book3s_hv.c | 2 +- > arch/powerpc/kvm/book3s_hv_nestedv2.c | 12 ++-- > 2 files changed, 11 insertions(+), 3 deletions(-) > > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c > index 1ed6ec140701..9573d7f4764a 100644 > --- a/arch/powerpc/kvm/book3s_hv.c > +++ b/arch/powerpc/kvm/book3s_hv.c > @@ -439,7 +439,7 @@ static int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, > u32 arch_compat) > if (guest_pcr_bit > host_pcr_bit) > return -EINVAL; > > - if (kvmhv_on_pseries() && kvmhv_is_nestedv2()) { > + if (kvmhv_on_pseries() && kvmhv_is_nestedv2() && arch_compat) { > if (!(cap & nested_capabilities)) > return -EINVAL; > } > diff --git a/arch/powerpc/kvm/book3s_hv_nestedv2.c > b/arch/powerpc/kvm/book3s_hv_nestedv2.c > index fd3c4f2d9480..069a1fcfd782 100644 > --- a/arch/powerpc/kvm/book3s_hv_nestedv2.c > +++ b/arch/powerpc/kvm/book3s_hv_nestedv2.c > @@ -138,6 +138,7 @@ static int gs_msg_ops_vcpu_fill_info(struct > kvmppc_gs_buff *gsb, > vector128 v; > int rc, i; > u16 iden; > + u32 arch_compat = 0; > > vcpu = gsm->data; > > @@ -347,8 +348,15 @@ static int gs_msg_ops_vcpu_fill_info(struct > kvmppc_gs_buff *gsb, > break; > } > case KVMPPC_GSID_LOGICAL_PVR: > - rc = kvmppc_gse_put_u32(gsb, iden, > - vcpu->arch.vcore->arch_compat); > + if (!vcpu->arch.vcore->arch_compat) { > + if (cpu_has_feature(CPU_FTR_ARCH_31)) > + arch_compat = PVR_ARCH_31; > + else if (cpu_has_feature(CPU_FTR_ARCH_300)) > + arch_compat = PVR_ARCH_300; > + } else { > + arch_compat = vcpu->arch.vcore->arch_compat; > + } > + rc = kvmppc_gse_put_u32(gsb, iden, arch_compat); > break; > } > > -- > 2.43.0 > I tested this patch on pseries Power 10 machine with KVM support : Without this patch, with the latest mainline as host,the kvm guest on pseries/powervm fails to reboot and with this patch, reboot works fine. Tested-by: Gautam Menghani
Re: [PING PATCH] powerpc/kasan: Fix addr error caused by page alignment
Le 23/01/2024 à 02:45, Jiangfeng Xiao a écrit : > [Vous ne recevez pas souvent de courriers de xiaojiangf...@huawei.com. > Découvrez pourquoi ceci est important à > https://aka.ms/LearnAboutSenderIdentification ] > > In kasan_init_region, when k_start is not page aligned, > at the begin of for loop, k_cur = k_start & PAGE_MASK > is less than k_start, and then va = block + k_cur - k_start > is less than block, the addr va is invalid, because the > memory address space from va to block is not alloced by > memblock_alloc, which will not be reserved > by memblock_reserve later, it will be used by other places. > > As a result, memory overwriting occurs. > > for example: > int __init __weak kasan_init_region(void *start, size_t size) > { > [...] > /* if say block(dcd97000) k_start(feef7400) k_end(feeff3fe) */ > block = memblock_alloc(k_end - k_start, PAGE_SIZE); > [...] > for (k_cur = k_start & PAGE_MASK; k_cur < k_end; k_cur += PAGE_SIZE) > { > /* at the begin of for loop > * block(dcd97000) va(dcd96c00) k_cur(feef7000) > k_start(feef7400) > * va(dcd96c00) is less than block(dcd97000), va is invalid > */ > void *va = block + k_cur - k_start; > [...] > } > [...] > } > > Therefore, page alignment is performed on k_start before > memblock_alloc to ensure the validity of the VA address. > > Fixes: 663c0c9496a6 ("powerpc/kasan: Fix shadow area set up for modules.") > > Signed-off-by: Jiangfeng Xiao Be patient, your patch is not lost. Now we have it twice, see: https://patchwork.ozlabs.org/project/linuxppc-dev/list/?submitter=76392 > --- > arch/powerpc/mm/kasan/init_32.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/arch/powerpc/mm/kasan/init_32.c b/arch/powerpc/mm/kasan/init_32.c > index a70828a..aa9aa11 100644 > --- a/arch/powerpc/mm/kasan/init_32.c > +++ b/arch/powerpc/mm/kasan/init_32.c > @@ -64,6 +64,7 @@ int __init __weak kasan_init_region(void *start, size_t > size) > if (ret) > return ret; > > + k_start = k_start & PAGE_MASK; > block = memblock_alloc(k_end - k_start, PAGE_SIZE); > if (!block) > return -ENOMEM; > -- > 1.8.5.6 >
Re: [PATCH 1/1] arch/mm/fault: fix major fault accounting when retrying under per-VMA lock
On Sun, Jan 21, 2024 at 11:38 PM Suren Baghdasaryan wrote: > > On Sat, Jan 20, 2024 at 1:15 PM Russell King (Oracle) > wrote: > > > > On Sat, Jan 20, 2024 at 09:09:47PM +, > > patchwork-bot+linux-ri...@kernel.org wrote: > > > Hello: > > > > > > This patch was applied to riscv/linux.git (fixes) > > > by Andrew Morton : > > > > > > On Tue, 26 Dec 2023 13:46:10 -0800 you wrote: > > > > A test [1] in Android test suite started failing after [2] was merged. > > > > It turns out that after handling a major fault under per-VMA lock, the > > > > process major fault counter does not register that fault as major. > > > > Before [2] read faults would be done under mmap_lock, in which case > > > > FAULT_FLAG_TRIED flag is set before retrying. That in turn causes > > > > mm_account_fault() to account the fault as major once retry completes. > > > > With per-VMA locks we often retry because a fault can't be handled > > > > without locking the whole mm using mmap_lock. Therefore such retries > > > > do not set FAULT_FLAG_TRIED flag. This logic does not work after [2] > > > > because we can now handle read major faults under per-VMA lock and > > > > upon retry the fact there was a major fault gets lost. Fix this by > > > > setting FAULT_FLAG_TRIED after retrying under per-VMA lock if > > > > VM_FAULT_MAJOR was returned. Ideally we would use an additional > > > > VM_FAULT bit to indicate the reason for the retry (could not handle > > > > under per-VMA lock vs other reason) but this simpler solution seems > > > > to work, so keeping it simple. > > > > > > > > [...] > > > > > > Here is the summary with links: > > > - [1/1] arch/mm/fault: fix major fault accounting when retrying under > > > per-VMA lock > > > https://git.kernel.org/riscv/c/46e714c729c8 > > > > > > You are awesome, thank you! > > > > Now that 32-bit ARM has support for the per-VMA lock, does that also > > need to be patched? > > Yes, I think so. I missed the ARM32 change that added support for > per-VMA locks. Will post a similar patch for it tomorrow. Fix for ARM posted at https://lore.kernel.org/all/20240123064305.2829244-1-sur...@google.com/ > Thanks, > Suren. > > > > > -- > > RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ > > FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
[PATCH 1/1] arch/arm/mm: fix major fault accounting when retrying under per-VMA lock
The change [1] missed ARM architecture when fixing major fault accounting for page fault retry under per-VMA lock. Add missing code to fix ARM architecture fault accounting. [1] 46e714c729c8 ("arch/mm/fault: fix major fault accounting when retrying under per-VMA lock") Fixes: 12214eba1992 ("mm: handle read faults under the VMA lock") Reported-by: Russell King (Oracle) Signed-off-by: Suren Baghdasaryan --- arch/arm/mm/fault.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c index e96fb40b9cc3..07565b593ed6 100644 --- a/arch/arm/mm/fault.c +++ b/arch/arm/mm/fault.c @@ -298,6 +298,8 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) goto done; } count_vm_vma_lock_event(VMA_LOCK_RETRY); + if (fault & VM_FAULT_MAJOR) + flags |= FAULT_FLAG_TRIED; /* Quick path to respond to signals */ if (fault_signal_pending(fault, regs)) { -- 2.43.0.429.g432eaa2c6b-goog
[PATCH v2] NUMA: Early use of cpu_to_node() returns 0 instead of the correct node id
During the kernel booting, the generic cpu_to_node() is called too early in arm64, powerpc and riscv when CONFIG_NUMA is enabled. For arm64/powerpc/riscv, there are at least four places in the common code where the generic cpu_to_node() is called before it is initialized: 1.) early_trace_init() in kernel/trace/trace.c 2.) sched_init() in kernel/sched/core.c 3.) init_sched_fair_class()in kernel/sched/fair.c 4.) workqueue_init_early() in kernel/workqueue.c In order to fix the bug, the patch changes generic cpu_to_node to function pointer, and export it for kernel modules. Introduce smp_prepare_boot_cpu_start() to wrap the original smp_prepare_boot_cpu(), and set cpu_to_node with early_cpu_to_node. Introduce smp_prepare_cpus_done() to wrap the original smp_prepare_cpus(), and set the cpu_to_node to formal _cpu_to_node(). Signed-off-by: Huang Shijie --- v1 --> v2: In order to fix the x86 compiling error, move the cpu_to_node() from driver/base/arch_numa.c to driver/base/node.c. v1: http://lists.infradead.org/pipermail/linux-arm-kernel/2024-January/896160.html An old different title patch: http://lists.infradead.org/pipermail/linux-arm-kernel/2024-January/895963.html --- drivers/base/node.c | 11 +++ include/linux/topology.h | 6 ++ init/main.c | 29 +++-- 3 files changed, 40 insertions(+), 6 deletions(-) diff --git a/drivers/base/node.c b/drivers/base/node.c index 1c05640461dd..477d58c12886 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -976,3 +976,14 @@ void __init node_dev_init(void) panic("%s() failed to add node: %d\n", __func__, ret); } } + +#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID +#ifndef cpu_to_node +int _cpu_to_node(int cpu) +{ + return per_cpu(numa_node, cpu); +} +int (*cpu_to_node)(int cpu); +EXPORT_SYMBOL(cpu_to_node); +#endif +#endif diff --git a/include/linux/topology.h b/include/linux/topology.h index 52f5850730b3..e7ce2bae11dd 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -91,10 +91,8 @@ static inline int numa_node_id(void) #endif #ifndef cpu_to_node -static inline int cpu_to_node(int cpu) -{ - return per_cpu(numa_node, cpu); -} +extern int (*cpu_to_node)(int cpu); +extern int _cpu_to_node(int cpu); #endif #ifndef set_numa_node diff --git a/init/main.c b/init/main.c index e24b0780fdff..b142e9c51161 100644 --- a/init/main.c +++ b/init/main.c @@ -870,6 +870,18 @@ static void __init print_unknown_bootoptions(void) memblock_free(unknown_options, len); } +static void __init smp_prepare_boot_cpu_start(void) +{ + smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */ + +#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID +#ifndef cpu_to_node + /* The early_cpu_to_node should be ready now. */ + cpu_to_node = early_cpu_to_node; +#endif +#endif +} + asmlinkage __visible __init __no_sanitize_address __noreturn __no_stack_protector void start_kernel(void) { @@ -899,7 +911,7 @@ void start_kernel(void) setup_command_line(command_line); setup_nr_cpu_ids(); setup_per_cpu_areas(); - smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */ + smp_prepare_boot_cpu_start(); boot_cpu_hotplug_init(); pr_notice("Kernel command line: %s\n", saved_command_line); @@ -1519,6 +1531,19 @@ void __init console_on_rootfs(void) fput(file); } +static void __init smp_prepare_cpus_done(unsigned int setup_max_cpus) +{ + /* Different ARCHs may override smp_prepare_cpus() */ + smp_prepare_cpus(setup_max_cpus); + +#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID +#ifndef cpu_to_node + /* Change to the formal function. */ + cpu_to_node = _cpu_to_node; +#endif +#endif +} + static noinline void __init kernel_init_freeable(void) { /* Now the scheduler is fully set up and can do blocking allocations */ @@ -1531,7 +1556,7 @@ static noinline void __init kernel_init_freeable(void) cad_pid = get_pid(task_pid(current)); - smp_prepare_cpus(setup_max_cpus); + smp_prepare_cpus_done(setup_max_cpus); workqueue_init(); -- 2.40.1
[PING PATCH] powerpc/kasan: Fix addr error caused by page alignment
In kasan_init_region, when k_start is not page aligned, at the begin of for loop, k_cur = k_start & PAGE_MASK is less than k_start, and then va = block + k_cur - k_start is less than block, the addr va is invalid, because the memory address space from va to block is not alloced by memblock_alloc, which will not be reserved by memblock_reserve later, it will be used by other places. As a result, memory overwriting occurs. for example: int __init __weak kasan_init_region(void *start, size_t size) { [...] /* if say block(dcd97000) k_start(feef7400) k_end(feeff3fe) */ block = memblock_alloc(k_end - k_start, PAGE_SIZE); [...] for (k_cur = k_start & PAGE_MASK; k_cur < k_end; k_cur += PAGE_SIZE) { /* at the begin of for loop * block(dcd97000) va(dcd96c00) k_cur(feef7000) k_start(feef7400) * va(dcd96c00) is less than block(dcd97000), va is invalid */ void *va = block + k_cur - k_start; [...] } [...] } Therefore, page alignment is performed on k_start before memblock_alloc to ensure the validity of the VA address. Fixes: 663c0c9496a6 ("powerpc/kasan: Fix shadow area set up for modules.") Signed-off-by: Jiangfeng Xiao --- arch/powerpc/mm/kasan/init_32.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/mm/kasan/init_32.c b/arch/powerpc/mm/kasan/init_32.c index a70828a..aa9aa11 100644 --- a/arch/powerpc/mm/kasan/init_32.c +++ b/arch/powerpc/mm/kasan/init_32.c @@ -64,6 +64,7 @@ int __init __weak kasan_init_region(void *start, size_t size) if (ret) return ret; + k_start = k_start & PAGE_MASK; block = memblock_alloc(k_end - k_start, PAGE_SIZE); if (!block) return -ENOMEM; -- 1.8.5.6
[PATCH 60/82] powerpc: Refactor intentional wrap-around test
In an effort to separate intentional arithmetic wrap-around from unexpected wrap-around, we need to refactor places that depend on this kind of math. One of the most common code patterns of this is: VAR + value < VAR Notably, this is considered "undefined behavior" for signed and pointer types, which the kernel works around by using the -fno-strict-overflow option in the build[1] (which used to just be -fwrapv). Regardless, we want to get the kernel source to the position where we can meaningfully instrument arithmetic wrap-around conditions and catch them when they are unexpected, regardless of whether they are signed[2], unsigned[3], or pointer[4] types. Refactor open-coded wrap-around addition test to use add_would_overflow(). This paves the way to enabling the wrap-around sanitizers in the future. Link: https://git.kernel.org/linus/68df3755e383e6fecf2354a67b08f92f18536594 [1] Link: https://github.com/KSPP/linux/issues/26 [2] Link: https://github.com/KSPP/linux/issues/27 [3] Link: https://github.com/KSPP/linux/issues/344 [4] Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: "Aneesh Kumar K.V" Cc: "Naveen N. Rao" Cc: Mahesh Salgaonkar Cc: Vasant Hegde Cc: dingsenjie Cc: linuxppc-dev@lists.ozlabs.org Cc: Aneesh Kumar K.V Cc: Naveen N. Rao Signed-off-by: Kees Cook --- arch/powerpc/platforms/powernv/opal-prd.c | 2 +- arch/powerpc/xmon/xmon.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/powernv/opal-prd.c b/arch/powerpc/platforms/powernv/opal-prd.c index b66b06efcef1..eaf95dc82925 100644 --- a/arch/powerpc/platforms/powernv/opal-prd.c +++ b/arch/powerpc/platforms/powernv/opal-prd.c @@ -51,7 +51,7 @@ static bool opal_prd_range_is_valid(uint64_t addr, uint64_t size) struct device_node *parent, *node; bool found; - if (addr + size < addr) + if (add_would_overflow(addr, size)) return false; parent = of_find_node_by_path("/reserved-memory"); diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c index b3b94cd37713..b91fdda49434 100644 --- a/arch/powerpc/xmon/xmon.c +++ b/arch/powerpc/xmon/xmon.c @@ -3252,7 +3252,7 @@ memzcan(void) } else if (!ok && ook) printf("%.8lx\n", a - mskip); ook = ok; - if (a + mskip < a) + if (add_would_overflow(a, mskip)) break; } if (ook) -- 2.34.1
[PATCH 6.7 517/641] perf vendor events powerpc: Update datasource event name to fix duplicate events
6.7-stable review patch. If anyone has any objections, please let me know. -- From: Athira Rajeev [ Upstream commit 9eef41014fe01287dae79fe208b9b433b13040bb ] Running "perf list" on powerpc fails with segfault as below: $ ./perf list Segmentation fault (core dumped) $ This happens because of duplicate events in the JSON list. The powerpc JSON event list contains some event with same event name, but different event code. They are: - PM_INST_FROM_L3MISS (Present in datasource and frontend) - PM_MRK_DATA_FROM_L2MISS (Present in datasource and marked) - PM_MRK_INST_FROM_L3MISS (Present in datasource and marked) - PM_MRK_DATA_FROM_L3MISS (Present in datasource and marked) pmu_events_table__num_events() uses the value from table_pmu->num_entries which includes duplicate events as well. This causes issue during "perf list" and results in a segmentation fault. Since both event codes are valid, append _DSRC to the Data Source events (datasource.json), so that they would have a unique name. Also add PM_DATA_FROM_L2MISS_DSRC and PM_DATA_FROM_L3MISS_DSRC events. With the fix, 'perf list' works as expected. Fixes: fc143580753348c6 ("perf vendor events power10: Update JSON/events") Signed-off-by: Athira Jajeev Tested-by: Disha Goel Cc: Adrian Hunter Cc: Disha Goel Cc: Ian Rogers Cc: James Clark Cc: Jiri Olsa Cc: Kajol Jain Cc: linuxppc-dev@lists.ozlabs.org Cc: Madhavan Srinivasan Cc: Namhyung Kim Link: https://lore.kernel.org/r/20231123160110.94090-1-atraj...@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo Signed-off-by: Sasha Levin --- .../arch/powerpc/power10/datasource.json | 18 ++ 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/tools/perf/pmu-events/arch/powerpc/power10/datasource.json b/tools/perf/pmu-events/arch/powerpc/power10/datasource.json index 6b0356f2d301..0eeaaf1a95b8 100644 --- a/tools/perf/pmu-events/arch/powerpc/power10/datasource.json +++ b/tools/perf/pmu-events/arch/powerpc/power10/datasource.json @@ -99,6 +99,11 @@ "EventName": "PM_INST_FROM_L2MISS", "BriefDescription": "The processor's instruction cache was reloaded from a source beyond the local core's L2 due to a demand miss." }, + { +"EventCode": "0x0003C000C040", +"EventName": "PM_DATA_FROM_L2MISS_DSRC", +"BriefDescription": "The processor's L1 data cache was reloaded from a source beyond the local core's L2 due to a demand miss." + }, { "EventCode": "0x00038010C040", "EventName": "PM_INST_FROM_L2MISS_ALL", @@ -161,9 +166,14 @@ }, { "EventCode": "0x00078000C040", -"EventName": "PM_INST_FROM_L3MISS", +"EventName": "PM_INST_FROM_L3MISS_DSRC", "BriefDescription": "The processor's instruction cache was reloaded from beyond the local core's L3 due to a demand miss." }, + { +"EventCode": "0x0007C000C040", +"EventName": "PM_DATA_FROM_L3MISS_DSRC", +"BriefDescription": "The processor's L1 data cache was reloaded from beyond the local core's L3 due to a demand miss." + }, { "EventCode": "0x00078010C040", "EventName": "PM_INST_FROM_L3MISS_ALL", @@ -981,7 +991,7 @@ }, { "EventCode": "0x0003C000C142", -"EventName": "PM_MRK_DATA_FROM_L2MISS", +"EventName": "PM_MRK_DATA_FROM_L2MISS_DSRC", "BriefDescription": "The processor's L1 data cache was reloaded from a source beyond the local core's L2 due to a demand miss for a marked instruction." }, { @@ -1046,12 +1056,12 @@ }, { "EventCode": "0x00078000C142", -"EventName": "PM_MRK_INST_FROM_L3MISS", +"EventName": "PM_MRK_INST_FROM_L3MISS_DSRC", "BriefDescription": "The processor's instruction cache was reloaded from beyond the local core's L3 due to a demand miss for a marked instruction." }, { "EventCode": "0x0007C000C142", -"EventName": "PM_MRK_DATA_FROM_L3MISS", +"EventName": "PM_MRK_DATA_FROM_L3MISS_DSRC", "BriefDescription": "The processor's L1 data cache was reloaded from beyond the local core's L3 due to a demand miss for a marked instruction." }, { -- 2.43.0
[PATCH v2] powerpc/pseries/iommu: DLPAR ADD of pci device doesn't completely initialize pci_controller structure
When a PCI device is Dynamically added, LPAR OOPS with NULL pointer exception. Complete stack is as below [ 211.239206] BUG: Kernel NULL pointer dereference on read at 0x0030 [ 211.239210] Faulting instruction address: 0xc06bbe5c [ 211.239214] Oops: Kernel access of bad area, sig: 11 [#1] [ 211.239218] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries [ 211.239223] Modules linked in: rpadlpar_io rpaphp rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs xsk_diag bonding nft_compat nf_tables nfnetlink rfkill binfmt_misc dm_multipath rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_umad ib_iser libiscsi scsi_transport_iscsi ib_ipoib rdma_cm iw_cm ib_cm mlx5_ib ib_uverbs ib_core pseries_rng drm drm_panel_orientation_quirks xfs libcrc32c mlx5_core mlxfw sd_mod t10_pi sg tls ibmvscsi ibmveth scsi_transport_srp vmx_crypto pseries_wdt psample dm_mirror dm_region_hash dm_log dm_mod fuse [ 211.239280] CPU: 17 PID: 2685 Comm: drmgr Not tainted 6.7.0-203405+ #66 [ 211.239284] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf06 of:IBM,FW1060.00 (NH1060_008) hv:phyp pSeries [ 211.239289] NIP: c06bbe5c LR: c0a13e68 CTR: c00579f8 [ 211.239293] REGS: c0009924f240 TRAP: 0300 Not tainted (6.7.0-203405+) [ 211.239298] MSR: 80009033 CR: 24002220 XER: 20040006 [ 211.239306] CFAR: c0a13e64 DAR: 0030 DSISR: 4000 IRQMASK: 0 [ 211.239306] GPR00: c0a13e68 c0009924f4e0 c15a2b00 [ 211.239306] GPR04: c13c5590 c6d07970 c000d8f8f180 [ 211.239306] GPR08: 06ec c000d8f8f180 c2c35d58 24002228 [ 211.239306] GPR12: c00579f8 c003ffeb3880 [ 211.239306] GPR16: [ 211.239306] GPR20: [ 211.239306] GPR24: c000919460c0 f000 c10088e8 [ 211.239306] GPR28: c13c5590 c6d07970 c000919460c0 c000919460c0 [ 211.239354] NIP [c06bbe5c] sysfs_add_link_to_group+0x34/0x94 [ 211.239361] LR [c0a13e68] iommu_device_link+0x5c/0x118 [ 211.239367] Call Trace: [ 211.239369] [c0009924f4e0] [c0a109b8] iommu_init_device+0x26c/0x318 (unreliable) [ 211.239376] [c0009924f520] [c0a13e68] iommu_device_link+0x5c/0x118 [ 211.239382] [c0009924f560] [c0a107f4] iommu_init_device+0xa8/0x318 [ 211.239387] [c0009924f5c0] [c0a11a08] iommu_probe_device+0xc0/0x134 [ 211.239393] [c0009924f600] [c0a11ac0] iommu_bus_notifier+0x44/0x104 [ 211.239398] [c0009924f640] [c018dcc0] notifier_call_chain+0xb8/0x19c [ 211.239405] [c0009924f6a0] [c018df88] blocking_notifier_call_chain+0x64/0x98 [ 211.239411] [c0009924f6e0] [c0a250fc] bus_notify+0x50/0x7c [ 211.239416] [c0009924f720] [c0a20838] device_add+0x640/0x918 [ 211.239421] [c0009924f7f0] [c08f1a34] pci_device_add+0x23c/0x298 [ 211.239427] [c0009924f840] [c0077460] of_create_pci_dev+0x400/0x884 [ 211.239432] [c0009924f8e0] [c0077a08] of_scan_pci_dev+0x124/0x1b0 [ 211.239437] [c0009924f980] [c0077b0c] __of_scan_bus+0x78/0x18c [ 211.239442] [c0009924fa10] [c0073f90] pcibios_scan_phb+0x2a4/0x3b0 [ 211.239447] [c0009924fad0] [c01007a8] init_phb_dynamic+0xb8/0x110 [ 211.239453] [c0009924fb40] [c00806920620] dlpar_add_slot+0x170/0x3b8 [rpadlpar_io] [ 211.239461] [c0009924fbe0] [c00806920d64] add_slot_store.part.0+0xb4/0x130 [rpadlpar_io] [ 211.239468] [c0009924fc70] [c0fb4144] kobj_attr_store+0x2c/0x48 [ 211.239473] [c0009924fc90] [c06b90e4] sysfs_kf_write+0x64/0x78 [ 211.239479] [c0009924fcb0] [c06b7b78] kernfs_fop_write_iter+0x1b0/0x290 [ 211.239485] [c0009924fd00] [c05b6fdc] vfs_write+0x350/0x4a0 [ 211.239491] [c0009924fdc0] [c05b7450] ksys_write+0x84/0x140 [ 211.239496] [c0009924fe10] [c0030a04] system_call_exception+0x124/0x330 [ 211.239502] [c0009924fe50] [c000cedc] system_call_vectored_common+0x15c/0x2ec Commit a940904443e4 ("powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains") broke DLPAR ADD of pci devices. The above added iommu_device structure to pci_controller. During system boot, pci devices are discovered and this newly added iommu_device structure initialized by a call to iommu_device_register(). During DLPAR ADD of a PCI device, a new pci_controller structure is allocated but there are no calls made to iommu_device_register() interface. Fix would be to register iommu device during DLPAR ADD as well. Fixes: a940904443e4 ("powerpc/iommu: Add iommu
Re: [RFC PATCH 2/3] fs: remove duplicate ifdefs
On Thu, Jan 18, 2024 at 01:33:25 PM +0530, Shrikanth Hegde wrote: > when a ifdef is used in the below manner, second one could be considered as > duplicate. > > ifdef DEFINE_A > ...code block... > ifdef DEFINE_A > ...code block... > endif > ...code block... > endif > > There are few places in fs code where above pattern was seen. > No functional change is intended here. It only aims to improve code > readability. > Can you please post the xfs changes as a separate patch along with Darrick's RVB tag? This will make it easy for me to apply the resulting patch to the XFS tree. -- Chandan
Re: [RFC PATCH] mm: z3fold: rename CONFIG_Z3FOLD to CONFIG_Z3FOLD_DEPRECATED
On Sun, Jan 21, 2024 at 11:42 PM Christoph Hellwig wrote: > > On Tue, Jan 16, 2024 at 12:19:39PM -0800, Yosry Ahmed wrote: > > Well, better compression ratios for one :) > > > > I think a long time ago there were complaints that zsmalloc had higher > > latency than zbud/z3fold, but since then a lot of things have changed > > (including nice compaction optimization from Sergey, and compaction > > was one of the main factors AFAICT). Also, recent experiments that > > Chris Li conducted showed that (at least in our setup), the > > decompression is only a small part of the fault latency with zswap > > (i.e. not the main factor) -- so I am not sure if it actually matters > > in practice. > > > > That said, I have not conducted any experiments personally with z3fold > > or zbud, which is why I proposed the conservative approach of marking > > as deprecated first. However, if others believe this is unnecessary I > > am fine with removal as well. Whatever we agree on is fine by me. > > In general deprecated is for code that has active (intentional) users > and/or would break setups. I does sound to me like that is not the > case here, but others might understand this better. I generally agree. So far we have no knowledge of active users, and if there are some, I expect most of them to be able to switch to zsmalloc with no problems. That being said, I was trying to take the conservative approach. If others agree I can send a removal patch instead.
Re: [PATCH v2 0/3] ASoC: Support SAI and MICFIL on i.MX95 platform
On Fri, 12 Jan 2024 14:43:28 +0900, Chancel Liu wrote: > Support SAI and MICFIL on i.MX95 platform > > changes in v2 > - Remove unnecessary "item" in fsl,micfil.yaml > - Don't change alphabetical order in fsl,sai.yaml > > Chancel Liu (3): > ASoC: dt-bindings: fsl,sai: Add compatible string for i.MX95 platform > ASoC: fsl_sai: Add support for i.MX95 platform > ASoC: dt-bindings: fsl,micfil: Add compatible string for i.MX95 > platform > > [...] Applied to https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-next Thanks! [1/3] ASoC: dt-bindings: fsl,sai: Add compatible string for i.MX95 platform commit: 52523f70fdf9b2cb0bfd1999eba4aa3a30b04fa6 [2/3] ASoC: fsl_sai: Add support for i.MX95 platform commit: 2f2d78e2c29347a96268f6f34092538b307ed056 [3/3] ASoC: dt-bindings: fsl,micfil: Add compatible string for i.MX95 platform commit: 20d2719937cf439602566a8f041d3208274abc01 All being well this means that it will be integrated into the linux-next tree (usually sometime in the next 24 hours) and sent to Linus during the next merge window (or sooner if it is a bug fix), however if problems are discovered then the patch may be dropped or reverted. You may get further e-mails resulting from automated or manual testing and review of the tree, please engage with people reporting problems and send followup patches addressing any issues that are reported if needed. If any updates are required or you are submitting further changes they should be sent as incremental updates against current git, existing patches will not be replaced. Please add any relevant lists and maintainers to the CCs when replying to this mail. Thanks, Mark
Re: [PATCH v1 04/11] risc: pgtable: define PFN_PTE_SHIFT
On 22.01.24 21:03, Alexandre Ghiti wrote: Hi David, On 22/01/2024 20:41, David Hildenbrand wrote: We want to make use of pte_next_pfn() outside of set_ptes(). Let's simpliy define PFN_PTE_SHIFT, required by pte_next_pfn(). Signed-off-by: David Hildenbrand --- arch/riscv/include/asm/pgtable.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h index 0c94260b5d0c1..add5cd30ab34d 100644 --- a/arch/riscv/include/asm/pgtable.h +++ b/arch/riscv/include/asm/pgtable.h @@ -523,6 +523,8 @@ static inline void __set_pte_at(pte_t *ptep, pte_t pteval) set_pte(ptep, pteval); } +#define PFN_PTE_SHIFT _PAGE_PFN_SHIFT + static inline void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pteval, unsigned int nr) { There is a typo in the commit title: risc -> riscv. Otherwise, this is right so: Whops :) Reviewed-by: Alexandre Ghiti Thanks! -- Cheers, David / dhildenb
Re: [PATCH v1 04/11] risc: pgtable: define PFN_PTE_SHIFT
Hi David, On 22/01/2024 20:41, David Hildenbrand wrote: We want to make use of pte_next_pfn() outside of set_ptes(). Let's simpliy define PFN_PTE_SHIFT, required by pte_next_pfn(). Signed-off-by: David Hildenbrand --- arch/riscv/include/asm/pgtable.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h index 0c94260b5d0c1..add5cd30ab34d 100644 --- a/arch/riscv/include/asm/pgtable.h +++ b/arch/riscv/include/asm/pgtable.h @@ -523,6 +523,8 @@ static inline void __set_pte_at(pte_t *ptep, pte_t pteval) set_pte(ptep, pteval); } +#define PFN_PTE_SHIFT _PAGE_PFN_SHIFT + static inline void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pteval, unsigned int nr) { There is a typo in the commit title: risc -> riscv. Otherwise, this is right so: Reviewed-by: Alexandre Ghiti Thanks, Alex
[PATCH v1 11/11] mm/memory: ignore writable bit in folio_pte_batch()
... and conditionally return to the caller if any pte except the first one is writable. fork() has to make sure to properly write-protect in case any PTE is writable. Other users (e.g., page unmaping) won't care. Signed-off-by: David Hildenbrand --- mm/memory.c | 26 +- 1 file changed, 21 insertions(+), 5 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 341b2be845b6e..a26fd0669016b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -955,7 +955,7 @@ static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma, static inline pte_t __pte_batch_clear_ignored(pte_t pte) { - return pte_clear_soft_dirty(pte_mkclean(pte_mkold(pte))); + return pte_wrprotect(pte_clear_soft_dirty(pte_mkclean(pte_mkold(pte; } /* @@ -963,20 +963,29 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte) * pages of the same folio. * * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN. - * the accessed bit, dirty bit and soft-dirty bit. + * the accessed bit, dirty bit, soft-dirty bit and writable bit. + . If "any_writable" is set, it will indicate if any other PTE besides the + * first (given) PTE is writable. */ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, - pte_t *start_ptep, pte_t pte, int max_nr) + pte_t *start_ptep, pte_t pte, int max_nr, bool *any_writable) { unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio); const pte_t *end_ptep = start_ptep + max_nr; pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte)); pte_t *ptep = start_ptep + 1; + bool writable; + + if (any_writable) + *any_writable = false; VM_WARN_ON_FOLIO(!pte_present(pte), folio); while (ptep != end_ptep) { - pte = __pte_batch_clear_ignored(ptep_get(ptep)); + pte = ptep_get(ptep); + if (any_writable) + writable = !!pte_write(pte); + pte = __pte_batch_clear_ignored(pte); if (!pte_same(pte, expected_pte)) break; @@ -989,6 +998,9 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, if (pte_pfn(pte) == folio_end_pfn) break; + if (any_writable) + *any_writable |= writable; + expected_pte = pte_next_pfn(expected_pte); ptep++; } @@ -1010,6 +1022,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma { struct page *page; struct folio *folio; + bool any_writable; int err, nr; page = vm_normal_page(src_vma, addr, pte); @@ -1024,7 +1037,8 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma * by keeping the batching logic separate. */ if (unlikely(!*prealloc && folio_test_large(folio) && max_nr != 1)) { - nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr); + nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, +&any_writable); if (folio_test_anon(folio)) { folio_ref_add(folio, nr); if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page, @@ -1039,6 +1053,8 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma folio_dup_file_rmap_ptes(folio, page, nr); rss[mm_counter_file(page)] += nr; } + if (any_writable) + pte = pte_mkwrite(pte, src_vma); __copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte, pte, addr, nr); return nr; -- 2.43.0
[PATCH v1 10/11] mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch()
Let's ignore these bits: they are irrelevant for fork, and will likely be irrelevant for upcoming users such as page unmapping. Signed-off-by: David Hildenbrand --- mm/memory.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index f563aec85b2a8..341b2be845b6e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -953,24 +953,30 @@ static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma, set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr); } +static inline pte_t __pte_batch_clear_ignored(pte_t pte) +{ + return pte_clear_soft_dirty(pte_mkclean(pte_mkold(pte))); +} + /* * Detect a PTE batch: consecutive (present) PTEs that map consecutive * pages of the same folio. * * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN. + * the accessed bit, dirty bit and soft-dirty bit. */ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, pte_t *start_ptep, pte_t pte, int max_nr) { unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio); const pte_t *end_ptep = start_ptep + max_nr; - pte_t expected_pte = pte_next_pfn(pte); + pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte)); pte_t *ptep = start_ptep + 1; VM_WARN_ON_FOLIO(!pte_present(pte), folio); while (ptep != end_ptep) { - pte = ptep_get(ptep); + pte = __pte_batch_clear_ignored(ptep_get(ptep)); if (!pte_same(pte, expected_pte)) break; -- 2.43.0
[PATCH v1 09/11] mm/memory: optimize fork() with PTE-mapped THP
Let's implement PTE batching when consecutive (present) PTEs map consecutive pages of the same large folio, and all other PTE bits besides the PFNs are equal. We will optimize folio_pte_batch() separately, to ignore some other PTE bits. This patch is based on work by Ryan Roberts. Use __always_inline for __copy_present_ptes() and keep the handling for single PTEs completely separate from the multi-PTE case: we really want the compiler to optimize for the single-PTE case with small folios, to not degrade performance. Note that PTE batching will never exceed a single page table and will always stay within VMA boundaries. Signed-off-by: David Hildenbrand --- include/linux/pgtable.h | 17 +- mm/memory.c | 113 +--- 2 files changed, 109 insertions(+), 21 deletions(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index f6d0e3513948a..d32cedf6936ba 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -212,8 +212,6 @@ static inline int pmd_dirty(pmd_t pmd) #define arch_flush_lazy_mmu_mode() do {} while (0) #endif -#ifndef set_ptes - #ifndef pte_next_pfn static inline pte_t pte_next_pfn(pte_t pte) { @@ -221,6 +219,7 @@ static inline pte_t pte_next_pfn(pte_t pte) } #endif +#ifndef set_ptes /** * set_ptes - Map consecutive pages to a contiguous range of addresses. * @mm: Address space to map the pages into. @@ -650,6 +649,20 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres } #endif +#ifndef wrprotect_ptes +static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, unsigned int nr) +{ + for (;;) { + ptep_set_wrprotect(mm, addr, ptep); + if (--nr == 0) + break; + ptep++; + addr += PAGE_SIZE; + } +} +#endif + /* * On some architectures hardware does not set page access bit when accessing * memory page, it is responsibility of software setting this bit. It brings diff --git a/mm/memory.c b/mm/memory.c index 185b4aff13d62..f563aec85b2a8 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -930,15 +930,15 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma return 0; } -static inline void __copy_present_pte(struct vm_area_struct *dst_vma, +static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pte_t *dst_pte, pte_t *src_pte, - pte_t pte, unsigned long addr) + pte_t pte, unsigned long addr, int nr) { struct mm_struct *src_mm = src_vma->vm_mm; /* If it's a COW mapping, write protect it both processes. */ if (is_cow_mapping(src_vma->vm_flags) && pte_write(pte)) { - ptep_set_wrprotect(src_mm, addr, src_pte); + wrprotect_ptes(src_mm, addr, src_pte, nr); pte = pte_wrprotect(pte); } @@ -950,26 +950,94 @@ static inline void __copy_present_pte(struct vm_area_struct *dst_vma, if (!userfaultfd_wp(dst_vma)) pte = pte_clear_uffd_wp(pte); - set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte); + set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr); +} + +/* + * Detect a PTE batch: consecutive (present) PTEs that map consecutive + * pages of the same folio. + * + * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN. + */ +static inline int folio_pte_batch(struct folio *folio, unsigned long addr, + pte_t *start_ptep, pte_t pte, int max_nr) +{ + unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio); + const pte_t *end_ptep = start_ptep + max_nr; + pte_t expected_pte = pte_next_pfn(pte); + pte_t *ptep = start_ptep + 1; + + VM_WARN_ON_FOLIO(!pte_present(pte), folio); + + while (ptep != end_ptep) { + pte = ptep_get(ptep); + + if (!pte_same(pte, expected_pte)) + break; + + /* +* Stop immediately once we reached the end of the folio. In +* corner cases the next PFN might fall into a different +* folio. +*/ + if (pte_pfn(pte) == folio_end_pfn) + break; + + expected_pte = pte_next_pfn(expected_pte); + ptep++; + } + + return ptep - start_ptep; } /* - * Copy one pte. Returns 0 if succeeded, or -EAGAIN if one preallocated page - * is required to copy this pte. + * Copy one present PTE, trying to batch-process subsequent PTEs that map + * consecutive pages of the same folio by copying them as well. + * + * Returns -EAGAIN if one preallocated page is required to copy the next PTE. + * Otherwise, returns the number of copied PTEs (at least 1). */ static inline int -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_a
[PATCH v1 08/11] mm/memory: pass PTE to copy_present_pte()
We already read it, let's just forward it. This patch is based on work by Ryan Roberts. Signed-off-by: David Hildenbrand --- mm/memory.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 2aa2051ee51d3..185b4aff13d62 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -959,10 +959,9 @@ static inline void __copy_present_pte(struct vm_area_struct *dst_vma, */ static inline int copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, -pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss, -struct folio **prealloc) +pte_t *dst_pte, pte_t *src_pte, pte_t pte, unsigned long addr, +int *rss, struct folio **prealloc) { - pte_t pte = ptep_get(src_pte); struct page *page; struct folio *folio; @@ -1104,7 +1103,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, } /* copy_present_pte() will clear `*prealloc' if consumed */ ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte, - addr, rss, &prealloc); + ptent, addr, rss, &prealloc); /* * If we need a pre-allocated page for this pte, drop the * locks, allocate, and try again. -- 2.43.0
[PATCH v1 07/11] mm/memory: factor out copying the actual PTE in copy_present_pte()
Let's prepare for further changes. Signed-off-by: David Hildenbrand --- mm/memory.c | 60 - 1 file changed, 32 insertions(+), 28 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 7e1f4849463aa..2aa2051ee51d3 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -930,6 +930,29 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma return 0; } +static inline void __copy_present_pte(struct vm_area_struct *dst_vma, + struct vm_area_struct *src_vma, pte_t *dst_pte, pte_t *src_pte, + pte_t pte, unsigned long addr) +{ + struct mm_struct *src_mm = src_vma->vm_mm; + + /* If it's a COW mapping, write protect it both processes. */ + if (is_cow_mapping(src_vma->vm_flags) && pte_write(pte)) { + ptep_set_wrprotect(src_mm, addr, src_pte); + pte = pte_wrprotect(pte); + } + + /* If it's a shared mapping, mark it clean in the child. */ + if (src_vma->vm_flags & VM_SHARED) + pte = pte_mkclean(pte); + pte = pte_mkold(pte); + + if (!userfaultfd_wp(dst_vma)) + pte = pte_clear_uffd_wp(pte); + + set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte); +} + /* * Copy one pte. Returns 0 if succeeded, or -EAGAIN if one preallocated page * is required to copy this pte. @@ -939,16 +962,16 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss, struct folio **prealloc) { - struct mm_struct *src_mm = src_vma->vm_mm; - unsigned long vm_flags = src_vma->vm_flags; pte_t pte = ptep_get(src_pte); struct page *page; struct folio *folio; page = vm_normal_page(src_vma, addr, pte); - if (page) - folio = page_folio(page); - if (page && folio_test_anon(folio)) { + if (unlikely(!page)) + goto copy_pte; + + folio = page_folio(page); + if (folio_test_anon(folio)) { /* * If this page may have been pinned by the parent process, * copy the page immediately for the child so that we'll always @@ -963,34 +986,15 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, addr, rss, prealloc, page); } rss[MM_ANONPAGES]++; - } else if (page) { + VM_WARN_ON_FOLIO(PageAnonExclusive(page), folio); + } else { folio_get(folio); folio_dup_file_rmap_pte(folio, page); rss[mm_counter_file(page)]++; } - /* -* If it's a COW mapping, write protect it both -* in the parent and the child -*/ - if (is_cow_mapping(vm_flags) && pte_write(pte)) { - ptep_set_wrprotect(src_mm, addr, src_pte); - pte = pte_wrprotect(pte); - } - VM_BUG_ON(page && folio_test_anon(folio) && PageAnonExclusive(page)); - - /* -* If it's a shared mapping, mark it clean in -* the child -*/ - if (vm_flags & VM_SHARED) - pte = pte_mkclean(pte); - pte = pte_mkold(pte); - - if (!userfaultfd_wp(dst_vma)) - pte = pte_clear_uffd_wp(pte); - - set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte); +copy_pte: + __copy_present_pte(dst_vma, src_vma, dst_pte, src_pte, pte, addr); return 0; } -- 2.43.0
[PATCH v1 06/11] sparc/pgtable: define PFN_PTE_SHIFT
We want to make use of pte_next_pfn() outside of set_ptes(). Let's simpliy define PFN_PTE_SHIFT, required by pte_next_pfn(). Signed-off-by: David Hildenbrand --- arch/sparc/include/asm/pgtable_64.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index a8c871b7d7860..652af9d63fa29 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -929,6 +929,8 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, maybe_tlb_batch_add(mm, addr, ptep, orig, fullmm, PAGE_SHIFT); } +#define PFN_PTE_SHIFT PAGE_SHIFT + static inline void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte, unsigned int nr) { -- 2.43.0
[PATCH v1 05/11] s390/pgtable: define PFN_PTE_SHIFT
We want to make use of pte_next_pfn() outside of set_ptes(). Let's simpliy define PFN_PTE_SHIFT, required by pte_next_pfn(). Signed-off-by: David Hildenbrand --- arch/s390/include/asm/pgtable.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index 1299b56e43f6f..4b91e65c85d97 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -1316,6 +1316,8 @@ pgprot_t pgprot_writecombine(pgprot_t prot); #define pgprot_writethroughpgprot_writethrough pgprot_t pgprot_writethrough(pgprot_t prot); +#define PFN_PTE_SHIFT PAGE_SHIFT + /* * Set multiple PTEs to consecutive pages with a single call. All PTEs * are within the same folio, PMD and VMA. -- 2.43.0
[PATCH v1 04/11] risc: pgtable: define PFN_PTE_SHIFT
We want to make use of pte_next_pfn() outside of set_ptes(). Let's simpliy define PFN_PTE_SHIFT, required by pte_next_pfn(). Signed-off-by: David Hildenbrand --- arch/riscv/include/asm/pgtable.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h index 0c94260b5d0c1..add5cd30ab34d 100644 --- a/arch/riscv/include/asm/pgtable.h +++ b/arch/riscv/include/asm/pgtable.h @@ -523,6 +523,8 @@ static inline void __set_pte_at(pte_t *ptep, pte_t pteval) set_pte(ptep, pteval); } +#define PFN_PTE_SHIFT _PAGE_PFN_SHIFT + static inline void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pteval, unsigned int nr) { -- 2.43.0
[PATCH v1 03/11] powerpc/pgtable: define PFN_PTE_SHIFT
We want to make use of pte_next_pfn() outside of set_ptes(). Let's simpliy define PFN_PTE_SHIFT, required by pte_next_pfn(). Signed-off-by: David Hildenbrand --- arch/powerpc/include/asm/pgtable.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index 9224f23065fff..7a1ba8889aeae 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -41,6 +41,8 @@ struct mm_struct; #ifndef __ASSEMBLY__ +#define PFN_PTE_SHIFT PTE_RPN_SHIFT + void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte, unsigned int nr); #define set_ptes set_ptes -- 2.43.0
[PATCH v1 02/11] nios2/pgtable: define PFN_PTE_SHIFT
We want to make use of pte_next_pfn() outside of set_ptes(). Let's simpliy define PFN_PTE_SHIFT, required by pte_next_pfn(). Signed-off-by: David Hildenbrand --- arch/nios2/include/asm/pgtable.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgtable.h index 5144506dfa693..d052dfcbe8d3a 100644 --- a/arch/nios2/include/asm/pgtable.h +++ b/arch/nios2/include/asm/pgtable.h @@ -178,6 +178,8 @@ static inline void set_pte(pte_t *ptep, pte_t pteval) *ptep = pteval; } +#define PFN_PTE_SHIFT 0 + static inline void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte, unsigned int nr) { -- 2.43.0
[PATCH v1 01/11] arm/pgtable: define PFN_PTE_SHIFT on arm and arm64
We want to make use of pte_next_pfn() outside of set_ptes(). Let's simpliy define PFN_PTE_SHIFT, required by pte_next_pfn(). Signed-off-by: David Hildenbrand --- arch/arm/include/asm/pgtable.h | 2 ++ arch/arm64/include/asm/pgtable.h | 2 ++ 2 files changed, 4 insertions(+) diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h index d657b84b6bf70..be91e376df79e 100644 --- a/arch/arm/include/asm/pgtable.h +++ b/arch/arm/include/asm/pgtable.h @@ -209,6 +209,8 @@ static inline void __sync_icache_dcache(pte_t pteval) extern void __sync_icache_dcache(pte_t pteval); #endif +#define PFN_PTE_SHIFT PAGE_SHIFT + void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pteval, unsigned int nr); #define set_ptes set_ptes diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 79ce70fbb751c..d4b3bd96e3304 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -341,6 +341,8 @@ static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages) mte_sync_tags(pte, nr_pages); } +#define PFN_PTE_SHIFT PAGE_SHIFT + static inline void set_ptes(struct mm_struct *mm, unsigned long __always_unused addr, pte_t *ptep, pte_t pte, unsigned int nr) -- 2.43.0
[PATCH v1 00/11] mm/memory: optimize fork() with PTE-mapped THP
Now that the rmap overhaul[1] is upstream that provides a clean interface for rmap batching, let's implement PTE batching during fork when processing PTE-mapped THPs. This series is partially based on Ryan's previous work[2] to implement cont-pte support on arm64, but its a complete rewrite based on [1] to optimize all architectures independent of any such PTE bits, and to use the new rmap batching functions that simplify the code and prepare for further rmap accounting changes. We collect consecutive PTEs that map consecutive pages of the same large folio, making sure that the other PTE bits are compatible, and (a) adjust the refcount only once per batch, (b) call rmap handling functions only once per batch and (c) perform batch PTE setting/updates. While this series should be beneficial for adding cont-pte support on ARM64[2], it's one of the requirements for maintaining a total mapcount[3] for large folios with minimal added overhead and further changes[4] that build up on top of the total mapcount. Independent of all that, this series results in a speedup during fork with PTE-mapped THP, which is the default with THPs that are smaller than a PMD (for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]). On an Intel Xeon Silver 4210R CPU, fork'ing with 1GiB of PTE-mapped folios of the same size (stddev < 1%) results in the following runtimes for fork() (shorter is better): Folio Size | v6.8-rc1 | New | Change -- 4KiB | 0.014328 | 0.014265 | 0% 16KiB | 0.014263 | 0.013293 | - 7% 32KiB | 0.014334 | 0.012355 | -14% 64KiB | 0.014046 | 0.011837 | -16% 128KiB | 0.014011 | 0.011536 | -18% 256KiB | 0.013993 | 0.01134 | -19% 512KiB | 0.013983 | 0.011311 | -19% 1024KiB | 0.013986 | 0.011282 | -19% 2048KiB | 0.014305 | 0.011496 | -20% Next up is PTE batching when unmapping, that I'll probably send out based on this series this/next week. Only tested on x86-64. Compile-tested on most other architectures. Will do more testing and double-check the arch changes while this is getting some review. [1] https://lkml.kernel.org/r/20231220224504.646757-1-da...@redhat.com [2] https://lkml.kernel.org/r/20231218105100.172635-1-ryan.robe...@arm.com [3] https://lkml.kernel.org/r/20230809083256.699513-1-da...@redhat.com [4] https://lkml.kernel.org/r/20231124132626.235350-1-da...@redhat.com [5] https://lkml.kernel.org/r/20231207161211.2374093-1-ryan.robe...@arm.com Cc: Andrew Morton Cc: Matthew Wilcox (Oracle) Cc: Ryan Roberts Cc: Russell King Cc: Catalin Marinas Cc: Will Deacon Cc: Dinh Nguyen Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: "Aneesh Kumar K.V" Cc: "Naveen N. Rao" Cc: Paul Walmsley Cc: Palmer Dabbelt Cc: Albert Ou Cc: Alexander Gordeev Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Vasily Gorbik Cc: Christian Borntraeger Cc: Sven Schnelle Cc: "David S. Miller" Cc: linux-arm-ker...@lists.infradead.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-ri...@lists.infradead.org Cc: linux-s...@vger.kernel.org Cc: sparcli...@vger.kernel.org David Hildenbrand (11): arm/pgtable: define PFN_PTE_SHIFT on arm and arm64 nios2/pgtable: define PFN_PTE_SHIFT powerpc/pgtable: define PFN_PTE_SHIFT risc: pgtable: define PFN_PTE_SHIFT s390/pgtable: define PFN_PTE_SHIFT sparc/pgtable: define PFN_PTE_SHIFT mm/memory: factor out copying the actual PTE in copy_present_pte() mm/memory: pass PTE to copy_present_pte() mm/memory: optimize fork() with PTE-mapped THP mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch() mm/memory: ignore writable bit in folio_pte_batch() arch/arm/include/asm/pgtable.h | 2 + arch/arm64/include/asm/pgtable.h| 2 + arch/nios2/include/asm/pgtable.h| 2 + arch/powerpc/include/asm/pgtable.h | 2 + arch/riscv/include/asm/pgtable.h| 2 + arch/s390/include/asm/pgtable.h | 2 + arch/sparc/include/asm/pgtable_64.h | 2 + include/linux/pgtable.h | 17 ++- mm/memory.c | 188 +--- 9 files changed, 173 insertions(+), 46 deletions(-) base-commit: 6613476e225e090cc9aad49be7fa504e290dd33d -- 2.43.0
Re: [PATCH 1/1] PCI/DPC: Fix TLP Prefix register reading offset
On Thu, Jan 18, 2024 at 01:08:15PM +0200, Ilpo Järvinen wrote: > The TLP Prefix Log Register consists of multiple DWORDs (PCIe r6.1 sec > 7.9.14.13) but the loop in dpc_process_rp_pio_error() keeps reading > from the first DWORD. Add the iteration count based offset calculation > into the config read. > > Fixes: f20c4ea49ec4 ("PCI/DPC: Add eDPC support") > Signed-off-by: Ilpo Järvinen Applied to pci/dpc for v6.9 with commit log below, thanks! PCI/DPC: Print all TLP Prefixes, not just the first The TLP Prefix Log Register consists of multiple DWORDs (PCIe r6.1 sec 7.9.14.13) but the loop in dpc_process_rp_pio_error() keeps reading from the first DWORD, so we print only the first PIO TLP Prefix (duplicated several times), and we never print the second, third, etc., Prefixes. Add the iteration count based offset calculation into the config read. Fixes: f20c4ea49ec4 ("PCI/DPC: Add eDPC support") Link: https://lore.kernel.org/r/20240118110815.3867-1-ilpo.jarvi...@linux.intel.com Signed-off-by: Ilpo Järvinen [bhelgaas: add user-visible details to commit log] Signed-off-by: Bjorn Helgaas > --- > drivers/pci/pcie/dpc.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c > index 94111e438241..e5d7c12854fa 100644 > --- a/drivers/pci/pcie/dpc.c > +++ b/drivers/pci/pcie/dpc.c > @@ -234,7 +234,7 @@ static void dpc_process_rp_pio_error(struct pci_dev *pdev) > > for (i = 0; i < pdev->dpc_rp_log_size - 5; i++) { > pci_read_config_dword(pdev, > - cap + PCI_EXP_DPC_RP_PIO_TLPPREFIX_LOG, &prefix); > + cap + PCI_EXP_DPC_RP_PIO_TLPPREFIX_LOG + i * 4, > &prefix); > pci_err(pdev, "TLP Prefix Header: dw%d, %#010x\n", i, prefix); > } > clear_status: > -- > 2.39.2 >
[RFC PATCH v2 4/4] arch/powerpc: remove duplicate ifdefs
when a ifdef is used in the below manner, second one could be considered as duplicate. ifdef DEFINE_A ...code block... ifdef DEFINE_A ...code block... endif ...code block... endif few places in arch/powerpc where this pattern was seen. In addition to that in paca.h, CONFIG_PPC_BOOK3S_64 was defined back to back. merged the two ifdefs. No functional change is intended here. It only aims to improve code readability. Signed-off-by: Shrikanth Hegde --- arch/powerpc/include/asm/paca.h | 4 arch/powerpc/kernel/asm-offsets.c | 2 -- arch/powerpc/platforms/powermac/feature.c | 2 -- arch/powerpc/xmon/xmon.c | 2 -- 4 files changed, 10 deletions(-) diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h index e667d455ecb4..1d58da946739 100644 --- a/arch/powerpc/include/asm/paca.h +++ b/arch/powerpc/include/asm/paca.h @@ -163,9 +163,7 @@ struct paca_struct { u64 kstack; /* Saved Kernel stack addr */ u64 saved_r1; /* r1 save for RTAS calls or PM or EE=0 */ u64 saved_msr; /* MSR saved here by enter_rtas */ -#ifdef CONFIG_PPC64 u64 exit_save_r1; /* Syscall/interrupt R1 save */ -#endif #ifdef CONFIG_PPC_BOOK3E_64 u16 trap_save; /* Used when bad stack is encountered */ #endif @@ -214,8 +212,6 @@ struct paca_struct { /* Non-maskable exceptions that are not performance critical */ u64 exnmi[EX_SIZE]; /* used for system reset (nmi) */ u64 exmc[EX_SIZE]; /* used for machine checks */ -#endif -#ifdef CONFIG_PPC_BOOK3S_64 /* Exclusive stacks for system reset and machine check exception. */ void *nmi_emergency_sp; void *mc_emergency_sp; diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index 9f14d95b8b32..f029755f9e69 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -246,9 +246,7 @@ int main(void) OFFSET(PACAHWCPUID, paca_struct, hw_cpu_id); OFFSET(PACAKEXECSTATE, paca_struct, kexec_state); OFFSET(PACA_DSCR_DEFAULT, paca_struct, dscr_default); -#ifdef CONFIG_PPC64 OFFSET(PACA_EXIT_SAVE_R1, paca_struct, exit_save_r1); -#endif #ifdef CONFIG_PPC_BOOK3E_64 OFFSET(PACA_TRAP_SAVE, paca_struct, trap_save); #endif diff --git a/arch/powerpc/platforms/powermac/feature.c b/arch/powerpc/platforms/powermac/feature.c index 81c9fbae88b1..2cc257f75c50 100644 --- a/arch/powerpc/platforms/powermac/feature.c +++ b/arch/powerpc/platforms/powermac/feature.c @@ -2333,7 +2333,6 @@ static struct pmac_mb_def pmac_mb_defs[] = { PMAC_TYPE_POWERMAC_G5, g5_features, 0, }, -#ifdef CONFIG_PPC64 { "PowerMac7,3", "PowerMac G5", PMAC_TYPE_POWERMAC_G5, g5_features, 0, @@ -2359,7 +2358,6 @@ static struct pmac_mb_def pmac_mb_defs[] = { 0, }, #endif /* CONFIG_PPC64 */ -#endif /* CONFIG_PPC64 */ }; /* diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c index b3b94cd37713..f413c220165c 100644 --- a/arch/powerpc/xmon/xmon.c +++ b/arch/powerpc/xmon/xmon.c @@ -643,10 +643,8 @@ static int xmon_core(struct pt_regs *regs, volatile int fromipi) touch_nmi_watchdog(); } else { cmd = 1; -#ifdef CONFIG_SMP if (xmon_batch) cmd = batch_cmds(regs); -#endif if (!locked_down && cmd) cmd = cmds(regs); if (locked_down || cmd != 0) { -- 2.39.3
[RFC PATCH v2 3/4] ntfs: remove duplicate ifdefs
when a ifdef is used in the below manner, second one could be considered as duplicate. ifdef DEFINE_A ...code block... ifdef DEFINE_A ...code block... endif ...code block... endif In the ntfs code, one such pattern was seen. Hence remove that duplicate ifdef. No functional change is intended here. It only aims to improve code readability. Signed-off-by: Shrikanth Hegde --- fs/ntfs/inode.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/fs/ntfs/inode.c b/fs/ntfs/inode.c index aba1e22db4e9..d2c8622d53d1 100644 --- a/fs/ntfs/inode.c +++ b/fs/ntfs/inode.c @@ -2859,11 +2859,9 @@ int ntfs_truncate(struct inode *vi) * * See ntfs_truncate() description above for details. */ -#ifdef NTFS_RW void ntfs_truncate_vfs(struct inode *vi) { ntfs_truncate(vi); } -#endif /** * ntfs_setattr - called from notify_change() when an attribute is being changed -- 2.39.3
[RFC PATCH v2 2/4] xfs: remove duplicate ifdefs
when a ifdef is used in the below manner, second one could be considered as duplicate. ifdef DEFINE_A ...code block... ifdef DEFINE_A ...code block... endif ...code block... endif In the xfs code two such patterns were seen. Hence removing these ifdefs. No functional change is intended here. It only aims to improve code readability. Reviewed-by: Darrick J. Wong Signed-off-by: Shrikanth Hegde --- fs/xfs/xfs_sysfs.c | 4 1 file changed, 4 deletions(-) diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c index 17485666b672..d2391eec37fe 100644 --- a/fs/xfs/xfs_sysfs.c +++ b/fs/xfs/xfs_sysfs.c @@ -193,7 +193,6 @@ always_cow_show( } XFS_SYSFS_ATTR_RW(always_cow); -#ifdef DEBUG /* * Override how many threads the parallel work queue is allowed to create. * This has to be a debug-only global (instead of an errortag) because one of @@ -260,7 +259,6 @@ larp_show( return snprintf(buf, PAGE_SIZE, "%d\n", xfs_globals.larp); } XFS_SYSFS_ATTR_RW(larp); -#endif /* DEBUG */ STATIC ssize_t bload_leaf_slack_store( @@ -319,10 +317,8 @@ static struct attribute *xfs_dbg_attrs[] = { ATTR_LIST(log_recovery_delay), ATTR_LIST(mount_delay), ATTR_LIST(always_cow), -#ifdef DEBUG ATTR_LIST(pwork_threads), ATTR_LIST(larp), -#endif ATTR_LIST(bload_leaf_slack), ATTR_LIST(bload_node_slack), NULL, -- 2.39.3
[RFC PATCH v2 0/4] remove duplicate ifdefs
When going through the code observed a case in scheduler, where #ifdef CONFIG_SMP was used to inside an #ifdef CONFIG_SMP. That didn't make sense since first one is good enough and second one is a duplicate. This could improve code readability. No functional change is intended. Since this might be present in other code areas wrote a very basic python script which helps in finding these cases. It doesn't handle any complicated #defines or space separated "# if". At some places the log collected had to be manually corrected due to space separated ifdefs. Thats why its not a treewide change. There might be an opportunity for other files as well. Logic is very simple. If there is #ifdef or #if or #ifndef add that variable to list. Upon every subsequent #ifdef or #if or #ifndef check if the same variable is in the list. If yes flag an error. Verification was done manually later checking for any #undef or any error due to script. These were the ones that flagged out and made sense after going through code. More details about how the logs were collected and the script used for processing the logs are mentioned in v1 cover letter. v2->v1: split the fs change into two patches as suggested by Chandan Babu R. v1: https://lore.kernel.org/all/20240118080326.13137-1-sshe...@linux.ibm.com/ Shrikanth Hegde (4): sched: remove duplicate ifdefs xfs: remove duplicate ifdefs ntfs: remove duplicate ifdefs arch/powerpc: remove duplicate ifdefs arch/powerpc/include/asm/paca.h | 4 arch/powerpc/kernel/asm-offsets.c | 2 -- arch/powerpc/platforms/powermac/feature.c | 2 -- arch/powerpc/xmon/xmon.c | 2 -- fs/ntfs/inode.c | 2 -- fs/xfs/xfs_sysfs.c| 4 kernel/sched/core.c | 4 +--- kernel/sched/fair.c | 2 -- 8 files changed, 1 insertion(+), 21 deletions(-) -- 2.39.3
[RFC PATCH v2 1/4] sched: remove duplicate ifdefs
when a ifdef is used in the below manner, second one could be considered as duplicate. ifdef DEFINE_A ...code block... ifdef DEFINE_A ...code block... endif ...code block... endif In the scheduler code, there are two places where above pattern can be observed. Hence second ifdef is a duplicate and not needed. Plus a minor comment update to reflect the else case. No functional change is intended here. It only aims to improve code readability. Signed-off-by: Shrikanth Hegde --- kernel/sched/core.c | 4 +--- kernel/sched/fair.c | 2 -- 2 files changed, 1 insertion(+), 5 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 9116bcc90346..a76c7095f736 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1792,7 +1792,6 @@ static void cpu_util_update_eff(struct cgroup_subsys_state *css); #endif #ifdef CONFIG_SYSCTL -#ifdef CONFIG_UCLAMP_TASK #ifdef CONFIG_UCLAMP_TASK_GROUP static void uclamp_update_root_tg(void) { @@ -1898,7 +1897,6 @@ static int sysctl_sched_uclamp_handler(struct ctl_table *table, int write, return result; } #endif -#endif static int uclamp_validate(struct task_struct *p, const struct sched_attr *attr) @@ -2065,7 +2063,7 @@ static void __init init_uclamp(void) } } -#else /* CONFIG_UCLAMP_TASK */ +#else /* !CONFIG_UCLAMP_TASK */ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { } static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { } static inline int uclamp_validate(struct task_struct *p, diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 533547e3c90a..8e30e2bb77a0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10182,10 +10182,8 @@ static int idle_cpu_without(int cpu, struct task_struct *p) * be computed and tested before calling idle_cpu_without(). */ -#ifdef CONFIG_SMP if (rq->ttwu_pending) return 0; -#endif return 1; } -- 2.39.3
Re: [RFC PATCH 2/3] fs: remove duplicate ifdefs
On 1/22/24 6:20 PM, Chandan Babu R wrote: > On Thu, Jan 18, 2024 at 01:33:25 PM +0530, Shrikanth Hegde wrote: >> when a ifdef is used in the below manner, second one could be considered as >> duplicate. >> >> ifdef DEFINE_A >> ...code block... >> ifdef DEFINE_A >> ...code block... >> endif >> ...code block... >> endif >> >> There are few places in fs code where above pattern was seen. >> No functional change is intended here. It only aims to improve code >> readability. >> > > Can you please post the xfs changes as a separate patch along with Darrick's > RVB tag? This will make it easy for me to apply the resulting patch to the XFS > tree. Ok. will split the fs patches into two and send v2 soon. Thanks. >
Re: [PATCH 1/1] PCI/DPC: Fix TLP Prefix register reading offset
On Fri, 19 Jan 2024, Bjorn Helgaas wrote: > On Thu, Jan 18, 2024 at 01:08:15PM +0200, Ilpo Järvinen wrote: > > The TLP Prefix Log Register consists of multiple DWORDs (PCIe r6.1 sec > > 7.9.14.13) but the loop in dpc_process_rp_pio_error() keeps reading > > from the first DWORD. Add the iteration count based offset calculation > > into the config read. > > So IIUC the user-visible bug is that we print only the first PIO TLP > Prefix (duplicated several times), and we never print the second, > third, etc Prefixes, right? Yes. > I wish we could print them all in a single pci_err(), as we do for the > TLP Header Log, instead of dribbling them out one by one. I've also done some work towards consolidating AER and DPC TLP Header/Prefix Log handling which is when I found this bug (the reading side is already done but printing is still pending). > > Fixes: f20c4ea49ec4 ("PCI/DPC: Add eDPC support") > > Signed-off-by: Ilpo Järvinen -- i.
Re: [PATCH] NUMA: Early use of cpu_to_node() returns 0 instead of the correct node id
在 2024/1/22 15:41, Mike Rapoport 写道: On Fri, Jan 19, 2024 at 04:50:53PM +0800, Shijie Huang wrote: 在 2024/1/19 16:42, Mike Rapoport 写道: Is there a fundamental reason to have early_cpu_to_node() at all? The early_cpu_to_node does not work on some ARCHs (which support the NUMA), such as SPARC, MIPS and S390. My question was why we need early_cpu_to_node() at all and why can't we use cpu_to_node() early on arches that do have it As you see, some ARCHs use cpu_to_node() all the time, such as SPARC,mips and S390. They do not use early_cpu_to_node() at all. In some ARCHs(arm64, powerpc riscv), the cpu_to_node() is ready at: start_kernel --> arch_call_rest_init() --> rest_init() --> kernel_init() --> kernel_init_freeable() --> smp_prepare_cpus() The cpu_to_node() is initialized too late. I am not sure if we can move "cpu_to_node initialization" to an early place. Move "cpu_to_node() initization" to an early place is more complicated, I guess. Thanks Huang Shijie
Re: [PATCH v2 01/13] mm/Kconfig: CONFIG_PGTABLE_HAS_HUGE_LEAVES
On Mon, Jan 15, 2024 at 01:37:37PM -0400, Jason Gunthorpe wrote: > On Wed, Jan 03, 2024 at 05:14:11PM +0800, pet...@redhat.com wrote: > > From: Peter Xu > > > > Introduce a config option that will be selected as long as huge leaves are > > involved in pgtable (thp or hugetlbfs). It would be useful to mark any > > code with this new config that can process either hugetlb or thp pages in > > any level that is higher than pte level. > > > > Signed-off-by: Peter Xu > > --- > > mm/Kconfig | 3 +++ > > 1 file changed, 3 insertions(+) > > So you mean anything that supports page table entires > PAGE_SIZE ? Yes. > > Makes sense to me, though maybe add a comment in the kconfig? Sure I'll add some. > > Reviewed-by: Jason Gunthorpe Thanks for your reviews and also positive comments in previous versions, Jason. I appreciate that. I'm just pretty occupied with other tasks recently so I don't yet have time to revisit this series, along with other comments yet. I'll do so and reply to the comments / discussions together afterwards. -- Peter Xu