Re: [PATCH v12 00/17] arm64: MMU enabled kexec relocation
Hi Pavel, After going through this series, I think if this can be done by using identity map through ttbr0. Then the processes may be neat (I hope so): -1. set up identity map in machine_kexec_post_load(), instead of copying linear map. -2. Also past this temporary identity map to arm64_relocate_new_kernel() -3. in arm64_relocate_new_kernel(), just load identity map and re-enable MMU. After copying, just turn off MMU. Thanks, Pingfan On Thu, Mar 4, 2021 at 3:47 PM Pavel Tatashin wrote: > > Changelog: > v12: > - A major change compared to previous version. Instead of using > contiguous VA range a copy of linear map is now used to perform > copying of segments during relocation as it was agreed in the > discussion of version 11 of this project. > - In addition to using linear map, I also took several ideas from > James Morse to better organize the kexec relocation: > 1. skip relocation function entirely if that is not needed > 2. remove the PoC flushing function since it is not needed >anymore with MMU enabled. > v11: > - Fixed missing KEXEC_CORE dependency for trans_pgd.c > - Removed useless "if(rc) return rc" statement (thank you Tyler Hicks) > - Another 12 patches were accepted into maintainer's get. > Re-based patches against: > https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git > Branch: for-next/kexec > v10: > - Addressed a lot of comments form James Morse and from Marc Zyngier > - Added review-by's > - Synchronized with mainline > > v9: - 9 patches from previous series landed in upstream, so now series > is smaller > - Added two patches from James Morse to address idmap issues for > machines > with high physical addresses. > - Addressed comments from Selin Dag about compiling issues. He also > tested > my series and got similar performance results: ~60 ms instead of > ~580 ms > with an initramfs size of ~120MB. > v8: > - Synced with mainline to keep series up-to-date > v7: > -- Addressed comments from James Morse > - arm64: hibernate: pass the allocated pgdp to ttbr0 > Removed "Fixes" tag, and added Added Reviewed-by: James Morse > - arm64: hibernate: check pgd table allocation > Sent out as a standalone patch so it can be sent to stable > Series applies on mainline + this patch > - arm64: hibernate: add trans_pgd public functions > Remove second allocation of tmp_pg_dir in swsusp_arch_resume > Added Reviewed-by: James Morse > - arm64: kexec: move relocation function setup and clean up > Fixed typo in commit log > Changed kern_reloc to phys_addr_t types. > Added explanation why kern_reloc is needed. > Split into four patches: > arm64: kexec: make dtb_mem always enabled > arm64: kexec: remove unnecessary debug prints > arm64: kexec: call kexec_image_info only once > arm64: kexec: move relocation function setup > - arm64: kexec: add expandable argument to relocation function > Changed types of new arguments from unsigned long to phys_addr_t. > Changed offset prefix to KEXEC_* > Split into four patches: > arm64: kexec: cpu_soft_restart change argument types > arm64: kexec: arm64_relocate_new_kernel clean-ups > arm64: kexec: arm64_relocate_new_kernel don't use x0 as temp > arm64: kexec: add expandable argument to relocation function > - arm64: kexec: configure trans_pgd page table for kexec > Added invalid entries into EL2 vector table > Removed KEXEC_EL2_VECTOR_TABLE_SIZE and > KEXEC_EL2_VECTOR_TABLE_OFFSET > Copy relocation functions and table into separate pages > Changed types in kern_reloc_arg. > Split into three patches: > arm64: kexec: offset for relocation function > arm64: kexec: kexec EL2 vectors > arm64: kexec: configure trans_pgd page table for kexec > - arm64: kexec: enable MMU during kexec relocation > Split into two patches: > arm64: kexec: enable MMU during kexec relocation > arm64: kexec: remove head from relocation argument > v6: > - Sync with mainline tip > - Added Acked's from Dave Young > v5: > - Addressed comments from Matthias Brugger: added review-by's, > improved > comments, and made cleanups to swsusp_arch_resume() in addition to > create_safe_exec_page(). > - Synced with mainline tip. > v4: > - Addressed comments from James Morse. > - Split "check pgd table allocation" into two patches, and moved to > the beginning of series for simpler backport of the fixes.
Re: [PATCH] drivers/arch_numa: remove rebudant setup_per_cpu_areas()
On Tue, Mar 9, 2021 at 10:02 PM Will Deacon wrote: > > [typo in subject "rebudant"] > > On Tue, Mar 09, 2021 at 06:21:38PM +0800, Pingfan Liu wrote: > > There are two identical implementations of setup_per_cpu_areas() in > > mm/percpu.c and drivers/base/arch_numa.c. > > > > Hence removing the one in arch_numa.c. And let arm64 drop > > HAVE_SETUP_PER_CPU_AREA. > > > > Signed-off-by: Pingfan Liu > > Cc: Catalin Marinas > > Cc: Will Deacon > > Cc: Greg Kroah-Hartman > > Cc: "Rafael J. Wysocki" > > Cc: Atish Patra > > Cc: linux-kernel@vger.kernel.org > > To: linux-arm-ker...@lists.infradead.org > > --- > > arch/arm64/Kconfig | 4 > > drivers/base/arch_numa.c | 22 -- > > 2 files changed, 26 deletions(-) > > > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > > index 1f212b47a48a..d4bf8be0c3d5 100644 > > --- a/arch/arm64/Kconfig > > +++ b/arch/arm64/Kconfig > > @@ -1022,10 +1022,6 @@ config USE_PERCPU_NUMA_NODE_ID > > def_bool y > > depends on NUMA > > > > -config HAVE_SETUP_PER_CPU_AREA > > - def_bool y > > - depends on NUMA > > - > > config NEED_PER_CPU_EMBED_FIRST_CHUNK > > def_bool y > > depends on NUMA > > diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c > > index 4cc4e117727d..23e1e419a83d 100644 > > --- a/drivers/base/arch_numa.c > > +++ b/drivers/base/arch_numa.c > > @@ -167,28 +167,6 @@ static void __init pcpu_fc_free(void *ptr, size_t size) > > { > > memblock_free_early(__pa(ptr), size); > > } > > - > > -void __init setup_per_cpu_areas(void) > > -{ > > - unsigned long delta; > > - unsigned int cpu; > > - int rc; > > - > > - /* > > - * Always reserve area for module percpu variables. That's > > - * what the legacy allocator did. > > - */ > > - rc = pcpu_embed_first_chunk(PERCPU_MODULE_RESERVE, > > - PERCPU_DYNAMIC_RESERVE, PAGE_SIZE, > > - pcpu_cpu_distance, > > - pcpu_fc_alloc, pcpu_fc_free); > > This doesn't look identical to the version in mm/percpu.c -- that one passes > NULL instead of 'pcpu_cpu_distance' and tries to allocate the pcpu memory on > the relevant NUMA nodes. In fact, if you could remove this function, you > could probably remove the whole HAVE_SETUP_PER_CPU_AREA block here as the > other functions are just used as helpers. So I'm not sure this is valid. > You are right. I need to rethink about it to see whether these two functions can be unified into one. Thanks, Pingfan
[PATCH] drivers/arch_numa: remove rebudant setup_per_cpu_areas()
There are two identical implementations of setup_per_cpu_areas() in mm/percpu.c and drivers/base/arch_numa.c. Hence removing the one in arch_numa.c. And let arm64 drop HAVE_SETUP_PER_CPU_AREA. Signed-off-by: Pingfan Liu Cc: Catalin Marinas Cc: Will Deacon Cc: Greg Kroah-Hartman Cc: "Rafael J. Wysocki" Cc: Atish Patra Cc: linux-kernel@vger.kernel.org To: linux-arm-ker...@lists.infradead.org --- arch/arm64/Kconfig | 4 drivers/base/arch_numa.c | 22 -- 2 files changed, 26 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 1f212b47a48a..d4bf8be0c3d5 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1022,10 +1022,6 @@ config USE_PERCPU_NUMA_NODE_ID def_bool y depends on NUMA -config HAVE_SETUP_PER_CPU_AREA - def_bool y - depends on NUMA - config NEED_PER_CPU_EMBED_FIRST_CHUNK def_bool y depends on NUMA diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c index 4cc4e117727d..23e1e419a83d 100644 --- a/drivers/base/arch_numa.c +++ b/drivers/base/arch_numa.c @@ -167,28 +167,6 @@ static void __init pcpu_fc_free(void *ptr, size_t size) { memblock_free_early(__pa(ptr), size); } - -void __init setup_per_cpu_areas(void) -{ - unsigned long delta; - unsigned int cpu; - int rc; - - /* -* Always reserve area for module percpu variables. That's -* what the legacy allocator did. -*/ - rc = pcpu_embed_first_chunk(PERCPU_MODULE_RESERVE, - PERCPU_DYNAMIC_RESERVE, PAGE_SIZE, - pcpu_cpu_distance, - pcpu_fc_alloc, pcpu_fc_free); - if (rc < 0) - panic("Failed to initialize percpu areas."); - - delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start; - for_each_possible_cpu(cpu) - __per_cpu_offset[cpu] = delta + pcpu_unit_offsets[cpu]; -} #endif /** -- 2.29.2
Re: [PATCH] arm64/irq: report bug if NR_IPI greater than max SGI during compile time
On Tue, Dec 8, 2020 at 5:51 PM Marc Zyngier wrote: > > On 2020-12-08 09:43, Pingfan Liu wrote: > > On Tue, Dec 8, 2020 at 5:31 PM Marc Zyngier wrote: > >> > >> On 2020-12-08 09:21, Pingfan Liu wrote: > >> > Although there is a runtime WARN_ON() when NR_IPR > max SGI, it had > >> > better > >> > do the check during built time, and associate these related code > >> > together. > >> > > >> > Signed-off-by: Pingfan Liu > >> > Cc: Catalin Marinas > >> > Cc: Will Deacon > >> > Cc: Thomas Gleixner > >> > Cc: Jason Cooper > >> > Cc: Marc Zyngier > >> > Cc: Mark Rutland > >> > To: linux-arm-ker...@lists.infradead.org > >> > Cc: linux-kernel@vger.kernel.org > >> > --- > >> > arch/arm64/kernel/smp.c| 2 ++ > >> > drivers/irqchip/irq-gic-v3.c | 2 +- > >> > drivers/irqchip/irq-gic.c | 2 +- > >> > include/linux/irqchip/arm-gic-common.h | 2 ++ > >> > 4 files changed, 6 insertions(+), 2 deletions(-) > >> > > >> > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c > >> > index 18e9727..9fc383c 100644 > >> > --- a/arch/arm64/kernel/smp.c > >> > +++ b/arch/arm64/kernel/smp.c > >> > @@ -33,6 +33,7 @@ > >> > #include > >> > #include > >> > #include > >> > +#include > >> > > >> > #include > >> > #include > >> > @@ -76,6 +77,7 @@ enum ipi_msg_type { > >> > IPI_WAKEUP, > >> > NR_IPI > >> > }; > >> > +static_assert(NR_IPI <= MAX_SGI_NUM); > >> > >> I am trying *very hard* to remove dependencies between the > >> architecture > >> code and random drivers, so this kind of check really is > >> counter-productive. > >> > >> Driver code should not have to know the number of IPIs, because there > >> is > >> no requirement that all IPIs should map 1:1 to SGIs. Conflating the > >> two > > > > Just curious about this. Is there an IPI which is not implemented by > > SGI? Or mapping several IPIs to a single SGI, and scatter out due to a > > global variable value? > > We currently have a single NS SGI left, and I'd like to move some of the > non-critical IPIs over to dispatching mechanism (the two "CPU stop" IPIs > definitely are candidate for merging). That's not implemented yet, but > I don't see a need to add checks that would otherwise violate this > IPI/SGI distinction. Got it. Thanks for your detailed explanation. Regards, Pingfan
Re: [PATCH] arm64/irq: report bug if NR_IPI greater than max SGI during compile time
On Tue, Dec 8, 2020 at 5:31 PM Marc Zyngier wrote: > > On 2020-12-08 09:21, Pingfan Liu wrote: > > Although there is a runtime WARN_ON() when NR_IPR > max SGI, it had > > better > > do the check during built time, and associate these related code > > together. > > > > Signed-off-by: Pingfan Liu > > Cc: Catalin Marinas > > Cc: Will Deacon > > Cc: Thomas Gleixner > > Cc: Jason Cooper > > Cc: Marc Zyngier > > Cc: Mark Rutland > > To: linux-arm-ker...@lists.infradead.org > > Cc: linux-kernel@vger.kernel.org > > --- > > arch/arm64/kernel/smp.c| 2 ++ > > drivers/irqchip/irq-gic-v3.c | 2 +- > > drivers/irqchip/irq-gic.c | 2 +- > > include/linux/irqchip/arm-gic-common.h | 2 ++ > > 4 files changed, 6 insertions(+), 2 deletions(-) > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c > > index 18e9727..9fc383c 100644 > > --- a/arch/arm64/kernel/smp.c > > +++ b/arch/arm64/kernel/smp.c > > @@ -33,6 +33,7 @@ > > #include > > #include > > #include > > +#include > > > > #include > > #include > > @@ -76,6 +77,7 @@ enum ipi_msg_type { > > IPI_WAKEUP, > > NR_IPI > > }; > > +static_assert(NR_IPI <= MAX_SGI_NUM); > > I am trying *very hard* to remove dependencies between the architecture > code and random drivers, so this kind of check really is > counter-productive. > > Driver code should not have to know the number of IPIs, because there is > no requirement that all IPIs should map 1:1 to SGIs. Conflating the two Just curious about this. Is there an IPI which is not implemented by SGI? Or mapping several IPIs to a single SGI, and scatter out due to a global variable value? Thanks, Pingfan > is already wrong, and I really don't want to add more of that. > > Thanks, > > M. > -- > Jazz is not dead. It just smells funny...
[PATCH] arm64/irq: report bug if NR_IPI greater than max SGI during compile time
Although there is a runtime WARN_ON() when NR_IPR > max SGI, it had better do the check during built time, and associate these related code together. Signed-off-by: Pingfan Liu Cc: Catalin Marinas Cc: Will Deacon Cc: Thomas Gleixner Cc: Jason Cooper Cc: Marc Zyngier Cc: Mark Rutland To: linux-arm-ker...@lists.infradead.org Cc: linux-kernel@vger.kernel.org --- arch/arm64/kernel/smp.c| 2 ++ drivers/irqchip/irq-gic-v3.c | 2 +- drivers/irqchip/irq-gic.c | 2 +- include/linux/irqchip/arm-gic-common.h | 2 ++ 4 files changed, 6 insertions(+), 2 deletions(-) diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index 18e9727..9fc383c 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -33,6 +33,7 @@ #include #include #include +#include #include #include @@ -76,6 +77,7 @@ enum ipi_msg_type { IPI_WAKEUP, NR_IPI }; +static_assert(NR_IPI <= MAX_SGI_NUM); static int ipi_irq_base __read_mostly; static int nr_ipi __read_mostly = NR_IPI; diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c index 16fecc0..ee13f85 100644 --- a/drivers/irqchip/irq-gic-v3.c +++ b/drivers/irqchip/irq-gic-v3.c @@ -1162,7 +1162,7 @@ static void __init gic_smp_init(void) gic_starting_cpu, NULL); /* Register all 8 non-secure SGIs */ - base_sgi = __irq_domain_alloc_irqs(gic_data.domain, -1, 8, + base_sgi = __irq_domain_alloc_irqs(gic_data.domain, -1, MAX_SGI_NUM, NUMA_NO_NODE, _fwspec, false, NULL); if (WARN_ON(base_sgi <= 0)) diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c index 6053245..07d36de 100644 --- a/drivers/irqchip/irq-gic.c +++ b/drivers/irqchip/irq-gic.c @@ -845,7 +845,7 @@ static __init void gic_smp_init(void) "irqchip/arm/gic:starting", gic_starting_cpu, NULL); - base_sgi = __irq_domain_alloc_irqs(gic_data[0].domain, -1, 8, + base_sgi = __irq_domain_alloc_irqs(gic_data[0].domain, -1, MAX_SGI_NUM, NUMA_NO_NODE, _fwspec, false, NULL); if (WARN_ON(base_sgi <= 0)) diff --git a/include/linux/irqchip/arm-gic-common.h b/include/linux/irqchip/arm-gic-common.h index fa8c045..7e45a9f 100644 --- a/include/linux/irqchip/arm-gic-common.h +++ b/include/linux/irqchip/arm-gic-common.h @@ -16,6 +16,8 @@ (GICD_INT_DEF_PRI << 8) |\ GICD_INT_DEF_PRI) +#define MAX_SGI_NUM8 + enum gic_type { GIC_V2, GIC_V3, -- 2.7.5
Re: [PATCH 2/2] x86/machine_kexec: disable PMU before jumping to new kernel
Sorry that I had made a misunderstanding of the code. Nacked it On Mon, Nov 23, 2020 at 1:37 PM Pingfan Liu wrote: > > During jumping to the new kernel, on the crashed cpu, the memory mapping > switches from an old one to an identity one. It had better disable PMU to > suppress NMI, which can be delivered using the old mapping. > > Also on x86_64, idt_invalidate() to clear idt as on 32 bits. > > Signed-off-by: Pingfan Liu > Cc: Peter Zijlstra > Cc: Ingo Molnar > Cc: Arnaldo Carvalho de Melo > Cc: Mark Rutland > Cc: Alexander Shishkin > Cc: Jiri Olsa > Cc: Namhyung Kim > Cc: Thomas Gleixner > Cc: Borislav Petkov > Cc: "H. Peter Anvin" > Cc: Omar Sandoval > Cc: Andrew Morton > Cc: Mike Rapoport > To: x...@kernel.org > Cc: linux-kernel@vger.kernel.org > --- > arch/x86/kernel/machine_kexec_32.c | 1 + > arch/x86/kernel/machine_kexec_64.c | 3 +++ > 2 files changed, 4 insertions(+) > > diff --git a/arch/x86/kernel/machine_kexec_32.c > b/arch/x86/kernel/machine_kexec_32.c > index 64b00b0..72c6100 100644 > --- a/arch/x86/kernel/machine_kexec_32.c > +++ b/arch/x86/kernel/machine_kexec_32.c > @@ -191,6 +191,7 @@ void machine_kexec(struct kimage *image) > /* Interrupts aren't acceptable while we reboot */ > local_irq_disable(); > hw_breakpoint_disable(); > + perf_pmu_disable_all(); > > if (image->preserve_context) { > #ifdef CONFIG_X86_IO_APIC > diff --git a/arch/x86/kernel/machine_kexec_64.c > b/arch/x86/kernel/machine_kexec_64.c > index a29a44a..238893e 100644 > --- a/arch/x86/kernel/machine_kexec_64.c > +++ b/arch/x86/kernel/machine_kexec_64.c > @@ -17,6 +17,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -338,6 +339,8 @@ void machine_kexec(struct kimage *image) > /* Interrupts aren't acceptable while we reboot */ > local_irq_disable(); > hw_breakpoint_disable(); > + perf_pmu_disable_all(); > + idt_invalidate(phys_to_virt(0)); > > if (image->preserve_context) { > #ifdef CONFIG_X86_IO_APIC > -- > 2.7.5 >
Re: [PATCH 1/2] events/core: introduce perf_pmu_disable_all() to turn off all PMU
On Mon, Nov 23, 2020 at 10:05 PM Peter Zijlstra wrote: > > On Mon, Nov 23, 2020 at 01:37:25PM +0800, Pingfan Liu wrote: > > > +/* When crashed, other cpus hang in idle loop, so here do an emergency job > > under no lock */ > > -ENOPARSE, -ETOOLONG > > > +void perf_pmu_disable_all(void) > > +{ > > + struct pmu *pmu; > > + > > + list_for_each_entry(pmu, , entry) > > + if (pmu->pmu_disable) > > + pmu->pmu_disable(pmu); > > +} > > This violates both locking rules and coding style. Oops. I will re-work on it to see how to run it safely on crashed context. Thanks, Pingfan
[PATCH 1/2] events/core: introduce perf_pmu_disable_all() to turn off all PMU
In crash context, NMI should be suppressed before jump to a new kernel. Naturally as the source of NMI on some arches, PMU should be turned off at that time. Introduce perf_pmu_disable_all() to achieve the goal. Signed-off-by: Pingfan Liu Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Arnaldo Carvalho de Melo Cc: Mark Rutland Cc: Alexander Shishkin Cc: Jiri Olsa Cc: Namhyung Kim Cc: Thomas Gleixner Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Omar Sandoval Cc: Andrew Morton Cc: Mike Rapoport Cc: x...@kernel.org To: linux-kernel@vger.kernel.org --- include/linux/perf_event.h | 1 + kernel/events/core.c | 10 ++ 2 files changed, 11 insertions(+) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 96450f6..f4baa87 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -965,6 +965,7 @@ extern const struct perf_event_attr *perf_event_attrs(struct perf_event *event); extern void perf_event_print_debug(void); extern void perf_pmu_disable(struct pmu *pmu); extern void perf_pmu_enable(struct pmu *pmu); +extern void perf_pmu_disable_all(void); extern void perf_sched_cb_dec(struct pmu *pmu); extern void perf_sched_cb_inc(struct pmu *pmu); extern int perf_event_task_disable(void); diff --git a/kernel/events/core.c b/kernel/events/core.c index dc568ca..c8e04a5 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -1205,6 +1205,16 @@ void perf_pmu_enable(struct pmu *pmu) pmu->pmu_enable(pmu); } +/* When crashed, other cpus hang in idle loop, so here do an emergency job under no lock */ +void perf_pmu_disable_all(void) +{ + struct pmu *pmu; + + list_for_each_entry(pmu, , entry) + if (pmu->pmu_disable) + pmu->pmu_disable(pmu); +} + static DEFINE_PER_CPU(struct list_head, active_ctx_list); /* -- 2.7.5
[PATCH 2/2] x86/machine_kexec: disable PMU before jumping to new kernel
During jumping to the new kernel, on the crashed cpu, the memory mapping switches from an old one to an identity one. It had better disable PMU to suppress NMI, which can be delivered using the old mapping. Also on x86_64, idt_invalidate() to clear idt as on 32 bits. Signed-off-by: Pingfan Liu Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Arnaldo Carvalho de Melo Cc: Mark Rutland Cc: Alexander Shishkin Cc: Jiri Olsa Cc: Namhyung Kim Cc: Thomas Gleixner Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Omar Sandoval Cc: Andrew Morton Cc: Mike Rapoport To: x...@kernel.org Cc: linux-kernel@vger.kernel.org --- arch/x86/kernel/machine_kexec_32.c | 1 + arch/x86/kernel/machine_kexec_64.c | 3 +++ 2 files changed, 4 insertions(+) diff --git a/arch/x86/kernel/machine_kexec_32.c b/arch/x86/kernel/machine_kexec_32.c index 64b00b0..72c6100 100644 --- a/arch/x86/kernel/machine_kexec_32.c +++ b/arch/x86/kernel/machine_kexec_32.c @@ -191,6 +191,7 @@ void machine_kexec(struct kimage *image) /* Interrupts aren't acceptable while we reboot */ local_irq_disable(); hw_breakpoint_disable(); + perf_pmu_disable_all(); if (image->preserve_context) { #ifdef CONFIG_X86_IO_APIC diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c index a29a44a..238893e 100644 --- a/arch/x86/kernel/machine_kexec_64.c +++ b/arch/x86/kernel/machine_kexec_64.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include @@ -338,6 +339,8 @@ void machine_kexec(struct kimage *image) /* Interrupts aren't acceptable while we reboot */ local_irq_disable(); hw_breakpoint_disable(); + perf_pmu_disable_all(); + idt_invalidate(phys_to_virt(0)); if (image->preserve_context) { #ifdef CONFIG_X86_IO_APIC -- 2.7.5
[PATCH 3/3] kernel/watchdog: use soft lockup to detect irq flood
When irq flood happens, interrupt handler occupies all of the cpu time. This results in a situation where soft lockup can be observed, although it is different from the design purpose of soft lockup. In order to distinguish this situation, it is helpful to print out the statistics of irq frequency when warning soft lockup to evaluate the potential irq flood. Thomas and Guilherme suggested patches to suppress the odd irq in different situation. [1].[2]. But it seems to be an open question in a near future. For now, it had better print some hints for users than nothing. [1]: https://lore.kernel.org/lkml/87tuueftou@nanos.tec.linutronix.de/ [2]: https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpicc...@canonical.com/ Signed-off-by: Pingfan Liu Cc: Thomas Gleixner Cc: Jisheng Zhang Cc: "Peter Zijlstra (Intel)" Cc: Vlastimil Babka Cc: Andrew Morton Cc: "Guilherme G. Piccoli" Cc: Petr Mladek Cc: ke...@lists.infradead.org To: linux-kernel@vger.kernel.org --- kernel/watchdog.c | 17 - 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 1cc619a..a0ab2a8 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include @@ -175,6 +176,9 @@ static DEFINE_PER_CPU(bool, softlockup_touch_sync); static DEFINE_PER_CPU(bool, soft_watchdog_warn); static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts); static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts_saved); +static DEFINE_PER_CPU(unsigned long, last_irq_sum); +static DEFINE_PER_CPU(unsigned long, last_unused_irq_sum); + static unsigned long soft_lockup_nmi_warn; static int __init nowatchdog_setup(char *str) @@ -353,6 +357,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) /* kick the softlockup detector */ if (completion_done(this_cpu_ptr(_completion))) { + __this_cpu_write(last_irq_sum, kstat_this_cpu->irqs_sum); + __this_cpu_write(last_unused_irq_sum, kstat_this_cpu->unused_irqs_sum); reinit_completion(this_cpu_ptr(_completion)); stop_one_cpu_nowait(smp_processor_id(), softlockup_fn, NULL, @@ -386,6 +392,9 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) */ duration = is_softlockup(touch_ts); if (unlikely(duration)) { + unsigned long irq_sum, unused_irq_sum; + unsigned int seconds; + /* * If a virtual machine is stopped by the host it can look to * the watchdog like a soft lockup, check to see if the host @@ -409,9 +418,15 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) } } + irq_sum = kstat_this_cpu->irqs_sum - __this_cpu_read(last_irq_sum); + unused_irq_sum = kstat_this_cpu->unused_irqs_sum - + __this_cpu_read(last_unused_irq_sum); + seconds = (unsigned int)convert_seconds(duration); pr_emerg("BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n", - smp_processor_id(), (unsigned int)convert_seconds(duration), + smp_processor_id(), seconds, current->comm, task_pid_nr(current)); + pr_emerg("%lu irqs at rate: %lu / s, %lu unused irq at rate: %lu / s\n", + irq_sum, irq_sum/seconds, unused_irq_sum, unused_irq_sum/seconds); print_modules(); print_irqtrace_events(current); if (regs) -- 2.7.5
[PATCH 2/3] kernel/watchdog: make watchdog_touch_ts more accurate by using nanosecond
The incoming statistics of irq average number will base on the delta of watchdog_touch_ts. Improving the accuracy of watchdog_touch_ts from second to nanosecond can help improve the accuracy of the statistics. Signed-off-by: Pingfan Liu Cc: Thomas Gleixner Cc: Jisheng Zhang Cc: "Peter Zijlstra (Intel)" Cc: Vlastimil Babka Cc: Andrew Morton Cc: "Guilherme G. Piccoli" Cc: Petr Mladek Cc: ke...@lists.infradead.org To: linux-kernel@vger.kernel.org --- kernel/watchdog.c | 22 +++--- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 5abb5b2..1cc619a 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -207,7 +207,7 @@ static void __lockup_detector_cleanup(void); * the thresholds with a factor: we make the soft threshold twice the amount of * time the hard threshold is. */ -static int get_softlockup_thresh(void) +static unsigned int get_softlockup_thresh(void) { return watchdog_thresh * 2; } @@ -217,9 +217,9 @@ static int get_softlockup_thresh(void) * resolution, and we don't need to waste time with a big divide when * 2^30ns == 1.074s. */ -static unsigned long get_timestamp(void) +static unsigned long convert_seconds(unsigned long ns) { - return running_clock() >> 30LL; /* 2^30 ~= 10^9 */ + return ns >> 30LL; /* 2^30 ~= 10^9 */ } static void set_sample_period(void) @@ -238,7 +238,7 @@ static void set_sample_period(void) /* Commands for resetting the watchdog */ static void __touch_watchdog(void) { - __this_cpu_write(watchdog_touch_ts, get_timestamp()); + __this_cpu_write(watchdog_touch_ts, running_clock()); } /** @@ -289,14 +289,15 @@ void touch_softlockup_watchdog_sync(void) __this_cpu_write(watchdog_touch_ts, SOFTLOCKUP_RESET); } -static int is_softlockup(unsigned long touch_ts) +static unsigned long is_softlockup(unsigned long touch_ts) { - unsigned long now = get_timestamp(); + unsigned long span, now = running_clock(); + span = now - touch_ts; if ((watchdog_enabled & SOFT_WATCHDOG_ENABLED) && watchdog_thresh){ /* Warn about unreasonable delays. */ - if (time_after(now, touch_ts + get_softlockup_thresh())) - return now - touch_ts; + if (time_after(convert_seconds(span), (unsigned long)get_softlockup_thresh())) + return span; } return 0; } @@ -340,9 +341,8 @@ static int softlockup_fn(void *data) /* watchdog kicker functions */ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) { - unsigned long touch_ts = __this_cpu_read(watchdog_touch_ts); + unsigned long duration, touch_ts = __this_cpu_read(watchdog_touch_ts); struct pt_regs *regs = get_irq_regs(); - int duration; int softlockup_all_cpu_backtrace = sysctl_softlockup_all_cpu_backtrace; if (!watchdog_enabled) @@ -410,7 +410,7 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) } pr_emerg("BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n", - smp_processor_id(), duration, + smp_processor_id(), (unsigned int)convert_seconds(duration), current->comm, task_pid_nr(current)); print_modules(); print_irqtrace_events(current); -- 2.7.5
[PATCH 0/3] use soft lockup to detect irq flood
The chipset's pins are often multi-used because greedy usage compares to a finite pin number. This requests a carefully configuration in firmware or OS. If not, it may contribute to most of irq flood cases, which appears as a soft lockup issue on Linux. Strictly speaking, soft lockup and irq flood are different things to overcome. And it is helpful to make users aware of that situation for prime time. This series shows the irq statistics when soft lockup. The statistics can be used to evaluate the possibility of irq flood and as a rough evaluated input to the kernel parameter "irqstorm_limit" in [1]. It is not easy to find a common way to work around irq flood, which may be raised by different root cause. To now, it is still a open question. Thomas and Guilherme suggested patches to suppress the odd irq in different situation. [1].[2] [1]: https://lore.kernel.org/lkml/87tuueftou@nanos.tec.linutronix.de/ [2]: https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpicc...@canonical.com/ Cc: Thomas Gleixner Cc: Jisheng Zhang Cc: "Peter Zijlstra (Intel)" Cc: Vlastimil Babka Cc: Andrew Morton Cc: "Guilherme G. Piccoli" Cc: Petr Mladek Cc: ke...@lists.infradead.org To: linux-kernel@vger.kernel.org Pingfan Liu (3): x86/irq: account the unused irq kernel/watchdog: make watchdog_touch_ts more accurate by using nanosecond kernel/watchdog: use soft lockup to detect irq flood arch/x86/kernel/irq.c | 1 + include/linux/kernel_stat.h | 1 + kernel/watchdog.c | 37 ++--- 3 files changed, 28 insertions(+), 11 deletions(-) -- 2.7.5
[PATCH 1/3] x86/irq: account the unused irq
Accounting the unused irq in order to count it if irq flood. Signed-off-by: Pingfan Liu Cc: Thomas Gleixner Cc: Jisheng Zhang Cc: "Peter Zijlstra (Intel)" Cc: Vlastimil Babka Cc: Andrew Morton Cc: "Guilherme G. Piccoli" Cc: Petr Mladek Cc: ke...@lists.infradead.org To: linux-kernel@vger.kernel.org --- arch/x86/kernel/irq.c | 1 + include/linux/kernel_stat.h | 1 + 2 files changed, 2 insertions(+) diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index c5dd503..6f583a7 100644 --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -254,6 +254,7 @@ DEFINE_IDTENTRY_IRQ(common_interrupt) pr_emerg_ratelimited("%s: %d.%u No irq handler for vector\n", __func__, smp_processor_id(), vector); + __this_cpu_inc(kstat.unused_irqs_sum); } else { __this_cpu_write(vector_irq[vector], VECTOR_UNUSED); } diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index 89f0745..c8d5cb8 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -37,6 +37,7 @@ struct kernel_cpustat { struct kernel_stat { unsigned long irqs_sum; + unsigned long unused_irqs_sum; unsigned int softirqs[NR_SOFTIRQS]; }; -- 2.7.5
Re: [PATCH 0/3] warn and suppress irqflood
On Wed, Oct 28, 2020 at 7:58 PM Thomas Gleixner wrote: > [...] > --- > include/linux/irqdesc.h |4 ++ > kernel/irq/manage.c |3 + > kernel/irq/spurious.c | 74 > +++- > 3 files changed, 61 insertions(+), 20 deletions(-) > > --- a/include/linux/irqdesc.h > +++ b/include/linux/irqdesc.h > @@ -30,6 +30,8 @@ struct pt_regs; > * @tot_count: stats field for non-percpu irqs > * @irq_count: stats field to detect stalled irqs > * @last_unhandled:aging timer for unhandled count > + * @storm_count: Counter for irq storm detection > + * @storm_checked: Timestamp for irq storm detection > * @irqs_unhandled:stats field for spurious unhandled interrupts > * @threads_handled: stats field for deferred spurious detection of > threaded handlers > * @threads_handled_last: comparator field for deferred spurious detection > of theraded handlers > @@ -65,6 +67,8 @@ struct irq_desc { > unsigned inttot_count; > unsigned intirq_count; /* For detecting broken IRQs > */ > unsigned long last_unhandled; /* Aging timer for unhandled > count */ > + unsigned long storm_count; > + unsigned long storm_checked; > unsigned intirqs_unhandled; > atomic_tthreads_handled; > int threads_handled_last; > --- a/kernel/irq/manage.c > +++ b/kernel/irq/manage.c > @@ -1581,6 +1581,9 @@ static int > if (!shared) { > init_waitqueue_head(>wait_for_threads); > > + /* Take a timestamp for interrupt storm detection */ > + desc->storm_checked = jiffies; > + > /* Setup the type (level, edge polarity) if configured: */ > if (new->flags & IRQF_TRIGGER_MASK) { > ret = __irq_set_trigger(desc, > --- a/kernel/irq/spurious.c > +++ b/kernel/irq/spurious.c > @@ -21,6 +21,7 @@ static void poll_spurious_irqs(struct ti > static DEFINE_TIMER(poll_spurious_irq_timer, poll_spurious_irqs); > static int irq_poll_cpu; > static atomic_t irq_poll_active; > +static unsigned long irqstorm_limit __ro_after_init; > > /* > * We wait here for a poller to finish. > @@ -189,18 +190,21 @@ static inline int bad_action_ret(irqretu > * (The other 100-of-100,000 interrupts may have been a correctly > * functioning device sharing an IRQ with the failing one) > */ > -static void __report_bad_irq(struct irq_desc *desc, irqreturn_t action_ret) > +static void __report_bad_irq(struct irq_desc *desc, irqreturn_t action_ret, > +bool storm) > { > unsigned int irq = irq_desc_get_irq(desc); > struct irqaction *action; > unsigned long flags; > > - if (bad_action_ret(action_ret)) { > - printk(KERN_ERR "irq event %d: bogus return value %x\n", > - irq, action_ret); > - } else { > - printk(KERN_ERR "irq %d: nobody cared (try booting with " > + if (!storm) { > + if (bad_action_ret(action_ret)) { > + pr_err("irq event %d: bogus return value %x\n", > + irq, action_ret); > + } else { > + pr_err("irq %d: nobody cared (try booting with " > "the \"irqpoll\" option)\n", irq); > + } > } > dump_stack(); > printk(KERN_ERR "handlers:\n"); > @@ -228,7 +232,7 @@ static void report_bad_irq(struct irq_de > > if (count > 0) { > count--; > - __report_bad_irq(desc, action_ret); > + __report_bad_irq(desc, action_ret, false); > } > } > > @@ -267,6 +271,33 @@ try_misrouted_irq(unsigned int irq, stru > return action && (action->flags & IRQF_IRQPOLL); > } > > +static void disable_stuck_irq(struct irq_desc *desc, irqreturn_t action_ret, > + const char *reason, bool storm) > +{ > + __report_bad_irq(desc, action_ret, storm); > + pr_emerg("Disabling %s IRQ #%d\n", reason, irq_desc_get_irq(desc)); > + desc->istate |= IRQS_SPURIOUS_DISABLED; > + desc->depth++; > + irq_disable(desc); > +} > + > +/* Interrupt storm detector for runaway interrupts (handled or not). */ > +static bool irqstorm_detected(struct irq_desc *desc) > +{ > + unsigned long now = jiffies; > + > + if (++desc->storm_count < irqstorm_limit) { > + if (time_after(now, desc->storm_checked + HZ)) { > + desc->storm_count = 0; > + desc->storm_checked = now; > + } > + return false; > + } > + > + disable_stuck_irq(desc, IRQ_NONE, "runaway", true); > + return true; > +} > + > #define SPURIOUS_DEFERRED 0x8000 > > void note_interrupt(struct
Re: [PATCH 0/3] warn and suppress irqflood
On Wed, Oct 28, 2020 at 12:58:41PM +0100, Thomas Gleixner wrote: > On Wed, Oct 28 2020 at 14:02, Pingfan Liu wrote: > > On Tue, Oct 27, 2020 at 3:59 AM Thomas Gleixner wrote: > >> Also Liu's patch only works if: > >> > >> 1) CONFIG_IRQ_TIME_ACCOUNTING is enabled > > > > I wonder whether it can not be a default option or not by the following > > method: > > DEFINE_STATIC_KEY_FALSE(irqtime_account), and enable it according to > > a boot param. > > How so? > > config IRQ_TIME_ACCOUNTING > depends on HAVE_IRQ_TIME_ACCOUNTING && > !VIRT_CPU_ACCOUNTING_NATIVE > Look closely at the two config value: -1. HAVE_IRQ_TIME_ACCOUNTING, it is selected by most of the popular arches, and can be further relaxed. It implies sched_clock() is fast enough for sampling. With current code, the variable sched_clock_irqtime=0 can be used to turn off irqtime accounting on some arches with slow sched_clock(). And it can be even better by using DEFINE_STATIC_KEY_FALSE(sched_clock_irqtime) So the pre-requirement can be relaxed as "depends on !VIRT_CPU_ACCOUNTING_NATIVE" In case that I can not express clearly, could you have a look at the demo patch? That patch _assumes_ that irqtime accounting costs much and is not turned on by default. If turned on, it will cost an extra jmp than current implement. And I think it is critical to my [1/3] whether this assumption is reasonable. -2. For VIRT_CPU_ACCOUNTING_NATIVE, it can only be selected by powerpc and ia64 In fact, I have a seperate patch for powerpc with CONFIG_VIRT_CPU_ACCOUNTING_NATIVE to utilize my [1/3]. --- diff --git a/init/Kconfig b/init/Kconfig index c944691..16d168b 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -490,7 +490,7 @@ endchoice config IRQ_TIME_ACCOUNTING bool "Fine granularity task level IRQ time accounting" - depends on HAVE_IRQ_TIME_ACCOUNTING && !VIRT_CPU_ACCOUNTING_NATIVE + depends on !VIRT_CPU_ACCOUNTING_NATIVE help Select this option to enable fine granularity task irq time accounting. This is done by reading a timestamp on each diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 5a55d23..3ab7e1d 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -19,7 +19,7 @@ */ DEFINE_PER_CPU(struct irqtime, cpu_irqtime); -static int sched_clock_irqtime; +DEFINE_STATIC_KEY_FALSE(sched_clock_irqtime); void enable_sched_clock_irqtime(void) { @@ -49,13 +49,14 @@ static void irqtime_account_delta(struct irqtime *irqtime, u64 delta, */ void irqtime_account_irq(struct task_struct *curr) { - struct irqtime *irqtime = this_cpu_ptr(_irqtime); + struct irqtime *irqtime; s64 delta; int cpu; - if (!sched_clock_irqtime) + if (static_branch_unlikely(_clock_irqtime)) return; + irqtime = this_cpu_ptr(_irqtime); cpu = smp_processor_id(); delta = sched_clock_cpu(cpu) - irqtime->irq_start_time; irqtime->irq_start_time += delta; @@ -84,16 +85,7 @@ static u64 irqtime_tick_accounted(u64 maxtime) return delta; } -#else /* CONFIG_IRQ_TIME_ACCOUNTING */ - -#define sched_clock_irqtime(0) - -static u64 irqtime_tick_accounted(u64 dummy) -{ - return 0; -} - -#endif /* !CONFIG_IRQ_TIME_ACCOUNTING */ +#endif static inline void task_group_account_field(struct task_struct *p, int index, u64 tmp) @@ -475,7 +467,7 @@ void account_process_tick(struct task_struct *p, int user_tick) if (vtime_accounting_enabled_this_cpu()) return; - if (sched_clock_irqtime) { + if (static_branch_unlikely(_clock_irqtime)) irqtime_account_process_tick(p, user_tick, 1); return; } @@ -504,7 +496,7 @@ void account_idle_ticks(unsigned long ticks) { u64 cputime, steal; - if (sched_clock_irqtime) { + if (static_branch_unlikely(_clock_irqtime)) irqtime_account_idle_ticks(ticks); return; } -- 2.7.5 [...] > + > +static int __init irqstorm_setup(char *arg) > +{ > + int res = kstrtoul(arg, 0, _limit); > + > + if (!res) { > + pr_info("Interrupt storm detector enabled. Limit=%lu / s\n", > + irqstorm_limit); > + } > + return !!res; > +} > +__setup("irqstorm_limit", irqstorm_setup); This configuration independent method looks appealing. And I am glad to have a try. But irqstorm_limit may be a hard choice. Maybe by formula: instruction-percpu-per-second / insn num of irq failed path ? It is hard to estimate "instruction-percpu-per-second". Thanks, Pingfan
Re: [PATCH 0/3] warn and suppress irqflood
On Tue, Oct 27, 2020 at 3:59 AM Thomas Gleixner wrote: > [...] > > And contrary to Liu's patches which try to disable a requested interrupt > if too many of them arrive, the kernel cannot do anything because there > is nothing to disable in your case. That's why you needed to do the MSI > disable magic in the early PCI quirks which run before interrupts get > enabled. > > Also Liu's patch only works if: > > 1) CONFIG_IRQ_TIME_ACCOUNTING is enabled I wonder whether it can not be a default option or not by the following method: DEFINE_STATIC_KEY_FALSE(irqtime_account), and enable it according to a boot param. This will have no impact on performance with the disabled branch. Meanwhile users can easily turn on the option to detect an irq flood without recompiling the kernel. If it is doable, I will rework only on [1/2]. > > 2) the runaway interrupt has been requested by the relevant driver in > the dump kernel. Yes, it raises a big challenge to my method. Kdump kernel miss the whole picture of the first kernel's irq routing. Thanks, Pingfan
Re: [Skiboot] [PATCH 0/3] warn and suppress irqflood
On Sun, Oct 25, 2020 at 8:21 PM Oliver O'Halloran wrote: > > On Sun, Oct 25, 2020 at 10:22 PM Pingfan Liu wrote: > > > > On Thu, Oct 22, 2020 at 4:37 PM Thomas Gleixner wrote: > > > > > > On Thu, Oct 22 2020 at 13:56, Pingfan Liu wrote: > > > > I hit a irqflood bug on powerpc platform, and two years ago, on a x86 > > > > platform. > > > > When the bug happens, the kernel is totally occupies by irq. > > > > Currently, there > > > > may be nothing or just soft lockup warning showed in console. It is > > > > better > > > > to warn users with irq flood info. > > > > > > > > In the kdump case, the kernel can move on by suppressing the irq flood. > > > > > > You're curing the symptom not the cause and the cure is just magic and > > > can't work reliably. > > Yeah, it is magic. But at least, it is better to printk something and > > alarm users about what happens. With current code, it may show nothing > > when system hangs. > > > > > > Where is that irq flood originated from and why is none of the > > > mechanisms we have in place to shut it up working? > > The bug originates from a driver tpm_i2c_nuvoton, which calls i2c-bus > > driver (i2c-opal.c). After i2c_opal_send_request(), the bug is > > triggered. > > > > But things are complicated by introducing a firmware layer: Skiboot. > > This software layer hides the detail of manipulating the hardware from > > Linux. > > > > I guess the software logic can not enter a sane state when kernel crashes. > > > > Cc Skiboot and ppc64 community to see whether anyone has idea about it. > > What system are you using? Here is the info, if not enough, I will get more. Product Name : OpenPOWER Firmware Product Version : open-power-SUPERMICRO-P9DSU-V1.16-20180531-imp Product Extra : op-build-e4b3eb5 Product Extra : skiboot-v6.0-p1da203b Product Extra : hostboot-f911e5c-pda8239f Product Extra : occ-77bb5e6-p623d1cd Product Extra : linux-4.16.7-openpower2-pbc45895 Product Extra : petitboot-v1.7.1-pf773c0d Product Extra : machine-xml-218a77a > > There's an external interrupt pin which is supposed to be wired to the > TPM. I think we bounce that interrupt to FW by default since the > external interrupt is sometimes used for other system-specific > purposes. Odds are FW doesn't know what to do with it so you > effectively have an always-on LSI. I fixed a similar bug a while ago > by having skiboot mask any interrupts it doesn't have a handler for, This sounds like the root cause. But here Skiboot should have handler, otherwise the first kernel can not run smoothly. Do you have any idea about an unexpected re-initialization introducing an unsane stage? Thanks, Pingfan
Re: [PATCH 0/3] warn and suppress irqflood
On Thu, Oct 22, 2020 at 4:37 PM Thomas Gleixner wrote: > > On Thu, Oct 22 2020 at 13:56, Pingfan Liu wrote: > > I hit a irqflood bug on powerpc platform, and two years ago, on a x86 > > platform. > > When the bug happens, the kernel is totally occupies by irq. Currently, > > there > > may be nothing or just soft lockup warning showed in console. It is better > > to warn users with irq flood info. > > > > In the kdump case, the kernel can move on by suppressing the irq flood. > > You're curing the symptom not the cause and the cure is just magic and > can't work reliably. Yeah, it is magic. But at least, it is better to printk something and alarm users about what happens. With current code, it may show nothing when system hangs. > > Where is that irq flood originated from and why is none of the > mechanisms we have in place to shut it up working? The bug originates from a driver tpm_i2c_nuvoton, which calls i2c-bus driver (i2c-opal.c). After i2c_opal_send_request(), the bug is triggered. But things are complicated by introducing a firmware layer: Skiboot. This software layer hides the detail of manipulating the hardware from Linux. I guess the software logic can not enter a sane state when kernel crashes. Cc Skiboot and ppc64 community to see whether anyone has idea about it. Thanks, Pingfan
[PATCH 3/3] Documentation: introduce a param "irqflood_suppress"
The param "irqflood_suppress" is helpful for capture kernel to survive irq flood. Signed-off-by: Pingfan Liu Cc: Thomas Gleixner Cc: Peter Zijlstra Cc: Jisheng Zhang Cc: Andrew Morton Cc: "Guilherme G. Piccoli" Cc: Petr Mladek Cc: Marc Zyngier Cc: Linus Walleij Cc: afzal mohammed Cc: Lina Iyer Cc: "Gustavo A. R. Silva" Cc: Maulik Shah Cc: Al Viro Cc: Jonathan Corbet Cc: Pawan Gupta Cc: Mike Kravetz Cc: Oliver Neukum To: linux-kernel@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: ke...@lists.infradead.org --- Documentation/admin-guide/kernel-parameters.txt | 3 +++ 1 file changed, 3 insertions(+) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index a106874..0a25a05 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2009,6 +2009,9 @@ for it. Also check all handlers each timer interrupt. Intended to get systems with badly broken firmware running. + irqflood_suppress [HW] + When a irq fully occupies a cpu in a long time, suppressing + it to make kernel move on. It is useful in the capture kernel. isapnp= [ISAPNP] Format: ,,, -- 2.7.5
[PATCH 1/3] kernel/watchdog: show irq percentage if irq floods
In kdump case, the capture kernel has no chance to shutdown devices, and leave the devices in unstable state. Irq flood is one of the consequence. When irq floods, the kernel is totally occupies by irq. Currently, there may be nothing or just soft lockup warning showed in console. It is better to warn users with irq flood info. Soft lockup watchdog is a good foundation to implement the detector. A irq flood is reported if the following two conditions are met: -1. the irq occupies too much (98%) of the past interval. -2. no tasks has been scheduled in the past interval. This is implemented by check_hint. A note: is_softlockup() implies the 2nd condition, but unfortunately, irq flood can come from anywhere. If irq_enter_rcu()->tick_irq_enter(), then touch_softlockup_watchdog_sched() will reset watchdog, and softlockup is never detected. Signed-off-by: Pingfan Liu Cc: Thomas Gleixner Cc: Peter Zijlstra Cc: Jisheng Zhang Cc: Andrew Morton Cc: "Guilherme G. Piccoli" Cc: Petr Mladek Cc: Marc Zyngier Cc: Linus Walleij Cc: afzal mohammed Cc: Lina Iyer Cc: "Gustavo A. R. Silva" Cc: Maulik Shah Cc: Al Viro Cc: Jonathan Corbet Cc: Pawan Gupta Cc: Mike Kravetz Cc: Oliver Neukum To: linux-kernel@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: ke...@lists.infradead.org --- kernel/watchdog.c | 41 + 1 file changed, 41 insertions(+) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 5abb5b2..230ac38 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include @@ -175,6 +176,13 @@ static DEFINE_PER_CPU(bool, softlockup_touch_sync); static DEFINE_PER_CPU(bool, soft_watchdog_warn); static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts); static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts_saved); + +#ifdef CONFIG_IRQ_TIME_ACCOUNTING +static DEFINE_PER_CPU(long, check_hint) = {-1}; +static DEFINE_PER_CPU(unsigned long, last_irq_ts); +static DEFINE_PER_CPU(unsigned long, last_total_ts); +#endif + static unsigned long soft_lockup_nmi_warn; static int __init nowatchdog_setup(char *str) @@ -331,12 +339,43 @@ static DEFINE_PER_CPU(struct cpu_stop_work, softlockup_stop_work); */ static int softlockup_fn(void *data) { +#ifdef CONFIG_IRQ_TIME_ACCOUNTING + __this_cpu_write(check_hint, -1); +#endif __touch_watchdog(); complete(this_cpu_ptr(_completion)); return 0; } +#ifdef CONFIG_IRQ_TIME_ACCOUNTING +static void check_irq_flood(void) +{ + unsigned long irqts, totalts, percent, cnt; + u64 *cpustat = kcpustat_this_cpu->cpustat; + + totalts = running_clock(); + irqts = cpustat[CPUTIME_IRQ] + cpustat[CPUTIME_SOFTIRQ]; + cnt = __this_cpu_inc_return(check_hint); + + if (cnt >= 5) { + totalts = totalts - __this_cpu_read(last_total_ts); + irqts = irqts - __this_cpu_read(last_irq_ts); + percent = irqts * 100 / totalts; + percent = percent < 100 ? percent : 100; + __this_cpu_write(check_hint, -1); + if (percent >= 98) + pr_info("Irq flood occupies more than %lu%% of the past %lu seconds\n", + percent, totalts >> 30); + } else if (cnt == 0) { + __this_cpu_write(last_total_ts, totalts); + __this_cpu_write(last_irq_ts, irqts); + } +} +#else +static void check_irq_flood(void){} +#endif + /* watchdog kicker functions */ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) { @@ -348,6 +387,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) if (!watchdog_enabled) return HRTIMER_NORESTART; + /* When irq floods, watchdog may be still touched. Hence it can not be done inside lockup */ + check_irq_flood(); /* kick the hardlockup detector */ watchdog_interrupt_count(); -- 2.7.5
[PATCH 2/3] kernel/watchdog: suppress max irq when irq floods
The capture kernel should try its best to save the crash info. Normally, irq flood is caused by some trivial devices, which has no impact on saving vmcore. Introducing a parameter "irqflood_suppress" to enable suppress irq flood. Signed-off-by: Pingfan Liu Cc: Thomas Gleixner Cc: Peter Zijlstra Cc: Jisheng Zhang Cc: Andrew Morton Cc: "Guilherme G. Piccoli" Cc: Petr Mladek Cc: Marc Zyngier Cc: Linus Walleij Cc: afzal mohammed Cc: Lina Iyer Cc: "Gustavo A. R. Silva" Cc: Maulik Shah Cc: Al Viro Cc: Jonathan Corbet Cc: Pawan Gupta Cc: Mike Kravetz Cc: Oliver Neukum To: linux-kernel@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: ke...@lists.infradead.org --- include/linux/irq.h | 2 ++ kernel/irq/spurious.c | 32 kernel/watchdog.c | 9 - 3 files changed, 42 insertions(+), 1 deletion(-) diff --git a/include/linux/irq.h b/include/linux/irq.h index 1b7f4df..140cb61 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -684,6 +684,8 @@ extern void note_interrupt(struct irq_desc *desc, irqreturn_t action_ret); /* Enable/disable irq debugging output: */ extern int noirqdebug_setup(char *str); +void suppress_max_irq(void); + /* Checks whether the interrupt can be requested by request_irq(): */ extern int can_request_irq(unsigned int irq, unsigned long irqflags); diff --git a/kernel/irq/spurious.c b/kernel/irq/spurious.c index f865e5f..d3d94d6 100644 --- a/kernel/irq/spurious.c +++ b/kernel/irq/spurious.c @@ -464,3 +464,35 @@ static int __init irqpoll_setup(char *str) } __setup("irqpoll", irqpoll_setup); + +#ifdef CONFIG_IRQ_TIME_ACCOUNTING + +static bool irqflood_suppress; + +static int __init irqflood_suppress_setup(char *str) +{ + irqflood_suppress = true; + pr_info("enable auto suppress irqflood\n"); + return 1; +} +__setup("irqflood_suppress", irqflood_suppress_setup); + +void suppress_max_irq(void) +{ + unsigned int tmp, maxirq = 0, max = 0; + int irq; + + if (!irqflood_suppress) + return; + for_each_active_irq(irq) { + tmp = kstat_irqs_cpu(irq, smp_processor_id()); + if (max < tmp) { + maxirq = irq; + max = tmp; + } + } + pr_warn("Suppress irq:%u, which is triggered %u times\n", + maxirq, max); + disable_irq_nosync(maxirq); +} +#endif diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 230ac38..28a74e5 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include @@ -364,9 +365,15 @@ static void check_irq_flood(void) percent = irqts * 100 / totalts; percent = percent < 100 ? percent : 100; __this_cpu_write(check_hint, -1); - if (percent >= 98) + if (percent >= 98) { pr_info("Irq flood occupies more than %lu%% of the past %lu seconds\n", percent, totalts >> 30); + /* +* Suppress top irq when scheduler does not work for long time and irq +* occupies too much time. +*/ + suppress_max_irq(); + } } else if (cnt == 0) { __this_cpu_write(last_total_ts, totalts); __this_cpu_write(last_irq_ts, irqts); -- 2.7.5
[PATCH 0/3] warn and suppress irqflood
I hit a irqflood bug on powerpc platform, and two years ago, on a x86 platform. When the bug happens, the kernel is totally occupies by irq. Currently, there may be nothing or just soft lockup warning showed in console. It is better to warn users with irq flood info. In the kdump case, the kernel can move on by suppressing the irq flood. Cc: Thomas Gleixner Cc: Peter Zijlstra Cc: Jisheng Zhang Cc: Andrew Morton Cc: "Guilherme G. Piccoli" Cc: Petr Mladek Cc: Marc Zyngier Cc: Linus Walleij Cc: afzal mohammed Cc: Lina Iyer Cc: "Gustavo A. R. Silva" Cc: Maulik Shah Cc: Al Viro Cc: Jonathan Corbet Cc: Pawan Gupta Cc: Mike Kravetz Cc: Oliver Neukum To: linux-kernel@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: ke...@lists.infradead.org Pingfan Liu (3): kernel/watchdog: show irq percentage if irq floods kernel/watchdog: suppress max irq when irq floods Documentation: introduce a param "irqflood_suppress" Documentation/admin-guide/kernel-parameters.txt | 3 ++ include/linux/irq.h | 2 ++ kernel/irq/spurious.c | 32 + kernel/watchdog.c | 48 + 4 files changed, 85 insertions(+) -- 2.7.5
Re: [PATCH] sched/cputime: correct account of irqtime
On Wed, Oct 14, 2020 at 9:02 PM Peter Zijlstra wrote: > > On Mon, Oct 12, 2020 at 09:50:44PM +0800, Pingfan Liu wrote: > > __do_softirq() may be interrupted by hardware interrupts. In this case, > > irqtime_account_irq() will account the time slice as CPUTIME_SOFTIRQ by > > mistake. > > > > By passing irqtime_account_irq() an extra param about either hardirq or > > softirq, irqtime_account_irq() can handle the above case. > > I'm not sure I see the scenario in which it goes wrong. > > irqtime_account_irq() is designed such that we're called with the old > preempt_count on enter and the new preempt_count on exit. This way we'll > accumuate the delta to the previous context. > Oops! You are right, the time delta between a softirq and a interrupting hardirq should be accounted into the softrq. Thanks for your clear explanation. Regards, Pingfan
Re: [PATCH] sched/cputime: correct account of irqtime
On Tue, Oct 13, 2020 at 11:10 AM jun qian wrote: > > Pingfan Liu 于2020年10月12日周一 下午9:54写道: > > > > __do_softirq() may be interrupted by hardware interrupts. In this case, > > irqtime_account_irq() will account the time slice as CPUTIME_SOFTIRQ by > > mistake. > > > > By passing irqtime_account_irq() an extra param about either hardirq or > > softirq, irqtime_account_irq() can handle the above case. > > > > Signed-off-by: Pingfan Liu > > Cc: Ingo Molnar > > Cc: Peter Zijlstra > > Cc: Juri Lelli > > Cc: Vincent Guittot > > Cc: Dietmar Eggemann > > Cc: Steven Rostedt > > Cc: Ben Segall > > Cc: Mel Gorman > > Cc: Thomas Gleixner > > Cc: Andy Lutomirski > > Cc: Will Deacon > > Cc: "Paul E. McKenney" > > Cc: Frederic Weisbecker > > Cc: Allen Pais > > Cc: Romain Perier > > To: linux-kernel@vger.kernel.org > > --- > > include/linux/hardirq.h | 4 ++-- > > include/linux/vtime.h | 12 ++-- > > kernel/sched/cputime.c | 4 ++-- > > kernel/softirq.c| 6 +++--- > > 4 files changed, 13 insertions(+), 13 deletions(-) > > > > diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h > > index 754f67a..56e7bb5 100644 > > --- a/include/linux/hardirq.h > > +++ b/include/linux/hardirq.h > > @@ -32,7 +32,7 @@ static __always_inline void rcu_irq_enter_check_tick(void) > > */ > > #define __irq_enter() \ > > do {\ > > - account_irq_enter_time(current);\ > > + account_irq_enter_time(current, true); \ > > preempt_count_add(HARDIRQ_OFFSET); \ > > lockdep_hardirq_enter();\ > > } while (0) > > @@ -63,7 +63,7 @@ void irq_enter_rcu(void); > > #define __irq_exit() \ > > do {\ > > lockdep_hardirq_exit(); \ > > - account_irq_exit_time(current); \ > > + account_irq_exit_time(current, true); \ > > preempt_count_sub(HARDIRQ_OFFSET); \ > > } while (0) > > > > diff --git a/include/linux/vtime.h b/include/linux/vtime.h > > index 2cdeca0..294188ae1 100644 > > --- a/include/linux/vtime.h > > +++ b/include/linux/vtime.h > > @@ -98,21 +98,21 @@ static inline void vtime_flush(struct task_struct *tsk) > > { } > > > > > > #ifdef CONFIG_IRQ_TIME_ACCOUNTING > > -extern void irqtime_account_irq(struct task_struct *tsk); > > +extern void irqtime_account_irq(struct task_struct *tsk, bool hardirq); > > #else > > -static inline void irqtime_account_irq(struct task_struct *tsk) { } > > +static inline void irqtime_account_irq(struct task_struct *tsk, bool > > hardirq) { } > > #endif > > > > -static inline void account_irq_enter_time(struct task_struct *tsk) > > +static inline void account_irq_enter_time(struct task_struct *tsk, bool > > hardirq) > > { > > vtime_account_irq_enter(tsk); > > - irqtime_account_irq(tsk); > > + irqtime_account_irq(tsk, hardirq); > > } > > > > -static inline void account_irq_exit_time(struct task_struct *tsk) > > +static inline void account_irq_exit_time(struct task_struct *tsk, bool > > hardirq) > > { > > vtime_account_irq_exit(tsk); > > - irqtime_account_irq(tsk); > > + irqtime_account_irq(tsk, hardirq); > > } > > > > #endif /* _LINUX_KERNEL_VTIME_H */ > > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c > > index 5a55d23..166f1d7 100644 > > --- a/kernel/sched/cputime.c > > +++ b/kernel/sched/cputime.c > > @@ -47,7 +47,7 @@ static void irqtime_account_delta(struct irqtime > > *irqtime, u64 delta, > > * Called before incrementing preempt_count on {soft,}irq_enter > > * and before decrementing preempt_count on {soft,}irq_exit. > > */ > > -void irqtime_account_irq(struct task_struct *curr) > > +void irqtime_account_irq(struct task_struct *curr, bool hardirq) > > { > > struct irqtime *irqtime = this_cpu_ptr(_irqtime); > > s64 delta; > > @@ -68,7 +68,7 @@ void irqtime_account_irq(struct task_struct *curr) > > */ > > if (hardirq_count()) > > irqtime_account_delta(irqtime, delta, CPUTIME_IRQ); > > - else if (in_serving_softirq() && curr
[PATCH] sched/cputime: correct account of irqtime
__do_softirq() may be interrupted by hardware interrupts. In this case, irqtime_account_irq() will account the time slice as CPUTIME_SOFTIRQ by mistake. By passing irqtime_account_irq() an extra param about either hardirq or softirq, irqtime_account_irq() can handle the above case. Signed-off-by: Pingfan Liu Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Cc: Thomas Gleixner Cc: Andy Lutomirski Cc: Will Deacon Cc: "Paul E. McKenney" Cc: Frederic Weisbecker Cc: Allen Pais Cc: Romain Perier To: linux-kernel@vger.kernel.org --- include/linux/hardirq.h | 4 ++-- include/linux/vtime.h | 12 ++-- kernel/sched/cputime.c | 4 ++-- kernel/softirq.c| 6 +++--- 4 files changed, 13 insertions(+), 13 deletions(-) diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h index 754f67a..56e7bb5 100644 --- a/include/linux/hardirq.h +++ b/include/linux/hardirq.h @@ -32,7 +32,7 @@ static __always_inline void rcu_irq_enter_check_tick(void) */ #define __irq_enter() \ do {\ - account_irq_enter_time(current);\ + account_irq_enter_time(current, true); \ preempt_count_add(HARDIRQ_OFFSET); \ lockdep_hardirq_enter();\ } while (0) @@ -63,7 +63,7 @@ void irq_enter_rcu(void); #define __irq_exit() \ do {\ lockdep_hardirq_exit(); \ - account_irq_exit_time(current); \ + account_irq_exit_time(current, true); \ preempt_count_sub(HARDIRQ_OFFSET); \ } while (0) diff --git a/include/linux/vtime.h b/include/linux/vtime.h index 2cdeca0..294188ae1 100644 --- a/include/linux/vtime.h +++ b/include/linux/vtime.h @@ -98,21 +98,21 @@ static inline void vtime_flush(struct task_struct *tsk) { } #ifdef CONFIG_IRQ_TIME_ACCOUNTING -extern void irqtime_account_irq(struct task_struct *tsk); +extern void irqtime_account_irq(struct task_struct *tsk, bool hardirq); #else -static inline void irqtime_account_irq(struct task_struct *tsk) { } +static inline void irqtime_account_irq(struct task_struct *tsk, bool hardirq) { } #endif -static inline void account_irq_enter_time(struct task_struct *tsk) +static inline void account_irq_enter_time(struct task_struct *tsk, bool hardirq) { vtime_account_irq_enter(tsk); - irqtime_account_irq(tsk); + irqtime_account_irq(tsk, hardirq); } -static inline void account_irq_exit_time(struct task_struct *tsk) +static inline void account_irq_exit_time(struct task_struct *tsk, bool hardirq) { vtime_account_irq_exit(tsk); - irqtime_account_irq(tsk); + irqtime_account_irq(tsk, hardirq); } #endif /* _LINUX_KERNEL_VTIME_H */ diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 5a55d23..166f1d7 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -47,7 +47,7 @@ static void irqtime_account_delta(struct irqtime *irqtime, u64 delta, * Called before incrementing preempt_count on {soft,}irq_enter * and before decrementing preempt_count on {soft,}irq_exit. */ -void irqtime_account_irq(struct task_struct *curr) +void irqtime_account_irq(struct task_struct *curr, bool hardirq) { struct irqtime *irqtime = this_cpu_ptr(_irqtime); s64 delta; @@ -68,7 +68,7 @@ void irqtime_account_irq(struct task_struct *curr) */ if (hardirq_count()) irqtime_account_delta(irqtime, delta, CPUTIME_IRQ); - else if (in_serving_softirq() && curr != this_cpu_ksoftirqd()) + else if (in_serving_softirq() && curr != this_cpu_ksoftirqd() && !hardirq) irqtime_account_delta(irqtime, delta, CPUTIME_SOFTIRQ); } EXPORT_SYMBOL_GPL(irqtime_account_irq); diff --git a/kernel/softirq.c b/kernel/softirq.c index bf88d7f6..da59ea39 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -270,7 +270,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void) current->flags &= ~PF_MEMALLOC; pending = local_softirq_pending(); - account_irq_enter_time(current); + account_irq_enter_time(current, false); __local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET); in_hardirq = lockdep_softirq_start(); @@ -321,7 +321,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void) } lockdep_softirq_end(in_hardirq); - account_irq_exit_time(current); + account_irq_exit_time(current, false); __local_bh_enable(SOFTIRQ_OFFSET); WARN_ON_ONCE(in_interrupt()); current_restore_flags(old_flags, PF_MEMALLOC); @@ -417,7 +417,7 @@ static inline void __irq_exit_rcu(void) #else
[tip: x86/urgent] x86/purgatory: Don't generate debug info for purgatory.ro
The following commit has been merged into the x86/urgent branch of tip: Commit-ID: 52416ffcf823ee11aa19792715664ab94757f111 Gitweb: https://git.kernel.org/tip/52416ffcf823ee11aa19792715664ab94757f111 Author:Pingfan Liu AuthorDate:Mon, 03 Aug 2020 13:49:48 +08:00 Committer: Ingo Molnar CommitterDate: Fri, 07 Aug 2020 01:32:00 +02:00 x86/purgatory: Don't generate debug info for purgatory.ro Purgatory.ro is a standalone binary that is not linked against the rest of the kernel. Its image is copied into an array that is linked to the kernel, and from there kexec relocates it wherever it desires. Unlike the debug info for vmlinux, which can be used for analyzing crash such info is useless in purgatory.ro. And discarding them can save about 200K space. Original: 259080 kexec-purgatory.o Stripped debug info: 29152 kexec-purgatory.o Signed-off-by: Pingfan Liu Signed-off-by: Ingo Molnar Reviewed-by: Nick Desaulniers Reviewed-by: Steve Wahl Acked-by: Dave Young Link: https://lore.kernel.org/r/1596433788-3784-1-git-send-email-kernelf...@gmail.com --- arch/x86/purgatory/Makefile | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile index 088bd76..d24b43a 100644 --- a/arch/x86/purgatory/Makefile +++ b/arch/x86/purgatory/Makefile @@ -32,7 +32,7 @@ KCOV_INSTRUMENT := n # make up the standalone purgatory.ro PURGATORY_CFLAGS_REMOVE := -mcmodel=kernel -PURGATORY_CFLAGS := -mcmodel=large -ffreestanding -fno-zero-initialized-in-bss +PURGATORY_CFLAGS := -mcmodel=large -ffreestanding -fno-zero-initialized-in-bss -g0 PURGATORY_CFLAGS += $(DISABLE_STACKLEAK_PLUGIN) -DDISABLE_BRANCH_PROFILING PURGATORY_CFLAGS += $(call cc-option,-fno-stack-protector) @@ -64,6 +64,9 @@ CFLAGS_sha256.o += $(PURGATORY_CFLAGS) CFLAGS_REMOVE_string.o += $(PURGATORY_CFLAGS_REMOVE) CFLAGS_string.o+= $(PURGATORY_CFLAGS) +AFLAGS_REMOVE_setup-x86_$(BITS).o += -Wa,-gdwarf-2 +AFLAGS_REMOVE_entry64.o+= -Wa,-gdwarf-2 + $(obj)/purgatory.ro: $(PURGATORY_OBJS) FORCE $(call if_changed,ld)
[tip: x86/urgent] x86/purgatory: Don't generate debug info for purgatory.ro
The following commit has been merged into the x86/urgent branch of tip: Commit-ID: b031cf7752a82fefa6818a788d906d14f533afa9 Gitweb: https://git.kernel.org/tip/b031cf7752a82fefa6818a788d906d14f533afa9 Author:Pingfan Liu AuthorDate:Mon, 03 Aug 2020 13:49:48 +08:00 Committer: Ingo Molnar CommitterDate: Thu, 06 Aug 2020 15:29:25 +02:00 x86/purgatory: Don't generate debug info for purgatory.ro Purgatory.ro is a standalone binary that is not linked against the rest of the kernel. Its image is copied into an array that is linked to the kernel, and from there kexec relocates it wherever it desires. Unlike the debug info for vmlinux, which can be used for analyzing crash such info is useless in purgatory.ro. And discarding them can save about 200K space. Original: 259080 kexec-purgatory.o Stripped debug info: 29152 kexec-purgatory.o Signed-off-by: Pingfan Liu Signed-off-by: Ingo Molnar Reviewed-by: Nick Desaulniers Reviewed-by: Steve Wahl Acked-by: Dave Young Link: https://lore.kernel.org/r/1596433788-3784-1-git-send-email-kernelf...@gmail.com --- arch/x86/purgatory/Makefile | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile index 088bd76..d24b43a 100644 --- a/arch/x86/purgatory/Makefile +++ b/arch/x86/purgatory/Makefile @@ -32,7 +32,7 @@ KCOV_INSTRUMENT := n # make up the standalone purgatory.ro PURGATORY_CFLAGS_REMOVE := -mcmodel=kernel -PURGATORY_CFLAGS := -mcmodel=large -ffreestanding -fno-zero-initialized-in-bss +PURGATORY_CFLAGS := -mcmodel=large -ffreestanding -fno-zero-initialized-in-bss -g0 PURGATORY_CFLAGS += $(DISABLE_STACKLEAK_PLUGIN) -DDISABLE_BRANCH_PROFILING PURGATORY_CFLAGS += $(call cc-option,-fno-stack-protector) @@ -64,6 +64,9 @@ CFLAGS_sha256.o += $(PURGATORY_CFLAGS) CFLAGS_REMOVE_string.o += $(PURGATORY_CFLAGS_REMOVE) CFLAGS_string.o+= $(PURGATORY_CFLAGS) +AFLAGS_REMOVE_setup-x86_$(BITS).o += -Wa,-gdwarf-2 +AFLAGS_REMOVE_entry64.o+= -Wa,-gdwarf-2 + $(obj)/purgatory.ro: $(PURGATORY_OBJS) FORCE $(call if_changed,ld)
[PATCHv2] x86/purgatory: don't generate debug info for purgatory.ro
Purgatory.ro is a standalone binary that is not linked against the rest of the kernel. Its image is copied into an array that is linked to the kernel, and from there kexec relocates it wherever it desires. Unlike the debug info for vmlinux, which can be used for analyzing crash such info is useless in purgatory.ro. And discarding them can save about 200K space. Original: 259080 kexec-purgatory.o Stripped debug info: 29152 kexec-purgatory.o Signed-off-by: Pingfan Liu Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Hans de Goede Cc: Nick Desaulniers Cc: Arvind Sankar Cc: Steve Wahl Cc: linux-kernel@vger.kernel.org Cc: ke...@lists.infradead.org To: x...@kernel.org --- arch/x86/purgatory/Makefile | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile index 088bd76..d24b43a 100644 --- a/arch/x86/purgatory/Makefile +++ b/arch/x86/purgatory/Makefile @@ -32,7 +32,7 @@ KCOV_INSTRUMENT := n # make up the standalone purgatory.ro PURGATORY_CFLAGS_REMOVE := -mcmodel=kernel -PURGATORY_CFLAGS := -mcmodel=large -ffreestanding -fno-zero-initialized-in-bss +PURGATORY_CFLAGS := -mcmodel=large -ffreestanding -fno-zero-initialized-in-bss -g0 PURGATORY_CFLAGS += $(DISABLE_STACKLEAK_PLUGIN) -DDISABLE_BRANCH_PROFILING PURGATORY_CFLAGS += $(call cc-option,-fno-stack-protector) @@ -64,6 +64,9 @@ CFLAGS_sha256.o += $(PURGATORY_CFLAGS) CFLAGS_REMOVE_string.o += $(PURGATORY_CFLAGS_REMOVE) CFLAGS_string.o+= $(PURGATORY_CFLAGS) +AFLAGS_REMOVE_setup-x86_$(BITS).o += -Wa,-gdwarf-2 +AFLAGS_REMOVE_entry64.o+= -Wa,-gdwarf-2 + $(obj)/purgatory.ro: $(PURGATORY_OBJS) FORCE $(call if_changed,ld) -- 2.7.5
Re: [PATCH] x86/purgatory: strip debug info
On Sat, Aug 1, 2020 at 2:18 AM Nick Desaulniers wrote: > > On Fri, Jul 31, 2020 at 2:36 AM Pingfan Liu wrote: > > > > On Fri, Jul 31, 2020 at 7:11 AM Nick Desaulniers > > wrote: > > > > > > On Thu, Jul 30, 2020 at 1:27 AM Pingfan Liu wrote: > > > > > > > > It is useless to keep debug info in purgatory. And discarding them saves > > > > about 200K space. > > > > > > > > Original: > > > > 259080 kexec-purgatory.o > > > > Stripped: > > > >29152 kexec-purgatory.o > > > > > > > > Signed-off-by: Pingfan Liu > > > > Cc: Thomas Gleixner > > > > Cc: Ingo Molnar > > > > Cc: Borislav Petkov > > > > Cc: "H. Peter Anvin" > > > > Cc: Hans de Goede > > > > Cc: Nick Desaulniers > > > > Cc: Arvind Sankar > > > > Cc: Steve Wahl > > > > Cc: linux-kernel@vger.kernel.org > > > > To: x...@kernel.org > > > > > > I don't see any code in > > > arch/x86/purgatory/ > > > arch/x86/include/asm/purgatory.h > > > include/linux/purgatory.h > > > include/uapi/linux/kexec.h > > > kernel/kexec* > > > include/linux/kexec.h > > > include/linux/crash_dump.h > > > kernel/crash_dump.c > > > arch/x86/kernel/crash* > > > https://github.com/horms/kexec-tools/tree/master/kexec/arch/x86_64 > > > that mentions any kind of debug info section. I'm not sure what you'd > > > do with the debug info anyway for this binary. So I suspect this > > > information should ok to discard. > > > > > > This works, but it might be faster to build to not generate the > > > compile info in the first place via compile flag `-g0`, which could be > > > added `ifdef CONFIG_DEBUG_INFO` or even just unconditionally. That > > > way we're not doing additional work to generate debug info, then > > > additional work to throw it away. > > What about: > > diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile > > index 088bd76..7e1ad9e 100644 > > --- a/arch/x86/purgatory/Makefile > > +++ b/arch/x86/purgatory/Makefile > > @@ -32,7 +32,7 @@ KCOV_INSTRUMENT := n > > # make up the standalone purgatory.ro > > > > PURGATORY_CFLAGS_REMOVE := -mcmodel=kernel > > -PURGATORY_CFLAGS := -mcmodel=large -ffreestanding > > -fno-zero-initialized-in-bss > > +PURGATORY_CFLAGS := -mcmodel=large -ffreestanding > > -fno-zero-initialized-in-bss -g0 > > PURGATORY_CFLAGS += $(DISABLE_STACKLEAK_PLUGIN) -DDISABLE_BRANCH_PROFILING > > PURGATORY_CFLAGS += $(call cc-option,-fno-stack-protector) > > I tested your patch but still see .debug_* sections in the .ro from a few .o. > > At least on > * setup-x86_64.o > * entry64.o > > If you add the following hunk to your diff: > ``` > @@ -64,6 +64,9 @@ CFLAGS_sha256.o += $(PURGATORY_CFLAGS) > CFLAGS_REMOVE_string.o += $(PURGATORY_CFLAGS_REMOVE) > CFLAGS_string.o+= $(PURGATORY_CFLAGS) > > +AFLAGS_REMOVE_setup-x86_$(BITS).o += -Wa,-gdwarf-2 > +AFLAGS_REMOVE_entry64.o+= -Wa,-gdwarf-2 > + Go through man as and gcc, and can not find a simpler method than your suggestion. > $(obj)/purgatory.ro: $(PURGATORY_OBJS) FORCE > $(call if_changed,ld) > ``` > then that should do it. Then you can verify the .ro file via: > $ llvm-readelf -S arch/x86/purgatory/purgatory.ro | not grep debug_ > (no output, should return zero) Thank you for your good suggestion and I will update V2 Regards, Pingfan
Re: [PATCH] x86/purgatory: strip debug info
On Fri, Jul 31, 2020 at 7:11 AM Nick Desaulniers wrote: > > On Thu, Jul 30, 2020 at 1:27 AM Pingfan Liu wrote: > > > > It is useless to keep debug info in purgatory. And discarding them saves > > about 200K space. > > > > Original: > > 259080 kexec-purgatory.o > > Stripped: > >29152 kexec-purgatory.o > > > > Signed-off-by: Pingfan Liu > > Cc: Thomas Gleixner > > Cc: Ingo Molnar > > Cc: Borislav Petkov > > Cc: "H. Peter Anvin" > > Cc: Hans de Goede > > Cc: Nick Desaulniers > > Cc: Arvind Sankar > > Cc: Steve Wahl > > Cc: linux-kernel@vger.kernel.org > > To: x...@kernel.org > > I don't see any code in > arch/x86/purgatory/ > arch/x86/include/asm/purgatory.h > include/linux/purgatory.h > include/uapi/linux/kexec.h > kernel/kexec* > include/linux/kexec.h > include/linux/crash_dump.h > kernel/crash_dump.c > arch/x86/kernel/crash* > https://github.com/horms/kexec-tools/tree/master/kexec/arch/x86_64 > that mentions any kind of debug info section. I'm not sure what you'd > do with the debug info anyway for this binary. So I suspect this > information should ok to discard. > > This works, but it might be faster to build to not generate the > compile info in the first place via compile flag `-g0`, which could be > added `ifdef CONFIG_DEBUG_INFO` or even just unconditionally. That > way we're not doing additional work to generate debug info, then > additional work to throw it away. What about: diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile index 088bd76..7e1ad9e 100644 --- a/arch/x86/purgatory/Makefile +++ b/arch/x86/purgatory/Makefile @@ -32,7 +32,7 @@ KCOV_INSTRUMENT := n # make up the standalone purgatory.ro PURGATORY_CFLAGS_REMOVE := -mcmodel=kernel -PURGATORY_CFLAGS := -mcmodel=large -ffreestanding -fno-zero-initialized-in-bss +PURGATORY_CFLAGS := -mcmodel=large -ffreestanding -fno-zero-initialized-in-bss -g0 PURGATORY_CFLAGS += $(DISABLE_STACKLEAK_PLUGIN) -DDISABLE_BRANCH_PROFILING PURGATORY_CFLAGS += $(call cc-option,-fno-stack-protector) Thanks, Pingfan
[PATCH] x86/purgatory: strip debug info
It is useless to keep debug info in purgatory. And discarding them saves about 200K space. Original: 259080 kexec-purgatory.o Stripped: 29152 kexec-purgatory.o Signed-off-by: Pingfan Liu Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Hans de Goede Cc: Nick Desaulniers Cc: Arvind Sankar Cc: Steve Wahl Cc: linux-kernel@vger.kernel.org To: x...@kernel.org --- arch/x86/purgatory/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile index 088bd76..4340ae6 100644 --- a/arch/x86/purgatory/Makefile +++ b/arch/x86/purgatory/Makefile @@ -16,7 +16,7 @@ CFLAGS_sha256.o := -D__DISABLE_EXPORTS # When linking purgatory.ro with -r unresolved symbols are not checked, # also link a purgatory.chk binary without -r to check for unresolved symbols. -PURGATORY_LDFLAGS := -e purgatory_start -nostdlib -z nodefaultlib +PURGATORY_LDFLAGS := -e purgatory_start -nostdlib -z nodefaultlib -S LDFLAGS_purgatory.ro := -r $(PURGATORY_LDFLAGS) LDFLAGS_purgatory.chk := $(PURGATORY_LDFLAGS) targets += purgatory.ro purgatory.chk -- 2.7.5
Re: [PATCH 2/3] mm/migrate: see hole as invalid source page
On Fri, Aug 16, 2019 at 11:02:22PM +0800, Pingfan Liu wrote: > On Thu, Aug 15, 2019 at 01:22:22PM -0400, Jerome Glisse wrote: > > On Tue, Aug 06, 2019 at 04:00:10PM +0800, Pingfan Liu wrote: > > > MIGRATE_PFN_MIGRATE marks a valid pfn, further more, suitable to migrate. > > > As for hole, there is no valid pfn, not to mention migration. > > > > > > Before this patch, hole has already relied on the following code to be > > > filtered out. Hence it is more reasonable to see hole as invalid source > > > page. > > > migrate_vma_prepare() > > > { > > > struct page *page = migrate_pfn_to_page(migrate->src[i]); > > > > > > if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE)) > > >\_ this condition > > > } > > > > NAK you break the API, MIGRATE_PFN_MIGRATE is use for 2 things, > > first it allow the collection code to mark entry that can be > > migrated, then it use by driver to allow driver to skip migration > > for some entry (for whatever reason the driver might have), we > > still need to keep the entry and not clear it so that we can > > cleanup thing (ie remove migration pte entry). > Thanks for your kindly review. > > I read the code again. Maybe I miss something. But as my understanding, > for hole, there is no pte. > As the current code migrate_vma_collect_pmd() > { > if (pmd_none(*pmdp)) > return migrate_vma_collect_hole(start, end, walk); > ... > make_migration_entry() > } > > We do not install migration entry for hole, then no need to remove > migration pte entry. > > And on the driver side, there is way to migrate a hole. The driver just > skip it by > drivers/gpu/drm/nouveau/nouveau_dmem.c: if (!spage || !(src_pfns[i] & > MIGRATE_PFN_MIGRATE)) > > Finally, in migrate_vma_finalize(), for a hole, > if (!page) { > if (newpage) { > unlock_page(newpage); > put_page(newpage); > } > continue; > } > And we do not rely on remove_migration_ptes(page, newpage, false); to > restore the orignal pte (and it is impossible). > Hello, do you have any comment? I think most of important, hole does not use the 'MIGRATE_PFN_MIGRATE' API. Hole has not pte, and there is no way to 'remove migration pte entry'. Further more, we can know the failure on the source side, no need to defer it to driver side. By this way, [3/3] can unify the code. Thanks, Pingfan
Re: [PATCH 2/3] mm/migrate: see hole as invalid source page
On Thu, Aug 15, 2019 at 01:22:22PM -0400, Jerome Glisse wrote: > On Tue, Aug 06, 2019 at 04:00:10PM +0800, Pingfan Liu wrote: > > MIGRATE_PFN_MIGRATE marks a valid pfn, further more, suitable to migrate. > > As for hole, there is no valid pfn, not to mention migration. > > > > Before this patch, hole has already relied on the following code to be > > filtered out. Hence it is more reasonable to see hole as invalid source > > page. > > migrate_vma_prepare() > > { > > struct page *page = migrate_pfn_to_page(migrate->src[i]); > > > > if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE)) > > \_ this condition > > } > > NAK you break the API, MIGRATE_PFN_MIGRATE is use for 2 things, > first it allow the collection code to mark entry that can be > migrated, then it use by driver to allow driver to skip migration > for some entry (for whatever reason the driver might have), we > still need to keep the entry and not clear it so that we can > cleanup thing (ie remove migration pte entry). Thanks for your kindly review. I read the code again. Maybe I miss something. But as my understanding, for hole, there is no pte. As the current code migrate_vma_collect_pmd() { if (pmd_none(*pmdp)) return migrate_vma_collect_hole(start, end, walk); ... make_migration_entry() } We do not install migration entry for hole, then no need to remove migration pte entry. And on the driver side, there is way to migrate a hole. The driver just skip it by drivers/gpu/drm/nouveau/nouveau_dmem.c: if (!spage || !(src_pfns[i] & MIGRATE_PFN_MIGRATE)) Finally, in migrate_vma_finalize(), for a hole, if (!page) { if (newpage) { unlock_page(newpage); put_page(newpage); } continue; } And we do not rely on remove_migration_ptes(page, newpage, false); to restore the orignal pte (and it is impossible). Thanks, Pingfan
Re: [PATCHv2] mm/migrate: clean up useless code in migrate_vma_collect_pmd()
On Thu, Aug 15, 2019 at 03:40:21PM -0400, Jerome Glisse wrote: > On Thu, Aug 15, 2019 at 12:23:44PM -0700, Ralph Campbell wrote: > [...] > > > > I don't understand. The only use of "pfn" I see is in the "else" > > clause above where it is set just before using it. > > Ok i managed to confuse myself with mpfn and probably with old > version of the code. Sorry for reading too quickly. Can we move > unsigned long pfn; into the else { branch so that there is no > more confusion to its scope. Looks better, I will update v3. Thanks, Pingfan
[PATCHv2] mm/migrate: clean up useless code in migrate_vma_collect_pmd()
Clean up useless 'pfn' variable. Signed-off-by: Pingfan Liu Cc: "Jérôme Glisse" Cc: Andrew Morton Cc: Mel Gorman Cc: Jan Kara Cc: "Kirill A. Shutemov" Cc: Michal Hocko Cc: Mike Kravetz Cc: Andrea Arcangeli Cc: Matthew Wilcox To: linux...@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/migrate.c | 9 +++-- 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 8992741..d483a55 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2225,17 +2225,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, pte_t pte; pte = *ptep; - pfn = pte_pfn(pte); if (pte_none(pte)) { mpfn = MIGRATE_PFN_MIGRATE; migrate->cpages++; - pfn = 0; goto next; } if (!pte_present(pte)) { - mpfn = pfn = 0; + mpfn = 0; /* * Only care about unaddressable device page special @@ -2252,10 +2250,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, if (is_write_device_private_entry(entry)) mpfn |= MIGRATE_PFN_WRITE; } else { + pfn = pte_pfn(pte); if (is_zero_pfn(pfn)) { mpfn = MIGRATE_PFN_MIGRATE; migrate->cpages++; - pfn = 0; goto next; } page = vm_normal_page(migrate->vma, addr, pte); @@ -2265,10 +2263,9 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, /* FIXME support THP */ if (!page || !page->mapping || PageTransCompound(page)) { - mpfn = pfn = 0; + mpfn = 0; goto next; } - pfn = page_to_pfn(page); /* * By getting a reference on the page we pin it and that blocks -- 2.7.5
Re: [PATCH 0/4] x86/mce: protect nr_cpus from rebooting by broadcast mce
On Wed, Aug 07, 2019 at 11:00:41AM +0800, Dave Young wrote: > Add Tony and Xunlei in cc. > On 08/05/19 at 04:58pm, Pingfan Liu wrote: > > This series include two related groups: > > [1-3/4]: protect nr_cpus from rebooting by broadcast mce > > [4/4]: improve "kexec -l" robustness against broadcast mce > > > > When I tried to fix [1], Thomas raised concern about the nr_cpus' > > vulnerability > > to unexpected rebooting by broadcast mce. After analysis, I think only the > > following first case suffers from the rebooting by broadcast mce. [1-3/4] > > aims > > to fix that issue. > > I did not understand and read the MCE details, but we previously had a > MCE problem, Xunlei fixed in below commit: > commit 5bc329503e8191c91c4c40836f062ef771d8ba83 > Author: Xunlei Pang > Date: Mon Mar 13 10:50:19 2017 +0100 > > x86/mce: Handle broadcasted MCE gracefully with kexec > > I wonder if this is same issue or not. Also the old discussion is in > below thread: > https://lore.kernel.org/patchwork/patch/753530/ > > Tony raised similar questions, but I'm not sure if it is still a problem > or it has been fixed. > Xunlei's patch is the precondition of the stability for the case 2: boot up by "kexec -p nr_cpus=" For case1/3, extra effort is needed. Thanks, Pingfan > > > > *** Back ground *** > > > > On x86 it's required to have all logical CPUs set CR4.MCE=1. Otherwise, a > > broadcast MCE observing CR4.MCE=0b on any core will shutdown the machine. > > > > The option 'nosmt' has already complied with the above rule by Thomas's > > patch. > > For detail, refer to 506a66f3748 (Revert "x86/apic: Ignore secondary > > threads if > > nosmt=force") > > > > But for nr_cpus option, the exposure to broadcast MCE is a little > > complicated, > > and can be categorized into three cases. > > > > -1. boot up by BIOS. Since no one set CR4.MCE=1, nr_cpus risks rebooting by > > broadcast MCE. > > > > -2. boot up by "kexec -p nr_cpus=". Since the 1st kernel has all cpus' > > CR4.MCE=1 set before kexec -p, nr_cpus is free of rebooting by broadcast > > MCE. > > Furthermore, the crashed kernel's wreckage, including page table and text, > > is > > not touched by capture kernel. Hence if MCE event happens on capped cpu, > > do_machine_check->__mc_check_crashing_cpu() runs smoothly and returns > > immediately, the capped cpu is still pinned on "halt". > > > > -3. boot up by "kexec -l nr_cpus=". As "kexec -p", it is free of rebooting > > by > > broadcast MCE. But the 1st kernel's wreckage is discarded and changed. when > > capped cpus execute do_machine_check(), they may crack the new kernel. But > > this is not related with broadcast MCE, and need an extra fix. > > > > *** Solution *** > > "nr_cpus" can not follow the same way as "nosmt". Because nr_cpus limits > > the > > allocation of percpu area and some other kthread memory, which is critical > > to > > cpu hotplug framework. Instead, developing a dedicated SIPI callback > > make_capped_cpu_stable() for capped cpu, which does not lean on percpu area > > to > > work. > > > > [1]: https://lkml.org/lkml/2019/7/5/3 > > > > To: Gleixner > > To: Andy Lutomirski > > Cc: Ingo Molnar > > Cc: Borislav Petkov > > Cc: "H. Peter Anvin" > > Cc: Dave Hansen > > Cc: Peter Zijlstra > > To: x...@kernel.org > > Cc: Masami Hiramatsu > > Cc: Qian Cai > > Cc: Vlastimil Babka > > Cc: Daniel Drake > > Cc: Jacob Pan > > Cc: Michal Hocko > > Cc: Eric Biederman > > Cc: linux-kernel@vger.kernel.org > > Cc: Dave Young > > Cc: Baoquan He > > Cc: ke...@lists.infradead.org > > > > --- > > Pingfan Liu (4): > > x86/apic: correct the ENO in generic_processor_info() > > x86/apic: record capped cpu in generic_processor_info() > > x86/smp: send capped cpus to a stable state when smp_init() > > x86/smp: disallow MCE handler on rebooting AP > > > > arch/x86/include/asm/apic.h | 1 + > > arch/x86/include/asm/smp.h | 3 ++ > > arch/x86/kernel/apic/apic.c | 23 > > arch/x86/kernel/cpu/common.c | 7 > > arch/x86/kernel/smp.c| 8 + > > arch/x86/kernel/smpboot.c| 83 > > > > kernel/smp.c | 6 > > 7 files changed, 124 insertions(+), 7 deletions(-) > > > > -- > > 2.7.5 > > > > Thanks > Dave
Re: [PATCH 1/3] mm/migrate: clean up useless code in migrate_vma_collect_pmd()
On Tue, Aug 06, 2019 at 06:35:03AM -0700, Matthew Wilcox wrote: > > This needs something beyond the subject line. Maybe ... > > After these assignments, we either restart the loop with a fresh variable, > or we assign to the variable again without using the value we've assigned. > > Reviewed-by: Matthew Wilcox (Oracle) > > > goto next; > > } > > - pfn = page_to_pfn(page); > > After you've done all this, as far as I can tell, the 'pfn' variable is > only used in one arm of the conditions, so it can be moved there. > > ie something like: > > - unsigned long mpfn, pfn; > + unsigned long mpfn; > ... > - pfn = pte_pfn(pte); > ... > + unsigned long pfn = pte_pfn(pte); > + > This makes code better. Thank you for the suggestion. Will send v2 for this patch. Regards, Pingfan
[PATCH 2/3] mm/migrate: see hole as invalid source page
MIGRATE_PFN_MIGRATE marks a valid pfn, further more, suitable to migrate. As for hole, there is no valid pfn, not to mention migration. Before this patch, hole has already relied on the following code to be filtered out. Hence it is more reasonable to see hole as invalid source page. migrate_vma_prepare() { struct page *page = migrate_pfn_to_page(migrate->src[i]); if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE)) \_ this condition } Signed-off-by: Pingfan Liu Cc: "Jérôme Glisse" Cc: Andrew Morton Cc: Mel Gorman Cc: Jan Kara Cc: "Kirill A. Shutemov" Cc: Michal Hocko Cc: Mike Kravetz Cc: Andrea Arcangeli Cc: Matthew Wilcox To: linux...@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/migrate.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index c2ec614..832483f 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2136,10 +2136,9 @@ static int migrate_vma_collect_hole(unsigned long start, unsigned long addr; for (addr = start & PAGE_MASK; addr < end; addr += PAGE_SIZE) { - migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE; + migrate->src[migrate->npages] = 0; migrate->dst[migrate->npages] = 0; migrate->npages++; - migrate->cpages++; } return 0; @@ -2228,8 +2227,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, pfn = pte_pfn(pte); if (pte_none(pte)) { - mpfn = MIGRATE_PFN_MIGRATE; - migrate->cpages++; + mpfn = 0; goto next; } -- 2.7.5
[PATCH 1/3] mm/migrate: clean up useless code in migrate_vma_collect_pmd()
Signed-off-by: Pingfan Liu Cc: "Jérôme Glisse" Cc: Andrew Morton Cc: Mel Gorman Cc: Jan Kara Cc: "Kirill A. Shutemov" Cc: Michal Hocko Cc: Mike Kravetz Cc: Andrea Arcangeli Cc: Matthew Wilcox To: linux...@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/migrate.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 8992741..c2ec614 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2230,7 +2230,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, if (pte_none(pte)) { mpfn = MIGRATE_PFN_MIGRATE; migrate->cpages++; - pfn = 0; goto next; } @@ -2255,7 +2254,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, if (is_zero_pfn(pfn)) { mpfn = MIGRATE_PFN_MIGRATE; migrate->cpages++; - pfn = 0; goto next; } page = vm_normal_page(migrate->vma, addr, pte); @@ -2265,10 +2263,9 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, /* FIXME support THP */ if (!page || !page->mapping || PageTransCompound(page)) { - mpfn = pfn = 0; + mpfn = 0; goto next; } - pfn = page_to_pfn(page); /* * By getting a reference on the page we pin it and that blocks -- 2.7.5
[PATCH 3/3] mm/migrate: remove the duplicated code migrate_vma_collect_hole()
After the previous patch which sees hole as invalid source, migrate_vma_collect_hole() has the same code as migrate_vma_collect_skip(). Removing the duplicated code. Signed-off-by: Pingfan Liu Cc: "Jérôme Glisse" Cc: Andrew Morton Cc: Mel Gorman Cc: Jan Kara Cc: "Kirill A. Shutemov" Cc: Michal Hocko Cc: Mike Kravetz Cc: Andrea Arcangeli Cc: Matthew Wilcox To: linux...@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/migrate.c | 22 +++--- 1 file changed, 3 insertions(+), 19 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 832483f..95e038d 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2128,22 +2128,6 @@ struct migrate_vma { unsigned long end; }; -static int migrate_vma_collect_hole(unsigned long start, - unsigned long end, - struct mm_walk *walk) -{ - struct migrate_vma *migrate = walk->private; - unsigned long addr; - - for (addr = start & PAGE_MASK; addr < end; addr += PAGE_SIZE) { - migrate->src[migrate->npages] = 0; - migrate->dst[migrate->npages] = 0; - migrate->npages++; - } - - return 0; -} - static int migrate_vma_collect_skip(unsigned long start, unsigned long end, struct mm_walk *walk) @@ -2173,7 +2157,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, again: if (pmd_none(*pmdp)) - return migrate_vma_collect_hole(start, end, walk); + return migrate_vma_collect_skip(start, end, walk); if (pmd_trans_huge(*pmdp)) { struct page *page; @@ -2206,7 +2190,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, return migrate_vma_collect_skip(start, end, walk); if (pmd_none(*pmdp)) - return migrate_vma_collect_hole(start, end, + return migrate_vma_collect_skip(start, end, walk); } } @@ -2337,7 +2321,7 @@ static void migrate_vma_collect(struct migrate_vma *migrate) mm_walk.pmd_entry = migrate_vma_collect_pmd; mm_walk.pte_entry = NULL; - mm_walk.pte_hole = migrate_vma_collect_hole; + mm_walk.pte_hole = migrate_vma_collect_skip; mm_walk.hugetlb_entry = NULL; mm_walk.test_walk = NULL; mm_walk.vma = migrate->vma; -- 2.7.5
Re: [PATCH] smp: force all cpu to boot once under maxcpus option
On Mon, Jul 22, 2019 at 11:41:29AM +0200, Thomas Gleixner wrote: > On Wed, 10 Jul 2019, Pingfan Liu wrote: > > > > +static inline bool maxcpus_allowed(unsigned int cpu) > > +{ > > + /* maxcpus only takes effect during system bootup */ > > + if (smp_boot_done) > > + return true; > > + if (num_online_cpus() < setup_max_cpus) > > + return true; > > + /* > > +* maxcpus should allow cpu to set CR4.MCE asap, otherwise the set may > > +* be deferred indefinitely. > > +*/ > > + if (!per_cpu(cpuhp_state, cpu).booted_once) > > + return true; > > As this is a x86 only issue, you cannot inflict this magic on every > architecture. OK. In my developing patch which fixes nr_cpus issue, I takes the following way: (I am diverted to other things, have not finished test yet, hope to turn back soon.) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 362dd89..c009169 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -956,6 +956,87 @@ int common_cpu_up(unsigned int cpu, struct task_struct *idle) return 0; } +void __init bring_capped_cpu_steady(void) +{ [...] +} + /* * NOTE - on most systems this is a PHYSICAL apic ID, but on multiquad * (ie clustered apic addressing mode), this is a LOGICAL apic ID. diff --git a/kernel/smp.c b/kernel/smp.c index d155374..b04961c 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -560,6 +560,10 @@ void __init setup_nr_cpu_ids(void) nr_cpu_ids = find_last_bit(cpumask_bits(cpu_possible_mask),NR_CPUS) + 1; } +void __weak bring_capped_cpu_steady(void) +{ +} + /* Called by boot processor to activate the rest. */ void __init smp_init(void) { @@ -579,6 +583,8 @@ void __init smp_init(void) cpu_up(cpu); } + /* force cpus capped by nr_cpus option into steady state */ + bring_capped_cpu_steady(); num_nodes = num_online_nodes(); The initial motivation is to step around percpu area required by cpu hotplug framework. But it also provide the abstraction of archs. What do you think about resolving maxcpus by the similar abstraction? > > Aside of that this does not solve the problem at all because smp_init() > still does: > > for_each_present_cpu(cpu) { > if (num_online_cpus() >= setup_max_cpus) > break; Yes, this logic should be removed, then maxcpus_allowed() can take effect. But now, it may take a quite different way to resolve it. Thanks for your kindly review. Regards, Pingfan > if (!cpu_online(cpu)) > cpu_up(cpu); > } > > So the remaining CPUs are not onlined at all. > > Thanks, > > tglx
[tip:x86/cleanups] x86/realmode: Remove trampoline_status
Commit-ID: 69732102426b1c55a257386841fb80ec1f425d32 Gitweb: https://git.kernel.org/tip/69732102426b1c55a257386841fb80ec1f425d32 Author: Pingfan Liu AuthorDate: Tue, 16 Jul 2019 16:40:24 +0800 Committer: Thomas Gleixner CommitDate: Mon, 22 Jul 2019 11:30:18 +0200 x86/realmode: Remove trampoline_status There is no reader of trampoline_status, it's only written. It turns out that after commit ce4b1b16502b ("x86/smpboot: Initialize secondary CPU only if master CPU will wait for it"), trampoline_status is not needed any more. Signed-off-by: Pingfan Liu Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/1563266424-3472-1-git-send-email-kernelf...@gmail.com --- arch/x86/include/asm/realmode.h | 1 - arch/x86/kernel/smpboot.c| 5 - arch/x86/realmode/rm/header.S| 1 - arch/x86/realmode/rm/trampoline_32.S | 3 --- arch/x86/realmode/rm/trampoline_64.S | 3 --- arch/x86/realmode/rm/trampoline_common.S | 4 6 files changed, 17 deletions(-) diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h index c53682303c9c..09ecc32f6524 100644 --- a/arch/x86/include/asm/realmode.h +++ b/arch/x86/include/asm/realmode.h @@ -20,7 +20,6 @@ struct real_mode_header { u32 ro_end; /* SMP trampoline */ u32 trampoline_start; - u32 trampoline_status; u32 trampoline_header; #ifdef CONFIG_X86_64 u32 trampoline_pgd; diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index fdbd47ceb84d..497e9b7077c1 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1023,8 +1023,6 @@ int common_cpu_up(unsigned int cpu, struct task_struct *idle) static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle, int *cpu0_nmi_registered) { - volatile u32 *trampoline_status = - (volatile u32 *) __va(real_mode_header->trampoline_status); /* start_ip had better be page-aligned! */ unsigned long start_ip = real_mode_header->trampoline_start; @@ -1116,9 +1114,6 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle, } } - /* mark "stuck" area as not stuck */ - *trampoline_status = 0; - if (x86_platform.legacy.warm_reset) { /* * Cleanup possible dangling ends... diff --git a/arch/x86/realmode/rm/header.S b/arch/x86/realmode/rm/header.S index 30b0d30d861a..6363761cc74c 100644 --- a/arch/x86/realmode/rm/header.S +++ b/arch/x86/realmode/rm/header.S @@ -19,7 +19,6 @@ GLOBAL(real_mode_header) .long pa_ro_end /* SMP trampoline */ .long pa_trampoline_start - .long pa_trampoline_status .long pa_trampoline_header #ifdef CONFIG_X86_64 .long pa_trampoline_pgd; diff --git a/arch/x86/realmode/rm/trampoline_32.S b/arch/x86/realmode/rm/trampoline_32.S index 2dd866c9e21e..1868b158480d 100644 --- a/arch/x86/realmode/rm/trampoline_32.S +++ b/arch/x86/realmode/rm/trampoline_32.S @@ -41,9 +41,6 @@ ENTRY(trampoline_start) movltr_start, %eax # where we need to go - movl$0xA5A5A5A5, trampoline_status - # write marker for master knows we're running - /* * GDT tables in non default location kernel can be beyond 16MB and * lgdt will not be able to load the address as in real mode default diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S index 24bb7598774e..aee2b45d83b8 100644 --- a/arch/x86/realmode/rm/trampoline_64.S +++ b/arch/x86/realmode/rm/trampoline_64.S @@ -49,9 +49,6 @@ ENTRY(trampoline_start) mov %ax, %es mov %ax, %ss - movl$0xA5A5A5A5, trampoline_status - # write marker for master knows we're running - # Setup stack movl$rm_stack_end, %esp diff --git a/arch/x86/realmode/rm/trampoline_common.S b/arch/x86/realmode/rm/trampoline_common.S index 7c706772ab59..8d8208dcca24 100644 --- a/arch/x86/realmode/rm/trampoline_common.S +++ b/arch/x86/realmode/rm/trampoline_common.S @@ -2,7 +2,3 @@ .section ".rodata","a" .balign 16 tr_idt: .fill 1, 6, 0 - - .bss - .balign 4 -GLOBAL(trampoline_status) .space 4
[PATCH] x86/realmode: remove trampoline_status flag
In current code, there is no reader of trampoline_status. It turns out that after commit ce4b1b16502b ("x86/smpboot: Initialize secondary CPU only if master CPU will wait for it"), trampoline_status is not needed any more. Signed-off-by: Pingfan Liu Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Mukesh Ojha Cc: Matteo Croce Cc: Len Brown Cc: Pu Wen Cc: Nicolai Stange Cc: Hui Wang Cc: linux-kernel@vger.kernel.org --- arch/x86/include/asm/realmode.h | 1 - arch/x86/kernel/smpboot.c| 5 - arch/x86/realmode/rm/header.S| 1 - arch/x86/realmode/rm/trampoline_32.S | 3 --- arch/x86/realmode/rm/trampoline_64.S | 3 --- arch/x86/realmode/rm/trampoline_common.S | 4 6 files changed, 17 deletions(-) diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h index c536823..09ecc32 100644 --- a/arch/x86/include/asm/realmode.h +++ b/arch/x86/include/asm/realmode.h @@ -20,7 +20,6 @@ struct real_mode_header { u32 ro_end; /* SMP trampoline */ u32 trampoline_start; - u32 trampoline_status; u32 trampoline_header; #ifdef CONFIG_X86_64 u32 trampoline_pgd; diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 362dd89..2ef10d9 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -965,8 +965,6 @@ int common_cpu_up(unsigned int cpu, struct task_struct *idle) static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle, int *cpu0_nmi_registered) { - volatile u32 *trampoline_status = - (volatile u32 *) __va(real_mode_header->trampoline_status); /* start_ip had better be page-aligned! */ unsigned long start_ip = real_mode_header->trampoline_start; @@ -1058,9 +1056,6 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle, } } - /* mark "stuck" area as not stuck */ - *trampoline_status = 0; - if (x86_platform.legacy.warm_reset) { /* * Cleanup possible dangling ends... diff --git a/arch/x86/realmode/rm/header.S b/arch/x86/realmode/rm/header.S index 30b0d30..6363761 100644 --- a/arch/x86/realmode/rm/header.S +++ b/arch/x86/realmode/rm/header.S @@ -19,7 +19,6 @@ GLOBAL(real_mode_header) .long pa_ro_end /* SMP trampoline */ .long pa_trampoline_start - .long pa_trampoline_status .long pa_trampoline_header #ifdef CONFIG_X86_64 .long pa_trampoline_pgd; diff --git a/arch/x86/realmode/rm/trampoline_32.S b/arch/x86/realmode/rm/trampoline_32.S index 2dd866c..1868b15 100644 --- a/arch/x86/realmode/rm/trampoline_32.S +++ b/arch/x86/realmode/rm/trampoline_32.S @@ -41,9 +41,6 @@ ENTRY(trampoline_start) movltr_start, %eax # where we need to go - movl$0xA5A5A5A5, trampoline_status - # write marker for master knows we're running - /* * GDT tables in non default location kernel can be beyond 16MB and * lgdt will not be able to load the address as in real mode default diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S index 24bb759..aee2b45 100644 --- a/arch/x86/realmode/rm/trampoline_64.S +++ b/arch/x86/realmode/rm/trampoline_64.S @@ -49,9 +49,6 @@ ENTRY(trampoline_start) mov %ax, %es mov %ax, %ss - movl$0xA5A5A5A5, trampoline_status - # write marker for master knows we're running - # Setup stack movl$rm_stack_end, %esp diff --git a/arch/x86/realmode/rm/trampoline_common.S b/arch/x86/realmode/rm/trampoline_common.S index 7c70677..8d8208d 100644 --- a/arch/x86/realmode/rm/trampoline_common.S +++ b/arch/x86/realmode/rm/trampoline_common.S @@ -2,7 +2,3 @@ .section ".rodata","a" .balign 16 tr_idt: .fill 1, 6, 0 - - .bss - .balign 4 -GLOBAL(trampoline_status) .space 4 -- 2.7.5
Re: [PATCH 2/2] x86/numa: instance all parsed numa node
On Tue, Jul 9, 2019 at 9:34 PM Andy Lutomirski wrote: > > > > > On Jul 9, 2019, at 1:24 AM, Pingfan Liu wrote: > > > >> On Tue, Jul 9, 2019 at 2:12 PM Thomas Gleixner wrote: > >> > >>> On Tue, 9 Jul 2019, Pingfan Liu wrote: > >>>> On Mon, Jul 8, 2019 at 5:35 PM Thomas Gleixner > >>>> wrote: > >>>> It can and it does. > >>>> > >>>> That's the whole point why we bring up all CPUs in the 'nosmt' case and > >>>> shut the siblings down again after setting CR4.MCE. Actually that's in > >>>> fact > >>>> a 'let's hope no MCE hits before that happened' approach, but that's all > >>>> we > >>>> can do. > >>>> > >>>> If we don't do that then the MCE broadcast can hit a CPU which has some > >>>> firmware initialized state. The result can be a full system lockup, > >>>> triple > >>>> fault etc. > >>>> > >>>> So when the MCE hits a CPU which is still in the crashed kernel lala > >>>> state, > >>>> then all hell breaks lose. > >>> Thank you for the comprehensive explain. With your guide, now, I have > >>> a full understanding of the issue. > >>> > >>> But when I tried to add something to enable CR4.MCE in > >>> crash_nmi_callback(), I realized that it is undo-able in some case (if > >>> crashed, we will not ask an offline smt cpu to online), also it is > >>> needless. "kexec -l/-p" takes the advantage of the cpu state in the > >>> first kernel, where all logical cpu has CR4.MCE=1. > >>> > >>> So kexec is exempt from this bug if the first kernel already do it. > >> > >> No. If the MCE broadcast is handled by a CPU which is stuck in the old > >> kernel stop loop, then it will execute on the old kernel and eventually run > >> into the memory corruption which crashed the old one. > >> > > Yes, you are right. Stuck cpu may execute the old do_machine_check() > > code. But I just found out that we have > > do_machine_check()->__mc_check_crashing_cpu() to against this case. > > > > And I think the MCE issue with nr_cpus is not closely related with > > this series, can > > be a separated issue. I had question whether Andy will take it, if > > not, I am glad to do it. > > > > > > Go for it. I’m not familiar enough with the SMP boot stuff that I would be > able to do it any faster than you. I’ll gladly help review it. I had sent out a patch to fix maxcpus "[PATCH] smp: force all cpu to boot once under maxcpus option" But for the case of nrcpus, I think things will not be so easy due to percpu area, and I think it may take a quite different way. Thanks, Pingfan
[PATCH] smp: force all cpu to boot once under maxcpus option
On x86 it's required to boot all logical CPUs at least once so that the init code can get a chance to set CR4.MCE on each CPU. Otherwise, a broadacasted MCE observing CR4.MCE=0b on any core will shutdown the machine. The option 'nosmt' has already complied with the above rule. In the case of maxcpus, the initialization of capped out cpus may be deferred indefinitely until a user brings them up. This exposes the machine under the risk of sudden shutdown indefinitely. Minimize the risk window by initializing all cpus at boot time. Signed-off-by: Pingfan Liu Cc: Thomas Gleixner Cc: "Peter Zijlstra (Intel)" Cc: Rik van Riel Cc: Josh Poimboeuf Cc: Ingo Molnar Cc: Jiri Kosina Cc: Mukesh Ojha Cc: Greg Kroah-Hartman Cc: Andy Lutomirski Cc: linux-kernel@vger.kernel.org --- include/linux/smp.h | 1 + kernel/cpu.c| 20 ++-- kernel/smp.c| 4 3 files changed, 23 insertions(+), 2 deletions(-) diff --git a/include/linux/smp.h b/include/linux/smp.h index a56f08f..9d2c692 100644 --- a/include/linux/smp.h +++ b/include/linux/smp.h @@ -130,6 +130,7 @@ extern void __init setup_nr_cpu_ids(void); extern void __init smp_init(void); extern int __boot_cpu_id; +extern bool smp_boot_done; static inline int get_boot_cpu_id(void) { diff --git a/kernel/cpu.c b/kernel/cpu.c index ef1c565..ab19dc8 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -439,6 +439,21 @@ static inline bool cpu_smt_allowed(unsigned int cpu) static inline bool cpu_smt_allowed(unsigned int cpu) { return true; } #endif +static inline bool maxcpus_allowed(unsigned int cpu) +{ + /* maxcpus only takes effect during system bootup */ + if (smp_boot_done) + return true; + if (num_online_cpus() < setup_max_cpus) + return true; + /* +* maxcpus should allow cpu to set CR4.MCE asap, otherwise the set may +* be deferred indefinitely. +*/ + if (!per_cpu(cpuhp_state, cpu).booted_once) + return true; +} + static inline enum cpuhp_state cpuhp_set_state(struct cpuhp_cpu_state *st, enum cpuhp_state target) { @@ -525,8 +540,9 @@ static int bringup_wait_for_ap(unsigned int cpu) * CPU marked itself as booted_once in cpu_notify_starting() so the * cpu_smt_allowed() check will now return false if this is not the * primary sibling. +* In case of maxcpus, the capped out cpus comply with the same rule. */ - if (!cpu_smt_allowed(cpu)) + if (!cpu_smt_allowed(cpu) || !maxcpus_allowed(cpu)) return -ECANCELED; if (st->target <= CPUHP_AP_ONLINE_IDLE) @@ -1177,7 +1193,7 @@ static int do_cpu_up(unsigned int cpu, enum cpuhp_state target) err = -EBUSY; goto out; } - if (!cpu_smt_allowed(cpu)) { + if (!cpu_smt_allowed(cpu) || !maxcpus_allowed(cpu)) { err = -EPERM; goto out; } diff --git a/kernel/smp.c b/kernel/smp.c index d155374..a5f82d53 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -560,6 +560,9 @@ void __init setup_nr_cpu_ids(void) nr_cpu_ids = find_last_bit(cpumask_bits(cpu_possible_mask),NR_CPUS) + 1; } +bool smp_boot_done __read_mostly; +EXPORT_SYMBOL(smp_boot_done); + /* Called by boot processor to activate the rest. */ void __init smp_init(void) { @@ -587,6 +590,7 @@ void __init smp_init(void) /* Any cleanup work */ smp_cpus_done(setup_max_cpus); + smp_boot_done = true; } /* -- 2.7.5
Re: [PATCH 2/2] x86/numa: instance all parsed numa node
On Tue, Jul 9, 2019 at 2:12 PM Thomas Gleixner wrote: > > On Tue, 9 Jul 2019, Pingfan Liu wrote: > > On Mon, Jul 8, 2019 at 5:35 PM Thomas Gleixner wrote: > > > It can and it does. > > > > > > That's the whole point why we bring up all CPUs in the 'nosmt' case and > > > shut the siblings down again after setting CR4.MCE. Actually that's in > > > fact > > > a 'let's hope no MCE hits before that happened' approach, but that's all > > > we > > > can do. > > > > > > If we don't do that then the MCE broadcast can hit a CPU which has some > > > firmware initialized state. The result can be a full system lockup, triple > > > fault etc. > > > > > > So when the MCE hits a CPU which is still in the crashed kernel lala > > > state, > > > then all hell breaks lose. > > Thank you for the comprehensive explain. With your guide, now, I have > > a full understanding of the issue. > > > > But when I tried to add something to enable CR4.MCE in > > crash_nmi_callback(), I realized that it is undo-able in some case (if > > crashed, we will not ask an offline smt cpu to online), also it is > > needless. "kexec -l/-p" takes the advantage of the cpu state in the > > first kernel, where all logical cpu has CR4.MCE=1. > > > > So kexec is exempt from this bug if the first kernel already do it. > > No. If the MCE broadcast is handled by a CPU which is stuck in the old > kernel stop loop, then it will execute on the old kernel and eventually run > into the memory corruption which crashed the old one. > Yes, you are right. Stuck cpu may execute the old do_machine_check() code. But I just found out that we have do_machine_check()->__mc_check_crashing_cpu() to against this case. And I think the MCE issue with nr_cpus is not closely related with this series, can be a separated issue. I had question whether Andy will take it, if not, I am glad to do it. Thanks and regards, Pingfan
Re: [PATCH 2/2] x86/numa: instance all parsed numa node
On Tue, Jul 9, 2019 at 1:53 AM Andy Lutomirski wrote: > > > > > On Jul 8, 2019, at 3:35 AM, Thomas Gleixner wrote: > > > >> On Mon, 8 Jul 2019, Pingfan Liu wrote: > >>> On Mon, Jul 8, 2019 at 3:44 AM Thomas Gleixner wrote: > >>> > >>>> On Fri, 5 Jul 2019, Pingfan Liu wrote: > >>>> > >>>> I hit a bug on an AMD machine, with kexec -l nr_cpus=4 option. nr_cpus > >>>> option > >>>> is used to speed up kdump process, so it is not a rare case. > >>> > >>> But fundamentally wrong, really. > >>> > >>> The rest of the CPUs are in a half baken state and any broadcast event, > >>> e.g. MCE or a stray IPI, will result in a undiagnosable crash. > >> Very appreciate if you can pay more word on it? I tried to figure out > >> your point, but fail. > >> > >> For "a half baked state", I think you concern about LAPIC state, and I > >> expand this point like the following: > > > > It's not only the APIC state. It's the state of the CPUs in general. > > > >> For IPI: when capture kernel BSP is up, the rest cpus are still loop > >> inside crash_nmi_callback(), so there is no way to eject new IPI from > >> these cpu. Also we disable_local_APIC(), which effectively prevent the > >> LAPIC from responding to IPI, except NMI/INIT/SIPI, which will not > >> occur in crash case. > > > > Fair enough for the IPI case. > > > >> For MCE, I am not sure whether it can broadcast or not between cpus, > >> but as my understanding, it can not. Then is it a problem? > > > > It can and it does. > > > > That's the whole point why we bring up all CPUs in the 'nosmt' case and > > shut the siblings down again after setting CR4.MCE. Actually that's in fact > > a 'let's hope no MCE hits before that happened' approach, but that's all we > > can do. > > > > If we don't do that then the MCE broadcast can hit a CPU which has some > > firmware initialized state. The result can be a full system lockup, triple > > fault etc. > > > > So when the MCE hits a CPU which is still in the crashed kernel lala state, > > then all hell breaks lose. > > > >> From another view point, is there any difference between nr_cpus=1 and > >> nr_cpus> 1 in crashing case? If stray IPI raises issue to nr_cpus>1, > >> it does for nr_cpus=1. > > > > Anything less than the actual number of present CPUs is problematic except > > you use the 'let's hope nothing happens' approach. We could add an option > > to stop the bringup at the early online state similar to what we do for > > 'nosmt'. > > > > > > How about we change nr_cpus to do that instead so we never have to have this > conversation again? Are you interest in implementing this? Thanks, Pingfan
Re: [PATCH 2/2] x86/numa: instance all parsed numa node
On Mon, Jul 8, 2019 at 5:35 PM Thomas Gleixner wrote: > > On Mon, 8 Jul 2019, Pingfan Liu wrote: > > On Mon, Jul 8, 2019 at 3:44 AM Thomas Gleixner wrote: > > > > > > On Fri, 5 Jul 2019, Pingfan Liu wrote: > > > > > > > I hit a bug on an AMD machine, with kexec -l nr_cpus=4 option. nr_cpus > > > > option > > > > is used to speed up kdump process, so it is not a rare case. > > > > > > But fundamentally wrong, really. > > > > > > The rest of the CPUs are in a half baken state and any broadcast event, > > > e.g. MCE or a stray IPI, will result in a undiagnosable crash. > > Very appreciate if you can pay more word on it? I tried to figure out > > your point, but fail. > > > > For "a half baked state", I think you concern about LAPIC state, and I > > expand this point like the following: > > It's not only the APIC state. It's the state of the CPUs in general. For other states, "kexec -l " is a kind of boot loader and the boot cpu complies with the kernel boot up provision. As for the rest AP, they are pinged at loop before receiving #INIT IPI. Then the left things is the same as SMP boot up. > > > For IPI: when capture kernel BSP is up, the rest cpus are still loop > > inside crash_nmi_callback(), so there is no way to eject new IPI from > > these cpu. Also we disable_local_APIC(), which effectively prevent the > > LAPIC from responding to IPI, except NMI/INIT/SIPI, which will not > > occur in crash case. > > Fair enough for the IPI case. > > > For MCE, I am not sure whether it can broadcast or not between cpus, > > but as my understanding, it can not. Then is it a problem? > > It can and it does. > > That's the whole point why we bring up all CPUs in the 'nosmt' case and > shut the siblings down again after setting CR4.MCE. Actually that's in fact > a 'let's hope no MCE hits before that happened' approach, but that's all we > can do. > > If we don't do that then the MCE broadcast can hit a CPU which has some > firmware initialized state. The result can be a full system lockup, triple > fault etc. > > So when the MCE hits a CPU which is still in the crashed kernel lala state, > then all hell breaks lose. Thank you for the comprehensive explain. With your guide, now, I have a full understanding of the issue. But when I tried to add something to enable CR4.MCE in crash_nmi_callback(), I realized that it is undo-able in some case (if crashed, we will not ask an offline smt cpu to online), also it is needless. "kexec -l/-p" takes the advantage of the cpu state in the first kernel, where all logical cpu has CR4.MCE=1. So kexec is exempt from this bug if the first kernel already do it. > > > From another view point, is there any difference between nr_cpus=1 and > > nr_cpus> 1 in crashing case? If stray IPI raises issue to nr_cpus>1, > > it does for nr_cpus=1. > > Anything less than the actual number of present CPUs is problematic except > you use the 'let's hope nothing happens' approach. We could add an option > to stop the bringup at the early online state similar to what we do for > 'nosmt'. Yes, we should do something about nr_cpus param for the first kernel. Thanks, Pingfan
Re: [PATCH 2/2] x86/numa: instance all parsed numa node
On Mon, Jul 8, 2019 at 3:44 AM Thomas Gleixner wrote: > > On Fri, 5 Jul 2019, Pingfan Liu wrote: > > > I hit a bug on an AMD machine, with kexec -l nr_cpus=4 option. nr_cpus > > option > > is used to speed up kdump process, so it is not a rare case. > > But fundamentally wrong, really. > > The rest of the CPUs are in a half baken state and any broadcast event, > e.g. MCE or a stray IPI, will result in a undiagnosable crash. Very appreciate if you can pay more word on it? I tried to figure out your point, but fail. For "a half baked state", I think you concern about LAPIC state, and I expand this point like the following: For IPI: when capture kernel BSP is up, the rest cpus are still loop inside crash_nmi_callback(), so there is no way to eject new IPI from these cpu. Also we disable_local_APIC(), which effectively prevent the LAPIC from responding to IPI, except NMI/INIT/SIPI, which will not occur in crash case. For MCE, I am not sure whether it can broadcast or not between cpus, but as my understanding, it can not. Then is it a problem? >From another view point, is there any difference between nr_cpus=1 and nr_cpus> 1 in crashing case? If stray IPI raises issue to nr_cpus>1, it does for nr_cpus=1. Thanks, Pingfan
[PATCH 2/2] x86/numa: instance all parsed numa node
by "git grep for_each_online_node | grep sched" or the discussion in tail of [3]. Since Michal has no time to continue on this issue. I pick it up again. This patch drops the change of "node online" definition in [2], i.e. still consider node as online if it has either cpu or memory. And keeps the rest main idea in [2] of initializing all parsed node on x86. For other archs, they need extra dedicated effort. [1]: https://patchwork.kernel.org/patch/10738733/ [2]: https://lkml.org/lkml/2019/2/13/253 [3]: https://lore.kernel.org/lkml/20190528182011.gg1...@dhcp22.suse.cz/T/ Signed-off-by: Pingfan Liu Cc: Michal Hocko Cc: Dave Hansen Cc: Mike Rapoport Cc: Tony Luck Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Andrew Morton Cc: Michal Hocko Cc: Vlastimil Babka Cc: Oscar Salvador Cc: Pavel Tatashin Cc: Mel Gorman Cc: Benjamin Herrenschmidt Cc: Michael Ellerman Cc: Stephen Rothwell Cc: Qian Cai Cc: Barret Rhoden Cc: Bjorn Helgaas Cc: David Rientjes Cc: linux...@kvack.org Cc: linux-kernel@vger.kernel.org --- arch/x86/mm/numa.c | 17 - mm/page_alloc.c| 11 --- 2 files changed, 20 insertions(+), 8 deletions(-) diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index b48d507..5f5b558 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -732,6 +732,15 @@ static void __init init_memory_less_node(int nid) */ } +static void __init init_parsed_rest_node(void) +{ + int node; + + for_each_node_mask(node, node_possible_map) + if (!node_online(node)) + init_memory_less_node(node); +} + /* * Setup early cpu_to_node. * @@ -752,6 +761,7 @@ void __init init_cpu_to_node(void) u16 *cpu_to_apicid = early_per_cpu_ptr(x86_cpu_to_apicid); BUG_ON(cpu_to_apicid == NULL); + init_parsed_rest_node(); for_each_possible_cpu(cpu) { int node = numa_cpu_node(cpu); @@ -759,11 +769,8 @@ void __init init_cpu_to_node(void) if (node == NUMA_NO_NODE) continue; - if (!node_online(node)) { - init_memory_less_node(node); - node_set_online(nid); - } - + if (!node_online(node)) + node_set_online(node); numa_set_node(cpu, node); } } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d66bc8a..5d8db00 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5662,10 +5662,15 @@ static void __build_all_zonelists(void *data) if (self && !node_online(self->node_id)) { build_zonelists(self); } else { - for_each_online_node(nid) { + /* In rare case, node_zonelist() hits offline node */ + for_each_node(nid) { pg_data_t *pgdat = NODE_DATA(nid); - - build_zonelists(pgdat); + /* +* This condition can be removed on archs, with all +* possible node instanced. +*/ + if (pgdat) + build_zonelists(pgdat); } #ifdef CONFIG_HAVE_MEMORYLESS_NODES -- 2.7.5
[PATCH 1/2] x86/numa: carve node online semantics out of alloc_node_data()
Node online means either memory online or cpu online. But there is requirement to instance a pglist_data, which has neither cpu nor memory online (refer to [2/2]). So carve out the online semantics, and call node_set_online() where either memory or cpu is online. Signed-off-by: Pingfan Liu Cc: Michal Hocko Cc: Dave Hansen Cc: Mike Rapoport Cc: Tony Luck Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Andrew Morton Cc: Michal Hocko Cc: Vlastimil Babka Cc: Oscar Salvador Cc: Pavel Tatashin Cc: Mel Gorman Cc: Benjamin Herrenschmidt Cc: Michael Ellerman Cc: Stephen Rothwell Cc: Qian Cai Cc: Barret Rhoden Cc: Bjorn Helgaas Cc: David Rientjes Cc: linux...@kvack.org Cc: linux-kernel@vger.kernel.org --- arch/x86/mm/numa.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index e6dad60..b48d507 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -213,8 +213,6 @@ static void __init alloc_node_data(int nid) node_data[nid] = nd; memset(NODE_DATA(nid), 0, sizeof(pg_data_t)); - - node_set_online(nid); } /** @@ -589,6 +587,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi) continue; alloc_node_data(nid); + node_set_online(nid); } /* Dump memblock with node info and return. */ @@ -760,8 +759,10 @@ void __init init_cpu_to_node(void) if (node == NUMA_NO_NODE) continue; - if (!node_online(node)) + if (!node_online(node)) { init_memory_less_node(node); + node_set_online(nid); + } numa_set_node(cpu, node); } -- 2.7.5
[PATCH] mm/page_isolate: change the prototype of undo_isolate_page_range()
undo_isolate_page_range() never fails, so no need to return value. Signed-off-by: Pingfan Liu Cc: Andrew Morton Cc: Michal Hocko Cc: Oscar Salvador Cc: Qian Cai Cc: Anshuman Khandual Cc: linux-kernel@vger.kernel.org --- include/linux/page-isolation.h | 2 +- mm/page_isolation.c| 3 +-- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h index 280ae96..1099c2f 100644 --- a/include/linux/page-isolation.h +++ b/include/linux/page-isolation.h @@ -50,7 +50,7 @@ start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, * Changes MIGRATE_ISOLATE to MIGRATE_MOVABLE. * target range is [start_pfn, end_pfn) */ -int +void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, unsigned migratetype); diff --git a/mm/page_isolation.c b/mm/page_isolation.c index e3638a5..89c19c0 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -230,7 +230,7 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, /* * Make isolated pages available again. */ -int undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, +void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, unsigned migratetype) { unsigned long pfn; @@ -247,7 +247,6 @@ int undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, continue; unset_migratetype_isolate(page, migratetype); } - return 0; } /* * Test all pages in the range is free(means isolated) or not. -- 2.7.5
Re: [PATCHv5] mm/gup: speed up check_and_migrate_cma_pages() on huge page
On Fri, Jun 28, 2019 at 7:25 AM Andrew Morton wrote: > > On Thu, 27 Jun 2019 13:15:45 +0800 Pingfan Liu wrote: > > > Both hugetlb and thp locate on the same migration type of pageblock, since > > they are allocated from a free_list[]. Based on this fact, it is enough to > > check on a single subpage to decide the migration type of the whole huge > > page. By this way, it saves (2M/4K - 1) times loop for pmd_huge on x86, > > similar on other archs. > > > > Furthermore, when executing isolate_huge_page(), it avoid taking global > > hugetlb_lock many times, and meanless remove/add to the local link list > > cma_page_list. > > > > Thanks, looks good to me. Have any timing measurements been taken? Not yet. It is a little hard to force huge page to be allocated CMA area. Should I provide the measurements? > > > ... > > > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -1336,25 +1336,30 @@ static long check_and_migrate_cma_pages(struct > > task_struct *tsk, > > struct vm_area_struct **vmas, > > unsigned int gup_flags) > > { > > - long i; > > + long i, step; > > I'll make these variables unsigned long - to match nr_pages and because > we have no need for them to be negative. OK, will fix it. Thanks, Pingfan > > > ...
[PATCHv5] mm/gup: speed up check_and_migrate_cma_pages() on huge page
Both hugetlb and thp locate on the same migration type of pageblock, since they are allocated from a free_list[]. Based on this fact, it is enough to check on a single subpage to decide the migration type of the whole huge page. By this way, it saves (2M/4K - 1) times loop for pmd_huge on x86, similar on other archs. Furthermore, when executing isolate_huge_page(), it avoid taking global hugetlb_lock many times, and meanless remove/add to the local link list cma_page_list. Signed-off-by: Pingfan Liu Cc: Andrew Morton Cc: Ira Weiny Cc: Mike Rapoport Cc: "Kirill A. Shutemov" Cc: Thomas Gleixner Cc: John Hubbard Cc: "Aneesh Kumar K.V" Cc: Christoph Hellwig Cc: Keith Busch Cc: Mike Kravetz Cc: Linux-kernel@vger.kernel.org --- v3 -> v4: fix C language precedence issue v4 -> v5: drop the check PageCompound() and improve notes mm/gup.c | 23 +++ 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index ddde097..1deaad2 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1336,25 +1336,30 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk, struct vm_area_struct **vmas, unsigned int gup_flags) { - long i; + long i, step; bool drain_allow = true; bool migrate_allow = true; LIST_HEAD(cma_page_list); check_again: - for (i = 0; i < nr_pages; i++) { + for (i = 0; i < nr_pages;) { + + struct page *head = compound_head(pages[i]); + + /* +* gup may start from a tail page. Advance step by the left +* part. +*/ + step = (1 << compound_order(head)) - (pages[i] - head); /* * If we get a page from the CMA zone, since we are going to * be pinning these entries, we might as well move them out * of the CMA zone if possible. */ - if (is_migrate_cma_page(pages[i])) { - - struct page *head = compound_head(pages[i]); - - if (PageHuge(head)) { + if (is_migrate_cma_page(head)) { + if (PageHuge(head)) isolate_huge_page(head, _page_list); - } else { + else { if (!PageLRU(head) && drain_allow) { lru_add_drain_all(); drain_allow = false; @@ -1369,6 +1374,8 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk, } } } + + i += step; } if (!list_empty(_page_list)) { -- 2.7.5
Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot
On Wed, Jun 26, 2019 at 9:57 PM Michal Hocko wrote: > > On Mon 24-06-19 16:42:20, Pingfan Liu wrote: > > Hi Michal, > > > > What about dropping the change of the online definition of your patch, > > just do the following? > > I am sorry but I am unlikely to find some more time to look into this. I > am willing to help reviewing but I will not find enough time to focus on > this to fix up the patch. Are you willing to work on this and finish the > patch? It is a very tricky area with side effects really hard to see in > advance but going with a robust fix is definitely worth the effort. Yeah, the bug is a little trivial but hard to fix bug, and take a lot of time. It is hard to meet your original design target, based on current situation. I will have a try limited to this bug. Thanks, Pingfan > > Thanks! > -- > Michal Hocko > SUSE Labs
Re: [PATCHv4] mm/gup: speed up check_and_migrate_cma_pages() on huge page
On Thu, Jun 27, 2019 at 7:15 AM Andrew Morton wrote: > > On Wed, 26 Jun 2019 21:10:00 +0800 Pingfan Liu wrote: > > > Both hugetlb and thp locate on the same migration type of pageblock, since > > they are allocated from a free_list[]. Based on this fact, it is enough to > > check on a single subpage to decide the migration type of the whole huge > > page. By this way, it saves (2M/4K - 1) times loop for pmd_huge on x86, > > similar on other archs. > > > > Furthermore, when executing isolate_huge_page(), it avoid taking global > > hugetlb_lock many times, and meanless remove/add to the local link list > > cma_page_list. > > > > ... > > > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -1342,19 +1342,22 @@ static long check_and_migrate_cma_pages(struct > > task_struct *tsk, > > LIST_HEAD(cma_page_list); > > > > check_again: > > - for (i = 0; i < nr_pages; i++) { > > + for (i = 0; i < nr_pages;) { > > + > > + struct page *head = compound_head(pages[i]); > > + long step = 1; > > + > > + if (PageCompound(head)) > > I suspect this would work correctly if the PageCompound test was simply > removed. Not that I'm really suggesting that it be removed - dunno. Yes, you are right. compound_order() can safely run on normal page, which means we can drop the check PageCompound(). > > > + step = (1 << compound_order(head)) - (pages[i] - > > head); > > I don't understand this statement. Why does the position of this page > in the pages[] array affect anything? There's an assumption about the > contents of the skipped pages, I assume. Because gup may start from a tail page. > > Could we please get a comment in here whcih fully explains the logic > and any assumptions? Sure, I will. Thanks, Pingfan >
[PATCHv4] mm/gup: speed up check_and_migrate_cma_pages() on huge page
Both hugetlb and thp locate on the same migration type of pageblock, since they are allocated from a free_list[]. Based on this fact, it is enough to check on a single subpage to decide the migration type of the whole huge page. By this way, it saves (2M/4K - 1) times loop for pmd_huge on x86, similar on other archs. Furthermore, when executing isolate_huge_page(), it avoid taking global hugetlb_lock many times, and meanless remove/add to the local link list cma_page_list. Signed-off-by: Pingfan Liu Cc: Andrew Morton Cc: Ira Weiny Cc: Mike Rapoport Cc: "Kirill A. Shutemov" Cc: Thomas Gleixner Cc: John Hubbard Cc: "Aneesh Kumar K.V" Cc: Christoph Hellwig Cc: Keith Busch Cc: Mike Kravetz Cc: Linux-kernel@vger.kernel.org --- v3 -> v4: fix C language precedence issue mm/gup.c | 19 --- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index ddde097..ffca55b 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1342,19 +1342,22 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk, LIST_HEAD(cma_page_list); check_again: - for (i = 0; i < nr_pages; i++) { + for (i = 0; i < nr_pages;) { + + struct page *head = compound_head(pages[i]); + long step = 1; + + if (PageCompound(head)) + step = (1 << compound_order(head)) - (pages[i] - head); /* * If we get a page from the CMA zone, since we are going to * be pinning these entries, we might as well move them out * of the CMA zone if possible. */ - if (is_migrate_cma_page(pages[i])) { - - struct page *head = compound_head(pages[i]); - - if (PageHuge(head)) { + if (is_migrate_cma_page(head)) { + if (PageHuge(head)) isolate_huge_page(head, _page_list); - } else { + else { if (!PageLRU(head) && drain_allow) { lru_add_drain_all(); drain_allow = false; @@ -1369,6 +1372,8 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk, } } } + + i += step; } if (!list_empty(_page_list)) { -- 2.7.5
Re: [PATCHv3] mm/gup: speed up check_and_migrate_cma_pages() on huge page
On Wed, Jun 26, 2019 at 1:51 AM Ira Weiny wrote: > > On Tue, Jun 25, 2019 at 10:13:19PM +0800, Pingfan Liu wrote: > > Both hugetlb and thp locate on the same migration type of pageblock, since > > they are allocated from a free_list[]. Based on this fact, it is enough to > > check on a single subpage to decide the migration type of the whole huge > > page. By this way, it saves (2M/4K - 1) times loop for pmd_huge on x86, > > similar on other archs. > > > > Furthermore, when executing isolate_huge_page(), it avoid taking global > > hugetlb_lock many times, and meanless remove/add to the local link list > > cma_page_list. > > > > Signed-off-by: Pingfan Liu > > Cc: Andrew Morton > > Cc: Ira Weiny > > Cc: Mike Rapoport > > Cc: "Kirill A. Shutemov" > > Cc: Thomas Gleixner > > Cc: John Hubbard > > Cc: "Aneesh Kumar K.V" > > Cc: Christoph Hellwig > > Cc: Keith Busch > > Cc: Mike Kravetz > > Cc: Linux-kernel@vger.kernel.org > > --- > > v2 -> v3: fix page order to size convertion > > > > mm/gup.c | 19 --- > > 1 file changed, 12 insertions(+), 7 deletions(-) > > > > diff --git a/mm/gup.c b/mm/gup.c > > index ddde097..03cc1f4 100644 > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -1342,19 +1342,22 @@ static long check_and_migrate_cma_pages(struct > > task_struct *tsk, > > LIST_HEAD(cma_page_list); > > > > check_again: > > - for (i = 0; i < nr_pages; i++) { > > + for (i = 0; i < nr_pages;) { > > + > > + struct page *head = compound_head(pages[i]); > > + long step = 1; > > + > > + if (PageCompound(head)) > > + step = 1 << compound_order(head) - (pages[i] - head); > > Check your precedence here. > > step = (1 << compound_order(head)) - (pages[i] - head); OK. Thanks, Pingfan
[PATCHv3] mm/gup: speed up check_and_migrate_cma_pages() on huge page
Both hugetlb and thp locate on the same migration type of pageblock, since they are allocated from a free_list[]. Based on this fact, it is enough to check on a single subpage to decide the migration type of the whole huge page. By this way, it saves (2M/4K - 1) times loop for pmd_huge on x86, similar on other archs. Furthermore, when executing isolate_huge_page(), it avoid taking global hugetlb_lock many times, and meanless remove/add to the local link list cma_page_list. Signed-off-by: Pingfan Liu Cc: Andrew Morton Cc: Ira Weiny Cc: Mike Rapoport Cc: "Kirill A. Shutemov" Cc: Thomas Gleixner Cc: John Hubbard Cc: "Aneesh Kumar K.V" Cc: Christoph Hellwig Cc: Keith Busch Cc: Mike Kravetz Cc: Linux-kernel@vger.kernel.org --- v2 -> v3: fix page order to size convertion mm/gup.c | 19 --- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index ddde097..03cc1f4 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1342,19 +1342,22 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk, LIST_HEAD(cma_page_list); check_again: - for (i = 0; i < nr_pages; i++) { + for (i = 0; i < nr_pages;) { + + struct page *head = compound_head(pages[i]); + long step = 1; + + if (PageCompound(head)) + step = 1 << compound_order(head) - (pages[i] - head); /* * If we get a page from the CMA zone, since we are going to * be pinning these entries, we might as well move them out * of the CMA zone if possible. */ - if (is_migrate_cma_page(pages[i])) { - - struct page *head = compound_head(pages[i]); - - if (PageHuge(head)) { + if (is_migrate_cma_page(head)) { + if (PageHuge(head)) isolate_huge_page(head, _page_list); - } else { + else { if (!PageLRU(head) && drain_allow) { lru_add_drain_all(); drain_allow = false; @@ -1369,6 +1372,8 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk, } } } + + i += step; } if (!list_empty(_page_list)) { -- 2.7.5
Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot
Hi Michal, What about dropping the change of the online definition of your patch, just do the following? diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index e6dad60..9c087c3 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -749,13 +749,12 @@ static void __init init_memory_less_node(int nid) */ void __init init_cpu_to_node(void) { - int cpu; + int cpu, node; u16 *cpu_to_apicid = early_per_cpu_ptr(x86_cpu_to_apicid); BUG_ON(cpu_to_apicid == NULL); - for_each_possible_cpu(cpu) { - int node = numa_cpu_node(cpu); + for_each_node_mask(node, numa_nodes_parsed) { if (node == NUMA_NO_NODE) continue; @@ -765,6 +764,10 @@ void __init init_cpu_to_node(void) numa_set_node(cpu, node); } + for_each_possible_cpu(cpu) { + int node = numa_cpu_node(cpu); + numa_set_node(cpu, node); + } } Thanks, Pingfan On Fri, Jun 21, 2019 at 9:55 PM Michal Hocko wrote: > > On Fri 21-06-19 09:17:58, Qian Cai wrote: > > Sigh... > > > > I don't see any benefit to keep the broken commit, > > > > "x86, numa: always initialize all possible nodes" > > > > for so long in linux-next that just prevent x86 NUMA machines with any > > memory- > > less node from booting. > > > > Andrew, maybe it is time to drop this patch until Michal found some time to > > fix > > it properly. > > Yes, please drop the patch for now, Andrew. I thought I could get to > this but time is just scarce. > -- > Michal Hocko > SUSE Labs
Re: [PATCH] mm/hugetlb: allow gigantic page allocation to migrate away smaller huge page
On Mon, Jun 24, 2019 at 1:16 PM Anshuman Khandual wrote: > > > > On 06/24/2019 09:51 AM, Pingfan Liu wrote: > > The current pfn_range_valid_gigantic() rejects the pud huge page allocation > > if there is a pmd huge page inside the candidate range. > > > > But pud huge resource is more rare, which should align on 1GB on x86. It is > > worth to allow migrating away pmd huge page to make room for a pud huge > > page. > > > > The same logic is applied to pgd and pud huge pages. > > The huge page in the range can either be a THP or HugeTLB and migrating them > has > different costs and chances of success. THP migration will involve splitting > if > THP migration is not enabled and all related TLB related costs. Are you sure > that a PUD HugeTLB allocation really should go through these ? Is there any PUD hugetlb has already driven out PMD thp in current. This patch just want to make PUD hugetlb survives PMD hugetlb. > guarantee that after migration of multiple PMD sized THP/HugeTLB pages on the > given range, the allocation request for PUD will succeed ? The migration is complicated, but as my understanding, if there is no gup pin in the range and there is enough memory including swap, then it will success. > > > > > Signed-off-by: Pingfan Liu > > Cc: Mike Kravetz > > Cc: Oscar Salvador > > Cc: David Hildenbrand > > Cc: Andrew Morton > > Cc: linux-kernel@vger.kernel.org > > --- > > mm/hugetlb.c | 8 +--- > > 1 file changed, 5 insertions(+), 3 deletions(-) > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index ac843d3..02d1978 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -1081,7 +1081,11 @@ static bool pfn_range_valid_gigantic(struct zone *z, > > unsigned long start_pfn, unsigned long nr_pages) > > { > > unsigned long i, end_pfn = start_pfn + nr_pages; > > - struct page *page; > > + struct page *page = pfn_to_page(start_pfn); > > + > > + if (PageHuge(page)) > > + if (compound_order(compound_head(page)) >= nr_pages) > > + return false; > > > > for (i = start_pfn; i < end_pfn; i++) { > > if (!pfn_valid(i)) > > @@ -1098,8 +1102,6 @@ static bool pfn_range_valid_gigantic(struct zone *z, > > if (page_count(page) > 0) > > return false; > > > > - if (PageHuge(page)) > > - return false; > > } > > > > return true; > > > > So except in the case where there is a bigger huge page in the range this will > attempt migrating everything on the way. As mentioned before if it all this is > a good idea, it needs to differentiate between HugeTLB and THP and also take > into account costs of migrations and chance of subsequence allocation attempt > into account. Sorry, but I think this logic is only for hugetlb. The caller alloc_gigantic_page() is only used inside mm/hugetlb.c, not by huge_memory.c. Thanks, Pingfan
Re: [PATCH] mm/hugetlb: allow gigantic page allocation to migrate away smaller huge page
On Mon, Jun 24, 2019 at 1:03 PM Ira Weiny wrote: > > On Mon, Jun 24, 2019 at 12:21:08PM +0800, Pingfan Liu wrote: > > The current pfn_range_valid_gigantic() rejects the pud huge page allocation > > if there is a pmd huge page inside the candidate range. > > > > But pud huge resource is more rare, which should align on 1GB on x86. It is > > worth to allow migrating away pmd huge page to make room for a pud huge > > page. > > > > The same logic is applied to pgd and pud huge pages. > > I'm sorry but I don't quite understand why we should do this. Is this a bug > or > an optimization? It sounds like an optimization. Yes, an optimization. It can help us to success to allocate a 1GB hugetlb if there is some 2MB hugetlb sit in the candidate range. Allocation 1GB hugetlb requires more tough condition, not only a continuous 1GB range, but also aligned on GB. While allocating a 2MB range is easier. > > > > > Signed-off-by: Pingfan Liu > > Cc: Mike Kravetz > > Cc: Oscar Salvador > > Cc: David Hildenbrand > > Cc: Andrew Morton > > Cc: linux-kernel@vger.kernel.org > > --- > > mm/hugetlb.c | 8 +--- > > 1 file changed, 5 insertions(+), 3 deletions(-) > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index ac843d3..02d1978 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -1081,7 +1081,11 @@ static bool pfn_range_valid_gigantic(struct zone *z, > > unsigned long start_pfn, unsigned long nr_pages) > > { > > unsigned long i, end_pfn = start_pfn + nr_pages; > > - struct page *page; > > + struct page *page = pfn_to_page(start_pfn); > > + > > + if (PageHuge(page)) > > + if (compound_order(compound_head(page)) >= nr_pages) > > I don't think you want compound_order() here. Yes, your are right. Thanks, Pingfan > > Ira > > > + return false; > > > > for (i = start_pfn; i < end_pfn; i++) { > > if (!pfn_valid(i)) > > @@ -1098,8 +1102,6 @@ static bool pfn_range_valid_gigantic(struct zone *z, > > if (page_count(page) > 0) > > return false; > > > > - if (PageHuge(page)) > > - return false; > > } > > > > return true; > > -- > > 2.7.5 > >
Re: [PATCHv2] mm/gup: speed up check_and_migrate_cma_pages() on huge page
On Mon, Jun 24, 2019 at 1:32 PM Pingfan Liu wrote: > > On Mon, Jun 24, 2019 at 12:43 PM Ira Weiny wrote: > > > > On Mon, Jun 24, 2019 at 12:12:41PM +0800, Pingfan Liu wrote: > > > Both hugetlb and thp locate on the same migration type of pageblock, since > > > they are allocated from a free_list[]. Based on this fact, it is enough to > > > check on a single subpage to decide the migration type of the whole huge > > > page. By this way, it saves (2M/4K - 1) times loop for pmd_huge on x86, > > > similar on other archs. > > > > > > Furthermore, when executing isolate_huge_page(), it avoid taking global > > > hugetlb_lock many times, and meanless remove/add to the local link list > > > cma_page_list. > > > > > > Signed-off-by: Pingfan Liu > > > Cc: Andrew Morton > > > Cc: Ira Weiny > > > Cc: Mike Rapoport > > > Cc: "Kirill A. Shutemov" > > > Cc: Thomas Gleixner > > > Cc: John Hubbard > > > Cc: "Aneesh Kumar K.V" > > > Cc: Christoph Hellwig > > > Cc: Keith Busch > > > Cc: Mike Kravetz > > > Cc: Linux-kernel@vger.kernel.org > > > --- > > > mm/gup.c | 19 --- > > > 1 file changed, 12 insertions(+), 7 deletions(-) > > > > > > diff --git a/mm/gup.c b/mm/gup.c > > > index ddde097..544f5de 100644 > > > --- a/mm/gup.c > > > +++ b/mm/gup.c > > > @@ -1342,19 +1342,22 @@ static long check_and_migrate_cma_pages(struct > > > task_struct *tsk, > > > LIST_HEAD(cma_page_list); > > > > > > check_again: > > > - for (i = 0; i < nr_pages; i++) { > > > + for (i = 0; i < nr_pages;) { > > > + > > > + struct page *head = compound_head(pages[i]); > > > + long step = 1; > > > + > > > + if (PageCompound(head)) > > > + step = compound_order(head) - (pages[i] - head); > > > > Sorry if I missed this last time. compound_order() is not correct here. > For thp, prep_transhuge_page()->prep_compound_page()->set_compound_order(). > For smaller hugetlb, > prep_new_huge_page()->prep_compound_page()->set_compound_order(). > For gigantic page, prep_compound_gigantic_page()->set_compound_order(). > > Do I miss anything? > Oh, got it. It should be 1< Thanks, > Pingfan > [...]
Re: [PATCHv2] mm/gup: speed up check_and_migrate_cma_pages() on huge page
On Mon, Jun 24, 2019 at 12:43 PM Ira Weiny wrote: > > On Mon, Jun 24, 2019 at 12:12:41PM +0800, Pingfan Liu wrote: > > Both hugetlb and thp locate on the same migration type of pageblock, since > > they are allocated from a free_list[]. Based on this fact, it is enough to > > check on a single subpage to decide the migration type of the whole huge > > page. By this way, it saves (2M/4K - 1) times loop for pmd_huge on x86, > > similar on other archs. > > > > Furthermore, when executing isolate_huge_page(), it avoid taking global > > hugetlb_lock many times, and meanless remove/add to the local link list > > cma_page_list. > > > > Signed-off-by: Pingfan Liu > > Cc: Andrew Morton > > Cc: Ira Weiny > > Cc: Mike Rapoport > > Cc: "Kirill A. Shutemov" > > Cc: Thomas Gleixner > > Cc: John Hubbard > > Cc: "Aneesh Kumar K.V" > > Cc: Christoph Hellwig > > Cc: Keith Busch > > Cc: Mike Kravetz > > Cc: Linux-kernel@vger.kernel.org > > --- > > mm/gup.c | 19 --- > > 1 file changed, 12 insertions(+), 7 deletions(-) > > > > diff --git a/mm/gup.c b/mm/gup.c > > index ddde097..544f5de 100644 > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -1342,19 +1342,22 @@ static long check_and_migrate_cma_pages(struct > > task_struct *tsk, > > LIST_HEAD(cma_page_list); > > > > check_again: > > - for (i = 0; i < nr_pages; i++) { > > + for (i = 0; i < nr_pages;) { > > + > > + struct page *head = compound_head(pages[i]); > > + long step = 1; > > + > > + if (PageCompound(head)) > > + step = compound_order(head) - (pages[i] - head); > > Sorry if I missed this last time. compound_order() is not correct here. For thp, prep_transhuge_page()->prep_compound_page()->set_compound_order(). For smaller hugetlb, prep_new_huge_page()->prep_compound_page()->set_compound_order(). For gigantic page, prep_compound_gigantic_page()->set_compound_order(). Do I miss anything? Thanks, Pingfan [...]
[PATCH] mm/hugetlb: allow gigantic page allocation to migrate away smaller huge page
The current pfn_range_valid_gigantic() rejects the pud huge page allocation if there is a pmd huge page inside the candidate range. But pud huge resource is more rare, which should align on 1GB on x86. It is worth to allow migrating away pmd huge page to make room for a pud huge page. The same logic is applied to pgd and pud huge pages. Signed-off-by: Pingfan Liu Cc: Mike Kravetz Cc: Oscar Salvador Cc: David Hildenbrand Cc: Andrew Morton Cc: linux-kernel@vger.kernel.org --- mm/hugetlb.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ac843d3..02d1978 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1081,7 +1081,11 @@ static bool pfn_range_valid_gigantic(struct zone *z, unsigned long start_pfn, unsigned long nr_pages) { unsigned long i, end_pfn = start_pfn + nr_pages; - struct page *page; + struct page *page = pfn_to_page(start_pfn); + + if (PageHuge(page)) + if (compound_order(compound_head(page)) >= nr_pages) + return false; for (i = start_pfn; i < end_pfn; i++) { if (!pfn_valid(i)) @@ -1098,8 +1102,6 @@ static bool pfn_range_valid_gigantic(struct zone *z, if (page_count(page) > 0) return false; - if (PageHuge(page)) - return false; } return true; -- 2.7.5
[PATCHv2] mm/gup: speed up check_and_migrate_cma_pages() on huge page
Both hugetlb and thp locate on the same migration type of pageblock, since they are allocated from a free_list[]. Based on this fact, it is enough to check on a single subpage to decide the migration type of the whole huge page. By this way, it saves (2M/4K - 1) times loop for pmd_huge on x86, similar on other archs. Furthermore, when executing isolate_huge_page(), it avoid taking global hugetlb_lock many times, and meanless remove/add to the local link list cma_page_list. Signed-off-by: Pingfan Liu Cc: Andrew Morton Cc: Ira Weiny Cc: Mike Rapoport Cc: "Kirill A. Shutemov" Cc: Thomas Gleixner Cc: John Hubbard Cc: "Aneesh Kumar K.V" Cc: Christoph Hellwig Cc: Keith Busch Cc: Mike Kravetz Cc: Linux-kernel@vger.kernel.org --- mm/gup.c | 19 --- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index ddde097..544f5de 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1342,19 +1342,22 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk, LIST_HEAD(cma_page_list); check_again: - for (i = 0; i < nr_pages; i++) { + for (i = 0; i < nr_pages;) { + + struct page *head = compound_head(pages[i]); + long step = 1; + + if (PageCompound(head)) + step = compound_order(head) - (pages[i] - head); /* * If we get a page from the CMA zone, since we are going to * be pinning these entries, we might as well move them out * of the CMA zone if possible. */ - if (is_migrate_cma_page(pages[i])) { - - struct page *head = compound_head(pages[i]); - - if (PageHuge(head)) { + if (is_migrate_cma_page(head)) { + if (PageHuge(head)) isolate_huge_page(head, _page_list); - } else { + else { if (!PageLRU(head) && drain_allow) { lru_add_drain_all(); drain_allow = false; @@ -1369,6 +1372,8 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk, } } } + + i += step; } if (!list_empty(_page_list)) { -- 2.7.5
Re: [PATCH] mm/gup: speed up check_and_migrate_cma_pages() on huge page
On Sat, Jun 22, 2019 at 2:13 AM Ira Weiny wrote: > > On Fri, Jun 21, 2019 at 06:15:16PM +0800, Pingfan Liu wrote: > > Both hugetlb and thp locate on the same migration type of pageblock, since > > they are allocated from a free_list[]. Based on this fact, it is enough to > > check on a single subpage to decide the migration type of the whole huge > > page. By this way, it saves (2M/4K - 1) times loop for pmd_huge on x86, > > similar on other archs. > > > > Furthermore, when executing isolate_huge_page(), it avoid taking global > > hugetlb_lock many times, and meanless remove/add to the local link list > > cma_page_list. > > > > Signed-off-by: Pingfan Liu > > Cc: Andrew Morton > > Cc: Ira Weiny > > Cc: Mike Rapoport > > Cc: "Kirill A. Shutemov" > > Cc: Thomas Gleixner > > Cc: John Hubbard > > Cc: "Aneesh Kumar K.V" > > Cc: Christoph Hellwig > > Cc: Keith Busch > > Cc: Mike Kravetz > > Cc: Linux-kernel@vger.kernel.org > > --- > > mm/gup.c | 13 + > > 1 file changed, 9 insertions(+), 4 deletions(-) > > > > diff --git a/mm/gup.c b/mm/gup.c > > index ddde097..2eecb16 100644 > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -1342,16 +1342,19 @@ static long check_and_migrate_cma_pages(struct > > task_struct *tsk, > > LIST_HEAD(cma_page_list); > > > > check_again: > > - for (i = 0; i < nr_pages; i++) { > > + for (i = 0; i < nr_pages;) { > > + > > + struct page *head = compound_head(pages[i]); > > + long step = 1; > > + > > + if (PageCompound(head)) > > + step = compound_order(head) - (pages[i] - head); > > /* > >* If we get a page from the CMA zone, since we are going to > >* be pinning these entries, we might as well move them out > >* of the CMA zone if possible. > >*/ > > if (is_migrate_cma_page(pages[i])) { > > I like this but I think for consistency I would change this pages[i] to be > head. Even though it is not required. Yes, agree. Thank you for your good suggestion. Regards, Pingfan > > Ira > > > - > > - struct page *head = compound_head(pages[i]); > > - > > if (PageHuge(head)) { > > isolate_huge_page(head, _page_list); > > } else { > > @@ -1369,6 +1372,8 @@ static long check_and_migrate_cma_pages(struct > > task_struct *tsk, > > } > > } > > } > > + > > + i += step; > > } > > > > if (!list_empty(_page_list)) { > > -- > > 2.7.5 > >
[PATCH] mm/gup: speed up check_and_migrate_cma_pages() on huge page
Both hugetlb and thp locate on the same migration type of pageblock, since they are allocated from a free_list[]. Based on this fact, it is enough to check on a single subpage to decide the migration type of the whole huge page. By this way, it saves (2M/4K - 1) times loop for pmd_huge on x86, similar on other archs. Furthermore, when executing isolate_huge_page(), it avoid taking global hugetlb_lock many times, and meanless remove/add to the local link list cma_page_list. Signed-off-by: Pingfan Liu Cc: Andrew Morton Cc: Ira Weiny Cc: Mike Rapoport Cc: "Kirill A. Shutemov" Cc: Thomas Gleixner Cc: John Hubbard Cc: "Aneesh Kumar K.V" Cc: Christoph Hellwig Cc: Keith Busch Cc: Mike Kravetz Cc: Linux-kernel@vger.kernel.org --- mm/gup.c | 13 + 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index ddde097..2eecb16 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1342,16 +1342,19 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk, LIST_HEAD(cma_page_list); check_again: - for (i = 0; i < nr_pages; i++) { + for (i = 0; i < nr_pages;) { + + struct page *head = compound_head(pages[i]); + long step = 1; + + if (PageCompound(head)) + step = compound_order(head) - (pages[i] - head); /* * If we get a page from the CMA zone, since we are going to * be pinning these entries, we might as well move them out * of the CMA zone if possible. */ if (is_migrate_cma_page(pages[i])) { - - struct page *head = compound_head(pages[i]); - if (PageHuge(head)) { isolate_huge_page(head, _page_list); } else { @@ -1369,6 +1372,8 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk, } } } + + i += step; } if (!list_empty(_page_list)) { -- 2.7.5
Re: [PATCHv4 2/3] mm/gup: fix omission of check on FOLL_LONGTERM in gup fast path
Cc Mike, David, who is an expert of hugetlb and thp On Fri, Jun 14, 2019 at 5:37 AM Ira Weiny wrote: > > On Thu, Jun 13, 2019 at 06:45:01PM +0800, Pingfan Liu wrote: > > FOLL_LONGTERM suggests a pin which is going to be given to hardware and > > can't move. It would truncate CMA permanently and should be excluded. > > > > FOLL_LONGTERM has already been checked in the slow path, but not checked in > > the fast path, which means a possible leak of CMA page to longterm pinned > > requirement through this crack. > > > > Place a check in gup_pte_range() in the fast path. > > > > Signed-off-by: Pingfan Liu > > Cc: Ira Weiny > > Cc: Andrew Morton > > Cc: Mike Rapoport > > Cc: Dan Williams > > Cc: Matthew Wilcox > > Cc: John Hubbard > > Cc: "Aneesh Kumar K.V" > > Cc: Keith Busch > > Cc: Christoph Hellwig > > Cc: Shuah Khan > > Cc: linux-kernel@vger.kernel.org > > --- > > mm/gup.c | 26 ++ > > 1 file changed, 26 insertions(+) > > > > diff --git a/mm/gup.c b/mm/gup.c > > index 766ae54..de1b03f 100644 > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -1757,6 +1757,14 @@ static int gup_pte_range(pmd_t pmd, unsigned long > > addr, unsigned long end, > > VM_BUG_ON(!pfn_valid(pte_pfn(pte))); > > page = pte_page(pte); > > > > + /* > > + * FOLL_LONGTERM suggests a pin given to hardware. Prevent it > > + * from truncating CMA area > > + */ > > + if (unlikely(flags & FOLL_LONGTERM) && > > + is_migrate_cma_page(page)) > > + goto pte_unmap; > > + > > head = try_get_compound_head(page, 1); > > if (!head) > > goto pte_unmap; > > @@ -1900,6 +1908,12 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, > > unsigned long addr, > > refs++; > > } while (addr += PAGE_SIZE, addr != end); > > > > + if (unlikely(flags & FOLL_LONGTERM) && > > + is_migrate_cma_page(page)) { > > + *nr -= refs; > > + return 0; > > + } > > + > > Why can't we place this check before the while loop and skip subtracting the > page count? Yes, that will be better. > > Can is_migrate_cma_page() operate on any "subpage" of a compound page? For gigantic page, __alloc_gigantic_page() allocate from MIGRATE_MOVABLE pageblock. For page order < MAX_ORDER, pages are allocated from either free_list[MIGRATE_MOVABLE] or free_list[MIGRATE_CMA]. So all subpage have the same migrate type. Thanks, Pingfan > > Here this calls is_magrate_cma_page() on the tail page of the compound page. > > I'm not an expert on compound pages nor cma handling so is this ok? > > It seems like you need to call is_migrate_cma_page() on each page within the > while loop? > > > head = try_get_compound_head(pmd_page(orig), refs); > > if (!head) { > > *nr -= refs; > > @@ -1941,6 +1955,12 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, > > unsigned long addr, > > refs++; > > } while (addr += PAGE_SIZE, addr != end); > > > > + if (unlikely(flags & FOLL_LONGTERM) && > > + is_migrate_cma_page(page)) { > > + *nr -= refs; > > + return 0; > > + } > > + > > Same comment here. > > > head = try_get_compound_head(pud_page(orig), refs); > > if (!head) { > > *nr -= refs; > > @@ -1978,6 +1998,12 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, > > unsigned long addr, > > refs++; > > } while (addr += PAGE_SIZE, addr != end); > > > > + if (unlikely(flags & FOLL_LONGTERM) && > > + is_migrate_cma_page(page)) { > > + *nr -= refs; > > + return 0; > > + } > > + > > And here. > > Ira > > > head = try_get_compound_head(pgd_page(orig), refs); > > if (!head) { > > *nr -= refs; > > -- > > 2.7.5 > >
[PATCHv4 3/3] mm/gup_benchemark: add LONGTERM_BENCHMARK test in gup fast path
Introduce a GUP_LONGTERM_BENCHMARK ioctl to test longterm pin in gup fast path. Signed-off-by: Pingfan Liu Cc: Ira Weiny Cc: Andrew Morton Cc: Mike Rapoport Cc: Dan Williams Cc: Matthew Wilcox Cc: John Hubbard Cc: "Aneesh Kumar K.V" Cc: Keith Busch Cc: Christoph Hellwig Cc: Shuah Khan Cc: linux-kernel@vger.kernel.org --- mm/gup_benchmark.c | 11 +-- tools/testing/selftests/vm/gup_benchmark.c | 10 +++--- 2 files changed, 16 insertions(+), 5 deletions(-) diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c index 7dd602d..83f3378 100644 --- a/mm/gup_benchmark.c +++ b/mm/gup_benchmark.c @@ -6,8 +6,9 @@ #include #define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark) -#define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark) -#define GUP_BENCHMARK _IOWR('g', 3, struct gup_benchmark) +#define GUP_FAST_LONGTERM_BENCHMARK_IOWR('g', 2, struct gup_benchmark) +#define GUP_LONGTERM_BENCHMARK _IOWR('g', 3, struct gup_benchmark) +#define GUP_BENCHMARK _IOWR('g', 4, struct gup_benchmark) struct gup_benchmark { __u64 get_delta_usec; @@ -53,6 +54,11 @@ static int __gup_benchmark_ioctl(unsigned int cmd, nr = get_user_pages_fast(addr, nr, gup->flags & 1, pages + i); break; + case GUP_FAST_LONGTERM_BENCHMARK: + nr = get_user_pages_fast(addr, nr, + (gup->flags & 1) | FOLL_LONGTERM, +pages + i); + break; case GUP_LONGTERM_BENCHMARK: nr = get_user_pages(addr, nr, (gup->flags & 1) | FOLL_LONGTERM, @@ -96,6 +102,7 @@ static long gup_benchmark_ioctl(struct file *filep, unsigned int cmd, switch (cmd) { case GUP_FAST_BENCHMARK: + case GUP_FAST_LONGTERM_BENCHMARK: case GUP_LONGTERM_BENCHMARK: case GUP_BENCHMARK: break; diff --git a/tools/testing/selftests/vm/gup_benchmark.c b/tools/testing/selftests/vm/gup_benchmark.c index c0534e2..ade8acb 100644 --- a/tools/testing/selftests/vm/gup_benchmark.c +++ b/tools/testing/selftests/vm/gup_benchmark.c @@ -15,8 +15,9 @@ #define PAGE_SIZE sysconf(_SC_PAGESIZE) #define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark) -#define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark) -#define GUP_BENCHMARK _IOWR('g', 3, struct gup_benchmark) +#define GUP_FAST_LONGTERM_BENCHMARK_IOWR('g', 2, struct gup_benchmark) +#define GUP_LONGTERM_BENCHMARK _IOWR('g', 3, struct gup_benchmark) +#define GUP_BENCHMARK _IOWR('g', 4, struct gup_benchmark) struct gup_benchmark { __u64 get_delta_usec; @@ -37,7 +38,7 @@ int main(int argc, char **argv) char *file = "/dev/zero"; char *p; - while ((opt = getopt(argc, argv, "m:r:n:f:tTLUSH")) != -1) { + while ((opt = getopt(argc, argv, "m:r:n:f:tTlLUSH")) != -1) { switch (opt) { case 'm': size = atoi(optarg) * MB; @@ -54,6 +55,9 @@ int main(int argc, char **argv) case 'T': thp = 0; break; + case 'l': + cmd = GUP_FAST_LONGTERM_BENCHMARK; + break; case 'L': cmd = GUP_LONGTERM_BENCHMARK; break; -- 2.7.5
[PATCHv4 1/3] mm/gup: rename nr as nr_pinned in get_user_pages_fast()
To better reflect the held state of pages and make code self-explaining, rename nr as nr_pinned. Signed-off-by: Pingfan Liu Cc: Ira Weiny Cc: Andrew Morton Cc: Mike Rapoport Cc: Dan Williams Cc: Matthew Wilcox Cc: John Hubbard Cc: "Aneesh Kumar K.V" Cc: Keith Busch Cc: Christoph Hellwig Cc: Shuah Khan Cc: linux-kernel@vger.kernel.org --- mm/gup.c | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index f173fcb..766ae54 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2216,7 +2216,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, unsigned int gup_flags, struct page **pages) { unsigned long addr, len, end; - int nr = 0, ret = 0; + int nr_pinned = 0, ret = 0; start &= PAGE_MASK; addr = start; @@ -2231,25 +2231,25 @@ int get_user_pages_fast(unsigned long start, int nr_pages, if (gup_fast_permitted(start, nr_pages)) { local_irq_disable(); - gup_pgd_range(addr, end, gup_flags, pages, ); + gup_pgd_range(addr, end, gup_flags, pages, _pinned); local_irq_enable(); - ret = nr; + ret = nr_pinned; } - if (nr < nr_pages) { + if (nr_pinned < nr_pages) { /* Try to get the remaining pages with get_user_pages */ - start += nr << PAGE_SHIFT; - pages += nr; + start += nr_pinned << PAGE_SHIFT; + pages += nr_pinned; - ret = __gup_longterm_unlocked(start, nr_pages - nr, + ret = __gup_longterm_unlocked(start, nr_pages - nr_pinned, gup_flags, pages); /* Have to be a bit careful with return values */ - if (nr > 0) { + if (nr_pinned > 0) { if (ret < 0) - ret = nr; + ret = nr_pinned; else - ret += nr; + ret += nr_pinned; } } -- 2.7.5
[PATCHv4 2/3] mm/gup: fix omission of check on FOLL_LONGTERM in gup fast path
FOLL_LONGTERM suggests a pin which is going to be given to hardware and can't move. It would truncate CMA permanently and should be excluded. FOLL_LONGTERM has already been checked in the slow path, but not checked in the fast path, which means a possible leak of CMA page to longterm pinned requirement through this crack. Place a check in gup_pte_range() in the fast path. Signed-off-by: Pingfan Liu Cc: Ira Weiny Cc: Andrew Morton Cc: Mike Rapoport Cc: Dan Williams Cc: Matthew Wilcox Cc: John Hubbard Cc: "Aneesh Kumar K.V" Cc: Keith Busch Cc: Christoph Hellwig Cc: Shuah Khan Cc: linux-kernel@vger.kernel.org --- mm/gup.c | 26 ++ 1 file changed, 26 insertions(+) diff --git a/mm/gup.c b/mm/gup.c index 766ae54..de1b03f 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1757,6 +1757,14 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, VM_BUG_ON(!pfn_valid(pte_pfn(pte))); page = pte_page(pte); + /* +* FOLL_LONGTERM suggests a pin given to hardware. Prevent it +* from truncating CMA area +*/ + if (unlikely(flags & FOLL_LONGTERM) && + is_migrate_cma_page(page)) + goto pte_unmap; + head = try_get_compound_head(page, 1); if (!head) goto pte_unmap; @@ -1900,6 +1908,12 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, refs++; } while (addr += PAGE_SIZE, addr != end); + if (unlikely(flags & FOLL_LONGTERM) && + is_migrate_cma_page(page)) { + *nr -= refs; + return 0; + } + head = try_get_compound_head(pmd_page(orig), refs); if (!head) { *nr -= refs; @@ -1941,6 +1955,12 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, refs++; } while (addr += PAGE_SIZE, addr != end); + if (unlikely(flags & FOLL_LONGTERM) && + is_migrate_cma_page(page)) { + *nr -= refs; + return 0; + } + head = try_get_compound_head(pud_page(orig), refs); if (!head) { *nr -= refs; @@ -1978,6 +1998,12 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr, refs++; } while (addr += PAGE_SIZE, addr != end); + if (unlikely(flags & FOLL_LONGTERM) && + is_migrate_cma_page(page)) { + *nr -= refs; + return 0; + } + head = try_get_compound_head(pgd_page(orig), refs); if (!head) { *nr -= refs; -- 2.7.5
[PATCHv4 0/3] mm/gup: fix omission of check on FOLL_LONGTERM in gup fast path
These three patches have no dependency of each other, but related with the same purpose to improve get_user_page_fast(), patch [2/3]. Put them together. v3->v4: Place the check on FOLL_LONGTERM in gup_pte_range() instead of get_user_page_fast() Cc: Ira Weiny Cc: Andrew Morton Cc: Mike Rapoport Cc: Dan Williams Cc: Matthew Wilcox Cc: John Hubbard Cc: "Aneesh Kumar K.V" Cc: Keith Busch Cc: Christoph Hellwig Cc: Shuah Khan Cc: linux-kernel@vger.kernel.org Pingfan Liu (3): mm/gup: rename nr as nr_pinned in get_user_pages_fast() mm/gup: fix omission of check on FOLL_LONGTERM in gup fast path mm/gup_benchemark: add LONGTERM_BENCHMARK test in gup fast path mm/gup.c | 46 +++--- mm/gup_benchmark.c | 11 +-- tools/testing/selftests/vm/gup_benchmark.c | 10 +-- 3 files changed, 52 insertions(+), 15 deletions(-) -- 2.7.5
Re: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
On Thu, Jun 13, 2019 at 7:49 AM Ira Weiny wrote: > > On Wed, Jun 12, 2019 at 09:54:58PM +0800, Pingfan Liu wrote: > > On Tue, Jun 11, 2019 at 04:29:11PM +, Weiny, Ira wrote: > > > > Pingfan Liu writes: > > > > > > > > > As for FOLL_LONGTERM, it is checked in the slow path > > > > > __gup_longterm_unlocked(). But it is not checked in the fast path, > > > > > which means a possible leak of CMA page to longterm pinned requirement > > > > > through this crack. > > > > > > > > Shouldn't we disallow FOLL_LONGTERM with get_user_pages fastpath? W.r.t > > > > dax check we need vma to ensure whether a long term pin is allowed or > > > > not. > > > > If FOLL_LONGTERM is specified we should fallback to slow path. > > > > > > Yes, the fastpath bails to the slowpath if FOLL_LONGTERM _and_ DAX. But > > > it does this while walking the page tables. I missed the CMA case and > > > Pingfan's patch fixes this. We could check for CMA pages while walking > > > the page tables but most agreed that it was not worth it. For DAX we > > > already had checks for *_devmap() so it was easier to put the > > > FOLL_LONGTERM checks there. > > > > > Then for CMA pages, are you suggesting something like: > > I'm not suggesting this. OK, then I send out v4. > > Sorry I wrote this prior to seeing the numbers in your other email. Given > the numbers it looks like performing the check whilst walking the tables is > worth the extra complexity. I was just trying to summarize the thread. I > don't think we should disallow FOLL_LONGTERM because it only affects CMA and > DAX. Other pages will be fine with FOLL_LONGTERM. Why penalize every call if > we don't have to. Also in the case of DAX the use of vma will be going > away...[1] Eventually... ;-) A good feature. Trying to catch up. Thanks, Pingfan
Re: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
On Wed, Jun 12, 2019 at 12:46 AM Ira Weiny wrote: > > On Tue, Jun 11, 2019 at 08:29:35PM +0800, Pingfan Liu wrote: > > On Fri, Jun 07, 2019 at 02:10:15PM +0800, Pingfan Liu wrote: > > > On Fri, Jun 7, 2019 at 5:17 AM John Hubbard wrote: > > > > > > > > On 6/5/19 7:19 PM, Pingfan Liu wrote: > > > > > On Thu, Jun 6, 2019 at 5:49 AM Andrew Morton > > > > > wrote: > > > > ... > > > > >>> --- a/mm/gup.c > > > > >>> +++ b/mm/gup.c > > > > >>> @@ -2196,6 +2196,26 @@ static int __gup_longterm_unlocked(unsigned > > > > >>> long start, int nr_pages, > > > > >>> return ret; > > > > >>> } > > > > >>> > > > > >>> +#ifdef CONFIG_CMA > > > > >>> +static inline int reject_cma_pages(int nr_pinned, struct page > > > > >>> **pages) > > > > >>> +{ > > > > >>> + int i; > > > > >>> + > > > > >>> + for (i = 0; i < nr_pinned; i++) > > > > >>> + if (is_migrate_cma_page(pages[i])) { > > > > >>> + put_user_pages(pages + i, nr_pinned - i); > > > > >>> + return i; > > > > >>> + } > > > > >>> + > > > > >>> + return nr_pinned; > > > > >>> +} > > > > >> > > > > >> There's no point in inlining this. > > > > > OK, will drop it in V4. > > > > > > > > > >> > > > > >> The code seems inefficient. If it encounters a single CMA page it > > > > >> can > > > > >> end up discarding a possibly significant number of non-CMA pages. I > > > > > The trick is the page is not be discarded, in fact, they are still be > > > > > referrenced by pte. We just leave the slow path to pick up the non-CMA > > > > > pages again. > > > > > > > > > >> guess that doesn't matter much, as get_user_pages(FOLL_LONGTERM) is > > > > >> rare. But could we avoid this (and the second pass across pages[]) > > > > >> by > > > > >> checking for a CMA page within gup_pte_range()? > > > > > It will spread the same logic to hugetlb pte and normal pte. And no > > > > > improvement in performance due to slow path. So I think maybe it is > > > > > not worth. > > > > > > > > > >> > > > > > > > > I think the concern is: for the successful gup_fast case with no CMA > > > > pages, this patch is adding another complete loop through all the > > > > pages. In the fast case. > > > > > > > > If the check were instead done as part of the gup_pte_range(), then > > > > it would be a little more efficient for that case. > > > > > > > > As for whether it's worth it, *probably* this is too small an effect to > > > > measure. > > > > But in order to attempt a measurement: running fio > > > > (https://github.com/axboe/fio) > > > > with O_DIRECT on an NVMe drive, might shed some light. Here's an > > > > fio.conf file > > > > that Jan Kara and Tom Talpey helped me come up with, for related > > > > testing: > > > > > > > > [reader] > > > > direct=1 > > > > ioengine=libaio > > > > blocksize=4096 > > > > size=1g > > > > numjobs=1 > > > > rw=read > > > > iodepth=64 > > > > > > Unable to get a NVME device to have a test. And when testing fio on the > > tranditional disk, I got the error "fio: engine libaio not loadable > > fio: failed to load engine > > fio: file:ioengines.c:89, func=dlopen, error=libaio: cannot open shared > > object file: No such file or directory" > > > > But I found a test case which can be slightly adjusted to met the aim. > > It is tools/testing/selftests/vm/gup_benchmark.c > > > > Test enviroment: > > MemTotal: 264079324 kB > > MemFree:262306788 kB > > CmaTotal: 0 kB > > CmaFree: 0 kB > > on AMD EPYC 7601 > > > > Test command: > > gup_benchmark -r 100 -n 64 > > gup_benchmark -r 100 -n 64 -l > > where -r stands for repeat times, -n is nr_pa
Re: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
On Tue, Jun 11, 2019 at 04:29:11PM +, Weiny, Ira wrote: > > Pingfan Liu writes: > > > > > As for FOLL_LONGTERM, it is checked in the slow path > > > __gup_longterm_unlocked(). But it is not checked in the fast path, > > > which means a possible leak of CMA page to longterm pinned requirement > > > through this crack. > > > > Shouldn't we disallow FOLL_LONGTERM with get_user_pages fastpath? W.r.t > > dax check we need vma to ensure whether a long term pin is allowed or not. > > If FOLL_LONGTERM is specified we should fallback to slow path. > > Yes, the fastpath bails to the slowpath if FOLL_LONGTERM _and_ DAX. But it > does this while walking the page tables. I missed the CMA case and Pingfan's > patch fixes this. We could check for CMA pages while walking the page tables > but most agreed that it was not worth it. For DAX we already had checks for > *_devmap() so it was easier to put the FOLL_LONGTERM checks there. > Then for CMA pages, are you suggesting something like: diff --git a/mm/gup.c b/mm/gup.c index 42a47c0..8bf3cc3 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2251,6 +2251,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages, if (unlikely(!access_ok((void __user *)start, len))) return -EFAULT; + if (unlikely(gup_flags & FOLL_LONGTERM)) + goto slow; if (gup_fast_permitted(start, nr_pages)) { local_irq_disable(); gup_pgd_range(addr, end, gup_flags, pages, ); @@ -2258,6 +2260,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, ret = nr; } +slow: if (nr < nr_pages) { /* Try to get the remaining pages with get_user_pages */ start += nr << PAGE_SHIFT; Thanks, Pingfan
Re: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
On Fri, Jun 07, 2019 at 02:10:15PM +0800, Pingfan Liu wrote: > On Fri, Jun 7, 2019 at 5:17 AM John Hubbard wrote: > > > > On 6/5/19 7:19 PM, Pingfan Liu wrote: > > > On Thu, Jun 6, 2019 at 5:49 AM Andrew Morton > > > wrote: > > ... > > >>> --- a/mm/gup.c > > >>> +++ b/mm/gup.c > > >>> @@ -2196,6 +2196,26 @@ static int __gup_longterm_unlocked(unsigned long > > >>> start, int nr_pages, > > >>> return ret; > > >>> } > > >>> > > >>> +#ifdef CONFIG_CMA > > >>> +static inline int reject_cma_pages(int nr_pinned, struct page **pages) > > >>> +{ > > >>> + int i; > > >>> + > > >>> + for (i = 0; i < nr_pinned; i++) > > >>> + if (is_migrate_cma_page(pages[i])) { > > >>> + put_user_pages(pages + i, nr_pinned - i); > > >>> + return i; > > >>> + } > > >>> + > > >>> + return nr_pinned; > > >>> +} > > >> > > >> There's no point in inlining this. > > > OK, will drop it in V4. > > > > > >> > > >> The code seems inefficient. If it encounters a single CMA page it can > > >> end up discarding a possibly significant number of non-CMA pages. I > > > The trick is the page is not be discarded, in fact, they are still be > > > referrenced by pte. We just leave the slow path to pick up the non-CMA > > > pages again. > > > > > >> guess that doesn't matter much, as get_user_pages(FOLL_LONGTERM) is > > >> rare. But could we avoid this (and the second pass across pages[]) by > > >> checking for a CMA page within gup_pte_range()? > > > It will spread the same logic to hugetlb pte and normal pte. And no > > > improvement in performance due to slow path. So I think maybe it is > > > not worth. > > > > > >> > > > > I think the concern is: for the successful gup_fast case with no CMA > > pages, this patch is adding another complete loop through all the > > pages. In the fast case. > > > > If the check were instead done as part of the gup_pte_range(), then > > it would be a little more efficient for that case. > > > > As for whether it's worth it, *probably* this is too small an effect to > > measure. > > But in order to attempt a measurement: running fio > > (https://github.com/axboe/fio) > > with O_DIRECT on an NVMe drive, might shed some light. Here's an fio.conf > > file > > that Jan Kara and Tom Talpey helped me come up with, for related testing: > > > > [reader] > > direct=1 > > ioengine=libaio > > blocksize=4096 > > size=1g > > numjobs=1 > > rw=read > > iodepth=64 > > Unable to get a NVME device to have a test. And when testing fio on the tranditional disk, I got the error "fio: engine libaio not loadable fio: failed to load engine fio: file:ioengines.c:89, func=dlopen, error=libaio: cannot open shared object file: No such file or directory" But I found a test case which can be slightly adjusted to met the aim. It is tools/testing/selftests/vm/gup_benchmark.c Test enviroment: MemTotal: 264079324 kB MemFree:262306788 kB CmaTotal: 0 kB CmaFree: 0 kB on AMD EPYC 7601 Test command: gup_benchmark -r 100 -n 64 gup_benchmark -r 100 -n 64 -l where -r stands for repeat times, -n is nr_pages param for get_user_pages_fast(), -l is a new option to test FOLL_LONGTERM in fast path, see a patch at the tail. Test result: w/o 477.80 w/o-l 481.07 a 481.80 a-l 640.41 b 466.24 (question a: b outperforms w/o ?) b-l 529.74 Where w/o is baseline without any patch using v5.2-rc2, a is this series, b does the check in gup_pte_range(). '-l' means FOLL_LONGTERM. I am suprised that b-l has about 17% improvement than a. (640.41 -529.74)/640.41 As for "question a: b outperforms w/o ?", I can not figure out why, maybe it can be considered as variance. Based on the above result, I think it is better to do the check inside gup_pte_range(). Any comment? Thanks, > Yeah, agreed. Data is more persuasive. Thanks for your suggestion. I > will try to bring out the result. > > Thanks, > Pingfan > --- Patch to do check inside gup_pte_range() diff --git a/mm/gup.c b/mm/gup.c index 2ce3091..ba213a0 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1757,6 +1757,10 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
Re: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
On Fri, Jun 7, 2019 at 5:17 AM John Hubbard wrote: > > On 6/5/19 7:19 PM, Pingfan Liu wrote: > > On Thu, Jun 6, 2019 at 5:49 AM Andrew Morton > > wrote: > ... > >>> --- a/mm/gup.c > >>> +++ b/mm/gup.c > >>> @@ -2196,6 +2196,26 @@ static int __gup_longterm_unlocked(unsigned long > >>> start, int nr_pages, > >>> return ret; > >>> } > >>> > >>> +#ifdef CONFIG_CMA > >>> +static inline int reject_cma_pages(int nr_pinned, struct page **pages) > >>> +{ > >>> + int i; > >>> + > >>> + for (i = 0; i < nr_pinned; i++) > >>> + if (is_migrate_cma_page(pages[i])) { > >>> + put_user_pages(pages + i, nr_pinned - i); > >>> + return i; > >>> + } > >>> + > >>> + return nr_pinned; > >>> +} > >> > >> There's no point in inlining this. > > OK, will drop it in V4. > > > >> > >> The code seems inefficient. If it encounters a single CMA page it can > >> end up discarding a possibly significant number of non-CMA pages. I > > The trick is the page is not be discarded, in fact, they are still be > > referrenced by pte. We just leave the slow path to pick up the non-CMA > > pages again. > > > >> guess that doesn't matter much, as get_user_pages(FOLL_LONGTERM) is > >> rare. But could we avoid this (and the second pass across pages[]) by > >> checking for a CMA page within gup_pte_range()? > > It will spread the same logic to hugetlb pte and normal pte. And no > > improvement in performance due to slow path. So I think maybe it is > > not worth. > > > >> > > I think the concern is: for the successful gup_fast case with no CMA > pages, this patch is adding another complete loop through all the > pages. In the fast case. > > If the check were instead done as part of the gup_pte_range(), then > it would be a little more efficient for that case. > > As for whether it's worth it, *probably* this is too small an effect to > measure. > But in order to attempt a measurement: running fio > (https://github.com/axboe/fio) > with O_DIRECT on an NVMe drive, might shed some light. Here's an fio.conf file > that Jan Kara and Tom Talpey helped me come up with, for related testing: > > [reader] > direct=1 > ioengine=libaio > blocksize=4096 > size=1g > numjobs=1 > rw=read > iodepth=64 > Yeah, agreed. Data is more persuasive. Thanks for your suggestion. I will try to bring out the result. Thanks, Pingfan
Re: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
On Thu, Jun 6, 2019 at 5:49 AM Andrew Morton wrote: > > On Wed, 5 Jun 2019 17:10:19 +0800 Pingfan Liu wrote: > > > As for FOLL_LONGTERM, it is checked in the slow path > > __gup_longterm_unlocked(). But it is not checked in the fast path, which > > means a possible leak of CMA page to longterm pinned requirement through > > this crack. > > > > Place a check in the fast path. > > I'm not actually seeing a description (in either the existing code or > this changelog or patch) an explanation of *why* we wish to exclude CMA > pages from longterm pinning. > What about a short description like this: FOLL_LONGTERM suggests a pin which is going to be given to hardware and can't move. It would truncate CMA permanently and should be excluded. > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -2196,6 +2196,26 @@ static int __gup_longterm_unlocked(unsigned long > > start, int nr_pages, > > return ret; > > } > > > > +#ifdef CONFIG_CMA > > +static inline int reject_cma_pages(int nr_pinned, struct page **pages) > > +{ > > + int i; > > + > > + for (i = 0; i < nr_pinned; i++) > > + if (is_migrate_cma_page(pages[i])) { > > + put_user_pages(pages + i, nr_pinned - i); > > + return i; > > + } > > + > > + return nr_pinned; > > +} > > There's no point in inlining this. OK, will drop it in V4. > > The code seems inefficient. If it encounters a single CMA page it can > end up discarding a possibly significant number of non-CMA pages. I The trick is the page is not be discarded, in fact, they are still be referrenced by pte. We just leave the slow path to pick up the non-CMA pages again. > guess that doesn't matter much, as get_user_pages(FOLL_LONGTERM) is > rare. But could we avoid this (and the second pass across pages[]) by > checking for a CMA page within gup_pte_range()? It will spread the same logic to hugetlb pte and normal pte. And no improvement in performance due to slow path. So I think maybe it is not worth. > > > +#else > > +static inline int reject_cma_pages(int nr_pinned, struct page **pages) > > +{ > > + return nr_pinned; > > +} > > +#endif > > + > > /** > > * get_user_pages_fast() - pin user pages in memory > > * @start: starting user address > > @@ -2236,6 +2256,9 @@ int get_user_pages_fast(unsigned long start, int > > nr_pages, > > ret = nr; > > } > > > > + if (unlikely(gup_flags & FOLL_LONGTERM) && nr) > > + nr = reject_cma_pages(nr, pages); > > + > > This would be a suitable place to add a comment explaining why we're > doing this... Would add one comment "FOLL_LONGTERM suggests a pin given to hardware and rarely returned." Thanks for your kind review. Regards, Pingfan
[PATCHv3 2/2] mm/gup: rename nr as nr_pinned in get_user_pages_fast()
To better reflect the held state of pages and make code self-explaining, rename nr as nr_pinned. Signed-off-by: Pingfan Liu Cc: Ira Weiny Cc: Andrew Morton Cc: Mike Rapoport Cc: Dan Williams Cc: Matthew Wilcox Cc: John Hubbard Cc: "Aneesh Kumar K.V" Cc: Keith Busch Cc: Christoph Hellwig Cc: linux-kernel@vger.kernel.org --- mm/gup.c | 24 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index 0e59af9..9b3c8a6 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2236,7 +2236,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, unsigned int gup_flags, struct page **pages) { unsigned long addr, len, end; - int nr = 0, ret = 0; + int nr_pinned = 0, ret = 0; start &= PAGE_MASK; addr = start; @@ -2251,28 +2251,28 @@ int get_user_pages_fast(unsigned long start, int nr_pages, if (gup_fast_permitted(start, nr_pages)) { local_irq_disable(); - gup_pgd_range(addr, end, gup_flags, pages, ); + gup_pgd_range(addr, end, gup_flags, pages, _pinned); local_irq_enable(); - ret = nr; + ret = nr_pinned; } - if (unlikely(gup_flags & FOLL_LONGTERM) && nr) - nr = reject_cma_pages(nr, pages); + if (unlikely(gup_flags & FOLL_LONGTERM) && nr_pinned) + nr_pinned = reject_cma_pages(nr_pinned, pages); - if (nr < nr_pages) { + if (nr_pinned < nr_pages) { /* Try to get the remaining pages with get_user_pages */ - start += nr << PAGE_SHIFT; - pages += nr; + start += nr_pinned << PAGE_SHIFT; + pages += nr_pinned; - ret = __gup_longterm_unlocked(start, nr_pages - nr, + ret = __gup_longterm_unlocked(start, nr_pages - nr_pinned, gup_flags, pages); /* Have to be a bit careful with return values */ - if (nr > 0) { + if (nr_pinned > 0) { if (ret < 0) - ret = nr; + ret = nr_pinned; else - ret += nr; + ret += nr_pinned; } } -- 2.7.5
[PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
As for FOLL_LONGTERM, it is checked in the slow path __gup_longterm_unlocked(). But it is not checked in the fast path, which means a possible leak of CMA page to longterm pinned requirement through this crack. Place a check in the fast path. Signed-off-by: Pingfan Liu Cc: Ira Weiny Cc: Andrew Morton Cc: Mike Rapoport Cc: Dan Williams Cc: Matthew Wilcox Cc: John Hubbard Cc: "Aneesh Kumar K.V" Cc: Keith Busch Cc: Christoph Hellwig Cc: linux-kernel@vger.kernel.org --- mm/gup.c | 23 +++ 1 file changed, 23 insertions(+) diff --git a/mm/gup.c b/mm/gup.c index f173fcb..0e59af9 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2196,6 +2196,26 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages, return ret; } +#ifdef CONFIG_CMA +static inline int reject_cma_pages(int nr_pinned, struct page **pages) +{ + int i; + + for (i = 0; i < nr_pinned; i++) + if (is_migrate_cma_page(pages[i])) { + put_user_pages(pages + i, nr_pinned - i); + return i; + } + + return nr_pinned; +} +#else +static inline int reject_cma_pages(int nr_pinned, struct page **pages) +{ + return nr_pinned; +} +#endif + /** * get_user_pages_fast() - pin user pages in memory * @start: starting user address @@ -2236,6 +2256,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages, ret = nr; } + if (unlikely(gup_flags & FOLL_LONGTERM) && nr) + nr = reject_cma_pages(nr, pages); + if (nr < nr_pages) { /* Try to get the remaining pages with get_user_pages */ start += nr << PAGE_SHIFT; -- 2.7.5
Re: [PATCH] mm/gup: remove unnecessary check against CMA in __gup_longterm_locked()
On Tue, Jun 4, 2019 at 4:30 PM Aneesh Kumar K.V wrote: > > On 6/4/19 12:56 PM, Pingfan Liu wrote: > > The PF_MEMALLOC_NOCMA is set by memalloc_nocma_save(), which is finally > > cast to ~_GFP_MOVABLE. So __get_user_pages_locked() will get pages from > > non CMA area and pin them. There is no need to > > check_and_migrate_cma_pages(). > > > That is not completely correct. We can fault in that pages outside > get_user_pages_longterm at which point those pages can get allocated > from CMA region. memalloc_nocma_save() as added as an optimization to > avoid unnecessary page migration. Yes, you are right. Thanks, Pingfan
[PATCH] mm/gup: remove unnecessary check against CMA in __gup_longterm_locked()
The PF_MEMALLOC_NOCMA is set by memalloc_nocma_save(), which is finally cast to ~_GFP_MOVABLE. So __get_user_pages_locked() will get pages from non CMA area and pin them. There is no need to check_and_migrate_cma_pages(). Signed-off-by: Pingfan Liu Cc: Andrew Morton Cc: Ira Weiny Cc: Dan Williams Cc: Thomas Gleixner Cc: Mike Rapoport Cc: John Hubbard Cc: "Aneesh Kumar K.V" Cc: Keith Busch Cc: linux-kernel@vger.kernel.org --- mm/gup.c | 146 --- 1 file changed, 146 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index f173fcb..9d931d1 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1275,149 +1275,6 @@ static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages) return false; } -#ifdef CONFIG_CMA -static struct page *new_non_cma_page(struct page *page, unsigned long private) -{ - /* -* We want to make sure we allocate the new page from the same node -* as the source page. -*/ - int nid = page_to_nid(page); - /* -* Trying to allocate a page for migration. Ignore allocation -* failure warnings. We don't force __GFP_THISNODE here because -* this node here is the node where we have CMA reservation and -* in some case these nodes will have really less non movable -* allocation memory. -*/ - gfp_t gfp_mask = GFP_USER | __GFP_NOWARN; - - if (PageHighMem(page)) - gfp_mask |= __GFP_HIGHMEM; - -#ifdef CONFIG_HUGETLB_PAGE - if (PageHuge(page)) { - struct hstate *h = page_hstate(page); - /* -* We don't want to dequeue from the pool because pool pages will -* mostly be from the CMA region. -*/ - return alloc_migrate_huge_page(h, gfp_mask, nid, NULL); - } -#endif - if (PageTransHuge(page)) { - struct page *thp; - /* -* ignore allocation failure warnings -*/ - gfp_t thp_gfpmask = GFP_TRANSHUGE | __GFP_NOWARN; - - /* -* Remove the movable mask so that we don't allocate from -* CMA area again. -*/ - thp_gfpmask &= ~__GFP_MOVABLE; - thp = __alloc_pages_node(nid, thp_gfpmask, HPAGE_PMD_ORDER); - if (!thp) - return NULL; - prep_transhuge_page(thp); - return thp; - } - - return __alloc_pages_node(nid, gfp_mask, 0); -} - -static long check_and_migrate_cma_pages(struct task_struct *tsk, - struct mm_struct *mm, - unsigned long start, - unsigned long nr_pages, - struct page **pages, - struct vm_area_struct **vmas, - unsigned int gup_flags) -{ - long i; - bool drain_allow = true; - bool migrate_allow = true; - LIST_HEAD(cma_page_list); - -check_again: - for (i = 0; i < nr_pages; i++) { - /* -* If we get a page from the CMA zone, since we are going to -* be pinning these entries, we might as well move them out -* of the CMA zone if possible. -*/ - if (is_migrate_cma_page(pages[i])) { - - struct page *head = compound_head(pages[i]); - - if (PageHuge(head)) { - isolate_huge_page(head, _page_list); - } else { - if (!PageLRU(head) && drain_allow) { - lru_add_drain_all(); - drain_allow = false; - } - - if (!isolate_lru_page(head)) { - list_add_tail(>lru, _page_list); - mod_node_page_state(page_pgdat(head), - NR_ISOLATED_ANON + - page_is_file_cache(head), - hpage_nr_pages(head)); - } - } - } - } - - if (!list_empty(_page_list)) { - /* -* drop the above get_user_pages reference. -*/ - for (i = 0; i < nr_pages; i++) - put_page(pages[i]); - - if (migrate_pages(_page_list, new_non_cma_page, - NULL, 0, MIGRATE_SYNC, MR_CONTIG_RANGE)) { - /* -* some of the pages fail
Re: [PATCHv2 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
On Tue, Jun 4, 2019 at 3:08 PM Christoph Hellwig wrote: > > On Mon, Jun 03, 2019 at 04:56:10PM -0700, Ira Weiny wrote: > > On Mon, Jun 03, 2019 at 09:42:06AM -0700, Christoph Hellwig wrote: > > > > +#if defined(CONFIG_CMA) > > > > > > You can just use #ifdef here. > > > > > > > +static inline int reject_cma_pages(int nr_pinned, unsigned int > > > > gup_flags, > > > > + struct page **pages) > > > > > > Please use two instead of one tab to indent the continuing line of > > > a function declaration. > > > > > > > +{ > > > > + if (unlikely(gup_flags & FOLL_LONGTERM)) { > > > > > > IMHO it would be a little nicer if we could move this into the caller. > > > > FWIW we already had this discussion and thought it better to put this here. > > > > https://lkml.org/lkml/2019/5/30/1565 > > I don't see any discussion like this. FYI, this is what I mean, > code might be easier than words: > > > diff --git a/mm/gup.c b/mm/gup.c > index ddde097cf9e4..62d770b18e2c 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2197,6 +2197,27 @@ static int __gup_longterm_unlocked(unsigned long > start, int nr_pages, > return ret; > } > > +#ifdef CONFIG_CMA > +static int reject_cma_pages(struct page **pages, int nr_pinned) > +{ > + int i = 0; > + > + for (i = 0; i < nr_pinned; i++) > + if (is_migrate_cma_page(pages[i])) { > + put_user_pages(pages + i, nr_pinned - i); > + return i; > + } > + } > + > + return nr_pinned; > +} > +#else > +static inline int reject_cma_pages(struct page **pages, int nr_pinned) > +{ > + return nr_pinned; > +} > +#endif /* CONFIG_CMA */ > + > /** > * get_user_pages_fast() - pin user pages in memory > * @start: starting user address > @@ -2237,6 +2258,9 @@ int get_user_pages_fast(unsigned long start, int > nr_pages, > ret = nr; > } > > + if (nr && unlikely(gup_flags & FOLL_LONGTERM)) > + nr = reject_cma_pages(pages, nr); > + Yeah. Looks better to keep reject_cma_pages() away from gup flags. > if (nr < nr_pages) { > /* Try to get the remaining pages with get_user_pages */ > start += nr << PAGE_SHIFT;
Re: [PATCHv2 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
On Tue, Jun 4, 2019 at 3:17 PM Christoph Hellwig wrote: > > On Tue, Jun 04, 2019 at 03:13:21PM +0800, Pingfan Liu wrote: > > Is it a convention? scripts/checkpatch.pl can not detect it. Could you > > show me some light so later I can avoid it? > > If you look at most kernel code you can see two conventions: > > - double tabe indent > - indent to the start of the first agument line > > Everything else is rather unusual. OK. Thanks
Re: [PATCHv2 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
On Tue, Jun 4, 2019 at 12:42 AM Christoph Hellwig wrote: > > > +#if defined(CONFIG_CMA) > > You can just use #ifdef here. > OK. > > +static inline int reject_cma_pages(int nr_pinned, unsigned int gup_flags, > > + struct page **pages) > > Please use two instead of one tab to indent the continuing line of > a function declaration. > Is it a convention? scripts/checkpatch.pl can not detect it. Could you show me some light so later I can avoid it? Thanks for your review. Regards, Pingfan > > +{ > > + if (unlikely(gup_flags & FOLL_LONGTERM)) { > > IMHO it would be a little nicer if we could move this into the caller.
[PATCHv2 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
As for FOLL_LONGTERM, it is checked in the slow path __gup_longterm_unlocked(). But it is not checked in the fast path, which means a possible leak of CMA page to longterm pinned requirement through this crack. Place a check in the fast path. Signed-off-by: Pingfan Liu Cc: Ira Weiny Cc: Andrew Morton Cc: Mike Rapoport Cc: Dan Williams Cc: Matthew Wilcox Cc: John Hubbard Cc: "Aneesh Kumar K.V" Cc: Keith Busch Cc: linux-kernel@vger.kernel.org --- mm/gup.c | 24 1 file changed, 24 insertions(+) diff --git a/mm/gup.c b/mm/gup.c index f173fcb..6fe2feb 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2196,6 +2196,29 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages, return ret; } +#if defined(CONFIG_CMA) +static inline int reject_cma_pages(int nr_pinned, unsigned int gup_flags, + struct page **pages) +{ + if (unlikely(gup_flags & FOLL_LONGTERM)) { + int i = 0; + + for (i = 0; i < nr_pinned; i++) + if (is_migrate_cma_page(pages[i])) { + put_user_pages(pages + i, nr_pinned - i); + return i; + } + } + return nr_pinned; +} +#else +static inline int reject_cma_pages(int nr_pinned, unsigned int gup_flags, + struct page **pages) +{ + return nr_pinned; +} +#endif + /** * get_user_pages_fast() - pin user pages in memory * @start: starting user address @@ -2236,6 +2259,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, ret = nr; } + nr = reject_cma_pages(nr, gup_flags, pages); if (nr < nr_pages) { /* Try to get the remaining pages with get_user_pages */ start += nr << PAGE_SHIFT; -- 2.7.5
[PATCHv2 2/2] mm/gup: rename nr as nr_pinned in get_user_pages_fast()
To better reflect the held state of pages and make code self-explaining, rename nr as nr_pinned. Signed-off-by: Pingfan Liu Cc: Ira Weiny Cc: Andrew Morton Cc: Mike Rapoport Cc: Dan Williams Cc: Matthew Wilcox Cc: John Hubbard Cc: "Aneesh Kumar K.V" Cc: Keith Busch Cc: linux-kernel@vger.kernel.org --- mm/gup.c | 22 +++--- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index 6fe2feb..106ab22 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2239,7 +2239,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, unsigned int gup_flags, struct page **pages) { unsigned long addr, len, end; - int nr = 0, ret = 0; + int nr_pinned = 0, ret = 0; start &= PAGE_MASK; addr = start; @@ -2254,26 +2254,26 @@ int get_user_pages_fast(unsigned long start, int nr_pages, if (gup_fast_permitted(start, nr_pages)) { local_irq_disable(); - gup_pgd_range(addr, end, gup_flags, pages, ); + gup_pgd_range(addr, end, gup_flags, pages, _pinned); local_irq_enable(); - ret = nr; + ret = nr_pinned; } - nr = reject_cma_pages(nr, gup_flags, pages); - if (nr < nr_pages) { + nr_pinned = reject_cma_pages(nr_pinned, gup_flags, pages); + if (nr_pinned < nr_pages) { /* Try to get the remaining pages with get_user_pages */ - start += nr << PAGE_SHIFT; - pages += nr; + start += nr_pinned << PAGE_SHIFT; + pages += nr_pinned; - ret = __gup_longterm_unlocked(start, nr_pages - nr, + ret = __gup_longterm_unlocked(start, nr_pages - nr_pinned, gup_flags, pages); /* Have to be a bit careful with return values */ - if (nr > 0) { + if (nr_pinned > 0) { if (ret < 0) - ret = nr; + ret = nr_pinned; else - ret += nr; + ret += nr_pinned; } } -- 2.7.5
Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot
On Fri, May 31, 2019 at 5:03 PM Michal Hocko wrote: > > On Thu 30-05-19 20:55:32, Pingfan Liu wrote: > > On Wed, May 29, 2019 at 2:20 AM Michal Hocko wrote: > > > > > > [Sorry for a late reply] > > > > > > On Thu 23-05-19 11:58:45, Pingfan Liu wrote: > > > > On Wed, May 22, 2019 at 7:16 PM Michal Hocko wrote: > > > > > > > > > > On Wed 22-05-19 15:12:16, Pingfan Liu wrote: > > > [...] > > > > > > But in fact, we already have for_each_node_state(nid, N_MEMORY) to > > > > > > cover this purpose. > > > > > > > > > > I do not really think we want to spread N_MEMORY outside of the core > > > > > MM. > > > > > It is quite confusing IMHO. > > > > > . > > > > But it has already like this. Just git grep N_MEMORY. > > > > > > I might be wrong but I suspect a closer review would reveal that the use > > > will be inconsistent or dubious so following the existing users is not > > > the best approach. > > > > > > > > > Furthermore, changing the definition of online may > > > > > > break something in the scheduler, e.g. in task_numa_migrate(), where > > > > > > it calls for_each_online_node. > > > > > > > > > > Could you be more specific please? Why should numa balancing consider > > > > > nodes without any memory? > > > > > > > > > As my understanding, the destination cpu can be on a memory less node. > > > > BTW, there are several functions in the scheduler facing the same > > > > scenario, task_numa_migrate() is an example. > > > > > > Even if the destination node is memoryless then any migration would fail > > > because there is no memory. Anyway I still do not see how using online > > > node would break anything. > > > > > Suppose we have nodes A, B,C, where C is memory less but has little > > distance to B, comparing with the one from A to B. Then if a task is > > running on A, but prefer to run on B due to memory footprint. > > task_numa_migrate() allows us to migrate the task to node C. Changing > > for_each_online_node will break this. > > That would require the task to have preferred node to be C no? Or do I > missunderstand the task migration logic? I think in task_numa_migrate(), the migration logic should looks like: env.dst_nid = p->numa_preferred_nid; //Here dst nid is B But later in if (env.best_cpu == -1 || (p->numa_group && p->numa_group->active_nodes > 1)) { for_each_online_node(nid) { [...] task_numa_find_cpu(, taskimp, groupimp); // Here is a chance to change p->numa_preferred_nid There are serveral other broken by changing for_each_online_node(), -1. show_numa_stats() -2. init_numa_topology_type(), where sched_numa_topology_type may be mistaken evaluated. -3. ... can check call to for_each_online_node() one by one in scheduler. That is my understanding of the code. Thanks, Pingfan
Re: [PATCH] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
On Sat, Jun 1, 2019 at 1:06 AM John Hubbard wrote: > > On 5/31/19 4:05 AM, Pingfan Liu wrote: > > On Fri, May 31, 2019 at 7:21 AM John Hubbard wrote: > >> On 5/30/19 2:47 PM, Ira Weiny wrote: > >>> On Thu, May 30, 2019 at 06:54:04AM +0800, Pingfan Liu wrote: > >> [...] > >> Rather lightly tested...I've compile-tested with CONFIG_CMA and > >> !CONFIG_CMA, > >> and boot tested with CONFIG_CMA, but could use a second set of eyes on > >> whether > >> I've added any off-by-one errors, or worse. :) > >> > > Do you mind I send V2 based on your above patch? Anyway, it is a simple bug > > fix. > > > > Sure, that's why I sent it. :) Note that Ira also recommended splitting the > "nr --> nr_pinned" renaming into a separate patch. > Thanks for your kind help. I will split out nr_pinned to a separate patch. Regards, Pingfan
Re: [PATCH] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
On Sat, Jun 1, 2019 at 1:12 AM Ira Weiny wrote: > > On Fri, May 31, 2019 at 07:05:27PM +0800, Pingfan Liu wrote: > > On Fri, May 31, 2019 at 7:21 AM John Hubbard wrote: > > > > > > > > > Rather lightly tested...I've compile-tested with CONFIG_CMA and > > > !CONFIG_CMA, > > > and boot tested with CONFIG_CMA, but could use a second set of eyes on > > > whether > > > I've added any off-by-one errors, or worse. :) > > > > > Do you mind I send V2 based on your above patch? Anyway, it is a simple bug > > fix. > > FWIW please split out the nr_pinned change to a separate patch. > OK. Thanks, Pingfan
Re: [PATCH] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
On Fri, May 31, 2019 at 7:21 AM John Hubbard wrote: > > On 5/30/19 2:47 PM, Ira Weiny wrote: > > On Thu, May 30, 2019 at 06:54:04AM +0800, Pingfan Liu wrote: > [...] > >> +for (j = i; j < nr; j++) > >> +put_page(pages[j]); > > > > Should be put_user_page() now. For now that just calls put_page() but it is > > slated to change soon. > > > > I also wonder if this would be more efficient as a check as we are walking > > the > > page tables and bail early. > > > > Perhaps the code complexity is not worth it? > > Good point, it might be worth it. Because now we've got two loops that > we run, after the interrupts-off page walk, and it's starting to look like > a potential performance concern. > > > > >> +nr = i; > > > > Why not just break from the loop here? > > > > Or better yet just use 'i' in the inner loop... > > > > ...but if you do end up putting in the after-the-fact check, then we can > go one or two steps further in cleaning it up, by: > > * hiding the visible #ifdef that was slicing up gup_fast, > > * using put_user_pages() instead of either put_page or put_user_page, > thus getting rid of j entirely, and > > * renaming an ancient minor confusion: nr --> nr_pinned), > > we could have this, which is looks cleaner and still does the same thing: > > diff --git a/mm/gup.c b/mm/gup.c > index f173fcbaf1b2..0c1f36be1863 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -1486,6 +1486,33 @@ static __always_inline long > __gup_longterm_locked(struct task_struct *tsk, > } > #endif /* CONFIG_FS_DAX || CONFIG_CMA */ > > +#ifdef CONFIG_CMA > +/* > + * Returns the number of pages that were *not* rejected. This makes it > + * exactly compatible with its callers. > + */ > +static int reject_cma_pages(int nr_pinned, unsigned gup_flags, > + struct page **pages) > +{ > + int i = 0; > + if (unlikely(gup_flags & FOLL_LONGTERM)) { > + > + for (i = 0; i < nr_pinned; i++) > + if (is_migrate_cma_page(pages[i])) { > + put_user_pages([i], nr_pinned - i); > + break; > + } > + } > + return i; > +} > +#else > +static int reject_cma_pages(int nr_pinned, unsigned gup_flags, > + struct page **pages) > +{ > + return nr_pinned; > +} > +#endif > + > /* > * This is the same as get_user_pages_remote(), just with a > * less-flexible calling convention where we assume that the task > @@ -2216,7 +2243,7 @@ int get_user_pages_fast(unsigned long start, int > nr_pages, > unsigned int gup_flags, struct page **pages) > { > unsigned long addr, len, end; > - int nr = 0, ret = 0; > + int nr_pinned = 0, ret = 0; > > start &= PAGE_MASK; > addr = start; > @@ -2231,25 +2258,27 @@ int get_user_pages_fast(unsigned long start, int > nr_pages, > > if (gup_fast_permitted(start, nr_pages)) { > local_irq_disable(); > - gup_pgd_range(addr, end, gup_flags, pages, ); > + gup_pgd_range(addr, end, gup_flags, pages, _pinned); > local_irq_enable(); > - ret = nr; > + ret = nr_pinned; > } > > - if (nr < nr_pages) { > + nr_pinned = reject_cma_pages(nr_pinned, gup_flags, pages); > + > + if (nr_pinned < nr_pages) { > /* Try to get the remaining pages with get_user_pages */ > - start += nr << PAGE_SHIFT; > - pages += nr; > + start += nr_pinned << PAGE_SHIFT; > + pages += nr_pinned; > > - ret = __gup_longterm_unlocked(start, nr_pages - nr, > + ret = __gup_longterm_unlocked(start, nr_pages - nr_pinned, > gup_flags, pages); > > /* Have to be a bit careful with return values */ > - if (nr > 0) { > + if (nr_pinned > 0) { > if (ret < 0) > - ret = nr; > + ret = nr_pinned; > else > - ret += nr; > + ret += nr_pinned; > } > } > > > Rather lightly tested...I've compile-tested with CONFIG_CMA and !CONFIG_CMA, > and boot tested with CONFIG_CMA, but could use a second set of eyes on whether > I've added any off-by-one errors, or worse. :) > Do you mind I send V2 based on your above patch? Anyway, it is a simple bug fix. Thanks, Pingfan
Re: [PATCH] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
On Fri, May 31, 2019 at 7:52 AM Ira Weiny wrote: > > On Thu, May 30, 2019 at 04:21:19PM -0700, John Hubbard wrote: > > On 5/30/19 2:47 PM, Ira Weiny wrote: > > > On Thu, May 30, 2019 at 06:54:04AM +0800, Pingfan Liu wrote: > > [...] > > >> + for (j = i; j < nr; j++) > > >> + put_page(pages[j]); > > > > > > Should be put_user_page() now. For now that just calls put_page() but it > > > is > > > slated to change soon. > > > > > > I also wonder if this would be more efficient as a check as we are > > > walking the > > > page tables and bail early. > > > > > > Perhaps the code complexity is not worth it? > > > > Good point, it might be worth it. Because now we've got two loops that > > we run, after the interrupts-off page walk, and it's starting to look like > > a potential performance concern. > > FWIW I don't see this being a huge issue at the moment. Perhaps those more > familiar with CMA can weigh in here. How was this issue found? If it was > found by running some test perhaps that indicates a performance preference? > I found the bug by reading code. And I do not see any performance concern. Bailing out early contritute little to performance, as we fall on the slow path immediately. Regards, Pingfan []
Re: [PATCH] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
On Fri, May 31, 2019 at 5:46 AM Ira Weiny wrote: > > On Thu, May 30, 2019 at 06:54:04AM +0800, Pingfan Liu wrote: > > As for FOLL_LONGTERM, it is checked in the slow path > > __gup_longterm_unlocked(). But it is not checked in the fast path, which > > means a possible leak of CMA page to longterm pinned requirement through > > this crack. > > > > Place a check in the fast path. > > > > Signed-off-by: Pingfan Liu > > Cc: Ira Weiny > > Cc: Andrew Morton > > Cc: Mike Rapoport > > Cc: Dan Williams > > Cc: Matthew Wilcox > > Cc: John Hubbard > > Cc: "Aneesh Kumar K.V" > > Cc: Keith Busch > > Cc: linux-kernel@vger.kernel.org > > --- > > mm/gup.c | 12 > > 1 file changed, 12 insertions(+) > > > > diff --git a/mm/gup.c b/mm/gup.c > > index f173fcb..00feab3 100644 > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -2235,6 +2235,18 @@ int get_user_pages_fast(unsigned long start, int > > nr_pages, > > local_irq_enable(); > > ret = nr; > > } > > +#if defined(CONFIG_CMA) > > + if (unlikely(gup_flags & FOLL_LONGTERM)) { > > + int i, j; > > + > > + for (i = 0; i < nr; i++) > > + if (is_migrate_cma_page(pages[i])) { > > + for (j = i; j < nr; j++) > > + put_page(pages[j]); > > Should be put_user_page() now. For now that just calls put_page() but it is > slated to change soon. > Not aware of these changes. And get your point now. > I also wonder if this would be more efficient as a check as we are walking the > page tables and bail early. > > Perhaps the code complexity is not worth it? > Yes. That will spread such logic in huge page and normal page. > > + nr = i; > > Why not just break from the loop here? > A mistake. Thanks, Pingfan > Or better yet just use 'i' in the inner loop... > > Ira > > > + } > > + } > > +#endif > > > > if (nr < nr_pages) { > > /* Try to get the remaining pages with get_user_pages */ > > -- > > 2.7.5 > >
Re: [PATCHv5 0/2] x86/boot/KASLR: skip the specified crashkernel region
Maintainers, ping? Hi Borislav, during the review of V4, you suggested to re-design the return value of parse_crashkernel(), the latest try is on https://lore.kernel.org/patchwork/patch/1065514/. It seems hard to move on in that thread. On the other hand, my series "[PATCHv5 0/2] x86/boot/KASLR: skip the specified crashkernel region" has no depend on the "re-design the return value of parse_crashkernel()". Thanks, Pingfan On Tue, May 7, 2019 at 12:32 PM Pingfan Liu wrote: > > crashkernel=x@y or or =range1:size1[,range2:size2,...]@offset option may > fail to reserve the required memory region if KASLR puts kernel into the > region. To avoid this uncertainty, asking KASLR to skip the required > region. > And the parsing routine can be re-used at this early boot stage. > > Cc: Thomas Gleixner > Cc: Ingo Molnar > Cc: Borislav Petkov > Cc: "H. Peter Anvin" > Cc: Baoquan He > Cc: Will Deacon > Cc: Nicolas Pitre > Cc: Vivek Goyal > Cc: Chao Fan > Cc: "Kirill A. Shutemov" > Cc: Ard Biesheuvel > CC: Hari Bathini > Cc: linux-kernel@vger.kernel.org > --- > v3 -> v4: > reuse the parse_crashkernel_xx routines > v4 -> v5: > drop unnecessary initialization of crash_base in [2/2] > > Pingfan Liu (2): > kernel/crash_core: separate the parsing routines to > lib/parse_crashkernel.c > x86/boot/KASLR: skip the specified crashkernel region > > arch/x86/boot/compressed/kaslr.c | 40 ++ > kernel/crash_core.c | 273 > lib/Makefile | 2 + > lib/parse_crashkernel.c | 289 > +++ > 4 files changed, 331 insertions(+), 273 deletions(-) > create mode 100644 lib/parse_crashkernel.c > > -- > 2.7.4 >
Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot
On Wed, May 29, 2019 at 2:21 AM Michal Hocko wrote: > > On Thu 23-05-19 12:00:46, Pingfan Liu wrote: > [...] > > > Yes, but maybe it will pay great effort on it. > > > > > And as a first step, we can find a way to fix the bug reported by me > > and the one reported by Barret > > Can we try http://lkml.kernel.org/r/20190513140448.gj24...@dhcp22.suse.cz > for starter? If it turns out that the changing of for_each_online_node() will not break functions in scheduler like task_numa_migrate(), I think it will be a good starter. On the other hand, if it does, then I think it only requires a slight adjustment of your patch to meet the aim. Thanks, Pingfan > -- > Michal Hocko > SUSE Labs
Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot
On Wed, May 29, 2019 at 2:20 AM Michal Hocko wrote: > > [Sorry for a late reply] > > On Thu 23-05-19 11:58:45, Pingfan Liu wrote: > > On Wed, May 22, 2019 at 7:16 PM Michal Hocko wrote: > > > > > > On Wed 22-05-19 15:12:16, Pingfan Liu wrote: > [...] > > > > But in fact, we already have for_each_node_state(nid, N_MEMORY) to > > > > cover this purpose. > > > > > > I do not really think we want to spread N_MEMORY outside of the core MM. > > > It is quite confusing IMHO. > > > . > > But it has already like this. Just git grep N_MEMORY. > > I might be wrong but I suspect a closer review would reveal that the use > will be inconsistent or dubious so following the existing users is not > the best approach. > > > > > Furthermore, changing the definition of online may > > > > break something in the scheduler, e.g. in task_numa_migrate(), where > > > > it calls for_each_online_node. > > > > > > Could you be more specific please? Why should numa balancing consider > > > nodes without any memory? > > > > > As my understanding, the destination cpu can be on a memory less node. > > BTW, there are several functions in the scheduler facing the same > > scenario, task_numa_migrate() is an example. > > Even if the destination node is memoryless then any migration would fail > because there is no memory. Anyway I still do not see how using online > node would break anything. > Suppose we have nodes A, B,C, where C is memory less but has little distance to B, comparing with the one from A to B. Then if a task is running on A, but prefer to run on B due to memory footprint. task_numa_migrate() allows us to migrate the task to node C. Changing for_each_online_node will break this. Regards, Pingfan > > > > By keeping the node owning cpu as online, Michal's patch can avoid > > > > such corner case and keep things easy. Furthermore, if needed, the > > > > other patch can use for_each_node_state(nid, N_MEMORY) to replace > > > > for_each_online_node is some space. > > > > > > Ideally no code outside of the core MM should care about what kind of > > > memory does the node really own. The external code should only care > > > whether the node is online and thus usable or offline and of no > > > interest. > > > > Yes, but maybe it will pay great effort on it. > > Even if that is the case it would be preferable because the current > situation is just not sustainable wrt maintenance cost. It is just too > simple to break the existing logic as this particular report outlines. > -- > Michal Hocko > SUSE Labs
[PATCH] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
As for FOLL_LONGTERM, it is checked in the slow path __gup_longterm_unlocked(). But it is not checked in the fast path, which means a possible leak of CMA page to longterm pinned requirement through this crack. Place a check in the fast path. Signed-off-by: Pingfan Liu Cc: Ira Weiny Cc: Andrew Morton Cc: Mike Rapoport Cc: Dan Williams Cc: Matthew Wilcox Cc: John Hubbard Cc: "Aneesh Kumar K.V" Cc: Keith Busch Cc: linux-kernel@vger.kernel.org --- mm/gup.c | 12 1 file changed, 12 insertions(+) diff --git a/mm/gup.c b/mm/gup.c index f173fcb..00feab3 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2235,6 +2235,18 @@ int get_user_pages_fast(unsigned long start, int nr_pages, local_irq_enable(); ret = nr; } +#if defined(CONFIG_CMA) + if (unlikely(gup_flags & FOLL_LONGTERM)) { + int i, j; + + for (i = 0; i < nr; i++) + if (is_migrate_cma_page(pages[i])) { + for (j = i; j < nr; j++) + put_page(pages[j]); + nr = i; + } + } +#endif if (nr < nr_pages) { /* Try to get the remaining pages with get_user_pages */ -- 2.7.5
Re: [PATCHv2] kernel/crash: make parse_crashkernel()'s return value more indicant
Matthias, ping? Any suggestions? Thanks, Pingfan On Thu, May 2, 2019 at 2:22 PM Pingfan Liu wrote: > > On Thu, Apr 25, 2019 at 4:20 PM Pingfan Liu wrote: > > > > On Wed, Apr 24, 2019 at 4:31 PM Matthias Brugger wrote: > > > > > > > > [...] > > > > @@ -139,6 +141,8 @@ static int __init parse_crashkernel_simple(char > > > > *cmdline, > > > > pr_warn("crashkernel: unrecognized char: %c\n", *cur); > > > > return -EINVAL; > > > > } > > > > + if (*crash_size == 0) > > > > + return -EINVAL; > > > > > > This covers the case where I pass an argument like "crashkernel=0M" ? > > > Can't we fix that by using kstrtoull() in memparse and check if the > > > return value > > > is < 0? In that case we could return without updating the retptr and we > > > will be > > > fine. > After a series of work, I suddenly realized that it can not be done > like this way. "0M" causes kstrtoull() to return -EINVAL, but this is > caused by "M", not "0". If passing "0" to kstrtoull(), it will return > 0 on success. > > > > > > It seems that kstrtoull() treats 0M as invalid parameter, while > > simple_strtoull() does not. > > > My careless going through the code. And I tested with a valid value > "256M" using kstrtoull(), it also returned -EINVAL. > > So I think there is no way to distinguish 0 from a positive value > inside this basic math function. > Do I miss anything? > > Thanks and regards, > Pingfan