Re: [PATCH v12 00/17] arm64: MMU enabled kexec relocation

2021-03-21 Thread Pingfan Liu
Hi Pavel,

After going through this series, I think if this can be done by using
identity map through ttbr0.

Then the processes may be neat (I hope so):
-1. set up identity map in machine_kexec_post_load(), instead of
copying linear map.
-2. Also past this temporary identity map to arm64_relocate_new_kernel()
-3. in arm64_relocate_new_kernel(), just load identity map and
re-enable MMU. After copying, just turn off MMU.

Thanks,
Pingfan

On Thu, Mar 4, 2021 at 3:47 PM Pavel Tatashin  wrote:
>
> Changelog:
> v12:
> - A major change compared to previous version. Instead of using
>   contiguous VA range a copy of linear map is now used to perform
>   copying of segments during relocation as it was agreed in the
>   discussion of version 11 of this project.
> - In addition to using linear map, I also took several ideas from
>   James Morse to better organize the kexec relocation:
> 1. skip relocation function entirely if that is not needed
> 2. remove the PoC flushing function since it is not needed
>anymore with MMU enabled.
> v11:
> - Fixed missing KEXEC_CORE dependency for trans_pgd.c
> - Removed useless "if(rc) return rc" statement (thank you Tyler Hicks)
> - Another 12 patches were accepted into maintainer's get.
>   Re-based patches against:
>   https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git
>   Branch: for-next/kexec
> v10:
> - Addressed a lot of comments form James Morse and from  Marc Zyngier
> - Added review-by's
> - Synchronized with mainline
>
> v9: - 9 patches from previous series landed in upstream, so now series
>   is smaller
> - Added two patches from James Morse to address idmap issues for 
> machines
>   with high physical addresses.
> - Addressed comments from Selin Dag about compiling issues. He also 
> tested
>   my series and got similar performance results: ~60 ms instead of 
> ~580 ms
>   with an initramfs size of ~120MB.
> v8:
> - Synced with mainline to keep series up-to-date
> v7:
> -- Addressed comments from James Morse
> - arm64: hibernate: pass the allocated pgdp to ttbr0
>   Removed "Fixes" tag, and added Added Reviewed-by: James Morse
> - arm64: hibernate: check pgd table allocation
>   Sent out as a standalone patch so it can be sent to stable
>   Series applies on mainline + this patch
> - arm64: hibernate: add trans_pgd public functions
>   Remove second allocation of tmp_pg_dir in swsusp_arch_resume
>   Added Reviewed-by: James Morse 
> - arm64: kexec: move relocation function setup and clean up
>   Fixed typo in commit log
>   Changed kern_reloc to phys_addr_t types.
>   Added explanation why kern_reloc is needed.
>   Split into four patches:
>   arm64: kexec: make dtb_mem always enabled
>   arm64: kexec: remove unnecessary debug prints
>   arm64: kexec: call kexec_image_info only once
>   arm64: kexec: move relocation function setup
> - arm64: kexec: add expandable argument to relocation function
>   Changed types of new arguments from unsigned long to phys_addr_t.
>   Changed offset prefix to KEXEC_*
>   Split into four patches:
>   arm64: kexec: cpu_soft_restart change argument types
>   arm64: kexec: arm64_relocate_new_kernel clean-ups
>   arm64: kexec: arm64_relocate_new_kernel don't use x0 as temp
>   arm64: kexec: add expandable argument to relocation function
> - arm64: kexec: configure trans_pgd page table for kexec
>   Added invalid entries into EL2 vector table
>   Removed KEXEC_EL2_VECTOR_TABLE_SIZE and 
> KEXEC_EL2_VECTOR_TABLE_OFFSET
>   Copy relocation functions and table into separate pages
>   Changed types in kern_reloc_arg.
>   Split into three patches:
>   arm64: kexec: offset for relocation function
>   arm64: kexec: kexec EL2 vectors
>   arm64: kexec: configure trans_pgd page table for kexec
> - arm64: kexec: enable MMU during kexec relocation
>   Split into two patches:
>   arm64: kexec: enable MMU during kexec relocation
>   arm64: kexec: remove head from relocation argument
> v6:
> - Sync with mainline tip
> - Added Acked's from Dave Young
> v5:
> - Addressed comments from Matthias Brugger: added review-by's, 
> improved
>   comments, and made cleanups to swsusp_arch_resume() in addition to
>   create_safe_exec_page().
> - Synced with mainline tip.
> v4:
> - Addressed comments from James Morse.
> - Split "check pgd table allocation" into two patches, and moved to
>   the beginning of series  for simpler backport of the fixes.

Re: [PATCH] drivers/arch_numa: remove rebudant setup_per_cpu_areas()

2021-03-09 Thread Pingfan Liu
On Tue, Mar 9, 2021 at 10:02 PM Will Deacon  wrote:
>
> [typo in subject "rebudant"]
>
> On Tue, Mar 09, 2021 at 06:21:38PM +0800, Pingfan Liu wrote:
> > There are two identical implementations of setup_per_cpu_areas() in
> > mm/percpu.c and drivers/base/arch_numa.c.
> >
> > Hence removing the one in arch_numa.c. And let arm64 drop
> > HAVE_SETUP_PER_CPU_AREA.
> >
> > Signed-off-by: Pingfan Liu 
> > Cc: Catalin Marinas 
> > Cc: Will Deacon 
> > Cc: Greg Kroah-Hartman 
> > Cc: "Rafael J. Wysocki" 
> > Cc: Atish Patra 
> > Cc: linux-kernel@vger.kernel.org
> > To: linux-arm-ker...@lists.infradead.org
> > ---
> >  arch/arm64/Kconfig   |  4 
> >  drivers/base/arch_numa.c | 22 --
> >  2 files changed, 26 deletions(-)
> >
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index 1f212b47a48a..d4bf8be0c3d5 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -1022,10 +1022,6 @@ config USE_PERCPU_NUMA_NODE_ID
> >   def_bool y
> >   depends on NUMA
> >
> > -config HAVE_SETUP_PER_CPU_AREA
> > - def_bool y
> > - depends on NUMA
> > -
> >  config NEED_PER_CPU_EMBED_FIRST_CHUNK
> >   def_bool y
> >   depends on NUMA
> > diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
> > index 4cc4e117727d..23e1e419a83d 100644
> > --- a/drivers/base/arch_numa.c
> > +++ b/drivers/base/arch_numa.c
> > @@ -167,28 +167,6 @@ static void __init pcpu_fc_free(void *ptr, size_t size)
> >  {
> >   memblock_free_early(__pa(ptr), size);
> >  }
> > -
> > -void __init setup_per_cpu_areas(void)
> > -{
> > - unsigned long delta;
> > - unsigned int cpu;
> > - int rc;
> > -
> > - /*
> > -  * Always reserve area for module percpu variables.  That's
> > -  * what the legacy allocator did.
> > -  */
> > - rc = pcpu_embed_first_chunk(PERCPU_MODULE_RESERVE,
> > - PERCPU_DYNAMIC_RESERVE, PAGE_SIZE,
> > - pcpu_cpu_distance,
> > - pcpu_fc_alloc, pcpu_fc_free);
>
> This doesn't look identical to the version in mm/percpu.c -- that one passes
> NULL instead of 'pcpu_cpu_distance' and tries to allocate the pcpu memory on
> the relevant NUMA nodes. In fact, if you could remove this function, you
> could probably remove the whole HAVE_SETUP_PER_CPU_AREA block here as the
> other functions are just used as helpers. So I'm not sure this is valid.
>
You are right. I need to rethink about it to see whether these two
functions can be unified into one.

Thanks,
Pingfan


[PATCH] drivers/arch_numa: remove rebudant setup_per_cpu_areas()

2021-03-09 Thread Pingfan Liu
There are two identical implementations of setup_per_cpu_areas() in
mm/percpu.c and drivers/base/arch_numa.c.

Hence removing the one in arch_numa.c. And let arm64 drop
HAVE_SETUP_PER_CPU_AREA.

Signed-off-by: Pingfan Liu 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Cc: Atish Patra 
Cc: linux-kernel@vger.kernel.org
To: linux-arm-ker...@lists.infradead.org
---
 arch/arm64/Kconfig   |  4 
 drivers/base/arch_numa.c | 22 --
 2 files changed, 26 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1f212b47a48a..d4bf8be0c3d5 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1022,10 +1022,6 @@ config USE_PERCPU_NUMA_NODE_ID
def_bool y
depends on NUMA
 
-config HAVE_SETUP_PER_CPU_AREA
-   def_bool y
-   depends on NUMA
-
 config NEED_PER_CPU_EMBED_FIRST_CHUNK
def_bool y
depends on NUMA
diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
index 4cc4e117727d..23e1e419a83d 100644
--- a/drivers/base/arch_numa.c
+++ b/drivers/base/arch_numa.c
@@ -167,28 +167,6 @@ static void __init pcpu_fc_free(void *ptr, size_t size)
 {
memblock_free_early(__pa(ptr), size);
 }
-
-void __init setup_per_cpu_areas(void)
-{
-   unsigned long delta;
-   unsigned int cpu;
-   int rc;
-
-   /*
-* Always reserve area for module percpu variables.  That's
-* what the legacy allocator did.
-*/
-   rc = pcpu_embed_first_chunk(PERCPU_MODULE_RESERVE,
-   PERCPU_DYNAMIC_RESERVE, PAGE_SIZE,
-   pcpu_cpu_distance,
-   pcpu_fc_alloc, pcpu_fc_free);
-   if (rc < 0)
-   panic("Failed to initialize percpu areas.");
-
-   delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start;
-   for_each_possible_cpu(cpu)
-   __per_cpu_offset[cpu] = delta + pcpu_unit_offsets[cpu];
-}
 #endif
 
 /**
-- 
2.29.2



Re: [PATCH] arm64/irq: report bug if NR_IPI greater than max SGI during compile time

2020-12-08 Thread Pingfan Liu
On Tue, Dec 8, 2020 at 5:51 PM Marc Zyngier  wrote:
>
> On 2020-12-08 09:43, Pingfan Liu wrote:
> > On Tue, Dec 8, 2020 at 5:31 PM Marc Zyngier  wrote:
> >>
> >> On 2020-12-08 09:21, Pingfan Liu wrote:
> >> > Although there is a runtime WARN_ON() when NR_IPR > max SGI, it had
> >> > better
> >> > do the check during built time, and associate these related code
> >> > together.
> >> >
> >> > Signed-off-by: Pingfan Liu 
> >> > Cc: Catalin Marinas 
> >> > Cc: Will Deacon 
> >> > Cc: Thomas Gleixner 
> >> > Cc: Jason Cooper 
> >> > Cc: Marc Zyngier 
> >> > Cc: Mark Rutland 
> >> > To: linux-arm-ker...@lists.infradead.org
> >> > Cc: linux-kernel@vger.kernel.org
> >> > ---
> >> >  arch/arm64/kernel/smp.c| 2 ++
> >> >  drivers/irqchip/irq-gic-v3.c   | 2 +-
> >> >  drivers/irqchip/irq-gic.c  | 2 +-
> >> >  include/linux/irqchip/arm-gic-common.h | 2 ++
> >> >  4 files changed, 6 insertions(+), 2 deletions(-)
> >> >
> >> > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> >> > index 18e9727..9fc383c 100644
> >> > --- a/arch/arm64/kernel/smp.c
> >> > +++ b/arch/arm64/kernel/smp.c
> >> > @@ -33,6 +33,7 @@
> >> >  #include 
> >> >  #include 
> >> >  #include 
> >> > +#include 
> >> >
> >> >  #include 
> >> >  #include 
> >> > @@ -76,6 +77,7 @@ enum ipi_msg_type {
> >> >   IPI_WAKEUP,
> >> >   NR_IPI
> >> >  };
> >> > +static_assert(NR_IPI <= MAX_SGI_NUM);
> >>
> >> I am trying *very hard* to remove dependencies between the
> >> architecture
> >> code and random drivers, so this kind of check really is
> >> counter-productive.
> >>
> >> Driver code should not have to know the number of IPIs, because there
> >> is
> >> no requirement that all IPIs should map 1:1 to SGIs. Conflating the
> >> two
> >
> > Just curious about this. Is there an IPI which is not implemented by
> > SGI? Or mapping several IPIs to a single SGI, and scatter out due to a
> > global variable value?
>
> We currently have a single NS SGI left, and I'd like to move some of the
> non-critical IPIs over to dispatching mechanism (the two "CPU stop" IPIs
> definitely are candidate for merging). That's not implemented yet, but
> I don't see a need to add checks that would otherwise violate this
> IPI/SGI distinction.

Got it. Thanks for your detailed explanation.

Regards,
Pingfan


Re: [PATCH] arm64/irq: report bug if NR_IPI greater than max SGI during compile time

2020-12-08 Thread Pingfan Liu
On Tue, Dec 8, 2020 at 5:31 PM Marc Zyngier  wrote:
>
> On 2020-12-08 09:21, Pingfan Liu wrote:
> > Although there is a runtime WARN_ON() when NR_IPR > max SGI, it had
> > better
> > do the check during built time, and associate these related code
> > together.
> >
> > Signed-off-by: Pingfan Liu 
> > Cc: Catalin Marinas 
> > Cc: Will Deacon 
> > Cc: Thomas Gleixner 
> > Cc: Jason Cooper 
> > Cc: Marc Zyngier 
> > Cc: Mark Rutland 
> > To: linux-arm-ker...@lists.infradead.org
> > Cc: linux-kernel@vger.kernel.org
> > ---
> >  arch/arm64/kernel/smp.c| 2 ++
> >  drivers/irqchip/irq-gic-v3.c   | 2 +-
> >  drivers/irqchip/irq-gic.c  | 2 +-
> >  include/linux/irqchip/arm-gic-common.h | 2 ++
> >  4 files changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > index 18e9727..9fc383c 100644
> > --- a/arch/arm64/kernel/smp.c
> > +++ b/arch/arm64/kernel/smp.c
> > @@ -33,6 +33,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  #include 
> >  #include 
> > @@ -76,6 +77,7 @@ enum ipi_msg_type {
> >   IPI_WAKEUP,
> >   NR_IPI
> >  };
> > +static_assert(NR_IPI <= MAX_SGI_NUM);
>
> I am trying *very hard* to remove dependencies between the architecture
> code and random drivers, so this kind of check really is
> counter-productive.
>
> Driver code should not have to know the number of IPIs, because there is
> no requirement that all IPIs should map 1:1 to SGIs. Conflating the two

Just curious about this. Is there an IPI which is not implemented by
SGI? Or mapping several IPIs to a single SGI, and scatter out due to a
global variable value?

Thanks,
Pingfan

> is already wrong, and I really don't want to add more of that.
>
> Thanks,
>
>  M.
> --
> Jazz is not dead. It just smells funny...


[PATCH] arm64/irq: report bug if NR_IPI greater than max SGI during compile time

2020-12-08 Thread Pingfan Liu
Although there is a runtime WARN_ON() when NR_IPR > max SGI, it had better
do the check during built time, and associate these related code together.

Signed-off-by: Pingfan Liu 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Thomas Gleixner 
Cc: Jason Cooper 
Cc: Marc Zyngier 
Cc: Mark Rutland 
To: linux-arm-ker...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
---
 arch/arm64/kernel/smp.c| 2 ++
 drivers/irqchip/irq-gic-v3.c   | 2 +-
 drivers/irqchip/irq-gic.c  | 2 +-
 include/linux/irqchip/arm-gic-common.h | 2 ++
 4 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 18e9727..9fc383c 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -76,6 +77,7 @@ enum ipi_msg_type {
IPI_WAKEUP,
NR_IPI
 };
+static_assert(NR_IPI <= MAX_SGI_NUM);
 
 static int ipi_irq_base __read_mostly;
 static int nr_ipi __read_mostly = NR_IPI;
diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index 16fecc0..ee13f85 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -1162,7 +1162,7 @@ static void __init gic_smp_init(void)
  gic_starting_cpu, NULL);
 
/* Register all 8 non-secure SGIs */
-   base_sgi = __irq_domain_alloc_irqs(gic_data.domain, -1, 8,
+   base_sgi = __irq_domain_alloc_irqs(gic_data.domain, -1, MAX_SGI_NUM,
   NUMA_NO_NODE, _fwspec,
   false, NULL);
if (WARN_ON(base_sgi <= 0))
diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c
index 6053245..07d36de 100644
--- a/drivers/irqchip/irq-gic.c
+++ b/drivers/irqchip/irq-gic.c
@@ -845,7 +845,7 @@ static __init void gic_smp_init(void)
  "irqchip/arm/gic:starting",
  gic_starting_cpu, NULL);
 
-   base_sgi = __irq_domain_alloc_irqs(gic_data[0].domain, -1, 8,
+   base_sgi = __irq_domain_alloc_irqs(gic_data[0].domain, -1, MAX_SGI_NUM,
   NUMA_NO_NODE, _fwspec,
   false, NULL);
if (WARN_ON(base_sgi <= 0))
diff --git a/include/linux/irqchip/arm-gic-common.h 
b/include/linux/irqchip/arm-gic-common.h
index fa8c045..7e45a9f 100644
--- a/include/linux/irqchip/arm-gic-common.h
+++ b/include/linux/irqchip/arm-gic-common.h
@@ -16,6 +16,8 @@
(GICD_INT_DEF_PRI << 8) |\
GICD_INT_DEF_PRI)
 
+#define MAX_SGI_NUM8
+
 enum gic_type {
GIC_V2,
GIC_V3,
-- 
2.7.5



Re: [PATCH 2/2] x86/machine_kexec: disable PMU before jumping to new kernel

2020-11-26 Thread Pingfan Liu
Sorry that I had made a misunderstanding of the code.

Nacked it

On Mon, Nov 23, 2020 at 1:37 PM Pingfan Liu  wrote:
>
> During jumping to the new kernel, on the crashed cpu, the memory mapping
> switches from an old one to an identity one. It had better disable PMU to
> suppress NMI, which can be delivered using the old mapping.
>
> Also on x86_64, idt_invalidate() to clear idt as on 32 bits.
>
> Signed-off-by: Pingfan Liu 
> Cc: Peter Zijlstra 
> Cc: Ingo Molnar 
> Cc: Arnaldo Carvalho de Melo 
> Cc: Mark Rutland 
> Cc: Alexander Shishkin 
> Cc: Jiri Olsa 
> Cc: Namhyung Kim 
> Cc: Thomas Gleixner 
> Cc: Borislav Petkov 
> Cc: "H. Peter Anvin" 
> Cc: Omar Sandoval 
> Cc: Andrew Morton 
> Cc: Mike Rapoport 
> To: x...@kernel.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  arch/x86/kernel/machine_kexec_32.c | 1 +
>  arch/x86/kernel/machine_kexec_64.c | 3 +++
>  2 files changed, 4 insertions(+)
>
> diff --git a/arch/x86/kernel/machine_kexec_32.c 
> b/arch/x86/kernel/machine_kexec_32.c
> index 64b00b0..72c6100 100644
> --- a/arch/x86/kernel/machine_kexec_32.c
> +++ b/arch/x86/kernel/machine_kexec_32.c
> @@ -191,6 +191,7 @@ void machine_kexec(struct kimage *image)
> /* Interrupts aren't acceptable while we reboot */
> local_irq_disable();
> hw_breakpoint_disable();
> +   perf_pmu_disable_all();
>
> if (image->preserve_context) {
>  #ifdef CONFIG_X86_IO_APIC
> diff --git a/arch/x86/kernel/machine_kexec_64.c 
> b/arch/x86/kernel/machine_kexec_64.c
> index a29a44a..238893e 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> @@ -338,6 +339,8 @@ void machine_kexec(struct kimage *image)
> /* Interrupts aren't acceptable while we reboot */
> local_irq_disable();
> hw_breakpoint_disable();
> +   perf_pmu_disable_all();
> +   idt_invalidate(phys_to_virt(0));
>
> if (image->preserve_context) {
>  #ifdef CONFIG_X86_IO_APIC
> --
> 2.7.5
>


Re: [PATCH 1/2] events/core: introduce perf_pmu_disable_all() to turn off all PMU

2020-11-24 Thread Pingfan Liu
On Mon, Nov 23, 2020 at 10:05 PM Peter Zijlstra  wrote:
>
> On Mon, Nov 23, 2020 at 01:37:25PM +0800, Pingfan Liu wrote:
>
> > +/* When crashed, other cpus hang in idle loop, so here do an emergency job 
> > under no lock */
>
> -ENOPARSE, -ETOOLONG
>
> > +void perf_pmu_disable_all(void)
> > +{
> > + struct pmu *pmu;
> > +
> > + list_for_each_entry(pmu, , entry)
> > + if (pmu->pmu_disable)
> > + pmu->pmu_disable(pmu);
> > +}
>
> This violates both locking rules and coding style.
Oops. I will re-work on it to see how to run it safely on crashed context.

Thanks,
Pingfan


[PATCH 1/2] events/core: introduce perf_pmu_disable_all() to turn off all PMU

2020-11-22 Thread Pingfan Liu
In crash context, NMI should be suppressed before jump to a new kernel.
Naturally as the source of NMI on some arches, PMU should be turned off at
that time.

Introduce perf_pmu_disable_all() to achieve the goal.

Signed-off-by: Pingfan Liu 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Mark Rutland 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Thomas Gleixner 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Omar Sandoval 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: x...@kernel.org
To: linux-kernel@vger.kernel.org
---
 include/linux/perf_event.h |  1 +
 kernel/events/core.c   | 10 ++
 2 files changed, 11 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 96450f6..f4baa87 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -965,6 +965,7 @@ extern const struct perf_event_attr 
*perf_event_attrs(struct perf_event *event);
 extern void perf_event_print_debug(void);
 extern void perf_pmu_disable(struct pmu *pmu);
 extern void perf_pmu_enable(struct pmu *pmu);
+extern void perf_pmu_disable_all(void);
 extern void perf_sched_cb_dec(struct pmu *pmu);
 extern void perf_sched_cb_inc(struct pmu *pmu);
 extern int perf_event_task_disable(void);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index dc568ca..c8e04a5 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1205,6 +1205,16 @@ void perf_pmu_enable(struct pmu *pmu)
pmu->pmu_enable(pmu);
 }
 
+/* When crashed, other cpus hang in idle loop, so here do an emergency job 
under no lock */
+void perf_pmu_disable_all(void)
+{
+   struct pmu *pmu;
+
+   list_for_each_entry(pmu, , entry)
+   if (pmu->pmu_disable)
+   pmu->pmu_disable(pmu);
+}
+
 static DEFINE_PER_CPU(struct list_head, active_ctx_list);
 
 /*
-- 
2.7.5



[PATCH 2/2] x86/machine_kexec: disable PMU before jumping to new kernel

2020-11-22 Thread Pingfan Liu
During jumping to the new kernel, on the crashed cpu, the memory mapping
switches from an old one to an identity one. It had better disable PMU to
suppress NMI, which can be delivered using the old mapping.

Also on x86_64, idt_invalidate() to clear idt as on 32 bits.

Signed-off-by: Pingfan Liu 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Mark Rutland 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Thomas Gleixner 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Omar Sandoval 
Cc: Andrew Morton 
Cc: Mike Rapoport 
To: x...@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/machine_kexec_32.c | 1 +
 arch/x86/kernel/machine_kexec_64.c | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/arch/x86/kernel/machine_kexec_32.c 
b/arch/x86/kernel/machine_kexec_32.c
index 64b00b0..72c6100 100644
--- a/arch/x86/kernel/machine_kexec_32.c
+++ b/arch/x86/kernel/machine_kexec_32.c
@@ -191,6 +191,7 @@ void machine_kexec(struct kimage *image)
/* Interrupts aren't acceptable while we reboot */
local_irq_disable();
hw_breakpoint_disable();
+   perf_pmu_disable_all();
 
if (image->preserve_context) {
 #ifdef CONFIG_X86_IO_APIC
diff --git a/arch/x86/kernel/machine_kexec_64.c 
b/arch/x86/kernel/machine_kexec_64.c
index a29a44a..238893e 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -338,6 +339,8 @@ void machine_kexec(struct kimage *image)
/* Interrupts aren't acceptable while we reboot */
local_irq_disable();
hw_breakpoint_disable();
+   perf_pmu_disable_all();
+   idt_invalidate(phys_to_virt(0));
 
if (image->preserve_context) {
 #ifdef CONFIG_X86_IO_APIC
-- 
2.7.5



[PATCH 3/3] kernel/watchdog: use soft lockup to detect irq flood

2020-11-17 Thread Pingfan Liu
When irq flood happens, interrupt handler occupies all of the cpu time.
This results in a situation where soft lockup can be observed, although it
is different from the design purpose of soft lockup.

In order to distinguish this situation, it is helpful to print out the
statistics of irq frequency when warning soft lockup to evaluate the
potential irq flood.

Thomas and Guilherme suggested patches to suppress the odd irq in different
situation. [1].[2]. But it seems to be an open question in a near future. For 
now,
it had better print some hints for users than nothing.

[1]: https://lore.kernel.org/lkml/87tuueftou@nanos.tec.linutronix.de/
[2]: 
https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpicc...@canonical.com/

Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Jisheng Zhang 
Cc: "Peter Zijlstra (Intel)" 
Cc: Vlastimil Babka 
Cc: Andrew Morton 
Cc: "Guilherme G. Piccoli" 
Cc: Petr Mladek 
Cc: ke...@lists.infradead.org
To: linux-kernel@vger.kernel.org
---
 kernel/watchdog.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 1cc619a..a0ab2a8 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -175,6 +176,9 @@ static DEFINE_PER_CPU(bool, softlockup_touch_sync);
 static DEFINE_PER_CPU(bool, soft_watchdog_warn);
 static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts);
 static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts_saved);
+static DEFINE_PER_CPU(unsigned long, last_irq_sum);
+static DEFINE_PER_CPU(unsigned long, last_unused_irq_sum);
+
 static unsigned long soft_lockup_nmi_warn;
 
 static int __init nowatchdog_setup(char *str)
@@ -353,6 +357,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct 
hrtimer *hrtimer)
 
/* kick the softlockup detector */
if (completion_done(this_cpu_ptr(_completion))) {
+   __this_cpu_write(last_irq_sum, kstat_this_cpu->irqs_sum);
+   __this_cpu_write(last_unused_irq_sum, 
kstat_this_cpu->unused_irqs_sum);
reinit_completion(this_cpu_ptr(_completion));
stop_one_cpu_nowait(smp_processor_id(),
softlockup_fn, NULL,
@@ -386,6 +392,9 @@ static enum hrtimer_restart watchdog_timer_fn(struct 
hrtimer *hrtimer)
 */
duration = is_softlockup(touch_ts);
if (unlikely(duration)) {
+   unsigned long irq_sum, unused_irq_sum;
+   unsigned int seconds;
+
/*
 * If a virtual machine is stopped by the host it can look to
 * the watchdog like a soft lockup, check to see if the host
@@ -409,9 +418,15 @@ static enum hrtimer_restart watchdog_timer_fn(struct 
hrtimer *hrtimer)
}
}
 
+   irq_sum = kstat_this_cpu->irqs_sum - 
__this_cpu_read(last_irq_sum);
+   unused_irq_sum = kstat_this_cpu->unused_irqs_sum -
+   __this_cpu_read(last_unused_irq_sum);
+   seconds = (unsigned int)convert_seconds(duration);
pr_emerg("BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
-   smp_processor_id(), (unsigned 
int)convert_seconds(duration),
+   smp_processor_id(), seconds,
current->comm, task_pid_nr(current));
+   pr_emerg("%lu irqs at rate: %lu / s, %lu unused irq at rate: 
%lu / s\n",
+   irq_sum, irq_sum/seconds, unused_irq_sum, 
unused_irq_sum/seconds);
print_modules();
print_irqtrace_events(current);
if (regs)
-- 
2.7.5



[PATCH 2/3] kernel/watchdog: make watchdog_touch_ts more accurate by using nanosecond

2020-11-17 Thread Pingfan Liu
The incoming statistics of irq average number will base on the delta of
watchdog_touch_ts. Improving the accuracy of watchdog_touch_ts from second
to nanosecond can help improve the accuracy of the statistics.

Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Jisheng Zhang 
Cc: "Peter Zijlstra (Intel)" 
Cc: Vlastimil Babka 
Cc: Andrew Morton 
Cc: "Guilherme G. Piccoli" 
Cc: Petr Mladek 
Cc: ke...@lists.infradead.org
To: linux-kernel@vger.kernel.org
---
 kernel/watchdog.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 5abb5b2..1cc619a 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -207,7 +207,7 @@ static void __lockup_detector_cleanup(void);
  * the thresholds with a factor: we make the soft threshold twice the amount of
  * time the hard threshold is.
  */
-static int get_softlockup_thresh(void)
+static unsigned int get_softlockup_thresh(void)
 {
return watchdog_thresh * 2;
 }
@@ -217,9 +217,9 @@ static int get_softlockup_thresh(void)
  * resolution, and we don't need to waste time with a big divide when
  * 2^30ns == 1.074s.
  */
-static unsigned long get_timestamp(void)
+static unsigned long convert_seconds(unsigned long ns)
 {
-   return running_clock() >> 30LL;  /* 2^30 ~= 10^9 */
+   return ns >> 30LL;  /* 2^30 ~= 10^9 */
 }
 
 static void set_sample_period(void)
@@ -238,7 +238,7 @@ static void set_sample_period(void)
 /* Commands for resetting the watchdog */
 static void __touch_watchdog(void)
 {
-   __this_cpu_write(watchdog_touch_ts, get_timestamp());
+   __this_cpu_write(watchdog_touch_ts, running_clock());
 }
 
 /**
@@ -289,14 +289,15 @@ void touch_softlockup_watchdog_sync(void)
__this_cpu_write(watchdog_touch_ts, SOFTLOCKUP_RESET);
 }
 
-static int is_softlockup(unsigned long touch_ts)
+static unsigned long is_softlockup(unsigned long touch_ts)
 {
-   unsigned long now = get_timestamp();
+   unsigned long span, now = running_clock();
 
+   span = now - touch_ts;
if ((watchdog_enabled & SOFT_WATCHDOG_ENABLED) && watchdog_thresh){
/* Warn about unreasonable delays. */
-   if (time_after(now, touch_ts + get_softlockup_thresh()))
-   return now - touch_ts;
+   if (time_after(convert_seconds(span), (unsigned 
long)get_softlockup_thresh()))
+   return span;
}
return 0;
 }
@@ -340,9 +341,8 @@ static int softlockup_fn(void *data)
 /* watchdog kicker functions */
 static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 {
-   unsigned long touch_ts = __this_cpu_read(watchdog_touch_ts);
+   unsigned long duration, touch_ts = __this_cpu_read(watchdog_touch_ts);
struct pt_regs *regs = get_irq_regs();
-   int duration;
int softlockup_all_cpu_backtrace = sysctl_softlockup_all_cpu_backtrace;
 
if (!watchdog_enabled)
@@ -410,7 +410,7 @@ static enum hrtimer_restart watchdog_timer_fn(struct 
hrtimer *hrtimer)
}
 
pr_emerg("BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
-   smp_processor_id(), duration,
+   smp_processor_id(), (unsigned 
int)convert_seconds(duration),
current->comm, task_pid_nr(current));
print_modules();
print_irqtrace_events(current);
-- 
2.7.5



[PATCH 0/3] use soft lockup to detect irq flood

2020-11-17 Thread Pingfan Liu
The chipset's pins are often multi-used because greedy usage compares to a
finite pin number. This requests a carefully configuration in firmware or OS.
If not, it may contribute to most of irq flood cases, which appears as a soft
lockup issue on Linux. 
 
Strictly speaking, soft lockup and irq flood are different things to overcome.
And it is helpful to make users aware of that situation for prime time.

This series shows the irq statistics when soft lockup. The statistics can be
used to evaluate the possibility of irq flood and as a rough evaluated input to
the kernel parameter "irqstorm_limit" in [1].
 
It is not easy to find a common way to work around irq flood, which may be
raised by different root cause. To now, it is still a open question.
Thomas and Guilherme suggested patches to suppress the odd irq in different
situation. [1].[2]

[1]: https://lore.kernel.org/lkml/87tuueftou@nanos.tec.linutronix.de/
[2]: 
https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpicc...@canonical.com/


Cc: Thomas Gleixner 
Cc: Jisheng Zhang 
Cc: "Peter Zijlstra (Intel)" 
Cc: Vlastimil Babka 
Cc: Andrew Morton 
Cc: "Guilherme G. Piccoli" 
Cc: Petr Mladek 
Cc: ke...@lists.infradead.org
To: linux-kernel@vger.kernel.org


Pingfan Liu (3):
  x86/irq: account the unused irq
  kernel/watchdog: make watchdog_touch_ts more accurate by using
nanosecond
  kernel/watchdog: use soft lockup to detect irq flood

 arch/x86/kernel/irq.c   |  1 +
 include/linux/kernel_stat.h |  1 +
 kernel/watchdog.c   | 37 ++---
 3 files changed, 28 insertions(+), 11 deletions(-)

-- 
2.7.5



[PATCH 1/3] x86/irq: account the unused irq

2020-11-17 Thread Pingfan Liu
Accounting the unused irq in order to count it if irq flood.

Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Jisheng Zhang 
Cc: "Peter Zijlstra (Intel)" 
Cc: Vlastimil Babka 
Cc: Andrew Morton 
Cc: "Guilherme G. Piccoli" 
Cc: Petr Mladek 
Cc: ke...@lists.infradead.org
To: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/irq.c   | 1 +
 include/linux/kernel_stat.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index c5dd503..6f583a7 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -254,6 +254,7 @@ DEFINE_IDTENTRY_IRQ(common_interrupt)
pr_emerg_ratelimited("%s: %d.%u No irq handler for 
vector\n",
 __func__, smp_processor_id(),
 vector);
+   __this_cpu_inc(kstat.unused_irqs_sum);
} else {
__this_cpu_write(vector_irq[vector], VECTOR_UNUSED);
}
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 89f0745..c8d5cb8 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -37,6 +37,7 @@ struct kernel_cpustat {
 
 struct kernel_stat {
unsigned long irqs_sum;
+   unsigned long unused_irqs_sum;
unsigned int softirqs[NR_SOFTIRQS];
 };
 
-- 
2.7.5



Re: [PATCH 0/3] warn and suppress irqflood

2020-11-05 Thread Pingfan Liu
On Wed, Oct 28, 2020 at 7:58 PM Thomas Gleixner  wrote:
>
[...]
> ---
>  include/linux/irqdesc.h |4 ++
>  kernel/irq/manage.c |3 +
>  kernel/irq/spurious.c   |   74 
> +++-
>  3 files changed, 61 insertions(+), 20 deletions(-)
>
> --- a/include/linux/irqdesc.h
> +++ b/include/linux/irqdesc.h
> @@ -30,6 +30,8 @@ struct pt_regs;
>   * @tot_count: stats field for non-percpu irqs
>   * @irq_count: stats field to detect stalled irqs
>   * @last_unhandled:aging timer for unhandled count
> + * @storm_count:   Counter for irq storm detection
> + * @storm_checked: Timestamp for irq storm detection
>   * @irqs_unhandled:stats field for spurious unhandled interrupts
>   * @threads_handled:   stats field for deferred spurious detection of 
> threaded handlers
>   * @threads_handled_last: comparator field for deferred spurious detection 
> of theraded handlers
> @@ -65,6 +67,8 @@ struct irq_desc {
> unsigned inttot_count;
> unsigned intirq_count;  /* For detecting broken IRQs 
> */
> unsigned long   last_unhandled; /* Aging timer for unhandled 
> count */
> +   unsigned long   storm_count;
> +   unsigned long   storm_checked;
> unsigned intirqs_unhandled;
> atomic_tthreads_handled;
> int threads_handled_last;
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -1581,6 +1581,9 @@ static int
> if (!shared) {
> init_waitqueue_head(>wait_for_threads);
>
> +   /* Take a timestamp for interrupt storm detection */
> +   desc->storm_checked = jiffies;
> +
> /* Setup the type (level, edge polarity) if configured: */
> if (new->flags & IRQF_TRIGGER_MASK) {
> ret = __irq_set_trigger(desc,
> --- a/kernel/irq/spurious.c
> +++ b/kernel/irq/spurious.c
> @@ -21,6 +21,7 @@ static void poll_spurious_irqs(struct ti
>  static DEFINE_TIMER(poll_spurious_irq_timer, poll_spurious_irqs);
>  static int irq_poll_cpu;
>  static atomic_t irq_poll_active;
> +static unsigned long irqstorm_limit __ro_after_init;
>
>  /*
>   * We wait here for a poller to finish.
> @@ -189,18 +190,21 @@ static inline int bad_action_ret(irqretu
>   * (The other 100-of-100,000 interrupts may have been a correctly
>   *  functioning device sharing an IRQ with the failing one)
>   */
> -static void __report_bad_irq(struct irq_desc *desc, irqreturn_t action_ret)
> +static void __report_bad_irq(struct irq_desc *desc, irqreturn_t action_ret,
> +bool storm)
>  {
> unsigned int irq = irq_desc_get_irq(desc);
> struct irqaction *action;
> unsigned long flags;
>
> -   if (bad_action_ret(action_ret)) {
> -   printk(KERN_ERR "irq event %d: bogus return value %x\n",
> -   irq, action_ret);
> -   } else {
> -   printk(KERN_ERR "irq %d: nobody cared (try booting with "
> +   if (!storm) {
> +   if (bad_action_ret(action_ret)) {
> +   pr_err("irq event %d: bogus return value %x\n",
> +  irq, action_ret);
> +   } else {
> +   pr_err("irq %d: nobody cared (try booting with "
> "the \"irqpoll\" option)\n", irq);
> +   }
> }
> dump_stack();
> printk(KERN_ERR "handlers:\n");
> @@ -228,7 +232,7 @@ static void report_bad_irq(struct irq_de
>
> if (count > 0) {
> count--;
> -   __report_bad_irq(desc, action_ret);
> +   __report_bad_irq(desc, action_ret, false);
> }
>  }
>
> @@ -267,6 +271,33 @@ try_misrouted_irq(unsigned int irq, stru
> return action && (action->flags & IRQF_IRQPOLL);
>  }
>
> +static void disable_stuck_irq(struct irq_desc *desc, irqreturn_t action_ret,
> + const char *reason, bool storm)
> +{
> +   __report_bad_irq(desc, action_ret, storm);
> +   pr_emerg("Disabling %s IRQ #%d\n", reason, irq_desc_get_irq(desc));
> +   desc->istate |= IRQS_SPURIOUS_DISABLED;
> +   desc->depth++;
> +   irq_disable(desc);
> +}
> +
> +/* Interrupt storm detector for runaway interrupts (handled or not). */
> +static bool irqstorm_detected(struct irq_desc *desc)
> +{
> +   unsigned long now = jiffies;
> +
> +   if (++desc->storm_count < irqstorm_limit) {
> +   if (time_after(now, desc->storm_checked + HZ)) {
> +   desc->storm_count = 0;
> +   desc->storm_checked = now;
> +   }
> +   return false;
> +   }
> +
> +   disable_stuck_irq(desc, IRQ_NONE, "runaway", true);
> +   return true;
> +}
> +
>  #define SPURIOUS_DEFERRED  0x8000
>
>  void note_interrupt(struct 

Re: [PATCH 0/3] warn and suppress irqflood

2020-10-29 Thread Pingfan Liu
On Wed, Oct 28, 2020 at 12:58:41PM +0100, Thomas Gleixner wrote:
> On Wed, Oct 28 2020 at 14:02, Pingfan Liu wrote:
> > On Tue, Oct 27, 2020 at 3:59 AM Thomas Gleixner  wrote:
> >> Also Liu's patch only works if:
> >>
> >>   1) CONFIG_IRQ_TIME_ACCOUNTING is enabled
> >
> > I wonder whether it can not be a default option or not by the following 
> > method:
> >   DEFINE_STATIC_KEY_FALSE(irqtime_account), and enable it according to
> > a boot param.
> 
> How so?
> 
>   config IRQ_TIME_ACCOUNTING
>   depends on HAVE_IRQ_TIME_ACCOUNTING && 
> !VIRT_CPU_ACCOUNTING_NATIVE
> 
Look closely at the two config value:
-1. HAVE_IRQ_TIME_ACCOUNTING, it is selected by most of the popular arches, and
can be further relaxed.
   It implies sched_clock() is fast enough for sampling. With current code, the
variable sched_clock_irqtime=0 can be used to turn off irqtime accounting on
some arches with slow sched_clock(). And it can be even better by using
DEFINE_STATIC_KEY_FALSE(sched_clock_irqtime)
   So the pre-requirement can be relaxed as "depends on 
!VIRT_CPU_ACCOUNTING_NATIVE"
In case that I can not express clearly, could you have a look at the demo patch?

   That patch _assumes_ that irqtime accounting costs much and is not turned on 
by
default. If turned on, it will cost an extra jmp than current implement.
And I think it is critical to my [1/3] whether this assumption is reasonable.

-2. For VIRT_CPU_ACCOUNTING_NATIVE, it can only be selected by powerpc and ia64

In fact, I have a seperate patch for powerpc with
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE to utilize my [1/3].

---
diff --git a/init/Kconfig b/init/Kconfig
index c944691..16d168b 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -490,7 +490,7 @@ endchoice
 
 config IRQ_TIME_ACCOUNTING
bool "Fine granularity task level IRQ time accounting"
-   depends on HAVE_IRQ_TIME_ACCOUNTING && !VIRT_CPU_ACCOUNTING_NATIVE
+   depends on !VIRT_CPU_ACCOUNTING_NATIVE
help
  Select this option to enable fine granularity task irq time
  accounting. This is done by reading a timestamp on each
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 5a55d23..3ab7e1d 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -19,7 +19,7 @@
  */
 DEFINE_PER_CPU(struct irqtime, cpu_irqtime);
 
-static int sched_clock_irqtime;
+DEFINE_STATIC_KEY_FALSE(sched_clock_irqtime);
 
 void enable_sched_clock_irqtime(void)
 {
@@ -49,13 +49,14 @@ static void irqtime_account_delta(struct irqtime *irqtime, 
u64 delta,
  */
 void irqtime_account_irq(struct task_struct *curr)
 {
-   struct irqtime *irqtime = this_cpu_ptr(_irqtime);
+   struct irqtime *irqtime;
s64 delta;
int cpu;
 
-   if (!sched_clock_irqtime)
+   if (static_branch_unlikely(_clock_irqtime))
return;
 
+   irqtime = this_cpu_ptr(_irqtime);
cpu = smp_processor_id();
delta = sched_clock_cpu(cpu) - irqtime->irq_start_time;
irqtime->irq_start_time += delta;
@@ -84,16 +85,7 @@ static u64 irqtime_tick_accounted(u64 maxtime)
return delta;
 }
 
-#else /* CONFIG_IRQ_TIME_ACCOUNTING */
-
-#define sched_clock_irqtime(0)
-
-static u64 irqtime_tick_accounted(u64 dummy)
-{
-   return 0;
-}
-
-#endif /* !CONFIG_IRQ_TIME_ACCOUNTING */
+#endif
 
 static inline void task_group_account_field(struct task_struct *p, int index,
u64 tmp)
@@ -475,7 +467,7 @@ void account_process_tick(struct task_struct *p, int 
user_tick)
if (vtime_accounting_enabled_this_cpu())
return;
 
-   if (sched_clock_irqtime) {
+   if (static_branch_unlikely(_clock_irqtime))
irqtime_account_process_tick(p, user_tick, 1);
return;
}
@@ -504,7 +496,7 @@ void account_idle_ticks(unsigned long ticks)
 {
u64 cputime, steal;
 
-   if (sched_clock_irqtime) {
+   if (static_branch_unlikely(_clock_irqtime))
irqtime_account_idle_ticks(ticks);
return;
}
-- 
2.7.5

[...]
> +
> +static int __init irqstorm_setup(char *arg)
> +{
> + int res = kstrtoul(arg, 0, _limit);
> +
> + if (!res) {
> + pr_info("Interrupt storm detector enabled. Limit=%lu / s\n",
> + irqstorm_limit);
> + }
> + return !!res;
> +}
> +__setup("irqstorm_limit", irqstorm_setup);

This configuration independent method looks appealing. And I am glad to have a 
try.

But irqstorm_limit may be a hard choice. Maybe by formula:
instruction-percpu-per-second / insn num of irq failed path ?  It is hard to
estimate "instruction-percpu-per-second".

Thanks,
Pingfan


Re: [PATCH 0/3] warn and suppress irqflood

2020-10-28 Thread Pingfan Liu
On Tue, Oct 27, 2020 at 3:59 AM Thomas Gleixner  wrote:
>
[...]
>
> And contrary to Liu's patches which try to disable a requested interrupt
> if too many of them arrive, the kernel cannot do anything because there
> is nothing to disable in your case. That's why you needed to do the MSI
> disable magic in the early PCI quirks which run before interrupts get
> enabled.
>
> Also Liu's patch only works if:
>
>   1) CONFIG_IRQ_TIME_ACCOUNTING is enabled

I wonder whether it can not be a default option or not by the following method:
  DEFINE_STATIC_KEY_FALSE(irqtime_account), and enable it according to
a boot param.

This will have no impact on performance with the disabled branch.
Meanwhile users can easily turn on the option to detect an irq flood
without  recompiling the kernel.

If it is doable, I will rework only on [1/2].
>
>   2) the runaway interrupt has been requested by the relevant driver in
>  the dump kernel.

Yes, it raises a big challenge to my method. Kdump kernel miss the
whole picture of the first kernel's irq routing.

Thanks,
Pingfan


Re: [Skiboot] [PATCH 0/3] warn and suppress irqflood

2020-10-25 Thread Pingfan Liu
On Sun, Oct 25, 2020 at 8:21 PM Oliver O'Halloran  wrote:
>
> On Sun, Oct 25, 2020 at 10:22 PM Pingfan Liu  wrote:
> >
> > On Thu, Oct 22, 2020 at 4:37 PM Thomas Gleixner  wrote:
> > >
> > > On Thu, Oct 22 2020 at 13:56, Pingfan Liu wrote:
> > > > I hit a irqflood bug on powerpc platform, and two years ago, on a x86 
> > > > platform.
> > > > When the bug happens, the kernel is totally occupies by irq.  
> > > > Currently, there
> > > > may be nothing or just soft lockup warning showed in console. It is 
> > > > better
> > > > to warn users with irq flood info.
> > > >
> > > > In the kdump case, the kernel can move on by suppressing the irq flood.
> > >
> > > You're curing the symptom not the cause and the cure is just magic and
> > > can't work reliably.
> > Yeah, it is magic. But at least, it is better to printk something and
> > alarm users about what happens. With current code, it may show nothing
> > when system hangs.
> > >
> > > Where is that irq flood originated from and why is none of the
> > > mechanisms we have in place to shut it up working?
> > The bug originates from a driver tpm_i2c_nuvoton, which calls i2c-bus
> > driver (i2c-opal.c). After i2c_opal_send_request(), the bug is
> > triggered.
> >
> > But things are complicated by introducing a firmware layer: Skiboot.
> > This software layer hides the detail of manipulating the hardware from
> > Linux.
> >
> > I guess the software logic can not enter a sane state when kernel crashes.
> >
> > Cc Skiboot and ppc64 community to see whether anyone has idea about it.
>
> What system are you using?

Here is the info, if not enough, I will get more.
 Product Name  : OpenPOWER Firmware
 Product Version   : open-power-SUPERMICRO-P9DSU-V1.16-20180531-imp
 Product Extra : op-build-e4b3eb5
 Product Extra : skiboot-v6.0-p1da203b
 Product Extra : hostboot-f911e5c-pda8239f
 Product Extra : occ-77bb5e6-p623d1cd
 Product Extra : linux-4.16.7-openpower2-pbc45895
 Product Extra : petitboot-v1.7.1-pf773c0d
 Product Extra : machine-xml-218a77a

>
> There's an external interrupt pin which is supposed to be wired to the
> TPM. I think we bounce that interrupt to FW by default since the
> external interrupt is sometimes used for other system-specific
> purposes. Odds are FW doesn't know what to do with it so you
> effectively have an always-on LSI. I fixed a similar bug a while ago
> by having skiboot mask any interrupts it doesn't have a handler for,
This sounds like the root cause. But here Skiboot should have handler,
otherwise the first kernel can not run smoothly.

Do you have any idea about an unexpected re-initialization introducing
an unsane stage?

Thanks,
Pingfan


Re: [PATCH 0/3] warn and suppress irqflood

2020-10-25 Thread Pingfan Liu
On Thu, Oct 22, 2020 at 4:37 PM Thomas Gleixner  wrote:
>
> On Thu, Oct 22 2020 at 13:56, Pingfan Liu wrote:
> > I hit a irqflood bug on powerpc platform, and two years ago, on a x86 
> > platform.
> > When the bug happens, the kernel is totally occupies by irq.  Currently, 
> > there
> > may be nothing or just soft lockup warning showed in console. It is better
> > to warn users with irq flood info.
> >
> > In the kdump case, the kernel can move on by suppressing the irq flood.
>
> You're curing the symptom not the cause and the cure is just magic and
> can't work reliably.
Yeah, it is magic. But at least, it is better to printk something and
alarm users about what happens. With current code, it may show nothing
when system hangs.
>
> Where is that irq flood originated from and why is none of the
> mechanisms we have in place to shut it up working?
The bug originates from a driver tpm_i2c_nuvoton, which calls i2c-bus
driver (i2c-opal.c). After i2c_opal_send_request(), the bug is
triggered.

But things are complicated by introducing a firmware layer: Skiboot.
This software layer hides the detail of manipulating the hardware from
Linux.

I guess the software logic can not enter a sane state when kernel crashes.

Cc Skiboot and ppc64 community to see whether anyone has idea about it.

Thanks,
Pingfan


[PATCH 3/3] Documentation: introduce a param "irqflood_suppress"

2020-10-21 Thread Pingfan Liu
The param "irqflood_suppress" is helpful for capture kernel to survive irq
flood.

Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Jisheng Zhang 
Cc: Andrew Morton 
Cc: "Guilherme G. Piccoli" 
Cc: Petr Mladek 
Cc: Marc Zyngier 
Cc: Linus Walleij 
Cc: afzal mohammed 
Cc: Lina Iyer 
Cc: "Gustavo A. R. Silva" 
Cc: Maulik Shah 
Cc: Al Viro 
Cc: Jonathan Corbet 
Cc: Pawan Gupta 
Cc: Mike Kravetz 
Cc: Oliver Neukum 
To: linux-kernel@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: ke...@lists.infradead.org
---
 Documentation/admin-guide/kernel-parameters.txt | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index a106874..0a25a05 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2009,6 +2009,9 @@
for it. Also check all handlers each timer
interrupt. Intended to get systems with badly broken
firmware running.
+   irqflood_suppress   [HW]
+   When a irq fully occupies a cpu in a long time, 
suppressing
+   it to make kernel move on. It is useful in the capture 
kernel.
 
isapnp= [ISAPNP]
Format: ,,,
-- 
2.7.5



[PATCH 1/3] kernel/watchdog: show irq percentage if irq floods

2020-10-21 Thread Pingfan Liu
In kdump case, the capture kernel has no chance to shutdown devices, and
leave the devices in unstable state. Irq flood is one of the consequence.

When irq floods, the kernel is totally occupies by irq.  Currently, there
may be nothing or just soft lockup warning showed in console. It is better
to warn users with irq flood info.

Soft lockup watchdog is a good foundation to implement the detector. A irq
flood is reported if the following two conditions are met:
  -1. the irq occupies too much (98%) of the past interval.
  -2. no tasks has been scheduled in the past interval. This is implemented
  by check_hint.
  A note: is_softlockup() implies the 2nd condition, but unfortunately, irq
  flood can come from anywhere. If irq_enter_rcu()->tick_irq_enter(), then
  touch_softlockup_watchdog_sched() will reset watchdog, and softlockup is
  never detected.

Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Jisheng Zhang 
Cc: Andrew Morton 
Cc: "Guilherme G. Piccoli" 
Cc: Petr Mladek 
Cc: Marc Zyngier 
Cc: Linus Walleij 
Cc: afzal mohammed 
Cc: Lina Iyer 
Cc: "Gustavo A. R. Silva" 
Cc: Maulik Shah 
Cc: Al Viro 
Cc: Jonathan Corbet 
Cc: Pawan Gupta 
Cc: Mike Kravetz 
Cc: Oliver Neukum 
To: linux-kernel@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: ke...@lists.infradead.org
---
 kernel/watchdog.c | 41 +
 1 file changed, 41 insertions(+)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 5abb5b2..230ac38 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -175,6 +176,13 @@ static DEFINE_PER_CPU(bool, softlockup_touch_sync);
 static DEFINE_PER_CPU(bool, soft_watchdog_warn);
 static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts);
 static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts_saved);
+
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+static DEFINE_PER_CPU(long, check_hint) = {-1};
+static DEFINE_PER_CPU(unsigned long, last_irq_ts);
+static DEFINE_PER_CPU(unsigned long, last_total_ts);
+#endif
+
 static unsigned long soft_lockup_nmi_warn;
 
 static int __init nowatchdog_setup(char *str)
@@ -331,12 +339,43 @@ static DEFINE_PER_CPU(struct cpu_stop_work, 
softlockup_stop_work);
  */
 static int softlockup_fn(void *data)
 {
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+   __this_cpu_write(check_hint, -1);
+#endif
__touch_watchdog();
complete(this_cpu_ptr(_completion));
 
return 0;
 }
 
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+static void check_irq_flood(void)
+{
+   unsigned long irqts, totalts, percent, cnt;
+   u64 *cpustat = kcpustat_this_cpu->cpustat;
+
+   totalts = running_clock();
+   irqts = cpustat[CPUTIME_IRQ] + cpustat[CPUTIME_SOFTIRQ];
+   cnt = __this_cpu_inc_return(check_hint);
+
+   if (cnt >= 5) {
+   totalts = totalts - __this_cpu_read(last_total_ts);
+   irqts = irqts - __this_cpu_read(last_irq_ts);
+   percent = irqts * 100 / totalts;
+   percent =  percent < 100 ? percent : 100;
+   __this_cpu_write(check_hint, -1);
+   if (percent >= 98)
+   pr_info("Irq flood occupies more than %lu%% of the past 
%lu seconds\n",
+   percent, totalts >> 30);
+   } else if (cnt == 0) {
+   __this_cpu_write(last_total_ts, totalts);
+   __this_cpu_write(last_irq_ts, irqts);
+   }
+}
+#else
+static void check_irq_flood(void){}
+#endif
+
 /* watchdog kicker functions */
 static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 {
@@ -348,6 +387,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct 
hrtimer *hrtimer)
if (!watchdog_enabled)
return HRTIMER_NORESTART;
 
+   /* When irq floods, watchdog may be still touched. Hence it can not be 
done inside lockup */
+   check_irq_flood();
/* kick the hardlockup detector */
watchdog_interrupt_count();
 
-- 
2.7.5



[PATCH 2/3] kernel/watchdog: suppress max irq when irq floods

2020-10-21 Thread Pingfan Liu
The capture kernel should try its best to save the crash info. Normally,
irq flood is caused by some trivial devices, which has no impact on saving
vmcore.

Introducing a parameter "irqflood_suppress" to enable suppress irq flood.

Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Jisheng Zhang 
Cc: Andrew Morton 
Cc: "Guilherme G. Piccoli" 
Cc: Petr Mladek 
Cc: Marc Zyngier 
Cc: Linus Walleij 
Cc: afzal mohammed 
Cc: Lina Iyer 
Cc: "Gustavo A. R. Silva" 
Cc: Maulik Shah 
Cc: Al Viro 
Cc: Jonathan Corbet 
Cc: Pawan Gupta 
Cc: Mike Kravetz 
Cc: Oliver Neukum 
To: linux-kernel@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: ke...@lists.infradead.org
---
 include/linux/irq.h   |  2 ++
 kernel/irq/spurious.c | 32 
 kernel/watchdog.c |  9 -
 3 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/include/linux/irq.h b/include/linux/irq.h
index 1b7f4df..140cb61 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -684,6 +684,8 @@ extern void note_interrupt(struct irq_desc *desc, 
irqreturn_t action_ret);
 /* Enable/disable irq debugging output: */
 extern int noirqdebug_setup(char *str);
 
+void suppress_max_irq(void);
+
 /* Checks whether the interrupt can be requested by request_irq(): */
 extern int can_request_irq(unsigned int irq, unsigned long irqflags);
 
diff --git a/kernel/irq/spurious.c b/kernel/irq/spurious.c
index f865e5f..d3d94d6 100644
--- a/kernel/irq/spurious.c
+++ b/kernel/irq/spurious.c
@@ -464,3 +464,35 @@ static int __init irqpoll_setup(char *str)
 }
 
 __setup("irqpoll", irqpoll_setup);
+
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+
+static bool irqflood_suppress;
+
+static int __init irqflood_suppress_setup(char *str)
+{
+   irqflood_suppress = true;
+   pr_info("enable auto suppress irqflood\n");
+   return 1;
+}
+__setup("irqflood_suppress", irqflood_suppress_setup);
+
+void suppress_max_irq(void)
+{
+   unsigned int tmp, maxirq = 0, max = 0;
+   int irq;
+
+   if (!irqflood_suppress)
+   return;
+   for_each_active_irq(irq) {
+   tmp = kstat_irqs_cpu(irq, smp_processor_id());
+   if (max < tmp) {
+   maxirq = irq;
+   max = tmp;
+   }
+   }
+   pr_warn("Suppress irq:%u, which is triggered %u times\n",
+   maxirq, max);
+   disable_irq_nosync(maxirq);
+}
+#endif
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 230ac38..28a74e5 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -364,9 +365,15 @@ static void check_irq_flood(void)
percent = irqts * 100 / totalts;
percent =  percent < 100 ? percent : 100;
__this_cpu_write(check_hint, -1);
-   if (percent >= 98)
+   if (percent >= 98) {
pr_info("Irq flood occupies more than %lu%% of the past 
%lu seconds\n",
percent, totalts >> 30);
+   /*
+* Suppress top irq when scheduler does not work for 
long time and irq
+* occupies too much time.
+*/
+   suppress_max_irq();
+   }
} else if (cnt == 0) {
__this_cpu_write(last_total_ts, totalts);
__this_cpu_write(last_irq_ts, irqts);
-- 
2.7.5



[PATCH 0/3] warn and suppress irqflood

2020-10-21 Thread Pingfan Liu
I hit a irqflood bug on powerpc platform, and two years ago, on a x86 platform.
When the bug happens, the kernel is totally occupies by irq.  Currently, there
may be nothing or just soft lockup warning showed in console. It is better
to warn users with irq flood info.

In the kdump case, the kernel can move on by suppressing the irq flood.

Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Jisheng Zhang 
Cc: Andrew Morton 
Cc: "Guilherme G. Piccoli" 
Cc: Petr Mladek 
Cc: Marc Zyngier 
Cc: Linus Walleij 
Cc: afzal mohammed 
Cc: Lina Iyer 
Cc: "Gustavo A. R. Silva" 
Cc: Maulik Shah 
Cc: Al Viro 
Cc: Jonathan Corbet 
Cc: Pawan Gupta 
Cc: Mike Kravetz 
Cc: Oliver Neukum 
To: linux-kernel@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: ke...@lists.infradead.org

Pingfan Liu (3):
  kernel/watchdog: show irq percentage if irq floods
  kernel/watchdog: suppress max irq when irq floods
  Documentation: introduce a param "irqflood_suppress"

 Documentation/admin-guide/kernel-parameters.txt |  3 ++
 include/linux/irq.h |  2 ++
 kernel/irq/spurious.c   | 32 +
 kernel/watchdog.c   | 48 +
 4 files changed, 85 insertions(+)

-- 
2.7.5



Re: [PATCH] sched/cputime: correct account of irqtime

2020-10-14 Thread Pingfan Liu
On Wed, Oct 14, 2020 at 9:02 PM Peter Zijlstra  wrote:
>
> On Mon, Oct 12, 2020 at 09:50:44PM +0800, Pingfan Liu wrote:
> > __do_softirq() may be interrupted by hardware interrupts. In this case,
> > irqtime_account_irq() will account the time slice as CPUTIME_SOFTIRQ by
> > mistake.
> >
> > By passing irqtime_account_irq() an extra param about either hardirq or
> > softirq, irqtime_account_irq() can handle the above case.
>
> I'm not sure I see the scenario in which it goes wrong.
>
> irqtime_account_irq() is designed such that we're called with the old
> preempt_count on enter and the new preempt_count on exit. This way we'll
> accumuate the delta to the previous context.
>
Oops! You are right, the time delta between a softirq and a
interrupting hardirq should be accounted into the softrq.

Thanks for your clear explanation.

Regards,
Pingfan


Re: [PATCH] sched/cputime: correct account of irqtime

2020-10-12 Thread Pingfan Liu
On Tue, Oct 13, 2020 at 11:10 AM jun qian  wrote:
>
> Pingfan Liu  于2020年10月12日周一 下午9:54写道:
> >
> > __do_softirq() may be interrupted by hardware interrupts. In this case,
> > irqtime_account_irq() will account the time slice as CPUTIME_SOFTIRQ by
> > mistake.
> >
> > By passing irqtime_account_irq() an extra param about either hardirq or
> > softirq, irqtime_account_irq() can handle the above case.
> >
> > Signed-off-by: Pingfan Liu 
> > Cc: Ingo Molnar 
> > Cc: Peter Zijlstra 
> > Cc: Juri Lelli 
> > Cc: Vincent Guittot 
> > Cc: Dietmar Eggemann 
> > Cc: Steven Rostedt 
> > Cc: Ben Segall 
> > Cc: Mel Gorman 
> > Cc: Thomas Gleixner 
> > Cc: Andy Lutomirski 
> > Cc: Will Deacon 
> > Cc: "Paul E. McKenney" 
> > Cc: Frederic Weisbecker 
> > Cc: Allen Pais 
> > Cc: Romain Perier 
> > To: linux-kernel@vger.kernel.org
> > ---
> >  include/linux/hardirq.h |  4 ++--
> >  include/linux/vtime.h   | 12 ++--
> >  kernel/sched/cputime.c  |  4 ++--
> >  kernel/softirq.c|  6 +++---
> >  4 files changed, 13 insertions(+), 13 deletions(-)
> >
> > diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
> > index 754f67a..56e7bb5 100644
> > --- a/include/linux/hardirq.h
> > +++ b/include/linux/hardirq.h
> > @@ -32,7 +32,7 @@ static __always_inline void rcu_irq_enter_check_tick(void)
> >   */
> >  #define __irq_enter()  \
> > do {\
> > -   account_irq_enter_time(current);\
> > +   account_irq_enter_time(current, true);  \
> > preempt_count_add(HARDIRQ_OFFSET);  \
> > lockdep_hardirq_enter();\
> > } while (0)
> > @@ -63,7 +63,7 @@ void irq_enter_rcu(void);
> >  #define __irq_exit()   \
> > do {\
> > lockdep_hardirq_exit(); \
> > -   account_irq_exit_time(current); \
> > +   account_irq_exit_time(current, true);   \
> > preempt_count_sub(HARDIRQ_OFFSET);  \
> > } while (0)
> >
> > diff --git a/include/linux/vtime.h b/include/linux/vtime.h
> > index 2cdeca0..294188ae1 100644
> > --- a/include/linux/vtime.h
> > +++ b/include/linux/vtime.h
> > @@ -98,21 +98,21 @@ static inline void vtime_flush(struct task_struct *tsk) 
> > { }
> >
> >
> >  #ifdef CONFIG_IRQ_TIME_ACCOUNTING
> > -extern void irqtime_account_irq(struct task_struct *tsk);
> > +extern void irqtime_account_irq(struct task_struct *tsk, bool hardirq);
> >  #else
> > -static inline void irqtime_account_irq(struct task_struct *tsk) { }
> > +static inline void irqtime_account_irq(struct task_struct *tsk, bool 
> > hardirq) { }
> >  #endif
> >
> > -static inline void account_irq_enter_time(struct task_struct *tsk)
> > +static inline void account_irq_enter_time(struct task_struct *tsk, bool 
> > hardirq)
> >  {
> > vtime_account_irq_enter(tsk);
> > -   irqtime_account_irq(tsk);
> > +   irqtime_account_irq(tsk, hardirq);
> >  }
> >
> > -static inline void account_irq_exit_time(struct task_struct *tsk)
> > +static inline void account_irq_exit_time(struct task_struct *tsk, bool 
> > hardirq)
> >  {
> > vtime_account_irq_exit(tsk);
> > -   irqtime_account_irq(tsk);
> > +   irqtime_account_irq(tsk, hardirq);
> >  }
> >
> >  #endif /* _LINUX_KERNEL_VTIME_H */
> > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> > index 5a55d23..166f1d7 100644
> > --- a/kernel/sched/cputime.c
> > +++ b/kernel/sched/cputime.c
> > @@ -47,7 +47,7 @@ static void irqtime_account_delta(struct irqtime 
> > *irqtime, u64 delta,
> >   * Called before incrementing preempt_count on {soft,}irq_enter
> >   * and before decrementing preempt_count on {soft,}irq_exit.
> >   */
> > -void irqtime_account_irq(struct task_struct *curr)
> > +void irqtime_account_irq(struct task_struct *curr, bool hardirq)
> >  {
> > struct irqtime *irqtime = this_cpu_ptr(_irqtime);
> > s64 delta;
> > @@ -68,7 +68,7 @@ void irqtime_account_irq(struct task_struct *curr)
> >  */
> > if (hardirq_count())
> > irqtime_account_delta(irqtime, delta, CPUTIME_IRQ);
> > -   else if (in_serving_softirq() && curr

[PATCH] sched/cputime: correct account of irqtime

2020-10-12 Thread Pingfan Liu
__do_softirq() may be interrupted by hardware interrupts. In this case,
irqtime_account_irq() will account the time slice as CPUTIME_SOFTIRQ by
mistake.

By passing irqtime_account_irq() an extra param about either hardirq or
softirq, irqtime_account_irq() can handle the above case.

Signed-off-by: Pingfan Liu 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Juri Lelli 
Cc: Vincent Guittot 
Cc: Dietmar Eggemann 
Cc: Steven Rostedt 
Cc: Ben Segall 
Cc: Mel Gorman 
Cc: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Will Deacon 
Cc: "Paul E. McKenney" 
Cc: Frederic Weisbecker 
Cc: Allen Pais 
Cc: Romain Perier 
To: linux-kernel@vger.kernel.org
---
 include/linux/hardirq.h |  4 ++--
 include/linux/vtime.h   | 12 ++--
 kernel/sched/cputime.c  |  4 ++--
 kernel/softirq.c|  6 +++---
 4 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 754f67a..56e7bb5 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -32,7 +32,7 @@ static __always_inline void rcu_irq_enter_check_tick(void)
  */
 #define __irq_enter()  \
do {\
-   account_irq_enter_time(current);\
+   account_irq_enter_time(current, true);  \
preempt_count_add(HARDIRQ_OFFSET);  \
lockdep_hardirq_enter();\
} while (0)
@@ -63,7 +63,7 @@ void irq_enter_rcu(void);
 #define __irq_exit()   \
do {\
lockdep_hardirq_exit(); \
-   account_irq_exit_time(current); \
+   account_irq_exit_time(current, true);   \
preempt_count_sub(HARDIRQ_OFFSET);  \
} while (0)
 
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 2cdeca0..294188ae1 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -98,21 +98,21 @@ static inline void vtime_flush(struct task_struct *tsk) { }
 
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
-extern void irqtime_account_irq(struct task_struct *tsk);
+extern void irqtime_account_irq(struct task_struct *tsk, bool hardirq);
 #else
-static inline void irqtime_account_irq(struct task_struct *tsk) { }
+static inline void irqtime_account_irq(struct task_struct *tsk, bool hardirq) 
{ }
 #endif
 
-static inline void account_irq_enter_time(struct task_struct *tsk)
+static inline void account_irq_enter_time(struct task_struct *tsk, bool 
hardirq)
 {
vtime_account_irq_enter(tsk);
-   irqtime_account_irq(tsk);
+   irqtime_account_irq(tsk, hardirq);
 }
 
-static inline void account_irq_exit_time(struct task_struct *tsk)
+static inline void account_irq_exit_time(struct task_struct *tsk, bool hardirq)
 {
vtime_account_irq_exit(tsk);
-   irqtime_account_irq(tsk);
+   irqtime_account_irq(tsk, hardirq);
 }
 
 #endif /* _LINUX_KERNEL_VTIME_H */
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 5a55d23..166f1d7 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -47,7 +47,7 @@ static void irqtime_account_delta(struct irqtime *irqtime, 
u64 delta,
  * Called before incrementing preempt_count on {soft,}irq_enter
  * and before decrementing preempt_count on {soft,}irq_exit.
  */
-void irqtime_account_irq(struct task_struct *curr)
+void irqtime_account_irq(struct task_struct *curr, bool hardirq)
 {
struct irqtime *irqtime = this_cpu_ptr(_irqtime);
s64 delta;
@@ -68,7 +68,7 @@ void irqtime_account_irq(struct task_struct *curr)
 */
if (hardirq_count())
irqtime_account_delta(irqtime, delta, CPUTIME_IRQ);
-   else if (in_serving_softirq() && curr != this_cpu_ksoftirqd())
+   else if (in_serving_softirq() && curr != this_cpu_ksoftirqd() && 
!hardirq)
irqtime_account_delta(irqtime, delta, CPUTIME_SOFTIRQ);
 }
 EXPORT_SYMBOL_GPL(irqtime_account_irq);
diff --git a/kernel/softirq.c b/kernel/softirq.c
index bf88d7f6..da59ea39 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -270,7 +270,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
current->flags &= ~PF_MEMALLOC;
 
pending = local_softirq_pending();
-   account_irq_enter_time(current);
+   account_irq_enter_time(current, false);
 
__local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
in_hardirq = lockdep_softirq_start();
@@ -321,7 +321,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
}
 
lockdep_softirq_end(in_hardirq);
-   account_irq_exit_time(current);
+   account_irq_exit_time(current, false);
__local_bh_enable(SOFTIRQ_OFFSET);
WARN_ON_ONCE(in_interrupt());
current_restore_flags(old_flags, PF_MEMALLOC);
@@ -417,7 +417,7 @@ static inline void __irq_exit_rcu(void)
 #else
  

[tip: x86/urgent] x86/purgatory: Don't generate debug info for purgatory.ro

2020-08-06 Thread tip-bot2 for Pingfan Liu
The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: 52416ffcf823ee11aa19792715664ab94757f111
Gitweb:
https://git.kernel.org/tip/52416ffcf823ee11aa19792715664ab94757f111
Author:Pingfan Liu 
AuthorDate:Mon, 03 Aug 2020 13:49:48 +08:00
Committer: Ingo Molnar 
CommitterDate: Fri, 07 Aug 2020 01:32:00 +02:00

x86/purgatory: Don't generate debug info for purgatory.ro

Purgatory.ro is a standalone binary that is not linked against the rest of
the kernel.  Its image is copied into an array that is linked to the
kernel, and from there kexec relocates it wherever it desires.

Unlike the debug info for vmlinux, which can be used for analyzing crash
such info is useless in purgatory.ro. And discarding them can save about
200K space.

 Original:
   259080  kexec-purgatory.o
 Stripped debug info:
29152  kexec-purgatory.o

Signed-off-by: Pingfan Liu 
Signed-off-by: Ingo Molnar 
Reviewed-by: Nick Desaulniers 
Reviewed-by: Steve Wahl 
Acked-by: Dave Young 
Link: 
https://lore.kernel.org/r/1596433788-3784-1-git-send-email-kernelf...@gmail.com
---
 arch/x86/purgatory/Makefile | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile
index 088bd76..d24b43a 100644
--- a/arch/x86/purgatory/Makefile
+++ b/arch/x86/purgatory/Makefile
@@ -32,7 +32,7 @@ KCOV_INSTRUMENT := n
 # make up the standalone purgatory.ro
 
 PURGATORY_CFLAGS_REMOVE := -mcmodel=kernel
-PURGATORY_CFLAGS := -mcmodel=large -ffreestanding -fno-zero-initialized-in-bss
+PURGATORY_CFLAGS := -mcmodel=large -ffreestanding -fno-zero-initialized-in-bss 
-g0
 PURGATORY_CFLAGS += $(DISABLE_STACKLEAK_PLUGIN) -DDISABLE_BRANCH_PROFILING
 PURGATORY_CFLAGS += $(call cc-option,-fno-stack-protector)
 
@@ -64,6 +64,9 @@ CFLAGS_sha256.o   += $(PURGATORY_CFLAGS)
 CFLAGS_REMOVE_string.o += $(PURGATORY_CFLAGS_REMOVE)
 CFLAGS_string.o+= $(PURGATORY_CFLAGS)
 
+AFLAGS_REMOVE_setup-x86_$(BITS).o  += -Wa,-gdwarf-2
+AFLAGS_REMOVE_entry64.o+= -Wa,-gdwarf-2
+
 $(obj)/purgatory.ro: $(PURGATORY_OBJS) FORCE
$(call if_changed,ld)
 


[tip: x86/urgent] x86/purgatory: Don't generate debug info for purgatory.ro

2020-08-06 Thread tip-bot2 for Pingfan Liu
The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: b031cf7752a82fefa6818a788d906d14f533afa9
Gitweb:
https://git.kernel.org/tip/b031cf7752a82fefa6818a788d906d14f533afa9
Author:Pingfan Liu 
AuthorDate:Mon, 03 Aug 2020 13:49:48 +08:00
Committer: Ingo Molnar 
CommitterDate: Thu, 06 Aug 2020 15:29:25 +02:00

x86/purgatory: Don't generate debug info for purgatory.ro

Purgatory.ro is a standalone binary that is not linked against the rest of
the kernel.  Its image is copied into an array that is linked to the
kernel, and from there kexec relocates it wherever it desires.

Unlike the debug info for vmlinux, which can be used for analyzing crash
such info is useless in purgatory.ro. And discarding them can save about
200K space.

 Original:
   259080  kexec-purgatory.o
 Stripped debug info:
29152  kexec-purgatory.o

Signed-off-by: Pingfan Liu 
Signed-off-by: Ingo Molnar 
Reviewed-by: Nick Desaulniers 
Reviewed-by: Steve Wahl 
Acked-by: Dave Young 
Link: 
https://lore.kernel.org/r/1596433788-3784-1-git-send-email-kernelf...@gmail.com
---
 arch/x86/purgatory/Makefile | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile
index 088bd76..d24b43a 100644
--- a/arch/x86/purgatory/Makefile
+++ b/arch/x86/purgatory/Makefile
@@ -32,7 +32,7 @@ KCOV_INSTRUMENT := n
 # make up the standalone purgatory.ro
 
 PURGATORY_CFLAGS_REMOVE := -mcmodel=kernel
-PURGATORY_CFLAGS := -mcmodel=large -ffreestanding -fno-zero-initialized-in-bss
+PURGATORY_CFLAGS := -mcmodel=large -ffreestanding -fno-zero-initialized-in-bss 
-g0
 PURGATORY_CFLAGS += $(DISABLE_STACKLEAK_PLUGIN) -DDISABLE_BRANCH_PROFILING
 PURGATORY_CFLAGS += $(call cc-option,-fno-stack-protector)
 
@@ -64,6 +64,9 @@ CFLAGS_sha256.o   += $(PURGATORY_CFLAGS)
 CFLAGS_REMOVE_string.o += $(PURGATORY_CFLAGS_REMOVE)
 CFLAGS_string.o+= $(PURGATORY_CFLAGS)
 
+AFLAGS_REMOVE_setup-x86_$(BITS).o  += -Wa,-gdwarf-2
+AFLAGS_REMOVE_entry64.o+= -Wa,-gdwarf-2
+
 $(obj)/purgatory.ro: $(PURGATORY_OBJS) FORCE
$(call if_changed,ld)
 


[PATCHv2] x86/purgatory: don't generate debug info for purgatory.ro

2020-08-02 Thread Pingfan Liu
Purgatory.ro is a standalone binary that is not linked against the rest of
the kernel.  Its image is copied into an array that is linked to the
kernel, and from there kexec relocates it wherever it desires.

Unlike the debug info for vmlinux, which can be used for analyzing crash
such info is useless in purgatory.ro. And discarding them can save about
200K space.

Original:
  259080  kexec-purgatory.o
Stripped debug info:
   29152  kexec-purgatory.o

Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Hans de Goede 
Cc: Nick Desaulniers 
Cc: Arvind Sankar 
Cc: Steve Wahl 
Cc: linux-kernel@vger.kernel.org
Cc: ke...@lists.infradead.org
To: x...@kernel.org
---
 arch/x86/purgatory/Makefile | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile
index 088bd76..d24b43a 100644
--- a/arch/x86/purgatory/Makefile
+++ b/arch/x86/purgatory/Makefile
@@ -32,7 +32,7 @@ KCOV_INSTRUMENT := n
 # make up the standalone purgatory.ro
 
 PURGATORY_CFLAGS_REMOVE := -mcmodel=kernel
-PURGATORY_CFLAGS := -mcmodel=large -ffreestanding -fno-zero-initialized-in-bss
+PURGATORY_CFLAGS := -mcmodel=large -ffreestanding -fno-zero-initialized-in-bss 
-g0
 PURGATORY_CFLAGS += $(DISABLE_STACKLEAK_PLUGIN) -DDISABLE_BRANCH_PROFILING
 PURGATORY_CFLAGS += $(call cc-option,-fno-stack-protector)
 
@@ -64,6 +64,9 @@ CFLAGS_sha256.o   += $(PURGATORY_CFLAGS)
 CFLAGS_REMOVE_string.o += $(PURGATORY_CFLAGS_REMOVE)
 CFLAGS_string.o+= $(PURGATORY_CFLAGS)
 
+AFLAGS_REMOVE_setup-x86_$(BITS).o  += -Wa,-gdwarf-2
+AFLAGS_REMOVE_entry64.o+= -Wa,-gdwarf-2
+
 $(obj)/purgatory.ro: $(PURGATORY_OBJS) FORCE
$(call if_changed,ld)
 
-- 
2.7.5



Re: [PATCH] x86/purgatory: strip debug info

2020-08-02 Thread Pingfan Liu
On Sat, Aug 1, 2020 at 2:18 AM Nick Desaulniers  wrote:
>
> On Fri, Jul 31, 2020 at 2:36 AM Pingfan Liu  wrote:
> >
> > On Fri, Jul 31, 2020 at 7:11 AM Nick Desaulniers
> >  wrote:
> > >
> > > On Thu, Jul 30, 2020 at 1:27 AM Pingfan Liu  wrote:
> > > >
> > > > It is useless to keep debug info in purgatory. And discarding them saves
> > > > about 200K space.
> > > >
> > > > Original:
> > > >   259080  kexec-purgatory.o
> > > > Stripped:
> > > >29152  kexec-purgatory.o
> > > >
> > > > Signed-off-by: Pingfan Liu 
> > > > Cc: Thomas Gleixner 
> > > > Cc: Ingo Molnar 
> > > > Cc: Borislav Petkov 
> > > > Cc: "H. Peter Anvin" 
> > > > Cc: Hans de Goede 
> > > > Cc: Nick Desaulniers 
> > > > Cc: Arvind Sankar 
> > > > Cc: Steve Wahl 
> > > > Cc: linux-kernel@vger.kernel.org
> > > > To: x...@kernel.org
> > >
> > > I don't see any code in
> > > arch/x86/purgatory/
> > > arch/x86/include/asm/purgatory.h
> > > include/linux/purgatory.h
> > > include/uapi/linux/kexec.h
> > > kernel/kexec*
> > > include/linux/kexec.h
> > > include/linux/crash_dump.h
> > > kernel/crash_dump.c
> > > arch/x86/kernel/crash*
> > > https://github.com/horms/kexec-tools/tree/master/kexec/arch/x86_64
> > > that mentions any kind of debug info section.  I'm not sure what you'd
> > > do with the debug info anyway for this binary.  So I suspect this
> > > information should ok to discard.
> > >
> > > This works, but it might be faster to build to not generate the
> > > compile info in the first place via compile flag `-g0`, which could be
> > > added `ifdef CONFIG_DEBUG_INFO` or even just unconditionally.  That
> > > way we're not doing additional work to generate debug info, then
> > > additional work to throw it away.
> > What about:
> > diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile
> > index 088bd76..7e1ad9e 100644
> > --- a/arch/x86/purgatory/Makefile
> > +++ b/arch/x86/purgatory/Makefile
> > @@ -32,7 +32,7 @@ KCOV_INSTRUMENT := n
> >  # make up the standalone purgatory.ro
> >
> >  PURGATORY_CFLAGS_REMOVE := -mcmodel=kernel
> > -PURGATORY_CFLAGS := -mcmodel=large -ffreestanding 
> > -fno-zero-initialized-in-bss
> > +PURGATORY_CFLAGS := -mcmodel=large -ffreestanding
> > -fno-zero-initialized-in-bss -g0
> >  PURGATORY_CFLAGS += $(DISABLE_STACKLEAK_PLUGIN) -DDISABLE_BRANCH_PROFILING
> >  PURGATORY_CFLAGS += $(call cc-option,-fno-stack-protector)
>
> I tested your patch but still see .debug_* sections in the .ro from a few .o.
>
> At least on
> * setup-x86_64.o
> * entry64.o
>
> If you add the following hunk to your diff:
> ```
> @@ -64,6 +64,9 @@ CFLAGS_sha256.o   += $(PURGATORY_CFLAGS)
>  CFLAGS_REMOVE_string.o += $(PURGATORY_CFLAGS_REMOVE)
>  CFLAGS_string.o+= $(PURGATORY_CFLAGS)
>
> +AFLAGS_REMOVE_setup-x86_$(BITS).o  += -Wa,-gdwarf-2
> +AFLAGS_REMOVE_entry64.o+= -Wa,-gdwarf-2
> +
Go through man as and gcc, and can not find a simpler method than your
suggestion.
>  $(obj)/purgatory.ro: $(PURGATORY_OBJS) FORCE
> $(call if_changed,ld)
> ```
> then that should do it.  Then you can verify the .ro file via:
> $ llvm-readelf -S arch/x86/purgatory/purgatory.ro | not grep debug_
> (no output, should return zero)
Thank you for your good suggestion and I will update V2

Regards,
Pingfan


Re: [PATCH] x86/purgatory: strip debug info

2020-07-31 Thread Pingfan Liu
On Fri, Jul 31, 2020 at 7:11 AM Nick Desaulniers
 wrote:
>
> On Thu, Jul 30, 2020 at 1:27 AM Pingfan Liu  wrote:
> >
> > It is useless to keep debug info in purgatory. And discarding them saves
> > about 200K space.
> >
> > Original:
> >   259080  kexec-purgatory.o
> > Stripped:
> >29152  kexec-purgatory.o
> >
> > Signed-off-by: Pingfan Liu 
> > Cc: Thomas Gleixner 
> > Cc: Ingo Molnar 
> > Cc: Borislav Petkov 
> > Cc: "H. Peter Anvin" 
> > Cc: Hans de Goede 
> > Cc: Nick Desaulniers 
> > Cc: Arvind Sankar 
> > Cc: Steve Wahl 
> > Cc: linux-kernel@vger.kernel.org
> > To: x...@kernel.org
>
> I don't see any code in
> arch/x86/purgatory/
> arch/x86/include/asm/purgatory.h
> include/linux/purgatory.h
> include/uapi/linux/kexec.h
> kernel/kexec*
> include/linux/kexec.h
> include/linux/crash_dump.h
> kernel/crash_dump.c
> arch/x86/kernel/crash*
> https://github.com/horms/kexec-tools/tree/master/kexec/arch/x86_64
> that mentions any kind of debug info section.  I'm not sure what you'd
> do with the debug info anyway for this binary.  So I suspect this
> information should ok to discard.
>
> This works, but it might be faster to build to not generate the
> compile info in the first place via compile flag `-g0`, which could be
> added `ifdef CONFIG_DEBUG_INFO` or even just unconditionally.  That
> way we're not doing additional work to generate debug info, then
> additional work to throw it away.
What about:
diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile
index 088bd76..7e1ad9e 100644
--- a/arch/x86/purgatory/Makefile
+++ b/arch/x86/purgatory/Makefile
@@ -32,7 +32,7 @@ KCOV_INSTRUMENT := n
 # make up the standalone purgatory.ro

 PURGATORY_CFLAGS_REMOVE := -mcmodel=kernel
-PURGATORY_CFLAGS := -mcmodel=large -ffreestanding -fno-zero-initialized-in-bss
+PURGATORY_CFLAGS := -mcmodel=large -ffreestanding
-fno-zero-initialized-in-bss -g0
 PURGATORY_CFLAGS += $(DISABLE_STACKLEAK_PLUGIN) -DDISABLE_BRANCH_PROFILING
 PURGATORY_CFLAGS += $(call cc-option,-fno-stack-protector)

Thanks,
Pingfan


[PATCH] x86/purgatory: strip debug info

2020-07-30 Thread Pingfan Liu
It is useless to keep debug info in purgatory. And discarding them saves
about 200K space.

Original:
  259080  kexec-purgatory.o
Stripped:
   29152  kexec-purgatory.o

Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Hans de Goede 
Cc: Nick Desaulniers 
Cc: Arvind Sankar 
Cc: Steve Wahl 
Cc: linux-kernel@vger.kernel.org
To: x...@kernel.org
---
 arch/x86/purgatory/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile
index 088bd76..4340ae6 100644
--- a/arch/x86/purgatory/Makefile
+++ b/arch/x86/purgatory/Makefile
@@ -16,7 +16,7 @@ CFLAGS_sha256.o := -D__DISABLE_EXPORTS
 
 # When linking purgatory.ro with -r unresolved symbols are not checked,
 # also link a purgatory.chk binary without -r to check for unresolved symbols.
-PURGATORY_LDFLAGS := -e purgatory_start -nostdlib -z nodefaultlib
+PURGATORY_LDFLAGS := -e purgatory_start -nostdlib -z nodefaultlib -S
 LDFLAGS_purgatory.ro := -r $(PURGATORY_LDFLAGS)
 LDFLAGS_purgatory.chk := $(PURGATORY_LDFLAGS)
 targets += purgatory.ro purgatory.chk
-- 
2.7.5



Re: [PATCH 2/3] mm/migrate: see hole as invalid source page

2019-08-26 Thread Pingfan Liu
On Fri, Aug 16, 2019 at 11:02:22PM +0800, Pingfan Liu wrote:
> On Thu, Aug 15, 2019 at 01:22:22PM -0400, Jerome Glisse wrote:
> > On Tue, Aug 06, 2019 at 04:00:10PM +0800, Pingfan Liu wrote:
> > > MIGRATE_PFN_MIGRATE marks a valid pfn, further more, suitable to migrate.
> > > As for hole, there is no valid pfn, not to mention migration.
> > > 
> > > Before this patch, hole has already relied on the following code to be
> > > filtered out. Hence it is more reasonable to see hole as invalid source
> > > page.
> > > migrate_vma_prepare()
> > > {
> > >   struct page *page = migrate_pfn_to_page(migrate->src[i]);
> > > 
> > >   if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE))
> > >\_ this condition
> > > }
> > 
> > NAK you break the API, MIGRATE_PFN_MIGRATE is use for 2 things,
> > first it allow the collection code to mark entry that can be
> > migrated, then it use by driver to allow driver to skip migration
> > for some entry (for whatever reason the driver might have), we
> > still need to keep the entry and not clear it so that we can
> > cleanup thing (ie remove migration pte entry).
> Thanks for your kindly review.
> 
> I read the code again. Maybe I miss something. But as my understanding,
> for hole, there is no pte.
> As the current code migrate_vma_collect_pmd()
> {
>   if (pmd_none(*pmdp))
>   return migrate_vma_collect_hole(start, end, walk);
> ...
>   make_migration_entry()
> }
> 
> We do not install migration entry for hole, then no need to remove
> migration pte entry.
> 
> And on the driver side, there is way to migrate a hole. The driver just
> skip it by
> drivers/gpu/drm/nouveau/nouveau_dmem.c: if (!spage || !(src_pfns[i] & 
> MIGRATE_PFN_MIGRATE))
>  
> Finally, in migrate_vma_finalize(), for a hole,
>   if (!page) {
>   if (newpage) {
>   unlock_page(newpage);
>   put_page(newpage);
>   }
>   continue;
>   }
> And we do not rely on remove_migration_ptes(page, newpage, false); to
> restore the orignal pte (and it is impossible).
> 
Hello, do you have any comment?

I think most of important, hole does not use the 'MIGRATE_PFN_MIGRATE'
API. Hole has not pte, and there is no way to 'remove migration pte
entry'. Further more, we can know the failure on the source side, no
need to defer it to driver side.

By this way, [3/3] can unify the code.

Thanks,
Pingfan


Re: [PATCH 2/3] mm/migrate: see hole as invalid source page

2019-08-16 Thread Pingfan Liu
On Thu, Aug 15, 2019 at 01:22:22PM -0400, Jerome Glisse wrote:
> On Tue, Aug 06, 2019 at 04:00:10PM +0800, Pingfan Liu wrote:
> > MIGRATE_PFN_MIGRATE marks a valid pfn, further more, suitable to migrate.
> > As for hole, there is no valid pfn, not to mention migration.
> > 
> > Before this patch, hole has already relied on the following code to be
> > filtered out. Hence it is more reasonable to see hole as invalid source
> > page.
> > migrate_vma_prepare()
> > {
> > struct page *page = migrate_pfn_to_page(migrate->src[i]);
> > 
> > if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE))
> >  \_ this condition
> > }
> 
> NAK you break the API, MIGRATE_PFN_MIGRATE is use for 2 things,
> first it allow the collection code to mark entry that can be
> migrated, then it use by driver to allow driver to skip migration
> for some entry (for whatever reason the driver might have), we
> still need to keep the entry and not clear it so that we can
> cleanup thing (ie remove migration pte entry).
Thanks for your kindly review.

I read the code again. Maybe I miss something. But as my understanding,
for hole, there is no pte.
As the current code migrate_vma_collect_pmd()
{
if (pmd_none(*pmdp))
return migrate_vma_collect_hole(start, end, walk);
...
make_migration_entry()
}

We do not install migration entry for hole, then no need to remove
migration pte entry.

And on the driver side, there is way to migrate a hole. The driver just
skip it by
drivers/gpu/drm/nouveau/nouveau_dmem.c: if (!spage || !(src_pfns[i] & 
MIGRATE_PFN_MIGRATE))
 
Finally, in migrate_vma_finalize(), for a hole,
if (!page) {
if (newpage) {
unlock_page(newpage);
put_page(newpage);
}
continue;
}
And we do not rely on remove_migration_ptes(page, newpage, false); to
restore the orignal pte (and it is impossible).

Thanks,
Pingfan


Re: [PATCHv2] mm/migrate: clean up useless code in migrate_vma_collect_pmd()

2019-08-16 Thread Pingfan Liu
On Thu, Aug 15, 2019 at 03:40:21PM -0400, Jerome Glisse wrote:
> On Thu, Aug 15, 2019 at 12:23:44PM -0700, Ralph Campbell wrote:
> [...]
> > 
> > I don't understand. The only use of "pfn" I see is in the "else"
> > clause above where it is set just before using it.
> 
> Ok i managed to confuse myself with mpfn and probably with old
> version of the code. Sorry for reading too quickly. Can we move
> unsigned long pfn; into the else { branch so that there is no
> more confusion to its scope.
Looks better, I will update v3.

Thanks,
Pingfan


[PATCHv2] mm/migrate: clean up useless code in migrate_vma_collect_pmd()

2019-08-07 Thread Pingfan Liu
Clean up useless 'pfn' variable.

Signed-off-by: Pingfan Liu 
Cc: "Jérôme Glisse" 
Cc: Andrew Morton 
Cc: Mel Gorman 
Cc: Jan Kara 
Cc: "Kirill A. Shutemov" 
Cc: Michal Hocko 
Cc: Mike Kravetz 
Cc: Andrea Arcangeli 
Cc: Matthew Wilcox 
To: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/migrate.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 8992741..d483a55 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2225,17 +2225,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
pte_t pte;
 
pte = *ptep;
-   pfn = pte_pfn(pte);
 
if (pte_none(pte)) {
mpfn = MIGRATE_PFN_MIGRATE;
migrate->cpages++;
-   pfn = 0;
goto next;
}
 
if (!pte_present(pte)) {
-   mpfn = pfn = 0;
+   mpfn = 0;
 
/*
 * Only care about unaddressable device page special
@@ -2252,10 +2250,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
if (is_write_device_private_entry(entry))
mpfn |= MIGRATE_PFN_WRITE;
} else {
+   pfn = pte_pfn(pte);
if (is_zero_pfn(pfn)) {
mpfn = MIGRATE_PFN_MIGRATE;
migrate->cpages++;
-   pfn = 0;
goto next;
}
page = vm_normal_page(migrate->vma, addr, pte);
@@ -2265,10 +2263,9 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 
/* FIXME support THP */
if (!page || !page->mapping || PageTransCompound(page)) {
-   mpfn = pfn = 0;
+   mpfn = 0;
goto next;
}
-   pfn = page_to_pfn(page);
 
/*
 * By getting a reference on the page we pin it and that blocks
-- 
2.7.5



Re: [PATCH 0/4] x86/mce: protect nr_cpus from rebooting by broadcast mce

2019-08-07 Thread Pingfan Liu
On Wed, Aug 07, 2019 at 11:00:41AM +0800, Dave Young wrote:
> Add Tony and Xunlei in cc.
> On 08/05/19 at 04:58pm, Pingfan Liu wrote:
> > This series include two related groups:
> > [1-3/4]: protect nr_cpus from rebooting by broadcast mce
> > [4/4]: improve "kexec -l" robustness against broadcast mce
> > 
> > When I tried to fix [1], Thomas raised concern about the nr_cpus' 
> > vulnerability
> > to unexpected rebooting by broadcast mce. After analysis, I think only the
> > following first case suffers from the rebooting by broadcast mce. [1-3/4] 
> > aims
> > to fix that issue.
> 
> I did not understand and read the MCE details, but we previously had a
> MCE problem, Xunlei fixed in below commit:
> commit 5bc329503e8191c91c4c40836f062ef771d8ba83
> Author: Xunlei Pang 
> Date:   Mon Mar 13 10:50:19 2017 +0100
> 
> x86/mce: Handle broadcasted MCE gracefully with kexec
> 
> I wonder if this is same issue or not. Also the old discussion is in
> below thread:
> https://lore.kernel.org/patchwork/patch/753530/
> 
> Tony raised similar questions, but I'm not sure if it is still a problem
> or it has been fixed.
> 
Xunlei's patch is the precondition of the stability for the case 2: boot up by 
"kexec -p nr_cpus="

For case1/3, extra effort is needed.

Thanks,
Pingfan
> > 
> > *** Back ground ***
> > 
> > On x86 it's required to have all logical CPUs set CR4.MCE=1. Otherwise, a
> > broadcast MCE observing CR4.MCE=0b on any core will shutdown the machine.
> > 
> > The option 'nosmt' has already complied with the above rule by Thomas's 
> > patch.
> > For detail, refer to 506a66f3748 (Revert "x86/apic: Ignore secondary 
> > threads if
> > nosmt=force")
> > 
> > But for nr_cpus option, the exposure to broadcast MCE is a little 
> > complicated,
> > and can be categorized into three cases.
> > 
> > -1. boot up by BIOS. Since no one set CR4.MCE=1, nr_cpus risks rebooting by
> > broadcast MCE.
> > 
> > -2. boot up by "kexec -p nr_cpus=".  Since the 1st kernel has all cpus'
> > CR4.MCE=1 set before kexec -p, nr_cpus is free of rebooting by broadcast 
> > MCE.
> > Furthermore, the crashed kernel's wreckage, including page table and text, 
> > is
> > not touched by capture kernel. Hence if MCE event happens on capped cpu,
> > do_machine_check->__mc_check_crashing_cpu() runs smoothly and returns
> > immediately, the capped cpu is still pinned on "halt".
> > 
> > -3. boot up by "kexec -l nr_cpus=". As "kexec -p", it is free of rebooting 
> > by
> > broadcast MCE. But the 1st kernel's wreckage is discarded and changed.  when
> > capped cpus execute do_machine_check(), they may crack the new kernel.  But
> > this is not related with broadcast MCE, and need an extra fix.
> > 
> > *** Solution ***
> > "nr_cpus" can not follow the same way as "nosmt".  Because nr_cpus limits 
> > the
> > allocation of percpu area and some other kthread memory, which is critical 
> > to
> > cpu hotplug framework.  Instead, developing a dedicated SIPI callback
> > make_capped_cpu_stable() for capped cpu, which does not lean on percpu area 
> > to
> > work.
> > 
> > [1]: https://lkml.org/lkml/2019/7/5/3
> > 
> > To: Gleixner 
> > To: Andy Lutomirski 
> > Cc: Ingo Molnar 
> > Cc: Borislav Petkov 
> > Cc: "H. Peter Anvin" 
> > Cc: Dave Hansen 
> > Cc: Peter Zijlstra 
> > To: x...@kernel.org
> > Cc: Masami Hiramatsu 
> > Cc: Qian Cai 
> > Cc: Vlastimil Babka 
> > Cc: Daniel Drake 
> > Cc: Jacob Pan 
> > Cc: Michal Hocko 
> > Cc: Eric Biederman 
> > Cc: linux-kernel@vger.kernel.org
> > Cc: Dave Young 
> > Cc: Baoquan He 
> > Cc: ke...@lists.infradead.org
> > 
> > ---
> > Pingfan Liu (4):
> >   x86/apic: correct the ENO in generic_processor_info()
> >   x86/apic: record capped cpu in generic_processor_info()
> >   x86/smp: send capped cpus to a stable state when smp_init()
> >   x86/smp: disallow MCE handler on rebooting AP
> > 
> >  arch/x86/include/asm/apic.h  |  1 +
> >  arch/x86/include/asm/smp.h   |  3 ++
> >  arch/x86/kernel/apic/apic.c  | 23 
> >  arch/x86/kernel/cpu/common.c |  7 
> >  arch/x86/kernel/smp.c|  8 +
> >  arch/x86/kernel/smpboot.c| 83 
> > 
> >  kernel/smp.c |  6 
> >  7 files changed, 124 insertions(+), 7 deletions(-)
> > 
> > -- 
> > 2.7.5
> > 
> 
> Thanks
> Dave


Re: [PATCH 1/3] mm/migrate: clean up useless code in migrate_vma_collect_pmd()

2019-08-06 Thread Pingfan Liu
On Tue, Aug 06, 2019 at 06:35:03AM -0700, Matthew Wilcox wrote:
> 
> This needs something beyond the subject line.  Maybe ...
> 
> After these assignments, we either restart the loop with a fresh variable,
> or we assign to the variable again without using the value we've assigned.
> 
> Reviewed-by: Matthew Wilcox (Oracle) 
> 
> > goto next;
> > }
> > -   pfn = page_to_pfn(page);
> 
> After you've done all this, as far as I can tell, the 'pfn' variable is
> only used in one arm of the conditions, so it can be moved there.
> 
> ie something like:
> 
> -   unsigned long mpfn, pfn;
> +   unsigned long mpfn;
> ...
> -   pfn = pte_pfn(pte);
> ...
> +   unsigned long pfn = pte_pfn(pte);
> +
> 
This makes code better. Thank you for the suggestion. Will send v2 for
this patch.

Regards,
Pingfan


[PATCH 2/3] mm/migrate: see hole as invalid source page

2019-08-06 Thread Pingfan Liu
MIGRATE_PFN_MIGRATE marks a valid pfn, further more, suitable to migrate.
As for hole, there is no valid pfn, not to mention migration.

Before this patch, hole has already relied on the following code to be
filtered out. Hence it is more reasonable to see hole as invalid source
page.
migrate_vma_prepare()
{
struct page *page = migrate_pfn_to_page(migrate->src[i]);

if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE))
 \_ this condition
}

Signed-off-by: Pingfan Liu 
Cc: "Jérôme Glisse" 
Cc: Andrew Morton 
Cc: Mel Gorman 
Cc: Jan Kara 
Cc: "Kirill A. Shutemov" 
Cc: Michal Hocko 
Cc: Mike Kravetz 
Cc: Andrea Arcangeli 
Cc: Matthew Wilcox 
To: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/migrate.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index c2ec614..832483f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2136,10 +2136,9 @@ static int migrate_vma_collect_hole(unsigned long start,
unsigned long addr;
 
for (addr = start & PAGE_MASK; addr < end; addr += PAGE_SIZE) {
-   migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE;
+   migrate->src[migrate->npages] = 0;
migrate->dst[migrate->npages] = 0;
migrate->npages++;
-   migrate->cpages++;
}
 
return 0;
@@ -2228,8 +2227,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
pfn = pte_pfn(pte);
 
if (pte_none(pte)) {
-   mpfn = MIGRATE_PFN_MIGRATE;
-   migrate->cpages++;
+   mpfn = 0;
goto next;
}
 
-- 
2.7.5



[PATCH 1/3] mm/migrate: clean up useless code in migrate_vma_collect_pmd()

2019-08-06 Thread Pingfan Liu
Signed-off-by: Pingfan Liu 
Cc: "Jérôme Glisse" 
Cc: Andrew Morton 
Cc: Mel Gorman 
Cc: Jan Kara 
Cc: "Kirill A. Shutemov" 
Cc: Michal Hocko 
Cc: Mike Kravetz 
Cc: Andrea Arcangeli 
Cc: Matthew Wilcox 
To: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/migrate.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 8992741..c2ec614 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2230,7 +2230,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
if (pte_none(pte)) {
mpfn = MIGRATE_PFN_MIGRATE;
migrate->cpages++;
-   pfn = 0;
goto next;
}
 
@@ -2255,7 +2254,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
if (is_zero_pfn(pfn)) {
mpfn = MIGRATE_PFN_MIGRATE;
migrate->cpages++;
-   pfn = 0;
goto next;
}
page = vm_normal_page(migrate->vma, addr, pte);
@@ -2265,10 +2263,9 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 
/* FIXME support THP */
if (!page || !page->mapping || PageTransCompound(page)) {
-   mpfn = pfn = 0;
+   mpfn = 0;
goto next;
}
-   pfn = page_to_pfn(page);
 
/*
 * By getting a reference on the page we pin it and that blocks
-- 
2.7.5



[PATCH 3/3] mm/migrate: remove the duplicated code migrate_vma_collect_hole()

2019-08-06 Thread Pingfan Liu
After the previous patch which sees hole as invalid source,
migrate_vma_collect_hole() has the same code as migrate_vma_collect_skip().
Removing the duplicated code.

Signed-off-by: Pingfan Liu 
Cc: "Jérôme Glisse" 
Cc: Andrew Morton 
Cc: Mel Gorman 
Cc: Jan Kara 
Cc: "Kirill A. Shutemov" 
Cc: Michal Hocko 
Cc: Mike Kravetz 
Cc: Andrea Arcangeli 
Cc: Matthew Wilcox 
To: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/migrate.c | 22 +++---
 1 file changed, 3 insertions(+), 19 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 832483f..95e038d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2128,22 +2128,6 @@ struct migrate_vma {
unsigned long   end;
 };
 
-static int migrate_vma_collect_hole(unsigned long start,
-   unsigned long end,
-   struct mm_walk *walk)
-{
-   struct migrate_vma *migrate = walk->private;
-   unsigned long addr;
-
-   for (addr = start & PAGE_MASK; addr < end; addr += PAGE_SIZE) {
-   migrate->src[migrate->npages] = 0;
-   migrate->dst[migrate->npages] = 0;
-   migrate->npages++;
-   }
-
-   return 0;
-}
-
 static int migrate_vma_collect_skip(unsigned long start,
unsigned long end,
struct mm_walk *walk)
@@ -2173,7 +2157,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 
 again:
if (pmd_none(*pmdp))
-   return migrate_vma_collect_hole(start, end, walk);
+   return migrate_vma_collect_skip(start, end, walk);
 
if (pmd_trans_huge(*pmdp)) {
struct page *page;
@@ -2206,7 +2190,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
return migrate_vma_collect_skip(start, end,
walk);
if (pmd_none(*pmdp))
-   return migrate_vma_collect_hole(start, end,
+   return migrate_vma_collect_skip(start, end,
walk);
}
}
@@ -2337,7 +2321,7 @@ static void migrate_vma_collect(struct migrate_vma 
*migrate)
 
mm_walk.pmd_entry = migrate_vma_collect_pmd;
mm_walk.pte_entry = NULL;
-   mm_walk.pte_hole = migrate_vma_collect_hole;
+   mm_walk.pte_hole = migrate_vma_collect_skip;
mm_walk.hugetlb_entry = NULL;
mm_walk.test_walk = NULL;
mm_walk.vma = migrate->vma;
-- 
2.7.5



Re: [PATCH] smp: force all cpu to boot once under maxcpus option

2019-07-24 Thread Pingfan Liu
On Mon, Jul 22, 2019 at 11:41:29AM +0200, Thomas Gleixner wrote:
> On Wed, 10 Jul 2019, Pingfan Liu wrote:
> >  
> > +static inline bool maxcpus_allowed(unsigned int cpu)
> > +{
> > +   /* maxcpus only takes effect during system bootup */
> > +   if (smp_boot_done)
> > +   return true;
> > +   if (num_online_cpus() < setup_max_cpus)
> > +   return true;
> > +   /*
> > +* maxcpus should allow cpu to set CR4.MCE asap, otherwise the set may
> > +* be deferred indefinitely.
> > +*/
> > +   if (!per_cpu(cpuhp_state, cpu).booted_once)
> > +   return true;
> 
> As this is a x86 only issue, you cannot inflict this magic on every
> architecture.
OK.
In my developing patch which fixes nr_cpus issue, I takes the following way:
(I am diverted to other things, have not finished test yet, hope to
turn back soon.)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 362dd89..c009169 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -956,6 +956,87 @@ int common_cpu_up(unsigned int cpu, struct task_struct 
*idle)
return 0;
 }

+void __init bring_capped_cpu_steady(void)
+{
[...]
+}
+
 /*
  * NOTE - on most systems this is a PHYSICAL apic ID, but on multiquad
  * (ie clustered apic addressing mode), this is a LOGICAL apic ID.
diff --git a/kernel/smp.c b/kernel/smp.c
index d155374..b04961c 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -560,6 +560,10 @@ void __init setup_nr_cpu_ids(void)
nr_cpu_ids = find_last_bit(cpumask_bits(cpu_possible_mask),NR_CPUS) + 1;
 }

+void __weak bring_capped_cpu_steady(void)
+{
+}
+
 /* Called by boot processor to activate the rest. */
 void __init smp_init(void)
 {
@@ -579,6 +583,8 @@ void __init smp_init(void)
cpu_up(cpu);
}

+   /* force cpus capped by nr_cpus option into steady state */
+   bring_capped_cpu_steady();
num_nodes = num_online_nodes();


The initial motivation is to step around percpu area required by cpu hotplug
framework. But it also provide the abstraction of archs.

What do you think about resolving maxcpus by the similar abstraction?

> 
> Aside of that this does not solve the problem at all because smp_init()
> still does:
> 
> for_each_present_cpu(cpu) {
> if (num_online_cpus() >= setup_max_cpus)
> break;
Yes, this logic should be removed, then maxcpus_allowed() can take effect. But
now, it may take a quite different way to resolve it.

Thanks for your kindly review.

Regards,
  Pingfan
> if (!cpu_online(cpu))
> cpu_up(cpu);
> }
> 
> So the remaining CPUs are not onlined at all.
> 
> Thanks,
> 
>   tglx


[tip:x86/cleanups] x86/realmode: Remove trampoline_status

2019-07-22 Thread tip-bot for Pingfan Liu
Commit-ID:  69732102426b1c55a257386841fb80ec1f425d32
Gitweb: https://git.kernel.org/tip/69732102426b1c55a257386841fb80ec1f425d32
Author: Pingfan Liu 
AuthorDate: Tue, 16 Jul 2019 16:40:24 +0800
Committer:  Thomas Gleixner 
CommitDate: Mon, 22 Jul 2019 11:30:18 +0200

x86/realmode: Remove trampoline_status

There is no reader of trampoline_status, it's only written.

It turns out that after commit ce4b1b16502b ("x86/smpboot: Initialize
secondary CPU only if master CPU will wait for it"), trampoline_status is
not needed any more.

Signed-off-by: Pingfan Liu 
Signed-off-by: Thomas Gleixner 
Link: 
https://lkml.kernel.org/r/1563266424-3472-1-git-send-email-kernelf...@gmail.com
---
 arch/x86/include/asm/realmode.h  | 1 -
 arch/x86/kernel/smpboot.c| 5 -
 arch/x86/realmode/rm/header.S| 1 -
 arch/x86/realmode/rm/trampoline_32.S | 3 ---
 arch/x86/realmode/rm/trampoline_64.S | 3 ---
 arch/x86/realmode/rm/trampoline_common.S | 4 
 6 files changed, 17 deletions(-)

diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
index c53682303c9c..09ecc32f6524 100644
--- a/arch/x86/include/asm/realmode.h
+++ b/arch/x86/include/asm/realmode.h
@@ -20,7 +20,6 @@ struct real_mode_header {
u32 ro_end;
/* SMP trampoline */
u32 trampoline_start;
-   u32 trampoline_status;
u32 trampoline_header;
 #ifdef CONFIG_X86_64
u32 trampoline_pgd;
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index fdbd47ceb84d..497e9b7077c1 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1023,8 +1023,6 @@ int common_cpu_up(unsigned int cpu, struct task_struct 
*idle)
 static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
   int *cpu0_nmi_registered)
 {
-   volatile u32 *trampoline_status =
-   (volatile u32 *) __va(real_mode_header->trampoline_status);
/* start_ip had better be page-aligned! */
unsigned long start_ip = real_mode_header->trampoline_start;
 
@@ -1116,9 +1114,6 @@ static int do_boot_cpu(int apicid, int cpu, struct 
task_struct *idle,
}
}
 
-   /* mark "stuck" area as not stuck */
-   *trampoline_status = 0;
-
if (x86_platform.legacy.warm_reset) {
/*
 * Cleanup possible dangling ends...
diff --git a/arch/x86/realmode/rm/header.S b/arch/x86/realmode/rm/header.S
index 30b0d30d861a..6363761cc74c 100644
--- a/arch/x86/realmode/rm/header.S
+++ b/arch/x86/realmode/rm/header.S
@@ -19,7 +19,6 @@ GLOBAL(real_mode_header)
.long   pa_ro_end
/* SMP trampoline */
.long   pa_trampoline_start
-   .long   pa_trampoline_status
.long   pa_trampoline_header
 #ifdef CONFIG_X86_64
.long   pa_trampoline_pgd;
diff --git a/arch/x86/realmode/rm/trampoline_32.S 
b/arch/x86/realmode/rm/trampoline_32.S
index 2dd866c9e21e..1868b158480d 100644
--- a/arch/x86/realmode/rm/trampoline_32.S
+++ b/arch/x86/realmode/rm/trampoline_32.S
@@ -41,9 +41,6 @@ ENTRY(trampoline_start)
 
movltr_start, %eax  # where we need to go
 
-   movl$0xA5A5A5A5, trampoline_status
-   # write marker for master knows we're running
-
/*
 * GDT tables in non default location kernel can be beyond 16MB and
 * lgdt will not be able to load the address as in real mode default
diff --git a/arch/x86/realmode/rm/trampoline_64.S 
b/arch/x86/realmode/rm/trampoline_64.S
index 24bb7598774e..aee2b45d83b8 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -49,9 +49,6 @@ ENTRY(trampoline_start)
mov %ax, %es
mov %ax, %ss
 
-   movl$0xA5A5A5A5, trampoline_status
-   # write marker for master knows we're running
-
# Setup stack
movl$rm_stack_end, %esp
 
diff --git a/arch/x86/realmode/rm/trampoline_common.S 
b/arch/x86/realmode/rm/trampoline_common.S
index 7c706772ab59..8d8208dcca24 100644
--- a/arch/x86/realmode/rm/trampoline_common.S
+++ b/arch/x86/realmode/rm/trampoline_common.S
@@ -2,7 +2,3 @@
.section ".rodata","a"
.balign 16
 tr_idt: .fill 1, 6, 0
-
-   .bss
-   .balign 4
-GLOBAL(trampoline_status)  .space  4


[PATCH] x86/realmode: remove trampoline_status flag

2019-07-16 Thread Pingfan Liu
In current code, there is no reader of trampoline_status. It turns out that
after commit ce4b1b16502b ("x86/smpboot: Initialize secondary CPU only if
master CPU will wait for it"), trampoline_status is not needed any more.

Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Mukesh Ojha 
Cc: Matteo Croce 
Cc: Len Brown 
Cc: Pu Wen 
Cc: Nicolai Stange 
Cc: Hui Wang 
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/realmode.h  | 1 -
 arch/x86/kernel/smpboot.c| 5 -
 arch/x86/realmode/rm/header.S| 1 -
 arch/x86/realmode/rm/trampoline_32.S | 3 ---
 arch/x86/realmode/rm/trampoline_64.S | 3 ---
 arch/x86/realmode/rm/trampoline_common.S | 4 
 6 files changed, 17 deletions(-)

diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
index c536823..09ecc32 100644
--- a/arch/x86/include/asm/realmode.h
+++ b/arch/x86/include/asm/realmode.h
@@ -20,7 +20,6 @@ struct real_mode_header {
u32 ro_end;
/* SMP trampoline */
u32 trampoline_start;
-   u32 trampoline_status;
u32 trampoline_header;
 #ifdef CONFIG_X86_64
u32 trampoline_pgd;
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 362dd89..2ef10d9 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -965,8 +965,6 @@ int common_cpu_up(unsigned int cpu, struct task_struct 
*idle)
 static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
   int *cpu0_nmi_registered)
 {
-   volatile u32 *trampoline_status =
-   (volatile u32 *) __va(real_mode_header->trampoline_status);
/* start_ip had better be page-aligned! */
unsigned long start_ip = real_mode_header->trampoline_start;
 
@@ -1058,9 +1056,6 @@ static int do_boot_cpu(int apicid, int cpu, struct 
task_struct *idle,
}
}
 
-   /* mark "stuck" area as not stuck */
-   *trampoline_status = 0;
-
if (x86_platform.legacy.warm_reset) {
/*
 * Cleanup possible dangling ends...
diff --git a/arch/x86/realmode/rm/header.S b/arch/x86/realmode/rm/header.S
index 30b0d30..6363761 100644
--- a/arch/x86/realmode/rm/header.S
+++ b/arch/x86/realmode/rm/header.S
@@ -19,7 +19,6 @@ GLOBAL(real_mode_header)
.long   pa_ro_end
/* SMP trampoline */
.long   pa_trampoline_start
-   .long   pa_trampoline_status
.long   pa_trampoline_header
 #ifdef CONFIG_X86_64
.long   pa_trampoline_pgd;
diff --git a/arch/x86/realmode/rm/trampoline_32.S 
b/arch/x86/realmode/rm/trampoline_32.S
index 2dd866c..1868b15 100644
--- a/arch/x86/realmode/rm/trampoline_32.S
+++ b/arch/x86/realmode/rm/trampoline_32.S
@@ -41,9 +41,6 @@ ENTRY(trampoline_start)
 
movltr_start, %eax  # where we need to go
 
-   movl$0xA5A5A5A5, trampoline_status
-   # write marker for master knows we're running
-
/*
 * GDT tables in non default location kernel can be beyond 16MB and
 * lgdt will not be able to load the address as in real mode default
diff --git a/arch/x86/realmode/rm/trampoline_64.S 
b/arch/x86/realmode/rm/trampoline_64.S
index 24bb759..aee2b45 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -49,9 +49,6 @@ ENTRY(trampoline_start)
mov %ax, %es
mov %ax, %ss
 
-   movl$0xA5A5A5A5, trampoline_status
-   # write marker for master knows we're running
-
# Setup stack
movl$rm_stack_end, %esp
 
diff --git a/arch/x86/realmode/rm/trampoline_common.S 
b/arch/x86/realmode/rm/trampoline_common.S
index 7c70677..8d8208d 100644
--- a/arch/x86/realmode/rm/trampoline_common.S
+++ b/arch/x86/realmode/rm/trampoline_common.S
@@ -2,7 +2,3 @@
.section ".rodata","a"
.balign 16
 tr_idt: .fill 1, 6, 0
-
-   .bss
-   .balign 4
-GLOBAL(trampoline_status)  .space  4
-- 
2.7.5



Re: [PATCH 2/2] x86/numa: instance all parsed numa node

2019-07-10 Thread Pingfan Liu
On Tue, Jul 9, 2019 at 9:34 PM Andy Lutomirski  wrote:
>
>
>
> > On Jul 9, 2019, at 1:24 AM, Pingfan Liu  wrote:
> >
> >> On Tue, Jul 9, 2019 at 2:12 PM Thomas Gleixner  wrote:
> >>
> >>> On Tue, 9 Jul 2019, Pingfan Liu wrote:
> >>>> On Mon, Jul 8, 2019 at 5:35 PM Thomas Gleixner  
> >>>> wrote:
> >>>> It can and it does.
> >>>>
> >>>> That's the whole point why we bring up all CPUs in the 'nosmt' case and
> >>>> shut the siblings down again after setting CR4.MCE. Actually that's in 
> >>>> fact
> >>>> a 'let's hope no MCE hits before that happened' approach, but that's all 
> >>>> we
> >>>> can do.
> >>>>
> >>>> If we don't do that then the MCE broadcast can hit a CPU which has some
> >>>> firmware initialized state. The result can be a full system lockup, 
> >>>> triple
> >>>> fault etc.
> >>>>
> >>>> So when the MCE hits a CPU which is still in the crashed kernel lala 
> >>>> state,
> >>>> then all hell breaks lose.
> >>> Thank you for the comprehensive explain. With your guide, now, I have
> >>> a full understanding of the issue.
> >>>
> >>> But when I tried to add something to enable CR4.MCE in
> >>> crash_nmi_callback(), I realized that it is undo-able in some case (if
> >>> crashed, we will not ask an offline smt cpu to online), also it is
> >>> needless. "kexec -l/-p" takes the advantage of the cpu state in the
> >>> first kernel, where all logical cpu has CR4.MCE=1.
> >>>
> >>> So kexec is exempt from this bug if the first kernel already do it.
> >>
> >> No. If the MCE broadcast is handled by a CPU which is stuck in the old
> >> kernel stop loop, then it will execute on the old kernel and eventually run
> >> into the memory corruption which crashed the old one.
> >>
> > Yes, you are right. Stuck cpu may execute the old do_machine_check()
> > code. But I just found out that we have
> > do_machine_check()->__mc_check_crashing_cpu() to against this case.
> >
> > And I think the MCE issue with nr_cpus is not closely related with
> > this series, can
> > be a separated issue. I had question whether Andy will take it, if
> > not, I am glad to do it.
> >
> >
>
> Go for it. I’m not familiar enough with the SMP boot stuff that I would be 
> able to do it any faster than you. I’ll gladly help review it.
I had sent out a patch to fix maxcpus "[PATCH] smp: force all cpu to
boot once under maxcpus option"
But for the case of nrcpus, I think things will not be so easy due to
percpu area, and I think it may take a quite different way.

Thanks,
  Pingfan


[PATCH] smp: force all cpu to boot once under maxcpus option

2019-07-10 Thread Pingfan Liu
On x86 it's required to boot all logical CPUs at least once so that the
init code can get a chance to set CR4.MCE on each CPU. Otherwise, a
broadacasted MCE observing CR4.MCE=0b on any core will shutdown the
machine.

The option 'nosmt' has already complied with the above rule. In the case of
maxcpus, the initialization of capped out cpus may be deferred indefinitely
until a user brings them up. This exposes the machine under the risk of
sudden shutdown indefinitely.

Minimize the risk window by initializing all cpus at boot time.

Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: "Peter Zijlstra (Intel)" 
Cc: Rik van Riel 
Cc: Josh Poimboeuf 
Cc: Ingo Molnar 
Cc: Jiri Kosina 
Cc: Mukesh Ojha 
Cc: Greg Kroah-Hartman 
Cc: Andy Lutomirski 
Cc: linux-kernel@vger.kernel.org
---
 include/linux/smp.h |  1 +
 kernel/cpu.c| 20 ++--
 kernel/smp.c|  4 
 3 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/include/linux/smp.h b/include/linux/smp.h
index a56f08f..9d2c692 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -130,6 +130,7 @@ extern void __init setup_nr_cpu_ids(void);
 extern void __init smp_init(void);
 
 extern int __boot_cpu_id;
+extern bool smp_boot_done;
 
 static inline int get_boot_cpu_id(void)
 {
diff --git a/kernel/cpu.c b/kernel/cpu.c
index ef1c565..ab19dc8 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -439,6 +439,21 @@ static inline bool cpu_smt_allowed(unsigned int cpu)
 static inline bool cpu_smt_allowed(unsigned int cpu) { return true; }
 #endif
 
+static inline bool maxcpus_allowed(unsigned int cpu)
+{
+   /* maxcpus only takes effect during system bootup */
+   if (smp_boot_done)
+   return true;
+   if (num_online_cpus() < setup_max_cpus)
+   return true;
+   /*
+* maxcpus should allow cpu to set CR4.MCE asap, otherwise the set may
+* be deferred indefinitely.
+*/
+   if (!per_cpu(cpuhp_state, cpu).booted_once)
+   return true;
+}
+
 static inline enum cpuhp_state
 cpuhp_set_state(struct cpuhp_cpu_state *st, enum cpuhp_state target)
 {
@@ -525,8 +540,9 @@ static int bringup_wait_for_ap(unsigned int cpu)
 * CPU marked itself as booted_once in cpu_notify_starting() so the
 * cpu_smt_allowed() check will now return false if this is not the
 * primary sibling.
+* In case of maxcpus, the capped out cpus comply with the same rule.
 */
-   if (!cpu_smt_allowed(cpu))
+   if (!cpu_smt_allowed(cpu) || !maxcpus_allowed(cpu))
return -ECANCELED;
 
if (st->target <= CPUHP_AP_ONLINE_IDLE)
@@ -1177,7 +1193,7 @@ static int do_cpu_up(unsigned int cpu, enum cpuhp_state 
target)
err = -EBUSY;
goto out;
}
-   if (!cpu_smt_allowed(cpu)) {
+   if (!cpu_smt_allowed(cpu) || !maxcpus_allowed(cpu)) {
err = -EPERM;
goto out;
}
diff --git a/kernel/smp.c b/kernel/smp.c
index d155374..a5f82d53 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -560,6 +560,9 @@ void __init setup_nr_cpu_ids(void)
nr_cpu_ids = find_last_bit(cpumask_bits(cpu_possible_mask),NR_CPUS) + 1;
 }
 
+bool smp_boot_done __read_mostly;
+EXPORT_SYMBOL(smp_boot_done);
+
 /* Called by boot processor to activate the rest. */
 void __init smp_init(void)
 {
@@ -587,6 +590,7 @@ void __init smp_init(void)
 
/* Any cleanup work */
smp_cpus_done(setup_max_cpus);
+   smp_boot_done = true;
 }
 
 /*
-- 
2.7.5



Re: [PATCH 2/2] x86/numa: instance all parsed numa node

2019-07-09 Thread Pingfan Liu
On Tue, Jul 9, 2019 at 2:12 PM Thomas Gleixner  wrote:
>
> On Tue, 9 Jul 2019, Pingfan Liu wrote:
> > On Mon, Jul 8, 2019 at 5:35 PM Thomas Gleixner  wrote:
> > > It can and it does.
> > >
> > > That's the whole point why we bring up all CPUs in the 'nosmt' case and
> > > shut the siblings down again after setting CR4.MCE. Actually that's in 
> > > fact
> > > a 'let's hope no MCE hits before that happened' approach, but that's all 
> > > we
> > > can do.
> > >
> > > If we don't do that then the MCE broadcast can hit a CPU which has some
> > > firmware initialized state. The result can be a full system lockup, triple
> > > fault etc.
> > >
> > > So when the MCE hits a CPU which is still in the crashed kernel lala 
> > > state,
> > > then all hell breaks lose.
> > Thank you for the comprehensive explain. With your guide, now, I have
> > a full understanding of the issue.
> >
> > But when I tried to add something to enable CR4.MCE in
> > crash_nmi_callback(), I realized that it is undo-able in some case (if
> > crashed, we will not ask an offline smt cpu to online), also it is
> > needless. "kexec -l/-p" takes the advantage of the cpu state in the
> > first kernel, where all logical cpu has CR4.MCE=1.
> >
> > So kexec is exempt from this bug if the first kernel already do it.
>
> No. If the MCE broadcast is handled by a CPU which is stuck in the old
> kernel stop loop, then it will execute on the old kernel and eventually run
> into the memory corruption which crashed the old one.
>
Yes, you are right. Stuck cpu may execute the old do_machine_check()
code. But I just found out that we have
do_machine_check()->__mc_check_crashing_cpu() to against this case.

And I think the MCE issue with nr_cpus is not closely related with
this series, can
be a separated issue. I had question whether Andy will take it, if
not, I am glad to do it.

Thanks and regards,
  Pingfan


Re: [PATCH 2/2] x86/numa: instance all parsed numa node

2019-07-08 Thread Pingfan Liu
On Tue, Jul 9, 2019 at 1:53 AM Andy Lutomirski  wrote:
>
>
>
> > On Jul 8, 2019, at 3:35 AM, Thomas Gleixner  wrote:
> >
> >> On Mon, 8 Jul 2019, Pingfan Liu wrote:
> >>> On Mon, Jul 8, 2019 at 3:44 AM Thomas Gleixner  wrote:
> >>>
> >>>> On Fri, 5 Jul 2019, Pingfan Liu wrote:
> >>>>
> >>>> I hit a bug on an AMD machine, with kexec -l nr_cpus=4 option. nr_cpus 
> >>>> option
> >>>> is used to speed up kdump process, so it is not a rare case.
> >>>
> >>> But fundamentally wrong, really.
> >>>
> >>> The rest of the CPUs are in a half baken state and any broadcast event,
> >>> e.g. MCE or a stray IPI, will result in a undiagnosable crash.
> >> Very appreciate if you can pay more word on it? I tried to figure out
> >> your point, but fail.
> >>
> >> For "a half baked state", I think you concern about LAPIC state, and I
> >> expand this point like the following:
> >
> > It's not only the APIC state. It's the state of the CPUs in general.
> >
> >> For IPI: when capture kernel BSP is up, the rest cpus are still loop
> >> inside crash_nmi_callback(), so there is no way to eject new IPI from
> >> these cpu. Also we disable_local_APIC(), which effectively prevent the
> >> LAPIC from responding to IPI, except NMI/INIT/SIPI, which will not
> >> occur in crash case.
> >
> > Fair enough for the IPI case.
> >
> >> For MCE, I am not sure whether it can broadcast or not between cpus,
> >> but as my understanding, it can not. Then is it a problem?
> >
> > It can and it does.
> >
> > That's the whole point why we bring up all CPUs in the 'nosmt' case and
> > shut the siblings down again after setting CR4.MCE. Actually that's in fact
> > a 'let's hope no MCE hits before that happened' approach, but that's all we
> > can do.
> >
> > If we don't do that then the MCE broadcast can hit a CPU which has some
> > firmware initialized state. The result can be a full system lockup, triple
> > fault etc.
> >
> > So when the MCE hits a CPU which is still in the crashed kernel lala state,
> > then all hell breaks lose.
> >
> >> From another view point, is there any difference between nr_cpus=1 and
> >> nr_cpus> 1 in crashing case? If stray IPI raises issue to nr_cpus>1,
> >> it does for nr_cpus=1.
> >
> > Anything less than the actual number of present CPUs is problematic except
> > you use the 'let's hope nothing happens' approach. We could add an option
> > to stop the bringup at the early online state similar to what we do for
> > 'nosmt'.
> >
> >
>
> How about we change nr_cpus to do that instead so we never have to have this 
> conversation again?
Are you interest in implementing this?

Thanks,
  Pingfan


Re: [PATCH 2/2] x86/numa: instance all parsed numa node

2019-07-08 Thread Pingfan Liu
On Mon, Jul 8, 2019 at 5:35 PM Thomas Gleixner  wrote:
>
> On Mon, 8 Jul 2019, Pingfan Liu wrote:
> > On Mon, Jul 8, 2019 at 3:44 AM Thomas Gleixner  wrote:
> > >
> > > On Fri, 5 Jul 2019, Pingfan Liu wrote:
> > >
> > > > I hit a bug on an AMD machine, with kexec -l nr_cpus=4 option. nr_cpus 
> > > > option
> > > > is used to speed up kdump process, so it is not a rare case.
> > >
> > > But fundamentally wrong, really.
> > >
> > > The rest of the CPUs are in a half baken state and any broadcast event,
> > > e.g. MCE or a stray IPI, will result in a undiagnosable crash.
> > Very appreciate if you can pay more word on it? I tried to figure out
> > your point, but fail.
> >
> > For "a half baked state", I think you concern about LAPIC state, and I
> > expand this point like the following:
>
> It's not only the APIC state. It's the state of the CPUs in general.
For other states, "kexec -l " is a kind of boot loader and the boot
cpu complies with the kernel boot up provision. As for the rest AP,
they are pinged at loop before receiving #INIT IPI. Then the left
things is the same as SMP boot up.

>
> > For IPI: when capture kernel BSP is up, the rest cpus are still loop
> > inside crash_nmi_callback(), so there is no way to eject new IPI from
> > these cpu. Also we disable_local_APIC(), which effectively prevent the
> > LAPIC from responding to IPI, except NMI/INIT/SIPI, which will not
> > occur in crash case.
>
> Fair enough for the IPI case.
>
> > For MCE, I am not sure whether it can broadcast or not between cpus,
> > but as my understanding, it can not. Then is it a problem?
>
> It can and it does.
>
> That's the whole point why we bring up all CPUs in the 'nosmt' case and
> shut the siblings down again after setting CR4.MCE. Actually that's in fact
> a 'let's hope no MCE hits before that happened' approach, but that's all we
> can do.
>
> If we don't do that then the MCE broadcast can hit a CPU which has some
> firmware initialized state. The result can be a full system lockup, triple
> fault etc.
>
> So when the MCE hits a CPU which is still in the crashed kernel lala state,
> then all hell breaks lose.
Thank you for the comprehensive explain. With your guide, now, I have
a full understanding of the issue.

But when I tried to add something to enable CR4.MCE in
crash_nmi_callback(), I realized that it is undo-able in some case (if
crashed, we will not ask an offline smt cpu to online), also it is
needless. "kexec -l/-p" takes the advantage of the cpu state in the
first kernel, where all logical cpu has CR4.MCE=1.

So kexec is exempt from this bug if the first kernel already do it.
>
> > From another view point, is there any difference between nr_cpus=1 and
> > nr_cpus> 1 in crashing case? If stray IPI raises issue to nr_cpus>1,
> > it does for nr_cpus=1.
>
> Anything less than the actual number of present CPUs is problematic except
> you use the 'let's hope nothing happens' approach. We could add an option
> to stop the bringup at the early online state similar to what we do for
> 'nosmt'.
Yes, we should do something about nr_cpus param for the first kernel.

Thanks,
  Pingfan


Re: [PATCH 2/2] x86/numa: instance all parsed numa node

2019-07-08 Thread Pingfan Liu
On Mon, Jul 8, 2019 at 3:44 AM Thomas Gleixner  wrote:
>
> On Fri, 5 Jul 2019, Pingfan Liu wrote:
>
> > I hit a bug on an AMD machine, with kexec -l nr_cpus=4 option. nr_cpus 
> > option
> > is used to speed up kdump process, so it is not a rare case.
>
> But fundamentally wrong, really.
>
> The rest of the CPUs are in a half baken state and any broadcast event,
> e.g. MCE or a stray IPI, will result in a undiagnosable crash.
Very appreciate if you can pay more word on it? I tried to figure out
your point, but fail.

For "a half baked state", I think you concern about LAPIC state, and I
expand this point like the following:

For IPI: when capture kernel BSP is up, the rest cpus are still loop
inside crash_nmi_callback(), so there is no way to eject new IPI from
these cpu. Also we disable_local_APIC(), which effectively prevent the
LAPIC from responding to IPI, except NMI/INIT/SIPI, which will not
occur in crash case.

For MCE, I am not sure whether it can broadcast or not between cpus,
but as my understanding, it can not. Then is it a problem?

>From another view point, is there any difference between nr_cpus=1 and
nr_cpus> 1 in crashing case? If stray IPI raises issue to nr_cpus>1,
it does for nr_cpus=1.

Thanks,
  Pingfan


[PATCH 2/2] x86/numa: instance all parsed numa node

2019-07-04 Thread Pingfan Liu
 by "git grep for_each_online_node | grep sched" or the
discussion in tail of [3].

Since Michal has no time to continue on this issue. I pick it up again.  This
patch drops the change of "node online" definition in [2], i.e. still consider
node as online if it has either cpu or memory. And keeps the rest main idea in
[2] of initializing all parsed node on x86. For other archs, they need extra
dedicated effort.

[1]: https://patchwork.kernel.org/patch/10738733/
[2]: https://lkml.org/lkml/2019/2/13/253
[3]: https://lore.kernel.org/lkml/20190528182011.gg1...@dhcp22.suse.cz/T/

Signed-off-by: Pingfan Liu 
Cc: Michal Hocko 
Cc: Dave Hansen 
Cc: Mike Rapoport 
Cc: Tony Luck 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Oscar Salvador 
Cc: Pavel Tatashin 
Cc: Mel Gorman 
Cc: Benjamin Herrenschmidt 
Cc: Michael Ellerman 
Cc: Stephen Rothwell 
Cc: Qian Cai 
Cc: Barret Rhoden 
Cc: Bjorn Helgaas 
Cc: David Rientjes 
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/mm/numa.c | 17 -
 mm/page_alloc.c| 11 ---
 2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index b48d507..5f5b558 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -732,6 +732,15 @@ static void __init init_memory_less_node(int nid)
 */
 }
 
+static void __init init_parsed_rest_node(void)
+{
+   int node;
+
+   for_each_node_mask(node, node_possible_map)
+   if (!node_online(node))
+   init_memory_less_node(node);
+}
+
 /*
  * Setup early cpu_to_node.
  *
@@ -752,6 +761,7 @@ void __init init_cpu_to_node(void)
u16 *cpu_to_apicid = early_per_cpu_ptr(x86_cpu_to_apicid);
 
BUG_ON(cpu_to_apicid == NULL);
+   init_parsed_rest_node();
 
for_each_possible_cpu(cpu) {
int node = numa_cpu_node(cpu);
@@ -759,11 +769,8 @@ void __init init_cpu_to_node(void)
if (node == NUMA_NO_NODE)
continue;
 
-   if (!node_online(node)) {
-   init_memory_less_node(node);
-   node_set_online(nid);
-   }
-
+   if (!node_online(node))
+   node_set_online(node);
numa_set_node(cpu, node);
}
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d66bc8a..5d8db00 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5662,10 +5662,15 @@ static void __build_all_zonelists(void *data)
if (self && !node_online(self->node_id)) {
build_zonelists(self);
} else {
-   for_each_online_node(nid) {
+   /* In rare case, node_zonelist() hits offline node */
+   for_each_node(nid) {
pg_data_t *pgdat = NODE_DATA(nid);
-
-   build_zonelists(pgdat);
+   /*
+* This condition can be removed on archs, with all
+* possible node instanced.
+*/
+   if (pgdat)
+   build_zonelists(pgdat);
}
 
 #ifdef CONFIG_HAVE_MEMORYLESS_NODES
-- 
2.7.5



[PATCH 1/2] x86/numa: carve node online semantics out of alloc_node_data()

2019-07-04 Thread Pingfan Liu
Node online means either memory online or cpu online. But there is
requirement to instance a pglist_data, which has neither cpu nor memory
online (refer to [2/2]).

So carve out the online semantics, and call node_set_online() where either
memory or cpu is online.

Signed-off-by: Pingfan Liu 
Cc: Michal Hocko 
Cc: Dave Hansen 
Cc: Mike Rapoport 
Cc: Tony Luck 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Oscar Salvador 
Cc: Pavel Tatashin 
Cc: Mel Gorman 
Cc: Benjamin Herrenschmidt 
Cc: Michael Ellerman 
Cc: Stephen Rothwell 
Cc: Qian Cai 
Cc: Barret Rhoden 
Cc: Bjorn Helgaas 
Cc: David Rientjes 
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/mm/numa.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index e6dad60..b48d507 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -213,8 +213,6 @@ static void __init alloc_node_data(int nid)
 
node_data[nid] = nd;
memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
-
-   node_set_online(nid);
 }
 
 /**
@@ -589,6 +587,7 @@ static int __init numa_register_memblks(struct numa_meminfo 
*mi)
continue;
 
alloc_node_data(nid);
+   node_set_online(nid);
}
 
/* Dump memblock with node info and return. */
@@ -760,8 +759,10 @@ void __init init_cpu_to_node(void)
if (node == NUMA_NO_NODE)
continue;
 
-   if (!node_online(node))
+   if (!node_online(node)) {
init_memory_less_node(node);
+   node_set_online(nid);
+   }
 
numa_set_node(cpu, node);
}
-- 
2.7.5



[PATCH] mm/page_isolate: change the prototype of undo_isolate_page_range()

2019-07-02 Thread Pingfan Liu
undo_isolate_page_range() never fails, so no need to return value.

Signed-off-by: Pingfan Liu 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Qian Cai 
Cc: Anshuman Khandual 
Cc: linux-kernel@vger.kernel.org
---
 include/linux/page-isolation.h | 2 +-
 mm/page_isolation.c| 3 +--
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 280ae96..1099c2f 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -50,7 +50,7 @@ start_isolate_page_range(unsigned long start_pfn, unsigned 
long end_pfn,
  * Changes MIGRATE_ISOLATE to MIGRATE_MOVABLE.
  * target range is [start_pfn, end_pfn)
  */
-int
+void
 undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
unsigned migratetype);
 
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index e3638a5..89c19c0 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -230,7 +230,7 @@ int start_isolate_page_range(unsigned long start_pfn, 
unsigned long end_pfn,
 /*
  * Make isolated pages available again.
  */
-int undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
+void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
unsigned migratetype)
 {
unsigned long pfn;
@@ -247,7 +247,6 @@ int undo_isolate_page_range(unsigned long start_pfn, 
unsigned long end_pfn,
continue;
unset_migratetype_isolate(page, migratetype);
}
-   return 0;
 }
 /*
  * Test all pages in the range is free(means isolated) or not.
-- 
2.7.5



Re: [PATCHv5] mm/gup: speed up check_and_migrate_cma_pages() on huge page

2019-06-27 Thread Pingfan Liu
On Fri, Jun 28, 2019 at 7:25 AM Andrew Morton  wrote:
>
> On Thu, 27 Jun 2019 13:15:45 +0800 Pingfan Liu  wrote:
>
> > Both hugetlb and thp locate on the same migration type of pageblock, since
> > they are allocated from a free_list[]. Based on this fact, it is enough to
> > check on a single subpage to decide the migration type of the whole huge
> > page. By this way, it saves (2M/4K - 1) times loop for pmd_huge on x86,
> > similar on other archs.
> >
> > Furthermore, when executing isolate_huge_page(), it avoid taking global
> > hugetlb_lock many times, and meanless remove/add to the local link list
> > cma_page_list.
> >
>
> Thanks, looks good to me.  Have any timing measurements been taken?
Not yet. It is a little hard to force huge page to be allocated CMA
area. Should I provide the measurements?
>
> > ...
> >
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1336,25 +1336,30 @@ static long check_and_migrate_cma_pages(struct 
> > task_struct *tsk,
> >   struct vm_area_struct **vmas,
> >   unsigned int gup_flags)
> >  {
> > - long i;
> > + long i, step;
>
> I'll make these variables unsigned long - to match nr_pages and because
> we have no need for them to be negative.
OK, will fix it.

Thanks,
  Pingfan
>
> > ...


[PATCHv5] mm/gup: speed up check_and_migrate_cma_pages() on huge page

2019-06-26 Thread Pingfan Liu
Both hugetlb and thp locate on the same migration type of pageblock, since
they are allocated from a free_list[]. Based on this fact, it is enough to
check on a single subpage to decide the migration type of the whole huge
page. By this way, it saves (2M/4K - 1) times loop for pmd_huge on x86,
similar on other archs.

Furthermore, when executing isolate_huge_page(), it avoid taking global
hugetlb_lock many times, and meanless remove/add to the local link list
cma_page_list.

Signed-off-by: Pingfan Liu 
Cc: Andrew Morton 
Cc: Ira Weiny 
Cc: Mike Rapoport 
Cc: "Kirill A. Shutemov" 
Cc: Thomas Gleixner 
Cc: John Hubbard 
Cc: "Aneesh Kumar K.V" 
Cc: Christoph Hellwig 
Cc: Keith Busch 
Cc: Mike Kravetz 
Cc: Linux-kernel@vger.kernel.org
---
v3 -> v4: fix C language precedence issue
v4 -> v5: drop the check PageCompound() and improve notes
 mm/gup.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index ddde097..1deaad2 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1336,25 +1336,30 @@ static long check_and_migrate_cma_pages(struct 
task_struct *tsk,
struct vm_area_struct **vmas,
unsigned int gup_flags)
 {
-   long i;
+   long i, step;
bool drain_allow = true;
bool migrate_allow = true;
LIST_HEAD(cma_page_list);
 
 check_again:
-   for (i = 0; i < nr_pages; i++) {
+   for (i = 0; i < nr_pages;) {
+
+   struct page *head = compound_head(pages[i]);
+
+   /*
+* gup may start from a tail page. Advance step by the left
+* part.
+*/
+   step = (1 << compound_order(head)) - (pages[i] - head);
/*
 * If we get a page from the CMA zone, since we are going to
 * be pinning these entries, we might as well move them out
 * of the CMA zone if possible.
 */
-   if (is_migrate_cma_page(pages[i])) {
-
-   struct page *head = compound_head(pages[i]);
-
-   if (PageHuge(head)) {
+   if (is_migrate_cma_page(head)) {
+   if (PageHuge(head))
isolate_huge_page(head, _page_list);
-   } else {
+   else {
if (!PageLRU(head) && drain_allow) {
lru_add_drain_all();
drain_allow = false;
@@ -1369,6 +1374,8 @@ static long check_and_migrate_cma_pages(struct 
task_struct *tsk,
}
}
}
+
+   i += step;
}
 
if (!list_empty(_page_list)) {
-- 
2.7.5



Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot

2019-06-26 Thread Pingfan Liu
On Wed, Jun 26, 2019 at 9:57 PM Michal Hocko  wrote:
>
> On Mon 24-06-19 16:42:20, Pingfan Liu wrote:
> > Hi Michal,
> >
> > What about dropping the change of the online definition of your patch,
> > just do the following?
>
> I am sorry but I am unlikely to find some more time to look into this. I
> am willing to help reviewing but I will not find enough time to focus on
> this to fix up the patch. Are you willing to work on this and finish the
> patch? It is a very tricky area with side effects really hard to see in
> advance but going with a robust fix is definitely worth the effort.
Yeah, the bug is a little trivial but hard to fix bug, and take a lot
of time. It is hard to meet your original design target, based on
current situation. I will have a try limited to this bug.

Thanks,
  Pingfan
>
> Thanks!
> --
> Michal Hocko
> SUSE Labs


Re: [PATCHv4] mm/gup: speed up check_and_migrate_cma_pages() on huge page

2019-06-26 Thread Pingfan Liu
On Thu, Jun 27, 2019 at 7:15 AM Andrew Morton  wrote:
>
> On Wed, 26 Jun 2019 21:10:00 +0800 Pingfan Liu  wrote:
>
> > Both hugetlb and thp locate on the same migration type of pageblock, since
> > they are allocated from a free_list[]. Based on this fact, it is enough to
> > check on a single subpage to decide the migration type of the whole huge
> > page. By this way, it saves (2M/4K - 1) times loop for pmd_huge on x86,
> > similar on other archs.
> >
> > Furthermore, when executing isolate_huge_page(), it avoid taking global
> > hugetlb_lock many times, and meanless remove/add to the local link list
> > cma_page_list.
> >
> > ...
> >
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1342,19 +1342,22 @@ static long check_and_migrate_cma_pages(struct 
> > task_struct *tsk,
> >   LIST_HEAD(cma_page_list);
> >
> >  check_again:
> > - for (i = 0; i < nr_pages; i++) {
> > + for (i = 0; i < nr_pages;) {
> > +
> > + struct page *head = compound_head(pages[i]);
> > + long step = 1;
> > +
> > + if (PageCompound(head))
>
> I suspect this would work correctly if the PageCompound test was simply
> removed.  Not that I'm really suggesting that it be removed - dunno.
Yes, you are right. compound_order() can safely run on normal page,
which means we can drop the check PageCompound().

>
> > + step = (1 << compound_order(head)) - (pages[i] - 
> > head);
>
> I don't understand this statement.  Why does the position of this page
> in the pages[] array affect anything?  There's an assumption about the
> contents of the skipped pages, I assume.
Because gup may start from a tail page.
>
> Could we please get a comment in here whcih fully explains the logic
> and any assumptions?
Sure, I will.

Thanks,
  Pingfan
>


[PATCHv4] mm/gup: speed up check_and_migrate_cma_pages() on huge page

2019-06-26 Thread Pingfan Liu
Both hugetlb and thp locate on the same migration type of pageblock, since
they are allocated from a free_list[]. Based on this fact, it is enough to
check on a single subpage to decide the migration type of the whole huge
page. By this way, it saves (2M/4K - 1) times loop for pmd_huge on x86,
similar on other archs.

Furthermore, when executing isolate_huge_page(), it avoid taking global
hugetlb_lock many times, and meanless remove/add to the local link list
cma_page_list.

Signed-off-by: Pingfan Liu 
Cc: Andrew Morton 
Cc: Ira Weiny 
Cc: Mike Rapoport 
Cc: "Kirill A. Shutemov" 
Cc: Thomas Gleixner 
Cc: John Hubbard 
Cc: "Aneesh Kumar K.V" 
Cc: Christoph Hellwig 
Cc: Keith Busch 
Cc: Mike Kravetz 
Cc: Linux-kernel@vger.kernel.org
---
v3 -> v4: fix C language precedence issue

 mm/gup.c | 19 ---
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index ddde097..ffca55b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1342,19 +1342,22 @@ static long check_and_migrate_cma_pages(struct 
task_struct *tsk,
LIST_HEAD(cma_page_list);
 
 check_again:
-   for (i = 0; i < nr_pages; i++) {
+   for (i = 0; i < nr_pages;) {
+
+   struct page *head = compound_head(pages[i]);
+   long step = 1;
+
+   if (PageCompound(head))
+   step = (1 << compound_order(head)) - (pages[i] - head);
/*
 * If we get a page from the CMA zone, since we are going to
 * be pinning these entries, we might as well move them out
 * of the CMA zone if possible.
 */
-   if (is_migrate_cma_page(pages[i])) {
-
-   struct page *head = compound_head(pages[i]);
-
-   if (PageHuge(head)) {
+   if (is_migrate_cma_page(head)) {
+   if (PageHuge(head))
isolate_huge_page(head, _page_list);
-   } else {
+   else {
if (!PageLRU(head) && drain_allow) {
lru_add_drain_all();
drain_allow = false;
@@ -1369,6 +1372,8 @@ static long check_and_migrate_cma_pages(struct 
task_struct *tsk,
}
}
}
+
+   i += step;
}
 
if (!list_empty(_page_list)) {
-- 
2.7.5



Re: [PATCHv3] mm/gup: speed up check_and_migrate_cma_pages() on huge page

2019-06-26 Thread Pingfan Liu
On Wed, Jun 26, 2019 at 1:51 AM Ira Weiny  wrote:
>
> On Tue, Jun 25, 2019 at 10:13:19PM +0800, Pingfan Liu wrote:
> > Both hugetlb and thp locate on the same migration type of pageblock, since
> > they are allocated from a free_list[]. Based on this fact, it is enough to
> > check on a single subpage to decide the migration type of the whole huge
> > page. By this way, it saves (2M/4K - 1) times loop for pmd_huge on x86,
> > similar on other archs.
> >
> > Furthermore, when executing isolate_huge_page(), it avoid taking global
> > hugetlb_lock many times, and meanless remove/add to the local link list
> > cma_page_list.
> >
> > Signed-off-by: Pingfan Liu 
> > Cc: Andrew Morton 
> > Cc: Ira Weiny 
> > Cc: Mike Rapoport 
> > Cc: "Kirill A. Shutemov" 
> > Cc: Thomas Gleixner 
> > Cc: John Hubbard 
> > Cc: "Aneesh Kumar K.V" 
> > Cc: Christoph Hellwig 
> > Cc: Keith Busch 
> > Cc: Mike Kravetz 
> > Cc: Linux-kernel@vger.kernel.org
> > ---
> > v2 -> v3: fix page order to size convertion
> >
> >  mm/gup.c | 19 ---
> >  1 file changed, 12 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index ddde097..03cc1f4 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1342,19 +1342,22 @@ static long check_and_migrate_cma_pages(struct 
> > task_struct *tsk,
> >   LIST_HEAD(cma_page_list);
> >
> >  check_again:
> > - for (i = 0; i < nr_pages; i++) {
> > + for (i = 0; i < nr_pages;) {
> > +
> > + struct page *head = compound_head(pages[i]);
> > + long step = 1;
> > +
> > + if (PageCompound(head))
> > + step = 1 << compound_order(head) - (pages[i] - head);
>
> Check your precedence here.
>
> step = (1 << compound_order(head)) - (pages[i] - head);
OK.

Thanks,
Pingfan


[PATCHv3] mm/gup: speed up check_and_migrate_cma_pages() on huge page

2019-06-25 Thread Pingfan Liu
Both hugetlb and thp locate on the same migration type of pageblock, since
they are allocated from a free_list[]. Based on this fact, it is enough to
check on a single subpage to decide the migration type of the whole huge
page. By this way, it saves (2M/4K - 1) times loop for pmd_huge on x86,
similar on other archs.

Furthermore, when executing isolate_huge_page(), it avoid taking global
hugetlb_lock many times, and meanless remove/add to the local link list
cma_page_list.

Signed-off-by: Pingfan Liu 
Cc: Andrew Morton 
Cc: Ira Weiny 
Cc: Mike Rapoport 
Cc: "Kirill A. Shutemov" 
Cc: Thomas Gleixner 
Cc: John Hubbard 
Cc: "Aneesh Kumar K.V" 
Cc: Christoph Hellwig 
Cc: Keith Busch 
Cc: Mike Kravetz 
Cc: Linux-kernel@vger.kernel.org
---
v2 -> v3: fix page order to size convertion

 mm/gup.c | 19 ---
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index ddde097..03cc1f4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1342,19 +1342,22 @@ static long check_and_migrate_cma_pages(struct 
task_struct *tsk,
LIST_HEAD(cma_page_list);
 
 check_again:
-   for (i = 0; i < nr_pages; i++) {
+   for (i = 0; i < nr_pages;) {
+
+   struct page *head = compound_head(pages[i]);
+   long step = 1;
+
+   if (PageCompound(head))
+   step = 1 << compound_order(head) - (pages[i] - head);
/*
 * If we get a page from the CMA zone, since we are going to
 * be pinning these entries, we might as well move them out
 * of the CMA zone if possible.
 */
-   if (is_migrate_cma_page(pages[i])) {
-
-   struct page *head = compound_head(pages[i]);
-
-   if (PageHuge(head)) {
+   if (is_migrate_cma_page(head)) {
+   if (PageHuge(head))
isolate_huge_page(head, _page_list);
-   } else {
+   else {
if (!PageLRU(head) && drain_allow) {
lru_add_drain_all();
drain_allow = false;
@@ -1369,6 +1372,8 @@ static long check_and_migrate_cma_pages(struct 
task_struct *tsk,
}
}
}
+
+   i += step;
}
 
if (!list_empty(_page_list)) {
-- 
2.7.5



Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot

2019-06-24 Thread Pingfan Liu
Hi Michal,

What about dropping the change of the online definition of your patch,
just do the following?
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index e6dad60..9c087c3 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -749,13 +749,12 @@ static void __init init_memory_less_node(int nid)
  */
 void __init init_cpu_to_node(void)
 {
-   int cpu;
+   int cpu, node;
u16 *cpu_to_apicid = early_per_cpu_ptr(x86_cpu_to_apicid);

BUG_ON(cpu_to_apicid == NULL);

-   for_each_possible_cpu(cpu) {
-   int node = numa_cpu_node(cpu);
+   for_each_node_mask(node, numa_nodes_parsed) {

if (node == NUMA_NO_NODE)
continue;
@@ -765,6 +764,10 @@ void __init init_cpu_to_node(void)

numa_set_node(cpu, node);
}
+   for_each_possible_cpu(cpu) {
+   int node = numa_cpu_node(cpu);
+   numa_set_node(cpu, node);
+   }
 }

Thanks,
  Pingfan

On Fri, Jun 21, 2019 at 9:55 PM Michal Hocko  wrote:
>
> On Fri 21-06-19 09:17:58, Qian Cai wrote:
> > Sigh...
> >
> > I don't see any benefit to keep the broken commit,
> >
> > "x86, numa: always initialize all possible nodes"
> >
> > for so long in linux-next that just prevent x86 NUMA machines with any 
> > memory-
> > less node from booting.
> >
> > Andrew, maybe it is time to drop this patch until Michal found some time to 
> > fix
> > it properly.
>
> Yes, please drop the patch for now, Andrew. I thought I could get to
> this but time is just scarce.
> --
> Michal Hocko
> SUSE Labs


Re: [PATCH] mm/hugetlb: allow gigantic page allocation to migrate away smaller huge page

2019-06-24 Thread Pingfan Liu
On Mon, Jun 24, 2019 at 1:16 PM Anshuman Khandual
 wrote:
>
>
>
> On 06/24/2019 09:51 AM, Pingfan Liu wrote:
> > The current pfn_range_valid_gigantic() rejects the pud huge page allocation
> > if there is a pmd huge page inside the candidate range.
> >
> > But pud huge resource is more rare, which should align on 1GB on x86. It is
> > worth to allow migrating away pmd huge page to make room for a pud huge
> > page.
> >
> > The same logic is applied to pgd and pud huge pages.
>
> The huge page in the range can either be a THP or HugeTLB and migrating them 
> has
> different costs and chances of success. THP migration will involve splitting 
> if
> THP migration is not enabled and all related TLB related costs. Are you sure
> that a PUD HugeTLB allocation really should go through these ? Is there any
PUD hugetlb has already driven out PMD thp in current. This patch just
want to make PUD hugetlb survives PMD hugetlb.

> guarantee that after migration of multiple PMD sized THP/HugeTLB pages on the
> given range, the allocation request for PUD will succeed ?
The migration is complicated, but as my understanding, if there is no
gup pin in the range and there is enough memory including swap, then
it will success.
>
> >
> > Signed-off-by: Pingfan Liu 
> > Cc: Mike Kravetz 
> > Cc: Oscar Salvador 
> > Cc: David Hildenbrand 
> > Cc: Andrew Morton 
> > Cc: linux-kernel@vger.kernel.org
> > ---
> >  mm/hugetlb.c | 8 +---
> >  1 file changed, 5 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index ac843d3..02d1978 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1081,7 +1081,11 @@ static bool pfn_range_valid_gigantic(struct zone *z,
> >   unsigned long start_pfn, unsigned long nr_pages)
> >  {
> >   unsigned long i, end_pfn = start_pfn + nr_pages;
> > - struct page *page;
> > + struct page *page = pfn_to_page(start_pfn);
> > +
> > + if (PageHuge(page))
> > + if (compound_order(compound_head(page)) >= nr_pages)
> > + return false;
> >
> >   for (i = start_pfn; i < end_pfn; i++) {
> >   if (!pfn_valid(i))
> > @@ -1098,8 +1102,6 @@ static bool pfn_range_valid_gigantic(struct zone *z,
> >   if (page_count(page) > 0)
> >   return false;
> >
> > - if (PageHuge(page))
> > - return false;
> >   }
> >
> >   return true;
> >
>
> So except in the case where there is a bigger huge page in the range this will
> attempt migrating everything on the way. As mentioned before if it all this is
> a good idea, it needs to differentiate between HugeTLB and THP and also take
> into account costs of migrations and chance of subsequence allocation attempt
> into account.
Sorry, but I think this logic is only for hugetlb. The caller
alloc_gigantic_page() is only used inside mm/hugetlb.c, not by
huge_memory.c.

Thanks,
  Pingfan


Re: [PATCH] mm/hugetlb: allow gigantic page allocation to migrate away smaller huge page

2019-06-23 Thread Pingfan Liu
On Mon, Jun 24, 2019 at 1:03 PM Ira Weiny  wrote:
>
> On Mon, Jun 24, 2019 at 12:21:08PM +0800, Pingfan Liu wrote:
> > The current pfn_range_valid_gigantic() rejects the pud huge page allocation
> > if there is a pmd huge page inside the candidate range.
> >
> > But pud huge resource is more rare, which should align on 1GB on x86. It is
> > worth to allow migrating away pmd huge page to make room for a pud huge
> > page.
> >
> > The same logic is applied to pgd and pud huge pages.
>
> I'm sorry but I don't quite understand why we should do this.  Is this a bug 
> or
> an optimization?  It sounds like an optimization.
Yes, an optimization. It can help us to success to allocate a 1GB
hugetlb if there is some 2MB hugetlb sit in the candidate range.
Allocation 1GB hugetlb requires more tough condition, not only a
continuous 1GB range, but also aligned on GB. While allocating a 2MB
range is easier.
>
> >
> > Signed-off-by: Pingfan Liu 
> > Cc: Mike Kravetz 
> > Cc: Oscar Salvador 
> > Cc: David Hildenbrand 
> > Cc: Andrew Morton 
> > Cc: linux-kernel@vger.kernel.org
> > ---
> >  mm/hugetlb.c | 8 +---
> >  1 file changed, 5 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index ac843d3..02d1978 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1081,7 +1081,11 @@ static bool pfn_range_valid_gigantic(struct zone *z,
> >   unsigned long start_pfn, unsigned long nr_pages)
> >  {
> >   unsigned long i, end_pfn = start_pfn + nr_pages;
> > - struct page *page;
> > + struct page *page = pfn_to_page(start_pfn);
> > +
> > + if (PageHuge(page))
> > + if (compound_order(compound_head(page)) >= nr_pages)
>
> I don't think you want compound_order() here.
Yes, your are right.

Thanks,
  Pingfan
>
> Ira
>
> > + return false;
> >
> >   for (i = start_pfn; i < end_pfn; i++) {
> >   if (!pfn_valid(i))
> > @@ -1098,8 +1102,6 @@ static bool pfn_range_valid_gigantic(struct zone *z,
> >   if (page_count(page) > 0)
> >   return false;
> >
> > - if (PageHuge(page))
> > - return false;
> >   }
> >
> >   return true;
> > --
> > 2.7.5
> >


Re: [PATCHv2] mm/gup: speed up check_and_migrate_cma_pages() on huge page

2019-06-23 Thread Pingfan Liu
On Mon, Jun 24, 2019 at 1:32 PM Pingfan Liu  wrote:
>
> On Mon, Jun 24, 2019 at 12:43 PM Ira Weiny  wrote:
> >
> > On Mon, Jun 24, 2019 at 12:12:41PM +0800, Pingfan Liu wrote:
> > > Both hugetlb and thp locate on the same migration type of pageblock, since
> > > they are allocated from a free_list[]. Based on this fact, it is enough to
> > > check on a single subpage to decide the migration type of the whole huge
> > > page. By this way, it saves (2M/4K - 1) times loop for pmd_huge on x86,
> > > similar on other archs.
> > >
> > > Furthermore, when executing isolate_huge_page(), it avoid taking global
> > > hugetlb_lock many times, and meanless remove/add to the local link list
> > > cma_page_list.
> > >
> > > Signed-off-by: Pingfan Liu 
> > > Cc: Andrew Morton 
> > > Cc: Ira Weiny 
> > > Cc: Mike Rapoport 
> > > Cc: "Kirill A. Shutemov" 
> > > Cc: Thomas Gleixner 
> > > Cc: John Hubbard 
> > > Cc: "Aneesh Kumar K.V" 
> > > Cc: Christoph Hellwig 
> > > Cc: Keith Busch 
> > > Cc: Mike Kravetz 
> > > Cc: Linux-kernel@vger.kernel.org
> > > ---
> > >  mm/gup.c | 19 ---
> > >  1 file changed, 12 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/mm/gup.c b/mm/gup.c
> > > index ddde097..544f5de 100644
> > > --- a/mm/gup.c
> > > +++ b/mm/gup.c
> > > @@ -1342,19 +1342,22 @@ static long check_and_migrate_cma_pages(struct 
> > > task_struct *tsk,
> > >   LIST_HEAD(cma_page_list);
> > >
> > >  check_again:
> > > - for (i = 0; i < nr_pages; i++) {
> > > + for (i = 0; i < nr_pages;) {
> > > +
> > > + struct page *head = compound_head(pages[i]);
> > > + long step = 1;
> > > +
> > > + if (PageCompound(head))
> > > + step = compound_order(head) - (pages[i] - head);
> >
> > Sorry if I missed this last time.  compound_order() is not correct here.
> For thp, prep_transhuge_page()->prep_compound_page()->set_compound_order().
> For smaller hugetlb,
> prep_new_huge_page()->prep_compound_page()->set_compound_order().
> For gigantic page, prep_compound_gigantic_page()->set_compound_order().
>
> Do I miss anything?
>
Oh, got it. It should be 1< Thanks,
>   Pingfan
> [...]


Re: [PATCHv2] mm/gup: speed up check_and_migrate_cma_pages() on huge page

2019-06-23 Thread Pingfan Liu
On Mon, Jun 24, 2019 at 12:43 PM Ira Weiny  wrote:
>
> On Mon, Jun 24, 2019 at 12:12:41PM +0800, Pingfan Liu wrote:
> > Both hugetlb and thp locate on the same migration type of pageblock, since
> > they are allocated from a free_list[]. Based on this fact, it is enough to
> > check on a single subpage to decide the migration type of the whole huge
> > page. By this way, it saves (2M/4K - 1) times loop for pmd_huge on x86,
> > similar on other archs.
> >
> > Furthermore, when executing isolate_huge_page(), it avoid taking global
> > hugetlb_lock many times, and meanless remove/add to the local link list
> > cma_page_list.
> >
> > Signed-off-by: Pingfan Liu 
> > Cc: Andrew Morton 
> > Cc: Ira Weiny 
> > Cc: Mike Rapoport 
> > Cc: "Kirill A. Shutemov" 
> > Cc: Thomas Gleixner 
> > Cc: John Hubbard 
> > Cc: "Aneesh Kumar K.V" 
> > Cc: Christoph Hellwig 
> > Cc: Keith Busch 
> > Cc: Mike Kravetz 
> > Cc: Linux-kernel@vger.kernel.org
> > ---
> >  mm/gup.c | 19 ---
> >  1 file changed, 12 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index ddde097..544f5de 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1342,19 +1342,22 @@ static long check_and_migrate_cma_pages(struct 
> > task_struct *tsk,
> >   LIST_HEAD(cma_page_list);
> >
> >  check_again:
> > - for (i = 0; i < nr_pages; i++) {
> > + for (i = 0; i < nr_pages;) {
> > +
> > + struct page *head = compound_head(pages[i]);
> > + long step = 1;
> > +
> > + if (PageCompound(head))
> > + step = compound_order(head) - (pages[i] - head);
>
> Sorry if I missed this last time.  compound_order() is not correct here.
For thp, prep_transhuge_page()->prep_compound_page()->set_compound_order().
For smaller hugetlb,
prep_new_huge_page()->prep_compound_page()->set_compound_order().
For gigantic page, prep_compound_gigantic_page()->set_compound_order().

Do I miss anything?

Thanks,
  Pingfan
[...]


[PATCH] mm/hugetlb: allow gigantic page allocation to migrate away smaller huge page

2019-06-23 Thread Pingfan Liu
The current pfn_range_valid_gigantic() rejects the pud huge page allocation
if there is a pmd huge page inside the candidate range.

But pud huge resource is more rare, which should align on 1GB on x86. It is
worth to allow migrating away pmd huge page to make room for a pud huge
page.

The same logic is applied to pgd and pud huge pages.

Signed-off-by: Pingfan Liu 
Cc: Mike Kravetz 
Cc: Oscar Salvador 
Cc: David Hildenbrand 
Cc: Andrew Morton 
Cc: linux-kernel@vger.kernel.org
---
 mm/hugetlb.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ac843d3..02d1978 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1081,7 +1081,11 @@ static bool pfn_range_valid_gigantic(struct zone *z,
unsigned long start_pfn, unsigned long nr_pages)
 {
unsigned long i, end_pfn = start_pfn + nr_pages;
-   struct page *page;
+   struct page *page = pfn_to_page(start_pfn);
+
+   if (PageHuge(page))
+   if (compound_order(compound_head(page)) >= nr_pages)
+   return false;
 
for (i = start_pfn; i < end_pfn; i++) {
if (!pfn_valid(i))
@@ -1098,8 +1102,6 @@ static bool pfn_range_valid_gigantic(struct zone *z,
if (page_count(page) > 0)
return false;
 
-   if (PageHuge(page))
-   return false;
}
 
return true;
-- 
2.7.5



[PATCHv2] mm/gup: speed up check_and_migrate_cma_pages() on huge page

2019-06-23 Thread Pingfan Liu
Both hugetlb and thp locate on the same migration type of pageblock, since
they are allocated from a free_list[]. Based on this fact, it is enough to
check on a single subpage to decide the migration type of the whole huge
page. By this way, it saves (2M/4K - 1) times loop for pmd_huge on x86,
similar on other archs.

Furthermore, when executing isolate_huge_page(), it avoid taking global
hugetlb_lock many times, and meanless remove/add to the local link list
cma_page_list.

Signed-off-by: Pingfan Liu 
Cc: Andrew Morton 
Cc: Ira Weiny 
Cc: Mike Rapoport 
Cc: "Kirill A. Shutemov" 
Cc: Thomas Gleixner 
Cc: John Hubbard 
Cc: "Aneesh Kumar K.V" 
Cc: Christoph Hellwig 
Cc: Keith Busch 
Cc: Mike Kravetz 
Cc: Linux-kernel@vger.kernel.org
---
 mm/gup.c | 19 ---
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index ddde097..544f5de 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1342,19 +1342,22 @@ static long check_and_migrate_cma_pages(struct 
task_struct *tsk,
LIST_HEAD(cma_page_list);
 
 check_again:
-   for (i = 0; i < nr_pages; i++) {
+   for (i = 0; i < nr_pages;) {
+
+   struct page *head = compound_head(pages[i]);
+   long step = 1;
+
+   if (PageCompound(head))
+   step = compound_order(head) - (pages[i] - head);
/*
 * If we get a page from the CMA zone, since we are going to
 * be pinning these entries, we might as well move them out
 * of the CMA zone if possible.
 */
-   if (is_migrate_cma_page(pages[i])) {
-
-   struct page *head = compound_head(pages[i]);
-
-   if (PageHuge(head)) {
+   if (is_migrate_cma_page(head)) {
+   if (PageHuge(head))
isolate_huge_page(head, _page_list);
-   } else {
+   else {
if (!PageLRU(head) && drain_allow) {
lru_add_drain_all();
drain_allow = false;
@@ -1369,6 +1372,8 @@ static long check_and_migrate_cma_pages(struct 
task_struct *tsk,
}
}
}
+
+   i += step;
}
 
if (!list_empty(_page_list)) {
-- 
2.7.5



Re: [PATCH] mm/gup: speed up check_and_migrate_cma_pages() on huge page

2019-06-23 Thread Pingfan Liu
On Sat, Jun 22, 2019 at 2:13 AM Ira Weiny  wrote:
>
> On Fri, Jun 21, 2019 at 06:15:16PM +0800, Pingfan Liu wrote:
> > Both hugetlb and thp locate on the same migration type of pageblock, since
> > they are allocated from a free_list[]. Based on this fact, it is enough to
> > check on a single subpage to decide the migration type of the whole huge
> > page. By this way, it saves (2M/4K - 1) times loop for pmd_huge on x86,
> > similar on other archs.
> >
> > Furthermore, when executing isolate_huge_page(), it avoid taking global
> > hugetlb_lock many times, and meanless remove/add to the local link list
> > cma_page_list.
> >
> > Signed-off-by: Pingfan Liu 
> > Cc: Andrew Morton 
> > Cc: Ira Weiny 
> > Cc: Mike Rapoport 
> > Cc: "Kirill A. Shutemov" 
> > Cc: Thomas Gleixner 
> > Cc: John Hubbard 
> > Cc: "Aneesh Kumar K.V" 
> > Cc: Christoph Hellwig 
> > Cc: Keith Busch 
> > Cc: Mike Kravetz 
> > Cc: Linux-kernel@vger.kernel.org
> > ---
> >  mm/gup.c | 13 +
> >  1 file changed, 9 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index ddde097..2eecb16 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1342,16 +1342,19 @@ static long check_and_migrate_cma_pages(struct 
> > task_struct *tsk,
> >   LIST_HEAD(cma_page_list);
> >
> >  check_again:
> > - for (i = 0; i < nr_pages; i++) {
> > + for (i = 0; i < nr_pages;) {
> > +
> > + struct page *head = compound_head(pages[i]);
> > + long step = 1;
> > +
> > + if (PageCompound(head))
> > + step = compound_order(head) - (pages[i] - head);
> >   /*
> >* If we get a page from the CMA zone, since we are going to
> >* be pinning these entries, we might as well move them out
> >* of the CMA zone if possible.
> >*/
> >   if (is_migrate_cma_page(pages[i])) {
>
> I like this but I think for consistency I would change this pages[i] to be
> head.  Even though it is not required.
Yes, agree. Thank you for your good suggestion.

Regards,
  Pingfan
>
> Ira
>
> > -
> > - struct page *head = compound_head(pages[i]);
> > -
> >   if (PageHuge(head)) {
> >   isolate_huge_page(head, _page_list);
> >   } else {
> > @@ -1369,6 +1372,8 @@ static long check_and_migrate_cma_pages(struct 
> > task_struct *tsk,
> >   }
> >   }
> >   }
> > +
> > + i += step;
> >   }
> >
> >   if (!list_empty(_page_list)) {
> > --
> > 2.7.5
> >


[PATCH] mm/gup: speed up check_and_migrate_cma_pages() on huge page

2019-06-21 Thread Pingfan Liu
Both hugetlb and thp locate on the same migration type of pageblock, since
they are allocated from a free_list[]. Based on this fact, it is enough to
check on a single subpage to decide the migration type of the whole huge
page. By this way, it saves (2M/4K - 1) times loop for pmd_huge on x86,
similar on other archs.

Furthermore, when executing isolate_huge_page(), it avoid taking global
hugetlb_lock many times, and meanless remove/add to the local link list
cma_page_list.

Signed-off-by: Pingfan Liu 
Cc: Andrew Morton 
Cc: Ira Weiny 
Cc: Mike Rapoport 
Cc: "Kirill A. Shutemov" 
Cc: Thomas Gleixner 
Cc: John Hubbard 
Cc: "Aneesh Kumar K.V" 
Cc: Christoph Hellwig 
Cc: Keith Busch 
Cc: Mike Kravetz 
Cc: Linux-kernel@vger.kernel.org
---
 mm/gup.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index ddde097..2eecb16 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1342,16 +1342,19 @@ static long check_and_migrate_cma_pages(struct 
task_struct *tsk,
LIST_HEAD(cma_page_list);
 
 check_again:
-   for (i = 0; i < nr_pages; i++) {
+   for (i = 0; i < nr_pages;) {
+
+   struct page *head = compound_head(pages[i]);
+   long step = 1;
+
+   if (PageCompound(head))
+   step = compound_order(head) - (pages[i] - head);
/*
 * If we get a page from the CMA zone, since we are going to
 * be pinning these entries, we might as well move them out
 * of the CMA zone if possible.
 */
if (is_migrate_cma_page(pages[i])) {
-
-   struct page *head = compound_head(pages[i]);
-
if (PageHuge(head)) {
isolate_huge_page(head, _page_list);
} else {
@@ -1369,6 +1372,8 @@ static long check_and_migrate_cma_pages(struct 
task_struct *tsk,
}
}
}
+
+   i += step;
}
 
if (!list_empty(_page_list)) {
-- 
2.7.5



Re: [PATCHv4 2/3] mm/gup: fix omission of check on FOLL_LONGTERM in gup fast path

2019-06-14 Thread Pingfan Liu
Cc Mike, David, who is an expert of hugetlb and thp

On Fri, Jun 14, 2019 at 5:37 AM Ira Weiny  wrote:
>
> On Thu, Jun 13, 2019 at 06:45:01PM +0800, Pingfan Liu wrote:
> > FOLL_LONGTERM suggests a pin which is going to be given to hardware and
> > can't move. It would truncate CMA permanently and should be excluded.
> >
> > FOLL_LONGTERM has already been checked in the slow path, but not checked in
> > the fast path, which means a possible leak of CMA page to longterm pinned
> > requirement through this crack.
> >
> > Place a check in gup_pte_range() in the fast path.
> >
> > Signed-off-by: Pingfan Liu 
> > Cc: Ira Weiny 
> > Cc: Andrew Morton 
> > Cc: Mike Rapoport 
> > Cc: Dan Williams 
> > Cc: Matthew Wilcox 
> > Cc: John Hubbard 
> > Cc: "Aneesh Kumar K.V" 
> > Cc: Keith Busch 
> > Cc: Christoph Hellwig 
> > Cc: Shuah Khan 
> > Cc: linux-kernel@vger.kernel.org
> > ---
> >  mm/gup.c | 26 ++
> >  1 file changed, 26 insertions(+)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 766ae54..de1b03f 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1757,6 +1757,14 @@ static int gup_pte_range(pmd_t pmd, unsigned long 
> > addr, unsigned long end,
> >   VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
> >   page = pte_page(pte);
> >
> > + /*
> > +  * FOLL_LONGTERM suggests a pin given to hardware. Prevent it
> > +  * from truncating CMA area
> > +  */
> > + if (unlikely(flags & FOLL_LONGTERM) &&
> > + is_migrate_cma_page(page))
> > + goto pte_unmap;
> > +
> >   head = try_get_compound_head(page, 1);
> >   if (!head)
> >   goto pte_unmap;
> > @@ -1900,6 +1908,12 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, 
> > unsigned long addr,
> >   refs++;
> >   } while (addr += PAGE_SIZE, addr != end);
> >
> > + if (unlikely(flags & FOLL_LONGTERM) &&
> > + is_migrate_cma_page(page)) {
> > + *nr -= refs;
> > + return 0;
> > + }
> > +
>
> Why can't we place this check before the while loop and skip subtracting the
> page count?
Yes, that will be better.

>
> Can is_migrate_cma_page() operate on any "subpage" of a compound page?
For gigantic page, __alloc_gigantic_page() allocate from
MIGRATE_MOVABLE pageblock. For page order < MAX_ORDER, pages are
allocated from either free_list[MIGRATE_MOVABLE] or
free_list[MIGRATE_CMA]. So all subpage have the same migrate type.

Thanks,
  Pingfan
>
> Here this calls is_magrate_cma_page() on the tail page of the compound page.
>
> I'm not an expert on compound pages nor cma handling so is this ok?
>
> It seems like you need to call is_migrate_cma_page() on each page within the
> while loop?
>
> >   head = try_get_compound_head(pmd_page(orig), refs);
> >   if (!head) {
> >   *nr -= refs;
> > @@ -1941,6 +1955,12 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, 
> > unsigned long addr,
> >   refs++;
> >   } while (addr += PAGE_SIZE, addr != end);
> >
> > + if (unlikely(flags & FOLL_LONGTERM) &&
> > + is_migrate_cma_page(page)) {
> > + *nr -= refs;
> > + return 0;
> > + }
> > +
>
> Same comment here.
>
> >   head = try_get_compound_head(pud_page(orig), refs);
> >   if (!head) {
> >   *nr -= refs;
> > @@ -1978,6 +1998,12 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, 
> > unsigned long addr,
> >   refs++;
> >   } while (addr += PAGE_SIZE, addr != end);
> >
> > + if (unlikely(flags & FOLL_LONGTERM) &&
> > + is_migrate_cma_page(page)) {
> > + *nr -= refs;
> > + return 0;
> > + }
> > +
>
> And here.
>
> Ira
>
> >   head = try_get_compound_head(pgd_page(orig), refs);
> >   if (!head) {
> >   *nr -= refs;
> > --
> > 2.7.5
> >


[PATCHv4 3/3] mm/gup_benchemark: add LONGTERM_BENCHMARK test in gup fast path

2019-06-13 Thread Pingfan Liu
Introduce a GUP_LONGTERM_BENCHMARK ioctl to test longterm pin in gup fast
path.

Signed-off-by: Pingfan Liu 
Cc: Ira Weiny 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: Dan Williams 
Cc: Matthew Wilcox 
Cc: John Hubbard 
Cc: "Aneesh Kumar K.V" 
Cc: Keith Busch 
Cc: Christoph Hellwig 
Cc: Shuah Khan 
Cc: linux-kernel@vger.kernel.org
---
 mm/gup_benchmark.c | 11 +--
 tools/testing/selftests/vm/gup_benchmark.c | 10 +++---
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
index 7dd602d..83f3378 100644
--- a/mm/gup_benchmark.c
+++ b/mm/gup_benchmark.c
@@ -6,8 +6,9 @@
 #include 
 
 #define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark)
-#define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark)
-#define GUP_BENCHMARK  _IOWR('g', 3, struct gup_benchmark)
+#define GUP_FAST_LONGTERM_BENCHMARK_IOWR('g', 2, struct gup_benchmark)
+#define GUP_LONGTERM_BENCHMARK _IOWR('g', 3, struct gup_benchmark)
+#define GUP_BENCHMARK  _IOWR('g', 4, struct gup_benchmark)
 
 struct gup_benchmark {
__u64 get_delta_usec;
@@ -53,6 +54,11 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
nr = get_user_pages_fast(addr, nr, gup->flags & 1,
 pages + i);
break;
+   case GUP_FAST_LONGTERM_BENCHMARK:
+   nr = get_user_pages_fast(addr, nr,
+   (gup->flags & 1) | FOLL_LONGTERM,
+pages + i);
+   break;
case GUP_LONGTERM_BENCHMARK:
nr = get_user_pages(addr, nr,
(gup->flags & 1) | FOLL_LONGTERM,
@@ -96,6 +102,7 @@ static long gup_benchmark_ioctl(struct file *filep, unsigned 
int cmd,
 
switch (cmd) {
case GUP_FAST_BENCHMARK:
+   case GUP_FAST_LONGTERM_BENCHMARK:
case GUP_LONGTERM_BENCHMARK:
case GUP_BENCHMARK:
break;
diff --git a/tools/testing/selftests/vm/gup_benchmark.c 
b/tools/testing/selftests/vm/gup_benchmark.c
index c0534e2..ade8acb 100644
--- a/tools/testing/selftests/vm/gup_benchmark.c
+++ b/tools/testing/selftests/vm/gup_benchmark.c
@@ -15,8 +15,9 @@
 #define PAGE_SIZE sysconf(_SC_PAGESIZE)
 
 #define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark)
-#define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark)
-#define GUP_BENCHMARK  _IOWR('g', 3, struct gup_benchmark)
+#define GUP_FAST_LONGTERM_BENCHMARK_IOWR('g', 2, struct gup_benchmark)
+#define GUP_LONGTERM_BENCHMARK _IOWR('g', 3, struct gup_benchmark)
+#define GUP_BENCHMARK  _IOWR('g', 4, struct gup_benchmark)
 
 struct gup_benchmark {
__u64 get_delta_usec;
@@ -37,7 +38,7 @@ int main(int argc, char **argv)
char *file = "/dev/zero";
char *p;
 
-   while ((opt = getopt(argc, argv, "m:r:n:f:tTLUSH")) != -1) {
+   while ((opt = getopt(argc, argv, "m:r:n:f:tTlLUSH")) != -1) {
switch (opt) {
case 'm':
size = atoi(optarg) * MB;
@@ -54,6 +55,9 @@ int main(int argc, char **argv)
case 'T':
thp = 0;
break;
+   case 'l':
+   cmd = GUP_FAST_LONGTERM_BENCHMARK;
+   break;
case 'L':
cmd = GUP_LONGTERM_BENCHMARK;
break;
-- 
2.7.5



[PATCHv4 1/3] mm/gup: rename nr as nr_pinned in get_user_pages_fast()

2019-06-13 Thread Pingfan Liu
To better reflect the held state of pages and make code self-explaining,
rename nr as nr_pinned.

Signed-off-by: Pingfan Liu 
Cc: Ira Weiny 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: Dan Williams 
Cc: Matthew Wilcox 
Cc: John Hubbard 
Cc: "Aneesh Kumar K.V" 
Cc: Keith Busch 
Cc: Christoph Hellwig 
Cc: Shuah Khan 
Cc: linux-kernel@vger.kernel.org
---
 mm/gup.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index f173fcb..766ae54 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2216,7 +2216,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
unsigned int gup_flags, struct page **pages)
 {
unsigned long addr, len, end;
-   int nr = 0, ret = 0;
+   int nr_pinned = 0, ret = 0;
 
start &= PAGE_MASK;
addr = start;
@@ -2231,25 +2231,25 @@ int get_user_pages_fast(unsigned long start, int 
nr_pages,
 
if (gup_fast_permitted(start, nr_pages)) {
local_irq_disable();
-   gup_pgd_range(addr, end, gup_flags, pages, );
+   gup_pgd_range(addr, end, gup_flags, pages, _pinned);
local_irq_enable();
-   ret = nr;
+   ret = nr_pinned;
}
 
-   if (nr < nr_pages) {
+   if (nr_pinned < nr_pages) {
/* Try to get the remaining pages with get_user_pages */
-   start += nr << PAGE_SHIFT;
-   pages += nr;
+   start += nr_pinned << PAGE_SHIFT;
+   pages += nr_pinned;
 
-   ret = __gup_longterm_unlocked(start, nr_pages - nr,
+   ret = __gup_longterm_unlocked(start, nr_pages - nr_pinned,
  gup_flags, pages);
 
/* Have to be a bit careful with return values */
-   if (nr > 0) {
+   if (nr_pinned > 0) {
if (ret < 0)
-   ret = nr;
+   ret = nr_pinned;
else
-   ret += nr;
+   ret += nr_pinned;
}
}
 
-- 
2.7.5



[PATCHv4 2/3] mm/gup: fix omission of check on FOLL_LONGTERM in gup fast path

2019-06-13 Thread Pingfan Liu
FOLL_LONGTERM suggests a pin which is going to be given to hardware and
can't move. It would truncate CMA permanently and should be excluded.

FOLL_LONGTERM has already been checked in the slow path, but not checked in
the fast path, which means a possible leak of CMA page to longterm pinned
requirement through this crack.

Place a check in gup_pte_range() in the fast path.

Signed-off-by: Pingfan Liu 
Cc: Ira Weiny 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: Dan Williams 
Cc: Matthew Wilcox 
Cc: John Hubbard 
Cc: "Aneesh Kumar K.V" 
Cc: Keith Busch 
Cc: Christoph Hellwig 
Cc: Shuah Khan 
Cc: linux-kernel@vger.kernel.org
---
 mm/gup.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index 766ae54..de1b03f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1757,6 +1757,14 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, 
unsigned long end,
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
page = pte_page(pte);
 
+   /*
+* FOLL_LONGTERM suggests a pin given to hardware. Prevent it
+* from truncating CMA area
+*/
+   if (unlikely(flags & FOLL_LONGTERM) &&
+   is_migrate_cma_page(page))
+   goto pte_unmap;
+
head = try_get_compound_head(page, 1);
if (!head)
goto pte_unmap;
@@ -1900,6 +1908,12 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, 
unsigned long addr,
refs++;
} while (addr += PAGE_SIZE, addr != end);
 
+   if (unlikely(flags & FOLL_LONGTERM) &&
+   is_migrate_cma_page(page)) {
+   *nr -= refs;
+   return 0;
+   }
+
head = try_get_compound_head(pmd_page(orig), refs);
if (!head) {
*nr -= refs;
@@ -1941,6 +1955,12 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, 
unsigned long addr,
refs++;
} while (addr += PAGE_SIZE, addr != end);
 
+   if (unlikely(flags & FOLL_LONGTERM) &&
+   is_migrate_cma_page(page)) {
+   *nr -= refs;
+   return 0;
+   }
+
head = try_get_compound_head(pud_page(orig), refs);
if (!head) {
*nr -= refs;
@@ -1978,6 +1998,12 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, 
unsigned long addr,
refs++;
} while (addr += PAGE_SIZE, addr != end);
 
+   if (unlikely(flags & FOLL_LONGTERM) &&
+   is_migrate_cma_page(page)) {
+   *nr -= refs;
+   return 0;
+   }
+
head = try_get_compound_head(pgd_page(orig), refs);
if (!head) {
*nr -= refs;
-- 
2.7.5



[PATCHv4 0/3] mm/gup: fix omission of check on FOLL_LONGTERM in gup fast path

2019-06-13 Thread Pingfan Liu
These three patches have no dependency of each other, but related with the same 
purpose
to improve get_user_page_fast(), patch [2/3]. Put them together.

v3->v4:
  Place the check on FOLL_LONGTERM in gup_pte_range() instead of 
get_user_page_fast()

Cc: Ira Weiny 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: Dan Williams 
Cc: Matthew Wilcox 
Cc: John Hubbard 
Cc: "Aneesh Kumar K.V" 
Cc: Keith Busch 
Cc: Christoph Hellwig 
Cc: Shuah Khan 
Cc: linux-kernel@vger.kernel.org

Pingfan Liu (3):
  mm/gup: rename nr as nr_pinned in get_user_pages_fast()
  mm/gup: fix omission of check on FOLL_LONGTERM in gup fast path
  mm/gup_benchemark: add LONGTERM_BENCHMARK test in gup fast path

 mm/gup.c   | 46 +++---
 mm/gup_benchmark.c | 11 +--
 tools/testing/selftests/vm/gup_benchmark.c | 10 +--
 3 files changed, 52 insertions(+), 15 deletions(-)

-- 
2.7.5



Re: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()

2019-06-13 Thread Pingfan Liu
On Thu, Jun 13, 2019 at 7:49 AM Ira Weiny  wrote:
>
> On Wed, Jun 12, 2019 at 09:54:58PM +0800, Pingfan Liu wrote:
> > On Tue, Jun 11, 2019 at 04:29:11PM +, Weiny, Ira wrote:
> > > > Pingfan Liu  writes:
> > > >
> > > > > As for FOLL_LONGTERM, it is checked in the slow path
> > > > > __gup_longterm_unlocked(). But it is not checked in the fast path,
> > > > > which means a possible leak of CMA page to longterm pinned requirement
> > > > > through this crack.
> > > >
> > > > Shouldn't we disallow FOLL_LONGTERM with get_user_pages fastpath? W.r.t
> > > > dax check we need vma to ensure whether a long term pin is allowed or 
> > > > not.
> > > > If FOLL_LONGTERM is specified we should fallback to slow path.
> > >
> > > Yes, the fastpath bails to the slowpath if FOLL_LONGTERM _and_ DAX.  But 
> > > it does this while walking the page tables.  I missed the CMA case and 
> > > Pingfan's patch fixes this.  We could check for CMA pages while walking 
> > > the page tables but most agreed that it was not worth it.  For DAX we 
> > > already had checks for *_devmap() so it was easier to put the 
> > > FOLL_LONGTERM checks there.
> > >
> > Then for CMA pages, are you suggesting something like:
>
> I'm not suggesting this.
OK, then I send out v4.
>
> Sorry I wrote this prior to seeing the numbers in your other email.  Given
> the numbers it looks like performing the check whilst walking the tables is
> worth the extra complexity.  I was just trying to summarize the thread.  I
> don't think we should disallow FOLL_LONGTERM because it only affects CMA and
> DAX.  Other pages will be fine with FOLL_LONGTERM.  Why penalize every call if
> we don't have to.  Also in the case of DAX the use of vma will be going
> away...[1]  Eventually...  ;-)
A good feature. Trying to catch up.

Thanks,
Pingfan


Re: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()

2019-06-12 Thread Pingfan Liu
On Wed, Jun 12, 2019 at 12:46 AM Ira Weiny  wrote:
>
> On Tue, Jun 11, 2019 at 08:29:35PM +0800, Pingfan Liu wrote:
> > On Fri, Jun 07, 2019 at 02:10:15PM +0800, Pingfan Liu wrote:
> > > On Fri, Jun 7, 2019 at 5:17 AM John Hubbard  wrote:
> > > >
> > > > On 6/5/19 7:19 PM, Pingfan Liu wrote:
> > > > > On Thu, Jun 6, 2019 at 5:49 AM Andrew Morton 
> > > > >  wrote:
> > > > ...
> > > > >>> --- a/mm/gup.c
> > > > >>> +++ b/mm/gup.c
> > > > >>> @@ -2196,6 +2196,26 @@ static int __gup_longterm_unlocked(unsigned 
> > > > >>> long start, int nr_pages,
> > > > >>>   return ret;
> > > > >>>  }
> > > > >>>
> > > > >>> +#ifdef CONFIG_CMA
> > > > >>> +static inline int reject_cma_pages(int nr_pinned, struct page 
> > > > >>> **pages)
> > > > >>> +{
> > > > >>> + int i;
> > > > >>> +
> > > > >>> + for (i = 0; i < nr_pinned; i++)
> > > > >>> + if (is_migrate_cma_page(pages[i])) {
> > > > >>> + put_user_pages(pages + i, nr_pinned - i);
> > > > >>> + return i;
> > > > >>> + }
> > > > >>> +
> > > > >>> + return nr_pinned;
> > > > >>> +}
> > > > >>
> > > > >> There's no point in inlining this.
> > > > > OK, will drop it in V4.
> > > > >
> > > > >>
> > > > >> The code seems inefficient.  If it encounters a single CMA page it 
> > > > >> can
> > > > >> end up discarding a possibly significant number of non-CMA pages.  I
> > > > > The trick is the page is not be discarded, in fact, they are still be
> > > > > referrenced by pte. We just leave the slow path to pick up the non-CMA
> > > > > pages again.
> > > > >
> > > > >> guess that doesn't matter much, as get_user_pages(FOLL_LONGTERM) is
> > > > >> rare.  But could we avoid this (and the second pass across pages[]) 
> > > > >> by
> > > > >> checking for a CMA page within gup_pte_range()?
> > > > > It will spread the same logic to hugetlb pte and normal pte. And no
> > > > > improvement in performance due to slow path. So I think maybe it is
> > > > > not worth.
> > > > >
> > > > >>
> > > >
> > > > I think the concern is: for the successful gup_fast case with no CMA
> > > > pages, this patch is adding another complete loop through all the
> > > > pages. In the fast case.
> > > >
> > > > If the check were instead done as part of the gup_pte_range(), then
> > > > it would be a little more efficient for that case.
> > > >
> > > > As for whether it's worth it, *probably* this is too small an effect to 
> > > > measure.
> > > > But in order to attempt a measurement: running fio 
> > > > (https://github.com/axboe/fio)
> > > > with O_DIRECT on an NVMe drive, might shed some light. Here's an 
> > > > fio.conf file
> > > > that Jan Kara and Tom Talpey helped me come up with, for related 
> > > > testing:
> > > >
> > > > [reader]
> > > > direct=1
> > > > ioengine=libaio
> > > > blocksize=4096
> > > > size=1g
> > > > numjobs=1
> > > > rw=read
> > > > iodepth=64
> > > >
> > Unable to get a NVME device to have a test. And when testing fio on the
> > tranditional disk, I got the error "fio: engine libaio not loadable
> > fio: failed to load engine
> > fio: file:ioengines.c:89, func=dlopen, error=libaio: cannot open shared 
> > object file: No such file or directory"
> >
> > But I found a test case which can be slightly adjusted to met the aim.
> > It is tools/testing/selftests/vm/gup_benchmark.c
> >
> > Test enviroment:
> >   MemTotal:   264079324 kB
> >   MemFree:262306788 kB
> >   CmaTotal:  0 kB
> >   CmaFree:   0 kB
> >   on AMD EPYC 7601
> >
> > Test command:
> >   gup_benchmark -r 100 -n 64
> >   gup_benchmark -r 100 -n 64 -l
> > where -r stands for repeat times, -n is nr_pa

Re: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()

2019-06-12 Thread Pingfan Liu
On Tue, Jun 11, 2019 at 04:29:11PM +, Weiny, Ira wrote:
> > Pingfan Liu  writes:
> > 
> > > As for FOLL_LONGTERM, it is checked in the slow path
> > > __gup_longterm_unlocked(). But it is not checked in the fast path,
> > > which means a possible leak of CMA page to longterm pinned requirement
> > > through this crack.
> > 
> > Shouldn't we disallow FOLL_LONGTERM with get_user_pages fastpath? W.r.t
> > dax check we need vma to ensure whether a long term pin is allowed or not.
> > If FOLL_LONGTERM is specified we should fallback to slow path.
> 
> Yes, the fastpath bails to the slowpath if FOLL_LONGTERM _and_ DAX.  But it 
> does this while walking the page tables.  I missed the CMA case and Pingfan's 
> patch fixes this.  We could check for CMA pages while walking the page tables 
> but most agreed that it was not worth it.  For DAX we already had checks for 
> *_devmap() so it was easier to put the FOLL_LONGTERM checks there.
> 
Then for CMA pages, are you suggesting something like:
diff --git a/mm/gup.c b/mm/gup.c
index 42a47c0..8bf3cc3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2251,6 +2251,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
if (unlikely(!access_ok((void __user *)start, len)))
return -EFAULT;

+   if (unlikely(gup_flags & FOLL_LONGTERM))
+   goto slow;
if (gup_fast_permitted(start, nr_pages)) {
local_irq_disable();
gup_pgd_range(addr, end, gup_flags, pages, );
@@ -2258,6 +2260,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
ret = nr;
}

+slow:
if (nr < nr_pages) {
/* Try to get the remaining pages with get_user_pages */
start += nr << PAGE_SHIFT;

Thanks,
  Pingfan


Re: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()

2019-06-11 Thread Pingfan Liu
On Fri, Jun 07, 2019 at 02:10:15PM +0800, Pingfan Liu wrote:
> On Fri, Jun 7, 2019 at 5:17 AM John Hubbard  wrote:
> >
> > On 6/5/19 7:19 PM, Pingfan Liu wrote:
> > > On Thu, Jun 6, 2019 at 5:49 AM Andrew Morton  
> > > wrote:
> > ...
> > >>> --- a/mm/gup.c
> > >>> +++ b/mm/gup.c
> > >>> @@ -2196,6 +2196,26 @@ static int __gup_longterm_unlocked(unsigned long 
> > >>> start, int nr_pages,
> > >>>   return ret;
> > >>>  }
> > >>>
> > >>> +#ifdef CONFIG_CMA
> > >>> +static inline int reject_cma_pages(int nr_pinned, struct page **pages)
> > >>> +{
> > >>> + int i;
> > >>> +
> > >>> + for (i = 0; i < nr_pinned; i++)
> > >>> + if (is_migrate_cma_page(pages[i])) {
> > >>> + put_user_pages(pages + i, nr_pinned - i);
> > >>> + return i;
> > >>> + }
> > >>> +
> > >>> + return nr_pinned;
> > >>> +}
> > >>
> > >> There's no point in inlining this.
> > > OK, will drop it in V4.
> > >
> > >>
> > >> The code seems inefficient.  If it encounters a single CMA page it can
> > >> end up discarding a possibly significant number of non-CMA pages.  I
> > > The trick is the page is not be discarded, in fact, they are still be
> > > referrenced by pte. We just leave the slow path to pick up the non-CMA
> > > pages again.
> > >
> > >> guess that doesn't matter much, as get_user_pages(FOLL_LONGTERM) is
> > >> rare.  But could we avoid this (and the second pass across pages[]) by
> > >> checking for a CMA page within gup_pte_range()?
> > > It will spread the same logic to hugetlb pte and normal pte. And no
> > > improvement in performance due to slow path. So I think maybe it is
> > > not worth.
> > >
> > >>
> >
> > I think the concern is: for the successful gup_fast case with no CMA
> > pages, this patch is adding another complete loop through all the
> > pages. In the fast case.
> >
> > If the check were instead done as part of the gup_pte_range(), then
> > it would be a little more efficient for that case.
> >
> > As for whether it's worth it, *probably* this is too small an effect to 
> > measure.
> > But in order to attempt a measurement: running fio 
> > (https://github.com/axboe/fio)
> > with O_DIRECT on an NVMe drive, might shed some light. Here's an fio.conf 
> > file
> > that Jan Kara and Tom Talpey helped me come up with, for related testing:
> >
> > [reader]
> > direct=1
> > ioengine=libaio
> > blocksize=4096
> > size=1g
> > numjobs=1
> > rw=read
> > iodepth=64
> >
Unable to get a NVME device to have a test. And when testing fio on the
tranditional disk, I got the error "fio: engine libaio not loadable
fio: failed to load engine
fio: file:ioengines.c:89, func=dlopen, error=libaio: cannot open shared object 
file: No such file or directory"

But I found a test case which can be slightly adjusted to met the aim.
It is tools/testing/selftests/vm/gup_benchmark.c

Test enviroment:
  MemTotal:   264079324 kB
  MemFree:262306788 kB
  CmaTotal:  0 kB
  CmaFree:   0 kB
  on AMD EPYC 7601

Test command:
  gup_benchmark -r 100 -n 64
  gup_benchmark -r 100 -n 64 -l
where -r stands for repeat times, -n is nr_pages param for
get_user_pages_fast(), -l is a new option to test FOLL_LONGTERM in fast
path, see a patch at the tail.

Test result:
w/o 477.80
w/o-l   481.07
a   481.80
a-l 640.41
b   466.24  (question a: b outperforms w/o ?)
b-l 529.74

Where w/o is baseline without any patch using v5.2-rc2, a is this series, b
does the check in gup_pte_range(). '-l' means FOLL_LONGTERM.

I am suprised that b-l has about 17% improvement than a. (640.41 -529.74)/640.41
As for "question a: b outperforms w/o ?", I can not figure out why, maybe it 
can be
considered as variance.

Based on the above result, I think it is better to do the check inside
gup_pte_range().

Any comment?

Thanks,


> Yeah, agreed. Data is more persuasive. Thanks for your suggestion. I
> will try to bring out the result.
> 
> Thanks,
>   Pingfan
> 

---
Patch to do check inside gup_pte_range()

diff --git a/mm/gup.c b/mm/gup.c
index 2ce3091..ba213a0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1757,6 +1757,10 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, 
unsigned long end,

Re: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()

2019-06-07 Thread Pingfan Liu
On Fri, Jun 7, 2019 at 5:17 AM John Hubbard  wrote:
>
> On 6/5/19 7:19 PM, Pingfan Liu wrote:
> > On Thu, Jun 6, 2019 at 5:49 AM Andrew Morton  
> > wrote:
> ...
> >>> --- a/mm/gup.c
> >>> +++ b/mm/gup.c
> >>> @@ -2196,6 +2196,26 @@ static int __gup_longterm_unlocked(unsigned long 
> >>> start, int nr_pages,
> >>>   return ret;
> >>>  }
> >>>
> >>> +#ifdef CONFIG_CMA
> >>> +static inline int reject_cma_pages(int nr_pinned, struct page **pages)
> >>> +{
> >>> + int i;
> >>> +
> >>> + for (i = 0; i < nr_pinned; i++)
> >>> + if (is_migrate_cma_page(pages[i])) {
> >>> + put_user_pages(pages + i, nr_pinned - i);
> >>> + return i;
> >>> + }
> >>> +
> >>> + return nr_pinned;
> >>> +}
> >>
> >> There's no point in inlining this.
> > OK, will drop it in V4.
> >
> >>
> >> The code seems inefficient.  If it encounters a single CMA page it can
> >> end up discarding a possibly significant number of non-CMA pages.  I
> > The trick is the page is not be discarded, in fact, they are still be
> > referrenced by pte. We just leave the slow path to pick up the non-CMA
> > pages again.
> >
> >> guess that doesn't matter much, as get_user_pages(FOLL_LONGTERM) is
> >> rare.  But could we avoid this (and the second pass across pages[]) by
> >> checking for a CMA page within gup_pte_range()?
> > It will spread the same logic to hugetlb pte and normal pte. And no
> > improvement in performance due to slow path. So I think maybe it is
> > not worth.
> >
> >>
>
> I think the concern is: for the successful gup_fast case with no CMA
> pages, this patch is adding another complete loop through all the
> pages. In the fast case.
>
> If the check were instead done as part of the gup_pte_range(), then
> it would be a little more efficient for that case.
>
> As for whether it's worth it, *probably* this is too small an effect to 
> measure.
> But in order to attempt a measurement: running fio 
> (https://github.com/axboe/fio)
> with O_DIRECT on an NVMe drive, might shed some light. Here's an fio.conf file
> that Jan Kara and Tom Talpey helped me come up with, for related testing:
>
> [reader]
> direct=1
> ioengine=libaio
> blocksize=4096
> size=1g
> numjobs=1
> rw=read
> iodepth=64
>
Yeah, agreed. Data is more persuasive. Thanks for your suggestion. I
will try to bring out the result.

Thanks,
  Pingfan


Re: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()

2019-06-05 Thread Pingfan Liu
On Thu, Jun 6, 2019 at 5:49 AM Andrew Morton  wrote:
>
> On Wed,  5 Jun 2019 17:10:19 +0800 Pingfan Liu  wrote:
>
> > As for FOLL_LONGTERM, it is checked in the slow path
> > __gup_longterm_unlocked(). But it is not checked in the fast path, which
> > means a possible leak of CMA page to longterm pinned requirement through
> > this crack.
> >
> > Place a check in the fast path.
>
> I'm not actually seeing a description (in either the existing code or
> this changelog or patch) an explanation of *why* we wish to exclude CMA
> pages from longterm pinning.
>
What about a short description like this:
FOLL_LONGTERM suggests a pin which is going to be given to hardware
and can't move. It would truncate CMA permanently and should be
excluded.

> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -2196,6 +2196,26 @@ static int __gup_longterm_unlocked(unsigned long 
> > start, int nr_pages,
> >   return ret;
> >  }
> >
> > +#ifdef CONFIG_CMA
> > +static inline int reject_cma_pages(int nr_pinned, struct page **pages)
> > +{
> > + int i;
> > +
> > + for (i = 0; i < nr_pinned; i++)
> > + if (is_migrate_cma_page(pages[i])) {
> > + put_user_pages(pages + i, nr_pinned - i);
> > + return i;
> > + }
> > +
> > + return nr_pinned;
> > +}
>
> There's no point in inlining this.
OK, will drop it in V4.

>
> The code seems inefficient.  If it encounters a single CMA page it can
> end up discarding a possibly significant number of non-CMA pages.  I
The trick is the page is not be discarded, in fact, they are still be
referrenced by pte. We just leave the slow path to pick up the non-CMA
pages again.

> guess that doesn't matter much, as get_user_pages(FOLL_LONGTERM) is
> rare.  But could we avoid this (and the second pass across pages[]) by
> checking for a CMA page within gup_pte_range()?
It will spread the same logic to hugetlb pte and normal pte. And no
improvement in performance due to slow path. So I think maybe it is
not worth.

>
> > +#else
> > +static inline int reject_cma_pages(int nr_pinned, struct page **pages)
> > +{
> > + return nr_pinned;
> > +}
> > +#endif
> > +
> >  /**
> >   * get_user_pages_fast() - pin user pages in memory
> >   * @start:   starting user address
> > @@ -2236,6 +2256,9 @@ int get_user_pages_fast(unsigned long start, int 
> > nr_pages,
> >   ret = nr;
> >   }
> >
> > + if (unlikely(gup_flags & FOLL_LONGTERM) && nr)
> > + nr = reject_cma_pages(nr, pages);
> > +
>
> This would be a suitable place to add a comment explaining why we're
> doing this...
Would add one comment "FOLL_LONGTERM suggests a pin given to hardware
and rarely returned."

Thanks for your kind review.

Regards,
  Pingfan


[PATCHv3 2/2] mm/gup: rename nr as nr_pinned in get_user_pages_fast()

2019-06-05 Thread Pingfan Liu
To better reflect the held state of pages and make code self-explaining,
rename nr as nr_pinned.

Signed-off-by: Pingfan Liu 
Cc: Ira Weiny 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: Dan Williams 
Cc: Matthew Wilcox 
Cc: John Hubbard 
Cc: "Aneesh Kumar K.V" 
Cc: Keith Busch 
Cc: Christoph Hellwig 
Cc: linux-kernel@vger.kernel.org
---
 mm/gup.c | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 0e59af9..9b3c8a6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2236,7 +2236,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
unsigned int gup_flags, struct page **pages)
 {
unsigned long addr, len, end;
-   int nr = 0, ret = 0;
+   int nr_pinned = 0, ret = 0;
 
start &= PAGE_MASK;
addr = start;
@@ -2251,28 +2251,28 @@ int get_user_pages_fast(unsigned long start, int 
nr_pages,
 
if (gup_fast_permitted(start, nr_pages)) {
local_irq_disable();
-   gup_pgd_range(addr, end, gup_flags, pages, );
+   gup_pgd_range(addr, end, gup_flags, pages, _pinned);
local_irq_enable();
-   ret = nr;
+   ret = nr_pinned;
}
 
-   if (unlikely(gup_flags & FOLL_LONGTERM) && nr)
-   nr = reject_cma_pages(nr, pages);
+   if (unlikely(gup_flags & FOLL_LONGTERM) && nr_pinned)
+   nr_pinned = reject_cma_pages(nr_pinned, pages);
 
-   if (nr < nr_pages) {
+   if (nr_pinned < nr_pages) {
/* Try to get the remaining pages with get_user_pages */
-   start += nr << PAGE_SHIFT;
-   pages += nr;
+   start += nr_pinned << PAGE_SHIFT;
+   pages += nr_pinned;
 
-   ret = __gup_longterm_unlocked(start, nr_pages - nr,
+   ret = __gup_longterm_unlocked(start, nr_pages - nr_pinned,
  gup_flags, pages);
 
/* Have to be a bit careful with return values */
-   if (nr > 0) {
+   if (nr_pinned > 0) {
if (ret < 0)
-   ret = nr;
+   ret = nr_pinned;
else
-   ret += nr;
+   ret += nr_pinned;
}
}
 
-- 
2.7.5



[PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()

2019-06-05 Thread Pingfan Liu
As for FOLL_LONGTERM, it is checked in the slow path
__gup_longterm_unlocked(). But it is not checked in the fast path, which
means a possible leak of CMA page to longterm pinned requirement through
this crack.

Place a check in the fast path.

Signed-off-by: Pingfan Liu 
Cc: Ira Weiny 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: Dan Williams 
Cc: Matthew Wilcox 
Cc: John Hubbard 
Cc: "Aneesh Kumar K.V" 
Cc: Keith Busch 
Cc: Christoph Hellwig 
Cc: linux-kernel@vger.kernel.org
---
 mm/gup.c | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index f173fcb..0e59af9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2196,6 +2196,26 @@ static int __gup_longterm_unlocked(unsigned long start, 
int nr_pages,
return ret;
 }
 
+#ifdef CONFIG_CMA
+static inline int reject_cma_pages(int nr_pinned, struct page **pages)
+{
+   int i;
+
+   for (i = 0; i < nr_pinned; i++)
+   if (is_migrate_cma_page(pages[i])) {
+   put_user_pages(pages + i, nr_pinned - i);
+   return i;
+   }
+
+   return nr_pinned;
+}
+#else
+static inline int reject_cma_pages(int nr_pinned, struct page **pages)
+{
+   return nr_pinned;
+}
+#endif
+
 /**
  * get_user_pages_fast() - pin user pages in memory
  * @start: starting user address
@@ -2236,6 +2256,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
ret = nr;
}
 
+   if (unlikely(gup_flags & FOLL_LONGTERM) && nr)
+   nr = reject_cma_pages(nr, pages);
+
if (nr < nr_pages) {
/* Try to get the remaining pages with get_user_pages */
start += nr << PAGE_SHIFT;
-- 
2.7.5



Re: [PATCH] mm/gup: remove unnecessary check against CMA in __gup_longterm_locked()

2019-06-04 Thread Pingfan Liu
On Tue, Jun 4, 2019 at 4:30 PM Aneesh Kumar K.V
 wrote:
>
> On 6/4/19 12:56 PM, Pingfan Liu wrote:
> > The PF_MEMALLOC_NOCMA is set by memalloc_nocma_save(), which is finally
> > cast to ~_GFP_MOVABLE.  So __get_user_pages_locked() will get pages from
> > non CMA area and pin them.  There is no need to
> > check_and_migrate_cma_pages().
>
>
> That is not completely correct. We can fault in that pages outside
> get_user_pages_longterm at which point those pages can get allocated
> from CMA region. memalloc_nocma_save() as added as an optimization to
> avoid unnecessary page migration.
Yes, you are right.

Thanks,
  Pingfan


[PATCH] mm/gup: remove unnecessary check against CMA in __gup_longterm_locked()

2019-06-04 Thread Pingfan Liu
The PF_MEMALLOC_NOCMA is set by memalloc_nocma_save(), which is finally
cast to ~_GFP_MOVABLE.  So __get_user_pages_locked() will get pages from
non CMA area and pin them.  There is no need to
check_and_migrate_cma_pages().

Signed-off-by: Pingfan Liu 
Cc: Andrew Morton 
Cc: Ira Weiny 
Cc: Dan Williams 
Cc: Thomas Gleixner 
Cc: Mike Rapoport 
Cc: John Hubbard 
Cc: "Aneesh Kumar K.V" 
Cc: Keith Busch 
Cc: linux-kernel@vger.kernel.org
---
 mm/gup.c | 146 ---
 1 file changed, 146 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index f173fcb..9d931d1 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1275,149 +1275,6 @@ static bool check_dax_vmas(struct vm_area_struct 
**vmas, long nr_pages)
return false;
 }
 
-#ifdef CONFIG_CMA
-static struct page *new_non_cma_page(struct page *page, unsigned long private)
-{
-   /*
-* We want to make sure we allocate the new page from the same node
-* as the source page.
-*/
-   int nid = page_to_nid(page);
-   /*
-* Trying to allocate a page for migration. Ignore allocation
-* failure warnings. We don't force __GFP_THISNODE here because
-* this node here is the node where we have CMA reservation and
-* in some case these nodes will have really less non movable
-* allocation memory.
-*/
-   gfp_t gfp_mask = GFP_USER | __GFP_NOWARN;
-
-   if (PageHighMem(page))
-   gfp_mask |= __GFP_HIGHMEM;
-
-#ifdef CONFIG_HUGETLB_PAGE
-   if (PageHuge(page)) {
-   struct hstate *h = page_hstate(page);
-   /*
-* We don't want to dequeue from the pool because pool pages 
will
-* mostly be from the CMA region.
-*/
-   return alloc_migrate_huge_page(h, gfp_mask, nid, NULL);
-   }
-#endif
-   if (PageTransHuge(page)) {
-   struct page *thp;
-   /*
-* ignore allocation failure warnings
-*/
-   gfp_t thp_gfpmask = GFP_TRANSHUGE | __GFP_NOWARN;
-
-   /*
-* Remove the movable mask so that we don't allocate from
-* CMA area again.
-*/
-   thp_gfpmask &= ~__GFP_MOVABLE;
-   thp = __alloc_pages_node(nid, thp_gfpmask, HPAGE_PMD_ORDER);
-   if (!thp)
-   return NULL;
-   prep_transhuge_page(thp);
-   return thp;
-   }
-
-   return __alloc_pages_node(nid, gfp_mask, 0);
-}
-
-static long check_and_migrate_cma_pages(struct task_struct *tsk,
-   struct mm_struct *mm,
-   unsigned long start,
-   unsigned long nr_pages,
-   struct page **pages,
-   struct vm_area_struct **vmas,
-   unsigned int gup_flags)
-{
-   long i;
-   bool drain_allow = true;
-   bool migrate_allow = true;
-   LIST_HEAD(cma_page_list);
-
-check_again:
-   for (i = 0; i < nr_pages; i++) {
-   /*
-* If we get a page from the CMA zone, since we are going to
-* be pinning these entries, we might as well move them out
-* of the CMA zone if possible.
-*/
-   if (is_migrate_cma_page(pages[i])) {
-
-   struct page *head = compound_head(pages[i]);
-
-   if (PageHuge(head)) {
-   isolate_huge_page(head, _page_list);
-   } else {
-   if (!PageLRU(head) && drain_allow) {
-   lru_add_drain_all();
-   drain_allow = false;
-   }
-
-   if (!isolate_lru_page(head)) {
-   list_add_tail(>lru, 
_page_list);
-   mod_node_page_state(page_pgdat(head),
-   NR_ISOLATED_ANON +
-   
page_is_file_cache(head),
-   
hpage_nr_pages(head));
-   }
-   }
-   }
-   }
-
-   if (!list_empty(_page_list)) {
-   /*
-* drop the above get_user_pages reference.
-*/
-   for (i = 0; i < nr_pages; i++)
-   put_page(pages[i]);
-
-   if (migrate_pages(_page_list, new_non_cma_page,
- NULL, 0, MIGRATE_SYNC, MR_CONTIG_RANGE)) {
-   /*
-* some of the pages fail

Re: [PATCHv2 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()

2019-06-04 Thread Pingfan Liu
On Tue, Jun 4, 2019 at 3:08 PM Christoph Hellwig  wrote:
>
> On Mon, Jun 03, 2019 at 04:56:10PM -0700, Ira Weiny wrote:
> > On Mon, Jun 03, 2019 at 09:42:06AM -0700, Christoph Hellwig wrote:
> > > > +#if defined(CONFIG_CMA)
> > >
> > > You can just use #ifdef here.
> > >
> > > > +static inline int reject_cma_pages(int nr_pinned, unsigned int 
> > > > gup_flags,
> > > > + struct page **pages)
> > >
> > > Please use two instead of one tab to indent the continuing line of
> > > a function declaration.
> > >
> > > > +{
> > > > + if (unlikely(gup_flags & FOLL_LONGTERM)) {
> > >
> > > IMHO it would be a little nicer if we could move this into the caller.
> >
> > FWIW we already had this discussion and thought it better to put this here.
> >
> > https://lkml.org/lkml/2019/5/30/1565
>
> I don't see any discussion like this.  FYI, this is what I mean,
> code might be easier than words:
>
>
> diff --git a/mm/gup.c b/mm/gup.c
> index ddde097cf9e4..62d770b18e2c 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2197,6 +2197,27 @@ static int __gup_longterm_unlocked(unsigned long 
> start, int nr_pages,
> return ret;
>  }
>
> +#ifdef CONFIG_CMA
> +static int reject_cma_pages(struct page **pages, int nr_pinned)
> +{
> +   int i = 0;
> +
> +   for (i = 0; i < nr_pinned; i++)
> +   if (is_migrate_cma_page(pages[i])) {
> +   put_user_pages(pages + i, nr_pinned - i);
> +   return i;
> +   }
> +   }
> +
> +   return nr_pinned;
> +}
> +#else
> +static inline int reject_cma_pages(struct page **pages, int nr_pinned)
> +{
> +   return nr_pinned;
> +}
> +#endif /* CONFIG_CMA */
> +
>  /**
>   * get_user_pages_fast() - pin user pages in memory
>   * @start: starting user address
> @@ -2237,6 +2258,9 @@ int get_user_pages_fast(unsigned long start, int 
> nr_pages,
> ret = nr;
> }
>
> +   if (nr && unlikely(gup_flags & FOLL_LONGTERM))
> +   nr = reject_cma_pages(pages, nr);
> +
Yeah. Looks better to keep reject_cma_pages() away from gup flags.

> if (nr < nr_pages) {
> /* Try to get the remaining pages with get_user_pages */
> start += nr << PAGE_SHIFT;


Re: [PATCHv2 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()

2019-06-04 Thread Pingfan Liu
On Tue, Jun 4, 2019 at 3:17 PM Christoph Hellwig  wrote:
>
> On Tue, Jun 04, 2019 at 03:13:21PM +0800, Pingfan Liu wrote:
> > Is it a convention? scripts/checkpatch.pl can not detect it. Could you
> > show me some light so later I can avoid it?
>
> If you look at most kernel code you can see two conventions:
>
>  - double tabe indent
>  - indent to the start of the first agument line
>
> Everything else is rather unusual.
OK.
Thanks


Re: [PATCHv2 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()

2019-06-04 Thread Pingfan Liu
On Tue, Jun 4, 2019 at 12:42 AM Christoph Hellwig  wrote:
>
> > +#if defined(CONFIG_CMA)
>
> You can just use #ifdef here.
>
OK.
> > +static inline int reject_cma_pages(int nr_pinned, unsigned int gup_flags,
> > + struct page **pages)
>
> Please use two instead of one tab to indent the continuing line of
> a function declaration.
>
Is it a convention? scripts/checkpatch.pl can not detect it. Could you
show me some light so later I can avoid it?

Thanks for your review.

Regards,
  Pingfan
> > +{
> > + if (unlikely(gup_flags & FOLL_LONGTERM)) {
>
> IMHO it would be a little nicer if we could move this into the caller.


[PATCHv2 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()

2019-06-03 Thread Pingfan Liu
As for FOLL_LONGTERM, it is checked in the slow path
__gup_longterm_unlocked(). But it is not checked in the fast path, which
means a possible leak of CMA page to longterm pinned requirement through
this crack.

Place a check in the fast path.

Signed-off-by: Pingfan Liu 
Cc: Ira Weiny 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: Dan Williams 
Cc: Matthew Wilcox 
Cc: John Hubbard 
Cc: "Aneesh Kumar K.V" 
Cc: Keith Busch 
Cc: linux-kernel@vger.kernel.org
---
 mm/gup.c | 24 
 1 file changed, 24 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index f173fcb..6fe2feb 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2196,6 +2196,29 @@ static int __gup_longterm_unlocked(unsigned long start, 
int nr_pages,
return ret;
 }
 
+#if defined(CONFIG_CMA)
+static inline int reject_cma_pages(int nr_pinned, unsigned int gup_flags,
+   struct page **pages)
+{
+   if (unlikely(gup_flags & FOLL_LONGTERM)) {
+   int i = 0;
+
+   for (i = 0; i < nr_pinned; i++)
+   if (is_migrate_cma_page(pages[i])) {
+   put_user_pages(pages + i, nr_pinned - i);
+   return i;
+   }
+   }
+   return nr_pinned;
+}
+#else
+static inline int reject_cma_pages(int nr_pinned, unsigned int gup_flags,
+   struct page **pages)
+{
+   return nr_pinned;
+}
+#endif
+
 /**
  * get_user_pages_fast() - pin user pages in memory
  * @start: starting user address
@@ -2236,6 +2259,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
ret = nr;
}
 
+   nr = reject_cma_pages(nr, gup_flags, pages);
if (nr < nr_pages) {
/* Try to get the remaining pages with get_user_pages */
start += nr << PAGE_SHIFT;
-- 
2.7.5



[PATCHv2 2/2] mm/gup: rename nr as nr_pinned in get_user_pages_fast()

2019-06-03 Thread Pingfan Liu
To better reflect the held state of pages and make code self-explaining,
rename nr as nr_pinned.

Signed-off-by: Pingfan Liu 
Cc: Ira Weiny 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: Dan Williams 
Cc: Matthew Wilcox 
Cc: John Hubbard 
Cc: "Aneesh Kumar K.V" 
Cc: Keith Busch 
Cc: linux-kernel@vger.kernel.org
---
 mm/gup.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 6fe2feb..106ab22 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2239,7 +2239,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
unsigned int gup_flags, struct page **pages)
 {
unsigned long addr, len, end;
-   int nr = 0, ret = 0;
+   int nr_pinned = 0, ret = 0;
 
start &= PAGE_MASK;
addr = start;
@@ -2254,26 +2254,26 @@ int get_user_pages_fast(unsigned long start, int 
nr_pages,
 
if (gup_fast_permitted(start, nr_pages)) {
local_irq_disable();
-   gup_pgd_range(addr, end, gup_flags, pages, );
+   gup_pgd_range(addr, end, gup_flags, pages, _pinned);
local_irq_enable();
-   ret = nr;
+   ret = nr_pinned;
}
 
-   nr = reject_cma_pages(nr, gup_flags, pages);
-   if (nr < nr_pages) {
+   nr_pinned = reject_cma_pages(nr_pinned, gup_flags, pages);
+   if (nr_pinned < nr_pages) {
/* Try to get the remaining pages with get_user_pages */
-   start += nr << PAGE_SHIFT;
-   pages += nr;
+   start += nr_pinned << PAGE_SHIFT;
+   pages += nr_pinned;
 
-   ret = __gup_longterm_unlocked(start, nr_pages - nr,
+   ret = __gup_longterm_unlocked(start, nr_pages - nr_pinned,
  gup_flags, pages);
 
/* Have to be a bit careful with return values */
-   if (nr > 0) {
+   if (nr_pinned > 0) {
if (ret < 0)
-   ret = nr;
+   ret = nr_pinned;
else
-   ret += nr;
+   ret += nr_pinned;
}
}
 
-- 
2.7.5



Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot

2019-06-02 Thread Pingfan Liu
On Fri, May 31, 2019 at 5:03 PM Michal Hocko  wrote:
>
> On Thu 30-05-19 20:55:32, Pingfan Liu wrote:
> > On Wed, May 29, 2019 at 2:20 AM Michal Hocko  wrote:
> > >
> > > [Sorry for a late reply]
> > >
> > > On Thu 23-05-19 11:58:45, Pingfan Liu wrote:
> > > > On Wed, May 22, 2019 at 7:16 PM Michal Hocko  wrote:
> > > > >
> > > > > On Wed 22-05-19 15:12:16, Pingfan Liu wrote:
> > > [...]
> > > > > > But in fact, we already have for_each_node_state(nid, N_MEMORY) to
> > > > > > cover this purpose.
> > > > >
> > > > > I do not really think we want to spread N_MEMORY outside of the core 
> > > > > MM.
> > > > > It is quite confusing IMHO.
> > > > > .
> > > > But it has already like this. Just git grep N_MEMORY.
> > >
> > > I might be wrong but I suspect a closer review would reveal that the use
> > > will be inconsistent or dubious so following the existing users is not
> > > the best approach.
> > >
> > > > > > Furthermore, changing the definition of online may
> > > > > > break something in the scheduler, e.g. in task_numa_migrate(), where
> > > > > > it calls for_each_online_node.
> > > > >
> > > > > Could you be more specific please? Why should numa balancing consider
> > > > > nodes without any memory?
> > > > >
> > > > As my understanding, the destination cpu can be on a memory less node.
> > > > BTW, there are several functions in the scheduler facing the same
> > > > scenario, task_numa_migrate() is an example.
> > >
> > > Even if the destination node is memoryless then any migration would fail
> > > because there is no memory. Anyway I still do not see how using online
> > > node would break anything.
> > >
> > Suppose we have nodes A, B,C, where C is memory less but has little
> > distance to B, comparing with the one from A to B. Then if a task is
> > running on A, but prefer to run on B due to memory footprint.
> > task_numa_migrate() allows us to migrate the task to node C. Changing
> > for_each_online_node will break this.
>
> That would require the task to have preferred node to be C no? Or do I
> missunderstand the task migration logic?
I think in task_numa_migrate(), the migration logic should looks like:
  env.dst_nid = p->numa_preferred_nid; //Here dst nid is B
But later in
  if (env.best_cpu == -1 || (p->numa_group &&
p->numa_group->active_nodes > 1)) {
for_each_online_node(nid) {
[...]
   task_numa_find_cpu(, taskimp, groupimp); // Here is a
chance to change p->numa_preferred_nid

There are serveral other broken by changing for_each_online_node(),
-1. show_numa_stats()
-2. init_numa_topology_type(), where sched_numa_topology_type may be
mistaken evaluated.
-3. ... can check call to for_each_online_node() one by one in scheduler.

That is my understanding of the code.

Thanks,
  Pingfan


Re: [PATCH] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()

2019-06-02 Thread Pingfan Liu
On Sat, Jun 1, 2019 at 1:06 AM John Hubbard  wrote:
>
> On 5/31/19 4:05 AM, Pingfan Liu wrote:
> > On Fri, May 31, 2019 at 7:21 AM John Hubbard  wrote:
> >> On 5/30/19 2:47 PM, Ira Weiny wrote:
> >>> On Thu, May 30, 2019 at 06:54:04AM +0800, Pingfan Liu wrote:
> >> [...]
> >> Rather lightly tested...I've compile-tested with CONFIG_CMA and 
> >> !CONFIG_CMA,
> >> and boot tested with CONFIG_CMA, but could use a second set of eyes on 
> >> whether
> >> I've added any off-by-one errors, or worse. :)
> >>
> > Do you mind I send V2 based on your above patch? Anyway, it is a simple bug 
> > fix.
> >
>
> Sure, that's why I sent it. :)  Note that Ira also recommended splitting the
> "nr --> nr_pinned" renaming into a separate patch.
>
Thanks for your kind help. I will split out nr_pinned to a separate patch.

Regards,
  Pingfan


Re: [PATCH] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()

2019-06-02 Thread Pingfan Liu
On Sat, Jun 1, 2019 at 1:12 AM Ira Weiny  wrote:
>
> On Fri, May 31, 2019 at 07:05:27PM +0800, Pingfan Liu wrote:
> > On Fri, May 31, 2019 at 7:21 AM John Hubbard  wrote:
> > >
> > >
> > > Rather lightly tested...I've compile-tested with CONFIG_CMA and 
> > > !CONFIG_CMA,
> > > and boot tested with CONFIG_CMA, but could use a second set of eyes on 
> > > whether
> > > I've added any off-by-one errors, or worse. :)
> > >
> > Do you mind I send V2 based on your above patch? Anyway, it is a simple bug 
> > fix.
>
> FWIW please split out the nr_pinned change to a separate patch.
>
OK.

Thanks,
  Pingfan


Re: [PATCH] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()

2019-05-31 Thread Pingfan Liu
On Fri, May 31, 2019 at 7:21 AM John Hubbard  wrote:
>
> On 5/30/19 2:47 PM, Ira Weiny wrote:
> > On Thu, May 30, 2019 at 06:54:04AM +0800, Pingfan Liu wrote:
> [...]
> >> +for (j = i; j < nr; j++)
> >> +put_page(pages[j]);
> >
> > Should be put_user_page() now.  For now that just calls put_page() but it is
> > slated to change soon.
> >
> > I also wonder if this would be more efficient as a check as we are walking 
> > the
> > page tables and bail early.
> >
> > Perhaps the code complexity is not worth it?
>
> Good point, it might be worth it. Because now we've got two loops that
> we run, after the interrupts-off page walk, and it's starting to look like
> a potential performance concern.
>
> >
> >> +nr = i;
> >
> > Why not just break from the loop here?
> >
> > Or better yet just use 'i' in the inner loop...
> >
>
> ...but if you do end up putting in the after-the-fact check, then we can
> go one or two steps further in cleaning it up, by:
>
> * hiding the visible #ifdef that was slicing up gup_fast,
>
> * using put_user_pages() instead of either put_page or put_user_page,
>   thus getting rid of j entirely, and
>
> * renaming an ancient minor confusion: nr --> nr_pinned),
>
> we could have this, which is looks cleaner and still does the same thing:
>
> diff --git a/mm/gup.c b/mm/gup.c
> index f173fcbaf1b2..0c1f36be1863 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1486,6 +1486,33 @@ static __always_inline long 
> __gup_longterm_locked(struct task_struct *tsk,
>  }
>  #endif /* CONFIG_FS_DAX || CONFIG_CMA */
>
> +#ifdef CONFIG_CMA
> +/*
> + * Returns the number of pages that were *not* rejected. This makes it
> + * exactly compatible with its callers.
> + */
> +static int reject_cma_pages(int nr_pinned, unsigned gup_flags,
> +   struct page **pages)
> +{
> +   int i = 0;
> +   if (unlikely(gup_flags & FOLL_LONGTERM)) {
> +
> +   for (i = 0; i < nr_pinned; i++)
> +   if (is_migrate_cma_page(pages[i])) {
> +   put_user_pages([i], nr_pinned - i);
> +   break;
> +   }
> +   }
> +   return i;
> +}
> +#else
> +static int reject_cma_pages(int nr_pinned, unsigned gup_flags,
> +   struct page **pages)
> +{
> +   return nr_pinned;
> +}
> +#endif
> +
>  /*
>   * This is the same as get_user_pages_remote(), just with a
>   * less-flexible calling convention where we assume that the task
> @@ -2216,7 +2243,7 @@ int get_user_pages_fast(unsigned long start, int 
> nr_pages,
> unsigned int gup_flags, struct page **pages)
>  {
> unsigned long addr, len, end;
> -   int nr = 0, ret = 0;
> +   int nr_pinned = 0, ret = 0;
>
> start &= PAGE_MASK;
> addr = start;
> @@ -2231,25 +2258,27 @@ int get_user_pages_fast(unsigned long start, int 
> nr_pages,
>
> if (gup_fast_permitted(start, nr_pages)) {
> local_irq_disable();
> -   gup_pgd_range(addr, end, gup_flags, pages, );
> +   gup_pgd_range(addr, end, gup_flags, pages, _pinned);
> local_irq_enable();
> -   ret = nr;
> +   ret = nr_pinned;
> }
>
> -   if (nr < nr_pages) {
> +   nr_pinned = reject_cma_pages(nr_pinned, gup_flags, pages);
> +
> +   if (nr_pinned < nr_pages) {
> /* Try to get the remaining pages with get_user_pages */
> -   start += nr << PAGE_SHIFT;
> -   pages += nr;
> +   start += nr_pinned << PAGE_SHIFT;
> +   pages += nr_pinned;
>
> -   ret = __gup_longterm_unlocked(start, nr_pages - nr,
> +   ret = __gup_longterm_unlocked(start, nr_pages - nr_pinned,
>   gup_flags, pages);
>
> /* Have to be a bit careful with return values */
> -   if (nr > 0) {
> +   if (nr_pinned > 0) {
> if (ret < 0)
> -   ret = nr;
> +   ret = nr_pinned;
> else
> -   ret += nr;
> +   ret += nr_pinned;
> }
> }
>
>
> Rather lightly tested...I've compile-tested with CONFIG_CMA and !CONFIG_CMA,
> and boot tested with CONFIG_CMA, but could use a second set of eyes on whether
> I've added any off-by-one errors, or worse. :)
>
Do you mind I send V2 based on your above patch? Anyway, it is a simple bug fix.

Thanks,
  Pingfan


Re: [PATCH] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()

2019-05-31 Thread Pingfan Liu
On Fri, May 31, 2019 at 7:52 AM Ira Weiny  wrote:
>
> On Thu, May 30, 2019 at 04:21:19PM -0700, John Hubbard wrote:
> > On 5/30/19 2:47 PM, Ira Weiny wrote:
> > > On Thu, May 30, 2019 at 06:54:04AM +0800, Pingfan Liu wrote:
> > [...]
> > >> +  for (j = i; j < nr; j++)
> > >> +  put_page(pages[j]);
> > >
> > > Should be put_user_page() now.  For now that just calls put_page() but it 
> > > is
> > > slated to change soon.
> > >
> > > I also wonder if this would be more efficient as a check as we are 
> > > walking the
> > > page tables and bail early.
> > >
> > > Perhaps the code complexity is not worth it?
> >
> > Good point, it might be worth it. Because now we've got two loops that
> > we run, after the interrupts-off page walk, and it's starting to look like
> > a potential performance concern.
>
> FWIW I don't see this being a huge issue at the moment.  Perhaps those more
> familiar with CMA can weigh in here.  How was this issue found?  If it was
> found by running some test perhaps that indicates a performance preference?
>
I found the bug by reading code. And I do not see any performance
concern. Bailing out early contritute little to performance, as we
fall on the slow path immediately.

Regards,
  Pingfan
[]


Re: [PATCH] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()

2019-05-31 Thread Pingfan Liu
On Fri, May 31, 2019 at 5:46 AM Ira Weiny  wrote:
>
> On Thu, May 30, 2019 at 06:54:04AM +0800, Pingfan Liu wrote:
> > As for FOLL_LONGTERM, it is checked in the slow path
> > __gup_longterm_unlocked(). But it is not checked in the fast path, which
> > means a possible leak of CMA page to longterm pinned requirement through
> > this crack.
> >
> > Place a check in the fast path.
> >
> > Signed-off-by: Pingfan Liu 
> > Cc: Ira Weiny 
> > Cc: Andrew Morton 
> > Cc: Mike Rapoport 
> > Cc: Dan Williams 
> > Cc: Matthew Wilcox 
> > Cc: John Hubbard 
> > Cc: "Aneesh Kumar K.V" 
> > Cc: Keith Busch 
> > Cc: linux-kernel@vger.kernel.org
> > ---
> >  mm/gup.c | 12 
> >  1 file changed, 12 insertions(+)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index f173fcb..00feab3 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -2235,6 +2235,18 @@ int get_user_pages_fast(unsigned long start, int 
> > nr_pages,
> >   local_irq_enable();
> >   ret = nr;
> >   }
> > +#if defined(CONFIG_CMA)
> > + if (unlikely(gup_flags & FOLL_LONGTERM)) {
> > + int i, j;
> > +
> > + for (i = 0; i < nr; i++)
> > + if (is_migrate_cma_page(pages[i])) {
> > + for (j = i; j < nr; j++)
> > + put_page(pages[j]);
>
> Should be put_user_page() now.  For now that just calls put_page() but it is
> slated to change soon.
>
Not aware of these changes. And get your point now.

> I also wonder if this would be more efficient as a check as we are walking the
> page tables and bail early.
>
> Perhaps the code complexity is not worth it?
>
Yes. That will spread such logic in huge page and normal page.
> > + nr = i;
>
> Why not just break from the loop here?
>
A mistake.

Thanks,
  Pingfan
> Or better yet just use 'i' in the inner loop...
>
> Ira
>
> > + }
> > + }
> > +#endif
> >
> >   if (nr < nr_pages) {
> >   /* Try to get the remaining pages with get_user_pages */
> > --
> > 2.7.5
> >


Re: [PATCHv5 0/2] x86/boot/KASLR: skip the specified crashkernel region

2019-05-30 Thread Pingfan Liu
Maintainers, ping?
Hi Borislav, during the review of V4, you suggested to re-design the
return value of parse_crashkernel(), the latest try is on
https://lore.kernel.org/patchwork/patch/1065514/. It seems hard to
move on in that thread. On the other hand, my series "[PATCHv5 0/2]
x86/boot/KASLR: skip the specified crashkernel region" has no depend
on the "re-design the return value of parse_crashkernel()".

Thanks,
  Pingfan
On Tue, May 7, 2019 at 12:32 PM Pingfan Liu  wrote:
>
> crashkernel=x@y or or =range1:size1[,range2:size2,...]@offset option may
> fail to reserve the required memory region if KASLR puts kernel into the
> region. To avoid this uncertainty, asking KASLR to skip the required
> region.
> And the parsing routine can be re-used at this early boot stage.
>
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: "H. Peter Anvin" 
> Cc: Baoquan He 
> Cc: Will Deacon 
> Cc: Nicolas Pitre 
> Cc: Vivek Goyal 
> Cc: Chao Fan 
> Cc: "Kirill A. Shutemov" 
> Cc: Ard Biesheuvel 
> CC: Hari Bathini 
> Cc: linux-kernel@vger.kernel.org
> ---
> v3 -> v4:
>   reuse the parse_crashkernel_xx routines
> v4 -> v5:
>   drop unnecessary initialization of crash_base in [2/2]
>
> Pingfan Liu (2):
>   kernel/crash_core: separate the parsing routines to
> lib/parse_crashkernel.c
>   x86/boot/KASLR: skip the specified crashkernel region
>
>  arch/x86/boot/compressed/kaslr.c |  40 ++
>  kernel/crash_core.c  | 273 
>  lib/Makefile |   2 +
>  lib/parse_crashkernel.c  | 289 
> +++
>  4 files changed, 331 insertions(+), 273 deletions(-)
>  create mode 100644 lib/parse_crashkernel.c
>
> --
> 2.7.4
>


Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot

2019-05-30 Thread Pingfan Liu
On Wed, May 29, 2019 at 2:21 AM Michal Hocko  wrote:
>
> On Thu 23-05-19 12:00:46, Pingfan Liu wrote:
> [...]
> > > Yes, but maybe it will pay great effort on it.
> > >
> > And as a first step, we can find a way to fix the bug reported by me
> > and the one reported by Barret
>
> Can we try http://lkml.kernel.org/r/20190513140448.gj24...@dhcp22.suse.cz
> for starter?
If it turns out that the changing of for_each_online_node() will not
break functions in scheduler like task_numa_migrate(), I think it will
be a good starter. On the other hand, if it does, then I think it only
requires a slight adjustment of your patch to meet the aim.

Thanks,
  Pingfan
> --
> Michal Hocko
> SUSE Labs


Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot

2019-05-30 Thread Pingfan Liu
On Wed, May 29, 2019 at 2:20 AM Michal Hocko  wrote:
>
> [Sorry for a late reply]
>
> On Thu 23-05-19 11:58:45, Pingfan Liu wrote:
> > On Wed, May 22, 2019 at 7:16 PM Michal Hocko  wrote:
> > >
> > > On Wed 22-05-19 15:12:16, Pingfan Liu wrote:
> [...]
> > > > But in fact, we already have for_each_node_state(nid, N_MEMORY) to
> > > > cover this purpose.
> > >
> > > I do not really think we want to spread N_MEMORY outside of the core MM.
> > > It is quite confusing IMHO.
> > > .
> > But it has already like this. Just git grep N_MEMORY.
>
> I might be wrong but I suspect a closer review would reveal that the use
> will be inconsistent or dubious so following the existing users is not
> the best approach.
>
> > > > Furthermore, changing the definition of online may
> > > > break something in the scheduler, e.g. in task_numa_migrate(), where
> > > > it calls for_each_online_node.
> > >
> > > Could you be more specific please? Why should numa balancing consider
> > > nodes without any memory?
> > >
> > As my understanding, the destination cpu can be on a memory less node.
> > BTW, there are several functions in the scheduler facing the same
> > scenario, task_numa_migrate() is an example.
>
> Even if the destination node is memoryless then any migration would fail
> because there is no memory. Anyway I still do not see how using online
> node would break anything.
>
Suppose we have nodes A, B,C, where C is memory less but has little
distance to B, comparing with the one from A to B. Then if a task is
running on A, but prefer to run on B due to memory footprint.
task_numa_migrate() allows us to migrate the task to node C. Changing
for_each_online_node will break this.

Regards,
  Pingfan

> > > > By keeping the node owning cpu as online, Michal's patch can avoid
> > > > such corner case and keep things easy. Furthermore, if needed, the
> > > > other patch can use for_each_node_state(nid, N_MEMORY) to replace
> > > > for_each_online_node is some space.
> > >
> > > Ideally no code outside of the core MM should care about what kind of
> > > memory does the node really own. The external code should only care
> > > whether the node is online and thus usable or offline and of no
> > > interest.
> >
> > Yes, but maybe it will pay great effort on it.
>
> Even if that is the case it would be preferable because the current
> situation is just not sustainable wrt maintenance cost. It is just too
> simple to break the existing logic as this particular report outlines.
> --
> Michal Hocko
> SUSE Labs


[PATCH] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()

2019-05-29 Thread Pingfan Liu
As for FOLL_LONGTERM, it is checked in the slow path
__gup_longterm_unlocked(). But it is not checked in the fast path, which
means a possible leak of CMA page to longterm pinned requirement through
this crack.

Place a check in the fast path.

Signed-off-by: Pingfan Liu 
Cc: Ira Weiny 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: Dan Williams 
Cc: Matthew Wilcox 
Cc: John Hubbard 
Cc: "Aneesh Kumar K.V" 
Cc: Keith Busch 
Cc: linux-kernel@vger.kernel.org
---
 mm/gup.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index f173fcb..00feab3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2235,6 +2235,18 @@ int get_user_pages_fast(unsigned long start, int 
nr_pages,
local_irq_enable();
ret = nr;
}
+#if defined(CONFIG_CMA)
+   if (unlikely(gup_flags & FOLL_LONGTERM)) {
+   int i, j;
+
+   for (i = 0; i < nr; i++)
+   if (is_migrate_cma_page(pages[i])) {
+   for (j = i; j < nr; j++)
+   put_page(pages[j]);
+   nr = i;
+   }
+   }
+#endif
 
if (nr < nr_pages) {
/* Try to get the remaining pages with get_user_pages */
-- 
2.7.5



Re: [PATCHv2] kernel/crash: make parse_crashkernel()'s return value more indicant

2019-05-23 Thread Pingfan Liu
Matthias, ping? Any suggestions?

Thanks,
Pingfan


On Thu, May 2, 2019 at 2:22 PM Pingfan Liu  wrote:
>
> On Thu, Apr 25, 2019 at 4:20 PM Pingfan Liu  wrote:
> >
> > On Wed, Apr 24, 2019 at 4:31 PM Matthias Brugger  wrote:
> > >
> > >
> > [...]
> > > > @@ -139,6 +141,8 @@ static int __init parse_crashkernel_simple(char 
> > > > *cmdline,
> > > >   pr_warn("crashkernel: unrecognized char: %c\n", *cur);
> > > >   return -EINVAL;
> > > >   }
> > > > + if (*crash_size == 0)
> > > > + return -EINVAL;
> > >
> > > This covers the case where I pass an argument like "crashkernel=0M" ?
> > > Can't we fix that by using kstrtoull() in memparse and check if the 
> > > return value
> > > is < 0? In that case we could return without updating the retptr and we 
> > > will be
> > > fine.
> After a series of work, I suddenly realized that it can not be done
> like this way. "0M" causes kstrtoull() to return -EINVAL, but this is
> caused by "M", not "0". If passing "0" to kstrtoull(), it will return
> 0 on success.
>
> > >
> > It seems that kstrtoull() treats 0M as invalid parameter, while
> > simple_strtoull() does not.
> >
> My careless going through the code. And I tested with a valid value
> "256M" using kstrtoull(), it also returned -EINVAL.
>
> So I think there is no way to distinguish 0 from a positive value
> inside this basic math function.
> Do I miss anything?
>
> Thanks and regards,
> Pingfan


  1   2   3   4   >