Re: [PATCH 2/3] mm: Send one IPI per CPU to TLB flush multiple pages that were recently unmapped

2015-06-09 Thread Mel Gorman
On Mon, Jun 08, 2015 at 03:38:13PM -0700, Andrew Morton wrote:
> On Mon,  8 Jun 2015 13:50:53 +0100 Mel Gorman  wrote:
> 
> > An IPI is sent to flush remote TLBs when a page is unmapped that was
> > recently accessed by other CPUs. There are many circumstances where this
> > happens but the obvious one is kswapd reclaiming pages belonging to a
> > running process as kswapd and the task are likely running on separate CPUs.
> > 
> > On small machines, this is not a significant problem but as machine
> > gets larger with more cores and more memory, the cost of these IPIs can
> > be high. This patch uses a structure similar in principle to a pagevec
> > to collect a list of PFNs and CPUs that require flushing. It then sends
> > one IPI per CPU that was mapping any of those pages to flush the list of
> > PFNs. A new TLB flush helper is required for this and one is added for
> > x86. Other architectures will need to decide if batching like this is both
> > safe and worth the memory overhead. Specifically the requirement is;
> > 
> > If a clean page is unmapped and not immediately flushed, the
> > architecture must guarantee that a write to that page from a CPU
> > with a cached TLB entry will trap a page fault.
> > 
> > This is essentially what the kernel already depends on but the window is
> > much larger with this patch applied and is worth highlighting.
> > 
> > The impact of this patch depends on the workload as measuring any benefit
> > requires both mapped pages co-located on the LRU and memory pressure. The
> > case with the biggest impact is multiple processes reading mapped pages
> > taken from the vm-scalability test suite. The test case uses NR_CPU readers
> > of mapped files that consume 10*RAM.
> > 
> > vmscale on a 4-node machine with 64G RAM and 48 CPUs
> >4.1.0-rc6 4.1.0-rc6
> >  vanilla batchunmap-v5
> > User  577.35618.60
> > System   5927.06   4195.03
> > Elapsed   162.21121.31
> > 
> > The workload completed 25% faster with 29% less CPU time.
> > 
> > This is showing that the readers completed 25% with 30% less CPU time. From
> > vmstats, it is known that the vanilla kernel was interrupted roughly 900K
> > times per second during the steady phase of the test and the patched kernel
> > was interrupts 180K times per second.
> > 
> > The impact is much lower on a small machine
> > 
> > vmscale on a 1-node machine with 8G RAM and 1 CPU
> >4.1.0-rc6 4.1.0-rc6
> >  vanilla batchunmap-v5
> > User   59.14 58.86
> > System109.15 83.78
> > Elapsed27.32 23.14
> > 
> > It's still a noticeable improvement with vmstat showing interrupts went
> > from roughly 500K per second to 45K per second.
> 
> Looks nice.
> 
> > The patch will have no impact on workloads with no memory pressure or
> > have relatively few mapped pages.
> 
> What benefit can we expect to see to any real-world userspace?
> 

Only a small subset of workloads will see a benefit -- ones that mmap a
lot of data with working sets larger than the size of a node. Some
streaming media servers allegedly do this.

Some numerical processing applications may hit this. Those that use glibc
for large buffers use mmap and if the application is larger than a NUMA
node, it'll need to be unmapped and flushed. Python/NumPY uses large
maps for large buffers (based on the paper "Doubling the Performance of
Python/Numpy with less than 100 SLOC"). Whether users of NumPY hit this
issue or not depends on whether kswapd is active.

Anecdotally, I'm aware from IRC of one user that is experimenting with
a large HTTP cache and a load generator that spent a lot of time sending
IPIs that was obvious from the profile. I asked weeks ago that they post
the results here which they promised they would but didn't. Unfortunately,
I don't know the persons real name to cc them. Rik might.

Anecdotally, I also believe that Intel hit this internally with some
internal workload but I'm basing this on idle comments at LSF/MM. However
they were unwilling or unable to describe exactly what the test does and
against what software.

I know this is more vague than you'd like. By and large, I'm relying on
the assumption that if we are reclaiming mapped pages from kswapd context
then sending one IPI per page is stupid.

> > --- a/include/linux/init_task.h
> > +++ b/include/linux/init_task.h
> > @@ -175,6 +175,13 @@ extern struct task_group root_task_group;
> >  # define INIT_NUMA_BALANCING(tsk)
> >  #endif
> >  
> > +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
> > +# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)
> > \
> > +   .tlb_ubc = NULL,
> > +#else
> > +# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)
> > +#endif
> > +
> >  #ifdef CONFIG_KASAN
> >  # define INIT_KASAN(tsk)   \
> > .kasan_depth = 1,
> > @@ -257,6 +264,7 @@ extern struct task_group root_task_gro

Re: [PATCH 2/3] mm: Send one IPI per CPU to TLB flush multiple pages that were recently unmapped

2015-06-08 Thread Andrew Morton
On Mon,  8 Jun 2015 13:50:53 +0100 Mel Gorman  wrote:

> An IPI is sent to flush remote TLBs when a page is unmapped that was
> recently accessed by other CPUs. There are many circumstances where this
> happens but the obvious one is kswapd reclaiming pages belonging to a
> running process as kswapd and the task are likely running on separate CPUs.
> 
> On small machines, this is not a significant problem but as machine
> gets larger with more cores and more memory, the cost of these IPIs can
> be high. This patch uses a structure similar in principle to a pagevec
> to collect a list of PFNs and CPUs that require flushing. It then sends
> one IPI per CPU that was mapping any of those pages to flush the list of
> PFNs. A new TLB flush helper is required for this and one is added for
> x86. Other architectures will need to decide if batching like this is both
> safe and worth the memory overhead. Specifically the requirement is;
> 
>   If a clean page is unmapped and not immediately flushed, the
>   architecture must guarantee that a write to that page from a CPU
>   with a cached TLB entry will trap a page fault.
> 
> This is essentially what the kernel already depends on but the window is
> much larger with this patch applied and is worth highlighting.
> 
> The impact of this patch depends on the workload as measuring any benefit
> requires both mapped pages co-located on the LRU and memory pressure. The
> case with the biggest impact is multiple processes reading mapped pages
> taken from the vm-scalability test suite. The test case uses NR_CPU readers
> of mapped files that consume 10*RAM.
> 
> vmscale on a 4-node machine with 64G RAM and 48 CPUs
>4.1.0-rc6 4.1.0-rc6
>  vanilla batchunmap-v5
> User  577.35618.60
> System   5927.06   4195.03
> Elapsed   162.21121.31
> 
> The workload completed 25% faster with 29% less CPU time.
> 
> This is showing that the readers completed 25% with 30% less CPU time. From
> vmstats, it is known that the vanilla kernel was interrupted roughly 900K
> times per second during the steady phase of the test and the patched kernel
> was interrupts 180K times per second.
> 
> The impact is much lower on a small machine
> 
> vmscale on a 1-node machine with 8G RAM and 1 CPU
>4.1.0-rc6 4.1.0-rc6
>  vanilla batchunmap-v5
> User   59.14 58.86
> System109.15 83.78
> Elapsed27.32 23.14
> 
> It's still a noticeable improvement with vmstat showing interrupts went
> from roughly 500K per second to 45K per second.

Looks nice.

> The patch will have no impact on workloads with no memory pressure or
> have relatively few mapped pages.

What benefit can we expect to see to any real-world userspace?

> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -175,6 +175,13 @@ extern struct task_group root_task_group;
>  # define INIT_NUMA_BALANCING(tsk)
>  #endif
>  
> +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
> +# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)  
> \
> + .tlb_ubc = NULL,
> +#else
> +# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)
> +#endif
> +
>  #ifdef CONFIG_KASAN
>  # define INIT_KASAN(tsk) \
>   .kasan_depth = 1,
> @@ -257,6 +264,7 @@ extern struct task_group root_task_group;
>   INIT_RT_MUTEXES(tsk)\
>   INIT_VTIME(tsk) \
>   INIT_NUMA_BALANCING(tsk)\
> + INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)  \
>   INIT_KASAN(tsk) \
>  }

We don't really need any of this - init_task starts up all-zero anyway.
Maybe it's useful for documentation reasons (dubious), but I bet we've
already missed fields.

>
> ...
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1289,6 +1289,16 @@ enum perf_event_task_context {
>   perf_nr_task_contexts,
>  };
>  
> +/* Matches SWAP_CLUSTER_MAX but refined to limit header dependencies */
> +#define BATCH_TLBFLUSH_SIZE 32UL
> +
> +/* Track pages that require TLB flushes */
> +struct tlbflush_unmap_batch {
> + struct cpumask cpumask;
> + unsigned long nr_pages;
> + unsigned long pfns[BATCH_TLBFLUSH_SIZE];

Why are we storing pfn's rather than page*'s?

I'm trying to get my head around what's actually in this structure.

Each thread has one of these, lazily allocated . 
The cpumask field contains a mask of all the CPUs which have done
.  The careful reader will find mm_struct.cpu_vm_mask_var
and will wonder why it didn't need documenting, sigh.

Wanna fill in the blanks?  As usual, understanding the data structure
is the key to understanding the design, so it's worth a couple of
paragraphs.  With this knowledge, the reader may understand why
try_to_unmap_fl

[PATCH 2/3] mm: Send one IPI per CPU to TLB flush multiple pages that were recently unmapped

2015-06-08 Thread Mel Gorman
An IPI is sent to flush remote TLBs when a page is unmapped that was
recently accessed by other CPUs. There are many circumstances where this
happens but the obvious one is kswapd reclaiming pages belonging to a
running process as kswapd and the task are likely running on separate CPUs.

On small machines, this is not a significant problem but as machine
gets larger with more cores and more memory, the cost of these IPIs can
be high. This patch uses a structure similar in principle to a pagevec
to collect a list of PFNs and CPUs that require flushing. It then sends
one IPI per CPU that was mapping any of those pages to flush the list of
PFNs. A new TLB flush helper is required for this and one is added for
x86. Other architectures will need to decide if batching like this is both
safe and worth the memory overhead. Specifically the requirement is;

If a clean page is unmapped and not immediately flushed, the
architecture must guarantee that a write to that page from a CPU
with a cached TLB entry will trap a page fault.

This is essentially what the kernel already depends on but the window is
much larger with this patch applied and is worth highlighting.

The impact of this patch depends on the workload as measuring any benefit
requires both mapped pages co-located on the LRU and memory pressure. The
case with the biggest impact is multiple processes reading mapped pages
taken from the vm-scalability test suite. The test case uses NR_CPU readers
of mapped files that consume 10*RAM.

vmscale on a 4-node machine with 64G RAM and 48 CPUs
   4.1.0-rc6 4.1.0-rc6
 vanilla batchunmap-v5
User  577.35618.60
System   5927.06   4195.03
Elapsed   162.21121.31

The workload completed 25% faster with 29% less CPU time.

This is showing that the readers completed 25% with 30% less CPU time. From
vmstats, it is known that the vanilla kernel was interrupted roughly 900K
times per second during the steady phase of the test and the patched kernel
was interrupts 180K times per second.

The impact is much lower on a small machine

vmscale on a 1-node machine with 8G RAM and 1 CPU
   4.1.0-rc6 4.1.0-rc6
 vanilla batchunmap-v5
User   59.14 58.86
System109.15 83.78
Elapsed27.32 23.14

It's still a noticeable improvement with vmstat showing interrupts went
from roughly 500K per second to 45K per second.

The patch will have no impact on workloads with no memory pressure or
have relatively few mapped pages.

Signed-off-by: Mel Gorman 
---
 arch/x86/Kconfig|   1 +
 arch/x86/include/asm/tlbflush.h |   2 +
 include/linux/init_task.h   |   8 +++
 include/linux/rmap.h|   3 ++
 include/linux/sched.h   |  14 ++
 init/Kconfig|   8 +++
 kernel/fork.c   |   5 ++
 kernel/sched/core.c |   3 ++
 mm/internal.h   |  11 +
 mm/rmap.c   | 105 +++-
 mm/vmscan.c |  23 -
 11 files changed, 181 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 226d5696e1d1..4e8bd86735af 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -31,6 +31,7 @@ config X86
select ARCH_MIGHT_HAVE_PC_SERIO
select HAVE_AOUT if X86_32
select HAVE_UNSTABLE_SCHED_CLOCK
+   select ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
select ARCH_SUPPORTS_INT128 if X86_64
select HAVE_IDE
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index cd791948b286..10c197a649f5 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -152,6 +152,8 @@ static inline void __flush_tlb_one(unsigned long addr)
  * and page-granular flushes are available only on i486 and up.
  */
 
+#define flush_local_tlb_addr(addr) __flush_tlb_single(addr)
+
 #ifndef CONFIG_SMP
 
 /* "_up" is for UniProcessor.
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 696d22312b31..0771937b47e1 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -175,6 +175,13 @@ extern struct task_group root_task_group;
 # define INIT_NUMA_BALANCING(tsk)
 #endif
 
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)
\
+   .tlb_ubc = NULL,
+#else
+# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)
+#endif
+
 #ifdef CONFIG_KASAN
 # define INIT_KASAN(tsk)   \
.kasan_depth = 1,
@@ -257,6 +264,7 @@ extern struct task_group root_task_group;
INIT_RT_MUTEXES(tsk)\
INIT_VTIME(tsk) \
INIT_NUMA_BALANCING(tsk)\
+

Re: [PATCH 2/3] mm: Send one IPI per CPU to TLB flush multiple pages that were recently unmapped

2015-04-26 Thread Rik van Riel
On 04/25/2015 01:45 PM, Mel Gorman wrote:
> An IPI is sent to flush remote TLBs when a page is unmapped that was
> recently accessed by other CPUs. There are many circumstances where this
> happens but the obvious one is kswapd reclaiming pages belonging to a
> running process as kswapd and the task are likely running on separate CPUs.

> It's still a noticeable improvement with vmstat showing interrupts went
> from roughly 500K per second to 45K per second.
> 
> The patch will have no impact on workloads with no memory pressure or
> have relatively few mapped pages.
> 
> Signed-off-by: Mel Gorman 

Reviewed-by: Rik van Riel 


-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/3] mm: Send one IPI per CPU to TLB flush multiple pages that were recently unmapped

2015-04-25 Thread Mel Gorman
An IPI is sent to flush remote TLBs when a page is unmapped that was
recently accessed by other CPUs. There are many circumstances where this
happens but the obvious one is kswapd reclaiming pages belonging to a
running process as kswapd and the task are likely running on separate CPUs.

On small machines, this is not a significant problem but as machine
gets larger with more cores and more memory, the cost of these IPIs can
be high. This patch uses a structure similar in principle to a pagevec
to collect a list of PFNs and CPUs that require flushing. It then sends
one IPI per CPU that was mapping any of those pages to flush the list of
PFNs. A new TLB flush helper is required for this and one is added for
x86. Other architectures will need to decide if batching like this is both
safe and worth the memory overhead. Specifically the requirement is;

If a clean page is unmapped and not immediately flushed, the
architecture must guarantee that a write to that page from a CPU
with a cached TLB entry will trap a page fault.

This is essentially what the kernel already depends on but the window is
much larger with this patch applied and is worth highlighting.

The impact of this patch depends on the workload as measuring any benefit
requires both mapped pages co-located on the LRU and memory pressure. The
case with the biggest impact is multiple processes reading mapped pages
taken from the vm-scalability test suite. The test case uses NR_CPU readers
of mapped files that consume 10*RAM.

vmscale on a 4-node machine with 64G RAM and 48 CPUs
4.0.0  4.0.0
  vanilla  batchunmap-v4
lru-file-mmap-read-elapsed166.90 (  0.00%)  119.80 ( 28.22%)

   4.0.0   4.0.0
 vanilla batchunmap-v3
User  564.25  623.74
System   6168.10 4196.53
Elapsed   168.29  121.14

This is showing that the readers completed 25% with 30% less CPU time. From
vmstats, it is known that the vanilla kernel was interrupted roughly 900K
times per second during the steady phase of the test and the patched kernel
was interrupts 180K times per second.

The impact is much lower on a small machine

vmscale on a 1-node machine with 8G RAM and 1 CPU
4.0.0  4.0.0
  vanilla  batchunmap-v4
Ops lru-file-mmap-read-elapsed22.19 (  0.00%)19.90 ( 10.32%)

   4.0.0   4.0.0
 vanilla  batchunmap-v4
User   33.49   32.41
System 35.29   33.23
Elapsed23.07   21.46

It's still a noticeable improvement with vmstat showing interrupts went
from roughly 500K per second to 45K per second.

The patch will have no impact on workloads with no memory pressure or
have relatively few mapped pages.

Signed-off-by: Mel Gorman 
---
 arch/x86/Kconfig|   1 +
 arch/x86/include/asm/tlbflush.h |   2 +
 include/linux/init_task.h   |   8 +++
 include/linux/rmap.h|   3 ++
 include/linux/sched.h   |  14 ++
 init/Kconfig|   8 +++
 kernel/fork.c   |   5 ++
 kernel/sched/core.c |   3 ++
 mm/internal.h   |  11 +
 mm/rmap.c   | 105 +++-
 mm/vmscan.c |  23 -
 11 files changed, 181 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b7d31ca55187..290844263218 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -30,6 +30,7 @@ config X86
select ARCH_MIGHT_HAVE_PC_SERIO
select HAVE_AOUT if X86_32
select HAVE_UNSTABLE_SCHED_CLOCK
+   select ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
select ARCH_SUPPORTS_INT128 if X86_64
select HAVE_IDE
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index cd791948b286..10c197a649f5 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -152,6 +152,8 @@ static inline void __flush_tlb_one(unsigned long addr)
  * and page-granular flushes are available only on i486 and up.
  */
 
+#define flush_local_tlb_addr(addr) __flush_tlb_single(addr)
+
 #ifndef CONFIG_SMP
 
 /* "_up" is for UniProcessor.
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 696d22312b31..0771937b47e1 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -175,6 +175,13 @@ extern struct task_group root_task_group;
 # define INIT_NUMA_BALANCING(tsk)
 #endif
 
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)
\
+   .tlb_ubc = NULL,
+#else
+# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)
+#