Re: [PATCH 1/11] ksm: allow trees per NUMA node

2013-02-07 Thread Hugh Dickins
On Tue, 5 Feb 2013, Mel Gorman wrote:
> On Fri, Jan 25, 2013 at 05:54:53PM -0800, Hugh Dickins wrote:
> > From: Petr Holasek 
> > 
> > Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
> > which control merging pages across different numa nodes.
> > When it is set to zero only pages from the same node are merged,
> > otherwise pages from all nodes can be merged together (default behavior).
> > 
> > Typical use-case could be a lot of KVM guests on NUMA machine
> > and cpus from more distant nodes would have significant increase
> > of access latency to the merged ksm page. Sysfs knob was choosen
> > for higher variability when some users still prefers higher amount
> > of saved physical memory regardless of access latency.
> > 
> 
> This is understandable but it's going to be a fairly obscure option.
> I do not think it can be known in advance if the option should be set.
> The user must either run benchmarks before and after or use perf to
> record the "node-load-misses" event and see if setting the parameter
> reduces the number of remote misses.

Andrew made a similar point on the description of merge_across_nodes
in ksm.txt.  Petr's quiet at the moment, so I'll add a few more lines
to that description (in an incremental patch): but be assured what I say
will remain inadequate and unspecific - I don't have much idea of how to
decide the setting, but assume that the people who are interested in
using the knob will have a firmer idea of how to test for it.

> 
> I don't know the internals of ksm.c at all and this is my first time reading
> this series. Everything in this review is subject to being completely
> wrong or due to a major misunderstanding on my part. Delete all feedback
> if desired.

Thank you for spending your time on it.

[...snippings, but let's leave this paragraph in]

> > Hugh notes that this patch brings two problems, whose solution needs
> > further support in mm/ksm.c, which follows in subsequent patches:
> > 1) switching merge_across_nodes after running KSM is liable to oops
> >on stale nodes still left over from the previous stable tree;
> > 2) memory hotremove may migrate KSM pages, but there is no provision
> >here for !merge_across_nodes to migrate nodes to the proper tree.
...
> > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:31.724205455 -0800
> > +++ mmotm/mm/ksm.c  2013-01-25 14:36:38.608205618 -0800
...
> 
> With multiple stable node trees does the comment that begins with
> 
>  * A few notes about the KSM scanning process,
>  * to make it easier to understand the data structures below:
> 
> need an update?

Okay: I won't go through it pluralizing everything, but a couple of lines
on the !merge_across_nodes multiplicity of trees would be helpful.

> 
> It's uninitialised so kernel data size in vmlinux should be unaffected but
> it's an additional runtime cost of around 4K for a standardish enterprise
> distro kernel config.  Small beans on a NUMA machine and maybe not worth
> the hassle of kmalloc for nr_online_nodes and dealing with node memory
> hotplug but it's a pity.

It's a pity, I agree; as is the addition of int nid into rmap_item
on 32-bit (on 64-bit it just occupies a hole) - there can be a lot of
those.  We were kind of hoping that the #ifdef CONFIG_NUMA would cover
it, but some distros now enable NUMA by default even on 32-bit.  And
it's a pity because 99% of users will leave merge_across_nodes at its
default of 1 and only ever need a single tree of each kind.

I'll look into starting off with just root_stable_tree[1] and
root_unstable_tree[1], then kmalloc'ing nr_node_ids of them when and if
merge_across_nodes is switched off.  Then I don't think we need bother
about hotplug.  If it ends up looking clean enough, I'll add that patch.

> 
> >  #define MM_SLOTS_HASH_BITS 10
> >  static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> > @@ -188,6 +192,9 @@ static unsigned int ksm_thread_pages_to_
> >  /* Milliseconds ksmd should sleep between batches */
> >  static unsigned int ksm_thread_sleep_millisecs = 20;
> >  
> > +/* Zeroed when merging across nodes is not allowed */
> > +static unsigned int ksm_merge_across_nodes = 1;
> > +
> 
> Nit but initialised data does increase the size of vmlinux so maybe this
> should be the "opposite". i.e. rename it to ksm_merge_within_nodes and
> default it to 0?

I don't find that particular increase in size very compelling!  Though
I would have preferred the tunable to be the opposite way around: it
annoys me that the new code comes into play when !ksm_merge_across_nodes.

However, I do find "merge across nodes" (thanks to Andrew for "across")
a much more vivid description than the opposite "merge within nodes",
and can't think of a better alternative for that; and wouldn't want to
change it anyway at this late (v7) stage, not without Petr's consent.

> 
> __read_mostly?

I feel the same way as I did when Andrew suggested it:
> 
> I spose this should be __read_mostly.  If __read_mostly is not 

Re: [PATCH 1/11] ksm: allow trees per NUMA node

2013-02-07 Thread Hugh Dickins
On Tue, 5 Feb 2013, Mel Gorman wrote:
 On Fri, Jan 25, 2013 at 05:54:53PM -0800, Hugh Dickins wrote:
  From: Petr Holasek phola...@redhat.com
  
  Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
  which control merging pages across different numa nodes.
  When it is set to zero only pages from the same node are merged,
  otherwise pages from all nodes can be merged together (default behavior).
  
  Typical use-case could be a lot of KVM guests on NUMA machine
  and cpus from more distant nodes would have significant increase
  of access latency to the merged ksm page. Sysfs knob was choosen
  for higher variability when some users still prefers higher amount
  of saved physical memory regardless of access latency.
  
 
 This is understandable but it's going to be a fairly obscure option.
 I do not think it can be known in advance if the option should be set.
 The user must either run benchmarks before and after or use perf to
 record the node-load-misses event and see if setting the parameter
 reduces the number of remote misses.

Andrew made a similar point on the description of merge_across_nodes
in ksm.txt.  Petr's quiet at the moment, so I'll add a few more lines
to that description (in an incremental patch): but be assured what I say
will remain inadequate and unspecific - I don't have much idea of how to
decide the setting, but assume that the people who are interested in
using the knob will have a firmer idea of how to test for it.

 
 I don't know the internals of ksm.c at all and this is my first time reading
 this series. Everything in this review is subject to being completely
 wrong or due to a major misunderstanding on my part. Delete all feedback
 if desired.

Thank you for spending your time on it.

[...snippings, but let's leave this paragraph in]

  Hugh notes that this patch brings two problems, whose solution needs
  further support in mm/ksm.c, which follows in subsequent patches:
  1) switching merge_across_nodes after running KSM is liable to oops
 on stale nodes still left over from the previous stable tree;
  2) memory hotremove may migrate KSM pages, but there is no provision
 here for !merge_across_nodes to migrate nodes to the proper tree.
...
  --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:31.724205455 -0800
  +++ mmotm/mm/ksm.c  2013-01-25 14:36:38.608205618 -0800
...
 
 With multiple stable node trees does the comment that begins with
 
  * A few notes about the KSM scanning process,
  * to make it easier to understand the data structures below:
 
 need an update?

Okay: I won't go through it pluralizing everything, but a couple of lines
on the !merge_across_nodes multiplicity of trees would be helpful.

 
 It's uninitialised so kernel data size in vmlinux should be unaffected but
 it's an additional runtime cost of around 4K for a standardish enterprise
 distro kernel config.  Small beans on a NUMA machine and maybe not worth
 the hassle of kmalloc for nr_online_nodes and dealing with node memory
 hotplug but it's a pity.

It's a pity, I agree; as is the addition of int nid into rmap_item
on 32-bit (on 64-bit it just occupies a hole) - there can be a lot of
those.  We were kind of hoping that the #ifdef CONFIG_NUMA would cover
it, but some distros now enable NUMA by default even on 32-bit.  And
it's a pity because 99% of users will leave merge_across_nodes at its
default of 1 and only ever need a single tree of each kind.

I'll look into starting off with just root_stable_tree[1] and
root_unstable_tree[1], then kmalloc'ing nr_node_ids of them when and if
merge_across_nodes is switched off.  Then I don't think we need bother
about hotplug.  If it ends up looking clean enough, I'll add that patch.

 
   #define MM_SLOTS_HASH_BITS 10
   static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
  @@ -188,6 +192,9 @@ static unsigned int ksm_thread_pages_to_
   /* Milliseconds ksmd should sleep between batches */
   static unsigned int ksm_thread_sleep_millisecs = 20;
   
  +/* Zeroed when merging across nodes is not allowed */
  +static unsigned int ksm_merge_across_nodes = 1;
  +
 
 Nit but initialised data does increase the size of vmlinux so maybe this
 should be the opposite. i.e. rename it to ksm_merge_within_nodes and
 default it to 0?

I don't find that particular increase in size very compelling!  Though
I would have preferred the tunable to be the opposite way around: it
annoys me that the new code comes into play when !ksm_merge_across_nodes.

However, I do find merge across nodes (thanks to Andrew for across)
a much more vivid description than the opposite merge within nodes,
and can't think of a better alternative for that; and wouldn't want to
change it anyway at this late (v7) stage, not without Petr's consent.

 
 __read_mostly?

I feel the same way as I did when Andrew suggested it:
 
 I spose this should be __read_mostly.  If __read_mostly is not really a
 synonym for __make_write_often_storage_slower.  I continue to harbor
 fear, 

Re: [PATCH 1/11] ksm: allow trees per NUMA node

2013-02-05 Thread Mel Gorman
On Fri, Jan 25, 2013 at 05:54:53PM -0800, Hugh Dickins wrote:
> From: Petr Holasek 
> 
> Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
> which control merging pages across different numa nodes.
> When it is set to zero only pages from the same node are merged,
> otherwise pages from all nodes can be merged together (default behavior).
> 
> Typical use-case could be a lot of KVM guests on NUMA machine
> and cpus from more distant nodes would have significant increase
> of access latency to the merged ksm page. Sysfs knob was choosen
> for higher variability when some users still prefers higher amount
> of saved physical memory regardless of access latency.
> 

This is understandable but it's going to be a fairly obscure option.
I do not think it can be known in advance if the option should be set.
The user must either run benchmarks before and after or use perf to
record the "node-load-misses" event and see if setting the parameter
reduces the number of remote misses.

I don't know the internals of ksm.c at all and this is my first time reading
this series. Everything in this review is subject to being completely
wrong or due to a major misunderstanding on my part. Delete all feedback
if desired.

> Every numa node has its own stable & unstable trees because of faster
> searching and inserting. Changing of merge_across_nodes value is possible
> only when there are not any ksm shared pages in system.
> 
> I've tested this patch on numa machines with 2, 4 and 8 nodes and
> measured speed of memory access inside of KVM guests with memory pinned
> to one of nodes with this benchmark:
> 
> http://pholasek.fedorapeople.org/alloc_pg.c
> 
> Population standard deviations of access times in percentage of average
> were following:
> 
> merge_across_nodes=1
> 2 nodes 1.4%
> 4 nodes 1.6%
> 8 nodes   1.7%
> 
> merge_across_nodes=0
> 2 nodes   1%
> 4 nodes   0.32%
> 8 nodes   0.018%
> 
> RFC: https://lkml.org/lkml/2011/11/30/91
> v1: https://lkml.org/lkml/2012/1/23/46
> v2: https://lkml.org/lkml/2012/6/29/105
> v3: https://lkml.org/lkml/2012/9/14/550
> v4: https://lkml.org/lkml/2012/9/23/137
> v5: https://lkml.org/lkml/2012/12/10/540
> v6: https://lkml.org/lkml/2012/12/23/154
> v7: https://lkml.org/lkml/2012/12/27/225
> 
> Hugh notes that this patch brings two problems, whose solution needs
> further support in mm/ksm.c, which follows in subsequent patches:
> 1) switching merge_across_nodes after running KSM is liable to oops
>on stale nodes still left over from the previous stable tree;
> 2) memory hotremove may migrate KSM pages, but there is no provision
>here for !merge_across_nodes to migrate nodes to the proper tree.
> 
> Signed-off-by: Petr Holasek 
> Signed-off-by: Hugh Dickins 
> Acked-by: Rik van Riel 
> ---
>  Documentation/vm/ksm.txt |7 +
>  mm/ksm.c |  151 -
>  2 files changed, 139 insertions(+), 19 deletions(-)
> 
> --- mmotm.orig/Documentation/vm/ksm.txt   2013-01-25 14:36:31.724205455 
> -0800
> +++ mmotm/Documentation/vm/ksm.txt2013-01-25 14:36:38.608205618 -0800
> @@ -58,6 +58,13 @@ sleep_millisecs  - how many milliseconds
> e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs"
> Default: 20 (chosen for demonstration purposes)
>  
> +merge_across_nodes - specifies if pages from different numa nodes can be 
> merged.
> +   When set to 0, ksm merges only pages which physically
> +   reside in the memory area of same NUMA node. It brings
> +   lower latency to access to shared page. Value can be
> +   changed only when there is no ksm shared pages in system.
> +   Default: 1
> +
>  run  - set 0 to stop ksmd from running but keep merged pages,
> set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run",
> set 2 to stop ksmd and unmerge all pages currently merged,
> --- mmotm.orig/mm/ksm.c   2013-01-25 14:36:31.724205455 -0800
> +++ mmotm/mm/ksm.c2013-01-25 14:36:38.608205618 -0800
> @@ -36,6 +36,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include "internal.h"
> @@ -139,6 +140,9 @@ struct rmap_item {
>   struct mm_struct *mm;
>   unsigned long address;  /* + low bits used for flags below */
>   unsigned int oldchecksum;   /* when unstable */
> +#ifdef CONFIG_NUMA
> + unsigned int nid;
> +#endif
>   union {
>   struct rb_node node;/* when node of unstable tree */
>   struct {/* when listed from stable tree */

> @@ -153,8 +157,8 @@ struct rmap_item {
>  #define STABLE_FLAG  0x200   /* is listed from the stable tree */
>  
>  /* The stable and unstable tree heads */
> -static struct rb_root root_stable_tree = RB_ROOT;
> -static struct rb_root root_unstable_tree = RB_ROOT;
> +static struct rb_root 

Re: [PATCH 1/11] ksm: allow trees per NUMA node

2013-02-05 Thread Mel Gorman
On Fri, Jan 25, 2013 at 05:54:53PM -0800, Hugh Dickins wrote:
 From: Petr Holasek phola...@redhat.com
 
 Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
 which control merging pages across different numa nodes.
 When it is set to zero only pages from the same node are merged,
 otherwise pages from all nodes can be merged together (default behavior).
 
 Typical use-case could be a lot of KVM guests on NUMA machine
 and cpus from more distant nodes would have significant increase
 of access latency to the merged ksm page. Sysfs knob was choosen
 for higher variability when some users still prefers higher amount
 of saved physical memory regardless of access latency.
 

This is understandable but it's going to be a fairly obscure option.
I do not think it can be known in advance if the option should be set.
The user must either run benchmarks before and after or use perf to
record the node-load-misses event and see if setting the parameter
reduces the number of remote misses.

I don't know the internals of ksm.c at all and this is my first time reading
this series. Everything in this review is subject to being completely
wrong or due to a major misunderstanding on my part. Delete all feedback
if desired.

 Every numa node has its own stable  unstable trees because of faster
 searching and inserting. Changing of merge_across_nodes value is possible
 only when there are not any ksm shared pages in system.
 
 I've tested this patch on numa machines with 2, 4 and 8 nodes and
 measured speed of memory access inside of KVM guests with memory pinned
 to one of nodes with this benchmark:
 
 http://pholasek.fedorapeople.org/alloc_pg.c
 
 Population standard deviations of access times in percentage of average
 were following:
 
 merge_across_nodes=1
 2 nodes 1.4%
 4 nodes 1.6%
 8 nodes   1.7%
 
 merge_across_nodes=0
 2 nodes   1%
 4 nodes   0.32%
 8 nodes   0.018%
 
 RFC: https://lkml.org/lkml/2011/11/30/91
 v1: https://lkml.org/lkml/2012/1/23/46
 v2: https://lkml.org/lkml/2012/6/29/105
 v3: https://lkml.org/lkml/2012/9/14/550
 v4: https://lkml.org/lkml/2012/9/23/137
 v5: https://lkml.org/lkml/2012/12/10/540
 v6: https://lkml.org/lkml/2012/12/23/154
 v7: https://lkml.org/lkml/2012/12/27/225
 
 Hugh notes that this patch brings two problems, whose solution needs
 further support in mm/ksm.c, which follows in subsequent patches:
 1) switching merge_across_nodes after running KSM is liable to oops
on stale nodes still left over from the previous stable tree;
 2) memory hotremove may migrate KSM pages, but there is no provision
here for !merge_across_nodes to migrate nodes to the proper tree.
 
 Signed-off-by: Petr Holasek phola...@redhat.com
 Signed-off-by: Hugh Dickins hu...@google.com
 Acked-by: Rik van Riel r...@redhat.com
 ---
  Documentation/vm/ksm.txt |7 +
  mm/ksm.c |  151 -
  2 files changed, 139 insertions(+), 19 deletions(-)
 
 --- mmotm.orig/Documentation/vm/ksm.txt   2013-01-25 14:36:31.724205455 
 -0800
 +++ mmotm/Documentation/vm/ksm.txt2013-01-25 14:36:38.608205618 -0800
 @@ -58,6 +58,13 @@ sleep_millisecs  - how many milliseconds
 e.g. echo 20  /sys/kernel/mm/ksm/sleep_millisecs
 Default: 20 (chosen for demonstration purposes)
  
 +merge_across_nodes - specifies if pages from different numa nodes can be 
 merged.
 +   When set to 0, ksm merges only pages which physically
 +   reside in the memory area of same NUMA node. It brings
 +   lower latency to access to shared page. Value can be
 +   changed only when there is no ksm shared pages in system.
 +   Default: 1
 +
  run  - set 0 to stop ksmd from running but keep merged pages,
 set 1 to run ksmd e.g. echo 1  /sys/kernel/mm/ksm/run,
 set 2 to stop ksmd and unmerge all pages currently merged,
 --- mmotm.orig/mm/ksm.c   2013-01-25 14:36:31.724205455 -0800
 +++ mmotm/mm/ksm.c2013-01-25 14:36:38.608205618 -0800
 @@ -36,6 +36,7 @@
  #include linux/hashtable.h
  #include linux/freezer.h
  #include linux/oom.h
 +#include linux/numa.h
  
  #include asm/tlbflush.h
  #include internal.h
 @@ -139,6 +140,9 @@ struct rmap_item {
   struct mm_struct *mm;
   unsigned long address;  /* + low bits used for flags below */
   unsigned int oldchecksum;   /* when unstable */
 +#ifdef CONFIG_NUMA
 + unsigned int nid;
 +#endif
   union {
   struct rb_node node;/* when node of unstable tree */
   struct {/* when listed from stable tree */

 @@ -153,8 +157,8 @@ struct rmap_item {
  #define STABLE_FLAG  0x200   /* is listed from the stable tree */
  
  /* The stable and unstable tree heads */
 -static struct rb_root root_stable_tree = RB_ROOT;
 -static struct rb_root root_unstable_tree = RB_ROOT;
 +static struct 

Re: [PATCH 1/11] ksm: allow trees per NUMA node

2013-01-28 Thread Hugh Dickins
On Mon, 28 Jan 2013, Andrew Morton wrote:
> On Fri, 25 Jan 2013 17:54:53 -0800 (PST)
> Hugh Dickins  wrote:
> 
> > +/* Zeroed when merging across nodes is not allowed */
> > +static unsigned int ksm_merge_across_nodes = 1;
> 
> I spose this should be __read_mostly.  If __read_mostly is not really a
> synonym for __make_write_often_storage_slower.  I continue to harbor
> fear, uncertainty and doubt about this...

Could do.  No strong feeling, but I think I'd rather it share its
cacheline with other KSM-related stuff, than be off mixed up with
unrelateds.  I think there's a much stronger case for __read_mostly
when it's a library thing accessed by different subsystems.

You're right that this variable is accessed significantly more often
that the other KSM tunables, so deserves a __read_mostly more than
they do.  But where to stop?  Similar reluctance led me to avoid
using "unlikely" throughout ksm.c, unlikely as some conditions are
(I'm aghast to see that Andrea sneaked in a "likely" :).

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/11] ksm: allow trees per NUMA node

2013-01-28 Thread Hugh Dickins
On Mon, 28 Jan 2013, Andrew Morton wrote:
> On Fri, 25 Jan 2013 17:54:53 -0800 (PST)
> Hugh Dickins  wrote:
> 
> > --- mmotm.orig/Documentation/vm/ksm.txt 2013-01-25 14:36:31.724205455 
> > -0800
> > +++ mmotm/Documentation/vm/ksm.txt  2013-01-25 14:36:38.608205618 -0800
> > @@ -58,6 +58,13 @@ sleep_millisecs  - how many milliseconds
> > e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs"
> > Default: 20 (chosen for demonstration purposes)
> >  
> > +merge_across_nodes - specifies if pages from different numa nodes can be 
> > merged.
> > +   When set to 0, ksm merges only pages which physically
> > +   reside in the memory area of same NUMA node. It brings
> > +   lower latency to access to shared page. Value can be
> > +   changed only when there is no ksm shared pages in 
> > system.
> > +   Default: 1
> > +
> 
> The explanation doesn't really tell the operator whether or not to set
> merge_across_nodes for a particular machine/workload.
> 
> I guess most people will just shrug, turn the thing on and see if it
> improved things, but that's rather random.

Right.  I don't think we can tell them which is going to be better,
but surely we could do a better job of hinting at the tradeoffs.

I think we expect large NUMA machines with lots of memory to want the
better NUMA behavior of !merge_across_nodes, but machines with more
limited memory across short-distance NUMA nodes, to prefer the greater
deduplication of merge_across nodes.

Petr, do you have a more informative text for this?

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/11] ksm: allow trees per NUMA node

2013-01-28 Thread Andrew Morton
On Fri, 25 Jan 2013 17:54:53 -0800 (PST)
Hugh Dickins  wrote:

> +/* Zeroed when merging across nodes is not allowed */
> +static unsigned int ksm_merge_across_nodes = 1;

I spose this should be __read_mostly.  If __read_mostly is not really a
synonym for __make_write_often_storage_slower.  I continue to harbor
fear, uncertainty and doubt about this...


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/11] ksm: allow trees per NUMA node

2013-01-28 Thread Andrew Morton
On Fri, 25 Jan 2013 17:54:53 -0800 (PST)
Hugh Dickins  wrote:

> --- mmotm.orig/Documentation/vm/ksm.txt   2013-01-25 14:36:31.724205455 
> -0800
> +++ mmotm/Documentation/vm/ksm.txt2013-01-25 14:36:38.608205618 -0800
> @@ -58,6 +58,13 @@ sleep_millisecs  - how many milliseconds
> e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs"
> Default: 20 (chosen for demonstration purposes)
>  
> +merge_across_nodes - specifies if pages from different numa nodes can be 
> merged.
> +   When set to 0, ksm merges only pages which physically
> +   reside in the memory area of same NUMA node. It brings
> +   lower latency to access to shared page. Value can be
> +   changed only when there is no ksm shared pages in system.
> +   Default: 1
> +

The explanation doesn't really tell the operator whether or not to set
merge_across_nodes for a particular machine/workload.

I guess most people will just shrug, turn the thing on and see if it
improved things, but that's rather random.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/11] ksm: allow trees per NUMA node

2013-01-28 Thread Andrew Morton
On Fri, 25 Jan 2013 17:54:53 -0800 (PST)
Hugh Dickins hu...@google.com wrote:

 --- mmotm.orig/Documentation/vm/ksm.txt   2013-01-25 14:36:31.724205455 
 -0800
 +++ mmotm/Documentation/vm/ksm.txt2013-01-25 14:36:38.608205618 -0800
 @@ -58,6 +58,13 @@ sleep_millisecs  - how many milliseconds
 e.g. echo 20  /sys/kernel/mm/ksm/sleep_millisecs
 Default: 20 (chosen for demonstration purposes)
  
 +merge_across_nodes - specifies if pages from different numa nodes can be 
 merged.
 +   When set to 0, ksm merges only pages which physically
 +   reside in the memory area of same NUMA node. It brings
 +   lower latency to access to shared page. Value can be
 +   changed only when there is no ksm shared pages in system.
 +   Default: 1
 +

The explanation doesn't really tell the operator whether or not to set
merge_across_nodes for a particular machine/workload.

I guess most people will just shrug, turn the thing on and see if it
improved things, but that's rather random.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/11] ksm: allow trees per NUMA node

2013-01-28 Thread Andrew Morton
On Fri, 25 Jan 2013 17:54:53 -0800 (PST)
Hugh Dickins hu...@google.com wrote:

 +/* Zeroed when merging across nodes is not allowed */
 +static unsigned int ksm_merge_across_nodes = 1;

I spose this should be __read_mostly.  If __read_mostly is not really a
synonym for __make_write_often_storage_slower.  I continue to harbor
fear, uncertainty and doubt about this...


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/11] ksm: allow trees per NUMA node

2013-01-28 Thread Hugh Dickins
On Mon, 28 Jan 2013, Andrew Morton wrote:
 On Fri, 25 Jan 2013 17:54:53 -0800 (PST)
 Hugh Dickins hu...@google.com wrote:
 
  --- mmotm.orig/Documentation/vm/ksm.txt 2013-01-25 14:36:31.724205455 
  -0800
  +++ mmotm/Documentation/vm/ksm.txt  2013-01-25 14:36:38.608205618 -0800
  @@ -58,6 +58,13 @@ sleep_millisecs  - how many milliseconds
  e.g. echo 20  /sys/kernel/mm/ksm/sleep_millisecs
  Default: 20 (chosen for demonstration purposes)
   
  +merge_across_nodes - specifies if pages from different numa nodes can be 
  merged.
  +   When set to 0, ksm merges only pages which physically
  +   reside in the memory area of same NUMA node. It brings
  +   lower latency to access to shared page. Value can be
  +   changed only when there is no ksm shared pages in 
  system.
  +   Default: 1
  +
 
 The explanation doesn't really tell the operator whether or not to set
 merge_across_nodes for a particular machine/workload.
 
 I guess most people will just shrug, turn the thing on and see if it
 improved things, but that's rather random.

Right.  I don't think we can tell them which is going to be better,
but surely we could do a better job of hinting at the tradeoffs.

I think we expect large NUMA machines with lots of memory to want the
better NUMA behavior of !merge_across_nodes, but machines with more
limited memory across short-distance NUMA nodes, to prefer the greater
deduplication of merge_across nodes.

Petr, do you have a more informative text for this?

Hugh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/11] ksm: allow trees per NUMA node

2013-01-28 Thread Hugh Dickins
On Mon, 28 Jan 2013, Andrew Morton wrote:
 On Fri, 25 Jan 2013 17:54:53 -0800 (PST)
 Hugh Dickins hu...@google.com wrote:
 
  +/* Zeroed when merging across nodes is not allowed */
  +static unsigned int ksm_merge_across_nodes = 1;
 
 I spose this should be __read_mostly.  If __read_mostly is not really a
 synonym for __make_write_often_storage_slower.  I continue to harbor
 fear, uncertainty and doubt about this...

Could do.  No strong feeling, but I think I'd rather it share its
cacheline with other KSM-related stuff, than be off mixed up with
unrelateds.  I think there's a much stronger case for __read_mostly
when it's a library thing accessed by different subsystems.

You're right that this variable is accessed significantly more often
that the other KSM tunables, so deserves a __read_mostly more than
they do.  But where to stop?  Similar reluctance led me to avoid
using unlikely throughout ksm.c, unlikely as some conditions are
(I'm aghast to see that Andrea sneaked in a likely :).

Hugh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/11] ksm: allow trees per NUMA node

2013-01-27 Thread Hugh Dickins
On Sat, 26 Jan 2013, Simon Jeons wrote:
> On Sat, 2013-01-26 at 18:54 -0800, Hugh Dickins wrote:
> > 
> > So you'd like us to add code for moving a node from one tree to another
> > in ksm_migrate_page() (and what would it do when it collides with an
> 
> Without numa awareness, I still can't understand your explanation why
> can't insert the node to the tree just after page migration instead of
> inserting it at the next scan.

The node is already there in the right (only) tree in that case.

> 
> > existing node?), code which will then be removed a few patches later
> > when ksm page migration is fully enabled?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/11] ksm: allow trees per NUMA node

2013-01-27 Thread Hugh Dickins
On Sat, 26 Jan 2013, Simon Jeons wrote:
 On Sat, 2013-01-26 at 18:54 -0800, Hugh Dickins wrote:
  
  So you'd like us to add code for moving a node from one tree to another
  in ksm_migrate_page() (and what would it do when it collides with an
 
 Without numa awareness, I still can't understand your explanation why
 can't insert the node to the tree just after page migration instead of
 inserting it at the next scan.

The node is already there in the right (only) tree in that case.

 
  existing node?), code which will then be removed a few patches later
  when ksm page migration is fully enabled?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/11] ksm: allow trees per NUMA node

2013-01-26 Thread Simon Jeons
On Sat, 2013-01-26 at 18:54 -0800, Hugh Dickins wrote:
> On Sat, 26 Jan 2013, Simon Jeons wrote:
> > On Fri, 2013-01-25 at 17:54 -0800, Hugh Dickins wrote:
> > > From: Petr Holasek 
> > > @@ -1122,6 +1166,18 @@ struct rmap_item *unstable_tree_search_i
> > >   return NULL;
> > >   }
> > >  
> > > + /*
> > > +  * If tree_page has been migrated to another NUMA node, it
> > > +  * will be flushed out and put into the right unstable tree
> > 
> > Then why not insert the new page to unstable tree during page migration
> > against current upstream? Because default behavior is merge across
> > nodes.
> 
> I don't understand the words "against current upstream" in your question.

I mean current upstream codes without numa awareness. :)

> 
> We cannot move a page (strictly, a node) from one tree to another during
> page migration itself, because the necessary ksm_thread_mutex is not held.
> Not would we even want to while "merge across nodes".
> 
> Ah, perhaps you are pointing out that in current upstream, the only user
> of ksm page migration is memory hotremove, which in current upstream does
> hold ksm_thread_mutex.
> 
> So you'd like us to add code for moving a node from one tree to another
> in ksm_migrate_page() (and what would it do when it collides with an

Without numa awareness, I still can't understand your explanation why
can't insert the node to the tree just after page migration instead of
inserting it at the next scan.

> existing node?), code which will then be removed a few patches later
> when ksm page migration is fully enabled?
> 
> No, I'm not going to put any more thought into that.  When Andrea pointed
> out the problem with Petr's original change to ksm_migrate_page(), I did
> indeed think that we could do something cleverer at that point; but once
> I got down to trying it, found that a dead end.  I wasn't going to be
> able to test the hotremove case properly anyway, so no good pursuing
> solutions that couldn't be generalized.
> 
> Hugh


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/11] ksm: allow trees per NUMA node

2013-01-26 Thread Hugh Dickins
On Sat, 26 Jan 2013, Simon Jeons wrote:
> On Fri, 2013-01-25 at 17:54 -0800, Hugh Dickins wrote:
> > From: Petr Holasek 
> > @@ -1122,6 +1166,18 @@ struct rmap_item *unstable_tree_search_i
> > return NULL;
> > }
> >  
> > +   /*
> > +* If tree_page has been migrated to another NUMA node, it
> > +* will be flushed out and put into the right unstable tree
> 
> Then why not insert the new page to unstable tree during page migration
> against current upstream? Because default behavior is merge across
> nodes.

I don't understand the words "against current upstream" in your question.

We cannot move a page (strictly, a node) from one tree to another during
page migration itself, because the necessary ksm_thread_mutex is not held.
Not would we even want to while "merge across nodes".

Ah, perhaps you are pointing out that in current upstream, the only user
of ksm page migration is memory hotremove, which in current upstream does
hold ksm_thread_mutex.

So you'd like us to add code for moving a node from one tree to another
in ksm_migrate_page() (and what would it do when it collides with an
existing node?), code which will then be removed a few patches later
when ksm page migration is fully enabled?

No, I'm not going to put any more thought into that.  When Andrea pointed
out the problem with Petr's original change to ksm_migrate_page(), I did
indeed think that we could do something cleverer at that point; but once
I got down to trying it, found that a dead end.  I wasn't going to be
able to test the hotremove case properly anyway, so no good pursuing
solutions that couldn't be generalized.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/11] ksm: allow trees per NUMA node

2013-01-26 Thread Simon Jeons
Hi Hugh,
On Fri, 2013-01-25 at 17:54 -0800, Hugh Dickins wrote:
> From: Petr Holasek 
> 
> Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
> which control merging pages across different numa nodes.
> When it is set to zero only pages from the same node are merged,
> otherwise pages from all nodes can be merged together (default behavior).
> 
> Typical use-case could be a lot of KVM guests on NUMA machine
> and cpus from more distant nodes would have significant increase
> of access latency to the merged ksm page. Sysfs knob was choosen
> for higher variability when some users still prefers higher amount
> of saved physical memory regardless of access latency.
> 
> Every numa node has its own stable & unstable trees because of faster
> searching and inserting. Changing of merge_across_nodes value is possible
> only when there are not any ksm shared pages in system.
> 
> I've tested this patch on numa machines with 2, 4 and 8 nodes and
> measured speed of memory access inside of KVM guests with memory pinned
> to one of nodes with this benchmark:
> 
> http://pholasek.fedorapeople.org/alloc_pg.c
> 
> Population standard deviations of access times in percentage of average
> were following:
> 
> merge_across_nodes=1
> 2 nodes 1.4%
> 4 nodes 1.6%
> 8 nodes   1.7%
> 
> merge_across_nodes=0
> 2 nodes   1%
> 4 nodes   0.32%
> 8 nodes   0.018%
> 
> RFC: https://lkml.org/lkml/2011/11/30/91
> v1: https://lkml.org/lkml/2012/1/23/46
> v2: https://lkml.org/lkml/2012/6/29/105
> v3: https://lkml.org/lkml/2012/9/14/550
> v4: https://lkml.org/lkml/2012/9/23/137
> v5: https://lkml.org/lkml/2012/12/10/540
> v6: https://lkml.org/lkml/2012/12/23/154
> v7: https://lkml.org/lkml/2012/12/27/225
> 
> Hugh notes that this patch brings two problems, whose solution needs
> further support in mm/ksm.c, which follows in subsequent patches:
> 1) switching merge_across_nodes after running KSM is liable to oops
>on stale nodes still left over from the previous stable tree;
> 2) memory hotremove may migrate KSM pages, but there is no provision
>here for !merge_across_nodes to migrate nodes to the proper tree.
> 
> Signed-off-by: Petr Holasek 
> Signed-off-by: Hugh Dickins 
> Acked-by: Rik van Riel 
> ---
>  Documentation/vm/ksm.txt |7 +
>  mm/ksm.c |  151 -
>  2 files changed, 139 insertions(+), 19 deletions(-)
> 
> --- mmotm.orig/Documentation/vm/ksm.txt   2013-01-25 14:36:31.724205455 
> -0800
> +++ mmotm/Documentation/vm/ksm.txt2013-01-25 14:36:38.608205618 -0800
> @@ -58,6 +58,13 @@ sleep_millisecs  - how many milliseconds
> e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs"
> Default: 20 (chosen for demonstration purposes)
>  
> +merge_across_nodes - specifies if pages from different numa nodes can be 
> merged.
> +   When set to 0, ksm merges only pages which physically
> +   reside in the memory area of same NUMA node. It brings
> +   lower latency to access to shared page. Value can be
> +   changed only when there is no ksm shared pages in system.
> +   Default: 1
> +
>  run  - set 0 to stop ksmd from running but keep merged pages,
> set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run",
> set 2 to stop ksmd and unmerge all pages currently merged,
> --- mmotm.orig/mm/ksm.c   2013-01-25 14:36:31.724205455 -0800
> +++ mmotm/mm/ksm.c2013-01-25 14:36:38.608205618 -0800
> @@ -36,6 +36,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include "internal.h"
> @@ -139,6 +140,9 @@ struct rmap_item {
>   struct mm_struct *mm;
>   unsigned long address;  /* + low bits used for flags below */
>   unsigned int oldchecksum;   /* when unstable */
> +#ifdef CONFIG_NUMA
> + unsigned int nid;
> +#endif
>   union {
>   struct rb_node node;/* when node of unstable tree */
>   struct {/* when listed from stable tree */
> @@ -153,8 +157,8 @@ struct rmap_item {
>  #define STABLE_FLAG  0x200   /* is listed from the stable tree */
>  
>  /* The stable and unstable tree heads */
> -static struct rb_root root_stable_tree = RB_ROOT;
> -static struct rb_root root_unstable_tree = RB_ROOT;
> +static struct rb_root root_unstable_tree[MAX_NUMNODES];
> +static struct rb_root root_stable_tree[MAX_NUMNODES];
>  
>  #define MM_SLOTS_HASH_BITS 10
>  static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> @@ -188,6 +192,9 @@ static unsigned int ksm_thread_pages_to_
>  /* Milliseconds ksmd should sleep between batches */
>  static unsigned int ksm_thread_sleep_millisecs = 20;
>  
> +/* Zeroed when merging across nodes is not allowed */
> +static unsigned int ksm_merge_across_nodes = 1;
> +
>  #define KSM_RUN_STOP 0
>  #define KSM_RUN_MERGE1
>  #define 

Re: [PATCH 1/11] ksm: allow trees per NUMA node

2013-01-26 Thread Simon Jeons
Hi Hugh,
On Fri, 2013-01-25 at 17:54 -0800, Hugh Dickins wrote:
 From: Petr Holasek phola...@redhat.com
 
 Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
 which control merging pages across different numa nodes.
 When it is set to zero only pages from the same node are merged,
 otherwise pages from all nodes can be merged together (default behavior).
 
 Typical use-case could be a lot of KVM guests on NUMA machine
 and cpus from more distant nodes would have significant increase
 of access latency to the merged ksm page. Sysfs knob was choosen
 for higher variability when some users still prefers higher amount
 of saved physical memory regardless of access latency.
 
 Every numa node has its own stable  unstable trees because of faster
 searching and inserting. Changing of merge_across_nodes value is possible
 only when there are not any ksm shared pages in system.
 
 I've tested this patch on numa machines with 2, 4 and 8 nodes and
 measured speed of memory access inside of KVM guests with memory pinned
 to one of nodes with this benchmark:
 
 http://pholasek.fedorapeople.org/alloc_pg.c
 
 Population standard deviations of access times in percentage of average
 were following:
 
 merge_across_nodes=1
 2 nodes 1.4%
 4 nodes 1.6%
 8 nodes   1.7%
 
 merge_across_nodes=0
 2 nodes   1%
 4 nodes   0.32%
 8 nodes   0.018%
 
 RFC: https://lkml.org/lkml/2011/11/30/91
 v1: https://lkml.org/lkml/2012/1/23/46
 v2: https://lkml.org/lkml/2012/6/29/105
 v3: https://lkml.org/lkml/2012/9/14/550
 v4: https://lkml.org/lkml/2012/9/23/137
 v5: https://lkml.org/lkml/2012/12/10/540
 v6: https://lkml.org/lkml/2012/12/23/154
 v7: https://lkml.org/lkml/2012/12/27/225
 
 Hugh notes that this patch brings two problems, whose solution needs
 further support in mm/ksm.c, which follows in subsequent patches:
 1) switching merge_across_nodes after running KSM is liable to oops
on stale nodes still left over from the previous stable tree;
 2) memory hotremove may migrate KSM pages, but there is no provision
here for !merge_across_nodes to migrate nodes to the proper tree.
 
 Signed-off-by: Petr Holasek phola...@redhat.com
 Signed-off-by: Hugh Dickins hu...@google.com
 Acked-by: Rik van Riel r...@redhat.com
 ---
  Documentation/vm/ksm.txt |7 +
  mm/ksm.c |  151 -
  2 files changed, 139 insertions(+), 19 deletions(-)
 
 --- mmotm.orig/Documentation/vm/ksm.txt   2013-01-25 14:36:31.724205455 
 -0800
 +++ mmotm/Documentation/vm/ksm.txt2013-01-25 14:36:38.608205618 -0800
 @@ -58,6 +58,13 @@ sleep_millisecs  - how many milliseconds
 e.g. echo 20  /sys/kernel/mm/ksm/sleep_millisecs
 Default: 20 (chosen for demonstration purposes)
  
 +merge_across_nodes - specifies if pages from different numa nodes can be 
 merged.
 +   When set to 0, ksm merges only pages which physically
 +   reside in the memory area of same NUMA node. It brings
 +   lower latency to access to shared page. Value can be
 +   changed only when there is no ksm shared pages in system.
 +   Default: 1
 +
  run  - set 0 to stop ksmd from running but keep merged pages,
 set 1 to run ksmd e.g. echo 1  /sys/kernel/mm/ksm/run,
 set 2 to stop ksmd and unmerge all pages currently merged,
 --- mmotm.orig/mm/ksm.c   2013-01-25 14:36:31.724205455 -0800
 +++ mmotm/mm/ksm.c2013-01-25 14:36:38.608205618 -0800
 @@ -36,6 +36,7 @@
  #include linux/hashtable.h
  #include linux/freezer.h
  #include linux/oom.h
 +#include linux/numa.h
  
  #include asm/tlbflush.h
  #include internal.h
 @@ -139,6 +140,9 @@ struct rmap_item {
   struct mm_struct *mm;
   unsigned long address;  /* + low bits used for flags below */
   unsigned int oldchecksum;   /* when unstable */
 +#ifdef CONFIG_NUMA
 + unsigned int nid;
 +#endif
   union {
   struct rb_node node;/* when node of unstable tree */
   struct {/* when listed from stable tree */
 @@ -153,8 +157,8 @@ struct rmap_item {
  #define STABLE_FLAG  0x200   /* is listed from the stable tree */
  
  /* The stable and unstable tree heads */
 -static struct rb_root root_stable_tree = RB_ROOT;
 -static struct rb_root root_unstable_tree = RB_ROOT;
 +static struct rb_root root_unstable_tree[MAX_NUMNODES];
 +static struct rb_root root_stable_tree[MAX_NUMNODES];
  
  #define MM_SLOTS_HASH_BITS 10
  static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
 @@ -188,6 +192,9 @@ static unsigned int ksm_thread_pages_to_
  /* Milliseconds ksmd should sleep between batches */
  static unsigned int ksm_thread_sleep_millisecs = 20;
  
 +/* Zeroed when merging across nodes is not allowed */
 +static unsigned int ksm_merge_across_nodes = 1;
 +
  #define KSM_RUN_STOP 0
  #define KSM_RUN_MERGE1
  #define 

Re: [PATCH 1/11] ksm: allow trees per NUMA node

2013-01-26 Thread Hugh Dickins
On Sat, 26 Jan 2013, Simon Jeons wrote:
 On Fri, 2013-01-25 at 17:54 -0800, Hugh Dickins wrote:
  From: Petr Holasek phola...@redhat.com
  @@ -1122,6 +1166,18 @@ struct rmap_item *unstable_tree_search_i
  return NULL;
  }
   
  +   /*
  +* If tree_page has been migrated to another NUMA node, it
  +* will be flushed out and put into the right unstable tree
 
 Then why not insert the new page to unstable tree during page migration
 against current upstream? Because default behavior is merge across
 nodes.

I don't understand the words against current upstream in your question.

We cannot move a page (strictly, a node) from one tree to another during
page migration itself, because the necessary ksm_thread_mutex is not held.
Not would we even want to while merge across nodes.

Ah, perhaps you are pointing out that in current upstream, the only user
of ksm page migration is memory hotremove, which in current upstream does
hold ksm_thread_mutex.

So you'd like us to add code for moving a node from one tree to another
in ksm_migrate_page() (and what would it do when it collides with an
existing node?), code which will then be removed a few patches later
when ksm page migration is fully enabled?

No, I'm not going to put any more thought into that.  When Andrea pointed
out the problem with Petr's original change to ksm_migrate_page(), I did
indeed think that we could do something cleverer at that point; but once
I got down to trying it, found that a dead end.  I wasn't going to be
able to test the hotremove case properly anyway, so no good pursuing
solutions that couldn't be generalized.

Hugh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/11] ksm: allow trees per NUMA node

2013-01-26 Thread Simon Jeons
On Sat, 2013-01-26 at 18:54 -0800, Hugh Dickins wrote:
 On Sat, 26 Jan 2013, Simon Jeons wrote:
  On Fri, 2013-01-25 at 17:54 -0800, Hugh Dickins wrote:
   From: Petr Holasek phola...@redhat.com
   @@ -1122,6 +1166,18 @@ struct rmap_item *unstable_tree_search_i
 return NULL;
 }

   + /*
   +  * If tree_page has been migrated to another NUMA node, it
   +  * will be flushed out and put into the right unstable tree
  
  Then why not insert the new page to unstable tree during page migration
  against current upstream? Because default behavior is merge across
  nodes.
 
 I don't understand the words against current upstream in your question.

I mean current upstream codes without numa awareness. :)

 
 We cannot move a page (strictly, a node) from one tree to another during
 page migration itself, because the necessary ksm_thread_mutex is not held.
 Not would we even want to while merge across nodes.
 
 Ah, perhaps you are pointing out that in current upstream, the only user
 of ksm page migration is memory hotremove, which in current upstream does
 hold ksm_thread_mutex.
 
 So you'd like us to add code for moving a node from one tree to another
 in ksm_migrate_page() (and what would it do when it collides with an

Without numa awareness, I still can't understand your explanation why
can't insert the node to the tree just after page migration instead of
inserting it at the next scan.

 existing node?), code which will then be removed a few patches later
 when ksm page migration is fully enabled?
 
 No, I'm not going to put any more thought into that.  When Andrea pointed
 out the problem with Petr's original change to ksm_migrate_page(), I did
 indeed think that we could do something cleverer at that point; but once
 I got down to trying it, found that a dead end.  I wasn't going to be
 able to test the hotremove case properly anyway, so no good pursuing
 solutions that couldn't be generalized.
 
 Hugh


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/11] ksm: allow trees per NUMA node

2013-01-25 Thread Hugh Dickins
From: Petr Holasek 

Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
which control merging pages across different numa nodes.
When it is set to zero only pages from the same node are merged,
otherwise pages from all nodes can be merged together (default behavior).

Typical use-case could be a lot of KVM guests on NUMA machine
and cpus from more distant nodes would have significant increase
of access latency to the merged ksm page. Sysfs knob was choosen
for higher variability when some users still prefers higher amount
of saved physical memory regardless of access latency.

Every numa node has its own stable & unstable trees because of faster
searching and inserting. Changing of merge_across_nodes value is possible
only when there are not any ksm shared pages in system.

I've tested this patch on numa machines with 2, 4 and 8 nodes and
measured speed of memory access inside of KVM guests with memory pinned
to one of nodes with this benchmark:

http://pholasek.fedorapeople.org/alloc_pg.c

Population standard deviations of access times in percentage of average
were following:

merge_across_nodes=1
2 nodes 1.4%
4 nodes 1.6%
8 nodes 1.7%

merge_across_nodes=0
2 nodes 1%
4 nodes 0.32%
8 nodes 0.018%

RFC: https://lkml.org/lkml/2011/11/30/91
v1: https://lkml.org/lkml/2012/1/23/46
v2: https://lkml.org/lkml/2012/6/29/105
v3: https://lkml.org/lkml/2012/9/14/550
v4: https://lkml.org/lkml/2012/9/23/137
v5: https://lkml.org/lkml/2012/12/10/540
v6: https://lkml.org/lkml/2012/12/23/154
v7: https://lkml.org/lkml/2012/12/27/225

Hugh notes that this patch brings two problems, whose solution needs
further support in mm/ksm.c, which follows in subsequent patches:
1) switching merge_across_nodes after running KSM is liable to oops
   on stale nodes still left over from the previous stable tree;
2) memory hotremove may migrate KSM pages, but there is no provision
   here for !merge_across_nodes to migrate nodes to the proper tree.

Signed-off-by: Petr Holasek 
Signed-off-by: Hugh Dickins 
Acked-by: Rik van Riel 
---
 Documentation/vm/ksm.txt |7 +
 mm/ksm.c |  151 -
 2 files changed, 139 insertions(+), 19 deletions(-)

--- mmotm.orig/Documentation/vm/ksm.txt 2013-01-25 14:36:31.724205455 -0800
+++ mmotm/Documentation/vm/ksm.txt  2013-01-25 14:36:38.608205618 -0800
@@ -58,6 +58,13 @@ sleep_millisecs  - how many milliseconds
e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs"
Default: 20 (chosen for demonstration purposes)
 
+merge_across_nodes - specifies if pages from different numa nodes can be 
merged.
+   When set to 0, ksm merges only pages which physically
+   reside in the memory area of same NUMA node. It brings
+   lower latency to access to shared page. Value can be
+   changed only when there is no ksm shared pages in system.
+   Default: 1
+
 run  - set 0 to stop ksmd from running but keep merged pages,
set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run",
set 2 to stop ksmd and unmerge all pages currently merged,
--- mmotm.orig/mm/ksm.c 2013-01-25 14:36:31.724205455 -0800
+++ mmotm/mm/ksm.c  2013-01-25 14:36:38.608205618 -0800
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include "internal.h"
@@ -139,6 +140,9 @@ struct rmap_item {
struct mm_struct *mm;
unsigned long address;  /* + low bits used for flags below */
unsigned int oldchecksum;   /* when unstable */
+#ifdef CONFIG_NUMA
+   unsigned int nid;
+#endif
union {
struct rb_node node;/* when node of unstable tree */
struct {/* when listed from stable tree */
@@ -153,8 +157,8 @@ struct rmap_item {
 #define STABLE_FLAG0x200   /* is listed from the stable tree */
 
 /* The stable and unstable tree heads */
-static struct rb_root root_stable_tree = RB_ROOT;
-static struct rb_root root_unstable_tree = RB_ROOT;
+static struct rb_root root_unstable_tree[MAX_NUMNODES];
+static struct rb_root root_stable_tree[MAX_NUMNODES];
 
 #define MM_SLOTS_HASH_BITS 10
 static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
@@ -188,6 +192,9 @@ static unsigned int ksm_thread_pages_to_
 /* Milliseconds ksmd should sleep between batches */
 static unsigned int ksm_thread_sleep_millisecs = 20;
 
+/* Zeroed when merging across nodes is not allowed */
+static unsigned int ksm_merge_across_nodes = 1;
+
 #define KSM_RUN_STOP   0
 #define KSM_RUN_MERGE  1
 #define KSM_RUN_UNMERGE2
@@ -441,10 +448,25 @@ out:  page = NULL;
return page;
 }
 
+/*
+ * This helper is used for getting right index into array of tree roots.
+ * When merge_across_nodes knob is set to 1, there are only two rb-trees for
+ * stable and unstable pages from all nodes with roots in index 0. 

[PATCH 1/11] ksm: allow trees per NUMA node

2013-01-25 Thread Hugh Dickins
From: Petr Holasek phola...@redhat.com

Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
which control merging pages across different numa nodes.
When it is set to zero only pages from the same node are merged,
otherwise pages from all nodes can be merged together (default behavior).

Typical use-case could be a lot of KVM guests on NUMA machine
and cpus from more distant nodes would have significant increase
of access latency to the merged ksm page. Sysfs knob was choosen
for higher variability when some users still prefers higher amount
of saved physical memory regardless of access latency.

Every numa node has its own stable  unstable trees because of faster
searching and inserting. Changing of merge_across_nodes value is possible
only when there are not any ksm shared pages in system.

I've tested this patch on numa machines with 2, 4 and 8 nodes and
measured speed of memory access inside of KVM guests with memory pinned
to one of nodes with this benchmark:

http://pholasek.fedorapeople.org/alloc_pg.c

Population standard deviations of access times in percentage of average
were following:

merge_across_nodes=1
2 nodes 1.4%
4 nodes 1.6%
8 nodes 1.7%

merge_across_nodes=0
2 nodes 1%
4 nodes 0.32%
8 nodes 0.018%

RFC: https://lkml.org/lkml/2011/11/30/91
v1: https://lkml.org/lkml/2012/1/23/46
v2: https://lkml.org/lkml/2012/6/29/105
v3: https://lkml.org/lkml/2012/9/14/550
v4: https://lkml.org/lkml/2012/9/23/137
v5: https://lkml.org/lkml/2012/12/10/540
v6: https://lkml.org/lkml/2012/12/23/154
v7: https://lkml.org/lkml/2012/12/27/225

Hugh notes that this patch brings two problems, whose solution needs
further support in mm/ksm.c, which follows in subsequent patches:
1) switching merge_across_nodes after running KSM is liable to oops
   on stale nodes still left over from the previous stable tree;
2) memory hotremove may migrate KSM pages, but there is no provision
   here for !merge_across_nodes to migrate nodes to the proper tree.

Signed-off-by: Petr Holasek phola...@redhat.com
Signed-off-by: Hugh Dickins hu...@google.com
Acked-by: Rik van Riel r...@redhat.com
---
 Documentation/vm/ksm.txt |7 +
 mm/ksm.c |  151 -
 2 files changed, 139 insertions(+), 19 deletions(-)

--- mmotm.orig/Documentation/vm/ksm.txt 2013-01-25 14:36:31.724205455 -0800
+++ mmotm/Documentation/vm/ksm.txt  2013-01-25 14:36:38.608205618 -0800
@@ -58,6 +58,13 @@ sleep_millisecs  - how many milliseconds
e.g. echo 20  /sys/kernel/mm/ksm/sleep_millisecs
Default: 20 (chosen for demonstration purposes)
 
+merge_across_nodes - specifies if pages from different numa nodes can be 
merged.
+   When set to 0, ksm merges only pages which physically
+   reside in the memory area of same NUMA node. It brings
+   lower latency to access to shared page. Value can be
+   changed only when there is no ksm shared pages in system.
+   Default: 1
+
 run  - set 0 to stop ksmd from running but keep merged pages,
set 1 to run ksmd e.g. echo 1  /sys/kernel/mm/ksm/run,
set 2 to stop ksmd and unmerge all pages currently merged,
--- mmotm.orig/mm/ksm.c 2013-01-25 14:36:31.724205455 -0800
+++ mmotm/mm/ksm.c  2013-01-25 14:36:38.608205618 -0800
@@ -36,6 +36,7 @@
 #include linux/hashtable.h
 #include linux/freezer.h
 #include linux/oom.h
+#include linux/numa.h
 
 #include asm/tlbflush.h
 #include internal.h
@@ -139,6 +140,9 @@ struct rmap_item {
struct mm_struct *mm;
unsigned long address;  /* + low bits used for flags below */
unsigned int oldchecksum;   /* when unstable */
+#ifdef CONFIG_NUMA
+   unsigned int nid;
+#endif
union {
struct rb_node node;/* when node of unstable tree */
struct {/* when listed from stable tree */
@@ -153,8 +157,8 @@ struct rmap_item {
 #define STABLE_FLAG0x200   /* is listed from the stable tree */
 
 /* The stable and unstable tree heads */
-static struct rb_root root_stable_tree = RB_ROOT;
-static struct rb_root root_unstable_tree = RB_ROOT;
+static struct rb_root root_unstable_tree[MAX_NUMNODES];
+static struct rb_root root_stable_tree[MAX_NUMNODES];
 
 #define MM_SLOTS_HASH_BITS 10
 static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
@@ -188,6 +192,9 @@ static unsigned int ksm_thread_pages_to_
 /* Milliseconds ksmd should sleep between batches */
 static unsigned int ksm_thread_sleep_millisecs = 20;
 
+/* Zeroed when merging across nodes is not allowed */
+static unsigned int ksm_merge_across_nodes = 1;
+
 #define KSM_RUN_STOP   0
 #define KSM_RUN_MERGE  1
 #define KSM_RUN_UNMERGE2
@@ -441,10 +448,25 @@ out:  page = NULL;
return page;
 }
 
+/*
+ * This helper is used for getting right index into array of tree roots.
+ * When