Re: [PATCH 1/11] ksm: allow trees per NUMA node
On Tue, 5 Feb 2013, Mel Gorman wrote: > On Fri, Jan 25, 2013 at 05:54:53PM -0800, Hugh Dickins wrote: > > From: Petr Holasek > > > > Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes > > which control merging pages across different numa nodes. > > When it is set to zero only pages from the same node are merged, > > otherwise pages from all nodes can be merged together (default behavior). > > > > Typical use-case could be a lot of KVM guests on NUMA machine > > and cpus from more distant nodes would have significant increase > > of access latency to the merged ksm page. Sysfs knob was choosen > > for higher variability when some users still prefers higher amount > > of saved physical memory regardless of access latency. > > > > This is understandable but it's going to be a fairly obscure option. > I do not think it can be known in advance if the option should be set. > The user must either run benchmarks before and after or use perf to > record the "node-load-misses" event and see if setting the parameter > reduces the number of remote misses. Andrew made a similar point on the description of merge_across_nodes in ksm.txt. Petr's quiet at the moment, so I'll add a few more lines to that description (in an incremental patch): but be assured what I say will remain inadequate and unspecific - I don't have much idea of how to decide the setting, but assume that the people who are interested in using the knob will have a firmer idea of how to test for it. > > I don't know the internals of ksm.c at all and this is my first time reading > this series. Everything in this review is subject to being completely > wrong or due to a major misunderstanding on my part. Delete all feedback > if desired. Thank you for spending your time on it. [...snippings, but let's leave this paragraph in] > > Hugh notes that this patch brings two problems, whose solution needs > > further support in mm/ksm.c, which follows in subsequent patches: > > 1) switching merge_across_nodes after running KSM is liable to oops > >on stale nodes still left over from the previous stable tree; > > 2) memory hotremove may migrate KSM pages, but there is no provision > >here for !merge_across_nodes to migrate nodes to the proper tree. ... > > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:31.724205455 -0800 > > +++ mmotm/mm/ksm.c 2013-01-25 14:36:38.608205618 -0800 ... > > With multiple stable node trees does the comment that begins with > > * A few notes about the KSM scanning process, > * to make it easier to understand the data structures below: > > need an update? Okay: I won't go through it pluralizing everything, but a couple of lines on the !merge_across_nodes multiplicity of trees would be helpful. > > It's uninitialised so kernel data size in vmlinux should be unaffected but > it's an additional runtime cost of around 4K for a standardish enterprise > distro kernel config. Small beans on a NUMA machine and maybe not worth > the hassle of kmalloc for nr_online_nodes and dealing with node memory > hotplug but it's a pity. It's a pity, I agree; as is the addition of int nid into rmap_item on 32-bit (on 64-bit it just occupies a hole) - there can be a lot of those. We were kind of hoping that the #ifdef CONFIG_NUMA would cover it, but some distros now enable NUMA by default even on 32-bit. And it's a pity because 99% of users will leave merge_across_nodes at its default of 1 and only ever need a single tree of each kind. I'll look into starting off with just root_stable_tree[1] and root_unstable_tree[1], then kmalloc'ing nr_node_ids of them when and if merge_across_nodes is switched off. Then I don't think we need bother about hotplug. If it ends up looking clean enough, I'll add that patch. > > > #define MM_SLOTS_HASH_BITS 10 > > static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); > > @@ -188,6 +192,9 @@ static unsigned int ksm_thread_pages_to_ > > /* Milliseconds ksmd should sleep between batches */ > > static unsigned int ksm_thread_sleep_millisecs = 20; > > > > +/* Zeroed when merging across nodes is not allowed */ > > +static unsigned int ksm_merge_across_nodes = 1; > > + > > Nit but initialised data does increase the size of vmlinux so maybe this > should be the "opposite". i.e. rename it to ksm_merge_within_nodes and > default it to 0? I don't find that particular increase in size very compelling! Though I would have preferred the tunable to be the opposite way around: it annoys me that the new code comes into play when !ksm_merge_across_nodes. However, I do find "merge across nodes" (thanks to Andrew for "across") a much more vivid description than the opposite "merge within nodes", and can't think of a better alternative for that; and wouldn't want to change it anyway at this late (v7) stage, not without Petr's consent. > > __read_mostly? I feel the same way as I did when Andrew suggested it: > > I spose this should be __read_mostly. If __read_mostly is not
Re: [PATCH 1/11] ksm: allow trees per NUMA node
On Tue, 5 Feb 2013, Mel Gorman wrote: On Fri, Jan 25, 2013 at 05:54:53PM -0800, Hugh Dickins wrote: From: Petr Holasek phola...@redhat.com Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes which control merging pages across different numa nodes. When it is set to zero only pages from the same node are merged, otherwise pages from all nodes can be merged together (default behavior). Typical use-case could be a lot of KVM guests on NUMA machine and cpus from more distant nodes would have significant increase of access latency to the merged ksm page. Sysfs knob was choosen for higher variability when some users still prefers higher amount of saved physical memory regardless of access latency. This is understandable but it's going to be a fairly obscure option. I do not think it can be known in advance if the option should be set. The user must either run benchmarks before and after or use perf to record the node-load-misses event and see if setting the parameter reduces the number of remote misses. Andrew made a similar point on the description of merge_across_nodes in ksm.txt. Petr's quiet at the moment, so I'll add a few more lines to that description (in an incremental patch): but be assured what I say will remain inadequate and unspecific - I don't have much idea of how to decide the setting, but assume that the people who are interested in using the knob will have a firmer idea of how to test for it. I don't know the internals of ksm.c at all and this is my first time reading this series. Everything in this review is subject to being completely wrong or due to a major misunderstanding on my part. Delete all feedback if desired. Thank you for spending your time on it. [...snippings, but let's leave this paragraph in] Hugh notes that this patch brings two problems, whose solution needs further support in mm/ksm.c, which follows in subsequent patches: 1) switching merge_across_nodes after running KSM is liable to oops on stale nodes still left over from the previous stable tree; 2) memory hotremove may migrate KSM pages, but there is no provision here for !merge_across_nodes to migrate nodes to the proper tree. ... --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:31.724205455 -0800 +++ mmotm/mm/ksm.c 2013-01-25 14:36:38.608205618 -0800 ... With multiple stable node trees does the comment that begins with * A few notes about the KSM scanning process, * to make it easier to understand the data structures below: need an update? Okay: I won't go through it pluralizing everything, but a couple of lines on the !merge_across_nodes multiplicity of trees would be helpful. It's uninitialised so kernel data size in vmlinux should be unaffected but it's an additional runtime cost of around 4K for a standardish enterprise distro kernel config. Small beans on a NUMA machine and maybe not worth the hassle of kmalloc for nr_online_nodes and dealing with node memory hotplug but it's a pity. It's a pity, I agree; as is the addition of int nid into rmap_item on 32-bit (on 64-bit it just occupies a hole) - there can be a lot of those. We were kind of hoping that the #ifdef CONFIG_NUMA would cover it, but some distros now enable NUMA by default even on 32-bit. And it's a pity because 99% of users will leave merge_across_nodes at its default of 1 and only ever need a single tree of each kind. I'll look into starting off with just root_stable_tree[1] and root_unstable_tree[1], then kmalloc'ing nr_node_ids of them when and if merge_across_nodes is switched off. Then I don't think we need bother about hotplug. If it ends up looking clean enough, I'll add that patch. #define MM_SLOTS_HASH_BITS 10 static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); @@ -188,6 +192,9 @@ static unsigned int ksm_thread_pages_to_ /* Milliseconds ksmd should sleep between batches */ static unsigned int ksm_thread_sleep_millisecs = 20; +/* Zeroed when merging across nodes is not allowed */ +static unsigned int ksm_merge_across_nodes = 1; + Nit but initialised data does increase the size of vmlinux so maybe this should be the opposite. i.e. rename it to ksm_merge_within_nodes and default it to 0? I don't find that particular increase in size very compelling! Though I would have preferred the tunable to be the opposite way around: it annoys me that the new code comes into play when !ksm_merge_across_nodes. However, I do find merge across nodes (thanks to Andrew for across) a much more vivid description than the opposite merge within nodes, and can't think of a better alternative for that; and wouldn't want to change it anyway at this late (v7) stage, not without Petr's consent. __read_mostly? I feel the same way as I did when Andrew suggested it: I spose this should be __read_mostly. If __read_mostly is not really a synonym for __make_write_often_storage_slower. I continue to harbor fear,
Re: [PATCH 1/11] ksm: allow trees per NUMA node
On Fri, Jan 25, 2013 at 05:54:53PM -0800, Hugh Dickins wrote: > From: Petr Holasek > > Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes > which control merging pages across different numa nodes. > When it is set to zero only pages from the same node are merged, > otherwise pages from all nodes can be merged together (default behavior). > > Typical use-case could be a lot of KVM guests on NUMA machine > and cpus from more distant nodes would have significant increase > of access latency to the merged ksm page. Sysfs knob was choosen > for higher variability when some users still prefers higher amount > of saved physical memory regardless of access latency. > This is understandable but it's going to be a fairly obscure option. I do not think it can be known in advance if the option should be set. The user must either run benchmarks before and after or use perf to record the "node-load-misses" event and see if setting the parameter reduces the number of remote misses. I don't know the internals of ksm.c at all and this is my first time reading this series. Everything in this review is subject to being completely wrong or due to a major misunderstanding on my part. Delete all feedback if desired. > Every numa node has its own stable & unstable trees because of faster > searching and inserting. Changing of merge_across_nodes value is possible > only when there are not any ksm shared pages in system. > > I've tested this patch on numa machines with 2, 4 and 8 nodes and > measured speed of memory access inside of KVM guests with memory pinned > to one of nodes with this benchmark: > > http://pholasek.fedorapeople.org/alloc_pg.c > > Population standard deviations of access times in percentage of average > were following: > > merge_across_nodes=1 > 2 nodes 1.4% > 4 nodes 1.6% > 8 nodes 1.7% > > merge_across_nodes=0 > 2 nodes 1% > 4 nodes 0.32% > 8 nodes 0.018% > > RFC: https://lkml.org/lkml/2011/11/30/91 > v1: https://lkml.org/lkml/2012/1/23/46 > v2: https://lkml.org/lkml/2012/6/29/105 > v3: https://lkml.org/lkml/2012/9/14/550 > v4: https://lkml.org/lkml/2012/9/23/137 > v5: https://lkml.org/lkml/2012/12/10/540 > v6: https://lkml.org/lkml/2012/12/23/154 > v7: https://lkml.org/lkml/2012/12/27/225 > > Hugh notes that this patch brings two problems, whose solution needs > further support in mm/ksm.c, which follows in subsequent patches: > 1) switching merge_across_nodes after running KSM is liable to oops >on stale nodes still left over from the previous stable tree; > 2) memory hotremove may migrate KSM pages, but there is no provision >here for !merge_across_nodes to migrate nodes to the proper tree. > > Signed-off-by: Petr Holasek > Signed-off-by: Hugh Dickins > Acked-by: Rik van Riel > --- > Documentation/vm/ksm.txt |7 + > mm/ksm.c | 151 - > 2 files changed, 139 insertions(+), 19 deletions(-) > > --- mmotm.orig/Documentation/vm/ksm.txt 2013-01-25 14:36:31.724205455 > -0800 > +++ mmotm/Documentation/vm/ksm.txt2013-01-25 14:36:38.608205618 -0800 > @@ -58,6 +58,13 @@ sleep_millisecs - how many milliseconds > e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs" > Default: 20 (chosen for demonstration purposes) > > +merge_across_nodes - specifies if pages from different numa nodes can be > merged. > + When set to 0, ksm merges only pages which physically > + reside in the memory area of same NUMA node. It brings > + lower latency to access to shared page. Value can be > + changed only when there is no ksm shared pages in system. > + Default: 1 > + > run - set 0 to stop ksmd from running but keep merged pages, > set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run", > set 2 to stop ksmd and unmerge all pages currently merged, > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:31.724205455 -0800 > +++ mmotm/mm/ksm.c2013-01-25 14:36:38.608205618 -0800 > @@ -36,6 +36,7 @@ > #include > #include > #include > +#include > > #include > #include "internal.h" > @@ -139,6 +140,9 @@ struct rmap_item { > struct mm_struct *mm; > unsigned long address; /* + low bits used for flags below */ > unsigned int oldchecksum; /* when unstable */ > +#ifdef CONFIG_NUMA > + unsigned int nid; > +#endif > union { > struct rb_node node;/* when node of unstable tree */ > struct {/* when listed from stable tree */ > @@ -153,8 +157,8 @@ struct rmap_item { > #define STABLE_FLAG 0x200 /* is listed from the stable tree */ > > /* The stable and unstable tree heads */ > -static struct rb_root root_stable_tree = RB_ROOT; > -static struct rb_root root_unstable_tree = RB_ROOT; > +static struct rb_root
Re: [PATCH 1/11] ksm: allow trees per NUMA node
On Fri, Jan 25, 2013 at 05:54:53PM -0800, Hugh Dickins wrote: From: Petr Holasek phola...@redhat.com Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes which control merging pages across different numa nodes. When it is set to zero only pages from the same node are merged, otherwise pages from all nodes can be merged together (default behavior). Typical use-case could be a lot of KVM guests on NUMA machine and cpus from more distant nodes would have significant increase of access latency to the merged ksm page. Sysfs knob was choosen for higher variability when some users still prefers higher amount of saved physical memory regardless of access latency. This is understandable but it's going to be a fairly obscure option. I do not think it can be known in advance if the option should be set. The user must either run benchmarks before and after or use perf to record the node-load-misses event and see if setting the parameter reduces the number of remote misses. I don't know the internals of ksm.c at all and this is my first time reading this series. Everything in this review is subject to being completely wrong or due to a major misunderstanding on my part. Delete all feedback if desired. Every numa node has its own stable unstable trees because of faster searching and inserting. Changing of merge_across_nodes value is possible only when there are not any ksm shared pages in system. I've tested this patch on numa machines with 2, 4 and 8 nodes and measured speed of memory access inside of KVM guests with memory pinned to one of nodes with this benchmark: http://pholasek.fedorapeople.org/alloc_pg.c Population standard deviations of access times in percentage of average were following: merge_across_nodes=1 2 nodes 1.4% 4 nodes 1.6% 8 nodes 1.7% merge_across_nodes=0 2 nodes 1% 4 nodes 0.32% 8 nodes 0.018% RFC: https://lkml.org/lkml/2011/11/30/91 v1: https://lkml.org/lkml/2012/1/23/46 v2: https://lkml.org/lkml/2012/6/29/105 v3: https://lkml.org/lkml/2012/9/14/550 v4: https://lkml.org/lkml/2012/9/23/137 v5: https://lkml.org/lkml/2012/12/10/540 v6: https://lkml.org/lkml/2012/12/23/154 v7: https://lkml.org/lkml/2012/12/27/225 Hugh notes that this patch brings two problems, whose solution needs further support in mm/ksm.c, which follows in subsequent patches: 1) switching merge_across_nodes after running KSM is liable to oops on stale nodes still left over from the previous stable tree; 2) memory hotremove may migrate KSM pages, but there is no provision here for !merge_across_nodes to migrate nodes to the proper tree. Signed-off-by: Petr Holasek phola...@redhat.com Signed-off-by: Hugh Dickins hu...@google.com Acked-by: Rik van Riel r...@redhat.com --- Documentation/vm/ksm.txt |7 + mm/ksm.c | 151 - 2 files changed, 139 insertions(+), 19 deletions(-) --- mmotm.orig/Documentation/vm/ksm.txt 2013-01-25 14:36:31.724205455 -0800 +++ mmotm/Documentation/vm/ksm.txt2013-01-25 14:36:38.608205618 -0800 @@ -58,6 +58,13 @@ sleep_millisecs - how many milliseconds e.g. echo 20 /sys/kernel/mm/ksm/sleep_millisecs Default: 20 (chosen for demonstration purposes) +merge_across_nodes - specifies if pages from different numa nodes can be merged. + When set to 0, ksm merges only pages which physically + reside in the memory area of same NUMA node. It brings + lower latency to access to shared page. Value can be + changed only when there is no ksm shared pages in system. + Default: 1 + run - set 0 to stop ksmd from running but keep merged pages, set 1 to run ksmd e.g. echo 1 /sys/kernel/mm/ksm/run, set 2 to stop ksmd and unmerge all pages currently merged, --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:31.724205455 -0800 +++ mmotm/mm/ksm.c2013-01-25 14:36:38.608205618 -0800 @@ -36,6 +36,7 @@ #include linux/hashtable.h #include linux/freezer.h #include linux/oom.h +#include linux/numa.h #include asm/tlbflush.h #include internal.h @@ -139,6 +140,9 @@ struct rmap_item { struct mm_struct *mm; unsigned long address; /* + low bits used for flags below */ unsigned int oldchecksum; /* when unstable */ +#ifdef CONFIG_NUMA + unsigned int nid; +#endif union { struct rb_node node;/* when node of unstable tree */ struct {/* when listed from stable tree */ @@ -153,8 +157,8 @@ struct rmap_item { #define STABLE_FLAG 0x200 /* is listed from the stable tree */ /* The stable and unstable tree heads */ -static struct rb_root root_stable_tree = RB_ROOT; -static struct rb_root root_unstable_tree = RB_ROOT; +static struct
Re: [PATCH 1/11] ksm: allow trees per NUMA node
On Mon, 28 Jan 2013, Andrew Morton wrote: > On Fri, 25 Jan 2013 17:54:53 -0800 (PST) > Hugh Dickins wrote: > > > +/* Zeroed when merging across nodes is not allowed */ > > +static unsigned int ksm_merge_across_nodes = 1; > > I spose this should be __read_mostly. If __read_mostly is not really a > synonym for __make_write_often_storage_slower. I continue to harbor > fear, uncertainty and doubt about this... Could do. No strong feeling, but I think I'd rather it share its cacheline with other KSM-related stuff, than be off mixed up with unrelateds. I think there's a much stronger case for __read_mostly when it's a library thing accessed by different subsystems. You're right that this variable is accessed significantly more often that the other KSM tunables, so deserves a __read_mostly more than they do. But where to stop? Similar reluctance led me to avoid using "unlikely" throughout ksm.c, unlikely as some conditions are (I'm aghast to see that Andrea sneaked in a "likely" :). Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/11] ksm: allow trees per NUMA node
On Mon, 28 Jan 2013, Andrew Morton wrote: > On Fri, 25 Jan 2013 17:54:53 -0800 (PST) > Hugh Dickins wrote: > > > --- mmotm.orig/Documentation/vm/ksm.txt 2013-01-25 14:36:31.724205455 > > -0800 > > +++ mmotm/Documentation/vm/ksm.txt 2013-01-25 14:36:38.608205618 -0800 > > @@ -58,6 +58,13 @@ sleep_millisecs - how many milliseconds > > e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs" > > Default: 20 (chosen for demonstration purposes) > > > > +merge_across_nodes - specifies if pages from different numa nodes can be > > merged. > > + When set to 0, ksm merges only pages which physically > > + reside in the memory area of same NUMA node. It brings > > + lower latency to access to shared page. Value can be > > + changed only when there is no ksm shared pages in > > system. > > + Default: 1 > > + > > The explanation doesn't really tell the operator whether or not to set > merge_across_nodes for a particular machine/workload. > > I guess most people will just shrug, turn the thing on and see if it > improved things, but that's rather random. Right. I don't think we can tell them which is going to be better, but surely we could do a better job of hinting at the tradeoffs. I think we expect large NUMA machines with lots of memory to want the better NUMA behavior of !merge_across_nodes, but machines with more limited memory across short-distance NUMA nodes, to prefer the greater deduplication of merge_across nodes. Petr, do you have a more informative text for this? Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/11] ksm: allow trees per NUMA node
On Fri, 25 Jan 2013 17:54:53 -0800 (PST) Hugh Dickins wrote: > +/* Zeroed when merging across nodes is not allowed */ > +static unsigned int ksm_merge_across_nodes = 1; I spose this should be __read_mostly. If __read_mostly is not really a synonym for __make_write_often_storage_slower. I continue to harbor fear, uncertainty and doubt about this... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/11] ksm: allow trees per NUMA node
On Fri, 25 Jan 2013 17:54:53 -0800 (PST) Hugh Dickins wrote: > --- mmotm.orig/Documentation/vm/ksm.txt 2013-01-25 14:36:31.724205455 > -0800 > +++ mmotm/Documentation/vm/ksm.txt2013-01-25 14:36:38.608205618 -0800 > @@ -58,6 +58,13 @@ sleep_millisecs - how many milliseconds > e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs" > Default: 20 (chosen for demonstration purposes) > > +merge_across_nodes - specifies if pages from different numa nodes can be > merged. > + When set to 0, ksm merges only pages which physically > + reside in the memory area of same NUMA node. It brings > + lower latency to access to shared page. Value can be > + changed only when there is no ksm shared pages in system. > + Default: 1 > + The explanation doesn't really tell the operator whether or not to set merge_across_nodes for a particular machine/workload. I guess most people will just shrug, turn the thing on and see if it improved things, but that's rather random. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/11] ksm: allow trees per NUMA node
On Fri, 25 Jan 2013 17:54:53 -0800 (PST) Hugh Dickins hu...@google.com wrote: --- mmotm.orig/Documentation/vm/ksm.txt 2013-01-25 14:36:31.724205455 -0800 +++ mmotm/Documentation/vm/ksm.txt2013-01-25 14:36:38.608205618 -0800 @@ -58,6 +58,13 @@ sleep_millisecs - how many milliseconds e.g. echo 20 /sys/kernel/mm/ksm/sleep_millisecs Default: 20 (chosen for demonstration purposes) +merge_across_nodes - specifies if pages from different numa nodes can be merged. + When set to 0, ksm merges only pages which physically + reside in the memory area of same NUMA node. It brings + lower latency to access to shared page. Value can be + changed only when there is no ksm shared pages in system. + Default: 1 + The explanation doesn't really tell the operator whether or not to set merge_across_nodes for a particular machine/workload. I guess most people will just shrug, turn the thing on and see if it improved things, but that's rather random. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/11] ksm: allow trees per NUMA node
On Fri, 25 Jan 2013 17:54:53 -0800 (PST) Hugh Dickins hu...@google.com wrote: +/* Zeroed when merging across nodes is not allowed */ +static unsigned int ksm_merge_across_nodes = 1; I spose this should be __read_mostly. If __read_mostly is not really a synonym for __make_write_often_storage_slower. I continue to harbor fear, uncertainty and doubt about this... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/11] ksm: allow trees per NUMA node
On Mon, 28 Jan 2013, Andrew Morton wrote: On Fri, 25 Jan 2013 17:54:53 -0800 (PST) Hugh Dickins hu...@google.com wrote: --- mmotm.orig/Documentation/vm/ksm.txt 2013-01-25 14:36:31.724205455 -0800 +++ mmotm/Documentation/vm/ksm.txt 2013-01-25 14:36:38.608205618 -0800 @@ -58,6 +58,13 @@ sleep_millisecs - how many milliseconds e.g. echo 20 /sys/kernel/mm/ksm/sleep_millisecs Default: 20 (chosen for demonstration purposes) +merge_across_nodes - specifies if pages from different numa nodes can be merged. + When set to 0, ksm merges only pages which physically + reside in the memory area of same NUMA node. It brings + lower latency to access to shared page. Value can be + changed only when there is no ksm shared pages in system. + Default: 1 + The explanation doesn't really tell the operator whether or not to set merge_across_nodes for a particular machine/workload. I guess most people will just shrug, turn the thing on and see if it improved things, but that's rather random. Right. I don't think we can tell them which is going to be better, but surely we could do a better job of hinting at the tradeoffs. I think we expect large NUMA machines with lots of memory to want the better NUMA behavior of !merge_across_nodes, but machines with more limited memory across short-distance NUMA nodes, to prefer the greater deduplication of merge_across nodes. Petr, do you have a more informative text for this? Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/11] ksm: allow trees per NUMA node
On Mon, 28 Jan 2013, Andrew Morton wrote: On Fri, 25 Jan 2013 17:54:53 -0800 (PST) Hugh Dickins hu...@google.com wrote: +/* Zeroed when merging across nodes is not allowed */ +static unsigned int ksm_merge_across_nodes = 1; I spose this should be __read_mostly. If __read_mostly is not really a synonym for __make_write_often_storage_slower. I continue to harbor fear, uncertainty and doubt about this... Could do. No strong feeling, but I think I'd rather it share its cacheline with other KSM-related stuff, than be off mixed up with unrelateds. I think there's a much stronger case for __read_mostly when it's a library thing accessed by different subsystems. You're right that this variable is accessed significantly more often that the other KSM tunables, so deserves a __read_mostly more than they do. But where to stop? Similar reluctance led me to avoid using unlikely throughout ksm.c, unlikely as some conditions are (I'm aghast to see that Andrea sneaked in a likely :). Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/11] ksm: allow trees per NUMA node
On Sat, 26 Jan 2013, Simon Jeons wrote: > On Sat, 2013-01-26 at 18:54 -0800, Hugh Dickins wrote: > > > > So you'd like us to add code for moving a node from one tree to another > > in ksm_migrate_page() (and what would it do when it collides with an > > Without numa awareness, I still can't understand your explanation why > can't insert the node to the tree just after page migration instead of > inserting it at the next scan. The node is already there in the right (only) tree in that case. > > > existing node?), code which will then be removed a few patches later > > when ksm page migration is fully enabled? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/11] ksm: allow trees per NUMA node
On Sat, 26 Jan 2013, Simon Jeons wrote: On Sat, 2013-01-26 at 18:54 -0800, Hugh Dickins wrote: So you'd like us to add code for moving a node from one tree to another in ksm_migrate_page() (and what would it do when it collides with an Without numa awareness, I still can't understand your explanation why can't insert the node to the tree just after page migration instead of inserting it at the next scan. The node is already there in the right (only) tree in that case. existing node?), code which will then be removed a few patches later when ksm page migration is fully enabled? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/11] ksm: allow trees per NUMA node
On Sat, 2013-01-26 at 18:54 -0800, Hugh Dickins wrote: > On Sat, 26 Jan 2013, Simon Jeons wrote: > > On Fri, 2013-01-25 at 17:54 -0800, Hugh Dickins wrote: > > > From: Petr Holasek > > > @@ -1122,6 +1166,18 @@ struct rmap_item *unstable_tree_search_i > > > return NULL; > > > } > > > > > > + /* > > > + * If tree_page has been migrated to another NUMA node, it > > > + * will be flushed out and put into the right unstable tree > > > > Then why not insert the new page to unstable tree during page migration > > against current upstream? Because default behavior is merge across > > nodes. > > I don't understand the words "against current upstream" in your question. I mean current upstream codes without numa awareness. :) > > We cannot move a page (strictly, a node) from one tree to another during > page migration itself, because the necessary ksm_thread_mutex is not held. > Not would we even want to while "merge across nodes". > > Ah, perhaps you are pointing out that in current upstream, the only user > of ksm page migration is memory hotremove, which in current upstream does > hold ksm_thread_mutex. > > So you'd like us to add code for moving a node from one tree to another > in ksm_migrate_page() (and what would it do when it collides with an Without numa awareness, I still can't understand your explanation why can't insert the node to the tree just after page migration instead of inserting it at the next scan. > existing node?), code which will then be removed a few patches later > when ksm page migration is fully enabled? > > No, I'm not going to put any more thought into that. When Andrea pointed > out the problem with Petr's original change to ksm_migrate_page(), I did > indeed think that we could do something cleverer at that point; but once > I got down to trying it, found that a dead end. I wasn't going to be > able to test the hotremove case properly anyway, so no good pursuing > solutions that couldn't be generalized. > > Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/11] ksm: allow trees per NUMA node
On Sat, 26 Jan 2013, Simon Jeons wrote: > On Fri, 2013-01-25 at 17:54 -0800, Hugh Dickins wrote: > > From: Petr Holasek > > @@ -1122,6 +1166,18 @@ struct rmap_item *unstable_tree_search_i > > return NULL; > > } > > > > + /* > > +* If tree_page has been migrated to another NUMA node, it > > +* will be flushed out and put into the right unstable tree > > Then why not insert the new page to unstable tree during page migration > against current upstream? Because default behavior is merge across > nodes. I don't understand the words "against current upstream" in your question. We cannot move a page (strictly, a node) from one tree to another during page migration itself, because the necessary ksm_thread_mutex is not held. Not would we even want to while "merge across nodes". Ah, perhaps you are pointing out that in current upstream, the only user of ksm page migration is memory hotremove, which in current upstream does hold ksm_thread_mutex. So you'd like us to add code for moving a node from one tree to another in ksm_migrate_page() (and what would it do when it collides with an existing node?), code which will then be removed a few patches later when ksm page migration is fully enabled? No, I'm not going to put any more thought into that. When Andrea pointed out the problem with Petr's original change to ksm_migrate_page(), I did indeed think that we could do something cleverer at that point; but once I got down to trying it, found that a dead end. I wasn't going to be able to test the hotremove case properly anyway, so no good pursuing solutions that couldn't be generalized. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/11] ksm: allow trees per NUMA node
Hi Hugh, On Fri, 2013-01-25 at 17:54 -0800, Hugh Dickins wrote: > From: Petr Holasek > > Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes > which control merging pages across different numa nodes. > When it is set to zero only pages from the same node are merged, > otherwise pages from all nodes can be merged together (default behavior). > > Typical use-case could be a lot of KVM guests on NUMA machine > and cpus from more distant nodes would have significant increase > of access latency to the merged ksm page. Sysfs knob was choosen > for higher variability when some users still prefers higher amount > of saved physical memory regardless of access latency. > > Every numa node has its own stable & unstable trees because of faster > searching and inserting. Changing of merge_across_nodes value is possible > only when there are not any ksm shared pages in system. > > I've tested this patch on numa machines with 2, 4 and 8 nodes and > measured speed of memory access inside of KVM guests with memory pinned > to one of nodes with this benchmark: > > http://pholasek.fedorapeople.org/alloc_pg.c > > Population standard deviations of access times in percentage of average > were following: > > merge_across_nodes=1 > 2 nodes 1.4% > 4 nodes 1.6% > 8 nodes 1.7% > > merge_across_nodes=0 > 2 nodes 1% > 4 nodes 0.32% > 8 nodes 0.018% > > RFC: https://lkml.org/lkml/2011/11/30/91 > v1: https://lkml.org/lkml/2012/1/23/46 > v2: https://lkml.org/lkml/2012/6/29/105 > v3: https://lkml.org/lkml/2012/9/14/550 > v4: https://lkml.org/lkml/2012/9/23/137 > v5: https://lkml.org/lkml/2012/12/10/540 > v6: https://lkml.org/lkml/2012/12/23/154 > v7: https://lkml.org/lkml/2012/12/27/225 > > Hugh notes that this patch brings two problems, whose solution needs > further support in mm/ksm.c, which follows in subsequent patches: > 1) switching merge_across_nodes after running KSM is liable to oops >on stale nodes still left over from the previous stable tree; > 2) memory hotremove may migrate KSM pages, but there is no provision >here for !merge_across_nodes to migrate nodes to the proper tree. > > Signed-off-by: Petr Holasek > Signed-off-by: Hugh Dickins > Acked-by: Rik van Riel > --- > Documentation/vm/ksm.txt |7 + > mm/ksm.c | 151 - > 2 files changed, 139 insertions(+), 19 deletions(-) > > --- mmotm.orig/Documentation/vm/ksm.txt 2013-01-25 14:36:31.724205455 > -0800 > +++ mmotm/Documentation/vm/ksm.txt2013-01-25 14:36:38.608205618 -0800 > @@ -58,6 +58,13 @@ sleep_millisecs - how many milliseconds > e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs" > Default: 20 (chosen for demonstration purposes) > > +merge_across_nodes - specifies if pages from different numa nodes can be > merged. > + When set to 0, ksm merges only pages which physically > + reside in the memory area of same NUMA node. It brings > + lower latency to access to shared page. Value can be > + changed only when there is no ksm shared pages in system. > + Default: 1 > + > run - set 0 to stop ksmd from running but keep merged pages, > set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run", > set 2 to stop ksmd and unmerge all pages currently merged, > --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:31.724205455 -0800 > +++ mmotm/mm/ksm.c2013-01-25 14:36:38.608205618 -0800 > @@ -36,6 +36,7 @@ > #include > #include > #include > +#include > > #include > #include "internal.h" > @@ -139,6 +140,9 @@ struct rmap_item { > struct mm_struct *mm; > unsigned long address; /* + low bits used for flags below */ > unsigned int oldchecksum; /* when unstable */ > +#ifdef CONFIG_NUMA > + unsigned int nid; > +#endif > union { > struct rb_node node;/* when node of unstable tree */ > struct {/* when listed from stable tree */ > @@ -153,8 +157,8 @@ struct rmap_item { > #define STABLE_FLAG 0x200 /* is listed from the stable tree */ > > /* The stable and unstable tree heads */ > -static struct rb_root root_stable_tree = RB_ROOT; > -static struct rb_root root_unstable_tree = RB_ROOT; > +static struct rb_root root_unstable_tree[MAX_NUMNODES]; > +static struct rb_root root_stable_tree[MAX_NUMNODES]; > > #define MM_SLOTS_HASH_BITS 10 > static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); > @@ -188,6 +192,9 @@ static unsigned int ksm_thread_pages_to_ > /* Milliseconds ksmd should sleep between batches */ > static unsigned int ksm_thread_sleep_millisecs = 20; > > +/* Zeroed when merging across nodes is not allowed */ > +static unsigned int ksm_merge_across_nodes = 1; > + > #define KSM_RUN_STOP 0 > #define KSM_RUN_MERGE1 > #define
Re: [PATCH 1/11] ksm: allow trees per NUMA node
Hi Hugh, On Fri, 2013-01-25 at 17:54 -0800, Hugh Dickins wrote: From: Petr Holasek phola...@redhat.com Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes which control merging pages across different numa nodes. When it is set to zero only pages from the same node are merged, otherwise pages from all nodes can be merged together (default behavior). Typical use-case could be a lot of KVM guests on NUMA machine and cpus from more distant nodes would have significant increase of access latency to the merged ksm page. Sysfs knob was choosen for higher variability when some users still prefers higher amount of saved physical memory regardless of access latency. Every numa node has its own stable unstable trees because of faster searching and inserting. Changing of merge_across_nodes value is possible only when there are not any ksm shared pages in system. I've tested this patch on numa machines with 2, 4 and 8 nodes and measured speed of memory access inside of KVM guests with memory pinned to one of nodes with this benchmark: http://pholasek.fedorapeople.org/alloc_pg.c Population standard deviations of access times in percentage of average were following: merge_across_nodes=1 2 nodes 1.4% 4 nodes 1.6% 8 nodes 1.7% merge_across_nodes=0 2 nodes 1% 4 nodes 0.32% 8 nodes 0.018% RFC: https://lkml.org/lkml/2011/11/30/91 v1: https://lkml.org/lkml/2012/1/23/46 v2: https://lkml.org/lkml/2012/6/29/105 v3: https://lkml.org/lkml/2012/9/14/550 v4: https://lkml.org/lkml/2012/9/23/137 v5: https://lkml.org/lkml/2012/12/10/540 v6: https://lkml.org/lkml/2012/12/23/154 v7: https://lkml.org/lkml/2012/12/27/225 Hugh notes that this patch brings two problems, whose solution needs further support in mm/ksm.c, which follows in subsequent patches: 1) switching merge_across_nodes after running KSM is liable to oops on stale nodes still left over from the previous stable tree; 2) memory hotremove may migrate KSM pages, but there is no provision here for !merge_across_nodes to migrate nodes to the proper tree. Signed-off-by: Petr Holasek phola...@redhat.com Signed-off-by: Hugh Dickins hu...@google.com Acked-by: Rik van Riel r...@redhat.com --- Documentation/vm/ksm.txt |7 + mm/ksm.c | 151 - 2 files changed, 139 insertions(+), 19 deletions(-) --- mmotm.orig/Documentation/vm/ksm.txt 2013-01-25 14:36:31.724205455 -0800 +++ mmotm/Documentation/vm/ksm.txt2013-01-25 14:36:38.608205618 -0800 @@ -58,6 +58,13 @@ sleep_millisecs - how many milliseconds e.g. echo 20 /sys/kernel/mm/ksm/sleep_millisecs Default: 20 (chosen for demonstration purposes) +merge_across_nodes - specifies if pages from different numa nodes can be merged. + When set to 0, ksm merges only pages which physically + reside in the memory area of same NUMA node. It brings + lower latency to access to shared page. Value can be + changed only when there is no ksm shared pages in system. + Default: 1 + run - set 0 to stop ksmd from running but keep merged pages, set 1 to run ksmd e.g. echo 1 /sys/kernel/mm/ksm/run, set 2 to stop ksmd and unmerge all pages currently merged, --- mmotm.orig/mm/ksm.c 2013-01-25 14:36:31.724205455 -0800 +++ mmotm/mm/ksm.c2013-01-25 14:36:38.608205618 -0800 @@ -36,6 +36,7 @@ #include linux/hashtable.h #include linux/freezer.h #include linux/oom.h +#include linux/numa.h #include asm/tlbflush.h #include internal.h @@ -139,6 +140,9 @@ struct rmap_item { struct mm_struct *mm; unsigned long address; /* + low bits used for flags below */ unsigned int oldchecksum; /* when unstable */ +#ifdef CONFIG_NUMA + unsigned int nid; +#endif union { struct rb_node node;/* when node of unstable tree */ struct {/* when listed from stable tree */ @@ -153,8 +157,8 @@ struct rmap_item { #define STABLE_FLAG 0x200 /* is listed from the stable tree */ /* The stable and unstable tree heads */ -static struct rb_root root_stable_tree = RB_ROOT; -static struct rb_root root_unstable_tree = RB_ROOT; +static struct rb_root root_unstable_tree[MAX_NUMNODES]; +static struct rb_root root_stable_tree[MAX_NUMNODES]; #define MM_SLOTS_HASH_BITS 10 static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); @@ -188,6 +192,9 @@ static unsigned int ksm_thread_pages_to_ /* Milliseconds ksmd should sleep between batches */ static unsigned int ksm_thread_sleep_millisecs = 20; +/* Zeroed when merging across nodes is not allowed */ +static unsigned int ksm_merge_across_nodes = 1; + #define KSM_RUN_STOP 0 #define KSM_RUN_MERGE1 #define
Re: [PATCH 1/11] ksm: allow trees per NUMA node
On Sat, 26 Jan 2013, Simon Jeons wrote: On Fri, 2013-01-25 at 17:54 -0800, Hugh Dickins wrote: From: Petr Holasek phola...@redhat.com @@ -1122,6 +1166,18 @@ struct rmap_item *unstable_tree_search_i return NULL; } + /* +* If tree_page has been migrated to another NUMA node, it +* will be flushed out and put into the right unstable tree Then why not insert the new page to unstable tree during page migration against current upstream? Because default behavior is merge across nodes. I don't understand the words against current upstream in your question. We cannot move a page (strictly, a node) from one tree to another during page migration itself, because the necessary ksm_thread_mutex is not held. Not would we even want to while merge across nodes. Ah, perhaps you are pointing out that in current upstream, the only user of ksm page migration is memory hotremove, which in current upstream does hold ksm_thread_mutex. So you'd like us to add code for moving a node from one tree to another in ksm_migrate_page() (and what would it do when it collides with an existing node?), code which will then be removed a few patches later when ksm page migration is fully enabled? No, I'm not going to put any more thought into that. When Andrea pointed out the problem with Petr's original change to ksm_migrate_page(), I did indeed think that we could do something cleverer at that point; but once I got down to trying it, found that a dead end. I wasn't going to be able to test the hotremove case properly anyway, so no good pursuing solutions that couldn't be generalized. Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/11] ksm: allow trees per NUMA node
On Sat, 2013-01-26 at 18:54 -0800, Hugh Dickins wrote: On Sat, 26 Jan 2013, Simon Jeons wrote: On Fri, 2013-01-25 at 17:54 -0800, Hugh Dickins wrote: From: Petr Holasek phola...@redhat.com @@ -1122,6 +1166,18 @@ struct rmap_item *unstable_tree_search_i return NULL; } + /* + * If tree_page has been migrated to another NUMA node, it + * will be flushed out and put into the right unstable tree Then why not insert the new page to unstable tree during page migration against current upstream? Because default behavior is merge across nodes. I don't understand the words against current upstream in your question. I mean current upstream codes without numa awareness. :) We cannot move a page (strictly, a node) from one tree to another during page migration itself, because the necessary ksm_thread_mutex is not held. Not would we even want to while merge across nodes. Ah, perhaps you are pointing out that in current upstream, the only user of ksm page migration is memory hotremove, which in current upstream does hold ksm_thread_mutex. So you'd like us to add code for moving a node from one tree to another in ksm_migrate_page() (and what would it do when it collides with an Without numa awareness, I still can't understand your explanation why can't insert the node to the tree just after page migration instead of inserting it at the next scan. existing node?), code which will then be removed a few patches later when ksm page migration is fully enabled? No, I'm not going to put any more thought into that. When Andrea pointed out the problem with Petr's original change to ksm_migrate_page(), I did indeed think that we could do something cleverer at that point; but once I got down to trying it, found that a dead end. I wasn't going to be able to test the hotremove case properly anyway, so no good pursuing solutions that couldn't be generalized. Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/