Re: [PATCH 04/11] mm/vmstat: Convert NUMA statistics to basic NUMA counters

2021-04-15 Thread Mel Gorman
On Wed, Apr 14, 2021 at 05:56:53PM +0200, Vlastimil Babka wrote:
> On 4/14/21 5:18 PM, Mel Gorman wrote:
> > On Wed, Apr 14, 2021 at 02:56:45PM +0200, Vlastimil Babka wrote:
> >> So it seems that this intermediate assignment to zone counters (using
> >> atomic_long_set() even) is unnecessary and this could mimic 
> >> sum_vm_events() that
> >> just does the summation on a local array?
> >> 
> > 
> > The atomic is unnecessary for sure but using a local array is
> > problematic because of your next point.
> 
> IIUC vm_events seems to do fine without a centralized array and handling CPU 
> hot
> remove at the sime time ...
> 

The vm_events are more global in nature. They are not reported
to userspace on a per-zone (/proc/zoneinfo) basis or per-node
(/sys/devices/system/node/node*/numastat) basis so they are not equivalent.

> >> And probably a bit more serious is that vm_events have 
> >> vm_events_fold_cpu() to
> >> deal with a cpu going away, but after your patch the stats counted on a 
> >> cpu just
> >> disapepar from the sums as it goes offline as there's no such thing for 
> >> the numa
> >> counters.
> >> 
> > 
> > That is a problem I missed. Even if zonestats was preserved on
> > hot-remove, fold_vm_zone_numa_events would not be reading the CPU so
> > hotplug events jump all over the place.
> > 
> > So some periodic folding is necessary. I would still prefer not to do it
> > by time but it could be done only on overflow or when a file like
> > /proc/vmstat is read. I'll think about it a bit more and see what I come
> > up with.
> 
> ... because vm_events_fold_cpu() seems to simply move the stats from the CPU
> being offlined to the current one. So the same approach should be enough for
> NUMA stats?
> 

Yes, or at least very similar.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 04/11] mm/vmstat: Convert NUMA statistics to basic NUMA counters

2021-04-14 Thread Vlastimil Babka
On 4/14/21 5:18 PM, Mel Gorman wrote:
> On Wed, Apr 14, 2021 at 02:56:45PM +0200, Vlastimil Babka wrote:
>> So it seems that this intermediate assignment to zone counters (using
>> atomic_long_set() even) is unnecessary and this could mimic sum_vm_events() 
>> that
>> just does the summation on a local array?
>> 
> 
> The atomic is unnecessary for sure but using a local array is
> problematic because of your next point.

IIUC vm_events seems to do fine without a centralized array and handling CPU hot
remove at the sime time ...

>> And probably a bit more serious is that vm_events have vm_events_fold_cpu() 
>> to
>> deal with a cpu going away, but after your patch the stats counted on a cpu 
>> just
>> disapepar from the sums as it goes offline as there's no such thing for the 
>> numa
>> counters.
>> 
> 
> That is a problem I missed. Even if zonestats was preserved on
> hot-remove, fold_vm_zone_numa_events would not be reading the CPU so
> hotplug events jump all over the place.
> 
> So some periodic folding is necessary. I would still prefer not to do it
> by time but it could be done only on overflow or when a file like
> /proc/vmstat is read. I'll think about it a bit more and see what I come
> up with.

... because vm_events_fold_cpu() seems to simply move the stats from the CPU
being offlined to the current one. So the same approach should be enough for
NUMA stats?

> Thanks!
> 



Re: [PATCH 04/11] mm/vmstat: Convert NUMA statistics to basic NUMA counters

2021-04-14 Thread Mel Gorman
On Wed, Apr 14, 2021 at 02:56:45PM +0200, Vlastimil Babka wrote:
> On 4/7/21 10:24 PM, Mel Gorman wrote:
> > NUMA statistics are maintained on the zone level for hits, misses, foreign
> > etc but nothing relies on them being perfectly accurate for functional
> > correctness. The counters are used by userspace to get a general overview
> > of a workloads NUMA behaviour but the page allocator incurs a high cost to
> > maintain perfect accuracy similar to what is required for a vmstat like
> > NR_FREE_PAGES. There even is a sysctl vm.numa_stat to allow userspace to
> > turn off the collection of NUMA statistics like NUMA_HIT.
> > 
> > This patch converts NUMA_HIT and friends to be NUMA events with similar
> > accuracy to VM events. There is a possibility that slight errors will be
> > introduced but the overall trend as seen by userspace will be similar.
> > Note that while these counters could be maintained at the node level that
> > it would have a user-visible impact.
> 
> I guess this kind of inaccuracy is fine. I just don't like much
> fold_vm_zone_numa_events() which seems to calculate sums of percpu counters 
> and
> then assign the result to zone counters for immediate consumption, which 
> differs
> from other kinds of folds in vmstat that reset the percpu counters to 0 as 
> they
> are treated as diffs to the global counters.
> 

The counters that are diffs fit inside an s8 and they are kept limited
because their "true" value is sometimes critical -- e.g. NR_FREE_PAGES
for watermark checking. So the level of drift has to be controlled and
the drift should not exist potentially forever so it gets updated
periodically.

The inaccurate counters are only exported to userspace. There is no need
to update them every few seconds so fold_vm_zone_numa_events() is only
called when a user cares but you raise a raise a valid below.

> So it seems that this intermediate assignment to zone counters (using
> atomic_long_set() even) is unnecessary and this could mimic sum_vm_events() 
> that
> just does the summation on a local array?
> 

The atomic is unnecessary for sure but using a local array is
problematic because of your next point.

> And probably a bit more serious is that vm_events have vm_events_fold_cpu() to
> deal with a cpu going away, but after your patch the stats counted on a cpu 
> just
> disapepar from the sums as it goes offline as there's no such thing for the 
> numa
> counters.
> 

That is a problem I missed. Even if zonestats was preserved on
hot-remove, fold_vm_zone_numa_events would not be reading the CPU so
hotplug events jump all over the place.

So some periodic folding is necessary. I would still prefer not to do it
by time but it could be done only on overflow or when a file like
/proc/vmstat is read. I'll think about it a bit more and see what I come
up with.

Thanks!

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 04/11] mm/vmstat: Convert NUMA statistics to basic NUMA counters

2021-04-14 Thread Vlastimil Babka
On 4/7/21 10:24 PM, Mel Gorman wrote:
> NUMA statistics are maintained on the zone level for hits, misses, foreign
> etc but nothing relies on them being perfectly accurate for functional
> correctness. The counters are used by userspace to get a general overview
> of a workloads NUMA behaviour but the page allocator incurs a high cost to
> maintain perfect accuracy similar to what is required for a vmstat like
> NR_FREE_PAGES. There even is a sysctl vm.numa_stat to allow userspace to
> turn off the collection of NUMA statistics like NUMA_HIT.
> 
> This patch converts NUMA_HIT and friends to be NUMA events with similar
> accuracy to VM events. There is a possibility that slight errors will be
> introduced but the overall trend as seen by userspace will be similar.
> Note that while these counters could be maintained at the node level that
> it would have a user-visible impact.

I guess this kind of inaccuracy is fine. I just don't like much
fold_vm_zone_numa_events() which seems to calculate sums of percpu counters and
then assign the result to zone counters for immediate consumption, which differs
from other kinds of folds in vmstat that reset the percpu counters to 0 as they
are treated as diffs to the global counters.

So it seems that this intermediate assignment to zone counters (using
atomic_long_set() even) is unnecessary and this could mimic sum_vm_events() that
just does the summation on a local array?

And probably a bit more serious is that vm_events have vm_events_fold_cpu() to
deal with a cpu going away, but after your patch the stats counted on a cpu just
disapepar from the sums as it goes offline as there's no such thing for the numa
counters.

Thanks,
Vlastimil


[PATCH 04/11] mm/vmstat: Convert NUMA statistics to basic NUMA counters

2021-04-07 Thread Mel Gorman
NUMA statistics are maintained on the zone level for hits, misses, foreign
etc but nothing relies on them being perfectly accurate for functional
correctness. The counters are used by userspace to get a general overview
of a workloads NUMA behaviour but the page allocator incurs a high cost to
maintain perfect accuracy similar to what is required for a vmstat like
NR_FREE_PAGES. There even is a sysctl vm.numa_stat to allow userspace to
turn off the collection of NUMA statistics like NUMA_HIT.

This patch converts NUMA_HIT and friends to be NUMA events with similar
accuracy to VM events. There is a possibility that slight errors will be
introduced but the overall trend as seen by userspace will be similar.
Note that while these counters could be maintained at the node level that
it would have a user-visible impact.

Signed-off-by: Mel Gorman 
---
 drivers/base/node.c|  18 +++--
 include/linux/mmzone.h |  11 ++-
 include/linux/vmstat.h |  42 +-
 mm/mempolicy.c |   2 +-
 mm/page_alloc.c|  12 +--
 mm/vmstat.c| 175 -
 6 files changed, 93 insertions(+), 167 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index f449dbb2c746..443a609db428 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -484,6 +484,7 @@ static DEVICE_ATTR(meminfo, 0444, node_read_meminfo, NULL);
 static ssize_t node_read_numastat(struct device *dev,
  struct device_attribute *attr, char *buf)
 {
+   fold_vm_numa_events();
return sysfs_emit(buf,
  "numa_hit %lu\n"
  "numa_miss %lu\n"
@@ -491,12 +492,12 @@ static ssize_t node_read_numastat(struct device *dev,
  "interleave_hit %lu\n"
  "local_node %lu\n"
  "other_node %lu\n",
- sum_zone_numa_state(dev->id, NUMA_HIT),
- sum_zone_numa_state(dev->id, NUMA_MISS),
- sum_zone_numa_state(dev->id, NUMA_FOREIGN),
- sum_zone_numa_state(dev->id, NUMA_INTERLEAVE_HIT),
- sum_zone_numa_state(dev->id, NUMA_LOCAL),
- sum_zone_numa_state(dev->id, NUMA_OTHER));
+ sum_zone_numa_event_state(dev->id, NUMA_HIT),
+ sum_zone_numa_event_state(dev->id, NUMA_MISS),
+ sum_zone_numa_event_state(dev->id, NUMA_FOREIGN),
+ sum_zone_numa_event_state(dev->id, 
NUMA_INTERLEAVE_HIT),
+ sum_zone_numa_event_state(dev->id, NUMA_LOCAL),
+ sum_zone_numa_event_state(dev->id, NUMA_OTHER));
 }
 static DEVICE_ATTR(numastat, 0444, node_read_numastat, NULL);
 
@@ -514,10 +515,11 @@ static ssize_t node_read_vmstat(struct device *dev,
 sum_zone_node_page_state(nid, i));
 
 #ifdef CONFIG_NUMA
-   for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
+   fold_vm_numa_events();
+   for (i = 0; i < NR_VM_NUMA_EVENT_ITEMS; i++)
len += sysfs_emit_at(buf, len, "%s %lu\n",
 numa_stat_name(i),
-sum_zone_numa_state(nid, i));
+sum_zone_numa_event_state(nid, i));
 
 #endif
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 106da8fbc72a..693cd5f24f7d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -135,10 +135,10 @@ enum numa_stat_item {
NUMA_INTERLEAVE_HIT,/* interleaver preferred this zone */
NUMA_LOCAL, /* allocation from local node */
NUMA_OTHER, /* allocation from other node */
-   NR_VM_NUMA_STAT_ITEMS
+   NR_VM_NUMA_EVENT_ITEMS
 };
 #else
-#define NR_VM_NUMA_STAT_ITEMS 0
+#define NR_VM_NUMA_EVENT_ITEMS 0
 #endif
 
 enum zone_stat_item {
@@ -357,7 +357,10 @@ struct per_cpu_zonestat {
s8 stat_threshold;
 #endif
 #ifdef CONFIG_NUMA
-   u16 vm_numa_stat_diff[NR_VM_NUMA_STAT_ITEMS];
+   u16 vm_numa_stat_diff[NR_VM_NUMA_EVENT_ITEMS];
+#endif
+#ifdef CONFIG_NUMA
+   unsigned long vm_numa_event[NR_VM_NUMA_EVENT_ITEMS];
 #endif
 };
 
@@ -609,7 +612,7 @@ struct zone {
ZONE_PADDING(_pad3_)
/* Zone statistics */
atomic_long_t   vm_stat[NR_VM_ZONE_STAT_ITEMS];
-   atomic_long_t   vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
+   atomic_long_t   vm_numa_events[NR_VM_NUMA_EVENT_ITEMS];
 } cacheline_internodealigned_in_smp;
 
 enum pgdat_flags {
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 1736ea9d24a7..fc14415223c5 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -138,35 +138,27 @@ static inline void vm_events_fold_cpu(int cpu)
  * Zone and node-based page accounting with per cpu differentials.