Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation

2017-12-20 Thread kemi


On 2017年12月20日 23:58, Christopher Lameter wrote:
> On Wed, 20 Dec 2017, kemi wrote:
> 
>>> You are making numastats special and I yet haven't heard any sounds
>>> arguments for that. But that should be discussed in the respective
>>> patch.
>>>
>>
>> That is because we have much larger threshold size for NUMA counters, that 
>> means larger
>> deviation. So, the number in local cpus may not be simply ignored.
> 
> Some numbers showing the effect of these changes would be helpful. You can
> probably create some in kernel synthetic tests to start with which would
> allow you to see any significant effects of those changes.
> 
> Then run the larger testsuites (f.e. those that Mel has published) and
> benchmarks to figure out how behavior of real apps *may* change?
> 

OK.
I will do that when available.
Let's just drop this patch in this series and consider this issue
in another patch. 


Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation

2017-12-20 Thread kemi


On 2017年12月20日 23:58, Christopher Lameter wrote:
> On Wed, 20 Dec 2017, kemi wrote:
> 
>>> You are making numastats special and I yet haven't heard any sounds
>>> arguments for that. But that should be discussed in the respective
>>> patch.
>>>
>>
>> That is because we have much larger threshold size for NUMA counters, that 
>> means larger
>> deviation. So, the number in local cpus may not be simply ignored.
> 
> Some numbers showing the effect of these changes would be helpful. You can
> probably create some in kernel synthetic tests to start with which would
> allow you to see any significant effects of those changes.
> 
> Then run the larger testsuites (f.e. those that Mel has published) and
> benchmarks to figure out how behavior of real apps *may* change?
> 

OK.
I will do that when available.
Let's just drop this patch in this series and consider this issue
in another patch. 


Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation

2017-12-20 Thread Christopher Lameter
On Wed, 20 Dec 2017, kemi wrote:

> > You are making numastats special and I yet haven't heard any sounds
> > arguments for that. But that should be discussed in the respective
> > patch.
> >
>
> That is because we have much larger threshold size for NUMA counters, that 
> means larger
> deviation. So, the number in local cpus may not be simply ignored.

Some numbers showing the effect of these changes would be helpful. You can
probably create some in kernel synthetic tests to start with which would
allow you to see any significant effects of those changes.

Then run the larger testsuites (f.e. those that Mel has published) and
benchmarks to figure out how behavior of real apps *may* change?



Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation

2017-12-20 Thread Christopher Lameter
On Wed, 20 Dec 2017, kemi wrote:

> > You are making numastats special and I yet haven't heard any sounds
> > arguments for that. But that should be discussed in the respective
> > patch.
> >
>
> That is because we have much larger threshold size for NUMA counters, that 
> means larger
> deviation. So, the number in local cpus may not be simply ignored.

Some numbers showing the effect of these changes would be helpful. You can
probably create some in kernel synthetic tests to start with which would
allow you to see any significant effects of those changes.

Then run the larger testsuites (f.e. those that Mel has published) and
benchmarks to figure out how behavior of real apps *may* change?



Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation

2017-12-20 Thread kemi


On 2017年12月20日 18:06, Michal Hocko wrote:
> On Wed 20-12-17 14:07:35, kemi wrote:
>>
>>
>> On 2017年12月19日 20:43, Michal Hocko wrote:
>>> On Tue 19-12-17 14:39:25, Kemi Wang wrote:
 To avoid deviation, this patch uses node_page_state_snapshot instead of
 node_page_state for node page stats query.
 e.g. cat /proc/zoneinfo
  cat /sys/devices/system/node/node*/vmstat
  cat /sys/devices/system/node/node*/numastat

 As it is a slow path and would not be read frequently, I would worry about
 it.
>>>
>>> The changelog doesn't explain why these counters needs any special
>>> treatment. _snapshot variants where used only for internal handling
>>> where the precision really mattered. We do not have any in-tree user and
>>> Jack has removed this by 
>>> http://lkml.kernel.org/r/20171122094416.26019-1-j...@suse.cz
>>> which is already sitting in the mmotm tree. We can re-add it but that
>>> would really require a _very good_ reason.
>>>
>>
>> Assume we have *nr* cpus, and threshold size is *t*. Thus, the maximum 
>> deviation is nr*t.
>> Currently, Skylake platform has hundreds of CPUs numbers and the number is 
>> still 
>> increasing. Also, even the threshold size is kept to 125 at maximum (32765 
>> for NUMA counters now), the deviation is just a little too big as I have 
>> mentioned in 
>> the log. I tend to sum the number in local cpus up when query the global 
>> stats.
> 
> This is a general problem of pcp accounting. So if it needs to be
> addressed then do it the same way for all stats.
> 
>> Also, node_page_state_snapshot is only called in slow path and I don't think 
>> that
>> would be a big problem. 
>>
>> Anyway, it is a matter of taste. I just think it's better to have.
> 
> You are making numastats special and I yet haven't heard any sounds
> arguments for that. But that should be discussed in the respective
> patch.
> 

That is because we have much larger threshold size for NUMA counters, that 
means larger 
deviation. So, the number in local cpus may not be simply ignored.

 Signed-off-by: Kemi Wang 
 ---
  drivers/base/node.c | 17 ++---
  mm/vmstat.c |  2 +-
  2 files changed, 11 insertions(+), 8 deletions(-)

 diff --git a/drivers/base/node.c b/drivers/base/node.c
 index a045ea1..cf303f8 100644
 --- a/drivers/base/node.c
 +++ b/drivers/base/node.c
 @@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev,
   "interleave_hit %lu\n"
   "local_node %lu\n"
   "other_node %lu\n",
 - node_page_state(NODE_DATA(dev->id), NUMA_HIT),
 - node_page_state(NODE_DATA(dev->id), NUMA_MISS),
 - node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN),
 - node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT),
 - node_page_state(NODE_DATA(dev->id), NUMA_LOCAL),
 - node_page_state(NODE_DATA(dev->id), NUMA_OTHER));
 + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT),
 + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS),
 + node_page_state_snapshot(NODE_DATA(dev->id),
 + NUMA_FOREIGN),
 + node_page_state_snapshot(NODE_DATA(dev->id),
 + NUMA_INTERLEAVE_HIT),
 + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL),
 + node_page_state_snapshot(NODE_DATA(dev->id),
 + NUMA_OTHER));
  }
  
  static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
 @@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev,
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
n += sprintf(buf+n, "%s %lu\n",
 vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
 -   node_page_state(pgdat, i));
 +   node_page_state_snapshot(pgdat, i));
  
return n;
  }
 diff --git a/mm/vmstat.c b/mm/vmstat.c
 index 64e08ae..d65f28d 100644
 --- a/mm/vmstat.c
 +++ b/mm/vmstat.c
 @@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, 
 pg_data_t *pgdat,
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
seq_printf(m, "\n  %-12s %lu",
vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
 -  node_page_state(pgdat, i));
 +  node_page_state_snapshot(pgdat, i));
}
}
seq_printf(m,
 -- 
 2.7.4

>>>
> 


Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation

2017-12-20 Thread kemi


On 2017年12月20日 18:06, Michal Hocko wrote:
> On Wed 20-12-17 14:07:35, kemi wrote:
>>
>>
>> On 2017年12月19日 20:43, Michal Hocko wrote:
>>> On Tue 19-12-17 14:39:25, Kemi Wang wrote:
 To avoid deviation, this patch uses node_page_state_snapshot instead of
 node_page_state for node page stats query.
 e.g. cat /proc/zoneinfo
  cat /sys/devices/system/node/node*/vmstat
  cat /sys/devices/system/node/node*/numastat

 As it is a slow path and would not be read frequently, I would worry about
 it.
>>>
>>> The changelog doesn't explain why these counters needs any special
>>> treatment. _snapshot variants where used only for internal handling
>>> where the precision really mattered. We do not have any in-tree user and
>>> Jack has removed this by 
>>> http://lkml.kernel.org/r/20171122094416.26019-1-j...@suse.cz
>>> which is already sitting in the mmotm tree. We can re-add it but that
>>> would really require a _very good_ reason.
>>>
>>
>> Assume we have *nr* cpus, and threshold size is *t*. Thus, the maximum 
>> deviation is nr*t.
>> Currently, Skylake platform has hundreds of CPUs numbers and the number is 
>> still 
>> increasing. Also, even the threshold size is kept to 125 at maximum (32765 
>> for NUMA counters now), the deviation is just a little too big as I have 
>> mentioned in 
>> the log. I tend to sum the number in local cpus up when query the global 
>> stats.
> 
> This is a general problem of pcp accounting. So if it needs to be
> addressed then do it the same way for all stats.
> 
>> Also, node_page_state_snapshot is only called in slow path and I don't think 
>> that
>> would be a big problem. 
>>
>> Anyway, it is a matter of taste. I just think it's better to have.
> 
> You are making numastats special and I yet haven't heard any sounds
> arguments for that. But that should be discussed in the respective
> patch.
> 

That is because we have much larger threshold size for NUMA counters, that 
means larger 
deviation. So, the number in local cpus may not be simply ignored.

 Signed-off-by: Kemi Wang 
 ---
  drivers/base/node.c | 17 ++---
  mm/vmstat.c |  2 +-
  2 files changed, 11 insertions(+), 8 deletions(-)

 diff --git a/drivers/base/node.c b/drivers/base/node.c
 index a045ea1..cf303f8 100644
 --- a/drivers/base/node.c
 +++ b/drivers/base/node.c
 @@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev,
   "interleave_hit %lu\n"
   "local_node %lu\n"
   "other_node %lu\n",
 - node_page_state(NODE_DATA(dev->id), NUMA_HIT),
 - node_page_state(NODE_DATA(dev->id), NUMA_MISS),
 - node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN),
 - node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT),
 - node_page_state(NODE_DATA(dev->id), NUMA_LOCAL),
 - node_page_state(NODE_DATA(dev->id), NUMA_OTHER));
 + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT),
 + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS),
 + node_page_state_snapshot(NODE_DATA(dev->id),
 + NUMA_FOREIGN),
 + node_page_state_snapshot(NODE_DATA(dev->id),
 + NUMA_INTERLEAVE_HIT),
 + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL),
 + node_page_state_snapshot(NODE_DATA(dev->id),
 + NUMA_OTHER));
  }
  
  static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
 @@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev,
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
n += sprintf(buf+n, "%s %lu\n",
 vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
 -   node_page_state(pgdat, i));
 +   node_page_state_snapshot(pgdat, i));
  
return n;
  }
 diff --git a/mm/vmstat.c b/mm/vmstat.c
 index 64e08ae..d65f28d 100644
 --- a/mm/vmstat.c
 +++ b/mm/vmstat.c
 @@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, 
 pg_data_t *pgdat,
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
seq_printf(m, "\n  %-12s %lu",
vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
 -  node_page_state(pgdat, i));
 +  node_page_state_snapshot(pgdat, i));
}
}
seq_printf(m,
 -- 
 2.7.4

>>>
> 


Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation

2017-12-20 Thread Michal Hocko
On Wed 20-12-17 14:07:35, kemi wrote:
> 
> 
> On 2017年12月19日 20:43, Michal Hocko wrote:
> > On Tue 19-12-17 14:39:25, Kemi Wang wrote:
> >> To avoid deviation, this patch uses node_page_state_snapshot instead of
> >> node_page_state for node page stats query.
> >> e.g. cat /proc/zoneinfo
> >>  cat /sys/devices/system/node/node*/vmstat
> >>  cat /sys/devices/system/node/node*/numastat
> >>
> >> As it is a slow path and would not be read frequently, I would worry about
> >> it.
> > 
> > The changelog doesn't explain why these counters needs any special
> > treatment. _snapshot variants where used only for internal handling
> > where the precision really mattered. We do not have any in-tree user and
> > Jack has removed this by 
> > http://lkml.kernel.org/r/20171122094416.26019-1-j...@suse.cz
> > which is already sitting in the mmotm tree. We can re-add it but that
> > would really require a _very good_ reason.
> > 
> 
> Assume we have *nr* cpus, and threshold size is *t*. Thus, the maximum 
> deviation is nr*t.
> Currently, Skylake platform has hundreds of CPUs numbers and the number is 
> still 
> increasing. Also, even the threshold size is kept to 125 at maximum (32765 
> for NUMA counters now), the deviation is just a little too big as I have 
> mentioned in 
> the log. I tend to sum the number in local cpus up when query the global 
> stats.

This is a general problem of pcp accounting. So if it needs to be
addressed then do it the same way for all stats.

> Also, node_page_state_snapshot is only called in slow path and I don't think 
> that
> would be a big problem. 
> 
> Anyway, it is a matter of taste. I just think it's better to have.

You are making numastats special and I yet haven't heard any sounds
arguments for that. But that should be discussed in the respective
patch.

> >> Signed-off-by: Kemi Wang 
> >> ---
> >>  drivers/base/node.c | 17 ++---
> >>  mm/vmstat.c |  2 +-
> >>  2 files changed, 11 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/drivers/base/node.c b/drivers/base/node.c
> >> index a045ea1..cf303f8 100644
> >> --- a/drivers/base/node.c
> >> +++ b/drivers/base/node.c
> >> @@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev,
> >>   "interleave_hit %lu\n"
> >>   "local_node %lu\n"
> >>   "other_node %lu\n",
> >> - node_page_state(NODE_DATA(dev->id), NUMA_HIT),
> >> - node_page_state(NODE_DATA(dev->id), NUMA_MISS),
> >> - node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN),
> >> - node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT),
> >> - node_page_state(NODE_DATA(dev->id), NUMA_LOCAL),
> >> - node_page_state(NODE_DATA(dev->id), NUMA_OTHER));
> >> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT),
> >> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS),
> >> + node_page_state_snapshot(NODE_DATA(dev->id),
> >> + NUMA_FOREIGN),
> >> + node_page_state_snapshot(NODE_DATA(dev->id),
> >> + NUMA_INTERLEAVE_HIT),
> >> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL),
> >> + node_page_state_snapshot(NODE_DATA(dev->id),
> >> + NUMA_OTHER));
> >>  }
> >>  
> >>  static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
> >> @@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev,
> >>for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> >>n += sprintf(buf+n, "%s %lu\n",
> >> vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
> >> -   node_page_state(pgdat, i));
> >> +   node_page_state_snapshot(pgdat, i));
> >>  
> >>return n;
> >>  }
> >> diff --git a/mm/vmstat.c b/mm/vmstat.c
> >> index 64e08ae..d65f28d 100644
> >> --- a/mm/vmstat.c
> >> +++ b/mm/vmstat.c
> >> @@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, 
> >> pg_data_t *pgdat,
> >>for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
> >>seq_printf(m, "\n  %-12s %lu",
> >>vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
> >> -  node_page_state(pgdat, i));
> >> +  node_page_state_snapshot(pgdat, i));
> >>}
> >>}
> >>seq_printf(m,
> >> -- 
> >> 2.7.4
> >>
> > 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation

2017-12-20 Thread Michal Hocko
On Wed 20-12-17 14:07:35, kemi wrote:
> 
> 
> On 2017年12月19日 20:43, Michal Hocko wrote:
> > On Tue 19-12-17 14:39:25, Kemi Wang wrote:
> >> To avoid deviation, this patch uses node_page_state_snapshot instead of
> >> node_page_state for node page stats query.
> >> e.g. cat /proc/zoneinfo
> >>  cat /sys/devices/system/node/node*/vmstat
> >>  cat /sys/devices/system/node/node*/numastat
> >>
> >> As it is a slow path and would not be read frequently, I would worry about
> >> it.
> > 
> > The changelog doesn't explain why these counters needs any special
> > treatment. _snapshot variants where used only for internal handling
> > where the precision really mattered. We do not have any in-tree user and
> > Jack has removed this by 
> > http://lkml.kernel.org/r/20171122094416.26019-1-j...@suse.cz
> > which is already sitting in the mmotm tree. We can re-add it but that
> > would really require a _very good_ reason.
> > 
> 
> Assume we have *nr* cpus, and threshold size is *t*. Thus, the maximum 
> deviation is nr*t.
> Currently, Skylake platform has hundreds of CPUs numbers and the number is 
> still 
> increasing. Also, even the threshold size is kept to 125 at maximum (32765 
> for NUMA counters now), the deviation is just a little too big as I have 
> mentioned in 
> the log. I tend to sum the number in local cpus up when query the global 
> stats.

This is a general problem of pcp accounting. So if it needs to be
addressed then do it the same way for all stats.

> Also, node_page_state_snapshot is only called in slow path and I don't think 
> that
> would be a big problem. 
> 
> Anyway, it is a matter of taste. I just think it's better to have.

You are making numastats special and I yet haven't heard any sounds
arguments for that. But that should be discussed in the respective
patch.

> >> Signed-off-by: Kemi Wang 
> >> ---
> >>  drivers/base/node.c | 17 ++---
> >>  mm/vmstat.c |  2 +-
> >>  2 files changed, 11 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/drivers/base/node.c b/drivers/base/node.c
> >> index a045ea1..cf303f8 100644
> >> --- a/drivers/base/node.c
> >> +++ b/drivers/base/node.c
> >> @@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev,
> >>   "interleave_hit %lu\n"
> >>   "local_node %lu\n"
> >>   "other_node %lu\n",
> >> - node_page_state(NODE_DATA(dev->id), NUMA_HIT),
> >> - node_page_state(NODE_DATA(dev->id), NUMA_MISS),
> >> - node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN),
> >> - node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT),
> >> - node_page_state(NODE_DATA(dev->id), NUMA_LOCAL),
> >> - node_page_state(NODE_DATA(dev->id), NUMA_OTHER));
> >> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT),
> >> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS),
> >> + node_page_state_snapshot(NODE_DATA(dev->id),
> >> + NUMA_FOREIGN),
> >> + node_page_state_snapshot(NODE_DATA(dev->id),
> >> + NUMA_INTERLEAVE_HIT),
> >> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL),
> >> + node_page_state_snapshot(NODE_DATA(dev->id),
> >> + NUMA_OTHER));
> >>  }
> >>  
> >>  static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
> >> @@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev,
> >>for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> >>n += sprintf(buf+n, "%s %lu\n",
> >> vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
> >> -   node_page_state(pgdat, i));
> >> +   node_page_state_snapshot(pgdat, i));
> >>  
> >>return n;
> >>  }
> >> diff --git a/mm/vmstat.c b/mm/vmstat.c
> >> index 64e08ae..d65f28d 100644
> >> --- a/mm/vmstat.c
> >> +++ b/mm/vmstat.c
> >> @@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, 
> >> pg_data_t *pgdat,
> >>for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
> >>seq_printf(m, "\n  %-12s %lu",
> >>vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
> >> -  node_page_state(pgdat, i));
> >> +  node_page_state_snapshot(pgdat, i));
> >>}
> >>}
> >>seq_printf(m,
> >> -- 
> >> 2.7.4
> >>
> > 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation

2017-12-19 Thread kemi


On 2017年12月19日 20:43, Michal Hocko wrote:
> On Tue 19-12-17 14:39:25, Kemi Wang wrote:
>> To avoid deviation, this patch uses node_page_state_snapshot instead of
>> node_page_state for node page stats query.
>> e.g. cat /proc/zoneinfo
>>  cat /sys/devices/system/node/node*/vmstat
>>  cat /sys/devices/system/node/node*/numastat
>>
>> As it is a slow path and would not be read frequently, I would worry about
>> it.
> 
> The changelog doesn't explain why these counters needs any special
> treatment. _snapshot variants where used only for internal handling
> where the precision really mattered. We do not have any in-tree user and
> Jack has removed this by 
> http://lkml.kernel.org/r/20171122094416.26019-1-j...@suse.cz
> which is already sitting in the mmotm tree. We can re-add it but that
> would really require a _very good_ reason.
> 

Assume we have *nr* cpus, and threshold size is *t*. Thus, the maximum 
deviation is nr*t.
Currently, Skylake platform has hundreds of CPUs numbers and the number is 
still 
increasing. Also, even the threshold size is kept to 125 at maximum (32765 
for NUMA counters now), the deviation is just a little too big as I have 
mentioned in 
the log. I tend to sum the number in local cpus up when query the global stats.

Also, node_page_state_snapshot is only called in slow path and I don't think 
that
would be a big problem. 

Anyway, it is a matter of taste. I just think it's better to have.

>> Signed-off-by: Kemi Wang 
>> ---
>>  drivers/base/node.c | 17 ++---
>>  mm/vmstat.c |  2 +-
>>  2 files changed, 11 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index a045ea1..cf303f8 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev,
>> "interleave_hit %lu\n"
>> "local_node %lu\n"
>> "other_node %lu\n",
>> -   node_page_state(NODE_DATA(dev->id), NUMA_HIT),
>> -   node_page_state(NODE_DATA(dev->id), NUMA_MISS),
>> -   node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN),
>> -   node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT),
>> -   node_page_state(NODE_DATA(dev->id), NUMA_LOCAL),
>> -   node_page_state(NODE_DATA(dev->id), NUMA_OTHER));
>> +   node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT),
>> +   node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS),
>> +   node_page_state_snapshot(NODE_DATA(dev->id),
>> +   NUMA_FOREIGN),
>> +   node_page_state_snapshot(NODE_DATA(dev->id),
>> +   NUMA_INTERLEAVE_HIT),
>> +   node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL),
>> +   node_page_state_snapshot(NODE_DATA(dev->id),
>> +   NUMA_OTHER));
>>  }
>>  
>>  static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
>> @@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev,
>>  for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
>>  n += sprintf(buf+n, "%s %lu\n",
>>   vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>> - node_page_state(pgdat, i));
>> + node_page_state_snapshot(pgdat, i));
>>  
>>  return n;
>>  }
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 64e08ae..d65f28d 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, 
>> pg_data_t *pgdat,
>>  for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
>>  seq_printf(m, "\n  %-12s %lu",
>>  vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>> -node_page_state(pgdat, i));
>> +node_page_state_snapshot(pgdat, i));
>>  }
>>  }
>>  seq_printf(m,
>> -- 
>> 2.7.4
>>
> 


Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation

2017-12-19 Thread kemi


On 2017年12月19日 20:43, Michal Hocko wrote:
> On Tue 19-12-17 14:39:25, Kemi Wang wrote:
>> To avoid deviation, this patch uses node_page_state_snapshot instead of
>> node_page_state for node page stats query.
>> e.g. cat /proc/zoneinfo
>>  cat /sys/devices/system/node/node*/vmstat
>>  cat /sys/devices/system/node/node*/numastat
>>
>> As it is a slow path and would not be read frequently, I would worry about
>> it.
> 
> The changelog doesn't explain why these counters needs any special
> treatment. _snapshot variants where used only for internal handling
> where the precision really mattered. We do not have any in-tree user and
> Jack has removed this by 
> http://lkml.kernel.org/r/20171122094416.26019-1-j...@suse.cz
> which is already sitting in the mmotm tree. We can re-add it but that
> would really require a _very good_ reason.
> 

Assume we have *nr* cpus, and threshold size is *t*. Thus, the maximum 
deviation is nr*t.
Currently, Skylake platform has hundreds of CPUs numbers and the number is 
still 
increasing. Also, even the threshold size is kept to 125 at maximum (32765 
for NUMA counters now), the deviation is just a little too big as I have 
mentioned in 
the log. I tend to sum the number in local cpus up when query the global stats.

Also, node_page_state_snapshot is only called in slow path and I don't think 
that
would be a big problem. 

Anyway, it is a matter of taste. I just think it's better to have.

>> Signed-off-by: Kemi Wang 
>> ---
>>  drivers/base/node.c | 17 ++---
>>  mm/vmstat.c |  2 +-
>>  2 files changed, 11 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index a045ea1..cf303f8 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev,
>> "interleave_hit %lu\n"
>> "local_node %lu\n"
>> "other_node %lu\n",
>> -   node_page_state(NODE_DATA(dev->id), NUMA_HIT),
>> -   node_page_state(NODE_DATA(dev->id), NUMA_MISS),
>> -   node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN),
>> -   node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT),
>> -   node_page_state(NODE_DATA(dev->id), NUMA_LOCAL),
>> -   node_page_state(NODE_DATA(dev->id), NUMA_OTHER));
>> +   node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT),
>> +   node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS),
>> +   node_page_state_snapshot(NODE_DATA(dev->id),
>> +   NUMA_FOREIGN),
>> +   node_page_state_snapshot(NODE_DATA(dev->id),
>> +   NUMA_INTERLEAVE_HIT),
>> +   node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL),
>> +   node_page_state_snapshot(NODE_DATA(dev->id),
>> +   NUMA_OTHER));
>>  }
>>  
>>  static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
>> @@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev,
>>  for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
>>  n += sprintf(buf+n, "%s %lu\n",
>>   vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>> - node_page_state(pgdat, i));
>> + node_page_state_snapshot(pgdat, i));
>>  
>>  return n;
>>  }
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 64e08ae..d65f28d 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, 
>> pg_data_t *pgdat,
>>  for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
>>  seq_printf(m, "\n  %-12s %lu",
>>  vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>> -node_page_state(pgdat, i));
>> +node_page_state_snapshot(pgdat, i));
>>  }
>>  }
>>  seq_printf(m,
>> -- 
>> 2.7.4
>>
> 


Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation

2017-12-19 Thread Michal Hocko
On Tue 19-12-17 14:39:25, Kemi Wang wrote:
> To avoid deviation, this patch uses node_page_state_snapshot instead of
> node_page_state for node page stats query.
> e.g. cat /proc/zoneinfo
>  cat /sys/devices/system/node/node*/vmstat
>  cat /sys/devices/system/node/node*/numastat
> 
> As it is a slow path and would not be read frequently, I would worry about
> it.

The changelog doesn't explain why these counters needs any special
treatment. _snapshot variants where used only for internal handling
where the precision really mattered. We do not have any in-tree user and
Jack has removed this by 
http://lkml.kernel.org/r/20171122094416.26019-1-j...@suse.cz
which is already sitting in the mmotm tree. We can re-add it but that
would really require a _very good_ reason.

> Signed-off-by: Kemi Wang 
> ---
>  drivers/base/node.c | 17 ++---
>  mm/vmstat.c |  2 +-
>  2 files changed, 11 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index a045ea1..cf303f8 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev,
>  "interleave_hit %lu\n"
>  "local_node %lu\n"
>  "other_node %lu\n",
> -node_page_state(NODE_DATA(dev->id), NUMA_HIT),
> -node_page_state(NODE_DATA(dev->id), NUMA_MISS),
> -node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN),
> -node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT),
> -node_page_state(NODE_DATA(dev->id), NUMA_LOCAL),
> -node_page_state(NODE_DATA(dev->id), NUMA_OTHER));
> +node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT),
> +node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS),
> +node_page_state_snapshot(NODE_DATA(dev->id),
> +NUMA_FOREIGN),
> +node_page_state_snapshot(NODE_DATA(dev->id),
> +NUMA_INTERLEAVE_HIT),
> +node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL),
> +node_page_state_snapshot(NODE_DATA(dev->id),
> +NUMA_OTHER));
>  }
>  
>  static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
> @@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev,
>   for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
>   n += sprintf(buf+n, "%s %lu\n",
>vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
> -  node_page_state(pgdat, i));
> +  node_page_state_snapshot(pgdat, i));
>  
>   return n;
>  }
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 64e08ae..d65f28d 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, 
> pg_data_t *pgdat,
>   for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
>   seq_printf(m, "\n  %-12s %lu",
>   vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
> - node_page_state(pgdat, i));
> + node_page_state_snapshot(pgdat, i));
>   }
>   }
>   seq_printf(m,
> -- 
> 2.7.4
> 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation

2017-12-19 Thread Michal Hocko
On Tue 19-12-17 14:39:25, Kemi Wang wrote:
> To avoid deviation, this patch uses node_page_state_snapshot instead of
> node_page_state for node page stats query.
> e.g. cat /proc/zoneinfo
>  cat /sys/devices/system/node/node*/vmstat
>  cat /sys/devices/system/node/node*/numastat
> 
> As it is a slow path and would not be read frequently, I would worry about
> it.

The changelog doesn't explain why these counters needs any special
treatment. _snapshot variants where used only for internal handling
where the precision really mattered. We do not have any in-tree user and
Jack has removed this by 
http://lkml.kernel.org/r/20171122094416.26019-1-j...@suse.cz
which is already sitting in the mmotm tree. We can re-add it but that
would really require a _very good_ reason.

> Signed-off-by: Kemi Wang 
> ---
>  drivers/base/node.c | 17 ++---
>  mm/vmstat.c |  2 +-
>  2 files changed, 11 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index a045ea1..cf303f8 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev,
>  "interleave_hit %lu\n"
>  "local_node %lu\n"
>  "other_node %lu\n",
> -node_page_state(NODE_DATA(dev->id), NUMA_HIT),
> -node_page_state(NODE_DATA(dev->id), NUMA_MISS),
> -node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN),
> -node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT),
> -node_page_state(NODE_DATA(dev->id), NUMA_LOCAL),
> -node_page_state(NODE_DATA(dev->id), NUMA_OTHER));
> +node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT),
> +node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS),
> +node_page_state_snapshot(NODE_DATA(dev->id),
> +NUMA_FOREIGN),
> +node_page_state_snapshot(NODE_DATA(dev->id),
> +NUMA_INTERLEAVE_HIT),
> +node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL),
> +node_page_state_snapshot(NODE_DATA(dev->id),
> +NUMA_OTHER));
>  }
>  
>  static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
> @@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev,
>   for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
>   n += sprintf(buf+n, "%s %lu\n",
>vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
> -  node_page_state(pgdat, i));
> +  node_page_state_snapshot(pgdat, i));
>  
>   return n;
>  }
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 64e08ae..d65f28d 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, 
> pg_data_t *pgdat,
>   for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
>   seq_printf(m, "\n  %-12s %lu",
>   vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
> - node_page_state(pgdat, i));
> + node_page_state_snapshot(pgdat, i));
>   }
>   }
>   seq_printf(m,
> -- 
> 2.7.4
> 

-- 
Michal Hocko
SUSE Labs


[PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation

2017-12-18 Thread Kemi Wang
To avoid deviation, this patch uses node_page_state_snapshot instead of
node_page_state for node page stats query.
e.g. cat /proc/zoneinfo
 cat /sys/devices/system/node/node*/vmstat
 cat /sys/devices/system/node/node*/numastat

As it is a slow path and would not be read frequently, I would worry about
it.

Signed-off-by: Kemi Wang 
---
 drivers/base/node.c | 17 ++---
 mm/vmstat.c |  2 +-
 2 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index a045ea1..cf303f8 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev,
   "interleave_hit %lu\n"
   "local_node %lu\n"
   "other_node %lu\n",
-  node_page_state(NODE_DATA(dev->id), NUMA_HIT),
-  node_page_state(NODE_DATA(dev->id), NUMA_MISS),
-  node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN),
-  node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT),
-  node_page_state(NODE_DATA(dev->id), NUMA_LOCAL),
-  node_page_state(NODE_DATA(dev->id), NUMA_OTHER));
+  node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT),
+  node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS),
+  node_page_state_snapshot(NODE_DATA(dev->id),
+  NUMA_FOREIGN),
+  node_page_state_snapshot(NODE_DATA(dev->id),
+  NUMA_INTERLEAVE_HIT),
+  node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL),
+  node_page_state_snapshot(NODE_DATA(dev->id),
+  NUMA_OTHER));
 }
 
 static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
@@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev,
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
n += sprintf(buf+n, "%s %lu\n",
 vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
-node_page_state(pgdat, i));
+node_page_state_snapshot(pgdat, i));
 
return n;
 }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 64e08ae..d65f28d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, 
pg_data_t *pgdat,
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
seq_printf(m, "\n  %-12s %lu",
vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
-   node_page_state(pgdat, i));
+   node_page_state_snapshot(pgdat, i));
}
}
seq_printf(m,
-- 
2.7.4



[PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation

2017-12-18 Thread Kemi Wang
To avoid deviation, this patch uses node_page_state_snapshot instead of
node_page_state for node page stats query.
e.g. cat /proc/zoneinfo
 cat /sys/devices/system/node/node*/vmstat
 cat /sys/devices/system/node/node*/numastat

As it is a slow path and would not be read frequently, I would worry about
it.

Signed-off-by: Kemi Wang 
---
 drivers/base/node.c | 17 ++---
 mm/vmstat.c |  2 +-
 2 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index a045ea1..cf303f8 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev,
   "interleave_hit %lu\n"
   "local_node %lu\n"
   "other_node %lu\n",
-  node_page_state(NODE_DATA(dev->id), NUMA_HIT),
-  node_page_state(NODE_DATA(dev->id), NUMA_MISS),
-  node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN),
-  node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT),
-  node_page_state(NODE_DATA(dev->id), NUMA_LOCAL),
-  node_page_state(NODE_DATA(dev->id), NUMA_OTHER));
+  node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT),
+  node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS),
+  node_page_state_snapshot(NODE_DATA(dev->id),
+  NUMA_FOREIGN),
+  node_page_state_snapshot(NODE_DATA(dev->id),
+  NUMA_INTERLEAVE_HIT),
+  node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL),
+  node_page_state_snapshot(NODE_DATA(dev->id),
+  NUMA_OTHER));
 }
 
 static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
@@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev,
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
n += sprintf(buf+n, "%s %lu\n",
 vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
-node_page_state(pgdat, i));
+node_page_state_snapshot(pgdat, i));
 
return n;
 }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 64e08ae..d65f28d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, 
pg_data_t *pgdat,
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
seq_printf(m, "\n  %-12s %lu",
vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
-   node_page_state(pgdat, i));
+   node_page_state_snapshot(pgdat, i));
}
}
seq_printf(m,
-- 
2.7.4