Re: utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc//totmaps)

2016-09-30 Thread Rik van Riel
On Fri, 2016-09-30 at 11:49 +0200, Michal Hocko wrote:
> [CC Mike and Mel as they have seen some accounting oddities
>  when doing performance testing. They can share details but
>  essentially the system time just gets too high]
> 
> For your reference the email thread started
> http://lkml.kernel.org/r/20160823143330.gl23...@dhcp22.suse.cz
> 
> I suspect this is mainly for short lived processes - like kernel
> compile
> $ /usr/bin/time -v make mm/mmap.o
> [...]
> User time (seconds): 0.45
> System time (seconds): 0.82
> Percent of CPU this job got: 111%
> Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.14
> $ rm mm/mmap.o
> $ /usr/bin/time -v make mm/mmap.o
> [...]
> User time (seconds): 0.47
> System time (seconds): 1.55
> Percent of CPU this job got: 107%
> Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.88

I was not able to get the "expected" results from
your last reproducer, but this one does happen on
my system, too.

The bad news is, I still have no clue what is causing
it...

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part


Re: utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc//totmaps)

2016-09-30 Thread Rik van Riel
On Fri, 2016-09-30 at 11:49 +0200, Michal Hocko wrote:
> [CC Mike and Mel as they have seen some accounting oddities
>  when doing performance testing. They can share details but
>  essentially the system time just gets too high]
> 
> For your reference the email thread started
> http://lkml.kernel.org/r/20160823143330.gl23...@dhcp22.suse.cz
> 
> I suspect this is mainly for short lived processes - like kernel
> compile
> $ /usr/bin/time -v make mm/mmap.o
> [...]
> User time (seconds): 0.45
> System time (seconds): 0.82
> Percent of CPU this job got: 111%
> Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.14
> $ rm mm/mmap.o
> $ /usr/bin/time -v make mm/mmap.o
> [...]
> User time (seconds): 0.47
> System time (seconds): 1.55
> Percent of CPU this job got: 107%
> Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.88

I was not able to get the "expected" results from
your last reproducer, but this one does happen on
my system, too.

The bad news is, I still have no clue what is causing
it...

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part


Re: utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc//totmaps)

2016-09-30 Thread Michal Hocko
[CC Mike and Mel as they have seen some accounting oddities
 when doing performance testing. They can share details but
 essentially the system time just gets too high]

For your reference the email thread started
http://lkml.kernel.org/r/20160823143330.gl23...@dhcp22.suse.cz

I suspect this is mainly for short lived processes - like kernel compile
$ /usr/bin/time -v make mm/mmap.o
[...]
User time (seconds): 0.45
System time (seconds): 0.82
Percent of CPU this job got: 111%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.14
$ rm mm/mmap.o
$ /usr/bin/time -v make mm/mmap.o
[...]
User time (seconds): 0.47
System time (seconds): 1.55
Percent of CPU this job got: 107%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.88

This is quite unexpected for a cache hot compile. I would expect most of
the time being spent in userspace.

$ perf report | grep kernel.vmlinux
 2.01%  as[kernel.vmlinux] [k] page_fault
 0.59%  cc1   [kernel.vmlinux] [k] page_fault
 0.15%  git   [kernel.vmlinux] [k] page_fault
 0.12%  bash  [kernel.vmlinux] [k] page_fault
 0.11%  sh[kernel.vmlinux] [k] page_fault
 0.08%  gcc   [kernel.vmlinux] [k] page_fault
 0.06%  make  [kernel.vmlinux] [k] page_fault
 0.04%  rm[kernel.vmlinux] [k] page_fault
 0.03%  ld[kernel.vmlinux] [k] page_fault
 0.02%  bash  [kernel.vmlinux] [k] entry_SYSCALL_64
 0.01%  git   [kernel.vmlinux] [k] entry_SYSCALL_64
 0.01%  cat   [kernel.vmlinux] [k] page_fault
 0.01%  collect2  [kernel.vmlinux] [k] page_fault
 0.00%  sh[kernel.vmlinux] [k] entry_SYSCALL_64
 0.00%  rm[kernel.vmlinux] [k] entry_SYSCALL_64
 0.00%  grep  [kernel.vmlinux] [k] page_fault

doesn't show anything unexpected.

Original Rik's reply follows:

On Tue 23-08-16 17:46:11, Rik van Riel wrote:
> On Tue, 2016-08-23 at 16:33 +0200, Michal Hocko wrote:
[...]
> > OK, so it seems I found it. I was quite lucky because
> > account_user_time
> > is not all that popular function and there were basically no changes
> > besides Riks ff9a9b4c4334 ("sched, time: Switch
> > VIRT_CPU_ACCOUNTING_GEN
> > to jiffy granularity") and that seems to cause the regression.
> > Reverting
> > the commit on top of the current mmotm seems to fix the issue for me.
> > 
> > And just to give Rik more context. While debugging overhead of the
> > /proc//smaps I am getting a misleading output from /usr/bin/time
> > -v
> > (source for ./max_mmap is [1])
> > 
> > root@test1:~# uname -r
> > 4.5.0-rc6-bisect1-00025-gff9a9b4c4334
> > root@test1:~# ./max_map 
> > pid:2990 maps:65515
> > root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2}
> > END {printf "rss:%d pss:%d\n", rss, pss}' /proc/2990/smaps
> > rss:263368 pss:262203
> > Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> > {printf "rss:%d pss:%d\n", rss, pss} /proc/2990/smaps"
> > User time (seconds): 0.00
> > System time (seconds): 0.45
> > Percent of CPU this job got: 98%
> > 
> 
> > root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2}
> > END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3015/smaps
> > rss:263316 pss:262199
> > Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> > {printf "rss:%d pss:%d\n", rss, pss} /proc/3015/smaps"
> > User time (seconds): 0.18
> > System time (seconds): 0.29
> > Percent of CPU this job got: 97%
> 
> The patch in question makes user and system
> time accounting essentially tick-based. If
> jiffies changes while the task is in user
> mode, time gets accounted as user time, if
> jiffies changes while the task is in system
> mode, time gets accounted as system time.
> 
> If you get "unlucky", with a job like the
> above, it is possible all time gets accounted
> to system time.
> 
> This would be true both with the system running
> with a periodic timer tick (before and after my
> patch is applied), and in nohz_idle mode (after
> my patch).
> 
> However, it does seem quite unlikely that you
> get zero user time, since you have 125 timer
> ticks in half a second. Furthermore, you do not
> even have NO_HZ_FULL enabled...
> 
> Does the workload consistently get zero user
> time?
> 
> If so, we need to dig further to see under
> what precise circumstances that happens.
> 
> On my laptop, with kernel 4.6.3-300.fc24.x86_64
> I get this:
> 
> $ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf
> "rss:%d pss:%d\n", rss, pss}' /proc/19825/smaps
> rss:263368 pss:262145
>   Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> {printf "rss:%d pss:%d\n", rss, pss} /proc/19825/smaps"
>   User time (seconds): 0.64
>   System time (seconds): 0.19
>   Percent of CPU this 

Re: utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc//totmaps)

2016-09-30 Thread Michal Hocko
[CC Mike and Mel as they have seen some accounting oddities
 when doing performance testing. They can share details but
 essentially the system time just gets too high]

For your reference the email thread started
http://lkml.kernel.org/r/20160823143330.gl23...@dhcp22.suse.cz

I suspect this is mainly for short lived processes - like kernel compile
$ /usr/bin/time -v make mm/mmap.o
[...]
User time (seconds): 0.45
System time (seconds): 0.82
Percent of CPU this job got: 111%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.14
$ rm mm/mmap.o
$ /usr/bin/time -v make mm/mmap.o
[...]
User time (seconds): 0.47
System time (seconds): 1.55
Percent of CPU this job got: 107%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.88

This is quite unexpected for a cache hot compile. I would expect most of
the time being spent in userspace.

$ perf report | grep kernel.vmlinux
 2.01%  as[kernel.vmlinux] [k] page_fault
 0.59%  cc1   [kernel.vmlinux] [k] page_fault
 0.15%  git   [kernel.vmlinux] [k] page_fault
 0.12%  bash  [kernel.vmlinux] [k] page_fault
 0.11%  sh[kernel.vmlinux] [k] page_fault
 0.08%  gcc   [kernel.vmlinux] [k] page_fault
 0.06%  make  [kernel.vmlinux] [k] page_fault
 0.04%  rm[kernel.vmlinux] [k] page_fault
 0.03%  ld[kernel.vmlinux] [k] page_fault
 0.02%  bash  [kernel.vmlinux] [k] entry_SYSCALL_64
 0.01%  git   [kernel.vmlinux] [k] entry_SYSCALL_64
 0.01%  cat   [kernel.vmlinux] [k] page_fault
 0.01%  collect2  [kernel.vmlinux] [k] page_fault
 0.00%  sh[kernel.vmlinux] [k] entry_SYSCALL_64
 0.00%  rm[kernel.vmlinux] [k] entry_SYSCALL_64
 0.00%  grep  [kernel.vmlinux] [k] page_fault

doesn't show anything unexpected.

Original Rik's reply follows:

On Tue 23-08-16 17:46:11, Rik van Riel wrote:
> On Tue, 2016-08-23 at 16:33 +0200, Michal Hocko wrote:
[...]
> > OK, so it seems I found it. I was quite lucky because
> > account_user_time
> > is not all that popular function and there were basically no changes
> > besides Riks ff9a9b4c4334 ("sched, time: Switch
> > VIRT_CPU_ACCOUNTING_GEN
> > to jiffy granularity") and that seems to cause the regression.
> > Reverting
> > the commit on top of the current mmotm seems to fix the issue for me.
> > 
> > And just to give Rik more context. While debugging overhead of the
> > /proc//smaps I am getting a misleading output from /usr/bin/time
> > -v
> > (source for ./max_mmap is [1])
> > 
> > root@test1:~# uname -r
> > 4.5.0-rc6-bisect1-00025-gff9a9b4c4334
> > root@test1:~# ./max_map 
> > pid:2990 maps:65515
> > root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2}
> > END {printf "rss:%d pss:%d\n", rss, pss}' /proc/2990/smaps
> > rss:263368 pss:262203
> > Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> > {printf "rss:%d pss:%d\n", rss, pss} /proc/2990/smaps"
> > User time (seconds): 0.00
> > System time (seconds): 0.45
> > Percent of CPU this job got: 98%
> > 
> 
> > root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2}
> > END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3015/smaps
> > rss:263316 pss:262199
> > Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> > {printf "rss:%d pss:%d\n", rss, pss} /proc/3015/smaps"
> > User time (seconds): 0.18
> > System time (seconds): 0.29
> > Percent of CPU this job got: 97%
> 
> The patch in question makes user and system
> time accounting essentially tick-based. If
> jiffies changes while the task is in user
> mode, time gets accounted as user time, if
> jiffies changes while the task is in system
> mode, time gets accounted as system time.
> 
> If you get "unlucky", with a job like the
> above, it is possible all time gets accounted
> to system time.
> 
> This would be true both with the system running
> with a periodic timer tick (before and after my
> patch is applied), and in nohz_idle mode (after
> my patch).
> 
> However, it does seem quite unlikely that you
> get zero user time, since you have 125 timer
> ticks in half a second. Furthermore, you do not
> even have NO_HZ_FULL enabled...
> 
> Does the workload consistently get zero user
> time?
> 
> If so, we need to dig further to see under
> what precise circumstances that happens.
> 
> On my laptop, with kernel 4.6.3-300.fc24.x86_64
> I get this:
> 
> $ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf
> "rss:%d pss:%d\n", rss, pss}' /proc/19825/smaps
> rss:263368 pss:262145
>   Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> {printf "rss:%d pss:%d\n", rss, pss} /proc/19825/smaps"
>   User time (seconds): 0.64
>   System time (seconds): 0.19
>   Percent of CPU this 

Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-30 Thread Michal Hocko
On Wed 24-08-16 12:14:06, Marcin Jabrzyk wrote:
[...]
> Sorry to hijack the thread, but I've found it recently
> and I guess it's the best place to present our point.
> We are working at our custom OS based on Linux and we also suffered much
> by /proc//smaps file. As in Chrome we tried to improve our internal
> application memory management polices (Low Memory Killer) using data
> provided by smaps but we failed due to very long time needed for reading
> and parsing properly the file.

I was already questioning Pss and also Private_* for any memory killer
purpose earlier in the thread because cumulative numbers for all
mappings can be really meaningless. Especially when you do not know
about which resource is shared and by whom. Maybe you can describe how
you are using those cumulative numbers for your decisions and prove me
wrong but I simply haven't heard any sound arguments so far. Everything
was just "we know what we are doing in our environment so we know those
resouces and therefore those numbers make sense to us". But with all due
respect this is not a reason to add a user visible API into the kernel.

-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-30 Thread Michal Hocko
On Wed 24-08-16 12:14:06, Marcin Jabrzyk wrote:
[...]
> Sorry to hijack the thread, but I've found it recently
> and I guess it's the best place to present our point.
> We are working at our custom OS based on Linux and we also suffered much
> by /proc//smaps file. As in Chrome we tried to improve our internal
> application memory management polices (Low Memory Killer) using data
> provided by smaps but we failed due to very long time needed for reading
> and parsing properly the file.

I was already questioning Pss and also Private_* for any memory killer
purpose earlier in the thread because cumulative numbers for all
mappings can be really meaningless. Especially when you do not know
about which resource is shared and by whom. Maybe you can describe how
you are using those cumulative numbers for your decisions and prove me
wrong but I simply haven't heard any sound arguments so far. Everything
was just "we know what we are doing in our environment so we know those
resouces and therefore those numbers make sense to us". But with all due
respect this is not a reason to add a user visible API into the kernel.

-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-30 Thread Michal Hocko
On Mon 29-08-16 16:37:04, Michal Hocko wrote:
> [Sorry for a late reply, I was busy with other stuff]
> 
> On Mon 22-08-16 15:44:53, Sonny Rao wrote:
> > On Mon, Aug 22, 2016 at 12:54 AM, Michal Hocko  wrote:
> [...]
> > But what about the private_clean and private_dirty?  Surely
> > those are more generally useful for calculating a lower bound on
> > process memory usage without additional knowledge?
> 
> I guess private_clean can be used as a reasonable estimate.

I was thinking about this more and I think I am wrong here. Even truly
MAP_PRIVATE|MAP_ANON will be in private_dirty. So private_clean will
become not all that interesting and similarly misleading as its _dirty
variant (mmaped file after [m]sync should become _clean) and that
doesn't mean the memory will get freed after the process which maps it
terminates. Take shmem as an example again.

> private_dirty less so because it may refer to e.g. tmpfs which is not
> mapped by other process and so no memory would be freed after unmap
> without removing the file.

-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-30 Thread Michal Hocko
On Mon 29-08-16 16:37:04, Michal Hocko wrote:
> [Sorry for a late reply, I was busy with other stuff]
> 
> On Mon 22-08-16 15:44:53, Sonny Rao wrote:
> > On Mon, Aug 22, 2016 at 12:54 AM, Michal Hocko  wrote:
> [...]
> > But what about the private_clean and private_dirty?  Surely
> > those are more generally useful for calculating a lower bound on
> > process memory usage without additional knowledge?
> 
> I guess private_clean can be used as a reasonable estimate.

I was thinking about this more and I think I am wrong here. Even truly
MAP_PRIVATE|MAP_ANON will be in private_dirty. So private_clean will
become not all that interesting and similarly misleading as its _dirty
variant (mmaped file after [m]sync should become _clean) and that
doesn't mean the memory will get freed after the process which maps it
terminates. Take shmem as an example again.

> private_dirty less so because it may refer to e.g. tmpfs which is not
> mapped by other process and so no memory would be freed after unmap
> without removing the file.

-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-29 Thread Michal Hocko
[Sorry for a late reply, I was busy with other stuff]

On Mon 22-08-16 15:44:53, Sonny Rao wrote:
> On Mon, Aug 22, 2016 at 12:54 AM, Michal Hocko  wrote:
[...]
> But what about the private_clean and private_dirty?  Surely
> those are more generally useful for calculating a lower bound on
> process memory usage without additional knowledge?

I guess private_clean can be used as a reasonable estimate.
private_dirty less so because it may refer to e.g. tmpfs which is not
mapped by other process and so no memory would be freed after unmap
without removing the file.

> At the end of the day all of these metrics are approximations, and it
> comes down to how far off the various approximations are and what
> trade offs we are willing to make.
> RSS is the cheapest but the most coarse.

I agree on this part definitely. I also understand that what we provide
currently is quite confusing and not really helpful. But I am afraid
that the accounting is far from trivial to make right for all the
possible cases.

> PSS (with the correct context) and Private data plus swap are much
> better but also more expensive due to the PT walk.

Maybe we can be more clever and do some form of caching. I haven't
thought that through to see how hard that could be. I mean we could
cache some data per mm_struct and invalidate them only after the current
value would get too much out of sync.

> As far as I know, to get anything but RSS we have to go through smaps
> or use memcg.  Swap seems to be available in /proc//status.
> 
> I looked at the "shared" value in /proc//statm but it doesn't
> seem to correlate well with the shared value in smaps -- not sure why?

task_statm() does only approximate to get_mm_counter(mm, MM_FILEPAGES) +
get_mm_counter(mm, MM_SHMEMPAGES) so all the pages accounted to the mm.
If they are not shared by anybody else they would be considered private
by smaps.

> It might be useful to show the magnitude of difference of using RSS vs
> PSS/Private in the case of the Chrome renderer processes.  On the
> system I was looking at there were about 40 of these processes, but I
> picked a few to give an idea:
> 
> localhost ~ # cat /proc/21550/totmaps
> Rss:   98972 kB
> Pss:   54717 kB
> Shared_Clean:  19020 kB
> Shared_Dirty:  26352 kB
> Private_Clean: 0 kB
> Private_Dirty: 53600 kB
> Referenced:92184 kB
> Anonymous: 46524 kB
> AnonHugePages: 24576 kB
> Swap:  13148 kB
> 
> 
> RSS is 80% higher than PSS and 84% higher than private data
> 
> localhost ~ # cat /proc/21470/totmaps
> Rss:  118420 kB
> Pss:   70938 kB
> Shared_Clean:  22212 kB
> Shared_Dirty:  26520 kB
> Private_Clean: 0 kB
> Private_Dirty: 69688 kB
> Referenced:   111500 kB
> Anonymous: 79928 kB
> AnonHugePages: 24576 kB
> Swap:  12964 kB
> 
> RSS is 66% higher than RSS and 69% higher than private data
> 
> localhost ~ # cat /proc/21435/totmaps
> Rss:   97156 kB
> Pss:   50044 kB
> Shared_Clean:  21920 kB
> Shared_Dirty:  26400 kB
> Private_Clean: 0 kB
> Private_Dirty: 48836 kB
> Referenced:90012 kB
> Anonymous: 75228 kB
> AnonHugePages: 24576 kB
> Swap:  13064 kB
> 
> RSS is 94% higher than PSS and 98% higher than private data.
> 
> It looks like there's a set of about 40MB of shared pages which cause
> the difference in this case.
> Swap was roughly even on these but I don't think it's always going to be true.

OK, I see that those processes differ in the way how they are using
memory but I am not really sure what kind of conclusion you can draw
from that.
-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-29 Thread Michal Hocko
[Sorry for a late reply, I was busy with other stuff]

On Mon 22-08-16 15:44:53, Sonny Rao wrote:
> On Mon, Aug 22, 2016 at 12:54 AM, Michal Hocko  wrote:
[...]
> But what about the private_clean and private_dirty?  Surely
> those are more generally useful for calculating a lower bound on
> process memory usage without additional knowledge?

I guess private_clean can be used as a reasonable estimate.
private_dirty less so because it may refer to e.g. tmpfs which is not
mapped by other process and so no memory would be freed after unmap
without removing the file.

> At the end of the day all of these metrics are approximations, and it
> comes down to how far off the various approximations are and what
> trade offs we are willing to make.
> RSS is the cheapest but the most coarse.

I agree on this part definitely. I also understand that what we provide
currently is quite confusing and not really helpful. But I am afraid
that the accounting is far from trivial to make right for all the
possible cases.

> PSS (with the correct context) and Private data plus swap are much
> better but also more expensive due to the PT walk.

Maybe we can be more clever and do some form of caching. I haven't
thought that through to see how hard that could be. I mean we could
cache some data per mm_struct and invalidate them only after the current
value would get too much out of sync.

> As far as I know, to get anything but RSS we have to go through smaps
> or use memcg.  Swap seems to be available in /proc//status.
> 
> I looked at the "shared" value in /proc//statm but it doesn't
> seem to correlate well with the shared value in smaps -- not sure why?

task_statm() does only approximate to get_mm_counter(mm, MM_FILEPAGES) +
get_mm_counter(mm, MM_SHMEMPAGES) so all the pages accounted to the mm.
If they are not shared by anybody else they would be considered private
by smaps.

> It might be useful to show the magnitude of difference of using RSS vs
> PSS/Private in the case of the Chrome renderer processes.  On the
> system I was looking at there were about 40 of these processes, but I
> picked a few to give an idea:
> 
> localhost ~ # cat /proc/21550/totmaps
> Rss:   98972 kB
> Pss:   54717 kB
> Shared_Clean:  19020 kB
> Shared_Dirty:  26352 kB
> Private_Clean: 0 kB
> Private_Dirty: 53600 kB
> Referenced:92184 kB
> Anonymous: 46524 kB
> AnonHugePages: 24576 kB
> Swap:  13148 kB
> 
> 
> RSS is 80% higher than PSS and 84% higher than private data
> 
> localhost ~ # cat /proc/21470/totmaps
> Rss:  118420 kB
> Pss:   70938 kB
> Shared_Clean:  22212 kB
> Shared_Dirty:  26520 kB
> Private_Clean: 0 kB
> Private_Dirty: 69688 kB
> Referenced:   111500 kB
> Anonymous: 79928 kB
> AnonHugePages: 24576 kB
> Swap:  12964 kB
> 
> RSS is 66% higher than RSS and 69% higher than private data
> 
> localhost ~ # cat /proc/21435/totmaps
> Rss:   97156 kB
> Pss:   50044 kB
> Shared_Clean:  21920 kB
> Shared_Dirty:  26400 kB
> Private_Clean: 0 kB
> Private_Dirty: 48836 kB
> Referenced:90012 kB
> Anonymous: 75228 kB
> AnonHugePages: 24576 kB
> Swap:  13064 kB
> 
> RSS is 94% higher than PSS and 98% higher than private data.
> 
> It looks like there's a set of about 40MB of shared pages which cause
> the difference in this case.
> Swap was roughly even on these but I don't think it's always going to be true.

OK, I see that those processes differ in the way how they are using
memory but I am not really sure what kind of conclusion you can draw
from that.
-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-24 Thread Marcin Jabrzyk



On 23/08/16 00:44, Sonny Rao wrote:

On Mon, Aug 22, 2016 at 12:54 AM, Michal Hocko  wrote:

On Fri 19-08-16 10:57:48, Sonny Rao wrote:

On Fri, Aug 19, 2016 at 12:59 AM, Michal Hocko  wrote:

On Thu 18-08-16 23:43:39, Sonny Rao wrote:

On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko  wrote:

On Thu 18-08-16 10:47:57, Sonny Rao wrote:

On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko  wrote:

On Wed 17-08-16 11:57:56, Sonny Rao wrote:

[...]

2) User space OOM handling -- we'd rather do a more graceful shutdown
than let the kernel's OOM killer activate and need to gather this
information and we'd like to be able to get this information to make
the decision much faster than 400ms


Global OOM handling in userspace is really dubious if you ask me. I
understand you want something better than SIGKILL and in fact this is
already possible with memory cgroup controller (btw. memcg will give
you a cheap access to rss, amount of shared, swapped out memory as
well). Anyway if you are getting close to the OOM your system will most
probably be really busy and chances are that also reading your new file
will take much more time. I am also not quite sure how is pss useful for
oom decisions.


I mentioned it before, but based on experience RSS just isn't good
enough -- there's too much sharing going on in our use case to make
the correct decision based on RSS.  If RSS were good enough, simply
put, this patch wouldn't exist.


But that doesn't answer my question, I am afraid. So how exactly do you
use pss for oom decisions?


We use PSS to calculate the memory used by a process among all the
processes in the system, in the case of Chrome this tells us how much
each renderer process (which is roughly tied to a particular "tab" in
Chrome) is using and how much it has swapped out, so we know what the
worst offenders are -- I'm not sure what's unclear about that?


So let me ask more specifically. How can you make any decision based on
the pss when you do not know _what_ is the shared resource. In other
words if you select a task to terminate based on the pss then you have to
kill others who share the same resource otherwise you do not release
that shared resource. Not to mention that such a shared resource might
be on tmpfs/shmem and it won't get released even after all processes
which map it are gone.


Ok I see why you're confused now, sorry.

In our case that we do know what is being shared in general because
the sharing is mostly between those processes that we're looking at
and not other random processes or tmpfs, so PSS gives us useful data
in the context of these processes which are sharing the data
especially for monitoring between the set of these renderer processes.


OK, I see and agree that pss might be useful when you _know_ what is
shared. But this sounds quite specific to a particular workload. How
many users are in a similar situation? In other words, if we present
a single number without the context, how much useful it will be in
general? Is it possible that presenting such a number could be even
misleading for somebody who doesn't have an idea which resources are
shared? These are all questions which should be answered before we
actually add this number (be it a new/existing proc file or a syscall).
I still believe that the number without wider context is just not all
that useful.



I see the specific point about  PSS -- because you need to know what
is being shared or otherwise use it in a whole system context, but I
still think the whole system context is a valid and generally useful
thing.  But what about the private_clean and private_dirty?  Surely
those are more generally useful for calculating a lower bound on
process memory usage without additional knowledge?

At the end of the day all of these metrics are approximations, and it
comes down to how far off the various approximations are and what
trade offs we are willing to make.
RSS is the cheapest but the most coarse.

PSS (with the correct context) and Private data plus swap are much
better but also more expensive due to the PT walk.
As far as I know, to get anything but RSS we have to go through smaps
or use memcg.  Swap seems to be available in /proc//status.

I looked at the "shared" value in /proc//statm but it doesn't
seem to correlate well with the shared value in smaps -- not sure why?

It might be useful to show the magnitude of difference of using RSS vs
PSS/Private in the case of the Chrome renderer processes.  On the
system I was looking at there were about 40 of these processes, but I
picked a few to give an idea:

localhost ~ # cat /proc/21550/totmaps
Rss:   98972 kB
Pss:   54717 kB
Shared_Clean:  19020 kB
Shared_Dirty:  26352 kB
Private_Clean: 0 kB
Private_Dirty: 53600 kB
Referenced:92184 kB
Anonymous: 46524 kB
AnonHugePages: 24576 kB
Swap:  13148 kB


RSS is 80% higher than PSS and 

Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-24 Thread Marcin Jabrzyk



On 23/08/16 00:44, Sonny Rao wrote:

On Mon, Aug 22, 2016 at 12:54 AM, Michal Hocko  wrote:

On Fri 19-08-16 10:57:48, Sonny Rao wrote:

On Fri, Aug 19, 2016 at 12:59 AM, Michal Hocko  wrote:

On Thu 18-08-16 23:43:39, Sonny Rao wrote:

On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko  wrote:

On Thu 18-08-16 10:47:57, Sonny Rao wrote:

On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko  wrote:

On Wed 17-08-16 11:57:56, Sonny Rao wrote:

[...]

2) User space OOM handling -- we'd rather do a more graceful shutdown
than let the kernel's OOM killer activate and need to gather this
information and we'd like to be able to get this information to make
the decision much faster than 400ms


Global OOM handling in userspace is really dubious if you ask me. I
understand you want something better than SIGKILL and in fact this is
already possible with memory cgroup controller (btw. memcg will give
you a cheap access to rss, amount of shared, swapped out memory as
well). Anyway if you are getting close to the OOM your system will most
probably be really busy and chances are that also reading your new file
will take much more time. I am also not quite sure how is pss useful for
oom decisions.


I mentioned it before, but based on experience RSS just isn't good
enough -- there's too much sharing going on in our use case to make
the correct decision based on RSS.  If RSS were good enough, simply
put, this patch wouldn't exist.


But that doesn't answer my question, I am afraid. So how exactly do you
use pss for oom decisions?


We use PSS to calculate the memory used by a process among all the
processes in the system, in the case of Chrome this tells us how much
each renderer process (which is roughly tied to a particular "tab" in
Chrome) is using and how much it has swapped out, so we know what the
worst offenders are -- I'm not sure what's unclear about that?


So let me ask more specifically. How can you make any decision based on
the pss when you do not know _what_ is the shared resource. In other
words if you select a task to terminate based on the pss then you have to
kill others who share the same resource otherwise you do not release
that shared resource. Not to mention that such a shared resource might
be on tmpfs/shmem and it won't get released even after all processes
which map it are gone.


Ok I see why you're confused now, sorry.

In our case that we do know what is being shared in general because
the sharing is mostly between those processes that we're looking at
and not other random processes or tmpfs, so PSS gives us useful data
in the context of these processes which are sharing the data
especially for monitoring between the set of these renderer processes.


OK, I see and agree that pss might be useful when you _know_ what is
shared. But this sounds quite specific to a particular workload. How
many users are in a similar situation? In other words, if we present
a single number without the context, how much useful it will be in
general? Is it possible that presenting such a number could be even
misleading for somebody who doesn't have an idea which resources are
shared? These are all questions which should be answered before we
actually add this number (be it a new/existing proc file or a syscall).
I still believe that the number without wider context is just not all
that useful.



I see the specific point about  PSS -- because you need to know what
is being shared or otherwise use it in a whole system context, but I
still think the whole system context is a valid and generally useful
thing.  But what about the private_clean and private_dirty?  Surely
those are more generally useful for calculating a lower bound on
process memory usage without additional knowledge?

At the end of the day all of these metrics are approximations, and it
comes down to how far off the various approximations are and what
trade offs we are willing to make.
RSS is the cheapest but the most coarse.

PSS (with the correct context) and Private data plus swap are much
better but also more expensive due to the PT walk.
As far as I know, to get anything but RSS we have to go through smaps
or use memcg.  Swap seems to be available in /proc//status.

I looked at the "shared" value in /proc//statm but it doesn't
seem to correlate well with the shared value in smaps -- not sure why?

It might be useful to show the magnitude of difference of using RSS vs
PSS/Private in the case of the Chrome renderer processes.  On the
system I was looking at there were about 40 of these processes, but I
picked a few to give an idea:

localhost ~ # cat /proc/21550/totmaps
Rss:   98972 kB
Pss:   54717 kB
Shared_Clean:  19020 kB
Shared_Dirty:  26352 kB
Private_Clean: 0 kB
Private_Dirty: 53600 kB
Referenced:92184 kB
Anonymous: 46524 kB
AnonHugePages: 24576 kB
Swap:  13148 kB


RSS is 80% higher than PSS and 84% higher than private data

localhost ~ # cat /proc/21470/totmaps
Rss:  

Re: utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc//totmaps)

2016-08-23 Thread Rik van Riel
On Tue, 2016-08-23 at 16:33 +0200, Michal Hocko wrote:
> On Tue 23-08-16 10:26:03, Michal Hocko wrote:
> > On Mon 22-08-16 19:47:09, Michal Hocko wrote:
> > > On Mon 22-08-16 19:29:36, Michal Hocko wrote:
> > > > On Mon 22-08-16 18:45:54, Michal Hocko wrote:
> > > > [...]
> > > > > I have no idea why those numbers are so different on my
> > > > > laptop
> > > > > yet. It surely looks suspicious. I will try to debug this
> > > > > further
> > > > > tomorrow.
> > > > 
> > > > Hmm, so I've tried to use my version of awk on other machine
> > > > and vice
> > > > versa and it didn't make any difference. So this is independent
> > > > on the
> > > > awk version it seems. So I've tried to strace /usr/bin/time and
> > > > wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0,
> > > > {ru_utime={0, 0}, ru_stime={0, 688438}, ...}) = 9128
> > > > 
> > > > so the kernel indeed reports 0 user time for some reason. Note
> > > > I
> > > > was testing with 4.7 and right now with 4.8.0-rc3 kernel (no
> > > > local
> > > > modifications). The other machine which reports non-0 utime is
> > > > 3.12
> > > > SLES kernel. Maybe I am hitting some accounting bug. At first I
> > > > was
> > > > suspecting CONFIG_NO_HZ_FULL because that is the main
> > > > difference between
> > > > my and the other machine but then I've noticed that the tests I
> > > > was
> > > > doing in kvm have this disabled too.. so it must be something
> > > > else.
> > > 
> > > 4.5 reports non-0 while 4.6 zero utime. NO_HZ configuration is
> > > the same
> > > in both kernels.
> > 
> > and one more thing. It is not like utime accounting would be
> > completely
> > broken and always report 0. Other commands report non-0 values even
> > on
> > 4.6 kernels. I will try to bisect this down later today.
> 
> OK, so it seems I found it. I was quite lucky because
> account_user_time
> is not all that popular function and there were basically no changes
> besides Riks ff9a9b4c4334 ("sched, time: Switch
> VIRT_CPU_ACCOUNTING_GEN
> to jiffy granularity") and that seems to cause the regression.
> Reverting
> the commit on top of the current mmotm seems to fix the issue for me.
> 
> And just to give Rik more context. While debugging overhead of the
> /proc//smaps I am getting a misleading output from /usr/bin/time
> -v
> (source for ./max_mmap is [1])
> 
> root@test1:~# uname -r
> 4.5.0-rc6-bisect1-00025-gff9a9b4c4334
> root@test1:~# ./max_map 
> pid:2990 maps:65515
> root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2}
> END {printf "rss:%d pss:%d\n", rss, pss}' /proc/2990/smaps
> rss:263368 pss:262203
> Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> {printf "rss:%d pss:%d\n", rss, pss} /proc/2990/smaps"
> User time (seconds): 0.00
> System time (seconds): 0.45
> Percent of CPU this job got: 98%
> 

> root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2}
> END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3015/smaps
> rss:263316 pss:262199
> Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> {printf "rss:%d pss:%d\n", rss, pss} /proc/3015/smaps"
> User time (seconds): 0.18
> System time (seconds): 0.29
> Percent of CPU this job got: 97%

The patch in question makes user and system
time accounting essentially tick-based. If
jiffies changes while the task is in user
mode, time gets accounted as user time, if
jiffies changes while the task is in system
mode, time gets accounted as system time.

If you get "unlucky", with a job like the
above, it is possible all time gets accounted
to system time.

This would be true both with the system running
with a periodic timer tick (before and after my
patch is applied), and in nohz_idle mode (after
my patch).

However, it does seem quite unlikely that you
get zero user time, since you have 125 timer
ticks in half a second. Furthermore, you do not
even have NO_HZ_FULL enabled...

Does the workload consistently get zero user
time?

If so, we need to dig further to see under
what precise circumstances that happens.

On my laptop, with kernel 4.6.3-300.fc24.x86_64
I get this:

$ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf
"rss:%d pss:%d\n", rss, pss}' /proc/19825/smaps
rss:263368 pss:262145
Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
{printf "rss:%d pss:%d\n", rss, pss} /proc/19825/smaps"
User time (seconds): 0.64
System time (seconds): 0.19
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.83

The main difference between your and my
NO_HZ config seems to be that NO_HZ_FULL
is set here. However, it is not enabled
at run time, so both of our systems
should only really get NO_HZ_IDLE
effectively.

Running tasks should get sampled with the
regular timer tick, while they are running.

In other words, vtime accounting should be
disabled in both of our tests, for everything
except the idle task.

Do I need to do 

Re: utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc//totmaps)

2016-08-23 Thread Rik van Riel
On Tue, 2016-08-23 at 16:33 +0200, Michal Hocko wrote:
> On Tue 23-08-16 10:26:03, Michal Hocko wrote:
> > On Mon 22-08-16 19:47:09, Michal Hocko wrote:
> > > On Mon 22-08-16 19:29:36, Michal Hocko wrote:
> > > > On Mon 22-08-16 18:45:54, Michal Hocko wrote:
> > > > [...]
> > > > > I have no idea why those numbers are so different on my
> > > > > laptop
> > > > > yet. It surely looks suspicious. I will try to debug this
> > > > > further
> > > > > tomorrow.
> > > > 
> > > > Hmm, so I've tried to use my version of awk on other machine
> > > > and vice
> > > > versa and it didn't make any difference. So this is independent
> > > > on the
> > > > awk version it seems. So I've tried to strace /usr/bin/time and
> > > > wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0,
> > > > {ru_utime={0, 0}, ru_stime={0, 688438}, ...}) = 9128
> > > > 
> > > > so the kernel indeed reports 0 user time for some reason. Note
> > > > I
> > > > was testing with 4.7 and right now with 4.8.0-rc3 kernel (no
> > > > local
> > > > modifications). The other machine which reports non-0 utime is
> > > > 3.12
> > > > SLES kernel. Maybe I am hitting some accounting bug. At first I
> > > > was
> > > > suspecting CONFIG_NO_HZ_FULL because that is the main
> > > > difference between
> > > > my and the other machine but then I've noticed that the tests I
> > > > was
> > > > doing in kvm have this disabled too.. so it must be something
> > > > else.
> > > 
> > > 4.5 reports non-0 while 4.6 zero utime. NO_HZ configuration is
> > > the same
> > > in both kernels.
> > 
> > and one more thing. It is not like utime accounting would be
> > completely
> > broken and always report 0. Other commands report non-0 values even
> > on
> > 4.6 kernels. I will try to bisect this down later today.
> 
> OK, so it seems I found it. I was quite lucky because
> account_user_time
> is not all that popular function and there were basically no changes
> besides Riks ff9a9b4c4334 ("sched, time: Switch
> VIRT_CPU_ACCOUNTING_GEN
> to jiffy granularity") and that seems to cause the regression.
> Reverting
> the commit on top of the current mmotm seems to fix the issue for me.
> 
> And just to give Rik more context. While debugging overhead of the
> /proc//smaps I am getting a misleading output from /usr/bin/time
> -v
> (source for ./max_mmap is [1])
> 
> root@test1:~# uname -r
> 4.5.0-rc6-bisect1-00025-gff9a9b4c4334
> root@test1:~# ./max_map 
> pid:2990 maps:65515
> root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2}
> END {printf "rss:%d pss:%d\n", rss, pss}' /proc/2990/smaps
> rss:263368 pss:262203
> Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> {printf "rss:%d pss:%d\n", rss, pss} /proc/2990/smaps"
> User time (seconds): 0.00
> System time (seconds): 0.45
> Percent of CPU this job got: 98%
> 

> root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2}
> END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3015/smaps
> rss:263316 pss:262199
> Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> {printf "rss:%d pss:%d\n", rss, pss} /proc/3015/smaps"
> User time (seconds): 0.18
> System time (seconds): 0.29
> Percent of CPU this job got: 97%

The patch in question makes user and system
time accounting essentially tick-based. If
jiffies changes while the task is in user
mode, time gets accounted as user time, if
jiffies changes while the task is in system
mode, time gets accounted as system time.

If you get "unlucky", with a job like the
above, it is possible all time gets accounted
to system time.

This would be true both with the system running
with a periodic timer tick (before and after my
patch is applied), and in nohz_idle mode (after
my patch).

However, it does seem quite unlikely that you
get zero user time, since you have 125 timer
ticks in half a second. Furthermore, you do not
even have NO_HZ_FULL enabled...

Does the workload consistently get zero user
time?

If so, we need to dig further to see under
what precise circumstances that happens.

On my laptop, with kernel 4.6.3-300.fc24.x86_64
I get this:

$ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf
"rss:%d pss:%d\n", rss, pss}' /proc/19825/smaps
rss:263368 pss:262145
Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
{printf "rss:%d pss:%d\n", rss, pss} /proc/19825/smaps"
User time (seconds): 0.64
System time (seconds): 0.19
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.83

The main difference between your and my
NO_HZ config seems to be that NO_HZ_FULL
is set here. However, it is not enabled
at run time, so both of our systems
should only really get NO_HZ_IDLE
effectively.

Running tasks should get sampled with the
regular timer tick, while they are running.

In other words, vtime accounting should be
disabled in both of our tests, for everything
except the idle task.

Do I need to do 

utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc//totmaps)

2016-08-23 Thread Michal Hocko
On Tue 23-08-16 10:26:03, Michal Hocko wrote:
> On Mon 22-08-16 19:47:09, Michal Hocko wrote:
> > On Mon 22-08-16 19:29:36, Michal Hocko wrote:
> > > On Mon 22-08-16 18:45:54, Michal Hocko wrote:
> > > [...]
> > > > I have no idea why those numbers are so different on my laptop
> > > > yet. It surely looks suspicious. I will try to debug this further
> > > > tomorrow.
> > > 
> > > Hmm, so I've tried to use my version of awk on other machine and vice
> > > versa and it didn't make any difference. So this is independent on the
> > > awk version it seems. So I've tried to strace /usr/bin/time and
> > > wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 0}, 
> > > ru_stime={0, 688438}, ...}) = 9128
> > > 
> > > so the kernel indeed reports 0 user time for some reason. Note I
> > > was testing with 4.7 and right now with 4.8.0-rc3 kernel (no local
> > > modifications). The other machine which reports non-0 utime is 3.12
> > > SLES kernel. Maybe I am hitting some accounting bug. At first I was
> > > suspecting CONFIG_NO_HZ_FULL because that is the main difference between
> > > my and the other machine but then I've noticed that the tests I was
> > > doing in kvm have this disabled too.. so it must be something else.
> > 
> > 4.5 reports non-0 while 4.6 zero utime. NO_HZ configuration is the same
> > in both kernels.
> 
> and one more thing. It is not like utime accounting would be completely
> broken and always report 0. Other commands report non-0 values even on
> 4.6 kernels. I will try to bisect this down later today.

OK, so it seems I found it. I was quite lucky because account_user_time
is not all that popular function and there were basically no changes
besides Riks ff9a9b4c4334 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN
to jiffy granularity") and that seems to cause the regression. Reverting
the commit on top of the current mmotm seems to fix the issue for me.

And just to give Rik more context. While debugging overhead of the
/proc//smaps I am getting a misleading output from /usr/bin/time -v
(source for ./max_mmap is [1])

root@test1:~# uname -r
4.5.0-rc6-bisect1-00025-gff9a9b4c4334
root@test1:~# ./max_map 
pid:2990 maps:65515
root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf 
"rss:%d pss:%d\n", rss, pss}' /proc/2990/smaps
rss:263368 pss:262203
Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf 
"rss:%d pss:%d\n", rss, pss} /proc/2990/smaps"
User time (seconds): 0.00
System time (seconds): 0.45
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.46
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1796
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1
Minor (reclaiming a frame) page faults: 83
Voluntary context switches: 6
Involuntary context switches: 6
Swaps: 0
File system inputs: 248
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

See the User time being 0 (as you can see above in the quoted text it
is not a rounding error in userspace or something similar because wait4
really returns 0). Now with the revert
root@test1:~# uname -r
4.5.0-rc6-revert-00026-g7fc86f968bf5
root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf 
"rss:%d pss:%d\n", rss, pss}' /proc/3015/smaps
rss:263316 pss:262199
Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf 
"rss:%d pss:%d\n", rss, pss} /proc/3015/smaps"
User time (seconds): 0.18
System time (seconds): 0.29
Percent of CPU this job got: 97%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.50
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1760
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1
Minor (reclaiming a frame) page faults: 79
Voluntary context switches: 5
Involuntary context switches: 7
Swaps: 0
File system inputs: 248
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

So it looks like the whole user time is accounted as the system time.
My config is attached and yes I do have CONFIG_VIRT_CPU_ACCOUNTING_GEN
enabled. Could you have a look please?

[1] http://lkml.kernel.org/r/20160817082200.ga10...@dhcp22.suse.cz
-- 
Michal Hocko
SUSE Labs


.config.gz
Description: application/gzip


utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc//totmaps)

2016-08-23 Thread Michal Hocko
On Tue 23-08-16 10:26:03, Michal Hocko wrote:
> On Mon 22-08-16 19:47:09, Michal Hocko wrote:
> > On Mon 22-08-16 19:29:36, Michal Hocko wrote:
> > > On Mon 22-08-16 18:45:54, Michal Hocko wrote:
> > > [...]
> > > > I have no idea why those numbers are so different on my laptop
> > > > yet. It surely looks suspicious. I will try to debug this further
> > > > tomorrow.
> > > 
> > > Hmm, so I've tried to use my version of awk on other machine and vice
> > > versa and it didn't make any difference. So this is independent on the
> > > awk version it seems. So I've tried to strace /usr/bin/time and
> > > wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 0}, 
> > > ru_stime={0, 688438}, ...}) = 9128
> > > 
> > > so the kernel indeed reports 0 user time for some reason. Note I
> > > was testing with 4.7 and right now with 4.8.0-rc3 kernel (no local
> > > modifications). The other machine which reports non-0 utime is 3.12
> > > SLES kernel. Maybe I am hitting some accounting bug. At first I was
> > > suspecting CONFIG_NO_HZ_FULL because that is the main difference between
> > > my and the other machine but then I've noticed that the tests I was
> > > doing in kvm have this disabled too.. so it must be something else.
> > 
> > 4.5 reports non-0 while 4.6 zero utime. NO_HZ configuration is the same
> > in both kernels.
> 
> and one more thing. It is not like utime accounting would be completely
> broken and always report 0. Other commands report non-0 values even on
> 4.6 kernels. I will try to bisect this down later today.

OK, so it seems I found it. I was quite lucky because account_user_time
is not all that popular function and there were basically no changes
besides Riks ff9a9b4c4334 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN
to jiffy granularity") and that seems to cause the regression. Reverting
the commit on top of the current mmotm seems to fix the issue for me.

And just to give Rik more context. While debugging overhead of the
/proc//smaps I am getting a misleading output from /usr/bin/time -v
(source for ./max_mmap is [1])

root@test1:~# uname -r
4.5.0-rc6-bisect1-00025-gff9a9b4c4334
root@test1:~# ./max_map 
pid:2990 maps:65515
root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf 
"rss:%d pss:%d\n", rss, pss}' /proc/2990/smaps
rss:263368 pss:262203
Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf 
"rss:%d pss:%d\n", rss, pss} /proc/2990/smaps"
User time (seconds): 0.00
System time (seconds): 0.45
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.46
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1796
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1
Minor (reclaiming a frame) page faults: 83
Voluntary context switches: 6
Involuntary context switches: 6
Swaps: 0
File system inputs: 248
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

See the User time being 0 (as you can see above in the quoted text it
is not a rounding error in userspace or something similar because wait4
really returns 0). Now with the revert
root@test1:~# uname -r
4.5.0-rc6-revert-00026-g7fc86f968bf5
root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf 
"rss:%d pss:%d\n", rss, pss}' /proc/3015/smaps
rss:263316 pss:262199
Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf 
"rss:%d pss:%d\n", rss, pss} /proc/3015/smaps"
User time (seconds): 0.18
System time (seconds): 0.29
Percent of CPU this job got: 97%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.50
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1760
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1
Minor (reclaiming a frame) page faults: 79
Voluntary context switches: 5
Involuntary context switches: 7
Swaps: 0
File system inputs: 248
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

So it looks like the whole user time is accounted as the system time.
My config is attached and yes I do have CONFIG_VIRT_CPU_ACCOUNTING_GEN
enabled. Could you have a look please?

[1] http://lkml.kernel.org/r/20160817082200.ga10...@dhcp22.suse.cz
-- 
Michal Hocko
SUSE Labs


.config.gz
Description: application/gzip


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-23 Thread Michal Hocko
On Mon 22-08-16 19:47:09, Michal Hocko wrote:
> On Mon 22-08-16 19:29:36, Michal Hocko wrote:
> > On Mon 22-08-16 18:45:54, Michal Hocko wrote:
> > [...]
> > > I have no idea why those numbers are so different on my laptop
> > > yet. It surely looks suspicious. I will try to debug this further
> > > tomorrow.
> > 
> > Hmm, so I've tried to use my version of awk on other machine and vice
> > versa and it didn't make any difference. So this is independent on the
> > awk version it seems. So I've tried to strace /usr/bin/time and
> > wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 0}, 
> > ru_stime={0, 688438}, ...}) = 9128
> > 
> > so the kernel indeed reports 0 user time for some reason. Note I
> > was testing with 4.7 and right now with 4.8.0-rc3 kernel (no local
> > modifications). The other machine which reports non-0 utime is 3.12
> > SLES kernel. Maybe I am hitting some accounting bug. At first I was
> > suspecting CONFIG_NO_HZ_FULL because that is the main difference between
> > my and the other machine but then I've noticed that the tests I was
> > doing in kvm have this disabled too.. so it must be something else.
> 
> 4.5 reports non-0 while 4.6 zero utime. NO_HZ configuration is the same
> in both kernels.

and one more thing. It is not like utime accounting would be completely
broken and always report 0. Other commands report non-0 values even on
4.6 kernels. I will try to bisect this down later today.
-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-23 Thread Michal Hocko
On Mon 22-08-16 19:47:09, Michal Hocko wrote:
> On Mon 22-08-16 19:29:36, Michal Hocko wrote:
> > On Mon 22-08-16 18:45:54, Michal Hocko wrote:
> > [...]
> > > I have no idea why those numbers are so different on my laptop
> > > yet. It surely looks suspicious. I will try to debug this further
> > > tomorrow.
> > 
> > Hmm, so I've tried to use my version of awk on other machine and vice
> > versa and it didn't make any difference. So this is independent on the
> > awk version it seems. So I've tried to strace /usr/bin/time and
> > wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 0}, 
> > ru_stime={0, 688438}, ...}) = 9128
> > 
> > so the kernel indeed reports 0 user time for some reason. Note I
> > was testing with 4.7 and right now with 4.8.0-rc3 kernel (no local
> > modifications). The other machine which reports non-0 utime is 3.12
> > SLES kernel. Maybe I am hitting some accounting bug. At first I was
> > suspecting CONFIG_NO_HZ_FULL because that is the main difference between
> > my and the other machine but then I've noticed that the tests I was
> > doing in kvm have this disabled too.. so it must be something else.
> 
> 4.5 reports non-0 while 4.6 zero utime. NO_HZ configuration is the same
> in both kernels.

and one more thing. It is not like utime accounting would be completely
broken and always report 0. Other commands report non-0 values even on
4.6 kernels. I will try to bisect this down later today.
-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-22 Thread Sonny Rao
On Mon, Aug 22, 2016 at 12:54 AM, Michal Hocko  wrote:
> On Fri 19-08-16 10:57:48, Sonny Rao wrote:
>> On Fri, Aug 19, 2016 at 12:59 AM, Michal Hocko  wrote:
>> > On Thu 18-08-16 23:43:39, Sonny Rao wrote:
>> >> On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko  wrote:
>> >> > On Thu 18-08-16 10:47:57, Sonny Rao wrote:
>> >> >> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko  
>> >> >> wrote:
>> >> >> > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
>> >> > [...]
>> >> >> >> 2) User space OOM handling -- we'd rather do a more graceful 
>> >> >> >> shutdown
>> >> >> >> than let the kernel's OOM killer activate and need to gather this
>> >> >> >> information and we'd like to be able to get this information to make
>> >> >> >> the decision much faster than 400ms
>> >> >> >
>> >> >> > Global OOM handling in userspace is really dubious if you ask me. I
>> >> >> > understand you want something better than SIGKILL and in fact this is
>> >> >> > already possible with memory cgroup controller (btw. memcg will give
>> >> >> > you a cheap access to rss, amount of shared, swapped out memory as
>> >> >> > well). Anyway if you are getting close to the OOM your system will 
>> >> >> > most
>> >> >> > probably be really busy and chances are that also reading your new 
>> >> >> > file
>> >> >> > will take much more time. I am also not quite sure how is pss useful 
>> >> >> > for
>> >> >> > oom decisions.
>> >> >>
>> >> >> I mentioned it before, but based on experience RSS just isn't good
>> >> >> enough -- there's too much sharing going on in our use case to make
>> >> >> the correct decision based on RSS.  If RSS were good enough, simply
>> >> >> put, this patch wouldn't exist.
>> >> >
>> >> > But that doesn't answer my question, I am afraid. So how exactly do you
>> >> > use pss for oom decisions?
>> >>
>> >> We use PSS to calculate the memory used by a process among all the
>> >> processes in the system, in the case of Chrome this tells us how much
>> >> each renderer process (which is roughly tied to a particular "tab" in
>> >> Chrome) is using and how much it has swapped out, so we know what the
>> >> worst offenders are -- I'm not sure what's unclear about that?
>> >
>> > So let me ask more specifically. How can you make any decision based on
>> > the pss when you do not know _what_ is the shared resource. In other
>> > words if you select a task to terminate based on the pss then you have to
>> > kill others who share the same resource otherwise you do not release
>> > that shared resource. Not to mention that such a shared resource might
>> > be on tmpfs/shmem and it won't get released even after all processes
>> > which map it are gone.
>>
>> Ok I see why you're confused now, sorry.
>>
>> In our case that we do know what is being shared in general because
>> the sharing is mostly between those processes that we're looking at
>> and not other random processes or tmpfs, so PSS gives us useful data
>> in the context of these processes which are sharing the data
>> especially for monitoring between the set of these renderer processes.
>
> OK, I see and agree that pss might be useful when you _know_ what is
> shared. But this sounds quite specific to a particular workload. How
> many users are in a similar situation? In other words, if we present
> a single number without the context, how much useful it will be in
> general? Is it possible that presenting such a number could be even
> misleading for somebody who doesn't have an idea which resources are
> shared? These are all questions which should be answered before we
> actually add this number (be it a new/existing proc file or a syscall).
> I still believe that the number without wider context is just not all
> that useful.


I see the specific point about  PSS -- because you need to know what
is being shared or otherwise use it in a whole system context, but I
still think the whole system context is a valid and generally useful
thing.  But what about the private_clean and private_dirty?  Surely
those are more generally useful for calculating a lower bound on
process memory usage without additional knowledge?

At the end of the day all of these metrics are approximations, and it
comes down to how far off the various approximations are and what
trade offs we are willing to make.
RSS is the cheapest but the most coarse.

PSS (with the correct context) and Private data plus swap are much
better but also more expensive due to the PT walk.
As far as I know, to get anything but RSS we have to go through smaps
or use memcg.  Swap seems to be available in /proc//status.

I looked at the "shared" value in /proc//statm but it doesn't
seem to correlate well with the shared value in smaps -- not sure why?

It might be useful to show the magnitude of difference of using RSS vs
PSS/Private in the case of the Chrome renderer processes.  On the
system I was looking at there were about 40 of these processes, but I

Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-22 Thread Sonny Rao
On Mon, Aug 22, 2016 at 12:54 AM, Michal Hocko  wrote:
> On Fri 19-08-16 10:57:48, Sonny Rao wrote:
>> On Fri, Aug 19, 2016 at 12:59 AM, Michal Hocko  wrote:
>> > On Thu 18-08-16 23:43:39, Sonny Rao wrote:
>> >> On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko  wrote:
>> >> > On Thu 18-08-16 10:47:57, Sonny Rao wrote:
>> >> >> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko  
>> >> >> wrote:
>> >> >> > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
>> >> > [...]
>> >> >> >> 2) User space OOM handling -- we'd rather do a more graceful 
>> >> >> >> shutdown
>> >> >> >> than let the kernel's OOM killer activate and need to gather this
>> >> >> >> information and we'd like to be able to get this information to make
>> >> >> >> the decision much faster than 400ms
>> >> >> >
>> >> >> > Global OOM handling in userspace is really dubious if you ask me. I
>> >> >> > understand you want something better than SIGKILL and in fact this is
>> >> >> > already possible with memory cgroup controller (btw. memcg will give
>> >> >> > you a cheap access to rss, amount of shared, swapped out memory as
>> >> >> > well). Anyway if you are getting close to the OOM your system will 
>> >> >> > most
>> >> >> > probably be really busy and chances are that also reading your new 
>> >> >> > file
>> >> >> > will take much more time. I am also not quite sure how is pss useful 
>> >> >> > for
>> >> >> > oom decisions.
>> >> >>
>> >> >> I mentioned it before, but based on experience RSS just isn't good
>> >> >> enough -- there's too much sharing going on in our use case to make
>> >> >> the correct decision based on RSS.  If RSS were good enough, simply
>> >> >> put, this patch wouldn't exist.
>> >> >
>> >> > But that doesn't answer my question, I am afraid. So how exactly do you
>> >> > use pss for oom decisions?
>> >>
>> >> We use PSS to calculate the memory used by a process among all the
>> >> processes in the system, in the case of Chrome this tells us how much
>> >> each renderer process (which is roughly tied to a particular "tab" in
>> >> Chrome) is using and how much it has swapped out, so we know what the
>> >> worst offenders are -- I'm not sure what's unclear about that?
>> >
>> > So let me ask more specifically. How can you make any decision based on
>> > the pss when you do not know _what_ is the shared resource. In other
>> > words if you select a task to terminate based on the pss then you have to
>> > kill others who share the same resource otherwise you do not release
>> > that shared resource. Not to mention that such a shared resource might
>> > be on tmpfs/shmem and it won't get released even after all processes
>> > which map it are gone.
>>
>> Ok I see why you're confused now, sorry.
>>
>> In our case that we do know what is being shared in general because
>> the sharing is mostly between those processes that we're looking at
>> and not other random processes or tmpfs, so PSS gives us useful data
>> in the context of these processes which are sharing the data
>> especially for monitoring between the set of these renderer processes.
>
> OK, I see and agree that pss might be useful when you _know_ what is
> shared. But this sounds quite specific to a particular workload. How
> many users are in a similar situation? In other words, if we present
> a single number without the context, how much useful it will be in
> general? Is it possible that presenting such a number could be even
> misleading for somebody who doesn't have an idea which resources are
> shared? These are all questions which should be answered before we
> actually add this number (be it a new/existing proc file or a syscall).
> I still believe that the number without wider context is just not all
> that useful.


I see the specific point about  PSS -- because you need to know what
is being shared or otherwise use it in a whole system context, but I
still think the whole system context is a valid and generally useful
thing.  But what about the private_clean and private_dirty?  Surely
those are more generally useful for calculating a lower bound on
process memory usage without additional knowledge?

At the end of the day all of these metrics are approximations, and it
comes down to how far off the various approximations are and what
trade offs we are willing to make.
RSS is the cheapest but the most coarse.

PSS (with the correct context) and Private data plus swap are much
better but also more expensive due to the PT walk.
As far as I know, to get anything but RSS we have to go through smaps
or use memcg.  Swap seems to be available in /proc//status.

I looked at the "shared" value in /proc//statm but it doesn't
seem to correlate well with the shared value in smaps -- not sure why?

It might be useful to show the magnitude of difference of using RSS vs
PSS/Private in the case of the Chrome renderer processes.  On the
system I was looking at there were about 40 of these processes, but I
picked a few to give an idea:

localhost ~ # cat /proc/21550/totmaps
Rss:   

Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-22 Thread Michal Hocko
On Mon 22-08-16 19:29:36, Michal Hocko wrote:
> On Mon 22-08-16 18:45:54, Michal Hocko wrote:
> [...]
> > I have no idea why those numbers are so different on my laptop
> > yet. It surely looks suspicious. I will try to debug this further
> > tomorrow.
> 
> Hmm, so I've tried to use my version of awk on other machine and vice
> versa and it didn't make any difference. So this is independent on the
> awk version it seems. So I've tried to strace /usr/bin/time and
> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 0}, 
> ru_stime={0, 688438}, ...}) = 9128
> 
> so the kernel indeed reports 0 user time for some reason. Note I
> was testing with 4.7 and right now with 4.8.0-rc3 kernel (no local
> modifications). The other machine which reports non-0 utime is 3.12
> SLES kernel. Maybe I am hitting some accounting bug. At first I was
> suspecting CONFIG_NO_HZ_FULL because that is the main difference between
> my and the other machine but then I've noticed that the tests I was
> doing in kvm have this disabled too.. so it must be something else.

4.5 reports non-0 while 4.6 zero utime. NO_HZ configuration is the same
in both kernels.
-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-22 Thread Michal Hocko
On Mon 22-08-16 19:29:36, Michal Hocko wrote:
> On Mon 22-08-16 18:45:54, Michal Hocko wrote:
> [...]
> > I have no idea why those numbers are so different on my laptop
> > yet. It surely looks suspicious. I will try to debug this further
> > tomorrow.
> 
> Hmm, so I've tried to use my version of awk on other machine and vice
> versa and it didn't make any difference. So this is independent on the
> awk version it seems. So I've tried to strace /usr/bin/time and
> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 0}, 
> ru_stime={0, 688438}, ...}) = 9128
> 
> so the kernel indeed reports 0 user time for some reason. Note I
> was testing with 4.7 and right now with 4.8.0-rc3 kernel (no local
> modifications). The other machine which reports non-0 utime is 3.12
> SLES kernel. Maybe I am hitting some accounting bug. At first I was
> suspecting CONFIG_NO_HZ_FULL because that is the main difference between
> my and the other machine but then I've noticed that the tests I was
> doing in kvm have this disabled too.. so it must be something else.

4.5 reports non-0 while 4.6 zero utime. NO_HZ configuration is the same
in both kernels.
-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-22 Thread Michal Hocko
On Mon 22-08-16 18:45:54, Michal Hocko wrote:
[...]
> I have no idea why those numbers are so different on my laptop
> yet. It surely looks suspicious. I will try to debug this further
> tomorrow.

Hmm, so I've tried to use my version of awk on other machine and vice
versa and it didn't make any difference. So this is independent on the
awk version it seems. So I've tried to strace /usr/bin/time and
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 0}, 
ru_stime={0, 688438}, ...}) = 9128

so the kernel indeed reports 0 user time for some reason. Note I
was testing with 4.7 and right now with 4.8.0-rc3 kernel (no local
modifications). The other machine which reports non-0 utime is 3.12
SLES kernel. Maybe I am hitting some accounting bug. At first I was
suspecting CONFIG_NO_HZ_FULL because that is the main difference between
my and the other machine but then I've noticed that the tests I was
doing in kvm have this disabled too.. so it must be something else.

Weird...
-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-22 Thread Michal Hocko
On Mon 22-08-16 18:45:54, Michal Hocko wrote:
[...]
> I have no idea why those numbers are so different on my laptop
> yet. It surely looks suspicious. I will try to debug this further
> tomorrow.

Hmm, so I've tried to use my version of awk on other machine and vice
versa and it didn't make any difference. So this is independent on the
awk version it seems. So I've tried to strace /usr/bin/time and
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 0}, 
ru_stime={0, 688438}, ...}) = 9128

so the kernel indeed reports 0 user time for some reason. Note I
was testing with 4.7 and right now with 4.8.0-rc3 kernel (no local
modifications). The other machine which reports non-0 utime is 3.12
SLES kernel. Maybe I am hitting some accounting bug. At first I was
suspecting CONFIG_NO_HZ_FULL because that is the main difference between
my and the other machine but then I've noticed that the tests I was
doing in kvm have this disabled too.. so it must be something else.

Weird...
-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-22 Thread Michal Hocko
On Mon 22-08-16 23:12:41, Minchan Kim wrote:
> On Mon, Aug 22, 2016 at 09:40:52AM +0200, Michal Hocko wrote:
> > On Mon 22-08-16 09:07:45, Minchan Kim wrote:
> > [...]
> > > #!/bin/sh
> > > ./smap_test &
> > > pid=$!
> > > 
> > > for i in $(seq 25)
> > > do
> > > awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {}' \
> > >  /proc/$pid/smaps
> > > done
> > > kill $pid
> > > 
> > > root@bbox:/home/barrios/test/smap# time ./s.sh 
> > > pid:21973
> > > 
> > > real0m17.812s
> > > user0m12.612s
> > > sys 0m5.187s
> > 
> > retested on the bare metal (x86_64 - 2CPUs)
> > Command being timed: "sh s.sh"
> > User time (seconds): 0.00
> > System time (seconds): 18.08
> > Percent of CPU this job got: 98%
> > Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.29
> > 
> > multiple runs are quite consistent in those numbers. I am running with
> > $ awk --version
> > GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)
> > 
> > > > like a problem we are not able to address. And I would even argue that
> > > > we want to address it in a generic way as much as possible.
> > > 
> > > Sure. What solution do you think as generic way?
> > 
> > either optimize seq_printf or replace it with something faster.
> 
> If it's real culprit, I agree. However, I tested your test program on
> my 2 x86 machines and my friend's machine.
> 
> Ubuntu, Fedora, Arch
> 
> They have awk 4.0.1 and 4.1.3.
> 
> Result are same. Userspace speand more times I mentioned.
> 
> [root@blaptop smap_test]# time awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END 
> {printf "rss:%d pss:%d\n", rss, pss}' /proc/3552/smaps
> rss:263484 pss:262188
> 
> real0m0.770s
> user0m0.574s
> sys 0m0.197s
> 
> I will attach my test progrma source.
> I hope you guys test and repost the result because it's the key for direction
> of patchset.

Hmm, this is really interesting. I have checked a different machine and
it shows different results. Same code, slightly different version of awk
(4.1.0) and the results are different
Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf 
"rss:%d pss:%d\n", rss, pss} /proc/48925/smaps"
User time (seconds): 0.43
System time (seconds): 0.27

I have no idea why those numbers are so different on my laptop
yet. It surely looks suspicious. I will try to debug this further
tomorrow. Anyway, the performance is just one side of the problem. I
have tried to express my concerns about a single exported pss value in
other email. Please try to step back and think about how useful is this
information without the knowing which resource we are talking about.
-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-22 Thread Michal Hocko
On Mon 22-08-16 23:12:41, Minchan Kim wrote:
> On Mon, Aug 22, 2016 at 09:40:52AM +0200, Michal Hocko wrote:
> > On Mon 22-08-16 09:07:45, Minchan Kim wrote:
> > [...]
> > > #!/bin/sh
> > > ./smap_test &
> > > pid=$!
> > > 
> > > for i in $(seq 25)
> > > do
> > > awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {}' \
> > >  /proc/$pid/smaps
> > > done
> > > kill $pid
> > > 
> > > root@bbox:/home/barrios/test/smap# time ./s.sh 
> > > pid:21973
> > > 
> > > real0m17.812s
> > > user0m12.612s
> > > sys 0m5.187s
> > 
> > retested on the bare metal (x86_64 - 2CPUs)
> > Command being timed: "sh s.sh"
> > User time (seconds): 0.00
> > System time (seconds): 18.08
> > Percent of CPU this job got: 98%
> > Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.29
> > 
> > multiple runs are quite consistent in those numbers. I am running with
> > $ awk --version
> > GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)
> > 
> > > > like a problem we are not able to address. And I would even argue that
> > > > we want to address it in a generic way as much as possible.
> > > 
> > > Sure. What solution do you think as generic way?
> > 
> > either optimize seq_printf or replace it with something faster.
> 
> If it's real culprit, I agree. However, I tested your test program on
> my 2 x86 machines and my friend's machine.
> 
> Ubuntu, Fedora, Arch
> 
> They have awk 4.0.1 and 4.1.3.
> 
> Result are same. Userspace speand more times I mentioned.
> 
> [root@blaptop smap_test]# time awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END 
> {printf "rss:%d pss:%d\n", rss, pss}' /proc/3552/smaps
> rss:263484 pss:262188
> 
> real0m0.770s
> user0m0.574s
> sys 0m0.197s
> 
> I will attach my test progrma source.
> I hope you guys test and repost the result because it's the key for direction
> of patchset.

Hmm, this is really interesting. I have checked a different machine and
it shows different results. Same code, slightly different version of awk
(4.1.0) and the results are different
Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf 
"rss:%d pss:%d\n", rss, pss} /proc/48925/smaps"
User time (seconds): 0.43
System time (seconds): 0.27

I have no idea why those numbers are so different on my laptop
yet. It surely looks suspicious. I will try to debug this further
tomorrow. Anyway, the performance is just one side of the problem. I
have tried to express my concerns about a single exported pss value in
other email. Please try to step back and think about how useful is this
information without the knowing which resource we are talking about.
-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-22 Thread Robert Foss



On 2016-08-22 10:12 AM, Minchan Kim wrote:

On Mon, Aug 22, 2016 at 09:40:52AM +0200, Michal Hocko wrote:

On Mon 22-08-16 09:07:45, Minchan Kim wrote:
[...]

#!/bin/sh
./smap_test &
pid=$!

for i in $(seq 25)
do
awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {}' \
 /proc/$pid/smaps
done
kill $pid

root@bbox:/home/barrios/test/smap# time ./s.sh
pid:21973

real0m17.812s
user0m12.612s
sys 0m5.187s


retested on the bare metal (x86_64 - 2CPUs)
Command being timed: "sh s.sh"
User time (seconds): 0.00
System time (seconds): 18.08
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.29

multiple runs are quite consistent in those numbers. I am running with
$ awk --version
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)



$ ./smap_test &
pid:19658 nr_vma:65514

$ time awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d 
pss:%d\n", rss, pss}' /proc/19658/smaps

rss:263452 pss:262151

real0m0.625s
user0m0.404s
sys 0m0.216s

$ awk --version
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)


like a problem we are not able to address. And I would even argue that
we want to address it in a generic way as much as possible.


Sure. What solution do you think as generic way?


either optimize seq_printf or replace it with something faster.


If it's real culprit, I agree. However, I tested your test program on
my 2 x86 machines and my friend's machine.

Ubuntu, Fedora, Arch

They have awk 4.0.1 and 4.1.3.

Result are same. Userspace speand more times I mentioned.

[root@blaptop smap_test]# time awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf 
"rss:%d pss:%d\n", rss, pss}' /proc/3552/smaps
rss:263484 pss:262188

real0m0.770s
user0m0.574s
sys 0m0.197s

I will attach my test progrma source.
I hope you guys test and repost the result because it's the key for direction
of patchset.

Thanks.



Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-22 Thread Robert Foss



On 2016-08-22 10:12 AM, Minchan Kim wrote:

On Mon, Aug 22, 2016 at 09:40:52AM +0200, Michal Hocko wrote:

On Mon 22-08-16 09:07:45, Minchan Kim wrote:
[...]

#!/bin/sh
./smap_test &
pid=$!

for i in $(seq 25)
do
awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {}' \
 /proc/$pid/smaps
done
kill $pid

root@bbox:/home/barrios/test/smap# time ./s.sh
pid:21973

real0m17.812s
user0m12.612s
sys 0m5.187s


retested on the bare metal (x86_64 - 2CPUs)
Command being timed: "sh s.sh"
User time (seconds): 0.00
System time (seconds): 18.08
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.29

multiple runs are quite consistent in those numbers. I am running with
$ awk --version
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)



$ ./smap_test &
pid:19658 nr_vma:65514

$ time awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d 
pss:%d\n", rss, pss}' /proc/19658/smaps

rss:263452 pss:262151

real0m0.625s
user0m0.404s
sys 0m0.216s

$ awk --version
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)


like a problem we are not able to address. And I would even argue that
we want to address it in a generic way as much as possible.


Sure. What solution do you think as generic way?


either optimize seq_printf or replace it with something faster.


If it's real culprit, I agree. However, I tested your test program on
my 2 x86 machines and my friend's machine.

Ubuntu, Fedora, Arch

They have awk 4.0.1 and 4.1.3.

Result are same. Userspace speand more times I mentioned.

[root@blaptop smap_test]# time awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf 
"rss:%d pss:%d\n", rss, pss}' /proc/3552/smaps
rss:263484 pss:262188

real0m0.770s
user0m0.574s
sys 0m0.197s

I will attach my test progrma source.
I hope you guys test and repost the result because it's the key for direction
of patchset.

Thanks.



Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-22 Thread Minchan Kim
On Mon, Aug 22, 2016 at 09:40:52AM +0200, Michal Hocko wrote:
> On Mon 22-08-16 09:07:45, Minchan Kim wrote:
> [...]
> > #!/bin/sh
> > ./smap_test &
> > pid=$!
> > 
> > for i in $(seq 25)
> > do
> > awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {}' \
> >  /proc/$pid/smaps
> > done
> > kill $pid
> > 
> > root@bbox:/home/barrios/test/smap# time ./s.sh 
> > pid:21973
> > 
> > real0m17.812s
> > user0m12.612s
> > sys 0m5.187s
> 
> retested on the bare metal (x86_64 - 2CPUs)
> Command being timed: "sh s.sh"
> User time (seconds): 0.00
> System time (seconds): 18.08
> Percent of CPU this job got: 98%
> Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.29
> 
> multiple runs are quite consistent in those numbers. I am running with
> $ awk --version
> GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)
> 
> > > like a problem we are not able to address. And I would even argue that
> > > we want to address it in a generic way as much as possible.
> > 
> > Sure. What solution do you think as generic way?
> 
> either optimize seq_printf or replace it with something faster.

If it's real culprit, I agree. However, I tested your test program on
my 2 x86 machines and my friend's machine.

Ubuntu, Fedora, Arch

They have awk 4.0.1 and 4.1.3.

Result are same. Userspace speand more times I mentioned.

[root@blaptop smap_test]# time awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf 
"rss:%d pss:%d\n", rss, pss}' /proc/3552/smaps
rss:263484 pss:262188

real0m0.770s
user0m0.574s
sys 0m0.197s

I will attach my test progrma source.
I hope you guys test and repost the result because it's the key for direction
of patchset.

Thanks.
#include 

int main()
{
unsigned long nr_vma = 0;

while (1) {
if (mmap(0, 4096, PROT_READ|PROT_WRITE, 
MAP_ANONYMOUS|MAP_SHARED|MAP_POPULATE, -1, 0) == MAP_FAILED)
break;
nr_vma++;
};

printf("pid:%d nr_vma:%lu\n", getpid(), nr_vma);
pause();
return 0;
}



Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-22 Thread Minchan Kim
On Mon, Aug 22, 2016 at 09:40:52AM +0200, Michal Hocko wrote:
> On Mon 22-08-16 09:07:45, Minchan Kim wrote:
> [...]
> > #!/bin/sh
> > ./smap_test &
> > pid=$!
> > 
> > for i in $(seq 25)
> > do
> > awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {}' \
> >  /proc/$pid/smaps
> > done
> > kill $pid
> > 
> > root@bbox:/home/barrios/test/smap# time ./s.sh 
> > pid:21973
> > 
> > real0m17.812s
> > user0m12.612s
> > sys 0m5.187s
> 
> retested on the bare metal (x86_64 - 2CPUs)
> Command being timed: "sh s.sh"
> User time (seconds): 0.00
> System time (seconds): 18.08
> Percent of CPU this job got: 98%
> Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.29
> 
> multiple runs are quite consistent in those numbers. I am running with
> $ awk --version
> GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)
> 
> > > like a problem we are not able to address. And I would even argue that
> > > we want to address it in a generic way as much as possible.
> > 
> > Sure. What solution do you think as generic way?
> 
> either optimize seq_printf or replace it with something faster.

If it's real culprit, I agree. However, I tested your test program on
my 2 x86 machines and my friend's machine.

Ubuntu, Fedora, Arch

They have awk 4.0.1 and 4.1.3.

Result are same. Userspace speand more times I mentioned.

[root@blaptop smap_test]# time awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf 
"rss:%d pss:%d\n", rss, pss}' /proc/3552/smaps
rss:263484 pss:262188

real0m0.770s
user0m0.574s
sys 0m0.197s

I will attach my test progrma source.
I hope you guys test and repost the result because it's the key for direction
of patchset.

Thanks.
#include 

int main()
{
unsigned long nr_vma = 0;

while (1) {
if (mmap(0, 4096, PROT_READ|PROT_WRITE, 
MAP_ANONYMOUS|MAP_SHARED|MAP_POPULATE, -1, 0) == MAP_FAILED)
break;
nr_vma++;
};

printf("pid:%d nr_vma:%lu\n", getpid(), nr_vma);
pause();
return 0;
}



Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-22 Thread Michal Hocko
On Fri 19-08-16 10:57:48, Sonny Rao wrote:
> On Fri, Aug 19, 2016 at 12:59 AM, Michal Hocko  wrote:
> > On Thu 18-08-16 23:43:39, Sonny Rao wrote:
> >> On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko  wrote:
> >> > On Thu 18-08-16 10:47:57, Sonny Rao wrote:
> >> >> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko  
> >> >> wrote:
> >> >> > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
> >> > [...]
> >> >> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> >> >> >> than let the kernel's OOM killer activate and need to gather this
> >> >> >> information and we'd like to be able to get this information to make
> >> >> >> the decision much faster than 400ms
> >> >> >
> >> >> > Global OOM handling in userspace is really dubious if you ask me. I
> >> >> > understand you want something better than SIGKILL and in fact this is
> >> >> > already possible with memory cgroup controller (btw. memcg will give
> >> >> > you a cheap access to rss, amount of shared, swapped out memory as
> >> >> > well). Anyway if you are getting close to the OOM your system will 
> >> >> > most
> >> >> > probably be really busy and chances are that also reading your new 
> >> >> > file
> >> >> > will take much more time. I am also not quite sure how is pss useful 
> >> >> > for
> >> >> > oom decisions.
> >> >>
> >> >> I mentioned it before, but based on experience RSS just isn't good
> >> >> enough -- there's too much sharing going on in our use case to make
> >> >> the correct decision based on RSS.  If RSS were good enough, simply
> >> >> put, this patch wouldn't exist.
> >> >
> >> > But that doesn't answer my question, I am afraid. So how exactly do you
> >> > use pss for oom decisions?
> >>
> >> We use PSS to calculate the memory used by a process among all the
> >> processes in the system, in the case of Chrome this tells us how much
> >> each renderer process (which is roughly tied to a particular "tab" in
> >> Chrome) is using and how much it has swapped out, so we know what the
> >> worst offenders are -- I'm not sure what's unclear about that?
> >
> > So let me ask more specifically. How can you make any decision based on
> > the pss when you do not know _what_ is the shared resource. In other
> > words if you select a task to terminate based on the pss then you have to
> > kill others who share the same resource otherwise you do not release
> > that shared resource. Not to mention that such a shared resource might
> > be on tmpfs/shmem and it won't get released even after all processes
> > which map it are gone.
> 
> Ok I see why you're confused now, sorry.
> 
> In our case that we do know what is being shared in general because
> the sharing is mostly between those processes that we're looking at
> and not other random processes or tmpfs, so PSS gives us useful data
> in the context of these processes which are sharing the data
> especially for monitoring between the set of these renderer processes.

OK, I see and agree that pss might be useful when you _know_ what is
shared. But this sounds quite specific to a particular workload. How
many users are in a similar situation? In other words, if we present
a single number without the context, how much useful it will be in
general? Is it possible that presenting such a number could be even
misleading for somebody who doesn't have an idea which resources are
shared? These are all questions which should be answered before we
actually add this number (be it a new/existing proc file or a syscall).
I still believe that the number without wider context is just not all
that useful.

> We also use the private clean and private dirty and swap fields to
> make a few metrics for the processes and charge each process for it's
> private, shared, and swap data. Private clean and dirty are used for
> estimating a lower bound on how much memory would be freed.

I can imagine that this kind of information might be useful and
presented in /proc//statm. The question is whether some of the
existing consumers would see the performance impact due to he page table
walk. Anyway even these counters might get quite tricky because even
shareable resources are considered private if the process is the only
one to map them (so again this might be a file on tmpfs...).

> Swap and
> PSS also give us some indication of additional memory which might get
> freed up.
-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-22 Thread Michal Hocko
On Fri 19-08-16 10:57:48, Sonny Rao wrote:
> On Fri, Aug 19, 2016 at 12:59 AM, Michal Hocko  wrote:
> > On Thu 18-08-16 23:43:39, Sonny Rao wrote:
> >> On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko  wrote:
> >> > On Thu 18-08-16 10:47:57, Sonny Rao wrote:
> >> >> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko  
> >> >> wrote:
> >> >> > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
> >> > [...]
> >> >> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> >> >> >> than let the kernel's OOM killer activate and need to gather this
> >> >> >> information and we'd like to be able to get this information to make
> >> >> >> the decision much faster than 400ms
> >> >> >
> >> >> > Global OOM handling in userspace is really dubious if you ask me. I
> >> >> > understand you want something better than SIGKILL and in fact this is
> >> >> > already possible with memory cgroup controller (btw. memcg will give
> >> >> > you a cheap access to rss, amount of shared, swapped out memory as
> >> >> > well). Anyway if you are getting close to the OOM your system will 
> >> >> > most
> >> >> > probably be really busy and chances are that also reading your new 
> >> >> > file
> >> >> > will take much more time. I am also not quite sure how is pss useful 
> >> >> > for
> >> >> > oom decisions.
> >> >>
> >> >> I mentioned it before, but based on experience RSS just isn't good
> >> >> enough -- there's too much sharing going on in our use case to make
> >> >> the correct decision based on RSS.  If RSS were good enough, simply
> >> >> put, this patch wouldn't exist.
> >> >
> >> > But that doesn't answer my question, I am afraid. So how exactly do you
> >> > use pss for oom decisions?
> >>
> >> We use PSS to calculate the memory used by a process among all the
> >> processes in the system, in the case of Chrome this tells us how much
> >> each renderer process (which is roughly tied to a particular "tab" in
> >> Chrome) is using and how much it has swapped out, so we know what the
> >> worst offenders are -- I'm not sure what's unclear about that?
> >
> > So let me ask more specifically. How can you make any decision based on
> > the pss when you do not know _what_ is the shared resource. In other
> > words if you select a task to terminate based on the pss then you have to
> > kill others who share the same resource otherwise you do not release
> > that shared resource. Not to mention that such a shared resource might
> > be on tmpfs/shmem and it won't get released even after all processes
> > which map it are gone.
> 
> Ok I see why you're confused now, sorry.
> 
> In our case that we do know what is being shared in general because
> the sharing is mostly between those processes that we're looking at
> and not other random processes or tmpfs, so PSS gives us useful data
> in the context of these processes which are sharing the data
> especially for monitoring between the set of these renderer processes.

OK, I see and agree that pss might be useful when you _know_ what is
shared. But this sounds quite specific to a particular workload. How
many users are in a similar situation? In other words, if we present
a single number without the context, how much useful it will be in
general? Is it possible that presenting such a number could be even
misleading for somebody who doesn't have an idea which resources are
shared? These are all questions which should be answered before we
actually add this number (be it a new/existing proc file or a syscall).
I still believe that the number without wider context is just not all
that useful.

> We also use the private clean and private dirty and swap fields to
> make a few metrics for the processes and charge each process for it's
> private, shared, and swap data. Private clean and dirty are used for
> estimating a lower bound on how much memory would be freed.

I can imagine that this kind of information might be useful and
presented in /proc//statm. The question is whether some of the
existing consumers would see the performance impact due to he page table
walk. Anyway even these counters might get quite tricky because even
shareable resources are considered private if the process is the only
one to map them (so again this might be a file on tmpfs...).

> Swap and
> PSS also give us some indication of additional memory which might get
> freed up.
-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-22 Thread Michal Hocko
On Mon 22-08-16 09:07:45, Minchan Kim wrote:
[...]
> #!/bin/sh
> ./smap_test &
> pid=$!
> 
> for i in $(seq 25)
> do
> awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {}' \
>  /proc/$pid/smaps
> done
> kill $pid
> 
> root@bbox:/home/barrios/test/smap# time ./s.sh 
> pid:21973
> 
> real0m17.812s
> user0m12.612s
> sys 0m5.187s

retested on the bare metal (x86_64 - 2CPUs)
Command being timed: "sh s.sh"
User time (seconds): 0.00
System time (seconds): 18.08
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.29

multiple runs are quite consistent in those numbers. I am running with
$ awk --version
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)

> > like a problem we are not able to address. And I would even argue that
> > we want to address it in a generic way as much as possible.
> 
> Sure. What solution do you think as generic way?

either optimize seq_printf or replace it with something faster.

-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-22 Thread Michal Hocko
On Mon 22-08-16 09:07:45, Minchan Kim wrote:
[...]
> #!/bin/sh
> ./smap_test &
> pid=$!
> 
> for i in $(seq 25)
> do
> awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {}' \
>  /proc/$pid/smaps
> done
> kill $pid
> 
> root@bbox:/home/barrios/test/smap# time ./s.sh 
> pid:21973
> 
> real0m17.812s
> user0m12.612s
> sys 0m5.187s

retested on the bare metal (x86_64 - 2CPUs)
Command being timed: "sh s.sh"
User time (seconds): 0.00
System time (seconds): 18.08
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.29

multiple runs are quite consistent in those numbers. I am running with
$ awk --version
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)

> > like a problem we are not able to address. And I would even argue that
> > we want to address it in a generic way as much as possible.
> 
> Sure. What solution do you think as generic way?

either optimize seq_printf or replace it with something faster.

-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-21 Thread Minchan Kim
On Fri, Aug 19, 2016 at 10:05:32AM +0200, Michal Hocko wrote:
> On Fri 19-08-16 11:26:34, Minchan Kim wrote:
> > Hi Michal,
> > 
> > On Thu, Aug 18, 2016 at 08:01:04PM +0200, Michal Hocko wrote:
> > > On Thu 18-08-16 10:47:57, Sonny Rao wrote:
> > > > On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko  
> > > > wrote:
> > > > > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
> > > [...]
> > > > >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> > > > >> than let the kernel's OOM killer activate and need to gather this
> > > > >> information and we'd like to be able to get this information to make
> > > > >> the decision much faster than 400ms
> > > > >
> > > > > Global OOM handling in userspace is really dubious if you ask me. I
> > > > > understand you want something better than SIGKILL and in fact this is
> > > > > already possible with memory cgroup controller (btw. memcg will give
> > > > > you a cheap access to rss, amount of shared, swapped out memory as
> > > > > well). Anyway if you are getting close to the OOM your system will 
> > > > > most
> > > > > probably be really busy and chances are that also reading your new 
> > > > > file
> > > > > will take much more time. I am also not quite sure how is pss useful 
> > > > > for
> > > > > oom decisions.
> > > > 
> > > > I mentioned it before, but based on experience RSS just isn't good
> > > > enough -- there's too much sharing going on in our use case to make
> > > > the correct decision based on RSS.  If RSS were good enough, simply
> > > > put, this patch wouldn't exist.
> > > 
> > > But that doesn't answer my question, I am afraid. So how exactly do you
> > > use pss for oom decisions?
> > 
> > My case is not for OOM decision but I agree it would be great if we can get
> > *fast* smap summary information.
> > 
> > PSS is really great tool to figure out how processes consume memory
> > more exactly rather than RSS. We have been used it for monitoring
> > of memory for per-process. Although it is not used for OOM decision,
> > it would be great if it is speed up because we don't want to spend
> > many CPU time for just monitoring.
> > 
> > For our usecase, we don't need AnonHugePages, ShmemPmdMapped, 
> > Shared_Hugetlb,
> > Private_Hugetlb, KernelPageSize, MMUPageSize because we never enable THP and
> > hugetlb. Additionally, Locked can be known via vma flags so we don't need 
> > it,
> > either. Even, we don't need address range for just monitoring when we don't
> > investigate in detail.
> > 
> > Although they are not severe overhead, why does it emit the useless
> > information? Even bloat day by day. :( With that, userspace tools should
> > spend more time to parse which is pointless.
> 
> So far it doesn't really seem that the parsing is the biggest problem.
> The major cycles killer is the output formatting and that doesn't sound

I cannot understand how kernel space is more expensive.
Hmm. I tested your test program on my machine.


#!/bin/sh
./smap_test &
pid=$!

for i in $(seq 25)
do
cat /proc/$pid/smaps > /dev/null
done
kill $pid

root@bbox:/home/barrios/test/smap# time ./s_v.sh
pid:21925
real0m3.365s
user0m0.031s
sys 0m3.046s


vs.

#!/bin/sh
./smap_test &
pid=$!

for i in $(seq 25)
do
awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {}' \
 /proc/$pid/smaps
done
kill $pid

root@bbox:/home/barrios/test/smap# time ./s.sh 
pid:21973

real0m17.812s
user0m12.612s
sys 0m5.187s

perf report says

39.56%  awkgawk   [.] dfaexec   
  
 7.61%  awk[kernel.kallsyms]  [k] format_decode 
  
 6.37%  awkgawk   [.] avoid_dfa 
  
 5.85%  awkgawk   [.] interpret 
  
 5.69%  awk[kernel.kallsyms]  [k] __memcpy  
  
 4.37%  awk[kernel.kallsyms]  [k] vsnprintf 
  
 2.69%  awk[kernel.kallsyms]  [k] number.isra.13
  
 2.10%  awkgawk   [.] research  
  
 1.91%  awkgawk   [.] 0x000351d0
  
 1.49%  awkgawk   [.] free_wstr 
  
 1.27%  awkgawk   [.] unref 
  
 1.19%  awkgawk   [.] reset_record  
  
 0.95%  awkgawk   [.] set_record
  
 0.95%  awkgawk   [.] get_field 
  
 0.94%  awk[kernel.kallsyms]  [k] show_smap 
  

Parsing is much expensive than kernel.
Could you retest your test program?

> like a problem we are not able to address. And I would even argue that
> we want to address it in a generic way as much as possible.

Sure. What solution do you 

Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-21 Thread Minchan Kim
On Fri, Aug 19, 2016 at 10:05:32AM +0200, Michal Hocko wrote:
> On Fri 19-08-16 11:26:34, Minchan Kim wrote:
> > Hi Michal,
> > 
> > On Thu, Aug 18, 2016 at 08:01:04PM +0200, Michal Hocko wrote:
> > > On Thu 18-08-16 10:47:57, Sonny Rao wrote:
> > > > On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko  
> > > > wrote:
> > > > > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
> > > [...]
> > > > >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> > > > >> than let the kernel's OOM killer activate and need to gather this
> > > > >> information and we'd like to be able to get this information to make
> > > > >> the decision much faster than 400ms
> > > > >
> > > > > Global OOM handling in userspace is really dubious if you ask me. I
> > > > > understand you want something better than SIGKILL and in fact this is
> > > > > already possible with memory cgroup controller (btw. memcg will give
> > > > > you a cheap access to rss, amount of shared, swapped out memory as
> > > > > well). Anyway if you are getting close to the OOM your system will 
> > > > > most
> > > > > probably be really busy and chances are that also reading your new 
> > > > > file
> > > > > will take much more time. I am also not quite sure how is pss useful 
> > > > > for
> > > > > oom decisions.
> > > > 
> > > > I mentioned it before, but based on experience RSS just isn't good
> > > > enough -- there's too much sharing going on in our use case to make
> > > > the correct decision based on RSS.  If RSS were good enough, simply
> > > > put, this patch wouldn't exist.
> > > 
> > > But that doesn't answer my question, I am afraid. So how exactly do you
> > > use pss for oom decisions?
> > 
> > My case is not for OOM decision but I agree it would be great if we can get
> > *fast* smap summary information.
> > 
> > PSS is really great tool to figure out how processes consume memory
> > more exactly rather than RSS. We have been used it for monitoring
> > of memory for per-process. Although it is not used for OOM decision,
> > it would be great if it is speed up because we don't want to spend
> > many CPU time for just monitoring.
> > 
> > For our usecase, we don't need AnonHugePages, ShmemPmdMapped, 
> > Shared_Hugetlb,
> > Private_Hugetlb, KernelPageSize, MMUPageSize because we never enable THP and
> > hugetlb. Additionally, Locked can be known via vma flags so we don't need 
> > it,
> > either. Even, we don't need address range for just monitoring when we don't
> > investigate in detail.
> > 
> > Although they are not severe overhead, why does it emit the useless
> > information? Even bloat day by day. :( With that, userspace tools should
> > spend more time to parse which is pointless.
> 
> So far it doesn't really seem that the parsing is the biggest problem.
> The major cycles killer is the output formatting and that doesn't sound

I cannot understand how kernel space is more expensive.
Hmm. I tested your test program on my machine.


#!/bin/sh
./smap_test &
pid=$!

for i in $(seq 25)
do
cat /proc/$pid/smaps > /dev/null
done
kill $pid

root@bbox:/home/barrios/test/smap# time ./s_v.sh
pid:21925
real0m3.365s
user0m0.031s
sys 0m3.046s


vs.

#!/bin/sh
./smap_test &
pid=$!

for i in $(seq 25)
do
awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {}' \
 /proc/$pid/smaps
done
kill $pid

root@bbox:/home/barrios/test/smap# time ./s.sh 
pid:21973

real0m17.812s
user0m12.612s
sys 0m5.187s

perf report says

39.56%  awkgawk   [.] dfaexec   
  
 7.61%  awk[kernel.kallsyms]  [k] format_decode 
  
 6.37%  awkgawk   [.] avoid_dfa 
  
 5.85%  awkgawk   [.] interpret 
  
 5.69%  awk[kernel.kallsyms]  [k] __memcpy  
  
 4.37%  awk[kernel.kallsyms]  [k] vsnprintf 
  
 2.69%  awk[kernel.kallsyms]  [k] number.isra.13
  
 2.10%  awkgawk   [.] research  
  
 1.91%  awkgawk   [.] 0x000351d0
  
 1.49%  awkgawk   [.] free_wstr 
  
 1.27%  awkgawk   [.] unref 
  
 1.19%  awkgawk   [.] reset_record  
  
 0.95%  awkgawk   [.] set_record
  
 0.95%  awkgawk   [.] get_field 
  
 0.94%  awk[kernel.kallsyms]  [k] show_smap 
  

Parsing is much expensive than kernel.
Could you retest your test program?

> like a problem we are not able to address. And I would even argue that
> we want to address it in a generic way as much as possible.

Sure. What solution do you think as generic way?

Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-19 Thread Sonny Rao
On Fri, Aug 19, 2016 at 1:05 AM, Michal Hocko  wrote:
> On Fri 19-08-16 11:26:34, Minchan Kim wrote:
>> Hi Michal,
>>
>> On Thu, Aug 18, 2016 at 08:01:04PM +0200, Michal Hocko wrote:
>> > On Thu 18-08-16 10:47:57, Sonny Rao wrote:
>> > > On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko  wrote:
>> > > > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
>> > [...]
>> > > >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
>> > > >> than let the kernel's OOM killer activate and need to gather this
>> > > >> information and we'd like to be able to get this information to make
>> > > >> the decision much faster than 400ms
>> > > >
>> > > > Global OOM handling in userspace is really dubious if you ask me. I
>> > > > understand you want something better than SIGKILL and in fact this is
>> > > > already possible with memory cgroup controller (btw. memcg will give
>> > > > you a cheap access to rss, amount of shared, swapped out memory as
>> > > > well). Anyway if you are getting close to the OOM your system will most
>> > > > probably be really busy and chances are that also reading your new file
>> > > > will take much more time. I am also not quite sure how is pss useful 
>> > > > for
>> > > > oom decisions.
>> > >
>> > > I mentioned it before, but based on experience RSS just isn't good
>> > > enough -- there's too much sharing going on in our use case to make
>> > > the correct decision based on RSS.  If RSS were good enough, simply
>> > > put, this patch wouldn't exist.
>> >
>> > But that doesn't answer my question, I am afraid. So how exactly do you
>> > use pss for oom decisions?
>>
>> My case is not for OOM decision but I agree it would be great if we can get
>> *fast* smap summary information.
>>
>> PSS is really great tool to figure out how processes consume memory
>> more exactly rather than RSS. We have been used it for monitoring
>> of memory for per-process. Although it is not used for OOM decision,
>> it would be great if it is speed up because we don't want to spend
>> many CPU time for just monitoring.
>>
>> For our usecase, we don't need AnonHugePages, ShmemPmdMapped, Shared_Hugetlb,
>> Private_Hugetlb, KernelPageSize, MMUPageSize because we never enable THP and
>> hugetlb. Additionally, Locked can be known via vma flags so we don't need it,
>> either. Even, we don't need address range for just monitoring when we don't
>> investigate in detail.
>>
>> Although they are not severe overhead, why does it emit the useless
>> information? Even bloat day by day. :( With that, userspace tools should
>> spend more time to parse which is pointless.
>
> So far it doesn't really seem that the parsing is the biggest problem.
> The major cycles killer is the output formatting and that doesn't sound
> like a problem we are not able to address. And I would even argue that
> we want to address it in a generic way as much as possible.
>
>> Having said that, I'm not fan of creating new stat knob for that, either.
>> How about appending summary information in the end of smap?
>> So, monitoring users can just open the file and lseek to the (end - 1) and
>> read the summary only.
>
> That might confuse existing parsers. Besides that we already have
> /proc//statm which gives cumulative numbers already. I am not sure
> how often it is used and whether the pte walk is too expensive for
> existing users but that should be explored and evaluated before a new
> file is created.
>
> The /proc became a dump of everything people found interesting just
> because we were to easy to allow those additions. Do not repeat those
> mistakes, please!

Another thing I noticed was that we lock down smaps on Chromium OS.  I
think this is to avoid exposing more information than necessary via
proc.  The totmaps file gives us just the information we need and
nothing else.   I certainly don't think we need a proc file for this
use case -- do you think a new system call is better or something
else?

> --
> Michal Hocko
> SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-19 Thread Sonny Rao
On Fri, Aug 19, 2016 at 1:05 AM, Michal Hocko  wrote:
> On Fri 19-08-16 11:26:34, Minchan Kim wrote:
>> Hi Michal,
>>
>> On Thu, Aug 18, 2016 at 08:01:04PM +0200, Michal Hocko wrote:
>> > On Thu 18-08-16 10:47:57, Sonny Rao wrote:
>> > > On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko  wrote:
>> > > > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
>> > [...]
>> > > >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
>> > > >> than let the kernel's OOM killer activate and need to gather this
>> > > >> information and we'd like to be able to get this information to make
>> > > >> the decision much faster than 400ms
>> > > >
>> > > > Global OOM handling in userspace is really dubious if you ask me. I
>> > > > understand you want something better than SIGKILL and in fact this is
>> > > > already possible with memory cgroup controller (btw. memcg will give
>> > > > you a cheap access to rss, amount of shared, swapped out memory as
>> > > > well). Anyway if you are getting close to the OOM your system will most
>> > > > probably be really busy and chances are that also reading your new file
>> > > > will take much more time. I am also not quite sure how is pss useful 
>> > > > for
>> > > > oom decisions.
>> > >
>> > > I mentioned it before, but based on experience RSS just isn't good
>> > > enough -- there's too much sharing going on in our use case to make
>> > > the correct decision based on RSS.  If RSS were good enough, simply
>> > > put, this patch wouldn't exist.
>> >
>> > But that doesn't answer my question, I am afraid. So how exactly do you
>> > use pss for oom decisions?
>>
>> My case is not for OOM decision but I agree it would be great if we can get
>> *fast* smap summary information.
>>
>> PSS is really great tool to figure out how processes consume memory
>> more exactly rather than RSS. We have been used it for monitoring
>> of memory for per-process. Although it is not used for OOM decision,
>> it would be great if it is speed up because we don't want to spend
>> many CPU time for just monitoring.
>>
>> For our usecase, we don't need AnonHugePages, ShmemPmdMapped, Shared_Hugetlb,
>> Private_Hugetlb, KernelPageSize, MMUPageSize because we never enable THP and
>> hugetlb. Additionally, Locked can be known via vma flags so we don't need it,
>> either. Even, we don't need address range for just monitoring when we don't
>> investigate in detail.
>>
>> Although they are not severe overhead, why does it emit the useless
>> information? Even bloat day by day. :( With that, userspace tools should
>> spend more time to parse which is pointless.
>
> So far it doesn't really seem that the parsing is the biggest problem.
> The major cycles killer is the output formatting and that doesn't sound
> like a problem we are not able to address. And I would even argue that
> we want to address it in a generic way as much as possible.
>
>> Having said that, I'm not fan of creating new stat knob for that, either.
>> How about appending summary information in the end of smap?
>> So, monitoring users can just open the file and lseek to the (end - 1) and
>> read the summary only.
>
> That might confuse existing parsers. Besides that we already have
> /proc//statm which gives cumulative numbers already. I am not sure
> how often it is used and whether the pte walk is too expensive for
> existing users but that should be explored and evaluated before a new
> file is created.
>
> The /proc became a dump of everything people found interesting just
> because we were to easy to allow those additions. Do not repeat those
> mistakes, please!

Another thing I noticed was that we lock down smaps on Chromium OS.  I
think this is to avoid exposing more information than necessary via
proc.  The totmaps file gives us just the information we need and
nothing else.   I certainly don't think we need a proc file for this
use case -- do you think a new system call is better or something
else?

> --
> Michal Hocko
> SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-19 Thread Sonny Rao
On Fri, Aug 19, 2016 at 12:59 AM, Michal Hocko  wrote:
> On Thu 18-08-16 23:43:39, Sonny Rao wrote:
>> On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko  wrote:
>> > On Thu 18-08-16 10:47:57, Sonny Rao wrote:
>> >> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko  wrote:
>> >> > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
>> > [...]
>> >> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
>> >> >> than let the kernel's OOM killer activate and need to gather this
>> >> >> information and we'd like to be able to get this information to make
>> >> >> the decision much faster than 400ms
>> >> >
>> >> > Global OOM handling in userspace is really dubious if you ask me. I
>> >> > understand you want something better than SIGKILL and in fact this is
>> >> > already possible with memory cgroup controller (btw. memcg will give
>> >> > you a cheap access to rss, amount of shared, swapped out memory as
>> >> > well). Anyway if you are getting close to the OOM your system will most
>> >> > probably be really busy and chances are that also reading your new file
>> >> > will take much more time. I am also not quite sure how is pss useful for
>> >> > oom decisions.
>> >>
>> >> I mentioned it before, but based on experience RSS just isn't good
>> >> enough -- there's too much sharing going on in our use case to make
>> >> the correct decision based on RSS.  If RSS were good enough, simply
>> >> put, this patch wouldn't exist.
>> >
>> > But that doesn't answer my question, I am afraid. So how exactly do you
>> > use pss for oom decisions?
>>
>> We use PSS to calculate the memory used by a process among all the
>> processes in the system, in the case of Chrome this tells us how much
>> each renderer process (which is roughly tied to a particular "tab" in
>> Chrome) is using and how much it has swapped out, so we know what the
>> worst offenders are -- I'm not sure what's unclear about that?
>
> So let me ask more specifically. How can you make any decision based on
> the pss when you do not know _what_ is the shared resource. In other
> words if you select a task to terminate based on the pss then you have to
> kill others who share the same resource otherwise you do not release
> that shared resource. Not to mention that such a shared resource might
> be on tmpfs/shmem and it won't get released even after all processes
> which map it are gone.

Ok I see why you're confused now, sorry.

In our case that we do know what is being shared in general because
the sharing is mostly between those processes that we're looking at
and not other random processes or tmpfs, so PSS gives us useful data
in the context of these processes which are sharing the data
especially for monitoring between the set of these renderer processes.

We also use the private clean and private dirty and swap fields to
make a few metrics for the processes and charge each process for it's
private, shared, and swap data. Private clean and dirty are used for
estimating a lower bound on how much memory would be freed.  Swap and
PSS also give us some indication of additional memory which might get
freed up.

>
> I am sorry for being dense but it is still not clear to me how the
> single pss number can be used for oom or, in general, any serious
> decisions. The counter might be useful of course for debugging purposes
> or to have a general overview but then arguing about 40 vs 20ms sounds a
> bit strange to me.

Yeah so it's more than just the single PSS number, it's PSS,
Private_Clean, Private_dirty, Swap are all interesting numbers to make
these decisions.

>
>> Chrome tends to use a lot of shared memory so we found PSS to be
>> better than RSS, and I can give you examples of the  RSS and PSS on
>> real systems to illustrate the magnitude of the difference between
>> those two numbers if that would be useful.
>>
>> >
>> >> So even with memcg I think we'd have the same problem?
>> >
>> > memcg will give you instant anon, shared counters for all processes in
>> > the memcg.
>> >
>>
>> We want to be able to get per-process granularity quickly.  I'm not
>> sure if memcg provides that exactly?
>
> I will give you that information if you do process-per-memcg but that
> doesn't sound ideal. I thought those 20-something processes you were
> talking about are treated together but it seems I misunderstood.
> --
> Michal Hocko
> SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-19 Thread Sonny Rao
On Fri, Aug 19, 2016 at 12:59 AM, Michal Hocko  wrote:
> On Thu 18-08-16 23:43:39, Sonny Rao wrote:
>> On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko  wrote:
>> > On Thu 18-08-16 10:47:57, Sonny Rao wrote:
>> >> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko  wrote:
>> >> > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
>> > [...]
>> >> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
>> >> >> than let the kernel's OOM killer activate and need to gather this
>> >> >> information and we'd like to be able to get this information to make
>> >> >> the decision much faster than 400ms
>> >> >
>> >> > Global OOM handling in userspace is really dubious if you ask me. I
>> >> > understand you want something better than SIGKILL and in fact this is
>> >> > already possible with memory cgroup controller (btw. memcg will give
>> >> > you a cheap access to rss, amount of shared, swapped out memory as
>> >> > well). Anyway if you are getting close to the OOM your system will most
>> >> > probably be really busy and chances are that also reading your new file
>> >> > will take much more time. I am also not quite sure how is pss useful for
>> >> > oom decisions.
>> >>
>> >> I mentioned it before, but based on experience RSS just isn't good
>> >> enough -- there's too much sharing going on in our use case to make
>> >> the correct decision based on RSS.  If RSS were good enough, simply
>> >> put, this patch wouldn't exist.
>> >
>> > But that doesn't answer my question, I am afraid. So how exactly do you
>> > use pss for oom decisions?
>>
>> We use PSS to calculate the memory used by a process among all the
>> processes in the system, in the case of Chrome this tells us how much
>> each renderer process (which is roughly tied to a particular "tab" in
>> Chrome) is using and how much it has swapped out, so we know what the
>> worst offenders are -- I'm not sure what's unclear about that?
>
> So let me ask more specifically. How can you make any decision based on
> the pss when you do not know _what_ is the shared resource. In other
> words if you select a task to terminate based on the pss then you have to
> kill others who share the same resource otherwise you do not release
> that shared resource. Not to mention that such a shared resource might
> be on tmpfs/shmem and it won't get released even after all processes
> which map it are gone.

Ok I see why you're confused now, sorry.

In our case that we do know what is being shared in general because
the sharing is mostly between those processes that we're looking at
and not other random processes or tmpfs, so PSS gives us useful data
in the context of these processes which are sharing the data
especially for monitoring between the set of these renderer processes.

We also use the private clean and private dirty and swap fields to
make a few metrics for the processes and charge each process for it's
private, shared, and swap data. Private clean and dirty are used for
estimating a lower bound on how much memory would be freed.  Swap and
PSS also give us some indication of additional memory which might get
freed up.

>
> I am sorry for being dense but it is still not clear to me how the
> single pss number can be used for oom or, in general, any serious
> decisions. The counter might be useful of course for debugging purposes
> or to have a general overview but then arguing about 40 vs 20ms sounds a
> bit strange to me.

Yeah so it's more than just the single PSS number, it's PSS,
Private_Clean, Private_dirty, Swap are all interesting numbers to make
these decisions.

>
>> Chrome tends to use a lot of shared memory so we found PSS to be
>> better than RSS, and I can give you examples of the  RSS and PSS on
>> real systems to illustrate the magnitude of the difference between
>> those two numbers if that would be useful.
>>
>> >
>> >> So even with memcg I think we'd have the same problem?
>> >
>> > memcg will give you instant anon, shared counters for all processes in
>> > the memcg.
>> >
>>
>> We want to be able to get per-process granularity quickly.  I'm not
>> sure if memcg provides that exactly?
>
> I will give you that information if you do process-per-memcg but that
> doesn't sound ideal. I thought those 20-something processes you were
> talking about are treated together but it seems I misunderstood.
> --
> Michal Hocko
> SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-19 Thread Michal Hocko
On Fri 19-08-16 11:26:34, Minchan Kim wrote:
> Hi Michal,
> 
> On Thu, Aug 18, 2016 at 08:01:04PM +0200, Michal Hocko wrote:
> > On Thu 18-08-16 10:47:57, Sonny Rao wrote:
> > > On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko  wrote:
> > > > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
> > [...]
> > > >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> > > >> than let the kernel's OOM killer activate and need to gather this
> > > >> information and we'd like to be able to get this information to make
> > > >> the decision much faster than 400ms
> > > >
> > > > Global OOM handling in userspace is really dubious if you ask me. I
> > > > understand you want something better than SIGKILL and in fact this is
> > > > already possible with memory cgroup controller (btw. memcg will give
> > > > you a cheap access to rss, amount of shared, swapped out memory as
> > > > well). Anyway if you are getting close to the OOM your system will most
> > > > probably be really busy and chances are that also reading your new file
> > > > will take much more time. I am also not quite sure how is pss useful for
> > > > oom decisions.
> > > 
> > > I mentioned it before, but based on experience RSS just isn't good
> > > enough -- there's too much sharing going on in our use case to make
> > > the correct decision based on RSS.  If RSS were good enough, simply
> > > put, this patch wouldn't exist.
> > 
> > But that doesn't answer my question, I am afraid. So how exactly do you
> > use pss for oom decisions?
> 
> My case is not for OOM decision but I agree it would be great if we can get
> *fast* smap summary information.
> 
> PSS is really great tool to figure out how processes consume memory
> more exactly rather than RSS. We have been used it for monitoring
> of memory for per-process. Although it is not used for OOM decision,
> it would be great if it is speed up because we don't want to spend
> many CPU time for just monitoring.
> 
> For our usecase, we don't need AnonHugePages, ShmemPmdMapped, Shared_Hugetlb,
> Private_Hugetlb, KernelPageSize, MMUPageSize because we never enable THP and
> hugetlb. Additionally, Locked can be known via vma flags so we don't need it,
> either. Even, we don't need address range for just monitoring when we don't
> investigate in detail.
> 
> Although they are not severe overhead, why does it emit the useless
> information? Even bloat day by day. :( With that, userspace tools should
> spend more time to parse which is pointless.

So far it doesn't really seem that the parsing is the biggest problem.
The major cycles killer is the output formatting and that doesn't sound
like a problem we are not able to address. And I would even argue that
we want to address it in a generic way as much as possible.

> Having said that, I'm not fan of creating new stat knob for that, either.
> How about appending summary information in the end of smap?
> So, monitoring users can just open the file and lseek to the (end - 1) and
> read the summary only.

That might confuse existing parsers. Besides that we already have
/proc//statm which gives cumulative numbers already. I am not sure
how often it is used and whether the pte walk is too expensive for
existing users but that should be explored and evaluated before a new
file is created.

The /proc became a dump of everything people found interesting just
because we were to easy to allow those additions. Do not repeat those
mistakes, please!
-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-19 Thread Michal Hocko
On Fri 19-08-16 11:26:34, Minchan Kim wrote:
> Hi Michal,
> 
> On Thu, Aug 18, 2016 at 08:01:04PM +0200, Michal Hocko wrote:
> > On Thu 18-08-16 10:47:57, Sonny Rao wrote:
> > > On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko  wrote:
> > > > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
> > [...]
> > > >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> > > >> than let the kernel's OOM killer activate and need to gather this
> > > >> information and we'd like to be able to get this information to make
> > > >> the decision much faster than 400ms
> > > >
> > > > Global OOM handling in userspace is really dubious if you ask me. I
> > > > understand you want something better than SIGKILL and in fact this is
> > > > already possible with memory cgroup controller (btw. memcg will give
> > > > you a cheap access to rss, amount of shared, swapped out memory as
> > > > well). Anyway if you are getting close to the OOM your system will most
> > > > probably be really busy and chances are that also reading your new file
> > > > will take much more time. I am also not quite sure how is pss useful for
> > > > oom decisions.
> > > 
> > > I mentioned it before, but based on experience RSS just isn't good
> > > enough -- there's too much sharing going on in our use case to make
> > > the correct decision based on RSS.  If RSS were good enough, simply
> > > put, this patch wouldn't exist.
> > 
> > But that doesn't answer my question, I am afraid. So how exactly do you
> > use pss for oom decisions?
> 
> My case is not for OOM decision but I agree it would be great if we can get
> *fast* smap summary information.
> 
> PSS is really great tool to figure out how processes consume memory
> more exactly rather than RSS. We have been used it for monitoring
> of memory for per-process. Although it is not used for OOM decision,
> it would be great if it is speed up because we don't want to spend
> many CPU time for just monitoring.
> 
> For our usecase, we don't need AnonHugePages, ShmemPmdMapped, Shared_Hugetlb,
> Private_Hugetlb, KernelPageSize, MMUPageSize because we never enable THP and
> hugetlb. Additionally, Locked can be known via vma flags so we don't need it,
> either. Even, we don't need address range for just monitoring when we don't
> investigate in detail.
> 
> Although they are not severe overhead, why does it emit the useless
> information? Even bloat day by day. :( With that, userspace tools should
> spend more time to parse which is pointless.

So far it doesn't really seem that the parsing is the biggest problem.
The major cycles killer is the output formatting and that doesn't sound
like a problem we are not able to address. And I would even argue that
we want to address it in a generic way as much as possible.

> Having said that, I'm not fan of creating new stat knob for that, either.
> How about appending summary information in the end of smap?
> So, monitoring users can just open the file and lseek to the (end - 1) and
> read the summary only.

That might confuse existing parsers. Besides that we already have
/proc//statm which gives cumulative numbers already. I am not sure
how often it is used and whether the pte walk is too expensive for
existing users but that should be explored and evaluated before a new
file is created.

The /proc became a dump of everything people found interesting just
because we were to easy to allow those additions. Do not repeat those
mistakes, please!
-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-19 Thread Michal Hocko
On Thu 18-08-16 23:43:39, Sonny Rao wrote:
> On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko  wrote:
> > On Thu 18-08-16 10:47:57, Sonny Rao wrote:
> >> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko  wrote:
> >> > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
> > [...]
> >> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> >> >> than let the kernel's OOM killer activate and need to gather this
> >> >> information and we'd like to be able to get this information to make
> >> >> the decision much faster than 400ms
> >> >
> >> > Global OOM handling in userspace is really dubious if you ask me. I
> >> > understand you want something better than SIGKILL and in fact this is
> >> > already possible with memory cgroup controller (btw. memcg will give
> >> > you a cheap access to rss, amount of shared, swapped out memory as
> >> > well). Anyway if you are getting close to the OOM your system will most
> >> > probably be really busy and chances are that also reading your new file
> >> > will take much more time. I am also not quite sure how is pss useful for
> >> > oom decisions.
> >>
> >> I mentioned it before, but based on experience RSS just isn't good
> >> enough -- there's too much sharing going on in our use case to make
> >> the correct decision based on RSS.  If RSS were good enough, simply
> >> put, this patch wouldn't exist.
> >
> > But that doesn't answer my question, I am afraid. So how exactly do you
> > use pss for oom decisions?
> 
> We use PSS to calculate the memory used by a process among all the
> processes in the system, in the case of Chrome this tells us how much
> each renderer process (which is roughly tied to a particular "tab" in
> Chrome) is using and how much it has swapped out, so we know what the
> worst offenders are -- I'm not sure what's unclear about that?

So let me ask more specifically. How can you make any decision based on
the pss when you do not know _what_ is the shared resource. In other
words if you select a task to terminate based on the pss then you have to
kill others who share the same resource otherwise you do not release
that shared resource. Not to mention that such a shared resource might
be on tmpfs/shmem and it won't get released even after all processes
which map it are gone.

I am sorry for being dense but it is still not clear to me how the
single pss number can be used for oom or, in general, any serious
decisions. The counter might be useful of course for debugging purposes
or to have a general overview but then arguing about 40 vs 20ms sounds a
bit strange to me.

> Chrome tends to use a lot of shared memory so we found PSS to be
> better than RSS, and I can give you examples of the  RSS and PSS on
> real systems to illustrate the magnitude of the difference between
> those two numbers if that would be useful.
> 
> >
> >> So even with memcg I think we'd have the same problem?
> >
> > memcg will give you instant anon, shared counters for all processes in
> > the memcg.
> >
> 
> We want to be able to get per-process granularity quickly.  I'm not
> sure if memcg provides that exactly?

I will give you that information if you do process-per-memcg but that
doesn't sound ideal. I thought those 20-something processes you were
talking about are treated together but it seems I misunderstood.
-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-19 Thread Michal Hocko
On Thu 18-08-16 23:43:39, Sonny Rao wrote:
> On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko  wrote:
> > On Thu 18-08-16 10:47:57, Sonny Rao wrote:
> >> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko  wrote:
> >> > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
> > [...]
> >> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> >> >> than let the kernel's OOM killer activate and need to gather this
> >> >> information and we'd like to be able to get this information to make
> >> >> the decision much faster than 400ms
> >> >
> >> > Global OOM handling in userspace is really dubious if you ask me. I
> >> > understand you want something better than SIGKILL and in fact this is
> >> > already possible with memory cgroup controller (btw. memcg will give
> >> > you a cheap access to rss, amount of shared, swapped out memory as
> >> > well). Anyway if you are getting close to the OOM your system will most
> >> > probably be really busy and chances are that also reading your new file
> >> > will take much more time. I am also not quite sure how is pss useful for
> >> > oom decisions.
> >>
> >> I mentioned it before, but based on experience RSS just isn't good
> >> enough -- there's too much sharing going on in our use case to make
> >> the correct decision based on RSS.  If RSS were good enough, simply
> >> put, this patch wouldn't exist.
> >
> > But that doesn't answer my question, I am afraid. So how exactly do you
> > use pss for oom decisions?
> 
> We use PSS to calculate the memory used by a process among all the
> processes in the system, in the case of Chrome this tells us how much
> each renderer process (which is roughly tied to a particular "tab" in
> Chrome) is using and how much it has swapped out, so we know what the
> worst offenders are -- I'm not sure what's unclear about that?

So let me ask more specifically. How can you make any decision based on
the pss when you do not know _what_ is the shared resource. In other
words if you select a task to terminate based on the pss then you have to
kill others who share the same resource otherwise you do not release
that shared resource. Not to mention that such a shared resource might
be on tmpfs/shmem and it won't get released even after all processes
which map it are gone.

I am sorry for being dense but it is still not clear to me how the
single pss number can be used for oom or, in general, any serious
decisions. The counter might be useful of course for debugging purposes
or to have a general overview but then arguing about 40 vs 20ms sounds a
bit strange to me.

> Chrome tends to use a lot of shared memory so we found PSS to be
> better than RSS, and I can give you examples of the  RSS and PSS on
> real systems to illustrate the magnitude of the difference between
> those two numbers if that would be useful.
> 
> >
> >> So even with memcg I think we'd have the same problem?
> >
> > memcg will give you instant anon, shared counters for all processes in
> > the memcg.
> >
> 
> We want to be able to get per-process granularity quickly.  I'm not
> sure if memcg provides that exactly?

I will give you that information if you do process-per-memcg but that
doesn't sound ideal. I thought those 20-something processes you were
talking about are treated together but it seems I misunderstood.
-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-19 Thread Sonny Rao
On Thu, Aug 18, 2016 at 7:26 PM, Minchan Kim  wrote:
> Hi Michal,
>
> On Thu, Aug 18, 2016 at 08:01:04PM +0200, Michal Hocko wrote:
>> On Thu 18-08-16 10:47:57, Sonny Rao wrote:
>> > On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko  wrote:
>> > > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
>> [...]
>> > >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
>> > >> than let the kernel's OOM killer activate and need to gather this
>> > >> information and we'd like to be able to get this information to make
>> > >> the decision much faster than 400ms
>> > >
>> > > Global OOM handling in userspace is really dubious if you ask me. I
>> > > understand you want something better than SIGKILL and in fact this is
>> > > already possible with memory cgroup controller (btw. memcg will give
>> > > you a cheap access to rss, amount of shared, swapped out memory as
>> > > well). Anyway if you are getting close to the OOM your system will most
>> > > probably be really busy and chances are that also reading your new file
>> > > will take much more time. I am also not quite sure how is pss useful for
>> > > oom decisions.
>> >
>> > I mentioned it before, but based on experience RSS just isn't good
>> > enough -- there's too much sharing going on in our use case to make
>> > the correct decision based on RSS.  If RSS were good enough, simply
>> > put, this patch wouldn't exist.
>>
>> But that doesn't answer my question, I am afraid. So how exactly do you
>> use pss for oom decisions?
>
> My case is not for OOM decision but I agree it would be great if we can get
> *fast* smap summary information.
>
> PSS is really great tool to figure out how processes consume memory
> more exactly rather than RSS. We have been used it for monitoring
> of memory for per-process. Although it is not used for OOM decision,
> it would be great if it is speed up because we don't want to spend
> many CPU time for just monitoring.
>
> For our usecase, we don't need AnonHugePages, ShmemPmdMapped, Shared_Hugetlb,
> Private_Hugetlb, KernelPageSize, MMUPageSize because we never enable THP and
> hugetlb. Additionally, Locked can be known via vma flags so we don't need it,
> either. Even, we don't need address range for just monitoring when we don't
> investigate in detail.
>
> Although they are not severe overhead, why does it emit the useless
> information? Even bloat day by day. :( With that, userspace tools should
> spend more time to parse which is pointless.
>
> Having said that, I'm not fan of creating new stat knob for that, either.
> How about appending summary information in the end of smap?
> So, monitoring users can just open the file and lseek to the (end - 1) and
> read the summary only.
>

That would work fine for us as long as it's fast -- i.e. we don't
still have to do all the expensive per-VMA format conversion in the
kernel.

> Thanks.


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-19 Thread Sonny Rao
On Thu, Aug 18, 2016 at 7:26 PM, Minchan Kim  wrote:
> Hi Michal,
>
> On Thu, Aug 18, 2016 at 08:01:04PM +0200, Michal Hocko wrote:
>> On Thu 18-08-16 10:47:57, Sonny Rao wrote:
>> > On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko  wrote:
>> > > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
>> [...]
>> > >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
>> > >> than let the kernel's OOM killer activate and need to gather this
>> > >> information and we'd like to be able to get this information to make
>> > >> the decision much faster than 400ms
>> > >
>> > > Global OOM handling in userspace is really dubious if you ask me. I
>> > > understand you want something better than SIGKILL and in fact this is
>> > > already possible with memory cgroup controller (btw. memcg will give
>> > > you a cheap access to rss, amount of shared, swapped out memory as
>> > > well). Anyway if you are getting close to the OOM your system will most
>> > > probably be really busy and chances are that also reading your new file
>> > > will take much more time. I am also not quite sure how is pss useful for
>> > > oom decisions.
>> >
>> > I mentioned it before, but based on experience RSS just isn't good
>> > enough -- there's too much sharing going on in our use case to make
>> > the correct decision based on RSS.  If RSS were good enough, simply
>> > put, this patch wouldn't exist.
>>
>> But that doesn't answer my question, I am afraid. So how exactly do you
>> use pss for oom decisions?
>
> My case is not for OOM decision but I agree it would be great if we can get
> *fast* smap summary information.
>
> PSS is really great tool to figure out how processes consume memory
> more exactly rather than RSS. We have been used it for monitoring
> of memory for per-process. Although it is not used for OOM decision,
> it would be great if it is speed up because we don't want to spend
> many CPU time for just monitoring.
>
> For our usecase, we don't need AnonHugePages, ShmemPmdMapped, Shared_Hugetlb,
> Private_Hugetlb, KernelPageSize, MMUPageSize because we never enable THP and
> hugetlb. Additionally, Locked can be known via vma flags so we don't need it,
> either. Even, we don't need address range for just monitoring when we don't
> investigate in detail.
>
> Although they are not severe overhead, why does it emit the useless
> information? Even bloat day by day. :( With that, userspace tools should
> spend more time to parse which is pointless.
>
> Having said that, I'm not fan of creating new stat knob for that, either.
> How about appending summary information in the end of smap?
> So, monitoring users can just open the file and lseek to the (end - 1) and
> read the summary only.
>

That would work fine for us as long as it's fast -- i.e. we don't
still have to do all the expensive per-VMA format conversion in the
kernel.

> Thanks.


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-19 Thread Sonny Rao
On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko  wrote:
> On Thu 18-08-16 10:47:57, Sonny Rao wrote:
>> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko  wrote:
>> > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
> [...]
>> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
>> >> than let the kernel's OOM killer activate and need to gather this
>> >> information and we'd like to be able to get this information to make
>> >> the decision much faster than 400ms
>> >
>> > Global OOM handling in userspace is really dubious if you ask me. I
>> > understand you want something better than SIGKILL and in fact this is
>> > already possible with memory cgroup controller (btw. memcg will give
>> > you a cheap access to rss, amount of shared, swapped out memory as
>> > well). Anyway if you are getting close to the OOM your system will most
>> > probably be really busy and chances are that also reading your new file
>> > will take much more time. I am also not quite sure how is pss useful for
>> > oom decisions.
>>
>> I mentioned it before, but based on experience RSS just isn't good
>> enough -- there's too much sharing going on in our use case to make
>> the correct decision based on RSS.  If RSS were good enough, simply
>> put, this patch wouldn't exist.
>
> But that doesn't answer my question, I am afraid. So how exactly do you
> use pss for oom decisions?

We use PSS to calculate the memory used by a process among all the
processes in the system, in the case of Chrome this tells us how much
each renderer process (which is roughly tied to a particular "tab" in
Chrome) is using and how much it has swapped out, so we know what the
worst offenders are -- I'm not sure what's unclear about that?

Chrome tends to use a lot of shared memory so we found PSS to be
better than RSS, and I can give you examples of the  RSS and PSS on
real systems to illustrate the magnitude of the difference between
those two numbers if that would be useful.

>
>> So even with memcg I think we'd have the same problem?
>
> memcg will give you instant anon, shared counters for all processes in
> the memcg.
>

We want to be able to get per-process granularity quickly.  I'm not
sure if memcg provides that exactly?

>> > Don't take me wrong, /proc//totmaps might be suitable for your
>> > specific usecase but so far I haven't heard any sound argument for it to
>> > be generally usable. It is true that smaps is unnecessarily costly but
>> > at least I can see some room for improvements. A simple patch I've
>> > posted cut the formatting overhead by 7%. Maybe we can do more.
>>
>> It seems like a general problem that if you want these values the
>> existing kernel interface can be very expensive, so it would be
>> generally usable by any application which wants a per process PSS,
>> private data, dirty data or swap value.
>
> yes this is really unfortunate. And if at all possible we should address
> that. Precise values require the expensive rmap walk. We can introduce
> some caching to help that. But so far it seems the biggest overhead is
> to simply format the output and that should be addressed before any new
> proc file is added.
>
>> I mentioned two use cases, but I guess I don't understand the comment
>> about why it's not usable by other use cases.
>
> I might be wrong here but a use of pss is quite limited and I do not
> remember anybody asking for large optimizations in that area. I still do
> not understand your use cases properly so I am quite skeptical about a
> general usefulness of a new file.

How do you know that usage of PSS is quite limited?  I can only say
that we've been using it on Chromium OS for at least four years and
have found it very valuable, and I think I've explained the use cases
in this thread. If you have more specific questions then I can try to
clarify.

>
> --
> Michal Hocko
> SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-19 Thread Sonny Rao
On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko  wrote:
> On Thu 18-08-16 10:47:57, Sonny Rao wrote:
>> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko  wrote:
>> > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
> [...]
>> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
>> >> than let the kernel's OOM killer activate and need to gather this
>> >> information and we'd like to be able to get this information to make
>> >> the decision much faster than 400ms
>> >
>> > Global OOM handling in userspace is really dubious if you ask me. I
>> > understand you want something better than SIGKILL and in fact this is
>> > already possible with memory cgroup controller (btw. memcg will give
>> > you a cheap access to rss, amount of shared, swapped out memory as
>> > well). Anyway if you are getting close to the OOM your system will most
>> > probably be really busy and chances are that also reading your new file
>> > will take much more time. I am also not quite sure how is pss useful for
>> > oom decisions.
>>
>> I mentioned it before, but based on experience RSS just isn't good
>> enough -- there's too much sharing going on in our use case to make
>> the correct decision based on RSS.  If RSS were good enough, simply
>> put, this patch wouldn't exist.
>
> But that doesn't answer my question, I am afraid. So how exactly do you
> use pss for oom decisions?

We use PSS to calculate the memory used by a process among all the
processes in the system, in the case of Chrome this tells us how much
each renderer process (which is roughly tied to a particular "tab" in
Chrome) is using and how much it has swapped out, so we know what the
worst offenders are -- I'm not sure what's unclear about that?

Chrome tends to use a lot of shared memory so we found PSS to be
better than RSS, and I can give you examples of the  RSS and PSS on
real systems to illustrate the magnitude of the difference between
those two numbers if that would be useful.

>
>> So even with memcg I think we'd have the same problem?
>
> memcg will give you instant anon, shared counters for all processes in
> the memcg.
>

We want to be able to get per-process granularity quickly.  I'm not
sure if memcg provides that exactly?

>> > Don't take me wrong, /proc//totmaps might be suitable for your
>> > specific usecase but so far I haven't heard any sound argument for it to
>> > be generally usable. It is true that smaps is unnecessarily costly but
>> > at least I can see some room for improvements. A simple patch I've
>> > posted cut the formatting overhead by 7%. Maybe we can do more.
>>
>> It seems like a general problem that if you want these values the
>> existing kernel interface can be very expensive, so it would be
>> generally usable by any application which wants a per process PSS,
>> private data, dirty data or swap value.
>
> yes this is really unfortunate. And if at all possible we should address
> that. Precise values require the expensive rmap walk. We can introduce
> some caching to help that. But so far it seems the biggest overhead is
> to simply format the output and that should be addressed before any new
> proc file is added.
>
>> I mentioned two use cases, but I guess I don't understand the comment
>> about why it's not usable by other use cases.
>
> I might be wrong here but a use of pss is quite limited and I do not
> remember anybody asking for large optimizations in that area. I still do
> not understand your use cases properly so I am quite skeptical about a
> general usefulness of a new file.

How do you know that usage of PSS is quite limited?  I can only say
that we've been using it on Chromium OS for at least four years and
have found it very valuable, and I think I've explained the use cases
in this thread. If you have more specific questions then I can try to
clarify.

>
> --
> Michal Hocko
> SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-19 Thread Sonny Rao
On Thu, Aug 18, 2016 at 2:05 PM, Robert Foss  wrote:
>
>
> On 2016-08-18 02:01 PM, Michal Hocko wrote:
>>
>> On Thu 18-08-16 10:47:57, Sonny Rao wrote:
>>>
>>> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko  wrote:

 On Wed 17-08-16 11:57:56, Sonny Rao wrote:
>>
>> [...]
>
> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> than let the kernel's OOM killer activate and need to gather this
> information and we'd like to be able to get this information to make
> the decision much faster than 400ms


 Global OOM handling in userspace is really dubious if you ask me. I
 understand you want something better than SIGKILL and in fact this is
 already possible with memory cgroup controller (btw. memcg will give
 you a cheap access to rss, amount of shared, swapped out memory as
 well). Anyway if you are getting close to the OOM your system will most
 probably be really busy and chances are that also reading your new file
 will take much more time. I am also not quite sure how is pss useful for
 oom decisions.
>>>
>>>
>>> I mentioned it before, but based on experience RSS just isn't good
>>> enough -- there's too much sharing going on in our use case to make
>>> the correct decision based on RSS.  If RSS were good enough, simply
>>> put, this patch wouldn't exist.
>>
>>
>> But that doesn't answer my question, I am afraid. So how exactly do you
>> use pss for oom decisions?
>>
>>> So even with memcg I think we'd have the same problem?
>>
>>
>> memcg will give you instant anon, shared counters for all processes in
>> the memcg.
>
>
> Is it technically feasible to add instant pss support to memcg?
>
> @Sonny Rao: Would using cgroups be acceptable for chromiumos?

It's possible, though I think we'd end up putting each renderer in
it's own cgroup to get the PSS stat, so it seems a bit like overkill.
I think memcg also has some overhead that we'd need to quantify but I
could be mistaken about this.

>
>
>>
 Don't take me wrong, /proc//totmaps might be suitable for your
 specific usecase but so far I haven't heard any sound argument for it to
 be generally usable. It is true that smaps is unnecessarily costly but
 at least I can see some room for improvements. A simple patch I've
 posted cut the formatting overhead by 7%. Maybe we can do more.
>>>
>>>
>>> It seems like a general problem that if you want these values the
>>> existing kernel interface can be very expensive, so it would be
>>> generally usable by any application which wants a per process PSS,
>>> private data, dirty data or swap value.
>>
>>
>> yes this is really unfortunate. And if at all possible we should address
>> that. Precise values require the expensive rmap walk. We can introduce
>> some caching to help that. But so far it seems the biggest overhead is
>> to simply format the output and that should be addressed before any new
>> proc file is added.
>>
>>> I mentioned two use cases, but I guess I don't understand the comment
>>> about why it's not usable by other use cases.
>>
>>
>> I might be wrong here but a use of pss is quite limited and I do not
>> remember anybody asking for large optimizations in that area. I still do
>> not understand your use cases properly so I am quite skeptical about a
>> general usefulness of a new file.
>>
>


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-19 Thread Sonny Rao
On Thu, Aug 18, 2016 at 2:05 PM, Robert Foss  wrote:
>
>
> On 2016-08-18 02:01 PM, Michal Hocko wrote:
>>
>> On Thu 18-08-16 10:47:57, Sonny Rao wrote:
>>>
>>> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko  wrote:

 On Wed 17-08-16 11:57:56, Sonny Rao wrote:
>>
>> [...]
>
> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> than let the kernel's OOM killer activate and need to gather this
> information and we'd like to be able to get this information to make
> the decision much faster than 400ms


 Global OOM handling in userspace is really dubious if you ask me. I
 understand you want something better than SIGKILL and in fact this is
 already possible with memory cgroup controller (btw. memcg will give
 you a cheap access to rss, amount of shared, swapped out memory as
 well). Anyway if you are getting close to the OOM your system will most
 probably be really busy and chances are that also reading your new file
 will take much more time. I am also not quite sure how is pss useful for
 oom decisions.
>>>
>>>
>>> I mentioned it before, but based on experience RSS just isn't good
>>> enough -- there's too much sharing going on in our use case to make
>>> the correct decision based on RSS.  If RSS were good enough, simply
>>> put, this patch wouldn't exist.
>>
>>
>> But that doesn't answer my question, I am afraid. So how exactly do you
>> use pss for oom decisions?
>>
>>> So even with memcg I think we'd have the same problem?
>>
>>
>> memcg will give you instant anon, shared counters for all processes in
>> the memcg.
>
>
> Is it technically feasible to add instant pss support to memcg?
>
> @Sonny Rao: Would using cgroups be acceptable for chromiumos?

It's possible, though I think we'd end up putting each renderer in
it's own cgroup to get the PSS stat, so it seems a bit like overkill.
I think memcg also has some overhead that we'd need to quantify but I
could be mistaken about this.

>
>
>>
 Don't take me wrong, /proc//totmaps might be suitable for your
 specific usecase but so far I haven't heard any sound argument for it to
 be generally usable. It is true that smaps is unnecessarily costly but
 at least I can see some room for improvements. A simple patch I've
 posted cut the formatting overhead by 7%. Maybe we can do more.
>>>
>>>
>>> It seems like a general problem that if you want these values the
>>> existing kernel interface can be very expensive, so it would be
>>> generally usable by any application which wants a per process PSS,
>>> private data, dirty data or swap value.
>>
>>
>> yes this is really unfortunate. And if at all possible we should address
>> that. Precise values require the expensive rmap walk. We can introduce
>> some caching to help that. But so far it seems the biggest overhead is
>> to simply format the output and that should be addressed before any new
>> proc file is added.
>>
>>> I mentioned two use cases, but I guess I don't understand the comment
>>> about why it's not usable by other use cases.
>>
>>
>> I might be wrong here but a use of pss is quite limited and I do not
>> remember anybody asking for large optimizations in that area. I still do
>> not understand your use cases properly so I am quite skeptical about a
>> general usefulness of a new file.
>>
>


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-18 Thread Sonny Rao
On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko  wrote:
> On Wed 17-08-16 11:57:56, Sonny Rao wrote:
>> On Wed, Aug 17, 2016 at 6:03 AM, Michal Hocko  wrote:
>> > On Wed 17-08-16 11:31:25, Jann Horn wrote:
> [...]
>> >> That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel
>> >> time spent on evaluating format strings. The new interface
>> >> wouldn't have to spend that much time on format strings because there
>> >> isn't so much text to format.
>> >
>> > well, this is true of course but I would much rather try to reduce the
>> > overhead of smaps file than add a new file. The following should help
>> > already. I've measured ~7% systime cut down. I guess there is still some
>> > room for improvements but I have to say I'm far from being convinced about
>> > a new proc file just because we suck at dumping information to the
>> > userspace.
>> > If this was something like /proc//stat which is
>> > essentially read all the time then it would be a different question but
>> > is the rss, pss going to be all that often? If yes why?
>>
>> If the question is why do we need to read RSS, PSS, Private_*, Swap
>> and the other fields so often?
>>
>> I have two use cases so far involving monitoring per-process memory
>> usage, and we usually need to read stats for about 25 processes.
>>
>> Here's a timing example on an fairly recent ARM system 4 core RK3288
>> running at 1.8Ghz
>>
>> localhost ~ # time cat /proc/25946/smaps > /dev/null
>>
>> real0m0.036s
>> user0m0.020s
>> sys 0m0.020s
>>
>> localhost ~ # time cat /proc/25946/totmaps > /dev/null
>>
>> real0m0.027s
>> user0m0.010s
>> sys 0m0.010s
>> localhost ~ #
>>
>> I'll ignore the user time for now, and we see about 20 ms of system
>> time with smaps and 10 ms with totmaps, with 20 similar processes it
>> would be 400 milliseconds of cpu time for the kernel to get this
>> information from smaps vs 200 milliseconds with totmaps.  Even totmaps
>> is still pretty slow, but much better than smaps.
>>
>> Use cases:
>> 1) Basic task monitoring -- like "top" that shows memory consumption
>> including PSS, Private, Swap
>> 1 second update means about 40% of one CPU is spent in the kernel
>> gathering the data with smaps
>
> I would argue that even 20% is way too much for such a monitoring. What
> is the value to do it so often tha 20 vs 40ms really matters?

Yeah it is too much (I believe I said that) but it's significantly better.

>> 2) User space OOM handling -- we'd rather do a more graceful shutdown
>> than let the kernel's OOM killer activate and need to gather this
>> information and we'd like to be able to get this information to make
>> the decision much faster than 400ms
>
> Global OOM handling in userspace is really dubious if you ask me. I
> understand you want something better than SIGKILL and in fact this is
> already possible with memory cgroup controller (btw. memcg will give
> you a cheap access to rss, amount of shared, swapped out memory as
> well). Anyway if you are getting close to the OOM your system will most
> probably be really busy and chances are that also reading your new file
> will take much more time. I am also not quite sure how is pss useful for
> oom decisions.

I mentioned it before, but based on experience RSS just isn't good
enough -- there's too much sharing going on in our use case to make
the correct decision based on RSS.  If RSS were good enough, simply
put, this patch wouldn't exist.  So even with memcg I think we'd have
the same problem?

>
> Don't take me wrong, /proc//totmaps might be suitable for your
> specific usecase but so far I haven't heard any sound argument for it to
> be generally usable. It is true that smaps is unnecessarily costly but
> at least I can see some room for improvements. A simple patch I've
> posted cut the formatting overhead by 7%. Maybe we can do more.

It seems like a general problem that if you want these values the
existing kernel interface can be very expensive, so it would be
generally usable by any application which wants a per process PSS,
private data, dirty data or swap value.   I mentioned two use cases,
but I guess I don't understand the comment about why it's not usable
by other use cases.

> --
> Michal Hocko
> SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-18 Thread Sonny Rao
On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko  wrote:
> On Wed 17-08-16 11:57:56, Sonny Rao wrote:
>> On Wed, Aug 17, 2016 at 6:03 AM, Michal Hocko  wrote:
>> > On Wed 17-08-16 11:31:25, Jann Horn wrote:
> [...]
>> >> That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel
>> >> time spent on evaluating format strings. The new interface
>> >> wouldn't have to spend that much time on format strings because there
>> >> isn't so much text to format.
>> >
>> > well, this is true of course but I would much rather try to reduce the
>> > overhead of smaps file than add a new file. The following should help
>> > already. I've measured ~7% systime cut down. I guess there is still some
>> > room for improvements but I have to say I'm far from being convinced about
>> > a new proc file just because we suck at dumping information to the
>> > userspace.
>> > If this was something like /proc//stat which is
>> > essentially read all the time then it would be a different question but
>> > is the rss, pss going to be all that often? If yes why?
>>
>> If the question is why do we need to read RSS, PSS, Private_*, Swap
>> and the other fields so often?
>>
>> I have two use cases so far involving monitoring per-process memory
>> usage, and we usually need to read stats for about 25 processes.
>>
>> Here's a timing example on an fairly recent ARM system 4 core RK3288
>> running at 1.8Ghz
>>
>> localhost ~ # time cat /proc/25946/smaps > /dev/null
>>
>> real0m0.036s
>> user0m0.020s
>> sys 0m0.020s
>>
>> localhost ~ # time cat /proc/25946/totmaps > /dev/null
>>
>> real0m0.027s
>> user0m0.010s
>> sys 0m0.010s
>> localhost ~ #
>>
>> I'll ignore the user time for now, and we see about 20 ms of system
>> time with smaps and 10 ms with totmaps, with 20 similar processes it
>> would be 400 milliseconds of cpu time for the kernel to get this
>> information from smaps vs 200 milliseconds with totmaps.  Even totmaps
>> is still pretty slow, but much better than smaps.
>>
>> Use cases:
>> 1) Basic task monitoring -- like "top" that shows memory consumption
>> including PSS, Private, Swap
>> 1 second update means about 40% of one CPU is spent in the kernel
>> gathering the data with smaps
>
> I would argue that even 20% is way too much for such a monitoring. What
> is the value to do it so often tha 20 vs 40ms really matters?

Yeah it is too much (I believe I said that) but it's significantly better.

>> 2) User space OOM handling -- we'd rather do a more graceful shutdown
>> than let the kernel's OOM killer activate and need to gather this
>> information and we'd like to be able to get this information to make
>> the decision much faster than 400ms
>
> Global OOM handling in userspace is really dubious if you ask me. I
> understand you want something better than SIGKILL and in fact this is
> already possible with memory cgroup controller (btw. memcg will give
> you a cheap access to rss, amount of shared, swapped out memory as
> well). Anyway if you are getting close to the OOM your system will most
> probably be really busy and chances are that also reading your new file
> will take much more time. I am also not quite sure how is pss useful for
> oom decisions.

I mentioned it before, but based on experience RSS just isn't good
enough -- there's too much sharing going on in our use case to make
the correct decision based on RSS.  If RSS were good enough, simply
put, this patch wouldn't exist.  So even with memcg I think we'd have
the same problem?

>
> Don't take me wrong, /proc//totmaps might be suitable for your
> specific usecase but so far I haven't heard any sound argument for it to
> be generally usable. It is true that smaps is unnecessarily costly but
> at least I can see some room for improvements. A simple patch I've
> posted cut the formatting overhead by 7%. Maybe we can do more.

It seems like a general problem that if you want these values the
existing kernel interface can be very expensive, so it would be
generally usable by any application which wants a per process PSS,
private data, dirty data or swap value.   I mentioned two use cases,
but I guess I don't understand the comment about why it's not usable
by other use cases.

> --
> Michal Hocko
> SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-18 Thread Minchan Kim
Hi Michal,

On Thu, Aug 18, 2016 at 08:01:04PM +0200, Michal Hocko wrote:
> On Thu 18-08-16 10:47:57, Sonny Rao wrote:
> > On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko  wrote:
> > > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
> [...]
> > >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> > >> than let the kernel's OOM killer activate and need to gather this
> > >> information and we'd like to be able to get this information to make
> > >> the decision much faster than 400ms
> > >
> > > Global OOM handling in userspace is really dubious if you ask me. I
> > > understand you want something better than SIGKILL and in fact this is
> > > already possible with memory cgroup controller (btw. memcg will give
> > > you a cheap access to rss, amount of shared, swapped out memory as
> > > well). Anyway if you are getting close to the OOM your system will most
> > > probably be really busy and chances are that also reading your new file
> > > will take much more time. I am also not quite sure how is pss useful for
> > > oom decisions.
> > 
> > I mentioned it before, but based on experience RSS just isn't good
> > enough -- there's too much sharing going on in our use case to make
> > the correct decision based on RSS.  If RSS were good enough, simply
> > put, this patch wouldn't exist.
> 
> But that doesn't answer my question, I am afraid. So how exactly do you
> use pss for oom decisions?

My case is not for OOM decision but I agree it would be great if we can get
*fast* smap summary information.

PSS is really great tool to figure out how processes consume memory
more exactly rather than RSS. We have been used it for monitoring
of memory for per-process. Although it is not used for OOM decision,
it would be great if it is speed up because we don't want to spend
many CPU time for just monitoring.

For our usecase, we don't need AnonHugePages, ShmemPmdMapped, Shared_Hugetlb,
Private_Hugetlb, KernelPageSize, MMUPageSize because we never enable THP and
hugetlb. Additionally, Locked can be known via vma flags so we don't need it,
either. Even, we don't need address range for just monitoring when we don't
investigate in detail.

Although they are not severe overhead, why does it emit the useless
information? Even bloat day by day. :( With that, userspace tools should
spend more time to parse which is pointless.

Having said that, I'm not fan of creating new stat knob for that, either.
How about appending summary information in the end of smap?
So, monitoring users can just open the file and lseek to the (end - 1) and
read the summary only.

Thanks.


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-18 Thread Minchan Kim
Hi Michal,

On Thu, Aug 18, 2016 at 08:01:04PM +0200, Michal Hocko wrote:
> On Thu 18-08-16 10:47:57, Sonny Rao wrote:
> > On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko  wrote:
> > > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
> [...]
> > >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> > >> than let the kernel's OOM killer activate and need to gather this
> > >> information and we'd like to be able to get this information to make
> > >> the decision much faster than 400ms
> > >
> > > Global OOM handling in userspace is really dubious if you ask me. I
> > > understand you want something better than SIGKILL and in fact this is
> > > already possible with memory cgroup controller (btw. memcg will give
> > > you a cheap access to rss, amount of shared, swapped out memory as
> > > well). Anyway if you are getting close to the OOM your system will most
> > > probably be really busy and chances are that also reading your new file
> > > will take much more time. I am also not quite sure how is pss useful for
> > > oom decisions.
> > 
> > I mentioned it before, but based on experience RSS just isn't good
> > enough -- there's too much sharing going on in our use case to make
> > the correct decision based on RSS.  If RSS were good enough, simply
> > put, this patch wouldn't exist.
> 
> But that doesn't answer my question, I am afraid. So how exactly do you
> use pss for oom decisions?

My case is not for OOM decision but I agree it would be great if we can get
*fast* smap summary information.

PSS is really great tool to figure out how processes consume memory
more exactly rather than RSS. We have been used it for monitoring
of memory for per-process. Although it is not used for OOM decision,
it would be great if it is speed up because we don't want to spend
many CPU time for just monitoring.

For our usecase, we don't need AnonHugePages, ShmemPmdMapped, Shared_Hugetlb,
Private_Hugetlb, KernelPageSize, MMUPageSize because we never enable THP and
hugetlb. Additionally, Locked can be known via vma flags so we don't need it,
either. Even, we don't need address range for just monitoring when we don't
investigate in detail.

Although they are not severe overhead, why does it emit the useless
information? Even bloat day by day. :( With that, userspace tools should
spend more time to parse which is pointless.

Having said that, I'm not fan of creating new stat knob for that, either.
How about appending summary information in the end of smap?
So, monitoring users can just open the file and lseek to the (end - 1) and
read the summary only.

Thanks.


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-18 Thread Robert Foss



On 2016-08-18 02:01 PM, Michal Hocko wrote:

On Thu 18-08-16 10:47:57, Sonny Rao wrote:

On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko  wrote:

On Wed 17-08-16 11:57:56, Sonny Rao wrote:

[...]

2) User space OOM handling -- we'd rather do a more graceful shutdown
than let the kernel's OOM killer activate and need to gather this
information and we'd like to be able to get this information to make
the decision much faster than 400ms


Global OOM handling in userspace is really dubious if you ask me. I
understand you want something better than SIGKILL and in fact this is
already possible with memory cgroup controller (btw. memcg will give
you a cheap access to rss, amount of shared, swapped out memory as
well). Anyway if you are getting close to the OOM your system will most
probably be really busy and chances are that also reading your new file
will take much more time. I am also not quite sure how is pss useful for
oom decisions.


I mentioned it before, but based on experience RSS just isn't good
enough -- there's too much sharing going on in our use case to make
the correct decision based on RSS.  If RSS were good enough, simply
put, this patch wouldn't exist.


But that doesn't answer my question, I am afraid. So how exactly do you
use pss for oom decisions?


So even with memcg I think we'd have the same problem?


memcg will give you instant anon, shared counters for all processes in
the memcg.


Is it technically feasible to add instant pss support to memcg?

@Sonny Rao: Would using cgroups be acceptable for chromiumos?




Don't take me wrong, /proc//totmaps might be suitable for your
specific usecase but so far I haven't heard any sound argument for it to
be generally usable. It is true that smaps is unnecessarily costly but
at least I can see some room for improvements. A simple patch I've
posted cut the formatting overhead by 7%. Maybe we can do more.


It seems like a general problem that if you want these values the
existing kernel interface can be very expensive, so it would be
generally usable by any application which wants a per process PSS,
private data, dirty data or swap value.


yes this is really unfortunate. And if at all possible we should address
that. Precise values require the expensive rmap walk. We can introduce
some caching to help that. But so far it seems the biggest overhead is
to simply format the output and that should be addressed before any new
proc file is added.


I mentioned two use cases, but I guess I don't understand the comment
about why it's not usable by other use cases.


I might be wrong here but a use of pss is quite limited and I do not
remember anybody asking for large optimizations in that area. I still do
not understand your use cases properly so I am quite skeptical about a
general usefulness of a new file.



Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-18 Thread Robert Foss



On 2016-08-18 02:01 PM, Michal Hocko wrote:

On Thu 18-08-16 10:47:57, Sonny Rao wrote:

On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko  wrote:

On Wed 17-08-16 11:57:56, Sonny Rao wrote:

[...]

2) User space OOM handling -- we'd rather do a more graceful shutdown
than let the kernel's OOM killer activate and need to gather this
information and we'd like to be able to get this information to make
the decision much faster than 400ms


Global OOM handling in userspace is really dubious if you ask me. I
understand you want something better than SIGKILL and in fact this is
already possible with memory cgroup controller (btw. memcg will give
you a cheap access to rss, amount of shared, swapped out memory as
well). Anyway if you are getting close to the OOM your system will most
probably be really busy and chances are that also reading your new file
will take much more time. I am also not quite sure how is pss useful for
oom decisions.


I mentioned it before, but based on experience RSS just isn't good
enough -- there's too much sharing going on in our use case to make
the correct decision based on RSS.  If RSS were good enough, simply
put, this patch wouldn't exist.


But that doesn't answer my question, I am afraid. So how exactly do you
use pss for oom decisions?


So even with memcg I think we'd have the same problem?


memcg will give you instant anon, shared counters for all processes in
the memcg.


Is it technically feasible to add instant pss support to memcg?

@Sonny Rao: Would using cgroups be acceptable for chromiumos?




Don't take me wrong, /proc//totmaps might be suitable for your
specific usecase but so far I haven't heard any sound argument for it to
be generally usable. It is true that smaps is unnecessarily costly but
at least I can see some room for improvements. A simple patch I've
posted cut the formatting overhead by 7%. Maybe we can do more.


It seems like a general problem that if you want these values the
existing kernel interface can be very expensive, so it would be
generally usable by any application which wants a per process PSS,
private data, dirty data or swap value.


yes this is really unfortunate. And if at all possible we should address
that. Precise values require the expensive rmap walk. We can introduce
some caching to help that. But so far it seems the biggest overhead is
to simply format the output and that should be addressed before any new
proc file is added.


I mentioned two use cases, but I guess I don't understand the comment
about why it's not usable by other use cases.


I might be wrong here but a use of pss is quite limited and I do not
remember anybody asking for large optimizations in that area. I still do
not understand your use cases properly so I am quite skeptical about a
general usefulness of a new file.



Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-18 Thread Michal Hocko
On Thu 18-08-16 10:47:57, Sonny Rao wrote:
> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko  wrote:
> > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
[...]
> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> >> than let the kernel's OOM killer activate and need to gather this
> >> information and we'd like to be able to get this information to make
> >> the decision much faster than 400ms
> >
> > Global OOM handling in userspace is really dubious if you ask me. I
> > understand you want something better than SIGKILL and in fact this is
> > already possible with memory cgroup controller (btw. memcg will give
> > you a cheap access to rss, amount of shared, swapped out memory as
> > well). Anyway if you are getting close to the OOM your system will most
> > probably be really busy and chances are that also reading your new file
> > will take much more time. I am also not quite sure how is pss useful for
> > oom decisions.
> 
> I mentioned it before, but based on experience RSS just isn't good
> enough -- there's too much sharing going on in our use case to make
> the correct decision based on RSS.  If RSS were good enough, simply
> put, this patch wouldn't exist.

But that doesn't answer my question, I am afraid. So how exactly do you
use pss for oom decisions?

> So even with memcg I think we'd have the same problem?

memcg will give you instant anon, shared counters for all processes in
the memcg.

> > Don't take me wrong, /proc//totmaps might be suitable for your
> > specific usecase but so far I haven't heard any sound argument for it to
> > be generally usable. It is true that smaps is unnecessarily costly but
> > at least I can see some room for improvements. A simple patch I've
> > posted cut the formatting overhead by 7%. Maybe we can do more.
> 
> It seems like a general problem that if you want these values the
> existing kernel interface can be very expensive, so it would be
> generally usable by any application which wants a per process PSS,
> private data, dirty data or swap value.

yes this is really unfortunate. And if at all possible we should address
that. Precise values require the expensive rmap walk. We can introduce
some caching to help that. But so far it seems the biggest overhead is
to simply format the output and that should be addressed before any new
proc file is added.

> I mentioned two use cases, but I guess I don't understand the comment
> about why it's not usable by other use cases.

I might be wrong here but a use of pss is quite limited and I do not
remember anybody asking for large optimizations in that area. I still do
not understand your use cases properly so I am quite skeptical about a
general usefulness of a new file.

-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-18 Thread Michal Hocko
On Thu 18-08-16 10:47:57, Sonny Rao wrote:
> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko  wrote:
> > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
[...]
> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> >> than let the kernel's OOM killer activate and need to gather this
> >> information and we'd like to be able to get this information to make
> >> the decision much faster than 400ms
> >
> > Global OOM handling in userspace is really dubious if you ask me. I
> > understand you want something better than SIGKILL and in fact this is
> > already possible with memory cgroup controller (btw. memcg will give
> > you a cheap access to rss, amount of shared, swapped out memory as
> > well). Anyway if you are getting close to the OOM your system will most
> > probably be really busy and chances are that also reading your new file
> > will take much more time. I am also not quite sure how is pss useful for
> > oom decisions.
> 
> I mentioned it before, but based on experience RSS just isn't good
> enough -- there's too much sharing going on in our use case to make
> the correct decision based on RSS.  If RSS were good enough, simply
> put, this patch wouldn't exist.

But that doesn't answer my question, I am afraid. So how exactly do you
use pss for oom decisions?

> So even with memcg I think we'd have the same problem?

memcg will give you instant anon, shared counters for all processes in
the memcg.

> > Don't take me wrong, /proc//totmaps might be suitable for your
> > specific usecase but so far I haven't heard any sound argument for it to
> > be generally usable. It is true that smaps is unnecessarily costly but
> > at least I can see some room for improvements. A simple patch I've
> > posted cut the formatting overhead by 7%. Maybe we can do more.
> 
> It seems like a general problem that if you want these values the
> existing kernel interface can be very expensive, so it would be
> generally usable by any application which wants a per process PSS,
> private data, dirty data or swap value.

yes this is really unfortunate. And if at all possible we should address
that. Precise values require the expensive rmap walk. We can introduce
some caching to help that. But so far it seems the biggest overhead is
to simply format the output and that should be addressed before any new
proc file is added.

> I mentioned two use cases, but I guess I don't understand the comment
> about why it's not usable by other use cases.

I might be wrong here but a use of pss is quite limited and I do not
remember anybody asking for large optimizations in that area. I still do
not understand your use cases properly so I am quite skeptical about a
general usefulness of a new file.

-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-18 Thread Michal Hocko
On Wed 17-08-16 11:57:56, Sonny Rao wrote:
> On Wed, Aug 17, 2016 at 6:03 AM, Michal Hocko  wrote:
> > On Wed 17-08-16 11:31:25, Jann Horn wrote:
[...]
> >> That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel
> >> time spent on evaluating format strings. The new interface
> >> wouldn't have to spend that much time on format strings because there
> >> isn't so much text to format.
> >
> > well, this is true of course but I would much rather try to reduce the
> > overhead of smaps file than add a new file. The following should help
> > already. I've measured ~7% systime cut down. I guess there is still some
> > room for improvements but I have to say I'm far from being convinced about
> > a new proc file just because we suck at dumping information to the
> > userspace.
> > If this was something like /proc//stat which is
> > essentially read all the time then it would be a different question but
> > is the rss, pss going to be all that often? If yes why?
> 
> If the question is why do we need to read RSS, PSS, Private_*, Swap
> and the other fields so often?
> 
> I have two use cases so far involving monitoring per-process memory
> usage, and we usually need to read stats for about 25 processes.
> 
> Here's a timing example on an fairly recent ARM system 4 core RK3288
> running at 1.8Ghz
> 
> localhost ~ # time cat /proc/25946/smaps > /dev/null
> 
> real0m0.036s
> user0m0.020s
> sys 0m0.020s
> 
> localhost ~ # time cat /proc/25946/totmaps > /dev/null
> 
> real0m0.027s
> user0m0.010s
> sys 0m0.010s
> localhost ~ #
> 
> I'll ignore the user time for now, and we see about 20 ms of system
> time with smaps and 10 ms with totmaps, with 20 similar processes it
> would be 400 milliseconds of cpu time for the kernel to get this
> information from smaps vs 200 milliseconds with totmaps.  Even totmaps
> is still pretty slow, but much better than smaps.
> 
> Use cases:
> 1) Basic task monitoring -- like "top" that shows memory consumption
> including PSS, Private, Swap
> 1 second update means about 40% of one CPU is spent in the kernel
> gathering the data with smaps

I would argue that even 20% is way too much for such a monitoring. What
is the value to do it so often tha 20 vs 40ms really matters?

> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> than let the kernel's OOM killer activate and need to gather this
> information and we'd like to be able to get this information to make
> the decision much faster than 400ms

Global OOM handling in userspace is really dubious if you ask me. I
understand you want something better than SIGKILL and in fact this is
already possible with memory cgroup controller (btw. memcg will give
you a cheap access to rss, amount of shared, swapped out memory as
well). Anyway if you are getting close to the OOM your system will most
probably be really busy and chances are that also reading your new file
will take much more time. I am also not quite sure how is pss useful for
oom decisions.

Don't take me wrong, /proc//totmaps might be suitable for your
specific usecase but so far I haven't heard any sound argument for it to
be generally usable. It is true that smaps is unnecessarily costly but
at least I can see some room for improvements. A simple patch I've
posted cut the formatting overhead by 7%. Maybe we can do more.
-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-18 Thread Michal Hocko
On Wed 17-08-16 11:57:56, Sonny Rao wrote:
> On Wed, Aug 17, 2016 at 6:03 AM, Michal Hocko  wrote:
> > On Wed 17-08-16 11:31:25, Jann Horn wrote:
[...]
> >> That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel
> >> time spent on evaluating format strings. The new interface
> >> wouldn't have to spend that much time on format strings because there
> >> isn't so much text to format.
> >
> > well, this is true of course but I would much rather try to reduce the
> > overhead of smaps file than add a new file. The following should help
> > already. I've measured ~7% systime cut down. I guess there is still some
> > room for improvements but I have to say I'm far from being convinced about
> > a new proc file just because we suck at dumping information to the
> > userspace.
> > If this was something like /proc//stat which is
> > essentially read all the time then it would be a different question but
> > is the rss, pss going to be all that often? If yes why?
> 
> If the question is why do we need to read RSS, PSS, Private_*, Swap
> and the other fields so often?
> 
> I have two use cases so far involving monitoring per-process memory
> usage, and we usually need to read stats for about 25 processes.
> 
> Here's a timing example on an fairly recent ARM system 4 core RK3288
> running at 1.8Ghz
> 
> localhost ~ # time cat /proc/25946/smaps > /dev/null
> 
> real0m0.036s
> user0m0.020s
> sys 0m0.020s
> 
> localhost ~ # time cat /proc/25946/totmaps > /dev/null
> 
> real0m0.027s
> user0m0.010s
> sys 0m0.010s
> localhost ~ #
> 
> I'll ignore the user time for now, and we see about 20 ms of system
> time with smaps and 10 ms with totmaps, with 20 similar processes it
> would be 400 milliseconds of cpu time for the kernel to get this
> information from smaps vs 200 milliseconds with totmaps.  Even totmaps
> is still pretty slow, but much better than smaps.
> 
> Use cases:
> 1) Basic task monitoring -- like "top" that shows memory consumption
> including PSS, Private, Swap
> 1 second update means about 40% of one CPU is spent in the kernel
> gathering the data with smaps

I would argue that even 20% is way too much for such a monitoring. What
is the value to do it so often tha 20 vs 40ms really matters?

> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> than let the kernel's OOM killer activate and need to gather this
> information and we'd like to be able to get this information to make
> the decision much faster than 400ms

Global OOM handling in userspace is really dubious if you ask me. I
understand you want something better than SIGKILL and in fact this is
already possible with memory cgroup controller (btw. memcg will give
you a cheap access to rss, amount of shared, swapped out memory as
well). Anyway if you are getting close to the OOM your system will most
probably be really busy and chances are that also reading your new file
will take much more time. I am also not quite sure how is pss useful for
oom decisions.

Don't take me wrong, /proc//totmaps might be suitable for your
specific usecase but so far I haven't heard any sound argument for it to
be generally usable. It is true that smaps is unnecessarily costly but
at least I can see some room for improvements. A simple patch I've
posted cut the formatting overhead by 7%. Maybe we can do more.
-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-17 Thread Sonny Rao
On Wed, Aug 17, 2016 at 6:03 AM, Michal Hocko  wrote:
> On Wed 17-08-16 11:31:25, Jann Horn wrote:
>> On Wed, Aug 17, 2016 at 10:22:00AM +0200, Michal Hocko wrote:
>> > On Tue 16-08-16 12:46:51, Robert Foss wrote:
>> > [...]
>> > > $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2}
>> > > /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\'
>> > > /proc/5025/smaps }"
>> > > [...]
>> > >   Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2}
>> > > /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' 
>> > > /proc/5025/smaps
>> > > }"
>> > >   User time (seconds): 0.37
>> > >   System time (seconds): 0.45
>> > >   Percent of CPU this job got: 92%
>> > >   Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89
>> >
>> > This is really unexpected. Where is the user time spent? Anyway, rather
>> > than measuring some random processes I've tried to measure something
>> > resembling the worst case. So I've created a simple program to mmap as
>> > much as possible:
>> >
>> > #include 
>> > #include 
>> > #include 
>> > #include 
>> > int main()
>> > {
>> > while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
>> > MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
>> > ;
>> >
>> > printf("pid:%d\n", getpid());
>> > pause();
>> > return 0;
>> > }
>>
>> Ah, nice, that's a reasonable test program. :)
>>
>>
>> > So with a reasonable user space the parsing is really not all that time
>> > consuming wrt. smaps handling. That being said I am still very skeptical
>> > about a dedicated proc file which accomplishes what userspace can done
>> > in a trivial way.
>>
>> Now, since your numbers showed that all the time is spent in the kernel,
>> also create this test program to just read that file over and over again:
>>
>> $ cat justreadloop.c
>> #include 
>> #include 
>> #include 
>> #include 
>> #include 
>> #include 
>> #include 
>>
>> char buf[100];
>>
>> int main(int argc, char **argv) {
>>   printf("pid:%d\n", getpid());
>>   while (1) {
>> int fd = open(argv[1], O_RDONLY);
>> if (fd < 0) continue;
>> if (read(fd, buf, sizeof(buf)) < 0)
>>   err(1, "read");
>> close(fd);
>>   }
>> }
>> $ gcc -Wall -o justreadloop justreadloop.c
>> $
>>
>> Now launch your test:
>>
>> $ ./mapstuff
>> pid:29397
>>
>> point justreadloop at it:
>>
>> $ ./justreadloop /proc/29397/smaps
>> pid:32567
>>
>> ... and then check the performance stats of justreadloop:
>>
>> # perf top -p 32567
>>
>> This is what I see:
>>
>> Samples: 232K of event 'cycles:ppp', Event count (approx.): 60448424325
>> Overhead  Shared Object Symbol
>>   30,43%  [kernel]  [k] format_decode
>>9,12%  [kernel]  [k] number
>>7,66%  [kernel]  [k] vsnprintf
>>7,06%  [kernel]  [k] __lock_acquire
>>3,23%  [kernel]  [k] lock_release
>>2,85%  [kernel]  [k] debug_lockdep_rcu_enabled
>>2,25%  [kernel]  [k] skip_atoi
>>2,13%  [kernel]  [k] lock_acquire
>>2,05%  [kernel]  [k] show_smap
>
> This is a lot! I would expect the rmap walk to consume more but it even
> doesn't show up in the top consumers.
>
>> That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel
>> time spent on evaluating format strings. The new interface
>> wouldn't have to spend that much time on format strings because there
>> isn't so much text to format.
>
> well, this is true of course but I would much rather try to reduce the
> overhead of smaps file than add a new file. The following should help
> already. I've measured ~7% systime cut down. I guess there is still some
> room for improvements but I have to say I'm far from being convinced about
> a new proc file just because we suck at dumping information to the
> userspace.
> If this was something like /proc//stat which is
> essentially read all the time then it would be a different question but
> is the rss, pss going to be all that often? If yes why?

If the question is why do we need to read RSS, PSS, Private_*, Swap
and the other fields so often?

I have two use cases so far involving monitoring per-process memory
usage, and we usually need to read stats for about 25 processes.

Here's a timing example on an fairly recent ARM system 4 core RK3288
running at 1.8Ghz

localhost ~ # time cat /proc/25946/smaps > /dev/null

real0m0.036s
user0m0.020s
sys 0m0.020s

localhost ~ # time cat /proc/25946/totmaps > /dev/null

real0m0.027s
user0m0.010s
sys 0m0.010s
localhost ~ #

I'll ignore the user time for now, and we see about 20 ms of system
time with smaps and 10 ms with totmaps, with 20 similar processes it
would be 400 milliseconds of cpu time for the kernel to get this
information from smaps vs 200 milliseconds with totmaps.  Even totmaps
is still pretty slow, but much better than smaps.

Use cases:
1) Basic task monitoring -- like "top" that shows memory consumption
including PSS, Private, Swap
1 second update 

Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-17 Thread Sonny Rao
On Wed, Aug 17, 2016 at 6:03 AM, Michal Hocko  wrote:
> On Wed 17-08-16 11:31:25, Jann Horn wrote:
>> On Wed, Aug 17, 2016 at 10:22:00AM +0200, Michal Hocko wrote:
>> > On Tue 16-08-16 12:46:51, Robert Foss wrote:
>> > [...]
>> > > $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2}
>> > > /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\'
>> > > /proc/5025/smaps }"
>> > > [...]
>> > >   Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2}
>> > > /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' 
>> > > /proc/5025/smaps
>> > > }"
>> > >   User time (seconds): 0.37
>> > >   System time (seconds): 0.45
>> > >   Percent of CPU this job got: 92%
>> > >   Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89
>> >
>> > This is really unexpected. Where is the user time spent? Anyway, rather
>> > than measuring some random processes I've tried to measure something
>> > resembling the worst case. So I've created a simple program to mmap as
>> > much as possible:
>> >
>> > #include 
>> > #include 
>> > #include 
>> > #include 
>> > int main()
>> > {
>> > while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
>> > MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
>> > ;
>> >
>> > printf("pid:%d\n", getpid());
>> > pause();
>> > return 0;
>> > }
>>
>> Ah, nice, that's a reasonable test program. :)
>>
>>
>> > So with a reasonable user space the parsing is really not all that time
>> > consuming wrt. smaps handling. That being said I am still very skeptical
>> > about a dedicated proc file which accomplishes what userspace can done
>> > in a trivial way.
>>
>> Now, since your numbers showed that all the time is spent in the kernel,
>> also create this test program to just read that file over and over again:
>>
>> $ cat justreadloop.c
>> #include 
>> #include 
>> #include 
>> #include 
>> #include 
>> #include 
>> #include 
>>
>> char buf[100];
>>
>> int main(int argc, char **argv) {
>>   printf("pid:%d\n", getpid());
>>   while (1) {
>> int fd = open(argv[1], O_RDONLY);
>> if (fd < 0) continue;
>> if (read(fd, buf, sizeof(buf)) < 0)
>>   err(1, "read");
>> close(fd);
>>   }
>> }
>> $ gcc -Wall -o justreadloop justreadloop.c
>> $
>>
>> Now launch your test:
>>
>> $ ./mapstuff
>> pid:29397
>>
>> point justreadloop at it:
>>
>> $ ./justreadloop /proc/29397/smaps
>> pid:32567
>>
>> ... and then check the performance stats of justreadloop:
>>
>> # perf top -p 32567
>>
>> This is what I see:
>>
>> Samples: 232K of event 'cycles:ppp', Event count (approx.): 60448424325
>> Overhead  Shared Object Symbol
>>   30,43%  [kernel]  [k] format_decode
>>9,12%  [kernel]  [k] number
>>7,66%  [kernel]  [k] vsnprintf
>>7,06%  [kernel]  [k] __lock_acquire
>>3,23%  [kernel]  [k] lock_release
>>2,85%  [kernel]  [k] debug_lockdep_rcu_enabled
>>2,25%  [kernel]  [k] skip_atoi
>>2,13%  [kernel]  [k] lock_acquire
>>2,05%  [kernel]  [k] show_smap
>
> This is a lot! I would expect the rmap walk to consume more but it even
> doesn't show up in the top consumers.
>
>> That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel
>> time spent on evaluating format strings. The new interface
>> wouldn't have to spend that much time on format strings because there
>> isn't so much text to format.
>
> well, this is true of course but I would much rather try to reduce the
> overhead of smaps file than add a new file. The following should help
> already. I've measured ~7% systime cut down. I guess there is still some
> room for improvements but I have to say I'm far from being convinced about
> a new proc file just because we suck at dumping information to the
> userspace.
> If this was something like /proc//stat which is
> essentially read all the time then it would be a different question but
> is the rss, pss going to be all that often? If yes why?

If the question is why do we need to read RSS, PSS, Private_*, Swap
and the other fields so often?

I have two use cases so far involving monitoring per-process memory
usage, and we usually need to read stats for about 25 processes.

Here's a timing example on an fairly recent ARM system 4 core RK3288
running at 1.8Ghz

localhost ~ # time cat /proc/25946/smaps > /dev/null

real0m0.036s
user0m0.020s
sys 0m0.020s

localhost ~ # time cat /proc/25946/totmaps > /dev/null

real0m0.027s
user0m0.010s
sys 0m0.010s
localhost ~ #

I'll ignore the user time for now, and we see about 20 ms of system
time with smaps and 10 ms with totmaps, with 20 similar processes it
would be 400 milliseconds of cpu time for the kernel to get this
information from smaps vs 200 milliseconds with totmaps.  Even totmaps
is still pretty slow, but much better than smaps.

Use cases:
1) Basic task monitoring -- like "top" that shows memory consumption
including PSS, Private, Swap
1 second update means about 40% of 

Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-17 Thread Robert Foss



On 2016-08-17 09:03 AM, Michal Hocko wrote:

On Wed 17-08-16 11:31:25, Jann Horn wrote:

On Wed, Aug 17, 2016 at 10:22:00AM +0200, Michal Hocko wrote:

On Tue 16-08-16 12:46:51, Robert Foss wrote:
[...]

$ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2}
/^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\'
/proc/5025/smaps }"
[...]
Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2}
/^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' /proc/5025/smaps
}"
User time (seconds): 0.37
System time (seconds): 0.45
Percent of CPU this job got: 92%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89


This is really unexpected. Where is the user time spent? Anyway, rather
than measuring some random processes I've tried to measure something
resembling the worst case. So I've created a simple program to mmap as
much as possible:

#include 
#include 
#include 
#include 
int main()
{
while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
;

printf("pid:%d\n", getpid());
pause();
return 0;
}


Ah, nice, that's a reasonable test program. :)



So with a reasonable user space the parsing is really not all that time
consuming wrt. smaps handling. That being said I am still very skeptical
about a dedicated proc file which accomplishes what userspace can done
in a trivial way.


Now, since your numbers showed that all the time is spent in the kernel,
also create this test program to just read that file over and over again:

$ cat justreadloop.c
#include 
#include 
#include 
#include 
#include 
#include 
#include 

char buf[100];

int main(int argc, char **argv) {
  printf("pid:%d\n", getpid());
  while (1) {
int fd = open(argv[1], O_RDONLY);
if (fd < 0) continue;
if (read(fd, buf, sizeof(buf)) < 0)
  err(1, "read");
close(fd);
  }
}
$ gcc -Wall -o justreadloop justreadloop.c
$

Now launch your test:

$ ./mapstuff
pid:29397

point justreadloop at it:

$ ./justreadloop /proc/29397/smaps
pid:32567

... and then check the performance stats of justreadloop:

# perf top -p 32567

This is what I see:

Samples: 232K of event 'cycles:ppp', Event count (approx.): 60448424325
Overhead  Shared Object Symbol
  30,43%  [kernel]  [k] format_decode
   9,12%  [kernel]  [k] number
   7,66%  [kernel]  [k] vsnprintf
   7,06%  [kernel]  [k] __lock_acquire
   3,23%  [kernel]  [k] lock_release
   2,85%  [kernel]  [k] debug_lockdep_rcu_enabled
   2,25%  [kernel]  [k] skip_atoi
   2,13%  [kernel]  [k] lock_acquire
   2,05%  [kernel]  [k] show_smap


This is a lot! I would expect the rmap walk to consume more but it even
doesn't show up in the top consumers.


That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel
time spent on evaluating format strings. The new interface
wouldn't have to spend that much time on format strings because there
isn't so much text to format.


well, this is true of course but I would much rather try to reduce the
overhead of smaps file than add a new file. The following should help
already. I've measured ~7% systime cut down. I guess there is still some
room for improvements but I have to say I'm far from being convinced about
a new proc file just because we suck at dumping information to the
userspace. If this was something like /proc//stat which is
essentially read all the time then it would be a different question but
is the rss, pss going to be all that often? If yes why? These are the
questions which should be answered before we even start considering the
implementation.


@Sonny Rao: Maybe you can comment on how often, for how many processes 
this information is needed and for which reasons this information is useful.



---
From 2a6883a7278ff8979808cb8e2dbcefe5ea3bf672 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Wed, 17 Aug 2016 14:00:13 +0200
Subject: [PATCH] proc, smaps: reduce printing overhead

seq_printf (used by show_smap) can be pretty expensive when dumping a
lot of numbers.  Say we would like to get Rss and Pss from a particular
process.  In order to measure a pathological case let's generate as many
mappings as possible:

$ cat max_mmap.c
int main()
{
while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
;

printf("pid:%d\n", getpid());
pause();
return 0;
}

$ awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, 
pss}' /proc/$pid/smaps

would do a trick. The whole runtime is in the kernel space which is not
that that unexpected because smaps is not the cheapest one (we have to
do rmap walk etc.).

Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d 
pss:%d\n", rss, pss} /proc/3050/smaps"
User time (seconds): 0.01
System time 

Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-17 Thread Robert Foss



On 2016-08-17 09:03 AM, Michal Hocko wrote:

On Wed 17-08-16 11:31:25, Jann Horn wrote:

On Wed, Aug 17, 2016 at 10:22:00AM +0200, Michal Hocko wrote:

On Tue 16-08-16 12:46:51, Robert Foss wrote:
[...]

$ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2}
/^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\'
/proc/5025/smaps }"
[...]
Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2}
/^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' /proc/5025/smaps
}"
User time (seconds): 0.37
System time (seconds): 0.45
Percent of CPU this job got: 92%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89


This is really unexpected. Where is the user time spent? Anyway, rather
than measuring some random processes I've tried to measure something
resembling the worst case. So I've created a simple program to mmap as
much as possible:

#include 
#include 
#include 
#include 
int main()
{
while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
;

printf("pid:%d\n", getpid());
pause();
return 0;
}


Ah, nice, that's a reasonable test program. :)



So with a reasonable user space the parsing is really not all that time
consuming wrt. smaps handling. That being said I am still very skeptical
about a dedicated proc file which accomplishes what userspace can done
in a trivial way.


Now, since your numbers showed that all the time is spent in the kernel,
also create this test program to just read that file over and over again:

$ cat justreadloop.c
#include 
#include 
#include 
#include 
#include 
#include 
#include 

char buf[100];

int main(int argc, char **argv) {
  printf("pid:%d\n", getpid());
  while (1) {
int fd = open(argv[1], O_RDONLY);
if (fd < 0) continue;
if (read(fd, buf, sizeof(buf)) < 0)
  err(1, "read");
close(fd);
  }
}
$ gcc -Wall -o justreadloop justreadloop.c
$

Now launch your test:

$ ./mapstuff
pid:29397

point justreadloop at it:

$ ./justreadloop /proc/29397/smaps
pid:32567

... and then check the performance stats of justreadloop:

# perf top -p 32567

This is what I see:

Samples: 232K of event 'cycles:ppp', Event count (approx.): 60448424325
Overhead  Shared Object Symbol
  30,43%  [kernel]  [k] format_decode
   9,12%  [kernel]  [k] number
   7,66%  [kernel]  [k] vsnprintf
   7,06%  [kernel]  [k] __lock_acquire
   3,23%  [kernel]  [k] lock_release
   2,85%  [kernel]  [k] debug_lockdep_rcu_enabled
   2,25%  [kernel]  [k] skip_atoi
   2,13%  [kernel]  [k] lock_acquire
   2,05%  [kernel]  [k] show_smap


This is a lot! I would expect the rmap walk to consume more but it even
doesn't show up in the top consumers.


That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel
time spent on evaluating format strings. The new interface
wouldn't have to spend that much time on format strings because there
isn't so much text to format.


well, this is true of course but I would much rather try to reduce the
overhead of smaps file than add a new file. The following should help
already. I've measured ~7% systime cut down. I guess there is still some
room for improvements but I have to say I'm far from being convinced about
a new proc file just because we suck at dumping information to the
userspace. If this was something like /proc//stat which is
essentially read all the time then it would be a different question but
is the rss, pss going to be all that often? If yes why? These are the
questions which should be answered before we even start considering the
implementation.


@Sonny Rao: Maybe you can comment on how often, for how many processes 
this information is needed and for which reasons this information is useful.



---
From 2a6883a7278ff8979808cb8e2dbcefe5ea3bf672 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Wed, 17 Aug 2016 14:00:13 +0200
Subject: [PATCH] proc, smaps: reduce printing overhead

seq_printf (used by show_smap) can be pretty expensive when dumping a
lot of numbers.  Say we would like to get Rss and Pss from a particular
process.  In order to measure a pathological case let's generate as many
mappings as possible:

$ cat max_mmap.c
int main()
{
while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
;

printf("pid:%d\n", getpid());
pause();
return 0;
}

$ awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, 
pss}' /proc/$pid/smaps

would do a trick. The whole runtime is in the kernel space which is not
that that unexpected because smaps is not the cheapest one (we have to
do rmap walk etc.).

Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d 
pss:%d\n", rss, pss} /proc/3050/smaps"
User time (seconds): 0.01
System time (seconds): 0.44
   

Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-17 Thread Michal Hocko
On Wed 17-08-16 11:31:25, Jann Horn wrote:
> On Wed, Aug 17, 2016 at 10:22:00AM +0200, Michal Hocko wrote:
> > On Tue 16-08-16 12:46:51, Robert Foss wrote:
> > [...]
> > > $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2}
> > > /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\'
> > > /proc/5025/smaps }"
> > > [...]
> > >   Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2}
> > > /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' 
> > > /proc/5025/smaps
> > > }"
> > >   User time (seconds): 0.37
> > >   System time (seconds): 0.45
> > >   Percent of CPU this job got: 92%
> > >   Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89
> > 
> > This is really unexpected. Where is the user time spent? Anyway, rather
> > than measuring some random processes I've tried to measure something
> > resembling the worst case. So I've created a simple program to mmap as
> > much as possible:
> > 
> > #include 
> > #include 
> > #include 
> > #include 
> > int main()
> > {
> > while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
> > MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
> > ;
> > 
> > printf("pid:%d\n", getpid());
> > pause();
> > return 0;
> > }
> 
> Ah, nice, that's a reasonable test program. :)
> 
> 
> > So with a reasonable user space the parsing is really not all that time
> > consuming wrt. smaps handling. That being said I am still very skeptical
> > about a dedicated proc file which accomplishes what userspace can done
> > in a trivial way.
> 
> Now, since your numbers showed that all the time is spent in the kernel,
> also create this test program to just read that file over and over again:
> 
> $ cat justreadloop.c
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> char buf[100];
> 
> int main(int argc, char **argv) {
>   printf("pid:%d\n", getpid());
>   while (1) {
> int fd = open(argv[1], O_RDONLY);
> if (fd < 0) continue;
> if (read(fd, buf, sizeof(buf)) < 0)
>   err(1, "read");
> close(fd);
>   }
> }
> $ gcc -Wall -o justreadloop justreadloop.c
> $ 
> 
> Now launch your test:
> 
> $ ./mapstuff 
> pid:29397
> 
> point justreadloop at it:
> 
> $ ./justreadloop /proc/29397/smaps
> pid:32567
> 
> ... and then check the performance stats of justreadloop:
> 
> # perf top -p 32567
> 
> This is what I see:
> 
> Samples: 232K of event 'cycles:ppp', Event count (approx.): 60448424325
> Overhead  Shared Object Symbol
>   30,43%  [kernel]  [k] format_decode
>9,12%  [kernel]  [k] number
>7,66%  [kernel]  [k] vsnprintf
>7,06%  [kernel]  [k] __lock_acquire
>3,23%  [kernel]  [k] lock_release
>2,85%  [kernel]  [k] debug_lockdep_rcu_enabled
>2,25%  [kernel]  [k] skip_atoi
>2,13%  [kernel]  [k] lock_acquire
>2,05%  [kernel]  [k] show_smap

This is a lot! I would expect the rmap walk to consume more but it even
doesn't show up in the top consumers.
 
> That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel
> time spent on evaluating format strings. The new interface
> wouldn't have to spend that much time on format strings because there
> isn't so much text to format.

well, this is true of course but I would much rather try to reduce the
overhead of smaps file than add a new file. The following should help
already. I've measured ~7% systime cut down. I guess there is still some
room for improvements but I have to say I'm far from being convinced about
a new proc file just because we suck at dumping information to the
userspace. If this was something like /proc//stat which is
essentially read all the time then it would be a different question but
is the rss, pss going to be all that often? If yes why? These are the
questions which should be answered before we even start considering the
implementation.
---
>From 2a6883a7278ff8979808cb8e2dbcefe5ea3bf672 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Wed, 17 Aug 2016 14:00:13 +0200
Subject: [PATCH] proc, smaps: reduce printing overhead

seq_printf (used by show_smap) can be pretty expensive when dumping a
lot of numbers.  Say we would like to get Rss and Pss from a particular
process.  In order to measure a pathological case let's generate as many
mappings as possible:

$ cat max_mmap.c
int main()
{
while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
;

printf("pid:%d\n", getpid());
pause();
return 0;
}

$ awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, 
pss}' /proc/$pid/smaps

would do a trick. The whole runtime is in the kernel space which is not
that that unexpected because smaps is not the cheapest one (we have to
do rmap walk etc.).

Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf 
"rss:%d pss:%d\n", rss, pss} /proc/3050/smaps"
User 

Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-17 Thread Michal Hocko
On Wed 17-08-16 11:31:25, Jann Horn wrote:
> On Wed, Aug 17, 2016 at 10:22:00AM +0200, Michal Hocko wrote:
> > On Tue 16-08-16 12:46:51, Robert Foss wrote:
> > [...]
> > > $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2}
> > > /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\'
> > > /proc/5025/smaps }"
> > > [...]
> > >   Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2}
> > > /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' 
> > > /proc/5025/smaps
> > > }"
> > >   User time (seconds): 0.37
> > >   System time (seconds): 0.45
> > >   Percent of CPU this job got: 92%
> > >   Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89
> > 
> > This is really unexpected. Where is the user time spent? Anyway, rather
> > than measuring some random processes I've tried to measure something
> > resembling the worst case. So I've created a simple program to mmap as
> > much as possible:
> > 
> > #include 
> > #include 
> > #include 
> > #include 
> > int main()
> > {
> > while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
> > MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
> > ;
> > 
> > printf("pid:%d\n", getpid());
> > pause();
> > return 0;
> > }
> 
> Ah, nice, that's a reasonable test program. :)
> 
> 
> > So with a reasonable user space the parsing is really not all that time
> > consuming wrt. smaps handling. That being said I am still very skeptical
> > about a dedicated proc file which accomplishes what userspace can done
> > in a trivial way.
> 
> Now, since your numbers showed that all the time is spent in the kernel,
> also create this test program to just read that file over and over again:
> 
> $ cat justreadloop.c
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> char buf[100];
> 
> int main(int argc, char **argv) {
>   printf("pid:%d\n", getpid());
>   while (1) {
> int fd = open(argv[1], O_RDONLY);
> if (fd < 0) continue;
> if (read(fd, buf, sizeof(buf)) < 0)
>   err(1, "read");
> close(fd);
>   }
> }
> $ gcc -Wall -o justreadloop justreadloop.c
> $ 
> 
> Now launch your test:
> 
> $ ./mapstuff 
> pid:29397
> 
> point justreadloop at it:
> 
> $ ./justreadloop /proc/29397/smaps
> pid:32567
> 
> ... and then check the performance stats of justreadloop:
> 
> # perf top -p 32567
> 
> This is what I see:
> 
> Samples: 232K of event 'cycles:ppp', Event count (approx.): 60448424325
> Overhead  Shared Object Symbol
>   30,43%  [kernel]  [k] format_decode
>9,12%  [kernel]  [k] number
>7,66%  [kernel]  [k] vsnprintf
>7,06%  [kernel]  [k] __lock_acquire
>3,23%  [kernel]  [k] lock_release
>2,85%  [kernel]  [k] debug_lockdep_rcu_enabled
>2,25%  [kernel]  [k] skip_atoi
>2,13%  [kernel]  [k] lock_acquire
>2,05%  [kernel]  [k] show_smap

This is a lot! I would expect the rmap walk to consume more but it even
doesn't show up in the top consumers.
 
> That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel
> time spent on evaluating format strings. The new interface
> wouldn't have to spend that much time on format strings because there
> isn't so much text to format.

well, this is true of course but I would much rather try to reduce the
overhead of smaps file than add a new file. The following should help
already. I've measured ~7% systime cut down. I guess there is still some
room for improvements but I have to say I'm far from being convinced about
a new proc file just because we suck at dumping information to the
userspace. If this was something like /proc//stat which is
essentially read all the time then it would be a different question but
is the rss, pss going to be all that often? If yes why? These are the
questions which should be answered before we even start considering the
implementation.
---
>From 2a6883a7278ff8979808cb8e2dbcefe5ea3bf672 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Wed, 17 Aug 2016 14:00:13 +0200
Subject: [PATCH] proc, smaps: reduce printing overhead

seq_printf (used by show_smap) can be pretty expensive when dumping a
lot of numbers.  Say we would like to get Rss and Pss from a particular
process.  In order to measure a pathological case let's generate as many
mappings as possible:

$ cat max_mmap.c
int main()
{
while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
;

printf("pid:%d\n", getpid());
pause();
return 0;
}

$ awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, 
pss}' /proc/$pid/smaps

would do a trick. The whole runtime is in the kernel space which is not
that that unexpected because smaps is not the cheapest one (we have to
do rmap walk etc.).

Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf 
"rss:%d pss:%d\n", rss, pss} /proc/3050/smaps"
User time (seconds): 

Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-17 Thread Jann Horn
On Wed, Aug 17, 2016 at 10:22:00AM +0200, Michal Hocko wrote:
> On Tue 16-08-16 12:46:51, Robert Foss wrote:
> [...]
> > $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2}
> > /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\'
> > /proc/5025/smaps }"
> > [...]
> > Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2}
> > /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' /proc/5025/smaps
> > }"
> > User time (seconds): 0.37
> > System time (seconds): 0.45
> > Percent of CPU this job got: 92%
> > Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89
> 
> This is really unexpected. Where is the user time spent? Anyway, rather
> than measuring some random processes I've tried to measure something
> resembling the worst case. So I've created a simple program to mmap as
> much as possible:
> 
> #include 
> #include 
> #include 
> #include 
> int main()
> {
>   while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
> MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
>   ;
> 
>   printf("pid:%d\n", getpid());
>   pause();
>   return 0;
> }

Ah, nice, that's a reasonable test program. :)


> So with a reasonable user space the parsing is really not all that time
> consuming wrt. smaps handling. That being said I am still very skeptical
> about a dedicated proc file which accomplishes what userspace can done
> in a trivial way.

Now, since your numbers showed that all the time is spent in the kernel,
also create this test program to just read that file over and over again:

$ cat justreadloop.c
#include 
#include 
#include 
#include 
#include 
#include 
#include 

char buf[100];

int main(int argc, char **argv) {
  printf("pid:%d\n", getpid());
  while (1) {
int fd = open(argv[1], O_RDONLY);
if (fd < 0) continue;
if (read(fd, buf, sizeof(buf)) < 0)
  err(1, "read");
close(fd);
  }
}
$ gcc -Wall -o justreadloop justreadloop.c
$ 

Now launch your test:

$ ./mapstuff 
pid:29397

point justreadloop at it:

$ ./justreadloop /proc/29397/smaps
pid:32567

... and then check the performance stats of justreadloop:

# perf top -p 32567

This is what I see:

Samples: 232K of event 'cycles:ppp', Event count (approx.): 60448424325
Overhead  Shared Object Symbol
  30,43%  [kernel]  [k] format_decode
   9,12%  [kernel]  [k] number
   7,66%  [kernel]  [k] vsnprintf
   7,06%  [kernel]  [k] __lock_acquire
   3,23%  [kernel]  [k] lock_release
   2,85%  [kernel]  [k] debug_lockdep_rcu_enabled
   2,25%  [kernel]  [k] skip_atoi
   2,13%  [kernel]  [k] lock_acquire
   2,05%  [kernel]  [k] show_smap

That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel
time spent on evaluating format strings. The new interface
wouldn't have to spend that much time on format strings because there
isn't so much text to format. (My kernel is built with a
bunch of debug options - the results might look very different on
distro kernels or so, so please try this yourself.)

I guess it could be argued that this is not just a problem with
smaps, but also a problem with format strings (or text-based interfaces
in general) just being slow in general.

(Here is a totally random and crazy thought: Can we put something into
the kernel build process that replaces printf calls that use simple
format strings with equivalent non-printf calls? Move the cost of
evaluating the format string to compile time?)


signature.asc
Description: Digital signature


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-17 Thread Jann Horn
On Wed, Aug 17, 2016 at 10:22:00AM +0200, Michal Hocko wrote:
> On Tue 16-08-16 12:46:51, Robert Foss wrote:
> [...]
> > $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2}
> > /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\'
> > /proc/5025/smaps }"
> > [...]
> > Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2}
> > /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' /proc/5025/smaps
> > }"
> > User time (seconds): 0.37
> > System time (seconds): 0.45
> > Percent of CPU this job got: 92%
> > Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89
> 
> This is really unexpected. Where is the user time spent? Anyway, rather
> than measuring some random processes I've tried to measure something
> resembling the worst case. So I've created a simple program to mmap as
> much as possible:
> 
> #include 
> #include 
> #include 
> #include 
> int main()
> {
>   while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
> MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
>   ;
> 
>   printf("pid:%d\n", getpid());
>   pause();
>   return 0;
> }

Ah, nice, that's a reasonable test program. :)


> So with a reasonable user space the parsing is really not all that time
> consuming wrt. smaps handling. That being said I am still very skeptical
> about a dedicated proc file which accomplishes what userspace can done
> in a trivial way.

Now, since your numbers showed that all the time is spent in the kernel,
also create this test program to just read that file over and over again:

$ cat justreadloop.c
#include 
#include 
#include 
#include 
#include 
#include 
#include 

char buf[100];

int main(int argc, char **argv) {
  printf("pid:%d\n", getpid());
  while (1) {
int fd = open(argv[1], O_RDONLY);
if (fd < 0) continue;
if (read(fd, buf, sizeof(buf)) < 0)
  err(1, "read");
close(fd);
  }
}
$ gcc -Wall -o justreadloop justreadloop.c
$ 

Now launch your test:

$ ./mapstuff 
pid:29397

point justreadloop at it:

$ ./justreadloop /proc/29397/smaps
pid:32567

... and then check the performance stats of justreadloop:

# perf top -p 32567

This is what I see:

Samples: 232K of event 'cycles:ppp', Event count (approx.): 60448424325
Overhead  Shared Object Symbol
  30,43%  [kernel]  [k] format_decode
   9,12%  [kernel]  [k] number
   7,66%  [kernel]  [k] vsnprintf
   7,06%  [kernel]  [k] __lock_acquire
   3,23%  [kernel]  [k] lock_release
   2,85%  [kernel]  [k] debug_lockdep_rcu_enabled
   2,25%  [kernel]  [k] skip_atoi
   2,13%  [kernel]  [k] lock_acquire
   2,05%  [kernel]  [k] show_smap

That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel
time spent on evaluating format strings. The new interface
wouldn't have to spend that much time on format strings because there
isn't so much text to format. (My kernel is built with a
bunch of debug options - the results might look very different on
distro kernels or so, so please try this yourself.)

I guess it could be argued that this is not just a problem with
smaps, but also a problem with format strings (or text-based interfaces
in general) just being slow in general.

(Here is a totally random and crazy thought: Can we put something into
the kernel build process that replaces printf calls that use simple
format strings with equivalent non-printf calls? Move the cost of
evaluating the format string to compile time?)


signature.asc
Description: Digital signature


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-17 Thread Michal Hocko
On Tue 16-08-16 12:46:51, Robert Foss wrote:
[...]
> $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2}
> /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\'
> /proc/5025/smaps }"
> [...]
>   Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2}
> /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' /proc/5025/smaps
> }"
>   User time (seconds): 0.37
>   System time (seconds): 0.45
>   Percent of CPU this job got: 92%
>   Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89

This is really unexpected. Where is the user time spent? Anyway, rather
than measuring some random processes I've tried to measure something
resembling the worst case. So I've created a simple program to mmap as
much as possible:

#include 
#include 
#include 
#include 
int main()
{
while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
;

printf("pid:%d\n", getpid());
pause();
return 0;
}

so depending on /proc/sys/vm/max_map_count you will get the maximum
possible mmaps. I am using a default so 65k mappings. Then I have
retried your 25x file parsing:
$ cat s.sh
#!/bin/sh

pid=$1
for i in $(seq 25)
do
awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", 
rss, pss}' /proc/$pid/smaps
done

But I am getting different results from you:
$ awk '/^[0-9a-f]/{print}' /proc/14808/smaps | wc -l
65532
[...]
Command being timed: "sh s.sh 14808"
User time (seconds): 0.00
System time (seconds): 20.10
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:20.20

The results are stable when I try multiple times, in fact there
shouldn't be any reason for them not to be. Then I went on to increase
max_map_count to 250k and that behaves consistently:
$ awk '/^[0-9a-f]/{print}' /proc/16093/smaps | wc -l 
250002
[...]
Command being timed: "sh s.sh 16093"
User time (seconds): 0.00
System time (seconds): 77.93
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:19.09

So with a reasonable user space the parsing is really not all that time
consuming wrt. smaps handling. That being said I am still very skeptical
about a dedicated proc file which accomplishes what userspace can done
in a trivial way.
-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-17 Thread Michal Hocko
On Tue 16-08-16 12:46:51, Robert Foss wrote:
[...]
> $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2}
> /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\'
> /proc/5025/smaps }"
> [...]
>   Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2}
> /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' /proc/5025/smaps
> }"
>   User time (seconds): 0.37
>   System time (seconds): 0.45
>   Percent of CPU this job got: 92%
>   Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89

This is really unexpected. Where is the user time spent? Anyway, rather
than measuring some random processes I've tried to measure something
resembling the worst case. So I've created a simple program to mmap as
much as possible:

#include 
#include 
#include 
#include 
int main()
{
while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
;

printf("pid:%d\n", getpid());
pause();
return 0;
}

so depending on /proc/sys/vm/max_map_count you will get the maximum
possible mmaps. I am using a default so 65k mappings. Then I have
retried your 25x file parsing:
$ cat s.sh
#!/bin/sh

pid=$1
for i in $(seq 25)
do
awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", 
rss, pss}' /proc/$pid/smaps
done

But I am getting different results from you:
$ awk '/^[0-9a-f]/{print}' /proc/14808/smaps | wc -l
65532
[...]
Command being timed: "sh s.sh 14808"
User time (seconds): 0.00
System time (seconds): 20.10
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:20.20

The results are stable when I try multiple times, in fact there
shouldn't be any reason for them not to be. Then I went on to increase
max_map_count to 250k and that behaves consistently:
$ awk '/^[0-9a-f]/{print}' /proc/16093/smaps | wc -l 
250002
[...]
Command being timed: "sh s.sh 16093"
User time (seconds): 0.00
System time (seconds): 77.93
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:19.09

So with a reasonable user space the parsing is really not all that time
consuming wrt. smaps handling. That being said I am still very skeptical
about a dedicated proc file which accomplishes what userspace can done
in a trivial way.
-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-16 Thread Robert Foss



On 2016-08-16 03:12 AM, Michal Hocko wrote:

On Mon 15-08-16 12:25:10, Robert Foss wrote:



On 2016-08-15 09:42 AM, Michal Hocko wrote:

[...]

The use case is to speed up monitoring of
memory consumption in environments where RSS isn't precise.

For example Chrome tends to many processes which have hundreds of VMAs
with a substantial amount of shared memory, and the error of using
RSS rather than PSS tends to be very large when looking at overall
memory consumption.  PSS isn't kept as a single number that's exported
like RSS, so to calculate PSS means having to parse a very large smaps
file.

This process is slow and has to be repeated for many processes, and we
found that the just act of doing the parsing was taking up a
significant amount of CPU time, so this patch is an attempt to make
that process cheaper.


Well, this is slow because it requires the pte walk otherwise you cannot
know how many ptes map the particular shared page. Your patch
(totmaps_proc_show) does the very same page table walk because in fact
it is unavoidable. So what exactly is the difference except for the
userspace parsing which is quite trivial e.g. my currently running Firefox
has
$ awk '/^[0-9a-f]/{print}' /proc/4950/smaps | wc -l
984

quite some VMAs, yet parsing it spends basically all the time in the kernel...

$ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d 
pss:%d\n", rss, pss}' /proc/4950/smaps
rss:1112288 pss:1096435
Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d 
pss:%d\n", rss, pss} /proc/4950/smaps"
User time (seconds): 0.00
System time (seconds): 0.02
Percent of CPU this job got: 91%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02

So I am not really sure I see the performance benefit.



I did some performance measurements of my own, and it would seem like 
there is about a 2x performance gain to be had. To me that is 
substantial, and a larger gain than commonly seen.


There naturally also the benefit that this is a lot easier to interact 
with programmatically.


$ ps aux | grep firefox
robertfoss   5025 24.3 13.7 3562820 2219616 ? Rl   Aug15 277:44 
/usr/lib/firefox/firefox https://allg.one/xpb

$ awk '/^[0-9a-f]/{print}' /proc/5025/smaps | wc -l
1503


$ /usr/bin/time -v -p zsh -c "(repeat 25 {cat /proc/5025/totmaps})"
[...]
Command being timed: "zsh -c (repeat 25 {cat /proc/5025/totmaps})"
User time (seconds): 0.00
System time (seconds): 0.40
Percent of CPU this job got: 90%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.45


$ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2} 
/^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\' 
/proc/5025/smaps }"

[...]
	Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2} 
/^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' 
/proc/5025/smaps }"

User time (seconds): 0.37
System time (seconds): 0.45
Percent of CPU this job got: 92%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-16 Thread Robert Foss



On 2016-08-16 03:12 AM, Michal Hocko wrote:

On Mon 15-08-16 12:25:10, Robert Foss wrote:



On 2016-08-15 09:42 AM, Michal Hocko wrote:

[...]

The use case is to speed up monitoring of
memory consumption in environments where RSS isn't precise.

For example Chrome tends to many processes which have hundreds of VMAs
with a substantial amount of shared memory, and the error of using
RSS rather than PSS tends to be very large when looking at overall
memory consumption.  PSS isn't kept as a single number that's exported
like RSS, so to calculate PSS means having to parse a very large smaps
file.

This process is slow and has to be repeated for many processes, and we
found that the just act of doing the parsing was taking up a
significant amount of CPU time, so this patch is an attempt to make
that process cheaper.


Well, this is slow because it requires the pte walk otherwise you cannot
know how many ptes map the particular shared page. Your patch
(totmaps_proc_show) does the very same page table walk because in fact
it is unavoidable. So what exactly is the difference except for the
userspace parsing which is quite trivial e.g. my currently running Firefox
has
$ awk '/^[0-9a-f]/{print}' /proc/4950/smaps | wc -l
984

quite some VMAs, yet parsing it spends basically all the time in the kernel...

$ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d 
pss:%d\n", rss, pss}' /proc/4950/smaps
rss:1112288 pss:1096435
Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d 
pss:%d\n", rss, pss} /proc/4950/smaps"
User time (seconds): 0.00
System time (seconds): 0.02
Percent of CPU this job got: 91%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02

So I am not really sure I see the performance benefit.



I did some performance measurements of my own, and it would seem like 
there is about a 2x performance gain to be had. To me that is 
substantial, and a larger gain than commonly seen.


There naturally also the benefit that this is a lot easier to interact 
with programmatically.


$ ps aux | grep firefox
robertfoss   5025 24.3 13.7 3562820 2219616 ? Rl   Aug15 277:44 
/usr/lib/firefox/firefox https://allg.one/xpb

$ awk '/^[0-9a-f]/{print}' /proc/5025/smaps | wc -l
1503


$ /usr/bin/time -v -p zsh -c "(repeat 25 {cat /proc/5025/totmaps})"
[...]
Command being timed: "zsh -c (repeat 25 {cat /proc/5025/totmaps})"
User time (seconds): 0.00
System time (seconds): 0.40
Percent of CPU this job got: 90%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.45


$ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2} 
/^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\' 
/proc/5025/smaps }"

[...]
	Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2} 
/^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' 
/proc/5025/smaps }"

User time (seconds): 0.37
System time (seconds): 0.45
Percent of CPU this job got: 92%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-16 Thread Michal Hocko
On Mon 15-08-16 12:25:10, Robert Foss wrote:
> 
> 
> On 2016-08-15 09:42 AM, Michal Hocko wrote:
[...]
> > The use case is to speed up monitoring of
> > memory consumption in environments where RSS isn't precise.
> >
> > For example Chrome tends to many processes which have hundreds of VMAs
> > with a substantial amount of shared memory, and the error of using
> > RSS rather than PSS tends to be very large when looking at overall
> > memory consumption.  PSS isn't kept as a single number that's exported
> > like RSS, so to calculate PSS means having to parse a very large smaps
> > file.
> >
> > This process is slow and has to be repeated for many processes, and we
> > found that the just act of doing the parsing was taking up a
> > significant amount of CPU time, so this patch is an attempt to make
> > that process cheaper.

Well, this is slow because it requires the pte walk otherwise you cannot
know how many ptes map the particular shared page. Your patch
(totmaps_proc_show) does the very same page table walk because in fact
it is unavoidable. So what exactly is the difference except for the
userspace parsing which is quite trivial e.g. my currently running Firefox
has
$ awk '/^[0-9a-f]/{print}' /proc/4950/smaps | wc -l
984

quite some VMAs, yet parsing it spends basically all the time in the kernel...

$ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d 
pss:%d\n", rss, pss}' /proc/4950/smaps 
rss:1112288 pss:1096435
Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf 
"rss:%d pss:%d\n", rss, pss} /proc/4950/smaps"
User time (seconds): 0.00
System time (seconds): 0.02
Percent of CPU this job got: 91%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02

So I am not really sure I see the performance benefit.
-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-16 Thread Michal Hocko
On Mon 15-08-16 12:25:10, Robert Foss wrote:
> 
> 
> On 2016-08-15 09:42 AM, Michal Hocko wrote:
[...]
> > The use case is to speed up monitoring of
> > memory consumption in environments where RSS isn't precise.
> >
> > For example Chrome tends to many processes which have hundreds of VMAs
> > with a substantial amount of shared memory, and the error of using
> > RSS rather than PSS tends to be very large when looking at overall
> > memory consumption.  PSS isn't kept as a single number that's exported
> > like RSS, so to calculate PSS means having to parse a very large smaps
> > file.
> >
> > This process is slow and has to be repeated for many processes, and we
> > found that the just act of doing the parsing was taking up a
> > significant amount of CPU time, so this patch is an attempt to make
> > that process cheaper.

Well, this is slow because it requires the pte walk otherwise you cannot
know how many ptes map the particular shared page. Your patch
(totmaps_proc_show) does the very same page table walk because in fact
it is unavoidable. So what exactly is the difference except for the
userspace parsing which is quite trivial e.g. my currently running Firefox
has
$ awk '/^[0-9a-f]/{print}' /proc/4950/smaps | wc -l
984

quite some VMAs, yet parsing it spends basically all the time in the kernel...

$ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d 
pss:%d\n", rss, pss}' /proc/4950/smaps 
rss:1112288 pss:1096435
Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf 
"rss:%d pss:%d\n", rss, pss} /proc/4950/smaps"
User time (seconds): 0.00
System time (seconds): 0.02
Percent of CPU this job got: 91%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02

So I am not really sure I see the performance benefit.
-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-15 Thread Robert Foss



On 2016-08-15 09:42 AM, Michal Hocko wrote:

On Mon 15-08-16 09:00:04, Robert Foss wrote:



On 2016-08-14 05:04 AM, Michal Hocko wrote:

On Fri 12-08-16 18:04:19, robert.f...@collabora.com wrote:

From: Robert Foss 

This series implements /proc/PID/totmaps, a tool for retrieving summarized
information about the mappings of a process.


The changelog is absolutely missing the usecase. Why do we need this?
Why existing interfaces are not sufficient?


You are absolutely right, more info information is in 1/3.


Patch 1 is silent about the use case as well. It is usually recommended
to describe the motivation for the change in the cover letter.


I'll change it for v3.




But the gist of it is that it provides a faster and more convenient way of
accessing the information in /proc/PID/smaps.


I am sorry to insist but this is far from a description I was hoping
for. Why do we need a more convenient API? Please note that this is a
userspace API which we will have to maintain for ever. We have made many
mistakes in the past where exporting some information made sense at the
time while it turned out being a mistake only later on. So let's make
sure we will not fall into the same trap again.

So please make sure you describe the use case, why the current API is
insufficient and why it cannot be tweaked to provide the information you
are looking for.



I'll add a more elaborate description to the v3 cover letter.
In v1, there was a discussion which I think presented the practical 
applications rather well:


https://lkml.org/lkml/2016/8/9/628

or the qoute from Sonny Rao pasted below:

> The use case is to speed up monitoring of
> memory consumption in environments where RSS isn't precise.
>
> For example Chrome tends to many processes which have hundreds of VMAs
> with a substantial amount of shared memory, and the error of using
> RSS rather than PSS tends to be very large when looking at overall
> memory consumption.  PSS isn't kept as a single number that's exported
> like RSS, so to calculate PSS means having to parse a very large smaps
> file.
>
> This process is slow and has to be repeated for many processes, and we
> found that the just act of doing the parsing was taking up a
> significant amount of CPU time, so this patch is an attempt to make
> that process cheaper.

If a reformatted version of this still isn't adequate or desirable for 
the cover-letter, please give me another heads up.


Thanks!


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-15 Thread Robert Foss



On 2016-08-15 09:42 AM, Michal Hocko wrote:

On Mon 15-08-16 09:00:04, Robert Foss wrote:



On 2016-08-14 05:04 AM, Michal Hocko wrote:

On Fri 12-08-16 18:04:19, robert.f...@collabora.com wrote:

From: Robert Foss 

This series implements /proc/PID/totmaps, a tool for retrieving summarized
information about the mappings of a process.


The changelog is absolutely missing the usecase. Why do we need this?
Why existing interfaces are not sufficient?


You are absolutely right, more info information is in 1/3.


Patch 1 is silent about the use case as well. It is usually recommended
to describe the motivation for the change in the cover letter.


I'll change it for v3.




But the gist of it is that it provides a faster and more convenient way of
accessing the information in /proc/PID/smaps.


I am sorry to insist but this is far from a description I was hoping
for. Why do we need a more convenient API? Please note that this is a
userspace API which we will have to maintain for ever. We have made many
mistakes in the past where exporting some information made sense at the
time while it turned out being a mistake only later on. So let's make
sure we will not fall into the same trap again.

So please make sure you describe the use case, why the current API is
insufficient and why it cannot be tweaked to provide the information you
are looking for.



I'll add a more elaborate description to the v3 cover letter.
In v1, there was a discussion which I think presented the practical 
applications rather well:


https://lkml.org/lkml/2016/8/9/628

or the qoute from Sonny Rao pasted below:

> The use case is to speed up monitoring of
> memory consumption in environments where RSS isn't precise.
>
> For example Chrome tends to many processes which have hundreds of VMAs
> with a substantial amount of shared memory, and the error of using
> RSS rather than PSS tends to be very large when looking at overall
> memory consumption.  PSS isn't kept as a single number that's exported
> like RSS, so to calculate PSS means having to parse a very large smaps
> file.
>
> This process is slow and has to be repeated for many processes, and we
> found that the just act of doing the parsing was taking up a
> significant amount of CPU time, so this patch is an attempt to make
> that process cheaper.

If a reformatted version of this still isn't adequate or desirable for 
the cover-letter, please give me another heads up.


Thanks!


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-15 Thread Michal Hocko
On Mon 15-08-16 09:00:04, Robert Foss wrote:
> 
> 
> On 2016-08-14 05:04 AM, Michal Hocko wrote:
> > On Fri 12-08-16 18:04:19, robert.f...@collabora.com wrote:
> > > From: Robert Foss 
> > > 
> > > This series implements /proc/PID/totmaps, a tool for retrieving summarized
> > > information about the mappings of a process.
> > 
> > The changelog is absolutely missing the usecase. Why do we need this?
> > Why existing interfaces are not sufficient?
> 
> You are absolutely right, more info information is in 1/3.

Patch 1 is silent about the use case as well. It is usually recommended
to describe the motivation for the change in the cover letter.

> But the gist of it is that it provides a faster and more convenient way of
> accessing the information in /proc/PID/smaps.

I am sorry to insist but this is far from a description I was hoping
for. Why do we need a more convenient API? Please note that this is a
userspace API which we will have to maintain for ever. We have made many
mistakes in the past where exporting some information made sense at the
time while it turned out being a mistake only later on. So let's make
sure we will not fall into the same trap again.

So please make sure you describe the use case, why the current API is
insufficient and why it cannot be tweaked to provide the information you
are looking for.
-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-15 Thread Michal Hocko
On Mon 15-08-16 09:00:04, Robert Foss wrote:
> 
> 
> On 2016-08-14 05:04 AM, Michal Hocko wrote:
> > On Fri 12-08-16 18:04:19, robert.f...@collabora.com wrote:
> > > From: Robert Foss 
> > > 
> > > This series implements /proc/PID/totmaps, a tool for retrieving summarized
> > > information about the mappings of a process.
> > 
> > The changelog is absolutely missing the usecase. Why do we need this?
> > Why existing interfaces are not sufficient?
> 
> You are absolutely right, more info information is in 1/3.

Patch 1 is silent about the use case as well. It is usually recommended
to describe the motivation for the change in the cover letter.

> But the gist of it is that it provides a faster and more convenient way of
> accessing the information in /proc/PID/smaps.

I am sorry to insist but this is far from a description I was hoping
for. Why do we need a more convenient API? Please note that this is a
userspace API which we will have to maintain for ever. We have made many
mistakes in the past where exporting some information made sense at the
time while it turned out being a mistake only later on. So let's make
sure we will not fall into the same trap again.

So please make sure you describe the use case, why the current API is
insufficient and why it cannot be tweaked to provide the information you
are looking for.
-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-15 Thread Robert Foss



On 2016-08-14 05:04 AM, Michal Hocko wrote:

On Fri 12-08-16 18:04:19, robert.f...@collabora.com wrote:

From: Robert Foss 

This series implements /proc/PID/totmaps, a tool for retrieving summarized
information about the mappings of a process.


The changelog is absolutely missing the usecase. Why do we need this?
Why existing interfaces are not sufficient?


You are absolutely right, more info information is in 1/3.
But the gist of it is that it provides a faster and more convenient way 
of accessing the information in /proc/PID/smaps.





Changes since v1:
- Removed IS_ERR check from get_task_mm() function
- Changed comment format
- Moved proc_totmaps_operations declaration inside internal.h
- Switched to using do_maps_open() in totmaps_open() function,
  which provides privilege checking
- Error handling reworked for totmaps_open() function
- Switched to stack allocated struct mem_size_stats mss_sum in
  totmaps_proc_show() function
- Removed get_task_mm() in totmaps_proc_show() since priv->mm
  already is available
- Added support to proc_map_release() fork priv==NULL, to allow
  function to be used for all failure cases
- Added proc_totmaps_op and for it helper functions
- Added documention in separate patch
- Removed totmaps_release() since it was just a wrapper for
  proc_map_release()


Robert Foss (3):
  mm, proc: Implement /proc//totmaps
  Documentation/filesystems: Fixed typo
  Documentation/filesystems: Added /proc/PID/totmaps documentation

 Documentation/filesystems/proc.txt |  23 ++-
 fs/proc/base.c |   1 +
 fs/proc/internal.h |   3 +
 fs/proc/task_mmu.c | 134 +
 4 files changed, 160 insertions(+), 1 deletion(-)

--
2.7.4





Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-15 Thread Robert Foss



On 2016-08-14 05:04 AM, Michal Hocko wrote:

On Fri 12-08-16 18:04:19, robert.f...@collabora.com wrote:

From: Robert Foss 

This series implements /proc/PID/totmaps, a tool for retrieving summarized
information about the mappings of a process.


The changelog is absolutely missing the usecase. Why do we need this?
Why existing interfaces are not sufficient?


You are absolutely right, more info information is in 1/3.
But the gist of it is that it provides a faster and more convenient way 
of accessing the information in /proc/PID/smaps.





Changes since v1:
- Removed IS_ERR check from get_task_mm() function
- Changed comment format
- Moved proc_totmaps_operations declaration inside internal.h
- Switched to using do_maps_open() in totmaps_open() function,
  which provides privilege checking
- Error handling reworked for totmaps_open() function
- Switched to stack allocated struct mem_size_stats mss_sum in
  totmaps_proc_show() function
- Removed get_task_mm() in totmaps_proc_show() since priv->mm
  already is available
- Added support to proc_map_release() fork priv==NULL, to allow
  function to be used for all failure cases
- Added proc_totmaps_op and for it helper functions
- Added documention in separate patch
- Removed totmaps_release() since it was just a wrapper for
  proc_map_release()


Robert Foss (3):
  mm, proc: Implement /proc//totmaps
  Documentation/filesystems: Fixed typo
  Documentation/filesystems: Added /proc/PID/totmaps documentation

 Documentation/filesystems/proc.txt |  23 ++-
 fs/proc/base.c |   1 +
 fs/proc/internal.h |   3 +
 fs/proc/task_mmu.c | 134 +
 4 files changed, 160 insertions(+), 1 deletion(-)

--
2.7.4





Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-14 Thread Michal Hocko
On Fri 12-08-16 18:04:19, robert.f...@collabora.com wrote:
> From: Robert Foss 
> 
> This series implements /proc/PID/totmaps, a tool for retrieving summarized
> information about the mappings of a process.

The changelog is absolutely missing the usecase. Why do we need this?
Why existing interfaces are not sufficient?

> Changes since v1:
> - Removed IS_ERR check from get_task_mm() function
> - Changed comment format
> - Moved proc_totmaps_operations declaration inside internal.h
> - Switched to using do_maps_open() in totmaps_open() function,
>   which provides privilege checking
> - Error handling reworked for totmaps_open() function
> - Switched to stack allocated struct mem_size_stats mss_sum in
>   totmaps_proc_show() function
> - Removed get_task_mm() in totmaps_proc_show() since priv->mm
>   already is available
> - Added support to proc_map_release() fork priv==NULL, to allow
>   function to be used for all failure cases
> - Added proc_totmaps_op and for it helper functions
> - Added documention in separate patch
> - Removed totmaps_release() since it was just a wrapper for
>   proc_map_release()
> 
> 
> Robert Foss (3):
>   mm, proc: Implement /proc//totmaps
>   Documentation/filesystems: Fixed typo
>   Documentation/filesystems: Added /proc/PID/totmaps documentation
> 
>  Documentation/filesystems/proc.txt |  23 ++-
>  fs/proc/base.c |   1 +
>  fs/proc/internal.h |   3 +
>  fs/proc/task_mmu.c | 134 
> +
>  4 files changed, 160 insertions(+), 1 deletion(-)
> 
> -- 
> 2.7.4
> 

-- 
Michal Hocko
SUSE Labs


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-14 Thread Michal Hocko
On Fri 12-08-16 18:04:19, robert.f...@collabora.com wrote:
> From: Robert Foss 
> 
> This series implements /proc/PID/totmaps, a tool for retrieving summarized
> information about the mappings of a process.

The changelog is absolutely missing the usecase. Why do we need this?
Why existing interfaces are not sufficient?

> Changes since v1:
> - Removed IS_ERR check from get_task_mm() function
> - Changed comment format
> - Moved proc_totmaps_operations declaration inside internal.h
> - Switched to using do_maps_open() in totmaps_open() function,
>   which provides privilege checking
> - Error handling reworked for totmaps_open() function
> - Switched to stack allocated struct mem_size_stats mss_sum in
>   totmaps_proc_show() function
> - Removed get_task_mm() in totmaps_proc_show() since priv->mm
>   already is available
> - Added support to proc_map_release() fork priv==NULL, to allow
>   function to be used for all failure cases
> - Added proc_totmaps_op and for it helper functions
> - Added documention in separate patch
> - Removed totmaps_release() since it was just a wrapper for
>   proc_map_release()
> 
> 
> Robert Foss (3):
>   mm, proc: Implement /proc//totmaps
>   Documentation/filesystems: Fixed typo
>   Documentation/filesystems: Added /proc/PID/totmaps documentation
> 
>  Documentation/filesystems/proc.txt |  23 ++-
>  fs/proc/base.c |   1 +
>  fs/proc/internal.h |   3 +
>  fs/proc/task_mmu.c | 134 
> +
>  4 files changed, 160 insertions(+), 1 deletion(-)
> 
> -- 
> 2.7.4
> 

-- 
Michal Hocko
SUSE Labs


[PACTH v2 0/3] Implement /proc//totmaps

2016-08-12 Thread robert . foss
From: Robert Foss 

This series implements /proc/PID/totmaps, a tool for retrieving summarized
information about the mappings of a process.

Changes since v1:
- Removed IS_ERR check from get_task_mm() function
- Changed comment format
- Moved proc_totmaps_operations declaration inside internal.h
- Switched to using do_maps_open() in totmaps_open() function,
  which provides privilege checking
- Error handling reworked for totmaps_open() function
- Switched to stack allocated struct mem_size_stats mss_sum in
  totmaps_proc_show() function
- Removed get_task_mm() in totmaps_proc_show() since priv->mm
  already is available
- Added support to proc_map_release() fork priv==NULL, to allow
  function to be used for all failure cases
- Added proc_totmaps_op and for it helper functions
- Added documention in separate patch
- Removed totmaps_release() since it was just a wrapper for
  proc_map_release()


Robert Foss (3):
  mm, proc: Implement /proc//totmaps
  Documentation/filesystems: Fixed typo
  Documentation/filesystems: Added /proc/PID/totmaps documentation

 Documentation/filesystems/proc.txt |  23 ++-
 fs/proc/base.c |   1 +
 fs/proc/internal.h |   3 +
 fs/proc/task_mmu.c | 134 +
 4 files changed, 160 insertions(+), 1 deletion(-)

-- 
2.7.4



[PACTH v2 0/3] Implement /proc//totmaps

2016-08-12 Thread robert . foss
From: Robert Foss 

This series implements /proc/PID/totmaps, a tool for retrieving summarized
information about the mappings of a process.

Changes since v1:
- Removed IS_ERR check from get_task_mm() function
- Changed comment format
- Moved proc_totmaps_operations declaration inside internal.h
- Switched to using do_maps_open() in totmaps_open() function,
  which provides privilege checking
- Error handling reworked for totmaps_open() function
- Switched to stack allocated struct mem_size_stats mss_sum in
  totmaps_proc_show() function
- Removed get_task_mm() in totmaps_proc_show() since priv->mm
  already is available
- Added support to proc_map_release() fork priv==NULL, to allow
  function to be used for all failure cases
- Added proc_totmaps_op and for it helper functions
- Added documention in separate patch
- Removed totmaps_release() since it was just a wrapper for
  proc_map_release()


Robert Foss (3):
  mm, proc: Implement /proc//totmaps
  Documentation/filesystems: Fixed typo
  Documentation/filesystems: Added /proc/PID/totmaps documentation

 Documentation/filesystems/proc.txt |  23 ++-
 fs/proc/base.c |   1 +
 fs/proc/internal.h |   3 +
 fs/proc/task_mmu.c | 134 +
 4 files changed, 160 insertions(+), 1 deletion(-)

-- 
2.7.4