Re: utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc//totmaps)
On Fri, 2016-09-30 at 11:49 +0200, Michal Hocko wrote: > [CC Mike and Mel as they have seen some accounting oddities > when doing performance testing. They can share details but > essentially the system time just gets too high] > > For your reference the email thread started > http://lkml.kernel.org/r/20160823143330.gl23...@dhcp22.suse.cz > > I suspect this is mainly for short lived processes - like kernel > compile > $ /usr/bin/time -v make mm/mmap.o > [...] > User time (seconds): 0.45 > System time (seconds): 0.82 > Percent of CPU this job got: 111% > Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.14 > $ rm mm/mmap.o > $ /usr/bin/time -v make mm/mmap.o > [...] > User time (seconds): 0.47 > System time (seconds): 1.55 > Percent of CPU this job got: 107% > Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.88 I was not able to get the "expected" results from your last reproducer, but this one does happen on my system, too. The bad news is, I still have no clue what is causing it... -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
Re: utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc//totmaps)
On Fri, 2016-09-30 at 11:49 +0200, Michal Hocko wrote: > [CC Mike and Mel as they have seen some accounting oddities > when doing performance testing. They can share details but > essentially the system time just gets too high] > > For your reference the email thread started > http://lkml.kernel.org/r/20160823143330.gl23...@dhcp22.suse.cz > > I suspect this is mainly for short lived processes - like kernel > compile > $ /usr/bin/time -v make mm/mmap.o > [...] > User time (seconds): 0.45 > System time (seconds): 0.82 > Percent of CPU this job got: 111% > Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.14 > $ rm mm/mmap.o > $ /usr/bin/time -v make mm/mmap.o > [...] > User time (seconds): 0.47 > System time (seconds): 1.55 > Percent of CPU this job got: 107% > Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.88 I was not able to get the "expected" results from your last reproducer, but this one does happen on my system, too. The bad news is, I still have no clue what is causing it... -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
Re: utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc//totmaps)
[CC Mike and Mel as they have seen some accounting oddities when doing performance testing. They can share details but essentially the system time just gets too high] For your reference the email thread started http://lkml.kernel.org/r/20160823143330.gl23...@dhcp22.suse.cz I suspect this is mainly for short lived processes - like kernel compile $ /usr/bin/time -v make mm/mmap.o [...] User time (seconds): 0.45 System time (seconds): 0.82 Percent of CPU this job got: 111% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.14 $ rm mm/mmap.o $ /usr/bin/time -v make mm/mmap.o [...] User time (seconds): 0.47 System time (seconds): 1.55 Percent of CPU this job got: 107% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.88 This is quite unexpected for a cache hot compile. I would expect most of the time being spent in userspace. $ perf report | grep kernel.vmlinux 2.01% as[kernel.vmlinux] [k] page_fault 0.59% cc1 [kernel.vmlinux] [k] page_fault 0.15% git [kernel.vmlinux] [k] page_fault 0.12% bash [kernel.vmlinux] [k] page_fault 0.11% sh[kernel.vmlinux] [k] page_fault 0.08% gcc [kernel.vmlinux] [k] page_fault 0.06% make [kernel.vmlinux] [k] page_fault 0.04% rm[kernel.vmlinux] [k] page_fault 0.03% ld[kernel.vmlinux] [k] page_fault 0.02% bash [kernel.vmlinux] [k] entry_SYSCALL_64 0.01% git [kernel.vmlinux] [k] entry_SYSCALL_64 0.01% cat [kernel.vmlinux] [k] page_fault 0.01% collect2 [kernel.vmlinux] [k] page_fault 0.00% sh[kernel.vmlinux] [k] entry_SYSCALL_64 0.00% rm[kernel.vmlinux] [k] entry_SYSCALL_64 0.00% grep [kernel.vmlinux] [k] page_fault doesn't show anything unexpected. Original Rik's reply follows: On Tue 23-08-16 17:46:11, Rik van Riel wrote: > On Tue, 2016-08-23 at 16:33 +0200, Michal Hocko wrote: [...] > > OK, so it seems I found it. I was quite lucky because > > account_user_time > > is not all that popular function and there were basically no changes > > besides Riks ff9a9b4c4334 ("sched, time: Switch > > VIRT_CPU_ACCOUNTING_GEN > > to jiffy granularity") and that seems to cause the regression. > > Reverting > > the commit on top of the current mmotm seems to fix the issue for me. > > > > And just to give Rik more context. While debugging overhead of the > > /proc//smaps I am getting a misleading output from /usr/bin/time > > -v > > (source for ./max_mmap is [1]) > > > > root@test1:~# uname -r > > 4.5.0-rc6-bisect1-00025-gff9a9b4c4334 > > root@test1:~# ./max_map > > pid:2990 maps:65515 > > root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} > > END {printf "rss:%d pss:%d\n", rss, pss}' /proc/2990/smaps > > rss:263368 pss:262203 > > Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END > > {printf "rss:%d pss:%d\n", rss, pss} /proc/2990/smaps" > > User time (seconds): 0.00 > > System time (seconds): 0.45 > > Percent of CPU this job got: 98% > > > > > root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} > > END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3015/smaps > > rss:263316 pss:262199 > > Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END > > {printf "rss:%d pss:%d\n", rss, pss} /proc/3015/smaps" > > User time (seconds): 0.18 > > System time (seconds): 0.29 > > Percent of CPU this job got: 97% > > The patch in question makes user and system > time accounting essentially tick-based. If > jiffies changes while the task is in user > mode, time gets accounted as user time, if > jiffies changes while the task is in system > mode, time gets accounted as system time. > > If you get "unlucky", with a job like the > above, it is possible all time gets accounted > to system time. > > This would be true both with the system running > with a periodic timer tick (before and after my > patch is applied), and in nohz_idle mode (after > my patch). > > However, it does seem quite unlikely that you > get zero user time, since you have 125 timer > ticks in half a second. Furthermore, you do not > even have NO_HZ_FULL enabled... > > Does the workload consistently get zero user > time? > > If so, we need to dig further to see under > what precise circumstances that happens. > > On my laptop, with kernel 4.6.3-300.fc24.x86_64 > I get this: > > $ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf > "rss:%d pss:%d\n", rss, pss}' /proc/19825/smaps > rss:263368 pss:262145 > Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END > {printf "rss:%d pss:%d\n", rss, pss} /proc/19825/smaps" > User time (seconds): 0.64 > System time (seconds): 0.19 > Percent of CPU this
Re: utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc//totmaps)
[CC Mike and Mel as they have seen some accounting oddities when doing performance testing. They can share details but essentially the system time just gets too high] For your reference the email thread started http://lkml.kernel.org/r/20160823143330.gl23...@dhcp22.suse.cz I suspect this is mainly for short lived processes - like kernel compile $ /usr/bin/time -v make mm/mmap.o [...] User time (seconds): 0.45 System time (seconds): 0.82 Percent of CPU this job got: 111% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.14 $ rm mm/mmap.o $ /usr/bin/time -v make mm/mmap.o [...] User time (seconds): 0.47 System time (seconds): 1.55 Percent of CPU this job got: 107% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.88 This is quite unexpected for a cache hot compile. I would expect most of the time being spent in userspace. $ perf report | grep kernel.vmlinux 2.01% as[kernel.vmlinux] [k] page_fault 0.59% cc1 [kernel.vmlinux] [k] page_fault 0.15% git [kernel.vmlinux] [k] page_fault 0.12% bash [kernel.vmlinux] [k] page_fault 0.11% sh[kernel.vmlinux] [k] page_fault 0.08% gcc [kernel.vmlinux] [k] page_fault 0.06% make [kernel.vmlinux] [k] page_fault 0.04% rm[kernel.vmlinux] [k] page_fault 0.03% ld[kernel.vmlinux] [k] page_fault 0.02% bash [kernel.vmlinux] [k] entry_SYSCALL_64 0.01% git [kernel.vmlinux] [k] entry_SYSCALL_64 0.01% cat [kernel.vmlinux] [k] page_fault 0.01% collect2 [kernel.vmlinux] [k] page_fault 0.00% sh[kernel.vmlinux] [k] entry_SYSCALL_64 0.00% rm[kernel.vmlinux] [k] entry_SYSCALL_64 0.00% grep [kernel.vmlinux] [k] page_fault doesn't show anything unexpected. Original Rik's reply follows: On Tue 23-08-16 17:46:11, Rik van Riel wrote: > On Tue, 2016-08-23 at 16:33 +0200, Michal Hocko wrote: [...] > > OK, so it seems I found it. I was quite lucky because > > account_user_time > > is not all that popular function and there were basically no changes > > besides Riks ff9a9b4c4334 ("sched, time: Switch > > VIRT_CPU_ACCOUNTING_GEN > > to jiffy granularity") and that seems to cause the regression. > > Reverting > > the commit on top of the current mmotm seems to fix the issue for me. > > > > And just to give Rik more context. While debugging overhead of the > > /proc//smaps I am getting a misleading output from /usr/bin/time > > -v > > (source for ./max_mmap is [1]) > > > > root@test1:~# uname -r > > 4.5.0-rc6-bisect1-00025-gff9a9b4c4334 > > root@test1:~# ./max_map > > pid:2990 maps:65515 > > root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} > > END {printf "rss:%d pss:%d\n", rss, pss}' /proc/2990/smaps > > rss:263368 pss:262203 > > Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END > > {printf "rss:%d pss:%d\n", rss, pss} /proc/2990/smaps" > > User time (seconds): 0.00 > > System time (seconds): 0.45 > > Percent of CPU this job got: 98% > > > > > root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} > > END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3015/smaps > > rss:263316 pss:262199 > > Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END > > {printf "rss:%d pss:%d\n", rss, pss} /proc/3015/smaps" > > User time (seconds): 0.18 > > System time (seconds): 0.29 > > Percent of CPU this job got: 97% > > The patch in question makes user and system > time accounting essentially tick-based. If > jiffies changes while the task is in user > mode, time gets accounted as user time, if > jiffies changes while the task is in system > mode, time gets accounted as system time. > > If you get "unlucky", with a job like the > above, it is possible all time gets accounted > to system time. > > This would be true both with the system running > with a periodic timer tick (before and after my > patch is applied), and in nohz_idle mode (after > my patch). > > However, it does seem quite unlikely that you > get zero user time, since you have 125 timer > ticks in half a second. Furthermore, you do not > even have NO_HZ_FULL enabled... > > Does the workload consistently get zero user > time? > > If so, we need to dig further to see under > what precise circumstances that happens. > > On my laptop, with kernel 4.6.3-300.fc24.x86_64 > I get this: > > $ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf > "rss:%d pss:%d\n", rss, pss}' /proc/19825/smaps > rss:263368 pss:262145 > Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END > {printf "rss:%d pss:%d\n", rss, pss} /proc/19825/smaps" > User time (seconds): 0.64 > System time (seconds): 0.19 > Percent of CPU this
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Wed 24-08-16 12:14:06, Marcin Jabrzyk wrote: [...] > Sorry to hijack the thread, but I've found it recently > and I guess it's the best place to present our point. > We are working at our custom OS based on Linux and we also suffered much > by /proc//smaps file. As in Chrome we tried to improve our internal > application memory management polices (Low Memory Killer) using data > provided by smaps but we failed due to very long time needed for reading > and parsing properly the file. I was already questioning Pss and also Private_* for any memory killer purpose earlier in the thread because cumulative numbers for all mappings can be really meaningless. Especially when you do not know about which resource is shared and by whom. Maybe you can describe how you are using those cumulative numbers for your decisions and prove me wrong but I simply haven't heard any sound arguments so far. Everything was just "we know what we are doing in our environment so we know those resouces and therefore those numbers make sense to us". But with all due respect this is not a reason to add a user visible API into the kernel. -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Wed 24-08-16 12:14:06, Marcin Jabrzyk wrote: [...] > Sorry to hijack the thread, but I've found it recently > and I guess it's the best place to present our point. > We are working at our custom OS based on Linux and we also suffered much > by /proc//smaps file. As in Chrome we tried to improve our internal > application memory management polices (Low Memory Killer) using data > provided by smaps but we failed due to very long time needed for reading > and parsing properly the file. I was already questioning Pss and also Private_* for any memory killer purpose earlier in the thread because cumulative numbers for all mappings can be really meaningless. Especially when you do not know about which resource is shared and by whom. Maybe you can describe how you are using those cumulative numbers for your decisions and prove me wrong but I simply haven't heard any sound arguments so far. Everything was just "we know what we are doing in our environment so we know those resouces and therefore those numbers make sense to us". But with all due respect this is not a reason to add a user visible API into the kernel. -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Mon 29-08-16 16:37:04, Michal Hocko wrote: > [Sorry for a late reply, I was busy with other stuff] > > On Mon 22-08-16 15:44:53, Sonny Rao wrote: > > On Mon, Aug 22, 2016 at 12:54 AM, Michal Hockowrote: > [...] > > But what about the private_clean and private_dirty? Surely > > those are more generally useful for calculating a lower bound on > > process memory usage without additional knowledge? > > I guess private_clean can be used as a reasonable estimate. I was thinking about this more and I think I am wrong here. Even truly MAP_PRIVATE|MAP_ANON will be in private_dirty. So private_clean will become not all that interesting and similarly misleading as its _dirty variant (mmaped file after [m]sync should become _clean) and that doesn't mean the memory will get freed after the process which maps it terminates. Take shmem as an example again. > private_dirty less so because it may refer to e.g. tmpfs which is not > mapped by other process and so no memory would be freed after unmap > without removing the file. -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Mon 29-08-16 16:37:04, Michal Hocko wrote: > [Sorry for a late reply, I was busy with other stuff] > > On Mon 22-08-16 15:44:53, Sonny Rao wrote: > > On Mon, Aug 22, 2016 at 12:54 AM, Michal Hocko wrote: > [...] > > But what about the private_clean and private_dirty? Surely > > those are more generally useful for calculating a lower bound on > > process memory usage without additional knowledge? > > I guess private_clean can be used as a reasonable estimate. I was thinking about this more and I think I am wrong here. Even truly MAP_PRIVATE|MAP_ANON will be in private_dirty. So private_clean will become not all that interesting and similarly misleading as its _dirty variant (mmaped file after [m]sync should become _clean) and that doesn't mean the memory will get freed after the process which maps it terminates. Take shmem as an example again. > private_dirty less so because it may refer to e.g. tmpfs which is not > mapped by other process and so no memory would be freed after unmap > without removing the file. -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
[Sorry for a late reply, I was busy with other stuff] On Mon 22-08-16 15:44:53, Sonny Rao wrote: > On Mon, Aug 22, 2016 at 12:54 AM, Michal Hockowrote: [...] > But what about the private_clean and private_dirty? Surely > those are more generally useful for calculating a lower bound on > process memory usage without additional knowledge? I guess private_clean can be used as a reasonable estimate. private_dirty less so because it may refer to e.g. tmpfs which is not mapped by other process and so no memory would be freed after unmap without removing the file. > At the end of the day all of these metrics are approximations, and it > comes down to how far off the various approximations are and what > trade offs we are willing to make. > RSS is the cheapest but the most coarse. I agree on this part definitely. I also understand that what we provide currently is quite confusing and not really helpful. But I am afraid that the accounting is far from trivial to make right for all the possible cases. > PSS (with the correct context) and Private data plus swap are much > better but also more expensive due to the PT walk. Maybe we can be more clever and do some form of caching. I haven't thought that through to see how hard that could be. I mean we could cache some data per mm_struct and invalidate them only after the current value would get too much out of sync. > As far as I know, to get anything but RSS we have to go through smaps > or use memcg. Swap seems to be available in /proc//status. > > I looked at the "shared" value in /proc//statm but it doesn't > seem to correlate well with the shared value in smaps -- not sure why? task_statm() does only approximate to get_mm_counter(mm, MM_FILEPAGES) + get_mm_counter(mm, MM_SHMEMPAGES) so all the pages accounted to the mm. If they are not shared by anybody else they would be considered private by smaps. > It might be useful to show the magnitude of difference of using RSS vs > PSS/Private in the case of the Chrome renderer processes. On the > system I was looking at there were about 40 of these processes, but I > picked a few to give an idea: > > localhost ~ # cat /proc/21550/totmaps > Rss: 98972 kB > Pss: 54717 kB > Shared_Clean: 19020 kB > Shared_Dirty: 26352 kB > Private_Clean: 0 kB > Private_Dirty: 53600 kB > Referenced:92184 kB > Anonymous: 46524 kB > AnonHugePages: 24576 kB > Swap: 13148 kB > > > RSS is 80% higher than PSS and 84% higher than private data > > localhost ~ # cat /proc/21470/totmaps > Rss: 118420 kB > Pss: 70938 kB > Shared_Clean: 22212 kB > Shared_Dirty: 26520 kB > Private_Clean: 0 kB > Private_Dirty: 69688 kB > Referenced: 111500 kB > Anonymous: 79928 kB > AnonHugePages: 24576 kB > Swap: 12964 kB > > RSS is 66% higher than RSS and 69% higher than private data > > localhost ~ # cat /proc/21435/totmaps > Rss: 97156 kB > Pss: 50044 kB > Shared_Clean: 21920 kB > Shared_Dirty: 26400 kB > Private_Clean: 0 kB > Private_Dirty: 48836 kB > Referenced:90012 kB > Anonymous: 75228 kB > AnonHugePages: 24576 kB > Swap: 13064 kB > > RSS is 94% higher than PSS and 98% higher than private data. > > It looks like there's a set of about 40MB of shared pages which cause > the difference in this case. > Swap was roughly even on these but I don't think it's always going to be true. OK, I see that those processes differ in the way how they are using memory but I am not really sure what kind of conclusion you can draw from that. -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
[Sorry for a late reply, I was busy with other stuff] On Mon 22-08-16 15:44:53, Sonny Rao wrote: > On Mon, Aug 22, 2016 at 12:54 AM, Michal Hocko wrote: [...] > But what about the private_clean and private_dirty? Surely > those are more generally useful for calculating a lower bound on > process memory usage without additional knowledge? I guess private_clean can be used as a reasonable estimate. private_dirty less so because it may refer to e.g. tmpfs which is not mapped by other process and so no memory would be freed after unmap without removing the file. > At the end of the day all of these metrics are approximations, and it > comes down to how far off the various approximations are and what > trade offs we are willing to make. > RSS is the cheapest but the most coarse. I agree on this part definitely. I also understand that what we provide currently is quite confusing and not really helpful. But I am afraid that the accounting is far from trivial to make right for all the possible cases. > PSS (with the correct context) and Private data plus swap are much > better but also more expensive due to the PT walk. Maybe we can be more clever and do some form of caching. I haven't thought that through to see how hard that could be. I mean we could cache some data per mm_struct and invalidate them only after the current value would get too much out of sync. > As far as I know, to get anything but RSS we have to go through smaps > or use memcg. Swap seems to be available in /proc//status. > > I looked at the "shared" value in /proc//statm but it doesn't > seem to correlate well with the shared value in smaps -- not sure why? task_statm() does only approximate to get_mm_counter(mm, MM_FILEPAGES) + get_mm_counter(mm, MM_SHMEMPAGES) so all the pages accounted to the mm. If they are not shared by anybody else they would be considered private by smaps. > It might be useful to show the magnitude of difference of using RSS vs > PSS/Private in the case of the Chrome renderer processes. On the > system I was looking at there were about 40 of these processes, but I > picked a few to give an idea: > > localhost ~ # cat /proc/21550/totmaps > Rss: 98972 kB > Pss: 54717 kB > Shared_Clean: 19020 kB > Shared_Dirty: 26352 kB > Private_Clean: 0 kB > Private_Dirty: 53600 kB > Referenced:92184 kB > Anonymous: 46524 kB > AnonHugePages: 24576 kB > Swap: 13148 kB > > > RSS is 80% higher than PSS and 84% higher than private data > > localhost ~ # cat /proc/21470/totmaps > Rss: 118420 kB > Pss: 70938 kB > Shared_Clean: 22212 kB > Shared_Dirty: 26520 kB > Private_Clean: 0 kB > Private_Dirty: 69688 kB > Referenced: 111500 kB > Anonymous: 79928 kB > AnonHugePages: 24576 kB > Swap: 12964 kB > > RSS is 66% higher than RSS and 69% higher than private data > > localhost ~ # cat /proc/21435/totmaps > Rss: 97156 kB > Pss: 50044 kB > Shared_Clean: 21920 kB > Shared_Dirty: 26400 kB > Private_Clean: 0 kB > Private_Dirty: 48836 kB > Referenced:90012 kB > Anonymous: 75228 kB > AnonHugePages: 24576 kB > Swap: 13064 kB > > RSS is 94% higher than PSS and 98% higher than private data. > > It looks like there's a set of about 40MB of shared pages which cause > the difference in this case. > Swap was roughly even on these but I don't think it's always going to be true. OK, I see that those processes differ in the way how they are using memory but I am not really sure what kind of conclusion you can draw from that. -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On 23/08/16 00:44, Sonny Rao wrote: On Mon, Aug 22, 2016 at 12:54 AM, Michal Hockowrote: On Fri 19-08-16 10:57:48, Sonny Rao wrote: On Fri, Aug 19, 2016 at 12:59 AM, Michal Hocko wrote: On Thu 18-08-16 23:43:39, Sonny Rao wrote: On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko wrote: On Thu 18-08-16 10:47:57, Sonny Rao wrote: On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko wrote: On Wed 17-08-16 11:57:56, Sonny Rao wrote: [...] 2) User space OOM handling -- we'd rather do a more graceful shutdown than let the kernel's OOM killer activate and need to gather this information and we'd like to be able to get this information to make the decision much faster than 400ms Global OOM handling in userspace is really dubious if you ask me. I understand you want something better than SIGKILL and in fact this is already possible with memory cgroup controller (btw. memcg will give you a cheap access to rss, amount of shared, swapped out memory as well). Anyway if you are getting close to the OOM your system will most probably be really busy and chances are that also reading your new file will take much more time. I am also not quite sure how is pss useful for oom decisions. I mentioned it before, but based on experience RSS just isn't good enough -- there's too much sharing going on in our use case to make the correct decision based on RSS. If RSS were good enough, simply put, this patch wouldn't exist. But that doesn't answer my question, I am afraid. So how exactly do you use pss for oom decisions? We use PSS to calculate the memory used by a process among all the processes in the system, in the case of Chrome this tells us how much each renderer process (which is roughly tied to a particular "tab" in Chrome) is using and how much it has swapped out, so we know what the worst offenders are -- I'm not sure what's unclear about that? So let me ask more specifically. How can you make any decision based on the pss when you do not know _what_ is the shared resource. In other words if you select a task to terminate based on the pss then you have to kill others who share the same resource otherwise you do not release that shared resource. Not to mention that such a shared resource might be on tmpfs/shmem and it won't get released even after all processes which map it are gone. Ok I see why you're confused now, sorry. In our case that we do know what is being shared in general because the sharing is mostly between those processes that we're looking at and not other random processes or tmpfs, so PSS gives us useful data in the context of these processes which are sharing the data especially for monitoring between the set of these renderer processes. OK, I see and agree that pss might be useful when you _know_ what is shared. But this sounds quite specific to a particular workload. How many users are in a similar situation? In other words, if we present a single number without the context, how much useful it will be in general? Is it possible that presenting such a number could be even misleading for somebody who doesn't have an idea which resources are shared? These are all questions which should be answered before we actually add this number (be it a new/existing proc file or a syscall). I still believe that the number without wider context is just not all that useful. I see the specific point about PSS -- because you need to know what is being shared or otherwise use it in a whole system context, but I still think the whole system context is a valid and generally useful thing. But what about the private_clean and private_dirty? Surely those are more generally useful for calculating a lower bound on process memory usage without additional knowledge? At the end of the day all of these metrics are approximations, and it comes down to how far off the various approximations are and what trade offs we are willing to make. RSS is the cheapest but the most coarse. PSS (with the correct context) and Private data plus swap are much better but also more expensive due to the PT walk. As far as I know, to get anything but RSS we have to go through smaps or use memcg. Swap seems to be available in /proc//status. I looked at the "shared" value in /proc//statm but it doesn't seem to correlate well with the shared value in smaps -- not sure why? It might be useful to show the magnitude of difference of using RSS vs PSS/Private in the case of the Chrome renderer processes. On the system I was looking at there were about 40 of these processes, but I picked a few to give an idea: localhost ~ # cat /proc/21550/totmaps Rss: 98972 kB Pss: 54717 kB Shared_Clean: 19020 kB Shared_Dirty: 26352 kB Private_Clean: 0 kB Private_Dirty: 53600 kB Referenced:92184 kB Anonymous: 46524 kB AnonHugePages: 24576 kB Swap: 13148 kB RSS is 80% higher than PSS and
Re: [PACTH v2 0/3] Implement /proc//totmaps
On 23/08/16 00:44, Sonny Rao wrote: On Mon, Aug 22, 2016 at 12:54 AM, Michal Hocko wrote: On Fri 19-08-16 10:57:48, Sonny Rao wrote: On Fri, Aug 19, 2016 at 12:59 AM, Michal Hocko wrote: On Thu 18-08-16 23:43:39, Sonny Rao wrote: On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko wrote: On Thu 18-08-16 10:47:57, Sonny Rao wrote: On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko wrote: On Wed 17-08-16 11:57:56, Sonny Rao wrote: [...] 2) User space OOM handling -- we'd rather do a more graceful shutdown than let the kernel's OOM killer activate and need to gather this information and we'd like to be able to get this information to make the decision much faster than 400ms Global OOM handling in userspace is really dubious if you ask me. I understand you want something better than SIGKILL and in fact this is already possible with memory cgroup controller (btw. memcg will give you a cheap access to rss, amount of shared, swapped out memory as well). Anyway if you are getting close to the OOM your system will most probably be really busy and chances are that also reading your new file will take much more time. I am also not quite sure how is pss useful for oom decisions. I mentioned it before, but based on experience RSS just isn't good enough -- there's too much sharing going on in our use case to make the correct decision based on RSS. If RSS were good enough, simply put, this patch wouldn't exist. But that doesn't answer my question, I am afraid. So how exactly do you use pss for oom decisions? We use PSS to calculate the memory used by a process among all the processes in the system, in the case of Chrome this tells us how much each renderer process (which is roughly tied to a particular "tab" in Chrome) is using and how much it has swapped out, so we know what the worst offenders are -- I'm not sure what's unclear about that? So let me ask more specifically. How can you make any decision based on the pss when you do not know _what_ is the shared resource. In other words if you select a task to terminate based on the pss then you have to kill others who share the same resource otherwise you do not release that shared resource. Not to mention that such a shared resource might be on tmpfs/shmem and it won't get released even after all processes which map it are gone. Ok I see why you're confused now, sorry. In our case that we do know what is being shared in general because the sharing is mostly between those processes that we're looking at and not other random processes or tmpfs, so PSS gives us useful data in the context of these processes which are sharing the data especially for monitoring between the set of these renderer processes. OK, I see and agree that pss might be useful when you _know_ what is shared. But this sounds quite specific to a particular workload. How many users are in a similar situation? In other words, if we present a single number without the context, how much useful it will be in general? Is it possible that presenting such a number could be even misleading for somebody who doesn't have an idea which resources are shared? These are all questions which should be answered before we actually add this number (be it a new/existing proc file or a syscall). I still believe that the number without wider context is just not all that useful. I see the specific point about PSS -- because you need to know what is being shared or otherwise use it in a whole system context, but I still think the whole system context is a valid and generally useful thing. But what about the private_clean and private_dirty? Surely those are more generally useful for calculating a lower bound on process memory usage without additional knowledge? At the end of the day all of these metrics are approximations, and it comes down to how far off the various approximations are and what trade offs we are willing to make. RSS is the cheapest but the most coarse. PSS (with the correct context) and Private data plus swap are much better but also more expensive due to the PT walk. As far as I know, to get anything but RSS we have to go through smaps or use memcg. Swap seems to be available in /proc//status. I looked at the "shared" value in /proc//statm but it doesn't seem to correlate well with the shared value in smaps -- not sure why? It might be useful to show the magnitude of difference of using RSS vs PSS/Private in the case of the Chrome renderer processes. On the system I was looking at there were about 40 of these processes, but I picked a few to give an idea: localhost ~ # cat /proc/21550/totmaps Rss: 98972 kB Pss: 54717 kB Shared_Clean: 19020 kB Shared_Dirty: 26352 kB Private_Clean: 0 kB Private_Dirty: 53600 kB Referenced:92184 kB Anonymous: 46524 kB AnonHugePages: 24576 kB Swap: 13148 kB RSS is 80% higher than PSS and 84% higher than private data localhost ~ # cat /proc/21470/totmaps Rss:
Re: utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc//totmaps)
On Tue, 2016-08-23 at 16:33 +0200, Michal Hocko wrote: > On Tue 23-08-16 10:26:03, Michal Hocko wrote: > > On Mon 22-08-16 19:47:09, Michal Hocko wrote: > > > On Mon 22-08-16 19:29:36, Michal Hocko wrote: > > > > On Mon 22-08-16 18:45:54, Michal Hocko wrote: > > > > [...] > > > > > I have no idea why those numbers are so different on my > > > > > laptop > > > > > yet. It surely looks suspicious. I will try to debug this > > > > > further > > > > > tomorrow. > > > > > > > > Hmm, so I've tried to use my version of awk on other machine > > > > and vice > > > > versa and it didn't make any difference. So this is independent > > > > on the > > > > awk version it seems. So I've tried to strace /usr/bin/time and > > > > wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, > > > > {ru_utime={0, 0}, ru_stime={0, 688438}, ...}) = 9128 > > > > > > > > so the kernel indeed reports 0 user time for some reason. Note > > > > I > > > > was testing with 4.7 and right now with 4.8.0-rc3 kernel (no > > > > local > > > > modifications). The other machine which reports non-0 utime is > > > > 3.12 > > > > SLES kernel. Maybe I am hitting some accounting bug. At first I > > > > was > > > > suspecting CONFIG_NO_HZ_FULL because that is the main > > > > difference between > > > > my and the other machine but then I've noticed that the tests I > > > > was > > > > doing in kvm have this disabled too.. so it must be something > > > > else. > > > > > > 4.5 reports non-0 while 4.6 zero utime. NO_HZ configuration is > > > the same > > > in both kernels. > > > > and one more thing. It is not like utime accounting would be > > completely > > broken and always report 0. Other commands report non-0 values even > > on > > 4.6 kernels. I will try to bisect this down later today. > > OK, so it seems I found it. I was quite lucky because > account_user_time > is not all that popular function and there were basically no changes > besides Riks ff9a9b4c4334 ("sched, time: Switch > VIRT_CPU_ACCOUNTING_GEN > to jiffy granularity") and that seems to cause the regression. > Reverting > the commit on top of the current mmotm seems to fix the issue for me. > > And just to give Rik more context. While debugging overhead of the > /proc//smaps I am getting a misleading output from /usr/bin/time > -v > (source for ./max_mmap is [1]) > > root@test1:~# uname -r > 4.5.0-rc6-bisect1-00025-gff9a9b4c4334 > root@test1:~# ./max_map > pid:2990 maps:65515 > root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} > END {printf "rss:%d pss:%d\n", rss, pss}' /proc/2990/smaps > rss:263368 pss:262203 > Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END > {printf "rss:%d pss:%d\n", rss, pss} /proc/2990/smaps" > User time (seconds): 0.00 > System time (seconds): 0.45 > Percent of CPU this job got: 98% > > root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} > END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3015/smaps > rss:263316 pss:262199 > Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END > {printf "rss:%d pss:%d\n", rss, pss} /proc/3015/smaps" > User time (seconds): 0.18 > System time (seconds): 0.29 > Percent of CPU this job got: 97% The patch in question makes user and system time accounting essentially tick-based. If jiffies changes while the task is in user mode, time gets accounted as user time, if jiffies changes while the task is in system mode, time gets accounted as system time. If you get "unlucky", with a job like the above, it is possible all time gets accounted to system time. This would be true both with the system running with a periodic timer tick (before and after my patch is applied), and in nohz_idle mode (after my patch). However, it does seem quite unlikely that you get zero user time, since you have 125 timer ticks in half a second. Furthermore, you do not even have NO_HZ_FULL enabled... Does the workload consistently get zero user time? If so, we need to dig further to see under what precise circumstances that happens. On my laptop, with kernel 4.6.3-300.fc24.x86_64 I get this: $ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/19825/smaps rss:263368 pss:262145 Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/19825/smaps" User time (seconds): 0.64 System time (seconds): 0.19 Percent of CPU this job got: 99% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.83 The main difference between your and my NO_HZ config seems to be that NO_HZ_FULL is set here. However, it is not enabled at run time, so both of our systems should only really get NO_HZ_IDLE effectively. Running tasks should get sampled with the regular timer tick, while they are running. In other words, vtime accounting should be disabled in both of our tests, for everything except the idle task. Do I need to do
Re: utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc//totmaps)
On Tue, 2016-08-23 at 16:33 +0200, Michal Hocko wrote: > On Tue 23-08-16 10:26:03, Michal Hocko wrote: > > On Mon 22-08-16 19:47:09, Michal Hocko wrote: > > > On Mon 22-08-16 19:29:36, Michal Hocko wrote: > > > > On Mon 22-08-16 18:45:54, Michal Hocko wrote: > > > > [...] > > > > > I have no idea why those numbers are so different on my > > > > > laptop > > > > > yet. It surely looks suspicious. I will try to debug this > > > > > further > > > > > tomorrow. > > > > > > > > Hmm, so I've tried to use my version of awk on other machine > > > > and vice > > > > versa and it didn't make any difference. So this is independent > > > > on the > > > > awk version it seems. So I've tried to strace /usr/bin/time and > > > > wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, > > > > {ru_utime={0, 0}, ru_stime={0, 688438}, ...}) = 9128 > > > > > > > > so the kernel indeed reports 0 user time for some reason. Note > > > > I > > > > was testing with 4.7 and right now with 4.8.0-rc3 kernel (no > > > > local > > > > modifications). The other machine which reports non-0 utime is > > > > 3.12 > > > > SLES kernel. Maybe I am hitting some accounting bug. At first I > > > > was > > > > suspecting CONFIG_NO_HZ_FULL because that is the main > > > > difference between > > > > my and the other machine but then I've noticed that the tests I > > > > was > > > > doing in kvm have this disabled too.. so it must be something > > > > else. > > > > > > 4.5 reports non-0 while 4.6 zero utime. NO_HZ configuration is > > > the same > > > in both kernels. > > > > and one more thing. It is not like utime accounting would be > > completely > > broken and always report 0. Other commands report non-0 values even > > on > > 4.6 kernels. I will try to bisect this down later today. > > OK, so it seems I found it. I was quite lucky because > account_user_time > is not all that popular function and there were basically no changes > besides Riks ff9a9b4c4334 ("sched, time: Switch > VIRT_CPU_ACCOUNTING_GEN > to jiffy granularity") and that seems to cause the regression. > Reverting > the commit on top of the current mmotm seems to fix the issue for me. > > And just to give Rik more context. While debugging overhead of the > /proc//smaps I am getting a misleading output from /usr/bin/time > -v > (source for ./max_mmap is [1]) > > root@test1:~# uname -r > 4.5.0-rc6-bisect1-00025-gff9a9b4c4334 > root@test1:~# ./max_map > pid:2990 maps:65515 > root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} > END {printf "rss:%d pss:%d\n", rss, pss}' /proc/2990/smaps > rss:263368 pss:262203 > Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END > {printf "rss:%d pss:%d\n", rss, pss} /proc/2990/smaps" > User time (seconds): 0.00 > System time (seconds): 0.45 > Percent of CPU this job got: 98% > > root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} > END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3015/smaps > rss:263316 pss:262199 > Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END > {printf "rss:%d pss:%d\n", rss, pss} /proc/3015/smaps" > User time (seconds): 0.18 > System time (seconds): 0.29 > Percent of CPU this job got: 97% The patch in question makes user and system time accounting essentially tick-based. If jiffies changes while the task is in user mode, time gets accounted as user time, if jiffies changes while the task is in system mode, time gets accounted as system time. If you get "unlucky", with a job like the above, it is possible all time gets accounted to system time. This would be true both with the system running with a periodic timer tick (before and after my patch is applied), and in nohz_idle mode (after my patch). However, it does seem quite unlikely that you get zero user time, since you have 125 timer ticks in half a second. Furthermore, you do not even have NO_HZ_FULL enabled... Does the workload consistently get zero user time? If so, we need to dig further to see under what precise circumstances that happens. On my laptop, with kernel 4.6.3-300.fc24.x86_64 I get this: $ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/19825/smaps rss:263368 pss:262145 Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/19825/smaps" User time (seconds): 0.64 System time (seconds): 0.19 Percent of CPU this job got: 99% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.83 The main difference between your and my NO_HZ config seems to be that NO_HZ_FULL is set here. However, it is not enabled at run time, so both of our systems should only really get NO_HZ_IDLE effectively. Running tasks should get sampled with the regular timer tick, while they are running. In other words, vtime accounting should be disabled in both of our tests, for everything except the idle task. Do I need to do
utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc//totmaps)
On Tue 23-08-16 10:26:03, Michal Hocko wrote: > On Mon 22-08-16 19:47:09, Michal Hocko wrote: > > On Mon 22-08-16 19:29:36, Michal Hocko wrote: > > > On Mon 22-08-16 18:45:54, Michal Hocko wrote: > > > [...] > > > > I have no idea why those numbers are so different on my laptop > > > > yet. It surely looks suspicious. I will try to debug this further > > > > tomorrow. > > > > > > Hmm, so I've tried to use my version of awk on other machine and vice > > > versa and it didn't make any difference. So this is independent on the > > > awk version it seems. So I've tried to strace /usr/bin/time and > > > wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 0}, > > > ru_stime={0, 688438}, ...}) = 9128 > > > > > > so the kernel indeed reports 0 user time for some reason. Note I > > > was testing with 4.7 and right now with 4.8.0-rc3 kernel (no local > > > modifications). The other machine which reports non-0 utime is 3.12 > > > SLES kernel. Maybe I am hitting some accounting bug. At first I was > > > suspecting CONFIG_NO_HZ_FULL because that is the main difference between > > > my and the other machine but then I've noticed that the tests I was > > > doing in kvm have this disabled too.. so it must be something else. > > > > 4.5 reports non-0 while 4.6 zero utime. NO_HZ configuration is the same > > in both kernels. > > and one more thing. It is not like utime accounting would be completely > broken and always report 0. Other commands report non-0 values even on > 4.6 kernels. I will try to bisect this down later today. OK, so it seems I found it. I was quite lucky because account_user_time is not all that popular function and there were basically no changes besides Riks ff9a9b4c4334 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity") and that seems to cause the regression. Reverting the commit on top of the current mmotm seems to fix the issue for me. And just to give Rik more context. While debugging overhead of the /proc//smaps I am getting a misleading output from /usr/bin/time -v (source for ./max_mmap is [1]) root@test1:~# uname -r 4.5.0-rc6-bisect1-00025-gff9a9b4c4334 root@test1:~# ./max_map pid:2990 maps:65515 root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/2990/smaps rss:263368 pss:262203 Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/2990/smaps" User time (seconds): 0.00 System time (seconds): 0.45 Percent of CPU this job got: 98% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.46 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 1796 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 1 Minor (reclaiming a frame) page faults: 83 Voluntary context switches: 6 Involuntary context switches: 6 Swaps: 0 File system inputs: 248 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 See the User time being 0 (as you can see above in the quoted text it is not a rounding error in userspace or something similar because wait4 really returns 0). Now with the revert root@test1:~# uname -r 4.5.0-rc6-revert-00026-g7fc86f968bf5 root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3015/smaps rss:263316 pss:262199 Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/3015/smaps" User time (seconds): 0.18 System time (seconds): 0.29 Percent of CPU this job got: 97% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.50 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 1760 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 1 Minor (reclaiming a frame) page faults: 79 Voluntary context switches: 5 Involuntary context switches: 7 Swaps: 0 File system inputs: 248 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 So it looks like the whole user time is accounted as the system time. My config is attached and yes I do have CONFIG_VIRT_CPU_ACCOUNTING_GEN enabled. Could you have a look please? [1] http://lkml.kernel.org/r/20160817082200.ga10...@dhcp22.suse.cz -- Michal Hocko SUSE Labs .config.gz Description: application/gzip
utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc//totmaps)
On Tue 23-08-16 10:26:03, Michal Hocko wrote: > On Mon 22-08-16 19:47:09, Michal Hocko wrote: > > On Mon 22-08-16 19:29:36, Michal Hocko wrote: > > > On Mon 22-08-16 18:45:54, Michal Hocko wrote: > > > [...] > > > > I have no idea why those numbers are so different on my laptop > > > > yet. It surely looks suspicious. I will try to debug this further > > > > tomorrow. > > > > > > Hmm, so I've tried to use my version of awk on other machine and vice > > > versa and it didn't make any difference. So this is independent on the > > > awk version it seems. So I've tried to strace /usr/bin/time and > > > wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 0}, > > > ru_stime={0, 688438}, ...}) = 9128 > > > > > > so the kernel indeed reports 0 user time for some reason. Note I > > > was testing with 4.7 and right now with 4.8.0-rc3 kernel (no local > > > modifications). The other machine which reports non-0 utime is 3.12 > > > SLES kernel. Maybe I am hitting some accounting bug. At first I was > > > suspecting CONFIG_NO_HZ_FULL because that is the main difference between > > > my and the other machine but then I've noticed that the tests I was > > > doing in kvm have this disabled too.. so it must be something else. > > > > 4.5 reports non-0 while 4.6 zero utime. NO_HZ configuration is the same > > in both kernels. > > and one more thing. It is not like utime accounting would be completely > broken and always report 0. Other commands report non-0 values even on > 4.6 kernels. I will try to bisect this down later today. OK, so it seems I found it. I was quite lucky because account_user_time is not all that popular function and there were basically no changes besides Riks ff9a9b4c4334 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity") and that seems to cause the regression. Reverting the commit on top of the current mmotm seems to fix the issue for me. And just to give Rik more context. While debugging overhead of the /proc//smaps I am getting a misleading output from /usr/bin/time -v (source for ./max_mmap is [1]) root@test1:~# uname -r 4.5.0-rc6-bisect1-00025-gff9a9b4c4334 root@test1:~# ./max_map pid:2990 maps:65515 root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/2990/smaps rss:263368 pss:262203 Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/2990/smaps" User time (seconds): 0.00 System time (seconds): 0.45 Percent of CPU this job got: 98% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.46 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 1796 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 1 Minor (reclaiming a frame) page faults: 83 Voluntary context switches: 6 Involuntary context switches: 6 Swaps: 0 File system inputs: 248 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 See the User time being 0 (as you can see above in the quoted text it is not a rounding error in userspace or something similar because wait4 really returns 0). Now with the revert root@test1:~# uname -r 4.5.0-rc6-revert-00026-g7fc86f968bf5 root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3015/smaps rss:263316 pss:262199 Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/3015/smaps" User time (seconds): 0.18 System time (seconds): 0.29 Percent of CPU this job got: 97% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.50 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 1760 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 1 Minor (reclaiming a frame) page faults: 79 Voluntary context switches: 5 Involuntary context switches: 7 Swaps: 0 File system inputs: 248 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 So it looks like the whole user time is accounted as the system time. My config is attached and yes I do have CONFIG_VIRT_CPU_ACCOUNTING_GEN enabled. Could you have a look please? [1] http://lkml.kernel.org/r/20160817082200.ga10...@dhcp22.suse.cz -- Michal Hocko SUSE Labs .config.gz Description: application/gzip
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Mon 22-08-16 19:47:09, Michal Hocko wrote: > On Mon 22-08-16 19:29:36, Michal Hocko wrote: > > On Mon 22-08-16 18:45:54, Michal Hocko wrote: > > [...] > > > I have no idea why those numbers are so different on my laptop > > > yet. It surely looks suspicious. I will try to debug this further > > > tomorrow. > > > > Hmm, so I've tried to use my version of awk on other machine and vice > > versa and it didn't make any difference. So this is independent on the > > awk version it seems. So I've tried to strace /usr/bin/time and > > wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 0}, > > ru_stime={0, 688438}, ...}) = 9128 > > > > so the kernel indeed reports 0 user time for some reason. Note I > > was testing with 4.7 and right now with 4.8.0-rc3 kernel (no local > > modifications). The other machine which reports non-0 utime is 3.12 > > SLES kernel. Maybe I am hitting some accounting bug. At first I was > > suspecting CONFIG_NO_HZ_FULL because that is the main difference between > > my and the other machine but then I've noticed that the tests I was > > doing in kvm have this disabled too.. so it must be something else. > > 4.5 reports non-0 while 4.6 zero utime. NO_HZ configuration is the same > in both kernels. and one more thing. It is not like utime accounting would be completely broken and always report 0. Other commands report non-0 values even on 4.6 kernels. I will try to bisect this down later today. -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Mon 22-08-16 19:47:09, Michal Hocko wrote: > On Mon 22-08-16 19:29:36, Michal Hocko wrote: > > On Mon 22-08-16 18:45:54, Michal Hocko wrote: > > [...] > > > I have no idea why those numbers are so different on my laptop > > > yet. It surely looks suspicious. I will try to debug this further > > > tomorrow. > > > > Hmm, so I've tried to use my version of awk on other machine and vice > > versa and it didn't make any difference. So this is independent on the > > awk version it seems. So I've tried to strace /usr/bin/time and > > wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 0}, > > ru_stime={0, 688438}, ...}) = 9128 > > > > so the kernel indeed reports 0 user time for some reason. Note I > > was testing with 4.7 and right now with 4.8.0-rc3 kernel (no local > > modifications). The other machine which reports non-0 utime is 3.12 > > SLES kernel. Maybe I am hitting some accounting bug. At first I was > > suspecting CONFIG_NO_HZ_FULL because that is the main difference between > > my and the other machine but then I've noticed that the tests I was > > doing in kvm have this disabled too.. so it must be something else. > > 4.5 reports non-0 while 4.6 zero utime. NO_HZ configuration is the same > in both kernels. and one more thing. It is not like utime accounting would be completely broken and always report 0. Other commands report non-0 values even on 4.6 kernels. I will try to bisect this down later today. -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Mon, Aug 22, 2016 at 12:54 AM, Michal Hockowrote: > On Fri 19-08-16 10:57:48, Sonny Rao wrote: >> On Fri, Aug 19, 2016 at 12:59 AM, Michal Hocko wrote: >> > On Thu 18-08-16 23:43:39, Sonny Rao wrote: >> >> On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko wrote: >> >> > On Thu 18-08-16 10:47:57, Sonny Rao wrote: >> >> >> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko >> >> >> wrote: >> >> >> > On Wed 17-08-16 11:57:56, Sonny Rao wrote: >> >> > [...] >> >> >> >> 2) User space OOM handling -- we'd rather do a more graceful >> >> >> >> shutdown >> >> >> >> than let the kernel's OOM killer activate and need to gather this >> >> >> >> information and we'd like to be able to get this information to make >> >> >> >> the decision much faster than 400ms >> >> >> > >> >> >> > Global OOM handling in userspace is really dubious if you ask me. I >> >> >> > understand you want something better than SIGKILL and in fact this is >> >> >> > already possible with memory cgroup controller (btw. memcg will give >> >> >> > you a cheap access to rss, amount of shared, swapped out memory as >> >> >> > well). Anyway if you are getting close to the OOM your system will >> >> >> > most >> >> >> > probably be really busy and chances are that also reading your new >> >> >> > file >> >> >> > will take much more time. I am also not quite sure how is pss useful >> >> >> > for >> >> >> > oom decisions. >> >> >> >> >> >> I mentioned it before, but based on experience RSS just isn't good >> >> >> enough -- there's too much sharing going on in our use case to make >> >> >> the correct decision based on RSS. If RSS were good enough, simply >> >> >> put, this patch wouldn't exist. >> >> > >> >> > But that doesn't answer my question, I am afraid. So how exactly do you >> >> > use pss for oom decisions? >> >> >> >> We use PSS to calculate the memory used by a process among all the >> >> processes in the system, in the case of Chrome this tells us how much >> >> each renderer process (which is roughly tied to a particular "tab" in >> >> Chrome) is using and how much it has swapped out, so we know what the >> >> worst offenders are -- I'm not sure what's unclear about that? >> > >> > So let me ask more specifically. How can you make any decision based on >> > the pss when you do not know _what_ is the shared resource. In other >> > words if you select a task to terminate based on the pss then you have to >> > kill others who share the same resource otherwise you do not release >> > that shared resource. Not to mention that such a shared resource might >> > be on tmpfs/shmem and it won't get released even after all processes >> > which map it are gone. >> >> Ok I see why you're confused now, sorry. >> >> In our case that we do know what is being shared in general because >> the sharing is mostly between those processes that we're looking at >> and not other random processes or tmpfs, so PSS gives us useful data >> in the context of these processes which are sharing the data >> especially for monitoring between the set of these renderer processes. > > OK, I see and agree that pss might be useful when you _know_ what is > shared. But this sounds quite specific to a particular workload. How > many users are in a similar situation? In other words, if we present > a single number without the context, how much useful it will be in > general? Is it possible that presenting such a number could be even > misleading for somebody who doesn't have an idea which resources are > shared? These are all questions which should be answered before we > actually add this number (be it a new/existing proc file or a syscall). > I still believe that the number without wider context is just not all > that useful. I see the specific point about PSS -- because you need to know what is being shared or otherwise use it in a whole system context, but I still think the whole system context is a valid and generally useful thing. But what about the private_clean and private_dirty? Surely those are more generally useful for calculating a lower bound on process memory usage without additional knowledge? At the end of the day all of these metrics are approximations, and it comes down to how far off the various approximations are and what trade offs we are willing to make. RSS is the cheapest but the most coarse. PSS (with the correct context) and Private data plus swap are much better but also more expensive due to the PT walk. As far as I know, to get anything but RSS we have to go through smaps or use memcg. Swap seems to be available in /proc//status. I looked at the "shared" value in /proc//statm but it doesn't seem to correlate well with the shared value in smaps -- not sure why? It might be useful to show the magnitude of difference of using RSS vs PSS/Private in the case of the Chrome renderer processes. On the system I was looking at there were about 40 of these processes, but I
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Mon, Aug 22, 2016 at 12:54 AM, Michal Hocko wrote: > On Fri 19-08-16 10:57:48, Sonny Rao wrote: >> On Fri, Aug 19, 2016 at 12:59 AM, Michal Hocko wrote: >> > On Thu 18-08-16 23:43:39, Sonny Rao wrote: >> >> On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko wrote: >> >> > On Thu 18-08-16 10:47:57, Sonny Rao wrote: >> >> >> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko >> >> >> wrote: >> >> >> > On Wed 17-08-16 11:57:56, Sonny Rao wrote: >> >> > [...] >> >> >> >> 2) User space OOM handling -- we'd rather do a more graceful >> >> >> >> shutdown >> >> >> >> than let the kernel's OOM killer activate and need to gather this >> >> >> >> information and we'd like to be able to get this information to make >> >> >> >> the decision much faster than 400ms >> >> >> > >> >> >> > Global OOM handling in userspace is really dubious if you ask me. I >> >> >> > understand you want something better than SIGKILL and in fact this is >> >> >> > already possible with memory cgroup controller (btw. memcg will give >> >> >> > you a cheap access to rss, amount of shared, swapped out memory as >> >> >> > well). Anyway if you are getting close to the OOM your system will >> >> >> > most >> >> >> > probably be really busy and chances are that also reading your new >> >> >> > file >> >> >> > will take much more time. I am also not quite sure how is pss useful >> >> >> > for >> >> >> > oom decisions. >> >> >> >> >> >> I mentioned it before, but based on experience RSS just isn't good >> >> >> enough -- there's too much sharing going on in our use case to make >> >> >> the correct decision based on RSS. If RSS were good enough, simply >> >> >> put, this patch wouldn't exist. >> >> > >> >> > But that doesn't answer my question, I am afraid. So how exactly do you >> >> > use pss for oom decisions? >> >> >> >> We use PSS to calculate the memory used by a process among all the >> >> processes in the system, in the case of Chrome this tells us how much >> >> each renderer process (which is roughly tied to a particular "tab" in >> >> Chrome) is using and how much it has swapped out, so we know what the >> >> worst offenders are -- I'm not sure what's unclear about that? >> > >> > So let me ask more specifically. How can you make any decision based on >> > the pss when you do not know _what_ is the shared resource. In other >> > words if you select a task to terminate based on the pss then you have to >> > kill others who share the same resource otherwise you do not release >> > that shared resource. Not to mention that such a shared resource might >> > be on tmpfs/shmem and it won't get released even after all processes >> > which map it are gone. >> >> Ok I see why you're confused now, sorry. >> >> In our case that we do know what is being shared in general because >> the sharing is mostly between those processes that we're looking at >> and not other random processes or tmpfs, so PSS gives us useful data >> in the context of these processes which are sharing the data >> especially for monitoring between the set of these renderer processes. > > OK, I see and agree that pss might be useful when you _know_ what is > shared. But this sounds quite specific to a particular workload. How > many users are in a similar situation? In other words, if we present > a single number without the context, how much useful it will be in > general? Is it possible that presenting such a number could be even > misleading for somebody who doesn't have an idea which resources are > shared? These are all questions which should be answered before we > actually add this number (be it a new/existing proc file or a syscall). > I still believe that the number without wider context is just not all > that useful. I see the specific point about PSS -- because you need to know what is being shared or otherwise use it in a whole system context, but I still think the whole system context is a valid and generally useful thing. But what about the private_clean and private_dirty? Surely those are more generally useful for calculating a lower bound on process memory usage without additional knowledge? At the end of the day all of these metrics are approximations, and it comes down to how far off the various approximations are and what trade offs we are willing to make. RSS is the cheapest but the most coarse. PSS (with the correct context) and Private data plus swap are much better but also more expensive due to the PT walk. As far as I know, to get anything but RSS we have to go through smaps or use memcg. Swap seems to be available in /proc//status. I looked at the "shared" value in /proc//statm but it doesn't seem to correlate well with the shared value in smaps -- not sure why? It might be useful to show the magnitude of difference of using RSS vs PSS/Private in the case of the Chrome renderer processes. On the system I was looking at there were about 40 of these processes, but I picked a few to give an idea: localhost ~ # cat /proc/21550/totmaps Rss:
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Mon 22-08-16 19:29:36, Michal Hocko wrote: > On Mon 22-08-16 18:45:54, Michal Hocko wrote: > [...] > > I have no idea why those numbers are so different on my laptop > > yet. It surely looks suspicious. I will try to debug this further > > tomorrow. > > Hmm, so I've tried to use my version of awk on other machine and vice > versa and it didn't make any difference. So this is independent on the > awk version it seems. So I've tried to strace /usr/bin/time and > wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 0}, > ru_stime={0, 688438}, ...}) = 9128 > > so the kernel indeed reports 0 user time for some reason. Note I > was testing with 4.7 and right now with 4.8.0-rc3 kernel (no local > modifications). The other machine which reports non-0 utime is 3.12 > SLES kernel. Maybe I am hitting some accounting bug. At first I was > suspecting CONFIG_NO_HZ_FULL because that is the main difference between > my and the other machine but then I've noticed that the tests I was > doing in kvm have this disabled too.. so it must be something else. 4.5 reports non-0 while 4.6 zero utime. NO_HZ configuration is the same in both kernels. -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Mon 22-08-16 19:29:36, Michal Hocko wrote: > On Mon 22-08-16 18:45:54, Michal Hocko wrote: > [...] > > I have no idea why those numbers are so different on my laptop > > yet. It surely looks suspicious. I will try to debug this further > > tomorrow. > > Hmm, so I've tried to use my version of awk on other machine and vice > versa and it didn't make any difference. So this is independent on the > awk version it seems. So I've tried to strace /usr/bin/time and > wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 0}, > ru_stime={0, 688438}, ...}) = 9128 > > so the kernel indeed reports 0 user time for some reason. Note I > was testing with 4.7 and right now with 4.8.0-rc3 kernel (no local > modifications). The other machine which reports non-0 utime is 3.12 > SLES kernel. Maybe I am hitting some accounting bug. At first I was > suspecting CONFIG_NO_HZ_FULL because that is the main difference between > my and the other machine but then I've noticed that the tests I was > doing in kvm have this disabled too.. so it must be something else. 4.5 reports non-0 while 4.6 zero utime. NO_HZ configuration is the same in both kernels. -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Mon 22-08-16 18:45:54, Michal Hocko wrote: [...] > I have no idea why those numbers are so different on my laptop > yet. It surely looks suspicious. I will try to debug this further > tomorrow. Hmm, so I've tried to use my version of awk on other machine and vice versa and it didn't make any difference. So this is independent on the awk version it seems. So I've tried to strace /usr/bin/time and wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 0}, ru_stime={0, 688438}, ...}) = 9128 so the kernel indeed reports 0 user time for some reason. Note I was testing with 4.7 and right now with 4.8.0-rc3 kernel (no local modifications). The other machine which reports non-0 utime is 3.12 SLES kernel. Maybe I am hitting some accounting bug. At first I was suspecting CONFIG_NO_HZ_FULL because that is the main difference between my and the other machine but then I've noticed that the tests I was doing in kvm have this disabled too.. so it must be something else. Weird... -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Mon 22-08-16 18:45:54, Michal Hocko wrote: [...] > I have no idea why those numbers are so different on my laptop > yet. It surely looks suspicious. I will try to debug this further > tomorrow. Hmm, so I've tried to use my version of awk on other machine and vice versa and it didn't make any difference. So this is independent on the awk version it seems. So I've tried to strace /usr/bin/time and wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 0}, ru_stime={0, 688438}, ...}) = 9128 so the kernel indeed reports 0 user time for some reason. Note I was testing with 4.7 and right now with 4.8.0-rc3 kernel (no local modifications). The other machine which reports non-0 utime is 3.12 SLES kernel. Maybe I am hitting some accounting bug. At first I was suspecting CONFIG_NO_HZ_FULL because that is the main difference between my and the other machine but then I've noticed that the tests I was doing in kvm have this disabled too.. so it must be something else. Weird... -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Mon 22-08-16 23:12:41, Minchan Kim wrote: > On Mon, Aug 22, 2016 at 09:40:52AM +0200, Michal Hocko wrote: > > On Mon 22-08-16 09:07:45, Minchan Kim wrote: > > [...] > > > #!/bin/sh > > > ./smap_test & > > > pid=$! > > > > > > for i in $(seq 25) > > > do > > > awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {}' \ > > > /proc/$pid/smaps > > > done > > > kill $pid > > > > > > root@bbox:/home/barrios/test/smap# time ./s.sh > > > pid:21973 > > > > > > real0m17.812s > > > user0m12.612s > > > sys 0m5.187s > > > > retested on the bare metal (x86_64 - 2CPUs) > > Command being timed: "sh s.sh" > > User time (seconds): 0.00 > > System time (seconds): 18.08 > > Percent of CPU this job got: 98% > > Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.29 > > > > multiple runs are quite consistent in those numbers. I am running with > > $ awk --version > > GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0) > > > > > > like a problem we are not able to address. And I would even argue that > > > > we want to address it in a generic way as much as possible. > > > > > > Sure. What solution do you think as generic way? > > > > either optimize seq_printf or replace it with something faster. > > If it's real culprit, I agree. However, I tested your test program on > my 2 x86 machines and my friend's machine. > > Ubuntu, Fedora, Arch > > They have awk 4.0.1 and 4.1.3. > > Result are same. Userspace speand more times I mentioned. > > [root@blaptop smap_test]# time awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END > {printf "rss:%d pss:%d\n", rss, pss}' /proc/3552/smaps > rss:263484 pss:262188 > > real0m0.770s > user0m0.574s > sys 0m0.197s > > I will attach my test progrma source. > I hope you guys test and repost the result because it's the key for direction > of patchset. Hmm, this is really interesting. I have checked a different machine and it shows different results. Same code, slightly different version of awk (4.1.0) and the results are different Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/48925/smaps" User time (seconds): 0.43 System time (seconds): 0.27 I have no idea why those numbers are so different on my laptop yet. It surely looks suspicious. I will try to debug this further tomorrow. Anyway, the performance is just one side of the problem. I have tried to express my concerns about a single exported pss value in other email. Please try to step back and think about how useful is this information without the knowing which resource we are talking about. -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Mon 22-08-16 23:12:41, Minchan Kim wrote: > On Mon, Aug 22, 2016 at 09:40:52AM +0200, Michal Hocko wrote: > > On Mon 22-08-16 09:07:45, Minchan Kim wrote: > > [...] > > > #!/bin/sh > > > ./smap_test & > > > pid=$! > > > > > > for i in $(seq 25) > > > do > > > awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {}' \ > > > /proc/$pid/smaps > > > done > > > kill $pid > > > > > > root@bbox:/home/barrios/test/smap# time ./s.sh > > > pid:21973 > > > > > > real0m17.812s > > > user0m12.612s > > > sys 0m5.187s > > > > retested on the bare metal (x86_64 - 2CPUs) > > Command being timed: "sh s.sh" > > User time (seconds): 0.00 > > System time (seconds): 18.08 > > Percent of CPU this job got: 98% > > Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.29 > > > > multiple runs are quite consistent in those numbers. I am running with > > $ awk --version > > GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0) > > > > > > like a problem we are not able to address. And I would even argue that > > > > we want to address it in a generic way as much as possible. > > > > > > Sure. What solution do you think as generic way? > > > > either optimize seq_printf or replace it with something faster. > > If it's real culprit, I agree. However, I tested your test program on > my 2 x86 machines and my friend's machine. > > Ubuntu, Fedora, Arch > > They have awk 4.0.1 and 4.1.3. > > Result are same. Userspace speand more times I mentioned. > > [root@blaptop smap_test]# time awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END > {printf "rss:%d pss:%d\n", rss, pss}' /proc/3552/smaps > rss:263484 pss:262188 > > real0m0.770s > user0m0.574s > sys 0m0.197s > > I will attach my test progrma source. > I hope you guys test and repost the result because it's the key for direction > of patchset. Hmm, this is really interesting. I have checked a different machine and it shows different results. Same code, slightly different version of awk (4.1.0) and the results are different Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/48925/smaps" User time (seconds): 0.43 System time (seconds): 0.27 I have no idea why those numbers are so different on my laptop yet. It surely looks suspicious. I will try to debug this further tomorrow. Anyway, the performance is just one side of the problem. I have tried to express my concerns about a single exported pss value in other email. Please try to step back and think about how useful is this information without the knowing which resource we are talking about. -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On 2016-08-22 10:12 AM, Minchan Kim wrote: On Mon, Aug 22, 2016 at 09:40:52AM +0200, Michal Hocko wrote: On Mon 22-08-16 09:07:45, Minchan Kim wrote: [...] #!/bin/sh ./smap_test & pid=$! for i in $(seq 25) do awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {}' \ /proc/$pid/smaps done kill $pid root@bbox:/home/barrios/test/smap# time ./s.sh pid:21973 real0m17.812s user0m12.612s sys 0m5.187s retested on the bare metal (x86_64 - 2CPUs) Command being timed: "sh s.sh" User time (seconds): 0.00 System time (seconds): 18.08 Percent of CPU this job got: 98% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.29 multiple runs are quite consistent in those numbers. I am running with $ awk --version GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0) $ ./smap_test & pid:19658 nr_vma:65514 $ time awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/19658/smaps rss:263452 pss:262151 real0m0.625s user0m0.404s sys 0m0.216s $ awk --version GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0) like a problem we are not able to address. And I would even argue that we want to address it in a generic way as much as possible. Sure. What solution do you think as generic way? either optimize seq_printf or replace it with something faster. If it's real culprit, I agree. However, I tested your test program on my 2 x86 machines and my friend's machine. Ubuntu, Fedora, Arch They have awk 4.0.1 and 4.1.3. Result are same. Userspace speand more times I mentioned. [root@blaptop smap_test]# time awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3552/smaps rss:263484 pss:262188 real0m0.770s user0m0.574s sys 0m0.197s I will attach my test progrma source. I hope you guys test and repost the result because it's the key for direction of patchset. Thanks.
Re: [PACTH v2 0/3] Implement /proc//totmaps
On 2016-08-22 10:12 AM, Minchan Kim wrote: On Mon, Aug 22, 2016 at 09:40:52AM +0200, Michal Hocko wrote: On Mon 22-08-16 09:07:45, Minchan Kim wrote: [...] #!/bin/sh ./smap_test & pid=$! for i in $(seq 25) do awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {}' \ /proc/$pid/smaps done kill $pid root@bbox:/home/barrios/test/smap# time ./s.sh pid:21973 real0m17.812s user0m12.612s sys 0m5.187s retested on the bare metal (x86_64 - 2CPUs) Command being timed: "sh s.sh" User time (seconds): 0.00 System time (seconds): 18.08 Percent of CPU this job got: 98% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.29 multiple runs are quite consistent in those numbers. I am running with $ awk --version GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0) $ ./smap_test & pid:19658 nr_vma:65514 $ time awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/19658/smaps rss:263452 pss:262151 real0m0.625s user0m0.404s sys 0m0.216s $ awk --version GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0) like a problem we are not able to address. And I would even argue that we want to address it in a generic way as much as possible. Sure. What solution do you think as generic way? either optimize seq_printf or replace it with something faster. If it's real culprit, I agree. However, I tested your test program on my 2 x86 machines and my friend's machine. Ubuntu, Fedora, Arch They have awk 4.0.1 and 4.1.3. Result are same. Userspace speand more times I mentioned. [root@blaptop smap_test]# time awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3552/smaps rss:263484 pss:262188 real0m0.770s user0m0.574s sys 0m0.197s I will attach my test progrma source. I hope you guys test and repost the result because it's the key for direction of patchset. Thanks.
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Mon, Aug 22, 2016 at 09:40:52AM +0200, Michal Hocko wrote: > On Mon 22-08-16 09:07:45, Minchan Kim wrote: > [...] > > #!/bin/sh > > ./smap_test & > > pid=$! > > > > for i in $(seq 25) > > do > > awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {}' \ > > /proc/$pid/smaps > > done > > kill $pid > > > > root@bbox:/home/barrios/test/smap# time ./s.sh > > pid:21973 > > > > real0m17.812s > > user0m12.612s > > sys 0m5.187s > > retested on the bare metal (x86_64 - 2CPUs) > Command being timed: "sh s.sh" > User time (seconds): 0.00 > System time (seconds): 18.08 > Percent of CPU this job got: 98% > Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.29 > > multiple runs are quite consistent in those numbers. I am running with > $ awk --version > GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0) > > > > like a problem we are not able to address. And I would even argue that > > > we want to address it in a generic way as much as possible. > > > > Sure. What solution do you think as generic way? > > either optimize seq_printf or replace it with something faster. If it's real culprit, I agree. However, I tested your test program on my 2 x86 machines and my friend's machine. Ubuntu, Fedora, Arch They have awk 4.0.1 and 4.1.3. Result are same. Userspace speand more times I mentioned. [root@blaptop smap_test]# time awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3552/smaps rss:263484 pss:262188 real0m0.770s user0m0.574s sys 0m0.197s I will attach my test progrma source. I hope you guys test and repost the result because it's the key for direction of patchset. Thanks. #include int main() { unsigned long nr_vma = 0; while (1) { if (mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_SHARED|MAP_POPULATE, -1, 0) == MAP_FAILED) break; nr_vma++; }; printf("pid:%d nr_vma:%lu\n", getpid(), nr_vma); pause(); return 0; }
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Mon, Aug 22, 2016 at 09:40:52AM +0200, Michal Hocko wrote: > On Mon 22-08-16 09:07:45, Minchan Kim wrote: > [...] > > #!/bin/sh > > ./smap_test & > > pid=$! > > > > for i in $(seq 25) > > do > > awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {}' \ > > /proc/$pid/smaps > > done > > kill $pid > > > > root@bbox:/home/barrios/test/smap# time ./s.sh > > pid:21973 > > > > real0m17.812s > > user0m12.612s > > sys 0m5.187s > > retested on the bare metal (x86_64 - 2CPUs) > Command being timed: "sh s.sh" > User time (seconds): 0.00 > System time (seconds): 18.08 > Percent of CPU this job got: 98% > Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.29 > > multiple runs are quite consistent in those numbers. I am running with > $ awk --version > GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0) > > > > like a problem we are not able to address. And I would even argue that > > > we want to address it in a generic way as much as possible. > > > > Sure. What solution do you think as generic way? > > either optimize seq_printf or replace it with something faster. If it's real culprit, I agree. However, I tested your test program on my 2 x86 machines and my friend's machine. Ubuntu, Fedora, Arch They have awk 4.0.1 and 4.1.3. Result are same. Userspace speand more times I mentioned. [root@blaptop smap_test]# time awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3552/smaps rss:263484 pss:262188 real0m0.770s user0m0.574s sys 0m0.197s I will attach my test progrma source. I hope you guys test and repost the result because it's the key for direction of patchset. Thanks. #include int main() { unsigned long nr_vma = 0; while (1) { if (mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_SHARED|MAP_POPULATE, -1, 0) == MAP_FAILED) break; nr_vma++; }; printf("pid:%d nr_vma:%lu\n", getpid(), nr_vma); pause(); return 0; }
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Fri 19-08-16 10:57:48, Sonny Rao wrote: > On Fri, Aug 19, 2016 at 12:59 AM, Michal Hockowrote: > > On Thu 18-08-16 23:43:39, Sonny Rao wrote: > >> On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko wrote: > >> > On Thu 18-08-16 10:47:57, Sonny Rao wrote: > >> >> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko > >> >> wrote: > >> >> > On Wed 17-08-16 11:57:56, Sonny Rao wrote: > >> > [...] > >> >> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown > >> >> >> than let the kernel's OOM killer activate and need to gather this > >> >> >> information and we'd like to be able to get this information to make > >> >> >> the decision much faster than 400ms > >> >> > > >> >> > Global OOM handling in userspace is really dubious if you ask me. I > >> >> > understand you want something better than SIGKILL and in fact this is > >> >> > already possible with memory cgroup controller (btw. memcg will give > >> >> > you a cheap access to rss, amount of shared, swapped out memory as > >> >> > well). Anyway if you are getting close to the OOM your system will > >> >> > most > >> >> > probably be really busy and chances are that also reading your new > >> >> > file > >> >> > will take much more time. I am also not quite sure how is pss useful > >> >> > for > >> >> > oom decisions. > >> >> > >> >> I mentioned it before, but based on experience RSS just isn't good > >> >> enough -- there's too much sharing going on in our use case to make > >> >> the correct decision based on RSS. If RSS were good enough, simply > >> >> put, this patch wouldn't exist. > >> > > >> > But that doesn't answer my question, I am afraid. So how exactly do you > >> > use pss for oom decisions? > >> > >> We use PSS to calculate the memory used by a process among all the > >> processes in the system, in the case of Chrome this tells us how much > >> each renderer process (which is roughly tied to a particular "tab" in > >> Chrome) is using and how much it has swapped out, so we know what the > >> worst offenders are -- I'm not sure what's unclear about that? > > > > So let me ask more specifically. How can you make any decision based on > > the pss when you do not know _what_ is the shared resource. In other > > words if you select a task to terminate based on the pss then you have to > > kill others who share the same resource otherwise you do not release > > that shared resource. Not to mention that such a shared resource might > > be on tmpfs/shmem and it won't get released even after all processes > > which map it are gone. > > Ok I see why you're confused now, sorry. > > In our case that we do know what is being shared in general because > the sharing is mostly between those processes that we're looking at > and not other random processes or tmpfs, so PSS gives us useful data > in the context of these processes which are sharing the data > especially for monitoring between the set of these renderer processes. OK, I see and agree that pss might be useful when you _know_ what is shared. But this sounds quite specific to a particular workload. How many users are in a similar situation? In other words, if we present a single number without the context, how much useful it will be in general? Is it possible that presenting such a number could be even misleading for somebody who doesn't have an idea which resources are shared? These are all questions which should be answered before we actually add this number (be it a new/existing proc file or a syscall). I still believe that the number without wider context is just not all that useful. > We also use the private clean and private dirty and swap fields to > make a few metrics for the processes and charge each process for it's > private, shared, and swap data. Private clean and dirty are used for > estimating a lower bound on how much memory would be freed. I can imagine that this kind of information might be useful and presented in /proc//statm. The question is whether some of the existing consumers would see the performance impact due to he page table walk. Anyway even these counters might get quite tricky because even shareable resources are considered private if the process is the only one to map them (so again this might be a file on tmpfs...). > Swap and > PSS also give us some indication of additional memory which might get > freed up. -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Fri 19-08-16 10:57:48, Sonny Rao wrote: > On Fri, Aug 19, 2016 at 12:59 AM, Michal Hocko wrote: > > On Thu 18-08-16 23:43:39, Sonny Rao wrote: > >> On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko wrote: > >> > On Thu 18-08-16 10:47:57, Sonny Rao wrote: > >> >> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko > >> >> wrote: > >> >> > On Wed 17-08-16 11:57:56, Sonny Rao wrote: > >> > [...] > >> >> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown > >> >> >> than let the kernel's OOM killer activate and need to gather this > >> >> >> information and we'd like to be able to get this information to make > >> >> >> the decision much faster than 400ms > >> >> > > >> >> > Global OOM handling in userspace is really dubious if you ask me. I > >> >> > understand you want something better than SIGKILL and in fact this is > >> >> > already possible with memory cgroup controller (btw. memcg will give > >> >> > you a cheap access to rss, amount of shared, swapped out memory as > >> >> > well). Anyway if you are getting close to the OOM your system will > >> >> > most > >> >> > probably be really busy and chances are that also reading your new > >> >> > file > >> >> > will take much more time. I am also not quite sure how is pss useful > >> >> > for > >> >> > oom decisions. > >> >> > >> >> I mentioned it before, but based on experience RSS just isn't good > >> >> enough -- there's too much sharing going on in our use case to make > >> >> the correct decision based on RSS. If RSS were good enough, simply > >> >> put, this patch wouldn't exist. > >> > > >> > But that doesn't answer my question, I am afraid. So how exactly do you > >> > use pss for oom decisions? > >> > >> We use PSS to calculate the memory used by a process among all the > >> processes in the system, in the case of Chrome this tells us how much > >> each renderer process (which is roughly tied to a particular "tab" in > >> Chrome) is using and how much it has swapped out, so we know what the > >> worst offenders are -- I'm not sure what's unclear about that? > > > > So let me ask more specifically. How can you make any decision based on > > the pss when you do not know _what_ is the shared resource. In other > > words if you select a task to terminate based on the pss then you have to > > kill others who share the same resource otherwise you do not release > > that shared resource. Not to mention that such a shared resource might > > be on tmpfs/shmem and it won't get released even after all processes > > which map it are gone. > > Ok I see why you're confused now, sorry. > > In our case that we do know what is being shared in general because > the sharing is mostly between those processes that we're looking at > and not other random processes or tmpfs, so PSS gives us useful data > in the context of these processes which are sharing the data > especially for monitoring between the set of these renderer processes. OK, I see and agree that pss might be useful when you _know_ what is shared. But this sounds quite specific to a particular workload. How many users are in a similar situation? In other words, if we present a single number without the context, how much useful it will be in general? Is it possible that presenting such a number could be even misleading for somebody who doesn't have an idea which resources are shared? These are all questions which should be answered before we actually add this number (be it a new/existing proc file or a syscall). I still believe that the number without wider context is just not all that useful. > We also use the private clean and private dirty and swap fields to > make a few metrics for the processes and charge each process for it's > private, shared, and swap data. Private clean and dirty are used for > estimating a lower bound on how much memory would be freed. I can imagine that this kind of information might be useful and presented in /proc//statm. The question is whether some of the existing consumers would see the performance impact due to he page table walk. Anyway even these counters might get quite tricky because even shareable resources are considered private if the process is the only one to map them (so again this might be a file on tmpfs...). > Swap and > PSS also give us some indication of additional memory which might get > freed up. -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Mon 22-08-16 09:07:45, Minchan Kim wrote: [...] > #!/bin/sh > ./smap_test & > pid=$! > > for i in $(seq 25) > do > awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {}' \ > /proc/$pid/smaps > done > kill $pid > > root@bbox:/home/barrios/test/smap# time ./s.sh > pid:21973 > > real0m17.812s > user0m12.612s > sys 0m5.187s retested on the bare metal (x86_64 - 2CPUs) Command being timed: "sh s.sh" User time (seconds): 0.00 System time (seconds): 18.08 Percent of CPU this job got: 98% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.29 multiple runs are quite consistent in those numbers. I am running with $ awk --version GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0) > > like a problem we are not able to address. And I would even argue that > > we want to address it in a generic way as much as possible. > > Sure. What solution do you think as generic way? either optimize seq_printf or replace it with something faster. -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Mon 22-08-16 09:07:45, Minchan Kim wrote: [...] > #!/bin/sh > ./smap_test & > pid=$! > > for i in $(seq 25) > do > awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {}' \ > /proc/$pid/smaps > done > kill $pid > > root@bbox:/home/barrios/test/smap# time ./s.sh > pid:21973 > > real0m17.812s > user0m12.612s > sys 0m5.187s retested on the bare metal (x86_64 - 2CPUs) Command being timed: "sh s.sh" User time (seconds): 0.00 System time (seconds): 18.08 Percent of CPU this job got: 98% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.29 multiple runs are quite consistent in those numbers. I am running with $ awk --version GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0) > > like a problem we are not able to address. And I would even argue that > > we want to address it in a generic way as much as possible. > > Sure. What solution do you think as generic way? either optimize seq_printf or replace it with something faster. -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Fri, Aug 19, 2016 at 10:05:32AM +0200, Michal Hocko wrote: > On Fri 19-08-16 11:26:34, Minchan Kim wrote: > > Hi Michal, > > > > On Thu, Aug 18, 2016 at 08:01:04PM +0200, Michal Hocko wrote: > > > On Thu 18-08-16 10:47:57, Sonny Rao wrote: > > > > On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko> > > > wrote: > > > > > On Wed 17-08-16 11:57:56, Sonny Rao wrote: > > > [...] > > > > >> 2) User space OOM handling -- we'd rather do a more graceful shutdown > > > > >> than let the kernel's OOM killer activate and need to gather this > > > > >> information and we'd like to be able to get this information to make > > > > >> the decision much faster than 400ms > > > > > > > > > > Global OOM handling in userspace is really dubious if you ask me. I > > > > > understand you want something better than SIGKILL and in fact this is > > > > > already possible with memory cgroup controller (btw. memcg will give > > > > > you a cheap access to rss, amount of shared, swapped out memory as > > > > > well). Anyway if you are getting close to the OOM your system will > > > > > most > > > > > probably be really busy and chances are that also reading your new > > > > > file > > > > > will take much more time. I am also not quite sure how is pss useful > > > > > for > > > > > oom decisions. > > > > > > > > I mentioned it before, but based on experience RSS just isn't good > > > > enough -- there's too much sharing going on in our use case to make > > > > the correct decision based on RSS. If RSS were good enough, simply > > > > put, this patch wouldn't exist. > > > > > > But that doesn't answer my question, I am afraid. So how exactly do you > > > use pss for oom decisions? > > > > My case is not for OOM decision but I agree it would be great if we can get > > *fast* smap summary information. > > > > PSS is really great tool to figure out how processes consume memory > > more exactly rather than RSS. We have been used it for monitoring > > of memory for per-process. Although it is not used for OOM decision, > > it would be great if it is speed up because we don't want to spend > > many CPU time for just monitoring. > > > > For our usecase, we don't need AnonHugePages, ShmemPmdMapped, > > Shared_Hugetlb, > > Private_Hugetlb, KernelPageSize, MMUPageSize because we never enable THP and > > hugetlb. Additionally, Locked can be known via vma flags so we don't need > > it, > > either. Even, we don't need address range for just monitoring when we don't > > investigate in detail. > > > > Although they are not severe overhead, why does it emit the useless > > information? Even bloat day by day. :( With that, userspace tools should > > spend more time to parse which is pointless. > > So far it doesn't really seem that the parsing is the biggest problem. > The major cycles killer is the output formatting and that doesn't sound I cannot understand how kernel space is more expensive. Hmm. I tested your test program on my machine. #!/bin/sh ./smap_test & pid=$! for i in $(seq 25) do cat /proc/$pid/smaps > /dev/null done kill $pid root@bbox:/home/barrios/test/smap# time ./s_v.sh pid:21925 real0m3.365s user0m0.031s sys 0m3.046s vs. #!/bin/sh ./smap_test & pid=$! for i in $(seq 25) do awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {}' \ /proc/$pid/smaps done kill $pid root@bbox:/home/barrios/test/smap# time ./s.sh pid:21973 real0m17.812s user0m12.612s sys 0m5.187s perf report says 39.56% awkgawk [.] dfaexec 7.61% awk[kernel.kallsyms] [k] format_decode 6.37% awkgawk [.] avoid_dfa 5.85% awkgawk [.] interpret 5.69% awk[kernel.kallsyms] [k] __memcpy 4.37% awk[kernel.kallsyms] [k] vsnprintf 2.69% awk[kernel.kallsyms] [k] number.isra.13 2.10% awkgawk [.] research 1.91% awkgawk [.] 0x000351d0 1.49% awkgawk [.] free_wstr 1.27% awkgawk [.] unref 1.19% awkgawk [.] reset_record 0.95% awkgawk [.] set_record 0.95% awkgawk [.] get_field 0.94% awk[kernel.kallsyms] [k] show_smap Parsing is much expensive than kernel. Could you retest your test program? > like a problem we are not able to address. And I would even argue that > we want to address it in a generic way as much as possible. Sure. What solution do you
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Fri, Aug 19, 2016 at 10:05:32AM +0200, Michal Hocko wrote: > On Fri 19-08-16 11:26:34, Minchan Kim wrote: > > Hi Michal, > > > > On Thu, Aug 18, 2016 at 08:01:04PM +0200, Michal Hocko wrote: > > > On Thu 18-08-16 10:47:57, Sonny Rao wrote: > > > > On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko > > > > wrote: > > > > > On Wed 17-08-16 11:57:56, Sonny Rao wrote: > > > [...] > > > > >> 2) User space OOM handling -- we'd rather do a more graceful shutdown > > > > >> than let the kernel's OOM killer activate and need to gather this > > > > >> information and we'd like to be able to get this information to make > > > > >> the decision much faster than 400ms > > > > > > > > > > Global OOM handling in userspace is really dubious if you ask me. I > > > > > understand you want something better than SIGKILL and in fact this is > > > > > already possible with memory cgroup controller (btw. memcg will give > > > > > you a cheap access to rss, amount of shared, swapped out memory as > > > > > well). Anyway if you are getting close to the OOM your system will > > > > > most > > > > > probably be really busy and chances are that also reading your new > > > > > file > > > > > will take much more time. I am also not quite sure how is pss useful > > > > > for > > > > > oom decisions. > > > > > > > > I mentioned it before, but based on experience RSS just isn't good > > > > enough -- there's too much sharing going on in our use case to make > > > > the correct decision based on RSS. If RSS were good enough, simply > > > > put, this patch wouldn't exist. > > > > > > But that doesn't answer my question, I am afraid. So how exactly do you > > > use pss for oom decisions? > > > > My case is not for OOM decision but I agree it would be great if we can get > > *fast* smap summary information. > > > > PSS is really great tool to figure out how processes consume memory > > more exactly rather than RSS. We have been used it for monitoring > > of memory for per-process. Although it is not used for OOM decision, > > it would be great if it is speed up because we don't want to spend > > many CPU time for just monitoring. > > > > For our usecase, we don't need AnonHugePages, ShmemPmdMapped, > > Shared_Hugetlb, > > Private_Hugetlb, KernelPageSize, MMUPageSize because we never enable THP and > > hugetlb. Additionally, Locked can be known via vma flags so we don't need > > it, > > either. Even, we don't need address range for just monitoring when we don't > > investigate in detail. > > > > Although they are not severe overhead, why does it emit the useless > > information? Even bloat day by day. :( With that, userspace tools should > > spend more time to parse which is pointless. > > So far it doesn't really seem that the parsing is the biggest problem. > The major cycles killer is the output formatting and that doesn't sound I cannot understand how kernel space is more expensive. Hmm. I tested your test program on my machine. #!/bin/sh ./smap_test & pid=$! for i in $(seq 25) do cat /proc/$pid/smaps > /dev/null done kill $pid root@bbox:/home/barrios/test/smap# time ./s_v.sh pid:21925 real0m3.365s user0m0.031s sys 0m3.046s vs. #!/bin/sh ./smap_test & pid=$! for i in $(seq 25) do awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {}' \ /proc/$pid/smaps done kill $pid root@bbox:/home/barrios/test/smap# time ./s.sh pid:21973 real0m17.812s user0m12.612s sys 0m5.187s perf report says 39.56% awkgawk [.] dfaexec 7.61% awk[kernel.kallsyms] [k] format_decode 6.37% awkgawk [.] avoid_dfa 5.85% awkgawk [.] interpret 5.69% awk[kernel.kallsyms] [k] __memcpy 4.37% awk[kernel.kallsyms] [k] vsnprintf 2.69% awk[kernel.kallsyms] [k] number.isra.13 2.10% awkgawk [.] research 1.91% awkgawk [.] 0x000351d0 1.49% awkgawk [.] free_wstr 1.27% awkgawk [.] unref 1.19% awkgawk [.] reset_record 0.95% awkgawk [.] set_record 0.95% awkgawk [.] get_field 0.94% awk[kernel.kallsyms] [k] show_smap Parsing is much expensive than kernel. Could you retest your test program? > like a problem we are not able to address. And I would even argue that > we want to address it in a generic way as much as possible. Sure. What solution do you think as generic way?
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Fri, Aug 19, 2016 at 1:05 AM, Michal Hockowrote: > On Fri 19-08-16 11:26:34, Minchan Kim wrote: >> Hi Michal, >> >> On Thu, Aug 18, 2016 at 08:01:04PM +0200, Michal Hocko wrote: >> > On Thu 18-08-16 10:47:57, Sonny Rao wrote: >> > > On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko wrote: >> > > > On Wed 17-08-16 11:57:56, Sonny Rao wrote: >> > [...] >> > > >> 2) User space OOM handling -- we'd rather do a more graceful shutdown >> > > >> than let the kernel's OOM killer activate and need to gather this >> > > >> information and we'd like to be able to get this information to make >> > > >> the decision much faster than 400ms >> > > > >> > > > Global OOM handling in userspace is really dubious if you ask me. I >> > > > understand you want something better than SIGKILL and in fact this is >> > > > already possible with memory cgroup controller (btw. memcg will give >> > > > you a cheap access to rss, amount of shared, swapped out memory as >> > > > well). Anyway if you are getting close to the OOM your system will most >> > > > probably be really busy and chances are that also reading your new file >> > > > will take much more time. I am also not quite sure how is pss useful >> > > > for >> > > > oom decisions. >> > > >> > > I mentioned it before, but based on experience RSS just isn't good >> > > enough -- there's too much sharing going on in our use case to make >> > > the correct decision based on RSS. If RSS were good enough, simply >> > > put, this patch wouldn't exist. >> > >> > But that doesn't answer my question, I am afraid. So how exactly do you >> > use pss for oom decisions? >> >> My case is not for OOM decision but I agree it would be great if we can get >> *fast* smap summary information. >> >> PSS is really great tool to figure out how processes consume memory >> more exactly rather than RSS. We have been used it for monitoring >> of memory for per-process. Although it is not used for OOM decision, >> it would be great if it is speed up because we don't want to spend >> many CPU time for just monitoring. >> >> For our usecase, we don't need AnonHugePages, ShmemPmdMapped, Shared_Hugetlb, >> Private_Hugetlb, KernelPageSize, MMUPageSize because we never enable THP and >> hugetlb. Additionally, Locked can be known via vma flags so we don't need it, >> either. Even, we don't need address range for just monitoring when we don't >> investigate in detail. >> >> Although they are not severe overhead, why does it emit the useless >> information? Even bloat day by day. :( With that, userspace tools should >> spend more time to parse which is pointless. > > So far it doesn't really seem that the parsing is the biggest problem. > The major cycles killer is the output formatting and that doesn't sound > like a problem we are not able to address. And I would even argue that > we want to address it in a generic way as much as possible. > >> Having said that, I'm not fan of creating new stat knob for that, either. >> How about appending summary information in the end of smap? >> So, monitoring users can just open the file and lseek to the (end - 1) and >> read the summary only. > > That might confuse existing parsers. Besides that we already have > /proc//statm which gives cumulative numbers already. I am not sure > how often it is used and whether the pte walk is too expensive for > existing users but that should be explored and evaluated before a new > file is created. > > The /proc became a dump of everything people found interesting just > because we were to easy to allow those additions. Do not repeat those > mistakes, please! Another thing I noticed was that we lock down smaps on Chromium OS. I think this is to avoid exposing more information than necessary via proc. The totmaps file gives us just the information we need and nothing else. I certainly don't think we need a proc file for this use case -- do you think a new system call is better or something else? > -- > Michal Hocko > SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Fri, Aug 19, 2016 at 1:05 AM, Michal Hocko wrote: > On Fri 19-08-16 11:26:34, Minchan Kim wrote: >> Hi Michal, >> >> On Thu, Aug 18, 2016 at 08:01:04PM +0200, Michal Hocko wrote: >> > On Thu 18-08-16 10:47:57, Sonny Rao wrote: >> > > On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko wrote: >> > > > On Wed 17-08-16 11:57:56, Sonny Rao wrote: >> > [...] >> > > >> 2) User space OOM handling -- we'd rather do a more graceful shutdown >> > > >> than let the kernel's OOM killer activate and need to gather this >> > > >> information and we'd like to be able to get this information to make >> > > >> the decision much faster than 400ms >> > > > >> > > > Global OOM handling in userspace is really dubious if you ask me. I >> > > > understand you want something better than SIGKILL and in fact this is >> > > > already possible with memory cgroup controller (btw. memcg will give >> > > > you a cheap access to rss, amount of shared, swapped out memory as >> > > > well). Anyway if you are getting close to the OOM your system will most >> > > > probably be really busy and chances are that also reading your new file >> > > > will take much more time. I am also not quite sure how is pss useful >> > > > for >> > > > oom decisions. >> > > >> > > I mentioned it before, but based on experience RSS just isn't good >> > > enough -- there's too much sharing going on in our use case to make >> > > the correct decision based on RSS. If RSS were good enough, simply >> > > put, this patch wouldn't exist. >> > >> > But that doesn't answer my question, I am afraid. So how exactly do you >> > use pss for oom decisions? >> >> My case is not for OOM decision but I agree it would be great if we can get >> *fast* smap summary information. >> >> PSS is really great tool to figure out how processes consume memory >> more exactly rather than RSS. We have been used it for monitoring >> of memory for per-process. Although it is not used for OOM decision, >> it would be great if it is speed up because we don't want to spend >> many CPU time for just monitoring. >> >> For our usecase, we don't need AnonHugePages, ShmemPmdMapped, Shared_Hugetlb, >> Private_Hugetlb, KernelPageSize, MMUPageSize because we never enable THP and >> hugetlb. Additionally, Locked can be known via vma flags so we don't need it, >> either. Even, we don't need address range for just monitoring when we don't >> investigate in detail. >> >> Although they are not severe overhead, why does it emit the useless >> information? Even bloat day by day. :( With that, userspace tools should >> spend more time to parse which is pointless. > > So far it doesn't really seem that the parsing is the biggest problem. > The major cycles killer is the output formatting and that doesn't sound > like a problem we are not able to address. And I would even argue that > we want to address it in a generic way as much as possible. > >> Having said that, I'm not fan of creating new stat knob for that, either. >> How about appending summary information in the end of smap? >> So, monitoring users can just open the file and lseek to the (end - 1) and >> read the summary only. > > That might confuse existing parsers. Besides that we already have > /proc//statm which gives cumulative numbers already. I am not sure > how often it is used and whether the pte walk is too expensive for > existing users but that should be explored and evaluated before a new > file is created. > > The /proc became a dump of everything people found interesting just > because we were to easy to allow those additions. Do not repeat those > mistakes, please! Another thing I noticed was that we lock down smaps on Chromium OS. I think this is to avoid exposing more information than necessary via proc. The totmaps file gives us just the information we need and nothing else. I certainly don't think we need a proc file for this use case -- do you think a new system call is better or something else? > -- > Michal Hocko > SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Fri, Aug 19, 2016 at 12:59 AM, Michal Hockowrote: > On Thu 18-08-16 23:43:39, Sonny Rao wrote: >> On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko wrote: >> > On Thu 18-08-16 10:47:57, Sonny Rao wrote: >> >> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko wrote: >> >> > On Wed 17-08-16 11:57:56, Sonny Rao wrote: >> > [...] >> >> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown >> >> >> than let the kernel's OOM killer activate and need to gather this >> >> >> information and we'd like to be able to get this information to make >> >> >> the decision much faster than 400ms >> >> > >> >> > Global OOM handling in userspace is really dubious if you ask me. I >> >> > understand you want something better than SIGKILL and in fact this is >> >> > already possible with memory cgroup controller (btw. memcg will give >> >> > you a cheap access to rss, amount of shared, swapped out memory as >> >> > well). Anyway if you are getting close to the OOM your system will most >> >> > probably be really busy and chances are that also reading your new file >> >> > will take much more time. I am also not quite sure how is pss useful for >> >> > oom decisions. >> >> >> >> I mentioned it before, but based on experience RSS just isn't good >> >> enough -- there's too much sharing going on in our use case to make >> >> the correct decision based on RSS. If RSS were good enough, simply >> >> put, this patch wouldn't exist. >> > >> > But that doesn't answer my question, I am afraid. So how exactly do you >> > use pss for oom decisions? >> >> We use PSS to calculate the memory used by a process among all the >> processes in the system, in the case of Chrome this tells us how much >> each renderer process (which is roughly tied to a particular "tab" in >> Chrome) is using and how much it has swapped out, so we know what the >> worst offenders are -- I'm not sure what's unclear about that? > > So let me ask more specifically. How can you make any decision based on > the pss when you do not know _what_ is the shared resource. In other > words if you select a task to terminate based on the pss then you have to > kill others who share the same resource otherwise you do not release > that shared resource. Not to mention that such a shared resource might > be on tmpfs/shmem and it won't get released even after all processes > which map it are gone. Ok I see why you're confused now, sorry. In our case that we do know what is being shared in general because the sharing is mostly between those processes that we're looking at and not other random processes or tmpfs, so PSS gives us useful data in the context of these processes which are sharing the data especially for monitoring between the set of these renderer processes. We also use the private clean and private dirty and swap fields to make a few metrics for the processes and charge each process for it's private, shared, and swap data. Private clean and dirty are used for estimating a lower bound on how much memory would be freed. Swap and PSS also give us some indication of additional memory which might get freed up. > > I am sorry for being dense but it is still not clear to me how the > single pss number can be used for oom or, in general, any serious > decisions. The counter might be useful of course for debugging purposes > or to have a general overview but then arguing about 40 vs 20ms sounds a > bit strange to me. Yeah so it's more than just the single PSS number, it's PSS, Private_Clean, Private_dirty, Swap are all interesting numbers to make these decisions. > >> Chrome tends to use a lot of shared memory so we found PSS to be >> better than RSS, and I can give you examples of the RSS and PSS on >> real systems to illustrate the magnitude of the difference between >> those two numbers if that would be useful. >> >> > >> >> So even with memcg I think we'd have the same problem? >> > >> > memcg will give you instant anon, shared counters for all processes in >> > the memcg. >> > >> >> We want to be able to get per-process granularity quickly. I'm not >> sure if memcg provides that exactly? > > I will give you that information if you do process-per-memcg but that > doesn't sound ideal. I thought those 20-something processes you were > talking about are treated together but it seems I misunderstood. > -- > Michal Hocko > SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Fri, Aug 19, 2016 at 12:59 AM, Michal Hocko wrote: > On Thu 18-08-16 23:43:39, Sonny Rao wrote: >> On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko wrote: >> > On Thu 18-08-16 10:47:57, Sonny Rao wrote: >> >> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko wrote: >> >> > On Wed 17-08-16 11:57:56, Sonny Rao wrote: >> > [...] >> >> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown >> >> >> than let the kernel's OOM killer activate and need to gather this >> >> >> information and we'd like to be able to get this information to make >> >> >> the decision much faster than 400ms >> >> > >> >> > Global OOM handling in userspace is really dubious if you ask me. I >> >> > understand you want something better than SIGKILL and in fact this is >> >> > already possible with memory cgroup controller (btw. memcg will give >> >> > you a cheap access to rss, amount of shared, swapped out memory as >> >> > well). Anyway if you are getting close to the OOM your system will most >> >> > probably be really busy and chances are that also reading your new file >> >> > will take much more time. I am also not quite sure how is pss useful for >> >> > oom decisions. >> >> >> >> I mentioned it before, but based on experience RSS just isn't good >> >> enough -- there's too much sharing going on in our use case to make >> >> the correct decision based on RSS. If RSS were good enough, simply >> >> put, this patch wouldn't exist. >> > >> > But that doesn't answer my question, I am afraid. So how exactly do you >> > use pss for oom decisions? >> >> We use PSS to calculate the memory used by a process among all the >> processes in the system, in the case of Chrome this tells us how much >> each renderer process (which is roughly tied to a particular "tab" in >> Chrome) is using and how much it has swapped out, so we know what the >> worst offenders are -- I'm not sure what's unclear about that? > > So let me ask more specifically. How can you make any decision based on > the pss when you do not know _what_ is the shared resource. In other > words if you select a task to terminate based on the pss then you have to > kill others who share the same resource otherwise you do not release > that shared resource. Not to mention that such a shared resource might > be on tmpfs/shmem and it won't get released even after all processes > which map it are gone. Ok I see why you're confused now, sorry. In our case that we do know what is being shared in general because the sharing is mostly between those processes that we're looking at and not other random processes or tmpfs, so PSS gives us useful data in the context of these processes which are sharing the data especially for monitoring between the set of these renderer processes. We also use the private clean and private dirty and swap fields to make a few metrics for the processes and charge each process for it's private, shared, and swap data. Private clean and dirty are used for estimating a lower bound on how much memory would be freed. Swap and PSS also give us some indication of additional memory which might get freed up. > > I am sorry for being dense but it is still not clear to me how the > single pss number can be used for oom or, in general, any serious > decisions. The counter might be useful of course for debugging purposes > or to have a general overview but then arguing about 40 vs 20ms sounds a > bit strange to me. Yeah so it's more than just the single PSS number, it's PSS, Private_Clean, Private_dirty, Swap are all interesting numbers to make these decisions. > >> Chrome tends to use a lot of shared memory so we found PSS to be >> better than RSS, and I can give you examples of the RSS and PSS on >> real systems to illustrate the magnitude of the difference between >> those two numbers if that would be useful. >> >> > >> >> So even with memcg I think we'd have the same problem? >> > >> > memcg will give you instant anon, shared counters for all processes in >> > the memcg. >> > >> >> We want to be able to get per-process granularity quickly. I'm not >> sure if memcg provides that exactly? > > I will give you that information if you do process-per-memcg but that > doesn't sound ideal. I thought those 20-something processes you were > talking about are treated together but it seems I misunderstood. > -- > Michal Hocko > SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Fri 19-08-16 11:26:34, Minchan Kim wrote: > Hi Michal, > > On Thu, Aug 18, 2016 at 08:01:04PM +0200, Michal Hocko wrote: > > On Thu 18-08-16 10:47:57, Sonny Rao wrote: > > > On Thu, Aug 18, 2016 at 12:44 AM, Michal Hockowrote: > > > > On Wed 17-08-16 11:57:56, Sonny Rao wrote: > > [...] > > > >> 2) User space OOM handling -- we'd rather do a more graceful shutdown > > > >> than let the kernel's OOM killer activate and need to gather this > > > >> information and we'd like to be able to get this information to make > > > >> the decision much faster than 400ms > > > > > > > > Global OOM handling in userspace is really dubious if you ask me. I > > > > understand you want something better than SIGKILL and in fact this is > > > > already possible with memory cgroup controller (btw. memcg will give > > > > you a cheap access to rss, amount of shared, swapped out memory as > > > > well). Anyway if you are getting close to the OOM your system will most > > > > probably be really busy and chances are that also reading your new file > > > > will take much more time. I am also not quite sure how is pss useful for > > > > oom decisions. > > > > > > I mentioned it before, but based on experience RSS just isn't good > > > enough -- there's too much sharing going on in our use case to make > > > the correct decision based on RSS. If RSS were good enough, simply > > > put, this patch wouldn't exist. > > > > But that doesn't answer my question, I am afraid. So how exactly do you > > use pss for oom decisions? > > My case is not for OOM decision but I agree it would be great if we can get > *fast* smap summary information. > > PSS is really great tool to figure out how processes consume memory > more exactly rather than RSS. We have been used it for monitoring > of memory for per-process. Although it is not used for OOM decision, > it would be great if it is speed up because we don't want to spend > many CPU time for just monitoring. > > For our usecase, we don't need AnonHugePages, ShmemPmdMapped, Shared_Hugetlb, > Private_Hugetlb, KernelPageSize, MMUPageSize because we never enable THP and > hugetlb. Additionally, Locked can be known via vma flags so we don't need it, > either. Even, we don't need address range for just monitoring when we don't > investigate in detail. > > Although they are not severe overhead, why does it emit the useless > information? Even bloat day by day. :( With that, userspace tools should > spend more time to parse which is pointless. So far it doesn't really seem that the parsing is the biggest problem. The major cycles killer is the output formatting and that doesn't sound like a problem we are not able to address. And I would even argue that we want to address it in a generic way as much as possible. > Having said that, I'm not fan of creating new stat knob for that, either. > How about appending summary information in the end of smap? > So, monitoring users can just open the file and lseek to the (end - 1) and > read the summary only. That might confuse existing parsers. Besides that we already have /proc//statm which gives cumulative numbers already. I am not sure how often it is used and whether the pte walk is too expensive for existing users but that should be explored and evaluated before a new file is created. The /proc became a dump of everything people found interesting just because we were to easy to allow those additions. Do not repeat those mistakes, please! -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Fri 19-08-16 11:26:34, Minchan Kim wrote: > Hi Michal, > > On Thu, Aug 18, 2016 at 08:01:04PM +0200, Michal Hocko wrote: > > On Thu 18-08-16 10:47:57, Sonny Rao wrote: > > > On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko wrote: > > > > On Wed 17-08-16 11:57:56, Sonny Rao wrote: > > [...] > > > >> 2) User space OOM handling -- we'd rather do a more graceful shutdown > > > >> than let the kernel's OOM killer activate and need to gather this > > > >> information and we'd like to be able to get this information to make > > > >> the decision much faster than 400ms > > > > > > > > Global OOM handling in userspace is really dubious if you ask me. I > > > > understand you want something better than SIGKILL and in fact this is > > > > already possible with memory cgroup controller (btw. memcg will give > > > > you a cheap access to rss, amount of shared, swapped out memory as > > > > well). Anyway if you are getting close to the OOM your system will most > > > > probably be really busy and chances are that also reading your new file > > > > will take much more time. I am also not quite sure how is pss useful for > > > > oom decisions. > > > > > > I mentioned it before, but based on experience RSS just isn't good > > > enough -- there's too much sharing going on in our use case to make > > > the correct decision based on RSS. If RSS were good enough, simply > > > put, this patch wouldn't exist. > > > > But that doesn't answer my question, I am afraid. So how exactly do you > > use pss for oom decisions? > > My case is not for OOM decision but I agree it would be great if we can get > *fast* smap summary information. > > PSS is really great tool to figure out how processes consume memory > more exactly rather than RSS. We have been used it for monitoring > of memory for per-process. Although it is not used for OOM decision, > it would be great if it is speed up because we don't want to spend > many CPU time for just monitoring. > > For our usecase, we don't need AnonHugePages, ShmemPmdMapped, Shared_Hugetlb, > Private_Hugetlb, KernelPageSize, MMUPageSize because we never enable THP and > hugetlb. Additionally, Locked can be known via vma flags so we don't need it, > either. Even, we don't need address range for just monitoring when we don't > investigate in detail. > > Although they are not severe overhead, why does it emit the useless > information? Even bloat day by day. :( With that, userspace tools should > spend more time to parse which is pointless. So far it doesn't really seem that the parsing is the biggest problem. The major cycles killer is the output formatting and that doesn't sound like a problem we are not able to address. And I would even argue that we want to address it in a generic way as much as possible. > Having said that, I'm not fan of creating new stat knob for that, either. > How about appending summary information in the end of smap? > So, monitoring users can just open the file and lseek to the (end - 1) and > read the summary only. That might confuse existing parsers. Besides that we already have /proc//statm which gives cumulative numbers already. I am not sure how often it is used and whether the pte walk is too expensive for existing users but that should be explored and evaluated before a new file is created. The /proc became a dump of everything people found interesting just because we were to easy to allow those additions. Do not repeat those mistakes, please! -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Thu 18-08-16 23:43:39, Sonny Rao wrote: > On Thu, Aug 18, 2016 at 11:01 AM, Michal Hockowrote: > > On Thu 18-08-16 10:47:57, Sonny Rao wrote: > >> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko wrote: > >> > On Wed 17-08-16 11:57:56, Sonny Rao wrote: > > [...] > >> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown > >> >> than let the kernel's OOM killer activate and need to gather this > >> >> information and we'd like to be able to get this information to make > >> >> the decision much faster than 400ms > >> > > >> > Global OOM handling in userspace is really dubious if you ask me. I > >> > understand you want something better than SIGKILL and in fact this is > >> > already possible with memory cgroup controller (btw. memcg will give > >> > you a cheap access to rss, amount of shared, swapped out memory as > >> > well). Anyway if you are getting close to the OOM your system will most > >> > probably be really busy and chances are that also reading your new file > >> > will take much more time. I am also not quite sure how is pss useful for > >> > oom decisions. > >> > >> I mentioned it before, but based on experience RSS just isn't good > >> enough -- there's too much sharing going on in our use case to make > >> the correct decision based on RSS. If RSS were good enough, simply > >> put, this patch wouldn't exist. > > > > But that doesn't answer my question, I am afraid. So how exactly do you > > use pss for oom decisions? > > We use PSS to calculate the memory used by a process among all the > processes in the system, in the case of Chrome this tells us how much > each renderer process (which is roughly tied to a particular "tab" in > Chrome) is using and how much it has swapped out, so we know what the > worst offenders are -- I'm not sure what's unclear about that? So let me ask more specifically. How can you make any decision based on the pss when you do not know _what_ is the shared resource. In other words if you select a task to terminate based on the pss then you have to kill others who share the same resource otherwise you do not release that shared resource. Not to mention that such a shared resource might be on tmpfs/shmem and it won't get released even after all processes which map it are gone. I am sorry for being dense but it is still not clear to me how the single pss number can be used for oom or, in general, any serious decisions. The counter might be useful of course for debugging purposes or to have a general overview but then arguing about 40 vs 20ms sounds a bit strange to me. > Chrome tends to use a lot of shared memory so we found PSS to be > better than RSS, and I can give you examples of the RSS and PSS on > real systems to illustrate the magnitude of the difference between > those two numbers if that would be useful. > > > > >> So even with memcg I think we'd have the same problem? > > > > memcg will give you instant anon, shared counters for all processes in > > the memcg. > > > > We want to be able to get per-process granularity quickly. I'm not > sure if memcg provides that exactly? I will give you that information if you do process-per-memcg but that doesn't sound ideal. I thought those 20-something processes you were talking about are treated together but it seems I misunderstood. -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Thu 18-08-16 23:43:39, Sonny Rao wrote: > On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko wrote: > > On Thu 18-08-16 10:47:57, Sonny Rao wrote: > >> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko wrote: > >> > On Wed 17-08-16 11:57:56, Sonny Rao wrote: > > [...] > >> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown > >> >> than let the kernel's OOM killer activate and need to gather this > >> >> information and we'd like to be able to get this information to make > >> >> the decision much faster than 400ms > >> > > >> > Global OOM handling in userspace is really dubious if you ask me. I > >> > understand you want something better than SIGKILL and in fact this is > >> > already possible with memory cgroup controller (btw. memcg will give > >> > you a cheap access to rss, amount of shared, swapped out memory as > >> > well). Anyway if you are getting close to the OOM your system will most > >> > probably be really busy and chances are that also reading your new file > >> > will take much more time. I am also not quite sure how is pss useful for > >> > oom decisions. > >> > >> I mentioned it before, but based on experience RSS just isn't good > >> enough -- there's too much sharing going on in our use case to make > >> the correct decision based on RSS. If RSS were good enough, simply > >> put, this patch wouldn't exist. > > > > But that doesn't answer my question, I am afraid. So how exactly do you > > use pss for oom decisions? > > We use PSS to calculate the memory used by a process among all the > processes in the system, in the case of Chrome this tells us how much > each renderer process (which is roughly tied to a particular "tab" in > Chrome) is using and how much it has swapped out, so we know what the > worst offenders are -- I'm not sure what's unclear about that? So let me ask more specifically. How can you make any decision based on the pss when you do not know _what_ is the shared resource. In other words if you select a task to terminate based on the pss then you have to kill others who share the same resource otherwise you do not release that shared resource. Not to mention that such a shared resource might be on tmpfs/shmem and it won't get released even after all processes which map it are gone. I am sorry for being dense but it is still not clear to me how the single pss number can be used for oom or, in general, any serious decisions. The counter might be useful of course for debugging purposes or to have a general overview but then arguing about 40 vs 20ms sounds a bit strange to me. > Chrome tends to use a lot of shared memory so we found PSS to be > better than RSS, and I can give you examples of the RSS and PSS on > real systems to illustrate the magnitude of the difference between > those two numbers if that would be useful. > > > > >> So even with memcg I think we'd have the same problem? > > > > memcg will give you instant anon, shared counters for all processes in > > the memcg. > > > > We want to be able to get per-process granularity quickly. I'm not > sure if memcg provides that exactly? I will give you that information if you do process-per-memcg but that doesn't sound ideal. I thought those 20-something processes you were talking about are treated together but it seems I misunderstood. -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Thu, Aug 18, 2016 at 7:26 PM, Minchan Kimwrote: > Hi Michal, > > On Thu, Aug 18, 2016 at 08:01:04PM +0200, Michal Hocko wrote: >> On Thu 18-08-16 10:47:57, Sonny Rao wrote: >> > On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko wrote: >> > > On Wed 17-08-16 11:57:56, Sonny Rao wrote: >> [...] >> > >> 2) User space OOM handling -- we'd rather do a more graceful shutdown >> > >> than let the kernel's OOM killer activate and need to gather this >> > >> information and we'd like to be able to get this information to make >> > >> the decision much faster than 400ms >> > > >> > > Global OOM handling in userspace is really dubious if you ask me. I >> > > understand you want something better than SIGKILL and in fact this is >> > > already possible with memory cgroup controller (btw. memcg will give >> > > you a cheap access to rss, amount of shared, swapped out memory as >> > > well). Anyway if you are getting close to the OOM your system will most >> > > probably be really busy and chances are that also reading your new file >> > > will take much more time. I am also not quite sure how is pss useful for >> > > oom decisions. >> > >> > I mentioned it before, but based on experience RSS just isn't good >> > enough -- there's too much sharing going on in our use case to make >> > the correct decision based on RSS. If RSS were good enough, simply >> > put, this patch wouldn't exist. >> >> But that doesn't answer my question, I am afraid. So how exactly do you >> use pss for oom decisions? > > My case is not for OOM decision but I agree it would be great if we can get > *fast* smap summary information. > > PSS is really great tool to figure out how processes consume memory > more exactly rather than RSS. We have been used it for monitoring > of memory for per-process. Although it is not used for OOM decision, > it would be great if it is speed up because we don't want to spend > many CPU time for just monitoring. > > For our usecase, we don't need AnonHugePages, ShmemPmdMapped, Shared_Hugetlb, > Private_Hugetlb, KernelPageSize, MMUPageSize because we never enable THP and > hugetlb. Additionally, Locked can be known via vma flags so we don't need it, > either. Even, we don't need address range for just monitoring when we don't > investigate in detail. > > Although they are not severe overhead, why does it emit the useless > information? Even bloat day by day. :( With that, userspace tools should > spend more time to parse which is pointless. > > Having said that, I'm not fan of creating new stat knob for that, either. > How about appending summary information in the end of smap? > So, monitoring users can just open the file and lseek to the (end - 1) and > read the summary only. > That would work fine for us as long as it's fast -- i.e. we don't still have to do all the expensive per-VMA format conversion in the kernel. > Thanks.
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Thu, Aug 18, 2016 at 7:26 PM, Minchan Kim wrote: > Hi Michal, > > On Thu, Aug 18, 2016 at 08:01:04PM +0200, Michal Hocko wrote: >> On Thu 18-08-16 10:47:57, Sonny Rao wrote: >> > On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko wrote: >> > > On Wed 17-08-16 11:57:56, Sonny Rao wrote: >> [...] >> > >> 2) User space OOM handling -- we'd rather do a more graceful shutdown >> > >> than let the kernel's OOM killer activate and need to gather this >> > >> information and we'd like to be able to get this information to make >> > >> the decision much faster than 400ms >> > > >> > > Global OOM handling in userspace is really dubious if you ask me. I >> > > understand you want something better than SIGKILL and in fact this is >> > > already possible with memory cgroup controller (btw. memcg will give >> > > you a cheap access to rss, amount of shared, swapped out memory as >> > > well). Anyway if you are getting close to the OOM your system will most >> > > probably be really busy and chances are that also reading your new file >> > > will take much more time. I am also not quite sure how is pss useful for >> > > oom decisions. >> > >> > I mentioned it before, but based on experience RSS just isn't good >> > enough -- there's too much sharing going on in our use case to make >> > the correct decision based on RSS. If RSS were good enough, simply >> > put, this patch wouldn't exist. >> >> But that doesn't answer my question, I am afraid. So how exactly do you >> use pss for oom decisions? > > My case is not for OOM decision but I agree it would be great if we can get > *fast* smap summary information. > > PSS is really great tool to figure out how processes consume memory > more exactly rather than RSS. We have been used it for monitoring > of memory for per-process. Although it is not used for OOM decision, > it would be great if it is speed up because we don't want to spend > many CPU time for just monitoring. > > For our usecase, we don't need AnonHugePages, ShmemPmdMapped, Shared_Hugetlb, > Private_Hugetlb, KernelPageSize, MMUPageSize because we never enable THP and > hugetlb. Additionally, Locked can be known via vma flags so we don't need it, > either. Even, we don't need address range for just monitoring when we don't > investigate in detail. > > Although they are not severe overhead, why does it emit the useless > information? Even bloat day by day. :( With that, userspace tools should > spend more time to parse which is pointless. > > Having said that, I'm not fan of creating new stat knob for that, either. > How about appending summary information in the end of smap? > So, monitoring users can just open the file and lseek to the (end - 1) and > read the summary only. > That would work fine for us as long as it's fast -- i.e. we don't still have to do all the expensive per-VMA format conversion in the kernel. > Thanks.
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Thu, Aug 18, 2016 at 11:01 AM, Michal Hockowrote: > On Thu 18-08-16 10:47:57, Sonny Rao wrote: >> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko wrote: >> > On Wed 17-08-16 11:57:56, Sonny Rao wrote: > [...] >> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown >> >> than let the kernel's OOM killer activate and need to gather this >> >> information and we'd like to be able to get this information to make >> >> the decision much faster than 400ms >> > >> > Global OOM handling in userspace is really dubious if you ask me. I >> > understand you want something better than SIGKILL and in fact this is >> > already possible with memory cgroup controller (btw. memcg will give >> > you a cheap access to rss, amount of shared, swapped out memory as >> > well). Anyway if you are getting close to the OOM your system will most >> > probably be really busy and chances are that also reading your new file >> > will take much more time. I am also not quite sure how is pss useful for >> > oom decisions. >> >> I mentioned it before, but based on experience RSS just isn't good >> enough -- there's too much sharing going on in our use case to make >> the correct decision based on RSS. If RSS were good enough, simply >> put, this patch wouldn't exist. > > But that doesn't answer my question, I am afraid. So how exactly do you > use pss for oom decisions? We use PSS to calculate the memory used by a process among all the processes in the system, in the case of Chrome this tells us how much each renderer process (which is roughly tied to a particular "tab" in Chrome) is using and how much it has swapped out, so we know what the worst offenders are -- I'm not sure what's unclear about that? Chrome tends to use a lot of shared memory so we found PSS to be better than RSS, and I can give you examples of the RSS and PSS on real systems to illustrate the magnitude of the difference between those two numbers if that would be useful. > >> So even with memcg I think we'd have the same problem? > > memcg will give you instant anon, shared counters for all processes in > the memcg. > We want to be able to get per-process granularity quickly. I'm not sure if memcg provides that exactly? >> > Don't take me wrong, /proc//totmaps might be suitable for your >> > specific usecase but so far I haven't heard any sound argument for it to >> > be generally usable. It is true that smaps is unnecessarily costly but >> > at least I can see some room for improvements. A simple patch I've >> > posted cut the formatting overhead by 7%. Maybe we can do more. >> >> It seems like a general problem that if you want these values the >> existing kernel interface can be very expensive, so it would be >> generally usable by any application which wants a per process PSS, >> private data, dirty data or swap value. > > yes this is really unfortunate. And if at all possible we should address > that. Precise values require the expensive rmap walk. We can introduce > some caching to help that. But so far it seems the biggest overhead is > to simply format the output and that should be addressed before any new > proc file is added. > >> I mentioned two use cases, but I guess I don't understand the comment >> about why it's not usable by other use cases. > > I might be wrong here but a use of pss is quite limited and I do not > remember anybody asking for large optimizations in that area. I still do > not understand your use cases properly so I am quite skeptical about a > general usefulness of a new file. How do you know that usage of PSS is quite limited? I can only say that we've been using it on Chromium OS for at least four years and have found it very valuable, and I think I've explained the use cases in this thread. If you have more specific questions then I can try to clarify. > > -- > Michal Hocko > SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko wrote: > On Thu 18-08-16 10:47:57, Sonny Rao wrote: >> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko wrote: >> > On Wed 17-08-16 11:57:56, Sonny Rao wrote: > [...] >> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown >> >> than let the kernel's OOM killer activate and need to gather this >> >> information and we'd like to be able to get this information to make >> >> the decision much faster than 400ms >> > >> > Global OOM handling in userspace is really dubious if you ask me. I >> > understand you want something better than SIGKILL and in fact this is >> > already possible with memory cgroup controller (btw. memcg will give >> > you a cheap access to rss, amount of shared, swapped out memory as >> > well). Anyway if you are getting close to the OOM your system will most >> > probably be really busy and chances are that also reading your new file >> > will take much more time. I am also not quite sure how is pss useful for >> > oom decisions. >> >> I mentioned it before, but based on experience RSS just isn't good >> enough -- there's too much sharing going on in our use case to make >> the correct decision based on RSS. If RSS were good enough, simply >> put, this patch wouldn't exist. > > But that doesn't answer my question, I am afraid. So how exactly do you > use pss for oom decisions? We use PSS to calculate the memory used by a process among all the processes in the system, in the case of Chrome this tells us how much each renderer process (which is roughly tied to a particular "tab" in Chrome) is using and how much it has swapped out, so we know what the worst offenders are -- I'm not sure what's unclear about that? Chrome tends to use a lot of shared memory so we found PSS to be better than RSS, and I can give you examples of the RSS and PSS on real systems to illustrate the magnitude of the difference between those two numbers if that would be useful. > >> So even with memcg I think we'd have the same problem? > > memcg will give you instant anon, shared counters for all processes in > the memcg. > We want to be able to get per-process granularity quickly. I'm not sure if memcg provides that exactly? >> > Don't take me wrong, /proc//totmaps might be suitable for your >> > specific usecase but so far I haven't heard any sound argument for it to >> > be generally usable. It is true that smaps is unnecessarily costly but >> > at least I can see some room for improvements. A simple patch I've >> > posted cut the formatting overhead by 7%. Maybe we can do more. >> >> It seems like a general problem that if you want these values the >> existing kernel interface can be very expensive, so it would be >> generally usable by any application which wants a per process PSS, >> private data, dirty data or swap value. > > yes this is really unfortunate. And if at all possible we should address > that. Precise values require the expensive rmap walk. We can introduce > some caching to help that. But so far it seems the biggest overhead is > to simply format the output and that should be addressed before any new > proc file is added. > >> I mentioned two use cases, but I guess I don't understand the comment >> about why it's not usable by other use cases. > > I might be wrong here but a use of pss is quite limited and I do not > remember anybody asking for large optimizations in that area. I still do > not understand your use cases properly so I am quite skeptical about a > general usefulness of a new file. How do you know that usage of PSS is quite limited? I can only say that we've been using it on Chromium OS for at least four years and have found it very valuable, and I think I've explained the use cases in this thread. If you have more specific questions then I can try to clarify. > > -- > Michal Hocko > SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Thu, Aug 18, 2016 at 2:05 PM, Robert Fosswrote: > > > On 2016-08-18 02:01 PM, Michal Hocko wrote: >> >> On Thu 18-08-16 10:47:57, Sonny Rao wrote: >>> >>> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko wrote: On Wed 17-08-16 11:57:56, Sonny Rao wrote: >> >> [...] > > 2) User space OOM handling -- we'd rather do a more graceful shutdown > than let the kernel's OOM killer activate and need to gather this > information and we'd like to be able to get this information to make > the decision much faster than 400ms Global OOM handling in userspace is really dubious if you ask me. I understand you want something better than SIGKILL and in fact this is already possible with memory cgroup controller (btw. memcg will give you a cheap access to rss, amount of shared, swapped out memory as well). Anyway if you are getting close to the OOM your system will most probably be really busy and chances are that also reading your new file will take much more time. I am also not quite sure how is pss useful for oom decisions. >>> >>> >>> I mentioned it before, but based on experience RSS just isn't good >>> enough -- there's too much sharing going on in our use case to make >>> the correct decision based on RSS. If RSS were good enough, simply >>> put, this patch wouldn't exist. >> >> >> But that doesn't answer my question, I am afraid. So how exactly do you >> use pss for oom decisions? >> >>> So even with memcg I think we'd have the same problem? >> >> >> memcg will give you instant anon, shared counters for all processes in >> the memcg. > > > Is it technically feasible to add instant pss support to memcg? > > @Sonny Rao: Would using cgroups be acceptable for chromiumos? It's possible, though I think we'd end up putting each renderer in it's own cgroup to get the PSS stat, so it seems a bit like overkill. I think memcg also has some overhead that we'd need to quantify but I could be mistaken about this. > > >> Don't take me wrong, /proc//totmaps might be suitable for your specific usecase but so far I haven't heard any sound argument for it to be generally usable. It is true that smaps is unnecessarily costly but at least I can see some room for improvements. A simple patch I've posted cut the formatting overhead by 7%. Maybe we can do more. >>> >>> >>> It seems like a general problem that if you want these values the >>> existing kernel interface can be very expensive, so it would be >>> generally usable by any application which wants a per process PSS, >>> private data, dirty data or swap value. >> >> >> yes this is really unfortunate. And if at all possible we should address >> that. Precise values require the expensive rmap walk. We can introduce >> some caching to help that. But so far it seems the biggest overhead is >> to simply format the output and that should be addressed before any new >> proc file is added. >> >>> I mentioned two use cases, but I guess I don't understand the comment >>> about why it's not usable by other use cases. >> >> >> I might be wrong here but a use of pss is quite limited and I do not >> remember anybody asking for large optimizations in that area. I still do >> not understand your use cases properly so I am quite skeptical about a >> general usefulness of a new file. >> >
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Thu, Aug 18, 2016 at 2:05 PM, Robert Foss wrote: > > > On 2016-08-18 02:01 PM, Michal Hocko wrote: >> >> On Thu 18-08-16 10:47:57, Sonny Rao wrote: >>> >>> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko wrote: On Wed 17-08-16 11:57:56, Sonny Rao wrote: >> >> [...] > > 2) User space OOM handling -- we'd rather do a more graceful shutdown > than let the kernel's OOM killer activate and need to gather this > information and we'd like to be able to get this information to make > the decision much faster than 400ms Global OOM handling in userspace is really dubious if you ask me. I understand you want something better than SIGKILL and in fact this is already possible with memory cgroup controller (btw. memcg will give you a cheap access to rss, amount of shared, swapped out memory as well). Anyway if you are getting close to the OOM your system will most probably be really busy and chances are that also reading your new file will take much more time. I am also not quite sure how is pss useful for oom decisions. >>> >>> >>> I mentioned it before, but based on experience RSS just isn't good >>> enough -- there's too much sharing going on in our use case to make >>> the correct decision based on RSS. If RSS were good enough, simply >>> put, this patch wouldn't exist. >> >> >> But that doesn't answer my question, I am afraid. So how exactly do you >> use pss for oom decisions? >> >>> So even with memcg I think we'd have the same problem? >> >> >> memcg will give you instant anon, shared counters for all processes in >> the memcg. > > > Is it technically feasible to add instant pss support to memcg? > > @Sonny Rao: Would using cgroups be acceptable for chromiumos? It's possible, though I think we'd end up putting each renderer in it's own cgroup to get the PSS stat, so it seems a bit like overkill. I think memcg also has some overhead that we'd need to quantify but I could be mistaken about this. > > >> Don't take me wrong, /proc//totmaps might be suitable for your specific usecase but so far I haven't heard any sound argument for it to be generally usable. It is true that smaps is unnecessarily costly but at least I can see some room for improvements. A simple patch I've posted cut the formatting overhead by 7%. Maybe we can do more. >>> >>> >>> It seems like a general problem that if you want these values the >>> existing kernel interface can be very expensive, so it would be >>> generally usable by any application which wants a per process PSS, >>> private data, dirty data or swap value. >> >> >> yes this is really unfortunate. And if at all possible we should address >> that. Precise values require the expensive rmap walk. We can introduce >> some caching to help that. But so far it seems the biggest overhead is >> to simply format the output and that should be addressed before any new >> proc file is added. >> >>> I mentioned two use cases, but I guess I don't understand the comment >>> about why it's not usable by other use cases. >> >> >> I might be wrong here but a use of pss is quite limited and I do not >> remember anybody asking for large optimizations in that area. I still do >> not understand your use cases properly so I am quite skeptical about a >> general usefulness of a new file. >> >
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Thu, Aug 18, 2016 at 12:44 AM, Michal Hockowrote: > On Wed 17-08-16 11:57:56, Sonny Rao wrote: >> On Wed, Aug 17, 2016 at 6:03 AM, Michal Hocko wrote: >> > On Wed 17-08-16 11:31:25, Jann Horn wrote: > [...] >> >> That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel >> >> time spent on evaluating format strings. The new interface >> >> wouldn't have to spend that much time on format strings because there >> >> isn't so much text to format. >> > >> > well, this is true of course but I would much rather try to reduce the >> > overhead of smaps file than add a new file. The following should help >> > already. I've measured ~7% systime cut down. I guess there is still some >> > room for improvements but I have to say I'm far from being convinced about >> > a new proc file just because we suck at dumping information to the >> > userspace. >> > If this was something like /proc//stat which is >> > essentially read all the time then it would be a different question but >> > is the rss, pss going to be all that often? If yes why? >> >> If the question is why do we need to read RSS, PSS, Private_*, Swap >> and the other fields so often? >> >> I have two use cases so far involving monitoring per-process memory >> usage, and we usually need to read stats for about 25 processes. >> >> Here's a timing example on an fairly recent ARM system 4 core RK3288 >> running at 1.8Ghz >> >> localhost ~ # time cat /proc/25946/smaps > /dev/null >> >> real0m0.036s >> user0m0.020s >> sys 0m0.020s >> >> localhost ~ # time cat /proc/25946/totmaps > /dev/null >> >> real0m0.027s >> user0m0.010s >> sys 0m0.010s >> localhost ~ # >> >> I'll ignore the user time for now, and we see about 20 ms of system >> time with smaps and 10 ms with totmaps, with 20 similar processes it >> would be 400 milliseconds of cpu time for the kernel to get this >> information from smaps vs 200 milliseconds with totmaps. Even totmaps >> is still pretty slow, but much better than smaps. >> >> Use cases: >> 1) Basic task monitoring -- like "top" that shows memory consumption >> including PSS, Private, Swap >> 1 second update means about 40% of one CPU is spent in the kernel >> gathering the data with smaps > > I would argue that even 20% is way too much for such a monitoring. What > is the value to do it so often tha 20 vs 40ms really matters? Yeah it is too much (I believe I said that) but it's significantly better. >> 2) User space OOM handling -- we'd rather do a more graceful shutdown >> than let the kernel's OOM killer activate and need to gather this >> information and we'd like to be able to get this information to make >> the decision much faster than 400ms > > Global OOM handling in userspace is really dubious if you ask me. I > understand you want something better than SIGKILL and in fact this is > already possible with memory cgroup controller (btw. memcg will give > you a cheap access to rss, amount of shared, swapped out memory as > well). Anyway if you are getting close to the OOM your system will most > probably be really busy and chances are that also reading your new file > will take much more time. I am also not quite sure how is pss useful for > oom decisions. I mentioned it before, but based on experience RSS just isn't good enough -- there's too much sharing going on in our use case to make the correct decision based on RSS. If RSS were good enough, simply put, this patch wouldn't exist. So even with memcg I think we'd have the same problem? > > Don't take me wrong, /proc//totmaps might be suitable for your > specific usecase but so far I haven't heard any sound argument for it to > be generally usable. It is true that smaps is unnecessarily costly but > at least I can see some room for improvements. A simple patch I've > posted cut the formatting overhead by 7%. Maybe we can do more. It seems like a general problem that if you want these values the existing kernel interface can be very expensive, so it would be generally usable by any application which wants a per process PSS, private data, dirty data or swap value. I mentioned two use cases, but I guess I don't understand the comment about why it's not usable by other use cases. > -- > Michal Hocko > SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko wrote: > On Wed 17-08-16 11:57:56, Sonny Rao wrote: >> On Wed, Aug 17, 2016 at 6:03 AM, Michal Hocko wrote: >> > On Wed 17-08-16 11:31:25, Jann Horn wrote: > [...] >> >> That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel >> >> time spent on evaluating format strings. The new interface >> >> wouldn't have to spend that much time on format strings because there >> >> isn't so much text to format. >> > >> > well, this is true of course but I would much rather try to reduce the >> > overhead of smaps file than add a new file. The following should help >> > already. I've measured ~7% systime cut down. I guess there is still some >> > room for improvements but I have to say I'm far from being convinced about >> > a new proc file just because we suck at dumping information to the >> > userspace. >> > If this was something like /proc//stat which is >> > essentially read all the time then it would be a different question but >> > is the rss, pss going to be all that often? If yes why? >> >> If the question is why do we need to read RSS, PSS, Private_*, Swap >> and the other fields so often? >> >> I have two use cases so far involving monitoring per-process memory >> usage, and we usually need to read stats for about 25 processes. >> >> Here's a timing example on an fairly recent ARM system 4 core RK3288 >> running at 1.8Ghz >> >> localhost ~ # time cat /proc/25946/smaps > /dev/null >> >> real0m0.036s >> user0m0.020s >> sys 0m0.020s >> >> localhost ~ # time cat /proc/25946/totmaps > /dev/null >> >> real0m0.027s >> user0m0.010s >> sys 0m0.010s >> localhost ~ # >> >> I'll ignore the user time for now, and we see about 20 ms of system >> time with smaps and 10 ms with totmaps, with 20 similar processes it >> would be 400 milliseconds of cpu time for the kernel to get this >> information from smaps vs 200 milliseconds with totmaps. Even totmaps >> is still pretty slow, but much better than smaps. >> >> Use cases: >> 1) Basic task monitoring -- like "top" that shows memory consumption >> including PSS, Private, Swap >> 1 second update means about 40% of one CPU is spent in the kernel >> gathering the data with smaps > > I would argue that even 20% is way too much for such a monitoring. What > is the value to do it so often tha 20 vs 40ms really matters? Yeah it is too much (I believe I said that) but it's significantly better. >> 2) User space OOM handling -- we'd rather do a more graceful shutdown >> than let the kernel's OOM killer activate and need to gather this >> information and we'd like to be able to get this information to make >> the decision much faster than 400ms > > Global OOM handling in userspace is really dubious if you ask me. I > understand you want something better than SIGKILL and in fact this is > already possible with memory cgroup controller (btw. memcg will give > you a cheap access to rss, amount of shared, swapped out memory as > well). Anyway if you are getting close to the OOM your system will most > probably be really busy and chances are that also reading your new file > will take much more time. I am also not quite sure how is pss useful for > oom decisions. I mentioned it before, but based on experience RSS just isn't good enough -- there's too much sharing going on in our use case to make the correct decision based on RSS. If RSS were good enough, simply put, this patch wouldn't exist. So even with memcg I think we'd have the same problem? > > Don't take me wrong, /proc//totmaps might be suitable for your > specific usecase but so far I haven't heard any sound argument for it to > be generally usable. It is true that smaps is unnecessarily costly but > at least I can see some room for improvements. A simple patch I've > posted cut the formatting overhead by 7%. Maybe we can do more. It seems like a general problem that if you want these values the existing kernel interface can be very expensive, so it would be generally usable by any application which wants a per process PSS, private data, dirty data or swap value. I mentioned two use cases, but I guess I don't understand the comment about why it's not usable by other use cases. > -- > Michal Hocko > SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
Hi Michal, On Thu, Aug 18, 2016 at 08:01:04PM +0200, Michal Hocko wrote: > On Thu 18-08-16 10:47:57, Sonny Rao wrote: > > On Thu, Aug 18, 2016 at 12:44 AM, Michal Hockowrote: > > > On Wed 17-08-16 11:57:56, Sonny Rao wrote: > [...] > > >> 2) User space OOM handling -- we'd rather do a more graceful shutdown > > >> than let the kernel's OOM killer activate and need to gather this > > >> information and we'd like to be able to get this information to make > > >> the decision much faster than 400ms > > > > > > Global OOM handling in userspace is really dubious if you ask me. I > > > understand you want something better than SIGKILL and in fact this is > > > already possible with memory cgroup controller (btw. memcg will give > > > you a cheap access to rss, amount of shared, swapped out memory as > > > well). Anyway if you are getting close to the OOM your system will most > > > probably be really busy and chances are that also reading your new file > > > will take much more time. I am also not quite sure how is pss useful for > > > oom decisions. > > > > I mentioned it before, but based on experience RSS just isn't good > > enough -- there's too much sharing going on in our use case to make > > the correct decision based on RSS. If RSS were good enough, simply > > put, this patch wouldn't exist. > > But that doesn't answer my question, I am afraid. So how exactly do you > use pss for oom decisions? My case is not for OOM decision but I agree it would be great if we can get *fast* smap summary information. PSS is really great tool to figure out how processes consume memory more exactly rather than RSS. We have been used it for monitoring of memory for per-process. Although it is not used for OOM decision, it would be great if it is speed up because we don't want to spend many CPU time for just monitoring. For our usecase, we don't need AnonHugePages, ShmemPmdMapped, Shared_Hugetlb, Private_Hugetlb, KernelPageSize, MMUPageSize because we never enable THP and hugetlb. Additionally, Locked can be known via vma flags so we don't need it, either. Even, we don't need address range for just monitoring when we don't investigate in detail. Although they are not severe overhead, why does it emit the useless information? Even bloat day by day. :( With that, userspace tools should spend more time to parse which is pointless. Having said that, I'm not fan of creating new stat knob for that, either. How about appending summary information in the end of smap? So, monitoring users can just open the file and lseek to the (end - 1) and read the summary only. Thanks.
Re: [PACTH v2 0/3] Implement /proc//totmaps
Hi Michal, On Thu, Aug 18, 2016 at 08:01:04PM +0200, Michal Hocko wrote: > On Thu 18-08-16 10:47:57, Sonny Rao wrote: > > On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko wrote: > > > On Wed 17-08-16 11:57:56, Sonny Rao wrote: > [...] > > >> 2) User space OOM handling -- we'd rather do a more graceful shutdown > > >> than let the kernel's OOM killer activate and need to gather this > > >> information and we'd like to be able to get this information to make > > >> the decision much faster than 400ms > > > > > > Global OOM handling in userspace is really dubious if you ask me. I > > > understand you want something better than SIGKILL and in fact this is > > > already possible with memory cgroup controller (btw. memcg will give > > > you a cheap access to rss, amount of shared, swapped out memory as > > > well). Anyway if you are getting close to the OOM your system will most > > > probably be really busy and chances are that also reading your new file > > > will take much more time. I am also not quite sure how is pss useful for > > > oom decisions. > > > > I mentioned it before, but based on experience RSS just isn't good > > enough -- there's too much sharing going on in our use case to make > > the correct decision based on RSS. If RSS were good enough, simply > > put, this patch wouldn't exist. > > But that doesn't answer my question, I am afraid. So how exactly do you > use pss for oom decisions? My case is not for OOM decision but I agree it would be great if we can get *fast* smap summary information. PSS is really great tool to figure out how processes consume memory more exactly rather than RSS. We have been used it for monitoring of memory for per-process. Although it is not used for OOM decision, it would be great if it is speed up because we don't want to spend many CPU time for just monitoring. For our usecase, we don't need AnonHugePages, ShmemPmdMapped, Shared_Hugetlb, Private_Hugetlb, KernelPageSize, MMUPageSize because we never enable THP and hugetlb. Additionally, Locked can be known via vma flags so we don't need it, either. Even, we don't need address range for just monitoring when we don't investigate in detail. Although they are not severe overhead, why does it emit the useless information? Even bloat day by day. :( With that, userspace tools should spend more time to parse which is pointless. Having said that, I'm not fan of creating new stat knob for that, either. How about appending summary information in the end of smap? So, monitoring users can just open the file and lseek to the (end - 1) and read the summary only. Thanks.
Re: [PACTH v2 0/3] Implement /proc//totmaps
On 2016-08-18 02:01 PM, Michal Hocko wrote: On Thu 18-08-16 10:47:57, Sonny Rao wrote: On Thu, Aug 18, 2016 at 12:44 AM, Michal Hockowrote: On Wed 17-08-16 11:57:56, Sonny Rao wrote: [...] 2) User space OOM handling -- we'd rather do a more graceful shutdown than let the kernel's OOM killer activate and need to gather this information and we'd like to be able to get this information to make the decision much faster than 400ms Global OOM handling in userspace is really dubious if you ask me. I understand you want something better than SIGKILL and in fact this is already possible with memory cgroup controller (btw. memcg will give you a cheap access to rss, amount of shared, swapped out memory as well). Anyway if you are getting close to the OOM your system will most probably be really busy and chances are that also reading your new file will take much more time. I am also not quite sure how is pss useful for oom decisions. I mentioned it before, but based on experience RSS just isn't good enough -- there's too much sharing going on in our use case to make the correct decision based on RSS. If RSS were good enough, simply put, this patch wouldn't exist. But that doesn't answer my question, I am afraid. So how exactly do you use pss for oom decisions? So even with memcg I think we'd have the same problem? memcg will give you instant anon, shared counters for all processes in the memcg. Is it technically feasible to add instant pss support to memcg? @Sonny Rao: Would using cgroups be acceptable for chromiumos? Don't take me wrong, /proc//totmaps might be suitable for your specific usecase but so far I haven't heard any sound argument for it to be generally usable. It is true that smaps is unnecessarily costly but at least I can see some room for improvements. A simple patch I've posted cut the formatting overhead by 7%. Maybe we can do more. It seems like a general problem that if you want these values the existing kernel interface can be very expensive, so it would be generally usable by any application which wants a per process PSS, private data, dirty data or swap value. yes this is really unfortunate. And if at all possible we should address that. Precise values require the expensive rmap walk. We can introduce some caching to help that. But so far it seems the biggest overhead is to simply format the output and that should be addressed before any new proc file is added. I mentioned two use cases, but I guess I don't understand the comment about why it's not usable by other use cases. I might be wrong here but a use of pss is quite limited and I do not remember anybody asking for large optimizations in that area. I still do not understand your use cases properly so I am quite skeptical about a general usefulness of a new file.
Re: [PACTH v2 0/3] Implement /proc//totmaps
On 2016-08-18 02:01 PM, Michal Hocko wrote: On Thu 18-08-16 10:47:57, Sonny Rao wrote: On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko wrote: On Wed 17-08-16 11:57:56, Sonny Rao wrote: [...] 2) User space OOM handling -- we'd rather do a more graceful shutdown than let the kernel's OOM killer activate and need to gather this information and we'd like to be able to get this information to make the decision much faster than 400ms Global OOM handling in userspace is really dubious if you ask me. I understand you want something better than SIGKILL and in fact this is already possible with memory cgroup controller (btw. memcg will give you a cheap access to rss, amount of shared, swapped out memory as well). Anyway if you are getting close to the OOM your system will most probably be really busy and chances are that also reading your new file will take much more time. I am also not quite sure how is pss useful for oom decisions. I mentioned it before, but based on experience RSS just isn't good enough -- there's too much sharing going on in our use case to make the correct decision based on RSS. If RSS were good enough, simply put, this patch wouldn't exist. But that doesn't answer my question, I am afraid. So how exactly do you use pss for oom decisions? So even with memcg I think we'd have the same problem? memcg will give you instant anon, shared counters for all processes in the memcg. Is it technically feasible to add instant pss support to memcg? @Sonny Rao: Would using cgroups be acceptable for chromiumos? Don't take me wrong, /proc//totmaps might be suitable for your specific usecase but so far I haven't heard any sound argument for it to be generally usable. It is true that smaps is unnecessarily costly but at least I can see some room for improvements. A simple patch I've posted cut the formatting overhead by 7%. Maybe we can do more. It seems like a general problem that if you want these values the existing kernel interface can be very expensive, so it would be generally usable by any application which wants a per process PSS, private data, dirty data or swap value. yes this is really unfortunate. And if at all possible we should address that. Precise values require the expensive rmap walk. We can introduce some caching to help that. But so far it seems the biggest overhead is to simply format the output and that should be addressed before any new proc file is added. I mentioned two use cases, but I guess I don't understand the comment about why it's not usable by other use cases. I might be wrong here but a use of pss is quite limited and I do not remember anybody asking for large optimizations in that area. I still do not understand your use cases properly so I am quite skeptical about a general usefulness of a new file.
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Thu 18-08-16 10:47:57, Sonny Rao wrote: > On Thu, Aug 18, 2016 at 12:44 AM, Michal Hockowrote: > > On Wed 17-08-16 11:57:56, Sonny Rao wrote: [...] > >> 2) User space OOM handling -- we'd rather do a more graceful shutdown > >> than let the kernel's OOM killer activate and need to gather this > >> information and we'd like to be able to get this information to make > >> the decision much faster than 400ms > > > > Global OOM handling in userspace is really dubious if you ask me. I > > understand you want something better than SIGKILL and in fact this is > > already possible with memory cgroup controller (btw. memcg will give > > you a cheap access to rss, amount of shared, swapped out memory as > > well). Anyway if you are getting close to the OOM your system will most > > probably be really busy and chances are that also reading your new file > > will take much more time. I am also not quite sure how is pss useful for > > oom decisions. > > I mentioned it before, but based on experience RSS just isn't good > enough -- there's too much sharing going on in our use case to make > the correct decision based on RSS. If RSS were good enough, simply > put, this patch wouldn't exist. But that doesn't answer my question, I am afraid. So how exactly do you use pss for oom decisions? > So even with memcg I think we'd have the same problem? memcg will give you instant anon, shared counters for all processes in the memcg. > > Don't take me wrong, /proc//totmaps might be suitable for your > > specific usecase but so far I haven't heard any sound argument for it to > > be generally usable. It is true that smaps is unnecessarily costly but > > at least I can see some room for improvements. A simple patch I've > > posted cut the formatting overhead by 7%. Maybe we can do more. > > It seems like a general problem that if you want these values the > existing kernel interface can be very expensive, so it would be > generally usable by any application which wants a per process PSS, > private data, dirty data or swap value. yes this is really unfortunate. And if at all possible we should address that. Precise values require the expensive rmap walk. We can introduce some caching to help that. But so far it seems the biggest overhead is to simply format the output and that should be addressed before any new proc file is added. > I mentioned two use cases, but I guess I don't understand the comment > about why it's not usable by other use cases. I might be wrong here but a use of pss is quite limited and I do not remember anybody asking for large optimizations in that area. I still do not understand your use cases properly so I am quite skeptical about a general usefulness of a new file. -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Thu 18-08-16 10:47:57, Sonny Rao wrote: > On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko wrote: > > On Wed 17-08-16 11:57:56, Sonny Rao wrote: [...] > >> 2) User space OOM handling -- we'd rather do a more graceful shutdown > >> than let the kernel's OOM killer activate and need to gather this > >> information and we'd like to be able to get this information to make > >> the decision much faster than 400ms > > > > Global OOM handling in userspace is really dubious if you ask me. I > > understand you want something better than SIGKILL and in fact this is > > already possible with memory cgroup controller (btw. memcg will give > > you a cheap access to rss, amount of shared, swapped out memory as > > well). Anyway if you are getting close to the OOM your system will most > > probably be really busy and chances are that also reading your new file > > will take much more time. I am also not quite sure how is pss useful for > > oom decisions. > > I mentioned it before, but based on experience RSS just isn't good > enough -- there's too much sharing going on in our use case to make > the correct decision based on RSS. If RSS were good enough, simply > put, this patch wouldn't exist. But that doesn't answer my question, I am afraid. So how exactly do you use pss for oom decisions? > So even with memcg I think we'd have the same problem? memcg will give you instant anon, shared counters for all processes in the memcg. > > Don't take me wrong, /proc//totmaps might be suitable for your > > specific usecase but so far I haven't heard any sound argument for it to > > be generally usable. It is true that smaps is unnecessarily costly but > > at least I can see some room for improvements. A simple patch I've > > posted cut the formatting overhead by 7%. Maybe we can do more. > > It seems like a general problem that if you want these values the > existing kernel interface can be very expensive, so it would be > generally usable by any application which wants a per process PSS, > private data, dirty data or swap value. yes this is really unfortunate. And if at all possible we should address that. Precise values require the expensive rmap walk. We can introduce some caching to help that. But so far it seems the biggest overhead is to simply format the output and that should be addressed before any new proc file is added. > I mentioned two use cases, but I guess I don't understand the comment > about why it's not usable by other use cases. I might be wrong here but a use of pss is quite limited and I do not remember anybody asking for large optimizations in that area. I still do not understand your use cases properly so I am quite skeptical about a general usefulness of a new file. -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Wed 17-08-16 11:57:56, Sonny Rao wrote: > On Wed, Aug 17, 2016 at 6:03 AM, Michal Hockowrote: > > On Wed 17-08-16 11:31:25, Jann Horn wrote: [...] > >> That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel > >> time spent on evaluating format strings. The new interface > >> wouldn't have to spend that much time on format strings because there > >> isn't so much text to format. > > > > well, this is true of course but I would much rather try to reduce the > > overhead of smaps file than add a new file. The following should help > > already. I've measured ~7% systime cut down. I guess there is still some > > room for improvements but I have to say I'm far from being convinced about > > a new proc file just because we suck at dumping information to the > > userspace. > > If this was something like /proc//stat which is > > essentially read all the time then it would be a different question but > > is the rss, pss going to be all that often? If yes why? > > If the question is why do we need to read RSS, PSS, Private_*, Swap > and the other fields so often? > > I have two use cases so far involving monitoring per-process memory > usage, and we usually need to read stats for about 25 processes. > > Here's a timing example on an fairly recent ARM system 4 core RK3288 > running at 1.8Ghz > > localhost ~ # time cat /proc/25946/smaps > /dev/null > > real0m0.036s > user0m0.020s > sys 0m0.020s > > localhost ~ # time cat /proc/25946/totmaps > /dev/null > > real0m0.027s > user0m0.010s > sys 0m0.010s > localhost ~ # > > I'll ignore the user time for now, and we see about 20 ms of system > time with smaps and 10 ms with totmaps, with 20 similar processes it > would be 400 milliseconds of cpu time for the kernel to get this > information from smaps vs 200 milliseconds with totmaps. Even totmaps > is still pretty slow, but much better than smaps. > > Use cases: > 1) Basic task monitoring -- like "top" that shows memory consumption > including PSS, Private, Swap > 1 second update means about 40% of one CPU is spent in the kernel > gathering the data with smaps I would argue that even 20% is way too much for such a monitoring. What is the value to do it so often tha 20 vs 40ms really matters? > 2) User space OOM handling -- we'd rather do a more graceful shutdown > than let the kernel's OOM killer activate and need to gather this > information and we'd like to be able to get this information to make > the decision much faster than 400ms Global OOM handling in userspace is really dubious if you ask me. I understand you want something better than SIGKILL and in fact this is already possible with memory cgroup controller (btw. memcg will give you a cheap access to rss, amount of shared, swapped out memory as well). Anyway if you are getting close to the OOM your system will most probably be really busy and chances are that also reading your new file will take much more time. I am also not quite sure how is pss useful for oom decisions. Don't take me wrong, /proc//totmaps might be suitable for your specific usecase but so far I haven't heard any sound argument for it to be generally usable. It is true that smaps is unnecessarily costly but at least I can see some room for improvements. A simple patch I've posted cut the formatting overhead by 7%. Maybe we can do more. -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Wed 17-08-16 11:57:56, Sonny Rao wrote: > On Wed, Aug 17, 2016 at 6:03 AM, Michal Hocko wrote: > > On Wed 17-08-16 11:31:25, Jann Horn wrote: [...] > >> That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel > >> time spent on evaluating format strings. The new interface > >> wouldn't have to spend that much time on format strings because there > >> isn't so much text to format. > > > > well, this is true of course but I would much rather try to reduce the > > overhead of smaps file than add a new file. The following should help > > already. I've measured ~7% systime cut down. I guess there is still some > > room for improvements but I have to say I'm far from being convinced about > > a new proc file just because we suck at dumping information to the > > userspace. > > If this was something like /proc//stat which is > > essentially read all the time then it would be a different question but > > is the rss, pss going to be all that often? If yes why? > > If the question is why do we need to read RSS, PSS, Private_*, Swap > and the other fields so often? > > I have two use cases so far involving monitoring per-process memory > usage, and we usually need to read stats for about 25 processes. > > Here's a timing example on an fairly recent ARM system 4 core RK3288 > running at 1.8Ghz > > localhost ~ # time cat /proc/25946/smaps > /dev/null > > real0m0.036s > user0m0.020s > sys 0m0.020s > > localhost ~ # time cat /proc/25946/totmaps > /dev/null > > real0m0.027s > user0m0.010s > sys 0m0.010s > localhost ~ # > > I'll ignore the user time for now, and we see about 20 ms of system > time with smaps and 10 ms with totmaps, with 20 similar processes it > would be 400 milliseconds of cpu time for the kernel to get this > information from smaps vs 200 milliseconds with totmaps. Even totmaps > is still pretty slow, but much better than smaps. > > Use cases: > 1) Basic task monitoring -- like "top" that shows memory consumption > including PSS, Private, Swap > 1 second update means about 40% of one CPU is spent in the kernel > gathering the data with smaps I would argue that even 20% is way too much for such a monitoring. What is the value to do it so often tha 20 vs 40ms really matters? > 2) User space OOM handling -- we'd rather do a more graceful shutdown > than let the kernel's OOM killer activate and need to gather this > information and we'd like to be able to get this information to make > the decision much faster than 400ms Global OOM handling in userspace is really dubious if you ask me. I understand you want something better than SIGKILL and in fact this is already possible with memory cgroup controller (btw. memcg will give you a cheap access to rss, amount of shared, swapped out memory as well). Anyway if you are getting close to the OOM your system will most probably be really busy and chances are that also reading your new file will take much more time. I am also not quite sure how is pss useful for oom decisions. Don't take me wrong, /proc//totmaps might be suitable for your specific usecase but so far I haven't heard any sound argument for it to be generally usable. It is true that smaps is unnecessarily costly but at least I can see some room for improvements. A simple patch I've posted cut the formatting overhead by 7%. Maybe we can do more. -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Wed, Aug 17, 2016 at 6:03 AM, Michal Hockowrote: > On Wed 17-08-16 11:31:25, Jann Horn wrote: >> On Wed, Aug 17, 2016 at 10:22:00AM +0200, Michal Hocko wrote: >> > On Tue 16-08-16 12:46:51, Robert Foss wrote: >> > [...] >> > > $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2} >> > > /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\' >> > > /proc/5025/smaps }" >> > > [...] >> > > Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2} >> > > /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' >> > > /proc/5025/smaps >> > > }" >> > > User time (seconds): 0.37 >> > > System time (seconds): 0.45 >> > > Percent of CPU this job got: 92% >> > > Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89 >> > >> > This is really unexpected. Where is the user time spent? Anyway, rather >> > than measuring some random processes I've tried to measure something >> > resembling the worst case. So I've created a simple program to mmap as >> > much as possible: >> > >> > #include >> > #include >> > #include >> > #include >> > int main() >> > { >> > while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, >> > MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED) >> > ; >> > >> > printf("pid:%d\n", getpid()); >> > pause(); >> > return 0; >> > } >> >> Ah, nice, that's a reasonable test program. :) >> >> >> > So with a reasonable user space the parsing is really not all that time >> > consuming wrt. smaps handling. That being said I am still very skeptical >> > about a dedicated proc file which accomplishes what userspace can done >> > in a trivial way. >> >> Now, since your numbers showed that all the time is spent in the kernel, >> also create this test program to just read that file over and over again: >> >> $ cat justreadloop.c >> #include >> #include >> #include >> #include >> #include >> #include >> #include >> >> char buf[100]; >> >> int main(int argc, char **argv) { >> printf("pid:%d\n", getpid()); >> while (1) { >> int fd = open(argv[1], O_RDONLY); >> if (fd < 0) continue; >> if (read(fd, buf, sizeof(buf)) < 0) >> err(1, "read"); >> close(fd); >> } >> } >> $ gcc -Wall -o justreadloop justreadloop.c >> $ >> >> Now launch your test: >> >> $ ./mapstuff >> pid:29397 >> >> point justreadloop at it: >> >> $ ./justreadloop /proc/29397/smaps >> pid:32567 >> >> ... and then check the performance stats of justreadloop: >> >> # perf top -p 32567 >> >> This is what I see: >> >> Samples: 232K of event 'cycles:ppp', Event count (approx.): 60448424325 >> Overhead Shared Object Symbol >> 30,43% [kernel] [k] format_decode >>9,12% [kernel] [k] number >>7,66% [kernel] [k] vsnprintf >>7,06% [kernel] [k] __lock_acquire >>3,23% [kernel] [k] lock_release >>2,85% [kernel] [k] debug_lockdep_rcu_enabled >>2,25% [kernel] [k] skip_atoi >>2,13% [kernel] [k] lock_acquire >>2,05% [kernel] [k] show_smap > > This is a lot! I would expect the rmap walk to consume more but it even > doesn't show up in the top consumers. > >> That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel >> time spent on evaluating format strings. The new interface >> wouldn't have to spend that much time on format strings because there >> isn't so much text to format. > > well, this is true of course but I would much rather try to reduce the > overhead of smaps file than add a new file. The following should help > already. I've measured ~7% systime cut down. I guess there is still some > room for improvements but I have to say I'm far from being convinced about > a new proc file just because we suck at dumping information to the > userspace. > If this was something like /proc//stat which is > essentially read all the time then it would be a different question but > is the rss, pss going to be all that often? If yes why? If the question is why do we need to read RSS, PSS, Private_*, Swap and the other fields so often? I have two use cases so far involving monitoring per-process memory usage, and we usually need to read stats for about 25 processes. Here's a timing example on an fairly recent ARM system 4 core RK3288 running at 1.8Ghz localhost ~ # time cat /proc/25946/smaps > /dev/null real0m0.036s user0m0.020s sys 0m0.020s localhost ~ # time cat /proc/25946/totmaps > /dev/null real0m0.027s user0m0.010s sys 0m0.010s localhost ~ # I'll ignore the user time for now, and we see about 20 ms of system time with smaps and 10 ms with totmaps, with 20 similar processes it would be 400 milliseconds of cpu time for the kernel to get this information from smaps vs 200 milliseconds with totmaps. Even totmaps is still pretty slow, but much better than smaps. Use cases: 1) Basic task monitoring -- like "top" that shows memory consumption including PSS, Private, Swap 1 second update
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Wed, Aug 17, 2016 at 6:03 AM, Michal Hocko wrote: > On Wed 17-08-16 11:31:25, Jann Horn wrote: >> On Wed, Aug 17, 2016 at 10:22:00AM +0200, Michal Hocko wrote: >> > On Tue 16-08-16 12:46:51, Robert Foss wrote: >> > [...] >> > > $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2} >> > > /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\' >> > > /proc/5025/smaps }" >> > > [...] >> > > Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2} >> > > /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' >> > > /proc/5025/smaps >> > > }" >> > > User time (seconds): 0.37 >> > > System time (seconds): 0.45 >> > > Percent of CPU this job got: 92% >> > > Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89 >> > >> > This is really unexpected. Where is the user time spent? Anyway, rather >> > than measuring some random processes I've tried to measure something >> > resembling the worst case. So I've created a simple program to mmap as >> > much as possible: >> > >> > #include >> > #include >> > #include >> > #include >> > int main() >> > { >> > while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, >> > MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED) >> > ; >> > >> > printf("pid:%d\n", getpid()); >> > pause(); >> > return 0; >> > } >> >> Ah, nice, that's a reasonable test program. :) >> >> >> > So with a reasonable user space the parsing is really not all that time >> > consuming wrt. smaps handling. That being said I am still very skeptical >> > about a dedicated proc file which accomplishes what userspace can done >> > in a trivial way. >> >> Now, since your numbers showed that all the time is spent in the kernel, >> also create this test program to just read that file over and over again: >> >> $ cat justreadloop.c >> #include >> #include >> #include >> #include >> #include >> #include >> #include >> >> char buf[100]; >> >> int main(int argc, char **argv) { >> printf("pid:%d\n", getpid()); >> while (1) { >> int fd = open(argv[1], O_RDONLY); >> if (fd < 0) continue; >> if (read(fd, buf, sizeof(buf)) < 0) >> err(1, "read"); >> close(fd); >> } >> } >> $ gcc -Wall -o justreadloop justreadloop.c >> $ >> >> Now launch your test: >> >> $ ./mapstuff >> pid:29397 >> >> point justreadloop at it: >> >> $ ./justreadloop /proc/29397/smaps >> pid:32567 >> >> ... and then check the performance stats of justreadloop: >> >> # perf top -p 32567 >> >> This is what I see: >> >> Samples: 232K of event 'cycles:ppp', Event count (approx.): 60448424325 >> Overhead Shared Object Symbol >> 30,43% [kernel] [k] format_decode >>9,12% [kernel] [k] number >>7,66% [kernel] [k] vsnprintf >>7,06% [kernel] [k] __lock_acquire >>3,23% [kernel] [k] lock_release >>2,85% [kernel] [k] debug_lockdep_rcu_enabled >>2,25% [kernel] [k] skip_atoi >>2,13% [kernel] [k] lock_acquire >>2,05% [kernel] [k] show_smap > > This is a lot! I would expect the rmap walk to consume more but it even > doesn't show up in the top consumers. > >> That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel >> time spent on evaluating format strings. The new interface >> wouldn't have to spend that much time on format strings because there >> isn't so much text to format. > > well, this is true of course but I would much rather try to reduce the > overhead of smaps file than add a new file. The following should help > already. I've measured ~7% systime cut down. I guess there is still some > room for improvements but I have to say I'm far from being convinced about > a new proc file just because we suck at dumping information to the > userspace. > If this was something like /proc//stat which is > essentially read all the time then it would be a different question but > is the rss, pss going to be all that often? If yes why? If the question is why do we need to read RSS, PSS, Private_*, Swap and the other fields so often? I have two use cases so far involving monitoring per-process memory usage, and we usually need to read stats for about 25 processes. Here's a timing example on an fairly recent ARM system 4 core RK3288 running at 1.8Ghz localhost ~ # time cat /proc/25946/smaps > /dev/null real0m0.036s user0m0.020s sys 0m0.020s localhost ~ # time cat /proc/25946/totmaps > /dev/null real0m0.027s user0m0.010s sys 0m0.010s localhost ~ # I'll ignore the user time for now, and we see about 20 ms of system time with smaps and 10 ms with totmaps, with 20 similar processes it would be 400 milliseconds of cpu time for the kernel to get this information from smaps vs 200 milliseconds with totmaps. Even totmaps is still pretty slow, but much better than smaps. Use cases: 1) Basic task monitoring -- like "top" that shows memory consumption including PSS, Private, Swap 1 second update means about 40% of
Re: [PACTH v2 0/3] Implement /proc//totmaps
On 2016-08-17 09:03 AM, Michal Hocko wrote: On Wed 17-08-16 11:31:25, Jann Horn wrote: On Wed, Aug 17, 2016 at 10:22:00AM +0200, Michal Hocko wrote: On Tue 16-08-16 12:46:51, Robert Foss wrote: [...] $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2} /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\' /proc/5025/smaps }" [...] Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' /proc/5025/smaps }" User time (seconds): 0.37 System time (seconds): 0.45 Percent of CPU this job got: 92% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89 This is really unexpected. Where is the user time spent? Anyway, rather than measuring some random processes I've tried to measure something resembling the worst case. So I've created a simple program to mmap as much as possible: #include #include #include #include int main() { while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED) ; printf("pid:%d\n", getpid()); pause(); return 0; } Ah, nice, that's a reasonable test program. :) So with a reasonable user space the parsing is really not all that time consuming wrt. smaps handling. That being said I am still very skeptical about a dedicated proc file which accomplishes what userspace can done in a trivial way. Now, since your numbers showed that all the time is spent in the kernel, also create this test program to just read that file over and over again: $ cat justreadloop.c #include #include #include #include #include #include #include char buf[100]; int main(int argc, char **argv) { printf("pid:%d\n", getpid()); while (1) { int fd = open(argv[1], O_RDONLY); if (fd < 0) continue; if (read(fd, buf, sizeof(buf)) < 0) err(1, "read"); close(fd); } } $ gcc -Wall -o justreadloop justreadloop.c $ Now launch your test: $ ./mapstuff pid:29397 point justreadloop at it: $ ./justreadloop /proc/29397/smaps pid:32567 ... and then check the performance stats of justreadloop: # perf top -p 32567 This is what I see: Samples: 232K of event 'cycles:ppp', Event count (approx.): 60448424325 Overhead Shared Object Symbol 30,43% [kernel] [k] format_decode 9,12% [kernel] [k] number 7,66% [kernel] [k] vsnprintf 7,06% [kernel] [k] __lock_acquire 3,23% [kernel] [k] lock_release 2,85% [kernel] [k] debug_lockdep_rcu_enabled 2,25% [kernel] [k] skip_atoi 2,13% [kernel] [k] lock_acquire 2,05% [kernel] [k] show_smap This is a lot! I would expect the rmap walk to consume more but it even doesn't show up in the top consumers. That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel time spent on evaluating format strings. The new interface wouldn't have to spend that much time on format strings because there isn't so much text to format. well, this is true of course but I would much rather try to reduce the overhead of smaps file than add a new file. The following should help already. I've measured ~7% systime cut down. I guess there is still some room for improvements but I have to say I'm far from being convinced about a new proc file just because we suck at dumping information to the userspace. If this was something like /proc//stat which is essentially read all the time then it would be a different question but is the rss, pss going to be all that often? If yes why? These are the questions which should be answered before we even start considering the implementation. @Sonny Rao: Maybe you can comment on how often, for how many processes this information is needed and for which reasons this information is useful. --- From 2a6883a7278ff8979808cb8e2dbcefe5ea3bf672 Mon Sep 17 00:00:00 2001 From: Michal HockoDate: Wed, 17 Aug 2016 14:00:13 +0200 Subject: [PATCH] proc, smaps: reduce printing overhead seq_printf (used by show_smap) can be pretty expensive when dumping a lot of numbers. Say we would like to get Rss and Pss from a particular process. In order to measure a pathological case let's generate as many mappings as possible: $ cat max_mmap.c int main() { while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED) ; printf("pid:%d\n", getpid()); pause(); return 0; } $ awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/$pid/smaps would do a trick. The whole runtime is in the kernel space which is not that that unexpected because smaps is not the cheapest one (we have to do rmap walk etc.). Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/3050/smaps" User time (seconds): 0.01 System time
Re: [PACTH v2 0/3] Implement /proc//totmaps
On 2016-08-17 09:03 AM, Michal Hocko wrote: On Wed 17-08-16 11:31:25, Jann Horn wrote: On Wed, Aug 17, 2016 at 10:22:00AM +0200, Michal Hocko wrote: On Tue 16-08-16 12:46:51, Robert Foss wrote: [...] $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2} /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\' /proc/5025/smaps }" [...] Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' /proc/5025/smaps }" User time (seconds): 0.37 System time (seconds): 0.45 Percent of CPU this job got: 92% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89 This is really unexpected. Where is the user time spent? Anyway, rather than measuring some random processes I've tried to measure something resembling the worst case. So I've created a simple program to mmap as much as possible: #include #include #include #include int main() { while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED) ; printf("pid:%d\n", getpid()); pause(); return 0; } Ah, nice, that's a reasonable test program. :) So with a reasonable user space the parsing is really not all that time consuming wrt. smaps handling. That being said I am still very skeptical about a dedicated proc file which accomplishes what userspace can done in a trivial way. Now, since your numbers showed that all the time is spent in the kernel, also create this test program to just read that file over and over again: $ cat justreadloop.c #include #include #include #include #include #include #include char buf[100]; int main(int argc, char **argv) { printf("pid:%d\n", getpid()); while (1) { int fd = open(argv[1], O_RDONLY); if (fd < 0) continue; if (read(fd, buf, sizeof(buf)) < 0) err(1, "read"); close(fd); } } $ gcc -Wall -o justreadloop justreadloop.c $ Now launch your test: $ ./mapstuff pid:29397 point justreadloop at it: $ ./justreadloop /proc/29397/smaps pid:32567 ... and then check the performance stats of justreadloop: # perf top -p 32567 This is what I see: Samples: 232K of event 'cycles:ppp', Event count (approx.): 60448424325 Overhead Shared Object Symbol 30,43% [kernel] [k] format_decode 9,12% [kernel] [k] number 7,66% [kernel] [k] vsnprintf 7,06% [kernel] [k] __lock_acquire 3,23% [kernel] [k] lock_release 2,85% [kernel] [k] debug_lockdep_rcu_enabled 2,25% [kernel] [k] skip_atoi 2,13% [kernel] [k] lock_acquire 2,05% [kernel] [k] show_smap This is a lot! I would expect the rmap walk to consume more but it even doesn't show up in the top consumers. That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel time spent on evaluating format strings. The new interface wouldn't have to spend that much time on format strings because there isn't so much text to format. well, this is true of course but I would much rather try to reduce the overhead of smaps file than add a new file. The following should help already. I've measured ~7% systime cut down. I guess there is still some room for improvements but I have to say I'm far from being convinced about a new proc file just because we suck at dumping information to the userspace. If this was something like /proc//stat which is essentially read all the time then it would be a different question but is the rss, pss going to be all that often? If yes why? These are the questions which should be answered before we even start considering the implementation. @Sonny Rao: Maybe you can comment on how often, for how many processes this information is needed and for which reasons this information is useful. --- From 2a6883a7278ff8979808cb8e2dbcefe5ea3bf672 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 17 Aug 2016 14:00:13 +0200 Subject: [PATCH] proc, smaps: reduce printing overhead seq_printf (used by show_smap) can be pretty expensive when dumping a lot of numbers. Say we would like to get Rss and Pss from a particular process. In order to measure a pathological case let's generate as many mappings as possible: $ cat max_mmap.c int main() { while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED) ; printf("pid:%d\n", getpid()); pause(); return 0; } $ awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/$pid/smaps would do a trick. The whole runtime is in the kernel space which is not that that unexpected because smaps is not the cheapest one (we have to do rmap walk etc.). Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/3050/smaps" User time (seconds): 0.01 System time (seconds): 0.44
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Wed 17-08-16 11:31:25, Jann Horn wrote: > On Wed, Aug 17, 2016 at 10:22:00AM +0200, Michal Hocko wrote: > > On Tue 16-08-16 12:46:51, Robert Foss wrote: > > [...] > > > $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2} > > > /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\' > > > /proc/5025/smaps }" > > > [...] > > > Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2} > > > /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' > > > /proc/5025/smaps > > > }" > > > User time (seconds): 0.37 > > > System time (seconds): 0.45 > > > Percent of CPU this job got: 92% > > > Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89 > > > > This is really unexpected. Where is the user time spent? Anyway, rather > > than measuring some random processes I've tried to measure something > > resembling the worst case. So I've created a simple program to mmap as > > much as possible: > > > > #include > > #include > > #include > > #include > > int main() > > { > > while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, > > MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED) > > ; > > > > printf("pid:%d\n", getpid()); > > pause(); > > return 0; > > } > > Ah, nice, that's a reasonable test program. :) > > > > So with a reasonable user space the parsing is really not all that time > > consuming wrt. smaps handling. That being said I am still very skeptical > > about a dedicated proc file which accomplishes what userspace can done > > in a trivial way. > > Now, since your numbers showed that all the time is spent in the kernel, > also create this test program to just read that file over and over again: > > $ cat justreadloop.c > #include > #include > #include > #include > #include > #include > #include > > char buf[100]; > > int main(int argc, char **argv) { > printf("pid:%d\n", getpid()); > while (1) { > int fd = open(argv[1], O_RDONLY); > if (fd < 0) continue; > if (read(fd, buf, sizeof(buf)) < 0) > err(1, "read"); > close(fd); > } > } > $ gcc -Wall -o justreadloop justreadloop.c > $ > > Now launch your test: > > $ ./mapstuff > pid:29397 > > point justreadloop at it: > > $ ./justreadloop /proc/29397/smaps > pid:32567 > > ... and then check the performance stats of justreadloop: > > # perf top -p 32567 > > This is what I see: > > Samples: 232K of event 'cycles:ppp', Event count (approx.): 60448424325 > Overhead Shared Object Symbol > 30,43% [kernel] [k] format_decode >9,12% [kernel] [k] number >7,66% [kernel] [k] vsnprintf >7,06% [kernel] [k] __lock_acquire >3,23% [kernel] [k] lock_release >2,85% [kernel] [k] debug_lockdep_rcu_enabled >2,25% [kernel] [k] skip_atoi >2,13% [kernel] [k] lock_acquire >2,05% [kernel] [k] show_smap This is a lot! I would expect the rmap walk to consume more but it even doesn't show up in the top consumers. > That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel > time spent on evaluating format strings. The new interface > wouldn't have to spend that much time on format strings because there > isn't so much text to format. well, this is true of course but I would much rather try to reduce the overhead of smaps file than add a new file. The following should help already. I've measured ~7% systime cut down. I guess there is still some room for improvements but I have to say I'm far from being convinced about a new proc file just because we suck at dumping information to the userspace. If this was something like /proc//stat which is essentially read all the time then it would be a different question but is the rss, pss going to be all that often? If yes why? These are the questions which should be answered before we even start considering the implementation. --- >From 2a6883a7278ff8979808cb8e2dbcefe5ea3bf672 Mon Sep 17 00:00:00 2001 From: Michal HockoDate: Wed, 17 Aug 2016 14:00:13 +0200 Subject: [PATCH] proc, smaps: reduce printing overhead seq_printf (used by show_smap) can be pretty expensive when dumping a lot of numbers. Say we would like to get Rss and Pss from a particular process. In order to measure a pathological case let's generate as many mappings as possible: $ cat max_mmap.c int main() { while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED) ; printf("pid:%d\n", getpid()); pause(); return 0; } $ awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/$pid/smaps would do a trick. The whole runtime is in the kernel space which is not that that unexpected because smaps is not the cheapest one (we have to do rmap walk etc.). Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/3050/smaps" User
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Wed 17-08-16 11:31:25, Jann Horn wrote: > On Wed, Aug 17, 2016 at 10:22:00AM +0200, Michal Hocko wrote: > > On Tue 16-08-16 12:46:51, Robert Foss wrote: > > [...] > > > $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2} > > > /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\' > > > /proc/5025/smaps }" > > > [...] > > > Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2} > > > /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' > > > /proc/5025/smaps > > > }" > > > User time (seconds): 0.37 > > > System time (seconds): 0.45 > > > Percent of CPU this job got: 92% > > > Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89 > > > > This is really unexpected. Where is the user time spent? Anyway, rather > > than measuring some random processes I've tried to measure something > > resembling the worst case. So I've created a simple program to mmap as > > much as possible: > > > > #include > > #include > > #include > > #include > > int main() > > { > > while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, > > MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED) > > ; > > > > printf("pid:%d\n", getpid()); > > pause(); > > return 0; > > } > > Ah, nice, that's a reasonable test program. :) > > > > So with a reasonable user space the parsing is really not all that time > > consuming wrt. smaps handling. That being said I am still very skeptical > > about a dedicated proc file which accomplishes what userspace can done > > in a trivial way. > > Now, since your numbers showed that all the time is spent in the kernel, > also create this test program to just read that file over and over again: > > $ cat justreadloop.c > #include > #include > #include > #include > #include > #include > #include > > char buf[100]; > > int main(int argc, char **argv) { > printf("pid:%d\n", getpid()); > while (1) { > int fd = open(argv[1], O_RDONLY); > if (fd < 0) continue; > if (read(fd, buf, sizeof(buf)) < 0) > err(1, "read"); > close(fd); > } > } > $ gcc -Wall -o justreadloop justreadloop.c > $ > > Now launch your test: > > $ ./mapstuff > pid:29397 > > point justreadloop at it: > > $ ./justreadloop /proc/29397/smaps > pid:32567 > > ... and then check the performance stats of justreadloop: > > # perf top -p 32567 > > This is what I see: > > Samples: 232K of event 'cycles:ppp', Event count (approx.): 60448424325 > Overhead Shared Object Symbol > 30,43% [kernel] [k] format_decode >9,12% [kernel] [k] number >7,66% [kernel] [k] vsnprintf >7,06% [kernel] [k] __lock_acquire >3,23% [kernel] [k] lock_release >2,85% [kernel] [k] debug_lockdep_rcu_enabled >2,25% [kernel] [k] skip_atoi >2,13% [kernel] [k] lock_acquire >2,05% [kernel] [k] show_smap This is a lot! I would expect the rmap walk to consume more but it even doesn't show up in the top consumers. > That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel > time spent on evaluating format strings. The new interface > wouldn't have to spend that much time on format strings because there > isn't so much text to format. well, this is true of course but I would much rather try to reduce the overhead of smaps file than add a new file. The following should help already. I've measured ~7% systime cut down. I guess there is still some room for improvements but I have to say I'm far from being convinced about a new proc file just because we suck at dumping information to the userspace. If this was something like /proc//stat which is essentially read all the time then it would be a different question but is the rss, pss going to be all that often? If yes why? These are the questions which should be answered before we even start considering the implementation. --- >From 2a6883a7278ff8979808cb8e2dbcefe5ea3bf672 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 17 Aug 2016 14:00:13 +0200 Subject: [PATCH] proc, smaps: reduce printing overhead seq_printf (used by show_smap) can be pretty expensive when dumping a lot of numbers. Say we would like to get Rss and Pss from a particular process. In order to measure a pathological case let's generate as many mappings as possible: $ cat max_mmap.c int main() { while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED) ; printf("pid:%d\n", getpid()); pause(); return 0; } $ awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/$pid/smaps would do a trick. The whole runtime is in the kernel space which is not that that unexpected because smaps is not the cheapest one (we have to do rmap walk etc.). Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/3050/smaps" User time (seconds):
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Wed, Aug 17, 2016 at 10:22:00AM +0200, Michal Hocko wrote: > On Tue 16-08-16 12:46:51, Robert Foss wrote: > [...] > > $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2} > > /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\' > > /proc/5025/smaps }" > > [...] > > Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2} > > /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' /proc/5025/smaps > > }" > > User time (seconds): 0.37 > > System time (seconds): 0.45 > > Percent of CPU this job got: 92% > > Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89 > > This is really unexpected. Where is the user time spent? Anyway, rather > than measuring some random processes I've tried to measure something > resembling the worst case. So I've created a simple program to mmap as > much as possible: > > #include > #include > #include > #include > int main() > { > while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, > MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED) > ; > > printf("pid:%d\n", getpid()); > pause(); > return 0; > } Ah, nice, that's a reasonable test program. :) > So with a reasonable user space the parsing is really not all that time > consuming wrt. smaps handling. That being said I am still very skeptical > about a dedicated proc file which accomplishes what userspace can done > in a trivial way. Now, since your numbers showed that all the time is spent in the kernel, also create this test program to just read that file over and over again: $ cat justreadloop.c #include #include #include #include #include #include #include char buf[100]; int main(int argc, char **argv) { printf("pid:%d\n", getpid()); while (1) { int fd = open(argv[1], O_RDONLY); if (fd < 0) continue; if (read(fd, buf, sizeof(buf)) < 0) err(1, "read"); close(fd); } } $ gcc -Wall -o justreadloop justreadloop.c $ Now launch your test: $ ./mapstuff pid:29397 point justreadloop at it: $ ./justreadloop /proc/29397/smaps pid:32567 ... and then check the performance stats of justreadloop: # perf top -p 32567 This is what I see: Samples: 232K of event 'cycles:ppp', Event count (approx.): 60448424325 Overhead Shared Object Symbol 30,43% [kernel] [k] format_decode 9,12% [kernel] [k] number 7,66% [kernel] [k] vsnprintf 7,06% [kernel] [k] __lock_acquire 3,23% [kernel] [k] lock_release 2,85% [kernel] [k] debug_lockdep_rcu_enabled 2,25% [kernel] [k] skip_atoi 2,13% [kernel] [k] lock_acquire 2,05% [kernel] [k] show_smap That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel time spent on evaluating format strings. The new interface wouldn't have to spend that much time on format strings because there isn't so much text to format. (My kernel is built with a bunch of debug options - the results might look very different on distro kernels or so, so please try this yourself.) I guess it could be argued that this is not just a problem with smaps, but also a problem with format strings (or text-based interfaces in general) just being slow in general. (Here is a totally random and crazy thought: Can we put something into the kernel build process that replaces printf calls that use simple format strings with equivalent non-printf calls? Move the cost of evaluating the format string to compile time?) signature.asc Description: Digital signature
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Wed, Aug 17, 2016 at 10:22:00AM +0200, Michal Hocko wrote: > On Tue 16-08-16 12:46:51, Robert Foss wrote: > [...] > > $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2} > > /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\' > > /proc/5025/smaps }" > > [...] > > Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2} > > /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' /proc/5025/smaps > > }" > > User time (seconds): 0.37 > > System time (seconds): 0.45 > > Percent of CPU this job got: 92% > > Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89 > > This is really unexpected. Where is the user time spent? Anyway, rather > than measuring some random processes I've tried to measure something > resembling the worst case. So I've created a simple program to mmap as > much as possible: > > #include > #include > #include > #include > int main() > { > while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, > MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED) > ; > > printf("pid:%d\n", getpid()); > pause(); > return 0; > } Ah, nice, that's a reasonable test program. :) > So with a reasonable user space the parsing is really not all that time > consuming wrt. smaps handling. That being said I am still very skeptical > about a dedicated proc file which accomplishes what userspace can done > in a trivial way. Now, since your numbers showed that all the time is spent in the kernel, also create this test program to just read that file over and over again: $ cat justreadloop.c #include #include #include #include #include #include #include char buf[100]; int main(int argc, char **argv) { printf("pid:%d\n", getpid()); while (1) { int fd = open(argv[1], O_RDONLY); if (fd < 0) continue; if (read(fd, buf, sizeof(buf)) < 0) err(1, "read"); close(fd); } } $ gcc -Wall -o justreadloop justreadloop.c $ Now launch your test: $ ./mapstuff pid:29397 point justreadloop at it: $ ./justreadloop /proc/29397/smaps pid:32567 ... and then check the performance stats of justreadloop: # perf top -p 32567 This is what I see: Samples: 232K of event 'cycles:ppp', Event count (approx.): 60448424325 Overhead Shared Object Symbol 30,43% [kernel] [k] format_decode 9,12% [kernel] [k] number 7,66% [kernel] [k] vsnprintf 7,06% [kernel] [k] __lock_acquire 3,23% [kernel] [k] lock_release 2,85% [kernel] [k] debug_lockdep_rcu_enabled 2,25% [kernel] [k] skip_atoi 2,13% [kernel] [k] lock_acquire 2,05% [kernel] [k] show_smap That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel time spent on evaluating format strings. The new interface wouldn't have to spend that much time on format strings because there isn't so much text to format. (My kernel is built with a bunch of debug options - the results might look very different on distro kernels or so, so please try this yourself.) I guess it could be argued that this is not just a problem with smaps, but also a problem with format strings (or text-based interfaces in general) just being slow in general. (Here is a totally random and crazy thought: Can we put something into the kernel build process that replaces printf calls that use simple format strings with equivalent non-printf calls? Move the cost of evaluating the format string to compile time?) signature.asc Description: Digital signature
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Tue 16-08-16 12:46:51, Robert Foss wrote: [...] > $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2} > /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\' > /proc/5025/smaps }" > [...] > Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2} > /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' /proc/5025/smaps > }" > User time (seconds): 0.37 > System time (seconds): 0.45 > Percent of CPU this job got: 92% > Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89 This is really unexpected. Where is the user time spent? Anyway, rather than measuring some random processes I've tried to measure something resembling the worst case. So I've created a simple program to mmap as much as possible: #include #include #include #include int main() { while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED) ; printf("pid:%d\n", getpid()); pause(); return 0; } so depending on /proc/sys/vm/max_map_count you will get the maximum possible mmaps. I am using a default so 65k mappings. Then I have retried your 25x file parsing: $ cat s.sh #!/bin/sh pid=$1 for i in $(seq 25) do awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/$pid/smaps done But I am getting different results from you: $ awk '/^[0-9a-f]/{print}' /proc/14808/smaps | wc -l 65532 [...] Command being timed: "sh s.sh 14808" User time (seconds): 0.00 System time (seconds): 20.10 Percent of CPU this job got: 99% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:20.20 The results are stable when I try multiple times, in fact there shouldn't be any reason for them not to be. Then I went on to increase max_map_count to 250k and that behaves consistently: $ awk '/^[0-9a-f]/{print}' /proc/16093/smaps | wc -l 250002 [...] Command being timed: "sh s.sh 16093" User time (seconds): 0.00 System time (seconds): 77.93 Percent of CPU this job got: 98% Elapsed (wall clock) time (h:mm:ss or m:ss): 1:19.09 So with a reasonable user space the parsing is really not all that time consuming wrt. smaps handling. That being said I am still very skeptical about a dedicated proc file which accomplishes what userspace can done in a trivial way. -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Tue 16-08-16 12:46:51, Robert Foss wrote: [...] > $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2} > /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\' > /proc/5025/smaps }" > [...] > Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2} > /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' /proc/5025/smaps > }" > User time (seconds): 0.37 > System time (seconds): 0.45 > Percent of CPU this job got: 92% > Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89 This is really unexpected. Where is the user time spent? Anyway, rather than measuring some random processes I've tried to measure something resembling the worst case. So I've created a simple program to mmap as much as possible: #include #include #include #include int main() { while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED) ; printf("pid:%d\n", getpid()); pause(); return 0; } so depending on /proc/sys/vm/max_map_count you will get the maximum possible mmaps. I am using a default so 65k mappings. Then I have retried your 25x file parsing: $ cat s.sh #!/bin/sh pid=$1 for i in $(seq 25) do awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/$pid/smaps done But I am getting different results from you: $ awk '/^[0-9a-f]/{print}' /proc/14808/smaps | wc -l 65532 [...] Command being timed: "sh s.sh 14808" User time (seconds): 0.00 System time (seconds): 20.10 Percent of CPU this job got: 99% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:20.20 The results are stable when I try multiple times, in fact there shouldn't be any reason for them not to be. Then I went on to increase max_map_count to 250k and that behaves consistently: $ awk '/^[0-9a-f]/{print}' /proc/16093/smaps | wc -l 250002 [...] Command being timed: "sh s.sh 16093" User time (seconds): 0.00 System time (seconds): 77.93 Percent of CPU this job got: 98% Elapsed (wall clock) time (h:mm:ss or m:ss): 1:19.09 So with a reasonable user space the parsing is really not all that time consuming wrt. smaps handling. That being said I am still very skeptical about a dedicated proc file which accomplishes what userspace can done in a trivial way. -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On 2016-08-16 03:12 AM, Michal Hocko wrote: On Mon 15-08-16 12:25:10, Robert Foss wrote: On 2016-08-15 09:42 AM, Michal Hocko wrote: [...] The use case is to speed up monitoring of memory consumption in environments where RSS isn't precise. For example Chrome tends to many processes which have hundreds of VMAs with a substantial amount of shared memory, and the error of using RSS rather than PSS tends to be very large when looking at overall memory consumption. PSS isn't kept as a single number that's exported like RSS, so to calculate PSS means having to parse a very large smaps file. This process is slow and has to be repeated for many processes, and we found that the just act of doing the parsing was taking up a significant amount of CPU time, so this patch is an attempt to make that process cheaper. Well, this is slow because it requires the pte walk otherwise you cannot know how many ptes map the particular shared page. Your patch (totmaps_proc_show) does the very same page table walk because in fact it is unavoidable. So what exactly is the difference except for the userspace parsing which is quite trivial e.g. my currently running Firefox has $ awk '/^[0-9a-f]/{print}' /proc/4950/smaps | wc -l 984 quite some VMAs, yet parsing it spends basically all the time in the kernel... $ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/4950/smaps rss:1112288 pss:1096435 Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/4950/smaps" User time (seconds): 0.00 System time (seconds): 0.02 Percent of CPU this job got: 91% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02 So I am not really sure I see the performance benefit. I did some performance measurements of my own, and it would seem like there is about a 2x performance gain to be had. To me that is substantial, and a larger gain than commonly seen. There naturally also the benefit that this is a lot easier to interact with programmatically. $ ps aux | grep firefox robertfoss 5025 24.3 13.7 3562820 2219616 ? Rl Aug15 277:44 /usr/lib/firefox/firefox https://allg.one/xpb $ awk '/^[0-9a-f]/{print}' /proc/5025/smaps | wc -l 1503 $ /usr/bin/time -v -p zsh -c "(repeat 25 {cat /proc/5025/totmaps})" [...] Command being timed: "zsh -c (repeat 25 {cat /proc/5025/totmaps})" User time (seconds): 0.00 System time (seconds): 0.40 Percent of CPU this job got: 90% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.45 $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2} /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\' /proc/5025/smaps }" [...] Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' /proc/5025/smaps }" User time (seconds): 0.37 System time (seconds): 0.45 Percent of CPU this job got: 92% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89
Re: [PACTH v2 0/3] Implement /proc//totmaps
On 2016-08-16 03:12 AM, Michal Hocko wrote: On Mon 15-08-16 12:25:10, Robert Foss wrote: On 2016-08-15 09:42 AM, Michal Hocko wrote: [...] The use case is to speed up monitoring of memory consumption in environments where RSS isn't precise. For example Chrome tends to many processes which have hundreds of VMAs with a substantial amount of shared memory, and the error of using RSS rather than PSS tends to be very large when looking at overall memory consumption. PSS isn't kept as a single number that's exported like RSS, so to calculate PSS means having to parse a very large smaps file. This process is slow and has to be repeated for many processes, and we found that the just act of doing the parsing was taking up a significant amount of CPU time, so this patch is an attempt to make that process cheaper. Well, this is slow because it requires the pte walk otherwise you cannot know how many ptes map the particular shared page. Your patch (totmaps_proc_show) does the very same page table walk because in fact it is unavoidable. So what exactly is the difference except for the userspace parsing which is quite trivial e.g. my currently running Firefox has $ awk '/^[0-9a-f]/{print}' /proc/4950/smaps | wc -l 984 quite some VMAs, yet parsing it spends basically all the time in the kernel... $ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/4950/smaps rss:1112288 pss:1096435 Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/4950/smaps" User time (seconds): 0.00 System time (seconds): 0.02 Percent of CPU this job got: 91% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02 So I am not really sure I see the performance benefit. I did some performance measurements of my own, and it would seem like there is about a 2x performance gain to be had. To me that is substantial, and a larger gain than commonly seen. There naturally also the benefit that this is a lot easier to interact with programmatically. $ ps aux | grep firefox robertfoss 5025 24.3 13.7 3562820 2219616 ? Rl Aug15 277:44 /usr/lib/firefox/firefox https://allg.one/xpb $ awk '/^[0-9a-f]/{print}' /proc/5025/smaps | wc -l 1503 $ /usr/bin/time -v -p zsh -c "(repeat 25 {cat /proc/5025/totmaps})" [...] Command being timed: "zsh -c (repeat 25 {cat /proc/5025/totmaps})" User time (seconds): 0.00 System time (seconds): 0.40 Percent of CPU this job got: 90% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.45 $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2} /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\' /proc/5025/smaps }" [...] Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' /proc/5025/smaps }" User time (seconds): 0.37 System time (seconds): 0.45 Percent of CPU this job got: 92% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Mon 15-08-16 12:25:10, Robert Foss wrote: > > > On 2016-08-15 09:42 AM, Michal Hocko wrote: [...] > > The use case is to speed up monitoring of > > memory consumption in environments where RSS isn't precise. > > > > For example Chrome tends to many processes which have hundreds of VMAs > > with a substantial amount of shared memory, and the error of using > > RSS rather than PSS tends to be very large when looking at overall > > memory consumption. PSS isn't kept as a single number that's exported > > like RSS, so to calculate PSS means having to parse a very large smaps > > file. > > > > This process is slow and has to be repeated for many processes, and we > > found that the just act of doing the parsing was taking up a > > significant amount of CPU time, so this patch is an attempt to make > > that process cheaper. Well, this is slow because it requires the pte walk otherwise you cannot know how many ptes map the particular shared page. Your patch (totmaps_proc_show) does the very same page table walk because in fact it is unavoidable. So what exactly is the difference except for the userspace parsing which is quite trivial e.g. my currently running Firefox has $ awk '/^[0-9a-f]/{print}' /proc/4950/smaps | wc -l 984 quite some VMAs, yet parsing it spends basically all the time in the kernel... $ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/4950/smaps rss:1112288 pss:1096435 Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/4950/smaps" User time (seconds): 0.00 System time (seconds): 0.02 Percent of CPU this job got: 91% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02 So I am not really sure I see the performance benefit. -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Mon 15-08-16 12:25:10, Robert Foss wrote: > > > On 2016-08-15 09:42 AM, Michal Hocko wrote: [...] > > The use case is to speed up monitoring of > > memory consumption in environments where RSS isn't precise. > > > > For example Chrome tends to many processes which have hundreds of VMAs > > with a substantial amount of shared memory, and the error of using > > RSS rather than PSS tends to be very large when looking at overall > > memory consumption. PSS isn't kept as a single number that's exported > > like RSS, so to calculate PSS means having to parse a very large smaps > > file. > > > > This process is slow and has to be repeated for many processes, and we > > found that the just act of doing the parsing was taking up a > > significant amount of CPU time, so this patch is an attempt to make > > that process cheaper. Well, this is slow because it requires the pte walk otherwise you cannot know how many ptes map the particular shared page. Your patch (totmaps_proc_show) does the very same page table walk because in fact it is unavoidable. So what exactly is the difference except for the userspace parsing which is quite trivial e.g. my currently running Firefox has $ awk '/^[0-9a-f]/{print}' /proc/4950/smaps | wc -l 984 quite some VMAs, yet parsing it spends basically all the time in the kernel... $ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/4950/smaps rss:1112288 pss:1096435 Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/4950/smaps" User time (seconds): 0.00 System time (seconds): 0.02 Percent of CPU this job got: 91% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02 So I am not really sure I see the performance benefit. -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On 2016-08-15 09:42 AM, Michal Hocko wrote: On Mon 15-08-16 09:00:04, Robert Foss wrote: On 2016-08-14 05:04 AM, Michal Hocko wrote: On Fri 12-08-16 18:04:19, robert.f...@collabora.com wrote: From: Robert FossThis series implements /proc/PID/totmaps, a tool for retrieving summarized information about the mappings of a process. The changelog is absolutely missing the usecase. Why do we need this? Why existing interfaces are not sufficient? You are absolutely right, more info information is in 1/3. Patch 1 is silent about the use case as well. It is usually recommended to describe the motivation for the change in the cover letter. I'll change it for v3. But the gist of it is that it provides a faster and more convenient way of accessing the information in /proc/PID/smaps. I am sorry to insist but this is far from a description I was hoping for. Why do we need a more convenient API? Please note that this is a userspace API which we will have to maintain for ever. We have made many mistakes in the past where exporting some information made sense at the time while it turned out being a mistake only later on. So let's make sure we will not fall into the same trap again. So please make sure you describe the use case, why the current API is insufficient and why it cannot be tweaked to provide the information you are looking for. I'll add a more elaborate description to the v3 cover letter. In v1, there was a discussion which I think presented the practical applications rather well: https://lkml.org/lkml/2016/8/9/628 or the qoute from Sonny Rao pasted below: > The use case is to speed up monitoring of > memory consumption in environments where RSS isn't precise. > > For example Chrome tends to many processes which have hundreds of VMAs > with a substantial amount of shared memory, and the error of using > RSS rather than PSS tends to be very large when looking at overall > memory consumption. PSS isn't kept as a single number that's exported > like RSS, so to calculate PSS means having to parse a very large smaps > file. > > This process is slow and has to be repeated for many processes, and we > found that the just act of doing the parsing was taking up a > significant amount of CPU time, so this patch is an attempt to make > that process cheaper. If a reformatted version of this still isn't adequate or desirable for the cover-letter, please give me another heads up. Thanks!
Re: [PACTH v2 0/3] Implement /proc//totmaps
On 2016-08-15 09:42 AM, Michal Hocko wrote: On Mon 15-08-16 09:00:04, Robert Foss wrote: On 2016-08-14 05:04 AM, Michal Hocko wrote: On Fri 12-08-16 18:04:19, robert.f...@collabora.com wrote: From: Robert Foss This series implements /proc/PID/totmaps, a tool for retrieving summarized information about the mappings of a process. The changelog is absolutely missing the usecase. Why do we need this? Why existing interfaces are not sufficient? You are absolutely right, more info information is in 1/3. Patch 1 is silent about the use case as well. It is usually recommended to describe the motivation for the change in the cover letter. I'll change it for v3. But the gist of it is that it provides a faster and more convenient way of accessing the information in /proc/PID/smaps. I am sorry to insist but this is far from a description I was hoping for. Why do we need a more convenient API? Please note that this is a userspace API which we will have to maintain for ever. We have made many mistakes in the past where exporting some information made sense at the time while it turned out being a mistake only later on. So let's make sure we will not fall into the same trap again. So please make sure you describe the use case, why the current API is insufficient and why it cannot be tweaked to provide the information you are looking for. I'll add a more elaborate description to the v3 cover letter. In v1, there was a discussion which I think presented the practical applications rather well: https://lkml.org/lkml/2016/8/9/628 or the qoute from Sonny Rao pasted below: > The use case is to speed up monitoring of > memory consumption in environments where RSS isn't precise. > > For example Chrome tends to many processes which have hundreds of VMAs > with a substantial amount of shared memory, and the error of using > RSS rather than PSS tends to be very large when looking at overall > memory consumption. PSS isn't kept as a single number that's exported > like RSS, so to calculate PSS means having to parse a very large smaps > file. > > This process is slow and has to be repeated for many processes, and we > found that the just act of doing the parsing was taking up a > significant amount of CPU time, so this patch is an attempt to make > that process cheaper. If a reformatted version of this still isn't adequate or desirable for the cover-letter, please give me another heads up. Thanks!
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Mon 15-08-16 09:00:04, Robert Foss wrote: > > > On 2016-08-14 05:04 AM, Michal Hocko wrote: > > On Fri 12-08-16 18:04:19, robert.f...@collabora.com wrote: > > > From: Robert Foss> > > > > > This series implements /proc/PID/totmaps, a tool for retrieving summarized > > > information about the mappings of a process. > > > > The changelog is absolutely missing the usecase. Why do we need this? > > Why existing interfaces are not sufficient? > > You are absolutely right, more info information is in 1/3. Patch 1 is silent about the use case as well. It is usually recommended to describe the motivation for the change in the cover letter. > But the gist of it is that it provides a faster and more convenient way of > accessing the information in /proc/PID/smaps. I am sorry to insist but this is far from a description I was hoping for. Why do we need a more convenient API? Please note that this is a userspace API which we will have to maintain for ever. We have made many mistakes in the past where exporting some information made sense at the time while it turned out being a mistake only later on. So let's make sure we will not fall into the same trap again. So please make sure you describe the use case, why the current API is insufficient and why it cannot be tweaked to provide the information you are looking for. -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Mon 15-08-16 09:00:04, Robert Foss wrote: > > > On 2016-08-14 05:04 AM, Michal Hocko wrote: > > On Fri 12-08-16 18:04:19, robert.f...@collabora.com wrote: > > > From: Robert Foss > > > > > > This series implements /proc/PID/totmaps, a tool for retrieving summarized > > > information about the mappings of a process. > > > > The changelog is absolutely missing the usecase. Why do we need this? > > Why existing interfaces are not sufficient? > > You are absolutely right, more info information is in 1/3. Patch 1 is silent about the use case as well. It is usually recommended to describe the motivation for the change in the cover letter. > But the gist of it is that it provides a faster and more convenient way of > accessing the information in /proc/PID/smaps. I am sorry to insist but this is far from a description I was hoping for. Why do we need a more convenient API? Please note that this is a userspace API which we will have to maintain for ever. We have made many mistakes in the past where exporting some information made sense at the time while it turned out being a mistake only later on. So let's make sure we will not fall into the same trap again. So please make sure you describe the use case, why the current API is insufficient and why it cannot be tweaked to provide the information you are looking for. -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On 2016-08-14 05:04 AM, Michal Hocko wrote: On Fri 12-08-16 18:04:19, robert.f...@collabora.com wrote: From: Robert FossThis series implements /proc/PID/totmaps, a tool for retrieving summarized information about the mappings of a process. The changelog is absolutely missing the usecase. Why do we need this? Why existing interfaces are not sufficient? You are absolutely right, more info information is in 1/3. But the gist of it is that it provides a faster and more convenient way of accessing the information in /proc/PID/smaps. Changes since v1: - Removed IS_ERR check from get_task_mm() function - Changed comment format - Moved proc_totmaps_operations declaration inside internal.h - Switched to using do_maps_open() in totmaps_open() function, which provides privilege checking - Error handling reworked for totmaps_open() function - Switched to stack allocated struct mem_size_stats mss_sum in totmaps_proc_show() function - Removed get_task_mm() in totmaps_proc_show() since priv->mm already is available - Added support to proc_map_release() fork priv==NULL, to allow function to be used for all failure cases - Added proc_totmaps_op and for it helper functions - Added documention in separate patch - Removed totmaps_release() since it was just a wrapper for proc_map_release() Robert Foss (3): mm, proc: Implement /proc//totmaps Documentation/filesystems: Fixed typo Documentation/filesystems: Added /proc/PID/totmaps documentation Documentation/filesystems/proc.txt | 23 ++- fs/proc/base.c | 1 + fs/proc/internal.h | 3 + fs/proc/task_mmu.c | 134 + 4 files changed, 160 insertions(+), 1 deletion(-) -- 2.7.4
Re: [PACTH v2 0/3] Implement /proc//totmaps
On 2016-08-14 05:04 AM, Michal Hocko wrote: On Fri 12-08-16 18:04:19, robert.f...@collabora.com wrote: From: Robert Foss This series implements /proc/PID/totmaps, a tool for retrieving summarized information about the mappings of a process. The changelog is absolutely missing the usecase. Why do we need this? Why existing interfaces are not sufficient? You are absolutely right, more info information is in 1/3. But the gist of it is that it provides a faster and more convenient way of accessing the information in /proc/PID/smaps. Changes since v1: - Removed IS_ERR check from get_task_mm() function - Changed comment format - Moved proc_totmaps_operations declaration inside internal.h - Switched to using do_maps_open() in totmaps_open() function, which provides privilege checking - Error handling reworked for totmaps_open() function - Switched to stack allocated struct mem_size_stats mss_sum in totmaps_proc_show() function - Removed get_task_mm() in totmaps_proc_show() since priv->mm already is available - Added support to proc_map_release() fork priv==NULL, to allow function to be used for all failure cases - Added proc_totmaps_op and for it helper functions - Added documention in separate patch - Removed totmaps_release() since it was just a wrapper for proc_map_release() Robert Foss (3): mm, proc: Implement /proc//totmaps Documentation/filesystems: Fixed typo Documentation/filesystems: Added /proc/PID/totmaps documentation Documentation/filesystems/proc.txt | 23 ++- fs/proc/base.c | 1 + fs/proc/internal.h | 3 + fs/proc/task_mmu.c | 134 + 4 files changed, 160 insertions(+), 1 deletion(-) -- 2.7.4
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Fri 12-08-16 18:04:19, robert.f...@collabora.com wrote: > From: Robert Foss> > This series implements /proc/PID/totmaps, a tool for retrieving summarized > information about the mappings of a process. The changelog is absolutely missing the usecase. Why do we need this? Why existing interfaces are not sufficient? > Changes since v1: > - Removed IS_ERR check from get_task_mm() function > - Changed comment format > - Moved proc_totmaps_operations declaration inside internal.h > - Switched to using do_maps_open() in totmaps_open() function, > which provides privilege checking > - Error handling reworked for totmaps_open() function > - Switched to stack allocated struct mem_size_stats mss_sum in > totmaps_proc_show() function > - Removed get_task_mm() in totmaps_proc_show() since priv->mm > already is available > - Added support to proc_map_release() fork priv==NULL, to allow > function to be used for all failure cases > - Added proc_totmaps_op and for it helper functions > - Added documention in separate patch > - Removed totmaps_release() since it was just a wrapper for > proc_map_release() > > > Robert Foss (3): > mm, proc: Implement /proc//totmaps > Documentation/filesystems: Fixed typo > Documentation/filesystems: Added /proc/PID/totmaps documentation > > Documentation/filesystems/proc.txt | 23 ++- > fs/proc/base.c | 1 + > fs/proc/internal.h | 3 + > fs/proc/task_mmu.c | 134 > + > 4 files changed, 160 insertions(+), 1 deletion(-) > > -- > 2.7.4 > -- Michal Hocko SUSE Labs
Re: [PACTH v2 0/3] Implement /proc//totmaps
On Fri 12-08-16 18:04:19, robert.f...@collabora.com wrote: > From: Robert Foss > > This series implements /proc/PID/totmaps, a tool for retrieving summarized > information about the mappings of a process. The changelog is absolutely missing the usecase. Why do we need this? Why existing interfaces are not sufficient? > Changes since v1: > - Removed IS_ERR check from get_task_mm() function > - Changed comment format > - Moved proc_totmaps_operations declaration inside internal.h > - Switched to using do_maps_open() in totmaps_open() function, > which provides privilege checking > - Error handling reworked for totmaps_open() function > - Switched to stack allocated struct mem_size_stats mss_sum in > totmaps_proc_show() function > - Removed get_task_mm() in totmaps_proc_show() since priv->mm > already is available > - Added support to proc_map_release() fork priv==NULL, to allow > function to be used for all failure cases > - Added proc_totmaps_op and for it helper functions > - Added documention in separate patch > - Removed totmaps_release() since it was just a wrapper for > proc_map_release() > > > Robert Foss (3): > mm, proc: Implement /proc//totmaps > Documentation/filesystems: Fixed typo > Documentation/filesystems: Added /proc/PID/totmaps documentation > > Documentation/filesystems/proc.txt | 23 ++- > fs/proc/base.c | 1 + > fs/proc/internal.h | 3 + > fs/proc/task_mmu.c | 134 > + > 4 files changed, 160 insertions(+), 1 deletion(-) > > -- > 2.7.4 > -- Michal Hocko SUSE Labs
[PACTH v2 0/3] Implement /proc//totmaps
From: Robert FossThis series implements /proc/PID/totmaps, a tool for retrieving summarized information about the mappings of a process. Changes since v1: - Removed IS_ERR check from get_task_mm() function - Changed comment format - Moved proc_totmaps_operations declaration inside internal.h - Switched to using do_maps_open() in totmaps_open() function, which provides privilege checking - Error handling reworked for totmaps_open() function - Switched to stack allocated struct mem_size_stats mss_sum in totmaps_proc_show() function - Removed get_task_mm() in totmaps_proc_show() since priv->mm already is available - Added support to proc_map_release() fork priv==NULL, to allow function to be used for all failure cases - Added proc_totmaps_op and for it helper functions - Added documention in separate patch - Removed totmaps_release() since it was just a wrapper for proc_map_release() Robert Foss (3): mm, proc: Implement /proc//totmaps Documentation/filesystems: Fixed typo Documentation/filesystems: Added /proc/PID/totmaps documentation Documentation/filesystems/proc.txt | 23 ++- fs/proc/base.c | 1 + fs/proc/internal.h | 3 + fs/proc/task_mmu.c | 134 + 4 files changed, 160 insertions(+), 1 deletion(-) -- 2.7.4
[PACTH v2 0/3] Implement /proc//totmaps
From: Robert Foss This series implements /proc/PID/totmaps, a tool for retrieving summarized information about the mappings of a process. Changes since v1: - Removed IS_ERR check from get_task_mm() function - Changed comment format - Moved proc_totmaps_operations declaration inside internal.h - Switched to using do_maps_open() in totmaps_open() function, which provides privilege checking - Error handling reworked for totmaps_open() function - Switched to stack allocated struct mem_size_stats mss_sum in totmaps_proc_show() function - Removed get_task_mm() in totmaps_proc_show() since priv->mm already is available - Added support to proc_map_release() fork priv==NULL, to allow function to be used for all failure cases - Added proc_totmaps_op and for it helper functions - Added documention in separate patch - Removed totmaps_release() since it was just a wrapper for proc_map_release() Robert Foss (3): mm, proc: Implement /proc//totmaps Documentation/filesystems: Fixed typo Documentation/filesystems: Added /proc/PID/totmaps documentation Documentation/filesystems/proc.txt | 23 ++- fs/proc/base.c | 1 + fs/proc/internal.h | 3 + fs/proc/task_mmu.c | 134 + 4 files changed, 160 insertions(+), 1 deletion(-) -- 2.7.4