Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
Hi Andrew, On Tue, Mar 07, 2017 at 02:43:38PM -0800, Andrew Morton wrote: > On Tue, 7 Mar 2017 11:05:45 +0100 Michal Hockowrote: > > > On Fri 03-03-17 16:10:27, Andrew Morton wrote: > > > On Thu, 2 Mar 2017 17:30:54 +0100 Michal Hocko wrote: > > > > > > > > It's not that I think you're wrong: it *is* an implementation detail. > > > > > But we take a bit of incoherency from batching all over the place, so > > > > > it's a little odd to take a stand over this particular instance of it > > > > > - whether demanding that it'd be fixed, or be documented, which would > > > > > only suggest to users that this is special when it really isn't etc. > > > > > > > > I am not aware of other counter printed in smaps that would suffer from > > > > the same problem, but I haven't checked too deeply so I might be wrong. > > > > > > > > Anyway it seems that I am alone in my position so I will not insist. > > > > If we have any bug report then we can still fix it. > > > > > > A single lru_add_drain_all() right at the top level (in smaps_show()?) > > > won't kill us > > > > I do not think we want to put lru_add_drain_all cost to a random > > process reading /proc//smaps. > > Why not? It's that process which is calling for the work to be done. > > > If anything the one which does the > > madvise should be doing this. > > But it would be silly to do extra work in madvise() if nobody will be > reading smaps for the next two months. > > How much work is it anyway? What would be the relative impact upon a > smaps read? I agree only if the draining guarantees all of mapped pages in the range could be marked to lazyfree. However, it's not true because there are a few of logics to skip the page marking in madvise_free_pte_range. So, my conclusion is drainning helps a bit but not gaurantees. In such case, IMHO, let's not do the effort to make better. Thanks.
Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
Hi Andrew, On Tue, Mar 07, 2017 at 02:43:38PM -0800, Andrew Morton wrote: > On Tue, 7 Mar 2017 11:05:45 +0100 Michal Hocko wrote: > > > On Fri 03-03-17 16:10:27, Andrew Morton wrote: > > > On Thu, 2 Mar 2017 17:30:54 +0100 Michal Hocko wrote: > > > > > > > > It's not that I think you're wrong: it *is* an implementation detail. > > > > > But we take a bit of incoherency from batching all over the place, so > > > > > it's a little odd to take a stand over this particular instance of it > > > > > - whether demanding that it'd be fixed, or be documented, which would > > > > > only suggest to users that this is special when it really isn't etc. > > > > > > > > I am not aware of other counter printed in smaps that would suffer from > > > > the same problem, but I haven't checked too deeply so I might be wrong. > > > > > > > > Anyway it seems that I am alone in my position so I will not insist. > > > > If we have any bug report then we can still fix it. > > > > > > A single lru_add_drain_all() right at the top level (in smaps_show()?) > > > won't kill us > > > > I do not think we want to put lru_add_drain_all cost to a random > > process reading /proc//smaps. > > Why not? It's that process which is calling for the work to be done. > > > If anything the one which does the > > madvise should be doing this. > > But it would be silly to do extra work in madvise() if nobody will be > reading smaps for the next two months. > > How much work is it anyway? What would be the relative impact upon a > smaps read? I agree only if the draining guarantees all of mapped pages in the range could be marked to lazyfree. However, it's not true because there are a few of logics to skip the page marking in madvise_free_pte_range. So, my conclusion is drainning helps a bit but not gaurantees. In such case, IMHO, let's not do the effort to make better. Thanks.
Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
On Tue, 7 Mar 2017 11:05:45 +0100 Michal Hockowrote: > On Fri 03-03-17 16:10:27, Andrew Morton wrote: > > On Thu, 2 Mar 2017 17:30:54 +0100 Michal Hocko wrote: > > > > > > It's not that I think you're wrong: it *is* an implementation detail. > > > > But we take a bit of incoherency from batching all over the place, so > > > > it's a little odd to take a stand over this particular instance of it > > > > - whether demanding that it'd be fixed, or be documented, which would > > > > only suggest to users that this is special when it really isn't etc. > > > > > > I am not aware of other counter printed in smaps that would suffer from > > > the same problem, but I haven't checked too deeply so I might be wrong. > > > > > > Anyway it seems that I am alone in my position so I will not insist. > > > If we have any bug report then we can still fix it. > > > > A single lru_add_drain_all() right at the top level (in smaps_show()?) > > won't kill us > > I do not think we want to put lru_add_drain_all cost to a random > process reading /proc//smaps. Why not? It's that process which is calling for the work to be done. > If anything the one which does the > madvise should be doing this. But it would be silly to do extra work in madvise() if nobody will be reading smaps for the next two months. How much work is it anyway? What would be the relative impact upon a smaps read?
Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
On Tue, 7 Mar 2017 11:05:45 +0100 Michal Hocko wrote: > On Fri 03-03-17 16:10:27, Andrew Morton wrote: > > On Thu, 2 Mar 2017 17:30:54 +0100 Michal Hocko wrote: > > > > > > It's not that I think you're wrong: it *is* an implementation detail. > > > > But we take a bit of incoherency from batching all over the place, so > > > > it's a little odd to take a stand over this particular instance of it > > > > - whether demanding that it'd be fixed, or be documented, which would > > > > only suggest to users that this is special when it really isn't etc. > > > > > > I am not aware of other counter printed in smaps that would suffer from > > > the same problem, but I haven't checked too deeply so I might be wrong. > > > > > > Anyway it seems that I am alone in my position so I will not insist. > > > If we have any bug report then we can still fix it. > > > > A single lru_add_drain_all() right at the top level (in smaps_show()?) > > won't kill us > > I do not think we want to put lru_add_drain_all cost to a random > process reading /proc//smaps. Why not? It's that process which is calling for the work to be done. > If anything the one which does the > madvise should be doing this. But it would be silly to do extra work in madvise() if nobody will be reading smaps for the next two months. How much work is it anyway? What would be the relative impact upon a smaps read?
Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
On Fri 03-03-17 16:10:27, Andrew Morton wrote: > On Thu, 2 Mar 2017 17:30:54 +0100 Michal Hockowrote: > > > > It's not that I think you're wrong: it *is* an implementation detail. > > > But we take a bit of incoherency from batching all over the place, so > > > it's a little odd to take a stand over this particular instance of it > > > - whether demanding that it'd be fixed, or be documented, which would > > > only suggest to users that this is special when it really isn't etc. > > > > I am not aware of other counter printed in smaps that would suffer from > > the same problem, but I haven't checked too deeply so I might be wrong. > > > > Anyway it seems that I am alone in my position so I will not insist. > > If we have any bug report then we can still fix it. > > A single lru_add_drain_all() right at the top level (in smaps_show()?) > won't kill us I do not think we want to put lru_add_drain_all cost to a random process reading /proc//smaps. If anything the one which does the madvise should be doing this. > and should significantly improve this issue. And it > might accidentally make some of the other smaps statistics more > accurate as well. > > If not, can we please have a nice comment somewhere appropriate which > explains why LazyFree is inaccurate and why we chose to leave it that > way? Yeah, I would appreciate the comment more. What about the following --- diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 45853e116eef..0b58b317dc76 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -444,7 +444,9 @@ a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE and a page is modified, the file page is replaced by a private anonymous copy. "LazyFree" shows the amount of memory which is marked by madvise(MADV_FREE). The memory isn't freed immediately with madvise(). It's freed in memory -pressure if the memory is clean. +pressure if the memory is clean. Please note that the printed value might +be lower than the real value due to optimizations used in the current +implementation. If this is not desirable please file a bug report. "AnonHugePages" shows the ammount of memory backed by transparent hugepage. "ShmemPmdMapped" shows the ammount of shared (shmem/tmpfs) memory backed by huge pages. -- Michal Hocko SUSE Labs
Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
On Fri 03-03-17 16:10:27, Andrew Morton wrote: > On Thu, 2 Mar 2017 17:30:54 +0100 Michal Hocko wrote: > > > > It's not that I think you're wrong: it *is* an implementation detail. > > > But we take a bit of incoherency from batching all over the place, so > > > it's a little odd to take a stand over this particular instance of it > > > - whether demanding that it'd be fixed, or be documented, which would > > > only suggest to users that this is special when it really isn't etc. > > > > I am not aware of other counter printed in smaps that would suffer from > > the same problem, but I haven't checked too deeply so I might be wrong. > > > > Anyway it seems that I am alone in my position so I will not insist. > > If we have any bug report then we can still fix it. > > A single lru_add_drain_all() right at the top level (in smaps_show()?) > won't kill us I do not think we want to put lru_add_drain_all cost to a random process reading /proc//smaps. If anything the one which does the madvise should be doing this. > and should significantly improve this issue. And it > might accidentally make some of the other smaps statistics more > accurate as well. > > If not, can we please have a nice comment somewhere appropriate which > explains why LazyFree is inaccurate and why we chose to leave it that > way? Yeah, I would appreciate the comment more. What about the following --- diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 45853e116eef..0b58b317dc76 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -444,7 +444,9 @@ a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE and a page is modified, the file page is replaced by a private anonymous copy. "LazyFree" shows the amount of memory which is marked by madvise(MADV_FREE). The memory isn't freed immediately with madvise(). It's freed in memory -pressure if the memory is clean. +pressure if the memory is clean. Please note that the printed value might +be lower than the real value due to optimizations used in the current +implementation. If this is not desirable please file a bug report. "AnonHugePages" shows the ammount of memory backed by transparent hugepage. "ShmemPmdMapped" shows the ammount of shared (shmem/tmpfs) memory backed by huge pages. -- Michal Hocko SUSE Labs
Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
On Thu, 2 Mar 2017 17:30:54 +0100 Michal Hockowrote: > > It's not that I think you're wrong: it *is* an implementation detail. > > But we take a bit of incoherency from batching all over the place, so > > it's a little odd to take a stand over this particular instance of it > > - whether demanding that it'd be fixed, or be documented, which would > > only suggest to users that this is special when it really isn't etc. > > I am not aware of other counter printed in smaps that would suffer from > the same problem, but I haven't checked too deeply so I might be wrong. > > Anyway it seems that I am alone in my position so I will not insist. > If we have any bug report then we can still fix it. A single lru_add_drain_all() right at the top level (in smaps_show()?) won't kill us and should significantly improve this issue. And it might accidentally make some of the other smaps statistics more accurate as well. If not, can we please have a nice comment somewhere appropriate which explains why LazyFree is inaccurate and why we chose to leave it that way?
Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
On Thu, 2 Mar 2017 17:30:54 +0100 Michal Hocko wrote: > > It's not that I think you're wrong: it *is* an implementation detail. > > But we take a bit of incoherency from batching all over the place, so > > it's a little odd to take a stand over this particular instance of it > > - whether demanding that it'd be fixed, or be documented, which would > > only suggest to users that this is special when it really isn't etc. > > I am not aware of other counter printed in smaps that would suffer from > the same problem, but I haven't checked too deeply so I might be wrong. > > Anyway it seems that I am alone in my position so I will not insist. > If we have any bug report then we can still fix it. A single lru_add_drain_all() right at the top level (in smaps_show()?) won't kill us and should significantly improve this issue. And it might accidentally make some of the other smaps statistics more accurate as well. If not, can we please have a nice comment somewhere appropriate which explains why LazyFree is inaccurate and why we chose to leave it that way?
Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
On Thu 02-03-17 09:01:01, Johannes Weiner wrote: > On Wed, Mar 01, 2017 at 07:57:35PM +0100, Michal Hocko wrote: > > On Wed 01-03-17 13:31:49, Johannes Weiner wrote: [...] > > > The error when reading a specific smaps should be completely ok. > > > > > > In numbers: even if your process is madvising from 16 different CPUs, > > > the error in its smaps file will peak at 896K in the worst case. That > > > level of concurrency tends to come with much bigger memory quantities > > > for that amount of error to matter. > > > > It is still an unexpected behavior IMHO and an implementation detail > > which leaks to the userspace. > > We have per-cpu fuzz in every single vmstat counter. Look at > calculate_normal_threshold() in vmstat.c and the sample thresholds for > when per-cpu deltas are flushed. In the vast majority of machines, the > per-cpu error in these counters is much higher than what we get with > pagevecs holding back a few pages. Yes but vmstat counters have a different usecase AFAIK. You mostly look at those when debugging or watching the system. /proc//smaps is quite often used to do per task metrics which are then used for some decision making so it should be less fuzzy if that is possible. > It's not that I think you're wrong: it *is* an implementation detail. > But we take a bit of incoherency from batching all over the place, so > it's a little odd to take a stand over this particular instance of it > - whether demanding that it'd be fixed, or be documented, which would > only suggest to users that this is special when it really isn't etc. I am not aware of other counter printed in smaps that would suffer from the same problem, but I haven't checked too deeply so I might be wrong. Anyway it seems that I am alone in my position so I will not insist. If we have any bug report then we can still fix it. -- Michal Hocko SUSE Labs
Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
On Thu 02-03-17 09:01:01, Johannes Weiner wrote: > On Wed, Mar 01, 2017 at 07:57:35PM +0100, Michal Hocko wrote: > > On Wed 01-03-17 13:31:49, Johannes Weiner wrote: [...] > > > The error when reading a specific smaps should be completely ok. > > > > > > In numbers: even if your process is madvising from 16 different CPUs, > > > the error in its smaps file will peak at 896K in the worst case. That > > > level of concurrency tends to come with much bigger memory quantities > > > for that amount of error to matter. > > > > It is still an unexpected behavior IMHO and an implementation detail > > which leaks to the userspace. > > We have per-cpu fuzz in every single vmstat counter. Look at > calculate_normal_threshold() in vmstat.c and the sample thresholds for > when per-cpu deltas are flushed. In the vast majority of machines, the > per-cpu error in these counters is much higher than what we get with > pagevecs holding back a few pages. Yes but vmstat counters have a different usecase AFAIK. You mostly look at those when debugging or watching the system. /proc//smaps is quite often used to do per task metrics which are then used for some decision making so it should be less fuzzy if that is possible. > It's not that I think you're wrong: it *is* an implementation detail. > But we take a bit of incoherency from batching all over the place, so > it's a little odd to take a stand over this particular instance of it > - whether demanding that it'd be fixed, or be documented, which would > only suggest to users that this is special when it really isn't etc. I am not aware of other counter printed in smaps that would suffer from the same problem, but I haven't checked too deeply so I might be wrong. Anyway it seems that I am alone in my position so I will not insist. If we have any bug report then we can still fix it. -- Michal Hocko SUSE Labs
Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
On Wed, Mar 01, 2017 at 07:57:35PM +0100, Michal Hocko wrote: > On Wed 01-03-17 13:31:49, Johannes Weiner wrote: > > On Wed, Mar 01, 2017 at 02:36:24PM +0100, Michal Hocko wrote: > > > @@ -474,7 +474,7 @@ static int madvise_free_single_vma(struct > > > vm_area_struct *vma, > > > madvise_free_page_range(, vma, start, end); > > > mmu_notifier_invalidate_range_end(mm, start, end); > > > tlb_finish_mmu(, start, end); > > > - > > > + lru_add_drain_all(); > > > > A full drain on all CPUs is very expensive and IMO not justified for > > some per-cpu fuzz factor in the stats. I'd take hampering the stats > > over hampering the syscall any day; only a subset of MADV_FREE users > > will look at the stats. > > > > And while the aggregate error can be large on machines with many CPUs > > (notably the machines on which you absolutely don't want to send IPIs > > to all cores each time a thread madvises some pages!), > > I am not sure I understand. Where would we trigger IPIs? > lru_add_drain_all relies on workqueus. Brainfart on my end, s,IPIs,sync work items,. That doesn't change my point, though. These things are expensive, and we had scalability issues with them in the past. See for example 4dd72b4a47a5 ("mm: fadvise: avoid expensive remote LRU cache draining after FADV_DONTNEED"). > > the pages of a > > single process are not likely to be spread out across more than a few > > CPUs. > > Then we can simply only flushe lru_lazyfree_pvecs which should reduce > the unrelated noise from other pagevecs. The problem isn't flushing other pagevecs once we're already scheduled on a CPU, the problem is scheduling work on all cpus and then waiting for completion. > > The error when reading a specific smaps should be completely ok. > > > > In numbers: even if your process is madvising from 16 different CPUs, > > the error in its smaps file will peak at 896K in the worst case. That > > level of concurrency tends to come with much bigger memory quantities > > for that amount of error to matter. > > It is still an unexpected behavior IMHO and an implementation detail > which leaks to the userspace. We have per-cpu fuzz in every single vmstat counter. Look at calculate_normal_threshold() in vmstat.c and the sample thresholds for when per-cpu deltas are flushed. In the vast majority of machines, the per-cpu error in these counters is much higher than what we get with pagevecs holding back a few pages. It's not that I think you're wrong: it *is* an implementation detail. But we take a bit of incoherency from batching all over the place, so it's a little odd to take a stand over this particular instance of it - whether demanding that it'd be fixed, or be documented, which would only suggest to users that this is special when it really isn't etc.
Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
On Wed, Mar 01, 2017 at 07:57:35PM +0100, Michal Hocko wrote: > On Wed 01-03-17 13:31:49, Johannes Weiner wrote: > > On Wed, Mar 01, 2017 at 02:36:24PM +0100, Michal Hocko wrote: > > > @@ -474,7 +474,7 @@ static int madvise_free_single_vma(struct > > > vm_area_struct *vma, > > > madvise_free_page_range(, vma, start, end); > > > mmu_notifier_invalidate_range_end(mm, start, end); > > > tlb_finish_mmu(, start, end); > > > - > > > + lru_add_drain_all(); > > > > A full drain on all CPUs is very expensive and IMO not justified for > > some per-cpu fuzz factor in the stats. I'd take hampering the stats > > over hampering the syscall any day; only a subset of MADV_FREE users > > will look at the stats. > > > > And while the aggregate error can be large on machines with many CPUs > > (notably the machines on which you absolutely don't want to send IPIs > > to all cores each time a thread madvises some pages!), > > I am not sure I understand. Where would we trigger IPIs? > lru_add_drain_all relies on workqueus. Brainfart on my end, s,IPIs,sync work items,. That doesn't change my point, though. These things are expensive, and we had scalability issues with them in the past. See for example 4dd72b4a47a5 ("mm: fadvise: avoid expensive remote LRU cache draining after FADV_DONTNEED"). > > the pages of a > > single process are not likely to be spread out across more than a few > > CPUs. > > Then we can simply only flushe lru_lazyfree_pvecs which should reduce > the unrelated noise from other pagevecs. The problem isn't flushing other pagevecs once we're already scheduled on a CPU, the problem is scheduling work on all cpus and then waiting for completion. > > The error when reading a specific smaps should be completely ok. > > > > In numbers: even if your process is madvising from 16 different CPUs, > > the error in its smaps file will peak at 896K in the worst case. That > > level of concurrency tends to come with much bigger memory quantities > > for that amount of error to matter. > > It is still an unexpected behavior IMHO and an implementation detail > which leaks to the userspace. We have per-cpu fuzz in every single vmstat counter. Look at calculate_normal_threshold() in vmstat.c and the sample thresholds for when per-cpu deltas are flushed. In the vast majority of machines, the per-cpu error in these counters is much higher than what we get with pagevecs holding back a few pages. It's not that I think you're wrong: it *is* an implementation detail. But we take a bit of incoherency from batching all over the place, so it's a little odd to take a stand over this particular instance of it - whether demanding that it'd be fixed, or be documented, which would only suggest to users that this is special when it really isn't etc.
Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
On Wed, Mar 01, 2017 at 07:57:35PM +0100, Michal Hocko wrote: > On Wed 01-03-17 13:31:49, Johannes Weiner wrote: > > On Wed, Mar 01, 2017 at 02:36:24PM +0100, Michal Hocko wrote: > > > On Fri 24-02-17 13:31:49, Shaohua Li wrote: > > > > show MADV_FREE pages info of each vma in smaps. The interface is for > > > > diganose or monitoring purpose, userspace could use it to understand > > > > what happens in the application. Since userspace could dirty MADV_FREE > > > > pages without notice from kernel, this interface is the only place we > > > > can get accurate accounting info about MADV_FREE pages. > > > > > > I have just got to test this patchset and noticed something that was a > > > bit surprising > > > > > > madvise(mmap(len), len, MADV_FREE) > > > Size: 102400 kB > > > Rss: 102400 kB > > > Pss: 102400 kB > > > Shared_Clean: 0 kB > > > Shared_Dirty: 0 kB > > > Private_Clean:102400 kB > > > Private_Dirty: 0 kB > > > Referenced:0 kB > > > Anonymous:102400 kB > > > LazyFree: 102368 kB > > > > > > It took me a some time to realize that LazyFree is not accurate because > > > there are still pages on the per-cpu lru_lazyfree_pvecs. I believe this > > > is an implementation detail which shouldn't be visible to the userspace. > > > Should we simply drain the pagevec? A crude way would be to simply > > > lru_add_drain_all after we are done with the given range. We can also > > > make this lru_lazyfree_pvecs specific but I am not sure this is worth > > > the additional code. > > > --- > > > diff --git a/mm/madvise.c b/mm/madvise.c > > > index dc5927c812d3..d2c318db16c9 100644 > > > --- a/mm/madvise.c > > > +++ b/mm/madvise.c > > > @@ -474,7 +474,7 @@ static int madvise_free_single_vma(struct > > > vm_area_struct *vma, > > > madvise_free_page_range(, vma, start, end); > > > mmu_notifier_invalidate_range_end(mm, start, end); > > > tlb_finish_mmu(, start, end); > > > - > > > + lru_add_drain_all(); > > > > A full drain on all CPUs is very expensive and IMO not justified for > > some per-cpu fuzz factor in the stats. I'd take hampering the stats > > over hampering the syscall any day; only a subset of MADV_FREE users > > will look at the stats. > > > > And while the aggregate error can be large on machines with many CPUs > > (notably the machines on which you absolutely don't want to send IPIs > > to all cores each time a thread madvises some pages!), > > I am not sure I understand. Where would we trigger IPIs? > lru_add_drain_all relies on workqueus. > > > the pages of a > > single process are not likely to be spread out across more than a few > > CPUs. > > Then we can simply only flushe lru_lazyfree_pvecs which should reduce > the unrelated noise from other pagevecs. > > > The error when reading a specific smaps should be completely ok. > > > > In numbers: even if your process is madvising from 16 different CPUs, > > the error in its smaps file will peak at 896K in the worst case. That > > level of concurrency tends to come with much bigger memory quantities > > for that amount of error to matter. > > It is still an unexpected behavior IMHO and an implementation detail > which leaks to the userspace. > > > IMO this is a non-issue. > > I will not insist if there is a general consensus on this and it is a > documented behavior, though. We cannot gurantee with that even draining because madvise_free can miss some of pages easily with several conditions. First of all, userspace can never know how many of pages are mapped in there at the moment. As well, one of page in the range can be swapped out or is going migrating, fail to try_lockpage and so on.
Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
On Wed, Mar 01, 2017 at 07:57:35PM +0100, Michal Hocko wrote: > On Wed 01-03-17 13:31:49, Johannes Weiner wrote: > > On Wed, Mar 01, 2017 at 02:36:24PM +0100, Michal Hocko wrote: > > > On Fri 24-02-17 13:31:49, Shaohua Li wrote: > > > > show MADV_FREE pages info of each vma in smaps. The interface is for > > > > diganose or monitoring purpose, userspace could use it to understand > > > > what happens in the application. Since userspace could dirty MADV_FREE > > > > pages without notice from kernel, this interface is the only place we > > > > can get accurate accounting info about MADV_FREE pages. > > > > > > I have just got to test this patchset and noticed something that was a > > > bit surprising > > > > > > madvise(mmap(len), len, MADV_FREE) > > > Size: 102400 kB > > > Rss: 102400 kB > > > Pss: 102400 kB > > > Shared_Clean: 0 kB > > > Shared_Dirty: 0 kB > > > Private_Clean:102400 kB > > > Private_Dirty: 0 kB > > > Referenced:0 kB > > > Anonymous:102400 kB > > > LazyFree: 102368 kB > > > > > > It took me a some time to realize that LazyFree is not accurate because > > > there are still pages on the per-cpu lru_lazyfree_pvecs. I believe this > > > is an implementation detail which shouldn't be visible to the userspace. > > > Should we simply drain the pagevec? A crude way would be to simply > > > lru_add_drain_all after we are done with the given range. We can also > > > make this lru_lazyfree_pvecs specific but I am not sure this is worth > > > the additional code. > > > --- > > > diff --git a/mm/madvise.c b/mm/madvise.c > > > index dc5927c812d3..d2c318db16c9 100644 > > > --- a/mm/madvise.c > > > +++ b/mm/madvise.c > > > @@ -474,7 +474,7 @@ static int madvise_free_single_vma(struct > > > vm_area_struct *vma, > > > madvise_free_page_range(, vma, start, end); > > > mmu_notifier_invalidate_range_end(mm, start, end); > > > tlb_finish_mmu(, start, end); > > > - > > > + lru_add_drain_all(); > > > > A full drain on all CPUs is very expensive and IMO not justified for > > some per-cpu fuzz factor in the stats. I'd take hampering the stats > > over hampering the syscall any day; only a subset of MADV_FREE users > > will look at the stats. > > > > And while the aggregate error can be large on machines with many CPUs > > (notably the machines on which you absolutely don't want to send IPIs > > to all cores each time a thread madvises some pages!), > > I am not sure I understand. Where would we trigger IPIs? > lru_add_drain_all relies on workqueus. > > > the pages of a > > single process are not likely to be spread out across more than a few > > CPUs. > > Then we can simply only flushe lru_lazyfree_pvecs which should reduce > the unrelated noise from other pagevecs. > > > The error when reading a specific smaps should be completely ok. > > > > In numbers: even if your process is madvising from 16 different CPUs, > > the error in its smaps file will peak at 896K in the worst case. That > > level of concurrency tends to come with much bigger memory quantities > > for that amount of error to matter. > > It is still an unexpected behavior IMHO and an implementation detail > which leaks to the userspace. > > > IMO this is a non-issue. > > I will not insist if there is a general consensus on this and it is a > documented behavior, though. We cannot gurantee with that even draining because madvise_free can miss some of pages easily with several conditions. First of all, userspace can never know how many of pages are mapped in there at the moment. As well, one of page in the range can be swapped out or is going migrating, fail to try_lockpage and so on.
Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
On Wed, Mar 01, 2017 at 06:49:56PM +0100, Michal Hocko wrote: > On Wed 01-03-17 09:37:10, Shaohua Li wrote: > > On Wed, Mar 01, 2017 at 02:36:24PM +0100, Michal Hocko wrote: > > > On Fri 24-02-17 13:31:49, Shaohua Li wrote: > > > > show MADV_FREE pages info of each vma in smaps. The interface is for > > > > diganose or monitoring purpose, userspace could use it to understand > > > > what happens in the application. Since userspace could dirty MADV_FREE > > > > pages without notice from kernel, this interface is the only place we > > > > can get accurate accounting info about MADV_FREE pages. > > > > > > I have just got to test this patchset and noticed something that was a > > > bit surprising > > > > > > madvise(mmap(len), len, MADV_FREE) > > > Size: 102400 kB > > > Rss: 102400 kB > > > Pss: 102400 kB > > > Shared_Clean: 0 kB > > > Shared_Dirty: 0 kB > > > Private_Clean:102400 kB > > > Private_Dirty: 0 kB > > > Referenced:0 kB > > > Anonymous:102400 kB > > > LazyFree: 102368 kB > > > > > > It took me a some time to realize that LazyFree is not accurate because > > > there are still pages on the per-cpu lru_lazyfree_pvecs. I believe this > > > is an implementation detail which shouldn't be visible to the userspace. > > > Should we simply drain the pagevec? A crude way would be to simply > > > lru_add_drain_all after we are done with the given range. We can also > > > make this lru_lazyfree_pvecs specific but I am not sure this is worth > > > the additional code. > > > > Minchan's original patch includes a drain of pvec. I discard it because I > > think > > it's not worth the effort. There aren't too many memory in the per-cpu vecs. > > but multiply that by the number of CPUs. > > > Like what you said, I doubt this is noticeable to userspace. > > maybe I wasn't clear enough. I've noticed and I expect others would as > well. We really shouldn't leak implementation details like that. So I > _believe_ this should be fixed. Draining all pagevecs is rather coarse > but it is the simplest thing to do. If you do not want to fold this > into the original patch I can send a standalone one. Or do you have any > concerns about draining? No, no objection at all. Just doubt it's worthy. Looks nobody complains similar issue, For exmaple, deactivate_file_page does the similar thing, then the smaps 'Referenced' could be inaccurate. Thanks, Shaohua
Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
On Wed, Mar 01, 2017 at 06:49:56PM +0100, Michal Hocko wrote: > On Wed 01-03-17 09:37:10, Shaohua Li wrote: > > On Wed, Mar 01, 2017 at 02:36:24PM +0100, Michal Hocko wrote: > > > On Fri 24-02-17 13:31:49, Shaohua Li wrote: > > > > show MADV_FREE pages info of each vma in smaps. The interface is for > > > > diganose or monitoring purpose, userspace could use it to understand > > > > what happens in the application. Since userspace could dirty MADV_FREE > > > > pages without notice from kernel, this interface is the only place we > > > > can get accurate accounting info about MADV_FREE pages. > > > > > > I have just got to test this patchset and noticed something that was a > > > bit surprising > > > > > > madvise(mmap(len), len, MADV_FREE) > > > Size: 102400 kB > > > Rss: 102400 kB > > > Pss: 102400 kB > > > Shared_Clean: 0 kB > > > Shared_Dirty: 0 kB > > > Private_Clean:102400 kB > > > Private_Dirty: 0 kB > > > Referenced:0 kB > > > Anonymous:102400 kB > > > LazyFree: 102368 kB > > > > > > It took me a some time to realize that LazyFree is not accurate because > > > there are still pages on the per-cpu lru_lazyfree_pvecs. I believe this > > > is an implementation detail which shouldn't be visible to the userspace. > > > Should we simply drain the pagevec? A crude way would be to simply > > > lru_add_drain_all after we are done with the given range. We can also > > > make this lru_lazyfree_pvecs specific but I am not sure this is worth > > > the additional code. > > > > Minchan's original patch includes a drain of pvec. I discard it because I > > think > > it's not worth the effort. There aren't too many memory in the per-cpu vecs. > > but multiply that by the number of CPUs. > > > Like what you said, I doubt this is noticeable to userspace. > > maybe I wasn't clear enough. I've noticed and I expect others would as > well. We really shouldn't leak implementation details like that. So I > _believe_ this should be fixed. Draining all pagevecs is rather coarse > but it is the simplest thing to do. If you do not want to fold this > into the original patch I can send a standalone one. Or do you have any > concerns about draining? No, no objection at all. Just doubt it's worthy. Looks nobody complains similar issue, For exmaple, deactivate_file_page does the similar thing, then the smaps 'Referenced' could be inaccurate. Thanks, Shaohua
Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
On Wed, Mar 01, 2017 at 02:36:24PM +0100, Michal Hocko wrote: > On Fri 24-02-17 13:31:49, Shaohua Li wrote: > > show MADV_FREE pages info of each vma in smaps. The interface is for > > diganose or monitoring purpose, userspace could use it to understand > > what happens in the application. Since userspace could dirty MADV_FREE > > pages without notice from kernel, this interface is the only place we > > can get accurate accounting info about MADV_FREE pages. > > I have just got to test this patchset and noticed something that was a > bit surprising > > madvise(mmap(len), len, MADV_FREE) > Size: 102400 kB > Rss: 102400 kB > Pss: 102400 kB > Shared_Clean: 0 kB > Shared_Dirty: 0 kB > Private_Clean:102400 kB > Private_Dirty: 0 kB > Referenced:0 kB > Anonymous:102400 kB > LazyFree: 102368 kB > > It took me a some time to realize that LazyFree is not accurate because > there are still pages on the per-cpu lru_lazyfree_pvecs. I believe this > is an implementation detail which shouldn't be visible to the userspace. > Should we simply drain the pagevec? A crude way would be to simply > lru_add_drain_all after we are done with the given range. We can also > make this lru_lazyfree_pvecs specific but I am not sure this is worth > the additional code. > --- > diff --git a/mm/madvise.c b/mm/madvise.c > index dc5927c812d3..d2c318db16c9 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -474,7 +474,7 @@ static int madvise_free_single_vma(struct vm_area_struct > *vma, > madvise_free_page_range(, vma, start, end); > mmu_notifier_invalidate_range_end(mm, start, end); > tlb_finish_mmu(, start, end); > - > + lru_add_drain_all(); A full drain on all CPUs is very expensive and IMO not justified for some per-cpu fuzz factor in the stats. I'd take hampering the stats over hampering the syscall any day; only a subset of MADV_FREE users will look at the stats. And while the aggregate error can be large on machines with many CPUs (notably the machines on which you absolutely don't want to send IPIs to all cores each time a thread madvises some pages!), the pages of a single process are not likely to be spread out across more than a few CPUs. The error when reading a specific smaps should be completely ok. In numbers: even if your process is madvising from 16 different CPUs, the error in its smaps file will peak at 896K in the worst case. That level of concurrency tends to come with much bigger memory quantities for that amount of error to matter. IMO this is a non-issue.
Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
On Wed, Mar 01, 2017 at 02:36:24PM +0100, Michal Hocko wrote: > On Fri 24-02-17 13:31:49, Shaohua Li wrote: > > show MADV_FREE pages info of each vma in smaps. The interface is for > > diganose or monitoring purpose, userspace could use it to understand > > what happens in the application. Since userspace could dirty MADV_FREE > > pages without notice from kernel, this interface is the only place we > > can get accurate accounting info about MADV_FREE pages. > > I have just got to test this patchset and noticed something that was a > bit surprising > > madvise(mmap(len), len, MADV_FREE) > Size: 102400 kB > Rss: 102400 kB > Pss: 102400 kB > Shared_Clean: 0 kB > Shared_Dirty: 0 kB > Private_Clean:102400 kB > Private_Dirty: 0 kB > Referenced:0 kB > Anonymous:102400 kB > LazyFree: 102368 kB > > It took me a some time to realize that LazyFree is not accurate because > there are still pages on the per-cpu lru_lazyfree_pvecs. I believe this > is an implementation detail which shouldn't be visible to the userspace. > Should we simply drain the pagevec? A crude way would be to simply > lru_add_drain_all after we are done with the given range. We can also > make this lru_lazyfree_pvecs specific but I am not sure this is worth > the additional code. > --- > diff --git a/mm/madvise.c b/mm/madvise.c > index dc5927c812d3..d2c318db16c9 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -474,7 +474,7 @@ static int madvise_free_single_vma(struct vm_area_struct > *vma, > madvise_free_page_range(, vma, start, end); > mmu_notifier_invalidate_range_end(mm, start, end); > tlb_finish_mmu(, start, end); > - > + lru_add_drain_all(); A full drain on all CPUs is very expensive and IMO not justified for some per-cpu fuzz factor in the stats. I'd take hampering the stats over hampering the syscall any day; only a subset of MADV_FREE users will look at the stats. And while the aggregate error can be large on machines with many CPUs (notably the machines on which you absolutely don't want to send IPIs to all cores each time a thread madvises some pages!), the pages of a single process are not likely to be spread out across more than a few CPUs. The error when reading a specific smaps should be completely ok. In numbers: even if your process is madvising from 16 different CPUs, the error in its smaps file will peak at 896K in the worst case. That level of concurrency tends to come with much bigger memory quantities for that amount of error to matter. IMO this is a non-issue.
Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
On Wed 01-03-17 13:31:49, Johannes Weiner wrote: > On Wed, Mar 01, 2017 at 02:36:24PM +0100, Michal Hocko wrote: > > On Fri 24-02-17 13:31:49, Shaohua Li wrote: > > > show MADV_FREE pages info of each vma in smaps. The interface is for > > > diganose or monitoring purpose, userspace could use it to understand > > > what happens in the application. Since userspace could dirty MADV_FREE > > > pages without notice from kernel, this interface is the only place we > > > can get accurate accounting info about MADV_FREE pages. > > > > I have just got to test this patchset and noticed something that was a > > bit surprising > > > > madvise(mmap(len), len, MADV_FREE) > > Size: 102400 kB > > Rss: 102400 kB > > Pss: 102400 kB > > Shared_Clean: 0 kB > > Shared_Dirty: 0 kB > > Private_Clean:102400 kB > > Private_Dirty: 0 kB > > Referenced:0 kB > > Anonymous:102400 kB > > LazyFree: 102368 kB > > > > It took me a some time to realize that LazyFree is not accurate because > > there are still pages on the per-cpu lru_lazyfree_pvecs. I believe this > > is an implementation detail which shouldn't be visible to the userspace. > > Should we simply drain the pagevec? A crude way would be to simply > > lru_add_drain_all after we are done with the given range. We can also > > make this lru_lazyfree_pvecs specific but I am not sure this is worth > > the additional code. > > --- > > diff --git a/mm/madvise.c b/mm/madvise.c > > index dc5927c812d3..d2c318db16c9 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -474,7 +474,7 @@ static int madvise_free_single_vma(struct > > vm_area_struct *vma, > > madvise_free_page_range(, vma, start, end); > > mmu_notifier_invalidate_range_end(mm, start, end); > > tlb_finish_mmu(, start, end); > > - > > + lru_add_drain_all(); > > A full drain on all CPUs is very expensive and IMO not justified for > some per-cpu fuzz factor in the stats. I'd take hampering the stats > over hampering the syscall any day; only a subset of MADV_FREE users > will look at the stats. > > And while the aggregate error can be large on machines with many CPUs > (notably the machines on which you absolutely don't want to send IPIs > to all cores each time a thread madvises some pages!), I am not sure I understand. Where would we trigger IPIs? lru_add_drain_all relies on workqueus. > the pages of a > single process are not likely to be spread out across more than a few > CPUs. Then we can simply only flushe lru_lazyfree_pvecs which should reduce the unrelated noise from other pagevecs. > The error when reading a specific smaps should be completely ok. > > In numbers: even if your process is madvising from 16 different CPUs, > the error in its smaps file will peak at 896K in the worst case. That > level of concurrency tends to come with much bigger memory quantities > for that amount of error to matter. It is still an unexpected behavior IMHO and an implementation detail which leaks to the userspace. > IMO this is a non-issue. I will not insist if there is a general consensus on this and it is a documented behavior, though. -- Michal Hocko SUSE Labs
Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
On Wed 01-03-17 13:31:49, Johannes Weiner wrote: > On Wed, Mar 01, 2017 at 02:36:24PM +0100, Michal Hocko wrote: > > On Fri 24-02-17 13:31:49, Shaohua Li wrote: > > > show MADV_FREE pages info of each vma in smaps. The interface is for > > > diganose or monitoring purpose, userspace could use it to understand > > > what happens in the application. Since userspace could dirty MADV_FREE > > > pages without notice from kernel, this interface is the only place we > > > can get accurate accounting info about MADV_FREE pages. > > > > I have just got to test this patchset and noticed something that was a > > bit surprising > > > > madvise(mmap(len), len, MADV_FREE) > > Size: 102400 kB > > Rss: 102400 kB > > Pss: 102400 kB > > Shared_Clean: 0 kB > > Shared_Dirty: 0 kB > > Private_Clean:102400 kB > > Private_Dirty: 0 kB > > Referenced:0 kB > > Anonymous:102400 kB > > LazyFree: 102368 kB > > > > It took me a some time to realize that LazyFree is not accurate because > > there are still pages on the per-cpu lru_lazyfree_pvecs. I believe this > > is an implementation detail which shouldn't be visible to the userspace. > > Should we simply drain the pagevec? A crude way would be to simply > > lru_add_drain_all after we are done with the given range. We can also > > make this lru_lazyfree_pvecs specific but I am not sure this is worth > > the additional code. > > --- > > diff --git a/mm/madvise.c b/mm/madvise.c > > index dc5927c812d3..d2c318db16c9 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -474,7 +474,7 @@ static int madvise_free_single_vma(struct > > vm_area_struct *vma, > > madvise_free_page_range(, vma, start, end); > > mmu_notifier_invalidate_range_end(mm, start, end); > > tlb_finish_mmu(, start, end); > > - > > + lru_add_drain_all(); > > A full drain on all CPUs is very expensive and IMO not justified for > some per-cpu fuzz factor in the stats. I'd take hampering the stats > over hampering the syscall any day; only a subset of MADV_FREE users > will look at the stats. > > And while the aggregate error can be large on machines with many CPUs > (notably the machines on which you absolutely don't want to send IPIs > to all cores each time a thread madvises some pages!), I am not sure I understand. Where would we trigger IPIs? lru_add_drain_all relies on workqueus. > the pages of a > single process are not likely to be spread out across more than a few > CPUs. Then we can simply only flushe lru_lazyfree_pvecs which should reduce the unrelated noise from other pagevecs. > The error when reading a specific smaps should be completely ok. > > In numbers: even if your process is madvising from 16 different CPUs, > the error in its smaps file will peak at 896K in the worst case. That > level of concurrency tends to come with much bigger memory quantities > for that amount of error to matter. It is still an unexpected behavior IMHO and an implementation detail which leaks to the userspace. > IMO this is a non-issue. I will not insist if there is a general consensus on this and it is a documented behavior, though. -- Michal Hocko SUSE Labs
Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
On Wed, Mar 01, 2017 at 02:36:24PM +0100, Michal Hocko wrote: > On Fri 24-02-17 13:31:49, Shaohua Li wrote: > > show MADV_FREE pages info of each vma in smaps. The interface is for > > diganose or monitoring purpose, userspace could use it to understand > > what happens in the application. Since userspace could dirty MADV_FREE > > pages without notice from kernel, this interface is the only place we > > can get accurate accounting info about MADV_FREE pages. > > I have just got to test this patchset and noticed something that was a > bit surprising > > madvise(mmap(len), len, MADV_FREE) > Size: 102400 kB > Rss: 102400 kB > Pss: 102400 kB > Shared_Clean: 0 kB > Shared_Dirty: 0 kB > Private_Clean:102400 kB > Private_Dirty: 0 kB > Referenced:0 kB > Anonymous:102400 kB > LazyFree: 102368 kB > > It took me a some time to realize that LazyFree is not accurate because > there are still pages on the per-cpu lru_lazyfree_pvecs. I believe this > is an implementation detail which shouldn't be visible to the userspace. > Should we simply drain the pagevec? A crude way would be to simply > lru_add_drain_all after we are done with the given range. We can also > make this lru_lazyfree_pvecs specific but I am not sure this is worth > the additional code. Minchan's original patch includes a drain of pvec. I discard it because I think it's not worth the effort. There aren't too many memory in the per-cpu vecs. Like what you said, I doubt this is noticeable to userspace. Thanks, Shaohua > --- > diff --git a/mm/madvise.c b/mm/madvise.c > index dc5927c812d3..d2c318db16c9 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -474,7 +474,7 @@ static int madvise_free_single_vma(struct vm_area_struct > *vma, > madvise_free_page_range(, vma, start, end); > mmu_notifier_invalidate_range_end(mm, start, end); > tlb_finish_mmu(, start, end); > - > + lru_add_drain_all(); > return 0; > } > > -- > Michal Hocko > SUSE Labs
Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
On Wed, Mar 01, 2017 at 02:36:24PM +0100, Michal Hocko wrote: > On Fri 24-02-17 13:31:49, Shaohua Li wrote: > > show MADV_FREE pages info of each vma in smaps. The interface is for > > diganose or monitoring purpose, userspace could use it to understand > > what happens in the application. Since userspace could dirty MADV_FREE > > pages without notice from kernel, this interface is the only place we > > can get accurate accounting info about MADV_FREE pages. > > I have just got to test this patchset and noticed something that was a > bit surprising > > madvise(mmap(len), len, MADV_FREE) > Size: 102400 kB > Rss: 102400 kB > Pss: 102400 kB > Shared_Clean: 0 kB > Shared_Dirty: 0 kB > Private_Clean:102400 kB > Private_Dirty: 0 kB > Referenced:0 kB > Anonymous:102400 kB > LazyFree: 102368 kB > > It took me a some time to realize that LazyFree is not accurate because > there are still pages on the per-cpu lru_lazyfree_pvecs. I believe this > is an implementation detail which shouldn't be visible to the userspace. > Should we simply drain the pagevec? A crude way would be to simply > lru_add_drain_all after we are done with the given range. We can also > make this lru_lazyfree_pvecs specific but I am not sure this is worth > the additional code. Minchan's original patch includes a drain of pvec. I discard it because I think it's not worth the effort. There aren't too many memory in the per-cpu vecs. Like what you said, I doubt this is noticeable to userspace. Thanks, Shaohua > --- > diff --git a/mm/madvise.c b/mm/madvise.c > index dc5927c812d3..d2c318db16c9 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -474,7 +474,7 @@ static int madvise_free_single_vma(struct vm_area_struct > *vma, > madvise_free_page_range(, vma, start, end); > mmu_notifier_invalidate_range_end(mm, start, end); > tlb_finish_mmu(, start, end); > - > + lru_add_drain_all(); > return 0; > } > > -- > Michal Hocko > SUSE Labs
Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
On Wed 01-03-17 09:37:10, Shaohua Li wrote: > On Wed, Mar 01, 2017 at 02:36:24PM +0100, Michal Hocko wrote: > > On Fri 24-02-17 13:31:49, Shaohua Li wrote: > > > show MADV_FREE pages info of each vma in smaps. The interface is for > > > diganose or monitoring purpose, userspace could use it to understand > > > what happens in the application. Since userspace could dirty MADV_FREE > > > pages without notice from kernel, this interface is the only place we > > > can get accurate accounting info about MADV_FREE pages. > > > > I have just got to test this patchset and noticed something that was a > > bit surprising > > > > madvise(mmap(len), len, MADV_FREE) > > Size: 102400 kB > > Rss: 102400 kB > > Pss: 102400 kB > > Shared_Clean: 0 kB > > Shared_Dirty: 0 kB > > Private_Clean:102400 kB > > Private_Dirty: 0 kB > > Referenced:0 kB > > Anonymous:102400 kB > > LazyFree: 102368 kB > > > > It took me a some time to realize that LazyFree is not accurate because > > there are still pages on the per-cpu lru_lazyfree_pvecs. I believe this > > is an implementation detail which shouldn't be visible to the userspace. > > Should we simply drain the pagevec? A crude way would be to simply > > lru_add_drain_all after we are done with the given range. We can also > > make this lru_lazyfree_pvecs specific but I am not sure this is worth > > the additional code. > > Minchan's original patch includes a drain of pvec. I discard it because I > think > it's not worth the effort. There aren't too many memory in the per-cpu vecs. but multiply that by the number of CPUs. > Like what you said, I doubt this is noticeable to userspace. maybe I wasn't clear enough. I've noticed and I expect others would as well. We really shouldn't leak implementation details like that. So I _believe_ this should be fixed. Draining all pagevecs is rather coarse but it is the simplest thing to do. If you do not want to fold this into the original patch I can send a standalone one. Or do you have any concerns about draining? -- Michal Hocko SUSE Labs
Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
On Wed 01-03-17 09:37:10, Shaohua Li wrote: > On Wed, Mar 01, 2017 at 02:36:24PM +0100, Michal Hocko wrote: > > On Fri 24-02-17 13:31:49, Shaohua Li wrote: > > > show MADV_FREE pages info of each vma in smaps. The interface is for > > > diganose or monitoring purpose, userspace could use it to understand > > > what happens in the application. Since userspace could dirty MADV_FREE > > > pages without notice from kernel, this interface is the only place we > > > can get accurate accounting info about MADV_FREE pages. > > > > I have just got to test this patchset and noticed something that was a > > bit surprising > > > > madvise(mmap(len), len, MADV_FREE) > > Size: 102400 kB > > Rss: 102400 kB > > Pss: 102400 kB > > Shared_Clean: 0 kB > > Shared_Dirty: 0 kB > > Private_Clean:102400 kB > > Private_Dirty: 0 kB > > Referenced:0 kB > > Anonymous:102400 kB > > LazyFree: 102368 kB > > > > It took me a some time to realize that LazyFree is not accurate because > > there are still pages on the per-cpu lru_lazyfree_pvecs. I believe this > > is an implementation detail which shouldn't be visible to the userspace. > > Should we simply drain the pagevec? A crude way would be to simply > > lru_add_drain_all after we are done with the given range. We can also > > make this lru_lazyfree_pvecs specific but I am not sure this is worth > > the additional code. > > Minchan's original patch includes a drain of pvec. I discard it because I > think > it's not worth the effort. There aren't too many memory in the per-cpu vecs. but multiply that by the number of CPUs. > Like what you said, I doubt this is noticeable to userspace. maybe I wasn't clear enough. I've noticed and I expect others would as well. We really shouldn't leak implementation details like that. So I _believe_ this should be fixed. Draining all pagevecs is rather coarse but it is the simplest thing to do. If you do not want to fold this into the original patch I can send a standalone one. Or do you have any concerns about draining? -- Michal Hocko SUSE Labs
Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
On Fri 24-02-17 13:31:49, Shaohua Li wrote: > show MADV_FREE pages info of each vma in smaps. The interface is for > diganose or monitoring purpose, userspace could use it to understand > what happens in the application. Since userspace could dirty MADV_FREE > pages without notice from kernel, this interface is the only place we > can get accurate accounting info about MADV_FREE pages. I have just got to test this patchset and noticed something that was a bit surprising madvise(mmap(len), len, MADV_FREE) Size: 102400 kB Rss: 102400 kB Pss: 102400 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean:102400 kB Private_Dirty: 0 kB Referenced:0 kB Anonymous:102400 kB LazyFree: 102368 kB It took me a some time to realize that LazyFree is not accurate because there are still pages on the per-cpu lru_lazyfree_pvecs. I believe this is an implementation detail which shouldn't be visible to the userspace. Should we simply drain the pagevec? A crude way would be to simply lru_add_drain_all after we are done with the given range. We can also make this lru_lazyfree_pvecs specific but I am not sure this is worth the additional code. --- diff --git a/mm/madvise.c b/mm/madvise.c index dc5927c812d3..d2c318db16c9 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -474,7 +474,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma, madvise_free_page_range(, vma, start, end); mmu_notifier_invalidate_range_end(mm, start, end); tlb_finish_mmu(, start, end); - + lru_add_drain_all(); return 0; } -- Michal Hocko SUSE Labs
Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
On Fri 24-02-17 13:31:49, Shaohua Li wrote: > show MADV_FREE pages info of each vma in smaps. The interface is for > diganose or monitoring purpose, userspace could use it to understand > what happens in the application. Since userspace could dirty MADV_FREE > pages without notice from kernel, this interface is the only place we > can get accurate accounting info about MADV_FREE pages. I have just got to test this patchset and noticed something that was a bit surprising madvise(mmap(len), len, MADV_FREE) Size: 102400 kB Rss: 102400 kB Pss: 102400 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean:102400 kB Private_Dirty: 0 kB Referenced:0 kB Anonymous:102400 kB LazyFree: 102368 kB It took me a some time to realize that LazyFree is not accurate because there are still pages on the per-cpu lru_lazyfree_pvecs. I believe this is an implementation detail which shouldn't be visible to the userspace. Should we simply drain the pagevec? A crude way would be to simply lru_add_drain_all after we are done with the given range. We can also make this lru_lazyfree_pvecs specific but I am not sure this is worth the additional code. --- diff --git a/mm/madvise.c b/mm/madvise.c index dc5927c812d3..d2c318db16c9 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -474,7 +474,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma, madvise_free_page_range(, vma, start, end); mmu_notifier_invalidate_range_end(mm, start, end); tlb_finish_mmu(, start, end); - + lru_add_drain_all(); return 0; } -- Michal Hocko SUSE Labs
Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
On February 25, 2017 5:32 AM Shaohua Li wrote: > > show MADV_FREE pages info of each vma in smaps. The interface is for > diganose or monitoring purpose, userspace could use it to understand > what happens in the application. Since userspace could dirty MADV_FREE > pages without notice from kernel, this interface is the only place we > can get accurate accounting info about MADV_FREE pages. > > Cc: Michal Hocko> Cc: Hugh Dickins > Cc: Rik van Riel > Cc: Mel Gorman > Cc: Andrew Morton > Acked-by: Johannes Weiner > Acked-by: Minchan Kim > Signed-off-by: Shaohua Li > --- Acked-by: Hillf Danton
Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
On February 25, 2017 5:32 AM Shaohua Li wrote: > > show MADV_FREE pages info of each vma in smaps. The interface is for > diganose or monitoring purpose, userspace could use it to understand > what happens in the application. Since userspace could dirty MADV_FREE > pages without notice from kernel, this interface is the only place we > can get accurate accounting info about MADV_FREE pages. > > Cc: Michal Hocko > Cc: Hugh Dickins > Cc: Rik van Riel > Cc: Mel Gorman > Cc: Andrew Morton > Acked-by: Johannes Weiner > Acked-by: Minchan Kim > Signed-off-by: Shaohua Li > --- Acked-by: Hillf Danton
Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
On Fri 24-02-17 13:31:49, Shaohua Li wrote: > show MADV_FREE pages info of each vma in smaps. The interface is for > diganose or monitoring purpose, userspace could use it to understand > what happens in the application. Since userspace could dirty MADV_FREE > pages without notice from kernel, this interface is the only place we > can get accurate accounting info about MADV_FREE pages. > > Cc: Michal Hocko> Cc: Hugh Dickins > Cc: Rik van Riel > Cc: Mel Gorman > Cc: Andrew Morton > Acked-by: Johannes Weiner > Acked-by: Minchan Kim > Signed-off-by: Shaohua Li Acked-by: Michal Hocko > --- > Documentation/filesystems/proc.txt | 4 > fs/proc/task_mmu.c | 8 +++- > 2 files changed, 11 insertions(+), 1 deletion(-) > > diff --git a/Documentation/filesystems/proc.txt > b/Documentation/filesystems/proc.txt > index c94b467..45853e1 100644 > --- a/Documentation/filesystems/proc.txt > +++ b/Documentation/filesystems/proc.txt > @@ -412,6 +412,7 @@ Private_Clean: 0 kB > Private_Dirty: 0 kB > Referenced: 892 kB > Anonymous: 0 kB > +LazyFree: 0 kB > AnonHugePages: 0 kB > ShmemPmdMapped:0 kB > Shared_Hugetlb:0 kB > @@ -441,6 +442,9 @@ accessed. > "Anonymous" shows the amount of memory that does not belong to any file. > Even > a mapping associated with a file may contain anonymous pages: when > MAP_PRIVATE > and a page is modified, the file page is replaced by a private anonymous > copy. > +"LazyFree" shows the amount of memory which is marked by madvise(MADV_FREE). > +The memory isn't freed immediately with madvise(). It's freed in memory > +pressure if the memory is clean. > "AnonHugePages" shows the ammount of memory backed by transparent hugepage. > "ShmemPmdMapped" shows the ammount of shared (shmem/tmpfs) memory backed by > huge pages. > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index ee3efb2..8a5ec00 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -440,6 +440,7 @@ struct mem_size_stats { > unsigned long private_dirty; > unsigned long referenced; > unsigned long anonymous; > + unsigned long lazyfree; > unsigned long anonymous_thp; > unsigned long shmem_thp; > unsigned long swap; > @@ -456,8 +457,11 @@ static void smaps_account(struct mem_size_stats *mss, > struct page *page, > int i, nr = compound ? 1 << compound_order(page) : 1; > unsigned long size = nr * PAGE_SIZE; > > - if (PageAnon(page)) > + if (PageAnon(page)) { > mss->anonymous += size; > + if (!PageSwapBacked(page) && !dirty && !PageDirty(page)) > + mss->lazyfree += size; > + } > > mss->resident += size; > /* Accumulate the size in pages that have been accessed. */ > @@ -770,6 +774,7 @@ static int show_smap(struct seq_file *m, void *v, int > is_pid) > "Private_Dirty: %8lu kB\n" > "Referenced: %8lu kB\n" > "Anonymous: %8lu kB\n" > +"LazyFree: %8lu kB\n" > "AnonHugePages: %8lu kB\n" > "ShmemPmdMapped: %8lu kB\n" > "Shared_Hugetlb: %8lu kB\n" > @@ -788,6 +793,7 @@ static int show_smap(struct seq_file *m, void *v, int > is_pid) > mss.private_dirty >> 10, > mss.referenced >> 10, > mss.anonymous >> 10, > +mss.lazyfree >> 10, > mss.anonymous_thp >> 10, > mss.shmem_thp >> 10, > mss.shared_hugetlb >> 10, > -- > 2.9.3 > -- Michal Hocko SUSE Labs
Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
On Fri 24-02-17 13:31:49, Shaohua Li wrote: > show MADV_FREE pages info of each vma in smaps. The interface is for > diganose or monitoring purpose, userspace could use it to understand > what happens in the application. Since userspace could dirty MADV_FREE > pages without notice from kernel, this interface is the only place we > can get accurate accounting info about MADV_FREE pages. > > Cc: Michal Hocko > Cc: Hugh Dickins > Cc: Rik van Riel > Cc: Mel Gorman > Cc: Andrew Morton > Acked-by: Johannes Weiner > Acked-by: Minchan Kim > Signed-off-by: Shaohua Li Acked-by: Michal Hocko > --- > Documentation/filesystems/proc.txt | 4 > fs/proc/task_mmu.c | 8 +++- > 2 files changed, 11 insertions(+), 1 deletion(-) > > diff --git a/Documentation/filesystems/proc.txt > b/Documentation/filesystems/proc.txt > index c94b467..45853e1 100644 > --- a/Documentation/filesystems/proc.txt > +++ b/Documentation/filesystems/proc.txt > @@ -412,6 +412,7 @@ Private_Clean: 0 kB > Private_Dirty: 0 kB > Referenced: 892 kB > Anonymous: 0 kB > +LazyFree: 0 kB > AnonHugePages: 0 kB > ShmemPmdMapped:0 kB > Shared_Hugetlb:0 kB > @@ -441,6 +442,9 @@ accessed. > "Anonymous" shows the amount of memory that does not belong to any file. > Even > a mapping associated with a file may contain anonymous pages: when > MAP_PRIVATE > and a page is modified, the file page is replaced by a private anonymous > copy. > +"LazyFree" shows the amount of memory which is marked by madvise(MADV_FREE). > +The memory isn't freed immediately with madvise(). It's freed in memory > +pressure if the memory is clean. > "AnonHugePages" shows the ammount of memory backed by transparent hugepage. > "ShmemPmdMapped" shows the ammount of shared (shmem/tmpfs) memory backed by > huge pages. > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index ee3efb2..8a5ec00 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -440,6 +440,7 @@ struct mem_size_stats { > unsigned long private_dirty; > unsigned long referenced; > unsigned long anonymous; > + unsigned long lazyfree; > unsigned long anonymous_thp; > unsigned long shmem_thp; > unsigned long swap; > @@ -456,8 +457,11 @@ static void smaps_account(struct mem_size_stats *mss, > struct page *page, > int i, nr = compound ? 1 << compound_order(page) : 1; > unsigned long size = nr * PAGE_SIZE; > > - if (PageAnon(page)) > + if (PageAnon(page)) { > mss->anonymous += size; > + if (!PageSwapBacked(page) && !dirty && !PageDirty(page)) > + mss->lazyfree += size; > + } > > mss->resident += size; > /* Accumulate the size in pages that have been accessed. */ > @@ -770,6 +774,7 @@ static int show_smap(struct seq_file *m, void *v, int > is_pid) > "Private_Dirty: %8lu kB\n" > "Referenced: %8lu kB\n" > "Anonymous: %8lu kB\n" > +"LazyFree: %8lu kB\n" > "AnonHugePages: %8lu kB\n" > "ShmemPmdMapped: %8lu kB\n" > "Shared_Hugetlb: %8lu kB\n" > @@ -788,6 +793,7 @@ static int show_smap(struct seq_file *m, void *v, int > is_pid) > mss.private_dirty >> 10, > mss.referenced >> 10, > mss.anonymous >> 10, > +mss.lazyfree >> 10, > mss.anonymous_thp >> 10, > mss.shmem_thp >> 10, > mss.shared_hugetlb >> 10, > -- > 2.9.3 > -- Michal Hocko SUSE Labs
[PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
show MADV_FREE pages info of each vma in smaps. The interface is for diganose or monitoring purpose, userspace could use it to understand what happens in the application. Since userspace could dirty MADV_FREE pages without notice from kernel, this interface is the only place we can get accurate accounting info about MADV_FREE pages. Cc: Michal HockoCc: Hugh Dickins Cc: Rik van Riel Cc: Mel Gorman Cc: Andrew Morton Acked-by: Johannes Weiner Acked-by: Minchan Kim Signed-off-by: Shaohua Li --- Documentation/filesystems/proc.txt | 4 fs/proc/task_mmu.c | 8 +++- 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index c94b467..45853e1 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -412,6 +412,7 @@ Private_Clean: 0 kB Private_Dirty: 0 kB Referenced: 892 kB Anonymous: 0 kB +LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped:0 kB Shared_Hugetlb:0 kB @@ -441,6 +442,9 @@ accessed. "Anonymous" shows the amount of memory that does not belong to any file. Even a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE and a page is modified, the file page is replaced by a private anonymous copy. +"LazyFree" shows the amount of memory which is marked by madvise(MADV_FREE). +The memory isn't freed immediately with madvise(). It's freed in memory +pressure if the memory is clean. "AnonHugePages" shows the ammount of memory backed by transparent hugepage. "ShmemPmdMapped" shows the ammount of shared (shmem/tmpfs) memory backed by huge pages. diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index ee3efb2..8a5ec00 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -440,6 +440,7 @@ struct mem_size_stats { unsigned long private_dirty; unsigned long referenced; unsigned long anonymous; + unsigned long lazyfree; unsigned long anonymous_thp; unsigned long shmem_thp; unsigned long swap; @@ -456,8 +457,11 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page, int i, nr = compound ? 1 << compound_order(page) : 1; unsigned long size = nr * PAGE_SIZE; - if (PageAnon(page)) + if (PageAnon(page)) { mss->anonymous += size; + if (!PageSwapBacked(page) && !dirty && !PageDirty(page)) + mss->lazyfree += size; + } mss->resident += size; /* Accumulate the size in pages that have been accessed. */ @@ -770,6 +774,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid) "Private_Dirty: %8lu kB\n" "Referenced: %8lu kB\n" "Anonymous: %8lu kB\n" + "LazyFree: %8lu kB\n" "AnonHugePages: %8lu kB\n" "ShmemPmdMapped: %8lu kB\n" "Shared_Hugetlb: %8lu kB\n" @@ -788,6 +793,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid) mss.private_dirty >> 10, mss.referenced >> 10, mss.anonymous >> 10, + mss.lazyfree >> 10, mss.anonymous_thp >> 10, mss.shmem_thp >> 10, mss.shared_hugetlb >> 10, -- 2.9.3
[PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
show MADV_FREE pages info of each vma in smaps. The interface is for diganose or monitoring purpose, userspace could use it to understand what happens in the application. Since userspace could dirty MADV_FREE pages without notice from kernel, this interface is the only place we can get accurate accounting info about MADV_FREE pages. Cc: Michal Hocko Cc: Hugh Dickins Cc: Rik van Riel Cc: Mel Gorman Cc: Andrew Morton Acked-by: Johannes Weiner Acked-by: Minchan Kim Signed-off-by: Shaohua Li --- Documentation/filesystems/proc.txt | 4 fs/proc/task_mmu.c | 8 +++- 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index c94b467..45853e1 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -412,6 +412,7 @@ Private_Clean: 0 kB Private_Dirty: 0 kB Referenced: 892 kB Anonymous: 0 kB +LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped:0 kB Shared_Hugetlb:0 kB @@ -441,6 +442,9 @@ accessed. "Anonymous" shows the amount of memory that does not belong to any file. Even a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE and a page is modified, the file page is replaced by a private anonymous copy. +"LazyFree" shows the amount of memory which is marked by madvise(MADV_FREE). +The memory isn't freed immediately with madvise(). It's freed in memory +pressure if the memory is clean. "AnonHugePages" shows the ammount of memory backed by transparent hugepage. "ShmemPmdMapped" shows the ammount of shared (shmem/tmpfs) memory backed by huge pages. diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index ee3efb2..8a5ec00 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -440,6 +440,7 @@ struct mem_size_stats { unsigned long private_dirty; unsigned long referenced; unsigned long anonymous; + unsigned long lazyfree; unsigned long anonymous_thp; unsigned long shmem_thp; unsigned long swap; @@ -456,8 +457,11 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page, int i, nr = compound ? 1 << compound_order(page) : 1; unsigned long size = nr * PAGE_SIZE; - if (PageAnon(page)) + if (PageAnon(page)) { mss->anonymous += size; + if (!PageSwapBacked(page) && !dirty && !PageDirty(page)) + mss->lazyfree += size; + } mss->resident += size; /* Accumulate the size in pages that have been accessed. */ @@ -770,6 +774,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid) "Private_Dirty: %8lu kB\n" "Referenced: %8lu kB\n" "Anonymous: %8lu kB\n" + "LazyFree: %8lu kB\n" "AnonHugePages: %8lu kB\n" "ShmemPmdMapped: %8lu kB\n" "Shared_Hugetlb: %8lu kB\n" @@ -788,6 +793,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid) mss.private_dirty >> 10, mss.referenced >> 10, mss.anonymous >> 10, + mss.lazyfree >> 10, mss.anonymous_thp >> 10, mss.shmem_thp >> 10, mss.shared_hugetlb >> 10, -- 2.9.3