Re: [RFCv2 1/6] mm: introduce MADV_COLD

2019-06-04 Thread Johannes Weiner
On Tue, Jun 04, 2019 at 08:02:05AM +0900, Minchan Kim wrote:
> Hi Johannes,
> 
> On Mon, Jun 03, 2019 at 05:50:59PM -0400, Johannes Weiner wrote:
> > On Mon, Jun 03, 2019 at 10:32:30PM +0200, Michal Hocko wrote:
> > > On Mon 03-06-19 13:27:17, Johannes Weiner wrote:
> > > > On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> > > > > On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > > > > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > > > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > > > > > When a process expects no accesses to a certain memory 
> > > > > > > > > > range, it could
> > > > > > > > > > give a hint to kernel that the pages can be reclaimed when 
> > > > > > > > > > memory pressure
> > > > > > > > > > happens but data should be preserved for future use.  This 
> > > > > > > > > > could reduce
> > > > > > > > > > workingset eviction so it ends up increasing performance.
> > > > > > > > > > 
> > > > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) 
> > > > > > > > > > syscall.
> > > > > > > > > > MADV_COLD can be used by a process to mark a memory range 
> > > > > > > > > > as not expected
> > > > > > > > > > to be used in the near future. The hint can help kernel in 
> > > > > > > > > > deciding which
> > > > > > > > > > pages to evict early during memory pressure.
> > > > > > > > > > 
> > > > > > > > > > Internally, it works via deactivating pages from active 
> > > > > > > > > > list to inactive's
> > > > > > > > > > head if the page is private because inactive list could be 
> > > > > > > > > > full of
> > > > > > > > > > used-once pages which are first candidate for the 
> > > > > > > > > > reclaiming and that's a
> > > > > > > > > > reason why MADV_FREE move pages to head of inactive LRU 
> > > > > > > > > > list. Therefore,
> > > > > > > > > > if the memory pressure happens, they will be reclaimed 
> > > > > > > > > > earlier than other
> > > > > > > > > > active pages unless there is no access until the time.
> > > > > > > > > 
> > > > > > > > > [I am intentionally not looking at the implementation because 
> > > > > > > > > below
> > > > > > > > > points should be clear from the changelog - sorry about 
> > > > > > > > > nagging ;)]
> > > > > > > > > 
> > > > > > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > > > > > Private/shared? If shared, are there any restrictions?
> > > > > > > > 
> > > > > > > > Both file and private pages could be deactived from each active 
> > > > > > > > LRU
> > > > > > > > to each inactive LRU if the page has one map_count. In other 
> > > > > > > > words,
> > > > > > > > 
> > > > > > > > if (page_mapcount(page) <= 1)
> > > > > > > > deactivate_page(page);
> > > > > > > 
> > > > > > > Why do we restrict to pages that are single mapped?
> > > > > > 
> > > > > > Because page table in one of process shared the page would have 
> > > > > > access bit
> > > > > > so finally we couldn't reclaim the page. The more process it is 
> > > > > > shared,
> > > > > > the more fail to reclaim.
> > > > > 
> > > > > So what? In other words why should it be restricted solely based on 
> > > > > the
> > > > > map count. I can see a reason to restrict based on the access
> > > > > permissions because we do not want to simplify all sorts of side 
> > > > > channel
> > > > > attacks but memory reclaim is capable of reclaiming shared pages and 
> > > > > so
> > > > > far I haven't heard any sound argument why madvise should skip those.
> > > > > Again if there are any reasons, then document them in the changelog.
> > > > 
> > > > I think it makes sense. It could be explained, but it also follows
> > > > established madvise semantics, and I'm not sure it's necessarily
> > > > Minchan's job to re-iterate those.
> > > > 
> > > > Sharing isn't exactly transparent to userspace. The kernel does COW,
> > > > ksm etc. When you madvise, you can really only speak for your own
> > > > reference to that memory - "*I* am not using this."
> > > > 
> > > > This is in line with other madvise calls: MADV_DONTNEED clears the
> > > > local page table entries and drops the corresponding references, so
> > > > shared pages won't get freed. MADV_FREE clears the pte dirty bit and
> > > > also has explicit mapcount checks before clearing PG_dirty, so again
> > > > shared pages don't get freed.
> > > 
> > > Right, being consistent with other madvise syscalls is certainly a way
> > > to go. And I am not pushing one way or another, I just want this to be
> > > documented with a reasoning behind. Consistency is certainly an argument
> > > to use.
> > > 
> > > On the other hand these non-destructive madvise operations are quite
> > > different and the shared policy might differ as a result as well. We are
> > > aging objects rather than destroying them 

Re: [RFCv2 1/6] mm: introduce MADV_COLD

2019-06-04 Thread Michal Hocko
On Tue 04-06-19 13:26:51, Minchan Kim wrote:
> On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
[...]
> > Right. But there is still the page cache reclaim. Is it expected that
> > an explicitly cold memory doesn't get reclaimed because we have a
> > sufficient amount of page cache (a very common case) and we never age
> > anonymous memory because of that?
> 
> If there are lots of used-once pages in file-LRU, I think there is no
> need to reclaim anonymous pages because it needs bigger overhead due to
> IO. It has been true for a long time in current VM policy.

You are making an assumption which is not universally true. If I _know_
that there is a considerable amount of idle anonymous memory then I
would really prefer if it goes to the swap rather than make a pressure
on caching. Inactive list is not guaranteed to contain only used-once
pages, right?

Anyway, as already mentioned, we can start with a simpler implementation
for now and explicitly note that pagecache biased reclaim is known to be
a problem potentially. I am pretty sure somebody will come sooner or
later and we can address the problem then with some good numbers to back
the additional complexity.

-- 
Michal Hocko
SUSE Labs


Re: [RFCv2 1/6] mm: introduce MADV_COLD

2019-06-04 Thread Michal Hocko
On Tue 04-06-19 08:02:05, Minchan Kim wrote:
> Hi Johannes,
> 
> On Mon, Jun 03, 2019 at 05:50:59PM -0400, Johannes Weiner wrote:
> > On Mon, Jun 03, 2019 at 10:32:30PM +0200, Michal Hocko wrote:
> > > On Mon 03-06-19 13:27:17, Johannes Weiner wrote:
> > > > On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> > > > > On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > > > > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > > > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > > > > > When a process expects no accesses to a certain memory 
> > > > > > > > > > range, it could
> > > > > > > > > > give a hint to kernel that the pages can be reclaimed when 
> > > > > > > > > > memory pressure
> > > > > > > > > > happens but data should be preserved for future use.  This 
> > > > > > > > > > could reduce
> > > > > > > > > > workingset eviction so it ends up increasing performance.
> > > > > > > > > > 
> > > > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) 
> > > > > > > > > > syscall.
> > > > > > > > > > MADV_COLD can be used by a process to mark a memory range 
> > > > > > > > > > as not expected
> > > > > > > > > > to be used in the near future. The hint can help kernel in 
> > > > > > > > > > deciding which
> > > > > > > > > > pages to evict early during memory pressure.
> > > > > > > > > > 
> > > > > > > > > > Internally, it works via deactivating pages from active 
> > > > > > > > > > list to inactive's
> > > > > > > > > > head if the page is private because inactive list could be 
> > > > > > > > > > full of
> > > > > > > > > > used-once pages which are first candidate for the 
> > > > > > > > > > reclaiming and that's a
> > > > > > > > > > reason why MADV_FREE move pages to head of inactive LRU 
> > > > > > > > > > list. Therefore,
> > > > > > > > > > if the memory pressure happens, they will be reclaimed 
> > > > > > > > > > earlier than other
> > > > > > > > > > active pages unless there is no access until the time.
> > > > > > > > > 
> > > > > > > > > [I am intentionally not looking at the implementation because 
> > > > > > > > > below
> > > > > > > > > points should be clear from the changelog - sorry about 
> > > > > > > > > nagging ;)]
> > > > > > > > > 
> > > > > > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > > > > > Private/shared? If shared, are there any restrictions?
> > > > > > > > 
> > > > > > > > Both file and private pages could be deactived from each active 
> > > > > > > > LRU
> > > > > > > > to each inactive LRU if the page has one map_count. In other 
> > > > > > > > words,
> > > > > > > > 
> > > > > > > > if (page_mapcount(page) <= 1)
> > > > > > > > deactivate_page(page);
> > > > > > > 
> > > > > > > Why do we restrict to pages that are single mapped?
> > > > > > 
> > > > > > Because page table in one of process shared the page would have 
> > > > > > access bit
> > > > > > so finally we couldn't reclaim the page. The more process it is 
> > > > > > shared,
> > > > > > the more fail to reclaim.
> > > > > 
> > > > > So what? In other words why should it be restricted solely based on 
> > > > > the
> > > > > map count. I can see a reason to restrict based on the access
> > > > > permissions because we do not want to simplify all sorts of side 
> > > > > channel
> > > > > attacks but memory reclaim is capable of reclaiming shared pages and 
> > > > > so
> > > > > far I haven't heard any sound argument why madvise should skip those.
> > > > > Again if there are any reasons, then document them in the changelog.
> > > > 
> > > > I think it makes sense. It could be explained, but it also follows
> > > > established madvise semantics, and I'm not sure it's necessarily
> > > > Minchan's job to re-iterate those.
> > > > 
> > > > Sharing isn't exactly transparent to userspace. The kernel does COW,
> > > > ksm etc. When you madvise, you can really only speak for your own
> > > > reference to that memory - "*I* am not using this."
> > > > 
> > > > This is in line with other madvise calls: MADV_DONTNEED clears the
> > > > local page table entries and drops the corresponding references, so
> > > > shared pages won't get freed. MADV_FREE clears the pte dirty bit and
> > > > also has explicit mapcount checks before clearing PG_dirty, so again
> > > > shared pages don't get freed.
> > > 
> > > Right, being consistent with other madvise syscalls is certainly a way
> > > to go. And I am not pushing one way or another, I just want this to be
> > > documented with a reasoning behind. Consistency is certainly an argument
> > > to use.
> > > 
> > > On the other hand these non-destructive madvise operations are quite
> > > different and the shared policy might differ as a result as well. We are
> > > aging objects rather than destroying them after all. Being 

Re: [RFCv2 1/6] mm: introduce MADV_COLD

2019-06-04 Thread Michal Hocko
On Mon 03-06-19 17:50:59, Johannes Weiner wrote:
> On Mon, Jun 03, 2019 at 10:32:30PM +0200, Michal Hocko wrote:
> > On Mon 03-06-19 13:27:17, Johannes Weiner wrote:
> > > On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> > > > On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > > > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > > > > When a process expects no accesses to a certain memory range, 
> > > > > > > > > it could
> > > > > > > > > give a hint to kernel that the pages can be reclaimed when 
> > > > > > > > > memory pressure
> > > > > > > > > happens but data should be preserved for future use.  This 
> > > > > > > > > could reduce
> > > > > > > > > workingset eviction so it ends up increasing performance.
> > > > > > > > > 
> > > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) 
> > > > > > > > > syscall.
> > > > > > > > > MADV_COLD can be used by a process to mark a memory range as 
> > > > > > > > > not expected
> > > > > > > > > to be used in the near future. The hint can help kernel in 
> > > > > > > > > deciding which
> > > > > > > > > pages to evict early during memory pressure.
> > > > > > > > > 
> > > > > > > > > Internally, it works via deactivating pages from active list 
> > > > > > > > > to inactive's
> > > > > > > > > head if the page is private because inactive list could be 
> > > > > > > > > full of
> > > > > > > > > used-once pages which are first candidate for the reclaiming 
> > > > > > > > > and that's a
> > > > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. 
> > > > > > > > > Therefore,
> > > > > > > > > if the memory pressure happens, they will be reclaimed 
> > > > > > > > > earlier than other
> > > > > > > > > active pages unless there is no access until the time.
> > > > > > > > 
> > > > > > > > [I am intentionally not looking at the implementation because 
> > > > > > > > below
> > > > > > > > points should be clear from the changelog - sorry about nagging 
> > > > > > > > ;)]
> > > > > > > > 
> > > > > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > > > > Private/shared? If shared, are there any restrictions?
> > > > > > > 
> > > > > > > Both file and private pages could be deactived from each active 
> > > > > > > LRU
> > > > > > > to each inactive LRU if the page has one map_count. In other 
> > > > > > > words,
> > > > > > > 
> > > > > > > if (page_mapcount(page) <= 1)
> > > > > > > deactivate_page(page);
> > > > > > 
> > > > > > Why do we restrict to pages that are single mapped?
> > > > > 
> > > > > Because page table in one of process shared the page would have 
> > > > > access bit
> > > > > so finally we couldn't reclaim the page. The more process it is 
> > > > > shared,
> > > > > the more fail to reclaim.
> > > > 
> > > > So what? In other words why should it be restricted solely based on the
> > > > map count. I can see a reason to restrict based on the access
> > > > permissions because we do not want to simplify all sorts of side channel
> > > > attacks but memory reclaim is capable of reclaiming shared pages and so
> > > > far I haven't heard any sound argument why madvise should skip those.
> > > > Again if there are any reasons, then document them in the changelog.
> > > 
> > > I think it makes sense. It could be explained, but it also follows
> > > established madvise semantics, and I'm not sure it's necessarily
> > > Minchan's job to re-iterate those.
> > > 
> > > Sharing isn't exactly transparent to userspace. The kernel does COW,
> > > ksm etc. When you madvise, you can really only speak for your own
> > > reference to that memory - "*I* am not using this."
> > > 
> > > This is in line with other madvise calls: MADV_DONTNEED clears the
> > > local page table entries and drops the corresponding references, so
> > > shared pages won't get freed. MADV_FREE clears the pte dirty bit and
> > > also has explicit mapcount checks before clearing PG_dirty, so again
> > > shared pages don't get freed.
> > 
> > Right, being consistent with other madvise syscalls is certainly a way
> > to go. And I am not pushing one way or another, I just want this to be
> > documented with a reasoning behind. Consistency is certainly an argument
> > to use.
> > 
> > On the other hand these non-destructive madvise operations are quite
> > different and the shared policy might differ as a result as well. We are
> > aging objects rather than destroying them after all. Being able to age
> > a pagecache with a sufficient privileges sounds like a useful usecase to
> > me. In other words you are able to cause the same effect indirectly
> > without the madvise operation so it kinda makes sense to allow it in a
> > more sophisticated way.
> 
> Right, I don't think it's 

Re: [RFCv2 1/6] mm: introduce MADV_COLD

2019-06-03 Thread Minchan Kim
On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > When a process expects no accesses to a certain memory range, it 
> > > > > > could
> > > > > > give a hint to kernel that the pages can be reclaimed when memory 
> > > > > > pressure
> > > > > > happens but data should be preserved for future use.  This could 
> > > > > > reduce
> > > > > > workingset eviction so it ends up increasing performance.
> > > > > > 
> > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > > MADV_COLD can be used by a process to mark a memory range as not 
> > > > > > expected
> > > > > > to be used in the near future. The hint can help kernel in deciding 
> > > > > > which
> > > > > > pages to evict early during memory pressure.
> > > > > > 
> > > > > > Internally, it works via deactivating pages from active list to 
> > > > > > inactive's
> > > > > > head if the page is private because inactive list could be full of
> > > > > > used-once pages which are first candidate for the reclaiming and 
> > > > > > that's a
> > > > > > reason why MADV_FREE move pages to head of inactive LRU list. 
> > > > > > Therefore,
> > > > > > if the memory pressure happens, they will be reclaimed earlier than 
> > > > > > other
> > > > > > active pages unless there is no access until the time.
> > > > > 
> > > > > [I am intentionally not looking at the implementation because below
> > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > > 
> > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > Private/shared? If shared, are there any restrictions?
> > > > 
> > > > Both file and private pages could be deactived from each active LRU
> > > > to each inactive LRU if the page has one map_count. In other words,
> > > > 
> > > > if (page_mapcount(page) <= 1)
> > > > deactivate_page(page);
> > > 
> > > Why do we restrict to pages that are single mapped?
> > 
> > Because page table in one of process shared the page would have access bit
> > so finally we couldn't reclaim the page. The more process it is shared,
> > the more fail to reclaim.
> 
> So what? In other words why should it be restricted solely based on the
> map count. I can see a reason to restrict based on the access
> permissions because we do not want to simplify all sorts of side channel
> attacks but memory reclaim is capable of reclaiming shared pages and so
> far I haven't heard any sound argument why madvise should skip those.
> Again if there are any reasons, then document them in the changelog.

I will go with removing the part so that defer to decision to the VM reclaim
based on the review.

>  
> [...]
> 
> > > Please document this, if this is really a desirable semantic because
> > > then you have the same set of problems as we've had with the early
> > > MADV_FREE implementation mentioned above.
> > 
> > IIRC, the problem of MADV_FREE was that we couldn't discard freeable
> > pages because VM never scan anonymous LRU with swapless system.
> > However, it's not the our case because we should reclaim them, not
> > discarding.
> 
> Right. But there is still the page cache reclaim. Is it expected that
> an explicitly cold memory doesn't get reclaimed because we have a
> sufficient amount of page cache (a very common case) and we never age
> anonymous memory because of that?

If there are lots of used-once pages in file-LRU, I think there is no
need to reclaim anonymous pages because it needs bigger overhead due to
IO. It has been true for a long time in current VM policy.

Reclaim preference model based on hints is as following based on cost:

MADV_DONTNEED >> MADV_PAGEOUT > used-once pages > MADV_FREE >= MADV_COLD

It is desirable for the new hints to be placed in the reclaiming preference
order such that a) they don't overlap functionally with existing hints and
b) we have a balanced ordering of disruptive and non-disruptive hints.


Re: [RFCv2 1/6] mm: introduce MADV_COLD

2019-06-03 Thread Minchan Kim
Hi Johannes,

On Mon, Jun 03, 2019 at 05:50:59PM -0400, Johannes Weiner wrote:
> On Mon, Jun 03, 2019 at 10:32:30PM +0200, Michal Hocko wrote:
> > On Mon 03-06-19 13:27:17, Johannes Weiner wrote:
> > > On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> > > > On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > > > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > > > > When a process expects no accesses to a certain memory range, 
> > > > > > > > > it could
> > > > > > > > > give a hint to kernel that the pages can be reclaimed when 
> > > > > > > > > memory pressure
> > > > > > > > > happens but data should be preserved for future use.  This 
> > > > > > > > > could reduce
> > > > > > > > > workingset eviction so it ends up increasing performance.
> > > > > > > > > 
> > > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) 
> > > > > > > > > syscall.
> > > > > > > > > MADV_COLD can be used by a process to mark a memory range as 
> > > > > > > > > not expected
> > > > > > > > > to be used in the near future. The hint can help kernel in 
> > > > > > > > > deciding which
> > > > > > > > > pages to evict early during memory pressure.
> > > > > > > > > 
> > > > > > > > > Internally, it works via deactivating pages from active list 
> > > > > > > > > to inactive's
> > > > > > > > > head if the page is private because inactive list could be 
> > > > > > > > > full of
> > > > > > > > > used-once pages which are first candidate for the reclaiming 
> > > > > > > > > and that's a
> > > > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. 
> > > > > > > > > Therefore,
> > > > > > > > > if the memory pressure happens, they will be reclaimed 
> > > > > > > > > earlier than other
> > > > > > > > > active pages unless there is no access until the time.
> > > > > > > > 
> > > > > > > > [I am intentionally not looking at the implementation because 
> > > > > > > > below
> > > > > > > > points should be clear from the changelog - sorry about nagging 
> > > > > > > > ;)]
> > > > > > > > 
> > > > > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > > > > Private/shared? If shared, are there any restrictions?
> > > > > > > 
> > > > > > > Both file and private pages could be deactived from each active 
> > > > > > > LRU
> > > > > > > to each inactive LRU if the page has one map_count. In other 
> > > > > > > words,
> > > > > > > 
> > > > > > > if (page_mapcount(page) <= 1)
> > > > > > > deactivate_page(page);
> > > > > > 
> > > > > > Why do we restrict to pages that are single mapped?
> > > > > 
> > > > > Because page table in one of process shared the page would have 
> > > > > access bit
> > > > > so finally we couldn't reclaim the page. The more process it is 
> > > > > shared,
> > > > > the more fail to reclaim.
> > > > 
> > > > So what? In other words why should it be restricted solely based on the
> > > > map count. I can see a reason to restrict based on the access
> > > > permissions because we do not want to simplify all sorts of side channel
> > > > attacks but memory reclaim is capable of reclaiming shared pages and so
> > > > far I haven't heard any sound argument why madvise should skip those.
> > > > Again if there are any reasons, then document them in the changelog.
> > > 
> > > I think it makes sense. It could be explained, but it also follows
> > > established madvise semantics, and I'm not sure it's necessarily
> > > Minchan's job to re-iterate those.
> > > 
> > > Sharing isn't exactly transparent to userspace. The kernel does COW,
> > > ksm etc. When you madvise, you can really only speak for your own
> > > reference to that memory - "*I* am not using this."
> > > 
> > > This is in line with other madvise calls: MADV_DONTNEED clears the
> > > local page table entries and drops the corresponding references, so
> > > shared pages won't get freed. MADV_FREE clears the pte dirty bit and
> > > also has explicit mapcount checks before clearing PG_dirty, so again
> > > shared pages don't get freed.
> > 
> > Right, being consistent with other madvise syscalls is certainly a way
> > to go. And I am not pushing one way or another, I just want this to be
> > documented with a reasoning behind. Consistency is certainly an argument
> > to use.
> > 
> > On the other hand these non-destructive madvise operations are quite
> > different and the shared policy might differ as a result as well. We are
> > aging objects rather than destroying them after all. Being able to age
> > a pagecache with a sufficient privileges sounds like a useful usecase to
> > me. In other words you are able to cause the same effect indirectly
> > without the madvise operation so it kinda makes sense to allow it in a
> > more sophisticated way.
> 

Re: [RFCv2 1/6] mm: introduce MADV_COLD

2019-06-03 Thread Johannes Weiner
On Mon, Jun 03, 2019 at 10:32:30PM +0200, Michal Hocko wrote:
> On Mon 03-06-19 13:27:17, Johannes Weiner wrote:
> > On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> > > On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > > > When a process expects no accesses to a certain memory range, 
> > > > > > > > it could
> > > > > > > > give a hint to kernel that the pages can be reclaimed when 
> > > > > > > > memory pressure
> > > > > > > > happens but data should be preserved for future use.  This 
> > > > > > > > could reduce
> > > > > > > > workingset eviction so it ends up increasing performance.
> > > > > > > > 
> > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) 
> > > > > > > > syscall.
> > > > > > > > MADV_COLD can be used by a process to mark a memory range as 
> > > > > > > > not expected
> > > > > > > > to be used in the near future. The hint can help kernel in 
> > > > > > > > deciding which
> > > > > > > > pages to evict early during memory pressure.
> > > > > > > > 
> > > > > > > > Internally, it works via deactivating pages from active list to 
> > > > > > > > inactive's
> > > > > > > > head if the page is private because inactive list could be full 
> > > > > > > > of
> > > > > > > > used-once pages which are first candidate for the reclaiming 
> > > > > > > > and that's a
> > > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. 
> > > > > > > > Therefore,
> > > > > > > > if the memory pressure happens, they will be reclaimed earlier 
> > > > > > > > than other
> > > > > > > > active pages unless there is no access until the time.
> > > > > > > 
> > > > > > > [I am intentionally not looking at the implementation because 
> > > > > > > below
> > > > > > > points should be clear from the changelog - sorry about nagging 
> > > > > > > ;)]
> > > > > > > 
> > > > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > > > Private/shared? If shared, are there any restrictions?
> > > > > > 
> > > > > > Both file and private pages could be deactived from each active LRU
> > > > > > to each inactive LRU if the page has one map_count. In other words,
> > > > > > 
> > > > > > if (page_mapcount(page) <= 1)
> > > > > > deactivate_page(page);
> > > > > 
> > > > > Why do we restrict to pages that are single mapped?
> > > > 
> > > > Because page table in one of process shared the page would have access 
> > > > bit
> > > > so finally we couldn't reclaim the page. The more process it is shared,
> > > > the more fail to reclaim.
> > > 
> > > So what? In other words why should it be restricted solely based on the
> > > map count. I can see a reason to restrict based on the access
> > > permissions because we do not want to simplify all sorts of side channel
> > > attacks but memory reclaim is capable of reclaiming shared pages and so
> > > far I haven't heard any sound argument why madvise should skip those.
> > > Again if there are any reasons, then document them in the changelog.
> > 
> > I think it makes sense. It could be explained, but it also follows
> > established madvise semantics, and I'm not sure it's necessarily
> > Minchan's job to re-iterate those.
> > 
> > Sharing isn't exactly transparent to userspace. The kernel does COW,
> > ksm etc. When you madvise, you can really only speak for your own
> > reference to that memory - "*I* am not using this."
> > 
> > This is in line with other madvise calls: MADV_DONTNEED clears the
> > local page table entries and drops the corresponding references, so
> > shared pages won't get freed. MADV_FREE clears the pte dirty bit and
> > also has explicit mapcount checks before clearing PG_dirty, so again
> > shared pages don't get freed.
> 
> Right, being consistent with other madvise syscalls is certainly a way
> to go. And I am not pushing one way or another, I just want this to be
> documented with a reasoning behind. Consistency is certainly an argument
> to use.
> 
> On the other hand these non-destructive madvise operations are quite
> different and the shared policy might differ as a result as well. We are
> aging objects rather than destroying them after all. Being able to age
> a pagecache with a sufficient privileges sounds like a useful usecase to
> me. In other words you are able to cause the same effect indirectly
> without the madvise operation so it kinda makes sense to allow it in a
> more sophisticated way.

Right, I don't think it's about permission - as you say, you can do
this indirectly. Page reclaim is all about relative page order, so if
we thwarted you from demoting some pages, you could instead promote
other pages to cause a similar end result.

I think it's about intent. You're advising the 

Re: [RFCv2 1/6] mm: introduce MADV_COLD

2019-06-03 Thread Michal Hocko
On Mon 03-06-19 13:27:17, Johannes Weiner wrote:
> On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> > On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > > When a process expects no accesses to a certain memory range, it 
> > > > > > > could
> > > > > > > give a hint to kernel that the pages can be reclaimed when memory 
> > > > > > > pressure
> > > > > > > happens but data should be preserved for future use.  This could 
> > > > > > > reduce
> > > > > > > workingset eviction so it ends up increasing performance.
> > > > > > > 
> > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) 
> > > > > > > syscall.
> > > > > > > MADV_COLD can be used by a process to mark a memory range as not 
> > > > > > > expected
> > > > > > > to be used in the near future. The hint can help kernel in 
> > > > > > > deciding which
> > > > > > > pages to evict early during memory pressure.
> > > > > > > 
> > > > > > > Internally, it works via deactivating pages from active list to 
> > > > > > > inactive's
> > > > > > > head if the page is private because inactive list could be full of
> > > > > > > used-once pages which are first candidate for the reclaiming and 
> > > > > > > that's a
> > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. 
> > > > > > > Therefore,
> > > > > > > if the memory pressure happens, they will be reclaimed earlier 
> > > > > > > than other
> > > > > > > active pages unless there is no access until the time.
> > > > > > 
> > > > > > [I am intentionally not looking at the implementation because below
> > > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > > > 
> > > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > > Private/shared? If shared, are there any restrictions?
> > > > > 
> > > > > Both file and private pages could be deactived from each active LRU
> > > > > to each inactive LRU if the page has one map_count. In other words,
> > > > > 
> > > > > if (page_mapcount(page) <= 1)
> > > > > deactivate_page(page);
> > > > 
> > > > Why do we restrict to pages that are single mapped?
> > > 
> > > Because page table in one of process shared the page would have access bit
> > > so finally we couldn't reclaim the page. The more process it is shared,
> > > the more fail to reclaim.
> > 
> > So what? In other words why should it be restricted solely based on the
> > map count. I can see a reason to restrict based on the access
> > permissions because we do not want to simplify all sorts of side channel
> > attacks but memory reclaim is capable of reclaiming shared pages and so
> > far I haven't heard any sound argument why madvise should skip those.
> > Again if there are any reasons, then document them in the changelog.
> 
> I think it makes sense. It could be explained, but it also follows
> established madvise semantics, and I'm not sure it's necessarily
> Minchan's job to re-iterate those.
> 
> Sharing isn't exactly transparent to userspace. The kernel does COW,
> ksm etc. When you madvise, you can really only speak for your own
> reference to that memory - "*I* am not using this."
> 
> This is in line with other madvise calls: MADV_DONTNEED clears the
> local page table entries and drops the corresponding references, so
> shared pages won't get freed. MADV_FREE clears the pte dirty bit and
> also has explicit mapcount checks before clearing PG_dirty, so again
> shared pages don't get freed.

Right, being consistent with other madvise syscalls is certainly a way
to go. And I am not pushing one way or another, I just want this to be
documented with a reasoning behind. Consistency is certainly an argument
to use.

On the other hand these non-destructive madvise operations are quite
different and the shared policy might differ as a result as well. We are
aging objects rather than destroying them after all. Being able to age
a pagecache with a sufficient privileges sounds like a useful usecase to
me. In other words you are able to cause the same effect indirectly
without the madvise operation so it kinda makes sense to allow it in a
more sophisticated way.

That being said, madvise is just a _hint_ and the kernel will be always
free to ignore it so the future implementation might change so we can
start simple and consistent with existing MADV_$FOO operations now and
extend later on. But let's document the intention in the changelog and
make the decision clear. I am sorry to be so anal about this but I have
seen so many ad-hoc policies that were undocumented and it was so hard
to guess when revisiting later on and make some sense of it.
-- 
Michal Hocko
SUSE Labs


Re: [RFCv2 1/6] mm: introduce MADV_COLD

2019-06-03 Thread Johannes Weiner
On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > When a process expects no accesses to a certain memory range, it 
> > > > > > could
> > > > > > give a hint to kernel that the pages can be reclaimed when memory 
> > > > > > pressure
> > > > > > happens but data should be preserved for future use.  This could 
> > > > > > reduce
> > > > > > workingset eviction so it ends up increasing performance.
> > > > > > 
> > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > > MADV_COLD can be used by a process to mark a memory range as not 
> > > > > > expected
> > > > > > to be used in the near future. The hint can help kernel in deciding 
> > > > > > which
> > > > > > pages to evict early during memory pressure.
> > > > > > 
> > > > > > Internally, it works via deactivating pages from active list to 
> > > > > > inactive's
> > > > > > head if the page is private because inactive list could be full of
> > > > > > used-once pages which are first candidate for the reclaiming and 
> > > > > > that's a
> > > > > > reason why MADV_FREE move pages to head of inactive LRU list. 
> > > > > > Therefore,
> > > > > > if the memory pressure happens, they will be reclaimed earlier than 
> > > > > > other
> > > > > > active pages unless there is no access until the time.
> > > > > 
> > > > > [I am intentionally not looking at the implementation because below
> > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > > 
> > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > Private/shared? If shared, are there any restrictions?
> > > > 
> > > > Both file and private pages could be deactived from each active LRU
> > > > to each inactive LRU if the page has one map_count. In other words,
> > > > 
> > > > if (page_mapcount(page) <= 1)
> > > > deactivate_page(page);
> > > 
> > > Why do we restrict to pages that are single mapped?
> > 
> > Because page table in one of process shared the page would have access bit
> > so finally we couldn't reclaim the page. The more process it is shared,
> > the more fail to reclaim.
> 
> So what? In other words why should it be restricted solely based on the
> map count. I can see a reason to restrict based on the access
> permissions because we do not want to simplify all sorts of side channel
> attacks but memory reclaim is capable of reclaiming shared pages and so
> far I haven't heard any sound argument why madvise should skip those.
> Again if there are any reasons, then document them in the changelog.

I think it makes sense. It could be explained, but it also follows
established madvise semantics, and I'm not sure it's necessarily
Minchan's job to re-iterate those.

Sharing isn't exactly transparent to userspace. The kernel does COW,
ksm etc. When you madvise, you can really only speak for your own
reference to that memory - "*I* am not using this."

This is in line with other madvise calls: MADV_DONTNEED clears the
local page table entries and drops the corresponding references, so
shared pages won't get freed. MADV_FREE clears the pte dirty bit and
also has explicit mapcount checks before clearing PG_dirty, so again
shared pages don't get freed.


Re: [RFCv2 1/6] mm: introduce MADV_COLD

2019-06-03 Thread Daniel Colascione
On Mon, Jun 3, 2019 at 12:16 AM Michal Hocko  wrote:
> On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > When a process expects no accesses to a certain memory range, it 
> > > > > > could
> > > > > > give a hint to kernel that the pages can be reclaimed when memory 
> > > > > > pressure
> > > > > > happens but data should be preserved for future use.  This could 
> > > > > > reduce
> > > > > > workingset eviction so it ends up increasing performance.
> > > > > >
> > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > > MADV_COLD can be used by a process to mark a memory range as not 
> > > > > > expected
> > > > > > to be used in the near future. The hint can help kernel in deciding 
> > > > > > which
> > > > > > pages to evict early during memory pressure.
> > > > > >
> > > > > > Internally, it works via deactivating pages from active list to 
> > > > > > inactive's
> > > > > > head if the page is private because inactive list could be full of
> > > > > > used-once pages which are first candidate for the reclaiming and 
> > > > > > that's a
> > > > > > reason why MADV_FREE move pages to head of inactive LRU list. 
> > > > > > Therefore,
> > > > > > if the memory pressure happens, they will be reclaimed earlier than 
> > > > > > other
> > > > > > active pages unless there is no access until the time.
> > > > >
> > > > > [I am intentionally not looking at the implementation because below
> > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > >
> > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > Private/shared? If shared, are there any restrictions?
> > > >
> > > > Both file and private pages could be deactived from each active LRU
> > > > to each inactive LRU if the page has one map_count. In other words,
> > > >
> > > > if (page_mapcount(page) <= 1)
> > > > deactivate_page(page);
> > >
> > > Why do we restrict to pages that are single mapped?
> >
> > Because page table in one of process shared the page would have access bit
> > so finally we couldn't reclaim the page. The more process it is shared,
> > the more fail to reclaim.
>
> So what? In other words why should it be restricted solely based on the
> map count. I can see a reason to restrict based on the access
> permissions because we do not want to simplify all sorts of side channel
> attacks but memory reclaim is capable of reclaiming shared pages and so
> far I haven't heard any sound argument why madvise should skip those.
> Again if there are any reasons, then document them in the changelog.

Whether to reclaim shared pages is a policy decision best left to
userland, IMHO.


Re: [RFCv2 1/6] mm: introduce MADV_COLD

2019-06-03 Thread Michal Hocko
On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > When a process expects no accesses to a certain memory range, it could
> > > > > give a hint to kernel that the pages can be reclaimed when memory 
> > > > > pressure
> > > > > happens but data should be preserved for future use.  This could 
> > > > > reduce
> > > > > workingset eviction so it ends up increasing performance.
> > > > > 
> > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > MADV_COLD can be used by a process to mark a memory range as not 
> > > > > expected
> > > > > to be used in the near future. The hint can help kernel in deciding 
> > > > > which
> > > > > pages to evict early during memory pressure.
> > > > > 
> > > > > Internally, it works via deactivating pages from active list to 
> > > > > inactive's
> > > > > head if the page is private because inactive list could be full of
> > > > > used-once pages which are first candidate for the reclaiming and 
> > > > > that's a
> > > > > reason why MADV_FREE move pages to head of inactive LRU list. 
> > > > > Therefore,
> > > > > if the memory pressure happens, they will be reclaimed earlier than 
> > > > > other
> > > > > active pages unless there is no access until the time.
> > > > 
> > > > [I am intentionally not looking at the implementation because below
> > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > 
> > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > Private/shared? If shared, are there any restrictions?
> > > 
> > > Both file and private pages could be deactived from each active LRU
> > > to each inactive LRU if the page has one map_count. In other words,
> > > 
> > > if (page_mapcount(page) <= 1)
> > > deactivate_page(page);
> > 
> > Why do we restrict to pages that are single mapped?
> 
> Because page table in one of process shared the page would have access bit
> so finally we couldn't reclaim the page. The more process it is shared,
> the more fail to reclaim.

So what? In other words why should it be restricted solely based on the
map count. I can see a reason to restrict based on the access
permissions because we do not want to simplify all sorts of side channel
attacks but memory reclaim is capable of reclaiming shared pages and so
far I haven't heard any sound argument why madvise should skip those.
Again if there are any reasons, then document them in the changelog.
 
[...]

> > Please document this, if this is really a desirable semantic because
> > then you have the same set of problems as we've had with the early
> > MADV_FREE implementation mentioned above.
> 
> IIRC, the problem of MADV_FREE was that we couldn't discard freeable
> pages because VM never scan anonymous LRU with swapless system.
> However, it's not the our case because we should reclaim them, not
> discarding.

Right. But there is still the page cache reclaim. Is it expected that
an explicitly cold memory doesn't get reclaimed because we have a
sufficient amount of page cache (a very common case) and we never age
anonymous memory because of that?

-- 
Michal Hocko
SUSE Labs


Re: [RFCv2 1/6] mm: introduce MADV_COLD

2019-05-31 Thread Minchan Kim
On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > When a process expects no accesses to a certain memory range, it could
> > > > give a hint to kernel that the pages can be reclaimed when memory 
> > > > pressure
> > > > happens but data should be preserved for future use.  This could reduce
> > > > workingset eviction so it ends up increasing performance.
> > > > 
> > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > MADV_COLD can be used by a process to mark a memory range as not 
> > > > expected
> > > > to be used in the near future. The hint can help kernel in deciding 
> > > > which
> > > > pages to evict early during memory pressure.
> > > > 
> > > > Internally, it works via deactivating pages from active list to 
> > > > inactive's
> > > > head if the page is private because inactive list could be full of
> > > > used-once pages which are first candidate for the reclaiming and that's 
> > > > a
> > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > if the memory pressure happens, they will be reclaimed earlier than 
> > > > other
> > > > active pages unless there is no access until the time.
> > > 
> > > [I am intentionally not looking at the implementation because below
> > > points should be clear from the changelog - sorry about nagging ;)]
> > > 
> > > What kind of pages can be deactivated? Anonymous/File backed.
> > > Private/shared? If shared, are there any restrictions?
> > 
> > Both file and private pages could be deactived from each active LRU
> > to each inactive LRU if the page has one map_count. In other words,
> > 
> > if (page_mapcount(page) <= 1)
> > deactivate_page(page);
> 
> Why do we restrict to pages that are single mapped?

Because page table in one of process shared the page would have access bit
so finally we couldn't reclaim the page. The more process it is shared,
the more fail to reclaim.

> 
> > > Are there any restrictions on mappings? E.g. what would be an effect of
> > > this operation on hugetlbfs mapping?
> > 
> > VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma will be skipped like MADV_FREE|DONTNEED
> 
> OK documenting that this is restricted to the same vmas as MADV_FREE|DONTNEED
> is really useful to mention.

Sure.

> 
> > 
> > > 
> > > Also you are talking about inactive LRU but what kind of LRU is that? Is
> > > it the anonymous LRU? If yes, don't we have the same problem as with the
> > 
> > active file page -> inactive file LRU
> > active anon page -> inacdtive anon LRU
> > 
> > > early MADV_FREE implementation when enough page cache causes that
> > > deactivated anonymous memory doesn't get reclaimed anytime soon. Or
> > > worse never when there is no swap available?
> > 
> > I think MADV_COLD is a little bit different symantic with MAVD_FREE.
> > MADV_FREE means it's okay to discard when the memory pressure because
> > the content of the page is *garbage*. Furthemore, freeing such pages is
> > almost zero overhead since we don't need to swap out and access
> > afterward causes minor fault. Thus, it would make sense to put those
> > freeable pages in inactive file LRU to compete other used-once pages.
> > 
> > However, MADV_COLD doesn't means it's a garbage and freeing requires
> > swap out/swap in afterward. So, it would be better to move inactive
> > anon's LRU list, not file LRU. Furthermore, it would avoid unnecessary
> > scanning of those cold anonymous if system doesn't have a swap device.
> 
> Please document this, if this is really a desirable semantic because
> then you have the same set of problems as we've had with the early
> MADV_FREE implementation mentioned above.

IIRC, the problem of MADV_FREE was that we couldn't discard freeable
pages because VM never scan anonymous LRU with swapless system.
However, it's not the our case because we should reclaim them, not
discarding.

I will include it in the description.

Thanks.


Re: [RFCv2 1/6] mm: introduce MADV_COLD

2019-05-31 Thread Michal Hocko
On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > When a process expects no accesses to a certain memory range, it could
> > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > happens but data should be preserved for future use.  This could reduce
> > > workingset eviction so it ends up increasing performance.
> > > 
> > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > to be used in the near future. The hint can help kernel in deciding which
> > > pages to evict early during memory pressure.
> > > 
> > > Internally, it works via deactivating pages from active list to inactive's
> > > head if the page is private because inactive list could be full of
> > > used-once pages which are first candidate for the reclaiming and that's a
> > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > if the memory pressure happens, they will be reclaimed earlier than other
> > > active pages unless there is no access until the time.
> > 
> > [I am intentionally not looking at the implementation because below
> > points should be clear from the changelog - sorry about nagging ;)]
> > 
> > What kind of pages can be deactivated? Anonymous/File backed.
> > Private/shared? If shared, are there any restrictions?
> 
> Both file and private pages could be deactived from each active LRU
> to each inactive LRU if the page has one map_count. In other words,
> 
> if (page_mapcount(page) <= 1)
> deactivate_page(page);

Why do we restrict to pages that are single mapped?

> > Are there any restrictions on mappings? E.g. what would be an effect of
> > this operation on hugetlbfs mapping?
> 
> VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma will be skipped like MADV_FREE|DONTNEED

OK documenting that this is restricted to the same vmas as MADV_FREE|DONTNEED
is really useful to mention.

> 
> > 
> > Also you are talking about inactive LRU but what kind of LRU is that? Is
> > it the anonymous LRU? If yes, don't we have the same problem as with the
> 
> active file page -> inactive file LRU
> active anon page -> inacdtive anon LRU
> 
> > early MADV_FREE implementation when enough page cache causes that
> > deactivated anonymous memory doesn't get reclaimed anytime soon. Or
> > worse never when there is no swap available?
> 
> I think MADV_COLD is a little bit different symantic with MAVD_FREE.
> MADV_FREE means it's okay to discard when the memory pressure because
> the content of the page is *garbage*. Furthemore, freeing such pages is
> almost zero overhead since we don't need to swap out and access
> afterward causes minor fault. Thus, it would make sense to put those
> freeable pages in inactive file LRU to compete other used-once pages.
> 
> However, MADV_COLD doesn't means it's a garbage and freeing requires
> swap out/swap in afterward. So, it would be better to move inactive
> anon's LRU list, not file LRU. Furthermore, it would avoid unnecessary
> scanning of those cold anonymous if system doesn't have a swap device.

Please document this, if this is really a desirable semantic because
then you have the same set of problems as we've had with the early
MADV_FREE implementation mentioned above.

-- 
Michal Hocko
SUSE Labs


Re: [RFCv2 1/6] mm: introduce MADV_COLD

2019-05-31 Thread Minchan Kim
On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > When a process expects no accesses to a certain memory range, it could
> > give a hint to kernel that the pages can be reclaimed when memory pressure
> > happens but data should be preserved for future use.  This could reduce
> > workingset eviction so it ends up increasing performance.
> > 
> > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > MADV_COLD can be used by a process to mark a memory range as not expected
> > to be used in the near future. The hint can help kernel in deciding which
> > pages to evict early during memory pressure.
> > 
> > Internally, it works via deactivating pages from active list to inactive's
> > head if the page is private because inactive list could be full of
> > used-once pages which are first candidate for the reclaiming and that's a
> > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > if the memory pressure happens, they will be reclaimed earlier than other
> > active pages unless there is no access until the time.
> 
> [I am intentionally not looking at the implementation because below
> points should be clear from the changelog - sorry about nagging ;)]
> 
> What kind of pages can be deactivated? Anonymous/File backed.
> Private/shared? If shared, are there any restrictions?

Both file and private pages could be deactived from each active LRU
to each inactive LRU if the page has one map_count. In other words,

if (page_mapcount(page) <= 1)
deactivate_page(page);

> 
> Are there any restrictions on mappings? E.g. what would be an effect of
> this operation on hugetlbfs mapping?

VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma will be skipped like MADV_FREE|DONTNEED

> 
> Also you are talking about inactive LRU but what kind of LRU is that? Is
> it the anonymous LRU? If yes, don't we have the same problem as with the

active file page -> inactive file LRU
active anon page -> inacdtive anon LRU

> early MADV_FREE implementation when enough page cache causes that
> deactivated anonymous memory doesn't get reclaimed anytime soon. Or
> worse never when there is no swap available?

I think MADV_COLD is a little bit different symantic with MAVD_FREE.
MADV_FREE means it's okay to discard when the memory pressure because
the content of the page is *garbage*. Furthemore, freeing such pages is
almost zero overhead since we don't need to swap out and access
afterward causes minor fault. Thus, it would make sense to put those
freeable pages in inactive file LRU to compete other used-once pages.

However, MADV_COLD doesn't means it's a garbage and freeing requires
swap out/swap in afterward. So, it would be better to move inactive
anon's LRU list, not file LRU. Furthermore, it would avoid unnecessary
scanning of those cold anonymous if system doesn't have a swap device.



Re: [RFCv2 1/6] mm: introduce MADV_COLD

2019-05-31 Thread Michal Hocko
On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> When a process expects no accesses to a certain memory range, it could
> give a hint to kernel that the pages can be reclaimed when memory pressure
> happens but data should be preserved for future use.  This could reduce
> workingset eviction so it ends up increasing performance.
> 
> This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> MADV_COLD can be used by a process to mark a memory range as not expected
> to be used in the near future. The hint can help kernel in deciding which
> pages to evict early during memory pressure.
> 
> Internally, it works via deactivating pages from active list to inactive's
> head if the page is private because inactive list could be full of
> used-once pages which are first candidate for the reclaiming and that's a
> reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> if the memory pressure happens, they will be reclaimed earlier than other
> active pages unless there is no access until the time.

[I am intentionally not looking at the implementation because below
points should be clear from the changelog - sorry about nagging ;)]

What kind of pages can be deactivated? Anonymous/File backed.
Private/shared? If shared, are there any restrictions?

Are there any restrictions on mappings? E.g. what would be an effect of
this operation on hugetlbfs mapping?

Also you are talking about inactive LRU but what kind of LRU is that? Is
it the anonymous LRU? If yes, don't we have the same problem as with the
early MADV_FREE implementation when enough page cache causes that
deactivated anonymous memory doesn't get reclaimed anytime soon. Or
worse never when there is no swap available?
-- 
Michal Hocko
SUSE Labs