Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable

2019-08-30 Thread Kirill A. Shutemov
On Wed, Aug 28, 2019 at 06:02:24PM +0200, Michal Hocko wrote:
> > > 
> > > Any idea about a bad case?
> > 
> > Not really.
> > 
> > How bad you want it to get? How many processes share the page? Access
> > pattern? Locking situation?
> 
> Let's say how hard a regular user can make this?

It bumped to ~170 ms if each THP was mapped 5 times.

Adding ptl contention (tight loop of MADV_DONTNEED) in 3 processes that
maps the page, the time to split bumped to ~740ms.

Overally, it's reasonable to take ~100ms per GB of huge pages as rule of
thumb.

-- 
 Kirill A. Shutemov


Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable

2019-08-30 Thread Michal Hocko
On Thu 29-08-19 10:03:21, Yang Shi wrote:
> On Wed, Aug 28, 2019 at 9:02 AM Michal Hocko  wrote:
> >
> > On Wed 28-08-19 17:46:59, Kirill A. Shutemov wrote:
> > > On Wed, Aug 28, 2019 at 02:12:53PM +, Michal Hocko wrote:
> > > > On Wed 28-08-19 17:03:29, Kirill A. Shutemov wrote:
> > > > > On Wed, Aug 28, 2019 at 09:57:08AM +0200, Michal Hocko wrote:
> > > > > > On Tue 27-08-19 10:06:20, Yang Shi wrote:
> > > > > > >
> > > > > > >
> > > > > > > On 8/27/19 5:59 AM, Kirill A. Shutemov wrote:
> > > > > > > > On Tue, Aug 27, 2019 at 03:17:39PM +0300, Kirill A. Shutemov 
> > > > > > > > wrote:
> > > > > > > > > On Tue, Aug 27, 2019 at 02:09:23PM +0200, Michal Hocko wrote:
> > > > > > > > > > On Tue 27-08-19 14:01:56, Vlastimil Babka wrote:
> > > > > > > > > > > On 8/27/19 1:02 PM, Kirill A. Shutemov wrote:
> > > > > > > > > > > > On Tue, Aug 27, 2019 at 08:01:39AM +0200, Michal Hocko 
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > On Mon 26-08-19 16:15:38, Kirill A. Shutemov wrote:
> > > > > > > > > > > > > > Unmapped completely pages will be freed with 
> > > > > > > > > > > > > > current code. Deferred split
> > > > > > > > > > > > > > only applies to partly mapped THPs: at least on 4k 
> > > > > > > > > > > > > > of the THP is still
> > > > > > > > > > > > > > mapped somewhere.
> > > > > > > > > > > > > Hmm, I am probably misreading the code but at least 
> > > > > > > > > > > > > current Linus' tree
> > > > > > > > > > > > > reads page_remove_rmap -> 
> > > > > > > > > > > > > [page_remove_anon_compound_rmap ->\ 
> > > > > > > > > > > > > deferred_split_huge_page even
> > > > > > > > > > > > > for fully mapped THP.
> > > > > > > > > > > > Well, you read correctly, but it was not intended. I 
> > > > > > > > > > > > screwed it up at some
> > > > > > > > > > > > point.
> > > > > > > > > > > >
> > > > > > > > > > > > See the patch below. It should make it work as intened.
> > > > > > > > > > > >
> > > > > > > > > > > > It's not bug as such, but inefficientcy. We add page to 
> > > > > > > > > > > > the queue where
> > > > > > > > > > > > it's not needed.
> > > > > > > > > > > But that adding to queue doesn't affect whether the page 
> > > > > > > > > > > will be freed
> > > > > > > > > > > immediately if there are no more partial mappings, right? 
> > > > > > > > > > > I don't see
> > > > > > > > > > > deferred_split_huge_page() pinning the page.
> > > > > > > > > > > So your patch wouldn't make THPs freed immediately in 
> > > > > > > > > > > cases where they
> > > > > > > > > > > haven't been freed before immediately, it just fixes a 
> > > > > > > > > > > minor
> > > > > > > > > > > inefficiency with queue manipulation?
> > > > > > > > > > Ohh, right. I can see that in free_transhuge_page now. So 
> > > > > > > > > > fully mapped
> > > > > > > > > > THPs really do not matter and what I have considered an odd 
> > > > > > > > > > case is
> > > > > > > > > > really happening more often.
> > > > > > > > > >
> > > > > > > > > > That being said this will not help at all for what Yang Shi 
> > > > > > > > > > is seeing
> > > > > > > > > > and we need a more proactive deferred splitting as I've 
> > > > > > > > > > mentioned
> > > > > > > > > > earlier.
> > > > > > > > > It was not intended to fix the issue. It's fix for current 
> > > > > > > > > logic. I'm
> > > > > > > > > playing with the work approach now.
> > > > > > > > Below is what I've come up with. It appears to be functional.
> > > > > > > >
> > > > > > > > Any comments?
> > > > > > >
> > > > > > > Thanks, Kirill and Michal. Doing split more proactive is 
> > > > > > > definitely a choice
> > > > > > > to eliminate huge accumulated deferred split THPs, I did think 
> > > > > > > about this
> > > > > > > approach before I came up with memcg aware approach. But, I 
> > > > > > > thought this
> > > > > > > approach has some problems:
> > > > > > >
> > > > > > > First of all, we can't prove if this is a universal win for the 
> > > > > > > most
> > > > > > > workloads or not. For some workloads (as I mentioned about our 
> > > > > > > usecase), we
> > > > > > > do see a lot THPs accumulated for a while, but they are very 
> > > > > > > short-lived for
> > > > > > > other workloads, i.e. kernel build.
> > > > > > >
> > > > > > > Secondly, it may be not fair for some workloads which don't 
> > > > > > > generate too
> > > > > > > many deferred split THPs or those THPs are short-lived. Actually, 
> > > > > > > the cpu
> > > > > > > time is abused by the excessive deferred split THPs generators, 
> > > > > > > isn't it?
> > > > > >
> > > > > > Yes this is indeed true. Do we have any idea on how much time that
> > > > > > actually is?
> > > > >
> > > > > For uncontented case, splitting 1G worth of pages (2MiB x 512) takes 
> > > > > a bit
> > > > > more than 50 ms in my setup. But it's best-case scenario: pages not 
> > > > > shared
> > > > > across multiple processes, no contention on ptl, page lock, etc.
> > > >
> > > > Any idea about a bad 

Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable

2019-08-29 Thread Yang Shi
On Wed, Aug 28, 2019 at 9:02 AM Michal Hocko  wrote:
>
> On Wed 28-08-19 17:46:59, Kirill A. Shutemov wrote:
> > On Wed, Aug 28, 2019 at 02:12:53PM +, Michal Hocko wrote:
> > > On Wed 28-08-19 17:03:29, Kirill A. Shutemov wrote:
> > > > On Wed, Aug 28, 2019 at 09:57:08AM +0200, Michal Hocko wrote:
> > > > > On Tue 27-08-19 10:06:20, Yang Shi wrote:
> > > > > >
> > > > > >
> > > > > > On 8/27/19 5:59 AM, Kirill A. Shutemov wrote:
> > > > > > > On Tue, Aug 27, 2019 at 03:17:39PM +0300, Kirill A. Shutemov 
> > > > > > > wrote:
> > > > > > > > On Tue, Aug 27, 2019 at 02:09:23PM +0200, Michal Hocko wrote:
> > > > > > > > > On Tue 27-08-19 14:01:56, Vlastimil Babka wrote:
> > > > > > > > > > On 8/27/19 1:02 PM, Kirill A. Shutemov wrote:
> > > > > > > > > > > On Tue, Aug 27, 2019 at 08:01:39AM +0200, Michal Hocko 
> > > > > > > > > > > wrote:
> > > > > > > > > > > > On Mon 26-08-19 16:15:38, Kirill A. Shutemov wrote:
> > > > > > > > > > > > > Unmapped completely pages will be freed with current 
> > > > > > > > > > > > > code. Deferred split
> > > > > > > > > > > > > only applies to partly mapped THPs: at least on 4k of 
> > > > > > > > > > > > > the THP is still
> > > > > > > > > > > > > mapped somewhere.
> > > > > > > > > > > > Hmm, I am probably misreading the code but at least 
> > > > > > > > > > > > current Linus' tree
> > > > > > > > > > > > reads page_remove_rmap -> 
> > > > > > > > > > > > [page_remove_anon_compound_rmap ->\ 
> > > > > > > > > > > > deferred_split_huge_page even
> > > > > > > > > > > > for fully mapped THP.
> > > > > > > > > > > Well, you read correctly, but it was not intended. I 
> > > > > > > > > > > screwed it up at some
> > > > > > > > > > > point.
> > > > > > > > > > >
> > > > > > > > > > > See the patch below. It should make it work as intened.
> > > > > > > > > > >
> > > > > > > > > > > It's not bug as such, but inefficientcy. We add page to 
> > > > > > > > > > > the queue where
> > > > > > > > > > > it's not needed.
> > > > > > > > > > But that adding to queue doesn't affect whether the page 
> > > > > > > > > > will be freed
> > > > > > > > > > immediately if there are no more partial mappings, right? I 
> > > > > > > > > > don't see
> > > > > > > > > > deferred_split_huge_page() pinning the page.
> > > > > > > > > > So your patch wouldn't make THPs freed immediately in cases 
> > > > > > > > > > where they
> > > > > > > > > > haven't been freed before immediately, it just fixes a minor
> > > > > > > > > > inefficiency with queue manipulation?
> > > > > > > > > Ohh, right. I can see that in free_transhuge_page now. So 
> > > > > > > > > fully mapped
> > > > > > > > > THPs really do not matter and what I have considered an odd 
> > > > > > > > > case is
> > > > > > > > > really happening more often.
> > > > > > > > >
> > > > > > > > > That being said this will not help at all for what Yang Shi 
> > > > > > > > > is seeing
> > > > > > > > > and we need a more proactive deferred splitting as I've 
> > > > > > > > > mentioned
> > > > > > > > > earlier.
> > > > > > > > It was not intended to fix the issue. It's fix for current 
> > > > > > > > logic. I'm
> > > > > > > > playing with the work approach now.
> > > > > > > Below is what I've come up with. It appears to be functional.
> > > > > > >
> > > > > > > Any comments?
> > > > > >
> > > > > > Thanks, Kirill and Michal. Doing split more proactive is definitely 
> > > > > > a choice
> > > > > > to eliminate huge accumulated deferred split THPs, I did think 
> > > > > > about this
> > > > > > approach before I came up with memcg aware approach. But, I thought 
> > > > > > this
> > > > > > approach has some problems:
> > > > > >
> > > > > > First of all, we can't prove if this is a universal win for the most
> > > > > > workloads or not. For some workloads (as I mentioned about our 
> > > > > > usecase), we
> > > > > > do see a lot THPs accumulated for a while, but they are very 
> > > > > > short-lived for
> > > > > > other workloads, i.e. kernel build.
> > > > > >
> > > > > > Secondly, it may be not fair for some workloads which don't 
> > > > > > generate too
> > > > > > many deferred split THPs or those THPs are short-lived. Actually, 
> > > > > > the cpu
> > > > > > time is abused by the excessive deferred split THPs generators, 
> > > > > > isn't it?
> > > > >
> > > > > Yes this is indeed true. Do we have any idea on how much time that
> > > > > actually is?
> > > >
> > > > For uncontented case, splitting 1G worth of pages (2MiB x 512) takes a 
> > > > bit
> > > > more than 50 ms in my setup. But it's best-case scenario: pages not 
> > > > shared
> > > > across multiple processes, no contention on ptl, page lock, etc.
> > >
> > > Any idea about a bad case?
> >
> > Not really.
> >
> > How bad you want it to get? How many processes share the page? Access
> > pattern? Locking situation?
>
> Let's say how hard a regular user can make this?
>
> > Worst case scenarion: no progress on splitting due to pins or locking
> > 

Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable

2019-08-28 Thread Michal Hocko
On Wed 28-08-19 17:46:59, Kirill A. Shutemov wrote:
> On Wed, Aug 28, 2019 at 02:12:53PM +, Michal Hocko wrote:
> > On Wed 28-08-19 17:03:29, Kirill A. Shutemov wrote:
> > > On Wed, Aug 28, 2019 at 09:57:08AM +0200, Michal Hocko wrote:
> > > > On Tue 27-08-19 10:06:20, Yang Shi wrote:
> > > > > 
> > > > > 
> > > > > On 8/27/19 5:59 AM, Kirill A. Shutemov wrote:
> > > > > > On Tue, Aug 27, 2019 at 03:17:39PM +0300, Kirill A. Shutemov wrote:
> > > > > > > On Tue, Aug 27, 2019 at 02:09:23PM +0200, Michal Hocko wrote:
> > > > > > > > On Tue 27-08-19 14:01:56, Vlastimil Babka wrote:
> > > > > > > > > On 8/27/19 1:02 PM, Kirill A. Shutemov wrote:
> > > > > > > > > > On Tue, Aug 27, 2019 at 08:01:39AM +0200, Michal Hocko 
> > > > > > > > > > wrote:
> > > > > > > > > > > On Mon 26-08-19 16:15:38, Kirill A. Shutemov wrote:
> > > > > > > > > > > > Unmapped completely pages will be freed with current 
> > > > > > > > > > > > code. Deferred split
> > > > > > > > > > > > only applies to partly mapped THPs: at least on 4k of 
> > > > > > > > > > > > the THP is still
> > > > > > > > > > > > mapped somewhere.
> > > > > > > > > > > Hmm, I am probably misreading the code but at least 
> > > > > > > > > > > current Linus' tree
> > > > > > > > > > > reads page_remove_rmap -> [page_remove_anon_compound_rmap 
> > > > > > > > > > > ->\ deferred_split_huge_page even
> > > > > > > > > > > for fully mapped THP.
> > > > > > > > > > Well, you read correctly, but it was not intended. I 
> > > > > > > > > > screwed it up at some
> > > > > > > > > > point.
> > > > > > > > > > 
> > > > > > > > > > See the patch below. It should make it work as intened.
> > > > > > > > > > 
> > > > > > > > > > It's not bug as such, but inefficientcy. We add page to the 
> > > > > > > > > > queue where
> > > > > > > > > > it's not needed.
> > > > > > > > > But that adding to queue doesn't affect whether the page will 
> > > > > > > > > be freed
> > > > > > > > > immediately if there are no more partial mappings, right? I 
> > > > > > > > > don't see
> > > > > > > > > deferred_split_huge_page() pinning the page.
> > > > > > > > > So your patch wouldn't make THPs freed immediately in cases 
> > > > > > > > > where they
> > > > > > > > > haven't been freed before immediately, it just fixes a minor
> > > > > > > > > inefficiency with queue manipulation?
> > > > > > > > Ohh, right. I can see that in free_transhuge_page now. So fully 
> > > > > > > > mapped
> > > > > > > > THPs really do not matter and what I have considered an odd 
> > > > > > > > case is
> > > > > > > > really happening more often.
> > > > > > > > 
> > > > > > > > That being said this will not help at all for what Yang Shi is 
> > > > > > > > seeing
> > > > > > > > and we need a more proactive deferred splitting as I've 
> > > > > > > > mentioned
> > > > > > > > earlier.
> > > > > > > It was not intended to fix the issue. It's fix for current logic. 
> > > > > > > I'm
> > > > > > > playing with the work approach now.
> > > > > > Below is what I've come up with. It appears to be functional.
> > > > > > 
> > > > > > Any comments?
> > > > > 
> > > > > Thanks, Kirill and Michal. Doing split more proactive is definitely a 
> > > > > choice
> > > > > to eliminate huge accumulated deferred split THPs, I did think about 
> > > > > this
> > > > > approach before I came up with memcg aware approach. But, I thought 
> > > > > this
> > > > > approach has some problems:
> > > > > 
> > > > > First of all, we can't prove if this is a universal win for the most
> > > > > workloads or not. For some workloads (as I mentioned about our 
> > > > > usecase), we
> > > > > do see a lot THPs accumulated for a while, but they are very 
> > > > > short-lived for
> > > > > other workloads, i.e. kernel build.
> > > > > 
> > > > > Secondly, it may be not fair for some workloads which don't generate 
> > > > > too
> > > > > many deferred split THPs or those THPs are short-lived. Actually, the 
> > > > > cpu
> > > > > time is abused by the excessive deferred split THPs generators, isn't 
> > > > > it?
> > > > 
> > > > Yes this is indeed true. Do we have any idea on how much time that
> > > > actually is?
> > > 
> > > For uncontented case, splitting 1G worth of pages (2MiB x 512) takes a bit
> > > more than 50 ms in my setup. But it's best-case scenario: pages not shared
> > > across multiple processes, no contention on ptl, page lock, etc.
> > 
> > Any idea about a bad case?
> 
> Not really.
> 
> How bad you want it to get? How many processes share the page? Access
> pattern? Locking situation?

Let's say how hard a regular user can make this?
 
> Worst case scenarion: no progress on splitting due to pins or locking
> conflicts (trylock failure).
> 
> > > > > With memcg awareness, the deferred split THPs actually are isolated 
> > > > > and
> > > > > capped by memcg. The long-lived deferred split THPs can't be 
> > > > > accumulated too
> > > > > many due to the limit of memcg. And, cpu time spent in splitting them 
> 

Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable

2019-08-28 Thread Kirill A. Shutemov
On Wed, Aug 28, 2019 at 02:12:53PM +, Michal Hocko wrote:
> On Wed 28-08-19 17:03:29, Kirill A. Shutemov wrote:
> > On Wed, Aug 28, 2019 at 09:57:08AM +0200, Michal Hocko wrote:
> > > On Tue 27-08-19 10:06:20, Yang Shi wrote:
> > > > 
> > > > 
> > > > On 8/27/19 5:59 AM, Kirill A. Shutemov wrote:
> > > > > On Tue, Aug 27, 2019 at 03:17:39PM +0300, Kirill A. Shutemov wrote:
> > > > > > On Tue, Aug 27, 2019 at 02:09:23PM +0200, Michal Hocko wrote:
> > > > > > > On Tue 27-08-19 14:01:56, Vlastimil Babka wrote:
> > > > > > > > On 8/27/19 1:02 PM, Kirill A. Shutemov wrote:
> > > > > > > > > On Tue, Aug 27, 2019 at 08:01:39AM +0200, Michal Hocko wrote:
> > > > > > > > > > On Mon 26-08-19 16:15:38, Kirill A. Shutemov wrote:
> > > > > > > > > > > Unmapped completely pages will be freed with current 
> > > > > > > > > > > code. Deferred split
> > > > > > > > > > > only applies to partly mapped THPs: at least on 4k of the 
> > > > > > > > > > > THP is still
> > > > > > > > > > > mapped somewhere.
> > > > > > > > > > Hmm, I am probably misreading the code but at least current 
> > > > > > > > > > Linus' tree
> > > > > > > > > > reads page_remove_rmap -> [page_remove_anon_compound_rmap 
> > > > > > > > > > ->\ deferred_split_huge_page even
> > > > > > > > > > for fully mapped THP.
> > > > > > > > > Well, you read correctly, but it was not intended. I screwed 
> > > > > > > > > it up at some
> > > > > > > > > point.
> > > > > > > > > 
> > > > > > > > > See the patch below. It should make it work as intened.
> > > > > > > > > 
> > > > > > > > > It's not bug as such, but inefficientcy. We add page to the 
> > > > > > > > > queue where
> > > > > > > > > it's not needed.
> > > > > > > > But that adding to queue doesn't affect whether the page will 
> > > > > > > > be freed
> > > > > > > > immediately if there are no more partial mappings, right? I 
> > > > > > > > don't see
> > > > > > > > deferred_split_huge_page() pinning the page.
> > > > > > > > So your patch wouldn't make THPs freed immediately in cases 
> > > > > > > > where they
> > > > > > > > haven't been freed before immediately, it just fixes a minor
> > > > > > > > inefficiency with queue manipulation?
> > > > > > > Ohh, right. I can see that in free_transhuge_page now. So fully 
> > > > > > > mapped
> > > > > > > THPs really do not matter and what I have considered an odd case 
> > > > > > > is
> > > > > > > really happening more often.
> > > > > > > 
> > > > > > > That being said this will not help at all for what Yang Shi is 
> > > > > > > seeing
> > > > > > > and we need a more proactive deferred splitting as I've mentioned
> > > > > > > earlier.
> > > > > > It was not intended to fix the issue. It's fix for current logic. 
> > > > > > I'm
> > > > > > playing with the work approach now.
> > > > > Below is what I've come up with. It appears to be functional.
> > > > > 
> > > > > Any comments?
> > > > 
> > > > Thanks, Kirill and Michal. Doing split more proactive is definitely a 
> > > > choice
> > > > to eliminate huge accumulated deferred split THPs, I did think about 
> > > > this
> > > > approach before I came up with memcg aware approach. But, I thought this
> > > > approach has some problems:
> > > > 
> > > > First of all, we can't prove if this is a universal win for the most
> > > > workloads or not. For some workloads (as I mentioned about our 
> > > > usecase), we
> > > > do see a lot THPs accumulated for a while, but they are very 
> > > > short-lived for
> > > > other workloads, i.e. kernel build.
> > > > 
> > > > Secondly, it may be not fair for some workloads which don't generate too
> > > > many deferred split THPs or those THPs are short-lived. Actually, the 
> > > > cpu
> > > > time is abused by the excessive deferred split THPs generators, isn't 
> > > > it?
> > > 
> > > Yes this is indeed true. Do we have any idea on how much time that
> > > actually is?
> > 
> > For uncontented case, splitting 1G worth of pages (2MiB x 512) takes a bit
> > more than 50 ms in my setup. But it's best-case scenario: pages not shared
> > across multiple processes, no contention on ptl, page lock, etc.
> 
> Any idea about a bad case?

Not really.

How bad you want it to get? How many processes share the page? Access
pattern? Locking situation?

Worst case scenarion: no progress on splitting due to pins or locking
conflicts (trylock failure).

> > > > With memcg awareness, the deferred split THPs actually are isolated and
> > > > capped by memcg. The long-lived deferred split THPs can't be 
> > > > accumulated too
> > > > many due to the limit of memcg. And, cpu time spent in splitting them 
> > > > would
> > > > just account to the memcgs who generate that many deferred split THPs, 
> > > > who
> > > > generate them who pay for it. This sounds more fair and we could achieve
> > > > much better isolation.
> > > 
> > > On the other hand, deferring the split and free up a non trivial amount
> > > of memory is a problem I consider quite serious because it 

Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable

2019-08-28 Thread Michal Hocko
On Wed 28-08-19 17:03:29, Kirill A. Shutemov wrote:
> On Wed, Aug 28, 2019 at 09:57:08AM +0200, Michal Hocko wrote:
> > On Tue 27-08-19 10:06:20, Yang Shi wrote:
> > > 
> > > 
> > > On 8/27/19 5:59 AM, Kirill A. Shutemov wrote:
> > > > On Tue, Aug 27, 2019 at 03:17:39PM +0300, Kirill A. Shutemov wrote:
> > > > > On Tue, Aug 27, 2019 at 02:09:23PM +0200, Michal Hocko wrote:
> > > > > > On Tue 27-08-19 14:01:56, Vlastimil Babka wrote:
> > > > > > > On 8/27/19 1:02 PM, Kirill A. Shutemov wrote:
> > > > > > > > On Tue, Aug 27, 2019 at 08:01:39AM +0200, Michal Hocko wrote:
> > > > > > > > > On Mon 26-08-19 16:15:38, Kirill A. Shutemov wrote:
> > > > > > > > > > Unmapped completely pages will be freed with current code. 
> > > > > > > > > > Deferred split
> > > > > > > > > > only applies to partly mapped THPs: at least on 4k of the 
> > > > > > > > > > THP is still
> > > > > > > > > > mapped somewhere.
> > > > > > > > > Hmm, I am probably misreading the code but at least current 
> > > > > > > > > Linus' tree
> > > > > > > > > reads page_remove_rmap -> [page_remove_anon_compound_rmap ->\ 
> > > > > > > > > deferred_split_huge_page even
> > > > > > > > > for fully mapped THP.
> > > > > > > > Well, you read correctly, but it was not intended. I screwed it 
> > > > > > > > up at some
> > > > > > > > point.
> > > > > > > > 
> > > > > > > > See the patch below. It should make it work as intened.
> > > > > > > > 
> > > > > > > > It's not bug as such, but inefficientcy. We add page to the 
> > > > > > > > queue where
> > > > > > > > it's not needed.
> > > > > > > But that adding to queue doesn't affect whether the page will be 
> > > > > > > freed
> > > > > > > immediately if there are no more partial mappings, right? I don't 
> > > > > > > see
> > > > > > > deferred_split_huge_page() pinning the page.
> > > > > > > So your patch wouldn't make THPs freed immediately in cases where 
> > > > > > > they
> > > > > > > haven't been freed before immediately, it just fixes a minor
> > > > > > > inefficiency with queue manipulation?
> > > > > > Ohh, right. I can see that in free_transhuge_page now. So fully 
> > > > > > mapped
> > > > > > THPs really do not matter and what I have considered an odd case is
> > > > > > really happening more often.
> > > > > > 
> > > > > > That being said this will not help at all for what Yang Shi is 
> > > > > > seeing
> > > > > > and we need a more proactive deferred splitting as I've mentioned
> > > > > > earlier.
> > > > > It was not intended to fix the issue. It's fix for current logic. I'm
> > > > > playing with the work approach now.
> > > > Below is what I've come up with. It appears to be functional.
> > > > 
> > > > Any comments?
> > > 
> > > Thanks, Kirill and Michal. Doing split more proactive is definitely a 
> > > choice
> > > to eliminate huge accumulated deferred split THPs, I did think about this
> > > approach before I came up with memcg aware approach. But, I thought this
> > > approach has some problems:
> > > 
> > > First of all, we can't prove if this is a universal win for the most
> > > workloads or not. For some workloads (as I mentioned about our usecase), 
> > > we
> > > do see a lot THPs accumulated for a while, but they are very short-lived 
> > > for
> > > other workloads, i.e. kernel build.
> > > 
> > > Secondly, it may be not fair for some workloads which don't generate too
> > > many deferred split THPs or those THPs are short-lived. Actually, the cpu
> > > time is abused by the excessive deferred split THPs generators, isn't it?
> > 
> > Yes this is indeed true. Do we have any idea on how much time that
> > actually is?
> 
> For uncontented case, splitting 1G worth of pages (2MiB x 512) takes a bit
> more than 50 ms in my setup. But it's best-case scenario: pages not shared
> across multiple processes, no contention on ptl, page lock, etc.

Any idea about a bad case?

> > > With memcg awareness, the deferred split THPs actually are isolated and
> > > capped by memcg. The long-lived deferred split THPs can't be accumulated 
> > > too
> > > many due to the limit of memcg. And, cpu time spent in splitting them 
> > > would
> > > just account to the memcgs who generate that many deferred split THPs, who
> > > generate them who pay for it. This sounds more fair and we could achieve
> > > much better isolation.
> > 
> > On the other hand, deferring the split and free up a non trivial amount
> > of memory is a problem I consider quite serious because it affects not
> > only the memcg workload which has to do the reclaim but also other
> > consumers of memory beucase large memory blocks could be used for higher
> > order allocations.
> 
> Maybe instead of drive the split from number of pages on queue we can take
> a hint from compaction that is struggles to get high order pages?

This is still unbounded in time.

> We can also try to use schedule_delayed_work() instead of plain
> schedule_work() to give short-lived page chance to get freed before
> splitting attempt.


Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable

2019-08-28 Thread Kirill A. Shutemov
On Wed, Aug 28, 2019 at 09:57:08AM +0200, Michal Hocko wrote:
> On Tue 27-08-19 10:06:20, Yang Shi wrote:
> > 
> > 
> > On 8/27/19 5:59 AM, Kirill A. Shutemov wrote:
> > > On Tue, Aug 27, 2019 at 03:17:39PM +0300, Kirill A. Shutemov wrote:
> > > > On Tue, Aug 27, 2019 at 02:09:23PM +0200, Michal Hocko wrote:
> > > > > On Tue 27-08-19 14:01:56, Vlastimil Babka wrote:
> > > > > > On 8/27/19 1:02 PM, Kirill A. Shutemov wrote:
> > > > > > > On Tue, Aug 27, 2019 at 08:01:39AM +0200, Michal Hocko wrote:
> > > > > > > > On Mon 26-08-19 16:15:38, Kirill A. Shutemov wrote:
> > > > > > > > > Unmapped completely pages will be freed with current code. 
> > > > > > > > > Deferred split
> > > > > > > > > only applies to partly mapped THPs: at least on 4k of the THP 
> > > > > > > > > is still
> > > > > > > > > mapped somewhere.
> > > > > > > > Hmm, I am probably misreading the code but at least current 
> > > > > > > > Linus' tree
> > > > > > > > reads page_remove_rmap -> [page_remove_anon_compound_rmap ->\ 
> > > > > > > > deferred_split_huge_page even
> > > > > > > > for fully mapped THP.
> > > > > > > Well, you read correctly, but it was not intended. I screwed it 
> > > > > > > up at some
> > > > > > > point.
> > > > > > > 
> > > > > > > See the patch below. It should make it work as intened.
> > > > > > > 
> > > > > > > It's not bug as such, but inefficientcy. We add page to the queue 
> > > > > > > where
> > > > > > > it's not needed.
> > > > > > But that adding to queue doesn't affect whether the page will be 
> > > > > > freed
> > > > > > immediately if there are no more partial mappings, right? I don't 
> > > > > > see
> > > > > > deferred_split_huge_page() pinning the page.
> > > > > > So your patch wouldn't make THPs freed immediately in cases where 
> > > > > > they
> > > > > > haven't been freed before immediately, it just fixes a minor
> > > > > > inefficiency with queue manipulation?
> > > > > Ohh, right. I can see that in free_transhuge_page now. So fully mapped
> > > > > THPs really do not matter and what I have considered an odd case is
> > > > > really happening more often.
> > > > > 
> > > > > That being said this will not help at all for what Yang Shi is seeing
> > > > > and we need a more proactive deferred splitting as I've mentioned
> > > > > earlier.
> > > > It was not intended to fix the issue. It's fix for current logic. I'm
> > > > playing with the work approach now.
> > > Below is what I've come up with. It appears to be functional.
> > > 
> > > Any comments?
> > 
> > Thanks, Kirill and Michal. Doing split more proactive is definitely a choice
> > to eliminate huge accumulated deferred split THPs, I did think about this
> > approach before I came up with memcg aware approach. But, I thought this
> > approach has some problems:
> > 
> > First of all, we can't prove if this is a universal win for the most
> > workloads or not. For some workloads (as I mentioned about our usecase), we
> > do see a lot THPs accumulated for a while, but they are very short-lived for
> > other workloads, i.e. kernel build.
> > 
> > Secondly, it may be not fair for some workloads which don't generate too
> > many deferred split THPs or those THPs are short-lived. Actually, the cpu
> > time is abused by the excessive deferred split THPs generators, isn't it?
> 
> Yes this is indeed true. Do we have any idea on how much time that
> actually is?

For uncontented case, splitting 1G worth of pages (2MiB x 512) takes a bit
more than 50 ms in my setup. But it's best-case scenario: pages not shared
across multiple processes, no contention on ptl, page lock, etc.

> > With memcg awareness, the deferred split THPs actually are isolated and
> > capped by memcg. The long-lived deferred split THPs can't be accumulated too
> > many due to the limit of memcg. And, cpu time spent in splitting them would
> > just account to the memcgs who generate that many deferred split THPs, who
> > generate them who pay for it. This sounds more fair and we could achieve
> > much better isolation.
> 
> On the other hand, deferring the split and free up a non trivial amount
> of memory is a problem I consider quite serious because it affects not
> only the memcg workload which has to do the reclaim but also other
> consumers of memory beucase large memory blocks could be used for higher
> order allocations.

Maybe instead of drive the split from number of pages on queue we can take
a hint from compaction that is struggles to get high order pages?

We can also try to use schedule_delayed_work() instead of plain
schedule_work() to give short-lived page chance to get freed before
splitting attempt.

> > And, I think the discussion is diverted and mislead by the number of
> > excessive deferred split THPs. To be clear, I didn't mean the excessive
> > deferred split THPs are problem for us (I agree it may waste memory to have
> > that many deferred split THPs not usable), the problem is the oom since they
> > couldn't be split by memcg limit reclaim 

Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable

2019-08-28 Thread Michal Hocko
On Tue 27-08-19 10:06:20, Yang Shi wrote:
> 
> 
> On 8/27/19 5:59 AM, Kirill A. Shutemov wrote:
> > On Tue, Aug 27, 2019 at 03:17:39PM +0300, Kirill A. Shutemov wrote:
> > > On Tue, Aug 27, 2019 at 02:09:23PM +0200, Michal Hocko wrote:
> > > > On Tue 27-08-19 14:01:56, Vlastimil Babka wrote:
> > > > > On 8/27/19 1:02 PM, Kirill A. Shutemov wrote:
> > > > > > On Tue, Aug 27, 2019 at 08:01:39AM +0200, Michal Hocko wrote:
> > > > > > > On Mon 26-08-19 16:15:38, Kirill A. Shutemov wrote:
> > > > > > > > Unmapped completely pages will be freed with current code. 
> > > > > > > > Deferred split
> > > > > > > > only applies to partly mapped THPs: at least on 4k of the THP 
> > > > > > > > is still
> > > > > > > > mapped somewhere.
> > > > > > > Hmm, I am probably misreading the code but at least current 
> > > > > > > Linus' tree
> > > > > > > reads page_remove_rmap -> [page_remove_anon_compound_rmap ->\ 
> > > > > > > deferred_split_huge_page even
> > > > > > > for fully mapped THP.
> > > > > > Well, you read correctly, but it was not intended. I screwed it up 
> > > > > > at some
> > > > > > point.
> > > > > > 
> > > > > > See the patch below. It should make it work as intened.
> > > > > > 
> > > > > > It's not bug as such, but inefficientcy. We add page to the queue 
> > > > > > where
> > > > > > it's not needed.
> > > > > But that adding to queue doesn't affect whether the page will be freed
> > > > > immediately if there are no more partial mappings, right? I don't see
> > > > > deferred_split_huge_page() pinning the page.
> > > > > So your patch wouldn't make THPs freed immediately in cases where they
> > > > > haven't been freed before immediately, it just fixes a minor
> > > > > inefficiency with queue manipulation?
> > > > Ohh, right. I can see that in free_transhuge_page now. So fully mapped
> > > > THPs really do not matter and what I have considered an odd case is
> > > > really happening more often.
> > > > 
> > > > That being said this will not help at all for what Yang Shi is seeing
> > > > and we need a more proactive deferred splitting as I've mentioned
> > > > earlier.
> > > It was not intended to fix the issue. It's fix for current logic. I'm
> > > playing with the work approach now.
> > Below is what I've come up with. It appears to be functional.
> > 
> > Any comments?
> 
> Thanks, Kirill and Michal. Doing split more proactive is definitely a choice
> to eliminate huge accumulated deferred split THPs, I did think about this
> approach before I came up with memcg aware approach. But, I thought this
> approach has some problems:
> 
> First of all, we can't prove if this is a universal win for the most
> workloads or not. For some workloads (as I mentioned about our usecase), we
> do see a lot THPs accumulated for a while, but they are very short-lived for
> other workloads, i.e. kernel build.
> 
> Secondly, it may be not fair for some workloads which don't generate too
> many deferred split THPs or those THPs are short-lived. Actually, the cpu
> time is abused by the excessive deferred split THPs generators, isn't it?

Yes this is indeed true. Do we have any idea on how much time that
actually is?

> With memcg awareness, the deferred split THPs actually are isolated and
> capped by memcg. The long-lived deferred split THPs can't be accumulated too
> many due to the limit of memcg. And, cpu time spent in splitting them would
> just account to the memcgs who generate that many deferred split THPs, who
> generate them who pay for it. This sounds more fair and we could achieve
> much better isolation.

On the other hand, deferring the split and free up a non trivial amount
of memory is a problem I consider quite serious because it affects not
only the memcg workload which has to do the reclaim but also other
consumers of memory beucase large memory blocks could be used for higher
order allocations.

> And, I think the discussion is diverted and mislead by the number of
> excessive deferred split THPs. To be clear, I didn't mean the excessive
> deferred split THPs are problem for us (I agree it may waste memory to have
> that many deferred split THPs not usable), the problem is the oom since they
> couldn't be split by memcg limit reclaim since the shrinker was not memcg
> aware.

Well, I would like to see how much of a problem the memcg OOM really is
after deferred splitting is more time constrained. Maybe we will find
that there is no special memcg aware solution really needed.
-- 
Michal Hocko
SUSE Labs


Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable

2019-08-27 Thread Yang Shi




On 8/27/19 5:59 AM, Kirill A. Shutemov wrote:

On Tue, Aug 27, 2019 at 03:17:39PM +0300, Kirill A. Shutemov wrote:

On Tue, Aug 27, 2019 at 02:09:23PM +0200, Michal Hocko wrote:

On Tue 27-08-19 14:01:56, Vlastimil Babka wrote:

On 8/27/19 1:02 PM, Kirill A. Shutemov wrote:

On Tue, Aug 27, 2019 at 08:01:39AM +0200, Michal Hocko wrote:

On Mon 26-08-19 16:15:38, Kirill A. Shutemov wrote:

Unmapped completely pages will be freed with current code. Deferred split
only applies to partly mapped THPs: at least on 4k of the THP is still
mapped somewhere.

Hmm, I am probably misreading the code but at least current Linus' tree
reads page_remove_rmap -> [page_remove_anon_compound_rmap ->\ 
deferred_split_huge_page even
for fully mapped THP.

Well, you read correctly, but it was not intended. I screwed it up at some
point.

See the patch below. It should make it work as intened.

It's not bug as such, but inefficientcy. We add page to the queue where
it's not needed.

But that adding to queue doesn't affect whether the page will be freed
immediately if there are no more partial mappings, right? I don't see
deferred_split_huge_page() pinning the page.
So your patch wouldn't make THPs freed immediately in cases where they
haven't been freed before immediately, it just fixes a minor
inefficiency with queue manipulation?

Ohh, right. I can see that in free_transhuge_page now. So fully mapped
THPs really do not matter and what I have considered an odd case is
really happening more often.

That being said this will not help at all for what Yang Shi is seeing
and we need a more proactive deferred splitting as I've mentioned
earlier.

It was not intended to fix the issue. It's fix for current logic. I'm
playing with the work approach now.

Below is what I've come up with. It appears to be functional.

Any comments?


Thanks, Kirill and Michal. Doing split more proactive is definitely a 
choice to eliminate huge accumulated deferred split THPs, I did think 
about this approach before I came up with memcg aware approach. But, I 
thought this approach has some problems:


First of all, we can't prove if this is a universal win for the most 
workloads or not. For some workloads (as I mentioned about our usecase), 
we do see a lot THPs accumulated for a while, but they are very 
short-lived for other workloads, i.e. kernel build.


Secondly, it may be not fair for some workloads which don't generate too 
many deferred split THPs or those THPs are short-lived. Actually, the 
cpu time is abused by the excessive deferred split THPs generators, 
isn't it?


With memcg awareness, the deferred split THPs actually are isolated and 
capped by memcg. The long-lived deferred split THPs can't be accumulated 
too many due to the limit of memcg. And, cpu time spent in splitting 
them would just account to the memcgs who generate that many deferred 
split THPs, who generate them who pay for it. This sounds more fair and 
we could achieve much better isolation.


And, I think the discussion is diverted and mislead by the number of 
excessive deferred split THPs. To be clear, I didn't mean the excessive 
deferred split THPs are problem for us (I agree it may waste memory to 
have that many deferred split THPs not usable), the problem is the oom 
since they couldn't be split by memcg limit reclaim since the shrinker 
was not memcg aware.




diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d77d717c620c..c576e9d772b7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -758,6 +758,8 @@ typedef struct pglist_data {
spinlock_t split_queue_lock;
struct list_head split_queue;
unsigned long split_queue_len;
+   unsigned int deferred_split_calls;
+   struct work_struct deferred_split_work;
  #endif
  
  	/* Fields commonly accessed by the page reclaim scanner */

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index de1f15969e27..12d109bbe8ac 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2820,22 +2820,6 @@ void free_transhuge_page(struct page *page)
free_compound_page(page);
  }
  
-void deferred_split_huge_page(struct page *page)

-{
-   struct pglist_data *pgdata = NODE_DATA(page_to_nid(page));
-   unsigned long flags;
-
-   VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-
-   spin_lock_irqsave(>split_queue_lock, flags);
-   if (list_empty(page_deferred_list(page))) {
-   count_vm_event(THP_DEFERRED_SPLIT_PAGE);
-   list_add_tail(page_deferred_list(page), >split_queue);
-   pgdata->split_queue_len++;
-   }
-   spin_unlock_irqrestore(>split_queue_lock, flags);
-}
-
  static unsigned long deferred_split_count(struct shrinker *shrink,
struct shrink_control *sc)
  {
@@ -2901,6 +2885,44 @@ static struct shrinker deferred_split_shrinker = {
.flags = SHRINKER_NUMA_AWARE,
  };
  
+void flush_deferred_split_queue(struct work_struct *work)

+{
+   struct pglist_data *pgdata;
+   

Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable

2019-08-27 Thread Kirill A. Shutemov
On Tue, Aug 27, 2019 at 03:17:39PM +0300, Kirill A. Shutemov wrote:
> On Tue, Aug 27, 2019 at 02:09:23PM +0200, Michal Hocko wrote:
> > On Tue 27-08-19 14:01:56, Vlastimil Babka wrote:
> > > On 8/27/19 1:02 PM, Kirill A. Shutemov wrote:
> > > > On Tue, Aug 27, 2019 at 08:01:39AM +0200, Michal Hocko wrote:
> > > >> On Mon 26-08-19 16:15:38, Kirill A. Shutemov wrote:
> > > >>>
> > > >>> Unmapped completely pages will be freed with current code. Deferred 
> > > >>> split
> > > >>> only applies to partly mapped THPs: at least on 4k of the THP is still
> > > >>> mapped somewhere.
> > > >>
> > > >> Hmm, I am probably misreading the code but at least current Linus' tree
> > > >> reads page_remove_rmap -> [page_remove_anon_compound_rmap ->\ 
> > > >> deferred_split_huge_page even
> > > >> for fully mapped THP.
> > > > 
> > > > Well, you read correctly, but it was not intended. I screwed it up at 
> > > > some
> > > > point.
> > > > 
> > > > See the patch below. It should make it work as intened.
> > > > 
> > > > It's not bug as such, but inefficientcy. We add page to the queue where
> > > > it's not needed.
> > > 
> > > But that adding to queue doesn't affect whether the page will be freed
> > > immediately if there are no more partial mappings, right? I don't see
> > > deferred_split_huge_page() pinning the page.
> > > So your patch wouldn't make THPs freed immediately in cases where they
> > > haven't been freed before immediately, it just fixes a minor
> > > inefficiency with queue manipulation?
> > 
> > Ohh, right. I can see that in free_transhuge_page now. So fully mapped
> > THPs really do not matter and what I have considered an odd case is
> > really happening more often.
> > 
> > That being said this will not help at all for what Yang Shi is seeing
> > and we need a more proactive deferred splitting as I've mentioned
> > earlier.
> 
> It was not intended to fix the issue. It's fix for current logic. I'm
> playing with the work approach now.

Below is what I've come up with. It appears to be functional.

Any comments?

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d77d717c620c..c576e9d772b7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -758,6 +758,8 @@ typedef struct pglist_data {
spinlock_t split_queue_lock;
struct list_head split_queue;
unsigned long split_queue_len;
+   unsigned int deferred_split_calls;
+   struct work_struct deferred_split_work;
 #endif
 
/* Fields commonly accessed by the page reclaim scanner */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index de1f15969e27..12d109bbe8ac 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2820,22 +2820,6 @@ void free_transhuge_page(struct page *page)
free_compound_page(page);
 }
 
-void deferred_split_huge_page(struct page *page)
-{
-   struct pglist_data *pgdata = NODE_DATA(page_to_nid(page));
-   unsigned long flags;
-
-   VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-
-   spin_lock_irqsave(>split_queue_lock, flags);
-   if (list_empty(page_deferred_list(page))) {
-   count_vm_event(THP_DEFERRED_SPLIT_PAGE);
-   list_add_tail(page_deferred_list(page), >split_queue);
-   pgdata->split_queue_len++;
-   }
-   spin_unlock_irqrestore(>split_queue_lock, flags);
-}
-
 static unsigned long deferred_split_count(struct shrinker *shrink,
struct shrink_control *sc)
 {
@@ -2901,6 +2885,44 @@ static struct shrinker deferred_split_shrinker = {
.flags = SHRINKER_NUMA_AWARE,
 };
 
+void flush_deferred_split_queue(struct work_struct *work)
+{
+   struct pglist_data *pgdata;
+   struct shrink_control sc;
+
+   pgdata = container_of(work, struct pglist_data, deferred_split_work);
+   sc.nid = pgdata->node_id;
+   sc.nr_to_scan = 0; /* Unlimited */
+
+   deferred_split_scan(NULL, );
+}
+
+#define NR_CALLS_TO_SPLIT 32
+#define NR_PAGES_ON_QUEUE_TO_SPLIT 16
+
+void deferred_split_huge_page(struct page *page)
+{
+   struct pglist_data *pgdata = NODE_DATA(page_to_nid(page));
+   unsigned long flags;
+
+   VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+
+   spin_lock_irqsave(>split_queue_lock, flags);
+   if (list_empty(page_deferred_list(page))) {
+   count_vm_event(THP_DEFERRED_SPLIT_PAGE);
+   list_add_tail(page_deferred_list(page), >split_queue);
+   pgdata->split_queue_len++;
+   pgdata->deferred_split_calls++;
+   }
+
+   if (pgdata->deferred_split_calls > NR_CALLS_TO_SPLIT &&
+   pgdata->split_queue_len > NR_PAGES_ON_QUEUE_TO_SPLIT) {
+   pgdata->deferred_split_calls = 0;
+   schedule_work(>deferred_split_work);
+   }
+   spin_unlock_irqrestore(>split_queue_lock, flags);
+}
+
 #ifdef CONFIG_DEBUG_FS
 static int split_huge_pages_set(void *data, u64 val)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9c9194959271..86af66d463e9 

Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable

2019-08-27 Thread Kirill A. Shutemov
On Tue, Aug 27, 2019 at 02:09:23PM +0200, Michal Hocko wrote:
> On Tue 27-08-19 14:01:56, Vlastimil Babka wrote:
> > On 8/27/19 1:02 PM, Kirill A. Shutemov wrote:
> > > On Tue, Aug 27, 2019 at 08:01:39AM +0200, Michal Hocko wrote:
> > >> On Mon 26-08-19 16:15:38, Kirill A. Shutemov wrote:
> > >>>
> > >>> Unmapped completely pages will be freed with current code. Deferred 
> > >>> split
> > >>> only applies to partly mapped THPs: at least on 4k of the THP is still
> > >>> mapped somewhere.
> > >>
> > >> Hmm, I am probably misreading the code but at least current Linus' tree
> > >> reads page_remove_rmap -> [page_remove_anon_compound_rmap ->\ 
> > >> deferred_split_huge_page even
> > >> for fully mapped THP.
> > > 
> > > Well, you read correctly, but it was not intended. I screwed it up at some
> > > point.
> > > 
> > > See the patch below. It should make it work as intened.
> > > 
> > > It's not bug as such, but inefficientcy. We add page to the queue where
> > > it's not needed.
> > 
> > But that adding to queue doesn't affect whether the page will be freed
> > immediately if there are no more partial mappings, right? I don't see
> > deferred_split_huge_page() pinning the page.
> > So your patch wouldn't make THPs freed immediately in cases where they
> > haven't been freed before immediately, it just fixes a minor
> > inefficiency with queue manipulation?
> 
> Ohh, right. I can see that in free_transhuge_page now. So fully mapped
> THPs really do not matter and what I have considered an odd case is
> really happening more often.
> 
> That being said this will not help at all for what Yang Shi is seeing
> and we need a more proactive deferred splitting as I've mentioned
> earlier.

It was not intended to fix the issue. It's fix for current logic. I'm
playing with the work approach now.

-- 
 Kirill A. Shutemov


Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable

2019-08-27 Thread Michal Hocko
On Tue 27-08-19 14:01:56, Vlastimil Babka wrote:
> On 8/27/19 1:02 PM, Kirill A. Shutemov wrote:
> > On Tue, Aug 27, 2019 at 08:01:39AM +0200, Michal Hocko wrote:
> >> On Mon 26-08-19 16:15:38, Kirill A. Shutemov wrote:
> >>>
> >>> Unmapped completely pages will be freed with current code. Deferred split
> >>> only applies to partly mapped THPs: at least on 4k of the THP is still
> >>> mapped somewhere.
> >>
> >> Hmm, I am probably misreading the code but at least current Linus' tree
> >> reads page_remove_rmap -> [page_remove_anon_compound_rmap ->\ 
> >> deferred_split_huge_page even
> >> for fully mapped THP.
> > 
> > Well, you read correctly, but it was not intended. I screwed it up at some
> > point.
> > 
> > See the patch below. It should make it work as intened.
> > 
> > It's not bug as such, but inefficientcy. We add page to the queue where
> > it's not needed.
> 
> But that adding to queue doesn't affect whether the page will be freed
> immediately if there are no more partial mappings, right? I don't see
> deferred_split_huge_page() pinning the page.
> So your patch wouldn't make THPs freed immediately in cases where they
> haven't been freed before immediately, it just fixes a minor
> inefficiency with queue manipulation?

Ohh, right. I can see that in free_transhuge_page now. So fully mapped
THPs really do not matter and what I have considered an odd case is
really happening more often.

That being said this will not help at all for what Yang Shi is seeing
and we need a more proactive deferred splitting as I've mentioned
earlier.

-- 
Michal Hocko
SUSE Labs


Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable

2019-08-27 Thread Vlastimil Babka
On 8/27/19 1:02 PM, Kirill A. Shutemov wrote:
> On Tue, Aug 27, 2019 at 08:01:39AM +0200, Michal Hocko wrote:
>> On Mon 26-08-19 16:15:38, Kirill A. Shutemov wrote:
>>>
>>> Unmapped completely pages will be freed with current code. Deferred split
>>> only applies to partly mapped THPs: at least on 4k of the THP is still
>>> mapped somewhere.
>>
>> Hmm, I am probably misreading the code but at least current Linus' tree
>> reads page_remove_rmap -> [page_remove_anon_compound_rmap ->\ 
>> deferred_split_huge_page even
>> for fully mapped THP.
> 
> Well, you read correctly, but it was not intended. I screwed it up at some
> point.
> 
> See the patch below. It should make it work as intened.
> 
> It's not bug as such, but inefficientcy. We add page to the queue where
> it's not needed.

But that adding to queue doesn't affect whether the page will be freed
immediately if there are no more partial mappings, right? I don't see
deferred_split_huge_page() pinning the page.
So your patch wouldn't make THPs freed immediately in cases where they
haven't been freed before immediately, it just fixes a minor
inefficiency with queue manipulation?


Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable

2019-08-27 Thread Michal Hocko
On Tue 27-08-19 14:02:10, Kirill A. Shutemov wrote:
> On Tue, Aug 27, 2019 at 08:01:39AM +0200, Michal Hocko wrote:
> > On Mon 26-08-19 16:15:38, Kirill A. Shutemov wrote:
> > > On Mon, Aug 26, 2019 at 09:40:35AM +0200, Michal Hocko wrote:
> > > > On Thu 22-08-19 18:29:34, Kirill A. Shutemov wrote:
> > > > > On Thu, Aug 22, 2019 at 02:56:56PM +0200, Vlastimil Babka wrote:
> > > > > > On 8/22/19 10:04 AM, Michal Hocko wrote:
> > > > > > > On Thu 22-08-19 01:55:25, Yang Shi wrote:
> > > > > > >> Available memory is one of the most important metrics for memory
> > > > > > >> pressure.
> > > > > > > 
> > > > > > > I would disagree with this statement. It is a rough estimate that 
> > > > > > > tells
> > > > > > > how much memory you can allocate before going into a more 
> > > > > > > expensive
> > > > > > > reclaim (mostly swapping). Allocating that amount still might 
> > > > > > > result in
> > > > > > > direct reclaim induced stalls. I do realize that this is simple 
> > > > > > > metric
> > > > > > > that is attractive to use and works in many cases though.
> > > > > > > 
> > > > > > >> Currently, the deferred split THPs are not accounted into
> > > > > > >> available memory, but they are reclaimable actually, like 
> > > > > > >> reclaimable
> > > > > > >> slabs.
> > > > > > >> 
> > > > > > >> And, they seems very common with the common workloads when THP is
> > > > > > >> enabled.  A simple run with MariaDB test of mmtest with THP 
> > > > > > >> enabled as
> > > > > > >> always shows it could generate over fifteen thousand deferred 
> > > > > > >> split THPs
> > > > > > >> (accumulated around 30G in one hour run, 75% of 40G memory for 
> > > > > > >> my VM).
> > > > > > >> It looks worth accounting in MemAvailable.
> > > > > > > 
> > > > > > > OK, this makes sense. But your above numbers are really worrying.
> > > > > > > Accumulating such a large amount of pages that are likely not 
> > > > > > > going to
> > > > > > > be used is really bad. They are essentially blocking any higher 
> > > > > > > order
> > > > > > > allocations and also push the system towards more memory pressure.
> > > > > > > 
> > > > > > > IIUC deferred splitting is mostly a workaround for nasty locking 
> > > > > > > issues
> > > > > > > during splitting, right? This is not really an optimization to 
> > > > > > > cache
> > > > > > > THPs for reuse or something like that. What is the reason this is 
> > > > > > > not
> > > > > > > done from a worker context? At least THPs which would be freed
> > > > > > > completely sound like a good candidate for kworker tear down, no?
> > > > > > 
> > > > > > Agreed that it's a good question. For Kirill :) Maybe with kworker 
> > > > > > approach we
> > > > > > also wouldn't need the cgroup awareness?
> > > > > 
> > > > > I don't remember a particular locking issue, but I cannot say there's
> > > > > none :P
> > > > > 
> > > > > It's artifact from decoupling PMD split from compound page split: the 
> > > > > same
> > > > > page can be mapped multiple times with combination of PMDs and PTEs. 
> > > > > Split
> > > > > of one PMD doesn't need to trigger split of all PMDs and underlying
> > > > > compound page.
> > > > > 
> > > > > Other consideration is the fact that page split can fail and we need 
> > > > > to
> > > > > have fallback for this case.
> > > > > 
> > > > > Also in most cases THP split would be just waste of time if we would 
> > > > > do
> > > > > them at the spot. If you don't have memory pressure it's better to 
> > > > > wait
> > > > > until process termination: less pages on LRU is still beneficial.
> > > > 
> > > > This might be true but the reality shows that a lot of THPs might be
> > > > waiting for the memory pressure that is essentially freeable on the
> > > > spot. So I am not really convinced that "less pages on LRUs" is really a
> > > > plausible justification. Can we free at least those THPs which are
> > > > unmapped completely without any pte mappings?
> > > 
> > > Unmapped completely pages will be freed with current code. Deferred split
> > > only applies to partly mapped THPs: at least on 4k of the THP is still
> > > mapped somewhere.
> > 
> > Hmm, I am probably misreading the code but at least current Linus' tree
> > reads page_remove_rmap -> [page_remove_anon_compound_rmap ->\ 
> > deferred_split_huge_page even
> > for fully mapped THP.
> 
> Well, you read correctly, but it was not intended. I screwed it up at some
> point.
> 
> See the patch below. It should make it work as intened.

OK, this would be indeed much better indeed. I was really under
impression that the deferred splitting is required due to locking.

Anyway this should take care of the most common usecase. If we can make
the odd cases of partially mapped THPs be handled deferred than
maybe do not really need the whole memcg deferred shrinkers and other
complications. So let's see.
-- 
Michal Hocko
SUSE Labs


Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable

2019-08-27 Thread Kirill A. Shutemov
On Tue, Aug 27, 2019 at 08:01:39AM +0200, Michal Hocko wrote:
> On Mon 26-08-19 16:15:38, Kirill A. Shutemov wrote:
> > On Mon, Aug 26, 2019 at 09:40:35AM +0200, Michal Hocko wrote:
> > > On Thu 22-08-19 18:29:34, Kirill A. Shutemov wrote:
> > > > On Thu, Aug 22, 2019 at 02:56:56PM +0200, Vlastimil Babka wrote:
> > > > > On 8/22/19 10:04 AM, Michal Hocko wrote:
> > > > > > On Thu 22-08-19 01:55:25, Yang Shi wrote:
> > > > > >> Available memory is one of the most important metrics for memory
> > > > > >> pressure.
> > > > > > 
> > > > > > I would disagree with this statement. It is a rough estimate that 
> > > > > > tells
> > > > > > how much memory you can allocate before going into a more expensive
> > > > > > reclaim (mostly swapping). Allocating that amount still might 
> > > > > > result in
> > > > > > direct reclaim induced stalls. I do realize that this is simple 
> > > > > > metric
> > > > > > that is attractive to use and works in many cases though.
> > > > > > 
> > > > > >> Currently, the deferred split THPs are not accounted into
> > > > > >> available memory, but they are reclaimable actually, like 
> > > > > >> reclaimable
> > > > > >> slabs.
> > > > > >> 
> > > > > >> And, they seems very common with the common workloads when THP is
> > > > > >> enabled.  A simple run with MariaDB test of mmtest with THP 
> > > > > >> enabled as
> > > > > >> always shows it could generate over fifteen thousand deferred 
> > > > > >> split THPs
> > > > > >> (accumulated around 30G in one hour run, 75% of 40G memory for my 
> > > > > >> VM).
> > > > > >> It looks worth accounting in MemAvailable.
> > > > > > 
> > > > > > OK, this makes sense. But your above numbers are really worrying.
> > > > > > Accumulating such a large amount of pages that are likely not going 
> > > > > > to
> > > > > > be used is really bad. They are essentially blocking any higher 
> > > > > > order
> > > > > > allocations and also push the system towards more memory pressure.
> > > > > > 
> > > > > > IIUC deferred splitting is mostly a workaround for nasty locking 
> > > > > > issues
> > > > > > during splitting, right? This is not really an optimization to cache
> > > > > > THPs for reuse or something like that. What is the reason this is 
> > > > > > not
> > > > > > done from a worker context? At least THPs which would be freed
> > > > > > completely sound like a good candidate for kworker tear down, no?
> > > > > 
> > > > > Agreed that it's a good question. For Kirill :) Maybe with kworker 
> > > > > approach we
> > > > > also wouldn't need the cgroup awareness?
> > > > 
> > > > I don't remember a particular locking issue, but I cannot say there's
> > > > none :P
> > > > 
> > > > It's artifact from decoupling PMD split from compound page split: the 
> > > > same
> > > > page can be mapped multiple times with combination of PMDs and PTEs. 
> > > > Split
> > > > of one PMD doesn't need to trigger split of all PMDs and underlying
> > > > compound page.
> > > > 
> > > > Other consideration is the fact that page split can fail and we need to
> > > > have fallback for this case.
> > > > 
> > > > Also in most cases THP split would be just waste of time if we would do
> > > > them at the spot. If you don't have memory pressure it's better to wait
> > > > until process termination: less pages on LRU is still beneficial.
> > > 
> > > This might be true but the reality shows that a lot of THPs might be
> > > waiting for the memory pressure that is essentially freeable on the
> > > spot. So I am not really convinced that "less pages on LRUs" is really a
> > > plausible justification. Can we free at least those THPs which are
> > > unmapped completely without any pte mappings?
> > 
> > Unmapped completely pages will be freed with current code. Deferred split
> > only applies to partly mapped THPs: at least on 4k of the THP is still
> > mapped somewhere.
> 
> Hmm, I am probably misreading the code but at least current Linus' tree
> reads page_remove_rmap -> [page_remove_anon_compound_rmap ->\ 
> deferred_split_huge_page even
> for fully mapped THP.

Well, you read correctly, but it was not intended. I screwed it up at some
point.

See the patch below. It should make it work as intened.

It's not bug as such, but inefficientcy. We add page to the queue where
it's not needed.

> > > > Main source of partly mapped THPs comes from exit path. When PMD mapping
> > > > of THP got split across multiple VMAs (for instance due to mprotect()),
> > > > in exit path we unmap PTEs belonging to one VMA just before unmapping 
> > > > the
> > > > rest of the page. It would be total waste of time to split the page in
> > > > this scenario.
> > > > 
> > > > The whole deferred split thing still looks as a reasonable compromise
> > > > to me.
> > > 
> > > Even when it leads to all other problems mentioned in this and memcg
> > > deferred reclaim series?
> > 
> > Yes.
> > 
> > You would still need deferred split even if you *try* to split the page on
> > the spot. 

Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable

2019-08-27 Thread Michal Hocko
On Tue 27-08-19 11:32:15, Kirill A. Shutemov wrote:
> On Tue, Aug 27, 2019 at 07:59:41AM +0200, Michal Hocko wrote:
> > > > > > IIUC deferred splitting is mostly a workaround for nasty locking 
> > > > > > issues
> > > > > > during splitting, right? This is not really an optimization to cache
> > > > > > THPs for reuse or something like that. What is the reason this is 
> > > > > > not
> > > > > > done from a worker context? At least THPs which would be freed
> > > > > > completely sound like a good candidate for kworker tear down, no?
> > > > > Yes, deferred split THP was introduced to avoid locking issues 
> > > > > according to
> > > > > the document. Memcg awareness would help to trigger the shrinker more 
> > > > > often.
> > > > > 
> > > > > I think it could be done in a worker context, but when to trigger to 
> > > > > worker
> > > > > is a subtle problem.
> > > > Why? What is the problem to trigger it after unmap of a batch worth of
> > > > THPs?
> > > 
> > > This leads to another question, how many THPs are "a batch of worth"?
> > 
> > Some arbitrary reasonable number. Few dozens of THPs waiting for split
> > are no big deal. Going into GB as you pointed out above is definitely a
> > problem.
> 
> This will not work if these GBs worth of THPs are pinned (like with
> RDMA).

Yes, but this is the case we cannot do anything about in any deferred
scheme unless we hood into unpinning call path. We might get there
eventually with the newly forming api.

> We can kick the deferred split each N calls of deferred_split_huge_page()
> if more than M pages queued or something.

Yes, that sounds reasonable to me. N can be few dozens of THPs. An
explicit flush API after unmap is done would be helpful as well.

> Do we want to kick it again after some time if split from deferred queue
> has failed?

I wouldn't mind to have reclaim path do the fallback and see how that 

-- 
Michal Hocko
SUSE Labs


Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable

2019-08-27 Thread Kirill A. Shutemov
On Tue, Aug 27, 2019 at 07:59:41AM +0200, Michal Hocko wrote:
> > > > > IIUC deferred splitting is mostly a workaround for nasty locking 
> > > > > issues
> > > > > during splitting, right? This is not really an optimization to cache
> > > > > THPs for reuse or something like that. What is the reason this is not
> > > > > done from a worker context? At least THPs which would be freed
> > > > > completely sound like a good candidate for kworker tear down, no?
> > > > Yes, deferred split THP was introduced to avoid locking issues 
> > > > according to
> > > > the document. Memcg awareness would help to trigger the shrinker more 
> > > > often.
> > > > 
> > > > I think it could be done in a worker context, but when to trigger to 
> > > > worker
> > > > is a subtle problem.
> > > Why? What is the problem to trigger it after unmap of a batch worth of
> > > THPs?
> > 
> > This leads to another question, how many THPs are "a batch of worth"?
> 
> Some arbitrary reasonable number. Few dozens of THPs waiting for split
> are no big deal. Going into GB as you pointed out above is definitely a
> problem.

This will not work if these GBs worth of THPs are pinned (like with
RDMA).

We can kick the deferred split each N calls of deferred_split_huge_page()
if more than M pages queued or something.

Do we want to kick it again after some time if split from deferred queue
has failed?

The check if the page is splittable is not exactly free, so everyting has
trade offs.

-- 
 Kirill A. Shutemov


Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable

2019-08-27 Thread Michal Hocko
On Mon 26-08-19 16:15:38, Kirill A. Shutemov wrote:
> On Mon, Aug 26, 2019 at 09:40:35AM +0200, Michal Hocko wrote:
> > On Thu 22-08-19 18:29:34, Kirill A. Shutemov wrote:
> > > On Thu, Aug 22, 2019 at 02:56:56PM +0200, Vlastimil Babka wrote:
> > > > On 8/22/19 10:04 AM, Michal Hocko wrote:
> > > > > On Thu 22-08-19 01:55:25, Yang Shi wrote:
> > > > >> Available memory is one of the most important metrics for memory
> > > > >> pressure.
> > > > > 
> > > > > I would disagree with this statement. It is a rough estimate that 
> > > > > tells
> > > > > how much memory you can allocate before going into a more expensive
> > > > > reclaim (mostly swapping). Allocating that amount still might result 
> > > > > in
> > > > > direct reclaim induced stalls. I do realize that this is simple metric
> > > > > that is attractive to use and works in many cases though.
> > > > > 
> > > > >> Currently, the deferred split THPs are not accounted into
> > > > >> available memory, but they are reclaimable actually, like reclaimable
> > > > >> slabs.
> > > > >> 
> > > > >> And, they seems very common with the common workloads when THP is
> > > > >> enabled.  A simple run with MariaDB test of mmtest with THP enabled 
> > > > >> as
> > > > >> always shows it could generate over fifteen thousand deferred split 
> > > > >> THPs
> > > > >> (accumulated around 30G in one hour run, 75% of 40G memory for my 
> > > > >> VM).
> > > > >> It looks worth accounting in MemAvailable.
> > > > > 
> > > > > OK, this makes sense. But your above numbers are really worrying.
> > > > > Accumulating such a large amount of pages that are likely not going to
> > > > > be used is really bad. They are essentially blocking any higher order
> > > > > allocations and also push the system towards more memory pressure.
> > > > > 
> > > > > IIUC deferred splitting is mostly a workaround for nasty locking 
> > > > > issues
> > > > > during splitting, right? This is not really an optimization to cache
> > > > > THPs for reuse or something like that. What is the reason this is not
> > > > > done from a worker context? At least THPs which would be freed
> > > > > completely sound like a good candidate for kworker tear down, no?
> > > > 
> > > > Agreed that it's a good question. For Kirill :) Maybe with kworker 
> > > > approach we
> > > > also wouldn't need the cgroup awareness?
> > > 
> > > I don't remember a particular locking issue, but I cannot say there's
> > > none :P
> > > 
> > > It's artifact from decoupling PMD split from compound page split: the same
> > > page can be mapped multiple times with combination of PMDs and PTEs. Split
> > > of one PMD doesn't need to trigger split of all PMDs and underlying
> > > compound page.
> > > 
> > > Other consideration is the fact that page split can fail and we need to
> > > have fallback for this case.
> > > 
> > > Also in most cases THP split would be just waste of time if we would do
> > > them at the spot. If you don't have memory pressure it's better to wait
> > > until process termination: less pages on LRU is still beneficial.
> > 
> > This might be true but the reality shows that a lot of THPs might be
> > waiting for the memory pressure that is essentially freeable on the
> > spot. So I am not really convinced that "less pages on LRUs" is really a
> > plausible justification. Can we free at least those THPs which are
> > unmapped completely without any pte mappings?
> 
> Unmapped completely pages will be freed with current code. Deferred split
> only applies to partly mapped THPs: at least on 4k of the THP is still
> mapped somewhere.

Hmm, I am probably misreading the code but at least current Linus' tree
reads page_remove_rmap -> [page_remove_anon_compound_rmap ->\ 
deferred_split_huge_page even
for fully mapped THP.

> > > Main source of partly mapped THPs comes from exit path. When PMD mapping
> > > of THP got split across multiple VMAs (for instance due to mprotect()),
> > > in exit path we unmap PTEs belonging to one VMA just before unmapping the
> > > rest of the page. It would be total waste of time to split the page in
> > > this scenario.
> > > 
> > > The whole deferred split thing still looks as a reasonable compromise
> > > to me.
> > 
> > Even when it leads to all other problems mentioned in this and memcg
> > deferred reclaim series?
> 
> Yes.
> 
> You would still need deferred split even if you *try* to split the page on
> the spot. split_huge_page() can fail (due to pin on the page) and you will
> need to have a way to try again later.
> 
> You'll not win anything in complexity by trying split_huge_page()
> immediately. I would ague you'll create much more complexity.

I am not arguing for in place split. I am arguing to do it ASAP rather
than to wait for memory pressure which might be in an unbound amount of
time. So let me ask again. Why cannot we do that in the worker context?
Essentially schedure the work item right away?

> > > We may have some kind of watermark and try to keep the 

Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable

2019-08-27 Thread Michal Hocko
On Mon 26-08-19 21:27:38, Yang Shi wrote:
> 
> 
> On 8/26/19 12:43 AM, Michal Hocko wrote:
> > On Thu 22-08-19 08:33:40, Yang Shi wrote:
> > > 
> > > On 8/22/19 1:04 AM, Michal Hocko wrote:
> > > > On Thu 22-08-19 01:55:25, Yang Shi wrote:
> > [...]
> > > > > And, they seems very common with the common workloads when THP is
> > > > > enabled.  A simple run with MariaDB test of mmtest with THP enabled as
> > > > > always shows it could generate over fifteen thousand deferred split 
> > > > > THPs
> > > > > (accumulated around 30G in one hour run, 75% of 40G memory for my VM).
> > > > > It looks worth accounting in MemAvailable.
> > > > OK, this makes sense. But your above numbers are really worrying.
> > > > Accumulating such a large amount of pages that are likely not going to
> > > > be used is really bad. They are essentially blocking any higher order
> > > > allocations and also push the system towards more memory pressure.
> > > That is accumulated number, during the running of the test, some of them
> > > were freed by shrinker already. IOW, it should not reach that much at any
> > > given time.
> > Then the above description is highly misleading. What is the actual
> > number of lingering THPs that wait for the memory pressure in the peak?
> 
> By rerunning sysbench mariadb test of mmtest, I didn't see too many THPs in
> the peak. I saw around 2K THPs sometimes on my VM with 40G memory. But they
> were short-lived (should be freed when the test exit). And, the number of
> accumulated THPs are variable.
> 
> And, this reminded me to go back double check our internal bug report which
> lead to the "make deferred split shrinker memcg aware" patchset.
> 
> In that case, a mysql instance with real production load was running in a
> memcg with ~86G limit, the number of deferred split THPs may reach to ~68G
> (~34K deferred split THPs) in a few hours. The deferred split THP shrinker
> was not invoked since global memory pressure is still fine since the host
> has 256G memory, but memcg limit reclaim was triggered.
> 
> And, I can't tell if all those deferred split THPs came from mysql or not
> since there were some other processes run in that container too according to
> the oom log.
> 
> I will update the commit log with the more solid data from production
> environment.

This is a very useful information. Thanks!

> > > > IIUC deferred splitting is mostly a workaround for nasty locking issues
> > > > during splitting, right? This is not really an optimization to cache
> > > > THPs for reuse or something like that. What is the reason this is not
> > > > done from a worker context? At least THPs which would be freed
> > > > completely sound like a good candidate for kworker tear down, no?
> > > Yes, deferred split THP was introduced to avoid locking issues according 
> > > to
> > > the document. Memcg awareness would help to trigger the shrinker more 
> > > often.
> > > 
> > > I think it could be done in a worker context, but when to trigger to 
> > > worker
> > > is a subtle problem.
> > Why? What is the problem to trigger it after unmap of a batch worth of
> > THPs?
> 
> This leads to another question, how many THPs are "a batch of worth"?

Some arbitrary reasonable number. Few dozens of THPs waiting for split
are no big deal. Going into GB as you pointed out above is definitely a
problem.

-- 
Michal Hocko
SUSE Labs


Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable

2019-08-26 Thread Yang Shi




On 8/26/19 12:43 AM, Michal Hocko wrote:

On Thu 22-08-19 08:33:40, Yang Shi wrote:


On 8/22/19 1:04 AM, Michal Hocko wrote:

On Thu 22-08-19 01:55:25, Yang Shi wrote:

[...]

And, they seems very common with the common workloads when THP is
enabled.  A simple run with MariaDB test of mmtest with THP enabled as
always shows it could generate over fifteen thousand deferred split THPs
(accumulated around 30G in one hour run, 75% of 40G memory for my VM).
It looks worth accounting in MemAvailable.

OK, this makes sense. But your above numbers are really worrying.
Accumulating such a large amount of pages that are likely not going to
be used is really bad. They are essentially blocking any higher order
allocations and also push the system towards more memory pressure.

That is accumulated number, during the running of the test, some of them
were freed by shrinker already. IOW, it should not reach that much at any
given time.

Then the above description is highly misleading. What is the actual
number of lingering THPs that wait for the memory pressure in the peak?


By rerunning sysbench mariadb test of mmtest, I didn't see too many THPs 
in the peak. I saw around 2K THPs sometimes on my VM with 40G memory. 
But they were short-lived (should be freed when the test exit). And, the 
number of accumulated THPs are variable.


And, this reminded me to go back double check our internal bug report 
which lead to the "make deferred split shrinker memcg aware" patchset.


In that case, a mysql instance with real production load was running in 
a memcg with ~86G limit, the number of deferred split THPs may reach to 
~68G (~34K deferred split THPs) in a few hours. The deferred split THP 
shrinker was not invoked since global memory pressure is still fine 
since the host has 256G memory, but memcg limit reclaim was triggered.


And, I can't tell if all those deferred split THPs came from mysql or 
not since there were some other processes run in that container too 
according to the oom log.


I will update the commit log with the more solid data from production 
environment.


  

IIUC deferred splitting is mostly a workaround for nasty locking issues
during splitting, right? This is not really an optimization to cache
THPs for reuse or something like that. What is the reason this is not
done from a worker context? At least THPs which would be freed
completely sound like a good candidate for kworker tear down, no?

Yes, deferred split THP was introduced to avoid locking issues according to
the document. Memcg awareness would help to trigger the shrinker more often.

I think it could be done in a worker context, but when to trigger to worker
is a subtle problem.

Why? What is the problem to trigger it after unmap of a batch worth of
THPs?


This leads to another question, how many THPs are "a batch of worth"? 
And, they may be short-lived as showed by Kirill's example, we can't 
tell in advance how long the THPs life time is. We may waste cpu cycles 
to do something unneeded.





Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable

2019-08-26 Thread Kirill A. Shutemov
On Mon, Aug 26, 2019 at 09:40:35AM +0200, Michal Hocko wrote:
> On Thu 22-08-19 18:29:34, Kirill A. Shutemov wrote:
> > On Thu, Aug 22, 2019 at 02:56:56PM +0200, Vlastimil Babka wrote:
> > > On 8/22/19 10:04 AM, Michal Hocko wrote:
> > > > On Thu 22-08-19 01:55:25, Yang Shi wrote:
> > > >> Available memory is one of the most important metrics for memory
> > > >> pressure.
> > > > 
> > > > I would disagree with this statement. It is a rough estimate that tells
> > > > how much memory you can allocate before going into a more expensive
> > > > reclaim (mostly swapping). Allocating that amount still might result in
> > > > direct reclaim induced stalls. I do realize that this is simple metric
> > > > that is attractive to use and works in many cases though.
> > > > 
> > > >> Currently, the deferred split THPs are not accounted into
> > > >> available memory, but they are reclaimable actually, like reclaimable
> > > >> slabs.
> > > >> 
> > > >> And, they seems very common with the common workloads when THP is
> > > >> enabled.  A simple run with MariaDB test of mmtest with THP enabled as
> > > >> always shows it could generate over fifteen thousand deferred split 
> > > >> THPs
> > > >> (accumulated around 30G in one hour run, 75% of 40G memory for my VM).
> > > >> It looks worth accounting in MemAvailable.
> > > > 
> > > > OK, this makes sense. But your above numbers are really worrying.
> > > > Accumulating such a large amount of pages that are likely not going to
> > > > be used is really bad. They are essentially blocking any higher order
> > > > allocations and also push the system towards more memory pressure.
> > > > 
> > > > IIUC deferred splitting is mostly a workaround for nasty locking issues
> > > > during splitting, right? This is not really an optimization to cache
> > > > THPs for reuse or something like that. What is the reason this is not
> > > > done from a worker context? At least THPs which would be freed
> > > > completely sound like a good candidate for kworker tear down, no?
> > > 
> > > Agreed that it's a good question. For Kirill :) Maybe with kworker 
> > > approach we
> > > also wouldn't need the cgroup awareness?
> > 
> > I don't remember a particular locking issue, but I cannot say there's
> > none :P
> > 
> > It's artifact from decoupling PMD split from compound page split: the same
> > page can be mapped multiple times with combination of PMDs and PTEs. Split
> > of one PMD doesn't need to trigger split of all PMDs and underlying
> > compound page.
> > 
> > Other consideration is the fact that page split can fail and we need to
> > have fallback for this case.
> > 
> > Also in most cases THP split would be just waste of time if we would do
> > them at the spot. If you don't have memory pressure it's better to wait
> > until process termination: less pages on LRU is still beneficial.
> 
> This might be true but the reality shows that a lot of THPs might be
> waiting for the memory pressure that is essentially freeable on the
> spot. So I am not really convinced that "less pages on LRUs" is really a
> plausible justification. Can we free at least those THPs which are
> unmapped completely without any pte mappings?

Unmapped completely pages will be freed with current code. Deferred split
only applies to partly mapped THPs: at least on 4k of the THP is still
mapped somewhere.

> > Main source of partly mapped THPs comes from exit path. When PMD mapping
> > of THP got split across multiple VMAs (for instance due to mprotect()),
> > in exit path we unmap PTEs belonging to one VMA just before unmapping the
> > rest of the page. It would be total waste of time to split the page in
> > this scenario.
> > 
> > The whole deferred split thing still looks as a reasonable compromise
> > to me.
> 
> Even when it leads to all other problems mentioned in this and memcg
> deferred reclaim series?

Yes.

You would still need deferred split even if you *try* to split the page on
the spot. split_huge_page() can fail (due to pin on the page) and you will
need to have a way to try again later.

You'll not win anything in complexity by trying split_huge_page()
immediately. I would ague you'll create much more complexity.

> > We may have some kind of watermark and try to keep the number of deferred
> > split THP under it. But it comes with own set of problems: what if all
> > these pages are pinned for really long time and effectively not available
> > for split.
> 
> Again, why cannot we simply push the freeing where there are no other
> mappings? This should be pretty common case, right?

Partly mapped THP is not common case at all.

To get to this point you will need to create a mapping, fault in THP and
then unmap part of it. It requires very active memory management on
application side. This kind of applications usually knows if THP is a fit
for them.

> I am still not sure that waiting for the memory reclaim is a general
> win.

It wins CPU cycles by not doing the work that is likely unneeded.

Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable

2019-08-26 Thread Michal Hocko
On Thu 22-08-19 08:33:40, Yang Shi wrote:
> 
> 
> On 8/22/19 1:04 AM, Michal Hocko wrote:
> > On Thu 22-08-19 01:55:25, Yang Shi wrote:
[...]
> > > And, they seems very common with the common workloads when THP is
> > > enabled.  A simple run with MariaDB test of mmtest with THP enabled as
> > > always shows it could generate over fifteen thousand deferred split THPs
> > > (accumulated around 30G in one hour run, 75% of 40G memory for my VM).
> > > It looks worth accounting in MemAvailable.
> > OK, this makes sense. But your above numbers are really worrying.
> > Accumulating such a large amount of pages that are likely not going to
> > be used is really bad. They are essentially blocking any higher order
> > allocations and also push the system towards more memory pressure.
> 
> That is accumulated number, during the running of the test, some of them
> were freed by shrinker already. IOW, it should not reach that much at any
> given time.

Then the above description is highly misleading. What is the actual
number of lingering THPs that wait for the memory pressure in the peak?
 
> > IIUC deferred splitting is mostly a workaround for nasty locking issues
> > during splitting, right? This is not really an optimization to cache
> > THPs for reuse or something like that. What is the reason this is not
> > done from a worker context? At least THPs which would be freed
> > completely sound like a good candidate for kworker tear down, no?
> 
> Yes, deferred split THP was introduced to avoid locking issues according to
> the document. Memcg awareness would help to trigger the shrinker more often.
> 
> I think it could be done in a worker context, but when to trigger to worker
> is a subtle problem.

Why? What is the problem to trigger it after unmap of a batch worth of
THPs?
-- 
Michal Hocko
SUSE Labs


Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable

2019-08-26 Thread Michal Hocko
On Thu 22-08-19 18:29:34, Kirill A. Shutemov wrote:
> On Thu, Aug 22, 2019 at 02:56:56PM +0200, Vlastimil Babka wrote:
> > On 8/22/19 10:04 AM, Michal Hocko wrote:
> > > On Thu 22-08-19 01:55:25, Yang Shi wrote:
> > >> Available memory is one of the most important metrics for memory
> > >> pressure.
> > > 
> > > I would disagree with this statement. It is a rough estimate that tells
> > > how much memory you can allocate before going into a more expensive
> > > reclaim (mostly swapping). Allocating that amount still might result in
> > > direct reclaim induced stalls. I do realize that this is simple metric
> > > that is attractive to use and works in many cases though.
> > > 
> > >> Currently, the deferred split THPs are not accounted into
> > >> available memory, but they are reclaimable actually, like reclaimable
> > >> slabs.
> > >> 
> > >> And, they seems very common with the common workloads when THP is
> > >> enabled.  A simple run with MariaDB test of mmtest with THP enabled as
> > >> always shows it could generate over fifteen thousand deferred split THPs
> > >> (accumulated around 30G in one hour run, 75% of 40G memory for my VM).
> > >> It looks worth accounting in MemAvailable.
> > > 
> > > OK, this makes sense. But your above numbers are really worrying.
> > > Accumulating such a large amount of pages that are likely not going to
> > > be used is really bad. They are essentially blocking any higher order
> > > allocations and also push the system towards more memory pressure.
> > > 
> > > IIUC deferred splitting is mostly a workaround for nasty locking issues
> > > during splitting, right? This is not really an optimization to cache
> > > THPs for reuse or something like that. What is the reason this is not
> > > done from a worker context? At least THPs which would be freed
> > > completely sound like a good candidate for kworker tear down, no?
> > 
> > Agreed that it's a good question. For Kirill :) Maybe with kworker approach 
> > we
> > also wouldn't need the cgroup awareness?
> 
> I don't remember a particular locking issue, but I cannot say there's
> none :P
> 
> It's artifact from decoupling PMD split from compound page split: the same
> page can be mapped multiple times with combination of PMDs and PTEs. Split
> of one PMD doesn't need to trigger split of all PMDs and underlying
> compound page.
> 
> Other consideration is the fact that page split can fail and we need to
> have fallback for this case.
> 
> Also in most cases THP split would be just waste of time if we would do
> them at the spot. If you don't have memory pressure it's better to wait
> until process termination: less pages on LRU is still beneficial.

This might be true but the reality shows that a lot of THPs might be
waiting for the memory pressure that is essentially freeable on the
spot. So I am not really convinced that "less pages on LRUs" is really a
plausible justification. Can we free at least those THPs which are
unmapped completely without any pte mappings?

> Main source of partly mapped THPs comes from exit path. When PMD mapping
> of THP got split across multiple VMAs (for instance due to mprotect()),
> in exit path we unmap PTEs belonging to one VMA just before unmapping the
> rest of the page. It would be total waste of time to split the page in
> this scenario.
> 
> The whole deferred split thing still looks as a reasonable compromise
> to me.

Even when it leads to all other problems mentioned in this and memcg
deferred reclaim series?

> We may have some kind of watermark and try to keep the number of deferred
> split THP under it. But it comes with own set of problems: what if all
> these pages are pinned for really long time and effectively not available
> for split.

Again, why cannot we simply push the freeing where there are no other
mappings? This should be pretty common case, right? I am still not sure
that waiting for the memory reclaim is a general win. Do you have any
examples of workloads that measurably benefit from this lazy approach
without any other downsides? In other words how exactly do we measure
cost/benefit model of this heuristic?

-- 
Michal Hocko
SUSE Labs


Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable

2019-08-22 Thread Yang Shi




On 8/22/19 5:56 AM, Vlastimil Babka wrote:

On 8/22/19 10:04 AM, Michal Hocko wrote:

On Thu 22-08-19 01:55:25, Yang Shi wrote:

Available memory is one of the most important metrics for memory
pressure.

I would disagree with this statement. It is a rough estimate that tells
how much memory you can allocate before going into a more expensive
reclaim (mostly swapping). Allocating that amount still might result in
direct reclaim induced stalls. I do realize that this is simple metric
that is attractive to use and works in many cases though.


Currently, the deferred split THPs are not accounted into
available memory, but they are reclaimable actually, like reclaimable
slabs.

And, they seems very common with the common workloads when THP is
enabled.  A simple run with MariaDB test of mmtest with THP enabled as
always shows it could generate over fifteen thousand deferred split THPs
(accumulated around 30G in one hour run, 75% of 40G memory for my VM).
It looks worth accounting in MemAvailable.

OK, this makes sense. But your above numbers are really worrying.
Accumulating such a large amount of pages that are likely not going to
be used is really bad. They are essentially blocking any higher order
allocations and also push the system towards more memory pressure.

IIUC deferred splitting is mostly a workaround for nasty locking issues
during splitting, right? This is not really an optimization to cache
THPs for reuse or something like that. What is the reason this is not
done from a worker context? At least THPs which would be freed
completely sound like a good candidate for kworker tear down, no?

Agreed that it's a good question. For Kirill :) Maybe with kworker approach we
also wouldn't need the cgroup awareness?


Record the number of freeable normal pages of deferred split THPs into
the second tail page, and account it into KReclaimable.  Although THP
allocations are not exactly "kernel allocations", once they are unmapped,
they are in fact kernel-only.  KReclaimable has been accounted into
MemAvailable.

This sounds reasonable to me.
  

When the deferred split THPs get split due to memory pressure or freed,
just decrease by the recorded number.

With this change when running program which populates 1G address space
then madvise(MADV_DONTNEED) 511 pages for every THP, /proc/meminfo would
show the deferred split THPs are accounted properly.

Populated by before calling madvise(MADV_DONTNEED):
MemAvailable:   43531960 kB
AnonPages:   1096660 kB
KReclaimable:  26156 kB
AnonHugePages:   1056768 kB

After calling madvise(MADV_DONTNEED):
MemAvailable:   44411164 kB
AnonPages: 50140 kB
KReclaimable:1070640 kB
AnonHugePages: 10240 kB

Suggested-by: Vlastimil Babka 
Cc: Michal Hocko 
Cc: Kirill A. Shutemov 
Cc: Johannes Weiner 
Cc: David Rientjes 
Signed-off-by: Yang Shi 

Thanks, looks like it wasn't too difficult with the 2nd tail page use :)

...


--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -524,6 +524,7 @@ void prep_transhuge_page(struct page *page)
  
  	INIT_LIST_HEAD(page_deferred_list(page));

set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
+   page[2].nr_freeable = 0;
  }
  
  static unsigned long __thp_get_unmapped_area(struct file *filp, unsigned long len,

@@ -2766,6 +2767,10 @@ int split_huge_page_to_list(struct page *page, struct 
list_head *list)
if (!list_empty(page_deferred_list(head))) {
ds_queue->split_queue_len--;
list_del(page_deferred_list(head));
+   __mod_node_page_state(page_pgdat(page),
+   NR_KERNEL_MISC_RECLAIMABLE,
+   -head[2].nr_freeable);
+   head[2].nr_freeable = 0;
}
if (mapping)
__dec_node_page_state(page, NR_SHMEM_THPS);
@@ -2816,11 +2821,14 @@ void free_transhuge_page(struct page *page)
ds_queue->split_queue_len--;
list_del(page_deferred_list(page));
}
+   __mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE,
+ -page[2].nr_freeable);
+   page[2].nr_freeable = 0;

Wouldn't it be safer to fully tie the nr_freeable use to adding the page to the
deffered list? So here the code would be in the if (!list_empty()) { } part 
above.


IMHO, having it outside (!list_empty()) sounds safer actually since we'd 
better dec nr_freeable unconditionally in free path.





spin_unlock_irqrestore(_queue->split_queue_lock, flags);
free_compound_page(page);
  }
  
-void deferred_split_huge_page(struct page *page)

+void deferred_split_huge_page(struct page *page, unsigned int nr)
  {
struct deferred_split *ds_queue = get_deferred_split_queue(page);
  #ifdef CONFIG_MEMCG
@@ -2844,6 +2852,9 @@ void deferred_split_huge_page(struct page *page)
return;
  
  	

Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable

2019-08-22 Thread Kirill A. Shutemov
On Thu, Aug 22, 2019 at 02:56:56PM +0200, Vlastimil Babka wrote:
> >> @@ -2816,11 +2821,14 @@ void free_transhuge_page(struct page *page)
> >>ds_queue->split_queue_len--;
> >>list_del(page_deferred_list(page));
> >>}
> >> +  __mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE,
> >> +-page[2].nr_freeable);
> >> +  page[2].nr_freeable = 0;
> 
> Wouldn't it be safer to fully tie the nr_freeable use to adding the page to 
> the
> deffered list? So here the code would be in the if (!list_empty()) { } part 
> above.
> 
> >>spin_unlock_irqrestore(_queue->split_queue_lock, flags);
> >>free_compound_page(page);
> >>  }
> >>  
> >> -void deferred_split_huge_page(struct page *page)
> >> +void deferred_split_huge_page(struct page *page, unsigned int nr)
> >>  {
> >>struct deferred_split *ds_queue = get_deferred_split_queue(page);
> >>  #ifdef CONFIG_MEMCG
> >> @@ -2844,6 +2852,9 @@ void deferred_split_huge_page(struct page *page)
> >>return;
> >>  
> >>spin_lock_irqsave(_queue->split_queue_lock, flags);
> >> +  page[2].nr_freeable += nr;
> >> +  __mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE,
> >> +nr);
> 
> Same here, only do this when adding to the list, below? Or we might perhaps
> account base pages multiple times?

No, it cannot be under list_empty() check. Consider the case when a THP
got unmapped 4k a time. You need to put it on the list once, but account
it every time.

-- 
 Kirill A. Shutemov


Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable

2019-08-22 Thread Yang Shi




On 8/22/19 1:04 AM, Michal Hocko wrote:

On Thu 22-08-19 01:55:25, Yang Shi wrote:

Available memory is one of the most important metrics for memory
pressure.

I would disagree with this statement. It is a rough estimate that tells
how much memory you can allocate before going into a more expensive
reclaim (mostly swapping). Allocating that amount still might result in
direct reclaim induced stalls. I do realize that this is simple metric
that is attractive to use and works in many cases though.


OK, I would rephrase this a little it, say "useful metric".




Currently, the deferred split THPs are not accounted into
available memory, but they are reclaimable actually, like reclaimable
slabs.

And, they seems very common with the common workloads when THP is
enabled.  A simple run with MariaDB test of mmtest with THP enabled as
always shows it could generate over fifteen thousand deferred split THPs
(accumulated around 30G in one hour run, 75% of 40G memory for my VM).
It looks worth accounting in MemAvailable.

OK, this makes sense. But your above numbers are really worrying.
Accumulating such a large amount of pages that are likely not going to
be used is really bad. They are essentially blocking any higher order
allocations and also push the system towards more memory pressure.


That is accumulated number, during the running of the test, some of them 
were freed by shrinker already. IOW, it should not reach that much at 
any given time.




IIUC deferred splitting is mostly a workaround for nasty locking issues
during splitting, right? This is not really an optimization to cache
THPs for reuse or something like that. What is the reason this is not
done from a worker context? At least THPs which would be freed
completely sound like a good candidate for kworker tear down, no?


Yes, deferred split THP was introduced to avoid locking issues according 
to the document. Memcg awareness would help to trigger the shrinker more 
often.


I think it could be done in a worker context, but when to trigger to 
worker is a subtle problem.





Record the number of freeable normal pages of deferred split THPs into
the second tail page, and account it into KReclaimable.  Although THP
allocations are not exactly "kernel allocations", once they are unmapped,
they are in fact kernel-only.  KReclaimable has been accounted into
MemAvailable.

This sounds reasonable to me.
  

When the deferred split THPs get split due to memory pressure or freed,
just decrease by the recorded number.

With this change when running program which populates 1G address space
then madvise(MADV_DONTNEED) 511 pages for every THP, /proc/meminfo would
show the deferred split THPs are accounted properly.

Populated by before calling madvise(MADV_DONTNEED):
MemAvailable:   43531960 kB
AnonPages:   1096660 kB
KReclaimable:  26156 kB
AnonHugePages:   1056768 kB

After calling madvise(MADV_DONTNEED):
MemAvailable:   44411164 kB
AnonPages: 50140 kB
KReclaimable:1070640 kB
AnonHugePages: 10240 kB

Suggested-by: Vlastimil Babka 
Cc: Michal Hocko 
Cc: Kirill A. Shutemov 
Cc: Johannes Weiner 
Cc: David Rientjes 
Signed-off-by: Yang Shi 

Other than the above concern, which is little bit orthogonal, the patch
looks reasonable to me. I might be missing subtle THPisms so I am not
going to ack though.


---
  Documentation/filesystems/proc.txt |  4 ++--
  include/linux/huge_mm.h|  7 +--
  include/linux/mm_types.h   |  3 ++-
  mm/huge_memory.c   | 13 -
  mm/rmap.c  |  4 ++--
  5 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 99ca040..93fc183 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -968,8 +968,8 @@ ShmemHugePages: Memory used by shared memory (shmem) and 
tmpfs allocated
with huge pages
  ShmemPmdMapped: Shared memory mapped into userspace with huge pages
  KReclaimable: Kernel allocations that the kernel will attempt to reclaim
-  under memory pressure. Includes SReclaimable (below), and other
-  direct allocations with a shrinker.
+  under memory pressure. Includes SReclaimable (below), deferred
+  split THPs, and other direct allocations with a shrinker.
  Slab: in-kernel data structures cache
  SReclaimable: Part of Slab, that might be reclaimed, such as caches
SUnreclaim: Part of Slab, that cannot be reclaimed on memory pressure
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 61c9ffd..c194630 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -162,7 +162,7 @@ static inline int split_huge_page(struct page *page)
  {
return split_huge_page_to_list(page, NULL);
  }
-void deferred_split_huge_page(struct page *page);
+void deferred_split_huge_page(struct page *page, unsigned int nr);
  
  void 

Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable

2019-08-22 Thread Kirill A. Shutemov
On Thu, Aug 22, 2019 at 02:56:56PM +0200, Vlastimil Babka wrote:
> On 8/22/19 10:04 AM, Michal Hocko wrote:
> > On Thu 22-08-19 01:55:25, Yang Shi wrote:
> >> Available memory is one of the most important metrics for memory
> >> pressure.
> > 
> > I would disagree with this statement. It is a rough estimate that tells
> > how much memory you can allocate before going into a more expensive
> > reclaim (mostly swapping). Allocating that amount still might result in
> > direct reclaim induced stalls. I do realize that this is simple metric
> > that is attractive to use and works in many cases though.
> > 
> >> Currently, the deferred split THPs are not accounted into
> >> available memory, but they are reclaimable actually, like reclaimable
> >> slabs.
> >> 
> >> And, they seems very common with the common workloads when THP is
> >> enabled.  A simple run with MariaDB test of mmtest with THP enabled as
> >> always shows it could generate over fifteen thousand deferred split THPs
> >> (accumulated around 30G in one hour run, 75% of 40G memory for my VM).
> >> It looks worth accounting in MemAvailable.
> > 
> > OK, this makes sense. But your above numbers are really worrying.
> > Accumulating such a large amount of pages that are likely not going to
> > be used is really bad. They are essentially blocking any higher order
> > allocations and also push the system towards more memory pressure.
> > 
> > IIUC deferred splitting is mostly a workaround for nasty locking issues
> > during splitting, right? This is not really an optimization to cache
> > THPs for reuse or something like that. What is the reason this is not
> > done from a worker context? At least THPs which would be freed
> > completely sound like a good candidate for kworker tear down, no?
> 
> Agreed that it's a good question. For Kirill :) Maybe with kworker approach we
> also wouldn't need the cgroup awareness?

I don't remember a particular locking issue, but I cannot say there's
none :P

It's artifact from decoupling PMD split from compound page split: the same
page can be mapped multiple times with combination of PMDs and PTEs. Split
of one PMD doesn't need to trigger split of all PMDs and underlying
compound page.

Other consideration is the fact that page split can fail and we need to
have fallback for this case.

Also in most cases THP split would be just waste of time if we would do
them at the spot. If you don't have memory pressure it's better to wait
until process termination: less pages on LRU is still beneficial.

Main source of partly mapped THPs comes from exit path. When PMD mapping
of THP got split across multiple VMAs (for instance due to mprotect()),
in exit path we unmap PTEs belonging to one VMA just before unmapping the
rest of the page. It would be total waste of time to split the page in
this scenario.

The whole deferred split thing still looks as a reasonable compromise
to me.

We may have some kind of watermark and try to keep the number of deferred
split THP under it. But it comes with own set of problems: what if all
these pages are pinned for really long time and effectively not available
for split.

-- 
 Kirill A. Shutemov


Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable

2019-08-22 Thread Vlastimil Babka
On 8/22/19 10:04 AM, Michal Hocko wrote:
> On Thu 22-08-19 01:55:25, Yang Shi wrote:
>> Available memory is one of the most important metrics for memory
>> pressure.
> 
> I would disagree with this statement. It is a rough estimate that tells
> how much memory you can allocate before going into a more expensive
> reclaim (mostly swapping). Allocating that amount still might result in
> direct reclaim induced stalls. I do realize that this is simple metric
> that is attractive to use and works in many cases though.
> 
>> Currently, the deferred split THPs are not accounted into
>> available memory, but they are reclaimable actually, like reclaimable
>> slabs.
>> 
>> And, they seems very common with the common workloads when THP is
>> enabled.  A simple run with MariaDB test of mmtest with THP enabled as
>> always shows it could generate over fifteen thousand deferred split THPs
>> (accumulated around 30G in one hour run, 75% of 40G memory for my VM).
>> It looks worth accounting in MemAvailable.
> 
> OK, this makes sense. But your above numbers are really worrying.
> Accumulating such a large amount of pages that are likely not going to
> be used is really bad. They are essentially blocking any higher order
> allocations and also push the system towards more memory pressure.
> 
> IIUC deferred splitting is mostly a workaround for nasty locking issues
> during splitting, right? This is not really an optimization to cache
> THPs for reuse or something like that. What is the reason this is not
> done from a worker context? At least THPs which would be freed
> completely sound like a good candidate for kworker tear down, no?

Agreed that it's a good question. For Kirill :) Maybe with kworker approach we
also wouldn't need the cgroup awareness?

>> Record the number of freeable normal pages of deferred split THPs into
>> the second tail page, and account it into KReclaimable.  Although THP
>> allocations are not exactly "kernel allocations", once they are unmapped,
>> they are in fact kernel-only.  KReclaimable has been accounted into
>> MemAvailable.
> 
> This sounds reasonable to me.
>  
>> When the deferred split THPs get split due to memory pressure or freed,
>> just decrease by the recorded number.
>> 
>> With this change when running program which populates 1G address space
>> then madvise(MADV_DONTNEED) 511 pages for every THP, /proc/meminfo would
>> show the deferred split THPs are accounted properly.
>> 
>> Populated by before calling madvise(MADV_DONTNEED):
>> MemAvailable:   43531960 kB
>> AnonPages:   1096660 kB
>> KReclaimable:  26156 kB
>> AnonHugePages:   1056768 kB
>> 
>> After calling madvise(MADV_DONTNEED):
>> MemAvailable:   44411164 kB
>> AnonPages: 50140 kB
>> KReclaimable:1070640 kB
>> AnonHugePages: 10240 kB
>> 
>> Suggested-by: Vlastimil Babka 
>> Cc: Michal Hocko 
>> Cc: Kirill A. Shutemov 
>> Cc: Johannes Weiner 
>> Cc: David Rientjes 
>> Signed-off-by: Yang Shi 

Thanks, looks like it wasn't too difficult with the 2nd tail page use :)

...

>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -524,6 +524,7 @@ void prep_transhuge_page(struct page *page)
>>  
>>  INIT_LIST_HEAD(page_deferred_list(page));
>>  set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
>> +page[2].nr_freeable = 0;
>>  }
>>  
>>  static unsigned long __thp_get_unmapped_area(struct file *filp, unsigned 
>> long len,
>> @@ -2766,6 +2767,10 @@ int split_huge_page_to_list(struct page *page, struct 
>> list_head *list)
>>  if (!list_empty(page_deferred_list(head))) {
>>  ds_queue->split_queue_len--;
>>  list_del(page_deferred_list(head));
>> +__mod_node_page_state(page_pgdat(page),
>> +NR_KERNEL_MISC_RECLAIMABLE,
>> +-head[2].nr_freeable);
>> +head[2].nr_freeable = 0;
>>  }
>>  if (mapping)
>>  __dec_node_page_state(page, NR_SHMEM_THPS);
>> @@ -2816,11 +2821,14 @@ void free_transhuge_page(struct page *page)
>>  ds_queue->split_queue_len--;
>>  list_del(page_deferred_list(page));
>>  }
>> +__mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE,
>> +  -page[2].nr_freeable);
>> +page[2].nr_freeable = 0;

Wouldn't it be safer to fully tie the nr_freeable use to adding the page to the
deffered list? So here the code would be in the if (!list_empty()) { } part 
above.

>>  spin_unlock_irqrestore(_queue->split_queue_lock, flags);
>>  free_compound_page(page);
>>  }
>>  
>> -void deferred_split_huge_page(struct page *page)
>> +void deferred_split_huge_page(struct page *page, unsigned int nr)
>>  {
>>  struct deferred_split *ds_queue = get_deferred_split_queue(page);
>>  #ifdef CONFIG_MEMCG
>> @@ -2844,6 +2852,9 @@ void deferred_split_huge_page(struct page *page)
>>  return;
>>  
>>  

Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable

2019-08-22 Thread Michal Hocko
On Thu 22-08-19 01:55:25, Yang Shi wrote:
> Available memory is one of the most important metrics for memory
> pressure.

I would disagree with this statement. It is a rough estimate that tells
how much memory you can allocate before going into a more expensive
reclaim (mostly swapping). Allocating that amount still might result in
direct reclaim induced stalls. I do realize that this is simple metric
that is attractive to use and works in many cases though.

> Currently, the deferred split THPs are not accounted into
> available memory, but they are reclaimable actually, like reclaimable
> slabs.
> 
> And, they seems very common with the common workloads when THP is
> enabled.  A simple run with MariaDB test of mmtest with THP enabled as
> always shows it could generate over fifteen thousand deferred split THPs
> (accumulated around 30G in one hour run, 75% of 40G memory for my VM).
> It looks worth accounting in MemAvailable.

OK, this makes sense. But your above numbers are really worrying.
Accumulating such a large amount of pages that are likely not going to
be used is really bad. They are essentially blocking any higher order
allocations and also push the system towards more memory pressure.

IIUC deferred splitting is mostly a workaround for nasty locking issues
during splitting, right? This is not really an optimization to cache
THPs for reuse or something like that. What is the reason this is not
done from a worker context? At least THPs which would be freed
completely sound like a good candidate for kworker tear down, no?

> Record the number of freeable normal pages of deferred split THPs into
> the second tail page, and account it into KReclaimable.  Although THP
> allocations are not exactly "kernel allocations", once they are unmapped,
> they are in fact kernel-only.  KReclaimable has been accounted into
> MemAvailable.

This sounds reasonable to me.
 
> When the deferred split THPs get split due to memory pressure or freed,
> just decrease by the recorded number.
> 
> With this change when running program which populates 1G address space
> then madvise(MADV_DONTNEED) 511 pages for every THP, /proc/meminfo would
> show the deferred split THPs are accounted properly.
> 
> Populated by before calling madvise(MADV_DONTNEED):
> MemAvailable:   43531960 kB
> AnonPages:   1096660 kB
> KReclaimable:  26156 kB
> AnonHugePages:   1056768 kB
> 
> After calling madvise(MADV_DONTNEED):
> MemAvailable:   44411164 kB
> AnonPages: 50140 kB
> KReclaimable:1070640 kB
> AnonHugePages: 10240 kB
> 
> Suggested-by: Vlastimil Babka 
> Cc: Michal Hocko 
> Cc: Kirill A. Shutemov 
> Cc: Johannes Weiner 
> Cc: David Rientjes 
> Signed-off-by: Yang Shi 

Other than the above concern, which is little bit orthogonal, the patch
looks reasonable to me. I might be missing subtle THPisms so I am not
going to ack though.

> ---
>  Documentation/filesystems/proc.txt |  4 ++--
>  include/linux/huge_mm.h|  7 +--
>  include/linux/mm_types.h   |  3 ++-
>  mm/huge_memory.c   | 13 -
>  mm/rmap.c  |  4 ++--
>  5 files changed, 23 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.txt 
> b/Documentation/filesystems/proc.txt
> index 99ca040..93fc183 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -968,8 +968,8 @@ ShmemHugePages: Memory used by shared memory (shmem) and 
> tmpfs allocated
>with huge pages
>  ShmemPmdMapped: Shared memory mapped into userspace with huge pages
>  KReclaimable: Kernel allocations that the kernel will attempt to reclaim
> -  under memory pressure. Includes SReclaimable (below), and other
> -  direct allocations with a shrinker.
> +  under memory pressure. Includes SReclaimable (below), deferred
> +  split THPs, and other direct allocations with a shrinker.
>  Slab: in-kernel data structures cache
>  SReclaimable: Part of Slab, that might be reclaimed, such as caches
>SUnreclaim: Part of Slab, that cannot be reclaimed on memory pressure
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 61c9ffd..c194630 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -162,7 +162,7 @@ static inline int split_huge_page(struct page *page)
>  {
>   return split_huge_page_to_list(page, NULL);
>  }
> -void deferred_split_huge_page(struct page *page);
> +void deferred_split_huge_page(struct page *page, unsigned int nr);
>  
>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>   unsigned long address, bool freeze, struct page *page);
> @@ -324,7 +324,10 @@ static inline int split_huge_page(struct page *page)
>  {
>   return 0;
>  }
> -static inline void deferred_split_huge_page(struct page *page) {}
> +static inline void deferred_split_huge_page(struct page *page, unsigned int 
> nr)
> +{
> +}
> +
>