Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop

2019-08-06 Thread Michal Hocko
On Tue 06-08-19 10:19:49, Konstantin Khlebnikov wrote:
> On 8/6/19 10:07 AM, Michal Hocko wrote:
> > On Fri 02-08-19 13:44:38, Michal Hocko wrote:
> > [...]
> > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > index ba9138a4a1de..53a35c526e43 100644
> > > > > --- a/mm/memcontrol.c
> > > > > +++ b/mm/memcontrol.c
> > > > > @@ -2429,8 +2429,12 @@ static int try_charge(struct mem_cgroup 
> > > > > *memcg, gfp_t gfp_mask,
> > > > >   schedule_work(>high_work);
> > > > >   break;
> > > > >   }
> > > > > - current->memcg_nr_pages_over_high += batch;
> > > > > - set_notify_resume(current);
> > > > > + if (gfpflags_allow_blocking(gfp_mask)) {
> > > > > + reclaim_high(memcg, nr_pages, 
> > > > > GFP_KERNEL);
> > > 
> > > ups, this should be s@GFP_KERNEL@gfp_mask@
> > > 
> > > > > + } else {
> > > > > + current->memcg_nr_pages_over_high += 
> > > > > batch;
> > > > > + set_notify_resume(current);
> > > > > + }
> > > > >   break;
> > > > >   }
> > > > >   } while ((memcg = parent_mem_cgroup(memcg)));
> > > > > 
> > 
> > Should I send an official patch for this?
> > 
> 
> I prefer to keep it as is while we have no better solution.

Fine with me.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop

2019-08-06 Thread Konstantin Khlebnikov

On 8/6/19 10:07 AM, Michal Hocko wrote:

On Fri 02-08-19 13:44:38, Michal Hocko wrote:
[...]

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ba9138a4a1de..53a35c526e43 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2429,8 +2429,12 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t 
gfp_mask,
schedule_work(>high_work);
break;
}
-   current->memcg_nr_pages_over_high += batch;
-   set_notify_resume(current);
+   if (gfpflags_allow_blocking(gfp_mask)) {
+   reclaim_high(memcg, nr_pages, GFP_KERNEL);


ups, this should be s@GFP_KERNEL@gfp_mask@


+   } else {
+   current->memcg_nr_pages_over_high += batch;
+   set_notify_resume(current);
+   }
break;
}
} while ((memcg = parent_mem_cgroup(memcg)));



Should I send an official patch for this?



I prefer to keep it as is while we have no better solution.


Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop

2019-08-06 Thread Michal Hocko
On Fri 02-08-19 13:44:38, Michal Hocko wrote:
[...]
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index ba9138a4a1de..53a35c526e43 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -2429,8 +2429,12 @@ static int try_charge(struct mem_cgroup *memcg, 
> > > gfp_t gfp_mask,
> > >   schedule_work(>high_work);
> > >   break;
> > >   }
> > > - current->memcg_nr_pages_over_high += batch;
> > > - set_notify_resume(current);
> > > + if (gfpflags_allow_blocking(gfp_mask)) {
> > > + reclaim_high(memcg, nr_pages, GFP_KERNEL);
> 
> ups, this should be s@GFP_KERNEL@gfp_mask@
> 
> > > + } else {
> > > + current->memcg_nr_pages_over_high += batch;
> > > + set_notify_resume(current);
> > > + }
> > >   break;
> > >   }
> > >   } while ((memcg = parent_mem_cgroup(memcg)));
> > > 

Should I send an official patch for this?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop

2019-08-06 Thread Michal Hocko
On Mon 05-08-19 20:28:40, Yang Shi wrote:
> On Mon, Aug 5, 2019 at 7:32 AM Michal Hocko  wrote:
> >
> > On Fri 02-08-19 11:56:28, Yang Shi wrote:
> > > On Fri, Aug 2, 2019 at 2:35 AM Michal Hocko  wrote:
> > > >
> > > > On Thu 01-08-19 14:00:51, Yang Shi wrote:
> > > > > On Mon, Jul 29, 2019 at 11:48 AM Michal Hocko  
> > > > > wrote:
> > > > > >
> > > > > > On Mon 29-07-19 10:28:43, Yang Shi wrote:
> > > > > > [...]
> > > > > > > I don't worry too much about scale since the scale issue is not 
> > > > > > > unique
> > > > > > > to background reclaim, direct reclaim may run into the same 
> > > > > > > problem.
> > > > > >
> > > > > > Just to clarify. By scaling problem I mean 1:1 kswapd thread to 
> > > > > > memcg.
> > > > > > You can have thousands of memcgs and I do not think we really do 
> > > > > > want
> > > > > > to create one kswapd for each. Once we have a kswapd thread pool 
> > > > > > then we
> > > > > > get into a tricky land where a determinism/fairness would be non 
> > > > > > trivial
> > > > > > to achieve. Direct reclaim, on the other hand is bound by the 
> > > > > > workload
> > > > > > itself.
> > > > >
> > > > > Yes, I agree thread pool would introduce more latency than dedicated
> > > > > kswapd thread. But, it looks not that bad in our test. When memory
> > > > > allocation is fast, even though dedicated kswapd thread can't catch
> > > > > up. So, such background reclaim is best effort, not guaranteed.
> > > > >
> > > > > I don't quite get what you mean about fairness. Do you mean they may
> > > > > spend excessive cpu time then cause other processes starvation? I
> > > > > think this could be mitigated by properly organizing and setting
> > > > > groups. But, I agree this is tricky.
> > > >
> > > > No, I meant that the cost of reclaiming a unit of charges (e.g.
> > > > SWAP_CLUSTER_MAX) is not constant and depends on the state of the memory
> > > > on LRUs. Therefore any thread pool mechanism would lead to unfair
> > > > reclaim and non-deterministic behavior.
> > >
> > > Yes, the cost depends on the state of pages, but I still don't quite
> > > understand what does "unfair" refer to in this context. Do you mean
> > > some cgroups may reclaim much more than others?
> >
> > > Or the work may take too long so it can't not serve other cgroups in time?
> >
> > exactly.
> 
> Actually, I'm not very concerned by this. In our design each memcg has
> its dedicated work (memcg->wmark_work), so the reclaim work for
> different memcgs could be run in parallel since they are *different*
> work in fact although they run the same function. And, We could queue
> them to a dedicated unbound workqueue which may have maximum 512 or
> scale with nr cpus active works. Although the system may have
> thousands of online memcgs, I'm supposed it should be rare to have all
> of them trigger reclaim at the same time.

I do believe that it might work for your particular usecase but I do not
think this is robust enough for the upstream kernel, I am afraid.

As I've said I am open to discuss an opt-in per memcg pro-active reclaim
(a kernel thread that belongs to the memcg) but it has to be a dedicated
worker bound by all the cgroup resource restrictions.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop

2019-08-05 Thread Yang Shi
On Mon, Aug 5, 2019 at 7:32 AM Michal Hocko  wrote:
>
> On Fri 02-08-19 11:56:28, Yang Shi wrote:
> > On Fri, Aug 2, 2019 at 2:35 AM Michal Hocko  wrote:
> > >
> > > On Thu 01-08-19 14:00:51, Yang Shi wrote:
> > > > On Mon, Jul 29, 2019 at 11:48 AM Michal Hocko  wrote:
> > > > >
> > > > > On Mon 29-07-19 10:28:43, Yang Shi wrote:
> > > > > [...]
> > > > > > I don't worry too much about scale since the scale issue is not 
> > > > > > unique
> > > > > > to background reclaim, direct reclaim may run into the same problem.
> > > > >
> > > > > Just to clarify. By scaling problem I mean 1:1 kswapd thread to memcg.
> > > > > You can have thousands of memcgs and I do not think we really do want
> > > > > to create one kswapd for each. Once we have a kswapd thread pool then 
> > > > > we
> > > > > get into a tricky land where a determinism/fairness would be non 
> > > > > trivial
> > > > > to achieve. Direct reclaim, on the other hand is bound by the workload
> > > > > itself.
> > > >
> > > > Yes, I agree thread pool would introduce more latency than dedicated
> > > > kswapd thread. But, it looks not that bad in our test. When memory
> > > > allocation is fast, even though dedicated kswapd thread can't catch
> > > > up. So, such background reclaim is best effort, not guaranteed.
> > > >
> > > > I don't quite get what you mean about fairness. Do you mean they may
> > > > spend excessive cpu time then cause other processes starvation? I
> > > > think this could be mitigated by properly organizing and setting
> > > > groups. But, I agree this is tricky.
> > >
> > > No, I meant that the cost of reclaiming a unit of charges (e.g.
> > > SWAP_CLUSTER_MAX) is not constant and depends on the state of the memory
> > > on LRUs. Therefore any thread pool mechanism would lead to unfair
> > > reclaim and non-deterministic behavior.
> >
> > Yes, the cost depends on the state of pages, but I still don't quite
> > understand what does "unfair" refer to in this context. Do you mean
> > some cgroups may reclaim much more than others?
>
> > Or the work may take too long so it can't not serve other cgroups in time?
>
> exactly.

Actually, I'm not very concerned by this. In our design each memcg has
its dedicated work (memcg->wmark_work), so the reclaim work for
different memcgs could be run in parallel since they are *different*
work in fact although they run the same function. And, We could queue
them to a dedicated unbound workqueue which may have maximum 512 or
scale with nr cpus active works. Although the system may have
thousands of online memcgs, I'm supposed it should be rare to have all
of them trigger reclaim at the same time.

> --
> Michal Hocko
> SUSE Labs


Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop

2019-08-05 Thread Shakeel Butt
On Mon, Aug 5, 2019 at 7:32 AM Michal Hocko  wrote:
>
> On Fri 02-08-19 11:56:28, Yang Shi wrote:
> > On Fri, Aug 2, 2019 at 2:35 AM Michal Hocko  wrote:
> > >
> > > On Thu 01-08-19 14:00:51, Yang Shi wrote:
> > > > On Mon, Jul 29, 2019 at 11:48 AM Michal Hocko  wrote:
> > > > >
> > > > > On Mon 29-07-19 10:28:43, Yang Shi wrote:
> > > > > [...]
> > > > > > I don't worry too much about scale since the scale issue is not 
> > > > > > unique
> > > > > > to background reclaim, direct reclaim may run into the same problem.
> > > > >
> > > > > Just to clarify. By scaling problem I mean 1:1 kswapd thread to memcg.
> > > > > You can have thousands of memcgs and I do not think we really do want
> > > > > to create one kswapd for each. Once we have a kswapd thread pool then 
> > > > > we
> > > > > get into a tricky land where a determinism/fairness would be non 
> > > > > trivial
> > > > > to achieve. Direct reclaim, on the other hand is bound by the workload
> > > > > itself.
> > > >
> > > > Yes, I agree thread pool would introduce more latency than dedicated
> > > > kswapd thread. But, it looks not that bad in our test. When memory
> > > > allocation is fast, even though dedicated kswapd thread can't catch
> > > > up. So, such background reclaim is best effort, not guaranteed.
> > > >
> > > > I don't quite get what you mean about fairness. Do you mean they may
> > > > spend excessive cpu time then cause other processes starvation? I
> > > > think this could be mitigated by properly organizing and setting
> > > > groups. But, I agree this is tricky.
> > >
> > > No, I meant that the cost of reclaiming a unit of charges (e.g.
> > > SWAP_CLUSTER_MAX) is not constant and depends on the state of the memory
> > > on LRUs. Therefore any thread pool mechanism would lead to unfair
> > > reclaim and non-deterministic behavior.
> >
> > Yes, the cost depends on the state of pages, but I still don't quite
> > understand what does "unfair" refer to in this context. Do you mean
> > some cgroups may reclaim much more than others?
>
> > Or the work may take too long so it can't not serve other cgroups in time?
>
> exactly.
>

How about allowing the users to implement their own user space kswapd?
A memcg interface similar to MADV_PAGEOUT. Users can register for
MEMCG_HIGH notification (it needs some modification) and on receiving
the notification, the uswapd (User's kswapd) will trigger reclaim
through memory.pageout (or memory.try_to_free_pages). One can argue
why not just use MADV_PAGEOUT? In real workload, a job can be a
combination of different sub-jobs and most probably may not know the
importance of the memory layout of the tasks of the sub-jobs. So, a
memcg level interface makes more sense there.

Shakeel


Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop

2019-08-05 Thread Michal Hocko
On Fri 02-08-19 11:56:28, Yang Shi wrote:
> On Fri, Aug 2, 2019 at 2:35 AM Michal Hocko  wrote:
> >
> > On Thu 01-08-19 14:00:51, Yang Shi wrote:
> > > On Mon, Jul 29, 2019 at 11:48 AM Michal Hocko  wrote:
> > > >
> > > > On Mon 29-07-19 10:28:43, Yang Shi wrote:
> > > > [...]
> > > > > I don't worry too much about scale since the scale issue is not unique
> > > > > to background reclaim, direct reclaim may run into the same problem.
> > > >
> > > > Just to clarify. By scaling problem I mean 1:1 kswapd thread to memcg.
> > > > You can have thousands of memcgs and I do not think we really do want
> > > > to create one kswapd for each. Once we have a kswapd thread pool then we
> > > > get into a tricky land where a determinism/fairness would be non trivial
> > > > to achieve. Direct reclaim, on the other hand is bound by the workload
> > > > itself.
> > >
> > > Yes, I agree thread pool would introduce more latency than dedicated
> > > kswapd thread. But, it looks not that bad in our test. When memory
> > > allocation is fast, even though dedicated kswapd thread can't catch
> > > up. So, such background reclaim is best effort, not guaranteed.
> > >
> > > I don't quite get what you mean about fairness. Do you mean they may
> > > spend excessive cpu time then cause other processes starvation? I
> > > think this could be mitigated by properly organizing and setting
> > > groups. But, I agree this is tricky.
> >
> > No, I meant that the cost of reclaiming a unit of charges (e.g.
> > SWAP_CLUSTER_MAX) is not constant and depends on the state of the memory
> > on LRUs. Therefore any thread pool mechanism would lead to unfair
> > reclaim and non-deterministic behavior.
> 
> Yes, the cost depends on the state of pages, but I still don't quite
> understand what does "unfair" refer to in this context. Do you mean
> some cgroups may reclaim much more than others?

> Or the work may take too long so it can't not serve other cgroups in time?

exactly.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop

2019-08-02 Thread Yang Shi
On Fri, Aug 2, 2019 at 2:35 AM Michal Hocko  wrote:
>
> On Thu 01-08-19 14:00:51, Yang Shi wrote:
> > On Mon, Jul 29, 2019 at 11:48 AM Michal Hocko  wrote:
> > >
> > > On Mon 29-07-19 10:28:43, Yang Shi wrote:
> > > [...]
> > > > I don't worry too much about scale since the scale issue is not unique
> > > > to background reclaim, direct reclaim may run into the same problem.
> > >
> > > Just to clarify. By scaling problem I mean 1:1 kswapd thread to memcg.
> > > You can have thousands of memcgs and I do not think we really do want
> > > to create one kswapd for each. Once we have a kswapd thread pool then we
> > > get into a tricky land where a determinism/fairness would be non trivial
> > > to achieve. Direct reclaim, on the other hand is bound by the workload
> > > itself.
> >
> > Yes, I agree thread pool would introduce more latency than dedicated
> > kswapd thread. But, it looks not that bad in our test. When memory
> > allocation is fast, even though dedicated kswapd thread can't catch
> > up. So, such background reclaim is best effort, not guaranteed.
> >
> > I don't quite get what you mean about fairness. Do you mean they may
> > spend excessive cpu time then cause other processes starvation? I
> > think this could be mitigated by properly organizing and setting
> > groups. But, I agree this is tricky.
>
> No, I meant that the cost of reclaiming a unit of charges (e.g.
> SWAP_CLUSTER_MAX) is not constant and depends on the state of the memory
> on LRUs. Therefore any thread pool mechanism would lead to unfair
> reclaim and non-deterministic behavior.

Yes, the cost depends on the state of pages, but I still don't quite
understand what does "unfair" refer to in this context. Do you mean
some cgroups may reclaim much more than others? Or the work may take
too long so it can't not serve other cgroups in time?

>
> I can imagine a middle ground where the background reclaim would have to
> be an opt-in feature and a dedicated kernel thread would be assigned to
> the particular memcg (hierarchy).

Yes, it is opt-in by defining a proper "water mark". As long as "water
mark" is defined (0, 100), the "kswapd" work would be queued once the
usage is greater than "water mark", then it would exit once the usage
is under "water mark". If "water mark" is 0, it will never queue any
background reclaim work.

We did use dedicated kernel thread for each cgroup, but it turns out
it is also tricky and error prone to manage the kernel thread,
workqueue sounds much more simple and less error prone.

> --
> Michal Hocko
> SUSE Labs


Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop

2019-08-02 Thread Michal Hocko
On Fri 02-08-19 13:01:07, Konstantin Khlebnikov wrote:
> 
> 
> On 02.08.2019 12:40, Michal Hocko wrote:
> > On Mon 29-07-19 20:55:09, Michal Hocko wrote:
> > > On Mon 29-07-19 11:49:52, Johannes Weiner wrote:
> > > > On Sun, Jul 28, 2019 at 03:29:38PM +0300, Konstantin Khlebnikov wrote:
> > > > > --- a/mm/gup.c
> > > > > +++ b/mm/gup.c
> > > > > @@ -847,8 +847,11 @@ static long __get_user_pages(struct task_struct 
> > > > > *tsk, struct mm_struct *mm,
> > > > >   ret = -ERESTARTSYS;
> > > > >   goto out;
> > > > >   }
> > > > > - cond_resched();
> > > > > + /* Reclaim memory over high limit before stocking too 
> > > > > much */
> > > > > + mem_cgroup_handle_over_high(true);
> > > > 
> > > > I'd rather this remained part of the try_charge() call. The code
> > > > comment in try_charge says this:
> > > > 
> > > >  * We can perform reclaim here if __GFP_RECLAIM but let's
> > > >  * always punt for simplicity and so that GFP_KERNEL can
> > > >  * consistently be used during reclaim.
> > > > 
> > > > The simplicity argument doesn't hold true anymore once we have to add
> > > > manual calls into allocation sites. We should instead fix try_charge()
> > > > to do synchronous reclaim for __GFP_RECLAIM and only punt to userspace
> > > > return when actually needed.
> > > 
> > > Agreed. If we want to do direct reclaim on the high limit breach then it
> > > should go into try_charge same way we do hard limit reclaim there. I am
> > > not yet sure about how/whether to scale the excess. The only reason to
> > > move reclaim to return-to-userspace path was GFP_NOWAIT charges. As you
> > > say, maybe we should start by always performing the reclaim for
> > > sleepable contexts first and only defer for non-sleeping requests.
> > 
> > In other words. Something like patch below (completely untested). Could
> > you give it a try Konstantin?
> 
> This should work but also eliminate all benefits from deferred reclaim:
> bigger batching and running without of any locks.

Yes, but we already have to deal with for hard limit reclaim. Also I
would like to see any actual data to back any more complex solution.
We should definitely start simple.

> After that gap between high and max will work just as reserve for atomic 
> allocations.
> 
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index ba9138a4a1de..53a35c526e43 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2429,8 +2429,12 @@ static int try_charge(struct mem_cgroup *memcg, 
> > gfp_t gfp_mask,
> > schedule_work(>high_work);
> > break;
> > }
> > -   current->memcg_nr_pages_over_high += batch;
> > -   set_notify_resume(current);
> > +   if (gfpflags_allow_blocking(gfp_mask)) {
> > +   reclaim_high(memcg, nr_pages, GFP_KERNEL);

ups, this should be s@GFP_KERNEL@gfp_mask@

> > +   } else {
> > +   current->memcg_nr_pages_over_high += batch;
> > +   set_notify_resume(current);
> > +   }
> > break;
> > }
> > } while ((memcg = parent_mem_cgroup(memcg)));
> > 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop

2019-08-02 Thread Michal Hocko
On Mon 29-07-19 20:55:09, Michal Hocko wrote:
> On Mon 29-07-19 11:49:52, Johannes Weiner wrote:
> > On Sun, Jul 28, 2019 at 03:29:38PM +0300, Konstantin Khlebnikov wrote:
> > > --- a/mm/gup.c
> > > +++ b/mm/gup.c
> > > @@ -847,8 +847,11 @@ static long __get_user_pages(struct task_struct 
> > > *tsk, struct mm_struct *mm,
> > >   ret = -ERESTARTSYS;
> > >   goto out;
> > >   }
> > > - cond_resched();
> > >  
> > > + /* Reclaim memory over high limit before stocking too much */
> > > + mem_cgroup_handle_over_high(true);
> > 
> > I'd rather this remained part of the try_charge() call. The code
> > comment in try_charge says this:
> > 
> >  * We can perform reclaim here if __GFP_RECLAIM but let's
> >  * always punt for simplicity and so that GFP_KERNEL can
> >  * consistently be used during reclaim.
> > 
> > The simplicity argument doesn't hold true anymore once we have to add
> > manual calls into allocation sites. We should instead fix try_charge()
> > to do synchronous reclaim for __GFP_RECLAIM and only punt to userspace
> > return when actually needed.
> 
> Agreed. If we want to do direct reclaim on the high limit breach then it
> should go into try_charge same way we do hard limit reclaim there. I am
> not yet sure about how/whether to scale the excess. The only reason to
> move reclaim to return-to-userspace path was GFP_NOWAIT charges. As you
> say, maybe we should start by always performing the reclaim for
> sleepable contexts first and only defer for non-sleeping requests.

In other words. Something like patch below (completely untested). Could
you give it a try Konstantin?

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ba9138a4a1de..53a35c526e43 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2429,8 +2429,12 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t 
gfp_mask,
schedule_work(>high_work);
break;
}
-   current->memcg_nr_pages_over_high += batch;
-   set_notify_resume(current);
+   if (gfpflags_allow_blocking(gfp_mask)) {
+   reclaim_high(memcg, nr_pages, GFP_KERNEL);
+   } else {
+   current->memcg_nr_pages_over_high += batch;
+   set_notify_resume(current);
+   }
break;
}
} while ((memcg = parent_mem_cgroup(memcg)));
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop

2019-08-02 Thread Konstantin Khlebnikov




On 02.08.2019 12:40, Michal Hocko wrote:

On Mon 29-07-19 20:55:09, Michal Hocko wrote:

On Mon 29-07-19 11:49:52, Johannes Weiner wrote:

On Sun, Jul 28, 2019 at 03:29:38PM +0300, Konstantin Khlebnikov wrote:

--- a/mm/gup.c
+++ b/mm/gup.c
@@ -847,8 +847,11 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
ret = -ERESTARTSYS;
goto out;
}
-   cond_resched();
  
+		/* Reclaim memory over high limit before stocking too much */

+   mem_cgroup_handle_over_high(true);


I'd rather this remained part of the try_charge() call. The code
comment in try_charge says this:

 * We can perform reclaim here if __GFP_RECLAIM but let's
 * always punt for simplicity and so that GFP_KERNEL can
 * consistently be used during reclaim.

The simplicity argument doesn't hold true anymore once we have to add
manual calls into allocation sites. We should instead fix try_charge()
to do synchronous reclaim for __GFP_RECLAIM and only punt to userspace
return when actually needed.


Agreed. If we want to do direct reclaim on the high limit breach then it
should go into try_charge same way we do hard limit reclaim there. I am
not yet sure about how/whether to scale the excess. The only reason to
move reclaim to return-to-userspace path was GFP_NOWAIT charges. As you
say, maybe we should start by always performing the reclaim for
sleepable contexts first and only defer for non-sleeping requests.


In other words. Something like patch below (completely untested). Could
you give it a try Konstantin?


This should work but also eliminate all benefits from deferred reclaim:
bigger batching and running without of any locks.

After that gap between high and max will work just as reserve for atomic 
allocations.



diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ba9138a4a1de..53a35c526e43 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2429,8 +2429,12 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t 
gfp_mask,
schedule_work(>high_work);
break;
}
-   current->memcg_nr_pages_over_high += batch;
-   set_notify_resume(current);
+   if (gfpflags_allow_blocking(gfp_mask)) {
+   reclaim_high(memcg, nr_pages, GFP_KERNEL);
+   } else {
+   current->memcg_nr_pages_over_high += batch;
+   set_notify_resume(current);
+   }
break;
}
} while ((memcg = parent_mem_cgroup(memcg)));



Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop

2019-08-02 Thread Michal Hocko
On Thu 01-08-19 14:00:51, Yang Shi wrote:
> On Mon, Jul 29, 2019 at 11:48 AM Michal Hocko  wrote:
> >
> > On Mon 29-07-19 10:28:43, Yang Shi wrote:
> > [...]
> > > I don't worry too much about scale since the scale issue is not unique
> > > to background reclaim, direct reclaim may run into the same problem.
> >
> > Just to clarify. By scaling problem I mean 1:1 kswapd thread to memcg.
> > You can have thousands of memcgs and I do not think we really do want
> > to create one kswapd for each. Once we have a kswapd thread pool then we
> > get into a tricky land where a determinism/fairness would be non trivial
> > to achieve. Direct reclaim, on the other hand is bound by the workload
> > itself.
> 
> Yes, I agree thread pool would introduce more latency than dedicated
> kswapd thread. But, it looks not that bad in our test. When memory
> allocation is fast, even though dedicated kswapd thread can't catch
> up. So, such background reclaim is best effort, not guaranteed.
> 
> I don't quite get what you mean about fairness. Do you mean they may
> spend excessive cpu time then cause other processes starvation? I
> think this could be mitigated by properly organizing and setting
> groups. But, I agree this is tricky.

No, I meant that the cost of reclaiming a unit of charges (e.g.
SWAP_CLUSTER_MAX) is not constant and depends on the state of the memory
on LRUs. Therefore any thread pool mechanism would lead to unfair
reclaim and non-deterministic behavior.

I can imagine a middle ground where the background reclaim would have to
be an opt-in feature and a dedicated kernel thread would be assigned to
the particular memcg (hierarchy).
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop

2019-08-01 Thread Yang Shi
On Mon, Jul 29, 2019 at 11:48 AM Michal Hocko  wrote:
>
> On Mon 29-07-19 10:28:43, Yang Shi wrote:
> [...]
> > I don't worry too much about scale since the scale issue is not unique
> > to background reclaim, direct reclaim may run into the same problem.
>
> Just to clarify. By scaling problem I mean 1:1 kswapd thread to memcg.
> You can have thousands of memcgs and I do not think we really do want
> to create one kswapd for each. Once we have a kswapd thread pool then we
> get into a tricky land where a determinism/fairness would be non trivial
> to achieve. Direct reclaim, on the other hand is bound by the workload
> itself.

Yes, I agree thread pool would introduce more latency than dedicated
kswapd thread. But, it looks not that bad in our test. When memory
allocation is fast, even though dedicated kswapd thread can't catch
up. So, such background reclaim is best effort, not guaranteed.

I don't quite get what you mean about fairness. Do you mean they may
spend excessive cpu time then cause other processes starvation? I
think this could be mitigated by properly organizing and setting
groups. But, I agree this is tricky.

Typically, the processes are placed into different cgroups according
to their importance and priority. For example, in our cluster, system
processes would go to system group, then latency sensitive jobs and
batch jobs (they are usually second class citizens) go to different
groups. The memcg kswapd would be enabled for latency sensitive groups
only. The memcg kswapd threads would have the same priority with
global kswapd.

> --
> Michal Hocko
> SUSE Labs


Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop

2019-07-29 Thread Michal Hocko
On Mon 29-07-19 11:49:52, Johannes Weiner wrote:
> On Sun, Jul 28, 2019 at 03:29:38PM +0300, Konstantin Khlebnikov wrote:
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -847,8 +847,11 @@ static long __get_user_pages(struct task_struct *tsk, 
> > struct mm_struct *mm,
> > ret = -ERESTARTSYS;
> > goto out;
> > }
> > -   cond_resched();
> >  
> > +   /* Reclaim memory over high limit before stocking too much */
> > +   mem_cgroup_handle_over_high(true);
> 
> I'd rather this remained part of the try_charge() call. The code
> comment in try_charge says this:
> 
>* We can perform reclaim here if __GFP_RECLAIM but let's
>* always punt for simplicity and so that GFP_KERNEL can
>* consistently be used during reclaim.
> 
> The simplicity argument doesn't hold true anymore once we have to add
> manual calls into allocation sites. We should instead fix try_charge()
> to do synchronous reclaim for __GFP_RECLAIM and only punt to userspace
> return when actually needed.

Agreed. If we want to do direct reclaim on the high limit breach then it
should go into try_charge same way we do hard limit reclaim there. I am
not yet sure about how/whether to scale the excess. The only reason to
move reclaim to return-to-userspace path was GFP_NOWAIT charges. As you
say, maybe we should start by always performing the reclaim for
sleepable contexts first and only defer for non-sleeping requests.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop

2019-07-29 Thread Michal Hocko
On Mon 29-07-19 10:28:43, Yang Shi wrote:
[...]
> I don't worry too much about scale since the scale issue is not unique
> to background reclaim, direct reclaim may run into the same problem.

Just to clarify. By scaling problem I mean 1:1 kswapd thread to memcg.
You can have thousands of memcgs and I do not think we really do want
to create one kswapd for each. Once we have a kswapd thread pool then we
get into a tricky land where a determinism/fairness would be non trivial
to achieve. Direct reclaim, on the other hand is bound by the workload
itself.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop

2019-07-29 Thread Yang Shi
On Mon, Jul 29, 2019 at 3:33 AM Michal Hocko  wrote:
>
> On Mon 29-07-19 12:40:29, Konstantin Khlebnikov wrote:
> > On 29.07.2019 12:17, Michal Hocko wrote:
> > > On Sun 28-07-19 15:29:38, Konstantin Khlebnikov wrote:
> > > > High memory limit in memory cgroup allows to batch memory reclaiming and
> > > > defer it until returning into userland. This moves it out of any locks.
> > > >
> > > > Fixed gap between high and max limit works pretty well (we are using
> > > > 64 * NR_CPUS pages) except cases when one syscall allocates tons of
> > > > memory. This affects all other tasks in cgroup because they might hit
> > > > max memory limit in unhandy places and\or under hot locks.
> > > >
> > > > For example mmap with MAP_POPULATE or MAP_LOCKED might allocate a lot
> > > > of pages and push memory cgroup usage far ahead high memory limit.
> > > >
> > > > This patch uses halfway between high and max limits as threshold and
> > > > in this case starts memory reclaiming if mem_cgroup_handle_over_high()
> > > > called with argument only_severe = true, otherwise reclaim is deferred
> > > > till returning into userland. If high limits isn't set nothing changes.
> > > >
> > > > Now long running get_user_pages will periodically reclaim cgroup memory.
> > > > Other possible targets are generic file read/write iter loops.
> > >
> > > I do see how gup can lead to a large high limit excess, but could you be
> > > more specific why is that a problem? We should be reclaiming the similar
> > > number of pages cumulatively.
> > >
> >
> > Large gup might push usage close to limit and keep it here for a some time.
> > As a result concurrent allocations will enter direct reclaim right at
> > charging much more frequently.
>
> Yes, this is indeed prossible. On the other hand even the reclaim from
> the charge path doesn't really prevent from that happening because the
> context might get preempted or blocked on locks. So I guess we need a
> more detailed information of an actual world visible problem here.
>
> > Right now deferred recalaim after passing high limit works like distributed
> > memcg kswapd which reclaims memory in "background" and prevents completely
> > synchronous direct reclaim.
> >
> > Maybe somebody have any plans for real kswapd for memcg?
>
> I am not aware of that. The primary problem back then was that we simply
> cannot have a kernel thread per each memcg because that doesn't scale.
> Using kthreads and a dynamic pool of threads tends to be quite tricky -
> e.g. a proper accounting, scaling again.

We did discuss this topic in last year's LSF/MM, please see:
https://lwn.net/Articles/753162/. We (Alibaba) do have the memcg
kswapd thing work in our production environment for a while, and it
works well for our 11.11 shopping festival flood.

I did plan to post the patches to upstream, but I was distracted by
something else for a long time, now I already redesigned it and
already had some preliminary patches work, if you are interested in
this I would like post the patches soon to gather some comments early.

However, some of the issues mentioned by Michal does still exist, i.e.
accounting. I have not thought too much about accounting yet. I
recalled Johannes mentioned they were working on accounting kswapd
back then. But, I'm not sure if they are still working on that or not,
or he just meant some throttling solved by commit
2cf855837b89d92996cf264713f3bed2bf9b0b4f ("memcontrol: schedule
throttling if we are congested")? But, I recalled vaguely accounting
sounds not very critical.

I don't worry too much about scale since the scale issue is not unique
to background reclaim, direct reclaim may run into the same problem.
If you run into extreme memory pressure, a lot memcgs run into direct
relcaim and global reclaim is also running at the mean time, your
machine is definitely already not usable. And, our usecase is memcg
backgroup reclaim is mainly used by some latency sensitive memcgs
(running latency sensitive applications) to try to minimize direct
reclaim, for other memcgs they'd better be throttled by direct reclaim
if they consume too much memory.

Regards,
Yang

>
> > I've put mem_cgroup_handle_over_high in gup next to cond_resched() and
> > later that gave me idea that this is good place for running any
> > deferred works, like bottom half for tasks. Right now this happens
> > only at switching into userspace.
>
> I am not against pushing high memory reclaim into the charge path in
> principle. I just want to hear how big of a problem this really is in
> practice. If this is mostly a theoretical problem that might hit then I
> would rather stick with the existing code though.
>
> --
> Michal Hocko
> SUSE Labs
>


Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop

2019-07-29 Thread Johannes Weiner
On Sun, Jul 28, 2019 at 03:29:38PM +0300, Konstantin Khlebnikov wrote:
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -847,8 +847,11 @@ static long __get_user_pages(struct task_struct *tsk, 
> struct mm_struct *mm,
>   ret = -ERESTARTSYS;
>   goto out;
>   }
> - cond_resched();
>  
> + /* Reclaim memory over high limit before stocking too much */
> + mem_cgroup_handle_over_high(true);

I'd rather this remained part of the try_charge() call. The code
comment in try_charge says this:

 * We can perform reclaim here if __GFP_RECLAIM but let's
 * always punt for simplicity and so that GFP_KERNEL can
 * consistently be used during reclaim.

The simplicity argument doesn't hold true anymore once we have to add
manual calls into allocation sites. We should instead fix try_charge()
to do synchronous reclaim for __GFP_RECLAIM and only punt to userspace
return when actually needed.


Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop

2019-07-29 Thread Konstantin Khlebnikov

On 29.07.2019 13:33, Michal Hocko wrote:

On Mon 29-07-19 12:40:29, Konstantin Khlebnikov wrote:

On 29.07.2019 12:17, Michal Hocko wrote:

On Sun 28-07-19 15:29:38, Konstantin Khlebnikov wrote:

High memory limit in memory cgroup allows to batch memory reclaiming and
defer it until returning into userland. This moves it out of any locks.

Fixed gap between high and max limit works pretty well (we are using
64 * NR_CPUS pages) except cases when one syscall allocates tons of
memory. This affects all other tasks in cgroup because they might hit
max memory limit in unhandy places and\or under hot locks.

For example mmap with MAP_POPULATE or MAP_LOCKED might allocate a lot
of pages and push memory cgroup usage far ahead high memory limit.

This patch uses halfway between high and max limits as threshold and
in this case starts memory reclaiming if mem_cgroup_handle_over_high()
called with argument only_severe = true, otherwise reclaim is deferred
till returning into userland. If high limits isn't set nothing changes.

Now long running get_user_pages will periodically reclaim cgroup memory.
Other possible targets are generic file read/write iter loops.


I do see how gup can lead to a large high limit excess, but could you be
more specific why is that a problem? We should be reclaiming the similar
number of pages cumulatively.



Large gup might push usage close to limit and keep it here for a some time.
As a result concurrent allocations will enter direct reclaim right at
charging much more frequently.


Yes, this is indeed prossible. On the other hand even the reclaim from
the charge path doesn't really prevent from that happening because the
context might get preempted or blocked on locks. So I guess we need a
more detailed information of an actual world visible problem here.
  

Right now deferred recalaim after passing high limit works like distributed
memcg kswapd which reclaims memory in "background" and prevents completely
synchronous direct reclaim.

Maybe somebody have any plans for real kswapd for memcg?


I am not aware of that. The primary problem back then was that we simply
cannot have a kernel thread per each memcg because that doesn't scale.
Using kthreads and a dynamic pool of threads tends to be quite tricky -
e.g. a proper accounting, scaling again.


Yep, for containers proper accounting is important, especially cpu usage.

We're using manual kwapd-style reclaim in userspace by MADV_STOCKPILE
within container where memory allocation latency is critical.

This patch is about less extreme cases which would be nice to handle
automatically, without custom tuning.

  

I've put mem_cgroup_handle_over_high in gup next to cond_resched() and
later that gave me idea that this is good place for running any
deferred works, like bottom half for tasks. Right now this happens
only at switching into userspace.


I am not against pushing high memory reclaim into the charge path in
principle. I just want to hear how big of a problem this really is in
practice. If this is mostly a theoretical problem that might hit then I
would rather stick with the existing code though.



Besides latency which might be not so important for everybody I see these:

First problem is a fairness within cgroup - task that generates allocation
flow isn't throttled after passing high limits as documentation states.
It will feel memory pressure only after hitting max limit while other
tasks with smaller allocations will go into direct reclaim right away.

Second is an accumulating too much deferred reclaim - after large gup task
might call direct reclaim with target amount much larger than gap between
high and max limits, or even larger than max limit itself.


Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop

2019-07-29 Thread Michal Hocko
On Mon 29-07-19 12:40:29, Konstantin Khlebnikov wrote:
> On 29.07.2019 12:17, Michal Hocko wrote:
> > On Sun 28-07-19 15:29:38, Konstantin Khlebnikov wrote:
> > > High memory limit in memory cgroup allows to batch memory reclaiming and
> > > defer it until returning into userland. This moves it out of any locks.
> > > 
> > > Fixed gap between high and max limit works pretty well (we are using
> > > 64 * NR_CPUS pages) except cases when one syscall allocates tons of
> > > memory. This affects all other tasks in cgroup because they might hit
> > > max memory limit in unhandy places and\or under hot locks.
> > > 
> > > For example mmap with MAP_POPULATE or MAP_LOCKED might allocate a lot
> > > of pages and push memory cgroup usage far ahead high memory limit.
> > > 
> > > This patch uses halfway between high and max limits as threshold and
> > > in this case starts memory reclaiming if mem_cgroup_handle_over_high()
> > > called with argument only_severe = true, otherwise reclaim is deferred
> > > till returning into userland. If high limits isn't set nothing changes.
> > > 
> > > Now long running get_user_pages will periodically reclaim cgroup memory.
> > > Other possible targets are generic file read/write iter loops.
> > 
> > I do see how gup can lead to a large high limit excess, but could you be
> > more specific why is that a problem? We should be reclaiming the similar
> > number of pages cumulatively.
> > 
> 
> Large gup might push usage close to limit and keep it here for a some time.
> As a result concurrent allocations will enter direct reclaim right at
> charging much more frequently.

Yes, this is indeed prossible. On the other hand even the reclaim from
the charge path doesn't really prevent from that happening because the
context might get preempted or blocked on locks. So I guess we need a
more detailed information of an actual world visible problem here.
 
> Right now deferred recalaim after passing high limit works like distributed
> memcg kswapd which reclaims memory in "background" and prevents completely
> synchronous direct reclaim.
> 
> Maybe somebody have any plans for real kswapd for memcg?

I am not aware of that. The primary problem back then was that we simply
cannot have a kernel thread per each memcg because that doesn't scale.
Using kthreads and a dynamic pool of threads tends to be quite tricky -
e.g. a proper accounting, scaling again.
 
> I've put mem_cgroup_handle_over_high in gup next to cond_resched() and
> later that gave me idea that this is good place for running any
> deferred works, like bottom half for tasks. Right now this happens
> only at switching into userspace.

I am not against pushing high memory reclaim into the charge path in
principle. I just want to hear how big of a problem this really is in
practice. If this is mostly a theoretical problem that might hit then I
would rather stick with the existing code though.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop

2019-07-29 Thread Konstantin Khlebnikov

On 29.07.2019 12:17, Michal Hocko wrote:

On Sun 28-07-19 15:29:38, Konstantin Khlebnikov wrote:

High memory limit in memory cgroup allows to batch memory reclaiming and
defer it until returning into userland. This moves it out of any locks.

Fixed gap between high and max limit works pretty well (we are using
64 * NR_CPUS pages) except cases when one syscall allocates tons of
memory. This affects all other tasks in cgroup because they might hit
max memory limit in unhandy places and\or under hot locks.

For example mmap with MAP_POPULATE or MAP_LOCKED might allocate a lot
of pages and push memory cgroup usage far ahead high memory limit.

This patch uses halfway between high and max limits as threshold and
in this case starts memory reclaiming if mem_cgroup_handle_over_high()
called with argument only_severe = true, otherwise reclaim is deferred
till returning into userland. If high limits isn't set nothing changes.

Now long running get_user_pages will periodically reclaim cgroup memory.
Other possible targets are generic file read/write iter loops.


I do see how gup can lead to a large high limit excess, but could you be
more specific why is that a problem? We should be reclaiming the similar
number of pages cumulatively.



Large gup might push usage close to limit and keep it here for a some time.
As a result concurrent allocations will enter direct reclaim right at
charging much more frequently.


Right now deferred recalaim after passing high limit works like distributed
memcg kswapd which reclaims memory in "background" and prevents completely
synchronous direct reclaim.

Maybe somebody have any plans for real kswapd for memcg?


I've put mem_cgroup_handle_over_high in gup next to cond_resched() and
later that gave me idea that this is good place for running any
deferred works, like bottom half for tasks. Right now this happens
only at switching into userspace.


Re: [PATCH RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop

2019-07-29 Thread Michal Hocko
On Sun 28-07-19 15:29:38, Konstantin Khlebnikov wrote:
> High memory limit in memory cgroup allows to batch memory reclaiming and
> defer it until returning into userland. This moves it out of any locks.
> 
> Fixed gap between high and max limit works pretty well (we are using
> 64 * NR_CPUS pages) except cases when one syscall allocates tons of
> memory. This affects all other tasks in cgroup because they might hit
> max memory limit in unhandy places and\or under hot locks.
> 
> For example mmap with MAP_POPULATE or MAP_LOCKED might allocate a lot
> of pages and push memory cgroup usage far ahead high memory limit.
> 
> This patch uses halfway between high and max limits as threshold and
> in this case starts memory reclaiming if mem_cgroup_handle_over_high()
> called with argument only_severe = true, otherwise reclaim is deferred
> till returning into userland. If high limits isn't set nothing changes.
> 
> Now long running get_user_pages will periodically reclaim cgroup memory.
> Other possible targets are generic file read/write iter loops.

I do see how gup can lead to a large high limit excess, but could you be
more specific why is that a problem? We should be reclaiming the similar
number of pages cumulatively.
-- 
Michal Hocko
SUSE Labs