Re: [RFC 0/2] opportunistic memory reclaim of a killed process

2019-04-11 Thread Joel Fernandes
On Thu, Apr 11, 2019 at 10:11:51PM +0200, Michal Hocko wrote:
> On Thu 11-04-19 15:14:30, Joel Fernandes wrote:
> > On Thu, Apr 11, 2019 at 08:12:43PM +0200, Michal Hocko wrote:
> > > On Thu 11-04-19 12:18:33, Joel Fernandes wrote:
> > > > On Thu, Apr 11, 2019 at 6:51 AM Michal Hocko  wrote:
> > > > >
> > > > > On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote:
> > > > > [...]
> > > > > > Proposed solution uses existing oom-reaper thread to increase memory
> > > > > > reclaim rate of a killed process and to make this rate more 
> > > > > > deterministic.
> > > > > > By no means the proposed solution is considered the best and was 
> > > > > > chosen
> > > > > > because it was simple to implement and allowed for test data 
> > > > > > collection.
> > > > > > The downside of this solution is that it requires additional 
> > > > > > “expedite”
> > > > > > hint for something which has to be fast in all cases. Would be 
> > > > > > great to
> > > > > > find a way that does not require additional hints.
> > > > >
> > > > > I have to say I do not like this much. It is abusing an implementation
> > > > > detail of the OOM implementation and makes it an official API. Also
> > > > > there are some non trivial assumptions to be fullfilled to use the
> > > > > current oom_reaper. First of all all the process groups that share the
> > > > > address space have to be killed. How do you want to 
> > > > > guarantee/implement
> > > > > that with a simply kill to a thread/process group?
> > > > 
> > > > Will task_will_free_mem() not bail out in such cases because of
> > > > process_shares_mm() returning true?
> > > 
> > > I am not really sure I understand your question. task_will_free_mem is
> > > just a shortcut to not kill anything if the current process or a victim
> > > is already dying and likely to free memory without killing or spamming
> > > the log. My concern is that this patch allows to invoke the reaper
> > 
> > Got it.
> > 
> > > without guaranteeing the same. So it can only be an optimistic attempt
> > > and then I am wondering how reasonable of an interface this really is.
> > > Userspace send the signal and has no way to find out whether the async
> > > reaping has been scheduled or not.
> > 
> > Could you clarify more what you're asking to guarantee? I cannot picture it.
> > If you mean guaranteeing that "a task is dying anyway and will free its
> > memory on its own", we are calling task_will_free_mem() to check that before
> > invoking the oom reaper.
> 
> No, I am talking about the API aspect. Say you kall kill with the flag
> to make the async address space tear down. Now you cannot really
> guarantee that this is safe to do because the target task might
> clone(CLONE_VM) at any time. So this will be known only once the signal
> is sent, but the calling process has no way to find out. So the caller
> has no way to know what is the actual result of the requested operation.
> That is a poor API in my book.
> 
> > Could you clarify what is the draback if OOM reaper is invoked in parallel 
> > to
> > an exiting task which will free its memory soon? It looks like the OOM 
> > reaper
> > is taking all the locks necessary (mmap_sem) in particular and is unmapping
> > pages. It seemed to me to be safe, but I am missing what are the main draw
> > backs of this - other than the intereference with core dump. One could be
> > presumably scalability since the since OOM reaper could be bottlenecked by
> > freeing memory on behalf of potentially several dying tasks.
> 
> oom_reaper or any other kernel thread doing the same is a mere
> implementation detail I think. The oom killer doesn't really need the
> oom_reaper to act swiftly because it is there to act as a last resort if
> the oom victim cannot terminate on its own. If you want to offer an
> user space API then you can assume users will like to use it and expect
> a certain behavior but what that is? E.g. what if there are thousands of
> tasks killed this way? Do we care that some of them will not get the
> async treatment? If yes why do we need an API to control that at all?
> 
> Am I more clear now?

Yes, your concerns are more clear now. We will think more about this and your
other responses, thanks a lot.

 - Joel



Re: [RFC 0/2] opportunistic memory reclaim of a killed process

2019-04-11 Thread Michal Hocko
On Thu 11-04-19 12:56:32, Suren Baghdasaryan wrote:
> On Thu, Apr 11, 2019 at 11:19 AM Michal Hocko  wrote:
> >
> > On Thu 11-04-19 09:47:31, Suren Baghdasaryan wrote:
> > [...]
> > > > I would question whether we really need this at all? Relying on the exit
> > > > speed sounds like a fundamental design problem of anything that relies
> > > > on it.
> > >
> > > Relying on it is wrong, I agree. There are protections like allocation
> > > throttling that we can fall back to stop memory depletion. However
> > > having a way to free up resources that are not needed by a dying
> > > process quickly would help to avoid throttling which hurts user
> > > experience.
> >
> > I am not opposing speeding up the exit time in general. That is a good
> > thing. Especially for a very large processes (e.g. a DB). But I do not
> > really think we want to expose an API to control this specific aspect.
> 
> Great! Thanks for confirming that the intent is not worthless.
> There were a number of ideas floating both internally and in the 2/2
> of this patchset. I would like to get some input on which
> implementation would be preferable. From your answer sounds like you
> think it should be a generic feature, should not require any new APIs
> or hints from the userspace and should be conducted for all kills
> unconditionally (irrespective of memory pressure, who is waiting for
> victim's death, etc.). Do I understand correctly that this would be
> the preferred solution?

Yes, I think the general tear down solution is much more preferable than
a questionable API. How that solution should look like is an open
question. I am not sure myself to be honest.
-- 
Michal Hocko
SUSE Labs


Re: [RFC 0/2] opportunistic memory reclaim of a killed process

2019-04-11 Thread Michal Hocko
On Thu 11-04-19 15:14:30, Joel Fernandes wrote:
> On Thu, Apr 11, 2019 at 08:12:43PM +0200, Michal Hocko wrote:
> > On Thu 11-04-19 12:18:33, Joel Fernandes wrote:
> > > On Thu, Apr 11, 2019 at 6:51 AM Michal Hocko  wrote:
> > > >
> > > > On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote:
> > > > [...]
> > > > > Proposed solution uses existing oom-reaper thread to increase memory
> > > > > reclaim rate of a killed process and to make this rate more 
> > > > > deterministic.
> > > > > By no means the proposed solution is considered the best and was 
> > > > > chosen
> > > > > because it was simple to implement and allowed for test data 
> > > > > collection.
> > > > > The downside of this solution is that it requires additional 
> > > > > “expedite”
> > > > > hint for something which has to be fast in all cases. Would be great 
> > > > > to
> > > > > find a way that does not require additional hints.
> > > >
> > > > I have to say I do not like this much. It is abusing an implementation
> > > > detail of the OOM implementation and makes it an official API. Also
> > > > there are some non trivial assumptions to be fullfilled to use the
> > > > current oom_reaper. First of all all the process groups that share the
> > > > address space have to be killed. How do you want to guarantee/implement
> > > > that with a simply kill to a thread/process group?
> > > 
> > > Will task_will_free_mem() not bail out in such cases because of
> > > process_shares_mm() returning true?
> > 
> > I am not really sure I understand your question. task_will_free_mem is
> > just a shortcut to not kill anything if the current process or a victim
> > is already dying and likely to free memory without killing or spamming
> > the log. My concern is that this patch allows to invoke the reaper
> 
> Got it.
> 
> > without guaranteeing the same. So it can only be an optimistic attempt
> > and then I am wondering how reasonable of an interface this really is.
> > Userspace send the signal and has no way to find out whether the async
> > reaping has been scheduled or not.
> 
> Could you clarify more what you're asking to guarantee? I cannot picture it.
> If you mean guaranteeing that "a task is dying anyway and will free its
> memory on its own", we are calling task_will_free_mem() to check that before
> invoking the oom reaper.

No, I am talking about the API aspect. Say you kall kill with the flag
to make the async address space tear down. Now you cannot really
guarantee that this is safe to do because the target task might
clone(CLONE_VM) at any time. So this will be known only once the signal
is sent, but the calling process has no way to find out. So the caller
has no way to know what is the actual result of the requested operation.
That is a poor API in my book.

> Could you clarify what is the draback if OOM reaper is invoked in parallel to
> an exiting task which will free its memory soon? It looks like the OOM reaper
> is taking all the locks necessary (mmap_sem) in particular and is unmapping
> pages. It seemed to me to be safe, but I am missing what are the main draw
> backs of this - other than the intereference with core dump. One could be
> presumably scalability since the since OOM reaper could be bottlenecked by
> freeing memory on behalf of potentially several dying tasks.

oom_reaper or any other kernel thread doing the same is a mere
implementation detail I think. The oom killer doesn't really need the
oom_reaper to act swiftly because it is there to act as a last resort if
the oom victim cannot terminate on its own. If you want to offer an
user space API then you can assume users will like to use it and expect
a certain behavior but what that is? E.g. what if there are thousands of
tasks killed this way? Do we care that some of them will not get the
async treatment? If yes why do we need an API to control that at all?

Am I more clear now?

-- 
Michal Hocko
SUSE Labs


Re: [RFC 0/2] opportunistic memory reclaim of a killed process

2019-04-11 Thread Suren Baghdasaryan
On Thu, Apr 11, 2019 at 11:19 AM Michal Hocko  wrote:
>
> On Thu 11-04-19 09:47:31, Suren Baghdasaryan wrote:
> [...]
> > > I would question whether we really need this at all? Relying on the exit
> > > speed sounds like a fundamental design problem of anything that relies
> > > on it.
> >
> > Relying on it is wrong, I agree. There are protections like allocation
> > throttling that we can fall back to stop memory depletion. However
> > having a way to free up resources that are not needed by a dying
> > process quickly would help to avoid throttling which hurts user
> > experience.
>
> I am not opposing speeding up the exit time in general. That is a good
> thing. Especially for a very large processes (e.g. a DB). But I do not
> really think we want to expose an API to control this specific aspect.

Great! Thanks for confirming that the intent is not worthless.
There were a number of ideas floating both internally and in the 2/2
of this patchset. I would like to get some input on which
implementation would be preferable. From your answer sounds like you
think it should be a generic feature, should not require any new APIs
or hints from the userspace and should be conducted for all kills
unconditionally (irrespective of memory pressure, who is waiting for
victim's death, etc.). Do I understand correctly that this would be
the preferred solution?

> --
> Michal Hocko
> SUSE Labs
>
> --
> You received this message because you are subscribed to the Google Groups 
> "kernel-team" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>


Re: [RFC 0/2] opportunistic memory reclaim of a killed process

2019-04-11 Thread Joel Fernandes
On Thu, Apr 11, 2019 at 08:12:43PM +0200, Michal Hocko wrote:
> On Thu 11-04-19 12:18:33, Joel Fernandes wrote:
> > On Thu, Apr 11, 2019 at 6:51 AM Michal Hocko  wrote:
> > >
> > > On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote:
> > > [...]
> > > > Proposed solution uses existing oom-reaper thread to increase memory
> > > > reclaim rate of a killed process and to make this rate more 
> > > > deterministic.
> > > > By no means the proposed solution is considered the best and was chosen
> > > > because it was simple to implement and allowed for test data collection.
> > > > The downside of this solution is that it requires additional “expedite”
> > > > hint for something which has to be fast in all cases. Would be great to
> > > > find a way that does not require additional hints.
> > >
> > > I have to say I do not like this much. It is abusing an implementation
> > > detail of the OOM implementation and makes it an official API. Also
> > > there are some non trivial assumptions to be fullfilled to use the
> > > current oom_reaper. First of all all the process groups that share the
> > > address space have to be killed. How do you want to guarantee/implement
> > > that with a simply kill to a thread/process group?
> > 
> > Will task_will_free_mem() not bail out in such cases because of
> > process_shares_mm() returning true?
> 
> I am not really sure I understand your question. task_will_free_mem is
> just a shortcut to not kill anything if the current process or a victim
> is already dying and likely to free memory without killing or spamming
> the log. My concern is that this patch allows to invoke the reaper

Got it.

> without guaranteeing the same. So it can only be an optimistic attempt
> and then I am wondering how reasonable of an interface this really is.
> Userspace send the signal and has no way to find out whether the async
> reaping has been scheduled or not.

Could you clarify more what you're asking to guarantee? I cannot picture it.
If you mean guaranteeing that "a task is dying anyway and will free its
memory on its own", we are calling task_will_free_mem() to check that before
invoking the oom reaper.

Could you clarify what is the draback if OOM reaper is invoked in parallel to
an exiting task which will free its memory soon? It looks like the OOM reaper
is taking all the locks necessary (mmap_sem) in particular and is unmapping
pages. It seemed to me to be safe, but I am missing what are the main draw
backs of this - other than the intereference with core dump. One could be
presumably scalability since the since OOM reaper could be bottlenecked by
freeing memory on behalf of potentially several dying tasks.

IIRC this patch is just Ok with being opportunistic and it need not be hidden
behind an API necessarily or need any guarantees. It is just providing a hint
that the OOM reaper could be woken up to expedite things. If a task is going
to be taking a long time to be scheduled and free its memory, the oom reaper
gives a headstart.  Many of the times, background tasks can be killed but
they may not have necessarily sufficient scheduler priority / cpuset (being
in the background) and may be holding onto a lot of memory that needs to be
reclaimed.

I am not saying this the right way to do it, but I also wanted us to
understand the drawbacks so that we can go back to the drawing board and come
up with something better.

Thanks!

 - Joel






Re: [RFC 0/2] opportunistic memory reclaim of a killed process

2019-04-11 Thread Michal Hocko
On Thu 11-04-19 09:47:31, Suren Baghdasaryan wrote:
[...]
> > I would question whether we really need this at all? Relying on the exit
> > speed sounds like a fundamental design problem of anything that relies
> > on it.
> 
> Relying on it is wrong, I agree. There are protections like allocation
> throttling that we can fall back to stop memory depletion. However
> having a way to free up resources that are not needed by a dying
> process quickly would help to avoid throttling which hurts user
> experience.

I am not opposing speeding up the exit time in general. That is a good
thing. Especially for a very large processes (e.g. a DB). But I do not
really think we want to expose an API to control this specific aspect.
-- 
Michal Hocko
SUSE Labs


Re: [RFC 0/2] opportunistic memory reclaim of a killed process

2019-04-11 Thread Michal Hocko
On Thu 11-04-19 12:18:33, Joel Fernandes wrote:
> On Thu, Apr 11, 2019 at 6:51 AM Michal Hocko  wrote:
> >
> > On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote:
> > [...]
> > > Proposed solution uses existing oom-reaper thread to increase memory
> > > reclaim rate of a killed process and to make this rate more deterministic.
> > > By no means the proposed solution is considered the best and was chosen
> > > because it was simple to implement and allowed for test data collection.
> > > The downside of this solution is that it requires additional “expedite”
> > > hint for something which has to be fast in all cases. Would be great to
> > > find a way that does not require additional hints.
> >
> > I have to say I do not like this much. It is abusing an implementation
> > detail of the OOM implementation and makes it an official API. Also
> > there are some non trivial assumptions to be fullfilled to use the
> > current oom_reaper. First of all all the process groups that share the
> > address space have to be killed. How do you want to guarantee/implement
> > that with a simply kill to a thread/process group?
> 
> Will task_will_free_mem() not bail out in such cases because of
> process_shares_mm() returning true?

I am not really sure I understand your question. task_will_free_mem is
just a shortcut to not kill anything if the current process or a victim
is already dying and likely to free memory without killing or spamming
the log. My concern is that this patch allows to invoke the reaper
without guaranteeing the same. So it can only be an optimistic attempt
and then I am wondering how reasonable of an interface this really is.
Userspace send the signal and has no way to find out whether the async
reaping has been scheduled or not.
-- 
Michal Hocko
SUSE Labs


Re: [RFC 0/2] opportunistic memory reclaim of a killed process

2019-04-11 Thread Johannes Weiner
On Thu, Apr 11, 2019 at 12:51:11PM +0200, Michal Hocko wrote:
> I would question whether we really need this at all? Relying on the exit
> speed sounds like a fundamental design problem of anything that relies
> on it. Sure task exit might be slow, but async mm tear down is just a
> mere optimization this is not guaranteed to really help in speading
> things up. OOM killer uses it as a guarantee for a forward progress in a
> finite time rather than as soon as possible.

I don't think it's flawed, it's just optimizing the user experience as
best as it can. You don't want to kill things prematurely, but once
there is pressure you want to rectify it quickly. That's valid.

We have a tool that does this, side effect or not, so I think it's
fair to try to make use of it when oom killing from userspace (which
we explictily support with oom_control in cgroup1 and memory.high in
cgroup2, and it's not just an Android thing).

The question is how explicit a contract we want to make with
userspace, and I would much prefer to not overpromise on a best-effort
thing like this, or even making the oom reaper ABI.

If unconditionally reaping killed tasks is too expensive, I'd much
prefer a simple kill hint over an explicit task reclaim interface.


Re: [RFC 0/2] opportunistic memory reclaim of a killed process

2019-04-11 Thread Suren Baghdasaryan
Thanks for the feedback!

On Thu, Apr 11, 2019 at 3:51 AM Michal Hocko  wrote:
>
> On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote:
> [...]
> > Proposed solution uses existing oom-reaper thread to increase memory
> > reclaim rate of a killed process and to make this rate more deterministic.
> > By no means the proposed solution is considered the best and was chosen
> > because it was simple to implement and allowed for test data collection.
> > The downside of this solution is that it requires additional “expedite”
> > hint for something which has to be fast in all cases. Would be great to
> > find a way that does not require additional hints.
>
> I have to say I do not like this much. It is abusing an implementation
> detail of the OOM implementation and makes it an official API.

I agree with you that this particular implementation is abusing oom
internal machinery and I don't think it is acceptable as is (hence
this is sent as an RFC). I would like to discuss the viability of the
idea of reaping kill victim's mm asynchronously. If we agree that this
is worth our time, only then I would love to get into more details on
how to implement it. The implementation in this RFC is a convenient
way to illustrate the idea and to collect test data.

> Also
> there are some non trivial assumptions to be fullfilled to use the
> current oom_reaper. First of all all the process groups that share the
> address space have to be killed. How do you want to guarantee/implement
> that with a simply kill to a thread/process group?
>

I'm not sure I understood this correctly but if you are asking how do
we know that the mm we are reaping is not shared with processes that
are not being killed then I think your task_will_free_mem() checks for
that. Or have I misunderstood your question?

> > Other possible approaches include:
> > - Implementing a dedicated syscall to perform opportunistic reclaim in the
> > context of the process waiting for the victim’s death. A natural boost
> > bonus occurs if the waiting process has high or RT priority and is not
> > limited by cpuset cgroup in its CPU choices.
> > - Implement a mechanism that would perform opportunistic reclaim if it’s
> > possible unconditionally (similar to checks in task_will_free_mem()).
> > - Implement opportunistic reclaim that uses shrinker interface, PSI or
> > other memory pressure indications as a hint to engage.
>
> I would question whether we really need this at all? Relying on the exit
> speed sounds like a fundamental design problem of anything that relies
> on it.

Relying on it is wrong, I agree. There are protections like allocation
throttling that we can fall back to stop memory depletion. However
having a way to free up resources that are not needed by a dying
process quickly would help to avoid throttling which hurts user
experience.
I agree that this is an optimization which is beneficial in a specific
case - when we kill to free up resources. However this is an important
optimization for systems with low memory resources like embedded
systems, phones, etc. The only way to prevent being cornered into
throttling is to increase the free memory margin that system needs to
maintain (I describe this in my cover letter). And with limited
overall memory resources memory space is at a premium, so we try to
decrease that margin.
I think the other and arguably even more important issue than the
speed of memory reclaim is that this speed depends on what the victim
is doing at the time of a kill. This introduces non-determinism in how
fast we can free up resource and at this point we don't even know how
much safety margin we need.

> Sure task exit might be slow, but async mm tear down is just a
> mere optimization this is not guaranteed to really help in speading
> things up. OOM killer uses it as a guarantee for a forward progress in a
> finite time rather than as soon as possible.
>
> --
> Michal Hocko
> SUSE Labs
>
> --
> You received this message because you are subscribed to the Google Groups 
> "kernel-team" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>


Re: [RFC 0/2] opportunistic memory reclaim of a killed process

2019-04-11 Thread Sandeep Patil
On Thu, Apr 11, 2019 at 12:51:11PM +0200, Michal Hocko wrote:
> On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote:
> [...]
> > Proposed solution uses existing oom-reaper thread to increase memory
> > reclaim rate of a killed process and to make this rate more deterministic.
> > By no means the proposed solution is considered the best and was chosen
> > because it was simple to implement and allowed for test data collection.
> > The downside of this solution is that it requires additional “expedite”
> > hint for something which has to be fast in all cases. Would be great to
> > find a way that does not require additional hints.
> 
> I have to say I do not like this much. It is abusing an implementation
> detail of the OOM implementation and makes it an official API. Also
> there are some non trivial assumptions to be fullfilled to use the
> current oom_reaper. First of all all the process groups that share the
> address space have to be killed. How do you want to guarantee/implement
> that with a simply kill to a thread/process group?
> 
> > Other possible approaches include:
> > - Implementing a dedicated syscall to perform opportunistic reclaim in the
> > context of the process waiting for the victim’s death. A natural boost
> > bonus occurs if the waiting process has high or RT priority and is not
> > limited by cpuset cgroup in its CPU choices.
> > - Implement a mechanism that would perform opportunistic reclaim if it’s
> > possible unconditionally (similar to checks in task_will_free_mem()).
> > - Implement opportunistic reclaim that uses shrinker interface, PSI or
> > other memory pressure indications as a hint to engage.
> 
> I would question whether we really need this at all? Relying on the exit
> speed sounds like a fundamental design problem of anything that relies
> on it.

OTOH, we want to keep as many processes around as possible for recency. In which
case, the exit path (particularly the memory reclaim) becomes critical to
maintain interactivity for phones.

Android keeps processes around because cold starting applications is much
slower than simply bringing them up from background. This obviously presents
the problem when a background application _is_ killed, it is almost always to
address sudden spike in memory needs by something else much more important
and user visible. e.g. a foreground application or critical system process.

> Sure task exit might be slow, but async mm tear down is just a
> mere optimization this is not guaranteed to really help in speading
> things up. OOM killer uses it as a guarantee for a forward progress in a
> finite time rather than as soon as possible.

With OOM killer, things are already really bad. When lmkd[1] kills processes,
it is doing so to serve the immediate needs of the system while trying to
avoid the OOM killer.


- ssp

1] 
https://android.googlesource.com/platform/system/core/+/refs/heads/master/lmkd/


Re: [RFC 0/2] opportunistic memory reclaim of a killed process

2019-04-11 Thread Joel Fernandes
On Thu, Apr 11, 2019 at 6:51 AM Michal Hocko  wrote:
>
> On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote:
> [...]
> > Proposed solution uses existing oom-reaper thread to increase memory
> > reclaim rate of a killed process and to make this rate more deterministic.
> > By no means the proposed solution is considered the best and was chosen
> > because it was simple to implement and allowed for test data collection.
> > The downside of this solution is that it requires additional “expedite”
> > hint for something which has to be fast in all cases. Would be great to
> > find a way that does not require additional hints.
>
> I have to say I do not like this much. It is abusing an implementation
> detail of the OOM implementation and makes it an official API. Also
> there are some non trivial assumptions to be fullfilled to use the
> current oom_reaper. First of all all the process groups that share the
> address space have to be killed. How do you want to guarantee/implement
> that with a simply kill to a thread/process group?

Will task_will_free_mem() not bail out in such cases because of
process_shares_mm() returning true? AFAIU, Suren's patch calls that.
Also, if I understand correctly, this patch is opportunistic and knows
what it may not be possible to reap in advance this way in all cases.
/*
 * Make sure that all tasks which share the mm with the given tasks
 * are dying as well to make sure that a) nobody pins its mm and
 * b) the task is also reapable by the oom reaper.
 */
rcu_read_lock();
for_each_process(p) {
if (!process_shares_mm(p, mm))

> > Other possible approaches include:
> > - Implementing a dedicated syscall to perform opportunistic reclaim in the
> > context of the process waiting for the victim’s death. A natural boost
> > bonus occurs if the waiting process has high or RT priority and is not
> > limited by cpuset cgroup in its CPU choices.
> > - Implement a mechanism that would perform opportunistic reclaim if it’s
> > possible unconditionally (similar to checks in task_will_free_mem()).
> > - Implement opportunistic reclaim that uses shrinker interface, PSI or
> > other memory pressure indications as a hint to engage.
>
> I would question whether we really need this at all? Relying on the exit
> speed sounds like a fundamental design problem of anything that relies
> on it. Sure task exit might be slow, but async mm tear down is just a
> mere optimization this is not guaranteed to really help in speading
> things up. OOM killer uses it as a guarantee for a forward progress in a
> finite time rather than as soon as possible.

Per the data collected by Suren, it does speed things up. It would be
nice if we can reuse this mechanism, or come up with a similar
mechanism.

thanks,

 - Joel


Re: [RFC 0/2] opportunistic memory reclaim of a killed process

2019-04-11 Thread Michal Hocko
On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote:
[...]
> Proposed solution uses existing oom-reaper thread to increase memory
> reclaim rate of a killed process and to make this rate more deterministic.
> By no means the proposed solution is considered the best and was chosen
> because it was simple to implement and allowed for test data collection.
> The downside of this solution is that it requires additional “expedite”
> hint for something which has to be fast in all cases. Would be great to
> find a way that does not require additional hints.

I have to say I do not like this much. It is abusing an implementation
detail of the OOM implementation and makes it an official API. Also
there are some non trivial assumptions to be fullfilled to use the
current oom_reaper. First of all all the process groups that share the
address space have to be killed. How do you want to guarantee/implement
that with a simply kill to a thread/process group?

> Other possible approaches include:
> - Implementing a dedicated syscall to perform opportunistic reclaim in the
> context of the process waiting for the victim’s death. A natural boost
> bonus occurs if the waiting process has high or RT priority and is not
> limited by cpuset cgroup in its CPU choices.
> - Implement a mechanism that would perform opportunistic reclaim if it’s
> possible unconditionally (similar to checks in task_will_free_mem()).
> - Implement opportunistic reclaim that uses shrinker interface, PSI or
> other memory pressure indications as a hint to engage.

I would question whether we really need this at all? Relying on the exit
speed sounds like a fundamental design problem of anything that relies
on it. Sure task exit might be slow, but async mm tear down is just a
mere optimization this is not guaranteed to really help in speading
things up. OOM killer uses it as a guarantee for a forward progress in a
finite time rather than as soon as possible.

-- 
Michal Hocko
SUSE Labs