Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty

2019-01-04 Thread Yang Shi




On 1/4/19 2:57 PM, Yang Shi wrote:



On 1/4/19 12:03 PM, Greg Thelen wrote:

Yang Shi  wrote:


On 1/3/19 11:23 AM, Michal Hocko wrote:

On Thu 03-01-19 11:10:00, Yang Shi wrote:

On 1/3/19 10:53 AM, Michal Hocko wrote:

On Thu 03-01-19 10:40:54, Yang Shi wrote:

On 1/3/19 10:13 AM, Michal Hocko wrote:

[...]
Is there any reason for your scripts to be strictly sequential 
here? In

other words why cannot you offload those expensive operations to a
detached context in _userspace_?
I would say it has not to be strictly sequential. The above 
script is just
an example to illustrate the pattern. But, sometimes it may hit 
such pattern
due to the complicated cluster scheduling and container 
scheduling in the
production environment, for example the creation process might 
be scheduled
to the same CPU which is doing force_empty. I have to say I 
don't know too

much about the internals of the container scheduling.

In that case I do not see a strong reason to implement the offloding
into the kernel. It is an additional code and semantic to maintain.
Yes, it does introduce some additional code and semantic, but 
IMHO, it is
quite simple and very straight forward, isn't it? Just utilize the 
existing
css offline worker. And, that a couple of lines of code do improve 
some

throughput issues for some real usecases.

I do not really care it is few LOC. It is more important that it is
conflating force_empty into offlining logic. There was a good 
reason to

remove reparenting/emptying the memcg during the offline. Considering
that you can offload force_empty from userspace trivially then I do 
not

see any reason to implement it in the kernel.

Er, I may not articulate in the earlier email, force_empty can not be
offloaded from userspace *trivially*. IOWs the container scheduler may
unexpectedly overcommit something due to the stall of synchronous force
empty, which can't be figured out by userspace before it actually
happens. The scheduler doesn't know how long force_empty would take. If
the force_empty could be offloaded by kernel, it would make scheduler's
life much easier. This is not something userspace could do.

If kernel workqueues are doing more work (i.e. force_empty processing),
then it seem like the time to offline could grow.  I'm not sure if
that's important.


One thing I can think of is this may slow down the recycling of memcg 
id. This may cause memcg id exhausted for some extreme workload. But, 
I don't see this as a problem in our workload.


Actually, sync force_empty should have the same side effect.

Yang



Thanks,
Yang



I assume that if we make force_empty an async side effect of rmdir then
user space scheduler would not be unable to immediately assume the
rmdir'd container memory is available without subjecting a new container
to direct reclaim.  So it seems like user space would use a mechanism to
wait for reclaim: either the existing sync force_empty or polling
meminfo/etc waiting for free memory to appear.


I think it is more important to discuss whether we want to introduce
force_empty in cgroup v2.

We would prefer have it in v2 as well.

Then bring this up in a separate email thread please.

Sure. Will prepare the patches later.

Thanks,
Yang






Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty

2019-01-04 Thread Yang Shi




On 1/4/19 12:03 PM, Greg Thelen wrote:

Yang Shi  wrote:


On 1/3/19 11:23 AM, Michal Hocko wrote:

On Thu 03-01-19 11:10:00, Yang Shi wrote:

On 1/3/19 10:53 AM, Michal Hocko wrote:

On Thu 03-01-19 10:40:54, Yang Shi wrote:

On 1/3/19 10:13 AM, Michal Hocko wrote:

[...]

Is there any reason for your scripts to be strictly sequential here? In
other words why cannot you offload those expensive operations to a
detached context in _userspace_?

I would say it has not to be strictly sequential. The above script is just
an example to illustrate the pattern. But, sometimes it may hit such pattern
due to the complicated cluster scheduling and container scheduling in the
production environment, for example the creation process might be scheduled
to the same CPU which is doing force_empty. I have to say I don't know too
much about the internals of the container scheduling.

In that case I do not see a strong reason to implement the offloding
into the kernel. It is an additional code and semantic to maintain.

Yes, it does introduce some additional code and semantic, but IMHO, it is
quite simple and very straight forward, isn't it? Just utilize the existing
css offline worker. And, that a couple of lines of code do improve some
throughput issues for some real usecases.

I do not really care it is few LOC. It is more important that it is
conflating force_empty into offlining logic. There was a good reason to
remove reparenting/emptying the memcg during the offline. Considering
that you can offload force_empty from userspace trivially then I do not
see any reason to implement it in the kernel.

Er, I may not articulate in the earlier email, force_empty can not be
offloaded from userspace *trivially*. IOWs the container scheduler may
unexpectedly overcommit something due to the stall of synchronous force
empty, which can't be figured out by userspace before it actually
happens. The scheduler doesn't know how long force_empty would take. If
the force_empty could be offloaded by kernel, it would make scheduler's
life much easier. This is not something userspace could do.

If kernel workqueues are doing more work (i.e. force_empty processing),
then it seem like the time to offline could grow.  I'm not sure if
that's important.


One thing I can think of is this may slow down the recycling of memcg 
id. This may cause memcg id exhausted for some extreme workload. But, I 
don't see this as a problem in our workload.


Thanks,
Yang



I assume that if we make force_empty an async side effect of rmdir then
user space scheduler would not be unable to immediately assume the
rmdir'd container memory is available without subjecting a new container
to direct reclaim.  So it seems like user space would use a mechanism to
wait for reclaim: either the existing sync force_empty or polling
meminfo/etc waiting for free memory to appear.


I think it is more important to discuss whether we want to introduce
force_empty in cgroup v2.

We would prefer have it in v2 as well.

Then bring this up in a separate email thread please.

Sure. Will prepare the patches later.

Thanks,
Yang




Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty

2019-01-04 Thread Yang Shi




On 1/4/19 12:03 PM, Greg Thelen wrote:

Yang Shi  wrote:


On 1/3/19 11:23 AM, Michal Hocko wrote:

On Thu 03-01-19 11:10:00, Yang Shi wrote:

On 1/3/19 10:53 AM, Michal Hocko wrote:

On Thu 03-01-19 10:40:54, Yang Shi wrote:

On 1/3/19 10:13 AM, Michal Hocko wrote:

[...]

Is there any reason for your scripts to be strictly sequential here? In
other words why cannot you offload those expensive operations to a
detached context in _userspace_?

I would say it has not to be strictly sequential. The above script is just
an example to illustrate the pattern. But, sometimes it may hit such pattern
due to the complicated cluster scheduling and container scheduling in the
production environment, for example the creation process might be scheduled
to the same CPU which is doing force_empty. I have to say I don't know too
much about the internals of the container scheduling.

In that case I do not see a strong reason to implement the offloding
into the kernel. It is an additional code and semantic to maintain.

Yes, it does introduce some additional code and semantic, but IMHO, it is
quite simple and very straight forward, isn't it? Just utilize the existing
css offline worker. And, that a couple of lines of code do improve some
throughput issues for some real usecases.

I do not really care it is few LOC. It is more important that it is
conflating force_empty into offlining logic. There was a good reason to
remove reparenting/emptying the memcg during the offline. Considering
that you can offload force_empty from userspace trivially then I do not
see any reason to implement it in the kernel.

Er, I may not articulate in the earlier email, force_empty can not be
offloaded from userspace *trivially*. IOWs the container scheduler may
unexpectedly overcommit something due to the stall of synchronous force
empty, which can't be figured out by userspace before it actually
happens. The scheduler doesn't know how long force_empty would take. If
the force_empty could be offloaded by kernel, it would make scheduler's
life much easier. This is not something userspace could do.

If kernel workqueues are doing more work (i.e. force_empty processing),
then it seem like the time to offline could grow.  I'm not sure if
that's important.


Yes, it would grow. I'm not sure, but it seems fine with our workloads.

The reclaim can be placed at the last step of offline, and it can be 
interrupted by some signals, i.e. fatal signal in current code.




I assume that if we make force_empty an async side effect of rmdir then
user space scheduler would not be unable to immediately assume the
rmdir'd container memory is available without subjecting a new container
to direct reclaim.  So it seems like user space would use a mechanism to
wait for reclaim: either the existing sync force_empty or polling
meminfo/etc waiting for free memory to appear.


Yes, it is expected side effect, the memory reclaim would happen in a 
short while. In this series I keep sync reclaim behavior of force_empty 
by checking the written value. Michal suggested a new knob do the 
offline reclaim, and keep force_empty intact.


I think using which knob is user's discretion.

Thanks,
Yang




I think it is more important to discuss whether we want to introduce
force_empty in cgroup v2.

We would prefer have it in v2 as well.

Then bring this up in a separate email thread please.

Sure. Will prepare the patches later.

Thanks,
Yang




Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty

2019-01-04 Thread Greg Thelen
Yang Shi  wrote:

> On 1/3/19 11:23 AM, Michal Hocko wrote:
>> On Thu 03-01-19 11:10:00, Yang Shi wrote:
>>>
>>> On 1/3/19 10:53 AM, Michal Hocko wrote:
 On Thu 03-01-19 10:40:54, Yang Shi wrote:
> On 1/3/19 10:13 AM, Michal Hocko wrote:
>> [...]
>> Is there any reason for your scripts to be strictly sequential here? In
>> other words why cannot you offload those expensive operations to a
>> detached context in _userspace_?
> I would say it has not to be strictly sequential. The above script is just
> an example to illustrate the pattern. But, sometimes it may hit such 
> pattern
> due to the complicated cluster scheduling and container scheduling in the
> production environment, for example the creation process might be 
> scheduled
> to the same CPU which is doing force_empty. I have to say I don't know too
> much about the internals of the container scheduling.
 In that case I do not see a strong reason to implement the offloding
 into the kernel. It is an additional code and semantic to maintain.
>>> Yes, it does introduce some additional code and semantic, but IMHO, it is
>>> quite simple and very straight forward, isn't it? Just utilize the existing
>>> css offline worker. And, that a couple of lines of code do improve some
>>> throughput issues for some real usecases.
>> I do not really care it is few LOC. It is more important that it is
>> conflating force_empty into offlining logic. There was a good reason to
>> remove reparenting/emptying the memcg during the offline. Considering
>> that you can offload force_empty from userspace trivially then I do not
>> see any reason to implement it in the kernel.
>
> Er, I may not articulate in the earlier email, force_empty can not be 
> offloaded from userspace *trivially*. IOWs the container scheduler may 
> unexpectedly overcommit something due to the stall of synchronous force 
> empty, which can't be figured out by userspace before it actually 
> happens. The scheduler doesn't know how long force_empty would take. If 
> the force_empty could be offloaded by kernel, it would make scheduler's 
> life much easier. This is not something userspace could do.

If kernel workqueues are doing more work (i.e. force_empty processing),
then it seem like the time to offline could grow.  I'm not sure if
that's important.

I assume that if we make force_empty an async side effect of rmdir then
user space scheduler would not be unable to immediately assume the
rmdir'd container memory is available without subjecting a new container
to direct reclaim.  So it seems like user space would use a mechanism to
wait for reclaim: either the existing sync force_empty or polling
meminfo/etc waiting for free memory to appear.

 I think it is more important to discuss whether we want to introduce
 force_empty in cgroup v2.
>>> We would prefer have it in v2 as well.
>> Then bring this up in a separate email thread please.
>
> Sure. Will prepare the patches later.
>
> Thanks,
> Yang


Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty

2019-01-04 Thread Yang Shi




On 1/4/19 12:55 AM, Michal Hocko wrote:

On Thu 03-01-19 20:15:30, Yang Shi wrote:


On 1/3/19 12:01 PM, Michal Hocko wrote:

On Thu 03-01-19 11:49:32, Yang Shi wrote:

On 1/3/19 11:23 AM, Michal Hocko wrote:

On Thu 03-01-19 11:10:00, Yang Shi wrote:

On 1/3/19 10:53 AM, Michal Hocko wrote:

On Thu 03-01-19 10:40:54, Yang Shi wrote:

On 1/3/19 10:13 AM, Michal Hocko wrote:

[...]

Is there any reason for your scripts to be strictly sequential here? In
other words why cannot you offload those expensive operations to a
detached context in _userspace_?

I would say it has not to be strictly sequential. The above script is just
an example to illustrate the pattern. But, sometimes it may hit such pattern
due to the complicated cluster scheduling and container scheduling in the
production environment, for example the creation process might be scheduled
to the same CPU which is doing force_empty. I have to say I don't know too
much about the internals of the container scheduling.

In that case I do not see a strong reason to implement the offloding
into the kernel. It is an additional code and semantic to maintain.

Yes, it does introduce some additional code and semantic, but IMHO, it is
quite simple and very straight forward, isn't it? Just utilize the existing
css offline worker. And, that a couple of lines of code do improve some
throughput issues for some real usecases.

I do not really care it is few LOC. It is more important that it is
conflating force_empty into offlining logic. There was a good reason to
remove reparenting/emptying the memcg during the offline. Considering
that you can offload force_empty from userspace trivially then I do not
see any reason to implement it in the kernel.

Er, I may not articulate in the earlier email, force_empty can not be
offloaded from userspace *trivially*. IOWs the container scheduler may
unexpectedly overcommit something due to the stall of synchronous force
empty, which can't be figured out by userspace before it actually happens.
The scheduler doesn't know how long force_empty would take. If the
force_empty could be offloaded by kernel, it would make scheduler's life
much easier. This is not something userspace could do.

What exactly prevents
(
echo 1 > $memecg/force_empty
rmdir $memcg
) &

so that this sequence doesn't really block anything?

We have "restarting the same name job" logic in our usecase (I'm not quite
sure why they do so). Basically, it means to create memcg with the exact
same name right after the old one is deleted, but may have different limit
or other settings. The creation has to wait for rmdir is done. Even though
rmdir is done in background like the above, the stall still exists since
rmdir simply is waiting for force_empty.

OK, I see. This is an important detail you didn't mention previously (or
at least I didn't understand it). One thing is still not clear to me.


Sorry, I should articulated at the first place.


"Restarting the same job" sounds as if the memcg itself could be
recycled as well. You are saying that the setting might change but if
that is about limits then we should handle that just fine. Or what other
kind of setting changes that wouldn't work properly?


We did try resize limit, but it may also incur costly direct reclaim to 
block something. Other than this we also want to reset all the 
counters/stats to get clearer and cleaner resource isolation since the 
container may run different jobs although they use the same name.




If the recycling is not possible then I would suggest to not reuse
force_empty interface but add wipe_on_destruction or similar new knob
which would enforce reclaim on offlining. It seems we have several
people asking for something like that already.


We did have a new knob in our in-house implementation, it just did 
force_empty on offlining.


So, you mean to have a new knob to just do force empty offlining, and 
keep force_empty's behavior, right?


Thanks,
Yang




Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty

2019-01-04 Thread Michal Hocko
On Thu 03-01-19 20:15:30, Yang Shi wrote:
> 
> 
> On 1/3/19 12:01 PM, Michal Hocko wrote:
> > On Thu 03-01-19 11:49:32, Yang Shi wrote:
> > > 
> > > On 1/3/19 11:23 AM, Michal Hocko wrote:
> > > > On Thu 03-01-19 11:10:00, Yang Shi wrote:
> > > > > On 1/3/19 10:53 AM, Michal Hocko wrote:
> > > > > > On Thu 03-01-19 10:40:54, Yang Shi wrote:
> > > > > > > On 1/3/19 10:13 AM, Michal Hocko wrote:
> > > > [...]
> > > > > > > > Is there any reason for your scripts to be strictly sequential 
> > > > > > > > here? In
> > > > > > > > other words why cannot you offload those expensive operations 
> > > > > > > > to a
> > > > > > > > detached context in _userspace_?
> > > > > > > I would say it has not to be strictly sequential. The above 
> > > > > > > script is just
> > > > > > > an example to illustrate the pattern. But, sometimes it may hit 
> > > > > > > such pattern
> > > > > > > due to the complicated cluster scheduling and container 
> > > > > > > scheduling in the
> > > > > > > production environment, for example the creation process might be 
> > > > > > > scheduled
> > > > > > > to the same CPU which is doing force_empty. I have to say I don't 
> > > > > > > know too
> > > > > > > much about the internals of the container scheduling.
> > > > > > In that case I do not see a strong reason to implement the offloding
> > > > > > into the kernel. It is an additional code and semantic to maintain.
> > > > > Yes, it does introduce some additional code and semantic, but IMHO, 
> > > > > it is
> > > > > quite simple and very straight forward, isn't it? Just utilize the 
> > > > > existing
> > > > > css offline worker. And, that a couple of lines of code do improve 
> > > > > some
> > > > > throughput issues for some real usecases.
> > > > I do not really care it is few LOC. It is more important that it is
> > > > conflating force_empty into offlining logic. There was a good reason to
> > > > remove reparenting/emptying the memcg during the offline. Considering
> > > > that you can offload force_empty from userspace trivially then I do not
> > > > see any reason to implement it in the kernel.
> > > Er, I may not articulate in the earlier email, force_empty can not be
> > > offloaded from userspace *trivially*. IOWs the container scheduler may
> > > unexpectedly overcommit something due to the stall of synchronous force
> > > empty, which can't be figured out by userspace before it actually happens.
> > > The scheduler doesn't know how long force_empty would take. If the
> > > force_empty could be offloaded by kernel, it would make scheduler's life
> > > much easier. This is not something userspace could do.
> > What exactly prevents
> > (
> > echo 1 > $memecg/force_empty
> > rmdir $memcg
> > ) &
> > 
> > so that this sequence doesn't really block anything?
> 
> We have "restarting the same name job" logic in our usecase (I'm not quite
> sure why they do so). Basically, it means to create memcg with the exact
> same name right after the old one is deleted, but may have different limit
> or other settings. The creation has to wait for rmdir is done. Even though
> rmdir is done in background like the above, the stall still exists since
> rmdir simply is waiting for force_empty.

OK, I see. This is an important detail you didn't mention previously (or
at least I didn't understand it). One thing is still not clear to me.
"Restarting the same job" sounds as if the memcg itself could be
recycled as well. You are saying that the setting might change but if
that is about limits then we should handle that just fine. Or what other
kind of setting changes that wouldn't work properly?

If the recycling is not possible then I would suggest to not reuse
force_empty interface but add wipe_on_destruction or similar new knob
which would enforce reclaim on offlining. It seems we have several
people asking for something like that already.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty

2019-01-03 Thread Yang Shi




On 1/3/19 12:01 PM, Michal Hocko wrote:

On Thu 03-01-19 11:49:32, Yang Shi wrote:


On 1/3/19 11:23 AM, Michal Hocko wrote:

On Thu 03-01-19 11:10:00, Yang Shi wrote:

On 1/3/19 10:53 AM, Michal Hocko wrote:

On Thu 03-01-19 10:40:54, Yang Shi wrote:

On 1/3/19 10:13 AM, Michal Hocko wrote:

[...]

Is there any reason for your scripts to be strictly sequential here? In
other words why cannot you offload those expensive operations to a
detached context in _userspace_?

I would say it has not to be strictly sequential. The above script is just
an example to illustrate the pattern. But, sometimes it may hit such pattern
due to the complicated cluster scheduling and container scheduling in the
production environment, for example the creation process might be scheduled
to the same CPU which is doing force_empty. I have to say I don't know too
much about the internals of the container scheduling.

In that case I do not see a strong reason to implement the offloding
into the kernel. It is an additional code and semantic to maintain.

Yes, it does introduce some additional code and semantic, but IMHO, it is
quite simple and very straight forward, isn't it? Just utilize the existing
css offline worker. And, that a couple of lines of code do improve some
throughput issues for some real usecases.

I do not really care it is few LOC. It is more important that it is
conflating force_empty into offlining logic. There was a good reason to
remove reparenting/emptying the memcg during the offline. Considering
that you can offload force_empty from userspace trivially then I do not
see any reason to implement it in the kernel.

Er, I may not articulate in the earlier email, force_empty can not be
offloaded from userspace *trivially*. IOWs the container scheduler may
unexpectedly overcommit something due to the stall of synchronous force
empty, which can't be figured out by userspace before it actually happens.
The scheduler doesn't know how long force_empty would take. If the
force_empty could be offloaded by kernel, it would make scheduler's life
much easier. This is not something userspace could do.

What exactly prevents
(
echo 1 > $memecg/force_empty
rmdir $memcg
) &

so that this sequence doesn't really block anything?


We have "restarting the same name job" logic in our usecase (I'm not 
quite sure why they do so). Basically, it means to create memcg with the 
exact same name right after the old one is deleted, but may have 
different limit or other settings. The creation has to wait for rmdir is 
done. Even though rmdir is done in background like the above, the stall 
still exists since rmdir simply is waiting for force_empty.


Thanks,
Yang




Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty

2019-01-03 Thread Michal Hocko
On Thu 03-01-19 11:49:32, Yang Shi wrote:
> 
> 
> On 1/3/19 11:23 AM, Michal Hocko wrote:
> > On Thu 03-01-19 11:10:00, Yang Shi wrote:
> > > 
> > > On 1/3/19 10:53 AM, Michal Hocko wrote:
> > > > On Thu 03-01-19 10:40:54, Yang Shi wrote:
> > > > > On 1/3/19 10:13 AM, Michal Hocko wrote:
> > [...]
> > > > > > Is there any reason for your scripts to be strictly sequential 
> > > > > > here? In
> > > > > > other words why cannot you offload those expensive operations to a
> > > > > > detached context in _userspace_?
> > > > > I would say it has not to be strictly sequential. The above script is 
> > > > > just
> > > > > an example to illustrate the pattern. But, sometimes it may hit such 
> > > > > pattern
> > > > > due to the complicated cluster scheduling and container scheduling in 
> > > > > the
> > > > > production environment, for example the creation process might be 
> > > > > scheduled
> > > > > to the same CPU which is doing force_empty. I have to say I don't 
> > > > > know too
> > > > > much about the internals of the container scheduling.
> > > > In that case I do not see a strong reason to implement the offloding
> > > > into the kernel. It is an additional code and semantic to maintain.
> > > Yes, it does introduce some additional code and semantic, but IMHO, it is
> > > quite simple and very straight forward, isn't it? Just utilize the 
> > > existing
> > > css offline worker. And, that a couple of lines of code do improve some
> > > throughput issues for some real usecases.
> > I do not really care it is few LOC. It is more important that it is
> > conflating force_empty into offlining logic. There was a good reason to
> > remove reparenting/emptying the memcg during the offline. Considering
> > that you can offload force_empty from userspace trivially then I do not
> > see any reason to implement it in the kernel.
> 
> Er, I may not articulate in the earlier email, force_empty can not be
> offloaded from userspace *trivially*. IOWs the container scheduler may
> unexpectedly overcommit something due to the stall of synchronous force
> empty, which can't be figured out by userspace before it actually happens.
> The scheduler doesn't know how long force_empty would take. If the
> force_empty could be offloaded by kernel, it would make scheduler's life
> much easier. This is not something userspace could do.

What exactly prevents
(
echo 1 > $memecg/force_empty
rmdir $memcg
) &

so that this sequence doesn't really block anything?
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty

2019-01-03 Thread Yang Shi




On 1/3/19 11:23 AM, Michal Hocko wrote:

On Thu 03-01-19 11:10:00, Yang Shi wrote:


On 1/3/19 10:53 AM, Michal Hocko wrote:

On Thu 03-01-19 10:40:54, Yang Shi wrote:

On 1/3/19 10:13 AM, Michal Hocko wrote:

[...]

Is there any reason for your scripts to be strictly sequential here? In
other words why cannot you offload those expensive operations to a
detached context in _userspace_?

I would say it has not to be strictly sequential. The above script is just
an example to illustrate the pattern. But, sometimes it may hit such pattern
due to the complicated cluster scheduling and container scheduling in the
production environment, for example the creation process might be scheduled
to the same CPU which is doing force_empty. I have to say I don't know too
much about the internals of the container scheduling.

In that case I do not see a strong reason to implement the offloding
into the kernel. It is an additional code and semantic to maintain.

Yes, it does introduce some additional code and semantic, but IMHO, it is
quite simple and very straight forward, isn't it? Just utilize the existing
css offline worker. And, that a couple of lines of code do improve some
throughput issues for some real usecases.

I do not really care it is few LOC. It is more important that it is
conflating force_empty into offlining logic. There was a good reason to
remove reparenting/emptying the memcg during the offline. Considering
that you can offload force_empty from userspace trivially then I do not
see any reason to implement it in the kernel.


Er, I may not articulate in the earlier email, force_empty can not be 
offloaded from userspace *trivially*. IOWs the container scheduler may 
unexpectedly overcommit something due to the stall of synchronous force 
empty, which can't be figured out by userspace before it actually 
happens. The scheduler doesn't know how long force_empty would take. If 
the force_empty could be offloaded by kernel, it would make scheduler's 
life much easier. This is not something userspace could do.





I think it is more important to discuss whether we want to introduce
force_empty in cgroup v2.

We would prefer have it in v2 as well.

Then bring this up in a separate email thread please.


Sure. Will prepare the patches later.

Thanks,
Yang




Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty

2019-01-03 Thread Michal Hocko
On Thu 03-01-19 11:10:00, Yang Shi wrote:
> 
> 
> On 1/3/19 10:53 AM, Michal Hocko wrote:
> > On Thu 03-01-19 10:40:54, Yang Shi wrote:
> > > 
> > > On 1/3/19 10:13 AM, Michal Hocko wrote:
[...]
> > > > Is there any reason for your scripts to be strictly sequential here? In
> > > > other words why cannot you offload those expensive operations to a
> > > > detached context in _userspace_?
> > > I would say it has not to be strictly sequential. The above script is just
> > > an example to illustrate the pattern. But, sometimes it may hit such 
> > > pattern
> > > due to the complicated cluster scheduling and container scheduling in the
> > > production environment, for example the creation process might be 
> > > scheduled
> > > to the same CPU which is doing force_empty. I have to say I don't know too
> > > much about the internals of the container scheduling.
> > In that case I do not see a strong reason to implement the offloding
> > into the kernel. It is an additional code and semantic to maintain.
> 
> Yes, it does introduce some additional code and semantic, but IMHO, it is
> quite simple and very straight forward, isn't it? Just utilize the existing
> css offline worker. And, that a couple of lines of code do improve some
> throughput issues for some real usecases.

I do not really care it is few LOC. It is more important that it is
conflating force_empty into offlining logic. There was a good reason to
remove reparenting/emptying the memcg during the offline. Considering
that you can offload force_empty from userspace trivially then I do not
see any reason to implement it in the kernel.

> > I think it is more important to discuss whether we want to introduce
> > force_empty in cgroup v2.
> 
> We would prefer have it in v2 as well.

Then bring this up in a separate email thread please.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty

2019-01-03 Thread Yang Shi




On 1/3/19 10:53 AM, Michal Hocko wrote:

On Thu 03-01-19 10:40:54, Yang Shi wrote:


On 1/3/19 10:13 AM, Michal Hocko wrote:

On Thu 03-01-19 09:33:14, Yang Shi wrote:

On 1/3/19 2:12 AM, Michal Hocko wrote:

On Thu 03-01-19 04:05:30, Yang Shi wrote:

Currently, force empty reclaims memory synchronously when writing to
memory.force_empty.  It may take some time to return and the afterwards
operations are blocked by it.  Although it can be interrupted by signal,
it still seems suboptimal.

Why it is suboptimal? We are doing that operation on behalf of the
process requesting it. What should anybody else pay for it? In other
words why should we hide the overhead?

Please see the below explanation.


Now css offline is handled by worker, and the typical usecase of force
empty is before memcg offline.  So, handling force empty in css offline
sounds reasonable.

Hmm, so I guess you are talking about
echo 1 > $MEMCG/force_empty
rmdir $MEMCG

and you are complaining that the operation takes too long. Right? Why do
you care actually?

We have some usecases which create and remove memcgs very frequently, and
the tasks in the memcg may just access the files which are unlikely accessed
by anyone else. So, we prefer force_empty the memcg before rmdir'ing it to
reclaim the page cache so that they don't get accumulated to incur
unnecessary memory pressure. Since the memory pressure may incur direct
reclaim to harm some latency sensitive applications.

Yes, this makes sense to me.


And, the create/remove might be run in a script sequentially (there might be
a lot scripts or applications are run in parallel to do this), i.e.
mkdir cg1
do something
echo 0 > cg1/memory.force_empty
rmdir cg1

mkdir cg2
...

The creation of the afterwards memcg might be blocked by the force_empty for
long time if there are a lot page caches, so the overall throughput of the
system may get hurt.

Is there any reason for your scripts to be strictly sequential here? In
other words why cannot you offload those expensive operations to a
detached context in _userspace_?

I would say it has not to be strictly sequential. The above script is just
an example to illustrate the pattern. But, sometimes it may hit such pattern
due to the complicated cluster scheduling and container scheduling in the
production environment, for example the creation process might be scheduled
to the same CPU which is doing force_empty. I have to say I don't know too
much about the internals of the container scheduling.

In that case I do not see a strong reason to implement the offloding
into the kernel. It is an additional code and semantic to maintain.


Yes, it does introduce some additional code and semantic, but IMHO, it 
is quite simple and very straight forward, isn't it? Just utilize the 
existing css offline worker. And, that a couple of lines of code do 
improve some throughput issues for some real usecases.




I think it is more important to discuss whether we want to introduce
force_empty in cgroup v2.


We would prefer have it in v2 as well.

Thanks,
Yang




Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty

2019-01-03 Thread Michal Hocko
On Thu 03-01-19 10:40:54, Yang Shi wrote:
> 
> 
> On 1/3/19 10:13 AM, Michal Hocko wrote:
> > On Thu 03-01-19 09:33:14, Yang Shi wrote:
> > > 
> > > On 1/3/19 2:12 AM, Michal Hocko wrote:
> > > > On Thu 03-01-19 04:05:30, Yang Shi wrote:
> > > > > Currently, force empty reclaims memory synchronously when writing to
> > > > > memory.force_empty.  It may take some time to return and the 
> > > > > afterwards
> > > > > operations are blocked by it.  Although it can be interrupted by 
> > > > > signal,
> > > > > it still seems suboptimal.
> > > > Why it is suboptimal? We are doing that operation on behalf of the
> > > > process requesting it. What should anybody else pay for it? In other
> > > > words why should we hide the overhead?
> > > Please see the below explanation.
> > > 
> > > > > Now css offline is handled by worker, and the typical usecase of force
> > > > > empty is before memcg offline.  So, handling force empty in css 
> > > > > offline
> > > > > sounds reasonable.
> > > > Hmm, so I guess you are talking about
> > > > echo 1 > $MEMCG/force_empty
> > > > rmdir $MEMCG
> > > > 
> > > > and you are complaining that the operation takes too long. Right? Why do
> > > > you care actually?
> > > We have some usecases which create and remove memcgs very frequently, and
> > > the tasks in the memcg may just access the files which are unlikely 
> > > accessed
> > > by anyone else. So, we prefer force_empty the memcg before rmdir'ing it to
> > > reclaim the page cache so that they don't get accumulated to incur
> > > unnecessary memory pressure. Since the memory pressure may incur direct
> > > reclaim to harm some latency sensitive applications.
> > Yes, this makes sense to me.
> > 
> > > And, the create/remove might be run in a script sequentially (there might 
> > > be
> > > a lot scripts or applications are run in parallel to do this), i.e.
> > > mkdir cg1
> > > do something
> > > echo 0 > cg1/memory.force_empty
> > > rmdir cg1
> > > 
> > > mkdir cg2
> > > ...
> > > 
> > > The creation of the afterwards memcg might be blocked by the force_empty 
> > > for
> > > long time if there are a lot page caches, so the overall throughput of the
> > > system may get hurt.
> > Is there any reason for your scripts to be strictly sequential here? In
> > other words why cannot you offload those expensive operations to a
> > detached context in _userspace_?
> 
> I would say it has not to be strictly sequential. The above script is just
> an example to illustrate the pattern. But, sometimes it may hit such pattern
> due to the complicated cluster scheduling and container scheduling in the
> production environment, for example the creation process might be scheduled
> to the same CPU which is doing force_empty. I have to say I don't know too
> much about the internals of the container scheduling.

In that case I do not see a strong reason to implement the offloding
into the kernel. It is an additional code and semantic to maintain.

I think it is more important to discuss whether we want to introduce
force_empty in cgroup v2.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty

2019-01-03 Thread Yang Shi




On 1/3/19 10:13 AM, Michal Hocko wrote:

On Thu 03-01-19 09:33:14, Yang Shi wrote:


On 1/3/19 2:12 AM, Michal Hocko wrote:

On Thu 03-01-19 04:05:30, Yang Shi wrote:

Currently, force empty reclaims memory synchronously when writing to
memory.force_empty.  It may take some time to return and the afterwards
operations are blocked by it.  Although it can be interrupted by signal,
it still seems suboptimal.

Why it is suboptimal? We are doing that operation on behalf of the
process requesting it. What should anybody else pay for it? In other
words why should we hide the overhead?

Please see the below explanation.


Now css offline is handled by worker, and the typical usecase of force
empty is before memcg offline.  So, handling force empty in css offline
sounds reasonable.

Hmm, so I guess you are talking about
echo 1 > $MEMCG/force_empty
rmdir $MEMCG

and you are complaining that the operation takes too long. Right? Why do
you care actually?

We have some usecases which create and remove memcgs very frequently, and
the tasks in the memcg may just access the files which are unlikely accessed
by anyone else. So, we prefer force_empty the memcg before rmdir'ing it to
reclaim the page cache so that they don't get accumulated to incur
unnecessary memory pressure. Since the memory pressure may incur direct
reclaim to harm some latency sensitive applications.

Yes, this makes sense to me.


And, the create/remove might be run in a script sequentially (there might be
a lot scripts or applications are run in parallel to do this), i.e.
mkdir cg1
do something
echo 0 > cg1/memory.force_empty
rmdir cg1

mkdir cg2
...

The creation of the afterwards memcg might be blocked by the force_empty for
long time if there are a lot page caches, so the overall throughput of the
system may get hurt.

Is there any reason for your scripts to be strictly sequential here? In
other words why cannot you offload those expensive operations to a
detached context in _userspace_?


I would say it has not to be strictly sequential. The above script is 
just an example to illustrate the pattern. But, sometimes it may hit 
such pattern due to the complicated cluster scheduling and container 
scheduling in the production environment, for example the creation 
process might be scheduled to the same CPU which is doing force_empty. I 
have to say I don't know too much about the internals of the container 
scheduling.


Thanks,
Yang




Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty

2019-01-03 Thread Michal Hocko
On Thu 03-01-19 09:33:14, Yang Shi wrote:
> 
> 
> On 1/3/19 2:12 AM, Michal Hocko wrote:
> > On Thu 03-01-19 04:05:30, Yang Shi wrote:
> > > Currently, force empty reclaims memory synchronously when writing to
> > > memory.force_empty.  It may take some time to return and the afterwards
> > > operations are blocked by it.  Although it can be interrupted by signal,
> > > it still seems suboptimal.
> > Why it is suboptimal? We are doing that operation on behalf of the
> > process requesting it. What should anybody else pay for it? In other
> > words why should we hide the overhead?
> 
> Please see the below explanation.
> 
> > 
> > > Now css offline is handled by worker, and the typical usecase of force
> > > empty is before memcg offline.  So, handling force empty in css offline
> > > sounds reasonable.
> > Hmm, so I guess you are talking about
> > echo 1 > $MEMCG/force_empty
> > rmdir $MEMCG
> > 
> > and you are complaining that the operation takes too long. Right? Why do
> > you care actually?
> 
> We have some usecases which create and remove memcgs very frequently, and
> the tasks in the memcg may just access the files which are unlikely accessed
> by anyone else. So, we prefer force_empty the memcg before rmdir'ing it to
> reclaim the page cache so that they don't get accumulated to incur
> unnecessary memory pressure. Since the memory pressure may incur direct
> reclaim to harm some latency sensitive applications.

Yes, this makes sense to me.

> And, the create/remove might be run in a script sequentially (there might be
> a lot scripts or applications are run in parallel to do this), i.e.
> mkdir cg1
> do something
> echo 0 > cg1/memory.force_empty
> rmdir cg1
> 
> mkdir cg2
> ...
> 
> The creation of the afterwards memcg might be blocked by the force_empty for
> long time if there are a lot page caches, so the overall throughput of the
> system may get hurt.

Is there any reason for your scripts to be strictly sequential here? In
other words why cannot you offload those expensive operations to a
detached context in _userspace_?
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty

2019-01-03 Thread Yang Shi




On 1/3/19 2:12 AM, Michal Hocko wrote:

On Thu 03-01-19 04:05:30, Yang Shi wrote:

Currently, force empty reclaims memory synchronously when writing to
memory.force_empty.  It may take some time to return and the afterwards
operations are blocked by it.  Although it can be interrupted by signal,
it still seems suboptimal.

Why it is suboptimal? We are doing that operation on behalf of the
process requesting it. What should anybody else pay for it? In other
words why should we hide the overhead?


Please see the below explanation.




Now css offline is handled by worker, and the typical usecase of force
empty is before memcg offline.  So, handling force empty in css offline
sounds reasonable.

Hmm, so I guess you are talking about
echo 1 > $MEMCG/force_empty
rmdir $MEMCG

and you are complaining that the operation takes too long. Right? Why do
you care actually?


We have some usecases which create and remove memcgs very frequently, 
and the tasks in the memcg may just access the files which are unlikely 
accessed by anyone else. So, we prefer force_empty the memcg before 
rmdir'ing it to reclaim the page cache so that they don't get 
accumulated to incur unnecessary memory pressure. Since the memory 
pressure may incur direct reclaim to harm some latency sensitive 
applications.


And, the create/remove might be run in a script sequentially (there 
might be a lot scripts or applications are run in parallel to do this), i.e.

mkdir cg1
do something
echo 0 > cg1/memory.force_empty
rmdir cg1

mkdir cg2
...

The creation of the afterwards memcg might be blocked by the force_empty 
for long time if there are a lot page caches, so the overall throughput 
of the system may get hurt.
And, it is not that urgent to reclaim the page cache right away and it 
is not that important who pays the cost, we just need a mechanism to 
reclaim the pages soon in a short while. The overhead could be smoothed 
by background workqueue.


And, the patch still keeps the old behavior, just in case someone else 
still depends on it.


Thanks,
Yang




Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty

2019-01-03 Thread Michal Hocko
On Thu 03-01-19 04:05:30, Yang Shi wrote:
> 
> Currently, force empty reclaims memory synchronously when writing to
> memory.force_empty.  It may take some time to return and the afterwards
> operations are blocked by it.  Although it can be interrupted by signal,
> it still seems suboptimal.

Why it is suboptimal? We are doing that operation on behalf of the
process requesting it. What should anybody else pay for it? In other
words why should we hide the overhead?

> Now css offline is handled by worker, and the typical usecase of force
> empty is before memcg offline.  So, handling force empty in css offline
> sounds reasonable.

Hmm, so I guess you are talking about
echo 1 > $MEMCG/force_empty
rmdir $MEMCG

and you are complaining that the operation takes too long. Right? Why do
you care actually?
-- 
Michal Hocko
SUSE Labs


[RFC PATCH 0/3] mm: memcontrol: delayed force empty

2019-01-02 Thread Yang Shi


Currently, force empty reclaims memory synchronously when writing to
memory.force_empty.  It may take some time to return and the afterwards
operations are blocked by it.  Although it can be interrupted by signal,
it still seems suboptimal.

Now css offline is handled by worker, and the typical usecase of force
empty is before memcg offline.  So, handling force empty in css offline
sounds reasonable.

The user may write into any value to memory.force_empty, but I'm
supposed the most used value should be 0 and 1.  To not break existing
applications, writing 0 or 1 still do force empty synchronously, any
other value will tell kernel to do force empty in css offline worker.

Patch #1: Fix some obsolete information about force_empty in the document
Patch #2: A minor improvement to skip swap for force_empty
Patch #3: Implement delayed force_empty

Yang Shi (3):
  doc: memcontrol: fix the obsolete content about force empty
  mm: memcontrol: do not try to do swap when force empty
  mm: memcontrol: delay force empty to css offline

 Documentation/cgroup-v1/memory.txt | 15 ++-
 include/linux/memcontrol.h |  2 ++
 mm/memcontrol.c| 20 +++-
 3 files changed, 31 insertions(+), 6 deletions(-)