Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
On 1/4/19 2:57 PM, Yang Shi wrote: On 1/4/19 12:03 PM, Greg Thelen wrote: Yang Shi wrote: On 1/3/19 11:23 AM, Michal Hocko wrote: On Thu 03-01-19 11:10:00, Yang Shi wrote: On 1/3/19 10:53 AM, Michal Hocko wrote: On Thu 03-01-19 10:40:54, Yang Shi wrote: On 1/3/19 10:13 AM, Michal Hocko wrote: [...] Is there any reason for your scripts to be strictly sequential here? In other words why cannot you offload those expensive operations to a detached context in _userspace_? I would say it has not to be strictly sequential. The above script is just an example to illustrate the pattern. But, sometimes it may hit such pattern due to the complicated cluster scheduling and container scheduling in the production environment, for example the creation process might be scheduled to the same CPU which is doing force_empty. I have to say I don't know too much about the internals of the container scheduling. In that case I do not see a strong reason to implement the offloding into the kernel. It is an additional code and semantic to maintain. Yes, it does introduce some additional code and semantic, but IMHO, it is quite simple and very straight forward, isn't it? Just utilize the existing css offline worker. And, that a couple of lines of code do improve some throughput issues for some real usecases. I do not really care it is few LOC. It is more important that it is conflating force_empty into offlining logic. There was a good reason to remove reparenting/emptying the memcg during the offline. Considering that you can offload force_empty from userspace trivially then I do not see any reason to implement it in the kernel. Er, I may not articulate in the earlier email, force_empty can not be offloaded from userspace *trivially*. IOWs the container scheduler may unexpectedly overcommit something due to the stall of synchronous force empty, which can't be figured out by userspace before it actually happens. The scheduler doesn't know how long force_empty would take. If the force_empty could be offloaded by kernel, it would make scheduler's life much easier. This is not something userspace could do. If kernel workqueues are doing more work (i.e. force_empty processing), then it seem like the time to offline could grow. I'm not sure if that's important. One thing I can think of is this may slow down the recycling of memcg id. This may cause memcg id exhausted for some extreme workload. But, I don't see this as a problem in our workload. Actually, sync force_empty should have the same side effect. Yang Thanks, Yang I assume that if we make force_empty an async side effect of rmdir then user space scheduler would not be unable to immediately assume the rmdir'd container memory is available without subjecting a new container to direct reclaim. So it seems like user space would use a mechanism to wait for reclaim: either the existing sync force_empty or polling meminfo/etc waiting for free memory to appear. I think it is more important to discuss whether we want to introduce force_empty in cgroup v2. We would prefer have it in v2 as well. Then bring this up in a separate email thread please. Sure. Will prepare the patches later. Thanks, Yang
Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
On 1/4/19 12:03 PM, Greg Thelen wrote: Yang Shi wrote: On 1/3/19 11:23 AM, Michal Hocko wrote: On Thu 03-01-19 11:10:00, Yang Shi wrote: On 1/3/19 10:53 AM, Michal Hocko wrote: On Thu 03-01-19 10:40:54, Yang Shi wrote: On 1/3/19 10:13 AM, Michal Hocko wrote: [...] Is there any reason for your scripts to be strictly sequential here? In other words why cannot you offload those expensive operations to a detached context in _userspace_? I would say it has not to be strictly sequential. The above script is just an example to illustrate the pattern. But, sometimes it may hit such pattern due to the complicated cluster scheduling and container scheduling in the production environment, for example the creation process might be scheduled to the same CPU which is doing force_empty. I have to say I don't know too much about the internals of the container scheduling. In that case I do not see a strong reason to implement the offloding into the kernel. It is an additional code and semantic to maintain. Yes, it does introduce some additional code and semantic, but IMHO, it is quite simple and very straight forward, isn't it? Just utilize the existing css offline worker. And, that a couple of lines of code do improve some throughput issues for some real usecases. I do not really care it is few LOC. It is more important that it is conflating force_empty into offlining logic. There was a good reason to remove reparenting/emptying the memcg during the offline. Considering that you can offload force_empty from userspace trivially then I do not see any reason to implement it in the kernel. Er, I may not articulate in the earlier email, force_empty can not be offloaded from userspace *trivially*. IOWs the container scheduler may unexpectedly overcommit something due to the stall of synchronous force empty, which can't be figured out by userspace before it actually happens. The scheduler doesn't know how long force_empty would take. If the force_empty could be offloaded by kernel, it would make scheduler's life much easier. This is not something userspace could do. If kernel workqueues are doing more work (i.e. force_empty processing), then it seem like the time to offline could grow. I'm not sure if that's important. One thing I can think of is this may slow down the recycling of memcg id. This may cause memcg id exhausted for some extreme workload. But, I don't see this as a problem in our workload. Thanks, Yang I assume that if we make force_empty an async side effect of rmdir then user space scheduler would not be unable to immediately assume the rmdir'd container memory is available without subjecting a new container to direct reclaim. So it seems like user space would use a mechanism to wait for reclaim: either the existing sync force_empty or polling meminfo/etc waiting for free memory to appear. I think it is more important to discuss whether we want to introduce force_empty in cgroup v2. We would prefer have it in v2 as well. Then bring this up in a separate email thread please. Sure. Will prepare the patches later. Thanks, Yang
Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
On 1/4/19 12:03 PM, Greg Thelen wrote: Yang Shi wrote: On 1/3/19 11:23 AM, Michal Hocko wrote: On Thu 03-01-19 11:10:00, Yang Shi wrote: On 1/3/19 10:53 AM, Michal Hocko wrote: On Thu 03-01-19 10:40:54, Yang Shi wrote: On 1/3/19 10:13 AM, Michal Hocko wrote: [...] Is there any reason for your scripts to be strictly sequential here? In other words why cannot you offload those expensive operations to a detached context in _userspace_? I would say it has not to be strictly sequential. The above script is just an example to illustrate the pattern. But, sometimes it may hit such pattern due to the complicated cluster scheduling and container scheduling in the production environment, for example the creation process might be scheduled to the same CPU which is doing force_empty. I have to say I don't know too much about the internals of the container scheduling. In that case I do not see a strong reason to implement the offloding into the kernel. It is an additional code and semantic to maintain. Yes, it does introduce some additional code and semantic, but IMHO, it is quite simple and very straight forward, isn't it? Just utilize the existing css offline worker. And, that a couple of lines of code do improve some throughput issues for some real usecases. I do not really care it is few LOC. It is more important that it is conflating force_empty into offlining logic. There was a good reason to remove reparenting/emptying the memcg during the offline. Considering that you can offload force_empty from userspace trivially then I do not see any reason to implement it in the kernel. Er, I may not articulate in the earlier email, force_empty can not be offloaded from userspace *trivially*. IOWs the container scheduler may unexpectedly overcommit something due to the stall of synchronous force empty, which can't be figured out by userspace before it actually happens. The scheduler doesn't know how long force_empty would take. If the force_empty could be offloaded by kernel, it would make scheduler's life much easier. This is not something userspace could do. If kernel workqueues are doing more work (i.e. force_empty processing), then it seem like the time to offline could grow. I'm not sure if that's important. Yes, it would grow. I'm not sure, but it seems fine with our workloads. The reclaim can be placed at the last step of offline, and it can be interrupted by some signals, i.e. fatal signal in current code. I assume that if we make force_empty an async side effect of rmdir then user space scheduler would not be unable to immediately assume the rmdir'd container memory is available without subjecting a new container to direct reclaim. So it seems like user space would use a mechanism to wait for reclaim: either the existing sync force_empty or polling meminfo/etc waiting for free memory to appear. Yes, it is expected side effect, the memory reclaim would happen in a short while. In this series I keep sync reclaim behavior of force_empty by checking the written value. Michal suggested a new knob do the offline reclaim, and keep force_empty intact. I think using which knob is user's discretion. Thanks, Yang I think it is more important to discuss whether we want to introduce force_empty in cgroup v2. We would prefer have it in v2 as well. Then bring this up in a separate email thread please. Sure. Will prepare the patches later. Thanks, Yang
Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
Yang Shi wrote: > On 1/3/19 11:23 AM, Michal Hocko wrote: >> On Thu 03-01-19 11:10:00, Yang Shi wrote: >>> >>> On 1/3/19 10:53 AM, Michal Hocko wrote: On Thu 03-01-19 10:40:54, Yang Shi wrote: > On 1/3/19 10:13 AM, Michal Hocko wrote: >> [...] >> Is there any reason for your scripts to be strictly sequential here? In >> other words why cannot you offload those expensive operations to a >> detached context in _userspace_? > I would say it has not to be strictly sequential. The above script is just > an example to illustrate the pattern. But, sometimes it may hit such > pattern > due to the complicated cluster scheduling and container scheduling in the > production environment, for example the creation process might be > scheduled > to the same CPU which is doing force_empty. I have to say I don't know too > much about the internals of the container scheduling. In that case I do not see a strong reason to implement the offloding into the kernel. It is an additional code and semantic to maintain. >>> Yes, it does introduce some additional code and semantic, but IMHO, it is >>> quite simple and very straight forward, isn't it? Just utilize the existing >>> css offline worker. And, that a couple of lines of code do improve some >>> throughput issues for some real usecases. >> I do not really care it is few LOC. It is more important that it is >> conflating force_empty into offlining logic. There was a good reason to >> remove reparenting/emptying the memcg during the offline. Considering >> that you can offload force_empty from userspace trivially then I do not >> see any reason to implement it in the kernel. > > Er, I may not articulate in the earlier email, force_empty can not be > offloaded from userspace *trivially*. IOWs the container scheduler may > unexpectedly overcommit something due to the stall of synchronous force > empty, which can't be figured out by userspace before it actually > happens. The scheduler doesn't know how long force_empty would take. If > the force_empty could be offloaded by kernel, it would make scheduler's > life much easier. This is not something userspace could do. If kernel workqueues are doing more work (i.e. force_empty processing), then it seem like the time to offline could grow. I'm not sure if that's important. I assume that if we make force_empty an async side effect of rmdir then user space scheduler would not be unable to immediately assume the rmdir'd container memory is available without subjecting a new container to direct reclaim. So it seems like user space would use a mechanism to wait for reclaim: either the existing sync force_empty or polling meminfo/etc waiting for free memory to appear. I think it is more important to discuss whether we want to introduce force_empty in cgroup v2. >>> We would prefer have it in v2 as well. >> Then bring this up in a separate email thread please. > > Sure. Will prepare the patches later. > > Thanks, > Yang
Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
On 1/4/19 12:55 AM, Michal Hocko wrote: On Thu 03-01-19 20:15:30, Yang Shi wrote: On 1/3/19 12:01 PM, Michal Hocko wrote: On Thu 03-01-19 11:49:32, Yang Shi wrote: On 1/3/19 11:23 AM, Michal Hocko wrote: On Thu 03-01-19 11:10:00, Yang Shi wrote: On 1/3/19 10:53 AM, Michal Hocko wrote: On Thu 03-01-19 10:40:54, Yang Shi wrote: On 1/3/19 10:13 AM, Michal Hocko wrote: [...] Is there any reason for your scripts to be strictly sequential here? In other words why cannot you offload those expensive operations to a detached context in _userspace_? I would say it has not to be strictly sequential. The above script is just an example to illustrate the pattern. But, sometimes it may hit such pattern due to the complicated cluster scheduling and container scheduling in the production environment, for example the creation process might be scheduled to the same CPU which is doing force_empty. I have to say I don't know too much about the internals of the container scheduling. In that case I do not see a strong reason to implement the offloding into the kernel. It is an additional code and semantic to maintain. Yes, it does introduce some additional code and semantic, but IMHO, it is quite simple and very straight forward, isn't it? Just utilize the existing css offline worker. And, that a couple of lines of code do improve some throughput issues for some real usecases. I do not really care it is few LOC. It is more important that it is conflating force_empty into offlining logic. There was a good reason to remove reparenting/emptying the memcg during the offline. Considering that you can offload force_empty from userspace trivially then I do not see any reason to implement it in the kernel. Er, I may not articulate in the earlier email, force_empty can not be offloaded from userspace *trivially*. IOWs the container scheduler may unexpectedly overcommit something due to the stall of synchronous force empty, which can't be figured out by userspace before it actually happens. The scheduler doesn't know how long force_empty would take. If the force_empty could be offloaded by kernel, it would make scheduler's life much easier. This is not something userspace could do. What exactly prevents ( echo 1 > $memecg/force_empty rmdir $memcg ) & so that this sequence doesn't really block anything? We have "restarting the same name job" logic in our usecase (I'm not quite sure why they do so). Basically, it means to create memcg with the exact same name right after the old one is deleted, but may have different limit or other settings. The creation has to wait for rmdir is done. Even though rmdir is done in background like the above, the stall still exists since rmdir simply is waiting for force_empty. OK, I see. This is an important detail you didn't mention previously (or at least I didn't understand it). One thing is still not clear to me. Sorry, I should articulated at the first place. "Restarting the same job" sounds as if the memcg itself could be recycled as well. You are saying that the setting might change but if that is about limits then we should handle that just fine. Or what other kind of setting changes that wouldn't work properly? We did try resize limit, but it may also incur costly direct reclaim to block something. Other than this we also want to reset all the counters/stats to get clearer and cleaner resource isolation since the container may run different jobs although they use the same name. If the recycling is not possible then I would suggest to not reuse force_empty interface but add wipe_on_destruction or similar new knob which would enforce reclaim on offlining. It seems we have several people asking for something like that already. We did have a new knob in our in-house implementation, it just did force_empty on offlining. So, you mean to have a new knob to just do force empty offlining, and keep force_empty's behavior, right? Thanks, Yang
Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
On Thu 03-01-19 20:15:30, Yang Shi wrote: > > > On 1/3/19 12:01 PM, Michal Hocko wrote: > > On Thu 03-01-19 11:49:32, Yang Shi wrote: > > > > > > On 1/3/19 11:23 AM, Michal Hocko wrote: > > > > On Thu 03-01-19 11:10:00, Yang Shi wrote: > > > > > On 1/3/19 10:53 AM, Michal Hocko wrote: > > > > > > On Thu 03-01-19 10:40:54, Yang Shi wrote: > > > > > > > On 1/3/19 10:13 AM, Michal Hocko wrote: > > > > [...] > > > > > > > > Is there any reason for your scripts to be strictly sequential > > > > > > > > here? In > > > > > > > > other words why cannot you offload those expensive operations > > > > > > > > to a > > > > > > > > detached context in _userspace_? > > > > > > > I would say it has not to be strictly sequential. The above > > > > > > > script is just > > > > > > > an example to illustrate the pattern. But, sometimes it may hit > > > > > > > such pattern > > > > > > > due to the complicated cluster scheduling and container > > > > > > > scheduling in the > > > > > > > production environment, for example the creation process might be > > > > > > > scheduled > > > > > > > to the same CPU which is doing force_empty. I have to say I don't > > > > > > > know too > > > > > > > much about the internals of the container scheduling. > > > > > > In that case I do not see a strong reason to implement the offloding > > > > > > into the kernel. It is an additional code and semantic to maintain. > > > > > Yes, it does introduce some additional code and semantic, but IMHO, > > > > > it is > > > > > quite simple and very straight forward, isn't it? Just utilize the > > > > > existing > > > > > css offline worker. And, that a couple of lines of code do improve > > > > > some > > > > > throughput issues for some real usecases. > > > > I do not really care it is few LOC. It is more important that it is > > > > conflating force_empty into offlining logic. There was a good reason to > > > > remove reparenting/emptying the memcg during the offline. Considering > > > > that you can offload force_empty from userspace trivially then I do not > > > > see any reason to implement it in the kernel. > > > Er, I may not articulate in the earlier email, force_empty can not be > > > offloaded from userspace *trivially*. IOWs the container scheduler may > > > unexpectedly overcommit something due to the stall of synchronous force > > > empty, which can't be figured out by userspace before it actually happens. > > > The scheduler doesn't know how long force_empty would take. If the > > > force_empty could be offloaded by kernel, it would make scheduler's life > > > much easier. This is not something userspace could do. > > What exactly prevents > > ( > > echo 1 > $memecg/force_empty > > rmdir $memcg > > ) & > > > > so that this sequence doesn't really block anything? > > We have "restarting the same name job" logic in our usecase (I'm not quite > sure why they do so). Basically, it means to create memcg with the exact > same name right after the old one is deleted, but may have different limit > or other settings. The creation has to wait for rmdir is done. Even though > rmdir is done in background like the above, the stall still exists since > rmdir simply is waiting for force_empty. OK, I see. This is an important detail you didn't mention previously (or at least I didn't understand it). One thing is still not clear to me. "Restarting the same job" sounds as if the memcg itself could be recycled as well. You are saying that the setting might change but if that is about limits then we should handle that just fine. Or what other kind of setting changes that wouldn't work properly? If the recycling is not possible then I would suggest to not reuse force_empty interface but add wipe_on_destruction or similar new knob which would enforce reclaim on offlining. It seems we have several people asking for something like that already. -- Michal Hocko SUSE Labs
Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
On 1/3/19 12:01 PM, Michal Hocko wrote: On Thu 03-01-19 11:49:32, Yang Shi wrote: On 1/3/19 11:23 AM, Michal Hocko wrote: On Thu 03-01-19 11:10:00, Yang Shi wrote: On 1/3/19 10:53 AM, Michal Hocko wrote: On Thu 03-01-19 10:40:54, Yang Shi wrote: On 1/3/19 10:13 AM, Michal Hocko wrote: [...] Is there any reason for your scripts to be strictly sequential here? In other words why cannot you offload those expensive operations to a detached context in _userspace_? I would say it has not to be strictly sequential. The above script is just an example to illustrate the pattern. But, sometimes it may hit such pattern due to the complicated cluster scheduling and container scheduling in the production environment, for example the creation process might be scheduled to the same CPU which is doing force_empty. I have to say I don't know too much about the internals of the container scheduling. In that case I do not see a strong reason to implement the offloding into the kernel. It is an additional code and semantic to maintain. Yes, it does introduce some additional code and semantic, but IMHO, it is quite simple and very straight forward, isn't it? Just utilize the existing css offline worker. And, that a couple of lines of code do improve some throughput issues for some real usecases. I do not really care it is few LOC. It is more important that it is conflating force_empty into offlining logic. There was a good reason to remove reparenting/emptying the memcg during the offline. Considering that you can offload force_empty from userspace trivially then I do not see any reason to implement it in the kernel. Er, I may not articulate in the earlier email, force_empty can not be offloaded from userspace *trivially*. IOWs the container scheduler may unexpectedly overcommit something due to the stall of synchronous force empty, which can't be figured out by userspace before it actually happens. The scheduler doesn't know how long force_empty would take. If the force_empty could be offloaded by kernel, it would make scheduler's life much easier. This is not something userspace could do. What exactly prevents ( echo 1 > $memecg/force_empty rmdir $memcg ) & so that this sequence doesn't really block anything? We have "restarting the same name job" logic in our usecase (I'm not quite sure why they do so). Basically, it means to create memcg with the exact same name right after the old one is deleted, but may have different limit or other settings. The creation has to wait for rmdir is done. Even though rmdir is done in background like the above, the stall still exists since rmdir simply is waiting for force_empty. Thanks, Yang
Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
On Thu 03-01-19 11:49:32, Yang Shi wrote: > > > On 1/3/19 11:23 AM, Michal Hocko wrote: > > On Thu 03-01-19 11:10:00, Yang Shi wrote: > > > > > > On 1/3/19 10:53 AM, Michal Hocko wrote: > > > > On Thu 03-01-19 10:40:54, Yang Shi wrote: > > > > > On 1/3/19 10:13 AM, Michal Hocko wrote: > > [...] > > > > > > Is there any reason for your scripts to be strictly sequential > > > > > > here? In > > > > > > other words why cannot you offload those expensive operations to a > > > > > > detached context in _userspace_? > > > > > I would say it has not to be strictly sequential. The above script is > > > > > just > > > > > an example to illustrate the pattern. But, sometimes it may hit such > > > > > pattern > > > > > due to the complicated cluster scheduling and container scheduling in > > > > > the > > > > > production environment, for example the creation process might be > > > > > scheduled > > > > > to the same CPU which is doing force_empty. I have to say I don't > > > > > know too > > > > > much about the internals of the container scheduling. > > > > In that case I do not see a strong reason to implement the offloding > > > > into the kernel. It is an additional code and semantic to maintain. > > > Yes, it does introduce some additional code and semantic, but IMHO, it is > > > quite simple and very straight forward, isn't it? Just utilize the > > > existing > > > css offline worker. And, that a couple of lines of code do improve some > > > throughput issues for some real usecases. > > I do not really care it is few LOC. It is more important that it is > > conflating force_empty into offlining logic. There was a good reason to > > remove reparenting/emptying the memcg during the offline. Considering > > that you can offload force_empty from userspace trivially then I do not > > see any reason to implement it in the kernel. > > Er, I may not articulate in the earlier email, force_empty can not be > offloaded from userspace *trivially*. IOWs the container scheduler may > unexpectedly overcommit something due to the stall of synchronous force > empty, which can't be figured out by userspace before it actually happens. > The scheduler doesn't know how long force_empty would take. If the > force_empty could be offloaded by kernel, it would make scheduler's life > much easier. This is not something userspace could do. What exactly prevents ( echo 1 > $memecg/force_empty rmdir $memcg ) & so that this sequence doesn't really block anything? -- Michal Hocko SUSE Labs
Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
On 1/3/19 11:23 AM, Michal Hocko wrote: On Thu 03-01-19 11:10:00, Yang Shi wrote: On 1/3/19 10:53 AM, Michal Hocko wrote: On Thu 03-01-19 10:40:54, Yang Shi wrote: On 1/3/19 10:13 AM, Michal Hocko wrote: [...] Is there any reason for your scripts to be strictly sequential here? In other words why cannot you offload those expensive operations to a detached context in _userspace_? I would say it has not to be strictly sequential. The above script is just an example to illustrate the pattern. But, sometimes it may hit such pattern due to the complicated cluster scheduling and container scheduling in the production environment, for example the creation process might be scheduled to the same CPU which is doing force_empty. I have to say I don't know too much about the internals of the container scheduling. In that case I do not see a strong reason to implement the offloding into the kernel. It is an additional code and semantic to maintain. Yes, it does introduce some additional code and semantic, but IMHO, it is quite simple and very straight forward, isn't it? Just utilize the existing css offline worker. And, that a couple of lines of code do improve some throughput issues for some real usecases. I do not really care it is few LOC. It is more important that it is conflating force_empty into offlining logic. There was a good reason to remove reparenting/emptying the memcg during the offline. Considering that you can offload force_empty from userspace trivially then I do not see any reason to implement it in the kernel. Er, I may not articulate in the earlier email, force_empty can not be offloaded from userspace *trivially*. IOWs the container scheduler may unexpectedly overcommit something due to the stall of synchronous force empty, which can't be figured out by userspace before it actually happens. The scheduler doesn't know how long force_empty would take. If the force_empty could be offloaded by kernel, it would make scheduler's life much easier. This is not something userspace could do. I think it is more important to discuss whether we want to introduce force_empty in cgroup v2. We would prefer have it in v2 as well. Then bring this up in a separate email thread please. Sure. Will prepare the patches later. Thanks, Yang
Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
On Thu 03-01-19 11:10:00, Yang Shi wrote: > > > On 1/3/19 10:53 AM, Michal Hocko wrote: > > On Thu 03-01-19 10:40:54, Yang Shi wrote: > > > > > > On 1/3/19 10:13 AM, Michal Hocko wrote: [...] > > > > Is there any reason for your scripts to be strictly sequential here? In > > > > other words why cannot you offload those expensive operations to a > > > > detached context in _userspace_? > > > I would say it has not to be strictly sequential. The above script is just > > > an example to illustrate the pattern. But, sometimes it may hit such > > > pattern > > > due to the complicated cluster scheduling and container scheduling in the > > > production environment, for example the creation process might be > > > scheduled > > > to the same CPU which is doing force_empty. I have to say I don't know too > > > much about the internals of the container scheduling. > > In that case I do not see a strong reason to implement the offloding > > into the kernel. It is an additional code and semantic to maintain. > > Yes, it does introduce some additional code and semantic, but IMHO, it is > quite simple and very straight forward, isn't it? Just utilize the existing > css offline worker. And, that a couple of lines of code do improve some > throughput issues for some real usecases. I do not really care it is few LOC. It is more important that it is conflating force_empty into offlining logic. There was a good reason to remove reparenting/emptying the memcg during the offline. Considering that you can offload force_empty from userspace trivially then I do not see any reason to implement it in the kernel. > > I think it is more important to discuss whether we want to introduce > > force_empty in cgroup v2. > > We would prefer have it in v2 as well. Then bring this up in a separate email thread please. -- Michal Hocko SUSE Labs
Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
On 1/3/19 10:53 AM, Michal Hocko wrote: On Thu 03-01-19 10:40:54, Yang Shi wrote: On 1/3/19 10:13 AM, Michal Hocko wrote: On Thu 03-01-19 09:33:14, Yang Shi wrote: On 1/3/19 2:12 AM, Michal Hocko wrote: On Thu 03-01-19 04:05:30, Yang Shi wrote: Currently, force empty reclaims memory synchronously when writing to memory.force_empty. It may take some time to return and the afterwards operations are blocked by it. Although it can be interrupted by signal, it still seems suboptimal. Why it is suboptimal? We are doing that operation on behalf of the process requesting it. What should anybody else pay for it? In other words why should we hide the overhead? Please see the below explanation. Now css offline is handled by worker, and the typical usecase of force empty is before memcg offline. So, handling force empty in css offline sounds reasonable. Hmm, so I guess you are talking about echo 1 > $MEMCG/force_empty rmdir $MEMCG and you are complaining that the operation takes too long. Right? Why do you care actually? We have some usecases which create and remove memcgs very frequently, and the tasks in the memcg may just access the files which are unlikely accessed by anyone else. So, we prefer force_empty the memcg before rmdir'ing it to reclaim the page cache so that they don't get accumulated to incur unnecessary memory pressure. Since the memory pressure may incur direct reclaim to harm some latency sensitive applications. Yes, this makes sense to me. And, the create/remove might be run in a script sequentially (there might be a lot scripts or applications are run in parallel to do this), i.e. mkdir cg1 do something echo 0 > cg1/memory.force_empty rmdir cg1 mkdir cg2 ... The creation of the afterwards memcg might be blocked by the force_empty for long time if there are a lot page caches, so the overall throughput of the system may get hurt. Is there any reason for your scripts to be strictly sequential here? In other words why cannot you offload those expensive operations to a detached context in _userspace_? I would say it has not to be strictly sequential. The above script is just an example to illustrate the pattern. But, sometimes it may hit such pattern due to the complicated cluster scheduling and container scheduling in the production environment, for example the creation process might be scheduled to the same CPU which is doing force_empty. I have to say I don't know too much about the internals of the container scheduling. In that case I do not see a strong reason to implement the offloding into the kernel. It is an additional code and semantic to maintain. Yes, it does introduce some additional code and semantic, but IMHO, it is quite simple and very straight forward, isn't it? Just utilize the existing css offline worker. And, that a couple of lines of code do improve some throughput issues for some real usecases. I think it is more important to discuss whether we want to introduce force_empty in cgroup v2. We would prefer have it in v2 as well. Thanks, Yang
Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
On Thu 03-01-19 10:40:54, Yang Shi wrote: > > > On 1/3/19 10:13 AM, Michal Hocko wrote: > > On Thu 03-01-19 09:33:14, Yang Shi wrote: > > > > > > On 1/3/19 2:12 AM, Michal Hocko wrote: > > > > On Thu 03-01-19 04:05:30, Yang Shi wrote: > > > > > Currently, force empty reclaims memory synchronously when writing to > > > > > memory.force_empty. It may take some time to return and the > > > > > afterwards > > > > > operations are blocked by it. Although it can be interrupted by > > > > > signal, > > > > > it still seems suboptimal. > > > > Why it is suboptimal? We are doing that operation on behalf of the > > > > process requesting it. What should anybody else pay for it? In other > > > > words why should we hide the overhead? > > > Please see the below explanation. > > > > > > > > Now css offline is handled by worker, and the typical usecase of force > > > > > empty is before memcg offline. So, handling force empty in css > > > > > offline > > > > > sounds reasonable. > > > > Hmm, so I guess you are talking about > > > > echo 1 > $MEMCG/force_empty > > > > rmdir $MEMCG > > > > > > > > and you are complaining that the operation takes too long. Right? Why do > > > > you care actually? > > > We have some usecases which create and remove memcgs very frequently, and > > > the tasks in the memcg may just access the files which are unlikely > > > accessed > > > by anyone else. So, we prefer force_empty the memcg before rmdir'ing it to > > > reclaim the page cache so that they don't get accumulated to incur > > > unnecessary memory pressure. Since the memory pressure may incur direct > > > reclaim to harm some latency sensitive applications. > > Yes, this makes sense to me. > > > > > And, the create/remove might be run in a script sequentially (there might > > > be > > > a lot scripts or applications are run in parallel to do this), i.e. > > > mkdir cg1 > > > do something > > > echo 0 > cg1/memory.force_empty > > > rmdir cg1 > > > > > > mkdir cg2 > > > ... > > > > > > The creation of the afterwards memcg might be blocked by the force_empty > > > for > > > long time if there are a lot page caches, so the overall throughput of the > > > system may get hurt. > > Is there any reason for your scripts to be strictly sequential here? In > > other words why cannot you offload those expensive operations to a > > detached context in _userspace_? > > I would say it has not to be strictly sequential. The above script is just > an example to illustrate the pattern. But, sometimes it may hit such pattern > due to the complicated cluster scheduling and container scheduling in the > production environment, for example the creation process might be scheduled > to the same CPU which is doing force_empty. I have to say I don't know too > much about the internals of the container scheduling. In that case I do not see a strong reason to implement the offloding into the kernel. It is an additional code and semantic to maintain. I think it is more important to discuss whether we want to introduce force_empty in cgroup v2. -- Michal Hocko SUSE Labs
Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
On 1/3/19 10:13 AM, Michal Hocko wrote: On Thu 03-01-19 09:33:14, Yang Shi wrote: On 1/3/19 2:12 AM, Michal Hocko wrote: On Thu 03-01-19 04:05:30, Yang Shi wrote: Currently, force empty reclaims memory synchronously when writing to memory.force_empty. It may take some time to return and the afterwards operations are blocked by it. Although it can be interrupted by signal, it still seems suboptimal. Why it is suboptimal? We are doing that operation on behalf of the process requesting it. What should anybody else pay for it? In other words why should we hide the overhead? Please see the below explanation. Now css offline is handled by worker, and the typical usecase of force empty is before memcg offline. So, handling force empty in css offline sounds reasonable. Hmm, so I guess you are talking about echo 1 > $MEMCG/force_empty rmdir $MEMCG and you are complaining that the operation takes too long. Right? Why do you care actually? We have some usecases which create and remove memcgs very frequently, and the tasks in the memcg may just access the files which are unlikely accessed by anyone else. So, we prefer force_empty the memcg before rmdir'ing it to reclaim the page cache so that they don't get accumulated to incur unnecessary memory pressure. Since the memory pressure may incur direct reclaim to harm some latency sensitive applications. Yes, this makes sense to me. And, the create/remove might be run in a script sequentially (there might be a lot scripts or applications are run in parallel to do this), i.e. mkdir cg1 do something echo 0 > cg1/memory.force_empty rmdir cg1 mkdir cg2 ... The creation of the afterwards memcg might be blocked by the force_empty for long time if there are a lot page caches, so the overall throughput of the system may get hurt. Is there any reason for your scripts to be strictly sequential here? In other words why cannot you offload those expensive operations to a detached context in _userspace_? I would say it has not to be strictly sequential. The above script is just an example to illustrate the pattern. But, sometimes it may hit such pattern due to the complicated cluster scheduling and container scheduling in the production environment, for example the creation process might be scheduled to the same CPU which is doing force_empty. I have to say I don't know too much about the internals of the container scheduling. Thanks, Yang
Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
On Thu 03-01-19 09:33:14, Yang Shi wrote: > > > On 1/3/19 2:12 AM, Michal Hocko wrote: > > On Thu 03-01-19 04:05:30, Yang Shi wrote: > > > Currently, force empty reclaims memory synchronously when writing to > > > memory.force_empty. It may take some time to return and the afterwards > > > operations are blocked by it. Although it can be interrupted by signal, > > > it still seems suboptimal. > > Why it is suboptimal? We are doing that operation on behalf of the > > process requesting it. What should anybody else pay for it? In other > > words why should we hide the overhead? > > Please see the below explanation. > > > > > > Now css offline is handled by worker, and the typical usecase of force > > > empty is before memcg offline. So, handling force empty in css offline > > > sounds reasonable. > > Hmm, so I guess you are talking about > > echo 1 > $MEMCG/force_empty > > rmdir $MEMCG > > > > and you are complaining that the operation takes too long. Right? Why do > > you care actually? > > We have some usecases which create and remove memcgs very frequently, and > the tasks in the memcg may just access the files which are unlikely accessed > by anyone else. So, we prefer force_empty the memcg before rmdir'ing it to > reclaim the page cache so that they don't get accumulated to incur > unnecessary memory pressure. Since the memory pressure may incur direct > reclaim to harm some latency sensitive applications. Yes, this makes sense to me. > And, the create/remove might be run in a script sequentially (there might be > a lot scripts or applications are run in parallel to do this), i.e. > mkdir cg1 > do something > echo 0 > cg1/memory.force_empty > rmdir cg1 > > mkdir cg2 > ... > > The creation of the afterwards memcg might be blocked by the force_empty for > long time if there are a lot page caches, so the overall throughput of the > system may get hurt. Is there any reason for your scripts to be strictly sequential here? In other words why cannot you offload those expensive operations to a detached context in _userspace_? -- Michal Hocko SUSE Labs
Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
On 1/3/19 2:12 AM, Michal Hocko wrote: On Thu 03-01-19 04:05:30, Yang Shi wrote: Currently, force empty reclaims memory synchronously when writing to memory.force_empty. It may take some time to return and the afterwards operations are blocked by it. Although it can be interrupted by signal, it still seems suboptimal. Why it is suboptimal? We are doing that operation on behalf of the process requesting it. What should anybody else pay for it? In other words why should we hide the overhead? Please see the below explanation. Now css offline is handled by worker, and the typical usecase of force empty is before memcg offline. So, handling force empty in css offline sounds reasonable. Hmm, so I guess you are talking about echo 1 > $MEMCG/force_empty rmdir $MEMCG and you are complaining that the operation takes too long. Right? Why do you care actually? We have some usecases which create and remove memcgs very frequently, and the tasks in the memcg may just access the files which are unlikely accessed by anyone else. So, we prefer force_empty the memcg before rmdir'ing it to reclaim the page cache so that they don't get accumulated to incur unnecessary memory pressure. Since the memory pressure may incur direct reclaim to harm some latency sensitive applications. And, the create/remove might be run in a script sequentially (there might be a lot scripts or applications are run in parallel to do this), i.e. mkdir cg1 do something echo 0 > cg1/memory.force_empty rmdir cg1 mkdir cg2 ... The creation of the afterwards memcg might be blocked by the force_empty for long time if there are a lot page caches, so the overall throughput of the system may get hurt. And, it is not that urgent to reclaim the page cache right away and it is not that important who pays the cost, we just need a mechanism to reclaim the pages soon in a short while. The overhead could be smoothed by background workqueue. And, the patch still keeps the old behavior, just in case someone else still depends on it. Thanks, Yang
Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
On Thu 03-01-19 04:05:30, Yang Shi wrote: > > Currently, force empty reclaims memory synchronously when writing to > memory.force_empty. It may take some time to return and the afterwards > operations are blocked by it. Although it can be interrupted by signal, > it still seems suboptimal. Why it is suboptimal? We are doing that operation on behalf of the process requesting it. What should anybody else pay for it? In other words why should we hide the overhead? > Now css offline is handled by worker, and the typical usecase of force > empty is before memcg offline. So, handling force empty in css offline > sounds reasonable. Hmm, so I guess you are talking about echo 1 > $MEMCG/force_empty rmdir $MEMCG and you are complaining that the operation takes too long. Right? Why do you care actually? -- Michal Hocko SUSE Labs
[RFC PATCH 0/3] mm: memcontrol: delayed force empty
Currently, force empty reclaims memory synchronously when writing to memory.force_empty. It may take some time to return and the afterwards operations are blocked by it. Although it can be interrupted by signal, it still seems suboptimal. Now css offline is handled by worker, and the typical usecase of force empty is before memcg offline. So, handling force empty in css offline sounds reasonable. The user may write into any value to memory.force_empty, but I'm supposed the most used value should be 0 and 1. To not break existing applications, writing 0 or 1 still do force empty synchronously, any other value will tell kernel to do force empty in css offline worker. Patch #1: Fix some obsolete information about force_empty in the document Patch #2: A minor improvement to skip swap for force_empty Patch #3: Implement delayed force_empty Yang Shi (3): doc: memcontrol: fix the obsolete content about force empty mm: memcontrol: do not try to do swap when force empty mm: memcontrol: delay force empty to css offline Documentation/cgroup-v1/memory.txt | 15 ++- include/linux/memcontrol.h | 2 ++ mm/memcontrol.c| 20 +++- 3 files changed, 31 insertions(+), 6 deletions(-)