Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
Kirill Tkhai writes: > On 01.06.2018 18:25, Eric W. Biederman wrote: >> Michal Hocko writes: >> >>> On Fri 01-06-18 09:32:42, Eric W. Biederman wrote: Michal Hocko writes: >>> [...] > Group leader exiting early without tearing down the whole thread > group should be quite rare as well. No question that somebody might do > that on purpose though... The group leader exiting early is a completely legitimate and reasonable thing to do, even if it is rare. >>> >>> I am not saying it isn't legitimate. But the most common case is the >>> main thread waiting for its threads or calling exit which would tear the >>> whole group down. Is there any easy way to achieve this other than tkill >>> to group leader? Calling exit(3) from the leader performs group exit >>> IIRC. >> >> pthread_exit from the group leader. >> >>> I am not arguing this is non-issue. And it certainly is a problem once >>> somebody wants to be nasty... I was more interested how often this >>> really happens for sane workloads. >> >> That is a fair question. All I know for certain is that whatever Kirill >> Tkhai's workload was it was triggering this the slow path. > > It was triggered on a server, where many VPS of many people are hosted. > Sorry, I have no an idea what they did. That at least tells us it was naturally occurring. Which makes this a real problem in the real world. Eric
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
Kirill Tkhai writes: > On 01.06.2018 18:25, Eric W. Biederman wrote: >> Michal Hocko writes: >> >>> On Fri 01-06-18 09:32:42, Eric W. Biederman wrote: Michal Hocko writes: >>> [...] > Group leader exiting early without tearing down the whole thread > group should be quite rare as well. No question that somebody might do > that on purpose though... The group leader exiting early is a completely legitimate and reasonable thing to do, even if it is rare. >>> >>> I am not saying it isn't legitimate. But the most common case is the >>> main thread waiting for its threads or calling exit which would tear the >>> whole group down. Is there any easy way to achieve this other than tkill >>> to group leader? Calling exit(3) from the leader performs group exit >>> IIRC. >> >> pthread_exit from the group leader. >> >>> I am not arguing this is non-issue. And it certainly is a problem once >>> somebody wants to be nasty... I was more interested how often this >>> really happens for sane workloads. >> >> That is a fair question. All I know for certain is that whatever Kirill >> Tkhai's workload was it was triggering this the slow path. > > It was triggered on a server, where many VPS of many people are hosted. > Sorry, I have no an idea what they did. That at least tells us it was naturally occurring. Which makes this a real problem in the real world. Eric
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
On 01.06.2018 18:25, Eric W. Biederman wrote: > Michal Hocko writes: > >> On Fri 01-06-18 09:32:42, Eric W. Biederman wrote: >>> Michal Hocko writes: >> [...] Group leader exiting early without tearing down the whole thread group should be quite rare as well. No question that somebody might do that on purpose though... >>> >>> The group leader exiting early is a completely legitimate and reasonable >>> thing to do, even if it is rare. >> >> I am not saying it isn't legitimate. But the most common case is the >> main thread waiting for its threads or calling exit which would tear the >> whole group down. Is there any easy way to achieve this other than tkill >> to group leader? Calling exit(3) from the leader performs group exit >> IIRC. > > pthread_exit from the group leader. > >> I am not arguing this is non-issue. And it certainly is a problem once >> somebody wants to be nasty... I was more interested how often this >> really happens for sane workloads. > > That is a fair question. All I know for certain is that whatever Kirill > Tkhai's workload was it was triggering this the slow path. It was triggered on a server, where many VPS of many people are hosted. Sorry, I have no an idea what they did. Kirill
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
On 01.06.2018 18:25, Eric W. Biederman wrote: > Michal Hocko writes: > >> On Fri 01-06-18 09:32:42, Eric W. Biederman wrote: >>> Michal Hocko writes: >> [...] Group leader exiting early without tearing down the whole thread group should be quite rare as well. No question that somebody might do that on purpose though... >>> >>> The group leader exiting early is a completely legitimate and reasonable >>> thing to do, even if it is rare. >> >> I am not saying it isn't legitimate. But the most common case is the >> main thread waiting for its threads or calling exit which would tear the >> whole group down. Is there any easy way to achieve this other than tkill >> to group leader? Calling exit(3) from the leader performs group exit >> IIRC. > > pthread_exit from the group leader. > >> I am not arguing this is non-issue. And it certainly is a problem once >> somebody wants to be nasty... I was more interested how often this >> really happens for sane workloads. > > That is a fair question. All I know for certain is that whatever Kirill > Tkhai's workload was it was triggering this the slow path. It was triggered on a server, where many VPS of many people are hosted. Sorry, I have no an idea what they did. Kirill
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
On Mon 04-06-18 09:31:46, Eric W. Biederman wrote: [...] > My key point is that it is easy to trigger which makes the current > mm_update_next_owner a fundamentally flawed design, and something that > needs to be fixed. Ohh, absolutely agreed! I was not trying to argue that part of course. -- Michal Hocko SUSE Labs
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
On Mon 04-06-18 09:31:46, Eric W. Biederman wrote: [...] > My key point is that it is easy to trigger which makes the current > mm_update_next_owner a fundamentally flawed design, and something that > needs to be fixed. Ohh, absolutely agreed! I was not trying to argue that part of course. -- Michal Hocko SUSE Labs
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
Michal Hocko writes: > On Fri 01-06-18 10:25:59, Eric W. Biederman wrote: >> Michal Hocko writes: >> >> > On Fri 01-06-18 09:32:42, Eric W. Biederman wrote: >> >> Michal Hocko writes: >> > [...] >> >> > Group leader exiting early without tearing down the whole thread >> >> > group should be quite rare as well. No question that somebody might do >> >> > that on purpose though... >> >> >> >> The group leader exiting early is a completely legitimate and reasonable >> >> thing to do, even if it is rare. >> > >> > I am not saying it isn't legitimate. But the most common case is the >> > main thread waiting for its threads or calling exit which would tear the >> > whole group down. Is there any easy way to achieve this other than tkill >> > to group leader? Calling exit(3) from the leader performs group exit >> > IIRC. >> >> pthread_exit from the group leader. > > Right, forgot to mention this one but this would be quite exotic, > right? Not exotic. It is easy to do and well defined. It would be easy to do if you are running a thread pool and closing unnecessary threads. Or frankly anything where the application is not assigning a special role to the initial thread. It does seem rare enough that no one has noticed how attrocious mm_update_next_owner is until now. My key point is that it is easy to trigger which makes the current mm_update_next_owner a fundamentally flawed design, and something that needs to be fixed. Eric
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
Michal Hocko writes: > On Fri 01-06-18 10:25:59, Eric W. Biederman wrote: >> Michal Hocko writes: >> >> > On Fri 01-06-18 09:32:42, Eric W. Biederman wrote: >> >> Michal Hocko writes: >> > [...] >> >> > Group leader exiting early without tearing down the whole thread >> >> > group should be quite rare as well. No question that somebody might do >> >> > that on purpose though... >> >> >> >> The group leader exiting early is a completely legitimate and reasonable >> >> thing to do, even if it is rare. >> > >> > I am not saying it isn't legitimate. But the most common case is the >> > main thread waiting for its threads or calling exit which would tear the >> > whole group down. Is there any easy way to achieve this other than tkill >> > to group leader? Calling exit(3) from the leader performs group exit >> > IIRC. >> >> pthread_exit from the group leader. > > Right, forgot to mention this one but this would be quite exotic, > right? Not exotic. It is easy to do and well defined. It would be easy to do if you are running a thread pool and closing unnecessary threads. Or frankly anything where the application is not assigning a special role to the initial thread. It does seem rare enough that no one has noticed how attrocious mm_update_next_owner is until now. My key point is that it is easy to trigger which makes the current mm_update_next_owner a fundamentally flawed design, and something that needs to be fixed. Eric
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
On Fri 01-06-18 10:25:59, Eric W. Biederman wrote: > Michal Hocko writes: > > > On Fri 01-06-18 09:32:42, Eric W. Biederman wrote: > >> Michal Hocko writes: > > [...] > >> > Group leader exiting early without tearing down the whole thread > >> > group should be quite rare as well. No question that somebody might do > >> > that on purpose though... > >> > >> The group leader exiting early is a completely legitimate and reasonable > >> thing to do, even if it is rare. > > > > I am not saying it isn't legitimate. But the most common case is the > > main thread waiting for its threads or calling exit which would tear the > > whole group down. Is there any easy way to achieve this other than tkill > > to group leader? Calling exit(3) from the leader performs group exit > > IIRC. > > pthread_exit from the group leader. Right, forgot to mention this one but this would be quite exotic, right? > > I am not arguing this is non-issue. And it certainly is a problem once > > somebody wants to be nasty... I was more interested how often this > > really happens for sane workloads. > > That is a fair question. All I know for certain is that whatever Kirill > Tkhai's workload was it was triggering this the slow path. Yeah, that was exactly why I've asked that originally. It must be something pretty special ;) -- Michal Hocko SUSE Labs
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
On Fri 01-06-18 10:25:59, Eric W. Biederman wrote: > Michal Hocko writes: > > > On Fri 01-06-18 09:32:42, Eric W. Biederman wrote: > >> Michal Hocko writes: > > [...] > >> > Group leader exiting early without tearing down the whole thread > >> > group should be quite rare as well. No question that somebody might do > >> > that on purpose though... > >> > >> The group leader exiting early is a completely legitimate and reasonable > >> thing to do, even if it is rare. > > > > I am not saying it isn't legitimate. But the most common case is the > > main thread waiting for its threads or calling exit which would tear the > > whole group down. Is there any easy way to achieve this other than tkill > > to group leader? Calling exit(3) from the leader performs group exit > > IIRC. > > pthread_exit from the group leader. Right, forgot to mention this one but this would be quite exotic, right? > > I am not arguing this is non-issue. And it certainly is a problem once > > somebody wants to be nasty... I was more interested how often this > > really happens for sane workloads. > > That is a fair question. All I know for certain is that whatever Kirill > Tkhai's workload was it was triggering this the slow path. Yeah, that was exactly why I've asked that originally. It must be something pretty special ;) -- Michal Hocko SUSE Labs
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
Michal Hocko writes: > On Fri 01-06-18 09:32:42, Eric W. Biederman wrote: >> Michal Hocko writes: > [...] >> > Group leader exiting early without tearing down the whole thread >> > group should be quite rare as well. No question that somebody might do >> > that on purpose though... >> >> The group leader exiting early is a completely legitimate and reasonable >> thing to do, even if it is rare. > > I am not saying it isn't legitimate. But the most common case is the > main thread waiting for its threads or calling exit which would tear the > whole group down. Is there any easy way to achieve this other than tkill > to group leader? Calling exit(3) from the leader performs group exit > IIRC. pthread_exit from the group leader. > I am not arguing this is non-issue. And it certainly is a problem once > somebody wants to be nasty... I was more interested how often this > really happens for sane workloads. That is a fair question. All I know for certain is that whatever Kirill Tkhai's workload was it was triggering this the slow path. Eric
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
Michal Hocko writes: > On Fri 01-06-18 09:32:42, Eric W. Biederman wrote: >> Michal Hocko writes: > [...] >> > Group leader exiting early without tearing down the whole thread >> > group should be quite rare as well. No question that somebody might do >> > that on purpose though... >> >> The group leader exiting early is a completely legitimate and reasonable >> thing to do, even if it is rare. > > I am not saying it isn't legitimate. But the most common case is the > main thread waiting for its threads or calling exit which would tear the > whole group down. Is there any easy way to achieve this other than tkill > to group leader? Calling exit(3) from the leader performs group exit > IIRC. pthread_exit from the group leader. > I am not arguing this is non-issue. And it certainly is a problem once > somebody wants to be nasty... I was more interested how often this > really happens for sane workloads. That is a fair question. All I know for certain is that whatever Kirill Tkhai's workload was it was triggering this the slow path. Eric
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
On Fri 01-06-18 09:32:42, Eric W. Biederman wrote: > Michal Hocko writes: [...] > > Group leader exiting early without tearing down the whole thread > > group should be quite rare as well. No question that somebody might do > > that on purpose though... > > The group leader exiting early is a completely legitimate and reasonable > thing to do, even if it is rare. I am not saying it isn't legitimate. But the most common case is the main thread waiting for its threads or calling exit which would tear the whole group down. Is there any easy way to achieve this other than tkill to group leader? Calling exit(3) from the leader performs group exit IIRC. I am not arguing this is non-issue. And it certainly is a problem once somebody wants to be nasty... I was more interested how often this really happens for sane workloads. -- Michal Hocko SUSE Labs
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
On Fri 01-06-18 09:32:42, Eric W. Biederman wrote: > Michal Hocko writes: [...] > > Group leader exiting early without tearing down the whole thread > > group should be quite rare as well. No question that somebody might do > > that on purpose though... > > The group leader exiting early is a completely legitimate and reasonable > thing to do, even if it is rare. I am not saying it isn't legitimate. But the most common case is the main thread waiting for its threads or calling exit which would tear the whole group down. Is there any easy way to achieve this other than tkill to group leader? Calling exit(3) from the leader performs group exit IIRC. I am not arguing this is non-issue. And it certainly is a problem once somebody wants to be nasty... I was more interested how often this really happens for sane workloads. -- Michal Hocko SUSE Labs
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
Michal Hocko writes: > On Thu 31-05-18 20:07:28, Eric W. Biederman wrote: >> Michal Hocko writes: >> >> > On Thu 26-04-18 14:00:19, Kirill Tkhai wrote: >> >> This function searches for a new mm owner in children and siblings, >> >> and then iterates over all processes in the system in unlikely case. >> >> Despite the case is unlikely, its probability growths with the number >> >> of processes in the system. The time, spent on iterations, also growths. >> >> I regulary observe mm_update_next_owner() in crash dumps (not related >> >> to this function) of the nodes with many processes (20K+), so it looks >> >> like it's not so unlikely case. >> > >> > Did you manage to find the pattern that forces mm_update_next_owner to >> > slow paths? This really shouldn't trigger very often. If we can fallback >> > easily then I suspect that we should be better off reconsidering >> > mm->owner and try to come up with something more clever. I've had a >> > patch to remove owner few years back. It needed some work to finish but >> > maybe that would be a better than try to make non-scalable thing suck >> > less. >> >> Reading through the code I just found a trivial pattern that triggers >> this. Create a multi-threaded process. Have the thread group leader >> (the first thread) exit. > > Hmm, I thought that we try to iterate over threads in the same thread > group first. But we are not doing that. Anyway just CLONE_VM without > CLONE_THREAD would achieve the same pathological path but that should be > rare. Yes, if the child exited. The code searches the children and siblings but the parents of the process that exited. > Group leader exiting early without tearing down the whole thread > group should be quite rare as well. No question that somebody might do > that on purpose though... The group leader exiting early is a completely legitimate and reasonable thing to do, even if it is rare. I think all it would take is one program like that in a work-load for the performance to descend into something unpleasant. Eric
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
Michal Hocko writes: > On Thu 31-05-18 20:07:28, Eric W. Biederman wrote: >> Michal Hocko writes: >> >> > On Thu 26-04-18 14:00:19, Kirill Tkhai wrote: >> >> This function searches for a new mm owner in children and siblings, >> >> and then iterates over all processes in the system in unlikely case. >> >> Despite the case is unlikely, its probability growths with the number >> >> of processes in the system. The time, spent on iterations, also growths. >> >> I regulary observe mm_update_next_owner() in crash dumps (not related >> >> to this function) of the nodes with many processes (20K+), so it looks >> >> like it's not so unlikely case. >> > >> > Did you manage to find the pattern that forces mm_update_next_owner to >> > slow paths? This really shouldn't trigger very often. If we can fallback >> > easily then I suspect that we should be better off reconsidering >> > mm->owner and try to come up with something more clever. I've had a >> > patch to remove owner few years back. It needed some work to finish but >> > maybe that would be a better than try to make non-scalable thing suck >> > less. >> >> Reading through the code I just found a trivial pattern that triggers >> this. Create a multi-threaded process. Have the thread group leader >> (the first thread) exit. > > Hmm, I thought that we try to iterate over threads in the same thread > group first. But we are not doing that. Anyway just CLONE_VM without > CLONE_THREAD would achieve the same pathological path but that should be > rare. Yes, if the child exited. The code searches the children and siblings but the parents of the process that exited. > Group leader exiting early without tearing down the whole thread > group should be quite rare as well. No question that somebody might do > that on purpose though... The group leader exiting early is a completely legitimate and reasonable thing to do, even if it is rare. I think all it would take is one program like that in a work-load for the performance to descend into something unpleasant. Eric
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
On Thu 31-05-18 20:07:28, Eric W. Biederman wrote: > Michal Hocko writes: > > > On Thu 26-04-18 14:00:19, Kirill Tkhai wrote: > >> This function searches for a new mm owner in children and siblings, > >> and then iterates over all processes in the system in unlikely case. > >> Despite the case is unlikely, its probability growths with the number > >> of processes in the system. The time, spent on iterations, also growths. > >> I regulary observe mm_update_next_owner() in crash dumps (not related > >> to this function) of the nodes with many processes (20K+), so it looks > >> like it's not so unlikely case. > > > > Did you manage to find the pattern that forces mm_update_next_owner to > > slow paths? This really shouldn't trigger very often. If we can fallback > > easily then I suspect that we should be better off reconsidering > > mm->owner and try to come up with something more clever. I've had a > > patch to remove owner few years back. It needed some work to finish but > > maybe that would be a better than try to make non-scalable thing suck > > less. > > Reading through the code I just found a trivial pattern that triggers > this. Create a multi-threaded process. Have the thread group leader > (the first thread) exit. Hmm, I thought that we try to iterate over threads in the same thread group first. But we are not doing that. Anyway just CLONE_VM without CLONE_THREAD would achieve the same pathological path but that should be rare. Group leader exiting early without tearing down the whole thread group should be quite rare as well. No question that somebody might do that on purpose though... -- Michal Hocko SUSE Labs
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
On Thu 31-05-18 20:07:28, Eric W. Biederman wrote: > Michal Hocko writes: > > > On Thu 26-04-18 14:00:19, Kirill Tkhai wrote: > >> This function searches for a new mm owner in children and siblings, > >> and then iterates over all processes in the system in unlikely case. > >> Despite the case is unlikely, its probability growths with the number > >> of processes in the system. The time, spent on iterations, also growths. > >> I regulary observe mm_update_next_owner() in crash dumps (not related > >> to this function) of the nodes with many processes (20K+), so it looks > >> like it's not so unlikely case. > > > > Did you manage to find the pattern that forces mm_update_next_owner to > > slow paths? This really shouldn't trigger very often. If we can fallback > > easily then I suspect that we should be better off reconsidering > > mm->owner and try to come up with something more clever. I've had a > > patch to remove owner few years back. It needed some work to finish but > > maybe that would be a better than try to make non-scalable thing suck > > less. > > Reading through the code I just found a trivial pattern that triggers > this. Create a multi-threaded process. Have the thread group leader > (the first thread) exit. Hmm, I thought that we try to iterate over threads in the same thread group first. But we are not doing that. Anyway just CLONE_VM without CLONE_THREAD would achieve the same pathological path but that should be rare. Group leader exiting early without tearing down the whole thread group should be quite rare as well. No question that somebody might do that on purpose though... -- Michal Hocko SUSE Labs
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
Michal Hocko writes: > On Thu 26-04-18 14:00:19, Kirill Tkhai wrote: >> This function searches for a new mm owner in children and siblings, >> and then iterates over all processes in the system in unlikely case. >> Despite the case is unlikely, its probability growths with the number >> of processes in the system. The time, spent on iterations, also growths. >> I regulary observe mm_update_next_owner() in crash dumps (not related >> to this function) of the nodes with many processes (20K+), so it looks >> like it's not so unlikely case. > > Did you manage to find the pattern that forces mm_update_next_owner to > slow paths? This really shouldn't trigger very often. If we can fallback > easily then I suspect that we should be better off reconsidering > mm->owner and try to come up with something more clever. I've had a > patch to remove owner few years back. It needed some work to finish but > maybe that would be a better than try to make non-scalable thing suck > less. Reading through the code I just found a trivial pattern that triggers this. Create a multi-threaded process. Have the thread group leader (the first thread) exit. This has the potential to be a significant DOS attack if anyone cares. Eric
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
Michal Hocko writes: > On Thu 26-04-18 14:00:19, Kirill Tkhai wrote: >> This function searches for a new mm owner in children and siblings, >> and then iterates over all processes in the system in unlikely case. >> Despite the case is unlikely, its probability growths with the number >> of processes in the system. The time, spent on iterations, also growths. >> I regulary observe mm_update_next_owner() in crash dumps (not related >> to this function) of the nodes with many processes (20K+), so it looks >> like it's not so unlikely case. > > Did you manage to find the pattern that forces mm_update_next_owner to > slow paths? This really shouldn't trigger very often. If we can fallback > easily then I suspect that we should be better off reconsidering > mm->owner and try to come up with something more clever. I've had a > patch to remove owner few years back. It needed some work to finish but > maybe that would be a better than try to make non-scalable thing suck > less. Reading through the code I just found a trivial pattern that triggers this. Create a multi-threaded process. Have the thread group leader (the first thread) exit. This has the potential to be a significant DOS attack if anyone cares. Eric
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
On 27.04.2018 21:05, Eric W. Biederman wrote: > Michal Hockowrites: > >> On Thu 26-04-18 21:28:18, Michal Hocko wrote: >>> On Thu 26-04-18 11:19:33, Eric W. Biederman wrote: Michal Hocko writes: > I've had a patch to remove owner few years back. It needed some work > to finish but maybe that would be a better than try to make > non-scalable thing suck less. I have a question. Would it be reasonable to just have a mm->memcg? That would appear to be the simplest solution to the problem. >>> >>> I do not remember details. Have to re-read the whole thing again. Hope >>> to get to this soon but with the current jet lag and backlog from the >>> LSFMM I rather not promis anything. Going with mm->memcg would be the >>> most simple of course but I have a very vague recollection that it was >>> not possible. Maybe I misremember... >> >> Just for the record, the last version where I've tried to remove owner >> was posted here: >> http://lkml.kernel.org/r/1436358472-29137-1-git-send-email-mho...@kernel.org >> >> I didn't get to remember details yet, but the primary problem was the >> task migration between cgroups and the nasty case when different thread >> grounds share the mm. At some point I just suggested to not care >> about semantic of these weird threads all that much. We can either >> migrate all tasks sharing the mm struct or just keep the inconsistency. >> >> Anyway, removing this ugliness would be so cool! > > I suspect the only common user of CLONE_VM today is vfork. And I do > think it is crazy to migrate a process that has called vfork before > calling exec. Other useses of CLONE_VM seem even crazier. > > I think the easiest change to make in mem_cgroup_can_attach would > be just to change the test for when charges are migrated. AKA > > from: > > if (mm->owner == p) { > > } > > to > if (mem_cgroup_from_task(p) == mm->memcg) { > ... > } > > That allows using mm->memcg with no new limitations on when migration > can be called. In crazy cases that has the potential to change which > memcgroup the charges are accounted to, but the choice is already > somewhat arbitrary so I don't think that will be a problem. Especially > given that mm_update_next_owner does not migrate charges if the next > owner is in a different memory cgroup. A mm with tasks using it in > two different cgroups is already questionable if not outright > problematic. > > > Kirill Tkhai do you think you would be able adapt Michal Hoko's old > patch at https://marc.info/?l=linux-kernel=143635857131756=2 > that replaces mm->owner with mm->memcg? I was at vacation. Sorry for the late reply. > We probably want to outlaw migrating an mm where we are not migrating > all of the mm->users eventually. Just because that case is crazy. > But it doesn't look like we need to do that to fix the memory control > group data structures. Kirill
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
On 27.04.2018 21:05, Eric W. Biederman wrote: > Michal Hocko writes: > >> On Thu 26-04-18 21:28:18, Michal Hocko wrote: >>> On Thu 26-04-18 11:19:33, Eric W. Biederman wrote: Michal Hocko writes: > I've had a patch to remove owner few years back. It needed some work > to finish but maybe that would be a better than try to make > non-scalable thing suck less. I have a question. Would it be reasonable to just have a mm->memcg? That would appear to be the simplest solution to the problem. >>> >>> I do not remember details. Have to re-read the whole thing again. Hope >>> to get to this soon but with the current jet lag and backlog from the >>> LSFMM I rather not promis anything. Going with mm->memcg would be the >>> most simple of course but I have a very vague recollection that it was >>> not possible. Maybe I misremember... >> >> Just for the record, the last version where I've tried to remove owner >> was posted here: >> http://lkml.kernel.org/r/1436358472-29137-1-git-send-email-mho...@kernel.org >> >> I didn't get to remember details yet, but the primary problem was the >> task migration between cgroups and the nasty case when different thread >> grounds share the mm. At some point I just suggested to not care >> about semantic of these weird threads all that much. We can either >> migrate all tasks sharing the mm struct or just keep the inconsistency. >> >> Anyway, removing this ugliness would be so cool! > > I suspect the only common user of CLONE_VM today is vfork. And I do > think it is crazy to migrate a process that has called vfork before > calling exec. Other useses of CLONE_VM seem even crazier. > > I think the easiest change to make in mem_cgroup_can_attach would > be just to change the test for when charges are migrated. AKA > > from: > > if (mm->owner == p) { > > } > > to > if (mem_cgroup_from_task(p) == mm->memcg) { > ... > } > > That allows using mm->memcg with no new limitations on when migration > can be called. In crazy cases that has the potential to change which > memcgroup the charges are accounted to, but the choice is already > somewhat arbitrary so I don't think that will be a problem. Especially > given that mm_update_next_owner does not migrate charges if the next > owner is in a different memory cgroup. A mm with tasks using it in > two different cgroups is already questionable if not outright > problematic. > > > Kirill Tkhai do you think you would be able adapt Michal Hoko's old > patch at https://marc.info/?l=linux-kernel=143635857131756=2 > that replaces mm->owner with mm->memcg? I was at vacation. Sorry for the late reply. > We probably want to outlaw migrating an mm where we are not migrating > all of the mm->users eventually. Just because that case is crazy. > But it doesn't look like we need to do that to fix the memory control > group data structures. Kirill
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
ebied...@xmission.com (Eric W. Biederman) writes: > Kirill Tkhai do you think you would be able adapt Michal Hoko's old > patch at https://marc.info/?l=linux-kernel=143635857131756=2 > that replaces mm->owner with mm->memcg? Ouch. At least compared to the current kernel your patch is wide of the mark of what is needed to use mm->memcg correctly. I have however written my own and if someone can review it I would really appreciate it. It looks like we can change mm->owner to mm->memcg without too much trouble and reduce the amount of code in the kernel at the same time. Eric
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
ebied...@xmission.com (Eric W. Biederman) writes: > Kirill Tkhai do you think you would be able adapt Michal Hoko's old > patch at https://marc.info/?l=linux-kernel=143635857131756=2 > that replaces mm->owner with mm->memcg? Ouch. At least compared to the current kernel your patch is wide of the mark of what is needed to use mm->memcg correctly. I have however written my own and if someone can review it I would really appreciate it. It looks like we can change mm->owner to mm->memcg without too much trouble and reduce the amount of code in the kernel at the same time. Eric
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
Michal Hockowrites: > On Thu 26-04-18 21:28:18, Michal Hocko wrote: >> On Thu 26-04-18 11:19:33, Eric W. Biederman wrote: >> > Michal Hocko writes: >> > >> > > I've had a patch to remove owner few years back. It needed some work >> > > to finish but maybe that would be a better than try to make >> > > non-scalable thing suck less. >> > >> > I have a question. Would it be reasonable to just have a mm->memcg? >> > That would appear to be the simplest solution to the problem. >> >> I do not remember details. Have to re-read the whole thing again. Hope >> to get to this soon but with the current jet lag and backlog from the >> LSFMM I rather not promis anything. Going with mm->memcg would be the >> most simple of course but I have a very vague recollection that it was >> not possible. Maybe I misremember... > > Just for the record, the last version where I've tried to remove owner > was posted here: > http://lkml.kernel.org/r/1436358472-29137-1-git-send-email-mho...@kernel.org > > I didn't get to remember details yet, but the primary problem was the > task migration between cgroups and the nasty case when different thread > grounds share the mm. At some point I just suggested to not care > about semantic of these weird threads all that much. We can either > migrate all tasks sharing the mm struct or just keep the inconsistency. > > Anyway, removing this ugliness would be so cool! I suspect the only common user of CLONE_VM today is vfork. And I do think it is crazy to migrate a process that has called vfork before calling exec. Other useses of CLONE_VM seem even crazier. I think the easiest change to make in mem_cgroup_can_attach would be just to change the test for when charges are migrated. AKA from: if (mm->owner == p) { } to if (mem_cgroup_from_task(p) == mm->memcg) { ... } That allows using mm->memcg with no new limitations on when migration can be called. In crazy cases that has the potential to change which memcgroup the charges are accounted to, but the choice is already somewhat arbitrary so I don't think that will be a problem. Especially given that mm_update_next_owner does not migrate charges if the next owner is in a different memory cgroup. A mm with tasks using it in two different cgroups is already questionable if not outright problematic. Kirill Tkhai do you think you would be able adapt Michal Hoko's old patch at https://marc.info/?l=linux-kernel=143635857131756=2 that replaces mm->owner with mm->memcg? We probably want to outlaw migrating an mm where we are not migrating all of the mm->users eventually. Just because that case is crazy. But it doesn't look like we need to do that to fix the memory control group data structures. Eric
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
Michal Hocko writes: > On Thu 26-04-18 21:28:18, Michal Hocko wrote: >> On Thu 26-04-18 11:19:33, Eric W. Biederman wrote: >> > Michal Hocko writes: >> > >> > > I've had a patch to remove owner few years back. It needed some work >> > > to finish but maybe that would be a better than try to make >> > > non-scalable thing suck less. >> > >> > I have a question. Would it be reasonable to just have a mm->memcg? >> > That would appear to be the simplest solution to the problem. >> >> I do not remember details. Have to re-read the whole thing again. Hope >> to get to this soon but with the current jet lag and backlog from the >> LSFMM I rather not promis anything. Going with mm->memcg would be the >> most simple of course but I have a very vague recollection that it was >> not possible. Maybe I misremember... > > Just for the record, the last version where I've tried to remove owner > was posted here: > http://lkml.kernel.org/r/1436358472-29137-1-git-send-email-mho...@kernel.org > > I didn't get to remember details yet, but the primary problem was the > task migration between cgroups and the nasty case when different thread > grounds share the mm. At some point I just suggested to not care > about semantic of these weird threads all that much. We can either > migrate all tasks sharing the mm struct or just keep the inconsistency. > > Anyway, removing this ugliness would be so cool! I suspect the only common user of CLONE_VM today is vfork. And I do think it is crazy to migrate a process that has called vfork before calling exec. Other useses of CLONE_VM seem even crazier. I think the easiest change to make in mem_cgroup_can_attach would be just to change the test for when charges are migrated. AKA from: if (mm->owner == p) { } to if (mem_cgroup_from_task(p) == mm->memcg) { ... } That allows using mm->memcg with no new limitations on when migration can be called. In crazy cases that has the potential to change which memcgroup the charges are accounted to, but the choice is already somewhat arbitrary so I don't think that will be a problem. Especially given that mm_update_next_owner does not migrate charges if the next owner is in a different memory cgroup. A mm with tasks using it in two different cgroups is already questionable if not outright problematic. Kirill Tkhai do you think you would be able adapt Michal Hoko's old patch at https://marc.info/?l=linux-kernel=143635857131756=2 that replaces mm->owner with mm->memcg? We probably want to outlaw migrating an mm where we are not migrating all of the mm->users eventually. Just because that case is crazy. But it doesn't look like we need to do that to fix the memory control group data structures. Eric
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
On Thu 26-04-18 21:28:18, Michal Hocko wrote: > On Thu 26-04-18 11:19:33, Eric W. Biederman wrote: > > Michal Hockowrites: > > > > > I've had a patch to remove owner few years back. It needed some work > > > to finish but maybe that would be a better than try to make > > > non-scalable thing suck less. > > > > I have a question. Would it be reasonable to just have a mm->memcg? > > That would appear to be the simplest solution to the problem. > > I do not remember details. Have to re-read the whole thing again. Hope > to get to this soon but with the current jet lag and backlog from the > LSFMM I rather not promis anything. Going with mm->memcg would be the > most simple of course but I have a very vague recollection that it was > not possible. Maybe I misremember... Just for the record, the last version where I've tried to remove owner was posted here: http://lkml.kernel.org/r/1436358472-29137-1-git-send-email-mho...@kernel.org I didn't get to remember details yet, but the primary problem was the task migration between cgroups and the nasty case when different thread grounds share the mm. At some point I just suggested to not care about semantic of these weird threads all that much. We can either migrate all tasks sharing the mm struct or just keep the inconsistency. Anyway, removing this ugliness would be so cool! -- Michal Hocko SUSE Labs
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
On Thu 26-04-18 21:28:18, Michal Hocko wrote: > On Thu 26-04-18 11:19:33, Eric W. Biederman wrote: > > Michal Hocko writes: > > > > > I've had a patch to remove owner few years back. It needed some work > > > to finish but maybe that would be a better than try to make > > > non-scalable thing suck less. > > > > I have a question. Would it be reasonable to just have a mm->memcg? > > That would appear to be the simplest solution to the problem. > > I do not remember details. Have to re-read the whole thing again. Hope > to get to this soon but with the current jet lag and backlog from the > LSFMM I rather not promis anything. Going with mm->memcg would be the > most simple of course but I have a very vague recollection that it was > not possible. Maybe I misremember... Just for the record, the last version where I've tried to remove owner was posted here: http://lkml.kernel.org/r/1436358472-29137-1-git-send-email-mho...@kernel.org I didn't get to remember details yet, but the primary problem was the task migration between cgroups and the nasty case when different thread grounds share the mm. At some point I just suggested to not care about semantic of these weird threads all that much. We can either migrate all tasks sharing the mm struct or just keep the inconsistency. Anyway, removing this ugliness would be so cool! -- Michal Hocko SUSE Labs
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
On Thu 26-04-18 11:19:33, Eric W. Biederman wrote: > Michal Hockowrites: > > > I've had a patch to remove owner few years back. It needed some work > > to finish but maybe that would be a better than try to make > > non-scalable thing suck less. > > I have a question. Would it be reasonable to just have a mm->memcg? > That would appear to be the simplest solution to the problem. I do not remember details. Have to re-read the whole thing again. Hope to get to this soon but with the current jet lag and backlog from the LSFMM I rather not promis anything. Going with mm->memcg would be the most simple of course but I have a very vague recollection that it was not possible. Maybe I misremember... -- Michal Hocko SUSE Labs
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
On Thu 26-04-18 11:19:33, Eric W. Biederman wrote: > Michal Hocko writes: > > > I've had a patch to remove owner few years back. It needed some work > > to finish but maybe that would be a better than try to make > > non-scalable thing suck less. > > I have a question. Would it be reasonable to just have a mm->memcg? > That would appear to be the simplest solution to the problem. I do not remember details. Have to re-read the whole thing again. Hope to get to this soon but with the current jet lag and backlog from the LSFMM I rather not promis anything. Going with mm->memcg would be the most simple of course but I have a very vague recollection that it was not possible. Maybe I misremember... -- Michal Hocko SUSE Labs
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
Michal Hockowrites: > I've had a patch to remove owner few years back. It needed some work > to finish but maybe that would be a better than try to make > non-scalable thing suck less. I have a question. Would it be reasonable to just have a mm->memcg? That would appear to be the simplest solution to the problem. That would require failing migration between memory cgroups if you are not moving all of processes/threads that have a given mm_struct. That should not be a huge restriction as typically it is only threads that share a mm. Further the check should be straigh forward: counting the number of threads and verifying the count matches the count on the mm_struct. Eric
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
Michal Hocko writes: > I've had a patch to remove owner few years back. It needed some work > to finish but maybe that would be a better than try to make > non-scalable thing suck less. I have a question. Would it be reasonable to just have a mm->memcg? That would appear to be the simplest solution to the problem. That would require failing migration between memory cgroups if you are not moving all of processes/threads that have a given mm_struct. That should not be a huge restriction as typically it is only threads that share a mm. Further the check should be straigh forward: counting the number of threads and verifying the count matches the count on the mm_struct. Eric
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
On 04/26, Kirill Tkhai wrote: > > We can rework this simply by adding a list of tasks to mm. Perhaps, but then I think this list should not depend on mm->owner. I mean, mm->list_of_group_leaders_which_use_this_mm can be used by coredump and oom-killer at least. But this is not that simple... Oleg.
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
On 04/26, Kirill Tkhai wrote: > > We can rework this simply by adding a list of tasks to mm. Perhaps, but then I think this list should not depend on mm->owner. I mean, mm->list_of_group_leaders_which_use_this_mm can be used by coredump and oom-killer at least. But this is not that simple... Oleg.
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
On 26.04.2018 16:07, Michal Hocko wrote: > On Thu 26-04-18 14:00:19, Kirill Tkhai wrote: >> This function searches for a new mm owner in children and siblings, >> and then iterates over all processes in the system in unlikely case. >> Despite the case is unlikely, its probability growths with the number >> of processes in the system. The time, spent on iterations, also growths. >> I regulary observe mm_update_next_owner() in crash dumps (not related >> to this function) of the nodes with many processes (20K+), so it looks >> like it's not so unlikely case. > > Did you manage to find the pattern that forces mm_update_next_owner to > slow paths? This really shouldn't trigger very often. If we can fallback > easily then I suspect that we should be better off reconsidering > mm->owner and try to come up with something more clever. I've had a > patch to remove owner few years back. It needed some work to finish but > maybe that would be a better than try to make non-scalable thing suck > less. It's not easy to find a pattern on such the big number of processes, especially because of only the final result is visible in crash. I assume, this may be connected with some unexpected signals received by task set in topology, but I'm not sure. Though, even it becomes the most unlikely situation with some small not zero probability, it still has to be optimized to minimize unexpected situations. We can rework this simply by adding a list of tasks to mm. Also, reuse task_lock() of the mm owner for that. Something like: assign_task_mm(struct mm_struct *mm, struct task_struct *task) { int again; again: again = 0; rcu_read_lock(); task = mm->owner; if (!task) goto unlock; task_lock(task); if (mm->owner = task) llist_add(>mm_list, >task_list); else again = 1; task_unlock(task); unlock: rcu_read_unlock(); if (again) goto again; } Kirill
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
On 26.04.2018 16:07, Michal Hocko wrote: > On Thu 26-04-18 14:00:19, Kirill Tkhai wrote: >> This function searches for a new mm owner in children and siblings, >> and then iterates over all processes in the system in unlikely case. >> Despite the case is unlikely, its probability growths with the number >> of processes in the system. The time, spent on iterations, also growths. >> I regulary observe mm_update_next_owner() in crash dumps (not related >> to this function) of the nodes with many processes (20K+), so it looks >> like it's not so unlikely case. > > Did you manage to find the pattern that forces mm_update_next_owner to > slow paths? This really shouldn't trigger very often. If we can fallback > easily then I suspect that we should be better off reconsidering > mm->owner and try to come up with something more clever. I've had a > patch to remove owner few years back. It needed some work to finish but > maybe that would be a better than try to make non-scalable thing suck > less. It's not easy to find a pattern on such the big number of processes, especially because of only the final result is visible in crash. I assume, this may be connected with some unexpected signals received by task set in topology, but I'm not sure. Though, even it becomes the most unlikely situation with some small not zero probability, it still has to be optimized to minimize unexpected situations. We can rework this simply by adding a list of tasks to mm. Also, reuse task_lock() of the mm owner for that. Something like: assign_task_mm(struct mm_struct *mm, struct task_struct *task) { int again; again: again = 0; rcu_read_lock(); task = mm->owner; if (!task) goto unlock; task_lock(task); if (mm->owner = task) llist_add(>mm_list, >task_list); else again = 1; task_unlock(task); unlock: rcu_read_unlock(); if (again) goto again; } Kirill
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
On 04/26, Michal Hocko wrote: > > then I suspect that we should be better off reconsidering > mm->owner and try to come up with something more clever. I've had a > patch to remove owner few years back. Yeees, I do remember we discussed this but I forgot everything. I agree, it would nice to turn mm->owner into mm->mem_cgroup and just kill the ugly mm_update_next_owner/etc. Oleg.
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
On 04/26, Michal Hocko wrote: > > then I suspect that we should be better off reconsidering > mm->owner and try to come up with something more clever. I've had a > patch to remove owner few years back. Yeees, I do remember we discussed this but I forgot everything. I agree, it would nice to turn mm->owner into mm->mem_cgroup and just kill the ugly mm_update_next_owner/etc. Oleg.
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
On Thu 26-04-18 14:00:19, Kirill Tkhai wrote: > This function searches for a new mm owner in children and siblings, > and then iterates over all processes in the system in unlikely case. > Despite the case is unlikely, its probability growths with the number > of processes in the system. The time, spent on iterations, also growths. > I regulary observe mm_update_next_owner() in crash dumps (not related > to this function) of the nodes with many processes (20K+), so it looks > like it's not so unlikely case. Did you manage to find the pattern that forces mm_update_next_owner to slow paths? This really shouldn't trigger very often. If we can fallback easily then I suspect that we should be better off reconsidering mm->owner and try to come up with something more clever. I've had a patch to remove owner few years back. It needed some work to finish but maybe that would be a better than try to make non-scalable thing suck less. -- Michal Hocko SUSE Labs
Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable
On Thu 26-04-18 14:00:19, Kirill Tkhai wrote: > This function searches for a new mm owner in children and siblings, > and then iterates over all processes in the system in unlikely case. > Despite the case is unlikely, its probability growths with the number > of processes in the system. The time, spent on iterations, also growths. > I regulary observe mm_update_next_owner() in crash dumps (not related > to this function) of the nodes with many processes (20K+), so it looks > like it's not so unlikely case. Did you manage to find the pattern that forces mm_update_next_owner to slow paths? This really shouldn't trigger very often. If we can fallback easily then I suspect that we should be better off reconsidering mm->owner and try to come up with something more clever. I've had a patch to remove owner few years back. It needed some work to finish but maybe that would be a better than try to make non-scalable thing suck less. -- Michal Hocko SUSE Labs