Re: [PATCH 00/10] cgroups: Task counter subsystem v8
On Mon, Apr 1, 2013 at 4:18 PM, Tejun Heo wrote: > Hello, > > On Mon, Apr 01, 2013 at 03:57:46PM -0700, Tim Hockin wrote: >> I am not limited by kernel memory, I am limited by PIDs, and I need to >> be able to manage them. memory.kmem.usage_in_bytes seems to be far >> too noisy to be useful for this purpose. It may work fine for "just >> stop a fork bomb" but not for any sort of finer-grained control. > > So, why are you limited by PIDs other than the arcane / weird > limitation that you have whereever that limitation is? Does anyone anywhere actually set PID_MAX > 64K? As far as I can tell, distros default it to 32K or 64K because there's a lot of stuff out there that assumes this to be true. This is the problem we have - deep down in the bowels of code that is taking literally years to overhaul, we have identified a bad assumption that PIDs are always 5 characters long. I can't fix it any faster. That said, we also identified other software that make similar assumptions, though they are less critical to us. >> > If you think you can tilt it the other way, please feel free to try. >> >> Just because others caved, doesn't make it less of a hack. And I will >> cave, too, because I don't have time to bang my head against a wall, >> especially when I can see the remnants of other people who have tried. >> >> We'll work around it, or we'll hack around it, or we'll carry this >> patch in our own tree and just grumble about ridiculous hacks every >> time we have to forward port it. >> >> I was just hoping that things had worked themselves out in the last year. > > It's kinda weird getting this response, as I don't think it has been > particularly walley. The arguments were pretty sound from what I > recall and Frederic's use case was actually better covered by kmemcg, > so where's the said wall? And I asked you why your use case is > different and the only reason you gave me is some arbitrary PID > limitation on whatever thing you're using, which you gotta agree is a > pretty hard sell. So, if you think you have a valid case, please just > explain it. Why go passive agressive on it? If you don't have a > valid case for pushing it, yes, you'll have to hack around it - carry > the patches in your tree, whatever, or better, fix the weird PID > problem. Sorry Tejun, you're being very reasonable, I was not. The history of this patch is what makes me frustrated. It seems like such an obvious thing to support that it blows my mind that people argue it. You know our environment. Users can use their memory budgets however they like - kernel or userspace. We have good accounting, but we are PID limited. We've even implemented some hacks of our own to make that hurt less because the previously-mentioned assumptions are just NOT going away any time soon. I literally have user bugs every week on this. Hopefully the hacks we have put in place will make the users stop hurting. But we're left with some residual problems, some of which are because the only limits we can apply are per-user rather than per-container. >From our POV building the cluster, cgroups are strictly superior to most other control interfaces because they work at the same granularity that we do. I want more things to support cgroup control. This particular one was double-tasty because the ability to set the limit to 0 would actually solve a different problem we have in teardown. But as I said, we can mostly work around that. So I am frustrated because I don't think my use case will convince you (at the root of it, it is a problem of our own making, but it LONG predates me), despite my belief that it is obviously a good feature. I find myself hoping that someone else comes along and says "me too" rather than using a totally different hack for this. Oh well. Thanks for the update. Off to do our own thing again. > -- > tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/10] cgroups: Task counter subsystem v8
Hello, On Mon, Apr 01, 2013 at 03:57:46PM -0700, Tim Hockin wrote: > I am not limited by kernel memory, I am limited by PIDs, and I need to > be able to manage them. memory.kmem.usage_in_bytes seems to be far > too noisy to be useful for this purpose. It may work fine for "just > stop a fork bomb" but not for any sort of finer-grained control. So, why are you limited by PIDs other than the arcane / weird limitation that you have whereever that limitation is? > > If you think you can tilt it the other way, please feel free to try. > > Just because others caved, doesn't make it less of a hack. And I will > cave, too, because I don't have time to bang my head against a wall, > especially when I can see the remnants of other people who have tried. > > We'll work around it, or we'll hack around it, or we'll carry this > patch in our own tree and just grumble about ridiculous hacks every > time we have to forward port it. > > I was just hoping that things had worked themselves out in the last year. It's kinda weird getting this response, as I don't think it has been particularly walley. The arguments were pretty sound from what I recall and Frederic's use case was actually better covered by kmemcg, so where's the said wall? And I asked you why your use case is different and the only reason you gave me is some arbitrary PID limitation on whatever thing you're using, which you gotta agree is a pretty hard sell. So, if you think you have a valid case, please just explain it. Why go passive agressive on it? If you don't have a valid case for pushing it, yes, you'll have to hack around it - carry the patches in your tree, whatever, or better, fix the weird PID problem. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/10] cgroups: Task counter subsystem v8
On Mon, Apr 1, 2013 at 3:35 PM, Tejun Heo wrote: > Hey, > > On Mon, Apr 01, 2013 at 03:20:47PM -0700, Tim Hockin wrote: >> > U so that's why you guys can't use kernel memory limit? :( >> >> Because it is completely non-obvious how to map between the two in a >> way that is safe across kernel versions and not likely to blow up in >> our faces. It's a hack, in other words. > > Now we're repeating the argument Frederic and Johannes had, so I'd > suggest going back the thread and reading the discussion and if you > still think using kmemcg is a bad idea, please explain why that is so. > For the specific point that you just raised, the scale tilted toward > thread/process count is a hacky and unreliable representation of > kernel memory resource than the other way around, at least back then. I am not limited by kernel memory, I am limited by PIDs, and I need to be able to manage them. memory.kmem.usage_in_bytes seems to be far too noisy to be useful for this purpose. It may work fine for "just stop a fork bomb" but not for any sort of finer-grained control. > If you think you can tilt it the other way, please feel free to try. Just because others caved, doesn't make it less of a hack. And I will cave, too, because I don't have time to bang my head against a wall, especially when I can see the remnants of other people who have tried. We'll work around it, or we'll hack around it, or we'll carry this patch in our own tree and just grumble about ridiculous hacks every time we have to forward port it. I was just hoping that things had worked themselves out in the last year. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/10] cgroups: Task counter subsystem v8
Hey, On Mon, Apr 01, 2013 at 03:20:47PM -0700, Tim Hockin wrote: > > U so that's why you guys can't use kernel memory limit? :( > > Because it is completely non-obvious how to map between the two in a > way that is safe across kernel versions and not likely to blow up in > our faces. It's a hack, in other words. Now we're repeating the argument Frederic and Johannes had, so I'd suggest going back the thread and reading the discussion and if you still think using kmemcg is a bad idea, please explain why that is so. For the specific point that you just raised, the scale tilted toward thread/process count is a hacky and unreliable representation of kernel memory resource than the other way around, at least back then. If you think you can tilt it the other way, please feel free to try. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/10] cgroups: Task counter subsystem v8
On Mon, Apr 1, 2013 at 3:03 PM, Tejun Heo wrote: > Hello, Tim. > > On Mon, Apr 01, 2013 at 02:02:06PM -0700, Tim Hockin wrote: >> We run dozens of jobs from dozens users on a single machine. We >> regularly experience users who leak threads, running into the tens of >> thousands. We are unable to raise the PID_MAX significantly due to >> some bad, but really thoroughly baked-in decisions that were made a >> long time ago. What we experience on a daily basis is users > > U so that's why you guys can't use kernel memory limit? :( Because it is completely non-obvious how to map between the two in a way that is safe across kernel versions and not likely to blow up in our faces. It's a hack, in other words. >> complaining about getting a "pthread_create(): resource unavailable" >> error because someone on the machine has leaked. > ... >> What I really don't understand is why so much push back? We have this >> nicely structured cgroup system. Each cgroup controller's code is >> pretty well partitioned - why would we not want more complete >> functionality built around it? We accept device drivers for the most >> random, useless crap on the assertion that "if you don't need it, >> don't compile it in". I can think of a half dozen more really useful, >> cool things we can do with cgroups, but I know the pushback will be >> tremendous, and I just don't grok why. > > In general, because it adds to maintenance overhead. e.g. We've been > trying to make all cgroups follow consistent nesting rules. We're now > almost there with a couple controllers left. This one would have been > another thing to do, which is fine if it's necessary but if it isn't > we're just adding up work for no good reason. > > More importantly, because cgroup is already plagued with so many bad > design decisions - some from core design decisions - e.g. not being > able to actually identify a resource outside of a context of a task. > Others are added on by each controller going out doing whatever it > wants without thinking about how the whole thing would come together > afterwards - e.g. double accounting between cpu and cpuacct, > completely illogical and unusable hierarchy implementations in > anything other than cpu controllers (they're getting better), and so > on. Right now it's in a state where there's not many things coherent > about it. Sure, every controller and feature supports the ones their > makers intended them to but when collected together it's just a mess, > which is one of the common complaints against cgroup. > > So, no free-for-all, please. > > Thanks. > > -- > tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/10] cgroups: Task counter subsystem v8
Hello, Tim. On Mon, Apr 01, 2013 at 02:02:06PM -0700, Tim Hockin wrote: > We run dozens of jobs from dozens users on a single machine. We > regularly experience users who leak threads, running into the tens of > thousands. We are unable to raise the PID_MAX significantly due to > some bad, but really thoroughly baked-in decisions that were made a > long time ago. What we experience on a daily basis is users U so that's why you guys can't use kernel memory limit? :( > complaining about getting a "pthread_create(): resource unavailable" > error because someone on the machine has leaked. ... > What I really don't understand is why so much push back? We have this > nicely structured cgroup system. Each cgroup controller's code is > pretty well partitioned - why would we not want more complete > functionality built around it? We accept device drivers for the most > random, useless crap on the assertion that "if you don't need it, > don't compile it in". I can think of a half dozen more really useful, > cool things we can do with cgroups, but I know the pushback will be > tremendous, and I just don't grok why. In general, because it adds to maintenance overhead. e.g. We've been trying to make all cgroups follow consistent nesting rules. We're now almost there with a couple controllers left. This one would have been another thing to do, which is fine if it's necessary but if it isn't we're just adding up work for no good reason. More importantly, because cgroup is already plagued with so many bad design decisions - some from core design decisions - e.g. not being able to actually identify a resource outside of a context of a task. Others are added on by each controller going out doing whatever it wants without thinking about how the whole thing would come together afterwards - e.g. double accounting between cpu and cpuacct, completely illogical and unusable hierarchy implementations in anything other than cpu controllers (they're getting better), and so on. Right now it's in a state where there's not many things coherent about it. Sure, every controller and feature supports the ones their makers intended them to but when collected together it's just a mess, which is one of the common complaints against cgroup. So, no free-for-all, please. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/10] cgroups: Task counter subsystem v8
On Mon, Apr 1, 2013 at 1:29 PM, Tejun Heo wrote: > On Mon, Apr 01, 2013 at 01:09:09PM -0700, Tim Hockin wrote: >> Pardon my ignorance, but... what? Use kernel memory limits as a proxy >> for process/thread counts? That sounds terrible - I hope I am > > Well, the argument was that process / thread counts were a poor and > unnecessary proxy for kernel memory consumption limit. IIRC, Johannes > put it as (I'm paraphrasing) "you can't go to Fry's and buy 4k thread > worth of component". > >> misunderstanding? This task counter patch had several properties that >> mapped very well to what we want. >> >> Is it dead in the water? > > After some discussion, Frederic agreed that at least his use case can > be served well by kmemcg, maybe even better - IIRC it was container > fork bomb scenario, so you'll have to argue your way in why kmemcg > isn't a suitable solution for your use case if you wanna revive this. We run dozens of jobs from dozens users on a single machine. We regularly experience users who leak threads, running into the tens of thousands. We are unable to raise the PID_MAX significantly due to some bad, but really thoroughly baked-in decisions that were made a long time ago. What we experience on a daily basis is users complaining about getting a "pthread_create(): resource unavailable" error because someone on the machine has leaked. Today we use RLIMIT_NPROC to lock most users down to a smaller max. But this is a per-user setting, not a per-container setting, and users do not control where their jobs land. Scheduling decisions often put multiple thread-heavy but non-leaking jobs from one user onto the same machine, which again causes problems. Further, it does not help for some of our use cases where a logical job can run as multiple UIDs for different processes within. >From the end-user point of view this is an isolation leak which is totally non-deterministic for them. They can not know what to plan for. Getting cgroup-level control of this limit is important for a saner SLA for our users. In addition, the behavior around locking-out new tasks seems like a nice way to simplify and clean up end-life work for the administrative system. Admittedly, we can mostly work around this with freezer instead. What I really don't understand is why so much push back? We have this nicely structured cgroup system. Each cgroup controller's code is pretty well partitioned - why would we not want more complete functionality built around it? We accept device drivers for the most random, useless crap on the assertion that "if you don't need it, don't compile it in". I can think of a half dozen more really useful, cool things we can do with cgroups, but I know the pushback will be tremendous, and I just don't grok why. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/10] cgroups: Task counter subsystem v8
On Mon, Apr 01, 2013 at 01:09:09PM -0700, Tim Hockin wrote: > Pardon my ignorance, but... what? Use kernel memory limits as a proxy > for process/thread counts? That sounds terrible - I hope I am Well, the argument was that process / thread counts were a poor and unnecessary proxy for kernel memory consumption limit. IIRC, Johannes put it as (I'm paraphrasing) "you can't go to Fry's and buy 4k thread worth of component". > misunderstanding? This task counter patch had several properties that > mapped very well to what we want. > > Is it dead in the water? After some discussion, Frederic agreed that at least his use case can be served well by kmemcg, maybe even better - IIRC it was container fork bomb scenario, so you'll have to argue your way in why kmemcg isn't a suitable solution for your use case if you wanna revive this. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/10] cgroups: Task counter subsystem v8
On Mon, Apr 1, 2013 at 11:46 AM, Tejun Heo wrote: > On Mon, Apr 01, 2013 at 11:43:03AM -0700, Tim Hockin wrote: >> A year later - what ever happened with this? I want it more than ever >> for Google's use. > > I think the conclusion was "use kmemcg instead". Pardon my ignorance, but... what? Use kernel memory limits as a proxy for process/thread counts? That sounds terrible - I hope I am misunderstanding? This task counter patch had several properties that mapped very well to what we want. Is it dead in the water? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/10] cgroups: Task counter subsystem v8
On Mon, Apr 01, 2013 at 11:43:03AM -0700, Tim Hockin wrote: > A year later - what ever happened with this? I want it more than ever > for Google's use. I think the conclusion was "use kmemcg instead". Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/10] cgroups: Task counter subsystem v8
A year later - what ever happened with this? I want it more than ever for Google's use. On Tue, Jan 31, 2012 at 7:37 PM, Frederic Weisbecker wrote: > Hi, > > Changes In this version: > > - Split 32/64 bits version of res_counter_write_u64() [1/10] > Courtesy of Kirill A. Shutemov > > - Added Kirill's ack [8/10] > > - Added selftests [9/10], [10/10] > > Please consider for merging. At least two users want this feature: > https://lkml.org/lkml/2011/12/13/309 > https://lkml.org/lkml/2011/12/13/364 > > More general details provided in the last version posting: > https://lkml.org/lkml/2012/1/13/230 > > Thanks! > > > Frederic Weisbecker (9): > cgroups: add res_counter_write_u64() API > cgroups: new resource counter inheritance API > cgroups: ability to stop res charge propagation on bounded ancestor > res_counter: allow charge failure pointer to be null > cgroups: pull up res counter charge failure interpretation to caller > cgroups: allow subsystems to cancel a fork > cgroups: Add a task counter subsystem > selftests: Enter each directories before executing selftests > selftests: Add a new task counter selftest > > Kirill A. Shutemov (1): > cgroups: add res counter common ancestor searching > > Documentation/cgroups/resource_counter.txt | 20 ++- > Documentation/cgroups/task_counter.txt | 153 +++ > include/linux/cgroup.h | 20 +- > include/linux/cgroup_subsys.h |5 + > include/linux/res_counter.h| 27 ++- > init/Kconfig |9 + > kernel/Makefile|1 + > kernel/cgroup.c| 23 ++- > kernel/cgroup_freezer.c|6 +- > kernel/cgroup_task_counter.c | 272 > > kernel/exit.c |2 +- > kernel/fork.c |7 +- > kernel/res_counter.c | 103 +++- > tools/testing/selftests/Makefile |2 +- > tools/testing/selftests/run_tests |6 +- > tools/testing/selftests/task_counter/Makefile |8 + > tools/testing/selftests/task_counter/fork.c| 40 +++ > tools/testing/selftests/task_counter/forkbomb.c| 40 +++ > tools/testing/selftests/task_counter/multithread.c | 68 + > tools/testing/selftests/task_counter/run_test | 198 ++ > .../selftests/task_counter/spread_thread_group.c | 82 ++ > 21 files changed, 1056 insertions(+), 36 deletions(-) > create mode 100644 Documentation/cgroups/task_counter.txt > create mode 100644 kernel/cgroup_task_counter.c > create mode 100644 tools/testing/selftests/task_counter/Makefile > create mode 100644 tools/testing/selftests/task_counter/fork.c > create mode 100644 tools/testing/selftests/task_counter/forkbomb.c > create mode 100644 tools/testing/selftests/task_counter/multithread.c > create mode 100755 tools/testing/selftests/task_counter/run_test > create mode 100644 tools/testing/selftests/task_counter/spread_thread_group.c > > -- > 1.7.5.4 > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/