Re: [PATCHSET] cpuset: decouple cpuset locking from cgroup core, take#2
On Mon, Jan 7, 2013 at 5:31 PM, Li Zefan wrote: > > I don't think Paul's still maintaining cpusets. Normally it's Andrew > that picks up cpuset patches. It's fine you route it through cgroup > tree. Yes, I'm sorry - I should have handed on cpusets at the time I had to hand on cgroups. I was only really ever the maintainer for cpusets because Paul Jackson asked me to take it over when he retired, as I understood the cgroups-related parts of it. I never really had a good grasp of how the some of the lower-level parts of it interacted with the rest of the system (e.g. offlining, CPUs, scheduler domains, etc) anyway ... Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Memory Resource Controller Add Boot Option
On Mon, Feb 25, 2008 at 7:01 PM, Li Zefan <[EMAIL PROTECTED]> wrote: > > > > - foo doesn't show up in /proc/cgroups > > Or we can print out the disable flag, maybe this will be better? > Because we can distinguish from disabled and not compiled in from > > /proc/cgroups. Certainly possible, if people felt it was useful. > > > - foo isn't auto-mounted if you mount all cgroups in a single hierarchy > > - foo isn't visible as an individually mountable subsystem > > You mentioned in a previous mail if we mount a disabled subsystem we > will get an error. Here we just ignore the mount option. Which makes > more sense ? > No, we don't ignore the mount option - we give an error since it doesn't refer to a valid subsystem. (And in the first case there is no mount option). Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/10] CGroup API files: Various cleanup to CGroup control files
On Mon, Feb 25, 2008 at 7:23 PM, Li Zefan <[EMAIL PROTECTED]> wrote: > > Should those pathces be rebased againt 2.6.25-rc3 ? > No, because they're against 2.6.25-rc2-mm1, which is already has (I think) any of the new bits in 2.6.25-rc3 that would be affected by these patches. Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Memory Resource Controller Add Boot Option
I'll send out a prototype for comment. Something like the patch below. The effects of cgroup_disable=foo are: - foo doesn't show up in /proc/cgroups - foo isn't auto-mounted if you mount all cgroups in a single hierarchy - foo isn't visible as an individually mountable subsystem As a result there will only ever be one call to foo->create(), at init time; all processes will stay in this group, and the group will never be mounted on a visible hierarchy. Any additional effects (e.g. not allocating metadata) are up to the foo subsystem. This doesn't handle early_init subsystems (their "disabled" bit isn't set be, but it could easily be extended to do so if any of the early_init systems wanted it - I think it would just involve some nastier parameter processing since it would occur before the command-line argument parser had been run. include/linux/cgroup.h |1 + kernel/cgroup.c| 29 +++-- 2 files changed, 28 insertions(+), 2 deletions(-) Index: cgroup_disable-2.6.25-rc2-mm1/include/linux/cgroup.h === --- cgroup_disable-2.6.25-rc2-mm1.orig/include/linux/cgroup.h +++ cgroup_disable-2.6.25-rc2-mm1/include/linux/cgroup.h @@ -256,6 +256,7 @@ struct cgroup_subsys { void (*bind)(struct cgroup_subsys *ss, struct cgroup *root); int subsys_id; int active; + int disabled; int early_init; #define MAX_CGROUP_TYPE_NAMELEN 32 const char *name; Index: cgroup_disable-2.6.25-rc2-mm1/kernel/cgroup.c === --- cgroup_disable-2.6.25-rc2-mm1.orig/kernel/cgroup.c +++ cgroup_disable-2.6.25-rc2-mm1/kernel/cgroup.c @@ -790,7 +790,14 @@ static int parse_cgroupfs_options(char * if (!*token) return -EINVAL; if (!strcmp(token, "all")) { - opts->subsys_bits = (1 << CGROUP_SUBSYS_COUNT) - 1; + /* Add all non-disabled subsystems */ + int i; + opts->subsys_bits = 0; + for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) { + struct cgroup_subsys *ss = subsys[i]; + if (!ss->disabled) + opts->subsys_bits |= 1ul << i; + } } else if (!strcmp(token, "noprefix")) { set_bit(ROOT_NOPREFIX, &opts->flags); } else if (!strncmp(token, "release_agent=", 14)) { @@ -808,7 +815,8 @@ static int parse_cgroupfs_options(char * for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) { ss = subsys[i]; if (!strcmp(token, ss->name)) { - set_bit(i, &opts->subsys_bits); + if (!ss->disabled) + set_bit(i, &opts->subsys_bits); break; } } @@ -2596,6 +2606,8 @@ static int proc_cgroupstats_show(struct mutex_lock(&cgroup_mutex); for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) { struct cgroup_subsys *ss = subsys[i]; + if (ss->disabled) + continue; seq_printf(m, "%s\t%lu\t%d\n", ss->name, ss->root->subsys_bits, ss->root->number_of_cgroups); @@ -2991,3 +3003,16 @@ static void cgroup_release_agent(struct spin_unlock(&release_list_lock); mutex_unlock(&cgroup_mutex); } + +static int __init cgroup_disable(char *str) +{ + int i; + for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) { + struct cgroup_subsys *ss = subsys[i]; + if (!strcmp(str, ss->name)) { + ss->disabled = 1; + break; + } + } +} +__setup("cgroup_disable=", cgroup_disable); Sure thing, if css has the flag, then it would nice. Could you wrap it up to say something like css_disabled(&mem_cgroup_subsys) It's the subsys object rather than the css (cgroup_subsys_state). We could have something like: #define cgroup_subsys_disabled(_ss) ((ss_)->disabled) but I don't see that cgroup_subsys_disabled(&mem_cgroup_subsys) is better than just putting mem_cgroup_subsys.disabled Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Memory Resource Controller Add Boot Option
On Mon, Feb 25, 2008 at 9:18 AM, Balbir Singh <[EMAIL PROTECTED]> wrote: > > I thought about it, but it did not work out all that well. The reason being, > that the memory controller is called in from places besides cgroup. > mem_cgroup_charge_common() for example is called from several places in mm. > Calling into cgroups to check, enabled/disabled did not seem right. You wouldn't need to call into cgroups - if it's a flag in the subsys object (which is defined in memcontrol.c) you'd just say if (mem_cgroup_subsys.disabled) { ... } I'll send out a prototype for comment. Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Memory Resource Controller Add Boot Option
On Mon, Feb 25, 2008 at 3:55 AM, Balbir Singh <[EMAIL PROTECTED]> wrote: > > > A boot option for the memory controller was discussed on lkml. It is a good > idea to add it, since it saves memory for people who want to turn off the > memory controller. > > By default the option is on for the following two reasons > > 1. It provides compatibility with the current scheme where the memory >controller turns on if the config option is enabled > 2. It allows for wider testing of the memory controller, once the config >option is enabled > > We still allow the create, destroy callbacks to succeed, since they are > not aware of boot options. We do not populate the directory will > memory resource controller specific files. Would it make more sense to have a generic cgroups boot option for this? Something like cgroup_disable=xxx, which would be parsed by cgroups and would cause: - a "disabled" flag to be set to true in the subsys object (you could use this in place of the mem_cgroup_on flag) - prevent the disabled cgroup from being bound to any mounted hierarchy (so it would be ignored in a mount with no subsystem options, and a mount with options that specifically pick that subsystem would give an error) Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cgroup: fix default notify_on_release setting
On Sun, Feb 24, 2008 at 9:53 PM, Li Zefan <[EMAIL PROTECTED]> wrote: > The documentation says the default value of notify_on_release of > a child cgroup is inherited from its parent, which is reasonable, > but the implementation just sets the flag disabled. > > Signed-off-by: Li Zefan <[EMAIL PROTECTED]> Acked-by: Paul Menage <[EMAIL PROTECTED]> Yes, I guess it makes sense to follow the original cpusets behaviour. I think that got lost when the notify-on-release functionality was temporarily removed during cgroups development. > --- > kernel/cgroup.c |4 +++- > 1 files changed, 3 insertions(+), 1 deletions(-) > > diff --git a/kernel/cgroup.c b/kernel/cgroup.c > index d8abe99..e9c2fb0 100644 > --- a/kernel/cgroup.c > +++ b/kernel/cgroup.c > @@ -2232,7 +2232,6 @@ static long cgroup_create(struct cgroup *parent, > struct dentry *dentry, > > mutex_lock(&cgroup_mutex); > > - cgrp->flags = 0; > INIT_LIST_HEAD(&cgrp->sibling); > INIT_LIST_HEAD(&cgrp->children); > INIT_LIST_HEAD(&cgrp->css_sets); > @@ -2242,6 +2241,9 @@ static long cgroup_create(struct cgroup *parent, > struct dentry *dentry, > cgrp->root = parent->root; > cgrp->top_cgroup = parent->top_cgroup; > > + if (notify_on_release(parent)) > + set_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags); > + > for_each_subsys(root, ss) { > struct cgroup_subsys_state *css = ss->create(ss, cgrp); > if (IS_ERR(css)) { > -- > 1.5.4.rc3 > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] ResCounter: Use read_uint in memory controller
On Sat, Feb 23, 2008 at 6:47 PM, Balbir Singh <[EMAIL PROTECTED]> wrote: > >> res_counter_read_u64() I'd also want to rename all the other > >> *read_uint functions/fields to *read_u64 too. Can I do that in a > >> separate patch? > >> > > > > Sounds sensible to me. > > > > Sure, fair enough. > Actually, since multiple people were asking for this change I did the search/replace and sent it out already (as a precursor of the other patches in the series that I sent today). Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/8] sched: rt-group: interface
On Sat, Feb 23, 2008 at 12:26 PM, Peter Zijlstra <[EMAIL PROTECTED]> wrote: > > > In that case I guess I'll have to add signed versions of the > > read_uint/write_uint methods. > > Yes, I looked at that, I found the interface somewhat unfortunate, it > would mean growing the struct with two more function pointers. Is that really a big deal? We're talking about a structure that has a small number (<10 in the current tree) of instances per cgroup subsystem. > Perhaps a > read and write function with abstract data would be better suited. That > would allow for this and more. Sadly it looses type information. If the size of the struct cftype really became a problem, I think the cleanest way to fix it would be to have a union of the potential function pointers, and add a field to specify which one is in use. Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/8] sched: rt-group: interface
On Sat, Feb 23, 2008 at 11:57 AM, Peter Zijlstra <[EMAIL PROTECTED]> wrote: > > If so, could we avoid that problem by using 0 rather than -1 as the > > "unlimited" value? It looks from what I've read in the Documentation > > changes as though 0 isn't really a meaningful value. > > 0 means no time, quite useful and clearly distinct from inf. time. > So a real-time task in a cgroup with a 0 rt_runtime can be in the R state but never actually get to run? OK, if people need to be able to do that then fair enough. In that case I guess I'll have to add signed versions of the read_uint/write_uint methods. Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/8] sched: rt-group: interface
On Mon, Feb 4, 2008 at 1:03 PM, Peter Zijlstra <[EMAIL PROTECTED]> wrote: > +static int cpu_rt_runtime_write(struct cgroup *cgrp, struct cftype *cft, > + struct file *file, > + const char __user *userbuf, > + size_t nbytes, loff_t *unused_ppos) > +{ > + char buffer[64]; > + int retval = 0; > + s64 val; > + char *end; > + > + if (!nbytes) > + return -EINVAL; > + if (nbytes >= sizeof(buffer)) > + return -E2BIG; > + if (copy_from_user(buffer, userbuf, nbytes)) > + return -EFAULT; > + > + buffer[nbytes] = 0; /* nul-terminate */ > + > + /* strip newline if necessary */ > + if (nbytes && (buffer[nbytes-1] == '\n')) > + buffer[nbytes-1] = 0; > + val = simple_strtoll(buffer, &end, 0); > + if (*end) > + return -EINVAL; > + > + /* Pass to subsystem */ > + retval = sched_group_set_rt_runtime(cgroup_tg(cgrp), val); > + if (!retval) > + retval = nbytes; > + return retval; > } > > -static u64 cpu_rt_ratio_read_uint(struct cgroup *cgrp, struct cftype *cft) > -{ > - struct task_group *tg = cgroup_tg(cgrp); > +static ssize_t cpu_rt_runtime_read(struct cgroup *cgrp, struct cftype *cft, > + struct file *file, > + char __user *buf, size_t nbytes, > + loff_t *ppos) > +{ > + char tmp[64]; > + long val = sched_group_rt_runtime(cgroup_tg(cgrp)); > + int len = sprintf(tmp, "%ld\n", val); > > - return (u64) tg->rt_ratio; > + return simple_read_from_buffer(buf, nbytes, ppos, tmp, len); > } What's the reason that you can't use the cgroup read_uint/write_uint methods for this? Is it just because you have -1 as your "unlimited" value. If so, could we avoid that problem by using 0 rather than -1 as the "unlimited" value? It looks from what I've read in the Documentation changes as though 0 isn't really a meaningful value. Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] Cpusets API: Update cpusets to use cgroup structured file API
On Sat, Feb 23, 2008 at 12:06 AM, Andrew Morton <[EMAIL PROTECTED]> wrote: > > It is unclear to me what the relationship is between this and your other > cgroup pseudo-fs changes, but as this is fiddling with a userspace > interface we should get a wiggle on - we don't want to let things like this > slip out to 2.6.26. > > So.. please resend everything? > Yes, will do. But (given that no-one seems to like the proposed cgroup.api file) these patches don't actually change the userspace API at all - it's purely internal plumbing improvements. Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cgroup: fix sparse warning of shadow symbol in cgroup.c
On Sat, Feb 23, 2008 at 4:33 AM, Paul Jackson <[EMAIL PROTECTED]> wrote: > From: Paul Jackson <[EMAIL PROTECTED]> > > Fix a code warning: symbol 'p' shadows an earlier one > > This is a reincarnation of Harvey Harrison's patch: > cpuset: sparse warnings in cpuset.c > > Independently, Cliff Wickman moved the affected code, > from kernel/cpuset.c to kernel/cgroup.c, in his patch: > cpusets: update_cpumask revision > > Signed-off-by: Paul Jackson <[EMAIL PROTECTED]> Acked-by: Paul Menage <[EMAIL PROTECTED]> > Cc: Harvey Harrison <[EMAIL PROTECTED]> > Cc: Cliff Wickman <[EMAIL PROTECTED]> > > --- > kernel/cgroup.c | 10 +- > 1 file changed, 5 insertions(+), 5 deletions(-) > > --- 2.6.25-rc2-mm1.orig/kernel/cgroup.c 2008-02-16 01:04:48.0 -0800 > +++ 2.6.25-rc2-mm1/kernel/cgroup.c 2008-02-23 04:19:44.006614677 -0800 > @@ -1897,14 +1897,14 @@ int cgroup_scan_tasks(struct cgroup_scan > > if (heap->size) { > for (i = 0; i < heap->size; i++) { > - struct task_struct *p = heap->ptrs[i]; > + struct task_struct *q = heap->ptrs[i]; > if (i == 0) { > - latest_time = p->start_time; > - latest_task = p; > + latest_time = q->start_time; > + latest_task = q; > } > /* Process the task per the caller's callback */ > - scan->process_task(p, scan); > - put_task_struct(p); > + scan->process_task(q, scan); > + put_task_struct(q); > } > /* > * If we had to process any tasks at all, scan again > > -- > I won't rest till it's the best ... > Programmer, Linux Scalability > Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373 > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] cgroup map files: Add cgroup map data type
On Sat, Feb 23, 2008 at 12:04 AM, Andrew Morton <[EMAIL PROTECTED]> wrote: > > +static int cgroup_map_add(struct cgroup_map_cb *cb, const char *key, u64 > value) > > +{ > > + struct seq_file *sf = cb->state; > > + return seq_printf(sf, "%s %llu\n", key, value); > > +} > > We don't know what type the architecture uses to implement u64. This will > warn on powerpc, sparc64, maybe others. OK, I'll add an (unsigned long long) cast. > > > > +static int cgroup_seqfile_show(struct seq_file *m, void *arg) > > +{ > > + struct cgroup_seqfile_state *state = m->private; > > + struct cftype *cft = state->cft; > > + if (cft->read_map) { > > + struct cgroup_map_cb cb = { > > + .fill = cgroup_map_add, > > + .state = m, > > + }; > > + return cft->read_map(state->cgroup, cft, &cb); > > + } else { > > + BUG(); > > That's not really needed. Just call cft->read_map unconditionally. if > it's zero we'll get a null-pointer deref which will have just the same > effect as a BUG. OK. The long-term plan is to have other kinds of files also handled by this function, so eventually it would look something like: if (cft->read_map) { ... } else if (cft->read_something_else) { ... } ... } else { BUG(); } But I guess I can save that for the future. > > static int cgroup_file_open(struct inode *inode, struct file *file) > > { > > int err; > > @@ -1499,7 +1539,18 @@ static int cgroup_file_open(struct inode > > cft = __d_cft(file->f_dentry); > > if (!cft) > > return -ENODEV; > > - if (cft->open) > > + if (cft->read_map) { > > But above a NULL value is illegal. Why are we testing it here? > > The existence of cft->read_map causes us to open a seq_file. Otherwise we do nothing special and carry on down the normal open path. Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Tiny cpusets -- cpusets for small systems?
On Sat, Feb 23, 2008 at 4:09 AM, Paul Jackson <[EMAIL PROTECTED]> wrote: > A couple of proposals have been made recently by people working Linux > on smaller systems, for improving realtime isolation and memory > pressure handling: > > (1) cpu isolation for hard(er) realtime > http://lkml.org/lkml/2008/2/21/517 > Max Krasnyanskiy <[EMAIL PROTECTED]> > [PATCH sched-devel 0/7] CPU isolation extensions > > (2) notify user space of tight memory > http://lkml.org/lkml/2008/2/9/144 > KOSAKI Motohiro <[EMAIL PROTECTED]> > [PATCH 0/8][for -mm] mem_notify v6 > > In both cases, some of us have responded "why not use cpusets", and the > original submitters have replied "cpusets are too fat" (well, they > were more diplomatic than that, but I guess I can say that ;) Having read those threads, it looks to me as though: - the parts of Max's problem that would be solved by cpusets can be mostly accomplished just via sched_setaffinity() - Motohiro wants to add a new system-wide API that you would also like to have available on a per-cpuset basis. (Why not just add two access points for the same feature?) I'm don't think that either of these would be enough to justify big changes to cpusets or cgroups, although eliminating bloat is always a good thing. > The primary semantic limit I'd suggest would be supporting exactly > one layer depth of cpusets, not a full hierarchy. So one could still > successfully issue from user space 'mkdir /dev/cpuset/foo', but trying > to do 'mkdir /dev/cpuset/foo/bar' would fail. This reminds me of > very early FAT file systems, which had just a single, fixed size > root directory ;). There might even be a configurable fixed upper > limit on how many /dev/cpuset/* directories were allowed, further > simplifying the locking and dynamic memory behavior of this apparatus. I'm not sure that either of these would make much difference to the overall footprint. A single layer of cpusets would allow you to simplify validate_change() but not much else. I don't see how a fixed upper limit on the number of cpusets makes the locking sufficiently simpler to save much code. > > How this extends to cgroups I don't know; for now I suspect that most > cgroup module development is motivated by the needs of larger systems, > not smaller systems. However, cpusets is now a module client of > cgroups, and it is cgroups that now provides cpusets with its interface > to the vfs infrastructure. It would seem unfortunate if this relation > was not continued with tiny cpusets. Perhaps someone can imagine a tiny > cgroups? This might be the most difficult part of this proposal. If we wanted to go this way, I can imagine a cgroups config option that forces just a single hierarchy, which would allow a bunch of simplifications that would save plenty of text. > > Looking at some IA64 sn2 config builds I have laying about, I see the > following text sizes for a couple of versions, showing the growth of > the cpuset/cgroup apparatus over time: > > 25933 2.6.18-rc3-mm1/kernel/cpuset.o (Aug 2006) > vs. > 37823 2.6.25-rc2-mm1/kernel/cgroup.o (Feb 2008) > 19558 2.6.25-rc2-mm1/kernel/cpuset.o > > So the total has grown from 25933 to 57381 text bytes (note that > this is IA64 arch; most arch's will have proportionately smaller > text sizes.) On x86_64 they're: cgroup.o: 17348 cpuset.o: 8533 Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] ResCounter: Use read_uint in memory controller
On Thu, Feb 21, 2008 at 8:29 PM, Balbir Singh <[EMAIL PROTECTED]> wrote: > > Looks good, except for the name uint(), can we make it u64(). Integers are 32 > bit on both ILP32 and LP64, but we really read/write 64 bit values. Yes, that's true. But read_uint() is more consistent with all the other instances in cgroups and subsystems. So if we were to call it res_counter_read_u64() I'd also want to rename all the other *read_uint functions/fields to *read_u64 too. Can I do that in a separate patch? Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/7] cgroup: fix comments
On Wed, Feb 20, 2008 at 6:14 PM, Li Zefan <[EMAIL PROTECTED]> wrote: > Paul Menage wrote: > > I think that docbook-style function comments need /** at the start of > > the comment block. > > > > Yes, I didn't notice it. I revised the patch to fix it. > > > --- > > > fix: > - comments about need_forkexit_callback > - comments about release agent > - typo and comment style, etc. > > Signed-off-by: Li Zefan <[EMAIL PROTECTED]> Acked-by: Paul Menage <[EMAIL PROTECTED]> > --- > include/linux/cgroup.h |2 +- > kernel/cgroup.c| 142 > +++- > 2 files changed, 80 insertions(+), 64 deletions(-) > > > diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h > index ff9055f..2ebf7af 100644 > --- a/include/linux/cgroup.h > +++ b/include/linux/cgroup.h > @@ -175,7 +175,7 @@ struct css_set { > * > * > * When reading/writing to a file: > - * - the cgroup to use in file->f_dentry->d_parent->d_fsdata > + * - the cgroup to use is file->f_dentry->d_parent->d_fsdata > * - the 'cftype' of the file is file->f_dentry->d_fsdata > */ > > diff --git a/kernel/cgroup.c b/kernel/cgroup.c > index 4766bb6..36066d8 100644 > > --- a/kernel/cgroup.c > +++ b/kernel/cgroup.c > @@ -113,9 +113,9 @@ static int root_count; > #define dummytop (&rootnode.top_cgroup) > > /* This flag indicates whether tasks in the fork and exit paths should > - * take callback_mutex and check for fork/exit handlers to call. This > - * avoids us having to do extra work in the fork/exit path if none of the > - * subsystems need to be called. > + * check for fork/exit handlers to call. This avoids us having to do > + * extra work in the fork/exit path if none of the subsystems need to > + * be called. > */ > static int need_forkexit_callback; > > @@ -307,7 +307,6 @@ static inline void put_css_set_taskexit(struct css_set > *cg) > * template: location in which to build the desired set of subsystem > * state objects for the new cgroup group > */ > - > static struct css_set *find_existing_css_set( > struct css_set *oldcg, > struct cgroup *cgrp, > @@ -354,7 +353,6 @@ static struct css_set *find_existing_css_set( > * and chains them on tmp through their cgrp_link_list fields. Returns 0 on > * success or a negative error > */ > - > static int allocate_cg_links(int count, struct list_head *tmp) > { > struct cg_cgroup_link *link; > @@ -396,7 +394,6 @@ static void free_cg_links(struct list_head *tmp) > * substituted into the appropriate hierarchy. Must be called with > * cgroup_mutex held > */ > - > > static struct css_set *find_css_set( > struct css_set *oldcg, struct cgroup *cgrp) > { > @@ -507,8 +504,8 @@ static struct css_set *find_css_set( > > * critical pieces of code here. The exception occurs on cgroup_exit(), > * when a task in a notify_on_release cgroup exits. Then cgroup_mutex > * is taken, and if the cgroup count is zero, a usermode call made > - * to /sbin/cgroup_release_agent with the name of the cgroup (path > - * relative to the root of cgroup file system) as the argument. > + * to the release agent with the name of the cgroup (path relative to > + * the root of cgroup file system) as the argument. > * > * A cgroup can only be deleted if both its 'count' of using tasks > * is zero, and its list of 'children' cgroups is empty. Since all > @@ -521,7 +518,7 @@ static struct css_set *find_css_set( > > * > * The need for this exception arises from the action of > * cgroup_attach_task(), which overwrites one tasks cgroup pointer with > - * another. It does so using cgroup_mutexe, however there are > + * another. It does so using cgroup_mutex, however there are > * several performance critical places that need to reference > * task->cgroup without the expense of grabbing a system global > * mutex. Therefore except as noted below, when dereferencing or, as > @@ -537,7 +534,6 @@ static struct css_set *find_css_set( > * cgroup_lock - lock out any changes to cgroup structures > * > */ > - > void cgroup_lock(void) > { > mutex_lock(&cgroup_mutex); > @@ -548,7 +544,6 @@ void cgroup_lock(void) > * > * Undo the lock taken in a previous cgroup_lock() call. > */ > - > void cgroup_unlock(void) > { > mutex_unlock(&cgroup_mutex); > @@ -590,7 +585,6 @@ static struct inode *cgroup_new_inode(mode_t mode, > struct super_block *sb) > * Call subsys's pre_d
Re: [PATCH 7/7] cgroup: remove dead code in cgroup_get_rootdir()
2008/2/17 Li Zefan <[EMAIL PROTECTED]>: > Signed-off-by: Li Zefan <[EMAIL PROTECTED]> Acked-by: Paul Menage <[EMAIL PROTECTED]> > --- > kernel/cgroup.c |1 - > 1 files changed, 0 insertions(+), 1 deletions(-) > > diff --git a/kernel/cgroup.c b/kernel/cgroup.c > index 71cf961..879a056 100644 > --- a/kernel/cgroup.c > +++ b/kernel/cgroup.c > @@ -926,7 +926,6 @@ static int cgroup_get_rootdir(struct super_block *sb) > if (!inode) > return -ENOMEM; > > - inode->i_op = &simple_dir_inode_operations; > inode->i_fop = &simple_dir_operations; > inode->i_op = &cgroup_dir_inode_operations; > /* directories start off with i_nlink == 2 (for "." entry) */ > -- > 1.5.4.rc3 > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] cgroup map files: Add a key/value map file type to cgroups
On Feb 19, 2008 10:14 PM, YAMAMOTO Takashi <[EMAIL PROTECTED]> wrote: > > On Feb 19, 2008 9:48 PM, YAMAMOTO Takashi <[EMAIL PROTECTED]> wrote: > > > > > > it changes the format from "%s %lld" to "%s: %llu", right? > > > why? > > > > > > > The colon for consistency with maps in /proc. I think it also makes it > > slightly more readable. > > can you be a little more specific? > > i object against the colon because i want to use the same parser for > /proc/vmstat, which doesn't have colons. Ah. This /proc behaviour of having multiple formats for reporting the same kind of data (compare with /proc/meminfo, which does use colons) is the kind of thing that I want to avoid with cgroups. i.e. if two cgroup subsystems are both reporting the same kind of structured data, then they should both use the same output format. I guess since /proc has both styles, and memory.stat is the first file reporting key/value pairs in cgroups, you get to call the format. OK, I'll zap the colon. Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] cgroup map files: Add a key/value map file type to cgroups
On Feb 19, 2008 9:48 PM, YAMAMOTO Takashi <[EMAIL PROTECTED]> wrote: > > it changes the format from "%s %lld" to "%s: %llu", right? > why? > The colon for consistency with maps in /proc. I think it also makes it slightly more readable. For %lld versus %llu - I think that cgroup resource APIs are much more likely to need to report unsigned rather than signed values. In the case of the memory.stat file, that's certainly the case. But I guess there's an argument to be made that nothing's likely to need the final 64th bit of an unsigned value, whereas the ability to report negative numbers could potentially be useful for some cgroups. Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/7] CGroup API: Add cgroup.api control file
On Feb 19, 2008 9:17 PM, Paul Jackson <[EMAIL PROTECTED]> wrote: > > Perhaps my primary concern with these *.api files was that I did not > understand who or what the critical use or user was; who found this > essential, not just nice to have. > Right now, no-one would find it essential. If/when a binary API is added, I guess I'll ressurrect this part of the patchset. Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] Cpusets API: Update cpusets to use cgroup structured file API
Many of the cpusets control files are simple integer values, which don't require the overhead of memory allocations for reads and writes. Move the handlers for these control files into cpuset_read_uint() and cpuset_write_uint(). This also has the advantage that the control files show up as "u64" rather than "string" in the cgroup.api file. Signed-off-by: Paul Menage <[EMAIL PROTECTED]> --- kernel/cpuset.c | 156 +--- 1 file changed, 82 insertions(+), 74 deletions(-) Index: cpusets-2.6.25-rc2-mm1/kernel/cpuset.c === --- cpusets-2.6.25-rc2-mm1.orig/kernel/cpuset.c +++ cpusets-2.6.25-rc2-mm1/kernel/cpuset.c @@ -999,19 +999,6 @@ int current_cpuset_is_being_rebound(void } /* - * Call with cgroup_mutex held. - */ - -static int update_memory_pressure_enabled(struct cpuset *cs, char *buf) -{ - if (simple_strtoul(buf, NULL, 10) != 0) - cpuset_memory_pressure_enabled = 1; - else - cpuset_memory_pressure_enabled = 0; - return 0; -} - -/* * update_flag - read a 0 or a 1 in a file and update associated flag * bit:the bit to update (CS_CPU_EXCLUSIVE, CS_MEM_EXCLUSIVE, * CS_SCHED_LOAD_BALANCE, @@ -1023,15 +1010,13 @@ static int update_memory_pressure_enable * Call with cgroup_mutex held. */ -static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, char *buf) +static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, + int turning_on) { - int turning_on; struct cpuset trialcs; int err; int cpus_nonempty, balance_flag_changed; - turning_on = (simple_strtoul(buf, NULL, 10) != 0); - trialcs = *cs; if (turning_on) set_bit(bit, &trialcs.flags); @@ -1247,43 +1232,65 @@ static ssize_t cpuset_common_file_write( case FILE_MEMLIST: retval = update_nodemask(cs, buffer); break; + default: + retval = -EINVAL; + goto out2; + } + + if (retval == 0) + retval = nbytes; +out2: + cgroup_unlock(); +out1: + kfree(buffer); + return retval; +} + +static int cpuset_write_uint(struct cgroup *cgrp, struct cftype *cft, u64 val) +{ + int retval = 0; + struct cpuset *cs = cgroup_cs(cgrp); + cpuset_filetype_t type = cft->private; + + cgroup_lock(); + + if (cgroup_is_removed(cgrp)) { + cgroup_unlock(); + return -ENODEV; + } + + switch (type) { case FILE_CPU_EXCLUSIVE: - retval = update_flag(CS_CPU_EXCLUSIVE, cs, buffer); + retval = update_flag(CS_CPU_EXCLUSIVE, cs, val); break; case FILE_MEM_EXCLUSIVE: - retval = update_flag(CS_MEM_EXCLUSIVE, cs, buffer); + retval = update_flag(CS_MEM_EXCLUSIVE, cs, val); break; case FILE_SCHED_LOAD_BALANCE: - retval = update_flag(CS_SCHED_LOAD_BALANCE, cs, buffer); + retval = update_flag(CS_SCHED_LOAD_BALANCE, cs, val); break; case FILE_MEMORY_MIGRATE: - retval = update_flag(CS_MEMORY_MIGRATE, cs, buffer); + retval = update_flag(CS_MEMORY_MIGRATE, cs, val); break; case FILE_MEMORY_PRESSURE_ENABLED: - retval = update_memory_pressure_enabled(cs, buffer); + cpuset_memory_pressure_enabled = !!val; break; case FILE_MEMORY_PRESSURE: retval = -EACCES; break; case FILE_SPREAD_PAGE: - retval = update_flag(CS_SPREAD_PAGE, cs, buffer); + retval = update_flag(CS_SPREAD_PAGE, cs, val); cs->mems_generation = cpuset_mems_generation++; break; case FILE_SPREAD_SLAB: - retval = update_flag(CS_SPREAD_SLAB, cs, buffer); + retval = update_flag(CS_SPREAD_SLAB, cs, val); cs->mems_generation = cpuset_mems_generation++; break; default: retval = -EINVAL; - goto out2; + break; } - - if (retval == 0) - retval = nbytes; -out2: cgroup_unlock(); -out1: - kfree(buffer); return retval; } @@ -1345,30 +1352,6 @@ static ssize_t cpuset_common_file_read(s case FILE_MEMLIST: s += cpuset_sprintf_memlist(s, cs); break; - case FILE_CPU_EXCLUSIVE: - *s++ = is_cpu_exclusive(cs) ? '1' : '0'; - break; - case FILE_MEM_EXCLUSIVE: - *s++ = is_mem_exclusive(cs) ? '1' : '0'; - break; - case FILE_SCHED_LOAD_BALANCE: -
[PATCH 1/2] Cpusets API: From: Paul Jackson <[EMAIL PROTECTED]>
Strip all trailing whitespace in cgroup_write_uint This removes the need for people to remember to pass the -n flag to echo when writing values to cgroup control files. Signed-off-by: Paul Menage <[EMAIL PROTECTED]> --- kernel/cgroup.c |5 + 1 file changed, 1 insertion(+), 4 deletions(-) Index: cpusets-2.6.25-rc2-mm1/kernel/cgroup.c === --- cpusets-2.6.25-rc2-mm1.orig/kernel/cgroup.c +++ cpusets-2.6.25-rc2-mm1/kernel/cgroup.c @@ -1321,10 +1321,7 @@ static ssize_t cgroup_write_uint(struct return -EFAULT; buffer[nbytes] = 0; /* nul-terminate */ - - /* strip newline if necessary */ - if (nbytes && (buffer[nbytes-1] == '\n')) - buffer[nbytes-1] = 0; + strstrip(buffer); val = simple_strtoull(buffer, &end, 0); if (*end) return -EINVAL; -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/2] Cpusets API: Update Cpusets control files
This pair of patches simplifies the cpusets read/write path for the control files that consist of simple integers. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Improve init/Kconfig help descriptions [PATCH 6/9]
On Feb 19, 2008 6:54 PM, Nick Andrew <[EMAIL PROTECTED]> wrote: > > config CGROUPS > bool "Control Group support" > help > Control Groups enables processes to be tracked and grouped > into "cgroups". This enables you, for example, to associate > cgroups with certain CPU sets using "cpusets". > > When enabled, a new filesystem type "cgroup" is available > and can be mounted to control cpusets and other > resource/behaviour controllers. > > See for more information. > > If unsure, say N. > > > I don't think that description is as clear as it could be. From > the non-kernel-developer point of view, that is. Originally this wasn't a user-selectable config value, it was auto-selected by any subsystem that needed it. I think that was nicer from the user-experience, and it would eliminate the need for this documentation but there were concerns that this triggered unspecified brokenness in the Kbuild system. > > Re "other resource/behaviour controllers", what in particular? > I take it that our current controllers are cpusets, scheduler, > CPU accounting and Resource counters? Resource counters aren't a resource controller, they're a helper library. The others are good examples, as is the memory controller that's just been added to 2.6.25. Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/7] cgroup: fix and update documentation
On Feb 18, 2008 12:39 AM, Li Zefan <[EMAIL PROTECTED]> wrote: > Misc fixes and updates, make the doc consistent with current > cgroup implementation. > > Signed-off-by: Li Zefan <[EMAIL PROTECTED]> Acked-by: Paul Menage <[EMAIL PROTECTED]> Thanks for these cleanups. Paul > --- > Documentation/cgroups.txt | 66 ++-- > 1 files changed, 33 insertions(+), 33 deletions(-) > > diff --git a/Documentation/cgroups.txt b/Documentation/cgroups.txt > index 42d7c4c..31d12e2 100644 > > --- a/Documentation/cgroups.txt > +++ b/Documentation/cgroups.txt > @@ -28,7 +28,7 @@ CONTENTS: > 4. Questions > > 1. Control Groups > -== > += > > 1.1 What are cgroups ? > -- > @@ -143,10 +143,10 @@ proliferation of such cgroups. > > Also lets say that the administrator would like to give enhanced network > access temporarily to a student's browser (since it is night and the user > -wants to do online gaming :) OR give one of the students simulation > +wants to do online gaming :)) OR give one of the students simulation > apps enhanced CPU power, > > -With ability to write pids directly to resource classes, its just a > +With ability to write pids directly to resource classes, it's just a > matter of : > > # echo pid > /mnt/network//tasks > @@ -227,10 +227,13 @@ Each cgroup is represented by a directory in the cgroup > file system > containing the following files describing that cgroup: > > - tasks: list of tasks (by pid) attached to that cgroup > - - notify_on_release flag: run /sbin/cgroup_release_agent on exit? > + - releasable flag: cgroup currently removeable? > + - notify_on_release flag: run the release agent on exit? > + - release_agent: the path to use for release notifications (this file > + exists in the top cgroup only) > > Other subsystems such as cpusets may add additional files in each > -cgroup dir > +cgroup dir. > > New cgroups are created using the mkdir system call or shell > command. The properties of a cgroup, such as its flags, are > @@ -257,7 +260,7 @@ performance. > To allow access from a cgroup to the css_sets (and hence tasks) > that comprise it, a set of cg_cgroup_link objects form a lattice; > each cg_cgroup_link is linked into a list of cg_cgroup_links for > -a single cgroup on its cont_link_list field, and a list of > +a single cgroup on its cgrp_link_list field, and a list of > cg_cgroup_links for a single css_set on its cg_link_list. > > Thus the set of tasks in a cgroup can be listed by iterating over > @@ -271,9 +274,6 @@ for cgroups, with a minimum of additional kernel code. > 1.4 What does notify_on_release do ? > > > -*** notify_on_release is disabled in the current patch set. It will be > -*** reactivated in a future patch in a less-intrusive manner > - > If the notify_on_release flag is enabled (1) in a cgroup, then > whenever the last task in the cgroup leaves (exits or attaches to > some other cgroup) and the last child cgroup of that cgroup > @@ -360,8 +360,8 @@ Now you want to do something with this cgroup. > > In this directory you can find several files: > # ls > -notify_on_release release_agent tasks > -(plus whatever files are added by the attached subsystems) > +notify_on_release releasable tasks > +(plus whatever files added by the attached subsystems) > > Now attach your shell to this cgroup: > # /bin/echo $$ > tasks > @@ -404,19 +404,13 @@ with a subsystem id which will be assigned by the > cgroup system. > Other fields in the cgroup_subsys object include: > > - subsys_id: a unique array index for the subsystem, indicating which > - entry in cgroup->subsys[] this subsystem should be > - managing. Initialized by cgroup_register_subsys(); prior to this > - it should be initialized to -1 > + entry in cgroup->subsys[] this subsystem should be managing. > > -- hierarchy: an index indicating which hierarchy, if any, this > - subsystem is currently attached to. If this is -1, then the > - subsystem is not attached to any hierarchy, and all tasks should be > - considered to be members of the subsystem's top_cgroup. It should > - be initialized to -1. > +- name: should be initialized to a unique subsystem name. Should be > + no longer than MAX_CGROUP_TYPE_NAMELEN. > > -- name: should be initialized to a unique subsystem name prior to > - calling cgroup_register_subsystem. Should be no longer than > - MAX_CGROUP_TYPE_NAMELEN > +- early_init: indicate if the subsystem needs early initialization > + at system boot. > > Each cgroup object created by the sy
Re: [PATCH 4/7] cgroup: fix memory leak in cgroup_get_sb()
On Feb 17, 2008 9:49 PM, Li Zefan <[EMAIL PROTECTED]> wrote: > opts.release_agent is not kfree()ed in all necessary places. > > Signed-off-by: Li Zefan <[EMAIL PROTECTED]> Acked-by: Paul Menage <[EMAIL PROTECTED]> Good catch, although hopefully something that would be extremely rare in practice. Thanks, Paul > --- > kernel/cgroup.c |5 - > 1 files changed, 4 insertions(+), 1 deletions(-) > > diff --git a/kernel/cgroup.c b/kernel/cgroup.c > index 0c35022..aa76bbd 100644 > --- a/kernel/cgroup.c > +++ b/kernel/cgroup.c > @@ -961,8 +961,11 @@ static int cgroup_get_sb(struct file_system_type > *fs_type, > } > > root = kzalloc(sizeof(*root), GFP_KERNEL); > - if (!root) > + if (!root) { > + if (opts.release_agent) > + kfree(opts.release_agent); > return -ENOMEM; > + } > > init_cgroup_root(root); > root->subsys_bits = opts.subsys_bits; > -- > 1.5.4.rc3 > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/7] cgroup: remove duplicate code in find_css_set()
On Feb 17, 2008 9:49 PM, Li Zefan <[EMAIL PROTECTED]> wrote: > The list head res->tasks gets initialized twice in find_css_set(). > > Signed-off-by: Li Zefan <[EMAIL PROTECTED]> Acked-by: Paul Menage <[EMAIL PROTECTED]> > --- > kernel/cgroup.c |1 - > 1 files changed, 0 insertions(+), 1 deletions(-) > > diff --git a/kernel/cgroup.c b/kernel/cgroup.c > index e8c8e58..71cf961 100644 > --- a/kernel/cgroup.c > +++ b/kernel/cgroup.c > @@ -473,7 +473,6 @@ static struct css_set *find_css_set( > /* Link this cgroup group into the list */ > list_add(&res->list, &init_css_set.list); > css_set_count++; > - INIT_LIST_HEAD(&res->tasks); > write_unlock(&css_set_lock); > > return res; > -- > 1.5.4.rc3 > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/7] cgroup: fix comments
On Feb 17, 2008 9:49 PM, Li Zefan <[EMAIL PROTECTED]> wrote: > fix: > - comments about need_forkexit_callback > - comments about release agent > - typo and comment style, etc. > > Signed-off-by: Li Zefan <[EMAIL PROTECTED]> > --- > include/linux/cgroup.h |2 +- > kernel/cgroup.c| 44 +--- > 2 files changed, 22 insertions(+), 24 deletions(-) > > diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h > index ff9055f..2ebf7af 100644 > --- a/include/linux/cgroup.h > +++ b/include/linux/cgroup.h > @@ -175,7 +175,7 @@ struct css_set { > * > * > * When reading/writing to a file: > - * - the cgroup to use in file->f_dentry->d_parent->d_fsdata > + * - the cgroup to use is file->f_dentry->d_parent->d_fsdata > * - the 'cftype' of the file is file->f_dentry->d_fsdata > */ > > diff --git a/kernel/cgroup.c b/kernel/cgroup.c > index 4766bb6..0c35022 100644 > --- a/kernel/cgroup.c > +++ b/kernel/cgroup.c > @@ -113,9 +113,9 @@ static int root_count; > #define dummytop (&rootnode.top_cgroup) > > /* This flag indicates whether tasks in the fork and exit paths should > - * take callback_mutex and check for fork/exit handlers to call. This > - * avoids us having to do extra work in the fork/exit path if none of the > - * subsystems need to be called. > + * check for fork/exit handlers to call. This avoids us having to do > + * extra work in the fork/exit path if none of the subsystems need to > + * be called. > */ > static int need_forkexit_callback; > > @@ -507,8 +507,8 @@ static struct css_set *find_css_set( > * critical pieces of code here. The exception occurs on cgroup_exit(), > * when a task in a notify_on_release cgroup exits. Then cgroup_mutex > * is taken, and if the cgroup count is zero, a usermode call made > - * to /sbin/cgroup_release_agent with the name of the cgroup (path > - * relative to the root of cgroup file system) as the argument. > + * to the release agent with the name of the cgroup (path relative to > + * the root of cgroup file system) as the argument. > * > * A cgroup can only be deleted if both its 'count' of using tasks > * is zero, and its list of 'children' cgroups is empty. Since all > @@ -521,7 +521,7 @@ static struct css_set *find_css_set( > * > * The need for this exception arises from the action of > * cgroup_attach_task(), which overwrites one tasks cgroup pointer with > - * another. It does so using cgroup_mutexe, however there are > + * another. It does so using cgroup_mutex, however there are > * several performance critical places that need to reference > * task->cgroup without the expense of grabbing a system global > * mutex. Therefore except as noted below, when dereferencing or, as > @@ -1192,7 +1192,7 @@ static void get_first_subsys(const struct cgroup *cgrp, > * Attach task 'tsk' to cgroup 'cgrp' > * > * Call holding cgroup_mutex. May take task_lock of > - * the task 'pid' during call. > + * the task 'tsk' during call. > */ > int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk) > { > @@ -1584,12 +1584,11 @@ static int cgroup_create_file(struct dentry *dentry, > int mode, > } > > /* I think that docbook-style function comments need /** at the start of the comment block. Thanks, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/7] cgroup: fix subsys bitops
On Feb 17, 2008 9:49 PM, Li Zefan <[EMAIL PROTECTED]> wrote: > Cgroup uses unsigned long for subsys bitops, not unsigned long long. > > Signed-off-by: Li Zefan <[EMAIL PROTECTED]> Acked-by: Paul Menage <[EMAIL PROTECTED]> > --- > kernel/cgroup.c |4 ++-- > 1 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/kernel/cgroup.c b/kernel/cgroup.c > index aa76bbd..e8c8e58 100644 > --- a/kernel/cgroup.c > +++ b/kernel/cgroup.c > @@ -320,7 +320,7 @@ static struct css_set *find_existing_css_set( > /* Built the set of subsystem state objects that we want to > * see in the new css_set */ > for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) { > - if (root->subsys_bits & (1ull << i)) { > + if (root->subsys_bits & (1UL << i)) { > /* Subsystem is in this hierarchy. So we want > * the subsystem state from the new > * cgroup */ > @@ -696,7 +696,7 @@ static int rebind_subsystems(struct cgroupfs_root *root, > added_bits = final_bits & ~root->actual_subsys_bits; > /* Check that any added subsystems are currently free */ > for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) { > - unsigned long long bit = 1ull << i; > + unsigned long bit = 1UL << i; > struct cgroup_subsys *ss = subsys[i]; > if (!(bit & added_bits)) > continue; > -- > 1.5.4.rc3 > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] cgroup: clean up cgroup.h
On Feb 17, 2008 9:49 PM, Li Zefan <[EMAIL PROTECTED]> wrote: > - replace old name 'cont' with 'cgrp' (Paul Menage did this cleanup for > cgroup.c in commit bd89aabc6761de1c35b154fe6f914a445d301510) > - remove a duplicate declaration of cgroup_path() > > Signed-off-by: Li Zefan <[EMAIL PROTECTED]> Acked-by: Paul Menage <[EMAIL PROTECTED]> > --- > include/linux/cgroup.h | 48 > +++- > 1 files changed, 23 insertions(+), 25 deletions(-) > > diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h > index 2ebf7af..028ba3b 100644 > --- a/include/linux/cgroup.h > +++ b/include/linux/cgroup.h > @@ -186,15 +186,15 @@ struct cftype { > char name[MAX_CFTYPE_NAME]; > int private; > int (*open) (struct inode *inode, struct file *file); > - ssize_t (*read) (struct cgroup *cont, struct cftype *cft, > + ssize_t (*read) (struct cgroup *cgrp, struct cftype *cft, > struct file *file, > char __user *buf, size_t nbytes, loff_t *ppos); > /* > * read_uint() is a shortcut for the common case of returning a > * single integer. Use it in place of read() > */ > - u64 (*read_uint) (struct cgroup *cont, struct cftype *cft); > - ssize_t (*write) (struct cgroup *cont, struct cftype *cft, > + u64 (*read_uint) (struct cgroup *cgrp, struct cftype *cft); > + ssize_t (*write) (struct cgroup *cgrp, struct cftype *cft, > struct file *file, > const char __user *buf, size_t nbytes, loff_t > *ppos); > > @@ -203,7 +203,7 @@ struct cftype { > * a single integer (as parsed by simple_strtoull) from > * userspace. Use in place of write(); return 0 or error. > */ > - int (*write_uint) (struct cgroup *cont, struct cftype *cft, u64 val); > + int (*write_uint) (struct cgroup *cgrp, struct cftype *cft, u64 val); > > int (*release) (struct inode *inode, struct file *file); > }; > @@ -218,41 +218,41 @@ struct cgroup_scanner { > > /* Add a new file to the given cgroup directory. Should only be > * called by subsystems from within a populate() method */ > -int cgroup_add_file(struct cgroup *cont, struct cgroup_subsys *subsys, > +int cgroup_add_file(struct cgroup *cgrp, struct cgroup_subsys *subsys, >const struct cftype *cft); > > /* Add a set of new files to the given cgroup directory. Should > * only be called by subsystems from within a populate() method */ > -int cgroup_add_files(struct cgroup *cont, > +int cgroup_add_files(struct cgroup *cgrp, > struct cgroup_subsys *subsys, > const struct cftype cft[], > int count); > > -int cgroup_is_removed(const struct cgroup *cont); > +int cgroup_is_removed(const struct cgroup *cgrp); > > -int cgroup_path(const struct cgroup *cont, char *buf, int buflen); > +int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen); > > -int cgroup_task_count(const struct cgroup *cont); > +int cgroup_task_count(const struct cgroup *cgrp); > > /* Return true if the cgroup is a descendant of the current cgroup */ > -int cgroup_is_descendant(const struct cgroup *cont); > +int cgroup_is_descendant(const struct cgroup *cgrp); > > /* Control Group subsystem type. See Documentation/cgroups.txt for details */ > > struct cgroup_subsys { > struct cgroup_subsys_state *(*create)(struct cgroup_subsys *ss, > - struct cgroup *cont); > - void (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cont); > - void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cont); > + struct cgroup *cgrp); > + void (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp); > + void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp); > int (*can_attach)(struct cgroup_subsys *ss, > - struct cgroup *cont, struct task_struct *tsk); > - void (*attach)(struct cgroup_subsys *ss, struct cgroup *cont, > - struct cgroup *old_cont, struct task_struct *tsk); > + struct cgroup *cgrp, struct task_struct *tsk); > + void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp, > + struct cgroup *old_cgrp, struct task_struct *tsk); > void (*fork)(struct cgroup_subsys *ss, struct task_struct *task); > void (*exit)(struct cgroup_subsys *ss, struct task_struct *task); > int (*populate)(struc
Re: [RFC][PATCH 1/7] CGroup API: Add cgroup.api control file
On Feb 18, 2008 1:45 AM, Li Zefan <[EMAIL PROTECTED]> wrote: > > > > But we don't have /proc/proc.api or /sys/sysfs.api ... True. And /proc is a bit of a mess. Having a similar API file for sysfs sounds like a good idea to me. > > And is it better to describe the debug subsystem too? > Yes, probably, but that would be a separate patch to the debug subsystem itself, not the main cgroups code. Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/7] CGroup API: Add cgroup.api control file
On Feb 19, 2008 1:57 PM, Paul Jackson <[EMAIL PROTECTED]> wrote: > > Finally, it goes against the one thingie per file (at most, one scalar > vector) that has worked well for us when tried. Right, I like the idea of keeping things simple. But if you're going to accept that a vector is useful, then it seems reasonable that some other *simple* structured datatypes can be useful. An N-element key/value map (a la /proc/meminfo) is, I think, nicer than having to read values from N separate files. > > As to the motivations Paul M gives: > 1) Avoid "an arbitrary mess of ad-hoc APIs": > We can still do that, whether or not we "self-document" these > API's in this manner. We can, but this file makes it more clear what control files have a well-defined API and which are just returning some ad-hoc string. I guess it's not essential, I just figured that if we had that information, it made sense to make it available to userspace. I guess I'm happy with dropping the actual exposed cgroup.api file for now as long as we can work towards reducing the number of control files that just return strings, and make use of the structured output such as read_uint() miore. > 2) binary APIs versus ASCII APIs: > Well, I have an ASCII API bias, not surprising. But I'd > suggest not doing things "in anticipation" of some future > fuzzy binary API support. Wait until that day actually arrives. I have a reasonably clear idea of how we can do the binary API. That's mostly for a separate RFC. But for example, reading a map via the binary API would be able to just return a list values since the keys could be parsed once from the ascii map (provided that the subsystem guaranteed that the map keys and their order wouldn't change between reboots). > 3) The memory controller currently has files with the "_in_bytes": > The traditional way to handle this is Documentation and man > pages; good enough for my granddad, good enough for me ;). I've tried submitting patches to remove the in_bytes suffix and just rely on the documentation, and people didn't seem to like it ... Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Improve init/Kconfig help descriptions [PATCH 6/9]
On Feb 19, 2008 7:12 AM, Nick Andrew <[EMAIL PROTECTED]> wrote: > config CGROUPS > bool "Control Group support" > help > - This option will let you use process cgroup subsystems > - such as Cpusets > + Control Groups enables processes to be tracked and grouped > + into "cgroups". This enables you, for example, to associate > + cgroups with certain CPU sets using "cpusets". > > - Say N if unsure. > + When enabled, a new filesystem type "cgroup" is available > + and can be mounted to control cpusets. How about: ... cpusets and other resource/behaviour controllers. Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 7/7] CGroup API: Update cpusets to use cgroup structured file API
On Feb 17, 2008 9:28 AM, Paul Jackson <[EMAIL PROTECTED]> wrote: > > I'm figuring it would be easiest if you just threw this > little change into your hopper for the bigger changes > you're making OK, will do. Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 7/7] CGroup API: Update cpusets to use cgroup structured file API
On Feb 16, 2008 7:29 PM, Paul Jackson <[EMAIL PROTECTED]> wrote: > > From: Paul Jackson <[EMAIL PROTECTED]> > > Strip all trailing whitespace (such as carriage returns) > when parsing integer writes to cgroup files, not just > one trailing newline if present. Sounds like a good idea to me. Thanks for this. > > Signed-off-by: Paul Jackson <[EMAIL PROTECTED]> > Cc: Paul Menage <[EMAIL PROTECTED]> Acked-by: Paul Menage <[EMAIL PROTECTED]> > > --- > kernel/cgroup.c |5 + > 1 file changed, 1 insertion(+), 4 deletions(-) > > --- 2.6.24-mm1.orig/kernel/cgroup.c 2008-02-16 04:20:33.0 -0800 > +++ 2.6.24-mm1/kernel/cgroup.c 2008-02-16 19:00:41.207478218 -0800 > @@ -1321,10 +1321,7 @@ static ssize_t cgroup_write_uint(struct > return -EFAULT; > > buffer[nbytes] = 0; /* nul-terminate */ > - > - /* strip newline if necessary */ > - if (nbytes && (buffer[nbytes-1] == '\n')) > - buffer[nbytes-1] = 0; > + strstrip(buffer); /* strip -just- trailing whitespace */ > val = simple_strtoull(buffer, &end, 0); > if (*end) > return -EINVAL; > > > -- > I won't rest till it's the best ... > Programmer, Linux Scalability > Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214 > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/7] CGroup API: Add cgroup.api control file
On Feb 16, 2008 2:07 AM, Balbir Singh <[EMAIL PROTECTED]> wrote: > Paul Menage wrote: > > Hi, Paul, > > Do we need to use a cgroup.api file? Why not keep up to date documentation and > get users to use that. I fear that, cgroup.api will not be kept up-to-date, > leading to confusion. The cgroup.api file isn't meant to give complete documentation for a control file, simply a brief indication of its usage. The aim is that most bits of the information reported in cgroup.api are auto-generated, so there shouldn't be problems with it getting out-of-date. Is it just the space used by the documentation string that you're objecting to? The other function of the file is to declare a type for each variable. Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/7] CGroup API: More structured API for CGroups control files
On Feb 16, 2008 1:31 AM, Li Zefan <[EMAIL PROTECTED]> wrote: > > I don't quite catch what you mean. Cgoup does support write-only/read-only > files. For a write-only file, just set .write and .write_uint to be NULL, > similar for a read-only file. > > Do I miss something? > I suppose we could infer from the lack of any write handlers that we should give the file in the filesystem a mode of 444 rather 644. Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH 6/7] CGroup API: Use descriptions for memory controller API files
This patch adds descriptions to the memory controller API files to indicate that the usage/limit are in bytes; the names of the control files can then be simplified to usage/limit. Also removes the unnecessary mem_force_empty_read() function Signed-off-by: Paul Menage <[EMAIL PROTECTED]> --- mm/memcontrol.c | 21 + 1 file changed, 5 insertions(+), 16 deletions(-) Index: cgroupmap-2.6.24-mm1/mm/memcontrol.c === --- cgroupmap-2.6.24-mm1.orig/mm/memcontrol.c +++ cgroupmap-2.6.24-mm1/mm/memcontrol.c @@ -950,19 +950,6 @@ static ssize_t mem_force_empty_write(str return ret; } -/* - * Note: This should be removed if cgroup supports write-only file. - */ - -static ssize_t mem_force_empty_read(struct cgroup *cont, - struct cftype *cft, - struct file *file, char __user *userbuf, - size_t nbytes, loff_t *ppos) -{ - return -EINVAL; -} - - static const struct mem_cgroup_stat_desc { const char *msg; u64 unit; @@ -1001,15 +988,17 @@ static int mem_control_stat_show(struct static struct cftype mem_cgroup_files[] = { { - .name = "usage_in_bytes", + .name = "usage", .private = RES_USAGE, .read_uint = mem_cgroup_read, + .desc = "Memory usage in bytes", }, { - .name = "limit_in_bytes", + .name = "limit", .private = RES_LIMIT, .write = mem_cgroup_write, .read_uint = mem_cgroup_read, + .desc = "Memory limit in bytes", }, { .name = "failcnt", @@ -1019,7 +1008,7 @@ static struct cftype mem_cgroup_files[] { .name = "force_empty", .write = mem_force_empty_write, - .read = mem_force_empty_read, + .desc = "Write to this file to forget all memory charges" }, { .name = "stat", -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH 5/7] CGroup API: Use read_uint in memory controller
Update the memory controller to use read_uint for its limit/usage/failcnt control files, calling the new res_counter_read_uint() function. This allows the files to show up as u64 rather than string in the cgroup.api file. Signed-off-by: Paul Menage <[EMAIL PROTECTED]> --- mm/memcontrol.c | 15 ++- 1 file changed, 6 insertions(+), 9 deletions(-) Index: cgroupmap-2.6.24-mm1/mm/memcontrol.c === --- cgroupmap-2.6.24-mm1.orig/mm/memcontrol.c +++ cgroupmap-2.6.24-mm1/mm/memcontrol.c @@ -922,13 +922,10 @@ int mem_cgroup_write_strategy(char *buf, return 0; } -static ssize_t mem_cgroup_read(struct cgroup *cont, - struct cftype *cft, struct file *file, - char __user *userbuf, size_t nbytes, loff_t *ppos) +static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft) { - return res_counter_read(&mem_cgroup_from_cont(cont)->res, - cft->private, userbuf, nbytes, ppos, - NULL); + return res_counter_read_uint(&mem_cgroup_from_cont(cont)->res, +cft->private); } static ssize_t mem_cgroup_write(struct cgroup *cont, struct cftype *cft, @@ -1006,18 +1003,18 @@ static struct cftype mem_cgroup_files[] { .name = "usage_in_bytes", .private = RES_USAGE, - .read = mem_cgroup_read, + .read_uint = mem_cgroup_read, }, { .name = "limit_in_bytes", .private = RES_LIMIT, .write = mem_cgroup_write, - .read = mem_cgroup_read, + .read_uint = mem_cgroup_read, }, { .name = "failcnt", .private = RES_FAILCNT, - .read = mem_cgroup_read, + .read_uint = mem_cgroup_read, }, { .name = "force_empty", -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH 1/7] CGroup API: Add cgroup.api control file
Add a cgroup.api control file in every cgroup directory. This reports for each control file the type of data represented by that control file, and a user-friendly description of the contents. A secondary effect of this patch is to add the "cgroup." prefix in front of all cgroup-provided control files. This will reduce the chance of future control files clashing with user-provided names. Signed-off-by: Paul Menage <[EMAIL PROTECTED]> --- include/linux/cgroup.h | 21 +++ kernel/cgroup.c| 133 ++--- 2 files changed, 148 insertions(+), 6 deletions(-) Index: cgroupmap-2.6.24-mm1/include/linux/cgroup.h === --- cgroupmap-2.6.24-mm1.orig/include/linux/cgroup.h +++ cgroupmap-2.6.24-mm1/include/linux/cgroup.h @@ -179,12 +179,33 @@ struct css_set { * - the 'cftype' of the file is file->f_dentry->d_fsdata */ +/* + * The various types of control file that are reported in the + * cgroup.api file. "String" is a catch-all default, but should only + * be used for special cases. If you use the appropriate accessors + * (such as "read_uint") in your control file, then you can leave this + * as 0 (CGROUP_FILE_UNKNOWN) and let cgroup figure out the right type. + */ +enum cgroup_file_type { + CGROUP_FILE_UNKNOWN = 0, + CGROUP_FILE_VOID, + CGROUP_FILE_U64, + CGROUP_FILE_STRING, +}; + #define MAX_CFTYPE_NAME 64 struct cftype { /* By convention, the name should begin with the name of the * subsystem, followed by a period */ char name[MAX_CFTYPE_NAME]; int private; + + /* The type of a file - reported in the cgroup.api file */ + enum cgroup_file_type type; + + /* Human-readable description of the file */ + const char *desc; + int (*open) (struct inode *inode, struct file *file); ssize_t (*read) (struct cgroup *cont, struct cftype *cft, struct file *file, Index: cgroupmap-2.6.24-mm1/kernel/cgroup.c === --- cgroupmap-2.6.24-mm1.orig/kernel/cgroup.c +++ cgroupmap-2.6.24-mm1/kernel/cgroup.c @@ -1301,6 +1301,7 @@ enum cgroup_filetype { FILE_NOTIFY_ON_RELEASE, FILE_RELEASABLE, FILE_RELEASE_AGENT, + FILE_API, }; static ssize_t cgroup_write_uint(struct cgroup *cgrp, struct cftype *cft, @@ -1611,17 +1612,21 @@ static int cgroup_create_dir(struct cgro } int cgroup_add_file(struct cgroup *cgrp, - struct cgroup_subsys *subsys, - const struct cftype *cft) + struct cgroup_subsys *subsys, + const struct cftype *cft) { struct dentry *dir = cgrp->dentry; struct dentry *dentry; int error; char name[MAX_CGROUP_TYPE_NAMELEN + MAX_CFTYPE_NAME + 2] = { 0 }; - if (subsys && !test_bit(ROOT_NOPREFIX, &cgrp->root->flags)) { - strcpy(name, subsys->name); - strcat(name, "."); + if (!test_bit(ROOT_NOPREFIX, &cgrp->root->flags)) { + if (subsys) { + strcpy(name, subsys->name); + strcat(name, "."); + } else { + strcpy(name, "cgroup."); + } } strcat(name, cft->name); BUG_ON(!mutex_is_locked(&dir->d_inode->i_mutex)); @@ -2126,6 +2131,110 @@ static u64 cgroup_read_releasable(struct return test_bit(CGRP_RELEASABLE, &cgrp->flags); } +static const struct file_operations cgroup_api_file_operations = { + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, +}; + +/* + * cgroup.api is a file in each cgroup directory that gives the types + * and descriptions of the various control files in that directory. + */ + +static struct dentry *cgroup_api_advance(struct dentry *d, int advance) +{ + struct dentry *parent = d->d_parent; + struct list_head *l = &d->d_u.d_child; + while (true) { + if (advance) + l = l->next; + advance = true; + /* Did we reach the end of the directory? */ + if (l == &parent->d_subdirs) + return NULL; + d = container_of(l, struct dentry, d_u.d_child); + /* Skip cgroup subdirectories */ + if (d->d_inode && S_ISREG(d->d_inode->i_mode)) + return d; + } +} + +static void *cgroup_api_start(struct seq_file *sf, loff_t *pos) +{ + struct dentry *parent = sf->private; + struct dentry *d; + loff_t l = 0; + spin_lock(&dcache_lock); + if (list_empty(&parent->d_subdirs)) +
[RFC][PATCH 0/7] CGroup API: More structured API for CGroups control files
This set of patches makes the Control Groups API more structured and self-describing. 1) Allows control files to be associated with data types such as "u64", "string", "map", etc. These types show up in a new cgroup.api file in each cgroup directory, along with a user-readable string. Files that use cgroup-provided data accessors have these file types inferred automatically. 2) Moves various files in cpusets and the memory controller from using custom-written file handlers to cgroup-defined handlers 3) Adds the "cgroup." prefix for existing cgroup-provided control files (tasks, release_agent, releasable, notify_on_release). Given than we've already had 2.6.24 go out without this prefix, I guess this could be a little contentious - but it seems like a good move to prevent name clashes in the future. (Note that this doesn't affect mounting the legacy cpuset filesystem, since the compatibility layer disables all prefixes when mounted with filesystem type "cpuset"). If people object too strongly, we could just make this the case for *new* cgroup API files, but I think this is a case where consistency would be better than compatibility - I'd be surprised if anyone has written major legacy apps yet that rely on 2.6.24 cgroup control file names. There are various motivations for this: 1) We said at Kernel Summit '07 that the cgroup API wouldn't be allowed to spiral into an arbitrary mess of ad-hoc APIs. Having simple ways to represent common data types makes this easier. (E.g. one standard way to report a map of string,u64 pairs to userspace.) 2) People were divided on the issue of binary APIs versus ASCII APIs for control groups. Compatibility with the existing cpusets system, and ease of experimentation, were two important reasons for going with the current. ASCII API. But by having structured control files, we can open the path towards having more efficient binary APIs for simpler and more efficient programmatic access too, without any additional modifications required from the subsystems themselves. My plans for this potential binary API are a little hazy at this point, but they might go something like opening a cgroup.bin file in a cgroup directory, and writing the names of the control files that you were interested in; then a read on that file handle would return the contents of the given control files in a single read in a simple binary format. (Better suggestions are welcome). Regardless, getting a good typing/structure on the control files is an important first step if we want to go in that direction. 3) The memory controller currently has files with the "_in_bytes" suffix, on the grounds that otherwise it's not obvious to a new user what they represent. By moving the description to a auto-generated API file, we can remove this (IMO) inelegant suffix. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH 2/7] CGroup API: Add cgroup map data type
Adds a new type of supported control file representation, a map from strings to u64 values. Signed-off-by: Paul Menage <[EMAIL PROTECTED]> --- include/linux/cgroup.h | 19 +++ kernel/cgroup.c| 61 - 2 files changed, 79 insertions(+), 1 deletion(-) Index: cgroupmap-2.6.24-mm1/include/linux/cgroup.h === --- cgroupmap-2.6.24-mm1.orig/include/linux/cgroup.h +++ cgroupmap-2.6.24-mm1/include/linux/cgroup.h @@ -191,6 +191,17 @@ enum cgroup_file_type { CGROUP_FILE_VOID, CGROUP_FILE_U64, CGROUP_FILE_STRING, + CGROUP_FILE_MAP, +}; + +/* + * cgroup_map_cb is an abstract callback API for reporting map-valued + * control files + */ + +struct cgroup_map_cb { + int (*fill)(struct cgroup_map_cb *cb, const char *key, u64 value); + void *state; }; #define MAX_CFTYPE_NAME 64 @@ -215,6 +226,14 @@ struct cftype { * single integer. Use it in place of read() */ u64 (*read_uint) (struct cgroup *cont, struct cftype *cft); + /* +* read_map() is used for defining a map of key/value +* pairs. It should call cb->fill(cb, key, value) for each +* entry. +*/ + int (*read_map) (struct cgroup *cont, struct cftype *cft, +struct cgroup_map_cb *cb); + ssize_t (*write) (struct cgroup *cont, struct cftype *cft, struct file *file, const char __user *buf, size_t nbytes, loff_t *ppos); Index: cgroupmap-2.6.24-mm1/kernel/cgroup.c === --- cgroupmap-2.6.24-mm1.orig/kernel/cgroup.c +++ cgroupmap-2.6.24-mm1/kernel/cgroup.c @@ -1488,6 +1488,46 @@ static ssize_t cgroup_file_read(struct f return -EINVAL; } +/* + * seqfile ops/methods for returning structured data. Currently just + * supports string->u64 maps, but can be extended in future. + */ + +struct cgroup_seqfile_state { + struct cftype *cft; + struct cgroup *cgroup; +}; + +static int cgroup_map_add(struct cgroup_map_cb *cb, const char *key, u64 value) +{ + struct seq_file *sf = cb->state; + return seq_printf(sf, "%s: %llu\n", key, value); +} + +static int cgroup_seqfile_show(struct seq_file *m, void *arg) +{ + struct cgroup_seqfile_state *state = m->private; + struct cftype *cft = state->cft; + struct cgroup_map_cb cb = { + .fill = cgroup_map_add, + .state = m, + }; + if (cft->read_map) { + return cft->read_map(state->cgroup, cft, &cb); + } else { + BUG(); + } +} + +int cgroup_seqfile_release(struct inode *inode, struct file *file) +{ + struct seq_file *seq = file->private_data; + kfree(seq->private); + return single_release(inode, file); +} + +static struct file_operations cgroup_seqfile_operations; + static int cgroup_file_open(struct inode *inode, struct file *file) { int err; @@ -1500,7 +1540,18 @@ static int cgroup_file_open(struct inode cft = __d_cft(file->f_dentry); if (!cft) return -ENODEV; - if (cft->open) + if (cft->read_map) { + struct cgroup_seqfile_state *state = + kzalloc(sizeof(*state), GFP_USER); + if (!state) + return -ENOMEM; + state->cft = cft; + state->cgroup = __d_cgrp(file->f_dentry->d_parent); + file->f_op = &cgroup_seqfile_operations; + err = single_open(file, cgroup_seqfile_show, state); + if (err < 0) + kfree(state); + } else if (cft->open) err = cft->open(inode, file); else err = 0; @@ -1539,6 +1590,12 @@ static struct file_operations cgroup_fil .release = cgroup_file_release, }; +static struct file_operations cgroup_seqfile_operations = { + .read = seq_read, + .llseek = seq_lseek, + .release = cgroup_seqfile_release, +}; + static struct inode_operations cgroup_dir_inode_operations = { .lookup = simple_lookup, .mkdir = cgroup_mkdir, @@ -2206,6 +2263,8 @@ static int cgroup_api_show(struct seq_fi if (type == CGROUP_FILE_UNKNOWN) { if (cft->read_uint) type = CGROUP_FILE_U64; + else if (cft->read_map) + type = CGROUP_FILE_MAP; else if (cft->read) type = CGROUP_FILE_STRING; else if (!cft->open) -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH 3/7] CGroup API: Use cgroup map for memcontrol stats file
Remove the seq_file boilerplate used to construct the memcontrol stats map, and instead use the new map representation for cgroup control files Signed-off-by: Paul Menage <[EMAIL PROTECTED]> --- mm/memcontrol.c | 30 ++ 1 file changed, 6 insertions(+), 24 deletions(-) Index: cgroupmap-2.6.24-mm1/mm/memcontrol.c === --- cgroupmap-2.6.24-mm1.orig/mm/memcontrol.c +++ cgroupmap-2.6.24-mm1/mm/memcontrol.c @@ -974,9 +974,9 @@ static const struct mem_cgroup_stat_desc [MEM_CGROUP_STAT_RSS] = { "rss", PAGE_SIZE, }, }; -static int mem_control_stat_show(struct seq_file *m, void *arg) +static int mem_control_stat_show(struct cgroup *cont, struct cftype *cft, +struct cgroup_map_cb *cb) { - struct cgroup *cont = m->private; struct mem_cgroup *mem_cont = mem_cgroup_from_cont(cont); struct mem_cgroup_stat *stat = &mem_cont->stat; int i; @@ -986,8 +986,7 @@ static int mem_control_stat_show(struct val = mem_cgroup_read_stat(stat, i); val *= mem_cgroup_stat_desc[i].unit; - seq_printf(m, "%s %lld\n", mem_cgroup_stat_desc[i].msg, - (long long)val); + cb->fill(cb, mem_cgroup_stat_desc[i].msg, val); } /* showing # of active pages */ { @@ -997,29 +996,12 @@ static int mem_control_stat_show(struct MEM_CGROUP_ZSTAT_INACTIVE); active = mem_cgroup_get_all_zonestat(mem_cont, MEM_CGROUP_ZSTAT_ACTIVE); - seq_printf(m, "active %ld\n", (active) * PAGE_SIZE); - seq_printf(m, "inactive %ld\n", (inactive) * PAGE_SIZE); + cb->fill(cb, "active", (active) * PAGE_SIZE); + cb->fill(cb, "inactive", (inactive) * PAGE_SIZE); } return 0; } -static const struct file_operations mem_control_stat_file_operations = { - .read = seq_read, - .llseek = seq_lseek, - .release = single_release, -}; - -static int mem_control_stat_open(struct inode *unused, struct file *file) -{ - /* XXX __d_cont */ - struct cgroup *cont = file->f_dentry->d_parent->d_fsdata; - - file->f_op = &mem_control_stat_file_operations; - return single_open(file, mem_control_stat_show, cont); -} - - - static struct cftype mem_cgroup_files[] = { { .name = "usage_in_bytes", @@ -1044,7 +1026,7 @@ static struct cftype mem_cgroup_files[] }, { .name = "stat", - .open = mem_control_stat_open, + .read_map = mem_control_stat_show, }, }; -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH 4/7] CGroup API: Add res_counter_read_uint()
Adds a function for returning the value of a resource counter member, in a form suitable for use in a cgroup read_uint control file method. Signed-off-by: Paul Menage <[EMAIL PROTECTED]> --- include/linux/res_counter.h |1 + kernel/res_counter.c|5 + 2 files changed, 6 insertions(+) Index: cgroupmap-2.6.24-mm1/include/linux/res_counter.h === --- cgroupmap-2.6.24-mm1.orig/include/linux/res_counter.h +++ cgroupmap-2.6.24-mm1/include/linux/res_counter.h @@ -54,6 +54,7 @@ struct res_counter { ssize_t res_counter_read(struct res_counter *counter, int member, const char __user *buf, size_t nbytes, loff_t *pos, int (*read_strategy)(unsigned long long val, char *s)); +u64 res_counter_read_uint(struct res_counter *counter, int member); ssize_t res_counter_write(struct res_counter *counter, int member, const char __user *buf, size_t nbytes, loff_t *pos, int (*write_strategy)(char *buf, unsigned long long *val)); Index: cgroupmap-2.6.24-mm1/kernel/res_counter.c === --- cgroupmap-2.6.24-mm1.orig/kernel/res_counter.c +++ cgroupmap-2.6.24-mm1/kernel/res_counter.c @@ -92,6 +92,11 @@ ssize_t res_counter_read(struct res_coun pos, buf, s - buf); } +u64 res_counter_read_uint(struct res_counter *counter, int member) +{ + return *res_counter_member(counter, member); +} + ssize_t res_counter_write(struct res_counter *counter, int member, const char __user *userbuf, size_t nbytes, loff_t *pos, int (*write_strategy)(char *st_buf, unsigned long long *val)) -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH 7/7] CGroup API: Update cpusets to use cgroup structured file API
Many of the cpusets control files are simple integer values, which don't require the overhead of memory allocations for reads and writes. Move the handlers for these control files into cpuset_read_uint() and cpuset_write_uint(). This also has the advantage that the control files show up as "u64" rather than "string" in the cgroup.api file. Signed-off-by: Paul Menage <[EMAIL PROTECTED]> --- kernel/cpuset.c | 158 +--- 1 file changed, 83 insertions(+), 75 deletions(-) Index: cgroupmap-2.6.24-mm1/kernel/cpuset.c === --- cgroupmap-2.6.24-mm1.orig/kernel/cpuset.c +++ cgroupmap-2.6.24-mm1/kernel/cpuset.c @@ -999,19 +999,6 @@ int current_cpuset_is_being_rebound(void } /* - * Call with cgroup_mutex held. - */ - -static int update_memory_pressure_enabled(struct cpuset *cs, char *buf) -{ - if (simple_strtoul(buf, NULL, 10) != 0) - cpuset_memory_pressure_enabled = 1; - else - cpuset_memory_pressure_enabled = 0; - return 0; -} - -/* * update_flag - read a 0 or a 1 in a file and update associated flag * bit:the bit to update (CS_CPU_EXCLUSIVE, CS_MEM_EXCLUSIVE, * CS_SCHED_LOAD_BALANCE, @@ -1023,15 +1010,13 @@ static int update_memory_pressure_enable * Call with cgroup_mutex held. */ -static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, char *buf) +static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, + int turning_on) { - int turning_on; struct cpuset trialcs; int err; int cpus_nonempty, balance_flag_changed; - turning_on = (simple_strtoul(buf, NULL, 10) != 0); - trialcs = *cs; if (turning_on) set_bit(bit, &trialcs.flags); @@ -1247,44 +1232,66 @@ static ssize_t cpuset_common_file_write( case FILE_MEMLIST: retval = update_nodemask(cs, buffer); break; + default: + retval = -EINVAL; + goto out2; + } + + if (retval == 0) + retval = nbytes; +out2: + cgroup_unlock(); +out1: + kfree(buffer); + return retval; +} + +static int cpuset_write_uint(struct cgroup *cgrp, struct cftype *cft, u64 val) +{ + int retval = 0; + struct cpuset *cs = cgroup_cs(cgrp); + cpuset_filetype_t type = cft->private; + + cgroup_lock(); + + if (cgroup_is_removed(cgrp)) { + cgroup_unlock(); + return -ENODEV; + } + + switch (type) { case FILE_CPU_EXCLUSIVE: - retval = update_flag(CS_CPU_EXCLUSIVE, cs, buffer); + retval = update_flag(CS_CPU_EXCLUSIVE, cs, val); break; case FILE_MEM_EXCLUSIVE: - retval = update_flag(CS_MEM_EXCLUSIVE, cs, buffer); + retval = update_flag(CS_MEM_EXCLUSIVE, cs, val); break; case FILE_SCHED_LOAD_BALANCE: - retval = update_flag(CS_SCHED_LOAD_BALANCE, cs, buffer); + retval = update_flag(CS_SCHED_LOAD_BALANCE, cs, val); break; case FILE_MEMORY_MIGRATE: - retval = update_flag(CS_MEMORY_MIGRATE, cs, buffer); + retval = update_flag(CS_MEMORY_MIGRATE, cs, val); break; case FILE_MEMORY_PRESSURE_ENABLED: - retval = update_memory_pressure_enabled(cs, buffer); + cpuset_memory_pressure_enabled = val; break; case FILE_MEMORY_PRESSURE: retval = -EACCES; break; case FILE_SPREAD_PAGE: - retval = update_flag(CS_SPREAD_PAGE, cs, buffer); + retval = update_flag(CS_SPREAD_PAGE, cs, val); cs->mems_generation = cpuset_mems_generation++; break; case FILE_SPREAD_SLAB: - retval = update_flag(CS_SPREAD_SLAB, cs, buffer); + retval = update_flag(CS_SPREAD_SLAB, cs, val); cs->mems_generation = cpuset_mems_generation++; break; default: retval = -EINVAL; - goto out2; + break; } - - if (retval == 0) - retval = nbytes; -out2: cgroup_unlock(); -out1: - kfree(buffer); - return retval; + return -EINVAL; } /* @@ -1345,30 +1352,6 @@ static ssize_t cpuset_common_file_read(s case FILE_MEMLIST: s += cpuset_sprintf_memlist(s, cs); break; - case FILE_CPU_EXCLUSIVE: - *s++ = is_cpu_exclusive(cs) ? '1' : '0'; - break; - case FILE_MEM_EXCLUSIVE: - *s++ = is_mem_exclusive(cs) ? '1' : '0'; - break; - case FILE_SCHED_LOAD_BALANCE:
[PATCH] Add linux-fsdevel to VFS entry in MAINTAINERS
Add linux-fsdevel to the VFS entry in MAINTAINERS Signed-off-by: Paul Menage <[EMAIL PROTECTED]> --- MAINTAINERS |1 + 1 file changed, 1 insertion(+) Index: 2.6.24-mm1-bindflags/MAINTAINERS === --- 2.6.24-mm1-bindflags.orig/MAINTAINERS +++ 2.6.24-mm1-bindflags/MAINTAINERS @@ -1616,6 +1616,7 @@ S:Maintained FILESYSTEMS (VFS and infrastructure) P: Alexander Viro M: [EMAIL PROTECTED] +L: [EMAIL PROTECTED] S: Maintained FIREWIRE SUBSYSTEM (drivers/firewire, ) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add MS_BIND_FLAGS mount flag
On Thu, Feb 14, 2008 at 9:31 AM, Miklos Szeredi <[EMAIL PROTECTED]> wrote: > > I deliberately not used the MS_* flags, which is currently a messy mix > of things with totally different meanings. > > Does this solve all the issues? We should add a size parameter either in the mount_params or as a final argument, for future extensibility. And we might as well include MNT_READONLY in the API on the assumption that per-mount readonly will be available soon. Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add MS_BIND_FLAGS mount flag
On Thu, Feb 14, 2008 at 8:03 AM, Miklos Szeredi <[EMAIL PROTECTED]> wrote: > > The "flags" argument could be the same as for regular mount, and > > contain the mnt_flags - so the extra argument could maybe usefully be > > a "mnt_flags_mask", to indicate which flags we actually care about > > overriding. > > The way I imagined it, is that mnt_flags is a mask, and the operation > (determined by flags) is either: > > - set bits in mask > - clear bits in mask (or not in mask) > - set flags to mask > > It doesn't allow setting some bits, clearing some others, and leaving > alone the rest. But I think such flexibility isn't really needed. I think I'd suggest something like: new_mnt->mnt_flags = (old_mnt->mnt_flags & ~arg_mask) | (arg_flags & mask) > Maybe instead of messing with masks, it's better to introduce a > get_flags() or a more general mount_stat() operation, and let > userspace deal with setting and clearing flags, just as we do for > stat/chmod? > > So we'd have > > mount_stat(path, stat); > mount_bind(from, to, flags); > mount_set_flags(path, flags); > mount_move(from, to); > > and perhaps > > mount_remount(path, opt_string, flags); Sounds reasonable to me. But it wouldn't directly solve the "do a recursive bind mount setting the MS_READONLY flag on all children" problem, so we'd need some of the earlier suggestions too. Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add MS_BIND_FLAGS mount flag
[ cc: linux-fsdevel ] On Thu, Feb 14, 2008 at 7:22 AM, Paul Menage <[EMAIL PROTECTED]> wrote: > On Wed, Feb 13, 2008 at 10:02 PM, Christoph Hellwig <[EMAIL PROTECTED]> wrote: > > > > I think this concept is reasonable, but I don't think MS_BIND_FLAGS > > is a descriptive name for this flag. MS_EXPLICIT_FLAGS might be better > > but still isn't optimal. > > > > MS_BIND_FLAGS_OVERRIDE ? > > Paul > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add MS_BIND_FLAGS mount flag
On Wed, Feb 13, 2008 at 10:02 PM, Christoph Hellwig <[EMAIL PROTECTED]> wrote: > > I think this concept is reasonable, but I don't think MS_BIND_FLAGS > is a descriptive name for this flag. MS_EXPLICIT_FLAGS might be better > but still isn't optimal. > MS_BIND_FLAGS_OVERRIDE ? Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add MS_BIND_FLAGS mount flag
On Thu, Feb 14, 2008 at 12:30 AM, Miklos Szeredi <[EMAIL PROTECTED]> wrote: > > For recursive bind mounts, only the root of the tree being bound > > inherits the per-mount flags from the mount() arguments; sub-mounts > > inherit their per-mount flags from the source tree as usual. > > This is rather strange behavior. I think it would be much better, if > setting mount flags would work for recursive operations as well. Also > what we really need is not resetting all the mount flags to some > predetermined values, but to be able to set or clear each flag > individually. This is certainly true, but as you observe below it's a fair bit more fiddly to specify in the API. I wasn't sure how much people recursive bind mounts, so I figured I'd throw out this simpler version first. > > For example, with the per-mount-read-only thing the most useful > application would be to just set the read-only flag and leave the > others alone. > > And this is where we usually conclude, that a new userspace mount API > is long overdue. So for starters, how about a new syscall for bind > mounts: > > int mount_bind(const char *src, const char *dst, unsigned flags, > unsigned mnt_flags); The "flags" argument could be the same as for regular mount, and contain the mnt_flags - so the extra argument could maybe usefully be a "mnt_flags_mask", to indicate which flags we actually care about overriding. What would happen when an existing super-block flag changes to become a per-mount flag (e.g. per-mount read-only)? I think that would just fit in with the "mask" idea, as long as we complained if any bits in mnt_flags_mask weren't actually per-mount settable. Being able to mask/set mount flags might be useful on a remount too, since there's no clean way to get the existing mount flags for a mount other than by scanning /proc/mounts. So an alternative to a separate system call would be a new mnt_flag_mask argument to mount() (whose presence would be indicated by a flag bit being set in the main flags) which would be used to control which bits were set cleared for remount/bind calls. Seems a bit wasteful of bits though. If we turned "flags" into an (optionally) 64-bit argument then we'd have plenty of bits to be able to specify both a "set" bit and a "mask" bit for each, without needing a new syscall. Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Add MS_BIND_FLAGS mount flag
From: Paul Menage <[EMAIL PROTECTED]> Add a new mount() flag, MS_BIND_FLAGS. MS_BIND_FLAGS indicates that a bind mount should take its per-mount flags from the arguments passed to mount() rather than from the source mountpoint. This flag allows you to create a bind mount with the desired per-mount flags in a single operation, rather than having to do a bind mount followed by a remount, which is fiddly and can block for non-trivial periods of time (on sb->s_umount?). For recursive bind mounts, only the root of the tree being bound inherits the per-mount flags from the mount() arguments; sub-mounts inherit their per-mount flags from the source tree as usual. Signed-off-by: Paul Menage <[EMAIL PROTECTED]> --- fs/namespace.c | 36 +--- include/linux/fs.h |2 ++ 2 files changed, 27 insertions(+), 11 deletions(-) Index: 2.6.24-mm1-bindflags/fs/namespace.c === --- 2.6.24-mm1-bindflags.orig/fs/namespace.c +++ 2.6.24-mm1-bindflags/fs/namespace.c @@ -512,13 +512,13 @@ static struct vfsmount *skip_mnt_tree(st } static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root, - int flag) + int flag, int mnt_flags) { struct super_block *sb = old->mnt_sb; struct vfsmount *mnt = alloc_vfsmnt(old->mnt_devname); if (mnt) { - mnt->mnt_flags = old->mnt_flags; + mnt->mnt_flags = mnt_flags; atomic_inc(&sb->s_active); mnt->mnt_sb = sb; mnt->mnt_root = dget(root); @@ -1095,8 +1095,9 @@ static int lives_below_in_same_fs(struct } } -struct vfsmount *copy_tree(struct vfsmount *mnt, struct dentry *dentry, - int flag) +static struct vfsmount * +__copy_tree(struct vfsmount *mnt, struct dentry *dentry, + int flag, int mnt_flags) { struct vfsmount *res, *p, *q, *r, *s; struct nameidata nd; @@ -1104,7 +1105,7 @@ struct vfsmount *copy_tree(struct vfsmou if (!(flag & CL_COPY_ALL) && IS_MNT_UNBINDABLE(mnt)) return NULL; - res = q = clone_mnt(mnt, dentry, flag); + res = q = clone_mnt(mnt, dentry, flag, mnt_flags); if (!q) goto Enomem; q->mnt_mountpoint = mnt->mnt_mountpoint; @@ -1126,7 +1127,7 @@ struct vfsmount *copy_tree(struct vfsmou p = s; nd.path.mnt = q; nd.path.dentry = p->mnt_mountpoint; - q = clone_mnt(p, p->mnt_root, flag); + q = clone_mnt(p, p->mnt_root, flag, p->mnt_flags); if (!q) goto Enomem; spin_lock(&vfsmount_lock); @@ -1146,6 +1147,11 @@ Enomem: } return NULL; } +struct vfsmount *copy_tree(struct vfsmount *mnt, struct dentry *dentry, + int flag) +{ + return __copy_tree(mnt, dentry, flag, mnt->mnt_flags); +} struct vfsmount *collect_mounts(struct vfsmount *mnt, struct dentry *dentry) { @@ -1320,7 +1326,8 @@ static int do_change_type(struct nameida /* * do loopback mount. */ -static int do_loopback(struct nameidata *nd, char *old_name, int recurse) +static int do_loopback(struct nameidata *nd, char *old_name, int flags, + int mnt_flags) { struct nameidata old_nd; struct vfsmount *mnt = NULL; @@ -1342,10 +1349,15 @@ static int do_loopback(struct nameidata goto out; err = -ENOMEM; - if (recurse) - mnt = copy_tree(old_nd.path.mnt, old_nd.path.dentry, 0); + /* Use the source mount flags unless the user passed MS_BIND_FLAGS */ + if (!(flags & MS_BIND_FLAGS)) + mnt_flags = old_nd.path.mnt->mnt_flags; + if (flags & MS_REC) + mnt = __copy_tree(old_nd.path.mnt, old_nd.path.dentry, 0, + mnt_flags); else - mnt = clone_mnt(old_nd.path.mnt, old_nd.path.dentry, 0); + mnt = clone_mnt(old_nd.path.mnt, old_nd.path.dentry, 0, + mnt_flags); if (!mnt) goto out; @@ -1874,7 +1886,9 @@ long do_mount(char *dev_name, char *dir_ retval = do_remount(&nd, flags & ~MS_REMOUNT, mnt_flags, data_page); else if (flags & MS_BIND) - retval = do_loopback(&nd, dev_name, flags & MS_REC); + retval = do_loopback(&nd, dev_name, +flags & (MS_REC | MS_BIND_FLAGS), +mnt_flags); else if (flags & (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE))
Re: [PATCH][DOCUMENTATION] Minimal controller code for a quick start
On Feb 7, 2008 12:28 PM, Peter Zijlstra <[EMAIL PROTECTED]> wrote: > > While on the subject, could someone document struct cgroup_subsys. There's documentation for all the methods in Documentation/cgroup.txt > particular, I've wondered why we have: cgroup_subsys::can_attach() and > not use a return value in cgroup_subsys::attach()? We could do in theory do that, but it would make the recovery logic in cgroup.c:attach_task() more complex - it would have to be able to deal with undoing a partial attach. It seems simpler to just split it into two phases, given that most cgroups don't appear to have attachment conditions anyway. Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][DOCUMENTATION] Minimal controller code for a quick start
On Feb 7, 2008 7:37 AM, Pavel Emelyanov <[EMAIL PROTECTED]> wrote: > The Documentation/cgroups.txt file contains the info on how > to write some controller for cgroups subsystem, but even with > this, one need to write quite a lot of code before developing > the core (or copy-n-paste it from some other place). Good idea. > + > +static ssize_t foo_bar_read(struct cgroup *cg, struct cftype *cft, > + struct file *file, char __user *userbuf, > + size_t nbytes, loff_t *ppos) > +{ > + struct foo_cgroup *foo; > + > + foo = foo_from_cgroup(cg); > + > + /* > +* produce some output > +*/ > + > + return nbytes; > +} > + > +static ssize_t foo_bar_write(struct cgroup *cg, struct cftype *cft, > + struct file *file, const char __user *userbuf, > + size_t nbytes, loff_t *ppos) > +{ > + struct foo_cgroup *foo; > + > + foo = foo_from_cgroup(cg); > + > + /* > +* read and tune the foo > +*/ > + > + return nbytes; > +} > + > +static struct cftype foo_files[] = { > + { > + .name = "bar", > + .read = foo_bar_read, > + .write = foo_bar_write, > + }, > +}; Can you structure this example so as to encourage people to use the more formatted read/write routines, such as read_int64 and write_int64? > + > +static struct cgroup_subsys_state *foo_create(struct cgroup_subsys *cs, > + struct cgroup *cg) > +{ > + struct foo_cgroup *foo; > + Maybe add a comment here that mentions that if your cgroup needs very early initialization, you can check for cg->parent being NULL, and return a statically-constructed structure here. (And set foo_subsys.early_init = 1) Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Default child of a cgroup
On Jan 31, 2008 11:58 PM, Peter Zijlstra <[EMAIL PROTECTED]> wrote: > > Is there a restriction in CFS that stops a given group from > > simultaneously holding tasks and sub-groups? If so, couldn't we change > > CFS to make it possible rather than enforcing awkward restrictions on > > cgroups? > > I think it is possible, just way more work than the proposed hack. Seems to me like the right thing to do though. > > > If we really can't change CFS in that way, then an alternative would > > be similar to Peter's suggestion - make cpu_cgroup_can_attach() fail > > if the cgroup has children, and make cpu_cgroup_create() fail if the > > cgroup has any tasks - that way you limit the restriction to just the > > hierarchy that has CFS attached to it, rather than generically for all > > cgroups > > Agreed. > Actually, I realised later that this is impossible - since the root cgroup will have tasks initially, there'd be no way to create the first child cgroup in the CFS hierarchy. Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Default child of a cgroup
On Jan 30, 2008 6:40 PM, Srivatsa Vaddagiri <[EMAIL PROTECTED]> wrote: > > Here are some questions that arise in this picture: > > 1. What is the relationship of the task-group in A/tasks with the >task-group in A/a1/tasks? In otherwords do they form siblings >of the same parent A? I'd argue the same as Balbir - tasks in A/tasks are are children of A and are siblings of a1, a2, etc. > > 2. Somewhat related to the above question, how much resource should the >task-group A/a1/tasks get in relation to A/tasks? Is it 1/2 of parent >A's share or 1/(1 + N) of parent A's share (where N = number of tasks >in A/tasks)? Each process in A should have a scheduler weight that's derived from its static_prio field. Similarly each subgroup of A will have a scheduler weight that's determined by its cpu.shares value. So the cpu share of any child (be it a task or a subgroup) would be equal to its own weight divided by the sum of weights of all children. So yes, if a task in A forks lots of children, those children could end up getting a disproportionate amount of the CPU compared to tasks in A/a1 - but that's the same as the situation without cgroups. If you want to control cpu usage between different sets of processes in A, they should be in sibling cgroups, not directly in A. Is there a restriction in CFS that stops a given group from simultaneously holding tasks and sub-groups? If so, couldn't we change CFS to make it possible rather than enforcing awkward restructions on cgroups? If we really can't change CFS in that way, then an alternative would be similar to Peter's suggestion - make cpu_cgroup_can_attach() fail if the cgroup has children, and make cpu_cgroup_create() fail if the cgroup has any tasks - that way you limit the restriction to just the hierarchy that has CFS attached to it, rather than generically for all cgroups BTW, I noticed this code in cpu_cgroup_create(): /* we support only 1-level deep hierarchical scheduler atm */ if (cgrp->parent->parent) return ERR_PTR(-EINVAL); Is anyone working on allowing more levels? Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Update comments in cpuset.c
Update comments in cpuset.c Some of the comments in kernel/cpuset.c were stale following the transition to control groups; this patch updates them to more closely match reality. Signed-off-by: Paul Menage <[EMAIL PROTECTED]> Acked-by: Paul Jackson <[EMAIL PROTECTED]> --- kernel/cpuset.c | 128 ++-- 1 file changed, 43 insertions(+), 85 deletions(-) Index: linux-2.6.24-rc6-mm1/kernel/cpuset.c === --- linux-2.6.24-rc6-mm1.orig/kernel/cpuset.c +++ linux-2.6.24-rc6-mm1/kernel/cpuset.c @@ -64,7 +64,7 @@ */ int number_of_cpusets __read_mostly; -/* Retrieve the cpuset from a cgroup */ +/* Forward declare cgroup structures */ struct cgroup_subsys cpuset_subsys; struct cpuset; @@ -160,17 +160,17 @@ static inline int is_spread_slab(const s * number, and avoid having to lock and reload mems_allowed unless * the cpuset they're using changes generation. * - * A single, global generation is needed because attach_task() could + * A single, global generation is needed because cpuset_attach_task() could * reattach a task to a different cpuset, which must not have its * generation numbers aliased with those of that tasks previous cpuset. * * Generations are needed for mems_allowed because one task cannot - * modify anothers memory placement. So we must enable every task, + * modify another's memory placement. So we must enable every task, * on every visit to __alloc_pages(), to efficiently check whether * its current->cpuset->mems_allowed has changed, requiring an update * of its current->mems_allowed. * - * Since cpuset_mems_generation is guarded by manage_mutex, + * Since writes to cpuset_mems_generation are guarded by the cgroup lock * there is no need to mark it atomic. */ static int cpuset_mems_generation; @@ -182,17 +182,20 @@ static struct cpuset top_cpuset = { }; /* - * We have two global cpuset mutexes below. They can nest. - * It is ok to first take manage_mutex, then nest callback_mutex. We also - * require taking task_lock() when dereferencing a tasks cpuset pointer. - * See "The task_lock() exception", at the end of this comment. + * There are two global mutexes guarding cpuset structures. The first + * is the main control groups cgroup_mutex, accessed via + * cgroup_lock()/cgroup_unlock(). The second is the cpuset-specific + * callback_mutex, below. They can nest. It is ok to first take + * cgroup_mutex, then nest callback_mutex. We also require taking + * task_lock() when dereferencing a task's cpuset pointer. See "The + * task_lock() exception", at the end of this comment. * * A task must hold both mutexes to modify cpusets. If a task - * holds manage_mutex, then it blocks others wanting that mutex, + * holds cgroup_mutex, then it blocks others wanting that mutex, * ensuring that it is the only task able to also acquire callback_mutex * and be able to modify cpusets. It can perform various checks on * the cpuset structure first, knowing nothing will change. It can - * also allocate memory while just holding manage_mutex. While it is + * also allocate memory while just holding cgroup_mutex. While it is * performing these checks, various callback routines can briefly * acquire callback_mutex to query cpusets. Once it is ready to make * the changes, it takes callback_mutex, blocking everyone else. @@ -208,60 +211,16 @@ static struct cpuset top_cpuset = { * The task_struct fields mems_allowed and mems_generation may only * be accessed in the context of that task, so require no locks. * - * Any task can increment and decrement the count field without lock. - * So in general, code holding manage_mutex or callback_mutex can't rely - * on the count field not changing. However, if the count goes to - * zero, then only attach_task(), which holds both mutexes, can - * increment it again. Because a count of zero means that no tasks - * are currently attached, therefore there is no way a task attached - * to that cpuset can fork (the other way to increment the count). - * So code holding manage_mutex or callback_mutex can safely assume that - * if the count is zero, it will stay zero. Similarly, if a task - * holds manage_mutex or callback_mutex on a cpuset with zero count, it - * knows that the cpuset won't be removed, as cpuset_rmdir() needs - * both of those mutexes. - * * The cpuset_common_file_write handler for operations that modify - * the cpuset hierarchy holds manage_mutex across the entire operation, + * the cpuset hierarchy holds cgroup_mutex across the entire operation, * single threading all such cpuset modifications across the system. * * The cpuset_common_file_read() handlers only hold callback_mutex across * small pieces of code, such as when reading out possibly multi-word * cpumasks and nodemasks. * - * The fork and exit callbacks cpuset_fork() and cpuset_exit
Re: [PATCH 5/12] Handle pid namespaces in cgroups code
On Jan 29, 2008 5:52 AM, Pavel Emelyanov <[EMAIL PROTECTED]> wrote: > There's one place that works with task pids - its the "tasks" file > in cgroups. The read/write handlers assume, that the pid values > go to/come from the user space and thus it is a virtual pid, i.e. > the pid as it is seen from inside a namespace. > > Tune the code accordingly. > > Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]> Acked-by: Paul Menage <[EMAIL PROTECTED]> > > --- > kernel/cgroup.c |4 ++-- > 1 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/kernel/cgroup.c b/kernel/cgroup.c > index 2c5cccb..4766bb6 100644 > --- a/kernel/cgroup.c > +++ b/kernel/cgroup.c > @@ -1269,7 +1269,7 @@ static int attach_task_by_pid(struct cgroup *cgrp, char > *pidbuf) > > if (pid) { > rcu_read_lock(); > - tsk = find_task_by_pid(pid); > + tsk = find_task_by_vpid(pid); > if (!tsk || tsk->flags & PF_EXITING) { > rcu_read_unlock(); > return -ESRCH; > @@ -1955,7 +1955,7 @@ static int pid_array_load(pid_t *pidarray, int npids, > struct cgroup *cgrp) > while ((tsk = cgroup_iter_next(cgrp, &it))) { > if (unlikely(n == npids)) > break; > - pidarray[n++] = task_pid_nr(tsk); > + pidarray[n++] = task_pid_vnr(tsk); > } > cgroup_iter_end(cgrp, &it); > return n; > -- > 1.5.3.4 > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cgroup: limit network bandwidth
On Jan 23, 2008 8:48 AM, Andrea Righi <[EMAIL PROTECTED]> wrote: > > > > 1. Implementation of soft limits (limit on contention of resource) > >gets harder > > Why? do you mean implementing a grace time when the soft-limit is > exceeded? this could be done in cgroup_nl_throttle() introducing 3 > additional attributes to struct netlimit (i.e. hard_limit, > last_time_exceeded grace_time) and perform something like: > ... > if ((current_rate > hard_limit) || > time_after(jiffies, last_time_exceeded + grace_time)) > schedule_timeout(sleep); > ... He's talking about cases where we want the behaviour to be work-conserving, whilst still offering guarantees in the event of contention. e.g. cgroups A and B each get a 20% guarantee on the TX path if they need it, but anyone can use any otherwise-idle bandwidth. (This is relatively straightforward to set up from userspace with the standard Linux traffic control tools). > > Yes, the integration with iptables (as Paul said), and traffic shaping > rules would be absolutely the right way(tm) in perspective. I was just > proposing a possible simple API to implement the limiting stuff. But this issue (traffic control for cgroups) is too complex to be described by a simple API. Any simple API you choose to try to describe the limiting directly will be insufficient for a good number of the potential users. Better to just provide a (very simple) API to hook into the existing (complex) traffic control API and leave the tricky stuff to userspace, where anyone can construct arbitrarily complex queueing schemes with a shell script and a few calls to "tc". Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cgroup: limit network bandwidth
An approach that we've been experimenting with at Google is much simpler: - add a "network class id" subsystem, that lets you associated an id with each cgroup - propagate this id to sockets created by that cgroup, and from there to packets sent/received on that socket - add a new traffic filter that can select based on a packet's cgroup class id This is a very small amount of kernel code, but it then lets userspace set up whatever queues/filters/classes it wants using the standard Linux traffic API, rather than creating a new traffic control API that's much more limited. So you can easily do things like controlling guarantees and limits, have different behaviour for local and remote packets, have packet/byte accounting for different flow classes, filter on ToS bits in order to let the cgroup prioritize its own traffic, etc. We also have plans (have had for months in fact, but haven't had time for it yet) to let the cgroup network id be selected on in iptables rules, and possibly add a new iptable for events such as listen(), bind(), and connect(), to allow very easy control over what network connections a cgroup can access. This would let you use the full power of the existing packet/connection matching available in the standard iptables rules without having to add a new complex (but still limited) API. Paul On Jan 23, 2008 1:09 AM, Andrea Righi <[EMAIL PROTECTED]> wrote: > Allow to limit the network bandwidth for specific process containers (cgroups) > imposing additional delays in the sockets' sendmsg()/recvmsg() calls made by > those processes that exceed the limits defined in the control group > filesystem. > > Example: > # mkdir /dev/cgroup > # mount -t cgroup -onet net /dev/cgroup > # cd /dev/cgroup > # mkdir foo > --> the cgroup foo has been created > # /bin/echo $$ > foo/tasks > # /bin/echo 1024 > foo/net.tcp > # /bin/echo 2048 > foo/net.tot > # sh > --> the subshell 'sh' is running in cgroup "foo" that has a maximum network > bandwidth for TCP traffic of 1MB/s and 2MB/s for total network > activities. > > The netlimit approach can be easily extended to support additional network > protocols or different socket families or types (PF_UNIX, PF_BLUETOOTH, > SOCK_SEQPACKET, etc.). > > Signed-off-by: Andrea Righi <[EMAIL PROTECTED]> > --- > > diff -urpN linux-2.6.24-rc8/include/linux/cgroup_netlimit.h > linux-2.6.24-rc8-cgroup-netlimit/include/linux/cgroup_netlimit.h > --- linux-2.6.24-rc8/include/linux/cgroup_netlimit.h1970-01-01 > 01:00:00.0 +0100 > +++ linux-2.6.24-rc8-cgroup-netlimit/include/linux/cgroup_netlimit.h > 2008-01-22 21:36:15.0 +0100 > @@ -0,0 +1,29 @@ > +#ifndef CGROUP_NETLIMIT_H > +#define CGROUP_NETLIMIT_H > + > +enum { > + CGROUP_NETLIMIT_TOT, > + CGROUP_NETLIMIT_TCP, > + CGROUP_NETLIMIT_UDP, > + CGROUP_NETLIMIT_RAW, > + /* This sets the size of the different netlimit types */ > + CGROUP_NETLIMIT_END, > +}; > + > +#define CGROUP_NETLIMIT_FILE(_x, _y) \ > + { \ > + .name = _x, \ > + .read = netlimit_read, \ > + .write_uint = netlimit_write_uint, \ > + .private = _y, \ > + } > + > +#ifdef CONFIG_CGROUP_NETLIMIT > +extern void cgroup_nl_acct(int limit_id, size_t bytes); > +extern void cgroup_nl_throttle(int limit_id, int interruptible); > +#else > +static inline void cgroup_nl_acct(int limit_id, size_t bytes) { } > +static inline void cgroup_nl_throttle(int limit_id, int interruptible) { } > +#endif /* CONFIG_CGROUP_NETLIMIT */ > + > +#endif > diff -urpN linux-2.6.24-rc8/include/linux/cgroup_subsys.h > linux-2.6.24-rc8-cgroup-netlimit/include/linux/cgroup_subsys.h > --- linux-2.6.24-rc8/include/linux/cgroup_subsys.h 2008-01-16 > 05:22:48.0 +0100 > +++ linux-2.6.24-rc8-cgroup-netlimit/include/linux/cgroup_subsys.h > 2008-01-22 14:42:17.0 +0100 > @@ -37,3 +37,9 @@ SUBSYS(cpuacct) > > /* */ > > +#ifdef CONFIG_CGROUP_NETLIMIT > +SUBSYS(netlimit) > +#endif > + > +/* */ > + > diff -urpN linux-2.6.24-rc8/init/Kconfig > linux-2.6.24-rc8-cgroup-netlimit/init/Kconfig > --- linux-2.6.24-rc8/init/Kconfig 2008-01-16 05:22:48.0 +0100 > +++ linux-2.6.24-rc8-cgroup-netlimit/init/Kconfig 2008-01-23 > 00:44:04.0 +0100 > @@ -313,6 +313,15 @@ config CGROUP_NS >for instance virtual servers and checkpoint/restart >jobs. > > +config CGROUP_NETLIMIT > +bool "Enable cgroup network bandwidth limitinig (EXPERIMENTAL)" > +depends on EXPERIMENTAL && CGROUPS > +help > + This allows to define network bandwidth limiting/shaping rules for > + specific cgroup(s). > + > + Say N if unsure. > + > config CPUSETS > bool "Cpuset support" > depends on SMP && CGROUPS > diff -urpN linux-2.6.24-rc8/kernel/cgroup_netlimit.c > linux-2.6.24-rc8-cgroup-netlimit/kernel/cgroup_netlimit.c > --- linux-2.6.24-rc8/kernel/cgro
Re: [PATCH] cgroup: limit block I/O bandwidth
On Jan 18, 2008 7:36 AM, Dhaval Giani <[EMAIL PROTECTED]> wrote: > On Fri, Jan 18, 2008 at 12:41:03PM +0100, Andrea Righi wrote: > > Allow to limit the block I/O bandwidth for specific process containers > > (cgroups) imposing additional delays on I/O requests for those processes > > that exceed the limits defined in the control group filesystem. > > > > Example: > > # mkdir /dev/cgroup > > # mount -t cgroup -oio-throttle io-throttle /dev/cgroup > > Just a minor nit, can't we name it as io, keeping in mind that other > controllers are known as cpu and memory? Or maybe "blockio"? Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: What can we do to get ready for memory controller merge in 2.6.25
On Dec 1, 2007 10:36 AM, Rik van Riel <[EMAIL PROTECTED]> wrote: > > With the /proc/refaults info, we can measure how much extra > memory each process group needs, if any. What's the status of that? It looks as though it would be better than the "accessed in the last N seconds" metric that we've been playing with, although it's possibly more intrusive? Would it be practical to keep a non-resident set for each cgroup? > > As for how much memory a process group needs, at pageout time > we can check the fraction of pages that are accessed. If 60% > of the pages were recently accessed at pageout time and this > process group is spending little or no time waiting for refaults, > 40% of the pages are *not* recently accessed and we can probably > reduce the amount of memory assigned to this group. It would probably be better to reduce its background-reclaim high watermark than to reduce its limit. If you do the latter, you risk triggering an OOM in the cgroup if it turns out that it did need all that memory after all. Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: cpu accounting controller (V2)
Hi Vatsa, Thanks, this looks pretty good. On Nov 30, 2007 4:42 AM, Srivatsa Vaddagiri <[EMAIL PROTECTED]> wrote: > > - Removed load average information. I felt it needs more thought (esp > to deal with SMP and virtualized platforms) and can be added for > 2.6.25 after more discussions. The "load" value was never a load average, it was just a count of the % cpu time used in the previous fixed window of time, updated at the end of each window. Maybe we can instead do something based tracking the length of the run queue for the cgroup? Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: What can we do to get ready for memory controller merge in 2.6.25
On Nov 29, 2007 6:11 PM, Nick Piggin <[EMAIL PROTECTED]> wrote: > And also some > results or even anecdotes of where this is going to be used would be > interesting... We want to be able to run multiple isolated jobs on the same machine. So being able to limit how much memory each job can consume, in terms of anonymous memory and page cache, are useful. I've not had much time to look at the patches in great detail, but they seem to provide a sensible way to assign and enforce static limits on a bunch of jobs. Some of our requirements are a bit beyond this, though: In our experience, users are not good at figuring out how much memory they really need. In general they tend to massively over-estimate their requirements. So we want some way to determine how much of its allocated memory a job is actively using, and how much could be thrown away or swapped out without bothering the job too much. Of course, the definition of "actve use" is tricky - one possibility that we're looking at is "has been accessed within the last N seconds", where N can be configured appropriately for different jobs depending on the job's latency requirements. Active use should also be reported for pages that can't be easily freed quickly, e.g. mlocked or dirty pages, or anon pages on a swapless system. Inactive pages should be easily freeable, and be the first ones to go in the event of memory pressure. (From a scheduling point of view we can treat them as free memory, and schedule more jobs on the machine) The existing active/inactive distinction doesn't really capture this, since it's relative rather than absolute. We want to be able to overcommit a machine, so the sums of the cgroup memory limits can add up to more than the total machine memory. So we need control over what happens when there's global memory pressure, and a way to ensure that the low-latency jobs don't get bogged down in reclaim (or OOM) due to the activity of batch jobs. Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Revert for cgroups CPU accounting subsystem patch
On Nov 12, 2007 11:59 PM, Srivatsa Vaddagiri <[EMAIL PROTECTED]> wrote: > > Thinking of it more, this requirement to "group tasks for only accounting > purpose" may be required for other resources (mem, io, network etc) as well? > Should we have a generic accounting controller which can provide these > various resource usgae stats for a group (cpu, mem etc) by iterating thr' the > task list for the group and summing up the corresponding stats already present > in task structure? In theory it could certainly be useful - but it can only be done if something in the kernel is already keeping track of resources on a per-task basis. This works for CPU, but isn't really possible for memory without doing something lame like just adding up the tasks' RSS values (since the page accounting is the hard part - limiting is easy once you have accounting). Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Revert for cgroups CPU accounting subsystem patch
On Nov 12, 2007 11:48 PM, Srivatsa Vaddagiri <[EMAIL PROTECTED]> wrote: > > Regarding your concern about tracking cpu usage in different ways, it > could be mitigated if we have cpuacct controller track usage as per > information present in a task's sched entity structure > (tsk->se.sum_exec_runtime) i.e call cpuacct_charge() from > __update_curr() which would accumulate the execution time of the > group in a SMP friendly manner (i.e dump it in a per-cpu per-group counter > first and then aggregate to a global per-group counter). That seems more reasonable than the current approach in cpu_acct.c Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Revert for cgroups CPU accounting subsystem patch
On Nov 12, 2007 11:29 PM, Balbir Singh <[EMAIL PROTECTED]> wrote: > > I think it's a good hack, but not sure about the complexity to implement > the code. I worry that if the number of tasks increase (say run into > thousands for one or more groups and a few groups have just a few > tasks), we'll lose out on accuracy. But for the case we're looking at, you've already said that you don't care much about actually controlling CPU, just monitoring it. So what does it matter if the CPU sharing isn't perfectly accurate? I don't think that you're painting a realistic scenario. > > I think we already have the code, we need to make it more useful and > reusable. In that case we should take the existing cpu_acct code out of 2.6.24 until the API has been made more useful and reusable, so that we don't have to support it for ever. Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Revert for cgroups CPU accounting subsystem patch
On Nov 12, 2007 11:00 PM, Balbir Singh <[EMAIL PROTECTED]> wrote: > > Right now, one of the limitations of the CPU controller is that > the moment you create another control group, the bandwidth gets > divided by the default number of shares. We can't create groups > just for monitoring. Could we get around this with, say, a flag that always treats a CFS schedulable entity as having a weight equal to the number of runnable tasks in it? So CPU bandwidth would be shared between groups in proportion to the number of runnable tasks, which would distribute the cycles approximately equivalently to them all being separate schedulable entities. > cpu_acct fills this gap. Agreed, but not in the right way IMO. Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Revert for cgroups CPU accounting subsystem patch
On Nov 12, 2007 10:00 PM, Srivatsa Vaddagiri <[EMAIL PROTECTED]> wrote: > On second thoughts, this may be a usefull controller of its own. > Say I just want to "monitor" usage (for accounting purpose) of a group of > tasks, but don't want to control their cpu consumption, then cpuacct > controller would come in handy. > That's plausible, but having two separate ways of tracking and reporting the CPU usage of a cgroup seems wrong. How bad would it be in your suggested case if you just give each cgroup the same weight? So there would be fair scheduling between cgroups, which seems as reasonable as any other choice in the event that the CPU is contended. Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Revert for cgroups CPU accounting subsystem patch
Hi Linus, Please can you revert commit 62d0df64065e7c135d0002f069444fbdfc64768f, entitled "Task Control Groups: example CPU accounting subsystem" ? This was originally intended as a simple initial example of how to create a control groups subsystem; it wasn't intended for mainline, but I didn't make this clear enough to Andrew. The CFS cgroup subsystem now has better functionality for the per-cgroup usage accounting (based directly on CFS stats) than the "usage" status file in this patch, and the "load" status file is rather simplistic - although having a per-cgroup load average report would be a useful feature, I don't believe this patch actually provides it. If it gets into the final 2.6.24 we'd probably have to support this interface for ever. Thanks, Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Improve cgroup printks
On Nov 10, 2007 3:11 AM, Diego Calleja <[EMAIL PROTECTED]> wrote: > When I boot with the 'quiet' parameter, I see on the screen: > > [0.00] Initializing cgroup subsys cpuset > [0.00] Initializing cgroup subsys cpu > [ 39.036026] Initializing cgroup subsys cpuacct > [ 39.036080] Initializing cgroup subsys debug > [ 39.036118] Initializing cgroup subsys ns > > This patch lowers the priority of those messages, adds a "cgroup: " prefix > to another couple of printks and kills the useless reference to the source > file. > > > Signed-off-by: Diego Calleja <[EMAIL PROTECTED]> (with the addition of akpm's KERN_INFO for cgroup_init_subsys() ) Acked-by: Paul Menage <[EMAIL PROTECTED]> Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Report usage in CFS cgroup
Report CPU usage in CFS Cgroup directories Adds a cpu.usage file to the CFS cgroup that reports CPU usage in milliseconds for that cgroup's tasks Signed-off-by: Paul Menage <[EMAIL PROTECTED]> --- kernel/sched.c | 36 +++- 1 file changed, 31 insertions(+), 5 deletions(-) Index: container-2.6.23-mm1/kernel/sched.c === --- container-2.6.23-mm1.orig/kernel/sched.c +++ container-2.6.23-mm1/kernel/sched.c @@ -7005,15 +7005,41 @@ static u64 cpu_shares_read_uint(struct c return (u64) tg->shares; } -static struct cftype cpu_shares = { - .name = "shares", - .read_uint = cpu_shares_read_uint, - .write_uint = cpu_shares_write_uint, +static u64 cpu_usage_read(struct cgroup *cgrp, struct cftype *cft) +{ + struct task_group *tg = cgroup_tg(cgrp); + int i; + u64 res = 0; + for_each_possible_cpu(i) { + unsigned long flags; + /* +* Lock to prevent races with updating 64-bit counters +* on 32-bit arches. +*/ + spin_lock_irqsave(&cpu_rq(i)->lock, flags); + res += tg->se[i]->sum_exec_runtime; + spin_unlock_irqrestore(&cpu_rq(i)->lock, flags); + } + /* Convert from ns to ms */ + do_div(res, 100); + return res; +} + +static struct cftype cpu_files[] = { + { + .name = "shares", + .read_uint = cpu_shares_read_uint, + .write_uint = cpu_shares_write_uint, + }, + { + .name = "usage", + .read_uint = cpu_usage_read, + }, }; static int cpu_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cont) { - return cgroup_add_file(cont, ss, &cpu_shares); + return cgroup_add_files(cont, ss, cpu_files, ARRAY_SIZE(cpu_files)); } struct cgroup_subsys cpu_cgroup_subsys = { - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Report usage in CFS cgroup
Report CPU usage in CFS Cgroup directories Adds a cpu.usage file to the CFS cgroup that reports CPU usage in milliseconds for that cgroup's tasks Signed-off-by: Paul Menage <[EMAIL PROTECTED]> --- kernel/sched.c | 36 +++- 1 file changed, 31 insertions(+), 5 deletions(-) Index: container-2.6.23-mm1/kernel/sched.c === --- container-2.6.23-mm1.orig/kernel/sched.c +++ container-2.6.23-mm1/kernel/sched.c @@ -7005,15 +7005,41 @@ static u64 cpu_shares_read_uint(struct c return (u64) tg->shares; } -static struct cftype cpu_shares = { - .name = "shares", - .read_uint = cpu_shares_read_uint, - .write_uint = cpu_shares_write_uint, +static u64 cpu_usage_read(struct cgroup *cgrp, struct cftype *cft) +{ + struct task_group *tg = cgroup_tg(cgrp); + int i; + u64 res = 0; + for_each_possible_cpu(i) { + unsigned long flags; + /* +* Lock to prevent races with updating 64-bit counters +* on 32-bit arches. +*/ + spin_lock_irqsave(&cpu_rq(i)->lock, flags); + res += tg->se[i]->sum_exec_runtime; + spin_unlock_irqrestore(&cpu_rq(i)->lock, flags); + } + /* Convert from ns to ms */ + do_div(res, 100); + return res; +} + +static struct cftype cpu_files[] = { + { + .name = "shares", + .read_uint = cpu_shares_read_uint, + .write_uint = cpu_shares_write_uint, + }, + { + .name = "usage", + .read_uint = cpu_usage_read, + }, }; static int cpu_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cont) { - return cgroup_add_file(cont, ss, &cpu_shares); + return cgroup_add_files(cont, ss, cpu_files, ARRAY_SIZE(cpu_files)); } struct cgroup_subsys cpu_cgroup_subsys = { - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Move cgroups destroy() callbacks to cgroup_diput()
Move cgroups destroy() callbacks to cgroup_diput() Move the calls to the cgroup subsystem destroy() methods from cgroup_rmdir() to cgroup_diput(). This allows control file reads and writes to access their subsystem state without having to be concerned with locking against cgroup destruction - the control file dentry will keep the cgroup and its subsystem state objects alive until the file is closed. The documentation is updated to reflect the changed semantics of destroy(); additionally the locking comments for destroy() and some other methods were clarified and decrustified. Signed-off-by: Paul Menage <[EMAIL PROTECTED]> --- Documentation/cgroups.txt | 22 +++--- kernel/cgroup.c | 36 2 files changed, 35 insertions(+), 23 deletions(-) Index: container-2.6.23-mm1/kernel/cgroup.c === --- container-2.6.23-mm1.orig/kernel/cgroup.c +++ container-2.6.23-mm1/kernel/cgroup.c @@ -592,6 +592,7 @@ static void cgroup_diput(struct dentry * /* is dentry a directory ? if so, kfree() associated cgroup */ if (S_ISDIR(inode->i_mode)) { struct cgroup *cgrp = dentry->d_fsdata; + struct cgroup_subsys *ss; BUG_ON(!(cgroup_is_removed(cgrp))); /* It's possible for external users to be holding css * reference counts on a cgroup; css_put() needs to @@ -600,6 +601,23 @@ static void cgroup_diput(struct dentry * * queue the cgroup to be handled by the release * agent */ synchronize_rcu(); + + mutex_lock(&cgroup_mutex); + /* +* Release the subsystem state objects. +*/ + for_each_subsys(cgrp->root, ss) { + if (cgrp->subsys[ss->subsys_id]) + ss->destroy(ss, cgrp); + } + + cgrp->root->number_of_cgroups--; + mutex_unlock(&cgroup_mutex); + + /* Drop the active superblock reference that we took when we +* created the cgroup */ + deactivate_super(cgrp->root->sb); + kfree(cgrp); } iput(inode); @@ -1333,6 +1351,10 @@ static ssize_t cgroup_common_file_write( mutex_lock(&cgroup_mutex); + /* +* This was already checked for in cgroup_file_write(), but +* check again now we're holding cgroup_mutex. +*/ if (cgroup_is_removed(cgrp)) { retval = -ENODEV; goto out2; @@ -1388,7 +1410,7 @@ static ssize_t cgroup_file_write(struct struct cftype *cft = __d_cft(file->f_dentry); struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent); - if (!cft) + if (!cft || cgroup_is_removed(cgrp)) return -ENODEV; if (cft->write) return cft->write(cgrp, cft, file, buf, nbytes, ppos); @@ -1458,7 +1480,7 @@ static ssize_t cgroup_file_read(struct f struct cftype *cft = __d_cft(file->f_dentry); struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent); - if (!cft) + if (!cft || cgroup_is_removed(cgrp)) return -ENODEV; if (cft->read) @@ -2139,7 +2161,6 @@ static int cgroup_rmdir(struct inode *un struct cgroup *cgrp = dentry->d_fsdata; struct dentry *d; struct cgroup *parent; - struct cgroup_subsys *ss; struct super_block *sb; struct cgroupfs_root *root; @@ -2164,11 +2185,6 @@ static int cgroup_rmdir(struct inode *un return -EBUSY; } - for_each_subsys(root, ss) { - if (cgrp->subsys[ss->subsys_id]) - ss->destroy(ss, cgrp); - } - spin_lock(&release_list_lock); set_bit(CGRP_REMOVED, &cgrp->flags); if (!list_empty(&cgrp->release_list)) @@ -2183,15 +2199,11 @@ static int cgroup_rmdir(struct inode *un cgroup_d_remove_dir(d); dput(d); - root->number_of_cgroups--; set_bit(CGRP_RELEASABLE, &parent->flags); check_for_release(parent); mutex_unlock(&cgroup_mutex); - /* Drop the active superblock reference that we took when we -* created the cgroup */ - deactivate_super(sb); return 0; } Index: container-2.6.23-mm1/Documentation/cgroups.txt === --- container-2.6.23-mm1.orig/Documentation/cgroups.txt +++ container-2.6.23-mm1/Documentation/cgroups.txt @@ -456,7 +456,7 @@ methods are create/destroy. Any others t be successful no-ops. struct cgroup_subsys_state *create(struct cgroup *cont) -LL=cgroup_mutex +(cgroup_mutex held by caller) Called to create a subsystem state object
Re: [2.6 patch] kernel/cgroup.c: remove dead code
On 10/25/07, Paul Jackson <[EMAIL PROTECTED]> wrote: > Paul M wrote: > > Sounds reasonable to me. Is there any kind of compile-time assert > > macro in the kernel? > > Check out the assembly code generated by: > > BUG_ON(sizeof(cgrp->root->release_agent_path) < PATH_MAX)); > > (Hint: you can't find it ;) > > It -is- compile time! > Sounds like a good solution. Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] CFS CGroup: Report usage
On 10/23/07, Srivatsa Vaddagiri <[EMAIL PROTECTED]> wrote: > > agreed, we need to be reporting idle time in (milli)seconds, although > the requirement we had was to report it back in percentage. I guess the > percentage figure can be derived from the raw idle time number. > > How about: > > idle time = t4-t3 (effectively sleep time) > > in the above example? > > > It doesn't seem quite right to me that a cgroup's idle time metric be > > affected by the activity of other cgroups on the machine, > > I don't see how the idle time metric defined above (t4-t3) can be > affected by other cgroup activity, unless the execution pattern of one > cgroup is dependent on the others. If the other cgroups are busier, and t1-t2 is longer, then the cgroup will get to the point where it's ready to sleep later in wallclock time, and t4-t3 will be shorter in absolute terms. If there were no other cgroups running, then presumably the sleep time would actually be the sum of those three periods. Even so, I guess you're right that t4-t3 is the most appropriate thing to report, as long as people realise that it's a bit of a fuzzy value. > I think primarily for systems management tools to report back various > statistics about a OpenVZ/VServer-like container (just like top reports stats > for a OS currently). Let me check if there are other uses envisoned for > it. Sorry, I didn't mean "how will you report it to users?", I meant "what kinds of useful information will the users be able to get from it?" Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] cgroup brace coding style fix
On 10/24/07, Paul Jackson <[EMAIL PROTECTED]> wrote: > From: Paul Jackson <[EMAIL PROTECTED]> > > Coding style fix - one line conditionals don't get braces. > > Signed-off-by: Paul Jackson <[EMAIL PROTECTED]> Not a coding style that I'm in favor of, but I suppose it is the kernel standard. Acked-by: Paul Menage <[EMAIL PROTECTED]> > > --- > > This patch applies --after-- Adrian Bunk's patch: > [2.6 patch] kernel/cgroup.c: remove dead code > > kernel/cgroup.c | 15 +-- > 1 file changed, 5 insertions(+), 10 deletions(-) > > --- 2.6.23-mm1.orig/kernel/cgroup.c 2007-10-24 09:51:20.289137770 -0700 > +++ 2.6.23-mm1/kernel/cgroup.c 2007-10-24 09:51:52.153617193 -0700 > @@ -1182,9 +1182,8 @@ static int attach_task(struct cgroup *cg > for_each_subsys(root, ss) { > if (ss->can_attach) { > retval = ss->can_attach(ss, cgrp, tsk); > - if (retval) { > + if (retval) > return retval; > - } > } > } > > @@ -1193,9 +1192,8 @@ static int attach_task(struct cgroup *cg > * based on its final set of cgroups > */ > newcg = find_css_set(cg, cgrp); > - if (!newcg) { > + if (!newcg) > return -ENOMEM; > - } > > task_lock(tsk); > if (tsk->flags & PF_EXITING) { > @@ -1215,9 +1213,8 @@ static int attach_task(struct cgroup *cg > write_unlock(&css_set_lock); > > for_each_subsys(root, ss) { > - if (ss->attach) { > + if (ss->attach) > ss->attach(ss, cgrp, oldcgrp, tsk); > - } > } > set_bit(CGRP_RELEASABLE, &oldcgrp->flags); > synchronize_rcu(); > @@ -1353,9 +1350,8 @@ static ssize_t cgroup_common_file_write( > { > struct cgroupfs_root *root = cgrp->root; > /* Strip trailing newline */ > - if (nbytes && (buffer[nbytes-1] == '\n')) { > + if (nbytes && (buffer[nbytes-1] == '\n')) > buffer[nbytes-1] = 0; > - } > /* We never write anything other than '\0' > * into the last char of release_agent_path, > * so it always remains a NUL-terminated > @@ -2123,9 +2119,8 @@ static inline int cgroup_has_css_refs(st > * matter, since it can only happen if the cgroup > * has been deleted and hence no longer needs the > * release agent to be called anyway. */ > - if (css && atomic_read(&css->refcnt)) { > + if (css && atomic_read(&css->refcnt)) > return 1; > - } > } > return 0; > } > > -- > I won't rest till it's the best ... > Programmer, Linux Scalability > Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373 > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] cgroup simplify space stripping
On 10/24/07, Paul Jackson <[EMAIL PROTECTED]> wrote: > From: Paul Jackson <[EMAIL PROTECTED]> > > Simplify the space stripping code in cgroup file write. > > Signed-off-by: Paul Jackson <[EMAIL PROTECTED]> Acked-by: Paul Menage <[EMAIL PROTECTED]> > > --- > > This patch applies after both: > Adrian Bunk's: [2.6 patch] kernel/cgroup.c: remove dead code > pj's: [PATCH] cgroup brace coding style fix > > kernel/cgroup.c | 15 +++ > 1 file changed, 3 insertions(+), 12 deletions(-) > > --- 2.6.23-mm1.orig/kernel/cgroup.c 2007-10-24 10:03:11.847801501 -0700 > +++ 2.6.23-mm1/kernel/cgroup.c 2007-10-24 10:29:08.355032464 -0700 > @@ -1327,6 +1327,7 @@ static ssize_t cgroup_common_file_write( > goto out1; > } > buffer[nbytes] = 0; /* nul-terminate */ > + strstrip(buffer); /* strip -just- trailing whitespace */ > > mutex_lock(&cgroup_mutex); > > @@ -1347,19 +1348,9 @@ static ssize_t cgroup_common_file_write( > clear_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags); > break; > case FILE_RELEASE_AGENT: > - { > - struct cgroupfs_root *root = cgrp->root; > - /* Strip trailing newline */ > - if (nbytes && (buffer[nbytes-1] == '\n')) > - buffer[nbytes-1] = 0; > - /* We never write anything other than '\0' > -* into the last char of release_agent_path, > -* so it always remains a NUL-terminated > -* string */ > - strncpy(root->release_agent_path, buffer, nbytes); > - root->release_agent_path[nbytes] = 0; > + BUG_ON(sizeof(cgrp->root->release_agent_path) < PATH_MAX); > + strcpy(cgrp->root->release_agent_path, buffer); > break; > - } > default: > retval = -EINVAL; > goto out2; > > -- > I won't rest till it's the best ... > Programmer, Linux Scalability > Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373 > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6 patch] kernel/cgroup.c: remove dead code
On 10/24/07, Paul Jackson <[EMAIL PROTECTED]> wrote: > Paul M wrote: > > I think I'd rather not make this change - if we later changed the size > > of release_agent_path[] this could silently fail. Can we get around > > the coverity checker somehow? > > Perhaps we can simplify this check then, to: > > BUG_ON(sizeof(cgrp->root->release_agent_path) < PATH_MAX)); > > Less runtime code. Sounds reasonable to me. Is there any kind of compile-time assert macro in the kernel? Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6 patch] kernel/cgroup.c: remove dead code
On 10/24/07, Adrian Bunk <[EMAIL PROTECTED]> wrote: > > Two questions: > - Is it really intended to perhaps change release_agent_path[] to have > less than PATH_MAX size? I've got no intention to do so currently. > - If yes, do you want to return -E2BIG for (nbytes >= PATH_MAX) or for > (nbytes >= sizeof(root->release_agent_path)) ? I think E2BIG for the former for backwards compatabilty; the latter could be either ENOSPC or E2BIG; i.e. both checks are useful - one to stop us allocating more memory than is sensible, and one to stop us overrunning the buffer; the fact that these two are the same size at the moment is coincidence. I guess ideally the first check would be for the max() of any of the data structures that we expect to be able to write over; PATH_MAX was just picked as a convenience. Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6 patch] kernel/cgroup.c: make 2 functions static
On 10/24/07, Adrian Bunk <[EMAIL PROTECTED]> wrote: > cgroup_is_releasable() and notify_on_release() should be static, > not global inline. > > Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]> Acked-by: Paul Menage <[EMAIL PROTECTED]> > > --- > > kernel/cgroup.c |4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > 626d10ec224de07fc7906b0aa82e035e153709ce > diff --git a/kernel/cgroup.c b/kernel/cgroup.c > index 5987dcc..fec1726 100644 > --- a/kernel/cgroup.c > +++ b/kernel/cgroup.c > @@ -143,7 +143,7 @@ enum { > ROOT_NOPREFIX, /* mounted subsystems have no named prefix */ > }; > > -inline int cgroup_is_releasable(const struct cgroup *cgrp) > +static int cgroup_is_releasable(const struct cgroup *cgrp) > { > const int bits = > (1 << CGRP_RELEASABLE) | > @@ -151,7 +151,7 @@ inline int cgroup_is_releasable(const struct cgroup *cgrp) > return (cgrp->flags & bits) == bits; > } > > -inline int notify_on_release(const struct cgroup *cgrp) > +static int notify_on_release(const struct cgroup *cgrp) > { > return test_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags); > } > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6 patch] kernel/cgroup.c: remove dead code
I think I'd rather not make this change - if we later changed the size of release_agent_path[] this could silently fail. Can we get around the coverity checker somehow? Paul On 10/24/07, Adrian Bunk <[EMAIL PROTECTED]> wrote: > This patch removes dead code spotted by the Coverity checker > (look at the "(nbytes >= PATH_MAX)" check). > > Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]> > > --- > > kernel/cgroup.c | 18 -- > 1 file changed, 8 insertions(+), 10 deletions(-) > > --- linux-2.6/kernel/cgroup.c.old 2007-10-23 18:37:43.0 +0200 > +++ linux-2.6/kernel/cgroup.c 2007-10-23 18:39:15.0 +0200 > @@ -1320,90 +1320,88 @@ static ssize_t cgroup_common_file_write( > > if (nbytes >= PATH_MAX) > return -E2BIG; > > /* +1 for nul-terminator */ > buffer = kmalloc(nbytes + 1, GFP_KERNEL); > if (buffer == NULL) > return -ENOMEM; > > if (copy_from_user(buffer, userbuf, nbytes)) { > retval = -EFAULT; > goto out1; > } > buffer[nbytes] = 0; /* nul-terminate */ > > mutex_lock(&cgroup_mutex); > > if (cgroup_is_removed(cgrp)) { > retval = -ENODEV; > goto out2; > } > > switch (type) { > case FILE_TASKLIST: > retval = attach_task_by_pid(cgrp, buffer); > break; > case FILE_NOTIFY_ON_RELEASE: > clear_bit(CGRP_RELEASABLE, &cgrp->flags); > if (simple_strtoul(buffer, NULL, 10) != 0) > set_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags); > else > clear_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags); > break; > case FILE_RELEASE_AGENT: > { > struct cgroupfs_root *root = cgrp->root; > /* Strip trailing newline */ > if (nbytes && (buffer[nbytes-1] == '\n')) { > buffer[nbytes-1] = 0; > } > - if (nbytes < sizeof(root->release_agent_path)) { > - /* We never write anything other than '\0' > -* into the last char of release_agent_path, > -* so it always remains a NUL-terminated > -* string */ > - strncpy(root->release_agent_path, buffer, nbytes); > - root->release_agent_path[nbytes] = 0; > - } else { > - retval = -ENOSPC; > - } > + > + /* We never write anything other than '\0' > +* into the last char of release_agent_path, > +* so it always remains a NUL-terminated > +* string */ > + strncpy(root->release_agent_path, buffer, nbytes); > + root->release_agent_path[nbytes] = 0; > + > break; > } > default: > retval = -EINVAL; > goto out2; > } > > if (retval == 0) > retval = nbytes; > out2: > mutex_unlock(&cgroup_mutex); > out1: > kfree(buffer); > return retval; > } > > static ssize_t cgroup_file_write(struct file *file, const char __user *buf, > size_t nbytes, loff_t *ppos) > { > struct cftype *cft = __d_cft(file->f_dentry); > struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent); > > if (!cft) > return -ENODEV; > if (cft->write) > return cft->write(cgrp, cft, file, buf, nbytes, ppos); > if (cft->write_uint) > return cgroup_write_uint(cgrp, cft, file, buf, nbytes, ppos); > return -EINVAL; > } > > static ssize_t cgroup_read_uint(struct cgroup *cgrp, struct cftype *cft, >struct file *file, >char __user *buf, size_t nbytes, >loff_t *ppos) > { > char tmp[64]; > u64 val = cft->read_uint(cgrp, cft); > int len = sprintf(tmp, "%llu\n", (unsigned long long) val); > > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6 patch] kernel/cgroup.c: make 2 functions static
On 10/24/07, Adrian Bunk <[EMAIL PROTECTED]> wrote: > cgroup_is_releasable() and notify_on_release() should be static, > not global inline. > They seem like they could be usefully static inline - or will the compiler inline them anyway since they're simple enough? Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] CFS CGroup: Code cleanup
Clean up some CFS CGroup code - replace "cont" with "cgrp" in a few places in the CFS cgroup code, - use write_uint rather than write for cpu.shares write function Signed-off-by: Paul Menage <[EMAIL PROTECTED]> --- This is a resend of yesterday's mail with the same subject; the final hunk was missing from yesterday's patch. The second patch in the series was unaffected by this. kernel/sched.c | 53 ++--- 1 file changed, 18 insertions(+), 35 deletions(-) Index: container-2.6.23-mm1/kernel/sched.c === --- container-2.6.23-mm1.orig/kernel/sched.c +++ container-2.6.23-mm1/kernel/sched.c @@ -6936,25 +6936,25 @@ unsigned long sched_group_shares(struct #ifdef CONFIG_FAIR_CGROUP_SCHED /* return corresponding task_group object of a cgroup */ -static inline struct task_group *cgroup_tg(struct cgroup *cont) +static inline struct task_group *cgroup_tg(struct cgroup *cgrp) { - return container_of(cgroup_subsys_state(cont, cpu_cgroup_subsys_id), -struct task_group, css); + return container_of(cgroup_subsys_state(cgrp, cpu_cgroup_subsys_id), + struct task_group, css); } static struct cgroup_subsys_state * -cpu_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) +cpu_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp) { struct task_group *tg; - if (!cont->parent) { + if (!cgrp->parent) { /* This is early initialization for the top cgroup */ - init_task_group.css.cgroup = cont; + init_task_group.css.cgroup = cgrp; return &init_task_group.css; } /* we support only 1-level deep hierarchical scheduler atm */ - if (cont->parent->parent) + if (cgrp->parent->parent) return ERR_PTR(-EINVAL); tg = sched_create_group(); @@ -6962,21 +6962,21 @@ cpu_cgroup_create(struct cgroup_subsys * return ERR_PTR(-ENOMEM); /* Bind the cgroup to task_group object we just created */ - tg->css.cgroup = cont; + tg->css.cgroup = cgrp; return &tg->css; } static void cpu_cgroup_destroy(struct cgroup_subsys *ss, - struct cgroup *cont) + struct cgroup *cgrp) { - struct task_group *tg = cgroup_tg(cont); + struct task_group *tg = cgroup_tg(cgrp); sched_destroy_group(tg); } static int cpu_cgroup_can_attach(struct cgroup_subsys *ss, -struct cgroup *cont, struct task_struct *tsk) +struct cgroup *cgrp, struct task_struct *tsk) { /* We don't support RT-tasks being in separate groups */ if (tsk->sched_class != &fair_sched_class) @@ -6986,38 +6986,21 @@ static int cpu_cgroup_can_attach(struct } static void -cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cont, +cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp, struct cgroup *old_cont, struct task_struct *tsk) { sched_move_task(tsk); } -static ssize_t cpu_shares_write(struct cgroup *cont, struct cftype *cftype, - struct file *file, const char __user *userbuf, - size_t nbytes, loff_t *ppos) +static int cpu_shares_write_uint(struct cgroup *cgrp, struct cftype *cftype, + u64 shareval) { - unsigned long shareval; - struct task_group *tg = cgroup_tg(cont); - char buffer[2*sizeof(unsigned long) + 1]; - int rc; - - if (nbytes > 2*sizeof(unsigned long))/* safety check */ - return -E2BIG; - - if (copy_from_user(buffer, userbuf, nbytes)) - return -EFAULT; - - buffer[nbytes] = 0; /* nul-terminate */ - shareval = simple_strtoul(buffer, NULL, 10); - - rc = sched_group_set_shares(tg, shareval); - - return (rc < 0 ? rc : nbytes); + return sched_group_set_shares(cgroup_tg(cgrp), shareval); } -static u64 cpu_shares_read_uint(struct cgroup *cont, struct cftype *cft) +static u64 cpu_shares_read_uint(struct cgroup *cgrp, struct cftype *cft) { - struct task_group *tg = cgroup_tg(cont); + struct task_group *tg = cgroup_tg(cgrp); return (u64) tg->shares; } @@ -7025,7 +7008,7 @@ static u64 cpu_shares_read_uint(struct c static struct cftype cpu_shares = { .name = "shares", .read_uint = cpu_shares_read_uint, - .write = cpu_shares_write, + .write_uint = cpu_shares_write_uint, }; static int cpu_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cont) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTE
Re: [PATCH 2/2] CFS CGroup: Report usage
On 10/23/07, Srivatsa Vaddagiri <[EMAIL PROTECTED]> wrote: > > Suppose you have two cgroups that would each want to use, say, 55% of > > a CPU - technically they should each be regarded as having 45% idle > > time, but if they run on a the same CPU the chances are that they will > > both always have some processes on their runqueue due to contention > > with the other group. So how would you measure the difference between > > this and a cgroup that really is trying to use 100%? > > Good point. I think we need to subtract out the time it was waiting on > runqueue > when calculating idle time. > > |--- . . . . . . -...---| > t0 t1t2 t3 t4 t5 t6 > > > -> Running time > -> Waiting time (to get on the cpu) > -> Sleeping time (when it didnt want to run because of >lack of tasks) > > So, in this case, > > idle time = (t4 - t3) / [ (t6 - t1) - (t2-t1) - (t5-t4) > Do you mean (t6 - t0) where you have (t6 - t1)? > ? > > This idle time will be a per-cpu stat for every cgroup and needs to be > consolidated across cpus into a single idle-stat number, just like how > top does it. This would be an idle fraction, not an idle time. (seconds divided by seconds) It doesn't seem quite right to me that a cgroup's idle time metric be affected by the activity of other cgroups on the machine, but it's hard to come up with a way of measuring it that doesn't have this behaviour - which is why, in the absence of hard CPU partitioning, it's not clear to me how much use this would be. What would people be planning to use it for? Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 9/12] cgroup: kill unused variable
On 10/23/07, Jeff Garzik <[EMAIL PROTECTED]> wrote: > Signed-off-by: Jeff Garzik <[EMAIL PROTECTED]> Acked-by: Paul Menage <[EMAIL PROTECTED]> > --- > > diff --git a/kernel/cgroup.c b/kernel/cgroup.c > index 5987dcc..3fe21e1 100644 > --- a/kernel/cgroup.c > +++ b/kernel/cgroup.c > @@ -2402,7 +2402,6 @@ struct file_operations proc_cgroup_operations = { > static int proc_cgroupstats_show(struct seq_file *m, void *v) > { > int i; > - struct cgroupfs_root *root; > > seq_puts(m, "#subsys_name\thierarchy\tnum_cgroups\n"); > mutex_lock(&cgroup_mutex); > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] CFS CGroup: Report usage
On 10/23/07, Srivatsa Vaddagiri <[EMAIL PROTECTED]> wrote: > > Adds a cpu.usage file to the CFS cgroup that reports CPU usage in > > milliseconds for that cgroup's tasks > > It would be nice to split this into user and sys time at some point. Sounds reasonable - but does CFS track this? > We have also received request to provide idle time for a > container/cgroup. The semantics of "idle time" for a cgroup on a shared system seem a bit fuzzy. How would you define it? Suppose you have two cgroups that would each want to use, say, 55% of a CPU - technically they should each be regarded as having 45% idle time, but if they run on a the same CPU the chances are that they will both always have some processes on their runqueue due to contention with the other group. So how would you measure the difference between this and a cgroup that really is trying to use 100%? Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Move cgroups destroy() callbacks to cgroup_diput()
Move the calls to the cgroup subsystem destroy() methods from cgroup_rmdir() to cgroup_diput(). This allows control file reads and writes to access their subsystem state without having to be concerned with locking against cgroup destruction - the control file dentry will keep the cgroup and its subsystem state objects alive until the file is closed. The documentation is updated to reflect the changed semantics of destroy(); additionally the locking comments for destroy() and some other methods were clarified and decrustified. Signed-off-by: Paul Menage <[EMAIL PROTECTED]> --- Documentation/cgroups.txt | 22 +++--- kernel/cgroup.c | 36 2 files changed, 35 insertions(+), 23 deletions(-) Index: container-2.6.23-mm1/kernel/cgroup.c === --- container-2.6.23-mm1.orig/kernel/cgroup.c +++ container-2.6.23-mm1/kernel/cgroup.c @@ -592,6 +592,7 @@ static void cgroup_diput(struct dentry * /* is dentry a directory ? if so, kfree() associated cgroup */ if (S_ISDIR(inode->i_mode)) { struct cgroup *cgrp = dentry->d_fsdata; + struct cgroup_subsys *ss; BUG_ON(!(cgroup_is_removed(cgrp))); /* It's possible for external users to be holding css * reference counts on a cgroup; css_put() needs to @@ -600,6 +601,23 @@ static void cgroup_diput(struct dentry * * queue the cgroup to be handled by the release * agent */ synchronize_rcu(); + + mutex_lock(&cgroup_mutex); + /* +* Release the subsystem state objects. +*/ + for_each_subsys(cgrp->root, ss) { + if (cgrp->subsys[ss->subsys_id]) + ss->destroy(ss, cgrp); + } + + cgrp->root->number_of_cgroups--; + mutex_unlock(&cgroup_mutex); + + /* Drop the active superblock reference that we took when we +* created the cgroup */ + deactivate_super(cgrp->root->sb); + kfree(cgrp); } iput(inode); @@ -1333,6 +1351,10 @@ static ssize_t cgroup_common_file_write( mutex_lock(&cgroup_mutex); + /* +* This was already checked for in cgroup_file_write(), but +* check again now we're holding cgroup_mutex. +*/ if (cgroup_is_removed(cgrp)) { retval = -ENODEV; goto out2; @@ -1388,7 +1410,7 @@ static ssize_t cgroup_file_write(struct struct cftype *cft = __d_cft(file->f_dentry); struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent); - if (!cft) + if (!cft || cgroup_is_removed(cgrp)) return -ENODEV; if (cft->write) return cft->write(cgrp, cft, file, buf, nbytes, ppos); @@ -1458,7 +1480,7 @@ static ssize_t cgroup_file_read(struct f struct cftype *cft = __d_cft(file->f_dentry); struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent); - if (!cft) + if (!cft || cgroup_is_removed(cgrp)) return -ENODEV; if (cft->read) @@ -2139,7 +2161,6 @@ static int cgroup_rmdir(struct inode *un struct cgroup *cgrp = dentry->d_fsdata; struct dentry *d; struct cgroup *parent; - struct cgroup_subsys *ss; struct super_block *sb; struct cgroupfs_root *root; @@ -2164,11 +2185,6 @@ static int cgroup_rmdir(struct inode *un return -EBUSY; } - for_each_subsys(root, ss) { - if (cgrp->subsys[ss->subsys_id]) - ss->destroy(ss, cgrp); - } - spin_lock(&release_list_lock); set_bit(CGRP_REMOVED, &cgrp->flags); if (!list_empty(&cgrp->release_list)) @@ -2183,15 +2199,11 @@ static int cgroup_rmdir(struct inode *un cgroup_d_remove_dir(d); dput(d); - root->number_of_cgroups--; set_bit(CGRP_RELEASABLE, &parent->flags); check_for_release(parent); mutex_unlock(&cgroup_mutex); - /* Drop the active superblock reference that we took when we -* created the cgroup */ - deactivate_super(sb); return 0; } Index: container-2.6.23-mm1/Documentation/cgroups.txt === --- container-2.6.23-mm1.orig/Documentation/cgroups.txt +++ container-2.6.23-mm1/Documentation/cgroups.txt @@ -456,7 +456,7 @@ methods are create/destroy. Any others t be successful no-ops. struct cgroup_subsys_state *create(struct cgroup *cont) -LL=cgroup_mutex +(cgroup_mutex held by caller) Called to create a subsystem state object for a cgroup. The
Re: [PATCH 2/2] CFS CGroup: Report usage
On 10/23/07, Balbir Singh <[EMAIL PROTECTED]> wrote: > > Well, without notify_on_release you can never be sure if a new task > got added to the control group or if someone acquired a reference > to it. I can't think of a safe way of removing control groups/cpusets > without using notify_on_release. > Using notify_on_release doesn't make any guarantees that the notification, or the action based on that notification, will occur before the cgroup becomes busy again. An rmdir() on a busy cgroup is perfectly safe, it just fails with -EBUSY. I'd just like to avoid making active control file fds trigger than failure mode. Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] CFS CGroup: Report usage
On 10/23/07, Balbir Singh <[EMAIL PROTECTED]> wrote: > > Well, most people who care about deletion will use the notify_on_release > callback and retry. I'm not convinced this is true. Certainly the userspace approaches we're developing at Google don't (currently) use notify_on_release. Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] CFS CGroup: Code cleanup
On 10/22/07, Paul Menage <[EMAIL PROTECTED]> wrote: > On 10/22/07, Srivatsa Vaddagiri <[EMAIL PROTECTED]> wrote: > > > > Minor nit: From pov of making this patch series git bisect safe, shouldn't > > we > > be registering a write_uint() handler in this patch as well? > > > > Yes, we should. Sigh. I originally had the cleanup and the new > reporting interface in the same patch, and decided to split them apart > into a cleanup patch and a new feature patch, but clearly goofed. I'll > resend tomorrow with the write_uint registration in the right place. OK, this wasn't a patch goof - the original patch in my tree does have the addition of write_uint in the cftype definition. I guess the last hunk got lost as I transferred it to the email. Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] CFS CGroup: Report usage
On 10/22/07, Paul Menage <[EMAIL PROTECTED]> wrote: > > Using cgroup_mutex is certainly possible for now, although more > heavy-weight than I'd like long term. Using css_get isn't the right > approach, I think - we shouldn't be able to cause an rmdir to fail due > to a concurrent read. > OK, the obvious solution is to use the same approach for subsystem state objects as we do for the struct cgroup itself - move the calls to the subsystem destroy methods to cgroup_diput. A control file dentry will keep alive the parent dir's dentry, which will keep alive the cgroup and (with this change) the subsystem state objects too. The only potential drawback that I can see is that an open fd on a cgroup directory or a control file will keep more memory alive than it would have done previously. Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] CFS CGroup: Code cleanup
On 10/22/07, Srivatsa Vaddagiri <[EMAIL PROTECTED]> wrote: > > Minor nit: From pov of making this patch series git bisect safe, shouldn't we > be registering a write_uint() handler in this patch as well? > Yes, we should. Sigh. I originally had the cleanup and the new reporting interface in the same patch, and decided to split them apart into a cleanup patch and a new feature patch, but clearly goofed. I'll resend tomorrow with the write_uint registration in the right place. Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -v2 4/7] RT overloaded runqueues accounting
On 10/22/07, Paul Jackson <[EMAIL PROTECTED]> wrote: > Steven wrote: > > +void cpuset_rt_set_overload(struct task_struct *tsk, int cpu) > > +{ > > + cpu_set(cpu, task_cs(tsk)->rt_overload); > > +} > > Question for Steven: > > What locks are held when cpuset_rt_set_overload() is called? > > Questions for Paul Menage: > > Does 'tsk' need to be locked for the above task_cs() call? Cgroups doesn't change the locking rules for accessing a cpuset from a task - you have to have one of: - task_lock(task) - callback_mutex - be in an RCU section from the point when you call task_cs to the point when you stop using its result. (Additionally, in this case there's no guarantee that the task stays in this cpuset for the duration of the RCU section). Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] CFS CGroup: Report usage
On 10/22/07, Balbir Singh <[EMAIL PROTECTED]> wrote: > > I think we also need the notion of load, like we have in cpu_acct.c Yes, a notion of load would be good - but the "load" calculated by cpu_acct.c isn't all that useful yet - it's just a total of the CPU cycles used in the 10 second time interval prior to the current interval. Something that's more analagous to the real machine load would be nice, and I hope the numbers could be made available via CFS. > Don't we need to do a css_get() on the cgrp to ensure that the cgroup > does not go away if it's empty and someone does an rmdir on it? See my other email. Yes, I think we need more here, but probably something more generic. Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] CFS CGroup: Report usage
On 10/22/07, Srivatsa Vaddagiri <[EMAIL PROTECTED]> wrote: > On Mon, Oct 22, 2007 at 05:49:39PM -0700, Paul Menage wrote: > > +static u64 cpu_usage_read(struct cgroup *cgrp, struct cftype *cft) > > +{ > > + struct task_group *tg = cgroup_tg(cgrp); > > + int i; > > + u64 res = 0; > > + for_each_possible_cpu(i) { > > + unsigned long flags; > > + spin_lock_irqsave(&tg->cfs_rq[i]->rq->lock, flags); > > Is the lock absolutely required here? I'm not sure, I was hoping you or Ingo could comment on this. But some kind of locking seems to required at least on 32-bit platforms, since sum_exec_runtime is a 64-bit number. > > Hmm .. I hope the cgroup code prevents a task group from being destroyed while > we are still reading a task group's cpu usage. Is that so? Good point - cgroups certainly prevents a cgroup itself from being freed while a control file is being read in an RCU section, and prevents a task group from being destroyed when that task group has been read via a task's cgroups pointer and the reader is still in an RCU section, but we need a generic protection for subsystem state objects being accessed via control files too. Using cgroup_mutex is certainly possible for now, although more heavy-weight than I'd like long term. Using css_get isn't the right approach, I think - we shouldn't be able to cause an rmdir to fail due to a concurrent read. Paul - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] CFS CGroup: Code cleanup
Clean up some CFS CGroup code - replace "cont" with "cgrp" in a few places in the CFS cgroup code, - use write_uint rather than write for cpu.shares write function Signed-off-by: Paul Menage <[EMAIL PROTECTED]> --- kernel/sched.c | 51 +-- 1 file changed, 17 insertions(+), 34 deletions(-) Index: container-2.6.23-mm1/kernel/sched.c === --- container-2.6.23-mm1.orig/kernel/sched.c +++ container-2.6.23-mm1/kernel/sched.c @@ -6936,25 +6936,25 @@ unsigned long sched_group_shares(struct #ifdef CONFIG_FAIR_CGROUP_SCHED /* return corresponding task_group object of a cgroup */ -static inline struct task_group *cgroup_tg(struct cgroup *cont) +static inline struct task_group *cgroup_tg(struct cgroup *cgrp) { - return container_of(cgroup_subsys_state(cont, cpu_cgroup_subsys_id), -struct task_group, css); + return container_of(cgroup_subsys_state(cgrp, cpu_cgroup_subsys_id), + struct task_group, css); } static struct cgroup_subsys_state * -cpu_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) +cpu_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp) { struct task_group *tg; - if (!cont->parent) { + if (!cgrp->parent) { /* This is early initialization for the top cgroup */ - init_task_group.css.cgroup = cont; + init_task_group.css.cgroup = cgrp; return &init_task_group.css; } /* we support only 1-level deep hierarchical scheduler atm */ - if (cont->parent->parent) + if (cgrp->parent->parent) return ERR_PTR(-EINVAL); tg = sched_create_group(); @@ -6962,21 +6962,21 @@ cpu_cgroup_create(struct cgroup_subsys * return ERR_PTR(-ENOMEM); /* Bind the cgroup to task_group object we just created */ - tg->css.cgroup = cont; + tg->css.cgroup = cgrp; return &tg->css; } static void cpu_cgroup_destroy(struct cgroup_subsys *ss, - struct cgroup *cont) + struct cgroup *cgrp) { - struct task_group *tg = cgroup_tg(cont); + struct task_group *tg = cgroup_tg(cgrp); sched_destroy_group(tg); } static int cpu_cgroup_can_attach(struct cgroup_subsys *ss, -struct cgroup *cont, struct task_struct *tsk) +struct cgroup *cgrp, struct task_struct *tsk) { /* We don't support RT-tasks being in separate groups */ if (tsk->sched_class != &fair_sched_class) @@ -6986,38 +6986,21 @@ static int cpu_cgroup_can_attach(struct } static void -cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cont, +cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp, struct cgroup *old_cont, struct task_struct *tsk) { sched_move_task(tsk); } -static ssize_t cpu_shares_write(struct cgroup *cont, struct cftype *cftype, - struct file *file, const char __user *userbuf, - size_t nbytes, loff_t *ppos) +static int cpu_shares_write_uint(struct cgroup *cgrp, struct cftype *cftype, + u64 shareval) { - unsigned long shareval; - struct task_group *tg = cgroup_tg(cont); - char buffer[2*sizeof(unsigned long) + 1]; - int rc; - - if (nbytes > 2*sizeof(unsigned long))/* safety check */ - return -E2BIG; - - if (copy_from_user(buffer, userbuf, nbytes)) - return -EFAULT; - - buffer[nbytes] = 0; /* nul-terminate */ - shareval = simple_strtoul(buffer, NULL, 10); - - rc = sched_group_set_shares(tg, shareval); - - return (rc < 0 ? rc : nbytes); + return sched_group_set_shares(cgroup_tg(cgrp), shareval); } -static u64 cpu_shares_read_uint(struct cgroup *cont, struct cftype *cft) +static u64 cpu_shares_read_uint(struct cgroup *cgrp, struct cftype *cft) { - struct task_group *tg = cgroup_tg(cont); + struct task_group *tg = cgroup_tg(cgrp); return (u64) tg->shares; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] CFS CGroup: Report usage
Report CPU usage in CFS Cgroup directories Adds a cpu.usage file to the CFS cgroup that reports CPU usage in milliseconds for that cgroup's tasks This replaces the "example CPU Accounting CGroup subsystem" that was merged into mainline last week. Signed-off-by: Paul Menage <[EMAIL PROTECTED]> --- kernel/sched.c | 32 +++- 1 file changed, 27 insertions(+), 5 deletions(-) Index: container-2.6.23-mm1/kernel/sched.c === --- container-2.6.23-mm1.orig/kernel/sched.c +++ container-2.6.23-mm1/kernel/sched.c @@ -7005,15 +7005,37 @@ static u64 cpu_shares_read_uint(struct c return (u64) tg->shares; } -static struct cftype cpu_shares = { - .name = "shares", - .read_uint = cpu_shares_read_uint, - .write_uint = cpu_shares_write_uint, +static u64 cpu_usage_read(struct cgroup *cgrp, struct cftype *cft) +{ + struct task_group *tg = cgroup_tg(cgrp); + int i; + u64 res = 0; + for_each_possible_cpu(i) { + unsigned long flags; + spin_lock_irqsave(&tg->cfs_rq[i]->rq->lock, flags); + res += tg->se[i]->sum_exec_runtime; + spin_unlock_irqrestore(&tg->cfs_rq[i]->rq->lock, flags); + } + /* Convert from ns to ms */ + do_div(res, 100); + return res; +} + +static struct cftype cpu_files[] = { + { + .name = "shares", + .read_uint = cpu_shares_read_uint, + .write_uint = cpu_shares_write_uint, + }, + { + .name = "usage", + .read_uint = cpu_usage_read, + }, }; static int cpu_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cont) { - return cgroup_add_file(cont, ss, &cpu_shares); + return cgroup_add_files(cont, ss, cpu_files, ARRAY_SIZE(cpu_files)); } struct cgroup_subsys cpu_cgroup_subsys = { - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/2] CFS CGroup: cleanup & usage reporting
These two patches consist of a small cleanup to CFS, and adding a control file reporting CPU usage in milliseconds in each CGroup directory. They're just bundled together since the second patch depends slightly on the cleanups in the first patch. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/