[Devel] Re: [PATCH 1/1, v7] cgroup/freezer: add per freezer duty ratio control

2011-02-14 Thread Matt Helsley
On Mon, Feb 14, 2011 at 11:41:42AM -0800, jacob pan wrote:
> On Sat, 12 Feb 2011 15:29:07 -0800
> Matt Helsley  wrote:
> 
> > On Fri, Feb 11, 2011 at 11:10:44AM -0800,
> > jacob.jun@linux.intel.com wrote:
> > > From: Jacob Pan 



> > > cgroup. +To make the tasks frozen at 90% of the time every 5
> > > seconds, do: +
> > > +[root@localhost ]# echo 90 > freezer.duty_ratio_pct
> > > +[root@localhost ]# echo 5000 > freezer.period_ms
> > > +
> > > +After that, the application in this freezer cgroup will only be
> > > +allowed to run at the following pattern.
> > > + ______
> > > +|  |<-- 90% frozen -->|  |  |  |
> > > +|  |__|  |__|  |_
> > > +
> > > +|< 5 seconds >|
> > 
> > So most of the time I've been reviewing this I managed to invert it!
> > I imagined "duty" meant the tasks were "on duty" ie runnable ie
> > thawed. But according this this documentation it's the opposite...
> > 
> My logic is that since this is a freezer, so positive logic should be
> frozen instead of thaw.

Yup, I figured as much. That's the reason I didn't ask you to swap the
meaning of the ratio values.

> > I've reviewed my review and now my comments are consistent with the
> > above. :) However it makes me wonder if there are better names which
> > would avoid this confusion.
> > 
> How about frozen_time_pct?

Much better! nit: I don't know if _pct is obvious to everyone but it only
takes 4 more characters to make it so..

> > > diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c



> > > +static void freezer_work_fn(struct work_struct *work)
> > > +{
> > > + struct freezer *freezer;
> > > + unsigned long delay_jiffies = 0;
> > > + enum freezer_state goal_state;
> > > +
> > 
> > Looking better. There are alot of field accesses here which can race
> > with writes to the cgroup's duty ratio and period files. They should
> > be protected. Perhaps we can reuse the freezer spin lock. That also
> > has the benefit that we can eliminate the toggle.freeze_thaw bit I
> > think:
> > 
> I did think about the race, it does exist. but in practice. My thought
> was that since freezer_change_state() holds the spin_lock of the
> freezer, the race with writes to params are harmless, it just means the
> new period or ratio will take effect in the next period.

I considered this but didn't like the idea of relying on it. More below.

> In terms of using freezer spin lock to eliminate toggle flag, I am not
> sure if i know how to do that. Are you suggesting based on whether the
> spin lock is taken or not, we can decide the toggle? but the freeze
> spin lock is used by other functions as well not just the delay work
> here. I guess I have missed something.

I was thinking that with the lock held you can check the state variable
and just do the "opposite" of what it indicates:

state   TODO
FROZEN  THAWED
FREEZINGTHAWED
THAWED  FROZEN

Then you don't need the separate bit to indicate which state it should
try to change to next.

> > > +
> > > + freezer = container_of(work, struct freezer,
> > > freezer_work.work);
> > > + /* toggle between THAWED and FROZEN state.
> > > +  * thaw if freezer->toggle.freeze_thaw = 0; freeze
> > > otherwise
> > > +  * skip the first round if already in the target states.
> > > +  */
> > 
> > spin_lock(&freezer->lock);
> > 
> > > + if ((freezer->toggle.freeze_thaw && freezer->state ==
> > > CGROUP_FROZEN) ||
> > > + (!freezer->toggle.freeze_thaw &&
> > > + freezer->state == CGROUP_THAWED)) {
> > > + delay_jiffies = 0;
> > 
> > This looks wrong. We should never schedule freezer work delayed by 0
> > jiffies -- even if the delayed work API allows it. With 0-length
> > delays I'd worry that we could peg the CPU in an obscure infinite
> > loop.
> > 
> > I think you can safely eliminate this block and the "exit_toggle"
> > label.
> > 
> Good point. My initial thought was that since the period for targeted
> usage is quite long, e.g. 30 sec., we want to start the duty ratio
> right away. But that shouldn't matter since we already schedule work
> based on the new ratio/period.
> > > + goto exit_toggle;
> > > + } else if (freezer->toggle.freeze_thaw) {
> > 
> > if (freezer->state == CGROUP_THAWED) {
> > 
> > > + goal_state = CGROUP_FROZEN;
> > > + delay_jiffies =
> > > msecs_to_jiffies(freezer->duty.ratio *
> > > +
> > > freezer->duty.period_pct_ms);
> > > + } else  {
> > > + goal_state = CGROUP_THAWED;
> > > + delay_jiffies = msecs_to_jiffies((100 -
> > > freezer->duty.ratio) *
> > > +
> > > freezer->duty.period_pct_ms);
> > > + }
> > > + freezer_change_state(freezer->css.cgroup, goal_state);
> > 
> > __freezer_change_state(freezer->css.cgroup, goal_state);
> > spin_unlock(&freezer->lock);
> > 
> > (where the __freezer_change_state() function expects to already ha

[Devel] Re: [PATCH 1/1, v7] cgroup/freezer: add per freezer duty ratio control

2011-02-14 Thread Andrew Morton
On Sun, 13 Feb 2011 19:23:10 -0800
Arjan van de Ven  wrote:

> On 2/13/2011 4:44 PM, KAMEZAWA Hiroyuki wrote:
> > On Sat, 12 Feb 2011 15:29:07 -0800
> > Matt Helsley  wrote:
> >
> >> On Fri, Feb 11, 2011 at 11:10:44AM -0800, jacob.jun@linux.intel.com 
> >> wrote:
> >>> From: Jacob Pan
> >>>
> >>> Freezer subsystem is used to manage batch jobs which can start
> >>> stop at the same time. However, sometime it is desirable to let
> >>> the kernel manage the freezer state automatically with a given
> >>> duty ratio.
> >>> For example, if we want to reduce the time that backgroup apps
> >>> are allowed to run we can put them into a freezer subsystem and
> >>> set the kernel to turn them THAWED/FROZEN at given duty ratio.
> >>>
> >>> This patch introduces two file nodes under cgroup
> >>> freezer.duty_ratio_pct and freezer.period_sec
> >> Again: I don't think this is the right approach in the long term.
> >> It would be better not to add this interface and instead enable the
> >> cpu cgroup subsystem for non-rt tasks using a similar duty ratio
> >> concept..
> >>
> >> Nevertheless, I've added some feedback on the code for you here :).
> >>
> > AFAIK, there was a work for bandwidth control in CFS.
> >
> > http://linux.derkeiler.com/Mailing-Lists/Kernel/2010-10/msg04335.html
> >
> > I tested this and worked fine. This schduler approach seems better for
> > my purpose to limit bandwidth of apprications rather than freezer.
> 
> for our purpose, it's not about bandwidth.
> it's about making sure the class of apps don't run for a long period 
> (30-second range) of time.
> 

The discussion about this patchset seems to have been upside-down: lots
of talk about a particular implementation, with people walking back
from the implemetnation trying to work out what the requirements were,
then seeing if other implementations might suit those requirements. 
Whatever they were.

I think it would be helpful to start again, ignoring (for now) any
implementation.


What are the requirements here, guys?  What effects are we actually
trying to achieve?  Once that is understood and agreed to, we can
think about implementations.


And maybe you people _are_ clear about the requirements.  But I'm not and
I'm sure many others aren't too.  A clear statement of them would help
things along and would doubtless lead to better code.  This is pretty
basic stuff!

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 0/3] procfs wrt pid namespace cleanups

2011-02-14 Thread Oleg Nesterov
On 02/14, Daniel Lezcano wrote:
>
> * removed buggy patch #4 from the initial patchset

Hmm. probably it was me who confused you... I am sorry if this is true.

I don't think that patch was buggy.  What I meant, it looks like
the preparation for the next changes
(http://kerneltrap.org/mailarchive/linux-kernel/2010/6/20/4585095)
and those changes are not correct.

In fact, 1/3 looks the same.

Anyway, this series looks correct.

Oleg.

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 3/3] procfs: kill the global proc_mnt variable

2011-02-14 Thread Daniel Lezcano
From: Oleg Nesterov 

After the previous cleanup in proc_get_sb() the global proc_mnt has
no reasons to exists, kill it.

Signed-off-by: Oleg Nesterov 
Signed-off-by: Eric W. Biederman 
Signed-off-by: Daniel Lezcano 
---
 fs/proc/inode.c|2 --
 fs/proc/internal.h |1 -
 fs/proc/root.c |7 ---
 3 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 176ce4c..ee0f802 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -42,8 +42,6 @@ static void proc_evict_inode(struct inode *inode)
sysctl_head_put(PROC_I(inode)->sysctl);
 }
 
-struct vfsmount *proc_mnt;
-
 static struct kmem_cache * proc_inode_cachep;
 
 static struct inode *proc_alloc_inode(struct super_block *sb)
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 9ad561d..c03e8d3 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -107,7 +107,6 @@ static inline struct proc_dir_entry *pde_get(struct 
proc_dir_entry *pde)
 }
 void pde_put(struct proc_dir_entry *pde);
 
-extern struct vfsmount *proc_mnt;
 int proc_fill_super(struct super_block *);
 struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry *);
 
diff --git a/fs/proc/root.c b/fs/proc/root.c
index e5e2bfa..a9000e9 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -90,19 +90,20 @@ static struct file_system_type proc_fs_type = {
 
 void __init proc_root_init(void)
 {
+   struct vfsmount *mnt;
int err;
 
proc_init_inodecache();
err = register_filesystem(&proc_fs_type);
if (err)
return;
-   proc_mnt = kern_mount_data(&proc_fs_type, &init_pid_ns);
-   if (IS_ERR(proc_mnt)) {
+   mnt = kern_mount_data(&proc_fs_type, &init_pid_ns);
+   if (IS_ERR(mnt)) {
unregister_filesystem(&proc_fs_type);
return;
}
 
-   init_pid_ns.proc_mnt = proc_mnt;
+   init_pid_ns.proc_mnt = mnt;
proc_symlink("mounts", NULL, "self/mounts");
 
proc_net_init();
-- 
1.7.1

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 0/3] procfs wrt pid namespace cleanups

2011-02-14 Thread Daniel Lezcano
This patchset is a cleanup and a preparation to unshare the pid
namespace. These prerequisite prepare the next Eric's patchset
to give a file descriptor to a namespace and join an existing
namespace.

The initial authors of this patchset are Eric Biederman and Oleg
Nesterov.

Changelog:

02/14/11 - daniel.lezc...@free.fr
* patch 2/3 : fix return ENOMEM and put_pid_ns on error
* removed buggy patch #4 from the initial patchset

01/31/11 - daniel.lezc...@free.fr:
* patch 1/4 : wrapped test in a function
* patch 2/4 : handle proc_pid_ns_prepare_proc error
* patch 2/4 : put parent pid_ns on error
* other patches : refreshed against linux-next


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/3] pid: Remove the child_reaper special case in init/main.c

2011-02-14 Thread Daniel Lezcano
From: Eric W. Biederman 

It turns out that the existing assignment in copy_process of
the child_reaper can handle the initial assignment of child_reaper
we just need to generalize the test in kernel/fork.c

Signed-off-by: Eric W. Biederman 
Signed-off-by: Daniel Lezcano 
---
 include/linux/pid.h |   11 +++
 init/main.c |9 -
 kernel/fork.c   |2 +-
 3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 49f1c2f..efceda0 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -141,6 +141,17 @@ static inline struct pid_namespace *ns_of_pid(struct pid 
*pid)
 }
 
 /*
+ * is_child_reaper returns true if the pid is the init process
+ * of the current namespace. As this one could be checked before
+ * pid_ns->child_reaper is assigned in copy_process, we check
+ * with the pid number.
+ */
+static inline bool is_child_reaper(struct pid *pid)
+{
+   return pid->numbers[pid->level].nr == 1;
+}
+
+/*
  * the helpers to get the pid's id seen from different namespaces
  *
  * pid_nr(): global id, i.e. the id seen from the init namespace;
diff --git a/init/main.c b/init/main.c
index 33c37c3..793ebfd 100644
--- a/init/main.c
+++ b/init/main.c
@@ -875,15 +875,6 @@ static int __init kernel_init(void * unused)
 * init can run on any cpu.
 */
set_cpus_allowed_ptr(current, cpu_all_mask);
-   /*
-* Tell the world that we're going to be the grim
-* reaper of innocent orphaned children.
-*
-* We don't want people to have to make incorrect
-* assumptions about where in the task array this
-* can be found.
-*/
-   init_pid_ns.child_reaper = current;
 
cad_pid = task_pid(current);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 25e4291..c9f0784 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1289,7 +1289,7 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
tracehook_finish_clone(p, clone_flags, trace);
 
if (thread_group_leader(p)) {
-   if (clone_flags & CLONE_NEWPID)
+   if (is_child_reaper(pid))
p->nsproxy->pid_ns->child_reaper = p;
 
p->signal->leader_pid = pid;
-- 
1.7.1

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 2/3] pidns: Call pid_ns_prepare_proc from create_pid_namespace

2011-02-14 Thread Daniel Lezcano
From: Eric W. Biederman 

Reorganize proc_get_sb so it can be called before the struct pid
of the first process is allocated.

Signed-off-by: Eric W. Biederman 
Signed-off-by: Daniel Lezcano 
---
 fs/proc/root.c |   25 +++--
 kernel/fork.c  |6 --
 kernel/pid_namespace.c |   11 +--
 3 files changed, 16 insertions(+), 26 deletions(-)

diff --git a/fs/proc/root.c b/fs/proc/root.c
index ef9fa8e..e5e2bfa 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -43,17 +43,6 @@ static struct dentry *proc_mount(struct file_system_type 
*fs_type,
struct pid_namespace *ns;
struct proc_inode *ei;
 
-   if (proc_mnt) {
-   /* Seed the root directory with a pid so it doesn't need
-* to be special in base.c.  I would do this earlier but
-* the only task alive when /proc is mounted the first time
-* is the init_task and it doesn't have any pids.
-*/
-   ei = PROC_I(proc_mnt->mnt_sb->s_root->d_inode);
-   if (!ei->pid)
-   ei->pid = find_get_pid(1);
-   }
-
if (flags & MS_KERNMOUNT)
ns = (struct pid_namespace *)data;
else
@@ -71,16 +60,16 @@ static struct dentry *proc_mount(struct file_system_type 
*fs_type,
return ERR_PTR(err);
}
 
-   ei = PROC_I(sb->s_root->d_inode);
-   if (!ei->pid) {
-   rcu_read_lock();
-   ei->pid = get_pid(find_pid_ns(1, ns));
-   rcu_read_unlock();
-   }
-
sb->s_flags |= MS_ACTIVE;
}
 
+   ei = PROC_I(sb->s_root->d_inode);
+   if (!ei->pid) {
+   rcu_read_lock();
+   ei->pid = get_pid(find_pid_ns(1, ns));
+   rcu_read_unlock();
+   }
+
return dget(sb->s_root);
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index c9f0784..e7a5907 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1180,12 +1180,6 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
pid = alloc_pid(p->nsproxy->pid_ns);
if (!pid)
goto bad_fork_cleanup_io;
-
-   if (clone_flags & CLONE_NEWPID) {
-   retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
-   if (retval < 0)
-   goto bad_fork_free_pid;
-   }
}
 
p->pid = pid_nr(pid);
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index a5aff94..e9c9adc 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define BITS_PER_PAGE  (PAGE_SIZE*8)
 
@@ -72,7 +73,7 @@ static struct pid_namespace *create_pid_namespace(struct 
pid_namespace *parent_p
 {
struct pid_namespace *ns;
unsigned int level = parent_pid_ns->level + 1;
-   int i;
+   int i, err = -ENOMEM;
 
ns = kmem_cache_zalloc(pid_ns_cachep, GFP_KERNEL);
if (ns == NULL)
@@ -96,14 +97,20 @@ static struct pid_namespace *create_pid_namespace(struct 
pid_namespace *parent_p
for (i = 1; i < PIDMAP_ENTRIES; i++)
atomic_set(&ns->pidmap[i].nr_free, BITS_PER_PAGE);
 
+   err = pid_ns_prepare_proc(ns);
+   if (err)
+   goto out_put_parent_pid_ns;
+
return ns;
 
+out_put_parent_pid_ns:
+   put_pid_ns(parent_pid_ns);
 out_free_map:
kfree(ns->pidmap[0].page);
 out_free:
kmem_cache_free(pid_ns_cachep, ns);
 out:
-   return ERR_PTR(-ENOMEM);
+   return ERR_PTR(err);
 }
 
 static void destroy_pid_namespace(struct pid_namespace *ns)
-- 
1.7.1

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 1/1, v6] cgroup/freezer: add per freezer duty ratio control

2011-02-14 Thread Vaidyanathan Srinivasan
* Arjan van de Ven  [2011-02-09 19:06:15]:

> On 2/9/2011 7:04 PM, Matt Helsley wrote:
> >On Tue, Feb 08, 2011 at 05:05:41PM -0800, jacob.jun@linux.intel.com 
> >wrote:
> >>From: Jacob Pan
> >>
> >>Freezer subsystem is used to manage batch jobs which can start
> >>stop at the same time. However, sometime it is desirable to let
> >>the kernel manage the freezer state automatically with a given
> >>duty ratio.
> >>For example, if we want to reduce the time that backgroup apps
> >>are allowed to run we can put them into a freezer subsystem and
> >>set the kernel to turn them THAWED/FROZEN at given duty ratio.
> >>
> >>This patch introduces two file nodes under cgroup
> >>freezer.duty_ratio_pct and freezer.period_sec
> >>
> >>Usage example: set period to be 5 seconds and frozen duty ratio 90%
> >>[root@localhost aoa]# echo 90>  freezer.duty_ratio_pct
> >>[root@localhost aoa]# echo 5000>  freezer.period_ms
> >I kept wondering how this was useful when we've got the "cpu" subsystem
> >because for some reason "duty cycle" made me think this was a scheduling
> >policy knob. In fact, I'm pretty sure it is -- it just happens to
> >sometimes reduce power consumption.
> >
> >Have you tried using the cpu cgroup subsystem's share to see if it can
> >have a similar effect?
> 
> does the cpu cgroup system work on a 20 to 30 second time window?
> the objective is to have the CPU idle, without wakeups, for that long...
> (to save power)

This is an interesting idea to force idle.  The cpu cgroup will
maintain resource ratio but will not restrict runtime of a cgroup if
there is nothing else to run in the system.

CFS hardlimits (http://lwn.net/Articles/368685/) can do something like
this but will need to be tuned for long intervals.  On multi cpu
system, synchronising the idle times across cpus has been the key
challenge that reduces the power saving benefits.

Does this technique provide good power savings for a specific
usecase/workload or platform?

--Vaidy

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH, v6 3/3] cgroups: introduce timer slack controller

2011-02-14 Thread Thomas Gleixner
B1;2401;0cOn Mon, 14 Feb 2011, Kirill A. Shutemov wrote:

> On Mon, Feb 14, 2011 at 03:00:03PM +0100, Thomas Gleixner wrote:
> > On Mon, 14 Feb 2011, Kirill A. Shutsemov wrote:
> > > From: Kirill A. Shutemov 
> > > 
> > > Every task_struct has timer_slack_ns value. This value uses to round up
> > > poll() and select() timeout values. This feature can be useful in
> > > mobile environment where combined wakeups are desired.
> > > 
> > > cgroup subsys "timer_slack" implement timer slack controller. It
> > > provides a way to group tasks by timer slack value and manage the
> > > value of group's tasks.
> > 
> > I have no objections against the whole thing in general, but why do we
> > need a module for this? Why can't we add this to the cgroups muck and
> > compile it in?
> 
> It was easier to test and debug with module.
> What is wrong with module? Do you worry about number of exports?

Not only about the number. We don't want exports when they are not
techically necessary, i.e. for driver stuff.

> > > +static int cgroup_timer_slack_check(struct notifier_block *nb,
> > > + unsigned long slack_ns, void *data)
> > > +{
> > > + struct cgroup_subsys_state *css;
> > > + struct timer_slack_cgroup *tslack_cgroup;
> > > +
> > > + /* XXX: lockdep false positive? */
> > 
> >   What? Either this has a reason or not. If it's a false positive then
> >   it needs to be fixed in lockdep. If not, 
> 
> I was not sure about it. There is similar workaround in freezer_fork().

I don't care about workarounds in freezer_work() at all. The above
question remains and this is new code and therefor it either needs to
hold rcu_read_lock() or it does not.

> > > + rcu_read_lock();
> > > + css = task_subsys_state(current, timer_slack_subsys.subsys_id);
> > > + tslack_cgroup = container_of(css, struct timer_slack_cgroup, css);
> > > + rcu_read_unlock();
> > > +
> > > + if (!is_timer_slack_allowed(tslack_cgroup, slack_ns))
> > > + return notifier_from_errno(-EPERM);
> > 
> >   If the above needs rcu read lock, why is the acess safe ?
> > 
> > > + return NOTIFY_OK;
> > 
> > > +/*
> > > + * Adjust ->timer_slack_ns and ->default_max_slack_ns of the task to fit
> > > + * limits of the cgroup.
> > > + */
> > > +static void tslack_adjust_task(struct timer_slack_cgroup *tslack_cgroup,
> > > + struct task_struct *tsk)
> > > +{
> > > + if (tslack_cgroup->min_slack_ns > tsk->timer_slack_ns)
> > > + tsk->timer_slack_ns = tslack_cgroup->min_slack_ns;
> > > + else if (tslack_cgroup->max_slack_ns < tsk->timer_slack_ns)
> > > + tsk->timer_slack_ns = tslack_cgroup->max_slack_ns;
> > > +
> > > + if (tslack_cgroup->min_slack_ns > tsk->default_timer_slack_ns)
> > > + tsk->default_timer_slack_ns = tslack_cgroup->min_slack_ns;
> > > + else if (tslack_cgroup->max_slack_ns < tsk->default_timer_slack_ns)
> > > + tsk->default_timer_slack_ns = tslack_cgroup->max_slack_ns;
> > 
> > 
> >   Why is there not a default slack value for the whole group ?
> 
> I think it breaks prctl() semantic. default slack value is a value on
> fork().

cgroups break a lot of semantics.
 
> > > +static u64 tslack_read_range(struct cgroup *cgroup, struct cftype *cft)
> > > +{
> > > + struct timer_slack_cgroup *tslack_cgroup;
> > > +
> > > + tslack_cgroup = cgroup_to_tslack_cgroup(cgroup);
> > > + switch (cft->private) {
> > > + case TIMER_SLACK_MIN:
> > > + return tslack_cgroup->min_slack_ns;
> > > + case TIMER_SLACK_MAX:
> > > + return tslack_cgroup->max_slack_ns;
> > > + default:
> > > + BUG();
> > 
> >   BUG() for soemthing which can be dealt with sensible ?
> 
> tslack_read_range() and tslack_write_range() have written to handle
> defined cftypes. If it used for other cftype it's a bug().

The only caller is initiated from here, right? So we really don't need
another bug just because you might fatfinger your own code.
 
> > > + list_for_each_entry(cur, &cgroup->children, sibling) {
> > > + child = cgroup_to_tslack_cgroup(cur);
> > > + if (type == TIMER_SLACK_MIN && val > child->min_slack_ns)
> > > + return -EBUSY;
> > 
> >   I thought the whole point is to propagate values through the group.
> 
> I think silent change here is wrong. cpuset returns -EBUSY in similar
> case.

And how is cpuset relevant for this ? Not at all. This is about
timer_slack and we better have a well defined scheme for all of this
and not some cobbled together thing with tons of exceptions and corner
cases. Of course undocumented as far the code goes.

Thanks,

tglx

 
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [linux-cr PATCH 1/1] update x86-64 eclone and cr syscall numbers

2011-02-14 Thread Serge E. Hallyn
(for ckpt-v23-rc1-pids branch)

Signed-off-by: Serge E. Hallyn 
---
 arch/x86/include/asm/unistd_64.h |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index 706d90a..f5d1b9e 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -671,9 +671,9 @@ __SYSCALL(__NR_fanotify_mark, sys_fanotify_mark)
 __SYSCALL(__NR_prlimit64, sys_prlimit64)
 #define __NR_eclone303
 __SYSCALL(__NR_eclone, stub_eclone)
-#define __NR_checkpoint301
+#define __NR_checkpoint304
 __SYSCALL(__NR_checkpoint, stub_checkpoint)
-#define __NR_restart   302
+#define __NR_restart   305
 __SYSCALL(__NR_restart, stub_restart)
 
 #ifndef __NO_STUBS
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [user-cr PATCH 1/1] Fix x86-64 syscall numbers

2011-02-14 Thread Serge E. Hallyn
Signed-off-by: Serge Hallyn 
---
 clone_x86_64.c |2 +-
 include/linux/checkpoint.h |4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/clone_x86_64.c b/clone_x86_64.c
index 5a22093..6750786 100644
--- a/clone_x86_64.c
+++ b/clone_x86_64.c
@@ -26,7 +26,7 @@
 #include "eclone.h"
 
 #ifndef __NR_eclone
-#define __NR_eclone 300
+#define __NR_eclone 303
 #endif
 
 int eclone(int (*fn)(void *), void *fn_arg, int clone_flags_low,
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index b6ac12d..3688ae1 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -54,11 +54,11 @@
 #elif __x86_64__
 
 #  ifndef __NR_checkpoint
-#  define __NR_checkpoint 301
+#  define __NR_checkpoint 304
 #  endif
 
 #  ifndef __NR_restart
-#  define __NR_restart 302
+#  define __NR_restart 305
 #  endif
 
 #else
-- 
1.7.2.3

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH, v6 3/3] cgroups: introduce timer slack controller

2011-02-14 Thread Kirill A. Shutemov
On Mon, Feb 14, 2011 at 03:00:03PM +0100, Thomas Gleixner wrote:
> On Mon, 14 Feb 2011, Kirill A. Shutsemov wrote:
> > From: Kirill A. Shutemov 
> > 
> > Every task_struct has timer_slack_ns value. This value uses to round up
> > poll() and select() timeout values. This feature can be useful in
> > mobile environment where combined wakeups are desired.
> > 
> > cgroup subsys "timer_slack" implement timer slack controller. It
> > provides a way to group tasks by timer slack value and manage the
> > value of group's tasks.
> 
> I have no objections against the whole thing in general, but why do we
> need a module for this? Why can't we add this to the cgroups muck and
> compile it in?

It was easier to test and debug with module.
What is wrong with module? Do you worry about number of exports?

> > +struct cgroup_subsys timer_slack_subsys;
> > +struct timer_slack_cgroup {
> > +   struct cgroup_subsys_state css;
> > +   unsigned long min_slack_ns;
> > +   unsigned long max_slack_ns;
> > +};
> > +
> > +enum {
> > +   TIMER_SLACK_MIN,
> > +   TIMER_SLACK_MAX,
> > +};
> > +
> > +static struct timer_slack_cgroup *cgroup_to_tslack_cgroup(struct cgroup 
> > *cgroup)
> > +{
> > +   struct cgroup_subsys_state *css;
> > +
> > +   css = cgroup_subsys_state(cgroup, timer_slack_subsys.subsys_id);
> > +   return container_of(css, struct timer_slack_cgroup, css);
> > +}
> > +
> > +static int is_timer_slack_allowed(struct timer_slack_cgroup *tslack_cgroup,
> 
>   bool perhaps ?

Right.

> > +   unsigned long slack_ns)
> > +{
> > +   if (slack_ns < tslack_cgroup->min_slack_ns ||
> > +   slack_ns > tslack_cgroup->max_slack_ns)
> > +   return false;
> > +   return true;
> > +}
> > +
> > +static int cgroup_timer_slack_check(struct notifier_block *nb,
> > +   unsigned long slack_ns, void *data)
> > +{
> > +   struct cgroup_subsys_state *css;
> > +   struct timer_slack_cgroup *tslack_cgroup;
> > +
> > +   /* XXX: lockdep false positive? */
> 
>   What? Either this has a reason or not. If it's a false positive then
>   it needs to be fixed in lockdep. If not, 

I was not sure about it. There is similar workaround in freezer_fork().

> > +   rcu_read_lock();
> > +   css = task_subsys_state(current, timer_slack_subsys.subsys_id);
> > +   tslack_cgroup = container_of(css, struct timer_slack_cgroup, css);
> > +   rcu_read_unlock();
> > +
> > +   if (!is_timer_slack_allowed(tslack_cgroup, slack_ns))
> > +   return notifier_from_errno(-EPERM);
> 
>   If the above needs rcu read lock, why is the acess safe ?
> 
> > +   return NOTIFY_OK;
> 
> > +/*
> > + * Adjust ->timer_slack_ns and ->default_max_slack_ns of the task to fit
> > + * limits of the cgroup.
> > + */
> > +static void tslack_adjust_task(struct timer_slack_cgroup *tslack_cgroup,
> > +   struct task_struct *tsk)
> > +{
> > +   if (tslack_cgroup->min_slack_ns > tsk->timer_slack_ns)
> > +   tsk->timer_slack_ns = tslack_cgroup->min_slack_ns;
> > +   else if (tslack_cgroup->max_slack_ns < tsk->timer_slack_ns)
> > +   tsk->timer_slack_ns = tslack_cgroup->max_slack_ns;
> > +
> > +   if (tslack_cgroup->min_slack_ns > tsk->default_timer_slack_ns)
> > +   tsk->default_timer_slack_ns = tslack_cgroup->min_slack_ns;
> > +   else if (tslack_cgroup->max_slack_ns < tsk->default_timer_slack_ns)
> > +   tsk->default_timer_slack_ns = tslack_cgroup->max_slack_ns;
> 
> 
>   Why is there not a default slack value for the whole group ?

I think it breaks prctl() semantic. default slack value is a value on
fork().

> > +static u64 tslack_read_range(struct cgroup *cgroup, struct cftype *cft)
> > +{
> > +   struct timer_slack_cgroup *tslack_cgroup;
> > +
> > +   tslack_cgroup = cgroup_to_tslack_cgroup(cgroup);
> > +   switch (cft->private) {
> > +   case TIMER_SLACK_MIN:
> > +   return tslack_cgroup->min_slack_ns;
> > +   case TIMER_SLACK_MAX:
> > +   return tslack_cgroup->max_slack_ns;
> > +   default:
> > +   BUG();
> 
>   BUG() for soemthing which can be dealt with sensible ?

tslack_read_range() and tslack_write_range() have written to handle
defined cftypes. If it used for other cftype it's a bug().

> > +   }
> > +}
> > +
> > +static int validate_change(struct cgroup *cgroup, u64 val, int type)
> > +{
> > +   struct timer_slack_cgroup *tslack_cgroup, *child;
> > +   struct cgroup *cur;
> > +
> > +   BUG_ON(type != TIMER_SLACK_MIN && type != TIMER_SLACK_MAX);
> 
>   Ditto. That should be -EINVAL or such.
> 
> > +   if (val > ULONG_MAX)
> > +   return -EINVAL;
> > +
> > +   if (cgroup->parent) {
> > +   struct timer_slack_cgroup *parent;
> > +   parent = cgroup_to_tslack_cgroup(cgroup->parent);
> > +   if (!is_timer_slack_allowed(parent, val))
> > +   return -EPERM;
> > +   }
> > +
> > +   tslack_cgroup = cgroup_to_tslack_cgroup(cgroup);
> > +   if (type == TIMER_SLACK_MIN && val > tslack_cgroup->max_slack_ns)
> > +   return -EINV

[Devel] Re: [PATCH, v6 2/3] Implement timer slack notifier chain

2011-02-14 Thread Thomas Gleixner
On Mon, 14 Feb 2011, Kirill A. Shutemov wrote:
> On Mon, Feb 14, 2011 at 02:32:23PM +0100, Thomas Gleixner wrote:
> > On Mon, 14 Feb 2011, Kirill A. Shutsemov wrote:
> > 
> > > From: Kirill A. Shutemov 
> > > 
> > > Process can change its timer slack using prctl(). Timer slack notifier
> > > call chain allows to react on such change or forbid it.
> > 
> > So we add a notifier call chain and more exports to allow what ?
> 
> To allow the cgroup contoller validate the value.

So we add 5 exports and a notifier chain to have a module? Errm, I
mean there is not really a high probability that we'll add 5 more of
those validation thingies, right?

So instead of having 
#ifdef CONFIG_CGROUP_MUCK
int cgroup_set_slack();
#else
static inline int cgroup_set_slack(...)
{
return 
}
#endif

We add all that stuff ?

> > > --- a/kernel/sys.c
> > > +++ b/kernel/sys.c
> > > @@ -1691,15 +1691,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned 
> > > long, arg2, unsigned long, arg3,
> > >   error = perf_event_task_enable();
> > >   break;
> > >   case PR_GET_TIMERSLACK:
> > > - error = current->timer_slack_ns;
> > > + error = prctl_get_timer_slack();
> > 
> >   What's the point of replacing current->timer_slack_ns with a
> >   function which does exactly the same ?
> 
> To keep it consistent. BTW, prctl_get_seccomp() does the same.

That does not make it less bloat.

> > 
> > > +long prctl_set_timer_slack(long timer_slack_ns)
> > > +{
> > > + int err;
> > > +
> > > + /* Reset timer slack to default value */
> > > + if (timer_slack_ns <= 0) {
> > > + current->timer_slack_ns = current->default_timer_slack_ns;
> > > + return 0;
> > 
> >   That does not make any sense at all. Why is setting
> >   default_timer_slack_ns not subject to validation ?
> 
> Hm.. In case of cgroup_timer_slack it's always valid.
> But, yes, in general, we should validate it.
> 
> >   Why is it treaded seperately ?
> 
> What do you mean?

Should have read:

 Why is it treated seperately from the other settings?

So setting the default is probably correct to be out of the validation
thing, still the question remains, why we do not have a cgroup default
then.

Thanks,

tglx
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH, v6 1/3] cgroups: export cgroup_iter_{start, next, end}

2011-02-14 Thread Thomas Gleixner
On Mon, 14 Feb 2011, Kirill A. Shutemov wrote:

> On Mon, Feb 14, 2011 at 02:27:44PM +0100, Thomas Gleixner wrote:
> > 
> > Lacks any sensible explanation why and whatfor this needs to be
> > exported.
> 
> I thought it's obvious since it's part of the patchset.
> Am I wrong?

Yes, because we want a sensible changelog for the patch itself.
 
> > And I really doubt that we want to export that at all.
> 
> Why?
> Do you think cgroup_timer_slack built as module is useless?

Pretty much.
 
Thanks,

tglx
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH, v6 2/3] Implement timer slack notifier chain

2011-02-14 Thread Kirill A. Shutemov
On Mon, Feb 14, 2011 at 02:32:23PM +0100, Thomas Gleixner wrote:
> On Mon, 14 Feb 2011, Kirill A. Shutsemov wrote:
> 
> > From: Kirill A. Shutemov 
> > 
> > Process can change its timer slack using prctl(). Timer slack notifier
> > call chain allows to react on such change or forbid it.
> 
> So we add a notifier call chain and more exports to allow what ?

To allow the cgroup contoller validate the value.

> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -1691,15 +1691,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, 
> > arg2, unsigned long, arg3,
> > error = perf_event_task_enable();
> > break;
> > case PR_GET_TIMERSLACK:
> > -   error = current->timer_slack_ns;
> > +   error = prctl_get_timer_slack();
> 
>   What's the point of replacing current->timer_slack_ns with a
>   function which does exactly the same ?

To keep it consistent. BTW, prctl_get_seccomp() does the same.

> 
> > +long prctl_set_timer_slack(long timer_slack_ns)
> > +{
> > +   int err;
> > +
> > +   /* Reset timer slack to default value */
> > +   if (timer_slack_ns <= 0) {
> > +   current->timer_slack_ns = current->default_timer_slack_ns;
> > +   return 0;
> 
>   That does not make any sense at all. Why is setting
>   default_timer_slack_ns not subject to validation ?

Hm.. In case of cgroup_timer_slack it's always valid.
But, yes, in general, we should validate it.

>   Why is it treaded seperately ?

What do you mean?

> > +   }
> > +
> > +   err = blocking_notifier_call_chain(&timer_slack_notify_list,
> > +   timer_slack_ns, NULL);
> > +   if (err == NOTIFY_DONE)
> > +   current->timer_slack_ns = timer_slack_ns;
> > +
> > +   return notifier_to_errno(err);
> 
> Thanks,
> 
>   tglx

-- 
 Kirill A. Shutemov
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH, v6 1/3] cgroups: export cgroup_iter_{start, next, end}

2011-02-14 Thread Kirill A. Shutemov
On Mon, Feb 14, 2011 at 02:27:44PM +0100, Thomas Gleixner wrote:
> 
> Lacks any sensible explanation why and whatfor this needs to be
> exported.

I thought it's obvious since it's part of the patchset.
Am I wrong?

> And I really doubt that we want to export that at all.

Why?
Do you think cgroup_timer_slack built as module is useless?

> > +EXPORT_SYMBOL_GPL(cgroup_iter_start);
> > +EXPORT_SYMBOL_GPL(cgroup_iter_next);
> > +EXPORT_SYMBOL_GPL(cgroup_iter_end);
> 
> Thanks,
> 
>   tglx

-- 
 Kirill A. Shutemov
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH, v6 3/3] cgroups: introduce timer slack controller

2011-02-14 Thread Thomas Gleixner
On Mon, 14 Feb 2011, Kirill A. Shutsemov wrote:
> From: Kirill A. Shutemov 
> 
> Every task_struct has timer_slack_ns value. This value uses to round up
> poll() and select() timeout values. This feature can be useful in
> mobile environment where combined wakeups are desired.
> 
> cgroup subsys "timer_slack" implement timer slack controller. It
> provides a way to group tasks by timer slack value and manage the
> value of group's tasks.

I have no objections against the whole thing in general, but why do we
need a module for this? Why can't we add this to the cgroups muck and
compile it in?

> +struct cgroup_subsys timer_slack_subsys;
> +struct timer_slack_cgroup {
> + struct cgroup_subsys_state css;
> + unsigned long min_slack_ns;
> + unsigned long max_slack_ns;
> +};
> +
> +enum {
> + TIMER_SLACK_MIN,
> + TIMER_SLACK_MAX,
> +};
> +
> +static struct timer_slack_cgroup *cgroup_to_tslack_cgroup(struct cgroup 
> *cgroup)
> +{
> + struct cgroup_subsys_state *css;
> +
> + css = cgroup_subsys_state(cgroup, timer_slack_subsys.subsys_id);
> + return container_of(css, struct timer_slack_cgroup, css);
> +}
> +
> +static int is_timer_slack_allowed(struct timer_slack_cgroup *tslack_cgroup,

  bool perhaps ?

> + unsigned long slack_ns)
> +{
> + if (slack_ns < tslack_cgroup->min_slack_ns ||
> + slack_ns > tslack_cgroup->max_slack_ns)
> + return false;
> + return true;
> +}
> +
> +static int cgroup_timer_slack_check(struct notifier_block *nb,
> + unsigned long slack_ns, void *data)
> +{
> + struct cgroup_subsys_state *css;
> + struct timer_slack_cgroup *tslack_cgroup;
> +
> + /* XXX: lockdep false positive? */

  What? Either this has a reason or not. If it's a false positive then
  it needs to be fixed in lockdep. If not, 

> + rcu_read_lock();
> + css = task_subsys_state(current, timer_slack_subsys.subsys_id);
> + tslack_cgroup = container_of(css, struct timer_slack_cgroup, css);
> + rcu_read_unlock();
> +
> + if (!is_timer_slack_allowed(tslack_cgroup, slack_ns))
> + return notifier_from_errno(-EPERM);

  If the above needs rcu read lock, why is the acess safe ?

> + return NOTIFY_OK;

> +/*
> + * Adjust ->timer_slack_ns and ->default_max_slack_ns of the task to fit
> + * limits of the cgroup.
> + */
> +static void tslack_adjust_task(struct timer_slack_cgroup *tslack_cgroup,
> + struct task_struct *tsk)
> +{
> + if (tslack_cgroup->min_slack_ns > tsk->timer_slack_ns)
> + tsk->timer_slack_ns = tslack_cgroup->min_slack_ns;
> + else if (tslack_cgroup->max_slack_ns < tsk->timer_slack_ns)
> + tsk->timer_slack_ns = tslack_cgroup->max_slack_ns;
> +
> + if (tslack_cgroup->min_slack_ns > tsk->default_timer_slack_ns)
> + tsk->default_timer_slack_ns = tslack_cgroup->min_slack_ns;
> + else if (tslack_cgroup->max_slack_ns < tsk->default_timer_slack_ns)
> + tsk->default_timer_slack_ns = tslack_cgroup->max_slack_ns;


  Why is there not a default slack value for the whole group ?

> +static u64 tslack_read_range(struct cgroup *cgroup, struct cftype *cft)
> +{
> + struct timer_slack_cgroup *tslack_cgroup;
> +
> + tslack_cgroup = cgroup_to_tslack_cgroup(cgroup);
> + switch (cft->private) {
> + case TIMER_SLACK_MIN:
> + return tslack_cgroup->min_slack_ns;
> + case TIMER_SLACK_MAX:
> + return tslack_cgroup->max_slack_ns;
> + default:
> + BUG();

  BUG() for soemthing which can be dealt with sensible ?

> + }
> +}
> +
> +static int validate_change(struct cgroup *cgroup, u64 val, int type)
> +{
> + struct timer_slack_cgroup *tslack_cgroup, *child;
> + struct cgroup *cur;
> +
> + BUG_ON(type != TIMER_SLACK_MIN && type != TIMER_SLACK_MAX);

  Ditto. That should be -EINVAL or such.

> + if (val > ULONG_MAX)
> + return -EINVAL;
> +
> + if (cgroup->parent) {
> + struct timer_slack_cgroup *parent;
> + parent = cgroup_to_tslack_cgroup(cgroup->parent);
> + if (!is_timer_slack_allowed(parent, val))
> + return -EPERM;
> + }
> +
> + tslack_cgroup = cgroup_to_tslack_cgroup(cgroup);
> + if (type == TIMER_SLACK_MIN && val > tslack_cgroup->max_slack_ns)
> + return -EINVAL;
> + if (type == TIMER_SLACK_MAX && val < tslack_cgroup->min_slack_ns)
> + return -EINVAL;
> +
> + list_for_each_entry(cur, &cgroup->children, sibling) {
> + child = cgroup_to_tslack_cgroup(cur);
> + if (type == TIMER_SLACK_MIN && val > child->min_slack_ns)
> + return -EBUSY;

  I thought the whole point is to propagate values through the group.

> + if (type == TIMER_SLACK_MAX && val < child->max_slack_ns)
> + return -EBUSY;

  This is completely confusing w/o any line of comment.

Thanks

   

[Devel] Re: [PATCH, v6 3/3] cgroups: introduce timer slack controller

2011-02-14 Thread Matt Helsley
On Mon, Feb 14, 2011 at 03:06:27PM +0200, Kirill A. Shutsemov wrote:
> From: Kirill A. Shutemov 
> 
> Every task_struct has timer_slack_ns value. This value uses to round up
> poll() and select() timeout values. This feature can be useful in
> mobile environment where combined wakeups are desired.
> 
> cgroup subsys "timer_slack" implement timer slack controller. It
> provides a way to group tasks by timer slack value and manage the
> value of group's tasks.
> 
> Idea-by: Jacob Pan 
> Signed-off-by: Kirill A. Shutemov 
> ---
>  Documentation/cgroups/timer_slack.txt |   93 +++
>  include/linux/cgroup_subsys.h |6 +
>  init/Kconfig  |   10 ++
>  kernel/Makefile   |1 +
>  kernel/cgroup_timer_slack.c   |  285 
> +
>  5 files changed, 395 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/cgroups/timer_slack.txt
>  create mode 100644 kernel/cgroup_timer_slack.c
> 
> diff --git a/Documentation/cgroups/timer_slack.txt 
> b/Documentation/cgroups/timer_slack.txt
> new file mode 100644
> index 000..e7ec2f3
> --- /dev/null
> +++ b/Documentation/cgroups/timer_slack.txt
> @@ -0,0 +1,93 @@
> +Timer Slack Controller
> +=
> +
> +Overview
> +
> +
> +Every task_struct has timer_slack_ns value. This value uses to round up
> +poll() and select() timeout values. This feature can be useful in
> +mobile environment where combined wakeups are desired.
> +
> +cgroup subsys "timer_slack" implement timer slack controller. It
> +provides a way to group tasks by timer slack value and manage the
> +value of group's tasks.
> +
> +
> +User interface
> +--
> +
> +To get timer slack controller functionality you need to enable it in
> +kernel configuration:
> +
> +CONFIG_CGROUP_TIMER_SLACK=y
> +
> +or if you want to compile it as module:
> +
> +CONFIG_CGROUP_TIMER_SLACK=m
> +
> +The controller provides three files in cgroup directory:
> +
> +# mount -t cgroup -o timer_slack none /sys/fs/cgroup
> +# ls /sys/fs/cgroup/timer_slack.*
> +/sys/fs/cgroup/timer_slack.max_slack_ns
> +/sys/fs/cgroup/timer_slack.min_slack_ns
> +/sys/fs/cgroup/timer_slack.set_slack_ns
> +
> +timer_slack.min_slack_ns and timer_slack.set_slack_ns specify allowed
> +range of timer slack for tasks in cgroup. By default it unlimited:
> +
> +# cat /sys/fs/cgroup/timer_slack.min_slack_ns
> +0
> +# cat /sys/fs/cgroup/timer_slack.max_slack_ns
> +4294967295
> +
> +You can specify limits you want:
> +
> +# echo 5 > /sys/fs/cgroup/timer_slack.min_slack_ns
> +# echo 100 > /sys/fs/cgroup/timer_slack.max_slack_ns
> +# cat /sys/fs/cgroup/timer_slack.{min,max}_slack_ns
> +5
> +100
> +
> +Timer slack value of all tasks of the cgroup will be adjusted to fit
> +min-max range.
> +
> +If a task will try to call prctl() to change timer slack value out of
> +the range it get -EPERM.
> +
> +You can change timer slack value of all tasks of the cgroup at once:
> +
> +# echo 7 > /sys/fs/cgroup/timer_slack.set_slack_ns
> +
> +Timer slack controller supports hierarchical groups. The only rule:
> +parent's limit range should be wider or equal to child's. Sibling
> +cgroups can have overlapping min-max range.
> +
> + (root: 5 - 100)
> +  /   \
> + (a: 5 - 5)  (b: 50 - 100)
> +  / \
> + (c: 50 - 90) (d: 70 - 80)
> +
> +# mkdir /sys/fs/cgroup/a
> +# echo 5 > /sys/fs/cgroup/a/timer_slack.max_slack_ns
> +# cat /sys/fs/cgroup/a/timer_slack.{min,max}_slack_ns
> +5
> +5
> +# mkdir /sys/fs/cgroup/b
> +# echo 50 > /sys/fs/cgroup/b/timer_slack.min_slack_ns
> +# cat /sys/fs/cgroup/b/timer_slack.{min,max}_slack_ns
> +50
> +100
> +# mkdir /sys/fs/cgroup/b/c
> +# echo 90 > /sys/fs/cgroup/b/c/timer_slack.max_slack_ns
> +# cat /sys/fs/cgroup/b/c/timer_slack.{min,max}_slack_ns
> +50
> +90
> +# mkdir /sys/fs/cgroup/b/d
> +# echo 70 > /sys/fs/cgroup/b/d/timer_slack.min_slack_ns
> +# echo 80 > /sys/fs/cgroup/b/d/timer_slack.max_slack_ns
> +# cat /sys/fs/cgroup/b/d/timer_slack.{min,max}_slack_ns
> +70
> +80
> +
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index ccefff0..e399228 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -66,3 +66,9 @@ SUBSYS(blkio)
>  #endif
> 
>  /* */
> +
> +#ifdef CONFIG_CGROUP_TIMER_SLACK
> +SUBSYS(timer_slack)
> +#endif
> +
> +/* */
> diff --git a/init/Kconfig b/init/Kconfig
> index be788c0..bb54ae9 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -596,6 +596,16 @@ config CGROUP_FREEZER
> Provides a way to freeze and unfreeze all tasks in a
> cgroup.
> 
> +config CGROUP_TIMER_SLACK
> + tristate "Timer slack cgroup controller"
> + help
> +   Provides a way of tasks grouping by timer slack value.
> +   Ev

[Devel] Re: [PATCH, v6 2/3] Implement timer slack notifier chain

2011-02-14 Thread Thomas Gleixner
On Mon, 14 Feb 2011, Kirill A. Shutsemov wrote:

> From: Kirill A. Shutemov 
> 
> Process can change its timer slack using prctl(). Timer slack notifier
> call chain allows to react on such change or forbid it.

So we add a notifier call chain and more exports to allow what ?
 
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1691,15 +1691,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, 
> arg2, unsigned long, arg3,
>   error = perf_event_task_enable();
>   break;
>   case PR_GET_TIMERSLACK:
> - error = current->timer_slack_ns;
> + error = prctl_get_timer_slack();

  What's the point of replacing current->timer_slack_ns with a
  function which does exactly the same ?

> +long prctl_set_timer_slack(long timer_slack_ns)
> +{
> + int err;
> +
> + /* Reset timer slack to default value */
> + if (timer_slack_ns <= 0) {
> + current->timer_slack_ns = current->default_timer_slack_ns;
> + return 0;

  That does not make any sense at all. Why is setting
  default_timer_slack_ns not subject to validation ? Why is it
  treaded seperately ?

> + }
> +
> + err = blocking_notifier_call_chain(&timer_slack_notify_list,
> + timer_slack_ns, NULL);
> + if (err == NOTIFY_DONE)
> + current->timer_slack_ns = timer_slack_ns;
> +
> + return notifier_to_errno(err);

Thanks,

tglx
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH, v6 1/3] cgroups: export cgroup_iter_{start, next, end}

2011-02-14 Thread Thomas Gleixner

Lacks any sensible explanation why and whatfor this needs to be
exported. And I really doubt that we want to export that at all.

> +EXPORT_SYMBOL_GPL(cgroup_iter_start);
> +EXPORT_SYMBOL_GPL(cgroup_iter_next);
> +EXPORT_SYMBOL_GPL(cgroup_iter_end);

Thanks,

tglx
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH, v6 0/3] Introduce timer slack controller

2011-02-14 Thread Thomas Gleixner
On Mon, 14 Feb 2011, Kirill A. Shutsemov wrote:

This 0/0 explains nothing nada. What is the whole muck about ?

Thanks,

tglx
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH, v6 3/3] cgroups: introduce timer slack controller

2011-02-14 Thread Kirill A. Shutsemov
From: Kirill A. Shutemov 

Every task_struct has timer_slack_ns value. This value uses to round up
poll() and select() timeout values. This feature can be useful in
mobile environment where combined wakeups are desired.

cgroup subsys "timer_slack" implement timer slack controller. It
provides a way to group tasks by timer slack value and manage the
value of group's tasks.

Idea-by: Jacob Pan 
Signed-off-by: Kirill A. Shutemov 
---
 Documentation/cgroups/timer_slack.txt |   93 +++
 include/linux/cgroup_subsys.h |6 +
 init/Kconfig  |   10 ++
 kernel/Makefile   |1 +
 kernel/cgroup_timer_slack.c   |  285 +
 5 files changed, 395 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/cgroups/timer_slack.txt
 create mode 100644 kernel/cgroup_timer_slack.c

diff --git a/Documentation/cgroups/timer_slack.txt 
b/Documentation/cgroups/timer_slack.txt
new file mode 100644
index 000..e7ec2f3
--- /dev/null
+++ b/Documentation/cgroups/timer_slack.txt
@@ -0,0 +1,93 @@
+Timer Slack Controller
+=
+
+Overview
+
+
+Every task_struct has timer_slack_ns value. This value uses to round up
+poll() and select() timeout values. This feature can be useful in
+mobile environment where combined wakeups are desired.
+
+cgroup subsys "timer_slack" implement timer slack controller. It
+provides a way to group tasks by timer slack value and manage the
+value of group's tasks.
+
+
+User interface
+--
+
+To get timer slack controller functionality you need to enable it in
+kernel configuration:
+
+CONFIG_CGROUP_TIMER_SLACK=y
+
+or if you want to compile it as module:
+
+CONFIG_CGROUP_TIMER_SLACK=m
+
+The controller provides three files in cgroup directory:
+
+# mount -t cgroup -o timer_slack none /sys/fs/cgroup
+# ls /sys/fs/cgroup/timer_slack.*
+/sys/fs/cgroup/timer_slack.max_slack_ns
+/sys/fs/cgroup/timer_slack.min_slack_ns
+/sys/fs/cgroup/timer_slack.set_slack_ns
+
+timer_slack.min_slack_ns and timer_slack.set_slack_ns specify allowed
+range of timer slack for tasks in cgroup. By default it unlimited:
+
+# cat /sys/fs/cgroup/timer_slack.min_slack_ns
+0
+# cat /sys/fs/cgroup/timer_slack.max_slack_ns
+4294967295
+
+You can specify limits you want:
+
+# echo 5 > /sys/fs/cgroup/timer_slack.min_slack_ns
+# echo 100 > /sys/fs/cgroup/timer_slack.max_slack_ns
+# cat /sys/fs/cgroup/timer_slack.{min,max}_slack_ns
+5
+100
+
+Timer slack value of all tasks of the cgroup will be adjusted to fit
+min-max range.
+
+If a task will try to call prctl() to change timer slack value out of
+the range it get -EPERM.
+
+You can change timer slack value of all tasks of the cgroup at once:
+
+# echo 7 > /sys/fs/cgroup/timer_slack.set_slack_ns
+
+Timer slack controller supports hierarchical groups. The only rule:
+parent's limit range should be wider or equal to child's. Sibling
+cgroups can have overlapping min-max range.
+
+   (root: 5 - 100)
+/   \
+   (a: 5 - 5)  (b: 50 - 100)
+/ \
+   (c: 50 - 90) (d: 70 - 80)
+
+# mkdir /sys/fs/cgroup/a
+# echo 5 > /sys/fs/cgroup/a/timer_slack.max_slack_ns
+# cat /sys/fs/cgroup/a/timer_slack.{min,max}_slack_ns
+5
+5
+# mkdir /sys/fs/cgroup/b
+# echo 50 > /sys/fs/cgroup/b/timer_slack.min_slack_ns
+# cat /sys/fs/cgroup/b/timer_slack.{min,max}_slack_ns
+50
+100
+# mkdir /sys/fs/cgroup/b/c
+# echo 90 > /sys/fs/cgroup/b/c/timer_slack.max_slack_ns
+# cat /sys/fs/cgroup/b/c/timer_slack.{min,max}_slack_ns
+50
+90
+# mkdir /sys/fs/cgroup/b/d
+# echo 70 > /sys/fs/cgroup/b/d/timer_slack.min_slack_ns
+# echo 80 > /sys/fs/cgroup/b/d/timer_slack.max_slack_ns
+# cat /sys/fs/cgroup/b/d/timer_slack.{min,max}_slack_ns
+70
+80
+
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index ccefff0..e399228 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -66,3 +66,9 @@ SUBSYS(blkio)
 #endif
 
 /* */
+
+#ifdef CONFIG_CGROUP_TIMER_SLACK
+SUBSYS(timer_slack)
+#endif
+
+/* */
diff --git a/init/Kconfig b/init/Kconfig
index be788c0..bb54ae9 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -596,6 +596,16 @@ config CGROUP_FREEZER
  Provides a way to freeze and unfreeze all tasks in a
  cgroup.
 
+config CGROUP_TIMER_SLACK
+   tristate "Timer slack cgroup controller"
+   help
+ Provides a way of tasks grouping by timer slack value.
+ Every timer slack cgroup has min and max slack value. Task's
+ timer slack value will be adjusted to fit min-max range once
+ the task attached to the cgroup.
+ It's useful in mobile devices where certain background apps
+ are attached to a cgroup and combined wakeups are desired.
+
 config CGROUP_DEVICE
bo

[Devel] [PATCH, v6 2/3] Implement timer slack notifier chain

2011-02-14 Thread Kirill A. Shutsemov
From: Kirill A. Shutemov 

Process can change its timer slack using prctl(). Timer slack notifier
call chain allows to react on such change or forbid it.

Signed-off-by: Kirill A. Shutemov 
---
 include/linux/sched.h |5 
 kernel/Makefile   |2 +-
 kernel/sys.c  |9 +--
 kernel/timer_slack.c  |   57 +
 4 files changed, 65 insertions(+), 8 deletions(-)
 create mode 100644 kernel/timer_slack.c

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d747f94..f1a4ec2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2620,6 +2620,11 @@ static inline unsigned long rlimit_max(unsigned int 
limit)
return task_rlimit_max(current, limit);
 }
 
+extern int register_timer_slack_notifier(struct notifier_block *nb);
+extern void unregister_timer_slack_notifier(struct notifier_block *nb);
+long prctl_get_timer_slack(void);
+extern long prctl_set_timer_slack(long timer_slack_ns);
+
 #endif /* __KERNEL__ */
 
 #endif
diff --git a/kernel/Makefile b/kernel/Makefile
index 353d3fe..aa43e63 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -10,7 +10,7 @@ obj-y = sched.o fork.o exec_domain.o panic.o printk.o \
kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \
notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \
-   async.o range.o jump_label.o
+   async.o range.o jump_label.o timer_slack.o
 obj-y += groups.o
 
 ifdef CONFIG_FUNCTION_TRACER
diff --git a/kernel/sys.c b/kernel/sys.c
index 18da702..a33e4b8 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1691,15 +1691,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, 
arg2, unsigned long, arg3,
error = perf_event_task_enable();
break;
case PR_GET_TIMERSLACK:
-   error = current->timer_slack_ns;
+   error = prctl_get_timer_slack();
break;
case PR_SET_TIMERSLACK:
-   if (arg2 <= 0)
-   current->timer_slack_ns =
-   current->default_timer_slack_ns;
-   else
-   current->timer_slack_ns = arg2;
-   error = 0;
+   error = prctl_set_timer_slack(arg2);
break;
case PR_MCE_KILL:
if (arg4 | arg5)
diff --git a/kernel/timer_slack.c b/kernel/timer_slack.c
new file mode 100644
index 000..546ed78
--- /dev/null
+++ b/kernel/timer_slack.c
@@ -0,0 +1,57 @@
+/*
+ * kernel/timer_slack.c
+ *
+ * Copyright Nokia Corparation, 2011
+ * Author: Kirill A. Shutemov
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include 
+#include 
+#include 
+
+static BLOCKING_NOTIFIER_HEAD(timer_slack_notify_list);
+
+int register_timer_slack_notifier(struct notifier_block *nb)
+{
+   return blocking_notifier_chain_register(&timer_slack_notify_list, nb);
+}
+EXPORT_SYMBOL_GPL(register_timer_slack_notifier);
+
+void unregister_timer_slack_notifier(struct notifier_block *nb)
+{
+   blocking_notifier_chain_unregister(&timer_slack_notify_list, nb);
+}
+EXPORT_SYMBOL_GPL(unregister_timer_slack_notifier);
+
+long prctl_get_timer_slack(void)
+{
+   return current->timer_slack_ns;
+}
+
+long prctl_set_timer_slack(long timer_slack_ns)
+{
+   int err;
+
+   /* Reset timer slack to default value */
+   if (timer_slack_ns <= 0) {
+   current->timer_slack_ns = current->default_timer_slack_ns;
+   return 0;
+   }
+
+   err = blocking_notifier_call_chain(&timer_slack_notify_list,
+   timer_slack_ns, NULL);
+   if (err == NOTIFY_DONE)
+   current->timer_slack_ns = timer_slack_ns;
+
+   return notifier_to_errno(err);
+}
-- 
1.7.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH, v6 0/3] Introduce timer slack controller

2011-02-14 Thread Kirill A. Shutsemov
From: Kirill A. Shutemov 


Changelog:

v6:
 - add documentation
 - use notifier_call_chain() instead of check hook
 - fix validate_change()
 - cleanup
v5:
 - -EBUSY on writing to timer_slack.min_slack_ns/max_slack_ns if a child has
   wider min-max range
v4:
 - hierarchy support
 - drop dummy_timer_slack_check()
 - workaround lockdep false (?) positive
 - allow 0 as timer slack value
v3:
 - rework interface
 - s/EXPORT_SYMBOL/EXPORT_SYMBOL_GPL/
v2:
 - fixed with CONFIG_CGROUP_TIMER_SLACK=y
v1:
 - initial revision

Kirill A. Shutemov (3):
  cgroups: export cgroup_iter_{start,next,end}
  Implement timer slack notifier chain
  cgroups: introduce timer slack controller

 Documentation/cgroups/timer_slack.txt |   93 +++
 include/linux/cgroup_subsys.h |6 +
 include/linux/sched.h |5 +
 init/Kconfig  |   10 ++
 kernel/Makefile   |3 +-
 kernel/cgroup.c   |3 +
 kernel/cgroup_timer_slack.c   |  285 +
 kernel/sys.c  |9 +-
 kernel/timer_slack.c  |   57 +++
 9 files changed, 463 insertions(+), 8 deletions(-)
 create mode 100644 Documentation/cgroups/timer_slack.txt
 create mode 100644 kernel/cgroup_timer_slack.c
 create mode 100644 kernel/timer_slack.c

-- 
1.7.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH, v6 1/3] cgroups: export cgroup_iter_{start, next, end}

2011-02-14 Thread Kirill A. Shutsemov
From: Kirill A. Shutemov 

Signed-off-by: Kirill A. Shutemov 
Acked-by: Paul Menage 
Acked-by: Balbir Singh 
---
 kernel/cgroup.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index b24d702..8234daa 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2443,6 +2443,7 @@ void cgroup_iter_start(struct cgroup *cgrp, struct 
cgroup_iter *it)
it->cg_link = &cgrp->css_sets;
cgroup_advance_iter(cgrp, it);
 }
+EXPORT_SYMBOL_GPL(cgroup_iter_start);
 
 struct task_struct *cgroup_iter_next(struct cgroup *cgrp,
struct cgroup_iter *it)
@@ -2467,11 +2468,13 @@ struct task_struct *cgroup_iter_next(struct cgroup 
*cgrp,
}
return res;
 }
+EXPORT_SYMBOL_GPL(cgroup_iter_next);
 
 void cgroup_iter_end(struct cgroup *cgrp, struct cgroup_iter *it)
 {
read_unlock(&css_set_lock);
 }
+EXPORT_SYMBOL_GPL(cgroup_iter_end);
 
 static inline int started_after_time(struct task_struct *t1,
 struct timespec *time,
-- 
1.7.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel