Re: [Devel] [RFC rh7] ve: cgroups -- Allow to attach non-self into ve cgroups
В Чт, 18/06/2015 в 21:26 +0300, Cyrill Gorcunov пишет: On Tue, Jun 16, 2015 at 07:51:52PM +0300, Cyrill Gorcunov wrote: If we have any problems because of this, the solution is good. OK. Gimme sometime (util tomorrow probably) to think of. This issue not critical at the moment because we know that we're moving one task only (from vzctl). So we can investigate. Kirill, you know I think Vladimir's proposal is the best option here. Yes there is a window when task_ve is not yet updated but ve interface is special and supposed to be run in a predefined way (ie moving caller of container's init [read vzctl] should be done in a forkless manner). So I think we can trade this off for a simplier solution, right? Also maybe we should add some check for creds thus arbitrary userspace apps wont be moved here and there. If there some other way -- please share (rcu for get-exec-env still look woth to add). Ok, I have no objections. The only thing is we need to carefully use direct task_ve in the future. All current place, where we use it, are safe. Kirill ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [RFC rh7] ve: cgroups -- Allow to attach non-self into ve cgroups
On Fri, Jun 19, 2015 at 01:15:31PM +0300, Kirill Tkhai wrote: Ok, I have no objections. The only thing is we need to carefully use direct task_ve in the future. All current place, where we use it, are safe. Sure. I'll prepare the patch once time permit. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [RFC rh7] ve: cgroups -- Allow to attach non-self into ve cgroups
On Tue, Jun 16, 2015 at 07:51:52PM +0300, Cyrill Gorcunov wrote: If we have any problems because of this, the solution is good. OK. Gimme sometime (util tomorrow probably) to think of. This issue not critical at the moment because we know that we're moving one task only (from vzctl). So we can investigate. Kirill, you know I think Vladimir's proposal is the best option here. Yes there is a window when task_ve is not yet updated but ve interface is special and supposed to be run in a predefined way (ie moving caller of container's init [read vzctl] should be done in a forkless manner). So I think we can trade this off for a simplier solution, right? Also maybe we should add some check for creds thus arbitrary userspace apps wont be moved here and there. If there some other way -- please share (rcu for get-exec-env still look woth to add). ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [RFC rh7] ve: cgroups -- Allow to attach non-self into ve cgroups
On Tue, Jun 16, 2015 at 06:19:44PM +0300, Kirill Tkhai wrote: This patch brings a couple of problems. The first one is if we're setting a ve cgroup for a forking task, it's possible the parent and the child fall into different cgroups. For example, parent is inside CT, while child is in ve0. The second problem is that it is racy with all places, where we're doing get_exec_env() and hoping on a stable result. There are a lot of places, we use it, so every place need too be checked. The first problem can be solved using cgroup_postfork() method, while the second is worse. The number of get_exec_env() is big, and it tends to increase. We could check all of them, but an universal solutions will be better. But it seems there is no easy universal solution. We can implement a task lock and use get_exec_env() under it. We can use something like rcu_read_{lock,unlock} braces around get_exec_env(), and use synchronize_rcu() in cgroup_postfork(). Alternatively, we could set task_ve in a task_work. See kernel/bc/beancounter.c for an example. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [RFC rh7] ve: cgroups -- Allow to attach non-self into ve cgroups
Hm. How about stop_machine? Bad idea. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [RFC rh7] ve: cgroups -- Allow to attach non-self into ve cgroups
В Вт, 16/06/2015 в 19:07 +0300, Cyrill Gorcunov пишет: On Tue, Jun 16, 2015 at 06:19:44PM +0300, Kirill Tkhai wrote: This patch brings a couple of problems. The first one is if we're setting a ve cgroup for a forking task, it's possible the parent and the child fall into different cgroups. For example, parent is inside CT, while child is in ve0. The second problem is that it is racy with all places, where we're doing get_exec_env() and hoping on a stable result. There are a lot of places, we use it, so every place need too be checked. The first problem can be solved using cgroup_postfork() method, while the second is worse. The number of get_exec_env() is big, and it tends to increase. We could check all of them, but an universal solutions will be better. But it seems there is no easy universal solution. We can implement a task lock and use get_exec_env() under it. We can use something like rcu_read_{lock,unlock} braces around get_exec_env(), and use synchronize_rcu() in cgroup_postfork(). The two solutions above are examples what we can do. What do you think? May not we simply add into ve cgroup? .can_attach ... spin_lock(task-sighand-siglock); .cance_attach ... spin_unlock(task-sighand-siglock); .attach ... spin_unlock(task-sighand-siglock); It seems sighand lock is not good for this, because cgroup_attach_task() iterating over all subsys: for_each_subsys(root, ss) { if (ss-can_attach) { so we may bumped into wrong lock order in one of them. Also this solves problem №1, but does not solve problem №2. Hm. How about stop_machine? It solves all of the problems. Also, it shouldn't worsen performance, because this action is rare. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [RFC rh7] ve: cgroups -- Allow to attach non-self into ve cgroups
В Чт, 14/05/2015 в 19:52 +0300, Cyrill Gorcunov пишет: In vzctl/libvzctl bundle we restore container like - create ve/$ctid cgroup - move self into this cgroup - run criu from inside So that kernel code passes ve_can_attach test. In turn for our P.Haul project (which is managing live migration) the situation is different -- it opens ve/$ctid but moves criu service pid instead (so that the service will start restore procedure). Which leads to situation where ve_can_attach fails with -EINVAL. Reported-by: Nikita Spiridonov nspirido...@odin.com Signed-off-by: Cyrill Gorcunov gorcu...@odin.com CC: Vladimir Davydov vdavy...@odin.com CC: Konstantin Khorenko khore...@odin.com CC: Pavel Emelyanov xe...@odin.com CC: Andrey Vagin ava...@odin.com --- Guys, could you please take a look, especially from security POV, is it safe to remove all these checks? kernel/ve/ve.c | 31 +-- 1 file changed, 13 insertions(+), 18 deletions(-) Index: linux-pcs7.git/kernel/ve/ve.c === --- linux-pcs7.git.orig/kernel/ve/ve.c +++ linux-pcs7.git/kernel/ve/ve.c @@ -750,13 +750,6 @@ static void ve_destroy(struct cgroup *cg static int ve_can_attach(struct cgroup *cg, struct cgroup_taskset *tset) { struct ve_struct *ve = cgroup_ve(cg); - struct task_struct *task = current; - - if (cgroup_taskset_size(tset) != 1 || - cgroup_taskset_first(tset) != task || - !thread_group_leader(task) || - !thread_group_empty(task)) - return -EINVAL; This patch brings a couple of problems. The first one is if we're setting a ve cgroup for a forking task, it's possible the parent and the child fall into different cgroups. For example, parent is inside CT, while child is in ve0. The second problem is that it is racy with all places, where we're doing get_exec_env() and hoping on a stable result. There are a lot of places, we use it, so every place need too be checked. The first problem can be solved using cgroup_postfork() method, while the second is worse. The number of get_exec_env() is big, and it tends to increase. We could check all of them, but an universal solutions will be better. But it seems there is no easy universal solution. We can implement a task lock and use get_exec_env() under it. We can use something like rcu_read_{lock,unlock} braces around get_exec_env(), and use synchronize_rcu() in cgroup_postfork(). The two solutions above are examples what we can do. What do you think? if (ve-is_locked) return -EBUSY; @@ -775,20 +768,22 @@ static int ve_can_attach(struct cgroup * static void ve_attach(struct cgroup *cg, struct cgroup_taskset *tset) { struct ve_struct *ve = cgroup_ve(cg); - struct task_struct *tsk = current; - - /* this probihibts ptracing of task entered to VE from host system */ - if (ve-is_running tsk-mm) - tsk-mm-vps_dumpable = VD_VE_ENTER_TASK; + struct task_struct *tsk; - /* Drop OOM protection. */ - tsk-signal-oom_score_adj = 0; - tsk-signal-oom_score_adj_min = 0; + cgroup_taskset_for_each(tsk, cg, tset) { + /* this probihibts ptracing of task entered to VE from host system */ + if (ve-is_running tsk-mm) + tsk-mm-vps_dumpable = VD_VE_ENTER_TASK; + + /* Drop OOM protection. */ + tsk-signal-oom_score_adj = 0; + tsk-signal-oom_score_adj_min = 0; - /* Leave parent exec domain */ - tsk-parent_exec_id--; + /* Leave parent exec domain */ + tsk-parent_exec_id--; - tsk-task_ve = ve; + tsk-task_ve = ve; + } } static int ve_state_read(struct cgroup *cg, struct cftype *cft, ___ Kirill ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [RFC rh7] ve: cgroups -- Allow to attach non-self into ve cgroups
On Tue, Jun 16, 2015 at 06:19:44PM +0300, Kirill Tkhai wrote: This patch brings a couple of problems. The first one is if we're setting a ve cgroup for a forking task, it's possible the parent and the child fall into different cgroups. For example, parent is inside CT, while child is in ve0. The second problem is that it is racy with all places, where we're doing get_exec_env() and hoping on a stable result. There are a lot of places, we use it, so every place need too be checked. The first problem can be solved using cgroup_postfork() method, while the second is worse. The number of get_exec_env() is big, and it tends to increase. We could check all of them, but an universal solutions will be better. But it seems there is no easy universal solution. We can implement a task lock and use get_exec_env() under it. We can use something like rcu_read_{lock,unlock} braces around get_exec_env(), and use synchronize_rcu() in cgroup_postfork(). The two solutions above are examples what we can do. What do you think? May not we simply add into ve cgroup? .can_attach ... spin_lock(task-sighand-siglock); .cance_attach ... spin_unlock(task-sighand-siglock); .attach ... spin_unlock(task-sighand-siglock); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [RFC rh7] ve: cgroups -- Allow to attach non-self into ve cgroups
On Tue, Jun 16, 2015 at 07:25:48PM +0300, Kirill Tkhai wrote: May not we simply add into ve cgroup? .can_attach ... spin_lock(task-sighand-siglock); .cance_attach ... spin_unlock(task-sighand-siglock); .attach ... spin_unlock(task-sighand-siglock); It seems sighand lock is not good for this, because cgroup_attach_task() iterating over all subsys: for_each_subsys(root, ss) { if (ss-can_attach) { so we may bumped into wrong lock order in one of them. Also this solves problem №1, but does not solve problem №2. We gonna move only one task in container start time so I think this is not critical in timing. As to #2 -- yes, I think rcu-readlock with sync should do the trick. Also I need to check in details what Vladimir suggested, maybe this will work even better? Hm. How about stop_machine? It solves all of the problems. Also, it shouldn't worsen performance, because this action is rare. iirc stop-machine is a big hammer :/ ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [RFC rh7] ve: cgroups -- Allow to attach non-self into ve cgroups
В Вт, 16/06/2015 в 19:36 +0300, Cyrill Gorcunov пишет: On Tue, Jun 16, 2015 at 07:25:48PM +0300, Kirill Tkhai wrote: May not we simply add into ve cgroup? .can_attach ... spin_lock(task-sighand-siglock); .cance_attach ... spin_unlock(task-sighand-siglock); .attach ... spin_unlock(task-sighand-siglock); It seems sighand lock is not good for this, because cgroup_attach_task() iterating over all subsys: for_each_subsys(root, ss) { if (ss-can_attach) { so we may bumped into wrong lock order in one of them. Also this solves problem №1, but does not solve problem №2. We gonna move only one task in container start time so I think this is not critical in timing. As to #2 -- yes, I think rcu-readlock with sync should do the trick. Also I need to check in details what Vladimir suggested, maybe this will work even better? Maybe it better, but it allows a situation, when parent's task_ve is !ve0, and child's is ve0 for a some time. I'm afraid if there are hidden places, which may work wrong because of this. If we have any problems because of this, the solution is good. Hm. How about stop_machine? It solves all of the problems. Also, it shouldn't worsen performance, because this action is rare. iirc stop-machine is a big hammer :/ ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [RFC rh7] ve: cgroups -- Allow to attach non-self into ve cgroups
On Tue, Jun 16, 2015 at 07:44:08PM +0300, Kirill Tkhai wrote: We gonna move only one task in container start time so I think this is not critical in timing. As to #2 -- yes, I think rcu-readlock with sync should do the trick. Also I need to check in details what Vladimir suggested, maybe this will work even better? Maybe it better, but it allows a situation, when parent's task_ve is !ve0, and child's is ve0 for a some time. I'm afraid if there are hidden places, which may work wrong because of this. If we have any problems because of this, the solution is good. OK. Gimme sometime (util tomorrow probably) to think of. This issue not critical at the moment because we know that we're moving one task only (from vzctl). So we can investigate. Hm. How about stop_machine? It solves all of the problems. Also, it shouldn't worsen performance, because this action is rare. iirc stop-machine is a big hammer :/ ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [RFC rh7] ve: cgroups -- Allow to attach non-self into ve cgroups
On 05/14/2015 07:52 PM, Cyrill Gorcunov wrote: In vzctl/libvzctl bundle we restore container like - create ve/$ctid cgroup - move self into this cgroup - run criu from inside So that kernel code passes ve_can_attach test. In turn for our P.Haul project (which is managing live migration) the situation is different -- it opens ve/$ctid but moves criu service pid instead (so that the service will start restore procedure). Which leads to situation where ve_can_attach fails with -EINVAL. Reported-by: Nikita Spiridonov nspirido...@odin.com Signed-off-by: Cyrill Gorcunov gorcu...@odin.com CC: Vladimir Davydov vdavy...@odin.com CC: Konstantin Khorenko khore...@odin.com CC: Pavel Emelyanov xe...@odin.com CC: Andrey Vagin ava...@odin.com --- Guys, could you please take a look, especially from security POV, is it safe to remove all these checks? kernel/ve/ve.c | 31 +-- 1 file changed, 13 insertions(+), 18 deletions(-) Index: linux-pcs7.git/kernel/ve/ve.c === --- linux-pcs7.git.orig/kernel/ve/ve.c +++ linux-pcs7.git/kernel/ve/ve.c @@ -750,13 +750,6 @@ static void ve_destroy(struct cgroup *cg static int ve_can_attach(struct cgroup *cg, struct cgroup_taskset *tset) { struct ve_struct *ve = cgroup_ve(cg); - struct task_struct *task = current; - - if (cgroup_taskset_size(tset) != 1 || - cgroup_taskset_first(tset) != task || - !thread_group_leader(task) || - !thread_group_empty(task)) - return -EINVAL; Is this true that without these checks a single thread of a multithread process can enter CT? If no - where is the check for this case? If yes - let's prohibit this. if (ve-is_locked) return -EBUSY; @@ -775,20 +768,22 @@ static int ve_can_attach(struct cgroup * static void ve_attach(struct cgroup *cg, struct cgroup_taskset *tset) { struct ve_struct *ve = cgroup_ve(cg); - struct task_struct *tsk = current; - - /* this probihibts ptracing of task entered to VE from host system */ - if (ve-is_running tsk-mm) - tsk-mm-vps_dumpable = VD_VE_ENTER_TASK; + struct task_struct *tsk; - /* Drop OOM protection. */ - tsk-signal-oom_score_adj = 0; - tsk-signal-oom_score_adj_min = 0; + cgroup_taskset_for_each(tsk, cg, tset) { + /* this probihibts ptracing of task entered to VE from host system */ + if (ve-is_running tsk-mm) + tsk-mm-vps_dumpable = VD_VE_ENTER_TASK; + + /* Drop OOM protection. */ + tsk-signal-oom_score_adj = 0; + tsk-signal-oom_score_adj_min = 0; - /* Leave parent exec domain */ - tsk-parent_exec_id--; + /* Leave parent exec domain */ + tsk-parent_exec_id--; - tsk-task_ve = ve; + tsk-task_ve = ve; + } } static int ve_state_read(struct cgroup *cg, struct cftype *cft, ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [RFC rh7] ve: cgroups -- Allow to attach non-self into ve cgroups
In vzctl/libvzctl bundle we restore container like - create ve/$ctid cgroup - move self into this cgroup - run criu from inside So that kernel code passes ve_can_attach test. In turn for our P.Haul project (which is managing live migration) the situation is different -- it opens ve/$ctid but moves criu service pid instead (so that the service will start restore procedure). Which leads to situation where ve_can_attach fails with -EINVAL. Reported-by: Nikita Spiridonov nspirido...@odin.com Signed-off-by: Cyrill Gorcunov gorcu...@odin.com CC: Vladimir Davydov vdavy...@odin.com CC: Konstantin Khorenko khore...@odin.com CC: Pavel Emelyanov xe...@odin.com CC: Andrey Vagin ava...@odin.com --- Guys, could you please take a look, especially from security POV, is it safe to remove all these checks? kernel/ve/ve.c | 31 +-- 1 file changed, 13 insertions(+), 18 deletions(-) Index: linux-pcs7.git/kernel/ve/ve.c === --- linux-pcs7.git.orig/kernel/ve/ve.c +++ linux-pcs7.git/kernel/ve/ve.c @@ -750,13 +750,6 @@ static void ve_destroy(struct cgroup *cg static int ve_can_attach(struct cgroup *cg, struct cgroup_taskset *tset) { struct ve_struct *ve = cgroup_ve(cg); - struct task_struct *task = current; - - if (cgroup_taskset_size(tset) != 1 || - cgroup_taskset_first(tset) != task || - !thread_group_leader(task) || - !thread_group_empty(task)) - return -EINVAL; if (ve-is_locked) return -EBUSY; @@ -775,20 +768,22 @@ static int ve_can_attach(struct cgroup * static void ve_attach(struct cgroup *cg, struct cgroup_taskset *tset) { struct ve_struct *ve = cgroup_ve(cg); - struct task_struct *tsk = current; - - /* this probihibts ptracing of task entered to VE from host system */ - if (ve-is_running tsk-mm) - tsk-mm-vps_dumpable = VD_VE_ENTER_TASK; + struct task_struct *tsk; - /* Drop OOM protection. */ - tsk-signal-oom_score_adj = 0; - tsk-signal-oom_score_adj_min = 0; + cgroup_taskset_for_each(tsk, cg, tset) { + /* this probihibts ptracing of task entered to VE from host system */ + if (ve-is_running tsk-mm) + tsk-mm-vps_dumpable = VD_VE_ENTER_TASK; + + /* Drop OOM protection. */ + tsk-signal-oom_score_adj = 0; + tsk-signal-oom_score_adj_min = 0; - /* Leave parent exec domain */ - tsk-parent_exec_id--; + /* Leave parent exec domain */ + tsk-parent_exec_id--; - tsk-task_ve = ve; + tsk-task_ve = ve; + } } static int ve_state_read(struct cgroup *cg, struct cftype *cft, ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel