Re: [Devel] [RFC rh7] ve: cgroups -- Allow to attach non-self into ve cgroups

2015-06-19 Thread Kirill Tkhai
В Чт, 18/06/2015 в 21:26 +0300, Cyrill Gorcunov пишет:
 On Tue, Jun 16, 2015 at 07:51:52PM +0300, Cyrill Gorcunov wrote:
   
   If we have any problems because of this, the solution is good.
  
  OK. Gimme sometime (util tomorrow probably) to think of. This issue
  not critical at the moment because we know that we're moving one
  task only (from vzctl). So we can investigate.
 
 Kirill, you know I think Vladimir's proposal is the best option here.
 Yes there is a window when task_ve is not yet updated but ve interface
 is special and supposed to be run in a predefined way (ie moving
 caller of container's init [read vzctl] should be done in a forkless
 manner). So I think we can trade this off for a simplier solution,
 right? Also maybe we should add some check for creds thus arbitrary
 userspace apps wont be moved here and there. If there some other
 way -- please share (rcu for get-exec-env still look woth to add).

Ok, I have no objections. The only thing is we need to carefully
use direct task_ve in the future. All current place, where we use
it, are safe.

Kirill

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [RFC rh7] ve: cgroups -- Allow to attach non-self into ve cgroups

2015-06-19 Thread Cyrill Gorcunov
On Fri, Jun 19, 2015 at 01:15:31PM +0300, Kirill Tkhai wrote:
 
 Ok, I have no objections. The only thing is we need to carefully
 use direct task_ve in the future. All current place, where we use
 it, are safe.

Sure. I'll prepare the patch once time permit.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [RFC rh7] ve: cgroups -- Allow to attach non-self into ve cgroups

2015-06-18 Thread Cyrill Gorcunov
On Tue, Jun 16, 2015 at 07:51:52PM +0300, Cyrill Gorcunov wrote:
  
  If we have any problems because of this, the solution is good.
 
 OK. Gimme sometime (util tomorrow probably) to think of. This issue
 not critical at the moment because we know that we're moving one
 task only (from vzctl). So we can investigate.

Kirill, you know I think Vladimir's proposal is the best option here.
Yes there is a window when task_ve is not yet updated but ve interface
is special and supposed to be run in a predefined way (ie moving
caller of container's init [read vzctl] should be done in a forkless
manner). So I think we can trade this off for a simplier solution,
right? Also maybe we should add some check for creds thus arbitrary
userspace apps wont be moved here and there. If there some other
way -- please share (rcu for get-exec-env still look woth to add).
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [RFC rh7] ve: cgroups -- Allow to attach non-self into ve cgroups

2015-06-16 Thread Vladimir Davydov
On Tue, Jun 16, 2015 at 06:19:44PM +0300, Kirill Tkhai wrote:
 This patch brings a couple of problems. The first one is if we're setting
 a ve cgroup for a forking task, it's possible the parent and the child
 fall into different cgroups. For example, parent is inside CT, while
 child is in ve0.
 
 The second problem is that it is racy with all places, where we're doing
 get_exec_env() and hoping on a stable result. There are a lot of places,
 we use it, so every place need too be checked.
 
 The first problem can be solved using cgroup_postfork() method, while
 the second is worse. The number of get_exec_env() is big, and it tends
 to increase.  We could check all of them, but an universal solutions
 will be better. But it seems there is no easy universal solution.
 
 We can implement a task lock and use get_exec_env() under it. 
 We can use something like rcu_read_{lock,unlock} braces around get_exec_env(),
 and use synchronize_rcu() in cgroup_postfork().

Alternatively, we could set task_ve in a task_work. See
kernel/bc/beancounter.c for an example.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [RFC rh7] ve: cgroups -- Allow to attach non-self into ve cgroups

2015-06-16 Thread Pavel Emelyanov
 Hm. How about stop_machine? 

Bad idea.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [RFC rh7] ve: cgroups -- Allow to attach non-self into ve cgroups

2015-06-16 Thread Kirill Tkhai
В Вт, 16/06/2015 в 19:07 +0300, Cyrill Gorcunov пишет:
 On Tue, Jun 16, 2015 at 06:19:44PM +0300, Kirill Tkhai wrote:
  
  This patch brings a couple of problems. The first one is if we're setting
  a ve cgroup for a forking task, it's possible the parent and the child
  fall into different cgroups. For example, parent is inside CT, while
  child is in ve0.
  
  The second problem is that it is racy with all places, where we're doing
  get_exec_env() and hoping on a stable result. There are a lot of places,
  we use it, so every place need too be checked.
  
  The first problem can be solved using cgroup_postfork() method, while
  the second is worse. The number of get_exec_env() is big, and it tends
  to increase.  We could check all of them, but an universal solutions
  will be better. But it seems there is no easy universal solution.
  
  We can implement a task lock and use get_exec_env() under it. 
  We can use something like rcu_read_{lock,unlock} braces around 
  get_exec_env(),
  and use synchronize_rcu() in cgroup_postfork().
  
  The two solutions above are examples what we can do. What do you think?
 
 May not we simply add into ve cgroup?
 
 .can_attach
   ...
   spin_lock(task-sighand-siglock);
 .cance_attach
   ...
   spin_unlock(task-sighand-siglock);
 .attach
   ...
   spin_unlock(task-sighand-siglock);

It seems sighand lock is not good for this, because cgroup_attach_task()
iterating over all subsys:

for_each_subsys(root, ss) {
if (ss-can_attach) {

so we may bumped into wrong lock order in one of them. Also this solves
problem №1, but does not solve problem №2.

Hm. How about stop_machine? It solves all of the problems. Also, it shouldn't
worsen performance, because this action is rare.

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [RFC rh7] ve: cgroups -- Allow to attach non-self into ve cgroups

2015-06-16 Thread Kirill Tkhai
В Чт, 14/05/2015 в 19:52 +0300, Cyrill Gorcunov пишет:
 In vzctl/libvzctl bundle we restore container like
 
  - create ve/$ctid cgroup
  - move self into this cgroup
  - run criu from inside
 
 So that kernel code passes ve_can_attach test. In turn for
 our P.Haul project (which is managing live migration) the
 situation is different -- it opens ve/$ctid but moves
 criu service pid instead (so that the service will
 start restore procedure). Which leads to situation
 where ve_can_attach fails with -EINVAL.
 
 Reported-by: Nikita Spiridonov nspirido...@odin.com
 Signed-off-by: Cyrill Gorcunov gorcu...@odin.com
 CC: Vladimir Davydov vdavy...@odin.com
 CC: Konstantin Khorenko khore...@odin.com
 CC: Pavel Emelyanov xe...@odin.com
 CC: Andrey Vagin ava...@odin.com
 ---
 
 Guys, could you please take a look, especially from
 security POV, is it safe to remove all these checks?
 
  kernel/ve/ve.c |   31 +--
  1 file changed, 13 insertions(+), 18 deletions(-)
 
 Index: linux-pcs7.git/kernel/ve/ve.c
 ===
 --- linux-pcs7.git.orig/kernel/ve/ve.c
 +++ linux-pcs7.git/kernel/ve/ve.c
 @@ -750,13 +750,6 @@ static void ve_destroy(struct cgroup *cg
  static int ve_can_attach(struct cgroup *cg, struct cgroup_taskset *tset)
  {
   struct ve_struct *ve = cgroup_ve(cg);
 - struct task_struct *task = current;
 -
 - if (cgroup_taskset_size(tset) != 1 ||
 - cgroup_taskset_first(tset) != task ||
 - !thread_group_leader(task) ||
 - !thread_group_empty(task))
 - return -EINVAL;

This patch brings a couple of problems. The first one is if we're setting
a ve cgroup for a forking task, it's possible the parent and the child
fall into different cgroups. For example, parent is inside CT, while
child is in ve0.

The second problem is that it is racy with all places, where we're doing
get_exec_env() and hoping on a stable result. There are a lot of places,
we use it, so every place need too be checked.

The first problem can be solved using cgroup_postfork() method, while
the second is worse. The number of get_exec_env() is big, and it tends
to increase.  We could check all of them, but an universal solutions
will be better. But it seems there is no easy universal solution.

We can implement a task lock and use get_exec_env() under it. 
We can use something like rcu_read_{lock,unlock} braces around get_exec_env(),
and use synchronize_rcu() in cgroup_postfork().

The two solutions above are examples what we can do. What do you think?

  
   if (ve-is_locked)
   return -EBUSY;
 @@ -775,20 +768,22 @@ static int ve_can_attach(struct cgroup *
  static void ve_attach(struct cgroup *cg, struct cgroup_taskset *tset)
  {
   struct ve_struct *ve = cgroup_ve(cg);
 - struct task_struct *tsk = current;
 -
 - /* this probihibts ptracing of task entered to VE from host system */
 - if (ve-is_running  tsk-mm)
 - tsk-mm-vps_dumpable = VD_VE_ENTER_TASK;
 + struct task_struct *tsk;
  
 - /* Drop OOM protection. */
 - tsk-signal-oom_score_adj = 0;
 - tsk-signal-oom_score_adj_min = 0;
 + cgroup_taskset_for_each(tsk, cg, tset) {
 + /* this probihibts ptracing of task entered to VE from host 
 system */
 + if (ve-is_running  tsk-mm)
 + tsk-mm-vps_dumpable = VD_VE_ENTER_TASK;
 +
 + /* Drop OOM protection. */
 + tsk-signal-oom_score_adj = 0;
 + tsk-signal-oom_score_adj_min = 0;
  
 - /* Leave parent exec domain */
 - tsk-parent_exec_id--;
 + /* Leave parent exec domain */
 + tsk-parent_exec_id--;
  
 - tsk-task_ve = ve;
 + tsk-task_ve = ve;
 + }
  }
  
  static int ve_state_read(struct cgroup *cg, struct cftype *cft,
 ___

Kirill

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [RFC rh7] ve: cgroups -- Allow to attach non-self into ve cgroups

2015-06-16 Thread Cyrill Gorcunov
On Tue, Jun 16, 2015 at 06:19:44PM +0300, Kirill Tkhai wrote:
 
 This patch brings a couple of problems. The first one is if we're setting
 a ve cgroup for a forking task, it's possible the parent and the child
 fall into different cgroups. For example, parent is inside CT, while
 child is in ve0.
 
 The second problem is that it is racy with all places, where we're doing
 get_exec_env() and hoping on a stable result. There are a lot of places,
 we use it, so every place need too be checked.
 
 The first problem can be solved using cgroup_postfork() method, while
 the second is worse. The number of get_exec_env() is big, and it tends
 to increase.  We could check all of them, but an universal solutions
 will be better. But it seems there is no easy universal solution.
 
 We can implement a task lock and use get_exec_env() under it. 
 We can use something like rcu_read_{lock,unlock} braces around get_exec_env(),
 and use synchronize_rcu() in cgroup_postfork().
 
 The two solutions above are examples what we can do. What do you think?

May not we simply add into ve cgroup?

.can_attach
...
spin_lock(task-sighand-siglock);
.cance_attach
...
spin_unlock(task-sighand-siglock);
.attach
...
spin_unlock(task-sighand-siglock);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [RFC rh7] ve: cgroups -- Allow to attach non-self into ve cgroups

2015-06-16 Thread Cyrill Gorcunov
On Tue, Jun 16, 2015 at 07:25:48PM +0300, Kirill Tkhai wrote:
  
  May not we simply add into ve cgroup?
  
  .can_attach
  ...
  spin_lock(task-sighand-siglock);
  .cance_attach
  ...
  spin_unlock(task-sighand-siglock);
  .attach
  ...
  spin_unlock(task-sighand-siglock);
 
 It seems sighand lock is not good for this, because cgroup_attach_task()
 iterating over all subsys:
 
 for_each_subsys(root, ss) {
 if (ss-can_attach) {
 
 so we may bumped into wrong lock order in one of them. Also this solves
 problem №1, but does not solve problem №2.

We gonna move only one task in container start time so I think
this is not critical in timing.

As to #2 -- yes, I think rcu-readlock with sync should do the trick.
Also I need to check in details what Vladimir suggested, maybe this
will work even better?

 
 Hm. How about stop_machine? It solves all of the problems. Also, it shouldn't
 worsen performance, because this action is rare.

iirc stop-machine is a big hammer :/
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [RFC rh7] ve: cgroups -- Allow to attach non-self into ve cgroups

2015-06-16 Thread Kirill Tkhai
В Вт, 16/06/2015 в 19:36 +0300, Cyrill Gorcunov пишет:
 On Tue, Jun 16, 2015 at 07:25:48PM +0300, Kirill Tkhai wrote:
   
   May not we simply add into ve cgroup?
   
   .can_attach
 ...
 spin_lock(task-sighand-siglock);
   .cance_attach
 ...
 spin_unlock(task-sighand-siglock);
   .attach
 ...
 spin_unlock(task-sighand-siglock);
  
  It seems sighand lock is not good for this, because cgroup_attach_task()
  iterating over all subsys:
  
  for_each_subsys(root, ss) {
  if (ss-can_attach) {
  
  so we may bumped into wrong lock order in one of them. Also this solves
  problem №1, but does not solve problem №2.
 
 We gonna move only one task in container start time so I think
 this is not critical in timing.
 
 As to #2 -- yes, I think rcu-readlock with sync should do the trick.
 Also I need to check in details what Vladimir suggested, maybe this
 will work even better?

Maybe it better, but it allows a situation, when parent's task_ve is
!ve0, and child's is ve0 for a some time. I'm afraid if there are hidden
places, which may work wrong because of this.

If we have any problems because of this, the solution is good.
 
  Hm. How about stop_machine? It solves all of the problems. Also, it 
  shouldn't
  worsen performance, because this action is rare.
 
 iirc stop-machine is a big hammer :/

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [RFC rh7] ve: cgroups -- Allow to attach non-self into ve cgroups

2015-06-16 Thread Cyrill Gorcunov
On Tue, Jun 16, 2015 at 07:44:08PM +0300, Kirill Tkhai wrote:
  
  We gonna move only one task in container start time so I think
  this is not critical in timing.
  
  As to #2 -- yes, I think rcu-readlock with sync should do the trick.
  Also I need to check in details what Vladimir suggested, maybe this
  will work even better?
 
 Maybe it better, but it allows a situation, when parent's task_ve is
 !ve0, and child's is ve0 for a some time. I'm afraid if there are hidden
 places, which may work wrong because of this.
 
 If we have any problems because of this, the solution is good.

OK. Gimme sometime (util tomorrow probably) to think of. This issue
not critical at the moment because we know that we're moving one
task only (from vzctl). So we can investigate.

  
   Hm. How about stop_machine? It solves all of the problems. Also, it 
   shouldn't
   worsen performance, because this action is rare.
  
  iirc stop-machine is a big hammer :/
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [RFC rh7] ve: cgroups -- Allow to attach non-self into ve cgroups

2015-05-18 Thread Konstantin Khorenko
On 05/14/2015 07:52 PM, Cyrill Gorcunov wrote:
 In vzctl/libvzctl bundle we restore container like
 
  - create ve/$ctid cgroup
  - move self into this cgroup
  - run criu from inside
 
 So that kernel code passes ve_can_attach test. In turn for
 our P.Haul project (which is managing live migration) the
 situation is different -- it opens ve/$ctid but moves
 criu service pid instead (so that the service will
 start restore procedure). Which leads to situation
 where ve_can_attach fails with -EINVAL.
 
 Reported-by: Nikita Spiridonov nspirido...@odin.com
 Signed-off-by: Cyrill Gorcunov gorcu...@odin.com
 CC: Vladimir Davydov vdavy...@odin.com
 CC: Konstantin Khorenko khore...@odin.com
 CC: Pavel Emelyanov xe...@odin.com
 CC: Andrey Vagin ava...@odin.com
 ---
 
 Guys, could you please take a look, especially from
 security POV, is it safe to remove all these checks?
 
  kernel/ve/ve.c |   31 +--
  1 file changed, 13 insertions(+), 18 deletions(-)
 
 Index: linux-pcs7.git/kernel/ve/ve.c
 ===
 --- linux-pcs7.git.orig/kernel/ve/ve.c
 +++ linux-pcs7.git/kernel/ve/ve.c
 @@ -750,13 +750,6 @@ static void ve_destroy(struct cgroup *cg
  static int ve_can_attach(struct cgroup *cg, struct cgroup_taskset *tset)
  {
   struct ve_struct *ve = cgroup_ve(cg);
 - struct task_struct *task = current;
 -
 - if (cgroup_taskset_size(tset) != 1 ||
 - cgroup_taskset_first(tset) != task ||
 - !thread_group_leader(task) ||
 - !thread_group_empty(task))
 - return -EINVAL;

Is this true that without these checks a single thread of a multithread process 
can enter CT?
If no - where is the check for this case?
If yes - let's prohibit this.

   if (ve-is_locked)
   return -EBUSY;
 @@ -775,20 +768,22 @@ static int ve_can_attach(struct cgroup *
  static void ve_attach(struct cgroup *cg, struct cgroup_taskset *tset)
  {
   struct ve_struct *ve = cgroup_ve(cg);
 - struct task_struct *tsk = current;
 -
 - /* this probihibts ptracing of task entered to VE from host system */
 - if (ve-is_running  tsk-mm)
 - tsk-mm-vps_dumpable = VD_VE_ENTER_TASK;
 + struct task_struct *tsk;
  
 - /* Drop OOM protection. */
 - tsk-signal-oom_score_adj = 0;
 - tsk-signal-oom_score_adj_min = 0;
 + cgroup_taskset_for_each(tsk, cg, tset) {
 + /* this probihibts ptracing of task entered to VE from host 
 system */
 + if (ve-is_running  tsk-mm)
 + tsk-mm-vps_dumpable = VD_VE_ENTER_TASK;
 +
 + /* Drop OOM protection. */
 + tsk-signal-oom_score_adj = 0;
 + tsk-signal-oom_score_adj_min = 0;
  
 - /* Leave parent exec domain */
 - tsk-parent_exec_id--;
 + /* Leave parent exec domain */
 + tsk-parent_exec_id--;
  
 - tsk-task_ve = ve;
 + tsk-task_ve = ve;
 + }
  }
  
  static int ve_state_read(struct cgroup *cg, struct cftype *cft,
 ___
 Devel mailing list
 Devel@openvz.org
 https://lists.openvz.org/mailman/listinfo/devel
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [RFC rh7] ve: cgroups -- Allow to attach non-self into ve cgroups

2015-05-14 Thread Cyrill Gorcunov
In vzctl/libvzctl bundle we restore container like

 - create ve/$ctid cgroup
 - move self into this cgroup
 - run criu from inside

So that kernel code passes ve_can_attach test. In turn for
our P.Haul project (which is managing live migration) the
situation is different -- it opens ve/$ctid but moves
criu service pid instead (so that the service will
start restore procedure). Which leads to situation
where ve_can_attach fails with -EINVAL.

Reported-by: Nikita Spiridonov nspirido...@odin.com
Signed-off-by: Cyrill Gorcunov gorcu...@odin.com
CC: Vladimir Davydov vdavy...@odin.com
CC: Konstantin Khorenko khore...@odin.com
CC: Pavel Emelyanov xe...@odin.com
CC: Andrey Vagin ava...@odin.com
---

Guys, could you please take a look, especially from
security POV, is it safe to remove all these checks?

 kernel/ve/ve.c |   31 +--
 1 file changed, 13 insertions(+), 18 deletions(-)

Index: linux-pcs7.git/kernel/ve/ve.c
===
--- linux-pcs7.git.orig/kernel/ve/ve.c
+++ linux-pcs7.git/kernel/ve/ve.c
@@ -750,13 +750,6 @@ static void ve_destroy(struct cgroup *cg
 static int ve_can_attach(struct cgroup *cg, struct cgroup_taskset *tset)
 {
struct ve_struct *ve = cgroup_ve(cg);
-   struct task_struct *task = current;
-
-   if (cgroup_taskset_size(tset) != 1 ||
-   cgroup_taskset_first(tset) != task ||
-   !thread_group_leader(task) ||
-   !thread_group_empty(task))
-   return -EINVAL;
 
if (ve-is_locked)
return -EBUSY;
@@ -775,20 +768,22 @@ static int ve_can_attach(struct cgroup *
 static void ve_attach(struct cgroup *cg, struct cgroup_taskset *tset)
 {
struct ve_struct *ve = cgroup_ve(cg);
-   struct task_struct *tsk = current;
-
-   /* this probihibts ptracing of task entered to VE from host system */
-   if (ve-is_running  tsk-mm)
-   tsk-mm-vps_dumpable = VD_VE_ENTER_TASK;
+   struct task_struct *tsk;
 
-   /* Drop OOM protection. */
-   tsk-signal-oom_score_adj = 0;
-   tsk-signal-oom_score_adj_min = 0;
+   cgroup_taskset_for_each(tsk, cg, tset) {
+   /* this probihibts ptracing of task entered to VE from host 
system */
+   if (ve-is_running  tsk-mm)
+   tsk-mm-vps_dumpable = VD_VE_ENTER_TASK;
+
+   /* Drop OOM protection. */
+   tsk-signal-oom_score_adj = 0;
+   tsk-signal-oom_score_adj_min = 0;
 
-   /* Leave parent exec domain */
-   tsk-parent_exec_id--;
+   /* Leave parent exec domain */
+   tsk-parent_exec_id--;
 
-   tsk-task_ve = ve;
+   tsk-task_ve = ve;
+   }
 }
 
 static int ve_state_read(struct cgroup *cg, struct cftype *cft,
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel