[Devel] Re: [RFD] reboot / shutdown of a container

2011-01-14 Thread Daniel Lezcano
On 01/15/2011 12:11 AM, Bruno Prémont wrote:
> On Thu, 13 January 2011 Daniel Lezcano  wrote:
>> On 01/13/2011 10:50 PM, Bruno Prémont wrote:
>>> On Thu, 13 January 2011 Daniel Lezcano   wrote:
 On 01/13/2011 09:09 PM, Bruno Prémont wrote:
> On Thu, 13 January 2011 Daniel Lezcanowrote:
>> in the container implementation, we are facing the problem of a process
>> calling the sys_reboot syscall which of course makes the host to
>> poweroff/reboot.
>>
>> If we drop the cap_sys_reboot capability, sys_reboot fails and the
>> container reach a shutdown state but the init process stay there, hence
>> the container becomes stuck waiting indefinitely the process '1' to exit.
>>
>> The current implementation to make the shutdown / reboot of the
>> container to work is we watch, from a process outside of the container,
>> the/var/run/utmp file and check the runlevel each time the file
>> changes. When the 'reboot' or 'shutdown' level is detected, we wait for
>> a single remaining in the container and then we kill it.
>>
>> That works but this is not efficient in case of a large number of
>> containers as we will have to watch a lot of utmp files. In addition,
>> the /var/run directory must *not* mounted as tmpfs in the distro.
>> Unfortunately, it is the default setup on most of the distros and tends
>> to generalize. That implies, the rootfs init's scripts must be modified
>> for the container when we put in place its rootfs and as /var/run is
>> supposed to be a tmpfs, most of the applications do not cleanup the
>> directory, so we need to add extra services to wipeout the files.
>>
>> More problems arise when we do an upgrade of the distro inside the
>> container, because all the setup we made at creation time will be lost.
>> The upgrade overwrite the scripts, the fstab and so on.
>>
>> We did what was possible to solve the problem from userspace but we
>> reach always a limit because there are different implementations of the
>> 'init' process and the init's scripts differ from a distro to another
>> and the same with the versions.
>>
>> We think this problem can only be solved from the kernel.
>>
>> The idea was to send a signal SIGPWR to the parent of the pid '1' of the
>> pid namespace when the sys_reboot is called. Of course that won't occur
>> for the init pid namespace.
> Wouldn't sending SIGKILL to the pid '1' process of the originating PID
> namespace be sufficient (that would trigger a SIGCHLD for the parent
> process in the outer PID namespace.
 This is already the case. The question is : when do we send this signal ?
 We have to wait for the container system shutdown before killing it.
>>> I meant that sys_reboot() would kill the namespace's init if it's not
>>> called from boot namespace.
>>>
>>> See below
>>>
> (as far as I remember the PID namespace is killed when its 'init' exits,
> if this is not the case all other processes in the given namespace would
> have to be killed as well)
 Yes, absolutely but this is not the point, reaping the container is not
 a problem.

 What we are trying to achieve is to shutdown properly the container from
 inside (from outside will be possible too with the setns syscall).

 Assuming the process '1234' creates a new process in a new namespace set
 and wait for it.

 The new process '1' will exec /sbin/init and the system will boot up.
 But, when the system is shutdown or rebooted, after the down scripts are
 executed the kill -15 -1 will be invoked, killing all the processes
 expect the process '1' and the caller. This one will then call
 'sys_reboot' and exit. Hence we still have the init process idle and its
 parent '1234' waiting for it to die.
>>> This call to sys_reboot() would kill "new process '1'" instead of trying to
>>> operate on the HW box.
>>> This also has the advantage that a container would not require an informed
>>> parent "monitoring" it from outside (though it would not be restarted even 
>>> if
>>> requested without such informed outside parent).
>> Oh, ok. Sorry I misunderstood.
>>
>> Yes, that could be better than crossing the namespace boundaries.
>>
 If we are able to receive the information in the process '1234' : "the
 sys_reboot was called in the child pid namespace", we can take then kill
 our child pid.  If this information is raised via a signal sent by the
 kernel with the proper information in the siginfo_t (eg. si_code
 contains "LINUX_REBOOT_CMD_RESTART", "LINUX_REBOOT_CMD_HALT", ... ), the
 solution will be generic for all the shutdown/reboot of any kind of
 container and init version.
>>> Could this be passed for a SIGCHLD? (when namespace is reaped, and received
>>> by 1234 from above example assuming sys_reboot() kills the "new process 
>>> '1'")
>> Y

[Devel] Re: [PATCH 4/7] allow killing tasks in your own or child userns

2011-01-14 Thread Serge E. Hallyn
Quoting Bastian Blank (bast...@waldi.eu.org):
> On Tue, Jan 11, 2011 at 01:31:52AM +, Serge E. Hallyn wrote:
> > Quoting Bastian Blank (bast...@waldi.eu.org):
> > > What is this flag used for anyway? I only see it used in the accounting
> > > stuff, and if every user can get it, it is not longer useful.
> > hm, I'm not sure...  maybe noone is using it!
> 
> This flag is from pre-git.
> 
> The only information is:
> | #define ASU0x02/* ... used super-user privileges */
> 
> However with your patches (or at least the goal), everyone is super-user
> in derived namespaces.

No, a task just sitting in a derived ns won't necessarily need/use
super-user privileges...  (and, if we ever get far enough along, it
won't even necessarily have CAP_SYS_ADMIN/etc targeted to the parent
userns, bc it won't need those to do the unshares).

thanks,
-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: linux-cr: v23-rc1 pushed

2011-01-14 Thread Oren Laadan


On 01/14/2011 01:59 AM, Sukadev Bhattiprolu wrote:
> Oren Laadan [or...@cs.columbia.edu] wrote:
> | Folks,
> | 
> | I just pushed out a new v23-rc1 branch of linux-cr. This one is
> | rebased to 2.6.37, and contains nearly all the patches pulled
> | on v22-dev. I only gave it a brief test drive... feel free to 
> | throw all your ammo it.
> 
> Oren,
> 
> We need the file_tty() helper to get the tty object from the file pointer
> (otherwise we will be off by 4 bytes and fail tty_paranoia_check() in
> tty_file_checkpoint()).

Ok, applied. Thanks,

Oren.

> 
> Thanks,
> 
> Sukadev
> 
> diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
> index c89f055..6aa458e 100644
> --- a/drivers/tty/tty_io.c
> +++ b/drivers/tty/tty_io.c
> @@ -2781,7 +2784,7 @@ static int tty_file_checkpoint(struct ckpt_ctx *ctx, 
> struc
> int master_objref, slave_objref;
> int ret;
>  
> -   tty = (struct tty_struct *)file->private_data;
> +   tty = file_tty(file);
> inode = file->f_path.dentry->d_inode;
> if (tty_paranoia_check(tty, inode, "tty_file_checkpoint"))
> return -EIO;
> 
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 6/7] user namespaces: convert all capable checks in kernel/sys.c

2011-01-14 Thread Bastian Blank
On Tue, Jan 11, 2011 at 05:27:59AM +, Serge E. Hallyn wrote:
> Quoting Bastian Blank (bast...@waldi.eu.org):
> > On Mon, Jan 10, 2011 at 09:14:07PM +, Serge E. Hallyn wrote:
> > > - if (pcred->uid  != cred->euid &&
> > > - pcred->euid != cred->euid && !capable(CAP_SYS_NICE)) {
> > > + if (pcred->user->user_ns != cred->user->user_ns &&
> > > + pcred->uid  != cred->euid &&
> > > + pcred->euid != cred->euid &&
> > > + !ns_capable(pcred->user->user_ns, CAP_SYS_NICE)) {
> > 
> > I don't think this is correct. This would not error out if the both
> > userns are the same. Because the same patern (check uid if same userns,
> > otherwise only capability) shows up in several parts of the code, maybe
> > this should be factored out.
> 
> Yeah, I'd really like to factor this out because it shows up everywhere
> and I have to think about it every time I look at it.  But each time it
> shows up, the uids being compared slightly change.  There must be some
> clever way of doing it, hopefully it'll fall out soon.

Well, then make mostly identical (_inline_) functions in one location
(include/linux/cred.h comes in mind).  You can ask later why they have
to be different.

You are scaling the complexity up. So you need to make it somehow
manageable, and even slightly different versions in one place are much
easier to handle than the same in many different places.

kill_ok_by_cred would be:
cred_check_euid_suid(struct task_struct *p, X capable)

set_one_prio_perm would be:
cred_check_euid_euid(struct task_struct *p, X capable)

Bastian

-- 
"Life and death are seldom logical."
"But attaining a desired goal always is."
-- McCoy and Spock, "The Galileo Seven", stardate 2821.7
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 05/08] Allow ptrace from non-init user namespaces

2011-01-14 Thread Bastian Blank
On Fri, Jan 14, 2011 at 03:46:09PM +0100, Bastian Blank wrote:
> On Tue, Jan 11, 2011 at 06:44:39AM +, Serge E. Hallyn wrote:
> > diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> > index 99bbaa3..ec7605d 100644
> > --- a/kernel/ptrace.c
> > +++ b/kernel/ptrace.c
> > @@ -134,21 +134,24 @@ int __ptrace_may_access(struct task_struct *task, 
> > unsigned int mode)
> > return 0;
> > rcu_read_lock();
> > tcred = __task_cred(task);
> > -   if ((cred->uid != tcred->euid ||
> > -cred->uid != tcred->suid ||
> > -cred->uid != tcred->uid  ||
> > -cred->gid != tcred->egid ||
> > -cred->gid != tcred->sgid ||
> > -cred->gid != tcred->gid) &&
> > -   !capable(CAP_SYS_PTRACE)) {
> > -   rcu_read_unlock();
> > -   return -EPERM;
> > -   }
> > +   if (cred->user->user_ns == tcred->user->user_ns &&
> > +   (cred->uid == tcred->euid &&
> > +cred->uid == tcred->suid &&
> > +cred->uid == tcred->uid  &&
> > +cred->gid == tcred->egid &&
> > +cred->gid == tcred->sgid &&
> > +cred->gid == tcred->gid))
> > +   goto ok;
> > +   if (ns_capable(tcred->user->user_ns, CAP_SYS_PTRACE))
> > +   goto ok;
> > +   rcu_read_unlock();
> > +   return -EPERM;
> > +ok:
> 
> This is wrong.

Whoops, it _is_ right. However the nested parenthes is unnecessary and
can lead to other conclusions.

Bastian

-- 
Lots of people drink from the wrong bottle sometimes.
-- Edith Keeler, "The City on the Edge of Forever",
   stardate unknown
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 4/7] allow killing tasks in your own or child userns

2011-01-14 Thread Bastian Blank
On Tue, Jan 11, 2011 at 01:31:52AM +, Serge E. Hallyn wrote:
> Quoting Bastian Blank (bast...@waldi.eu.org):
> > What is this flag used for anyway? I only see it used in the accounting
> > stuff, and if every user can get it, it is not longer useful.
> hm, I'm not sure...  maybe noone is using it!

This flag is from pre-git.

The only information is:
| #define ASU0x02/* ... used super-user privileges */

However with your patches (or at least the goal), everyone is super-user
in derived namespaces.

Bastian

-- 
Where there's no emotion, there's no motive for violence.
-- Spock, "Dagger of the Mind", stardate 2715.1
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 05/08] Allow ptrace from non-init user namespaces

2011-01-14 Thread Bastian Blank
This mail did not show up in my inbox, even if I'm listed as receipient.
I only got a copy via the mailinglist.

On Tue, Jan 11, 2011 at 06:44:39AM +, Serge E. Hallyn wrote:
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index 99bbaa3..ec7605d 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -134,21 +134,24 @@ int __ptrace_may_access(struct task_struct *task, 
> unsigned int mode)
>   return 0;
>   rcu_read_lock();
>   tcred = __task_cred(task);
> - if ((cred->uid != tcred->euid ||
> -  cred->uid != tcred->suid ||
> -  cred->uid != tcred->uid  ||
> -  cred->gid != tcred->egid ||
> -  cred->gid != tcred->sgid ||
> -  cred->gid != tcred->gid) &&
> - !capable(CAP_SYS_PTRACE)) {
> - rcu_read_unlock();
> - return -EPERM;
> - }
> + if (cred->user->user_ns == tcred->user->user_ns &&
> + (cred->uid == tcred->euid &&
> +  cred->uid == tcred->suid &&
> +  cred->uid == tcred->uid  &&
> +  cred->gid == tcred->egid &&
> +  cred->gid == tcred->sgid &&
> +  cred->gid == tcred->gid))
> + goto ok;
> + if (ns_capable(tcred->user->user_ns, CAP_SYS_PTRACE))
> + goto ok;
> + rcu_read_unlock();
> + return -EPERM;
> +ok:

This is wrong. Please move that out into functions if you are unable to
get the conditions right.

Bastian

-- 
If a man had a child who'd gone anti-social, killed perhaps, he'd still
tend to protect that child.
-- McCoy, "The Ultimate Computer", stardate 4731.3
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel