subject:"\[lxc\-devel\] \[RFC PATCH 00\/11\] Add support for devtmpfs in user namespaces"

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-28 Thread Eric W. Biederman

"Serge E. Hallyn"  writes:
>> I was aware of FUSE but hadn't ever looked at it much. Looking at it
>> now, this isn't going to satisfy any of the use cases I know about,
>> which are wanting to use filesystems supported in-kernel (isofs, ext*).
>> I don't see that any of these have a FUSE implementation, and I think we
>> gain more from figuring out how to use in-kernel filesystems in
>> containers than trying to find a way to shoehorn selected filesystems
>> into FUSE.
>
> That's why I was wondering how much work it would be to auto-generate
> fuse fs support from the in-kernel source.

So at a quick look I have found fuseext2, fuseiso and mountlo-0.5 (which
claims to have supported all the in-kernel filesystems with the help of
user mode linux).

Give that the first two are just an apt-get install away fuse really
looks like the shortest path to being able to mount an iso, do other
interesting things.

We probably want something more but only when performance becomes a
bottle-neck.

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-28 Thread Serge Hallyn

Quoting James Bottomley (james.bottom...@hansenpartnership.com):
> On Mon, 2014-05-26 at 00:24 +0200, Serge E. Hallyn wrote:
> > Quoting James Bottomley (james.bottom...@hansenpartnership.com):
> > > On Sat, 2014-05-24 at 22:25 +, Serge Hallyn wrote:
> > > > Quoting James Bottomley (james.bottom...@hansenpartnership.com):
> > > > > On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote:
> > > > > > On 05/20/2014 05:19 PM, Serge Hallyn wrote:
> > > > > > > Quoting Andy Lutomirski (l...@amacapital.net):
> > > > > > >> On May 15, 2014 1:26 PM, "Serge E. Hallyn"  
> > > > > > >> wrote:
> > > > > > >>> 
> > > > > > >>> Quoting Richard Weinberger (rich...@nod.at):
> > > > > >  Am 15.05.2014 21:50, schrieb Serge Hallyn:
> > > > > > > Quoting Richard Weinberger (richard.weinber...@gmail.com):
> > > > > > >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman 
> > > > > > >>  wrote:
> > > > > > >>> Then don't use a container to build such a thing, or fix 
> > > > > > >>> the build scripts to not do that :)
> > > > > > >> 
> > > > > > >> I second this. To me it looks like some folks try to (ab)use 
> > > > > > >> Linux containers for purposes where KVM
> > > > > > >> would much better fit in. Please don't put more complexity 
> > > > > > >> into containers. They are already horrible
> > > > > > >> complex and error prone.
> > > > > > > 
> > > > > > > I, naturally, disagree :)  The only use case which is 
> > > > > > > inherently not valid for containers is running a
> > > > > > > kernel.  Practically speaking there are other things which 
> > > > > > > likely will never be possible, but if someone 
> > > > > > > offers a way to do something in containers, "you can't do 
> > > > > > > that in containers" is not an apropos response.
> > > > > > > 
> > > > > > > "That abstraction is wrong" is certainly valid, as when vpids 
> > > > > > > were originally proposed and rejected,
> > > > > > > resulting in the development of pid namespaces.  "We have to 
> > > > > > > work out (x) first" can be valid (and I can
> > > > > > > think of examples here), assuming it's not just trying to 
> > > > > > > hide behind a catch-22/chicken-egg problem.
> > > > > > > 
> > > > > > > Finally, saying "containers are complex and error prone" is 
> > > > > > > conflating several large suites of userspace
> > > > > > > code and many kernel features which support them.  Being more 
> > > > > > > precise would, if the argument is valid, lend
> > > > > > > it a lot more weight.
> > > > > >  
> > > > > >  We (my company) use Linux containers since 2011 in production. 
> > > > > >  First LXC, now libvirt-lxc. To understand the
> > > > > >  internals better I also wrote my own userspace to create/start 
> > > > > >  containers. There are so many things which can
> > > > > >  hurt you badly. With user namespaces we expose a really big 
> > > > > >  attack surface to regular users. I.e. Suddenly a
> > > > > >  user is allowed to mount filesystems.
> > > > > > >>> 
> > > > > > >>> That is currently not the case.  They can mount some virtual 
> > > > > > >>> filesystems and do bind mounts, but cannot mount
> > > > > > >>> most real filesystems.  This keeps us protected (for now) from 
> > > > > > >>> potentially unsafe superblock readers in the 
> > > > > > >>> kernel.
> > > > > > >>> 
> > > > > >  Ask Andy, he found already lots of nasty things...
> > > > > > >> 
> > > > > > >> I don't think I have anything brilliant to add to this 
> > > > > > >> discussion right now, except possibly:
> > > > > > >> 
> > > > > > >> ISTM that Linux distributions are, in general, vulnerable to all 
> > > > > > >> kinds of shenanigans that would happen if an
> > > > > > >> untrusted user can cause a block device to appear.  That user 
> > > > > > >> doesn't need permission to mount it
> > > > > > > 
> > > > > > > Interesting point.  This would further suggest that we absolutely 
> > > > > > > must ensure that a loop device which shows up in
> > > > > > > the container does not also show up in the host.
> > > > > > 
> > > > > > Can I suggest the usage of the devices cgroup to achieve that?
> > > > > 
> > > > > Not really ... cgroups impose resource limits, it's namespaces that
> > > > > impose visibility separations.  In theory this can be done with the
> > > > > device namespace that's been proposed; however, a simpler way is 
> > > > > simply
> > > > > to rm the device node in the host and mknod it in the guest.  I don't
> > > > > really see host visibility as a huge problem: in a shared OS
> > > > > virtualisation it's not really possible securely to separate the guest
> > > > > from the host (only vice versa).
> > > > > 
> > > > > But I really don't think we want to do it this way.  Giving a 
> > > > > container
> > > > > the ability to do a mount is too dangerous.  What we want to do is
> > > > > int

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-28 Thread Serge E. Hallyn

Quoting Seth Forshee (seth.fors...@canonical.com):
> On Fri, May 23, 2014 at 03:23:50PM -0700, Eric W. Biederman wrote:
> > Serge Hallyn  writes:
> > 
> > > Quoting Eric W. Biederman (ebied...@xmission.com):
> > >> 
> > >> 
> > >> >> Ultimately the technical challenge is how do we create a block device
> > >> >> that is safe for a user who does not have any capabilities to use, and
> > >> >> what can we do with that block device to make it useful.
> > >> >
> > >> > Yes, and I'd like to get started solving those challenges. But I also
> > >> > don't think we can address these two points (support partition blkdevs,
> > >> > help prevent more priveleged users from using a namespace's loop
> > >> > devices) sufficiently while having an implementation completely
> > >> > contained within the loop driver as Greg is requesting.
> > >> 
> > >> My key take away from the conversation is that we should reduce the
> > >> scope of what is being done to something that makes sense and the
> > >> propblems are immediately visible.
> > >> 
> > >> Part of me would like to suggest that fuse and it's ability to imitate
> > >> device nodes might be a more appropriate solution, to something that
> > >
> > > Do you have a link to more info on this?  Some googling got me to an
> > > interesting but old thread on CUSE, but nothing specifically about fuse
> > > doing this.
> > 
> > CUSE is probably what I was thinking of.  It is all part of the fuse
> > code base in the kernel.  And now that I am reminded it is called CUSE
> > I go Duh that is a character device...
> > 
> > Fuse and everything it can do is definitely the filesystem I would like
> > to see most have the audits to be enabled in user namespace.  Fuse
> > was built to be sufficiently paranoid to allow this and so it should not
> > take a lot to take fuse the rest of the way.
> 
> I was aware of FUSE but hadn't ever looked at it much. Looking at it
> now, this isn't going to satisfy any of the use cases I know about,
> which are wanting to use filesystems supported in-kernel (isofs, ext*).
> I don't see that any of these have a FUSE implementation, and I think we
> gain more from figuring out how to use in-kernel filesystems in
> containers than trying to find a way to shoehorn selected filesystems
> into FUSE.

That's why I was wondering how much work it would be to auto-generate
fuse fs support from the in-kernel source.

-serge
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-28 Thread Seth Forshee

On Fri, May 23, 2014 at 03:23:50PM -0700, Eric W. Biederman wrote:
> Serge Hallyn  writes:
> 
> > Quoting Eric W. Biederman (ebied...@xmission.com):
> >> 
> >> 
> >> >> Ultimately the technical challenge is how do we create a block device
> >> >> that is safe for a user who does not have any capabilities to use, and
> >> >> what can we do with that block device to make it useful.
> >> >
> >> > Yes, and I'd like to get started solving those challenges. But I also
> >> > don't think we can address these two points (support partition blkdevs,
> >> > help prevent more priveleged users from using a namespace's loop
> >> > devices) sufficiently while having an implementation completely
> >> > contained within the loop driver as Greg is requesting.
> >> 
> >> My key take away from the conversation is that we should reduce the
> >> scope of what is being done to something that makes sense and the
> >> propblems are immediately visible.
> >> 
> >> Part of me would like to suggest that fuse and it's ability to imitate
> >> device nodes might be a more appropriate solution, to something that
> >
> > Do you have a link to more info on this?  Some googling got me to an
> > interesting but old thread on CUSE, but nothing specifically about fuse
> > doing this.
> 
> CUSE is probably what I was thinking of.  It is all part of the fuse
> code base in the kernel.  And now that I am reminded it is called CUSE
> I go Duh that is a character device...
> 
> Fuse and everything it can do is definitely the filesystem I would like
> to see most have the audits to be enabled in user namespace.  Fuse
> was built to be sufficiently paranoid to allow this and so it should not
> take a lot to take fuse the rest of the way.

I was aware of FUSE but hadn't ever looked at it much. Looking at it
now, this isn't going to satisfy any of the use cases I know about,
which are wanting to use filesystems supported in-kernel (isofs, ext*).
I don't see that any of these have a FUSE implementation, and I think we
gain more from figuring out how to use in-kernel filesystems in
containers than trying to find a way to shoehorn selected filesystems
into FUSE.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-28 Thread James Bottomley

On Mon, 2014-05-26 at 00:24 +0200, Serge E. Hallyn wrote:
> Quoting James Bottomley (james.bottom...@hansenpartnership.com):
> > On Sat, 2014-05-24 at 22:25 +, Serge Hallyn wrote:
> > > Quoting James Bottomley (james.bottom...@hansenpartnership.com):
> > > > On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote:
> > > > > On 05/20/2014 05:19 PM, Serge Hallyn wrote:
> > > > > > Quoting Andy Lutomirski (l...@amacapital.net):
> > > > > >> On May 15, 2014 1:26 PM, "Serge E. Hallyn"  
> > > > > >> wrote:
> > > > > >>> 
> > > > > >>> Quoting Richard Weinberger (rich...@nod.at):
> > > > >  Am 15.05.2014 21:50, schrieb Serge Hallyn:
> > > > > > Quoting Richard Weinberger (richard.weinber...@gmail.com):
> > > > > >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman 
> > > > > >>  wrote:
> > > > > >>> Then don't use a container to build such a thing, or fix the 
> > > > > >>> build scripts to not do that :)
> > > > > >> 
> > > > > >> I second this. To me it looks like some folks try to (ab)use 
> > > > > >> Linux containers for purposes where KVM
> > > > > >> would much better fit in. Please don't put more complexity 
> > > > > >> into containers. They are already horrible
> > > > > >> complex and error prone.
> > > > > > 
> > > > > > I, naturally, disagree :)  The only use case which is 
> > > > > > inherently not valid for containers is running a
> > > > > > kernel.  Practically speaking there are other things which 
> > > > > > likely will never be possible, but if someone 
> > > > > > offers a way to do something in containers, "you can't do that 
> > > > > > in containers" is not an apropos response.
> > > > > > 
> > > > > > "That abstraction is wrong" is certainly valid, as when vpids 
> > > > > > were originally proposed and rejected,
> > > > > > resulting in the development of pid namespaces.  "We have to 
> > > > > > work out (x) first" can be valid (and I can
> > > > > > think of examples here), assuming it's not just trying to hide 
> > > > > > behind a catch-22/chicken-egg problem.
> > > > > > 
> > > > > > Finally, saying "containers are complex and error prone" is 
> > > > > > conflating several large suites of userspace
> > > > > > code and many kernel features which support them.  Being more 
> > > > > > precise would, if the argument is valid, lend
> > > > > > it a lot more weight.
> > > > >  
> > > > >  We (my company) use Linux containers since 2011 in production. 
> > > > >  First LXC, now libvirt-lxc. To understand the
> > > > >  internals better I also wrote my own userspace to create/start 
> > > > >  containers. There are so many things which can
> > > > >  hurt you badly. With user namespaces we expose a really big 
> > > > >  attack surface to regular users. I.e. Suddenly a
> > > > >  user is allowed to mount filesystems.
> > > > > >>> 
> > > > > >>> That is currently not the case.  They can mount some virtual 
> > > > > >>> filesystems and do bind mounts, but cannot mount
> > > > > >>> most real filesystems.  This keeps us protected (for now) from 
> > > > > >>> potentially unsafe superblock readers in the 
> > > > > >>> kernel.
> > > > > >>> 
> > > > >  Ask Andy, he found already lots of nasty things...
> > > > > >> 
> > > > > >> I don't think I have anything brilliant to add to this discussion 
> > > > > >> right now, except possibly:
> > > > > >> 
> > > > > >> ISTM that Linux distributions are, in general, vulnerable to all 
> > > > > >> kinds of shenanigans that would happen if an
> > > > > >> untrusted user can cause a block device to appear.  That user 
> > > > > >> doesn't need permission to mount it
> > > > > > 
> > > > > > Interesting point.  This would further suggest that we absolutely 
> > > > > > must ensure that a loop device which shows up in
> > > > > > the container does not also show up in the host.
> > > > > 
> > > > > Can I suggest the usage of the devices cgroup to achieve that?
> > > > 
> > > > Not really ... cgroups impose resource limits, it's namespaces that
> > > > impose visibility separations.  In theory this can be done with the
> > > > device namespace that's been proposed; however, a simpler way is simply
> > > > to rm the device node in the host and mknod it in the guest.  I don't
> > > > really see host visibility as a huge problem: in a shared OS
> > > > virtualisation it's not really possible securely to separate the guest
> > > > from the host (only vice versa).
> > > > 
> > > > But I really don't think we want to do it this way.  Giving a container
> > > > the ability to do a mount is too dangerous.  What we want to do is
> > > > intercept the mount in the host and perform it on behalf of the guest as
> > > > host root in the guest's mount namespace.  If you do it that way, it
> > > 
> > > That doesn't help the problem of guests being able to provide bad input
> > > for (basically fuzz

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-25 Thread Serge E. Hallyn

Quoting James Bottomley (james.bottom...@hansenpartnership.com):
> On Sat, 2014-05-24 at 22:25 +, Serge Hallyn wrote:
> > Quoting James Bottomley (james.bottom...@hansenpartnership.com):
> > > On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote:
> > > > On 05/20/2014 05:19 PM, Serge Hallyn wrote:
> > > > > Quoting Andy Lutomirski (l...@amacapital.net):
> > > > >> On May 15, 2014 1:26 PM, "Serge E. Hallyn"  wrote:
> > > > >>> 
> > > > >>> Quoting Richard Weinberger (rich...@nod.at):
> > > >  Am 15.05.2014 21:50, schrieb Serge Hallyn:
> > > > > Quoting Richard Weinberger (richard.weinber...@gmail.com):
> > > > >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman 
> > > > >>  wrote:
> > > > >>> Then don't use a container to build such a thing, or fix the 
> > > > >>> build scripts to not do that :)
> > > > >> 
> > > > >> I second this. To me it looks like some folks try to (ab)use 
> > > > >> Linux containers for purposes where KVM
> > > > >> would much better fit in. Please don't put more complexity into 
> > > > >> containers. They are already horrible
> > > > >> complex and error prone.
> > > > > 
> > > > > I, naturally, disagree :)  The only use case which is inherently 
> > > > > not valid for containers is running a
> > > > > kernel.  Practically speaking there are other things which likely 
> > > > > will never be possible, but if someone 
> > > > > offers a way to do something in containers, "you can't do that in 
> > > > > containers" is not an apropos response.
> > > > > 
> > > > > "That abstraction is wrong" is certainly valid, as when vpids 
> > > > > were originally proposed and rejected,
> > > > > resulting in the development of pid namespaces.  "We have to work 
> > > > > out (x) first" can be valid (and I can
> > > > > think of examples here), assuming it's not just trying to hide 
> > > > > behind a catch-22/chicken-egg problem.
> > > > > 
> > > > > Finally, saying "containers are complex and error prone" is 
> > > > > conflating several large suites of userspace
> > > > > code and many kernel features which support them.  Being more 
> > > > > precise would, if the argument is valid, lend
> > > > > it a lot more weight.
> > > >  
> > > >  We (my company) use Linux containers since 2011 in production. 
> > > >  First LXC, now libvirt-lxc. To understand the
> > > >  internals better I also wrote my own userspace to create/start 
> > > >  containers. There are so many things which can
> > > >  hurt you badly. With user namespaces we expose a really big attack 
> > > >  surface to regular users. I.e. Suddenly a
> > > >  user is allowed to mount filesystems.
> > > > >>> 
> > > > >>> That is currently not the case.  They can mount some virtual 
> > > > >>> filesystems and do bind mounts, but cannot mount
> > > > >>> most real filesystems.  This keeps us protected (for now) from 
> > > > >>> potentially unsafe superblock readers in the 
> > > > >>> kernel.
> > > > >>> 
> > > >  Ask Andy, he found already lots of nasty things...
> > > > >> 
> > > > >> I don't think I have anything brilliant to add to this discussion 
> > > > >> right now, except possibly:
> > > > >> 
> > > > >> ISTM that Linux distributions are, in general, vulnerable to all 
> > > > >> kinds of shenanigans that would happen if an
> > > > >> untrusted user can cause a block device to appear.  That user 
> > > > >> doesn't need permission to mount it
> > > > > 
> > > > > Interesting point.  This would further suggest that we absolutely 
> > > > > must ensure that a loop device which shows up in
> > > > > the container does not also show up in the host.
> > > > 
> > > > Can I suggest the usage of the devices cgroup to achieve that?
> > > 
> > > Not really ... cgroups impose resource limits, it's namespaces that
> > > impose visibility separations.  In theory this can be done with the
> > > device namespace that's been proposed; however, a simpler way is simply
> > > to rm the device node in the host and mknod it in the guest.  I don't
> > > really see host visibility as a huge problem: in a shared OS
> > > virtualisation it's not really possible securely to separate the guest
> > > from the host (only vice versa).
> > > 
> > > But I really don't think we want to do it this way.  Giving a container
> > > the ability to do a mount is too dangerous.  What we want to do is
> > > intercept the mount in the host and perform it on behalf of the guest as
> > > host root in the guest's mount namespace.  If you do it that way, it
> > 
> > That doesn't help the problem of guests being able to provide bad input
> > for (basically fuzz) the in-kernel filesystem code.  So apparently I'm
> > suffering a failure of the imagination - what problem exactly does it solve?
> 
> Well, there's two types of fuzzing, one is on sys_mount, which this
> would help with because the host filte

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-25 Thread James Bottomley

On Sat, 2014-05-24 at 22:25 +, Serge Hallyn wrote:
> Quoting James Bottomley (james.bottom...@hansenpartnership.com):
> > On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote:
> > > On 05/20/2014 05:19 PM, Serge Hallyn wrote:
> > > > Quoting Andy Lutomirski (l...@amacapital.net):
> > > >> On May 15, 2014 1:26 PM, "Serge E. Hallyn"  wrote:
> > > >>> 
> > > >>> Quoting Richard Weinberger (rich...@nod.at):
> > >  Am 15.05.2014 21:50, schrieb Serge Hallyn:
> > > > Quoting Richard Weinberger (richard.weinber...@gmail.com):
> > > >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman 
> > > >>  wrote:
> > > >>> Then don't use a container to build such a thing, or fix the 
> > > >>> build scripts to not do that :)
> > > >> 
> > > >> I second this. To me it looks like some folks try to (ab)use Linux 
> > > >> containers for purposes where KVM
> > > >> would much better fit in. Please don't put more complexity into 
> > > >> containers. They are already horrible
> > > >> complex and error prone.
> > > > 
> > > > I, naturally, disagree :)  The only use case which is inherently 
> > > > not valid for containers is running a
> > > > kernel.  Practically speaking there are other things which likely 
> > > > will never be possible, but if someone 
> > > > offers a way to do something in containers, "you can't do that in 
> > > > containers" is not an apropos response.
> > > > 
> > > > "That abstraction is wrong" is certainly valid, as when vpids were 
> > > > originally proposed and rejected,
> > > > resulting in the development of pid namespaces.  "We have to work 
> > > > out (x) first" can be valid (and I can
> > > > think of examples here), assuming it's not just trying to hide 
> > > > behind a catch-22/chicken-egg problem.
> > > > 
> > > > Finally, saying "containers are complex and error prone" is 
> > > > conflating several large suites of userspace
> > > > code and many kernel features which support them.  Being more 
> > > > precise would, if the argument is valid, lend
> > > > it a lot more weight.
> > >  
> > >  We (my company) use Linux containers since 2011 in production. First 
> > >  LXC, now libvirt-lxc. To understand the
> > >  internals better I also wrote my own userspace to create/start 
> > >  containers. There are so many things which can
> > >  hurt you badly. With user namespaces we expose a really big attack 
> > >  surface to regular users. I.e. Suddenly a
> > >  user is allowed to mount filesystems.
> > > >>> 
> > > >>> That is currently not the case.  They can mount some virtual 
> > > >>> filesystems and do bind mounts, but cannot mount
> > > >>> most real filesystems.  This keeps us protected (for now) from 
> > > >>> potentially unsafe superblock readers in the 
> > > >>> kernel.
> > > >>> 
> > >  Ask Andy, he found already lots of nasty things...
> > > >> 
> > > >> I don't think I have anything brilliant to add to this discussion 
> > > >> right now, except possibly:
> > > >> 
> > > >> ISTM that Linux distributions are, in general, vulnerable to all kinds 
> > > >> of shenanigans that would happen if an
> > > >> untrusted user can cause a block device to appear.  That user doesn't 
> > > >> need permission to mount it
> > > > 
> > > > Interesting point.  This would further suggest that we absolutely must 
> > > > ensure that a loop device which shows up in
> > > > the container does not also show up in the host.
> > > 
> > > Can I suggest the usage of the devices cgroup to achieve that?
> > 
> > Not really ... cgroups impose resource limits, it's namespaces that
> > impose visibility separations.  In theory this can be done with the
> > device namespace that's been proposed; however, a simpler way is simply
> > to rm the device node in the host and mknod it in the guest.  I don't
> > really see host visibility as a huge problem: in a shared OS
> > virtualisation it's not really possible securely to separate the guest
> > from the host (only vice versa).
> > 
> > But I really don't think we want to do it this way.  Giving a container
> > the ability to do a mount is too dangerous.  What we want to do is
> > intercept the mount in the host and perform it on behalf of the guest as
> > host root in the guest's mount namespace.  If you do it that way, it
> 
> That doesn't help the problem of guests being able to provide bad input
> for (basically fuzz) the in-kernel filesystem code.  So apparently I'm
> suffering a failure of the imagination - what problem exactly does it solve?

Well, there's two types of fuzzing, one is on sys_mount, which this
would help with because the host filters the mount including all
parameters and may even redo the mount (from direct to bind etc).

If you're thinking the system can be compromised by fuzzing within the
filesystem, then yes, I agree, but it's the same vulnerability an
unvirtualised

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-24 Thread Serge Hallyn

Quoting James Bottomley (james.bottom...@hansenpartnership.com):
> On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote:
> > On 05/20/2014 05:19 PM, Serge Hallyn wrote:
> > > Quoting Andy Lutomirski (l...@amacapital.net):
> > >> On May 15, 2014 1:26 PM, "Serge E. Hallyn"  wrote:
> > >>> 
> > >>> Quoting Richard Weinberger (rich...@nod.at):
> >  Am 15.05.2014 21:50, schrieb Serge Hallyn:
> > > Quoting Richard Weinberger (richard.weinber...@gmail.com):
> > >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman 
> > >>  wrote:
> > >>> Then don't use a container to build such a thing, or fix the build 
> > >>> scripts to not do that :)
> > >> 
> > >> I second this. To me it looks like some folks try to (ab)use Linux 
> > >> containers for purposes where KVM
> > >> would much better fit in. Please don't put more complexity into 
> > >> containers. They are already horrible
> > >> complex and error prone.
> > > 
> > > I, naturally, disagree :)  The only use case which is inherently not 
> > > valid for containers is running a
> > > kernel.  Practically speaking there are other things which likely 
> > > will never be possible, but if someone 
> > > offers a way to do something in containers, "you can't do that in 
> > > containers" is not an apropos response.
> > > 
> > > "That abstraction is wrong" is certainly valid, as when vpids were 
> > > originally proposed and rejected,
> > > resulting in the development of pid namespaces.  "We have to work out 
> > > (x) first" can be valid (and I can
> > > think of examples here), assuming it's not just trying to hide behind 
> > > a catch-22/chicken-egg problem.
> > > 
> > > Finally, saying "containers are complex and error prone" is 
> > > conflating several large suites of userspace
> > > code and many kernel features which support them.  Being more precise 
> > > would, if the argument is valid, lend
> > > it a lot more weight.
> >  
> >  We (my company) use Linux containers since 2011 in production. First 
> >  LXC, now libvirt-lxc. To understand the
> >  internals better I also wrote my own userspace to create/start 
> >  containers. There are so many things which can
> >  hurt you badly. With user namespaces we expose a really big attack 
> >  surface to regular users. I.e. Suddenly a
> >  user is allowed to mount filesystems.
> > >>> 
> > >>> That is currently not the case.  They can mount some virtual 
> > >>> filesystems and do bind mounts, but cannot mount
> > >>> most real filesystems.  This keeps us protected (for now) from 
> > >>> potentially unsafe superblock readers in the 
> > >>> kernel.
> > >>> 
> >  Ask Andy, he found already lots of nasty things...
> > >> 
> > >> I don't think I have anything brilliant to add to this discussion right 
> > >> now, except possibly:
> > >> 
> > >> ISTM that Linux distributions are, in general, vulnerable to all kinds 
> > >> of shenanigans that would happen if an
> > >> untrusted user can cause a block device to appear.  That user doesn't 
> > >> need permission to mount it
> > > 
> > > Interesting point.  This would further suggest that we absolutely must 
> > > ensure that a loop device which shows up in
> > > the container does not also show up in the host.
> > 
> > Can I suggest the usage of the devices cgroup to achieve that?
> 
> Not really ... cgroups impose resource limits, it's namespaces that
> impose visibility separations.  In theory this can be done with the
> device namespace that's been proposed; however, a simpler way is simply
> to rm the device node in the host and mknod it in the guest.  I don't
> really see host visibility as a huge problem: in a shared OS
> virtualisation it's not really possible securely to separate the guest
> from the host (only vice versa).
> 
> But I really don't think we want to do it this way.  Giving a container
> the ability to do a mount is too dangerous.  What we want to do is
> intercept the mount in the host and perform it on behalf of the guest as
> host root in the guest's mount namespace.  If you do it that way, it

That doesn't help the problem of guests being able to provide bad input
for (basically fuzz) the in-kernel filesystem code.  So apparently I'm
suffering a failure of the imagination - what problem exactly does it solve?

> doesn't really matter what device actually shows up in the guest, as
> long as the host knows what to do when the mount request comes along.
> 
> James
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-23 Thread Eric W. Biederman

Serge Hallyn  writes:

> Quoting Eric W. Biederman (ebied...@xmission.com):
>> 
>> 
>> >> Ultimately the technical challenge is how do we create a block device
>> >> that is safe for a user who does not have any capabilities to use, and
>> >> what can we do with that block device to make it useful.
>> >
>> > Yes, and I'd like to get started solving those challenges. But I also
>> > don't think we can address these two points (support partition blkdevs,
>> > help prevent more priveleged users from using a namespace's loop
>> > devices) sufficiently while having an implementation completely
>> > contained within the loop driver as Greg is requesting.
>> 
>> My key take away from the conversation is that we should reduce the
>> scope of what is being done to something that makes sense and the
>> propblems are immediately visible.
>> 
>> Part of me would like to suggest that fuse and it's ability to imitate
>> device nodes might be a more appropriate solution, to something that
>
> Do you have a link to more info on this?  Some googling got me to an
> interesting but old thread on CUSE, but nothing specifically about fuse
> doing this.

CUSE is probably what I was thinking of.  It is all part of the fuse
code base in the kernel.  And now that I am reminded it is called CUSE
I go Duh that is a character device...

Fuse and everything it can do is definitely the filesystem I would like
to see most have the audits to be enabled in user namespace.  Fuse
was built to be sufficiently paranoid to allow this and so it should not
take a lot to take fuse the rest of the way.

>> just needs block device access and nothing else.
>> 
>> For purposes of discussion let's call it unprivloopfs.  That can reuse
>> code from the loop device or not as appropriate.  Not supporting
>> paritioning I think is a very reasonable first step until it is shown
>> that we can make good use of partitioning support, and there are not
>> better ways of solving the problem.
>> 
>> I expect the most productive thing to talk about is what is your
>> immediate goal?  Mounting a filesystem?  Building an iso?
>
> For me it would be taking an iso and making some changes to it to
> localize it (i.e. take an install iso and add preseed file).
>
> Now of course in the end there is no reason why we can't do all of
> this with a new suite of libraries which simply uses read/write with
> knowledge of the fs layouts to parse and modify the backing files.
> My concern there is that duplicating all of the fs code seems unlikely
> to improve the soundness of either implementation.  Perhaps we can
> autogenerate this from the kernel source?  Does fuse already do
> something like that?

I am not aware of that.  But I have not worked extensively with fuse.

I do agree that finding a way to perform a read-only mount of an ISO by
an unprivielged user is a very interesting use case.  Given it's
interchange medium nature isofs should be as hardened as human possible,
and that is likely easier with a read-only filesystem.  And at less than
4000 lines of code isofs is auditable.

So as a target for unprivileged mounts of a block device isofs looks
like a good place to start.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-23 Thread Andy Lutomirski

On Fri, May 23, 2014 at 6:16 AM, James Bottomley
 wrote:
> On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote:
>> On 05/20/2014 05:19 PM, Serge Hallyn wrote:
>> > Quoting Andy Lutomirski (l...@amacapital.net):
>> >> On May 15, 2014 1:26 PM, "Serge E. Hallyn"  wrote:
>> >>>
>> >>> Quoting Richard Weinberger (rich...@nod.at):
>>  Am 15.05.2014 21:50, schrieb Serge Hallyn:
>> > Quoting Richard Weinberger (richard.weinber...@gmail.com):
>> >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman 
>> >>  wrote:
>> >>> Then don't use a container to build such a thing, or fix the build 
>> >>> scripts to not do that :)
>> >>
>> >> I second this. To me it looks like some folks try to (ab)use Linux 
>> >> containers for purposes where KVM
>> >> would much better fit in. Please don't put more complexity into 
>> >> containers. They are already horrible
>> >> complex and error prone.
>> >
>> > I, naturally, disagree :)  The only use case which is inherently not 
>> > valid for containers is running a
>> > kernel.  Practically speaking there are other things which likely will 
>> > never be possible, but if someone
>> > offers a way to do something in containers, "you can't do that in 
>> > containers" is not an apropos response.
>> >
>> > "That abstraction is wrong" is certainly valid, as when vpids were 
>> > originally proposed and rejected,
>> > resulting in the development of pid namespaces.  "We have to work out 
>> > (x) first" can be valid (and I can
>> > think of examples here), assuming it's not just trying to hide behind 
>> > a catch-22/chicken-egg problem.
>> >
>> > Finally, saying "containers are complex and error prone" is conflating 
>> > several large suites of userspace
>> > code and many kernel features which support them.  Being more precise 
>> > would, if the argument is valid, lend
>> > it a lot more weight.
>> 
>>  We (my company) use Linux containers since 2011 in production. First 
>>  LXC, now libvirt-lxc. To understand the
>>  internals better I also wrote my own userspace to create/start 
>>  containers. There are so many things which can
>>  hurt you badly. With user namespaces we expose a really big attack 
>>  surface to regular users. I.e. Suddenly a
>>  user is allowed to mount filesystems.
>> >>>
>> >>> That is currently not the case.  They can mount some virtual filesystems 
>> >>> and do bind mounts, but cannot mount
>> >>> most real filesystems.  This keeps us protected (for now) from 
>> >>> potentially unsafe superblock readers in the
>> >>> kernel.
>> >>>
>>  Ask Andy, he found already lots of nasty things...
>> >>
>> >> I don't think I have anything brilliant to add to this discussion right 
>> >> now, except possibly:
>> >>
>> >> ISTM that Linux distributions are, in general, vulnerable to all kinds of 
>> >> shenanigans that would happen if an
>> >> untrusted user can cause a block device to appear.  That user doesn't 
>> >> need permission to mount it
>> >
>> > Interesting point.  This would further suggest that we absolutely must 
>> > ensure that a loop device which shows up in
>> > the container does not also show up in the host.
>>
>> Can I suggest the usage of the devices cgroup to achieve that?
>
> Not really ... cgroups impose resource limits, it's namespaces that
> impose visibility separations.  In theory this can be done with the
> device namespace that's been proposed; however, a simpler way is simply
> to rm the device node in the host and mknod it in the guest.  I don't
> really see host visibility as a huge problem: in a shared OS
> virtualisation it's not really possible securely to separate the guest
> from the host (only vice versa).
>
> But I really don't think we want to do it this way.  Giving a container
> the ability to do a mount is too dangerous.  What we want to do is
> intercept the mount in the host and perform it on behalf of the guest as
> host root in the guest's mount namespace.  If you do it that way, it
> doesn't really matter what device actually shows up in the guest, as
> long as the host knows what to do when the mount request comes along.

This is only useful/safe if the host understands what's going on.  By
the host, I mean the host's udev and other system-level stuff.  This
is probably fine for disks and such, but it might not be so great for
loop devices, FUSE, etc.  I already know of one user of containers
that wants container-local FUSE mounts.  This ought to Just Work (tm),
but there's fair amount of work needed to get there.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-23 Thread James Bottomley

On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote:
> On 05/20/2014 05:19 PM, Serge Hallyn wrote:
> > Quoting Andy Lutomirski (l...@amacapital.net):
> >> On May 15, 2014 1:26 PM, "Serge E. Hallyn"  wrote:
> >>> 
> >>> Quoting Richard Weinberger (rich...@nod.at):
>  Am 15.05.2014 21:50, schrieb Serge Hallyn:
> > Quoting Richard Weinberger (richard.weinber...@gmail.com):
> >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman 
> >>  wrote:
> >>> Then don't use a container to build such a thing, or fix the build 
> >>> scripts to not do that :)
> >> 
> >> I second this. To me it looks like some folks try to (ab)use Linux 
> >> containers for purposes where KVM
> >> would much better fit in. Please don't put more complexity into 
> >> containers. They are already horrible
> >> complex and error prone.
> > 
> > I, naturally, disagree :)  The only use case which is inherently not 
> > valid for containers is running a
> > kernel.  Practically speaking there are other things which likely will 
> > never be possible, but if someone 
> > offers a way to do something in containers, "you can't do that in 
> > containers" is not an apropos response.
> > 
> > "That abstraction is wrong" is certainly valid, as when vpids were 
> > originally proposed and rejected,
> > resulting in the development of pid namespaces.  "We have to work out 
> > (x) first" can be valid (and I can
> > think of examples here), assuming it's not just trying to hide behind a 
> > catch-22/chicken-egg problem.
> > 
> > Finally, saying "containers are complex and error prone" is conflating 
> > several large suites of userspace
> > code and many kernel features which support them.  Being more precise 
> > would, if the argument is valid, lend
> > it a lot more weight.
>  
>  We (my company) use Linux containers since 2011 in production. First 
>  LXC, now libvirt-lxc. To understand the
>  internals better I also wrote my own userspace to create/start 
>  containers. There are so many things which can
>  hurt you badly. With user namespaces we expose a really big attack 
>  surface to regular users. I.e. Suddenly a
>  user is allowed to mount filesystems.
> >>> 
> >>> That is currently not the case.  They can mount some virtual filesystems 
> >>> and do bind mounts, but cannot mount
> >>> most real filesystems.  This keeps us protected (for now) from 
> >>> potentially unsafe superblock readers in the 
> >>> kernel.
> >>> 
>  Ask Andy, he found already lots of nasty things...
> >> 
> >> I don't think I have anything brilliant to add to this discussion right 
> >> now, except possibly:
> >> 
> >> ISTM that Linux distributions are, in general, vulnerable to all kinds of 
> >> shenanigans that would happen if an
> >> untrusted user can cause a block device to appear.  That user doesn't need 
> >> permission to mount it
> > 
> > Interesting point.  This would further suggest that we absolutely must 
> > ensure that a loop device which shows up in
> > the container does not also show up in the host.
> 
> Can I suggest the usage of the devices cgroup to achieve that?

Not really ... cgroups impose resource limits, it's namespaces that
impose visibility separations.  In theory this can be done with the
device namespace that's been proposed; however, a simpler way is simply
to rm the device node in the host and mknod it in the guest.  I don't
really see host visibility as a huge problem: in a shared OS
virtualisation it's not really possible securely to separate the guest
from the host (only vice versa).

But I really don't think we want to do it this way.  Giving a container
the ability to do a mount is too dangerous.  What we want to do is
intercept the mount in the host and perform it on behalf of the guest as
host root in the guest's mount namespace.  If you do it that way, it
doesn't really matter what device actually shows up in the guest, as
long as the host knows what to do when the mount request comes along.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-23 Thread Marian Marinov

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 05/20/2014 05:19 PM, Serge Hallyn wrote:
> Quoting Andy Lutomirski (l...@amacapital.net):
>> On May 15, 2014 1:26 PM, "Serge E. Hallyn"  wrote:
>>> 
>>> Quoting Richard Weinberger (rich...@nod.at):
 Am 15.05.2014 21:50, schrieb Serge Hallyn:
> Quoting Richard Weinberger (richard.weinber...@gmail.com):
>> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman 
>>  wrote:
>>> Then don't use a container to build such a thing, or fix the build 
>>> scripts to not do that :)
>> 
>> I second this. To me it looks like some folks try to (ab)use Linux 
>> containers for purposes where KVM
>> would much better fit in. Please don't put more complexity into 
>> containers. They are already horrible
>> complex and error prone.
> 
> I, naturally, disagree :)  The only use case which is inherently not 
> valid for containers is running a
> kernel.  Practically speaking there are other things which likely will 
> never be possible, but if someone 
> offers a way to do something in containers, "you can't do that in 
> containers" is not an apropos response.
> 
> "That abstraction is wrong" is certainly valid, as when vpids were 
> originally proposed and rejected,
> resulting in the development of pid namespaces.  "We have to work out (x) 
> first" can be valid (and I can
> think of examples here), assuming it's not just trying to hide behind a 
> catch-22/chicken-egg problem.
> 
> Finally, saying "containers are complex and error prone" is conflating 
> several large suites of userspace
> code and many kernel features which support them.  Being more precise 
> would, if the argument is valid, lend
> it a lot more weight.
 
 We (my company) use Linux containers since 2011 in production. First LXC, 
 now libvirt-lxc. To understand the
 internals better I also wrote my own userspace to create/start containers. 
 There are so many things which can
 hurt you badly. With user namespaces we expose a really big attack surface 
 to regular users. I.e. Suddenly a
 user is allowed to mount filesystems.
>>> 
>>> That is currently not the case.  They can mount some virtual filesystems 
>>> and do bind mounts, but cannot mount
>>> most real filesystems.  This keeps us protected (for now) from potentially 
>>> unsafe superblock readers in the 
>>> kernel.
>>> 
 Ask Andy, he found already lots of nasty things...
>> 
>> I don't think I have anything brilliant to add to this discussion right now, 
>> except possibly:
>> 
>> ISTM that Linux distributions are, in general, vulnerable to all kinds of 
>> shenanigans that would happen if an
>> untrusted user can cause a block device to appear.  That user doesn't need 
>> permission to mount it
> 
> Interesting point.  This would further suggest that we absolutely must ensure 
> that a loop device which shows up in
> the container does not also show up in the host.

Can I suggest the usage of the devices cgroup to achieve that?

Marian

> 
>> or even necessarily to change its contents on the fly.
>> 
>> E.g. what happens if you boot a machine that contains a malicious disk image 
>> that has the same partition UUID as
>> /?  Nothing good, I imagine.
>> 
>> So if we're going to go down this road, we really need some way to tell the 
>> host that certain devices are not
>> trusted.
>> 
>> --Andy
> -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
> the body of a message to
> majord...@vger.kernel.org More majordomo info at  
> http://vger.kernel.org/majordomo-info.html Please read the FAQ at
> http://www.tux.org/lkml/
> 


- -- 
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlN/BL8ACgkQ4mt9JeIbjJRuTwCgjpP8cNle5deHpUSJJoDkcfin
byEAn3Fy4wwiZ3avNwA/ljZWVWeGFU8W
=iQLO
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-21 Thread Serge Hallyn

Quoting Eric W. Biederman (ebied...@xmission.com):
> 
> 
> >> Ultimately the technical challenge is how do we create a block device
> >> that is safe for a user who does not have any capabilities to use, and
> >> what can we do with that block device to make it useful.
> >
> > Yes, and I'd like to get started solving those challenges. But I also
> > don't think we can address these two points (support partition blkdevs,
> > help prevent more priveleged users from using a namespace's loop
> > devices) sufficiently while having an implementation completely
> > contained within the loop driver as Greg is requesting.
> 
> My key take away from the conversation is that we should reduce the
> scope of what is being done to something that makes sense and the
> propblems are immediately visible.
> 
> Part of me would like to suggest that fuse and it's ability to imitate
> device nodes might be a more appropriate solution, to something that

Do you have a link to more info on this?  Some googling got me to an
interesting but old thread on CUSE, but nothing specifically about fuse
doing this.

> just needs block device access and nothing else.
> 
> For purposes of discussion let's call it unprivloopfs.  That can reuse
> code from the loop device or not as appropriate.  Not supporting
> paritioning I think is a very reasonable first step until it is shown
> that we can make good use of partitioning support, and there are not
> better ways of solving the problem.
> 
> I expect the most productive thing to talk about is what is your
> immediate goal?  Mounting a filesystem?  Building an iso?

For me it would be taking an iso and making some changes to it to
localize it (i.e. take an install iso and add preseed file).

Now of course in the end there is no reason why we can't do all of
this with a new suite of libraries which simply uses read/write with
knowledge of the fs layouts to parse and modify the backing files.
My concern there is that duplicating all of the fs code seems unlikely
to improve the soundness of either implementation.  Perhaps we can
autogenerate this from the kernel source?  Does fuse already do
something like that?

> We have a long history with the namespace support of punting on issues
> and not solving them until a long term maintainable solution becomes
> clear.  Let's do what we can to make the problem and the solution clear.

-serge
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-21 Thread Eric W. Biederman



>> Ultimately the technical challenge is how do we create a block device
>> that is safe for a user who does not have any capabilities to use, and
>> what can we do with that block device to make it useful.
>
> Yes, and I'd like to get started solving those challenges. But I also
> don't think we can address these two points (support partition blkdevs,
> help prevent more priveleged users from using a namespace's loop
> devices) sufficiently while having an implementation completely
> contained within the loop driver as Greg is requesting.

My key take away from the conversation is that we should reduce the
scope of what is being done to something that makes sense and the
propblems are immediately visible.

Part of me would like to suggest that fuse and it's ability to imitate
device nodes might be a more appropriate solution, to something that
just needs block device access and nothing else.

For purposes of discussion let's call it unprivloopfs.  That can reuse
code from the loop device or not as appropriate.  Not supporting
paritioning I think is a very reasonable first step until it is shown
that we can make good use of partitioning support, and there are not
better ways of solving the problem.

I expect the most productive thing to talk about is what is your
immediate goal?  Mounting a filesystem?  Building an iso?

We have a long history with the namespace support of punting on issues
and not solving them until a long term maintainable solution becomes
clear.  Let's do what we can to make the problem and the solution clear.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-20 Thread Serge Hallyn

Quoting Serge Hallyn (serge.hal...@ubuntu.com):
> Quoting Seth Forshee (seth.fors...@canonical.com):
> > On Sun, May 18, 2014 at 04:44:58AM +0200, Serge E. Hallyn wrote:
> > > Quoting Seth Forshee (seth.fors...@canonical.com):
> > > > On Fri, May 16, 2014 at 09:31:37PM -0700, Eric W. Biederman wrote:
> > > > > Greg Kroah-Hartman  writes:
> > > > > 
> > > > > > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
> > > > > >> > I think having to pick and choose what device nodes you want in a
> > > > > >> > container is a good thing.  Becides, you would have to do the 
> > > > > >> > same thing
> > > > > >> > in the kernel anyway, what's wrong with userspace making the 
> > > > > >> > decision
> > > > > >> > here, especially as it knows exactly what it wants to do much 
> > > > > >> > more so
> > > > > >> > than the kernel ever can.
> > > > > >> 
> > > > > >> For 'real' devices that sounds sensible.  The thing about loop 
> > > > > >> devices
> > > > > >> is that we simply want to allow a container to say "give me a loop
> > > > > >> device to use" and have it receive a unique loop device (or 3), 
> > > > > >> without
> > > > > >> having to pre-assign them.  I think that would be cleaner to do 
> > > > > >> using
> > > > > >> a pseudofs and loop-control device, rather than having to have a
> > > > > >> daemon in userspace on the host farming those out in response to
> > > > > >> some, I don't know, dbus request?
> > > > > >
> > > > > > I agree that loop devices would be nice to have in a container, and 
> > > > > > that
> > > > > > the existing loop interface doesn't really lend itself to that.  So
> > > > > > create a new type of thing that acts like a loop device in a 
> > > > > > container.
> > > > > > But don't try to mess with the whole driver core just for a single 
> > > > > > type
> > > > > > of device.
> > > > > 
> > > > > Yes. Something like devpts (without the newinstance option).  Built to
> > > > > allow unprivileged users to create loopback devices.
> > > > 
> > > > That's where I started, and I've got code, so I guess I'll clean it up
> > > > and send patches. If the stance is that only system-wide CAP_SYS_ADMIN
> > > > gets to do privileged block device ioctls, including reading partitions
> > > 
> > > Sorry, where did that come from?  What Eric was referring to below is
> > > the fs superblock readers not being trusted.  Maybe I glossed over another
> > > email where it was mentioned?
> > 
> > You must have. Take a look at [1].
> > 
> > To repeat the point: the ioctl to reread partitions (along with several
> > other block device ioctls) has a capable(CAP_SYS_ADMIN) check. We can't
> > change this to an ns_capable check without at minimum the block layer
> > knowing about the namespace associated with the block device. Ergo we
> 
> Which only means those changes are necessary :)
> 
> So far as I understand, a namespaced devtmpfs is nacked, but a loopfs
> is interesting (and, depending on the implementation, acceptable).  That
> necessarily includes the minimal blockdev changes to support it.
> 
> > can't reread paritions if this is done entirely within the loop driver
> > via a psuedo fs.
> > 
> > [1] http://article.gmane.org/gmane.linux.kernel.containers.lxc.devel/8191

Hm, yeah, I was confuddling two issues.  Nevertheless, for real block devices I
absolutely agree.  For loop devices I don't.  My answer to

> I don't think unpriviliged containers should be able to do partitioning.
> An unpriviliged user can't do that, so why should a container be any
> different?

would be that the loop device is a convenience built atop the backing image,
and if the user had the rights to loop-attach the backing image, he can
just as will partition using write(2), so why artificially plac this limit?

Nevertheless this is not really a debate worth having until we have a
blockdev fs mountable in a userns.

My main interest currently is with privileged containers.  I think we can
learn plenty from that for now.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-20 Thread Seth Forshee

On Mon, May 19, 2014 at 05:04:55PM -0700, Eric W. Biederman wrote:
> Seth Forshee  writes:
> 
> > What I set out for was feature parity between loop devices in a secure
> > container and loop devices on the host. Since some operations currently
> > check for system-wide CAP_SYS_ADMIN, the only way I see to accomplish
> > this is to push knowledge of the user namespace farther down into the
> > driver stack so the check can instead be for CAP_SYS_ADMIN in the user
> > namespace associated with the device.
> >
> > That said, I suspect our current use cases can get by without these
> > capabilities. Really though I suspect this is just deferring the
> > discussion rather than settling it, and what we'll end up with is little
> > more than a fancy way for userspace to ask the kernel to run mknod on
> > its behalf.
> 
> A fancy way to ask the kernel to run mknod on its behalf is what
> /dev/pts is.
> 
> When I suggested this I did not mean you should forgo making changes to
> allow partitions and the like.  What I itended is that you should find a
> way to make this safe for users who don't have root capabilities.

But Greg did say that "unprivileged" or "secure" containers (depending
on whose terminology you're using) should not be able to do partitioning
[1]. I don't really understand this stance though, as I don't see what
possible security problems arise from letting root in a user ns do
BLKRRPART on a block device that it's explicitly been granted privileged
use of.

Assuming we come to an agreement that root in a user ns can do BLKRRPART
on some devices, we've got two issues. First, the block layer enforces
this restriction so it has to be aware of what namespace has privileges
for the device, but Greg wants a solution localized to the loop driver.
Second, if we're using a loop psuedo fs then we'd logically want block
devices for the partitions in the loop fs, so we have to create some
mechanism for the loop driver to get notified about these devices being
created.

> Which possibly means that mount needs to learn how to keep a more
> privileged user from using your new loop devices.

The patches I posted have mechanisms to at least mitigate the problem.
First, anyone using loop-control to find a free loop device will never
get a device allocated to a different user ns (the loop psuedo fs code I
have also does this). Second, a given loop block device would only show
up in the devtmpfs of the namespace which owned that device. So a
sufficiently priveleged user isn't completely prevented from using the
devices, but since they would have to explicitly mknod the block device
node it should prevent accidental use by a more privileged user.

But I also brought this up previously, and Greg argued that it isn't a
real issue [1].

> To get to the point where this is really and truly usable I expect to be
> technically daunting.
> 
> Ultimately the technical challenge is how do we create a block device
> that is safe for a user who does not have any capabilities to use, and
> what can we do with that block device to make it useful.

Yes, and I'd like to get started solving those challenges. But I also
don't think we can address these two points (support partition blkdevs,
help prevent more priveleged users from using a namespace's loop
devices) sufficiently while having an implementation completely
contained within the loop driver as Greg is requesting.

Thanks,
Seth

> 
> Only when the question is can this kernel functionality which is
> otherwise safe confuse a preexisting setuid application do namespace
> or container bits significantly come into play.
> 
> Eric

[1] http://www.spinics.net/linux/lists/kernel/msg1744750.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-20 Thread Serge Hallyn

Quoting Andy Lutomirski (l...@amacapital.net):
> On May 15, 2014 1:26 PM, "Serge E. Hallyn"  wrote:
> >
> > Quoting Richard Weinberger (rich...@nod.at):
> > > Am 15.05.2014 21:50, schrieb Serge Hallyn:
> > > > Quoting Richard Weinberger (richard.weinber...@gmail.com):
> > > >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman
> > > >>  wrote:
> > > >>> Then don't use a container to build such a thing, or fix the build
> > > >>> scripts to not do that :)
> > > >>
> > > >> I second this.
> > > >> To me it looks like some folks try to (ab)use Linux containers
> > > >> for purposes where KVM would much better fit in.
> > > >> Please don't put more complexity into containers. They are already
> > > >> horrible complex
> > > >> and error prone.
> > > >
> > > > I, naturally, disagree :)  The only use case which is inherently not
> > > > valid for containers is running a kernel.  Practically speaking there
> > > > are other things which likely will never be possible, but if someone
> > > > offers a way to do something in containers, "you can't do that in
> > > > containers" is not an apropos response.
> > > >
> > > > "That abstraction is wrong" is certainly valid, as when vpids were
> > > > originally proposed and rejected, resulting in the development of
> > > > pid namespaces.  "We have to work out (x) first" can be valid (and
> > > > I can think of examples here), assuming it's not just trying to hide
> > > > behind a catch-22/chicken-egg problem.
> > > >
> > > > Finally, saying "containers are complex and error prone" is conflating
> > > > several large suites of userspace code and many kernel features which
> > > > support them.  Being more precise would, if the argument is valid,
> > > > lend it a lot more weight.
> > >
> > > We (my company) use Linux containers since 2011 in production. First LXC, 
> > > now libvirt-lxc.
> > > To understand the internals better I also wrote my own userspace to 
> > > create/start
> > > containers. There are so many things which can hurt you badly.
> > > With user namespaces we expose a really big attack surface to regular 
> > > users.
> > > I.e. Suddenly a user is allowed to mount filesystems.
> >
> > That is currently not the case.  They can mount some virtual filesystems
> > and do bind mounts, but cannot mount most real filesystems.  This keeps
> > us protected (for now) from potentially unsafe superblock readers in the
> > kernel.
> >
> > > Ask Andy, he found already lots of nasty things...
> 
> I don't think I have anything brilliant to add to this discussion
> right now, except possibly:
> 
> ISTM that Linux distributions are, in general, vulnerable to all kinds
> of shenanigans that would happen if an untrusted user can cause a
> block device to appear.  That user doesn't need permission to mount it

Interesting point.  This would further suggest that we absolutely must
ensure that a loop device which shows up in the container does not also
show up in the host.

> or even necessarily to change its contents on the fly.
> 
> E.g. what happens if you boot a machine that contains a malicious disk
> image that has the same partition UUID as /?  Nothing good, I imagine.
> 
> So if we're going to go down this road, we really need some way to
> tell the host that certain devices are not trusted.
> 
> --Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-20 Thread Serge Hallyn

Quoting Michael H. Warfield (m...@wittsend.com):
> On Mon, 2014-05-19 at 17:04 -0700, Eric W. Biederman wrote:
> > Seth Forshee  writes:
> > 
> > > What I set out for was feature parity between loop devices in a secure
> > > container and loop devices on the host. Since some operations currently
> > > check for system-wide CAP_SYS_ADMIN, the only way I see to accomplish
> > > this is to push knowledge of the user namespace farther down into the
> > > driver stack so the check can instead be for CAP_SYS_ADMIN in the user
> > > namespace associated with the device.
> > >
> > > That said, I suspect our current use cases can get by without these
> > > capabilities. Really though I suspect this is just deferring the
> > > discussion rather than settling it, and what we'll end up with is little
> > > more than a fancy way for userspace to ask the kernel to run mknod on
> > > its behalf.
> 
> > A fancy way to ask the kernel to run mknod on its behalf is what
> > /dev/pts is.
> 
> > When I suggested this I did not mean you should forgo making changes to
> > allow partitions and the like.  What I itended is that you should find a
> > way to make this safe for users who don't have root capabilities.
> 
> I like to think in terms of the "rootless" configurations where "root"
> per se is not absolute and everything is framed in terms of
> capabilities.
> 
> > Which possibly means that mount needs to learn how to keep a more
> > privileged user from using your new loop devices.
> 
> Not sure I got that one.  As user with "more" privileges may or may not
> have access dependent on the congruence of the privileges.  They're not

Yes so in this case by more privileged' he meant a privileged user in a
userns which is ancestor to the current userns.  It is in fact *more*
privileged than any user in the current userns.

> heiarchial.  If someone has that "priv" then they have access.  If they

They are in fact implicitly hierarchical due to the hierarchical userns
design.

> do not, they do not.
> 
> > To get to the point where this is really and truly usable I expect to be
> > technically daunting.
> 
> Most technically non-trivial problems generally are.
> 
> > Ultimately the technical challenge is how do we create a block device
> > that is safe for a user who does not have any capabilities to use, and
> > what can we do with that block device to make it useful.
> 
> Concur.  It boils down to privilege management and access.  Absolutely
> concur.
> 
> > Only when the question is can this kernel functionality which is
> > otherwise safe confuse a preexisting setuid application do namespace
> > or container bits significantly come into play.
> 
> Ah...  Admittedly it's not as late as our conversation at LinuxPlumbers
> last year in NOLA but...  Maybe late at night but I failed to parse the
> above.
> 
> > Eric
> 
> Regards,
> Mike
> -- 
> Michael H. Warfield (AI4NB) | (770) 978-7061 |  m...@wittsend.com
>/\/\|=mhw=|\/\/  | (678) 463-0932 |  http://www.wittsend.com/mhw/
>NIC whois: MHW9  | An optimist believes we live in the best of all
>  PGP Key: 0x674627FF| possible worlds.  A pessimist is sure of it!
> 



> ___
> lxc-devel mailing list
> lxc-de...@lists.linuxcontainers.org
> http://lists.linuxcontainers.org/listinfo/lxc-devel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-20 Thread Serge Hallyn

Quoting Seth Forshee (seth.fors...@canonical.com):
> On Sun, May 18, 2014 at 04:44:58AM +0200, Serge E. Hallyn wrote:
> > Quoting Seth Forshee (seth.fors...@canonical.com):
> > > On Fri, May 16, 2014 at 09:31:37PM -0700, Eric W. Biederman wrote:
> > > > Greg Kroah-Hartman  writes:
> > > > 
> > > > > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
> > > > >> > I think having to pick and choose what device nodes you want in a
> > > > >> > container is a good thing.  Becides, you would have to do the same 
> > > > >> > thing
> > > > >> > in the kernel anyway, what's wrong with userspace making the 
> > > > >> > decision
> > > > >> > here, especially as it knows exactly what it wants to do much more 
> > > > >> > so
> > > > >> > than the kernel ever can.
> > > > >> 
> > > > >> For 'real' devices that sounds sensible.  The thing about loop 
> > > > >> devices
> > > > >> is that we simply want to allow a container to say "give me a loop
> > > > >> device to use" and have it receive a unique loop device (or 3), 
> > > > >> without
> > > > >> having to pre-assign them.  I think that would be cleaner to do using
> > > > >> a pseudofs and loop-control device, rather than having to have a
> > > > >> daemon in userspace on the host farming those out in response to
> > > > >> some, I don't know, dbus request?
> > > > >
> > > > > I agree that loop devices would be nice to have in a container, and 
> > > > > that
> > > > > the existing loop interface doesn't really lend itself to that.  So
> > > > > create a new type of thing that acts like a loop device in a 
> > > > > container.
> > > > > But don't try to mess with the whole driver core just for a single 
> > > > > type
> > > > > of device.
> > > > 
> > > > Yes. Something like devpts (without the newinstance option).  Built to
> > > > allow unprivileged users to create loopback devices.
> > > 
> > > That's where I started, and I've got code, so I guess I'll clean it up
> > > and send patches. If the stance is that only system-wide CAP_SYS_ADMIN
> > > gets to do privileged block device ioctls, including reading partitions
> > 
> > Sorry, where did that come from?  What Eric was referring to below is
> > the fs superblock readers not being trusted.  Maybe I glossed over another
> > email where it was mentioned?
> 
> You must have. Take a look at [1].
> 
> To repeat the point: the ioctl to reread partitions (along with several
> other block device ioctls) has a capable(CAP_SYS_ADMIN) check. We can't
> change this to an ns_capable check without at minimum the block layer
> knowing about the namespace associated with the block device. Ergo we

Which only means those changes are necessary :)

So far as I understand, a namespaced devtmpfs is nacked, but a loopfs
is interesting (and, depending on the implementation, acceptable).  That
necessarily includes the minimal blockdev changes to support it.

> can't reread paritions if this is done entirely within the loop driver
> via a psuedo fs.
> 
> [1] http://article.gmane.org/gmane.linux.kernel.containers.lxc.devel/8191
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-19 Thread Michael H. Warfield

On Mon, 2014-05-19 at 17:04 -0700, Eric W. Biederman wrote:
> Seth Forshee  writes:
> 
> > What I set out for was feature parity between loop devices in a secure
> > container and loop devices on the host. Since some operations currently
> > check for system-wide CAP_SYS_ADMIN, the only way I see to accomplish
> > this is to push knowledge of the user namespace farther down into the
> > driver stack so the check can instead be for CAP_SYS_ADMIN in the user
> > namespace associated with the device.
> >
> > That said, I suspect our current use cases can get by without these
> > capabilities. Really though I suspect this is just deferring the
> > discussion rather than settling it, and what we'll end up with is little
> > more than a fancy way for userspace to ask the kernel to run mknod on
> > its behalf.

> A fancy way to ask the kernel to run mknod on its behalf is what
> /dev/pts is.

> When I suggested this I did not mean you should forgo making changes to
> allow partitions and the like.  What I itended is that you should find a
> way to make this safe for users who don't have root capabilities.

I like to think in terms of the "rootless" configurations where "root"
per se is not absolute and everything is framed in terms of
capabilities.

> Which possibly means that mount needs to learn how to keep a more
> privileged user from using your new loop devices.

Not sure I got that one.  As user with "more" privileges may or may not
have access dependent on the congruence of the privileges.  They're not
heiarchial.  If someone has that "priv" then they have access.  If they
do not, they do not.

> To get to the point where this is really and truly usable I expect to be
> technically daunting.

Most technically non-trivial problems generally are.

> Ultimately the technical challenge is how do we create a block device
> that is safe for a user who does not have any capabilities to use, and
> what can we do with that block device to make it useful.

Concur.  It boils down to privilege management and access.  Absolutely
concur.

> Only when the question is can this kernel functionality which is
> otherwise safe confuse a preexisting setuid application do namespace
> or container bits significantly come into play.

Ah...  Admittedly it's not as late as our conversation at LinuxPlumbers
last year in NOLA but...  Maybe late at night but I failed to parse the
above.

> Eric

Regards,
Mike
-- 
Michael H. Warfield (AI4NB) | (770) 978-7061 |  m...@wittsend.com
   /\/\|=mhw=|\/\/  | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9  | An optimist believes we live in the best of all
 PGP Key: 0x674627FF| possible worlds.  A pessimist is sure of it!



signature.asc
Description: This is a digitally signed message part

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-19 Thread Eric W. Biederman

Seth Forshee  writes:

> What I set out for was feature parity between loop devices in a secure
> container and loop devices on the host. Since some operations currently
> check for system-wide CAP_SYS_ADMIN, the only way I see to accomplish
> this is to push knowledge of the user namespace farther down into the
> driver stack so the check can instead be for CAP_SYS_ADMIN in the user
> namespace associated with the device.
>
> That said, I suspect our current use cases can get by without these
> capabilities. Really though I suspect this is just deferring the
> discussion rather than settling it, and what we'll end up with is little
> more than a fancy way for userspace to ask the kernel to run mknod on
> its behalf.

A fancy way to ask the kernel to run mknod on its behalf is what
/dev/pts is.

When I suggested this I did not mean you should forgo making changes to
allow partitions and the like.  What I itended is that you should find a
way to make this safe for users who don't have root capabilities.

Which possibly means that mount needs to learn how to keep a more
privileged user from using your new loop devices.

To get to the point where this is really and truly usable I expect to be
technically daunting.

Ultimately the technical challenge is how do we create a block device
that is safe for a user who does not have any capabilities to use, and
what can we do with that block device to make it useful.

Only when the question is can this kernel functionality which is
otherwise safe confuse a preexisting setuid application do namespace
or container bits significantly come into play.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-19 Thread Andy Lutomirski

On May 15, 2014 1:26 PM, "Serge E. Hallyn"  wrote:
>
> Quoting Richard Weinberger (rich...@nod.at):
> > Am 15.05.2014 21:50, schrieb Serge Hallyn:
> > > Quoting Richard Weinberger (richard.weinber...@gmail.com):
> > >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman
> > >>  wrote:
> > >>> Then don't use a container to build such a thing, or fix the build
> > >>> scripts to not do that :)
> > >>
> > >> I second this.
> > >> To me it looks like some folks try to (ab)use Linux containers
> > >> for purposes where KVM would much better fit in.
> > >> Please don't put more complexity into containers. They are already
> > >> horrible complex
> > >> and error prone.
> > >
> > > I, naturally, disagree :)  The only use case which is inherently not
> > > valid for containers is running a kernel.  Practically speaking there
> > > are other things which likely will never be possible, but if someone
> > > offers a way to do something in containers, "you can't do that in
> > > containers" is not an apropos response.
> > >
> > > "That abstraction is wrong" is certainly valid, as when vpids were
> > > originally proposed and rejected, resulting in the development of
> > > pid namespaces.  "We have to work out (x) first" can be valid (and
> > > I can think of examples here), assuming it's not just trying to hide
> > > behind a catch-22/chicken-egg problem.
> > >
> > > Finally, saying "containers are complex and error prone" is conflating
> > > several large suites of userspace code and many kernel features which
> > > support them.  Being more precise would, if the argument is valid,
> > > lend it a lot more weight.
> >
> > We (my company) use Linux containers since 2011 in production. First LXC, 
> > now libvirt-lxc.
> > To understand the internals better I also wrote my own userspace to 
> > create/start
> > containers. There are so many things which can hurt you badly.
> > With user namespaces we expose a really big attack surface to regular users.
> > I.e. Suddenly a user is allowed to mount filesystems.
>
> That is currently not the case.  They can mount some virtual filesystems
> and do bind mounts, but cannot mount most real filesystems.  This keeps
> us protected (for now) from potentially unsafe superblock readers in the
> kernel.
>
> > Ask Andy, he found already lots of nasty things...

I don't think I have anything brilliant to add to this discussion
right now, except possibly:

ISTM that Linux distributions are, in general, vulnerable to all kinds
of shenanigans that would happen if an untrusted user can cause a
block device to appear.  That user doesn't need permission to mount it
or even necessarily to change its contents on the fly.

E.g. what happens if you boot a machine that contains a malicious disk
image that has the same partition UUID as /?  Nothing good, I imagine.

So if we're going to go down this road, we really need some way to
tell the host that certain devices are not trusted.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-19 Thread Seth Forshee

On Sun, May 18, 2014 at 04:44:58AM +0200, Serge E. Hallyn wrote:
> Quoting Seth Forshee (seth.fors...@canonical.com):
> > On Fri, May 16, 2014 at 09:31:37PM -0700, Eric W. Biederman wrote:
> > > Greg Kroah-Hartman  writes:
> > > 
> > > > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
> > > >> > I think having to pick and choose what device nodes you want in a
> > > >> > container is a good thing.  Becides, you would have to do the same 
> > > >> > thing
> > > >> > in the kernel anyway, what's wrong with userspace making the decision
> > > >> > here, especially as it knows exactly what it wants to do much more so
> > > >> > than the kernel ever can.
> > > >> 
> > > >> For 'real' devices that sounds sensible.  The thing about loop devices
> > > >> is that we simply want to allow a container to say "give me a loop
> > > >> device to use" and have it receive a unique loop device (or 3), without
> > > >> having to pre-assign them.  I think that would be cleaner to do using
> > > >> a pseudofs and loop-control device, rather than having to have a
> > > >> daemon in userspace on the host farming those out in response to
> > > >> some, I don't know, dbus request?
> > > >
> > > > I agree that loop devices would be nice to have in a container, and that
> > > > the existing loop interface doesn't really lend itself to that.  So
> > > > create a new type of thing that acts like a loop device in a container.
> > > > But don't try to mess with the whole driver core just for a single type
> > > > of device.
> > > 
> > > Yes. Something like devpts (without the newinstance option).  Built to
> > > allow unprivileged users to create loopback devices.
> > 
> > That's where I started, and I've got code, so I guess I'll clean it up
> > and send patches. If the stance is that only system-wide CAP_SYS_ADMIN
> > gets to do privileged block device ioctls, including reading partitions
> 
> Sorry, where did that come from?  What Eric was referring to below is
> the fs superblock readers not being trusted.  Maybe I glossed over another
> email where it was mentioned?

You must have. Take a look at [1].

To repeat the point: the ioctl to reread partitions (along with several
other block device ioctls) has a capable(CAP_SYS_ADMIN) check. We can't
change this to an ns_capable check without at minimum the block layer
knowing about the namespace associated with the block device. Ergo we
can't reread paritions if this is done entirely within the loop driver
via a psuedo fs.

[1] http://article.gmane.org/gmane.linux.kernel.containers.lxc.devel/8191

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-17 Thread Serge E. Hallyn

Quoting Seth Forshee (seth.fors...@canonical.com):
> On Fri, May 16, 2014 at 09:31:37PM -0700, Eric W. Biederman wrote:
> > Greg Kroah-Hartman  writes:
> > 
> > > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
> > >> > I think having to pick and choose what device nodes you want in a
> > >> > container is a good thing.  Becides, you would have to do the same 
> > >> > thing
> > >> > in the kernel anyway, what's wrong with userspace making the decision
> > >> > here, especially as it knows exactly what it wants to do much more so
> > >> > than the kernel ever can.
> > >> 
> > >> For 'real' devices that sounds sensible.  The thing about loop devices
> > >> is that we simply want to allow a container to say "give me a loop
> > >> device to use" and have it receive a unique loop device (or 3), without
> > >> having to pre-assign them.  I think that would be cleaner to do using
> > >> a pseudofs and loop-control device, rather than having to have a
> > >> daemon in userspace on the host farming those out in response to
> > >> some, I don't know, dbus request?
> > >
> > > I agree that loop devices would be nice to have in a container, and that
> > > the existing loop interface doesn't really lend itself to that.  So
> > > create a new type of thing that acts like a loop device in a container.
> > > But don't try to mess with the whole driver core just for a single type
> > > of device.
> > 
> > Yes. Something like devpts (without the newinstance option).  Built to
> > allow unprivileged users to create loopback devices.
> 
> That's where I started, and I've got code, so I guess I'll clean it up
> and send patches. If the stance is that only system-wide CAP_SYS_ADMIN
> gets to do privileged block device ioctls, including reading partitions

Sorry, where did that come from?  What Eric was referring to below is
the fs superblock readers not being trusted.  Maybe I glossed over another
email where it was mentioned?

> on a block device which has been assigned to a contiainer, then I guess
> that approach works well enough.
> 
> > There is still a huge kettle of fish in with verifying a filesystem is
> > safe from a hostile user that has acess to the block device while the
> > filesystem is mounted.
> > 
> > Having a few filesystems that are robust enough to trust with arbitrary
> > filesystem corruption would be very interesting.
> > 
> > I assume unprivileged and hostile users because if you trusted the real
> > root inside of your container this would not be an issue.
> > 
> > Eric
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-17 Thread Serge E. Hallyn

Quoting James Bottomley (james.bottom...@hansenpartnership.com):
> On Fri, 2014-05-16 at 11:57 -0700, Greg Kroah-Hartman wrote:
> > On Fri, May 16, 2014 at 09:06:07AM -0500, Seth Forshee wrote:
> > > On Thu, May 15, 2014 at 09:35:32PM -0700, Greg Kroah-Hartman wrote:
> > > > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
> > > > > > I think having to pick and choose what device nodes you want in a
> > > > > > container is a good thing.  Becides, you would have to do the same 
> > > > > > thing
> > > > > > in the kernel anyway, what's wrong with userspace making the 
> > > > > > decision
> > > > > > here, especially as it knows exactly what it wants to do much more 
> > > > > > so
> > > > > > than the kernel ever can.
> > > > > 
> > > > > For 'real' devices that sounds sensible.  The thing about loop devices
> > > > > is that we simply want to allow a container to say "give me a loop
> > > > > device to use" and have it receive a unique loop device (or 3), 
> > > > > without
> > > > > having to pre-assign them.  I think that would be cleaner to do using
> > > > > a pseudofs and loop-control device, rather than having to have a
> > > > > daemon in userspace on the host farming those out in response to
> > > > > some, I don't know, dbus request?
> > > > 
> > > > I agree that loop devices would be nice to have in a container, and that
> > > > the existing loop interface doesn't really lend itself to that.  So
> > > > create a new type of thing that acts like a loop device in a container.
> > > > But don't try to mess with the whole driver core just for a single type
> > > > of device.
> > > 
> > > No matter what I don't think we get out of this without driver core
> > > changes, whether this was done in loop or by creating something new.
> > > Not unless the whole thing is punted to userspace, anyway.
> > > 
> > > The first problem is that many block device ioctls check for
> > > CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm
> > > not really sure. But loop does at minimum support partitions, and to get
> > > that functionality in an unprivileged container at least the block layer
> > > needs to know the namespace which has privileges for that device.
> > 
> > That's fine, you should have those permissions in a container if you
> > want to do something like that on a loop device, right?
> 
> Really, no.  CAP_SYS_ADMIN is effectively a pseudo root security hole.
> Any user possessing CAP_SYS_ADMIN can do about as much damage as real
> root can, whether or not you use user namespaces, so it would compromise
> a lot of the security we're just bringing to containers.
> 
> > > The second is that all block devices automatically appear in devtmpfs.
> > > The scenario I'm concerned about is that the host could unknowingly use
> > > a loop device exposed to a container, then the container could see data
> > > from the host.
> > 
> > I don't think that's a real issue, the host should know not to do that.
> > 
> > > So we either need a flag to tell the driver core not to create a node
> > > in devtmpfs, or we need a privileged manager in userspace to remove
> > > them (which kind of defeats the purpose). And it gets more complicated
> > > when partition block devs are mixed in, because they can be created
> > > without involvement from the driver - they would need to inherit the
> > > "no devtmpfs node" property from their parent, and if the driver uses
> > > a psuedo fs to create device nodes for userspace then it needs to be
> > > informed about the partitions too so it can create those nodes.
> > 
> > I don't think that will be needed.  Root in a host can do whatever it
> > wants in the containers, so mixing up block devices is the least of the
> > issues involved :)
> > 
> > > So maybe we could get by without the privileged ioctls, as long as it
> > > was understood that unprivileged containers can't do partitioning. But I
> > > do think the devtmpfs problem would need to be addressed.
> > 
> > I don't think unpriviliged containers should be able to do partitioning.
> > An unpriviliged user can't do that, so why should a container be any
> > different?
> 
> To make sure we're on the same page with terminology, there's an
> unprivileged container and a secure container.  In the former, there's

Hm, that terminology (which isn't what we've been using) could be
useful, but is still not quite precise enough if we're going down
that road.

> no root user (all the processes run as non-root), so the container isn't

"there is no root user" and "all processes run as non-root" are not the
same thing.  Is it just that no processes are running as root?  Or that
uid 0 in the container is not mapped at all and hence not achievable?

The former really isn't a function of the container itself, and depends
on there really not being any setuid-root or capability-wielding files
available in the container.

If the latter, and you're hoping to claim that the host is saved from
the container exerc

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-17 Thread Seth Forshee

On Fri, May 16, 2014 at 09:31:37PM -0700, Eric W. Biederman wrote:
> Greg Kroah-Hartman  writes:
> 
> > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
> >> > I think having to pick and choose what device nodes you want in a
> >> > container is a good thing.  Becides, you would have to do the same thing
> >> > in the kernel anyway, what's wrong with userspace making the decision
> >> > here, especially as it knows exactly what it wants to do much more so
> >> > than the kernel ever can.
> >> 
> >> For 'real' devices that sounds sensible.  The thing about loop devices
> >> is that we simply want to allow a container to say "give me a loop
> >> device to use" and have it receive a unique loop device (or 3), without
> >> having to pre-assign them.  I think that would be cleaner to do using
> >> a pseudofs and loop-control device, rather than having to have a
> >> daemon in userspace on the host farming those out in response to
> >> some, I don't know, dbus request?
> >
> > I agree that loop devices would be nice to have in a container, and that
> > the existing loop interface doesn't really lend itself to that.  So
> > create a new type of thing that acts like a loop device in a container.
> > But don't try to mess with the whole driver core just for a single type
> > of device.
> 
> Yes. Something like devpts (without the newinstance option).  Built to
> allow unprivileged users to create loopback devices.

That's where I started, and I've got code, so I guess I'll clean it up
and send patches. If the stance is that only system-wide CAP_SYS_ADMIN
gets to do privileged block device ioctls, including reading partitions
on a block device which has been assigned to a contiainer, then I guess
that approach works well enough.

> There is still a huge kettle of fish in with verifying a filesystem is
> safe from a hostile user that has acess to the block device while the
> filesystem is mounted.
> 
> Having a few filesystems that are robust enough to trust with arbitrary
> filesystem corruption would be very interesting.
> 
> I assume unprivileged and hostile users because if you trusted the real
> root inside of your container this would not be an issue.
> 
> Eric
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-17 Thread Michael H. Warfield

On Thu, 2014-05-15 at 21:35 -0700, Greg Kroah-Hartman wrote:
> On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
> > > I think having to pick and choose what device nodes you want in a
> > > container is a good thing.  Becides, you would have to do the same thing
> > > in the kernel anyway, what's wrong with userspace making the decision
> > > here, especially as it knows exactly what it wants to do much more so
> > > than the kernel ever can.
> > 
> > For 'real' devices that sounds sensible.  The thing about loop devices
> > is that we simply want to allow a container to say "give me a loop
> > device to use" and have it receive a unique loop device (or 3), without
> > having to pre-assign them.  I think that would be cleaner to do using
> > a pseudofs and loop-control device, rather than having to have a
> > daemon in userspace on the host farming those out in response to
> > some, I don't know, dbus request?

> I agree that loop devices would be nice to have in a container, and that
> the existing loop interface doesn't really lend itself to that.  So
> create a new type of thing that acts like a loop device in a container.
> But don't try to mess with the whole driver core just for a single type
> of device.

Yeah, a lot of dynamic devices (like serial devices) can be handled in
user space with the proviso that we could use some way to tickle udev
and hotplug in the container with events.

But the loop device is the real ugly duckling here.  It's a unique case
of an on-demand device with a shared control device that's not really
hot-plug and not really deterministic enough to be handled purely in
user space.  It presents unique challenges unto itself.

Makes sense to me.

> greg k-h

Regards,
Mike
-- 
Michael H. Warfield (AI4NB) | (770) 978-7061 |  m...@wittsend.com
   /\/\|=mhw=|\/\/  | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9  | An optimist believes we live in the best of all
 PGP Key: 0x674627FF| possible worlds.  A pessimist is sure of it!



signature.asc
Description: This is a digitally signed message part

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-16 Thread Eric W. Biederman

Greg Kroah-Hartman  writes:

> On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
>> > I think having to pick and choose what device nodes you want in a
>> > container is a good thing.  Becides, you would have to do the same thing
>> > in the kernel anyway, what's wrong with userspace making the decision
>> > here, especially as it knows exactly what it wants to do much more so
>> > than the kernel ever can.
>> 
>> For 'real' devices that sounds sensible.  The thing about loop devices
>> is that we simply want to allow a container to say "give me a loop
>> device to use" and have it receive a unique loop device (or 3), without
>> having to pre-assign them.  I think that would be cleaner to do using
>> a pseudofs and loop-control device, rather than having to have a
>> daemon in userspace on the host farming those out in response to
>> some, I don't know, dbus request?
>
> I agree that loop devices would be nice to have in a container, and that
> the existing loop interface doesn't really lend itself to that.  So
> create a new type of thing that acts like a loop device in a container.
> But don't try to mess with the whole driver core just for a single type
> of device.

Yes. Something like devpts (without the newinstance option).  Built to
allow unprivileged users to create loopback devices.

There is still a huge kettle of fish in with verifying a filesystem is
safe from a hostile user that has acess to the block device while the
filesystem is mounted.

Having a few filesystems that are robust enough to trust with arbitrary
filesystem corruption would be very interesting.

I assume unprivileged and hostile users because if you trusted the real
root inside of your container this would not be an issue.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-16 Thread Seth Forshee

On Fri, May 16, 2014 at 12:28:35PM -0700, James Bottomley wrote:
> On Fri, 2014-05-16 at 11:57 -0700, Greg Kroah-Hartman wrote:
> > On Fri, May 16, 2014 at 09:06:07AM -0500, Seth Forshee wrote:
> > > On Thu, May 15, 2014 at 09:35:32PM -0700, Greg Kroah-Hartman wrote:
> > > > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
> > > > > > I think having to pick and choose what device nodes you want in a
> > > > > > container is a good thing.  Becides, you would have to do the same 
> > > > > > thing
> > > > > > in the kernel anyway, what's wrong with userspace making the 
> > > > > > decision
> > > > > > here, especially as it knows exactly what it wants to do much more 
> > > > > > so
> > > > > > than the kernel ever can.
> > > > > 
> > > > > For 'real' devices that sounds sensible.  The thing about loop devices
> > > > > is that we simply want to allow a container to say "give me a loop
> > > > > device to use" and have it receive a unique loop device (or 3), 
> > > > > without
> > > > > having to pre-assign them.  I think that would be cleaner to do using
> > > > > a pseudofs and loop-control device, rather than having to have a
> > > > > daemon in userspace on the host farming those out in response to
> > > > > some, I don't know, dbus request?
> > > > 
> > > > I agree that loop devices would be nice to have in a container, and that
> > > > the existing loop interface doesn't really lend itself to that.  So
> > > > create a new type of thing that acts like a loop device in a container.
> > > > But don't try to mess with the whole driver core just for a single type
> > > > of device.
> > > 
> > > No matter what I don't think we get out of this without driver core
> > > changes, whether this was done in loop or by creating something new.
> > > Not unless the whole thing is punted to userspace, anyway.
> > > 
> > > The first problem is that many block device ioctls check for
> > > CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm
> > > not really sure. But loop does at minimum support partitions, and to get
> > > that functionality in an unprivileged container at least the block layer
> > > needs to know the namespace which has privileges for that device.
> > 
> > That's fine, you should have those permissions in a container if you
> > want to do something like that on a loop device, right?
> 
> Really, no.  CAP_SYS_ADMIN is effectively a pseudo root security hole.
> Any user possessing CAP_SYS_ADMIN can do about as much damage as real
> root can, whether or not you use user namespaces, so it would compromise
> a lot of the security we're just bringing to containers.
> 
> > > The second is that all block devices automatically appear in devtmpfs.
> > > The scenario I'm concerned about is that the host could unknowingly use
> > > a loop device exposed to a container, then the container could see data
> > > from the host.
> > 
> > I don't think that's a real issue, the host should know not to do that.
> > 
> > > So we either need a flag to tell the driver core not to create a node
> > > in devtmpfs, or we need a privileged manager in userspace to remove
> > > them (which kind of defeats the purpose). And it gets more complicated
> > > when partition block devs are mixed in, because they can be created
> > > without involvement from the driver - they would need to inherit the
> > > "no devtmpfs node" property from their parent, and if the driver uses
> > > a psuedo fs to create device nodes for userspace then it needs to be
> > > informed about the partitions too so it can create those nodes.
> > 
> > I don't think that will be needed.  Root in a host can do whatever it
> > wants in the containers, so mixing up block devices is the least of the
> > issues involved :)
> > 
> > > So maybe we could get by without the privileged ioctls, as long as it
> > > was understood that unprivileged containers can't do partitioning. But I
> > > do think the devtmpfs problem would need to be addressed.
> > 
> > I don't think unpriviliged containers should be able to do partitioning.
> > An unpriviliged user can't do that, so why should a container be any
> > different?
> 
> To make sure we're on the same page with terminology, there's an
> unprivileged container and a secure container.  In the former, there's
> no root user (all the processes run as non-root), so the container isn't
> expected to perform any actions root would ... that's easy.  In a secure
> container, root is mapped to a nobody user in the host, so is
> effectively unprivileged, but root in the container expects to look like
> a real root within the VPS (and thus may expect to partition things,
> depending on how they've been given access to the block device).  The
> big problem is giving back capabilities to the container root such that
> a) it loses them if it escapes the container and b) it doesn't get
> sufficient capabilities to damage the system.

Based on your description what I was talking about is a secure
container.

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-16 Thread Michael H. Warfield

On Fri, 2014-05-16 at 12:20 -0700, James Bottomley wrote:
> On Thu, 2014-05-15 at 21:42 -0400, Michael H. Warfield wrote:
> > On Thu, 2014-05-15 at 15:15 -0700, Greg Kroah-Hartman wrote:
> > > > PS - Apparently both parallels and Michael independently
> > > > project devices which are hot-plugged on the host into containers.
> > > > That also seems like something worth talking about (best practices,
> > > > shortcomings, use cases not met by it, any ways tha the kernel can
> > > > help out) at ksummit/linuxcon.
> > 
> > > I was told that containers would never want devices hotplugged into
> > > them.
> > 
> > Interesting.  You were told they (who they?) would never want them?  Who
> > said that?  I would have never thought that given that other
> > implementations can provide that.  I would certainly want them.  Seems
> > strange to explicitly relegate LXC containers to being second class
> > citizens behind OpenVZ, Parallels, BSD Gaols, and Solaris Zones.

> That would probably be me.  Running hotplug inside a container is a
> security problem and, since containers are easily entered by the host,
> it's very easy to listen for the hotplug in the host and inject it into
> the container using nsenter.

In all virtualization...  The host, particularly root on the host,
exists as deus ex machina, the "god outside the machine".  They are at
my mercy.  Even hardware virtualization can not protect you from the
host.  You wanna hear some frightening talks on virtualization, catch
Joanna (miss little blue pill) Rutkowska some time.  I'm particularly
interesting in her takes on the "anti evil-maid attacks" and I sat in on
her talks on the "north bridge" and "south bridge" malware evasion
techniques.  She's a good speaker who makes powerful points that makes
you sweat but is pleasant in face to face conversation.  I've played
with her Qubes distribution a couple of times and the way it works with
the TPM to insure a secure boot is interesting.  But that's a completely
different topic on trusted computing.

OTOH, there are plenty of other things to worry about in all forms of
virtualization.  At Internet Security Systems, where I was a founder,
fellow, and "X-Force Senior Wizard", we were looking at the ability to
leak information through the USB subsystem.  No isolation is perfect,
especially when you have USB enabled.

But that's my turf.

> I don't think the intention is to label anyone's implementation as
> preferred.  What this shows, I think, is that we all have different
> practises when it comes to setting up containers.  Some are necessary
> because our containers are different.  Some could do with serious
> examination to see if there's really a best way to do the action which
> we would then all use.

And I hope to contribute to the discussion of said actions.

> > I might believe you were never told they would need them, but that's a
> > totally different sense.  Are we going to tell RedHat and the Docker
> > people that LXC is an inferior technology that is complex and unreliable
> > (to quote another poster) compared to these others?  They're saying this
> > will be enterprise technology.  If I go to Amazon AWS or other VPS
> > services and compare, are we not going to stand on a level playing
> > field?  Admittedly, I don't expect Amazon AWS to provide me with serial
> > consoles, but I do expect to be able to mount file system images within
> > my VPS.

> Well, that's another nasty, isn't it.  We all have different ways of
> coping with mount in the container.  I think at plumbers we need to sit
> down with some of this plumbing and work out which pipes carry the same
> fluids and whether we could unify them.

Concur

> As an aside (probably requiring a new thread) we were wondering about
> some type of notifier on the mount call that we could vector into the
> host to perform the action.  The main issue for us is mount of procfs,
> which really needs to be a bind mount in a container.  All of this led
> me to speculate that we could use some type of syscall notifier
> mechanism to manage capabilities in the host and even intercept and
> complete the syscall action within the host rather than having to keep
> evolving more an more complex kernel drivers to do this.

Interesting.  That could be very useful.  That might even help with the
loop device case where the mounts have to go through loop devices for
things like file system images and builds.  Very interesting...

> James

Regards,
Mike
-- 
Michael H. Warfield (AI4NB) | (770) 978-7061 |  m...@wittsend.com
   /\/\|=mhw=|\/\/  | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9  | An optimist believes we live in the best of all
 PGP Key: 0x674627FF| possible worlds.  A pessimist is sure of it!



signature.asc
Description: This is a digitally signed message part

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-16 Thread James Bottomley

On Fri, 2014-05-16 at 11:57 -0700, Greg Kroah-Hartman wrote:
> On Fri, May 16, 2014 at 09:06:07AM -0500, Seth Forshee wrote:
> > On Thu, May 15, 2014 at 09:35:32PM -0700, Greg Kroah-Hartman wrote:
> > > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
> > > > > I think having to pick and choose what device nodes you want in a
> > > > > container is a good thing.  Becides, you would have to do the same 
> > > > > thing
> > > > > in the kernel anyway, what's wrong with userspace making the decision
> > > > > here, especially as it knows exactly what it wants to do much more so
> > > > > than the kernel ever can.
> > > > 
> > > > For 'real' devices that sounds sensible.  The thing about loop devices
> > > > is that we simply want to allow a container to say "give me a loop
> > > > device to use" and have it receive a unique loop device (or 3), without
> > > > having to pre-assign them.  I think that would be cleaner to do using
> > > > a pseudofs and loop-control device, rather than having to have a
> > > > daemon in userspace on the host farming those out in response to
> > > > some, I don't know, dbus request?
> > > 
> > > I agree that loop devices would be nice to have in a container, and that
> > > the existing loop interface doesn't really lend itself to that.  So
> > > create a new type of thing that acts like a loop device in a container.
> > > But don't try to mess with the whole driver core just for a single type
> > > of device.
> > 
> > No matter what I don't think we get out of this without driver core
> > changes, whether this was done in loop or by creating something new.
> > Not unless the whole thing is punted to userspace, anyway.
> > 
> > The first problem is that many block device ioctls check for
> > CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm
> > not really sure. But loop does at minimum support partitions, and to get
> > that functionality in an unprivileged container at least the block layer
> > needs to know the namespace which has privileges for that device.
> 
> That's fine, you should have those permissions in a container if you
> want to do something like that on a loop device, right?

Really, no.  CAP_SYS_ADMIN is effectively a pseudo root security hole.
Any user possessing CAP_SYS_ADMIN can do about as much damage as real
root can, whether or not you use user namespaces, so it would compromise
a lot of the security we're just bringing to containers.

> > The second is that all block devices automatically appear in devtmpfs.
> > The scenario I'm concerned about is that the host could unknowingly use
> > a loop device exposed to a container, then the container could see data
> > from the host.
> 
> I don't think that's a real issue, the host should know not to do that.
> 
> > So we either need a flag to tell the driver core not to create a node
> > in devtmpfs, or we need a privileged manager in userspace to remove
> > them (which kind of defeats the purpose). And it gets more complicated
> > when partition block devs are mixed in, because they can be created
> > without involvement from the driver - they would need to inherit the
> > "no devtmpfs node" property from their parent, and if the driver uses
> > a psuedo fs to create device nodes for userspace then it needs to be
> > informed about the partitions too so it can create those nodes.
> 
> I don't think that will be needed.  Root in a host can do whatever it
> wants in the containers, so mixing up block devices is the least of the
> issues involved :)
> 
> > So maybe we could get by without the privileged ioctls, as long as it
> > was understood that unprivileged containers can't do partitioning. But I
> > do think the devtmpfs problem would need to be addressed.
> 
> I don't think unpriviliged containers should be able to do partitioning.
> An unpriviliged user can't do that, so why should a container be any
> different?

To make sure we're on the same page with terminology, there's an
unprivileged container and a secure container.  In the former, there's
no root user (all the processes run as non-root), so the container isn't
expected to perform any actions root would ... that's easy.  In a secure
container, root is mapped to a nobody user in the host, so is
effectively unprivileged, but root in the container expects to look like
a real root within the VPS (and thus may expect to partition things,
depending on how they've been given access to the block device).  The
big problem is giving back capabilities to the container root such that
a) it loses them if it escapes the container and b) it doesn't get
sufficient capabilities to damage the system.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-16 Thread James Bottomley

On Thu, 2014-05-15 at 21:42 -0400, Michael H. Warfield wrote:
> On Thu, 2014-05-15 at 15:15 -0700, Greg Kroah-Hartman wrote:
> > > PS - Apparently both parallels and Michael independently
> > > project devices which are hot-plugged on the host into containers.
> > > That also seems like something worth talking about (best practices,
> > > shortcomings, use cases not met by it, any ways tha the kernel can
> > > help out) at ksummit/linuxcon.
> 
> > I was told that containers would never want devices hotplugged into
> > them.
> 
> Interesting.  You were told they (who they?) would never want them?  Who
> said that?  I would have never thought that given that other
> implementations can provide that.  I would certainly want them.  Seems
> strange to explicitly relegate LXC containers to being second class
> citizens behind OpenVZ, Parallels, BSD Gaols, and Solaris Zones.

That would probably be me.  Running hotplug inside a container is a
security problem and, since containers are easily entered by the host,
it's very easy to listen for the hotplug in the host and inject it into
the container using nsenter.

I don't think the intention is to label anyone's implementation as
preferred.  What this shows, I think, is that we all have different
practises when it comes to setting up containers.  Some are necessary
because our containers are different.  Some could do with serious
examination to see if there's really a best way to do the action which
we would then all use.

> I might believe you were never told they would need them, but that's a
> totally different sense.  Are we going to tell RedHat and the Docker
> people that LXC is an inferior technology that is complex and unreliable
> (to quote another poster) compared to these others?  They're saying this
> will be enterprise technology.  If I go to Amazon AWS or other VPS
> services and compare, are we not going to stand on a level playing
> field?  Admittedly, I don't expect Amazon AWS to provide me with serial
> consoles, but I do expect to be able to mount file system images within
> my VPS.

Well, that's another nasty, isn't it.  We all have different ways of
coping with mount in the container.  I think at plumbers we need to sit
down with some of this plumbing and work out which pipes carry the same
fluids and whether we could unify them.

As an aside (probably requiring a new thread) we were wondering about
some type of notifier on the mount call that we could vector into the
host to perform the action.  The main issue for us is mount of procfs,
which really needs to be a bind mount in a container.  All of this led
me to speculate that we could use some type of syscall notifier
mechanism to manage capabilities in the host and even intercept and
complete the syscall action within the host rather than having to keep
evolving more an more complex kernel drivers to do this.

James

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-16 Thread Greg Kroah-Hartman

On Fri, May 16, 2014 at 09:06:07AM -0500, Seth Forshee wrote:
> On Thu, May 15, 2014 at 09:35:32PM -0700, Greg Kroah-Hartman wrote:
> > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
> > > > I think having to pick and choose what device nodes you want in a
> > > > container is a good thing.  Becides, you would have to do the same thing
> > > > in the kernel anyway, what's wrong with userspace making the decision
> > > > here, especially as it knows exactly what it wants to do much more so
> > > > than the kernel ever can.
> > > 
> > > For 'real' devices that sounds sensible.  The thing about loop devices
> > > is that we simply want to allow a container to say "give me a loop
> > > device to use" and have it receive a unique loop device (or 3), without
> > > having to pre-assign them.  I think that would be cleaner to do using
> > > a pseudofs and loop-control device, rather than having to have a
> > > daemon in userspace on the host farming those out in response to
> > > some, I don't know, dbus request?
> > 
> > I agree that loop devices would be nice to have in a container, and that
> > the existing loop interface doesn't really lend itself to that.  So
> > create a new type of thing that acts like a loop device in a container.
> > But don't try to mess with the whole driver core just for a single type
> > of device.
> 
> No matter what I don't think we get out of this without driver core
> changes, whether this was done in loop or by creating something new.
> Not unless the whole thing is punted to userspace, anyway.
> 
> The first problem is that many block device ioctls check for
> CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm
> not really sure. But loop does at minimum support partitions, and to get
> that functionality in an unprivileged container at least the block layer
> needs to know the namespace which has privileges for that device.

That's fine, you should have those permissions in a container if you
want to do something like that on a loop device, right?

> The second is that all block devices automatically appear in devtmpfs.
> The scenario I'm concerned about is that the host could unknowingly use
> a loop device exposed to a container, then the container could see data
> from the host.

I don't think that's a real issue, the host should know not to do that.

> So we either need a flag to tell the driver core not to create a node
> in devtmpfs, or we need a privileged manager in userspace to remove
> them (which kind of defeats the purpose). And it gets more complicated
> when partition block devs are mixed in, because they can be created
> without involvement from the driver - they would need to inherit the
> "no devtmpfs node" property from their parent, and if the driver uses
> a psuedo fs to create device nodes for userspace then it needs to be
> informed about the partitions too so it can create those nodes.

I don't think that will be needed.  Root in a host can do whatever it
wants in the containers, so mixing up block devices is the least of the
issues involved :)

> So maybe we could get by without the privileged ioctls, as long as it
> was understood that unprivileged containers can't do partitioning. But I
> do think the devtmpfs problem would need to be addressed.

I don't think unpriviliged containers should be able to do partitioning.
An unpriviliged user can't do that, so why should a container be any
different?

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-16 Thread Seth Forshee

On Fri, May 16, 2014 at 11:28:28AM -0400, Michael H. Warfield wrote:
> On Fri, 2014-05-16 at 09:06 -0500, Seth Forshee wrote:
> > On Thu, May 15, 2014 at 09:35:32PM -0700, Greg Kroah-Hartman wrote:
> > > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
> > > > > I think having to pick and choose what device nodes you want in a
> > > > > container is a good thing.  Becides, you would have to do the same 
> > > > > thing
> > > > > in the kernel anyway, what's wrong with userspace making the decision
> > > > > here, especially as it knows exactly what it wants to do much more so
> > > > > than the kernel ever can.
> > > > 
> > > > For 'real' devices that sounds sensible.  The thing about loop devices
> > > > is that we simply want to allow a container to say "give me a loop
> > > > device to use" and have it receive a unique loop device (or 3), without
> > > > having to pre-assign them.  I think that would be cleaner to do using
> > > > a pseudofs and loop-control device, rather than having to have a
> > > > daemon in userspace on the host farming those out in response to
> > > > some, I don't know, dbus request?
> > > 
> > > I agree that loop devices would be nice to have in a container, and that
> > > the existing loop interface doesn't really lend itself to that.  So
> > > create a new type of thing that acts like a loop device in a container.
> > > But don't try to mess with the whole driver core just for a single type
> > > of device.
> 
> > No matter what I don't think we get out of this without driver core
> > changes, whether this was done in loop or by creating something new.
> > Not unless the whole thing is punted to userspace, anyway.
> 
> > The first problem is that many block device ioctls check for
> > CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm
> > not really sure. But loop does at minimum support partitions, and to get
> > that functionality in an unprivileged container at least the block layer
> > needs to know the namespace which has privileges for that device.
> 
> Woa!  Time out...  Sorry, this will be an off topic aside.
> 
> Loop devices support partitions?  I'd love to know how that works.  I've
> tried several times in the past to do that but it's failed every time.
> I haven't been able to find any how-to in the past.  This article was
> just a couple of years ago (after the last time I tried this):
> 
> http://madduck.net/blog/2006.10.20:loop-mounting-partitions-from-a-disk-image/
> 
> This guy didn't use partitions directly but used the offset to the
> mount, which is what I had to use.  Everything I found always referred
> to using mount offsets in order to mount partitions within a loop
> device.

It's controlled by the loop.max_part module parameter. It defaults to 0,
which means no partition support. For any value > 0 max_part will be the
maximum available partition number, after rounding it up to the nearest
power of 2 minus 1 (so max_part=5 gives you up to 8 partitions,
max_part=8 gives you up to 16, etc).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-16 Thread Michael H. Warfield

On Fri, 2014-05-16 at 09:06 -0500, Seth Forshee wrote:
> On Thu, May 15, 2014 at 09:35:32PM -0700, Greg Kroah-Hartman wrote:
> > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
> > > > I think having to pick and choose what device nodes you want in a
> > > > container is a good thing.  Becides, you would have to do the same thing
> > > > in the kernel anyway, what's wrong with userspace making the decision
> > > > here, especially as it knows exactly what it wants to do much more so
> > > > than the kernel ever can.
> > > 
> > > For 'real' devices that sounds sensible.  The thing about loop devices
> > > is that we simply want to allow a container to say "give me a loop
> > > device to use" and have it receive a unique loop device (or 3), without
> > > having to pre-assign them.  I think that would be cleaner to do using
> > > a pseudofs and loop-control device, rather than having to have a
> > > daemon in userspace on the host farming those out in response to
> > > some, I don't know, dbus request?
> > 
> > I agree that loop devices would be nice to have in a container, and that
> > the existing loop interface doesn't really lend itself to that.  So
> > create a new type of thing that acts like a loop device in a container.
> > But don't try to mess with the whole driver core just for a single type
> > of device.

> No matter what I don't think we get out of this without driver core
> changes, whether this was done in loop or by creating something new.
> Not unless the whole thing is punted to userspace, anyway.

> The first problem is that many block device ioctls check for
> CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm
> not really sure. But loop does at minimum support partitions, and to get
> that functionality in an unprivileged container at least the block layer
> needs to know the namespace which has privileges for that device.

Woa!  Time out...  Sorry, this will be an off topic aside.

Loop devices support partitions?  I'd love to know how that works.  I've
tried several times in the past to do that but it's failed every time.
I haven't been able to find any how-to in the past.  This article was
just a couple of years ago (after the last time I tried this):

http://madduck.net/blog/2006.10.20:loop-mounting-partitions-from-a-disk-image/

This guy didn't use partitions directly but used the offset to the
mount, which is what I had to use.  Everything I found always referred
to using mount offsets in order to mount partitions within a loop
device.

Regards,
Mike

> The second is that all block devices automatically appear in devtmpfs.
> The scenario I'm concerned about is that the host could unknowingly use
> a loop device exposed to a container, then the container could see data
> from the host. So we either need a flag to tell the driver core not to
> create a node in devtmpfs, or we need a privileged manager in userspace
> to remove them (which kind of defeats the purpose). And it gets more
> complicated when partition block devs are mixed in, because they can be
> created without involvement from the driver - they would need to inherit
> the "no devtmpfs node" property from their parent, and if the driver
> uses a psuedo fs to create device nodes for userspace then it needs to
> be informed about the partitions too so it can create those nodes.
> 
> So maybe we could get by without the privileged ioctls, as long as it
> was understood that unprivileged containers can't do partitioning. But I
> do think the devtmpfs problem would need to be addressed.
> 
> Thanks,
> Seth
> ___
> lxc-devel mailing list
> lxc-de...@lists.linuxcontainers.org
> http://lists.linuxcontainers.org/listinfo/lxc-devel
> 

-- 
Michael H. Warfield (AI4NB) | (770) 978-7061 |  m...@wittsend.com
   /\/\|=mhw=|\/\/  | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9  | An optimist believes we live in the best of all
 PGP Key: 0x674627FF| possible worlds.  A pessimist is sure of it!



signature.asc
Description: This is a digitally signed message part

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-16 Thread Seth Forshee

On Thu, May 15, 2014 at 09:35:32PM -0700, Greg Kroah-Hartman wrote:
> On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
> > > I think having to pick and choose what device nodes you want in a
> > > container is a good thing.  Becides, you would have to do the same thing
> > > in the kernel anyway, what's wrong with userspace making the decision
> > > here, especially as it knows exactly what it wants to do much more so
> > > than the kernel ever can.
> > 
> > For 'real' devices that sounds sensible.  The thing about loop devices
> > is that we simply want to allow a container to say "give me a loop
> > device to use" and have it receive a unique loop device (or 3), without
> > having to pre-assign them.  I think that would be cleaner to do using
> > a pseudofs and loop-control device, rather than having to have a
> > daemon in userspace on the host farming those out in response to
> > some, I don't know, dbus request?
> 
> I agree that loop devices would be nice to have in a container, and that
> the existing loop interface doesn't really lend itself to that.  So
> create a new type of thing that acts like a loop device in a container.
> But don't try to mess with the whole driver core just for a single type
> of device.

No matter what I don't think we get out of this without driver core
changes, whether this was done in loop or by creating something new.
Not unless the whole thing is punted to userspace, anyway.

The first problem is that many block device ioctls check for
CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm
not really sure. But loop does at minimum support partitions, and to get
that functionality in an unprivileged container at least the block layer
needs to know the namespace which has privileges for that device.

The second is that all block devices automatically appear in devtmpfs.
The scenario I'm concerned about is that the host could unknowingly use
a loop device exposed to a container, then the container could see data
from the host. So we either need a flag to tell the driver core not to
create a node in devtmpfs, or we need a privileged manager in userspace
to remove them (which kind of defeats the purpose). And it gets more
complicated when partition block devs are mixed in, because they can be
created without involvement from the driver - they would need to inherit
the "no devtmpfs node" property from their parent, and if the driver
uses a psuedo fs to create device nodes for userspace then it needs to
be informed about the partitions too so it can create those nodes.

So maybe we could get by without the privileged ioctls, as long as it
was understood that unprivileged containers can't do partitioning. But I
do think the devtmpfs problem would need to be addressed.

Thanks,
Seth
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-16 Thread Richard Weinberger

On Fri, May 16, 2014 at 3:42 AM, Michael H. Warfield  wrote:
> On Thu, 2014-05-15 at 15:15 -0700, Greg Kroah-Hartman wrote:
>> On Thu, May 15, 2014 at 05:42:54PM +, Serge Hallyn wrote:
>> > What exactly defines '"normal" use case for a container'?
>
>> Well, I'd say "acting like a virtual machine" is a good start :)
>
> Ok...  And virtual machines (VirtualBox, VMware, etc, etc) have hot plug
> USB devices.  I use the USB hotplug with VirtualBox.  I plug a
> configured USB device in and the VirtualBox VM grabs it.  Virtual
> machines have loopback devices.  I've used them and using them in
> containers is significantly more efficient.  VirtualBox has remote audio
> and a host of other device features.
>
> Now we have some agreement.  Normal is "acting like a virtual machine".
> That's a goal I can agree with.  I want to work toward that goal of
> containers "acting like a virtual machine" just running on a common
> kernel with the host.  It's a challenge.  We're getting there.
>
>> > Not too long ago much of what we can now do with network namespaces
>> > was not a normal container use case.  Neither "you can't do it now"
>> > nor "I don't use it like that" should be grounds for a pre-emptive
>> > nack.  "It will horribly break security assumptions" certainly would
>> > be.
>
>> I agree, and maybe we will get there over time, but this patch is nto
>> the way to do that.
>
> Ok...  We have a goal.  Now we can haggle over the details (to
> paraphrase a joke that's as old as I am).
>
>> > That's not to say there might not be good reasons why this in particular
>> > is not appropriate, but ISTM if things are going to be nacked without
>> > consideration of the patchset itself, we ought to be having a ksummit
>> > session to come to a consensus [ or receive a decree, presumably by you :)
>> > but after we have a chance to make our case ] on what things are going to
>> > be un/acceptable.
>
>> I already stood up and publically said this last year at Plumbers, why
>> is anything now different?
>
> Not much really.  The reality is that more and more people are trying to
> use hotplug devices, network interfaces, and loopback devices in
> containers just like they would in full para or hw virt machines.  We're
> trying to make them work, without it looking like a kludge.  I
> personally agree with you that much of this can be done in host user
> space and, coming out of LinuxPlumbers last year, I've implemented some
> ideas that did not require kernel patches that achieve some of my goals.
>
>> And this patchset is proof of why it's not a good idea.  You really
>> didn't do anything with all of the namespace stuff, except change loop.
>> That's the only thing that cares, so, just do it there, like I said to
>> do so, last August.
>
>> And you are ignoring the notifications to userspace and how namespaces
>> here would deal with that.
>
> That's a problem to deal with.  I don't thing anyone is ignoring them.
>
>> > > > Serge mentioned something to me about a loopdevfs (?) thing that 
>> > > > someone
>> > > > else is working on.  That would seem to be a better solution in this
>> > > > particular case but I don't know much about it or where it's at.
>> > >
>> > > Ok, let's see those patches then.
>> >
>> > I think Seth has a git tree ready, but not sure which branch he'd want
>> > us to look at.
>> >
>> > Splitting a namespaced devtmpfs from loopdevfs discussion might be
>> > sensible.  However, in defense of a namespaced devtmpfs I'd say
>> > that for userspace to, at every container startup, bind-mount in
>> > devices from the global devtmpfs into a private tmpfs (for systemd's
>> > sake it can't just be on the container rootfs), seems like something
>> > worth avoiding.
>
>> I think having to pick and choose what device nodes you want in a
>> container is a good thing.
>
> Both static and dynamic devices.  It's got to support hotplug.  We have
> (I have) use cases.  That's what I'm trying to do with host udev rules
> and some custom configurations.  I can play games with udev rules.
> Maybe we can keep the user spaces policies in user space and not burden
> the kernel.
>
>> Becides, you would have to do the same thing
>> in the kernel anyway, what's wrong with userspace making the decision
>> here, especially as it knows exactly what it wants to do much more so
>> than the kernel ever can.
>
> IMHO, there's nothing wrong with that as long as we agree on how it's to
> be done.  I'm not convinced that it can all be done in user space and
> I'm not convinced that name spaced devtmpfs is the magic pill to make it
> all go away either.  Making the user space make the decisions and having
> the kernel enforce them is a principle worth considering.
>
>> > PS - Apparently both parallels and Michael independently
>> > project devices which are hot-plugged on the host into containers.
>> > That also seems like something worth talking about (best practices,
>> > shortcomings, use cases not met by it, any ways tha the ker

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Greg Kroah-Hartman

On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
> > I think having to pick and choose what device nodes you want in a
> > container is a good thing.  Becides, you would have to do the same thing
> > in the kernel anyway, what's wrong with userspace making the decision
> > here, especially as it knows exactly what it wants to do much more so
> > than the kernel ever can.
> 
> For 'real' devices that sounds sensible.  The thing about loop devices
> is that we simply want to allow a container to say "give me a loop
> device to use" and have it receive a unique loop device (or 3), without
> having to pre-assign them.  I think that would be cleaner to do using
> a pseudofs and loop-control device, rather than having to have a
> daemon in userspace on the host farming those out in response to
> some, I don't know, dbus request?

I agree that loop devices would be nice to have in a container, and that
the existing loop interface doesn't really lend itself to that.  So
create a new type of thing that acts like a loop device in a container.
But don't try to mess with the whole driver core just for a single type
of device.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Serge Hallyn

Quoting Greg Kroah-Hartman (gre...@linuxfoundation.org):
> On Thu, May 15, 2014 at 05:42:54PM +, Serge Hallyn wrote:
> > What exactly defines '"normal" use case for a container'?
> 
> Well, I'd say "acting like a virtual machine" is a good start :)
> 
> > Not too long ago much of what we can now do with network namespaces
> > was not a normal container use case.  Neither "you can't do it now"
> > nor "I don't use it like that" should be grounds for a pre-emptive
> > nack.  "It will horribly break security assumptions" certainly would
> > be.
> 
> I agree, and maybe we will get there over time, but this patch is nto
> the way to do that.

Ok.  [ I/we may be asking for more details later, but think there is enough
below :), particularly the point about event forwarding ]  Thanks.

> > That's not to say there might not be good reasons why this in particular
> > is not appropriate, but ISTM if things are going to be nacked without
> > consideration of the patchset itself, we ought to be having a ksummit
> > session to come to a consensus [ or receive a decree, presumably by you :)
> > but after we have a chance to make our case ] on what things are going to
> > be un/acceptable.
> 
> I already stood up and publically said this last year at Plumbers, why
> is anything now different?

Well I've simply never had a chance to talk to you since then to find out
exactly what it is that is unacceptable, and why.  And, of course, code
makes it easier to discuss these things.

> And this patchset is proof of why it's not a good idea.  You really
> didn't do anything with all of the namespace stuff, except change loop.
> That's the only thing that cares, so, just do it there, like I said to
> do so, last August.

Sorry, just do it where?

> And you are ignoring the notifications to userspace and how namespaces
> here would deal with that.

Good point.  Addressing that is at the same time necessary, interesting,
and complicated.

> > > > Serge mentioned something to me about a loopdevfs (?) thing that someone
> > > > else is working on.  That would seem to be a better solution in this
> > > > particular case but I don't know much about it or where it's at.
> > > 
> > > Ok, let's see those patches then.
> > 
> > I think Seth has a git tree ready, but not sure which branch he'd want
> > us to look at.
> > 
> > Splitting a namespaced devtmpfs from loopdevfs discussion might be
> > sensible.  However, in defense of a namespaced devtmpfs I'd say
> > that for userspace to, at every container startup, bind-mount in
> > devices from the global devtmpfs into a private tmpfs (for systemd's
> > sake it can't just be on the container rootfs), seems like something
> > worth avoiding.
> 
> I think having to pick and choose what device nodes you want in a
> container is a good thing.  Becides, you would have to do the same thing
> in the kernel anyway, what's wrong with userspace making the decision
> here, especially as it knows exactly what it wants to do much more so
> than the kernel ever can.

For 'real' devices that sounds sensible.  The thing about loop devices
is that we simply want to allow a container to say "give me a loop
device to use" and have it receive a unique loop device (or 3), without
having to pre-assign them.  I think that would be cleaner to do using
a pseudofs and loop-control device, rather than having to have a
daemon in userspace on the host farming those out in response to
some, I don't know, dbus request?

> > PS - Apparently both parallels and Michael independently
> > project devices which are hot-plugged on the host into containers.
> > That also seems like something worth talking about (best practices,
> > shortcomings, use cases not met by it, any ways tha the kernel can
> > help out) at ksummit/linuxcon.
> 
> I was told that containers would never want devices hotplugged into
> them.  What use case has this happening / needed?

I'm pretty sure I didn't say that .  But I guess
we are combining two topics here, the loop psuedofs and the namespaced
devtmpfs.

The use case of loop-control device and loop pseudofs is to have
multiple chrooted/namespaced programs be able to grab a loop device
on demand which they can use for the obvious things (building a livecd,
extracting file contents, etc) without stepping on each other's toes.  The
namespaced devtmpfs is not required for this.

One advantage of a namespaced devtmpfs would be sane-looking devices
in unprivileged containers.  Currently we have to bind-mount the host's
/dev/{full,zero,etc} which, due to uid and guid mappings, then shows up
as:

crw-rw-rw- 1 nobody nogroup   1, 7 May 12 13:35 full

Also you mentioned uevent forwarding above.  Michael has talked several
times about having userspace on the host 'pass' devices into the
container.  One thing which I believe he and Eric have discussed
before was how to have userspace in the container be notified when
a device is passed in.  It seems to me that at least this is something
that would be simpler

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Michael H. Warfield

On Thu, 2014-05-15 at 15:15 -0700, Greg Kroah-Hartman wrote:
> On Thu, May 15, 2014 at 05:42:54PM +, Serge Hallyn wrote:
> > What exactly defines '"normal" use case for a container'?

> Well, I'd say "acting like a virtual machine" is a good start :)

Ok...  And virtual machines (VirtualBox, VMware, etc, etc) have hot plug
USB devices.  I use the USB hotplug with VirtualBox.  I plug a
configured USB device in and the VirtualBox VM grabs it.  Virtual
machines have loopback devices.  I've used them and using them in
containers is significantly more efficient.  VirtualBox has remote audio
and a host of other device features.

Now we have some agreement.  Normal is "acting like a virtual machine".
That's a goal I can agree with.  I want to work toward that goal of
containers "acting like a virtual machine" just running on a common
kernel with the host.  It's a challenge.  We're getting there.

> > Not too long ago much of what we can now do with network namespaces
> > was not a normal container use case.  Neither "you can't do it now"
> > nor "I don't use it like that" should be grounds for a pre-emptive
> > nack.  "It will horribly break security assumptions" certainly would
> > be.

> I agree, and maybe we will get there over time, but this patch is nto
> the way to do that.

Ok...  We have a goal.  Now we can haggle over the details (to
paraphrase a joke that's as old as I am).

> > That's not to say there might not be good reasons why this in particular
> > is not appropriate, but ISTM if things are going to be nacked without
> > consideration of the patchset itself, we ought to be having a ksummit
> > session to come to a consensus [ or receive a decree, presumably by you :)
> > but after we have a chance to make our case ] on what things are going to
> > be un/acceptable.

> I already stood up and publically said this last year at Plumbers, why
> is anything now different?

Not much really.  The reality is that more and more people are trying to
use hotplug devices, network interfaces, and loopback devices in
containers just like they would in full para or hw virt machines.  We're
trying to make them work, without it looking like a kludge.  I
personally agree with you that much of this can be done in host user
space and, coming out of LinuxPlumbers last year, I've implemented some
ideas that did not require kernel patches that achieve some of my goals.

> And this patchset is proof of why it's not a good idea.  You really
> didn't do anything with all of the namespace stuff, except change loop.
> That's the only thing that cares, so, just do it there, like I said to
> do so, last August.

> And you are ignoring the notifications to userspace and how namespaces
> here would deal with that.

That's a problem to deal with.  I don't thing anyone is ignoring them.

> > > > Serge mentioned something to me about a loopdevfs (?) thing that someone
> > > > else is working on.  That would seem to be a better solution in this
> > > > particular case but I don't know much about it or where it's at.
> > > 
> > > Ok, let's see those patches then.
> > 
> > I think Seth has a git tree ready, but not sure which branch he'd want
> > us to look at.
> > 
> > Splitting a namespaced devtmpfs from loopdevfs discussion might be
> > sensible.  However, in defense of a namespaced devtmpfs I'd say
> > that for userspace to, at every container startup, bind-mount in
> > devices from the global devtmpfs into a private tmpfs (for systemd's
> > sake it can't just be on the container rootfs), seems like something
> > worth avoiding.

> I think having to pick and choose what device nodes you want in a
> container is a good thing.

Both static and dynamic devices.  It's got to support hotplug.  We have
(I have) use cases.  That's what I'm trying to do with host udev rules
and some custom configurations.  I can play games with udev rules.
Maybe we can keep the user spaces policies in user space and not burden
the kernel.

> Becides, you would have to do the same thing
> in the kernel anyway, what's wrong with userspace making the decision
> here, especially as it knows exactly what it wants to do much more so
> than the kernel ever can.

IMHO, there's nothing wrong with that as long as we agree on how it's to
be done.  I'm not convinced that it can all be done in user space and
I'm not convinced that name spaced devtmpfs is the magic pill to make it
all go away either.  Making the user space make the decisions and having
the kernel enforce them is a principle worth considering.

> > PS - Apparently both parallels and Michael independently
> > project devices which are hot-plugged on the host into containers.
> > That also seems like something worth talking about (best practices,
> > shortcomings, use cases not met by it, any ways tha the kernel can
> > help out) at ksummit/linuxcon.

> I was told that containers would never want devices hotplugged into
> them.

Interesting.  You were told they (who they?) would never want them?  Who
s

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Greg Kroah-Hartman

On Thu, May 15, 2014 at 05:42:54PM +, Serge Hallyn wrote:
> What exactly defines '"normal" use case for a container'?

Well, I'd say "acting like a virtual machine" is a good start :)

> Not too long ago much of what we can now do with network namespaces
> was not a normal container use case.  Neither "you can't do it now"
> nor "I don't use it like that" should be grounds for a pre-emptive
> nack.  "It will horribly break security assumptions" certainly would
> be.

I agree, and maybe we will get there over time, but this patch is nto
the way to do that.

> That's not to say there might not be good reasons why this in particular
> is not appropriate, but ISTM if things are going to be nacked without
> consideration of the patchset itself, we ought to be having a ksummit
> session to come to a consensus [ or receive a decree, presumably by you :)
> but after we have a chance to make our case ] on what things are going to
> be un/acceptable.

I already stood up and publically said this last year at Plumbers, why
is anything now different?

And this patchset is proof of why it's not a good idea.  You really
didn't do anything with all of the namespace stuff, except change loop.
That's the only thing that cares, so, just do it there, like I said to
do so, last August.

And you are ignoring the notifications to userspace and how namespaces
here would deal with that.

> > > Serge mentioned something to me about a loopdevfs (?) thing that someone
> > > else is working on.  That would seem to be a better solution in this
> > > particular case but I don't know much about it or where it's at.
> > 
> > Ok, let's see those patches then.
> 
> I think Seth has a git tree ready, but not sure which branch he'd want
> us to look at.
> 
> Splitting a namespaced devtmpfs from loopdevfs discussion might be
> sensible.  However, in defense of a namespaced devtmpfs I'd say
> that for userspace to, at every container startup, bind-mount in
> devices from the global devtmpfs into a private tmpfs (for systemd's
> sake it can't just be on the container rootfs), seems like something
> worth avoiding.

I think having to pick and choose what device nodes you want in a
container is a good thing.  Becides, you would have to do the same thing
in the kernel anyway, what's wrong with userspace making the decision
here, especially as it knows exactly what it wants to do much more so
than the kernel ever can.

> PS - Apparently both parallels and Michael independently
> project devices which are hot-plugged on the host into containers.
> That also seems like something worth talking about (best practices,
> shortcomings, use cases not met by it, any ways tha the kernel can
> help out) at ksummit/linuxcon.

I was told that containers would never want devices hotplugged into
them.  What use case has this happening / needed?

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Richard Weinberger

Am 15.05.2014 22:26, schrieb Serge E. Hallyn:
> Quoting Richard Weinberger (rich...@nod.at):
>> Am 15.05.2014 21:50, schrieb Serge Hallyn:
>>> Quoting Richard Weinberger (richard.weinber...@gmail.com):
 On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman
  wrote:
> Then don't use a container to build such a thing, or fix the build
> scripts to not do that :)

 I second this.
 To me it looks like some folks try to (ab)use Linux containers
 for purposes where KVM would much better fit in.
 Please don't put more complexity into containers. They are already
 horrible complex
 and error prone.
>>>
>>> I, naturally, disagree :)  The only use case which is inherently not
>>> valid for containers is running a kernel.  Practically speaking there
>>> are other things which likely will never be possible, but if someone
>>> offers a way to do something in containers, "you can't do that in
>>> containers" is not an apropos response.
>>>
>>> "That abstraction is wrong" is certainly valid, as when vpids were
>>> originally proposed and rejected, resulting in the development of
>>> pid namespaces.  "We have to work out (x) first" can be valid (and
>>> I can think of examples here), assuming it's not just trying to hide
>>> behind a catch-22/chicken-egg problem.
>>>
>>> Finally, saying "containers are complex and error prone" is conflating
>>> several large suites of userspace code and many kernel features which
>>> support them.  Being more precise would, if the argument is valid,
>>> lend it a lot more weight.
>>
>> We (my company) use Linux containers since 2011 in production. First LXC, 
>> now libvirt-lxc.
>> To understand the internals better I also wrote my own userspace to 
>> create/start
>> containers. There are so many things which can hurt you badly.
>> With user namespaces we expose a really big attack surface to regular users.
>> I.e. Suddenly a user is allowed to mount filesystems.
> 
> That is currently not the case.  They can mount some virtual filesystems
> and do bind mounts, but cannot mount most real filesystems.  This keeps
> us protected (for now) from potentially unsafe superblock readers in the
> kernel.

Yeah, I meant not only "real" filesystems.
I had VFS issues in mind where an attacker could do bad things
using bind mounts for example.

>> Ask Andy, he found already lots of nasty things...
> 
> Yes, of course, and there may be more to come...
> 
>> I agree that user namespaces are the way to go, all the papering with LSM
>> over security issues is much worse.
>> But we have to make sure that we don't add too much features too fast.
> 
> Agreed.  Like I said, 'we have to work (x) out first' could be valid,
> including 'we should wait (a year?) for user ns issues to fall out
> before relaxing any of the current user ns constraints." 
> 
> On the other hand, not exercising the new code may only mean that
> existing flaws stick around longer, undetected (by most).

Fair point.

>> That said, I like containers a lot because they are cheap but as they are 
>> lightweight
>> also therefore also isolation level is lightweight.
>> IMHO containers are not a cheap replacement for KVM.
> 
> The building blocks for containers can also be used for entirely
> new, simpler use cases - i.e. perhaps a new fakeroot alternative based
> on user namespace mappings.  Which is why "this is not a use case for
> containers" is not the right way to push back, whether or not the
> feature ends up being appropriate.

Agreed.

Maybe I'm too pessimistic.
We'll see. :-)

Thanks,
//richard
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Serge E. Hallyn

Quoting Richard Weinberger (rich...@nod.at):
> Am 15.05.2014 21:50, schrieb Serge Hallyn:
> > Quoting Richard Weinberger (richard.weinber...@gmail.com):
> >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman
> >>  wrote:
> >>> Then don't use a container to build such a thing, or fix the build
> >>> scripts to not do that :)
> >>
> >> I second this.
> >> To me it looks like some folks try to (ab)use Linux containers
> >> for purposes where KVM would much better fit in.
> >> Please don't put more complexity into containers. They are already
> >> horrible complex
> >> and error prone.
> > 
> > I, naturally, disagree :)  The only use case which is inherently not
> > valid for containers is running a kernel.  Practically speaking there
> > are other things which likely will never be possible, but if someone
> > offers a way to do something in containers, "you can't do that in
> > containers" is not an apropos response.
> > 
> > "That abstraction is wrong" is certainly valid, as when vpids were
> > originally proposed and rejected, resulting in the development of
> > pid namespaces.  "We have to work out (x) first" can be valid (and
> > I can think of examples here), assuming it's not just trying to hide
> > behind a catch-22/chicken-egg problem.
> > 
> > Finally, saying "containers are complex and error prone" is conflating
> > several large suites of userspace code and many kernel features which
> > support them.  Being more precise would, if the argument is valid,
> > lend it a lot more weight.
> 
> We (my company) use Linux containers since 2011 in production. First LXC, now 
> libvirt-lxc.
> To understand the internals better I also wrote my own userspace to 
> create/start
> containers. There are so many things which can hurt you badly.
> With user namespaces we expose a really big attack surface to regular users.
> I.e. Suddenly a user is allowed to mount filesystems.

That is currently not the case.  They can mount some virtual filesystems
and do bind mounts, but cannot mount most real filesystems.  This keeps
us protected (for now) from potentially unsafe superblock readers in the
kernel.

> Ask Andy, he found already lots of nasty things...

Yes, of course, and there may be more to come...

> I agree that user namespaces are the way to go, all the papering with LSM
> over security issues is much worse.
> But we have to make sure that we don't add too much features too fast.

Agreed.  Like I said, 'we have to work (x) out first' could be valid,
including 'we should wait (a year?) for user ns issues to fall out
before relaxing any of the current user ns constraints." 

On the other hand, not exercising the new code may only mean that
existing flaws stick around longer, undetected (by most).

> That said, I like containers a lot because they are cheap but as they are 
> lightweight
> also therefore also isolation level is lightweight.
> IMHO containers are not a cheap replacement for KVM.

The building blocks for containers can also be used for entirely
new, simpler use cases - i.e. perhaps a new fakeroot alternative based
on user namespace mappings.  Which is why "this is not a use case for
containers" is not the right way to push back, whether or not the
feature ends up being appropriate.

-serge
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Richard Weinberger

Am 15.05.2014 21:50, schrieb Serge Hallyn:
> Quoting Richard Weinberger (richard.weinber...@gmail.com):
>> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman
>>  wrote:
>>> Then don't use a container to build such a thing, or fix the build
>>> scripts to not do that :)
>>
>> I second this.
>> To me it looks like some folks try to (ab)use Linux containers
>> for purposes where KVM would much better fit in.
>> Please don't put more complexity into containers. They are already
>> horrible complex
>> and error prone.
> 
> I, naturally, disagree :)  The only use case which is inherently not
> valid for containers is running a kernel.  Practically speaking there
> are other things which likely will never be possible, but if someone
> offers a way to do something in containers, "you can't do that in
> containers" is not an apropos response.
> 
> "That abstraction is wrong" is certainly valid, as when vpids were
> originally proposed and rejected, resulting in the development of
> pid namespaces.  "We have to work out (x) first" can be valid (and
> I can think of examples here), assuming it's not just trying to hide
> behind a catch-22/chicken-egg problem.
> 
> Finally, saying "containers are complex and error prone" is conflating
> several large suites of userspace code and many kernel features which
> support them.  Being more precise would, if the argument is valid,
> lend it a lot more weight.

We (my company) use Linux containers since 2011 in production. First LXC, now 
libvirt-lxc.
To understand the internals better I also wrote my own userspace to create/start
containers. There are so many things which can hurt you badly.
With user namespaces we expose a really big attack surface to regular users.
I.e. Suddenly a user is allowed to mount filesystems.
Ask Andy, he found already lots of nasty things...
I agree that user namespaces are the way to go, all the papering with LSM
over security issues is much worse.
But we have to make sure that we don't add too much features too fast.

That said, I like containers a lot because they are cheap but as they are 
lightweight
also therefore also isolation level is lightweight.
IMHO containers are not a cheap replacement for KVM.

Thanks,
//richard
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Serge Hallyn

Quoting Richard Weinberger (richard.weinber...@gmail.com):
> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman
>  wrote:
> > Then don't use a container to build such a thing, or fix the build
> > scripts to not do that :)
> 
> I second this.
> To me it looks like some folks try to (ab)use Linux containers
> for purposes where KVM would much better fit in.
> Please don't put more complexity into containers. They are already
> horrible complex
> and error prone.

I, naturally, disagree :)  The only use case which is inherently not
valid for containers is running a kernel.  Practically speaking there
are other things which likely will never be possible, but if someone
offers a way to do something in containers, "you can't do that in
containers" is not an apropos response.

"That abstraction is wrong" is certainly valid, as when vpids were
originally proposed and rejected, resulting in the development of
pid namespaces.  "We have to work out (x) first" can be valid (and
I can think of examples here), assuming it's not just trying to hide
behind a catch-22/chicken-egg problem.

Finally, saying "containers are complex and error prone" is conflating
several large suites of userspace code and many kernel features which
support them.  Being more precise would, if the argument is valid,
lend it a lot more weight.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Richard Weinberger

On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman
 wrote:
> Then don't use a container to build such a thing, or fix the build
> scripts to not do that :)

I second this.
To me it looks like some folks try to (ab)use Linux containers
for purposes where KVM would much better fit in.
Please don't put more complexity into containers. They are already
horrible complex
and error prone.

-- 
Thanks,
//richard
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Seth Forshee

On Thu, May 15, 2014 at 05:42:54PM +, Serge Hallyn wrote:
> > > Serge mentioned something to me about a loopdevfs (?) thing that someone
> > > else is working on.  That would seem to be a better solution in this
> > > particular case but I don't know much about it or where it's at.
> > 
> > Ok, let's see those patches then.
> 
> I think Seth has a git tree ready, but not sure which branch he'd want
> us to look at.

I think the most recent code I've got is the devloop branch of
http://kernel.ubuntu.com/git/sforshee/ubuntu-trusty.git, which is still
a bit messy but gets the idea across. I switched from that to the
devtmpfs approach though for several reasons: the psuedo-fs approach
required some (in my opinion) undesirable collateral changes, it would
require changes to userspace tools (though likely small), and it solves
the problem only for loop devices. Plus if you don't push namespace
awareness down to at least the generic block layer you still can't do
partitions or encrypted loop, and then there are still other problems
which need to be solved to get partition blkdevs inside the mount.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Serge Hallyn

Quoting Greg Kroah-Hartman (gre...@linuxfoundation.org):
> On Thu, May 15, 2014 at 09:42:17AM -0400, Michael H. Warfield wrote:
> > On Wed, 2014-05-14 at 21:00 -0700, Greg Kroah-Hartman wrote:
> > > On Wed, May 14, 2014 at 10:15:27PM -0500, Seth Forshee wrote:
> > > > On Wed, May 14, 2014 at 10:17:31PM -0400, Michael H. Warfield wrote:
> > > > > > > Using devtmpfs is one possible
> > > > > > > solution, and it would have the added benefit of making container 
> > > > > > > setup
> > > > > > > simpler. But simply letting containers mount devtmpfs isn't 
> > > > > > > sufficient
> > > > > > > since the container may need to see a different, more limited set 
> > > > > > > of
> > > > > > > devices, and because different environments making modifications 
> > > > > > > to
> > > > > > > the filesystem could lead to conflicts.
> > > > > > > 
> > > > > > > This series solves these problems by assigning devices to user
> > > > > > > namespaces. Each device has an "owner" namespace which specifies 
> > > > > > > which
> > > > > > > devtmpfs mount the device should appear in as well allowing 
> > > > > > > priveleged
> > > > > > > operations on the device from that namespace. This defaults to
> > > > > > > init_user_ns. There's also an ns_global flag to indicate a device 
> > > > > > > should
> > > > > > > appear in all devtmpfs mounts.
> > > > > 
> > > > > > I'd strongly argue that this isn't even a "problem" at all.  And, 
> > > > > > as I
> > > > > > said at the Plumbers conference last year, adding namespaces to 
> > > > > > devices
> > > > > > isn't going to happen, sorry.  Please don't continue down this path.
> > > > > 
> > > > > I was just mentioning that to Serge just a week or so ago reminding 
> > > > > him
> > > > > of what you told all of us face to face back then.  We were having a
> > > > > discussion over loop devices into containers and this topic came up.
> > > > 
> > > > It was the loop device use case that got me started down this path in
> > > > the first place, so I don't personally have any interest in physical
> > > > devices right now (though I was sure others would).
> > 
> > > Why do you want to give access to a loop device to a container?
> > > Shouldn't you set up the loop devices before creating the container and
> > > then pass those mount points into the container?  I thought that was how
> > > things worked today, or am I missing something?
> > 
> > Ah, you keep feeding me easy ones.  I need raw access to loop devices
> > and loop-control because I'm using containers to build NST (Network
> > Security Toolkit) distribution iso images (one container is x86_64 while
> > the other is i686).  Each requires 2 loop devices.  You can't set up the
> > loop devices in advance since the containers will be creating the images
> > and building them.  NST tinkers with the base build engine
> > configuration, so I really DON'T want it running on a hard iron host. 
> > There may be other cases where I need other specialized containers for
> > building distros.  I'm also looking at custom builds of Kali (another
> > security distribution).
> 
> Then don't use a container to build such a thing, or fix the build
> scripts to not do that :)
> 
> That is not a "normal" use case for a container at all.  Containers are
> not for "everything", use a virtual machine for some tasks (like this
> one).

Hi Greg,

What exactly defines '"normal" use case for a container'?  Not too long
ago much of what we can now do with network namespaces was not a normal
container use case.  Neither "you can't do it now" nor "I don't use it
like that" should be grounds for a pre-emptive nack.  "It will horribly
break security assumptions" certainly would be.

That's not to say there might not be good reasons why this in particular
is not appropriate, but ISTM if things are going to be nacked without
consideration of the patchset itself, we ought to be having a ksummit
session to come to a consensus [ or receive a decree, presumably by you :)
but after we have a chance to make our case ] on what things are going to
be un/acceptable.

> > Serge mentioned something to me about a loopdevfs (?) thing that someone
> > else is working on.  That would seem to be a better solution in this
> > particular case but I don't know much about it or where it's at.
> 
> Ok, let's see those patches then.

I think Seth has a git tree ready, but not sure which branch he'd want
us to look at.

Splitting a namespaced devtmpfs from loopdevfs discussion might be
sensible.  However, in defense of a namespaced devtmpfs I'd say
that for userspace to, at every container startup, bind-mount in
devices from the global devtmpfs into a private tmpfs (for systemd's
sake it can't just be on the container rootfs), seems like something
worth avoiding.

-serge

PS - Apparently both parallels and Michael independently
project devices which are hot-plugged on the host into containers.
That also seems like something worth talking about (best practices,
shortcomings,

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Greg Kroah-Hartman

On Thu, May 15, 2014 at 09:42:17AM -0400, Michael H. Warfield wrote:
> On Wed, 2014-05-14 at 21:00 -0700, Greg Kroah-Hartman wrote:
> > On Wed, May 14, 2014 at 10:15:27PM -0500, Seth Forshee wrote:
> > > On Wed, May 14, 2014 at 10:17:31PM -0400, Michael H. Warfield wrote:
> > > > > > Using devtmpfs is one possible
> > > > > > solution, and it would have the added benefit of making container 
> > > > > > setup
> > > > > > simpler. But simply letting containers mount devtmpfs isn't 
> > > > > > sufficient
> > > > > > since the container may need to see a different, more limited set of
> > > > > > devices, and because different environments making modifications to
> > > > > > the filesystem could lead to conflicts.
> > > > > > 
> > > > > > This series solves these problems by assigning devices to user
> > > > > > namespaces. Each device has an "owner" namespace which specifies 
> > > > > > which
> > > > > > devtmpfs mount the device should appear in as well allowing 
> > > > > > priveleged
> > > > > > operations on the device from that namespace. This defaults to
> > > > > > init_user_ns. There's also an ns_global flag to indicate a device 
> > > > > > should
> > > > > > appear in all devtmpfs mounts.
> > > > 
> > > > > I'd strongly argue that this isn't even a "problem" at all.  And, as I
> > > > > said at the Plumbers conference last year, adding namespaces to 
> > > > > devices
> > > > > isn't going to happen, sorry.  Please don't continue down this path.
> > > > 
> > > > I was just mentioning that to Serge just a week or so ago reminding him
> > > > of what you told all of us face to face back then.  We were having a
> > > > discussion over loop devices into containers and this topic came up.
> > > 
> > > It was the loop device use case that got me started down this path in
> > > the first place, so I don't personally have any interest in physical
> > > devices right now (though I was sure others would).
> 
> > Why do you want to give access to a loop device to a container?
> > Shouldn't you set up the loop devices before creating the container and
> > then pass those mount points into the container?  I thought that was how
> > things worked today, or am I missing something?
> 
> Ah, you keep feeding me easy ones.  I need raw access to loop devices
> and loop-control because I'm using containers to build NST (Network
> Security Toolkit) distribution iso images (one container is x86_64 while
> the other is i686).  Each requires 2 loop devices.  You can't set up the
> loop devices in advance since the containers will be creating the images
> and building them.  NST tinkers with the base build engine
> configuration, so I really DON'T want it running on a hard iron host. 
> There may be other cases where I need other specialized containers for
> building distros.  I'm also looking at custom builds of Kali (another
> security distribution).

Then don't use a container to build such a thing, or fix the build
scripts to not do that :)

That is not a "normal" use case for a container at all.  Containers are
not for "everything", use a virtual machine for some tasks (like this
one).

> Serge mentioned something to me about a loopdevfs (?) thing that someone
> else is working on.  That would seem to be a better solution in this
> particular case but I don't know much about it or where it's at.

Ok, let's see those patches then.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Michael H. Warfield

On Wed, 2014-05-14 at 21:00 -0700, Greg Kroah-Hartman wrote:
> On Wed, May 14, 2014 at 10:15:27PM -0500, Seth Forshee wrote:
> > On Wed, May 14, 2014 at 10:17:31PM -0400, Michael H. Warfield wrote:
> > > > > Using devtmpfs is one possible
> > > > > solution, and it would have the added benefit of making container 
> > > > > setup
> > > > > simpler. But simply letting containers mount devtmpfs isn't sufficient
> > > > > since the container may need to see a different, more limited set of
> > > > > devices, and because different environments making modifications to
> > > > > the filesystem could lead to conflicts.
> > > > > 
> > > > > This series solves these problems by assigning devices to user
> > > > > namespaces. Each device has an "owner" namespace which specifies which
> > > > > devtmpfs mount the device should appear in as well allowing priveleged
> > > > > operations on the device from that namespace. This defaults to
> > > > > init_user_ns. There's also an ns_global flag to indicate a device 
> > > > > should
> > > > > appear in all devtmpfs mounts.
> > > 
> > > > I'd strongly argue that this isn't even a "problem" at all.  And, as I
> > > > said at the Plumbers conference last year, adding namespaces to devices
> > > > isn't going to happen, sorry.  Please don't continue down this path.
> > > 
> > > I was just mentioning that to Serge just a week or so ago reminding him
> > > of what you told all of us face to face back then.  We were having a
> > > discussion over loop devices into containers and this topic came up.
> > 
> > It was the loop device use case that got me started down this path in
> > the first place, so I don't personally have any interest in physical
> > devices right now (though I was sure others would).

> Why do you want to give access to a loop device to a container?
> Shouldn't you set up the loop devices before creating the container and
> then pass those mount points into the container?  I thought that was how
> things worked today, or am I missing something?

Ah, you keep feeding me easy ones.  I need raw access to loop devices
and loop-control because I'm using containers to build NST (Network
Security Toolkit) distribution iso images (one container is x86_64 while
the other is i686).  Each requires 2 loop devices.  You can't set up the
loop devices in advance since the containers will be creating the images
and building them.  NST tinkers with the base build engine
configuration, so I really DON'T want it running on a hard iron host. 
There may be other cases where I need other specialized containers for
building distros.  I'm also looking at custom builds of Kali (another
security distribution).

> Giving the ability for a container to create a loop device at all is a
> horrid idea, as you have pointed out, lots of information leakage could
> easily happen.

It does but only slightly.  I noticed that losetup will list all the
devices regardless of container where run or the container where set up.
But that seems to be largely cosmetic.  You can't do anything with the
loop device in the other container.  You can't disconnected it, read it,
or mount it (I've tested it).  In the former case, losetup returns with
no error but does nothing.  In the later case, you get a busy error.
Not clean, not pretty, but no damage.  Since loop-control is working on
the global pool of loop devices, it's impossible to know what device to
move to what container when the container runs losetup.

For me, this isn't a serious problem, since it only involves 2
specialized containers out of over 4 dozen containers I have running
across 3 sites.  And those two containers are under my explicit and
exclusive control.  None of the others need it.  I can get away with
adding extra loop devices and adding them to the containers and let
losetup deal with allocation and contention.

Serge mentioned something to me about a loopdevfs (?) thing that someone
else is working on.  That would seem to be a better solution in this
particular case but I don't know much about it or where it's at.

Mind you, I heard your arguments at LinuxPlumbers regarding pushing user
space policies into the kernel and all and basically I agree with you,
this should be handled in host system user space and it seems
reasonable.  I'm just pointing out real world cases I have in operation
right now and pointing out that I have solutions for them in host user
space, even if some of them may not be estheticly pretty.

> greg k-h

Regards,
Mike
-- 
Michael H. Warfield (AI4NB) | (770) 978-7061 |  m...@wittsend.com
   /\/\|=mhw=|\/\/  | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9  | An optimist believes we live in the best of all
 PGP Key: 0x674627FF| possible worlds.  A pessimist is sure of it!

signature.asc
Description: This is a digitally signed message part

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-14 Thread Greg Kroah-Hartman

On Wed, May 14, 2014 at 10:15:27PM -0500, Seth Forshee wrote:
> On Wed, May 14, 2014 at 10:17:31PM -0400, Michael H. Warfield wrote:
> > > > Using devtmpfs is one possible
> > > > solution, and it would have the added benefit of making container setup
> > > > simpler. But simply letting containers mount devtmpfs isn't sufficient
> > > > since the container may need to see a different, more limited set of
> > > > devices, and because different environments making modifications to
> > > > the filesystem could lead to conflicts.
> > > > 
> > > > This series solves these problems by assigning devices to user
> > > > namespaces. Each device has an "owner" namespace which specifies which
> > > > devtmpfs mount the device should appear in as well allowing priveleged
> > > > operations on the device from that namespace. This defaults to
> > > > init_user_ns. There's also an ns_global flag to indicate a device should
> > > > appear in all devtmpfs mounts.
> > 
> > > I'd strongly argue that this isn't even a "problem" at all.  And, as I
> > > said at the Plumbers conference last year, adding namespaces to devices
> > > isn't going to happen, sorry.  Please don't continue down this path.
> > 
> > I was just mentioning that to Serge just a week or so ago reminding him
> > of what you told all of us face to face back then.  We were having a
> > discussion over loop devices into containers and this topic came up.
> 
> It was the loop device use case that got me started down this path in
> the first place, so I don't personally have any interest in physical
> devices right now (though I was sure others would).

Why do you want to give access to a loop device to a container?
Shouldn't you set up the loop devices before creating the container and
then pass those mount points into the container?  I thought that was how
things worked today, or am I missing something?

Giving the ability for a container to create a loop device at all is a
horrid idea, as you have pointed out, lots of information leakage could
easily happen.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-14 Thread Seth Forshee

On Wed, May 14, 2014 at 10:17:31PM -0400, Michael H. Warfield wrote:
> > > Using devtmpfs is one possible
> > > solution, and it would have the added benefit of making container setup
> > > simpler. But simply letting containers mount devtmpfs isn't sufficient
> > > since the container may need to see a different, more limited set of
> > > devices, and because different environments making modifications to
> > > the filesystem could lead to conflicts.
> > > 
> > > This series solves these problems by assigning devices to user
> > > namespaces. Each device has an "owner" namespace which specifies which
> > > devtmpfs mount the device should appear in as well allowing priveleged
> > > operations on the device from that namespace. This defaults to
> > > init_user_ns. There's also an ns_global flag to indicate a device should
> > > appear in all devtmpfs mounts.
> 
> > I'd strongly argue that this isn't even a "problem" at all.  And, as I
> > said at the Plumbers conference last year, adding namespaces to devices
> > isn't going to happen, sorry.  Please don't continue down this path.
> 
> I was just mentioning that to Serge just a week or so ago reminding him
> of what you told all of us face to face back then.  We were having a
> discussion over loop devices into containers and this topic came up.

It was the loop device use case that got me started down this path in
the first place, so I don't personally have any interest in physical
devices right now (though I was sure others would).

As things stand today, to support loop devices lxc would need to do
something like this: grab some unused loop devices, remove them from
/dev, and make device nodes with appropriate ownership/permissions in
the container's /dev. Otherwise there's potential for accidental
duplicate use of the devices, which besides having unexpected results
could result in information leak into the container. At that point you
have some loop devices that the container can use, but privileged
operations such as re-reading partitions and encrypted loop aren't
possible. Even if you can re-read partitions device nodes will appear in
the main /dev and not in the container.

With these patches the container could mount devtmpfs, and since
loop-control is global it would appear in the mount. The
LOOP_CTL_GET_FREE ioctl can be used to get an unused loop device which
will owned by the container's user namespace, so it will only appear in
that container's devtmpfs mount. Privileged operations would be allowed
on the loop device by root in the namespace, and if partition devices
were created they would inherit the namespace from the parent and thus
show up in the container's devtmpfs mount.

I think this use case demonstrates some real problems with only half-way
solutions atm. I'm certainly open to other suggestions about how to
solve them.

Thanks,
Seth
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-14 Thread Michael H. Warfield

On Wed, 2014-05-14 at 18:32 -0700, Greg Kroah-Hartman wrote:
> On Wed, May 14, 2014 at 04:34:48PM -0500, Seth Forshee wrote:
> > Unpriveleged containers cannot run mknod, making it difficult to support
> > devices which appear at runtime.

> Wait.

> Why would you even want a container to see a "new" device?  That's the
> whole point, your container should see a "clean" system, not the "this
> USB device was just plugged in" system.  Otherwise, how are you going to
> even tell that container a new device showed up?  Are you now going to
> add udev support in containers?  Hah, no.

Oooo...  I can answer that...  Tell me if you've heard this one
before...  (You have back in NOLA last summer)...

I use a USB sharing device that controls a multiport USB serial device
controlling serial consoles to 16 servers and shared between 4
controlling servers.  The sharing control port (a USB HID device) should
be shared between designated containers so that any designated container
owner can "request" a console to one of the other servers (yeah, I know
there can be contention but that's the way the cookie crumbles - most of
the time it's on the master host).  Once they get the sharing device's
attention, they "lose" that HID control device (it disappears from /dev
entirely) and they gain only their designated USBtty{n} device for their
console.  Dynamic devices at their finest.

I worked out a way of dealing with it using udev rules in the host and
shifting devices using subdirectories in /dev.  I got the infrastructure
implemented but didn't finish the specific udev rules.

> > Using devtmpfs is one possible
> > solution, and it would have the added benefit of making container setup
> > simpler. But simply letting containers mount devtmpfs isn't sufficient
> > since the container may need to see a different, more limited set of
> > devices, and because different environments making modifications to
> > the filesystem could lead to conflicts.
> > 
> > This series solves these problems by assigning devices to user
> > namespaces. Each device has an "owner" namespace which specifies which
> > devtmpfs mount the device should appear in as well allowing priveleged
> > operations on the device from that namespace. This defaults to
> > init_user_ns. There's also an ns_global flag to indicate a device should
> > appear in all devtmpfs mounts.

> I'd strongly argue that this isn't even a "problem" at all.  And, as I
> said at the Plumbers conference last year, adding namespaces to devices
> isn't going to happen, sorry.  Please don't continue down this path.

I was just mentioning that to Serge just a week or so ago reminding him
of what you told all of us face to face back then.  We were having a
discussion over loop devices into containers and this topic came up.

> greg k-h

Regards,
Mike
-- 
Michael H. Warfield (AI4NB) | (770) 978-7061 |  m...@wittsend.com
   /\/\|=mhw=|\/\/  | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9  | An optimist believes we live in the best of all
 PGP Key: 0x674627FF| possible worlds.  A pessimist is sure of it!

signature.asc
Description: This is a digitally signed message part

53 matches

Mail list logo