subject:"Re\: \[lxc\-devel\] \[RFC PATCH 00\/11\] Add support for devtmpfs in user namespaces"

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-28 Thread Eric W. Biederman

"Serge E. Hallyn"  writes:
>> I was aware of FUSE but hadn't ever looked at it much. Looking at it
>> now, this isn't going to satisfy any of the use cases I know about,
>> which are wanting to use filesystems supported in-kernel (isofs, ext*).
>> I don't see that any of these have a FUSE implementation, and I think we
>> gain more from figuring out how to use in-kernel filesystems in
>> containers than trying to find a way to shoehorn selected filesystems
>> into FUSE.
>
> That's why I was wondering how much work it would be to auto-generate
> fuse fs support from the in-kernel source.

So at a quick look I have found fuseext2, fuseiso and mountlo-0.5 (which
claims to have supported all the in-kernel filesystems with the help of
user mode linux).

Give that the first two are just an apt-get install away fuse really
looks like the shortest path to being able to mount an iso, do other
interesting things.

We probably want something more but only when performance becomes a
bottle-neck.

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-28 Thread Serge Hallyn

Quoting James Bottomley (james.bottom...@hansenpartnership.com):
> On Mon, 2014-05-26 at 00:24 +0200, Serge E. Hallyn wrote:
> > Quoting James Bottomley (james.bottom...@hansenpartnership.com):
> > > On Sat, 2014-05-24 at 22:25 +, Serge Hallyn wrote:
> > > > Quoting James Bottomley (james.bottom...@hansenpartnership.com):
> > > > > On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote:
> > > > > > On 05/20/2014 05:19 PM, Serge Hallyn wrote:
> > > > > > > Quoting Andy Lutomirski (l...@amacapital.net):
> > > > > > >> On May 15, 2014 1:26 PM, "Serge E. Hallyn"  
> > > > > > >> wrote:
> > > > > > >>> 
> > > > > > >>> Quoting Richard Weinberger (rich...@nod.at):
> > > > > >  Am 15.05.2014 21:50, schrieb Serge Hallyn:
> > > > > > > Quoting Richard Weinberger (richard.weinber...@gmail.com):
> > > > > > >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman 
> > > > > > >>  wrote:
> > > > > > >>> Then don't use a container to build such a thing, or fix 
> > > > > > >>> the build scripts to not do that :)
> > > > > > >> 
> > > > > > >> I second this. To me it looks like some folks try to (ab)use 
> > > > > > >> Linux containers for purposes where KVM
> > > > > > >> would much better fit in. Please don't put more complexity 
> > > > > > >> into containers. They are already horrible
> > > > > > >> complex and error prone.
> > > > > > > 
> > > > > > > I, naturally, disagree :)  The only use case which is 
> > > > > > > inherently not valid for containers is running a
> > > > > > > kernel.  Practically speaking there are other things which 
> > > > > > > likely will never be possible, but if someone 
> > > > > > > offers a way to do something in containers, "you can't do 
> > > > > > > that in containers" is not an apropos response.
> > > > > > > 
> > > > > > > "That abstraction is wrong" is certainly valid, as when vpids 
> > > > > > > were originally proposed and rejected,
> > > > > > > resulting in the development of pid namespaces.  "We have to 
> > > > > > > work out (x) first" can be valid (and I can
> > > > > > > think of examples here), assuming it's not just trying to 
> > > > > > > hide behind a catch-22/chicken-egg problem.
> > > > > > > 
> > > > > > > Finally, saying "containers are complex and error prone" is 
> > > > > > > conflating several large suites of userspace
> > > > > > > code and many kernel features which support them.  Being more 
> > > > > > > precise would, if the argument is valid, lend
> > > > > > > it a lot more weight.
> > > > > >  
> > > > > >  We (my company) use Linux containers since 2011 in production. 
> > > > > >  First LXC, now libvirt-lxc. To understand the
> > > > > >  internals better I also wrote my own userspace to create/start 
> > > > > >  containers. There are so many things which can
> > > > > >  hurt you badly. With user namespaces we expose a really big 
> > > > > >  attack surface to regular users. I.e. Suddenly a
> > > > > >  user is allowed to mount filesystems.
> > > > > > >>> 
> > > > > > >>> That is currently not the case.  They can mount some virtual 
> > > > > > >>> filesystems and do bind mounts, but cannot mount
> > > > > > >>> most real filesystems.  This keeps us protected (for now) from 
> > > > > > >>> potentially unsafe superblock readers in the 
> > > > > > >>> kernel.
> > > > > > >>> 
> > > > > >  Ask Andy, he found already lots of nasty things...
> > > > > > >> 
> > > > > > >> I don't think I have anything brilliant to add to this 
> > > > > > >> discussion right now, except possibly:
> > > > > > >> 
> > > > > > >> ISTM that Linux distributions are, in general, vulnerable to all 
> > > > > > >> kinds of shenanigans that would happen if an
> > > > > > >> untrusted user can cause a block device to appear.  That user 
> > > > > > >> doesn't need permission to mount it
> > > > > > > 
> > > > > > > Interesting point.  This would further suggest that we absolutely 
> > > > > > > must ensure that a loop device which shows up in
> > > > > > > the container does not also show up in the host.
> > > > > > 
> > > > > > Can I suggest the usage of the devices cgroup to achieve that?
> > > > > 
> > > > > Not really ... cgroups impose resource limits, it's namespaces that
> > > > > impose visibility separations.  In theory this can be done with the
> > > > > device namespace that's been proposed; however, a simpler way is 
> > > > > simply
> > > > > to rm the device node in the host and mknod it in the guest.  I don't
> > > > > really see host visibility as a huge problem: in a shared OS
> > > > > virtualisation it's not really possible securely to separate the guest
> > > > > from the host (only vice versa).
> > > > > 
> > > > > But I really don't think we want to do it this way.  Giving a 
> > > > > container
> > > > > the ability to do a mount is too dangerous.  What we want to do is
> > > > >

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-28 Thread Serge E. Hallyn

Quoting Seth Forshee (seth.fors...@canonical.com):
> On Fri, May 23, 2014 at 03:23:50PM -0700, Eric W. Biederman wrote:
> > Serge Hallyn  writes:
> > 
> > > Quoting Eric W. Biederman (ebied...@xmission.com):
> > >> 
> > >> 
> > >> >> Ultimately the technical challenge is how do we create a block device
> > >> >> that is safe for a user who does not have any capabilities to use, and
> > >> >> what can we do with that block device to make it useful.
> > >> >
> > >> > Yes, and I'd like to get started solving those challenges. But I also
> > >> > don't think we can address these two points (support partition blkdevs,
> > >> > help prevent more priveleged users from using a namespace's loop
> > >> > devices) sufficiently while having an implementation completely
> > >> > contained within the loop driver as Greg is requesting.
> > >> 
> > >> My key take away from the conversation is that we should reduce the
> > >> scope of what is being done to something that makes sense and the
> > >> propblems are immediately visible.
> > >> 
> > >> Part of me would like to suggest that fuse and it's ability to imitate
> > >> device nodes might be a more appropriate solution, to something that
> > >
> > > Do you have a link to more info on this?  Some googling got me to an
> > > interesting but old thread on CUSE, but nothing specifically about fuse
> > > doing this.
> > 
> > CUSE is probably what I was thinking of.  It is all part of the fuse
> > code base in the kernel.  And now that I am reminded it is called CUSE
> > I go Duh that is a character device...
> > 
> > Fuse and everything it can do is definitely the filesystem I would like
> > to see most have the audits to be enabled in user namespace.  Fuse
> > was built to be sufficiently paranoid to allow this and so it should not
> > take a lot to take fuse the rest of the way.
> 
> I was aware of FUSE but hadn't ever looked at it much. Looking at it
> now, this isn't going to satisfy any of the use cases I know about,
> which are wanting to use filesystems supported in-kernel (isofs, ext*).
> I don't see that any of these have a FUSE implementation, and I think we
> gain more from figuring out how to use in-kernel filesystems in
> containers than trying to find a way to shoehorn selected filesystems
> into FUSE.

That's why I was wondering how much work it would be to auto-generate
fuse fs support from the in-kernel source.

-serge
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-28 Thread Seth Forshee

On Fri, May 23, 2014 at 03:23:50PM -0700, Eric W. Biederman wrote:
> Serge Hallyn  writes:
> 
> > Quoting Eric W. Biederman (ebied...@xmission.com):
> >> 
> >> 
> >> >> Ultimately the technical challenge is how do we create a block device
> >> >> that is safe for a user who does not have any capabilities to use, and
> >> >> what can we do with that block device to make it useful.
> >> >
> >> > Yes, and I'd like to get started solving those challenges. But I also
> >> > don't think we can address these two points (support partition blkdevs,
> >> > help prevent more priveleged users from using a namespace's loop
> >> > devices) sufficiently while having an implementation completely
> >> > contained within the loop driver as Greg is requesting.
> >> 
> >> My key take away from the conversation is that we should reduce the
> >> scope of what is being done to something that makes sense and the
> >> propblems are immediately visible.
> >> 
> >> Part of me would like to suggest that fuse and it's ability to imitate
> >> device nodes might be a more appropriate solution, to something that
> >
> > Do you have a link to more info on this?  Some googling got me to an
> > interesting but old thread on CUSE, but nothing specifically about fuse
> > doing this.
> 
> CUSE is probably what I was thinking of.  It is all part of the fuse
> code base in the kernel.  And now that I am reminded it is called CUSE
> I go Duh that is a character device...
> 
> Fuse and everything it can do is definitely the filesystem I would like
> to see most have the audits to be enabled in user namespace.  Fuse
> was built to be sufficiently paranoid to allow this and so it should not
> take a lot to take fuse the rest of the way.

I was aware of FUSE but hadn't ever looked at it much. Looking at it
now, this isn't going to satisfy any of the use cases I know about,
which are wanting to use filesystems supported in-kernel (isofs, ext*).
I don't see that any of these have a FUSE implementation, and I think we
gain more from figuring out how to use in-kernel filesystems in
containers than trying to find a way to shoehorn selected filesystems
into FUSE.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-28 Thread James Bottomley

On Mon, 2014-05-26 at 00:24 +0200, Serge E. Hallyn wrote:
> Quoting James Bottomley (james.bottom...@hansenpartnership.com):
> > On Sat, 2014-05-24 at 22:25 +, Serge Hallyn wrote:
> > > Quoting James Bottomley (james.bottom...@hansenpartnership.com):
> > > > On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote:
> > > > > On 05/20/2014 05:19 PM, Serge Hallyn wrote:
> > > > > > Quoting Andy Lutomirski (l...@amacapital.net):
> > > > > >> On May 15, 2014 1:26 PM, "Serge E. Hallyn"  
> > > > > >> wrote:
> > > > > >>> 
> > > > > >>> Quoting Richard Weinberger (rich...@nod.at):
> > > > >  Am 15.05.2014 21:50, schrieb Serge Hallyn:
> > > > > > Quoting Richard Weinberger (richard.weinber...@gmail.com):
> > > > > >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman 
> > > > > >>  wrote:
> > > > > >>> Then don't use a container to build such a thing, or fix the 
> > > > > >>> build scripts to not do that :)
> > > > > >> 
> > > > > >> I second this. To me it looks like some folks try to (ab)use 
> > > > > >> Linux containers for purposes where KVM
> > > > > >> would much better fit in. Please don't put more complexity 
> > > > > >> into containers. They are already horrible
> > > > > >> complex and error prone.
> > > > > > 
> > > > > > I, naturally, disagree :)  The only use case which is 
> > > > > > inherently not valid for containers is running a
> > > > > > kernel.  Practically speaking there are other things which 
> > > > > > likely will never be possible, but if someone 
> > > > > > offers a way to do something in containers, "you can't do that 
> > > > > > in containers" is not an apropos response.
> > > > > > 
> > > > > > "That abstraction is wrong" is certainly valid, as when vpids 
> > > > > > were originally proposed and rejected,
> > > > > > resulting in the development of pid namespaces.  "We have to 
> > > > > > work out (x) first" can be valid (and I can
> > > > > > think of examples here), assuming it's not just trying to hide 
> > > > > > behind a catch-22/chicken-egg problem.
> > > > > > 
> > > > > > Finally, saying "containers are complex and error prone" is 
> > > > > > conflating several large suites of userspace
> > > > > > code and many kernel features which support them.  Being more 
> > > > > > precise would, if the argument is valid, lend
> > > > > > it a lot more weight.
> > > > >  
> > > > >  We (my company) use Linux containers since 2011 in production. 
> > > > >  First LXC, now libvirt-lxc. To understand the
> > > > >  internals better I also wrote my own userspace to create/start 
> > > > >  containers. There are so many things which can
> > > > >  hurt you badly. With user namespaces we expose a really big 
> > > > >  attack surface to regular users. I.e. Suddenly a
> > > > >  user is allowed to mount filesystems.
> > > > > >>> 
> > > > > >>> That is currently not the case.  They can mount some virtual 
> > > > > >>> filesystems and do bind mounts, but cannot mount
> > > > > >>> most real filesystems.  This keeps us protected (for now) from 
> > > > > >>> potentially unsafe superblock readers in the 
> > > > > >>> kernel.
> > > > > >>> 
> > > > >  Ask Andy, he found already lots of nasty things...
> > > > > >> 
> > > > > >> I don't think I have anything brilliant to add to this discussion 
> > > > > >> right now, except possibly:
> > > > > >> 
> > > > > >> ISTM that Linux distributions are, in general, vulnerable to all 
> > > > > >> kinds of shenanigans that would happen if an
> > > > > >> untrusted user can cause a block device to appear.  That user 
> > > > > >> doesn't need permission to mount it
> > > > > > 
> > > > > > Interesting point.  This would further suggest that we absolutely 
> > > > > > must ensure that a loop device which shows up in
> > > > > > the container does not also show up in the host.
> > > > > 
> > > > > Can I suggest the usage of the devices cgroup to achieve that?
> > > > 
> > > > Not really ... cgroups impose resource limits, it's namespaces that
> > > > impose visibility separations.  In theory this can be done with the
> > > > device namespace that's been proposed; however, a simpler way is simply
> > > > to rm the device node in the host and mknod it in the guest.  I don't
> > > > really see host visibility as a huge problem: in a shared OS
> > > > virtualisation it's not really possible securely to separate the guest
> > > > from the host (only vice versa).
> > > > 
> > > > But I really don't think we want to do it this way.  Giving a container
> > > > the ability to do a mount is too dangerous.  What we want to do is
> > > > intercept the mount in the host and perform it on behalf of the guest as
> > > > host root in the guest's mount namespace.  If you do it that way, it
> > > 
> > > That doesn't help the problem of guests being able to provide bad input
> > > for (basically

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-28 Thread Eric W. Biederman

Serge E. Hallyn se...@hallyn.com writes:
 I was aware of FUSE but hadn't ever looked at it much. Looking at it
 now, this isn't going to satisfy any of the use cases I know about,
 which are wanting to use filesystems supported in-kernel (isofs, ext*).
 I don't see that any of these have a FUSE implementation, and I think we
 gain more from figuring out how to use in-kernel filesystems in
 containers than trying to find a way to shoehorn selected filesystems
 into FUSE.

 That's why I was wondering how much work it would be to auto-generate
 fuse fs support from the in-kernel source.

So at a quick look I have found fuseext2, fuseiso and mountlo-0.5 (which
claims to have supported all the in-kernel filesystems with the help of
user mode linux).

Give that the first two are just an apt-get install away fuse really
looks like the shortest path to being able to mount an iso, do other
interesting things.

We probably want something more but only when performance becomes a
bottle-neck.

Eric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-28 Thread James Bottomley

On Mon, 2014-05-26 at 00:24 +0200, Serge E. Hallyn wrote:
 Quoting James Bottomley (james.bottom...@hansenpartnership.com):
  On Sat, 2014-05-24 at 22:25 +, Serge Hallyn wrote:
   Quoting James Bottomley (james.bottom...@hansenpartnership.com):
On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote:
 On 05/20/2014 05:19 PM, Serge Hallyn wrote:
  Quoting Andy Lutomirski (l...@amacapital.net):
  On May 15, 2014 1:26 PM, Serge E. Hallyn se...@hallyn.com 
  wrote:
  
  Quoting Richard Weinberger (rich...@nod.at):
  Am 15.05.2014 21:50, schrieb Serge Hallyn:
  Quoting Richard Weinberger (richard.weinber...@gmail.com):
  On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman 
  gre...@linuxfoundation.org wrote:
  Then don't use a container to build such a thing, or fix the 
  build scripts to not do that :)
  
  I second this. To me it looks like some folks try to (ab)use 
  Linux containers for purposes where KVM
  would much better fit in. Please don't put more complexity 
  into containers. They are already horrible
  complex and error prone.
  
  I, naturally, disagree :)  The only use case which is 
  inherently not valid for containers is running a
  kernel.  Practically speaking there are other things which 
  likely will never be possible, but if someone 
  offers a way to do something in containers, you can't do that 
  in containers is not an apropos response.
  
  That abstraction is wrong is certainly valid, as when vpids 
  were originally proposed and rejected,
  resulting in the development of pid namespaces.  We have to 
  work out (x) first can be valid (and I can
  think of examples here), assuming it's not just trying to hide 
  behind a catch-22/chicken-egg problem.
  
  Finally, saying containers are complex and error prone is 
  conflating several large suites of userspace
  code and many kernel features which support them.  Being more 
  precise would, if the argument is valid, lend
  it a lot more weight.
  
  We (my company) use Linux containers since 2011 in production. 
  First LXC, now libvirt-lxc. To understand the
  internals better I also wrote my own userspace to create/start 
  containers. There are so many things which can
  hurt you badly. With user namespaces we expose a really big 
  attack surface to regular users. I.e. Suddenly a
  user is allowed to mount filesystems.
  
  That is currently not the case.  They can mount some virtual 
  filesystems and do bind mounts, but cannot mount
  most real filesystems.  This keeps us protected (for now) from 
  potentially unsafe superblock readers in the 
  kernel.
  
  Ask Andy, he found already lots of nasty things...
  
  I don't think I have anything brilliant to add to this discussion 
  right now, except possibly:
  
  ISTM that Linux distributions are, in general, vulnerable to all 
  kinds of shenanigans that would happen if an
  untrusted user can cause a block device to appear.  That user 
  doesn't need permission to mount it
  
  Interesting point.  This would further suggest that we absolutely 
  must ensure that a loop device which shows up in
  the container does not also show up in the host.
 
 Can I suggest the usage of the devices cgroup to achieve that?

Not really ... cgroups impose resource limits, it's namespaces that
impose visibility separations.  In theory this can be done with the
device namespace that's been proposed; however, a simpler way is simply
to rm the device node in the host and mknod it in the guest.  I don't
really see host visibility as a huge problem: in a shared OS
virtualisation it's not really possible securely to separate the guest
from the host (only vice versa).

But I really don't think we want to do it this way.  Giving a container
the ability to do a mount is too dangerous.  What we want to do is
intercept the mount in the host and perform it on behalf of the guest as
host root in the guest's mount namespace.  If you do it that way, it
   
   That doesn't help the problem of guests being able to provide bad input
   for (basically fuzz) the in-kernel filesystem code.  So apparently I'm
   suffering a failure of the imagination - what problem exactly does it 
   solve?
  
  Well, there's two types of fuzzing, one is on sys_mount, which this
  would help with because the host filters the mount including all
  parameters and may even redo the mount (from direct to bind etc).
 
 Sorry - I'm not *trying* to be dense, but am still not seeing it.
 
 Let's assume that we continue to be strict about what a container may
 mount - let's say they can only mount using loopdev from blockdev images.
 They have to own the file, as well as the mount target.  Whatever they
 do

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-28 Thread Seth Forshee

On Fri, May 23, 2014 at 03:23:50PM -0700, Eric W. Biederman wrote:
 Serge Hallyn serge.hal...@ubuntu.com writes:
 
  Quoting Eric W. Biederman (ebied...@xmission.com):
  
  
   Ultimately the technical challenge is how do we create a block device
   that is safe for a user who does not have any capabilities to use, and
   what can we do with that block device to make it useful.
  
   Yes, and I'd like to get started solving those challenges. But I also
   don't think we can address these two points (support partition blkdevs,
   help prevent more priveleged users from using a namespace's loop
   devices) sufficiently while having an implementation completely
   contained within the loop driver as Greg is requesting.
  
  My key take away from the conversation is that we should reduce the
  scope of what is being done to something that makes sense and the
  propblems are immediately visible.
  
  Part of me would like to suggest that fuse and it's ability to imitate
  device nodes might be a more appropriate solution, to something that
 
  Do you have a link to more info on this?  Some googling got me to an
  interesting but old thread on CUSE, but nothing specifically about fuse
  doing this.
 
 CUSE is probably what I was thinking of.  It is all part of the fuse
 code base in the kernel.  And now that I am reminded it is called CUSE
 I go Duh that is a character device...
 
 Fuse and everything it can do is definitely the filesystem I would like
 to see most have the audits to be enabled in user namespace.  Fuse
 was built to be sufficiently paranoid to allow this and so it should not
 take a lot to take fuse the rest of the way.

I was aware of FUSE but hadn't ever looked at it much. Looking at it
now, this isn't going to satisfy any of the use cases I know about,
which are wanting to use filesystems supported in-kernel (isofs, ext*).
I don't see that any of these have a FUSE implementation, and I think we
gain more from figuring out how to use in-kernel filesystems in
containers than trying to find a way to shoehorn selected filesystems
into FUSE.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-28 Thread Serge E. Hallyn

Quoting Seth Forshee (seth.fors...@canonical.com):
 On Fri, May 23, 2014 at 03:23:50PM -0700, Eric W. Biederman wrote:
  Serge Hallyn serge.hal...@ubuntu.com writes:
  
   Quoting Eric W. Biederman (ebied...@xmission.com):
   
   
Ultimately the technical challenge is how do we create a block device
that is safe for a user who does not have any capabilities to use, and
what can we do with that block device to make it useful.
   
Yes, and I'd like to get started solving those challenges. But I also
don't think we can address these two points (support partition blkdevs,
help prevent more priveleged users from using a namespace's loop
devices) sufficiently while having an implementation completely
contained within the loop driver as Greg is requesting.
   
   My key take away from the conversation is that we should reduce the
   scope of what is being done to something that makes sense and the
   propblems are immediately visible.
   
   Part of me would like to suggest that fuse and it's ability to imitate
   device nodes might be a more appropriate solution, to something that
  
   Do you have a link to more info on this?  Some googling got me to an
   interesting but old thread on CUSE, but nothing specifically about fuse
   doing this.
  
  CUSE is probably what I was thinking of.  It is all part of the fuse
  code base in the kernel.  And now that I am reminded it is called CUSE
  I go Duh that is a character device...
  
  Fuse and everything it can do is definitely the filesystem I would like
  to see most have the audits to be enabled in user namespace.  Fuse
  was built to be sufficiently paranoid to allow this and so it should not
  take a lot to take fuse the rest of the way.
 
 I was aware of FUSE but hadn't ever looked at it much. Looking at it
 now, this isn't going to satisfy any of the use cases I know about,
 which are wanting to use filesystems supported in-kernel (isofs, ext*).
 I don't see that any of these have a FUSE implementation, and I think we
 gain more from figuring out how to use in-kernel filesystems in
 containers than trying to find a way to shoehorn selected filesystems
 into FUSE.

That's why I was wondering how much work it would be to auto-generate
fuse fs support from the in-kernel source.

-serge
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-28 Thread Serge Hallyn

Quoting James Bottomley (james.bottom...@hansenpartnership.com):
 On Mon, 2014-05-26 at 00:24 +0200, Serge E. Hallyn wrote:
  Quoting James Bottomley (james.bottom...@hansenpartnership.com):
   On Sat, 2014-05-24 at 22:25 +, Serge Hallyn wrote:
Quoting James Bottomley (james.bottom...@hansenpartnership.com):
 On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote:
  On 05/20/2014 05:19 PM, Serge Hallyn wrote:
   Quoting Andy Lutomirski (l...@amacapital.net):
   On May 15, 2014 1:26 PM, Serge E. Hallyn se...@hallyn.com 
   wrote:
   
   Quoting Richard Weinberger (rich...@nod.at):
   Am 15.05.2014 21:50, schrieb Serge Hallyn:
   Quoting Richard Weinberger (richard.weinber...@gmail.com):
   On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman 
   gre...@linuxfoundation.org wrote:
   Then don't use a container to build such a thing, or fix 
   the build scripts to not do that :)
   
   I second this. To me it looks like some folks try to (ab)use 
   Linux containers for purposes where KVM
   would much better fit in. Please don't put more complexity 
   into containers. They are already horrible
   complex and error prone.
   
   I, naturally, disagree :)  The only use case which is 
   inherently not valid for containers is running a
   kernel.  Practically speaking there are other things which 
   likely will never be possible, but if someone 
   offers a way to do something in containers, you can't do 
   that in containers is not an apropos response.
   
   That abstraction is wrong is certainly valid, as when vpids 
   were originally proposed and rejected,
   resulting in the development of pid namespaces.  We have to 
   work out (x) first can be valid (and I can
   think of examples here), assuming it's not just trying to 
   hide behind a catch-22/chicken-egg problem.
   
   Finally, saying containers are complex and error prone is 
   conflating several large suites of userspace
   code and many kernel features which support them.  Being more 
   precise would, if the argument is valid, lend
   it a lot more weight.
   
   We (my company) use Linux containers since 2011 in production. 
   First LXC, now libvirt-lxc. To understand the
   internals better I also wrote my own userspace to create/start 
   containers. There are so many things which can
   hurt you badly. With user namespaces we expose a really big 
   attack surface to regular users. I.e. Suddenly a
   user is allowed to mount filesystems.
   
   That is currently not the case.  They can mount some virtual 
   filesystems and do bind mounts, but cannot mount
   most real filesystems.  This keeps us protected (for now) from 
   potentially unsafe superblock readers in the 
   kernel.
   
   Ask Andy, he found already lots of nasty things...
   
   I don't think I have anything brilliant to add to this 
   discussion right now, except possibly:
   
   ISTM that Linux distributions are, in general, vulnerable to all 
   kinds of shenanigans that would happen if an
   untrusted user can cause a block device to appear.  That user 
   doesn't need permission to mount it
   
   Interesting point.  This would further suggest that we absolutely 
   must ensure that a loop device which shows up in
   the container does not also show up in the host.
  
  Can I suggest the usage of the devices cgroup to achieve that?
 
 Not really ... cgroups impose resource limits, it's namespaces that
 impose visibility separations.  In theory this can be done with the
 device namespace that's been proposed; however, a simpler way is 
 simply
 to rm the device node in the host and mknod it in the guest.  I don't
 really see host visibility as a huge problem: in a shared OS
 virtualisation it's not really possible securely to separate the guest
 from the host (only vice versa).
 
 But I really don't think we want to do it this way.  Giving a 
 container
 the ability to do a mount is too dangerous.  What we want to do is
 intercept the mount in the host and perform it on behalf of the guest 
 as
 host root in the guest's mount namespace.  If you do it that way, it

That doesn't help the problem of guests being able to provide bad input
for (basically fuzz) the in-kernel filesystem code.  So apparently I'm
suffering a failure of the imagination - what problem exactly does it 
solve?
   
   Well, there's two types of fuzzing, one is on sys_mount, which this
   would help with because the host filters the mount including all
   parameters and may even redo the mount (from direct to bind etc).
  
  Sorry - I'm not *trying* to be dense, but am still not seeing it.
  
  Let's assume that we continue to be

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-25 Thread Serge E. Hallyn

Quoting James Bottomley (james.bottom...@hansenpartnership.com):
> On Sat, 2014-05-24 at 22:25 +, Serge Hallyn wrote:
> > Quoting James Bottomley (james.bottom...@hansenpartnership.com):
> > > On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote:
> > > > On 05/20/2014 05:19 PM, Serge Hallyn wrote:
> > > > > Quoting Andy Lutomirski (l...@amacapital.net):
> > > > >> On May 15, 2014 1:26 PM, "Serge E. Hallyn"  wrote:
> > > > >>> 
> > > > >>> Quoting Richard Weinberger (rich...@nod.at):
> > > >  Am 15.05.2014 21:50, schrieb Serge Hallyn:
> > > > > Quoting Richard Weinberger (richard.weinber...@gmail.com):
> > > > >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman 
> > > > >>  wrote:
> > > > >>> Then don't use a container to build such a thing, or fix the 
> > > > >>> build scripts to not do that :)
> > > > >> 
> > > > >> I second this. To me it looks like some folks try to (ab)use 
> > > > >> Linux containers for purposes where KVM
> > > > >> would much better fit in. Please don't put more complexity into 
> > > > >> containers. They are already horrible
> > > > >> complex and error prone.
> > > > > 
> > > > > I, naturally, disagree :)  The only use case which is inherently 
> > > > > not valid for containers is running a
> > > > > kernel.  Practically speaking there are other things which likely 
> > > > > will never be possible, but if someone 
> > > > > offers a way to do something in containers, "you can't do that in 
> > > > > containers" is not an apropos response.
> > > > > 
> > > > > "That abstraction is wrong" is certainly valid, as when vpids 
> > > > > were originally proposed and rejected,
> > > > > resulting in the development of pid namespaces.  "We have to work 
> > > > > out (x) first" can be valid (and I can
> > > > > think of examples here), assuming it's not just trying to hide 
> > > > > behind a catch-22/chicken-egg problem.
> > > > > 
> > > > > Finally, saying "containers are complex and error prone" is 
> > > > > conflating several large suites of userspace
> > > > > code and many kernel features which support them.  Being more 
> > > > > precise would, if the argument is valid, lend
> > > > > it a lot more weight.
> > > >  
> > > >  We (my company) use Linux containers since 2011 in production. 
> > > >  First LXC, now libvirt-lxc. To understand the
> > > >  internals better I also wrote my own userspace to create/start 
> > > >  containers. There are so many things which can
> > > >  hurt you badly. With user namespaces we expose a really big attack 
> > > >  surface to regular users. I.e. Suddenly a
> > > >  user is allowed to mount filesystems.
> > > > >>> 
> > > > >>> That is currently not the case.  They can mount some virtual 
> > > > >>> filesystems and do bind mounts, but cannot mount
> > > > >>> most real filesystems.  This keeps us protected (for now) from 
> > > > >>> potentially unsafe superblock readers in the 
> > > > >>> kernel.
> > > > >>> 
> > > >  Ask Andy, he found already lots of nasty things...
> > > > >> 
> > > > >> I don't think I have anything brilliant to add to this discussion 
> > > > >> right now, except possibly:
> > > > >> 
> > > > >> ISTM that Linux distributions are, in general, vulnerable to all 
> > > > >> kinds of shenanigans that would happen if an
> > > > >> untrusted user can cause a block device to appear.  That user 
> > > > >> doesn't need permission to mount it
> > > > > 
> > > > > Interesting point.  This would further suggest that we absolutely 
> > > > > must ensure that a loop device which shows up in
> > > > > the container does not also show up in the host.
> > > > 
> > > > Can I suggest the usage of the devices cgroup to achieve that?
> > > 
> > > Not really ... cgroups impose resource limits, it's namespaces that
> > > impose visibility separations.  In theory this can be done with the
> > > device namespace that's been proposed; however, a simpler way is simply
> > > to rm the device node in the host and mknod it in the guest.  I don't
> > > really see host visibility as a huge problem: in a shared OS
> > > virtualisation it's not really possible securely to separate the guest
> > > from the host (only vice versa).
> > > 
> > > But I really don't think we want to do it this way.  Giving a container
> > > the ability to do a mount is too dangerous.  What we want to do is
> > > intercept the mount in the host and perform it on behalf of the guest as
> > > host root in the guest's mount namespace.  If you do it that way, it
> > 
> > That doesn't help the problem of guests being able to provide bad input
> > for (basically fuzz) the in-kernel filesystem code.  So apparently I'm
> > suffering a failure of the imagination - what problem exactly does it solve?
> 
> Well, there's two types of fuzzing, one is on sys_mount, which this
> would help with because the host

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-25 Thread James Bottomley

On Sat, 2014-05-24 at 22:25 +, Serge Hallyn wrote:
> Quoting James Bottomley (james.bottom...@hansenpartnership.com):
> > On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote:
> > > On 05/20/2014 05:19 PM, Serge Hallyn wrote:
> > > > Quoting Andy Lutomirski (l...@amacapital.net):
> > > >> On May 15, 2014 1:26 PM, "Serge E. Hallyn"  wrote:
> > > >>> 
> > > >>> Quoting Richard Weinberger (rich...@nod.at):
> > >  Am 15.05.2014 21:50, schrieb Serge Hallyn:
> > > > Quoting Richard Weinberger (richard.weinber...@gmail.com):
> > > >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman 
> > > >>  wrote:
> > > >>> Then don't use a container to build such a thing, or fix the 
> > > >>> build scripts to not do that :)
> > > >> 
> > > >> I second this. To me it looks like some folks try to (ab)use Linux 
> > > >> containers for purposes where KVM
> > > >> would much better fit in. Please don't put more complexity into 
> > > >> containers. They are already horrible
> > > >> complex and error prone.
> > > > 
> > > > I, naturally, disagree :)  The only use case which is inherently 
> > > > not valid for containers is running a
> > > > kernel.  Practically speaking there are other things which likely 
> > > > will never be possible, but if someone 
> > > > offers a way to do something in containers, "you can't do that in 
> > > > containers" is not an apropos response.
> > > > 
> > > > "That abstraction is wrong" is certainly valid, as when vpids were 
> > > > originally proposed and rejected,
> > > > resulting in the development of pid namespaces.  "We have to work 
> > > > out (x) first" can be valid (and I can
> > > > think of examples here), assuming it's not just trying to hide 
> > > > behind a catch-22/chicken-egg problem.
> > > > 
> > > > Finally, saying "containers are complex and error prone" is 
> > > > conflating several large suites of userspace
> > > > code and many kernel features which support them.  Being more 
> > > > precise would, if the argument is valid, lend
> > > > it a lot more weight.
> > >  
> > >  We (my company) use Linux containers since 2011 in production. First 
> > >  LXC, now libvirt-lxc. To understand the
> > >  internals better I also wrote my own userspace to create/start 
> > >  containers. There are so many things which can
> > >  hurt you badly. With user namespaces we expose a really big attack 
> > >  surface to regular users. I.e. Suddenly a
> > >  user is allowed to mount filesystems.
> > > >>> 
> > > >>> That is currently not the case.  They can mount some virtual 
> > > >>> filesystems and do bind mounts, but cannot mount
> > > >>> most real filesystems.  This keeps us protected (for now) from 
> > > >>> potentially unsafe superblock readers in the 
> > > >>> kernel.
> > > >>> 
> > >  Ask Andy, he found already lots of nasty things...
> > > >> 
> > > >> I don't think I have anything brilliant to add to this discussion 
> > > >> right now, except possibly:
> > > >> 
> > > >> ISTM that Linux distributions are, in general, vulnerable to all kinds 
> > > >> of shenanigans that would happen if an
> > > >> untrusted user can cause a block device to appear.  That user doesn't 
> > > >> need permission to mount it
> > > > 
> > > > Interesting point.  This would further suggest that we absolutely must 
> > > > ensure that a loop device which shows up in
> > > > the container does not also show up in the host.
> > > 
> > > Can I suggest the usage of the devices cgroup to achieve that?
> > 
> > Not really ... cgroups impose resource limits, it's namespaces that
> > impose visibility separations.  In theory this can be done with the
> > device namespace that's been proposed; however, a simpler way is simply
> > to rm the device node in the host and mknod it in the guest.  I don't
> > really see host visibility as a huge problem: in a shared OS
> > virtualisation it's not really possible securely to separate the guest
> > from the host (only vice versa).
> > 
> > But I really don't think we want to do it this way.  Giving a container
> > the ability to do a mount is too dangerous.  What we want to do is
> > intercept the mount in the host and perform it on behalf of the guest as
> > host root in the guest's mount namespace.  If you do it that way, it
> 
> That doesn't help the problem of guests being able to provide bad input
> for (basically fuzz) the in-kernel filesystem code.  So apparently I'm
> suffering a failure of the imagination - what problem exactly does it solve?

Well, there's two types of fuzzing, one is on sys_mount, which this
would help with because the host filters the mount including all
parameters and may even redo the mount (from direct to bind etc).

If you're thinking the system can be compromised by fuzzing within the
filesystem, then yes, I agree, but it's the same vulnerability an
unvirtualised

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-25 Thread James Bottomley

On Sat, 2014-05-24 at 22:25 +, Serge Hallyn wrote:
 Quoting James Bottomley (james.bottom...@hansenpartnership.com):
  On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote:
   On 05/20/2014 05:19 PM, Serge Hallyn wrote:
Quoting Andy Lutomirski (l...@amacapital.net):
On May 15, 2014 1:26 PM, Serge E. Hallyn se...@hallyn.com wrote:

Quoting Richard Weinberger (rich...@nod.at):
Am 15.05.2014 21:50, schrieb Serge Hallyn:
Quoting Richard Weinberger (richard.weinber...@gmail.com):
On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman 
gre...@linuxfoundation.org wrote:
Then don't use a container to build such a thing, or fix the 
build scripts to not do that :)

I second this. To me it looks like some folks try to (ab)use Linux 
containers for purposes where KVM
would much better fit in. Please don't put more complexity into 
containers. They are already horrible
complex and error prone.

I, naturally, disagree :)  The only use case which is inherently 
not valid for containers is running a
kernel.  Practically speaking there are other things which likely 
will never be possible, but if someone 
offers a way to do something in containers, you can't do that in 
containers is not an apropos response.

That abstraction is wrong is certainly valid, as when vpids were 
originally proposed and rejected,
resulting in the development of pid namespaces.  We have to work 
out (x) first can be valid (and I can
think of examples here), assuming it's not just trying to hide 
behind a catch-22/chicken-egg problem.

Finally, saying containers are complex and error prone is 
conflating several large suites of userspace
code and many kernel features which support them.  Being more 
precise would, if the argument is valid, lend
it a lot more weight.

We (my company) use Linux containers since 2011 in production. First 
LXC, now libvirt-lxc. To understand the
internals better I also wrote my own userspace to create/start 
containers. There are so many things which can
hurt you badly. With user namespaces we expose a really big attack 
surface to regular users. I.e. Suddenly a
user is allowed to mount filesystems.

That is currently not the case.  They can mount some virtual 
filesystems and do bind mounts, but cannot mount
most real filesystems.  This keeps us protected (for now) from 
potentially unsafe superblock readers in the 
kernel.

Ask Andy, he found already lots of nasty things...

I don't think I have anything brilliant to add to this discussion 
right now, except possibly:

ISTM that Linux distributions are, in general, vulnerable to all kinds 
of shenanigans that would happen if an
untrusted user can cause a block device to appear.  That user doesn't 
need permission to mount it

Interesting point.  This would further suggest that we absolutely must 
ensure that a loop device which shows up in
the container does not also show up in the host.
   
   Can I suggest the usage of the devices cgroup to achieve that?
  
  Not really ... cgroups impose resource limits, it's namespaces that
  impose visibility separations.  In theory this can be done with the
  device namespace that's been proposed; however, a simpler way is simply
  to rm the device node in the host and mknod it in the guest.  I don't
  really see host visibility as a huge problem: in a shared OS
  virtualisation it's not really possible securely to separate the guest
  from the host (only vice versa).
  
  But I really don't think we want to do it this way.  Giving a container
  the ability to do a mount is too dangerous.  What we want to do is
  intercept the mount in the host and perform it on behalf of the guest as
  host root in the guest's mount namespace.  If you do it that way, it
 
 That doesn't help the problem of guests being able to provide bad input
 for (basically fuzz) the in-kernel filesystem code.  So apparently I'm
 suffering a failure of the imagination - what problem exactly does it solve?

Well, there's two types of fuzzing, one is on sys_mount, which this
would help with because the host filters the mount including all
parameters and may even redo the mount (from direct to bind etc).

If you're thinking the system can be compromised by fuzzing within the
filesystem, then yes, I agree, but it's the same vulnerability an
unvirtualised host would have, so I don't necessarily see it as our
problem.

The problem vectored mount solves is the one of not wanting root in the
container to have unfettered access to sys_mount because it allows the
host to vet all calls and execute the ones it likes in the context of
real root (possibly after modifying the parameters).

James


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-25 Thread Serge E. Hallyn

Quoting James Bottomley (james.bottom...@hansenpartnership.com):
 On Sat, 2014-05-24 at 22:25 +, Serge Hallyn wrote:
  Quoting James Bottomley (james.bottom...@hansenpartnership.com):
   On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote:
On 05/20/2014 05:19 PM, Serge Hallyn wrote:
 Quoting Andy Lutomirski (l...@amacapital.net):
 On May 15, 2014 1:26 PM, Serge E. Hallyn se...@hallyn.com wrote:
 
 Quoting Richard Weinberger (rich...@nod.at):
 Am 15.05.2014 21:50, schrieb Serge Hallyn:
 Quoting Richard Weinberger (richard.weinber...@gmail.com):
 On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman 
 gre...@linuxfoundation.org wrote:
 Then don't use a container to build such a thing, or fix the 
 build scripts to not do that :)
 
 I second this. To me it looks like some folks try to (ab)use 
 Linux containers for purposes where KVM
 would much better fit in. Please don't put more complexity into 
 containers. They are already horrible
 complex and error prone.
 
 I, naturally, disagree :)  The only use case which is inherently 
 not valid for containers is running a
 kernel.  Practically speaking there are other things which likely 
 will never be possible, but if someone 
 offers a way to do something in containers, you can't do that in 
 containers is not an apropos response.
 
 That abstraction is wrong is certainly valid, as when vpids 
 were originally proposed and rejected,
 resulting in the development of pid namespaces.  We have to work 
 out (x) first can be valid (and I can
 think of examples here), assuming it's not just trying to hide 
 behind a catch-22/chicken-egg problem.
 
 Finally, saying containers are complex and error prone is 
 conflating several large suites of userspace
 code and many kernel features which support them.  Being more 
 precise would, if the argument is valid, lend
 it a lot more weight.
 
 We (my company) use Linux containers since 2011 in production. 
 First LXC, now libvirt-lxc. To understand the
 internals better I also wrote my own userspace to create/start 
 containers. There are so many things which can
 hurt you badly. With user namespaces we expose a really big attack 
 surface to regular users. I.e. Suddenly a
 user is allowed to mount filesystems.
 
 That is currently not the case.  They can mount some virtual 
 filesystems and do bind mounts, but cannot mount
 most real filesystems.  This keeps us protected (for now) from 
 potentially unsafe superblock readers in the 
 kernel.
 
 Ask Andy, he found already lots of nasty things...
 
 I don't think I have anything brilliant to add to this discussion 
 right now, except possibly:
 
 ISTM that Linux distributions are, in general, vulnerable to all 
 kinds of shenanigans that would happen if an
 untrusted user can cause a block device to appear.  That user 
 doesn't need permission to mount it
 
 Interesting point.  This would further suggest that we absolutely 
 must ensure that a loop device which shows up in
 the container does not also show up in the host.

Can I suggest the usage of the devices cgroup to achieve that?
   
   Not really ... cgroups impose resource limits, it's namespaces that
   impose visibility separations.  In theory this can be done with the
   device namespace that's been proposed; however, a simpler way is simply
   to rm the device node in the host and mknod it in the guest.  I don't
   really see host visibility as a huge problem: in a shared OS
   virtualisation it's not really possible securely to separate the guest
   from the host (only vice versa).
   
   But I really don't think we want to do it this way.  Giving a container
   the ability to do a mount is too dangerous.  What we want to do is
   intercept the mount in the host and perform it on behalf of the guest as
   host root in the guest's mount namespace.  If you do it that way, it
  
  That doesn't help the problem of guests being able to provide bad input
  for (basically fuzz) the in-kernel filesystem code.  So apparently I'm
  suffering a failure of the imagination - what problem exactly does it solve?
 
 Well, there's two types of fuzzing, one is on sys_mount, which this
 would help with because the host filters the mount including all
 parameters and may even redo the mount (from direct to bind etc).

Sorry - I'm not *trying* to be dense, but am still not seeing it.

Let's assume that we continue to be strict about what a container may
mount - let's say they can only mount using loopdev from blockdev images.
They have to own the file, as well as the mount target.  Whatever they
do with sys_mount, the only danger I see is the one where the filesystem
data is bad and causes a DOS or privilege escalation in some bad fs
reading code in the kernel.

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-24 Thread Serge Hallyn

Quoting James Bottomley (james.bottom...@hansenpartnership.com):
> On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote:
> > On 05/20/2014 05:19 PM, Serge Hallyn wrote:
> > > Quoting Andy Lutomirski (l...@amacapital.net):
> > >> On May 15, 2014 1:26 PM, "Serge E. Hallyn"  wrote:
> > >>> 
> > >>> Quoting Richard Weinberger (rich...@nod.at):
> >  Am 15.05.2014 21:50, schrieb Serge Hallyn:
> > > Quoting Richard Weinberger (richard.weinber...@gmail.com):
> > >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman 
> > >>  wrote:
> > >>> Then don't use a container to build such a thing, or fix the build 
> > >>> scripts to not do that :)
> > >> 
> > >> I second this. To me it looks like some folks try to (ab)use Linux 
> > >> containers for purposes where KVM
> > >> would much better fit in. Please don't put more complexity into 
> > >> containers. They are already horrible
> > >> complex and error prone.
> > > 
> > > I, naturally, disagree :)  The only use case which is inherently not 
> > > valid for containers is running a
> > > kernel.  Practically speaking there are other things which likely 
> > > will never be possible, but if someone 
> > > offers a way to do something in containers, "you can't do that in 
> > > containers" is not an apropos response.
> > > 
> > > "That abstraction is wrong" is certainly valid, as when vpids were 
> > > originally proposed and rejected,
> > > resulting in the development of pid namespaces.  "We have to work out 
> > > (x) first" can be valid (and I can
> > > think of examples here), assuming it's not just trying to hide behind 
> > > a catch-22/chicken-egg problem.
> > > 
> > > Finally, saying "containers are complex and error prone" is 
> > > conflating several large suites of userspace
> > > code and many kernel features which support them.  Being more precise 
> > > would, if the argument is valid, lend
> > > it a lot more weight.
> >  
> >  We (my company) use Linux containers since 2011 in production. First 
> >  LXC, now libvirt-lxc. To understand the
> >  internals better I also wrote my own userspace to create/start 
> >  containers. There are so many things which can
> >  hurt you badly. With user namespaces we expose a really big attack 
> >  surface to regular users. I.e. Suddenly a
> >  user is allowed to mount filesystems.
> > >>> 
> > >>> That is currently not the case.  They can mount some virtual 
> > >>> filesystems and do bind mounts, but cannot mount
> > >>> most real filesystems.  This keeps us protected (for now) from 
> > >>> potentially unsafe superblock readers in the 
> > >>> kernel.
> > >>> 
> >  Ask Andy, he found already lots of nasty things...
> > >> 
> > >> I don't think I have anything brilliant to add to this discussion right 
> > >> now, except possibly:
> > >> 
> > >> ISTM that Linux distributions are, in general, vulnerable to all kinds 
> > >> of shenanigans that would happen if an
> > >> untrusted user can cause a block device to appear.  That user doesn't 
> > >> need permission to mount it
> > > 
> > > Interesting point.  This would further suggest that we absolutely must 
> > > ensure that a loop device which shows up in
> > > the container does not also show up in the host.
> > 
> > Can I suggest the usage of the devices cgroup to achieve that?
> 
> Not really ... cgroups impose resource limits, it's namespaces that
> impose visibility separations.  In theory this can be done with the
> device namespace that's been proposed; however, a simpler way is simply
> to rm the device node in the host and mknod it in the guest.  I don't
> really see host visibility as a huge problem: in a shared OS
> virtualisation it's not really possible securely to separate the guest
> from the host (only vice versa).
> 
> But I really don't think we want to do it this way.  Giving a container
> the ability to do a mount is too dangerous.  What we want to do is
> intercept the mount in the host and perform it on behalf of the guest as
> host root in the guest's mount namespace.  If you do it that way, it

That doesn't help the problem of guests being able to provide bad input
for (basically fuzz) the in-kernel filesystem code.  So apparently I'm
suffering a failure of the imagination - what problem exactly does it solve?

> doesn't really matter what device actually shows up in the guest, as
> long as the host knows what to do when the mount request comes along.
> 
> James
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-24 Thread Serge Hallyn

Quoting James Bottomley (james.bottom...@hansenpartnership.com):
 On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote:
  On 05/20/2014 05:19 PM, Serge Hallyn wrote:
   Quoting Andy Lutomirski (l...@amacapital.net):
   On May 15, 2014 1:26 PM, Serge E. Hallyn se...@hallyn.com wrote:
   
   Quoting Richard Weinberger (rich...@nod.at):
   Am 15.05.2014 21:50, schrieb Serge Hallyn:
   Quoting Richard Weinberger (richard.weinber...@gmail.com):
   On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman 
   gre...@linuxfoundation.org wrote:
   Then don't use a container to build such a thing, or fix the build 
   scripts to not do that :)
   
   I second this. To me it looks like some folks try to (ab)use Linux 
   containers for purposes where KVM
   would much better fit in. Please don't put more complexity into 
   containers. They are already horrible
   complex and error prone.
   
   I, naturally, disagree :)  The only use case which is inherently not 
   valid for containers is running a
   kernel.  Practically speaking there are other things which likely 
   will never be possible, but if someone 
   offers a way to do something in containers, you can't do that in 
   containers is not an apropos response.
   
   That abstraction is wrong is certainly valid, as when vpids were 
   originally proposed and rejected,
   resulting in the development of pid namespaces.  We have to work out 
   (x) first can be valid (and I can
   think of examples here), assuming it's not just trying to hide behind 
   a catch-22/chicken-egg problem.
   
   Finally, saying containers are complex and error prone is 
   conflating several large suites of userspace
   code and many kernel features which support them.  Being more precise 
   would, if the argument is valid, lend
   it a lot more weight.
   
   We (my company) use Linux containers since 2011 in production. First 
   LXC, now libvirt-lxc. To understand the
   internals better I also wrote my own userspace to create/start 
   containers. There are so many things which can
   hurt you badly. With user namespaces we expose a really big attack 
   surface to regular users. I.e. Suddenly a
   user is allowed to mount filesystems.
   
   That is currently not the case.  They can mount some virtual 
   filesystems and do bind mounts, but cannot mount
   most real filesystems.  This keeps us protected (for now) from 
   potentially unsafe superblock readers in the 
   kernel.
   
   Ask Andy, he found already lots of nasty things...
   
   I don't think I have anything brilliant to add to this discussion right 
   now, except possibly:
   
   ISTM that Linux distributions are, in general, vulnerable to all kinds 
   of shenanigans that would happen if an
   untrusted user can cause a block device to appear.  That user doesn't 
   need permission to mount it
   
   Interesting point.  This would further suggest that we absolutely must 
   ensure that a loop device which shows up in
   the container does not also show up in the host.
  
  Can I suggest the usage of the devices cgroup to achieve that?
 
 Not really ... cgroups impose resource limits, it's namespaces that
 impose visibility separations.  In theory this can be done with the
 device namespace that's been proposed; however, a simpler way is simply
 to rm the device node in the host and mknod it in the guest.  I don't
 really see host visibility as a huge problem: in a shared OS
 virtualisation it's not really possible securely to separate the guest
 from the host (only vice versa).
 
 But I really don't think we want to do it this way.  Giving a container
 the ability to do a mount is too dangerous.  What we want to do is
 intercept the mount in the host and perform it on behalf of the guest as
 host root in the guest's mount namespace.  If you do it that way, it

That doesn't help the problem of guests being able to provide bad input
for (basically fuzz) the in-kernel filesystem code.  So apparently I'm
suffering a failure of the imagination - what problem exactly does it solve?

 doesn't really matter what device actually shows up in the guest, as
 long as the host knows what to do when the mount request comes along.
 
 James
 
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-23 Thread Eric W. Biederman

Serge Hallyn  writes:

> Quoting Eric W. Biederman (ebied...@xmission.com):
>> 
>> 
>> >> Ultimately the technical challenge is how do we create a block device
>> >> that is safe for a user who does not have any capabilities to use, and
>> >> what can we do with that block device to make it useful.
>> >
>> > Yes, and I'd like to get started solving those challenges. But I also
>> > don't think we can address these two points (support partition blkdevs,
>> > help prevent more priveleged users from using a namespace's loop
>> > devices) sufficiently while having an implementation completely
>> > contained within the loop driver as Greg is requesting.
>> 
>> My key take away from the conversation is that we should reduce the
>> scope of what is being done to something that makes sense and the
>> propblems are immediately visible.
>> 
>> Part of me would like to suggest that fuse and it's ability to imitate
>> device nodes might be a more appropriate solution, to something that
>
> Do you have a link to more info on this?  Some googling got me to an
> interesting but old thread on CUSE, but nothing specifically about fuse
> doing this.

CUSE is probably what I was thinking of.  It is all part of the fuse
code base in the kernel.  And now that I am reminded it is called CUSE
I go Duh that is a character device...

Fuse and everything it can do is definitely the filesystem I would like
to see most have the audits to be enabled in user namespace.  Fuse
was built to be sufficiently paranoid to allow this and so it should not
take a lot to take fuse the rest of the way.

>> just needs block device access and nothing else.
>> 
>> For purposes of discussion let's call it unprivloopfs.  That can reuse
>> code from the loop device or not as appropriate.  Not supporting
>> paritioning I think is a very reasonable first step until it is shown
>> that we can make good use of partitioning support, and there are not
>> better ways of solving the problem.
>> 
>> I expect the most productive thing to talk about is what is your
>> immediate goal?  Mounting a filesystem?  Building an iso?
>
> For me it would be taking an iso and making some changes to it to
> localize it (i.e. take an install iso and add preseed file).
>
> Now of course in the end there is no reason why we can't do all of
> this with a new suite of libraries which simply uses read/write with
> knowledge of the fs layouts to parse and modify the backing files.
> My concern there is that duplicating all of the fs code seems unlikely
> to improve the soundness of either implementation.  Perhaps we can
> autogenerate this from the kernel source?  Does fuse already do
> something like that?

I am not aware of that.  But I have not worked extensively with fuse.

I do agree that finding a way to perform a read-only mount of an ISO by
an unprivielged user is a very interesting use case.  Given it's
interchange medium nature isofs should be as hardened as human possible,
and that is likely easier with a read-only filesystem.  And at less than
4000 lines of code isofs is auditable.

So as a target for unprivileged mounts of a block device isofs looks
like a good place to start.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-23 Thread Andy Lutomirski

On Fri, May 23, 2014 at 6:16 AM, James Bottomley
 wrote:
> On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote:
>> On 05/20/2014 05:19 PM, Serge Hallyn wrote:
>> > Quoting Andy Lutomirski (l...@amacapital.net):
>> >> On May 15, 2014 1:26 PM, "Serge E. Hallyn"  wrote:
>> >>>
>> >>> Quoting Richard Weinberger (rich...@nod.at):
>>  Am 15.05.2014 21:50, schrieb Serge Hallyn:
>> > Quoting Richard Weinberger (richard.weinber...@gmail.com):
>> >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman 
>> >>  wrote:
>> >>> Then don't use a container to build such a thing, or fix the build 
>> >>> scripts to not do that :)
>> >>
>> >> I second this. To me it looks like some folks try to (ab)use Linux 
>> >> containers for purposes where KVM
>> >> would much better fit in. Please don't put more complexity into 
>> >> containers. They are already horrible
>> >> complex and error prone.
>> >
>> > I, naturally, disagree :)  The only use case which is inherently not 
>> > valid for containers is running a
>> > kernel.  Practically speaking there are other things which likely will 
>> > never be possible, but if someone
>> > offers a way to do something in containers, "you can't do that in 
>> > containers" is not an apropos response.
>> >
>> > "That abstraction is wrong" is certainly valid, as when vpids were 
>> > originally proposed and rejected,
>> > resulting in the development of pid namespaces.  "We have to work out 
>> > (x) first" can be valid (and I can
>> > think of examples here), assuming it's not just trying to hide behind 
>> > a catch-22/chicken-egg problem.
>> >
>> > Finally, saying "containers are complex and error prone" is conflating 
>> > several large suites of userspace
>> > code and many kernel features which support them.  Being more precise 
>> > would, if the argument is valid, lend
>> > it a lot more weight.
>> 
>>  We (my company) use Linux containers since 2011 in production. First 
>>  LXC, now libvirt-lxc. To understand the
>>  internals better I also wrote my own userspace to create/start 
>>  containers. There are so many things which can
>>  hurt you badly. With user namespaces we expose a really big attack 
>>  surface to regular users. I.e. Suddenly a
>>  user is allowed to mount filesystems.
>> >>>
>> >>> That is currently not the case.  They can mount some virtual filesystems 
>> >>> and do bind mounts, but cannot mount
>> >>> most real filesystems.  This keeps us protected (for now) from 
>> >>> potentially unsafe superblock readers in the
>> >>> kernel.
>> >>>
>>  Ask Andy, he found already lots of nasty things...
>> >>
>> >> I don't think I have anything brilliant to add to this discussion right 
>> >> now, except possibly:
>> >>
>> >> ISTM that Linux distributions are, in general, vulnerable to all kinds of 
>> >> shenanigans that would happen if an
>> >> untrusted user can cause a block device to appear.  That user doesn't 
>> >> need permission to mount it
>> >
>> > Interesting point.  This would further suggest that we absolutely must 
>> > ensure that a loop device which shows up in
>> > the container does not also show up in the host.
>>
>> Can I suggest the usage of the devices cgroup to achieve that?
>
> Not really ... cgroups impose resource limits, it's namespaces that
> impose visibility separations.  In theory this can be done with the
> device namespace that's been proposed; however, a simpler way is simply
> to rm the device node in the host and mknod it in the guest.  I don't
> really see host visibility as a huge problem: in a shared OS
> virtualisation it's not really possible securely to separate the guest
> from the host (only vice versa).
>
> But I really don't think we want to do it this way.  Giving a container
> the ability to do a mount is too dangerous.  What we want to do is
> intercept the mount in the host and perform it on behalf of the guest as
> host root in the guest's mount namespace.  If you do it that way, it
> doesn't really matter what device actually shows up in the guest, as
> long as the host knows what to do when the mount request comes along.

This is only useful/safe if the host understands what's going on.  By
the host, I mean the host's udev and other system-level stuff.  This
is probably fine for disks and such, but it might not be so great for
loop devices, FUSE, etc.  I already know of one user of containers
that wants container-local FUSE mounts.  This ought to Just Work (tm),
but there's fair amount of work needed to get there.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-23 Thread James Bottomley

On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote:
> On 05/20/2014 05:19 PM, Serge Hallyn wrote:
> > Quoting Andy Lutomirski (l...@amacapital.net):
> >> On May 15, 2014 1:26 PM, "Serge E. Hallyn"  wrote:
> >>> 
> >>> Quoting Richard Weinberger (rich...@nod.at):
>  Am 15.05.2014 21:50, schrieb Serge Hallyn:
> > Quoting Richard Weinberger (richard.weinber...@gmail.com):
> >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman 
> >>  wrote:
> >>> Then don't use a container to build such a thing, or fix the build 
> >>> scripts to not do that :)
> >> 
> >> I second this. To me it looks like some folks try to (ab)use Linux 
> >> containers for purposes where KVM
> >> would much better fit in. Please don't put more complexity into 
> >> containers. They are already horrible
> >> complex and error prone.
> > 
> > I, naturally, disagree :)  The only use case which is inherently not 
> > valid for containers is running a
> > kernel.  Practically speaking there are other things which likely will 
> > never be possible, but if someone 
> > offers a way to do something in containers, "you can't do that in 
> > containers" is not an apropos response.
> > 
> > "That abstraction is wrong" is certainly valid, as when vpids were 
> > originally proposed and rejected,
> > resulting in the development of pid namespaces.  "We have to work out 
> > (x) first" can be valid (and I can
> > think of examples here), assuming it's not just trying to hide behind a 
> > catch-22/chicken-egg problem.
> > 
> > Finally, saying "containers are complex and error prone" is conflating 
> > several large suites of userspace
> > code and many kernel features which support them.  Being more precise 
> > would, if the argument is valid, lend
> > it a lot more weight.
>  
>  We (my company) use Linux containers since 2011 in production. First 
>  LXC, now libvirt-lxc. To understand the
>  internals better I also wrote my own userspace to create/start 
>  containers. There are so many things which can
>  hurt you badly. With user namespaces we expose a really big attack 
>  surface to regular users. I.e. Suddenly a
>  user is allowed to mount filesystems.
> >>> 
> >>> That is currently not the case.  They can mount some virtual filesystems 
> >>> and do bind mounts, but cannot mount
> >>> most real filesystems.  This keeps us protected (for now) from 
> >>> potentially unsafe superblock readers in the 
> >>> kernel.
> >>> 
>  Ask Andy, he found already lots of nasty things...
> >> 
> >> I don't think I have anything brilliant to add to this discussion right 
> >> now, except possibly:
> >> 
> >> ISTM that Linux distributions are, in general, vulnerable to all kinds of 
> >> shenanigans that would happen if an
> >> untrusted user can cause a block device to appear.  That user doesn't need 
> >> permission to mount it
> > 
> > Interesting point.  This would further suggest that we absolutely must 
> > ensure that a loop device which shows up in
> > the container does not also show up in the host.
> 
> Can I suggest the usage of the devices cgroup to achieve that?

Not really ... cgroups impose resource limits, it's namespaces that
impose visibility separations.  In theory this can be done with the
device namespace that's been proposed; however, a simpler way is simply
to rm the device node in the host and mknod it in the guest.  I don't
really see host visibility as a huge problem: in a shared OS
virtualisation it's not really possible securely to separate the guest
from the host (only vice versa).

But I really don't think we want to do it this way.  Giving a container
the ability to do a mount is too dangerous.  What we want to do is
intercept the mount in the host and perform it on behalf of the guest as
host root in the guest's mount namespace.  If you do it that way, it
doesn't really matter what device actually shows up in the guest, as
long as the host knows what to do when the mount request comes along.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-23 Thread Marian Marinov

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 05/20/2014 05:19 PM, Serge Hallyn wrote:
> Quoting Andy Lutomirski (l...@amacapital.net):
>> On May 15, 2014 1:26 PM, "Serge E. Hallyn"  wrote:
>>> 
>>> Quoting Richard Weinberger (rich...@nod.at):
 Am 15.05.2014 21:50, schrieb Serge Hallyn:
> Quoting Richard Weinberger (richard.weinber...@gmail.com):
>> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman 
>>  wrote:
>>> Then don't use a container to build such a thing, or fix the build 
>>> scripts to not do that :)
>> 
>> I second this. To me it looks like some folks try to (ab)use Linux 
>> containers for purposes where KVM
>> would much better fit in. Please don't put more complexity into 
>> containers. They are already horrible
>> complex and error prone.
> 
> I, naturally, disagree :)  The only use case which is inherently not 
> valid for containers is running a
> kernel.  Practically speaking there are other things which likely will 
> never be possible, but if someone 
> offers a way to do something in containers, "you can't do that in 
> containers" is not an apropos response.
> 
> "That abstraction is wrong" is certainly valid, as when vpids were 
> originally proposed and rejected,
> resulting in the development of pid namespaces.  "We have to work out (x) 
> first" can be valid (and I can
> think of examples here), assuming it's not just trying to hide behind a 
> catch-22/chicken-egg problem.
> 
> Finally, saying "containers are complex and error prone" is conflating 
> several large suites of userspace
> code and many kernel features which support them.  Being more precise 
> would, if the argument is valid, lend
> it a lot more weight.
 
 We (my company) use Linux containers since 2011 in production. First LXC, 
 now libvirt-lxc. To understand the
 internals better I also wrote my own userspace to create/start containers. 
 There are so many things which can
 hurt you badly. With user namespaces we expose a really big attack surface 
 to regular users. I.e. Suddenly a
 user is allowed to mount filesystems.
>>> 
>>> That is currently not the case.  They can mount some virtual filesystems 
>>> and do bind mounts, but cannot mount
>>> most real filesystems.  This keeps us protected (for now) from potentially 
>>> unsafe superblock readers in the 
>>> kernel.
>>> 
 Ask Andy, he found already lots of nasty things...
>> 
>> I don't think I have anything brilliant to add to this discussion right now, 
>> except possibly:
>> 
>> ISTM that Linux distributions are, in general, vulnerable to all kinds of 
>> shenanigans that would happen if an
>> untrusted user can cause a block device to appear.  That user doesn't need 
>> permission to mount it
> 
> Interesting point.  This would further suggest that we absolutely must ensure 
> that a loop device which shows up in
> the container does not also show up in the host.

Can I suggest the usage of the devices cgroup to achieve that?

Marian

> 
>> or even necessarily to change its contents on the fly.
>> 
>> E.g. what happens if you boot a machine that contains a malicious disk image 
>> that has the same partition UUID as
>> /?  Nothing good, I imagine.
>> 
>> So if we're going to go down this road, we really need some way to tell the 
>> host that certain devices are not
>> trusted.
>> 
>> --Andy
> -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
> the body of a message to
> majord...@vger.kernel.org More majordomo info at  
> http://vger.kernel.org/majordomo-info.html Please read the FAQ at
> http://www.tux.org/lkml/
> 


- -- 
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlN/BL8ACgkQ4mt9JeIbjJRuTwCgjpP8cNle5deHpUSJJoDkcfin
byEAn3Fy4wwiZ3avNwA/ljZWVWeGFU8W
=iQLO
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-23 Thread Marian Marinov

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 05/20/2014 05:19 PM, Serge Hallyn wrote:
 Quoting Andy Lutomirski (l...@amacapital.net):
 On May 15, 2014 1:26 PM, Serge E. Hallyn se...@hallyn.com wrote:
 
 Quoting Richard Weinberger (rich...@nod.at):
 Am 15.05.2014 21:50, schrieb Serge Hallyn:
 Quoting Richard Weinberger (richard.weinber...@gmail.com):
 On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman 
 gre...@linuxfoundation.org wrote:
 Then don't use a container to build such a thing, or fix the build 
 scripts to not do that :)
 
 I second this. To me it looks like some folks try to (ab)use Linux 
 containers for purposes where KVM
 would much better fit in. Please don't put more complexity into 
 containers. They are already horrible
 complex and error prone.
 
 I, naturally, disagree :)  The only use case which is inherently not 
 valid for containers is running a
 kernel.  Practically speaking there are other things which likely will 
 never be possible, but if someone 
 offers a way to do something in containers, you can't do that in 
 containers is not an apropos response.
 
 That abstraction is wrong is certainly valid, as when vpids were 
 originally proposed and rejected,
 resulting in the development of pid namespaces.  We have to work out (x) 
 first can be valid (and I can
 think of examples here), assuming it's not just trying to hide behind a 
 catch-22/chicken-egg problem.
 
 Finally, saying containers are complex and error prone is conflating 
 several large suites of userspace
 code and many kernel features which support them.  Being more precise 
 would, if the argument is valid, lend
 it a lot more weight.
 
 We (my company) use Linux containers since 2011 in production. First LXC, 
 now libvirt-lxc. To understand the
 internals better I also wrote my own userspace to create/start containers. 
 There are so many things which can
 hurt you badly. With user namespaces we expose a really big attack surface 
 to regular users. I.e. Suddenly a
 user is allowed to mount filesystems.
 
 That is currently not the case.  They can mount some virtual filesystems 
 and do bind mounts, but cannot mount
 most real filesystems.  This keeps us protected (for now) from potentially 
 unsafe superblock readers in the 
 kernel.
 
 Ask Andy, he found already lots of nasty things...
 
 I don't think I have anything brilliant to add to this discussion right now, 
 except possibly:
 
 ISTM that Linux distributions are, in general, vulnerable to all kinds of 
 shenanigans that would happen if an
 untrusted user can cause a block device to appear.  That user doesn't need 
 permission to mount it
 
 Interesting point.  This would further suggest that we absolutely must ensure 
 that a loop device which shows up in
 the container does not also show up in the host.

Can I suggest the usage of the devices cgroup to achieve that?

Marian

 
 or even necessarily to change its contents on the fly.
 
 E.g. what happens if you boot a machine that contains a malicious disk image 
 that has the same partition UUID as
 /?  Nothing good, I imagine.
 
 So if we're going to go down this road, we really need some way to tell the 
 host that certain devices are not
 trusted.
 
 --Andy
 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in 
 the body of a message to
 majord...@vger.kernel.org More majordomo info at  
 http://vger.kernel.org/majordomo-info.html Please read the FAQ at
 http://www.tux.org/lkml/
 


- -- 
Marian Marinov
Founder  CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlN/BL8ACgkQ4mt9JeIbjJRuTwCgjpP8cNle5deHpUSJJoDkcfin
byEAn3Fy4wwiZ3avNwA/ljZWVWeGFU8W
=iQLO
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-23 Thread James Bottomley

On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote:
 On 05/20/2014 05:19 PM, Serge Hallyn wrote:
  Quoting Andy Lutomirski (l...@amacapital.net):
  On May 15, 2014 1:26 PM, Serge E. Hallyn se...@hallyn.com wrote:
  
  Quoting Richard Weinberger (rich...@nod.at):
  Am 15.05.2014 21:50, schrieb Serge Hallyn:
  Quoting Richard Weinberger (richard.weinber...@gmail.com):
  On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman 
  gre...@linuxfoundation.org wrote:
  Then don't use a container to build such a thing, or fix the build 
  scripts to not do that :)
  
  I second this. To me it looks like some folks try to (ab)use Linux 
  containers for purposes where KVM
  would much better fit in. Please don't put more complexity into 
  containers. They are already horrible
  complex and error prone.
  
  I, naturally, disagree :)  The only use case which is inherently not 
  valid for containers is running a
  kernel.  Practically speaking there are other things which likely will 
  never be possible, but if someone 
  offers a way to do something in containers, you can't do that in 
  containers is not an apropos response.
  
  That abstraction is wrong is certainly valid, as when vpids were 
  originally proposed and rejected,
  resulting in the development of pid namespaces.  We have to work out 
  (x) first can be valid (and I can
  think of examples here), assuming it's not just trying to hide behind a 
  catch-22/chicken-egg problem.
  
  Finally, saying containers are complex and error prone is conflating 
  several large suites of userspace
  code and many kernel features which support them.  Being more precise 
  would, if the argument is valid, lend
  it a lot more weight.
  
  We (my company) use Linux containers since 2011 in production. First 
  LXC, now libvirt-lxc. To understand the
  internals better I also wrote my own userspace to create/start 
  containers. There are so many things which can
  hurt you badly. With user namespaces we expose a really big attack 
  surface to regular users. I.e. Suddenly a
  user is allowed to mount filesystems.
  
  That is currently not the case.  They can mount some virtual filesystems 
  and do bind mounts, but cannot mount
  most real filesystems.  This keeps us protected (for now) from 
  potentially unsafe superblock readers in the 
  kernel.
  
  Ask Andy, he found already lots of nasty things...
  
  I don't think I have anything brilliant to add to this discussion right 
  now, except possibly:
  
  ISTM that Linux distributions are, in general, vulnerable to all kinds of 
  shenanigans that would happen if an
  untrusted user can cause a block device to appear.  That user doesn't need 
  permission to mount it
  
  Interesting point.  This would further suggest that we absolutely must 
  ensure that a loop device which shows up in
  the container does not also show up in the host.
 
 Can I suggest the usage of the devices cgroup to achieve that?

Not really ... cgroups impose resource limits, it's namespaces that
impose visibility separations.  In theory this can be done with the
device namespace that's been proposed; however, a simpler way is simply
to rm the device node in the host and mknod it in the guest.  I don't
really see host visibility as a huge problem: in a shared OS
virtualisation it's not really possible securely to separate the guest
from the host (only vice versa).

But I really don't think we want to do it this way.  Giving a container
the ability to do a mount is too dangerous.  What we want to do is
intercept the mount in the host and perform it on behalf of the guest as
host root in the guest's mount namespace.  If you do it that way, it
doesn't really matter what device actually shows up in the guest, as
long as the host knows what to do when the mount request comes along.

James


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-23 Thread Andy Lutomirski

On Fri, May 23, 2014 at 6:16 AM, James Bottomley
james.bottom...@hansenpartnership.com wrote:
 On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote:
 On 05/20/2014 05:19 PM, Serge Hallyn wrote:
  Quoting Andy Lutomirski (l...@amacapital.net):
  On May 15, 2014 1:26 PM, Serge E. Hallyn se...@hallyn.com wrote:
 
  Quoting Richard Weinberger (rich...@nod.at):
  Am 15.05.2014 21:50, schrieb Serge Hallyn:
  Quoting Richard Weinberger (richard.weinber...@gmail.com):
  On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman 
  gre...@linuxfoundation.org wrote:
  Then don't use a container to build such a thing, or fix the build 
  scripts to not do that :)
 
  I second this. To me it looks like some folks try to (ab)use Linux 
  containers for purposes where KVM
  would much better fit in. Please don't put more complexity into 
  containers. They are already horrible
  complex and error prone.
 
  I, naturally, disagree :)  The only use case which is inherently not 
  valid for containers is running a
  kernel.  Practically speaking there are other things which likely will 
  never be possible, but if someone
  offers a way to do something in containers, you can't do that in 
  containers is not an apropos response.
 
  That abstraction is wrong is certainly valid, as when vpids were 
  originally proposed and rejected,
  resulting in the development of pid namespaces.  We have to work out 
  (x) first can be valid (and I can
  think of examples here), assuming it's not just trying to hide behind 
  a catch-22/chicken-egg problem.
 
  Finally, saying containers are complex and error prone is conflating 
  several large suites of userspace
  code and many kernel features which support them.  Being more precise 
  would, if the argument is valid, lend
  it a lot more weight.
 
  We (my company) use Linux containers since 2011 in production. First 
  LXC, now libvirt-lxc. To understand the
  internals better I also wrote my own userspace to create/start 
  containers. There are so many things which can
  hurt you badly. With user namespaces we expose a really big attack 
  surface to regular users. I.e. Suddenly a
  user is allowed to mount filesystems.
 
  That is currently not the case.  They can mount some virtual filesystems 
  and do bind mounts, but cannot mount
  most real filesystems.  This keeps us protected (for now) from 
  potentially unsafe superblock readers in the
  kernel.
 
  Ask Andy, he found already lots of nasty things...
 
  I don't think I have anything brilliant to add to this discussion right 
  now, except possibly:
 
  ISTM that Linux distributions are, in general, vulnerable to all kinds of 
  shenanigans that would happen if an
  untrusted user can cause a block device to appear.  That user doesn't 
  need permission to mount it
 
  Interesting point.  This would further suggest that we absolutely must 
  ensure that a loop device which shows up in
  the container does not also show up in the host.

 Can I suggest the usage of the devices cgroup to achieve that?

 Not really ... cgroups impose resource limits, it's namespaces that
 impose visibility separations.  In theory this can be done with the
 device namespace that's been proposed; however, a simpler way is simply
 to rm the device node in the host and mknod it in the guest.  I don't
 really see host visibility as a huge problem: in a shared OS
 virtualisation it's not really possible securely to separate the guest
 from the host (only vice versa).

 But I really don't think we want to do it this way.  Giving a container
 the ability to do a mount is too dangerous.  What we want to do is
 intercept the mount in the host and perform it on behalf of the guest as
 host root in the guest's mount namespace.  If you do it that way, it
 doesn't really matter what device actually shows up in the guest, as
 long as the host knows what to do when the mount request comes along.

This is only useful/safe if the host understands what's going on.  By
the host, I mean the host's udev and other system-level stuff.  This
is probably fine for disks and such, but it might not be so great for
loop devices, FUSE, etc.  I already know of one user of containers
that wants container-local FUSE mounts.  This ought to Just Work (tm),
but there's fair amount of work needed to get there.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-23 Thread Eric W. Biederman

Serge Hallyn serge.hal...@ubuntu.com writes:

 Quoting Eric W. Biederman (ebied...@xmission.com):
 
 
  Ultimately the technical challenge is how do we create a block device
  that is safe for a user who does not have any capabilities to use, and
  what can we do with that block device to make it useful.
 
  Yes, and I'd like to get started solving those challenges. But I also
  don't think we can address these two points (support partition blkdevs,
  help prevent more priveleged users from using a namespace's loop
  devices) sufficiently while having an implementation completely
  contained within the loop driver as Greg is requesting.
 
 My key take away from the conversation is that we should reduce the
 scope of what is being done to something that makes sense and the
 propblems are immediately visible.
 
 Part of me would like to suggest that fuse and it's ability to imitate
 device nodes might be a more appropriate solution, to something that

 Do you have a link to more info on this?  Some googling got me to an
 interesting but old thread on CUSE, but nothing specifically about fuse
 doing this.

CUSE is probably what I was thinking of.  It is all part of the fuse
code base in the kernel.  And now that I am reminded it is called CUSE
I go Duh that is a character device...

Fuse and everything it can do is definitely the filesystem I would like
to see most have the audits to be enabled in user namespace.  Fuse
was built to be sufficiently paranoid to allow this and so it should not
take a lot to take fuse the rest of the way.

 just needs block device access and nothing else.
 
 For purposes of discussion let's call it unprivloopfs.  That can reuse
 code from the loop device or not as appropriate.  Not supporting
 paritioning I think is a very reasonable first step until it is shown
 that we can make good use of partitioning support, and there are not
 better ways of solving the problem.
 
 I expect the most productive thing to talk about is what is your
 immediate goal?  Mounting a filesystem?  Building an iso?

 For me it would be taking an iso and making some changes to it to
 localize it (i.e. take an install iso and add preseed file).

 Now of course in the end there is no reason why we can't do all of
 this with a new suite of libraries which simply uses read/write with
 knowledge of the fs layouts to parse and modify the backing files.
 My concern there is that duplicating all of the fs code seems unlikely
 to improve the soundness of either implementation.  Perhaps we can
 autogenerate this from the kernel source?  Does fuse already do
 something like that?

I am not aware of that.  But I have not worked extensively with fuse.

I do agree that finding a way to perform a read-only mount of an ISO by
an unprivielged user is a very interesting use case.  Given it's
interchange medium nature isofs should be as hardened as human possible,
and that is likely easier with a read-only filesystem.  And at less than
4000 lines of code isofs is auditable.

So as a target for unprivileged mounts of a block device isofs looks
like a good place to start.

Eric
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-21 Thread Serge Hallyn

Quoting Eric W. Biederman (ebied...@xmission.com):
> 
> 
> >> Ultimately the technical challenge is how do we create a block device
> >> that is safe for a user who does not have any capabilities to use, and
> >> what can we do with that block device to make it useful.
> >
> > Yes, and I'd like to get started solving those challenges. But I also
> > don't think we can address these two points (support partition blkdevs,
> > help prevent more priveleged users from using a namespace's loop
> > devices) sufficiently while having an implementation completely
> > contained within the loop driver as Greg is requesting.
> 
> My key take away from the conversation is that we should reduce the
> scope of what is being done to something that makes sense and the
> propblems are immediately visible.
> 
> Part of me would like to suggest that fuse and it's ability to imitate
> device nodes might be a more appropriate solution, to something that

Do you have a link to more info on this?  Some googling got me to an
interesting but old thread on CUSE, but nothing specifically about fuse
doing this.

> just needs block device access and nothing else.
> 
> For purposes of discussion let's call it unprivloopfs.  That can reuse
> code from the loop device or not as appropriate.  Not supporting
> paritioning I think is a very reasonable first step until it is shown
> that we can make good use of partitioning support, and there are not
> better ways of solving the problem.
> 
> I expect the most productive thing to talk about is what is your
> immediate goal?  Mounting a filesystem?  Building an iso?

For me it would be taking an iso and making some changes to it to
localize it (i.e. take an install iso and add preseed file).

Now of course in the end there is no reason why we can't do all of
this with a new suite of libraries which simply uses read/write with
knowledge of the fs layouts to parse and modify the backing files.
My concern there is that duplicating all of the fs code seems unlikely
to improve the soundness of either implementation.  Perhaps we can
autogenerate this from the kernel source?  Does fuse already do
something like that?

> We have a long history with the namespace support of punting on issues
> and not solving them until a long term maintainable solution becomes
> clear.  Let's do what we can to make the problem and the solution clear.

-serge
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-21 Thread Eric W. Biederman



>> Ultimately the technical challenge is how do we create a block device
>> that is safe for a user who does not have any capabilities to use, and
>> what can we do with that block device to make it useful.
>
> Yes, and I'd like to get started solving those challenges. But I also
> don't think we can address these two points (support partition blkdevs,
> help prevent more priveleged users from using a namespace's loop
> devices) sufficiently while having an implementation completely
> contained within the loop driver as Greg is requesting.

My key take away from the conversation is that we should reduce the
scope of what is being done to something that makes sense and the
propblems are immediately visible.

Part of me would like to suggest that fuse and it's ability to imitate
device nodes might be a more appropriate solution, to something that
just needs block device access and nothing else.

For purposes of discussion let's call it unprivloopfs.  That can reuse
code from the loop device or not as appropriate.  Not supporting
paritioning I think is a very reasonable first step until it is shown
that we can make good use of partitioning support, and there are not
better ways of solving the problem.

I expect the most productive thing to talk about is what is your
immediate goal?  Mounting a filesystem?  Building an iso?

We have a long history with the namespace support of punting on issues
and not solving them until a long term maintainable solution becomes
clear.  Let's do what we can to make the problem and the solution clear.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-21 Thread Eric W. Biederman



 Ultimately the technical challenge is how do we create a block device
 that is safe for a user who does not have any capabilities to use, and
 what can we do with that block device to make it useful.

 Yes, and I'd like to get started solving those challenges. But I also
 don't think we can address these two points (support partition blkdevs,
 help prevent more priveleged users from using a namespace's loop
 devices) sufficiently while having an implementation completely
 contained within the loop driver as Greg is requesting.

My key take away from the conversation is that we should reduce the
scope of what is being done to something that makes sense and the
propblems are immediately visible.

Part of me would like to suggest that fuse and it's ability to imitate
device nodes might be a more appropriate solution, to something that
just needs block device access and nothing else.

For purposes of discussion let's call it unprivloopfs.  That can reuse
code from the loop device or not as appropriate.  Not supporting
paritioning I think is a very reasonable first step until it is shown
that we can make good use of partitioning support, and there are not
better ways of solving the problem.

I expect the most productive thing to talk about is what is your
immediate goal?  Mounting a filesystem?  Building an iso?

We have a long history with the namespace support of punting on issues
and not solving them until a long term maintainable solution becomes
clear.  Let's do what we can to make the problem and the solution clear.

Eric
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-21 Thread Serge Hallyn

Quoting Eric W. Biederman (ebied...@xmission.com):
 
 
  Ultimately the technical challenge is how do we create a block device
  that is safe for a user who does not have any capabilities to use, and
  what can we do with that block device to make it useful.
 
  Yes, and I'd like to get started solving those challenges. But I also
  don't think we can address these two points (support partition blkdevs,
  help prevent more priveleged users from using a namespace's loop
  devices) sufficiently while having an implementation completely
  contained within the loop driver as Greg is requesting.
 
 My key take away from the conversation is that we should reduce the
 scope of what is being done to something that makes sense and the
 propblems are immediately visible.
 
 Part of me would like to suggest that fuse and it's ability to imitate
 device nodes might be a more appropriate solution, to something that

Do you have a link to more info on this?  Some googling got me to an
interesting but old thread on CUSE, but nothing specifically about fuse
doing this.

 just needs block device access and nothing else.
 
 For purposes of discussion let's call it unprivloopfs.  That can reuse
 code from the loop device or not as appropriate.  Not supporting
 paritioning I think is a very reasonable first step until it is shown
 that we can make good use of partitioning support, and there are not
 better ways of solving the problem.
 
 I expect the most productive thing to talk about is what is your
 immediate goal?  Mounting a filesystem?  Building an iso?

For me it would be taking an iso and making some changes to it to
localize it (i.e. take an install iso and add preseed file).

Now of course in the end there is no reason why we can't do all of
this with a new suite of libraries which simply uses read/write with
knowledge of the fs layouts to parse and modify the backing files.
My concern there is that duplicating all of the fs code seems unlikely
to improve the soundness of either implementation.  Perhaps we can
autogenerate this from the kernel source?  Does fuse already do
something like that?

 We have a long history with the namespace support of punting on issues
 and not solving them until a long term maintainable solution becomes
 clear.  Let's do what we can to make the problem and the solution clear.

-serge
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-20 Thread Serge Hallyn

Quoting Serge Hallyn (serge.hal...@ubuntu.com):
> Quoting Seth Forshee (seth.fors...@canonical.com):
> > On Sun, May 18, 2014 at 04:44:58AM +0200, Serge E. Hallyn wrote:
> > > Quoting Seth Forshee (seth.fors...@canonical.com):
> > > > On Fri, May 16, 2014 at 09:31:37PM -0700, Eric W. Biederman wrote:
> > > > > Greg Kroah-Hartman  writes:
> > > > > 
> > > > > > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
> > > > > >> > I think having to pick and choose what device nodes you want in a
> > > > > >> > container is a good thing.  Becides, you would have to do the 
> > > > > >> > same thing
> > > > > >> > in the kernel anyway, what's wrong with userspace making the 
> > > > > >> > decision
> > > > > >> > here, especially as it knows exactly what it wants to do much 
> > > > > >> > more so
> > > > > >> > than the kernel ever can.
> > > > > >> 
> > > > > >> For 'real' devices that sounds sensible.  The thing about loop 
> > > > > >> devices
> > > > > >> is that we simply want to allow a container to say "give me a loop
> > > > > >> device to use" and have it receive a unique loop device (or 3), 
> > > > > >> without
> > > > > >> having to pre-assign them.  I think that would be cleaner to do 
> > > > > >> using
> > > > > >> a pseudofs and loop-control device, rather than having to have a
> > > > > >> daemon in userspace on the host farming those out in response to
> > > > > >> some, I don't know, dbus request?
> > > > > >
> > > > > > I agree that loop devices would be nice to have in a container, and 
> > > > > > that
> > > > > > the existing loop interface doesn't really lend itself to that.  So
> > > > > > create a new type of thing that acts like a loop device in a 
> > > > > > container.
> > > > > > But don't try to mess with the whole driver core just for a single 
> > > > > > type
> > > > > > of device.
> > > > > 
> > > > > Yes. Something like devpts (without the newinstance option).  Built to
> > > > > allow unprivileged users to create loopback devices.
> > > > 
> > > > That's where I started, and I've got code, so I guess I'll clean it up
> > > > and send patches. If the stance is that only system-wide CAP_SYS_ADMIN
> > > > gets to do privileged block device ioctls, including reading partitions
> > > 
> > > Sorry, where did that come from?  What Eric was referring to below is
> > > the fs superblock readers not being trusted.  Maybe I glossed over another
> > > email where it was mentioned?
> > 
> > You must have. Take a look at [1].
> > 
> > To repeat the point: the ioctl to reread partitions (along with several
> > other block device ioctls) has a capable(CAP_SYS_ADMIN) check. We can't
> > change this to an ns_capable check without at minimum the block layer
> > knowing about the namespace associated with the block device. Ergo we
> 
> Which only means those changes are necessary :)
> 
> So far as I understand, a namespaced devtmpfs is nacked, but a loopfs
> is interesting (and, depending on the implementation, acceptable).  That
> necessarily includes the minimal blockdev changes to support it.
> 
> > can't reread paritions if this is done entirely within the loop driver
> > via a psuedo fs.
> > 
> > [1] http://article.gmane.org/gmane.linux.kernel.containers.lxc.devel/8191

Hm, yeah, I was confuddling two issues.  Nevertheless, for real block devices I
absolutely agree.  For loop devices I don't.  My answer to

> I don't think unpriviliged containers should be able to do partitioning.
> An unpriviliged user can't do that, so why should a container be any
> different?

would be that the loop device is a convenience built atop the backing image,
and if the user had the rights to loop-attach the backing image, he can
just as will partition using write(2), so why artificially plac this limit?

Nevertheless this is not really a debate worth having until we have a
blockdev fs mountable in a userns.

My main interest currently is with privileged containers.  I think we can
learn plenty from that for now.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-20 Thread Seth Forshee

On Mon, May 19, 2014 at 05:04:55PM -0700, Eric W. Biederman wrote:
> Seth Forshee  writes:
> 
> > What I set out for was feature parity between loop devices in a secure
> > container and loop devices on the host. Since some operations currently
> > check for system-wide CAP_SYS_ADMIN, the only way I see to accomplish
> > this is to push knowledge of the user namespace farther down into the
> > driver stack so the check can instead be for CAP_SYS_ADMIN in the user
> > namespace associated with the device.
> >
> > That said, I suspect our current use cases can get by without these
> > capabilities. Really though I suspect this is just deferring the
> > discussion rather than settling it, and what we'll end up with is little
> > more than a fancy way for userspace to ask the kernel to run mknod on
> > its behalf.
> 
> A fancy way to ask the kernel to run mknod on its behalf is what
> /dev/pts is.
> 
> When I suggested this I did not mean you should forgo making changes to
> allow partitions and the like.  What I itended is that you should find a
> way to make this safe for users who don't have root capabilities.

But Greg did say that "unprivileged" or "secure" containers (depending
on whose terminology you're using) should not be able to do partitioning
[1]. I don't really understand this stance though, as I don't see what
possible security problems arise from letting root in a user ns do
BLKRRPART on a block device that it's explicitly been granted privileged
use of.

Assuming we come to an agreement that root in a user ns can do BLKRRPART
on some devices, we've got two issues. First, the block layer enforces
this restriction so it has to be aware of what namespace has privileges
for the device, but Greg wants a solution localized to the loop driver.
Second, if we're using a loop psuedo fs then we'd logically want block
devices for the partitions in the loop fs, so we have to create some
mechanism for the loop driver to get notified about these devices being
created.

> Which possibly means that mount needs to learn how to keep a more
> privileged user from using your new loop devices.

The patches I posted have mechanisms to at least mitigate the problem.
First, anyone using loop-control to find a free loop device will never
get a device allocated to a different user ns (the loop psuedo fs code I
have also does this). Second, a given loop block device would only show
up in the devtmpfs of the namespace which owned that device. So a
sufficiently priveleged user isn't completely prevented from using the
devices, but since they would have to explicitly mknod the block device
node it should prevent accidental use by a more privileged user.

But I also brought this up previously, and Greg argued that it isn't a
real issue [1].

> To get to the point where this is really and truly usable I expect to be
> technically daunting.
> 
> Ultimately the technical challenge is how do we create a block device
> that is safe for a user who does not have any capabilities to use, and
> what can we do with that block device to make it useful.

Yes, and I'd like to get started solving those challenges. But I also
don't think we can address these two points (support partition blkdevs,
help prevent more priveleged users from using a namespace's loop
devices) sufficiently while having an implementation completely
contained within the loop driver as Greg is requesting.

Thanks,
Seth

> 
> Only when the question is can this kernel functionality which is
> otherwise safe confuse a preexisting setuid application do namespace
> or container bits significantly come into play.
> 
> Eric

[1] http://www.spinics.net/linux/lists/kernel/msg1744750.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-20 Thread Serge Hallyn

Quoting Andy Lutomirski (l...@amacapital.net):
> On May 15, 2014 1:26 PM, "Serge E. Hallyn"  wrote:
> >
> > Quoting Richard Weinberger (rich...@nod.at):
> > > Am 15.05.2014 21:50, schrieb Serge Hallyn:
> > > > Quoting Richard Weinberger (richard.weinber...@gmail.com):
> > > >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman
> > > >>  wrote:
> > > >>> Then don't use a container to build such a thing, or fix the build
> > > >>> scripts to not do that :)
> > > >>
> > > >> I second this.
> > > >> To me it looks like some folks try to (ab)use Linux containers
> > > >> for purposes where KVM would much better fit in.
> > > >> Please don't put more complexity into containers. They are already
> > > >> horrible complex
> > > >> and error prone.
> > > >
> > > > I, naturally, disagree :)  The only use case which is inherently not
> > > > valid for containers is running a kernel.  Practically speaking there
> > > > are other things which likely will never be possible, but if someone
> > > > offers a way to do something in containers, "you can't do that in
> > > > containers" is not an apropos response.
> > > >
> > > > "That abstraction is wrong" is certainly valid, as when vpids were
> > > > originally proposed and rejected, resulting in the development of
> > > > pid namespaces.  "We have to work out (x) first" can be valid (and
> > > > I can think of examples here), assuming it's not just trying to hide
> > > > behind a catch-22/chicken-egg problem.
> > > >
> > > > Finally, saying "containers are complex and error prone" is conflating
> > > > several large suites of userspace code and many kernel features which
> > > > support them.  Being more precise would, if the argument is valid,
> > > > lend it a lot more weight.
> > >
> > > We (my company) use Linux containers since 2011 in production. First LXC, 
> > > now libvirt-lxc.
> > > To understand the internals better I also wrote my own userspace to 
> > > create/start
> > > containers. There are so many things which can hurt you badly.
> > > With user namespaces we expose a really big attack surface to regular 
> > > users.
> > > I.e. Suddenly a user is allowed to mount filesystems.
> >
> > That is currently not the case.  They can mount some virtual filesystems
> > and do bind mounts, but cannot mount most real filesystems.  This keeps
> > us protected (for now) from potentially unsafe superblock readers in the
> > kernel.
> >
> > > Ask Andy, he found already lots of nasty things...
> 
> I don't think I have anything brilliant to add to this discussion
> right now, except possibly:
> 
> ISTM that Linux distributions are, in general, vulnerable to all kinds
> of shenanigans that would happen if an untrusted user can cause a
> block device to appear.  That user doesn't need permission to mount it

Interesting point.  This would further suggest that we absolutely must
ensure that a loop device which shows up in the container does not also
show up in the host.

> or even necessarily to change its contents on the fly.
> 
> E.g. what happens if you boot a machine that contains a malicious disk
> image that has the same partition UUID as /?  Nothing good, I imagine.
> 
> So if we're going to go down this road, we really need some way to
> tell the host that certain devices are not trusted.
> 
> --Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-20 Thread Serge Hallyn

Quoting Michael H. Warfield (m...@wittsend.com):
> On Mon, 2014-05-19 at 17:04 -0700, Eric W. Biederman wrote:
> > Seth Forshee  writes:
> > 
> > > What I set out for was feature parity between loop devices in a secure
> > > container and loop devices on the host. Since some operations currently
> > > check for system-wide CAP_SYS_ADMIN, the only way I see to accomplish
> > > this is to push knowledge of the user namespace farther down into the
> > > driver stack so the check can instead be for CAP_SYS_ADMIN in the user
> > > namespace associated with the device.
> > >
> > > That said, I suspect our current use cases can get by without these
> > > capabilities. Really though I suspect this is just deferring the
> > > discussion rather than settling it, and what we'll end up with is little
> > > more than a fancy way for userspace to ask the kernel to run mknod on
> > > its behalf.
> 
> > A fancy way to ask the kernel to run mknod on its behalf is what
> > /dev/pts is.
> 
> > When I suggested this I did not mean you should forgo making changes to
> > allow partitions and the like.  What I itended is that you should find a
> > way to make this safe for users who don't have root capabilities.
> 
> I like to think in terms of the "rootless" configurations where "root"
> per se is not absolute and everything is framed in terms of
> capabilities.
> 
> > Which possibly means that mount needs to learn how to keep a more
> > privileged user from using your new loop devices.
> 
> Not sure I got that one.  As user with "more" privileges may or may not
> have access dependent on the congruence of the privileges.  They're not

Yes so in this case by more privileged' he meant a privileged user in a
userns which is ancestor to the current userns.  It is in fact *more*
privileged than any user in the current userns.

> heiarchial.  If someone has that "priv" then they have access.  If they

They are in fact implicitly hierarchical due to the hierarchical userns
design.

> do not, they do not.
> 
> > To get to the point where this is really and truly usable I expect to be
> > technically daunting.
> 
> Most technically non-trivial problems generally are.
> 
> > Ultimately the technical challenge is how do we create a block device
> > that is safe for a user who does not have any capabilities to use, and
> > what can we do with that block device to make it useful.
> 
> Concur.  It boils down to privilege management and access.  Absolutely
> concur.
> 
> > Only when the question is can this kernel functionality which is
> > otherwise safe confuse a preexisting setuid application do namespace
> > or container bits significantly come into play.
> 
> Ah...  Admittedly it's not as late as our conversation at LinuxPlumbers
> last year in NOLA but...  Maybe late at night but I failed to parse the
> above.
> 
> > Eric
> 
> Regards,
> Mike
> -- 
> Michael H. Warfield (AI4NB) | (770) 978-7061 |  m...@wittsend.com
>/\/\|=mhw=|\/\/  | (678) 463-0932 |  http://www.wittsend.com/mhw/
>NIC whois: MHW9  | An optimist believes we live in the best of all
>  PGP Key: 0x674627FF| possible worlds.  A pessimist is sure of it!
> 



> ___
> lxc-devel mailing list
> lxc-de...@lists.linuxcontainers.org
> http://lists.linuxcontainers.org/listinfo/lxc-devel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-20 Thread Serge Hallyn

Quoting Seth Forshee (seth.fors...@canonical.com):
> On Sun, May 18, 2014 at 04:44:58AM +0200, Serge E. Hallyn wrote:
> > Quoting Seth Forshee (seth.fors...@canonical.com):
> > > On Fri, May 16, 2014 at 09:31:37PM -0700, Eric W. Biederman wrote:
> > > > Greg Kroah-Hartman  writes:
> > > > 
> > > > > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
> > > > >> > I think having to pick and choose what device nodes you want in a
> > > > >> > container is a good thing.  Becides, you would have to do the same 
> > > > >> > thing
> > > > >> > in the kernel anyway, what's wrong with userspace making the 
> > > > >> > decision
> > > > >> > here, especially as it knows exactly what it wants to do much more 
> > > > >> > so
> > > > >> > than the kernel ever can.
> > > > >> 
> > > > >> For 'real' devices that sounds sensible.  The thing about loop 
> > > > >> devices
> > > > >> is that we simply want to allow a container to say "give me a loop
> > > > >> device to use" and have it receive a unique loop device (or 3), 
> > > > >> without
> > > > >> having to pre-assign them.  I think that would be cleaner to do using
> > > > >> a pseudofs and loop-control device, rather than having to have a
> > > > >> daemon in userspace on the host farming those out in response to
> > > > >> some, I don't know, dbus request?
> > > > >
> > > > > I agree that loop devices would be nice to have in a container, and 
> > > > > that
> > > > > the existing loop interface doesn't really lend itself to that.  So
> > > > > create a new type of thing that acts like a loop device in a 
> > > > > container.
> > > > > But don't try to mess with the whole driver core just for a single 
> > > > > type
> > > > > of device.
> > > > 
> > > > Yes. Something like devpts (without the newinstance option).  Built to
> > > > allow unprivileged users to create loopback devices.
> > > 
> > > That's where I started, and I've got code, so I guess I'll clean it up
> > > and send patches. If the stance is that only system-wide CAP_SYS_ADMIN
> > > gets to do privileged block device ioctls, including reading partitions
> > 
> > Sorry, where did that come from?  What Eric was referring to below is
> > the fs superblock readers not being trusted.  Maybe I glossed over another
> > email where it was mentioned?
> 
> You must have. Take a look at [1].
> 
> To repeat the point: the ioctl to reread partitions (along with several
> other block device ioctls) has a capable(CAP_SYS_ADMIN) check. We can't
> change this to an ns_capable check without at minimum the block layer
> knowing about the namespace associated with the block device. Ergo we

Which only means those changes are necessary :)

So far as I understand, a namespaced devtmpfs is nacked, but a loopfs
is interesting (and, depending on the implementation, acceptable).  That
necessarily includes the minimal blockdev changes to support it.

> can't reread paritions if this is done entirely within the loop driver
> via a psuedo fs.
> 
> [1] http://article.gmane.org/gmane.linux.kernel.containers.lxc.devel/8191
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-20 Thread Serge Hallyn

Quoting Seth Forshee (seth.fors...@canonical.com):
 On Sun, May 18, 2014 at 04:44:58AM +0200, Serge E. Hallyn wrote:
  Quoting Seth Forshee (seth.fors...@canonical.com):
   On Fri, May 16, 2014 at 09:31:37PM -0700, Eric W. Biederman wrote:
Greg Kroah-Hartman gre...@linuxfoundation.org writes:

 On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
  I think having to pick and choose what device nodes you want in a
  container is a good thing.  Becides, you would have to do the same 
  thing
  in the kernel anyway, what's wrong with userspace making the 
  decision
  here, especially as it knows exactly what it wants to do much more 
  so
  than the kernel ever can.
 
 For 'real' devices that sounds sensible.  The thing about loop 
 devices
 is that we simply want to allow a container to say give me a loop
 device to use and have it receive a unique loop device (or 3), 
 without
 having to pre-assign them.  I think that would be cleaner to do using
 a pseudofs and loop-control device, rather than having to have a
 daemon in userspace on the host farming those out in response to
 some, I don't know, dbus request?

 I agree that loop devices would be nice to have in a container, and 
 that
 the existing loop interface doesn't really lend itself to that.  So
 create a new type of thing that acts like a loop device in a 
 container.
 But don't try to mess with the whole driver core just for a single 
 type
 of device.

Yes. Something like devpts (without the newinstance option).  Built to
allow unprivileged users to create loopback devices.
   
   That's where I started, and I've got code, so I guess I'll clean it up
   and send patches. If the stance is that only system-wide CAP_SYS_ADMIN
   gets to do privileged block device ioctls, including reading partitions
  
  Sorry, where did that come from?  What Eric was referring to below is
  the fs superblock readers not being trusted.  Maybe I glossed over another
  email where it was mentioned?
 
 You must have. Take a look at [1].
 
 To repeat the point: the ioctl to reread partitions (along with several
 other block device ioctls) has a capable(CAP_SYS_ADMIN) check. We can't
 change this to an ns_capable check without at minimum the block layer
 knowing about the namespace associated with the block device. Ergo we

Which only means those changes are necessary :)

So far as I understand, a namespaced devtmpfs is nacked, but a loopfs
is interesting (and, depending on the implementation, acceptable).  That
necessarily includes the minimal blockdev changes to support it.

 can't reread paritions if this is done entirely within the loop driver
 via a psuedo fs.
 
 [1] http://article.gmane.org/gmane.linux.kernel.containers.lxc.devel/8191
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-20 Thread Serge Hallyn

Quoting Michael H. Warfield (m...@wittsend.com):
 On Mon, 2014-05-19 at 17:04 -0700, Eric W. Biederman wrote:
  Seth Forshee seth.fors...@canonical.com writes:
  
   What I set out for was feature parity between loop devices in a secure
   container and loop devices on the host. Since some operations currently
   check for system-wide CAP_SYS_ADMIN, the only way I see to accomplish
   this is to push knowledge of the user namespace farther down into the
   driver stack so the check can instead be for CAP_SYS_ADMIN in the user
   namespace associated with the device.
  
   That said, I suspect our current use cases can get by without these
   capabilities. Really though I suspect this is just deferring the
   discussion rather than settling it, and what we'll end up with is little
   more than a fancy way for userspace to ask the kernel to run mknod on
   its behalf.
 
  A fancy way to ask the kernel to run mknod on its behalf is what
  /dev/pts is.
 
  When I suggested this I did not mean you should forgo making changes to
  allow partitions and the like.  What I itended is that you should find a
  way to make this safe for users who don't have root capabilities.
 
 I like to think in terms of the rootless configurations where root
 per se is not absolute and everything is framed in terms of
 capabilities.
 
  Which possibly means that mount needs to learn how to keep a more
  privileged user from using your new loop devices.
 
 Not sure I got that one.  As user with more privileges may or may not
 have access dependent on the congruence of the privileges.  They're not

Yes so in this case by more privileged' he meant a privileged user in a
userns which is ancestor to the current userns.  It is in fact *more*
privileged than any user in the current userns.

 heiarchial.  If someone has that priv then they have access.  If they

They are in fact implicitly hierarchical due to the hierarchical userns
design.

 do not, they do not.
 
  To get to the point where this is really and truly usable I expect to be
  technically daunting.
 
 Most technically non-trivial problems generally are.
 
  Ultimately the technical challenge is how do we create a block device
  that is safe for a user who does not have any capabilities to use, and
  what can we do with that block device to make it useful.
 
 Concur.  It boils down to privilege management and access.  Absolutely
 concur.
 
  Only when the question is can this kernel functionality which is
  otherwise safe confuse a preexisting setuid application do namespace
  or container bits significantly come into play.
 
 Ah...  Admittedly it's not as late as our conversation at LinuxPlumbers
 last year in NOLA but...  Maybe late at night but I failed to parse the
 above.
 
  Eric
 
 Regards,
 Mike
 -- 
 Michael H. Warfield (AI4NB) | (770) 978-7061 |  m...@wittsend.com
/\/\|=mhw=|\/\/  | (678) 463-0932 |  http://www.wittsend.com/mhw/
NIC whois: MHW9  | An optimist believes we live in the best of all
  PGP Key: 0x674627FF| possible worlds.  A pessimist is sure of it!
 



 ___
 lxc-devel mailing list
 lxc-de...@lists.linuxcontainers.org
 http://lists.linuxcontainers.org/listinfo/lxc-devel

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-20 Thread Serge Hallyn

Quoting Andy Lutomirski (l...@amacapital.net):
 On May 15, 2014 1:26 PM, Serge E. Hallyn se...@hallyn.com wrote:
 
  Quoting Richard Weinberger (rich...@nod.at):
   Am 15.05.2014 21:50, schrieb Serge Hallyn:
Quoting Richard Weinberger (richard.weinber...@gmail.com):
On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman
gre...@linuxfoundation.org wrote:
Then don't use a container to build such a thing, or fix the build
scripts to not do that :)
   
I second this.
To me it looks like some folks try to (ab)use Linux containers
for purposes where KVM would much better fit in.
Please don't put more complexity into containers. They are already
horrible complex
and error prone.
   
I, naturally, disagree :)  The only use case which is inherently not
valid for containers is running a kernel.  Practically speaking there
are other things which likely will never be possible, but if someone
offers a way to do something in containers, you can't do that in
containers is not an apropos response.
   
That abstraction is wrong is certainly valid, as when vpids were
originally proposed and rejected, resulting in the development of
pid namespaces.  We have to work out (x) first can be valid (and
I can think of examples here), assuming it's not just trying to hide
behind a catch-22/chicken-egg problem.
   
Finally, saying containers are complex and error prone is conflating
several large suites of userspace code and many kernel features which
support them.  Being more precise would, if the argument is valid,
lend it a lot more weight.
  
   We (my company) use Linux containers since 2011 in production. First LXC, 
   now libvirt-lxc.
   To understand the internals better I also wrote my own userspace to 
   create/start
   containers. There are so many things which can hurt you badly.
   With user namespaces we expose a really big attack surface to regular 
   users.
   I.e. Suddenly a user is allowed to mount filesystems.
 
  That is currently not the case.  They can mount some virtual filesystems
  and do bind mounts, but cannot mount most real filesystems.  This keeps
  us protected (for now) from potentially unsafe superblock readers in the
  kernel.
 
   Ask Andy, he found already lots of nasty things...
 
 I don't think I have anything brilliant to add to this discussion
 right now, except possibly:
 
 ISTM that Linux distributions are, in general, vulnerable to all kinds
 of shenanigans that would happen if an untrusted user can cause a
 block device to appear.  That user doesn't need permission to mount it

Interesting point.  This would further suggest that we absolutely must
ensure that a loop device which shows up in the container does not also
show up in the host.

 or even necessarily to change its contents on the fly.
 
 E.g. what happens if you boot a machine that contains a malicious disk
 image that has the same partition UUID as /?  Nothing good, I imagine.
 
 So if we're going to go down this road, we really need some way to
 tell the host that certain devices are not trusted.
 
 --Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-20 Thread Seth Forshee

On Mon, May 19, 2014 at 05:04:55PM -0700, Eric W. Biederman wrote:
 Seth Forshee seth.fors...@canonical.com writes:
 
  What I set out for was feature parity between loop devices in a secure
  container and loop devices on the host. Since some operations currently
  check for system-wide CAP_SYS_ADMIN, the only way I see to accomplish
  this is to push knowledge of the user namespace farther down into the
  driver stack so the check can instead be for CAP_SYS_ADMIN in the user
  namespace associated with the device.
 
  That said, I suspect our current use cases can get by without these
  capabilities. Really though I suspect this is just deferring the
  discussion rather than settling it, and what we'll end up with is little
  more than a fancy way for userspace to ask the kernel to run mknod on
  its behalf.
 
 A fancy way to ask the kernel to run mknod on its behalf is what
 /dev/pts is.
 
 When I suggested this I did not mean you should forgo making changes to
 allow partitions and the like.  What I itended is that you should find a
 way to make this safe for users who don't have root capabilities.

But Greg did say that unprivileged or secure containers (depending
on whose terminology you're using) should not be able to do partitioning
[1]. I don't really understand this stance though, as I don't see what
possible security problems arise from letting root in a user ns do
BLKRRPART on a block device that it's explicitly been granted privileged
use of.

Assuming we come to an agreement that root in a user ns can do BLKRRPART
on some devices, we've got two issues. First, the block layer enforces
this restriction so it has to be aware of what namespace has privileges
for the device, but Greg wants a solution localized to the loop driver.
Second, if we're using a loop psuedo fs then we'd logically want block
devices for the partitions in the loop fs, so we have to create some
mechanism for the loop driver to get notified about these devices being
created.

 Which possibly means that mount needs to learn how to keep a more
 privileged user from using your new loop devices.

The patches I posted have mechanisms to at least mitigate the problem.
First, anyone using loop-control to find a free loop device will never
get a device allocated to a different user ns (the loop psuedo fs code I
have also does this). Second, a given loop block device would only show
up in the devtmpfs of the namespace which owned that device. So a
sufficiently priveleged user isn't completely prevented from using the
devices, but since they would have to explicitly mknod the block device
node it should prevent accidental use by a more privileged user.

But I also brought this up previously, and Greg argued that it isn't a
real issue [1].

 To get to the point where this is really and truly usable I expect to be
 technically daunting.
 
 Ultimately the technical challenge is how do we create a block device
 that is safe for a user who does not have any capabilities to use, and
 what can we do with that block device to make it useful.

Yes, and I'd like to get started solving those challenges. But I also
don't think we can address these two points (support partition blkdevs,
help prevent more priveleged users from using a namespace's loop
devices) sufficiently while having an implementation completely
contained within the loop driver as Greg is requesting.

Thanks,
Seth

 
 Only when the question is can this kernel functionality which is
 otherwise safe confuse a preexisting setuid application do namespace
 or container bits significantly come into play.
 
 Eric

[1] http://www.spinics.net/linux/lists/kernel/msg1744750.html
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-20 Thread Serge Hallyn

Quoting Serge Hallyn (serge.hal...@ubuntu.com):
 Quoting Seth Forshee (seth.fors...@canonical.com):
  On Sun, May 18, 2014 at 04:44:58AM +0200, Serge E. Hallyn wrote:
   Quoting Seth Forshee (seth.fors...@canonical.com):
On Fri, May 16, 2014 at 09:31:37PM -0700, Eric W. Biederman wrote:
 Greg Kroah-Hartman gre...@linuxfoundation.org writes:
 
  On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
   I think having to pick and choose what device nodes you want in a
   container is a good thing.  Becides, you would have to do the 
   same thing
   in the kernel anyway, what's wrong with userspace making the 
   decision
   here, especially as it knows exactly what it wants to do much 
   more so
   than the kernel ever can.
  
  For 'real' devices that sounds sensible.  The thing about loop 
  devices
  is that we simply want to allow a container to say give me a loop
  device to use and have it receive a unique loop device (or 3), 
  without
  having to pre-assign them.  I think that would be cleaner to do 
  using
  a pseudofs and loop-control device, rather than having to have a
  daemon in userspace on the host farming those out in response to
  some, I don't know, dbus request?
 
  I agree that loop devices would be nice to have in a container, and 
  that
  the existing loop interface doesn't really lend itself to that.  So
  create a new type of thing that acts like a loop device in a 
  container.
  But don't try to mess with the whole driver core just for a single 
  type
  of device.
 
 Yes. Something like devpts (without the newinstance option).  Built to
 allow unprivileged users to create loopback devices.

That's where I started, and I've got code, so I guess I'll clean it up
and send patches. If the stance is that only system-wide CAP_SYS_ADMIN
gets to do privileged block device ioctls, including reading partitions
   
   Sorry, where did that come from?  What Eric was referring to below is
   the fs superblock readers not being trusted.  Maybe I glossed over another
   email where it was mentioned?
  
  You must have. Take a look at [1].
  
  To repeat the point: the ioctl to reread partitions (along with several
  other block device ioctls) has a capable(CAP_SYS_ADMIN) check. We can't
  change this to an ns_capable check without at minimum the block layer
  knowing about the namespace associated with the block device. Ergo we
 
 Which only means those changes are necessary :)
 
 So far as I understand, a namespaced devtmpfs is nacked, but a loopfs
 is interesting (and, depending on the implementation, acceptable).  That
 necessarily includes the minimal blockdev changes to support it.
 
  can't reread paritions if this is done entirely within the loop driver
  via a psuedo fs.
  
  [1] http://article.gmane.org/gmane.linux.kernel.containers.lxc.devel/8191

Hm, yeah, I was confuddling two issues.  Nevertheless, for real block devices I
absolutely agree.  For loop devices I don't.  My answer to

 I don't think unpriviliged containers should be able to do partitioning.
 An unpriviliged user can't do that, so why should a container be any
 different?

would be that the loop device is a convenience built atop the backing image,
and if the user had the rights to loop-attach the backing image, he can
just as will partition using write(2), so why artificially plac this limit?

Nevertheless this is not really a debate worth having until we have a
blockdev fs mountable in a userns.

My main interest currently is with privileged containers.  I think we can
learn plenty from that for now.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-19 Thread Michael H. Warfield

On Mon, 2014-05-19 at 17:04 -0700, Eric W. Biederman wrote:
> Seth Forshee  writes:
> 
> > What I set out for was feature parity between loop devices in a secure
> > container and loop devices on the host. Since some operations currently
> > check for system-wide CAP_SYS_ADMIN, the only way I see to accomplish
> > this is to push knowledge of the user namespace farther down into the
> > driver stack so the check can instead be for CAP_SYS_ADMIN in the user
> > namespace associated with the device.
> >
> > That said, I suspect our current use cases can get by without these
> > capabilities. Really though I suspect this is just deferring the
> > discussion rather than settling it, and what we'll end up with is little
> > more than a fancy way for userspace to ask the kernel to run mknod on
> > its behalf.

> A fancy way to ask the kernel to run mknod on its behalf is what
> /dev/pts is.

> When I suggested this I did not mean you should forgo making changes to
> allow partitions and the like.  What I itended is that you should find a
> way to make this safe for users who don't have root capabilities.

I like to think in terms of the "rootless" configurations where "root"
per se is not absolute and everything is framed in terms of
capabilities.

> Which possibly means that mount needs to learn how to keep a more
> privileged user from using your new loop devices.

Not sure I got that one.  As user with "more" privileges may or may not
have access dependent on the congruence of the privileges.  They're not
heiarchial.  If someone has that "priv" then they have access.  If they
do not, they do not.

> To get to the point where this is really and truly usable I expect to be
> technically daunting.

Most technically non-trivial problems generally are.

> Ultimately the technical challenge is how do we create a block device
> that is safe for a user who does not have any capabilities to use, and
> what can we do with that block device to make it useful.

Concur.  It boils down to privilege management and access.  Absolutely
concur.

> Only when the question is can this kernel functionality which is
> otherwise safe confuse a preexisting setuid application do namespace
> or container bits significantly come into play.

Ah...  Admittedly it's not as late as our conversation at LinuxPlumbers
last year in NOLA but...  Maybe late at night but I failed to parse the
above.

> Eric

Regards,
Mike
-- 
Michael H. Warfield (AI4NB) | (770) 978-7061 |  m...@wittsend.com
   /\/\|=mhw=|\/\/  | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9  | An optimist believes we live in the best of all
 PGP Key: 0x674627FF| possible worlds.  A pessimist is sure of it!



signature.asc
Description: This is a digitally signed message part

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-19 Thread Eric W. Biederman

Seth Forshee  writes:

> What I set out for was feature parity between loop devices in a secure
> container and loop devices on the host. Since some operations currently
> check for system-wide CAP_SYS_ADMIN, the only way I see to accomplish
> this is to push knowledge of the user namespace farther down into the
> driver stack so the check can instead be for CAP_SYS_ADMIN in the user
> namespace associated with the device.
>
> That said, I suspect our current use cases can get by without these
> capabilities. Really though I suspect this is just deferring the
> discussion rather than settling it, and what we'll end up with is little
> more than a fancy way for userspace to ask the kernel to run mknod on
> its behalf.

A fancy way to ask the kernel to run mknod on its behalf is what
/dev/pts is.

When I suggested this I did not mean you should forgo making changes to
allow partitions and the like.  What I itended is that you should find a
way to make this safe for users who don't have root capabilities.

Which possibly means that mount needs to learn how to keep a more
privileged user from using your new loop devices.

To get to the point where this is really and truly usable I expect to be
technically daunting.

Ultimately the technical challenge is how do we create a block device
that is safe for a user who does not have any capabilities to use, and
what can we do with that block device to make it useful.

Only when the question is can this kernel functionality which is
otherwise safe confuse a preexisting setuid application do namespace
or container bits significantly come into play.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-19 Thread Andy Lutomirski

On May 15, 2014 1:26 PM, "Serge E. Hallyn"  wrote:
>
> Quoting Richard Weinberger (rich...@nod.at):
> > Am 15.05.2014 21:50, schrieb Serge Hallyn:
> > > Quoting Richard Weinberger (richard.weinber...@gmail.com):
> > >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman
> > >>  wrote:
> > >>> Then don't use a container to build such a thing, or fix the build
> > >>> scripts to not do that :)
> > >>
> > >> I second this.
> > >> To me it looks like some folks try to (ab)use Linux containers
> > >> for purposes where KVM would much better fit in.
> > >> Please don't put more complexity into containers. They are already
> > >> horrible complex
> > >> and error prone.
> > >
> > > I, naturally, disagree :)  The only use case which is inherently not
> > > valid for containers is running a kernel.  Practically speaking there
> > > are other things which likely will never be possible, but if someone
> > > offers a way to do something in containers, "you can't do that in
> > > containers" is not an apropos response.
> > >
> > > "That abstraction is wrong" is certainly valid, as when vpids were
> > > originally proposed and rejected, resulting in the development of
> > > pid namespaces.  "We have to work out (x) first" can be valid (and
> > > I can think of examples here), assuming it's not just trying to hide
> > > behind a catch-22/chicken-egg problem.
> > >
> > > Finally, saying "containers are complex and error prone" is conflating
> > > several large suites of userspace code and many kernel features which
> > > support them.  Being more precise would, if the argument is valid,
> > > lend it a lot more weight.
> >
> > We (my company) use Linux containers since 2011 in production. First LXC, 
> > now libvirt-lxc.
> > To understand the internals better I also wrote my own userspace to 
> > create/start
> > containers. There are so many things which can hurt you badly.
> > With user namespaces we expose a really big attack surface to regular users.
> > I.e. Suddenly a user is allowed to mount filesystems.
>
> That is currently not the case.  They can mount some virtual filesystems
> and do bind mounts, but cannot mount most real filesystems.  This keeps
> us protected (for now) from potentially unsafe superblock readers in the
> kernel.
>
> > Ask Andy, he found already lots of nasty things...

I don't think I have anything brilliant to add to this discussion
right now, except possibly:

ISTM that Linux distributions are, in general, vulnerable to all kinds
of shenanigans that would happen if an untrusted user can cause a
block device to appear.  That user doesn't need permission to mount it
or even necessarily to change its contents on the fly.

E.g. what happens if you boot a machine that contains a malicious disk
image that has the same partition UUID as /?  Nothing good, I imagine.

So if we're going to go down this road, we really need some way to
tell the host that certain devices are not trusted.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-19 Thread Seth Forshee

On Sun, May 18, 2014 at 04:44:58AM +0200, Serge E. Hallyn wrote:
> Quoting Seth Forshee (seth.fors...@canonical.com):
> > On Fri, May 16, 2014 at 09:31:37PM -0700, Eric W. Biederman wrote:
> > > Greg Kroah-Hartman  writes:
> > > 
> > > > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
> > > >> > I think having to pick and choose what device nodes you want in a
> > > >> > container is a good thing.  Becides, you would have to do the same 
> > > >> > thing
> > > >> > in the kernel anyway, what's wrong with userspace making the decision
> > > >> > here, especially as it knows exactly what it wants to do much more so
> > > >> > than the kernel ever can.
> > > >> 
> > > >> For 'real' devices that sounds sensible.  The thing about loop devices
> > > >> is that we simply want to allow a container to say "give me a loop
> > > >> device to use" and have it receive a unique loop device (or 3), without
> > > >> having to pre-assign them.  I think that would be cleaner to do using
> > > >> a pseudofs and loop-control device, rather than having to have a
> > > >> daemon in userspace on the host farming those out in response to
> > > >> some, I don't know, dbus request?
> > > >
> > > > I agree that loop devices would be nice to have in a container, and that
> > > > the existing loop interface doesn't really lend itself to that.  So
> > > > create a new type of thing that acts like a loop device in a container.
> > > > But don't try to mess with the whole driver core just for a single type
> > > > of device.
> > > 
> > > Yes. Something like devpts (without the newinstance option).  Built to
> > > allow unprivileged users to create loopback devices.
> > 
> > That's where I started, and I've got code, so I guess I'll clean it up
> > and send patches. If the stance is that only system-wide CAP_SYS_ADMIN
> > gets to do privileged block device ioctls, including reading partitions
> 
> Sorry, where did that come from?  What Eric was referring to below is
> the fs superblock readers not being trusted.  Maybe I glossed over another
> email where it was mentioned?

You must have. Take a look at [1].

To repeat the point: the ioctl to reread partitions (along with several
other block device ioctls) has a capable(CAP_SYS_ADMIN) check. We can't
change this to an ns_capable check without at minimum the block layer
knowing about the namespace associated with the block device. Ergo we
can't reread paritions if this is done entirely within the loop driver
via a psuedo fs.

[1] http://article.gmane.org/gmane.linux.kernel.containers.lxc.devel/8191

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-19 Thread Eric W. Biederman

Seth Forshee seth.fors...@canonical.com writes:

 What I set out for was feature parity between loop devices in a secure
 container and loop devices on the host. Since some operations currently
 check for system-wide CAP_SYS_ADMIN, the only way I see to accomplish
 this is to push knowledge of the user namespace farther down into the
 driver stack so the check can instead be for CAP_SYS_ADMIN in the user
 namespace associated with the device.

 That said, I suspect our current use cases can get by without these
 capabilities. Really though I suspect this is just deferring the
 discussion rather than settling it, and what we'll end up with is little
 more than a fancy way for userspace to ask the kernel to run mknod on
 its behalf.

A fancy way to ask the kernel to run mknod on its behalf is what
/dev/pts is.

When I suggested this I did not mean you should forgo making changes to
allow partitions and the like.  What I itended is that you should find a
way to make this safe for users who don't have root capabilities.

Which possibly means that mount needs to learn how to keep a more
privileged user from using your new loop devices.

To get to the point where this is really and truly usable I expect to be
technically daunting.

Ultimately the technical challenge is how do we create a block device
that is safe for a user who does not have any capabilities to use, and
what can we do with that block device to make it useful.

Only when the question is can this kernel functionality which is
otherwise safe confuse a preexisting setuid application do namespace
or container bits significantly come into play.

Eric
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-19 Thread Michael H. Warfield

On Mon, 2014-05-19 at 17:04 -0700, Eric W. Biederman wrote:
 Seth Forshee seth.fors...@canonical.com writes:
 
  What I set out for was feature parity between loop devices in a secure
  container and loop devices on the host. Since some operations currently
  check for system-wide CAP_SYS_ADMIN, the only way I see to accomplish
  this is to push knowledge of the user namespace farther down into the
  driver stack so the check can instead be for CAP_SYS_ADMIN in the user
  namespace associated with the device.
 
  That said, I suspect our current use cases can get by without these
  capabilities. Really though I suspect this is just deferring the
  discussion rather than settling it, and what we'll end up with is little
  more than a fancy way for userspace to ask the kernel to run mknod on
  its behalf.

 A fancy way to ask the kernel to run mknod on its behalf is what
 /dev/pts is.

 When I suggested this I did not mean you should forgo making changes to
 allow partitions and the like.  What I itended is that you should find a
 way to make this safe for users who don't have root capabilities.

I like to think in terms of the rootless configurations where root
per se is not absolute and everything is framed in terms of
capabilities.

 Which possibly means that mount needs to learn how to keep a more
 privileged user from using your new loop devices.

Not sure I got that one.  As user with more privileges may or may not
have access dependent on the congruence of the privileges.  They're not
heiarchial.  If someone has that priv then they have access.  If they
do not, they do not.

 To get to the point where this is really and truly usable I expect to be
 technically daunting.

Most technically non-trivial problems generally are.

 Ultimately the technical challenge is how do we create a block device
 that is safe for a user who does not have any capabilities to use, and
 what can we do with that block device to make it useful.

Concur.  It boils down to privilege management and access.  Absolutely
concur.

 Only when the question is can this kernel functionality which is
 otherwise safe confuse a preexisting setuid application do namespace
 or container bits significantly come into play.

Ah...  Admittedly it's not as late as our conversation at LinuxPlumbers
last year in NOLA but...  Maybe late at night but I failed to parse the
above.

 Eric

Regards,
Mike
-- 
Michael H. Warfield (AI4NB) | (770) 978-7061 |  m...@wittsend.com
   /\/\|=mhw=|\/\/  | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9  | An optimist believes we live in the best of all
 PGP Key: 0x674627FF| possible worlds.  A pessimist is sure of it!



signature.asc
Description: This is a digitally signed message part

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-19 Thread Seth Forshee

On Sun, May 18, 2014 at 04:44:58AM +0200, Serge E. Hallyn wrote:
 Quoting Seth Forshee (seth.fors...@canonical.com):
  On Fri, May 16, 2014 at 09:31:37PM -0700, Eric W. Biederman wrote:
   Greg Kroah-Hartman gre...@linuxfoundation.org writes:
   
On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
 I think having to pick and choose what device nodes you want in a
 container is a good thing.  Becides, you would have to do the same 
 thing
 in the kernel anyway, what's wrong with userspace making the decision
 here, especially as it knows exactly what it wants to do much more so
 than the kernel ever can.

For 'real' devices that sounds sensible.  The thing about loop devices
is that we simply want to allow a container to say give me a loop
device to use and have it receive a unique loop device (or 3), without
having to pre-assign them.  I think that would be cleaner to do using
a pseudofs and loop-control device, rather than having to have a
daemon in userspace on the host farming those out in response to
some, I don't know, dbus request?
   
I agree that loop devices would be nice to have in a container, and that
the existing loop interface doesn't really lend itself to that.  So
create a new type of thing that acts like a loop device in a container.
But don't try to mess with the whole driver core just for a single type
of device.
   
   Yes. Something like devpts (without the newinstance option).  Built to
   allow unprivileged users to create loopback devices.
  
  That's where I started, and I've got code, so I guess I'll clean it up
  and send patches. If the stance is that only system-wide CAP_SYS_ADMIN
  gets to do privileged block device ioctls, including reading partitions
 
 Sorry, where did that come from?  What Eric was referring to below is
 the fs superblock readers not being trusted.  Maybe I glossed over another
 email where it was mentioned?

You must have. Take a look at [1].

To repeat the point: the ioctl to reread partitions (along with several
other block device ioctls) has a capable(CAP_SYS_ADMIN) check. We can't
change this to an ns_capable check without at minimum the block layer
knowing about the namespace associated with the block device. Ergo we
can't reread paritions if this is done entirely within the loop driver
via a psuedo fs.

[1] http://article.gmane.org/gmane.linux.kernel.containers.lxc.devel/8191

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-19 Thread Andy Lutomirski

On May 15, 2014 1:26 PM, Serge E. Hallyn se...@hallyn.com wrote:

 Quoting Richard Weinberger (rich...@nod.at):
  Am 15.05.2014 21:50, schrieb Serge Hallyn:
   Quoting Richard Weinberger (richard.weinber...@gmail.com):
   On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman
   gre...@linuxfoundation.org wrote:
   Then don't use a container to build such a thing, or fix the build
   scripts to not do that :)
  
   I second this.
   To me it looks like some folks try to (ab)use Linux containers
   for purposes where KVM would much better fit in.
   Please don't put more complexity into containers. They are already
   horrible complex
   and error prone.
  
   I, naturally, disagree :)  The only use case which is inherently not
   valid for containers is running a kernel.  Practically speaking there
   are other things which likely will never be possible, but if someone
   offers a way to do something in containers, you can't do that in
   containers is not an apropos response.
  
   That abstraction is wrong is certainly valid, as when vpids were
   originally proposed and rejected, resulting in the development of
   pid namespaces.  We have to work out (x) first can be valid (and
   I can think of examples here), assuming it's not just trying to hide
   behind a catch-22/chicken-egg problem.
  
   Finally, saying containers are complex and error prone is conflating
   several large suites of userspace code and many kernel features which
   support them.  Being more precise would, if the argument is valid,
   lend it a lot more weight.
 
  We (my company) use Linux containers since 2011 in production. First LXC, 
  now libvirt-lxc.
  To understand the internals better I also wrote my own userspace to 
  create/start
  containers. There are so many things which can hurt you badly.
  With user namespaces we expose a really big attack surface to regular users.
  I.e. Suddenly a user is allowed to mount filesystems.

 That is currently not the case.  They can mount some virtual filesystems
 and do bind mounts, but cannot mount most real filesystems.  This keeps
 us protected (for now) from potentially unsafe superblock readers in the
 kernel.

  Ask Andy, he found already lots of nasty things...

I don't think I have anything brilliant to add to this discussion
right now, except possibly:

ISTM that Linux distributions are, in general, vulnerable to all kinds
of shenanigans that would happen if an untrusted user can cause a
block device to appear.  That user doesn't need permission to mount it
or even necessarily to change its contents on the fly.

E.g. what happens if you boot a machine that contains a malicious disk
image that has the same partition UUID as /?  Nothing good, I imagine.

So if we're going to go down this road, we really need some way to
tell the host that certain devices are not trusted.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-17 Thread Serge E. Hallyn

Quoting Seth Forshee (seth.fors...@canonical.com):
> On Fri, May 16, 2014 at 09:31:37PM -0700, Eric W. Biederman wrote:
> > Greg Kroah-Hartman  writes:
> > 
> > > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
> > >> > I think having to pick and choose what device nodes you want in a
> > >> > container is a good thing.  Becides, you would have to do the same 
> > >> > thing
> > >> > in the kernel anyway, what's wrong with userspace making the decision
> > >> > here, especially as it knows exactly what it wants to do much more so
> > >> > than the kernel ever can.
> > >> 
> > >> For 'real' devices that sounds sensible.  The thing about loop devices
> > >> is that we simply want to allow a container to say "give me a loop
> > >> device to use" and have it receive a unique loop device (or 3), without
> > >> having to pre-assign them.  I think that would be cleaner to do using
> > >> a pseudofs and loop-control device, rather than having to have a
> > >> daemon in userspace on the host farming those out in response to
> > >> some, I don't know, dbus request?
> > >
> > > I agree that loop devices would be nice to have in a container, and that
> > > the existing loop interface doesn't really lend itself to that.  So
> > > create a new type of thing that acts like a loop device in a container.
> > > But don't try to mess with the whole driver core just for a single type
> > > of device.
> > 
> > Yes. Something like devpts (without the newinstance option).  Built to
> > allow unprivileged users to create loopback devices.
> 
> That's where I started, and I've got code, so I guess I'll clean it up
> and send patches. If the stance is that only system-wide CAP_SYS_ADMIN
> gets to do privileged block device ioctls, including reading partitions

Sorry, where did that come from?  What Eric was referring to below is
the fs superblock readers not being trusted.  Maybe I glossed over another
email where it was mentioned?

> on a block device which has been assigned to a contiainer, then I guess
> that approach works well enough.
> 
> > There is still a huge kettle of fish in with verifying a filesystem is
> > safe from a hostile user that has acess to the block device while the
> > filesystem is mounted.
> > 
> > Having a few filesystems that are robust enough to trust with arbitrary
> > filesystem corruption would be very interesting.
> > 
> > I assume unprivileged and hostile users because if you trusted the real
> > root inside of your container this would not be an issue.
> > 
> > Eric
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-17 Thread Serge E. Hallyn

Quoting James Bottomley (james.bottom...@hansenpartnership.com):
> On Fri, 2014-05-16 at 11:57 -0700, Greg Kroah-Hartman wrote:
> > On Fri, May 16, 2014 at 09:06:07AM -0500, Seth Forshee wrote:
> > > On Thu, May 15, 2014 at 09:35:32PM -0700, Greg Kroah-Hartman wrote:
> > > > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
> > > > > > I think having to pick and choose what device nodes you want in a
> > > > > > container is a good thing.  Becides, you would have to do the same 
> > > > > > thing
> > > > > > in the kernel anyway, what's wrong with userspace making the 
> > > > > > decision
> > > > > > here, especially as it knows exactly what it wants to do much more 
> > > > > > so
> > > > > > than the kernel ever can.
> > > > > 
> > > > > For 'real' devices that sounds sensible.  The thing about loop devices
> > > > > is that we simply want to allow a container to say "give me a loop
> > > > > device to use" and have it receive a unique loop device (or 3), 
> > > > > without
> > > > > having to pre-assign them.  I think that would be cleaner to do using
> > > > > a pseudofs and loop-control device, rather than having to have a
> > > > > daemon in userspace on the host farming those out in response to
> > > > > some, I don't know, dbus request?
> > > > 
> > > > I agree that loop devices would be nice to have in a container, and that
> > > > the existing loop interface doesn't really lend itself to that.  So
> > > > create a new type of thing that acts like a loop device in a container.
> > > > But don't try to mess with the whole driver core just for a single type
> > > > of device.
> > > 
> > > No matter what I don't think we get out of this without driver core
> > > changes, whether this was done in loop or by creating something new.
> > > Not unless the whole thing is punted to userspace, anyway.
> > > 
> > > The first problem is that many block device ioctls check for
> > > CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm
> > > not really sure. But loop does at minimum support partitions, and to get
> > > that functionality in an unprivileged container at least the block layer
> > > needs to know the namespace which has privileges for that device.
> > 
> > That's fine, you should have those permissions in a container if you
> > want to do something like that on a loop device, right?
> 
> Really, no.  CAP_SYS_ADMIN is effectively a pseudo root security hole.
> Any user possessing CAP_SYS_ADMIN can do about as much damage as real
> root can, whether or not you use user namespaces, so it would compromise
> a lot of the security we're just bringing to containers.
> 
> > > The second is that all block devices automatically appear in devtmpfs.
> > > The scenario I'm concerned about is that the host could unknowingly use
> > > a loop device exposed to a container, then the container could see data
> > > from the host.
> > 
> > I don't think that's a real issue, the host should know not to do that.
> > 
> > > So we either need a flag to tell the driver core not to create a node
> > > in devtmpfs, or we need a privileged manager in userspace to remove
> > > them (which kind of defeats the purpose). And it gets more complicated
> > > when partition block devs are mixed in, because they can be created
> > > without involvement from the driver - they would need to inherit the
> > > "no devtmpfs node" property from their parent, and if the driver uses
> > > a psuedo fs to create device nodes for userspace then it needs to be
> > > informed about the partitions too so it can create those nodes.
> > 
> > I don't think that will be needed.  Root in a host can do whatever it
> > wants in the containers, so mixing up block devices is the least of the
> > issues involved :)
> > 
> > > So maybe we could get by without the privileged ioctls, as long as it
> > > was understood that unprivileged containers can't do partitioning. But I
> > > do think the devtmpfs problem would need to be addressed.
> > 
> > I don't think unpriviliged containers should be able to do partitioning.
> > An unpriviliged user can't do that, so why should a container be any
> > different?
> 
> To make sure we're on the same page with terminology, there's an
> unprivileged container and a secure container.  In the former, there's

Hm, that terminology (which isn't what we've been using) could be
useful, but is still not quite precise enough if we're going down
that road.

> no root user (all the processes run as non-root), so the container isn't

"there is no root user" and "all processes run as non-root" are not the
same thing.  Is it just that no processes are running as root?  Or that
uid 0 in the container is not mapped at all and hence not achievable?

The former really isn't a function of the container itself, and depends
on there really not being any setuid-root or capability-wielding files
available in the container.

If the latter, and you're hoping to claim that the host is saved from
the container

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-17 Thread Seth Forshee

On Fri, May 16, 2014 at 09:31:37PM -0700, Eric W. Biederman wrote:
> Greg Kroah-Hartman  writes:
> 
> > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
> >> > I think having to pick and choose what device nodes you want in a
> >> > container is a good thing.  Becides, you would have to do the same thing
> >> > in the kernel anyway, what's wrong with userspace making the decision
> >> > here, especially as it knows exactly what it wants to do much more so
> >> > than the kernel ever can.
> >> 
> >> For 'real' devices that sounds sensible.  The thing about loop devices
> >> is that we simply want to allow a container to say "give me a loop
> >> device to use" and have it receive a unique loop device (or 3), without
> >> having to pre-assign them.  I think that would be cleaner to do using
> >> a pseudofs and loop-control device, rather than having to have a
> >> daemon in userspace on the host farming those out in response to
> >> some, I don't know, dbus request?
> >
> > I agree that loop devices would be nice to have in a container, and that
> > the existing loop interface doesn't really lend itself to that.  So
> > create a new type of thing that acts like a loop device in a container.
> > But don't try to mess with the whole driver core just for a single type
> > of device.
> 
> Yes. Something like devpts (without the newinstance option).  Built to
> allow unprivileged users to create loopback devices.

That's where I started, and I've got code, so I guess I'll clean it up
and send patches. If the stance is that only system-wide CAP_SYS_ADMIN
gets to do privileged block device ioctls, including reading partitions
on a block device which has been assigned to a contiainer, then I guess
that approach works well enough.

> There is still a huge kettle of fish in with verifying a filesystem is
> safe from a hostile user that has acess to the block device while the
> filesystem is mounted.
> 
> Having a few filesystems that are robust enough to trust with arbitrary
> filesystem corruption would be very interesting.
> 
> I assume unprivileged and hostile users because if you trusted the real
> root inside of your container this would not be an issue.
> 
> Eric
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-17 Thread Michael H. Warfield

On Thu, 2014-05-15 at 21:35 -0700, Greg Kroah-Hartman wrote:
> On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
> > > I think having to pick and choose what device nodes you want in a
> > > container is a good thing.  Becides, you would have to do the same thing
> > > in the kernel anyway, what's wrong with userspace making the decision
> > > here, especially as it knows exactly what it wants to do much more so
> > > than the kernel ever can.
> > 
> > For 'real' devices that sounds sensible.  The thing about loop devices
> > is that we simply want to allow a container to say "give me a loop
> > device to use" and have it receive a unique loop device (or 3), without
> > having to pre-assign them.  I think that would be cleaner to do using
> > a pseudofs and loop-control device, rather than having to have a
> > daemon in userspace on the host farming those out in response to
> > some, I don't know, dbus request?

> I agree that loop devices would be nice to have in a container, and that
> the existing loop interface doesn't really lend itself to that.  So
> create a new type of thing that acts like a loop device in a container.
> But don't try to mess with the whole driver core just for a single type
> of device.

Yeah, a lot of dynamic devices (like serial devices) can be handled in
user space with the proviso that we could use some way to tickle udev
and hotplug in the container with events.

But the loop device is the real ugly duckling here.  It's a unique case
of an on-demand device with a shared control device that's not really
hot-plug and not really deterministic enough to be handled purely in
user space.  It presents unique challenges unto itself.

Makes sense to me.

> greg k-h

Regards,
Mike
-- 
Michael H. Warfield (AI4NB) | (770) 978-7061 |  m...@wittsend.com
   /\/\|=mhw=|\/\/  | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9  | An optimist believes we live in the best of all
 PGP Key: 0x674627FF| possible worlds.  A pessimist is sure of it!



signature.asc
Description: This is a digitally signed message part

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-17 Thread Michael H. Warfield

On Thu, 2014-05-15 at 21:35 -0700, Greg Kroah-Hartman wrote:
 On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
   I think having to pick and choose what device nodes you want in a
   container is a good thing.  Becides, you would have to do the same thing
   in the kernel anyway, what's wrong with userspace making the decision
   here, especially as it knows exactly what it wants to do much more so
   than the kernel ever can.
  
  For 'real' devices that sounds sensible.  The thing about loop devices
  is that we simply want to allow a container to say give me a loop
  device to use and have it receive a unique loop device (or 3), without
  having to pre-assign them.  I think that would be cleaner to do using
  a pseudofs and loop-control device, rather than having to have a
  daemon in userspace on the host farming those out in response to
  some, I don't know, dbus request?

 I agree that loop devices would be nice to have in a container, and that
 the existing loop interface doesn't really lend itself to that.  So
 create a new type of thing that acts like a loop device in a container.
 But don't try to mess with the whole driver core just for a single type
 of device.

Yeah, a lot of dynamic devices (like serial devices) can be handled in
user space with the proviso that we could use some way to tickle udev
and hotplug in the container with events.

But the loop device is the real ugly duckling here.  It's a unique case
of an on-demand device with a shared control device that's not really
hot-plug and not really deterministic enough to be handled purely in
user space.  It presents unique challenges unto itself.

Makes sense to me.

 greg k-h

Regards,
Mike
-- 
Michael H. Warfield (AI4NB) | (770) 978-7061 |  m...@wittsend.com
   /\/\|=mhw=|\/\/  | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9  | An optimist believes we live in the best of all
 PGP Key: 0x674627FF| possible worlds.  A pessimist is sure of it!



signature.asc
Description: This is a digitally signed message part

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-17 Thread Seth Forshee

On Fri, May 16, 2014 at 09:31:37PM -0700, Eric W. Biederman wrote:
 Greg Kroah-Hartman gre...@linuxfoundation.org writes:
 
  On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
   I think having to pick and choose what device nodes you want in a
   container is a good thing.  Becides, you would have to do the same thing
   in the kernel anyway, what's wrong with userspace making the decision
   here, especially as it knows exactly what it wants to do much more so
   than the kernel ever can.
  
  For 'real' devices that sounds sensible.  The thing about loop devices
  is that we simply want to allow a container to say give me a loop
  device to use and have it receive a unique loop device (or 3), without
  having to pre-assign them.  I think that would be cleaner to do using
  a pseudofs and loop-control device, rather than having to have a
  daemon in userspace on the host farming those out in response to
  some, I don't know, dbus request?
 
  I agree that loop devices would be nice to have in a container, and that
  the existing loop interface doesn't really lend itself to that.  So
  create a new type of thing that acts like a loop device in a container.
  But don't try to mess with the whole driver core just for a single type
  of device.
 
 Yes. Something like devpts (without the newinstance option).  Built to
 allow unprivileged users to create loopback devices.

That's where I started, and I've got code, so I guess I'll clean it up
and send patches. If the stance is that only system-wide CAP_SYS_ADMIN
gets to do privileged block device ioctls, including reading partitions
on a block device which has been assigned to a contiainer, then I guess
that approach works well enough.

 There is still a huge kettle of fish in with verifying a filesystem is
 safe from a hostile user that has acess to the block device while the
 filesystem is mounted.
 
 Having a few filesystems that are robust enough to trust with arbitrary
 filesystem corruption would be very interesting.
 
 I assume unprivileged and hostile users because if you trusted the real
 root inside of your container this would not be an issue.
 
 Eric
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-17 Thread Serge E. Hallyn

Quoting James Bottomley (james.bottom...@hansenpartnership.com):
 On Fri, 2014-05-16 at 11:57 -0700, Greg Kroah-Hartman wrote:
  On Fri, May 16, 2014 at 09:06:07AM -0500, Seth Forshee wrote:
   On Thu, May 15, 2014 at 09:35:32PM -0700, Greg Kroah-Hartman wrote:
On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
  I think having to pick and choose what device nodes you want in a
  container is a good thing.  Becides, you would have to do the same 
  thing
  in the kernel anyway, what's wrong with userspace making the 
  decision
  here, especially as it knows exactly what it wants to do much more 
  so
  than the kernel ever can.
 
 For 'real' devices that sounds sensible.  The thing about loop devices
 is that we simply want to allow a container to say give me a loop
 device to use and have it receive a unique loop device (or 3), 
 without
 having to pre-assign them.  I think that would be cleaner to do using
 a pseudofs and loop-control device, rather than having to have a
 daemon in userspace on the host farming those out in response to
 some, I don't know, dbus request?

I agree that loop devices would be nice to have in a container, and that
the existing loop interface doesn't really lend itself to that.  So
create a new type of thing that acts like a loop device in a container.
But don't try to mess with the whole driver core just for a single type
of device.
   
   No matter what I don't think we get out of this without driver core
   changes, whether this was done in loop or by creating something new.
   Not unless the whole thing is punted to userspace, anyway.
   
   The first problem is that many block device ioctls check for
   CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm
   not really sure. But loop does at minimum support partitions, and to get
   that functionality in an unprivileged container at least the block layer
   needs to know the namespace which has privileges for that device.
  
  That's fine, you should have those permissions in a container if you
  want to do something like that on a loop device, right?
 
 Really, no.  CAP_SYS_ADMIN is effectively a pseudo root security hole.
 Any user possessing CAP_SYS_ADMIN can do about as much damage as real
 root can, whether or not you use user namespaces, so it would compromise
 a lot of the security we're just bringing to containers.
 
   The second is that all block devices automatically appear in devtmpfs.
   The scenario I'm concerned about is that the host could unknowingly use
   a loop device exposed to a container, then the container could see data
   from the host.
  
  I don't think that's a real issue, the host should know not to do that.
  
   So we either need a flag to tell the driver core not to create a node
   in devtmpfs, or we need a privileged manager in userspace to remove
   them (which kind of defeats the purpose). And it gets more complicated
   when partition block devs are mixed in, because they can be created
   without involvement from the driver - they would need to inherit the
   no devtmpfs node property from their parent, and if the driver uses
   a psuedo fs to create device nodes for userspace then it needs to be
   informed about the partitions too so it can create those nodes.
  
  I don't think that will be needed.  Root in a host can do whatever it
  wants in the containers, so mixing up block devices is the least of the
  issues involved :)
  
   So maybe we could get by without the privileged ioctls, as long as it
   was understood that unprivileged containers can't do partitioning. But I
   do think the devtmpfs problem would need to be addressed.
  
  I don't think unpriviliged containers should be able to do partitioning.
  An unpriviliged user can't do that, so why should a container be any
  different?
 
 To make sure we're on the same page with terminology, there's an
 unprivileged container and a secure container.  In the former, there's

Hm, that terminology (which isn't what we've been using) could be
useful, but is still not quite precise enough if we're going down
that road.

 no root user (all the processes run as non-root), so the container isn't

there is no root user and all processes run as non-root are not the
same thing.  Is it just that no processes are running as root?  Or that
uid 0 in the container is not mapped at all and hence not achievable?

The former really isn't a function of the container itself, and depends
on there really not being any setuid-root or capability-wielding files
available in the container.

If the latter, and you're hoping to claim that the host is saved from
the container exercising kernel code which falls under 'if
(ns_capable(X))', then you're stil just one unprivileged
clone(CLONE_NEWUSER) and mapping of nested uid 0 to any actually validly
mapped container uid away from hitting that kernel code.  Your container

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-17 Thread Serge E. Hallyn

Quoting Seth Forshee (seth.fors...@canonical.com):
 On Fri, May 16, 2014 at 09:31:37PM -0700, Eric W. Biederman wrote:
  Greg Kroah-Hartman gre...@linuxfoundation.org writes:
  
   On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
I think having to pick and choose what device nodes you want in a
container is a good thing.  Becides, you would have to do the same 
thing
in the kernel anyway, what's wrong with userspace making the decision
here, especially as it knows exactly what it wants to do much more so
than the kernel ever can.
   
   For 'real' devices that sounds sensible.  The thing about loop devices
   is that we simply want to allow a container to say give me a loop
   device to use and have it receive a unique loop device (or 3), without
   having to pre-assign them.  I think that would be cleaner to do using
   a pseudofs and loop-control device, rather than having to have a
   daemon in userspace on the host farming those out in response to
   some, I don't know, dbus request?
  
   I agree that loop devices would be nice to have in a container, and that
   the existing loop interface doesn't really lend itself to that.  So
   create a new type of thing that acts like a loop device in a container.
   But don't try to mess with the whole driver core just for a single type
   of device.
  
  Yes. Something like devpts (without the newinstance option).  Built to
  allow unprivileged users to create loopback devices.
 
 That's where I started, and I've got code, so I guess I'll clean it up
 and send patches. If the stance is that only system-wide CAP_SYS_ADMIN
 gets to do privileged block device ioctls, including reading partitions

Sorry, where did that come from?  What Eric was referring to below is
the fs superblock readers not being trusted.  Maybe I glossed over another
email where it was mentioned?

 on a block device which has been assigned to a contiainer, then I guess
 that approach works well enough.
 
  There is still a huge kettle of fish in with verifying a filesystem is
  safe from a hostile user that has acess to the block device while the
  filesystem is mounted.
  
  Having a few filesystems that are robust enough to trust with arbitrary
  filesystem corruption would be very interesting.
  
  I assume unprivileged and hostile users because if you trusted the real
  root inside of your container this would not be an issue.
  
  Eric
  --
  To unsubscribe from this list: send the line unsubscribe linux-kernel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  Please read the FAQ at  http://www.tux.org/lkml/
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-16 Thread Eric W. Biederman

Greg Kroah-Hartman  writes:

> On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
>> > I think having to pick and choose what device nodes you want in a
>> > container is a good thing.  Becides, you would have to do the same thing
>> > in the kernel anyway, what's wrong with userspace making the decision
>> > here, especially as it knows exactly what it wants to do much more so
>> > than the kernel ever can.
>> 
>> For 'real' devices that sounds sensible.  The thing about loop devices
>> is that we simply want to allow a container to say "give me a loop
>> device to use" and have it receive a unique loop device (or 3), without
>> having to pre-assign them.  I think that would be cleaner to do using
>> a pseudofs and loop-control device, rather than having to have a
>> daemon in userspace on the host farming those out in response to
>> some, I don't know, dbus request?
>
> I agree that loop devices would be nice to have in a container, and that
> the existing loop interface doesn't really lend itself to that.  So
> create a new type of thing that acts like a loop device in a container.
> But don't try to mess with the whole driver core just for a single type
> of device.

Yes. Something like devpts (without the newinstance option).  Built to
allow unprivileged users to create loopback devices.

There is still a huge kettle of fish in with verifying a filesystem is
safe from a hostile user that has acess to the block device while the
filesystem is mounted.

Having a few filesystems that are robust enough to trust with arbitrary
filesystem corruption would be very interesting.

I assume unprivileged and hostile users because if you trusted the real
root inside of your container this would not be an issue.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-16 Thread Seth Forshee

On Fri, May 16, 2014 at 12:28:35PM -0700, James Bottomley wrote:
> On Fri, 2014-05-16 at 11:57 -0700, Greg Kroah-Hartman wrote:
> > On Fri, May 16, 2014 at 09:06:07AM -0500, Seth Forshee wrote:
> > > On Thu, May 15, 2014 at 09:35:32PM -0700, Greg Kroah-Hartman wrote:
> > > > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
> > > > > > I think having to pick and choose what device nodes you want in a
> > > > > > container is a good thing.  Becides, you would have to do the same 
> > > > > > thing
> > > > > > in the kernel anyway, what's wrong with userspace making the 
> > > > > > decision
> > > > > > here, especially as it knows exactly what it wants to do much more 
> > > > > > so
> > > > > > than the kernel ever can.
> > > > > 
> > > > > For 'real' devices that sounds sensible.  The thing about loop devices
> > > > > is that we simply want to allow a container to say "give me a loop
> > > > > device to use" and have it receive a unique loop device (or 3), 
> > > > > without
> > > > > having to pre-assign them.  I think that would be cleaner to do using
> > > > > a pseudofs and loop-control device, rather than having to have a
> > > > > daemon in userspace on the host farming those out in response to
> > > > > some, I don't know, dbus request?
> > > > 
> > > > I agree that loop devices would be nice to have in a container, and that
> > > > the existing loop interface doesn't really lend itself to that.  So
> > > > create a new type of thing that acts like a loop device in a container.
> > > > But don't try to mess with the whole driver core just for a single type
> > > > of device.
> > > 
> > > No matter what I don't think we get out of this without driver core
> > > changes, whether this was done in loop or by creating something new.
> > > Not unless the whole thing is punted to userspace, anyway.
> > > 
> > > The first problem is that many block device ioctls check for
> > > CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm
> > > not really sure. But loop does at minimum support partitions, and to get
> > > that functionality in an unprivileged container at least the block layer
> > > needs to know the namespace which has privileges for that device.
> > 
> > That's fine, you should have those permissions in a container if you
> > want to do something like that on a loop device, right?
> 
> Really, no.  CAP_SYS_ADMIN is effectively a pseudo root security hole.
> Any user possessing CAP_SYS_ADMIN can do about as much damage as real
> root can, whether or not you use user namespaces, so it would compromise
> a lot of the security we're just bringing to containers.
> 
> > > The second is that all block devices automatically appear in devtmpfs.
> > > The scenario I'm concerned about is that the host could unknowingly use
> > > a loop device exposed to a container, then the container could see data
> > > from the host.
> > 
> > I don't think that's a real issue, the host should know not to do that.
> > 
> > > So we either need a flag to tell the driver core not to create a node
> > > in devtmpfs, or we need a privileged manager in userspace to remove
> > > them (which kind of defeats the purpose). And it gets more complicated
> > > when partition block devs are mixed in, because they can be created
> > > without involvement from the driver - they would need to inherit the
> > > "no devtmpfs node" property from their parent, and if the driver uses
> > > a psuedo fs to create device nodes for userspace then it needs to be
> > > informed about the partitions too so it can create those nodes.
> > 
> > I don't think that will be needed.  Root in a host can do whatever it
> > wants in the containers, so mixing up block devices is the least of the
> > issues involved :)
> > 
> > > So maybe we could get by without the privileged ioctls, as long as it
> > > was understood that unprivileged containers can't do partitioning. But I
> > > do think the devtmpfs problem would need to be addressed.
> > 
> > I don't think unpriviliged containers should be able to do partitioning.
> > An unpriviliged user can't do that, so why should a container be any
> > different?
> 
> To make sure we're on the same page with terminology, there's an
> unprivileged container and a secure container.  In the former, there's
> no root user (all the processes run as non-root), so the container isn't
> expected to perform any actions root would ... that's easy.  In a secure
> container, root is mapped to a nobody user in the host, so is
> effectively unprivileged, but root in the container expects to look like
> a real root within the VPS (and thus may expect to partition things,
> depending on how they've been given access to the block device).  The
> big problem is giving back capabilities to the container root such that
> a) it loses them if it escapes the container and b) it doesn't get
> sufficient capabilities to damage the system.

Based on your description what I was talking about is a secure

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-16 Thread Michael H. Warfield

On Fri, 2014-05-16 at 12:20 -0700, James Bottomley wrote:
> On Thu, 2014-05-15 at 21:42 -0400, Michael H. Warfield wrote:
> > On Thu, 2014-05-15 at 15:15 -0700, Greg Kroah-Hartman wrote:
> > > > PS - Apparently both parallels and Michael independently
> > > > project devices which are hot-plugged on the host into containers.
> > > > That also seems like something worth talking about (best practices,
> > > > shortcomings, use cases not met by it, any ways tha the kernel can
> > > > help out) at ksummit/linuxcon.
> > 
> > > I was told that containers would never want devices hotplugged into
> > > them.
> > 
> > Interesting.  You were told they (who they?) would never want them?  Who
> > said that?  I would have never thought that given that other
> > implementations can provide that.  I would certainly want them.  Seems
> > strange to explicitly relegate LXC containers to being second class
> > citizens behind OpenVZ, Parallels, BSD Gaols, and Solaris Zones.

> That would probably be me.  Running hotplug inside a container is a
> security problem and, since containers are easily entered by the host,
> it's very easy to listen for the hotplug in the host and inject it into
> the container using nsenter.

In all virtualization...  The host, particularly root on the host,
exists as deus ex machina, the "god outside the machine".  They are at
my mercy.  Even hardware virtualization can not protect you from the
host.  You wanna hear some frightening talks on virtualization, catch
Joanna (miss little blue pill) Rutkowska some time.  I'm particularly
interesting in her takes on the "anti evil-maid attacks" and I sat in on
her talks on the "north bridge" and "south bridge" malware evasion
techniques.  She's a good speaker who makes powerful points that makes
you sweat but is pleasant in face to face conversation.  I've played
with her Qubes distribution a couple of times and the way it works with
the TPM to insure a secure boot is interesting.  But that's a completely
different topic on trusted computing.

OTOH, there are plenty of other things to worry about in all forms of
virtualization.  At Internet Security Systems, where I was a founder,
fellow, and "X-Force Senior Wizard", we were looking at the ability to
leak information through the USB subsystem.  No isolation is perfect,
especially when you have USB enabled.

But that's my turf.

> I don't think the intention is to label anyone's implementation as
> preferred.  What this shows, I think, is that we all have different
> practises when it comes to setting up containers.  Some are necessary
> because our containers are different.  Some could do with serious
> examination to see if there's really a best way to do the action which
> we would then all use.

And I hope to contribute to the discussion of said actions.

> > I might believe you were never told they would need them, but that's a
> > totally different sense.  Are we going to tell RedHat and the Docker
> > people that LXC is an inferior technology that is complex and unreliable
> > (to quote another poster) compared to these others?  They're saying this
> > will be enterprise technology.  If I go to Amazon AWS or other VPS
> > services and compare, are we not going to stand on a level playing
> > field?  Admittedly, I don't expect Amazon AWS to provide me with serial
> > consoles, but I do expect to be able to mount file system images within
> > my VPS.

> Well, that's another nasty, isn't it.  We all have different ways of
> coping with mount in the container.  I think at plumbers we need to sit
> down with some of this plumbing and work out which pipes carry the same
> fluids and whether we could unify them.

Concur

> As an aside (probably requiring a new thread) we were wondering about
> some type of notifier on the mount call that we could vector into the
> host to perform the action.  The main issue for us is mount of procfs,
> which really needs to be a bind mount in a container.  All of this led
> me to speculate that we could use some type of syscall notifier
> mechanism to manage capabilities in the host and even intercept and
> complete the syscall action within the host rather than having to keep
> evolving more an more complex kernel drivers to do this.

Interesting.  That could be very useful.  That might even help with the
loop device case where the mounts have to go through loop devices for
things like file system images and builds.  Very interesting...

> James

Regards,
Mike
-- 
Michael H. Warfield (AI4NB) | (770) 978-7061 |  m...@wittsend.com
   /\/\|=mhw=|\/\/  | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9  | An optimist believes we live in the best of all
 PGP Key: 0x674627FF| possible worlds.  A pessimist is sure of it!



signature.asc
Description: This is a digitally signed message part

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-16 Thread James Bottomley

On Fri, 2014-05-16 at 11:57 -0700, Greg Kroah-Hartman wrote:
> On Fri, May 16, 2014 at 09:06:07AM -0500, Seth Forshee wrote:
> > On Thu, May 15, 2014 at 09:35:32PM -0700, Greg Kroah-Hartman wrote:
> > > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
> > > > > I think having to pick and choose what device nodes you want in a
> > > > > container is a good thing.  Becides, you would have to do the same 
> > > > > thing
> > > > > in the kernel anyway, what's wrong with userspace making the decision
> > > > > here, especially as it knows exactly what it wants to do much more so
> > > > > than the kernel ever can.
> > > > 
> > > > For 'real' devices that sounds sensible.  The thing about loop devices
> > > > is that we simply want to allow a container to say "give me a loop
> > > > device to use" and have it receive a unique loop device (or 3), without
> > > > having to pre-assign them.  I think that would be cleaner to do using
> > > > a pseudofs and loop-control device, rather than having to have a
> > > > daemon in userspace on the host farming those out in response to
> > > > some, I don't know, dbus request?
> > > 
> > > I agree that loop devices would be nice to have in a container, and that
> > > the existing loop interface doesn't really lend itself to that.  So
> > > create a new type of thing that acts like a loop device in a container.
> > > But don't try to mess with the whole driver core just for a single type
> > > of device.
> > 
> > No matter what I don't think we get out of this without driver core
> > changes, whether this was done in loop or by creating something new.
> > Not unless the whole thing is punted to userspace, anyway.
> > 
> > The first problem is that many block device ioctls check for
> > CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm
> > not really sure. But loop does at minimum support partitions, and to get
> > that functionality in an unprivileged container at least the block layer
> > needs to know the namespace which has privileges for that device.
> 
> That's fine, you should have those permissions in a container if you
> want to do something like that on a loop device, right?

Really, no.  CAP_SYS_ADMIN is effectively a pseudo root security hole.
Any user possessing CAP_SYS_ADMIN can do about as much damage as real
root can, whether or not you use user namespaces, so it would compromise
a lot of the security we're just bringing to containers.

> > The second is that all block devices automatically appear in devtmpfs.
> > The scenario I'm concerned about is that the host could unknowingly use
> > a loop device exposed to a container, then the container could see data
> > from the host.
> 
> I don't think that's a real issue, the host should know not to do that.
> 
> > So we either need a flag to tell the driver core not to create a node
> > in devtmpfs, or we need a privileged manager in userspace to remove
> > them (which kind of defeats the purpose). And it gets more complicated
> > when partition block devs are mixed in, because they can be created
> > without involvement from the driver - they would need to inherit the
> > "no devtmpfs node" property from their parent, and if the driver uses
> > a psuedo fs to create device nodes for userspace then it needs to be
> > informed about the partitions too so it can create those nodes.
> 
> I don't think that will be needed.  Root in a host can do whatever it
> wants in the containers, so mixing up block devices is the least of the
> issues involved :)
> 
> > So maybe we could get by without the privileged ioctls, as long as it
> > was understood that unprivileged containers can't do partitioning. But I
> > do think the devtmpfs problem would need to be addressed.
> 
> I don't think unpriviliged containers should be able to do partitioning.
> An unpriviliged user can't do that, so why should a container be any
> different?

To make sure we're on the same page with terminology, there's an
unprivileged container and a secure container.  In the former, there's
no root user (all the processes run as non-root), so the container isn't
expected to perform any actions root would ... that's easy.  In a secure
container, root is mapped to a nobody user in the host, so is
effectively unprivileged, but root in the container expects to look like
a real root within the VPS (and thus may expect to partition things,
depending on how they've been given access to the block device).  The
big problem is giving back capabilities to the container root such that
a) it loses them if it escapes the container and b) it doesn't get
sufficient capabilities to damage the system.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-16 Thread James Bottomley

On Thu, 2014-05-15 at 21:42 -0400, Michael H. Warfield wrote:
> On Thu, 2014-05-15 at 15:15 -0700, Greg Kroah-Hartman wrote:
> > > PS - Apparently both parallels and Michael independently
> > > project devices which are hot-plugged on the host into containers.
> > > That also seems like something worth talking about (best practices,
> > > shortcomings, use cases not met by it, any ways tha the kernel can
> > > help out) at ksummit/linuxcon.
> 
> > I was told that containers would never want devices hotplugged into
> > them.
> 
> Interesting.  You were told they (who they?) would never want them?  Who
> said that?  I would have never thought that given that other
> implementations can provide that.  I would certainly want them.  Seems
> strange to explicitly relegate LXC containers to being second class
> citizens behind OpenVZ, Parallels, BSD Gaols, and Solaris Zones.

That would probably be me.  Running hotplug inside a container is a
security problem and, since containers are easily entered by the host,
it's very easy to listen for the hotplug in the host and inject it into
the container using nsenter.

I don't think the intention is to label anyone's implementation as
preferred.  What this shows, I think, is that we all have different
practises when it comes to setting up containers.  Some are necessary
because our containers are different.  Some could do with serious
examination to see if there's really a best way to do the action which
we would then all use.

> I might believe you were never told they would need them, but that's a
> totally different sense.  Are we going to tell RedHat and the Docker
> people that LXC is an inferior technology that is complex and unreliable
> (to quote another poster) compared to these others?  They're saying this
> will be enterprise technology.  If I go to Amazon AWS or other VPS
> services and compare, are we not going to stand on a level playing
> field?  Admittedly, I don't expect Amazon AWS to provide me with serial
> consoles, but I do expect to be able to mount file system images within
> my VPS.

Well, that's another nasty, isn't it.  We all have different ways of
coping with mount in the container.  I think at plumbers we need to sit
down with some of this plumbing and work out which pipes carry the same
fluids and whether we could unify them.

As an aside (probably requiring a new thread) we were wondering about
some type of notifier on the mount call that we could vector into the
host to perform the action.  The main issue for us is mount of procfs,
which really needs to be a bind mount in a container.  All of this led
me to speculate that we could use some type of syscall notifier
mechanism to manage capabilities in the host and even intercept and
complete the syscall action within the host rather than having to keep
evolving more an more complex kernel drivers to do this.

James

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-16 Thread Greg Kroah-Hartman

On Fri, May 16, 2014 at 09:06:07AM -0500, Seth Forshee wrote:
> On Thu, May 15, 2014 at 09:35:32PM -0700, Greg Kroah-Hartman wrote:
> > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
> > > > I think having to pick and choose what device nodes you want in a
> > > > container is a good thing.  Becides, you would have to do the same thing
> > > > in the kernel anyway, what's wrong with userspace making the decision
> > > > here, especially as it knows exactly what it wants to do much more so
> > > > than the kernel ever can.
> > > 
> > > For 'real' devices that sounds sensible.  The thing about loop devices
> > > is that we simply want to allow a container to say "give me a loop
> > > device to use" and have it receive a unique loop device (or 3), without
> > > having to pre-assign them.  I think that would be cleaner to do using
> > > a pseudofs and loop-control device, rather than having to have a
> > > daemon in userspace on the host farming those out in response to
> > > some, I don't know, dbus request?
> > 
> > I agree that loop devices would be nice to have in a container, and that
> > the existing loop interface doesn't really lend itself to that.  So
> > create a new type of thing that acts like a loop device in a container.
> > But don't try to mess with the whole driver core just for a single type
> > of device.
> 
> No matter what I don't think we get out of this without driver core
> changes, whether this was done in loop or by creating something new.
> Not unless the whole thing is punted to userspace, anyway.
> 
> The first problem is that many block device ioctls check for
> CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm
> not really sure. But loop does at minimum support partitions, and to get
> that functionality in an unprivileged container at least the block layer
> needs to know the namespace which has privileges for that device.

That's fine, you should have those permissions in a container if you
want to do something like that on a loop device, right?

> The second is that all block devices automatically appear in devtmpfs.
> The scenario I'm concerned about is that the host could unknowingly use
> a loop device exposed to a container, then the container could see data
> from the host.

I don't think that's a real issue, the host should know not to do that.

> So we either need a flag to tell the driver core not to create a node
> in devtmpfs, or we need a privileged manager in userspace to remove
> them (which kind of defeats the purpose). And it gets more complicated
> when partition block devs are mixed in, because they can be created
> without involvement from the driver - they would need to inherit the
> "no devtmpfs node" property from their parent, and if the driver uses
> a psuedo fs to create device nodes for userspace then it needs to be
> informed about the partitions too so it can create those nodes.

I don't think that will be needed.  Root in a host can do whatever it
wants in the containers, so mixing up block devices is the least of the
issues involved :)

> So maybe we could get by without the privileged ioctls, as long as it
> was understood that unprivileged containers can't do partitioning. But I
> do think the devtmpfs problem would need to be addressed.

I don't think unpriviliged containers should be able to do partitioning.
An unpriviliged user can't do that, so why should a container be any
different?

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-16 Thread Seth Forshee

On Fri, May 16, 2014 at 11:28:28AM -0400, Michael H. Warfield wrote:
> On Fri, 2014-05-16 at 09:06 -0500, Seth Forshee wrote:
> > On Thu, May 15, 2014 at 09:35:32PM -0700, Greg Kroah-Hartman wrote:
> > > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
> > > > > I think having to pick and choose what device nodes you want in a
> > > > > container is a good thing.  Becides, you would have to do the same 
> > > > > thing
> > > > > in the kernel anyway, what's wrong with userspace making the decision
> > > > > here, especially as it knows exactly what it wants to do much more so
> > > > > than the kernel ever can.
> > > > 
> > > > For 'real' devices that sounds sensible.  The thing about loop devices
> > > > is that we simply want to allow a container to say "give me a loop
> > > > device to use" and have it receive a unique loop device (or 3), without
> > > > having to pre-assign them.  I think that would be cleaner to do using
> > > > a pseudofs and loop-control device, rather than having to have a
> > > > daemon in userspace on the host farming those out in response to
> > > > some, I don't know, dbus request?
> > > 
> > > I agree that loop devices would be nice to have in a container, and that
> > > the existing loop interface doesn't really lend itself to that.  So
> > > create a new type of thing that acts like a loop device in a container.
> > > But don't try to mess with the whole driver core just for a single type
> > > of device.
> 
> > No matter what I don't think we get out of this without driver core
> > changes, whether this was done in loop or by creating something new.
> > Not unless the whole thing is punted to userspace, anyway.
> 
> > The first problem is that many block device ioctls check for
> > CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm
> > not really sure. But loop does at minimum support partitions, and to get
> > that functionality in an unprivileged container at least the block layer
> > needs to know the namespace which has privileges for that device.
> 
> Woa!  Time out...  Sorry, this will be an off topic aside.
> 
> Loop devices support partitions?  I'd love to know how that works.  I've
> tried several times in the past to do that but it's failed every time.
> I haven't been able to find any how-to in the past.  This article was
> just a couple of years ago (after the last time I tried this):
> 
> http://madduck.net/blog/2006.10.20:loop-mounting-partitions-from-a-disk-image/
> 
> This guy didn't use partitions directly but used the offset to the
> mount, which is what I had to use.  Everything I found always referred
> to using mount offsets in order to mount partitions within a loop
> device.

It's controlled by the loop.max_part module parameter. It defaults to 0,
which means no partition support. For any value > 0 max_part will be the
maximum available partition number, after rounding it up to the nearest
power of 2 minus 1 (so max_part=5 gives you up to 8 partitions,
max_part=8 gives you up to 16, etc).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-16 Thread Michael H. Warfield

On Fri, 2014-05-16 at 09:06 -0500, Seth Forshee wrote:
> On Thu, May 15, 2014 at 09:35:32PM -0700, Greg Kroah-Hartman wrote:
> > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
> > > > I think having to pick and choose what device nodes you want in a
> > > > container is a good thing.  Becides, you would have to do the same thing
> > > > in the kernel anyway, what's wrong with userspace making the decision
> > > > here, especially as it knows exactly what it wants to do much more so
> > > > than the kernel ever can.
> > > 
> > > For 'real' devices that sounds sensible.  The thing about loop devices
> > > is that we simply want to allow a container to say "give me a loop
> > > device to use" and have it receive a unique loop device (or 3), without
> > > having to pre-assign them.  I think that would be cleaner to do using
> > > a pseudofs and loop-control device, rather than having to have a
> > > daemon in userspace on the host farming those out in response to
> > > some, I don't know, dbus request?
> > 
> > I agree that loop devices would be nice to have in a container, and that
> > the existing loop interface doesn't really lend itself to that.  So
> > create a new type of thing that acts like a loop device in a container.
> > But don't try to mess with the whole driver core just for a single type
> > of device.

> No matter what I don't think we get out of this without driver core
> changes, whether this was done in loop or by creating something new.
> Not unless the whole thing is punted to userspace, anyway.

> The first problem is that many block device ioctls check for
> CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm
> not really sure. But loop does at minimum support partitions, and to get
> that functionality in an unprivileged container at least the block layer
> needs to know the namespace which has privileges for that device.

Woa!  Time out...  Sorry, this will be an off topic aside.

Loop devices support partitions?  I'd love to know how that works.  I've
tried several times in the past to do that but it's failed every time.
I haven't been able to find any how-to in the past.  This article was
just a couple of years ago (after the last time I tried this):

http://madduck.net/blog/2006.10.20:loop-mounting-partitions-from-a-disk-image/

This guy didn't use partitions directly but used the offset to the
mount, which is what I had to use.  Everything I found always referred
to using mount offsets in order to mount partitions within a loop
device.

Regards,
Mike

> The second is that all block devices automatically appear in devtmpfs.
> The scenario I'm concerned about is that the host could unknowingly use
> a loop device exposed to a container, then the container could see data
> from the host. So we either need a flag to tell the driver core not to
> create a node in devtmpfs, or we need a privileged manager in userspace
> to remove them (which kind of defeats the purpose). And it gets more
> complicated when partition block devs are mixed in, because they can be
> created without involvement from the driver - they would need to inherit
> the "no devtmpfs node" property from their parent, and if the driver
> uses a psuedo fs to create device nodes for userspace then it needs to
> be informed about the partitions too so it can create those nodes.
> 
> So maybe we could get by without the privileged ioctls, as long as it
> was understood that unprivileged containers can't do partitioning. But I
> do think the devtmpfs problem would need to be addressed.
> 
> Thanks,
> Seth
> ___
> lxc-devel mailing list
> lxc-de...@lists.linuxcontainers.org
> http://lists.linuxcontainers.org/listinfo/lxc-devel
> 

-- 
Michael H. Warfield (AI4NB) | (770) 978-7061 |  m...@wittsend.com
   /\/\|=mhw=|\/\/  | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9  | An optimist believes we live in the best of all
 PGP Key: 0x674627FF| possible worlds.  A pessimist is sure of it!



signature.asc
Description: This is a digitally signed message part

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-16 Thread Seth Forshee

On Thu, May 15, 2014 at 09:35:32PM -0700, Greg Kroah-Hartman wrote:
> On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
> > > I think having to pick and choose what device nodes you want in a
> > > container is a good thing.  Becides, you would have to do the same thing
> > > in the kernel anyway, what's wrong with userspace making the decision
> > > here, especially as it knows exactly what it wants to do much more so
> > > than the kernel ever can.
> > 
> > For 'real' devices that sounds sensible.  The thing about loop devices
> > is that we simply want to allow a container to say "give me a loop
> > device to use" and have it receive a unique loop device (or 3), without
> > having to pre-assign them.  I think that would be cleaner to do using
> > a pseudofs and loop-control device, rather than having to have a
> > daemon in userspace on the host farming those out in response to
> > some, I don't know, dbus request?
> 
> I agree that loop devices would be nice to have in a container, and that
> the existing loop interface doesn't really lend itself to that.  So
> create a new type of thing that acts like a loop device in a container.
> But don't try to mess with the whole driver core just for a single type
> of device.

No matter what I don't think we get out of this without driver core
changes, whether this was done in loop or by creating something new.
Not unless the whole thing is punted to userspace, anyway.

The first problem is that many block device ioctls check for
CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm
not really sure. But loop does at minimum support partitions, and to get
that functionality in an unprivileged container at least the block layer
needs to know the namespace which has privileges for that device.

The second is that all block devices automatically appear in devtmpfs.
The scenario I'm concerned about is that the host could unknowingly use
a loop device exposed to a container, then the container could see data
from the host. So we either need a flag to tell the driver core not to
create a node in devtmpfs, or we need a privileged manager in userspace
to remove them (which kind of defeats the purpose). And it gets more
complicated when partition block devs are mixed in, because they can be
created without involvement from the driver - they would need to inherit
the "no devtmpfs node" property from their parent, and if the driver
uses a psuedo fs to create device nodes for userspace then it needs to
be informed about the partitions too so it can create those nodes.

So maybe we could get by without the privileged ioctls, as long as it
was understood that unprivileged containers can't do partitioning. But I
do think the devtmpfs problem would need to be addressed.

Thanks,
Seth
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-16 Thread Richard Weinberger

On Fri, May 16, 2014 at 3:42 AM, Michael H. Warfield  wrote:
> On Thu, 2014-05-15 at 15:15 -0700, Greg Kroah-Hartman wrote:
>> On Thu, May 15, 2014 at 05:42:54PM +, Serge Hallyn wrote:
>> > What exactly defines '"normal" use case for a container'?
>
>> Well, I'd say "acting like a virtual machine" is a good start :)
>
> Ok...  And virtual machines (VirtualBox, VMware, etc, etc) have hot plug
> USB devices.  I use the USB hotplug with VirtualBox.  I plug a
> configured USB device in and the VirtualBox VM grabs it.  Virtual
> machines have loopback devices.  I've used them and using them in
> containers is significantly more efficient.  VirtualBox has remote audio
> and a host of other device features.
>
> Now we have some agreement.  Normal is "acting like a virtual machine".
> That's a goal I can agree with.  I want to work toward that goal of
> containers "acting like a virtual machine" just running on a common
> kernel with the host.  It's a challenge.  We're getting there.
>
>> > Not too long ago much of what we can now do with network namespaces
>> > was not a normal container use case.  Neither "you can't do it now"
>> > nor "I don't use it like that" should be grounds for a pre-emptive
>> > nack.  "It will horribly break security assumptions" certainly would
>> > be.
>
>> I agree, and maybe we will get there over time, but this patch is nto
>> the way to do that.
>
> Ok...  We have a goal.  Now we can haggle over the details (to
> paraphrase a joke that's as old as I am).
>
>> > That's not to say there might not be good reasons why this in particular
>> > is not appropriate, but ISTM if things are going to be nacked without
>> > consideration of the patchset itself, we ought to be having a ksummit
>> > session to come to a consensus [ or receive a decree, presumably by you :)
>> > but after we have a chance to make our case ] on what things are going to
>> > be un/acceptable.
>
>> I already stood up and publically said this last year at Plumbers, why
>> is anything now different?
>
> Not much really.  The reality is that more and more people are trying to
> use hotplug devices, network interfaces, and loopback devices in
> containers just like they would in full para or hw virt machines.  We're
> trying to make them work, without it looking like a kludge.  I
> personally agree with you that much of this can be done in host user
> space and, coming out of LinuxPlumbers last year, I've implemented some
> ideas that did not require kernel patches that achieve some of my goals.
>
>> And this patchset is proof of why it's not a good idea.  You really
>> didn't do anything with all of the namespace stuff, except change loop.
>> That's the only thing that cares, so, just do it there, like I said to
>> do so, last August.
>
>> And you are ignoring the notifications to userspace and how namespaces
>> here would deal with that.
>
> That's a problem to deal with.  I don't thing anyone is ignoring them.
>
>> > > > Serge mentioned something to me about a loopdevfs (?) thing that 
>> > > > someone
>> > > > else is working on.  That would seem to be a better solution in this
>> > > > particular case but I don't know much about it or where it's at.
>> > >
>> > > Ok, let's see those patches then.
>> >
>> > I think Seth has a git tree ready, but not sure which branch he'd want
>> > us to look at.
>> >
>> > Splitting a namespaced devtmpfs from loopdevfs discussion might be
>> > sensible.  However, in defense of a namespaced devtmpfs I'd say
>> > that for userspace to, at every container startup, bind-mount in
>> > devices from the global devtmpfs into a private tmpfs (for systemd's
>> > sake it can't just be on the container rootfs), seems like something
>> > worth avoiding.
>
>> I think having to pick and choose what device nodes you want in a
>> container is a good thing.
>
> Both static and dynamic devices.  It's got to support hotplug.  We have
> (I have) use cases.  That's what I'm trying to do with host udev rules
> and some custom configurations.  I can play games with udev rules.
> Maybe we can keep the user spaces policies in user space and not burden
> the kernel.
>
>> Becides, you would have to do the same thing
>> in the kernel anyway, what's wrong with userspace making the decision
>> here, especially as it knows exactly what it wants to do much more so
>> than the kernel ever can.
>
> IMHO, there's nothing wrong with that as long as we agree on how it's to
> be done.  I'm not convinced that it can all be done in user space and
> I'm not convinced that name spaced devtmpfs is the magic pill to make it
> all go away either.  Making the user space make the decisions and having
> the kernel enforce them is a principle worth considering.
>
>> > PS - Apparently both parallels and Michael independently
>> > project devices which are hot-plugged on the host into containers.
>> > That also seems like something worth talking about (best practices,
>> > shortcomings, use cases not met by it, any ways tha the

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-16 Thread Richard Weinberger

On Fri, May 16, 2014 at 3:42 AM, Michael H. Warfield m...@wittsend.com wrote:
 On Thu, 2014-05-15 at 15:15 -0700, Greg Kroah-Hartman wrote:
 On Thu, May 15, 2014 at 05:42:54PM +, Serge Hallyn wrote:
  What exactly defines 'normal use case for a container'?

 Well, I'd say acting like a virtual machine is a good start :)

 Ok...  And virtual machines (VirtualBox, VMware, etc, etc) have hot plug
 USB devices.  I use the USB hotplug with VirtualBox.  I plug a
 configured USB device in and the VirtualBox VM grabs it.  Virtual
 machines have loopback devices.  I've used them and using them in
 containers is significantly more efficient.  VirtualBox has remote audio
 and a host of other device features.

 Now we have some agreement.  Normal is acting like a virtual machine.
 That's a goal I can agree with.  I want to work toward that goal of
 containers acting like a virtual machine just running on a common
 kernel with the host.  It's a challenge.  We're getting there.

  Not too long ago much of what we can now do with network namespaces
  was not a normal container use case.  Neither you can't do it now
  nor I don't use it like that should be grounds for a pre-emptive
  nack.  It will horribly break security assumptions certainly would
  be.

 I agree, and maybe we will get there over time, but this patch is nto
 the way to do that.

 Ok...  We have a goal.  Now we can haggle over the details (to
 paraphrase a joke that's as old as I am).

  That's not to say there might not be good reasons why this in particular
  is not appropriate, but ISTM if things are going to be nacked without
  consideration of the patchset itself, we ought to be having a ksummit
  session to come to a consensus [ or receive a decree, presumably by you :)
  but after we have a chance to make our case ] on what things are going to
  be un/acceptable.

 I already stood up and publically said this last year at Plumbers, why
 is anything now different?

 Not much really.  The reality is that more and more people are trying to
 use hotplug devices, network interfaces, and loopback devices in
 containers just like they would in full para or hw virt machines.  We're
 trying to make them work, without it looking like a kludge.  I
 personally agree with you that much of this can be done in host user
 space and, coming out of LinuxPlumbers last year, I've implemented some
 ideas that did not require kernel patches that achieve some of my goals.

 And this patchset is proof of why it's not a good idea.  You really
 didn't do anything with all of the namespace stuff, except change loop.
 That's the only thing that cares, so, just do it there, like I said to
 do so, last August.

 And you are ignoring the notifications to userspace and how namespaces
 here would deal with that.

 That's a problem to deal with.  I don't thing anyone is ignoring them.

Serge mentioned something to me about a loopdevfs (?) thing that 
someone
else is working on.  That would seem to be a better solution in this
particular case but I don't know much about it or where it's at.
  
   Ok, let's see those patches then.
 
  I think Seth has a git tree ready, but not sure which branch he'd want
  us to look at.
 
  Splitting a namespaced devtmpfs from loopdevfs discussion might be
  sensible.  However, in defense of a namespaced devtmpfs I'd say
  that for userspace to, at every container startup, bind-mount in
  devices from the global devtmpfs into a private tmpfs (for systemd's
  sake it can't just be on the container rootfs), seems like something
  worth avoiding.

 I think having to pick and choose what device nodes you want in a
 container is a good thing.

 Both static and dynamic devices.  It's got to support hotplug.  We have
 (I have) use cases.  That's what I'm trying to do with host udev rules
 and some custom configurations.  I can play games with udev rules.
 Maybe we can keep the user spaces policies in user space and not burden
 the kernel.

 Becides, you would have to do the same thing
 in the kernel anyway, what's wrong with userspace making the decision
 here, especially as it knows exactly what it wants to do much more so
 than the kernel ever can.

 IMHO, there's nothing wrong with that as long as we agree on how it's to
 be done.  I'm not convinced that it can all be done in user space and
 I'm not convinced that name spaced devtmpfs is the magic pill to make it
 all go away either.  Making the user space make the decisions and having
 the kernel enforce them is a principle worth considering.

  PS - Apparently both parallels and Michael independently
  project devices which are hot-plugged on the host into containers.
  That also seems like something worth talking about (best practices,
  shortcomings, use cases not met by it, any ways tha the kernel can
  help out) at ksummit/linuxcon.

 I was told that containers would never want devices hotplugged into
 them.

 Interesting.  You were told they (who they?) would never want them?

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-16 Thread Seth Forshee

On Thu, May 15, 2014 at 09:35:32PM -0700, Greg Kroah-Hartman wrote:
 On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
   I think having to pick and choose what device nodes you want in a
   container is a good thing.  Becides, you would have to do the same thing
   in the kernel anyway, what's wrong with userspace making the decision
   here, especially as it knows exactly what it wants to do much more so
   than the kernel ever can.
  
  For 'real' devices that sounds sensible.  The thing about loop devices
  is that we simply want to allow a container to say give me a loop
  device to use and have it receive a unique loop device (or 3), without
  having to pre-assign them.  I think that would be cleaner to do using
  a pseudofs and loop-control device, rather than having to have a
  daemon in userspace on the host farming those out in response to
  some, I don't know, dbus request?
 
 I agree that loop devices would be nice to have in a container, and that
 the existing loop interface doesn't really lend itself to that.  So
 create a new type of thing that acts like a loop device in a container.
 But don't try to mess with the whole driver core just for a single type
 of device.

No matter what I don't think we get out of this without driver core
changes, whether this was done in loop or by creating something new.
Not unless the whole thing is punted to userspace, anyway.

The first problem is that many block device ioctls check for
CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm
not really sure. But loop does at minimum support partitions, and to get
that functionality in an unprivileged container at least the block layer
needs to know the namespace which has privileges for that device.

The second is that all block devices automatically appear in devtmpfs.
The scenario I'm concerned about is that the host could unknowingly use
a loop device exposed to a container, then the container could see data
from the host. So we either need a flag to tell the driver core not to
create a node in devtmpfs, or we need a privileged manager in userspace
to remove them (which kind of defeats the purpose). And it gets more
complicated when partition block devs are mixed in, because they can be
created without involvement from the driver - they would need to inherit
the no devtmpfs node property from their parent, and if the driver
uses a psuedo fs to create device nodes for userspace then it needs to
be informed about the partitions too so it can create those nodes.

So maybe we could get by without the privileged ioctls, as long as it
was understood that unprivileged containers can't do partitioning. But I
do think the devtmpfs problem would need to be addressed.

Thanks,
Seth
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-16 Thread Michael H. Warfield

On Fri, 2014-05-16 at 09:06 -0500, Seth Forshee wrote:
 On Thu, May 15, 2014 at 09:35:32PM -0700, Greg Kroah-Hartman wrote:
  On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
I think having to pick and choose what device nodes you want in a
container is a good thing.  Becides, you would have to do the same thing
in the kernel anyway, what's wrong with userspace making the decision
here, especially as it knows exactly what it wants to do much more so
than the kernel ever can.
   
   For 'real' devices that sounds sensible.  The thing about loop devices
   is that we simply want to allow a container to say give me a loop
   device to use and have it receive a unique loop device (or 3), without
   having to pre-assign them.  I think that would be cleaner to do using
   a pseudofs and loop-control device, rather than having to have a
   daemon in userspace on the host farming those out in response to
   some, I don't know, dbus request?
  
  I agree that loop devices would be nice to have in a container, and that
  the existing loop interface doesn't really lend itself to that.  So
  create a new type of thing that acts like a loop device in a container.
  But don't try to mess with the whole driver core just for a single type
  of device.

 No matter what I don't think we get out of this without driver core
 changes, whether this was done in loop or by creating something new.
 Not unless the whole thing is punted to userspace, anyway.

 The first problem is that many block device ioctls check for
 CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm
 not really sure. But loop does at minimum support partitions, and to get
 that functionality in an unprivileged container at least the block layer
 needs to know the namespace which has privileges for that device.

Woa!  Time out...  Sorry, this will be an off topic aside.

Loop devices support partitions?  I'd love to know how that works.  I've
tried several times in the past to do that but it's failed every time.
I haven't been able to find any how-to in the past.  This article was
just a couple of years ago (after the last time I tried this):

http://madduck.net/blog/2006.10.20:loop-mounting-partitions-from-a-disk-image/

This guy didn't use partitions directly but used the offset to the
mount, which is what I had to use.  Everything I found always referred
to using mount offsets in order to mount partitions within a loop
device.

Regards,
Mike

 The second is that all block devices automatically appear in devtmpfs.
 The scenario I'm concerned about is that the host could unknowingly use
 a loop device exposed to a container, then the container could see data
 from the host. So we either need a flag to tell the driver core not to
 create a node in devtmpfs, or we need a privileged manager in userspace
 to remove them (which kind of defeats the purpose). And it gets more
 complicated when partition block devs are mixed in, because they can be
 created without involvement from the driver - they would need to inherit
 the no devtmpfs node property from their parent, and if the driver
 uses a psuedo fs to create device nodes for userspace then it needs to
 be informed about the partitions too so it can create those nodes.
 
 So maybe we could get by without the privileged ioctls, as long as it
 was understood that unprivileged containers can't do partitioning. But I
 do think the devtmpfs problem would need to be addressed.
 
 Thanks,
 Seth
 ___
 lxc-devel mailing list
 lxc-de...@lists.linuxcontainers.org
 http://lists.linuxcontainers.org/listinfo/lxc-devel
 

-- 
Michael H. Warfield (AI4NB) | (770) 978-7061 |  m...@wittsend.com
   /\/\|=mhw=|\/\/  | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9  | An optimist believes we live in the best of all
 PGP Key: 0x674627FF| possible worlds.  A pessimist is sure of it!



signature.asc
Description: This is a digitally signed message part

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-16 Thread Seth Forshee

On Fri, May 16, 2014 at 11:28:28AM -0400, Michael H. Warfield wrote:
 On Fri, 2014-05-16 at 09:06 -0500, Seth Forshee wrote:
  On Thu, May 15, 2014 at 09:35:32PM -0700, Greg Kroah-Hartman wrote:
   On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
 I think having to pick and choose what device nodes you want in a
 container is a good thing.  Becides, you would have to do the same 
 thing
 in the kernel anyway, what's wrong with userspace making the decision
 here, especially as it knows exactly what it wants to do much more so
 than the kernel ever can.

For 'real' devices that sounds sensible.  The thing about loop devices
is that we simply want to allow a container to say give me a loop
device to use and have it receive a unique loop device (or 3), without
having to pre-assign them.  I think that would be cleaner to do using
a pseudofs and loop-control device, rather than having to have a
daemon in userspace on the host farming those out in response to
some, I don't know, dbus request?
   
   I agree that loop devices would be nice to have in a container, and that
   the existing loop interface doesn't really lend itself to that.  So
   create a new type of thing that acts like a loop device in a container.
   But don't try to mess with the whole driver core just for a single type
   of device.
 
  No matter what I don't think we get out of this without driver core
  changes, whether this was done in loop or by creating something new.
  Not unless the whole thing is punted to userspace, anyway.
 
  The first problem is that many block device ioctls check for
  CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm
  not really sure. But loop does at minimum support partitions, and to get
  that functionality in an unprivileged container at least the block layer
  needs to know the namespace which has privileges for that device.
 
 Woa!  Time out...  Sorry, this will be an off topic aside.
 
 Loop devices support partitions?  I'd love to know how that works.  I've
 tried several times in the past to do that but it's failed every time.
 I haven't been able to find any how-to in the past.  This article was
 just a couple of years ago (after the last time I tried this):
 
 http://madduck.net/blog/2006.10.20:loop-mounting-partitions-from-a-disk-image/
 
 This guy didn't use partitions directly but used the offset to the
 mount, which is what I had to use.  Everything I found always referred
 to using mount offsets in order to mount partitions within a loop
 device.

It's controlled by the loop.max_part module parameter. It defaults to 0,
which means no partition support. For any value  0 max_part will be the
maximum available partition number, after rounding it up to the nearest
power of 2 minus 1 (so max_part=5 gives you up to 8 partitions,
max_part=8 gives you up to 16, etc).
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-16 Thread Greg Kroah-Hartman

On Fri, May 16, 2014 at 09:06:07AM -0500, Seth Forshee wrote:
 On Thu, May 15, 2014 at 09:35:32PM -0700, Greg Kroah-Hartman wrote:
  On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
I think having to pick and choose what device nodes you want in a
container is a good thing.  Becides, you would have to do the same thing
in the kernel anyway, what's wrong with userspace making the decision
here, especially as it knows exactly what it wants to do much more so
than the kernel ever can.
   
   For 'real' devices that sounds sensible.  The thing about loop devices
   is that we simply want to allow a container to say give me a loop
   device to use and have it receive a unique loop device (or 3), without
   having to pre-assign them.  I think that would be cleaner to do using
   a pseudofs and loop-control device, rather than having to have a
   daemon in userspace on the host farming those out in response to
   some, I don't know, dbus request?
  
  I agree that loop devices would be nice to have in a container, and that
  the existing loop interface doesn't really lend itself to that.  So
  create a new type of thing that acts like a loop device in a container.
  But don't try to mess with the whole driver core just for a single type
  of device.
 
 No matter what I don't think we get out of this without driver core
 changes, whether this was done in loop or by creating something new.
 Not unless the whole thing is punted to userspace, anyway.
 
 The first problem is that many block device ioctls check for
 CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm
 not really sure. But loop does at minimum support partitions, and to get
 that functionality in an unprivileged container at least the block layer
 needs to know the namespace which has privileges for that device.

That's fine, you should have those permissions in a container if you
want to do something like that on a loop device, right?

 The second is that all block devices automatically appear in devtmpfs.
 The scenario I'm concerned about is that the host could unknowingly use
 a loop device exposed to a container, then the container could see data
 from the host.

I don't think that's a real issue, the host should know not to do that.

 So we either need a flag to tell the driver core not to create a node
 in devtmpfs, or we need a privileged manager in userspace to remove
 them (which kind of defeats the purpose). And it gets more complicated
 when partition block devs are mixed in, because they can be created
 without involvement from the driver - they would need to inherit the
 no devtmpfs node property from their parent, and if the driver uses
 a psuedo fs to create device nodes for userspace then it needs to be
 informed about the partitions too so it can create those nodes.

I don't think that will be needed.  Root in a host can do whatever it
wants in the containers, so mixing up block devices is the least of the
issues involved :)

 So maybe we could get by without the privileged ioctls, as long as it
 was understood that unprivileged containers can't do partitioning. But I
 do think the devtmpfs problem would need to be addressed.

I don't think unpriviliged containers should be able to do partitioning.
An unpriviliged user can't do that, so why should a container be any
different?

thanks,

greg k-h
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-16 Thread James Bottomley

On Thu, 2014-05-15 at 21:42 -0400, Michael H. Warfield wrote:
 On Thu, 2014-05-15 at 15:15 -0700, Greg Kroah-Hartman wrote:
   PS - Apparently both parallels and Michael independently
   project devices which are hot-plugged on the host into containers.
   That also seems like something worth talking about (best practices,
   shortcomings, use cases not met by it, any ways tha the kernel can
   help out) at ksummit/linuxcon.
 
  I was told that containers would never want devices hotplugged into
  them.
 
 Interesting.  You were told they (who they?) would never want them?  Who
 said that?  I would have never thought that given that other
 implementations can provide that.  I would certainly want them.  Seems
 strange to explicitly relegate LXC containers to being second class
 citizens behind OpenVZ, Parallels, BSD Gaols, and Solaris Zones.

That would probably be me.  Running hotplug inside a container is a
security problem and, since containers are easily entered by the host,
it's very easy to listen for the hotplug in the host and inject it into
the container using nsenter.

I don't think the intention is to label anyone's implementation as
preferred.  What this shows, I think, is that we all have different
practises when it comes to setting up containers.  Some are necessary
because our containers are different.  Some could do with serious
examination to see if there's really a best way to do the action which
we would then all use.

 I might believe you were never told they would need them, but that's a
 totally different sense.  Are we going to tell RedHat and the Docker
 people that LXC is an inferior technology that is complex and unreliable
 (to quote another poster) compared to these others?  They're saying this
 will be enterprise technology.  If I go to Amazon AWS or other VPS
 services and compare, are we not going to stand on a level playing
 field?  Admittedly, I don't expect Amazon AWS to provide me with serial
 consoles, but I do expect to be able to mount file system images within
 my VPS.

Well, that's another nasty, isn't it.  We all have different ways of
coping with mount in the container.  I think at plumbers we need to sit
down with some of this plumbing and work out which pipes carry the same
fluids and whether we could unify them.

As an aside (probably requiring a new thread) we were wondering about
some type of notifier on the mount call that we could vector into the
host to perform the action.  The main issue for us is mount of procfs,
which really needs to be a bind mount in a container.  All of this led
me to speculate that we could use some type of syscall notifier
mechanism to manage capabilities in the host and even intercept and
complete the syscall action within the host rather than having to keep
evolving more an more complex kernel drivers to do this.

James


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-16 Thread James Bottomley

On Fri, 2014-05-16 at 11:57 -0700, Greg Kroah-Hartman wrote:
 On Fri, May 16, 2014 at 09:06:07AM -0500, Seth Forshee wrote:
  On Thu, May 15, 2014 at 09:35:32PM -0700, Greg Kroah-Hartman wrote:
   On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
 I think having to pick and choose what device nodes you want in a
 container is a good thing.  Becides, you would have to do the same 
 thing
 in the kernel anyway, what's wrong with userspace making the decision
 here, especially as it knows exactly what it wants to do much more so
 than the kernel ever can.

For 'real' devices that sounds sensible.  The thing about loop devices
is that we simply want to allow a container to say give me a loop
device to use and have it receive a unique loop device (or 3), without
having to pre-assign them.  I think that would be cleaner to do using
a pseudofs and loop-control device, rather than having to have a
daemon in userspace on the host farming those out in response to
some, I don't know, dbus request?
   
   I agree that loop devices would be nice to have in a container, and that
   the existing loop interface doesn't really lend itself to that.  So
   create a new type of thing that acts like a loop device in a container.
   But don't try to mess with the whole driver core just for a single type
   of device.
  
  No matter what I don't think we get out of this without driver core
  changes, whether this was done in loop or by creating something new.
  Not unless the whole thing is punted to userspace, anyway.
  
  The first problem is that many block device ioctls check for
  CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm
  not really sure. But loop does at minimum support partitions, and to get
  that functionality in an unprivileged container at least the block layer
  needs to know the namespace which has privileges for that device.
 
 That's fine, you should have those permissions in a container if you
 want to do something like that on a loop device, right?

Really, no.  CAP_SYS_ADMIN is effectively a pseudo root security hole.
Any user possessing CAP_SYS_ADMIN can do about as much damage as real
root can, whether or not you use user namespaces, so it would compromise
a lot of the security we're just bringing to containers.

  The second is that all block devices automatically appear in devtmpfs.
  The scenario I'm concerned about is that the host could unknowingly use
  a loop device exposed to a container, then the container could see data
  from the host.
 
 I don't think that's a real issue, the host should know not to do that.
 
  So we either need a flag to tell the driver core not to create a node
  in devtmpfs, or we need a privileged manager in userspace to remove
  them (which kind of defeats the purpose). And it gets more complicated
  when partition block devs are mixed in, because they can be created
  without involvement from the driver - they would need to inherit the
  no devtmpfs node property from their parent, and if the driver uses
  a psuedo fs to create device nodes for userspace then it needs to be
  informed about the partitions too so it can create those nodes.
 
 I don't think that will be needed.  Root in a host can do whatever it
 wants in the containers, so mixing up block devices is the least of the
 issues involved :)
 
  So maybe we could get by without the privileged ioctls, as long as it
  was understood that unprivileged containers can't do partitioning. But I
  do think the devtmpfs problem would need to be addressed.
 
 I don't think unpriviliged containers should be able to do partitioning.
 An unpriviliged user can't do that, so why should a container be any
 different?

To make sure we're on the same page with terminology, there's an
unprivileged container and a secure container.  In the former, there's
no root user (all the processes run as non-root), so the container isn't
expected to perform any actions root would ... that's easy.  In a secure
container, root is mapped to a nobody user in the host, so is
effectively unprivileged, but root in the container expects to look like
a real root within the VPS (and thus may expect to partition things,
depending on how they've been given access to the block device).  The
big problem is giving back capabilities to the container root such that
a) it loses them if it escapes the container and b) it doesn't get
sufficient capabilities to damage the system.

James


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-16 Thread Michael H. Warfield

On Fri, 2014-05-16 at 12:20 -0700, James Bottomley wrote:
 On Thu, 2014-05-15 at 21:42 -0400, Michael H. Warfield wrote:
  On Thu, 2014-05-15 at 15:15 -0700, Greg Kroah-Hartman wrote:
PS - Apparently both parallels and Michael independently
project devices which are hot-plugged on the host into containers.
That also seems like something worth talking about (best practices,
shortcomings, use cases not met by it, any ways tha the kernel can
help out) at ksummit/linuxcon.
  
   I was told that containers would never want devices hotplugged into
   them.
  
  Interesting.  You were told they (who they?) would never want them?  Who
  said that?  I would have never thought that given that other
  implementations can provide that.  I would certainly want them.  Seems
  strange to explicitly relegate LXC containers to being second class
  citizens behind OpenVZ, Parallels, BSD Gaols, and Solaris Zones.

 That would probably be me.  Running hotplug inside a container is a
 security problem and, since containers are easily entered by the host,
 it's very easy to listen for the hotplug in the host and inject it into
 the container using nsenter.

In all virtualization...  The host, particularly root on the host,
exists as deus ex machina, the god outside the machine.  They are at
my mercy.  Even hardware virtualization can not protect you from the
host.  You wanna hear some frightening talks on virtualization, catch
Joanna (miss little blue pill) Rutkowska some time.  I'm particularly
interesting in her takes on the anti evil-maid attacks and I sat in on
her talks on the north bridge and south bridge malware evasion
techniques.  She's a good speaker who makes powerful points that makes
you sweat but is pleasant in face to face conversation.  I've played
with her Qubes distribution a couple of times and the way it works with
the TPM to insure a secure boot is interesting.  But that's a completely
different topic on trusted computing.

OTOH, there are plenty of other things to worry about in all forms of
virtualization.  At Internet Security Systems, where I was a founder,
fellow, and X-Force Senior Wizard, we were looking at the ability to
leak information through the USB subsystem.  No isolation is perfect,
especially when you have USB enabled.

But that's my turf.

 I don't think the intention is to label anyone's implementation as
 preferred.  What this shows, I think, is that we all have different
 practises when it comes to setting up containers.  Some are necessary
 because our containers are different.  Some could do with serious
 examination to see if there's really a best way to do the action which
 we would then all use.

And I hope to contribute to the discussion of said actions.

  I might believe you were never told they would need them, but that's a
  totally different sense.  Are we going to tell RedHat and the Docker
  people that LXC is an inferior technology that is complex and unreliable
  (to quote another poster) compared to these others?  They're saying this
  will be enterprise technology.  If I go to Amazon AWS or other VPS
  services and compare, are we not going to stand on a level playing
  field?  Admittedly, I don't expect Amazon AWS to provide me with serial
  consoles, but I do expect to be able to mount file system images within
  my VPS.

 Well, that's another nasty, isn't it.  We all have different ways of
 coping with mount in the container.  I think at plumbers we need to sit
 down with some of this plumbing and work out which pipes carry the same
 fluids and whether we could unify them.

Concur

 As an aside (probably requiring a new thread) we were wondering about
 some type of notifier on the mount call that we could vector into the
 host to perform the action.  The main issue for us is mount of procfs,
 which really needs to be a bind mount in a container.  All of this led
 me to speculate that we could use some type of syscall notifier
 mechanism to manage capabilities in the host and even intercept and
 complete the syscall action within the host rather than having to keep
 evolving more an more complex kernel drivers to do this.

Interesting.  That could be very useful.  That might even help with the
loop device case where the mounts have to go through loop devices for
things like file system images and builds.  Very interesting...

 James

Regards,
Mike
-- 
Michael H. Warfield (AI4NB) | (770) 978-7061 |  m...@wittsend.com
   /\/\|=mhw=|\/\/  | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9  | An optimist believes we live in the best of all
 PGP Key: 0x674627FF| possible worlds.  A pessimist is sure of it!



signature.asc
Description: This is a digitally signed message part

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-16 Thread Seth Forshee

On Fri, May 16, 2014 at 12:28:35PM -0700, James Bottomley wrote:
 On Fri, 2014-05-16 at 11:57 -0700, Greg Kroah-Hartman wrote:
  On Fri, May 16, 2014 at 09:06:07AM -0500, Seth Forshee wrote:
   On Thu, May 15, 2014 at 09:35:32PM -0700, Greg Kroah-Hartman wrote:
On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
  I think having to pick and choose what device nodes you want in a
  container is a good thing.  Becides, you would have to do the same 
  thing
  in the kernel anyway, what's wrong with userspace making the 
  decision
  here, especially as it knows exactly what it wants to do much more 
  so
  than the kernel ever can.
 
 For 'real' devices that sounds sensible.  The thing about loop devices
 is that we simply want to allow a container to say give me a loop
 device to use and have it receive a unique loop device (or 3), 
 without
 having to pre-assign them.  I think that would be cleaner to do using
 a pseudofs and loop-control device, rather than having to have a
 daemon in userspace on the host farming those out in response to
 some, I don't know, dbus request?

I agree that loop devices would be nice to have in a container, and that
the existing loop interface doesn't really lend itself to that.  So
create a new type of thing that acts like a loop device in a container.
But don't try to mess with the whole driver core just for a single type
of device.
   
   No matter what I don't think we get out of this without driver core
   changes, whether this was done in loop or by creating something new.
   Not unless the whole thing is punted to userspace, anyway.
   
   The first problem is that many block device ioctls check for
   CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm
   not really sure. But loop does at minimum support partitions, and to get
   that functionality in an unprivileged container at least the block layer
   needs to know the namespace which has privileges for that device.
  
  That's fine, you should have those permissions in a container if you
  want to do something like that on a loop device, right?
 
 Really, no.  CAP_SYS_ADMIN is effectively a pseudo root security hole.
 Any user possessing CAP_SYS_ADMIN can do about as much damage as real
 root can, whether or not you use user namespaces, so it would compromise
 a lot of the security we're just bringing to containers.
 
   The second is that all block devices automatically appear in devtmpfs.
   The scenario I'm concerned about is that the host could unknowingly use
   a loop device exposed to a container, then the container could see data
   from the host.
  
  I don't think that's a real issue, the host should know not to do that.
  
   So we either need a flag to tell the driver core not to create a node
   in devtmpfs, or we need a privileged manager in userspace to remove
   them (which kind of defeats the purpose). And it gets more complicated
   when partition block devs are mixed in, because they can be created
   without involvement from the driver - they would need to inherit the
   no devtmpfs node property from their parent, and if the driver uses
   a psuedo fs to create device nodes for userspace then it needs to be
   informed about the partitions too so it can create those nodes.
  
  I don't think that will be needed.  Root in a host can do whatever it
  wants in the containers, so mixing up block devices is the least of the
  issues involved :)
  
   So maybe we could get by without the privileged ioctls, as long as it
   was understood that unprivileged containers can't do partitioning. But I
   do think the devtmpfs problem would need to be addressed.
  
  I don't think unpriviliged containers should be able to do partitioning.
  An unpriviliged user can't do that, so why should a container be any
  different?
 
 To make sure we're on the same page with terminology, there's an
 unprivileged container and a secure container.  In the former, there's
 no root user (all the processes run as non-root), so the container isn't
 expected to perform any actions root would ... that's easy.  In a secure
 container, root is mapped to a nobody user in the host, so is
 effectively unprivileged, but root in the container expects to look like
 a real root within the VPS (and thus may expect to partition things,
 depending on how they've been given access to the block device).  The
 big problem is giving back capabilities to the container root such that
 a) it loses them if it escapes the container and b) it doesn't get
 sufficient capabilities to damage the system.

Based on your description what I was talking about is a secure
container. Thanks for clearing that up, and sorry for misusing the
terminology.

What I set out for was feature parity between loop devices in a secure
container and loop devices on the host. Since some operations currently
check for system-wide CAP_SYS_ADMIN,

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-16 Thread Eric W. Biederman

Greg Kroah-Hartman gre...@linuxfoundation.org writes:

 On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
  I think having to pick and choose what device nodes you want in a
  container is a good thing.  Becides, you would have to do the same thing
  in the kernel anyway, what's wrong with userspace making the decision
  here, especially as it knows exactly what it wants to do much more so
  than the kernel ever can.
 
 For 'real' devices that sounds sensible.  The thing about loop devices
 is that we simply want to allow a container to say give me a loop
 device to use and have it receive a unique loop device (or 3), without
 having to pre-assign them.  I think that would be cleaner to do using
 a pseudofs and loop-control device, rather than having to have a
 daemon in userspace on the host farming those out in response to
 some, I don't know, dbus request?

 I agree that loop devices would be nice to have in a container, and that
 the existing loop interface doesn't really lend itself to that.  So
 create a new type of thing that acts like a loop device in a container.
 But don't try to mess with the whole driver core just for a single type
 of device.

Yes. Something like devpts (without the newinstance option).  Built to
allow unprivileged users to create loopback devices.

There is still a huge kettle of fish in with verifying a filesystem is
safe from a hostile user that has acess to the block device while the
filesystem is mounted.

Having a few filesystems that are robust enough to trust with arbitrary
filesystem corruption would be very interesting.

I assume unprivileged and hostile users because if you trusted the real
root inside of your container this would not be an issue.

Eric
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Greg Kroah-Hartman

On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
> > I think having to pick and choose what device nodes you want in a
> > container is a good thing.  Becides, you would have to do the same thing
> > in the kernel anyway, what's wrong with userspace making the decision
> > here, especially as it knows exactly what it wants to do much more so
> > than the kernel ever can.
> 
> For 'real' devices that sounds sensible.  The thing about loop devices
> is that we simply want to allow a container to say "give me a loop
> device to use" and have it receive a unique loop device (or 3), without
> having to pre-assign them.  I think that would be cleaner to do using
> a pseudofs and loop-control device, rather than having to have a
> daemon in userspace on the host farming those out in response to
> some, I don't know, dbus request?

I agree that loop devices would be nice to have in a container, and that
the existing loop interface doesn't really lend itself to that.  So
create a new type of thing that acts like a loop device in a container.
But don't try to mess with the whole driver core just for a single type
of device.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Serge Hallyn

Quoting Greg Kroah-Hartman (gre...@linuxfoundation.org):
> On Thu, May 15, 2014 at 05:42:54PM +, Serge Hallyn wrote:
> > What exactly defines '"normal" use case for a container'?
> 
> Well, I'd say "acting like a virtual machine" is a good start :)
> 
> > Not too long ago much of what we can now do with network namespaces
> > was not a normal container use case.  Neither "you can't do it now"
> > nor "I don't use it like that" should be grounds for a pre-emptive
> > nack.  "It will horribly break security assumptions" certainly would
> > be.
> 
> I agree, and maybe we will get there over time, but this patch is nto
> the way to do that.

Ok.  [ I/we may be asking for more details later, but think there is enough
below :), particularly the point about event forwarding ]  Thanks.

> > That's not to say there might not be good reasons why this in particular
> > is not appropriate, but ISTM if things are going to be nacked without
> > consideration of the patchset itself, we ought to be having a ksummit
> > session to come to a consensus [ or receive a decree, presumably by you :)
> > but after we have a chance to make our case ] on what things are going to
> > be un/acceptable.
> 
> I already stood up and publically said this last year at Plumbers, why
> is anything now different?

Well I've simply never had a chance to talk to you since then to find out
exactly what it is that is unacceptable, and why.  And, of course, code
makes it easier to discuss these things.

> And this patchset is proof of why it's not a good idea.  You really
> didn't do anything with all of the namespace stuff, except change loop.
> That's the only thing that cares, so, just do it there, like I said to
> do so, last August.

Sorry, just do it where?

> And you are ignoring the notifications to userspace and how namespaces
> here would deal with that.

Good point.  Addressing that is at the same time necessary, interesting,
and complicated.

> > > > Serge mentioned something to me about a loopdevfs (?) thing that someone
> > > > else is working on.  That would seem to be a better solution in this
> > > > particular case but I don't know much about it or where it's at.
> > > 
> > > Ok, let's see those patches then.
> > 
> > I think Seth has a git tree ready, but not sure which branch he'd want
> > us to look at.
> > 
> > Splitting a namespaced devtmpfs from loopdevfs discussion might be
> > sensible.  However, in defense of a namespaced devtmpfs I'd say
> > that for userspace to, at every container startup, bind-mount in
> > devices from the global devtmpfs into a private tmpfs (for systemd's
> > sake it can't just be on the container rootfs), seems like something
> > worth avoiding.
> 
> I think having to pick and choose what device nodes you want in a
> container is a good thing.  Becides, you would have to do the same thing
> in the kernel anyway, what's wrong with userspace making the decision
> here, especially as it knows exactly what it wants to do much more so
> than the kernel ever can.

For 'real' devices that sounds sensible.  The thing about loop devices
is that we simply want to allow a container to say "give me a loop
device to use" and have it receive a unique loop device (or 3), without
having to pre-assign them.  I think that would be cleaner to do using
a pseudofs and loop-control device, rather than having to have a
daemon in userspace on the host farming those out in response to
some, I don't know, dbus request?

> > PS - Apparently both parallels and Michael independently
> > project devices which are hot-plugged on the host into containers.
> > That also seems like something worth talking about (best practices,
> > shortcomings, use cases not met by it, any ways tha the kernel can
> > help out) at ksummit/linuxcon.
> 
> I was told that containers would never want devices hotplugged into
> them.  What use case has this happening / needed?

I'm pretty sure I didn't say that .  But I guess
we are combining two topics here, the loop psuedofs and the namespaced
devtmpfs.

The use case of loop-control device and loop pseudofs is to have
multiple chrooted/namespaced programs be able to grab a loop device
on demand which they can use for the obvious things (building a livecd,
extracting file contents, etc) without stepping on each other's toes.  The
namespaced devtmpfs is not required for this.

One advantage of a namespaced devtmpfs would be sane-looking devices
in unprivileged containers.  Currently we have to bind-mount the host's
/dev/{full,zero,etc} which, due to uid and guid mappings, then shows up
as:

crw-rw-rw- 1 nobody nogroup   1, 7 May 12 13:35 full

Also you mentioned uevent forwarding above.  Michael has talked several
times about having userspace on the host 'pass' devices into the
container.  One thing which I believe he and Eric have discussed
before was how to have userspace in the container be notified when
a device is passed in.  It seems to me that at least this is something
that would be

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Michael H. Warfield

On Thu, 2014-05-15 at 15:15 -0700, Greg Kroah-Hartman wrote:
> On Thu, May 15, 2014 at 05:42:54PM +, Serge Hallyn wrote:
> > What exactly defines '"normal" use case for a container'?

> Well, I'd say "acting like a virtual machine" is a good start :)

Ok...  And virtual machines (VirtualBox, VMware, etc, etc) have hot plug
USB devices.  I use the USB hotplug with VirtualBox.  I plug a
configured USB device in and the VirtualBox VM grabs it.  Virtual
machines have loopback devices.  I've used them and using them in
containers is significantly more efficient.  VirtualBox has remote audio
and a host of other device features.

Now we have some agreement.  Normal is "acting like a virtual machine".
That's a goal I can agree with.  I want to work toward that goal of
containers "acting like a virtual machine" just running on a common
kernel with the host.  It's a challenge.  We're getting there.

> > Not too long ago much of what we can now do with network namespaces
> > was not a normal container use case.  Neither "you can't do it now"
> > nor "I don't use it like that" should be grounds for a pre-emptive
> > nack.  "It will horribly break security assumptions" certainly would
> > be.

> I agree, and maybe we will get there over time, but this patch is nto
> the way to do that.

Ok...  We have a goal.  Now we can haggle over the details (to
paraphrase a joke that's as old as I am).

> > That's not to say there might not be good reasons why this in particular
> > is not appropriate, but ISTM if things are going to be nacked without
> > consideration of the patchset itself, we ought to be having a ksummit
> > session to come to a consensus [ or receive a decree, presumably by you :)
> > but after we have a chance to make our case ] on what things are going to
> > be un/acceptable.

> I already stood up and publically said this last year at Plumbers, why
> is anything now different?

Not much really.  The reality is that more and more people are trying to
use hotplug devices, network interfaces, and loopback devices in
containers just like they would in full para or hw virt machines.  We're
trying to make them work, without it looking like a kludge.  I
personally agree with you that much of this can be done in host user
space and, coming out of LinuxPlumbers last year, I've implemented some
ideas that did not require kernel patches that achieve some of my goals.

> And this patchset is proof of why it's not a good idea.  You really
> didn't do anything with all of the namespace stuff, except change loop.
> That's the only thing that cares, so, just do it there, like I said to
> do so, last August.

> And you are ignoring the notifications to userspace and how namespaces
> here would deal with that.

That's a problem to deal with.  I don't thing anyone is ignoring them.

> > > > Serge mentioned something to me about a loopdevfs (?) thing that someone
> > > > else is working on.  That would seem to be a better solution in this
> > > > particular case but I don't know much about it or where it's at.
> > > 
> > > Ok, let's see those patches then.
> > 
> > I think Seth has a git tree ready, but not sure which branch he'd want
> > us to look at.
> > 
> > Splitting a namespaced devtmpfs from loopdevfs discussion might be
> > sensible.  However, in defense of a namespaced devtmpfs I'd say
> > that for userspace to, at every container startup, bind-mount in
> > devices from the global devtmpfs into a private tmpfs (for systemd's
> > sake it can't just be on the container rootfs), seems like something
> > worth avoiding.

> I think having to pick and choose what device nodes you want in a
> container is a good thing.

Both static and dynamic devices.  It's got to support hotplug.  We have
(I have) use cases.  That's what I'm trying to do with host udev rules
and some custom configurations.  I can play games with udev rules.
Maybe we can keep the user spaces policies in user space and not burden
the kernel.

> Becides, you would have to do the same thing
> in the kernel anyway, what's wrong with userspace making the decision
> here, especially as it knows exactly what it wants to do much more so
> than the kernel ever can.

IMHO, there's nothing wrong with that as long as we agree on how it's to
be done.  I'm not convinced that it can all be done in user space and
I'm not convinced that name spaced devtmpfs is the magic pill to make it
all go away either.  Making the user space make the decisions and having
the kernel enforce them is a principle worth considering.

> > PS - Apparently both parallels and Michael independently
> > project devices which are hot-plugged on the host into containers.
> > That also seems like something worth talking about (best practices,
> > shortcomings, use cases not met by it, any ways tha the kernel can
> > help out) at ksummit/linuxcon.

> I was told that containers would never want devices hotplugged into
> them.

Interesting.  You were told they (who they?) would never want them?  Who

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Greg Kroah-Hartman

On Thu, May 15, 2014 at 05:42:54PM +, Serge Hallyn wrote:
> What exactly defines '"normal" use case for a container'?

Well, I'd say "acting like a virtual machine" is a good start :)

> Not too long ago much of what we can now do with network namespaces
> was not a normal container use case.  Neither "you can't do it now"
> nor "I don't use it like that" should be grounds for a pre-emptive
> nack.  "It will horribly break security assumptions" certainly would
> be.

I agree, and maybe we will get there over time, but this patch is nto
the way to do that.

> That's not to say there might not be good reasons why this in particular
> is not appropriate, but ISTM if things are going to be nacked without
> consideration of the patchset itself, we ought to be having a ksummit
> session to come to a consensus [ or receive a decree, presumably by you :)
> but after we have a chance to make our case ] on what things are going to
> be un/acceptable.

I already stood up and publically said this last year at Plumbers, why
is anything now different?

And this patchset is proof of why it's not a good idea.  You really
didn't do anything with all of the namespace stuff, except change loop.
That's the only thing that cares, so, just do it there, like I said to
do so, last August.

And you are ignoring the notifications to userspace and how namespaces
here would deal with that.

> > > Serge mentioned something to me about a loopdevfs (?) thing that someone
> > > else is working on.  That would seem to be a better solution in this
> > > particular case but I don't know much about it or where it's at.
> > 
> > Ok, let's see those patches then.
> 
> I think Seth has a git tree ready, but not sure which branch he'd want
> us to look at.
> 
> Splitting a namespaced devtmpfs from loopdevfs discussion might be
> sensible.  However, in defense of a namespaced devtmpfs I'd say
> that for userspace to, at every container startup, bind-mount in
> devices from the global devtmpfs into a private tmpfs (for systemd's
> sake it can't just be on the container rootfs), seems like something
> worth avoiding.

I think having to pick and choose what device nodes you want in a
container is a good thing.  Becides, you would have to do the same thing
in the kernel anyway, what's wrong with userspace making the decision
here, especially as it knows exactly what it wants to do much more so
than the kernel ever can.

> PS - Apparently both parallels and Michael independently
> project devices which are hot-plugged on the host into containers.
> That also seems like something worth talking about (best practices,
> shortcomings, use cases not met by it, any ways tha the kernel can
> help out) at ksummit/linuxcon.

I was told that containers would never want devices hotplugged into
them.  What use case has this happening / needed?

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Richard Weinberger

Am 15.05.2014 22:26, schrieb Serge E. Hallyn:
> Quoting Richard Weinberger (rich...@nod.at):
>> Am 15.05.2014 21:50, schrieb Serge Hallyn:
>>> Quoting Richard Weinberger (richard.weinber...@gmail.com):
 On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman
  wrote:
> Then don't use a container to build such a thing, or fix the build
> scripts to not do that :)

 I second this.
 To me it looks like some folks try to (ab)use Linux containers
 for purposes where KVM would much better fit in.
 Please don't put more complexity into containers. They are already
 horrible complex
 and error prone.
>>>
>>> I, naturally, disagree :)  The only use case which is inherently not
>>> valid for containers is running a kernel.  Practically speaking there
>>> are other things which likely will never be possible, but if someone
>>> offers a way to do something in containers, "you can't do that in
>>> containers" is not an apropos response.
>>>
>>> "That abstraction is wrong" is certainly valid, as when vpids were
>>> originally proposed and rejected, resulting in the development of
>>> pid namespaces.  "We have to work out (x) first" can be valid (and
>>> I can think of examples here), assuming it's not just trying to hide
>>> behind a catch-22/chicken-egg problem.
>>>
>>> Finally, saying "containers are complex and error prone" is conflating
>>> several large suites of userspace code and many kernel features which
>>> support them.  Being more precise would, if the argument is valid,
>>> lend it a lot more weight.
>>
>> We (my company) use Linux containers since 2011 in production. First LXC, 
>> now libvirt-lxc.
>> To understand the internals better I also wrote my own userspace to 
>> create/start
>> containers. There are so many things which can hurt you badly.
>> With user namespaces we expose a really big attack surface to regular users.
>> I.e. Suddenly a user is allowed to mount filesystems.
> 
> That is currently not the case.  They can mount some virtual filesystems
> and do bind mounts, but cannot mount most real filesystems.  This keeps
> us protected (for now) from potentially unsafe superblock readers in the
> kernel.

Yeah, I meant not only "real" filesystems.
I had VFS issues in mind where an attacker could do bad things
using bind mounts for example.

>> Ask Andy, he found already lots of nasty things...
> 
> Yes, of course, and there may be more to come...
> 
>> I agree that user namespaces are the way to go, all the papering with LSM
>> over security issues is much worse.
>> But we have to make sure that we don't add too much features too fast.
> 
> Agreed.  Like I said, 'we have to work (x) out first' could be valid,
> including 'we should wait (a year?) for user ns issues to fall out
> before relaxing any of the current user ns constraints." 
> 
> On the other hand, not exercising the new code may only mean that
> existing flaws stick around longer, undetected (by most).

Fair point.

>> That said, I like containers a lot because they are cheap but as they are 
>> lightweight
>> also therefore also isolation level is lightweight.
>> IMHO containers are not a cheap replacement for KVM.
> 
> The building blocks for containers can also be used for entirely
> new, simpler use cases - i.e. perhaps a new fakeroot alternative based
> on user namespace mappings.  Which is why "this is not a use case for
> containers" is not the right way to push back, whether or not the
> feature ends up being appropriate.

Agreed.

Maybe I'm too pessimistic.
We'll see. :-)

Thanks,
//richard
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Serge E. Hallyn

Quoting Richard Weinberger (rich...@nod.at):
> Am 15.05.2014 21:50, schrieb Serge Hallyn:
> > Quoting Richard Weinberger (richard.weinber...@gmail.com):
> >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman
> >>  wrote:
> >>> Then don't use a container to build such a thing, or fix the build
> >>> scripts to not do that :)
> >>
> >> I second this.
> >> To me it looks like some folks try to (ab)use Linux containers
> >> for purposes where KVM would much better fit in.
> >> Please don't put more complexity into containers. They are already
> >> horrible complex
> >> and error prone.
> > 
> > I, naturally, disagree :)  The only use case which is inherently not
> > valid for containers is running a kernel.  Practically speaking there
> > are other things which likely will never be possible, but if someone
> > offers a way to do something in containers, "you can't do that in
> > containers" is not an apropos response.
> > 
> > "That abstraction is wrong" is certainly valid, as when vpids were
> > originally proposed and rejected, resulting in the development of
> > pid namespaces.  "We have to work out (x) first" can be valid (and
> > I can think of examples here), assuming it's not just trying to hide
> > behind a catch-22/chicken-egg problem.
> > 
> > Finally, saying "containers are complex and error prone" is conflating
> > several large suites of userspace code and many kernel features which
> > support them.  Being more precise would, if the argument is valid,
> > lend it a lot more weight.
> 
> We (my company) use Linux containers since 2011 in production. First LXC, now 
> libvirt-lxc.
> To understand the internals better I also wrote my own userspace to 
> create/start
> containers. There are so many things which can hurt you badly.
> With user namespaces we expose a really big attack surface to regular users.
> I.e. Suddenly a user is allowed to mount filesystems.

That is currently not the case.  They can mount some virtual filesystems
and do bind mounts, but cannot mount most real filesystems.  This keeps
us protected (for now) from potentially unsafe superblock readers in the
kernel.

> Ask Andy, he found already lots of nasty things...

Yes, of course, and there may be more to come...

> I agree that user namespaces are the way to go, all the papering with LSM
> over security issues is much worse.
> But we have to make sure that we don't add too much features too fast.

Agreed.  Like I said, 'we have to work (x) out first' could be valid,
including 'we should wait (a year?) for user ns issues to fall out
before relaxing any of the current user ns constraints." 

On the other hand, not exercising the new code may only mean that
existing flaws stick around longer, undetected (by most).

> That said, I like containers a lot because they are cheap but as they are 
> lightweight
> also therefore also isolation level is lightweight.
> IMHO containers are not a cheap replacement for KVM.

The building blocks for containers can also be used for entirely
new, simpler use cases - i.e. perhaps a new fakeroot alternative based
on user namespace mappings.  Which is why "this is not a use case for
containers" is not the right way to push back, whether or not the
feature ends up being appropriate.

-serge
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Richard Weinberger

Am 15.05.2014 21:50, schrieb Serge Hallyn:
> Quoting Richard Weinberger (richard.weinber...@gmail.com):
>> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman
>>  wrote:
>>> Then don't use a container to build such a thing, or fix the build
>>> scripts to not do that :)
>>
>> I second this.
>> To me it looks like some folks try to (ab)use Linux containers
>> for purposes where KVM would much better fit in.
>> Please don't put more complexity into containers. They are already
>> horrible complex
>> and error prone.
> 
> I, naturally, disagree :)  The only use case which is inherently not
> valid for containers is running a kernel.  Practically speaking there
> are other things which likely will never be possible, but if someone
> offers a way to do something in containers, "you can't do that in
> containers" is not an apropos response.
> 
> "That abstraction is wrong" is certainly valid, as when vpids were
> originally proposed and rejected, resulting in the development of
> pid namespaces.  "We have to work out (x) first" can be valid (and
> I can think of examples here), assuming it's not just trying to hide
> behind a catch-22/chicken-egg problem.
> 
> Finally, saying "containers are complex and error prone" is conflating
> several large suites of userspace code and many kernel features which
> support them.  Being more precise would, if the argument is valid,
> lend it a lot more weight.

We (my company) use Linux containers since 2011 in production. First LXC, now 
libvirt-lxc.
To understand the internals better I also wrote my own userspace to create/start
containers. There are so many things which can hurt you badly.
With user namespaces we expose a really big attack surface to regular users.
I.e. Suddenly a user is allowed to mount filesystems.
Ask Andy, he found already lots of nasty things...
I agree that user namespaces are the way to go, all the papering with LSM
over security issues is much worse.
But we have to make sure that we don't add too much features too fast.

That said, I like containers a lot because they are cheap but as they are 
lightweight
also therefore also isolation level is lightweight.
IMHO containers are not a cheap replacement for KVM.

Thanks,
//richard
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Serge Hallyn

Quoting Richard Weinberger (richard.weinber...@gmail.com):
> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman
>  wrote:
> > Then don't use a container to build such a thing, or fix the build
> > scripts to not do that :)
> 
> I second this.
> To me it looks like some folks try to (ab)use Linux containers
> for purposes where KVM would much better fit in.
> Please don't put more complexity into containers. They are already
> horrible complex
> and error prone.

I, naturally, disagree :)  The only use case which is inherently not
valid for containers is running a kernel.  Practically speaking there
are other things which likely will never be possible, but if someone
offers a way to do something in containers, "you can't do that in
containers" is not an apropos response.

"That abstraction is wrong" is certainly valid, as when vpids were
originally proposed and rejected, resulting in the development of
pid namespaces.  "We have to work out (x) first" can be valid (and
I can think of examples here), assuming it's not just trying to hide
behind a catch-22/chicken-egg problem.

Finally, saying "containers are complex and error prone" is conflating
several large suites of userspace code and many kernel features which
support them.  Being more precise would, if the argument is valid,
lend it a lot more weight.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Richard Weinberger

On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman
 wrote:
> Then don't use a container to build such a thing, or fix the build
> scripts to not do that :)

I second this.
To me it looks like some folks try to (ab)use Linux containers
for purposes where KVM would much better fit in.
Please don't put more complexity into containers. They are already
horrible complex
and error prone.

-- 
Thanks,
//richard
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Seth Forshee

On Thu, May 15, 2014 at 05:42:54PM +, Serge Hallyn wrote:
> > > Serge mentioned something to me about a loopdevfs (?) thing that someone
> > > else is working on.  That would seem to be a better solution in this
> > > particular case but I don't know much about it or where it's at.
> > 
> > Ok, let's see those patches then.
> 
> I think Seth has a git tree ready, but not sure which branch he'd want
> us to look at.

I think the most recent code I've got is the devloop branch of
http://kernel.ubuntu.com/git/sforshee/ubuntu-trusty.git, which is still
a bit messy but gets the idea across. I switched from that to the
devtmpfs approach though for several reasons: the psuedo-fs approach
required some (in my opinion) undesirable collateral changes, it would
require changes to userspace tools (though likely small), and it solves
the problem only for loop devices. Plus if you don't push namespace
awareness down to at least the generic block layer you still can't do
partitions or encrypted loop, and then there are still other problems
which need to be solved to get partition blkdevs inside the mount.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Serge Hallyn

Quoting Greg Kroah-Hartman (gre...@linuxfoundation.org):
> On Thu, May 15, 2014 at 09:42:17AM -0400, Michael H. Warfield wrote:
> > On Wed, 2014-05-14 at 21:00 -0700, Greg Kroah-Hartman wrote:
> > > On Wed, May 14, 2014 at 10:15:27PM -0500, Seth Forshee wrote:
> > > > On Wed, May 14, 2014 at 10:17:31PM -0400, Michael H. Warfield wrote:
> > > > > > > Using devtmpfs is one possible
> > > > > > > solution, and it would have the added benefit of making container 
> > > > > > > setup
> > > > > > > simpler. But simply letting containers mount devtmpfs isn't 
> > > > > > > sufficient
> > > > > > > since the container may need to see a different, more limited set 
> > > > > > > of
> > > > > > > devices, and because different environments making modifications 
> > > > > > > to
> > > > > > > the filesystem could lead to conflicts.
> > > > > > > 
> > > > > > > This series solves these problems by assigning devices to user
> > > > > > > namespaces. Each device has an "owner" namespace which specifies 
> > > > > > > which
> > > > > > > devtmpfs mount the device should appear in as well allowing 
> > > > > > > priveleged
> > > > > > > operations on the device from that namespace. This defaults to
> > > > > > > init_user_ns. There's also an ns_global flag to indicate a device 
> > > > > > > should
> > > > > > > appear in all devtmpfs mounts.
> > > > > 
> > > > > > I'd strongly argue that this isn't even a "problem" at all.  And, 
> > > > > > as I
> > > > > > said at the Plumbers conference last year, adding namespaces to 
> > > > > > devices
> > > > > > isn't going to happen, sorry.  Please don't continue down this path.
> > > > > 
> > > > > I was just mentioning that to Serge just a week or so ago reminding 
> > > > > him
> > > > > of what you told all of us face to face back then.  We were having a
> > > > > discussion over loop devices into containers and this topic came up.
> > > > 
> > > > It was the loop device use case that got me started down this path in
> > > > the first place, so I don't personally have any interest in physical
> > > > devices right now (though I was sure others would).
> > 
> > > Why do you want to give access to a loop device to a container?
> > > Shouldn't you set up the loop devices before creating the container and
> > > then pass those mount points into the container?  I thought that was how
> > > things worked today, or am I missing something?
> > 
> > Ah, you keep feeding me easy ones.  I need raw access to loop devices
> > and loop-control because I'm using containers to build NST (Network
> > Security Toolkit) distribution iso images (one container is x86_64 while
> > the other is i686).  Each requires 2 loop devices.  You can't set up the
> > loop devices in advance since the containers will be creating the images
> > and building them.  NST tinkers with the base build engine
> > configuration, so I really DON'T want it running on a hard iron host. 
> > There may be other cases where I need other specialized containers for
> > building distros.  I'm also looking at custom builds of Kali (another
> > security distribution).
> 
> Then don't use a container to build such a thing, or fix the build
> scripts to not do that :)
> 
> That is not a "normal" use case for a container at all.  Containers are
> not for "everything", use a virtual machine for some tasks (like this
> one).

Hi Greg,

What exactly defines '"normal" use case for a container'?  Not too long
ago much of what we can now do with network namespaces was not a normal
container use case.  Neither "you can't do it now" nor "I don't use it
like that" should be grounds for a pre-emptive nack.  "It will horribly
break security assumptions" certainly would be.

That's not to say there might not be good reasons why this in particular
is not appropriate, but ISTM if things are going to be nacked without
consideration of the patchset itself, we ought to be having a ksummit
session to come to a consensus [ or receive a decree, presumably by you :)
but after we have a chance to make our case ] on what things are going to
be un/acceptable.

> > Serge mentioned something to me about a loopdevfs (?) thing that someone
> > else is working on.  That would seem to be a better solution in this
> > particular case but I don't know much about it or where it's at.
> 
> Ok, let's see those patches then.

I think Seth has a git tree ready, but not sure which branch he'd want
us to look at.

Splitting a namespaced devtmpfs from loopdevfs discussion might be
sensible.  However, in defense of a namespaced devtmpfs I'd say
that for userspace to, at every container startup, bind-mount in
devices from the global devtmpfs into a private tmpfs (for systemd's
sake it can't just be on the container rootfs), seems like something
worth avoiding.

-serge

PS - Apparently both parallels and Michael independently
project devices which are hot-plugged on the host into containers.
That also seems like something worth talking about (best practices,
shortcomings,

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Greg Kroah-Hartman

On Thu, May 15, 2014 at 09:42:17AM -0400, Michael H. Warfield wrote:
> On Wed, 2014-05-14 at 21:00 -0700, Greg Kroah-Hartman wrote:
> > On Wed, May 14, 2014 at 10:15:27PM -0500, Seth Forshee wrote:
> > > On Wed, May 14, 2014 at 10:17:31PM -0400, Michael H. Warfield wrote:
> > > > > > Using devtmpfs is one possible
> > > > > > solution, and it would have the added benefit of making container 
> > > > > > setup
> > > > > > simpler. But simply letting containers mount devtmpfs isn't 
> > > > > > sufficient
> > > > > > since the container may need to see a different, more limited set of
> > > > > > devices, and because different environments making modifications to
> > > > > > the filesystem could lead to conflicts.
> > > > > > 
> > > > > > This series solves these problems by assigning devices to user
> > > > > > namespaces. Each device has an "owner" namespace which specifies 
> > > > > > which
> > > > > > devtmpfs mount the device should appear in as well allowing 
> > > > > > priveleged
> > > > > > operations on the device from that namespace. This defaults to
> > > > > > init_user_ns. There's also an ns_global flag to indicate a device 
> > > > > > should
> > > > > > appear in all devtmpfs mounts.
> > > > 
> > > > > I'd strongly argue that this isn't even a "problem" at all.  And, as I
> > > > > said at the Plumbers conference last year, adding namespaces to 
> > > > > devices
> > > > > isn't going to happen, sorry.  Please don't continue down this path.
> > > > 
> > > > I was just mentioning that to Serge just a week or so ago reminding him
> > > > of what you told all of us face to face back then.  We were having a
> > > > discussion over loop devices into containers and this topic came up.
> > > 
> > > It was the loop device use case that got me started down this path in
> > > the first place, so I don't personally have any interest in physical
> > > devices right now (though I was sure others would).
> 
> > Why do you want to give access to a loop device to a container?
> > Shouldn't you set up the loop devices before creating the container and
> > then pass those mount points into the container?  I thought that was how
> > things worked today, or am I missing something?
> 
> Ah, you keep feeding me easy ones.  I need raw access to loop devices
> and loop-control because I'm using containers to build NST (Network
> Security Toolkit) distribution iso images (one container is x86_64 while
> the other is i686).  Each requires 2 loop devices.  You can't set up the
> loop devices in advance since the containers will be creating the images
> and building them.  NST tinkers with the base build engine
> configuration, so I really DON'T want it running on a hard iron host. 
> There may be other cases where I need other specialized containers for
> building distros.  I'm also looking at custom builds of Kali (another
> security distribution).

Then don't use a container to build such a thing, or fix the build
scripts to not do that :)

That is not a "normal" use case for a container at all.  Containers are
not for "everything", use a virtual machine for some tasks (like this
one).

> Serge mentioned something to me about a loopdevfs (?) thing that someone
> else is working on.  That would seem to be a better solution in this
> particular case but I don't know much about it or where it's at.

Ok, let's see those patches then.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Michael H. Warfield

On Wed, 2014-05-14 at 21:00 -0700, Greg Kroah-Hartman wrote:
> On Wed, May 14, 2014 at 10:15:27PM -0500, Seth Forshee wrote:
> > On Wed, May 14, 2014 at 10:17:31PM -0400, Michael H. Warfield wrote:
> > > > > Using devtmpfs is one possible
> > > > > solution, and it would have the added benefit of making container 
> > > > > setup
> > > > > simpler. But simply letting containers mount devtmpfs isn't sufficient
> > > > > since the container may need to see a different, more limited set of
> > > > > devices, and because different environments making modifications to
> > > > > the filesystem could lead to conflicts.
> > > > > 
> > > > > This series solves these problems by assigning devices to user
> > > > > namespaces. Each device has an "owner" namespace which specifies which
> > > > > devtmpfs mount the device should appear in as well allowing priveleged
> > > > > operations on the device from that namespace. This defaults to
> > > > > init_user_ns. There's also an ns_global flag to indicate a device 
> > > > > should
> > > > > appear in all devtmpfs mounts.
> > > 
> > > > I'd strongly argue that this isn't even a "problem" at all.  And, as I
> > > > said at the Plumbers conference last year, adding namespaces to devices
> > > > isn't going to happen, sorry.  Please don't continue down this path.
> > > 
> > > I was just mentioning that to Serge just a week or so ago reminding him
> > > of what you told all of us face to face back then.  We were having a
> > > discussion over loop devices into containers and this topic came up.
> > 
> > It was the loop device use case that got me started down this path in
> > the first place, so I don't personally have any interest in physical
> > devices right now (though I was sure others would).

> Why do you want to give access to a loop device to a container?
> Shouldn't you set up the loop devices before creating the container and
> then pass those mount points into the container?  I thought that was how
> things worked today, or am I missing something?

Ah, you keep feeding me easy ones.  I need raw access to loop devices
and loop-control because I'm using containers to build NST (Network
Security Toolkit) distribution iso images (one container is x86_64 while
the other is i686).  Each requires 2 loop devices.  You can't set up the
loop devices in advance since the containers will be creating the images
and building them.  NST tinkers with the base build engine
configuration, so I really DON'T want it running on a hard iron host. 
There may be other cases where I need other specialized containers for
building distros.  I'm also looking at custom builds of Kali (another
security distribution).

> Giving the ability for a container to create a loop device at all is a
> horrid idea, as you have pointed out, lots of information leakage could
> easily happen.

It does but only slightly.  I noticed that losetup will list all the
devices regardless of container where run or the container where set up.
But that seems to be largely cosmetic.  You can't do anything with the
loop device in the other container.  You can't disconnected it, read it,
or mount it (I've tested it).  In the former case, losetup returns with
no error but does nothing.  In the later case, you get a busy error.
Not clean, not pretty, but no damage.  Since loop-control is working on
the global pool of loop devices, it's impossible to know what device to
move to what container when the container runs losetup.

For me, this isn't a serious problem, since it only involves 2
specialized containers out of over 4 dozen containers I have running
across 3 sites.  And those two containers are under my explicit and
exclusive control.  None of the others need it.  I can get away with
adding extra loop devices and adding them to the containers and let
losetup deal with allocation and contention.

Serge mentioned something to me about a loopdevfs (?) thing that someone
else is working on.  That would seem to be a better solution in this
particular case but I don't know much about it or where it's at.

Mind you, I heard your arguments at LinuxPlumbers regarding pushing user
space policies into the kernel and all and basically I agree with you,
this should be handled in host system user space and it seems
reasonable.  I'm just pointing out real world cases I have in operation
right now and pointing out that I have solutions for them in host user
space, even if some of them may not be estheticly pretty.

> greg k-h

Regards,
Mike
-- 
Michael H. Warfield (AI4NB) | (770) 978-7061 |  m...@wittsend.com
   /\/\|=mhw=|\/\/  | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9  | An optimist believes we live in the best of all
 PGP Key: 0x674627FF| possible worlds.  A pessimist is sure of it!

signature.asc
Description: This is a digitally signed message part

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Michael H. Warfield

On Wed, 2014-05-14 at 21:00 -0700, Greg Kroah-Hartman wrote:
 On Wed, May 14, 2014 at 10:15:27PM -0500, Seth Forshee wrote:
  On Wed, May 14, 2014 at 10:17:31PM -0400, Michael H. Warfield wrote:
 Using devtmpfs is one possible
 solution, and it would have the added benefit of making container 
 setup
 simpler. But simply letting containers mount devtmpfs isn't sufficient
 since the container may need to see a different, more limited set of
 devices, and because different environments making modifications to
 the filesystem could lead to conflicts.
 
 This series solves these problems by assigning devices to user
 namespaces. Each device has an owner namespace which specifies which
 devtmpfs mount the device should appear in as well allowing priveleged
 operations on the device from that namespace. This defaults to
 init_user_ns. There's also an ns_global flag to indicate a device 
 should
 appear in all devtmpfs mounts.
   
I'd strongly argue that this isn't even a problem at all.  And, as I
said at the Plumbers conference last year, adding namespaces to devices
isn't going to happen, sorry.  Please don't continue down this path.
   
   I was just mentioning that to Serge just a week or so ago reminding him
   of what you told all of us face to face back then.  We were having a
   discussion over loop devices into containers and this topic came up.
  
  It was the loop device use case that got me started down this path in
  the first place, so I don't personally have any interest in physical
  devices right now (though I was sure others would).

 Why do you want to give access to a loop device to a container?
 Shouldn't you set up the loop devices before creating the container and
 then pass those mount points into the container?  I thought that was how
 things worked today, or am I missing something?

Ah, you keep feeding me easy ones.  I need raw access to loop devices
and loop-control because I'm using containers to build NST (Network
Security Toolkit) distribution iso images (one container is x86_64 while
the other is i686).  Each requires 2 loop devices.  You can't set up the
loop devices in advance since the containers will be creating the images
and building them.  NST tinkers with the base build engine
configuration, so I really DON'T want it running on a hard iron host. 
There may be other cases where I need other specialized containers for
building distros.  I'm also looking at custom builds of Kali (another
security distribution).

 Giving the ability for a container to create a loop device at all is a
 horrid idea, as you have pointed out, lots of information leakage could
 easily happen.

It does but only slightly.  I noticed that losetup will list all the
devices regardless of container where run or the container where set up.
But that seems to be largely cosmetic.  You can't do anything with the
loop device in the other container.  You can't disconnected it, read it,
or mount it (I've tested it).  In the former case, losetup returns with
no error but does nothing.  In the later case, you get a busy error.
Not clean, not pretty, but no damage.  Since loop-control is working on
the global pool of loop devices, it's impossible to know what device to
move to what container when the container runs losetup.

For me, this isn't a serious problem, since it only involves 2
specialized containers out of over 4 dozen containers I have running
across 3 sites.  And those two containers are under my explicit and
exclusive control.  None of the others need it.  I can get away with
adding extra loop devices and adding them to the containers and let
losetup deal with allocation and contention.

Serge mentioned something to me about a loopdevfs (?) thing that someone
else is working on.  That would seem to be a better solution in this
particular case but I don't know much about it or where it's at.

Mind you, I heard your arguments at LinuxPlumbers regarding pushing user
space policies into the kernel and all and basically I agree with you,
this should be handled in host system user space and it seems
reasonable.  I'm just pointing out real world cases I have in operation
right now and pointing out that I have solutions for them in host user
space, even if some of them may not be estheticly pretty.

 greg k-h

Regards,
Mike
-- 
Michael H. Warfield (AI4NB) | (770) 978-7061 |  m...@wittsend.com
   /\/\|=mhw=|\/\/  | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9  | An optimist believes we live in the best of all
 PGP Key: 0x674627FF| possible worlds.  A pessimist is sure of it!



signature.asc
Description: This is a digitally signed message part

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Greg Kroah-Hartman

On Thu, May 15, 2014 at 09:42:17AM -0400, Michael H. Warfield wrote:
 On Wed, 2014-05-14 at 21:00 -0700, Greg Kroah-Hartman wrote:
  On Wed, May 14, 2014 at 10:15:27PM -0500, Seth Forshee wrote:
   On Wed, May 14, 2014 at 10:17:31PM -0400, Michael H. Warfield wrote:
  Using devtmpfs is one possible
  solution, and it would have the added benefit of making container 
  setup
  simpler. But simply letting containers mount devtmpfs isn't 
  sufficient
  since the container may need to see a different, more limited set of
  devices, and because different environments making modifications to
  the filesystem could lead to conflicts.
  
  This series solves these problems by assigning devices to user
  namespaces. Each device has an owner namespace which specifies 
  which
  devtmpfs mount the device should appear in as well allowing 
  priveleged
  operations on the device from that namespace. This defaults to
  init_user_ns. There's also an ns_global flag to indicate a device 
  should
  appear in all devtmpfs mounts.

 I'd strongly argue that this isn't even a problem at all.  And, as I
 said at the Plumbers conference last year, adding namespaces to 
 devices
 isn't going to happen, sorry.  Please don't continue down this path.

I was just mentioning that to Serge just a week or so ago reminding him
of what you told all of us face to face back then.  We were having a
discussion over loop devices into containers and this topic came up.
   
   It was the loop device use case that got me started down this path in
   the first place, so I don't personally have any interest in physical
   devices right now (though I was sure others would).
 
  Why do you want to give access to a loop device to a container?
  Shouldn't you set up the loop devices before creating the container and
  then pass those mount points into the container?  I thought that was how
  things worked today, or am I missing something?
 
 Ah, you keep feeding me easy ones.  I need raw access to loop devices
 and loop-control because I'm using containers to build NST (Network
 Security Toolkit) distribution iso images (one container is x86_64 while
 the other is i686).  Each requires 2 loop devices.  You can't set up the
 loop devices in advance since the containers will be creating the images
 and building them.  NST tinkers with the base build engine
 configuration, so I really DON'T want it running on a hard iron host. 
 There may be other cases where I need other specialized containers for
 building distros.  I'm also looking at custom builds of Kali (another
 security distribution).

Then don't use a container to build such a thing, or fix the build
scripts to not do that :)

That is not a normal use case for a container at all.  Containers are
not for everything, use a virtual machine for some tasks (like this
one).

 Serge mentioned something to me about a loopdevfs (?) thing that someone
 else is working on.  That would seem to be a better solution in this
 particular case but I don't know much about it or where it's at.

Ok, let's see those patches then.

thanks,

greg k-h
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Serge Hallyn

Quoting Greg Kroah-Hartman (gre...@linuxfoundation.org):
 On Thu, May 15, 2014 at 09:42:17AM -0400, Michael H. Warfield wrote:
  On Wed, 2014-05-14 at 21:00 -0700, Greg Kroah-Hartman wrote:
   On Wed, May 14, 2014 at 10:15:27PM -0500, Seth Forshee wrote:
On Wed, May 14, 2014 at 10:17:31PM -0400, Michael H. Warfield wrote:
   Using devtmpfs is one possible
   solution, and it would have the added benefit of making container 
   setup
   simpler. But simply letting containers mount devtmpfs isn't 
   sufficient
   since the container may need to see a different, more limited set 
   of
   devices, and because different environments making modifications 
   to
   the filesystem could lead to conflicts.
   
   This series solves these problems by assigning devices to user
   namespaces. Each device has an owner namespace which specifies 
   which
   devtmpfs mount the device should appear in as well allowing 
   priveleged
   operations on the device from that namespace. This defaults to
   init_user_ns. There's also an ns_global flag to indicate a device 
   should
   appear in all devtmpfs mounts.
 
  I'd strongly argue that this isn't even a problem at all.  And, 
  as I
  said at the Plumbers conference last year, adding namespaces to 
  devices
  isn't going to happen, sorry.  Please don't continue down this path.
 
 I was just mentioning that to Serge just a week or so ago reminding 
 him
 of what you told all of us face to face back then.  We were having a
 discussion over loop devices into containers and this topic came up.

It was the loop device use case that got me started down this path in
the first place, so I don't personally have any interest in physical
devices right now (though I was sure others would).
  
   Why do you want to give access to a loop device to a container?
   Shouldn't you set up the loop devices before creating the container and
   then pass those mount points into the container?  I thought that was how
   things worked today, or am I missing something?
  
  Ah, you keep feeding me easy ones.  I need raw access to loop devices
  and loop-control because I'm using containers to build NST (Network
  Security Toolkit) distribution iso images (one container is x86_64 while
  the other is i686).  Each requires 2 loop devices.  You can't set up the
  loop devices in advance since the containers will be creating the images
  and building them.  NST tinkers with the base build engine
  configuration, so I really DON'T want it running on a hard iron host. 
  There may be other cases where I need other specialized containers for
  building distros.  I'm also looking at custom builds of Kali (another
  security distribution).
 
 Then don't use a container to build such a thing, or fix the build
 scripts to not do that :)
 
 That is not a normal use case for a container at all.  Containers are
 not for everything, use a virtual machine for some tasks (like this
 one).

Hi Greg,

What exactly defines 'normal use case for a container'?  Not too long
ago much of what we can now do with network namespaces was not a normal
container use case.  Neither you can't do it now nor I don't use it
like that should be grounds for a pre-emptive nack.  It will horribly
break security assumptions certainly would be.

That's not to say there might not be good reasons why this in particular
is not appropriate, but ISTM if things are going to be nacked without
consideration of the patchset itself, we ought to be having a ksummit
session to come to a consensus [ or receive a decree, presumably by you :)
but after we have a chance to make our case ] on what things are going to
be un/acceptable.

  Serge mentioned something to me about a loopdevfs (?) thing that someone
  else is working on.  That would seem to be a better solution in this
  particular case but I don't know much about it or where it's at.
 
 Ok, let's see those patches then.

I think Seth has a git tree ready, but not sure which branch he'd want
us to look at.

Splitting a namespaced devtmpfs from loopdevfs discussion might be
sensible.  However, in defense of a namespaced devtmpfs I'd say
that for userspace to, at every container startup, bind-mount in
devices from the global devtmpfs into a private tmpfs (for systemd's
sake it can't just be on the container rootfs), seems like something
worth avoiding.

-serge

PS - Apparently both parallels and Michael independently
project devices which are hot-plugged on the host into containers.
That also seems like something worth talking about (best practices,
shortcomings, use cases not met by it, any ways tha the kernel can
help out) at ksummit/linuxcon.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Seth Forshee

On Thu, May 15, 2014 at 05:42:54PM +, Serge Hallyn wrote:
   Serge mentioned something to me about a loopdevfs (?) thing that someone
   else is working on.  That would seem to be a better solution in this
   particular case but I don't know much about it or where it's at.
  
  Ok, let's see those patches then.
 
 I think Seth has a git tree ready, but not sure which branch he'd want
 us to look at.

I think the most recent code I've got is the devloop branch of
http://kernel.ubuntu.com/git/sforshee/ubuntu-trusty.git, which is still
a bit messy but gets the idea across. I switched from that to the
devtmpfs approach though for several reasons: the psuedo-fs approach
required some (in my opinion) undesirable collateral changes, it would
require changes to userspace tools (though likely small), and it solves
the problem only for loop devices. Plus if you don't push namespace
awareness down to at least the generic block layer you still can't do
partitions or encrypted loop, and then there are still other problems
which need to be solved to get partition blkdevs inside the mount.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Richard Weinberger

On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman
gre...@linuxfoundation.org wrote:
 Then don't use a container to build such a thing, or fix the build
 scripts to not do that :)

I second this.
To me it looks like some folks try to (ab)use Linux containers
for purposes where KVM would much better fit in.
Please don't put more complexity into containers. They are already
horrible complex
and error prone.

-- 
Thanks,
//richard
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Serge Hallyn

Quoting Richard Weinberger (richard.weinber...@gmail.com):
 On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman
 gre...@linuxfoundation.org wrote:
  Then don't use a container to build such a thing, or fix the build
  scripts to not do that :)
 
 I second this.
 To me it looks like some folks try to (ab)use Linux containers
 for purposes where KVM would much better fit in.
 Please don't put more complexity into containers. They are already
 horrible complex
 and error prone.

I, naturally, disagree :)  The only use case which is inherently not
valid for containers is running a kernel.  Practically speaking there
are other things which likely will never be possible, but if someone
offers a way to do something in containers, you can't do that in
containers is not an apropos response.

That abstraction is wrong is certainly valid, as when vpids were
originally proposed and rejected, resulting in the development of
pid namespaces.  We have to work out (x) first can be valid (and
I can think of examples here), assuming it's not just trying to hide
behind a catch-22/chicken-egg problem.

Finally, saying containers are complex and error prone is conflating
several large suites of userspace code and many kernel features which
support them.  Being more precise would, if the argument is valid,
lend it a lot more weight.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Serge E. Hallyn

Quoting Richard Weinberger (rich...@nod.at):
 Am 15.05.2014 21:50, schrieb Serge Hallyn:
  Quoting Richard Weinberger (richard.weinber...@gmail.com):
  On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman
  gre...@linuxfoundation.org wrote:
  Then don't use a container to build such a thing, or fix the build
  scripts to not do that :)
 
  I second this.
  To me it looks like some folks try to (ab)use Linux containers
  for purposes where KVM would much better fit in.
  Please don't put more complexity into containers. They are already
  horrible complex
  and error prone.
  
  I, naturally, disagree :)  The only use case which is inherently not
  valid for containers is running a kernel.  Practically speaking there
  are other things which likely will never be possible, but if someone
  offers a way to do something in containers, you can't do that in
  containers is not an apropos response.
  
  That abstraction is wrong is certainly valid, as when vpids were
  originally proposed and rejected, resulting in the development of
  pid namespaces.  We have to work out (x) first can be valid (and
  I can think of examples here), assuming it's not just trying to hide
  behind a catch-22/chicken-egg problem.
  
  Finally, saying containers are complex and error prone is conflating
  several large suites of userspace code and many kernel features which
  support them.  Being more precise would, if the argument is valid,
  lend it a lot more weight.
 
 We (my company) use Linux containers since 2011 in production. First LXC, now 
 libvirt-lxc.
 To understand the internals better I also wrote my own userspace to 
 create/start
 containers. There are so many things which can hurt you badly.
 With user namespaces we expose a really big attack surface to regular users.
 I.e. Suddenly a user is allowed to mount filesystems.

That is currently not the case.  They can mount some virtual filesystems
and do bind mounts, but cannot mount most real filesystems.  This keeps
us protected (for now) from potentially unsafe superblock readers in the
kernel.

 Ask Andy, he found already lots of nasty things...

Yes, of course, and there may be more to come...

 I agree that user namespaces are the way to go, all the papering with LSM
 over security issues is much worse.
 But we have to make sure that we don't add too much features too fast.

Agreed.  Like I said, 'we have to work (x) out first' could be valid,
including 'we should wait (a year?) for user ns issues to fall out
before relaxing any of the current user ns constraints. 

On the other hand, not exercising the new code may only mean that
existing flaws stick around longer, undetected (by most).

 That said, I like containers a lot because they are cheap but as they are 
 lightweight
 also therefore also isolation level is lightweight.
 IMHO containers are not a cheap replacement for KVM.

The building blocks for containers can also be used for entirely
new, simpler use cases - i.e. perhaps a new fakeroot alternative based
on user namespace mappings.  Which is why this is not a use case for
containers is not the right way to push back, whether or not the
feature ends up being appropriate.

-serge
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Richard Weinberger

Am 15.05.2014 22:26, schrieb Serge E. Hallyn:
 Quoting Richard Weinberger (rich...@nod.at):
 Am 15.05.2014 21:50, schrieb Serge Hallyn:
 Quoting Richard Weinberger (richard.weinber...@gmail.com):
 On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman
 gre...@linuxfoundation.org wrote:
 Then don't use a container to build such a thing, or fix the build
 scripts to not do that :)

 I second this.
 To me it looks like some folks try to (ab)use Linux containers
 for purposes where KVM would much better fit in.
 Please don't put more complexity into containers. They are already
 horrible complex
 and error prone.

 I, naturally, disagree :)  The only use case which is inherently not
 valid for containers is running a kernel.  Practically speaking there
 are other things which likely will never be possible, but if someone
 offers a way to do something in containers, you can't do that in
 containers is not an apropos response.

 That abstraction is wrong is certainly valid, as when vpids were
 originally proposed and rejected, resulting in the development of
 pid namespaces.  We have to work out (x) first can be valid (and
 I can think of examples here), assuming it's not just trying to hide
 behind a catch-22/chicken-egg problem.

 Finally, saying containers are complex and error prone is conflating
 several large suites of userspace code and many kernel features which
 support them.  Being more precise would, if the argument is valid,
 lend it a lot more weight.

 We (my company) use Linux containers since 2011 in production. First LXC, 
 now libvirt-lxc.
 To understand the internals better I also wrote my own userspace to 
 create/start
 containers. There are so many things which can hurt you badly.
 With user namespaces we expose a really big attack surface to regular users.
 I.e. Suddenly a user is allowed to mount filesystems.
 
 That is currently not the case.  They can mount some virtual filesystems
 and do bind mounts, but cannot mount most real filesystems.  This keeps
 us protected (for now) from potentially unsafe superblock readers in the
 kernel.

Yeah, I meant not only real filesystems.
I had VFS issues in mind where an attacker could do bad things
using bind mounts for example.

 Ask Andy, he found already lots of nasty things...
 
 Yes, of course, and there may be more to come...
 
 I agree that user namespaces are the way to go, all the papering with LSM
 over security issues is much worse.
 But we have to make sure that we don't add too much features too fast.
 
 Agreed.  Like I said, 'we have to work (x) out first' could be valid,
 including 'we should wait (a year?) for user ns issues to fall out
 before relaxing any of the current user ns constraints. 
 
 On the other hand, not exercising the new code may only mean that
 existing flaws stick around longer, undetected (by most).

Fair point.

 That said, I like containers a lot because they are cheap but as they are 
 lightweight
 also therefore also isolation level is lightweight.
 IMHO containers are not a cheap replacement for KVM.
 
 The building blocks for containers can also be used for entirely
 new, simpler use cases - i.e. perhaps a new fakeroot alternative based
 on user namespace mappings.  Which is why this is not a use case for
 containers is not the right way to push back, whether or not the
 feature ends up being appropriate.

Agreed.

Maybe I'm too pessimistic.
We'll see. :-)

Thanks,
//richard
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Greg Kroah-Hartman

On Thu, May 15, 2014 at 05:42:54PM +, Serge Hallyn wrote:
 What exactly defines 'normal use case for a container'?

Well, I'd say acting like a virtual machine is a good start :)

 Not too long ago much of what we can now do with network namespaces
 was not a normal container use case.  Neither you can't do it now
 nor I don't use it like that should be grounds for a pre-emptive
 nack.  It will horribly break security assumptions certainly would
 be.

I agree, and maybe we will get there over time, but this patch is nto
the way to do that.

 That's not to say there might not be good reasons why this in particular
 is not appropriate, but ISTM if things are going to be nacked without
 consideration of the patchset itself, we ought to be having a ksummit
 session to come to a consensus [ or receive a decree, presumably by you :)
 but after we have a chance to make our case ] on what things are going to
 be un/acceptable.

I already stood up and publically said this last year at Plumbers, why
is anything now different?

And this patchset is proof of why it's not a good idea.  You really
didn't do anything with all of the namespace stuff, except change loop.
That's the only thing that cares, so, just do it there, like I said to
do so, last August.

And you are ignoring the notifications to userspace and how namespaces
here would deal with that.

   Serge mentioned something to me about a loopdevfs (?) thing that someone
   else is working on.  That would seem to be a better solution in this
   particular case but I don't know much about it or where it's at.
  
  Ok, let's see those patches then.
 
 I think Seth has a git tree ready, but not sure which branch he'd want
 us to look at.
 
 Splitting a namespaced devtmpfs from loopdevfs discussion might be
 sensible.  However, in defense of a namespaced devtmpfs I'd say
 that for userspace to, at every container startup, bind-mount in
 devices from the global devtmpfs into a private tmpfs (for systemd's
 sake it can't just be on the container rootfs), seems like something
 worth avoiding.

I think having to pick and choose what device nodes you want in a
container is a good thing.  Becides, you would have to do the same thing
in the kernel anyway, what's wrong with userspace making the decision
here, especially as it knows exactly what it wants to do much more so
than the kernel ever can.

 PS - Apparently both parallels and Michael independently
 project devices which are hot-plugged on the host into containers.
 That also seems like something worth talking about (best practices,
 shortcomings, use cases not met by it, any ways tha the kernel can
 help out) at ksummit/linuxcon.

I was told that containers would never want devices hotplugged into
them.  What use case has this happening / needed?

thanks,

greg k-h
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Michael H. Warfield

On Thu, 2014-05-15 at 15:15 -0700, Greg Kroah-Hartman wrote:
 On Thu, May 15, 2014 at 05:42:54PM +, Serge Hallyn wrote:
  What exactly defines 'normal use case for a container'?

 Well, I'd say acting like a virtual machine is a good start :)

Ok...  And virtual machines (VirtualBox, VMware, etc, etc) have hot plug
USB devices.  I use the USB hotplug with VirtualBox.  I plug a
configured USB device in and the VirtualBox VM grabs it.  Virtual
machines have loopback devices.  I've used them and using them in
containers is significantly more efficient.  VirtualBox has remote audio
and a host of other device features.

Now we have some agreement.  Normal is acting like a virtual machine.
That's a goal I can agree with.  I want to work toward that goal of
containers acting like a virtual machine just running on a common
kernel with the host.  It's a challenge.  We're getting there.

  Not too long ago much of what we can now do with network namespaces
  was not a normal container use case.  Neither you can't do it now
  nor I don't use it like that should be grounds for a pre-emptive
  nack.  It will horribly break security assumptions certainly would
  be.

 I agree, and maybe we will get there over time, but this patch is nto
 the way to do that.

Ok...  We have a goal.  Now we can haggle over the details (to
paraphrase a joke that's as old as I am).

  That's not to say there might not be good reasons why this in particular
  is not appropriate, but ISTM if things are going to be nacked without
  consideration of the patchset itself, we ought to be having a ksummit
  session to come to a consensus [ or receive a decree, presumably by you :)
  but after we have a chance to make our case ] on what things are going to
  be un/acceptable.

 I already stood up and publically said this last year at Plumbers, why
 is anything now different?

Not much really.  The reality is that more and more people are trying to
use hotplug devices, network interfaces, and loopback devices in
containers just like they would in full para or hw virt machines.  We're
trying to make them work, without it looking like a kludge.  I
personally agree with you that much of this can be done in host user
space and, coming out of LinuxPlumbers last year, I've implemented some
ideas that did not require kernel patches that achieve some of my goals.

 And this patchset is proof of why it's not a good idea.  You really
 didn't do anything with all of the namespace stuff, except change loop.
 That's the only thing that cares, so, just do it there, like I said to
 do so, last August.

 And you are ignoring the notifications to userspace and how namespaces
 here would deal with that.

That's a problem to deal with.  I don't thing anyone is ignoring them.

Serge mentioned something to me about a loopdevfs (?) thing that someone
else is working on.  That would seem to be a better solution in this
particular case but I don't know much about it or where it's at.
   
   Ok, let's see those patches then.
  
  I think Seth has a git tree ready, but not sure which branch he'd want
  us to look at.
  
  Splitting a namespaced devtmpfs from loopdevfs discussion might be
  sensible.  However, in defense of a namespaced devtmpfs I'd say
  that for userspace to, at every container startup, bind-mount in
  devices from the global devtmpfs into a private tmpfs (for systemd's
  sake it can't just be on the container rootfs), seems like something
  worth avoiding.

 I think having to pick and choose what device nodes you want in a
 container is a good thing.

Both static and dynamic devices.  It's got to support hotplug.  We have
(I have) use cases.  That's what I'm trying to do with host udev rules
and some custom configurations.  I can play games with udev rules.
Maybe we can keep the user spaces policies in user space and not burden
the kernel.

 Becides, you would have to do the same thing
 in the kernel anyway, what's wrong with userspace making the decision
 here, especially as it knows exactly what it wants to do much more so
 than the kernel ever can.

IMHO, there's nothing wrong with that as long as we agree on how it's to
be done.  I'm not convinced that it can all be done in user space and
I'm not convinced that name spaced devtmpfs is the magic pill to make it
all go away either.  Making the user space make the decisions and having
the kernel enforce them is a principle worth considering.

  PS - Apparently both parallels and Michael independently
  project devices which are hot-plugged on the host into containers.
  That also seems like something worth talking about (best practices,
  shortcomings, use cases not met by it, any ways tha the kernel can
  help out) at ksummit/linuxcon.

 I was told that containers would never want devices hotplugged into
 them.

Interesting.  You were told they (who they?) would never want them?  Who
said that?  I would have never thought that given that other
implementations can provide that.  I would

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Serge Hallyn

Quoting Greg Kroah-Hartman (gre...@linuxfoundation.org):
 On Thu, May 15, 2014 at 05:42:54PM +, Serge Hallyn wrote:
  What exactly defines 'normal use case for a container'?
 
 Well, I'd say acting like a virtual machine is a good start :)
 
  Not too long ago much of what we can now do with network namespaces
  was not a normal container use case.  Neither you can't do it now
  nor I don't use it like that should be grounds for a pre-emptive
  nack.  It will horribly break security assumptions certainly would
  be.
 
 I agree, and maybe we will get there over time, but this patch is nto
 the way to do that.

Ok.  [ I/we may be asking for more details later, but think there is enough
below :), particularly the point about event forwarding ]  Thanks.

  That's not to say there might not be good reasons why this in particular
  is not appropriate, but ISTM if things are going to be nacked without
  consideration of the patchset itself, we ought to be having a ksummit
  session to come to a consensus [ or receive a decree, presumably by you :)
  but after we have a chance to make our case ] on what things are going to
  be un/acceptable.
 
 I already stood up and publically said this last year at Plumbers, why
 is anything now different?

Well I've simply never had a chance to talk to you since then to find out
exactly what it is that is unacceptable, and why.  And, of course, code
makes it easier to discuss these things.

 And this patchset is proof of why it's not a good idea.  You really
 didn't do anything with all of the namespace stuff, except change loop.
 That's the only thing that cares, so, just do it there, like I said to
 do so, last August.

Sorry, just do it where?

 And you are ignoring the notifications to userspace and how namespaces
 here would deal with that.

Good point.  Addressing that is at the same time necessary, interesting,
and complicated.

Serge mentioned something to me about a loopdevfs (?) thing that someone
else is working on.  That would seem to be a better solution in this
particular case but I don't know much about it or where it's at.
   
   Ok, let's see those patches then.
  
  I think Seth has a git tree ready, but not sure which branch he'd want
  us to look at.
  
  Splitting a namespaced devtmpfs from loopdevfs discussion might be
  sensible.  However, in defense of a namespaced devtmpfs I'd say
  that for userspace to, at every container startup, bind-mount in
  devices from the global devtmpfs into a private tmpfs (for systemd's
  sake it can't just be on the container rootfs), seems like something
  worth avoiding.
 
 I think having to pick and choose what device nodes you want in a
 container is a good thing.  Becides, you would have to do the same thing
 in the kernel anyway, what's wrong with userspace making the decision
 here, especially as it knows exactly what it wants to do much more so
 than the kernel ever can.

For 'real' devices that sounds sensible.  The thing about loop devices
is that we simply want to allow a container to say give me a loop
device to use and have it receive a unique loop device (or 3), without
having to pre-assign them.  I think that would be cleaner to do using
a pseudofs and loop-control device, rather than having to have a
daemon in userspace on the host farming those out in response to
some, I don't know, dbus request?

  PS - Apparently both parallels and Michael independently
  project devices which are hot-plugged on the host into containers.
  That also seems like something worth talking about (best practices,
  shortcomings, use cases not met by it, any ways tha the kernel can
  help out) at ksummit/linuxcon.
 
 I was told that containers would never want devices hotplugged into
 them.  What use case has this happening / needed?

I'm pretty sure I didn't say that looks around nervously.  But I guess
we are combining two topics here, the loop psuedofs and the namespaced
devtmpfs.

The use case of loop-control device and loop pseudofs is to have
multiple chrooted/namespaced programs be able to grab a loop device
on demand which they can use for the obvious things (building a livecd,
extracting file contents, etc) without stepping on each other's toes.  The
namespaced devtmpfs is not required for this.

One advantage of a namespaced devtmpfs would be sane-looking devices
in unprivileged containers.  Currently we have to bind-mount the host's
/dev/{full,zero,etc} which, due to uid and guid mappings, then shows up
as:

crw-rw-rw- 1 nobody nogroup   1, 7 May 12 13:35 full

Also you mentioned uevent forwarding above.  Michael has talked several
times about having userspace on the host 'pass' devices into the
container.  One thing which I believe he and Eric have discussed
before was how to have userspace in the container be notified when
a device is passed in.  It seems to me that at least this is something
that would be simpler done from devtmpfs.  I could be wrong on this -
Michael do you have any updates or

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-15 Thread Greg Kroah-Hartman

On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote:
  I think having to pick and choose what device nodes you want in a
  container is a good thing.  Becides, you would have to do the same thing
  in the kernel anyway, what's wrong with userspace making the decision
  here, especially as it knows exactly what it wants to do much more so
  than the kernel ever can.
 
 For 'real' devices that sounds sensible.  The thing about loop devices
 is that we simply want to allow a container to say give me a loop
 device to use and have it receive a unique loop device (or 3), without
 having to pre-assign them.  I think that would be cleaner to do using
 a pseudofs and loop-control device, rather than having to have a
 daemon in userspace on the host farming those out in response to
 some, I don't know, dbus request?

I agree that loop devices would be nice to have in a container, and that
the existing loop interface doesn't really lend itself to that.  So
create a new type of thing that acts like a loop device in a container.
But don't try to mess with the whole driver core just for a single type
of device.

greg k-h
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-14 Thread Greg Kroah-Hartman

On Wed, May 14, 2014 at 10:15:27PM -0500, Seth Forshee wrote:
> On Wed, May 14, 2014 at 10:17:31PM -0400, Michael H. Warfield wrote:
> > > > Using devtmpfs is one possible
> > > > solution, and it would have the added benefit of making container setup
> > > > simpler. But simply letting containers mount devtmpfs isn't sufficient
> > > > since the container may need to see a different, more limited set of
> > > > devices, and because different environments making modifications to
> > > > the filesystem could lead to conflicts.
> > > > 
> > > > This series solves these problems by assigning devices to user
> > > > namespaces. Each device has an "owner" namespace which specifies which
> > > > devtmpfs mount the device should appear in as well allowing priveleged
> > > > operations on the device from that namespace. This defaults to
> > > > init_user_ns. There's also an ns_global flag to indicate a device should
> > > > appear in all devtmpfs mounts.
> > 
> > > I'd strongly argue that this isn't even a "problem" at all.  And, as I
> > > said at the Plumbers conference last year, adding namespaces to devices
> > > isn't going to happen, sorry.  Please don't continue down this path.
> > 
> > I was just mentioning that to Serge just a week or so ago reminding him
> > of what you told all of us face to face back then.  We were having a
> > discussion over loop devices into containers and this topic came up.
> 
> It was the loop device use case that got me started down this path in
> the first place, so I don't personally have any interest in physical
> devices right now (though I was sure others would).

Why do you want to give access to a loop device to a container?
Shouldn't you set up the loop devices before creating the container and
then pass those mount points into the container?  I thought that was how
things worked today, or am I missing something?

Giving the ability for a container to create a loop device at all is a
horrid idea, as you have pointed out, lots of information leakage could
easily happen.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 105 matches

Mail list logo