Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
"Serge E. Hallyn" writes: >> I was aware of FUSE but hadn't ever looked at it much. Looking at it >> now, this isn't going to satisfy any of the use cases I know about, >> which are wanting to use filesystems supported in-kernel (isofs, ext*). >> I don't see that any of these have a FUSE implementation, and I think we >> gain more from figuring out how to use in-kernel filesystems in >> containers than trying to find a way to shoehorn selected filesystems >> into FUSE. > > That's why I was wondering how much work it would be to auto-generate > fuse fs support from the in-kernel source. So at a quick look I have found fuseext2, fuseiso and mountlo-0.5 (which claims to have supported all the in-kernel filesystems with the help of user mode linux). Give that the first two are just an apt-get install away fuse really looks like the shortest path to being able to mount an iso, do other interesting things. We probably want something more but only when performance becomes a bottle-neck. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
Quoting James Bottomley (james.bottom...@hansenpartnership.com): > On Mon, 2014-05-26 at 00:24 +0200, Serge E. Hallyn wrote: > > Quoting James Bottomley (james.bottom...@hansenpartnership.com): > > > On Sat, 2014-05-24 at 22:25 +, Serge Hallyn wrote: > > > > Quoting James Bottomley (james.bottom...@hansenpartnership.com): > > > > > On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote: > > > > > > On 05/20/2014 05:19 PM, Serge Hallyn wrote: > > > > > > > Quoting Andy Lutomirski (l...@amacapital.net): > > > > > > >> On May 15, 2014 1:26 PM, "Serge E. Hallyn" > > > > > > >> wrote: > > > > > > >>> > > > > > > >>> Quoting Richard Weinberger (rich...@nod.at): > > > > > > Am 15.05.2014 21:50, schrieb Serge Hallyn: > > > > > > > Quoting Richard Weinberger (richard.weinber...@gmail.com): > > > > > > >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman > > > > > > >> wrote: > > > > > > >>> Then don't use a container to build such a thing, or fix > > > > > > >>> the build scripts to not do that :) > > > > > > >> > > > > > > >> I second this. To me it looks like some folks try to (ab)use > > > > > > >> Linux containers for purposes where KVM > > > > > > >> would much better fit in. Please don't put more complexity > > > > > > >> into containers. They are already horrible > > > > > > >> complex and error prone. > > > > > > > > > > > > > > I, naturally, disagree :) The only use case which is > > > > > > > inherently not valid for containers is running a > > > > > > > kernel. Practically speaking there are other things which > > > > > > > likely will never be possible, but if someone > > > > > > > offers a way to do something in containers, "you can't do > > > > > > > that in containers" is not an apropos response. > > > > > > > > > > > > > > "That abstraction is wrong" is certainly valid, as when vpids > > > > > > > were originally proposed and rejected, > > > > > > > resulting in the development of pid namespaces. "We have to > > > > > > > work out (x) first" can be valid (and I can > > > > > > > think of examples here), assuming it's not just trying to > > > > > > > hide behind a catch-22/chicken-egg problem. > > > > > > > > > > > > > > Finally, saying "containers are complex and error prone" is > > > > > > > conflating several large suites of userspace > > > > > > > code and many kernel features which support them. Being more > > > > > > > precise would, if the argument is valid, lend > > > > > > > it a lot more weight. > > > > > > > > > > > > We (my company) use Linux containers since 2011 in production. > > > > > > First LXC, now libvirt-lxc. To understand the > > > > > > internals better I also wrote my own userspace to create/start > > > > > > containers. There are so many things which can > > > > > > hurt you badly. With user namespaces we expose a really big > > > > > > attack surface to regular users. I.e. Suddenly a > > > > > > user is allowed to mount filesystems. > > > > > > >>> > > > > > > >>> That is currently not the case. They can mount some virtual > > > > > > >>> filesystems and do bind mounts, but cannot mount > > > > > > >>> most real filesystems. This keeps us protected (for now) from > > > > > > >>> potentially unsafe superblock readers in the > > > > > > >>> kernel. > > > > > > >>> > > > > > > Ask Andy, he found already lots of nasty things... > > > > > > >> > > > > > > >> I don't think I have anything brilliant to add to this > > > > > > >> discussion right now, except possibly: > > > > > > >> > > > > > > >> ISTM that Linux distributions are, in general, vulnerable to all > > > > > > >> kinds of shenanigans that would happen if an > > > > > > >> untrusted user can cause a block device to appear. That user > > > > > > >> doesn't need permission to mount it > > > > > > > > > > > > > > Interesting point. This would further suggest that we absolutely > > > > > > > must ensure that a loop device which shows up in > > > > > > > the container does not also show up in the host. > > > > > > > > > > > > Can I suggest the usage of the devices cgroup to achieve that? > > > > > > > > > > Not really ... cgroups impose resource limits, it's namespaces that > > > > > impose visibility separations. In theory this can be done with the > > > > > device namespace that's been proposed; however, a simpler way is > > > > > simply > > > > > to rm the device node in the host and mknod it in the guest. I don't > > > > > really see host visibility as a huge problem: in a shared OS > > > > > virtualisation it's not really possible securely to separate the guest > > > > > from the host (only vice versa). > > > > > > > > > > But I really don't think we want to do it this way. Giving a > > > > > container > > > > > the ability to do a mount is too dangerous. What we want to do is > > > > > int
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
Quoting Seth Forshee (seth.fors...@canonical.com): > On Fri, May 23, 2014 at 03:23:50PM -0700, Eric W. Biederman wrote: > > Serge Hallyn writes: > > > > > Quoting Eric W. Biederman (ebied...@xmission.com): > > >> > > >> > > >> >> Ultimately the technical challenge is how do we create a block device > > >> >> that is safe for a user who does not have any capabilities to use, and > > >> >> what can we do with that block device to make it useful. > > >> > > > >> > Yes, and I'd like to get started solving those challenges. But I also > > >> > don't think we can address these two points (support partition blkdevs, > > >> > help prevent more priveleged users from using a namespace's loop > > >> > devices) sufficiently while having an implementation completely > > >> > contained within the loop driver as Greg is requesting. > > >> > > >> My key take away from the conversation is that we should reduce the > > >> scope of what is being done to something that makes sense and the > > >> propblems are immediately visible. > > >> > > >> Part of me would like to suggest that fuse and it's ability to imitate > > >> device nodes might be a more appropriate solution, to something that > > > > > > Do you have a link to more info on this? Some googling got me to an > > > interesting but old thread on CUSE, but nothing specifically about fuse > > > doing this. > > > > CUSE is probably what I was thinking of. It is all part of the fuse > > code base in the kernel. And now that I am reminded it is called CUSE > > I go Duh that is a character device... > > > > Fuse and everything it can do is definitely the filesystem I would like > > to see most have the audits to be enabled in user namespace. Fuse > > was built to be sufficiently paranoid to allow this and so it should not > > take a lot to take fuse the rest of the way. > > I was aware of FUSE but hadn't ever looked at it much. Looking at it > now, this isn't going to satisfy any of the use cases I know about, > which are wanting to use filesystems supported in-kernel (isofs, ext*). > I don't see that any of these have a FUSE implementation, and I think we > gain more from figuring out how to use in-kernel filesystems in > containers than trying to find a way to shoehorn selected filesystems > into FUSE. That's why I was wondering how much work it would be to auto-generate fuse fs support from the in-kernel source. -serge -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Fri, May 23, 2014 at 03:23:50PM -0700, Eric W. Biederman wrote: > Serge Hallyn writes: > > > Quoting Eric W. Biederman (ebied...@xmission.com): > >> > >> > >> >> Ultimately the technical challenge is how do we create a block device > >> >> that is safe for a user who does not have any capabilities to use, and > >> >> what can we do with that block device to make it useful. > >> > > >> > Yes, and I'd like to get started solving those challenges. But I also > >> > don't think we can address these two points (support partition blkdevs, > >> > help prevent more priveleged users from using a namespace's loop > >> > devices) sufficiently while having an implementation completely > >> > contained within the loop driver as Greg is requesting. > >> > >> My key take away from the conversation is that we should reduce the > >> scope of what is being done to something that makes sense and the > >> propblems are immediately visible. > >> > >> Part of me would like to suggest that fuse and it's ability to imitate > >> device nodes might be a more appropriate solution, to something that > > > > Do you have a link to more info on this? Some googling got me to an > > interesting but old thread on CUSE, but nothing specifically about fuse > > doing this. > > CUSE is probably what I was thinking of. It is all part of the fuse > code base in the kernel. And now that I am reminded it is called CUSE > I go Duh that is a character device... > > Fuse and everything it can do is definitely the filesystem I would like > to see most have the audits to be enabled in user namespace. Fuse > was built to be sufficiently paranoid to allow this and so it should not > take a lot to take fuse the rest of the way. I was aware of FUSE but hadn't ever looked at it much. Looking at it now, this isn't going to satisfy any of the use cases I know about, which are wanting to use filesystems supported in-kernel (isofs, ext*). I don't see that any of these have a FUSE implementation, and I think we gain more from figuring out how to use in-kernel filesystems in containers than trying to find a way to shoehorn selected filesystems into FUSE. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Mon, 2014-05-26 at 00:24 +0200, Serge E. Hallyn wrote: > Quoting James Bottomley (james.bottom...@hansenpartnership.com): > > On Sat, 2014-05-24 at 22:25 +, Serge Hallyn wrote: > > > Quoting James Bottomley (james.bottom...@hansenpartnership.com): > > > > On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote: > > > > > On 05/20/2014 05:19 PM, Serge Hallyn wrote: > > > > > > Quoting Andy Lutomirski (l...@amacapital.net): > > > > > >> On May 15, 2014 1:26 PM, "Serge E. Hallyn" > > > > > >> wrote: > > > > > >>> > > > > > >>> Quoting Richard Weinberger (rich...@nod.at): > > > > > Am 15.05.2014 21:50, schrieb Serge Hallyn: > > > > > > Quoting Richard Weinberger (richard.weinber...@gmail.com): > > > > > >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman > > > > > >> wrote: > > > > > >>> Then don't use a container to build such a thing, or fix the > > > > > >>> build scripts to not do that :) > > > > > >> > > > > > >> I second this. To me it looks like some folks try to (ab)use > > > > > >> Linux containers for purposes where KVM > > > > > >> would much better fit in. Please don't put more complexity > > > > > >> into containers. They are already horrible > > > > > >> complex and error prone. > > > > > > > > > > > > I, naturally, disagree :) The only use case which is > > > > > > inherently not valid for containers is running a > > > > > > kernel. Practically speaking there are other things which > > > > > > likely will never be possible, but if someone > > > > > > offers a way to do something in containers, "you can't do that > > > > > > in containers" is not an apropos response. > > > > > > > > > > > > "That abstraction is wrong" is certainly valid, as when vpids > > > > > > were originally proposed and rejected, > > > > > > resulting in the development of pid namespaces. "We have to > > > > > > work out (x) first" can be valid (and I can > > > > > > think of examples here), assuming it's not just trying to hide > > > > > > behind a catch-22/chicken-egg problem. > > > > > > > > > > > > Finally, saying "containers are complex and error prone" is > > > > > > conflating several large suites of userspace > > > > > > code and many kernel features which support them. Being more > > > > > > precise would, if the argument is valid, lend > > > > > > it a lot more weight. > > > > > > > > > > We (my company) use Linux containers since 2011 in production. > > > > > First LXC, now libvirt-lxc. To understand the > > > > > internals better I also wrote my own userspace to create/start > > > > > containers. There are so many things which can > > > > > hurt you badly. With user namespaces we expose a really big > > > > > attack surface to regular users. I.e. Suddenly a > > > > > user is allowed to mount filesystems. > > > > > >>> > > > > > >>> That is currently not the case. They can mount some virtual > > > > > >>> filesystems and do bind mounts, but cannot mount > > > > > >>> most real filesystems. This keeps us protected (for now) from > > > > > >>> potentially unsafe superblock readers in the > > > > > >>> kernel. > > > > > >>> > > > > > Ask Andy, he found already lots of nasty things... > > > > > >> > > > > > >> I don't think I have anything brilliant to add to this discussion > > > > > >> right now, except possibly: > > > > > >> > > > > > >> ISTM that Linux distributions are, in general, vulnerable to all > > > > > >> kinds of shenanigans that would happen if an > > > > > >> untrusted user can cause a block device to appear. That user > > > > > >> doesn't need permission to mount it > > > > > > > > > > > > Interesting point. This would further suggest that we absolutely > > > > > > must ensure that a loop device which shows up in > > > > > > the container does not also show up in the host. > > > > > > > > > > Can I suggest the usage of the devices cgroup to achieve that? > > > > > > > > Not really ... cgroups impose resource limits, it's namespaces that > > > > impose visibility separations. In theory this can be done with the > > > > device namespace that's been proposed; however, a simpler way is simply > > > > to rm the device node in the host and mknod it in the guest. I don't > > > > really see host visibility as a huge problem: in a shared OS > > > > virtualisation it's not really possible securely to separate the guest > > > > from the host (only vice versa). > > > > > > > > But I really don't think we want to do it this way. Giving a container > > > > the ability to do a mount is too dangerous. What we want to do is > > > > intercept the mount in the host and perform it on behalf of the guest as > > > > host root in the guest's mount namespace. If you do it that way, it > > > > > > That doesn't help the problem of guests being able to provide bad input > > > for (basically fuzz
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
Quoting James Bottomley (james.bottom...@hansenpartnership.com): > On Sat, 2014-05-24 at 22:25 +, Serge Hallyn wrote: > > Quoting James Bottomley (james.bottom...@hansenpartnership.com): > > > On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote: > > > > On 05/20/2014 05:19 PM, Serge Hallyn wrote: > > > > > Quoting Andy Lutomirski (l...@amacapital.net): > > > > >> On May 15, 2014 1:26 PM, "Serge E. Hallyn" wrote: > > > > >>> > > > > >>> Quoting Richard Weinberger (rich...@nod.at): > > > > Am 15.05.2014 21:50, schrieb Serge Hallyn: > > > > > Quoting Richard Weinberger (richard.weinber...@gmail.com): > > > > >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman > > > > >> wrote: > > > > >>> Then don't use a container to build such a thing, or fix the > > > > >>> build scripts to not do that :) > > > > >> > > > > >> I second this. To me it looks like some folks try to (ab)use > > > > >> Linux containers for purposes where KVM > > > > >> would much better fit in. Please don't put more complexity into > > > > >> containers. They are already horrible > > > > >> complex and error prone. > > > > > > > > > > I, naturally, disagree :) The only use case which is inherently > > > > > not valid for containers is running a > > > > > kernel. Practically speaking there are other things which likely > > > > > will never be possible, but if someone > > > > > offers a way to do something in containers, "you can't do that in > > > > > containers" is not an apropos response. > > > > > > > > > > "That abstraction is wrong" is certainly valid, as when vpids > > > > > were originally proposed and rejected, > > > > > resulting in the development of pid namespaces. "We have to work > > > > > out (x) first" can be valid (and I can > > > > > think of examples here), assuming it's not just trying to hide > > > > > behind a catch-22/chicken-egg problem. > > > > > > > > > > Finally, saying "containers are complex and error prone" is > > > > > conflating several large suites of userspace > > > > > code and many kernel features which support them. Being more > > > > > precise would, if the argument is valid, lend > > > > > it a lot more weight. > > > > > > > > We (my company) use Linux containers since 2011 in production. > > > > First LXC, now libvirt-lxc. To understand the > > > > internals better I also wrote my own userspace to create/start > > > > containers. There are so many things which can > > > > hurt you badly. With user namespaces we expose a really big attack > > > > surface to regular users. I.e. Suddenly a > > > > user is allowed to mount filesystems. > > > > >>> > > > > >>> That is currently not the case. They can mount some virtual > > > > >>> filesystems and do bind mounts, but cannot mount > > > > >>> most real filesystems. This keeps us protected (for now) from > > > > >>> potentially unsafe superblock readers in the > > > > >>> kernel. > > > > >>> > > > > Ask Andy, he found already lots of nasty things... > > > > >> > > > > >> I don't think I have anything brilliant to add to this discussion > > > > >> right now, except possibly: > > > > >> > > > > >> ISTM that Linux distributions are, in general, vulnerable to all > > > > >> kinds of shenanigans that would happen if an > > > > >> untrusted user can cause a block device to appear. That user > > > > >> doesn't need permission to mount it > > > > > > > > > > Interesting point. This would further suggest that we absolutely > > > > > must ensure that a loop device which shows up in > > > > > the container does not also show up in the host. > > > > > > > > Can I suggest the usage of the devices cgroup to achieve that? > > > > > > Not really ... cgroups impose resource limits, it's namespaces that > > > impose visibility separations. In theory this can be done with the > > > device namespace that's been proposed; however, a simpler way is simply > > > to rm the device node in the host and mknod it in the guest. I don't > > > really see host visibility as a huge problem: in a shared OS > > > virtualisation it's not really possible securely to separate the guest > > > from the host (only vice versa). > > > > > > But I really don't think we want to do it this way. Giving a container > > > the ability to do a mount is too dangerous. What we want to do is > > > intercept the mount in the host and perform it on behalf of the guest as > > > host root in the guest's mount namespace. If you do it that way, it > > > > That doesn't help the problem of guests being able to provide bad input > > for (basically fuzz) the in-kernel filesystem code. So apparently I'm > > suffering a failure of the imagination - what problem exactly does it solve? > > Well, there's two types of fuzzing, one is on sys_mount, which this > would help with because the host filte
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Sat, 2014-05-24 at 22:25 +, Serge Hallyn wrote: > Quoting James Bottomley (james.bottom...@hansenpartnership.com): > > On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote: > > > On 05/20/2014 05:19 PM, Serge Hallyn wrote: > > > > Quoting Andy Lutomirski (l...@amacapital.net): > > > >> On May 15, 2014 1:26 PM, "Serge E. Hallyn" wrote: > > > >>> > > > >>> Quoting Richard Weinberger (rich...@nod.at): > > > Am 15.05.2014 21:50, schrieb Serge Hallyn: > > > > Quoting Richard Weinberger (richard.weinber...@gmail.com): > > > >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman > > > >> wrote: > > > >>> Then don't use a container to build such a thing, or fix the > > > >>> build scripts to not do that :) > > > >> > > > >> I second this. To me it looks like some folks try to (ab)use Linux > > > >> containers for purposes where KVM > > > >> would much better fit in. Please don't put more complexity into > > > >> containers. They are already horrible > > > >> complex and error prone. > > > > > > > > I, naturally, disagree :) The only use case which is inherently > > > > not valid for containers is running a > > > > kernel. Practically speaking there are other things which likely > > > > will never be possible, but if someone > > > > offers a way to do something in containers, "you can't do that in > > > > containers" is not an apropos response. > > > > > > > > "That abstraction is wrong" is certainly valid, as when vpids were > > > > originally proposed and rejected, > > > > resulting in the development of pid namespaces. "We have to work > > > > out (x) first" can be valid (and I can > > > > think of examples here), assuming it's not just trying to hide > > > > behind a catch-22/chicken-egg problem. > > > > > > > > Finally, saying "containers are complex and error prone" is > > > > conflating several large suites of userspace > > > > code and many kernel features which support them. Being more > > > > precise would, if the argument is valid, lend > > > > it a lot more weight. > > > > > > We (my company) use Linux containers since 2011 in production. First > > > LXC, now libvirt-lxc. To understand the > > > internals better I also wrote my own userspace to create/start > > > containers. There are so many things which can > > > hurt you badly. With user namespaces we expose a really big attack > > > surface to regular users. I.e. Suddenly a > > > user is allowed to mount filesystems. > > > >>> > > > >>> That is currently not the case. They can mount some virtual > > > >>> filesystems and do bind mounts, but cannot mount > > > >>> most real filesystems. This keeps us protected (for now) from > > > >>> potentially unsafe superblock readers in the > > > >>> kernel. > > > >>> > > > Ask Andy, he found already lots of nasty things... > > > >> > > > >> I don't think I have anything brilliant to add to this discussion > > > >> right now, except possibly: > > > >> > > > >> ISTM that Linux distributions are, in general, vulnerable to all kinds > > > >> of shenanigans that would happen if an > > > >> untrusted user can cause a block device to appear. That user doesn't > > > >> need permission to mount it > > > > > > > > Interesting point. This would further suggest that we absolutely must > > > > ensure that a loop device which shows up in > > > > the container does not also show up in the host. > > > > > > Can I suggest the usage of the devices cgroup to achieve that? > > > > Not really ... cgroups impose resource limits, it's namespaces that > > impose visibility separations. In theory this can be done with the > > device namespace that's been proposed; however, a simpler way is simply > > to rm the device node in the host and mknod it in the guest. I don't > > really see host visibility as a huge problem: in a shared OS > > virtualisation it's not really possible securely to separate the guest > > from the host (only vice versa). > > > > But I really don't think we want to do it this way. Giving a container > > the ability to do a mount is too dangerous. What we want to do is > > intercept the mount in the host and perform it on behalf of the guest as > > host root in the guest's mount namespace. If you do it that way, it > > That doesn't help the problem of guests being able to provide bad input > for (basically fuzz) the in-kernel filesystem code. So apparently I'm > suffering a failure of the imagination - what problem exactly does it solve? Well, there's two types of fuzzing, one is on sys_mount, which this would help with because the host filters the mount including all parameters and may even redo the mount (from direct to bind etc). If you're thinking the system can be compromised by fuzzing within the filesystem, then yes, I agree, but it's the same vulnerability an unvirtualised
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
Quoting James Bottomley (james.bottom...@hansenpartnership.com): > On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote: > > On 05/20/2014 05:19 PM, Serge Hallyn wrote: > > > Quoting Andy Lutomirski (l...@amacapital.net): > > >> On May 15, 2014 1:26 PM, "Serge E. Hallyn" wrote: > > >>> > > >>> Quoting Richard Weinberger (rich...@nod.at): > > Am 15.05.2014 21:50, schrieb Serge Hallyn: > > > Quoting Richard Weinberger (richard.weinber...@gmail.com): > > >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman > > >> wrote: > > >>> Then don't use a container to build such a thing, or fix the build > > >>> scripts to not do that :) > > >> > > >> I second this. To me it looks like some folks try to (ab)use Linux > > >> containers for purposes where KVM > > >> would much better fit in. Please don't put more complexity into > > >> containers. They are already horrible > > >> complex and error prone. > > > > > > I, naturally, disagree :) The only use case which is inherently not > > > valid for containers is running a > > > kernel. Practically speaking there are other things which likely > > > will never be possible, but if someone > > > offers a way to do something in containers, "you can't do that in > > > containers" is not an apropos response. > > > > > > "That abstraction is wrong" is certainly valid, as when vpids were > > > originally proposed and rejected, > > > resulting in the development of pid namespaces. "We have to work out > > > (x) first" can be valid (and I can > > > think of examples here), assuming it's not just trying to hide behind > > > a catch-22/chicken-egg problem. > > > > > > Finally, saying "containers are complex and error prone" is > > > conflating several large suites of userspace > > > code and many kernel features which support them. Being more precise > > > would, if the argument is valid, lend > > > it a lot more weight. > > > > We (my company) use Linux containers since 2011 in production. First > > LXC, now libvirt-lxc. To understand the > > internals better I also wrote my own userspace to create/start > > containers. There are so many things which can > > hurt you badly. With user namespaces we expose a really big attack > > surface to regular users. I.e. Suddenly a > > user is allowed to mount filesystems. > > >>> > > >>> That is currently not the case. They can mount some virtual > > >>> filesystems and do bind mounts, but cannot mount > > >>> most real filesystems. This keeps us protected (for now) from > > >>> potentially unsafe superblock readers in the > > >>> kernel. > > >>> > > Ask Andy, he found already lots of nasty things... > > >> > > >> I don't think I have anything brilliant to add to this discussion right > > >> now, except possibly: > > >> > > >> ISTM that Linux distributions are, in general, vulnerable to all kinds > > >> of shenanigans that would happen if an > > >> untrusted user can cause a block device to appear. That user doesn't > > >> need permission to mount it > > > > > > Interesting point. This would further suggest that we absolutely must > > > ensure that a loop device which shows up in > > > the container does not also show up in the host. > > > > Can I suggest the usage of the devices cgroup to achieve that? > > Not really ... cgroups impose resource limits, it's namespaces that > impose visibility separations. In theory this can be done with the > device namespace that's been proposed; however, a simpler way is simply > to rm the device node in the host and mknod it in the guest. I don't > really see host visibility as a huge problem: in a shared OS > virtualisation it's not really possible securely to separate the guest > from the host (only vice versa). > > But I really don't think we want to do it this way. Giving a container > the ability to do a mount is too dangerous. What we want to do is > intercept the mount in the host and perform it on behalf of the guest as > host root in the guest's mount namespace. If you do it that way, it That doesn't help the problem of guests being able to provide bad input for (basically fuzz) the in-kernel filesystem code. So apparently I'm suffering a failure of the imagination - what problem exactly does it solve? > doesn't really matter what device actually shows up in the guest, as > long as the host knows what to do when the mount request comes along. > > James > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
Serge Hallyn writes: > Quoting Eric W. Biederman (ebied...@xmission.com): >> >> >> >> Ultimately the technical challenge is how do we create a block device >> >> that is safe for a user who does not have any capabilities to use, and >> >> what can we do with that block device to make it useful. >> > >> > Yes, and I'd like to get started solving those challenges. But I also >> > don't think we can address these two points (support partition blkdevs, >> > help prevent more priveleged users from using a namespace's loop >> > devices) sufficiently while having an implementation completely >> > contained within the loop driver as Greg is requesting. >> >> My key take away from the conversation is that we should reduce the >> scope of what is being done to something that makes sense and the >> propblems are immediately visible. >> >> Part of me would like to suggest that fuse and it's ability to imitate >> device nodes might be a more appropriate solution, to something that > > Do you have a link to more info on this? Some googling got me to an > interesting but old thread on CUSE, but nothing specifically about fuse > doing this. CUSE is probably what I was thinking of. It is all part of the fuse code base in the kernel. And now that I am reminded it is called CUSE I go Duh that is a character device... Fuse and everything it can do is definitely the filesystem I would like to see most have the audits to be enabled in user namespace. Fuse was built to be sufficiently paranoid to allow this and so it should not take a lot to take fuse the rest of the way. >> just needs block device access and nothing else. >> >> For purposes of discussion let's call it unprivloopfs. That can reuse >> code from the loop device or not as appropriate. Not supporting >> paritioning I think is a very reasonable first step until it is shown >> that we can make good use of partitioning support, and there are not >> better ways of solving the problem. >> >> I expect the most productive thing to talk about is what is your >> immediate goal? Mounting a filesystem? Building an iso? > > For me it would be taking an iso and making some changes to it to > localize it (i.e. take an install iso and add preseed file). > > Now of course in the end there is no reason why we can't do all of > this with a new suite of libraries which simply uses read/write with > knowledge of the fs layouts to parse and modify the backing files. > My concern there is that duplicating all of the fs code seems unlikely > to improve the soundness of either implementation. Perhaps we can > autogenerate this from the kernel source? Does fuse already do > something like that? I am not aware of that. But I have not worked extensively with fuse. I do agree that finding a way to perform a read-only mount of an ISO by an unprivielged user is a very interesting use case. Given it's interchange medium nature isofs should be as hardened as human possible, and that is likely easier with a read-only filesystem. And at less than 4000 lines of code isofs is auditable. So as a target for unprivileged mounts of a block device isofs looks like a good place to start. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Fri, May 23, 2014 at 6:16 AM, James Bottomley wrote: > On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote: >> On 05/20/2014 05:19 PM, Serge Hallyn wrote: >> > Quoting Andy Lutomirski (l...@amacapital.net): >> >> On May 15, 2014 1:26 PM, "Serge E. Hallyn" wrote: >> >>> >> >>> Quoting Richard Weinberger (rich...@nod.at): >> Am 15.05.2014 21:50, schrieb Serge Hallyn: >> > Quoting Richard Weinberger (richard.weinber...@gmail.com): >> >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman >> >> wrote: >> >>> Then don't use a container to build such a thing, or fix the build >> >>> scripts to not do that :) >> >> >> >> I second this. To me it looks like some folks try to (ab)use Linux >> >> containers for purposes where KVM >> >> would much better fit in. Please don't put more complexity into >> >> containers. They are already horrible >> >> complex and error prone. >> > >> > I, naturally, disagree :) The only use case which is inherently not >> > valid for containers is running a >> > kernel. Practically speaking there are other things which likely will >> > never be possible, but if someone >> > offers a way to do something in containers, "you can't do that in >> > containers" is not an apropos response. >> > >> > "That abstraction is wrong" is certainly valid, as when vpids were >> > originally proposed and rejected, >> > resulting in the development of pid namespaces. "We have to work out >> > (x) first" can be valid (and I can >> > think of examples here), assuming it's not just trying to hide behind >> > a catch-22/chicken-egg problem. >> > >> > Finally, saying "containers are complex and error prone" is conflating >> > several large suites of userspace >> > code and many kernel features which support them. Being more precise >> > would, if the argument is valid, lend >> > it a lot more weight. >> >> We (my company) use Linux containers since 2011 in production. First >> LXC, now libvirt-lxc. To understand the >> internals better I also wrote my own userspace to create/start >> containers. There are so many things which can >> hurt you badly. With user namespaces we expose a really big attack >> surface to regular users. I.e. Suddenly a >> user is allowed to mount filesystems. >> >>> >> >>> That is currently not the case. They can mount some virtual filesystems >> >>> and do bind mounts, but cannot mount >> >>> most real filesystems. This keeps us protected (for now) from >> >>> potentially unsafe superblock readers in the >> >>> kernel. >> >>> >> Ask Andy, he found already lots of nasty things... >> >> >> >> I don't think I have anything brilliant to add to this discussion right >> >> now, except possibly: >> >> >> >> ISTM that Linux distributions are, in general, vulnerable to all kinds of >> >> shenanigans that would happen if an >> >> untrusted user can cause a block device to appear. That user doesn't >> >> need permission to mount it >> > >> > Interesting point. This would further suggest that we absolutely must >> > ensure that a loop device which shows up in >> > the container does not also show up in the host. >> >> Can I suggest the usage of the devices cgroup to achieve that? > > Not really ... cgroups impose resource limits, it's namespaces that > impose visibility separations. In theory this can be done with the > device namespace that's been proposed; however, a simpler way is simply > to rm the device node in the host and mknod it in the guest. I don't > really see host visibility as a huge problem: in a shared OS > virtualisation it's not really possible securely to separate the guest > from the host (only vice versa). > > But I really don't think we want to do it this way. Giving a container > the ability to do a mount is too dangerous. What we want to do is > intercept the mount in the host and perform it on behalf of the guest as > host root in the guest's mount namespace. If you do it that way, it > doesn't really matter what device actually shows up in the guest, as > long as the host knows what to do when the mount request comes along. This is only useful/safe if the host understands what's going on. By the host, I mean the host's udev and other system-level stuff. This is probably fine for disks and such, but it might not be so great for loop devices, FUSE, etc. I already know of one user of containers that wants container-local FUSE mounts. This ought to Just Work (tm), but there's fair amount of work needed to get there. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote: > On 05/20/2014 05:19 PM, Serge Hallyn wrote: > > Quoting Andy Lutomirski (l...@amacapital.net): > >> On May 15, 2014 1:26 PM, "Serge E. Hallyn" wrote: > >>> > >>> Quoting Richard Weinberger (rich...@nod.at): > Am 15.05.2014 21:50, schrieb Serge Hallyn: > > Quoting Richard Weinberger (richard.weinber...@gmail.com): > >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman > >> wrote: > >>> Then don't use a container to build such a thing, or fix the build > >>> scripts to not do that :) > >> > >> I second this. To me it looks like some folks try to (ab)use Linux > >> containers for purposes where KVM > >> would much better fit in. Please don't put more complexity into > >> containers. They are already horrible > >> complex and error prone. > > > > I, naturally, disagree :) The only use case which is inherently not > > valid for containers is running a > > kernel. Practically speaking there are other things which likely will > > never be possible, but if someone > > offers a way to do something in containers, "you can't do that in > > containers" is not an apropos response. > > > > "That abstraction is wrong" is certainly valid, as when vpids were > > originally proposed and rejected, > > resulting in the development of pid namespaces. "We have to work out > > (x) first" can be valid (and I can > > think of examples here), assuming it's not just trying to hide behind a > > catch-22/chicken-egg problem. > > > > Finally, saying "containers are complex and error prone" is conflating > > several large suites of userspace > > code and many kernel features which support them. Being more precise > > would, if the argument is valid, lend > > it a lot more weight. > > We (my company) use Linux containers since 2011 in production. First > LXC, now libvirt-lxc. To understand the > internals better I also wrote my own userspace to create/start > containers. There are so many things which can > hurt you badly. With user namespaces we expose a really big attack > surface to regular users. I.e. Suddenly a > user is allowed to mount filesystems. > >>> > >>> That is currently not the case. They can mount some virtual filesystems > >>> and do bind mounts, but cannot mount > >>> most real filesystems. This keeps us protected (for now) from > >>> potentially unsafe superblock readers in the > >>> kernel. > >>> > Ask Andy, he found already lots of nasty things... > >> > >> I don't think I have anything brilliant to add to this discussion right > >> now, except possibly: > >> > >> ISTM that Linux distributions are, in general, vulnerable to all kinds of > >> shenanigans that would happen if an > >> untrusted user can cause a block device to appear. That user doesn't need > >> permission to mount it > > > > Interesting point. This would further suggest that we absolutely must > > ensure that a loop device which shows up in > > the container does not also show up in the host. > > Can I suggest the usage of the devices cgroup to achieve that? Not really ... cgroups impose resource limits, it's namespaces that impose visibility separations. In theory this can be done with the device namespace that's been proposed; however, a simpler way is simply to rm the device node in the host and mknod it in the guest. I don't really see host visibility as a huge problem: in a shared OS virtualisation it's not really possible securely to separate the guest from the host (only vice versa). But I really don't think we want to do it this way. Giving a container the ability to do a mount is too dangerous. What we want to do is intercept the mount in the host and perform it on behalf of the guest as host root in the guest's mount namespace. If you do it that way, it doesn't really matter what device actually shows up in the guest, as long as the host knows what to do when the mount request comes along. James -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 05/20/2014 05:19 PM, Serge Hallyn wrote: > Quoting Andy Lutomirski (l...@amacapital.net): >> On May 15, 2014 1:26 PM, "Serge E. Hallyn" wrote: >>> >>> Quoting Richard Weinberger (rich...@nod.at): Am 15.05.2014 21:50, schrieb Serge Hallyn: > Quoting Richard Weinberger (richard.weinber...@gmail.com): >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman >> wrote: >>> Then don't use a container to build such a thing, or fix the build >>> scripts to not do that :) >> >> I second this. To me it looks like some folks try to (ab)use Linux >> containers for purposes where KVM >> would much better fit in. Please don't put more complexity into >> containers. They are already horrible >> complex and error prone. > > I, naturally, disagree :) The only use case which is inherently not > valid for containers is running a > kernel. Practically speaking there are other things which likely will > never be possible, but if someone > offers a way to do something in containers, "you can't do that in > containers" is not an apropos response. > > "That abstraction is wrong" is certainly valid, as when vpids were > originally proposed and rejected, > resulting in the development of pid namespaces. "We have to work out (x) > first" can be valid (and I can > think of examples here), assuming it's not just trying to hide behind a > catch-22/chicken-egg problem. > > Finally, saying "containers are complex and error prone" is conflating > several large suites of userspace > code and many kernel features which support them. Being more precise > would, if the argument is valid, lend > it a lot more weight. We (my company) use Linux containers since 2011 in production. First LXC, now libvirt-lxc. To understand the internals better I also wrote my own userspace to create/start containers. There are so many things which can hurt you badly. With user namespaces we expose a really big attack surface to regular users. I.e. Suddenly a user is allowed to mount filesystems. >>> >>> That is currently not the case. They can mount some virtual filesystems >>> and do bind mounts, but cannot mount >>> most real filesystems. This keeps us protected (for now) from potentially >>> unsafe superblock readers in the >>> kernel. >>> Ask Andy, he found already lots of nasty things... >> >> I don't think I have anything brilliant to add to this discussion right now, >> except possibly: >> >> ISTM that Linux distributions are, in general, vulnerable to all kinds of >> shenanigans that would happen if an >> untrusted user can cause a block device to appear. That user doesn't need >> permission to mount it > > Interesting point. This would further suggest that we absolutely must ensure > that a loop device which shows up in > the container does not also show up in the host. Can I suggest the usage of the devices cgroup to achieve that? Marian > >> or even necessarily to change its contents on the fly. >> >> E.g. what happens if you boot a machine that contains a malicious disk image >> that has the same partition UUID as >> /? Nothing good, I imagine. >> >> So if we're going to go down this road, we really need some way to tell the >> host that certain devices are not >> trusted. >> >> --Andy > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to > majord...@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html Please read the FAQ at > http://www.tux.org/lkml/ > - -- Marian Marinov Founder & CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iEYEARECAAYFAlN/BL8ACgkQ4mt9JeIbjJRuTwCgjpP8cNle5deHpUSJJoDkcfin byEAn3Fy4wwiZ3avNwA/ljZWVWeGFU8W =iQLO -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
Quoting Eric W. Biederman (ebied...@xmission.com): > > > >> Ultimately the technical challenge is how do we create a block device > >> that is safe for a user who does not have any capabilities to use, and > >> what can we do with that block device to make it useful. > > > > Yes, and I'd like to get started solving those challenges. But I also > > don't think we can address these two points (support partition blkdevs, > > help prevent more priveleged users from using a namespace's loop > > devices) sufficiently while having an implementation completely > > contained within the loop driver as Greg is requesting. > > My key take away from the conversation is that we should reduce the > scope of what is being done to something that makes sense and the > propblems are immediately visible. > > Part of me would like to suggest that fuse and it's ability to imitate > device nodes might be a more appropriate solution, to something that Do you have a link to more info on this? Some googling got me to an interesting but old thread on CUSE, but nothing specifically about fuse doing this. > just needs block device access and nothing else. > > For purposes of discussion let's call it unprivloopfs. That can reuse > code from the loop device or not as appropriate. Not supporting > paritioning I think is a very reasonable first step until it is shown > that we can make good use of partitioning support, and there are not > better ways of solving the problem. > > I expect the most productive thing to talk about is what is your > immediate goal? Mounting a filesystem? Building an iso? For me it would be taking an iso and making some changes to it to localize it (i.e. take an install iso and add preseed file). Now of course in the end there is no reason why we can't do all of this with a new suite of libraries which simply uses read/write with knowledge of the fs layouts to parse and modify the backing files. My concern there is that duplicating all of the fs code seems unlikely to improve the soundness of either implementation. Perhaps we can autogenerate this from the kernel source? Does fuse already do something like that? > We have a long history with the namespace support of punting on issues > and not solving them until a long term maintainable solution becomes > clear. Let's do what we can to make the problem and the solution clear. -serge -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
>> Ultimately the technical challenge is how do we create a block device >> that is safe for a user who does not have any capabilities to use, and >> what can we do with that block device to make it useful. > > Yes, and I'd like to get started solving those challenges. But I also > don't think we can address these two points (support partition blkdevs, > help prevent more priveleged users from using a namespace's loop > devices) sufficiently while having an implementation completely > contained within the loop driver as Greg is requesting. My key take away from the conversation is that we should reduce the scope of what is being done to something that makes sense and the propblems are immediately visible. Part of me would like to suggest that fuse and it's ability to imitate device nodes might be a more appropriate solution, to something that just needs block device access and nothing else. For purposes of discussion let's call it unprivloopfs. That can reuse code from the loop device or not as appropriate. Not supporting paritioning I think is a very reasonable first step until it is shown that we can make good use of partitioning support, and there are not better ways of solving the problem. I expect the most productive thing to talk about is what is your immediate goal? Mounting a filesystem? Building an iso? We have a long history with the namespace support of punting on issues and not solving them until a long term maintainable solution becomes clear. Let's do what we can to make the problem and the solution clear. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
Quoting Serge Hallyn (serge.hal...@ubuntu.com): > Quoting Seth Forshee (seth.fors...@canonical.com): > > On Sun, May 18, 2014 at 04:44:58AM +0200, Serge E. Hallyn wrote: > > > Quoting Seth Forshee (seth.fors...@canonical.com): > > > > On Fri, May 16, 2014 at 09:31:37PM -0700, Eric W. Biederman wrote: > > > > > Greg Kroah-Hartman writes: > > > > > > > > > > > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote: > > > > > >> > I think having to pick and choose what device nodes you want in a > > > > > >> > container is a good thing. Becides, you would have to do the > > > > > >> > same thing > > > > > >> > in the kernel anyway, what's wrong with userspace making the > > > > > >> > decision > > > > > >> > here, especially as it knows exactly what it wants to do much > > > > > >> > more so > > > > > >> > than the kernel ever can. > > > > > >> > > > > > >> For 'real' devices that sounds sensible. The thing about loop > > > > > >> devices > > > > > >> is that we simply want to allow a container to say "give me a loop > > > > > >> device to use" and have it receive a unique loop device (or 3), > > > > > >> without > > > > > >> having to pre-assign them. I think that would be cleaner to do > > > > > >> using > > > > > >> a pseudofs and loop-control device, rather than having to have a > > > > > >> daemon in userspace on the host farming those out in response to > > > > > >> some, I don't know, dbus request? > > > > > > > > > > > > I agree that loop devices would be nice to have in a container, and > > > > > > that > > > > > > the existing loop interface doesn't really lend itself to that. So > > > > > > create a new type of thing that acts like a loop device in a > > > > > > container. > > > > > > But don't try to mess with the whole driver core just for a single > > > > > > type > > > > > > of device. > > > > > > > > > > Yes. Something like devpts (without the newinstance option). Built to > > > > > allow unprivileged users to create loopback devices. > > > > > > > > That's where I started, and I've got code, so I guess I'll clean it up > > > > and send patches. If the stance is that only system-wide CAP_SYS_ADMIN > > > > gets to do privileged block device ioctls, including reading partitions > > > > > > Sorry, where did that come from? What Eric was referring to below is > > > the fs superblock readers not being trusted. Maybe I glossed over another > > > email where it was mentioned? > > > > You must have. Take a look at [1]. > > > > To repeat the point: the ioctl to reread partitions (along with several > > other block device ioctls) has a capable(CAP_SYS_ADMIN) check. We can't > > change this to an ns_capable check without at minimum the block layer > > knowing about the namespace associated with the block device. Ergo we > > Which only means those changes are necessary :) > > So far as I understand, a namespaced devtmpfs is nacked, but a loopfs > is interesting (and, depending on the implementation, acceptable). That > necessarily includes the minimal blockdev changes to support it. > > > can't reread paritions if this is done entirely within the loop driver > > via a psuedo fs. > > > > [1] http://article.gmane.org/gmane.linux.kernel.containers.lxc.devel/8191 Hm, yeah, I was confuddling two issues. Nevertheless, for real block devices I absolutely agree. For loop devices I don't. My answer to > I don't think unpriviliged containers should be able to do partitioning. > An unpriviliged user can't do that, so why should a container be any > different? would be that the loop device is a convenience built atop the backing image, and if the user had the rights to loop-attach the backing image, he can just as will partition using write(2), so why artificially plac this limit? Nevertheless this is not really a debate worth having until we have a blockdev fs mountable in a userns. My main interest currently is with privileged containers. I think we can learn plenty from that for now. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Mon, May 19, 2014 at 05:04:55PM -0700, Eric W. Biederman wrote: > Seth Forshee writes: > > > What I set out for was feature parity between loop devices in a secure > > container and loop devices on the host. Since some operations currently > > check for system-wide CAP_SYS_ADMIN, the only way I see to accomplish > > this is to push knowledge of the user namespace farther down into the > > driver stack so the check can instead be for CAP_SYS_ADMIN in the user > > namespace associated with the device. > > > > That said, I suspect our current use cases can get by without these > > capabilities. Really though I suspect this is just deferring the > > discussion rather than settling it, and what we'll end up with is little > > more than a fancy way for userspace to ask the kernel to run mknod on > > its behalf. > > A fancy way to ask the kernel to run mknod on its behalf is what > /dev/pts is. > > When I suggested this I did not mean you should forgo making changes to > allow partitions and the like. What I itended is that you should find a > way to make this safe for users who don't have root capabilities. But Greg did say that "unprivileged" or "secure" containers (depending on whose terminology you're using) should not be able to do partitioning [1]. I don't really understand this stance though, as I don't see what possible security problems arise from letting root in a user ns do BLKRRPART on a block device that it's explicitly been granted privileged use of. Assuming we come to an agreement that root in a user ns can do BLKRRPART on some devices, we've got two issues. First, the block layer enforces this restriction so it has to be aware of what namespace has privileges for the device, but Greg wants a solution localized to the loop driver. Second, if we're using a loop psuedo fs then we'd logically want block devices for the partitions in the loop fs, so we have to create some mechanism for the loop driver to get notified about these devices being created. > Which possibly means that mount needs to learn how to keep a more > privileged user from using your new loop devices. The patches I posted have mechanisms to at least mitigate the problem. First, anyone using loop-control to find a free loop device will never get a device allocated to a different user ns (the loop psuedo fs code I have also does this). Second, a given loop block device would only show up in the devtmpfs of the namespace which owned that device. So a sufficiently priveleged user isn't completely prevented from using the devices, but since they would have to explicitly mknod the block device node it should prevent accidental use by a more privileged user. But I also brought this up previously, and Greg argued that it isn't a real issue [1]. > To get to the point where this is really and truly usable I expect to be > technically daunting. > > Ultimately the technical challenge is how do we create a block device > that is safe for a user who does not have any capabilities to use, and > what can we do with that block device to make it useful. Yes, and I'd like to get started solving those challenges. But I also don't think we can address these two points (support partition blkdevs, help prevent more priveleged users from using a namespace's loop devices) sufficiently while having an implementation completely contained within the loop driver as Greg is requesting. Thanks, Seth > > Only when the question is can this kernel functionality which is > otherwise safe confuse a preexisting setuid application do namespace > or container bits significantly come into play. > > Eric [1] http://www.spinics.net/linux/lists/kernel/msg1744750.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
Quoting Andy Lutomirski (l...@amacapital.net): > On May 15, 2014 1:26 PM, "Serge E. Hallyn" wrote: > > > > Quoting Richard Weinberger (rich...@nod.at): > > > Am 15.05.2014 21:50, schrieb Serge Hallyn: > > > > Quoting Richard Weinberger (richard.weinber...@gmail.com): > > > >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman > > > >> wrote: > > > >>> Then don't use a container to build such a thing, or fix the build > > > >>> scripts to not do that :) > > > >> > > > >> I second this. > > > >> To me it looks like some folks try to (ab)use Linux containers > > > >> for purposes where KVM would much better fit in. > > > >> Please don't put more complexity into containers. They are already > > > >> horrible complex > > > >> and error prone. > > > > > > > > I, naturally, disagree :) The only use case which is inherently not > > > > valid for containers is running a kernel. Practically speaking there > > > > are other things which likely will never be possible, but if someone > > > > offers a way to do something in containers, "you can't do that in > > > > containers" is not an apropos response. > > > > > > > > "That abstraction is wrong" is certainly valid, as when vpids were > > > > originally proposed and rejected, resulting in the development of > > > > pid namespaces. "We have to work out (x) first" can be valid (and > > > > I can think of examples here), assuming it's not just trying to hide > > > > behind a catch-22/chicken-egg problem. > > > > > > > > Finally, saying "containers are complex and error prone" is conflating > > > > several large suites of userspace code and many kernel features which > > > > support them. Being more precise would, if the argument is valid, > > > > lend it a lot more weight. > > > > > > We (my company) use Linux containers since 2011 in production. First LXC, > > > now libvirt-lxc. > > > To understand the internals better I also wrote my own userspace to > > > create/start > > > containers. There are so many things which can hurt you badly. > > > With user namespaces we expose a really big attack surface to regular > > > users. > > > I.e. Suddenly a user is allowed to mount filesystems. > > > > That is currently not the case. They can mount some virtual filesystems > > and do bind mounts, but cannot mount most real filesystems. This keeps > > us protected (for now) from potentially unsafe superblock readers in the > > kernel. > > > > > Ask Andy, he found already lots of nasty things... > > I don't think I have anything brilliant to add to this discussion > right now, except possibly: > > ISTM that Linux distributions are, in general, vulnerable to all kinds > of shenanigans that would happen if an untrusted user can cause a > block device to appear. That user doesn't need permission to mount it Interesting point. This would further suggest that we absolutely must ensure that a loop device which shows up in the container does not also show up in the host. > or even necessarily to change its contents on the fly. > > E.g. what happens if you boot a machine that contains a malicious disk > image that has the same partition UUID as /? Nothing good, I imagine. > > So if we're going to go down this road, we really need some way to > tell the host that certain devices are not trusted. > > --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
Quoting Michael H. Warfield (m...@wittsend.com): > On Mon, 2014-05-19 at 17:04 -0700, Eric W. Biederman wrote: > > Seth Forshee writes: > > > > > What I set out for was feature parity between loop devices in a secure > > > container and loop devices on the host. Since some operations currently > > > check for system-wide CAP_SYS_ADMIN, the only way I see to accomplish > > > this is to push knowledge of the user namespace farther down into the > > > driver stack so the check can instead be for CAP_SYS_ADMIN in the user > > > namespace associated with the device. > > > > > > That said, I suspect our current use cases can get by without these > > > capabilities. Really though I suspect this is just deferring the > > > discussion rather than settling it, and what we'll end up with is little > > > more than a fancy way for userspace to ask the kernel to run mknod on > > > its behalf. > > > A fancy way to ask the kernel to run mknod on its behalf is what > > /dev/pts is. > > > When I suggested this I did not mean you should forgo making changes to > > allow partitions and the like. What I itended is that you should find a > > way to make this safe for users who don't have root capabilities. > > I like to think in terms of the "rootless" configurations where "root" > per se is not absolute and everything is framed in terms of > capabilities. > > > Which possibly means that mount needs to learn how to keep a more > > privileged user from using your new loop devices. > > Not sure I got that one. As user with "more" privileges may or may not > have access dependent on the congruence of the privileges. They're not Yes so in this case by more privileged' he meant a privileged user in a userns which is ancestor to the current userns. It is in fact *more* privileged than any user in the current userns. > heiarchial. If someone has that "priv" then they have access. If they They are in fact implicitly hierarchical due to the hierarchical userns design. > do not, they do not. > > > To get to the point where this is really and truly usable I expect to be > > technically daunting. > > Most technically non-trivial problems generally are. > > > Ultimately the technical challenge is how do we create a block device > > that is safe for a user who does not have any capabilities to use, and > > what can we do with that block device to make it useful. > > Concur. It boils down to privilege management and access. Absolutely > concur. > > > Only when the question is can this kernel functionality which is > > otherwise safe confuse a preexisting setuid application do namespace > > or container bits significantly come into play. > > Ah... Admittedly it's not as late as our conversation at LinuxPlumbers > last year in NOLA but... Maybe late at night but I failed to parse the > above. > > > Eric > > Regards, > Mike > -- > Michael H. Warfield (AI4NB) | (770) 978-7061 | m...@wittsend.com >/\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/ >NIC whois: MHW9 | An optimist believes we live in the best of all > PGP Key: 0x674627FF| possible worlds. A pessimist is sure of it! > > ___ > lxc-devel mailing list > lxc-de...@lists.linuxcontainers.org > http://lists.linuxcontainers.org/listinfo/lxc-devel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
Quoting Seth Forshee (seth.fors...@canonical.com): > On Sun, May 18, 2014 at 04:44:58AM +0200, Serge E. Hallyn wrote: > > Quoting Seth Forshee (seth.fors...@canonical.com): > > > On Fri, May 16, 2014 at 09:31:37PM -0700, Eric W. Biederman wrote: > > > > Greg Kroah-Hartman writes: > > > > > > > > > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote: > > > > >> > I think having to pick and choose what device nodes you want in a > > > > >> > container is a good thing. Becides, you would have to do the same > > > > >> > thing > > > > >> > in the kernel anyway, what's wrong with userspace making the > > > > >> > decision > > > > >> > here, especially as it knows exactly what it wants to do much more > > > > >> > so > > > > >> > than the kernel ever can. > > > > >> > > > > >> For 'real' devices that sounds sensible. The thing about loop > > > > >> devices > > > > >> is that we simply want to allow a container to say "give me a loop > > > > >> device to use" and have it receive a unique loop device (or 3), > > > > >> without > > > > >> having to pre-assign them. I think that would be cleaner to do using > > > > >> a pseudofs and loop-control device, rather than having to have a > > > > >> daemon in userspace on the host farming those out in response to > > > > >> some, I don't know, dbus request? > > > > > > > > > > I agree that loop devices would be nice to have in a container, and > > > > > that > > > > > the existing loop interface doesn't really lend itself to that. So > > > > > create a new type of thing that acts like a loop device in a > > > > > container. > > > > > But don't try to mess with the whole driver core just for a single > > > > > type > > > > > of device. > > > > > > > > Yes. Something like devpts (without the newinstance option). Built to > > > > allow unprivileged users to create loopback devices. > > > > > > That's where I started, and I've got code, so I guess I'll clean it up > > > and send patches. If the stance is that only system-wide CAP_SYS_ADMIN > > > gets to do privileged block device ioctls, including reading partitions > > > > Sorry, where did that come from? What Eric was referring to below is > > the fs superblock readers not being trusted. Maybe I glossed over another > > email where it was mentioned? > > You must have. Take a look at [1]. > > To repeat the point: the ioctl to reread partitions (along with several > other block device ioctls) has a capable(CAP_SYS_ADMIN) check. We can't > change this to an ns_capable check without at minimum the block layer > knowing about the namespace associated with the block device. Ergo we Which only means those changes are necessary :) So far as I understand, a namespaced devtmpfs is nacked, but a loopfs is interesting (and, depending on the implementation, acceptable). That necessarily includes the minimal blockdev changes to support it. > can't reread paritions if this is done entirely within the loop driver > via a psuedo fs. > > [1] http://article.gmane.org/gmane.linux.kernel.containers.lxc.devel/8191 > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Mon, 2014-05-19 at 17:04 -0700, Eric W. Biederman wrote: > Seth Forshee writes: > > > What I set out for was feature parity between loop devices in a secure > > container and loop devices on the host. Since some operations currently > > check for system-wide CAP_SYS_ADMIN, the only way I see to accomplish > > this is to push knowledge of the user namespace farther down into the > > driver stack so the check can instead be for CAP_SYS_ADMIN in the user > > namespace associated with the device. > > > > That said, I suspect our current use cases can get by without these > > capabilities. Really though I suspect this is just deferring the > > discussion rather than settling it, and what we'll end up with is little > > more than a fancy way for userspace to ask the kernel to run mknod on > > its behalf. > A fancy way to ask the kernel to run mknod on its behalf is what > /dev/pts is. > When I suggested this I did not mean you should forgo making changes to > allow partitions and the like. What I itended is that you should find a > way to make this safe for users who don't have root capabilities. I like to think in terms of the "rootless" configurations where "root" per se is not absolute and everything is framed in terms of capabilities. > Which possibly means that mount needs to learn how to keep a more > privileged user from using your new loop devices. Not sure I got that one. As user with "more" privileges may or may not have access dependent on the congruence of the privileges. They're not heiarchial. If someone has that "priv" then they have access. If they do not, they do not. > To get to the point where this is really and truly usable I expect to be > technically daunting. Most technically non-trivial problems generally are. > Ultimately the technical challenge is how do we create a block device > that is safe for a user who does not have any capabilities to use, and > what can we do with that block device to make it useful. Concur. It boils down to privilege management and access. Absolutely concur. > Only when the question is can this kernel functionality which is > otherwise safe confuse a preexisting setuid application do namespace > or container bits significantly come into play. Ah... Admittedly it's not as late as our conversation at LinuxPlumbers last year in NOLA but... Maybe late at night but I failed to parse the above. > Eric Regards, Mike -- Michael H. Warfield (AI4NB) | (770) 978-7061 | m...@wittsend.com /\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/ NIC whois: MHW9 | An optimist believes we live in the best of all PGP Key: 0x674627FF| possible worlds. A pessimist is sure of it! signature.asc Description: This is a digitally signed message part
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
Seth Forshee writes: > What I set out for was feature parity between loop devices in a secure > container and loop devices on the host. Since some operations currently > check for system-wide CAP_SYS_ADMIN, the only way I see to accomplish > this is to push knowledge of the user namespace farther down into the > driver stack so the check can instead be for CAP_SYS_ADMIN in the user > namespace associated with the device. > > That said, I suspect our current use cases can get by without these > capabilities. Really though I suspect this is just deferring the > discussion rather than settling it, and what we'll end up with is little > more than a fancy way for userspace to ask the kernel to run mknod on > its behalf. A fancy way to ask the kernel to run mknod on its behalf is what /dev/pts is. When I suggested this I did not mean you should forgo making changes to allow partitions and the like. What I itended is that you should find a way to make this safe for users who don't have root capabilities. Which possibly means that mount needs to learn how to keep a more privileged user from using your new loop devices. To get to the point where this is really and truly usable I expect to be technically daunting. Ultimately the technical challenge is how do we create a block device that is safe for a user who does not have any capabilities to use, and what can we do with that block device to make it useful. Only when the question is can this kernel functionality which is otherwise safe confuse a preexisting setuid application do namespace or container bits significantly come into play. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On May 15, 2014 1:26 PM, "Serge E. Hallyn" wrote: > > Quoting Richard Weinberger (rich...@nod.at): > > Am 15.05.2014 21:50, schrieb Serge Hallyn: > > > Quoting Richard Weinberger (richard.weinber...@gmail.com): > > >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman > > >> wrote: > > >>> Then don't use a container to build such a thing, or fix the build > > >>> scripts to not do that :) > > >> > > >> I second this. > > >> To me it looks like some folks try to (ab)use Linux containers > > >> for purposes where KVM would much better fit in. > > >> Please don't put more complexity into containers. They are already > > >> horrible complex > > >> and error prone. > > > > > > I, naturally, disagree :) The only use case which is inherently not > > > valid for containers is running a kernel. Practically speaking there > > > are other things which likely will never be possible, but if someone > > > offers a way to do something in containers, "you can't do that in > > > containers" is not an apropos response. > > > > > > "That abstraction is wrong" is certainly valid, as when vpids were > > > originally proposed and rejected, resulting in the development of > > > pid namespaces. "We have to work out (x) first" can be valid (and > > > I can think of examples here), assuming it's not just trying to hide > > > behind a catch-22/chicken-egg problem. > > > > > > Finally, saying "containers are complex and error prone" is conflating > > > several large suites of userspace code and many kernel features which > > > support them. Being more precise would, if the argument is valid, > > > lend it a lot more weight. > > > > We (my company) use Linux containers since 2011 in production. First LXC, > > now libvirt-lxc. > > To understand the internals better I also wrote my own userspace to > > create/start > > containers. There are so many things which can hurt you badly. > > With user namespaces we expose a really big attack surface to regular users. > > I.e. Suddenly a user is allowed to mount filesystems. > > That is currently not the case. They can mount some virtual filesystems > and do bind mounts, but cannot mount most real filesystems. This keeps > us protected (for now) from potentially unsafe superblock readers in the > kernel. > > > Ask Andy, he found already lots of nasty things... I don't think I have anything brilliant to add to this discussion right now, except possibly: ISTM that Linux distributions are, in general, vulnerable to all kinds of shenanigans that would happen if an untrusted user can cause a block device to appear. That user doesn't need permission to mount it or even necessarily to change its contents on the fly. E.g. what happens if you boot a machine that contains a malicious disk image that has the same partition UUID as /? Nothing good, I imagine. So if we're going to go down this road, we really need some way to tell the host that certain devices are not trusted. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Sun, May 18, 2014 at 04:44:58AM +0200, Serge E. Hallyn wrote: > Quoting Seth Forshee (seth.fors...@canonical.com): > > On Fri, May 16, 2014 at 09:31:37PM -0700, Eric W. Biederman wrote: > > > Greg Kroah-Hartman writes: > > > > > > > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote: > > > >> > I think having to pick and choose what device nodes you want in a > > > >> > container is a good thing. Becides, you would have to do the same > > > >> > thing > > > >> > in the kernel anyway, what's wrong with userspace making the decision > > > >> > here, especially as it knows exactly what it wants to do much more so > > > >> > than the kernel ever can. > > > >> > > > >> For 'real' devices that sounds sensible. The thing about loop devices > > > >> is that we simply want to allow a container to say "give me a loop > > > >> device to use" and have it receive a unique loop device (or 3), without > > > >> having to pre-assign them. I think that would be cleaner to do using > > > >> a pseudofs and loop-control device, rather than having to have a > > > >> daemon in userspace on the host farming those out in response to > > > >> some, I don't know, dbus request? > > > > > > > > I agree that loop devices would be nice to have in a container, and that > > > > the existing loop interface doesn't really lend itself to that. So > > > > create a new type of thing that acts like a loop device in a container. > > > > But don't try to mess with the whole driver core just for a single type > > > > of device. > > > > > > Yes. Something like devpts (without the newinstance option). Built to > > > allow unprivileged users to create loopback devices. > > > > That's where I started, and I've got code, so I guess I'll clean it up > > and send patches. If the stance is that only system-wide CAP_SYS_ADMIN > > gets to do privileged block device ioctls, including reading partitions > > Sorry, where did that come from? What Eric was referring to below is > the fs superblock readers not being trusted. Maybe I glossed over another > email where it was mentioned? You must have. Take a look at [1]. To repeat the point: the ioctl to reread partitions (along with several other block device ioctls) has a capable(CAP_SYS_ADMIN) check. We can't change this to an ns_capable check without at minimum the block layer knowing about the namespace associated with the block device. Ergo we can't reread paritions if this is done entirely within the loop driver via a psuedo fs. [1] http://article.gmane.org/gmane.linux.kernel.containers.lxc.devel/8191 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
Quoting Seth Forshee (seth.fors...@canonical.com): > On Fri, May 16, 2014 at 09:31:37PM -0700, Eric W. Biederman wrote: > > Greg Kroah-Hartman writes: > > > > > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote: > > >> > I think having to pick and choose what device nodes you want in a > > >> > container is a good thing. Becides, you would have to do the same > > >> > thing > > >> > in the kernel anyway, what's wrong with userspace making the decision > > >> > here, especially as it knows exactly what it wants to do much more so > > >> > than the kernel ever can. > > >> > > >> For 'real' devices that sounds sensible. The thing about loop devices > > >> is that we simply want to allow a container to say "give me a loop > > >> device to use" and have it receive a unique loop device (or 3), without > > >> having to pre-assign them. I think that would be cleaner to do using > > >> a pseudofs and loop-control device, rather than having to have a > > >> daemon in userspace on the host farming those out in response to > > >> some, I don't know, dbus request? > > > > > > I agree that loop devices would be nice to have in a container, and that > > > the existing loop interface doesn't really lend itself to that. So > > > create a new type of thing that acts like a loop device in a container. > > > But don't try to mess with the whole driver core just for a single type > > > of device. > > > > Yes. Something like devpts (without the newinstance option). Built to > > allow unprivileged users to create loopback devices. > > That's where I started, and I've got code, so I guess I'll clean it up > and send patches. If the stance is that only system-wide CAP_SYS_ADMIN > gets to do privileged block device ioctls, including reading partitions Sorry, where did that come from? What Eric was referring to below is the fs superblock readers not being trusted. Maybe I glossed over another email where it was mentioned? > on a block device which has been assigned to a contiainer, then I guess > that approach works well enough. > > > There is still a huge kettle of fish in with verifying a filesystem is > > safe from a hostile user that has acess to the block device while the > > filesystem is mounted. > > > > Having a few filesystems that are robust enough to trust with arbitrary > > filesystem corruption would be very interesting. > > > > I assume unprivileged and hostile users because if you trusted the real > > root inside of your container this would not be an issue. > > > > Eric > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
Quoting James Bottomley (james.bottom...@hansenpartnership.com): > On Fri, 2014-05-16 at 11:57 -0700, Greg Kroah-Hartman wrote: > > On Fri, May 16, 2014 at 09:06:07AM -0500, Seth Forshee wrote: > > > On Thu, May 15, 2014 at 09:35:32PM -0700, Greg Kroah-Hartman wrote: > > > > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote: > > > > > > I think having to pick and choose what device nodes you want in a > > > > > > container is a good thing. Becides, you would have to do the same > > > > > > thing > > > > > > in the kernel anyway, what's wrong with userspace making the > > > > > > decision > > > > > > here, especially as it knows exactly what it wants to do much more > > > > > > so > > > > > > than the kernel ever can. > > > > > > > > > > For 'real' devices that sounds sensible. The thing about loop devices > > > > > is that we simply want to allow a container to say "give me a loop > > > > > device to use" and have it receive a unique loop device (or 3), > > > > > without > > > > > having to pre-assign them. I think that would be cleaner to do using > > > > > a pseudofs and loop-control device, rather than having to have a > > > > > daemon in userspace on the host farming those out in response to > > > > > some, I don't know, dbus request? > > > > > > > > I agree that loop devices would be nice to have in a container, and that > > > > the existing loop interface doesn't really lend itself to that. So > > > > create a new type of thing that acts like a loop device in a container. > > > > But don't try to mess with the whole driver core just for a single type > > > > of device. > > > > > > No matter what I don't think we get out of this without driver core > > > changes, whether this was done in loop or by creating something new. > > > Not unless the whole thing is punted to userspace, anyway. > > > > > > The first problem is that many block device ioctls check for > > > CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm > > > not really sure. But loop does at minimum support partitions, and to get > > > that functionality in an unprivileged container at least the block layer > > > needs to know the namespace which has privileges for that device. > > > > That's fine, you should have those permissions in a container if you > > want to do something like that on a loop device, right? > > Really, no. CAP_SYS_ADMIN is effectively a pseudo root security hole. > Any user possessing CAP_SYS_ADMIN can do about as much damage as real > root can, whether or not you use user namespaces, so it would compromise > a lot of the security we're just bringing to containers. > > > > The second is that all block devices automatically appear in devtmpfs. > > > The scenario I'm concerned about is that the host could unknowingly use > > > a loop device exposed to a container, then the container could see data > > > from the host. > > > > I don't think that's a real issue, the host should know not to do that. > > > > > So we either need a flag to tell the driver core not to create a node > > > in devtmpfs, or we need a privileged manager in userspace to remove > > > them (which kind of defeats the purpose). And it gets more complicated > > > when partition block devs are mixed in, because they can be created > > > without involvement from the driver - they would need to inherit the > > > "no devtmpfs node" property from their parent, and if the driver uses > > > a psuedo fs to create device nodes for userspace then it needs to be > > > informed about the partitions too so it can create those nodes. > > > > I don't think that will be needed. Root in a host can do whatever it > > wants in the containers, so mixing up block devices is the least of the > > issues involved :) > > > > > So maybe we could get by without the privileged ioctls, as long as it > > > was understood that unprivileged containers can't do partitioning. But I > > > do think the devtmpfs problem would need to be addressed. > > > > I don't think unpriviliged containers should be able to do partitioning. > > An unpriviliged user can't do that, so why should a container be any > > different? > > To make sure we're on the same page with terminology, there's an > unprivileged container and a secure container. In the former, there's Hm, that terminology (which isn't what we've been using) could be useful, but is still not quite precise enough if we're going down that road. > no root user (all the processes run as non-root), so the container isn't "there is no root user" and "all processes run as non-root" are not the same thing. Is it just that no processes are running as root? Or that uid 0 in the container is not mapped at all and hence not achievable? The former really isn't a function of the container itself, and depends on there really not being any setuid-root or capability-wielding files available in the container. If the latter, and you're hoping to claim that the host is saved from the container exerc
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Fri, May 16, 2014 at 09:31:37PM -0700, Eric W. Biederman wrote: > Greg Kroah-Hartman writes: > > > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote: > >> > I think having to pick and choose what device nodes you want in a > >> > container is a good thing. Becides, you would have to do the same thing > >> > in the kernel anyway, what's wrong with userspace making the decision > >> > here, especially as it knows exactly what it wants to do much more so > >> > than the kernel ever can. > >> > >> For 'real' devices that sounds sensible. The thing about loop devices > >> is that we simply want to allow a container to say "give me a loop > >> device to use" and have it receive a unique loop device (or 3), without > >> having to pre-assign them. I think that would be cleaner to do using > >> a pseudofs and loop-control device, rather than having to have a > >> daemon in userspace on the host farming those out in response to > >> some, I don't know, dbus request? > > > > I agree that loop devices would be nice to have in a container, and that > > the existing loop interface doesn't really lend itself to that. So > > create a new type of thing that acts like a loop device in a container. > > But don't try to mess with the whole driver core just for a single type > > of device. > > Yes. Something like devpts (without the newinstance option). Built to > allow unprivileged users to create loopback devices. That's where I started, and I've got code, so I guess I'll clean it up and send patches. If the stance is that only system-wide CAP_SYS_ADMIN gets to do privileged block device ioctls, including reading partitions on a block device which has been assigned to a contiainer, then I guess that approach works well enough. > There is still a huge kettle of fish in with verifying a filesystem is > safe from a hostile user that has acess to the block device while the > filesystem is mounted. > > Having a few filesystems that are robust enough to trust with arbitrary > filesystem corruption would be very interesting. > > I assume unprivileged and hostile users because if you trusted the real > root inside of your container this would not be an issue. > > Eric > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Thu, 2014-05-15 at 21:35 -0700, Greg Kroah-Hartman wrote: > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote: > > > I think having to pick and choose what device nodes you want in a > > > container is a good thing. Becides, you would have to do the same thing > > > in the kernel anyway, what's wrong with userspace making the decision > > > here, especially as it knows exactly what it wants to do much more so > > > than the kernel ever can. > > > > For 'real' devices that sounds sensible. The thing about loop devices > > is that we simply want to allow a container to say "give me a loop > > device to use" and have it receive a unique loop device (or 3), without > > having to pre-assign them. I think that would be cleaner to do using > > a pseudofs and loop-control device, rather than having to have a > > daemon in userspace on the host farming those out in response to > > some, I don't know, dbus request? > I agree that loop devices would be nice to have in a container, and that > the existing loop interface doesn't really lend itself to that. So > create a new type of thing that acts like a loop device in a container. > But don't try to mess with the whole driver core just for a single type > of device. Yeah, a lot of dynamic devices (like serial devices) can be handled in user space with the proviso that we could use some way to tickle udev and hotplug in the container with events. But the loop device is the real ugly duckling here. It's a unique case of an on-demand device with a shared control device that's not really hot-plug and not really deterministic enough to be handled purely in user space. It presents unique challenges unto itself. Makes sense to me. > greg k-h Regards, Mike -- Michael H. Warfield (AI4NB) | (770) 978-7061 | m...@wittsend.com /\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/ NIC whois: MHW9 | An optimist believes we live in the best of all PGP Key: 0x674627FF| possible worlds. A pessimist is sure of it! signature.asc Description: This is a digitally signed message part
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
Greg Kroah-Hartman writes: > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote: >> > I think having to pick and choose what device nodes you want in a >> > container is a good thing. Becides, you would have to do the same thing >> > in the kernel anyway, what's wrong with userspace making the decision >> > here, especially as it knows exactly what it wants to do much more so >> > than the kernel ever can. >> >> For 'real' devices that sounds sensible. The thing about loop devices >> is that we simply want to allow a container to say "give me a loop >> device to use" and have it receive a unique loop device (or 3), without >> having to pre-assign them. I think that would be cleaner to do using >> a pseudofs and loop-control device, rather than having to have a >> daemon in userspace on the host farming those out in response to >> some, I don't know, dbus request? > > I agree that loop devices would be nice to have in a container, and that > the existing loop interface doesn't really lend itself to that. So > create a new type of thing that acts like a loop device in a container. > But don't try to mess with the whole driver core just for a single type > of device. Yes. Something like devpts (without the newinstance option). Built to allow unprivileged users to create loopback devices. There is still a huge kettle of fish in with verifying a filesystem is safe from a hostile user that has acess to the block device while the filesystem is mounted. Having a few filesystems that are robust enough to trust with arbitrary filesystem corruption would be very interesting. I assume unprivileged and hostile users because if you trusted the real root inside of your container this would not be an issue. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Fri, May 16, 2014 at 12:28:35PM -0700, James Bottomley wrote: > On Fri, 2014-05-16 at 11:57 -0700, Greg Kroah-Hartman wrote: > > On Fri, May 16, 2014 at 09:06:07AM -0500, Seth Forshee wrote: > > > On Thu, May 15, 2014 at 09:35:32PM -0700, Greg Kroah-Hartman wrote: > > > > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote: > > > > > > I think having to pick and choose what device nodes you want in a > > > > > > container is a good thing. Becides, you would have to do the same > > > > > > thing > > > > > > in the kernel anyway, what's wrong with userspace making the > > > > > > decision > > > > > > here, especially as it knows exactly what it wants to do much more > > > > > > so > > > > > > than the kernel ever can. > > > > > > > > > > For 'real' devices that sounds sensible. The thing about loop devices > > > > > is that we simply want to allow a container to say "give me a loop > > > > > device to use" and have it receive a unique loop device (or 3), > > > > > without > > > > > having to pre-assign them. I think that would be cleaner to do using > > > > > a pseudofs and loop-control device, rather than having to have a > > > > > daemon in userspace on the host farming those out in response to > > > > > some, I don't know, dbus request? > > > > > > > > I agree that loop devices would be nice to have in a container, and that > > > > the existing loop interface doesn't really lend itself to that. So > > > > create a new type of thing that acts like a loop device in a container. > > > > But don't try to mess with the whole driver core just for a single type > > > > of device. > > > > > > No matter what I don't think we get out of this without driver core > > > changes, whether this was done in loop or by creating something new. > > > Not unless the whole thing is punted to userspace, anyway. > > > > > > The first problem is that many block device ioctls check for > > > CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm > > > not really sure. But loop does at minimum support partitions, and to get > > > that functionality in an unprivileged container at least the block layer > > > needs to know the namespace which has privileges for that device. > > > > That's fine, you should have those permissions in a container if you > > want to do something like that on a loop device, right? > > Really, no. CAP_SYS_ADMIN is effectively a pseudo root security hole. > Any user possessing CAP_SYS_ADMIN can do about as much damage as real > root can, whether or not you use user namespaces, so it would compromise > a lot of the security we're just bringing to containers. > > > > The second is that all block devices automatically appear in devtmpfs. > > > The scenario I'm concerned about is that the host could unknowingly use > > > a loop device exposed to a container, then the container could see data > > > from the host. > > > > I don't think that's a real issue, the host should know not to do that. > > > > > So we either need a flag to tell the driver core not to create a node > > > in devtmpfs, or we need a privileged manager in userspace to remove > > > them (which kind of defeats the purpose). And it gets more complicated > > > when partition block devs are mixed in, because they can be created > > > without involvement from the driver - they would need to inherit the > > > "no devtmpfs node" property from their parent, and if the driver uses > > > a psuedo fs to create device nodes for userspace then it needs to be > > > informed about the partitions too so it can create those nodes. > > > > I don't think that will be needed. Root in a host can do whatever it > > wants in the containers, so mixing up block devices is the least of the > > issues involved :) > > > > > So maybe we could get by without the privileged ioctls, as long as it > > > was understood that unprivileged containers can't do partitioning. But I > > > do think the devtmpfs problem would need to be addressed. > > > > I don't think unpriviliged containers should be able to do partitioning. > > An unpriviliged user can't do that, so why should a container be any > > different? > > To make sure we're on the same page with terminology, there's an > unprivileged container and a secure container. In the former, there's > no root user (all the processes run as non-root), so the container isn't > expected to perform any actions root would ... that's easy. In a secure > container, root is mapped to a nobody user in the host, so is > effectively unprivileged, but root in the container expects to look like > a real root within the VPS (and thus may expect to partition things, > depending on how they've been given access to the block device). The > big problem is giving back capabilities to the container root such that > a) it loses them if it escapes the container and b) it doesn't get > sufficient capabilities to damage the system. Based on your description what I was talking about is a secure container.
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Fri, 2014-05-16 at 12:20 -0700, James Bottomley wrote: > On Thu, 2014-05-15 at 21:42 -0400, Michael H. Warfield wrote: > > On Thu, 2014-05-15 at 15:15 -0700, Greg Kroah-Hartman wrote: > > > > PS - Apparently both parallels and Michael independently > > > > project devices which are hot-plugged on the host into containers. > > > > That also seems like something worth talking about (best practices, > > > > shortcomings, use cases not met by it, any ways tha the kernel can > > > > help out) at ksummit/linuxcon. > > > > > I was told that containers would never want devices hotplugged into > > > them. > > > > Interesting. You were told they (who they?) would never want them? Who > > said that? I would have never thought that given that other > > implementations can provide that. I would certainly want them. Seems > > strange to explicitly relegate LXC containers to being second class > > citizens behind OpenVZ, Parallels, BSD Gaols, and Solaris Zones. > That would probably be me. Running hotplug inside a container is a > security problem and, since containers are easily entered by the host, > it's very easy to listen for the hotplug in the host and inject it into > the container using nsenter. In all virtualization... The host, particularly root on the host, exists as deus ex machina, the "god outside the machine". They are at my mercy. Even hardware virtualization can not protect you from the host. You wanna hear some frightening talks on virtualization, catch Joanna (miss little blue pill) Rutkowska some time. I'm particularly interesting in her takes on the "anti evil-maid attacks" and I sat in on her talks on the "north bridge" and "south bridge" malware evasion techniques. She's a good speaker who makes powerful points that makes you sweat but is pleasant in face to face conversation. I've played with her Qubes distribution a couple of times and the way it works with the TPM to insure a secure boot is interesting. But that's a completely different topic on trusted computing. OTOH, there are plenty of other things to worry about in all forms of virtualization. At Internet Security Systems, where I was a founder, fellow, and "X-Force Senior Wizard", we were looking at the ability to leak information through the USB subsystem. No isolation is perfect, especially when you have USB enabled. But that's my turf. > I don't think the intention is to label anyone's implementation as > preferred. What this shows, I think, is that we all have different > practises when it comes to setting up containers. Some are necessary > because our containers are different. Some could do with serious > examination to see if there's really a best way to do the action which > we would then all use. And I hope to contribute to the discussion of said actions. > > I might believe you were never told they would need them, but that's a > > totally different sense. Are we going to tell RedHat and the Docker > > people that LXC is an inferior technology that is complex and unreliable > > (to quote another poster) compared to these others? They're saying this > > will be enterprise technology. If I go to Amazon AWS or other VPS > > services and compare, are we not going to stand on a level playing > > field? Admittedly, I don't expect Amazon AWS to provide me with serial > > consoles, but I do expect to be able to mount file system images within > > my VPS. > Well, that's another nasty, isn't it. We all have different ways of > coping with mount in the container. I think at plumbers we need to sit > down with some of this plumbing and work out which pipes carry the same > fluids and whether we could unify them. Concur > As an aside (probably requiring a new thread) we were wondering about > some type of notifier on the mount call that we could vector into the > host to perform the action. The main issue for us is mount of procfs, > which really needs to be a bind mount in a container. All of this led > me to speculate that we could use some type of syscall notifier > mechanism to manage capabilities in the host and even intercept and > complete the syscall action within the host rather than having to keep > evolving more an more complex kernel drivers to do this. Interesting. That could be very useful. That might even help with the loop device case where the mounts have to go through loop devices for things like file system images and builds. Very interesting... > James Regards, Mike -- Michael H. Warfield (AI4NB) | (770) 978-7061 | m...@wittsend.com /\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/ NIC whois: MHW9 | An optimist believes we live in the best of all PGP Key: 0x674627FF| possible worlds. A pessimist is sure of it! signature.asc Description: This is a digitally signed message part
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Fri, 2014-05-16 at 11:57 -0700, Greg Kroah-Hartman wrote: > On Fri, May 16, 2014 at 09:06:07AM -0500, Seth Forshee wrote: > > On Thu, May 15, 2014 at 09:35:32PM -0700, Greg Kroah-Hartman wrote: > > > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote: > > > > > I think having to pick and choose what device nodes you want in a > > > > > container is a good thing. Becides, you would have to do the same > > > > > thing > > > > > in the kernel anyway, what's wrong with userspace making the decision > > > > > here, especially as it knows exactly what it wants to do much more so > > > > > than the kernel ever can. > > > > > > > > For 'real' devices that sounds sensible. The thing about loop devices > > > > is that we simply want to allow a container to say "give me a loop > > > > device to use" and have it receive a unique loop device (or 3), without > > > > having to pre-assign them. I think that would be cleaner to do using > > > > a pseudofs and loop-control device, rather than having to have a > > > > daemon in userspace on the host farming those out in response to > > > > some, I don't know, dbus request? > > > > > > I agree that loop devices would be nice to have in a container, and that > > > the existing loop interface doesn't really lend itself to that. So > > > create a new type of thing that acts like a loop device in a container. > > > But don't try to mess with the whole driver core just for a single type > > > of device. > > > > No matter what I don't think we get out of this without driver core > > changes, whether this was done in loop or by creating something new. > > Not unless the whole thing is punted to userspace, anyway. > > > > The first problem is that many block device ioctls check for > > CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm > > not really sure. But loop does at minimum support partitions, and to get > > that functionality in an unprivileged container at least the block layer > > needs to know the namespace which has privileges for that device. > > That's fine, you should have those permissions in a container if you > want to do something like that on a loop device, right? Really, no. CAP_SYS_ADMIN is effectively a pseudo root security hole. Any user possessing CAP_SYS_ADMIN can do about as much damage as real root can, whether or not you use user namespaces, so it would compromise a lot of the security we're just bringing to containers. > > The second is that all block devices automatically appear in devtmpfs. > > The scenario I'm concerned about is that the host could unknowingly use > > a loop device exposed to a container, then the container could see data > > from the host. > > I don't think that's a real issue, the host should know not to do that. > > > So we either need a flag to tell the driver core not to create a node > > in devtmpfs, or we need a privileged manager in userspace to remove > > them (which kind of defeats the purpose). And it gets more complicated > > when partition block devs are mixed in, because they can be created > > without involvement from the driver - they would need to inherit the > > "no devtmpfs node" property from their parent, and if the driver uses > > a psuedo fs to create device nodes for userspace then it needs to be > > informed about the partitions too so it can create those nodes. > > I don't think that will be needed. Root in a host can do whatever it > wants in the containers, so mixing up block devices is the least of the > issues involved :) > > > So maybe we could get by without the privileged ioctls, as long as it > > was understood that unprivileged containers can't do partitioning. But I > > do think the devtmpfs problem would need to be addressed. > > I don't think unpriviliged containers should be able to do partitioning. > An unpriviliged user can't do that, so why should a container be any > different? To make sure we're on the same page with terminology, there's an unprivileged container and a secure container. In the former, there's no root user (all the processes run as non-root), so the container isn't expected to perform any actions root would ... that's easy. In a secure container, root is mapped to a nobody user in the host, so is effectively unprivileged, but root in the container expects to look like a real root within the VPS (and thus may expect to partition things, depending on how they've been given access to the block device). The big problem is giving back capabilities to the container root such that a) it loses them if it escapes the container and b) it doesn't get sufficient capabilities to damage the system. James -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Thu, 2014-05-15 at 21:42 -0400, Michael H. Warfield wrote: > On Thu, 2014-05-15 at 15:15 -0700, Greg Kroah-Hartman wrote: > > > PS - Apparently both parallels and Michael independently > > > project devices which are hot-plugged on the host into containers. > > > That also seems like something worth talking about (best practices, > > > shortcomings, use cases not met by it, any ways tha the kernel can > > > help out) at ksummit/linuxcon. > > > I was told that containers would never want devices hotplugged into > > them. > > Interesting. You were told they (who they?) would never want them? Who > said that? I would have never thought that given that other > implementations can provide that. I would certainly want them. Seems > strange to explicitly relegate LXC containers to being second class > citizens behind OpenVZ, Parallels, BSD Gaols, and Solaris Zones. That would probably be me. Running hotplug inside a container is a security problem and, since containers are easily entered by the host, it's very easy to listen for the hotplug in the host and inject it into the container using nsenter. I don't think the intention is to label anyone's implementation as preferred. What this shows, I think, is that we all have different practises when it comes to setting up containers. Some are necessary because our containers are different. Some could do with serious examination to see if there's really a best way to do the action which we would then all use. > I might believe you were never told they would need them, but that's a > totally different sense. Are we going to tell RedHat and the Docker > people that LXC is an inferior technology that is complex and unreliable > (to quote another poster) compared to these others? They're saying this > will be enterprise technology. If I go to Amazon AWS or other VPS > services and compare, are we not going to stand on a level playing > field? Admittedly, I don't expect Amazon AWS to provide me with serial > consoles, but I do expect to be able to mount file system images within > my VPS. Well, that's another nasty, isn't it. We all have different ways of coping with mount in the container. I think at plumbers we need to sit down with some of this plumbing and work out which pipes carry the same fluids and whether we could unify them. As an aside (probably requiring a new thread) we were wondering about some type of notifier on the mount call that we could vector into the host to perform the action. The main issue for us is mount of procfs, which really needs to be a bind mount in a container. All of this led me to speculate that we could use some type of syscall notifier mechanism to manage capabilities in the host and even intercept and complete the syscall action within the host rather than having to keep evolving more an more complex kernel drivers to do this. James -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Fri, May 16, 2014 at 09:06:07AM -0500, Seth Forshee wrote: > On Thu, May 15, 2014 at 09:35:32PM -0700, Greg Kroah-Hartman wrote: > > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote: > > > > I think having to pick and choose what device nodes you want in a > > > > container is a good thing. Becides, you would have to do the same thing > > > > in the kernel anyway, what's wrong with userspace making the decision > > > > here, especially as it knows exactly what it wants to do much more so > > > > than the kernel ever can. > > > > > > For 'real' devices that sounds sensible. The thing about loop devices > > > is that we simply want to allow a container to say "give me a loop > > > device to use" and have it receive a unique loop device (or 3), without > > > having to pre-assign them. I think that would be cleaner to do using > > > a pseudofs and loop-control device, rather than having to have a > > > daemon in userspace on the host farming those out in response to > > > some, I don't know, dbus request? > > > > I agree that loop devices would be nice to have in a container, and that > > the existing loop interface doesn't really lend itself to that. So > > create a new type of thing that acts like a loop device in a container. > > But don't try to mess with the whole driver core just for a single type > > of device. > > No matter what I don't think we get out of this without driver core > changes, whether this was done in loop or by creating something new. > Not unless the whole thing is punted to userspace, anyway. > > The first problem is that many block device ioctls check for > CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm > not really sure. But loop does at minimum support partitions, and to get > that functionality in an unprivileged container at least the block layer > needs to know the namespace which has privileges for that device. That's fine, you should have those permissions in a container if you want to do something like that on a loop device, right? > The second is that all block devices automatically appear in devtmpfs. > The scenario I'm concerned about is that the host could unknowingly use > a loop device exposed to a container, then the container could see data > from the host. I don't think that's a real issue, the host should know not to do that. > So we either need a flag to tell the driver core not to create a node > in devtmpfs, or we need a privileged manager in userspace to remove > them (which kind of defeats the purpose). And it gets more complicated > when partition block devs are mixed in, because they can be created > without involvement from the driver - they would need to inherit the > "no devtmpfs node" property from their parent, and if the driver uses > a psuedo fs to create device nodes for userspace then it needs to be > informed about the partitions too so it can create those nodes. I don't think that will be needed. Root in a host can do whatever it wants in the containers, so mixing up block devices is the least of the issues involved :) > So maybe we could get by without the privileged ioctls, as long as it > was understood that unprivileged containers can't do partitioning. But I > do think the devtmpfs problem would need to be addressed. I don't think unpriviliged containers should be able to do partitioning. An unpriviliged user can't do that, so why should a container be any different? thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Fri, May 16, 2014 at 11:28:28AM -0400, Michael H. Warfield wrote: > On Fri, 2014-05-16 at 09:06 -0500, Seth Forshee wrote: > > On Thu, May 15, 2014 at 09:35:32PM -0700, Greg Kroah-Hartman wrote: > > > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote: > > > > > I think having to pick and choose what device nodes you want in a > > > > > container is a good thing. Becides, you would have to do the same > > > > > thing > > > > > in the kernel anyway, what's wrong with userspace making the decision > > > > > here, especially as it knows exactly what it wants to do much more so > > > > > than the kernel ever can. > > > > > > > > For 'real' devices that sounds sensible. The thing about loop devices > > > > is that we simply want to allow a container to say "give me a loop > > > > device to use" and have it receive a unique loop device (or 3), without > > > > having to pre-assign them. I think that would be cleaner to do using > > > > a pseudofs and loop-control device, rather than having to have a > > > > daemon in userspace on the host farming those out in response to > > > > some, I don't know, dbus request? > > > > > > I agree that loop devices would be nice to have in a container, and that > > > the existing loop interface doesn't really lend itself to that. So > > > create a new type of thing that acts like a loop device in a container. > > > But don't try to mess with the whole driver core just for a single type > > > of device. > > > No matter what I don't think we get out of this without driver core > > changes, whether this was done in loop or by creating something new. > > Not unless the whole thing is punted to userspace, anyway. > > > The first problem is that many block device ioctls check for > > CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm > > not really sure. But loop does at minimum support partitions, and to get > > that functionality in an unprivileged container at least the block layer > > needs to know the namespace which has privileges for that device. > > Woa! Time out... Sorry, this will be an off topic aside. > > Loop devices support partitions? I'd love to know how that works. I've > tried several times in the past to do that but it's failed every time. > I haven't been able to find any how-to in the past. This article was > just a couple of years ago (after the last time I tried this): > > http://madduck.net/blog/2006.10.20:loop-mounting-partitions-from-a-disk-image/ > > This guy didn't use partitions directly but used the offset to the > mount, which is what I had to use. Everything I found always referred > to using mount offsets in order to mount partitions within a loop > device. It's controlled by the loop.max_part module parameter. It defaults to 0, which means no partition support. For any value > 0 max_part will be the maximum available partition number, after rounding it up to the nearest power of 2 minus 1 (so max_part=5 gives you up to 8 partitions, max_part=8 gives you up to 16, etc). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Fri, 2014-05-16 at 09:06 -0500, Seth Forshee wrote: > On Thu, May 15, 2014 at 09:35:32PM -0700, Greg Kroah-Hartman wrote: > > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote: > > > > I think having to pick and choose what device nodes you want in a > > > > container is a good thing. Becides, you would have to do the same thing > > > > in the kernel anyway, what's wrong with userspace making the decision > > > > here, especially as it knows exactly what it wants to do much more so > > > > than the kernel ever can. > > > > > > For 'real' devices that sounds sensible. The thing about loop devices > > > is that we simply want to allow a container to say "give me a loop > > > device to use" and have it receive a unique loop device (or 3), without > > > having to pre-assign them. I think that would be cleaner to do using > > > a pseudofs and loop-control device, rather than having to have a > > > daemon in userspace on the host farming those out in response to > > > some, I don't know, dbus request? > > > > I agree that loop devices would be nice to have in a container, and that > > the existing loop interface doesn't really lend itself to that. So > > create a new type of thing that acts like a loop device in a container. > > But don't try to mess with the whole driver core just for a single type > > of device. > No matter what I don't think we get out of this without driver core > changes, whether this was done in loop or by creating something new. > Not unless the whole thing is punted to userspace, anyway. > The first problem is that many block device ioctls check for > CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm > not really sure. But loop does at minimum support partitions, and to get > that functionality in an unprivileged container at least the block layer > needs to know the namespace which has privileges for that device. Woa! Time out... Sorry, this will be an off topic aside. Loop devices support partitions? I'd love to know how that works. I've tried several times in the past to do that but it's failed every time. I haven't been able to find any how-to in the past. This article was just a couple of years ago (after the last time I tried this): http://madduck.net/blog/2006.10.20:loop-mounting-partitions-from-a-disk-image/ This guy didn't use partitions directly but used the offset to the mount, which is what I had to use. Everything I found always referred to using mount offsets in order to mount partitions within a loop device. Regards, Mike > The second is that all block devices automatically appear in devtmpfs. > The scenario I'm concerned about is that the host could unknowingly use > a loop device exposed to a container, then the container could see data > from the host. So we either need a flag to tell the driver core not to > create a node in devtmpfs, or we need a privileged manager in userspace > to remove them (which kind of defeats the purpose). And it gets more > complicated when partition block devs are mixed in, because they can be > created without involvement from the driver - they would need to inherit > the "no devtmpfs node" property from their parent, and if the driver > uses a psuedo fs to create device nodes for userspace then it needs to > be informed about the partitions too so it can create those nodes. > > So maybe we could get by without the privileged ioctls, as long as it > was understood that unprivileged containers can't do partitioning. But I > do think the devtmpfs problem would need to be addressed. > > Thanks, > Seth > ___ > lxc-devel mailing list > lxc-de...@lists.linuxcontainers.org > http://lists.linuxcontainers.org/listinfo/lxc-devel > -- Michael H. Warfield (AI4NB) | (770) 978-7061 | m...@wittsend.com /\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/ NIC whois: MHW9 | An optimist believes we live in the best of all PGP Key: 0x674627FF| possible worlds. A pessimist is sure of it! signature.asc Description: This is a digitally signed message part
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Thu, May 15, 2014 at 09:35:32PM -0700, Greg Kroah-Hartman wrote: > On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote: > > > I think having to pick and choose what device nodes you want in a > > > container is a good thing. Becides, you would have to do the same thing > > > in the kernel anyway, what's wrong with userspace making the decision > > > here, especially as it knows exactly what it wants to do much more so > > > than the kernel ever can. > > > > For 'real' devices that sounds sensible. The thing about loop devices > > is that we simply want to allow a container to say "give me a loop > > device to use" and have it receive a unique loop device (or 3), without > > having to pre-assign them. I think that would be cleaner to do using > > a pseudofs and loop-control device, rather than having to have a > > daemon in userspace on the host farming those out in response to > > some, I don't know, dbus request? > > I agree that loop devices would be nice to have in a container, and that > the existing loop interface doesn't really lend itself to that. So > create a new type of thing that acts like a loop device in a container. > But don't try to mess with the whole driver core just for a single type > of device. No matter what I don't think we get out of this without driver core changes, whether this was done in loop or by creating something new. Not unless the whole thing is punted to userspace, anyway. The first problem is that many block device ioctls check for CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm not really sure. But loop does at minimum support partitions, and to get that functionality in an unprivileged container at least the block layer needs to know the namespace which has privileges for that device. The second is that all block devices automatically appear in devtmpfs. The scenario I'm concerned about is that the host could unknowingly use a loop device exposed to a container, then the container could see data from the host. So we either need a flag to tell the driver core not to create a node in devtmpfs, or we need a privileged manager in userspace to remove them (which kind of defeats the purpose). And it gets more complicated when partition block devs are mixed in, because they can be created without involvement from the driver - they would need to inherit the "no devtmpfs node" property from their parent, and if the driver uses a psuedo fs to create device nodes for userspace then it needs to be informed about the partitions too so it can create those nodes. So maybe we could get by without the privileged ioctls, as long as it was understood that unprivileged containers can't do partitioning. But I do think the devtmpfs problem would need to be addressed. Thanks, Seth -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Fri, May 16, 2014 at 3:42 AM, Michael H. Warfield wrote: > On Thu, 2014-05-15 at 15:15 -0700, Greg Kroah-Hartman wrote: >> On Thu, May 15, 2014 at 05:42:54PM +, Serge Hallyn wrote: >> > What exactly defines '"normal" use case for a container'? > >> Well, I'd say "acting like a virtual machine" is a good start :) > > Ok... And virtual machines (VirtualBox, VMware, etc, etc) have hot plug > USB devices. I use the USB hotplug with VirtualBox. I plug a > configured USB device in and the VirtualBox VM grabs it. Virtual > machines have loopback devices. I've used them and using them in > containers is significantly more efficient. VirtualBox has remote audio > and a host of other device features. > > Now we have some agreement. Normal is "acting like a virtual machine". > That's a goal I can agree with. I want to work toward that goal of > containers "acting like a virtual machine" just running on a common > kernel with the host. It's a challenge. We're getting there. > >> > Not too long ago much of what we can now do with network namespaces >> > was not a normal container use case. Neither "you can't do it now" >> > nor "I don't use it like that" should be grounds for a pre-emptive >> > nack. "It will horribly break security assumptions" certainly would >> > be. > >> I agree, and maybe we will get there over time, but this patch is nto >> the way to do that. > > Ok... We have a goal. Now we can haggle over the details (to > paraphrase a joke that's as old as I am). > >> > That's not to say there might not be good reasons why this in particular >> > is not appropriate, but ISTM if things are going to be nacked without >> > consideration of the patchset itself, we ought to be having a ksummit >> > session to come to a consensus [ or receive a decree, presumably by you :) >> > but after we have a chance to make our case ] on what things are going to >> > be un/acceptable. > >> I already stood up and publically said this last year at Plumbers, why >> is anything now different? > > Not much really. The reality is that more and more people are trying to > use hotplug devices, network interfaces, and loopback devices in > containers just like they would in full para or hw virt machines. We're > trying to make them work, without it looking like a kludge. I > personally agree with you that much of this can be done in host user > space and, coming out of LinuxPlumbers last year, I've implemented some > ideas that did not require kernel patches that achieve some of my goals. > >> And this patchset is proof of why it's not a good idea. You really >> didn't do anything with all of the namespace stuff, except change loop. >> That's the only thing that cares, so, just do it there, like I said to >> do so, last August. > >> And you are ignoring the notifications to userspace and how namespaces >> here would deal with that. > > That's a problem to deal with. I don't thing anyone is ignoring them. > >> > > > Serge mentioned something to me about a loopdevfs (?) thing that >> > > > someone >> > > > else is working on. That would seem to be a better solution in this >> > > > particular case but I don't know much about it or where it's at. >> > > >> > > Ok, let's see those patches then. >> > >> > I think Seth has a git tree ready, but not sure which branch he'd want >> > us to look at. >> > >> > Splitting a namespaced devtmpfs from loopdevfs discussion might be >> > sensible. However, in defense of a namespaced devtmpfs I'd say >> > that for userspace to, at every container startup, bind-mount in >> > devices from the global devtmpfs into a private tmpfs (for systemd's >> > sake it can't just be on the container rootfs), seems like something >> > worth avoiding. > >> I think having to pick and choose what device nodes you want in a >> container is a good thing. > > Both static and dynamic devices. It's got to support hotplug. We have > (I have) use cases. That's what I'm trying to do with host udev rules > and some custom configurations. I can play games with udev rules. > Maybe we can keep the user spaces policies in user space and not burden > the kernel. > >> Becides, you would have to do the same thing >> in the kernel anyway, what's wrong with userspace making the decision >> here, especially as it knows exactly what it wants to do much more so >> than the kernel ever can. > > IMHO, there's nothing wrong with that as long as we agree on how it's to > be done. I'm not convinced that it can all be done in user space and > I'm not convinced that name spaced devtmpfs is the magic pill to make it > all go away either. Making the user space make the decisions and having > the kernel enforce them is a principle worth considering. > >> > PS - Apparently both parallels and Michael independently >> > project devices which are hot-plugged on the host into containers. >> > That also seems like something worth talking about (best practices, >> > shortcomings, use cases not met by it, any ways tha the ker
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Fri, May 16, 2014 at 01:49:59AM +, Serge Hallyn wrote: > > I think having to pick and choose what device nodes you want in a > > container is a good thing. Becides, you would have to do the same thing > > in the kernel anyway, what's wrong with userspace making the decision > > here, especially as it knows exactly what it wants to do much more so > > than the kernel ever can. > > For 'real' devices that sounds sensible. The thing about loop devices > is that we simply want to allow a container to say "give me a loop > device to use" and have it receive a unique loop device (or 3), without > having to pre-assign them. I think that would be cleaner to do using > a pseudofs and loop-control device, rather than having to have a > daemon in userspace on the host farming those out in response to > some, I don't know, dbus request? I agree that loop devices would be nice to have in a container, and that the existing loop interface doesn't really lend itself to that. So create a new type of thing that acts like a loop device in a container. But don't try to mess with the whole driver core just for a single type of device. greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
Quoting Greg Kroah-Hartman (gre...@linuxfoundation.org): > On Thu, May 15, 2014 at 05:42:54PM +, Serge Hallyn wrote: > > What exactly defines '"normal" use case for a container'? > > Well, I'd say "acting like a virtual machine" is a good start :) > > > Not too long ago much of what we can now do with network namespaces > > was not a normal container use case. Neither "you can't do it now" > > nor "I don't use it like that" should be grounds for a pre-emptive > > nack. "It will horribly break security assumptions" certainly would > > be. > > I agree, and maybe we will get there over time, but this patch is nto > the way to do that. Ok. [ I/we may be asking for more details later, but think there is enough below :), particularly the point about event forwarding ] Thanks. > > That's not to say there might not be good reasons why this in particular > > is not appropriate, but ISTM if things are going to be nacked without > > consideration of the patchset itself, we ought to be having a ksummit > > session to come to a consensus [ or receive a decree, presumably by you :) > > but after we have a chance to make our case ] on what things are going to > > be un/acceptable. > > I already stood up and publically said this last year at Plumbers, why > is anything now different? Well I've simply never had a chance to talk to you since then to find out exactly what it is that is unacceptable, and why. And, of course, code makes it easier to discuss these things. > And this patchset is proof of why it's not a good idea. You really > didn't do anything with all of the namespace stuff, except change loop. > That's the only thing that cares, so, just do it there, like I said to > do so, last August. Sorry, just do it where? > And you are ignoring the notifications to userspace and how namespaces > here would deal with that. Good point. Addressing that is at the same time necessary, interesting, and complicated. > > > > Serge mentioned something to me about a loopdevfs (?) thing that someone > > > > else is working on. That would seem to be a better solution in this > > > > particular case but I don't know much about it or where it's at. > > > > > > Ok, let's see those patches then. > > > > I think Seth has a git tree ready, but not sure which branch he'd want > > us to look at. > > > > Splitting a namespaced devtmpfs from loopdevfs discussion might be > > sensible. However, in defense of a namespaced devtmpfs I'd say > > that for userspace to, at every container startup, bind-mount in > > devices from the global devtmpfs into a private tmpfs (for systemd's > > sake it can't just be on the container rootfs), seems like something > > worth avoiding. > > I think having to pick and choose what device nodes you want in a > container is a good thing. Becides, you would have to do the same thing > in the kernel anyway, what's wrong with userspace making the decision > here, especially as it knows exactly what it wants to do much more so > than the kernel ever can. For 'real' devices that sounds sensible. The thing about loop devices is that we simply want to allow a container to say "give me a loop device to use" and have it receive a unique loop device (or 3), without having to pre-assign them. I think that would be cleaner to do using a pseudofs and loop-control device, rather than having to have a daemon in userspace on the host farming those out in response to some, I don't know, dbus request? > > PS - Apparently both parallels and Michael independently > > project devices which are hot-plugged on the host into containers. > > That also seems like something worth talking about (best practices, > > shortcomings, use cases not met by it, any ways tha the kernel can > > help out) at ksummit/linuxcon. > > I was told that containers would never want devices hotplugged into > them. What use case has this happening / needed? I'm pretty sure I didn't say that . But I guess we are combining two topics here, the loop psuedofs and the namespaced devtmpfs. The use case of loop-control device and loop pseudofs is to have multiple chrooted/namespaced programs be able to grab a loop device on demand which they can use for the obvious things (building a livecd, extracting file contents, etc) without stepping on each other's toes. The namespaced devtmpfs is not required for this. One advantage of a namespaced devtmpfs would be sane-looking devices in unprivileged containers. Currently we have to bind-mount the host's /dev/{full,zero,etc} which, due to uid and guid mappings, then shows up as: crw-rw-rw- 1 nobody nogroup 1, 7 May 12 13:35 full Also you mentioned uevent forwarding above. Michael has talked several times about having userspace on the host 'pass' devices into the container. One thing which I believe he and Eric have discussed before was how to have userspace in the container be notified when a device is passed in. It seems to me that at least this is something that would be simpler
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Thu, 2014-05-15 at 15:15 -0700, Greg Kroah-Hartman wrote: > On Thu, May 15, 2014 at 05:42:54PM +, Serge Hallyn wrote: > > What exactly defines '"normal" use case for a container'? > Well, I'd say "acting like a virtual machine" is a good start :) Ok... And virtual machines (VirtualBox, VMware, etc, etc) have hot plug USB devices. I use the USB hotplug with VirtualBox. I plug a configured USB device in and the VirtualBox VM grabs it. Virtual machines have loopback devices. I've used them and using them in containers is significantly more efficient. VirtualBox has remote audio and a host of other device features. Now we have some agreement. Normal is "acting like a virtual machine". That's a goal I can agree with. I want to work toward that goal of containers "acting like a virtual machine" just running on a common kernel with the host. It's a challenge. We're getting there. > > Not too long ago much of what we can now do with network namespaces > > was not a normal container use case. Neither "you can't do it now" > > nor "I don't use it like that" should be grounds for a pre-emptive > > nack. "It will horribly break security assumptions" certainly would > > be. > I agree, and maybe we will get there over time, but this patch is nto > the way to do that. Ok... We have a goal. Now we can haggle over the details (to paraphrase a joke that's as old as I am). > > That's not to say there might not be good reasons why this in particular > > is not appropriate, but ISTM if things are going to be nacked without > > consideration of the patchset itself, we ought to be having a ksummit > > session to come to a consensus [ or receive a decree, presumably by you :) > > but after we have a chance to make our case ] on what things are going to > > be un/acceptable. > I already stood up and publically said this last year at Plumbers, why > is anything now different? Not much really. The reality is that more and more people are trying to use hotplug devices, network interfaces, and loopback devices in containers just like they would in full para or hw virt machines. We're trying to make them work, without it looking like a kludge. I personally agree with you that much of this can be done in host user space and, coming out of LinuxPlumbers last year, I've implemented some ideas that did not require kernel patches that achieve some of my goals. > And this patchset is proof of why it's not a good idea. You really > didn't do anything with all of the namespace stuff, except change loop. > That's the only thing that cares, so, just do it there, like I said to > do so, last August. > And you are ignoring the notifications to userspace and how namespaces > here would deal with that. That's a problem to deal with. I don't thing anyone is ignoring them. > > > > Serge mentioned something to me about a loopdevfs (?) thing that someone > > > > else is working on. That would seem to be a better solution in this > > > > particular case but I don't know much about it or where it's at. > > > > > > Ok, let's see those patches then. > > > > I think Seth has a git tree ready, but not sure which branch he'd want > > us to look at. > > > > Splitting a namespaced devtmpfs from loopdevfs discussion might be > > sensible. However, in defense of a namespaced devtmpfs I'd say > > that for userspace to, at every container startup, bind-mount in > > devices from the global devtmpfs into a private tmpfs (for systemd's > > sake it can't just be on the container rootfs), seems like something > > worth avoiding. > I think having to pick and choose what device nodes you want in a > container is a good thing. Both static and dynamic devices. It's got to support hotplug. We have (I have) use cases. That's what I'm trying to do with host udev rules and some custom configurations. I can play games with udev rules. Maybe we can keep the user spaces policies in user space and not burden the kernel. > Becides, you would have to do the same thing > in the kernel anyway, what's wrong with userspace making the decision > here, especially as it knows exactly what it wants to do much more so > than the kernel ever can. IMHO, there's nothing wrong with that as long as we agree on how it's to be done. I'm not convinced that it can all be done in user space and I'm not convinced that name spaced devtmpfs is the magic pill to make it all go away either. Making the user space make the decisions and having the kernel enforce them is a principle worth considering. > > PS - Apparently both parallels and Michael independently > > project devices which are hot-plugged on the host into containers. > > That also seems like something worth talking about (best practices, > > shortcomings, use cases not met by it, any ways tha the kernel can > > help out) at ksummit/linuxcon. > I was told that containers would never want devices hotplugged into > them. Interesting. You were told they (who they?) would never want them? Who s
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Thu, May 15, 2014 at 05:42:54PM +, Serge Hallyn wrote: > What exactly defines '"normal" use case for a container'? Well, I'd say "acting like a virtual machine" is a good start :) > Not too long ago much of what we can now do with network namespaces > was not a normal container use case. Neither "you can't do it now" > nor "I don't use it like that" should be grounds for a pre-emptive > nack. "It will horribly break security assumptions" certainly would > be. I agree, and maybe we will get there over time, but this patch is nto the way to do that. > That's not to say there might not be good reasons why this in particular > is not appropriate, but ISTM if things are going to be nacked without > consideration of the patchset itself, we ought to be having a ksummit > session to come to a consensus [ or receive a decree, presumably by you :) > but after we have a chance to make our case ] on what things are going to > be un/acceptable. I already stood up and publically said this last year at Plumbers, why is anything now different? And this patchset is proof of why it's not a good idea. You really didn't do anything with all of the namespace stuff, except change loop. That's the only thing that cares, so, just do it there, like I said to do so, last August. And you are ignoring the notifications to userspace and how namespaces here would deal with that. > > > Serge mentioned something to me about a loopdevfs (?) thing that someone > > > else is working on. That would seem to be a better solution in this > > > particular case but I don't know much about it or where it's at. > > > > Ok, let's see those patches then. > > I think Seth has a git tree ready, but not sure which branch he'd want > us to look at. > > Splitting a namespaced devtmpfs from loopdevfs discussion might be > sensible. However, in defense of a namespaced devtmpfs I'd say > that for userspace to, at every container startup, bind-mount in > devices from the global devtmpfs into a private tmpfs (for systemd's > sake it can't just be on the container rootfs), seems like something > worth avoiding. I think having to pick and choose what device nodes you want in a container is a good thing. Becides, you would have to do the same thing in the kernel anyway, what's wrong with userspace making the decision here, especially as it knows exactly what it wants to do much more so than the kernel ever can. > PS - Apparently both parallels and Michael independently > project devices which are hot-plugged on the host into containers. > That also seems like something worth talking about (best practices, > shortcomings, use cases not met by it, any ways tha the kernel can > help out) at ksummit/linuxcon. I was told that containers would never want devices hotplugged into them. What use case has this happening / needed? thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
Am 15.05.2014 22:26, schrieb Serge E. Hallyn: > Quoting Richard Weinberger (rich...@nod.at): >> Am 15.05.2014 21:50, schrieb Serge Hallyn: >>> Quoting Richard Weinberger (richard.weinber...@gmail.com): On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman wrote: > Then don't use a container to build such a thing, or fix the build > scripts to not do that :) I second this. To me it looks like some folks try to (ab)use Linux containers for purposes where KVM would much better fit in. Please don't put more complexity into containers. They are already horrible complex and error prone. >>> >>> I, naturally, disagree :) The only use case which is inherently not >>> valid for containers is running a kernel. Practically speaking there >>> are other things which likely will never be possible, but if someone >>> offers a way to do something in containers, "you can't do that in >>> containers" is not an apropos response. >>> >>> "That abstraction is wrong" is certainly valid, as when vpids were >>> originally proposed and rejected, resulting in the development of >>> pid namespaces. "We have to work out (x) first" can be valid (and >>> I can think of examples here), assuming it's not just trying to hide >>> behind a catch-22/chicken-egg problem. >>> >>> Finally, saying "containers are complex and error prone" is conflating >>> several large suites of userspace code and many kernel features which >>> support them. Being more precise would, if the argument is valid, >>> lend it a lot more weight. >> >> We (my company) use Linux containers since 2011 in production. First LXC, >> now libvirt-lxc. >> To understand the internals better I also wrote my own userspace to >> create/start >> containers. There are so many things which can hurt you badly. >> With user namespaces we expose a really big attack surface to regular users. >> I.e. Suddenly a user is allowed to mount filesystems. > > That is currently not the case. They can mount some virtual filesystems > and do bind mounts, but cannot mount most real filesystems. This keeps > us protected (for now) from potentially unsafe superblock readers in the > kernel. Yeah, I meant not only "real" filesystems. I had VFS issues in mind where an attacker could do bad things using bind mounts for example. >> Ask Andy, he found already lots of nasty things... > > Yes, of course, and there may be more to come... > >> I agree that user namespaces are the way to go, all the papering with LSM >> over security issues is much worse. >> But we have to make sure that we don't add too much features too fast. > > Agreed. Like I said, 'we have to work (x) out first' could be valid, > including 'we should wait (a year?) for user ns issues to fall out > before relaxing any of the current user ns constraints." > > On the other hand, not exercising the new code may only mean that > existing flaws stick around longer, undetected (by most). Fair point. >> That said, I like containers a lot because they are cheap but as they are >> lightweight >> also therefore also isolation level is lightweight. >> IMHO containers are not a cheap replacement for KVM. > > The building blocks for containers can also be used for entirely > new, simpler use cases - i.e. perhaps a new fakeroot alternative based > on user namespace mappings. Which is why "this is not a use case for > containers" is not the right way to push back, whether or not the > feature ends up being appropriate. Agreed. Maybe I'm too pessimistic. We'll see. :-) Thanks, //richard -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
Quoting Richard Weinberger (rich...@nod.at): > Am 15.05.2014 21:50, schrieb Serge Hallyn: > > Quoting Richard Weinberger (richard.weinber...@gmail.com): > >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman > >> wrote: > >>> Then don't use a container to build such a thing, or fix the build > >>> scripts to not do that :) > >> > >> I second this. > >> To me it looks like some folks try to (ab)use Linux containers > >> for purposes where KVM would much better fit in. > >> Please don't put more complexity into containers. They are already > >> horrible complex > >> and error prone. > > > > I, naturally, disagree :) The only use case which is inherently not > > valid for containers is running a kernel. Practically speaking there > > are other things which likely will never be possible, but if someone > > offers a way to do something in containers, "you can't do that in > > containers" is not an apropos response. > > > > "That abstraction is wrong" is certainly valid, as when vpids were > > originally proposed and rejected, resulting in the development of > > pid namespaces. "We have to work out (x) first" can be valid (and > > I can think of examples here), assuming it's not just trying to hide > > behind a catch-22/chicken-egg problem. > > > > Finally, saying "containers are complex and error prone" is conflating > > several large suites of userspace code and many kernel features which > > support them. Being more precise would, if the argument is valid, > > lend it a lot more weight. > > We (my company) use Linux containers since 2011 in production. First LXC, now > libvirt-lxc. > To understand the internals better I also wrote my own userspace to > create/start > containers. There are so many things which can hurt you badly. > With user namespaces we expose a really big attack surface to regular users. > I.e. Suddenly a user is allowed to mount filesystems. That is currently not the case. They can mount some virtual filesystems and do bind mounts, but cannot mount most real filesystems. This keeps us protected (for now) from potentially unsafe superblock readers in the kernel. > Ask Andy, he found already lots of nasty things... Yes, of course, and there may be more to come... > I agree that user namespaces are the way to go, all the papering with LSM > over security issues is much worse. > But we have to make sure that we don't add too much features too fast. Agreed. Like I said, 'we have to work (x) out first' could be valid, including 'we should wait (a year?) for user ns issues to fall out before relaxing any of the current user ns constraints." On the other hand, not exercising the new code may only mean that existing flaws stick around longer, undetected (by most). > That said, I like containers a lot because they are cheap but as they are > lightweight > also therefore also isolation level is lightweight. > IMHO containers are not a cheap replacement for KVM. The building blocks for containers can also be used for entirely new, simpler use cases - i.e. perhaps a new fakeroot alternative based on user namespace mappings. Which is why "this is not a use case for containers" is not the right way to push back, whether or not the feature ends up being appropriate. -serge -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
Am 15.05.2014 21:50, schrieb Serge Hallyn: > Quoting Richard Weinberger (richard.weinber...@gmail.com): >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman >> wrote: >>> Then don't use a container to build such a thing, or fix the build >>> scripts to not do that :) >> >> I second this. >> To me it looks like some folks try to (ab)use Linux containers >> for purposes where KVM would much better fit in. >> Please don't put more complexity into containers. They are already >> horrible complex >> and error prone. > > I, naturally, disagree :) The only use case which is inherently not > valid for containers is running a kernel. Practically speaking there > are other things which likely will never be possible, but if someone > offers a way to do something in containers, "you can't do that in > containers" is not an apropos response. > > "That abstraction is wrong" is certainly valid, as when vpids were > originally proposed and rejected, resulting in the development of > pid namespaces. "We have to work out (x) first" can be valid (and > I can think of examples here), assuming it's not just trying to hide > behind a catch-22/chicken-egg problem. > > Finally, saying "containers are complex and error prone" is conflating > several large suites of userspace code and many kernel features which > support them. Being more precise would, if the argument is valid, > lend it a lot more weight. We (my company) use Linux containers since 2011 in production. First LXC, now libvirt-lxc. To understand the internals better I also wrote my own userspace to create/start containers. There are so many things which can hurt you badly. With user namespaces we expose a really big attack surface to regular users. I.e. Suddenly a user is allowed to mount filesystems. Ask Andy, he found already lots of nasty things... I agree that user namespaces are the way to go, all the papering with LSM over security issues is much worse. But we have to make sure that we don't add too much features too fast. That said, I like containers a lot because they are cheap but as they are lightweight also therefore also isolation level is lightweight. IMHO containers are not a cheap replacement for KVM. Thanks, //richard -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
Quoting Richard Weinberger (richard.weinber...@gmail.com): > On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman > wrote: > > Then don't use a container to build such a thing, or fix the build > > scripts to not do that :) > > I second this. > To me it looks like some folks try to (ab)use Linux containers > for purposes where KVM would much better fit in. > Please don't put more complexity into containers. They are already > horrible complex > and error prone. I, naturally, disagree :) The only use case which is inherently not valid for containers is running a kernel. Practically speaking there are other things which likely will never be possible, but if someone offers a way to do something in containers, "you can't do that in containers" is not an apropos response. "That abstraction is wrong" is certainly valid, as when vpids were originally proposed and rejected, resulting in the development of pid namespaces. "We have to work out (x) first" can be valid (and I can think of examples here), assuming it's not just trying to hide behind a catch-22/chicken-egg problem. Finally, saying "containers are complex and error prone" is conflating several large suites of userspace code and many kernel features which support them. Being more precise would, if the argument is valid, lend it a lot more weight. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman wrote: > Then don't use a container to build such a thing, or fix the build > scripts to not do that :) I second this. To me it looks like some folks try to (ab)use Linux containers for purposes where KVM would much better fit in. Please don't put more complexity into containers. They are already horrible complex and error prone. -- Thanks, //richard -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Thu, May 15, 2014 at 05:42:54PM +, Serge Hallyn wrote: > > > Serge mentioned something to me about a loopdevfs (?) thing that someone > > > else is working on. That would seem to be a better solution in this > > > particular case but I don't know much about it or where it's at. > > > > Ok, let's see those patches then. > > I think Seth has a git tree ready, but not sure which branch he'd want > us to look at. I think the most recent code I've got is the devloop branch of http://kernel.ubuntu.com/git/sforshee/ubuntu-trusty.git, which is still a bit messy but gets the idea across. I switched from that to the devtmpfs approach though for several reasons: the psuedo-fs approach required some (in my opinion) undesirable collateral changes, it would require changes to userspace tools (though likely small), and it solves the problem only for loop devices. Plus if you don't push namespace awareness down to at least the generic block layer you still can't do partitions or encrypted loop, and then there are still other problems which need to be solved to get partition blkdevs inside the mount. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
Quoting Greg Kroah-Hartman (gre...@linuxfoundation.org): > On Thu, May 15, 2014 at 09:42:17AM -0400, Michael H. Warfield wrote: > > On Wed, 2014-05-14 at 21:00 -0700, Greg Kroah-Hartman wrote: > > > On Wed, May 14, 2014 at 10:15:27PM -0500, Seth Forshee wrote: > > > > On Wed, May 14, 2014 at 10:17:31PM -0400, Michael H. Warfield wrote: > > > > > > > Using devtmpfs is one possible > > > > > > > solution, and it would have the added benefit of making container > > > > > > > setup > > > > > > > simpler. But simply letting containers mount devtmpfs isn't > > > > > > > sufficient > > > > > > > since the container may need to see a different, more limited set > > > > > > > of > > > > > > > devices, and because different environments making modifications > > > > > > > to > > > > > > > the filesystem could lead to conflicts. > > > > > > > > > > > > > > This series solves these problems by assigning devices to user > > > > > > > namespaces. Each device has an "owner" namespace which specifies > > > > > > > which > > > > > > > devtmpfs mount the device should appear in as well allowing > > > > > > > priveleged > > > > > > > operations on the device from that namespace. This defaults to > > > > > > > init_user_ns. There's also an ns_global flag to indicate a device > > > > > > > should > > > > > > > appear in all devtmpfs mounts. > > > > > > > > > > > I'd strongly argue that this isn't even a "problem" at all. And, > > > > > > as I > > > > > > said at the Plumbers conference last year, adding namespaces to > > > > > > devices > > > > > > isn't going to happen, sorry. Please don't continue down this path. > > > > > > > > > > I was just mentioning that to Serge just a week or so ago reminding > > > > > him > > > > > of what you told all of us face to face back then. We were having a > > > > > discussion over loop devices into containers and this topic came up. > > > > > > > > It was the loop device use case that got me started down this path in > > > > the first place, so I don't personally have any interest in physical > > > > devices right now (though I was sure others would). > > > > > Why do you want to give access to a loop device to a container? > > > Shouldn't you set up the loop devices before creating the container and > > > then pass those mount points into the container? I thought that was how > > > things worked today, or am I missing something? > > > > Ah, you keep feeding me easy ones. I need raw access to loop devices > > and loop-control because I'm using containers to build NST (Network > > Security Toolkit) distribution iso images (one container is x86_64 while > > the other is i686). Each requires 2 loop devices. You can't set up the > > loop devices in advance since the containers will be creating the images > > and building them. NST tinkers with the base build engine > > configuration, so I really DON'T want it running on a hard iron host. > > There may be other cases where I need other specialized containers for > > building distros. I'm also looking at custom builds of Kali (another > > security distribution). > > Then don't use a container to build such a thing, or fix the build > scripts to not do that :) > > That is not a "normal" use case for a container at all. Containers are > not for "everything", use a virtual machine for some tasks (like this > one). Hi Greg, What exactly defines '"normal" use case for a container'? Not too long ago much of what we can now do with network namespaces was not a normal container use case. Neither "you can't do it now" nor "I don't use it like that" should be grounds for a pre-emptive nack. "It will horribly break security assumptions" certainly would be. That's not to say there might not be good reasons why this in particular is not appropriate, but ISTM if things are going to be nacked without consideration of the patchset itself, we ought to be having a ksummit session to come to a consensus [ or receive a decree, presumably by you :) but after we have a chance to make our case ] on what things are going to be un/acceptable. > > Serge mentioned something to me about a loopdevfs (?) thing that someone > > else is working on. That would seem to be a better solution in this > > particular case but I don't know much about it or where it's at. > > Ok, let's see those patches then. I think Seth has a git tree ready, but not sure which branch he'd want us to look at. Splitting a namespaced devtmpfs from loopdevfs discussion might be sensible. However, in defense of a namespaced devtmpfs I'd say that for userspace to, at every container startup, bind-mount in devices from the global devtmpfs into a private tmpfs (for systemd's sake it can't just be on the container rootfs), seems like something worth avoiding. -serge PS - Apparently both parallels and Michael independently project devices which are hot-plugged on the host into containers. That also seems like something worth talking about (best practices, shortcomings,
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Thu, May 15, 2014 at 09:42:17AM -0400, Michael H. Warfield wrote: > On Wed, 2014-05-14 at 21:00 -0700, Greg Kroah-Hartman wrote: > > On Wed, May 14, 2014 at 10:15:27PM -0500, Seth Forshee wrote: > > > On Wed, May 14, 2014 at 10:17:31PM -0400, Michael H. Warfield wrote: > > > > > > Using devtmpfs is one possible > > > > > > solution, and it would have the added benefit of making container > > > > > > setup > > > > > > simpler. But simply letting containers mount devtmpfs isn't > > > > > > sufficient > > > > > > since the container may need to see a different, more limited set of > > > > > > devices, and because different environments making modifications to > > > > > > the filesystem could lead to conflicts. > > > > > > > > > > > > This series solves these problems by assigning devices to user > > > > > > namespaces. Each device has an "owner" namespace which specifies > > > > > > which > > > > > > devtmpfs mount the device should appear in as well allowing > > > > > > priveleged > > > > > > operations on the device from that namespace. This defaults to > > > > > > init_user_ns. There's also an ns_global flag to indicate a device > > > > > > should > > > > > > appear in all devtmpfs mounts. > > > > > > > > > I'd strongly argue that this isn't even a "problem" at all. And, as I > > > > > said at the Plumbers conference last year, adding namespaces to > > > > > devices > > > > > isn't going to happen, sorry. Please don't continue down this path. > > > > > > > > I was just mentioning that to Serge just a week or so ago reminding him > > > > of what you told all of us face to face back then. We were having a > > > > discussion over loop devices into containers and this topic came up. > > > > > > It was the loop device use case that got me started down this path in > > > the first place, so I don't personally have any interest in physical > > > devices right now (though I was sure others would). > > > Why do you want to give access to a loop device to a container? > > Shouldn't you set up the loop devices before creating the container and > > then pass those mount points into the container? I thought that was how > > things worked today, or am I missing something? > > Ah, you keep feeding me easy ones. I need raw access to loop devices > and loop-control because I'm using containers to build NST (Network > Security Toolkit) distribution iso images (one container is x86_64 while > the other is i686). Each requires 2 loop devices. You can't set up the > loop devices in advance since the containers will be creating the images > and building them. NST tinkers with the base build engine > configuration, so I really DON'T want it running on a hard iron host. > There may be other cases where I need other specialized containers for > building distros. I'm also looking at custom builds of Kali (another > security distribution). Then don't use a container to build such a thing, or fix the build scripts to not do that :) That is not a "normal" use case for a container at all. Containers are not for "everything", use a virtual machine for some tasks (like this one). > Serge mentioned something to me about a loopdevfs (?) thing that someone > else is working on. That would seem to be a better solution in this > particular case but I don't know much about it or where it's at. Ok, let's see those patches then. thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Wed, 2014-05-14 at 21:00 -0700, Greg Kroah-Hartman wrote: > On Wed, May 14, 2014 at 10:15:27PM -0500, Seth Forshee wrote: > > On Wed, May 14, 2014 at 10:17:31PM -0400, Michael H. Warfield wrote: > > > > > Using devtmpfs is one possible > > > > > solution, and it would have the added benefit of making container > > > > > setup > > > > > simpler. But simply letting containers mount devtmpfs isn't sufficient > > > > > since the container may need to see a different, more limited set of > > > > > devices, and because different environments making modifications to > > > > > the filesystem could lead to conflicts. > > > > > > > > > > This series solves these problems by assigning devices to user > > > > > namespaces. Each device has an "owner" namespace which specifies which > > > > > devtmpfs mount the device should appear in as well allowing priveleged > > > > > operations on the device from that namespace. This defaults to > > > > > init_user_ns. There's also an ns_global flag to indicate a device > > > > > should > > > > > appear in all devtmpfs mounts. > > > > > > > I'd strongly argue that this isn't even a "problem" at all. And, as I > > > > said at the Plumbers conference last year, adding namespaces to devices > > > > isn't going to happen, sorry. Please don't continue down this path. > > > > > > I was just mentioning that to Serge just a week or so ago reminding him > > > of what you told all of us face to face back then. We were having a > > > discussion over loop devices into containers and this topic came up. > > > > It was the loop device use case that got me started down this path in > > the first place, so I don't personally have any interest in physical > > devices right now (though I was sure others would). > Why do you want to give access to a loop device to a container? > Shouldn't you set up the loop devices before creating the container and > then pass those mount points into the container? I thought that was how > things worked today, or am I missing something? Ah, you keep feeding me easy ones. I need raw access to loop devices and loop-control because I'm using containers to build NST (Network Security Toolkit) distribution iso images (one container is x86_64 while the other is i686). Each requires 2 loop devices. You can't set up the loop devices in advance since the containers will be creating the images and building them. NST tinkers with the base build engine configuration, so I really DON'T want it running on a hard iron host. There may be other cases where I need other specialized containers for building distros. I'm also looking at custom builds of Kali (another security distribution). > Giving the ability for a container to create a loop device at all is a > horrid idea, as you have pointed out, lots of information leakage could > easily happen. It does but only slightly. I noticed that losetup will list all the devices regardless of container where run or the container where set up. But that seems to be largely cosmetic. You can't do anything with the loop device in the other container. You can't disconnected it, read it, or mount it (I've tested it). In the former case, losetup returns with no error but does nothing. In the later case, you get a busy error. Not clean, not pretty, but no damage. Since loop-control is working on the global pool of loop devices, it's impossible to know what device to move to what container when the container runs losetup. For me, this isn't a serious problem, since it only involves 2 specialized containers out of over 4 dozen containers I have running across 3 sites. And those two containers are under my explicit and exclusive control. None of the others need it. I can get away with adding extra loop devices and adding them to the containers and let losetup deal with allocation and contention. Serge mentioned something to me about a loopdevfs (?) thing that someone else is working on. That would seem to be a better solution in this particular case but I don't know much about it or where it's at. Mind you, I heard your arguments at LinuxPlumbers regarding pushing user space policies into the kernel and all and basically I agree with you, this should be handled in host system user space and it seems reasonable. I'm just pointing out real world cases I have in operation right now and pointing out that I have solutions for them in host user space, even if some of them may not be estheticly pretty. > greg k-h Regards, Mike -- Michael H. Warfield (AI4NB) | (770) 978-7061 | m...@wittsend.com /\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/ NIC whois: MHW9 | An optimist believes we live in the best of all PGP Key: 0x674627FF| possible worlds. A pessimist is sure of it! signature.asc Description: This is a digitally signed message part
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Wed, May 14, 2014 at 10:15:27PM -0500, Seth Forshee wrote: > On Wed, May 14, 2014 at 10:17:31PM -0400, Michael H. Warfield wrote: > > > > Using devtmpfs is one possible > > > > solution, and it would have the added benefit of making container setup > > > > simpler. But simply letting containers mount devtmpfs isn't sufficient > > > > since the container may need to see a different, more limited set of > > > > devices, and because different environments making modifications to > > > > the filesystem could lead to conflicts. > > > > > > > > This series solves these problems by assigning devices to user > > > > namespaces. Each device has an "owner" namespace which specifies which > > > > devtmpfs mount the device should appear in as well allowing priveleged > > > > operations on the device from that namespace. This defaults to > > > > init_user_ns. There's also an ns_global flag to indicate a device should > > > > appear in all devtmpfs mounts. > > > > > I'd strongly argue that this isn't even a "problem" at all. And, as I > > > said at the Plumbers conference last year, adding namespaces to devices > > > isn't going to happen, sorry. Please don't continue down this path. > > > > I was just mentioning that to Serge just a week or so ago reminding him > > of what you told all of us face to face back then. We were having a > > discussion over loop devices into containers and this topic came up. > > It was the loop device use case that got me started down this path in > the first place, so I don't personally have any interest in physical > devices right now (though I was sure others would). Why do you want to give access to a loop device to a container? Shouldn't you set up the loop devices before creating the container and then pass those mount points into the container? I thought that was how things worked today, or am I missing something? Giving the ability for a container to create a loop device at all is a horrid idea, as you have pointed out, lots of information leakage could easily happen. greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Wed, May 14, 2014 at 10:17:31PM -0400, Michael H. Warfield wrote: > > > Using devtmpfs is one possible > > > solution, and it would have the added benefit of making container setup > > > simpler. But simply letting containers mount devtmpfs isn't sufficient > > > since the container may need to see a different, more limited set of > > > devices, and because different environments making modifications to > > > the filesystem could lead to conflicts. > > > > > > This series solves these problems by assigning devices to user > > > namespaces. Each device has an "owner" namespace which specifies which > > > devtmpfs mount the device should appear in as well allowing priveleged > > > operations on the device from that namespace. This defaults to > > > init_user_ns. There's also an ns_global flag to indicate a device should > > > appear in all devtmpfs mounts. > > > I'd strongly argue that this isn't even a "problem" at all. And, as I > > said at the Plumbers conference last year, adding namespaces to devices > > isn't going to happen, sorry. Please don't continue down this path. > > I was just mentioning that to Serge just a week or so ago reminding him > of what you told all of us face to face back then. We were having a > discussion over loop devices into containers and this topic came up. It was the loop device use case that got me started down this path in the first place, so I don't personally have any interest in physical devices right now (though I was sure others would). As things stand today, to support loop devices lxc would need to do something like this: grab some unused loop devices, remove them from /dev, and make device nodes with appropriate ownership/permissions in the container's /dev. Otherwise there's potential for accidental duplicate use of the devices, which besides having unexpected results could result in information leak into the container. At that point you have some loop devices that the container can use, but privileged operations such as re-reading partitions and encrypted loop aren't possible. Even if you can re-read partitions device nodes will appear in the main /dev and not in the container. With these patches the container could mount devtmpfs, and since loop-control is global it would appear in the mount. The LOOP_CTL_GET_FREE ioctl can be used to get an unused loop device which will owned by the container's user namespace, so it will only appear in that container's devtmpfs mount. Privileged operations would be allowed on the loop device by root in the namespace, and if partition devices were created they would inherit the namespace from the parent and thus show up in the container's devtmpfs mount. I think this use case demonstrates some real problems with only half-way solutions atm. I'm certainly open to other suggestions about how to solve them. Thanks, Seth -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Wed, 2014-05-14 at 18:32 -0700, Greg Kroah-Hartman wrote: > On Wed, May 14, 2014 at 04:34:48PM -0500, Seth Forshee wrote: > > Unpriveleged containers cannot run mknod, making it difficult to support > > devices which appear at runtime. > Wait. > Why would you even want a container to see a "new" device? That's the > whole point, your container should see a "clean" system, not the "this > USB device was just plugged in" system. Otherwise, how are you going to > even tell that container a new device showed up? Are you now going to > add udev support in containers? Hah, no. Oooo... I can answer that... Tell me if you've heard this one before... (You have back in NOLA last summer)... I use a USB sharing device that controls a multiport USB serial device controlling serial consoles to 16 servers and shared between 4 controlling servers. The sharing control port (a USB HID device) should be shared between designated containers so that any designated container owner can "request" a console to one of the other servers (yeah, I know there can be contention but that's the way the cookie crumbles - most of the time it's on the master host). Once they get the sharing device's attention, they "lose" that HID control device (it disappears from /dev entirely) and they gain only their designated USBtty{n} device for their console. Dynamic devices at their finest. I worked out a way of dealing with it using udev rules in the host and shifting devices using subdirectories in /dev. I got the infrastructure implemented but didn't finish the specific udev rules. > > Using devtmpfs is one possible > > solution, and it would have the added benefit of making container setup > > simpler. But simply letting containers mount devtmpfs isn't sufficient > > since the container may need to see a different, more limited set of > > devices, and because different environments making modifications to > > the filesystem could lead to conflicts. > > > > This series solves these problems by assigning devices to user > > namespaces. Each device has an "owner" namespace which specifies which > > devtmpfs mount the device should appear in as well allowing priveleged > > operations on the device from that namespace. This defaults to > > init_user_ns. There's also an ns_global flag to indicate a device should > > appear in all devtmpfs mounts. > I'd strongly argue that this isn't even a "problem" at all. And, as I > said at the Plumbers conference last year, adding namespaces to devices > isn't going to happen, sorry. Please don't continue down this path. I was just mentioning that to Serge just a week or so ago reminding him of what you told all of us face to face back then. We were having a discussion over loop devices into containers and this topic came up. > greg k-h Regards, Mike -- Michael H. Warfield (AI4NB) | (770) 978-7061 | m...@wittsend.com /\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/ NIC whois: MHW9 | An optimist believes we live in the best of all PGP Key: 0x674627FF| possible worlds. A pessimist is sure of it! signature.asc Description: This is a digitally signed message part