On Fri, May 23, 2014 at 6:16 AM, James Bottomley
<james.bottom...@hansenpartnership.com> wrote:
> On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote:
>> On 05/20/2014 05:19 PM, Serge Hallyn wrote:
>> > Quoting Andy Lutomirski (l...@amacapital.net):
>> >> On May 15, 2014 1:26 PM, "Serge E. Hallyn" <se...@hallyn.com> wrote:
>> >>>
>> >>> Quoting Richard Weinberger (rich...@nod.at):
>> >>>> Am 15.05.2014 21:50, schrieb Serge Hallyn:
>> >>>>> Quoting Richard Weinberger (richard.weinber...@gmail.com):
>> >>>>>> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman 
>> >>>>>> <gre...@linuxfoundation.org> wrote:
>> >>>>>>> Then don't use a container to build such a thing, or fix the build 
>> >>>>>>> scripts to not do that :)
>> >>>>>>
>> >>>>>> I second this. To me it looks like some folks try to (ab)use Linux 
>> >>>>>> containers for purposes where KVM
>> >>>>>> would much better fit in. Please don't put more complexity into 
>> >>>>>> containers. They are already horrible
>> >>>>>> complex and error prone.
>> >>>>>
>> >>>>> I, naturally, disagree :)  The only use case which is inherently not 
>> >>>>> valid for containers is running a
>> >>>>> kernel.  Practically speaking there are other things which likely will 
>> >>>>> never be possible, but if someone
>> >>>>> offers a way to do something in containers, "you can't do that in 
>> >>>>> containers" is not an apropos response.
>> >>>>>
>> >>>>> "That abstraction is wrong" is certainly valid, as when vpids were 
>> >>>>> originally proposed and rejected,
>> >>>>> resulting in the development of pid namespaces.  "We have to work out 
>> >>>>> (x) first" can be valid (and I can
>> >>>>> think of examples here), assuming it's not just trying to hide behind 
>> >>>>> a catch-22/chicken-egg problem.
>> >>>>>
>> >>>>> Finally, saying "containers are complex and error prone" is conflating 
>> >>>>> several large suites of userspace
>> >>>>> code and many kernel features which support them.  Being more precise 
>> >>>>> would, if the argument is valid, lend
>> >>>>> it a lot more weight.
>> >>>>
>> >>>> We (my company) use Linux containers since 2011 in production. First 
>> >>>> LXC, now libvirt-lxc. To understand the
>> >>>> internals better I also wrote my own userspace to create/start 
>> >>>> containers. There are so many things which can
>> >>>> hurt you badly. With user namespaces we expose a really big attack 
>> >>>> surface to regular users. I.e. Suddenly a
>> >>>> user is allowed to mount filesystems.
>> >>>
>> >>> That is currently not the case.  They can mount some virtual filesystems 
>> >>> and do bind mounts, but cannot mount
>> >>> most real filesystems.  This keeps us protected (for now) from 
>> >>> potentially unsafe superblock readers in the
>> >>> kernel.
>> >>>
>> >>>> Ask Andy, he found already lots of nasty things...
>> >>
>> >> I don't think I have anything brilliant to add to this discussion right 
>> >> now, except possibly:
>> >>
>> >> ISTM that Linux distributions are, in general, vulnerable to all kinds of 
>> >> shenanigans that would happen if an
>> >> untrusted user can cause a block device to appear.  That user doesn't 
>> >> need permission to mount it
>> >
>> > Interesting point.  This would further suggest that we absolutely must 
>> > ensure that a loop device which shows up in
>> > the container does not also show up in the host.
>>
>> Can I suggest the usage of the devices cgroup to achieve that?
>
> Not really ... cgroups impose resource limits, it's namespaces that
> impose visibility separations.  In theory this can be done with the
> device namespace that's been proposed; however, a simpler way is simply
> to rm the device node in the host and mknod it in the guest.  I don't
> really see host visibility as a huge problem: in a shared OS
> virtualisation it's not really possible securely to separate the guest
> from the host (only vice versa).
>
> But I really don't think we want to do it this way.  Giving a container
> the ability to do a mount is too dangerous.  What we want to do is
> intercept the mount in the host and perform it on behalf of the guest as
> host root in the guest's mount namespace.  If you do it that way, it
> doesn't really matter what device actually shows up in the guest, as
> long as the host knows what to do when the mount request comes along.

This is only useful/safe if the host understands what's going on.  By
the host, I mean the host's udev and other system-level stuff.  This
is probably fine for disks and such, but it might not be so great for
loop devices, FUSE, etc.  I already know of one user of containers
that wants container-local FUSE mounts.  This ought to Just Work (tm),
but there's fair amount of work needed to get there.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to