On Mon, May 09, 2016 at 04:26:30PM +0000, Serge Hallyn wrote: > Quoting Djalal Harouni (tix...@gmail.com): > > Hi, [...] > > > > After clone(CLONE_NEWUSER|CLONE_NEWNS|CLONE_MNTNS_SHIFT_UIDGID), setup > > the user namespace mapping, I guess you drop capabilities, do setuid() > > or whatever and start the PID 1 or the app of the container. > > > > Now and to not confuse more Dave, since he doesn't like the idea of > > a shared backing device, and me neither for obvious reasons! the shared > > device should not be used for a rootfs, maybe for read-only user shared > > data, or shared config, that's it... but for real rootfs they should have > > their own *different* backing device! unless you know what you are doing > > hehe I don't want to confuse people, and I just lack time, will also > > respond to Dave email. > > Yes. We're saying slightly different things. You're saying that the admin > should assign different backing stores for containers. I'm saying perhaps > the kernel should enforce that, because $leaks. Let's say the host admin > did a perfect setup of a container with shifted uids. Now he wants to > run a quick ps in the container... he does it in a way that leaks a > /proc/pid reference into the container so that (evil) container root can > use /proc/pid/root/ to get a toehold into the host /. Does he now have > shifted access to that?
No. Assuming host / or its other mount points are not mounted with vfs_shift_uids and vfs_shift_gids options. In this case no shift is performed at all. 1) If you mount host / with vfs_shift_uids and vfs_shift_gids it's like real root in init_user_ns does "chmod -R o+rwx /"... It does not make sense and since no one can edit/remount mounts to change their options in the mount namespace of init_user_ns, it's safe, and not available by default. 2) That's why also filsystems must support this explicitly and not on their behalf. IMO the kernel is already enforcing this, so even if you assign different backing stores to containers, you can't have shifted access there, unless you explicitly tell the kernel that the mount is mean to be shifted by adding vfs_shift_uids and vfs_shift_gids mount options. > I think if we say "this blockdev will have shifted uids in > /proc/$pid/ns/user", > then immediately that blockdev becomes not-readable (or not-executable) > in any namespace which does not have /proc/$pid/ns/user as an ancestor. Hmm, (1) This won't work since to do that you have to know in advance /proc/$pid/ns/user and since file systems can't be mounted inside user namespace this brings us to the same blocker ... ! and in our use case we do want to shift UIDs/GIDs to just access inodes, no need to expose the whole filesystem, root is responsible and filesystems stay safe. (2) Why complicate ? the kernel already supports this! and it's a generic solution. As said you can just create new mount namespaces, mount things there private, slave... mount your blockdev that will be shifted by processes that inherits that mount, you can even have intermediate mount namespaces that you will forget/unref at any moment and where they are only used to perform setup, and no other process/code can enter... You don't have any leaks nothing! you control that piece of code. If you want that blockdev to become not-readable or noexec in any namespace which does not have /proc/$pid/ns/user as an ancestor, the kernel allows a better interface, it allows that blockdev to not even show up in any ancestor, by making use of mount namespaces and MS_PRIVATE, MS_SLAVE... no one will even notice if the mount exists. However if you want to access that blockdev for whatever reason, then create a new mount namespace and use MS_PRIVATE, MS_SLAVE and all the noexec flags and mount it. Yes slightly different things, but I don't want to add complexity where the interface already exists in the kernel... > With obvious check as in write-versus-execute exclusion that you cannot > mark the blockdev shifted if ancestor user_ns already has a file open for > execute. Please note here, that it's the same ancestor who will mark the blockdev to be shifted, but why the ancestor will keep at the same time a file open in that filesystem that is mean to be shifted and later execute through that fd a program that was just crafted by untrusted container ?! For me the kernel already offers the interfaces no need to complicate things or enforce it... As said in other responses, the design of these patches is to just use what the kernel already provides. > BTW, perhaps I should do this in a separate email, but here is how I would > expect to use this: > > 1. Using zfs: I create a bare (unshifted) rootfs fs1. When I want to > create a new container, I zfs clone fs1 to fs2, and let the container > use fs2 shifted. No danger to fs1 since fs2 is cow. Same with btrfs. Yes that would work, since fs1 is unshifted, the only requirement is that fs2 should not reside on the same backing store of fs1 to not share quota with fs1 (I'm not a ZFS user...) and you can always make the parent of mount point fs2 or containers directories 0700... and root should not go there and exec programs like it's not safe to go /hom/$user... and exec... > 2. Using overlay: I create a bare (unshifted) rootfs fs1. When I want > to create a new container, I I mount fs1 read-only and shifted as base > layer, then fs2 as the rw layer. Yes here you may share quota if all the fs2 rw layers of all containers reside on the same backing store... but here the requirement is that fs1 should be mounted the first time with shifted uids/gids where fs1 resides on ext4, btrfs, xfs or anyother filesystem that supports shifting. This means you may have to mount fs1 on a different backing store say on /root-fs0/lib/container-image-fs1/ with vfs_shit_uids/gids then use it as a shared read-only lower layer. Of course you may just use your host / as a readonly layer where you mount it the first time with vfs_shift_uids/gids but as discussed above that's not really safe unless that's not a shared user system, or you know what you are doing... These patches do not edit overlayfs. overlayfs support is transparent if the underlaying filesystems, the upper and lower directories are on filesystems that support vfs_shift_uids/vfs_shift_gids. If we go with overlayfs, we make it an overlayfs problem where it needs different approache related to union mounts which I noted in the cover letter of this patches. > The point here is that the zfs clone plus container start takes (for a > 600-800M rootfs) about .5 seconds on my laptop, while the act of shifting > all the uids takes another 2 seconds. So being able do this without > manually shifting would be a huge improvement for cases (i.e. docker) > where you do lots and lots of quick deploys. > That's one of the use cases of course! you can also verify the integrity... and able to really make containers fs read-only without the recursive chown... Thank you for your feedback! -- Djalal Harouni http://opendz.org