On Fri, 22.04.11 19:55, Josh Triplett (j...@joshtriplett.org) wrote: > The systemd-nspawn manpage lists the various mechanisms used to isolate > the container, and then says "Note that even though these security > precautions are taken systemd-nspawn is not suitable for secure > container setups. Many of the security features may be circumvented and > are hence primarily useful to avoid accidental changes to the host > system from the container." > > How can a process in a systemd-nspawn container circumvent the container > setup? What additional steps would systemd-nspawn need to take to > provide a secure container setup?
Well, the question is of course what "secure" actually means... But here's why I put this sentence in the man page: First of all, we don't virtualize AF_UNIX abstract namespace sockets. It is part of the network virtualization, and I explicitly decided not do virtualize that, to simplify things, since otherwise containers need specific network configuration, and they'd be much harder to use hence than chroots, but the simplicity to use of chroot is what I was heading for. Ideally AF_UNIX virtulaization would not be part of CLONE_NEWNET but of CLONE_NEWIPC, since it is a local IPC interface, and has nothing to do with the network, but I guess that's too late now. Fortunately not many services use abstract namespace sockets, since they are insecure and mostly unnecessary in most cases these days. There are a few exceptions though: some services use randomly named unix sockets. And there's udev. Since we don't want to run a second udev in the container we actually benefit from this here: only the host udev can bind the socket, hence the container udev will immediately fail. The missing virtualization of the abstarct namespace means processes can talk to services outside of the namespace. This has obvious problems. And a couple of non-obvious ones on top: SCM_CREDENTIALS will be weird due to the non-matching users and stuff. When we enter the container we drop all capabilities, except the following: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_DAC_READ_SEARCH, CAP_FOWNER, CAP_FSETID, CAP_IPC_OWNER, CAP_KILL, CAP_LEASE, CAP_LINUX_IMMUTABLE, CAP_NET_BIND_SERVICE, CAP_NET_BROADCAST, CAP_NET_RAW, CAP_SETGID, CAP_SETFCAP, CAP_SETPCAP, CAP_SETUID, CAP_SYS_ADMIN, CAP_SYS_CHROOT, CAP_SYS_NICE, CAP_SYS_PTRACE, CAP_SYS_TTY_CONFIG. Due to the PID, fs and IPC namespacing a couple of these capabilities should not be much of a problem. Except for a few cases: - We don't virtualize the network for simplicity reasons, that means CAP_NET_BIND allows processes in the container to bind to any port, thus blocking stuff outside of the container to work. Now, it would be easy to remove this capability too, but this of course would still allow DoS high port services on the host from withing the container. (Consider the container blocking all ports > 6000 thus making it impossible to run X on the host). But this one is actually not a big issue in the end I guess, so let's ignore it here. - CAP_NET_RAW means that the container can sniff into the host's traffic. - CAP_SYS_ADMIN is a grab bag of things, and is the biggie here. With this the container can remount /sys, /selinux and /proc/sys read-writable and thus influence this host massively. It can disable swap partitions, too, and lots and lots of other things, too. - A couple of the FS related operations might be problematic since the abstract namespace sockets are not virtualized, and thus you could do privileged operations on fds from outside the container. There's also currently no virtualization of the users. That means RLIMIT_NPROC and stuff when applied in the container will also affect the same user outside of the container. That's pretty bad... Some of these issues require kernel support to fix properly (for example the RLIMIT_NPROC issue). Other's we could fix in userspace probably. For example, we might be able to make CAP_SYS_ADMIN unnecessary if we premount really everything in the container that it might need. systemd is already smart enough to be happy with pre-mounted directories, not entirely sure about sysvinit though. With a bit of work we probably could even add CLONE_NEWNET support, and automatically set up a valid virtualized net interface for the container, that could not be reconfigurable by the container and is always forwarded to the host, but which buys us AF_UNIX abstract namespace virtualization and fixes the CAP_NET_BIND issue. With CLONE_NEWUSER in place and these changes we could probably make things reasonably secure. But especially figuring out a way to virtualize the network in an elegant way so that things will continue to "just work" is not going to be easy. Lennart -- Lennart Poettering - Red Hat, Inc. _______________________________________________ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel