Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root
On Wed, Jun 17, 2020 at 08:55:36AM -0400, Colin Walters wrote: > On Wed, Jun 17, 2020, at 8:50 AM, Stefan Hajnoczi wrote: > > > Something along these lines should work. Hopefully seccomp can be > > retained. It would also be necessary to check how not having the shared > > directory as / in the mount namespace affects functionality. For one, > > I'm pretty sure symlink escapes and similar path traversals outside the > > shared directory will be possible since virtiofsd normally relies on / > > as protection. > > Yes, though two points: > > - As I said, I don't care about that for my use case; the operating system > we're testing is going to e.g. run on bare metal hosting workloads itself, so > if it's malicious we have already lost (reliability against *accidental* > damage is always nice though, like a stray rm -rf in some test script walking > into the host) It's not just for security, it's also for functional correctness. Have you checked what happens when absolute symlinks are accessed? If we're lucky well-behaved clients use FUSE_READLINK before accessing path components. Then the symlink resolution happens in the FUSE client. But if not, absolute symlinks will access the wrong files (oops!). Accidental race conditions are possible too, especially when both guest and host access the shared directory tree. > - Probably the best long term fix would be to use > https://lwn.net/Articles/796868/ anyways Yes, I think this is a requirement for achieving correctness without pivot_root()/chroot() due to the issues mentioned above. signature.asc Description: PGP signature
Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root
On Wed, Jun 17, 2020, at 8:50 AM, Stefan Hajnoczi wrote: > Something along these lines should work. Hopefully seccomp can be > retained. It would also be necessary to check how not having the shared > directory as / in the mount namespace affects functionality. For one, > I'm pretty sure symlink escapes and similar path traversals outside the > shared directory will be possible since virtiofsd normally relies on / > as protection. Yes, though two points: - As I said, I don't care about that for my use case; the operating system we're testing is going to e.g. run on bare metal hosting workloads itself, so if it's malicious we have already lost (reliability against *accidental* damage is always nice though, like a stray rm -rf in some test script walking into the host) - Probably the best long term fix would be to use https://lwn.net/Articles/796868/ anyways
Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root
On Tue, Jun 02, 2020 at 09:53:18PM -0400, Colin Walters wrote: > On Tue, Jun 2, 2020, at 5:55 AM, Stefan Hajnoczi wrote: > > Ping Colin. It would be great if you have time to share your thoughts on > > this discussion and explain how you are using this patch. > > Yeah sorry about not replying in this thread earlier, this was just a quick > Friday side project for me and the thread obviously exploded =) > > Thinking about this more, probably what would be good enough for now is an > option to just disable internal containerization/sandboxing. In fact per the > discussion our production pipeline runs inside OpenShift 4 and because > Kubernetes doesn't support user namespaces yet it also doesn't support > recursive containerization, so we need an option to turn off the internal > containerization. > > Our use case is somewhat specialized - for what we're doing we generally > trust the guest. We use VMs for operating system testing and development of > content we trust, as opposed to e.g. something like kata. > > It's fine for us to run virtiofs as the same user/security context as qemu. > > So...something like this? (Only compile tested) ... > @@ -2775,6 +2775,8 @@ static void setup_capabilities(void) > static void setup_sandbox(struct lo_data *lo, struct fuse_session *se, >bool enable_syslog) > { > +if (se->no_namespaces) > +return; > setup_namespaces(lo, se); > setup_mounts(lo->source); > setup_seccomp(enable_syslog); Something along these lines should work. Hopefully seccomp can be retained. It would also be necessary to check how not having the shared directory as / in the mount namespace affects functionality. For one, I'm pretty sure symlink escapes and similar path traversals outside the shared directory will be possible since virtiofsd normally relies on / as protection. Stefan signature.asc Description: PGP signature
Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root
On Tue, Jun 2, 2020, at 5:55 AM, Stefan Hajnoczi wrote: > > Ping Colin. It would be great if you have time to share your thoughts on > this discussion and explain how you are using this patch. Yeah sorry about not replying in this thread earlier, this was just a quick Friday side project for me and the thread obviously exploded =) Thinking about this more, probably what would be good enough for now is an option to just disable internal containerization/sandboxing. In fact per the discussion our production pipeline runs inside OpenShift 4 and because Kubernetes doesn't support user namespaces yet it also doesn't support recursive containerization, so we need an option to turn off the internal containerization. Our use case is somewhat specialized - for what we're doing we generally trust the guest. We use VMs for operating system testing and development of content we trust, as opposed to e.g. something like kata. It's fine for us to run virtiofs as the same user/security context as qemu. So...something like this? (Only compile tested) diff --git a/tools/virtiofsd/fuse_i.h b/tools/virtiofsd/fuse_i.h index 1240828208..603773c505 100644 --- a/tools/virtiofsd/fuse_i.h +++ b/tools/virtiofsd/fuse_i.h @@ -51,6 +51,7 @@ struct fuse_session { int fd; int debug; int deny_others; +int no_namespaces; struct fuse_lowlevel_ops op; int got_init; struct cuse_data *cuse_data; diff --git a/tools/virtiofsd/fuse_lowlevel.c b/tools/virtiofsd/fuse_lowlevel.c index 2dd36ec03b..263134f792 100644 --- a/tools/virtiofsd/fuse_lowlevel.c +++ b/tools/virtiofsd/fuse_lowlevel.c @@ -2522,6 +2522,7 @@ static const struct fuse_opt fuse_ll_opts[] = { LL_OPTION("-d", debug, 1), LL_OPTION("--debug", debug, 1), LL_OPTION("allow_root", deny_others, 1), +LL_OPTION("--no-namespaces", no_namespaces, 1), LL_OPTION("--socket-path=%s", vu_socket_path, 0), LL_OPTION("--fd=%d", vu_listen_fd, 0), LL_OPTION("--thread-pool-size=%d", thread_pool_size, 0), @@ -2542,6 +2543,7 @@ void fuse_lowlevel_help(void) */ printf( "-o allow_root allow access by root\n" +"--no-namespacesDisable internal use of unshare()/clone(UNSHARE)\n" "--socket-path=PATH path for the vhost-user socket\n" "--fd=FDNUM fd number of vhost-user socket\n" "--thread-pool-size=NUM thread pool size limit (default %d)\n", diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c index 3ba1d90984..7c54a9cde3 100644 --- a/tools/virtiofsd/passthrough_ll.c +++ b/tools/virtiofsd/passthrough_ll.c @@ -2551,15 +2551,15 @@ static void setup_namespaces(struct lo_data *lo, struct fuse_session *se) char *tmpdir; /* - * Create a new pid namespace for *child* processes. We'll have to - * fork in order to enter the new pid namespace. A new mount namespace - * is also needed so that we can remount /proc for the new pid - * namespace. - * - * Our UNIX domain sockets have been created. Now we can move to - * an empty network namespace to prevent TCP/IP and other network - * activity in case this process is compromised. - */ +* Create a new pid namespace for *child* processes. We'll have to +* fork in order to enter the new pid namespace. A new mount namespace +* is also needed so that we can remount /proc for the new pid +* namespace. +* +* Our UNIX domain sockets have been created. Now we can move to +* an empty network namespace to prevent TCP/IP and other network +* activity in case this process is compromised. +*/ if (unshare(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET) != 0) { fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS): %m\n"); exit(1); @@ -2775,6 +2775,8 @@ static void setup_capabilities(void) static void setup_sandbox(struct lo_data *lo, struct fuse_session *se, bool enable_syslog) { +if (se->no_namespaces) +return; setup_namespaces(lo, se); setup_mounts(lo->source); setup_seccomp(enable_syslog);
Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root
On Fri, May 01, 2020 at 02:25:48PM -0400, Colin Walters wrote: > I'd like to make use of virtiofs as part of our tooling in > https://github.com/coreos/coreos-assembler > Most of the code runs as non-root today; qemu also runs as non-root. > We use 9p right now. > > virtiofsd's builtin sandboxing effectively assumes it runs as > root. > > First, change the code to use `clone()` and not `unshare()+fork()`. > > Next, automatically use `CLONE_NEWUSER` if we're running as non root. > > This is similar logic to that in https://github.com/containers/bubblewrap > (Which...BTW, it could make sense for virtiofs to depend on bubblewrap > and re-exec itself rather than re-implementing the containerization > itself) > > Signed-off-by: Colin Walters > --- > tools/virtiofsd/passthrough_ll.c | 26 +- > 1 file changed, 21 insertions(+), 5 deletions(-) Ping Colin. It would be great if you have time to share your thoughts on this discussion and explain how you are using this patch. To summarize: I'm unclear what behavior a user can expect since I'm not aware of anything that applies /etc/subuid for the user namespace. Does this mean the expected behavior is that virtiofsd will map all uids/gids to -1 when invoked non-root? Could you document the behavior and consider supporting both -1 and /etc/subuid operation? Both seem like useful behaviors for different use cases. Thanks, Stefan signature.asc Description: PGP signature
Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root
On Thu, May 21, 2020 at 11:43:44AM +0100, Daniel P. Berrangé wrote: > On Thu, May 21, 2020 at 11:19:23AM +0100, Stefan Hajnoczi wrote: > > On Thu, May 07, 2020 at 10:28:32AM +0100, Daniel P. Berrangé wrote: > > > If the person in the host launching virtiofsd is non-root, then > > > user namespaces mean they can offer the guest the full range of > > > POSIX APIs wrt access control & file ownership, since they're > > > no longer restricted to their single host UID when inside the > > > container. > > > > What installs the uid_map/gid_map for virtiofsd? > > > > My machine has /etc/subuid and /etc/subgid, but how would this come into > > play with these patches applied? > > AFAICT, nothing is setting up the mapping, hence my question in the first > review of this patch, that this patch seems incomplete. > > The subuid/subgid files are essentially just reserving a range of IDs > for each user. Something then needs to read & honour those IDs. > > The rules on setting up a mapping are complex though, to avoid a user > from creating a new user namespace, and defining a mapping that lets > them elevate privileges to read files in the host they wouldn't otherwise > be permitted to access. > > IIUC, applying the range of IDs from subuid/subgid will require the > process defining the mapping to have CAP_SETUID *outside* the container. > As an unprivileged host user, you won't have that, so I think the only > thing you can do is setup a mapping for your own UID/GID, which would > allow the container to read/write files owned by the host user that > started it. That should be ok for virtiofsd's needs at least for simple > file sharing. If virtiofsd needs to expose its full range of features > though, something privileged in the host needs to setup a full mapping > based on subuid/subgid > > BTW, "man user_namespaces" gives many more details on UID mapping > rules. > > > What happens when an unprivileged user who is not listed in /etc/subuid > > runs virtiofsd? > > The UID map inside the container will be empty, which should result in > all files being remapped to (uid_t)-1 or whatever is shown in the > /proc/sys/kernel/overflow{u,g}id files. > > So virtiofsd would not have any write access, and only have read access > to files which have world-read bit set. Okay. Enabling user namespaces for virtiofsd is valuable. I think the behavior should be documented though so users know what needs to be configured. That is missing from this patch series. Stefan signature.asc Description: PGP signature
Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root
On Thu, May 21, 2020 at 11:19:23AM +0100, Stefan Hajnoczi wrote: > On Thu, May 07, 2020 at 10:28:32AM +0100, Daniel P. Berrangé wrote: > > If the person in the host launching virtiofsd is non-root, then > > user namespaces mean they can offer the guest the full range of > > POSIX APIs wrt access control & file ownership, since they're > > no longer restricted to their single host UID when inside the > > container. > > What installs the uid_map/gid_map for virtiofsd? > > My machine has /etc/subuid and /etc/subgid, but how would this come into > play with these patches applied? AFAICT, nothing is setting up the mapping, hence my question in the first review of this patch, that this patch seems incomplete. The subuid/subgid files are essentially just reserving a range of IDs for each user. Something then needs to read & honour those IDs. The rules on setting up a mapping are complex though, to avoid a user from creating a new user namespace, and defining a mapping that lets them elevate privileges to read files in the host they wouldn't otherwise be permitted to access. IIUC, applying the range of IDs from subuid/subgid will require the process defining the mapping to have CAP_SETUID *outside* the container. As an unprivileged host user, you won't have that, so I think the only thing you can do is setup a mapping for your own UID/GID, which would allow the container to read/write files owned by the host user that started it. That should be ok for virtiofsd's needs at least for simple file sharing. If virtiofsd needs to expose its full range of features though, something privileged in the host needs to setup a full mapping based on subuid/subgid BTW, "man user_namespaces" gives many more details on UID mapping rules. > What happens when an unprivileged user who is not listed in /etc/subuid > runs virtiofsd? The UID map inside the container will be empty, which should result in all files being remapped to (uid_t)-1 or whatever is shown in the /proc/sys/kernel/overflow{u,g}id files. So virtiofsd would not have any write access, and only have read access to files which have world-read bit set. Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root
On Thu, May 07, 2020 at 10:28:32AM +0100, Daniel P. Berrangé wrote: > If the person in the host launching virtiofsd is non-root, then > user namespaces mean they can offer the guest the full range of > POSIX APIs wrt access control & file ownership, since they're > no longer restricted to their single host UID when inside the > container. What installs the uid_map/gid_map for virtiofsd? My machine has /etc/subuid and /etc/subgid, but how would this come into play with these patches applied? What happens when an unprivileged user who is not listed in /etc/subuid runs virtiofsd? Stefan signature.asc Description: PGP signature
Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root
On Wed, May 06, 2020 at 08:16:14PM +0100, Dr. David Alan Gilbert wrote: > * Colin Walters (walt...@verbum.org) wrote: > > I'd like to make use of virtiofs as part of our tooling in > > https://github.com/coreos/coreos-assembler > > Most of the code runs as non-root today; qemu also runs as non-root. > > We use 9p right now. > > > > virtiofsd's builtin sandboxing effectively assumes it runs as > > root. > > > > First, change the code to use `clone()` and not `unshare()+fork()`. > > > > Next, automatically use `CLONE_NEWUSER` if we're running as non root. > > Is it ever useful for root to run the code in a new user namespace? Yes, user namespace is useful to both root and non-root alike. Roughly speaking, for root, it offers security benefits, for non-root it offers functionality benefits. The longer answer... With a new user namespaces, users inside the container get remapped to different set of users outside the host, through defined UID & GID mappings. For any UID/GID which doesn't have a mapping, access will get performed as (uid_t)-1 / (gid_t)-1. eg consider you have a range of host IDs 100,000->165,536 available. With user namespaces, you can now ssetuop a mapping of container IDs 0 -> 65536. Thus any time UID 0 inside the container does something, from the host POV they are acting as UID 100,000. If UID 30,000 inside the container does something, this is UID 130,000 in the host POV. If UID 80,000 in the container does something, this is uid -1 from the host POV. If the person in the host launching virtiofsd is non-root, then user namespaces mean they can offer the guest the full range of POSIX APIs wrt access control & file ownership, since they're no longer restricted to their single host UID when inside the container. They also get important things like CAP_DAC_OVERRIDE. IOW, for non-root, user namespaces unlock the full functionality of virtiofsd. Without it, we're limited to read-only access to files not owned by the current non-root user. If the person in the host launching virtiofsd is root, then user namespaces mean we can reduce the effective privileges of virtiofsd. Currently when inside the container, uid==0 is still the same as uid==0 outside. So if there are any resources visible inside the container (either accidentally or intentionally), then virtiofsd shouldn't have write access to, we're lacking protection. By adding usernamespace + a mapping, we strictly isolate virtiofsd from any host resources. The main pain point with user namespaces is that all the files in the directory you are exporting need to be shifted to match the UID/GID mapping user for the user namespaces. Traditionally this has needed a recursive chown of the tree to remap the file ownership. There has been talk of a filesystem overlay todo the remapping transparently, but I've lost track of whether that's a thing yet. Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root
* Colin Walters (walt...@verbum.org) wrote: > I'd like to make use of virtiofs as part of our tooling in > https://github.com/coreos/coreos-assembler > Most of the code runs as non-root today; qemu also runs as non-root. > We use 9p right now. > > virtiofsd's builtin sandboxing effectively assumes it runs as > root. > > First, change the code to use `clone()` and not `unshare()+fork()`. > > Next, automatically use `CLONE_NEWUSER` if we're running as non root. Is it ever useful for root to run the code in a new user namespace? Dave > This is similar logic to that in https://github.com/containers/bubblewrap > (Which...BTW, it could make sense for virtiofs to depend on bubblewrap > and re-exec itself rather than re-implementing the containerization > itself) > > Signed-off-by: Colin Walters > --- > tools/virtiofsd/passthrough_ll.c | 26 +- > 1 file changed, 21 insertions(+), 5 deletions(-) > > diff --git a/tools/virtiofsd/passthrough_ll.c > b/tools/virtiofsd/passthrough_ll.c > index 4c35c95b25..468617f6d6 100644 > --- a/tools/virtiofsd/passthrough_ll.c > +++ b/tools/virtiofsd/passthrough_ll.c > @@ -2530,6 +2530,21 @@ static void print_capabilities(void) > printf("}\n"); > } > > +/* Copied from bubblewrap */ > +static int > +raw_clone(unsigned long flags, void *child_stack) > +{ > +#if defined(__s390__) || defined(__CRIS__) > + /* > + * On s390 and cris the order of the first and second arguments > + * of the raw clone() system call is reversed. > + */ > +return (int) syscall(__NR_clone, child_stack, flags); > +#else > +return (int) syscall(__NR_clone, flags, child_stack); > +#endif > +} > + > /* > * Move to a new mount, net, and pid namespaces to isolate this process. > */ > @@ -2547,14 +2562,15 @@ static void setup_namespaces(struct lo_data *lo, > struct fuse_session *se) > * an empty network namespace to prevent TCP/IP and other network > * activity in case this process is compromised. > */ > -if (unshare(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET) != 0) { > -fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS): %m\n"); > -exit(1); > +int clone_flags = SIGCHLD | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET; > +/* If we're non root, we need a new user namespace */ > +if (getuid() != 0) { > +clone_flags |= CLONE_NEWUSER; > } > > -child = fork(); > +child = raw_clone(clone_flags, NULL); > if (child < 0) { > -fuse_log(FUSE_LOG_ERR, "fork() failed: %m\n"); > +fuse_log(FUSE_LOG_ERR, "clone() failed: %m\n"); > exit(1); > } > if (child > 0) { > -- > 2.24.1 > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root
On Tue, May 05, 2020 at 04:23:59PM +0100, Stefan Hajnoczi wrote: > On Mon, May 04, 2020 at 04:07:22PM +0200, Marc-André Lureau wrote: > > Hi > > > > On Fri, May 1, 2020 at 8:29 PM Colin Walters wrote: > > > > > > I'd like to make use of virtiofs as part of our tooling in > > > https://github.com/coreos/coreos-assembler > > > Most of the code runs as non-root today; qemu also runs as non-root. > > > We use 9p right now. > > > > > > virtiofsd's builtin sandboxing effectively assumes it runs as > > > root. > > > > > > First, change the code to use `clone()` and not `unshare()+fork()`. > > > > > > Next, automatically use `CLONE_NEWUSER` if we're running as non root. > > > > > > This is similar logic to that in https://github.com/containers/bubblewrap > > > (Which...BTW, it could make sense for virtiofs to depend on bubblewrap > > > and re-exec itself rather than re-implementing the containerization > > > itself) > > > > > > > Now that systemd-nspawn works without privileges, isn't that also a > > solution? One that would fit both system and session level > > permissions, and integration with other services? > > Does systemd-nspawn work inside containers? > > I think virtiofsd will need to run inside containers in the future and > remember systemd being difficult to use in containers. It can be made to work, but my gut tells me people won't be happy if system were a mandatory requirement for virtiofsd usage. Also there are current Linux distros which don't even use systemd. Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root
On Mon, May 04, 2020 at 04:07:22PM +0200, Marc-André Lureau wrote: > Hi > > On Fri, May 1, 2020 at 8:29 PM Colin Walters wrote: > > > > I'd like to make use of virtiofs as part of our tooling in > > https://github.com/coreos/coreos-assembler > > Most of the code runs as non-root today; qemu also runs as non-root. > > We use 9p right now. > > > > virtiofsd's builtin sandboxing effectively assumes it runs as > > root. > > > > First, change the code to use `clone()` and not `unshare()+fork()`. > > > > Next, automatically use `CLONE_NEWUSER` if we're running as non root. > > > > This is similar logic to that in https://github.com/containers/bubblewrap > > (Which...BTW, it could make sense for virtiofs to depend on bubblewrap > > and re-exec itself rather than re-implementing the containerization > > itself) > > > > Now that systemd-nspawn works without privileges, isn't that also a > solution? One that would fit both system and session level > permissions, and integration with other services? Does systemd-nspawn work inside containers? I think virtiofsd will need to run inside containers in the future and remember systemd being difficult to use in containers. Stefan signature.asc Description: PGP signature
Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root
Hi On Mon, May 4, 2020 at 4:27 PM Colin Walters wrote: > > > > On Mon, May 4, 2020, at 10:07 AM, Marc-André Lureau wrote: > > > Now that systemd-nspawn works without privileges, isn't that also a > > solution? One that would fit both system and session level > > permissions, and integration with other services? > > This is a complex topic and one I should probably write up in the bubblewrap > README.md. Today for example for CoreOS, our build and CI processes run > inside OpenShift (Kubernetes) - we aren't running systemd inside our > containers. Actually, I mean systemd-run (oops!) > > bubblewrap is a small self-contained C wrapper around the container system > calls basically. In contrast, AFAICS right now, nspawn requires systemd - > which won't work for our use case. > > Really the contention point here is systemd's dependency on cgroups for > process tracking; in a "nested containerization" scenario you often just want > the cgroups from the "outer" container to apply. But having nested > mounts/pid namespaces are still very useful. (That said, cgroups v2 allows > sane nesting, but we aren't there yet) > > Also related is https://github.com/kubernetes/enhancements/issues/127 - > without that one requires privileged containers to do nesting. > > Now honestly, probably an even easier fix is `virtiofsd --disable-sandboxing` > because we fully trust the code running in these VMs. > > Or to directly respond again to your proposal: systemd-nspawn as an option > may work for some cases but won't for mine (I don't want virtiofsd/qemu > instances to "escape" the build container or run separately). > You can run within your parent slice, and even more conveniently with: https://github.com/systemd/systemd/pull/15362 -- Marc-André Lureau
Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root
On Mon, May 4, 2020, at 10:07 AM, Marc-André Lureau wrote: > Now that systemd-nspawn works without privileges, isn't that also a > solution? One that would fit both system and session level > permissions, and integration with other services? This is a complex topic and one I should probably write up in the bubblewrap README.md. Today for example for CoreOS, our build and CI processes run inside OpenShift (Kubernetes) - we aren't running systemd inside our containers. bubblewrap is a small self-contained C wrapper around the container system calls basically. In contrast, AFAICS right now, nspawn requires systemd - which won't work for our use case. Really the contention point here is systemd's dependency on cgroups for process tracking; in a "nested containerization" scenario you often just want the cgroups from the "outer" container to apply. But having nested mounts/pid namespaces are still very useful. (That said, cgroups v2 allows sane nesting, but we aren't there yet) Also related is https://github.com/kubernetes/enhancements/issues/127 - without that one requires privileged containers to do nesting. Now honestly, probably an even easier fix is `virtiofsd --disable-sandboxing` because we fully trust the code running in these VMs. Or to directly respond again to your proposal: systemd-nspawn as an option may work for some cases but won't for mine (I don't want virtiofsd/qemu instances to "escape" the build container or run separately).
Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root
Hi On Fri, May 1, 2020 at 8:29 PM Colin Walters wrote: > > I'd like to make use of virtiofs as part of our tooling in > https://github.com/coreos/coreos-assembler > Most of the code runs as non-root today; qemu also runs as non-root. > We use 9p right now. > > virtiofsd's builtin sandboxing effectively assumes it runs as > root. > > First, change the code to use `clone()` and not `unshare()+fork()`. > > Next, automatically use `CLONE_NEWUSER` if we're running as non root. > > This is similar logic to that in https://github.com/containers/bubblewrap > (Which...BTW, it could make sense for virtiofs to depend on bubblewrap > and re-exec itself rather than re-implementing the containerization > itself) > Now that systemd-nspawn works without privileges, isn't that also a solution? One that would fit both system and session level permissions, and integration with other services? > Signed-off-by: Colin Walters > --- > tools/virtiofsd/passthrough_ll.c | 26 +- > 1 file changed, 21 insertions(+), 5 deletions(-) > > diff --git a/tools/virtiofsd/passthrough_ll.c > b/tools/virtiofsd/passthrough_ll.c > index 4c35c95b25..468617f6d6 100644 > --- a/tools/virtiofsd/passthrough_ll.c > +++ b/tools/virtiofsd/passthrough_ll.c > @@ -2530,6 +2530,21 @@ static void print_capabilities(void) > printf("}\n"); > } > > +/* Copied from bubblewrap */ > +static int > +raw_clone(unsigned long flags, void *child_stack) > +{ > +#if defined(__s390__) || defined(__CRIS__) > + /* > + * On s390 and cris the order of the first and second arguments > + * of the raw clone() system call is reversed. > + */ > +return (int) syscall(__NR_clone, child_stack, flags); > +#else > +return (int) syscall(__NR_clone, flags, child_stack); > +#endif > +} > + > /* > * Move to a new mount, net, and pid namespaces to isolate this process. > */ > @@ -2547,14 +2562,15 @@ static void setup_namespaces(struct lo_data *lo, > struct fuse_session *se) > * an empty network namespace to prevent TCP/IP and other network > * activity in case this process is compromised. > */ > -if (unshare(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET) != 0) { > -fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS): %m\n"); > -exit(1); > +int clone_flags = SIGCHLD | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET; > +/* If we're non root, we need a new user namespace */ > +if (getuid() != 0) { > +clone_flags |= CLONE_NEWUSER; > } > > -child = fork(); > +child = raw_clone(clone_flags, NULL); > if (child < 0) { > -fuse_log(FUSE_LOG_ERR, "fork() failed: %m\n"); > +fuse_log(FUSE_LOG_ERR, "clone() failed: %m\n"); > exit(1); > } > if (child > 0) { > -- > 2.24.1 > > -- Marc-André Lureau
Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root
On Fri, May 01, 2020 at 02:25:48PM -0400, Colin Walters wrote: > I'd like to make use of virtiofs as part of our tooling in > https://github.com/coreos/coreos-assembler > Most of the code runs as non-root today; qemu also runs as non-root. > We use 9p right now. > > virtiofsd's builtin sandboxing effectively assumes it runs as > root. > > First, change the code to use `clone()` and not `unshare()+fork()`. > > Next, automatically use `CLONE_NEWUSER` if we're running as non root. > > This is similar logic to that in https://github.com/containers/bubblewrap > (Which...BTW, it could make sense for virtiofs to depend on bubblewrap > and re-exec itself rather than re-implementing the containerization > itself) Great, thanks for posting this! The topic of how and when to do the sandboxing, as well as whether to require root, has come up several times. Please update the man page to explain the behavior that users should expect when running as non-root: 1. What happens to uid/gid of files from the perspective of a user outside the sandbox? 2. Are there any limitations (certain operations that don't work)? Thanks, Stefan > > Signed-off-by: Colin Walters > --- > tools/virtiofsd/passthrough_ll.c | 26 +- > 1 file changed, 21 insertions(+), 5 deletions(-) > > diff --git a/tools/virtiofsd/passthrough_ll.c > b/tools/virtiofsd/passthrough_ll.c > index 4c35c95b25..468617f6d6 100644 > --- a/tools/virtiofsd/passthrough_ll.c > +++ b/tools/virtiofsd/passthrough_ll.c > @@ -2530,6 +2530,21 @@ static void print_capabilities(void) > printf("}\n"); > } > > +/* Copied from bubblewrap */ > +static int > +raw_clone(unsigned long flags, void *child_stack) > +{ > +#if defined(__s390__) || defined(__CRIS__) > + /* > + * On s390 and cris the order of the first and second arguments > + * of the raw clone() system call is reversed. > + */ > +return (int) syscall(__NR_clone, child_stack, flags); > +#else > +return (int) syscall(__NR_clone, flags, child_stack); > +#endif > +} > + > /* > * Move to a new mount, net, and pid namespaces to isolate this process. > */ > @@ -2547,14 +2562,15 @@ static void setup_namespaces(struct lo_data *lo, > struct fuse_session *se) > * an empty network namespace to prevent TCP/IP and other network > * activity in case this process is compromised. > */ > -if (unshare(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET) != 0) { > -fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS): %m\n"); > -exit(1); > +int clone_flags = SIGCHLD | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET; > +/* If we're non root, we need a new user namespace */ > +if (getuid() != 0) { > +clone_flags |= CLONE_NEWUSER; > } > > -child = fork(); > +child = raw_clone(clone_flags, NULL); > if (child < 0) { > -fuse_log(FUSE_LOG_ERR, "fork() failed: %m\n"); > +fuse_log(FUSE_LOG_ERR, "clone() failed: %m\n"); > exit(1); > } > if (child > 0) { > -- > 2.24.1 > > signature.asc Description: PGP signature
Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root
On Fri, May 01, 2020 at 02:25:48PM -0400, Colin Walters wrote: > I'd like to make use of virtiofs as part of our tooling in > https://github.com/coreos/coreos-assembler > Most of the code runs as non-root today; qemu also runs as non-root. > We use 9p right now. > > virtiofsd's builtin sandboxing effectively assumes it runs as > root. > > First, change the code to use `clone()` and not `unshare()+fork()`. > > Next, automatically use `CLONE_NEWUSER` if we're running as non root. I'd suggest splitting these two, so that the re-factoring is separate from introducing new functionality. > > This is similar logic to that in https://github.com/containers/bubblewrap > (Which...BTW, it could make sense for virtiofs to depend on bubblewrap > and re-exec itself rather than re-implementing the containerization > itself) > > Signed-off-by: Colin Walters > --- > tools/virtiofsd/passthrough_ll.c | 26 +- > 1 file changed, 21 insertions(+), 5 deletions(-) > > diff --git a/tools/virtiofsd/passthrough_ll.c > b/tools/virtiofsd/passthrough_ll.c > index 4c35c95b25..468617f6d6 100644 > --- a/tools/virtiofsd/passthrough_ll.c > +++ b/tools/virtiofsd/passthrough_ll.c > @@ -2530,6 +2530,21 @@ static void print_capabilities(void) > printf("}\n"); > } > > +/* Copied from bubblewrap */ > +static int > +raw_clone(unsigned long flags, void *child_stack) > +{ > +#if defined(__s390__) || defined(__CRIS__) > + /* > + * On s390 and cris the order of the first and second arguments > + * of the raw clone() system call is reversed. > + */ > +return (int) syscall(__NR_clone, child_stack, flags); > +#else > +return (int) syscall(__NR_clone, flags, child_stack); > +#endif > +} What's the reason for using the raw syscall ? Was it just to avoid having to allocate a new stack space ? > + > /* > * Move to a new mount, net, and pid namespaces to isolate this process. > */ > @@ -2547,14 +2562,15 @@ static void setup_namespaces(struct lo_data *lo, > struct fuse_session *se) > * an empty network namespace to prevent TCP/IP and other network > * activity in case this process is compromised. > */ > -if (unshare(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET) != 0) { > -fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS): %m\n"); > -exit(1); > +int clone_flags = SIGCHLD | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET; > +/* If we're non root, we need a new user namespace */ > +if (getuid() != 0) { > +clone_flags |= CLONE_NEWUSER; > } IIUC, with CLONE_NEWUSER we need to set a UID/GID mapping, otherwise all file accesses will be squashed to the UID -1. Or was it intentional that you're only trying to provide read-only access to files that are world-accessible ? > -child = fork(); > +child = raw_clone(clone_flags, NULL); > if (child < 0) { > -fuse_log(FUSE_LOG_ERR, "fork() failed: %m\n"); > +fuse_log(FUSE_LOG_ERR, "clone() failed: %m\n"); > exit(1); > } > if (child > 0) { Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
[PATCH] virtiofsd: Use clone() and not unshare(), support non-root
I'd like to make use of virtiofs as part of our tooling in https://github.com/coreos/coreos-assembler Most of the code runs as non-root today; qemu also runs as non-root. We use 9p right now. virtiofsd's builtin sandboxing effectively assumes it runs as root. First, change the code to use `clone()` and not `unshare()+fork()`. Next, automatically use `CLONE_NEWUSER` if we're running as non root. This is similar logic to that in https://github.com/containers/bubblewrap (Which...BTW, it could make sense for virtiofs to depend on bubblewrap and re-exec itself rather than re-implementing the containerization itself) Signed-off-by: Colin Walters --- tools/virtiofsd/passthrough_ll.c | 26 +- 1 file changed, 21 insertions(+), 5 deletions(-) diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c index 4c35c95b25..468617f6d6 100644 --- a/tools/virtiofsd/passthrough_ll.c +++ b/tools/virtiofsd/passthrough_ll.c @@ -2530,6 +2530,21 @@ static void print_capabilities(void) printf("}\n"); } +/* Copied from bubblewrap */ +static int +raw_clone(unsigned long flags, void *child_stack) +{ +#if defined(__s390__) || defined(__CRIS__) + /* + * On s390 and cris the order of the first and second arguments + * of the raw clone() system call is reversed. + */ +return (int) syscall(__NR_clone, child_stack, flags); +#else +return (int) syscall(__NR_clone, flags, child_stack); +#endif +} + /* * Move to a new mount, net, and pid namespaces to isolate this process. */ @@ -2547,14 +2562,15 @@ static void setup_namespaces(struct lo_data *lo, struct fuse_session *se) * an empty network namespace to prevent TCP/IP and other network * activity in case this process is compromised. */ -if (unshare(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET) != 0) { -fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS): %m\n"); -exit(1); +int clone_flags = SIGCHLD | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET; +/* If we're non root, we need a new user namespace */ +if (getuid() != 0) { +clone_flags |= CLONE_NEWUSER; } -child = fork(); +child = raw_clone(clone_flags, NULL); if (child < 0) { -fuse_log(FUSE_LOG_ERR, "fork() failed: %m\n"); +fuse_log(FUSE_LOG_ERR, "clone() failed: %m\n"); exit(1); } if (child > 0) { -- 2.24.1