Re: [Virtio-fs] virtiofsd: Any reason why there's not an "openat2" sandbox mode?
On Thu, Sep 29, 2022, at 1:03 PM, Vivek Goyal wrote: > > So rust version of virtiofsd, already supports running unprivileged > (inside a user namespace). I know, but as I already said, the use case here is running inside an OpenShift unprivileged pod where *we are already in a container*. > host$ podman unshare -- virtiofsd --socket-path=/tmp/vfsd.sock > --shared-dir /mnt \ > --announce-submounts --sandbox chroot & Yes, but in current OCP 4.11 our seccomp policy denies CLONE_NEWUSER: ``` $ unshare -m unshare: unshare failed: Function not implemented ``` https://docs.openshift.com/container-platform/4.11/security/seccomp-profiles.html > I think only privileged operation it needs is assigning a range of > subuid/subgid to the uid you are using on host. We also turn on NO_NEW_PRIVILEGES by default in OCP pods. Now, I *could* in general get elevated permissions where I need to today. But it's also really important to me to have a long term goal of having operating system builds and tests work well as "just another workload" in our production container platform (now, one *does* want to bind in /dev/kvm, but that's generally safe, and even that strictly speaking is optional if one can stomach the ~10x perf hit). > Can you give rust virtiofsd (unprivileged) a try. I admit to not actually trying it in a pod, but I think we all agree it can't work, and the only thing that can today is openat2.
Re: [Virtio-fs] virtiofsd: Any reason why there's not an "openat2" sandbox mode?
On Thu, Sep 29, 2022, at 10:10 AM, Vivek Goyal wrote: > What's your use case. How do you plan to use virtiofs. At the current time, the Kubernetes that we run does not support user namespaces. We want to do the production builds of our operating system (Fedora CoreOS and RHEL CoreOS) today inside an *unprivileged* Kubernetes pod (actually in OpenShift using anyuid, i.e. random unprivileged uid too), just with /dev/kvm exposed from the host (which is safe). Operating system builds *and* tests in qemu are just another workload that can be shared with other tenants. qemu works fine in this model, as does 9p. It's just the virtiofs isolation requires privileges to be used today.
Re: [Virtio-fs] virtiofsd: Any reason why there's not an "openat2" sandbox mode?
On Wed, Sep 28, 2022, at 3:28 PM, Vivek Goyal wrote: > Sounds reasonable. In fact, we could probably do someting similar > for "landlock" as well. Thanks for the discussion all! Can someone (vaguely) commit to look into this in say the next few months? It's not *urgent*, we can live with the 9p flakes and problems short term, just trying to figure out if this needs to be on our medium-term radar or not. Thanks!
Re: virtiofsd: Any reason why there's not an "openat2" sandbox mode?
On Tue, Sep 27, 2022, at 1:27 PM, German Maglione wrote: > >> > Now all the development has moved to rust virtiofsd. Oh, awesome!! The code there looks great. > I could work on this for the next major version and see if anything breaks. > But I prefer to add this as a compilation feature, instead of a command line > option that we will then have to maintain for a while. Hmm, what would be the issue with having the code there by default? I think rather than any new command line option, we automatically use `openat2+RESOLVE_IN_ROOT` if the process is run as a nonzero uid. > Also, I don't see it as a sandbox feature, as Stefan mentioned, a compromised > process can call openat2() without RESOLVE_IN_ROOT. I'm a bit skeptical honestly about how secure the existing namespace code is against a compromised virtiofsd process. The primary worry is guest filesystem traversals, right? openat2+RESOLVE_IN_ROOT addresses that. Plus being in Rust makes this dramatically safer. > I did some test with > Landlock to lock virtiofsd inside the shared directory, but IIRC it requires a > kernel 5.13 But yes, landlock and other things make sense, I just don't see these things as strongly linked. IOW we shouldn't in my opinion block unprivileged virtiofsd on more sandboxing than openat2 already gives us.
virtiofsd: Any reason why there's not an "openat2" sandbox mode?
We previously had a chat here https://lore.kernel.org/all/348d4774-bd5f-4832-bd7e-a21491fda...@www.fastmail.com/T/ around virtiofsd and privileges and the case of trying to run virtiofsd inside an unprivileged (Kubernetes) container. Right now we're still using 9p, and it has bugs (basically it seems like the 9p inode flushing callback tries to allocate memory to send an RPC, and this causes OOM problems) https://github.com/coreos/coreos-assembler/issues/1812 Coming back to this...as of lately in Linux, there's support for strongly isolated filesystem access via openat2(): https://lwn.net/Articles/796868/ Is there any reason we couldn't do an -o sandbox=openat2 ? This operates without any privileges at all, and should be usable (and secure enough) in our use case. I may try a patch if this sounds OK...
Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root
On Wed, Jun 17, 2020, at 8:50 AM, Stefan Hajnoczi wrote: > Something along these lines should work. Hopefully seccomp can be > retained. It would also be necessary to check how not having the shared > directory as / in the mount namespace affects functionality. For one, > I'm pretty sure symlink escapes and similar path traversals outside the > shared directory will be possible since virtiofsd normally relies on / > as protection. Yes, though two points: - As I said, I don't care about that for my use case; the operating system we're testing is going to e.g. run on bare metal hosting workloads itself, so if it's malicious we have already lost (reliability against *accidental* damage is always nice though, like a stray rm -rf in some test script walking into the host) - Probably the best long term fix would be to use https://lwn.net/Articles/796868/ anyways
Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root
On Tue, Jun 2, 2020, at 5:55 AM, Stefan Hajnoczi wrote: > > Ping Colin. It would be great if you have time to share your thoughts on > this discussion and explain how you are using this patch. Yeah sorry about not replying in this thread earlier, this was just a quick Friday side project for me and the thread obviously exploded =) Thinking about this more, probably what would be good enough for now is an option to just disable internal containerization/sandboxing. In fact per the discussion our production pipeline runs inside OpenShift 4 and because Kubernetes doesn't support user namespaces yet it also doesn't support recursive containerization, so we need an option to turn off the internal containerization. Our use case is somewhat specialized - for what we're doing we generally trust the guest. We use VMs for operating system testing and development of content we trust, as opposed to e.g. something like kata. It's fine for us to run virtiofs as the same user/security context as qemu. So...something like this? (Only compile tested) diff --git a/tools/virtiofsd/fuse_i.h b/tools/virtiofsd/fuse_i.h index 1240828208..603773c505 100644 --- a/tools/virtiofsd/fuse_i.h +++ b/tools/virtiofsd/fuse_i.h @@ -51,6 +51,7 @@ struct fuse_session { int fd; int debug; int deny_others; +int no_namespaces; struct fuse_lowlevel_ops op; int got_init; struct cuse_data *cuse_data; diff --git a/tools/virtiofsd/fuse_lowlevel.c b/tools/virtiofsd/fuse_lowlevel.c index 2dd36ec03b..263134f792 100644 --- a/tools/virtiofsd/fuse_lowlevel.c +++ b/tools/virtiofsd/fuse_lowlevel.c @@ -2522,6 +2522,7 @@ static const struct fuse_opt fuse_ll_opts[] = { LL_OPTION("-d", debug, 1), LL_OPTION("--debug", debug, 1), LL_OPTION("allow_root", deny_others, 1), +LL_OPTION("--no-namespaces", no_namespaces, 1), LL_OPTION("--socket-path=%s", vu_socket_path, 0), LL_OPTION("--fd=%d", vu_listen_fd, 0), LL_OPTION("--thread-pool-size=%d", thread_pool_size, 0), @@ -2542,6 +2543,7 @@ void fuse_lowlevel_help(void) */ printf( "-o allow_root allow access by root\n" +"--no-namespacesDisable internal use of unshare()/clone(UNSHARE)\n" "--socket-path=PATH path for the vhost-user socket\n" "--fd=FDNUM fd number of vhost-user socket\n" "--thread-pool-size=NUM thread pool size limit (default %d)\n", diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c index 3ba1d90984..7c54a9cde3 100644 --- a/tools/virtiofsd/passthrough_ll.c +++ b/tools/virtiofsd/passthrough_ll.c @@ -2551,15 +2551,15 @@ static void setup_namespaces(struct lo_data *lo, struct fuse_session *se) char *tmpdir; /* - * Create a new pid namespace for *child* processes. We'll have to - * fork in order to enter the new pid namespace. A new mount namespace - * is also needed so that we can remount /proc for the new pid - * namespace. - * - * Our UNIX domain sockets have been created. Now we can move to - * an empty network namespace to prevent TCP/IP and other network - * activity in case this process is compromised. - */ +* Create a new pid namespace for *child* processes. We'll have to +* fork in order to enter the new pid namespace. A new mount namespace +* is also needed so that we can remount /proc for the new pid +* namespace. +* +* Our UNIX domain sockets have been created. Now we can move to +* an empty network namespace to prevent TCP/IP and other network +* activity in case this process is compromised. +*/ if (unshare(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET) != 0) { fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS): %m\n"); exit(1); @@ -2775,6 +2775,8 @@ static void setup_capabilities(void) static void setup_sandbox(struct lo_data *lo, struct fuse_session *se, bool enable_syslog) { +if (se->no_namespaces) +return; setup_namespaces(lo, se); setup_mounts(lo->source); setup_seccomp(enable_syslog);
Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root
On Mon, May 4, 2020, at 10:07 AM, Marc-André Lureau wrote: > Now that systemd-nspawn works without privileges, isn't that also a > solution? One that would fit both system and session level > permissions, and integration with other services? This is a complex topic and one I should probably write up in the bubblewrap README.md. Today for example for CoreOS, our build and CI processes run inside OpenShift (Kubernetes) - we aren't running systemd inside our containers. bubblewrap is a small self-contained C wrapper around the container system calls basically. In contrast, AFAICS right now, nspawn requires systemd - which won't work for our use case. Really the contention point here is systemd's dependency on cgroups for process tracking; in a "nested containerization" scenario you often just want the cgroups from the "outer" container to apply. But having nested mounts/pid namespaces are still very useful. (That said, cgroups v2 allows sane nesting, but we aren't there yet) Also related is https://github.com/kubernetes/enhancements/issues/127 - without that one requires privileged containers to do nesting. Now honestly, probably an even easier fix is `virtiofsd --disable-sandboxing` because we fully trust the code running in these VMs. Or to directly respond again to your proposal: systemd-nspawn as an option may work for some cases but won't for mine (I don't want virtiofsd/qemu instances to "escape" the build container or run separately).
[PATCH] virtiofsd: Use clone() and not unshare(), support non-root
I'd like to make use of virtiofs as part of our tooling in https://github.com/coreos/coreos-assembler Most of the code runs as non-root today; qemu also runs as non-root. We use 9p right now. virtiofsd's builtin sandboxing effectively assumes it runs as root. First, change the code to use `clone()` and not `unshare()+fork()`. Next, automatically use `CLONE_NEWUSER` if we're running as non root. This is similar logic to that in https://github.com/containers/bubblewrap (Which...BTW, it could make sense for virtiofs to depend on bubblewrap and re-exec itself rather than re-implementing the containerization itself) Signed-off-by: Colin Walters --- tools/virtiofsd/passthrough_ll.c | 26 +- 1 file changed, 21 insertions(+), 5 deletions(-) diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c index 4c35c95b25..468617f6d6 100644 --- a/tools/virtiofsd/passthrough_ll.c +++ b/tools/virtiofsd/passthrough_ll.c @@ -2530,6 +2530,21 @@ static void print_capabilities(void) printf("}\n"); } +/* Copied from bubblewrap */ +static int +raw_clone(unsigned long flags, void *child_stack) +{ +#if defined(__s390__) || defined(__CRIS__) + /* + * On s390 and cris the order of the first and second arguments + * of the raw clone() system call is reversed. + */ +return (int) syscall(__NR_clone, child_stack, flags); +#else +return (int) syscall(__NR_clone, flags, child_stack); +#endif +} + /* * Move to a new mount, net, and pid namespaces to isolate this process. */ @@ -2547,14 +2562,15 @@ static void setup_namespaces(struct lo_data *lo, struct fuse_session *se) * an empty network namespace to prevent TCP/IP and other network * activity in case this process is compromised. */ -if (unshare(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET) != 0) { -fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS): %m\n"); -exit(1); +int clone_flags = SIGCHLD | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET; +/* If we're non root, we need a new user namespace */ +if (getuid() != 0) { +clone_flags |= CLONE_NEWUSER; } -child = fork(); +child = raw_clone(clone_flags, NULL); if (child < 0) { -fuse_log(FUSE_LOG_ERR, "fork() failed: %m\n"); +fuse_log(FUSE_LOG_ERR, "clone() failed: %m\n"); exit(1); } if (child > 0) { -- 2.24.1