Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root

2020-06-23 Thread Stefan Hajnoczi
On Wed, Jun 17, 2020 at 08:55:36AM -0400, Colin Walters wrote:
> On Wed, Jun 17, 2020, at 8:50 AM, Stefan Hajnoczi wrote:
> 
> > Something along these lines should work. Hopefully seccomp can be
> > retained. It would also be necessary to check how not having the shared
> > directory as / in the mount namespace affects functionality. For one,
> > I'm pretty sure symlink escapes and similar path traversals outside the
> > shared directory will be possible since virtiofsd normally relies on /
> > as protection.
> 
> Yes, though two points:
> 
> - As I said, I don't care about that for my use case; the operating system 
> we're testing is going to e.g. run on bare metal hosting workloads itself, so 
> if it's malicious we have already lost (reliability against *accidental* 
> damage is always nice though, like a stray rm -rf in some test script walking 
> into the host)

It's not just for security, it's also for functional correctness.
Have you checked what happens when absolute symlinks are accessed?

If we're lucky well-behaved clients use FUSE_READLINK before accessing
path components. Then the symlink resolution happens in the FUSE client.
But if not, absolute symlinks will access the wrong files (oops!).

Accidental race conditions are possible too, especially when both guest
and host access the shared directory tree.

> - Probably the best long term fix would be to use 
> https://lwn.net/Articles/796868/ anyways

Yes, I think this is a requirement for achieving correctness without
pivot_root()/chroot() due to the issues mentioned above.


signature.asc
Description: PGP signature


Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root

2020-06-17 Thread Colin Walters



On Wed, Jun 17, 2020, at 8:50 AM, Stefan Hajnoczi wrote:

> Something along these lines should work. Hopefully seccomp can be
> retained. It would also be necessary to check how not having the shared
> directory as / in the mount namespace affects functionality. For one,
> I'm pretty sure symlink escapes and similar path traversals outside the
> shared directory will be possible since virtiofsd normally relies on /
> as protection.

Yes, though two points:

- As I said, I don't care about that for my use case; the operating system 
we're testing is going to e.g. run on bare metal hosting workloads itself, so 
if it's malicious we have already lost (reliability against *accidental* damage 
is always nice though, like a stray rm -rf in some test script walking into the 
host)
- Probably the best long term fix would be to use 
https://lwn.net/Articles/796868/ anyways



Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root

2020-06-17 Thread Stefan Hajnoczi
On Tue, Jun 02, 2020 at 09:53:18PM -0400, Colin Walters wrote:
> On Tue, Jun 2, 2020, at 5:55 AM, Stefan Hajnoczi wrote:
> > Ping Colin. It would be great if you have time to share your thoughts on
> > this discussion and explain how you are using this patch.
> 
> Yeah sorry about not replying in this thread earlier, this was just a quick 
> Friday side project for me and the thread obviously exploded =)
> 
> Thinking about this more, probably what would be good enough for now is an 
> option to just disable internal containerization/sandboxing.  In fact per the 
> discussion our production pipeline runs inside OpenShift 4 and because 
> Kubernetes doesn't support user namespaces yet it also doesn't support 
> recursive containerization, so we need an option to turn off the internal 
> containerization.
> 
> Our use case is somewhat specialized - for what we're doing we generally 
> trust the guest.  We use VMs for operating system testing and development of 
> content we trust, as opposed to e.g. something like kata.
> 
> It's fine for us to run virtiofs as the same user/security context as qemu.
> 
> So...something like this?  (Only compile tested)
...
> @@ -2775,6 +2775,8 @@ static void setup_capabilities(void)
>  static void setup_sandbox(struct lo_data *lo, struct fuse_session *se,
>bool enable_syslog)
>  {
> +if (se->no_namespaces)
> +return;
>  setup_namespaces(lo, se);
>  setup_mounts(lo->source);
>  setup_seccomp(enable_syslog);

Something along these lines should work. Hopefully seccomp can be
retained. It would also be necessary to check how not having the shared
directory as / in the mount namespace affects functionality. For one,
I'm pretty sure symlink escapes and similar path traversals outside the
shared directory will be possible since virtiofsd normally relies on /
as protection.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root

2020-06-02 Thread Colin Walters



On Tue, Jun 2, 2020, at 5:55 AM, Stefan Hajnoczi wrote:
> 
> Ping Colin. It would be great if you have time to share your thoughts on
> this discussion and explain how you are using this patch.

Yeah sorry about not replying in this thread earlier, this was just a quick 
Friday side project for me and the thread obviously exploded =)

Thinking about this more, probably what would be good enough for now is an 
option to just disable internal containerization/sandboxing.  In fact per the 
discussion our production pipeline runs inside OpenShift 4 and because 
Kubernetes doesn't support user namespaces yet it also doesn't support 
recursive containerization, so we need an option to turn off the internal 
containerization.

Our use case is somewhat specialized - for what we're doing we generally trust 
the guest.  We use VMs for operating system testing and development of content 
we trust, as opposed to e.g. something like kata.

It's fine for us to run virtiofs as the same user/security context as qemu.

So...something like this?  (Only compile tested)

diff --git a/tools/virtiofsd/fuse_i.h b/tools/virtiofsd/fuse_i.h
index 1240828208..603773c505 100644
--- a/tools/virtiofsd/fuse_i.h
+++ b/tools/virtiofsd/fuse_i.h
@@ -51,6 +51,7 @@ struct fuse_session {
 int fd;
 int debug;
 int deny_others;
+int no_namespaces;
 struct fuse_lowlevel_ops op;
 int got_init;
 struct cuse_data *cuse_data;
diff --git a/tools/virtiofsd/fuse_lowlevel.c b/tools/virtiofsd/fuse_lowlevel.c
index 2dd36ec03b..263134f792 100644
--- a/tools/virtiofsd/fuse_lowlevel.c
+++ b/tools/virtiofsd/fuse_lowlevel.c
@@ -2522,6 +2522,7 @@ static const struct fuse_opt fuse_ll_opts[] = {
 LL_OPTION("-d", debug, 1),
 LL_OPTION("--debug", debug, 1),
 LL_OPTION("allow_root", deny_others, 1),
+LL_OPTION("--no-namespaces", no_namespaces, 1),
 LL_OPTION("--socket-path=%s", vu_socket_path, 0),
 LL_OPTION("--fd=%d", vu_listen_fd, 0),
 LL_OPTION("--thread-pool-size=%d", thread_pool_size, 0),
@@ -2542,6 +2543,7 @@ void fuse_lowlevel_help(void)
  */
 printf(
 "-o allow_root  allow access by root\n"
+"--no-namespacesDisable internal use of 
unshare()/clone(UNSHARE)\n"
 "--socket-path=PATH path for the vhost-user socket\n"
 "--fd=FDNUM fd number of vhost-user socket\n"
 "--thread-pool-size=NUM thread pool size limit (default %d)\n",
diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 3ba1d90984..7c54a9cde3 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -2551,15 +2551,15 @@ static void setup_namespaces(struct lo_data *lo, struct 
fuse_session *se)
 char *tmpdir;
 
 /*
- * Create a new pid namespace for *child* processes.  We'll have to
- * fork in order to enter the new pid namespace.  A new mount namespace
- * is also needed so that we can remount /proc for the new pid
- * namespace.
- *
- * Our UNIX domain sockets have been created.  Now we can move to
- * an empty network namespace to prevent TCP/IP and other network
- * activity in case this process is compromised.
- */
+* Create a new pid namespace for *child* processes.  We'll have to
+* fork in order to enter the new pid namespace.  A new mount namespace
+* is also needed so that we can remount /proc for the new pid
+* namespace.
+*
+* Our UNIX domain sockets have been created.  Now we can move to
+* an empty network namespace to prevent TCP/IP and other network
+* activity in case this process is compromised.
+*/
 if (unshare(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET) != 0) {
 fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS): %m\n");
 exit(1);
@@ -2775,6 +2775,8 @@ static void setup_capabilities(void)
 static void setup_sandbox(struct lo_data *lo, struct fuse_session *se,
   bool enable_syslog)
 {
+if (se->no_namespaces)
+return;
 setup_namespaces(lo, se);
 setup_mounts(lo->source);
 setup_seccomp(enable_syslog);



Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root

2020-06-02 Thread Stefan Hajnoczi
On Fri, May 01, 2020 at 02:25:48PM -0400, Colin Walters wrote:
> I'd like to make use of virtiofs as part of our tooling in
> https://github.com/coreos/coreos-assembler
> Most of the code runs as non-root today; qemu also runs as non-root.
> We use 9p right now.
> 
> virtiofsd's builtin sandboxing effectively assumes it runs as
> root.
> 
> First, change the code to use `clone()` and not `unshare()+fork()`.
> 
> Next, automatically use `CLONE_NEWUSER` if we're running as non root.
> 
> This is similar logic to that in https://github.com/containers/bubblewrap
> (Which...BTW, it could make sense for virtiofs to depend on bubblewrap
>  and re-exec itself rather than re-implementing the containerization
>  itself)
> 
> Signed-off-by: Colin Walters 
> ---
>  tools/virtiofsd/passthrough_ll.c | 26 +-
>  1 file changed, 21 insertions(+), 5 deletions(-)

Ping Colin. It would be great if you have time to share your thoughts on
this discussion and explain how you are using this patch.

To summarize: I'm unclear what behavior a user can expect since I'm not
aware of anything that applies /etc/subuid for the user namespace. Does
this mean the expected behavior is that virtiofsd will map all uids/gids
to -1 when invoked non-root?

Could you document the behavior and consider supporting both -1 and
/etc/subuid operation? Both seem like useful behaviors for different use
cases.

Thanks,
Stefan


signature.asc
Description: PGP signature


Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root

2020-05-27 Thread Stefan Hajnoczi
On Thu, May 21, 2020 at 11:43:44AM +0100, Daniel P. Berrangé wrote:
> On Thu, May 21, 2020 at 11:19:23AM +0100, Stefan Hajnoczi wrote:
> > On Thu, May 07, 2020 at 10:28:32AM +0100, Daniel P. Berrangé wrote:
> > > If the person in the host launching virtiofsd is non-root, then
> > > user namespaces mean they can offer the guest the full range of
> > > POSIX APIs wrt access control & file ownership, since they're
> > > no longer restricted to their single host UID when inside the
> > > container.
> > 
> > What installs the uid_map/gid_map for virtiofsd?
> > 
> > My machine has /etc/subuid and /etc/subgid, but how would this come into
> > play with these patches applied?
> 
> AFAICT, nothing is setting up the mapping, hence my question in the first
> review of this patch, that this patch seems incomplete.
> 
> The subuid/subgid files are essentially just reserving a range of IDs
> for each user. Something then needs to read & honour those IDs.
> 
> The rules on setting up a mapping are complex though, to avoid a user
> from creating a new user namespace, and defining a mapping that lets
> them elevate privileges to read files in the host they wouldn't otherwise
> be permitted to access.
> 
> IIUC, applying the range of IDs from subuid/subgid will require the
> process defining the mapping to have CAP_SETUID *outside* the container.
> As an unprivileged host user, you won't have that, so I think the only
> thing you can do is setup a mapping for your own UID/GID, which would
> allow the container to read/write files owned by the host user that
> started it. That should be ok for virtiofsd's needs at least for simple
> file sharing. If virtiofsd needs to expose its full range of features
> though, something privileged in the host needs to setup a full mapping
> based on subuid/subgid
> 
> BTW, "man user_namespaces" gives many more details on UID mapping
> rules.
> 
> > What happens when an unprivileged user who is not listed in /etc/subuid
> > runs virtiofsd?
> 
> The UID map inside the container will be empty, which should result in
> all files being remapped to (uid_t)-1 or whatever is shown in the
> /proc/sys/kernel/overflow{u,g}id files.
> 
> So virtiofsd would not have any write access, and only have read access
> to files which have world-read bit set.  

Okay. Enabling user namespaces for virtiofsd is valuable. I think the
behavior should be documented though so users know what needs to be
configured. That is missing from this patch series.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root

2020-05-21 Thread Daniel P . Berrangé
On Thu, May 21, 2020 at 11:19:23AM +0100, Stefan Hajnoczi wrote:
> On Thu, May 07, 2020 at 10:28:32AM +0100, Daniel P. Berrangé wrote:
> > If the person in the host launching virtiofsd is non-root, then
> > user namespaces mean they can offer the guest the full range of
> > POSIX APIs wrt access control & file ownership, since they're
> > no longer restricted to their single host UID when inside the
> > container.
> 
> What installs the uid_map/gid_map for virtiofsd?
> 
> My machine has /etc/subuid and /etc/subgid, but how would this come into
> play with these patches applied?

AFAICT, nothing is setting up the mapping, hence my question in the first
review of this patch, that this patch seems incomplete.

The subuid/subgid files are essentially just reserving a range of IDs
for each user. Something then needs to read & honour those IDs.

The rules on setting up a mapping are complex though, to avoid a user
from creating a new user namespace, and defining a mapping that lets
them elevate privileges to read files in the host they wouldn't otherwise
be permitted to access.

IIUC, applying the range of IDs from subuid/subgid will require the
process defining the mapping to have CAP_SETUID *outside* the container.
As an unprivileged host user, you won't have that, so I think the only
thing you can do is setup a mapping for your own UID/GID, which would
allow the container to read/write files owned by the host user that
started it. That should be ok for virtiofsd's needs at least for simple
file sharing. If virtiofsd needs to expose its full range of features
though, something privileged in the host needs to setup a full mapping
based on subuid/subgid

BTW, "man user_namespaces" gives many more details on UID mapping
rules.

> What happens when an unprivileged user who is not listed in /etc/subuid
> runs virtiofsd?

The UID map inside the container will be empty, which should result in
all files being remapped to (uid_t)-1 or whatever is shown in the
/proc/sys/kernel/overflow{u,g}id files.

So virtiofsd would not have any write access, and only have read access
to files which have world-read bit set.  


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root

2020-05-21 Thread Stefan Hajnoczi
On Thu, May 07, 2020 at 10:28:32AM +0100, Daniel P. Berrangé wrote:
> If the person in the host launching virtiofsd is non-root, then
> user namespaces mean they can offer the guest the full range of
> POSIX APIs wrt access control & file ownership, since they're
> no longer restricted to their single host UID when inside the
> container.

What installs the uid_map/gid_map for virtiofsd?

My machine has /etc/subuid and /etc/subgid, but how would this come into
play with these patches applied?

What happens when an unprivileged user who is not listed in /etc/subuid
runs virtiofsd?

Stefan


signature.asc
Description: PGP signature


Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root

2020-05-07 Thread Daniel P . Berrangé
On Wed, May 06, 2020 at 08:16:14PM +0100, Dr. David Alan Gilbert wrote:
> * Colin Walters (walt...@verbum.org) wrote:
> > I'd like to make use of virtiofs as part of our tooling in
> > https://github.com/coreos/coreos-assembler
> > Most of the code runs as non-root today; qemu also runs as non-root.
> > We use 9p right now.
> > 
> > virtiofsd's builtin sandboxing effectively assumes it runs as
> > root.
> > 
> > First, change the code to use `clone()` and not `unshare()+fork()`.
> > 
> > Next, automatically use `CLONE_NEWUSER` if we're running as non root.
> 
> Is it ever useful for root to run the code in a new user namespace?

Yes, user namespace is useful to both root and non-root alike. Roughly
speaking, for root, it offers security benefits, for non-root it offers
functionality benefits.

The longer answer...

With a new user namespaces, users inside the container get remapped
to different set of users outside the host, through defined UID & GID
mappings.  For any UID/GID which doesn't have a mapping, access will
get performed as (uid_t)-1 / (gid_t)-1.

eg consider you have a range of host IDs 100,000->165,536 available.
With user namespaces, you can now ssetuop a mapping of container
IDs 0 -> 65536.

Thus any time  UID 0 inside the container does something, from the
host POV they are acting as UID 100,000.  If UID 30,000 inside the
container does something, this is UID 130,000 in the host POV. If
UID 80,000 in the container does something, this is uid -1 from
the host POV.

If the person in the host launching virtiofsd is non-root, then
user namespaces mean they can offer the guest the full range of
POSIX APIs wrt access control & file ownership, since they're
no longer restricted to their single host UID when inside the
container.  They also get important things like CAP_DAC_OVERRIDE.
IOW, for non-root, user namespaces unlock the full functionality
of virtiofsd. Without it, we're limited to read-only access to
files not owned by the current non-root user.

If the person in the host launching virtiofsd is root, then user
namespaces mean we can reduce the effective privileges of virtiofsd.
Currently when inside the container, uid==0 is still the same as
uid==0 outside. So if there are any resources visible inside the
container (either accidentally or intentionally), then virtiofsd
shouldn't have write access to, we're lacking protection. By
adding usernamespace + a mapping, we strictly isolate virtiofsd
from any host resources.

The main pain point with user namespaces is that all the files
in the directory you are exporting need to be shifted to match
the UID/GID mapping user for the user namespaces. Traditionally
this has needed a recursive chown of the tree to remap the file
ownership. There has been talk of a filesystem overlay todo the
remapping transparently, but I've lost track of whether that's
a thing yet.


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root

2020-05-06 Thread Dr. David Alan Gilbert
* Colin Walters (walt...@verbum.org) wrote:
> I'd like to make use of virtiofs as part of our tooling in
> https://github.com/coreos/coreos-assembler
> Most of the code runs as non-root today; qemu also runs as non-root.
> We use 9p right now.
> 
> virtiofsd's builtin sandboxing effectively assumes it runs as
> root.
> 
> First, change the code to use `clone()` and not `unshare()+fork()`.
> 
> Next, automatically use `CLONE_NEWUSER` if we're running as non root.

Is it ever useful for root to run the code in a new user namespace?

Dave

> This is similar logic to that in https://github.com/containers/bubblewrap
> (Which...BTW, it could make sense for virtiofs to depend on bubblewrap
>  and re-exec itself rather than re-implementing the containerization
>  itself)
> 
> Signed-off-by: Colin Walters 
> ---
>  tools/virtiofsd/passthrough_ll.c | 26 +-
>  1 file changed, 21 insertions(+), 5 deletions(-)
> 
> diff --git a/tools/virtiofsd/passthrough_ll.c 
> b/tools/virtiofsd/passthrough_ll.c
> index 4c35c95b25..468617f6d6 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -2530,6 +2530,21 @@ static void print_capabilities(void)
>  printf("}\n");
>  }
>  
> +/* Copied from bubblewrap */
> +static int
> +raw_clone(unsigned long flags, void *child_stack)
> +{
> +#if defined(__s390__) || defined(__CRIS__)
> +  /*
> +   * On s390 and cris the order of the first and second arguments
> +   * of the raw clone() system call is reversed.
> +   */
> +return (int) syscall(__NR_clone, child_stack, flags);
> +#else
> +return (int) syscall(__NR_clone, flags, child_stack);
> +#endif
> +}
> +
>  /*
>   * Move to a new mount, net, and pid namespaces to isolate this process.
>   */
> @@ -2547,14 +2562,15 @@ static void setup_namespaces(struct lo_data *lo, 
> struct fuse_session *se)
>   * an empty network namespace to prevent TCP/IP and other network
>   * activity in case this process is compromised.
>   */
> -if (unshare(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET) != 0) {
> -fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS): %m\n");
> -exit(1);
> +int clone_flags = SIGCHLD | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET;
> +/* If we're non root, we need a new user namespace */
> +if (getuid() != 0) {
> +clone_flags |= CLONE_NEWUSER;
>  }
>  
> -child = fork();
> +child = raw_clone(clone_flags, NULL);
>  if (child < 0) {
> -fuse_log(FUSE_LOG_ERR, "fork() failed: %m\n");
> +fuse_log(FUSE_LOG_ERR, "clone() failed: %m\n");
>  exit(1);
>  }
>  if (child > 0) {
> -- 
> 2.24.1
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK




Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root

2020-05-05 Thread Daniel P . Berrangé
On Tue, May 05, 2020 at 04:23:59PM +0100, Stefan Hajnoczi wrote:
> On Mon, May 04, 2020 at 04:07:22PM +0200, Marc-André Lureau wrote:
> > Hi
> > 
> > On Fri, May 1, 2020 at 8:29 PM Colin Walters  wrote:
> > >
> > > I'd like to make use of virtiofs as part of our tooling in
> > > https://github.com/coreos/coreos-assembler
> > > Most of the code runs as non-root today; qemu also runs as non-root.
> > > We use 9p right now.
> > >
> > > virtiofsd's builtin sandboxing effectively assumes it runs as
> > > root.
> > >
> > > First, change the code to use `clone()` and not `unshare()+fork()`.
> > >
> > > Next, automatically use `CLONE_NEWUSER` if we're running as non root.
> > >
> > > This is similar logic to that in https://github.com/containers/bubblewrap
> > > (Which...BTW, it could make sense for virtiofs to depend on bubblewrap
> > >  and re-exec itself rather than re-implementing the containerization
> > >  itself)
> > >
> > 
> > Now that systemd-nspawn works without privileges, isn't that also a
> > solution? One that would fit both system and session level
> > permissions, and integration with other services?
> 
> Does systemd-nspawn work inside containers?
> 
> I think virtiofsd will need to run inside containers in the future and
> remember systemd being difficult to use in containers.

It can be made to work, but my gut tells me people won't be happy if
system were a mandatory requirement for virtiofsd usage. Also there
are current Linux distros which don't even use systemd.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root

2020-05-05 Thread Stefan Hajnoczi
On Mon, May 04, 2020 at 04:07:22PM +0200, Marc-André Lureau wrote:
> Hi
> 
> On Fri, May 1, 2020 at 8:29 PM Colin Walters  wrote:
> >
> > I'd like to make use of virtiofs as part of our tooling in
> > https://github.com/coreos/coreos-assembler
> > Most of the code runs as non-root today; qemu also runs as non-root.
> > We use 9p right now.
> >
> > virtiofsd's builtin sandboxing effectively assumes it runs as
> > root.
> >
> > First, change the code to use `clone()` and not `unshare()+fork()`.
> >
> > Next, automatically use `CLONE_NEWUSER` if we're running as non root.
> >
> > This is similar logic to that in https://github.com/containers/bubblewrap
> > (Which...BTW, it could make sense for virtiofs to depend on bubblewrap
> >  and re-exec itself rather than re-implementing the containerization
> >  itself)
> >
> 
> Now that systemd-nspawn works without privileges, isn't that also a
> solution? One that would fit both system and session level
> permissions, and integration with other services?

Does systemd-nspawn work inside containers?

I think virtiofsd will need to run inside containers in the future and
remember systemd being difficult to use in containers.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root

2020-05-04 Thread Marc-André Lureau
Hi

On Mon, May 4, 2020 at 4:27 PM Colin Walters  wrote:
>
>
>
> On Mon, May 4, 2020, at 10:07 AM, Marc-André Lureau wrote:
>
> > Now that systemd-nspawn works without privileges, isn't that also a
> > solution? One that would fit both system and session level
> > permissions, and integration with other services?
>
> This is a complex topic and one I should probably write up in the bubblewrap 
> README.md.  Today for example for CoreOS, our build and CI processes run 
> inside OpenShift (Kubernetes) - we aren't running systemd inside our 
> containers.

Actually, I mean systemd-run (oops!)

>
> bubblewrap is a small self-contained C wrapper around the container system 
> calls basically.  In contrast, AFAICS right now, nspawn requires systemd - 
> which won't work for our use case.
>
> Really the contention point here is systemd's dependency on cgroups for 
> process tracking; in a "nested containerization" scenario you often just want 
> the cgroups from the "outer" container to apply.  But having nested 
> mounts/pid namespaces are still very useful.  (That said, cgroups v2 allows 
> sane nesting, but we aren't there yet)
>
> Also related is https://github.com/kubernetes/enhancements/issues/127 - 
> without that one requires privileged containers to do nesting.
>
> Now honestly, probably an even easier fix is `virtiofsd --disable-sandboxing` 
> because we fully trust the code running in these VMs.
>
> Or to directly respond again to your proposal: systemd-nspawn as an option 
> may work for some cases but won't for mine (I don't want virtiofsd/qemu 
> instances to "escape" the build container or run separately).
>

You can run within your parent slice, and even more conveniently with:
https://github.com/systemd/systemd/pull/15362


-- 
Marc-André Lureau



Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root

2020-05-04 Thread Colin Walters



On Mon, May 4, 2020, at 10:07 AM, Marc-André Lureau wrote:

> Now that systemd-nspawn works without privileges, isn't that also a
> solution? One that would fit both system and session level
> permissions, and integration with other services?

This is a complex topic and one I should probably write up in the bubblewrap 
README.md.  Today for example for CoreOS, our build and CI processes run inside 
OpenShift (Kubernetes) - we aren't running systemd inside our containers.

bubblewrap is a small self-contained C wrapper around the container system 
calls basically.  In contrast, AFAICS right now, nspawn requires systemd - 
which won't work for our use case.

Really the contention point here is systemd's dependency on cgroups for process 
tracking; in a "nested containerization" scenario you often just want the 
cgroups from the "outer" container to apply.  But having nested mounts/pid 
namespaces are still very useful.  (That said, cgroups v2 allows sane nesting, 
but we aren't there yet)

Also related is https://github.com/kubernetes/enhancements/issues/127 - without 
that one requires privileged containers to do nesting.

Now honestly, probably an even easier fix is `virtiofsd --disable-sandboxing` 
because we fully trust the code running in these VMs.

Or to directly respond again to your proposal: systemd-nspawn as an option may 
work for some cases but won't for mine (I don't want virtiofsd/qemu instances 
to "escape" the build container or run separately).




Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root

2020-05-04 Thread Marc-André Lureau
Hi

On Fri, May 1, 2020 at 8:29 PM Colin Walters  wrote:
>
> I'd like to make use of virtiofs as part of our tooling in
> https://github.com/coreos/coreos-assembler
> Most of the code runs as non-root today; qemu also runs as non-root.
> We use 9p right now.
>
> virtiofsd's builtin sandboxing effectively assumes it runs as
> root.
>
> First, change the code to use `clone()` and not `unshare()+fork()`.
>
> Next, automatically use `CLONE_NEWUSER` if we're running as non root.
>
> This is similar logic to that in https://github.com/containers/bubblewrap
> (Which...BTW, it could make sense for virtiofs to depend on bubblewrap
>  and re-exec itself rather than re-implementing the containerization
>  itself)
>

Now that systemd-nspawn works without privileges, isn't that also a
solution? One that would fit both system and session level
permissions, and integration with other services?

> Signed-off-by: Colin Walters 
> ---
>  tools/virtiofsd/passthrough_ll.c | 26 +-
>  1 file changed, 21 insertions(+), 5 deletions(-)
>
> diff --git a/tools/virtiofsd/passthrough_ll.c 
> b/tools/virtiofsd/passthrough_ll.c
> index 4c35c95b25..468617f6d6 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -2530,6 +2530,21 @@ static void print_capabilities(void)
>  printf("}\n");
>  }
>
> +/* Copied from bubblewrap */
> +static int
> +raw_clone(unsigned long flags, void *child_stack)
> +{
> +#if defined(__s390__) || defined(__CRIS__)
> +  /*
> +   * On s390 and cris the order of the first and second arguments
> +   * of the raw clone() system call is reversed.
> +   */
> +return (int) syscall(__NR_clone, child_stack, flags);
> +#else
> +return (int) syscall(__NR_clone, flags, child_stack);
> +#endif
> +}
> +
>  /*
>   * Move to a new mount, net, and pid namespaces to isolate this process.
>   */
> @@ -2547,14 +2562,15 @@ static void setup_namespaces(struct lo_data *lo, 
> struct fuse_session *se)
>   * an empty network namespace to prevent TCP/IP and other network
>   * activity in case this process is compromised.
>   */
> -if (unshare(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET) != 0) {
> -fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS): %m\n");
> -exit(1);
> +int clone_flags = SIGCHLD | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET;
> +/* If we're non root, we need a new user namespace */
> +if (getuid() != 0) {
> +clone_flags |= CLONE_NEWUSER;
>  }
>
> -child = fork();
> +child = raw_clone(clone_flags, NULL);
>  if (child < 0) {
> -fuse_log(FUSE_LOG_ERR, "fork() failed: %m\n");
> +fuse_log(FUSE_LOG_ERR, "clone() failed: %m\n");
>  exit(1);
>  }
>  if (child > 0) {
> --
> 2.24.1
>
>


-- 
Marc-André Lureau



Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root

2020-05-04 Thread Stefan Hajnoczi
On Fri, May 01, 2020 at 02:25:48PM -0400, Colin Walters wrote:
> I'd like to make use of virtiofs as part of our tooling in
> https://github.com/coreos/coreos-assembler
> Most of the code runs as non-root today; qemu also runs as non-root.
> We use 9p right now.
> 
> virtiofsd's builtin sandboxing effectively assumes it runs as
> root.
> 
> First, change the code to use `clone()` and not `unshare()+fork()`.
> 
> Next, automatically use `CLONE_NEWUSER` if we're running as non root.
> 
> This is similar logic to that in https://github.com/containers/bubblewrap
> (Which...BTW, it could make sense for virtiofs to depend on bubblewrap
>  and re-exec itself rather than re-implementing the containerization
>  itself)

Great, thanks for posting this! The topic of how and when to do the
sandboxing, as well as whether to require root, has come up several
times.

Please update the man page to explain the behavior that users should
expect when running as non-root:
1. What happens to uid/gid of files from the perspective of a user
   outside the sandbox?
2. Are there any limitations (certain operations that don't work)?

Thanks,
Stefan

> 
> Signed-off-by: Colin Walters 
> ---
>  tools/virtiofsd/passthrough_ll.c | 26 +-
>  1 file changed, 21 insertions(+), 5 deletions(-)
> 
> diff --git a/tools/virtiofsd/passthrough_ll.c 
> b/tools/virtiofsd/passthrough_ll.c
> index 4c35c95b25..468617f6d6 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -2530,6 +2530,21 @@ static void print_capabilities(void)
>  printf("}\n");
>  }
>  
> +/* Copied from bubblewrap */
> +static int
> +raw_clone(unsigned long flags, void *child_stack)
> +{
> +#if defined(__s390__) || defined(__CRIS__)
> +  /*
> +   * On s390 and cris the order of the first and second arguments
> +   * of the raw clone() system call is reversed.
> +   */
> +return (int) syscall(__NR_clone, child_stack, flags);
> +#else
> +return (int) syscall(__NR_clone, flags, child_stack);
> +#endif
> +}
> +
>  /*
>   * Move to a new mount, net, and pid namespaces to isolate this process.
>   */
> @@ -2547,14 +2562,15 @@ static void setup_namespaces(struct lo_data *lo, 
> struct fuse_session *se)
>   * an empty network namespace to prevent TCP/IP and other network
>   * activity in case this process is compromised.
>   */
> -if (unshare(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET) != 0) {
> -fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS): %m\n");
> -exit(1);
> +int clone_flags = SIGCHLD | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET;
> +/* If we're non root, we need a new user namespace */
> +if (getuid() != 0) {
> +clone_flags |= CLONE_NEWUSER;
>  }
>  
> -child = fork();
> +child = raw_clone(clone_flags, NULL);
>  if (child < 0) {
> -fuse_log(FUSE_LOG_ERR, "fork() failed: %m\n");
> +fuse_log(FUSE_LOG_ERR, "clone() failed: %m\n");
>  exit(1);
>  }
>  if (child > 0) {
> -- 
> 2.24.1
> 
> 


signature.asc
Description: PGP signature


Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root

2020-05-04 Thread Daniel P . Berrangé
On Fri, May 01, 2020 at 02:25:48PM -0400, Colin Walters wrote:
> I'd like to make use of virtiofs as part of our tooling in
> https://github.com/coreos/coreos-assembler
> Most of the code runs as non-root today; qemu also runs as non-root.
> We use 9p right now.
> 
> virtiofsd's builtin sandboxing effectively assumes it runs as
> root.
> 
> First, change the code to use `clone()` and not `unshare()+fork()`.
> 
> Next, automatically use `CLONE_NEWUSER` if we're running as non root.

I'd suggest splitting these two, so that the re-factoring is separate
from introducing new functionality.

> 
> This is similar logic to that in https://github.com/containers/bubblewrap
> (Which...BTW, it could make sense for virtiofs to depend on bubblewrap
>  and re-exec itself rather than re-implementing the containerization
>  itself)
> 
> Signed-off-by: Colin Walters 
> ---
>  tools/virtiofsd/passthrough_ll.c | 26 +-
>  1 file changed, 21 insertions(+), 5 deletions(-)
> 
> diff --git a/tools/virtiofsd/passthrough_ll.c 
> b/tools/virtiofsd/passthrough_ll.c
> index 4c35c95b25..468617f6d6 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -2530,6 +2530,21 @@ static void print_capabilities(void)
>  printf("}\n");
>  }
>  
> +/* Copied from bubblewrap */
> +static int
> +raw_clone(unsigned long flags, void *child_stack)
> +{
> +#if defined(__s390__) || defined(__CRIS__)
> +  /*
> +   * On s390 and cris the order of the first and second arguments
> +   * of the raw clone() system call is reversed.
> +   */
> +return (int) syscall(__NR_clone, child_stack, flags);
> +#else
> +return (int) syscall(__NR_clone, flags, child_stack);
> +#endif
> +}

What's the reason for using the raw syscall ?  Was it just to avoid having
to allocate a new stack space ?


> +
>  /*
>   * Move to a new mount, net, and pid namespaces to isolate this process.
>   */
> @@ -2547,14 +2562,15 @@ static void setup_namespaces(struct lo_data *lo, 
> struct fuse_session *se)
>   * an empty network namespace to prevent TCP/IP and other network
>   * activity in case this process is compromised.
>   */
> -if (unshare(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET) != 0) {
> -fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS): %m\n");
> -exit(1);
> +int clone_flags = SIGCHLD | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET;
> +/* If we're non root, we need a new user namespace */
> +if (getuid() != 0) {
> +clone_flags |= CLONE_NEWUSER;
>  }

IIUC, with CLONE_NEWUSER we need to set a UID/GID mapping, otherwise all
file accesses will be squashed to the UID -1.  Or was it intentional
that you're only trying to provide read-only access to files that are
world-accessible ?



> -child = fork();
> +child = raw_clone(clone_flags, NULL);
>  if (child < 0) {
> -fuse_log(FUSE_LOG_ERR, "fork() failed: %m\n");
> +fuse_log(FUSE_LOG_ERR, "clone() failed: %m\n");
>  exit(1);
>  }
>  if (child > 0) {

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




[PATCH] virtiofsd: Use clone() and not unshare(), support non-root

2020-05-01 Thread Colin Walters
I'd like to make use of virtiofs as part of our tooling in
https://github.com/coreos/coreos-assembler
Most of the code runs as non-root today; qemu also runs as non-root.
We use 9p right now.

virtiofsd's builtin sandboxing effectively assumes it runs as
root.

First, change the code to use `clone()` and not `unshare()+fork()`.

Next, automatically use `CLONE_NEWUSER` if we're running as non root.

This is similar logic to that in https://github.com/containers/bubblewrap
(Which...BTW, it could make sense for virtiofs to depend on bubblewrap
 and re-exec itself rather than re-implementing the containerization
 itself)

Signed-off-by: Colin Walters 
---
 tools/virtiofsd/passthrough_ll.c | 26 +-
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 4c35c95b25..468617f6d6 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -2530,6 +2530,21 @@ static void print_capabilities(void)
 printf("}\n");
 }
 
+/* Copied from bubblewrap */
+static int
+raw_clone(unsigned long flags, void *child_stack)
+{
+#if defined(__s390__) || defined(__CRIS__)
+  /*
+   * On s390 and cris the order of the first and second arguments
+   * of the raw clone() system call is reversed.
+   */
+return (int) syscall(__NR_clone, child_stack, flags);
+#else
+return (int) syscall(__NR_clone, flags, child_stack);
+#endif
+}
+
 /*
  * Move to a new mount, net, and pid namespaces to isolate this process.
  */
@@ -2547,14 +2562,15 @@ static void setup_namespaces(struct lo_data *lo, struct 
fuse_session *se)
  * an empty network namespace to prevent TCP/IP and other network
  * activity in case this process is compromised.
  */
-if (unshare(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET) != 0) {
-fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS): %m\n");
-exit(1);
+int clone_flags = SIGCHLD | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET;
+/* If we're non root, we need a new user namespace */
+if (getuid() != 0) {
+clone_flags |= CLONE_NEWUSER;
 }
 
-child = fork();
+child = raw_clone(clone_flags, NULL);
 if (child < 0) {
-fuse_log(FUSE_LOG_ERR, "fork() failed: %m\n");
+fuse_log(FUSE_LOG_ERR, "clone() failed: %m\n");
 exit(1);
 }
 if (child > 0) {
-- 
2.24.1