On Wed, Jul 11, 2018 at 12:10 PM <nert@wheatley> wrote: > On Mon, Jul 09, 2018 at 05:00:49PM -0400, Jason Baron wrote: > > > > > >On 07/08/2018 02:01 AM, Martin Kletzander wrote: > >> On Thu, Jul 05, 2018 at 06:24:20PM +0200, Roman Mohr wrote: > >>> On Thu, Jul 5, 2018 at 4:20 PM Jason Baron <jba...@akamai.com> wrote: > >>> > >>>> Hi, > >>>> > >>>> Opening tap devices, such as macvtap, that are created in containers > is > >>>> problematic because the interface for opening tap devices is via > >>>> /dev/tapNN and devtmpfs is not typically mounted inside a container as > >>>> its not namespace aware. It is possible to do a mknod() in the > >>>> container, once the tap devices are created, however, since the tap > >>>> devices are created dynamically its not possible to apriori allow > access > >>>> to certain major/minor numbers, since we don't know what these are > going > >>>> to be. In addition, its desirable to not allow the mknod capability in > >>>> containers. This behavior, I think is somewhat inconsistent with the > >>>> tuntap driver where one can create tuntap devices inside a container > by > >>>> first opening /dev/net/tun and then using them by supplying the tuntap > >>>> device name via the ioctl(TUNSETIFF). And since TUNSETIFF validates > the > >>>> network namespace, one is limited to opening network devices that > belong > >>>> to your current network namespace. > >>>> > >>>> Here are some options to this issue, that I wanted to get feedback > >>>> about, and just wondering if anybody else has run into this. > >>>> > >>>> 1) > >>>> > >>>> Don't create the tap device, such as macvtap in the container. > Instead, > >>>> create the tap device outside of the container and then move it into > the > >>>> desired container network namespace. In addition, do a mknod() for the > >>>> corresponding /dev/tapNN device from outside the container before > doing > >>>> chroot(). > >>>> > >>>> This solution still doesn't allow tap devices to be created inside the > >>>> container. Thus, in the case of kubevirt, which runs libvirtd inside > of > >>>> a container, it would mean changing libvirtd to open existing tap > >>>> devices (as opposed to the current behavior of creating new ones). > This > >>>> would not require any kernel changes, but as mentioned seems > >>>> inconsistent with the tuntap interface. > >>>> > >>> > >>> For KubeVirt, apart from how exactly the device ends up in the > >>> container, I > >>> would want to pursue a way where all network preparations which require > >>> privileges happens from a privileged process *outside* of the > container. > >>> Like CNI solutions do it. They run outside, have privileges and then > >>> create > >>> devices in the right network/mount namespace or move them there. The > >>> final > >>> goal for KubeVirt is that our pod with the qemu process is completely > >>> unprivileged and privileged setup happens from outside. > >>> > >>> As a consequence, and depending on which route Dan pursues with the > >>> restructured libvirt, I would assume that either a privileged > >>> libvirtd-part > >>> outside of containers creates the devices by entering the right > >>> namespaces, > >>> or that libvirt in the container can consume pre-created tun/tap > devices, > >>> like qemu. > >>> > >> > >> That would be nice, but as far as I understand there will always be a > >> need for > >> some privileges if you want to use a tap device. It's nice that CNI > >> does that > >> and all the containers can run unprivileged, but that's because they do > >> not open > >> the tap device and they do not do any privileged operations on it. But > >> QEMU > >> needs to. So the only way would be passing an opened fd to the > >> container or > >> opening the tap device there and making the fd usable for one process in > >> the > >> container. Is this already supported for some type of containers in > >> some way? > >> > >> Martin > > > >Hi, > > > >So another option here call it #3 is to pass open fds via unix sockets. > >If there are privileged operations that QEMU is trying to do with the fd > >though, how will opening it first and then passing it to an unprivileged > >QEMU address that? Is the opener doing those operations first? > > > > Sorry for the confusion, but QEMU is not doing any privileged operations. > I got > confused by the fact that anyone can open and do a R/W on a tap device. > But it > looks like that's on purpose. No capabilities are needed for opening > /dev/net/tun and calling ioctl(TUNSETIFF) with existing name and then > doing R/W > operations on it. It just works. > > Correct me if I'm wrong, but to sum it all up, the only things that we > need to > figure out (which might possibly be solved by ideas in the other thread) > are: > > tap: > - Existence of /dev/net/tun > - Having permissions to open it (0666 by default, shouldn't be a nig deal) > - Knowing the device name > > macvtap: > - Existence of /dev/tapXX > - Having permissions to open /dev/tapXX > - One of the following: > - Knowing the device name (and being able to translate it using a > netlink socket) > - Knowing the the device index > > The rest should be an implementation detail. > > Am I right? Did I miss anything?
At least from the KubeVirt use-case that sounds to be the things which we would need to solve the networking setup in a similar way like the Container Network Interface implementations solve the setup in k8s. Best Regards, Roman
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list