On November 5, 2018 8:12:35 AM GMT+03:00, kemi <kemi.w...@intel.com> wrote: > > >On 2018/11/2 下午8:05, Fajar A. Nugraha wrote: >> On Fri, Nov 2, 2018 at 8:44 AM, kemi <kemi.w...@intel.com> wrote: >> >>> >>> thx for your question. >>> In our case, our customers want to run android games within >containers on >>> cloud. >>> >> >> It might be possible for you to adjust https://anbox.io/ to run on >lxd >> instead of lxc. YMMV. >> > >anbox provides a GUI interface to run android in container. >We don't need that GUI which leads to extra overhead. Also, >Anbox can't offer thousands of containers running in parallel. > >> There are two problems we have known. >>> The first one occurs during Android OS boot, the coldboot of Android >>> requires to >>> write uevent file in /sys, this will trigger an uevent broadcast to >all of >>> listeners >>> (udev daemons) in user space (this uevent is sent from kernel via >>> netlink), >>> with the increase of container number (200+), we found the boot >latency >>> has >>> reached 1~2 mins. And latency would be intolerable when the number >reaches >>> 500.
That is no longer true from kernels 4.17 onwards. I should really write a blogpost about my patchset it seems. This keeps popping up every now and then. So, I'm going to explain this in a little more detail here. Uevents were previously broadcast into all network namespaces. This was obviously problematic: - You could be smarter than you should be and trick the system into running a second udev daemon in a non-initial network namespace that is owned by the initial user namespace. That has the potential to wreck the system. However this only affects privileged containers that would be dumb enough to mount /sys read-write. - You could see an insane performance hit when you ran large numbers of containers that each ran a udev daemon since the kernel would broadcast these events to all of them. This is made worse by the fact that in non-initial network namespaces that are owned by non-initial user namespaces the kernel would not fix up the uid and gid relative to the owning user namespace of the network namespace. That meant user space would see those events with INVALD_{G,U}ID which causes udev to ignore those events. Effectively, the kernel was screaming uevents into the void for absolutely no good reason. Moreover the id permissions weren't even fixed up for namespaced devices such as network devices that can be owned by different network namespaces (e.g. moving a physical network device into an unprivileged container) - You could technically spy on the hosts device events from an unprivileged container. It's probably not an attack vector but it is definitely an information leak. - You had no way of delegating a device to a container since uevents that were received for it were unuseable (cf. above) but you also had no way of injecting/forwarding uevents to a container. For all those reasons I wrote several patches that namespace uevents and allow injecting uevents: - 94e5e3087a67c765be98592b36d8d187566478d5 - 692ec06d7c92af8ca841a6367648b9b3045344fd - 26045a7b14bc7a5455e411d820110f66557d6589 - a3498436b3a0f8ec289e6847e1de40b4123e1639 So, the first two patches make it possible to forward/inject uevents into other network namespaces if the caller has CAP_NET_ADMIN in the owning user namespace of the target network namespace. This effectively allows for device namespaces. Any forwarded/injected uevent should strip/not add a sequence number. The kernel will append the correct sequence number to the buffer itself. The following two patches are concerned with isolating uevents aka namespacing them more cleanly. Because #legacybehavior we came up with the following logic: uevents are restricted to all network namespaces that are owned by the initial user namespace. This implies that all non-initial network namespaces that are owned by non-initial user namespaces do not receive any uevents unless the kobject (in-kernel device representation) (e.g. network devices) carries a namespace tag or a uevent is forwarded/injected. My patches ensure that network namespace specific uevents and forwarded/injected uevents get their permissions fixed-up according to the owning user namespace of the target network namespace. This has the nice consequence that delegated network devices (physical, virtual, SRIO-V) can now be seen by udev inside unprivileged containers. So if uevents were a bottleneck for you then it shouldn't be the case anymore for unprivileged containers at least. The in-kernel locking is also improved by my patches and I have plans to further improve it. I just need to find the time. If you're running privileged containers and uevents are still a bottleneck for you we can think about a per-network-namespace sysctl that might allow you to opt-in or out per network namespace. Although I doubt that's a clean enough option. >>> >>> >> I don't see udev running inside it's lxc container, so perhaps >they've >> managed to solve that issue Udevd will usually not run in unprivileged containers since /sys is mounted ro so it won't start. However, in unprivileged containers /sys can safely be mounted rw and udev will start. This also makes sense on kernels with my patches added. (cf. above). >> > >root@kemi-desktop:/home/kemi/git# lxc list > >+--------+---------+---------------------+------+------------+-----------+ >| NAME | STATE | IPV4 | IPV6 | TYPE | >SNAPSHOTS | >+--------+---------+---------------------+------+------------+-----------+ >| first | RUNNING | 10.70.45.163 (eth0) | | PERSISTENT | 0 > | >+--------+---------+---------------------+------+------------+-----------+ >| second | STOPPED | | | PERSISTENT | 0 > | >+--------+---------+---------------------+------+------------+-----------+ >root@kemi-desktop:/home/kemi/git# >root@kemi-desktop:/home/kemi/git# lxc exec first -- bash >root@first:~# >root@first:~# ps -ef|grep udev >root 61 1 0 Nov01 ? 00:00:00 >/lib/systemd/systemd-udevd >root 2252 2241 0 05:07 ? 00:00:00 grep --color=auto udev >root@first:~# > >Seems udevd (I used ubuntu 18.04 image) is running in lxc container. >Correct me if I misunderstood something, thx. > >> >> The second one occurs when an app in container begins to run, it will >read >>> /sys/devices/system/cpu/online file to get avilable cpu number >before >>> creating >>> threads accordingly. Then. the problem is, sysfs now is shared with >host, >>> it will get the CPU number equals to host thread number even if the >cpu >>> number >>> of container is limited. >>> >>> >> If it simply reads the file, you could simply mount a text file on >it. >> Similar to what lxcfs does, but simpler. >> > >Good suggestion. We are considering this workaround. >But it may not be a common solution, because on one knows which file in >/sys >will be used by app in userspace. > >> >> _______________________________________________ >> lxc-users mailing list >> lxc-users@lists.linuxcontainers.org >> http://lists.linuxcontainers.org/listinfo/lxc-users >> >_______________________________________________ >lxc-users mailing list >lxc-users@lists.linuxcontainers.org >http://lists.linuxcontainers.org/listinfo/lxc-users _______________________________________________ lxc-users mailing list lxc-users@lists.linuxcontainers.org http://lists.linuxcontainers.org/listinfo/lxc-users