Re: [lxc-users] apparmor profile for systemd containers (WAS: Fedora container thinks it is not running)
On Thu, Jun 19, 2014 at 9:01 PM, Michael H. Warfield m...@wittsend.com wrote: All concerned participants: Was there any further update on this problem? I'd like to know if we (I) should be updating the templates for either this aa_profile thing or for the mount sets. IIRC Christian was going to try something? So far all my test with every suggested values of lxc.mount.auto (including cgroup-full:mixed) isn't enough to got f20 container running under the default apparmor profile. I either have to: - use unconfined profile. Works, but vulnerable to most known lxc exploit. - use lxc.hook.mount and lxc.hook.post-stop scripts that create and bind-mount a new, empty, systemd cgroup hiearchy to the container's /sys/fs/cgroup/systemd. Kinda messy, but this way it's still protected by the apparmor profile. The second approach is more ideal if it can be made into something like lxc.mount.auto=cgroup:systemd-new setting, but it's way beyond what I'm capable of. For the next lxc release, as a user I suggest to just uncomment the aa_profile line. -- Fajar Regards, Mike On Fri, 2014-05-30 at 01:00 +0200, Christian Seiler wrote: Hi, # lxc-attach -n f20 -- mount | grep cgroup cgroup on /sys/fs/cgroup type tmpfs (rw,relatime,size=12k,mode=755) none on /sys/fs/cgroup/cgmanager type tmpfs (rw,relatime,size=4k,mode=755) tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,mode=755) :-( This appears to be a rather nasty bug... lxc does read the file /etc/lxc/lxc.conf that I created, verfied by the fact that lxc.cgroup.pattern works correctly. It does not, however, create the directory /sys/fs/cgroup/systemd/lxc-all/f20 (which, if I understand correctly, it should, since I use lxc.cgroup.use = @all) # ls -d /sys/fs/cgroup/*/lxc-all/f20 /sys/fs/cgroup/blkio/lxc-all/f20/sys/fs/cgroup/cpuset/lxc-all/f20 /sys/fs/cgroup/hugetlb/lxc-all/f20 /sys/fs/cgroup/cpuacct/lxc-all/f20 /sys/fs/cgroup/devices/lxc-all/f20 /sys/fs/cgroup/memory/lxc-all/f20 /sys/fs/cgroup/cpu/lxc-all/f20 /sys/fs/cgroup/freezer/lxc-all/f20 /sys/fs/cgroup/perf_event/lxc-all/f20 # mount | grep cgroup none on /sys/fs/cgroup type tmpfs (rw,relatime,size=4k,mode=755) cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,relatime,cpuset,release_agent=/run/cgmanager/agents/cgm-release-agent.cpuset,clone_children) cgroup on /sys/fs/cgroup/cpu type cgroup (rw,relatime,cpu,release_agent=/run/cgmanager/agents/cgm-release-agent.cpu) cgroup on /sys/fs/cgroup/cpuacct type cgroup (rw,relatime,cpuacct,release_agent=/run/cgmanager/agents/cgm-release-agent.cpuacct) cgroup on /sys/fs/cgroup/memory type cgroup (rw,relatime,memory,release_agent=/run/cgmanager/agents/cgm-release-agent.memory) cgroup on /sys/fs/cgroup/devices type cgroup (rw,relatime,devices,release_agent=/run/cgmanager/agents/cgm-release-agent.devices) cgroup on /sys/fs/cgroup/freezer type cgroup (rw,relatime,freezer,release_agent=/run/cgmanager/agents/cgm-release-agent.freezer) cgroup on /sys/fs/cgroup/blkio type cgroup (rw,relatime,blkio,release_agent=/run/cgmanager/agents/cgm-release-agent.blkio) cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,relatime,perf_event,release_agent=/run/cgmanager/agents/cgm-release-agent.perf_event) cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,relatime,hugetlb,release_agent=/run/cgmanager/agents/cgm-release-agent.hugetlb) systemd on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,release_agent=/run/cgmanager/agents/cgm-release-agent.systemd,name=systemd) Hmm, are you running cgmanager at the same time as systemd? I think this might be a problem with the intersection of cgmanager with the cgroup mounting code, i.e. the cgroup mounting code uses the cgfs stuff (which was originally just cgroup before Serge implemented multiple drivers) while the put the container into cgroup code uses cgmanager, which may have some weird side effect in this case. I have to confess that so far I haven't tried cgmanager myself (it's on my todo list), so I never tested the interaction between Serge's cgmanager code and my cgroup mounting code... If you are running cgmanager, could you try the same while cgmanager being stopped? Then LXC should fall back to the cgfs code, which *should* work in this case, unless something else broke this logic. Anyway, I'll have a chance to look at this more closely on Saturday (I'm busy with other things tomorrow). Regards, Christian -- Michael H. Warfield (AI4NB) | (770) 978-7061 | m...@wittsend.com /\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/ NIC whois: MHW9 | An optimist believes we live in the best of all PGP Key: 0x674627FF| possible worlds. A pessimist is sure of it! ___ lxc-users mailing list
Re: [lxc-users] apparmor profile for systemd containers (WAS: Fedora container thinks it is not running)
Hi, # lxc-attach -n f20 -- mount | grep cgroup cgroup on /sys/fs/cgroup type tmpfs (rw,relatime,size=12k,mode=755) none on /sys/fs/cgroup/cgmanager type tmpfs (rw,relatime,size=4k,mode=755) tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,mode=755) :-( This appears to be a rather nasty bug... lxc does read the file /etc/lxc/lxc.conf that I created, verfied by the fact that lxc.cgroup.pattern works correctly. It does not, however, create the directory /sys/fs/cgroup/systemd/lxc-all/f20 (which, if I understand correctly, it should, since I use lxc.cgroup.use = @all) # ls -d /sys/fs/cgroup/*/lxc-all/f20 /sys/fs/cgroup/blkio/lxc-all/f20/sys/fs/cgroup/cpuset/lxc-all/f20 /sys/fs/cgroup/hugetlb/lxc-all/f20 /sys/fs/cgroup/cpuacct/lxc-all/f20 /sys/fs/cgroup/devices/lxc-all/f20 /sys/fs/cgroup/memory/lxc-all/f20 /sys/fs/cgroup/cpu/lxc-all/f20 /sys/fs/cgroup/freezer/lxc-all/f20 /sys/fs/cgroup/perf_event/lxc-all/f20 # mount | grep cgroup none on /sys/fs/cgroup type tmpfs (rw,relatime,size=4k,mode=755) cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,relatime,cpuset,release_agent=/run/cgmanager/agents/cgm-release-agent.cpuset,clone_children) cgroup on /sys/fs/cgroup/cpu type cgroup (rw,relatime,cpu,release_agent=/run/cgmanager/agents/cgm-release-agent.cpu) cgroup on /sys/fs/cgroup/cpuacct type cgroup (rw,relatime,cpuacct,release_agent=/run/cgmanager/agents/cgm-release-agent.cpuacct) cgroup on /sys/fs/cgroup/memory type cgroup (rw,relatime,memory,release_agent=/run/cgmanager/agents/cgm-release-agent.memory) cgroup on /sys/fs/cgroup/devices type cgroup (rw,relatime,devices,release_agent=/run/cgmanager/agents/cgm-release-agent.devices) cgroup on /sys/fs/cgroup/freezer type cgroup (rw,relatime,freezer,release_agent=/run/cgmanager/agents/cgm-release-agent.freezer) cgroup on /sys/fs/cgroup/blkio type cgroup (rw,relatime,blkio,release_agent=/run/cgmanager/agents/cgm-release-agent.blkio) cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,relatime,perf_event,release_agent=/run/cgmanager/agents/cgm-release-agent.perf_event) cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,relatime,hugetlb,release_agent=/run/cgmanager/agents/cgm-release-agent.hugetlb) systemd on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,release_agent=/run/cgmanager/agents/cgm-release-agent.systemd,name=systemd) Hmm, are you running cgmanager at the same time as systemd? I think this might be a problem with the intersection of cgmanager with the cgroup mounting code, i.e. the cgroup mounting code uses the cgfs stuff (which was originally just cgroup before Serge implemented multiple drivers) while the put the container into cgroup code uses cgmanager, which may have some weird side effect in this case. I have to confess that so far I haven't tried cgmanager myself (it's on my todo list), so I never tested the interaction between Serge's cgmanager code and my cgroup mounting code... If you are running cgmanager, could you try the same while cgmanager being stopped? Then LXC should fall back to the cgfs code, which *should* work in this case, unless something else broke this logic. Anyway, I'll have a chance to look at this more closely on Saturday (I'm busy with other things tomorrow). Regards, Christian ___ lxc-users mailing list lxc-users@lists.linuxcontainers.org http://lists.linuxcontainers.org/listinfo/lxc-users
Re: [lxc-users] apparmor profile for systemd containers (WAS: Fedora container thinks it is not running)
Hi, as I said before, I'll have a chance of looking at the whole thing tomorrow myself, but just two quick things: First it turns out I also needed to add lxc.mount.auto = sys before lxc.mount.auto = cgroup:mixed (otherwise I'd get double /sys/fs/cgroup tmpfs mount). Huh? So lxc.mount.auto = sys has to be there, obiously (otherwise /sys is not mounted), but what exactly do you mean by double? What happpens is: - the container still Freezing execution while starting root slice - /sys/fs/cgroup/cpuset (and friends) are bind-mounted (there's additional user/0.user/13.session directory, but I assume it's the effect of the ubuntu hosts's systemd, and is okay) - systemd mount in the container happens at /sys/fs/cgroup/systemd/user/0.user/13.session/lxc-all/f20 , but the container expects /sys/fs/cgroup/systemd/ to be writable So lxc.mount.auto = cgroup:mixed and lxc.cgroup.use = @all works, but it's not enough for fedora (and other sytemd-based container) to work properly. Could you try the following? lxc.mount.auto = sys cgroup-full:mixed That will mount the whole cgroup tree, but the parts outside of the container read-only. In any case, I'll take a close look myself tomorrow. Regards, Christian ___ lxc-users mailing list lxc-users@lists.linuxcontainers.org http://lists.linuxcontainers.org/listinfo/lxc-users
Re: [lxc-users] apparmor profile for systemd containers (WAS: Fedora container thinks it is not running)
Quoting Fajar A. Nugraha (l...@fajar.net): On Thu, May 29, 2014 at 10:58 AM, Serge Hallyn serge.hal...@ubuntu.comwrote: Quoting Fajar A. Nugraha (l...@fajar.net): On Thu, May 29, 2014 at 5:08 AM, Serge Hallyn serge.hal...@ubuntu.com wrote: would systemd be happy with it being mounted by lxc using an lxc.mount.entry? I think that would be preferable to relaxing the apparmor policy. i.e. lxc.mount.entry = /sys/fs/cgroup/systemd sys/fs/cgroup/systemd none bind,create=dir,optional 0 0 Wouldn't that be shadowed by the container mounting its own /sys? If lxc mounts /sys then systemd will leave it be. Apparently that line alone doesn't work for me. I also had to add before that: lxc.mount.entry = sysfs sys sysfs default 0 0 lxc.mount.entry = none sys/fs/cgroup tmpfs rw 0 0 or lxc.mount.auto = sys That's what I meant by 'if lxc mounts /sys' :) Stephane also pointed out in my (closed) pull request that it would also allow the container to mess with the hosts's resource allocation. Yes, that's why lxc.mount.auto = cgroup:mixed is better. But the above mount entry is no worse than letting the container do it through apparmor. That does not work, apparently. ### in confing lxc.mount.auto = cgroup:mixed ### ### lxc-start output 30systemd[1]: Starting Root Slice. 27systemd[1]: Caught SEGV, dumped core as pid 12. 30systemd[1]: Freezing execution. ### Hm, that's unfortunate. I thought lxc.mount.auto = cgroup:mixed with cgfs would mount named subsystems? Christian? ### # lxc-attach -n f20 -- mount rpool/lxc on / type zfs (rw,noatime,xattr,noacl) udev on /dev type devtmpfs (rw,relatime,size=2473540k,nr_inodes=618385,mode=755) cgroup on /sys/fs/cgroup type tmpfs (rw,relatime,size=12k,mode=755) none on /sys/fs/cgroup/cgmanager type tmpfs (rw,relatime,size=4k,mode=755) devpts on /dev/lxc/console type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000) devpts on /dev/lxc/tty1 type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000) devpts on /dev/lxc/tty2 type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000) devpts on /dev/lxc/tty3 type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000) devpts on /dev/lxc/tty4 type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000) devpts on /dev/pts type devpts (rw,relatime,gid=5,mode=620,ptmxmode=666) proc on /proc type proc (rw,nosuid,nodev,noexec,relatime) sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime) tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev) tmpfs on /run type tmpfs (rw,nosuid,nodev,mode=755) tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,mode=755) # lxc-attach -n f20 -- ls /sys/fs/cgroup/ blkio cpu,cpuacct cpuset devices freezer hugetlb memory perf_event systemd # lxc-attach -n f20 -- ls /sys/fs/cgroup/systemd (no output) ### It looks like there's two lines for /sys/fs/cgroup? I'm using trusty's lxc-1.0.3. This works (at least, tested with console and ssh login), and should be secure-enough (bind-mount the container subdir, instead of the whole systemd cgroup), but complicated. ### snippet of config lxc.hook.mount = /var/lib/lxc/f20/bin/create_container_systemd_cgroup lxc.hook.post-stop = /var/lib/lxc/f20/bin/remove_container_systemd_cgroup ### ### cat create_container_systemd_cgroup #!/bin/bash mkdir -p /sys/fs/cgroup/systemd/lxc/$LXC_NAME mount -t sysfs sysfs $LXC_ROOTFS_MOUNT/sys mount -t tmpfs none $LXC_ROOTFS_MOUNT/sys/fs/cgroup mkdir $LXC_ROOTFS_MOUNT/sys/fs/cgroup/systemd mount --bind /sys/fs/cgroup/systemd/lxc/$LXC_NAME $LXC_ROOTFS_MOUNT/sys/fs/cgroup/systemd ### ### cat remove_container_systemd_cgroup #!/bin/bash [ -n $LXC_NAME ] find /sys/fs/cgroup/systemd/lxc/$LXC_NAME -type d | tac | xargs rmdir ### Is there a way to simplify this somehow for it to be more suitable in the template? I suppose we could add a new a lxc.mount.auto = cgroup:systemd option which only mounts name=systemd, read-only except for the container's own cgroup which is rw? But when I say we I don't really mean we :) Will that work? systemd cgroup mount is weird in a sense that there's no /lxc/CONTAINER_NAME subdirs under /sys/fs/cgroup/systemd, while there's one under /sys/fs/crgoup/{blkio,cpu,etc}. So for systemd cgroup I don't see which ones should be mount ro and which gets rw. The workaround hook I wrote earlier creates the directory /sys/fs/cgroup/systemd/lxc/CONTAINER_NAME on the host, and bind-mount it as the container's /sys/fs/cgroup/systemd. -- Fajar ___ lxc-users mailing list lxc-users@lists.linuxcontainers.org http://lists.linuxcontainers.org/listinfo/lxc-users
Re: [lxc-users] apparmor profile for systemd containers (WAS: Fedora container thinks it is not running)
Quoting Christian Seiler (christ...@iwakd.de): Hi, ### lxc-start output 30systemd[1]: Starting Root Slice. 27systemd[1]: Caught SEGV, dumped core as pid 12. 30systemd[1]: Freezing execution. ### Hm, that's unfortunate. I thought lxc.mount.auto = cgroup:mixed with cgfs would mount named subsystems? Christian? Yes, but this is actually controlled by lxc.cgroup.use (in lxc.system.conf(5), *not* lxc.container.conf(5)). Basically, we were conservative back then and decided to only touch cgroups (both for putting the container into and also for bind-mounting) that were either kernel cgroups or that the user explicitly specified. Ah, thanks. Fajar, does that fix it for you? BUT I think for the auto-mounting hook we should maybe change that to use *all* hierarchies. It's just that auto-mounting came a bit later and I just reused the existing code at that point and didn't properly think through the implications. I can provide a patch for changing this to all hierarchies for the auto-mounting case, but not today. In the mean time, you can just create a /etc/lxc/lxc.conf (or whatever LXC looks for on your system) with the following setting: lxc.cgroup.use = @all That will resort to using *all* named hierarchies. Or, alternatively, you can use something like lxc.cgroup.use = @kernel systemd to include all kernel hierarchies and the systemd hierarchy, but not other named ones. Note btw. that including the systemd hierarchy here actually has some weird side-effects, since the lxc.cgroup.use setting applies to both the auto-mounting feature but also the let's move the container into cgroup logic, thus directly modifying the systemd cgroup tree, something that the systemd strongly discourages. I was actually working on an additional cgroup backend for LXC (in addition to cgfs and cgmanager) that interfaces with systemd's dbus interface, but I'm not nearly done yet. Oh, great. Clearly finding a good place for cgmanager and systemd to intersect is on my todo list, maybe your driver will be inspiration. (My primary goal is to continue support unprivileged nested containers as well with systemd as we do with upstart+cgmanager) -serge ___ lxc-users mailing list lxc-users@lists.linuxcontainers.org http://lists.linuxcontainers.org/listinfo/lxc-users
Re: [lxc-users] apparmor profile for systemd containers (WAS: Fedora container thinks it is not running)
Quoting Fajar A. Nugraha (l...@fajar.net): (changed subject to match content) On Tue, May 27, 2014 at 11:10 PM, Michael H. Warfield m...@wittsend.com wrote: On Tue, 2014-05-27 at 15:33 +0700, Fajar A. Nugraha wrote: On further test, this seems enough ### # cat lxc-default-with-systemd profile lxc-container-default-with-systemd flags=(attach_disconnected,mediate_deleted) { #include abstractions/lxc/container-base deny mount fstype=devpts, mount options=(none,name=systemd) fstype=cgroup - /sys/fs/cgroup/systemd/, } ### This sounds excellent. It sounds like this should be incorporated into the lxc package for any host distros supporting app armour and we could then add that default to all the systemd based containers such as Fedora, Suse, eventually Oracle, and eventually CentOS. I agree it does seem to make more sense to use a restrictive profile that covers the minimal set of requirements as opposed to unconfined. That should be submitted as a patch over on the lxc-devel list then, for Serge and Stéphane to review. I see where the file would need to be added in the config/apparmour/profiles directory but I'm not familiar enough with the packaging for Ubuntu to know what changes would be needed to add them there. I'll let Serge comment on this one. As a side note, I've tested opensuse 13.1 (using the squashfs root from rescue ISO) and it has two additional complains with the previous apparmor profile: May 27 17:12:50 trusty kernel: [66563.219898] type=1400 audit(1401185570.578:9249): apparmor=DENIED operation=mount info=failed type match error=-13 profile=lxc-container-default-with-systemd name=/var/run/ pid=30648 comm=mount srcname=/run/ flags=rw, bind Hm. In Debian/Ubuntu this is done with a /var/run - /run symlink... May 27 17:21:20 trusty kernel: [67073.932892] type=1400 audit(1401186080.906:9846): apparmor=DENIED operation=mount info=failed flags match error=-13 profile=lxc-container-opensuse name=/proc/ pid=4158 comm=mount flags=rw, remount the second one (/proc) is pretty harmless, so I ignored it. The first one (/var/run) produced lots of errors [FAILED] Failed to mount Runtime Directory. See 'systemctl status var-run.mount' for details. [DEPEND] Dependency failed for System Logging Service. Mounting Runtime Directory... ... and made syslog (and possibly other services) failed to start, so for opensuse I had to adjust the profile even further ### profile lxc-container-opensuse flags=(attach_disconnected,mediate_deleted) { #include abstractions/lxc/container-base deny mount fstype=devpts, mount options=(none,name=systemd) fstype=cgroup - /sys/fs/cgroup/systemd/, mount options=(rw,bind), } ### Bind mounts inside a container should be safe, right? While there are still some problems with opensuse container (e.g. shutdown takes a long time on systemctl stop network@eth0.service), it is at least usable for testing purposes. would systemd be happy with it being mounted by lxc using an lxc.mount.entry? I think that would be preferable to relaxing the apparmor policy. i.e. lxc.mount.entry = /sys/fs/cgroup/systemd sys/fs/cgroup/systemd none bind,create=dir,optional 0 0 Or, of course, you can just do lxc.mount.auto = cgroup:mixed which should give you /sys/fs/cgroup/systemd if it exists on the host, and in a safer way. Now if /sys/fs/cgroup/systemd does not exist on the host, these won't work... As you say the bind mounts should be ok - although some of the mount options stuff doesn't work right in many apparmor parsers. So we'd want to make sure that 'mount options=(rw,bind)' does in fact only allow that, instead of suddely allowing all mounts, as I've unfortunately seen happen when I tried to selectively allow some other mount options. -serge ___ lxc-users mailing list lxc-users@lists.linuxcontainers.org http://lists.linuxcontainers.org/listinfo/lxc-users
Re: [lxc-users] apparmor profile for systemd containers (WAS: Fedora container thinks it is not running)
On Thu, May 29, 2014 at 5:08 AM, Serge Hallyn serge.hal...@ubuntu.comwrote: Quoting Fajar A. Nugraha (l...@fajar.net): (changed subject to match content) On Tue, May 27, 2014 at 11:10 PM, Michael H. Warfield m...@wittsend.com wrote: On Tue, 2014-05-27 at 15:33 +0700, Fajar A. Nugraha wrote: On further test, this seems enough ### # cat lxc-default-with-systemd profile lxc-container-default-with-systemd flags=(attach_disconnected,mediate_deleted) { #include abstractions/lxc/container-base deny mount fstype=devpts, mount options=(none,name=systemd) fstype=cgroup - /sys/fs/cgroup/systemd/, } ### This sounds excellent. It sounds like this should be incorporated into the lxc package for any host distros supporting app armour and we could then add that default to all the systemd based containers such as Fedora, Suse, eventually Oracle, and eventually CentOS. I agree it does seem to make more sense to use a restrictive profile that covers the minimal set of requirements as opposed to unconfined. That should be submitted as a patch over on the lxc-devel list then, for Serge and Stéphane to review. I see where the file would need to be added in the config/apparmour/profiles directory but I'm not familiar enough with the packaging for Ubuntu to know what changes would be needed to add them there. I'll let Serge comment on this one. As a side note, I've tested opensuse 13.1 (using the squashfs root from rescue ISO) and it has two additional complains with the previous apparmor profile: May 27 17:12:50 trusty kernel: [66563.219898] type=1400 audit(1401185570.578:9249): apparmor=DENIED operation=mount info=failed type match error=-13 profile=lxc-container-default-with-systemd name=/var/run/ pid=30648 comm=mount srcname=/run/ flags=rw, bind Hm. In Debian/Ubuntu this is done with a /var/run - /run symlink... something like that could probably be added to the opensuse template, modifying the current mount service. May 27 17:21:20 trusty kernel: [67073.932892] type=1400 audit(1401186080.906:9846): apparmor=DENIED operation=mount info=failed flags match error=-13 profile=lxc-container-opensuse name=/proc/ pid=4158 comm=mount flags=rw, remount the second one (/proc) is pretty harmless, so I ignored it. The first one (/var/run) produced lots of errors [FAILED] Failed to mount Runtime Directory. See 'systemctl status var-run.mount' for details. [DEPEND] Dependency failed for System Logging Service. Mounting Runtime Directory... ... and made syslog (and possibly other services) failed to start, so for opensuse I had to adjust the profile even further ### profile lxc-container-opensuse flags=(attach_disconnected,mediate_deleted) { #include abstractions/lxc/container-base deny mount fstype=devpts, mount options=(none,name=systemd) fstype=cgroup - /sys/fs/cgroup/systemd/, mount options=(rw,bind), } ### Bind mounts inside a container should be safe, right? While there are still some problems with opensuse container (e.g. shutdown takes a long time on systemctl stop network@eth0.service), it is at least usable for testing purposes. would systemd be happy with it being mounted by lxc using an lxc.mount.entry? I think that would be preferable to relaxing the apparmor policy. i.e. lxc.mount.entry = /sys/fs/cgroup/systemd sys/fs/cgroup/systemd none bind,create=dir,optional 0 0 Wouldn't that be shadowed by the container mounting its own /sys? Stephane also pointed out in my (closed) pull request that it would also allow the container to mess with the hosts's resource allocation. This works (at least, tested with console and ssh login), and should be secure-enough (bind-mount the container subdir, instead of the whole systemd cgroup), but complicated. ### snippet of config lxc.hook.mount = /var/lib/lxc/f20/bin/create_container_systemd_cgroup lxc.hook.post-stop = /var/lib/lxc/f20/bin/remove_container_systemd_cgroup ### ### cat create_container_systemd_cgroup #!/bin/bash mkdir -p /sys/fs/cgroup/systemd/lxc/$LXC_NAME mount -t sysfs sysfs $LXC_ROOTFS_MOUNT/sys mount -t tmpfs none $LXC_ROOTFS_MOUNT/sys/fs/cgroup mkdir $LXC_ROOTFS_MOUNT/sys/fs/cgroup/systemd mount --bind /sys/fs/cgroup/systemd/lxc/$LXC_NAME $LXC_ROOTFS_MOUNT/sys/fs/cgroup/systemd ### ### cat remove_container_systemd_cgroup #!/bin/bash [ -n $LXC_NAME ] find /sys/fs/cgroup/systemd/lxc/$LXC_NAME -type d | tac | xargs rmdir ### Is there a way to simplify this somehow for it to be more suitable in the template? -- Fajar ___ lxc-users mailing list lxc-users@lists.linuxcontainers.org http://lists.linuxcontainers.org/listinfo/lxc-users
Re: [lxc-users] apparmor profile for systemd containers (WAS: Fedora container thinks it is not running)
Quoting Fajar A. Nugraha (l...@fajar.net): On Thu, May 29, 2014 at 5:08 AM, Serge Hallyn serge.hal...@ubuntu.comwrote: would systemd be happy with it being mounted by lxc using an lxc.mount.entry? I think that would be preferable to relaxing the apparmor policy. i.e. lxc.mount.entry = /sys/fs/cgroup/systemd sys/fs/cgroup/systemd none bind,create=dir,optional 0 0 Wouldn't that be shadowed by the container mounting its own /sys? If lxc mounts /sys then systemd will leave it be. Stephane also pointed out in my (closed) pull request that it would also allow the container to mess with the hosts's resource allocation. Yes, that's why lxc.mount.auto = cgroup:mixed is better. But the above mount entry is no worse than letting the container do it through apparmor. This works (at least, tested with console and ssh login), and should be secure-enough (bind-mount the container subdir, instead of the whole systemd cgroup), but complicated. ### snippet of config lxc.hook.mount = /var/lib/lxc/f20/bin/create_container_systemd_cgroup lxc.hook.post-stop = /var/lib/lxc/f20/bin/remove_container_systemd_cgroup ### ### cat create_container_systemd_cgroup #!/bin/bash mkdir -p /sys/fs/cgroup/systemd/lxc/$LXC_NAME mount -t sysfs sysfs $LXC_ROOTFS_MOUNT/sys mount -t tmpfs none $LXC_ROOTFS_MOUNT/sys/fs/cgroup mkdir $LXC_ROOTFS_MOUNT/sys/fs/cgroup/systemd mount --bind /sys/fs/cgroup/systemd/lxc/$LXC_NAME $LXC_ROOTFS_MOUNT/sys/fs/cgroup/systemd ### ### cat remove_container_systemd_cgroup #!/bin/bash [ -n $LXC_NAME ] find /sys/fs/cgroup/systemd/lxc/$LXC_NAME -type d | tac | xargs rmdir ### Is there a way to simplify this somehow for it to be more suitable in the template? I suppose we could add a new a lxc.mount.auto = cgroup:systemd option which only mounts name=systemd, read-only except for the container's own cgroup which is rw? But when I say we I don't really mean we :) ___ lxc-users mailing list lxc-users@lists.linuxcontainers.org http://lists.linuxcontainers.org/listinfo/lxc-users
Re: [lxc-users] apparmor profile for systemd containers (WAS: Fedora container thinks it is not running)
On Thu, May 29, 2014 at 10:58 AM, Serge Hallyn serge.hal...@ubuntu.comwrote: Quoting Fajar A. Nugraha (l...@fajar.net): On Thu, May 29, 2014 at 5:08 AM, Serge Hallyn serge.hal...@ubuntu.com wrote: would systemd be happy with it being mounted by lxc using an lxc.mount.entry? I think that would be preferable to relaxing the apparmor policy. i.e. lxc.mount.entry = /sys/fs/cgroup/systemd sys/fs/cgroup/systemd none bind,create=dir,optional 0 0 Wouldn't that be shadowed by the container mounting its own /sys? If lxc mounts /sys then systemd will leave it be. Apparently that line alone doesn't work for me. I also had to add before that: lxc.mount.entry = sysfs sys sysfs default 0 0 lxc.mount.entry = none sys/fs/cgroup tmpfs rw 0 0 Stephane also pointed out in my (closed) pull request that it would also allow the container to mess with the hosts's resource allocation. Yes, that's why lxc.mount.auto = cgroup:mixed is better. But the above mount entry is no worse than letting the container do it through apparmor. That does not work, apparently. ### in confing lxc.mount.auto = cgroup:mixed ### ### lxc-start output 30systemd[1]: Starting Root Slice. 27systemd[1]: Caught SEGV, dumped core as pid 12. 30systemd[1]: Freezing execution. ### ### # lxc-attach -n f20 -- mount rpool/lxc on / type zfs (rw,noatime,xattr,noacl) udev on /dev type devtmpfs (rw,relatime,size=2473540k,nr_inodes=618385,mode=755) cgroup on /sys/fs/cgroup type tmpfs (rw,relatime,size=12k,mode=755) none on /sys/fs/cgroup/cgmanager type tmpfs (rw,relatime,size=4k,mode=755) devpts on /dev/lxc/console type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000) devpts on /dev/lxc/tty1 type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000) devpts on /dev/lxc/tty2 type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000) devpts on /dev/lxc/tty3 type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000) devpts on /dev/lxc/tty4 type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000) devpts on /dev/pts type devpts (rw,relatime,gid=5,mode=620,ptmxmode=666) proc on /proc type proc (rw,nosuid,nodev,noexec,relatime) sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime) tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev) tmpfs on /run type tmpfs (rw,nosuid,nodev,mode=755) tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,mode=755) # lxc-attach -n f20 -- ls /sys/fs/cgroup/ blkio cpu,cpuacct cpuset devices freezer hugetlb memory perf_event systemd # lxc-attach -n f20 -- ls /sys/fs/cgroup/systemd (no output) ### It looks like there's two lines for /sys/fs/cgroup? I'm using trusty's lxc-1.0.3. This works (at least, tested with console and ssh login), and should be secure-enough (bind-mount the container subdir, instead of the whole systemd cgroup), but complicated. ### snippet of config lxc.hook.mount = /var/lib/lxc/f20/bin/create_container_systemd_cgroup lxc.hook.post-stop = /var/lib/lxc/f20/bin/remove_container_systemd_cgroup ### ### cat create_container_systemd_cgroup #!/bin/bash mkdir -p /sys/fs/cgroup/systemd/lxc/$LXC_NAME mount -t sysfs sysfs $LXC_ROOTFS_MOUNT/sys mount -t tmpfs none $LXC_ROOTFS_MOUNT/sys/fs/cgroup mkdir $LXC_ROOTFS_MOUNT/sys/fs/cgroup/systemd mount --bind /sys/fs/cgroup/systemd/lxc/$LXC_NAME $LXC_ROOTFS_MOUNT/sys/fs/cgroup/systemd ### ### cat remove_container_systemd_cgroup #!/bin/bash [ -n $LXC_NAME ] find /sys/fs/cgroup/systemd/lxc/$LXC_NAME -type d | tac | xargs rmdir ### Is there a way to simplify this somehow for it to be more suitable in the template? I suppose we could add a new a lxc.mount.auto = cgroup:systemd option which only mounts name=systemd, read-only except for the container's own cgroup which is rw? But when I say we I don't really mean we :) Will that work? systemd cgroup mount is weird in a sense that there's no /lxc/CONTAINER_NAME subdirs under /sys/fs/cgroup/systemd, while there's one under /sys/fs/crgoup/{blkio,cpu,etc}. So for systemd cgroup I don't see which ones should be mount ro and which gets rw. The workaround hook I wrote earlier creates the directory /sys/fs/cgroup/systemd/lxc/CONTAINER_NAME on the host, and bind-mount it as the container's /sys/fs/cgroup/systemd. -- Fajar ___ lxc-users mailing list lxc-users@lists.linuxcontainers.org http://lists.linuxcontainers.org/listinfo/lxc-users