Re: [systemd-devel] [PATCH] audit: Fix journal failing on unsupported audit in containers [was: journal: don't complain about audit socket errors in a container.]

2015-05-21 Thread Serge Hallyn
Quoting Lennart Poettering (lenn...@poettering.net):
 On Wed, 20.05.15 22:40, Martin Pitt (martin.p...@ubuntu.com) wrote:
 
  Hey Lennart,
  
  Lennart Poettering [2015-05-20 17:49 +0200]:
   Nope, ConditionSecurity=audit is only a simple boolean check that
   holds when audit is enabled at all. It doesn't tell you anything about
   the precise audit feature set of the kernel.
  
  Ah, thanks for the clarification.
  
   I have now conditionalized the unit on CAP_ADMIN_READ, which is the
   cap that you need to read the audit multicast stuff. You container
   manager hence should simply drop that cap fro, the cap set it passes
   and all should be good.

I want to clarify this point.  Dropping CAP_ADMIN_READ from the bounding
set means dropping it from the capabilities targeted at your own user
namespace.  The only check in the kernel for CAP_ADMIN_READ currently is
against the initial user namespace.  One day of course (maybe soon) this
may change so that you only need CAP_ADMIN_READ against your own
user_ns.  Following the above, container managers could then again keep
CAP_ADMIN_READ in the bounding set.

But I'm claiming that checking for CAP_ADMIN_READ in your bounding set
is the wrong check here.  It simply has nothing to do with what you
actually want to be able to do.  One could argue that the right answer
is a new kernel facility to check for caps against init_user_ns, but no
that will have the same problem after audit ns becomes possible.  I
think the right check for systemd to perform to check whether this is
allowed is to actuallly try the bind().  That will return the right
answer both now and when namespaced audit is possible, without taking a
probably-wrong unrelated cue from the container manager.

It's not earth-shatteringly important and what you've got is workable,
but I think it may set a better precedent to do it the other way.

-serge

(One might almost think that we should have a new kernel facility to
answer questions such questions.  CAP_MAC_ADMIN is similar.)
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] dynamic uid allocation (was: [PATCH] loopback setup in unprivileged containers)

2015-02-03 Thread Serge Hallyn
Quoting Lennart Poettering (lenn...@poettering.net):
 On Tue, 03.02.15 15:03, Daniel P. Berrange (berra...@redhat.com) wrote:
 
   Hmm, so, I thought a lot about this in the past weeks. I think the way
   I'd really like to see this work in the end is that we never have to
   persist the UID mappings. This could work if the kernel would provide
   us with the ability to bind mount a file system into the container
   applying a fixed UID shift. That way, the shifted UIDs would never hit
   the actual disk, and hence we wouldn't have to persist their mappings.
   
   Instead on each container startup we'd look for a new UID range, and
   release it entirely when the container shuts down. The bind mount with
   UID shift would then shift the UIDs up, the userns stuff would shift
   it down from inside the container again.
   
   Of course, this all depends on whether the kernel will get an
   extension to apply uid shifts to bind mounts. I hear they want to
   provide this, but let's see.
  
  I would dearly love to see that happen. Having to recursively change

It'd definately be useful (though not without issues).

  the UID/GID on entire filesystem sub-trees given to containers with
  userns is a real unpleasant thing to have to deal with. I'd not want

Of course you would *not* want to take a stock rootfs where uid == 0
and shift that into the container, as that would give root in the
container a chance to write root-owned files on the host to leverage
later in a convoluted attack :)  We might want to come up with a
containers concensus that container rootfs's are always shipped with
uid range 0-65535 - 10-165535.  That still leaves a chance for
container A (mapped to 20-265535) to write valid setuid-root
binary for container B (mapped to 30-365535), which isn't possible
otherwise.  But that's better than doing so for host-root.

  the filesystem UID shift to only apply to bind mounts though. It is
  not uncommon to use a disk image[1] for a container's filesystem, so
  being able to request a UID shift on *any* filesystem mount is pretty
  desirable, rather than having to mount the image and then bind mount
  it onto itself just to apply the UID shift.
 
 Well, you can always change the bind mount flags without creating a
 new bind mount with MS_BIND|MS_REMOUNT.
 
  [1] Using a separate disk image per container means a container can't
  DOS other containers by exhausting inodes for example with $millions
  of small files.
 
 Indeed. I'd claim that without such a concept of mount point uid
 shifting the whole userns story is not very useful IRL...

I had always thought this would eventually be done using a stackable
filesystem, but doing it at bind mount time would be neat too, and
less objectionable to the kernel folks.  (Though overlayfs is in now,
so shrug)

I'm actually quite surprised noone has sat down and written a
stackable uid-shifting fs yet.

-serge
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [PATCH] netns: unix: only allow to find out unix socket in same net namespace

2013-08-26 Thread Serge Hallyn
Quoting Gao feng (gaof...@cn.fujitsu.com):
 On 08/26/2013 11:19 AM, James Bottomley wrote:
  On Mon, 2013-08-26 at 09:06 +0800, Gao feng wrote:
  On 08/26/2013 02:16 AM, James Bottomley wrote:
  On Sun, 2013-08-25 at 19:37 +0200, Kay Sievers wrote:
  On Sun, Aug 25, 2013 at 7:16 PM, James Bottomley
  jbottom...@parallels.com wrote:
  On Wed, 2013-08-21 at 11:51 +0200, Kay Sievers wrote:
  On Wed, Aug 21, 2013 at 9:22 AM, Gao feng gaof...@cn.fujitsu.com 
  wrote:
  On 08/21/2013 03:06 PM, Eric W. Biederman wrote:
 
  I suspect libvirt should simply not share /run or any other normally
  writable directory with the host.  Sharing /run /var/run or even /tmp
  seems extremely dubious if you want some kind of containment, and
  without strange things spilling through.
 
  Right, /run or /var cannot be shared. It's not only about sockets,
  many other things will also go really wrong that way.
 
  This is very narrow thinking about what a container might be and will
  cause trouble as people start to create novel uses for containers in the
  cloud if you try to impose this on our current infrastructure.
 
  One of the cgroup only container uses we see at Parallels (so no
  separate filesystem and no net namespaces) is pure apache load balancer
  type shared hosting.  In this scenario, base apache is effectively
  brought up in the host environment, but then spawned instances are
  resource limited using cgroups according to what the customer has paid.
  Obviously all apache instances are sharing /var and /run from the host
  (mostly for logging and pid storage and static pages).  The reason some
  hosters do this is that it allows much higher density simple web serving
  (either static pages from quota limited chroots or dynamic pages limited
  by database space constraints) because each instance shares so much
  from the host.  The service is obviously much more basic than giving
  each customer a container running apache, but it's much easier for the
  hoster to administer and it serves the customer just as well for a large
  cross section of use cases and for those it doesn't serve, the hoster
  usually has separate container hosting (for a higher price, of course).
 
  The container as we talk about has it's own init, and no, it cannot
  share /var or /run.
 
  This is what we would call an IaaS container: bringing up init and
  effectively a new OS inside a container is the closest containers come
  to being like hypervisors.  It's the most common use case of Parallels
  containers in the field, so I'm certainly not telling you it's a bad
  idea.
 
  The stuff you talk about has nothing to do with that, it's not
  different from all services or a multi-instantiated service on the
  host sharing the same /run and /var.
 
  I gave you one example: a really simplistic one.  A more sophisticated
  example is a PaaS or SaaS container where you bring the OS up in the
  host but spawn a particular application into its own container (this is
  essentially similar to what Docker does).  Often in this case, you do
  add separate mount and network namespaces to make the application
  isolated and migrateable with its own IP address.  The reason you share
  init and most of the OS from the host is for elasticity and density,
  which are fast becoming a holy grail type quest of cloud orchestration
  systems: if you don't have to bring up the OS from init and you can just
  start the application from a C/R image (orders of magnitude smaller than
  a full system image) and slap on the necessary namespaces as you clone
  it, you have something that comes online in miliseconds which is a feat
  no hypervisor based virtualisation can match.
 
  I'm not saying don't pursue the IaaS case, it's definitely useful ...
  I'm just saying it would be a serious mistake to think that's the only
  use case for containers and we certainly shouldn't adjust Linux to serve
  only that use case.
 
 
  The feature you said above VS contianer-reboot-host bug, I prefer to
  fix
  the bug.
  
  What bug?
  
   and this feature can be achieved even container unshares /run
  directory
  with host by default, for libvirt, user can set the container
  configuration to
  make the container shares the /run directory with host.
 
  I would like to say, the reboot from container bug is more urgent and
  need
  to be fixed.
  
  Are you talking about the old bug where trying to reboot an lxc
  container from within it would reboot the entire system? 
 
 Yes, we are discussing this problem in this whole thread.
 
  If so, OpenVZ
  has never suffered from that problem and I thought it was fixed
  upstream.  I've not tested lxc tools, but the latest vzctl from the
  openvz website will bring up a container on the vanilla 3.9 kernel
  (provided you have USER_NS compiled in) can also be used to reboot the
  container, so I see no reason it wouldn't work for lxc as well.
  
 
 I'm using libvirt lxc not lxc-tools.
 Not all of users enable user namespace, I 

Re: [systemd-devel] [Lxc-users] Unable to run systemd in an LXC / cgroup container.

2012-10-26 Thread Serge Hallyn
Quoting Michael H. Warfield (m...@wittsend.com):
 On Thu, 2012-10-25 at 20:30 -0500, Serge Hallyn wrote:
  Quoting Michael H. Warfield (m...@wittsend.com):
   On Thu, 2012-10-25 at 23:38 +0200, Lennart Poettering wrote:
On Thu, 25.10.12 11:59, Michael H. Warfield (m...@wittsend.com) wrote:
   
 I've got some more problems relating to shutting down containers, some
 of which may be related to mounting tmpfs on /run to which /var/run is
 symlinked to.  We're doing halt / restart detection by monitoring utmp
 in that directory but it looks like utmp isn't even in that directory
 anymore and mounting tmpfs on it was always problematical.  We may 
 have
 to have a more generic method to detect when a container has shut down
 or is restarting in that case.
   
I can't parse this. The system call reboot() is virtualized for
containers just fine and the container managaer (i.e. LXC) can check for
that easily.
   
   The problem we have had was with differentiating between reboot and halt
   to either shut the container down cold or restarted it.  You say
   easily and yet we never came up with an easy solution and monitored
   utmp instead for the next runlevel change.  What is your easy solution
   for that problem?
 
  I think you're on older kernels, where we had to resort to that.  Pretty
  recently Daniel Lezcano's patch was finally accepted upstream, which lets
  a container call reboot() and lets the parent of init tell whether it
  called reboot or shutdown by looking at wTERMSIG(status).
 
 Now THAT is wonderful news!  I hadn't realized that had been accepted.
 So we no longer need to rely on the old utmp kludge?

Yup :)  It was very liberating, in terms of what containers can do with
mounting.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [Lxc-users] Unable to run systemd in an LXC / cgroup container.

2012-10-26 Thread Serge Hallyn
Quoting Lennart Poettering (lenn...@poettering.net):
 On Thu, 25.10.12 14:02, Serge Hallyn (serge.hal...@canonical.com) wrote:
 
Ok...  I've done some cursory search and turned up nothing but some
comments about pre mount hooks.  Where is the documentation about this
feature and how I might use / implement it?  Some examples would
probably suffice.  Is there a require release version of lxc-utils?
   
   I think I found what I needed in the changelog here:
   
   http://www.mail-archive.com/lxc-devel@lists.sourceforge.net/msg01490.html
   
   I'll play with it and report back.
  
  Also the Lifecycle management hooks section in
  https://help.ubuntu.com/12.10/serverguide/lxc.html
  
  Note that I'm thinking that having lxc-start guess how to fill in /dev
  is wrong, because different distros and even different releases of the
  same distros have different expectations.  For instance ubuntu lucid
  wants /dev/shm to be a directory, while precise+ wants a symlink.  So
  somehow the template should get involved, be it by adding a hook, or
  simply specifying a configuration file which lxc uses internally to
  decide how to create /dev.
 
 /dev/shm can be created/mounted/symlinked by the OS in the
 container. This is nothing LXC should care about.
 
 My recommendation for LXC would be to unconditionally pre-mount /dev as
 tmpfs, and add exactly the device nodes /dev/null, /dev/zero, /dev/full,
 /dev/urandom, /dev/random, /dev/tty, /dev/ptmx to it. That is the
 minimal set you need to boot a machine. All further
 submounts/symlinks/dirs can be created by the OS boot logic in the
 container.

I'm thinking we'll do that, optionally.  Templates (including fedora
and ubuntu) can simply always set the option to mount and fill /dev.
Others (like busybox and mini-sshd) won't.

 That's what libvirt-lxc and nspawn do, and is what we defined in:
 
 http://www.freedesktop.org/wiki/Software/systemd/ContainerInterface
 
 It would be good if LXC would do the same in order to minimize the
 manual user configuration necessary.
 
 Lennart

Agreed it simplifies things for full system containers with modern distros.

thanks,
-serge
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [Lxc-users] Unable to run systemd in an LXC / cgroup container.

2012-10-26 Thread Serge Hallyn
Quoting Michael H. Warfield (m...@wittsend.com):
 Adding in the lxc-devel list.
 
 On Thu, 2012-10-25 at 22:59 -0400, Michael H. Warfield wrote:
  On Thu, 2012-10-25 at 15:42 -0400, Michael H. Warfield wrote:
   On Thu, 2012-10-25 at 14:02 -0500, Serge Hallyn wrote:
Quoting Michael H. Warfield (m...@wittsend.com):
 On Thu, 2012-10-25 at 13:23 -0400, Michael H. Warfield wrote:
  Hey Serge,
  
  On Thu, 2012-10-25 at 11:19 -0500, Serge Hallyn wrote:
 
 ...
 
   Oh, sorry - I take back that suggestion :)
  
   Note that we have mount hooks, so templates could install a mount 
   hook to
   mount a tmpfs onto /dev and populate it.
  
  Ok...  I've done some cursory search and turned up nothing but some
  comments about pre mount hooks.  Where is the documentation about 
  this
  feature and how I might use / implement it?  Some examples would
  probably suffice.  Is there a require release version of lxc-utils?
 
 I think I found what I needed in the changelog here:
 
 http://www.mail-archive.com/lxc-devel@lists.sourceforge.net/msg01490.html
 
 I'll play with it and report back.
   
Also the Lifecycle management hooks section in
https://help.ubuntu.com/12.10/serverguide/lxc.html
   
   This isn't working...
   
   Based on what was in both of those articles, I added this entry to
   another container (Plover) to test...
   
   lxc.hook.mount = /var/lib/lxc/Plover/mount
   
   When I run lxc-start -n Plover, I see this:
   
   [root@forest ~]# lxc-start -n Plover
   lxc-start: unknow key lxc.hook.mount
   lxc-start: failed to read configuration file
   
   I'm running the latest rc...
   
   [root@forest ~]# rpm -qa | grep lxc
   lxc-0.8.0.rc2-1.fc16.x86_64
   lxc-libs-0.8.0.rc2-1.fc16.x86_64
   lxc-doc-0.8.0.rc2-1.fc16.x86_64
   
   Is it something in git that hasn't made it to a release yet?
 
  nm...  I see it.  It's in git and hasn't made it to a release.  I'm
  working on a git build to test now.  If this is something that solves
  some of this, we need to move things along here and get these things
  moved out.  According to git, 0.8.0rc2 was 7 months ago?  What's the
  show stoppers here?
 
 While the git repo says 7 months ago, the date stamp on the
 lxc-0.8.0-rc2 tarball is from July 10, so about 3-1/2 months ago.
 Sounds like we've accumulated some features (like the hooks) we are
 going to need like months ago to deal with this systemd debacle.  How
 close are we to either 0.8.0rc3 or 0.8.0?  Any blockers or are we just
 waiting on some more features?

Daniel has simply been too busy.  Stéphane has made a new branch which
cherrypicks 50 bugfixes for 0.8.0, with the remaining patches (about
twice as many) left for 0.9.0.  I'm hoping we get 0.8.0 next week :)
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [Lxc-users] Unable to run systemd in an LXC / cgroup container.

2012-10-25 Thread Serge Hallyn
Quoting Michael H. Warfield (m...@wittsend.com):
 Sorry for taking a few days to get back on this.  I was delivering a
 guest lecture up at Fordham University last Tuesday so I was out of
 pocket a couple of days or I would have responded sooner...
 
 On Mon, 2012-10-22 at 16:59 -0400, Michael H. Warfield wrote:
  On Mon, 2012-10-22 at 22:50 +0200, Lennart Poettering wrote:
   On Mon, 22.10.12 11:48, Michael H. Warfield (m...@wittsend.com) wrote:
   
  To summarize the problem...  The LXC startup binary sets up various
  things for /dev and /dev/pts for the container to run properly and 
  this
  works perfectly fine for SystemV start-up scripts and/or Upstart.
  Unfortunately, systemd has mounts of devtmpfs on /dev and devpts
  on /dev/pts which then break things horribly.  This is because the
  kernel currently lacks namespaces for devices and won't for some 
  time to
  come (in design).  When devtmpfs gets mounted over top of /dev in 
  the
  container, it then hijacks the hosts console tty and several other
  devices which had been set up through bind mounts by LXC and should 
  have
  been LEFT ALONE.

 Please initialize a minimal tmpfs on /dev. systemd will then work 
 fine.

My containers have a reasonable /dev that work with Upstart just fine
but they are not on tmpfs.  Is mounting tmpfs on /dev and recreating
that minimal /dev required?
 
   Well, it can be any kind of mount really. Just needs to be a mount. And
   the idea is to use tmpfs for this.
 
   What /dev are you currently using? It's probably not a good idea to
   reuse the hosts' /dev, since it contains so many device nodes that
   should not be accessible/visible to the container.
 
  Got it.  And that explains the problems we're seeing but also what I'm
  seeing in some libvirt-lxc related pages, which is a separate and
  distinct project in spite of the similarities in the name...
 
  http://wiki.1tux.org/wiki/Lxc/Installation#Additional_notes
 
  Unfortunately, in our case, merely getting a mount in there is a
  complication in that it also has to be populated but, at least, we
  understand the problem set now.
 
 Ok...  Serge and I were corresponding on the lxc-users list and he had a
 suggestion that worked but I consider to be a bit of a sub-optimal
 workaround.  Ironically, it was to mount devtmpfs on /dev.  We don't

Oh, sorry - I take back that suggestion :)

Note that we have mount hooks, so templates could install a mount hook to
mount a tmpfs onto /dev and populate it.

Or, if everyone is going to need it, we could just add a 'lxc.populatedevs = 1'
option which does that without needing a hook.

devtmpfs should not be used in containers :)

-serge
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [Lxc-users] Unable to run systemd in an LXC / cgroup container.

2012-10-25 Thread Serge Hallyn
Quoting Michael H. Warfield (m...@wittsend.com):
 On Thu, 2012-10-25 at 13:23 -0400, Michael H. Warfield wrote:
  Hey Serge,
  
  On Thu, 2012-10-25 at 11:19 -0500, Serge Hallyn wrote:
 
 ...
 
   Oh, sorry - I take back that suggestion :)
  
   Note that we have mount hooks, so templates could install a mount hook to
   mount a tmpfs onto /dev and populate it.
  
  Ok...  I've done some cursory search and turned up nothing but some
  comments about pre mount hooks.  Where is the documentation about this
  feature and how I might use / implement it?  Some examples would
  probably suffice.  Is there a require release version of lxc-utils?
 
 I think I found what I needed in the changelog here:
 
 http://www.mail-archive.com/lxc-devel@lists.sourceforge.net/msg01490.html
 
 I'll play with it and report back.

Also the Lifecycle management hooks section in
https://help.ubuntu.com/12.10/serverguide/lxc.html

Note that I'm thinking that having lxc-start guess how to fill in /dev
is wrong, because different distros and even different releases of the
same distros have different expectations.  For instance ubuntu lucid
wants /dev/shm to be a directory, while precise+ wants a symlink.  So
somehow the template should get involved, be it by adding a hook, or
simply specifying a configuration file which lxc uses internally to
decide how to create /dev.

Personally I'd prefer if /dev were always populated by the templates,
and containers (i.e. userspace) didn't mount a fresh tmpfs for /dev.
But that does complicate userspace, and we've seen it in debian/ubuntu
as well (i.e. at certain package upgrades which rely on /dev being
cleared after a reboot).

-serge
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [PATCH] SELINUX: add /sys/fs/selinux mount point to put selinuxfs

2011-05-11 Thread Serge Hallyn
Quoting Eric Paris (epa...@parisplace.org):
 On Wed, May 11, 2011 at 11:13 AM, Stephen Smalley s...@tycho.nsa.gov wrote:
  On Wed, 2011-05-11 at 10:58 -0400, Eric Paris wrote:
  On Wed, May 11, 2011 at 10:54 AM, John Johansen
 
   AppArmor, Tomoyo and IMA all create their own subdirectoy under 
   securityfs
   so this should not be a problem
 
  I guess the question is, should SELinux try to move to /sys/fs/selinux
  or /sys/security/selinux.  The only minor issue I see with the later
  is that it requires both sysfs and securityfs to be mounted before you
  can mount selinuxfs, whereas the first only requires sysfs.  Stephen,
  Casey, either of you have thoughts on the matter?
 
  Unless we plan to re-implement selinuxfs as securityfs nodes, I don't
  see why we'd move to /sys/security/selinux; we don't presently depend on
  securityfs and it isn't commonly mounted today.  selinuxfs has some
  specialized functionality that may not be trivial to reimplement via
  securityfs, and there was concern about userspace compatibility breakage
  when last we considered using securityfs.
 
 The reason we would move to /sys/security/ instead of /sys/fs/ is
 because other LSMs are already there and it would look consistent.

Actually I think it'd be deceptive precisely because (aiui) /sys/security
is for securityfs, while /sys/fs is for virtual filesystems.

I suppose we could whip this issue by having /sys/security be under
/sys/fs/security :)  Too late for that too.

-serge
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel