Re: [libvirt] Notes from the KVM Forum relevant to libvirt

2011-08-29 Thread Serge Hallyn
Quoting Daniel P. Berrange (berra...@redhat.com):
> On Thu, Aug 25, 2011 at 08:58:27AM -0500, Serge E. Hallyn wrote:
> > Quoting Stefan Hajnoczi (stefa...@gmail.com):
> > > On Thu, Aug 25, 2011 at 11:03 AM, Daniel P. Berrange
> > >  wrote:
> > > > On Thu, Aug 25, 2011 at 10:10:27AM +0100, Stefan Hajnoczi wrote:
> > > >> On Wed, Aug 24, 2011 at 3:46 PM, Daniel P. Berrange 
> > > >>  wrote:
> > > >> > On Wed, Aug 24, 2011 at 03:20:57PM +0100, Stefan Hajnoczi wrote:
> > > >> >> On Tue, Aug 23, 2011 at 4:31 PM, Daniel P. Berrange 
> > > >> >>  wrote:
> > > >> >> > On Tue, Aug 23, 2011 at 04:24:46PM +0100, Stefan Hajnoczi wrote:
> > > >> >> >> On Tue, Aug 23, 2011 at 12:15 PM, Daniel P. Berrange
> > > >> >> >>  wrote:
> > > >> >> >> > I was at the KVM Forum / LinuxCon last week and there were many
> > > >> >> >> > interesting things discussed which are relevant to ongoing 
> > > >> >> >> > libvirt
> > > >> >> >> > development. Here was the list that caught my attention. If I 
> > > >> >> >> > have
> > > >> >> >> > missed any, fill in the gaps
> > > >> >> >> >
> > > >> >> >> >  - Sandbox/container KVM.  The Solaris port of KVM puts QEMU 
> > > >> >> >> > inside
> > > >> >> >> >   a zone so that an exploit of QEMU can't escape into the full 
> > > >> >> >> > OS.
> > > >> >> >> >   Containers are Linux's parallel of Zones, and while not 
> > > >> >> >> > nearly as
> > > >> >> >> >   secure yet, it would still be worth using more containers 
> > > >> >> >> > support
> > > >> >> >> >   to confine QEMU.
> > > >> >> >>
> > > >> >> >> Can you elaborate on why Linux containers are "not nearly as 
> > > >> >> >> secure"
> > > >> >> >> [as Solaris Zones]?
> > > >> >> >
> > > >> >> > Mostly because the Linux namespace functionality is far from 
> > > >> >> > complete,
> > > >> >> > notably lacking proper UID/GID/capability separation, and UID/GID
> > > >> >> > virtualization wrt filesystems. The longer answer is here:
> > > >> >> >
> > > >> >> >   https://wiki.ubuntu.com/UserNamespace
> > > >> >> >
> > > >> >> > So at this time you can't build a secure container on Linux, 
> > > >> >> > relying
> > > >> >> > just on DAC alone. You have to add in a MAC layer ontop of the 
> > > >> >> > container
> > > >> >> > to get full security benefits, which obviously defeats the point 
> > > >> >> > of
> > > >> >> > using the container as a backup for failure in the MAC layer.
> > > >> >>
> > > >> >> Thanks, that is interesting.  I still don't understand why that is a
> > > >> >> problem.  Linux containers (lxc) uses a different pid namespace (no
> > > >> >> ptrace worries), file system root (restricted to a subdirectory 
> > > >> >> tree),
> > > >> >> forbids most device nodes, etc.  Why does the user namespace matter
> > > >> >> for security in this case?
> > > >> >
> > > >> > A number of reasons really...
> > > >> >
> > > >> > If user ID '0' on the host starts a container, and a process inside
> > > >> > the container does 'setuid(500)', then any user outside the container
> > > >> > with UID 500 will be able to kill that process. Only user ID '0' 
> > > >> > should
> > > >> > have been allowed todo that.
> > > >> >
> > > >> > It will also let non-root user IDs on the host OS, start containers
> > > >> > and have root uid=0 inside the container.
> > > >> >
> > > >> > Finally, any files created inside the container with, say, uid 500
> > > >> > will be accessible by any other process with UID 500, in either the
> > > >> > host or any other container
> > > >>
> > > >> These points mean that the host can peek inside containers and has
> > > >> access to their processes/files.  But from the point of a libvirt
> > > >> running inside a container there is no security problem.
> > > >>
> > > >> This is kind of like saying that root on the host can modify KVM guest
> > > >> disk images.  That is true but I don't see it as a security problem
> > > >> because the root on the host is the trusted part of the system.
> > > >>
> > > >> >> I think it matters when giving multiple containers access to the 
> > > >> >> same
> > > >> >> file system.  Is that what you'd like to do for libvirt?
> > > >> >
> > > >> > Each container would have to share a (readonly) view onto the host
> > > >> > filesystem so it can see the QEMU emulator install / libraries. There
> > > >> > would also have to be some writable areas per QEMU container.  QEMU
> > > >> > inside the container would be set to run as some non-root UID (from
> > > >> > the container's POV). So both problem 1 & 3 above would impact the
> > > >> > security of this confinement.
> > > >>
> > > >> But is there a way to escape confinement?  If not, then this is secure.
> > > >
> > > > The filesystem UID/GID ownership is the most likely way you can escape
> > > > the confinement. You would have to be very careful to ensure that each
> > > > container's view of the filesystem did not include any directories
> > > > with files that are assigned to another container, since the UID
> > > > separation w

Re: [libvirt] Notes from the KVM Forum relevant to libvirt

2011-08-25 Thread Daniel P. Berrange
On Thu, Aug 25, 2011 at 08:58:27AM -0500, Serge E. Hallyn wrote:
> Quoting Stefan Hajnoczi (stefa...@gmail.com):
> > On Thu, Aug 25, 2011 at 11:03 AM, Daniel P. Berrange
> >  wrote:
> > > On Thu, Aug 25, 2011 at 10:10:27AM +0100, Stefan Hajnoczi wrote:
> > >> On Wed, Aug 24, 2011 at 3:46 PM, Daniel P. Berrange 
> > >>  wrote:
> > >> > On Wed, Aug 24, 2011 at 03:20:57PM +0100, Stefan Hajnoczi wrote:
> > >> >> On Tue, Aug 23, 2011 at 4:31 PM, Daniel P. Berrange 
> > >> >>  wrote:
> > >> >> > On Tue, Aug 23, 2011 at 04:24:46PM +0100, Stefan Hajnoczi wrote:
> > >> >> >> On Tue, Aug 23, 2011 at 12:15 PM, Daniel P. Berrange
> > >> >> >>  wrote:
> > >> >> >> > I was at the KVM Forum / LinuxCon last week and there were many
> > >> >> >> > interesting things discussed which are relevant to ongoing 
> > >> >> >> > libvirt
> > >> >> >> > development. Here was the list that caught my attention. If I 
> > >> >> >> > have
> > >> >> >> > missed any, fill in the gaps
> > >> >> >> >
> > >> >> >> >  - Sandbox/container KVM.  The Solaris port of KVM puts QEMU 
> > >> >> >> > inside
> > >> >> >> >   a zone so that an exploit of QEMU can't escape into the full 
> > >> >> >> > OS.
> > >> >> >> >   Containers are Linux's parallel of Zones, and while not nearly 
> > >> >> >> > as
> > >> >> >> >   secure yet, it would still be worth using more containers 
> > >> >> >> > support
> > >> >> >> >   to confine QEMU.
> > >> >> >>
> > >> >> >> Can you elaborate on why Linux containers are "not nearly as 
> > >> >> >> secure"
> > >> >> >> [as Solaris Zones]?
> > >> >> >
> > >> >> > Mostly because the Linux namespace functionality is far from 
> > >> >> > complete,
> > >> >> > notably lacking proper UID/GID/capability separation, and UID/GID
> > >> >> > virtualization wrt filesystems. The longer answer is here:
> > >> >> >
> > >> >> >   https://wiki.ubuntu.com/UserNamespace
> > >> >> >
> > >> >> > So at this time you can't build a secure container on Linux, relying
> > >> >> > just on DAC alone. You have to add in a MAC layer ontop of the 
> > >> >> > container
> > >> >> > to get full security benefits, which obviously defeats the point of
> > >> >> > using the container as a backup for failure in the MAC layer.
> > >> >>
> > >> >> Thanks, that is interesting.  I still don't understand why that is a
> > >> >> problem.  Linux containers (lxc) uses a different pid namespace (no
> > >> >> ptrace worries), file system root (restricted to a subdirectory tree),
> > >> >> forbids most device nodes, etc.  Why does the user namespace matter
> > >> >> for security in this case?
> > >> >
> > >> > A number of reasons really...
> > >> >
> > >> > If user ID '0' on the host starts a container, and a process inside
> > >> > the container does 'setuid(500)', then any user outside the container
> > >> > with UID 500 will be able to kill that process. Only user ID '0' should
> > >> > have been allowed todo that.
> > >> >
> > >> > It will also let non-root user IDs on the host OS, start containers
> > >> > and have root uid=0 inside the container.
> > >> >
> > >> > Finally, any files created inside the container with, say, uid 500
> > >> > will be accessible by any other process with UID 500, in either the
> > >> > host or any other container
> > >>
> > >> These points mean that the host can peek inside containers and has
> > >> access to their processes/files.  But from the point of a libvirt
> > >> running inside a container there is no security problem.
> > >>
> > >> This is kind of like saying that root on the host can modify KVM guest
> > >> disk images.  That is true but I don't see it as a security problem
> > >> because the root on the host is the trusted part of the system.
> > >>
> > >> >> I think it matters when giving multiple containers access to the same
> > >> >> file system.  Is that what you'd like to do for libvirt?
> > >> >
> > >> > Each container would have to share a (readonly) view onto the host
> > >> > filesystem so it can see the QEMU emulator install / libraries. There
> > >> > would also have to be some writable areas per QEMU container.  QEMU
> > >> > inside the container would be set to run as some non-root UID (from
> > >> > the container's POV). So both problem 1 & 3 above would impact the
> > >> > security of this confinement.
> > >>
> > >> But is there a way to escape confinement?  If not, then this is secure.
> > >
> > > The filesystem UID/GID ownership is the most likely way you can escape
> > > the confinement. You would have to be very careful to ensure that each
> > > container's view of the filesystem did not include any directories
> > > with files that are assigned to another container, since the UID
> > > separation would not prevent access to another container's resources.
> > >
> > > This is rather tedious but could be just about doable, but it gets
> > > harder when you throw in things like sysfs and PCI device assignment.
> > > eg a guest with PCI device assigned gets given ownership of the files
> > > in /sys

Re: [libvirt] Notes from the KVM Forum relevant to libvirt

2011-08-25 Thread Serge E. Hallyn
Quoting Stefan Hajnoczi (stefa...@gmail.com):
> On Thu, Aug 25, 2011 at 11:03 AM, Daniel P. Berrange
>  wrote:
> > On Thu, Aug 25, 2011 at 10:10:27AM +0100, Stefan Hajnoczi wrote:
> >> On Wed, Aug 24, 2011 at 3:46 PM, Daniel P. Berrange  
> >> wrote:
> >> > On Wed, Aug 24, 2011 at 03:20:57PM +0100, Stefan Hajnoczi wrote:
> >> >> On Tue, Aug 23, 2011 at 4:31 PM, Daniel P. Berrange 
> >> >>  wrote:
> >> >> > On Tue, Aug 23, 2011 at 04:24:46PM +0100, Stefan Hajnoczi wrote:
> >> >> >> On Tue, Aug 23, 2011 at 12:15 PM, Daniel P. Berrange
> >> >> >>  wrote:
> >> >> >> > I was at the KVM Forum / LinuxCon last week and there were many
> >> >> >> > interesting things discussed which are relevant to ongoing libvirt
> >> >> >> > development. Here was the list that caught my attention. If I have
> >> >> >> > missed any, fill in the gaps
> >> >> >> >
> >> >> >> >  - Sandbox/container KVM.  The Solaris port of KVM puts QEMU inside
> >> >> >> >   a zone so that an exploit of QEMU can't escape into the full OS.
> >> >> >> >   Containers are Linux's parallel of Zones, and while not nearly as
> >> >> >> >   secure yet, it would still be worth using more containers support
> >> >> >> >   to confine QEMU.
> >> >> >>
> >> >> >> Can you elaborate on why Linux containers are "not nearly as secure"
> >> >> >> [as Solaris Zones]?
> >> >> >
> >> >> > Mostly because the Linux namespace functionality is far from complete,
> >> >> > notably lacking proper UID/GID/capability separation, and UID/GID
> >> >> > virtualization wrt filesystems. The longer answer is here:
> >> >> >
> >> >> >   https://wiki.ubuntu.com/UserNamespace
> >> >> >
> >> >> > So at this time you can't build a secure container on Linux, relying
> >> >> > just on DAC alone. You have to add in a MAC layer ontop of the 
> >> >> > container
> >> >> > to get full security benefits, which obviously defeats the point of
> >> >> > using the container as a backup for failure in the MAC layer.
> >> >>
> >> >> Thanks, that is interesting.  I still don't understand why that is a
> >> >> problem.  Linux containers (lxc) uses a different pid namespace (no
> >> >> ptrace worries), file system root (restricted to a subdirectory tree),
> >> >> forbids most device nodes, etc.  Why does the user namespace matter
> >> >> for security in this case?
> >> >
> >> > A number of reasons really...
> >> >
> >> > If user ID '0' on the host starts a container, and a process inside
> >> > the container does 'setuid(500)', then any user outside the container
> >> > with UID 500 will be able to kill that process. Only user ID '0' should
> >> > have been allowed todo that.
> >> >
> >> > It will also let non-root user IDs on the host OS, start containers
> >> > and have root uid=0 inside the container.
> >> >
> >> > Finally, any files created inside the container with, say, uid 500
> >> > will be accessible by any other process with UID 500, in either the
> >> > host or any other container
> >>
> >> These points mean that the host can peek inside containers and has
> >> access to their processes/files.  But from the point of a libvirt
> >> running inside a container there is no security problem.
> >>
> >> This is kind of like saying that root on the host can modify KVM guest
> >> disk images.  That is true but I don't see it as a security problem
> >> because the root on the host is the trusted part of the system.
> >>
> >> >> I think it matters when giving multiple containers access to the same
> >> >> file system.  Is that what you'd like to do for libvirt?
> >> >
> >> > Each container would have to share a (readonly) view onto the host
> >> > filesystem so it can see the QEMU emulator install / libraries. There
> >> > would also have to be some writable areas per QEMU container.  QEMU
> >> > inside the container would be set to run as some non-root UID (from
> >> > the container's POV). So both problem 1 & 3 above would impact the
> >> > security of this confinement.
> >>
> >> But is there a way to escape confinement?  If not, then this is secure.
> >
> > The filesystem UID/GID ownership is the most likely way you can escape
> > the confinement. You would have to be very careful to ensure that each
> > container's view of the filesystem did not include any directories
> > with files that are assigned to another container, since the UID
> > separation would not prevent access to another container's resources.
> >
> > This is rather tedious but could be just about doable, but it gets
> > harder when you throw in things like sysfs and PCI device assignment.
> > eg a guest with PCI device assigned gets given ownership of the files
> > in /sys/bus/pci/devices/:00:XX:XX/ and since there is no UID
> > namespacing, this will be accessible to any other container with the
> > same UID. To hack around this when starting up a container you would
> > probably have to bind mount a empty tmpfs over the top of all the
> > PCI device paths you wanted to block in sysfs.

Which of course is easily undoable by 

Re: [libvirt] Notes from the KVM Forum relevant to libvirt

2011-08-25 Thread Stefan Hajnoczi
On Thu, Aug 25, 2011 at 11:03 AM, Daniel P. Berrange
 wrote:
> On Thu, Aug 25, 2011 at 10:10:27AM +0100, Stefan Hajnoczi wrote:
>> On Wed, Aug 24, 2011 at 3:46 PM, Daniel P. Berrange  
>> wrote:
>> > On Wed, Aug 24, 2011 at 03:20:57PM +0100, Stefan Hajnoczi wrote:
>> >> On Tue, Aug 23, 2011 at 4:31 PM, Daniel P. Berrange  
>> >> wrote:
>> >> > On Tue, Aug 23, 2011 at 04:24:46PM +0100, Stefan Hajnoczi wrote:
>> >> >> On Tue, Aug 23, 2011 at 12:15 PM, Daniel P. Berrange
>> >> >>  wrote:
>> >> >> > I was at the KVM Forum / LinuxCon last week and there were many
>> >> >> > interesting things discussed which are relevant to ongoing libvirt
>> >> >> > development. Here was the list that caught my attention. If I have
>> >> >> > missed any, fill in the gaps
>> >> >> >
>> >> >> >  - Sandbox/container KVM.  The Solaris port of KVM puts QEMU inside
>> >> >> >   a zone so that an exploit of QEMU can't escape into the full OS.
>> >> >> >   Containers are Linux's parallel of Zones, and while not nearly as
>> >> >> >   secure yet, it would still be worth using more containers support
>> >> >> >   to confine QEMU.
>> >> >>
>> >> >> Can you elaborate on why Linux containers are "not nearly as secure"
>> >> >> [as Solaris Zones]?
>> >> >
>> >> > Mostly because the Linux namespace functionality is far from complete,
>> >> > notably lacking proper UID/GID/capability separation, and UID/GID
>> >> > virtualization wrt filesystems. The longer answer is here:
>> >> >
>> >> >   https://wiki.ubuntu.com/UserNamespace
>> >> >
>> >> > So at this time you can't build a secure container on Linux, relying
>> >> > just on DAC alone. You have to add in a MAC layer ontop of the container
>> >> > to get full security benefits, which obviously defeats the point of
>> >> > using the container as a backup for failure in the MAC layer.
>> >>
>> >> Thanks, that is interesting.  I still don't understand why that is a
>> >> problem.  Linux containers (lxc) uses a different pid namespace (no
>> >> ptrace worries), file system root (restricted to a subdirectory tree),
>> >> forbids most device nodes, etc.  Why does the user namespace matter
>> >> for security in this case?
>> >
>> > A number of reasons really...
>> >
>> > If user ID '0' on the host starts a container, and a process inside
>> > the container does 'setuid(500)', then any user outside the container
>> > with UID 500 will be able to kill that process. Only user ID '0' should
>> > have been allowed todo that.
>> >
>> > It will also let non-root user IDs on the host OS, start containers
>> > and have root uid=0 inside the container.
>> >
>> > Finally, any files created inside the container with, say, uid 500
>> > will be accessible by any other process with UID 500, in either the
>> > host or any other container
>>
>> These points mean that the host can peek inside containers and has
>> access to their processes/files.  But from the point of a libvirt
>> running inside a container there is no security problem.
>>
>> This is kind of like saying that root on the host can modify KVM guest
>> disk images.  That is true but I don't see it as a security problem
>> because the root on the host is the trusted part of the system.
>>
>> >> I think it matters when giving multiple containers access to the same
>> >> file system.  Is that what you'd like to do for libvirt?
>> >
>> > Each container would have to share a (readonly) view onto the host
>> > filesystem so it can see the QEMU emulator install / libraries. There
>> > would also have to be some writable areas per QEMU container.  QEMU
>> > inside the container would be set to run as some non-root UID (from
>> > the container's POV). So both problem 1 & 3 above would impact the
>> > security of this confinement.
>>
>> But is there a way to escape confinement?  If not, then this is secure.
>
> The filesystem UID/GID ownership is the most likely way you can escape
> the confinement. You would have to be very careful to ensure that each
> container's view of the filesystem did not include any directories
> with files that are assigned to another container, since the UID
> separation would not prevent access to another container's resources.
>
> This is rather tedious but could be just about doable, but it gets
> harder when you throw in things like sysfs and PCI device assignment.
> eg a guest with PCI device assigned gets given ownership of the files
> in /sys/bus/pci/devices/:00:XX:XX/ and since there is no UID
> namespacing, this will be accessible to any other container with the
> same UID. To hack around this when starting up a container you would
> probably have to bind mount a empty tmpfs over the top of all the
> PCI device paths you wanted to block in sysfs.

Ah, I hadn't thought of /sys/bus/pci or /sys/bus/usb!

Thanks for the explanation and it does seem like the design would get messy.

Stefan

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] Notes from the KVM Forum relevant to libvirt

2011-08-25 Thread Daniel P. Berrange
On Thu, Aug 25, 2011 at 10:10:27AM +0100, Stefan Hajnoczi wrote:
> On Wed, Aug 24, 2011 at 3:46 PM, Daniel P. Berrange  
> wrote:
> > On Wed, Aug 24, 2011 at 03:20:57PM +0100, Stefan Hajnoczi wrote:
> >> On Tue, Aug 23, 2011 at 4:31 PM, Daniel P. Berrange  
> >> wrote:
> >> > On Tue, Aug 23, 2011 at 04:24:46PM +0100, Stefan Hajnoczi wrote:
> >> >> On Tue, Aug 23, 2011 at 12:15 PM, Daniel P. Berrange
> >> >>  wrote:
> >> >> > I was at the KVM Forum / LinuxCon last week and there were many
> >> >> > interesting things discussed which are relevant to ongoing libvirt
> >> >> > development. Here was the list that caught my attention. If I have
> >> >> > missed any, fill in the gaps
> >> >> >
> >> >> >  - Sandbox/container KVM.  The Solaris port of KVM puts QEMU inside
> >> >> >   a zone so that an exploit of QEMU can't escape into the full OS.
> >> >> >   Containers are Linux's parallel of Zones, and while not nearly as
> >> >> >   secure yet, it would still be worth using more containers support
> >> >> >   to confine QEMU.
> >> >>
> >> >> Can you elaborate on why Linux containers are "not nearly as secure"
> >> >> [as Solaris Zones]?
> >> >
> >> > Mostly because the Linux namespace functionality is far from complete,
> >> > notably lacking proper UID/GID/capability separation, and UID/GID
> >> > virtualization wrt filesystems. The longer answer is here:
> >> >
> >> >   https://wiki.ubuntu.com/UserNamespace
> >> >
> >> > So at this time you can't build a secure container on Linux, relying
> >> > just on DAC alone. You have to add in a MAC layer ontop of the container
> >> > to get full security benefits, which obviously defeats the point of
> >> > using the container as a backup for failure in the MAC layer.
> >>
> >> Thanks, that is interesting.  I still don't understand why that is a
> >> problem.  Linux containers (lxc) uses a different pid namespace (no
> >> ptrace worries), file system root (restricted to a subdirectory tree),
> >> forbids most device nodes, etc.  Why does the user namespace matter
> >> for security in this case?
> >
> > A number of reasons really...
> >
> > If user ID '0' on the host starts a container, and a process inside
> > the container does 'setuid(500)', then any user outside the container
> > with UID 500 will be able to kill that process. Only user ID '0' should
> > have been allowed todo that.
> >
> > It will also let non-root user IDs on the host OS, start containers
> > and have root uid=0 inside the container.
> >
> > Finally, any files created inside the container with, say, uid 500
> > will be accessible by any other process with UID 500, in either the
> > host or any other container
> 
> These points mean that the host can peek inside containers and has
> access to their processes/files.  But from the point of a libvirt
> running inside a container there is no security problem.
> 
> This is kind of like saying that root on the host can modify KVM guest
> disk images.  That is true but I don't see it as a security problem
> because the root on the host is the trusted part of the system.
> 
> >> I think it matters when giving multiple containers access to the same
> >> file system.  Is that what you'd like to do for libvirt?
> >
> > Each container would have to share a (readonly) view onto the host
> > filesystem so it can see the QEMU emulator install / libraries. There
> > would also have to be some writable areas per QEMU container.  QEMU
> > inside the container would be set to run as some non-root UID (from
> > the container's POV). So both problem 1 & 3 above would impact the
> > security of this confinement.
> 
> But is there a way to escape confinement?  If not, then this is secure.

The filesystem UID/GID ownership is the most likely way you can escape
the confinement. You would have to be very careful to ensure that each
container's view of the filesystem did not include any directories
with files that are assigned to another container, since the UID
separation would not prevent access to another container's resources. 

This is rather tedious but could be just about doable, but it gets
harder when you throw in things like sysfs and PCI device assignment.
eg a guest with PCI device assigned gets given ownership of the files
in /sys/bus/pci/devices/:00:XX:XX/ and since there is no UID
namespacing, this will be accessible to any other container with the
same UID. To hack around this when starting up a container you would
probably have to bind mount a empty tmpfs over the top of all the
PCI device paths you wanted to block in sysfs.

Obviously you can get around this by running each guest as a different
user ID, but this is one of the things we wanted to avoid by using
containers & it ought to not be needed if containers were actually
secure.

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://sea

Re: [libvirt] Notes from the KVM Forum relevant to libvirt

2011-08-25 Thread Stefan Hajnoczi
On Wed, Aug 24, 2011 at 3:46 PM, Daniel P. Berrange  wrote:
> On Wed, Aug 24, 2011 at 03:20:57PM +0100, Stefan Hajnoczi wrote:
>> On Tue, Aug 23, 2011 at 4:31 PM, Daniel P. Berrange  
>> wrote:
>> > On Tue, Aug 23, 2011 at 04:24:46PM +0100, Stefan Hajnoczi wrote:
>> >> On Tue, Aug 23, 2011 at 12:15 PM, Daniel P. Berrange
>> >>  wrote:
>> >> > I was at the KVM Forum / LinuxCon last week and there were many
>> >> > interesting things discussed which are relevant to ongoing libvirt
>> >> > development. Here was the list that caught my attention. If I have
>> >> > missed any, fill in the gaps
>> >> >
>> >> >  - Sandbox/container KVM.  The Solaris port of KVM puts QEMU inside
>> >> >   a zone so that an exploit of QEMU can't escape into the full OS.
>> >> >   Containers are Linux's parallel of Zones, and while not nearly as
>> >> >   secure yet, it would still be worth using more containers support
>> >> >   to confine QEMU.
>> >>
>> >> Can you elaborate on why Linux containers are "not nearly as secure"
>> >> [as Solaris Zones]?
>> >
>> > Mostly because the Linux namespace functionality is far from complete,
>> > notably lacking proper UID/GID/capability separation, and UID/GID
>> > virtualization wrt filesystems. The longer answer is here:
>> >
>> >   https://wiki.ubuntu.com/UserNamespace
>> >
>> > So at this time you can't build a secure container on Linux, relying
>> > just on DAC alone. You have to add in a MAC layer ontop of the container
>> > to get full security benefits, which obviously defeats the point of
>> > using the container as a backup for failure in the MAC layer.
>>
>> Thanks, that is interesting.  I still don't understand why that is a
>> problem.  Linux containers (lxc) uses a different pid namespace (no
>> ptrace worries), file system root (restricted to a subdirectory tree),
>> forbids most device nodes, etc.  Why does the user namespace matter
>> for security in this case?
>
> A number of reasons really...
>
> If user ID '0' on the host starts a container, and a process inside
> the container does 'setuid(500)', then any user outside the container
> with UID 500 will be able to kill that process. Only user ID '0' should
> have been allowed todo that.
>
> It will also let non-root user IDs on the host OS, start containers
> and have root uid=0 inside the container.
>
> Finally, any files created inside the container with, say, uid 500
> will be accessible by any other process with UID 500, in either the
> host or any other container

These points mean that the host can peek inside containers and has
access to their processes/files.  But from the point of a libvirt
running inside a container there is no security problem.

This is kind of like saying that root on the host can modify KVM guest
disk images.  That is true but I don't see it as a security problem
because the root on the host is the trusted part of the system.

>> I think it matters when giving multiple containers access to the same
>> file system.  Is that what you'd like to do for libvirt?
>
> Each container would have to share a (readonly) view onto the host
> filesystem so it can see the QEMU emulator install / libraries. There
> would also have to be some writable areas per QEMU container.  QEMU
> inside the container would be set to run as some non-root UID (from
> the container's POV). So both problem 1 & 3 above would impact the
> security of this confinement.

But is there a way to escape confinement?  If not, then this is secure.

Stefan

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] Notes from the KVM Forum relevant to libvirt

2011-08-24 Thread Daniel P. Berrange
On Wed, Aug 24, 2011 at 03:20:57PM +0100, Stefan Hajnoczi wrote:
> On Tue, Aug 23, 2011 at 4:31 PM, Daniel P. Berrange  
> wrote:
> > On Tue, Aug 23, 2011 at 04:24:46PM +0100, Stefan Hajnoczi wrote:
> >> On Tue, Aug 23, 2011 at 12:15 PM, Daniel P. Berrange
> >>  wrote:
> >> > I was at the KVM Forum / LinuxCon last week and there were many
> >> > interesting things discussed which are relevant to ongoing libvirt
> >> > development. Here was the list that caught my attention. If I have
> >> > missed any, fill in the gaps
> >> >
> >> >  - Sandbox/container KVM.  The Solaris port of KVM puts QEMU inside
> >> >   a zone so that an exploit of QEMU can't escape into the full OS.
> >> >   Containers are Linux's parallel of Zones, and while not nearly as
> >> >   secure yet, it would still be worth using more containers support
> >> >   to confine QEMU.
> >>
> >> Can you elaborate on why Linux containers are "not nearly as secure"
> >> [as Solaris Zones]?
> >
> > Mostly because the Linux namespace functionality is far from complete,
> > notably lacking proper UID/GID/capability separation, and UID/GID
> > virtualization wrt filesystems. The longer answer is here:
> >
> >   https://wiki.ubuntu.com/UserNamespace
> >
> > So at this time you can't build a secure container on Linux, relying
> > just on DAC alone. You have to add in a MAC layer ontop of the container
> > to get full security benefits, which obviously defeats the point of
> > using the container as a backup for failure in the MAC layer.
> 
> Thanks, that is interesting.  I still don't understand why that is a
> problem.  Linux containers (lxc) uses a different pid namespace (no
> ptrace worries), file system root (restricted to a subdirectory tree),
> forbids most device nodes, etc.  Why does the user namespace matter
> for security in this case?

A number of reasons really...

If user ID '0' on the host starts a container, and a process inside
the container does 'setuid(500)', then any user outside the container
with UID 500 will be able to kill that process. Only user ID '0' should
have been allowed todo that.

It will also let non-root user IDs on the host OS, start containers
and have root uid=0 inside the container.

Finally, any files created inside the container with, say, uid 500
will be accessible by any other process with UID 500, in either the
host or any other container

> I think it matters when giving multiple containers access to the same
> file system.  Is that what you'd like to do for libvirt?

Each container would have to share a (readonly) view onto the host
filesystem so it can see the QEMU emulator install / libraries. There
would also have to be some writable areas per QEMU container.  QEMU
inside the container would be set to run as some non-root UID (from
the container's POV). So both problem 1 & 3 above would impact the
security of this confinement.

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] Notes from the KVM Forum relevant to libvirt

2011-08-24 Thread Stefan Hajnoczi
On Tue, Aug 23, 2011 at 4:31 PM, Daniel P. Berrange  wrote:
> On Tue, Aug 23, 2011 at 04:24:46PM +0100, Stefan Hajnoczi wrote:
>> On Tue, Aug 23, 2011 at 12:15 PM, Daniel P. Berrange
>>  wrote:
>> > I was at the KVM Forum / LinuxCon last week and there were many
>> > interesting things discussed which are relevant to ongoing libvirt
>> > development. Here was the list that caught my attention. If I have
>> > missed any, fill in the gaps
>> >
>> >  - Sandbox/container KVM.  The Solaris port of KVM puts QEMU inside
>> >   a zone so that an exploit of QEMU can't escape into the full OS.
>> >   Containers are Linux's parallel of Zones, and while not nearly as
>> >   secure yet, it would still be worth using more containers support
>> >   to confine QEMU.
>>
>> Can you elaborate on why Linux containers are "not nearly as secure"
>> [as Solaris Zones]?
>
> Mostly because the Linux namespace functionality is far from complete,
> notably lacking proper UID/GID/capability separation, and UID/GID
> virtualization wrt filesystems. The longer answer is here:
>
>   https://wiki.ubuntu.com/UserNamespace
>
> So at this time you can't build a secure container on Linux, relying
> just on DAC alone. You have to add in a MAC layer ontop of the container
> to get full security benefits, which obviously defeats the point of
> using the container as a backup for failure in the MAC layer.

Thanks, that is interesting.  I still don't understand why that is a
problem.  Linux containers (lxc) uses a different pid namespace (no
ptrace worries), file system root (restricted to a subdirectory tree),
forbids most device nodes, etc.  Why does the user namespace matter
for security in this case?

I think it matters when giving multiple containers access to the same
file system.  Is that what you'd like to do for libvirt?

Stefan

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] Notes from the KVM Forum relevant to libvirt

2011-08-23 Thread Daniel P. Berrange
On Tue, Aug 23, 2011 at 04:24:46PM +0100, Stefan Hajnoczi wrote:
> On Tue, Aug 23, 2011 at 12:15 PM, Daniel P. Berrange
>  wrote:
> > I was at the KVM Forum / LinuxCon last week and there were many
> > interesting things discussed which are relevant to ongoing libvirt
> > development. Here was the list that caught my attention. If I have
> > missed any, fill in the gaps
> >
> >  - Sandbox/container KVM.  The Solaris port of KVM puts QEMU inside
> >   a zone so that an exploit of QEMU can't escape into the full OS.
> >   Containers are Linux's parallel of Zones, and while not nearly as
> >   secure yet, it would still be worth using more containers support
> >   to confine QEMU.
> 
> Can you elaborate on why Linux containers are "not nearly as secure"
> [as Solaris Zones]?

Mostly because the Linux namespace functionality is far from complete,
notably lacking proper UID/GID/capability separation, and UID/GID
virtualization wrt filesystems. The longer answer is here:

   https://wiki.ubuntu.com/UserNamespace

So at this time you can't build a secure container on Linux, relying
just on DAC alone. You have to add in a MAC layer ontop of the container
to get full security benefits, which obviously defeats the point of
using the container as a backup for failure in the MAC layer.

> >  - Native KVM tool. The problem statement was that the QEMU code is too
> >   big/complex & and command line args are too complex, so lets rewrite
> >   from scratch to make the code small & CLI simple. They achieve this,
> >   but of course primarily because they lack so many features compared
> >   to QEMU. They had libvirt support as a bullet point on their preso,
> >   but I'm not expecting it to replace the current QEMU KVM support in
> >   the forseeable future, given its current level of features and the
> >   size of its dev team compared to QEMU/KVM. They did have some fun
> >   demos of booting using the host OS filesystem though. We can
> >   actually do the same with regular KVM/libvirt but there's no nice
> >   demo tool to show it off. I'm hoping to create one
> 
> Yep it's virtfs which QEMU has supported for a while.  The trick is
> setting things up so that the Linux guest boots from virtfs.

It isn't actually that hard from a technical POV, it is just that most
(all?) distros typical  initrd files lack support for specifying 9p over
virtio as a root filesystem.

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] Notes from the KVM Forum relevant to libvirt

2011-08-23 Thread Stefan Hajnoczi
On Tue, Aug 23, 2011 at 12:15 PM, Daniel P. Berrange
 wrote:
> I was at the KVM Forum / LinuxCon last week and there were many
> interesting things discussed which are relevant to ongoing libvirt
> development. Here was the list that caught my attention. If I have
> missed any, fill in the gaps
>
>  - Sandbox/container KVM.  The Solaris port of KVM puts QEMU inside
>   a zone so that an exploit of QEMU can't escape into the full OS.
>   Containers are Linux's parallel of Zones, and while not nearly as
>   secure yet, it would still be worth using more containers support
>   to confine QEMU.

Can you elaborate on why Linux containers are "not nearly as secure"
[as Solaris Zones]?

Containers is just another attempt at isolating the QEMU process.
SELinux works differently but can also do many of the same things.  I
like containers more because they are simpler than labelling
everything.

>  - Native KVM tool. The problem statement was that the QEMU code is too
>   big/complex & and command line args are too complex, so lets rewrite
>   from scratch to make the code small & CLI simple. They achieve this,
>   but of course primarily because they lack so many features compared
>   to QEMU. They had libvirt support as a bullet point on their preso,
>   but I'm not expecting it to replace the current QEMU KVM support in
>   the forseeable future, given its current level of features and the
>   size of its dev team compared to QEMU/KVM. They did have some fun
>   demos of booting using the host OS filesystem though. We can
>   actually do the same with regular KVM/libvirt but there's no nice
>   demo tool to show it off. I'm hoping to create one

Yep it's virtfs which QEMU has supported for a while.  The trick is
setting things up so that the Linux guest boots from virtfs.

Stefan

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


[libvirt] Notes from the KVM Forum relevant to libvirt

2011-08-23 Thread Daniel P. Berrange
I was at the KVM Forum / LinuxCon last week and there were many
interesting things discussed which are relevant to ongoing libvirt
development. Here was the list that caught my attention. If I have
missed any, fill in the gaps

 - Sandbox/container KVM.  The Solaris port of KVM puts QEMU inside
   a zone so that an exploit of QEMU can't escape into the full OS.
   Containers are Linux's parallel of Zones, and while not nearly as
   secure yet, it would still be worth using more containers support
   to confine QEMU.

 - Events for object changes. We already have async events for virDomainPtr.
   We need the same for virInterfacePtr, virStoragePoolPtr, virStorageVolPtr
   and virNodeDevPtr, so that at the very least applications can be notified
   when objects are created or removed. For virNodeDevPtr we also want to
   be notified when properties change (ie CDROM media change).

 - CGroups passthrough. There is alot of experimentation with cgroups. We
   don't want to expose cgroups as a direct concept in the libvirt API,
   but we should consider putting a generic cgroups get/set in the
   libvirt-qemu.so library, or create a libvirt-linux.so library.
   Also likely add a  XML element to store arbitrary
   tunables in the XML. Same (low) level of support as with qemu:XXX
   of course

 - CPUSet for changing CPU + Memory NUMA pinning. The CPUset cgroups
   controller is able to actually move a guest's memory between NUMA
   nodes. We can already change VCPU pinning, but we need a new API
   to do node pinning of the whole VM, so we can ensure the I/O threads
   are also moved. We also need an API to move the memory pinning to
   new nodes.

 - Guest NUMA topology. If we have guests with RAM size > node size,
   we need to expose a NUMA topology into the guest. The CPU/memory
   pinning APIs will also need to be able to pin individual guest
   NUMA nodes to individual host NUMA nodes.

 - AHCI controller. IDE is going the way of the dodo. We need to add
   support for QEMU's new AHCI controller. This is quite simple, we
   already have a 'sata' disk type we can wire up to QEMU

 - VFIO PCI passthru. The current PCI assignment code may well be
   changed to use something called 'VFIO'. This will need some
   work in libvirt to support new CLI arg syntax, and probably
   some SELinux work

 - QCow3. There will soon be a QCow3 format. We need to add code to
   detect it and extract backing stores, etc. Trivial since the primary
   header format will still be the same as QCow2.

 - QMP completion. Given anthony's plan for a complete replacement of
   the current CLI + monitor syntax in QEMU 2.0 (long way out), he has
   dropped objections to adding new commands to QMP in the near future.
   So all existing HMP commands will immediately be made available in
   QMP with no attempt to re-design them now. So the need for the HMP
   passthrough command will soon go away.

 - Migration + VEPA/VNLink failures. As raised previously on this list,
   Cisco really wants libvirt to have the ability to do migration, and
   optionally *not* fail, even if the VEPA/VNLink setup fails. This will
   require an event notification to the app if a failure of a device
   backend occurs, and an API to let the admin app fix the device backend
   (virDomainUpdateDevice) and some way to tell migration what bits are
   allowed to fail.

 - Virtio SCSI. We need to support this new stuff in QEMU when it is
   eventually implemented. It will mean we avoid the PCI slot usage
   problems inherant in virtio-blk, and get other things like multipath
   and decent SCSI passthrough support.

 - USB 2.0. We need to support this in libvirt asap. It is very important
   for desktop experiance and to support better integration with SPICE
   This also gets us proper USB port addressing. Fun footnote, QEMU USB
   has *never* supported migration. The USB tablet only works by sheer
   luck, as OS' see the device disappear on migration & come back with
   different device ID/port addr and so does a re-initialize !

 - Native KVM tool. The problem statement was that the QEMU code is too
   big/complex & and command line args are too complex, so lets rewrite
   from scratch to make the code small & CLI simple. They achieve this,
   but of course primarily because they lack so many features compared
   to QEMU. They had libvirt support as a bullet point on their preso,
   but I'm not expecting it to replace the current QEMU KVM support in
   the forseeable future, given its current level of features and the
   size of its dev team compared to QEMU/KVM. They did have some fun
   demos of booting using the host OS filesystem though. We can
   actually do the same with regular KVM/libvirt but there's no nice
   demo tool to show it off. I'm hoping to create one

 - Shared memory devices. Some people doing high performance work are
   using the QEMU shared memory device. We don't support this (ivhshm
   device) in libvirt yet. Fairly niche use cases but m