Re: [Openstack] Libvirt LXC with volume-attach broken ?

2012-07-06 Thread Serge Hallyn
Quoting Eric W. Biederman (ebied...@xmission.com):
 Daniel P. Berrange berra...@redhat.com writes:
 
  On Thu, Jul 05, 2012 at 06:49:06PM -0700, Eric W. Biederman wrote:
  Serge Hallyn serge.hal...@canonical.com writes:
  
   Quoting Daniel P. Berrange (berra...@redhat.com):
   On Thu, Jul 05, 2012 at 03:00:26PM +0100, Daniel P. Berrange wrote:
Now, when using 'nova volume-attach':

  # nova volume-attach 05eb16df-03b8-451b-85c1-b838a8757736 
a5ad1d37-aed0-4bf6-8c6e-c28543cd38ac /dev/sdf

nova will import an iSCSI LUN from the nova volume service, on the 
compute
node. The kernel will assign it the next free SCSI drive letter, in my
case '/dev/sdc'.

The libvirt nova driver will then do a mknod, using the volume name
passed to 'nova volume-attach'.
eg it will do

  mknod  /var/lib/nova/instances/instance-000e/rootfs/dev/sdf
   
   Opps, I'm slightly wrong here. What it actually does is
   
 mount --bind /dev/sdc 
   /var/lib/nova/instances/instance-000e/rootfs/dev/sdf
   
   so you get a 'sdf' device, but with the major/minor number of the 'sdc'
   device. I can't say I particularly like this approach. Ultimately I
   think we need the kernel support to make this work correctly. In any
  
   Yes, that's what the 'devices namespace' is meant to address.  I'm hoping
   we can some serious design discussion on that in the next few months.
  
  This is not the device namespace problem.
  
  This is the setns problem for mount namespaces, and the unprivilged
  mount problem.
  
  There may be a notification issue so use space can perform actions
  in a container when a device shows up.
  
  But it should be very possible on the host to call.
  setns(containers_mount_namespace);
  mknod(/dev/foo);
  chown(/dev/foo, CONTAINER_ROOT_UID, CONTAINER_ROOT_GID);
  
  And then from inside the container especially when I get the rest of
  the user namespace merged it should be very possible to manipulate
  the block device because you have permission, and to mount the
  partitions of the block device, because you are root in your container.
  
  But until the user namespace is merged you really are root so you can
  mount whatever.
  
  Daniel does that sound like the support you are looking for?
 
  Yes, the setns(mnt) approach you describe above is exactly what I'd
  like to be able todo, to solve the first half of the problem.
 
  The part of the problem is that I have a /dev/sdf, or even a
  /dev/volgroup00/logvol3 in the host (with whatever major:minor
  number that implies), and I want to be able to make it always
  appear as /dev/sda  in the container (with the correspondingly
  different major:minor number).  I'm guessing this is what Serge
  was refering to as the 'device' namespace problem

Right.

 Getting the device to always appear with the name /dev/sda is easy.

It's easy to log in and make it look that way.  It's not easy to
make all distros see it that way across boot.

 Where does the need to have a specific device come from?  I would have
 thought by now that hotplug had been around long enough that in general
 user space would not care.

Yes the *primary* need for the devices namespace is to prevent udev
storm in the host and send uevents to the right place, and macvtap
and loop devices.

 The only case that I know of where keeping the same device number seems
 reasonable is in the case of live migration an application, in order to
 avoid issues with stat changing for the same file over the transition,
 and I think a synthesized hotplug event could probably handle that case.
 
 Is there another case besides buggy applications that have hard
 coded device numbers that need specific device numbers?

Other cases where specific device maj-min numbers are important
are things like makedev.  There is lots of software, and especially
automatic update software, which insists that things have specific
'correct' maj-minor numbers.

FWIW my (presumably naive) view is that for each non-init devicens
we'd have a list of

type-major:minor::type2-major:minor2

(:: meaning maps-to).  Then if a uevent comes through not aimed at
any type2-major2:minor2 valid in the namespace, that ns doesn't get
the uevent.

-serge

___
Mailing list: https://launchpad.net/~openstack
Post to : openstack@lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp


Re: [Openstack] Libvirt LXC with volume-attach broken ?

2012-07-05 Thread Serge Hallyn
Quoting Daniel P. Berrange (berra...@redhat.com):
 On Thu, Jul 05, 2012 at 03:00:26PM +0100, Daniel P. Berrange wrote:
  Now, when using 'nova volume-attach':
  
# nova volume-attach 05eb16df-03b8-451b-85c1-b838a8757736 
  a5ad1d37-aed0-4bf6-8c6e-c28543cd38ac /dev/sdf
  
  nova will import an iSCSI LUN from the nova volume service, on the compute
  node. The kernel will assign it the next free SCSI drive letter, in my
  case '/dev/sdc'.
  
  The libvirt nova driver will then do a mknod, using the volume name
  passed to 'nova volume-attach'.
  eg it will do
  
mknod  /var/lib/nova/instances/instance-000e/rootfs/dev/sdf
 
 Opps, I'm slightly wrong here. What it actually does is
 
   mount --bind /dev/sdc 
 /var/lib/nova/instances/instance-000e/rootfs/dev/sdf
 
 so you get a 'sdf' device, but with the major/minor number of the 'sdc'
 device. I can't say I particularly like this approach. Ultimately I
 think we need the kernel support to make this work correctly. In any

Yes, that's what the 'devices namespace' is meant to address.  I'm hoping
we can some serious design discussion on that in the next few months.

-serge

___
Mailing list: https://launchpad.net/~openstack
Post to : openstack@lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp


Re: [Openstack] [libvirt] [RFC PATCH] lxc: don't return error on GetInfo when cgroups not yet set up

2011-09-30 Thread Serge Hallyn
Quoting Serge E. Hallyn (serge.hal...@canonical.com):
 Quoting Daniel P. Berrange (berra...@redhat.com):
  On Wed, Sep 28, 2011 at 02:14:52PM -0500, Serge E. Hallyn wrote:
   Nova (openstack) calls libvirt to create a container, then
   periodically checks using GetInfo to see whether the container
   is up.  If it does this too quickly, then libvirt returns an
   error, which in libvirt.py causes an exception to be raised,
   the same type as if the container was bad.
  lxcDomainGetInfo(), holds a mutex on 'dom' for the duration of
  its execution. It checks for virDomainObjIsActive() before
  trying to use the cgroups.
 
 Yes, it does, but
 
  lxcDomainStart(), holds the mutex on 'dom' for the duration of
  its execution, and does not return until the container is running
  and cgroups are present.
 
 No.  It calls the lxc_controller with --background.  The controller
 main task in turn exits before the cgroups have been set up.  There
 is the race.

So what is the right fix here?  Should the controller write out another
file when it is past the part which should be locked, and the driver
waits for that file to exist before it drops the driver mutex?  If we
do that, do we risk having the driver hang when the controller has
hung?

-serge

___
Mailing list: https://launchpad.net/~openstack
Post to : openstack@lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp


Re: [Openstack] [libvirt] [RFC PATCH] lxc: don't return error on GetInfo when cgroups not yet set up

2011-09-30 Thread Serge Hallyn
Quoting Daniel P. Berrange (berra...@redhat.com):
 On Thu, Sep 29, 2011 at 10:12:17PM -0500, Serge E. Hallyn wrote:
  Quoting Daniel P. Berrange (berra...@redhat.com):
   On Wed, Sep 28, 2011 at 02:14:52PM -0500, Serge E. Hallyn wrote:
Nova (openstack) calls libvirt to create a container, then
periodically checks using GetInfo to see whether the container
is up.  If it does this too quickly, then libvirt returns an
error, which in libvirt.py causes an exception to be raised,
the same type as if the container was bad.
   lxcDomainGetInfo(), holds a mutex on 'dom' for the duration of
   its execution. It checks for virDomainObjIsActive() before
   trying to use the cgroups.
  
  Yes, it does, but
  
   lxcDomainStart(), holds the mutex on 'dom' for the duration of
   its execution, and does not return until the container is running
   and cgroups are present.
  
  No.  It calls the lxc_controller with --background.  The controller
  main task in turn exits before the cgroups have been set up.  There
  is the race.
 
 The lxcDomainStart() method isn't actually waiting on the child
 pid directly, so the --background flag ought not to matter. We
 have a pipe that we pass into the controller, which we wait on
 for a notification after running the process. The controller
 does not notify the 'handshake' FD until after cgroups have
 been setup, unless I'm mis-interpreting our code

That's the call to lxcContainerWaitForContinue(), right?  If so, that's
done by lxcContainerChild(), which is called by the lxc_controller.
AFAICS there is nothing in the lxc_driver which will wait on that
before dropping the driver-lock mutex.

-serge

___
Mailing list: https://launchpad.net/~openstack
Post to : openstack@lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp


Re: [Openstack] detecting errors when determining libvirt vm power state

2011-09-28 Thread Serge Hallyn
Quoting Serge Hallyn (serge.hal...@canonical.com):
 Hi,
 
 I'm looking at what first manifested as a bug when launching multiple
 lxc containers simultaneously, i.e. 'euca-run-instances -n 4', as
 reported at https://bugs.launchpad.net/ubuntu/+source/nova/+bug/842845.
 
 The problem appears to be that nova uses self.driver.get_info().  Libvirt
 can raise excpetions on this for several reasons - the vm could be bad or
 not exist, or it could be in a transient state i.e. cgroups are not set
 up yet.
 
 What is the right way to handle this?  Should the drivers categorize
 their exceptions into either 'broken' or 'transient' ones, so that
 nova can detect former and bail, and retry on the latter?

Now that I've sent that, I guess it seems pretty clear that the
lxc getinfo helper should understand that -ENOENT from getcgroup
means it's not yet ready, and set the values to 0 as it does if
the domain is not running.

-serge

___
Mailing list: https://launchpad.net/~openstack
Post to : openstack@lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp


[Openstack] detecting errors when determining libvirt vm power state

2011-09-28 Thread Serge Hallyn
Hi,

I'm looking at what first manifested as a bug when launching multiple
lxc containers simultaneously, i.e. 'euca-run-instances -n 4', as
reported at https://bugs.launchpad.net/ubuntu/+source/nova/+bug/842845.

The problem appears to be that nova uses self.driver.get_info().  Libvirt
can raise excpetions on this for several reasons - the vm could be bad or
not exist, or it could be in a transient state i.e. cgroups are not set
up yet.

What is the right way to handle this?  Should the drivers categorize
their exceptions into either 'broken' or 'transient' ones, so that
nova can detect former and bail, and retry on the latter?

Note that while the bug was raised for lxc, I suspect the same should
be possible with kvm ones.  However the qemu GetInfo method doesn't
get its cpu/mem usage info from cgroups, so it would not happen the
exact same way.

-serge

___
Mailing list: https://launchpad.net/~openstack
Post to : openstack@lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp