Am Fri, 28 Oct 2016 11:25:55 -0400
schrieb Laine Stump <la...@redhat.com>:

> On 10/28/2016 07:28 AM, Henning Schild wrote:
> > Hey,
> >
> > i am running an unusual setup where i assign pci devices behind the
> > back of libvirt. I have two options to do that:
> > 1. a wrapper script for qemu that takes care of suid-root and
> > appends arguments for pci-assign
> > 2. virsh qemu-monitor-command ... 'device_add pci-assign...'  
> 
> With any reasonably modern version of Linux/qemu/libvirt, you should
> not be using pci-assign, but should use vfio-pci instead. pci-assign
> is old, unmaintained, and deprecated (and any other bad words you can
> think of).
> 
> Also, have you done anything to lock the guest's memory in host RAM? 
> This is necessary so that the source/destination of DMA reads/writes
> is always present. It is done automatically by libvirt as required
> *when libvirt knows that a device is being assigned to the guest*,
> but if you're going behind libvirt's back, you need to take care of
> that yourself (or alternately, don't go behind libvirt's back, which
> is the greatly preferred alternative!)

Memory locking is taken care of with "-realtime mlock=on".

> >
> > I know i should probably not be doing this,  
> 
> 
> Yes, that is a serious understatement :-) And I suspect that it isn't 
> necessary.

I know, but that was never the question ;).

> >   it is a workaround to
> > introduce fine-grained pci-assignment in an openstack setup, where
> > vendor and device id are not enough to pick the right device for a
> > vm.  
> 
> libvirt selects the device according to its PCI address, not vendor
> and device id. Is that not "fine-grained" enough? (And does OpenStack
> not let you select devices based on their PCI address?)

The workaround is indeed for the version of OpenStack we are using.
Recent versions might have support for more fine-grained assignment,
but updating OpenStack is not something i would like to do right now.
Another item on the TODO-list that i would like to keep seperate from
the problem at hand.

> >
> > In both cases qemu will crash with the following output:
> >  
> >> qemu: hardware error: pci read failed, ret = 0 errno = 22  
> > followed by the usual machine state dump. With strace i found it to
> > be a failing read on the config space file of my device.
> > /sys/bus/pci/devices/0000:xx:xx.x/config
> > A few reads out of that file succeeded, as well as accesses on
> > vendor etc.
> >
> > Manually launching a qemu with the pci-assign works without a
> > problem, so i "blame" libvirt and the cgroup environment the qemu
> > ends up in. So i put a bash into the exact same cgroup setup - next
> > to a running qemu, expecting a dd or hexdump on the config-space
> > file to fail. But from that bash i can read the file without a
> > problem.
> >
> > Has anyone seen that problem before?  
> 
> No, because nobody else (that I've ever heard) is doing what you are 
> doing. You're going around behind the back of libvirt  (and
> OpenStack) to do device assignment with a method that was replaced
> with something newer/better/etc about 3 years ago, and in the process
> are likely missing a lot of the details that would otherwise be
> automatically handled by libvirt.

Sure, and my question was aiming at what exactly i could be missing.
That is just to fix a system that used to work and get a better
understanding of "a lot of the details that would otherwise be
automatically handled by libvirt".

> 
> > Right now i do not know what i
> > am missing, maybe qemu is hitting some limits configured for the
> > cgroups or whatever. I can not use pci-assign from libvirt, but if i
> > did would it configure cgroups in a different way or relax some
> > limits?
> >
> > What would be a good next step to debug that? Right now i am
> > looking at kernel event traces, but the machine is pretty big and
> > so is the trace.  
> 
> 
> My recommendation would be this:
> 
> 1) look at OpenStack to see if it allows selecting the device to
> assign by PCI address. If so, use that (it will just tell libvirt
> "assign this device", and libvirt will automatically use VFIO for the
> device assignment if it's available (which it will be))

The version currently in use does not allow that.

> 2) if (1) is a deadend (i.e. OpenStack doesn't allow you to select
> based on PCI address), use your "sneaky backdoor method" to do "virsh 
> attach-device somexmlfile.xml", where somexmlfile.xml has a proper 
> <hostdev> element to select and assign the host device you want.
> Again, libvirt will automatically figure out if VFIO can be used, and
> will properly setup everything necessary related to cgroups, locked
> memory, etc.

Thanks! I will try the sneaky .xml method, in that case i will only
have to play tricks on OpenStack and hopefully get all the libvirt
details.

> 
> >
> > That assignment used to work and i do not know how it broke, i have
> > tried combinations of several kernels, versions of libvirt and qemu.
> > (kernel 3.18 and 4.4, libvirt 1.3.2 and 2.0.0, and qemu 2.2.1 and
> > 2.7) All combinations show the same problem, even the ones that
> > work on other machines. So when it comes to software versions the
> > problem could well be caused by a software update of another
> > component, that i got with the package manager and did not compile
> > myself. It is a debian 8.6 with all recent updates installed. My
> > guess would be that systemd could have an influence on cgroups or
> > limits causing such a problem.  
> 
> That you would need to think of such things points out that your
> current setup is fragile and ultimately unmaintainable. Please
> consider "coloring inside the lines" :-) (We'd be happy to help if
> there are any hangups along the way, either on the libvirt-users
> mailing list or in the #virt channel on irc.oftc.net).

It is a legacy reference/demo/proof-of-concept setup for
realtime-enabled VMs, that somehow broke. PCI assignment was used for
NICs when guests did not support virtio.
https://archive.fosdem.org/2016/schedule/event/virt_iaas_real_time_cloud/

Since it is a hack and unmaintainable and does not scale, we do not use
it anymore. But i was curious why it suddenly stopped working in that
old demo setup.

regards,
Henning

Reply via email to