Am Fri, 28 Oct 2016 11:25:55 -0400 schrieb Laine Stump <la...@redhat.com>:
> On 10/28/2016 07:28 AM, Henning Schild wrote: > > Hey, > > > > i am running an unusual setup where i assign pci devices behind the > > back of libvirt. I have two options to do that: > > 1. a wrapper script for qemu that takes care of suid-root and > > appends arguments for pci-assign > > 2. virsh qemu-monitor-command ... 'device_add pci-assign...' > > With any reasonably modern version of Linux/qemu/libvirt, you should > not be using pci-assign, but should use vfio-pci instead. pci-assign > is old, unmaintained, and deprecated (and any other bad words you can > think of). > > Also, have you done anything to lock the guest's memory in host RAM? > This is necessary so that the source/destination of DMA reads/writes > is always present. It is done automatically by libvirt as required > *when libvirt knows that a device is being assigned to the guest*, > but if you're going behind libvirt's back, you need to take care of > that yourself (or alternately, don't go behind libvirt's back, which > is the greatly preferred alternative!) Memory locking is taken care of with "-realtime mlock=on". > > > > I know i should probably not be doing this, > > > Yes, that is a serious understatement :-) And I suspect that it isn't > necessary. I know, but that was never the question ;). > > it is a workaround to > > introduce fine-grained pci-assignment in an openstack setup, where > > vendor and device id are not enough to pick the right device for a > > vm. > > libvirt selects the device according to its PCI address, not vendor > and device id. Is that not "fine-grained" enough? (And does OpenStack > not let you select devices based on their PCI address?) The workaround is indeed for the version of OpenStack we are using. Recent versions might have support for more fine-grained assignment, but updating OpenStack is not something i would like to do right now. Another item on the TODO-list that i would like to keep seperate from the problem at hand. > > > > In both cases qemu will crash with the following output: > > > >> qemu: hardware error: pci read failed, ret = 0 errno = 22 > > followed by the usual machine state dump. With strace i found it to > > be a failing read on the config space file of my device. > > /sys/bus/pci/devices/0000:xx:xx.x/config > > A few reads out of that file succeeded, as well as accesses on > > vendor etc. > > > > Manually launching a qemu with the pci-assign works without a > > problem, so i "blame" libvirt and the cgroup environment the qemu > > ends up in. So i put a bash into the exact same cgroup setup - next > > to a running qemu, expecting a dd or hexdump on the config-space > > file to fail. But from that bash i can read the file without a > > problem. > > > > Has anyone seen that problem before? > > No, because nobody else (that I've ever heard) is doing what you are > doing. You're going around behind the back of libvirt (and > OpenStack) to do device assignment with a method that was replaced > with something newer/better/etc about 3 years ago, and in the process > are likely missing a lot of the details that would otherwise be > automatically handled by libvirt. Sure, and my question was aiming at what exactly i could be missing. That is just to fix a system that used to work and get a better understanding of "a lot of the details that would otherwise be automatically handled by libvirt". > > > Right now i do not know what i > > am missing, maybe qemu is hitting some limits configured for the > > cgroups or whatever. I can not use pci-assign from libvirt, but if i > > did would it configure cgroups in a different way or relax some > > limits? > > > > What would be a good next step to debug that? Right now i am > > looking at kernel event traces, but the machine is pretty big and > > so is the trace. > > > My recommendation would be this: > > 1) look at OpenStack to see if it allows selecting the device to > assign by PCI address. If so, use that (it will just tell libvirt > "assign this device", and libvirt will automatically use VFIO for the > device assignment if it's available (which it will be)) The version currently in use does not allow that. > 2) if (1) is a deadend (i.e. OpenStack doesn't allow you to select > based on PCI address), use your "sneaky backdoor method" to do "virsh > attach-device somexmlfile.xml", where somexmlfile.xml has a proper > <hostdev> element to select and assign the host device you want. > Again, libvirt will automatically figure out if VFIO can be used, and > will properly setup everything necessary related to cgroups, locked > memory, etc. Thanks! I will try the sneaky .xml method, in that case i will only have to play tricks on OpenStack and hopefully get all the libvirt details. > > > > > That assignment used to work and i do not know how it broke, i have > > tried combinations of several kernels, versions of libvirt and qemu. > > (kernel 3.18 and 4.4, libvirt 1.3.2 and 2.0.0, and qemu 2.2.1 and > > 2.7) All combinations show the same problem, even the ones that > > work on other machines. So when it comes to software versions the > > problem could well be caused by a software update of another > > component, that i got with the package manager and did not compile > > myself. It is a debian 8.6 with all recent updates installed. My > > guess would be that systemd could have an influence on cgroups or > > limits causing such a problem. > > That you would need to think of such things points out that your > current setup is fragile and ultimately unmaintainable. Please > consider "coloring inside the lines" :-) (We'd be happy to help if > there are any hangups along the way, either on the libvirt-users > mailing list or in the #virt channel on irc.oftc.net). It is a legacy reference/demo/proof-of-concept setup for realtime-enabled VMs, that somehow broke. PCI assignment was used for NICs when guests did not support virtio. https://archive.fosdem.org/2016/schedule/event/virt_iaas_real_time_cloud/ Since it is a hack and unmaintainable and does not scale, we do not use it anymore. But i was curious why it suddenly stopped working in that old demo setup. regards, Henning