On Thu, Jul 6, 2017 at 2:20 AM, Alex Williamson <[email protected] > wrote:
> On Wed, Jul 5, 2017 at 10:23 PM, Thiago Ramon <[email protected]> > wrote: >> >> >> Here, dropped the raw message in pastebin: https://pastebin.com/hfJ6ryJg >> >> That particular run was trying to pass the 980 Ti, which is the boot >> device, and which probably had something else prodding at it (I'll give it >> a try again and check what else was attaching to it). I've mostly focused >> on passing the 1060 though, which doesn't get touched by anything but >> vfio-pci, and also doesn't show any mmap issues, here's the last QEMU run >> with SeaBIOS: >> >> https://pastebin.com/DEPpewCH >> >> And the last one from OVMF: >> >> https://pastebin.com/L7gkrm36 >> >> On the kernel log, I only get the vfio_bar_restore messages. One >> interesting and consistent pattern is that SeaBIOS always generate 2 pairs >> of warnings (one for GPU, one audio), while OVMF generates quite a bit >> (dozen+, don't have a log handy). Probably not relevant, as apparently the >> failure happens before the first message anyway. >> >> Another detail that may be relevant: Whenever I try a passthrough (and >> fail), the kernel fails to soft restart. It gets to the last stage where it >> would do a soft reset but the console just sits there. Could this just be >> vfio_pci trying to do something with the unresponsive card, or something >> else that may be a clue to what's going on? >> > > Yep, here's what I suspected about the D3 warning: > > >PCI state after passthrough attempt: > > 29:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 > [GeForce GTX 980 Ti] [10de:17c8] (rev ff) (prog-if ff) > > !!! Unknown header type 7f > > Kernel driver in use: vfio-pci > > Kernel modules: nouveau, nvidia_drm, nvidia > > > > 29:00.1 Audio device [0403]: NVIDIA Corporation GM200 High Definition > Audio [10de:0fb0] (rev ff) (prog-if ff) > > !!! Unknown header type 7f > > Kernel driver in use: vfio-pci > > Kernel modules: snd_hda_intel > > The card isn't actually stuck in D3, it's basically disappeared from the > bus and all reads from config space are returning -1, which is > indistinguishable from from D3 power state for the bits that tell us the > power state. This is probably the result of doing a bus reset, but that's > also our only way of putting the device back to a known state before > starting it in the VM. You might try to see if you can reproduce this > result manually with setpci. We do a bus reset by finding the bridge > upstream of the device, lspci -t is handy for this with a tree view of the > PCI topology. As an example: > > https://pastebin.com/c3URT6vx > > Bus numbers are shown in brackets, so if I want the parent bridge of > device 01:00.0, look to the left of [01]--00.0 to find 01.0. This is > attached to the root bus at [0000:00], so the full address of the parent > bridge is 0000:00:01.0. > > We can access the bridge control register using > > # setpci -s 0000:00:01.0 BRIDGE_CONTROL > > The secondary bus reset bit is 0x40. We want to set this bit: > > # setpci -s 0000:00:01.0 BRIDGE_CONTROL=40:40 > > Then clear it: > > # setpci -s 0000:00:01.0 BRIDGE_CONTROL=00:40 > > Then run lspci on the bus to see if the device is still present. In your > case it would be bus 29, so you'd run > > # lspci -vvv -s 0000:29: > > Do you get output like above with the 'Unknown header type 7f' or a > complete listing of the device? Be sure to reboot the system after running > this test, regardless of the result the device will be re-initialized, and > clearly nothing should be using the device while doing this. If the > graphics card doesn't recover from a bus reset, then something about this > system setup is not compatible with this use case. Thanks, > > Alex > Ok, did some more testing. First thing I did was from having my 2 cards bound to the NVidia driver, shut down X, rmmod nvidia, bound my secondary card to vfio-pci and tried to reset the bus. It indeed failed to reset properly and got stuck. Then I tried switching out to my primary passthrough setup, to see what was grabbing the card memory, which turned out to be vesafb, even though I've disabled it. After adding a bunch more options to the boot command line, I've managed to properly block it from anything else, and proceeded to test the bus reset, which worked this time. Then I tried running the VM (without external BIOS) which failed, but complained about not accessing the BIOS. Rebooted again and tried with a pre-dumped BIOS, and it still failed in the same way as before. Returning to my secondary card, I've tried to reset the bus again, this time from a fresh boot, which seems to have worked fine. Here are the logs: https://pastebin.com/94F5wURY I've proceeded to reset the bus a few times, to see if it was a problem, but at least half a dozen resets don't seem to have caused any problems. Any other ideas?
_______________________________________________ vfio-users mailing list [email protected] https://www.redhat.com/mailman/listinfo/vfio-users
