Joe, seems to have been mostly solved with the qemu upgrade. Since I plan on being on Mitaka before blessing the gpu instances with the 'production' label I'm OK with that.
Blair I reflexively black list nouveau drivers about 5 ways in my installer and six in puppet :) I do have an odd remaining issue where I can run cuda jobs in the vm but snapshots fail and after pause (for snapshotting) the pci device can't be reattached (which is where i think it deletes the snapshot it took). Got same issue with 3.16 and 4.4 kernels. Not very well categorized yet, but I'm hoping it's because the VM I was hacking on had it's libvirt.xml written out with the older qemu maybe? It had been through a couple reboots of the physical system though. Currently building a fresh instance and bashing more keys... Thanks all, -Jon On Thu, Jul 07, 2016 at 12:35:33AM +1000, Blair Bethwaite wrote: :Hi Jon, : :Do you have the nouveau driver/module loaded in the host by any :chance? If so, blacklist, reboot, repeat. : :Whilst we're talking about this. Has anyone had any luck doing this :with hosts having a PCI-e switch across multiple GPUs? : :Cheers, : :On 6 July 2016 at 23:27, Jonathan D. Proulx <j...@csail.mit.edu> wrote: :> Hi All, :> :> Trying to spass through some Nvidia K80 GPUs to soem instance and have :> gotten to the place where Nova seems to be doing the right thing gpu :> instances scheduled on the 1 gpu hypervisor I have and for inside the :> VM I see: :> :> root@gpu-x1:~# lspci | grep -i k80 :> 00:06.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) :> :> And I can install nvdia-361 driver and get :> :> # ls /dev/nvidia* :> /dev/nvidia0 /dev/nvidiactl /dev/nvidia-uvm /dev/nvidia-uvm-tools :> :> Once I load up cuda-7.5 and build the exmaples none fo the run :> claiming there's no cuda device. :> :> # ./matrixMul :> [Matrix Multiply Using CUDA] - Starting... :> cudaGetDevice returned error no CUDA-capable device is detected (code 38), line(396) :> cudaGetDeviceProperties returned error no CUDA-capable device is detected (code 38), line(409) :> MatrixA(160,160), MatrixB(320,160) :> cudaMalloc d_A returned error no CUDA-capable device is detected (code 38), line(164) :> :> I'm not familiar with cuda really but I did get some example code :> running on the physical system for burn in over the weekend (sicne :> reinstaleld so no nvidia driver on hypervisor). :> :> Following various online examples for setting up pass through I set :> the kernel boot line on the hypervisor to: :> :> # cat /proc/cmdline :> BOOT_IMAGE=/boot/vmlinuz-3.13.0-87-generic root=UUID=d9bc9159-fedf-475b-b379-f65490c71860 ro console=tty0 console=ttyS1,115200 intel_iommu=on iommu=pt rd.modules-load=vfio-pci nosplash nomodeset intel_iommu=on iommu=pt rd.modules-load=vfio-pci nomdmonddf nomdmonisw :> :> Puzzled that I apparently have the device but it is apparently :> nonfunctional, where do I even look from here? :> :> -Jon :> :> :> _______________________________________________ :> OpenStack-operators mailing list :> OpenStack-operators@lists.openstack.org :> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators : : : :-- :Cheers, :~Blairo -- _______________________________________________ OpenStack-operators mailing list OpenStack-operators@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators