[ovirt-users] Re: Multiple GPU Passthrough with NVLink (Invalid I/O region)

2024-07-09 Thread Maria Jonas
Hi,

In pass-through mode it is essential to assign all GPUs connected through 
NVLink to the same VM. If only a subset of these GPUs is assigned to a VM it 
triggers the unrecoverable error XID 74 during boot corrupting the NVLink state 
and rendering the NVLink bridge unusable. Therefore to avoid this issue ensure 
that all GPUs in the NVLink are passed through to the VM.

Thanks
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/WDYPUNW2RBYT6KQVMRGXKQOUH6B6APIC/


[ovirt-users] Re: Multiple GPU Passthrough with NVLink (Invalid I/O region)

2024-07-09 Thread Zhengyi Lai
I noticed this document 
https://docs.nvidia.com/vgpu/16.0/grid-vgpu-release-notes-generic-linux-kvm/index.html#all-nvlink-gpus-must-be-passed-through-to-same-vm
 has this to say

In pass through mode, all GPUs connected to each other through NVLink must be 
assigned to the same VM. If a subset of GPUs connected to each other through 
NVLink is passed through to a VM, unrecoverable error XID 74 occurs when the VM 
is booted. If a subset of GPUs connected to each other through NVLink is passed 
through to a VM, unrecoverable error XID 74 occurs when the VM is booted. This 
error corrupts the NVLink state on the physical GPUs and, as a result, the 
NVLink bridge between the NVLink and the physical GPUs is not recognized. 
result, the NVLink bridge between the GPUs is unusable.

You may need to passthrough all GPUs in the nvlink to the VM
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/C7WY4UZZDCSRCHRH5EVQGBYYJF5MYSP7/


[ovirt-users] Re: Multiple GPU Passthrough with NVLink (Invalid I/O region)

2020-09-14 Thread Arman Khalatyan
any progress in this gpu question?
in our setup we have supermicro boards with intel xeon gold 6146 + 2 T4
we add extra line in the /etc/default/grub
"rd.driver.blacklist=nouveau nouveau.modeset=0 pci-stub.ids=xxx:xxx
intel_iommu=on"
would be interesting if the nvlink was the showstopper.



Arman Khalatyan  schrieb am Sa., 5. Sept. 2020, 00:38:

> same here ☺️, on Monday will check them.
>
> Michael Jones  schrieb am Fr., 4. Sept. 2020, 22:01:
>
>> Yea pass through, I think vgpu you have to pay for driver upgrade with
>> nvidia, I've not tried that and don't know the price, didn't find getting
>> info on it easy last time I tried.
>>
>> Have used in both legacy and uefi boot machines, don't know the chipsets
>> off the top of my head, will look on Monday.
>>
>>
>> On Fri, 4 Sep 2020, 20:56 Vinícius Ferrão, 
>> wrote:
>>
>>> Thanks Michael and Arman.
>>>
>>> To make things clear, you guys are using Passthrough, right? It’s not
>>> vGPU. The 4x GPUs are added on the “Host Devices” tab of the VM.
>>> What I’m trying to achieve is add the 4x V100 directly to one specific
>>> VM.
>>>
>>> And finally can you guys confirm which BIOS type is being used in your
>>> machines? I’m with Q35 Chipset with UEFI BIOS. I haven’t tested it with
>>> legacy, perhaps I’ll give it a try.
>>>
>>> Thanks again.
>>>
>>> On 4 Sep 2020, at 14:09, Michael Jones  wrote:
>>>
>>> Also use multiple t4, also p4, titans, no issues but never used the
>>> nvlink
>>>
>>> On Fri, 4 Sep 2020, 16:02 Arman Khalatyan,  wrote:
>>>
 hi,
 with the 2xT4 we haven't seen any trouble. we have no nvlink there.

 did u try to disable the nvlink?



 Vinícius Ferrão via Users  schrieb am Fr., 4. Sept.
 2020, 08:39:

> Hello, here we go again.
>
> I’m trying to passthrough 4x NVIDIA Tesla V100 GPUs (with NVLink) to a
> single VM; but things aren’t that good. Only one GPU shows up on the VM.
> lspci is able to show the GPUs, but three of them are unusable:
>
> 08:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2
> 16GB] (rev a1)
> 09:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2
> 16GB] (rev a1)
> 0a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2
> 16GB] (rev a1)
> 0b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2
> 16GB] (rev a1)
>
> There are some errors on dmesg, regarding a misconfigured BIOS:
>
> [   27.295972] nvidia: loading out-of-tree module taints kernel.
> [   27.295980] nvidia: module license 'NVIDIA' taints kernel.
> [   27.295981] Disabling lock debugging due to kernel taint
> [   27.304180] nvidia: module verification failed: signature and/or
> required key missing - tainting kernel
> [   27.364244] nvidia-nvlink: Nvlink Core is being initialized, major
> device number 241
> [   27.579261] nvidia :09:00.0: enabling device ( -> 0002)
> [   27.579560] NVRM: This PCI I/O region assigned to your NVIDIA
> device is invalid:
>NVRM: BAR1 is 0M @ 0x0 (PCI::09:00.0)
> [   27.579560] NVRM: The system BIOS may have misconfigured your GPU.
> [   27.579566] nvidia: probe of :09:00.0 failed with error -1
> [   27.580727] NVRM: This PCI I/O region assigned to your NVIDIA
> device is invalid:
>NVRM: BAR0 is 0M @ 0x0 (PCI::0a:00.0)
> [   27.580729] NVRM: The system BIOS may have misconfigured your GPU.
> [   27.580734] nvidia: probe of :0a:00.0 failed with error -1
> [   27.581299] NVRM: This PCI I/O region assigned to your NVIDIA
> device is invalid:
>NVRM: BAR0 is 0M @ 0x0 (PCI::0b:00.0)
> [   27.581300] NVRM: The system BIOS may have misconfigured your GPU.
> [   27.581305] nvidia: probe of :0b:00.0 failed with error -1
> [   27.581333] NVRM: The NVIDIA probe routine failed for 3 device(s).
> [   27.581334] NVRM: loading NVIDIA UNIX x86_64 Kernel Module
> 450.51.06  Sun Jul 19 20:02:54 UTC 2020
> [   27.649128] nvidia-modeset: Loading NVIDIA Kernel Mode Setting
> Driver for UNIX platforms  450.51.06  Sun Jul 19 20:06:42 UTC 2020
>
> The host is Secure Intel Skylake (x86_64). VM is running with Q35
> Chipset with UEFI (pc-q35-rhel8.2.0)
>
> I’ve tried to change the I/O mapping options on the host, tried with
> 56TB and 12TB without success. Same results. Didn’t tried with 512GB since
> the machine have 768GB of system RAM.
>
> Tried blacklisting the nouveau on the host, nothing.
> Installed NVIDIA drivers on the host, nothing.
>
> In the host I can use the 4x V100, but inside a single VM it’s
> impossible.
>
> Any suggestions?
>
>
>
> ___
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: 

[ovirt-users] Re: Multiple GPU Passthrough with NVLink (Invalid I/O region)

2020-09-04 Thread Arman Khalatyan
same here ☺️, on Monday will check them.

Michael Jones  schrieb am Fr., 4. Sept. 2020, 22:01:

> Yea pass through, I think vgpu you have to pay for driver upgrade with
> nvidia, I've not tried that and don't know the price, didn't find getting
> info on it easy last time I tried.
>
> Have used in both legacy and uefi boot machines, don't know the chipsets
> off the top of my head, will look on Monday.
>
>
> On Fri, 4 Sep 2020, 20:56 Vinícius Ferrão, 
> wrote:
>
>> Thanks Michael and Arman.
>>
>> To make things clear, you guys are using Passthrough, right? It’s not
>> vGPU. The 4x GPUs are added on the “Host Devices” tab of the VM.
>> What I’m trying to achieve is add the 4x V100 directly to one specific VM.
>>
>> And finally can you guys confirm which BIOS type is being used in your
>> machines? I’m with Q35 Chipset with UEFI BIOS. I haven’t tested it with
>> legacy, perhaps I’ll give it a try.
>>
>> Thanks again.
>>
>> On 4 Sep 2020, at 14:09, Michael Jones  wrote:
>>
>> Also use multiple t4, also p4, titans, no issues but never used the nvlink
>>
>> On Fri, 4 Sep 2020, 16:02 Arman Khalatyan,  wrote:
>>
>>> hi,
>>> with the 2xT4 we haven't seen any trouble. we have no nvlink there.
>>>
>>> did u try to disable the nvlink?
>>>
>>>
>>>
>>> Vinícius Ferrão via Users  schrieb am Fr., 4. Sept.
>>> 2020, 08:39:
>>>
 Hello, here we go again.

 I’m trying to passthrough 4x NVIDIA Tesla V100 GPUs (with NVLink) to a
 single VM; but things aren’t that good. Only one GPU shows up on the VM.
 lspci is able to show the GPUs, but three of them are unusable:

 08:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2
 16GB] (rev a1)
 09:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2
 16GB] (rev a1)
 0a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2
 16GB] (rev a1)
 0b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2
 16GB] (rev a1)

 There are some errors on dmesg, regarding a misconfigured BIOS:

 [   27.295972] nvidia: loading out-of-tree module taints kernel.
 [   27.295980] nvidia: module license 'NVIDIA' taints kernel.
 [   27.295981] Disabling lock debugging due to kernel taint
 [   27.304180] nvidia: module verification failed: signature and/or
 required key missing - tainting kernel
 [   27.364244] nvidia-nvlink: Nvlink Core is being initialized, major
 device number 241
 [   27.579261] nvidia :09:00.0: enabling device ( -> 0002)
 [   27.579560] NVRM: This PCI I/O region assigned to your NVIDIA device
 is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI::09:00.0)
 [   27.579560] NVRM: The system BIOS may have misconfigured your GPU.
 [   27.579566] nvidia: probe of :09:00.0 failed with error -1
 [   27.580727] NVRM: This PCI I/O region assigned to your NVIDIA device
 is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI::0a:00.0)
 [   27.580729] NVRM: The system BIOS may have misconfigured your GPU.
 [   27.580734] nvidia: probe of :0a:00.0 failed with error -1
 [   27.581299] NVRM: This PCI I/O region assigned to your NVIDIA device
 is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI::0b:00.0)
 [   27.581300] NVRM: The system BIOS may have misconfigured your GPU.
 [   27.581305] nvidia: probe of :0b:00.0 failed with error -1
 [   27.581333] NVRM: The NVIDIA probe routine failed for 3 device(s).
 [   27.581334] NVRM: loading NVIDIA UNIX x86_64 Kernel Module
 450.51.06  Sun Jul 19 20:02:54 UTC 2020
 [   27.649128] nvidia-modeset: Loading NVIDIA Kernel Mode Setting
 Driver for UNIX platforms  450.51.06  Sun Jul 19 20:06:42 UTC 2020

 The host is Secure Intel Skylake (x86_64). VM is running with Q35
 Chipset with UEFI (pc-q35-rhel8.2.0)

 I’ve tried to change the I/O mapping options on the host, tried with
 56TB and 12TB without success. Same results. Didn’t tried with 512GB since
 the machine have 768GB of system RAM.

 Tried blacklisting the nouveau on the host, nothing.
 Installed NVIDIA drivers on the host, nothing.

 In the host I can use the 4x V100, but inside a single VM it’s
 impossible.

 Any suggestions?



 ___
 Users mailing list -- users@ovirt.org
 To unsubscribe send an email to users-le...@ovirt.org
 Privacy Statement: https://www.ovirt.org/privacy-policy.html
 oVirt Code of Conduct:
 https://www.ovirt.org/community/about/community-guidelines/
 List Archives:
 https://lists.ovirt.org/archives/list/users@ovirt.org/message/73CXU27AX6ND6EXUJKBKKRWM6DJH7UL7/

>>> ___
>>> Users mailing list -- users@ovirt.org
>>> To unsubscribe send an email to users-le...@ovirt.org
>>> Privacy Statement: https://www.ovirt.org/privacy-policy.html
>>> 

[ovirt-users] Re: Multiple GPU Passthrough with NVLink (Invalid I/O region)

2020-09-04 Thread Michael Jones
First things I'd check would be what driver is on host and that it's all
nvidia driver all the way make sure nouveau is blacklisted throughout

On Fri, 4 Sep 2020, 21:01 Michael Jones,  wrote:

> Yea pass through, I think vgpu you have to pay for driver upgrade with
> nvidia, I've not tried that and don't know the price, didn't find getting
> info on it easy last time I tried.
>
> Have used in both legacy and uefi boot machines, don't know the chipsets
> off the top of my head, will look on Monday.
>
>
> On Fri, 4 Sep 2020, 20:56 Vinícius Ferrão, 
> wrote:
>
>> Thanks Michael and Arman.
>>
>> To make things clear, you guys are using Passthrough, right? It’s not
>> vGPU. The 4x GPUs are added on the “Host Devices” tab of the VM.
>> What I’m trying to achieve is add the 4x V100 directly to one specific VM.
>>
>> And finally can you guys confirm which BIOS type is being used in your
>> machines? I’m with Q35 Chipset with UEFI BIOS. I haven’t tested it with
>> legacy, perhaps I’ll give it a try.
>>
>> Thanks again.
>>
>> On 4 Sep 2020, at 14:09, Michael Jones  wrote:
>>
>> Also use multiple t4, also p4, titans, no issues but never used the nvlink
>>
>> On Fri, 4 Sep 2020, 16:02 Arman Khalatyan,  wrote:
>>
>>> hi,
>>> with the 2xT4 we haven't seen any trouble. we have no nvlink there.
>>>
>>> did u try to disable the nvlink?
>>>
>>>
>>>
>>> Vinícius Ferrão via Users  schrieb am Fr., 4. Sept.
>>> 2020, 08:39:
>>>
 Hello, here we go again.

 I’m trying to passthrough 4x NVIDIA Tesla V100 GPUs (with NVLink) to a
 single VM; but things aren’t that good. Only one GPU shows up on the VM.
 lspci is able to show the GPUs, but three of them are unusable:

 08:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2
 16GB] (rev a1)
 09:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2
 16GB] (rev a1)
 0a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2
 16GB] (rev a1)
 0b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2
 16GB] (rev a1)

 There are some errors on dmesg, regarding a misconfigured BIOS:

 [   27.295972] nvidia: loading out-of-tree module taints kernel.
 [   27.295980] nvidia: module license 'NVIDIA' taints kernel.
 [   27.295981] Disabling lock debugging due to kernel taint
 [   27.304180] nvidia: module verification failed: signature and/or
 required key missing - tainting kernel
 [   27.364244] nvidia-nvlink: Nvlink Core is being initialized, major
 device number 241
 [   27.579261] nvidia :09:00.0: enabling device ( -> 0002)
 [   27.579560] NVRM: This PCI I/O region assigned to your NVIDIA device
 is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI::09:00.0)
 [   27.579560] NVRM: The system BIOS may have misconfigured your GPU.
 [   27.579566] nvidia: probe of :09:00.0 failed with error -1
 [   27.580727] NVRM: This PCI I/O region assigned to your NVIDIA device
 is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI::0a:00.0)
 [   27.580729] NVRM: The system BIOS may have misconfigured your GPU.
 [   27.580734] nvidia: probe of :0a:00.0 failed with error -1
 [   27.581299] NVRM: This PCI I/O region assigned to your NVIDIA device
 is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI::0b:00.0)
 [   27.581300] NVRM: The system BIOS may have misconfigured your GPU.
 [   27.581305] nvidia: probe of :0b:00.0 failed with error -1
 [   27.581333] NVRM: The NVIDIA probe routine failed for 3 device(s).
 [   27.581334] NVRM: loading NVIDIA UNIX x86_64 Kernel Module
 450.51.06  Sun Jul 19 20:02:54 UTC 2020
 [   27.649128] nvidia-modeset: Loading NVIDIA Kernel Mode Setting
 Driver for UNIX platforms  450.51.06  Sun Jul 19 20:06:42 UTC 2020

 The host is Secure Intel Skylake (x86_64). VM is running with Q35
 Chipset with UEFI (pc-q35-rhel8.2.0)

 I’ve tried to change the I/O mapping options on the host, tried with
 56TB and 12TB without success. Same results. Didn’t tried with 512GB since
 the machine have 768GB of system RAM.

 Tried blacklisting the nouveau on the host, nothing.
 Installed NVIDIA drivers on the host, nothing.

 In the host I can use the 4x V100, but inside a single VM it’s
 impossible.

 Any suggestions?



 ___
 Users mailing list -- users@ovirt.org
 To unsubscribe send an email to users-le...@ovirt.org
 Privacy Statement: https://www.ovirt.org/privacy-policy.html
 oVirt Code of Conduct:
 https://www.ovirt.org/community/about/community-guidelines/
 List Archives:
 https://lists.ovirt.org/archives/list/users@ovirt.org/message/73CXU27AX6ND6EXUJKBKKRWM6DJH7UL7/

>>> ___
>>> Users mailing list -- users@ovirt.org
>>> To unsubscribe send an 

[ovirt-users] Re: Multiple GPU Passthrough with NVLink (Invalid I/O region)

2020-09-04 Thread Michael Jones
Yea pass through, I think vgpu you have to pay for driver upgrade with
nvidia, I've not tried that and don't know the price, didn't find getting
info on it easy last time I tried.

Have used in both legacy and uefi boot machines, don't know the chipsets
off the top of my head, will look on Monday.


On Fri, 4 Sep 2020, 20:56 Vinícius Ferrão, 
wrote:

> Thanks Michael and Arman.
>
> To make things clear, you guys are using Passthrough, right? It’s not
> vGPU. The 4x GPUs are added on the “Host Devices” tab of the VM.
> What I’m trying to achieve is add the 4x V100 directly to one specific VM.
>
> And finally can you guys confirm which BIOS type is being used in your
> machines? I’m with Q35 Chipset with UEFI BIOS. I haven’t tested it with
> legacy, perhaps I’ll give it a try.
>
> Thanks again.
>
> On 4 Sep 2020, at 14:09, Michael Jones  wrote:
>
> Also use multiple t4, also p4, titans, no issues but never used the nvlink
>
> On Fri, 4 Sep 2020, 16:02 Arman Khalatyan,  wrote:
>
>> hi,
>> with the 2xT4 we haven't seen any trouble. we have no nvlink there.
>>
>> did u try to disable the nvlink?
>>
>>
>>
>> Vinícius Ferrão via Users  schrieb am Fr., 4. Sept.
>> 2020, 08:39:
>>
>>> Hello, here we go again.
>>>
>>> I’m trying to passthrough 4x NVIDIA Tesla V100 GPUs (with NVLink) to a
>>> single VM; but things aren’t that good. Only one GPU shows up on the VM.
>>> lspci is able to show the GPUs, but three of them are unusable:
>>>
>>> 08:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]
>>> (rev a1)
>>> 09:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]
>>> (rev a1)
>>> 0a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]
>>> (rev a1)
>>> 0b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]
>>> (rev a1)
>>>
>>> There are some errors on dmesg, regarding a misconfigured BIOS:
>>>
>>> [   27.295972] nvidia: loading out-of-tree module taints kernel.
>>> [   27.295980] nvidia: module license 'NVIDIA' taints kernel.
>>> [   27.295981] Disabling lock debugging due to kernel taint
>>> [   27.304180] nvidia: module verification failed: signature and/or
>>> required key missing - tainting kernel
>>> [   27.364244] nvidia-nvlink: Nvlink Core is being initialized, major
>>> device number 241
>>> [   27.579261] nvidia :09:00.0: enabling device ( -> 0002)
>>> [   27.579560] NVRM: This PCI I/O region assigned to your NVIDIA device
>>> is invalid:
>>>NVRM: BAR1 is 0M @ 0x0 (PCI::09:00.0)
>>> [   27.579560] NVRM: The system BIOS may have misconfigured your GPU.
>>> [   27.579566] nvidia: probe of :09:00.0 failed with error -1
>>> [   27.580727] NVRM: This PCI I/O region assigned to your NVIDIA device
>>> is invalid:
>>>NVRM: BAR0 is 0M @ 0x0 (PCI::0a:00.0)
>>> [   27.580729] NVRM: The system BIOS may have misconfigured your GPU.
>>> [   27.580734] nvidia: probe of :0a:00.0 failed with error -1
>>> [   27.581299] NVRM: This PCI I/O region assigned to your NVIDIA device
>>> is invalid:
>>>NVRM: BAR0 is 0M @ 0x0 (PCI::0b:00.0)
>>> [   27.581300] NVRM: The system BIOS may have misconfigured your GPU.
>>> [   27.581305] nvidia: probe of :0b:00.0 failed with error -1
>>> [   27.581333] NVRM: The NVIDIA probe routine failed for 3 device(s).
>>> [   27.581334] NVRM: loading NVIDIA UNIX x86_64 Kernel Module
>>> 450.51.06  Sun Jul 19 20:02:54 UTC 2020
>>> [   27.649128] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver
>>> for UNIX platforms  450.51.06  Sun Jul 19 20:06:42 UTC 2020
>>>
>>> The host is Secure Intel Skylake (x86_64). VM is running with Q35
>>> Chipset with UEFI (pc-q35-rhel8.2.0)
>>>
>>> I’ve tried to change the I/O mapping options on the host, tried with
>>> 56TB and 12TB without success. Same results. Didn’t tried with 512GB since
>>> the machine have 768GB of system RAM.
>>>
>>> Tried blacklisting the nouveau on the host, nothing.
>>> Installed NVIDIA drivers on the host, nothing.
>>>
>>> In the host I can use the 4x V100, but inside a single VM it’s
>>> impossible.
>>>
>>> Any suggestions?
>>>
>>>
>>>
>>> ___
>>> Users mailing list -- users@ovirt.org
>>> To unsubscribe send an email to users-le...@ovirt.org
>>> Privacy Statement: https://www.ovirt.org/privacy-policy.html
>>> oVirt Code of Conduct:
>>> https://www.ovirt.org/community/about/community-guidelines/
>>> List Archives:
>>> https://lists.ovirt.org/archives/list/users@ovirt.org/message/73CXU27AX6ND6EXUJKBKKRWM6DJH7UL7/
>>>
>> ___
>> Users mailing list -- users@ovirt.org
>> To unsubscribe send an email to users-le...@ovirt.org
>> Privacy Statement: https://www.ovirt.org/privacy-policy.html
>> oVirt Code of Conduct:
>> https://www.ovirt.org/community/about/community-guidelines/
>> List Archives:
>> https://lists.ovirt.org/archives/list/users@ovirt.org/message/PIO4DIVUU4JWG5FXYW3NQSVXCFZWYV26/
>>
>
>

[ovirt-users] Re: Multiple GPU Passthrough with NVLink (Invalid I/O region)

2020-09-04 Thread Vinícius Ferrão via Users
Thanks Michael and Arman.

To make things clear, you guys are using Passthrough, right? It’s not vGPU. The 
4x GPUs are added on the “Host Devices” tab of the VM.
What I’m trying to achieve is add the 4x V100 directly to one specific VM.

And finally can you guys confirm which BIOS type is being used in your 
machines? I’m with Q35 Chipset with UEFI BIOS. I haven’t tested it with legacy, 
perhaps I’ll give it a try.

Thanks again.

On 4 Sep 2020, at 14:09, Michael Jones 
mailto:m...@mikejonesey.co.uk>> wrote:

Also use multiple t4, also p4, titans, no issues but never used the nvlink

On Fri, 4 Sep 2020, 16:02 Arman Khalatyan, 
mailto:arm2...@gmail.com>> wrote:
hi,
with the 2xT4 we haven't seen any trouble. we have no nvlink there.

did u try to disable the nvlink?



Vinícius Ferrão via Users mailto:users@ovirt.org>> schrieb am 
Fr., 4. Sept. 2020, 08:39:
Hello, here we go again.

I’m trying to passthrough 4x NVIDIA Tesla V100 GPUs (with NVLink) to a single 
VM; but things aren’t that good. Only one GPU shows up on the VM. lspci is able 
to show the GPUs, but three of them are unusable:

08:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev 
a1)
09:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev 
a1)
0a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev 
a1)
0b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev 
a1)

There are some errors on dmesg, regarding a misconfigured BIOS:

[   27.295972] nvidia: loading out-of-tree module taints kernel.
[   27.295980] nvidia: module license 'NVIDIA' taints kernel.
[   27.295981] Disabling lock debugging due to kernel taint
[   27.304180] nvidia: module verification failed: signature and/or required 
key missing - tainting kernel
[   27.364244] nvidia-nvlink: Nvlink Core is being initialized, major device 
number 241
[   27.579261] nvidia :09:00.0: enabling device ( -> 0002)
[   27.579560] NVRM: This PCI I/O region assigned to your NVIDIA device is 
invalid:
   NVRM: BAR1 is 0M @ 0x0 (PCI::09:00.0)
[   27.579560] NVRM: The system BIOS may have misconfigured your GPU.
[   27.579566] nvidia: probe of :09:00.0 failed with error -1
[   27.580727] NVRM: This PCI I/O region assigned to your NVIDIA device is 
invalid:
   NVRM: BAR0 is 0M @ 0x0 (PCI::0a:00.0)
[   27.580729] NVRM: The system BIOS may have misconfigured your GPU.
[   27.580734] nvidia: probe of :0a:00.0 failed with error -1
[   27.581299] NVRM: This PCI I/O region assigned to your NVIDIA device is 
invalid:
   NVRM: BAR0 is 0M @ 0x0 (PCI::0b:00.0)
[   27.581300] NVRM: The system BIOS may have misconfigured your GPU.
[   27.581305] nvidia: probe of :0b:00.0 failed with error -1
[   27.581333] NVRM: The NVIDIA probe routine failed for 3 device(s).
[   27.581334] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  450.51.06  Sun 
Jul 19 20:02:54 UTC 2020
[   27.649128] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for 
UNIX platforms  450.51.06  Sun Jul 19 20:06:42 UTC 2020

The host is Secure Intel Skylake (x86_64). VM is running with Q35 Chipset with 
UEFI (pc-q35-rhel8.2.0)

I’ve tried to change the I/O mapping options on the host, tried with 56TB and 
12TB without success. Same results. Didn’t tried with 512GB since the machine 
have 768GB of system RAM.

Tried blacklisting the nouveau on the host, nothing.
Installed NVIDIA drivers on the host, nothing.

In the host I can use the 4x V100, but inside a single VM it’s impossible.

Any suggestions?



___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to 
users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/73CXU27AX6ND6EXUJKBKKRWM6DJH7UL7/
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to 
users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/PIO4DIVUU4JWG5FXYW3NQSVXCFZWYV26/

___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/FY5J2VGAZXUOE3K5QJIS3ETXP76M3CHO/


[ovirt-users] Re: Multiple GPU Passthrough with NVLink (Invalid I/O region)

2020-09-04 Thread Michael Jones
Also use multiple t4, also p4, titans, no issues but never used the nvlink

On Fri, 4 Sep 2020, 16:02 Arman Khalatyan,  wrote:

> hi,
> with the 2xT4 we haven't seen any trouble. we have no nvlink there.
>
> did u try to disable the nvlink?
>
>
>
> Vinícius Ferrão via Users  schrieb am Fr., 4. Sept.
> 2020, 08:39:
>
>> Hello, here we go again.
>>
>> I’m trying to passthrough 4x NVIDIA Tesla V100 GPUs (with NVLink) to a
>> single VM; but things aren’t that good. Only one GPU shows up on the VM.
>> lspci is able to show the GPUs, but three of them are unusable:
>>
>> 08:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]
>> (rev a1)
>> 09:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]
>> (rev a1)
>> 0a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]
>> (rev a1)
>> 0b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]
>> (rev a1)
>>
>> There are some errors on dmesg, regarding a misconfigured BIOS:
>>
>> [   27.295972] nvidia: loading out-of-tree module taints kernel.
>> [   27.295980] nvidia: module license 'NVIDIA' taints kernel.
>> [   27.295981] Disabling lock debugging due to kernel taint
>> [   27.304180] nvidia: module verification failed: signature and/or
>> required key missing - tainting kernel
>> [   27.364244] nvidia-nvlink: Nvlink Core is being initialized, major
>> device number 241
>> [   27.579261] nvidia :09:00.0: enabling device ( -> 0002)
>> [   27.579560] NVRM: This PCI I/O region assigned to your NVIDIA device
>> is invalid:
>>NVRM: BAR1 is 0M @ 0x0 (PCI::09:00.0)
>> [   27.579560] NVRM: The system BIOS may have misconfigured your GPU.
>> [   27.579566] nvidia: probe of :09:00.0 failed with error -1
>> [   27.580727] NVRM: This PCI I/O region assigned to your NVIDIA device
>> is invalid:
>>NVRM: BAR0 is 0M @ 0x0 (PCI::0a:00.0)
>> [   27.580729] NVRM: The system BIOS may have misconfigured your GPU.
>> [   27.580734] nvidia: probe of :0a:00.0 failed with error -1
>> [   27.581299] NVRM: This PCI I/O region assigned to your NVIDIA device
>> is invalid:
>>NVRM: BAR0 is 0M @ 0x0 (PCI::0b:00.0)
>> [   27.581300] NVRM: The system BIOS may have misconfigured your GPU.
>> [   27.581305] nvidia: probe of :0b:00.0 failed with error -1
>> [   27.581333] NVRM: The NVIDIA probe routine failed for 3 device(s).
>> [   27.581334] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  450.51.06
>> Sun Jul 19 20:02:54 UTC 2020
>> [   27.649128] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver
>> for UNIX platforms  450.51.06  Sun Jul 19 20:06:42 UTC 2020
>>
>> The host is Secure Intel Skylake (x86_64). VM is running with Q35 Chipset
>> with UEFI (pc-q35-rhel8.2.0)
>>
>> I’ve tried to change the I/O mapping options on the host, tried with 56TB
>> and 12TB without success. Same results. Didn’t tried with 512GB since the
>> machine have 768GB of system RAM.
>>
>> Tried blacklisting the nouveau on the host, nothing.
>> Installed NVIDIA drivers on the host, nothing.
>>
>> In the host I can use the 4x V100, but inside a single VM it’s impossible.
>>
>> Any suggestions?
>>
>>
>>
>> ___
>> Users mailing list -- users@ovirt.org
>> To unsubscribe send an email to users-le...@ovirt.org
>> Privacy Statement: https://www.ovirt.org/privacy-policy.html
>> oVirt Code of Conduct:
>> https://www.ovirt.org/community/about/community-guidelines/
>> List Archives:
>> https://lists.ovirt.org/archives/list/users@ovirt.org/message/73CXU27AX6ND6EXUJKBKKRWM6DJH7UL7/
>>
> ___
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/privacy-policy.html
> oVirt Code of Conduct:
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
> https://lists.ovirt.org/archives/list/users@ovirt.org/message/PIO4DIVUU4JWG5FXYW3NQSVXCFZWYV26/
>
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/ZOMK6ULEK3IXNC3TQV5TYIY5SH23NNA4/


[ovirt-users] Re: Multiple GPU Passthrough with NVLink (Invalid I/O region)

2020-09-04 Thread Arman Khalatyan
hi,
with the 2xT4 we haven't seen any trouble. we have no nvlink there.

did u try to disable the nvlink?



Vinícius Ferrão via Users  schrieb am Fr., 4. Sept. 2020,
08:39:

> Hello, here we go again.
>
> I’m trying to passthrough 4x NVIDIA Tesla V100 GPUs (with NVLink) to a
> single VM; but things aren’t that good. Only one GPU shows up on the VM.
> lspci is able to show the GPUs, but three of them are unusable:
>
> 08:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]
> (rev a1)
> 09:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]
> (rev a1)
> 0a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]
> (rev a1)
> 0b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]
> (rev a1)
>
> There are some errors on dmesg, regarding a misconfigured BIOS:
>
> [   27.295972] nvidia: loading out-of-tree module taints kernel.
> [   27.295980] nvidia: module license 'NVIDIA' taints kernel.
> [   27.295981] Disabling lock debugging due to kernel taint
> [   27.304180] nvidia: module verification failed: signature and/or
> required key missing - tainting kernel
> [   27.364244] nvidia-nvlink: Nvlink Core is being initialized, major
> device number 241
> [   27.579261] nvidia :09:00.0: enabling device ( -> 0002)
> [   27.579560] NVRM: This PCI I/O region assigned to your NVIDIA device is
> invalid:
>NVRM: BAR1 is 0M @ 0x0 (PCI::09:00.0)
> [   27.579560] NVRM: The system BIOS may have misconfigured your GPU.
> [   27.579566] nvidia: probe of :09:00.0 failed with error -1
> [   27.580727] NVRM: This PCI I/O region assigned to your NVIDIA device is
> invalid:
>NVRM: BAR0 is 0M @ 0x0 (PCI::0a:00.0)
> [   27.580729] NVRM: The system BIOS may have misconfigured your GPU.
> [   27.580734] nvidia: probe of :0a:00.0 failed with error -1
> [   27.581299] NVRM: This PCI I/O region assigned to your NVIDIA device is
> invalid:
>NVRM: BAR0 is 0M @ 0x0 (PCI::0b:00.0)
> [   27.581300] NVRM: The system BIOS may have misconfigured your GPU.
> [   27.581305] nvidia: probe of :0b:00.0 failed with error -1
> [   27.581333] NVRM: The NVIDIA probe routine failed for 3 device(s).
> [   27.581334] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  450.51.06
> Sun Jul 19 20:02:54 UTC 2020
> [   27.649128] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver
> for UNIX platforms  450.51.06  Sun Jul 19 20:06:42 UTC 2020
>
> The host is Secure Intel Skylake (x86_64). VM is running with Q35 Chipset
> with UEFI (pc-q35-rhel8.2.0)
>
> I’ve tried to change the I/O mapping options on the host, tried with 56TB
> and 12TB without success. Same results. Didn’t tried with 512GB since the
> machine have 768GB of system RAM.
>
> Tried blacklisting the nouveau on the host, nothing.
> Installed NVIDIA drivers on the host, nothing.
>
> In the host I can use the 4x V100, but inside a single VM it’s impossible.
>
> Any suggestions?
>
>
>
> ___
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/privacy-policy.html
> oVirt Code of Conduct:
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
> https://lists.ovirt.org/archives/list/users@ovirt.org/message/73CXU27AX6ND6EXUJKBKKRWM6DJH7UL7/
>
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/PIO4DIVUU4JWG5FXYW3NQSVXCFZWYV26/