[vfio-users] lspci and vfio_pci_release deadlock when destroy a pci passthrough VM
Hi Alex, I notice a patch you pushed in https://lkml.org/lkml/2019/2/18/1315 You said the previous commit you pushed may prone to deadlock, could you please share the details about how to reproduce the deadlock scene if you know it. I met a similar question that all lspci command went into D state and libvirtd went into Z state when destroy a VM with a GPU passthrou. The stack like that: 2019-03-20T13:37:14.726514+07:00|err|kernel[-]|[2427373.553663] INFO: task ps:112058 blocked for more than 120 seconds. 2019-03-20T13:37:14.726576+07:00|err|kernel[-]|[2427373.553667] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2019-03-20T13:37:14.726599+07:00|info|kernel[-]|[2427373.553669] ps D 0 112058 1 0x0004 2019-03-20T13:37:14.726620+07:00|warning|kernel[-]|[2427373.553673] Call Trace: 2019-03-20T13:37:14.726640+07:00|warning|kernel[-]|[2427373.553682] [] schedule_preempt_disabled+0x29/0x70 2019-03-20T13:37:14.726668+07:00|warning|kernel[-]|[2427373.553684] [] __mutex_lock_slowpath+0xe1/0x170 2019-03-20T13:37:14.726689+07:00|warning|kernel[-]|[2427373.553689] [] mutex_lock+0x1f/0x2f 2019-03-20T13:37:14.726707+07:00|warning|kernel[-]|[2427373.553695] [] pci_bus_save_and_disable+0x37/0x70 2019-03-20T13:37:14.726725+07:00|warning|kernel[-]|[2427373.553697] [] pci_try_reset_bus+0x38/0x80 2019-03-20T13:37:14.726743+07:00|warning|kernel[-]|[2427373.553730] [] vfio_pci_release+0x3d5/0x430 [vfio_pci] 2019-03-20T13:37:14.726761+07:00|warning|kernel[-]|[2427373.553737] [] ? vfio_pci_rw+0xc0/0xc0 [vfio_pci] 2019-03-20T13:37:14.726779+07:00|warning|kernel[-]|[2427373.553745] [] vfio_device_fops_release+0x22/0x40 [vfio] 2019-03-20T13:37:14.726798+07:00|warning|kernel[-]|[2427373.553751] [] __fput+0xec/0x260 2019-03-20T13:37:14.726821+07:00|warning|kernel[-]|[2427373.553754] [] fput+0xe/0x10 2019-03-20T13:37:14.726840+07:00|warning|kernel[-]|[2427373.553758] [] task_work_run+0xaa/0xe0 2019-03-20T13:37:14.726858+07:00|warning|kernel[-]|[2427373.553763] [] do_notify_resume+0x92/0xb0 2019-03-20T13:37:14.726876+07:00|warning|kernel[-]|[2427373.553767] [] int_signal+0x12/0x17 2019-03-20T13:37:14.726892+07:00|err|kernel[-]|[2427373.553771] INFO: task lspci:139540 blocked for more than 120 seconds. 2019-03-20T13:37:14.726910+07:00|err|kernel[-]|[2427373.553772] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2019-03-20T13:37:14.726929+07:00|info|kernel[-]|[2427373.553773] lspci D 0 139540 139539 0x 2019-03-20T13:37:14.726948+07:00|warning|kernel[-]|[2427373.553776] Call Trace: 2019-03-20T13:37:14.726970+07:00|warning|kernel[-]|[2427373.553778] [] schedule+0x29/0x70 2019-03-20T13:37:14.726989+07:00|warning|kernel[-]|[2427373.553782] [] pci_wait_cfg+0xa0/0x110 2019-03-20T13:37:14.727006+07:00|warning|kernel[-]|[2427373.553787] [] ? wake_up_state+0x20/0x20 2019-03-20T13:37:14.727023+07:00|warning|kernel[-]|[2427373.553790] [] pci_user_read_config_dword+0x105/0x110 2019-03-20T13:37:14.727043+07:00|warning|kernel[-]|[2427373.553794] [] pci_read_config+0x114/0x2c0 2019-03-20T13:37:14.727063+07:00|warning|kernel[-]|[2427373.553799] [] ? __kmalloc+0x55/0x240 2019-03-20T13:37:14.727084+07:00|warning|kernel[-]|[2427373.553804] [] read+0xde/0x1f0 2019-03-20T13:37:14.727103+07:00|warning|kernel[-]|[2427373.553807] [] vfs_read+0x9f/0x170 2019-03-20T13:37:14.727123+07:00|warning|kernel[-]|[2427373.553809] [] SyS_pread64+0x92/0xc0 2019-03-20T13:37:14.727141+07:00|warning|kernel[-]|[2427373.553812] [] system_call_fastpath+0x1c/0x21 It seems that lspci and vfio_pci_release are in deadlock. Thanks, Zongyong Wu ___ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users
Re: [vfio-users] PLX switch report a UR when EP tries to DMA to VM's memory
> > Hi, > > > > I notice a problem with a PCIe endpoint, which is behind a PLX switch, > assigned to a VM by VFIO. > > The problem is switch report a UR error when the EP tries to DMA to a > memory zone inside VM's address space. > > Assume that the DMA destination address is between in the VM's ram > > address space, and unfortunately that address value in host's point of > view just hit the PLX switch upstream port BAR0 memory-mapped IO range. > > In a result, the DMA will failed because SW think this memory request > > is invalid if the destination address hit its UP's bar. > > Is this a hardware bug or qemu/seabios doesn't maintain a proper address > space for VM? > > Upstream switch ports are generally single function devices and therefore > governed by 6.12.1.3 (PCIe base spec rev 4.0, v1) which indicates an ACS > capability must not be implemented. We can therefore read into section > 6.12.2 on interoperability which indicates the interaction between ACS and > non-ACS components, including: > > * When ACS P2P Request Redirect, ACS P2P Completion Redirect, or both >are being used, certain components in the PCI Express hierarchy must >support ACS Upstream Forwarding (of Upstream redirected Requests). >Specifically: >... >Between each ACS component where P2P TLP redirection is enabled and >its associated Root Port, any intermediate Switches must support ACS >Upstream Forwarding. Otherwise, how such Switches handle Upstream >redirected TLPs is undefined. > > It's my interpretation therefore that in a configuration where the switch > downstream ports supports ACS, the switch upstream port must implicitly > support upstream forwarding, thus I would consider this a hardware issue. > The alternative is that we need to poke holes in the VM address space to > account for any possible conflict and assigned device hot-add becomes > nearly a non-starter. Thanks, > > Alex Thanks for your explanation. Do you know are there other vendors' switches that wouldn't result in this problem? Thanks, Zongyong Wu ___ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users
[vfio-users] PLX switch report a UR when EP tries to DMA to VM's memory
Hi, I notice a problem with a PCIe endpoint, which is behind a PLX switch, assigned to a VM by VFIO. The problem is switch report a UR error when the EP tries to DMA to a memory zone inside VM's address space. Assume that the DMA destination address is between in the VM's ram address space, and unfortunately that address value in host's point of view just hit the PLX switch upstream port BAR0 memory-mapped IO range. In a result, the DMA will failed because SW think this memory request is invalid if the destination address hit its UP's bar. Is this a hardware bug or qemu/seabios doesn't maintain a proper address space for VM? Thanks, Zongyong Wu ___ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users
Re: [vfio-users] host crash when assign 4 nics to 4 vms separately
> > Hi, > > > > Recently my colleague ran into a kernel crash problem when he tried to > assign 4 nics to 4 vms separately. > > Unfortunately he didn't collect related logs and we only can see the > dmesg log when core dump currently. > > > > Here the info: > > > > linux:~ # lspci | grep -i eth > > 02:00.0 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit > Ethernet PCIe (rev 01) > > 02:00.1 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit > Ethernet PCIe (rev 01) > > 02:00.2 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit > Ethernet PCIe (rev 01) > > 02:00.3 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit > Ethernet PCIe (rev 01) > > 81:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit > SFI/SFP+ Network Connection (rev 01) > > 81:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit > SFI/SFP+ Network Connection (rev 01) > > 82:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit > SFI/SFP+ Network Connection (rev 01) > > 82:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit > SFI/SFP+ Network Connection (rev 01) > > > > He used the last four nics. > > > > Dmesg: > > > > [ 3449.519354] general protection fault: [#1] SMP > > [ 3449.682056] CPU: 8 PID: 26794 Comm: qemu-kvm Tainted: G OE > --- 3.10.0-514.44.5.10_44.x86_64 #1 > > Are you able to reproduce this on an upstream kernel? Thanks, > > Alex No, I can't reproduce it in our own environment either. ___ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users
[vfio-users] How does a vfio container related to a iommu_domain
Hi, I noted that the comments in vfio_iommu_type1_attach_group: /* * Try to match an existing compatible domain. We don't want to * preclude an IOMMU driver supporting multiple bus_types and being * able to include different bus_types in the same IOMMU domain, so * we test whether the domains use the same iommu_ops rather than * testing if they're on the same bus_type. */ In intel x86 platforms currently, can I think a virtual machine have a one-to-one association with a vfio container, and a vfio container have a one-to-one association with a iommu_domain?. Does there exist any scene or system that two domains use two different iommu_ops? Thanks, Wu Zongyong ___ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users
Re: [vfio-users] win7 report that the device cannot find enough free resources with a GPU behind a pci bridge
> > Hi, > > > > I found that win7/win2012r2 reported "This device cannot find enough > free resources that it can use" > > with a NVIDIA GPU passthrough behind a PCI Bridge. > > Here is a part of my xml: > > > > > > > >> function='0x0'/> > > > > > > > > function='0x0'/> > > > > > function='0x0'/> > > > > There is a similar problem in > > https://bugzilla.redhat.com/show_bug.cgi?id=1273172 but I still don't > > know what Is the root cause > > Neither do we. Current status is doesn't work, don't do it. Thanks, > > Alex So does this problem exist in all windows versions? It seems that there is no problem with linux. Thanks, Wu Zongyong ___ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users
[vfio-users] win7 report that the device cannot find enough free resources with a GPU behind a pci bridge
Hi, I found that win7/win2012r2 reported "This device cannot find enough free resources that it can use" with a NVIDIA GPU passthrough behind a PCI Bridge. Here is a part of my xml: There is a similar problem in https://bugzilla.redhat.com/show_bug.cgi?id=1273172 but I still don't know what Is the root cause . Thanks, Zongyong Wu ___ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users
Re: [vfio-users] qemu stuck when hot-add memory to a virtual machine with a device passthrough
> > > > Hi, > > > > > > > > The qemu process will stuck when hot-add large size memory to > > > > the virtual machine with a device passtrhough. > > > > We found it is too slow to pin and map pages in vfio_dma_do_map. > > > > Is there any method to improve this process? > > > > > > At what size do you start to see problems? The time to map a > > > section of memory should be directly proportional to the size. As > > > the size is increased, it will take longer, but I don't know why > > > you'd reach a point of not making forward progress. Is it actually > > > stuck or is it just taking longer than you want? Using hugepages > > > can certainly help, we still need to pin each PAGE_SIZE page within > > > the hugepage, but we'll have larger contiguous regions and therefore > > > call iommu_map() less frequently. Please share more data. Thanks, > > > > > > Alex > > It just take longer time, instead of actually stuck. > > We found that the problem exist when we hot-added 16G memory. And it > > will consume tens of minutes when we hot-added 1T memory. > > Is the stall adding 1TB roughly 64 times the stall adding 16GB or do we > have some inflection in the size vs time curve? There is a cost to > pinning an mapping through the IOMMU, perhaps we can improve that, but I > don't see how we can eliminate it or how it wouldn't be at least linear > compared to the size of memory added without moving to a page request > model, which hardly any hardware currently supports. A workaround might > be to incrementally add memory in smaller chunks which generate a less > noticeable stall. Thanks, > > Alex I collected a part of report as below recorded by perf when I hot-added 24GB memory: + 63.41% 0.00% qemu-kvm qemu-kvm-2.8.1-25.127 [.] 0xffc7534a + 63.41% 0.00% qemu-kvm [kernel.vmlinux][k] do_vfs_ioctl + 63.41% 0.00% qemu-kvm [kernel.vmlinux][k] sys_ioctl + 63.41% 0.00% qemu-kvm libc-2.17.so[.] __GI___ioctl + 63.41% 0.00% qemu-kvm qemu-kvm-2.8.1-25.127 [.] 0xffc71c59 + 63.10% 0.00% qemu-kvm [vfio] [k] vfio_fops_unl_ioctl + 63.10% 0.00% qemu-kvm qemu-kvm-2.8.1-25.127 [.] 0xffcbbb6a + 63.10% 0.02% qemu-kvm [vfio_iommu_type1] [k] vfio_iommu_type1_ioctl + 60.67% 0.31% qemu-kvm [vfio_iommu_type1] [k] vfio_pin_pages_remote + 60.06% 0.46% qemu-kvm [vfio_iommu_type1] [k] vaddr_get_pfn + 59.61% 0.95% qemu-kvm [kernel.vmlinux][k] get_user_pages_fast + 54.28% 0.02% qemu-kvm [kernel.vmlinux][k] get_user_pages_unlocked + 54.24% 0.04% qemu-kvm [kernel.vmlinux][k] __get_user_pages + 54.13% 0.01% qemu-kvm [kernel.vmlinux][k] handle_mm_fault + 54.08% 0.03% qemu-kvm [kernel.vmlinux][k] do_huge_pmd_anonymous_page + 52.09%52.09% qemu-kvm [kernel.vmlinux][k] clear_page +9.42% 0.12% swapper [kernel.vmlinux][k] cpu_startup_entry +9.20% 0.00% swapper [kernel.vmlinux][k] start_secondary +8.85% 0.02% swapper [kernel.vmlinux][k] arch_cpu_idle +8.79% 0.07% swapper [kernel.vmlinux][k] cpuidle_idle_call +6.16% 0.29% swapper [kernel.vmlinux][k] apic_timer_interrupt +5.73% 0.07% swapper [kernel.vmlinux][k] smp_apic_timer_interrupt +4.34% 0.99% qemu-kvm [kernel.vmlinux][k] gup_pud_range +3.56% 0.16% swapper [kernel.vmlinux][k] local_apic_timer_interrupt +3.32% 0.41% swapper [kernel.vmlinux][k] hrtimer_interrupt +3.25% 3.21% qemu-kvm [kernel.vmlinux][k] gup_huge_pmd +2.31% 0.01% qemu-kvm [kernel.vmlinux][k] iommu_map +2.30% 0.00% qemu-kvm [kernel.vmlinux][k] intel_iommu_map It seems that the bottleneck is trying to pin pages through get_user_pages instead of do iommu mapping. Thanks, Wu Zongyong ___ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users
Re: [vfio-users] qemu stuck when hot-add memory to a virtual machine with a device passthrough
> > > > Hi, > > > > > > > > The qemu process will stuck when hot-add large size memory to > > > > the virtual machine with a device passtrhough. > > > > We found it is too slow to pin and map pages in vfio_dma_do_map. > > > > Is there any method to improve this process? > > > > > > At what size do you start to see problems? The time to map a > > > section of memory should be directly proportional to the size. As > > > the size is increased, it will take longer, but I don't know why > > > you'd reach a point of not making forward progress. Is it actually > > > stuck or is it just taking longer than you want? Using hugepages > > > can certainly help, we still need to pin each PAGE_SIZE page within > > > the hugepage, but we'll have larger contiguous regions and therefore > > > call iommu_map() less frequently. Please share more data. Thanks, > > > > > > Alex > > It just take longer time, instead of actually stuck. > > We found that the problem exist when we hot-added 16G memory. And it > > will consume tens of minutes when we hot-added 1T memory. > > Is the stall adding 1TB roughly 64 times the stall adding 16GB or do we > have some inflection in the size vs time curve? There is a cost to > pinning an mapping through the IOMMU, perhaps we can improve that, but I > don't see how we can eliminate it or how it wouldn't be at least linear > compared to the size of memory added without moving to a page request > model, which hardly any hardware currently supports. A workaround might > be to incrementally add memory in smaller chunks which generate a less > noticeable stall. Thanks, > > Alex It took about 1 minute to add 16GB and about 40 minutes to add 1TB. ___ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users
Re: [vfio-users] qemu stuck when hot-add memory to a virtual machine with a device passthrough
> > Hi, > > > > The qemu process will stuck when hot-add large size memory to the > > virtual machine with a device passtrhough. > > We found it is too slow to pin and map pages in vfio_dma_do_map. > > Is there any method to improve this process? > > At what size do you start to see problems? The time to map a section of > memory should be directly proportional to the size. As the size is > increased, it will take longer, but I don't know why you'd reach a point > of not making forward progress. Is it actually stuck or is it just taking > longer than you want? Using hugepages can certainly help, we still need > to pin each PAGE_SIZE page within the hugepage, but we'll have larger > contiguous regions and therefore call iommu_map() less frequently. Please > share more data. Thanks, > > Alex It just take longer time, instead of actually stuck. We found that the problem exist when we hot-added 16G memory. And it will consume tens of minutes when we hot-added 1T memory. Thanks, Wu Zongyong ___ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users
[vfio-users] qemu stuck when hot-add memory to a virtual machine with a device passthrough
Hi, The qemu process will stuck when hot-add large size memory to the virtual machine with a device passtrhough. We found it is too slow to pin and map pages in vfio_dma_do_map. Is there any method to improve this process? Thanks, Zongyong Wu ___ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users
[vfio-users] Is there some method to merge 2 iommu group if I disable ACS ?
Hi, Assume that an endpoint device(called ep1) belongs to iommu group 1, and another endpoint device(called ep2) belongs to iommu group 2. Moreover, these two devices locate in different downstream ports of the same switch respectively. If I disable the ACS of these downstream ports, we know ep1 and ep2 should locate in the same iommu group. So the question is if I can regenerate a iommu group to let ep1 and ep2 locate in the same iommu group and let ep1 and ep2 can't be assigned to two different VMs without host rebooting? Thanks, Cordius Wu ___ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users