Re: [PATCH v2] iommu/amd: Fix interrupt remapping when disable guest_mode
On 6/28/17 19:44, Joerg Roedel wrote: On Mon, Jun 26, 2017 at 04:28:04AM -0500, Suravee Suthikulpanit wrote: Pass-through devices to VM guest can get updated IRQ affinity information via irq_set_affinity() when not running in guest mode. Currently, AMD IOMMU driver in GA mode ignores the updated information if the pass-through device is setup to use vAPIC regardless of guest_mode. This could cause invalid interrupt remapping. Also, the guest_mode bit should be set and cleared only when SVM updates posted-interrupt interrupt remapping information. Signed-off-by: Suravee SuthikulpanitCc: Joerg Roedel Nervermind, I added this fixes line and applied the patch: Fixes: d98de49a53e48 ('iommu/amd: Enable vAPIC interrupt remapping mode by default' Thanks, Suravee ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH] iommu/vt-d: Don't free parent pagetable of the PTE we're adding
When adding a large scatterlist entry that covers more than the L3 superpage size (1GB) but has an alignment such that we must use L2 superpages (2MB) , we give dma_pte_free_level() a range that causes it to free the L3 pagetable we're about to populate. We fix this by telling dma_pte_free_pagetable() about the pagetable level we're about to populate to prevent freeing it. For example, mapping a scatterlist with entry lengths 854MB and 1194MB at IOVA 0x8000 would, when processing the 2MB-aligned second entry, cause pfn_to_dma_pte() to create a L3 directory to hold L2 superpages for the mapping at IOVA 0xc000. We would previously call dma_pte_free_pagetable(domain, 0xc, 0xf), which would free the L3 directory pfn_to_dma_pte() just created for IO PFN 0xc. Telling dma_pte_free_pagetable() to retain the L3 directories while using L2 superpages avoids the erroneous free. Signed-off-by: David Dillow--- drivers/iommu/intel-iommu.c | 38 ++ 1 file changed, 26 insertions(+), 12 deletions(-) diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index fc2765ccdb57..3a8036f60367 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -1137,8 +1137,9 @@ static void dma_pte_clear_range(struct dmar_domain *domain, } static void dma_pte_free_level(struct dmar_domain *domain, int level, - struct dma_pte *pte, unsigned long pfn, - unsigned long start_pfn, unsigned long last_pfn) + int retain_level, struct dma_pte *pte, + unsigned long pfn, unsigned long start_pfn, + unsigned long last_pfn) { pfn = max(start_pfn, pfn); pte = [pfn_level_offset(pfn, level)]; @@ -1153,12 +1154,17 @@ static void dma_pte_free_level(struct dmar_domain *domain, int level, level_pfn = pfn & level_mask(level); level_pte = phys_to_virt(dma_pte_addr(pte)); - if (level > 2) - dma_pte_free_level(domain, level - 1, level_pte, - level_pfn, start_pfn, last_pfn); + if (level > 2) { + dma_pte_free_level(domain, level - 1, retain_level, + level_pte, level_pfn, start_pfn, + last_pfn); + } - /* If range covers entire pagetable, free it */ - if (!(start_pfn > level_pfn || + /* +* Free the page table if we're below the level we want to +* retain and the range covers the entire table. +*/ + if (level < retain_level && !(start_pfn > level_pfn || last_pfn < level_pfn + level_size(level) - 1)) { dma_clear_pte(pte); domain_flush_cache(domain, pte, sizeof(*pte)); @@ -1169,10 +1175,14 @@ static void dma_pte_free_level(struct dmar_domain *domain, int level, } while (!first_pte_in_page(++pte) && pfn <= last_pfn); } -/* clear last level (leaf) ptes and free page table pages. */ +/* + * clear last level (leaf) ptes and free page table pages below the + * level we wish to keep intact. + */ static void dma_pte_free_pagetable(struct dmar_domain *domain, unsigned long start_pfn, - unsigned long last_pfn) + unsigned long last_pfn, + int retain_level) { BUG_ON(!domain_pfn_supported(domain, start_pfn)); BUG_ON(!domain_pfn_supported(domain, last_pfn)); @@ -1181,7 +1191,7 @@ static void dma_pte_free_pagetable(struct dmar_domain *domain, dma_pte_clear_range(domain, start_pfn, last_pfn); /* We don't need lock here; nobody else touches the iova range */ - dma_pte_free_level(domain, agaw_to_level(domain->agaw), + dma_pte_free_level(domain, agaw_to_level(domain->agaw), retain_level, domain->pgd, 0, start_pfn, last_pfn); /* free pgd */ @@ -2277,8 +2287,11 @@ static int __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn, /* * Ensure that old small page tables are * removed to make room for superpage(s). +* We're adding new large pages, so make sure +* we don't remove their parent tables. */ - dma_pte_free_pagetable(domain, iov_pfn, end_pfn); + dma_pte_free_pagetable(domain, iov_pfn, end_pfn, + largepage_lvl + 1);
Re: [PATCH 1/5] iommu/arm-smmu-v3: put off the execution of TLBI* to reduce lock confliction
On 2017/6/28 17:32, Will Deacon wrote: > Hi Zhen Lei, > > Nate (CC'd), Robin and I have been working on something very similar to > this series, but this patch is different to what we had planned. More below. > > On Mon, Jun 26, 2017 at 09:38:46PM +0800, Zhen Lei wrote: >> Because all TLBI commands should be followed by a SYNC command, to make >> sure that it has been completely finished. So we can just add the TLBI >> commands into the queue, and put off the execution until meet SYNC or >> other commands. To prevent the followed SYNC command waiting for a long >> time because of too many commands have been delayed, restrict the max >> delayed number. >> >> According to my test, I got the same performance data as I replaced writel >> with writel_relaxed in queue_inc_prod. >> >> Signed-off-by: Zhen Lei>> --- >> drivers/iommu/arm-smmu-v3.c | 42 +- >> 1 file changed, 37 insertions(+), 5 deletions(-) >> >> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c >> index 291da5f..4481123 100644 >> --- a/drivers/iommu/arm-smmu-v3.c >> +++ b/drivers/iommu/arm-smmu-v3.c >> @@ -337,6 +337,7 @@ >> /* Command queue */ >> #define CMDQ_ENT_DWORDS 2 >> #define CMDQ_MAX_SZ_SHIFT 8 >> +#define CMDQ_MAX_DELAYED32 >> >> #define CMDQ_ERR_SHIFT 24 >> #define CMDQ_ERR_MASK 0x7f >> @@ -472,6 +473,7 @@ struct arm_smmu_cmdq_ent { >> }; >> } cfgi; >> >> +#define CMDQ_OP_TLBI_NH_ALL 0x10 >> #define CMDQ_OP_TLBI_NH_ASID0x11 >> #define CMDQ_OP_TLBI_NH_VA 0x12 >> #define CMDQ_OP_TLBI_EL2_ALL0x20 >> @@ -499,6 +501,7 @@ struct arm_smmu_cmdq_ent { >> >> struct arm_smmu_queue { >> int irq; /* Wired interrupt */ >> +u32 nr_delay; >> >> __le64 *base; >> dma_addr_t base_dma; >> @@ -722,11 +725,16 @@ static int queue_sync_prod(struct arm_smmu_queue *q) >> return ret; >> } >> >> -static void queue_inc_prod(struct arm_smmu_queue *q) >> +static void queue_inc_swprod(struct arm_smmu_queue *q) >> { >> -u32 prod = (Q_WRP(q, q->prod) | Q_IDX(q, q->prod)) + 1; >> +u32 prod = q->prod + 1; >> >> q->prod = Q_OVF(q, q->prod) | Q_WRP(q, prod) | Q_IDX(q, prod); >> +} >> + >> +static void queue_inc_prod(struct arm_smmu_queue *q) >> +{ >> +queue_inc_swprod(q); >> writel(q->prod, q->prod_reg); >> } >> >> @@ -761,13 +769,24 @@ static void queue_write(__le64 *dst, u64 *src, size_t >> n_dwords) >> *dst++ = cpu_to_le64(*src++); >> } >> >> -static int queue_insert_raw(struct arm_smmu_queue *q, u64 *ent) >> +static int queue_insert_raw(struct arm_smmu_queue *q, u64 *ent, int >> optimize) >> { >> if (queue_full(q)) >> return -ENOSPC; >> >> queue_write(Q_ENT(q, q->prod), ent, q->ent_dwords); >> -queue_inc_prod(q); >> + >> +/* >> + * We don't want too many commands to be delayed, this may lead the >> + * followed sync command to wait for a long time. >> + */ >> +if (optimize && (++q->nr_delay < CMDQ_MAX_DELAYED)) { >> +queue_inc_swprod(q); >> +} else { >> +queue_inc_prod(q); >> +q->nr_delay = 0; >> +} >> + > > So here, you're effectively putting invalidation commands into the command > queue without updating PROD. Do you actually see a performance advantage > from doing so? Another side of the argument would be that we should be Yes, my sas ssd performance test showed that it can improve about 100-150K/s(the same to I directly replace writel with writel_relaxed). And the average execution time of iommu_unmap(which called by iommu_dma_unmap_sg) dropped from 10us to 5us. > moving PROD as soon as we can, so that the SMMU can process invalidation > commands in the background and reduce the cost of the final SYNC operation > when the high-level unmap operation is complete. There maybe that __iowmb() is more expensive than wait for tlbi complete. Except the time of __iowmb() itself, it also protected by spinlock, lock confliction will rise rapidly in the stress scene. __iowmb() average cost 300-500ns(Sorry, I forget the exact value). In addition, after applied this patcheset and Robin's v2, and my earlier dma64 iova optimization patchset. Our net performance test got the same data to global bypass. But sas ssd still have more than 20% dropped. Maybe we should still focus at map/unamp, because the average execution time of iova alloc/free is only about 400ns. By the way, patch2-5 is more effective than this one, it can improve more than 350K/s. And with it, we can got about 100-150K/s improvement of Robin's v2. Otherwise, I saw non effective of Robin's v2. Sorry, I have not tested how about this patch without
Re: AMD Ryzen KVM/NPT/IOMMU issue
Some more data from 3DMark benchmarks: Time Spy(DirectX 12): - Graphics test 1: - npt=0: 37.65 FPS - npt=1: 24.22 FPS (36% drop) - Graphics test 2: - npt=0: 33.05 FPS - npt=1: 29.65 FPS (10% drop) - CPU test: - npt=0: 17.35 FPS - npt=1: 12.03 FPS (31% drop) Fire Strike(DirectX 11): - Graphics test 1: - npt=0: 80.56 FPS - npt=1: 41.89 FPS (49% drop) - Graphics test 2: - npt=0: 70.64 FPS - npt=1: 60.75 FPS (14% drop) - Physics test: - npt=0: 50.14 FPS - npt=1: 5.78 FPS (89% drop) - Combined test: - npt=0: 32.83 FPS - npt=1: 17.70 FPS (47% drop) Sky Diver(DirectX 11): - Graphics test 1: - npt=0: 248.81 FPS - npt=1: 173.63 FPS (31% drop) - Graphics test 2: - npt=0: 250.49 FPS - npt=1: 124.84 FPS (51% drop) - Physics test: - 8 threads: - npt=0: 140.93 FPS - npt=1: 119.08 FPS (15% drop) - 24 threads: - npt=0: 110.22 FPS - npt=1: 74.55 FPS (33% drop) - 48 threads: - npt=1: 71.56 FPS - npt=1: 45.93 FPS (36% drop) - 96 threads: - npt=0: 41.04 FPS - npt=1: 24.81 FPS (40% drop) - Combined test: - npt=0: 75.65 FPS - npt=1: 50.45 FPS (33% drop) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: AMD Ryzen KVM/NPT/IOMMU issue
On Wed, Jun 28, 2017 at 9:50 PM, Thiago Padilhawrote: > Some more data from 3DMark benchmarks: > > Time Spy(DirectX 12): > - Graphics test 1: > - npt=0: 37.65 FPS > - npt=1: 24.22 FPS (36% drop) > - Graphics test 2: > - npt=0: 33.05 FPS > - npt=1: 29.65 FPS (10% drop) > - CPU test: > - npt=0: 17.35 FPS > - npt=1: 12.03 FPS (31% drop) > > Fire Strike(DirectX 11): > - Graphics test 1: > - npt=0: 80.56 FPS > - npt=1: 41.89 FPS (49% drop) > - Graphics test 2: > - npt=0: 70.64 FPS > - npt=1: 60.75 FPS (14% drop) > - Physics test: > - npt=0: 50.14 FPS > - npt=1: 5.78 FPS (89% drop) > - Combined test: > - npt=0: 32.83 FPS > - npt=1: 17.70 FPS (47% drop) > > Sky Diver(DirectX 11): > - Graphics test 1: > - npt=0: 248.81 FPS > - npt=1: 173.63 FPS (31% drop) > - Graphics test 2: > - npt=0: 250.49 FPS > - npt=1: 124.84 FPS (51% drop) > - Physics test: > - 8 threads: > - npt=0: 140.93 FPS > - npt=1: 119.08 FPS (15% drop) > - 24 threads: > - npt=0: 110.22 FPS > - npt=1: 74.55 FPS (33% drop) > - 48 threads: > - npt=1: 71.56 FPS > - npt=1: 45.93 FPS (36% drop) > - 96 threads: > - npt=0: 41.04 FPS > - npt=1: 24.81 FPS (40% drop) > - Combined test: > - npt=0: 75.65 FPS > - npt=1: 50.45 FPS (33% drop) Hi Thiago, Thanks for the data, I'm sure it will be useful. Sorry, I should have noted that I've only tested a few games/benchmarks and never any non-intensive loads. Sarnex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: AMD Ryzen KVM/NPT/IOMMU issue
On Wed, Jun 28, 2017 at 7:34 PM, Nick Sarniewrote: > Hi Suravee, > > Thanks a lot for helping. Torcs does not appear graphically demanding > on modern hardware, so this issue may not be easily noticeable. I was > able to easily reproduce the problem using the Unigine Heaven > benchmark, but I'm sure anything moderately graphically demanding > would show a performance loss with NPT enabled. As an example, when I > tested this with Fedora on my RX480, I got around 30-35 FPS with NPT > on and around 55-60 with NPT off. > > Let me know if you need any more information or have any questions. > > (no problem John, thanks a lot for taking interest in this) > > Thanks again, > Sarnex Hi I don't think the FPS drop is proportional to how graphically demanding the workload is. On the contrary, at first sight it would seem like the less demanding a workload is, the bigger the FPS impact suffered, though as some numbers I will show in a moment suggest, this is not always the case. Unfortunately I haven't been able to find a pattern to what causes the most impact in FPS except that the relative drop increases with higher FPS values. Other than that, it seems very specific to the workload/benchmark used. Here's some data I've collected to help with the investigation. The system is Ryzen 1700 (no overclock, 3ghz), GTX 1070, windows 10 guest. I've used Unigine Heaven and Passmark's PerformanceTest 9.0. First Heaven benchmark with ultra settings on 1920x1080: - DirectX 11: - npt=0: 87.0 fps - npt=1: 78.4 fps (10% drop) - DirectX 9: - npt=0: 100.0 fps - npt=1: 66.4 fps (33% drop) - OpenGL: - npt=0: 82.5 fps - npt=1: 35.2 fps (58% drop) Heaven Benchmark again, this time with low settings on 1280x720: - DirectX 11: - npt=0: 182.5 fps - npt=1: 140.1 fps (25% drop) - DirectX 9: - npt=0: 169.2 fps - npt=1: 74.1 fps (56% drop) - OpenGL: - npt=0: 202.8 fps - npt=1: 45.0 fps (78% drop) PerformanceTest 9.0 3d benchmark: - DirectX 9: - npt=0: 157 fps - npt=1: 13 fps (92% drop) - DirectX 10: - npt=0: 220 fps - npt=1: 212 fps (4% drop) - DirectX 11: - npt=0: 234 fps - npt=1: 140 fps (40% drop) - DirectX 12: - npt=0: 88 fps (scored 35 because of the penalized FPS of not being able to run at 4k) - npt=1: 4.5 fps (scored 1, 95% drop) - GPU Compute: - Mandel: - npt=0: ~= 2000 fps - npt=1: ~= 2000 fps - Bitonic Sort: - npt=0: ~= 153583696.0 elements/sec - npt=1: ~= 106233376.0 elements/sec (31% drop) - QJulia4D: - npt=0: ~= 1000 fps - npt=1: ~= 1000 fps - OpenCL: - npt=0: ~= 750 fps - npt=1: ~= 220 fps As you can see, in some cases there's only about 5% drop(which could be within the margin of error), while others the drop is as high as 95%. Some points of interest: - Passmark directx9 is not graphically demanding(runs at 1024x768, gtx 1070 doesn't break a sweat) and suffers a 92% drop in FPS. - Unigine directx11 on ultra is graphically demanding and suffers less than 10% drop in FPS. - Passmark directx12 is graphically demanding and suffers 95% drop in FPS. - The bitonic sort is not a graphical benchmark, it shows the results(avg number of sorted elements/sec) in a console window, yet it suffers 31% drop in performance. I think it would take someone with experience in GPU programming, and with knowledge of what each benchmark does, to find a pattern in these numbers. Thiago ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: AMD Ryzen KVM/NPT/IOMMU issue
On Wed, Jun 28, 2017 at 4:33 PM, Paolo Bonziniwrote: > > > On 28/06/2017 21:52, Graham Neville wrote: >> Although not related to graphics card performance, there is definitely >> another issue with regards to running KVM nested L2 guests when npt=1. >> >> Thought I'd mention this in case it helps with identifying performance >> issues with NPT. >> >> I'm unable to start any L2 guests with KVM acceleration (--enable-kvm). >> As soon as it attempts to bring up the L2 guest the L1 host crashes, L0 >> host remains online. Nothing is printed in either L1 or L0's dmesg. >> >> My L0 is running Arch with 4.11.0-rc6, with qemu 2.8.0. I've tried >> different L1 hosts (Ubuntu,Arch) and different kernels right to 4.12-rc5 >> kernel, along with different qemu versions. >> >> This used to work fine with my Intel i7-4770s setup. >> >> With npt=0, L2 guests can start but performance is dier. > > Nested AMD needs some care. It's known, but time has been lacking... > > Paolo > ___ > iommu mailing list > iommu@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/iommu Hi Suravee, Thanks a lot for helping. Torcs does not appear graphically demanding on modern hardware, so this issue may not be easily noticeable. I was able to easily reproduce the problem using the Unigine Heaven benchmark, but I'm sure anything moderately graphically demanding would show a performance loss with NPT enabled. As an example, when I tested this with Fedora on my RX480, I got around 30-35 FPS with NPT on and around 55-60 with NPT off. Let me know if you need any more information or have any questions. (no problem John, thanks a lot for taking interest in this) Thanks again, Sarnex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v1 3/3] iommu/amd: Optimize the IOMMU queue flush
> -Original Message- > From: Joerg Roedel [mailto:j...@8bytes.org] > Sent: Wednesday, June 28, 2017 4:37 AM > To: Jan Vesely; Deucher, Alexander > Cc: Lendacky, Thomas; Nath, Arindam; Craig Stein; iommu@lists.linux- > foundation.org; Duran, Leo; Suthikulpanit, Suravee > Subject: Re: [PATCH v1 3/3] iommu/amd: Optimize the IOMMU queue flush > > [Adding Alex Deucher] > > Hey Alex, > > On Tue, Jun 27, 2017 at 12:24:35PM -0400, Jan Vesely wrote: > > On Mon, 2017-06-26 at 14:14 +0200, Joerg Roedel wrote: > > > > How does that 'dGPU goes to sleep' work? Do you put it to sleep > manually > > > via sysfs or something? Or is that something that amdgpu does on its > > > own? > > > > AMD folks should be able to provide more details. afaik, the driver > > uses ACPI methods to power on/off the device. Driver routines wake the > > device up before accessing it and there is a timeout to turn it off > > after few seconds of inactivity. > > > > > > > > It looks like the GPU just switches the ATS unit off when it goes to > > > sleep and doesn't answer the invalidation anymore, which explains the > > > completion-wait timeouts. > > > > Both MMIO regs and PCIe config regs are turned off so it would not > > surprise me if all PCIe requests were ignored by the device in off > > state. it should be possible to request device wake up before > > invalidating the relevant IOMMU domain. I'll leave to more > > knowledgeable ppl to judge whether it's a good idea (we can also > > postpone such invalidations until the device is woken by other means) > > Can you maybe sched some light on how the sleep-mode of the GPUs work? > Is it initiated by the GPU driver or from somewhere else? In the case > discussed here it looks like the ATS unit of the GPU is switched of, > causing IOTLB invalidation timeouts on the IOMMU side. > > If that is the case we might need some sort of dma-api extension so that > the GPU driver can tell the iommu driver that the device is going to be > quiet. I assume you are talking about Hybrid/PowerXpress laptops where the dGPU can be powered down dynamically? That is done via the runtime pm subsystem in the kernel. We register several callbacks with that, and then it takes care of the power down auto timers and such. The actual mechanism to power down the GPU varies for platform to platform (platform specific ACPI methods on early systems, D3cold on newer ones). Alex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: AMD Ryzen KVM/NPT/IOMMU issue
On 28/06/2017 21:52, Graham Neville wrote: > Although not related to graphics card performance, there is definitely > another issue with regards to running KVM nested L2 guests when npt=1. > > Thought I'd mention this in case it helps with identifying performance > issues with NPT. > > I'm unable to start any L2 guests with KVM acceleration (--enable-kvm). > As soon as it attempts to bring up the L2 guest the L1 host crashes, L0 > host remains online. Nothing is printed in either L1 or L0's dmesg. > > My L0 is running Arch with 4.11.0-rc6, with qemu 2.8.0. I've tried > different L1 hosts (Ubuntu,Arch) and different kernels right to 4.12-rc5 > kernel, along with different qemu versions. > > This used to work fine with my Intel i7-4770s setup. > > With npt=0, L2 guests can start but performance is dier. Nested AMD needs some care. It's known, but time has been lacking... Paolo ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: AMD Ryzen KVM/NPT/IOMMU issue
Although not related to graphics card performance, there is definitely another issue with regards to running KVM nested L2 guests when npt=1. Thought I'd mention this in case it helps with identifying performance issues with NPT. I'm unable to start any L2 guests with KVM acceleration (--enable-kvm). As soon as it attempts to bring up the L2 guest the L1 host crashes, L0 host remains online. Nothing is printed in either L1 or L0's dmesg. My L0 is running Arch with 4.11.0-rc6, with qemu 2.8.0. I've tried different L1 hosts (Ubuntu,Arch) and different kernels right to 4.12-rc5 kernel, along with different qemu versions. This used to work fine with my Intel i7-4770s setup. With npt=0, L2 guests can start but performance is dier. On Wed, Jun 28, 2017 at 7:29 PM, Bridgman, Johnwrote: > > > From: Alex Williamson > Sent: June 28, 2017 3:08 PM > To: Suthikulpanit, Suravee > Cc: Steven Walter; Nick Sarnie; Paolo Bonzini; > iommu@lists.linux-foundation.org; Matthias Ehrenfeuchter; > k...@vger.kernel.org; Bridgman, John > Subject: Re: AMD Ryzen KVM/NPT/IOMMU issue > > On Thu, 29 Jun 2017 01:53:57 +0700 > Suravee Suthikulpanit wrote: > > > On 6/29/17 00:26, Steven Walter wrote: > > >> So, I'm trying to reproduce this issue on the Ryzen system w/ the > following > > >> setup: > > >> > > >> * Host kernel v4.11 (with this patch https://lkml.org/lkml/2017/6/ > 23/295) > > >> > > >> * guest VM RHEL7.3 > > >> > > >> * guest graphic driver = radeon > > >> > > >> * qemu-system-x86_64 --version > > >> QEMU emulator version 2.9.50 (v2.9.0-1659-g577caa2-dirty) > > >> > > >> * kvm-amd npt=1 > > >> > > >> * dGPU is 08:00.0 VGA compatible controller: Advanced Micro > Devices, Inc. > > >> [AMD/ATI] Tobago PRO [Radeon R7 360 / R9 360 OEM] (rev 81) > > >> > > >> * qemu-system-x86_64 -smp 4 -enable-kvm -M q35 -m 4096 -cpu host > -bios > > >> /usr/share/qemu/bios.bin -device > > >> ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1, > chassis=1,id=root.1 > > >> -drive file=/sandbox/vm-images/rhel7.3.qcow2,if=virtio,id=disk0 -net > none > > >> -vga none -nodefaults -device > > >> vfio-pci,host=08:00.0,x-vga=on,addr=0.0,multifunction=on, > bus=root.1,romfile=/sandbox/vm-images/vbios.rom > > >> -usb -device usb-host,hostbus=3,hostport=1 -device > > >> usb-host,hostbus=3,hostport=3 -device vfio-pci,host=:08:00.1 > -device > > >> vfio-pci,host=:09:00.0 > > >> > > >> With this setup, I am able to pass-through the dGPU and run the > following > > >> test: > > >> * Starting up the guest w/ full GNOME GUI on the attached monitor. > > >> * glxgears (running @ 60 FPS) > > >> * Playing 1080p HD video on Youtube > > >> > > >> I am not noticing issues here. What kind of test are you running in > the > > >> guest VM? > > > Try running the open source game "torcs" inside the VM. I think > > > you'll find that there's a very noticeable performance different > > > between npt=1 and npt=0 > > > > Hm.. actually torcs seems to be running fine w/ npt=1 setup. Although I > think my > > driving skill is ~20% worse :( > > >A clarification on the issue, it's not that these games/benchmarks > >don't work or aren't playable, it's that they run slower (as measured by > >framerate) with npt=1 vs npt=0. A virtualized guest on AMD hardware is > >hindered either by lower graphics performance (npt=1) or higher CPU > >virtualization overhead (npt=0) for high intensity games or graphics > >workloads. Intel's equivalent ept feature does not have this issue. I > >would encourage looking at the framerate for one mode vs the other > >before drawing any conclusions on whether it's working well. Thanks, > >Alex > > One more data point - Nick did some testing with Xen enabling/disabling > npt and > found that (a) performance was not affected much whether npt was on or > off, and > (b) performance with npt on was pretty close to KVM performance with npt > off. > > This suggests something specific to KVM that doesn't play well with npt, > although > I have no idea what that might be. I was going to talk to our folks to see > if they had > any suggestions re: ways to narrow down where the performance impact is > coming > from, but I ended up going off sick instead. > > Haven't gone through the whole thread here yet so apologies if this has > already been > mentioned (and Nick sorry for the delay). > > ___ > iommu mailing list > iommu@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/iommu > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: AMD Ryzen KVM/NPT/IOMMU issue
From: Alex WilliamsonSent: June 28, 2017 3:08 PM To: Suthikulpanit, Suravee Cc: Steven Walter; Nick Sarnie; Paolo Bonzini; iommu@lists.linux-foundation.org; Matthias Ehrenfeuchter; k...@vger.kernel.org; Bridgman, John Subject: Re: AMD Ryzen KVM/NPT/IOMMU issue On Thu, 29 Jun 2017 01:53:57 +0700 Suravee Suthikulpanit wrote: > On 6/29/17 00:26, Steven Walter wrote: > >> So, I'm trying to reproduce this issue on the Ryzen system w/ the following > >> setup: > >> > >> * Host kernel v4.11 (with this patch > >>https://lkml.org/lkml/2017/6/23/295) > >> > >> * guest VM RHEL7.3 > >> > >> * guest graphic driver = radeon > >> > >> * qemu-system-x86_64 --version > >> QEMU emulator version 2.9.50 (v2.9.0-1659-g577caa2-dirty) > >> > >> * kvm-amd npt=1 > >> > >> * dGPU is 08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. > >> [AMD/ATI] Tobago PRO [Radeon R7 360 / R9 360 OEM] (rev 81) > >> > >> * qemu-system-x86_64 -smp 4 -enable-kvm -M q35 -m 4096 -cpu host -bios > >> /usr/share/qemu/bios.bin -device > >> ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=root.1 > >> -drive file=/sandbox/vm-images/rhel7.3.qcow2,if=virtio,id=disk0 -net none > >> -vga none -nodefaults -device > >> vfio-pci,host=08:00.0,x-vga=on,addr=0.0,multifunction=on,bus=root.1,romfile=/sandbox/vm-images/vbios.rom > >> -usb -device usb-host,hostbus=3,hostport=1 -device > >> usb-host,hostbus=3,hostport=3 -device vfio-pci,host=:08:00.1 -device > >> vfio-pci,host=:09:00.0 > >> > >> With this setup, I am able to pass-through the dGPU and run the following > >> test: > >> * Starting up the guest w/ full GNOME GUI on the attached monitor. > >> * glxgears (running @ 60 FPS) > >> * Playing 1080p HD video on Youtube > >> > >> I am not noticing issues here. What kind of test are you running in the > >> guest VM? > > Try running the open source game "torcs" inside the VM. I think > > you'll find that there's a very noticeable performance different > > between npt=1 and npt=0 > > Hm.. actually torcs seems to be running fine w/ npt=1 setup. Although I think > my > driving skill is ~20% worse :( >A clarification on the issue, it's not that these games/benchmarks >don't work or aren't playable, it's that they run slower (as measured by >framerate) with npt=1 vs npt=0. A virtualized guest on AMD hardware is >hindered either by lower graphics performance (npt=1) or higher CPU >virtualization overhead (npt=0) for high intensity games or graphics >workloads. Intel's equivalent ept feature does not have this issue. I >would encourage looking at the framerate for one mode vs the other >before drawing any conclusions on whether it's working well. Thanks, >Alex One more data point - Nick did some testing with Xen enabling/disabling npt and found that (a) performance was not affected much whether npt was on or off, and (b) performance with npt on was pretty close to KVM performance with npt off. This suggests something specific to KVM that doesn't play well with npt, although I have no idea what that might be. I was going to talk to our folks to see if they had any suggestions re: ways to narrow down where the performance impact is coming from, but I ended up going off sick instead. Haven't gone through the whole thread here yet so apologies if this has already been mentioned (and Nick sorry for the delay). ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: AMD Ryzen KVM/NPT/IOMMU issue
From: Alex WilliamsonSent: June 28, 2017 3:08 PM To: Suthikulpanit, Suravee Cc: Steven Walter; Nick Sarnie; Paolo Bonzini; iommu@lists.linux-foundation.org; Matthias Ehrenfeuchter; k...@vger.kernel.org; Bridgman, John Subject: Re: AMD Ryzen KVM/NPT/IOMMU issue On Thu, 29 Jun 2017 01:53:57 +0700 Suravee Suthikulpanit wrote: > On 6/29/17 00:26, Steven Walter wrote: > >> So, I'm trying to reproduce this issue on the Ryzen system w/ the following > >> setup: > >> > >> * Host kernel v4.11 (with this patch https://lkml.org/lkml/2017/6/23/295) > >> > >> * guest VM RHEL7.3 > >> > >> * guest graphic driver = radeon > >> > >> * qemu-system-x86_64 --version > >> QEMU emulator version 2.9.50 (v2.9.0-1659-g577caa2-dirty) > >> > >> * kvm-amd npt=1 > >> > >> * dGPU is 08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. > >> [AMD/ATI] Tobago PRO [Radeon R7 360 / R9 360 OEM] (rev 81) > >> > >> * qemu-system-x86_64 -smp 4 -enable-kvm -M q35 -m 4096 -cpu host -bios > >> /usr/share/qemu/bios.bin -device > >> ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=root.1 > >> -drive file=/sandbox/vm-images/rhel7.3.qcow2,if=virtio,id=disk0 -net none > >> -vga none -nodefaults -device > >> vfio-pci,host=08:00.0,x-vga=on,addr=0.0,multifunction=on,bus=root.1,romfile=/sandbox/vm-images/vbios.rom > >> -usb -device usb-host,hostbus=3,hostport=1 -device > >> usb-host,hostbus=3,hostport=3 -device vfio-pci,host=:08:00.1 -device > >> vfio-pci,host=:09:00.0 > >> > >> With this setup, I am able to pass-through the dGPU and run the following > >> test: > >> * Starting up the guest w/ full GNOME GUI on the attached monitor. > >> * glxgears (running @ 60 FPS) > >> * Playing 1080p HD video on Youtube > >> > >> I am not noticing issues here. What kind of test are you running in the > >> guest VM? > > Try running the open source game "torcs" inside the VM. I think > > you'll find that there's a very noticeable performance different > > between npt=1 and npt=0 > > Hm.. actually torcs seems to be running fine w/ npt=1 setup. Although I think > my > driving skill is ~20% worse :( >A clarification on the issue, it's not that these games/benchmarks >don't work or aren't playable, it's that they run slower (as measured by >framerate) with npt=1 vs npt=0. A virtualized guest on AMD hardware is >hindered either by lower graphics performance (npt=1) or higher CPU >virtualization overhead (npt=0) for high intensity games or graphics >workloads. Intel's equivalent ept feature does not have this issue. I >would encourage looking at the framerate for one mode vs the other >before drawing any conclusions on whether it's working well. Thanks, >Alex One more data point - Nick did some testing with Xen enabling/disabling npt and found that (a) performance was not affected much whether npt was on or off, and (b) performance with npt on was pretty close to KVM performance with npt off. This suggests something specific to KVM that doesn't play well with npt, although I have no idea what that might be. I was going to talk to our folks to see if they had any suggestions re: ways to narrow down where the performance impact is coming from, but I ended up going off sick instead. Haven't gone through the whole thread here yet so apologies if this has already been mentioned. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: AMD Ryzen KVM/NPT/IOMMU issue
On Thu, 29 Jun 2017 01:53:57 +0700 Suravee Suthikulpanitwrote: > On 6/29/17 00:26, Steven Walter wrote: > >> So, I'm trying to reproduce this issue on the Ryzen system w/ the following > >> setup: > >> > >> * Host kernel v4.11 (with this patch https://lkml.org/lkml/2017/6/23/295) > >> > >> * guest VM RHEL7.3 > >> > >> * guest graphic driver = radeon > >> > >> * qemu-system-x86_64 --version > >> QEMU emulator version 2.9.50 (v2.9.0-1659-g577caa2-dirty) > >> > >> * kvm-amd npt=1 > >> > >> * dGPU is 08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. > >> [AMD/ATI] Tobago PRO [Radeon R7 360 / R9 360 OEM] (rev 81) > >> > >> * qemu-system-x86_64 -smp 4 -enable-kvm -M q35 -m 4096 -cpu host -bios > >> /usr/share/qemu/bios.bin -device > >> ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=root.1 > >> -drive file=/sandbox/vm-images/rhel7.3.qcow2,if=virtio,id=disk0 -net none > >> -vga none -nodefaults -device > >> vfio-pci,host=08:00.0,x-vga=on,addr=0.0,multifunction=on,bus=root.1,romfile=/sandbox/vm-images/vbios.rom > >> -usb -device usb-host,hostbus=3,hostport=1 -device > >> usb-host,hostbus=3,hostport=3 -device vfio-pci,host=:08:00.1 -device > >> vfio-pci,host=:09:00.0 > >> > >> With this setup, I am able to pass-through the dGPU and run the following > >> test: > >> * Starting up the guest w/ full GNOME GUI on the attached monitor. > >> * glxgears (running @ 60 FPS) > >> * Playing 1080p HD video on Youtube > >> > >> I am not noticing issues here. What kind of test are you running in the > >> guest VM? > > Try running the open source game "torcs" inside the VM. I think > > you'll find that there's a very noticeable performance different > > between npt=1 and npt=0 > > Hm.. actually torcs seems to be running fine w/ npt=1 setup. Although I think > my > driving skill is ~20% worse :( A clarification on the issue, it's not that these games/benchmarks don't work or aren't playable, it's that they run slower (as measured by framerate) with npt=1 vs npt=0. A virtualized guest on AMD hardware is hindered either by lower graphics performance (npt=1) or higher CPU virtualization overhead (npt=0) for high intensity games or graphics workloads. Intel's equivalent ept feature does not have this issue. I would encourage looking at the framerate for one mode vs the other before drawing any conclusions on whether it's working well. Thanks, Alex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: AMD Ryzen KVM/NPT/IOMMU issue
On 6/29/17 00:26, Steven Walter wrote: So, I'm trying to reproduce this issue on the Ryzen system w/ the following setup: * Host kernel v4.11 (with this patch https://lkml.org/lkml/2017/6/23/295) * guest VM RHEL7.3 * guest graphic driver = radeon * qemu-system-x86_64 --version QEMU emulator version 2.9.50 (v2.9.0-1659-g577caa2-dirty) * kvm-amd npt=1 * dGPU is 08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tobago PRO [Radeon R7 360 / R9 360 OEM] (rev 81) * qemu-system-x86_64 -smp 4 -enable-kvm -M q35 -m 4096 -cpu host -bios /usr/share/qemu/bios.bin -device ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=root.1 -drive file=/sandbox/vm-images/rhel7.3.qcow2,if=virtio,id=disk0 -net none -vga none -nodefaults -device vfio-pci,host=08:00.0,x-vga=on,addr=0.0,multifunction=on,bus=root.1,romfile=/sandbox/vm-images/vbios.rom -usb -device usb-host,hostbus=3,hostport=1 -device usb-host,hostbus=3,hostport=3 -device vfio-pci,host=:08:00.1 -device vfio-pci,host=:09:00.0 With this setup, I am able to pass-through the dGPU and run the following test: * Starting up the guest w/ full GNOME GUI on the attached monitor. * glxgears (running @ 60 FPS) * Playing 1080p HD video on Youtube I am not noticing issues here. What kind of test are you running in the guest VM? Try running the open source game "torcs" inside the VM. I think you'll find that there's a very noticeable performance different between npt=1 and npt=0 Hm.. actually torcs seems to be running fine w/ npt=1 setup. Although I think my driving skill is ~20% worse :( S ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 2/2] acpi/iort: numa: Add numa node mapping for smmuv3 devices
On 15.06.17 14:46:03, Lorenzo Pieralisi wrote: > On Thu, Jun 08, 2017 at 10:14:19AM +0530, Ganapatrao Kulkarni wrote: > > Add code to parse proximity domain in SMMUv3 IORT table to > > set numa node mapping for smmuv3 devices. > > > > Signed-off-by: Ganapatrao Kulkarni> > --- > > drivers/acpi/arm64/iort.c | 28 ++-- > > 1 file changed, 26 insertions(+), 2 deletions(-) > > I am happy to take this but I want to know what we shall do with > patch 1 and related ACPICA changes first. The change is now in acpica: https://github.com/acpica/acpica/commit/8cadc4fb500e2aa52241e367c87a0f95d9760c58 So we could guard the code with an #ifdef until that patch is pulled in via acpica tree: > > diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c > > index bba2b59..e804386 100644 > > --- a/drivers/acpi/arm64/iort.c > > +++ b/drivers/acpi/arm64/iort.c > > @@ -882,6 +882,23 @@ static bool __init arm_smmu_v3_is_coherent(struct > > acpi_iort_node *node) > > return smmu->flags & ACPI_IORT_SMMU_V3_COHACC_OVERRIDE; > > } > > > > +/* > > + * set numa proximity domain for smmuv3 device > > + */ > > +static void __init arm_smmu_v3_set_proximity(struct acpi_iort_node *node, > > + struct device *dev) > > +{ #ifdef ACPI_IORT_SMMU_V3_PXM_VALID > > + struct acpi_iort_smmu_v3 *smmu; > > + > > + smmu = (struct acpi_iort_smmu_v3 *)node->node_data; > > + if (smmu->flags & ACPI_IORT_SMMU_V3_PXM_VALID) { > > + set_dev_node(dev, acpi_map_pxm_to_node(smmu->pxm)); > > + pr_info("SMMUV3[%llx] Mapped to Proximity domain %d\n", > > + smmu->base_address, > > + smmu->pxm); > > + } #endif > > +} > > + Could the patch be applied with this change? Thanks, -Robert > > static int __init arm_smmu_count_resources(struct acpi_iort_node *node) > > { > > struct acpi_iort_smmu *smmu; > > @@ -951,20 +968,24 @@ struct iort_iommu_config { > > int (*iommu_count_resources)(struct acpi_iort_node *node); > > void (*iommu_init_resources)(struct resource *res, > > struct acpi_iort_node *node); > > + void (*iommu_set_proximity)(struct acpi_iort_node *node, > > +struct device *dev); > > }; > > > > static const struct iort_iommu_config iort_arm_smmu_v3_cfg __initconst = { > > .name = "arm-smmu-v3", > > .iommu_is_coherent = arm_smmu_v3_is_coherent, > > .iommu_count_resources = arm_smmu_v3_count_resources, > > - .iommu_init_resources = arm_smmu_v3_init_resources > > + .iommu_init_resources = arm_smmu_v3_init_resources, > > + .iommu_set_proximity = arm_smmu_v3_set_proximity > > }; > > > > static const struct iort_iommu_config iort_arm_smmu_cfg __initconst = { > > .name = "arm-smmu", > > .iommu_is_coherent = arm_smmu_is_coherent, > > .iommu_count_resources = arm_smmu_count_resources, > > - .iommu_init_resources = arm_smmu_init_resources > > + .iommu_init_resources = arm_smmu_init_resources, > > + .iommu_set_proximity = NULL > > }; > > > > static __init > > @@ -1002,6 +1023,9 @@ static int __init > > iort_add_smmu_platform_device(struct acpi_iort_node *node) > > if (!pdev) > > return -ENOMEM; > > > > + if (ops->iommu_set_proximity) > > + ops->iommu_set_proximity(node, >dev); > > + > > count = ops->iommu_count_resources(node); > > > > r = kcalloc(count, sizeof(*r), GFP_KERNEL); > > -- > > 1.8.1.4 > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-acpi" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: AMD Ryzen KVM/NPT/IOMMU issue
On Thu, 29 Jun 2017 00:23:20 +0700 Suravee Suthikulpanitwrote: > So, I'm trying to reproduce this issue on the Ryzen system w/ the following > setup: > >* Host kernel v4.11 (with this patch https://lkml.org/lkml/2017/6/23/295) > >* guest VM RHEL7.3 > >* guest graphic driver = radeon > >* qemu-system-x86_64 --version > QEMU emulator version 2.9.50 (v2.9.0-1659-g577caa2-dirty) > >* kvm-amd npt=1 > >* dGPU is 08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. > [AMD/ATI] Tobago PRO [Radeon R7 360 / R9 360 OEM] (rev 81) > >* qemu-system-x86_64 -smp 4 -enable-kvm -M q35 -m 4096 -cpu host -bios > /usr/share/qemu/bios.bin -device > ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=root.1 > -drive > file=/sandbox/vm-images/rhel7.3.qcow2,if=virtio,id=disk0 -net none -vga none > -nodefaults -device > vfio-pci,host=08:00.0,x-vga=on,addr=0.0,multifunction=on,bus=root.1,romfile=/sandbox/vm-images/vbios.rom > > -usb -device usb-host,hostbus=3,hostport=1 -device > usb-host,hostbus=3,hostport=3 > -device vfio-pci,host=:08:00.1 -device vfio-pci,host=:09:00.0 > > With this setup, I am able to pass-through the dGPU and run the following > test: >* Starting up the guest w/ full GNOME GUI on the attached monitor. >* glxgears (running @ 60 FPS) >* Playing 1080p HD video on Youtube > > I am not noticing issues here. What kind of test are you running in the guest > VM? Hi Suravee, Thanks for your help! I think we'd be in real trouble if glxgears reported something less than 60fps (perhaps that might be easier to debug though). I'd suggest trying one of the Unigine benchmarks, like Heaven. There should be a noticeable and consistent framerate difference for npt=0/1. Thanks, Alex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: AMD Ryzen KVM/NPT/IOMMU issue
On Wed, Jun 28, 2017 at 1:23 PM, Suravee Suthikulpanitwrote: > > > On 6/25/17 12:55, Nick Sarnie wrote: >> >> On Fri, May 5, 2017 at 1:27 PM, Alex Williamson >> wrote: >>> >>> On Wed, 3 May 2017 12:28:35 -0400 >>> Nick Sarnie wrote: >>> On Wed, May 3, 2017 at 10:37 AM, Matthias Ehrenfeuchter wrote: > > Hi, > > There are a lot of messages/threads out there about bad performance > while > using AMDs Ryzen with KVM GPU passthrough. It revolves all on > enabling/disabling npt, while enabled overall VM performance is nice > but the > GPU performance gives me about 20% (and a lot of drops to zero GPU > usage, > while CPU/Disk/Ram also doing nothing) compared to npt disabled. But > while > npt is disabled overall VM performance is like beeing on 4x86 with > floppy > disk as only storage. (Ex. it takes 2 seconds just to open startmenu > while > host and vm are in idle, and neither CPU pinning, changing CPU model, > changing storage device nor using hugepages changed anything). > > So everything I read pointed to a bug in the npt implementation? > Anything I > could do to get closer to the "thing" issuing this? > > Best Regards > > efeu > ___ > iommu mailing list > iommu@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/iommu I heard from Joerg that it might be related to a lower intercept rate being used when NPT is enabled, but we haven't been able to find a way to trace that to confirm. >>> >>> >>> Joerg/Paolo, any ideas how we might debug this? Anyone from AMD >>> watching? Thanks, >>> >>> Alex >> >> >> >> Hi all, >> >> A somewhat major update. >> >> I managed to install Xen with my GPU passthrough config and test the >> performance with NPT enabled. >> >> There is no performance drop with NPT on Xen, it matches the GPU >> performance of KVM with NPT disabled. The CPU performance is also >> great. >> >> John Bridgman (ccd) from AMD says he's going to ask around AMD about >> this next week, but it would be even better if some AMD guys that read >> this ML shared their ideas or took a look. >> >> Let me know if you need any more information. >> >> Thanks, >> Sarnex >> ___ >> iommu mailing list >> iommu@lists.linux-foundation.org >> https://lists.linuxfoundation.org/mailman/listinfo/iommu >> > > So, I'm trying to reproduce this issue on the Ryzen system w/ the following > setup: > > * Host kernel v4.11 (with this patch https://lkml.org/lkml/2017/6/23/295) > > * guest VM RHEL7.3 > > * guest graphic driver = radeon > > * qemu-system-x86_64 --version > QEMU emulator version 2.9.50 (v2.9.0-1659-g577caa2-dirty) > > * kvm-amd npt=1 > > * dGPU is 08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. > [AMD/ATI] Tobago PRO [Radeon R7 360 / R9 360 OEM] (rev 81) > > * qemu-system-x86_64 -smp 4 -enable-kvm -M q35 -m 4096 -cpu host -bios > /usr/share/qemu/bios.bin -device > ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=root.1 > -drive file=/sandbox/vm-images/rhel7.3.qcow2,if=virtio,id=disk0 -net none > -vga none -nodefaults -device > vfio-pci,host=08:00.0,x-vga=on,addr=0.0,multifunction=on,bus=root.1,romfile=/sandbox/vm-images/vbios.rom > -usb -device usb-host,hostbus=3,hostport=1 -device > usb-host,hostbus=3,hostport=3 -device vfio-pci,host=:08:00.1 -device > vfio-pci,host=:09:00.0 > > With this setup, I am able to pass-through the dGPU and run the following > test: > * Starting up the guest w/ full GNOME GUI on the attached monitor. > * glxgears (running @ 60 FPS) > * Playing 1080p HD video on Youtube > > I am not noticing issues here. What kind of test are you running in the > guest VM? Try running the open source game "torcs" inside the VM. I think you'll find that there's a very noticeable performance different between npt=1 and npt=0 -- -Steven Walter ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: AMD Ryzen KVM/NPT/IOMMU issue
On 6/25/17 12:55, Nick Sarnie wrote: On Fri, May 5, 2017 at 1:27 PM, Alex Williamsonwrote: On Wed, 3 May 2017 12:28:35 -0400 Nick Sarnie wrote: On Wed, May 3, 2017 at 10:37 AM, Matthias Ehrenfeuchter wrote: Hi, There are a lot of messages/threads out there about bad performance while using AMDs Ryzen with KVM GPU passthrough. It revolves all on enabling/disabling npt, while enabled overall VM performance is nice but the GPU performance gives me about 20% (and a lot of drops to zero GPU usage, while CPU/Disk/Ram also doing nothing) compared to npt disabled. But while npt is disabled overall VM performance is like beeing on 4x86 with floppy disk as only storage. (Ex. it takes 2 seconds just to open startmenu while host and vm are in idle, and neither CPU pinning, changing CPU model, changing storage device nor using hugepages changed anything). So everything I read pointed to a bug in the npt implementation? Anything I could do to get closer to the "thing" issuing this? Best Regards efeu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu I heard from Joerg that it might be related to a lower intercept rate being used when NPT is enabled, but we haven't been able to find a way to trace that to confirm. Joerg/Paolo, any ideas how we might debug this? Anyone from AMD watching? Thanks, Alex Hi all, A somewhat major update. I managed to install Xen with my GPU passthrough config and test the performance with NPT enabled. There is no performance drop with NPT on Xen, it matches the GPU performance of KVM with NPT disabled. The CPU performance is also great. John Bridgman (ccd) from AMD says he's going to ask around AMD about this next week, but it would be even better if some AMD guys that read this ML shared their ideas or took a look. Let me know if you need any more information. Thanks, Sarnex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu So, I'm trying to reproduce this issue on the Ryzen system w/ the following setup: * Host kernel v4.11 (with this patch https://lkml.org/lkml/2017/6/23/295) * guest VM RHEL7.3 * guest graphic driver = radeon * qemu-system-x86_64 --version QEMU emulator version 2.9.50 (v2.9.0-1659-g577caa2-dirty) * kvm-amd npt=1 * dGPU is 08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tobago PRO [Radeon R7 360 / R9 360 OEM] (rev 81) * qemu-system-x86_64 -smp 4 -enable-kvm -M q35 -m 4096 -cpu host -bios /usr/share/qemu/bios.bin -device ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=root.1 -drive file=/sandbox/vm-images/rhel7.3.qcow2,if=virtio,id=disk0 -net none -vga none -nodefaults -device vfio-pci,host=08:00.0,x-vga=on,addr=0.0,multifunction=on,bus=root.1,romfile=/sandbox/vm-images/vbios.rom -usb -device usb-host,hostbus=3,hostport=1 -device usb-host,hostbus=3,hostport=3 -device vfio-pci,host=:08:00.1 -device vfio-pci,host=:09:00.0 With this setup, I am able to pass-through the dGPU and run the following test: * Starting up the guest w/ full GNOME GUI on the attached monitor. * glxgears (running @ 60 FPS) * Playing 1080p HD video on Youtube I am not noticing issues here. What kind of test are you running in the guest VM? Thanks, Suravee ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 0/8] io-pgtable lock removal
Hi Will/Robin, On 6/28/17 4:46 AM, Will Deacon wrote: > Hi Ray, > > Robin and I have been bashing our heads against the tlb_sync_pending flag > this morning, and we reckon it could have something to do with your timeouts > on MMU-500. > > On Tue, Jun 27, 2017 at 09:43:19AM -0700, Ray Jui wrote: Also, in a few occasions, I observed the following message during the test, when multiple cores are involved: arm-smmu 6400.mmu: TLB sync timed out -- SMMU may be deadlocked > > The tlb_sync_pending logic was written under the assumption of a global > page-table lock, so it assumes that it only has to care about syncing > flushes from the current CPU/context. That's not true anymore, and the > current code can accidentally skip syncs and (what I think is happening in > your case) allow concurrent syncs, which will potentially lead to timeouts > if a CPU is unlucky enough to keep missing the Ack. > > Please can you try the diff below and see if it fixes things for you? > This applies on top of my for-joerg/arm-smmu/updates branch, but note > that I've only shown it to the compiler. Not tested at all. > > Will > Thanks for looking into this. I'm a bit busy at work but will certainly find time to test the diff below. I hopefully will get to it later this week. Thanks, Ray > --->8 > > diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c > index 127558d83667..cd8d7aaec161 100644 > --- a/drivers/iommu/io-pgtable.c > +++ b/drivers/iommu/io-pgtable.c > @@ -59,6 +59,7 @@ struct io_pgtable_ops *alloc_io_pgtable_ops(enum > io_pgtable_fmt fmt, > iop->cookie = cookie; > iop->cfg= *cfg; > > + atomic_set(>tlb_sync_pending, 0); > return >ops; > } > > diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h > index 524263a7ae6f..b64580c9d03d 100644 > --- a/drivers/iommu/io-pgtable.h > +++ b/drivers/iommu/io-pgtable.h > @@ -1,5 +1,7 @@ > #ifndef __IO_PGTABLE_H > #define __IO_PGTABLE_H > + > +#include > #include > > /* > @@ -165,7 +167,7 @@ void free_io_pgtable_ops(struct io_pgtable_ops *ops); > struct io_pgtable { > enum io_pgtable_fmt fmt; > void*cookie; > - booltlb_sync_pending; > + atomic_ttlb_sync_pending; > struct io_pgtable_cfg cfg; > struct io_pgtable_ops ops; > }; > @@ -175,22 +177,20 @@ struct io_pgtable { > static inline void io_pgtable_tlb_flush_all(struct io_pgtable *iop) > { > iop->cfg.tlb->tlb_flush_all(iop->cookie); > - iop->tlb_sync_pending = true; > + atomic_set_release(>tlb_sync_pending, 1); > } > > static inline void io_pgtable_tlb_add_flush(struct io_pgtable *iop, > unsigned long iova, size_t size, size_t granule, bool leaf) > { > iop->cfg.tlb->tlb_add_flush(iova, size, granule, leaf, iop->cookie); > - iop->tlb_sync_pending = true; > + atomic_set_release(>tlb_sync_pending, 1); > } > > static inline void io_pgtable_tlb_sync(struct io_pgtable *iop) > { > - if (iop->tlb_sync_pending) { > + if (atomic_xchg_relaxed(>tlb_sync_pending, 0)) > iop->cfg.tlb->tlb_sync(iop->cookie); > - iop->tlb_sync_pending = false; > - } > } > > /** > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/3] iommu/omap: Return ERR_PTR in device_group call-back
On 06/28/2017 07:00 AM, Joerg Roedel wrote: > From: Joerg Roedel> > Make sure that the device_group callback returns an ERR_PTR > instead of NULL. > > Signed-off-by: Joerg Roedel Thanks for the patch, Acked-by: Suman Anna regards Suman > --- > drivers/iommu/omap-iommu.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/iommu/omap-iommu.c b/drivers/iommu/omap-iommu.c > index 95dfca3..641e035 100644 > --- a/drivers/iommu/omap-iommu.c > +++ b/drivers/iommu/omap-iommu.c > @@ -1309,7 +1309,7 @@ static void omap_iommu_remove_device(struct device *dev) > static struct iommu_group *omap_iommu_device_group(struct device *dev) > { > struct omap_iommu_arch_data *arch_data = dev->archdata.iommu; > - struct iommu_group *group = NULL; > + struct iommu_group *group = ERR_PTR(-EINVAL); > > if (arch_data->iommu_dev) > group = arch_data->iommu_dev->group; > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 5/9] iommu: Introduce fault notifier API
On Wed, 28 Jun 2017 12:16:03 +0200 Joerg Roedelwrote: > On Tue, Jun 27, 2017 at 12:47:59PM -0700, Jacob Pan wrote: > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c > > index d973555..07cfd92 100644 > > --- a/drivers/iommu/iommu.c > > +++ b/drivers/iommu/iommu.c > > @@ -48,6 +48,7 @@ struct iommu_group { > > struct list_head devices; > > struct mutex mutex; > > struct blocking_notifier_head notifier; > > + struct blocking_notifier_head fault_notifier; > > Do you really need a notifier chain here? Will there ever be more than > one callback registered to it? > yes, this notifier chain is shared by all devices under a group. the event contains device info which notifier callbacks can filter. > > +struct iommu_fault_event { > > + struct device *dev; > > Putting a 'struct device *' member in a uapi struct looks > fundamentally wrong. > > my mistake, it was originally (RFC) not in uapi but with the consideration of using vfio to expose it to user space I have moved it to uapi. But you are right, it should be some other forms of device representation used by vfio. VFIO layer has to do the translation and inject that into the guest. In kernel driver users can use struct device to identify the faulting device. > > Joerg > [Jacob Pan] ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 3/9] iommu: Introduce iommu do invalidate API function
On Wed, 28 Jun 2017 12:08:23 +0200 Joerg Roedelwrote: > On Tue, Jun 27, 2017 at 12:47:57PM -0700, Jacob Pan wrote: > > From: "Liu, Yi L" > > > > When a SVM capable device is assigned to a guest, the first level > > page tables are owned by the guest and the guest PASID table > > pointer is linked to the device context entry of the physical IOMMU. > > > > Host IOMMU driver has no knowledge of caching structure updates > > unless the guest invalidation activities are passed down to the > > host. The primary usage is derived from emulated IOMMU in the > > guest, where QEMU can trap invalidation activities before pass them > > down the host/physical IOMMU. There are IOMMU architectural > > specific actions need to be taken which requires the generic APIs > > introduced in this patch to have opaque data in the > > tlb_invalidate_info argument. > > Which "IOMMU architectural specific actions" are you thinking of? > construction of queued invalidation descriptors, then submit them to the IOMMU QI interface. > > +int iommu_invalidate(struct iommu_domain *domain, > > + struct device *dev, struct tlb_invalidate_info > > *inv_info) +{ > > + int ret = 0; > > + > > + if (unlikely(!domain->ops->invalidate)) > > + return -ENODEV; > > + > > + ret = domain->ops->invalidate(domain, dev, inv_info); > > + > > + return ret; > > +} > > +EXPORT_SYMBOL_GPL(iommu_invalidate); > > [...] > > > +struct tlb_invalidate_info { > > + __u32 model; > > + __u32 length; > > + __u8opaque[]; > > +}; > > This interface is aweful. It requires the user of a generic api to > know details about the implementation behind to do anything useful. > > Please explain in more detail why this is needed. My feeling is that > we can make this more generic with a small set of invalidation > functions in the iommu-api. > My thinking was that via configuration control, there will be unlikely any mixed IOMMU models between pIOMMU and vIOMMU. We could have just model specific data pass through layers of SW (QEMU, VFIO) for performance reasons. We do have an earlier hybrid version that has generic data and opaque raw data. Would the below work for all IOMMU models? https://www.spinics.net/lists/kvm/msg148798.html struct tlb_invalidate_info { __u32 model; /* Vendor number */ __u8 granularity #define DEVICE_SELECTVIE_INV(1 << 0) #define PAGE_SELECTIVE_INV (1 << 0) #define PASID_SELECTIVE_INV (1 << 1) __u32 pasid; __u64 addr; __u64 size; /* Since IOMMU format has already been validated for this table, the IOMMU driver knows that the following structure is in a format it knows */ __u8 opaque[]; }; > > > Joerg > [Jacob Pan] ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/3] iommu: Return ERR_PTR() values from device_group call-backs
On Wed, 28 Jun 2017 14:00:56 +0200 Joerg Roedelwrote: > From: Joerg Roedel > > The generic device_group call-backs in iommu.c return NULL > in case of error. Since they are getting ERR_PTR values from > iommu_group_alloc(), just pass them up instead. > > Reported-by: Gerald Schaefer > Signed-off-by: Joerg Roedel > --- Looks good, Reviewed-by: Gerald Schaefer > drivers/iommu/iommu.c | 14 ++ > 1 file changed, 2 insertions(+), 12 deletions(-) > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c > index cf7ca7e..de09e1e 100644 > --- a/drivers/iommu/iommu.c > +++ b/drivers/iommu/iommu.c > @@ -915,13 +915,7 @@ static int get_pci_alias_or_group(struct pci_dev *pdev, > u16 alias, void *opaque) > */ > struct iommu_group *generic_device_group(struct device *dev) > { > - struct iommu_group *group; > - > - group = iommu_group_alloc(); > - if (IS_ERR(group)) > - return NULL; > - > - return group; > + return iommu_group_alloc(); > } > > /* > @@ -988,11 +982,7 @@ struct iommu_group *pci_device_group(struct device *dev) > return group; > > /* No shared group found, allocate new */ > - group = iommu_group_alloc(); > - if (IS_ERR(group)) > - return NULL; > - > - return group; > + return iommu_group_alloc(); > } > > /** ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2] iommu/amd: Fix interrupt remapping when disable guest_mode
On Mon, Jun 26, 2017 at 04:28:04AM -0500, Suravee Suthikulpanit wrote: > Pass-through devices to VM guest can get updated IRQ affinity > information via irq_set_affinity() when not running in guest mode. > Currently, AMD IOMMU driver in GA mode ignores the updated information > if the pass-through device is setup to use vAPIC regardless of guest_mode. > This could cause invalid interrupt remapping. > > Also, the guest_mode bit should be set and cleared only when > SVM updates posted-interrupt interrupt remapping information. > > Signed-off-by: Suravee Suthikulpanit> Cc: Joerg Roedel Nervermind, I added this fixes line and applied the patch: Fixes: d98de49a53e48 ('iommu/amd: Enable vAPIC interrupt remapping mode by default' ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2] iommu/vt-d: constify intel_dma_ops
On Wed, Jun 28, 2017 at 04:39:32PM +0530, Arvind Yadav wrote: > Most dma_map_ops structures are never modified. Constify these > structures such that these can be write-protected. > > Signed-off-by: Arvind Yadav> --- > Changes in v2: > Added description. > > drivers/iommu/intel-iommu.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) Applied, thanks. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 3/3] iommu: Warn once when device_group callback returns NULL
From: Joerg RoedelThis callback should never return NULL. Print a warning if that happens so that we notice and can fix it. Signed-off-by: Joerg Roedel --- drivers/iommu/iommu.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index de09e1e..3f6ea16 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -1010,6 +1010,9 @@ struct iommu_group *iommu_group_get_for_dev(struct device *dev) if (ops && ops->device_group) group = ops->device_group(dev); + if (WARN_ON_ONCE(group == NULL)) + return ERR_PTR(-EINVAL); + if (IS_ERR(group)) return group; -- 2.7.4 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 2/3] iommu/omap: Return ERR_PTR in device_group call-back
From: Joerg RoedelMake sure that the device_group callback returns an ERR_PTR instead of NULL. Signed-off-by: Joerg Roedel --- drivers/iommu/omap-iommu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/omap-iommu.c b/drivers/iommu/omap-iommu.c index 95dfca3..641e035 100644 --- a/drivers/iommu/omap-iommu.c +++ b/drivers/iommu/omap-iommu.c @@ -1309,7 +1309,7 @@ static void omap_iommu_remove_device(struct device *dev) static struct iommu_group *omap_iommu_device_group(struct device *dev) { struct omap_iommu_arch_data *arch_data = dev->archdata.iommu; - struct iommu_group *group = NULL; + struct iommu_group *group = ERR_PTR(-EINVAL); if (arch_data->iommu_dev) group = arch_data->iommu_dev->group; -- 2.7.4 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 1/3] iommu: Return ERR_PTR() values from device_group call-backs
From: Joerg RoedelThe generic device_group call-backs in iommu.c return NULL in case of error. Since they are getting ERR_PTR values from iommu_group_alloc(), just pass them up instead. Reported-by: Gerald Schaefer Signed-off-by: Joerg Roedel --- drivers/iommu/iommu.c | 14 ++ 1 file changed, 2 insertions(+), 12 deletions(-) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index cf7ca7e..de09e1e 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -915,13 +915,7 @@ static int get_pci_alias_or_group(struct pci_dev *pdev, u16 alias, void *opaque) */ struct iommu_group *generic_device_group(struct device *dev) { - struct iommu_group *group; - - group = iommu_group_alloc(); - if (IS_ERR(group)) - return NULL; - - return group; + return iommu_group_alloc(); } /* @@ -988,11 +982,7 @@ struct iommu_group *pci_device_group(struct device *dev) return group; /* No shared group found, allocate new */ - group = iommu_group_alloc(); - if (IS_ERR(group)) - return NULL; - - return group; + return iommu_group_alloc(); } /** -- 2.7.4 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 0/8] io-pgtable lock removal
Hi Ray, Robin and I have been bashing our heads against the tlb_sync_pending flag this morning, and we reckon it could have something to do with your timeouts on MMU-500. On Tue, Jun 27, 2017 at 09:43:19AM -0700, Ray Jui wrote: > >> Also, in a few occasions, I observed the following message during the > >> test, when multiple cores are involved: > >> > >> arm-smmu 6400.mmu: TLB sync timed out -- SMMU may be deadlocked The tlb_sync_pending logic was written under the assumption of a global page-table lock, so it assumes that it only has to care about syncing flushes from the current CPU/context. That's not true anymore, and the current code can accidentally skip syncs and (what I think is happening in your case) allow concurrent syncs, which will potentially lead to timeouts if a CPU is unlucky enough to keep missing the Ack. Please can you try the diff below and see if it fixes things for you? This applies on top of my for-joerg/arm-smmu/updates branch, but note that I've only shown it to the compiler. Not tested at all. Will --->8 diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c index 127558d83667..cd8d7aaec161 100644 --- a/drivers/iommu/io-pgtable.c +++ b/drivers/iommu/io-pgtable.c @@ -59,6 +59,7 @@ struct io_pgtable_ops *alloc_io_pgtable_ops(enum io_pgtable_fmt fmt, iop->cookie = cookie; iop->cfg= *cfg; + atomic_set(>tlb_sync_pending, 0); return >ops; } diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h index 524263a7ae6f..b64580c9d03d 100644 --- a/drivers/iommu/io-pgtable.h +++ b/drivers/iommu/io-pgtable.h @@ -1,5 +1,7 @@ #ifndef __IO_PGTABLE_H #define __IO_PGTABLE_H + +#include #include /* @@ -165,7 +167,7 @@ void free_io_pgtable_ops(struct io_pgtable_ops *ops); struct io_pgtable { enum io_pgtable_fmt fmt; void*cookie; - booltlb_sync_pending; + atomic_ttlb_sync_pending; struct io_pgtable_cfg cfg; struct io_pgtable_ops ops; }; @@ -175,22 +177,20 @@ struct io_pgtable { static inline void io_pgtable_tlb_flush_all(struct io_pgtable *iop) { iop->cfg.tlb->tlb_flush_all(iop->cookie); - iop->tlb_sync_pending = true; + atomic_set_release(>tlb_sync_pending, 1); } static inline void io_pgtable_tlb_add_flush(struct io_pgtable *iop, unsigned long iova, size_t size, size_t granule, bool leaf) { iop->cfg.tlb->tlb_add_flush(iova, size, granule, leaf, iop->cookie); - iop->tlb_sync_pending = true; + atomic_set_release(>tlb_sync_pending, 1); } static inline void io_pgtable_tlb_sync(struct io_pgtable *iop) { - if (iop->tlb_sync_pending) { + if (atomic_xchg_relaxed(>tlb_sync_pending, 0)) iop->cfg.tlb->tlb_sync(iop->cookie); - iop->tlb_sync_pending = false; - } } /** ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/3] iommu/iova: don't disable preempt around this_cpu_ptr()
On Wed, Jun 28, 2017 at 11:31:55AM +0200, Sebastian Andrzej Siewior wrote: > It really does. The spin_lock() does disable preemption but this is not > the problem. The thing is that the preempt_disable() is superfluous and > it hurts Preempt-RT (and this is how I noticed it). Also the > get_cpu_ptr() is not requited and was only added to keep lockdep quiet > (according to the history). > Everything else here can stay as-is, I am just asking for the removal of > the redundant preempt_disable() where it is not required. Okay, makes sense, I applied both patches. Thanks, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/vt-d: constify intel_dma_ops.
On Wed, Jun 28, 2017 at 03:31:16PM +0530, Arvind Yadav wrote: > Most dma_map_ops structures are never modified. Constify these > structures such that these can be write-protected. This file size diff > will show the difference between data and text segment. I know what the diff shows, but it doesn't matter for this patch because thats just an implementation detail of the compiler. The real reason for making it 'const' is to write-protect them, and that should be clear in the commit-message. Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 3/9] iommu: Introduce iommu do invalidate API function
On Tue, Jun 27, 2017 at 12:47:57PM -0700, Jacob Pan wrote: > From: "Liu, Yi L"> > When a SVM capable device is assigned to a guest, the first level page > tables are owned by the guest and the guest PASID table pointer is > linked to the device context entry of the physical IOMMU. > > Host IOMMU driver has no knowledge of caching structure updates unless > the guest invalidation activities are passed down to the host. The > primary usage is derived from emulated IOMMU in the guest, where QEMU > can trap invalidation activities before pass them down the > host/physical IOMMU. There are IOMMU architectural specific actions > need to be taken which requires the generic APIs introduced in this > patch to have opaque data in the tlb_invalidate_info argument. Which "IOMMU architectural specific actions" are you thinking of? > +int iommu_invalidate(struct iommu_domain *domain, > + struct device *dev, struct tlb_invalidate_info *inv_info) > +{ > + int ret = 0; > + > + if (unlikely(!domain->ops->invalidate)) > + return -ENODEV; > + > + ret = domain->ops->invalidate(domain, dev, inv_info); > + > + return ret; > +} > +EXPORT_SYMBOL_GPL(iommu_invalidate); [...] > +struct tlb_invalidate_info { > + __u32 model; > + __u32 length; > + __u8opaque[]; > +}; This interface is aweful. It requires the user of a generic api to know details about the implementation behind to do anything useful. Please explain in more detail why this is needed. My feeling is that we can make this more generic with a small set of invalidation functions in the iommu-api. Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/vt-d: constify intel_dma_ops.
Hi, Most dma_map_ops structures are never modified. Constify these structures such that these can be write-protected. This file size diff will show the difference between data and text segment. Thanks, ~arvind On Wednesday 28 June 2017 02:23 PM, Joerg Roedel wrote: On Tue, Jun 13, 2017 at 03:48:34PM +0530, Arvind Yadav wrote: File size before: text data bss dec hex filename 327657581824 353478a13 drivers/iommu/intel-iommu.o File size After adding 'const': text data bss dec hex filename 329255981824 353478a13 drivers/iommu/intel-iommu.o What is the point of those commit-messages? They just show that the struct moved from data to the text segment. The overall size doesn't change much. Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/9] iommu/vt-d: add bind_pasid_table function
On Tue, Jun 27, 2017 at 12:47:56PM -0700, Jacob Pan wrote: > Add Intel VT-d ops to the generic iommu_bind_pasid_table API > functions. > > The primary use case is for direct assignment of SVM capable > device. Originated from emulated IOMMU in the guest, the request goes > through many layers (e.g. VFIO). Upon calling host IOMMU driver, caller > passes guest PASID table pointer (GPA) and size. > > Device context table entry is modified by Intel IOMMU specific > bind_pasid_table function. This will turn on nesting mode and matching > translation type. > > The unbind operation restores default context mapping. > > Signed-off-by: Jacob Pan> Signed-off-by: Liu, Yi L > Signed-off-by: Ashok Raj > --- > drivers/iommu/intel-iommu.c | 117 > ++ > include/linux/dma_remapping.h | 1 + > 2 files changed, 118 insertions(+) > > diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c > index 8274ce3..ef05b59 100644 > --- a/drivers/iommu/intel-iommu.c > +++ b/drivers/iommu/intel-iommu.c > @@ -5430,6 +5430,119 @@ struct intel_iommu *intel_svm_device_to_iommu(struct > device *dev) > > return iommu; > } > + > +static int intel_iommu_bind_pasid_table(struct iommu_domain *domain, > + struct device *dev, struct pasid_table_info *pasidt_binfo) > +{ > + struct intel_iommu *iommu; > + struct context_entry *context; > + struct dmar_domain *dmar_domain = to_dmar_domain(domain); > + struct device_domain_info *info; > + struct pci_dev *pdev; > + u8 bus, devfn; > + u16 did, *sid; > + int ret = 0; > + unsigned long flags; > + u64 ctx_lo; > + > + if (pasidt_binfo == NULL || pasidt_binfo->model != > IOMMU_MODEL_INTEL_VTD) { > + pr_warn("%s: Invalid bind request!\n", __func__); > + return -EINVAL; > + } > + > + iommu = device_to_iommu(dev, , ); > + if (!iommu) > + return -ENODEV; > + > + sid = (u16 *)_binfo->opaque; > + /* > + * check SID, if it is not correct, return success to allow looping > + * through all devices within a group > + */ > + if (PCI_DEVID(bus, devfn) != *sid) > + return 0; > + > + if (!dev || !dev_is_pci(dev)) > + return -ENODEV; > + > + pdev = to_pci_dev(dev); > + if (!pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_PRI)) > + return -EINVAL; > + > + info = dev->archdata.iommu; > + if (!info || !info->pasid_supported || > + !pci_enable_pasid(pdev, info->pasid_supported & ~1)) { > + pr_err("PCI %04x:%02x:%02x.%d: has no PASID support\n", > +pci_domain_nr(pdev->bus), bus, PCI_SLOT(devfn), > +PCI_FUNC(devfn)); You can either use dev_name() here to decode the device name, or just use dev_err() instead. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/9] iommu: Introduce bind_pasid_table API function
On Tue, Jun 27, 2017 at 12:47:55PM -0700, Jacob Pan wrote: > +int iommu_bind_pasid_table(struct iommu_domain *domain, struct device *dev, > + struct pasid_table_info *pasidt_binfo) > +{ > + if (unlikely(!domain->ops->bind_pasid_table)) > + return -EINVAL; I think its better to return -ENODEV here, like other iommu-api functions do when a callback is NULL. > +enum iommu_model { > + IOMMU_MODEL_INTEL_VTD, > + IOMMU_MODEL_ARM_SMMU, > +}; AMD IOMMU also supports shared virtual memory. Regards, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v8 RESEND 27/38] iommu/amd: Allow the AMD IOMMU to work with memory encryption
Hi Tom, On Tue, Jun 27, 2017 at 10:12:30AM -0500, Tom Lendacky wrote: > --- > drivers/iommu/amd_iommu.c | 30 -- > drivers/iommu/amd_iommu_init.c | 34 -- > drivers/iommu/amd_iommu_proto.h | 10 ++ > drivers/iommu/amd_iommu_types.h |2 +- > 4 files changed, 55 insertions(+), 21 deletions(-) Looks like a straightforward change. Just one nit below. > +static bool amd_iommu_supports_sme(void) > +{ > + if (!sme_active() || (boot_cpu_data.x86 != 0x17)) > + return true; > + > + /* For Fam17h, a specific level of support is required */ > + if (boot_cpu_data.microcode >= 0x08001205) > + return true; > + > + if ((boot_cpu_data.microcode >= 0x08001126) && > + (boot_cpu_data.microcode <= 0x080011ff)) > + return true; > + > + pr_notice("AMD-Vi: IOMMU not currently supported when SME is active\n"); > + > + return false; > +} The name of the function is misleading. It checks whether the IOMMU can be enabled when SME is active. But the name suggests that it checks whether the iommu hardware supports SME. How about renaming it to amd_iommu_sme_check()? With that change the patch is: Acked-by: Joerg Roedel___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/5] iommu/arm-smmu-v3: put off the execution of TLBI* to reduce lock confliction
Hi Zhen Lei, Nate (CC'd), Robin and I have been working on something very similar to this series, but this patch is different to what we had planned. More below. On Mon, Jun 26, 2017 at 09:38:46PM +0800, Zhen Lei wrote: > Because all TLBI commands should be followed by a SYNC command, to make > sure that it has been completely finished. So we can just add the TLBI > commands into the queue, and put off the execution until meet SYNC or > other commands. To prevent the followed SYNC command waiting for a long > time because of too many commands have been delayed, restrict the max > delayed number. > > According to my test, I got the same performance data as I replaced writel > with writel_relaxed in queue_inc_prod. > > Signed-off-by: Zhen Lei> --- > drivers/iommu/arm-smmu-v3.c | 42 +- > 1 file changed, 37 insertions(+), 5 deletions(-) > > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c > index 291da5f..4481123 100644 > --- a/drivers/iommu/arm-smmu-v3.c > +++ b/drivers/iommu/arm-smmu-v3.c > @@ -337,6 +337,7 @@ > /* Command queue */ > #define CMDQ_ENT_DWORDS 2 > #define CMDQ_MAX_SZ_SHIFT8 > +#define CMDQ_MAX_DELAYED 32 > > #define CMDQ_ERR_SHIFT 24 > #define CMDQ_ERR_MASK0x7f > @@ -472,6 +473,7 @@ struct arm_smmu_cmdq_ent { > }; > } cfgi; > > + #define CMDQ_OP_TLBI_NH_ALL 0x10 > #define CMDQ_OP_TLBI_NH_ASID0x11 > #define CMDQ_OP_TLBI_NH_VA 0x12 > #define CMDQ_OP_TLBI_EL2_ALL0x20 > @@ -499,6 +501,7 @@ struct arm_smmu_cmdq_ent { > > struct arm_smmu_queue { > int irq; /* Wired interrupt */ > + u32 nr_delay; > > __le64 *base; > dma_addr_t base_dma; > @@ -722,11 +725,16 @@ static int queue_sync_prod(struct arm_smmu_queue *q) > return ret; > } > > -static void queue_inc_prod(struct arm_smmu_queue *q) > +static void queue_inc_swprod(struct arm_smmu_queue *q) > { > - u32 prod = (Q_WRP(q, q->prod) | Q_IDX(q, q->prod)) + 1; > + u32 prod = q->prod + 1; > > q->prod = Q_OVF(q, q->prod) | Q_WRP(q, prod) | Q_IDX(q, prod); > +} > + > +static void queue_inc_prod(struct arm_smmu_queue *q) > +{ > + queue_inc_swprod(q); > writel(q->prod, q->prod_reg); > } > > @@ -761,13 +769,24 @@ static void queue_write(__le64 *dst, u64 *src, size_t > n_dwords) > *dst++ = cpu_to_le64(*src++); > } > > -static int queue_insert_raw(struct arm_smmu_queue *q, u64 *ent) > +static int queue_insert_raw(struct arm_smmu_queue *q, u64 *ent, int optimize) > { > if (queue_full(q)) > return -ENOSPC; > > queue_write(Q_ENT(q, q->prod), ent, q->ent_dwords); > - queue_inc_prod(q); > + > + /* > + * We don't want too many commands to be delayed, this may lead the > + * followed sync command to wait for a long time. > + */ > + if (optimize && (++q->nr_delay < CMDQ_MAX_DELAYED)) { > + queue_inc_swprod(q); > + } else { > + queue_inc_prod(q); > + q->nr_delay = 0; > + } > + So here, you're effectively putting invalidation commands into the command queue without updating PROD. Do you actually see a performance advantage from doing so? Another side of the argument would be that we should be moving PROD as soon as we can, so that the SMMU can process invalidation commands in the background and reduce the cost of the final SYNC operation when the high-level unmap operation is complete. Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/3] iommu/iova: don't disable preempt around this_cpu_ptr()
On 2017-06-28 11:22:05 [+0200], Joerg Roedel wrote: > On Tue, Jun 27, 2017 at 06:16:47PM +0200, Sebastian Andrzej Siewior wrote: > > Commit 583248e6620a ("iommu/iova: Disable preemption around use of > > this_cpu_ptr()") disables preemption while accessing a per-CPU variable. > > This does keep lockdep quiet. However I don't see the point why it is > > bad if we get migrated after its access to another CPU. > > __iova_rcache_insert() and __iova_rcache_get() immediately locks the > > variable after obtaining it - before accessing its members. > > _If_ we get migrated away after retrieving the address of cpu_rcache > > before taking the lock then the *other* task on the same CPU will > > retrieve the same address of cpu_rcache and will spin on the lock. > > > > alloc_iova_fast() disables preemption while invoking > > free_cpu_cached_iovas() on each CPU. The function itself uses > > per_cpu_ptr() which does not trigger a warning (like this_cpu_ptr() > > does). It _could_ make sense to use get_online_cpus() instead but the we > > have a hotplug notifier for CPU down (and none for up) so we are good. > > Does that really matter? The spin_lock disables irqs and thus avoids > preemption too. We also can't get rid of the irqsave lock here because > these locks are taken in the dma-api path which is used from interrupt > context. It really does. The spin_lock() does disable preemption but this is not the problem. The thing is that the preempt_disable() is superfluous and it hurts Preempt-RT (and this is how I noticed it). Also the get_cpu_ptr() is not requited and was only added to keep lockdep quiet (according to the history). Everything else here can stay as-is, I am just asking for the removal of the redundant preempt_disable() where it is not required. > Joerg Sebastian ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/3] iommu/iova: don't disable preempt around this_cpu_ptr()
On Tue, Jun 27, 2017 at 06:16:47PM +0200, Sebastian Andrzej Siewior wrote: > Commit 583248e6620a ("iommu/iova: Disable preemption around use of > this_cpu_ptr()") disables preemption while accessing a per-CPU variable. > This does keep lockdep quiet. However I don't see the point why it is > bad if we get migrated after its access to another CPU. > __iova_rcache_insert() and __iova_rcache_get() immediately locks the > variable after obtaining it - before accessing its members. > _If_ we get migrated away after retrieving the address of cpu_rcache > before taking the lock then the *other* task on the same CPU will > retrieve the same address of cpu_rcache and will spin on the lock. > > alloc_iova_fast() disables preemption while invoking > free_cpu_cached_iovas() on each CPU. The function itself uses > per_cpu_ptr() which does not trigger a warning (like this_cpu_ptr() > does). It _could_ make sense to use get_online_cpus() instead but the we > have a hotplug notifier for CPU down (and none for up) so we are good. Does that really matter? The spin_lock disables irqs and thus avoids preemption too. We also can't get rid of the irqsave lock here because these locks are taken in the dma-api path which is used from interrupt context. Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2] iommu/amd: Fix interrupt remapping when disable guest_mode
Hi Suravee, On Mon, Jun 26, 2017 at 04:28:04AM -0500, Suravee Suthikulpanit wrote: > Pass-through devices to VM guest can get updated IRQ affinity > information via irq_set_affinity() when not running in guest mode. > Currently, AMD IOMMU driver in GA mode ignores the updated information > if the pass-through device is setup to use vAPIC regardless of guest_mode. > This could cause invalid interrupt remapping. > > Also, the guest_mode bit should be set and cleared only when > SVM updates posted-interrupt interrupt remapping information. > > Signed-off-by: Suravee Suthikulpanit> Cc: Joerg Roedel Can you please provide a 'Fixes:' tag too? No need to resend the whole patch, just reply with the tag and I'll add it when applying this patch. Thanks, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/vt-d: constify intel_dma_ops.
On Tue, Jun 13, 2017 at 03:48:34PM +0530, Arvind Yadav wrote: > File size before: >text data bss dec hex filename > 32765 7581824 353478a13 drivers/iommu/intel-iommu.o > > File size After adding 'const': >text data bss dec hex filename > 32925 5981824 353478a13 drivers/iommu/intel-iommu.o What is the point of those commit-messages? They just show that the struct moved from data to the text segment. The overall size doesn't change much. Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [GIT PULL] iommu/arm-smmu: Updates for 4.13
On Mon, Jun 26, 2017 at 11:42:42AM +0100, Will Deacon wrote: > The following changes since commit 2ea659a9ef488125eb46da6eb571de5eae5c43f6: > > Linux 4.12-rc1 (2017-05-13 13:19:49 -0700) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git > for-joerg/arm-smmu/updates > > for you to fetch changes up to f935448acf462c26142e8b04f1c8829b28d3b9d8: > > iommu/arm-smmu-v3: Add workaround for Cavium ThunderX2 erratum #126 > (2017-06-23 17:58:04 +0100) Pulled, thanks Will. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v1 3/3] iommu/amd: Optimize the IOMMU queue flush
[Adding Alex Deucher] Hey Alex, On Tue, Jun 27, 2017 at 12:24:35PM -0400, Jan Vesely wrote: > On Mon, 2017-06-26 at 14:14 +0200, Joerg Roedel wrote: > > How does that 'dGPU goes to sleep' work? Do you put it to sleep manually > > via sysfs or something? Or is that something that amdgpu does on its > > own? > > AMD folks should be able to provide more details. afaik, the driver > uses ACPI methods to power on/off the device. Driver routines wake the > device up before accessing it and there is a timeout to turn it off > after few seconds of inactivity. > > > > > It looks like the GPU just switches the ATS unit off when it goes to > > sleep and doesn't answer the invalidation anymore, which explains the > > completion-wait timeouts. > > Both MMIO regs and PCIe config regs are turned off so it would not > surprise me if all PCIe requests were ignored by the device in off > state. it should be possible to request device wake up before > invalidating the relevant IOMMU domain. I'll leave to more > knowledgeable ppl to judge whether it's a good idea (we can also > postpone such invalidations until the device is woken by other means) Can you maybe sched some light on how the sleep-mode of the GPUs work? Is it initiated by the GPU driver or from somewhere else? In the case discussed here it looks like the ATS unit of the GPU is switched of, causing IOTLB invalidation timeouts on the IOMMU side. If that is the case we might need some sort of dma-api extension so that the GPU driver can tell the iommu driver that the device is going to be quiet. Thanks, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu