Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
On 03/12/15 08:50, Tian, Kevin wrote: >> From: Jan Beulich [mailto:jbeul...@suse.com] >> Sent: Thursday, December 03, 2015 4:18 PM >> > On 03.12.15 at 03:40,wrote: >>> Just confirmed internally with HW team. On SNB 4KB cache is always >>> used regardless of 4KB/2MB/1GB mapping. There'd be another reason >>> for this 40% drop observation... >> So when they stated that the 4k TLB gets always used, did they at >> least provide some thoughts on what else might be causing this >> severe a performance impact? Without them helping we're left >> guessing... >> > Unfortunately no clear answer... http://networkbuilders.intel.com/docs/Network_Builders_RA_vBRAS_Final.pdf Page 42: "The IOTLB on the previous generation Intel Xeon Processor E5-2690 does not natively support huge pages (it emulates them using 4K pages)." And Figure 51 on Page 43 The "emulates them using 4K pages" probably means that the IOTLB is flushed and filled with 512 adjacent 4k mappings. Citrix's measurements back up the findings in that paper, and also show that performance is better when using plain 4k mappings as opposed to emulated 2M mappings. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
On 03/12/15 01:19, Tian, Kevin wrote: >> From: Andrew Cooper [mailto:andrew.coop...@citrix.com] >> Sent: Thursday, November 26, 2015 9:56 PM >> >> On 26/11/15 13:48, Malcolm Crossley wrote: >>> On 26/11/15 13:46, Jan Beulich wrote: >>> On 25.11.15 at 11:28,wrote: > The problem is that SandyBridge IOMMUs advertise 2M support and do > function with it, but cannot cache 2MB translations in the IOTLBs. > > As a result, attempting to use 2M translations causes substantially > worse performance than 4K translations. Btw - how does this get explained? At a first glance, even if 2Mb translations don't get entered into the TLB, it should still be one less page table level to walk for the IOMMU, and should hence nevertheless be a benefit. Yet you even say _substantially_ worse performance results. >>> There is a IOTLB for the 4K translation so if you only use 4K >>> translations then you get to take advantage of the IOTLB. >>> >>> If you use the 2Mb translation then a page table walk has to be >>> performed every time there's a DMA access to that region of the BFN >>> address space. >> Also remember that a high level dma access (from the point of view of a >> driver) will be fragmented at the PCIe max packet size, which is >> typically 256 bytes. >> >> So by not caching the 2Mb translation, a dma access of 4k may undergo 16 >> pagetable walks, one for each PCIe packet. >> >> We observed that using 2Mb mappings results in a 40% overhead, compared >> to using 4k mappings, from the point of view of a sample network workload. >> >> ~Andrew > One confusion here. The original patch just disables shared_ept, w/o > changing IOMMU to not use 2MB mapping. Is there something missing > or other tricks behind? Disabling shared_ept works because the Xen IOMMU interface doesn't support superpages. However, I subsequently changed my mind in this thread about the approach taken, and quirking the IOMMUs not to report superpage capabilities is the correct solution to the problem. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
>>> On 03.12.15 at 03:40,wrote: > Just confirmed internally with HW team. On SNB 4KB cache is always > used regardless of 4KB/2MB/1GB mapping. There'd be another reason > for this 40% drop observation... So when they stated that the 4k TLB gets always used, did they at least provide some thoughts on what else might be causing this severe a performance impact? Without them helping we're left guessing... Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
> From: Andrew Cooper [mailto:andrew.coop...@citrix.com] > Sent: Thursday, December 03, 2015 7:24 PM > > On 03/12/15 01:19, Tian, Kevin wrote: > >> From: Andrew Cooper [mailto:andrew.coop...@citrix.com] > >> Sent: Thursday, November 26, 2015 9:56 PM > >> > >> On 26/11/15 13:48, Malcolm Crossley wrote: > >>> On 26/11/15 13:46, Jan Beulich wrote: > >>> On 25.11.15 at 11:28,wrote: > > The problem is that SandyBridge IOMMUs advertise 2M support and do > > function with it, but cannot cache 2MB translations in the IOTLBs. > > > > As a result, attempting to use 2M translations causes substantially > > worse performance than 4K translations. > Btw - how does this get explained? At a first glance, even if 2Mb > translations don't get entered into the TLB, it should still be one > less page table level to walk for the IOMMU, and should hence > nevertheless be a benefit. Yet you even say _substantially_ > worse performance results. > >>> There is a IOTLB for the 4K translation so if you only use 4K > >>> translations then you get to take advantage of the IOTLB. > >>> > >>> If you use the 2Mb translation then a page table walk has to be > >>> performed every time there's a DMA access to that region of the BFN > >>> address space. > >> Also remember that a high level dma access (from the point of view of a > >> driver) will be fragmented at the PCIe max packet size, which is > >> typically 256 bytes. > >> > >> So by not caching the 2Mb translation, a dma access of 4k may undergo 16 > >> pagetable walks, one for each PCIe packet. > >> > >> We observed that using 2Mb mappings results in a 40% overhead, compared > >> to using 4k mappings, from the point of view of a sample network workload. > >> > >> ~Andrew > > One confusion here. The original patch just disables shared_ept, w/o > > changing IOMMU to not use 2MB mapping. Is there something missing > > or other tricks behind? > > Disabling shared_ept works because the Xen IOMMU interface doesn't > support superpages. > > However, I subsequently changed my mind in this thread about the > approach taken, and quirking the IOMMUs not to report superpage > capabilities is the correct solution to the problem. > Yes, it makes more sense. Thanks Kevin ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
> From: Andrew Cooper [mailto:andrew.coop...@citrix.com] > Sent: Thursday, December 03, 2015 7:19 PM > > On 03/12/15 08:50, Tian, Kevin wrote: > >> From: Jan Beulich [mailto:jbeul...@suse.com] > >> Sent: Thursday, December 03, 2015 4:18 PM > >> > > On 03.12.15 at 03:40,wrote: > >>> Just confirmed internally with HW team. On SNB 4KB cache is always > >>> used regardless of 4KB/2MB/1GB mapping. There'd be another reason > >>> for this 40% drop observation... > >> So when they stated that the 4k TLB gets always used, did they at > >> least provide some thoughts on what else might be causing this > >> severe a performance impact? Without them helping we're left > >> guessing... > >> > > Unfortunately no clear answer... > > http://networkbuilders.intel.com/docs/Network_Builders_RA_vBRAS_Final.pdf > > Page 42: "The IOTLB on the previous generation Intel Xeon Processor > E5-2690 does not natively support huge pages (it emulates them using 4K > pages)." > > And Figure 51 on Page 43 > > The "emulates them using 4K pages" probably means that the IOTLB is > flushed and filled with 512 adjacent 4k mappings. > > Citrix's measurements back up the findings in that paper, and also show > that performance is better when using plain 4k mappings as opposed to > emulated 2M mappings. > Thanks for the information. I'll forward it to HW team. If above interpretation is correct (which also matches my thought), then for two options you listed earlier: --- > This leaves two options > 1) 2M mappings are entirely uncached > 2) 2M mappings are shattered to 4K mappings and cached > The fact there is a 40% performance reduction suggests 1 rather than 2. --- looks 2) is suggested rather than 1). There are two further options: 2.1) 2M mappings are shattered to 512 adjacent 4k mappings which are all cached; 2.2) Only the 4k mapping out of 2M mapping is cached for the page being accessed; for 2.1), as IOTLB entries are limited, it may cause unnecessary IOTLB entry flushes and thus incurs more page walking overhead to fill-in. for 2.2), I can't think out a reason to cause performance drop. Thanks Kevin ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
> From: Tian, Kevin > Sent: Thursday, December 03, 2015 9:20 AM > > > From: Andrew Cooper [mailto:andrew.coop...@citrix.com] > > Sent: Thursday, November 26, 2015 9:56 PM > > > > On 26/11/15 13:48, Malcolm Crossley wrote: > > > On 26/11/15 13:46, Jan Beulich wrote: > > > On 25.11.15 at 11:28,wrote: > > >>> The problem is that SandyBridge IOMMUs advertise 2M support and do > > >>> function with it, but cannot cache 2MB translations in the IOTLBs. > > >>> > > >>> As a result, attempting to use 2M translations causes substantially > > >>> worse performance than 4K translations. > > >> Btw - how does this get explained? At a first glance, even if 2Mb > > >> translations don't get entered into the TLB, it should still be one > > >> less page table level to walk for the IOMMU, and should hence > > >> nevertheless be a benefit. Yet you even say _substantially_ > > >> worse performance results. > > > There is a IOTLB for the 4K translation so if you only use 4K > > > translations then you get to take advantage of the IOTLB. > > > > > > If you use the 2Mb translation then a page table walk has to be > > > performed every time there's a DMA access to that region of the BFN > > > address space. > > > > Also remember that a high level dma access (from the point of view of a > > driver) will be fragmented at the PCIe max packet size, which is > > typically 256 bytes. > > > > So by not caching the 2Mb translation, a dma access of 4k may undergo 16 > > pagetable walks, one for each PCIe packet. > > > > We observed that using 2Mb mappings results in a 40% overhead, compared > > to using 4k mappings, from the point of view of a sample network workload. > > > > ~Andrew > > One confusion here. The original patch just disables shared_ept, w/o > changing IOMMU to not use 2MB mapping. Is there something missing > or other tricks behind? > > When you say using 4k mapping saves 40% overhead back, is it w/ > ept shared or not? > Just confirmed internally with HW team. On SNB 4KB cache is always used regardless of 4KB/2MB/1GB mapping. There'd be another reason for this 40% drop observation... Thanks Kevin ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
> From: Andrew Cooper [mailto:andrew.coop...@citrix.com] > Sent: Thursday, November 26, 2015 9:56 PM > > On 26/11/15 13:48, Malcolm Crossley wrote: > > On 26/11/15 13:46, Jan Beulich wrote: > > On 25.11.15 at 11:28,wrote: > >>> The problem is that SandyBridge IOMMUs advertise 2M support and do > >>> function with it, but cannot cache 2MB translations in the IOTLBs. > >>> > >>> As a result, attempting to use 2M translations causes substantially > >>> worse performance than 4K translations. > >> Btw - how does this get explained? At a first glance, even if 2Mb > >> translations don't get entered into the TLB, it should still be one > >> less page table level to walk for the IOMMU, and should hence > >> nevertheless be a benefit. Yet you even say _substantially_ > >> worse performance results. > > There is a IOTLB for the 4K translation so if you only use 4K > > translations then you get to take advantage of the IOTLB. > > > > If you use the 2Mb translation then a page table walk has to be > > performed every time there's a DMA access to that region of the BFN > > address space. > > Also remember that a high level dma access (from the point of view of a > driver) will be fragmented at the PCIe max packet size, which is > typically 256 bytes. > > So by not caching the 2Mb translation, a dma access of 4k may undergo 16 > pagetable walks, one for each PCIe packet. > > We observed that using 2Mb mappings results in a 40% overhead, compared > to using 4k mappings, from the point of view of a sample network workload. > > ~Andrew One confusion here. The original patch just disables shared_ept, w/o changing IOMMU to not use 2MB mapping. Is there something missing or other tricks behind? When you say using 4k mapping saves 40% overhead back, is it w/ ept shared or not? Thanks Kevin ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
Based on the discussion below, can I assume there is an agreement for using processor model for filtering or chipset ID will be the preferred candidate. Thanks Anshul Makkar -Original Message- From: Tian, Kevin [mailto:kevin.t...@intel.com] Sent: 26 November 2015 07:17 To: Malcolm Crossley <malcolm.cross...@citrix.com>; Jan Beulich <jbeul...@suse.com>; Andrew Cooper <andrew.coop...@citrix.com>; Anshul Makkar <anshul.mak...@citrix.com> Cc: Zhang, Yang Z <yang.z.zh...@intel.com>; xen-devel@lists.xen.org Subject: RE: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors. > From: Malcolm Crossley [mailto:malcolm.cross...@citrix.com] > Sent: Wednesday, November 25, 2015 11:59 PM > > On 25/11/15 15:38, Jan Beulich wrote: > >>>> On 25.11.15 at 16:13, <andrew.coop...@citrix.com> wrote: > >> On 25/11/15 10:49, Jan Beulich wrote: > >>>>>> On 25.11.15 at 11:28, <andrew.coop...@citrix.com> wrote: > >>>> On 24/11/15 17:41, Jan Beulich wrote: > >>>>>>>> On 24.11.15 at 18:17, wrote: > >>>>>> --- a/xen/drivers/passthrough/vtd/quirks.c > >>>>>> +++ b/xen/drivers/passthrough/vtd/quirks.c > >>>>>> @@ -320,6 +320,20 @@ void __init platform_quirks_init(void) > >>>>>> /* Tylersburg interrupt remap quirk */ > >>>>>> if ( iommu_intremap ) > >>>>>> tylersburg_intremap_quirk(); > >>>>>> + > >>>>>> +/* > >>>>>> + * Disable shared EPT ("sharept") on Sandybridge and older > >>>>>> processors > >>>>>> + * by default. > >>>>>> + * SandyBridge has no huge page support for IOTLB which > >>>>>> + leads to > >> fallback > >>>>>> + * on 4k pages and leads to performance degradation. > >>>>>> + * > >>>>>> + * Shared EPT ("sharept") will be disabled only if user has not > >>>>>> + * provided explicit choice on the command line thus > >>>>>> + iommu_hap_pt_share > >> is > >>>>>> + * at its initialized value of -1. > >>>>>> + */ > >>>>>> +if ( (boot_cpu_data.x86 == 0x06 && > >>>>>> + (boot_cpu_data.x86_model <= 0x2F > || > >>>>>> + boot_cpu_data.x86_model == 0x36)) && > >>>>>> + (iommu_hap_pt_share == > -1) ) > >>>>>> +iommu_hap_pt_share = 0; > >>>>> If we really want to do this, then I think we should key this on > >>>>> EPT but not VT-d having 2M support, instead of on CPU models. > >>>> This check is already performed by vtd_ept_page_compatible() > >>> Yeah, I realized there would be such a check on the way home. > >>> > >>>> The problem is that SandyBridge IOMMUs advertise 2M support and > >>>> do function with it, but cannot cache 2MB translations in the IOTLBs. > >>>> > >>>> As a result, attempting to use 2M translations causes > >>>> substantially worse performance than 4K translations. > >>> So commit message and comment should make this more explicit, to > >>> avoid the impression "IOTLB" isn't just the relatively common > >>> mis-naming of "IOMMU". > >>> > >>> Plus I guess the sharing won't need suppressing if !opt_hap_2mb? > >>> > >>> Further the model based check is relatively broad, and includes > >>> Atoms (0x36 actually is one), which can't be considered > >>> "Sandybridge or older" imo. > >>> > >>> And finally I'm not fully convinced using CPU model info to deduce > >>> chipset behavior is entirely correct (albeit perhaps in practice > >>> it'll be fine except maybe when running Xen itself virtualized). > >> > >> What else would you suggest? I can't think of any better > >> identifying information. > > > > Chipset IDs / revisions? > > In this case the IOMMU is integrated into the Sandybridge-EP processor itself. > Unfortunately there's no register to query the IOTLB configuration of > the IOMMU and so we're stuck identifying the via the processor model number > itself. > > Malcolm > I'm OK to use processor model here, though ideally Jan is right. :-) Thanks Kevin ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
>>> On 01.12.15 at 17:45,wrote: > Based on the discussion below, can I assume there is an agreement for using > processor model for filtering or chipset ID will be the preferred candidate. I think the subsequent suggestion by Andrew makes it even more desirable to remain independent of CPU model here. As said before, we should simply leverage the PCI IDs we already have quirks for. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
On 30/11/15 21:22, Konrad Rzeszutek Wilk wrote: > On Thu, Nov 26, 2015 at 01:55:57PM +, Andrew Cooper wrote: >> On 26/11/15 13:48, Malcolm Crossley wrote: >>> On 26/11/15 13:46, Jan Beulich wrote: >>> On 25.11.15 at 11:28,wrote: > The problem is that SandyBridge IOMMUs advertise 2M support and do > function with it, but cannot cache 2MB translations in the IOTLBs. > > As a result, attempting to use 2M translations causes substantially > worse performance than 4K translations. Btw - how does this get explained? At a first glance, even if 2Mb translations don't get entered into the TLB, it should still be one less page table level to walk for the IOMMU, and should hence nevertheless be a benefit. Yet you even say _substantially_ worse performance results. >>> There is a IOTLB for the 4K translation so if you only use 4K >>> translations then you get to take advantage of the IOTLB. >>> >>> If you use the 2Mb translation then a page table walk has to be >>> performed every time there's a DMA access to that region of the BFN >>> address space. >> Also remember that a high level dma access (from the point of view of a >> driver) will be fragmented at the PCIe max packet size, which is >> typically 256 bytes. >> >> So by not caching the 2Mb translation, a dma access of 4k may undergo 16 >> pagetable walks, one for each PCIe packet. >> >> We observed that using 2Mb mappings results in a 40% overhead, compared >> to using 4k mappings, from the point of view of a sample network workload. > How did you observe this? I am mighty curious what kind of performance tools > you used to find this as I would love to figure out if some of the issues > we have seen are related to this? The 40% difference is just in terms of network throughput of a VF, given a workload which can normally saturate line rate on the card. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
Snabbswitch (virtualized switch) also encountered similar problem : https://groups.google.com/forum/#!topic/snabb-devel/xX0yFzeXylI Thanks Anshul Makkar -Original Message- From: Andrew Cooper [mailto:andrew.coop...@citrix.com] Sent: 01 December 2015 10:34 To: Konrad Rzeszutek Wilk <konrad.w...@oracle.com> Cc: Jan Beulich <jbeul...@suse.com>; Kevin Tian <kevin.t...@intel.com>; yang.z.zh...@intel.com; Malcolm Crossley <malcolm.cross...@citrix.com>; Anshul Makkar <anshul.mak...@citrix.com>; xen-devel@lists.xen.org Subject: Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors. On 30/11/15 21:22, Konrad Rzeszutek Wilk wrote: > On Thu, Nov 26, 2015 at 01:55:57PM +, Andrew Cooper wrote: >> On 26/11/15 13:48, Malcolm Crossley wrote: >>> On 26/11/15 13:46, Jan Beulich wrote: >>>>>>> On 25.11.15 at 11:28, <andrew.coop...@citrix.com> wrote: >>>>> The problem is that SandyBridge IOMMUs advertise 2M support and do >>>>> function with it, but cannot cache 2MB translations in the IOTLBs. >>>>> >>>>> As a result, attempting to use 2M translations causes >>>>> substantially worse performance than 4K translations. >>>> Btw - how does this get explained? At a first glance, even if 2Mb >>>> translations don't get entered into the TLB, it should still be one >>>> less page table level to walk for the IOMMU, and should hence >>>> nevertheless be a benefit. Yet you even say _substantially_ worse >>>> performance results. >>> There is a IOTLB for the 4K translation so if you only use 4K >>> translations then you get to take advantage of the IOTLB. >>> >>> If you use the 2Mb translation then a page table walk has to be >>> performed every time there's a DMA access to that region of the BFN >>> address space. >> Also remember that a high level dma access (from the point of view of >> a >> driver) will be fragmented at the PCIe max packet size, which is >> typically 256 bytes. >> >> So by not caching the 2Mb translation, a dma access of 4k may undergo >> 16 pagetable walks, one for each PCIe packet. >> >> We observed that using 2Mb mappings results in a 40% overhead, >> compared to using 4k mappings, from the point of view of a sample network >> workload. > How did you observe this? I am mighty curious what kind of performance > tools you used to find this as I would love to figure out if some of > the issues we have seen are related to this? The 40% difference is just in terms of network throughput of a VF, given a workload which can normally saturate line rate on the card. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
On Tue, Dec 01, 2015 at 10:34:17AM +, Andrew Cooper wrote: > On 30/11/15 21:22, Konrad Rzeszutek Wilk wrote: > > On Thu, Nov 26, 2015 at 01:55:57PM +, Andrew Cooper wrote: > >> On 26/11/15 13:48, Malcolm Crossley wrote: > >>> On 26/11/15 13:46, Jan Beulich wrote: > >>> On 25.11.15 at 11:28,wrote: > > The problem is that SandyBridge IOMMUs advertise 2M support and do > > function with it, but cannot cache 2MB translations in the IOTLBs. > > > > As a result, attempting to use 2M translations causes substantially > > worse performance than 4K translations. > Btw - how does this get explained? At a first glance, even if 2Mb > translations don't get entered into the TLB, it should still be one > less page table level to walk for the IOMMU, and should hence > nevertheless be a benefit. Yet you even say _substantially_ > worse performance results. > >>> There is a IOTLB for the 4K translation so if you only use 4K > >>> translations then you get to take advantage of the IOTLB. > >>> > >>> If you use the 2Mb translation then a page table walk has to be > >>> performed every time there's a DMA access to that region of the BFN > >>> address space. > >> Also remember that a high level dma access (from the point of view of a > >> driver) will be fragmented at the PCIe max packet size, which is > >> typically 256 bytes. > >> > >> So by not caching the 2Mb translation, a dma access of 4k may undergo 16 > >> pagetable walks, one for each PCIe packet. > >> > >> We observed that using 2Mb mappings results in a 40% overhead, compared > >> to using 4k mappings, from the point of view of a sample network workload. > > How did you observe this? I am mighty curious what kind of performance tools > > you used to find this as I would love to figure out if some of the issues > > we have seen are related to this? > > The 40% difference is just in terms of network throughput of a VF, given > a workload which can normally saturate line rate on the card. I understand that. But I am curious on how you found out the page walks by the IOMMU were so excessive? Were there any perf counters on the IOMMU that showed a crazy amount of pagetable walks? It just that if I had looked at this I would have first looked at interrupts, then kernels, then hypervisor - and eventually (after lots of head banging) it would have occurred to me to look at the IOMMU pagetables. > > ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
On 01/12/15 15:24, Konrad Rzeszutek Wilk wrote: > On Tue, Dec 01, 2015 at 10:34:17AM +, Andrew Cooper wrote: >> On 30/11/15 21:22, Konrad Rzeszutek Wilk wrote: >>> On Thu, Nov 26, 2015 at 01:55:57PM +, Andrew Cooper wrote: On 26/11/15 13:48, Malcolm Crossley wrote: > On 26/11/15 13:46, Jan Beulich wrote: > On 25.11.15 at 11:28,wrote: >>> The problem is that SandyBridge IOMMUs advertise 2M support and do >>> function with it, but cannot cache 2MB translations in the IOTLBs. >>> >>> As a result, attempting to use 2M translations causes substantially >>> worse performance than 4K translations. >> Btw - how does this get explained? At a first glance, even if 2Mb >> translations don't get entered into the TLB, it should still be one >> less page table level to walk for the IOMMU, and should hence >> nevertheless be a benefit. Yet you even say _substantially_ >> worse performance results. > There is a IOTLB for the 4K translation so if you only use 4K > translations then you get to take advantage of the IOTLB. > > If you use the 2Mb translation then a page table walk has to be > performed every time there's a DMA access to that region of the BFN > address space. Also remember that a high level dma access (from the point of view of a driver) will be fragmented at the PCIe max packet size, which is typically 256 bytes. So by not caching the 2Mb translation, a dma access of 4k may undergo 16 pagetable walks, one for each PCIe packet. We observed that using 2Mb mappings results in a 40% overhead, compared to using 4k mappings, from the point of view of a sample network workload. >>> How did you observe this? I am mighty curious what kind of performance tools >>> you used to find this as I would love to figure out if some of the issues >>> we have seen are related to this? >> The 40% difference is just in terms of network throughput of a VF, given >> a workload which can normally saturate line rate on the card. > I understand that. > > But I am curious on how you found out the page walks by the IOMMU were > so excessive? I didn't. It is all speculation drawn from other information. The manual states that there is not a superpage IOTLB. This leaves two options 1) 2M mappings are entirely uncached 2) 2M mappings are shattered to 4K mappings and cached The fact there is a 40% performance reduction suggests 1 rather than 2. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
On Thu, Nov 26, 2015 at 01:55:57PM +, Andrew Cooper wrote: > On 26/11/15 13:48, Malcolm Crossley wrote: > > On 26/11/15 13:46, Jan Beulich wrote: > > On 25.11.15 at 11:28,wrote: > >>> The problem is that SandyBridge IOMMUs advertise 2M support and do > >>> function with it, but cannot cache 2MB translations in the IOTLBs. > >>> > >>> As a result, attempting to use 2M translations causes substantially > >>> worse performance than 4K translations. > >> Btw - how does this get explained? At a first glance, even if 2Mb > >> translations don't get entered into the TLB, it should still be one > >> less page table level to walk for the IOMMU, and should hence > >> nevertheless be a benefit. Yet you even say _substantially_ > >> worse performance results. > > There is a IOTLB for the 4K translation so if you only use 4K > > translations then you get to take advantage of the IOTLB. > > > > If you use the 2Mb translation then a page table walk has to be > > performed every time there's a DMA access to that region of the BFN > > address space. > > Also remember that a high level dma access (from the point of view of a > driver) will be fragmented at the PCIe max packet size, which is > typically 256 bytes. > > So by not caching the 2Mb translation, a dma access of 4k may undergo 16 > pagetable walks, one for each PCIe packet. > > We observed that using 2Mb mappings results in a 40% overhead, compared > to using 4k mappings, from the point of view of a sample network workload. How did you observe this? I am mighty curious what kind of performance tools you used to find this as I would love to figure out if some of the issues we have seen are related to this? > > ~Andrew > > ___ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
>>> On 25.11.15 at 16:58,wrote: > On 25/11/15 15:38, Jan Beulich wrote: > On 25.11.15 at 16:13, wrote: >>> On 25/11/15 10:49, Jan Beulich wrote: And finally I'm not fully convinced using CPU model info to deduce chipset behavior is entirely correct (albeit perhaps in practice it'll be fine except maybe when running Xen itself virtualized). >>> >>> What else would you suggest? I can't think of any better identifying >>> information. >> >> Chipset IDs / revisions? > > In this case the IOMMU is integrated into the Sandybridge-EP processor > itself. Which doesn't preclude it to be identified via PCI device ID - after all there are dozens of processor integrated PCI devices. Looking at one of my systems, 00:05.0 System peripheral [0880]: Intel Corporation Sandy Bridge Address Map, VTd_Misc, System Management [8086:3c28] (rev 07) 80:05.0 System peripheral [0880]: Intel Corporation Sandy Bridge Address Map, VTd_Misc, System Management [8086:3c28] (rev 07) could be a candidate (we already key a quirk on this device in pci_vtd_quirk()). Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
On 26/11/15 08:45, Jan Beulich wrote: On 25.11.15 at 16:58,wrote: >> On 25/11/15 15:38, Jan Beulich wrote: >> On 25.11.15 at 16:13, wrote: On 25/11/15 10:49, Jan Beulich wrote: > And finally I'm not fully convinced using CPU model info to deduce > chipset behavior is entirely correct (albeit perhaps in practice it'll > be fine except maybe when running Xen itself virtualized). What else would you suggest? I can't think of any better identifying information. >>> Chipset IDs / revisions? >> In this case the IOMMU is integrated into the Sandybridge-EP processor >> itself. > Which doesn't preclude it to be identified via PCI device ID - after all > there are dozens of processor integrated PCI devices. Looking at > one of my systems, > > 00:05.0 System peripheral [0880]: Intel Corporation Sandy Bridge Address Map, > VTd_Misc, System Management [8086:3c28] (rev 07) > 80:05.0 System peripheral [0880]: Intel Corporation Sandy Bridge Address Map, > VTd_Misc, System Management [8086:3c28] (rev 07) > > could be a candidate (we already key a quirk on this device in > pci_vtd_quirk()). These are fine for server variants, but not for desktop variants, both of which we have seen in use. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
>>> On 26.11.15 at 11:27,wrote: > On 26/11/15 08:45, Jan Beulich wrote: > On 25.11.15 at 16:58, wrote: >>> On 25/11/15 15:38, Jan Beulich wrote: >>> On 25.11.15 at 16:13, wrote: > On 25/11/15 10:49, Jan Beulich wrote: >> And finally I'm not fully convinced using CPU model info to deduce >> chipset behavior is entirely correct (albeit perhaps in practice it'll >> be fine except maybe when running Xen itself virtualized). > What else would you suggest? I can't think of any better identifying > information. Chipset IDs / revisions? >>> In this case the IOMMU is integrated into the Sandybridge-EP processor >>> itself. >> Which doesn't preclude it to be identified via PCI device ID - after all >> there are dozens of processor integrated PCI devices. Looking at >> one of my systems, >> >> 00:05.0 System peripheral [0880]: Intel Corporation Sandy Bridge Address > Map, VTd_Misc, System Management [8086:3c28] (rev 07) >> 80:05.0 System peripheral [0880]: Intel Corporation Sandy Bridge Address > Map, VTd_Misc, System Management [8086:3c28] (rev 07) >> >> could be a candidate (we already key a quirk on this device in >> pci_vtd_quirk()). > > These are fine for server variants, but not for desktop variants, both > of which we have seen in use. And I gave them only as an example that keying off of PCI IDs would be possible. A complete list would of course need to be compiled (but I think we could simply derive it from the list of IDs we already deal with in quirks.c). Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
On 26/11/15 10:39, Jan Beulich wrote: On 26.11.15 at 11:27,wrote: >> On 26/11/15 08:45, Jan Beulich wrote: >> On 25.11.15 at 16:58, wrote: On 25/11/15 15:38, Jan Beulich wrote: On 25.11.15 at 16:13, wrote: >> On 25/11/15 10:49, Jan Beulich wrote: >>> And finally I'm not fully convinced using CPU model info to deduce >>> chipset behavior is entirely correct (albeit perhaps in practice it'll >>> be fine except maybe when running Xen itself virtualized). >> What else would you suggest? I can't think of any better identifying >> information. > Chipset IDs / revisions? In this case the IOMMU is integrated into the Sandybridge-EP processor itself. >>> Which doesn't preclude it to be identified via PCI device ID - after all >>> there are dozens of processor integrated PCI devices. Looking at >>> one of my systems, >>> >>> 00:05.0 System peripheral [0880]: Intel Corporation Sandy Bridge Address >> Map, VTd_Misc, System Management [8086:3c28] (rev 07) >>> 80:05.0 System peripheral [0880]: Intel Corporation Sandy Bridge Address >> Map, VTd_Misc, System Management [8086:3c28] (rev 07) >>> could be a candidate (we already key a quirk on this device in >>> pci_vtd_quirk()). >> These are fine for server variants, but not for desktop variants, both >> of which we have seen in use. > And I gave them only as an example that keying off of PCI IDs > would be possible. A complete list would of course need to be > compiled (but I think we could simply derive it from the list of IDs > we already deal with in quirks.c). That is not my point. The Desktop variants do not expose their internals as PCI devices. Keying on the host bridge might be an option. Also, on further consideration, the better fix (however we identify the affected systems) would be to quirk the IOMMUs themselves into not claiming 2M/1G superpage support. Otherwise, when we do eventually get superpage IOMMU mapping support in the API, the performance regression will creep back in. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
>>> On 25.11.15 at 11:28,wrote: > The problem is that SandyBridge IOMMUs advertise 2M support and do > function with it, but cannot cache 2MB translations in the IOTLBs. > > As a result, attempting to use 2M translations causes substantially > worse performance than 4K translations. Btw - how does this get explained? At a first glance, even if 2Mb translations don't get entered into the TLB, it should still be one less page table level to walk for the IOMMU, and should hence nevertheless be a benefit. Yet you even say _substantially_ worse performance results. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
On 26/11/15 13:46, Jan Beulich wrote: On 25.11.15 at 11:28,wrote: >> The problem is that SandyBridge IOMMUs advertise 2M support and do >> function with it, but cannot cache 2MB translations in the IOTLBs. >> >> As a result, attempting to use 2M translations causes substantially >> worse performance than 4K translations. > > Btw - how does this get explained? At a first glance, even if 2Mb > translations don't get entered into the TLB, it should still be one > less page table level to walk for the IOMMU, and should hence > nevertheless be a benefit. Yet you even say _substantially_ > worse performance results. There is a IOTLB for the 4K translation so if you only use 4K translations then you get to take advantage of the IOTLB. If you use the 2Mb translation then a page table walk has to be performed every time there's a DMA access to that region of the BFN address space. Malcolm > > Jan > ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
On 26/11/15 13:48, Malcolm Crossley wrote: > On 26/11/15 13:46, Jan Beulich wrote: > On 25.11.15 at 11:28,wrote: >>> The problem is that SandyBridge IOMMUs advertise 2M support and do >>> function with it, but cannot cache 2MB translations in the IOTLBs. >>> >>> As a result, attempting to use 2M translations causes substantially >>> worse performance than 4K translations. >> Btw - how does this get explained? At a first glance, even if 2Mb >> translations don't get entered into the TLB, it should still be one >> less page table level to walk for the IOMMU, and should hence >> nevertheless be a benefit. Yet you even say _substantially_ >> worse performance results. > There is a IOTLB for the 4K translation so if you only use 4K > translations then you get to take advantage of the IOTLB. > > If you use the 2Mb translation then a page table walk has to be > performed every time there's a DMA access to that region of the BFN > address space. Also remember that a high level dma access (from the point of view of a driver) will be fragmented at the PCIe max packet size, which is typically 256 bytes. So by not caching the 2Mb translation, a dma access of 4k may undergo 16 pagetable walks, one for each PCIe packet. We observed that using 2Mb mappings results in a 40% overhead, compared to using 4k mappings, from the point of view of a sample network workload. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
>>> On 25.11.15 at 16:13,wrote: > On 25/11/15 10:49, Jan Beulich wrote: > On 25.11.15 at 11:28, wrote: >>> On 24/11/15 17:41, Jan Beulich wrote: >>> On 24.11.15 at 18:17, wrote: > --- a/xen/drivers/passthrough/vtd/quirks.c > +++ b/xen/drivers/passthrough/vtd/quirks.c > @@ -320,6 +320,20 @@ void __init platform_quirks_init(void) > /* Tylersburg interrupt remap quirk */ > if ( iommu_intremap ) > tylersburg_intremap_quirk(); > + > +/* > + * Disable shared EPT ("sharept") on Sandybridge and older processors > + * by default. > + * SandyBridge has no huge page support for IOTLB which leads to > fallback > + * on 4k pages and leads to performance degradation. > + * > + * Shared EPT ("sharept") will be disabled only if user has not > + * provided explicit choice on the command line thus > iommu_hap_pt_share > is > + * at its initialized value of -1. > + */ > +if ( (boot_cpu_data.x86 == 0x06 && (boot_cpu_data.x86_model <= 0x2F > || > + boot_cpu_data.x86_model == 0x36)) && (iommu_hap_pt_share == > -1) ) > +iommu_hap_pt_share = 0; If we really want to do this, then I think we should key this on EPT but not VT-d having 2M support, instead of on CPU models. >>> This check is already performed by vtd_ept_page_compatible() >> Yeah, I realized there would be such a check on the way home. >> >>> The problem is that SandyBridge IOMMUs advertise 2M support and do >>> function with it, but cannot cache 2MB translations in the IOTLBs. >>> >>> As a result, attempting to use 2M translations causes substantially >>> worse performance than 4K translations. >> So commit message and comment should make this more explicit, >> to avoid the impression "IOTLB" isn't just the relatively common >> mis-naming of "IOMMU". >> >> Plus I guess the sharing won't need suppressing if !opt_hap_2mb? >> >> Further the model based check is relatively broad, and includes >> Atoms (0x36 actually is one), which can't be considered "Sandybridge >> or older" imo. >> >> And finally I'm not fully convinced using CPU model info to deduce >> chipset behavior is entirely correct (albeit perhaps in practice it'll >> be fine except maybe when running Xen itself virtualized). > > What else would you suggest? I can't think of any better identifying > information. Chipset IDs / revisions? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
On 25/11/15 15:38, Jan Beulich wrote: On 25.11.15 at 16:13,wrote: >> On 25/11/15 10:49, Jan Beulich wrote: >> On 25.11.15 at 11:28, wrote: On 24/11/15 17:41, Jan Beulich wrote: On 24.11.15 at 18:17, wrote: >> --- a/xen/drivers/passthrough/vtd/quirks.c >> +++ b/xen/drivers/passthrough/vtd/quirks.c >> @@ -320,6 +320,20 @@ void __init platform_quirks_init(void) >> /* Tylersburg interrupt remap quirk */ >> if ( iommu_intremap ) >> tylersburg_intremap_quirk(); >> + >> +/* >> + * Disable shared EPT ("sharept") on Sandybridge and older >> processors >> + * by default. >> + * SandyBridge has no huge page support for IOTLB which leads to >> fallback >> + * on 4k pages and leads to performance degradation. >> + * >> + * Shared EPT ("sharept") will be disabled only if user has not >> + * provided explicit choice on the command line thus >> iommu_hap_pt_share >> is >> + * at its initialized value of -1. >> + */ >> +if ( (boot_cpu_data.x86 == 0x06 && (boot_cpu_data.x86_model <= 0x2F >> || >> + boot_cpu_data.x86_model == 0x36)) && (iommu_hap_pt_share == >> -1) ) >> +iommu_hap_pt_share = 0; > If we really want to do this, then I think we should key this on > EPT but not VT-d having 2M support, instead of on CPU models. This check is already performed by vtd_ept_page_compatible() >>> Yeah, I realized there would be such a check on the way home. >>> The problem is that SandyBridge IOMMUs advertise 2M support and do function with it, but cannot cache 2MB translations in the IOTLBs. As a result, attempting to use 2M translations causes substantially worse performance than 4K translations. >>> So commit message and comment should make this more explicit, >>> to avoid the impression "IOTLB" isn't just the relatively common >>> mis-naming of "IOMMU". >>> >>> Plus I guess the sharing won't need suppressing if !opt_hap_2mb? >>> >>> Further the model based check is relatively broad, and includes >>> Atoms (0x36 actually is one), which can't be considered "Sandybridge >>> or older" imo. >>> >>> And finally I'm not fully convinced using CPU model info to deduce >>> chipset behavior is entirely correct (albeit perhaps in practice it'll >>> be fine except maybe when running Xen itself virtualized). >> >> What else would you suggest? I can't think of any better identifying >> information. > > Chipset IDs / revisions? In this case the IOMMU is integrated into the Sandybridge-EP processor itself. Unfortunately there's no register to query the IOTLB configuration of the IOMMU and so we're stuck identifying the via the processor model number itself. Malcolm > > Jan > > > ___ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel > ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
On 25/11/15 10:49, Jan Beulich wrote: On 25.11.15 at 11:28,wrote: >> On 24/11/15 17:41, Jan Beulich wrote: >> On 24.11.15 at 18:17, wrote: --- a/xen/drivers/passthrough/vtd/quirks.c +++ b/xen/drivers/passthrough/vtd/quirks.c @@ -320,6 +320,20 @@ void __init platform_quirks_init(void) /* Tylersburg interrupt remap quirk */ if ( iommu_intremap ) tylersburg_intremap_quirk(); + +/* + * Disable shared EPT ("sharept") on Sandybridge and older processors + * by default. + * SandyBridge has no huge page support for IOTLB which leads to fallback + * on 4k pages and leads to performance degradation. + * + * Shared EPT ("sharept") will be disabled only if user has not + * provided explicit choice on the command line thus iommu_hap_pt_share is + * at its initialized value of -1. + */ +if ( (boot_cpu_data.x86 == 0x06 && (boot_cpu_data.x86_model <= 0x2F || + boot_cpu_data.x86_model == 0x36)) && (iommu_hap_pt_share == -1) ) +iommu_hap_pt_share = 0; >>> If we really want to do this, then I think we should key this on >>> EPT but not VT-d having 2M support, instead of on CPU models. >> This check is already performed by vtd_ept_page_compatible() > Yeah, I realized there would be such a check on the way home. > >> The problem is that SandyBridge IOMMUs advertise 2M support and do >> function with it, but cannot cache 2MB translations in the IOTLBs. >> >> As a result, attempting to use 2M translations causes substantially >> worse performance than 4K translations. > So commit message and comment should make this more explicit, > to avoid the impression "IOTLB" isn't just the relatively common > mis-naming of "IOMMU". > > Plus I guess the sharing won't need suppressing if !opt_hap_2mb? > > Further the model based check is relatively broad, and includes > Atoms (0x36 actually is one), which can't be considered "Sandybridge > or older" imo. > > And finally I'm not fully convinced using CPU model info to deduce > chipset behavior is entirely correct (albeit perhaps in practice it'll > be fine except maybe when running Xen itself virtualized). What else would you suggest? I can't think of any better identifying information. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
> From: Malcolm Crossley [mailto:malcolm.cross...@citrix.com] > Sent: Wednesday, November 25, 2015 11:59 PM > > On 25/11/15 15:38, Jan Beulich wrote: > On 25.11.15 at 16:13,wrote: > >> On 25/11/15 10:49, Jan Beulich wrote: > >> On 25.11.15 at 11:28, wrote: > On 24/11/15 17:41, Jan Beulich wrote: > On 24.11.15 at 18:17, wrote: > >> --- a/xen/drivers/passthrough/vtd/quirks.c > >> +++ b/xen/drivers/passthrough/vtd/quirks.c > >> @@ -320,6 +320,20 @@ void __init platform_quirks_init(void) > >> /* Tylersburg interrupt remap quirk */ > >> if ( iommu_intremap ) > >> tylersburg_intremap_quirk(); > >> + > >> +/* > >> + * Disable shared EPT ("sharept") on Sandybridge and older > >> processors > >> + * by default. > >> + * SandyBridge has no huge page support for IOTLB which leads to > >> fallback > >> + * on 4k pages and leads to performance degradation. > >> + * > >> + * Shared EPT ("sharept") will be disabled only if user has not > >> + * provided explicit choice on the command line thus > >> iommu_hap_pt_share > >> is > >> + * at its initialized value of -1. > >> + */ > >> +if ( (boot_cpu_data.x86 == 0x06 && (boot_cpu_data.x86_model <= > >> 0x2F > || > >> + boot_cpu_data.x86_model == 0x36)) && (iommu_hap_pt_share == > -1) ) > >> +iommu_hap_pt_share = 0; > > If we really want to do this, then I think we should key this on > > EPT but not VT-d having 2M support, instead of on CPU models. > This check is already performed by vtd_ept_page_compatible() > >>> Yeah, I realized there would be such a check on the way home. > >>> > The problem is that SandyBridge IOMMUs advertise 2M support and do > function with it, but cannot cache 2MB translations in the IOTLBs. > > As a result, attempting to use 2M translations causes substantially > worse performance than 4K translations. > >>> So commit message and comment should make this more explicit, > >>> to avoid the impression "IOTLB" isn't just the relatively common > >>> mis-naming of "IOMMU". > >>> > >>> Plus I guess the sharing won't need suppressing if !opt_hap_2mb? > >>> > >>> Further the model based check is relatively broad, and includes > >>> Atoms (0x36 actually is one), which can't be considered "Sandybridge > >>> or older" imo. > >>> > >>> And finally I'm not fully convinced using CPU model info to deduce > >>> chipset behavior is entirely correct (albeit perhaps in practice it'll > >>> be fine except maybe when running Xen itself virtualized). > >> > >> What else would you suggest? I can't think of any better identifying > >> information. > > > > Chipset IDs / revisions? > > In this case the IOMMU is integrated into the Sandybridge-EP processor itself. > Unfortunately there's no register to query the IOTLB configuration of the > IOMMU > and so we're stuck identifying the via the processor model number itself. > > Malcolm > I'm OK to use processor model here, though ideally Jan is right. :-) Thanks Kevin ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
On 24/11/15 17:41, Jan Beulich wrote: On 24.11.15 at 18:17, wrote: >> --- a/xen/drivers/passthrough/vtd/quirks.c >> +++ b/xen/drivers/passthrough/vtd/quirks.c >> @@ -320,6 +320,20 @@ void __init platform_quirks_init(void) >> /* Tylersburg interrupt remap quirk */ >> if ( iommu_intremap ) >> tylersburg_intremap_quirk(); >> + >> +/* >> + * Disable shared EPT ("sharept") on Sandybridge and older processors >> + * by default. >> + * SandyBridge has no huge page support for IOTLB which leads to >> fallback >> + * on 4k pages and leads to performance degradation. >> + * >> + * Shared EPT ("sharept") will be disabled only if user has not >> + * provided explicit choice on the command line thus iommu_hap_pt_share >> is >> + * at its initialized value of -1. >> + */ >> +if ( (boot_cpu_data.x86 == 0x06 && (boot_cpu_data.x86_model <= 0x2F || >> + boot_cpu_data.x86_model == 0x36)) && (iommu_hap_pt_share == -1) ) >> +iommu_hap_pt_share = 0; > If we really want to do this, then I think we should key this on > EPT but not VT-d having 2M support, instead of on CPU models. This check is already performed by vtd_ept_page_compatible() The problem is that SandyBridge IOMMUs advertise 2M support and do function with it, but cannot cache 2MB translations in the IOTLBs. As a result, attempting to use 2M translations causes substantially worse performance than 4K translations. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
>>> On 25.11.15 at 11:28,wrote: > On 24/11/15 17:41, Jan Beulich wrote: > On 24.11.15 at 18:17, wrote: >>> --- a/xen/drivers/passthrough/vtd/quirks.c >>> +++ b/xen/drivers/passthrough/vtd/quirks.c >>> @@ -320,6 +320,20 @@ void __init platform_quirks_init(void) >>> /* Tylersburg interrupt remap quirk */ >>> if ( iommu_intremap ) >>> tylersburg_intremap_quirk(); >>> + >>> +/* >>> + * Disable shared EPT ("sharept") on Sandybridge and older processors >>> + * by default. >>> + * SandyBridge has no huge page support for IOTLB which leads to >>> fallback >>> + * on 4k pages and leads to performance degradation. >>> + * >>> + * Shared EPT ("sharept") will be disabled only if user has not >>> + * provided explicit choice on the command line thus >>> iommu_hap_pt_share is >>> + * at its initialized value of -1. >>> + */ >>> +if ( (boot_cpu_data.x86 == 0x06 && (boot_cpu_data.x86_model <= 0x2F || >>> + boot_cpu_data.x86_model == 0x36)) && (iommu_hap_pt_share == -1) ) >>> +iommu_hap_pt_share = 0; >> If we really want to do this, then I think we should key this on >> EPT but not VT-d having 2M support, instead of on CPU models. > > This check is already performed by vtd_ept_page_compatible() Yeah, I realized there would be such a check on the way home. > The problem is that SandyBridge IOMMUs advertise 2M support and do > function with it, but cannot cache 2MB translations in the IOTLBs. > > As a result, attempting to use 2M translations causes substantially > worse performance than 4K translations. So commit message and comment should make this more explicit, to avoid the impression "IOTLB" isn't just the relatively common mis-naming of "IOMMU". Plus I guess the sharing won't need suppressing if !opt_hap_2mb? Further the model based check is relatively broad, and includes Atoms (0x36 actually is one), which can't be considered "Sandybridge or older" imo. And finally I'm not fully convinced using CPU model info to deduce chipset behavior is entirely correct (albeit perhaps in practice it'll be fine except maybe when running Xen itself virtualized). Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
>>> On 24.11.15 at 18:17, wrote: > --- a/xen/drivers/passthrough/vtd/quirks.c > +++ b/xen/drivers/passthrough/vtd/quirks.c > @@ -320,6 +320,20 @@ void __init platform_quirks_init(void) > /* Tylersburg interrupt remap quirk */ > if ( iommu_intremap ) > tylersburg_intremap_quirk(); > + > +/* > + * Disable shared EPT ("sharept") on Sandybridge and older processors > + * by default. > + * SandyBridge has no huge page support for IOTLB which leads to fallback > + * on 4k pages and leads to performance degradation. > + * > + * Shared EPT ("sharept") will be disabled only if user has not > + * provided explicit choice on the command line thus iommu_hap_pt_share > is > + * at its initialized value of -1. > + */ > +if ( (boot_cpu_data.x86 == 0x06 && (boot_cpu_data.x86_model <= 0x2F || > + boot_cpu_data.x86_model == 0x36)) && (iommu_hap_pt_share == -1) ) > +iommu_hap_pt_share = 0; If we really want to do this, then I think we should key this on EPT but not VT-d having 2M support, instead of on CPU models. Also - with the above only marginally relevant - the line split and/or indentation is wrong. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
>>> On 01.09.15 at 16:18,wrote: > On 31/08/15 09:09, Jan Beulich wrote: > On 28.08.15 at 17:41, wrote: >>> --- a/docs/misc/xen-command-line.markdown >>> +++ b/docs/misc/xen-command-line.markdown >>> @@ -896,7 +896,7 @@ debug hypervisor only). >>> >>> > `sharept` >>> >>> -> Default: `true` >>> +> Default: `true` if newer than SandyBridge or `false` if Sandybridge or >>> earlier. >> This neglects the AMD side. > > The AMD side has iommu_hap_pt_share unconditionally disabled, for > reasons pertaining to grant mapped frames. Yet that doesn't eliminate the desire to have the correct default spelled out here. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
On 31/08/15 09:09, Jan Beulich wrote: On 28.08.15 at 17:41, wrote: >> --- a/docs/misc/xen-command-line.markdown >> +++ b/docs/misc/xen-command-line.markdown >> @@ -896,7 +896,7 @@ debug hypervisor only). >> >> > `sharept` >> >> -> Default: `true` >> +> Default: `true` if newer than SandyBridge or `false` if Sandybridge or >> earlier. > This neglects the AMD side. The AMD side has iommu_hap_pt_share unconditionally disabled, for reasons pertaining to grant mapped frames. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
>>> On 28.08.15 at 17:41, wrote: > --- a/docs/misc/xen-command-line.markdown > +++ b/docs/misc/xen-command-line.markdown > @@ -896,7 +896,7 @@ debug hypervisor only). > > > `sharept` > > -> Default: `true` > +> Default: `true` if newer than SandyBridge or `false` if Sandybridge or > earlier. This neglects the AMD side. > --- a/xen/drivers/passthrough/iommu.c > +++ b/xen/drivers/passthrough/iommu.c > @@ -54,6 +54,7 @@ bool_t __read_mostly iommu_intremap = 1; > bool_t __read_mostly iommu_hap_pt_share = 1; > bool_t __read_mostly iommu_debug; > bool_t __read_mostly amd_iommu_perdev_intremap = 1; > +bool_t __read_mostly iommu_sharept_set = 0; _If_ you really want to introduce a new variable, it ought to be __initdata (since only __init functions reference it). My preference though would be to make the existing variable a tristate, starting out with a value of -1 (and having type s8). > --- a/xen/drivers/passthrough/vtd/quirks.c > +++ b/xen/drivers/passthrough/vtd/quirks.c > @@ -320,6 +320,19 @@ void __init platform_quirks_init(void) > /* Tylersburg interrupt remap quirk */ > if ( iommu_intremap ) > tylersburg_intremap_quirk(); > + > +/* > + * Disable shared EPT ("sharept") on Sandybridge and older processors > + * by default. > + * SandyBridge has no huge page support for IOTLB which leads to fallback > + * on 4k pages and leads to performance degradation. > + * > + * Shared EPT ("sharept") will be disabled only if user has not > + * provided explicit choice on the command line. > + */ > +if ( (boot_cpu_data.x86 == 6) && > + (boot_cpu_data.x86_model <= 0x2a) && !iommu_sharept_set ) > +iommu_hap_pt_share = 0; Model 0x2d certainly is also Sandybridge. Models 0x2c, 0x2e, and 0x2f are even older architectures. And then there are various Atoms at higher numbers which I'm not sure can be considered "newer than Sandybridge" architecture wise. I.e. I don't think you can get away with a simple, single relation here. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.
On 28/08/15 16:41, Anshul Makkar anshul.makkar@citrix.com wrote: From: anshulma anshul.mak...@citrix.com Sandybridge or earlier processors don't have huge page support for IOTLB which leads to fallback on 4k pages and causes performance issues. Shared EPT will be disabled only if the user has not provided explicit choice on the command line. Signed-off-by: Anshul Makkar anshul.mak...@citrix.com As a note concerning the performance issues, for some IO workloads, this nets a 40% throughput improvement. We did not observe any IO workloads which had a worse performance as a result of disabling shared ept. Reviewed-by: Andrew Cooper andrew.coop...@citrix.com ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel