Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-12-03 Thread Andrew Cooper
On 03/12/15 08:50, Tian, Kevin wrote:
>> From: Jan Beulich [mailto:jbeul...@suse.com]
>> Sent: Thursday, December 03, 2015 4:18 PM
>>
> On 03.12.15 at 03:40,  wrote:
>>> Just confirmed internally with HW team. On SNB 4KB cache is always
>>> used regardless of 4KB/2MB/1GB mapping. There'd be another reason
>>> for this 40% drop observation...
>> So when they stated that the 4k TLB gets always used, did they at
>> least provide some thoughts on what else might be causing this
>> severe a performance impact? Without them helping we're left
>> guessing...
>>
> Unfortunately no clear answer...

http://networkbuilders.intel.com/docs/Network_Builders_RA_vBRAS_Final.pdf

Page 42: "The IOTLB on the previous generation Intel Xeon Processor
E5-2690 does not natively support huge pages (it emulates them using 4K
pages)."

And Figure 51 on Page 43

The "emulates them using 4K pages" probably means that the IOTLB is
flushed and filled with 512 adjacent 4k mappings.

Citrix's measurements back up the findings in that paper, and also show
that performance is better when using plain 4k mappings as opposed to
emulated 2M mappings.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-12-03 Thread Andrew Cooper
On 03/12/15 01:19, Tian, Kevin wrote:
>> From: Andrew Cooper [mailto:andrew.coop...@citrix.com]
>> Sent: Thursday, November 26, 2015 9:56 PM
>>
>> On 26/11/15 13:48, Malcolm Crossley wrote:
>>> On 26/11/15 13:46, Jan Beulich wrote:
>>> On 25.11.15 at 11:28,  wrote:
> The problem is that SandyBridge IOMMUs advertise 2M support and do
> function with it, but cannot cache 2MB translations in the IOTLBs.
>
> As a result, attempting to use 2M translations causes substantially
> worse performance than 4K translations.
 Btw - how does this get explained? At a first glance, even if 2Mb
 translations don't get entered into the TLB, it should still be one
 less page table level to walk for the IOMMU, and should hence
 nevertheless be a benefit. Yet you even say _substantially_
 worse performance results.
>>> There is a IOTLB for the 4K translation so if you only use 4K
>>> translations then you get to take advantage of the IOTLB.
>>>
>>> If you use the 2Mb translation then a page table walk has to be
>>> performed every time there's a DMA access to that region of the BFN
>>> address space.
>> Also remember that a high level dma access (from the point of view of a
>> driver) will be fragmented at the PCIe max packet size, which is
>> typically 256 bytes.
>>
>> So by not caching the 2Mb translation, a dma access of 4k may undergo 16
>> pagetable walks, one for each PCIe packet.
>>
>> We observed that using 2Mb mappings results in a 40% overhead, compared
>> to using 4k mappings, from the point of view of a sample network workload.
>>
>> ~Andrew
> One confusion here. The original patch just disables shared_ept, w/o
> changing IOMMU to not use 2MB mapping. Is there something missing
> or other tricks behind?

Disabling shared_ept works because the Xen IOMMU interface doesn't
support superpages.

However, I subsequently changed my mind in this thread about the
approach taken, and quirking the IOMMUs not to report superpage
capabilities is the correct solution to the problem.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-12-03 Thread Jan Beulich
>>> On 03.12.15 at 03:40,  wrote:
> Just confirmed internally with HW team. On SNB 4KB cache is always
> used regardless of 4KB/2MB/1GB mapping. There'd be another reason
> for this 40% drop observation...

So when they stated that the 4k TLB gets always used, did they at
least provide some thoughts on what else might be causing this
severe a performance impact? Without them helping we're left
guessing...

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-12-03 Thread Tian, Kevin
> From: Andrew Cooper [mailto:andrew.coop...@citrix.com]
> Sent: Thursday, December 03, 2015 7:24 PM
> 
> On 03/12/15 01:19, Tian, Kevin wrote:
> >> From: Andrew Cooper [mailto:andrew.coop...@citrix.com]
> >> Sent: Thursday, November 26, 2015 9:56 PM
> >>
> >> On 26/11/15 13:48, Malcolm Crossley wrote:
> >>> On 26/11/15 13:46, Jan Beulich wrote:
> >>> On 25.11.15 at 11:28,  wrote:
> > The problem is that SandyBridge IOMMUs advertise 2M support and do
> > function with it, but cannot cache 2MB translations in the IOTLBs.
> >
> > As a result, attempting to use 2M translations causes substantially
> > worse performance than 4K translations.
>  Btw - how does this get explained? At a first glance, even if 2Mb
>  translations don't get entered into the TLB, it should still be one
>  less page table level to walk for the IOMMU, and should hence
>  nevertheless be a benefit. Yet you even say _substantially_
>  worse performance results.
> >>> There is a IOTLB for the 4K translation so if you only use 4K
> >>> translations then you get to take advantage of the IOTLB.
> >>>
> >>> If you use the 2Mb translation then a page table walk has to be
> >>> performed every time there's a DMA access to that region of the BFN
> >>> address space.
> >> Also remember that a high level dma access (from the point of view of a
> >> driver) will be fragmented at the PCIe max packet size, which is
> >> typically 256 bytes.
> >>
> >> So by not caching the 2Mb translation, a dma access of 4k may undergo 16
> >> pagetable walks, one for each PCIe packet.
> >>
> >> We observed that using 2Mb mappings results in a 40% overhead, compared
> >> to using 4k mappings, from the point of view of a sample network workload.
> >>
> >> ~Andrew
> > One confusion here. The original patch just disables shared_ept, w/o
> > changing IOMMU to not use 2MB mapping. Is there something missing
> > or other tricks behind?
> 
> Disabling shared_ept works because the Xen IOMMU interface doesn't
> support superpages.
> 
> However, I subsequently changed my mind in this thread about the
> approach taken, and quirking the IOMMUs not to report superpage
> capabilities is the correct solution to the problem.
> 

Yes, it makes more sense.

Thanks
Kevin

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-12-03 Thread Tian, Kevin
> From: Andrew Cooper [mailto:andrew.coop...@citrix.com]
> Sent: Thursday, December 03, 2015 7:19 PM
> 
> On 03/12/15 08:50, Tian, Kevin wrote:
> >> From: Jan Beulich [mailto:jbeul...@suse.com]
> >> Sent: Thursday, December 03, 2015 4:18 PM
> >>
> > On 03.12.15 at 03:40,  wrote:
> >>> Just confirmed internally with HW team. On SNB 4KB cache is always
> >>> used regardless of 4KB/2MB/1GB mapping. There'd be another reason
> >>> for this 40% drop observation...
> >> So when they stated that the 4k TLB gets always used, did they at
> >> least provide some thoughts on what else might be causing this
> >> severe a performance impact? Without them helping we're left
> >> guessing...
> >>
> > Unfortunately no clear answer...
> 
> http://networkbuilders.intel.com/docs/Network_Builders_RA_vBRAS_Final.pdf
> 
> Page 42: "The IOTLB on the previous generation Intel Xeon Processor
> E5-2690 does not natively support huge pages (it emulates them using 4K
> pages)."
> 
> And Figure 51 on Page 43
> 
> The "emulates them using 4K pages" probably means that the IOTLB is
> flushed and filled with 512 adjacent 4k mappings.
> 
> Citrix's measurements back up the findings in that paper, and also show
> that performance is better when using plain 4k mappings as opposed to
> emulated 2M mappings.
> 

Thanks for the information. I'll forward it to HW team.

If above interpretation is correct (which also matches my thought), then
for two options you listed earlier:

---
> This leaves two options
> 1) 2M mappings are entirely uncached
> 2) 2M mappings are shattered to 4K mappings and cached

> The fact there is a 40% performance reduction suggests 1 rather than 2.
---

looks 2) is suggested rather than 1). There are two further options:

2.1) 2M mappings are shattered to 512 adjacent 4k mappings which are all
cached;
2.2) Only the 4k mapping out of 2M mapping is cached for the page being 
accessed;

for 2.1), as IOTLB entries are limited, it may cause unnecessary IOTLB
entry flushes and thus incurs more page walking overhead to fill-in.

for 2.2), I can't think out a reason to cause performance drop.

Thanks
Kevin

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-12-02 Thread Tian, Kevin
> From: Tian, Kevin
> Sent: Thursday, December 03, 2015 9:20 AM
> 
> > From: Andrew Cooper [mailto:andrew.coop...@citrix.com]
> > Sent: Thursday, November 26, 2015 9:56 PM
> >
> > On 26/11/15 13:48, Malcolm Crossley wrote:
> > > On 26/11/15 13:46, Jan Beulich wrote:
> > > On 25.11.15 at 11:28,  wrote:
> > >>> The problem is that SandyBridge IOMMUs advertise 2M support and do
> > >>> function with it, but cannot cache 2MB translations in the IOTLBs.
> > >>>
> > >>> As a result, attempting to use 2M translations causes substantially
> > >>> worse performance than 4K translations.
> > >> Btw - how does this get explained? At a first glance, even if 2Mb
> > >> translations don't get entered into the TLB, it should still be one
> > >> less page table level to walk for the IOMMU, and should hence
> > >> nevertheless be a benefit. Yet you even say _substantially_
> > >> worse performance results.
> > > There is a IOTLB for the 4K translation so if you only use 4K
> > > translations then you get to take advantage of the IOTLB.
> > >
> > > If you use the 2Mb translation then a page table walk has to be
> > > performed every time there's a DMA access to that region of the BFN
> > > address space.
> >
> > Also remember that a high level dma access (from the point of view of a
> > driver) will be fragmented at the PCIe max packet size, which is
> > typically 256 bytes.
> >
> > So by not caching the 2Mb translation, a dma access of 4k may undergo 16
> > pagetable walks, one for each PCIe packet.
> >
> > We observed that using 2Mb mappings results in a 40% overhead, compared
> > to using 4k mappings, from the point of view of a sample network workload.
> >
> > ~Andrew
> 
> One confusion here. The original patch just disables shared_ept, w/o
> changing IOMMU to not use 2MB mapping. Is there something missing
> or other tricks behind?
> 
> When you say using 4k mapping saves 40% overhead back, is it w/
> ept shared or not?
> 

Just confirmed internally with HW team. On SNB 4KB cache is always
used regardless of 4KB/2MB/1GB mapping. There'd be another reason
for this 40% drop observation...

Thanks
Kevin

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-12-02 Thread Tian, Kevin
> From: Andrew Cooper [mailto:andrew.coop...@citrix.com]
> Sent: Thursday, November 26, 2015 9:56 PM
> 
> On 26/11/15 13:48, Malcolm Crossley wrote:
> > On 26/11/15 13:46, Jan Beulich wrote:
> > On 25.11.15 at 11:28,  wrote:
> >>> The problem is that SandyBridge IOMMUs advertise 2M support and do
> >>> function with it, but cannot cache 2MB translations in the IOTLBs.
> >>>
> >>> As a result, attempting to use 2M translations causes substantially
> >>> worse performance than 4K translations.
> >> Btw - how does this get explained? At a first glance, even if 2Mb
> >> translations don't get entered into the TLB, it should still be one
> >> less page table level to walk for the IOMMU, and should hence
> >> nevertheless be a benefit. Yet you even say _substantially_
> >> worse performance results.
> > There is a IOTLB for the 4K translation so if you only use 4K
> > translations then you get to take advantage of the IOTLB.
> >
> > If you use the 2Mb translation then a page table walk has to be
> > performed every time there's a DMA access to that region of the BFN
> > address space.
> 
> Also remember that a high level dma access (from the point of view of a
> driver) will be fragmented at the PCIe max packet size, which is
> typically 256 bytes.
> 
> So by not caching the 2Mb translation, a dma access of 4k may undergo 16
> pagetable walks, one for each PCIe packet.
> 
> We observed that using 2Mb mappings results in a 40% overhead, compared
> to using 4k mappings, from the point of view of a sample network workload.
> 
> ~Andrew

One confusion here. The original patch just disables shared_ept, w/o
changing IOMMU to not use 2MB mapping. Is there something missing
or other tricks behind?

When you say using 4k mapping saves 40% overhead back, is it w/
ept shared or not?

Thanks
Kevin

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-12-01 Thread Anshul Makkar
Based on the discussion below, can I assume there is an agreement for using 
processor model for filtering or chipset ID will be the preferred candidate.

Thanks
Anshul Makkar

-Original Message-
From: Tian, Kevin [mailto:kevin.t...@intel.com] 
Sent: 26 November 2015 07:17
To: Malcolm Crossley <malcolm.cross...@citrix.com>; Jan Beulich 
<jbeul...@suse.com>; Andrew Cooper <andrew.coop...@citrix.com>; Anshul Makkar 
<anshul.mak...@citrix.com>
Cc: Zhang, Yang Z <yang.z.zh...@intel.com>; xen-devel@lists.xen.org
Subject: RE: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for 
Sandybridge and earlier processors.

> From: Malcolm Crossley [mailto:malcolm.cross...@citrix.com]
> Sent: Wednesday, November 25, 2015 11:59 PM
> 
> On 25/11/15 15:38, Jan Beulich wrote:
> >>>> On 25.11.15 at 16:13, <andrew.coop...@citrix.com> wrote:
> >> On 25/11/15 10:49, Jan Beulich wrote:
> >>>>>> On 25.11.15 at 11:28, <andrew.coop...@citrix.com> wrote:
> >>>> On 24/11/15 17:41, Jan Beulich wrote:
> >>>>>>>> On 24.11.15 at 18:17,  wrote:
> >>>>>> --- a/xen/drivers/passthrough/vtd/quirks.c
> >>>>>> +++ b/xen/drivers/passthrough/vtd/quirks.c
> >>>>>> @@ -320,6 +320,20 @@ void __init platform_quirks_init(void)
> >>>>>>  /* Tylersburg interrupt remap quirk */
> >>>>>>  if ( iommu_intremap )
> >>>>>>  tylersburg_intremap_quirk();
> >>>>>> +
> >>>>>> +/*
> >>>>>> + * Disable shared EPT ("sharept") on Sandybridge and older 
> >>>>>> processors
> >>>>>> + * by default.
> >>>>>> + * SandyBridge has no huge page support for IOTLB which 
> >>>>>> + leads to
> >> fallback
> >>>>>> + * on 4k pages and leads to performance degradation.
> >>>>>> + *
> >>>>>> + * Shared EPT ("sharept") will be disabled only if user has not
> >>>>>> + * provided explicit choice on the command line thus 
> >>>>>> + iommu_hap_pt_share
> >> is
> >>>>>> + * at its initialized value of -1.
> >>>>>> + */
> >>>>>> +if ( (boot_cpu_data.x86 == 0x06 && 
> >>>>>> + (boot_cpu_data.x86_model <= 0x2F
> ||
> >>>>>> +  boot_cpu_data.x86_model == 0x36)) && 
> >>>>>> + (iommu_hap_pt_share ==
> -1) )
> >>>>>> +iommu_hap_pt_share = 0;
> >>>>> If we really want to do this, then I think we should key this on 
> >>>>> EPT but not VT-d having 2M support, instead of on CPU models.
> >>>> This check is already performed by vtd_ept_page_compatible()
> >>> Yeah, I realized there would be such a check on the way home.
> >>>
> >>>> The problem is that SandyBridge IOMMUs advertise 2M support and 
> >>>> do function with it, but cannot cache 2MB translations in the IOTLBs.
> >>>>
> >>>> As a result, attempting to use 2M translations causes 
> >>>> substantially worse performance than 4K translations.
> >>> So commit message and comment should make this more explicit, to 
> >>> avoid the impression "IOTLB" isn't just the relatively common 
> >>> mis-naming of "IOMMU".
> >>>
> >>> Plus I guess the sharing won't need suppressing if !opt_hap_2mb?
> >>>
> >>> Further the model based check is relatively broad, and includes 
> >>> Atoms (0x36 actually is one), which can't be considered 
> >>> "Sandybridge or older" imo.
> >>>
> >>> And finally I'm not fully convinced using CPU model info to deduce 
> >>> chipset behavior is entirely correct (albeit perhaps in practice 
> >>> it'll be fine except maybe when running Xen itself virtualized).
> >>
> >> What else would you suggest? I can't think of any better 
> >> identifying information.
> >
> > Chipset IDs / revisions?
> 
> In this case the IOMMU is integrated into the Sandybridge-EP processor itself.
> Unfortunately there's no register to query the IOTLB configuration of 
> the IOMMU and so we're stuck identifying the via the processor model number 
> itself.
> 
> Malcolm
> 

I'm OK to use processor model here, though ideally Jan is right. :-)

Thanks
Kevin

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-12-01 Thread Jan Beulich
>>> On 01.12.15 at 17:45,  wrote:
> Based on the discussion below, can I assume there is an agreement for using 
> processor model for filtering or chipset ID will be the preferred candidate.

I think the subsequent suggestion by Andrew makes it even more
desirable to remain independent of CPU model here. As said before,
we should simply leverage the PCI IDs we already have quirks for.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-12-01 Thread Andrew Cooper
On 30/11/15 21:22, Konrad Rzeszutek Wilk wrote:
> On Thu, Nov 26, 2015 at 01:55:57PM +, Andrew Cooper wrote:
>> On 26/11/15 13:48, Malcolm Crossley wrote:
>>> On 26/11/15 13:46, Jan Beulich wrote:
>>> On 25.11.15 at 11:28,  wrote:
> The problem is that SandyBridge IOMMUs advertise 2M support and do
> function with it, but cannot cache 2MB translations in the IOTLBs.
>
> As a result, attempting to use 2M translations causes substantially
> worse performance than 4K translations.
 Btw - how does this get explained? At a first glance, even if 2Mb
 translations don't get entered into the TLB, it should still be one
 less page table level to walk for the IOMMU, and should hence
 nevertheless be a benefit. Yet you even say _substantially_
 worse performance results.
>>> There is a IOTLB for the 4K translation so if you only use 4K
>>> translations then you get to take advantage of the IOTLB.
>>>
>>> If you use the 2Mb translation then a page table walk has to be
>>> performed every time there's a DMA access to that region of the BFN
>>> address space.
>> Also remember that a high level dma access (from the point of view of a
>> driver) will be fragmented at the PCIe max packet size, which is
>> typically 256 bytes.
>>
>> So by not caching the 2Mb translation, a dma access of 4k may undergo 16
>> pagetable walks, one for each PCIe packet.
>>
>> We observed that using 2Mb mappings results in a 40% overhead, compared
>> to using 4k mappings, from the point of view of a sample network workload.
> How did you observe this? I am mighty curious what kind of performance tools
> you used to find this  as I would love to figure out if some of the issues
> we have seen are related to this?

The 40% difference is just in terms of network throughput of a VF, given
a workload which can normally saturate line rate on the card.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-12-01 Thread Anshul Makkar
Snabbswitch (virtualized switch) also encountered similar problem : 
https://groups.google.com/forum/#!topic/snabb-devel/xX0yFzeXylI

Thanks
Anshul Makkar

-Original Message-
From: Andrew Cooper [mailto:andrew.coop...@citrix.com] 
Sent: 01 December 2015 10:34
To: Konrad Rzeszutek Wilk <konrad.w...@oracle.com>
Cc: Jan Beulich <jbeul...@suse.com>; Kevin Tian <kevin.t...@intel.com>; 
yang.z.zh...@intel.com; Malcolm Crossley <malcolm.cross...@citrix.com>; Anshul 
Makkar <anshul.mak...@citrix.com>; xen-devel@lists.xen.org
Subject: Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for 
Sandybridge and earlier processors.

On 30/11/15 21:22, Konrad Rzeszutek Wilk wrote:
> On Thu, Nov 26, 2015 at 01:55:57PM +, Andrew Cooper wrote:
>> On 26/11/15 13:48, Malcolm Crossley wrote:
>>> On 26/11/15 13:46, Jan Beulich wrote:
>>>>>>> On 25.11.15 at 11:28, <andrew.coop...@citrix.com> wrote:
>>>>> The problem is that SandyBridge IOMMUs advertise 2M support and do 
>>>>> function with it, but cannot cache 2MB translations in the IOTLBs.
>>>>>
>>>>> As a result, attempting to use 2M translations causes 
>>>>> substantially worse performance than 4K translations.
>>>> Btw - how does this get explained? At a first glance, even if 2Mb 
>>>> translations don't get entered into the TLB, it should still be one 
>>>> less page table level to walk for the IOMMU, and should hence 
>>>> nevertheless be a benefit. Yet you even say _substantially_ worse 
>>>> performance results.
>>> There is a IOTLB for the 4K translation so if you only use 4K 
>>> translations then you get to take advantage of the IOTLB.
>>>
>>> If you use the 2Mb translation then a page table walk has to be 
>>> performed every time there's a DMA access to that region of the BFN 
>>> address space.
>> Also remember that a high level dma access (from the point of view of 
>> a
>> driver) will be fragmented at the PCIe max packet size, which is 
>> typically 256 bytes.
>>
>> So by not caching the 2Mb translation, a dma access of 4k may undergo 
>> 16 pagetable walks, one for each PCIe packet.
>>
>> We observed that using 2Mb mappings results in a 40% overhead, 
>> compared to using 4k mappings, from the point of view of a sample network 
>> workload.
> How did you observe this? I am mighty curious what kind of performance 
> tools you used to find this  as I would love to figure out if some of 
> the issues we have seen are related to this?

The 40% difference is just in terms of network throughput of a VF, given a 
workload which can normally saturate line rate on the card.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-12-01 Thread Konrad Rzeszutek Wilk
On Tue, Dec 01, 2015 at 10:34:17AM +, Andrew Cooper wrote:
> On 30/11/15 21:22, Konrad Rzeszutek Wilk wrote:
> > On Thu, Nov 26, 2015 at 01:55:57PM +, Andrew Cooper wrote:
> >> On 26/11/15 13:48, Malcolm Crossley wrote:
> >>> On 26/11/15 13:46, Jan Beulich wrote:
> >>> On 25.11.15 at 11:28,  wrote:
> > The problem is that SandyBridge IOMMUs advertise 2M support and do
> > function with it, but cannot cache 2MB translations in the IOTLBs.
> >
> > As a result, attempting to use 2M translations causes substantially
> > worse performance than 4K translations.
>  Btw - how does this get explained? At a first glance, even if 2Mb
>  translations don't get entered into the TLB, it should still be one
>  less page table level to walk for the IOMMU, and should hence
>  nevertheless be a benefit. Yet you even say _substantially_
>  worse performance results.
> >>> There is a IOTLB for the 4K translation so if you only use 4K
> >>> translations then you get to take advantage of the IOTLB.
> >>>
> >>> If you use the 2Mb translation then a page table walk has to be
> >>> performed every time there's a DMA access to that region of the BFN
> >>> address space.
> >> Also remember that a high level dma access (from the point of view of a
> >> driver) will be fragmented at the PCIe max packet size, which is
> >> typically 256 bytes.
> >>
> >> So by not caching the 2Mb translation, a dma access of 4k may undergo 16
> >> pagetable walks, one for each PCIe packet.
> >>
> >> We observed that using 2Mb mappings results in a 40% overhead, compared
> >> to using 4k mappings, from the point of view of a sample network workload.
> > How did you observe this? I am mighty curious what kind of performance tools
> > you used to find this  as I would love to figure out if some of the issues
> > we have seen are related to this?
> 
> The 40% difference is just in terms of network throughput of a VF, given
> a workload which can normally saturate line rate on the card.

I understand that.

But I am curious on how you found out the page walks by the IOMMU were
so excessive? Were there any perf counters on the IOMMU that showed
a crazy amount of pagetable walks?

It just that if I had looked at this I would have first looked at interrupts, 
then
kernels, then hypervisor - and eventually (after lots of head banging) it would
have occurred to me to look at the IOMMU pagetables.


> 
> ~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-12-01 Thread Andrew Cooper
On 01/12/15 15:24, Konrad Rzeszutek Wilk wrote:
> On Tue, Dec 01, 2015 at 10:34:17AM +, Andrew Cooper wrote:
>> On 30/11/15 21:22, Konrad Rzeszutek Wilk wrote:
>>> On Thu, Nov 26, 2015 at 01:55:57PM +, Andrew Cooper wrote:
 On 26/11/15 13:48, Malcolm Crossley wrote:
> On 26/11/15 13:46, Jan Beulich wrote:
> On 25.11.15 at 11:28,  wrote:
>>> The problem is that SandyBridge IOMMUs advertise 2M support and do
>>> function with it, but cannot cache 2MB translations in the IOTLBs.
>>>
>>> As a result, attempting to use 2M translations causes substantially
>>> worse performance than 4K translations.
>> Btw - how does this get explained? At a first glance, even if 2Mb
>> translations don't get entered into the TLB, it should still be one
>> less page table level to walk for the IOMMU, and should hence
>> nevertheless be a benefit. Yet you even say _substantially_
>> worse performance results.
> There is a IOTLB for the 4K translation so if you only use 4K
> translations then you get to take advantage of the IOTLB.
>
> If you use the 2Mb translation then a page table walk has to be
> performed every time there's a DMA access to that region of the BFN
> address space.
 Also remember that a high level dma access (from the point of view of a
 driver) will be fragmented at the PCIe max packet size, which is
 typically 256 bytes.

 So by not caching the 2Mb translation, a dma access of 4k may undergo 16
 pagetable walks, one for each PCIe packet.

 We observed that using 2Mb mappings results in a 40% overhead, compared
 to using 4k mappings, from the point of view of a sample network workload.
>>> How did you observe this? I am mighty curious what kind of performance tools
>>> you used to find this  as I would love to figure out if some of the issues
>>> we have seen are related to this?
>> The 40% difference is just in terms of network throughput of a VF, given
>> a workload which can normally saturate line rate on the card.
> I understand that.
>
> But I am curious on how you found out the page walks by the IOMMU were
> so excessive?

I didn't.  It is all speculation drawn from other information.

The manual states that there is not a superpage IOTLB.

This leaves two options
1) 2M mappings are entirely uncached
2) 2M mappings are shattered to 4K mappings and cached

The fact there is a 40% performance reduction suggests 1 rather than 2.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-11-30 Thread Konrad Rzeszutek Wilk
On Thu, Nov 26, 2015 at 01:55:57PM +, Andrew Cooper wrote:
> On 26/11/15 13:48, Malcolm Crossley wrote:
> > On 26/11/15 13:46, Jan Beulich wrote:
> > On 25.11.15 at 11:28,  wrote:
> >>> The problem is that SandyBridge IOMMUs advertise 2M support and do
> >>> function with it, but cannot cache 2MB translations in the IOTLBs.
> >>>
> >>> As a result, attempting to use 2M translations causes substantially
> >>> worse performance than 4K translations.
> >> Btw - how does this get explained? At a first glance, even if 2Mb
> >> translations don't get entered into the TLB, it should still be one
> >> less page table level to walk for the IOMMU, and should hence
> >> nevertheless be a benefit. Yet you even say _substantially_
> >> worse performance results.
> > There is a IOTLB for the 4K translation so if you only use 4K
> > translations then you get to take advantage of the IOTLB.
> >
> > If you use the 2Mb translation then a page table walk has to be
> > performed every time there's a DMA access to that region of the BFN
> > address space.
> 
> Also remember that a high level dma access (from the point of view of a
> driver) will be fragmented at the PCIe max packet size, which is
> typically 256 bytes.
> 
> So by not caching the 2Mb translation, a dma access of 4k may undergo 16
> pagetable walks, one for each PCIe packet.
> 
> We observed that using 2Mb mappings results in a 40% overhead, compared
> to using 4k mappings, from the point of view of a sample network workload.

How did you observe this? I am mighty curious what kind of performance tools
you used to find this  as I would love to figure out if some of the issues
we have seen are related to this?

> 
> ~Andrew
> 
> ___
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-11-26 Thread Jan Beulich
>>> On 25.11.15 at 16:58,  wrote:
> On 25/11/15 15:38, Jan Beulich wrote:
> On 25.11.15 at 16:13,  wrote:
>>> On 25/11/15 10:49, Jan Beulich wrote:
 And finally I'm not fully convinced using CPU model info to deduce
 chipset behavior is entirely correct (albeit perhaps in practice it'll
 be fine except maybe when running Xen itself virtualized).
>>>
>>> What else would you suggest? I can't think of any better identifying
>>> information.
>> 
>> Chipset IDs / revisions?
> 
> In this case the IOMMU is integrated into the Sandybridge-EP processor 
> itself.

Which doesn't preclude it to be identified via PCI device ID - after all
there are dozens of processor integrated PCI devices. Looking at
one of my systems,

00:05.0 System peripheral [0880]: Intel Corporation Sandy Bridge Address Map, 
VTd_Misc, System Management [8086:3c28] (rev 07)
80:05.0 System peripheral [0880]: Intel Corporation Sandy Bridge Address Map, 
VTd_Misc, System Management [8086:3c28] (rev 07)

could be a candidate (we already key a quirk on this device in
pci_vtd_quirk()).

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-11-26 Thread Andrew Cooper
On 26/11/15 08:45, Jan Beulich wrote:
 On 25.11.15 at 16:58,  wrote:
>> On 25/11/15 15:38, Jan Beulich wrote:
>> On 25.11.15 at 16:13,  wrote:
 On 25/11/15 10:49, Jan Beulich wrote:
> And finally I'm not fully convinced using CPU model info to deduce
> chipset behavior is entirely correct (albeit perhaps in practice it'll
> be fine except maybe when running Xen itself virtualized).
 What else would you suggest? I can't think of any better identifying
 information.
>>> Chipset IDs / revisions?
>> In this case the IOMMU is integrated into the Sandybridge-EP processor 
>> itself.
> Which doesn't preclude it to be identified via PCI device ID - after all
> there are dozens of processor integrated PCI devices. Looking at
> one of my systems,
>
> 00:05.0 System peripheral [0880]: Intel Corporation Sandy Bridge Address Map, 
> VTd_Misc, System Management [8086:3c28] (rev 07)
> 80:05.0 System peripheral [0880]: Intel Corporation Sandy Bridge Address Map, 
> VTd_Misc, System Management [8086:3c28] (rev 07)
>
> could be a candidate (we already key a quirk on this device in
> pci_vtd_quirk()).

These are fine for server variants, but not for desktop variants, both
of which we have seen in use.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-11-26 Thread Jan Beulich
>>> On 26.11.15 at 11:27,  wrote:
> On 26/11/15 08:45, Jan Beulich wrote:
> On 25.11.15 at 16:58,  wrote:
>>> On 25/11/15 15:38, Jan Beulich wrote:
>>> On 25.11.15 at 16:13,  wrote:
> On 25/11/15 10:49, Jan Beulich wrote:
>> And finally I'm not fully convinced using CPU model info to deduce
>> chipset behavior is entirely correct (albeit perhaps in practice it'll
>> be fine except maybe when running Xen itself virtualized).
> What else would you suggest? I can't think of any better identifying
> information.
 Chipset IDs / revisions?
>>> In this case the IOMMU is integrated into the Sandybridge-EP processor 
>>> itself.
>> Which doesn't preclude it to be identified via PCI device ID - after all
>> there are dozens of processor integrated PCI devices. Looking at
>> one of my systems,
>>
>> 00:05.0 System peripheral [0880]: Intel Corporation Sandy Bridge Address 
> Map, VTd_Misc, System Management [8086:3c28] (rev 07)
>> 80:05.0 System peripheral [0880]: Intel Corporation Sandy Bridge Address 
> Map, VTd_Misc, System Management [8086:3c28] (rev 07)
>>
>> could be a candidate (we already key a quirk on this device in
>> pci_vtd_quirk()).
> 
> These are fine for server variants, but not for desktop variants, both
> of which we have seen in use.

And I gave them only as an example that keying off of PCI IDs
would be possible. A complete list would of course need to be
compiled (but I think we could simply derive it from the list of IDs
we already deal with in quirks.c).

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-11-26 Thread Andrew Cooper
On 26/11/15 10:39, Jan Beulich wrote:
 On 26.11.15 at 11:27,  wrote:
>> On 26/11/15 08:45, Jan Beulich wrote:
>> On 25.11.15 at 16:58,  wrote:
 On 25/11/15 15:38, Jan Beulich wrote:
 On 25.11.15 at 16:13,  wrote:
>> On 25/11/15 10:49, Jan Beulich wrote:
>>> And finally I'm not fully convinced using CPU model info to deduce
>>> chipset behavior is entirely correct (albeit perhaps in practice it'll
>>> be fine except maybe when running Xen itself virtualized).
>> What else would you suggest? I can't think of any better identifying
>> information.
> Chipset IDs / revisions?
 In this case the IOMMU is integrated into the Sandybridge-EP processor 
 itself.
>>> Which doesn't preclude it to be identified via PCI device ID - after all
>>> there are dozens of processor integrated PCI devices. Looking at
>>> one of my systems,
>>>
>>> 00:05.0 System peripheral [0880]: Intel Corporation Sandy Bridge Address 
>> Map, VTd_Misc, System Management [8086:3c28] (rev 07)
>>> 80:05.0 System peripheral [0880]: Intel Corporation Sandy Bridge Address 
>> Map, VTd_Misc, System Management [8086:3c28] (rev 07)
>>> could be a candidate (we already key a quirk on this device in
>>> pci_vtd_quirk()).
>> These are fine for server variants, but not for desktop variants, both
>> of which we have seen in use.
> And I gave them only as an example that keying off of PCI IDs
> would be possible. A complete list would of course need to be
> compiled (but I think we could simply derive it from the list of IDs
> we already deal with in quirks.c).

That is not my point.  The Desktop variants do not expose their
internals as PCI devices.

Keying on the host bridge might be an option.


Also, on further consideration, the better fix (however we identify the
affected systems) would be to quirk the IOMMUs themselves into not
claiming 2M/1G superpage support.

Otherwise, when we do eventually get superpage IOMMU mapping support in
the API, the performance regression will creep back in.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-11-26 Thread Jan Beulich
>>> On 25.11.15 at 11:28,  wrote:
> The problem is that SandyBridge IOMMUs advertise 2M support and do
> function with it, but cannot cache 2MB translations in the IOTLBs.
> 
> As a result, attempting to use 2M translations causes substantially
> worse performance than 4K translations.

Btw - how does this get explained? At a first glance, even if 2Mb
translations don't get entered into the TLB, it should still be one
less page table level to walk for the IOMMU, and should hence
nevertheless be a benefit. Yet you even say _substantially_
worse performance results.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-11-26 Thread Malcolm Crossley
On 26/11/15 13:46, Jan Beulich wrote:
 On 25.11.15 at 11:28,  wrote:
>> The problem is that SandyBridge IOMMUs advertise 2M support and do
>> function with it, but cannot cache 2MB translations in the IOTLBs.
>>
>> As a result, attempting to use 2M translations causes substantially
>> worse performance than 4K translations.
> 
> Btw - how does this get explained? At a first glance, even if 2Mb
> translations don't get entered into the TLB, it should still be one
> less page table level to walk for the IOMMU, and should hence
> nevertheless be a benefit. Yet you even say _substantially_
> worse performance results.

There is a IOTLB for the 4K translation so if you only use 4K
translations then you get to take advantage of the IOTLB.

If you use the 2Mb translation then a page table walk has to be
performed every time there's a DMA access to that region of the BFN
address space.

Malcolm

> 
> Jan
> 


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-11-26 Thread Andrew Cooper
On 26/11/15 13:48, Malcolm Crossley wrote:
> On 26/11/15 13:46, Jan Beulich wrote:
> On 25.11.15 at 11:28,  wrote:
>>> The problem is that SandyBridge IOMMUs advertise 2M support and do
>>> function with it, but cannot cache 2MB translations in the IOTLBs.
>>>
>>> As a result, attempting to use 2M translations causes substantially
>>> worse performance than 4K translations.
>> Btw - how does this get explained? At a first glance, even if 2Mb
>> translations don't get entered into the TLB, it should still be one
>> less page table level to walk for the IOMMU, and should hence
>> nevertheless be a benefit. Yet you even say _substantially_
>> worse performance results.
> There is a IOTLB for the 4K translation so if you only use 4K
> translations then you get to take advantage of the IOTLB.
>
> If you use the 2Mb translation then a page table walk has to be
> performed every time there's a DMA access to that region of the BFN
> address space.

Also remember that a high level dma access (from the point of view of a
driver) will be fragmented at the PCIe max packet size, which is
typically 256 bytes.

So by not caching the 2Mb translation, a dma access of 4k may undergo 16
pagetable walks, one for each PCIe packet.

We observed that using 2Mb mappings results in a 40% overhead, compared
to using 4k mappings, from the point of view of a sample network workload.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-11-25 Thread Jan Beulich
>>> On 25.11.15 at 16:13,  wrote:
> On 25/11/15 10:49, Jan Beulich wrote:
> On 25.11.15 at 11:28,  wrote:
>>> On 24/11/15 17:41, Jan Beulich wrote:
>>> On 24.11.15 at 18:17,  wrote:
> --- a/xen/drivers/passthrough/vtd/quirks.c
> +++ b/xen/drivers/passthrough/vtd/quirks.c
> @@ -320,6 +320,20 @@ void __init platform_quirks_init(void)
>  /* Tylersburg interrupt remap quirk */
>  if ( iommu_intremap )
>  tylersburg_intremap_quirk();
> +
> +/*
> + * Disable shared EPT ("sharept") on Sandybridge and older processors
> + * by default.
> + * SandyBridge has no huge page support for IOTLB which leads to 
> fallback
> + * on 4k pages and leads to performance degradation.
> + *
> + * Shared EPT ("sharept") will be disabled only if user has not
> + * provided explicit choice on the command line thus 
> iommu_hap_pt_share 
> is
> + * at its initialized value of -1.
> + */
> +if ( (boot_cpu_data.x86 == 0x06 && (boot_cpu_data.x86_model <= 0x2F 
> ||
> +  boot_cpu_data.x86_model == 0x36)) && (iommu_hap_pt_share == 
> -1) )
> +iommu_hap_pt_share = 0;
 If we really want to do this, then I think we should key this on
 EPT but not VT-d having 2M support, instead of on CPU models.
>>> This check is already performed by vtd_ept_page_compatible()
>> Yeah, I realized there would be such a check on the way home.
>>
>>> The problem is that SandyBridge IOMMUs advertise 2M support and do
>>> function with it, but cannot cache 2MB translations in the IOTLBs.
>>>
>>> As a result, attempting to use 2M translations causes substantially
>>> worse performance than 4K translations.
>> So commit message and comment should make this more explicit,
>> to avoid the impression "IOTLB" isn't just the relatively common
>> mis-naming of "IOMMU".
>>
>> Plus I guess the sharing won't need suppressing if !opt_hap_2mb?
>>
>> Further the model based check is relatively broad, and includes
>> Atoms (0x36 actually is one), which can't be considered "Sandybridge
>> or older" imo.
>>
>> And finally I'm not fully convinced using CPU model info to deduce
>> chipset behavior is entirely correct (albeit perhaps in practice it'll
>> be fine except maybe when running Xen itself virtualized).
> 
> What else would you suggest? I can't think of any better identifying
> information.

Chipset IDs / revisions?

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-11-25 Thread Malcolm Crossley
On 25/11/15 15:38, Jan Beulich wrote:
 On 25.11.15 at 16:13,  wrote:
>> On 25/11/15 10:49, Jan Beulich wrote:
>> On 25.11.15 at 11:28,  wrote:
 On 24/11/15 17:41, Jan Beulich wrote:
 On 24.11.15 at 18:17,  wrote:
>> --- a/xen/drivers/passthrough/vtd/quirks.c
>> +++ b/xen/drivers/passthrough/vtd/quirks.c
>> @@ -320,6 +320,20 @@ void __init platform_quirks_init(void)
>>  /* Tylersburg interrupt remap quirk */
>>  if ( iommu_intremap )
>>  tylersburg_intremap_quirk();
>> +
>> +/*
>> + * Disable shared EPT ("sharept") on Sandybridge and older 
>> processors
>> + * by default.
>> + * SandyBridge has no huge page support for IOTLB which leads to 
>> fallback
>> + * on 4k pages and leads to performance degradation.
>> + *
>> + * Shared EPT ("sharept") will be disabled only if user has not
>> + * provided explicit choice on the command line thus 
>> iommu_hap_pt_share 
>> is
>> + * at its initialized value of -1.
>> + */
>> +if ( (boot_cpu_data.x86 == 0x06 && (boot_cpu_data.x86_model <= 0x2F 
>> ||
>> +  boot_cpu_data.x86_model == 0x36)) && (iommu_hap_pt_share == 
>> -1) )
>> +iommu_hap_pt_share = 0;
> If we really want to do this, then I think we should key this on
> EPT but not VT-d having 2M support, instead of on CPU models.
 This check is already performed by vtd_ept_page_compatible()
>>> Yeah, I realized there would be such a check on the way home.
>>>
 The problem is that SandyBridge IOMMUs advertise 2M support and do
 function with it, but cannot cache 2MB translations in the IOTLBs.

 As a result, attempting to use 2M translations causes substantially
 worse performance than 4K translations.
>>> So commit message and comment should make this more explicit,
>>> to avoid the impression "IOTLB" isn't just the relatively common
>>> mis-naming of "IOMMU".
>>>
>>> Plus I guess the sharing won't need suppressing if !opt_hap_2mb?
>>>
>>> Further the model based check is relatively broad, and includes
>>> Atoms (0x36 actually is one), which can't be considered "Sandybridge
>>> or older" imo.
>>>
>>> And finally I'm not fully convinced using CPU model info to deduce
>>> chipset behavior is entirely correct (albeit perhaps in practice it'll
>>> be fine except maybe when running Xen itself virtualized).
>>
>> What else would you suggest? I can't think of any better identifying
>> information.
> 
> Chipset IDs / revisions?

In this case the IOMMU is integrated into the Sandybridge-EP processor itself.
Unfortunately there's no register to query the IOTLB configuration of the IOMMU
and so we're stuck identifying the via the processor model number itself.

Malcolm

> 
> Jan
> 
> 
> ___
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-11-25 Thread Andrew Cooper
On 25/11/15 10:49, Jan Beulich wrote:
 On 25.11.15 at 11:28,  wrote:
>> On 24/11/15 17:41, Jan Beulich wrote:
>> On 24.11.15 at 18:17,  wrote:
 --- a/xen/drivers/passthrough/vtd/quirks.c
 +++ b/xen/drivers/passthrough/vtd/quirks.c
 @@ -320,6 +320,20 @@ void __init platform_quirks_init(void)
  /* Tylersburg interrupt remap quirk */
  if ( iommu_intremap )
  tylersburg_intremap_quirk();
 +
 +/*
 + * Disable shared EPT ("sharept") on Sandybridge and older processors
 + * by default.
 + * SandyBridge has no huge page support for IOTLB which leads to 
 fallback
 + * on 4k pages and leads to performance degradation.
 + *
 + * Shared EPT ("sharept") will be disabled only if user has not
 + * provided explicit choice on the command line thus 
 iommu_hap_pt_share is
 + * at its initialized value of -1.
 + */
 +if ( (boot_cpu_data.x86 == 0x06 && (boot_cpu_data.x86_model <= 0x2F ||
 +  boot_cpu_data.x86_model == 0x36)) && (iommu_hap_pt_share == -1) 
 )
 +iommu_hap_pt_share = 0;
>>> If we really want to do this, then I think we should key this on
>>> EPT but not VT-d having 2M support, instead of on CPU models.
>> This check is already performed by vtd_ept_page_compatible()
> Yeah, I realized there would be such a check on the way home.
>
>> The problem is that SandyBridge IOMMUs advertise 2M support and do
>> function with it, but cannot cache 2MB translations in the IOTLBs.
>>
>> As a result, attempting to use 2M translations causes substantially
>> worse performance than 4K translations.
> So commit message and comment should make this more explicit,
> to avoid the impression "IOTLB" isn't just the relatively common
> mis-naming of "IOMMU".
>
> Plus I guess the sharing won't need suppressing if !opt_hap_2mb?
>
> Further the model based check is relatively broad, and includes
> Atoms (0x36 actually is one), which can't be considered "Sandybridge
> or older" imo.
>
> And finally I'm not fully convinced using CPU model info to deduce
> chipset behavior is entirely correct (albeit perhaps in practice it'll
> be fine except maybe when running Xen itself virtualized).

What else would you suggest? I can't think of any better identifying
information.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-11-25 Thread Tian, Kevin
> From: Malcolm Crossley [mailto:malcolm.cross...@citrix.com]
> Sent: Wednesday, November 25, 2015 11:59 PM
> 
> On 25/11/15 15:38, Jan Beulich wrote:
>  On 25.11.15 at 16:13,  wrote:
> >> On 25/11/15 10:49, Jan Beulich wrote:
> >> On 25.11.15 at 11:28,  wrote:
>  On 24/11/15 17:41, Jan Beulich wrote:
>  On 24.11.15 at 18:17,  wrote:
> >> --- a/xen/drivers/passthrough/vtd/quirks.c
> >> +++ b/xen/drivers/passthrough/vtd/quirks.c
> >> @@ -320,6 +320,20 @@ void __init platform_quirks_init(void)
> >>  /* Tylersburg interrupt remap quirk */
> >>  if ( iommu_intremap )
> >>  tylersburg_intremap_quirk();
> >> +
> >> +/*
> >> + * Disable shared EPT ("sharept") on Sandybridge and older 
> >> processors
> >> + * by default.
> >> + * SandyBridge has no huge page support for IOTLB which leads to
> >> fallback
> >> + * on 4k pages and leads to performance degradation.
> >> + *
> >> + * Shared EPT ("sharept") will be disabled only if user has not
> >> + * provided explicit choice on the command line thus 
> >> iommu_hap_pt_share
> >> is
> >> + * at its initialized value of -1.
> >> + */
> >> +if ( (boot_cpu_data.x86 == 0x06 && (boot_cpu_data.x86_model <= 
> >> 0x2F
> ||
> >> +  boot_cpu_data.x86_model == 0x36)) && (iommu_hap_pt_share ==
> -1) )
> >> +iommu_hap_pt_share = 0;
> > If we really want to do this, then I think we should key this on
> > EPT but not VT-d having 2M support, instead of on CPU models.
>  This check is already performed by vtd_ept_page_compatible()
> >>> Yeah, I realized there would be such a check on the way home.
> >>>
>  The problem is that SandyBridge IOMMUs advertise 2M support and do
>  function with it, but cannot cache 2MB translations in the IOTLBs.
> 
>  As a result, attempting to use 2M translations causes substantially
>  worse performance than 4K translations.
> >>> So commit message and comment should make this more explicit,
> >>> to avoid the impression "IOTLB" isn't just the relatively common
> >>> mis-naming of "IOMMU".
> >>>
> >>> Plus I guess the sharing won't need suppressing if !opt_hap_2mb?
> >>>
> >>> Further the model based check is relatively broad, and includes
> >>> Atoms (0x36 actually is one), which can't be considered "Sandybridge
> >>> or older" imo.
> >>>
> >>> And finally I'm not fully convinced using CPU model info to deduce
> >>> chipset behavior is entirely correct (albeit perhaps in practice it'll
> >>> be fine except maybe when running Xen itself virtualized).
> >>
> >> What else would you suggest? I can't think of any better identifying
> >> information.
> >
> > Chipset IDs / revisions?
> 
> In this case the IOMMU is integrated into the Sandybridge-EP processor itself.
> Unfortunately there's no register to query the IOTLB configuration of the 
> IOMMU
> and so we're stuck identifying the via the processor model number itself.
> 
> Malcolm
> 

I'm OK to use processor model here, though ideally Jan is right. :-)

Thanks
Kevin

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-11-25 Thread Andrew Cooper
On 24/11/15 17:41, Jan Beulich wrote:
 On 24.11.15 at 18:17,  wrote:
>> --- a/xen/drivers/passthrough/vtd/quirks.c
>> +++ b/xen/drivers/passthrough/vtd/quirks.c
>> @@ -320,6 +320,20 @@ void __init platform_quirks_init(void)
>>  /* Tylersburg interrupt remap quirk */
>>  if ( iommu_intremap )
>>  tylersburg_intremap_quirk();
>> +
>> +/*
>> + * Disable shared EPT ("sharept") on Sandybridge and older processors
>> + * by default.
>> + * SandyBridge has no huge page support for IOTLB which leads to 
>> fallback
>> + * on 4k pages and leads to performance degradation.
>> + *
>> + * Shared EPT ("sharept") will be disabled only if user has not
>> + * provided explicit choice on the command line thus iommu_hap_pt_share 
>> is
>> + * at its initialized value of -1.
>> + */
>> +if ( (boot_cpu_data.x86 == 0x06 && (boot_cpu_data.x86_model <= 0x2F ||
>> +  boot_cpu_data.x86_model == 0x36)) && (iommu_hap_pt_share == -1) )
>> +iommu_hap_pt_share = 0;
> If we really want to do this, then I think we should key this on
> EPT but not VT-d having 2M support, instead of on CPU models.

This check is already performed by vtd_ept_page_compatible()

The problem is that SandyBridge IOMMUs advertise 2M support and do
function with it, but cannot cache 2MB translations in the IOTLBs.

As a result, attempting to use 2M translations causes substantially
worse performance than 4K translations.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-11-25 Thread Jan Beulich
>>> On 25.11.15 at 11:28,  wrote:
> On 24/11/15 17:41, Jan Beulich wrote:
> On 24.11.15 at 18:17,  wrote:
>>> --- a/xen/drivers/passthrough/vtd/quirks.c
>>> +++ b/xen/drivers/passthrough/vtd/quirks.c
>>> @@ -320,6 +320,20 @@ void __init platform_quirks_init(void)
>>>  /* Tylersburg interrupt remap quirk */
>>>  if ( iommu_intremap )
>>>  tylersburg_intremap_quirk();
>>> +
>>> +/*
>>> + * Disable shared EPT ("sharept") on Sandybridge and older processors
>>> + * by default.
>>> + * SandyBridge has no huge page support for IOTLB which leads to 
>>> fallback
>>> + * on 4k pages and leads to performance degradation.
>>> + *
>>> + * Shared EPT ("sharept") will be disabled only if user has not
>>> + * provided explicit choice on the command line thus 
>>> iommu_hap_pt_share is
>>> + * at its initialized value of -1.
>>> + */
>>> +if ( (boot_cpu_data.x86 == 0x06 && (boot_cpu_data.x86_model <= 0x2F ||
>>> +  boot_cpu_data.x86_model == 0x36)) && (iommu_hap_pt_share == -1) )
>>> +iommu_hap_pt_share = 0;
>> If we really want to do this, then I think we should key this on
>> EPT but not VT-d having 2M support, instead of on CPU models.
> 
> This check is already performed by vtd_ept_page_compatible()

Yeah, I realized there would be such a check on the way home.

> The problem is that SandyBridge IOMMUs advertise 2M support and do
> function with it, but cannot cache 2MB translations in the IOTLBs.
> 
> As a result, attempting to use 2M translations causes substantially
> worse performance than 4K translations.

So commit message and comment should make this more explicit,
to avoid the impression "IOTLB" isn't just the relatively common
mis-naming of "IOMMU".

Plus I guess the sharing won't need suppressing if !opt_hap_2mb?

Further the model based check is relatively broad, and includes
Atoms (0x36 actually is one), which can't be considered "Sandybridge
or older" imo.

And finally I'm not fully convinced using CPU model info to deduce
chipset behavior is entirely correct (albeit perhaps in practice it'll
be fine except maybe when running Xen itself virtualized).

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-11-24 Thread Jan Beulich
>>> On 24.11.15 at 18:17,  wrote:
> --- a/xen/drivers/passthrough/vtd/quirks.c
> +++ b/xen/drivers/passthrough/vtd/quirks.c
> @@ -320,6 +320,20 @@ void __init platform_quirks_init(void)
>  /* Tylersburg interrupt remap quirk */
>  if ( iommu_intremap )
>  tylersburg_intremap_quirk();
> +
> +/*
> + * Disable shared EPT ("sharept") on Sandybridge and older processors
> + * by default.
> + * SandyBridge has no huge page support for IOTLB which leads to fallback
> + * on 4k pages and leads to performance degradation.
> + *
> + * Shared EPT ("sharept") will be disabled only if user has not
> + * provided explicit choice on the command line thus iommu_hap_pt_share 
> is
> + * at its initialized value of -1.
> + */
> +if ( (boot_cpu_data.x86 == 0x06 && (boot_cpu_data.x86_model <= 0x2F ||
> +  boot_cpu_data.x86_model == 0x36)) && (iommu_hap_pt_share == -1) )
> +iommu_hap_pt_share = 0;

If we really want to do this, then I think we should key this on
EPT but not VT-d having 2M support, instead of on CPU models.

Also - with the above only marginally relevant - the line split and/or
indentation is wrong.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-09-01 Thread Jan Beulich
>>> On 01.09.15 at 16:18,  wrote:
> On 31/08/15 09:09, Jan Beulich wrote:
> On 28.08.15 at 17:41,  wrote:
>>> --- a/docs/misc/xen-command-line.markdown
>>> +++ b/docs/misc/xen-command-line.markdown
>>> @@ -896,7 +896,7 @@ debug hypervisor only).
>>>  
>>>  > `sharept`
>>>  
>>> -> Default: `true`
>>> +> Default: `true` if newer than SandyBridge or `false` if Sandybridge or 
>>> earlier.
>> This neglects the AMD side.
> 
> The AMD side has iommu_hap_pt_share unconditionally disabled, for
> reasons pertaining to grant mapped frames.

Yet that doesn't eliminate the desire to have the correct default
spelled out here.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-09-01 Thread Andrew Cooper
On 31/08/15 09:09, Jan Beulich wrote:
 On 28.08.15 at 17:41,  wrote:
>> --- a/docs/misc/xen-command-line.markdown
>> +++ b/docs/misc/xen-command-line.markdown
>> @@ -896,7 +896,7 @@ debug hypervisor only).
>>  
>>  > `sharept`
>>  
>> -> Default: `true`
>> +> Default: `true` if newer than SandyBridge or `false` if Sandybridge or 
>> earlier.
> This neglects the AMD side.

The AMD side has iommu_hap_pt_share unconditionally disabled, for
reasons pertaining to grant mapped frames.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-08-31 Thread Jan Beulich
>>> On 28.08.15 at 17:41,  wrote:
> --- a/docs/misc/xen-command-line.markdown
> +++ b/docs/misc/xen-command-line.markdown
> @@ -896,7 +896,7 @@ debug hypervisor only).
>  
>  > `sharept`
>  
> -> Default: `true`
> +> Default: `true` if newer than SandyBridge or `false` if Sandybridge or 
> earlier.

This neglects the AMD side.

> --- a/xen/drivers/passthrough/iommu.c
> +++ b/xen/drivers/passthrough/iommu.c
> @@ -54,6 +54,7 @@ bool_t __read_mostly iommu_intremap = 1;
>  bool_t __read_mostly iommu_hap_pt_share = 1;
>  bool_t __read_mostly iommu_debug;
>  bool_t __read_mostly amd_iommu_perdev_intremap = 1;
> +bool_t __read_mostly iommu_sharept_set = 0;

_If_ you really want to introduce a new variable, it ought to be
__initdata (since only __init functions reference it). My preference
though would be to make the existing variable a tristate, starting
out with a value of -1 (and having type s8).

> --- a/xen/drivers/passthrough/vtd/quirks.c
> +++ b/xen/drivers/passthrough/vtd/quirks.c
> @@ -320,6 +320,19 @@ void __init platform_quirks_init(void)
>  /* Tylersburg interrupt remap quirk */
>  if ( iommu_intremap )
>  tylersburg_intremap_quirk();
> +
> +/*
> + * Disable shared EPT ("sharept") on Sandybridge and older processors
> + * by default.
> + * SandyBridge has no huge page support for IOTLB which leads to fallback
> + * on 4k pages and leads to performance degradation.
> + *
> + * Shared EPT ("sharept") will be disabled only if user has not
> + * provided explicit choice on the command line.
> + */
> +if ( (boot_cpu_data.x86 == 6) &&
> + (boot_cpu_data.x86_model <= 0x2a) && !iommu_sharept_set )
> +iommu_hap_pt_share = 0;

Model 0x2d certainly is also Sandybridge. Models 0x2c, 0x2e, and
0x2f are even older architectures. And then there are various
Atoms at higher numbers which I'm not sure can be considered
"newer than Sandybridge" architecture wise. I.e. I don't think you
can get away with a simple, single relation here.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

2015-08-28 Thread Andrew Cooper
On 28/08/15 16:41, Anshul Makkar anshul.makkar@citrix.com wrote:
 From: anshulma anshul.mak...@citrix.com

 Sandybridge or earlier processors don't have huge page support for
 IOTLB which leads to fallback on 4k pages and causes performance issues.

 Shared EPT will be disabled only if the user has not provided explicit
 choice on the command line.

 Signed-off-by: Anshul Makkar anshul.mak...@citrix.com

As a note concerning the performance issues, for some IO workloads, this
nets a 40% throughput improvement.

We did not observe any IO workloads which had a worse performance as a
result of disabling shared ept.

Reviewed-by: Andrew Cooper andrew.coop...@citrix.com

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel