On 12/06/2017 08:47, Juergen Gross wrote:
> On 12/06/17 09:35, Andrew Cooper wrote:
>> On 12/06/2017 06:48, Juergen Gross wrote:
>>> On 08/06/17 23:00, Dario Faggioli wrote:
>>>> Bringing in Konrad because...
>>>>
>>>> On Thu, 2017-06-08 at 11:37 +0200, Juergen Gross wrote:
>>>>> On 07/06/17 20:19, Stefano Stabellini wrote:
>>>>>> On Wed, 7 Jun 2017, Juergen Gross wrote:
>>>>>>> On 06/06/17 21:08, Stefano Stabellini wrote:
>>>>>>>> 2) PV suspend/resume
>>>>>>>> 3) vector callback
>>>>>>>> 4) interrupt remapping
>>>>>>>>
>>>>>>>> 2) is not on the hot path.
>>>>>>>> I did individual measurements of 3) at some points and it was a
>>>>>>>> clear win.
>>>>>>> That might depend on the hardware. Could it be newer processors
>>>>>>> are
>>>>>>> faster here?
>>>>>> I don't think so: the alternative it's an emulated interrupt. It's
>>>>>> slower under all points of view.
>>>>> What about APIC virtualization of modern processors? Are you sure
>>>>> e.g.
>>>>> timer interrupts aren't handled completely by the processor? I guess
>>>>> this might be faster than letting it be handled by the hypervisor and
>>>>> then use the callback into the guest.
>>>>>
>>>> ... I kind of remember an email exchange we had, not here on the list,
>>>> but in private, about some apparently weird scheduling behavior you
>>>> were seeing, there at Oracle, on a particular benchmark/customer's
>>>> workload.
>>>>
>>>> Not that this is directly related, but I seem to also recall that you
>>>> managed to find out that some of the perf difference (between baremetal
>>>> and guest) was due to vAPIC being faster than the PV path we were
>>>> taking? What I don't recall, though, is whether your guest was PV or
>>>> (PV)HVM... Do you remember anything more precisely than this?
>>> I now tweaked the kernel to use the LAPIC timer instead of the pv one.
>>>
>>> While it is a very little bit faster (<1%) this doesn't seem to be the
>>> reason for the performance drop.
>>>
>>> Using xentrace I've verified that no additional hypercalls or other
>>> VMEXITs are occurring which would explain what is happening (I'm
>>> seeing setting the timer and the related timer interrupt 250 times
>>> a second, what is expected).
>>>
>>> Using ftrace in the kernel I can see all functions being called on
>>> the munmap path. Nothing worrying and no weird differences between the
>>> pv and the non-pv test.
>>>
>>> What is interesting is that the time for the pv test isn't lost at one
>>> or two specific points, but all over the test. All function seem to run
>>> just a little bit slower as in the non-pv case.
>>>
>>> So I concluded it might be TLB related. The main difference between
>>> using pv interfaces or not is the mapping of the shared info page into
>>> the guest. The guest physical page for the shared info page is allocated
>>> rather early via extend_brk(). Mapping the shared info page into the
>>> guest requires that specific page to be mapped via a 4kB EPT entry,
>>> resulting in breaking up a 2MB entry. So at least most of the other data
>>> allocated via extend_brk() in the kernel will be hit by this large page
>>> break up. The main other data allocated this way are the early page
>>> tables which are essential for nearly all virtual addresses of the
>>> kernel address space.
>>>
>>> Instead of using extend_brk() I had a try allocating the shared info
>>> pfn from the first MB of memory, as this area is already mapped via
>>> 4kB EPT entries. And indeed: this measure did speed up the munmap test
>>> even when using pv interfaces in the guest.
>>>
>>> I'll send a proper patch for the kernel after doing some more testing.
>> Is it practical to use somewhere other than the first MB of memory?
>>
>> The only reason the first 2M of memory is mapped with 4k EPT entries is
>> because of MTRRs.  I'm still hoping we can sensibly disable them for PVH
>> workloads, after which, the guest could be mapped using exclusively 1G
>> EPT mappings (if such RAM/alignment were available in the system).>
>> Ideally, all mapped-in frames (including grants, foreign frames, etc)
>> would use GFNs above the top of RAM, so as never to shatter any of the
>> host superpage mappings RAM.
> Right. We can easily move to such a region (e.g. Xen PCI-device memory)
> when we've removed the MTRR settings for the low memory. Right now using
> the low 1MB of memory is working well and requires only very limited
> changes, thus making a backport much easier.

Good point.

>
> BTW: I could imagine using a special GFN region for all special mapped
> data might require some hypervisor tweaks, too.

Not any changes I'm aware of.  One of the many things Xen should
currently do (and doesn't) is limit the guests choice of gfns into a
range pre-determined by the toolstack.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Reply via email to