>>> On 01.03.18 at 16:47, <andrew.coop...@citrix.com> wrote:
> On 07/12/17 10:45, Jan Beulich wrote:
>>>>> On 06.12.17 at 20:34, <andrew.coop...@citrix.com> wrote:
>>> On 06/12/17 16:37, Jan Beulich wrote:
>>>> @@ -196,6 +215,19 @@ static inline void wrgsbase(unsigned lon
>>>>          wrmsrl(MSR_GS_BASE, base);
>>>>  }
>>>>  
>>>> +static inline void wrgsshadow(unsigned long base)
>>>> +{
>>>> +    alternative_input("mov %[msr], %%ecx\n\t"
>>>> +                      "shld $32, %[val], %%rdx\n\t"
>>> This is a vector path instruction and specifically called out to be
>>> avoided in the AMD optimisation guide.  On all hardware (according to
>>> Agner's latency mesurements) it alone has a longer latency to execute
>>> that the mov/shl pair you've replaced, and that is before accounting for
>>> mov elimination.
>> For one I doubt the latency of the SHLD will be noticable with
>> the latency of the following WRMSR. And then I think the main
>> goal should be to have optimal performance on modern CPUs.
>> That means size-optimizing original code, to reduce the amount
>> of NOPs needed when the alternative is being installed.
> 
> I'm sorry if this comes across as blunt, but you are too focused on the
> wrong details.
> 
> I agree that `swap;{rd,wr}gsbase;swap` is better than wrmsr, and we
> should be using it when available.
> 
> However, it is simply not the case that squeezing every possible byte
> out of .text makes the best result.  Just as with inc and dec, there is
> a very good reason why compilers don't emit shld, and that is because
> the ORM's recommend against their use.  In this specific case, a mov/shl
> pair is better than shld on all hardware, even in older pipelines which
> don't do mov elimination.

Well, okay I'll switch away from using SHLD. It'll be a single padding
NOP only anyway, just a slightly larger one. And note that this wasn't
about size optimization, but about NOP padding reduction.

> You are going to need a more convincing argument than currently provided
> as to why the helpers aren't implemented like this:
> 
> static inline unsigned long rdgsshadow(void)
> {
>     unsigned long base;
> 
>     if ( cpu_has_fsgsbase )
>         asm volatile ( "swapgs\n\t"
> #ifdef HAVE_GAS_FSGSBASE
>                        "rdgsbase %0\n\t"
> #else
>                     ".byte 0xf3, 0x48, 0x0f, 0xae, 0xc8\n\t"
> #endif
>                         "swapgs\n\t"
>                        :
> #ifdef HAVE_GAS_FSGSBASE
>                        "=r" (base)
> #else
>                        "=a" (base)
> #endif
>             );
>     else
>         rdmsrl(MSR_SHADOW_GS_BASE, base);
> 
>     return base;
> }
> 
> static inline void wrgsshadow(unsigned long base)
> {
>     if ( cpu_has_fsgsbase )
>         asm volatile ( "swapgs\n\t"
> #ifdef HAVE_GAS_FSGSBASE
>                        "wrgsbase %0\n\t"
> #else
>                        ".byte 0xf3, 0x48, 0x0f, 0xae, 0xd8\n\t"
> #endif
>                         "swapgs\n\t"
>                        ::
> #ifdef HAVE_GAS_FSGSBASE
>                        "r" (base)
> #else
>                        "a" (base)
> #endif
>             );
>     else
>         wrmsrl(MSR_SHADOW_GS_BASE, base);
> }
> 
> In particular:
> 1) This is consistent with existing {rd,wr}{fs,gs}base helpers.

Which, in due course, I would have meant to convert to alternatives
as well.

> 2) cpu_has_fsgsbase is constant after boot, so the branch will be
> predicted correctly.

If it has an entry in the BTB in the first place.

I can certainly also switch away from using alternatives here, but
considering especially my response to 1) I would do this quite
hesitantly.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Reply via email to