>>> On 07.09.18 at 18:17, <andrew.coop...@citrix.com> wrote:
> On 17/08/18 10:38, Jan Beulich wrote:
>>>>> On 17.08.18 at 10:59, <andrew.coop...@citrix.com> wrote:
>>> On 17/08/2018 08:21, Jan Beulich wrote:
>>>> --- a/xen/include/asm-x86/asm_defns.h
>>>> +++ b/xen/include/asm-x86/asm_defns.h
>>>> @@ -186,6 +186,20 @@ void ret_from_intr(void);
>>>>          UNLIKELY_END_SECTION "\n"          \
>>>>          ".Llikely." #tag ".%=:"
>>>>  
>>>> +#define LINKONCE_PROLOGUE(sym)                    \
>>>> +        ".ifndef " sym() "\n\t"                   \
>>>> +        ".pushsection " sym(.gnu.linkonce.t.) "," \
>>> This definitely warrants a comment and a change of name, seeing as sym
>>> isn't a symbol.  Its a macro which gives you a string back.
>> It's still a symbol name; don't forget that in the end section names
>> are symbols names too. I'd have gone with a better name if I had
>> any better idea. It's also not really clear to me what you want a
>> comment to say here.
> 
> /* sym must be a macro which takes an optional symbol name prefix */ ?

Thanks, I'll use this in slightly modified form.

>>>> @@ -71,16 +80,77 @@ static inline unsigned long __virt_to_ma
>>>>  
>>>>          va += xen_phys_start - XEN_VIRT_START;
>>>>      }
>>>> -    return (va & ma_va_bottom_mask) |
>>>> -           ((va << pfn_pdx_hole_shift) & ma_top_mask);
>>>> +
>>>> +#ifdef CONFIG_INDIRECT_THUNK /* V modifier available? */
>>>> +#define SYMNAME(pfx...) #pfx "do2ma_%V[ma]_%V[off]"
>>>> +    alternative_io("call " SYMNAME() "\n\t"
>>>> +                   LINKONCE_PROLOGUE(SYMNAME) "\n\t"
>>>> +                   "mov %[shift], %%ecx\n\t"
>>>> +                   "mov %[off], %[ma]\n\t"
>>>> +                   "and %[bmask], %[ma]\n\t"
>>>> +                   "shl %%cl, %[off]\n\t"
>>>> +                   "and %[tmask], %[off]\n\t"
>>>> +                   "or %[off], %[ma]\n\t"
>>>> +                   "ret\n\t"
>>>> +                   LINKONCE_EPILOGUE(SYMNAME),
>>>> +                   "pdep %[mask], %[off], %[ma]", X86_FEATURE_BMI2,
>>> The compiler understanding V doesn't imply that the assembler
>>> understands pdep
>> I've been considering this case, and in fact in an early version I had
>> a -DHAVE_AS_BMI2 addition. But I then decided to drop it, as I couldn't
>> really imagine V support to be backported to gcc pre-dating BMI2
>> support in gas (available as of 2.22, i.e. gcc 4.5 / 4.6 timeframe).
>> But yes, if you really think we need to cope with that, I could re-add
>> this. Just let me know if you indeed think this is needed.
> 
> Didn't you say that you've got Retpoline backported to GCC 4.3 ?

Yes, with an underly binutils 2.25.

> Independently of my argument against using V below, I'm not overly
> fussed especially as we are considering upping the requirements to
> something a little more modern, so long as we accept that we may have to
> retrofit a -DHAVE_AS_BMI2 after the fact.

I've added a respective sentence to the description.

>>>> +                   ASM_OUTPUT2([ma] "=&r" (ma), [off] "+r" (va)),
>>>> +                   [mask] "m" (ma_real_mask),
>>>> +                   [shift] "m" (pfn_pdx_hole_shift),
>>>> +                   [bmask] "m" (ma_va_bottom_mask),
>>>> +                   [tmask] "m" (ma_top_mask)
>>>> +                   : "ecx");
>>>> +#undef SYMNAME
>>>> +#else
>>>> +    alternative_io("call do2ma",
>>>> +                   /* pdep ma_real_mask(%rip), %rdi, %rax */
>>>> +                   ".byte 0xc4, 0xe2, 0xc3, 0xf5, 0x05\n\t"
>>>> +                   ".long ma_real_mask - 4 - .",
>>>> +                   X86_FEATURE_BMI2,
>>>> +                   ASM_OUTPUT2("=a" (ma), "+D" (va)), "m" (ma_real_mask)
>>>> +                   : "rcx", "rdx", "rsi", "r8", "r9", "r10", "r11");
>>>> +#endif
>>> This is a massive clobber list in a function you've forced always
>>> inline, and I can't see it doing nice things to the callsites.  TBH,
>>> this still feels over-complicated for what it wants to be.
>>>
>>> Why not implement one single function in assembly that doesn't have
>>> usual C calling conventions and can clobber %ecx and one other, and use
>>> that?
>>>
>>> It avoids the need for potentially 256 almost-identical copies of the
>>> function in the linkonce section, and avoids having the multiple
>>> implementations in C/asm, avoids the need for any logic derived from
>>> CONFIG_INDIRECT_THUNK, and avoids the need for massive clobber lists.
>> Your response mixes things a bit too much for me to sort out what
>> exactly you're concerned about: The massive clobber list exists only
>> in the !CONFIG_INDIRECT_THUNK case. In that case though there
>> aren't going to be up to 225 instances of the function. I'd be fine
>> implementing the single one in assembly to reduce the clobber list,
>> it was just to keep down assembly code size and also to have
>> compiler generated code to compare against. I have to admit though
>> that I'm not overly concerned about the !CONFIG_INDIRECT_THUNK
>> case in the first place, so I also didn't see much reason to try to
>> optimize it.
>>
>> For the CONFIG_INDIRECT_THUNK case, otoh, I'd really like to
>> avoid dictating register allocation to the compiler. Hence the solution
>> with the (possibly many) function instances. Overall code size still
>> decreases with this approach, and on modern hardware the entire
>> region of the image in which they live will remain cold. (Additionally
>> not affecting register allocation here has the benefit of making it
>> far easier to compare pre/post generated code.)
> 
> How many passes through the hypervisor hit two or more of these functions?
> 
> I appreciate that for development, reducing the register perturbance can
> be nice for diffing the resulting disassembly, but when it comes to
> actually running the code, register renames at compile time are free,
> whereas pulling multiple almost-identical copies of the stub into the
> I-cache is not.
> 
> Implementing this as a single function looks to be simpler in terms of
> the change, will compile to a smaller result, and will run faster.  It
> seems to be a win-win-win overall.

Well, I continue to not really agree. First and foremost, as said before,
the common (exclusive?) case is going to be that with "x86: use MOV
for PFN/PDX conversion when possible" no calls will exist at runtime at
all. At that point all function instances could collectively be purged just
like .init.text, if we cared enough. And then, for this particular case,
leaving the compiler the widest possible choice of register allocation
still seems pretty desirable to me. I'd agree with your "register
renames at compile time are free" only if there weren't special uses of
quite a few of the registers.

As perhaps a prime example, consider the case where the
transformation here gets done in the course of setting up another
function's arguments. The compiler would have to avoid certain
registers (or generate extra MOVs) if it had to first pass the input
in a fixed register to the helper here (which then additionally also
needs to be assumed to clobber several other ones).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Reply via email to