On 25/11/2024 2:28 pm, Jan Beulich wrote:
> Move the function to its own assembly file. Having it in C just for the
> entire body to be an asm() isn't really helpful. Then have two flavors:
> A "basic" version using qword steps for the bulk of the operation, and an
> ERMS version for modern hardware, to be substituted in via alternatives
> patching.
>
> Signed-off-by: Jan Beulich <[email protected]>
This is far nicer than previous versions with nested alternatives.
> ---
> We may want to consider branching over the REP STOSQ as well, if the
> number of qwords turns out to be zero.
Until FSR{S,M} (Fast Short Rep {STO,MOV}SB), which is far newer than
ERMS, passing 0 into any REP instruction is expensive.
I wonder how often we memset with a size less than 8.
> We may also want to consider using non-REP STOS{L,W,B} for the tail.
Probably, yes. We use this form in non-ERMS cases, where we're advised
to stay away from STOSB entirely.
Interestingly, Linux doesn't have a STOSQ case at all. Or rather, it
was deleted by Linus in 20f3337d350c last year. It was also identified
as causing a performance regression.
https://lore.kernel.org/lkml/CANn89iKUbyrJ=r2+_kk+sb2zsshiffz7qkpldpatkj8v4wu...@mail.gmail.com/T/#u
although the memset() path was not reverted as part of the fix
(47ee3f1dd93bcb eventually).
Yet ca96b162bfd2 shows that REP MOVSQ is still definitely a win on Rome
CPUs.
I expect we probably do want some non-rep forms in here.
Do you have any benchmarks with this series?
> ---
> v3: Re-base.
>
> --- a/xen/arch/x86/Makefile
> +++ b/xen/arch/x86/Makefile
> @@ -48,6 +48,7 @@ obj-$(CONFIG_INDIRECT_THUNK) += indirect
> obj-$(CONFIG_PV) += ioport_emulate.o
> obj-y += irq.o
> obj-$(CONFIG_KEXEC) += machine_kexec.o
> +obj-y += memset.o
> obj-y += mm.o x86_64/mm.o
> obj-$(CONFIG_HVM) += monitor.o
> obj-y += mpparse.o
> --- /dev/null
> +++ b/xen/arch/x86/memset.S
> @@ -0,0 +1,30 @@
> +#include <asm/asm_defns.h>
> +
> +.macro memset
> + and $7, %edx
> + shr $3, %rcx
> + movzbl %sil, %esi
> + mov $0x0101010101010101, %rax
> + imul %rsi, %rax
> + mov %rdi, %rsi
> + rep stosq
> + or %edx, %ecx
> + jz 0f
> + rep stosb
> +0:
> + mov %rsi, %rax
Could you use %r8/9/etc instead of %rsi please? This is deceptively
close to looking like a bug, and it took me a while to figure out it's
only correct because STOSB only edits %rdi.
Otherwise, I suspect this can go in. It should be an improvement on
plain REP STOSB on non-ERMS systems, even if there are other
improvements to come. I specifically wouldn't suggest blocking it until
patch 1 is resolved.
~Andrew