Hi,

On Tue, 30 Jul 2024, Andi Kleen wrote:

> AVX2 is widely available on x86 and it allows to do the scanner line
> check with 32 bytes at a time. The code is similar to the SSE2 code
> path, just using AVX and 32 bytes at a time instead of SSE2 16 bytes.
> 
> Also adjust the code to allow inlining when the compiler
> is built for an AVX2 host, following what other architectures
> do.
> 
> I see about a ~0.6% compile time improvement for compiling i386
> insn-recog.i with -O0.

Is that from some kind of rigorous measurement under perf? As you
surely know, 0.6% wall-clock time can be from boost clock variation
or just run-to-run noise on x86.

I have looked at this code before. When AVX2 is available, so is SSSE3,
and then a much more efficient approach is available: instead of comparing
against \r \n \\ ? one-by-one, build a vector

  0  1  2  3  4  5  6  7  8  9    a   b    c     d   e   f
{ 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, '\n', 0, '\\', '\r', 0, '?' }

where each character C we're seeking is at position (C % 16). Then
you can match against them all at once using PSHUFB:

  t = _mm_shuffle_epi8 (lut, data);
  t = t == data;

As you might recognize this handily beats the fancy SSE4.1 loop as well.
I did not pursue this because I did not measure a substantial improvement
(we're way into the land of diminishing returns here) and it seemed like
maintainers might not like to be distracted with that, but if we are
touching this code, might as well use the more efficient algorithm.
I'll be happy to propose a patch if people think it's worthwhile.

I see one issue with your patch, please see below:

> @@ -448,6 +526,10 @@ init_vectorized_lexer (void)
>  
>    if (minimum == 3)
>      impl = search_line_sse42;
> +  else if (__get_cpuid_max (0, &dummy) >= 7
> +            && __get_cpuid_count (7, 0, &dummy, &ebx, &dummy, &dummy)
> +            && (ebx & bit_AVX2))
> +    impl = search_line_avx2;
>    else if (__get_cpuid (1, &dummy, &dummy, &ecx, &edx) || minimum == 2)
>      {
>        if (minimum == 3 || (ecx & bit_SSE4_2))

Surely this is not enough? You're not checking OS support via xgetbv.

Alexander

Reply via email to