bufferiszero: Remove SSE4.1 variant

Richard Henderson via Qemu-commits Fri, 03 May 2024 08:15:20 -0700

  Branch: refs/heads/staging
  Home:   https://github.com/qemu/qemu
  Commit: 8a917b99d5394d34ffcd851c8b287ced6eb48133
      
https://github.com/qemu/qemu/commit/8a917b99d5394d34ffcd851c8b287ced6eb48133
  Author: Alexander Monakov <[email protected]>
  Date:   2024-05-03 (Fri, 03 May 2024)


  Changed paths:
    M util/bufferiszero.c

  Log Message:
  -----------
  util/bufferiszero: Remove SSE4.1 variant

The SSE4.1 variant is virtually identical to the SSE2 variant, except
for using 'PTEST+JNZ' in place of 'PCMPEQB+PMOVMSKB+CMP+JNE' for testing
if an SSE register is all zeroes. The PTEST instruction decodes to two
uops, so it can be handled only by the complex decoder, and since
CMP+JNE are macro-fused, both sequences decode to three uops. The uops
comprising the PTEST instruction dispatch to p0 and p5 on Intel CPUs, so
PCMPEQB+PMOVMSKB is comparatively more flexible from dispatch
standpoint.

Hence, the use of PTEST brings no benefit from throughput standpoint.
Its latency is not important, since it feeds only a conditional jump,
which terminates the dependency chain.

I never observed PTEST variants to be faster on real hardware.

Signed-off-by: Alexander Monakov <[email protected]>
Signed-off-by: Mikhail Romanov <[email protected]>
Reviewed-by: Richard Henderson <[email protected]>
Message-Id: <[email protected]>


  Commit: d018425c324704949c7f65230def9586e71f07f5
      
https://github.com/qemu/qemu/commit/d018425c324704949c7f65230def9586e71f07f5
  Author: Alexander Monakov <[email protected]>
  Date:   2024-05-03 (Fri, 03 May 2024)

  Changed paths:
    M util/bufferiszero.c

  Log Message:
  -----------
  util/bufferiszero: Remove AVX512 variant

Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD
routines are invoked much more rarely in normal use when most buffers
are non-zero. This makes use of AVX512 unprofitable, as it incurs extra
frequency and voltage transition periods during which the CPU operates
at reduced performance, as described in
https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html

Signed-off-by: Mikhail Romanov <[email protected]>
Signed-off-by: Alexander Monakov <[email protected]>
Reviewed-by: Richard Henderson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Richard Henderson <[email protected]>


  Commit: cbe3d5264631aa193fd2705820cbde6c5a602abb
      
https://github.com/qemu/qemu/commit/cbe3d5264631aa193fd2705820cbde6c5a602abb
  Author: Alexander Monakov <[email protected]>
  Date:   2024-05-03 (Fri, 03 May 2024)

  Changed paths:
    M include/qemu/cutils.h
    M util/bufferiszero.c

  Log Message:
  -----------
  util/bufferiszero: Reorganize for early test for acceleration

Test for length >= 256 inline, where is is often a constant.
Before calling into the accelerated routine, sample three bytes
from the buffer, which handles most non-zero buffers.

Signed-off-by: Alexander Monakov <[email protected]>
Signed-off-by: Mikhail Romanov <[email protected]>
Message-Id: <[email protected]>
[rth: Use __builtin_constant_p; move the indirect call out of line.]
Signed-off-by: Richard Henderson <[email protected]>


  Commit: 93a6085618f16fb2cd316d1e84f1a638b7e2d8ff
      
https://github.com/qemu/qemu/commit/93a6085618f16fb2cd316d1e84f1a638b7e2d8ff
  Author: Alexander Monakov <[email protected]>
  Date:   2024-05-03 (Fri, 03 May 2024)

  Changed paths:
    M util/bufferiszero.c

  Log Message:
  -----------
  util/bufferiszero: Remove useless prefetches

Use of prefetching in bufferiszero.c is quite questionable:

- prefetches are issued just a few CPU cycles before the corresponding
  line would be hit by demand loads;

- they are done for simple access patterns, i.e. where hardware
  prefetchers can perform better;

- they compete for load ports in loops that should be limited by load
  port throughput rather than ALU throughput.

Signed-off-by: Alexander Monakov <[email protected]>
Signed-off-by: Mikhail Romanov <[email protected]>
Reviewed-by: Richard Henderson <[email protected]>
Message-Id: <[email protected]>


  Commit: f28e0bbefa41fe643cce2f107e868abff312ced9
      
https://github.com/qemu/qemu/commit/f28e0bbefa41fe643cce2f107e868abff312ced9
  Author: Alexander Monakov <[email protected]>
  Date:   2024-05-03 (Fri, 03 May 2024)

  Changed paths:
    M util/bufferiszero.c

  Log Message:
  -----------
  util/bufferiszero: Optimize SSE2 and AVX2 variants

Increase unroll factor in SIMD loops from 4x to 8x in order to move
their bottlenecks from ALU port contention to load issue rate (two loads
per cycle on popular x86 implementations).

Avoid using out-of-bounds pointers in loop boundary conditions.

Follow SSE2 implementation strategy in the AVX2 variant. Avoid use of
PTEST, which is not profitable there (like in the removed SSE4 variant).

Signed-off-by: Alexander Monakov <[email protected]>
Signed-off-by: Mikhail Romanov <[email protected]>
Reviewed-by: Richard Henderson <[email protected]>
Message-Id: <[email protected]>


  Commit: 7ae6399a85f6a0818a532d9f3c6e200691f6ef68
      
https://github.com/qemu/qemu/commit/7ae6399a85f6a0818a532d9f3c6e200691f6ef68
  Author: Richard Henderson <[email protected]>
  Date:   2024-05-03 (Fri, 03 May 2024)

  Changed paths:
    M util/bufferiszero.c

  Log Message:
  -----------
  util/bufferiszero: Improve scalar variant

Split less-than and greater-than 256 cases.
Use unaligned accesses for head and tail.
Avoid using out-of-bounds pointers in loop boundary conditions.

Reviewed-by: Philippe Mathieu-Daudé <[email protected]>
Signed-off-by: Richard Henderson <[email protected]>


  Commit: 0100ce2b49725e6ba2fbe8301855978d5d3dc790
      
https://github.com/qemu/qemu/commit/0100ce2b49725e6ba2fbe8301855978d5d3dc790
  Author: Richard Henderson <[email protected]>
  Date:   2024-05-03 (Fri, 03 May 2024)

  Changed paths:
    M util/bufferiszero.c

  Log Message:
  -----------
  util/bufferiszero: Introduce biz_accel_fn typedef

Reviewed-by: Philippe Mathieu-Daudé <[email protected]>
Signed-off-by: Richard Henderson <[email protected]>


  Commit: bf67aa3dd2d8b28d7618d8ec62cd9f6055366751
      
https://github.com/qemu/qemu/commit/bf67aa3dd2d8b28d7618d8ec62cd9f6055366751
  Author: Richard Henderson <[email protected]>
  Date:   2024-05-03 (Fri, 03 May 2024)

  Changed paths:
    M util/bufferiszero.c

  Log Message:
  -----------
  util/bufferiszero: Simplify test_buffer_is_zero_next_accel

Because the three alternatives are monotonic, we don't need
to keep a couple of bitmasks, just identify the strongest
alternative at startup.

Generalize test_buffer_is_zero_next_accel and init_accel
by always defining an accel_table array.

Reviewed-by: Philippe Mathieu-Daudé <[email protected]>
Signed-off-by: Richard Henderson <[email protected]>


  Commit: 22437b4de94c37e6104d90d46b31d80cf14358d4
      
https://github.com/qemu/qemu/commit/22437b4de94c37e6104d90d46b31d80cf14358d4
  Author: Richard Henderson <[email protected]>
  Date:   2024-05-03 (Fri, 03 May 2024)

  Changed paths:
    M util/bufferiszero.c

  Log Message:
  -----------
  util/bufferiszero: Add simd acceleration for aarch64

Because non-embedded aarch64 is expected to have AdvSIMD enabled, merely
double-check with the compiler flags for __ARM_NEON and don't bother with
a runtime check.  Otherwise, model the loop after the x86 SSE2 function.

Use UMAXV for the vector reduction.  This is 3 cycles on cortex-a76 and
2 cycles on neoverse-n1.

Reviewed-by: Philippe Mathieu-Daudé <[email protected]>
Signed-off-by: Richard Henderson <[email protected]>


  Commit: a06d9eddb015a9f5895161b0a3958a2e4be21579
      
https://github.com/qemu/qemu/commit/a06d9eddb015a9f5895161b0a3958a2e4be21579
  Author: Richard Henderson <[email protected]>
  Date:   2024-05-03 (Fri, 03 May 2024)

  Changed paths:
    A tests/bench/bufferiszero-bench.c
    M tests/bench/meson.build

  Log Message:
  -----------
  tests/bench: Add bufferiszero-bench

Benchmark each acceleration function vs an aligned buffer of zeros.

Reviewed-by: Philippe Mathieu-Daudé <[email protected]>
Signed-off-by: Richard Henderson <[email protected]>


  Commit: 909aff7eaf6335aeeb4962fb0ac2a6c571c96af2
      
https://github.com/qemu/qemu/commit/909aff7eaf6335aeeb4962fb0ac2a6c571c96af2
  Author: Richard Henderson <[email protected]>
  Date:   2024-05-03 (Fri, 03 May 2024)

  Changed paths:
    M include/qemu/cutils.h
    A tests/bench/bufferiszero-bench.c
    M tests/bench/meson.build
    M util/bufferiszero.c

  Log Message:
  -----------
  Merge tag 'pull-misc-20240503' of https://gitlab.com/rth7680/qemu into staging

util/bufferiszero:
  - Remove sse4.1 and avx512 variants
  - Reorganize for early test for acceleration
  - Remove useless prefetches
  - Optimize sse2, avx2 and integer variants
  - Add simd acceleration for aarch64
  - Add bufferiszero-bench

# -----BEGIN PGP SIGNATURE-----
#
# iQFRBAABCgA7FiEEekgeeIaLTbaoWgXAZN846K9+IV8FAmY0/qMdHHJpY2hhcmQu
# aGVuZGVyc29uQGxpbmFyby5vcmcACgkQZN846K9+IV+ULQf/T2JSdvG6/EjDCf4N
# cnSGiUV2MIeByw8tkrc/fWCNdlulHhk9gbg9l+f2muwK8H/k2BdynbrQnt1Ymmtk
# xzM6+PNOcByaovSAkvNweZVbrQX36Yih9S7f3n+xcxfVuvvYhKSLHXLkeqO96LMd
# rN+WRpxhReaU3n8/FO7o3S26SRpk7X9kRfShaT7U7ytHGjGsXUvMKIRs30hbsJTB
# yjed0a0u54FoSlN6AEqjWdgzaWP8nT65+8Yxe3dzB9hx09UiolZo60eHqYy7Mkno
# N6aMOB6gUUbCiKZ3Qk+1zEX97vl26NH3zt5tIIJTWDoIkC3f9qbg1x5hwWLQ3rra
# rM8h8w==
# =DnZO
# -----END PGP SIGNATURE-----
# gpg: Signature made Fri 03 May 2024 08:11:31 AM PDT
# gpg:                using RSA key 7A481E78868B4DB6A85A05C064DF38E8AF7E215F
# gpg:                issuer "[email protected]"
# gpg: Good signature from "Richard Henderson <[email protected]>" 
[ultimate]

* tag 'pull-misc-20240503' of https://gitlab.com/rth7680/qemu:
  tests/bench: Add bufferiszero-bench
  util/bufferiszero: Add simd acceleration for aarch64
  util/bufferiszero: Simplify test_buffer_is_zero_next_accel
  util/bufferiszero: Introduce biz_accel_fn typedef
  util/bufferiszero: Improve scalar variant
  util/bufferiszero: Optimize SSE2 and AVX2 variants
  util/bufferiszero: Remove useless prefetches
  util/bufferiszero: Reorganize for early test for acceleration
  util/bufferiszero: Remove AVX512 variant
  util/bufferiszero: Remove SSE4.1 variant

Signed-off-by: Richard Henderson <[email protected]>


Compare: https://github.com/qemu/qemu/compare/4977ce198d23...909aff7eaf63

To unsubscribe from these emails, change your notification settings at 
https://github.com/qemu/qemu/settings/notifications

[Qemu-commits] [qemu/qemu] 8a917b: util/bufferiszero: Remove SSE4.1 variant

Reply via email to