Branch: refs/heads/staging
Home: https://github.com/qemu/qemu
Commit: 8a917b99d5394d34ffcd851c8b287ced6eb48133
https://github.com/qemu/qemu/commit/8a917b99d5394d34ffcd851c8b287ced6eb48133
Author: Alexander Monakov <[email protected]>
Date: 2024-05-03 (Fri, 03 May 2024)
Changed paths:
M util/bufferiszero.c
Log Message:
-----------
util/bufferiszero: Remove SSE4.1 variant
The SSE4.1 variant is virtually identical to the SSE2 variant, except
for using 'PTEST+JNZ' in place of 'PCMPEQB+PMOVMSKB+CMP+JNE' for testing
if an SSE register is all zeroes. The PTEST instruction decodes to two
uops, so it can be handled only by the complex decoder, and since
CMP+JNE are macro-fused, both sequences decode to three uops. The uops
comprising the PTEST instruction dispatch to p0 and p5 on Intel CPUs, so
PCMPEQB+PMOVMSKB is comparatively more flexible from dispatch
standpoint.
Hence, the use of PTEST brings no benefit from throughput standpoint.
Its latency is not important, since it feeds only a conditional jump,
which terminates the dependency chain.
I never observed PTEST variants to be faster on real hardware.
Signed-off-by: Alexander Monakov <[email protected]>
Signed-off-by: Mikhail Romanov <[email protected]>
Reviewed-by: Richard Henderson <[email protected]>
Message-Id: <[email protected]>
Commit: d018425c324704949c7f65230def9586e71f07f5
https://github.com/qemu/qemu/commit/d018425c324704949c7f65230def9586e71f07f5
Author: Alexander Monakov <[email protected]>
Date: 2024-05-03 (Fri, 03 May 2024)
Changed paths:
M util/bufferiszero.c
Log Message:
-----------
util/bufferiszero: Remove AVX512 variant
Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD
routines are invoked much more rarely in normal use when most buffers
are non-zero. This makes use of AVX512 unprofitable, as it incurs extra
frequency and voltage transition periods during which the CPU operates
at reduced performance, as described in
https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html
Signed-off-by: Mikhail Romanov <[email protected]>
Signed-off-by: Alexander Monakov <[email protected]>
Reviewed-by: Richard Henderson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Richard Henderson <[email protected]>
Commit: cbe3d5264631aa193fd2705820cbde6c5a602abb
https://github.com/qemu/qemu/commit/cbe3d5264631aa193fd2705820cbde6c5a602abb
Author: Alexander Monakov <[email protected]>
Date: 2024-05-03 (Fri, 03 May 2024)
Changed paths:
M include/qemu/cutils.h
M util/bufferiszero.c
Log Message:
-----------
util/bufferiszero: Reorganize for early test for acceleration
Test for length >= 256 inline, where is is often a constant.
Before calling into the accelerated routine, sample three bytes
from the buffer, which handles most non-zero buffers.
Signed-off-by: Alexander Monakov <[email protected]>
Signed-off-by: Mikhail Romanov <[email protected]>
Message-Id: <[email protected]>
[rth: Use __builtin_constant_p; move the indirect call out of line.]
Signed-off-by: Richard Henderson <[email protected]>
Commit: 93a6085618f16fb2cd316d1e84f1a638b7e2d8ff
https://github.com/qemu/qemu/commit/93a6085618f16fb2cd316d1e84f1a638b7e2d8ff
Author: Alexander Monakov <[email protected]>
Date: 2024-05-03 (Fri, 03 May 2024)
Changed paths:
M util/bufferiszero.c
Log Message:
-----------
util/bufferiszero: Remove useless prefetches
Use of prefetching in bufferiszero.c is quite questionable:
- prefetches are issued just a few CPU cycles before the corresponding
line would be hit by demand loads;
- they are done for simple access patterns, i.e. where hardware
prefetchers can perform better;
- they compete for load ports in loops that should be limited by load
port throughput rather than ALU throughput.
Signed-off-by: Alexander Monakov <[email protected]>
Signed-off-by: Mikhail Romanov <[email protected]>
Reviewed-by: Richard Henderson <[email protected]>
Message-Id: <[email protected]>
Commit: f28e0bbefa41fe643cce2f107e868abff312ced9
https://github.com/qemu/qemu/commit/f28e0bbefa41fe643cce2f107e868abff312ced9
Author: Alexander Monakov <[email protected]>
Date: 2024-05-03 (Fri, 03 May 2024)
Changed paths:
M util/bufferiszero.c
Log Message:
-----------
util/bufferiszero: Optimize SSE2 and AVX2 variants
Increase unroll factor in SIMD loops from 4x to 8x in order to move
their bottlenecks from ALU port contention to load issue rate (two loads
per cycle on popular x86 implementations).
Avoid using out-of-bounds pointers in loop boundary conditions.
Follow SSE2 implementation strategy in the AVX2 variant. Avoid use of
PTEST, which is not profitable there (like in the removed SSE4 variant).
Signed-off-by: Alexander Monakov <[email protected]>
Signed-off-by: Mikhail Romanov <[email protected]>
Reviewed-by: Richard Henderson <[email protected]>
Message-Id: <[email protected]>
Commit: 7ae6399a85f6a0818a532d9f3c6e200691f6ef68
https://github.com/qemu/qemu/commit/7ae6399a85f6a0818a532d9f3c6e200691f6ef68
Author: Richard Henderson <[email protected]>
Date: 2024-05-03 (Fri, 03 May 2024)
Changed paths:
M util/bufferiszero.c
Log Message:
-----------
util/bufferiszero: Improve scalar variant
Split less-than and greater-than 256 cases.
Use unaligned accesses for head and tail.
Avoid using out-of-bounds pointers in loop boundary conditions.
Reviewed-by: Philippe Mathieu-Daudé <[email protected]>
Signed-off-by: Richard Henderson <[email protected]>
Commit: 0100ce2b49725e6ba2fbe8301855978d5d3dc790
https://github.com/qemu/qemu/commit/0100ce2b49725e6ba2fbe8301855978d5d3dc790
Author: Richard Henderson <[email protected]>
Date: 2024-05-03 (Fri, 03 May 2024)
Changed paths:
M util/bufferiszero.c
Log Message:
-----------
util/bufferiszero: Introduce biz_accel_fn typedef
Reviewed-by: Philippe Mathieu-Daudé <[email protected]>
Signed-off-by: Richard Henderson <[email protected]>
Commit: bf67aa3dd2d8b28d7618d8ec62cd9f6055366751
https://github.com/qemu/qemu/commit/bf67aa3dd2d8b28d7618d8ec62cd9f6055366751
Author: Richard Henderson <[email protected]>
Date: 2024-05-03 (Fri, 03 May 2024)
Changed paths:
M util/bufferiszero.c
Log Message:
-----------
util/bufferiszero: Simplify test_buffer_is_zero_next_accel
Because the three alternatives are monotonic, we don't need
to keep a couple of bitmasks, just identify the strongest
alternative at startup.
Generalize test_buffer_is_zero_next_accel and init_accel
by always defining an accel_table array.
Reviewed-by: Philippe Mathieu-Daudé <[email protected]>
Signed-off-by: Richard Henderson <[email protected]>
Commit: 22437b4de94c37e6104d90d46b31d80cf14358d4
https://github.com/qemu/qemu/commit/22437b4de94c37e6104d90d46b31d80cf14358d4
Author: Richard Henderson <[email protected]>
Date: 2024-05-03 (Fri, 03 May 2024)
Changed paths:
M util/bufferiszero.c
Log Message:
-----------
util/bufferiszero: Add simd acceleration for aarch64
Because non-embedded aarch64 is expected to have AdvSIMD enabled, merely
double-check with the compiler flags for __ARM_NEON and don't bother with
a runtime check. Otherwise, model the loop after the x86 SSE2 function.
Use UMAXV for the vector reduction. This is 3 cycles on cortex-a76 and
2 cycles on neoverse-n1.
Reviewed-by: Philippe Mathieu-Daudé <[email protected]>
Signed-off-by: Richard Henderson <[email protected]>
Commit: a06d9eddb015a9f5895161b0a3958a2e4be21579
https://github.com/qemu/qemu/commit/a06d9eddb015a9f5895161b0a3958a2e4be21579
Author: Richard Henderson <[email protected]>
Date: 2024-05-03 (Fri, 03 May 2024)
Changed paths:
A tests/bench/bufferiszero-bench.c
M tests/bench/meson.build
Log Message:
-----------
tests/bench: Add bufferiszero-bench
Benchmark each acceleration function vs an aligned buffer of zeros.
Reviewed-by: Philippe Mathieu-Daudé <[email protected]>
Signed-off-by: Richard Henderson <[email protected]>
Commit: 909aff7eaf6335aeeb4962fb0ac2a6c571c96af2
https://github.com/qemu/qemu/commit/909aff7eaf6335aeeb4962fb0ac2a6c571c96af2
Author: Richard Henderson <[email protected]>
Date: 2024-05-03 (Fri, 03 May 2024)
Changed paths:
M include/qemu/cutils.h
A tests/bench/bufferiszero-bench.c
M tests/bench/meson.build
M util/bufferiszero.c
Log Message:
-----------
Merge tag 'pull-misc-20240503' of https://gitlab.com/rth7680/qemu into staging
util/bufferiszero:
- Remove sse4.1 and avx512 variants
- Reorganize for early test for acceleration
- Remove useless prefetches
- Optimize sse2, avx2 and integer variants
- Add simd acceleration for aarch64
- Add bufferiszero-bench
# -----BEGIN PGP SIGNATURE-----
#
# iQFRBAABCgA7FiEEekgeeIaLTbaoWgXAZN846K9+IV8FAmY0/qMdHHJpY2hhcmQu
# aGVuZGVyc29uQGxpbmFyby5vcmcACgkQZN846K9+IV+ULQf/T2JSdvG6/EjDCf4N
# cnSGiUV2MIeByw8tkrc/fWCNdlulHhk9gbg9l+f2muwK8H/k2BdynbrQnt1Ymmtk
# xzM6+PNOcByaovSAkvNweZVbrQX36Yih9S7f3n+xcxfVuvvYhKSLHXLkeqO96LMd
# rN+WRpxhReaU3n8/FO7o3S26SRpk7X9kRfShaT7U7ytHGjGsXUvMKIRs30hbsJTB
# yjed0a0u54FoSlN6AEqjWdgzaWP8nT65+8Yxe3dzB9hx09UiolZo60eHqYy7Mkno
# N6aMOB6gUUbCiKZ3Qk+1zEX97vl26NH3zt5tIIJTWDoIkC3f9qbg1x5hwWLQ3rra
# rM8h8w==
# =DnZO
# -----END PGP SIGNATURE-----
# gpg: Signature made Fri 03 May 2024 08:11:31 AM PDT
# gpg: using RSA key 7A481E78868B4DB6A85A05C064DF38E8AF7E215F
# gpg: issuer "[email protected]"
# gpg: Good signature from "Richard Henderson <[email protected]>"
[ultimate]
* tag 'pull-misc-20240503' of https://gitlab.com/rth7680/qemu:
tests/bench: Add bufferiszero-bench
util/bufferiszero: Add simd acceleration for aarch64
util/bufferiszero: Simplify test_buffer_is_zero_next_accel
util/bufferiszero: Introduce biz_accel_fn typedef
util/bufferiszero: Improve scalar variant
util/bufferiszero: Optimize SSE2 and AVX2 variants
util/bufferiszero: Remove useless prefetches
util/bufferiszero: Reorganize for early test for acceleration
util/bufferiszero: Remove AVX512 variant
util/bufferiszero: Remove SSE4.1 variant
Signed-off-by: Richard Henderson <[email protected]>
Compare: https://github.com/qemu/qemu/compare/4977ce198d23...909aff7eaf63
To unsubscribe from these emails, change your notification settings at
https://github.com/qemu/qemu/settings/notifications