On 2/7/24 06:48, Alexander Monakov wrote:
Make buffer_is_zero a 'static inline' function that tests up to three
bytes from the buffer before handing off to an unrolled loop. This
eliminates call overhead for most non-zero buffers, and allows to
optimize out length checks when it is known at compile time (which is
often the case in Qemu).

Signed-off-by: Alexander Monakov <amona...@ispras.ru>
Signed-off-by: Mikhail Romanov <mmroma...@ispras.ru>
---
  include/qemu/cutils.h | 28 +++++++++++++++-
  util/bufferiszero.c   | 76 ++++++++++++-------------------------------
  2 files changed, 47 insertions(+), 57 deletions(-)

diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h
index 92c927a6a3..62b153e603 100644
--- a/include/qemu/cutils.h
+++ b/include/qemu/cutils.h
@@ -187,9 +187,35 @@ char *freq_to_str(uint64_t freq_hz);
  /* used to print char* safely */
  #define STR_OR_NULL(str) ((str) ? (str) : "null")
-bool buffer_is_zero(const void *buf, size_t len);
+bool buffer_is_zero_len_4_plus(const void *, size_t);
+extern bool (*buffer_is_zero_len_256_plus)(const void *, size_t);

Why 256, when the avx2 routine can handle size 128, and you're about to remove 
avx512?

You appear to have missed that select_accel_fn() resolves directly to buffer_zero_int, aka buffer_is_zero_len_4_plus for non-x86, without an indirect function call.

I think you should not attempt to expose the 4 vs larger implementation detail here in the inline function. Presumably the bulk of the benefit in avoiding the function call is already realized via the three byte spot checks.


r~

Reply via email to