On Wed, 7 Feb 2024, Richard Henderson wrote:

> On 2/7/24 06:48, Alexander Monakov wrote:
> > Make buffer_is_zero a 'static inline' function that tests up to three
> > bytes from the buffer before handing off to an unrolled loop. This
> > eliminates call overhead for most non-zero buffers, and allows to
> > optimize out length checks when it is known at compile time (which is
> > often the case in Qemu).
> > 
> > Signed-off-by: Alexander Monakov <amona...@ispras.ru>
> > Signed-off-by: Mikhail Romanov <mmroma...@ispras.ru>
> > ---
> >   include/qemu/cutils.h | 28 +++++++++++++++-
> >   util/bufferiszero.c   | 76 ++++++++++++-------------------------------
> >   2 files changed, 47 insertions(+), 57 deletions(-)
> > 
> > diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h
> > index 92c927a6a3..62b153e603 100644
> > --- a/include/qemu/cutils.h
> > +++ b/include/qemu/cutils.h
> > @@ -187,9 +187,35 @@ char *freq_to_str(uint64_t freq_hz);
> >   /* used to print char* safely */
> >   #define STR_OR_NULL(str) ((str) ? (str) : "null")
> >   
> > -bool buffer_is_zero(const void *buf, size_t len);
> > +bool buffer_is_zero_len_4_plus(const void *, size_t);
> > +extern bool (*buffer_is_zero_len_256_plus)(const void *, size_t);
> 
> Why 256, when the avx2 routine can handle size 128, and you're about to remove
> avx512?

(yes, avx2 is bumped to 256-byte chunks in a later patch)

> You appear to have missed that select_accel_fn() resolves directly to
> buffer_zero_int, aka buffer_is_zero_len_4_plus for non-x86, without an
> indirect function call.
> 
> I think you should not attempt to expose the 4 vs larger implementation detail
> here in the inline function.  Presumably the bulk of the benefit in avoiding
> the function call is already realized via the three byte spot checks.

Thank you. I agree we shouldn't penalize non-x86 hosts here, but to be honest
I'd really like to keep this optimization because so many places in Qemu invoke
buffer_is_zero with a constant length, allowing the compiler to optimize out
the length test. Would you be open to testing availability of optimized variants
in the inline wrapper like this:

diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h
index 62b153e603..7a2145ffef 100644
--- a/include/qemu/cutils.h
+++ b/include/qemu/cutils.h
@@ -209,11 +209,12 @@ static inline bool buffer_is_zero(const void *vbuf, 
size_t len)
         return true;
     }

+#if defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
     if (len >= 256) {
         return buffer_is_zero_len_256_plus(vbuf, len);
-    } else {
-        return buffer_is_zero_len_4_plus(vbuf, len);
     }
+#endif
+    return buffer_is_zero_len_4_plus(vbuf, len);
 }

 /*

Alexander

Reply via email to