On Wed, 7 Feb 2024, Richard Henderson wrote:
> On 2/7/24 06:48, Alexander Monakov wrote: > > Make buffer_is_zero a 'static inline' function that tests up to three > > bytes from the buffer before handing off to an unrolled loop. This > > eliminates call overhead for most non-zero buffers, and allows to > > optimize out length checks when it is known at compile time (which is > > often the case in Qemu). > > > > Signed-off-by: Alexander Monakov <amona...@ispras.ru> > > Signed-off-by: Mikhail Romanov <mmroma...@ispras.ru> > > --- > > include/qemu/cutils.h | 28 +++++++++++++++- > > util/bufferiszero.c | 76 ++++++++++++------------------------------- > > 2 files changed, 47 insertions(+), 57 deletions(-) > > > > diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h > > index 92c927a6a3..62b153e603 100644 > > --- a/include/qemu/cutils.h > > +++ b/include/qemu/cutils.h > > @@ -187,9 +187,35 @@ char *freq_to_str(uint64_t freq_hz); > > /* used to print char* safely */ > > #define STR_OR_NULL(str) ((str) ? (str) : "null") > > > > -bool buffer_is_zero(const void *buf, size_t len); > > +bool buffer_is_zero_len_4_plus(const void *, size_t); > > +extern bool (*buffer_is_zero_len_256_plus)(const void *, size_t); > > Why 256, when the avx2 routine can handle size 128, and you're about to remove > avx512? (yes, avx2 is bumped to 256-byte chunks in a later patch) > You appear to have missed that select_accel_fn() resolves directly to > buffer_zero_int, aka buffer_is_zero_len_4_plus for non-x86, without an > indirect function call. > > I think you should not attempt to expose the 4 vs larger implementation detail > here in the inline function. Presumably the bulk of the benefit in avoiding > the function call is already realized via the three byte spot checks. Thank you. I agree we shouldn't penalize non-x86 hosts here, but to be honest I'd really like to keep this optimization because so many places in Qemu invoke buffer_is_zero with a constant length, allowing the compiler to optimize out the length test. Would you be open to testing availability of optimized variants in the inline wrapper like this: diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h index 62b153e603..7a2145ffef 100644 --- a/include/qemu/cutils.h +++ b/include/qemu/cutils.h @@ -209,11 +209,12 @@ static inline bool buffer_is_zero(const void *vbuf, size_t len) return true; } +#if defined(CONFIG_AVX2_OPT) || defined(__SSE2__) if (len >= 256) { return buffer_is_zero_len_256_plus(vbuf, len); - } else { - return buffer_is_zero_len_4_plus(vbuf, len); } +#endif + return buffer_is_zero_len_4_plus(vbuf, len); } /* Alexander