On 2/6/24 21:13, Alexander Monakov wrote:
Thank you. I agree we shouldn't penalize non-x86 hosts here, but to be honest
I'd really like to keep this optimization because so many places in Qemu invoke
buffer_is_zero with a constant length, allowing the compiler to optimize out
the length test.
Hmm. True, both migration and image copy use large blocks frequently.
Would you be open to testing availability of optimized variants
in the inline wrapper like this:
diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h
index 62b153e603..7a2145ffef 100644
--- a/include/qemu/cutils.h
+++ b/include/qemu/cutils.h
@@ -209,11 +209,12 @@ static inline bool buffer_is_zero(const void *vbuf,
size_t len)
return true;
}
+#if defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
if (len >= 256) {
return buffer_is_zero_len_256_plus(vbuf, len);
- } else {
- return buffer_is_zero_len_4_plus(vbuf, len);
}
+#endif
+ return buffer_is_zero_len_4_plus(vbuf, len);
Plausible.
Also, now that we're down to two variants instead of four, perhaps a statically predicted
direct branch or two might be better than an indirect branch? E.g.
bool buffer_is_zero_len_256_plus(buf, len)
{
#ifdef CONFIG_AVX2_OPT
if (select_accel & CPUINFO_AVX2) {
return buffer_zero_avx2(buf, len);
}
#endif
#ifdef __SSE2__
if (select_accel & CPUINFO_SSE2) {
return buffer_zero_sse2(buf, len);
}
#endif
return buffer_is_zero_len_4_plus(buf, len);
}
where select_accel would be set by test_buffer_is_zero_next_accel() etc.
r~