I am posting a new revision of buffer_is_zero improvements (v2 can be found at https://patchew.org/QEMU/20231027143704.7060-1-mmroma...@ispras.ru/ ).
In our experiments buffer_is_zero took about 40%-50% of overall qemu-img run time, even though Glib I/O is not very efficient. Hence, it remains an important routine to optimize. We substantially improve its performance in typical cases, mostly by introducing an inline wrapper that samples three bytes from head/middle/tail, avoid call overhead when any of those is non-zero. We also provide improvements for SIMD and portable scalar variants. Changed for v3: - separate into 6 patches - fix an oversight which would break the build on non-x86 hosts - properly avoid out-of-bounds pointers in the scalar variant Alexander Monakov (6): util/bufferiszero: remove SSE4.1 variant util/bufferiszero: introduce an inline wrapper util/bufferiszero: remove AVX512 variant util/bufferiszero: remove useless prefetches util/bufferiszero: optimize SSE2 and AVX2 variants util/bufferiszero: improve scalar variant include/qemu/cutils.h | 28 ++++- util/bufferiszero.c | 280 +++++++++++++++--------------------------- 2 files changed, 128 insertions(+), 180 deletions(-) -- 2.32.0