bufferiszero: Add simd acceleration for loongarch64

maobibo Wed, 05 Jun 2024 19:31:22 -0700



On 2024/6/6 上午7:51, Richard Henderson wrote:

On 6/5/24 02:32, Bibo Mao wrote:

Different gcc versions have different features, macro CONFIG_LSX_OPT
and CONFIG_LASX_OPT is added here to detect whether gcc supports
built-in lsx/lasx macro.

Function buffer_zero_lsx() is added for 128bit simd fpu optimization,
and function buffer_zero_lasx() is for 256bit simd fpu optimization.

Loongarch gcc built-in lsx/lasx macro can be used only when compiler
option -mlsx/-mlasx is added, and there is no separate compiler option
for function only. So it is only in effect when qemu is compiled with
parameter --extra-cflags="-mlasx"

Signed-off-by: Bibo Mao <maob...@loongson.cn>
---
  meson.build         |  11 +++++
  util/bufferiszero.c | 103 ++++++++++++++++++++++++++++++++++++++++++++
  2 files changed, 114 insertions(+)

diff --git a/meson.build b/meson.build
index 6386607144..29bc362d7a 100644
--- a/meson.build
+++ b/meson.build

@@ -2855,6 +2855,17 @@ config_host_data.set('CONFIG_ARM_AES_BUILTIN',cc.compiles('''

      void foo(uint8x16_t *p) { *p = vaesmcq_u8(*p); }
    '''))
+# For Loongarch64, detect if LSX/LASX are available.
+ config_host_data.set('CONFIG_LSX_OPT', cc.compiles('''
+    #include "lsxintrin.h"
+    int foo(__m128i v) { return __lsx_bz_v(v); }
+  '''))
+
+config_host_data.set('CONFIG_LASX_OPT', cc.compiles('''
+    #include "lasxintrin.h"
+    int foo(__m256i v) { return __lasx_xbz_v(v); }
+  '''))

Both of these are introduced by gcc 14 and llvm 18, so I'm not certainof the utility of separate tests. We might simplify this with


   config_host_data.set('CONFIG_LSX_LASX_INTRIN_H',
     cc.has_header('lsxintrin.h') && cc.has_header('lasxintrin.h'))

As you say, these headers require vector instructions to be enabled atcompile-time rather than detecting them at runtime. This is a pointwhere the compilers could be improved to support__attribute__((target("xyz"))) and the builtins with that. The i386port does this, for instance.

In the meantime, it means that you don't need a runtime test. Similarto aarch64 and the use of __ARM_NEON as a compile-time test for simdsupport. Perhaps


#elif defined(CONFIG_LSX_LASX_INTRIN_H) && \
       (defined(__loongarch_sx) || defined(__loongarch_asx))
# ifdef __loongarch_sx
   ...
# endif
# ifdef __loongarch_asx
   ...
# endif

Sure, will do in this way.
And also there is runtime check coming from hwcap, such this:

unsigned info = cpuinfo_init();
  if (info & CPUINFO_LASX)

The actual code is perfectly fine, of course, since it follows thepattern from the others. How much improvement do you see frombufferiszero-bench?

yes, it is much easier to follow others, it is not new things.

Here is the benchmark result, no obvious improvement with 1K
buffer size. 200% improvement with LASX, 100% improve with LSX
with 16K page size.

# /root/src/qemu/b/tests/bench/bufferiszero-bench --tap -k
# Start of cutils tests
# Start of bufferiszero tests
# buffer_is_zero #0:  1KB    13460 MB/sec
# buffer_is_zero #0:  4KB    36857 MB/sec
# buffer_is_zero #0: 16KB    69884 MB/sec
# buffer_is_zero #0: 64KB    80863 MB/sec
#
# buffer_is_zero #1:  1KB    11180 MB/sec
# buffer_is_zero #1:  4KB    27972 MB/sec
# buffer_is_zero #1: 16KB    42951 MB/sec
# buffer_is_zero #1: 64KB    43293 MB/sec
#
# buffer_is_zero #2:  1KB    10026 MB/sec
# buffer_is_zero #2:  4KB    18373 MB/sec
# buffer_is_zero #2: 16KB    23933 MB/sec
# buffer_is_zero #2: 64KB    25180 MB/sec

Regards
Bibo Mao

r~

Re: [PATCH 2/2] util/bufferiszero: Add simd acceleration for loongarch64

Reply via email to