On 2024/6/6 上午7:51, Richard Henderson wrote:
On 6/5/24 02:32, Bibo Mao wrote:
Different gcc versions have different features, macro CONFIG_LSX_OPT
and CONFIG_LASX_OPT is added here to detect whether gcc supports
built-in lsx/lasx macro.
Function buffer_zero_lsx() is added for 128bit simd fpu optimization,
and function buffer_zero_lasx() is for 256bit simd fpu optimization.
Loongarch gcc built-in lsx/lasx macro can be used only when compiler
option -mlsx/-mlasx is added, and there is no separate compiler option
for function only. So it is only in effect when qemu is compiled with
parameter --extra-cflags="-mlasx"
Signed-off-by: Bibo Mao <maob...@loongson.cn>
---
meson.build | 11 +++++
util/bufferiszero.c | 103 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 114 insertions(+)
diff --git a/meson.build b/meson.build
index 6386607144..29bc362d7a 100644
--- a/meson.build
+++ b/meson.build
@@ -2855,6 +2855,17 @@ config_host_data.set('CONFIG_ARM_AES_BUILTIN',
cc.compiles('''
void foo(uint8x16_t *p) { *p = vaesmcq_u8(*p); }
'''))
+# For Loongarch64, detect if LSX/LASX are available.
+ config_host_data.set('CONFIG_LSX_OPT', cc.compiles('''
+ #include "lsxintrin.h"
+ int foo(__m128i v) { return __lsx_bz_v(v); }
+ '''))
+
+config_host_data.set('CONFIG_LASX_OPT', cc.compiles('''
+ #include "lasxintrin.h"
+ int foo(__m256i v) { return __lasx_xbz_v(v); }
+ '''))
Both of these are introduced by gcc 14 and llvm 18, so I'm not certain
of the utility of separate tests. We might simplify this with
config_host_data.set('CONFIG_LSX_LASX_INTRIN_H',
cc.has_header('lsxintrin.h') && cc.has_header('lasxintrin.h'))
As you say, these headers require vector instructions to be enabled at
compile-time rather than detecting them at runtime. This is a point
where the compilers could be improved to support
__attribute__((target("xyz"))) and the builtins with that. The i386
port does this, for instance.
In the meantime, it means that you don't need a runtime test. Similar
to aarch64 and the use of __ARM_NEON as a compile-time test for simd
support. Perhaps
#elif defined(CONFIG_LSX_LASX_INTRIN_H) && \
(defined(__loongarch_sx) || defined(__loongarch_asx))
# ifdef __loongarch_sx
...
# endif
# ifdef __loongarch_asx
...
# endif
Sure, will do in this way.
And also there is runtime check coming from hwcap, such this:
unsigned info = cpuinfo_init();
if (info & CPUINFO_LASX)
The actual code is perfectly fine, of course, since it follows the
pattern from the others. How much improvement do you see from
bufferiszero-bench?
yes, it is much easier to follow others, it is not new things.
Here is the benchmark result, no obvious improvement with 1K
buffer size. 200% improvement with LASX, 100% improve with LSX
with 16K page size.
# /root/src/qemu/b/tests/bench/bufferiszero-bench --tap -k
# Start of cutils tests
# Start of bufferiszero tests
# buffer_is_zero #0: 1KB 13460 MB/sec
# buffer_is_zero #0: 4KB 36857 MB/sec
# buffer_is_zero #0: 16KB 69884 MB/sec
# buffer_is_zero #0: 64KB 80863 MB/sec
#
# buffer_is_zero #1: 1KB 11180 MB/sec
# buffer_is_zero #1: 4KB 27972 MB/sec
# buffer_is_zero #1: 16KB 42951 MB/sec
# buffer_is_zero #1: 64KB 43293 MB/sec
#
# buffer_is_zero #2: 1KB 10026 MB/sec
# buffer_is_zero #2: 4KB 18373 MB/sec
# buffer_is_zero #2: 16KB 23933 MB/sec
# buffer_is_zero #2: 64KB 25180 MB/sec
Regards
Bibo Mao
r~