Le 21/04/2021 à 11:14, Yibo Cai a écrit :
When running benchmarks on Arm64 servers, I find some benchmarks are extremely slow when built with clang. E.g., "ModeKernelNarrow<BooleanType>/1048576/10000" costs 90s to finish. I find almost all the time is spent in generating random bits (prepare test data)[1], not the test itself. Below sample code is to show the issue. Tested on Arm64 with clang-10 and gcc-7.5, built with -O3. For gcc, the code finished in 0.1s. But for clang, the code finishes in 11s, very bad. This issue does not happen on Apple M1, with apple clang-12 arm64 compiler. On x86, clang random engine is also much slower than gcc built, but the gap is much smaller. As std::default_random_engine is implementation defined[2], I think the performance (randomness, speed) is not determinate. Maybe there are better ways to generate random bits?
Can you try out https://github.com/apache/arrow/pull/8879 ?