On 4/22/21 9:38 AM, Yibo Cai wrote:
On 4/21/21 6:07 PM, Antoine Pitrou wrote:


Le 21/04/2021 à 11:41, Yibo Cai a écrit :
On 4/21/21 5:17 PM, Antoine Pitrou wrote:

Le 21/04/2021 à 11:14, Yibo Cai a écrit :
When running benchmarks on Arm64 servers, I find some benchmarks are extremely 
slow when built with clang.
E.g., "ModeKernelNarrow<BooleanType>/1048576/10000" costs 90s to finish.
I find almost all the time is spent in generating random bits (prepare test 
data)[1], not the test itself.

Below sample code is to show the issue. Tested on Arm64 with clang-10 and 
gcc-7.5, built with -O3.
For gcc, the code finished in 0.1s. But for clang, the code finishes in 11s, 
very bad.
This issue does not happen on Apple M1, with apple clang-12 arm64 compiler.
On x86, clang random engine is also much slower than gcc built, but the gap is 
much smaller.

As std::default_random_engine is implementation defined[2], I think the 
performance (randomness, speed) is not determinate.
Maybe there are better ways to generate random bits?

Can you try out https://github.com/apache/arrow/pull/8879 ?


Tested this PR on Arm64. It improves speed a lot, but still quite slow for 
clang built code.
For benchmark "ModeKernelNarrow<BooleanType>/1048576/10000", run time as below
- clang: master branch - 90s,   apply PR-8879 - 55s
- gcc:   master branch - 2.5s,  apply PR-8879 - 1.5s

Is it using GNU libstdc++ or clang's libc++?
If the latter, perhaps the Bernouilli implementation is very bad?


Both using same libstdc++.
But std::bernoulli_distribution is inlined, so they are indeed different for 
clang and gcc.
https://godbolt.org/z/aT84x5Yec
Looks a pure compiler thing.

For the record. Clang "-ffast-math" option makes the bernoulli generator 100x 
faster on Arm64, 10x faster on x86_64.
As this is only for generating test bits, looks "-ffast-math" is safe for the 
purpose.
https://clang.llvm.org/docs/UsersManual.html#cmdoption-ffast-math

Reply via email to