RE: [C++] Indeterminate poor performance of random number generator

2021-04-22 Thread Yibo Cai
Yes, these soft-float math (in libm.so) makes Arm binary extremely slow.

-Original Message-
From: Antoine Pitrou 
Sent: Thursday, April 22, 2021 17:20
To: dev@arrow.apache.org
Subject: Re: [C++] Indeterminate poor performance of random number generator


Le 22/04/2021 à 03:38, Yibo Cai a écrit :
>
> Both using same libstdc++.
> But std::bernoulli_distribution is inlined, so they are indeed different for 
> clang and gcc.
> https://godbolt.org/z/aT84x5Yec
> Looks a pure compiler thing.

It looks like clang generates calls to logl() and __divtf3() (soft-float long 
double division) inside the loop.  Perhaps that can be avoided by 
reimplementing the Bernoulli distribution.  If we don't care too much about 
accuracy and extreme probability values (very close to 0 or 1), that should be 
relatively easy.

Regards

Antoine.
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


Re: [C++] Indeterminate poor performance of random number generator

2021-04-22 Thread Yibo Cai

On 4/22/21 9:38 AM, Yibo Cai wrote:

On 4/21/21 6:07 PM, Antoine Pitrou wrote:



Le 21/04/2021 à 11:41, Yibo Cai a écrit :

On 4/21/21 5:17 PM, Antoine Pitrou wrote:


Le 21/04/2021 à 11:14, Yibo Cai a écrit :

When running benchmarks on Arm64 servers, I find some benchmarks are extremely 
slow when built with clang.
E.g., "ModeKernelNarrow/1048576/1" costs 90s to finish.
I find almost all the time is spent in generating random bits (prepare test 
data)[1], not the test itself.

Below sample code is to show the issue. Tested on Arm64 with clang-10 and 
gcc-7.5, built with -O3.
For gcc, the code finished in 0.1s. But for clang, the code finishes in 11s, 
very bad.
This issue does not happen on Apple M1, with apple clang-12 arm64 compiler.
On x86, clang random engine is also much slower than gcc built, but the gap is 
much smaller.

As std::default_random_engine is implementation defined[2], I think the 
performance (randomness, speed) is not determinate.
Maybe there are better ways to generate random bits?


Can you try out https://github.com/apache/arrow/pull/8879 ?



Tested this PR on Arm64. It improves speed a lot, but still quite slow for 
clang built code.
For benchmark "ModeKernelNarrow/1048576/1", run time as below
- clang: master branch - 90s,   apply PR-8879 - 55s
- gcc:   master branch - 2.5s,  apply PR-8879 - 1.5s


Is it using GNU libstdc++ or clang's libc++?
If the latter, perhaps the Bernouilli implementation is very bad?



Both using same libstdc++.
But std::bernoulli_distribution is inlined, so they are indeed different for 
clang and gcc.
https://godbolt.org/z/aT84x5Yec
Looks a pure compiler thing.


For the record. Clang "-ffast-math" option makes the bernoulli generator 100x 
faster on Arm64, 10x faster on x86_64.
As this is only for generating test bits, looks "-ffast-math" is safe for the 
purpose.
https://clang.llvm.org/docs/UsersManual.html#cmdoption-ffast-math


Re: [C++] Indeterminate poor performance of random number generator

2021-04-22 Thread Antoine Pitrou



Le 22/04/2021 à 03:38, Yibo Cai a écrit :


Both using same libstdc++.
But std::bernoulli_distribution is inlined, so they are indeed different for 
clang and gcc.
https://godbolt.org/z/aT84x5Yec
Looks a pure compiler thing.


It looks like clang generates calls to logl() and __divtf3() (soft-float 
long double division) inside the loop.  Perhaps that can be avoided by 
reimplementing the Bernoulli distribution.  If we don't care too much 
about accuracy and extreme probability values (very close to 0 or 1), 
that should be relatively easy.


Regards

Antoine.


Re: [C++] Indeterminate poor performance of random number generator

2021-04-21 Thread Yibo Cai

On 4/21/21 6:07 PM, Antoine Pitrou wrote:



Le 21/04/2021 à 11:41, Yibo Cai a écrit :

On 4/21/21 5:17 PM, Antoine Pitrou wrote:


Le 21/04/2021 à 11:14, Yibo Cai a écrit :

When running benchmarks on Arm64 servers, I find some benchmarks are extremely 
slow when built with clang.
E.g., "ModeKernelNarrow/1048576/1" costs 90s to finish.
I find almost all the time is spent in generating random bits (prepare test 
data)[1], not the test itself.

Below sample code is to show the issue. Tested on Arm64 with clang-10 and 
gcc-7.5, built with -O3.
For gcc, the code finished in 0.1s. But for clang, the code finishes in 11s, 
very bad.
This issue does not happen on Apple M1, with apple clang-12 arm64 compiler.
On x86, clang random engine is also much slower than gcc built, but the gap is 
much smaller.

As std::default_random_engine is implementation defined[2], I think the 
performance (randomness, speed) is not determinate.
Maybe there are better ways to generate random bits?


Can you try out https://github.com/apache/arrow/pull/8879 ?



Tested this PR on Arm64. It improves speed a lot, but still quite slow for 
clang built code.
For benchmark "ModeKernelNarrow/1048576/1", run time as below
- clang: master branch - 90s,   apply PR-8879 - 55s
- gcc:   master branch - 2.5s,  apply PR-8879 - 1.5s


Is it using GNU libstdc++ or clang's libc++?
If the latter, perhaps the Bernouilli implementation is very bad?



Both using same libstdc++.
But std::bernoulli_distribution is inlined, so they are indeed different for 
clang and gcc.
https://godbolt.org/z/aT84x5Yec
Looks a pure compiler thing.


Re: [C++] Indeterminate poor performance of random number generator

2021-04-21 Thread Antoine Pitrou




Le 21/04/2021 à 11:41, Yibo Cai a écrit :

On 4/21/21 5:17 PM, Antoine Pitrou wrote:


Le 21/04/2021 à 11:14, Yibo Cai a écrit :

When running benchmarks on Arm64 servers, I find some benchmarks are extremely 
slow when built with clang.
E.g., "ModeKernelNarrow/1048576/1" costs 90s to finish.
I find almost all the time is spent in generating random bits (prepare test 
data)[1], not the test itself.

Below sample code is to show the issue. Tested on Arm64 with clang-10 and 
gcc-7.5, built with -O3.
For gcc, the code finished in 0.1s. But for clang, the code finishes in 11s, 
very bad.
This issue does not happen on Apple M1, with apple clang-12 arm64 compiler.
On x86, clang random engine is also much slower than gcc built, but the gap is 
much smaller.

As std::default_random_engine is implementation defined[2], I think the 
performance (randomness, speed) is not determinate.
Maybe there are better ways to generate random bits?


Can you try out https://github.com/apache/arrow/pull/8879 ?



Tested this PR on Arm64. It improves speed a lot, but still quite slow for 
clang built code.
For benchmark "ModeKernelNarrow/1048576/1", run time as below
- clang: master branch - 90s,   apply PR-8879 - 55s
- gcc:   master branch - 2.5s,  apply PR-8879 - 1.5s


Is it using GNU libstdc++ or clang's libc++?
If the latter, perhaps the Bernouilli implementation is very bad?



Re: [C++] Indeterminate poor performance of random number generator

2021-04-21 Thread Yibo Cai

On 4/21/21 5:17 PM, Antoine Pitrou wrote:


Le 21/04/2021 à 11:14, Yibo Cai a écrit :

When running benchmarks on Arm64 servers, I find some benchmarks are extremely 
slow when built with clang.
E.g., "ModeKernelNarrow/1048576/1" costs 90s to finish.
I find almost all the time is spent in generating random bits (prepare test 
data)[1], not the test itself.

Below sample code is to show the issue. Tested on Arm64 with clang-10 and 
gcc-7.5, built with -O3.
For gcc, the code finished in 0.1s. But for clang, the code finishes in 11s, 
very bad.
This issue does not happen on Apple M1, with apple clang-12 arm64 compiler.
On x86, clang random engine is also much slower than gcc built, but the gap is 
much smaller.

As std::default_random_engine is implementation defined[2], I think the 
performance (randomness, speed) is not determinate.
Maybe there are better ways to generate random bits?


Can you try out https://github.com/apache/arrow/pull/8879 ?



Tested this PR on Arm64. It improves speed a lot, but still quite slow for 
clang built code.
For benchmark "ModeKernelNarrow/1048576/1", run time as below
- clang: master branch - 90s,   apply PR-8879 - 55s
- gcc:   master branch - 2.5s,  apply PR-8879 - 1.5s


Re: [C++] Indeterminate poor performance of random number generator

2021-04-21 Thread Antoine Pitrou



Le 21/04/2021 à 11:14, Yibo Cai a écrit :

When running benchmarks on Arm64 servers, I find some benchmarks are extremely 
slow when built with clang.
E.g., "ModeKernelNarrow/1048576/1" costs 90s to finish.
I find almost all the time is spent in generating random bits (prepare test 
data)[1], not the test itself.

Below sample code is to show the issue. Tested on Arm64 with clang-10 and 
gcc-7.5, built with -O3.
For gcc, the code finished in 0.1s. But for clang, the code finishes in 11s, 
very bad.
This issue does not happen on Apple M1, with apple clang-12 arm64 compiler.
On x86, clang random engine is also much slower than gcc built, but the gap is 
much smaller.

As std::default_random_engine is implementation defined[2], I think the 
performance (randomness, speed) is not determinate.
Maybe there are better ways to generate random bits?


Can you try out https://github.com/apache/arrow/pull/8879 ?



[C++] Indeterminate poor performance of random number generator

2021-04-21 Thread Yibo Cai

When running benchmarks on Arm64 servers, I find some benchmarks are extremely 
slow when built with clang.
E.g., "ModeKernelNarrow/1048576/1" costs 90s to finish.
I find almost all the time is spent in generating random bits (prepare test 
data)[1], not the test itself.

Below sample code is to show the issue. Tested on Arm64 with clang-10 and 
gcc-7.5, built with -O3.
For gcc, the code finished in 0.1s. But for clang, the code finishes in 11s, 
very bad.
This issue does not happen on Apple M1, with apple clang-12 arm64 compiler.
On x86, clang random engine is also much slower than gcc built, but the gap is 
much smaller.

As std::default_random_engine is implementation defined[2], I think the 
performance (randomness, speed) is not determinate.
Maybe there are better ways to generate random bits?

[1] 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/testing/random.cc#L101-L112
[2] https://en.cppreference.com/w/cpp/numeric/random

#include 
int main() {
  std::default_random_engine rng(42);
  std::bernoulli_distribution d(0.25);

  int s = 0;
  for (int i = 0; i < 8 * 1024 * 1024; ++i) {
s += d(rng);
  }

  return s;
}