RE: [C++] Indeterminate poor performance of random number generator
Yes, these soft-float math (in libm.so) makes Arm binary extremely slow. -Original Message- From: Antoine Pitrou Sent: Thursday, April 22, 2021 17:20 To: dev@arrow.apache.org Subject: Re: [C++] Indeterminate poor performance of random number generator Le 22/04/2021 à 03:38, Yibo Cai a écrit : > > Both using same libstdc++. > But std::bernoulli_distribution is inlined, so they are indeed different for > clang and gcc. > https://godbolt.org/z/aT84x5Yec > Looks a pure compiler thing. It looks like clang generates calls to logl() and __divtf3() (soft-float long double division) inside the loop. Perhaps that can be avoided by reimplementing the Bernoulli distribution. If we don't care too much about accuracy and extreme probability values (very close to 0 or 1), that should be relatively easy. Regards Antoine. IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Re: [C++] Indeterminate poor performance of random number generator
On 4/22/21 9:38 AM, Yibo Cai wrote: On 4/21/21 6:07 PM, Antoine Pitrou wrote: Le 21/04/2021 à 11:41, Yibo Cai a écrit : On 4/21/21 5:17 PM, Antoine Pitrou wrote: Le 21/04/2021 à 11:14, Yibo Cai a écrit : When running benchmarks on Arm64 servers, I find some benchmarks are extremely slow when built with clang. E.g., "ModeKernelNarrow/1048576/1" costs 90s to finish. I find almost all the time is spent in generating random bits (prepare test data)[1], not the test itself. Below sample code is to show the issue. Tested on Arm64 with clang-10 and gcc-7.5, built with -O3. For gcc, the code finished in 0.1s. But for clang, the code finishes in 11s, very bad. This issue does not happen on Apple M1, with apple clang-12 arm64 compiler. On x86, clang random engine is also much slower than gcc built, but the gap is much smaller. As std::default_random_engine is implementation defined[2], I think the performance (randomness, speed) is not determinate. Maybe there are better ways to generate random bits? Can you try out https://github.com/apache/arrow/pull/8879 ? Tested this PR on Arm64. It improves speed a lot, but still quite slow for clang built code. For benchmark "ModeKernelNarrow/1048576/1", run time as below - clang: master branch - 90s, apply PR-8879 - 55s - gcc: master branch - 2.5s, apply PR-8879 - 1.5s Is it using GNU libstdc++ or clang's libc++? If the latter, perhaps the Bernouilli implementation is very bad? Both using same libstdc++. But std::bernoulli_distribution is inlined, so they are indeed different for clang and gcc. https://godbolt.org/z/aT84x5Yec Looks a pure compiler thing. For the record. Clang "-ffast-math" option makes the bernoulli generator 100x faster on Arm64, 10x faster on x86_64. As this is only for generating test bits, looks "-ffast-math" is safe for the purpose. https://clang.llvm.org/docs/UsersManual.html#cmdoption-ffast-math
Re: [C++] Indeterminate poor performance of random number generator
Le 22/04/2021 à 03:38, Yibo Cai a écrit : Both using same libstdc++. But std::bernoulli_distribution is inlined, so they are indeed different for clang and gcc. https://godbolt.org/z/aT84x5Yec Looks a pure compiler thing. It looks like clang generates calls to logl() and __divtf3() (soft-float long double division) inside the loop. Perhaps that can be avoided by reimplementing the Bernoulli distribution. If we don't care too much about accuracy and extreme probability values (very close to 0 or 1), that should be relatively easy. Regards Antoine.
Re: [C++] Indeterminate poor performance of random number generator
On 4/21/21 6:07 PM, Antoine Pitrou wrote: Le 21/04/2021 à 11:41, Yibo Cai a écrit : On 4/21/21 5:17 PM, Antoine Pitrou wrote: Le 21/04/2021 à 11:14, Yibo Cai a écrit : When running benchmarks on Arm64 servers, I find some benchmarks are extremely slow when built with clang. E.g., "ModeKernelNarrow/1048576/1" costs 90s to finish. I find almost all the time is spent in generating random bits (prepare test data)[1], not the test itself. Below sample code is to show the issue. Tested on Arm64 with clang-10 and gcc-7.5, built with -O3. For gcc, the code finished in 0.1s. But for clang, the code finishes in 11s, very bad. This issue does not happen on Apple M1, with apple clang-12 arm64 compiler. On x86, clang random engine is also much slower than gcc built, but the gap is much smaller. As std::default_random_engine is implementation defined[2], I think the performance (randomness, speed) is not determinate. Maybe there are better ways to generate random bits? Can you try out https://github.com/apache/arrow/pull/8879 ? Tested this PR on Arm64. It improves speed a lot, but still quite slow for clang built code. For benchmark "ModeKernelNarrow/1048576/1", run time as below - clang: master branch - 90s, apply PR-8879 - 55s - gcc: master branch - 2.5s, apply PR-8879 - 1.5s Is it using GNU libstdc++ or clang's libc++? If the latter, perhaps the Bernouilli implementation is very bad? Both using same libstdc++. But std::bernoulli_distribution is inlined, so they are indeed different for clang and gcc. https://godbolt.org/z/aT84x5Yec Looks a pure compiler thing.
Re: [C++] Indeterminate poor performance of random number generator
Le 21/04/2021 à 11:41, Yibo Cai a écrit : On 4/21/21 5:17 PM, Antoine Pitrou wrote: Le 21/04/2021 à 11:14, Yibo Cai a écrit : When running benchmarks on Arm64 servers, I find some benchmarks are extremely slow when built with clang. E.g., "ModeKernelNarrow/1048576/1" costs 90s to finish. I find almost all the time is spent in generating random bits (prepare test data)[1], not the test itself. Below sample code is to show the issue. Tested on Arm64 with clang-10 and gcc-7.5, built with -O3. For gcc, the code finished in 0.1s. But for clang, the code finishes in 11s, very bad. This issue does not happen on Apple M1, with apple clang-12 arm64 compiler. On x86, clang random engine is also much slower than gcc built, but the gap is much smaller. As std::default_random_engine is implementation defined[2], I think the performance (randomness, speed) is not determinate. Maybe there are better ways to generate random bits? Can you try out https://github.com/apache/arrow/pull/8879 ? Tested this PR on Arm64. It improves speed a lot, but still quite slow for clang built code. For benchmark "ModeKernelNarrow/1048576/1", run time as below - clang: master branch - 90s, apply PR-8879 - 55s - gcc: master branch - 2.5s, apply PR-8879 - 1.5s Is it using GNU libstdc++ or clang's libc++? If the latter, perhaps the Bernouilli implementation is very bad?
Re: [C++] Indeterminate poor performance of random number generator
On 4/21/21 5:17 PM, Antoine Pitrou wrote: Le 21/04/2021 à 11:14, Yibo Cai a écrit : When running benchmarks on Arm64 servers, I find some benchmarks are extremely slow when built with clang. E.g., "ModeKernelNarrow/1048576/1" costs 90s to finish. I find almost all the time is spent in generating random bits (prepare test data)[1], not the test itself. Below sample code is to show the issue. Tested on Arm64 with clang-10 and gcc-7.5, built with -O3. For gcc, the code finished in 0.1s. But for clang, the code finishes in 11s, very bad. This issue does not happen on Apple M1, with apple clang-12 arm64 compiler. On x86, clang random engine is also much slower than gcc built, but the gap is much smaller. As std::default_random_engine is implementation defined[2], I think the performance (randomness, speed) is not determinate. Maybe there are better ways to generate random bits? Can you try out https://github.com/apache/arrow/pull/8879 ? Tested this PR on Arm64. It improves speed a lot, but still quite slow for clang built code. For benchmark "ModeKernelNarrow/1048576/1", run time as below - clang: master branch - 90s, apply PR-8879 - 55s - gcc: master branch - 2.5s, apply PR-8879 - 1.5s
Re: [C++] Indeterminate poor performance of random number generator
Le 21/04/2021 à 11:14, Yibo Cai a écrit : When running benchmarks on Arm64 servers, I find some benchmarks are extremely slow when built with clang. E.g., "ModeKernelNarrow/1048576/1" costs 90s to finish. I find almost all the time is spent in generating random bits (prepare test data)[1], not the test itself. Below sample code is to show the issue. Tested on Arm64 with clang-10 and gcc-7.5, built with -O3. For gcc, the code finished in 0.1s. But for clang, the code finishes in 11s, very bad. This issue does not happen on Apple M1, with apple clang-12 arm64 compiler. On x86, clang random engine is also much slower than gcc built, but the gap is much smaller. As std::default_random_engine is implementation defined[2], I think the performance (randomness, speed) is not determinate. Maybe there are better ways to generate random bits? Can you try out https://github.com/apache/arrow/pull/8879 ?
[C++] Indeterminate poor performance of random number generator
When running benchmarks on Arm64 servers, I find some benchmarks are extremely slow when built with clang. E.g., "ModeKernelNarrow/1048576/1" costs 90s to finish. I find almost all the time is spent in generating random bits (prepare test data)[1], not the test itself. Below sample code is to show the issue. Tested on Arm64 with clang-10 and gcc-7.5, built with -O3. For gcc, the code finished in 0.1s. But for clang, the code finishes in 11s, very bad. This issue does not happen on Apple M1, with apple clang-12 arm64 compiler. On x86, clang random engine is also much slower than gcc built, but the gap is much smaller. As std::default_random_engine is implementation defined[2], I think the performance (randomness, speed) is not determinate. Maybe there are better ways to generate random bits? [1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/testing/random.cc#L101-L112 [2] https://en.cppreference.com/w/cpp/numeric/random #include int main() { std::default_random_engine rng(42); std::bernoulli_distribution d(0.25); int s = 0; for (int i = 0; i < 8 * 1024 * 1024; ++i) { s += d(rng); } return s; }