https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91940
Bug ID: 91940
Summary: __builtin_bswap16 loop optimization
Product: gcc
Version: 9.2.1
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c++
Assignee: unassigned at gcc dot gnu.org
Reporter: matwey.kornilov at gmail dot com
Target Milestone: ---
Created attachment 46984
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46984&action=edit
code snippet
Hello,
I am using "gcc (SUSE Linux) 9.2.1 20190903 [gcc-9-branch revision 275330]" and
I see the following performance issue with the __builtin_bswap16() on x86_64
platform.
Attached here is code sample implementing byte swapping for arrays of 2-byte
words.
I see that the following code (when compiled with -O3)
inline void swab_bi(const void* from, void* to, std::size_t size) {
const auto begin = reinterpret_cast<const std::uint16_t*>(from);
const auto end = reinterpret_cast<const
std::uint16_t*>(reinterpret_cast<const std::uint8_t*>(from) + size);
auto out = reinterpret_cast<std::uint16_t*>(to);
for(auto it = begin; it != end; ++it) {
*(out++) = __builtin_bswap16(*it);
}
}
takes 0.023 sec. on average to execute on my hardware (Intel Core-i5).
While the following code
inline void swab(const void* from, void* to, std::size_t size) {
const auto begin = reinterpret_cast<const std::uint16_t*>(from);
const auto end = reinterpret_cast<const
std::uint16_t*>(reinterpret_cast<const std::uint8_t*>(from) + size);
auto out = reinterpret_cast<std::uint16_t*>(to);
for(auto it = begin; it != end; ++it) {
*(out++) = ((*it & 0xFF) << 8) | ((*it & 0xFF00) >> 8);
}
}
is *more* efficiently. It takes only 0.011 sec.
When I try to dump assembler output for both function I see that packed
instructions are used for the latter case:
movdqu 0(%rbp,%rax), %xmm0
movdqa %xmm0, %xmm1
psllw $8, %xmm0
psrlw $8, %xmm1
por %xmm1, %xmm0
movups %xmm0, (%r12,%rax)
addq $16, %rax
while rolw is used for the former case:
movzwl 0(%rbp,%rax), %edx
rolw $8, %dx
movw %dx, (%r12,%rax)
addq $2, %rax