I need to be able to do unaligned memory accesses to memory in
big-endian or little-endian mode. For portability, I'd like to do it
in pure C, but I'd like the compiler to generate optimal sequences for
the operations. Most CPUs that I know of even have special
instructions designed to speed up part or all of these operations.
So I'm looking for ways of writing these to-be-inlined elemental
functions in C that gcc will recognize as such, while still working
correctly, if more slowly, for other compilers.
I've done my tests with 4.1.
For byteswapping, I only found an idiom for 16bits:
unsigned short swap_16(unsigned short v)
{
return (v>>8) | (v<<8);
}
gives (i386, -O9 -fomit-frame-pointer):
movl 4(%esp), %eax
rolw $8, %ax
movzwl %ax, %eax
ret
which is excellent (the movzwl may be extra, but it probably goes away
inside a larger routine).
For 32bits though, the nearest equivalent is only halfway there:
unsigned int swap_32(unsigned int v)
{
v = ((v & 0x00ff00ffU) << 8) | ((v & 0xff00ff00U) >> 8);
v = ((v & 0x0000ffffU) << 16) | ((v & 0xffff0000U) >> 16);
return v;
}
movl 4(%esp), %edx
movl %edx, %eax
andl $16711935, %eax
sall $8, %eax
andl $-16711936, %edx
shrl $8, %edx
orl %edx, %eax
roll $16, %eax
ret
The roll is nice, but it's a tad verbose for what should be a simple
bswap.
The obvious mask-and-shift version:
unsigned int swap_32(unsigned int v)
{
return
((v & 0x000000ffU) << 24) |
((v & 0x0000ff00U) << 8) |
((v & 0x00ff0000U) >> 8) |
((v & 0xff000000U) >> 24);
}
is catastrophic:
movl 4(%esp), %ecx
movl %ecx, %eax
sall $24, %eax
movl %ecx, %edx
andl $65280, %edx
sall $8, %edx
orl %edx, %eax
movl %ecx, %edx
andl $16711680, %edx
shrl $8, %edx
shrl $24, %ecx
orl %ecx, %edx
orl %edx, %eax
ret
Also, I wasn't able to find any way to do host-endian independant,
fixed-endian unaligned memory accesses. For instance:
unsigned short read_16_le(const unsigned char *adr)
{
return adr[0] | (adr[1] << 8);
}
unsigned short read_16_be(const unsigned char *adr)
{
return (adr[0] << 8) | adr[1];
}
gives:
read_16_le:
movl 4(%esp), %edx
movzbw (%edx), %ax
movzbl 1(%edx), %edx
sall $8, %edx
orl %edx, %eax
movzwl %ax, %eax
ret
read_16_be:
movl 4(%esp), %eax
movzbl (%eax), %edx
sall $8, %edx
movzbw 1(%eax), %ax
orl %edx, %eax
movzwl %ax, %eax
ret
which for a processor that does fast unligned accesses in hardware is
catastrophic and register-expensive. The linux kernel uses
conversions to pointers to packed structures, but that's both
host-endian-dependant and gcc-dependant, which is annoying.
So, well, does this idea of specific correct C code structures that
the compiler knows about reasonable, are there others that work in
addition to the swap16 one I found, and would work towards adding some
more considered acceptable?
OG.