I need to be able to do unaligned memory accesses to memory in big-endian or little-endian mode. For portability, I'd like to do it in pure C, but I'd like the compiler to generate optimal sequences for the operations. Most CPUs that I know of even have special instructions designed to speed up part or all of these operations.
So I'm looking for ways of writing these to-be-inlined elemental functions in C that gcc will recognize as such, while still working correctly, if more slowly, for other compilers. I've done my tests with 4.1. For byteswapping, I only found an idiom for 16bits: unsigned short swap_16(unsigned short v) { return (v>>8) | (v<<8); } gives (i386, -O9 -fomit-frame-pointer): movl 4(%esp), %eax rolw $8, %ax movzwl %ax, %eax ret which is excellent (the movzwl may be extra, but it probably goes away inside a larger routine). For 32bits though, the nearest equivalent is only halfway there: unsigned int swap_32(unsigned int v) { v = ((v & 0x00ff00ffU) << 8) | ((v & 0xff00ff00U) >> 8); v = ((v & 0x0000ffffU) << 16) | ((v & 0xffff0000U) >> 16); return v; } movl 4(%esp), %edx movl %edx, %eax andl $16711935, %eax sall $8, %eax andl $-16711936, %edx shrl $8, %edx orl %edx, %eax roll $16, %eax ret The roll is nice, but it's a tad verbose for what should be a simple bswap. The obvious mask-and-shift version: unsigned int swap_32(unsigned int v) { return ((v & 0x000000ffU) << 24) | ((v & 0x0000ff00U) << 8) | ((v & 0x00ff0000U) >> 8) | ((v & 0xff000000U) >> 24); } is catastrophic: movl 4(%esp), %ecx movl %ecx, %eax sall $24, %eax movl %ecx, %edx andl $65280, %edx sall $8, %edx orl %edx, %eax movl %ecx, %edx andl $16711680, %edx shrl $8, %edx shrl $24, %ecx orl %ecx, %edx orl %edx, %eax ret Also, I wasn't able to find any way to do host-endian independant, fixed-endian unaligned memory accesses. For instance: unsigned short read_16_le(const unsigned char *adr) { return adr[0] | (adr[1] << 8); } unsigned short read_16_be(const unsigned char *adr) { return (adr[0] << 8) | adr[1]; } gives: read_16_le: movl 4(%esp), %edx movzbw (%edx), %ax movzbl 1(%edx), %edx sall $8, %edx orl %edx, %eax movzwl %ax, %eax ret read_16_be: movl 4(%esp), %eax movzbl (%eax), %edx sall $8, %edx movzbw 1(%eax), %ax orl %edx, %eax movzwl %ax, %eax ret which for a processor that does fast unligned accesses in hardware is catastrophic and register-expensive. The linux kernel uses conversions to pointers to packed structures, but that's both host-endian-dependant and gcc-dependant, which is annoying. So, well, does this idea of specific correct C code structures that the compiler knows about reasonable, are there others that work in addition to the swap16 one I found, and would work towards adding some more considered acceptable? OG.