I need to be able to do unaligned memory accesses to memory in
big-endian or little-endian mode.  For portability, I'd like to do it
in pure C, but I'd like the compiler to generate optimal sequences for
the operations.  Most CPUs that I know of even have special
instructions designed to speed up part or all of these operations.

So I'm looking for ways of writing these to-be-inlined elemental
functions in C that gcc will recognize as such, while still working
correctly, if more slowly, for other compilers.

I've done my tests with 4.1.

For byteswapping, I only found an idiom for 16bits:

unsigned short swap_16(unsigned short v)
{
  return (v>>8) | (v<<8);
}

gives (i386, -O9 -fomit-frame-pointer):
        movl    4(%esp), %eax
        rolw    $8, %ax
        movzwl  %ax, %eax
        ret

which is excellent (the movzwl may be extra, but it probably goes away
inside a larger routine).

For 32bits though, the nearest equivalent is only halfway there:

unsigned int swap_32(unsigned int v)
{
  v = ((v & 0x00ff00ffU) << 8)  | ((v & 0xff00ff00U) >> 8);
  v = ((v & 0x0000ffffU) << 16) | ((v & 0xffff0000U) >> 16);
  return v;
}

        movl    4(%esp), %edx
        movl    %edx, %eax
        andl    $16711935, %eax
        sall    $8, %eax
        andl    $-16711936, %edx
        shrl    $8, %edx
        orl     %edx, %eax
        roll    $16, %eax
        ret

The roll is nice, but it's a tad verbose for what should be a simple
bswap.

The obvious mask-and-shift version:

unsigned int swap_32(unsigned int v)
{
  return
    ((v & 0x000000ffU) << 24) |
    ((v & 0x0000ff00U) << 8)  |
    ((v & 0x00ff0000U) >> 8)  |
    ((v & 0xff000000U) >> 24);
}

is catastrophic:
        movl    4(%esp), %ecx
        movl    %ecx, %eax
        sall    $24, %eax
        movl    %ecx, %edx
        andl    $65280, %edx
        sall    $8, %edx
        orl     %edx, %eax
        movl    %ecx, %edx
        andl    $16711680, %edx
        shrl    $8, %edx
        shrl    $24, %ecx
        orl     %ecx, %edx
        orl     %edx, %eax
        ret


Also, I wasn't able to find any way to do host-endian independant,
fixed-endian unaligned memory accesses.  For instance:

unsigned short read_16_le(const unsigned char *adr)
{
  return adr[0] | (adr[1] << 8);
}

unsigned short read_16_be(const unsigned char *adr)
{
  return (adr[0] << 8) | adr[1];
}

gives:
read_16_le:
        movl    4(%esp), %edx
        movzbw  (%edx), %ax
        movzbl  1(%edx), %edx
        sall    $8, %edx
        orl     %edx, %eax
        movzwl  %ax, %eax
        ret

read_16_be:
        movl    4(%esp), %eax
        movzbl  (%eax), %edx
        sall    $8, %edx
        movzbw  1(%eax), %ax
        orl     %edx, %eax
        movzwl  %ax, %eax
        ret

which for a processor that does fast unligned accesses in hardware is
catastrophic and register-expensive.  The linux kernel uses
conversions to pointers to packed structures, but that's both
host-endian-dependant and gcc-dependant, which is annoying.


So, well, does this idea of specific correct C code structures that
the compiler knows about reasonable, are there others that work in
addition to the swap16 one I found, and would work towards adding some
more considered acceptable?

  OG.

Reply via email to