> GCC seems to be unable to propogate constants across calls to htonl. > So it turns out to be worth the while to replace htonl with > a hand-written macro in case of constant parameter.
I'm wondering why this helps you. On my system (which has Debian's old glibc 2.3.6, certainly nothing particularly fancy), I see in my <netinet/in.h>: /* Get machine dependent optimized versions of byte swapping functions. */ #include <bits/byteswap.h> #ifdef __OPTIMIZE__ /* We can optimize calls to the conversion functions. Either nothing has to be done or we are using directly the byte-swapping functions which often can be inlined. */ # if __BYTE_ORDER == __BIG_ENDIAN //... # else # if __BYTE_ORDER == __LITTLE_ENDIAN # define ntohl(x) __bswap_32 (x) and so on (and gcc defines __OPTIMIZE__ if you pass it any -O flag including -Os). And in <bits/byteswap.h> I have /* Swap bytes in 32 bit value. */ #define __bswap_constant_32(x) \ ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | \ (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24)) and variations of __bswap_32() that look roughly like # define __bswap_32(x) \ (__extension__ \ ({ register unsigned int __v, __x = (x); \ if (__builtin_constant_p (__x)) \ __v = __bswap_constant_32 (__x); \ else \ and so on. (The point of all this being that for constants, htonl() should expand to roughly the same thing as your CONSTANT_HTONL() -- the only difference is that you don't have the & for the << 24 and >> 24 parts, which I guess just has the potential to bite us if someone did something like CONSTANT_HTONL(1L) on a 64-bit system). As a quick test I compiled the code #include <netinet/in.h> enum { Y = 5 }; uint32_t foo(uint32_t x) { return x | htonl(Y); } with gcc -c -O and the disassembly of foo() looks like 0000000000000000 <foo>: 0: 89 f8 mov %edi,%eax 2: 0d 00 00 00 05 or $0x5000000,%eax 7: c3 retq and so everything works exactly the way we would want. (32-bit i386 also just does or with a constant too). In fact for libmthca I just checked that the preprocessor output of places like the following (which your patch converts) ((wr->send_flags & IBV_SEND_SIGNALED) ? htonl(MTHCA_NEXT_CQ_UPDATE) : 0) | is ((wr->send_flags & IBV_SEND_SIGNALED) ? (__extension__ ({ register unsigned int __v, __x = (MTHCA_NEXT_CQ_UPDATE); if (__builtin_constant_p (__x)) __v = ((((__x) & 0xff000000) >> 24) | (((__x) & 0x00ff0000) >> 8) | (((__x) & 0x0000ff00) << 8) | (((__x) & 0x000000ff) << 24)); else __asm__ ("bswap %0" : "=r" (__v) : "0" (__x)); __v; })) : 0) | And if I compare the generated assembly for libmthca with and without your patch (on both x86-64 and i386), I don't see any significant difference (the size is exactly the same, I just see things like the compiler using eax and edx in the opposite order and trivial things like that). So what is different in your setup that causes this patch to make a difference for you? (BTW, one thing I did notice while looking at the i386 assembly is that one micro-optimization that might make sense to use something like __attribute__((regparm(3))) for internal function calls within libibverbs and libmthca on i386, since otherwise we waste instructions pushing stuff on the stack for no reason other than compliance with the crufty old i386 ABI. Something like a FASTCALL macro in <infiniband/arch.h> perhaps... if anyone really cares about 32-bit i386 performance any more) - R. _______________________________________________ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general