Consider this function: unsigned long long x(unsigned long long l) { return l >> 4; }
gcc will use the shrd instruction here, which is much slower than doing it "by hand" on at least Athlon, Pentium 3, VIA C3. On Core 2 shrd appears to be faster. On my Athlon 64, I measured 350 cycles vs 441 for a loop of 100. On my Core 2, I measured 672 cycles vs 624. So, my suggestion is: if -march= is set to Pentium 3 or a non-Intel CPU, don't use shrd and shrl. My benchmark program is on http://dl.fefe.de/shrd.c -- Summary: gcc generates suboptimal code for long long shifts Product: gcc Version: 4.3.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: rtl-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: felix-gcc at fefe dot de GCC build triplet: i386-pc-linux-gnu GCC host triplet: i386-pc-linux-gnu GCC target triplet: i386-pc-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33716