http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51078
--- Comment #11 from Grygoriy Fuchedzhy <grygoriy.fuchedzhy at gmail dot com> 2011-11-11 21:53:50 UTC --- I've tried different optimization options: 1. -march=native -O2, 2. -march=native -O2 -funroll-loops, 3. -march=native -O2 -funroll-all-loops, 4. -march=native -O3, 5. -march=native -O3 -funroll-loops, 6. -march=native -O3 -funroll-all-loops And got following list of optimizations from faster to slower: 5, 6, 1, 4, 2, 3 You can see that code with automatic loop unrolling sometimes performs worse than code without one. 2 and 3 optimizations gives 1.5 times worse result compared to 1 variant. 5 and 6 variant is better then 1, but still manual loop unrolling performs better(more than 25%). Also I've tried changing number of unrolled iteration from 2 to 8 and got best performance for 4 and 5 on core2 cpu.