https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66741
--- Comment #7 from Bernhard Reutner-Fischer <aldot at gcc dot gnu.org> --- folding tolower (and toupper while at it) gives: for i in 0 1 2;do gcc -o tolower_strcpy-$i tolower_strcpy-$i.c -Ofast -W -Wall -Wextra -pedantic -DMAIN -msse4.2 done /tmp/inp is 200MB random binary data (flash video) gcc (Debian 4.9.2-22) 4.9.2: for j in 0 1 2;do echo "# tolower_strcpy-$j"; time for i in $(seq 1 10);do ./tolower_strcpy-$j < /tmp/inp;done;done # tolower_strcpy-0 real 0m6.237s user 0m3.268s sys 0m2.956s # tolower_strcpy-1 real 0m7.776s user 0m4.896s sys 0m2.856s # tolower_strcpy-2 real 0m3.578s user 0m0.760s sys 0m2.800s gcc-5 (Debian 5.1.1-12) 5.1.1 20150622 # tolower_strcpy-0 real 0m6.061s user 0m3.196s sys 0m2.856s # tolower_strcpy-1 real 0m7.737s user 0m4.872s sys 0m2.844s # tolower_strcpy-2 real 0m3.562s user 0m0.708s sys 0m2.840s gcc (GCC) 6.0.0 20150703 (experimental) [sibcall-elim revision 9e8ac6a:b79ca3b:17a501138f5bad51638cd4bbb290dffd9978b706] + fold_builtin_tolower # tolower_strcpy-0 real 0m6.019s user 0m3.148s sys 0m2.852s # tolower_strcpy-1 real 0m5.360s user 0m2.480s sys 0m2.856s # tolower_strcpy-2 real 0m3.559s user 0m0.776s sys 0m2.764s But it's a complete mystery how to generate any sensible SSE4.x for the tolower_strcpy-0 code. I'd had hoped that the loops would magically end up using some fast sse4.2 but -Ofast -msse4.2 is not enough?