https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66741

--- Comment #7 from Bernhard Reutner-Fischer <aldot at gcc dot gnu.org> ---
folding tolower (and toupper while at it) gives:

for i in 0 1 2;do
gcc -o tolower_strcpy-$i tolower_strcpy-$i.c -Ofast -W -Wall -Wextra -pedantic
-DMAIN -msse4.2
done

/tmp/inp is 200MB random binary data (flash video)
gcc (Debian 4.9.2-22) 4.9.2:
for j in 0 1 2;do echo "# tolower_strcpy-$j"; time for i in $(seq 1 10);do
./tolower_strcpy-$j < /tmp/inp;done;done
# tolower_strcpy-0

real    0m6.237s
user    0m3.268s
sys     0m2.956s
# tolower_strcpy-1

real    0m7.776s
user    0m4.896s
sys     0m2.856s
# tolower_strcpy-2

real    0m3.578s
user    0m0.760s
sys     0m2.800s


gcc-5 (Debian 5.1.1-12) 5.1.1 20150622
# tolower_strcpy-0

real    0m6.061s
user    0m3.196s
sys     0m2.856s
# tolower_strcpy-1

real    0m7.737s
user    0m4.872s
sys     0m2.844s
# tolower_strcpy-2

real    0m3.562s
user    0m0.708s
sys     0m2.840s

gcc (GCC) 6.0.0 20150703 (experimental) [sibcall-elim revision
9e8ac6a:b79ca3b:17a501138f5bad51638cd4bbb290dffd9978b706] +
fold_builtin_tolower

# tolower_strcpy-0

real    0m6.019s
user    0m3.148s
sys     0m2.852s
# tolower_strcpy-1

real    0m5.360s
user    0m2.480s
sys     0m2.856s
# tolower_strcpy-2

real    0m3.559s
user    0m0.776s
sys     0m2.764s

But it's a complete mystery how to generate any sensible SSE4.x for the
tolower_strcpy-0 code. I'd had hoped that the loops would magically end up
using some fast sse4.2 but -Ofast -msse4.2 is not enough?

Reply via email to