https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83068
Bug ID: 83068 Summary: Suboptimal code generated with -m32 using MMX reg Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: bradfier at fstab dot me Target Milestone: --- Follow up from ML post here: [1] I tried compiling the following simple function with different -march flags in -m32 mode: > uint64_t sum(uint64_t a, uint64_t b) { > return a + b; > } Using g++ -m32 -O2 the generated ASM is the following: > # 64-m32-example.cpp:6: return a + b; > mov eax, DWORD PTR [esp+12] # b, b > add eax, DWORD PTR [esp+4] # tmp90, a > mov edx, DWORD PTR [esp+16] # b, b > adc edx, DWORD PTR [esp+8] #, a > # 64-m32-example.cpp:7: } > ret However when I compile with -m32 -O2 -march=broadwell (or native, my processor is a Skylake part) I get the following code instead: > vmovq xmm1, QWORD PTR [esp+12] # b, b > # 64-m32-example.cpp:6: return a + b; > vmovq xmm0, QWORD PTR [esp+4] # tmp92, a > vpaddq xmm0, xmm0, xmm1 # tmp90, tmp92, b > vmovd eax, xmm0 # tmp93, tmp90 > vpextrd edx, xmm0, 1 # tmp94, tmp90, > # 64-m32-example.cpp:7: } > ret This seems to be generated for any processor type where MMX is available, although I have not tested exhaustively. I thought this looked suspect, so I ran a benchmark using Hayai. For the code using regular mov and add instructions, a 'run' is 10,000 iterations: ---------- Run Times: (1 run = 10,000 iterations) Average time: 0.006 us (~0.095 us) Fastest time: 0.000 us (-0.006 us / -100.000 %) Slowest time: 3.958 us (+3.952 us / +68209.689 %) Median time: 0.000 us (1st quartile: 0.000 us | 3rd quartile: 0.000 us) Average performance: 172586379.48293 runs/s Best performance: inf runs/s (+inf runs/s / +inf %) Worst performance: 252652.85498 runs/s (-172333726.62795 runs/s / -99.85361 %) Median performance: inf runs/s (1st quartile: inf | 3rd quartile: inf) ---------- I do wonder if these numbers are suspect, they seem too fast even for a simple function, but I don't know enough about the Intel OOE to be sure what's going on. What's clear is the code using MMX and vector instructions is much slower: ---------- Run Times: (1 run = 10,000 iterations) Average time: 24.901 us (~1.144 us) Fastest time: 23.867 us (-1.034 us / -4.153 %) Slowest time: 61.867 us (+36.966 us / +148.451 %) Median time: 24.867 us (1st quartile: 24.867 us | 3rd quartile: 24.867 us) Average performance: 40158.86848 runs/s Best performance: 41898.85616 runs/s (+1739.98768 runs/s / +4.33276 %) Worst performance: 16163.70601 runs/s (-23995.16247 runs/s / -59.75059 %) Median performance: 40213.93815 runs/s (1st quartile: 40213.93815 | 3rd quartile: 40213.93815) ---------- If this is a genuine regression I can look into where it's coming from, I have my eye on dimode_scalar_chain::compute_convert_gain, but I'll keep digging for now. Thanks, Richard [1]: https://gcc.gnu.org/ml/gcc/2017-11/msg00128.html