[Bug rtl-optimization/83068] New: Suboptimal code generated with -m32 using MMX reg

bradfier at fstab dot me Mon, 20 Nov 2017 04:02:44 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83068


            Bug ID: 83068
           Summary: Suboptimal code generated with -m32 using MMX reg
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: bradfier at fstab dot me
  Target Milestone: ---

Follow up from ML post here: [1]

I tried compiling the following simple function with different -march flags in
-m32 mode:

> uint64_t sum(uint64_t a, uint64_t b) {
>     return a + b;
> }

Using g++ -m32 -O2 the generated ASM is the following:

>  # 64-m32-example.cpp:6:     return a + b;
>    mov eax, DWORD PTR [esp+12] # b, b
>    add eax, DWORD PTR [esp+4]  # tmp90, a
>    mov edx, DWORD PTR [esp+16] # b, b
>    adc edx, DWORD PTR [esp+8]  #, a
>  # 64-m32-example.cpp:7: }
>    ret

However when I compile with -m32 -O2 -march=broadwell (or native, my
processor is a Skylake part) I get the following code instead:

>    vmovq       xmm1, QWORD PTR [esp+12] # b, b
>  # 64-m32-example.cpp:6:                 return a + b;
>    vmovq       xmm0, QWORD PTR [esp+4]  # tmp92, a
>    vpaddq      xmm0, xmm0, xmm1         # tmp90, tmp92, b
>    vmovd       eax, xmm0                # tmp93, tmp90
>    vpextrd     edx, xmm0, 1             # tmp94, tmp90,
>  # 64-m32-example.cpp:7: }
>    ret

This seems to be generated for any processor type where MMX is available,
although I have not tested exhaustively.

I thought this looked suspect, so I ran a benchmark using Hayai.

For the code using regular mov and add instructions, a 'run' is 10,000
iterations:
----------
 Run Times: (1 run = 10,000 iterations)
 Average time: 0.006 us (~0.095 us)
 Fastest time: 0.000 us (-0.006 us / -100.000 %)
 Slowest time: 3.958 us (+3.952 us / +68209.689 %)
  Median time: 0.000 us (1st quartile: 0.000 us | 3rd quartile: 0.000 us)

 Average performance: 172586379.48293 runs/s
    Best performance: inf runs/s (+inf runs/s / +inf %)
   Worst performance: 252652.85498 runs/s (-172333726.62795 runs/s / -99.85361
%)
  Median performance: inf runs/s (1st quartile: inf | 3rd quartile: inf)
----------

I do wonder if these numbers are suspect, they seem too fast even for a simple
function,
but I don't know enough about the Intel OOE to be sure what's going on.

What's clear is the code using MMX and vector instructions is much slower:
----------
 Run Times: (1 run = 10,000 iterations)
 Average time: 24.901 us (~1.144 us)
 Fastest time: 23.867 us (-1.034 us / -4.153 %)
 Slowest time: 61.867 us (+36.966 us / +148.451 %)
  Median time: 24.867 us (1st quartile: 24.867 us | 3rd quartile: 24.867 us)

 Average performance: 40158.86848 runs/s
    Best performance: 41898.85616 runs/s (+1739.98768 runs/s / +4.33276 %)
   Worst performance: 16163.70601 runs/s (-23995.16247 runs/s / -59.75059 %)
  Median performance: 40213.93815 runs/s (1st quartile: 40213.93815 | 3rd
quartile: 40213.93815)
----------


If this is a genuine regression I can look into where it's coming from,
I have my eye on dimode_scalar_chain::compute_convert_gain, but I'll keep
digging for now.

Thanks,

Richard


[1]: https://gcc.gnu.org/ml/gcc/2017-11/msg00128.html

[Bug rtl-optimization/83068] New: Suboptimal code generated with -m32 using MMX reg

Reply via email to