https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820

            Bug ID: 80820
           Summary: _mm_set_epi64x shouldn't store/reload for
                    -mtune=haswell, Zen should avoid store/reload, and
                    generic should think about it.
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

gcc with -mtune=generic likes to bounce through memory when moving data from
integer registers to xmm for things like _mm_set_epi32.

There are 3 related tuning issues here:

* -mtune=haswell -mno-sse4 still uses one store/reload for _mm_set_epi64x.

* -mtune=znver1 should definitely favour movd/movq instead of store/reload.
  (Ryzen has 1 m-op movd/movq between vector and integer with 3c latency,
shorter than store-forwarding.  All the reasons to favour store/reload on other
AMD uarches are gone.)

* -mtune=generic should probably favour movd/movq.  I think it's better for a
weighted-average of CPUs we care about for -mtune=generic.  Most of the text
below is an attempt to back up this claim, but I don't have hardware to test
with so all I can do is look at Agner Fog's tables and microarch pdf.

 movd is about break-even on Bulldozer, better on SnB-family, much better on
Core2/Nehalem, and significantly worse only on AMD K8/K10.  Or maybe use a
hybrid strategy that does half with movd and half with store/reload, which can
actually be better than either strategy alone on Bulldozer and SnB-family.

-----------

The tune=haswell issue is maybe separate from the others, since gcc already
knows that bouncing through memory isn't the optimal strategy.

#include <immintrin.h>
__m128i combine64(long long a, long long b) {
  return _mm_set_epi64x(b,a);
}

gcc8 -O3 -mtune=haswell emits:

        movq    %rsi, -16(%rsp)
        movq    %rdi, %xmm0
        movhps  -16(%rsp), %xmm0

(see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80819 for the wasted store
with -msse4 -mno-avx).


I think what clang and ICC do is optimal for the SSE2-only case, for Intel CPUs
and Ryzen:

        movq    %rsi, %xmm1
        movq    %rdi, %xmm0
        punpcklqdq      %xmm1, %xmm0

_mm_set_epi32(d,c,b,a) with -mtune=haswell gives us the expected movd/punpck
(without SSE4), no store/reload.


-----


Using movd or movq instead of a store/reload is a code-size win: movd %eax,
%xmm0 is 4 bytes (or 5 with a REX prefix for movq or high registers). 
Store/reload to -0x10(%rsp) is 10, 11, or 12 bytes, depending on operand size
and high register(s).

movd int->xmm is lower latency than store/reload on most CPUs, especially Intel
SnB-family where it's 1c latency, and also AMD Ryzen.   On SnB family,
store/reload's only advantage is rare cases where port5 is a throughput
bottleneck and latency isn't important.

It replaces a store and a load uop with 1 ALU uop on Intel Core2 and later, and
Atom/Silvermont/KNL.  Also 1 uop on VIA Nano.

movd int->xmm is 2 ALU uops on AMD K10/Bulldozer-family and Jaguar, and P4, and
3 on K8/Bobcat.  It never costs any more total uops for the front-end (since a
movd load is 2 uops on K8/Bobcat), but decoding a multi-uop instruction can
sometimes be a bottleneck (especially on K8 where a 3 m-op instruction is a
"vectorpath" (microcode)).


Store/reload has one per clock throughput on every CPU, AFAIK.  On most CPUs
that have much weight in -mtune=generic, movd's throughput is one-per-clock or
better.  (According to Agner Fog's tables, only Bobcat, K8/K10, and P4 have
throughput of one per 2 or 3 clocks for movd/movq int->xmm).  The biggest
problem is K10, with something like one per 2.8c throughput (according to a
couple reports from http://users.atw.hu/instlatx64/, e.g. 
http://users.atw.hu/instlatx64/AuthenticAMD0100FA0_K10_Thuban_InstLatX64.txt). 
Agner Fog says 3, but none of these are measuring with other instructions mixed
in.

Some CPUs have better than one-per-clock throughput for movd/movq: Core2 is
0.5, and Nehalem is 0.33.  So do we hurt them a lot to help PhenomII?  I'd
guess that Core2+Nehalem has somewhat more weight in tune=generic than K10. 
Some AMD PhenomII CPUs are still around, though.  (But we could exclude them
for code built with -mssse3)


---------

Probably the deciding factor for tune=generic is whether it hurts AMD
Bulldozer-family significantly or at all.  It looks there's not much difference
either way: similar throughput and latency.

However, store/reload may have an advantage when two cores in a cluster are
competing for their shared vector unit.  Probably both of movd's macro-ops need
to run on the shared vector unit, but for store/reload maybe only the load
needs the shared resource.  IDK if this is correct or relevant, though. 
Probably -mtune=bdver* should keep using store/reload, but this might not be
enough of a reason to stop -mtune=generic from using movd.


Agner Fog's microarch pdf (Bulldozer section 18.11) says:

  > Nevertheless, I cannot confirm that it is faster to move data from a
general purpose register
  > to a vector register through a memory intermediate, as recommended in AMD's
optimization guide.

That AMD optimization guide advice may have been left over from K8/K10, where
movd/movq from integer->vector has bad throughput.

As far as latency goes, scalar store -> vector reload is 10c on Bulldozer
according to Agner Fog's numbers, while movd/movq is 10c on
Bulldozer/Piledriver, and 5c on Steamroller.  (Steamroller also appears to have
reduced the store-forwarding latency to 6c.  Agner's tables are supposed to
have the store+load latencies add up to the store-forwarding latency.)

Store/reload is 2 instructions / 2 m-ops, but movd or movq is 1 instruction / 2
m-ops.  This is mostly ok for the decoders, but bdver1 can't decode in a 2-2
pattern (ver2/ver3 can).

Scheduling instructions to avoid consecutive multi-uop instructions may help
decode throughput on bdver1.  But pairs of 2 m-op instructions are good on
bdver2 and later.


With SSE4, pinsrd/q is probably good, because it's still only 2 m-ops on
Bulldozer-family.  Indeed, -mtune=bdver1 uses 2x store/reload and 2x pinsrd for
_mm_set_epi32(d,c,b,a).

        movl    %edx, -12(%rsp)
        movd    -12(%rsp), %xmm1
        movl    %edi, -12(%rsp)
        movd    -12(%rsp), %xmm0
        pinsrd  $1, %ecx, %xmm1
        pinsrd  $1, %esi, %xmm0
        punpcklqdq      %xmm1, %xmm0


Even better would probably be

        movd    %edx, %xmm1
        movl    %edi, -12(%rsp)
        pinsrd  $1, %ecx, %xmm1    # for bdver2, schedule so it can decode in a
2-2 pattern with the other pinsrd
        movd    -12(%rsp), %xmm0
        pinsrd  $1, %esi, %xmm0
        punpcklqdq      %xmm1, %xmm0

The store/reload can happen in parallel with the direct movd int->xmm1.  This
would be pretty reasonable for tune=generic, and should run well on Intel
SnB-family CPUs.


-----

For -msse4 -mtune=core2, -mtune=nehalem, probably this is optimal:

        movd    %edi, %xmm0
        pinsrd  $1, %esi, %xmm0
        pinsrd  $2, %edx, %xmm0
        pinsrd  $3, %ecx, %xmm0

movd can run on any port and pinsrd is only 1 uop.  So this has a total latency
of 2 + 3*1 = 5c on Core2 Wolfdale.  (First-gen core2 doesn't have SSE4.1). 
Front-end bottlenecks are more common on Core2/Nehalem since they don't have a
uop-cache, so fewer instructions is probably a good bet even at the expense of
latency.

It might not be worth the effort to get gcc to emit this for Core2/Nehalem,
since they're old and getting less relevant all the time.

It may also be good for -mtune=silvermont or KNL, though, since they also have
1 uop pinsrd/q.  But with 3c latency for pinsrd, the lack ILP may be a big
problem.  Also, decode on Silvermont without VEX will stall if the pinsrd needs
a REX (too many prefixes).  KNL should always use VEX or EVEX to avoid that.

Reply via email to