[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

rguenth at gcc dot gnu.org Mon, 15 Jan 2018 02:26:28 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|REOPENED                    |RESOLVED
         Resolution|---                         |FIXED

--- Comment #27 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Peter Cordes from comment #21)
> (In reply to Richard Biener from comment #20)
> > Fixed.
> 
> Unfortunately only fixed for integer, not FP.  The OpenMP and vanilla float
> array sum functions from the godbolt link in the initial bug report still
> use 256b shuffles, including a gratuitous vperm2f128 when the upper half
> isn't used, so vextractf128 would have done the same job in 1 uop on Ryzen
> instead of 8.
> 
> Even on Intel CPUs, they're optimized for code-size, not performance
> (vhaddps instead of shuffle / vaddps).  Remember that Intel CPUs with AVX
> only have one FP shuffle unit.  (Including Sandy/Ivybridge, which has 2
> integer-128 shuffle units)
> 
> float sumfloat_autovec(const float arr[]) {
>   arr = __builtin_assume_aligned(arr, 64);
>   float sum=0;
>   for (int i=0 ; i<1024 ; i++)
>     sum = sum + arr[i];
>   return sum;
> }
> 
> # gcc 20180113 -mavx2 -ffast-math -O3
> #  (tune=generic, and even with arch=znver1 no-prefer-avx128)
>         ...
>         vhaddps %ymm0, %ymm0, %ymm0
>         vhaddps %ymm0, %ymm0, %ymm1
>         vperm2f128      $1, %ymm1, %ymm1, %ymm0   # why not vextract?
>         vaddps  %ymm1, %ymm0, %ymm0               # gratuitous 256b
>         vzeroupper
> 
> This bug is still present for FP code: it narrows from 256b to scalar only
> in the last step.
> 
> Every VHADDPS is 2 shuffles + 1 add on Intel.  They're in-lane shuffles, but
> it's still 2 uops for port5 vs. VSHUFPS + VADDPS.  (Costing an extra cycle
> of latency because with only 1 shuffle port, the 2 interleave-shuffles that
> feed a vertical-add uop can't run in the same cycle.)  (V)HADDPS with the
> same input twice is almost never the best choice for performance.
> 
> On Ryzen it's an even bigger penalty: HADDPS xmm is 4 uops (vs. 3 on Intel).
> It's also 7c latency (vs. 3 for ADDPS).  256b VHADDPS ymm is 8 uops, one per
> 3 cycle throughput, and Agner Fog reports that it's "mixed domain", i.e.
> some kind of penalty for ivec / fp domain crossing.  I guess the shuffles it
> uses internally are ivec domain?
> 
> With multiple threads on the same core, or even with ILP with surrounding
> code, uop throughput matters as well as latency, so more uops is worse even
> if it didn't have latency costs.
> 
> The sequence I'd recommend (for both Intel and AMD) is:
> (See also
> http://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-
> float-vector-sum-on-x86/35270026#35270026)
> 
> 
>         vextractf128    $1, %ymm0, %xmm1
>         vaddps          %xmm1, %xmm0, %xmm0          # narrow to 128b
> 
>         vmovshdup       %xmm0, %xmm0, %xmm1          # copy high->low in
> each pair
>         vaddps          %xmm1, %xmm0, %xmm0
> 
>         vmovhlps        %xmm0, %xmm0, %xmm1          # duplicate high 64b
>         vaddps          %xmm1, %xmm0, %xmm0
> 
> The MOVSHDUP / MOVHLPS sequence is also what you want without VEX, so you
> can do a 128b hsum with 4 instructions, with no MOVAPS.
> 
> Intel: 6 uops total, 3 shuffles.  vs. 8 total, 5 shuffles
> 
> AMD Ryzen: 6 uops, 3 shuffles.  vs. 26 total uops, 20 of them shuffles.  And
> much worse latency, too.
> 
> Even just fixing this specific bug without fixing the rest of the sequence
> would help AMD *significantly*, because vextractf128 is very cheap, and
> vhaddps xmm is only half the uops of ymm.  (But the latency still sucks).

Note that this is deliberately left as-is because the target advertises
(cheap) support for horizontal reduction.  The vectorizer simply generates
a single statement for the reduction epilogue:

  <bb 4> [local count: 10737418]:
  stmp_sum_11.5_20 = REDUC_PLUS (vect_sum_11.4_6); [tail call]
  return stmp_sum_11.5_20;

so either the target shouldn't tell the vectorizer it supports this or
it simply needs to expand to better code.  Which means - can you open
a separate bug for this?  The backend currently does:

(define_expand "reduc_plus_scal_v8sf"
  [(match_operand:SF 0 "register_operand")
   (match_operand:V8SF 1 "register_operand")]
  "TARGET_AVX"
{
  rtx tmp = gen_reg_rtx (V8SFmode);
  rtx tmp2 = gen_reg_rtx (V8SFmode);
  rtx vec_res = gen_reg_rtx (V8SFmode);
  emit_insn (gen_avx_haddv8sf3 (tmp, operands[1], operands[1]));
  emit_insn (gen_avx_haddv8sf3 (tmp2, tmp, tmp));
  emit_insn (gen_avx_vperm2f128v8sf3 (tmp, tmp2, tmp2, GEN_INT (1)));
  emit_insn (gen_addv8sf3 (vec_res, tmp, tmp2));
  emit_insn (gen_vec_extractv8sfsf (operands[0], vec_res, const0_rtx));
  DONE;
})

changing it to your sequence shouldn't be too difficult.

> -----
> 
> Even for integer, this patch didn't fix the MOVDQA + PSRLDQ that we get
> without AVX.  PSHUFD or PSHUFLW to copy+shuffle is cheaper.  I guess I need
> to report that bug separately, because it probably won't get fixed soon: if
> I understand correctly, there's no mechanism for the back-end to tell the
> auto-vectorizer what shuffles it can do efficiently!

Also shuffles on GIMPLE do not expose the targets micro-ops so this is a
pure target issue.  Please report separately.

> 
> It usually won't make too much difference, but for very small arrays (like 8
> `int` elements) the hsum is a big part of the cost, although it's probably
> still worth auto-vectorizing *if* you can do it efficiently.

So I'm closing this bug again.

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

Reply via email to