https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|REOPENED |RESOLVED Resolution|--- |FIXED --- Comment #27 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Peter Cordes from comment #21) > (In reply to Richard Biener from comment #20) > > Fixed. > > Unfortunately only fixed for integer, not FP. The OpenMP and vanilla float > array sum functions from the godbolt link in the initial bug report still > use 256b shuffles, including a gratuitous vperm2f128 when the upper half > isn't used, so vextractf128 would have done the same job in 1 uop on Ryzen > instead of 8. > > Even on Intel CPUs, they're optimized for code-size, not performance > (vhaddps instead of shuffle / vaddps). Remember that Intel CPUs with AVX > only have one FP shuffle unit. (Including Sandy/Ivybridge, which has 2 > integer-128 shuffle units) > > float sumfloat_autovec(const float arr[]) { > arr = __builtin_assume_aligned(arr, 64); > float sum=0; > for (int i=0 ; i<1024 ; i++) > sum = sum + arr[i]; > return sum; > } > > # gcc 20180113 -mavx2 -ffast-math -O3 > # (tune=generic, and even with arch=znver1 no-prefer-avx128) > ... > vhaddps %ymm0, %ymm0, %ymm0 > vhaddps %ymm0, %ymm0, %ymm1 > vperm2f128 $1, %ymm1, %ymm1, %ymm0 # why not vextract? > vaddps %ymm1, %ymm0, %ymm0 # gratuitous 256b > vzeroupper > > This bug is still present for FP code: it narrows from 256b to scalar only > in the last step. > > Every VHADDPS is 2 shuffles + 1 add on Intel. They're in-lane shuffles, but > it's still 2 uops for port5 vs. VSHUFPS + VADDPS. (Costing an extra cycle > of latency because with only 1 shuffle port, the 2 interleave-shuffles that > feed a vertical-add uop can't run in the same cycle.) (V)HADDPS with the > same input twice is almost never the best choice for performance. > > On Ryzen it's an even bigger penalty: HADDPS xmm is 4 uops (vs. 3 on Intel). > It's also 7c latency (vs. 3 for ADDPS). 256b VHADDPS ymm is 8 uops, one per > 3 cycle throughput, and Agner Fog reports that it's "mixed domain", i.e. > some kind of penalty for ivec / fp domain crossing. I guess the shuffles it > uses internally are ivec domain? > > With multiple threads on the same core, or even with ILP with surrounding > code, uop throughput matters as well as latency, so more uops is worse even > if it didn't have latency costs. > > The sequence I'd recommend (for both Intel and AMD) is: > (See also > http://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal- > float-vector-sum-on-x86/35270026#35270026) > > > vextractf128 $1, %ymm0, %xmm1 > vaddps %xmm1, %xmm0, %xmm0 # narrow to 128b > > vmovshdup %xmm0, %xmm0, %xmm1 # copy high->low in > each pair > vaddps %xmm1, %xmm0, %xmm0 > > vmovhlps %xmm0, %xmm0, %xmm1 # duplicate high 64b > vaddps %xmm1, %xmm0, %xmm0 > > The MOVSHDUP / MOVHLPS sequence is also what you want without VEX, so you > can do a 128b hsum with 4 instructions, with no MOVAPS. > > Intel: 6 uops total, 3 shuffles. vs. 8 total, 5 shuffles > > AMD Ryzen: 6 uops, 3 shuffles. vs. 26 total uops, 20 of them shuffles. And > much worse latency, too. > > Even just fixing this specific bug without fixing the rest of the sequence > would help AMD *significantly*, because vextractf128 is very cheap, and > vhaddps xmm is only half the uops of ymm. (But the latency still sucks). Note that this is deliberately left as-is because the target advertises (cheap) support for horizontal reduction. The vectorizer simply generates a single statement for the reduction epilogue: <bb 4> [local count: 10737418]: stmp_sum_11.5_20 = REDUC_PLUS (vect_sum_11.4_6); [tail call] return stmp_sum_11.5_20; so either the target shouldn't tell the vectorizer it supports this or it simply needs to expand to better code. Which means - can you open a separate bug for this? The backend currently does: (define_expand "reduc_plus_scal_v8sf" [(match_operand:SF 0 "register_operand") (match_operand:V8SF 1 "register_operand")] "TARGET_AVX" { rtx tmp = gen_reg_rtx (V8SFmode); rtx tmp2 = gen_reg_rtx (V8SFmode); rtx vec_res = gen_reg_rtx (V8SFmode); emit_insn (gen_avx_haddv8sf3 (tmp, operands[1], operands[1])); emit_insn (gen_avx_haddv8sf3 (tmp2, tmp, tmp)); emit_insn (gen_avx_vperm2f128v8sf3 (tmp, tmp2, tmp2, GEN_INT (1))); emit_insn (gen_addv8sf3 (vec_res, tmp, tmp2)); emit_insn (gen_vec_extractv8sfsf (operands[0], vec_res, const0_rtx)); DONE; }) changing it to your sequence shouldn't be too difficult. > ----- > > Even for integer, this patch didn't fix the MOVDQA + PSRLDQ that we get > without AVX. PSHUFD or PSHUFLW to copy+shuffle is cheaper. I guess I need > to report that bug separately, because it probably won't get fixed soon: if > I understand correctly, there's no mechanism for the back-end to tell the > auto-vectorizer what shuffles it can do efficiently! Also shuffles on GIMPLE do not expose the targets micro-ops so this is a pure target issue. Please report separately. > > It usually won't make too much difference, but for very small arrays (like 8 > `int` elements) the hsum is a big part of the cost, although it's probably > still worth auto-vectorizing *if* you can do it efficiently. So I'm closing this bug again.