"Kewen.Lin" <li...@linux.ibm.com> writes: > Hi Richard, > > on 2021/7/8 下午8:38, Richard Sandiford via Gcc-patches wrote: >> Quoting from the final patch in the series: >> >> ------------------------------------------------------------------------ >> This patch adds support for reusing a main loop's reduction accumulator >> in an epilogue loop. This in turn lets the loops share a single piece >> of vector->scalar reduction code. >> >> The patch has the following restrictions: >> >> (1) The epilogue reduction can only operate on a single vector >> (e.g. ncopies must be 1 for non-SLP reductions, and the group size >> must be <= the element count for SLP reductions). >> >> (2) Both loops must use the same vector mode for their accumulators. >> This means that the patch is restricted to targets that support >> --param vect-partial-vector-usage=1. >> >> (3) The reduction must be a standard “tree code” reduction. >> >> However, these restrictions could be lifted in future. For example, >> if the main loop operates on 128-bit vectors and the epilogue loop >> operates on 64-bit vectors, we could in future reduce the 128-bit >> vector by one stage and use the 64-bit result as the starting point >> for the epilogue result. >> >> The patch tries to handle chained SLP reductions, unchained SLP >> reductions and non-SLP reductions. It also handles cases in which >> the epilogue loop is entered directly (rather than via the main loop) >> and cases in which the epilogue loop can be skipped. >> ------------------------------------------------------------------------ >> >> However, it ended up being difficult to do that without some preparatory >> clean-ups. Some of them could probably stand on their own, but others >> are a bit “meh” without the final patch to justify them. >> >> The diff below shows the effect of the patch when compiling: >> >> unsigned short __attribute__((noipa)) >> add_loop (unsigned short *x, int n) >> { >> unsigned short res = 0; >> for (int i = 0; i < n; ++i) >> res += x[i]; >> return res; >> } >> >> with -O3 --param vect-partial-vector-usage=1 on an SVE target: >> >> add_loop: add_loop: >> .LFB0: .LFB0: >> .cfi_startproc .cfi_startproc >> mov x4, x0 < >> cmp w1, 0 cmp w1, 0 >> ble .L7 ble .L7 >> cnth x0 | cnth x4 >> sub w2, w1, #1 sub w2, w1, #1 >> sub w3, w0, #1 | sub w3, w4, #1 >> cmp w2, w3 cmp w2, w3 >> bcc .L8 bcc .L8 >> sub w0, w1, w0 | sub w4, w1, w4 >> mov x3, 0 mov x3, 0 >> cnth x5 cnth x5 >> mov z0.b, #0 mov z0.b, #0 >> ptrue p0.b, all ptrue p0.b, all >> .p2align 3,,7 .p2align 3,,7 >> .L4: .L4: >> ld1h z1.h, p0/z, [x4, x3, | ld1h z1.h, p0/z, [x0, x3, >> mov x2, x3 mov x2, x3 >> add x3, x3, x5 add x3, x3, x5 >> add z0.h, z0.h, z1.h add z0.h, z0.h, z1.h >> cmp w0, w3 | cmp w4, w3 >> bcs .L4 bcs .L4 >> uaddv d0, p0, z0.h < >> umov w0, v0.h[0] < >> inch x2 inch x2 >> and w0, w0, 65535 < >> cmp w1, w2 cmp w1, w2 >> beq .L2 | beq .L6 >> .L3: .L3: >> sub w1, w1, w2 sub w1, w1, w2 >> mov z1.b, #0 | add x2, x0, w2, uxtw 1 >> whilelo p0.h, wzr, w1 whilelo p0.h, wzr, w1 >> add x2, x4, w2, uxtw 1 | ld1h z1.h, p0/z, [x2] >> ptrue p1.b, all | add z0.h, p0/m, z0.h, z1. >> ld1h z0.h, p0/z, [x2] | .L6: >> sel z0.h, p0, z0.h, z1.h | ptrue p0.b, all >> uaddv d0, p1, z0.h | uaddv d0, p0, z0.h >> fmov x1, d0 | umov w0, v0.h[0] >> add w0, w0, w1, uxth < >> and w0, w0, 65535 and w0, w0, 65535 >> .L2: < >> ret ret >> .p2align 2,,3 .p2align 2,,3 >> .L7: .L7: >> mov w0, 0 mov w0, 0 >> ret ret >> .L8: .L8: >> mov w2, 0 mov w2, 0 >> mov w0, 0 | mov z0.b, #0 >> b .L3 b .L3 >> .cfi_endproc .cfi_endproc >> >> Kewen, could you give this a spin on Power 10 to see whether it >> works/helps there? I've attached a combined diff. >> > > Thanks for the combined diff file. > > I'm sorry that the current length based partial vector doesn't support > reduction, there are no conditional operations for length, we have to > preprocess the inactive lanes for the intermediate operations or final > reduction operations as operation types since the inactive lane value > is supposed to be undefined, this seems to require an efficient way to > turn length to a mask vector, Power10 doesn't have the corresponding > instruction so we have to do some tricks, it's still on my TODO list.
Ah, yeah, I'd forgotten about that, sorry. > I did a hacking to relax the check in vectorizable_operation for > operations involved for reduction, I can see this patch series takes > effect for length based partial vector, so I believe it will help > length based partial vector once we enable it for reduction later. > Thanks for improving this! > > This patch series was bootstrapped and regress-tested on Power10, also > benchmarked with SPEC2017 based on r12-2179 at Ofast unroll, no > remarkable regression and improvement was observed. Thanks for the testing. Richard