Re: [RFC] Implementing detection of saturation and rounding arithmetic

Bin.Cheng via Gcc-patches Mon, 07 Jun 2021 04:29:42 -0700

On Fri, Jun 4, 2021 at 12:35 AM Andre Vieira (lists) via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Hi,
>
> This RFC is motivated by the IV sharing RFC in
> https://gcc.gnu.org/pipermail/gcc-patches/2021-May/569502.html and the
> need to have the IVOPTS pass be able to clean up IV's shared between
> multiple loops. When creating a similar problem with C code I noticed
> IVOPTs treated IV's with uses outside the loop differently, this didn't
> even required multiple loops, take for instance the following example
> using SVE intrinsics:
>
> #include <arm_sve.h>
> #include <limits.h>
> extern void use (char *);
> void bar (char  * __restrict__ a, char * __restrict__ b, char *
> __restrict__ c, unsigned n)
> {
>      svbool_t all_true = svptrue_b8 ();
>    unsigned i = 0;
>    if (n < (UINT_MAX - svcntb() - 1))
>      {
>          for (; i < n; i += svcntb())
>              {
>                  svuint8_t va = svld1 (all_true, (uint8_t*)a);
>                  svuint8_t vb = svld1 (all_true, (uint8_t*)b);
>                  svst1 (all_true, (uint8_t *)c, svadd_z (all_true, va,vb));
>                  a += svcntb();
>                  b += svcntb();
>                  c += svcntb();
>              }
>      }
>    use (a);
> }
>
> IVOPTs tends to generate a shared IV for SVE memory accesses, as we
> don't have a post-increment for SVE load/stores. If we had not included
> 'use (a);' in this example, IVOPTs would have replaced the IV's for a, b
> and c with a single one, (also used for the loop-control). See:
>
>    <bb 4> [local count: 955630225]:
>    # ivtmp.7_8 = PHI <ivtmp.7_25(7), 0(6)>
>    va_14 = MEM <svuint8_t> [(unsigned char *)a_10(D) + ivtmp.7_8 * 1];
>    vb_15 = MEM <svuint8_t> [(unsigned char *)b_11(D) + ivtmp.7_8 * 1];
>    _2 = svadd_u8_z ({ -1, ... }, va_14, vb_15);
>    MEM <__SVUint8_t> [(unsigned char *)c_12(D) + ivtmp.7_8 * 1] = _2;
>    ivtmp.7_25 = ivtmp.7_8 + POLY_INT_CST [16, 16];
>    i_23 = (unsigned int) ivtmp.7_25;
>    if (n_9(D) > i_23)
>      goto <bb 7>; [89.00%]
>    else
>      goto <bb 5>; [11.00%]
>
>   However, due to the 'use (a);' it will create two IVs one for
> loop-control, b and c and one for a. See:
>
>   <bb 4> [local count: 955630225]:
>    # a_28 = PHI <a_18(7), a_11(D)(6)>
>    # ivtmp.7_25 = PHI <ivtmp.7_24(7), 0(6)>
>    va_15 = MEM <svuint8_t> [(unsigned char *)a_28];
>    vb_16 = MEM <svuint8_t> [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
>    _2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
>    MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
>    a_18 = a_28 + POLY_INT_CST [16, 16];
>    ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
>    i_8 = (unsigned int) ivtmp.7_24;
>    if (n_10(D) > i_8)
>      goto <bb 7>; [89.00%]
>    else
>      goto <bb 10>; [11.00%]
>
> With the first patch attached in this RFC 'no_cost.patch', I tell IVOPTs
> to not cost uses outside of the loop. This makes IVOPTs generate a
> single IV, but unfortunately it decides to create the variable for the
> use inside the loop and it also seems to use the pre-increment value of
> the shared-IV and add the [16,16] to it. See:
>
>    <bb 4> [local count: 955630225]:
>    # ivtmp.7_25 = PHI <ivtmp.7_24(7), 0(6)>
>    va_15 = MEM <svuint8_t> [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
>    vb_16 = MEM <svuint8_t> [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
>    _2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
>    MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
>    _8 = (unsigned long) a_11(D);
>    _7 = _8 + ivtmp.7_25;
>    _6 = _7 + POLY_INT_CST [16, 16];
>    a_18 = (char * restrict) _6;
>    ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
>    i_5 = (unsigned int) ivtmp.7_24;
>    if (n_10(D) > i_5)
>      goto <bb 7>; [89.00%]
>    else
>      goto <bb 10>; [11.00%]
>
> With the patch 'var_after.patch' I make get_computation_aff_1 use
> 'cand->var_after' for outside uses thus using the post-increment var of
> the candidate IV. This means I have to insert it in a different place
> and make sure to delete the old use->stmt. I'm sure there is a better
> way to do this using IVOPTs current framework, but I didn't find one
> yet. See the result:
>
>   <bb 4> [local count: 955630225]:
>    # ivtmp.7_25 = PHI <ivtmp.7_24(7), 0(6)>
>    va_15 = MEM <svuint8_t> [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
>    vb_16 = MEM <svuint8_t> [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
>    _2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
>    MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
>    ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
>    _8 = (unsigned long) a_11(D);
>    _7 = _8 + ivtmp.7_24;
>    a_18 = (char * restrict) _7;
>    i_6 = (unsigned int) ivtmp.7_24;
>    if (n_10(D) > i_6)
>      goto <bb 7>; [89.00%]
>    else
>      goto <bb 10>; [11.00%]
>
>
> This is still not optimal as we are still doing the update inside the
> loop and there is absolutely no need for that. I found that running sink
> would solve it and it seems someone has added a second sink pass, so
> that saves me a third patch :) see after sink2:
>
>    <bb 4> [local count: 955630225]:
>    # ivtmp.7_25 = PHI <ivtmp.7_24(7), 0(6)>
>    va_15 = MEM <svuint8_t> [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
>    vb_16 = MEM <svuint8_t> [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
>    _2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
>    MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
>    ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
>    i_6 = (unsigned int) ivtmp.7_24;
>    if (i_6 < n_10(D))
>      goto <bb 7>; [89.00%]
>    else
>      goto <bb 10>; [11.00%]
>
>    <bb 10> [local count: 105119324]:
>    _8 = (unsigned long) a_11(D);
>    _7 = _8 + ivtmp.7_24;
>    a_18 = (char * restrict) _7;
>    goto <bb 5>; [100.00%]
>
>
> I haven't tested this at all, but I wanted to get the opinion of someone
> more knowledgeable in IVOPTs before I continued this avenue. I have two
> main questions:
> 1) How should we be costing outside uses, right now I use a nocost, but
> that's not entirely accurate. Should we use a constant multiply factor
> for inside loop uses to make them outweigh outside uses? Should we use
> iteration count if available? Do we want to use a backend hook to let
> targets provide their own costing for these?
Hi Andre,
I didn't look into the details of the IV sharing RFC.  It seems to me
costing outside uses is trying to generate better code for later code
(epilogue loop here).  The only problem is IVOPTs doesn't know that
the outside use is not in the final form - which will be transformed
by IVOPTs again.


I think this example is not good at describing your problem because it
shows exactly that considering outside use results in better code,
compared to the other two approaches.

> 2) Is there a cleaner way to generate the optimal 'post-increment' use
> for the outside-use variable? I first thought the position in the
> candidate might be something I could use or even the var_at_stmt
> functionality, but the outside IV has the actual increment of the
> variable as it's use, rather than the outside uses. This is this RFC's
> main weakness I find.
To answer why IVOPTs behaves like this w/o your two patches.  The main
problem is the point IVOPTs rewrites outside use IV - I don't remember
the exact point - but looks like at the end of loop while before
incrementing instruction of main IV.  It's a known issue that outside
use should be costed/re-written on the exit edge along which its value
flows out of loop.  I had a patch a long time ago but discarded it,
because it didn't bring obvious improvement and is complicated in case
of multi-exit edges.

But in general, I am less convinced that any of the two patches is the
right direction solving IV sharing issue between vectorized loop and
epilogue loop.  I would need to read the previous RFC before giving
further comments though.

Thanks,
bin

Re: [RFC] Implementing detection of saturation and rounding arithmetic

Reply via email to