Re: [RFC] Implementing detection of saturation and rounding arithmetic

2021-06-08 Thread Andre Simoes Dias Vieira via Gcc-patches

Hi Bin,

Thank you for the reply, I have some questions, see below.

On 07/06/2021 12:28, Bin.Cheng wrote:

On Fri, Jun 4, 2021 at 12:35 AM Andre Vieira (lists) via Gcc-patches
 wrote:

Hi Andre,
I didn't look into the details of the IV sharing RFC.  It seems to me
costing outside uses is trying to generate better code for later code
(epilogue loop here).  The only problem is IVOPTs doesn't know that
the outside use is not in the final form - which will be transformed
by IVOPTs again.

I think this example is not good at describing your problem because it
shows exactly that considering outside use results in better code,
compared to the other two approaches.
I don't quite understand what you are saying here :( What do you mean by 
final form? It seems to me that costing uses inside and outside loop the 
same way is wrong because calculating the IV inside the loop has to be 
done every iteration, whereas if you can resolve it to a single update 
(without an IV) then you can sink it outside the loop. This is why I 
think this example shows why we need to cost these uses differently.

2) Is there a cleaner way to generate the optimal 'post-increment' use
for the outside-use variable? I first thought the position in the
candidate might be something I could use or even the var_at_stmt
functionality, but the outside IV has the actual increment of the
variable as it's use, rather than the outside uses. This is this RFC's
main weakness I find.

To answer why IVOPTs behaves like this w/o your two patches.  The main
problem is the point IVOPTs rewrites outside use IV - I don't remember
the exact point - but looks like at the end of loop while before
incrementing instruction of main IV.  It's a known issue that outside
use should be costed/re-written on the exit edge along which its value
flows out of loop.  I had a patch a long time ago but discarded it,
because it didn't bring obvious improvement and is complicated in case
of multi-exit edges.
Yeah I haven't looked at multi-exit edges and I understand that 
complicates things. But for now we could disable the special casing of 
outside uses when dealing with multi-exit loops and keep the current 
behavior.


But in general, I am less convinced that any of the two patches is the
right direction solving IV sharing issue between vectorized loop and
epilogue loop.  I would need to read the previous RFC before giving
further comments though.


The previous RFC still has a lot of unanswered questions too, but 
regardless of that, take the following (non-vectorizer) example:


#include 
#include 

void bar (char  * __restrict__ a, char * __restrict__ b, char * 
__restrict__ c, unsigned long long n)

{
    svbool_t all_true = svptrue_b8 ();
  unsigned long long i = 0;
    for (; i < (n & ~(svcntb() - 1)); i += svcntb()) {
  svuint8_t va = svld1 (all_true, (uint8_t*)a);
  svuint8_t vb = svld1 (all_true, (uint8_t*)b);
  svst1 (all_true, (uint8_t *)c, svadd_z (all_true, va,vb));
  a += svcntb();
  b += svcntb();
  c += svcntb();
  }
  svbool_t pred;
  for (; i < (n); i += svcntb()) {
  pred = svwhilelt_b8 (i, n);
  svuint8_t va = svld1 (pred, (uint8_t*)a);
  svuint8_t vb = svld1 (pred, (uint8_t*)b);
  svst1 (pred, (uint8_t *)c, svadd_z (pred, va,vb));
  a += svcntb();
  b += svcntb();
  c += svcntb();
  }


Current IVOPTs will use 4 iterators for the first loop, when it could do 
with just 1. In fact, if you use my patches it will create just a single 
IV and sink the uses and it is then able to merge them with loads & 
stores of the next loop.


I am not saying setting outside costs to 0 is the right thing to do by 
the way. It is absolutely not! It will break cost considerations for 
other cases. Like I said above I've been playing around with using 
'!use->outside' as a multiplier for the cost. Unfortunately it won't 
help with the case above, because this seems to choose 'infinite_cost' 
because the candidate IV has a lower precision than the use IV. I don't 
quite understand yet how candidates are created, but something I'm going 
to try to look at. Just wanted to show this as an example of how IVOPTs 
would not improve code with multiple loops that don't involve the 
vectorizer.


BR,
Andre




Thanks,
bin


Re: [RFC] Implementing detection of saturation and rounding arithmetic

2021-06-07 Thread Bin.Cheng via Gcc-patches
On Fri, Jun 4, 2021 at 12:35 AM Andre Vieira (lists) via Gcc-patches
 wrote:
>
> Hi,
>
> This RFC is motivated by the IV sharing RFC in
> https://gcc.gnu.org/pipermail/gcc-patches/2021-May/569502.html and the
> need to have the IVOPTS pass be able to clean up IV's shared between
> multiple loops. When creating a similar problem with C code I noticed
> IVOPTs treated IV's with uses outside the loop differently, this didn't
> even required multiple loops, take for instance the following example
> using SVE intrinsics:
>
> #include 
> #include 
> extern void use (char *);
> void bar (char  * __restrict__ a, char * __restrict__ b, char *
> __restrict__ c, unsigned n)
> {
>  svbool_t all_true = svptrue_b8 ();
>unsigned i = 0;
>if (n < (UINT_MAX - svcntb() - 1))
>  {
>  for (; i < n; i += svcntb())
>  {
>  svuint8_t va = svld1 (all_true, (uint8_t*)a);
>  svuint8_t vb = svld1 (all_true, (uint8_t*)b);
>  svst1 (all_true, (uint8_t *)c, svadd_z (all_true, va,vb));
>  a += svcntb();
>  b += svcntb();
>  c += svcntb();
>  }
>  }
>use (a);
> }
>
> IVOPTs tends to generate a shared IV for SVE memory accesses, as we
> don't have a post-increment for SVE load/stores. If we had not included
> 'use (a);' in this example, IVOPTs would have replaced the IV's for a, b
> and c with a single one, (also used for the loop-control). See:
>
> [local count: 955630225]:
># ivtmp.7_8 = PHI 
>va_14 = MEM  [(unsigned char *)a_10(D) + ivtmp.7_8 * 1];
>vb_15 = MEM  [(unsigned char *)b_11(D) + ivtmp.7_8 * 1];
>_2 = svadd_u8_z ({ -1, ... }, va_14, vb_15);
>MEM <__SVUint8_t> [(unsigned char *)c_12(D) + ivtmp.7_8 * 1] = _2;
>ivtmp.7_25 = ivtmp.7_8 + POLY_INT_CST [16, 16];
>i_23 = (unsigned int) ivtmp.7_25;
>if (n_9(D) > i_23)
>  goto ; [89.00%]
>else
>  goto ; [11.00%]
>
>   However, due to the 'use (a);' it will create two IVs one for
> loop-control, b and c and one for a. See:
>
>[local count: 955630225]:
># a_28 = PHI 
># ivtmp.7_25 = PHI 
>va_15 = MEM  [(unsigned char *)a_28];
>vb_16 = MEM  [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
>_2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
>MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
>a_18 = a_28 + POLY_INT_CST [16, 16];
>ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
>i_8 = (unsigned int) ivtmp.7_24;
>if (n_10(D) > i_8)
>  goto ; [89.00%]
>else
>  goto ; [11.00%]
>
> With the first patch attached in this RFC 'no_cost.patch', I tell IVOPTs
> to not cost uses outside of the loop. This makes IVOPTs generate a
> single IV, but unfortunately it decides to create the variable for the
> use inside the loop and it also seems to use the pre-increment value of
> the shared-IV and add the [16,16] to it. See:
>
> [local count: 955630225]:
># ivtmp.7_25 = PHI 
>va_15 = MEM  [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
>vb_16 = MEM  [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
>_2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
>MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
>_8 = (unsigned long) a_11(D);
>_7 = _8 + ivtmp.7_25;
>_6 = _7 + POLY_INT_CST [16, 16];
>a_18 = (char * restrict) _6;
>ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
>i_5 = (unsigned int) ivtmp.7_24;
>if (n_10(D) > i_5)
>  goto ; [89.00%]
>else
>  goto ; [11.00%]
>
> With the patch 'var_after.patch' I make get_computation_aff_1 use
> 'cand->var_after' for outside uses thus using the post-increment var of
> the candidate IV. This means I have to insert it in a different place
> and make sure to delete the old use->stmt. I'm sure there is a better
> way to do this using IVOPTs current framework, but I didn't find one
> yet. See the result:
>
>[local count: 955630225]:
># ivtmp.7_25 = PHI 
>va_15 = MEM  [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
>vb_16 = MEM  [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
>_2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
>MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
>ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
>_8 = (unsigned long) a_11(D);
>_7 = _8 + ivtmp.7_24;
>a_18 = (char * restrict) _7;
>i_6 = (unsigned int) ivtmp.7_24;
>if (n_10(D) > i_6)
>  goto ; [89.00%]
>else
>  goto ; [11.00%]
>
>
> This is still not optimal as we are still doing the update inside the
> loop and there is absolutely no need for that. I found that running sink
> would solve it and it seems someone has added a second sink pass, so
> that saves me a third patch :) see after sink2:
>
> [local count: 955630225]:
># ivtmp.7_25 = PHI 
>va_15 = MEM  [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
>vb_16 = MEM  [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
>_2 = svadd_u8_z ({ -1, ... }, 

[RFC][ivopts] Generate better code for IVs with uses outside the loop (was Re: [RFC] Implementing detection of saturation and rounding arithmetic)

2021-06-03 Thread Andre Vieira (lists) via Gcc-patches

Streams got crossed there and used the wrong subject ...

On 03/06/2021 17:34, Andre Vieira (lists) via Gcc-patches wrote:

Hi,

This RFC is motivated by the IV sharing RFC in 
https://gcc.gnu.org/pipermail/gcc-patches/2021-May/569502.html and the 
need to have the IVOPTS pass be able to clean up IV's shared between 
multiple loops. When creating a similar problem with C code I noticed 
IVOPTs treated IV's with uses outside the loop differently, this 
didn't even required multiple loops, take for instance the following 
example using SVE intrinsics:


#include 
#include 
extern void use (char *);
void bar (char  * __restrict__ a, char * __restrict__ b, char * 
__restrict__ c, unsigned n)

{
    svbool_t all_true = svptrue_b8 ();
  unsigned i = 0;
  if (n < (UINT_MAX - svcntb() - 1))
    {
    for (; i < n; i += svcntb())
    {
    svuint8_t va = svld1 (all_true, (uint8_t*)a);
    svuint8_t vb = svld1 (all_true, (uint8_t*)b);
    svst1 (all_true, (uint8_t *)c, svadd_z (all_true, 
va,vb));

    a += svcntb();
    b += svcntb();
    c += svcntb();
    }
    }
  use (a);
}

IVOPTs tends to generate a shared IV for SVE memory accesses, as we 
don't have a post-increment for SVE load/stores. If we had not 
included 'use (a);' in this example, IVOPTs would have replaced the 
IV's for a, b and c with a single one, (also used for the 
loop-control). See:


   [local count: 955630225]:
  # ivtmp.7_8 = PHI 
  va_14 = MEM  [(unsigned char *)a_10(D) + ivtmp.7_8 * 1];
  vb_15 = MEM  [(unsigned char *)b_11(D) + ivtmp.7_8 * 1];
  _2 = svadd_u8_z ({ -1, ... }, va_14, vb_15);
  MEM <__SVUint8_t> [(unsigned char *)c_12(D) + ivtmp.7_8 * 1] = _2;
  ivtmp.7_25 = ivtmp.7_8 + POLY_INT_CST [16, 16];
  i_23 = (unsigned int) ivtmp.7_25;
  if (n_9(D) > i_23)
    goto ; [89.00%]
  else
    goto ; [11.00%]

 However, due to the 'use (a);' it will create two IVs one for 
loop-control, b and c and one for a. See:


  [local count: 955630225]:
  # a_28 = PHI 
  # ivtmp.7_25 = PHI 
  va_15 = MEM  [(unsigned char *)a_28];
  vb_16 = MEM  [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
  _2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
  MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
  a_18 = a_28 + POLY_INT_CST [16, 16];
  ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
  i_8 = (unsigned int) ivtmp.7_24;
  if (n_10(D) > i_8)
    goto ; [89.00%]
  else
    goto ; [11.00%]

With the first patch attached in this RFC 'no_cost.patch', I tell 
IVOPTs to not cost uses outside of the loop. This makes IVOPTs 
generate a single IV, but unfortunately it decides to create the 
variable for the use inside the loop and it also seems to use the 
pre-increment value of the shared-IV and add the [16,16] to it. See:


   [local count: 955630225]:
  # ivtmp.7_25 = PHI 
  va_15 = MEM  [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
  vb_16 = MEM  [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
  _2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
  MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
  _8 = (unsigned long) a_11(D);
  _7 = _8 + ivtmp.7_25;
  _6 = _7 + POLY_INT_CST [16, 16];
  a_18 = (char * restrict) _6;
  ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
  i_5 = (unsigned int) ivtmp.7_24;
  if (n_10(D) > i_5)
    goto ; [89.00%]
  else
    goto ; [11.00%]

With the patch 'var_after.patch' I make get_computation_aff_1 use 
'cand->var_after' for outside uses thus using the post-increment var 
of the candidate IV. This means I have to insert it in a different 
place and make sure to delete the old use->stmt. I'm sure there is a 
better way to do this using IVOPTs current framework, but I didn't 
find one yet. See the result:


  [local count: 955630225]:
  # ivtmp.7_25 = PHI 
  va_15 = MEM  [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
  vb_16 = MEM  [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
  _2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
  MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
  ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
  _8 = (unsigned long) a_11(D);
  _7 = _8 + ivtmp.7_24;
  a_18 = (char * restrict) _7;
  i_6 = (unsigned int) ivtmp.7_24;
  if (n_10(D) > i_6)
    goto ; [89.00%]
  else
    goto ; [11.00%]


This is still not optimal as we are still doing the update inside the 
loop and there is absolutely no need for that. I found that running 
sink would solve it and it seems someone has added a second sink pass, 
so that saves me a third patch :) see after sink2:


   [local count: 955630225]:
  # ivtmp.7_25 = PHI 
  va_15 = MEM  [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
  vb_16 = MEM  [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
  _2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
  MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
  ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
  i_6 = (unsigned int) ivtmp.7_24;
  if (i_6 < n_10(D))
    goto ; [89.00%]
  else
    goto ; [11.00%]

[RFC] Implementing detection of saturation and rounding arithmetic

2021-06-03 Thread Andre Vieira (lists) via Gcc-patches

Hi,

This RFC is motivated by the IV sharing RFC in 
https://gcc.gnu.org/pipermail/gcc-patches/2021-May/569502.html and the 
need to have the IVOPTS pass be able to clean up IV's shared between 
multiple loops. When creating a similar problem with C code I noticed 
IVOPTs treated IV's with uses outside the loop differently, this didn't 
even required multiple loops, take for instance the following example 
using SVE intrinsics:


#include 
#include 
extern void use (char *);
void bar (char  * __restrict__ a, char * __restrict__ b, char * 
__restrict__ c, unsigned n)

{
    svbool_t all_true = svptrue_b8 ();
  unsigned i = 0;
  if (n < (UINT_MAX - svcntb() - 1))
    {
    for (; i < n; i += svcntb())
    {
    svuint8_t va = svld1 (all_true, (uint8_t*)a);
    svuint8_t vb = svld1 (all_true, (uint8_t*)b);
    svst1 (all_true, (uint8_t *)c, svadd_z (all_true, va,vb));
    a += svcntb();
    b += svcntb();
    c += svcntb();
    }
    }
  use (a);
}

IVOPTs tends to generate a shared IV for SVE memory accesses, as we 
don't have a post-increment for SVE load/stores. If we had not included 
'use (a);' in this example, IVOPTs would have replaced the IV's for a, b 
and c with a single one, (also used for the loop-control). See:


   [local count: 955630225]:
  # ivtmp.7_8 = PHI 
  va_14 = MEM  [(unsigned char *)a_10(D) + ivtmp.7_8 * 1];
  vb_15 = MEM  [(unsigned char *)b_11(D) + ivtmp.7_8 * 1];
  _2 = svadd_u8_z ({ -1, ... }, va_14, vb_15);
  MEM <__SVUint8_t> [(unsigned char *)c_12(D) + ivtmp.7_8 * 1] = _2;
  ivtmp.7_25 = ivtmp.7_8 + POLY_INT_CST [16, 16];
  i_23 = (unsigned int) ivtmp.7_25;
  if (n_9(D) > i_23)
    goto ; [89.00%]
  else
    goto ; [11.00%]

 However, due to the 'use (a);' it will create two IVs one for 
loop-control, b and c and one for a. See:


  [local count: 955630225]:
  # a_28 = PHI 
  # ivtmp.7_25 = PHI 
  va_15 = MEM  [(unsigned char *)a_28];
  vb_16 = MEM  [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
  _2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
  MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
  a_18 = a_28 + POLY_INT_CST [16, 16];
  ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
  i_8 = (unsigned int) ivtmp.7_24;
  if (n_10(D) > i_8)
    goto ; [89.00%]
  else
    goto ; [11.00%]

With the first patch attached in this RFC 'no_cost.patch', I tell IVOPTs 
to not cost uses outside of the loop. This makes IVOPTs generate a 
single IV, but unfortunately it decides to create the variable for the 
use inside the loop and it also seems to use the pre-increment value of 
the shared-IV and add the [16,16] to it. See:


   [local count: 955630225]:
  # ivtmp.7_25 = PHI 
  va_15 = MEM  [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
  vb_16 = MEM  [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
  _2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
  MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
  _8 = (unsigned long) a_11(D);
  _7 = _8 + ivtmp.7_25;
  _6 = _7 + POLY_INT_CST [16, 16];
  a_18 = (char * restrict) _6;
  ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
  i_5 = (unsigned int) ivtmp.7_24;
  if (n_10(D) > i_5)
    goto ; [89.00%]
  else
    goto ; [11.00%]

With the patch 'var_after.patch' I make get_computation_aff_1 use 
'cand->var_after' for outside uses thus using the post-increment var of 
the candidate IV. This means I have to insert it in a different place 
and make sure to delete the old use->stmt. I'm sure there is a better 
way to do this using IVOPTs current framework, but I didn't find one 
yet. See the result:


  [local count: 955630225]:
  # ivtmp.7_25 = PHI 
  va_15 = MEM  [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
  vb_16 = MEM  [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
  _2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
  MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
  ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
  _8 = (unsigned long) a_11(D);
  _7 = _8 + ivtmp.7_24;
  a_18 = (char * restrict) _7;
  i_6 = (unsigned int) ivtmp.7_24;
  if (n_10(D) > i_6)
    goto ; [89.00%]
  else
    goto ; [11.00%]


This is still not optimal as we are still doing the update inside the 
loop and there is absolutely no need for that. I found that running sink 
would solve it and it seems someone has added a second sink pass, so 
that saves me a third patch :) see after sink2:


   [local count: 955630225]:
  # ivtmp.7_25 = PHI 
  va_15 = MEM  [(unsigned char *)a_11(D) + ivtmp.7_25 * 1];
  vb_16 = MEM  [(unsigned char *)b_12(D) + ivtmp.7_25 * 1];
  _2 = svadd_u8_z ({ -1, ... }, va_15, vb_16);
  MEM <__SVUint8_t> [(unsigned char *)c_13(D) + ivtmp.7_25 * 1] = _2;
  ivtmp.7_24 = ivtmp.7_25 + POLY_INT_CST [16, 16];
  i_6 = (unsigned int) ivtmp.7_24;
  if (i_6 < n_10(D))
    goto ; [89.00%]
  else
    goto ; [11.00%]

   [local count: 105119324]:
  _8 = (unsigned long) a_11(D);
  _7 = _8 + ivtmp.7_24;
  a_18 = (char * restrict) _7;
  goto ; 

Re: [RFC] Implementing detection of saturation and rounding arithmetic

2021-05-12 Thread Liu Hao via Gcc

在 5/12/21 5:13 PM, Tamar Christina via Gcc 写道:


int f (int a, int b)
{
 int res;
 if (__builtin_add_overflow (a, b, ))
   {
   if (res >= 0)
 return INT_MAX;
   else
 return INT_MIN;
   }
 return res;
}

Should be recognized as saturating as well.  But yeah, following the same 
approach we
would rewrite the sequence to something like res = .ADD_SAT (a, b);



Is this a correct saturating addition implementation?

If the addition has overflowed, you get a positive result or zero for the sum of two negative 
numbers (or a negative one for two positive numbers); and it is not straightforward to write it this 
way.



This should be

  int f (int a, int b)
  {
int res;
if (__builtin_add_overflow (a, b, ))
  {
if (a >= 0)  /* changed from `res` to `a`  */
  return INT_MAX;
else
  return INT_MIN;
  }
return res;
  }

which can be optimized further as

  int f (int a, int b)
  {
int res;
if (__builtin_add_overflow (a, b, ))
  res = (a >> sizeof(int) * CHAR_BIT - 1) ^ INT_MAX;
return res;
  }


--
Best regards,
Liu Hao



OpenPGP_signature
Description: OpenPGP digital signature


Re: [RFC] Implementing detection of saturation and rounding arithmetic

2021-05-12 Thread Segher Boessenkool
On Wed, May 12, 2021 at 09:13:38AM +, Tamar Christina wrote:
> > From: Segher Boessenkool 
> > On Tue, May 11, 2021 at 05:37:34AM +, Tamar Christina via Gcc wrote:
> > > 2. Saturating abs:
> > >char sat (char a)
> > >{
> > >   int tmp = abs (a);
> > >   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> > >}
> > 
> > That can be done quite a bit better, branchless at least.  Same for all
> > examples here probably.
> 
> Do you mean in the example? Sure, I wanted to keep it simple 

Point taken...  I think you need to look at these issues sooner rather
than later though.

> > > 2. Saturation:
> > >a) Use match.pd to rewrite the various saturation expressions into
> > min/max
> > >   operations which opens up the expressions to further optimizations.
> > 
> > You'll have to do the operation in a bigger mode for that.  (This is also 
> > true for
> > rounding, in many cases).
> > 
> > This makes new internal functions more attractive / more feasible.
> 
> True, but the internal function doesn't need to model the wider mode right?
> So if you're saturating an int, the internal-fn would work on int,int,int.

The ifn doesn't have to go to a wider mode, it's only if you want to
express it with more basic operations that you have to.  That is the
reason I find ifns more attractive for this, yup.

> > >   We could get the right instructions by using combine if we don't 
> > > rewrite
> > >   the instructions to an internal function, however then during
> > Vectorization
> > >   we would overestimate the cost of performing the saturation.  The
> > constants
> > >   will the also be loaded into registers and so becomes a lot more 
> > > difficult
> > >   to cleanup solely in the backend.
> > 
> > Combine is almost never the right answer if you want to combine more than
> > two or three RTL insns.  It can be done, but combine will always write the
> > combined instruction in simplest terms, which tends to mean that if you
> > combine more insns there can be very many outcomes that you all need to
> > recognise as insns in your machine description.
> 
> Yeah, ideally I would like to handle it before it gets to expand.

That is the right place yeah...  but it will need some RTL work, in
simplify-rtx at least.


Segher


RE: [RFC] Implementing detection of saturation and rounding arithmetic

2021-05-12 Thread Richard Biener
On Wed, 12 May 2021, Tamar Christina wrote:

> 
> 
> > -Original Message-
> > From: Richard Biener 
> > Sent: Tuesday, May 11, 2021 12:45 PM
> > To: Tamar Christina 
> > Cc: gcc@gcc.gnu.org; Richard Sandiford ;
> > Jakub Jelinek 
> > Subject: Re: [RFC] Implementing detection of saturation and rounding
> > arithmetic
> > 
> > On Tue, 11 May 2021, Tamar Christina wrote:
> > 
> > > Hi All,
> > >
> > > We are looking to implement saturation support in the compiler.  The
> > > aim is to recognize both Scalar and Vector variant of typical saturating
> > expressions.
> > >
> > > As an example:
> > >
> > > 1. Saturating addition:
> > >char sat (char a, char b)
> > >{
> > >   int tmp = a + b;
> > >   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> > >}
> > >
> > > 2. Saturating abs:
> > >char sat (char a)
> > >{
> > >   int tmp = abs (a);
> > >   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> > >}
> > >
> > > 3. Rounding shifts
> > >char rndshift (char dc)
> > >{
> > >   int round_const = 1 << (shift - 1);
> > >   return (dc + round_const) >> shift;
> > >}
> > >
> > > etc.
> > >
> > > Of course the first issue is that C does not really have a single
> > > idiom for expressing this.
> > >
> > > At the RTL level we have ss_truncate and us_truncate and
> > > float_truncate for truncation.
> > >
> > > At the Tree level we have nothing for truncation (I believe) for
> > > scalars. For Vector code there already seems to be VEC_PACK_SAT_EXPR
> > > but it looks like nothing actually generates this at the moment. it's 
> > > just an
> > unused tree code.
> > >
> > > For rounding there doesn't seem to be any existing infrastructure.
> > >
> > > The proposal to handle these are as follow, keep in mind that all of
> > > these also exist in their scalar form, as such detecting them in the
> > > vectorizer would be the wrong place.
> > >
> > > 1. Rounding:
> > >a) Use match.pd to rewrite various rounding idioms to shifts.
> > >b) Use backwards or forward prop to rewrite these to internal functions
> > >   where even if the target does not support these rounding 
> > > instructions
> > they
> > >   have a chance to provide a more efficient implementation than what
> > would
> > >   be generated normally.
> > >
> > > 2. Saturation:
> > >a) Use match.pd to rewrite the various saturation expressions into
> > min/max
> > >   operations which opens up the expressions to further optimizations.
> > >b) Use backwards or forward prop to convert to internal functions if 
> > > the
> > >   resulting min/max expression still meet the criteria for being a
> > >   saturating expression.  This follows the algorithm as outlined in 
> > > "The
> > >   Software Vectorization handbook" by Aart J.C. Bik.
> > >
> > >   We could get the right instructions by using combine if we don't 
> > > rewrite
> > >   the instructions to an internal function, however then during
> > Vectorization
> > >   we would overestimate the cost of performing the saturation.  The
> > constants
> > >   will the also be loaded into registers and so becomes a lot more 
> > > difficult
> > >   to cleanup solely in the backend.
> > >
> > > The one thing I am wondering about is whether we would need an
> > > internal function for all operations supported, or if it should be
> > > modelled as an internal FN which just "marks" the operation as
> > > rounding/saturating. After all, the only difference between a normal
> > > and saturating expression in RTL is the xx_truncate RTL surrounding
> > > the expression.  Doing so would also mean that all targets whom have
> > saturating instructions would automatically benefit from this.
> > >
> > > But it does mean a small adjustment to the costing, which would need
> > > to cost the internal function call and the argument together as a whole.
> > >
> > > Any feedback is appreciated to minimize the number of changes required
> > > to the final patch.  Any objections 

Re: [RFC] Implementing detection of saturation and rounding arithmetic

2021-05-12 Thread Richard Biener
On Wed, 12 May 2021, Richard Sandiford wrote:

> Tamar Christina  writes:
> > Hi All,
> >
> > We are looking to implement saturation support in the compiler.  The aim is 
> > to
> > recognize both Scalar and Vector variant of typical saturating expressions.
> >
> > As an example:
> >
> > 1. Saturating addition:
> >char sat (char a, char b)
> >{
> >   int tmp = a + b;
> >   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> >}
> >
> > 2. Saturating abs:
> >char sat (char a)
> >{
> >   int tmp = abs (a);
> >   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> >}
> >
> > 3. Rounding shifts
> >char rndshift (char dc)
> >{
> >   int round_const = 1 << (shift - 1);
> >   return (dc + round_const) >> shift;
> >}
> >
> > etc.
> >
> > Of course the first issue is that C does not really have a single idiom for
> > expressing this.
> >
> > At the RTL level we have ss_truncate and us_truncate and float_truncate for
> > truncation.
> >
> > At the Tree level we have nothing for truncation (I believe) for scalars. 
> > For
> > Vector code there already seems to be VEC_PACK_SAT_EXPR but it looks like
> > nothing actually generates this at the moment. it's just an unused tree 
> > code.
> >
> > For rounding there doesn't seem to be any existing infrastructure.
> >
> > The proposal to handle these are as follow, keep in mind that all of these 
> > also
> > exist in their scalar form, as such detecting them in the vectorizer would 
> > be
> > the wrong place.
> >
> > 1. Rounding:
> >a) Use match.pd to rewrite various rounding idioms to shifts.
> >b) Use backwards or forward prop to rewrite these to internal functions
> >   where even if the target does not support these rounding instructions 
> > they
> >   have a chance to provide a more efficient implementation than what 
> > would
> >   be generated normally.
> >
> > 2. Saturation:
> >a) Use match.pd to rewrite the various saturation expressions into 
> > min/max
> >   operations which opens up the expressions to further optimizations.
> >b) Use backwards or forward prop to convert to internal functions if the
> >   resulting min/max expression still meet the criteria for being a
> >   saturating expression.  This follows the algorithm as outlined in "The
> >   Software Vectorization handbook" by Aart J.C. Bik.
> >
> >   We could get the right instructions by using combine if we don't 
> > rewrite
> >   the instructions to an internal function, however then during 
> > Vectorization
> >   we would overestimate the cost of performing the saturation.  The 
> > constants
> >   will the also be loaded into registers and so becomes a lot more 
> > difficult
> >   to cleanup solely in the backend.
> >
> > The one thing I am wondering about is whether we would need an internal 
> > function
> > for all operations supported, or if it should be modelled as an internal FN 
> > which
> > just "marks" the operation as rounding/saturating. After all, the only 
> > difference
> > between a normal and saturating expression in RTL is the xx_truncate RTL 
> > surrounding
> > the expression.  Doing so would also mean that all targets whom have 
> > saturating
> > instructions would automatically benefit from this.
> 
> I might have misunderstood what you meant here, but the *_truncate
> RTL codes are true truncations: the operand has to be wider than the
> result.  Using this representation for general arithmetic is a problem
> if you're operating at the maximum size that the target supports natively.
> E.g. representing a 64-bit saturating addition as:
> 
>   - extend to 128 bits
>   - do a 128-bit addition
>   - truncate to 64 bits
> 
> is going to be hard to cost and code-generate on targets that don't support
> native 128-bit operations (or at least, don't support them cheaply).
> This might not be a problem when recognising C idioms, since the C source
> code has to be able do the wider operation before truncating the result,
> but it could be a problem if we provide built-in functions or if we want
> to introduce compiler-generated saturating operations.
> 
> RTL already has per-operation saturation such as ss_plus/us_plus,
> ss_minus/us_minus, ss_neg/us_neg, ss_mult/us_mult, ss_div,
> ss_ashift/us_ashift and ss_abs.  I think we should do the same
> in gimple, using internal functions like you say.

I think that for followup optimizations using regular arithmetic
ops and just new saturating truncations is better.  Maybe we can
also do both, with first only matching the actual saturation
with a new tree code and then later match the optabs the target
actually supports (in ISEL for example)?

Truly saturating ops might provide an interesting example how
to deal with -ftrapv - one might think we can now simply
use the trapping optabs as internal functions to reflect
-ftrapv onto the IL ...

Richard.

> Thanks,
> Richard
> 
> 

-- 
Richard Biener 
SUSE 

RE: [RFC] Implementing detection of saturation and rounding arithmetic

2021-05-12 Thread Tamar Christina via Gcc



> -Original Message-
> From: Richard Sandiford 
> Sent: Wednesday, May 12, 2021 9:48 AM
> To: Tamar Christina 
> Cc: gcc@gcc.gnu.org; Richard Biener 
> Subject: Re: [RFC] Implementing detection of saturation and rounding
> arithmetic
> 
> Tamar Christina  writes:
> > Hi All,
> >
> > We are looking to implement saturation support in the compiler.  The
> > aim is to recognize both Scalar and Vector variant of typical saturating
> expressions.
> >
> > As an example:
> >
> > 1. Saturating addition:
> >char sat (char a, char b)
> >{
> >   int tmp = a + b;
> >   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> >}
> >
> > 2. Saturating abs:
> >char sat (char a)
> >{
> >   int tmp = abs (a);
> >   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> >}
> >
> > 3. Rounding shifts
> >char rndshift (char dc)
> >{
> >   int round_const = 1 << (shift - 1);
> >   return (dc + round_const) >> shift;
> >}
> >
> > etc.
> >
> > Of course the first issue is that C does not really have a single
> > idiom for expressing this.
> >
> > At the RTL level we have ss_truncate and us_truncate and
> > float_truncate for truncation.
> >
> > At the Tree level we have nothing for truncation (I believe) for
> > scalars. For Vector code there already seems to be VEC_PACK_SAT_EXPR
> > but it looks like nothing actually generates this at the moment. it's just 
> > an
> unused tree code.
> >
> > For rounding there doesn't seem to be any existing infrastructure.
> >
> > The proposal to handle these are as follow, keep in mind that all of
> > these also exist in their scalar form, as such detecting them in the
> > vectorizer would be the wrong place.
> >
> > 1. Rounding:
> >a) Use match.pd to rewrite various rounding idioms to shifts.
> >b) Use backwards or forward prop to rewrite these to internal functions
> >   where even if the target does not support these rounding instructions
> they
> >   have a chance to provide a more efficient implementation than what
> would
> >   be generated normally.
> >
> > 2. Saturation:
> >a) Use match.pd to rewrite the various saturation expressions into
> min/max
> >   operations which opens up the expressions to further optimizations.
> >b) Use backwards or forward prop to convert to internal functions if the
> >   resulting min/max expression still meet the criteria for being a
> >   saturating expression.  This follows the algorithm as outlined in "The
> >   Software Vectorization handbook" by Aart J.C. Bik.
> >
> >   We could get the right instructions by using combine if we don't 
> > rewrite
> >   the instructions to an internal function, however then during
> Vectorization
> >   we would overestimate the cost of performing the saturation.  The
> constants
> >   will the also be loaded into registers and so becomes a lot more 
> > difficult
> >   to cleanup solely in the backend.
> >
> > The one thing I am wondering about is whether we would need an
> > internal function for all operations supported, or if it should be
> > modelled as an internal FN which just "marks" the operation as
> > rounding/saturating. After all, the only difference between a normal
> > and saturating expression in RTL is the xx_truncate RTL surrounding
> > the expression.  Doing so would also mean that all targets whom have
> saturating instructions would automatically benefit from this.
> 
> I might have misunderstood what you meant here, but the *_truncate RTL
> codes are true truncations: the operand has to be wider than the result.
> Using this representation for general arithmetic is a problem if you're
> operating at the maximum size that the target supports natively.
> E.g. representing a 64-bit saturating addition as:
> 
>   - extend to 128 bits
>   - do a 128-bit addition
>   - truncate to 64 bits
> 

Ah, that wasn't clear from the documentation.. The one for the normal truncate
mentions that the modes have to be wider but the _truncate and friends don't
mention this constraint.  That would indeed not work..

> is going to be hard to cost and code-generate on targets that don't support
> native 128-bit operations (or at least, don't support them cheaply).
> This might not be a problem when recognising C idioms, since the C source
> code has to be able do the wider operation before tru

RE: [RFC] Implementing detection of saturation and rounding arithmetic

2021-05-12 Thread Tamar Christina via Gcc



> -Original Message-
> From: Joseph Myers 
> Sent: Tuesday, May 11, 2021 6:01 PM
> To: David Brown 
> Cc: Tamar Christina ; gcc@gcc.gnu.org; Richard
> Sandiford ; Richard Biener
> 
> Subject: Re: [RFC] Implementing detection of saturation and rounding
> arithmetic
> 
> On Tue, 11 May 2021, David Brown wrote:
> 
> > It is also worth noting that gcc already has support for saturating
> > types on some targets:
> >
> > <https://gcc.gnu.org/onlinedocs/gcc/Fixed-Point.html>
> >
> > My testing of these (quite a long time ago) left me with a feeling
> > that it was not a feature anyone had worked hard to optimise -
> > certainly it
> 
> The implementation isn't well-integrated with any optimizations for
> arithmetic on ordinary integer types / modes, because it has its own
> completely separate machine modes and operations on those.  I still think it
> would be better to have a GIMPLE pass that lowers from fixed-point types to
> saturating etc. operations on ordinary integer types, as I said in
> <https://gcc.gnu.org/legacy-ml/gcc-patches/2011-05/msg00846.html>.
> 

Oh, thanks for the link! That's a very useful read!

Cheers,
Tamar

> Note however that such lowering should be more or less independent of
> what's being discussed in this thread - this thread is about better
> optimization of such operations on ordinary types (with or without built-in
> functions of some kind in addition to recognition of such operations written
> in generic C), which you can do independently of what's done with fixed-
> point types.
> 
> --
> Joseph S. Myers
> jos...@codesourcery.com


RE: [RFC] Implementing detection of saturation and rounding arithmetic

2021-05-12 Thread Tamar Christina via Gcc
Hi,

> -Original Message-
> From: Segher Boessenkool 
> Sent: Tuesday, May 11, 2021 4:43 PM
> To: Tamar Christina 
> Cc: gcc@gcc.gnu.org; Richard Sandiford ;
> Richard Biener 
> Subject: Re: [RFC] Implementing detection of saturation and rounding
> arithmetic
> 
> Hi!
> 
> On Tue, May 11, 2021 at 05:37:34AM +, Tamar Christina via Gcc wrote:
> > 2. Saturating abs:
> >char sat (char a)
> >{
> >   int tmp = abs (a);
> >   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> >}
> 
> That can be done quite a bit better, branchless at least.  Same for all
> examples here probably.

Do you mean in the example? Sure, I wanted to keep it simple 

> 
> > 2. Saturation:
> >a) Use match.pd to rewrite the various saturation expressions into
> min/max
> >   operations which opens up the expressions to further optimizations.
> 
> You'll have to do the operation in a bigger mode for that.  (This is also 
> true for
> rounding, in many cases).
> 
> This makes new internal functions more attractive / more feasible.

True, but the internal function doesn't need to model the wider mode right?
So if you're saturating an int, the internal-fn would work on int,int,int.

> 
> >   We could get the right instructions by using combine if we don't 
> > rewrite
> >   the instructions to an internal function, however then during
> Vectorization
> >   we would overestimate the cost of performing the saturation.  The
> constants
> >   will the also be loaded into registers and so becomes a lot more 
> > difficult
> >   to cleanup solely in the backend.
> 
> Combine is almost never the right answer if you want to combine more than
> two or three RTL insns.  It can be done, but combine will always write the
> combined instruction in simplest terms, which tends to mean that if you
> combine more insns there can be very many outcomes that you all need to
> recognise as insns in your machine description.

Yeah, ideally I would like to handle it before it gets to expand.

> 
> > The one thing I am wondering about is whether we would need an
> > internal function for all operations supported, or if it should be
> > modelled as an internal FN which just "marks" the operation as
> > rounding/saturating. After all, the only difference between a normal
> > and saturating expression in RTL is the xx_truncate RTL surrounding
> > the expression.  Doing so would also mean that all targets whom have
> saturating instructions would automatically benefit from this.
> 
> I think you will have to experiment with both approaches to get a good
> feeling for the tradeoff.

Fair enough,

Thanks for the feedback!

Cheers,
Tamar
> 
> 
> Segher


RE: [RFC] Implementing detection of saturation and rounding arithmetic

2021-05-12 Thread Tamar Christina via Gcc



> -Original Message-
> From: Richard Biener 
> Sent: Tuesday, May 11, 2021 12:45 PM
> To: Tamar Christina 
> Cc: gcc@gcc.gnu.org; Richard Sandiford ;
> Jakub Jelinek 
> Subject: Re: [RFC] Implementing detection of saturation and rounding
> arithmetic
> 
> On Tue, 11 May 2021, Tamar Christina wrote:
> 
> > Hi All,
> >
> > We are looking to implement saturation support in the compiler.  The
> > aim is to recognize both Scalar and Vector variant of typical saturating
> expressions.
> >
> > As an example:
> >
> > 1. Saturating addition:
> >char sat (char a, char b)
> >{
> >   int tmp = a + b;
> >   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> >}
> >
> > 2. Saturating abs:
> >char sat (char a)
> >{
> >   int tmp = abs (a);
> >   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> >}
> >
> > 3. Rounding shifts
> >char rndshift (char dc)
> >{
> >   int round_const = 1 << (shift - 1);
> >   return (dc + round_const) >> shift;
> >}
> >
> > etc.
> >
> > Of course the first issue is that C does not really have a single
> > idiom for expressing this.
> >
> > At the RTL level we have ss_truncate and us_truncate and
> > float_truncate for truncation.
> >
> > At the Tree level we have nothing for truncation (I believe) for
> > scalars. For Vector code there already seems to be VEC_PACK_SAT_EXPR
> > but it looks like nothing actually generates this at the moment. it's just 
> > an
> unused tree code.
> >
> > For rounding there doesn't seem to be any existing infrastructure.
> >
> > The proposal to handle these are as follow, keep in mind that all of
> > these also exist in their scalar form, as such detecting them in the
> > vectorizer would be the wrong place.
> >
> > 1. Rounding:
> >a) Use match.pd to rewrite various rounding idioms to shifts.
> >b) Use backwards or forward prop to rewrite these to internal functions
> >   where even if the target does not support these rounding instructions
> they
> >   have a chance to provide a more efficient implementation than what
> would
> >   be generated normally.
> >
> > 2. Saturation:
> >a) Use match.pd to rewrite the various saturation expressions into
> min/max
> >   operations which opens up the expressions to further optimizations.
> >b) Use backwards or forward prop to convert to internal functions if the
> >   resulting min/max expression still meet the criteria for being a
> >   saturating expression.  This follows the algorithm as outlined in "The
> >   Software Vectorization handbook" by Aart J.C. Bik.
> >
> >   We could get the right instructions by using combine if we don't 
> > rewrite
> >   the instructions to an internal function, however then during
> Vectorization
> >   we would overestimate the cost of performing the saturation.  The
> constants
> >   will the also be loaded into registers and so becomes a lot more 
> > difficult
> >   to cleanup solely in the backend.
> >
> > The one thing I am wondering about is whether we would need an
> > internal function for all operations supported, or if it should be
> > modelled as an internal FN which just "marks" the operation as
> > rounding/saturating. After all, the only difference between a normal
> > and saturating expression in RTL is the xx_truncate RTL surrounding
> > the expression.  Doing so would also mean that all targets whom have
> saturating instructions would automatically benefit from this.
> >
> > But it does mean a small adjustment to the costing, which would need
> > to cost the internal function call and the argument together as a whole.
> >
> > Any feedback is appreciated to minimize the number of changes required
> > to the final patch.  Any objections to the outlined approach?
> 
> I think it makes sense to pattern-match the operations on GIMPLE and follow
> the approach take by __builtin_add_overflow & friends.
> Maybe quickly check whether clang provides some builtins already which we
> could implement.
> 

Hmm so taking a look at __builtin_add_overflow and friends it looks like they 
use
Internal functions like ADD_OVERFLOW. Biggest difference being that it sets a 
condition
flag.  This does me wonder if something like

int f (int a, int b)
{
int res;
if (__builtin_add_overflow (a, b, ))
  {
  if (res >= 0)
  

Re: [RFC] Implementing detection of saturation and rounding arithmetic

2021-05-12 Thread Richard Sandiford via Gcc
Tamar Christina  writes:
> Hi All,
>
> We are looking to implement saturation support in the compiler.  The aim is to
> recognize both Scalar and Vector variant of typical saturating expressions.
>
> As an example:
>
> 1. Saturating addition:
>char sat (char a, char b)
>{
>   int tmp = a + b;
>   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
>}
>
> 2. Saturating abs:
>char sat (char a)
>{
>   int tmp = abs (a);
>   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
>}
>
> 3. Rounding shifts
>char rndshift (char dc)
>{
>   int round_const = 1 << (shift - 1);
>   return (dc + round_const) >> shift;
>}
>
> etc.
>
> Of course the first issue is that C does not really have a single idiom for
> expressing this.
>
> At the RTL level we have ss_truncate and us_truncate and float_truncate for
> truncation.
>
> At the Tree level we have nothing for truncation (I believe) for scalars. For
> Vector code there already seems to be VEC_PACK_SAT_EXPR but it looks like
> nothing actually generates this at the moment. it's just an unused tree code.
>
> For rounding there doesn't seem to be any existing infrastructure.
>
> The proposal to handle these are as follow, keep in mind that all of these 
> also
> exist in their scalar form, as such detecting them in the vectorizer would be
> the wrong place.
>
> 1. Rounding:
>a) Use match.pd to rewrite various rounding idioms to shifts.
>b) Use backwards or forward prop to rewrite these to internal functions
>   where even if the target does not support these rounding instructions 
> they
>   have a chance to provide a more efficient implementation than what would
>   be generated normally.
>
> 2. Saturation:
>a) Use match.pd to rewrite the various saturation expressions into min/max
>   operations which opens up the expressions to further optimizations.
>b) Use backwards or forward prop to convert to internal functions if the
>   resulting min/max expression still meet the criteria for being a
>   saturating expression.  This follows the algorithm as outlined in "The
>   Software Vectorization handbook" by Aart J.C. Bik.
>
>   We could get the right instructions by using combine if we don't rewrite
>   the instructions to an internal function, however then during 
> Vectorization
>   we would overestimate the cost of performing the saturation.  The 
> constants
>   will the also be loaded into registers and so becomes a lot more 
> difficult
>   to cleanup solely in the backend.
>
> The one thing I am wondering about is whether we would need an internal 
> function
> for all operations supported, or if it should be modelled as an internal FN 
> which
> just "marks" the operation as rounding/saturating. After all, the only 
> difference
> between a normal and saturating expression in RTL is the xx_truncate RTL 
> surrounding
> the expression.  Doing so would also mean that all targets whom have 
> saturating
> instructions would automatically benefit from this.

I might have misunderstood what you meant here, but the *_truncate
RTL codes are true truncations: the operand has to be wider than the
result.  Using this representation for general arithmetic is a problem
if you're operating at the maximum size that the target supports natively.
E.g. representing a 64-bit saturating addition as:

  - extend to 128 bits
  - do a 128-bit addition
  - truncate to 64 bits

is going to be hard to cost and code-generate on targets that don't support
native 128-bit operations (or at least, don't support them cheaply).
This might not be a problem when recognising C idioms, since the C source
code has to be able do the wider operation before truncating the result,
but it could be a problem if we provide built-in functions or if we want
to introduce compiler-generated saturating operations.

RTL already has per-operation saturation such as ss_plus/us_plus,
ss_minus/us_minus, ss_neg/us_neg, ss_mult/us_mult, ss_div,
ss_ashift/us_ashift and ss_abs.  I think we should do the same
in gimple, using internal functions like you say.

Thanks,
Richard



Re: [RFC] Implementing detection of saturation and rounding arithmetic

2021-05-12 Thread David Brown
On 12/05/2021 10:00, Tamar Christina wrote:
> Hi David, 
> 
>> -Original Message-
>> From: David Brown 
>> Sent: Tuesday, May 11, 2021 11:04 AM
>> To: Tamar Christina ; gcc@gcc.gnu.org
>> Cc: Richard Sandiford ; Richard Biener
>> 
>> Subject: Re: [RFC] Implementing detection of saturation and rounding
>> arithmetic
>>
>> On 11/05/2021 07:37, Tamar Christina via Gcc wrote:
>>> Hi All,
>>>
>>> We are looking to implement saturation support in the compiler.  The
>>> aim is to recognize both Scalar and Vector variant of typical saturating
>> expressions.
>>>
>>> As an example:
>>>
>>> 1. Saturating addition:
>>>char sat (char a, char b)
>>>{
>>>   int tmp = a + b;
>>>   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
>>>}
>>>
>>> 2. Saturating abs:
>>>char sat (char a)
>>>{
>>>   int tmp = abs (a);
>>>   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
>>>}
>>>
>>> 3. Rounding shifts
>>>char rndshift (char dc)
>>>{
>>>   int round_const = 1 << (shift - 1);
>>>   return (dc + round_const) >> shift;
>>>}
>>>
>>> etc.
>>>
>>
>> I can't comment on the implementation part - I don't know anything about it.
>>
>> However, in your examples above I see a few points.
>>
>> One is your use of "char".  "char" is a type that varies in signedness from
>> target to target (and also depending on compiler flags), and is slightly
>> different in C and C++ ('a' has char type in C++, int type in C).  If you 
>> must use
>> "char" in arithmetic contexts, I recommend using "signed char" or "unsigned
>> char" explicitly.
>>
>> I would rather recommend you use the size-specific  types - int8_t,
>> etc., - as being more appropriate for this kind of thing.
>> (AFAIK all gcc targets have 8-bit CHAR.)  This also makes it easier to see 
>> the
>> sizes you need for the "tmp" value as you make functions for bigger sizes -
>> remember that on some gcc targets, "int" is 16-bit.
> 
> Indeed, but unfortunately we're targeting existing code that's quite old, so 
> the
> C99 fixed sized types may not have been used.

I was thinking of your examples (for testing) and in your implementation
- and you have C99 in those cases.  If you make gcc optimisations that
you confirm work correctly for int8_t, int16_t, int32_t, int64_t, (and
possibly __int128), and the unsigned types, then they will automatically
work with fundamental types (signed char, short, etc.) and any other
typedefs users have in their code.


> 
>>
>> It is also worth noting that gcc already has support for saturating types on
>> some targets:
>>
>> <https://gcc.gnu.org/onlinedocs/gcc/Fixed-Point.html>
>>
>> My testing of these (quite a long time ago) left me with a feeling that it 
>> was
>> not a feature anyone had worked hard to optimise - certainly it did not make
>> use of saturating arithmetic instructions available on some of the targets I
>> tested (ARM Cortex M4, for example).  But it is possible that there are 
>> things
>> here that would be of use to you.  (I am not convinced that it is worth
>> spending time optimising the implementation of these - I don't think the
>> N1169 types are much used by
>> anyone.)
>>
> 
> I did notice these and was wondering if it makes sense to use them, but I'd 
> need
> to check how well supported they are.  For instance would need to check if 
> they
> don't block vectorization and stuff like that.
> 

I get the impression that they are likely to fail with vectorisation or
other more advanced optimisations (though I am no expert here).  I
mentioned them in case there is any inspiration or ideas you can copy,
rather than because I think they are useful.  I have not seen any use of
these types in real code, and I think the whole N1169 / TR 18037 was a
case of too little, too inconvenient to use, too late.  Few compilers
support it, fewer programmers use it.  (The syntax for named address
space is used by many embedded compilers, including gcc, but it is a
simple and obvious syntax extension that was used long before that TR.)

>>
>> While it is always good that the compiler can spot patterns in generic C code
>> and generate optimal instruction sequences, another possibility here would
>> be a set of built-in functions for saturated and rounding arithmetic.  T

RE: [RFC] Implementing detection of saturation and rounding arithmetic

2021-05-12 Thread Tamar Christina via Gcc
Hi David, 

> -Original Message-
> From: David Brown 
> Sent: Tuesday, May 11, 2021 11:04 AM
> To: Tamar Christina ; gcc@gcc.gnu.org
> Cc: Richard Sandiford ; Richard Biener
> 
> Subject: Re: [RFC] Implementing detection of saturation and rounding
> arithmetic
> 
> On 11/05/2021 07:37, Tamar Christina via Gcc wrote:
> > Hi All,
> >
> > We are looking to implement saturation support in the compiler.  The
> > aim is to recognize both Scalar and Vector variant of typical saturating
> expressions.
> >
> > As an example:
> >
> > 1. Saturating addition:
> >char sat (char a, char b)
> >{
> >   int tmp = a + b;
> >   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> >}
> >
> > 2. Saturating abs:
> >char sat (char a)
> >{
> >   int tmp = abs (a);
> >   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> >}
> >
> > 3. Rounding shifts
> >char rndshift (char dc)
> >{
> >   int round_const = 1 << (shift - 1);
> >   return (dc + round_const) >> shift;
> >}
> >
> > etc.
> >
> 
> I can't comment on the implementation part - I don't know anything about it.
> 
> However, in your examples above I see a few points.
> 
> One is your use of "char".  "char" is a type that varies in signedness from
> target to target (and also depending on compiler flags), and is slightly
> different in C and C++ ('a' has char type in C++, int type in C).  If you 
> must use
> "char" in arithmetic contexts, I recommend using "signed char" or "unsigned
> char" explicitly.
> 
> I would rather recommend you use the size-specific  types - int8_t,
> etc., - as being more appropriate for this kind of thing.
> (AFAIK all gcc targets have 8-bit CHAR.)  This also makes it easier to see the
> sizes you need for the "tmp" value as you make functions for bigger sizes -
> remember that on some gcc targets, "int" is 16-bit.

Indeed, but unfortunately we're targeting existing code that's quite old, so the
C99 fixed sized types may not have been used.

> 
> It is also worth noting that gcc already has support for saturating types on
> some targets:
> 
> <https://gcc.gnu.org/onlinedocs/gcc/Fixed-Point.html>
> 
> My testing of these (quite a long time ago) left me with a feeling that it was
> not a feature anyone had worked hard to optimise - certainly it did not make
> use of saturating arithmetic instructions available on some of the targets I
> tested (ARM Cortex M4, for example).  But it is possible that there are things
> here that would be of use to you.  (I am not convinced that it is worth
> spending time optimising the implementation of these - I don't think the
> N1169 types are much used by
> anyone.)
> 

I did notice these and was wondering if it makes sense to use them, but I'd need
to check how well supported they are.  For instance would need to check if they
don't block vectorization and stuff like that.

> 
> While it is always good that the compiler can spot patterns in generic C code
> and generate optimal instruction sequences, another possibility here would
> be a set of built-in functions for saturated and rounding arithmetic.  That
> would take the guesswork out of it for users - if there code requires 
> efficient
> saturated addition, they can use __builtin_add_sat and get the best their
> target can offer (just like __builtin_add_overflow, and that kind of thing).
> And it might be easier to implement in the compiler.
> 

We already have neon intrinics for this, but these operations happen quite often
in media processing and machine learning frameworks which don't use intrinsics
or builtins.  So the big gain for us here really is the idiom recognition 

Thanks for the feedback!

Cheers,
Tamar

> 
> I hope these comments give you a few ideas or useful thoughts.
> 
> David



Re: [RFC] Implementing detection of saturation and rounding arithmetic

2021-05-12 Thread David Brown
On 11/05/2021 19:00, Joseph Myers wrote:
> On Tue, 11 May 2021, David Brown wrote:
> 
>> It is also worth noting that gcc already has support for saturating
>> types on some targets:
>>
>> 
>>
>> My testing of these (quite a long time ago) left me with a feeling that
>> it was not a feature anyone had worked hard to optimise - certainly it
> 
> The implementation isn't well-integrated with any optimizations for 
> arithmetic on ordinary integer types / modes, because it has its own 
> completely separate machine modes and operations on those.  I still think 
> it would be better to have a GIMPLE pass that lowers from fixed-point 
> types to saturating etc. operations on ordinary integer types, as I said 
> in .

That would make sense (to me anyway, with my limited knowledge),
especially if this work on ordinary integer types pays off.  That would
surely let you simplify the sat/accum/fract type handling while
simultaneously making it work on a wider variety of targets.

> 
> Note however that such lowering should be more or less independent of 
> what's being discussed in this thread - this thread is about better 
> optimization of such operations on ordinary types (with or without 
> built-in functions of some kind in addition to recognition of such 
> operations written in generic C), which you can do independently of 
> what's done with fixed-point types.
> 

Yes, indeed.  I mentioned them for comparison, and in case there were
ideas which could be copied.  I don't think the N1169/TR 18037 types are
much (if ever) used in real code - with Tamar's planned optimisations,
the use-cases for them will be even fewer.


Re: [RFC] Implementing detection of saturation and rounding arithmetic

2021-05-11 Thread Joseph Myers
On Tue, 11 May 2021, David Brown wrote:

> It is also worth noting that gcc already has support for saturating
> types on some targets:
> 
> 
> 
> My testing of these (quite a long time ago) left me with a feeling that
> it was not a feature anyone had worked hard to optimise - certainly it

The implementation isn't well-integrated with any optimizations for 
arithmetic on ordinary integer types / modes, because it has its own 
completely separate machine modes and operations on those.  I still think 
it would be better to have a GIMPLE pass that lowers from fixed-point 
types to saturating etc. operations on ordinary integer types, as I said 
in .

Note however that such lowering should be more or less independent of 
what's being discussed in this thread - this thread is about better 
optimization of such operations on ordinary types (with or without 
built-in functions of some kind in addition to recognition of such 
operations written in generic C), which you can do independently of 
what's done with fixed-point types.

-- 
Joseph S. Myers
jos...@codesourcery.com


Re: [RFC] Implementing detection of saturation and rounding arithmetic

2021-05-11 Thread Segher Boessenkool
Hi!

On Tue, May 11, 2021 at 05:37:34AM +, Tamar Christina via Gcc wrote:
> 2. Saturating abs:
>char sat (char a)
>{
>   int tmp = abs (a);
>   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
>}

That can be done quite a bit better, branchless at least.  Same for all
examples here probably.

> 2. Saturation:
>a) Use match.pd to rewrite the various saturation expressions into min/max
>   operations which opens up the expressions to further optimizations.

You'll have to do the operation in a bigger mode for that.  (This is
also true for rounding, in many cases).

This makes new internal functions more attractive / more feasible.

>   We could get the right instructions by using combine if we don't rewrite
>   the instructions to an internal function, however then during 
> Vectorization
>   we would overestimate the cost of performing the saturation.  The 
> constants
>   will the also be loaded into registers and so becomes a lot more 
> difficult
>   to cleanup solely in the backend.

Combine is almost never the right answer if you want to combine more
than two or three RTL insns.  It can be done, but combine will always
write the combined instruction in simplest terms, which tends to mean
that if you combine more insns there can be very many outcomes that you
all need to recognise as insns in your machine description.

> The one thing I am wondering about is whether we would need an internal 
> function
> for all operations supported, or if it should be modelled as an internal FN 
> which
> just "marks" the operation as rounding/saturating. After all, the only 
> difference
> between a normal and saturating expression in RTL is the xx_truncate RTL 
> surrounding
> the expression.  Doing so would also mean that all targets whom have 
> saturating
> instructions would automatically benefit from this.

I think you will have to experiment with both approaches to get a good
feeling for the tradeoff.


Segher


Re: [RFC] Implementing detection of saturation and rounding arithmetic

2021-05-11 Thread Richard Biener
On Tue, 11 May 2021, Tamar Christina wrote:

> Hi All,
> 
> We are looking to implement saturation support in the compiler.  The aim is to
> recognize both Scalar and Vector variant of typical saturating expressions.
> 
> As an example:
> 
> 1. Saturating addition:
>char sat (char a, char b)
>{
>   int tmp = a + b;
>   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
>}
> 
> 2. Saturating abs:
>char sat (char a)
>{
>   int tmp = abs (a);
>   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
>}
> 
> 3. Rounding shifts
>char rndshift (char dc)
>{
>   int round_const = 1 << (shift - 1);
>   return (dc + round_const) >> shift;
>}
> 
> etc.
> 
> Of course the first issue is that C does not really have a single idiom for
> expressing this.
> 
> At the RTL level we have ss_truncate and us_truncate and float_truncate for
> truncation.
> 
> At the Tree level we have nothing for truncation (I believe) for scalars. For
> Vector code there already seems to be VEC_PACK_SAT_EXPR but it looks like
> nothing actually generates this at the moment. it's just an unused tree code.
> 
> For rounding there doesn't seem to be any existing infrastructure.
> 
> The proposal to handle these are as follow, keep in mind that all of these 
> also
> exist in their scalar form, as such detecting them in the vectorizer would be
> the wrong place.
> 
> 1. Rounding:
>a) Use match.pd to rewrite various rounding idioms to shifts.
>b) Use backwards or forward prop to rewrite these to internal functions
>   where even if the target does not support these rounding instructions 
> they
>   have a chance to provide a more efficient implementation than what would
>   be generated normally.
> 
> 2. Saturation:
>a) Use match.pd to rewrite the various saturation expressions into min/max
>   operations which opens up the expressions to further optimizations.
>b) Use backwards or forward prop to convert to internal functions if the
>   resulting min/max expression still meet the criteria for being a
>   saturating expression.  This follows the algorithm as outlined in "The
>   Software Vectorization handbook" by Aart J.C. Bik.
> 
>   We could get the right instructions by using combine if we don't rewrite
>   the instructions to an internal function, however then during 
> Vectorization
>   we would overestimate the cost of performing the saturation.  The 
> constants
>   will the also be loaded into registers and so becomes a lot more 
> difficult
>   to cleanup solely in the backend.
> 
> The one thing I am wondering about is whether we would need an internal 
> function
> for all operations supported, or if it should be modelled as an internal FN 
> which
> just "marks" the operation as rounding/saturating. After all, the only 
> difference
> between a normal and saturating expression in RTL is the xx_truncate RTL 
> surrounding
> the expression.  Doing so would also mean that all targets whom have 
> saturating
> instructions would automatically benefit from this.
> 
> But it does mean a small adjustment to the costing, which would need to cost 
> the
> internal function call and the argument together as a whole.
> 
> Any feedback is appreciated to minimize the number of changes required to the
> final patch.  Any objections to the outlined approach?

I think it makes sense to pattern-match the operations on GIMPLE
and follow the approach take by __builtin_add_overflow & friends.
Maybe quickly check whether clang provides some builtins already
which we could implement.

There's some appeal to mimicing what RTL does - thus have
the saturation be represented as saturating truncation.
Maybe that's what users expect of builtins as well.

I'm not sure what the rounding shift would do - 'shift' isn't
an argument to rndshift here.  It feels like it's a
rounding division but only by powers of two.  Does
ROUND_DIV_EXPR already provide the desired semantics?

Richard.


Re: [RFC] Implementing detection of saturation and rounding arithmetic

2021-05-11 Thread David Brown
On 11/05/2021 07:37, Tamar Christina via Gcc wrote:
> Hi All,
> 
> We are looking to implement saturation support in the compiler.  The aim is to
> recognize both Scalar and Vector variant of typical saturating expressions.
> 
> As an example:
> 
> 1. Saturating addition:
>char sat (char a, char b)
>{
>   int tmp = a + b;
>   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
>}
> 
> 2. Saturating abs:
>char sat (char a)
>{
>   int tmp = abs (a);
>   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
>}
> 
> 3. Rounding shifts
>char rndshift (char dc)
>{
>   int round_const = 1 << (shift - 1);
>   return (dc + round_const) >> shift;
>}
> 
> etc.
> 

I can't comment on the implementation part - I don't know anything about it.

However, in your examples above I see a few points.

One is your use of "char".  "char" is a type that varies in signedness
from target to target (and also depending on compiler flags), and is
slightly different in C and C++ ('a' has char type in C++, int type in
C).  If you must use "char" in arithmetic contexts, I recommend using
"signed char" or "unsigned char" explicitly.

I would rather recommend you use the size-specific  types -
int8_t, etc., - as being more appropriate for this kind of thing.
(AFAIK all gcc targets have 8-bit CHAR.)  This also makes it easier to
see the sizes you need for the "tmp" value as you make functions for
bigger sizes - remember that on some gcc targets, "int" is 16-bit.

It is also worth noting that gcc already has support for saturating
types on some targets:



My testing of these (quite a long time ago) left me with a feeling that
it was not a feature anyone had worked hard to optimise - certainly it
did not make use of saturating arithmetic instructions available on some
of the targets I tested (ARM Cortex M4, for example).  But it is
possible that there are things here that would be of use to you.  (I am
not convinced that it is worth spending time optimising the
implementation of these - I don't think the N1169 types are much used by
anyone.)


While it is always good that the compiler can spot patterns in generic C
code and generate optimal instruction sequences, another possibility
here would be a set of built-in functions for saturated and rounding
arithmetic.  That would take the guesswork out of it for users - if
there code requires efficient saturated addition, they can use
__builtin_add_sat and get the best their target can offer (just like
__builtin_add_overflow, and that kind of thing).  And it might be easier
to implement in the compiler.


I hope these comments give you a few ideas or useful thoughts.

David



[RFC] Implementing detection of saturation and rounding arithmetic

2021-05-10 Thread Tamar Christina via Gcc
Hi All,

We are looking to implement saturation support in the compiler.  The aim is to
recognize both Scalar and Vector variant of typical saturating expressions.

As an example:

1. Saturating addition:
   char sat (char a, char b)
   {
  int tmp = a + b;
  return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
   }

2. Saturating abs:
   char sat (char a)
   {
  int tmp = abs (a);
  return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
   }

3. Rounding shifts
   char rndshift (char dc)
   {
  int round_const = 1 << (shift - 1);
  return (dc + round_const) >> shift;
   }

etc.

Of course the first issue is that C does not really have a single idiom for
expressing this.

At the RTL level we have ss_truncate and us_truncate and float_truncate for
truncation.

At the Tree level we have nothing for truncation (I believe) for scalars. For
Vector code there already seems to be VEC_PACK_SAT_EXPR but it looks like
nothing actually generates this at the moment. it's just an unused tree code.

For rounding there doesn't seem to be any existing infrastructure.

The proposal to handle these are as follow, keep in mind that all of these also
exist in their scalar form, as such detecting them in the vectorizer would be
the wrong place.

1. Rounding:
   a) Use match.pd to rewrite various rounding idioms to shifts.
   b) Use backwards or forward prop to rewrite these to internal functions
  where even if the target does not support these rounding instructions they
  have a chance to provide a more efficient implementation than what would
  be generated normally.

2. Saturation:
   a) Use match.pd to rewrite the various saturation expressions into min/max
  operations which opens up the expressions to further optimizations.
   b) Use backwards or forward prop to convert to internal functions if the
  resulting min/max expression still meet the criteria for being a
  saturating expression.  This follows the algorithm as outlined in "The
  Software Vectorization handbook" by Aart J.C. Bik.

  We could get the right instructions by using combine if we don't rewrite
  the instructions to an internal function, however then during 
Vectorization
  we would overestimate the cost of performing the saturation.  The 
constants
  will the also be loaded into registers and so becomes a lot more difficult
  to cleanup solely in the backend.

The one thing I am wondering about is whether we would need an internal function
for all operations supported, or if it should be modelled as an internal FN 
which
just "marks" the operation as rounding/saturating. After all, the only 
difference
between a normal and saturating expression in RTL is the xx_truncate RTL 
surrounding
the expression.  Doing so would also mean that all targets whom have saturating
instructions would automatically benefit from this.

But it does mean a small adjustment to the costing, which would need to cost the
internal function call and the argument together as a whole.

Any feedback is appreciated to minimize the number of changes required to the
final patch.  Any objections to the outlined approach?

Thanks,
Tamar