Re: [RFC] Implementing detection of saturation and rounding arithmetic

2021-05-12 Thread David Brown
On 11/05/2021 19:00, Joseph Myers wrote:
> On Tue, 11 May 2021, David Brown wrote:
> 
>> It is also worth noting that gcc already has support for saturating
>> types on some targets:
>>
>> 
>>
>> My testing of these (quite a long time ago) left me with a feeling that
>> it was not a feature anyone had worked hard to optimise - certainly it
> 
> The implementation isn't well-integrated with any optimizations for 
> arithmetic on ordinary integer types / modes, because it has its own 
> completely separate machine modes and operations on those.  I still think 
> it would be better to have a GIMPLE pass that lowers from fixed-point 
> types to saturating etc. operations on ordinary integer types, as I said 
> in .

That would make sense (to me anyway, with my limited knowledge),
especially if this work on ordinary integer types pays off.  That would
surely let you simplify the sat/accum/fract type handling while
simultaneously making it work on a wider variety of targets.

> 
> Note however that such lowering should be more or less independent of 
> what's being discussed in this thread - this thread is about better 
> optimization of such operations on ordinary types (with or without 
> built-in functions of some kind in addition to recognition of such 
> operations written in generic C), which you can do independently of 
> what's done with fixed-point types.
> 

Yes, indeed.  I mentioned them for comparison, and in case there were
ideas which could be copied.  I don't think the N1169/TR 18037 types are
much (if ever) used in real code - with Tamar's planned optimisations,
the use-cases for them will be even fewer.


RE: [RFC] Implementing detection of saturation and rounding arithmetic

2021-05-12 Thread Tamar Christina via Gcc
Hi David, 

> -Original Message-
> From: David Brown 
> Sent: Tuesday, May 11, 2021 11:04 AM
> To: Tamar Christina ; gcc@gcc.gnu.org
> Cc: Richard Sandiford ; Richard Biener
> 
> Subject: Re: [RFC] Implementing detection of saturation and rounding
> arithmetic
> 
> On 11/05/2021 07:37, Tamar Christina via Gcc wrote:
> > Hi All,
> >
> > We are looking to implement saturation support in the compiler.  The
> > aim is to recognize both Scalar and Vector variant of typical saturating
> expressions.
> >
> > As an example:
> >
> > 1. Saturating addition:
> >char sat (char a, char b)
> >{
> >   int tmp = a + b;
> >   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> >}
> >
> > 2. Saturating abs:
> >char sat (char a)
> >{
> >   int tmp = abs (a);
> >   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> >}
> >
> > 3. Rounding shifts
> >char rndshift (char dc)
> >{
> >   int round_const = 1 << (shift - 1);
> >   return (dc + round_const) >> shift;
> >}
> >
> > etc.
> >
> 
> I can't comment on the implementation part - I don't know anything about it.
> 
> However, in your examples above I see a few points.
> 
> One is your use of "char".  "char" is a type that varies in signedness from
> target to target (and also depending on compiler flags), and is slightly
> different in C and C++ ('a' has char type in C++, int type in C).  If you 
> must use
> "char" in arithmetic contexts, I recommend using "signed char" or "unsigned
> char" explicitly.
> 
> I would rather recommend you use the size-specific  types - int8_t,
> etc., - as being more appropriate for this kind of thing.
> (AFAIK all gcc targets have 8-bit CHAR.)  This also makes it easier to see the
> sizes you need for the "tmp" value as you make functions for bigger sizes -
> remember that on some gcc targets, "int" is 16-bit.

Indeed, but unfortunately we're targeting existing code that's quite old, so the
C99 fixed sized types may not have been used.

> 
> It is also worth noting that gcc already has support for saturating types on
> some targets:
> 
> 
> 
> My testing of these (quite a long time ago) left me with a feeling that it was
> not a feature anyone had worked hard to optimise - certainly it did not make
> use of saturating arithmetic instructions available on some of the targets I
> tested (ARM Cortex M4, for example).  But it is possible that there are things
> here that would be of use to you.  (I am not convinced that it is worth
> spending time optimising the implementation of these - I don't think the
> N1169 types are much used by
> anyone.)
> 

I did notice these and was wondering if it makes sense to use them, but I'd need
to check how well supported they are.  For instance would need to check if they
don't block vectorization and stuff like that.

> 
> While it is always good that the compiler can spot patterns in generic C code
> and generate optimal instruction sequences, another possibility here would
> be a set of built-in functions for saturated and rounding arithmetic.  That
> would take the guesswork out of it for users - if there code requires 
> efficient
> saturated addition, they can use __builtin_add_sat and get the best their
> target can offer (just like __builtin_add_overflow, and that kind of thing).
> And it might be easier to implement in the compiler.
> 

We already have neon intrinics for this, but these operations happen quite often
in media processing and machine learning frameworks which don't use intrinsics
or builtins.  So the big gain for us here really is the idiom recognition 😊

Thanks for the feedback!

Cheers,
Tamar

> 
> I hope these comments give you a few ideas or useful thoughts.
> 
> David



Re: Speed of compiling gimple-match.c

2021-05-12 Thread Richard Biener via Gcc
On Tue, May 11, 2021 at 5:01 PM Segher Boessenkool
 wrote:
>
> On Tue, May 04, 2021 at 10:40:38AM +0200, Richard Biener via Gcc wrote:
> > On Mon, May 3, 2021 at 11:10 PM Andrew Pinski via Gcc  
> > wrote:
> > >   I noticed my (highly, -j24) parallel build of GCC is serialized on
> > > compiling gimple-match.c.  Has anyone looked into splitting this
> > > generated file into multiple files?
> >
> > There were threads about this in the past, yes.  There's the
> > possibility to use LTO for this as well (also mentioned in those
> > threads).  Note it's not easy to split in a meaningful way in
> > genmatch.c
>
> But it will have to be handled at least somewhat soon: on not huge
> parallelism (-j120 for example) building *-match.c takes longer than
> building everything else in gcc/ together (wallclock time), and it is a
> huge part of regstrap time (bigger than running all of the testsuite!)

I would classify -j120 as "huge parallelism" ;)  Testing time still
dominates my builds (with -j24) where bootstrap takes ~20 mins
but testing another 40.

Is it building stage2 gimple-match.c that you are worried about?
(it's built using the -O0 compiled stage1 compiler - but we at
least should end up using -fno-checking for this build)

Maybe you can do some experiments - like add
-fno-inline-functions-called-once and change
genmatch.c:3766 to split out single uses as well
(should decrease function sizes).

There's the option to make all functions external in
gimple-match.c so splitting the file at arbitrary points
will be possible (directly from genmatch), we'll need
some internal header with all declarations then
as well or alternatively some clever logic in
genmatch to only externalize functions needed from
mutliple split files.

That said - ideas to reduce the size of the generated
code are welcome as well.

There's also pattern ordering in match.pd that can
make a difference because we're honoring
first-match and thus have to re-start matching from
outermost on conflicts (most of the time the actual
oder in match.pd is just random).  If you add -v
to genmatch then you'll see

/home/rguenther/src/gcc3/gcc/match.pd:6092:10 warning: failed to merge
decision tree node
   (cmp (op@3 @0 INTEGER_CST@1) INTEGER_CST@2)
 ^
/home/rguenther/src/gcc3/gcc/match.pd:4263:11 warning: with the following
(cmp (op @0 REAL_CST@1) REAL_CST@2)
  ^
/home/rguenther/src/gcc3/gcc/match.pd:5164:6 warning: because of the
following which serves as ordering barrier
 (eq @0 integer_onep)
 ^

that means that the simple (eq @0 integer_onep) should match after
4263 but before 6092
(only the latter will actually match the same - the former has
REAL_CST@2 but 5164
uses a predicate integer_onep).  This causes us to emit three switch
(code){ case EQ_EXPR: }
instead of one.

There might be legitimate cases of such order constraints but most of them
are spurious.  "Fixing" them will also make the matching process faster, but
it's quite some legwork where moving a pattern can fix one occurance but
result in new others.

For me building stage3 gimple-match.o (on a fully loaded system.. :/) is

95.05user 0.42system 1:35.51elapsed 99%CPU (0avgtext+0avgdata
929400maxresident)k
0inputs+0outputs (0major+393349minor)pagefaults 0swaps

and when I use -Wno-error -flto=24 -flinker-output=nolto-rel -r

139.95user 1.79system 0:25.92elapsed 546%CPU (0avgtext+0avgdata
538852maxresident)k
0inputs+0outputs (0major+1139679minor)pagefaults 0swaps

the issue of course is that we can't use this for the stage1 build
(unless we detect working
GCC LTO in the host compiler setup).  I suppose those measures show the lower
bound of what should be possible with splitting up the file (LTO
splits to 128 pieces),
so for me it's a 4x speedup in wallclock time despite the overhead of
LTO which is
quite noticable.  -fno-checking also makes a dramatic difference for me.

Richard.

>
> Segher


Re: [RFC] Implementing detection of saturation and rounding arithmetic

2021-05-12 Thread David Brown
On 12/05/2021 10:00, Tamar Christina wrote:
> Hi David, 
> 
>> -Original Message-
>> From: David Brown 
>> Sent: Tuesday, May 11, 2021 11:04 AM
>> To: Tamar Christina ; gcc@gcc.gnu.org
>> Cc: Richard Sandiford ; Richard Biener
>> 
>> Subject: Re: [RFC] Implementing detection of saturation and rounding
>> arithmetic
>>
>> On 11/05/2021 07:37, Tamar Christina via Gcc wrote:
>>> Hi All,
>>>
>>> We are looking to implement saturation support in the compiler.  The
>>> aim is to recognize both Scalar and Vector variant of typical saturating
>> expressions.
>>>
>>> As an example:
>>>
>>> 1. Saturating addition:
>>>char sat (char a, char b)
>>>{
>>>   int tmp = a + b;
>>>   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
>>>}
>>>
>>> 2. Saturating abs:
>>>char sat (char a)
>>>{
>>>   int tmp = abs (a);
>>>   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
>>>}
>>>
>>> 3. Rounding shifts
>>>char rndshift (char dc)
>>>{
>>>   int round_const = 1 << (shift - 1);
>>>   return (dc + round_const) >> shift;
>>>}
>>>
>>> etc.
>>>
>>
>> I can't comment on the implementation part - I don't know anything about it.
>>
>> However, in your examples above I see a few points.
>>
>> One is your use of "char".  "char" is a type that varies in signedness from
>> target to target (and also depending on compiler flags), and is slightly
>> different in C and C++ ('a' has char type in C++, int type in C).  If you 
>> must use
>> "char" in arithmetic contexts, I recommend using "signed char" or "unsigned
>> char" explicitly.
>>
>> I would rather recommend you use the size-specific  types - int8_t,
>> etc., - as being more appropriate for this kind of thing.
>> (AFAIK all gcc targets have 8-bit CHAR.)  This also makes it easier to see 
>> the
>> sizes you need for the "tmp" value as you make functions for bigger sizes -
>> remember that on some gcc targets, "int" is 16-bit.
> 
> Indeed, but unfortunately we're targeting existing code that's quite old, so 
> the
> C99 fixed sized types may not have been used.

I was thinking of your examples (for testing) and in your implementation
- and you have C99 in those cases.  If you make gcc optimisations that
you confirm work correctly for int8_t, int16_t, int32_t, int64_t, (and
possibly __int128), and the unsigned types, then they will automatically
work with fundamental types (signed char, short, etc.) and any other
typedefs users have in their code.


> 
>>
>> It is also worth noting that gcc already has support for saturating types on
>> some targets:
>>
>> 
>>
>> My testing of these (quite a long time ago) left me with a feeling that it 
>> was
>> not a feature anyone had worked hard to optimise - certainly it did not make
>> use of saturating arithmetic instructions available on some of the targets I
>> tested (ARM Cortex M4, for example).  But it is possible that there are 
>> things
>> here that would be of use to you.  (I am not convinced that it is worth
>> spending time optimising the implementation of these - I don't think the
>> N1169 types are much used by
>> anyone.)
>>
> 
> I did notice these and was wondering if it makes sense to use them, but I'd 
> need
> to check how well supported they are.  For instance would need to check if 
> they
> don't block vectorization and stuff like that.
> 

I get the impression that they are likely to fail with vectorisation or
other more advanced optimisations (though I am no expert here).  I
mentioned them in case there is any inspiration or ideas you can copy,
rather than because I think they are useful.  I have not seen any use of
these types in real code, and I think the whole N1169 / TR 18037 was a
case of too little, too inconvenient to use, too late.  Few compilers
support it, fewer programmers use it.  (The syntax for named address
space is used by many embedded compilers, including gcc, but it is a
simple and obvious syntax extension that was used long before that TR.)

>>
>> While it is always good that the compiler can spot patterns in generic C code
>> and generate optimal instruction sequences, another possibility here would
>> be a set of built-in functions for saturated and rounding arithmetic.  That
>> would take the guesswork out of it for users - if there code requires 
>> efficient
>> saturated addition, they can use __builtin_add_sat and get the best their
>> target can offer (just like __builtin_add_overflow, and that kind of thing).
>> And it might be easier to implement in the compiler.
>>
> 
> We already have neon intrinics for this, but these operations happen quite 
> often
> in media processing and machine learning frameworks which don't use intrinsics
> or builtins.  So the big gain for us here really is the idiom recognition 😊
> 

Target-specific intrinsics are useful, but it is even better when they
are supported as generic builtins in gcc.  That way they work on all
targets, user sourc

Re: [RFC] Implementing detection of saturation and rounding arithmetic

2021-05-12 Thread Richard Sandiford via Gcc
Tamar Christina  writes:
> Hi All,
>
> We are looking to implement saturation support in the compiler.  The aim is to
> recognize both Scalar and Vector variant of typical saturating expressions.
>
> As an example:
>
> 1. Saturating addition:
>char sat (char a, char b)
>{
>   int tmp = a + b;
>   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
>}
>
> 2. Saturating abs:
>char sat (char a)
>{
>   int tmp = abs (a);
>   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
>}
>
> 3. Rounding shifts
>char rndshift (char dc)
>{
>   int round_const = 1 << (shift - 1);
>   return (dc + round_const) >> shift;
>}
>
> etc.
>
> Of course the first issue is that C does not really have a single idiom for
> expressing this.
>
> At the RTL level we have ss_truncate and us_truncate and float_truncate for
> truncation.
>
> At the Tree level we have nothing for truncation (I believe) for scalars. For
> Vector code there already seems to be VEC_PACK_SAT_EXPR but it looks like
> nothing actually generates this at the moment. it's just an unused tree code.
>
> For rounding there doesn't seem to be any existing infrastructure.
>
> The proposal to handle these are as follow, keep in mind that all of these 
> also
> exist in their scalar form, as such detecting them in the vectorizer would be
> the wrong place.
>
> 1. Rounding:
>a) Use match.pd to rewrite various rounding idioms to shifts.
>b) Use backwards or forward prop to rewrite these to internal functions
>   where even if the target does not support these rounding instructions 
> they
>   have a chance to provide a more efficient implementation than what would
>   be generated normally.
>
> 2. Saturation:
>a) Use match.pd to rewrite the various saturation expressions into min/max
>   operations which opens up the expressions to further optimizations.
>b) Use backwards or forward prop to convert to internal functions if the
>   resulting min/max expression still meet the criteria for being a
>   saturating expression.  This follows the algorithm as outlined in "The
>   Software Vectorization handbook" by Aart J.C. Bik.
>
>   We could get the right instructions by using combine if we don't rewrite
>   the instructions to an internal function, however then during 
> Vectorization
>   we would overestimate the cost of performing the saturation.  The 
> constants
>   will the also be loaded into registers and so becomes a lot more 
> difficult
>   to cleanup solely in the backend.
>
> The one thing I am wondering about is whether we would need an internal 
> function
> for all operations supported, or if it should be modelled as an internal FN 
> which
> just "marks" the operation as rounding/saturating. After all, the only 
> difference
> between a normal and saturating expression in RTL is the xx_truncate RTL 
> surrounding
> the expression.  Doing so would also mean that all targets whom have 
> saturating
> instructions would automatically benefit from this.

I might have misunderstood what you meant here, but the *_truncate
RTL codes are true truncations: the operand has to be wider than the
result.  Using this representation for general arithmetic is a problem
if you're operating at the maximum size that the target supports natively.
E.g. representing a 64-bit saturating addition as:

  - extend to 128 bits
  - do a 128-bit addition
  - truncate to 64 bits

is going to be hard to cost and code-generate on targets that don't support
native 128-bit operations (or at least, don't support them cheaply).
This might not be a problem when recognising C idioms, since the C source
code has to be able do the wider operation before truncating the result,
but it could be a problem if we provide built-in functions or if we want
to introduce compiler-generated saturating operations.

RTL already has per-operation saturation such as ss_plus/us_plus,
ss_minus/us_minus, ss_neg/us_neg, ss_mult/us_mult, ss_div,
ss_ashift/us_ashift and ss_abs.  I think we should do the same
in gimple, using internal functions like you say.

Thanks,
Richard



Re: gcc 11.1.0 mpfr

2021-05-12 Thread Richard Biener via Gcc
On Tue, May 11, 2021 at 6:34 PM Serge Belyshev via Gcc  wrote:
>
>
> > $ egrep "mpfr\.h" log/cfg/cfg.gcc-11.1.0.log
> > checking for the correct version of mpfr.h... buggy but acceptable
> >
> > It appears "gcc-11.1.0/contrib/download_prerequisites"
> > specifies "mpfr-3.1.4.tar.bz2" whereas top level 'configure'
> > has "MPFR_VERSION_NUM(3,1,6)".
> >
> > Is there a reason mpfr 3.1.6 isn't being used?
>
> No good reason: that script was not updated with new versions. GCC-11 is
> also known to work fine with the most recent mpfr version 4.1.0.

Yes, the update of the minimum version and the buggy check was
done by Janne w/o adjusting the download script.

Richard.


Re: Speed of compiling gimple-match.c

2021-05-12 Thread Andrew Pinski via Gcc
On Wed, May 12, 2021 at 1:19 AM Richard Biener
 wrote:
>
> On Tue, May 11, 2021 at 5:01 PM Segher Boessenkool
>  wrote:
> >
> > On Tue, May 04, 2021 at 10:40:38AM +0200, Richard Biener via Gcc wrote:
> > > On Mon, May 3, 2021 at 11:10 PM Andrew Pinski via Gcc  
> > > wrote:
> > > >   I noticed my (highly, -j24) parallel build of GCC is serialized on
> > > > compiling gimple-match.c.  Has anyone looked into splitting this
> > > > generated file into multiple files?
> > >
> > > There were threads about this in the past, yes.  There's the
> > > possibility to use LTO for this as well (also mentioned in those
> > > threads).  Note it's not easy to split in a meaningful way in
> > > genmatch.c
> >
> > But it will have to be handled at least somewhat soon: on not huge
> > parallelism (-j120 for example) building *-match.c takes longer than
> > building everything else in gcc/ together (wallclock time), and it is a
> > huge part of regstrap time (bigger than running all of the testsuite!)
>
> I would classify -j120 as "huge parallelism" ;)  Testing time still
> dominates my builds (with -j24) where bootstrap takes ~20 mins
> but testing another 40.

For me, it is around 1 hour bootstrapping and 1 hour testing.

> Is it building stage2 gimple-match.c that you are worried about?
> (it's built using the -O0 compiled stage1 compiler - but we at
> least should end up using -fno-checking for this build)

Yes.  It takes on the machine I was using 15 minutes to compile
gimple-match.c, dominating the whole time for bootstrapping.
Everything else was done in 1-3 minutes max even.
This is on an aarch64 machine with 24 cores (not threads).

Thanks,
Andrew Pinski

>
> Maybe you can do some experiments - like add
> -fno-inline-functions-called-once and change
> genmatch.c:3766 to split out single uses as well
> (should decrease function sizes).
>
> There's the option to make all functions external in
> gimple-match.c so splitting the file at arbitrary points
> will be possible (directly from genmatch), we'll need
> some internal header with all declarations then
> as well or alternatively some clever logic in
> genmatch to only externalize functions needed from
> mutliple split files.
>
> That said - ideas to reduce the size of the generated
> code are welcome as well.
>
> There's also pattern ordering in match.pd that can
> make a difference because we're honoring
> first-match and thus have to re-start matching from
> outermost on conflicts (most of the time the actual
> oder in match.pd is just random).  If you add -v
> to genmatch then you'll see
>
> /home/rguenther/src/gcc3/gcc/match.pd:6092:10 warning: failed to merge
> decision tree node
>(cmp (op@3 @0 INTEGER_CST@1) INTEGER_CST@2)
>  ^
> /home/rguenther/src/gcc3/gcc/match.pd:4263:11 warning: with the following
> (cmp (op @0 REAL_CST@1) REAL_CST@2)
>   ^
> /home/rguenther/src/gcc3/gcc/match.pd:5164:6 warning: because of the
> following which serves as ordering barrier
>  (eq @0 integer_onep)
>  ^
>
> that means that the simple (eq @0 integer_onep) should match after
> 4263 but before 6092
> (only the latter will actually match the same - the former has
> REAL_CST@2 but 5164
> uses a predicate integer_onep).  This causes us to emit three switch
> (code){ case EQ_EXPR: }
> instead of one.
>
> There might be legitimate cases of such order constraints but most of them
> are spurious.  "Fixing" them will also make the matching process faster, but
> it's quite some legwork where moving a pattern can fix one occurance but
> result in new others.
>
> For me building stage3 gimple-match.o (on a fully loaded system.. :/) is
>
> 95.05user 0.42system 1:35.51elapsed 99%CPU (0avgtext+0avgdata
> 929400maxresident)k
> 0inputs+0outputs (0major+393349minor)pagefaults 0swaps
>
> and when I use -Wno-error -flto=24 -flinker-output=nolto-rel -r
>
> 139.95user 1.79system 0:25.92elapsed 546%CPU (0avgtext+0avgdata
> 538852maxresident)k
> 0inputs+0outputs (0major+1139679minor)pagefaults 0swaps
>
> the issue of course is that we can't use this for the stage1 build
> (unless we detect working
> GCC LTO in the host compiler setup).  I suppose those measures show the lower
> bound of what should be possible with splitting up the file (LTO
> splits to 128 pieces),
> so for me it's a 4x speedup in wallclock time despite the overhead of
> LTO which is
> quite noticable.  -fno-checking also makes a dramatic difference for me.
>
> Richard.
>
> >
> > Segher


Re: Speed of compiling gimple-match.c

2021-05-12 Thread Richard Biener via Gcc
On Wed, May 12, 2021 at 11:03 AM Andrew Pinski  wrote:
>
> On Wed, May 12, 2021 at 1:19 AM Richard Biener
>  wrote:
> >
> > On Tue, May 11, 2021 at 5:01 PM Segher Boessenkool
> >  wrote:
> > >
> > > On Tue, May 04, 2021 at 10:40:38AM +0200, Richard Biener via Gcc wrote:
> > > > On Mon, May 3, 2021 at 11:10 PM Andrew Pinski via Gcc  
> > > > wrote:
> > > > >   I noticed my (highly, -j24) parallel build of GCC is serialized on
> > > > > compiling gimple-match.c.  Has anyone looked into splitting this
> > > > > generated file into multiple files?
> > > >
> > > > There were threads about this in the past, yes.  There's the
> > > > possibility to use LTO for this as well (also mentioned in those
> > > > threads).  Note it's not easy to split in a meaningful way in
> > > > genmatch.c
> > >
> > > But it will have to be handled at least somewhat soon: on not huge
> > > parallelism (-j120 for example) building *-match.c takes longer than
> > > building everything else in gcc/ together (wallclock time), and it is a
> > > huge part of regstrap time (bigger than running all of the testsuite!)
> >
> > I would classify -j120 as "huge parallelism" ;)  Testing time still
> > dominates my builds (with -j24) where bootstrap takes ~20 mins
> > but testing another 40.
>
> For me, it is around 1 hour bootstrapping and 1 hour testing.
>
> > Is it building stage2 gimple-match.c that you are worried about?
> > (it's built using the -O0 compiled stage1 compiler - but we at
> > least should end up using -fno-checking for this build)
>
> Yes.  It takes on the machine I was using 15 minutes to compile
> gimple-match.c, dominating the whole time for bootstrapping.
> Everything else was done in 1-3 minutes max even.
> This is on an aarch64 machine with 24 cores (not threads).

I'm usually using STAGE1_CFLAGS="-O2" to speed up the
"useless" part of the bootstrap cycle...

> Thanks,
> Andrew Pinski
>
> >
> > Maybe you can do some experiments - like add
> > -fno-inline-functions-called-once and change
> > genmatch.c:3766 to split out single uses as well
> > (should decrease function sizes).
> >
> > There's the option to make all functions external in
> > gimple-match.c so splitting the file at arbitrary points
> > will be possible (directly from genmatch), we'll need
> > some internal header with all declarations then
> > as well or alternatively some clever logic in
> > genmatch to only externalize functions needed from
> > mutliple split files.
> >
> > That said - ideas to reduce the size of the generated
> > code are welcome as well.
> >
> > There's also pattern ordering in match.pd that can
> > make a difference because we're honoring
> > first-match and thus have to re-start matching from
> > outermost on conflicts (most of the time the actual
> > oder in match.pd is just random).  If you add -v
> > to genmatch then you'll see
> >
> > /home/rguenther/src/gcc3/gcc/match.pd:6092:10 warning: failed to merge
> > decision tree node
> >(cmp (op@3 @0 INTEGER_CST@1) INTEGER_CST@2)
> >  ^
> > /home/rguenther/src/gcc3/gcc/match.pd:4263:11 warning: with the following
> > (cmp (op @0 REAL_CST@1) REAL_CST@2)
> >   ^
> > /home/rguenther/src/gcc3/gcc/match.pd:5164:6 warning: because of the
> > following which serves as ordering barrier
> >  (eq @0 integer_onep)
> >  ^
> >
> > that means that the simple (eq @0 integer_onep) should match after
> > 4263 but before 6092
> > (only the latter will actually match the same - the former has
> > REAL_CST@2 but 5164
> > uses a predicate integer_onep).  This causes us to emit three switch
> > (code){ case EQ_EXPR: }
> > instead of one.
> >
> > There might be legitimate cases of such order constraints but most of them
> > are spurious.  "Fixing" them will also make the matching process faster, but
> > it's quite some legwork where moving a pattern can fix one occurance but
> > result in new others.
> >
> > For me building stage3 gimple-match.o (on a fully loaded system.. :/) is
> >
> > 95.05user 0.42system 1:35.51elapsed 99%CPU (0avgtext+0avgdata
> > 929400maxresident)k
> > 0inputs+0outputs (0major+393349minor)pagefaults 0swaps
> >
> > and when I use -Wno-error -flto=24 -flinker-output=nolto-rel -r
> >
> > 139.95user 1.79system 0:25.92elapsed 546%CPU (0avgtext+0avgdata
> > 538852maxresident)k
> > 0inputs+0outputs (0major+1139679minor)pagefaults 0swaps
> >
> > the issue of course is that we can't use this for the stage1 build
> > (unless we detect working
> > GCC LTO in the host compiler setup).  I suppose those measures show the 
> > lower
> > bound of what should be possible with splitting up the file (LTO
> > splits to 128 pieces),
> > so for me it's a 4x speedup in wallclock time despite the overhead of
> > LTO which is
> > quite noticable.  -fno-checking also makes a dramatic difference for me.
> >
> > Richard.
> >
> > >
> > > Segher


RE: [RFC] Implementing detection of saturation and rounding arithmetic

2021-05-12 Thread Tamar Christina via Gcc



> -Original Message-
> From: Richard Biener 
> Sent: Tuesday, May 11, 2021 12:45 PM
> To: Tamar Christina 
> Cc: gcc@gcc.gnu.org; Richard Sandiford ;
> Jakub Jelinek 
> Subject: Re: [RFC] Implementing detection of saturation and rounding
> arithmetic
> 
> On Tue, 11 May 2021, Tamar Christina wrote:
> 
> > Hi All,
> >
> > We are looking to implement saturation support in the compiler.  The
> > aim is to recognize both Scalar and Vector variant of typical saturating
> expressions.
> >
> > As an example:
> >
> > 1. Saturating addition:
> >char sat (char a, char b)
> >{
> >   int tmp = a + b;
> >   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> >}
> >
> > 2. Saturating abs:
> >char sat (char a)
> >{
> >   int tmp = abs (a);
> >   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> >}
> >
> > 3. Rounding shifts
> >char rndshift (char dc)
> >{
> >   int round_const = 1 << (shift - 1);
> >   return (dc + round_const) >> shift;
> >}
> >
> > etc.
> >
> > Of course the first issue is that C does not really have a single
> > idiom for expressing this.
> >
> > At the RTL level we have ss_truncate and us_truncate and
> > float_truncate for truncation.
> >
> > At the Tree level we have nothing for truncation (I believe) for
> > scalars. For Vector code there already seems to be VEC_PACK_SAT_EXPR
> > but it looks like nothing actually generates this at the moment. it's just 
> > an
> unused tree code.
> >
> > For rounding there doesn't seem to be any existing infrastructure.
> >
> > The proposal to handle these are as follow, keep in mind that all of
> > these also exist in their scalar form, as such detecting them in the
> > vectorizer would be the wrong place.
> >
> > 1. Rounding:
> >a) Use match.pd to rewrite various rounding idioms to shifts.
> >b) Use backwards or forward prop to rewrite these to internal functions
> >   where even if the target does not support these rounding instructions
> they
> >   have a chance to provide a more efficient implementation than what
> would
> >   be generated normally.
> >
> > 2. Saturation:
> >a) Use match.pd to rewrite the various saturation expressions into
> min/max
> >   operations which opens up the expressions to further optimizations.
> >b) Use backwards or forward prop to convert to internal functions if the
> >   resulting min/max expression still meet the criteria for being a
> >   saturating expression.  This follows the algorithm as outlined in "The
> >   Software Vectorization handbook" by Aart J.C. Bik.
> >
> >   We could get the right instructions by using combine if we don't 
> > rewrite
> >   the instructions to an internal function, however then during
> Vectorization
> >   we would overestimate the cost of performing the saturation.  The
> constants
> >   will the also be loaded into registers and so becomes a lot more 
> > difficult
> >   to cleanup solely in the backend.
> >
> > The one thing I am wondering about is whether we would need an
> > internal function for all operations supported, or if it should be
> > modelled as an internal FN which just "marks" the operation as
> > rounding/saturating. After all, the only difference between a normal
> > and saturating expression in RTL is the xx_truncate RTL surrounding
> > the expression.  Doing so would also mean that all targets whom have
> saturating instructions would automatically benefit from this.
> >
> > But it does mean a small adjustment to the costing, which would need
> > to cost the internal function call and the argument together as a whole.
> >
> > Any feedback is appreciated to minimize the number of changes required
> > to the final patch.  Any objections to the outlined approach?
> 
> I think it makes sense to pattern-match the operations on GIMPLE and follow
> the approach take by __builtin_add_overflow & friends.
> Maybe quickly check whether clang provides some builtins already which we
> could implement.
> 

Hmm so taking a look at __builtin_add_overflow and friends it looks like they 
use
Internal functions like ADD_OVERFLOW. Biggest difference being that it sets a 
condition
flag.  This does me wonder if something like

int f (int a, int b)
{
int res;
if (__builtin_add_overflow (a, b, &res))
  {
  if (res >= 0)
return INT_MAX;
  else
return INT_MIN;
  }
return res;
}

Should be recognized as saturating as well.  But yeah, following the same 
approach we
would rewrite the sequence to something like res = .ADD_SAT (a, b);

I know that in the past there was concerns about doing such pattern matching 
early which
is why I thought about following the same approach we do with the table based 
ctz recognition
and do it in backprop.

For completeness, the OVERFLOW version of ADD in AArch64 maps to ADDS, but the 
saturating to SQADD.

> There's some appeal to mimicing what RTL does - thus ha

RE: [RFC] Implementing detection of saturation and rounding arithmetic

2021-05-12 Thread Tamar Christina via Gcc
Hi,

> -Original Message-
> From: Segher Boessenkool 
> Sent: Tuesday, May 11, 2021 4:43 PM
> To: Tamar Christina 
> Cc: gcc@gcc.gnu.org; Richard Sandiford ;
> Richard Biener 
> Subject: Re: [RFC] Implementing detection of saturation and rounding
> arithmetic
> 
> Hi!
> 
> On Tue, May 11, 2021 at 05:37:34AM +, Tamar Christina via Gcc wrote:
> > 2. Saturating abs:
> >char sat (char a)
> >{
> >   int tmp = abs (a);
> >   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> >}
> 
> That can be done quite a bit better, branchless at least.  Same for all
> examples here probably.

Do you mean in the example? Sure, I wanted to keep it simple 😊

> 
> > 2. Saturation:
> >a) Use match.pd to rewrite the various saturation expressions into
> min/max
> >   operations which opens up the expressions to further optimizations.
> 
> You'll have to do the operation in a bigger mode for that.  (This is also 
> true for
> rounding, in many cases).
> 
> This makes new internal functions more attractive / more feasible.

True, but the internal function doesn't need to model the wider mode right?
So if you're saturating an int, the internal-fn would work on int,int,int.

> 
> >   We could get the right instructions by using combine if we don't 
> > rewrite
> >   the instructions to an internal function, however then during
> Vectorization
> >   we would overestimate the cost of performing the saturation.  The
> constants
> >   will the also be loaded into registers and so becomes a lot more 
> > difficult
> >   to cleanup solely in the backend.
> 
> Combine is almost never the right answer if you want to combine more than
> two or three RTL insns.  It can be done, but combine will always write the
> combined instruction in simplest terms, which tends to mean that if you
> combine more insns there can be very many outcomes that you all need to
> recognise as insns in your machine description.

Yeah, ideally I would like to handle it before it gets to expand.

> 
> > The one thing I am wondering about is whether we would need an
> > internal function for all operations supported, or if it should be
> > modelled as an internal FN which just "marks" the operation as
> > rounding/saturating. After all, the only difference between a normal
> > and saturating expression in RTL is the xx_truncate RTL surrounding
> > the expression.  Doing so would also mean that all targets whom have
> saturating instructions would automatically benefit from this.
> 
> I think you will have to experiment with both approaches to get a good
> feeling for the tradeoff.

Fair enough,

Thanks for the feedback!

Cheers,
Tamar
> 
> 
> Segher


RE: [RFC] Implementing detection of saturation and rounding arithmetic

2021-05-12 Thread Tamar Christina via Gcc



> -Original Message-
> From: Joseph Myers 
> Sent: Tuesday, May 11, 2021 6:01 PM
> To: David Brown 
> Cc: Tamar Christina ; gcc@gcc.gnu.org; Richard
> Sandiford ; Richard Biener
> 
> Subject: Re: [RFC] Implementing detection of saturation and rounding
> arithmetic
> 
> On Tue, 11 May 2021, David Brown wrote:
> 
> > It is also worth noting that gcc already has support for saturating
> > types on some targets:
> >
> > 
> >
> > My testing of these (quite a long time ago) left me with a feeling
> > that it was not a feature anyone had worked hard to optimise -
> > certainly it
> 
> The implementation isn't well-integrated with any optimizations for
> arithmetic on ordinary integer types / modes, because it has its own
> completely separate machine modes and operations on those.  I still think it
> would be better to have a GIMPLE pass that lowers from fixed-point types to
> saturating etc. operations on ordinary integer types, as I said in
> .
> 

Oh, thanks for the link! That's a very useful read!

Cheers,
Tamar

> Note however that such lowering should be more or less independent of
> what's being discussed in this thread - this thread is about better
> optimization of such operations on ordinary types (with or without built-in
> functions of some kind in addition to recognition of such operations written
> in generic C), which you can do independently of what's done with fixed-
> point types.
> 
> --
> Joseph S. Myers
> jos...@codesourcery.com


RE: [RFC] Implementing detection of saturation and rounding arithmetic

2021-05-12 Thread Tamar Christina via Gcc



> -Original Message-
> From: Richard Sandiford 
> Sent: Wednesday, May 12, 2021 9:48 AM
> To: Tamar Christina 
> Cc: gcc@gcc.gnu.org; Richard Biener 
> Subject: Re: [RFC] Implementing detection of saturation and rounding
> arithmetic
> 
> Tamar Christina  writes:
> > Hi All,
> >
> > We are looking to implement saturation support in the compiler.  The
> > aim is to recognize both Scalar and Vector variant of typical saturating
> expressions.
> >
> > As an example:
> >
> > 1. Saturating addition:
> >char sat (char a, char b)
> >{
> >   int tmp = a + b;
> >   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> >}
> >
> > 2. Saturating abs:
> >char sat (char a)
> >{
> >   int tmp = abs (a);
> >   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> >}
> >
> > 3. Rounding shifts
> >char rndshift (char dc)
> >{
> >   int round_const = 1 << (shift - 1);
> >   return (dc + round_const) >> shift;
> >}
> >
> > etc.
> >
> > Of course the first issue is that C does not really have a single
> > idiom for expressing this.
> >
> > At the RTL level we have ss_truncate and us_truncate and
> > float_truncate for truncation.
> >
> > At the Tree level we have nothing for truncation (I believe) for
> > scalars. For Vector code there already seems to be VEC_PACK_SAT_EXPR
> > but it looks like nothing actually generates this at the moment. it's just 
> > an
> unused tree code.
> >
> > For rounding there doesn't seem to be any existing infrastructure.
> >
> > The proposal to handle these are as follow, keep in mind that all of
> > these also exist in their scalar form, as such detecting them in the
> > vectorizer would be the wrong place.
> >
> > 1. Rounding:
> >a) Use match.pd to rewrite various rounding idioms to shifts.
> >b) Use backwards or forward prop to rewrite these to internal functions
> >   where even if the target does not support these rounding instructions
> they
> >   have a chance to provide a more efficient implementation than what
> would
> >   be generated normally.
> >
> > 2. Saturation:
> >a) Use match.pd to rewrite the various saturation expressions into
> min/max
> >   operations which opens up the expressions to further optimizations.
> >b) Use backwards or forward prop to convert to internal functions if the
> >   resulting min/max expression still meet the criteria for being a
> >   saturating expression.  This follows the algorithm as outlined in "The
> >   Software Vectorization handbook" by Aart J.C. Bik.
> >
> >   We could get the right instructions by using combine if we don't 
> > rewrite
> >   the instructions to an internal function, however then during
> Vectorization
> >   we would overestimate the cost of performing the saturation.  The
> constants
> >   will the also be loaded into registers and so becomes a lot more 
> > difficult
> >   to cleanup solely in the backend.
> >
> > The one thing I am wondering about is whether we would need an
> > internal function for all operations supported, or if it should be
> > modelled as an internal FN which just "marks" the operation as
> > rounding/saturating. After all, the only difference between a normal
> > and saturating expression in RTL is the xx_truncate RTL surrounding
> > the expression.  Doing so would also mean that all targets whom have
> saturating instructions would automatically benefit from this.
> 
> I might have misunderstood what you meant here, but the *_truncate RTL
> codes are true truncations: the operand has to be wider than the result.
> Using this representation for general arithmetic is a problem if you're
> operating at the maximum size that the target supports natively.
> E.g. representing a 64-bit saturating addition as:
> 
>   - extend to 128 bits
>   - do a 128-bit addition
>   - truncate to 64 bits
> 

Ah, that wasn't clear from the documentation.. The one for the normal truncate
mentions that the modes have to be wider but the _truncate and friends don't
mention this constraint.  That would indeed not work..

> is going to be hard to cost and code-generate on targets that don't support
> native 128-bit operations (or at least, don't support them cheaply).
> This might not be a problem when recognising C idioms, since the C source
> code has to be able do the wider operation before truncating the result, but
> it could be a problem if we provide built-in functions or if we want to
> introduce compiler-generated saturating operations.
> 
> RTL already has per-operation saturation such as ss_plus/us_plus,
> ss_minus/us_minus, ss_neg/us_neg, ss_mult/us_mult, ss_div,
> ss_ashift/us_ashift and ss_abs.  I think we should do the same in gimple,
> using internal functions like you say.

Oh, I didn't know about those. Indeed those look like a better fit here.
Having all the operations separate in RTL already does seem to imply
That separate internal-fns is the way to go.

Thanks!

Regard

Re: [RFC] Implementing detection of saturation and rounding arithmetic

2021-05-12 Thread Richard Biener
On Wed, 12 May 2021, Richard Sandiford wrote:

> Tamar Christina  writes:
> > Hi All,
> >
> > We are looking to implement saturation support in the compiler.  The aim is 
> > to
> > recognize both Scalar and Vector variant of typical saturating expressions.
> >
> > As an example:
> >
> > 1. Saturating addition:
> >char sat (char a, char b)
> >{
> >   int tmp = a + b;
> >   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> >}
> >
> > 2. Saturating abs:
> >char sat (char a)
> >{
> >   int tmp = abs (a);
> >   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> >}
> >
> > 3. Rounding shifts
> >char rndshift (char dc)
> >{
> >   int round_const = 1 << (shift - 1);
> >   return (dc + round_const) >> shift;
> >}
> >
> > etc.
> >
> > Of course the first issue is that C does not really have a single idiom for
> > expressing this.
> >
> > At the RTL level we have ss_truncate and us_truncate and float_truncate for
> > truncation.
> >
> > At the Tree level we have nothing for truncation (I believe) for scalars. 
> > For
> > Vector code there already seems to be VEC_PACK_SAT_EXPR but it looks like
> > nothing actually generates this at the moment. it's just an unused tree 
> > code.
> >
> > For rounding there doesn't seem to be any existing infrastructure.
> >
> > The proposal to handle these are as follow, keep in mind that all of these 
> > also
> > exist in their scalar form, as such detecting them in the vectorizer would 
> > be
> > the wrong place.
> >
> > 1. Rounding:
> >a) Use match.pd to rewrite various rounding idioms to shifts.
> >b) Use backwards or forward prop to rewrite these to internal functions
> >   where even if the target does not support these rounding instructions 
> > they
> >   have a chance to provide a more efficient implementation than what 
> > would
> >   be generated normally.
> >
> > 2. Saturation:
> >a) Use match.pd to rewrite the various saturation expressions into 
> > min/max
> >   operations which opens up the expressions to further optimizations.
> >b) Use backwards or forward prop to convert to internal functions if the
> >   resulting min/max expression still meet the criteria for being a
> >   saturating expression.  This follows the algorithm as outlined in "The
> >   Software Vectorization handbook" by Aart J.C. Bik.
> >
> >   We could get the right instructions by using combine if we don't 
> > rewrite
> >   the instructions to an internal function, however then during 
> > Vectorization
> >   we would overestimate the cost of performing the saturation.  The 
> > constants
> >   will the also be loaded into registers and so becomes a lot more 
> > difficult
> >   to cleanup solely in the backend.
> >
> > The one thing I am wondering about is whether we would need an internal 
> > function
> > for all operations supported, or if it should be modelled as an internal FN 
> > which
> > just "marks" the operation as rounding/saturating. After all, the only 
> > difference
> > between a normal and saturating expression in RTL is the xx_truncate RTL 
> > surrounding
> > the expression.  Doing so would also mean that all targets whom have 
> > saturating
> > instructions would automatically benefit from this.
> 
> I might have misunderstood what you meant here, but the *_truncate
> RTL codes are true truncations: the operand has to be wider than the
> result.  Using this representation for general arithmetic is a problem
> if you're operating at the maximum size that the target supports natively.
> E.g. representing a 64-bit saturating addition as:
> 
>   - extend to 128 bits
>   - do a 128-bit addition
>   - truncate to 64 bits
> 
> is going to be hard to cost and code-generate on targets that don't support
> native 128-bit operations (or at least, don't support them cheaply).
> This might not be a problem when recognising C idioms, since the C source
> code has to be able do the wider operation before truncating the result,
> but it could be a problem if we provide built-in functions or if we want
> to introduce compiler-generated saturating operations.
> 
> RTL already has per-operation saturation such as ss_plus/us_plus,
> ss_minus/us_minus, ss_neg/us_neg, ss_mult/us_mult, ss_div,
> ss_ashift/us_ashift and ss_abs.  I think we should do the same
> in gimple, using internal functions like you say.

I think that for followup optimizations using regular arithmetic
ops and just new saturating truncations is better.  Maybe we can
also do both, with first only matching the actual saturation
with a new tree code and then later match the optabs the target
actually supports (in ISEL for example)?

Truly saturating ops might provide an interesting example how
to deal with -ftrapv - one might think we can now simply
use the trapping optabs as internal functions to reflect
-ftrapv onto the IL ...

Richard.

> Thanks,
> Richard
> 
> 

-- 
Richard Biener 
SUSE So

RE: [RFC] Implementing detection of saturation and rounding arithmetic

2021-05-12 Thread Richard Biener
On Wed, 12 May 2021, Tamar Christina wrote:

> 
> 
> > -Original Message-
> > From: Richard Biener 
> > Sent: Tuesday, May 11, 2021 12:45 PM
> > To: Tamar Christina 
> > Cc: gcc@gcc.gnu.org; Richard Sandiford ;
> > Jakub Jelinek 
> > Subject: Re: [RFC] Implementing detection of saturation and rounding
> > arithmetic
> > 
> > On Tue, 11 May 2021, Tamar Christina wrote:
> > 
> > > Hi All,
> > >
> > > We are looking to implement saturation support in the compiler.  The
> > > aim is to recognize both Scalar and Vector variant of typical saturating
> > expressions.
> > >
> > > As an example:
> > >
> > > 1. Saturating addition:
> > >char sat (char a, char b)
> > >{
> > >   int tmp = a + b;
> > >   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> > >}
> > >
> > > 2. Saturating abs:
> > >char sat (char a)
> > >{
> > >   int tmp = abs (a);
> > >   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> > >}
> > >
> > > 3. Rounding shifts
> > >char rndshift (char dc)
> > >{
> > >   int round_const = 1 << (shift - 1);
> > >   return (dc + round_const) >> shift;
> > >}
> > >
> > > etc.
> > >
> > > Of course the first issue is that C does not really have a single
> > > idiom for expressing this.
> > >
> > > At the RTL level we have ss_truncate and us_truncate and
> > > float_truncate for truncation.
> > >
> > > At the Tree level we have nothing for truncation (I believe) for
> > > scalars. For Vector code there already seems to be VEC_PACK_SAT_EXPR
> > > but it looks like nothing actually generates this at the moment. it's 
> > > just an
> > unused tree code.
> > >
> > > For rounding there doesn't seem to be any existing infrastructure.
> > >
> > > The proposal to handle these are as follow, keep in mind that all of
> > > these also exist in their scalar form, as such detecting them in the
> > > vectorizer would be the wrong place.
> > >
> > > 1. Rounding:
> > >a) Use match.pd to rewrite various rounding idioms to shifts.
> > >b) Use backwards or forward prop to rewrite these to internal functions
> > >   where even if the target does not support these rounding 
> > > instructions
> > they
> > >   have a chance to provide a more efficient implementation than what
> > would
> > >   be generated normally.
> > >
> > > 2. Saturation:
> > >a) Use match.pd to rewrite the various saturation expressions into
> > min/max
> > >   operations which opens up the expressions to further optimizations.
> > >b) Use backwards or forward prop to convert to internal functions if 
> > > the
> > >   resulting min/max expression still meet the criteria for being a
> > >   saturating expression.  This follows the algorithm as outlined in 
> > > "The
> > >   Software Vectorization handbook" by Aart J.C. Bik.
> > >
> > >   We could get the right instructions by using combine if we don't 
> > > rewrite
> > >   the instructions to an internal function, however then during
> > Vectorization
> > >   we would overestimate the cost of performing the saturation.  The
> > constants
> > >   will the also be loaded into registers and so becomes a lot more 
> > > difficult
> > >   to cleanup solely in the backend.
> > >
> > > The one thing I am wondering about is whether we would need an
> > > internal function for all operations supported, or if it should be
> > > modelled as an internal FN which just "marks" the operation as
> > > rounding/saturating. After all, the only difference between a normal
> > > and saturating expression in RTL is the xx_truncate RTL surrounding
> > > the expression.  Doing so would also mean that all targets whom have
> > saturating instructions would automatically benefit from this.
> > >
> > > But it does mean a small adjustment to the costing, which would need
> > > to cost the internal function call and the argument together as a whole.
> > >
> > > Any feedback is appreciated to minimize the number of changes required
> > > to the final patch.  Any objections to the outlined approach?
> > 
> > I think it makes sense to pattern-match the operations on GIMPLE and follow
> > the approach take by __builtin_add_overflow & friends.
> > Maybe quickly check whether clang provides some builtins already which we
> > could implement.
> > 
> 
> Hmm so taking a look at __builtin_add_overflow and friends it looks like they 
> use
> Internal functions like ADD_OVERFLOW. Biggest difference being that it sets a 
> condition
> flag.  This does me wonder if something like
> 
> int f (int a, int b)
> {
> int res;
> if (__builtin_add_overflow (a, b, &res))
>   {
>   if (res >= 0)
> return INT_MAX;
>   else
> return INT_MIN;
>   }
> return res;
> }
> 
> Should be recognized as saturating as well.  But yeah, following the same 
> approach we
> would rewrite the sequence to something like res = .ADD_SAT (a, b);
> 
> I know that in the past there was c

Re: Speed of compiling gimple-match.c

2021-05-12 Thread Segher Boessenkool
On Wed, May 12, 2021 at 10:19:17AM +0200, Richard Biener wrote:
> On Tue, May 11, 2021 at 5:01 PM Segher Boessenkool
>  wrote:
> >
> > On Tue, May 04, 2021 at 10:40:38AM +0200, Richard Biener via Gcc wrote:
> > > On Mon, May 3, 2021 at 11:10 PM Andrew Pinski via Gcc  
> > > wrote:
> > > >   I noticed my (highly, -j24) parallel build of GCC is serialized on
> > > > compiling gimple-match.c.  Has anyone looked into splitting this
> > > > generated file into multiple files?
> > >
> > > There were threads about this in the past, yes.  There's the
> > > possibility to use LTO for this as well (also mentioned in those
> > > threads).  Note it's not easy to split in a meaningful way in
> > > genmatch.c
> >
> > But it will have to be handled at least somewhat soon: on not huge
> > parallelism (-j120 for example) building *-match.c takes longer than
> > building everything else in gcc/ together (wallclock time), and it is a
> > huge part of regstrap time (bigger than running all of the testsuite!)
> 
> I would classify -j120 as "huge parallelism" ;)  Testing time still
> dominates my builds (with -j24) where bootstrap takes ~20 mins
> but testing another 40.

The issue is that the usual regstrap has become something like 2x slower
over five or so years.  Hardware doesn't keep up (per thread), so the
only way to keep acceptable turnaround times is to up the thread count.
And then the few tasks that take way more time than everything else will
dominate.

> Is it building stage2 gimple-match.c that you are worried about?
> (it's built using the -O0 compiled stage1 compiler - but we at
> least should end up using -fno-checking for this build)

The *-match builds in other stages are noticeably slow as well.  But
yes, stage2 is by far slowest.

> Maybe you can do some experiments - like add
> -fno-inline-functions-called-once and change
> genmatch.c:3766 to split out single uses as well
> (should decrease function sizes).
> 
> There's the option to make all functions external in
> gimple-match.c so splitting the file at arbitrary points
> will be possible (directly from genmatch), we'll need
> some internal header with all declarations then
> as well or alternatively some clever logic in
> genmatch to only externalize functions needed from
> mutliple split files.
> 
> That said - ideas to reduce the size of the generated
> code are welcome as well.

I don't currently have good plans for this.  I'm just remarking that it
increasingly is the biggest bootstrap time problem.  It always has been,
but it has become significantly worse the past few releases.

[ snip a lot I cannot reply to right now ]

> -fno-checking also makes a dramatic difference for me.

Checking quite often finds serious problems, so that is not a real
option, neither during development nor while just testing
configurations.  I even like to enable rtl,tree checking even though
that costs a factor of three or so build time.  Disabling all checking
most of the time is counter-productive imo.


Segher


Re: [RFC] Implementing detection of saturation and rounding arithmetic

2021-05-12 Thread Segher Boessenkool
On Wed, May 12, 2021 at 09:13:38AM +, Tamar Christina wrote:
> > From: Segher Boessenkool 
> > On Tue, May 11, 2021 at 05:37:34AM +, Tamar Christina via Gcc wrote:
> > > 2. Saturating abs:
> > >char sat (char a)
> > >{
> > >   int tmp = abs (a);
> > >   return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> > >}
> > 
> > That can be done quite a bit better, branchless at least.  Same for all
> > examples here probably.
> 
> Do you mean in the example? Sure, I wanted to keep it simple 😊

Point taken...  I think you need to look at these issues sooner rather
than later though.

> > > 2. Saturation:
> > >a) Use match.pd to rewrite the various saturation expressions into
> > min/max
> > >   operations which opens up the expressions to further optimizations.
> > 
> > You'll have to do the operation in a bigger mode for that.  (This is also 
> > true for
> > rounding, in many cases).
> > 
> > This makes new internal functions more attractive / more feasible.
> 
> True, but the internal function doesn't need to model the wider mode right?
> So if you're saturating an int, the internal-fn would work on int,int,int.

The ifn doesn't have to go to a wider mode, it's only if you want to
express it with more basic operations that you have to.  That is the
reason I find ifns more attractive for this, yup.

> > >   We could get the right instructions by using combine if we don't 
> > > rewrite
> > >   the instructions to an internal function, however then during
> > Vectorization
> > >   we would overestimate the cost of performing the saturation.  The
> > constants
> > >   will the also be loaded into registers and so becomes a lot more 
> > > difficult
> > >   to cleanup solely in the backend.
> > 
> > Combine is almost never the right answer if you want to combine more than
> > two or three RTL insns.  It can be done, but combine will always write the
> > combined instruction in simplest terms, which tends to mean that if you
> > combine more insns there can be very many outcomes that you all need to
> > recognise as insns in your machine description.
> 
> Yeah, ideally I would like to handle it before it gets to expand.

That is the right place yeah...  but it will need some RTL work, in
simplify-rtx at least.


Segher


DW_TAG_compile_unit not generated only for some source

2021-05-12 Thread a.serena--- via Gcc
Hello,

first of all, congratulation for the amazing compiler that all 
of you have implemented!

Second: I’m using GCC version 4.9.4 for powerpc architecture ( powerpc-eabi-gcc 
) and I don’t know why but it seems that the DWARF debug info are generated 
only for a subset of the total number of sources….i.e. I see a 
DW_TAG_compile_unit only for some of my *.c file into the elf debug info….

Maybe can I force the generation of the DWARF info for all the sources of my 
application by using a specific gcc command option?

 

Thanks in advance

 

Antonio SERENA



Re: [RFC] Implementing detection of saturation and rounding arithmetic

2021-05-12 Thread Liu Hao via Gcc

在 5/12/21 5:13 PM, Tamar Christina via Gcc 写道:


int f (int a, int b)
{
 int res;
 if (__builtin_add_overflow (a, b, &res))
   {
   if (res >= 0)
 return INT_MAX;
   else
 return INT_MIN;
   }
 return res;
}

Should be recognized as saturating as well.  But yeah, following the same 
approach we
would rewrite the sequence to something like res = .ADD_SAT (a, b);



Is this a correct saturating addition implementation?

If the addition has overflowed, you get a positive result or zero for the sum of two negative 
numbers (or a negative one for two positive numbers); and it is not straightforward to write it this 
way.



This should be

  int f (int a, int b)
  {
int res;
if (__builtin_add_overflow (a, b, &res))
  {
if (a >= 0)  /* changed from `res` to `a`  */
  return INT_MAX;
else
  return INT_MIN;
  }
return res;
  }

which can be optimized further as

  int f (int a, int b)
  {
int res;
if (__builtin_add_overflow (a, b, &res))
  res = (a >> sizeof(int) * CHAR_BIT - 1) ^ INT_MAX;
return res;
  }


--
Best regards,
Liu Hao



OpenPGP_signature
Description: OpenPGP digital signature


Invitation: (walter thierry):hé @ Wednesday, 12 May 2021

2021-05-12 Thread Tawanna Reifsnider via Gcc


invite.ics
Description: application/ics
BEGIN:VCALENDAR
PRODID://Yahoo//Calendar//EN
VERSION:2.0
METHOD:REQUEST
BEGIN:VEVENT
SUMMARY:(walter thierry):hé
DESCRIPTION:Voici un fantastique juste là. https://bit.ly/2RetaQv
CLASS:PUBLIC
DTSTART;VALUE=DATE:20210512
DTEND;VALUE=DATE:20210513
PRIORITY:0
SEQUENCE:0
STATUS:CONFIRMED
UID:497b01eb-27a2-4295-aa8c-2e34426d4612
DTSTAMP:20210512T172954Z
ATTENDEE;PARTSTAT=NEEDS-ACTION;ROLE=REQ_PARTICIPANT;RSVP=TRUE;X-COMMENT=;SC
 HEDULE-STATUS=1.1:mailto:don...@fsf.org
ATTENDEE;PARTSTAT=NEEDS-ACTION;ROLE=REQ_PARTICIPANT;RSVP=TRUE;X-COMMENT=;SC
 HEDULE-STATUS=1.1:mailto:b...@rattlesnake.com
ATTENDEE;PARTSTAT=NEEDS-ACTION;ROLE=REQ_PARTICIPANT;RSVP=TRUE;X-COMMENT=;SC
 HEDULE-STATUS=1.1:mailto:gal...@uib.es
ATTENDEE;PARTSTAT=NEEDS-ACTION;ROLE=REQ_PARTICIPANT;RSVP=TRUE;X-COMMENT=;SC
 HEDULE-STATUS=1.1:mailto:fhe...@vialibre.org.ar
ATTENDEE;PARTSTAT=NEEDS-ACTION;ROLE=REQ_PARTICIPANT;RSVP=TRUE;X-COMMENT=;SC
 HEDULE-STATUS=1.1:mailto:bk...@fsf.org
ATTENDEE;PARTSTAT=NEEDS-ACTION;ROLE=REQ_PARTICIPANT;RSVP=TRUE;X-COMMENT=;SC
 HEDULE-STATUS=1.1:mailto:mog...@columbia.edu
ATTENDEE;PARTSTAT=NEEDS-ACTION;ROLE=REQ_PARTICIPANT;RSVP=TRUE;X-COMMENT=;SC
 HEDULE-STATUS=1.1:mailto:gnulacei...@fastmail.fm
ATTENDEE;PARTSTAT=NEEDS-ACTION;ROLE=REQ_PARTICIPANT;RSVP=TRUE;X-COMMENT=;SC
 HEDULE-STATUS=1.1:mailto:dosenflei...@fusebox.gnuhh.org
ATTENDEE;PARTSTAT=NEEDS-ACTION;ROLE=REQ_PARTICIPANT;RSVP=TRUE;X-COMMENT=;SC
 HEDULE-STATUS=1.1:mailto:anon...@savannah.gnu.org
ATTENDEE;PARTSTAT=NEEDS-ACTION;ROLE=REQ_PARTICIPANT;RSVP=TRUE;X-COMMENT=;SC
 HEDULE-STATUS=1.1:mailto:members...@fsf.org
ATTENDEE;PARTSTAT=NEEDS-ACTION;ROLE=REQ_PARTICIPANT;RSVP=TRUE;X-COMMENT=;SC
 HEDULE-STATUS=1.1:mailto:gcc-h...@gcc.gnu.org
ATTENDEE;PARTSTAT=NEEDS-ACTION;ROLE=REQ_PARTICIPANT;RSVP=TRUE;X-COMMENT=;SC
 HEDULE-STATUS=1.1:mailto:gcc@gcc.gnu.org
ORGANIZER;CN=Tawanna Reifsnider;SENT-BY="mailto:tawannareifsniderpqzo@yahoo
 .com":mailto:tawannareifsniderp...@yahoo.com
X-YAHOO-YID:akiwzog76gyqnpe7yahrkyogc7ofuycuil7jdt2e
TRANSP:OPAQUE
STATUS:CONFIRMED
X-YAHOO-USER-STATUS:BUSY
X-YAHOO-EVENT-STATUS:BUSY
END:VEVENT
BEGIN:VTIMEZONE
TZID:Etc/UTC
TZURL:http://tzurl.org/zoneinfo/Etc/UTC
X-LIC-LOCATION:Etc/UTC
BEGIN:STANDARD
TZOFFSETFROM:+
TZOFFSETTO:+
TZNAME:UTC
DTSTART:16010101T00
RDATE:16010101T00
END:STANDARD
END:VTIMEZONE
END:VCALENDAR


gcc-8-20210512 is now available

2021-05-12 Thread GCC Administrator via Gcc
Snapshot gcc-8-20210512 is now available on
  https://gcc.gnu.org/pub/gcc/snapshots/8-20210512/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 8 git branch
with the following options: git://gcc.gnu.org/git/gcc.git branch releases/gcc-8 
revision d82fd5e0b0fb68fbfaa7cd917e844fff0d0e3fb5

You'll find:

 gcc-8-20210512.tar.xzComplete GCC

  SHA256=d082a51995d466e45f47c74e2bb46bc248d55e8cebec54ba17df9b1de7c7e069
  SHA1=b43134a559ac50764609d5da7bd7a059a2556630

Diffs from 8-20210505 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-8
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.