Re: [RFC] Run pass_sink_code once more after ivopts/fre

2021-03-26 Thread Xionghu Luo via Gcc-patches
Hi, sorry for late response,

On 2021/3/23 16:50, Richard Biener wrote:
>>> It definitely should be before uncprop (but context stops there). And yes,
>>> re-running passes isn't the very, very best thing to do without explaining
>>> it cannot be done in other ways. Not for late stage 3 anyway.
>>>
>>> Richard.
>>>
>> Thanks.  Also tried to implement this in a seperate RTL pass, which
>> would be better?  I guess this would also be stage1 issues...
> Yes, that's also for stage1 obviously.  Can you check how the number
> of sink opportunities of your RTL pass changes if you add the
> 2nd GIMPLE sinking pass?

Number of Instructions sank out of loop when running (no sink2 -> sink2):
1. SPEC2017 int: 371  ->  142
2. SPEC2017 float: 949  ->  343
3. bootstrap: 402  ->  229
4. stage1 libraries: 115   ->  68
5. regression tests: 4533 ->  2948

the case I used from the beginning could all be optimized by gimple sink2
instead, but there are still many instructions sunk even gimple sink2 is
added, I guess most of them are produced by expand pass.  One example(It was
after #38 in 262r.reginfo, note that new block is created between
exit->src and exit->dst to avoid other bb jumps into exit->dst cause
execution error due to r132:DI updated unexpectedly.), sometimes extra
zero extend in loop like this could cause serious performance issue:

vect-live-4.ltrans0.ltrans.263r.sink:
...
Loop 2: sinking (set (reg/v:DI 132 [  ])
(sign_extend:DI (reg:SI 144))) from bb 7 to bb 11

...

   44: L44:
   35: NOTE_INSN_BASIC_BLOCK 7
   36: r127:DI=r127:DI+0x4
   37: r145:SI=[r127:DI]
   38: r144:SI=r145:SI+0x5
  REG_DEAD r145:SI
   40: r125:DI=r125:DI+0x4
   41: [r125:DI]=r144:SI
  REG_DEAD r144:SI
   42: r146:SI=r126:DI#0-0x1
  REG_DEAD r126:DI
   43: r126:DI=zero_extend(r146:SI)
  REG_DEAD r146:SI
   45: r147:CC=cmp(r126:DI,0)
   46: pc={(r147:CC!=0)?L65:pc}
  REG_DEAD r147:CC
  REG_BR_PROB 941032164
   68: NOTE_INSN_BASIC_BLOCK 11
   69: r132:DI=sign_extend(r144:SI)
  ; pc falls through to BB 8
   65: L65:
   64: NOTE_INSN_BASIC_BLOCK 9
  ; pc falls through to BB 7
   47: L47:
   48: NOTE_INSN_BASIC_BLOCK 8

> 
> I'll note that you are not sinking later stmts first (you only
> walk insns reverse but not BBs).  GIMPLE sinking performs a
> domwalk over post dominators (but it has an easier job because
> of PHIs).  I guess you'd want to walk starting from loop exit
> blocks (from innermost loops as you do) in reverse program order.

Yes, this rtl sink could only sink instruction from *loop header*
to every loop exit blocks in reverse order, it is a bit strict
since this is the first step to see whether it is reasonable to add
such a pass.
For example, if the instruction is in loop body block, sink it out will
cause execution error sometimes as it doesn't have information in it
whether the loop body block will be executed or not, if the loop jumps
from header to exit directly, the instructions sunk from body to exit
will change register value unexpected, seems always_reached and
always_executed in loop-invarinat.c could be reused here to determine
such circumstance?

> 
> I'll also note that you don't have to check whether stmts you
> want to sink have their uses set after it - you can emit copies
> to a new pseudo at the original insn location and use those
> after the loop (that of course comes at some cost).

Not sure whether I understood your point correctly, but the instruction
is still in loop executed loop niter times?
What I am trying to do is move r132:DI=sign_extend(r144:SI)
out of loop, if it was executed in loop 100 times, r132 is not used
in loop, and r144 is not updated after current instruction, then move
it to loop exit to execute one once. 

> 
> Also we already have a sinking pass on RTL which even computes
> a proper PRE on the reverse graph - -fgcse-sm aka store-motion.c.
> I'm not sure whether this deals with non-stores but the
> LCM machinery definitely can handle arbitrary expressions.  I wonder
> if it makes more sense to extend this rather than inventing a new
> ad-hoc sinking pass?

>From the literal, my pass doesn't handle or process store instructions
like store-motion..  Thanks, will check it.

-- 
Thanks,
Xionghu


Re: [PATCH] x86: Skip ISA check for always_inline in system headers

2021-03-26 Thread Richard Biener via Gcc-patches
On Thu, Mar 25, 2021 at 7:37 PM H.J. Lu  wrote:
>
> On Thu, Mar 25, 2021 at 7:54 AM Jakub Jelinek via Gcc-patches
>  wrote:
> >
> > On Thu, Mar 25, 2021 at 03:40:51PM +0100, Richard Biener wrote:
> > > I think the "proper" way to do this is to have 'open' above end up
> > > refering to the out-of-line 'open' in the DSO, _not_ to emit the
> > > fortification wrapper out-of-line.  But then, yes, it shouldn't
> > > be always-inline then.  It should be like the former extern inline
> > > extension.
> >
> > It is extern inline __attribute__((gnu_inline, always_inline, artificial))
> > I think.  But the always_inline is completely intentional there,
> > we don't want the decision whether to inline it or not being done based on
> > its size, amount of functions already inlined into the caller before,
> > whether the call is cold or hot, etc.  It is a matter of security.
> > If it is taking address, we want the library routine in that case, sure.
> >
> > > But we have existing issues with [target] options differing and existing 
> > > old
> > > uses of always_inline (like the fortification wrappers).  Adding a new 
> > > attribute
> > > will not fix those issues.  Do you propose to not fix them and instead 
> > > only
> > > fix the new general_regs_only always-inline function glibc wants to add?
> >
> > Yes.
> > Basically solve the problem for the fortification wrappers and rdtsc or
> > whatever other always inlines don't really require any specific
> > target/optimize options.
> >
> > > IMHO we have to fix the existing always_inline and we need a _new_
> > > attribute to get the desired diagnostics on intrinsics.  Something
> > > like __attribute__((need_target("avx"))) for AVX intrinsics?
> >
> > Or, if we go this route in addition to adding
> > at least a new attributes for the "diagnose taking address without
> > direct call", we'd need probably not just that,
> > but also pragma way to specify it for a lot of functions together,
> > otherwise it would be a maintainance nightmare.
> >
>
> How can we move forward with it?  I'd like to resolve it in GCC 11.

So I looked closer and we handle target attribute mismatches different
from optimization attribute mismatches (the latter are validated in
can_inline_edge_by_limits_p, the former in can_inline_edge_p).
For optimize attribute differences we're ignoring all (even semantic
differences):

 /* Until GCC 4.9 we did not check the semantics-altering flags
below and inlined across optimization boundaries.
Enabling checks below breaks several packages by refusing
to inline library always_inline functions. See PR65873.
Disable the check for early inlining for now until better solution
is found.  */
 if (always_inline && early)
;
  /* There are some options that change IL semantics which means
 we cannot inline in these cases for correctness reason.
 Not even for always_inline declared functions.  */
 else if (check_match (flag_wrapv)
...
  /* gcc.dg/pr43564.c.  Apply user-forced inline even at -O0.  */
  else if (always_inline)
;
  /* When user added an attribute to the callee honor it.  */
  else if (lookup_attribute ("optimize", DECL_ATTRIBUTES (callee->decl))
   && opts_for_fn (caller->decl) != opts_for_fn (callee->decl))
{
  e->inline_failed = CIF_OPTIMIZATION_MISMATCH;
  inlinable = false;
}

so the original intent was to do things "correctly" but then as now seen
with target attribute mismatches we run into problems.  Thus now we
allow all always-inlines.

I suppose diagnosing

static inline void __attribute__((target("avx"),always_inline))
foo_avx_optimized () {...}
void bar()
{
  if (cpu_supports ("avx"))
   foo_avx_optimized ();
}

for the missed optimization because of the always-inline
(foo_avx_optimized will inherit
the callers target flags and _not_ be avx optimized) might be nice,
but well, at least this
kind of inlining will not generate wrong code.

Thus we IMHO can do sth like

diff --git a/gcc/ipa-inline.c b/gcc/ipa-inline.c
index f15c4828958..d4d4ac366c8 100644
--- a/gcc/ipa-inline.c
+++ b/gcc/ipa-inline.c
@@ -374,9 +374,14 @@ can_inline_edge_p (struct cgraph_edge *e, bool report,
   e->inline_failed = CIF_UNSPECIFIED;
   inlinable = false;
 }
-  /* Check compatibility of target optimization options.  */
-  else if (!targetm.target_option.can_inline_p (caller->decl,
-   callee->decl))
+  /* Check compatibility of target optimization options.  Be consistent with
+ handling of early always-inlines and optimize attribute differences
+ handled in can_inline_edge_by_limits_p.  */
+  else if ((!early
+   || !DECL_DISREGARD_INLINE_LIMITS (callee->decl)
+   || !lookup_attribute ("always_inline",
+ DECL_ATTRIBUTES (callee->decl)))
+  && !targetm.target_option.can_inline_p (caller->decl, callee->d

Re: [PATCH] x86: Skip ISA check for always_inline in system headers

2021-03-26 Thread Jakub Jelinek via Gcc-patches
On Thu, Mar 25, 2021 at 11:36:37AM -0700, H.J. Lu via Gcc-patches wrote:
> How can we move forward with it?  I'd like to resolve it in GCC 11.

I think it is too late for GCC 11 for this.
Especially if the solution would be that we change the behavior of existing
attribute, we would need enough time to test everything in the wild that
we don't break it badly, even if we add new attributes that cover the
previous behavior.  Only if we keep the behavior of existing attribute
and add a new one with the new behavior it would be something that could
be considered for GCC 11 IMNSHO but then you'd need to change the glibc
headers in time too to buy into the new attribute.
We need analysis of all GCC targets with target attribute support and
handle them consistently.

Jakub



Re: [PATCH] x86: Skip ISA check for always_inline in system headers

2021-03-26 Thread Richard Biener via Gcc-patches
On Fri, Mar 26, 2021 at 9:34 AM Jakub Jelinek via Gcc-patches
 wrote:
>
> On Thu, Mar 25, 2021 at 11:36:37AM -0700, H.J. Lu via Gcc-patches wrote:
> > How can we move forward with it?  I'd like to resolve it in GCC 11.
>
> I think it is too late for GCC 11 for this.
> Especially if the solution would be that we change the behavior of existing
> attribute, we would need enough time to test everything in the wild that
> we don't break it badly,

But isn't the suggested change only going to make programs we reject now
with an error accepted or ICEing?  Thus, no program that works right now
should break.

Richard.

> even if we add new attributes that cover the
> previous behavior.  Only if we keep the behavior of existing attribute
> and add a new one with the new behavior it would be something that could
> be considered for GCC 11 IMNSHO but then you'd need to change the glibc
> headers in time too to buy into the new attribute.
> We need analysis of all GCC targets with target attribute support and
> handle them consistently.
>
> Jakub
>


Re: [PATCH] x86: Skip ISA check for always_inline in system headers

2021-03-26 Thread Jakub Jelinek via Gcc-patches
On Fri, Mar 26, 2021 at 11:13:21AM +0100, Richard Biener wrote:
> On Fri, Mar 26, 2021 at 9:34 AM Jakub Jelinek via Gcc-patches
>  wrote:
> >
> > On Thu, Mar 25, 2021 at 11:36:37AM -0700, H.J. Lu via Gcc-patches wrote:
> > > How can we move forward with it?  I'd like to resolve it in GCC 11.
> >
> > I think it is too late for GCC 11 for this.
> > Especially if the solution would be that we change the behavior of existing
> > attribute, we would need enough time to test everything in the wild that
> > we don't break it badly,
> 
> But isn't the suggested change only going to make programs we reject now
> with an error accepted or ICEing?  Thus, no program that works right now
> should break.

That is true, but even
accepts-invalid
and
ice-on-invalid-code
would be important regressions.
Changing the always_inline attribute behavior without at least avoiding
the first of those for our intrinsics would be bad, and we need to look what
people use always_inline in the wild for and what are their expectations.
And for the intrinsics we need something maintainable, we have > 5000
intrinsics on i386 alone, > 4000 on aarch64, > 7000 on arm, > 600 on rs6000,
> 100 on sparc, I bet most of them rely on the current behavior.
I think the world doesn't end if we do it for GCC 12 only, do it right for
everything we are aware of and have many months to figure out what impact it
will have on programs in the wild.

Jakub



Re: [Patch, fortran] PR99602 - [11 regression] runtime error: pointer actual argument not associated

2021-03-26 Thread Tobias Burnus

Hi Paul,

I do not understand the !UNLIMITED_POLY(fsym) part of the patch.
In particular, your patch causes foo.f90 to fail by wrongly diagnosting:

  Fortran runtime error: Pointer actual argument 'cptr' is not associated

I have only did some light tests – but it seems that just removing
'&& !UNLIMITED_POLY(fsym)' seems to be enough. (But I did not run
the testsuite.)

Hence:
- Please include the attached testcases or some variants of them.
- Check that removing !UNLIMITED_POLY does not cause any regressions

If that works: OK for mainline

Thanks for looking into this issue and working on the patches.

Tobias

On 26.03.21 07:58, Paul Richard Thomas via Fortran wrote:

This patch is straightforward but the isolation of the problem was rather
less so. Many thanks to Juergen for testcase reduction.

Regtested on FC33/x86_64 - OK for master?

Paul

Fortran: Fix problem with runtime pointer chack [PR99602].

2021-03-26  Paul Thomas  

gcc/fortran/ChangeLog

PR fortran/99602
* trans-expr.c (gfc_conv_procedure_call): Use the _data attrs
for class expressions and detect proc pointer evaluations by
the non-null actual argument list.

gcc/testsuite/ChangeLog

PR fortran/99602
* gfortran.dg/pr99602.f90: New test.
* gfortran.dg/pr99602a.f90: New test.
* gfortran.dg/pr99602b.f90: New test.

-
Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München 
Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank 
Thürauf
! { dg-do compile }
! { dg-options "-fcheck=pointer -fdump-tree-original" }
!
! PR fortran/99602
!

module m
  implicit none
contains
  subroutine wr(y)
class(*) :: y
stop 1
  end
end module m

use m
implicit none
integer, pointer :: iptr
class(*), pointer :: cptr

nullify (cptr, iptr)
call wr(iptr)
call wr(cptr)
end

! { dg-final { scan-tree-dump-times "_gfortran_runtime_error_at" "original" 2 } }
! { dg-final { scan-tree-dump-times "Pointer actual argument 'cptr'" "original" 1 } }
! { dg-final { scan-tree-dump-times "Pointer actual argument 'iptr'" "original" 1 } }
! { dg-do compile }
! { dg-options "-fcheck=pointer -fdump-tree-original" }
!
! PR fortran/99602
!

module m
  implicit none
contains
  subroutine wr(y)
class(*), pointer :: y
if (associated (y)) stop 1
  end
end module m

use m
implicit none
class(*), pointer :: cptr

nullify (cptr)
call wr(cptr)
end

! { dg-final { scan-tree-dump-not "_gfortran_runtime_error_at" "original" } }
! { dg-final { scan-tree-dump-not "Pointer actual argument" "original" } }


Re: [PATCH] x86: Skip ISA check for always_inline in system headers

2021-03-26 Thread Richard Biener via Gcc-patches
On Fri, Mar 26, 2021 at 11:26 AM Jakub Jelinek  wrote:
>
> On Fri, Mar 26, 2021 at 11:13:21AM +0100, Richard Biener wrote:
> > On Fri, Mar 26, 2021 at 9:34 AM Jakub Jelinek via Gcc-patches
> >  wrote:
> > >
> > > On Thu, Mar 25, 2021 at 11:36:37AM -0700, H.J. Lu via Gcc-patches wrote:
> > > > How can we move forward with it?  I'd like to resolve it in GCC 11.
> > >
> > > I think it is too late for GCC 11 for this.
> > > Especially if the solution would be that we change the behavior of 
> > > existing
> > > attribute, we would need enough time to test everything in the wild that
> > > we don't break it badly,
> >
> > But isn't the suggested change only going to make programs we reject now
> > with an error accepted or ICEing?  Thus, no program that works right now
> > should break.
>
> That is true, but even
> accepts-invalid
> and
> ice-on-invalid-code
> would be important regressions.
> Changing the always_inline attribute behavior without at least avoiding
> the first of those for our intrinsics would be bad, and we need to look what
> people use always_inline in the wild for and what are their expectations.
> And for the intrinsics we need something maintainable, we have > 5000
> intrinsics on i386 alone, > 4000 on aarch64, > 7000 on arm, > 600 on rs6000,
> > 100 on sparc, I bet most of them rely on the current behavior.
> I think the world doesn't end if we do it for GCC 12 only, do it right for
> everything we are aware of and have many months to figure out what impact it
> will have on programs in the wild.

As said, my opinion is that this fallout doesn't "exist" in the wild
since it can
only exist for code we reject right now which in my definition of
"out in the wild" makes it not exist.  I consider only code accepted by
the compiler as valid "out in the wild" example.

See also the behavior of always-inline with regard to the optimize attribute.

So yes, a better solution would be nice but I can't see any since the
underlying issue is known since a long time and thus the pragmatic
solution is the best (IMHO), also from a QOI perspective.  For intrinsics
it also avoids differences with -O0 vs -O with what we accept and reject.

Richard.

> Jakub
>


Re: [PATCH, rs6000][PR gdb/27525] displaced stepping across addpcis/lnia.

2021-03-26 Thread Ulrich Weigand via Gcc-patches
On Tue, Mar 16, 2021 at 05:31:03PM -0500, will schmidt wrote:

>   This addresses PR gdb/27525. The lnia and other variations
> of the addpcis instruction write the value of the NIA into a target register.
> If we are single-stepping across a breakpoint, the instruction is executed
> from a displaced location, and thusly the written value of the PC/NIA
> will be incorrect.   The changes here will measure the displacement
> offset, and adjust the target register value to compensate.
> 
> This is written in a way that I believe will make it easier to
> update to handle prefixed (8 byte) instructions in a future patch.


This looks good to me functionally, but I'm not sure it really makes
much sense to extract code into those new routines -- *all* of the
ppc_displaced_step_fixup routine is about handling instructions that
read the PC, like the branches do.

I'd prefer if the new instructions were simply added to the existing
switch alongside the branches.

> +  displaced_offset = from - to ;  /* FIXME - By inspection, it appears 
> the displaced instruction
> + is at a lower address.  Is this 
> always true?  */

No, it could be either way.  But it shouldn't really matter since
you just need to apply the same displaced offset to the target,
whether the offset is positive or negative.  Again, you should
just do it the same way it is already done by existing code
that handles branches.

Bye,
Ulrich

-- 
  Dr. Ulrich Weigand
  GNU/Linux compilers and toolchain
  ulrich.weig...@de.ibm.com


Re: [PATCH] gdb-power10-single-step

2021-03-26 Thread Ulrich Weigand via Gcc-patches
On Thu, Mar 25, 2021 at 12:21:42PM -0500, will schmidt wrote:
> On Wed, 2021-03-10 at 18:50 +0100, Ulrich Weigand wrote:
> > Will Schmidt wrote:
> > 
> > >   This is a patch written by Alan Modra.  With his permission
> > > I'm submitting this for review and helping get this upstream.
> > > 
> > > Powerpc / Power10 ISA 3.1 adds prefixed instructions, which
> > > are 8 bytes in length.  This is in contrast to powerpc previously
> > > always having 4 byte instruction length.  This patch implements
> > > changes to allow GDB to better detect prefixed instructions, and
> > > handle single stepping across the 8 byte instructions.
> > 
> > There's a few issues I see here:
> 
> I've dug in a bit more,.. have a few questions related to the patch and
> the comments here.  I've got a refreshed version of this patch in my
> queue, with a nit or two that i'm  still trying to understand and
> squash before I post it.
> 
> > 
> > - The patch now *forces* software single-stepping for all 8-byte
> >   instructions.  I'm not sure why this is necessary; I thought
> >   that hardware single-stepping was supported for 8-byte instructions
> >   as well?  That would certainly be preferable.
> 
> 
> Does software single-stepping go hand-in-hand with executing the
> instructions from a displaced location?

Yes.  Hardware single-step executes the instruction where it is.
Software single-step needs to replace the subsequent instruction
with a breakpoint, and in order to be able to do that without
unduly affecting simultaneous execution of that code in other
threads, this is not done in place, but in a copy in a displaced
location.

> I only see one clear return-if-prefixed change in the patch, so I am
> assuming the above refers to the patch chunk seen as :
> > @@ -1081,6 +1090,10 @@ ppc_deal_with_atomic_sequence (struct regcache 
> > *regcache)
> >const int atomic_sequence_length = 16; /* Instruction sequence length.  
> > */
> >int bc_insn_count = 0; /* Conditional branch instruction count.  */
> > 
> > +  /* Power10 prefix instructions are two words in length.  */
> > +  if ((insn & OP_MASK) == 1 << 26)
> > +return { pc + 8 };

Yes, this is what I was refering to.  By returning a PC value here,
common code is instructed to always perform a software single-step.
This should not be necessary.

> I've got a local change to eliminate that return.   Per the poking I've
> done so far, none of the prefix instructions I've run against so far
> allow us past the is_load_and_reserve instruction check. 
> >   if (!IS_LOAD_AND_RESERVE_INSN (insn))
> > return {};
> 
> statement, so not a significant code flow behavior change.

Yes, if you just remove the three lines above, code will fall
through to here and return an empty sequence, causing the
common code to use hardware single-step.

> > - However, the inner loop of ppc_deal_with_atomic_sequence should
> >   probably be updated to correctly skip 8-byte instructions; e.g.
> >   to avoid mistakenly recognizing the second word of an 8-byte
> >   instructions for a branch or store conditional.  (Also, the
> >   count of up to "16 instructions" is wrong if 8-byte instructions
> >   are not handled specifically.)
> 
> I've got a local change to inspect the instruction and determine if it
> is prefixed, so I think i've got this handled.  I'm generally assuming
> we will never start halfway through an 8-byte prefixed instruction.

Yes, you can assume the incoming PC value is a valid PC at the start
of the current instruction.

Bye,
Ulrich

-- 
  Dr. Ulrich Weigand
  GNU/Linux compilers and toolchain
  ulrich.weig...@de.ibm.com


Re: [PATCH] x86: Skip ISA check for always_inline in system headers

2021-03-26 Thread Florian Weimer
* Jakub Jelinek via Gcc-patches:

> On Thu, Mar 25, 2021 at 11:36:37AM -0700, H.J. Lu via Gcc-patches wrote:
>> How can we move forward with it?  I'd like to resolve it in GCC 11.
>
> I think it is too late for GCC 11 for this.
> Especially if the solution would be that we change the behavior of existing
> attribute, we would need enough time to test everything in the wild that
> we don't break it badly, even if we add new attributes that cover the
> previous behavior.  Only if we keep the behavior of existing attribute
> and add a new one with the new behavior it would be something that could
> be considered for GCC 11 IMNSHO but then you'd need to change the glibc
> headers in time too to buy into the new attribute.
> We need analysis of all GCC targets with target attribute support and
> handle them consistently.

I think H.J. needs this for a function that isn't even always_inline,
just extern inline __attribute__ ((gnu_inline)).  Is that aspect
something that could be solved for GCC 11?


Re: [PATCH] x86: Skip ISA check for always_inline in system headers

2021-03-26 Thread Richard Biener via Gcc-patches
On Fri, Mar 26, 2021 at 2:49 PM Florian Weimer  wrote:
>
> * Jakub Jelinek via Gcc-patches:
>
> > On Thu, Mar 25, 2021 at 11:36:37AM -0700, H.J. Lu via Gcc-patches wrote:
> >> How can we move forward with it?  I'd like to resolve it in GCC 11.
> >
> > I think it is too late for GCC 11 for this.
> > Especially if the solution would be that we change the behavior of existing
> > attribute, we would need enough time to test everything in the wild that
> > we don't break it badly, even if we add new attributes that cover the
> > previous behavior.  Only if we keep the behavior of existing attribute
> > and add a new one with the new behavior it would be something that could
> > be considered for GCC 11 IMNSHO but then you'd need to change the glibc
> > headers in time too to buy into the new attribute.
> > We need analysis of all GCC targets with target attribute support and
> > handle them consistently.
>
> I think H.J. needs this for a function that isn't even always_inline,
> just extern inline __attribute__ ((gnu_inline)).  Is that aspect
> something that could be solved for GCC 11?

But that should already work, no?  Yes, it won't inline but also not
error.  Unless glibc lacks the out-of-line definition, that is.

Richard.


Re: [PATCH 1/2] openacc: Fix lowering for derived-type mappings through array elements

2021-03-26 Thread Thomas Schwinge
Hi!

On 2021-03-25T12:54:31+0100, I wrote:
> On 2021-02-12T07:46:48-0800, Julian Brown  wrote:
>> --- /dev/null
>> +++ b/libgomp/testsuite/libgomp.oacc-fortran/derivedtypes-arrays-1.f90
>> @@ -0,0 +1,109 @@
>> +[...]
>> +!$acc serial present(var3%t2(5)%t1%arr1)
>> +var3%t2(5)%t1%arr1(:,:) = 6
>> +!$acc end serial
>> +[...]
>
> I've pushed "'libgomp.oacc-fortran/derivedtypes-arrays-1.f90' OpenACC
> 'serial' construct diagnostic for nvptx offloading" to master branch in
> commit 8bafce1be11a301c2421483736c634b8bf330e69, and cherry-picked into
> devel/omp/gcc-10 branch in commit
> c89b23b73edeeb7e3d8cbad278e505c2d6d770c4, see attached.

I'd pushed the wrong thing to devel/omp/gcc-10 branch, so I've now pushed
"Adjust 'libgomp.oacc-fortran/derivedtypes-arrays-1.f90' for og10" in
commit 4777cf66403e311ff3f00bf3d9a60bd5b546f5ed, see attached.


Grüße
 Thomas


-
Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München 
Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank 
Thürauf
>From 4777cf66403e311ff3f00bf3d9a60bd5b546f5ed Mon Sep 17 00:00:00 2001
From: Thomas Schwinge 
Date: Fri, 26 Mar 2021 15:19:49 +0100
Subject: [PATCH] Adjust 'libgomp.oacc-fortran/derivedtypes-arrays-1.f90' for
 og10

This is a fix-up for og10 commit c89b23b73edeeb7e3d8cbad278e505c2d6d770c4
"'libgomp.oacc-fortran/derivedtypes-arrays-1.f90' OpenACC 'serial' construct
diagnostic for nvptx offloading".

We're missing in og10 a few patches related to diagnostics location
tracking/checking, both compiler-side and testsuite-side.

	libgomp/
	* testsuite/libgomp.oacc-fortran/derivedtypes-arrays-1.f90: Adjust
	for og10.
---
 libgomp/ChangeLog.omp| 5 +
 .../testsuite/libgomp.oacc-fortran/derivedtypes-arrays-1.f90 | 2 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/libgomp/ChangeLog.omp b/libgomp/ChangeLog.omp
index b0af9c205a38..f131c2c79b7e 100644
--- a/libgomp/ChangeLog.omp
+++ b/libgomp/ChangeLog.omp
@@ -1,3 +1,8 @@
+2021-03-26  Thomas Schwinge  
+
+	* testsuite/libgomp.oacc-fortran/derivedtypes-arrays-1.f90: Adjust
+	for og10.
+
 2021-03-25  Kwok Cheung Yeung  
 
 	Backport from mainline
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/derivedtypes-arrays-1.f90 b/libgomp/testsuite/libgomp.oacc-fortran/derivedtypes-arrays-1.f90
index 7bca2df66285..0208e07ea937 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/derivedtypes-arrays-1.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/derivedtypes-arrays-1.f90
@@ -88,7 +88,7 @@ end do
 !$acc data copyin(var3%t2(5)%t1%arr1)
 
 !$acc serial present(var3%t2(5)%t1%arr1)
-! { dg-warning "using vector_length \\(32\\), ignoring 1" "" { target openacc_nvidia_accel_selected } .-1 }
+! { dg-warning "using vector_length \\(32\\), ignoring 1" "" { target openacc_nvidia_accel_selected } 92 }
 var3%t2(5)%t1%arr1(:,:) = 6
 !$acc end serial
 
-- 
2.30.2



[Patch] libgomp: Fix on_device_arch.c aux-file handling [PR99555] (was: [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738])

2021-03-26 Thread Tobias Burnus

Hi Thomas, hi all,

your commit causes compile fails:

cc1: fatal error: ../lib/on_device_arch.c: No such file or directory

FAIL: libgomp.c/../libgomp.c-c++-common/task-detach-6.c (test for excess errors)
FAIL: libgomp.c/pr99555-1.c (test for excess errors)
FAIL: libgomp.fortran/task-detach-6.f90   -O0  (test for excess errors)

That's with embedded testing, where files are copied into the test directory, 
i.e.
cp .../libgomp/testsuite/libgomp.fortran/../lib/on_device_arch.c 
on_device_arch.c
cp .../libgomp/testsuite/libgomp.fortran/task-detach-6.f90 task-detach-6.f90
and then executed as:
powerpc64le-none-linux-gnu-gcc $TESTDIR/task-detach-6.f90 
../lib/on_device_arch.c
which fails.

How about the following patch? It moves the aux function to 
libgomp.c-c++-common/on_device_arch.c
and #includes it in the new wrapper files libgomp.{c,fortran}/on_device_arch.c.
(Based on the observation that #include with relative paths always works,
while dg-additional-sources may not, depending how the testsuite it run.)

OK? Or does anyone have a better suggestion?

Tobias

PS: The testcases still FAIL with nvptx offloading – but now at execution time.
I think that's expected, is it? (→PR99555?)
FAIL: libgomp.c/../libgomp.c-c++-common/pr96390.c execution test
FAIL: libgomp.c/../libgomp.c-c++-common/task-detach-6.c execution test
FAIL: libgomp.fortran/task-detach-6.f90   -O0  execution test
FAIL: libgomp.fortran/task-detach-6.f90   -O1  execution test
FAIL: libgomp.fortran/task-detach-6.f90   -O2  execution test
FAIL: libgomp.fortran/task-detach-6.f90   -O3 -fomit-frame-pointer 
-funroll-loops -fpeel-loops -ftracer -finline-functions  execution test
FAIL: libgomp.fortran/task-detach-6.f90   -O3 -g  execution test
FAIL: libgomp.fortran/task-detach-6.f90   -Os  execution test

On 25.03.21 13:02, Thomas Schwinge wrote:

Until this gets resolved properly, OK to push something like the attached
(currently testing) "Avoid OpenMP/nvptx execution-time hangs for simple
nested OpenMP 'target'/'parallel'/'task' constructs [PR99555]"?

[...] I've now pushed "Avoid OpenMP/nvptx execution-time hangs for
simple nested OpenMP 'target'/'parallel'/'task' constructs [PR99555]" to
master branch in commit d99111fd8e12deffdd9a965ce17e8a760d531ec3, see
attached.  "... awaiting proper resolution, of course."

-
Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München 
Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank 
Thürauf
libgomp: Fix on_device_arch.c aux-file handling [PR99555]

libgomp/ChangeLog:

	PR target/99555
* testsuite/libgomp.c-c++-common/task-detach-6.c:
	* testsuite/libgomp.c/pr99555-1.c:
	* testsuite/libgomp.fortran/task-detach-6.f90:
	* testsuite/lib/on_device_arch.c: Removed.
	* testsuite/libgomp.c-c++-common/on_device_arch.c: New test.
	* testsuite/libgomp.c/on_device_arch.c: New test.
	* testsuite/libgomp.fortran/on_device_arch.c: New test.

 libgomp/testsuite/lib/on_device_arch.c | 30 
 .../libgomp.c-c++-common/on_device_arch.c  | 33 ++
 .../testsuite/libgomp.c-c++-common/task-detach-6.c |  2 +-
 libgomp/testsuite/libgomp.c/on_device_arch.c   |  3 ++
 libgomp/testsuite/libgomp.c/pr99555-1.c|  2 +-
 libgomp/testsuite/libgomp.fortran/on_device_arch.c |  3 ++
 .../testsuite/libgomp.fortran/task-detach-6.f90|  2 +-
 7 files changed, 42 insertions(+), 33 deletions(-)

diff --git a/libgomp/testsuite/lib/on_device_arch.c b/libgomp/testsuite/lib/on_device_arch.c
deleted file mode 100644
index 1c0753c..000
--- a/libgomp/testsuite/lib/on_device_arch.c
+++ /dev/null
@@ -1,30 +0,0 @@
-#include 
-
-/* static */ int
-device_arch_nvptx (void)
-{
-  return GOMP_DEVICE_NVIDIA_PTX;
-}
-
-#pragma omp declare variant (device_arch_nvptx) match(construct={target},device={arch(nvptx)})
-/* static */ int
-device_arch (void)
-{
-  return GOMP_DEVICE_DEFAULT;
-}
-
-static int
-on_device_arch (int d)
-{
-  int d_cur;
-  #pragma omp target map(from:d_cur)
-  d_cur = device_arch ();
-
-  return d_cur == d;
-}
-
-int
-on_device_arch_nvptx ()
-{
-  return on_device_arch (GOMP_DEVICE_NVIDIA_PTX);
-}
diff --git a/libgomp/testsuite/libgomp.c-c++-common/on_device_arch.c b/libgomp/testsuite/libgomp.c-c++-common/on_device_arch.c
new file mode 100644
index 000..00524b5
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c-c++-common/on_device_arch.c
@@ -0,0 +1,33 @@
+/* Auxiliar file.  */
+/* { dg-do compile  { target skip-all-targets } } */
+/* Note: this file is also #included in ../libgomp.fortran/on_device_arch.c  */
+#include 
+
+/* static */ int
+device_arch_nvptx (void)
+{
+  return GOMP_DEVICE_NVIDIA_PTX;
+}
+
+#pragma omp declare variant (device_arch_nvptx) match(construct={target},device={arch(nvptx)})
+/* static */ int
+device_arch (void)
+{
+  return GOMP_DEVICE_DEFAULT;
+}
+
+static int
+on_device_arch (int d)
+{
+  int d_cur;
+  #pragma omp target map(from:d_cur)
+  d_cur = device_arch ();
+
+  return d_c

Re: [Patch] libgomp: Fix on_device_arch.c aux-file handling [PR99555] (was: [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738])

2021-03-26 Thread Jakub Jelinek via Gcc-patches
On Fri, Mar 26, 2021 at 03:42:22PM +0100, Tobias Burnus wrote:
> How about the following patch? It moves the aux function to 
> libgomp.c-c++-common/on_device_arch.c
> and #includes it in the new wrapper files 
> libgomp.{c,fortran}/on_device_arch.c.
> (Based on the observation that #include with relative paths always works,
> while dg-additional-sources may not, depending how the testsuite it run.)
> 
> OK? Or does anyone have a better suggestion?

For C/C++, why do we call it on_device_arch.c at all?  Can't be just
on_device_arch.h that is #included in each test instead of additional
sources?  If we don't like inlining, just use noinline attribute, but I
don't see why inlining would hurt.
For Fortran, sure, we can't include it, so let's add
libgomp.fortran/on_device_arch.c that #includes that header.

Jakub



[committed] libphobos: Build all modules with -fversion=Shared when configured with --enable-shared

2021-03-26 Thread Iain Buclaw via Gcc-patches
Hi,

The libgdruntime_convenience library was built with `-fversion=Shared',
but the libphobos part wasn't when creating the static library.

As there are no issues compiling in Shared code into the static library,
to avoid mismatches the flag is now always present when --enable-shared
is turned on.  Libtool's compiler PIC D flag is now the combination of
compiler PIC and D Shared flags, and AM_DFLAGS passes `-prefer-pic' to
libtool unless --enable-shared is turned off.

Bootstrapped and regression tested on x86_64-linux-gnu/-m32/-mx32, and
committed to mainline.

Regards,
Iain

---
libphobos/ChangeLog:

* Makefile.in: Regenerate.
* configure: Regenerate.
* configure.ac: Substitute enable_shared, enable_static, and
phobos_lt_pic_flag.
* libdruntime/Makefile.am (AM_DFLAGS): Replace
  phobos_compiler_pic_flag with phobos_lt_pic_flags, and
  phobos_compiler_shared_flag.
* libdruntime/Makefile.in: Regenerate.
* src/Makefile.am (AM_DFLAGS): Replace phobos_compiler_pic_flag
  with phobos_lt_pic_flag, and phobos_compiler_shared_flag.
* src/Makefile.in: Regenerate.
* testsuite/Makefile.in: Regenerate.
* testsuite/libphobos.druntime_shared/druntime_shared.exp: Remove
-fversion=Shared and -fno-moduleinfo from default extra test flags.
* testsuite/libphobos.phobos_shared/phobos_shared.exp: Likewise.
* testsuite/testsuite_flags.in: Add phobos_compiler_shared_flag to
--gdcflags.
---
 libphobos/Makefile.in |  3 +++
 libphobos/configure   | 24 ---
 libphobos/configure.ac| 17 +++--
 libphobos/libdruntime/Makefile.am |  2 +-
 libphobos/libdruntime/Makefile.in |  5 +++-
 libphobos/src/Makefile.am |  2 +-
 libphobos/src/Makefile.in |  5 +++-
 libphobos/testsuite/Makefile.in   |  3 +++
 .../druntime_shared.exp   |  4 ++--
 .../libphobos.phobos_shared/phobos_shared.exp |  4 ++--
 libphobos/testsuite/testsuite_flags.in|  3 ++-
 11 files changed, 47 insertions(+), 25 deletions(-)

diff --git a/libphobos/Makefile.in b/libphobos/Makefile.in
index d42248405a2..eab12688867 100644
--- a/libphobos/Makefile.in
+++ b/libphobos/Makefile.in
@@ -298,6 +298,8 @@ datadir = @datadir@
 datarootdir = @datarootdir@
 docdir = @docdir@
 dvidir = @dvidir@
+enable_shared = @enable_shared@
+enable_static = @enable_static@
 exec_prefix = @exec_prefix@
 gcc_version = @gcc_version@
 gdc_include_dir = @gdc_include_dir@
@@ -327,6 +329,7 @@ oldincludedir = @oldincludedir@
 pdfdir = @pdfdir@
 phobos_compiler_pic_flag = @phobos_compiler_pic_flag@
 phobos_compiler_shared_flag = @phobos_compiler_shared_flag@
+phobos_lt_pic_flag = @phobos_lt_pic_flag@
 prefix = @prefix@
 program_transform_name = @program_transform_name@
 psdir = @psdir@
diff --git a/libphobos/configure b/libphobos/configure
index c940a404be4..59ca64aa1e0 100755
--- a/libphobos/configure
+++ b/libphobos/configure
@@ -705,6 +705,9 @@ libphobos_builddir
 get_gcc_base_ver
 phobos_compiler_shared_flag
 phobos_compiler_pic_flag
+phobos_lt_pic_flag
+enable_static
+enable_shared
 OTOOL64
 OTOOL
 LIPO
@@ -11746,7 +11749,7 @@ else
   lt_dlunknown=0; lt_dlno_uscore=1; lt_dlneed_uscore=2
   lt_status=$lt_dlunknown
   cat > conftest.$ac_ext <<_LT_EOF
-#line 11749 "configure"
+#line 11752 "configure"
 #include "confdefs.h"
 
 #if HAVE_DLFCN_H
@@ -11852,7 +11855,7 @@ else
   lt_dlunknown=0; lt_dlno_uscore=1; lt_dlneed_uscore=2
   lt_status=$lt_dlunknown
   cat > conftest.$ac_ext <<_LT_EOF
-#line 11855 "configure"
+#line 11858 "configure"
 #include "confdefs.h"
 
 #if HAVE_DLFCN_H
@@ -13997,8 +14000,14 @@ CFLAGS=$lt_save_CFLAGS
   GDCFLAGS=$gdc_save_DFLAGS
 
 
+
+
 # libtool variables for Phobos shared and position-independent compiles.
 #
+# Use phobos_lt_pic_flag to designate the automake variable
+# used to encapsulate the default libtool approach to creating objects
+# with position-independent code. Default: -prefer-pic.
+#
 # Use phobos_compiler_shared_flag to designate the compile-time flags for
 # creating shared objects. Default: -fversion=Shared.
 #
@@ -14010,26 +14019,23 @@ CFLAGS=$lt_save_CFLAGS
 # libtool, and so we make it here.  How it is handled is that in shared
 # compilations the `lt_prog_compiler_pic_D' variable is used to instead
 # ensure that conditional compilation of shared runtime code is compiled in.
-# The original PIC flags are then used in the compilation of every object.
-#
-# Why are objects destined for libgphobos.a compiled with -fPIC?
-# Because -fPIC is not harmful to use for objects destined for static
-# libraries. In addition, using -fPIC will allow the use of static
-# libgphobos.a in the creation of other D shared libraries.
 if test "$enable_shared" = yes; then
+  phobos_lt_pic_flag="-prefer-pic"
   phobos_compiler_pic_flag="$lt_prog_compi

[committed] [freebsd] d: Fix build failures on sparc64-*-freebsd*

2021-03-26 Thread Iain Buclaw via Gcc-patches
Hi,

This patch fixes a build issue on sparc64-freebsd targets, all platforms
that could run on SPARC should include this header in order to avoid
errors from memmodel being used in sparc-protos.h.

Bootstrapped on x86_64-freebsd12 and committed to mainline.

Regards
Iain

---
gcc/ChangeLog:

* config/freebsd-d.c: Include memmodel.h.
---
 gcc/config/freebsd-d.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/gcc/config/freebsd-d.c b/gcc/config/freebsd-d.c
index 425ca8365ba..8a8ddd92884 100644
--- a/gcc/config/freebsd-d.c
+++ b/gcc/config/freebsd-d.c
@@ -18,6 +18,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
+#include "memmodel.h"
 #include "tm.h"
 #include "tm_p.h"
 #include "d/d-target.h"
-- 
2.27.0



[committed] d: Define IN_TARGET_CODE in all machine-specific D language files.

2021-03-26 Thread Iain Buclaw via Gcc-patches
Hi,

This patch defines IN_TARGET_CODE in all D language support files in the
back-end, to be consistent with other machine-specific files.

Bootstrapped and regression tested on x86_64-linux-gnu, and committed to
mainline as obvious.

Regards,
Iain.

---
gcc/ChangeLog:

* config/aarch64/aarch64-d.c (IN_TARGET_CODE): Define.
* config/arm/arm-d.c (IN_TARGET_CODE): Likewise.
* config/i386/i386-d.c (IN_TARGET_CODE): Likewise.
* config/mips/mips-d.c (IN_TARGET_CODE): Likewise.
* config/pa/pa-d.c (IN_TARGET_CODE): Likewise.
* config/riscv/riscv-d.c (IN_TARGET_CODE): Likewise.
* config/rs6000/rs6000-d.c (IN_TARGET_CODE): Likewise.
* config/s390/s390-d.c (IN_TARGET_CODE): Likewise.
* config/sparc/sparc-d.c (IN_TARGET_CODE): Likewise.
---
 gcc/config/aarch64/aarch64-d.c | 2 ++
 gcc/config/arm/arm-d.c | 2 ++
 gcc/config/i386/i386-d.c   | 2 ++
 gcc/config/mips/mips-d.c   | 2 ++
 gcc/config/pa/pa-d.c   | 2 ++
 gcc/config/riscv/riscv-d.c | 2 ++
 gcc/config/rs6000/rs6000-d.c   | 2 ++
 gcc/config/s390/s390-d.c   | 2 ++
 gcc/config/sparc/sparc-d.c | 2 ++
 9 files changed, 18 insertions(+)

diff --git a/gcc/config/aarch64/aarch64-d.c b/gcc/config/aarch64/aarch64-d.c
index 5c9b4fa6fb8..4fce593ac27 100644
--- a/gcc/config/aarch64/aarch64-d.c
+++ b/gcc/config/aarch64/aarch64-d.c
@@ -15,6 +15,8 @@ You should have received a copy of the GNU General Public 
License
 along with GCC; see the file COPYING3.  If not see
 .  */
 
+#define IN_TARGET_CODE 1
+
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
diff --git a/gcc/config/arm/arm-d.c b/gcc/config/arm/arm-d.c
index 76ede3b6d44..2cb9f4bd899 100644
--- a/gcc/config/arm/arm-d.c
+++ b/gcc/config/arm/arm-d.c
@@ -15,6 +15,8 @@ You should have received a copy of the GNU General Public 
License
 along with GCC; see the file COPYING3.  If not see
 .  */
 
+#define IN_TARGET_CODE 1
+
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
diff --git a/gcc/config/i386/i386-d.c b/gcc/config/i386/i386-d.c
index cbd3ceb187d..b79be85e661 100644
--- a/gcc/config/i386/i386-d.c
+++ b/gcc/config/i386/i386-d.c
@@ -15,6 +15,8 @@ You should have received a copy of the GNU General Public 
License
 along with GCC; see the file COPYING3.  If not see
 .  */
 
+#define IN_TARGET_CODE 1
+
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
diff --git a/gcc/config/mips/mips-d.c b/gcc/config/mips/mips-d.c
index dad101cf7eb..dc57127791c 100644
--- a/gcc/config/mips/mips-d.c
+++ b/gcc/config/mips/mips-d.c
@@ -15,6 +15,8 @@ You should have received a copy of the GNU General Public 
License
 along with GCC; see the file COPYING3.  If not see
 .  */
 
+#define IN_TARGET_CODE 1
+
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
diff --git a/gcc/config/pa/pa-d.c b/gcc/config/pa/pa-d.c
index 1de49df12cc..663e749995a 100644
--- a/gcc/config/pa/pa-d.c
+++ b/gcc/config/pa/pa-d.c
@@ -15,6 +15,8 @@ You should have received a copy of the GNU General Public 
License
 along with GCC; see the file COPYING3.  If not see
 .  */
 
+#define IN_TARGET_CODE 1
+
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
diff --git a/gcc/config/riscv/riscv-d.c b/gcc/config/riscv/riscv-d.c
index 2b690b18cfd..b20b778bd35 100644
--- a/gcc/config/riscv/riscv-d.c
+++ b/gcc/config/riscv/riscv-d.c
@@ -15,6 +15,8 @@ You should have received a copy of the GNU General Public 
License
 along with GCC; see the file COPYING3.  If not see
 .  */
 
+#define IN_TARGET_CODE 1
+
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
diff --git a/gcc/config/rs6000/rs6000-d.c b/gcc/config/rs6000/rs6000-d.c
index 14c4133f305..6bfe8130dd3 100644
--- a/gcc/config/rs6000/rs6000-d.c
+++ b/gcc/config/rs6000/rs6000-d.c
@@ -15,6 +15,8 @@ You should have received a copy of the GNU General Public 
License
 along with GCC; see the file COPYING3.  If not see
 .  */
 
+#define IN_TARGET_CODE 1
+
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
diff --git a/gcc/config/s390/s390-d.c b/gcc/config/s390/s390-d.c
index 155144ce7b8..2f945ebfa12 100644
--- a/gcc/config/s390/s390-d.c
+++ b/gcc/config/s390/s390-d.c
@@ -15,6 +15,8 @@ You should have received a copy of the GNU General Public 
License
 along with GCC; see the file COPYING3.  If not see
 .  */
 
+#define IN_TARGET_CODE 1
+
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
diff --git a/gcc/config/sparc/sparc-d.c b/gcc/config/sparc/sparc-d.c
index 186e965ae84..0eb663bb132 100644
--- a/gcc/config/sparc/sparc-d.c
+++ b/gcc/config/sparc/sparc-d.c
@@ -15,6 +15,8 @@ You should have received a copy of the GNU General Public 
License
 along with GCC; see the file COPYING

Re: [Patch] libgomp: Fix on_device_arch.c aux-file handling [PR99555] (was: [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738])

2021-03-26 Thread Tobias Burnus

Hi Jakub,

great suggestion – I did now as proposed.

On 26.03.21 15:46, Jakub Jelinek via Gcc-patches wrote:

On Fri, Mar 26, 2021 at 03:42:22PM +0100, Tobias Burnus wrote:

How about the following patch? It moves the aux function to 
libgomp.c-c++-common/on_device_arch.c
and #includes it in the new wrapper files libgomp.{c,fortran}/on_device_arch.c.
(Based on the observation that #include with relative paths always works,
while dg-additional-sources may not, depending how the testsuite it run.) [...]

For C/C++, why do we call it on_device_arch.c at all?  Can't be just
on_device_arch.h that is #included in each test instead of additional
sources?  If we don't like inlining, just use noinline attribute, but I
don't see why inlining would hurt.
For Fortran, sure, we can't include it, so let's add
libgomp.fortran/on_device_arch.c that #includes that header.


OK?

Tobias

-
Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München 
Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank 
Thürauf
libgomp: Fix on_device_arch.c aux-file handling [PR99555]

libgomp/ChangeLog:

	PR target/99555
* testsuite/lib/on_device_arch.c: Move to ...
* testsuite/libgomp.c-c++-common/on_device_arch.h: ... here.
* testsuite/libgomp.fortran/on_device_arch.c: New file;
	#include on_device_arch.h.
* testsuite/libgomp.c-c++-common/task-detach-6.c: #include
	on_device_arch.h instead of using dg-additional-source.
* testsuite/libgomp.c/pr99555-1.c: Likewise.
* testsuite/libgomp.fortran/task-detach-6.f90: Update to use
	on_device_arch.c without relative paths.

 libgomp/testsuite/lib/on_device_arch.c | 30 --
 .../libgomp.c-c++-common/on_device_arch.h  | 30 ++
 .../testsuite/libgomp.c-c++-common/task-detach-6.c |  4 +--
 libgomp/testsuite/libgomp.c/pr99555-1.c|  3 +--
 libgomp/testsuite/libgomp.fortran/on_device_arch.c |  3 +++
 .../testsuite/libgomp.fortran/task-detach-6.f90|  2 +-
 6 files changed, 36 insertions(+), 36 deletions(-)

diff --git a/libgomp/testsuite/lib/on_device_arch.c b/libgomp/testsuite/lib/on_device_arch.c
deleted file mode 100644
index 1c0753c..000
--- a/libgomp/testsuite/lib/on_device_arch.c
+++ /dev/null
@@ -1,30 +0,0 @@
-#include 
-
-/* static */ int
-device_arch_nvptx (void)
-{
-  return GOMP_DEVICE_NVIDIA_PTX;
-}
-
-#pragma omp declare variant (device_arch_nvptx) match(construct={target},device={arch(nvptx)})
-/* static */ int
-device_arch (void)
-{
-  return GOMP_DEVICE_DEFAULT;
-}
-
-static int
-on_device_arch (int d)
-{
-  int d_cur;
-  #pragma omp target map(from:d_cur)
-  d_cur = device_arch ();
-
-  return d_cur == d;
-}
-
-int
-on_device_arch_nvptx ()
-{
-  return on_device_arch (GOMP_DEVICE_NVIDIA_PTX);
-}
diff --git a/libgomp/testsuite/libgomp.c-c++-common/on_device_arch.h b/libgomp/testsuite/libgomp.c-c++-common/on_device_arch.h
new file mode 100644
index 000..1c0753c
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c-c++-common/on_device_arch.h
@@ -0,0 +1,30 @@
+#include 
+
+/* static */ int
+device_arch_nvptx (void)
+{
+  return GOMP_DEVICE_NVIDIA_PTX;
+}
+
+#pragma omp declare variant (device_arch_nvptx) match(construct={target},device={arch(nvptx)})
+/* static */ int
+device_arch (void)
+{
+  return GOMP_DEVICE_DEFAULT;
+}
+
+static int
+on_device_arch (int d)
+{
+  int d_cur;
+  #pragma omp target map(from:d_cur)
+  d_cur = device_arch ();
+
+  return d_cur == d;
+}
+
+int
+on_device_arch_nvptx ()
+{
+  return on_device_arch (GOMP_DEVICE_NVIDIA_PTX);
+}
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
index 4a3e4a2..119d7f5 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
@@ -1,10 +1,8 @@
 /* { dg-do run } */
 
-/* { dg-additional-sources "../lib/on_device_arch.c" } */
-extern int on_device_arch_nvptx ();
-
 #include 
 #include 
+#include "on_device_arch.h"
 
 /* Test tasks with detach clause on an offload device.  Each device
thread spawns off a chain of tasks, that can then be executed by
diff --git a/libgomp/testsuite/libgomp.c/pr99555-1.c b/libgomp/testsuite/libgomp.c/pr99555-1.c
index 9ba3309..0dc17bf 100644
--- a/libgomp/testsuite/libgomp.c/pr99555-1.c
+++ b/libgomp/testsuite/libgomp.c/pr99555-1.c
@@ -2,8 +2,7 @@
 
 // { dg-additional-options "-O0" }
 
-// { dg-additional-sources "../lib/on_device_arch.c" }
-extern int on_device_arch_nvptx ();
+#include "../libgomp.c-c++-common/on_device_arch.h"
 
 int main (void)
 {
diff --git a/libgomp/testsuite/libgomp.fortran/on_device_arch.c b/libgomp/testsuite/libgomp.fortran/on_device_arch.c
new file mode 100644
index 000..98822c4
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/on_device_arch.c
@@ -0,0 +1,3 @@
+/* Auxiliar file.  */
+/* { dg-do compile  { target skip-all-targets } } */
+#include "../libgomp

Re: [PATCH] x86: Skip ISA check for always_inline in system headers

2021-03-26 Thread Florian Weimer
* Richard Biener:

>> I think H.J. needs this for a function that isn't even always_inline,
>> just extern inline __attribute__ ((gnu_inline)).  Is that aspect
>> something that could be solved for GCC 11?
>
> But that should already work, no?  Yes, it won't inline but also not
> error.  Unless glibc lacks the out-of-line definition, that is.

It does not work:

extern double strtod (const char *, char **);

extern __inline __attribute__ ((__gnu_inline__)) double
atof (const char *__nptr)
{
  return strtod (__nptr, (char **) ((void *)0));
}

fails with -mno-sse:

t.c: In function ‘atof’:
t.c:5:1: error: SSE register return with SSE disabled

I don't think we need to support calling atof under these
circumstances (in fact, this is impossible to support because there is
no ABI we could use for the call).  But we need to ignore the inline
function definition, like we ignore function declarations.  Otherwise
we'll have to patch a lot of headers to support -mno-sse.

Or has this already been fixed differently in GCC 11?


Re: [Patch] libgomp: Fix on_device_arch.c aux-file handling [PR99555] (was: [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738])

2021-03-26 Thread Jakub Jelinek via Gcc-patches
On Fri, Mar 26, 2021 at 04:19:56PM +0100, Tobias Burnus wrote:
> Hi Jakub,
> 
> great suggestion – I did now as proposed.
> 
> On 26.03.21 15:46, Jakub Jelinek via Gcc-patches wrote:
> > On Fri, Mar 26, 2021 at 03:42:22PM +0100, Tobias Burnus wrote:
> > > How about the following patch? It moves the aux function to 
> > > libgomp.c-c++-common/on_device_arch.c
> > > and #includes it in the new wrapper files 
> > > libgomp.{c,fortran}/on_device_arch.c.
> > > (Based on the observation that #include with relative paths always works,
> > > while dg-additional-sources may not, depending how the testsuite it run.) 
> > > [...]
> > For C/C++, why do we call it on_device_arch.c at all?  Can't be just
> > on_device_arch.h that is #included in each test instead of additional
> > sources?  If we don't like inlining, just use noinline attribute, but I
> > don't see why inlining would hurt.
> > For Fortran, sure, we can't include it, so let's add
> > libgomp.fortran/on_device_arch.c that #includes that header.
> 
> OK?

LGTM, but please give Thomas a chance to chime in.

Jakub



testsuite/arm: Improve scan-assembler in pr96770.c

2021-03-26 Thread Christophe Lyon via Gcc-patches
Hi,

I'm seeing random scan-assembler-times failures in pr96770.c when LTO is used.

I suspect this is because the \\+4 string matches the LTO sections, sometimes.

I propose this small patch to avoid the issue, by matching arr\\+4 instead. OK?

2021-03-26  Christophe Lyon  

gcc/testsuite/
* gcc.target/arm/pure-code/pr96770.c: Improve scan-assembler-times.

diff --git a/gcc/testsuite/gcc.target/arm/pure-code/pr96770.c
b/gcc/testsuite/gcc.target/arm/pure-code/pr96770.c
index a43d71f..ae1bd10 100644
--- a/gcc/testsuite/gcc.target/arm/pure-code/pr96770.c
+++ b/gcc/testsuite/gcc.target/arm/pure-code/pr96770.c
@@ -5,17 +5,17 @@ int arr[1000];
 int *f4 (void) { return &arr[1]; }

 /* For cortex-m0 (thumb-1/v6m), we generate 4 movs with upper/lower:#arr+4.  */
-/* { dg-final { scan-assembler-times "\\+4" 4 { target { { !
arm_thumb1_movt_ok } && { ! arm_thumb2_ok } } } } } */
+/* { dg-final { scan-assembler-times "arr\\+4" 4 { target { { !
arm_thumb1_movt_ok } && { ! arm_thumb2_ok } } } } } */

 /* For cortex-m with movt/movw (thumb-1/v8m.base or thumb-2), we
generate a movt/movw pair with upper/lower:#arr+4.  */
-/* { dg-final { scan-assembler-times "\\+4" 2 { target {
arm_thumb1_movt_ok || arm_thumb2_ok } } } } */
+/* { dg-final { scan-assembler-times "arr\\+4" 2 { target {
arm_thumb1_movt_ok || arm_thumb2_ok } } } } */

 int *f5 (void) { return &arr[80]; }

 /* For cortex-m0 (thumb-1/v6m), we generate 1 ldr from rodata pointer
to arr+320.  */
-/* { dg-final { scan-assembler-times "\\+320" 1 { target { { !
arm_thumb1_movt_ok } && { ! arm_thumb2_ok } } } } } */
+/* { dg-final { scan-assembler-times "arr\\+320" 1 { target { { !
arm_thumb1_movt_ok } && { ! arm_thumb2_ok } } } } } */

 /* For cortex-m with movt/movw (thumb-1/v8m.base or thumb-2), we
generate a movt/movw pair with upper/lower:arr+320.  */
-/* { dg-final { scan-assembler-times "\\+320" 2 { target {
arm_thumb1_movt_ok || arm_thumb2_ok } } } } */
+/* { dg-final { scan-assembler-times "arr\\+320" 2 { target {
arm_thumb1_movt_ok || arm_thumb2_ok } } } } */


Re: [PATCH] d: Add windows support for D compiler (PR91595)

2021-03-26 Thread ibuclaw--- via Gcc-patches
> On 21/03/2021 12:58 Iain Buclaw  wrote:
> 
>  
> Hi,
> 
> This patch adds necessary backend support for MinGW/Cygwin targets so
> that all relevant predefined version conditions are available, a
> prerequesite for building most parts of libphobos.
> 

After some more testing done building libphobos on MinGW, it was identified
that the version identifiers CRuntime_Microsoft and CRuntime_Newlib need to be
present, as well as definitions for the MINFO section support code.

Bootstrapped on x86_64-w64-mingw64, and committed to mainline.

Regards,
Iain.

---
gcc/ChangeLog:

PR d/91595
* config.gcc (*-*-cygwin*): Add winnt-d.o
(*-*-mingw*): Likewise.
* config/i386/cygwin.h (EXTRA_TARGET_D_OS_VERSIONS): New macro.
* config/i386/mingw32.h (EXTRA_TARGET_D_OS_VERSIONS): Likewise.
* config/i386/t-cygming: Add winnt-d.o.
* config/i386/winnt-d.c: New file.
---
 gcc/config.gcc|  6 +
 gcc/config/i386/cygwin.h  |  9 +++
 gcc/config/i386/mingw32.h | 12 +
 gcc/config/i386/t-cygming |  4 +++
 gcc/config/i386/winnt-d.c | 56 +++
 5 files changed, 87 insertions(+)
 create mode 100644 gcc/config/i386/winnt-d.c

diff --git a/gcc/config.gcc b/gcc/config.gcc
index 34e732d861b..997a9f61a5c 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -2123,6 +2123,8 @@ i[34567]86-*-cygwin*)
extra_objs="${extra_objs} winnt.o winnt-stubs.o"
c_target_objs="${c_target_objs} msformat-c.o"
cxx_target_objs="${cxx_target_objs} winnt-cxx.o msformat-c.o"
+   d_target_objs="${d_target_objs} winnt-d.o"
+   target_has_targetdm="yes"
if test x$enable_threads = xyes; then
thread_file='posix'
fi
@@ -2139,6 +2141,8 @@ x86_64-*-cygwin*)
extra_objs="${extra_objs} winnt.o winnt-stubs.o"
c_target_objs="${c_target_objs} msformat-c.o"
cxx_target_objs="${cxx_target_objs} winnt-cxx.o msformat-c.o"
+   d_target_objs="${d_target_objs} winnt-d.o"
+   target_has_targetdm="yes"
if test x$enable_threads = xyes; then
thread_file='posix'
fi
@@ -2151,7 +2155,9 @@ i[34567]86-*-mingw* | x86_64-*-mingw*)
xm_file=i386/xm-mingw32.h
c_target_objs="${c_target_objs} winnt-c.o"
cxx_target_objs="${cxx_target_objs} winnt-c.o"
+   d_target_objs="${d_target_objs} winnt-d.o"
target_has_targetcm="yes"
+   target_has_targetdm="yes"
case ${target} in
x86_64-*-* | *-w64-*)
need_64bit_isa=yes
diff --git a/gcc/config/i386/cygwin.h b/gcc/config/i386/cygwin.h
index db0a3cc0b35..71fb6135c2c 100644
--- a/gcc/config/i386/cygwin.h
+++ b/gcc/config/i386/cygwin.h
@@ -29,6 +29,15 @@ along with GCC; see the file COPYING3.  If not see
 }  \
   while (0)
 
+#define EXTRA_TARGET_D_OS_VERSIONS()   \
+  do   \
+{  \
+  builtin_version ("Cygwin");  \
+  builtin_version ("Posix");   \
+  builtin_version ("CRuntime_Newlib"); \
+}  \
+  while (0)
+
 #undef CPP_SPEC
 #define CPP_SPEC "%(cpp_cpu) %{posix:-D_POSIX_SOURCE} \
   %{!ansi:-Dunix} \
diff --git a/gcc/config/i386/mingw32.h b/gcc/config/i386/mingw32.h
index 1a6a3a07ca6..36e7bae5e1b 100644
--- a/gcc/config/i386/mingw32.h
+++ b/gcc/config/i386/mingw32.h
@@ -53,6 +53,18 @@ along with GCC; see the file COPYING3.  If not see
 }  \
   while (0)
 
+#define EXTRA_TARGET_D_OS_VERSIONS()   \
+  do   \
+{  \
+  builtin_version ("MinGW");   \
+  if (TARGET_64BIT && ix86_abi == MS_ABI)  \
+   builtin_version ("Win64");  \
+  else if (!TARGET_64BIT)  \
+   builtin_version ("Win32");  \
+  builtin_version ("CRuntime_Microsoft");  \
+}  \
+  while (0)
+
 #ifndef TARGET_USE_PTHREAD_BY_DEFAULT
 #define SPEC_PTHREAD1 "pthread"
 #define SPEC_PTHREAD2 "!no-pthread"
diff --git a/gcc/config/i386/t-cygming b/gcc/config/i386/t-cygming
index 7ccbb84adad..38e2f0be237 100644
--- a/gcc/config/i386/t-cygming
+++ b/gcc/config/i386/t-cygming
@@ -39,6 +39,10 @@ winnt-stubs.o: $(srcdir)/config/i386/winnt-stubs.c 
$(CONFIG_H) $(SYSTEM_H) coret
$(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
$(srcdir)/config/i386/winnt-stubs.c
 
+winnt-d.o: $(srcdir)/config/i386/w

Re: [PATCH] x86: Skip ISA check for always_inline in system headers

2021-03-26 Thread Richard Biener via Gcc-patches
On March 26, 2021 4:20:28 PM GMT+01:00, Florian Weimer  
wrote:
>* Richard Biener:
>
>>> I think H.J. needs this for a function that isn't even
>always_inline,
>>> just extern inline __attribute__ ((gnu_inline)).  Is that aspect
>>> something that could be solved for GCC 11?
>>
>> But that should already work, no?  Yes, it won't inline but also not
>> error.  Unless glibc lacks the out-of-line definition, that is.
>
>It does not work:
>
>extern double strtod (const char *, char **);
>
>extern __inline __attribute__ ((__gnu_inline__)) double
>atof (const char *__nptr)
>{
>  return strtod (__nptr, (char **) ((void *)0));
>}
>
>fails with -mno-sse:
>
>t.c: In function ‘atof’:
>t.c:5:1: error: SSE register return with SSE disabled
>
>I don't think we need to support calling atof under these
>circumstances (in fact, this is impossible to support because there is
>no ABI we could use for the call).  But we need to ignore the inline
>function definition, like we ignore function declarations.  Otherwise
>we'll have to patch a lot of headers to support -mno-sse.
>
>Or has this already been fixed differently in GCC 11?

I think that has been fixed differently already.

Richard. 


aarch64: Opt-in tweaks to the AArch64 vector cost model

2021-03-26 Thread Richard Sandiford via Gcc-patches
SVE uses VECT_COMPARE_COSTS to tell the vectoriser to try as many
variations as it knows and pick the one with the lowest cost.
This serves two purposes:

(1) It means we can compare SVE loops that operate on packed vectors
with SVE loops that operate on unpacked vectors.

(2) It means that we can compare SVE with Advanced SIMD.

Although we used VECT_COMPARE_COSTS for both of these purposes from the
outset, the focus initially was more on (1).  Adding VECT_COMPARE_COSTS
allowed us to use SVE extending loads and truncating stores, in which
loads and stores effectively operate on unpacked rather than packed
vectors.  This part seems to work pretty well in practice.

However, it turns out that the second part (Advanced SIMD vs. SVE)
is less reliable.  There are three main reasons for this:

* At the moment, the AArch64 vector cost structures stick rigidly to the
  vect_cost_for_stmt enumeration provided by target-independent code.
  This particularly affects vec_to_scalar, which is used for at least:

  - reductions
  - extracting an element from a vector to do scalar arithmetic
  - extracting an element to store it out

  The vectoriser gives us the information we need to distinguish
  these cases, but the port wasn't using it.  Other problems include
  undercosting LD[234] and ST[234] instructions and scatter stores.

* Currently, the vectoriser costing works by adding up what are typically
  latency values.  As Richi mentioned recently in an x86 context,
  this effectively means that we treat the scalar and vector code
  as executing serially.  That already causes some problems for
  Advanced SIMD vs. scalar code, but it turns out to be particularly
  a problem when comparing SVE with Advanced SIMD.  Scalar, Advanced
  SIMD and SVE can have significantly different issue characteristics,
  and summing latencies misses some important details, especially in
  loops involving reductions.

* Advanced SIMD code can be completely unrolled at compile time,
  but length-agnostic SVE code can't.  We weren't taking this into
  account when comparing the costs.

This series of patches tries to address these problems by making
some opt-in tweaks to the vector cost model.  It produces much better
results on the SVE workloads that we've tried internally.  We'd therefore
like to put this in for GCC 11.

I'm really sorry that this is landing so late in stage 4.  Clearly it
would have been much better to do this earlier.  However:

- The patches “only” change the vector cost hooks.  There are no changes
  elsewhere.  In other words, the SVE code we generate and the Advanced
  SIMD code we generate is unchanged: the “only” thing we're doing is
  using different heuristics to select between them.

- As mentioned above, almost all the new code is “opt-in”.  Therefore,
  only CPUs that explicitly want it (and will benefit from it) will be
  affected.  Most of the code is not executed otherwise.

Tested on aarch64-linux-gnu (with and without SVE), pushed to trunk.

Richard


[PATCH 01/13] aarch64: Add reduction costs to simd_vec_costs

2021-03-26 Thread Richard Sandiford via Gcc-patches
This patch is part of a series that makes opt-in tweaks to the
AArch64 vector cost model.

At the moment, all reductions are costed as vec_to_scalar, which
also includes things like extracting a single element from a vector.
This is a bit too coarse in practice, since the cost of a reduction
depends very much on the type of value that it's processing.
This patch therefore adds separate costs for each case.  To start with,
all the new costs are copied from the associated vec_to_scalar ones.

Due the extreme lateness of this patch in the GCC 11 cycle, I've added
a new tuning flag (use_new_vector_costs) that selects the new behaviour.
This should help to ensure that the risk of the new code is only borne
by the CPUs that need it.  Generic tuning is not affected.

gcc/
* config/aarch64/aarch64-tuning-flags.def (use_new_vector_costs):
New tuning flag.
* config/aarch64/aarch64-protos.h (simd_vec_cost): Put comments
above the fields rather than to the right.
(simd_vec_cost::reduc_i8_cost): New member variable.
(simd_vec_cost::reduc_i16_cost): Likewise.
(simd_vec_cost::reduc_i32_cost): Likewise.
(simd_vec_cost::reduc_i64_cost): Likewise.
(simd_vec_cost::reduc_f16_cost): Likewise.
(simd_vec_cost::reduc_f32_cost): Likewise.
(simd_vec_cost::reduc_f64_cost): Likewise.
* config/aarch64/aarch64.c (generic_advsimd_vector_cost): Update
accordingly, using the vec_to_scalar_cost for the new fields.
(generic_sve_vector_cost, a64fx_advsimd_vector_cost): Likewise.
(a64fx_sve_vector_cost, qdf24xx_advsimd_vector_cost): Likewise.
(thunderx_advsimd_vector_cost, tsv110_advsimd_vector_cost): Likewise.
(cortexa57_advsimd_vector_cost, exynosm1_advsimd_vector_cost)
(xgene1_advsimd_vector_cost, thunderx2t99_advsimd_vector_cost)
(thunderx3t110_advsimd_vector_cost): Likewise.
(aarch64_use_new_vector_costs_p): New function.
(aarch64_simd_vec_costs): New function, split out from...
(aarch64_builtin_vectorization_cost): ...here.
(aarch64_is_reduction): New function.
(aarch64_detect_vector_stmt_subtype): Likewise.
(aarch64_add_stmt_cost): Call aarch64_detect_vector_stmt_subtype if
using the new vector costs.
---
 gcc/config/aarch64/aarch64-protos.h |  56 --
 gcc/config/aarch64/aarch64-tuning-flags.def |   2 +
 gcc/config/aarch64/aarch64.c| 180 +++-
 3 files changed, 216 insertions(+), 22 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index ff87ced2a34..e4eeb2ce142 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -194,22 +194,46 @@ struct cpu_regmove_cost
 
 struct simd_vec_cost
 {
-  const int int_stmt_cost; /* Cost of any int vector operation,
-  excluding load, store, permute,
-  vector-to-scalar and
-  scalar-to-vector operation.  */
-  const int fp_stmt_cost;   /* Cost of any fp vector operation,
-   excluding load, store, permute,
-   vector-to-scalar and
-   scalar-to-vector operation.  */
-  const int permute_cost;   /* Cost of permute operation.  */
-  const int vec_to_scalar_cost; /* Cost of vec-to-scalar 
operation.  */
-  const int scalar_to_vec_cost; /* Cost of scalar-to-vector
-   operation.  */
-  const int align_load_cost;/* Cost of aligned vector load.  */
-  const int unalign_load_cost;  /* Cost of unaligned vector load.  */
-  const int unalign_store_cost; /* Cost of unaligned vector store.  */
-  const int store_cost; /* Cost of vector store.  */
+  /* Cost of any integer vector operation, excluding the ones handled
+ specially below.  */
+  const int int_stmt_cost;
+
+  /* Cost of any fp vector operation, excluding the ones handled
+ specially below.  */
+  const int fp_stmt_cost;
+
+  /* Cost of a permute operation.  */
+  const int permute_cost;
+
+  /* Cost of reductions for various vector types: iN is for N-bit
+ integer elements and fN is for N-bit floating-point elements.
+ We need to single out the element type because it affects the
+ depth of the reduction.  */
+  const int reduc_i8_cost;
+  const int reduc_i16_cost;
+  const int reduc_i32_cost;
+  const int reduc_i64_cost;
+  const int reduc_f16_cost;
+  const int reduc_f32_cost;
+  const int reduc_f64_cost;
+
+  /* Cost of a vector-to-scalar operation.  */
+  const int vec_to_scalar_cost;
+
+  /* Cost of a scalar-to-vector operation.  */
+  const int scalar_to_vec_cost;
+
+  /* Cost of an aligned vector load.  */
+  const int al

[PATCH 02/13] aarch64: Add vector costs for SVE CLAST[AB] and FADDA

2021-03-26 Thread Richard Sandiford via Gcc-patches
Following on from the previous reduction costs patch, this one
adds costs for the SVE CLAST[AB] and FADDA instructions.
These instructions occur within the loop body, whereas the
reductions handled by the previous patch occur outside.

Like with the previous patch, this one only becomes active if
a CPU selects use_new_vector_costs.  It should therefore have
a very low impact on other CPUs.

gcc/
* config/aarch64/aarch64-protos.h (sve_vec_cost): Turn into a
derived class of simd_vec_cost.  Add information about CLAST[AB]
and FADDA instructions.
* config/aarch64/aarch64.c (generic_sve_vector_cost): Update
accordingly, using the vec_to_scalar costs for the new fields.
(a64fx_sve_vector_cost): Likewise.
(aarch64_reduc_type): New function.
(aarch64_sve_in_loop_reduction_latency): Likewise.
(aarch64_detect_vector_stmt_subtype): Take a vinfo parameter.
Use aarch64_sve_in_loop_reduction_latency to handle SVE reductions
that occur in the loop body.
(aarch64_add_stmt_cost): Update call accordingly.
---
 gcc/config/aarch64/aarch64-protos.h |  28 +-
 gcc/config/aarch64/aarch64.c| 150 +---
 2 files changed, 141 insertions(+), 37 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index e4eeb2ce142..bfcab72b122 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -237,7 +237,33 @@ struct simd_vec_cost
 };
 
 typedef struct simd_vec_cost advsimd_vec_cost;
-typedef struct simd_vec_cost sve_vec_cost;
+
+/* SVE-specific extensions to the information provided by simd_vec_cost.  */
+struct sve_vec_cost : simd_vec_cost
+{
+  constexpr sve_vec_cost (const simd_vec_cost &base,
+ unsigned int clast_cost,
+ unsigned int fadda_f16_cost,
+ unsigned int fadda_f32_cost,
+ unsigned int fadda_f64_cost)
+: simd_vec_cost (base),
+  clast_cost (clast_cost),
+  fadda_f16_cost (fadda_f16_cost),
+  fadda_f32_cost (fadda_f32_cost),
+  fadda_f64_cost (fadda_f64_cost)
+  {}
+
+  /* The cost of a vector-to-scalar CLASTA or CLASTB instruction,
+ with the scalar being stored in FP registers.  This cost is
+ assumed to be a cycle latency.  */
+  const int clast_cost;
+
+  /* The costs of FADDA for the three data types that it supports.
+ These costs are assumed to be cycle latencies.  */
+  const int fadda_f16_cost;
+  const int fadda_f32_cost;
+  const int fadda_f64_cost;
+};
 
 /* Cost for vector insn classes.  */
 struct cpu_vector_cost
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index b44dcdc6a6e..b62169a267a 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -609,22 +609,28 @@ static const advsimd_vec_cost generic_advsimd_vector_cost 
=
 /* Generic costs for SVE vector operations.  */
 static const sve_vec_cost generic_sve_vector_cost =
 {
-  1, /* int_stmt_cost  */
-  1, /* fp_stmt_cost  */
-  2, /* permute_cost  */
-  2, /* reduc_i8_cost  */
-  2, /* reduc_i16_cost  */
-  2, /* reduc_i32_cost  */
-  2, /* reduc_i64_cost  */
-  2, /* reduc_f16_cost  */
-  2, /* reduc_f32_cost  */
-  2, /* reduc_f64_cost  */
-  2, /* vec_to_scalar_cost  */
-  1, /* scalar_to_vec_cost  */
-  1, /* align_load_cost  */
-  1, /* unalign_load_cost  */
-  1, /* unalign_store_cost  */
-  1  /* store_cost  */
+  {
+1, /* int_stmt_cost  */
+1, /* fp_stmt_cost  */
+2, /* permute_cost  */
+2, /* reduc_i8_cost  */
+2, /* reduc_i16_cost  */
+2, /* reduc_i32_cost  */
+2, /* reduc_i64_cost  */
+2, /* reduc_f16_cost  */
+2, /* reduc_f32_cost  */
+2, /* reduc_f64_cost  */
+2, /* vec_to_scalar_cost  */
+1, /* scalar_to_vec_cost  */
+1, /* align_load_cost  */
+1, /* unalign_load_cost  */
+1, /* unalign_store_cost  */
+1  /* store_cost  */
+  },
+  2, /* clast_cost  */
+  2, /* fadda_f16_cost  */
+  2, /* fadda_f32_cost  */
+  2 /* fadda_f64_cost  */
 };
 
 /* Generic costs for vector insn classes.  */
@@ -662,22 +668,28 @@ static const advsimd_vec_cost a64fx_advsimd_vector_cost =
 
 static const sve_vec_cost a64fx_sve_vector_cost =
 {
-  2, /* int_stmt_cost  */
-  5, /* fp_stmt_cost  */
-  3, /* permute_cost  */
-  13, /* reduc_i8_cost  */
-  13, /* reduc_i16_cost  */
-  13, /* reduc_i32_cost  */
-  13, /* reduc_i64_cost  */
-  13, /* reduc_f16_cost  */
-  13, /* reduc_f32_cost  */
-  13, /* reduc_f64_cost  */
-  13, /* vec_to_scalar_cost  */
-  4, /* scalar_to_vec_cost  */
-  6, /* align_load_cost  */
-  6, /* unalign_load_cost  */
-  1, /* unalign_store_cost  */
-  1  /* store_cost  */
+  {
+2, /* int_stmt_cost  */
+5, /* fp_stmt_cost  */
+3, /* permute_cost  */
+13, /* reduc_i8_cost  */
+13, /* reduc_i16_cost  */
+13, /* reduc_i32_cost  */
+13, /* reduc_i64_cost  */
+13, /* reduc_f

[PATCH 03/13] aarch64: Add costs for LD[234]/ST[234] permutes

2021-03-26 Thread Richard Sandiford via Gcc-patches
At the moment, we cost LD[234] and ST[234] as N vector loads
or stores, which effectively treats the implied permute as free.
This patch adds additional costs for the permutes, which apply on
top of the costs for the loads and stores.

Like with the previous patches, this one only becomes active if
a CPU selects use_new_vector_costs.  It should therefore have
a very low impact on other CPUs.

gcc/
* config/aarch64/aarch64-protos.h (simd_vec_cost::ld2_st2_permute_cost)
(simd_vec_cost::ld3_st3_permute_cost): New member variables.
(simd_vec_cost::ld4_st4_permute_cost): Likewise.
* config/aarch64/aarch64.c (generic_advsimd_vector_cost): Update
accordingly, using zero for the new costs.
(generic_sve_vector_cost, a64fx_advsimd_vector_cost): Likewise.
(a64fx_sve_vector_cost, qdf24xx_advsimd_vector_cost): Likewise.
(thunderx_advsimd_vector_cost, tsv110_advsimd_vector_cost): Likewise.
(cortexa57_advsimd_vector_cost, exynosm1_advsimd_vector_cost)
(xgene1_advsimd_vector_cost, thunderx2t99_advsimd_vector_cost)
(thunderx3t110_advsimd_vector_cost): Likewise.
(aarch64_ld234_st234_vectors): New function.
(aarch64_adjust_stmt_cost): Likewise.
(aarch64_add_stmt_cost): Call aarch64_adjust_stmt_cost if using
the new vector costs.
---
 gcc/config/aarch64/aarch64-protos.h |  7 +++
 gcc/config/aarch64/aarch64.c| 94 +
 2 files changed, 101 insertions(+)

diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index bfcab72b122..3d152754981 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -202,6 +202,13 @@ struct simd_vec_cost
  specially below.  */
   const int fp_stmt_cost;
 
+  /* Per-vector cost of permuting vectors after an LD2, LD3 or LD4,
+ as well as the per-vector cost of permuting vectors before
+ an ST2, ST3 or ST4.  */
+  const int ld2_st2_permute_cost;
+  const int ld3_st3_permute_cost;
+  const int ld4_st4_permute_cost;
+
   /* Cost of a permute operation.  */
   const int permute_cost;
 
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index b62169a267a..8fb723dabd2 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -590,6 +590,9 @@ static const advsimd_vec_cost generic_advsimd_vector_cost =
 {
   1, /* int_stmt_cost  */
   1, /* fp_stmt_cost  */
+  0, /* ld2_st2_permute_cost  */
+  0, /* ld3_st3_permute_cost  */
+  0, /* ld4_st4_permute_cost  */
   2, /* permute_cost  */
   2, /* reduc_i8_cost  */
   2, /* reduc_i16_cost  */
@@ -612,6 +615,9 @@ static const sve_vec_cost generic_sve_vector_cost =
   {
 1, /* int_stmt_cost  */
 1, /* fp_stmt_cost  */
+0, /* ld2_st2_permute_cost  */
+0, /* ld3_st3_permute_cost  */
+0, /* ld4_st4_permute_cost  */
 2, /* permute_cost  */
 2, /* reduc_i8_cost  */
 2, /* reduc_i16_cost  */
@@ -650,6 +656,9 @@ static const advsimd_vec_cost a64fx_advsimd_vector_cost =
 {
   2, /* int_stmt_cost  */
   5, /* fp_stmt_cost  */
+  0, /* ld2_st2_permute_cost  */
+  0, /* ld3_st3_permute_cost  */
+  0, /* ld4_st4_permute_cost  */
   3, /* permute_cost  */
   13, /* reduc_i8_cost  */
   13, /* reduc_i16_cost  */
@@ -671,6 +680,9 @@ static const sve_vec_cost a64fx_sve_vector_cost =
   {
 2, /* int_stmt_cost  */
 5, /* fp_stmt_cost  */
+0, /* ld2_st2_permute_cost  */
+0, /* ld3_st3_permute_cost  */
+0, /* ld4_st4_permute_cost  */
 3, /* permute_cost  */
 13, /* reduc_i8_cost  */
 13, /* reduc_i16_cost  */
@@ -708,6 +720,9 @@ static const advsimd_vec_cost qdf24xx_advsimd_vector_cost =
 {
   1, /* int_stmt_cost  */
   3, /* fp_stmt_cost  */
+  0, /* ld2_st2_permute_cost  */
+  0, /* ld3_st3_permute_cost  */
+  0, /* ld4_st4_permute_cost  */
   2, /* permute_cost  */
   1, /* reduc_i8_cost  */
   1, /* reduc_i16_cost  */
@@ -742,6 +757,9 @@ static const advsimd_vec_cost thunderx_advsimd_vector_cost =
 {
   4, /* int_stmt_cost  */
   1, /* fp_stmt_cost  */
+  0, /* ld2_st2_permute_cost  */
+  0, /* ld3_st3_permute_cost  */
+  0, /* ld4_st4_permute_cost  */
   4, /* permute_cost  */
   2, /* reduc_i8_cost  */
   2, /* reduc_i16_cost  */
@@ -775,6 +793,9 @@ static const advsimd_vec_cost tsv110_advsimd_vector_cost =
 {
   2, /* int_stmt_cost  */
   2, /* fp_stmt_cost  */
+  0, /* ld2_st2_permute_cost  */
+  0, /* ld3_st3_permute_cost  */
+  0, /* ld4_st4_permute_cost  */
   2, /* permute_cost  */
   3, /* reduc_i8_cost  */
   3, /* reduc_i16_cost  */
@@ -807,6 +828,9 @@ static const advsimd_vec_cost cortexa57_advsimd_vector_cost 
=
 {
   2, /* int_stmt_cost  */
   2, /* fp_stmt_cost  */
+  0, /* ld2_st2_permute_cost  */
+  0, /* ld3_st3_permute_cost  */
+  0, /* ld4_st4_permute_cost  */
   3, /* permute_cost  */
   8, /* reduc_i8_cost  */
   8, /* reduc_i16_cost  */
@@ -840,6 +864,9 @@ static const advsimd_vec_cost exynosm1_advsimd_vector_cost =
 {
   

[PATCH 04/13] aarch64: Add costs for storing one element of a vector

2021-03-26 Thread Richard Sandiford via Gcc-patches
Storing one element of a vector is costed as a vec_to_scalar
followed by a scalar_store.  However, vec_to_scalar is also
used for reductions and for vector-to-GPR moves, which makes
it difficult to pick one cost for them all.

This patch therefore adds a cost for extracting one element
of a vector in preparation for storing it out.  The store
itself is still costed separately.

Like with the previous patches, this one only becomes active if
a CPU selects use_new_vector_costs.  It should therefore have
a very low impact on other CPUs.

gcc/
* config/aarch64/aarch64-protos.h
(simd_vec_cost::store_elt_extra_cost): New member variable.
* config/aarch64/aarch64.c (generic_advsimd_vector_cost): Update
accordingly, using the vec_to_scalar cost for the new field.
(generic_sve_vector_cost, a64fx_advsimd_vector_cost): Likewise.
(a64fx_sve_vector_cost, qdf24xx_advsimd_vector_cost): Likewise.
(thunderx_advsimd_vector_cost, tsv110_advsimd_vector_cost): Likewise.
(cortexa57_advsimd_vector_cost, exynosm1_advsimd_vector_cost)
(xgene1_advsimd_vector_cost, thunderx2t99_advsimd_vector_cost)
(thunderx3t110_advsimd_vector_cost): Likewise.
(aarch64_detect_vector_stmt_subtype): Detect single-element stores.
---
 gcc/config/aarch64/aarch64-protos.h |  4 
 gcc/config/aarch64/aarch64.c| 20 
 2 files changed, 24 insertions(+)

diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index 3d152754981..fabe3df7071 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -224,6 +224,10 @@ struct simd_vec_cost
   const int reduc_f32_cost;
   const int reduc_f64_cost;
 
+  /* Additional cost of storing a single vector element, on top of the
+ normal cost of a scalar store.  */
+  const int store_elt_extra_cost;
+
   /* Cost of a vector-to-scalar operation.  */
   const int vec_to_scalar_cost;
 
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 8fb723dabd2..20bb75bd56c 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -601,6 +601,7 @@ static const advsimd_vec_cost generic_advsimd_vector_cost =
   2, /* reduc_f16_cost  */
   2, /* reduc_f32_cost  */
   2, /* reduc_f64_cost  */
+  2, /* store_elt_extra_cost  */
   2, /* vec_to_scalar_cost  */
   1, /* scalar_to_vec_cost  */
   1, /* align_load_cost  */
@@ -626,6 +627,7 @@ static const sve_vec_cost generic_sve_vector_cost =
 2, /* reduc_f16_cost  */
 2, /* reduc_f32_cost  */
 2, /* reduc_f64_cost  */
+2, /* store_elt_extra_cost  */
 2, /* vec_to_scalar_cost  */
 1, /* scalar_to_vec_cost  */
 1, /* align_load_cost  */
@@ -667,6 +669,7 @@ static const advsimd_vec_cost a64fx_advsimd_vector_cost =
   13, /* reduc_f16_cost  */
   13, /* reduc_f32_cost  */
   13, /* reduc_f64_cost  */
+  13, /* store_elt_extra_cost  */
   13, /* vec_to_scalar_cost  */
   4, /* scalar_to_vec_cost  */
   6, /* align_load_cost  */
@@ -691,6 +694,7 @@ static const sve_vec_cost a64fx_sve_vector_cost =
 13, /* reduc_f16_cost  */
 13, /* reduc_f32_cost  */
 13, /* reduc_f64_cost  */
+13, /* store_elt_extra_cost  */
 13, /* vec_to_scalar_cost  */
 4, /* scalar_to_vec_cost  */
 6, /* align_load_cost  */
@@ -731,6 +735,7 @@ static const advsimd_vec_cost qdf24xx_advsimd_vector_cost =
   1, /* reduc_f16_cost  */
   1, /* reduc_f32_cost  */
   1, /* reduc_f64_cost  */
+  1, /* store_elt_extra_cost  */
   1, /* vec_to_scalar_cost  */
   1, /* scalar_to_vec_cost  */
   1, /* align_load_cost  */
@@ -768,6 +773,7 @@ static const advsimd_vec_cost thunderx_advsimd_vector_cost =
   2, /* reduc_f16_cost  */
   2, /* reduc_f32_cost  */
   2, /* reduc_f64_cost  */
+  2, /* store_elt_extra_cost  */
   2, /* vec_to_scalar_cost  */
   2, /* scalar_to_vec_cost  */
   3, /* align_load_cost  */
@@ -804,6 +810,7 @@ static const advsimd_vec_cost tsv110_advsimd_vector_cost =
   3, /* reduc_f16_cost  */
   3, /* reduc_f32_cost  */
   3, /* reduc_f64_cost  */
+  3, /* store_elt_extra_cost  */
   3, /* vec_to_scalar_cost  */
   2, /* scalar_to_vec_cost  */
   5, /* align_load_cost  */
@@ -839,6 +846,7 @@ static const advsimd_vec_cost cortexa57_advsimd_vector_cost 
=
   8, /* reduc_f16_cost  */
   8, /* reduc_f32_cost  */
   8, /* reduc_f64_cost  */
+  8, /* store_elt_extra_cost  */
   8, /* vec_to_scalar_cost  */
   8, /* scalar_to_vec_cost  */
   4, /* align_load_cost  */
@@ -875,6 +883,7 @@ static const advsimd_vec_cost exynosm1_advsimd_vector_cost =
   3, /* reduc_f16_cost  */
   3, /* reduc_f32_cost  */
   3, /* reduc_f64_cost  */
+  3, /* store_elt_extra_cost  */
   3, /* vec_to_scalar_cost  */
   3, /* scalar_to_vec_cost  */
   5, /* align_load_cost  */
@@ -910,6 +919,7 @@ static const advsimd_vec_cost xgene1_advsimd_vector_cost =
   4, /* reduc_f16_cost  */
   4, /* reduc_f32_cost  */
   4, /* reduc_f64_cost  */
+  4, /* store_elt_extra_co

[PATCH 05/13] aarch64: Add costs for one element of a scatter store

2021-03-26 Thread Richard Sandiford via Gcc-patches
Currently each element in a gather load is costed as a scalar_load
and each element in a scatter store is costed as a scalar_store.
The load side seems to work pretty well in practice, since many
CPU-specific costs give loads quite a high cost relative to
arithmetic operations.  However, stores usually have a cost
of just 1, which means that scatters tend to appear too cheap.

This patch adds a separate cost for one element in a scatter store.

Like with the previous patches, this one only becomes active if
a CPU selects use_new_vector_costs.  It should therefore have
a very low impact on other CPUs.

gcc/
* config/aarch64/aarch64-protos.h
(sve_vec_cost::scatter_store_elt_cost): New member variable.
* config/aarch64/aarch64.c (generic_sve_vector_cost): Update
accordingly, taking the cost from the cost of a scalar_store.
(a64fx_sve_vector_cost): Likewise.
(aarch64_detect_vector_stmt_subtype): Detect scatter stores.
---
 gcc/config/aarch64/aarch64-protos.h |  9 +++--
 gcc/config/aarch64/aarch64.c| 13 +++--
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index fabe3df7071..2ffa96ec24b 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -256,12 +256,14 @@ struct sve_vec_cost : simd_vec_cost
  unsigned int clast_cost,
  unsigned int fadda_f16_cost,
  unsigned int fadda_f32_cost,
- unsigned int fadda_f64_cost)
+ unsigned int fadda_f64_cost,
+ unsigned int scatter_store_elt_cost)
 : simd_vec_cost (base),
   clast_cost (clast_cost),
   fadda_f16_cost (fadda_f16_cost),
   fadda_f32_cost (fadda_f32_cost),
-  fadda_f64_cost (fadda_f64_cost)
+  fadda_f64_cost (fadda_f64_cost),
+  scatter_store_elt_cost (scatter_store_elt_cost)
   {}
 
   /* The cost of a vector-to-scalar CLASTA or CLASTB instruction,
@@ -274,6 +276,9 @@ struct sve_vec_cost : simd_vec_cost
   const int fadda_f16_cost;
   const int fadda_f32_cost;
   const int fadda_f64_cost;
+
+  /* The per-element cost of a scatter store.  */
+  const int scatter_store_elt_cost;
 };
 
 /* Cost for vector insn classes.  */
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 20bb75bd56c..7f727413d01 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -638,7 +638,8 @@ static const sve_vec_cost generic_sve_vector_cost =
   2, /* clast_cost  */
   2, /* fadda_f16_cost  */
   2, /* fadda_f32_cost  */
-  2 /* fadda_f64_cost  */
+  2, /* fadda_f64_cost  */
+  1 /* scatter_store_elt_cost  */
 };
 
 /* Generic costs for vector insn classes.  */
@@ -705,7 +706,8 @@ static const sve_vec_cost a64fx_sve_vector_cost =
   13, /* clast_cost  */
   13, /* fadda_f16_cost  */
   13, /* fadda_f32_cost  */
-  13 /* fadda_f64_cost  */
+  13, /* fadda_f64_cost  */
+  1 /* scatter_store_elt_cost  */
 };
 
 static const struct cpu_vector_cost a64fx_vector_cost =
@@ -14279,6 +14281,13 @@ aarch64_detect_vector_stmt_subtype (vec_info *vinfo, 
vect_cost_for_stmt kind,
   && DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)))
 return simd_costs->store_elt_extra_cost;
 
+  /* Detect cases in which a scalar_store is really storing one element
+ in a scatter operation.  */
+  if (kind == scalar_store
+  && sve_costs
+  && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER)
+return sve_costs->scatter_store_elt_cost;
+
   /* Detect cases in which vec_to_scalar represents an in-loop reduction.  */
   if (kind == vec_to_scalar
   && where == vect_body
-- 
2.17.1



[PATCH 06/13] aarch64: Add a CPU-specific cost table for Neoverse V1

2021-03-26 Thread Richard Sandiford via Gcc-patches
This patch adds dedicated vector costs for Neoverse V1.
Previously we just used the Cortex-A57 costs, which isn't
ideal given that Cortex-A57 doesn't support SVE.

gcc/
* config/aarch64/aarch64.c (neoversev1_advsimd_vector_cost)
(neoversev1_sve_vector_cost): New cost structures.
(neoversev1_vector_cost): Likewise.
(neoversev1_tunings): Use them.  Enable use_new_vector_costs.
---
 gcc/config/aarch64/aarch64.c | 95 +++-
 1 file changed, 93 insertions(+), 2 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 7f727413d01..2e9853e4c9b 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -1619,12 +1619,102 @@ static const struct tune_params neoversen1_tunings =
   &generic_prefetch_tune
 };
 
+static const advsimd_vec_cost neoversev1_advsimd_vector_cost =
+{
+  2, /* int_stmt_cost  */
+  2, /* fp_stmt_cost  */
+  4, /* ld2_st2_permute_cost */
+  4, /* ld3_st3_permute_cost  */
+  5, /* ld4_st4_permute_cost  */
+  3, /* permute_cost  */
+  4, /* reduc_i8_cost  */
+  4, /* reduc_i16_cost  */
+  2, /* reduc_i32_cost  */
+  2, /* reduc_i64_cost  */
+  6, /* reduc_f16_cost  */
+  3, /* reduc_f32_cost  */
+  2, /* reduc_f64_cost  */
+  2, /* store_elt_extra_cost  */
+  /* This value is just inherited from the Cortex-A57 table.  */
+  8, /* vec_to_scalar_cost  */
+  /* This depends very much on what the scalar value is and
+ where it comes from.  E.g. some constants take two dependent
+ instructions or a load, while others might be moved from a GPR.
+ 4 seems to be a reasonable compromise in practice.  */
+  4, /* scalar_to_vec_cost  */
+  4, /* align_load_cost  */
+  4, /* unalign_load_cost  */
+  /* Although stores have a latency of 2 and compete for the
+ vector pipes, in practice it's better not to model that.  */
+  1, /* unalign_store_cost  */
+  1  /* store_cost  */
+};
+
+static const sve_vec_cost neoversev1_sve_vector_cost =
+{
+  {
+2, /* int_stmt_cost  */
+2, /* fp_stmt_cost  */
+4, /* ld2_st2_permute_cost  */
+7, /* ld3_st3_permute_cost  */
+8, /* ld4_st4_permute_cost  */
+3, /* permute_cost  */
+/* Theoretically, a reduction involving 31 scalar ADDs could
+   complete in ~9 cycles and would have a cost of 31.  [SU]ADDV
+   completes in 14 cycles, so give it a cost of 31 + 5.  */
+36, /* reduc_i8_cost  */
+/* Likewise for 15 scalar ADDs (~5 cycles) vs. 12: 15 + 7.  */
+22, /* reduc_i16_cost  */
+/* Likewise for 7 scalar ADDs (~3 cycles) vs. 10: 7 + 7.  */
+14, /* reduc_i32_cost  */
+/* Likewise for 3 scalar ADDs (~2 cycles) vs. 10: 3 + 8.  */
+11, /* reduc_i64_cost  */
+/* Theoretically, a reduction involving 15 scalar FADDs could
+   complete in ~9 cycles and would have a cost of 30.  FADDV
+   completes in 13 cycles, so give it a cost of 30 + 4.  */
+34, /* reduc_f16_cost  */
+/* Likewise for 7 scalar FADDs (~6 cycles) vs. 11: 14 + 5.  */
+19, /* reduc_f32_cost  */
+/* Likewise for 3 scalar FADDs (~4 cycles) vs. 9: 6 + 5.  */
+11, /* reduc_f64_cost  */
+2, /* store_elt_extra_cost  */
+/* This value is just inherited from the Cortex-A57 table.  */
+8, /* vec_to_scalar_cost  */
+/* See the comment above the Advanced SIMD versions.  */
+4, /* scalar_to_vec_cost  */
+4, /* align_load_cost  */
+4, /* unalign_load_cost  */
+/* Although stores have a latency of 2 and compete for the
+   vector pipes, in practice it's better not to model that.  */
+1, /* unalign_store_cost  */
+1  /* store_cost  */
+  },
+  3, /* clast_cost  */
+  19, /* fadda_f16_cost  */
+  11, /* fadda_f32_cost  */
+  8, /* fadda_f64_cost  */
+  3 /* scatter_store_elt_cost  */
+};
+
+/* Neoverse V1 costs for vector insn classes.  */
+static const struct cpu_vector_cost neoversev1_vector_cost =
+{
+  1, /* scalar_int_stmt_cost  */
+  2, /* scalar_fp_stmt_cost  */
+  4, /* scalar_load_cost  */
+  1, /* scalar_store_cost  */
+  1, /* cond_taken_branch_cost  */
+  1, /* cond_not_taken_branch_cost  */
+  &neoversev1_advsimd_vector_cost, /* advsimd  */
+  &neoversev1_sve_vector_cost /* sve  */
+};
+
 static const struct tune_params neoversev1_tunings =
 {
   &cortexa76_extra_costs,
   &generic_addrcost_table,
   &generic_regmove_cost,
-  &cortexa57_vector_cost,
+  &neoversev1_vector_cost,
   &generic_branch_cost,
   &generic_approx_modes,
   SVE_256, /* sve_width  */
@@ -1641,7 +1731,8 @@ static const struct tune_params neoversev1_tunings =
   2,   /* min_div_recip_mul_df.  */
   0,   /* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS),   /* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
+   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS), /* tune_flags.  */
   &generic_prefetch_tune
 };
 
-- 
2.17.1



[PATCH 07/13] aarch64: Use an aarch64-specific structure for vector costing

2021-03-26 Thread Richard Sandiford via Gcc-patches
This patch makes the AArch64 vector code use its own vector
costs structure, rather than just using the default unsigned[3].

Unfortunately, it's not easy to make this change specific to
use_new_vector_costs, so this part is one that affects all CPUs.
The change is relatively mechanical though.

gcc/
* config/aarch64/aarch64.c (aarch64_vector_costs): New structure.
(aarch64_init_cost): New function.
(aarch64_add_stmt_cost): Use aarch64_vector_costs instead of
the default unsigned[3].
(aarch64_finish_cost, aarch64_destroy_cost_data): New functions.
(TARGET_VECTORIZE_INIT_COST): Override.
(TARGET_VECTORIZE_FINISH_COST): Likewise.
(TARGET_VECTORIZE_DESTROY_COST_DATA): Likewise.
---
 gcc/config/aarch64/aarch64.c | 46 ++--
 1 file changed, 44 insertions(+), 2 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 2e9853e4c9b..81683b7faa7 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -14111,6 +14111,21 @@ aarch64_first_cycle_multipass_dfa_lookahead_guard 
(rtx_insn *insn,
 
 /* Vectorizer cost model target hooks.  */
 
+/* Information about vector code that we're in the process of costing.  */
+struct aarch64_vector_costs
+{
+  /* The normal latency-based costs for each region (prologue, body and
+ epilogue), indexed by vect_cost_model_location.  */
+  unsigned int region[3] = {};
+};
+
+/* Implement TARGET_VECTORIZE_INIT_COST.  */
+void *
+aarch64_init_cost (class loop *)
+{
+  return new aarch64_vector_costs;
+}
+
 /* Return true if the current CPU should use the new costs defined
in GCC 11.  This should be removed for GCC 12 and above, with the
costs applying to all CPUs instead.  */
@@ -14535,7 +14550,7 @@ aarch64_add_stmt_cost (class vec_info *vinfo, void 
*data, int count,
   struct _stmt_vec_info *stmt_info, tree vectype,
   int misalign, enum vect_cost_model_location where)
 {
-  unsigned *cost = (unsigned *) data;
+  auto *costs = static_cast (data);
   unsigned retval = 0;
 
   if (flag_vect_cost_model)
@@ -14569,12 +14584,30 @@ aarch64_add_stmt_cost (class vec_info *vinfo, void 
*data, int count,
count *= 50; /*  FIXME  */
 
   retval = (unsigned) (count * stmt_cost);
-  cost[where] += retval;
+  costs->region[where] += retval;
 }
 
   return retval;
 }
 
+/* Implement TARGET_VECTORIZE_FINISH_COST.  */
+static void
+aarch64_finish_cost (void *data, unsigned *prologue_cost,
+unsigned *body_cost, unsigned *epilogue_cost)
+{
+  auto *costs = static_cast (data);
+  *prologue_cost = costs->region[vect_prologue];
+  *body_cost = costs->region[vect_body];
+  *epilogue_cost = costs->region[vect_epilogue];
+}
+
+/* Implement TARGET_VECTORIZE_DESTROY_COST_DATA.  */
+static void
+aarch64_destroy_cost_data (void *data)
+{
+  delete static_cast (data);
+}
+
 static void initialize_aarch64_code_model (struct gcc_options *);
 
 /* Parse the TO_PARSE string and put the architecture struct that it
@@ -24713,9 +24746,18 @@ aarch64_libgcc_floating_mode_supported_p
 #undef TARGET_ARRAY_MODE_SUPPORTED_P
 #define TARGET_ARRAY_MODE_SUPPORTED_P aarch64_array_mode_supported_p
 
+#undef TARGET_VECTORIZE_INIT_COST
+#define TARGET_VECTORIZE_INIT_COST aarch64_init_cost
+
 #undef TARGET_VECTORIZE_ADD_STMT_COST
 #define TARGET_VECTORIZE_ADD_STMT_COST aarch64_add_stmt_cost
 
+#undef TARGET_VECTORIZE_FINISH_COST
+#define TARGET_VECTORIZE_FINISH_COST aarch64_finish_cost
+
+#undef TARGET_VECTORIZE_DESTROY_COST_DATA
+#define TARGET_VECTORIZE_DESTROY_COST_DATA aarch64_destroy_cost_data
+
 #undef TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST
 #define TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST \
   aarch64_builtin_vectorization_cost
-- 
2.17.1



[PATCH 08/13] aarch64: Try to detect when Advanced SIMD code would be completely unrolled

2021-03-26 Thread Richard Sandiford via Gcc-patches
GCC usually costs the SVE and Advanced SIMD versions of a loop
and picks the one with the lowest cost.  By default it will choose
SVE over Advanced SIMD in the event of tie.

This is normally the correct behaviour, not least because SVE can
handle every scalar iteration count whereas Advanced SIMD can only
handle full vectors.  However, there is one important exception
that GCC failed to consider: we can completely unroll Advanced SIMD
code at compile time, but we can't do the same for SVE.

This patch therefore adds an opt-in heuristic to guess whether
the Advanced SIMD version of a loop is likely to be unrolled.
This will only be suitable for some CPUs, so it is not enabled
by default and is controlled separately from use_new_vector_costs.

Like with previous patches, this one only becomes active if a
CPU selects both of the new tuning parameters.  It should therefore
have a very low impact on other CPUs.

gcc/
* config/aarch64/aarch64-tuning-flags.def (matched_vector_throughput):
New tuning parameter.
* config/aarch64/aarch64.c (neoversev1_tunings): Use it.
(aarch64_estimated_sve_vq): New function.
(aarch64_vector_costs::analyzed_vinfo): New member variable.
(aarch64_vector_costs::is_loop): Likewise.
(aarch64_vector_costs::unrolled_advsimd_niters): Likewise.
(aarch64_vector_costs::unrolled_advsimd_stmts): Likewise.
(aarch64_record_potential_advsimd_unrolling): New function.
(aarch64_analyze_loop_vinfo, aarch64_analyze_bb_vinfo): Likewise.
(aarch64_add_stmt_cost): Call aarch64_analyze_loop_vinfo or
aarch64_analyze_bb_vinfo on the first use of a costs structure.
Detect whether we're vectorizing a loop for SVE that might be
completely unrolled if it used Advanced SIMD instead.
(aarch64_adjust_body_cost_for_latency): New function.
(aarch64_finish_cost): Call it.
---
 gcc/config/aarch64/aarch64-tuning-flags.def |   2 +
 gcc/config/aarch64/aarch64.c| 215 +++-
 2 files changed, 210 insertions(+), 7 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
b/gcc/config/aarch64/aarch64-tuning-flags.def
index a61fcf94916..65b4c37d652 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -50,4 +50,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", 
CSE_SVE_VL_CONSTANTS)
 
 AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", USE_NEW_VECTOR_COSTS)
 
+AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", 
MATCHED_VECTOR_THROUGHPUT)
+
 #undef AARCH64_EXTRA_TUNING_OPTION
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 81683b7faa7..63750e38862 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -1732,7 +1732,8 @@ static const struct tune_params neoversev1_tunings =
   0,   /* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
-   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS), /* tune_flags.  */
+   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
+   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),/* tune_flags.  */
   &generic_prefetch_tune
 };
 
@@ -2539,6 +2540,14 @@ aarch64_bit_representation (rtx x)
   return x;
 }
 
+/* Return an estimate for the number of quadwords in an SVE vector.  This is
+   equivalent to the number of Advanced SIMD vectors in an SVE vector.  */
+static unsigned int
+aarch64_estimated_sve_vq ()
+{
+  return estimated_poly_value (BITS_PER_SVE_VECTOR) / 128;
+}
+
 /* Return true if MODE is any of the Advanced SIMD structure modes.  */
 static bool
 aarch64_advsimd_struct_mode_p (machine_mode mode)
@@ -14117,6 +14126,39 @@ struct aarch64_vector_costs
   /* The normal latency-based costs for each region (prologue, body and
  epilogue), indexed by vect_cost_model_location.  */
   unsigned int region[3] = {};
+
+  /* True if we have performed one-time initialization based on the vec_info.
+
+ This variable exists because the vec_info is not passed to the
+ init_cost hook.  We therefore have to defer initialization based on
+ it till later.  */
+  bool analyzed_vinfo = false;
+
+  /* True if we're costing a vector loop, false if we're costing block-level
+ vectorization.  */
+  bool is_loop = false;
+
+  /* - If VEC_FLAGS is zero then we're costing the original scalar code.
+ - If VEC_FLAGS & VEC_ADVSIMD is nonzero then we're costing Advanced
+   SIMD code.
+ - If VEC_FLAGS & VEC_ANY_SVE is nonzero then we're costing SVE code.  */
+  unsigned int vec_flags = 0;
+
+  /* On some CPUs, SVE and Advanced SIMD provide the same theoretical vector
+ throughput, such as 4x128 Advanced SIMD vs. 2x256 SVE.  In those
+ situations, we try to predict whether an Advanced SIMD implementation
+ of the loop could be completely unrolled and become straight-line code.
+ If so, it is generally better t

[PATCH 09/13] aarch64: Detect scalar extending loads

2021-03-26 Thread Richard Sandiford via Gcc-patches
If the scalar code does an integer load followed by an integer
extension, we've tended to cost that as two separate operations,
even though the extension is probably going to be free in practice.
This patch treats the extension as having zero cost, like we already
do for extending SVE loads.

Like with previous patches, this one only becomes active if
a CPU selects use_new_vector_costs.  It should therefore have
a very low impact on other CPUs.

gcc/
* config/aarch64/aarch64.c (aarch64_detect_scalar_stmt_subtype):
New function.
(aarch64_add_stmt_cost): Call it.
---
 gcc/config/aarch64/aarch64.c | 31 +++
 1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 63750e38862..e2d92f0c136 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -14492,6 +14492,23 @@ aarch64_sve_in_loop_reduction_latency (vec_info *vinfo,
   return 0;
 }
 
+/* STMT_COST is the cost calculated by aarch64_builtin_vectorization_cost
+   for STMT_INFO, which has cost kind KIND.  If this is a scalar operation,
+   try to subdivide the target-independent categorization provided by KIND
+   to get a more accurate cost.  */
+static unsigned int
+aarch64_detect_scalar_stmt_subtype (vec_info *vinfo, vect_cost_for_stmt kind,
+   stmt_vec_info stmt_info,
+   unsigned int stmt_cost)
+{
+  /* Detect an extension of a loaded value.  In general, we'll be able to fuse
+ the extension with the load.  */
+  if (kind == scalar_stmt && aarch64_extending_load_p (vinfo, stmt_info))
+return 0;
+
+  return stmt_cost;
+}
+
 /* STMT_COST is the cost calculated by aarch64_builtin_vectorization_cost
for the vectorized form of STMT_INFO, which has cost kind KIND and which
when vectorized would operate on vector type VECTYPE.  Try to subdivide
@@ -14702,10 +14719,16 @@ aarch64_add_stmt_cost (class vec_info *vinfo, void 
*data, int count,
 
   /* Try to get a more accurate cost by looking at STMT_INFO instead
 of just looking at KIND.  */
-  if (stmt_info && vectype && aarch64_use_new_vector_costs_p ())
-   stmt_cost = aarch64_detect_vector_stmt_subtype (vinfo, kind,
-   stmt_info, vectype,
-   where, stmt_cost);
+  if (stmt_info && aarch64_use_new_vector_costs_p ())
+   {
+ stmt_cost = aarch64_detect_scalar_stmt_subtype
+   (vinfo, kind, stmt_info, stmt_cost);
+
+ if (vectype && costs->vec_flags)
+   stmt_cost = aarch64_detect_vector_stmt_subtype (vinfo, kind,
+   stmt_info, vectype,
+   where, stmt_cost);
+   }
 
   /* Do any SVE-specific adjustments to the cost.  */
   if (stmt_info && vectype && aarch64_sve_mode_p (TYPE_MODE (vectype)))
-- 
2.17.1



[PATCH 10/13] aarch64: Cost comparisons embedded in COND_EXPRs

2021-03-26 Thread Richard Sandiford via Gcc-patches
So far the costing of COND_EXPRs hasn't distinguished between
cases in which the condition is calculated separately or is
built into the COND_EXPR itself.  This patch adds the cost
of any embedded comparison.

Like with the previous patches, this one only becomes active if
a CPU selects use_new_vector_costs.  It should therefore have
a very low impact on other CPUs.

gcc/
* config/aarch64/aarch64.c (aarch64_embedded_comparison_type): New
function.
(aarch64_adjust_stmt_cost): Add the costs of embedded scalar and
vector comparisons.
---
 gcc/config/aarch64/aarch64.c | 33 +
 1 file changed, 33 insertions(+)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index e2d92f0c136..e97e71b6e3d 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -14392,6 +14392,21 @@ aarch64_ld234_st234_vectors (vect_cost_for_stmt kind, 
stmt_vec_info stmt_info)
   return 0;
 }
 
+/* If STMT_INFO is a COND_EXPR that includes an embedded comparison, return the
+   scalar type of the values being compared.  Return null otherwise.  */
+static tree
+aarch64_embedded_comparison_type (stmt_vec_info stmt_info)
+{
+  if (auto *assign = dyn_cast (stmt_info->stmt))
+if (gimple_assign_rhs_code (assign) == COND_EXPR)
+  {
+   tree cond = gimple_assign_rhs1 (assign);
+   if (COMPARISON_CLASS_P (cond))
+ return TREE_TYPE (TREE_OPERAND (cond, 0));
+  }
+  return NULL_TREE;
+}
+
 /* Return true if creating multiple copies of STMT_INFO for Advanced SIMD
vectors would produce a series of LDP or STP operations.  KIND is the
kind of statement that STMT_INFO represents.  */
@@ -14685,8 +14700,26 @@ aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, 
stmt_vec_info stmt_info,
  stmt_cost += simd_costs->ld4_st4_permute_cost;
  break;
}
+
+  if (kind == vector_stmt || kind == vec_to_scalar)
+   if (tree cmp_type = aarch64_embedded_comparison_type (stmt_info))
+ {
+   if (FLOAT_TYPE_P (cmp_type))
+ stmt_cost += simd_costs->fp_stmt_cost;
+   else
+ stmt_cost += simd_costs->int_stmt_cost;
+ }
 }
 
+  if (kind == scalar_stmt)
+if (tree cmp_type = aarch64_embedded_comparison_type (stmt_info))
+  {
+   if (FLOAT_TYPE_P (cmp_type))
+ stmt_cost += aarch64_tune_params.vec_costs->scalar_fp_stmt_cost;
+   else
+ stmt_cost += aarch64_tune_params.vec_costs->scalar_int_stmt_cost;
+  }
+
   return stmt_cost;
 }
 
-- 
2.17.1



[PATCH 11/13] aarch64: Ignore inductions when costing vector code

2021-03-26 Thread Richard Sandiford via Gcc-patches
In practice it seems to be better not to cost a vector induction.
The scalar code generally needs the same induction but doesn't
cost it, making an apples-for-apples comparison harder.  Most
inductions also have a low latency and their cost usually gets
hidden by other operations.

Like with the previous patches, this one only becomes active if
a CPU selects use_new_vector_costs.  It should therefore have
a very low impact on other CPUs.

gcc/
* config/aarch64/aarch64.c (aarch64_detect_vector_stmt_subtype):
Assume a zero cost for induction phis.
---
 gcc/config/aarch64/aarch64.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index e97e71b6e3d..6d18d82079c 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -14541,6 +14541,12 @@ aarch64_detect_vector_stmt_subtype (vec_info *vinfo, 
vect_cost_for_stmt kind,
   if (aarch64_sve_mode_p (TYPE_MODE (vectype)))
 sve_costs = aarch64_tune_params.vec_costs->sve;
 
+  /* It's generally better to avoid costing inductions, since the induction
+ will usually be hidden by other operations.  This is particularly true
+ for things like COND_REDUCTIONS.  */
+  if (is_a (stmt_info->stmt))
+return 0;
+
   /* Detect cases in which vec_to_scalar is describing the extraction of a
  vector element in preparation for a scalar store.  The store itself is
  costed separately.  */
-- 
2.17.1



[PATCH 12/13] aarch64: Take issue rate into account for vector loop costs

2021-03-26 Thread Richard Sandiford via Gcc-patches
When SVE is enabled, GCC needs to do a three-way comparison
between scalar, Advanced SIMD and SVE code.  The normal costs
tend to be latency-based, which is well-suited to SLP.  However,
comparing sums of latency costs means that we effectively treat
the code as executing sequentially.  This can hide the effect of
pipeline bubbles or resource contention that in practice are quite
important for loop vectorisation.  This is particularly true for
loops that involve reductions.

This patch therefore tries to estimate how quickly each piece
of code could issue, using a very (very) simplistic model.
It then uses this to adjust the loop vector costs up or down as
appropriate.  Part of the Advanced SIMD vs. SVE adjustment is
opt-in and is not enabled by default even for use_new_vector_costs.

Like with the previous patches, this one only becomes active if
a CPU selects use_new_vector_costs.  It should therefore have
a very low impact on other CPUs.  The code also mostly ignores
CPUs that have no issue information, even if use_new_vector_costs
is enabled for some reason.

gcc/
* config/aarch64/aarch64.opt
(-param=aarch64-loop-vect-issue-rate-niters=): New parameter.
* doc/invoke.texi: Document it.
* config/aarch64/aarch64-protos.h (aarch64_base_vec_issue_info)
(aarch64_scalar_vec_issue_info, aarch64_simd_vec_issue_info)
(aarch64_advsimd_vec_issue_info, aarch64_sve_vec_issue_info)
(aarch64_vec_issue_info): New structures.
(cpu_vector_cost): Write comments above the variables rather
than to the side.
(cpu_vector_cost::issue_info): New member variable.
* config/aarch64/aarch64.c: Include gimple-pretty-print.h
and tree-ssa-loop-niter.h.
(generic_vector_cost, a64fx_vector_cost, qdf24xx_vector_cost)
(thunderx_vector_cost, tsv110_vector_cost, cortexa57_vector_cost)
(exynosm1_vector_cost, xgene1_vector_cost, thunderx2t99_vector_cost)
(thunderx3t110_vector_cost): Initialize issue_info to null.
(neoversev1_scalar_issue_info, neoversev1_advsimd_issue_info)
(neoversev1_sve_issue_info, neoversev1_vec_issue_info): New structures.
(neoversev1_vector_cost): Use them.
(aarch64_vec_op_count, aarch64_sve_op_count): New structures.
(aarch64_vector_costs::saw_sve_only_op): New member variable.
(aarch64_vector_costs::num_vector_iterations): Likewise.
(aarch64_vector_costs::scalar_ops): Likewise.
(aarch64_vector_costs::advsimd_ops): Likewise.
(aarch64_vector_costs::sve_ops): Likewise.
(aarch64_vector_costs::seen_loads): Likewise.
(aarch64_simd_vec_costs_for_flags): New function.
(aarch64_analyze_loop_vinfo): Initialize num_vector_iterations.
Count the number of predicate operations required by SVE WHILE
instructions.
(aarch64_comparison_type, aarch64_multiply_add_p): New functions.
(aarch64_sve_only_stmt_p, aarch64_in_loop_reduction_latency): Likewise.
(aarch64_count_ops): Likewise.
(aarch64_add_stmt_cost): Record whether see an SVE operation
that cannot currently be implementing using Advanced SIMD.
Record issue information about the scalar, Advanced SIMD
and (where relevant) SVE versions of a loop.
(aarch64_vec_op_count::dump): New function.
(aarch64_sve_op_count::dump): Likewise.
(aarch64_estimate_min_cycles_per_iter): Likewise.
(aarch64_adjust_body_cost): If issue information is available,
try to compare the issue rates of the various loop implementations
and increase or decrease the vector body cost accordingly.
---
 gcc/config/aarch64/aarch64-protos.h | 178 ++-
 gcc/config/aarch64/aarch64.c| 798 +++-
 gcc/config/aarch64/aarch64.opt  |   3 +
 gcc/doc/invoke.texi |   8 +
 4 files changed, 966 insertions(+), 21 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index 2ffa96ec24b..ca1ed9e8758 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -281,19 +281,177 @@ struct sve_vec_cost : simd_vec_cost
   const int scatter_store_elt_cost;
 };
 
+/* Base information about how the CPU issues code, containing
+   information that is relevant to scalar, Advanced SIMD and SVE
+   operations.
+
+   The structure uses the general term "operation" to refer to
+   whichever subdivision of an instruction makes sense for the CPU.
+   These operations would typically be micro operations or macro
+   operations.
+
+   Note that this structure and the ones derived from it are only
+   as general as they need to be for the CPUs that currently use them.
+   They will probably need to be extended or refined as more CPUs are
+   added.  */
+struct aarch64_base_vec_issue_info
+{
+  /* How many loads and stores can be issued per cycle.  */
+  const unsigned int loads_stores_p

[PATCH 13/13] aarch64: Add costs for LD[34] and ST[34] postincrements

2021-03-26 Thread Richard Sandiford via Gcc-patches
Most postincrements are cheap on Neoverse V1, but it's
generally better to avoid them on LD[34] and ST[34] instructions.
This patch adds separate address costs fields for these cases.
Other CPUs continue to use the same costs for all postincrements.

gcc/
* config/aarch64/aarch64-protos.h
(cpu_addrcost_table::post_modify_ld3_st3): New member variable.
(cpu_addrcost_table::post_modify_ld4_st4): Likewise.
* config/aarch64/aarch64.c (generic_addrcost_table): Update
accordingly, using the same costs as for post_modify.
(exynosm1_addrcost_table, xgene1_addrcost_table): Likewise.
(thunderx2t99_addrcost_table, thunderx3t110_addrcost_table):
(tsv110_addrcost_table, qdf24xx_addrcost_table): Likewise.
(a64fx_addrcost_table): Likewise.
(neoversev1_addrcost_table): New.
(neoversev1_tunings): Use neoversev1_addrcost_table.
(aarch64_address_cost): Use the new post_modify costs for CImode
and XImode.
---
 gcc/config/aarch64/aarch64-protos.h |  2 ++
 gcc/config/aarch64/aarch64.c| 45 +++--
 2 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index ca1ed9e8758..d5d5417370e 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -177,6 +177,8 @@ struct cpu_addrcost_table
   const struct scale_addr_mode_cost addr_scale_costs;
   const int pre_modify;
   const int post_modify;
+  const int post_modify_ld3_st3;
+  const int post_modify_ld4_st4;
   const int register_offset;
   const int register_sextend;
   const int register_zextend;
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 6d961bea5dc..a573850b3fd 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -364,6 +364,8 @@ static const struct cpu_addrcost_table 
generic_addrcost_table =
 },
   0, /* pre_modify  */
   0, /* post_modify  */
+  0, /* post_modify_ld3_st3  */
+  0, /* post_modify_ld4_st4  */
   0, /* register_offset  */
   0, /* register_sextend  */
   0, /* register_zextend  */
@@ -380,6 +382,8 @@ static const struct cpu_addrcost_table 
exynosm1_addrcost_table =
 },
   0, /* pre_modify  */
   0, /* post_modify  */
+  0, /* post_modify_ld3_st3  */
+  0, /* post_modify_ld4_st4  */
   1, /* register_offset  */
   1, /* register_sextend  */
   2, /* register_zextend  */
@@ -396,6 +400,8 @@ static const struct cpu_addrcost_table 
xgene1_addrcost_table =
 },
   1, /* pre_modify  */
   1, /* post_modify  */
+  1, /* post_modify_ld3_st3  */
+  1, /* post_modify_ld4_st4  */
   0, /* register_offset  */
   1, /* register_sextend  */
   1, /* register_zextend  */
@@ -412,6 +418,8 @@ static const struct cpu_addrcost_table 
thunderx2t99_addrcost_table =
 },
   0, /* pre_modify  */
   0, /* post_modify  */
+  0, /* post_modify_ld3_st3  */
+  0, /* post_modify_ld4_st4  */
   2, /* register_offset  */
   3, /* register_sextend  */
   3, /* register_zextend  */
@@ -428,6 +436,8 @@ static const struct cpu_addrcost_table 
thunderx3t110_addrcost_table =
 },
   0, /* pre_modify  */
   0, /* post_modify  */
+  0, /* post_modify_ld3_st3  */
+  0, /* post_modify_ld4_st4  */
   2, /* register_offset  */
   3, /* register_sextend  */
   3, /* register_zextend  */
@@ -444,6 +454,8 @@ static const struct cpu_addrcost_table 
tsv110_addrcost_table =
 },
   0, /* pre_modify  */
   0, /* post_modify  */
+  0, /* post_modify_ld3_st3  */
+  0, /* post_modify_ld4_st4  */
   0, /* register_offset  */
   1, /* register_sextend  */
   1, /* register_zextend  */
@@ -460,6 +472,8 @@ static const struct cpu_addrcost_table 
qdf24xx_addrcost_table =
 },
   1, /* pre_modify  */
   1, /* post_modify  */
+  1, /* post_modify_ld3_st3  */
+  1, /* post_modify_ld4_st4  */
   3, /* register_offset  */
   3, /* register_sextend  */
   3, /* register_zextend  */
@@ -476,12 +490,32 @@ static const struct cpu_addrcost_table 
a64fx_addrcost_table =
 },
   0, /* pre_modify  */
   0, /* post_modify  */
+  0, /* post_modify_ld3_st3  */
+  0, /* post_modify_ld4_st4  */
   2, /* register_offset  */
   3, /* register_sextend  */
   3, /* register_zextend  */
   0, /* imm_offset  */
 };
 
+static const struct cpu_addrcost_table neoversev1_addrcost_table =
+{
+{
+  1, /* hi  */
+  0, /* si  */
+  0, /* di  */
+  1, /* ti  */
+},
+  0, /* pre_modify  */
+  0, /* post_modify  */
+  3, /* post_modify_ld3_st3  */
+  3, /* post_modify_ld4_st4  */
+  0, /* register_offset  */
+  0, /* register_sextend  */
+  0, /* register_zextend  */
+  0 /* imm_offset  */
+};
+
 static const struct cpu_regmove_cost generic_regmove_cost =
 {
   1, /* GP2GP  */
@@ -1777,7 +1811,7 @@ static const struct cpu_vector_cost 
neoversev1_vector_cost =
 static const struct tune_params neoversev1_tunings =
 {
   &cortexa76_extra_costs,
-  &generic_addrcost_table,
+  &neoversev1_addrcost_ta

Re: testsuite/arm: Improve scan-assembler in pr96770.c

2021-03-26 Thread Richard Earnshaw via Gcc-patches
On 26/03/2021 15:35, Christophe Lyon via Gcc-patches wrote:
> Hi,
> 
> I'm seeing random scan-assembler-times failures in pr96770.c when LTO is used.
> 
> I suspect this is because the \\+4 string matches the LTO sections, sometimes.
> 
> I propose this small patch to avoid the issue, by matching arr\\+4 instead. 
> OK?
> 
> 2021-03-26  Christophe Lyon  
> 
> gcc/testsuite/
> * gcc.target/arm/pure-code/pr96770.c: Improve scan-assembler-times.

OK.

R.

> 
> diff --git a/gcc/testsuite/gcc.target/arm/pure-code/pr96770.c
> b/gcc/testsuite/gcc.target/arm/pure-code/pr96770.c
> index a43d71f..ae1bd10 100644
> --- a/gcc/testsuite/gcc.target/arm/pure-code/pr96770.c
> +++ b/gcc/testsuite/gcc.target/arm/pure-code/pr96770.c
> @@ -5,17 +5,17 @@ int arr[1000];
>  int *f4 (void) { return &arr[1]; }
> 
>  /* For cortex-m0 (thumb-1/v6m), we generate 4 movs with upper/lower:#arr+4.  
> */
> -/* { dg-final { scan-assembler-times "\\+4" 4 { target { { !
> arm_thumb1_movt_ok } && { ! arm_thumb2_ok } } } } } */
> +/* { dg-final { scan-assembler-times "arr\\+4" 4 { target { { !
> arm_thumb1_movt_ok } && { ! arm_thumb2_ok } } } } } */
> 
>  /* For cortex-m with movt/movw (thumb-1/v8m.base or thumb-2), we
> generate a movt/movw pair with upper/lower:#arr+4.  */
> -/* { dg-final { scan-assembler-times "\\+4" 2 { target {
> arm_thumb1_movt_ok || arm_thumb2_ok } } } } */
> +/* { dg-final { scan-assembler-times "arr\\+4" 2 { target {
> arm_thumb1_movt_ok || arm_thumb2_ok } } } } */
> 
>  int *f5 (void) { return &arr[80]; }
> 
>  /* For cortex-m0 (thumb-1/v6m), we generate 1 ldr from rodata pointer
> to arr+320.  */
> -/* { dg-final { scan-assembler-times "\\+320" 1 { target { { !
> arm_thumb1_movt_ok } && { ! arm_thumb2_ok } } } } } */
> +/* { dg-final { scan-assembler-times "arr\\+320" 1 { target { { !
> arm_thumb1_movt_ok } && { ! arm_thumb2_ok } } } } } */
> 
>  /* For cortex-m with movt/movw (thumb-1/v8m.base or thumb-2), we
> generate a movt/movw pair with upper/lower:arr+320.  */
> -/* { dg-final { scan-assembler-times "\\+320" 2 { target {
> arm_thumb1_movt_ok || arm_thumb2_ok } } } } */
> +/* { dg-final { scan-assembler-times "arr\\+320" 2 { target {
> arm_thumb1_movt_ok || arm_thumb2_ok } } } } */
> 



[Patch, fortran] 99307 - FAIL: gfortran.dg/class_assign_4.f90 execution test

2021-03-26 Thread Paul Richard Thomas via Gcc-patches
This patch comes in two versions: submit.diff with Change.Logs or
submit2.diff with Change2.Logs.

The first fixes the problem by changing array temporaries from class
expressions into class temporaries. This permits the use of
gfc_get_class_from_expr to obtain the vptr for these temporaries and all
the good things that come with that when handling dynamic types. The second
part of the fix is to use the array element length from the class
descriptor, when reallocating on assignment. This is needed because the
vptr is being set too early. I will set about trying to track down why this
is happening and fix it after release.

The second version does the same as the first but puts in place a load of
tidying up that is permitted by the fix to class array temporaries.

I couldn't readily see how to prepare a testcase - ideas?

Both regtest on FC33/x86_64. The first was tested by Dominique (see the
PR). OK for master?

Regards

Paul


Change.Logs
Description: Binary data


Change2.Logs
Description: Binary data
diff --git a/gcc/fortran/trans-array.c b/gcc/fortran/trans-array.c
index c6725659093..8aa56d1ccb9 100644
--- a/gcc/fortran/trans-array.c
+++ b/gcc/fortran/trans-array.c
@@ -1403,9 +1403,6 @@ gfc_trans_create_temp_array (stmtblock_t * pre, stmtblock_t * post, gfc_ss * ss,
   desc = gfc_create_var (type, "atmp");
   GFC_DECL_PACKED_ARRAY (desc) = 1;
 
-  info->descriptor = desc;
-  size = gfc_index_one_node;
-
   /* Emit a DECL_EXPR for the variable sized array type in
  GFC_TYPE_ARRAY_DATAPTR_TYPE so the gimplification of its type
  sizes works correctly.  */
@@ -1416,9 +1413,40 @@ gfc_trans_create_temp_array (stmtblock_t * pre, stmtblock_t * post, gfc_ss * ss,
   gfc_add_expr_to_block (pre, build1 (DECL_EXPR,
   arraytype, TYPE_NAME (arraytype)));
 
-  /* Fill in the array dtype.  */
-  tmp = gfc_conv_descriptor_dtype (desc);
-  gfc_add_modify (pre, tmp, gfc_get_dtype (TREE_TYPE (desc)));
+  if (class_expr != NULL_TREE)
+{
+  tree class_data;
+  tree dtype;
+
+  /* Create a class temporary.  */
+  tmp = gfc_create_var (TREE_TYPE (class_expr), "ctmp");
+  gfc_add_modify (pre, tmp, class_expr);
+
+  /* Assign the new descriptor to the _data field. This allows the
+	 vptr _copy to be used for scalarized assignment since the class
+	 temporary can be found from the descriptor.  */
+  class_data = gfc_class_data_get (tmp);
+  tmp = fold_build1_loc (input_location, VIEW_CONVERT_EXPR,
+			 TREE_TYPE (desc), desc);
+  gfc_add_modify (pre, class_data, tmp);
+
+  /* Take the dtype from the class expression.  */
+  dtype = gfc_conv_descriptor_dtype (gfc_class_data_get (class_expr));
+  tmp = gfc_conv_descriptor_dtype (class_data);
+  gfc_add_modify (pre, tmp, dtype);
+
+  /* Point desc to the class _data field.  */
+  desc = class_data;
+}
+  else
+{
+  /* Fill in the array dtype.  */
+  tmp = gfc_conv_descriptor_dtype (desc);
+  gfc_add_modify (pre, tmp, gfc_get_dtype (TREE_TYPE (desc)));
+}
+
+  info->descriptor = desc;
+  size = gfc_index_one_node;
 
   /*
  Fill in the bounds and stride.  This is a packed array, so:
@@ -3438,6 +3466,12 @@ build_class_array_ref (gfc_se *se, tree base, tree index)
   && GFC_DECL_SAVED_DESCRIPTOR (se->expr)
   && GFC_CLASS_TYPE_P (TREE_TYPE (GFC_DECL_SAVED_DESCRIPTOR (se->expr
 decl = se->expr;
+  else if (!VAR_P (base) && gfc_get_class_from_expr (base))
+{
+  decl = gfc_get_class_from_expr (base);
+  se->class_vptr = gfc_evaluate_now (gfc_class_vptr_get (decl), &se->pre);
+  goto class_found;
+}
   else
 {
   if (expr == NULL
@@ -3530,6 +3564,7 @@ build_class_array_ref (gfc_se *se, tree base, tree index)
   if (POINTER_TYPE_P (TREE_TYPE (decl)))
 decl = build_fold_indirect_ref_loc (input_location, decl);
 
+class_found:
   if (!GFC_CLASS_TYPE_P (TREE_TYPE (decl)))
 return false;
 
@@ -10274,23 +10309,10 @@ gfc_alloc_allocatable_for_assignment (gfc_loopinfo *loop,
 }
   else if (expr1->ts.type == BT_CLASS)
 {
-  tmp = expr1->rank ? gfc_get_class_from_expr (desc) : NULL_TREE;
-  if (tmp == NULL_TREE)
-	tmp = gfc_get_class_from_gfc_expr (expr1);
-
-  if (tmp != NULL_TREE)
-	{
-	  tmp2 = gfc_class_vptr_get (tmp);
-	  cond = fold_build2_loc (input_location, NE_EXPR,
-  logical_type_node, tmp2,
-  build_int_cst (TREE_TYPE (tmp2), 0));
-	  elemsize1 = gfc_class_vtab_size_get (tmp);
-	  elemsize1 = fold_build3_loc (input_location, COND_EXPR,
-  gfc_array_index_type, cond,
-  elemsize1, gfc_index_zero_node);
-	}
-  else
-	elemsize1 = TYPE_SIZE_UNIT (gfc_typenode_for_spec (&CLASS_DATA (expr1)->ts));
+  /* Unfortunately, the lhs vptr is set too early in many cases.
+	 Play it safe by using the descriptor element length.  */
+  tmp = gfc_conv_descriptor_elem_len (desc);
+  elemsize1 = fold_convert (gfc_array_index_type, tmp);
 }
   else
 elemsize1 = NULL_TREE;
@@ -10

Re: [PATCH v2] aarch64: Fix SVE ACLE builtins with LTO [PR99216]

2021-03-26 Thread Richard Sandiford via Gcc-patches
Alex Coplan  writes:
> Hi all,
>
> Here is a v2 patch which provides a more obviously fake answer to
> TARGET_BUILTIN_DECL: this can hopefully go in for GCC 11. For GCC 12, it
> seems that we should consider removing the target hook.
>
> Original patch:
> https://gcc.gnu.org/pipermail/gcc-patches/2021-March/566405.html
>
> ---
>
> As discussed in the PR, we currently have two different numbering
> schemes for SVE builtins: one for C, and one for C++. This is
> problematic for LTO, where we end up getting confused about which
> intrinsic we're talking about. This patch inserts placeholders into the
> registered_functions vector to ensure that there is a consistent
> numbering scheme for both C and C++.
>
> This patch uses integer_zero_node as a placeholder node instead of
> building a function decl. This is safe because the node is only returned
> by the TARGET_BUILTIN_DECL hook, which (on AArch64) is only used for
> validation when builtin decls are streamed into lto1.
>
> Bootstrapped and regtested on aarch64-linux-gnu, OK for trunk?
>
> Thanks,
> Alex
>
> gcc/ChangeLog:
>
>   PR target/99216
>   * config/aarch64/aarch64-sve-builtins.cc
>   (function_builder::add_function): Add placeholder_p argument, use
>   placeholder decls if this is set.
>   (function_builder::add_unique_function): Instead of conditionally adding
>   direct overloads, unconditionally add either a direct overload or a
>   placeholder.
>   (function_builder::add_overloaded_function): Set placeholder_p if we're
>   using C++ overloads. Use the obstack for string storage instead
>   of relying on the tree nodes.
>   (function_builder::add_overloaded_functions): Don't return early for
>   m_direct_overloads: we need to add placeholders.
>   * config/aarch64/aarch64-sve-builtins.h
>   (function_builder::add_function): Add placeholder_p argument.
>
> gcc/testsuite/ChangeLog:
>
>   PR target/99216
>   * g++.target/aarch64/sve/pr99216.C: New test.

OK, thanks, and sorry for the delay in reviewing.

Richard


[committed] [PR99766] Consider relaxed memory associated more with memory instead of special memory

2021-03-26 Thread Vladimir Makarov via Gcc-patches

The following patch fixes

  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99766

The patch was successfully bootstrapped and tested on aarch64.


commit 0d37e2d3ead072ba57e03fcb97a041504a22e721
Author: Vladimir Makarov 
Date:   Fri Mar 26 17:09:24 2021 +

[PR99766] Consider relaxed memory associated more with memory instead of special memory.

Relaxed memory should be considered more like memory then special memory.

gcc/ChangeLog:

PR target/99766
* ira-costs.c (record_reg_classes): Put case with
CT_RELAXED_MEMORY adjacent to one with CT_MEMORY.
* ira.c (ira_setup_alts): Ditto.
* lra-constraints.c (process_alt_operands): Ditto.
* recog.c (asm_operand_ok): Ditto.
* reload.c (find_reloads): Ditto.

gcc/testsuite/ChangeLog:

PR target/99766
* g++.target/aarch64/sve/pr99766.C: New.

diff --git a/gcc/ira-costs.c b/gcc/ira-costs.c
index 7547f3e0f53..10727b5ff9e 100644
--- a/gcc/ira-costs.c
+++ b/gcc/ira-costs.c
@@ -773,6 +773,7 @@ record_reg_classes (int n_alts, int n_ops, rtx *ops,
 		  break;
 
 		case CT_MEMORY:
+		case CT_RELAXED_MEMORY:
 		  /* Every MEM can be reloaded to fit.  */
 		  insn_allows_mem[i] = allows_mem[i] = 1;
 		  if (MEM_P (op))
@@ -780,7 +781,6 @@ record_reg_classes (int n_alts, int n_ops, rtx *ops,
 		  break;
 
 		case CT_SPECIAL_MEMORY:
-		case CT_RELAXED_MEMORY:
 		  insn_allows_mem[i] = allows_mem[i] = 1;
 		  if (MEM_P (extract_mem_from_operand (op))
 			  && constraint_satisfied_p (op, cn))
diff --git a/gcc/ira.c b/gcc/ira.c
index 7e903289e79..b93588d8a9f 100644
--- a/gcc/ira.c
+++ b/gcc/ira.c
@@ -1871,10 +1871,10 @@ ira_setup_alts (rtx_insn *insn)
 			  goto op_success;
 
 			case CT_MEMORY:
+			case CT_RELAXED_MEMORY:
 			  mem = op;
 			  /* Fall through.  */
 			case CT_SPECIAL_MEMORY:
-			case CT_RELAXED_MEMORY:
 			  if (!mem)
 			mem = extract_mem_from_operand (op);
 			  if (MEM_P (mem))
diff --git a/gcc/lra-constraints.c b/gcc/lra-constraints.c
index 861b5aad40b..9993065f8d6 100644
--- a/gcc/lra-constraints.c
+++ b/gcc/lra-constraints.c
@@ -2417,6 +2417,7 @@ process_alt_operands (int only_alternative)
 		  break;
 
 		case CT_MEMORY:
+		case CT_RELAXED_MEMORY:
 		  if (MEM_P (op)
 			  && satisfies_memory_constraint_p (op, cn))
 			win = true;
@@ -2459,7 +2460,6 @@ process_alt_operands (int only_alternative)
 		  break;
 
 		case CT_SPECIAL_MEMORY:
-		case CT_RELAXED_MEMORY:
 		  if (satisfies_memory_constraint_p (op, cn))
 			win = true;
 		  else if (spilled_pseudo_p (op))
diff --git a/gcc/recog.c b/gcc/recog.c
index ee143bc761e..eb617f11163 100644
--- a/gcc/recog.c
+++ b/gcc/recog.c
@@ -2267,10 +2267,10 @@ asm_operand_ok (rtx op, const char *constraint, const char **constraints)
 	  break;
 
 	case CT_MEMORY:
+	case CT_RELAXED_MEMORY:
 	  mem = op;
 	  /* Fall through.  */
 	case CT_SPECIAL_MEMORY:
-	case CT_RELAXED_MEMORY:
 	  /* Every memory operand can be reloaded to fit.  */
 	  if (!mem)
 		mem = extract_mem_from_operand (op);
diff --git a/gcc/reload.c b/gcc/reload.c
index 7340125c441..461fd0272eb 100644
--- a/gcc/reload.c
+++ b/gcc/reload.c
@@ -3471,6 +3471,7 @@ find_reloads (rtx_insn *insn, int replace, int ind_levels, int live_known,
 			break;
 
 		  case CT_MEMORY:
+		  case CT_RELAXED_MEMORY:
 			if (force_reload)
 			  break;
 			if (constraint_satisfied_p (operand, cn))
@@ -3504,7 +3505,6 @@ find_reloads (rtx_insn *insn, int replace, int ind_levels, int live_known,
 			break;
 
 		  case CT_SPECIAL_MEMORY:
-		  case CT_RELAXED_MEMORY:
 			if (force_reload)
 			  break;
 			if (constraint_satisfied_p (operand, cn))
diff --git a/gcc/testsuite/g++.target/aarch64/sve/pr99766.C b/gcc/testsuite/g++.target/aarch64/sve/pr99766.C
new file mode 100644
index 000..0ca8aee5798
--- /dev/null
+++ b/gcc/testsuite/g++.target/aarch64/sve/pr99766.C
@@ -0,0 +1,24 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -march=armv8.2-a+sve" } */
+typedef float a __attribute__((__mode__(HF)));
+typedef struct {
+  a b;
+  a c;
+} d;
+int e;
+d *f, *g;
+__fp16 h;
+void j() {
+  for (int i;; ++i) {
+auto l = &f[i];
+for (int k; k < e;) {
+  k = 0;
+  for (; k < e; ++k)
+g[k].b = l[k].b * l[k].c;
+}
+for (int k; k < e; ++k) {
+  g[k].b *= h;
+  g[k].c *= h;
+}
+  }
+}


Re: [PATCH] slp tree vectorizer: Re-calculate vectorization factor in the case of invalid choices [PR96974]

2021-03-26 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> On Wed, 24 Mar 2021, Stam Markianos-Wright wrote:
>
>> Hi all,
>> 
>> This patch resolves bug:
>> 
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96974
>> 
>> This is achieved by forcing a re-calculation of *stmt_vectype_out if an
>> incompatible combination of TYPE_VECTOR_SUBPARTS is detected, but with an
>> extra introduced max_nunits ceiling.
>> 
>> I am not 100% sure if this is the best way to go about fixing this, because
>> this is my first look at the vectorizer and I lack knowledge of the wider
>> context, so do let me know if you see a better way to do this!
>> 
>> I have added the previously ICE-ing reproducer as a new test.
>> 
>> This is compiled as "g++ -Ofast -march=armv8.2-a+sve -fdisable-tree-fre4" for
>> GCC11 and "g++ -Ofast -march=armv8.2-a+sve" for GCC10.
>> 
>> (the non-fdisable-tree-fre4 version has gone latent on GCC11)
>> 
>> Bootstrapped and reg-tested on aarch64-linux-gnu.
>> Also reg-tested on aarch64-none-elf.
>
> I don't think this is going to work well given uses will expect
> a vector type that's consistent here.
>
> I think giving up is for the moment the best choice, thus replacing
> the assert with vectorization failure.
>
> In the end we shouldn't require those nunits vectypes to be
> separately computed - we compute the vector type of the defs
> anyway and in case they're invariant the vectorizable_* function
> either can deal with the type mix or not anyway.

I agree this area needs simplification, but I think the direction of
travel should be to make the assert valid.  I agree this is probably
the pragmatic fix for GCC 11 and earlier though.

Also, IMO it's a bug that we use OImode, CImode or XImode for plain
vectors.  We should only need to use them for LD[234] and ST[234] arrays.

Thanks,
Richard

>
> That said, the goal should be to simplify things here.
>
> Richard.
>
>> 
>> gcc/ChangeLog:
>> 
>> * tree-vect-stmts.c (get_vectype_for_scalar_type): Add new
>> parameter to core function and add new function overload.
>> (vect_get_vector_types_for_stmt): Add re-calculation logic.
>> 
>> gcc/testsuite/ChangeLog:
>> 
>> * g++.target/aarch64/sve/pr96974.C: New test.
>> 


[PATCH] x86: Define __rdtsc and __rdtscp as macros

2021-03-26 Thread H.J. Lu via Gcc-patches
On Fri, Mar 26, 2021 at 5:09 AM Richard Biener
 wrote:
>
> On Fri, Mar 26, 2021 at 11:26 AM Jakub Jelinek  wrote:
> >
> > On Fri, Mar 26, 2021 at 11:13:21AM +0100, Richard Biener wrote:
> > > On Fri, Mar 26, 2021 at 9:34 AM Jakub Jelinek via Gcc-patches
> > >  wrote:
> > > >
> > > > On Thu, Mar 25, 2021 at 11:36:37AM -0700, H.J. Lu via Gcc-patches wrote:
> > > > > How can we move forward with it?  I'd like to resolve it in GCC 11.
> > > >
> > > > I think it is too late for GCC 11 for this.
> > > > Especially if the solution would be that we change the behavior of 
> > > > existing
> > > > attribute, we would need enough time to test everything in the wild that
> > > > we don't break it badly,
> > >
> > > But isn't the suggested change only going to make programs we reject now
> > > with an error accepted or ICEing?  Thus, no program that works right now
> > > should break.
> >
> > That is true, but even
> > accepts-invalid
> > and
> > ice-on-invalid-code
> > would be important regressions.
> > Changing the always_inline attribute behavior without at least avoiding
> > the first of those for our intrinsics would be bad, and we need to look what
> > people use always_inline in the wild for and what are their expectations.
> > And for the intrinsics we need something maintainable, we have > 5000
> > intrinsics on i386 alone, > 4000 on aarch64, > 7000 on arm, > 600 on rs6000,
> > > 100 on sparc, I bet most of them rely on the current behavior.
> > I think the world doesn't end if we do it for GCC 12 only, do it right for
> > everything we are aware of and have many months to figure out what impact it
> > will have on programs in the wild.
>
> As said, my opinion is that this fallout doesn't "exist" in the wild
> since it can
> only exist for code we reject right now which in my definition of
> "out in the wild" makes it not exist.  I consider only code accepted by
> the compiler as valid "out in the wild" example.
>
> See also the behavior of always-inline with regard to the optimize attribute.
>
> So yes, a better solution would be nice but I can't see any since the
> underlying issue is known since a long time and thus the pragmatic
> solution is the best (IMHO), also from a QOI perspective.  For intrinsics
> it also avoids differences with -O0 vs -O with what we accept and reject.

Here is a simple patch for GCC 11 by defining __rdtsc and __rdtscp
as macros.   OK for master?

Thanks.

-- 
H.J.
From 84c0019ee2a7125daaad161bfbb98c3bf74ca48b Mon Sep 17 00:00:00 2001
From: "H.J. Lu" 
Date: Tue, 23 Mar 2021 20:04:58 -0700
Subject: [PATCH] x86: Define __rdtsc and __rdtscp as macros

Define __rdtsc and __rdtscp as macros for callers with general-regs-only
target attribute to avoid inline failure with always_inline attribute.

gcc/

	PR target/99744
	* config/i386/ia32intrin.h (__rdtsc): Defined as macro.
	(__rdtscp): Likewise.

gcc/testsuite/

	PR target/99744
	* gcc.target/i386/pr99744-1.c: New test.
---
 gcc/config/i386/ia32intrin.h  | 14 ++---
 gcc/testsuite/gcc.target/i386/pr99744-1.c | 25 +++
 2 files changed, 27 insertions(+), 12 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr99744-1.c

diff --git a/gcc/config/i386/ia32intrin.h b/gcc/config/i386/ia32intrin.h
index d336a51669a..591394076cc 100644
--- a/gcc/config/i386/ia32intrin.h
+++ b/gcc/config/i386/ia32intrin.h
@@ -107,22 +107,12 @@ __rdpmc (int __S)
 #endif /* __iamcu__ */
 
 /* rdtsc */
-extern __inline unsigned long long
-__attribute__((__gnu_inline__, __always_inline__, __artificial__))
-__rdtsc (void)
-{
-  return __builtin_ia32_rdtsc ();
-}
+#define __rdtsc()		__builtin_ia32_rdtsc ()
 
 #ifndef __iamcu__
 
 /* rdtscp */
-extern __inline unsigned long long
-__attribute__((__gnu_inline__, __always_inline__, __artificial__))
-__rdtscp (unsigned int *__A)
-{
-  return __builtin_ia32_rdtscp (__A);
-}
+#define __rdtscp(a)		__builtin_ia32_rdtscp (a)
 
 #endif /* __iamcu__ */
 
diff --git a/gcc/testsuite/gcc.target/i386/pr99744-1.c b/gcc/testsuite/gcc.target/i386/pr99744-1.c
new file mode 100644
index 000..a5a905c732a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr99744-1.c
@@ -0,0 +1,25 @@
+/* { dg-do compile } */
+/* { dg-options "-O0" } */
+
+#include 
+
+extern unsigned long long int curr_deadline;
+extern void bar (void);
+
+__attribute__ ((target("general-regs-only")))
+void
+foo1 (void)
+{
+  if (__rdtsc () < curr_deadline)
+return; 
+  bar ();
+}
+
+__attribute__ ((target("general-regs-only")))
+void
+foo2 (unsigned int *p)
+{
+  if (__rdtscp (p) < curr_deadline)
+return; 
+  bar ();
+}
-- 
2.30.2



off-by-one buffer overflow patch

2021-03-26 Thread Steve Kargl via Gcc-patches
This patch fixes an off-by-one buffer overflow issue.
Please commit.


diff --git a/gcc/fortran/misc.c b/gcc/fortran/misc.c
index 8a96243e80d..3d449ae17fe 100644
--- a/gcc/fortran/misc.c
+++ b/gcc/fortran/misc.c
@@ -124,8 +124,10 @@ gfc_basic_typename (bt type)
 const char *
 gfc_typename (gfc_typespec *ts, bool for_hash)
 {
-  static char buffer1[GFC_MAX_SYMBOL_LEN + 7];  /* 7 for "TYPE()" + '\0'.  */
-  static char buffer2[GFC_MAX_SYMBOL_LEN + 7];
+  /* Need to add sufficient padding for "TYPE()" + '\0', "UNION()" + '\0',
+ or "CLASS()" + '\0'.  */
+  static char buffer1[GFC_MAX_SYMBOL_LEN + 8];
+  static char buffer2[GFC_MAX_SYMBOL_LEN + 8];
   static int flag = 0;
   char *buffer;
   gfc_typespec *ts1;

-- 
Steve


Re: [PATCH] dwarf2cfi: Defer queued register saves some more [PR99334]

2021-03-26 Thread Jakub Jelinek via Gcc-patches
On Thu, Mar 25, 2021 at 05:43:57PM -0400, Jason Merrill via Gcc-patches wrote:
> > Will e.g. GDB be happy about the changes?
> 
> I would expect so, but it would be good to have someone from GDB verify.

Ok, have verified it now on the testcase from the PR (stubbed out what it
calls, called it from main with all NULL pointers and just in gdb stepped
through the prologue and at each insn looked at
p $rsp
p $rbp
up
p $rsp
p $rbp
down
stepi
With unpatched gcc indeed it works everywhere except when stopping
at the start of movq %rsp, %rbp instruction where $rbp in the parent
frame was unrelated garbage.
And with the patch below, I get the expected values in all cases.

> > Thinking some more about this, what can be problematic on the GCC side
> > is inline asm, that can and often will contain multiple instructions.
> > For that is an easy fix, just test asm_noperands and handle
> > clobbers_queued_reg_save before the insn rather than after in that case.
> 
> Sure, but does inline asm go through dwarf2out_frame_debug?

Not through dwarf2out_frame_debug, as inline asm is not RTX_FRAME_RELATED_P,
but it goes through that scan_trace hunk I was changing like any other
insn (except for BARRIERs, instructions not in basic blocks, some NOTEs
and DEBUG_INSNs).  And clobbers_queued_reg_save (insn) can be true for the
inline asm either because it stores some registers or because it clobbers
them.
And then there are other backend insns that expand to multiple hw
instructions, e.g. in the i386 backend typically (but not sure if we
guarantee that) marked as get_attr_type (insn) == TYPE_MULTI.
So, if we wanted to do what my patch did, we would need to come up
with some new (optional) insn attribute, say get_attr_undivisible (insn),
e.g. define that attribute on i386 to "type" attribute equal to "multi"
by default and perhaps with some exceptions if needed, and use that
attribute if HAVE_ATTR_undivisible in dwarf2cfi.c to decide whether it can
be emitted after the insn or needs to be emitted before it.

> > But there is another problem, instruction patterns that emit multiple
> > hw instructions, code can stop in between them.  So, do we want some
> > instruction attribute that would conservatively tell us which instruction
> > patterns are guaranteed to be single machine instructions?
> 
> It seems to me that in that situation you'd want to add the save to *%rsp,
> and still not update to *%rbp until after the combined instruction.

So, today I've tried instead to deal with it through REG_FRAME_RELATED_EXPR
from the backend, but that failed miserably as explained in the PR,
dwarf2cfi.c has some rules (Rule 16 to Rule 19) that are specific to the
dynamic stack realignment using drap register that only the i386 backend
does right now, and by using REG_FRAME_RELATED_EXPR or REG_CFA* notes we
can't emulate those rules.  The following patch instead does the deferring
of the hard frame pointer save rule in dwarf2cfi.c Rule 18 handling and
emits it on the (set hfp sp) assignment that must appear shortly after it
and adds assertion that it is the case.

The difference before/after the patch on the assembly is:
--- pr99334.s~  2021-03-26 15:42:40.881749380 +0100
+++ pr99334.s   2021-03-26 17:38:05.729161910 +0100
@@ -11,8 +11,8 @@ _func_with_dwarf_issue_:
andq$-16, %rsp
pushq   -8(%r10)
pushq   %rbp
-   .cfi_escape 0x10,0x6,0x2,0x76,0
movq%rsp, %rbp
+   .cfi_escape 0x10,0x6,0x2,0x76,0
pushq   %r15
pushq   %r14
pushq   %r13
i.e. does just what we IMHO need, after pushq %rbp %rbp
still contains parent's frame value and so the save rule doesn't
need to be overridden there, ditto at the start of the next insn
before the side-effect took effect, and we override it only after
it when %rbp already has the right value.

If some other target adds dynamic stack realignment in the future and
the offset 0 case wouldn't be true there, the code can be adjusted so that
it works on all the drap architectures, I'm pretty sure the code would
need other adjustments too.

For the rule 18 and for the (set hfp sp) after it we already have asserts
for the drap cases that check whether the code looks the way i?86/x86_64
emit it currently.

Is this ok for trunk if it passes bootstrap/regtest on x86_64-linux and
i686-linux (I've verified no other target has the drap stuff right now,
nvptx does nvptx_get_drap_rtx but no other realignment stack (unclear why?),
but nvptx is a non-RA target so dwarf2cfi doesn't apply to it)?

2021-03-26  Jakub Jelinek  

PR debug/99334
* dwarf2out.h (struct dw_fde_node): Add rule18 member.
* dwarf2cfi.c (dwarf2out_frame_debug_expr): When handling (set hfp sp)
assignment with drap_reg active, queue reg save for hfp with offset 0
and flush queued reg saves.  When handling a push with rule18,
defer queueing reg save for hfp and just assert the offset is 0.
(scan_trace): Assert that fde->rule18 is false.

--

Re: Dimitar Dimitrov as TI PRU maintainer

2021-03-26 Thread Dimitar Dimitrov
On петък, 26 март 2021 г. 18:29:23 EET Jeff Law wrote:
> I am pleased to announce that the GCC Steering Committee has appointed
> Dimitar Dimitrov as maintainer of the TI PRU port in GCC.
> 
> 
> Dimitar, please update your listing in the MAINTAINERS file. Sorry it's
> taken so long to make this happen.  It just kept slipping off my radar.
> 
> 
> Thanks,
> 
> Jeff

Thank you for the honour. I have pushed the following as 
c314741a539244a947b94ac045611746c0f072e0

ChangeLog:

* MAINTAINERS: Add myself as pru port maintainer.
---
 MAINTAINERS | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 1722f0aa8fc..0fbbc0519d0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -96,6 +96,7 @@ nvptx portTom de Vries

 or1k port  Stafford Horne  
 pdp11 port Paul Koning 
 powerpcspe portAndrew Jenner   

+pru port   Dimitar Dimitrov
 riscv port Kito Cheng  
 riscv port Palmer Dabbelt  
 riscv port Andrew Waterman 
@@ -372,7 +373,6 @@ Bud Davis   

 Chris Demetriou
 Sameera Deshpande  
 Wilco Dijkstra 
-Dimitar Dimitrov   
 Benoit Dupont de Dinechin  

 Jason Eckhardt 
 Bernd Edlinger 






c++: imported templates and alias-template changes [PR 99283]

2021-03-26 Thread Nathan Sidwell


During development of modules, I had difficulty	deciding whether the 
module flags of a template should live on the decl_template_result, the 
template_decl, or both.  I chose the latter, and require them to be 
consistent.  This and a few other defects show how hard that consistency 
is.  Hence this patch move to holding the flags on the 
template-decl-result decl.  That's the entity various bits of the parser 
have at the appropriate time.   Once needs STRIP_TEMPLATE in a bunch of 
places, which this patch adds.  Also a check that we never give a 
TEMPLATE_DECL to the module flag accessors.


This left a problem with how I was handling template aliases.  These 
were in two parts -- separating the TEMPLATE_DECL from the TYPE_DECL. 
That seemed somewhat funky, but development showed it necessary.  Of 
course, that causes problems if the TEMPLATE_DECL cannot contain 'am 
imported' information.	Investigating now shows	that we	do not need to 
treat them separately.	By reverting a bit of template instantiation 
machinery that caused the problem, we're back on course.  I think what 
has happened is that between then and now, other typedef fixes have 
corrected the underlying problem this separation was working around. It 
allows a bunch of cleanup in the decl streamer, as we no longer have to 
handle a null TEMPLATE_DECL_RESULT.



PR c++/99283
gcc/cp/
* cp-tree.h (DECL_MODULE_CHECK): Ban TEMPLATE_DECL.
(SET_TYPE_TEMPLATE_INFO): Restore Alias template setting.
* decl.c (duplicate_decls): Remove template_decl module flag
propagation.
* module.cc (merge_kind_name): Add alias tmpl spec as a thing.
(dumper::impl::nested_name): Adjust for template-decl module flag
change.
(trees_in::assert_definition): Likewise.
(trees_in::install_entity): Likewise.
(trees_out::decl_value): Likewise.  Remove alias template
separation of template and type_decl.
(trees_in::decl_value): Likewise.
(trees_out::key_mergeable): Likewise,
(trees_in::key_mergeable): Likewise.
(trees_out::decl_node): Adjust for template-decl module flag
change.
(depset::hash::make_dependency): Likewise.
(get_originating_module, module_may_redeclare): Likewise.
(set_instantiating_module, set_defining_module): Likewise.
* name-lookup.c (name_lookup::search_adl): Likewise.
(do_pushdecl): Likewise.
* pt.c (build_template_decl): Likewise.
(lookup_template_class_1): Remove special alias_template handling
of DECL_TI_TEMPLATE.
(tsubst_template_decl): Likewise.
gcc/testsuite/
* g++.dg/modules/pr99283-2_a.H: New.
* g++.dg/modules/pr99283-2_b.H: New.
* g++.dg/modules/pr99283-2_c.H: New.
* g++.dg/modules/pr99283-3_a.H: New.
* g++.dg/modules/pr99283-3_b.H: New.
* g++.dg/modules/pr99283-4.H: New.
* g++.dg/modules/tpl-alias-1_a.H: Adjust scans.
* g++.dg/modules/tpl-alias-1_b.C: Adjust scans.

--
Nathan Sidwell
diff --git c/gcc/cp/cp-tree.h w/gcc/cp/cp-tree.h
index e68e3905f80..a4d4d69075f 100644
--- c/gcc/cp/cp-tree.h
+++ w/gcc/cp/cp-tree.h
@@ -1661,9 +1661,11 @@ check_constraint_info (tree t)
 #define CONSTRAINED_PARM_PROTOTYPE(NODE) \
   DECL_INITIAL (TYPE_DECL_CHECK (NODE))
 
-/* Module defines.  */
-// Too many _DECLS: FUNCTION,VAR,TYPE,TEMPLATE,CONCEPT or NAMESPACE
-#define DECL_MODULE_CHECK(NODE) (NODE)
+/* Module flags on FUNCTION,VAR,TYPE,CONCEPT or NAMESPACE
+   A TEMPLATE_DECL holds them on the DECL_TEMPLATE_RESULT object --
+   it's just not practical to keep them consistent.  */
+#define DECL_MODULE_CHECK(NODE)		\
+  TREE_NOT_CHECK (NODE, TEMPLATE_DECL)
 
 /* In the purview of a module (including header unit).  */
 #define DECL_MODULE_PURVIEW_P(N) \
@@ -3626,9 +3628,10 @@ struct GTY(()) lang_decl {
 /* Set the template information for a non-alias n ENUMERAL_, RECORD_,
or UNION_TYPE to VAL.  ALIAS's are dealt with separately.  */
 #define SET_TYPE_TEMPLATE_INFO(NODE, VAL)\
-  (gcc_checking_assert (TREE_CODE (NODE) == ENUMERAL_TYPE		\
-			|| (CLASS_TYPE_P (NODE) && !TYPE_ALIAS_P (NODE))), \
-   (TYPE_LANG_SLOT_1 (NODE) = (VAL)))	\
+  (TREE_CODE (NODE) == ENUMERAL_TYPE		\
+   || (CLASS_TYPE_P (NODE) && !TYPE_ALIAS_P (NODE))			\
+   ? (TYPE_LANG_SLOT_1 (NODE) = (VAL))	\
+   : (DECL_TEMPLATE_INFO (TYPE_NAME (NODE)) = (VAL)))			\
 
 #define TI_TEMPLATE(NODE) \
   ((struct tree_template_info*)TEMPLATE_INFO_CHECK (NODE))->tmpl
diff --git c/gcc/cp/decl.c w/gcc/cp/decl.c
index 3483b0c0398..6789aa859cc 100644
--- c/gcc/cp/decl.c
+++ w/gcc/cp/decl.c
@@ -2275,10 +2275,6 @@ duplicate_decls (tree newdecl, tree olddecl, bool hiding, bool was_hidden)
 	}
 	}
 
-  DECL_MODULE_IMPORT_P (olddecl)
-	= DECL_MODULE_IMPORT_P (old_result)
-	= DECL_MODULE_IMPORT_P (newdecl);
-
   return olddecl;
 }
 
@@ -2931,19 +2927,6 @@ duplicate_decls (tree newdecl, tree olddecl, 

Re: require et random_device for cons token test

2021-03-26 Thread Jonathan Wakely via Gcc-patches

On 25/03/21 11:57 +, Jonathan Wakely wrote:

On 25/03/21 07:17 -0300, Alexandre Oliva wrote:

On Mar 24, 2021, Jonathan Wakely  wrote:


This works for me on x86_64-linux and powerpc64le-linux, and also on
x86_64-linux when I kluge the config macros so that the new code path
gets used. Does this work for VxWorks?


Thanks.  I (trivially) backported it to apply on our gcc-10 tree, and
tested that on x86_64-vx7r2, and I confirm it works there too.

However, I suspect there's a series of typos in the patch.  You appear
to be using the 'which' enum variable for bit testing, but with '|'
rather than '&'.


Oops, that's what I get for a last-minute rewrite without proper
testing. I originally had:

 if (which == blah || which == any)

and then borked it in an attempt to use & instead.

I'll fix that locally too.


Here's what I've pushed to trunk.

Tested x86_64-linux, powerpc64le-linux, x86_64-w64-mingw.


commit 5f070ba29803c99a5fe94ed7632d7b8c55593df3
Author: Jonathan Wakely 
Date:   Fri Mar 26 18:39:49 2021

libstdc++: Add PRNG fallback to std::random_device

This makes std::random_device usable on VxWorks when running on older
x86 hardware. Since the r10-728 fix for PR libstdc++/85494 the library
will use the new code unconditionally on x86, but the cpuid checks for
RDSEED and RDRAND can fail at runtime, depending on the hardware where
the code is executing. If the OS does not provide /dev/urandom then this
means the std::random_device constructor always fails. In previous
releases if /dev/urandom is unavailable then std::mt19937 was used
unconditionally.

This patch adds a fallback for the case where the runtime cpuid checks
for x86 hardware instructions fail, and no /dev/urandom is available.
When this happens a std::linear_congruential_engine object will be used,
with a seed based on hashing the engine's address and the current time.
Distinct std::random_device objects will use different seeds, unless an
object is created and destroyed and a new object created at the same
memory location within the clock tick. This is not great, but is better
than always throwing from the constructor, and better than always using
std::mt19937 with the same seed (as GCC 9 and earlier do).

libstdc++-v3/ChangeLog:

* src/c++11/random.cc (USE_LCG): Define when a pseudo-random
fallback is needed.
[USE_LCG] (bad_seed, construct_lcg_at, destroy_lcg_at, __lcg):
New helper functions and callback.
(random_device::_M_init): Add 'prng' and 'all' enumerators.
Replace switch with fallthrough with a series of 'if' statements.
[USE_LCG]: Construct an lcg_type engine and use __lcg when cpuid
checks fail.
(random_device::_M_init_pretr1) [USE_MT19937]: Accept "prng"
token.
(random_device::_M_getval): Check for callback unconditionally
and always pass _M_file pointer.
* testsuite/26_numerics/random/random_device/85494.cc: Remove
effective-target check. Use new random_device_available helper.
* testsuite/26_numerics/random/random_device/94087.cc: Likewise.
* testsuite/26_numerics/random/random_device/cons/default-cow.cc:
Remove effective-target check.
* testsuite/26_numerics/random/random_device/cons/default.cc:
Likewise.
* testsuite/26_numerics/random/random_device/cons/token.cc: Use
new random_device_available helper. Test "prng" token.
* testsuite/util/testsuite_random.h (random_device_available):
New helper function.

diff --git a/libstdc++-v3/src/c++11/random.cc b/libstdc++-v3/src/c++11/random.cc
index 1092299e56d..44b9f30e4a9 100644
--- a/libstdc++-v3/src/c++11/random.cc
+++ b/libstdc++-v3/src/c++11/random.cc
@@ -66,14 +66,23 @@
 # include 
 #endif
 
-#if defined USE_RDRAND || defined USE_RDSEED \
-  || defined _GLIBCXX_USE_CRT_RAND_S || defined _GLIBCXX_USE_DEV_RANDOM
+#if defined _GLIBCXX_USE_CRT_RAND_S || defined _GLIBCXX_USE_DEV_RANDOM
+// The OS provides a source of randomness we can use.
 # pragma GCC poison _M_mt
+#elif defined USE_RDRAND || defined USE_RDSEED
+// Hardware instructions might be available, but use cpuid checks at runtime.
+# pragma GCC poison _M_mt
+// If the runtime cpuid checks fail we'll use a linear congruential engine.
+# define USE_LCG 1
 #else
 // Use the mt19937 member of the union, as in previous GCC releases.
 # define USE_MT19937 1
 #endif
 
+#ifdef USE_LCG
+# include 
+#endif
+
 namespace std _GLIBCXX_VISIBILITY(default)
 {
   namespace
@@ -136,6 +145,53 @@ namespace std _GLIBCXX_VISIBILITY(default)
   return val;
 }
 #endif
+
+#ifdef USE_LCG
+// TODO: use this to seed std::mt19937 engine too.
+unsigned
+bad_seed(void* p) noexcept
+{
+  // Poor quality seed based on hash of the current time and the address

[PATCH] c++: ICE on invalid with NSDMI in C++98 [PR98352]

2021-03-26 Thread Marek Polacek via Gcc-patches
NSDMIs are a C++11 thing, and here we ICE with them on the non-C++11
path.  Fortunately all we need is a small tweak to my recent r11-7835
patch (and a small tweak to the new test).

Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk?

gcc/cp/ChangeLog:

PR c++/98352
* method.c (implicitly_declare_fn): Pass &raises to
synthesized_method_walk.

gcc/testsuite/ChangeLog:

PR c++/98352
* g++.dg/cpp0x/inh-ctor37.C: Remove dg-error.
* g++.dg/cpp0x/nsdmi17.C: New test.
---
 gcc/cp/method.c | 2 +-
 gcc/testsuite/g++.dg/cpp0x/inh-ctor37.C | 2 +-
 gcc/testsuite/g++.dg/cpp0x/nsdmi17.C| 8 
 3 files changed, 10 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/g++.dg/cpp0x/nsdmi17.C

diff --git a/gcc/cp/method.c b/gcc/cp/method.c
index 25c1e681b99..8ae7496f023 100644
--- a/gcc/cp/method.c
+++ b/gcc/cp/method.c
@@ -3002,7 +3002,7 @@ implicitly_declare_fn (special_function_kind kind, tree 
type,
 member initializer (c++/89914).  Also, in C++98, we might have
 failed to deduce RAISES, so try again but complain this time.  */
  if (cxx_dialect < cxx11)
-   synthesized_method_walk (type, kind, const_p, nullptr, nullptr,
+   synthesized_method_walk (type, kind, const_p, &raises, nullptr,
 nullptr, nullptr, /*diag=*/true,
 &inherited_ctor, inherited_parms);
  /* We should have seen an error at this point.  */
diff --git a/gcc/testsuite/g++.dg/cpp0x/inh-ctor37.C 
b/gcc/testsuite/g++.dg/cpp0x/inh-ctor37.C
index 7d12b534d95..a14874f4632 100644
--- a/gcc/testsuite/g++.dg/cpp0x/inh-ctor37.C
+++ b/gcc/testsuite/g++.dg/cpp0x/inh-ctor37.C
@@ -22,5 +22,5 @@ struct S { S(B *); };
 S
 fn ()
 {
-  return S(new B(10.5)); // { dg-error "no matching function" "" { target 
c++98_only } }
+  return S(new B(10.5));
 }
diff --git a/gcc/testsuite/g++.dg/cpp0x/nsdmi17.C 
b/gcc/testsuite/g++.dg/cpp0x/nsdmi17.C
new file mode 100644
index 000..e69d6ced49b
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp0x/nsdmi17.C
@@ -0,0 +1,8 @@
+// PR c++/98352
+// { dg-do compile }
+
+struct A {
+  int i = (A(), 42); // { dg-error "default member initializer" }
+// { dg-error "only available" "" { target c++98_only } .-1 }
+};
+A a;

base-commit: c314741a539244a947b94ac045611746c0f072e0
-- 
2.30.2



Re: [PATCH] dwarf2cfi: Defer queued register saves some more [PR99334]

2021-03-26 Thread Jason Merrill via Gcc-patches

On 3/26/21 1:29 PM, Jakub Jelinek wrote:

On Thu, Mar 25, 2021 at 05:43:57PM -0400, Jason Merrill via Gcc-patches wrote:

Will e.g. GDB be happy about the changes?


I would expect so, but it would be good to have someone from GDB verify.


Ok, have verified it now on the testcase from the PR (stubbed out what it
calls, called it from main with all NULL pointers and just in gdb stepped
through the prologue and at each insn looked at
p $rsp
p $rbp
up
p $rsp
p $rbp
down
stepi
With unpatched gcc indeed it works everywhere except when stopping
at the start of movq %rsp, %rbp instruction where $rbp in the parent
frame was unrelated garbage.
And with the patch below, I get the expected values in all cases.


Thinking some more about this, what can be problematic on the GCC side
is inline asm, that can and often will contain multiple instructions.
For that is an easy fix, just test asm_noperands and handle
clobbers_queued_reg_save before the insn rather than after in that case.


Sure, but does inline asm go through dwarf2out_frame_debug?


Not through dwarf2out_frame_debug, as inline asm is not RTX_FRAME_RELATED_P,
but it goes through that scan_trace hunk I was changing like any other
insn (except for BARRIERs, instructions not in basic blocks, some NOTEs
and DEBUG_INSNs).  And clobbers_queued_reg_save (insn) can be true for the
inline asm either because it stores some registers or because it clobbers
them.
And then there are other backend insns that expand to multiple hw
instructions, e.g. in the i386 backend typically (but not sure if we
guarantee that) marked as get_attr_type (insn) == TYPE_MULTI.
So, if we wanted to do what my patch did, we would need to come up
with some new (optional) insn attribute, say get_attr_undivisible (insn),
e.g. define that attribute on i386 to "type" attribute equal to "multi"
by default and perhaps with some exceptions if needed, and use that
attribute if HAVE_ATTR_undivisible in dwarf2cfi.c to decide whether it can
be emitted after the insn or needs to be emitted before it.


But there is another problem, instruction patterns that emit multiple
hw instructions, code can stop in between them.  So, do we want some
instruction attribute that would conservatively tell us which instruction
patterns are guaranteed to be single machine instructions?


It seems to me that in that situation you'd want to add the save to *%rsp,
and still not update to *%rbp until after the combined instruction.


So, today I've tried instead to deal with it through REG_FRAME_RELATED_EXPR
from the backend, but that failed miserably as explained in the PR,
dwarf2cfi.c has some rules (Rule 16 to Rule 19) that are specific to the
dynamic stack realignment using drap register that only the i386 backend
does right now, and by using REG_FRAME_RELATED_EXPR or REG_CFA* notes we
can't emulate those rules.  The following patch instead does the deferring
of the hard frame pointer save rule in dwarf2cfi.c Rule 18 handling and
emits it on the (set hfp sp) assignment that must appear shortly after it
and adds assertion that it is the case.

The difference before/after the patch on the assembly is:
--- pr99334.s~  2021-03-26 15:42:40.881749380 +0100
+++ pr99334.s   2021-03-26 17:38:05.729161910 +0100
@@ -11,8 +11,8 @@ _func_with_dwarf_issue_:
andq$-16, %rsp
pushq   -8(%r10)
pushq   %rbp
-   .cfi_escape 0x10,0x6,0x2,0x76,0
movq%rsp, %rbp
+   .cfi_escape 0x10,0x6,0x2,0x76,0
pushq   %r15
pushq   %r14
pushq   %r13
i.e. does just what we IMHO need, after pushq %rbp %rbp
still contains parent's frame value and so the save rule doesn't
need to be overridden there, ditto at the start of the next insn
before the side-effect took effect, and we override it only after
it when %rbp already has the right value.

If some other target adds dynamic stack realignment in the future and
the offset 0 case wouldn't be true there, the code can be adjusted so that
it works on all the drap architectures, I'm pretty sure the code would
need other adjustments too.

For the rule 18 and for the (set hfp sp) after it we already have asserts
for the drap cases that check whether the code looks the way i?86/x86_64
emit it currently.

Is this ok for trunk if it passes bootstrap/regtest on x86_64-linux and
i686-linux (I've verified no other target has the drap stuff right now,
nvptx does nvptx_get_drap_rtx but no other realignment stack (unclear why?),
but nvptx is a non-RA target so dwarf2cfi doesn't apply to it)?


I looked into the issue a bit.  So the problem is that after push %rbp, 
the queued reg save is trying to say that %rbp is now saved at CFA 
offset 0, which is correct.  Except that then we hit



  /* When stack is aligned, store REG using DW_CFA_expression with FP.  */
  if (fde && fde->stack_realign)
{
  cfi->dw_cfi_opc = DW_CFA_expression;
  cfi->dw_cfi_oprnd1.dw_cfi_reg_num = reg;
  cfi->dw_cfi_oprnd2.dw_cfi_loc
   

Re: [PATCH] c++: ICE on invalid with NSDMI in C++98 [PR98352]

2021-03-26 Thread Jason Merrill via Gcc-patches

On 3/26/21 3:41 PM, Marek Polacek wrote:

NSDMIs are a C++11 thing, and here we ICE with them on the non-C++11
path.  Fortunately all we need is a small tweak to my recent r11-7835
patch (and a small tweak to the new test).

Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk?


OK.


gcc/cp/ChangeLog:

PR c++/98352
* method.c (implicitly_declare_fn): Pass &raises to
synthesized_method_walk.

gcc/testsuite/ChangeLog:

PR c++/98352
* g++.dg/cpp0x/inh-ctor37.C: Remove dg-error.
* g++.dg/cpp0x/nsdmi17.C: New test.
---
  gcc/cp/method.c | 2 +-
  gcc/testsuite/g++.dg/cpp0x/inh-ctor37.C | 2 +-
  gcc/testsuite/g++.dg/cpp0x/nsdmi17.C| 8 
  3 files changed, 10 insertions(+), 2 deletions(-)
  create mode 100644 gcc/testsuite/g++.dg/cpp0x/nsdmi17.C

diff --git a/gcc/cp/method.c b/gcc/cp/method.c
index 25c1e681b99..8ae7496f023 100644
--- a/gcc/cp/method.c
+++ b/gcc/cp/method.c
@@ -3002,7 +3002,7 @@ implicitly_declare_fn (special_function_kind kind, tree 
type,
 member initializer (c++/89914).  Also, in C++98, we might have
 failed to deduce RAISES, so try again but complain this time.  */
  if (cxx_dialect < cxx11)
-   synthesized_method_walk (type, kind, const_p, nullptr, nullptr,
+   synthesized_method_walk (type, kind, const_p, &raises, nullptr,
 nullptr, nullptr, /*diag=*/true,
 &inherited_ctor, inherited_parms);
  /* We should have seen an error at this point.  */
diff --git a/gcc/testsuite/g++.dg/cpp0x/inh-ctor37.C 
b/gcc/testsuite/g++.dg/cpp0x/inh-ctor37.C
index 7d12b534d95..a14874f4632 100644
--- a/gcc/testsuite/g++.dg/cpp0x/inh-ctor37.C
+++ b/gcc/testsuite/g++.dg/cpp0x/inh-ctor37.C
@@ -22,5 +22,5 @@ struct S { S(B *); };
  S
  fn ()
  {
-  return S(new B(10.5)); // { dg-error "no matching function" "" { target 
c++98_only } }
+  return S(new B(10.5));
  }
diff --git a/gcc/testsuite/g++.dg/cpp0x/nsdmi17.C 
b/gcc/testsuite/g++.dg/cpp0x/nsdmi17.C
new file mode 100644
index 000..e69d6ced49b
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp0x/nsdmi17.C
@@ -0,0 +1,8 @@
+// PR c++/98352
+// { dg-do compile }
+
+struct A {
+  int i = (A(), 42); // { dg-error "default member initializer" }
+// { dg-error "only available" "" { target c++98_only } .-1 }
+};
+A a;

base-commit: c314741a539244a947b94ac045611746c0f072e0





[PATCH] Fix _GLIBCXX_DEBUG container allocator aware move constructors

2021-03-26 Thread François Dumont via Gcc-patches

I review the allocator aware move constructors of _GLIBCXX_DEBUG containers.

I think the recently added __gnu_debug basic_string one is also missing 
the rvalue reference, no ?


    libstdc++: _GLIBCXX_DEBUG Fix allocator aware move constructor

    Fix several allocator aware move construtor in _GLIBCXX_DEBUG
    containers.

    libstdc++-v3/ChangeLog:
    * include/debug/forward_list
    (forward_list(forward_list&&, const allocator_type&)): Add 
noexcept qualification.
    * include/debug/list (list(list&&, const allocator_type&)): 
Likewise and add

    call to safe container allocator aware move constructor.
    * include/debug/string (basic_string(basic_string&&, const 
_Allocator&)):

    Check base type allocator aware more constructor.
    * include/debug/vector (vector(vector&&, const 
allocator_type&)):

    Fix noexcept qualification.
    * 
testsuite/23_containers/forward_list/cons/noexcept_move_construct.cc:
    Add allocator aware move constructor noexceot qualification 
check.
    * 
testsuite/23_containers/list/cons/noexcept_move_construct.cc: Likewise.


Tested under linux x86_64.

Ok to commit ?

François

diff --git a/libstdc++-v3/include/debug/forward_list b/libstdc++-v3/include/debug/forward_list
index db46705cc71..d631d53c62e 100644
--- a/libstdc++-v3/include/debug/forward_list
+++ b/libstdc++-v3/include/debug/forward_list
@@ -239,8 +239,11 @@ namespace __debug
   { }
 
   forward_list(forward_list&& __list, const allocator_type& __al)
-	: _Safe(std::move(__list._M_safe()), __al),
-	  _Base(std::move(__list._M_base()), __al)
+	noexcept(
+	  std::is_nothrow_constructible<_Base,
+	_Base&&, const allocator_type&>::value )
+  : _Safe(std::move(__list._M_safe()), __al),
+	_Base(std::move(__list._M_base()), __al)
   { }
 
   explicit
diff --git a/libstdc++-v3/include/debug/list b/libstdc++-v3/include/debug/list
index 06938899253..7b6e15bee07 100644
--- a/libstdc++-v3/include/debug/list
+++ b/libstdc++-v3/include/debug/list
@@ -119,7 +119,11 @@ namespace __debug
   : _Base(__x, __a) { }
 
   list(list&& __x, const allocator_type& __a)
-  : _Base(std::move(__x), __a) { }
+	noexcept(
+	  std::is_nothrow_constructible<_Base,
+	_Base&&, const allocator_type&>::value )
+  : _Safe(std::move(__x._M_safe()), __a),
+	_Base(std::move(__x._M_base()), __a) { }
 #endif
 
   explicit
diff --git a/libstdc++-v3/include/debug/string b/libstdc++-v3/include/debug/string
index 8744a55be64..038b06a3097 100644
--- a/libstdc++-v3/include/debug/string
+++ b/libstdc++-v3/include/debug/string
@@ -159,7 +159,7 @@ namespace __gnu_debug
 
   basic_string(basic_string&& __s, const _Allocator& __a)
   noexcept(
-	std::is_nothrow_constructible<_Base, _Base, const _Allocator&>::value )
+	std::is_nothrow_constructible<_Base, _Base&&, const _Allocator&>::value )
   : _Safe(std::move(__s._M_safe()), __a),
 	_Base(std::move(__s._M_base()), __a)
   { }
diff --git a/libstdc++-v3/include/debug/vector b/libstdc++-v3/include/debug/vector
index df179cbbfea..6acf3d3dbf4 100644
--- a/libstdc++-v3/include/debug/vector
+++ b/libstdc++-v3/include/debug/vector
@@ -217,8 +217,9 @@ namespace __debug
   : _Base(__x, __a) { }
 
   vector(vector&& __x, const allocator_type& __a)
-  noexcept(noexcept(
-	_Base(std::declval<_Base&&>()), std::declval()))
+  noexcept(
+	std::is_nothrow_constructible<_Base,
+	  _Base&&, const allocator_type&>::value )
   : _Safe(std::move(__x._M_safe()), __a),
 	_Base(std::move(__x._M_base()), __a),
 	_Safe_vector(std::move(__x)) { }
diff --git a/libstdc++-v3/testsuite/23_containers/forward_list/cons/noexcept_move_construct.cc b/libstdc++-v3/testsuite/23_containers/forward_list/cons/noexcept_move_construct.cc
index 96f3876e4f6..6ca9e96ab4a 100644
--- a/libstdc++-v3/testsuite/23_containers/forward_list/cons/noexcept_move_construct.cc
+++ b/libstdc++-v3/testsuite/23_containers/forward_list/cons/noexcept_move_construct.cc
@@ -23,4 +23,8 @@
 
 typedef std::forward_list fltype;
 
-static_assert(std::is_nothrow_move_constructible::value, "Error");
+static_assert( std::is_nothrow_move_constructible::value,
+	   "noexcept move constructor" );
+static_assert( std::is_nothrow_constructible::value,
+	   "noexcept move constructor with allocator" );
diff --git a/libstdc++-v3/testsuite/23_containers/list/cons/noexcept_move_construct.cc b/libstdc++-v3/testsuite/23_containers/list/cons/noexcept_move_construct.cc
index 5a2de10cf09..63a33a4991b 100644
--- a/libstdc++-v3/testsuite/23_containers/list/cons/noexcept_move_construct.cc
+++ b/libstdc++-v3/testsuite/23_containers/list/cons/noexcept_move_construct.cc
@@ -23,4 +23,8 @@
 
 typedef std::list ltype;
 
-static_assert(std::is_nothrow_move_constructible::value, "Error");
+static_assert( std::is_nothrow_move_constructible::value,
+	   "noexcept move constructor" );
+static_assert( std::is_nothrow_c

[committed] MAINTAINERS: Add myself as pru port maintainer

2021-03-26 Thread Dimitar Dimitrov
ChangeLog:

* MAINTAINERS: Add myself as pru port maintainer.
---
 MAINTAINERS | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 1722f0aa8fc..0fbbc0519d0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -96,6 +96,7 @@ nvptx portTom de Vries

 or1k port  Stafford Horne  
 pdp11 port Paul Koning 
 powerpcspe portAndrew Jenner   

+pru port   Dimitar Dimitrov
 riscv port Kito Cheng  
 riscv port Palmer Dabbelt  
 riscv port Andrew Waterman 
@@ -372,7 +373,6 @@ Bud Davis   

 Chris Demetriou
 Sameera Deshpande  
 Wilco Dijkstra 
-Dimitar Dimitrov   
 Benoit Dupont de Dinechin  

 Jason Eckhardt 
 Bernd Edlinger 
-- 
2.20.1





[committed] add test for PR 59970

2021-03-26 Thread Martin Sebor via Gcc-patches

The bug has been fixed for a few years now.  r11-7869 adds the test
for it: https://gcc.gnu.org/g:980b12cc81979e52f491bf0dfe961d30c07fe864

Martin


[PATCH v9] Practical improvement to libgcc complex divide

2021-03-26 Thread Patrick McGehearty via Gcc-patches
Changes in Version 9 since Version 8:

Revised code to meet gcc coding standards in all files, especially
with respect to adding spaces around operations and removing
excess () in #define macro definitions.

Major revision to gcc/testsuite/gcc.c-torture/execute/ieee/cdivchkld.c
Prior code was focused on x86 80-bit implementation of long
doubles. The new version provides tests for: IEEE 128-bit format,
80-bit format, ibm extended precision format (11-bit exponent, 113
bit mantissa), and when long double is treated as IEEE 64-bit
doubles (also 11-bit exponent). The limits tested are now based on
LDBL_MIN and LDBL_MANT_DIG which automatically adapts to the
appropriate format. The IEEE 128-bit and 80-bit formats share
input test values due to having matching 15-bit exponent sizes as
doe the IBM extended double and normal 64-bit double having 11-bit
exponent sizes.  The input values are accurate to the lesser
mantissa with zero fill for the extended length mantissas. When
the test with the smaller mantissa is active, the expected result
will automatically be truncated to match the available mantissa.
The size of the exponent (LDBL_MAX_EXP) is used to determine which
values are tested. The program was tested using x86 (80-bit),
aarch64 (128-bit), IBM extended format and IEEE 64-bit format.
Some adjustments where made due to IBM extended format having
slightly less range at the bottom end than IEEE 64-bit format in
spite of having the same size exponent field. For all four cases,
the cdivchkld.c program will abort() under the old complex divide
and exit() under the new complex divide.

Correctness and performance test programs used during development of
this project may be found in the attachment to:
https://www.mail-archive.com/gcc-patches@gcc.gnu.org/msg254210.html

Summary of Purpose

This patch to libgcc/libgcc2.c __divdc3 provides an
opportunity to gain important improvements to the quality of answers
for the default complex divide routine (half, float, double, extended,
long double precisions) when dealing with very large or very small exponents.

The current code correctly implements Smith's method (1962) [2]
further modified by c99's requirements for dealing with NaN (not a
number) results. When working with input values where the exponents
are greater than *_MAX_EXP/2 or less than -(*_MAX_EXP)/2, results are
substantially different from the answers provided by quad precision
more than 1% of the time. This error rate may be unacceptable for many
applications that cannot a priori restrict their computations to the
safe range. The proposed method reduces the frequency of
"substantially different" answers by more than 99% for double
precision at a modest cost of performance.

Differences between current gcc methods and the new method will be
described. Then accuracy and performance differences will be discussed.

Background

This project started with an investigation related to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59714.  Study of Beebe[1]
provided an overview of past and recent practice for computing complex
divide. The current glibc implementation is based on Robert Smith's
algorithm [2] from 1962.  A google search found the paper by Baudin
and Smith [3] (same Robert Smith) published in 2012. Elen Kalda's
proposed patch [4] is based on that paper.

I developed two sets of test data by randomly distributing values over
a restricted range and the full range of input values. The current
complex divide handled the restricted range well enough, but failed on
the full range more than 1% of the time. Baudin and Smith's primary
test for "ratio" equals zero reduced the cases with 16 or more error
bits by a factor of 5, but still left too many flawed answers. Adding
debug print out to cases with substantial errors allowed me to see the
intermediate calculations for test values that failed. I noted that
for many of the failures, "ratio" was a subnormal. Changing the
"ratio" test from check for zero to check for subnormal reduced the 16
bit error rate by another factor of 12. This single modified test
provides the greatest benefit for the least cost, but the percentage
of cases with greater than 16 bit errors (double precision data) is
still greater than 0.027% (2.7 in 10,000).

Continued examination of remaining errors and their intermediate
computations led to the various tests of input value tests and scaling
to avoid under/overflow. The current patch does not handle some of the
rare and most extreme combinations of input values, but the random
test data is only showing 1 case in 10 million that has an error of
greater than 12 bits. That case has 18 bits of error and is due to
subtraction cancellation. These results are significantly better
than the results reported by Baudin and Smith.

Support for half, float, double, extended, and long double precision
is included as all are handled with suitable preprocessor symb

[r11-7866 Regression] FAIL: g++.dg/modules/xtreme-header_a.H -std=c++2b (test for excess errors) on Linux/x86_64

2021-03-26 Thread sunil.k.pandey via Gcc-patches
On Linux/x86_64,

d82797420c2238e31a7a25fe6db6bd05cd37224d is the first bad commit
commit d82797420c2238e31a7a25fe6db6bd05cd37224d
Author: Nathan Sidwell 
Date:   Fri Mar 26 10:46:31 2021 -0700

c++: imported templates and alias-template changes [PR 99283]

caused

FAIL: g++.dg/modules/xtreme-header-5_a.H module-cmi  
(gcm.cache/$srcdir/g++.dg/modules/xtreme-header-5_a.H.gcm)
FAIL: g++.dg/modules/xtreme-header-5_a.H -std=c++17 (internal compiler error)
FAIL: g++.dg/modules/xtreme-header-5_a.H -std=c++17 (test for excess errors)
FAIL: g++.dg/modules/xtreme-header-5_a.H -std=c++2a (internal compiler error)
FAIL: g++.dg/modules/xtreme-header-5_a.H -std=c++2a (test for excess errors)
FAIL: g++.dg/modules/xtreme-header-5_a.H -std=c++2b (internal compiler error)
FAIL: g++.dg/modules/xtreme-header-5_a.H -std=c++2b (test for excess errors)
FAIL: g++.dg/modules/xtreme-header_a.H module-cmi  
(gcm.cache/$srcdir/g++.dg/modules/xtreme-header_a.H.gcm)
FAIL: g++.dg/modules/xtreme-header_a.H -std=c++17 (internal compiler error)
FAIL: g++.dg/modules/xtreme-header_a.H -std=c++17 (test for excess errors)
FAIL: g++.dg/modules/xtreme-header_a.H -std=c++2a (internal compiler error)
FAIL: g++.dg/modules/xtreme-header_a.H -std=c++2a (test for excess errors)
FAIL: g++.dg/modules/xtreme-header_a.H -std=c++2b (internal compiler error)
FAIL: g++.dg/modules/xtreme-header_a.H -std=c++2b (test for excess errors)

with GCC configured with

../../gcc/configure 
--prefix=/local/skpandey/gccwork/toolwork/gcc-bisect-master/master/r11-7866/usr 
--enable-clocale=gnu --with-system-zlib --with-demangler-in-ld 
--with-fpmath=sse --enable-languages=c,c++,fortran --enable-cet --without-isl 
--enable-libmpx x86_64-linux --disable-bootstrap

To reproduce:

$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="modules.exp=g++.dg/modules/xtreme-header-5_a.H 
--target_board='unix{-m32\ -march=cascadelake}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="modules.exp=g++.dg/modules/xtreme-header-5_a.H 
--target_board='unix{-m64\ -march=cascadelake}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="modules.exp=g++.dg/modules/xtreme-header_a.H 
--target_board='unix{-m32\ -march=cascadelake}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="modules.exp=g++.dg/modules/xtreme-header_a.H 
--target_board='unix{-m64\ -march=cascadelake}'"

(Please do not reply to this email, for question about this report, contact me 
at skpgkp2 at gmail dot com)


Re: [PATCH 0/3] Uncontroversial improvements to C++20 wait-related implementation

2021-03-26 Thread Thomas Rodgers

On 2021-03-23 09:35, Jonathan Wakely wrote:

On 23/03/21 09:26 -0700, Thiago Macieira via Libstdc++ wrote: On 
Tuesday, 23 March 2021 08:39:43 PDT Thomas Rodgers wrote: I will be 
submitting a new patch for the
atomic.wait/barrier/latch/semaphore functionality a bit later today 
that

subsumes the changes to atomic_wait and latch, and includes the changes
to barrier.
Thanks, Thomas

Is that meant to be part of GCC 11's release?


Yes.

If not, what do we do about preventing the future BC break and 
potential

heisenbugs?

1) do nothing, accept they will happen silently


This is our current policy for experimental features and it isn't
going to change for GCC 11.


2) cause non-silent BC breaks
3) disable the code for now (unless explicitly opted-in)

-- Thiago Macieira - thiago.macieira (AT) intel.com
Software Architect - Intel DPG Cloud Engineering


FWIW, I would like to commit to an ABI for this with GCC12 and 
everything currently residing in the __detail namespace would be moved 
into the .so as part of that (likely with a third, and ideally final, 
rewrite).


[PATCH] rs6000: Enable 32bit variable vec_insert [PR99718]

2021-03-26 Thread Xionghu Luo via Gcc-patches
From: "luo...@cn.ibm.com" 

32bit and P7 VSX could also benefit a lot from the variable vec_insert
implementation with shift/insert/shift back method.

Tested pass on P7BE/P8BE/P8LE{-m32,m64} and P9LE{m64}.

gcc/ChangeLog:

PR target/99718
* config/rs6000/altivec.md (altivec_lvsl_reg): Change to ...
(altivec_lvsl_reg_): ... this.
(altivec_lvsr_reg): Change to ...
(altivec_lvsr_reg_): ... this.
* config/rs6000/predicates.md (vec_set_index_operand): New.
* config/rs6000/rs6000-c.c (altivec_resolve_overloaded_builtin):
Enable 32bit variable vec_insert for all TARGET_VSX.
* config/rs6000/rs6000.c (rs6000_expand_vector_set_var_p9):
Enable 32bit variable vec_insert for p9 and above.
(rs6000_expand_vector_set_var_p8): Rename to ...
(rs6000_expand_vector_set_var_p7): ... this.
(rs6000_expand_vector_set): Use TARGET_VSX and adjust assert
position.
* config/rs6000/vector.md: Use vec_set_index_operand.
* config/rs6000/vsx.md: Use gen_altivec_lvsl_reg_di and
gen_altivec_lvsr_reg_di.

gcc/testsuite/ChangeLog:

PR target/99718
* gcc.target/powerpc/fold-vec-insert-char-p8.c: Update
instruction counts.
* gcc.target/powerpc/fold-vec-insert-char-p9.c: Likewise.
* gcc.target/powerpc/fold-vec-insert-double.c: Likewise.
* gcc.target/powerpc/fold-vec-insert-float-p8.c: Likewise.
* gcc.target/powerpc/fold-vec-insert-float-p9.c: Likewise.
* gcc.target/powerpc/fold-vec-insert-int-p8.c: Likewise.
* gcc.target/powerpc/fold-vec-insert-int-p9.c: Likewise.
* gcc.target/powerpc/fold-vec-insert-longlong.c: Likewise.
* gcc.target/powerpc/fold-vec-insert-short-p8.c: Likewise.
* gcc.target/powerpc/fold-vec-insert-short-p9.c: Likewise.
* gcc.target/powerpc/pr79251.p8.c: Likewise.
* gcc.target/powerpc/pr79251.p9.c: Likewise.
* gcc.target/powerpc/vsx-builtin-7.c: Likewise.
* gcc.target/powerpc/pr79251-run.p7.c: New test.
* gcc.target/powerpc/pr79251.p7.c: New test.
---
 gcc/config/rs6000/altivec.md  |  8 +-
 gcc/config/rs6000/predicates.md   |  6 ++
 gcc/config/rs6000/rs6000-c.c  |  2 +-
 gcc/config/rs6000/rs6000.c| 89 +++
 gcc/config/rs6000/vector.md   |  2 +-
 gcc/config/rs6000/vsx.md  |  4 +-
 .../powerpc/fold-vec-insert-char-p8.c |  8 +-
 .../powerpc/fold-vec-insert-char-p9.c |  4 +-
 .../powerpc/fold-vec-insert-double.c  | 18 ++--
 .../powerpc/fold-vec-insert-float-p8.c|  4 +-
 .../powerpc/fold-vec-insert-float-p9.c|  2 +-
 .../powerpc/fold-vec-insert-int-p8.c  |  6 +-
 .../powerpc/fold-vec-insert-int-p9.c  |  6 +-
 .../powerpc/fold-vec-insert-longlong.c|  6 +-
 .../powerpc/fold-vec-insert-short-p8.c|  6 +-
 .../powerpc/fold-vec-insert-short-p9.c|  6 +-
 .../gcc.target/powerpc/pr79251-run.p7.c   | 15 
 gcc/testsuite/gcc.target/powerpc/pr79251.p7.c | 23 +
 gcc/testsuite/gcc.target/powerpc/pr79251.p8.c | 10 +--
 gcc/testsuite/gcc.target/powerpc/pr79251.p9.c |  6 +-
 .../gcc.target/powerpc/vsx-builtin-7.c| 11 +--
 21 files changed, 170 insertions(+), 72 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr79251-run.p7.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr79251.p7.c

diff --git a/gcc/config/rs6000/altivec.md b/gcc/config/rs6000/altivec.md
index 27a269b9e72..e9005d6e42e 100644
--- a/gcc/config/rs6000/altivec.md
+++ b/gcc/config/rs6000/altivec.md
@@ -2771,10 +2771,10 @@ (define_expand "altivec_lvsl"
   DONE;
 })
 
-(define_insn "altivec_lvsl_reg"
+(define_insn "altivec_lvsl_reg_"
   [(set (match_operand:V16QI 0 "altivec_register_operand" "=v")
(unspec:V16QI
-   [(match_operand:DI 1 "gpc_reg_operand" "b")]
+   [(match_operand:SDI 1 "gpc_reg_operand" "b")]
UNSPEC_LVSL_REG))]
   "TARGET_ALTIVEC"
   "lvsl %0,0,%1"
@@ -2809,10 +2809,10 @@ (define_expand "altivec_lvsr"
   DONE;
 })
 
-(define_insn "altivec_lvsr_reg"
+(define_insn "altivec_lvsr_reg_"
   [(set (match_operand:V16QI 0 "altivec_register_operand" "=v")
(unspec:V16QI
-   [(match_operand:DI 1 "gpc_reg_operand" "b")]
+   [(match_operand:SDI 1 "gpc_reg_operand" "b")]
UNSPEC_LVSR_REG))]
   "TARGET_ALTIVEC"
   "lvsr %0,0,%1"
diff --git a/gcc/config/rs6000/predicates.md b/gcc/config/rs6000/predicates.md
index 859af75dfbd..4d8f660eea0 100644
--- a/gcc/config/rs6000/predicates.md
+++ b/gcc/config/rs6000/predicates.md
@@ -1940,3 +1940,9 @@ (define_predicate "d_form_memory"
 
   return !indexed_address (addr, mode);
 })
+
+;; Return true if TARGET_VSX for vec_set with variable index.
+(define_predicate "vec_set_index_operand"
+ (if_then_else (match_test "TARGET_VSX")
+  (match_operand 0 "reg_or_cint_operand")
+  (match_operand 0 "const_int_